You are on page 1of 183

Advanced Mathematical Statistics I

(Estimation)

Bing-Yi JING

Math, HKUST
Email: majing@ust.hk

February 21, 2018


Contents

1 Probability Models 2

1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Special family of distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Principles of Data Reduction 8

2.1 Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Factorization Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Minimal Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Minimal sufficiency for special families . . . . . . . . . . . . . . . . . . . . . 13

2.4.1 Minimal Sufficiency for Exponential Family of Full Rank . . . . . . . 13

2.4.2 Minimal Sufficiency for Location Family . . . . . . . . . . . . . . . . 14

2.5 Ancillary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.6 Complete Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.7 Basu Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Unbiased Estimation 21

3.1 Optimality Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Mean square error (MSE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3 UMVUE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.4 Searching UMVUE’s by sufficiency and completeness . . . . . . . . . . . . . 24

3.5 Steps to find UMVUE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1
3.6 UMVUE in Normal Linear Models . . . . . . . . . . . . . . . . . . . . . . . 30

3.7 UMVUE may fail for small n . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4 The Method of Maximum Likelihood 34

4.1 Maximum Likelihood Estimate (MLE) . . . . . . . . . . . . . . . . . . . . . 34

4.2 Invariance property of MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2.1 g(·) is a 1-to-1 mapping . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2.2 g(·) is not 1-to-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.4 Numerical solution to likelihood equations . . . . . . . . . . . . . . . . . . . 39

4.5 Certain “Issues” with MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.5.1 MLE may not be unique . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.5.2 MLE may not be asymptotically normal . . . . . . . . . . . . . . . . 40

4.5.3 MLE can be inconsistent . . . . . . . . . . . . . . . . . . . . . . . . 40

4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5 Efficient Estimation 43

5.1 Information inequality (or Cramer-Rao lower bound) . . . . . . . . . . . . . 43

5.2 Efficiency, sub-efficiency, and super-efficiency . . . . . . . . . . . . . . . . . 45

5.3 Asymptotic efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.3.1 Asymptotic Cramer-Rao lower bound . . . . . . . . . . . . . . . . . 48

5.4 Consistency of the MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.5 MLE is asymptotically efficient . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.6 Some comparisons about the UMVUE and MLE’s . . . . . . . . . . . . . . 55

6 Transformation 56

6.1 Why transformations? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.2 Box-Cox transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6.3 Review of most common transformations . . . . . . . . . . . . . . . . . . . . 57

6.4 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

2
6.5 The δ-method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6.5.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6.6 Variance-stabilizing transformations . . . . . . . . . . . . . . . . . . . . . . 64

6.6.1 An example: the sample correlation coefficient . . . . . . . . . . . . 65

6.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

7 Quantiles 69

7.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

7.2 Consistency of sample quantiles . . . . . . . . . . . . . . . . . . . . . . . . . 70

7.3 Asymptotic normality of the sample qunatiles . . . . . . . . . . . . . . . . . 71

7.4 Bahadur representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

7.5 A more general definition of quantiles . . . . . . . . . . . . . . . . . . . . . 75

7.5.1 Alternative definition of quantiles . . . . . . . . . . . . . . . . . . . . 78

7.6 Median or Quantile (Robust) Regression . . . . . . . . . . . . . . . . . . . . 80

7.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

8 M-Estimation 82

8.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

8.2 Examples of M-estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

8.3 Consistency and asymptotic normality of M -estimates . . . . . . . . . . . . 86

8.4 Asymptotic Relative Efficiency (ARE) . . . . . . . . . . . . . . . . . . . . . 95

8.5 Robustness v.s. Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

8.6 Generalized Estimating Equations (GEE) . . . . . . . . . . . . . . . . . . . 97

8.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

9 U -Statistics 99

9.1 U -statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

9.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

9.3 Hoeffding-decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

9.3.1 Asymptotic Normality . . . . . . . . . . . . . . . . . . . . . . . . . . 103

9.4 Jackknife variance estimates for U -statistics . . . . . . . . . . . . . . . . . . 103

3
10 Jackknife 105

10.1 The traditional approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

10.2 Jackknife . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

10.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

10.2.2 Heuristic justification . . . . . . . . . . . . . . . . . . . . . . . . . . 108

10.2.3 The examples revisited . . . . . . . . . . . . . . . . . . . . . . . . . . 108

10.3 Application of the Jackknife to U - and V -statistics . . . . . . . . . . . . . . 110

10.3.1 Reduce bias for V -statistics by jackknife . . . . . . . . . . . . . . . . 111

10.3.2 Use the jackknife to estimate variance for U -statistics . . . . . . . . 112

10.4 Application of the jackknife to the smooth function of means models . . . . 115

10.5 Jackknife to linear regression models . . . . . . . . . . . . . . . . . . . . . . 117

10.6 Failure of the jackknife . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

10.6.1 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

10.6.2 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

10.6.3 Delete-d jackknife . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

10.6.4 A useful lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

10.7 Application of jackknife in a real example . . . . . . . . . . . . . . . . . . . 125

10.8 A quick summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

10.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

10 Bootstrap 129

10.1 The bootstrap (or plug-in, or substitution method) . . . . . . . . . . . . . . 129

10.2 Monte Carlo simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

10.3 Application of the bootstrap to quantile . . . . . . . . . . . . . . . . . . . . 131

10.4 Bootstrap approximations to d.f.’s . . . . . . . . . . . . . . . . . . . . . . . 133

10.4.1 First-order accurate or consistent . . . . . . . . . . . . . . . . . . . . 134

10.5 Bootstrap variance approximations for functions of means . . . . . . . . . . 135

10.6 Bootstrap confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . 136

10.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

4
11 Bayes Rules and Empirical Bayes 146

11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

11.2 Bayes estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

11.3 Properties of Bayes Estimators . . . . . . . . . . . . . . . . . . . . . . . . . 160

11.4 Empirical Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

11.5 Empirical Bayes and the James-Stein Estimator . . . . . . . . . . . . . . . . 166

11.6 Using SURE to find Estimators . . . . . . . . . . . . . . . . . . . . . . . . . 175

11.7 James-Stein Estimator, shrinking to non-zero point . . . . . . . . . . . . . . 176

1
Chapter 1

Probability Models

Statistical inference is based on probability models, which, broadly speaking, can be clas-
sified into

• Parametric models

• Non-parametric models

1.1 Motivations

Given a set of data,

Xi = (Xi1 , Xi2 , ......, Xip ), i = 1, ..., n,

we wish to draw useful information about the underlying population. For instance, we
collect information of n people consisting of

(Age, Gender, Occupation, Income, Address, Health, .......)

The data can be written in a n × p matrix form:


 
X11 X12 ...... X1p
 . . . . 
X=
 
. . . .

 
Xn1 Xn2 ...... Xnp

Some of the questions might include

• what is the distribution of each of the p feature?

• what are the relationships between the features?

• can we make prediction of one of the component from the rest?

• can we explain the variations?

To answer these questions, we need first to introduce probability models.

2
1.2 Models

Parametric models

Parametric models are a family of distribution functions (d.f.s), indexed by a finite-


dimensional parameter θ:
F = {Fθ : θ ∈ Θ}
where Θ ⊂ Rd is the parameter space.

Examples

1. A random sample
Xi ∼i.i.d. N (µ, σ 2 ), i = 1, ..., n,
where θ = (µ, σ 2 ).

2. Linear regression

Yi ∼indep N (α + β T xi , σi2 ), i = 1, ..., n.

⇐⇒
Yi = α + β T xi + i , i ∼ N (0, σ 2 ), i = 1, ..., n,
where θ = (α, β1 , ..., βp , σ 2 ).

Pros and Cons

• simple, easy to understand and interpret


(once a model is chosen, one only needs to estimate the parameter)

• prone to model mis-specification (not robust) parameter

Non-parametric models

Non-parametric models are families of distributions which can not be characterised by


a finite-dimensional parameter.

Examples

1. A random sample
Xi ’s are i.i.d. with EXi2 < ∞.

2. Linear regression
Yi = α + βxi + i , i = 1, ..., n,
where i are i.i.d. For instance,

• Ei = 0, V (i ) = σ 2 , or
• median(i ) = 0, and IQR(i ) = σ 2 .

3
3. Non-linear regression

Yi = g1 (xi1 ) + ...... + gp (xip ) + i , i ∼ (0, σ 2 ).

Pros and Cons

• more flexible, harder to interpret

• prone to overfitting

Semi-parametric models

They are mixtures of parametric and nonparametric models.

They could be thought to include parametric and nonparametric models as special


cases. On the other hand, they can also be regarded as nonparametric models.

e.g. Cox’s model: h(t|x) = λ(t)ex .

Which model to use?

These models all have their advantages and disadvantages, and are useful in different
situations. Exactly which model to use purely depends on the problem at hand.

Sometimes, we also talk about Semi-parametric models, e.g., Cox-model, which


is in between the above two models.
parametric ——- semi-parametric ——– non-parametric

We will study parametric models first before moving to nonparametric models later.

1.3 Special family of distributions

Exponential family

Definition 1.1 (Exponential family)

1. F is a k-parameter exponential family (written as Exp(k)) if

fθ (x) = exp {η(θ) · T (x) − B(θ)} h(x),

where η(θ) = (η1 (θ), ..., ηk (θ)) and T (x) = (T1 (x), ..., Tk (x)).

nR nP o o
k R
Note that B(θ) = ln exp j=1 ηj (θ)Tj (x) h(x)dx as fθ (x)dx = 1.

4
2. Canonical family: Reparametrizing ηj = ηj (θ), we get a canonical family
fη (x) = exp {η · T (x) − A(η)} h(x),
where η = (η1 , ..., ηk ) is called the natural parameter.
3. The natural parameter space
   
 Z Xk  
Ξ= η = (η1 , ..., ηk ) : exp ηj Tj (x) h(x)dx < ∞
   
j=1

is convex. But the parameter space


Θ = {θ : (η1 (θ), ..., ηk (θ)) ∈ Ξ}
is not necessarily convex.
4. A canonical family is of full rank if
• it is minimal (i.e., neither T ’s nor η’s satisfy a linear constraint), and
• Ξ contains a k-dimensional rectangle.

Examples of exponential family

Examples of exponential families include Gaussian, Gamma, Binomial, Poisson, etc. How-
ever, examples of non-exponential families are Student-t, Uniform(0, θ), Cauchy, Double-
exponential.

Example 1.1 Normal family N (µ, σ 2 ):


√ n n
( )
−n −
Pn
(x −µ)2 /(2σ 2 ) µ X 1 X
fθ (x) = ( 2πσ) e i=1 i = C(θ) exp xi − 2 x2
σ 2 i=1 2σ i=1 i
So θ = (µ, σ 2 ) and k = 2,
µ 1
 
η(θ) = (η1 , η2 ) = 2
,− 2
σ 2σ
and
n n
!
X X
T (x) = (T1 (x), T2 (x)) = xi , x2i .
i=1 i=1
The family is of full rank since Ξ is a lower half space in R2 , containing a 2-dim rectangle.

Example 1.2 Let X1 , . . . , Xn ∼ N (θ, θ2 ). Then,


√ n n
( )
Pn 2 2 1X 1 X
fθ (x) = ( 2π|θ|)−n e− i=1 (xi −θ) /(2θ ) = C(θ) exp xi − 2 x2 .
θ i=1 2θ i=1 i
So k = 2,
1 1
 
η(θ) = (η1 , η2 ) = ,− 2
θ 2θ
and
n n
!
X X
T (x) = (T1 (x), T2 (x)) = xi , x2i .
i=1 i=1
The family is not of full rank since Ξ is a curve in R2 , which DOES NOT contain a 2-dim
rectangle.

5
Exponential family is closed under independent, or marginal, or condi-
tional operations

Theorem 1.1 If X, Y ∈ Exp(k), and are independent (X ⊥ Y ), then (X, Y ) ∈ Exp.

Theorem 1.2 If X1 , ..., Xn ∼i.i.d. Exp(k), then (X1 , ..., Xn ) ∈ Exp(k) as


n
( " n # ) n
Y X Y
fθ (x) = fθ (xi ) = exp η(θ) · T (xi ) − nB(θ) h(xi ).
i=1 i=1 i=1

The proofs are trivial and omitted here.

T (X) is also in another exponential family

Theorem 1.3 If X ∈ Exp(k), so is T (X) = (T1 (X), ..., Tk (X)):


fT (t) = exp {η(θ) · t − B(θ)} h0 (t).

Proof. For simplicity, we only prove this for the discrete r.v. For t = (t1 , ..., tk ),
X
P (T = t) = P (X = x)
T (x)=t
X
= exp {η(θ) · T (x) − B(θ)} h(x)
T (x)=t
X
= exp {η(θ) · t − B(θ)} h(x)
T (x)=t
= exp {η(θ)t − B(θ)} h0 (t),
P
where h0 (t) = T (x)=t h(x).

Calculating Moments of Exponential Families

Theorem 1.4 Let X ∼ fη (x) = exp {η · T (x) − A(η)} h(x). The moment-generating
function (m.g.f.) of T (X):
M (s) := Ees·T (X) = exp[A(s + η) − A(η)].

Proof. Note
Z Z Z Z
s·T (X) s·T (x)
M (s) = Ee = ... e fη (x)dx = ... e(s+η)·T (x)−A(η) h(x)dx
Z Z
τ T (x)−A(s+η)
= eA(s+η)−A(η) ... e(s+η) h(x)dx

= eA(s+η)−A(η) .
So its cumulant generating function (c.g.f.) is C(s) =: log M (s) = A(s + η) − A(η). Hence
E(T (X)) = C 0 (s)|s=0 = A0 (η), V ar(T (X)) = C 00 (s)|s=0 = A00 (η).

It is clear that exponential family requires the existence of all moments (very thin-
tailed distributions). This excludes cases like Student-t, Cauchy, etc.

6
Location-scale family

Definition 1.2 Let X ∼ F (x) be any c.d.f.. Then we define

Location family: {F (x − a) : a ∈ R} ;
(i.e., X + a ∼ F (x − a))
x
   
Scale family: F :b>0 ;
b
(i.e., bX ∼ F (x/b))
x−a
   
Location-scale family: F : µ ∈ R, σ > 0 ;
b
(i.e., a + bX ∼ F ((x − a)/b)).

where a is called the location parameter, b > 0 the scale parameter.

Let f (x) be the corresponding p.d.f. to F (x). Then we have

Location family: {f (x − a) : a ∈ R} ;
1 x
   
Scale family: f :b>0 ;
σ b
1 x−a
   
Location-scale family: f : a ∈ R, b > 0 .
b b

Note a and b are NOT necessarily the mean and standard deviation, which may not
even exist, e.g., for F = Cauchy.

Lemma 1.1 If U1 , ..., Un ∼ U nif (0, 1), then

U(k) ∼ Beta(k, n − k + 1), where 1 ≤ k ≤ n


k
with EU(k) = n+1 .

Example 1.3 It follows easily from the above that

1. If X1 , ..., Xn ∼ U (0, θ) (a scale family), then Xi /θ ∼ U nif (0, 1), and hence

k
EX(k) = θ.
n+1

2. If Y1 , ..., Yn ∼ U (θ, θ + 1) (a location family), then Yi − θ ∼ U nif (0, 1), and hence

k
EY(k) = θ + .
n+1

7
Chapter 2

Principles of Data Reduction

Given a parametric family


X =: (X1 , ..., Xn ) ∼iid Fθ
we wish to find an “optimal” or “near-optimal” estimator of θ (or more generally g(θ)).
There are several key concepts.

• Sufficiency principle
One can reduce X to a “sufficient” one T (X) without losing information on θ. That
is, T (X) contains all information about θ. (If we cannot reduce it any further, we
get the minimal sufficient statistics.)
• Auxiliary statistics
They contain no information about θ. Properly chosen, this proves to be very useful
in (conditional) statistical inference. This is opposite to sufficiency.
• Basu Theorem
Sufficient and auxiliary statistics are independent for a complete family.

2.1 Sufficiency

Definition 2.1 T (X) is sufficient for θ if the distribution of X conditional on T (X) does
not depend on θ.

(That is, once T is known, the data X contains no more information about θ.)

Example 2.1

1. T (X) = X is always sufficient.


Proof. P (X = x|T = t) = P (X = x|X = t) = I{x = t} does not depend on θ.
2. T = (X(1) , ..., X(n) ) is sufficient, if Fθ is continuous (i.e., no ties a.s.).
Proof. Now
P (X = x|T = t) = P ((X1 , ..., Xn ) = (x1 , ..., xn )|(X(1) , ..., X(n) ) = (t1 , ..., tn ))
= (1/n!)I{{x1 , ..., xn } = {t1 , ..., tn }},

8
not depending on θ.
(Continuity of d.f. ensures no ties. One can remove the assumption by the Factor-
ization Theorem later.)
Pn
3. T = i=1 Xi if X ∼ Bernoulli(p).
Proof. T ∼ Bin(n, p).
• If ni=1 xi =
6 t, P (X = x|T (X) = t) = 0.
P
Pn
• If i=1 xi = t, then P (X = x|T (X) = t) equals
P (X = x, T (X) = t)
P (X = x|T (X) = t) = (1.1)
P ( ni=1 Xi = t)
P

P (X = x)
= (1.2)
P ( ni=1 Xi = t)
P

as ({X = x} ⊂ {T (X) = t}) (1.3)


Pn Pn
Qn x n− x
i=1 P (Xi = xi ) p i=1 i (1 − p) i=1 i
= = n t
P ( ni=1 Xi = t) − p)n−t
P
t p (1
1
= n .
t
Hence, T is sufficient.

2.2 Factorization Theorem

An easy way to find sufficient statistics is offered by the Factorization Theorem.

Theorem 2.1 (Factorization Theorem) T (X) is sufficient for θ iff


fθ (x) = g (T (x), θ) h(x) .
(Therefore, sufficiency makes joint pdf or likelihood as simple as possible.)

P (X = x, T (X) = t)
Proof. We prove it for the discrete case only. Now P (X = x|T = t) = .
P (T (X) = t)

• Suppose fθ (x) = P (X = x) = g (T (x), θ) h (x).


If T (x) 6= t, then P (X = x|T (X) = t) = 0.
If T (x) = t, then similarly to (1.1)-(1.3), P (X = x|T (X) = t) becomes
P (X = x) P (X = x) g(t, θ)h(x) h (x)
=P = P =P ,
P (T (X) = t) T (x)=t P (X = x) g(t, θ) T (x)=t h(x) T (x)=t h(x)

Hence T is sufficient.
• Suppose T (X) is sufficient. Then
fθ (x) = P (X = x) = P (X = x, T (X) = T (x))
(as {X = x} ⊂ {T (X) = T (x)})
= P (T (X) = T (x)) P (X = x|T (X) = T (x))
=: g(T (x), θ) h(x),
where sufficiency implies that h(x) is free of θ.

9
Qn
Example 2.2 fθ (x) = i=1 fθ (xi ).

1. X ∼ Bernoulli(p), then
n
Y P P
fθ (x) = pxi (1 − p)1−xi = p xi
(1 − p)n− xi
.
i=1
Pn
So T = i=1 Xi is sufficient.

2. X ∼ U [0, θ], then


n
θ−1 I{0 ≤ xi ≤ θ} = θ−n I{0 ≤ x(1) ≤ x(n) ≤ θ} = θ−n I{0 ≤ x(1) }I{x(n) ≤ θ}.
Y
fθ (x) =
i=1

So T = X(n) is sufficient.

2.3 Minimal Sufficiency

Minimal sufficient statistics provide the maximal reduction to the data.

Definition 2.2 T (X) is minimal sufficient if, given any other sufficient statistic S(X),
T (X) is a function of S(X).

Theorem 2.2 Let T = l(S).

1. If T is sufficient, So is S.

2. l is 1-1. Then T is (minimal) sufficient iff S is (minimal) sufficient.

Proof.

1. fθ (x) = gθ (T (x))h(x) = gθ (l(S(x)))h(x) =⇒ S(X) is sufficient (by Factorization


Theorem).

2. The proof follows from (i) and the definition.

Examples

Example 2.3 Clearly, (X(1) , ..., X(n) ) is sufficient, so the original data (X1 , ..., Xn ) is
never minimal sufficient.

Example 2.4 X ∼ U(θ, θ + 1), i.e., fθ (x) = I{θ < x < θ + 1}. Then
n
Y
fθ (x) = I{θ < xi < θ + 1} = I{θ < x(1) < x(n) < θ + 1}
i=1
= I{x(n) − 1 < θ < x(1) }. (3.4)

10
 
By the factorization theorem, T = X(1) , X(n) is sufficient.

To show minimal sufficiency, suppose that S(X) is also sufficient. By the factoriza-
tion theorem,
fθ (x) = g (S(x), θ) h (x) .
We only consider those x s.t. h(x) > 0, so all conclusions hold a.s. From (3.4),

x(1) = sup{θ : fθ (x) > 0} = sup{θ : g (S(x), θ) > 0},


x(n) = inf{θ : fθ (x) > 0} + 1 = inf{θ : g (S(x), θ) > 0} + 1.
 
So T (x) = x(1) , x(n) is a function of S(x). Hence, T (X) is minimal sufficient.

An equivalent minimal sufficient statistic is T 0 = (R, M ), where

R = X(n) − X(1) , and M = (X(n) + X(1) )/2

are the range and mid-range point, respectively.

A simple rule to find minimal sufficient statistics

Sufficient statistics can be obtained easily by Factorization theorem. Then one can apply
the following to verify minimal sufficiency.

Theorem 2.3 Suppose

1. T (X) is sufficient.
fθ (x)
2. is free of θ, ∀x, y =⇒ T (x) = T (y).
fθ (y)

Then T (X) is minimal sufficient.

Proof. Te(X) is any sufficient statistics. For each x ∈ X , we have a pair (Te(x), T (x)). To
show that T (x) is a function of Te(x), it suffices to prove that

Te(x) = Te(y) =⇒ T (x) = T (y). (3.5)

By the Factorization Theorem, fθ (x) = geθ (Te(x))h(x).


e Now if Te(x) = Te(y), then

fθ (x) geθ (Te(x))h(x)


e h(x)
e
= = , free of θ.
fθ (y) geθ (Te(y))h(y)
e h(y)
e

By assumption, we get T (x) = T (y), which proves (3.5), So T (x) is minimal.

Examples

Example 2.5 Find minimal sufficient statistics for the following cases:

11
1. X ∼ U(θ, θ + 1)

2. X ∼ U(0, θ)

3. X ∼ U(0, θ) with θ ≥ 1.

Solution.

 
1. First fθ (x) = I{x(n) − 1 < θ < x(1) }, so T (x) = x(1) , x(n) is sufficient by Factor-
ization Theorem.
If

fθ (x) I{x(n) − 1 < θ < x(1) }


r(θ) ≡ = .
fθ (y) I{y(n) − 1 < θ < y(1) }

is free of θ, we must have T (x) = T (y).


 
Thus, T = X(1) , X(n) is minimal sufficient.
(For example, if x(1) > y(1) and x(n) < y(n) , then r(θ) takes values 0, 1 and ∞,
depending where θ is, so r(θ) is not free of θ. )

2. First fθ (x) = θ−1 I{0 < x < θ}, then fθ (x) = θ−n I(0 < x(1) ≤ x(n) < θ) = θ−n I(0 <
x(1) )I(x(n) < θ). Hence T (x) = x(n) is sufficient.
Now if

fθ (x) I(0 < x(1) )I(x(n) < θ)


r(θ) ≡ = .
fθ (y) I(0 < y(1) )I(y(n) < θ)

is free of θ, we must have T (x) = T (y).


Thus T = X(n) is minimal sufficient.

3. First the pdf of X is fθ (x) = θ−1 I{0 < x < θ}I{θ > 1}. Hence,

fθ (x) = θ−n I(0 < x(1) ≤ x(n) < θ)I{θ > 1}


= θ−n I(0 < x(1) )I(x(n) < θ)I{θ > 1}
= I(x(1) > 0)I(Te(x) < θ).

Clearly, Te(x) = T(y)


e is sufficient.
Now if

fθ (x) I(x(1) > 0) I(Te(x) < θ)


r(θ) ≡ =
fθ (y) I(y(1) > 0) I(Te(y) < θ))

is free of θ, we must have Te(x) = T(y).


e

Thus, Te(X) = max{X(n) , 1} is minimal sufficient.

It is intuitively clear why Te(X) = max{X(n) , 1}, rather than X(n) , is a minimal
sufficient statistic: if X(n) < 1, say X(n) = 1/2, then 1/2 would not be a good estimator
of θ since we know θ > 1. A better choice would be 1. Of course, if X(n) ≥ 1, then X(n)
would be a very good estimator of θ.

12
2.4 Minimal sufficiency for special families

2.4.1 Minimal Sufficiency for Exponential Family of Full Rank

For full rank exponential families, finding minimal sufficient statistics is very easy.

Theorem 2.4 If P is in an exponential family of full rank with p.d.f.’s given by


( k )
X
fθ (x) = exp ηj (θ)Tj (x) − A(η) h(x).
i=1

Then T (X) = (Ti (X), ..., Tk (X)) is minimal sufficient for θ.

Proof. Sufficiency follows from the Factorization Theorem. To prove minimality, we


denote aj ≡ aj (x, y) = Tj (x) − Tj (y), 1 ≤ j ≤ n. Then
nP o  
k k
fθ (x) exp j=1 ηj (θ)Tj (x)
h(x) X  h(x)
= nP o = exp aj ηj (θ) .
fθ (y) exp k h(y)  h(y)
j=1 ηj (θ)Tj (y)

j=1

Pk
is free of θ iff i=1 aj ηj (θ) is free of θ, i.e.,
k
X
aj ηj (θ) = C (x, y) ,
j=1
Pk
Fix θ0 ∈ Θ, then j=1 aj ηj (θ0 ) = C (x, y) . Then
k
X
0= aj [ηj (θ) − ηj (θ0 )] = a · ηe(θ) (4.6)
j=1

where a = (a1 , ..., ak )0 and ηe(θ) = (η1 (θ) − η1 (θ0 ), ..., ηk (θ) − ηk (θ0 ))0 .

Since fθ (x) is of full rank k, there exist θ1 , ..., θk such that

{ηe(θ1 ), ......, ηe(θk )}

are linearly independent. Then it follows from (4.6) that a = 0. That is, Ti (x) = Ti (y)
for all 1 ≤ i ≤ k.

Examples

Example 2.6 Let X ∼ N (µ, σ 2 ), find a minimal sufficient statistics for θ = (µ, σ 2 ).

√ 2 2
Solution. fθ (x) = ( 2πσ)−1 e−(x−µ) /(2σ ) . Therefore,
√ n n
( )
−n −
Pn
(x −µ)2 /(2σ 2 ) µ X 1 X
fθ (x) = ( 2πσ) e i=1 i = C(θ) exp x i − x2
σ 2 i=1 2σ 2 i=1 i

So k = 2,
µ 1
 
η(θ) = (η1 , η2 ) = ,−
σ 2 2σ 2

13
and
n n
!
X X
T (x) = (T1 (x), T2 (x)) = xi , x2i .
i=1 i=1

Clearly, the natural parameter space Ξ is a half space in R2 , which contains a 2-dim
rectangle. By Theorem 2.4, T (X) is a minimal sufficient statistic.

Example 2.7 If X1 , . . . , Xn ∼ N (θ, θ2 ), find a minimal sufficient statistics for θ.

Solution. Recall that


√ n n
( )
−n −
Pn
(x −θ)2 /(2θ2 ) 1X 1 X
fθ (x) = ( 2π|θ|) e i=1 i = C(θ) exp xi − 2 x2
θ i=1 2θ i=1 i

n n P 2 P 
with sufficient statistics T (x) = (T1 (x), T2 (x)) = i=1 xi , i=1 xi . Since it is a curved
exponential family (not a full rank), one can not use the very last theorem to find the
minimal sufficient statistic. However, we can use the simple rule given earlier. Now

fθ (x) 1 1
 
r(θ) ≡ = exp (T1 (x) − T1 (y)) − 2 (T2 (x) − T2 (y))
fθ (y) θ 2θ

is free of θ iff
1 1
(T1 (x) − T1 (y)) − 2 (T2 (x) − T2 (y)) = C (x, y) = 0,
θ 2θ
where the last equality is obtained by letting θ → ∞ on both sides. Again, since θ is
arbitrary, we get (by taking θ = 1 and θ = 2)

T1 (x) − T1 (y) = 0, T2 (x) − T2 (y) = 0.


Pn Pn 2

Therefore, T (X) = i=1 Xi , i=1 Xi is a minimal sufficient statistic.

2.4.2 Minimal Sufficiency for Location Family

If X = (X1 , . . . , Xn ) ∼ f (x − θ), then

fθ (x) = Πni=1 f (xi − θ) = Πni=1 f (x(i) − θ).


 
Then T (X) = X(1) , ..., X(n) is sufficient. Further reduction may or may not be possible.

Example 2.8 (Further reduction is possible)

1. If X ∼ N (µ, 1) with pdf φ(x − µ), X̄ is minimal sufficient (also in exponential


family).

2. If X ∼ exp(θ, 1) with pdf f (x−θ) = e−(x−θ) I(x > θ), then X(1) is minimal sufficient.
(Similar proof to U (0, θ) example.)
 
3. If X ∼ U (θ, θ + 1), X(1) , X(n) is minimal sufficient (shown earlier).

14
 
In some other cases, T (X) = X(1) , ..., X(n) is the maximum reduction (i.e. minimal
sufficiency).

 
Example 2.9 (Further reduction is impossible) T (X) = X(1) , ..., X(n) is minimal
sufficient if X ∼ fθ (x) = f (x − θ), where

1 −|x|
1. (Double exponential) f (x) = e (all moments exist, and fθ (x) is not differ-
2
entiable at θ = x).
1 1
2. (Cauchy) f (x) = (the mean does not exist).
π (1 + x2 )
e−x
3. (Logistic) f (x) = (µ = EX1 exists).
(1 + e−x )2

Note that f (−x) = f (x) in all cases.

Proof. We only prove the double exponential case here. Now

exp{ ni=1 |xi − θ|}


P
fθ (y)
r(θ) = = .
exp{ ni=1 |yi − θ|}
P
fθ (x)

If r(θ) is free of θ, then


n
exp{ ni=1 (θ − xi )}
P
X
lim r(θ) = Pn = exp{ (yi − xi )} =: A,
θ→∞ exp{ i=1 (θ − yi )} i=1
n
exp{ ni=1 (xi − θ)}
P
= exp{ (xi − yi )} = A−1 .
X
lim r(θ) =
exp{ ni=1 (yi − θ)}
P
θ→−∞
i=1

So A = 1/A, hence, A = 1 (as A > 0). That is,

exp{ ni=1 |xi − θ|}


P
r(θ) = = r(∞) = r(−∞) = 1.
exp{ ni=1 |yi − θ|}
P

Hence,
n
X n
X
g(θ) =: |xi − θ| = |yi − θ|
i=1 i=1

The value minimizing the above is the sample median. Thus, the two data sets X’s and
Y ’s have the same medians. Consider two cases:

• n = 2m + 1 is odd
By minimizing w.r.t. θ, we see that x’s and y’s have the same sample medians, i.e.,

x(m+1) = y(m+1) .

Therefore, we can remove this common value by considering g(θ) − |x(m+1) − θ| =


g(θ) − |y(m+1) − θ|. Once this is done, we have an even number of observations left.
Thus, it suffices to consider the case where n is even, as we shall do below.

15
• n = 2m is even
Any value in [x(m) , x(m+1) ] is a median of x’s, which minimizes g(θ) w.r.t. θ.
Any value in [y(m) , y(m+1) ] is a median of y’s, which minimizes g(θ) w.r.t. θ.
These two sets must be the same, thus x(m) = y(m) and x(m+1) = y(m+1) .
Removing these two points, we still have even number of points left.
Continuing doing this, we get x(k) = y(k) for all k = 1, ..., n.

Remarks.

• For double exponential d.f., the sample median is a better estimator than the sample
mean. In fact, as can be seen in later chapters, the sample median is an MLE, and
hence asymptotically efficient.
• For Cauchy d.f., the sample mean is a bad choice. But sample median is OK, even
though it is not the asymptotic efficient estimator (MLE).
• For logistic d.f., we will find out later. Again, the sample median is better than the
sample mean, even though the optimal one is neither.

2.5 Ancillary Statistics

Definition 2.3 S(X) is ancillary if its distribution does not depend on θ.

Example 2.10

1. If X1 , . . . , Xn ∼ F (x−θ) (a location family), the range, R = X(n) −X(1) , is ancillary.


Proof. Zi = Xi − θ ∼ F (x) free of θ. Then R = Z(n) − Z(1) is ancillary.
2. If X1 , . . . , Xn ∼ F (x/σ) (a scale family), (X1 /Xn , · · · , Xn−1 /Xn ) is ancillary.
Proof. Zi = Xi /σ ∼ F (x) free of θ. Then (X1 /Xn , · · · , Xn−1 /Xn ) = (Z1 /Zn , · · · , Zn−1 /Zn )
is ancillary.

Ancillary statistics are useful in conditional inference and also in hypotheses testing.

2.6 Complete Statistics

Definition 2.4 Let fθ (t) be a family of pdf ’s for a statistic T .

1. T is complete for θ if

Eθ g(T ) = 0 =⇒ g(T ) = 0 a.s. ∀θ.

2. T is boundedly complete if the previous statement holds for all bounded g.

Clearly, completeness implies bounded completeness, but not vice versa.

16
Examples of complete families
Pn
Example 2.11 If X ∼ Bernoulli(p), T = i=1 Xi is complete.

Proof. T ∼ Bin(n, p). If Ep g(T ) = 0, i.e.,


n n
! !
X n t n−t n
X n
Ep g(T ) = pq g(t) = q rt g(t) = 0, where r = p/q > 0,
t=0
t t=0
t

we have g(t) = 0 for t = 0, 1, · · · , n. That is, g(T ) = 0 a.s.

Example 2.12 If X ∼ U (0, θ), T = X(n) is complete.

Proof. F (x) = (x/θ)I(0 < x < θ) + I(x ≥ θ), P (T ≤ t) = F (t)n . If Eθ g(T ) = 0 for θ > 0,
then Z θ Z θ
n−1 −n
g(t)nt θ dt = 0, =⇒ g(t)tn−1 dt = 0.
0 0

Differentiating w.r.t. θ, one gets g(θ)θn−1 = 0, i.e., g(θ) = 0 for all θ > 0.

Examples of Non-Complete Families

Completeness is a rather strong requirement. Sufficiency does not imply completeness.

Pn−1
Example 2.13 Let X1 , ..., Xn ∼ Bin(1, p). Then ( i=1 Xi , Xn ) is not complete.

Proof. For simplicity, we take n = 2. Clearly, E(X1 − X2 ) = 0. However,

P (X1 − X2 = 0) = P (X1 = 0, X2 = 0) + P (X1 = 1, X2 = 1) < 1.

Hence, X1 − X2 is NOT complete. The general case n ≥ 2 can be shown similarly.

In fact, minimal sufficiency may not imply completeness either.

 
Example 2.14 Given X1 , ..., Xn ∼ U (θ, θ + 1), T = X(1) , X(n) is minimal sufficient,
but not complete.

Proof. If U1 , ..., Un ∼ U nif (0, 1), then U(k) ∼ Beta(k, n − k + 1) where 1 ≤ k ≤ n.


Therefore, EU(k) = k/(n + 1).

Now Xk − θ = Uk ∼ U (0, 1), we have EX(k) = θ + EU(k) = k/(n + 1). Similarly,


h i
1 n−1
we can get E(X(1) ) = θ + EU(1) = θ + n+1 . Therefore, E X(n) − X(1) − n+1 = 0. But
X(n) − X(1) − (n − 1)(n + 1) 6= 0. So T is not complete.

However, a complete and sufficient statistic is minimal sufficient.

17
Exponential family of full rank is complete

Theorem 2.5 If X ∼ Exp(k) of full rank:


( k )
X
fη (x) = exp ηj Tj (x) − A(η) h(x).
i=1

Then T (X) = (T1 (X), ..., Tk (X)) is complete (and minimal sufficient).

Proof. If Eη [g(T )] = 0, recalling fT (t) = exp {tη 0 − A(η)} h0 (t) for t = (t1 , ..., tk ) and
η = (η1 , ..., ηk ), we have
Z Z Z
0
Eη [g(T )] = g(t)fT (t)dt = g(t) exp{tη −A(η)}h0 (t)dt = g(t) exp{tη 0 −A(η)} dλ(t) = 0,

is a measure on (Rk , B k ). Let η0 be an interior point in Ξ. Then


R
where λ(A) = A h0 (t)dt
Z Z
g + (t) exp{tη 0 − A(η)} dλ = g − (t) exp{tη 0 − A(η)} dλ (6.7)

for all η ∈ N (η0 ), where N (η0 ) = {η : ||η − η0 || < } for some  > 0. In particular,
Z Z
g + (t) exp{tη00 − A(η0 )} dλt = g − (t) exp{tη00 − A(η0 )} dλt = C ≥ 0.

If C = 0, then g + (t) = g − (t) = 0 a.e. λ, i.e., g(t) = 0 a.e. λ. If C > 0, then


C −1 g + (t) exp{tη00 − A(η0 )} and C −1 g − (t) exp{tη00 − A(η0 )} are p.d.f. with respect to λ.
However, (6.7) implies that

g + (t) exp{tη00 }dλ(t) g − (t) exp{tη00 }dλ(t)


Z Z
exp{t(η − η0 )0 } R + = exp{t(η − η0 )0 } R −
g (t) exp{tη00 }dλ(t) g (t) exp{tη00 }dλ(t)
g (t) exp{tη }dλ(t) + 0
for all η ∈ N (η0 ). That is, the m.g.f.’s of these two p.d.f.’s, R g+ (t) exp{tη0 0 }dλ(t) and
0
− 0
Rg (t) exp{tη0 }dλ(t)
0 , are the same in a neighborhood of 0. This implies that g + (t) = g − (t)

g (t) exp{tη0 }dλ(t)
a.e. λ. That is, g(t) = 0 a.e. λ. That is, T is complete.

2.7 Basu Theorem

There is an interesting relationship between minimal sufficient and ancillary statistics.

Sufficient statistics contain all the information about θ while ancillary statistics
contain no information about θ. It is natural to ask whether (minimal) sufficient statistics
and ancillary statistics are independent. This does not hold true in general unless bounded
completeness is imposed.

Example 2.15 For X1 , . . . , Xn ∼ U (θ, θ + 1), a minimal sufficient is (X(1) , X(n) ), or


 
equivalently (X(n) − X(1) , X(n) + X(1) ), while an ancillary statisitc is X(n) − X(1) (i.e.
range of a location family). Clearly, the minimal sufficient and ancillary statistics are not
independent.

18
Theorem 2.6 (Basu’s Theorem) If T (X) is bounded complete and sufficient and V is
ancillary, then T (X) ⊥ V .

Proof. It suffices to show that

g(T ) =: P (V ∈ B|T ) − P (V ∈ B) = 0 a.s., for any Borel set B. (7.8)

Note that g(T ) is free of θ as T is sufficient and V is ancillary. It is also bounded. Clearly,

Eθ g(T ) = EE(I{V ∈ B}|T ) − P (V ∈ B) = P (V ∈ B) − P (V ∈ B) = 0.

Since T is bounded complete, we have g(T ) = 0 a.s., i.e., hence (7.8) holds.

Examples
Pn
Example 2.16 Given X ∼ N (µ, σ 2 ), X̄ and S 2 = n−1 i=1 (Xi − X̄)2 are independent.

Proof. Fix σ 2 , and treat µ as the only parameter. Then, X̄ is complete and sufficient for
µ. On other other hand, S 2 is ancillary for µ. The result follows from Basu’s theorem.

Alternative proof. Let Yi = Xi /σ, and define Ȳ and SY2 correspondingly. Note that
Yi ∼ N (µ/σ, 1) =: N (µ
e, 1). Treating µ
e as our new parameter, we can use Basu’s Theorem
as above.

Example 2.17 Given X1 , . . . , Xn ∼ fθ (x) = e−(x−θ) I{x > θ}, it is easy to show that X(1)
is complete and sufficient. Hence, X(1) ⊥ g(X), where g(X) is any ancillary statistics,
e.g.,
g(X) = S 2 , or (X(n) − X(1) ), or X2 − X1 , etc.

By Basu’s Theorem,
X(1) ⊥ g(X).
For example,
  2
eX(1) − cos X(1) , and eS sin2 (X(n) − X(1) ) + log (1 + |X2 − X1 |) .

In this case, Basu’s Theorem allows us to deduce independence of two statistics without
finding their joint distributions, which is typically very messy to do.

19
2.8 Exercises

1. Let X1 , ..., Xn ∼iid Fθ . Find a non-trivial sufficient statistic for θ, where

(a) f (x) = 1 α−1 e−x/β ,


Γ(α)β α x x > 0, θ = (α, β).
(b) f (x) = σ1 e−(x−µ)/σ for x > µ, θ = (µ, σ).
(c) Fθ = Uniform(θ − 21 , θ + 12 ).

2. Let X1 , . . . , Xn be independent r.v.’s with densities fXi (x) = eiθ−x , for x ≥ iθ.
Prove that T = mini {Xi /i} is a sufficient statistic for θ.

3. For each of the following distributions let X1 , . . . , Xn be a random sample. Find a


minimal sufficient statistic for the following problems.
2 /2
(a) f (x) = (2π)−1/2 e−(x−θ) (normal)
(b) f (x) = P (X = x) = e−θ θx /x!, for x = 0, 1, ... (Poisson)
(c) f (x) = e−(x−θ) , for x > θ (location exponential)
(d) f (x) = c(µ, σ) exp −(x − µ)4 /σ 4 , where σ > 0.

−1
(e) f (x) = (π)−1 1 + (x − θ)2

(Cauchy)
(f) fθ (x) = C(α)xα (1 − x)1−α , 0 < x < 1, where α > 0 and C(α) is a constant
depending only on α (Beta(α, α) distribution.)
iid
4. Let X1 , . . . , Xn ∼ Fθ . Then T = (X1 , · · · , Xn−1 ) is not sufficient for θ. (Tn only
uses part of the data, and is wasting information).
iid
5. Let T = (X1 , . . . , Xn ) ∼ Fθ . Then, T is sufficient, but not complete.

6. Let X1 , · · · , Xn be i.i.d. random variables from U (a, b), where a < b. Show that
X(i) − X(1)
Zi =: , i = 2, ..., n − 1, are independent of (X(1) , X(n) ) for any a and b.
X(n) − X(1)

7. X1 , · · · , Xn are i.i.d. r.v.s with pdf f (x) = θ−1 e−(x−a)/θ I(a,∞)(x) . Show that
Pn
(a) i=1 (Xi − X(1) ) are X(1) are independent
(b) Zi =: (X(n) − X(i) )/(X(n) − X(n−1) ) i = 1, 2, ..., n − 2, are independent of
((X(1) , ni=1 (Xi − X(1) ))
P

8. Let X be a discrete r.v.’s with p.d.f.

fθ (x) = θ, x=0
2 x−1
= (1 − θ) θ , x = 1, 2, ...
= 0 otherwise

where θ ∈ (0, 1). Show that X is boundedly complete, but not complete.

20
Chapter 3

Unbiased Estimation

Given X1 , · · · , Xn ∼iid Fθ , we wish to find an “optimal” estimator T = T (X1 , · · · , Xn ) for


θ or more generally g(θ). Unfortunately, a globally optimal estimator typically does not
exist in the class of all estimators. But it is possible to find an optimal one in a subclass of
estimators. In this chapter, we shall focus on the class of all unbiased estimators. Bear in
mind that this may not always be a wise choice in some situations, e.g., in high-dimensional
case, as will be seen in later chapters. As a matter of fact, variable selection thrive on
biased estimation by taking the right balance between bias and variance.

3.1 Optimality Criterion

There are many different optimality criterions for choosing a parameter.

• Most probable or likely estimator (MLE)


Of course, this will rely on given probability models.

• Mean square errors.


This is model-free.

• Bayesian estimator

• ......

3.2 Mean square error (MSE)

A simple example will show that an optimal estimator does not exist in the class of all
estimators. There are many criteria for “optimality”. Here we use the minimization of
Mean Squared Errors (MSE):

M SE[T ] := E[T − g(θ)]2 = V ar(T ) + bias2 [T ].

where bias[T ] = ET − g(θ). Hence T is MSE-optimal if it has the smallest MSE for all
θ ∈ Θ.

21
Example 3.1 Let X1 , · · · , Xn be iid with EXi = µ and V ar(Xi ) = 1. Take three estima-
tors of µ:
T1 = X1 , T2 = 3, T3 = X̄.
Clearly,
1
M SE(T1 ) = 1, M SE(T2 ) = (3 − µ)2 , M SE(T3 ) = .
n
We can plot their MSEs against µ.

• T1 is clearly out since it is always inferior to T3 . Note that T1 is not sufficient.

• Neither T2 nor T3 dominates the other for all θ, i.e., there is no uniformly MSE-
optimal estimator between the two.

• If we have no idea about θ, T3 might be preferred as T2 is just a blind guess. (Fre-


quentist view.)

• If we are pretty sure that µ is near 3, we might choose T2 over T3 .

• If we have some vague knowledge that µ is near 3, we might take both estimators
into account, e.g., a linear combination of both estimators (Bayesian estimator):

T4 = λT2 + (1 − λ)T3 .

3.3 UMVUE

Definition

The problem arises because the class of all estimators under consideration (with second
moments) is too big. One way to get around is to focus on a smaller class of estimators,
i.e., all unbiased estimators (with second moments):

U{g(θ)} = the class of all unbiased estimators of g(θ).

• T (X) is unbiased for g(θ) if ET (X) = g(θ).


Then M SE(T ) = V ar(T ).

• T (X) is uniformly minimum variance unbiased estimator (UMVUE) for g(θ)


if it is unbiased and has the smallest variance for all θ.

Characterization of a UMVUE.

Theorem 3.1

1. δ(X) is a UMVUE iff

(a) Eδ(X) = g(θ)


(b) Cov(δ, U ) = E[δ(X)U (X)] = 0, ∀U ∈ U{0} and ∀θ ∈ Θ.

2. T is sufficient. η(T ) is a UMVUE iff

22
(a) Eη(T ) = g(θ);
(b) Cov[η(T ), U (T )] = E[η(T )U (T )] = 0, ∀θ ∈ Θ and ∀U ∈ U{0}
e = {h(T ) :
Eh(T ) = 0}.

Proof. We shall only prove the first part. The second part can be shown similarly.

Necessity. Suppose that δ is a UMVUE for g(θ). Fix U ∈ U{0} and θ ∈ Θ. Let
δ1 = δ + λU . Clearly Eδ1 = g(θ), and so

V ar(δ1 ) = V ar(δ) + 2λCov(U, δ) + λ2 V ar(U ) ≥ V ar(δ),

that is,
2λCov(U, δ) + λ2 V ar(U ) ≥ 0,
for all λ. The quadratic in λ has two roots λ1 = 0 and λ2 = −2Cov(U, δ)/V ar(U ), and
hence takes on negative values unless λ1 = λ2 , that is, Cov(U, δ) = 0.

Sufficiency. Suppose that Eθ (δU ) = 0 for all U ∈ U{0}. For any δ1 such that
E(δ1 ) = g(θ), we have δ1 − δ ∈ U{0}. Hence,

Cov(δ, δ1 − δ) = Cov(δ, δ1 ) − Cov(δ, δ) = Cov(δ, δ1 ) − V ar(δ) = 0

Hence, by Cauchy-Schwarz inequality,

V ar(δ) = Cov(δ, δ1 ) ≤ V ar1/2 (δ)V ar1/2 (δ1 )

That is, V ar(δ) ≤ V ar(δ1 ).

Three aspects to find a UMVUE for g(θ)

• E[U (X)] = 0.

• E[δ(X)U (X)] = 0.

• E[δ(X)] = g(θ).

Or

• E[U (T )] = 0.

• E[η(T )U (T )] = 0.

• E[η(T )] = g(θ).

UMVUE may or may not exist

Example 3.2 X ∼ Bin(n,p). Then, U{g(θ)} is nonempty (i.e., g(p) is U -estimable) iff
it is a polynomial in p of degree m (an integer), where 0 ≤ m ≤ n.

Pn n k
Proof. η(X) is an unbiased estimator iff g(p) = Eη(X) = k=0 k η(k)p (1 − p)n−k , i.e.
g(p) is a polynomial in p of degree m, where 0 ≤ m ≤ n.

Thus, g(p) = p−1 , p/(1 − p) and p are not U -estimable.

23
Uniqueness of UMVUE, if it exists

Theorem 3.2 If δ is a UMVUE for g(θ), it is unique.

Proof. If δ1 and δ2 are two UMVUEs, then δ1 −δ2 ∈ U{0}. By Characterization Theorem,
we have E[δ1 (δ1 − δ2 )] = 0 and E[δ2 (δ1 − δ2 )] = 0. So

E[(δ1 − δ2 )2 ] = E[δ1 (δ1 − δ2 )] − E[δ2 (δ1 − δ2 )] = 0,

which implies δ1 = δ2 a.s.

3.4 Searching UMVUE’s by sufficiency and completeness

If T is complete and sufficient (C-S), finding UMVUEs is simple.

Lehmann-Scheffe Theorem

Theorem 3.3 If T is C-S, and Eθ [η(T )] = g(θ), then η(T ) is the UMVUE for g(θ).

Proof. Since T is complete, then Eθ (U (T )) = 0 implies U (T ) = 0. Then U{0}


e = {0}.
The proof follows from the UMVUE characterization theorem. The proof of uniqueness is
as in the first proof.

Rao-Blackwell Theorem

Given a sufficient statistic, we can always improve it by conditioning.

Theorem 3.4 (Rao-Blackwell) : Suppose T is sufficient, and δ(X) is an estimator of


g(θ). Define η(T ) = E(δ(X)|T ). Then

1. Eη(T ) = Eδ (hence they have the same bias).

2. V ar(η(T )) ≤ V ar(δ).

3. MSE(η(T )) ≤ MSE(δ), where equality holds iff P (η(T ) = δ) = 1.

(Hence, by conditioning on T , we don’t alter bias, but may reduce variance.)

Proof. By sufficiency, η(T ) = E(δ(X)|T ) if free of θ, and hence is an estimator of g(θ).

1. Eη(T ) = EE(δ|T ) = Eδ. Therefore, bias(δ) = bias(η(T )).

2. This follows from the last and the next parts, and M SE(Y ) = bias(Y )2 + V ar(Y ).

24
3. Now

M SE(δ) = E(δ − g(θ))2 = E ([δ − η] + [η − g(θ)])2


= E[η − g(θ)]2 + E[δ − η]2 + 2C
= M SE(η) + E[δ − η]2 + 2C,

where

C = E ([δ − η][η − g(θ)]) = EE ([δ − η][η − g(θ)]|T ) = E {[η − g(θ)] (E[δ|T ] − η)} = 0.

Therefore, M SE(δ) ≥ M SE(η) and the equality holds iff δ = η(T ) a.s.

Theorem 3.5 If T is C-S and Eδ = g(θ), then η(T ) = E(δ|T ) is the UMVUE for g(θ).

Proof. This follows from Rao-Blackwell Theorem and Lehmann-Scheffe Theorem.

3.5 Steps to find UMVUE


• If complete and sufficient (C-S) statstics exists,

– Solving equations
– Conditioning by Rao-Blackwell Theorem.

• Otherwise, use Characterisation Theorem

Solving equations

Example 3.3 X1 , . . . , Xn ∼ U (0, θ). Find a UMVUE for g(θ) where g(x) is differen-
tiable.

Solution. It is known that T = X(n) is C-S, and

fT (t) = nθ−n tn−1 I(0 < t < θ).



If Eη(T ) = 0 η(t)nθ−n tn−1 dt = g(θ), then
Z θ
n η(t)tn−1 dt = θn g(θ)
0

Differentiating w.r.t. θ, we get

nη(θ)θn−1 = nθn−1 g(θ) + θn g 0 (θ)

Then η(θ) = g(θ) + nθ g 0 (θ).

X(n) 0
η(T ) = g(X(n) ) + g (X(n) )
n

n+1
(a). If g(θ) = θ, then η(T ) = n X(n) .
(b). If g(θ) = θ2 , then η(T ) 2 + 2 X2
= X(n) = n+2 2
n (n) n X(n) .

25
Remark 3.1 An MLE of g(θ) is g(X(n) ) with bias of order O(n−1 ).

Example 3.4 Let X1 , . . . , Xn ∼ Poisson(λ), find a UMVUE for g(λ), where g(λ) is in-
finitely differentiable.

Xi is C-S, and T ∼ P oisson(nλ), that is, fT (t) = e−nλ (nλ)t /t! for
P
Solution. T =
t = 0, 1, 2, ... Then

η(k)e−nλ (nλ)k /k! = g(λ).
X
Eη(T ) =
k=0
Then
∞ ∞ (i)
! ∞ 

X η(k)nk λk X g (0)λi X nj λ j X
= g(λ)enλ =  = ak λk ,
k=0
k! i=0
i! j=0
j! k=0

P g (i) (0)nj η(k)nk ak k!


where ak = i+j=k i!j! . Therefore, k! = ak . So η(k) = nk
. That is,
 
T
!
aT T ! T! X g (i) (0)nj X T g (i) (0)
η(T ) = T = T  = .
n n i+j=T
i!j! i=0
i ni

We now give two special examples.

1. If g(λ) = λr , where r > 0 is an integer, then

g (k) (λ) = r(r − 1)...(r − k + 1)λr−k if 0 ≤ k ≤ r,


= 0 if k > r,

Then g (r) (0) = r! and 0 otherwise. Therefore,

T ! nT −r r! T (T − 1)...(T − r + 1)
η(T ) = = if T ≥ r,
nT (T − r)!r! nr
= 0 if T < r.

Remark: Note that η(T ) ≈ (X̄)r , the MLE.


e−λ λk
2. If g(λ) = P (X1 = k) = k! , for any k = 0, 1, ..., then we can use the above result.
For instance, if k = 0, g(λ) = e−λ . So g (i) (λ) = (−1)i e−λ and g (i) (0) = (−1)i .
Therefore, a UMVUE of e−λ is
T
! T
T (−1)i 1
X 
η(T ) = i
= 1− .
i=0
i n n

Similarly, if g(λ) = e−λ λk /k!, then its UMVUE is


!   T −k
T 1 k 1 1
 
η(T ) = 1− = P Bin(T, p = ) = k .
k n n n

Example 3.5 X1 , . . . , Xn ∼ N (µ, σ 2 ). Find a UMVUE for µ/σ 2 .

Solution. The following fact will be useful in this problem.

26
Pn
• X̄ and S 2 are independent, where S 2 = (n − 1)−1 i=1 (Xi − X̄)2
xr/2−1 e−x/2
• X̄ ∼ N (µ, σ 2 /n), and (n − 1)S 2 /σ 2 ∼ χ2n−1 , where the pdf of χ2r is 2r/2 Γ(r/2)
.

• Γ(α + 1) = αΓ(α)

It is known that T = (X̄, S 2 ) is complete and sufficient. Let


η(T ) = C .
S2
 
1
Then Eη(T ) = CµE S2
. Now taking r = n − 1, we have
! !
σ2 1
E = E
(n − 1)S 2 χ2n−1
1 xr/2−1 e−x/2
Z ∞
= dx
0 x 2r/2 Γ(r/2)
x(r−2)/2−1 e−x/2 2(r−2)/2 Γ((r − 2)/2)
Z ∞
= dx
0 2(r−2)/2 Γ((r − 2)/2) 2r/2 Γ((r − 2)/2 + 1)
1 Γ((n − 3)/2)
=
2 Γ((n − 1)/2)
1
= .
n−3
Then
1 n−1 µ µ
 
Eη(T ) = CµE =C = 2,
S2 n−3σ 2 σ
n−3 n−3 X̄ µ
if C = n−1 . That is, n−1 S 2 is a UMVUE of σ2
.

Remark: Note the MLE = X̄/S 2 .

Conditioning by Rao-Blackwell Theorem

Example 3.6 Let X1 , . . . , Xn ∼ P oisson(λ), find a UMVUE for

e−λ λk
g(λ) = P (X1 = k) = , for any k = 0, 1, 2, ........
k!

P
Solution. Let δ = I{X1 = k}. Clearly, Eδ = g(λ). Now T = Xi is C-S, and
T ∼ P oisson(nλ), then

η(t) = E[δ|T = t] = P (X1 = k|T = t)


= P (X1 = k, T = t)/P (T = t)
n n
!, !
X X
= P X1 = k, Xi = t − k P Xi =t
i=2 i=1
e−λ λk /k! e−(n−1)λ [(n − 1)λ]t−k /(t − k)!
=
e−nλ [nλ]t /t!
!  
1 k
t−k
t 1
= 1− .
k n n

27
So a UMVUE for g(λ) is
!   T −k
k
T 1 1
η(T ) = 1− .
k n n

Remark 3.2

• The above derivation is straightaway by using the following fact: If X ∼ P oisson(λ1 )


and Y ∼ P oisson(λ2 ) and independent, then X|X + Y = t ∼ Bin{t, λ1 /(λ1 + λ2 )}.
e−X̄ (X̄)k
• By comparison, the MLE for the above problem is k! . .....

Using Characterisation Theorem

When a C-S statistic is not available or not easily available, we can not use Lehmann-
Scheffe Theorem to find UMVUEs. Then we can try the Characterization Theorem.

Example 3.7 Let X1 , . . . , Xn ∼ U (0, θ), where θ > 1. Find a UMVUE of θ.

Solution. It is known that T = X(n) is sufficient. But it is not complete. To show this,
note that FT (t) = F n (t) = tn /θn , and so

ntn−1
fT (t) = I(0 < t < θ).
θn
If Eθ g(T ) = 0, then
Z θ Z 1 Z θ
n−1 n−1
0= t g(t)dt = t g(t)dt + tn−1 g(t)dt.
0 0 1

Differentiating w.r.t. θ, we get g(θ) = 0 for any θ > 1, and


Z 1
tn−1 g(t)dt = 0.
0

This certainly has a nonzero solution, g(t). Hence, T is not complete.

It is known that T̃ = X(n) ∨ 1 is minimal sufficient. However, it may be troublesome


to check completeness. Perhaps it might be easier to use the characterization theorem to
find a UMVUE in this case.

(1). E[U (T )] = 0 implies that (simply take g(t) = U (t) above)


Z 1
U (θ) = 0 for any θ > 1 and tn−1 U (t)dt = 0.
0

(2). If E[η(T )U (T )] = 0, then from (1), we get


Z θ Z 1
n−1
t U (t)η(t)dt = tn−1 U (t)η(t)dt = 0.
0 0

(3). Next we need to find η(T ) such that Eη(T ) = θ. We use two approaches.

28
Approach 1. Clearly, (1) and (2) are both satisfied by

η(t) = c if 0 < t ≤ 1
= bt if t > 1.

Then from Eη(T ) = θ, we get

θ = Eη(T ) = E [η(T )I(0 < T ≤ 1)] + E [η(T )I(T > 1)]


Z 1 Z θ
= cfT (t)dt + btfT (t)dt
0 1
ntn−1 θ
Z
= cP (T < 1) + dt bt
1 θn
c bn θn+1 − 1
= +
θn n + 1 θn
bn bn 1
 
= θ+ c− ,
n+1 n + 1 θn
where we used the fact P (T < 1) = F n (1) = θ−n . Therefore, we have
bn bn
= 1, c− = 0.
n+1 n+1
So we get b = (n + 1)/n and c = 1. That is,

η(T ) = 1 if 0 < X(n) < 1


−1
(1 + n )X(n) if X(n) > 1.

Approach 2. Clearly, we can choose

η(t) = c if 0 < t ≤ 1
= η(t) if t > 1 .

Here, we choose η(t) = c if 0 < t ≤ 1 since it satisfies (1). From Eη(T ) = θ, we get

θ = Eη(T ) = E [cI(0 < T < 1)] + E [η(T )I(T > 1)]


Z 1 Z θ
= cfT (t)dt + η(t)fT (t)dt
0 1
Z θ
= cP (T < 1) + η(t)fT (t)dt
1
ntn−1
Z θ
c
= + η(t) dt.
θn 1 θn
as P (T < 1) = F n (1) = θ−n . That is,
Z θ
n+1
θ = c+ η(t)ntn−1 dt. (5.1)
1

Differentiating w.r.t. θ, we get

nη(θ)θn−1 = (n + 1)θn .
n+1
Therefore, η(θ) = n θ for θ > 1. Putting it back into (5.1), we get c = 1. Hence,

η(T ) = 1 if 0 < T ≤ 1
(n + 1)T /n if T > 1.

29
Remark 3.3 If x(n) < 1, we estimate θ by 1. If x(n) ≥ 1, we just carry on as usual.

Remark 3.4 T0 = max{X(n) , 1} is minimal sufficient. Clearly, 0 < X(n) ≤ 1 ⇐⇒ T0 = 1;


and X(n) > 1 ⇐⇒ T0 > 1. Then,

η(T ) = 1 if T0 = 1
−1
(1 + n )T0 if T0 > 1.

Therefore, η(T ) is a function of T0 , as can be expected.

Summary on how to find UMVUE

1. Is it U -estimable? If not, it is the end of story. If so, move on to the next step.

2. Find a (minimal) sufficient statistic T . Verify whether T is complete or not.

• If T is complete, then we can use Lehmann-Scheffe Theorem.


• If T is NOT complete, then we can use the UMVUE Characterization Theorem.

Difficulties with UMVUE’s

• Unbiased estimates may not exist.

• Even if UMVUE’s exist, they are usually quite difficult to work out.

• The UMVUE may be inadmissible.


T is inadmissible if there exists another estimator S such that R(θ, T ) > R(θ, S) for
all θ with inequality holding for some θ.
In modern statistics, we sometimes deliberately introduce bias into estimators, which
may have smaller variance, e.g., James-Stein estimator.

• Unbiasedness is not transformation-invariant, i.e., if θ̂ is unbiased for θ, but g(θ̂)


may be biased for g(θ).

3.6 UMVUE in Normal Linear Models

We have considered UMVUE for i.i.d. r.v.’s. We now consider independent r.v.’s.

Example 3.8 Simple regression model. Suppose

Xi = ξi + i = α + βti + i , i = 1, ..., n,

where i ∼ N (0, σ 2 ). Equivalently, Xi ∼ N (ξi , σ 2 ), i = 1, ..., n, where ξi = α + βti . Find


UMVUEs for θ = (α, β, σ 2 ).

30
Solution. The joint pdf is
n
!
Y 1 −(xi − α − βti )2
f (x) = exp
i=1
(2πσ 2 )1/2 2σ 2
Pn Pn Pn !
2
i=1 xi α i=1 xi β i=1 ti xi
= C(θ) exp − − −
2σ 2 2σ 2 2σ 2

Hence, T = ( ni=1 xi , ni=1 ti xi , ni=1 x2i ) is C-S for θ̃ = (1/σ 2 , α/σ 2 , β/σ 2 ) or equivalently
P P P

for θ = (α, β, σ 2 ) (1 to 1 correspondence).

Since the LSEs of α and β are unbiased and are functions of T , hence they must be
UMVUE. Similarly, the UMVUE of σ 2 is given by σ̂ 2 = SSE/(n − 2).

Normal Linear Models

Rather than dealing with every single case, here we present some unified treatment for
independent case, in particular, the ”Normal Linear Models”:

Xi ∼ N (ξi , σ 2 ), i = 1, ..., n,

or equivalently
X ∼ N (ξ, σ 2 I),
where the Xi ’s are independent and ξ = (ξ1 , ..., ξn ) ∈ Ls , an s-dimensional linear subspace
of Rn (s < n). Let

Y = (Y1 , ..., Yn ) = (Y1 , ..., Yn )(e1 , ..., en ) = (X1 , ..., Xn )C = XC

where C is an orthogonal transformation. Denote ηi = E(Yi ) then η = ξC or ηC T = ξ,


and
Y ∼ N (η, σ 2 I),
Let the first s columns of C T span Ls , then clearly,

ηs+1 = ... = ηn = 0.

As a result, we have
n
!
1
Y −(yi − ηi )2
fY (y) = exp
i=1
(2πσ 2 )1/2 2σ 2
Ps Pn 2
!
i=1 (yi − ηi )2 i=s+1 yi
= C1 (θ) exp − −
2σ 2 2σ 2
Pn Ps !
2
i=1 yi i=1 ηi yi
= C1 (θ) exp − −
2σ 2 2σ 2
Pn 2 Ps 2
Hence, T = (Y1 , ..., Ys , i=1 Yi ) or equivalently, T̃ = (Y1 , ..., Ys , i=1 Yi ), is C-S.

Ps
Theorem 3.6 UMVUE of i=1 λi ηi and σ 2 are, respectively,
s Ps 2
i=1 Yi
X
2
λi Yi , and σ̂ =
i=1
n−s

31
Proof. They are unbiased and functions of T , hence UMVUE by L-S theorem.

Theorem 3.7 UMVUE of


Ps Ps ˆ where
i=1 ξi ηi is i=1 λi ξi ,
n
ξˆ = (ξˆ1 , ..., ξˆn ) = arg min
X
(Xi − ξi )2
i=1
is an LSE of ξ.

Pn Pn Ps Pn
Proof. From i=1 (Xi − ξi ) 2 = i=1 (Yi − ηi )2 = i=1 (Yi − ηi )2 + 2
i=s+1 Yi , we get
ˆ
η̂ = (Y1 , ..., Ys , 0, ..., 0) = ξC.
Thus, ξˆ = η̂C T , and hence si=1 λi ξˆi , is a function of Y1 , ..., Ys . But η = E η̂ = E ξC
ˆ = ξC,
P

thus E( i=1 λi ξˆi ) = i=1 λi ξi .


Ps P s

Example 3.9 One-way layout. Suppose


Xij ∼ N (ξi , σ 2 ), j = 1, ..., ni ; i = 1, ..., s.
or equivalently,
X11 , ..., X1n1 ∼ N (ξ1 , σ 2 );
...................
Xs1 , ..., Xsns ∼ N (ξs , σ 2 );
Find UMVUEs for ξi and σ 2 .

Solution. Let Xi· be the average of ith treatment. Note


 
ni
s X s Xni 
(Xij − Xi· )2 + ni (Xij − Xi· )2
X X
(Xij − ξi )2 =
 
i=1 j=1 i=1 j=1

So the LSE and hence the UMVUE of ξi s are ξˆi = Xi· . And the UMVUE of σ 2 is
Ps Pni
2 i=1 j=1 (Xij − Xi· )2
σ̂ = .
n−s

3.7 UMVUE may fail for small n

Let X1 , . . . , Xn ∼ P oisson(λ). Then T = Xi ∼ P oisson(nλ), and is complete and


P

sufficient. To find a UMVUE for e −aλ , we set Eη(T ) = e−aλ , i.e.,


X η(t)(nλ)t X (n − a)t λt
= e(n−a)λ =
t! t!
so η(T ) = (1 − a/n)T is a UMVUE of e−aλ .

• Big problems for small n.


Suppose a = 3, and n = 1, then η(T ) = (−2)X1 . It goes from 1 to -2 to 4 to -8
etc as X goes from 0, 1, 2, .... This is crazy since it oscillates between negative and
positive values.
• The above problem disappears for large n, e.g., as soon as n > a.
• The MLE is simply e−aX̄ .

32
3.8 Exercises
iid
1. T = (X1 , . . . , Xn ) ∼ f (x), n ≥ 2. Show that T is sufficient, but not complete.

2. Let Ti be a UMVUE of θi , j = 1, ..., k. Then


Pk Pk
(a) T = i=1 ci Ti is a UMVUE of θ = i=1 ci θi , for constants ci ’s.
Qk Q 
k
(b) T = i=1 Ti is a UMVUE of θ = ET = E i=1 Ti .

iid
3. Let X1 , . . . , Xn ∼ N (µ, σ 2 ). Find a UMVUE for µ2 /σ.
iid
4. X1 , . . . , Xn ∼ P oisson(λ). Find a UMVUE for e−tλ for t > 0.
iid
5. X1 , . . . , Xn ∼ U (θ1 , θ2 ), where θ1 < θ2 . Find a UMVUE for (θ1 + θ2 )/2.

6. Let X1 , . . . , Xn be iid Bin(1, p), where p ∈ (0, 1). Find the UMVUE of

(a) pm , where m is a positive integer ≤ n;


(b) P (X1 + ... + Xm = k), where m and k are positive integers ≤ n;
(c) P (X1 + ... + Xn−1 > Xn ).
iid
7. X1 , . . . , Xn ∼ fθ (x) = e−(x−θ) , x > θ. Find a UMVUE of θr , r > 0.

8. Let X1 , . . . , Xn ∼ Bin(k, p). Find a UMVUE for g(p) = P (X1 = 1) = kp(1 − p)k−1 .
iid
9. X1 , . . . , Xn ∼ N (0, σ 2 ). We wish to find a UMVUE of σ 2 .

(a) Find a complete and sufficient statistic for σ 2 .


(b) Find C1 , C2 (which may depend on n) such that W1 = C1 (X̄)2 and W2 =
C2 ni=1 (Xi − X̄)2 are unbiased for σ 2 .
P

(c) Consider W = aW1 + (1 − a)W2 , where a is a constant. Find a which minimizes


var(W ). Show that var(W ) ≤ var(Wi ), i = 1, 2
(d) Find a UMVUE of σ 2 .

33
Chapter 4

The Method of Maximum


Likelihood

Likelihood-based methods lie in the heart of statistics. They have been widely used in
practice, more so than you perhaps realise. Some methods, , e.g., least square estimation
(LSE), logistic regression, etc., which are seemingly unrelated to likelihood, can be cast
into the likelihood framework. Others, e.g., Bayesian methods to be studied later, can be
loosely regarded as the extension of likelihood methods.

The idea of the maximum likelihood estimate (MLE), first introduced by R.A. Fisher,
is simple and yet very powerful. It enjoys many nice properties, such as invariance prin-
ciple, asymptotic efficiency, existence principle. Of course, like every method, it also has
drawbacks, e.g. strong assumptions, non-robust, etc.

4.1 Maximum Likelihood Estimate (MLE)

Example 4.1 Let X ∼ Pθ , where θ = 0 or 1. Suppose

Pθ (X = x) x=0 x=1 x=2 x=3


θ=0 0.5 0.1 0.1 0.3
θ=1 0.2 0.3 0.2 0.3

Let θ̂ be an estimate of θ.

• If X = 0, P0 (0) > P1 (0), so θ = 0 is more likely. So we take θ̂ = 0.


• Similarly, if X = 1 or 2, take θ̂ = 1.
• If X = 3, then P0 (3) = P1 (3), take θ̂ = 0 or 1.

Thus we obtained an MLE of θ:

θ̂ = 0 if X = 0, 3
= 1 if X = 1, 2, 3.

Note immediately that θ̂ may not be unique, e.g., when X = 3, θ̂ = 0 or 1.

34
Definition 4.1 Suppose X ∼ fθ (x), where θ ∈ Θ ∈ Rk . Let Θ be the closure of Θ.

• Given X = x, define

– likelihood function: L(θ) ≡ L(θ, x) = fθ (x)


– log-likelihood function: l(θ) ≡ l(θ, x) = log L(θ).

• Given x, a maximum likelihood estimator (MLE) of θ is θ̂ = θ̂(x) satisfying


   
L θ̂(x) = sup L(θ), or equivalently l θ̂(x) = sup l(θ).
θ∈Θ θ∈Θ

i.e.,
θ̂ = arg sup L (θ(x)) = arg sup l (θ(x)) .
θ∈Θ θ∈Θ

Such a θ̂ = θ̂(X) always exists in Θ (not necessarily in Θ though), and is called


MLE.

We note that MLE may not be unique (unlike UMVUE). Furthermore, if L(θ) is
differentiable, possible candidates for MLE’s satisfy

L0 (θ) = 0, ⇐⇒ l0 (θ) = 0. (1.1)

These are called likelihood and log-likelihood equations, respectively.

4.2 Invariance property of MLE

One very attractive property of the MLE is its invariance. To be more precise, let η = g(θ),
where g(·) is a function from Rk to Rd . Basically, then the invariance property of MLE
means: if θ̂ is an MLE of θ, then g(θ̂) is an MLE of g(θ).

4.2.1 g(·) is a 1-to-1 mapping

If g(·) is 1-to-1, estimating θ is equivalent to estimating η = g(θ) as a new parameter.

Theorem 4.1 Let X ∼ fθ (x) and η = g(θ) where g(·) is 1-to-1. If θ̂ is an MLE for θ,
then g(θ̂) is an MLE for g(θ).

Proof. Since g(·) is 1-1, then g −1 (·) exists. Also

L(θ) = fθ (x) = fg−1 (η) (x) = L1 (η)

where L1 is the likelihood function of η. Hence,

L(θ̂) = L1 (g(θ̂))

Since θ̂ is an MLE, hence L(θ) and L1 (η) are maximized by θ̂ and g(θ̂), respectively. That
is, g(θ̂) is an MLE of η.

35
4.2.2 g(·) is not 1-to-1

If g(·) is NOT 1-to-1, η = g(θ) is no longer a parameter, since once knowing η, we could
have more than one θ’s, say, θ1 and θ2 , such that η = g(θ1 ) = g(θ2 ), i.e., we may not be
able to completely specify θ in the parameter space Θ (non-identifiable). This is bad since
we don’t know whether X ∼ fθ1 (x) or X ∼ fθ2 (x).

For fixed η, Let Θη =: {θ ∈ Θ : g(θ) = η}. For example, we could have Θη =:


{θ1 , θ2 }. As we are not sure which θ to pick from Θη , it seems reasonable to choose the
most probable one, i.e., the value of θ in Θη that maximizes the likelihood L(θ) = fθ (x).
This is the so-called induced likelihood.

Definition 4.2 Suppose g(·): Θ → Λ ∈ Rp , and let Θη =: {θ ∈ Θ : g(θ) = η}.

• The induced likelihood function for η = g(θ) is defined as

L(η)
e = sup L(θ).
{θ∈Θ:g(θ)=η}

• For fixed x, let η̂ = η̂(x) satisfy L(η̂)


e = sup L(η)
e i.e.,
η∈g(Θ)

η̂ = arg sup L(η).


e
η∈g(Θ)

Such a η̂ always exists in g(Θ) (not necessarily in g(Θ)), and is still called a maxi-
mum likelihood estimator (MLE) of θ.

Theorem 4.2 (MLE Invariance Theorem) Let X ∼ fθ (x) and η = g(θ). If θ̂ is an


MLE for θ, then g(θ̂) is an MLE for η = g(θ).

Proof. It suffices to show that L(η̂)


e = L(g(
e θ̂)). By definition, we have

L(η̂)
e = sup L(η)
e = sup sup L(θ) = sup L(θ) = L(θ̂).
η∈g(Θ) η {θ∈Θ:g(θ)=η} θ∈Θ

On the other hand, from the definition of L̃(η), we have

L(g(
e θ̂)) = sup L(θ) = L(θ̂).
{θ∈Θ:g(θ)=g(θ̂)}

4.3 Examples

Example 4.2 Let X1 , ..., Xn ∼ Bin(1, p) with p ∈ (0, 1). Find an MLE for p2 .

Solution Note that


n Pn Pn
pxi (1 − p)1−xi = p xi
(1 − p)n− xi
= pnx̄ (1 − p)n(1−x̄)
Y
L(p) = i=1 i=1

i=1
l(p) = n[x̄ log p + (1 − x̄) log (1 − p)]

36
Setting
x̄ 1 − x̄ (x̄ − p)
 
0
l (p) = n − = = 0,
p 1−p p(1 − p)
we get p̂ = x̄. Since
x̄ 1 − x̄
 
00
l (p) = − 2 − < 0,
p (1 − p)2
l(p) is a concave function. Then p̂ = x̄ is an MLE.

Then, an MLE of p is p̂ = X̄. By the invariance theorem, an MLE of p2 is (X̄)2 .

Remark 4.1 (a) If X̄ = 0 or 1, then p̂ = 0 or 1 6∈ Θ = (0, 1). (b) Note that in this
example, the MLE X̄ is also the UMVUE.

Example 4.3 Let X1 , ..., Xn ∼ P oisson(λ) with λ > 0. Find an MLE for λ.

Solution. Note that


e−nλ λnx̄
L(λ) = , and l(λ) = −nλ + (nx̄) log λ − log(x1 !...xn !)
x1 !...xn !

• If x̄ > 0, we set l0 (λ) = −n + nx̄


λ = 0 to get

λ̂ = x̄.

Since l00 (λ) = − nx̄


λ2
= − nx̄ < 0, so l(λ) is concave. Hence, λ̂ = x̄ is an MLE.

• If x̄ = 0, then L(λ) = e−nλ , which has the supremum at λ̂ = 0 = x̄.

Combining these results, an MLE of λ is thus λ̂ = X̄.

Remark 4.2 (i) If X̄ = 0, then λ̂ = 0 6∈ Θ = (0, ∞). (ii) Note that in this example, the
MLE X̄ is also the UMVUE.

Example 4.4 Let X1 , ..., Xn ∼ N (µ, σ 2 ) with µ ∈ R and σ 2 > 0, where n ≥ 2. Find an
MLE for θ = (µ, σ 2 ).

Solution. Note
n
n ( )
1 1 X

L(θ) = √ exp − 2 (xi − µ)2 ,
2πσ 2 2σ i=1
and n
1 X n n
l(θ) = − 2
(xi − µ)2 − log σ 2 − log(2π).
2σ i=1 2 2
The log-likelihood equations become
1 Pn
! ! !
0 lµ0 (θ) σ2 i=1 (xi − µ) 0
l (θ) = = = .
lσ0 2 (θ) 1 Pn
2σ 4
2 n
i=1 (xi − µ) − 2σ 2 0

37
Therefore, we get θ̂ = (µ̂, σ̂ 2 ), where
n
σ̂ 2 = n−1 (xi − x̄)2 .
X
µ̂ = x̄, and
i=1

It can be shown that l00 (θ) is negative definite (exercise). Hence, the MLE of θ is
n
2 −1
X 2
µ̂ = X̄, and σ̂ = n Xi − X̄ .
i=1

Example 4.5 Let X1 , ..., Xn ∼ N (µ, σ 2 ) with µ ≥ 0 and σ 2 > 0, where n ≥ 2. Find an
MLE for θ = (µ, σ 2 ).

Solution. From the last question, the log-likelihood equations become


n n
1 X 1 X 1
(xi − µ) = 0, and (xi − µ)4 − 2 = 0
σ 2 i=1 4
2σ i=1 2σ

resulting in
n
−1
(xi − x̄)2 .
X
2
µ̂ = x̄, and σ̂ = n
i=1

Case I. If x̄ ≥ 0, θ̂ ∈ Θ =: {(µ, σ 2 ) : µ ≥ 0, σ 2 ≥ 0}. Hence, θ̂ is an MLE.

Case II. If x̄ < 0, θ̂ 6∈ Θ But as in the last example,


n
1 X n n
l(θ) = l(µ, σ 2 ) = = − 2
(xi − µ)2 − log σ 2 − log(2π)
2σ i=1 2 2
n
1 X n n n
= − (xi − x̄)2 − 2 (x̄ − µ)2 − log σ 2 − log(2π).
2σ 2 i=1 2σ 2 2

For each (fixed) σ 2 , l(θ) attains its maximum when µ = 0 since x̄ ≤ 0. Then
n
1 X n n n
l(0, σ 2 ) = − 2
(xi − x̄)2 − 2 (x̄)2 − log σ 2 − log(2π),
2σ i=1 2σ 2 2

which attains its maximum when σ 2 = n−1


P 2
xi . Therefore, an MLE of θ is

σ̂ 2 = n−1
X
µ̂ = 0, x2i .

In summary, an MLE of θ is

θ̂ = (0, n−1
X
Xi2 ), if X̄ < 0
= (X̄, n−1
X
(Xi − X̄)2 ), if X̄ ≥ 0.

38
4.4 Numerical solution to likelihood equations

Typically, there are no explicit solutions to likelihood equations, and some numerical
methods have to be used to compute MLE’s. We start with a couple of examples.

Example. Let X1 , ..., Xn ∼ Cauchy(θ). Find an MLE of θ.


−1
Solution. The pdf of Cauchy(θ) is fθ (x) = π −1 1 + (x − θ)2 . Therefore, L(θ) =
−1
π n−1 ni=1 1 + (xi − θ)2
Q
and
n
X
l(θ) = −n log π − log(1 + (xi − θ)2 ).
i=1
Setting
n
2(xi − θ)
l0 (θ) =
X
= 0.
i=1
1 + (xi − θ)2
An MLE θ̂ can not be solved explicitly, but can be obtained by numerical method, e.g.,
Newton-Raphson.

Newton’s algorithm.

Note that θ̂ →p θ, and n−1 l(θ), n−1 l0 (θ) and n−1 l00 (θ) are all sample means of i.i.d. r.v.’s.
Thus
1 1 0 1 1 1
0 = l0 (θ̂) ≈ l (θ) + (θ̂ − θ) l00 (θ) ≈ l0 (θ) + (θ̂ − θ) El00 (θ)
n n n n n
Then
θ̂ ≈ θ − l0 (θ)[l00 (θ)]−1 ≈ θ − l0 (θ)[El00 (θ)]−1 .

• Newton-Raphson’s algorithm:
−1
θj+1 = θj − l0 (θj ) l00 (θj )

, j = 0, 1, 2, ...,

• Scoring method:
−1
θj+1 = θj − l0 (θj ) El00 (θj )

, j = 0, 1, 2, ...,

4.5 Certain “Issues” with MLE

4.5.1 MLE may not be unique

Example. Let X1 , ..., Xn ∼ U (θ − 1/2, θ + 1/2) with θ ∈ R. Find an MLE of θ.

Solution. Note
n
Y
L(θ) = I(θ − 1/2 < xi < θ + 1/2)
i=1
= I(θ − 1/2 < x(1) < x(n) < θ + 1/2)
= I(x(n) − 1/2 < θ < x(1) + 1/2).
By plotting L(θ) against θ, we see that θ̂ ∈ (x(n) − 1/2, x(1) + 1/2). So we have infinite
number of MLE’s, and the range is x(1) + 1/2 − (x(n) − 1/2) = 1 − (x(n) − x(1) ) → 0 as
n → ∞.

39
4.5.2 MLE may not be asymptotically normal

Example. Let X1 , ..., Xn ∼ U (0, θ). Find an MLE for θ.

Solution. The likelihood function is


1
L(θ, x) = I(0 ≤ x(1) ≤ x(n) ≤ θ)
θn
Draw a graph of L(θ, x) as a function of θ, we see that an MLE for θ is

θ̂ = X(n) .

Now

P (n(θ − θ̂)/θ ≤ t) = P (θ̂ ≥ θ − tθ/n)


= 1 − P (X(n) ≤ θ − tθ/n)
= 1 − [P (X1 ≤ θ − tθ/n)]n
t n
 
= 1− 1−
n
−t
→ 1−e as n → ∞.

Hence, θ̂ is asymptotically exp(1) (after normalization), instead of normal.

4.5.3 MLE can be inconsistent

{Tn } is a consistent estimator of g(θ) if Tn →p g(θ), i.e.,

lim P (|Tn − g(θ)| > ) = 0, for any  > 0.


n→∞

For example, let X1 , . . . , Xn ∼ F , µ = EX1 , X̄ = n−1 Xi . If E|X1 | < ∞, then by


P

WLLN, we get X̄ →p µ.

When there are many nuisance parameters, then MLE’s can behave very badly even for
large samples, as shown below.

Example. We have p independent random samples

X11 , · · · , X1n ∼ N (µ1 , σ 2 ),

......
Xp1 , · · · , Xpn ∼ N (µp , σ 2 ).
Let θ = (µ1 , · · · , µp , σ 2 ). Equivalently,

Xi = (X1i , ..., Xpi ) ∼ M N (µ, σ 2 I), where µ = (µ1 , ..., µp ), and i = 1, ..., n.

Show that

n p X
n
1. an MLE of θ is θ̂, where µ̂i = n−1 σ̂ 2 = (pn)−1
X X
xij = x̄i· , (xij − x̄i· )2 ;
j=1 i=1 j=1

2. if p is fixed and n → ∞, then θ̂ is consistent for θ;

40
3. if n is fixed and p → ∞, then θ̂ is NOT consistent for θ.

Proof.

1. The log-likelihood function is


  
p Y
n p X n 2
 1X (xij − µi ) 
fθ (xij ) = log (2πσ 2 )−pn/2 exp −
Y
l(θ) = log
i=1 j=1
 2
i=1 j=1
σ2 
p n
pn pn 1 XX
= − log(2π) − log(σ 2 ) − 2 (xij − µi )2
2 2 2σ i=1 j=1

The log-likelihood equations are


n p n
∂l(θ) 1 X ∂l(θ) pn 1 XX
= 2 (xij − µi ) = 0, and =− 2 + 4 (xij − µi )2 = 0
∂µi σ j=1 ∂σ 2 2σ 2σ i=1 j=1

whose solution is θ̂ as required. Note


 
p
n 2 p
σ 2 X X xij − x̄i· σ2 X 2 σ2 2

2
σ̂ = = χn−1 = χ
pn i=1 j=1 σ pn i=1 pn p(n−1)

Thus,

n−1 2 σ2 σ2 2(n − 1) 4
E σ̂ 2 = σ = σ2 − , Bias(σ̂ 2 ) = − , V ar(σ̂ 2 ) = σ
n n n pn2

2. If p is fixed, n → ∞, µ̂i is consistent for µi by WLLN. By Chebyshev’s inequality,

P (|σ̂ 2 − σ 2 | > ) ≤ −2 E(σ̂ 2 − σ 2 )2


h i
≤ −2 Bias2 (σ̂ 2 ) + V ar(σ̂ 2 )
h i
≤ −2 σ 4 /n2 + V ar(σ̂ 2 )
→ 0.

That is, σ̂ 2 is consistent for σ 2 .

3. If n is fixed, p → ∞, then

• µ̂i is not consistent for µi since E µ̂i = µi , but V ar(µ̂i ) = σ 2 /n > 0;


• σ̂ 2 is NOT consistent for σ 2 since the Bias(σ̂ 2 ) 6= 0, even though var(σ̂ 2 ) → 0.

The third part is often referred to as “small n, large p” problem in the literature
(p  n). There are several ways to overcome this difficulty.

• Assume that both n, p → ∞, but n/p → 0.

• Assume sparsity: many µi ’s are zero’s. We can impose L1 penalty, e.g.

41
4.6 Exercises

1. Let X1 , ..., Xn ∼iid fθ (x). Find an MLE of θ if

(a) fθ (x) = θ−1 , where x = 1, ..., θ and θ is an integer between 1 and θ0


(b) fθ (x) = e−(x−θ) , where x ≥ θ and θ > 0.
(c) fθ (x) = θ(1 − x)θ−1 I(0 < x < 1), where θ > 1.
(d) fθ (x) is the pdf of N (θ, θ2 ), where θ ∈ R.

2. Let X ∼ fθ (x) and T = T (X) is sufficient. Show that if an MLE exists, it is a


function of T but it may not be sufficient for θ.

3. Let X be a sample from Fθ , θ ∈ R. Suppose that Fθ ’s have pdf’s fθ and that


{x : fθ (x) > 0} does not depend on θ. Assume further that an unbiased estimator
θ̂ of θ attains the Cramer-Rao lower bound and the conditions in the Cramer-Rao
theorem hold for θ̂. Show that

(a). θ̂ is the UMVUE of θ.


(b). θ̂ is also the unique MLE of θ.

4. X1 , · · · , Xn are i.i.d. r.v.s with fθ (x) = (θ + 1)xθ , 0 ≤ x ≤ 1. (i.e. Beta(θ + 1, 1).)

(1) Find the MLE of θ.


(2) Find the asymptotic distribution of the MLE.

42
Chapter 5

Efficient Estimation

5.1 Information inequality (or Cramer-Rao lower bound)

Let X ∼ fθ (x), and l(θ) = log fθ (X), where θ ∈ Rk . Define the Fisher Information of θ
contained in X by h i

IX (θ) = E l0 (θ) k×1 l0 (θ) 1×k


Theorem 5.1 (Cramer-Rao (C-R) lower bound) Let Eδ(X) = g(θ). Then

V ar(δ(X)) ≥ g 0 (θ)1×k [IX (θ)]−1 0 τ


k×k g (θ)k×1 ,

where the equality holds iff δ(x) and l0 (θ) are linearly dependent, provided

∂ ∂
Z Z
h(x)fθ (x)dx = h(x) fθ (x)dx, for θ ∈ Θ
∂θ ∂θ
for h(x) = 1 and h(x) = δ(x),

Proof

(a) Lemma. For any r.v. δ and Y = (Y1 , ..., Yk ), we have

V ar(δ) ≥ Cov(δ, Y )[V ar(Y )]−1 [Cov(δ, Y )]τ , (1.1)

where the equality holds iff δ and Y are linearly dependent. Here,

Cov(δ, Y ) = (Cov(δ, Y1 ), ......, Cov(δ, Yk ))1×k ,


V ar(Y ) = (Cov(Yi , Yj ))k×k .

Proof. For any λ = (λ1 , ..., λk ) ∈ Rk , we have

0 ≤ V ar(δ − λY τ ) = Cov(δ − λY τ , δ − λY τ )
= V ar(δ) − 2Cov(δ, Y λτ ) + Cov(λY τ , λY τ )
= V ar(δ) − 2Cov(δ, Y )λτ + λV ar(Y )λτ
=: g(λ)

43
Setting g 0 (λ) = 2λV ar(Y ) − 2Cov(δ, Y ) = 0, we get the stationary point
λ0 = Cov(δ, Y )[V ar(Y )]−1 .
Thus,
g(λ0 ) = V ar(δ) − Cov(δ, Y )[V ar(Y )]−1 [Cov(δ, Y )]τ ≥ 0
where equality holds iff g(λ0 ) = 0 = V ar(δ − λ0 Y τ ) iff δ − λ0 Y τ = C.
(b) Take

∂ fθ (X)
Y = l0 (θ) = log fθ (X) = ∂θ .
∂θ fθ (X)
Differentiating (w.r.t. θ)
Z Z
1= fθ (x)dx, and g(θ) = E(δ(X)) = δ(x)fθ (x)dx

we get
∂fθ (x)
Z Z
0 = dx = l0 (θ)fθ (x)dx = El0 (θ) = EY,
∂θ
∂ ∂fθ (x)
Z Z
g 0 (θ) = Eδ(X) = δ(x) dx = δ(x)l0 (θ)fθ (x)dx
∂θ ∂θ
= E[δ(x)l0 (θ)] = cov δ(X), l0 (θ) = cov (δ(X), Y ) ,

h τ i
l0 (θ) l0 (θ) = E(Y τ Y ) = V ar(Y ).

IX (θ) = E k×1 1×k

Applying (1.1), we get


V ar(δ) ≥ [Eδ(X)]0θ [IX (θ)]−1 [[Eδ(X)]0θ ]τ = g 0 (θ)[IX (θ)]−1 [g 0 (θ)]τ ,
where the equality holds iff δ − λ0 l0 (θ)τ = C iff δ(X) − Eδ(X) = λ0 l0 (θ)τ .

Some properties of the Fisher information matrix

1. If X ⊥ Y, then I(X,Y) (θ) = IX (θ) + IY (θ).


Pn
2. If X = {X1 , . . . , Xn } ∼i.i.d. fθ (x), then IX (θ) = i=1 IXi (θ) = nIXi (θ).
R ∂2
3. IX (θ) = −El00 (θ), where ∂ R ∂
∂θτ ∂θ fθ (x)dx = ∂θ∂θτ fθ (x)dx, for θ ∈ Θ.
R
Proof. We shall only prove the case k = 1. Differentiating 1 = fθ (x)dx w.r.t. θ,
we get Z
l0 (θ)fθ (x)dx = E l0 (θ) .

0=
Differentiating this again w.r.t. θ gives
Z Z
2 2
0= l00 (θ)fθ (x)dx + l0 (θ) fθ (x)dx = E l0 (θ) + El00 (θ).

4. Let θ = h(η) where h : Rm → Rk with m, k ≥ 1. Then,


IX (η) = h0 (η)τm×k IX (θ)h0 (η)k×m .
Proof. From the definition,
∂l(θ) τ ∂l(θ)
   
IX (η) = E
∂η ∂η
τ 0
= E l (θ)h (η) l (θ)h0 (η)
 0 0 

h0 (η) E l0 (θ)τ l0 (θ) h0 (η)
 
=
= h0 (η)τ IX (θ)h0 (η)

44
5. Let θ = θ(η) is from Rm to Rk , where m, k ≥ 1. Assume that θ(.) is differentiable.
Then the Fisher information that X contains about η is
∂θ ∂θτ
   
IX (η) = IX (θ) .
∂η τ m×k ∂η k×m

Proof. From the definition,

∂l(θ) τ ∂l(θ)
   
IX (η) = E
∂η ∂η
τ 0
= E l (θ)θ (η) l (θ)θ0 (η)
 0 0 
τ  0 τ 0  0
= θ0 (η) E l (θ) l (θ) θ (η)

= θ0 (η) IX (θ)θ0 (η)

6. If θ = h(η) is 1-1 (from Rk to Rk ), then the Cramer-Rao lower bound remains the
same (i.e., reparametrization invariant).
Proof. X1 , ..., Xn ∼ fθ (x), and θ = h(η) is 1-1. Now Eδ(X) = g(θ) = g(h(η)) =
r(η). By C-R lower bound theorem,

V ar(δ(X)) ≥ g 0 (θ) [IX (θ)]−1 0

1×k k×k g (θ) k×1 =: A,

and also

V ar(δ(X)) ≥ r0 (η) [IX (η)]−1 0

1×k k×k r (η) k×1 =: B.

We shall show that A = B. Note that


∂l(η) ∂l(η) ∂θτ ∂g(θ) 0 τ
l0 (η) = = = h (η) = g 0 (θ)h0 (η)τ ,
∂η ∂θ ∂η ∂θ
∂θτ
and that ∂η is k × k matrix of full rank. So from this and (4) above, we get
−1 −1 τ τ
B = (g 0 (θ)h0 (η)τ ) (h0 (η)τ ) [IX (θ)]−1 (h0 (η)) (g 0 (θ)h0 (η)τ ) = (g 0 (θ)) [IX (θ)]−1 (g 0 (θ)) = A.

7. If θ is a scalar (i.e., k = 1), then the Cramer-Rao lower bound becomes

[g 0 (θ)]2 2
IX (θ) = E l0 (θ) = −E l00 (θ) .

V ar(δ(X)) ≥ , where
IX (θ)

5.2 Efficiency, sub-efficiency, and super-efficiency

An unbiased estimator δ(X) is

−1
• efficient if V ar(δ) = IX (θ)
−1
• sub-efficient if V ar(δ) > IX (θ)
−1
• super-efficient if V ar(δ) < IX (θ)

We have used the Cramer-Rao lower bound as a benchmark to define efficiency.

45
Examples of efficient and sub-efficient estimators

A necessary and sufficient condition for reaching Cramer-Rao lower bound is that δ(x)
and l0 (θ) are linearly dependent, i.e. δ(x) − Eδ(X) = λ0 (l0 (θ))τ .

UMVUE may or may not be efficient.

Example 5.1 Let X1 , ..., Xn be iid from the Bin(1, p). Show that the Cramer-Rao lower
bound can be reached in estimating p but not for p2 .

P P
Proof. First fp (x) = p i xi (1 − p)n− xi P
i , xi = 0, 1. So l(p) = log fp (x) = i xi log p +
(n − i xi ) log(1 − p). Then
P

n − i xi n(x̄ − p)
P P
i xi
l0 (p) = − = , xi = 0, 1. (2.2)
p 1−p p(1 − p)
Therefore,
nE(X1 − p)2 n
IX (p) = 2 2
= , x = 0, 1.
p (1 − p) p(1 − p)
If Eδ(X) = g(p), then by the Cramer-Rao inequality,
[g 0 (p)]2 [g 0 (p)]2 p(1 − p)
V ar(δ(X)) ≥ = .
IX (p) n
From (2.2), the equality only holds for g(p) = c0 + c1 p, a linear function of p.

1. If g(p) = p, then V ar(δ(X)) ≥ p(1 − p)/n, which is indeed reached by δ(X) = X̄


(UMVUE of p).
2. If g(p) = p2 , then V ar(δ(X)) ≥ 4p3 (1 − p)/n = CRLB.
Since this lower bound can never be reached, we are interested in how big the gap is.
Take η(T ) = T (T − 1)/(n(n − 1)) (the UMVUE of p2 ). After some tedious algebra,
we get
E[T 2 (T − 1)2 ] 4p3 (1 − p) 2p2 (1 − p)2
V ar(η(T )) = −p4
= + = CRLB +O(n−2 ).
[n2 (n − 1)2 ] n n(n − 1)
Note: To find out V ar(η(T )), e.g., the m.g.f. of T is Mn (t) = (q + pet )n . So
Mn0 (t) = n(q + pet )n−1 pet = npet Mn−1 (t),
Mn00 (t) = net Mn−1
0
(t) + net Mn−1 (t),
= n(n − 1)e2t Mn−2 (t) + net Mn−1 (t),
Mn(3) (t) = ........
Mn(4) (t) = ........
And then setting t = 0, we get the first four moments of T .

Examples of super-efficient estimator

On regularity conditions

Example 5.2 Let X1 , . . . , Xn ∼ U (0, θ). Show that

46
1. the assumption in the Cramer-Rao theorem does not hold
(the range of the pdf depends on the unknown parameter).
2. the information inequality does not apply to the UMVUE of θ.

Solution.

1. Use Leibnitz’s rule, we have


∂ ∂ θ h(x)
Z Z
h(x)fθ (x)dx = dx
∂θ ∂θ 0 θ
Z θ
h(θ) ∂ 1
= + h(x) dx
θ 0 ∂θ θ
h(θ) ∂
Z
= + h(x) fθ (x)dx
θ ∂θ

Z
6 = h(x) fθ (x)dx
∂θ
unless h(θ)/θ = 0 for all θ. Therefore, the assumption fails here.

2. Here fθ (x) = 1/θ, 0 < x < θ. So ∂θ log fθ (x) = −1/θ, and
2 Z θ
∂ 1 1 n

IX (θ) = nE log fθ (X) =n dx =
∂θ 0 θ2 θ θ2
By Cramer-Rao theorem, we get, if W is any unbiased estimator, then
θ2
V ar(W ) ≥ .
n
nxn−1
Take θ̂ = (n + 1)/nX(n) (the UMVUE). Recall fX(n) (x) = θn I(0 < x < θ), then

nxn+1
Z θ
2 n 2
EX(n) = dx = θ
0 θn n+2
2 2
n+1 n+1 n 2 θ2
   
2
var θ̂ = EX(n) − θ2 = θ − θ2 =
n n n+2 n(n + 2)
θ2 1
< = .
n nI(θ)
Note that the V ar(X(n) ) is of size O(n−2 ), as opposed to the usual size O(n−1 ).

Remark 5.1 In this example, the UMVUE has variance below the C-R lower bound, which
is a good thing. In fact, the variance of θ̂ is of order n−2 , much smaller than the usual
n−1 . More precisely, we have
n(θ − θ̂)/θ ∼approx Exp(λ = 1)
Therefore, θ̂ is asymptotically Exp(λ = 1), but not asymptotically normal.

Proof. For any y ≥ 0, we have


P (n(θ − X(n) ) ≤ y) = P (X(n) ≥ θ − y/n)
= 1 − P (X(n) ≤ θ − y/n)
= 1 − [P (X1 ≤ θ − y/n)]n
y n
 
= 1− 1− .

47
Taking y = tθ, we get
 
P n(θ − X(n) )/θ ≤ t = 1 − (1 − t/n)n → 1 − e−t as n → ∞.

By Slutsky’s theorem, we get n(θ − θ̂)/θ ∼approx Exp(λ = 1).

5.3 Asymptotic efficiency

5.3.1 Asymptotic Cramer-Rao lower bound

The Cramer-Rao theorem (under appropriate regularity conditions) states


([Eδn ]0θ )2 √ ([Eδn ]0θ )2
V arθ (δn ) ≥ ⇐⇒ V arθ ( nδn ) ≥ ∀θ ∈ Θ, (3.3)
nIX1 (θ) IX1 (θ)
The above inequality holds for fixed n. What happens as n → ∞?

Assume

n[δn − g(θ)] →d N (0, v(θ)), v(θ) > 0 (3.4)

which holds if bias(δn ) = Eδn − g(θ) = o(n−1/2 ) (i.e., δn is asymptotically unbiased) and

V ar( nδn ) → v(θ). Let n → ∞ in (3.3), we would get
[g 0 (θ)]2
v(θ) ≥ for ALL θ ∈ Θ. (3.5)
I(θ)
However, (3.5) may not hold uniformly for ALL θ ∈ Θ, even under nice regularity condi-
tions. (The example when regularity conditions fails is U nif (0, θ).)

Example 5.3 (Hodges) Let X1 , ..., Xn ∼ N (θ, 1), θ ∈ R. Then IX1 (θ) = 1. The MLE
of θ is θ̂ = X̄ ∼ N (θ, 1/n). Now let us define

θ̃ = X̄ if |X̄| ≥ n−1/4
tX̄ if |X̄| < n−1/4 ,

where we choose 0 ≤ t < 1. Show that


√  
n θ̃ − θ →d N (0, t2 ) if θ = 0,
N (0, 1) if θ 6= 0

Remark 5.2 θ̃ shrinks X̄ around 0 to tX̄. Hence it is more efficient at 0 if the true
parameter is 0. The trouble is that we don’t know where the true θ is.

Solution.

Note that θ̃ = X̄In + tX̄(1 − In ), where In = I(|X̄| ≥ n−1/4 ) = I(| nX̄| ≥ n1/4 ).


(1) If θ = 0, then nX̄ ∼ N (0, 1). Also, it can be shown that In →p 0 as P (In ≥ ) =

P (| nX̄| ≥ n1/4 ) = P (|N (0, 1)| ≥ n1/4 ) → 0. Hence, by Slutsky theorem, we have
√ √ √
n(θ̃ − θ) = nX̄In + t nX̄(1 − In ) →d N (0, t2 )

48
√ √
(2) If θ 6= 0, then n(X̄ − θ) ∼ N (0, 1). Also, it can be shown that n(1 − In ) →p 0 as
√ √ √ 
P ( n(1 − In ) ≥ ) = P (In ≤ 1 − / n) = P (In = 0) = P | nX̄| < n1/4

= P (−n1/4 < nX̄ < n1/4 )
 √ √ √ 
= P − nθ − n1/4 < n(X̄ − θ) < n1/4 − nθ
 √ √ 
= P − nθ − n1/4 < N (0, 1) < − nθ + n1/4
 √   √ 
= Φ − nθ + n1/4 − Φ − nθ − n1/4
→ 0

Hence, by Slutsky theorem, we have


√ √ √
n(θ̃ − θ) = n(X̄ − θ) + (t − 1) nX̄(1 − In ) →d N (0, 1)

Combining the above two cases, we get


√  
n θ̃ − θ →d N (0, t2 ) if θ = 0,
N (0, 1) if θ 6= 0

In the case of θ = 0, the asymptotic Cramer-Rao theorem does not hold, since

t2 < 1 = [I(θ)]−1 .

Remark 5.3

• The set of all points violating (3.5) is called points of superefficiency. Supereffi-
ciency in fact is a very good property.

• Even for the normal family (the best possible family), (3.5) may not hold. However,
it will hold for a more restricted class (i.e., “regular” estimators).

• The example can be easily extended to have countable number of super-efficiency


points. For instance, we can define, for k = 0, ±1, ±2, ......,

θ̂ = X̄ if |X̄ − k| ≥ n−1/4 ,
k + tk (X̄ − k) if |X̄ − k| < n−1/4

where we choose 0 ≤ tk < 1 for all k. Then it can be shown that this estimator will
be super-efficient at all k = 0, ±1, ±2, .......

• Although (3.5) is not always true, however, it holds a.e., i.e., the points of super-
efficiency has Lebesgue measure 0. See the next theorem (Le Cam (1953), Bahadur
(1964)).

Theorem 5.2 If δn = δn (X1 , . . . , Xn ) is any estimator satisfying (3.4), then v(θ) satisfies
(3.5) except on a set of Lebesgue measure 0, given the following conditions hold:

1. The set A = {x : fθ (x) > 0} is independent of θ. [i.e., they have common support,
and the boundary of Θ does not depend on θ.]

2. For every x ∈ A, ∂ 2 fθ (x)/∂θ2 exists and is continuous in θ.

49
R
3. fθ (x)dx can be twice differentiated under the integral sign.
∂ log fθ (X1 ) 2
 
4. 0 < IX1 (θ) < ∞, where IX1 (θ) = E ∂θ .

5. For any θ0 ∈ Θ, there exists c > 0 and a function M (x) (both may depend on θ0 )
such that for all x ∈ A and |θ − θ0 | < c, we have

∂ 2 log f (X )
θ 1
≤ M (x), and Eθ0 [M (X1 )] < ∞.

∂θ2

50
5.4 Consistency of the MLE

Assume throughout the section:

(A1) X = (X1 , · · · , Xn ) ∼iid f (x|θ) with respect to µ with common support.


(A2) The true parameter value θ0 is an interior point of the parameter space Ω.

Theorem 5.3 For any fixed θ 6= θ0 , we have


Pθ0 [L(θ0 |X) > L(θ|X), for large n] = Pθ0 [l(θ0 ) > l(θ), for large n] = 1.

Proof. By SLLN and Jensen inequality, we have


1 1X f (Xi |θ) f (Xi |θ) f (Xi |θ)
[l(θ) − l(θ0 )] = log −→a.s. Eθ0 log < log Eθ0 = 0.
n n f (Xi |θ0 ) f (Xi |θ0 ) f (Xi |θ0 )

Although for a single point x, we don’t usually have l(θ0 ) = f (x1 |θ0 ) > f (x1 |θ) =
l(θ). However, for large enough sample size n, we have l(θ0 ) = ni=1 f (xi |θ0 ) > ni=1 f (xi |θ) =
Q Q

l(θ) with high probability. We do not know θ0 , but we can determine the value θ̂ of θ
which maximizes l(θ) = ni=1 f (xi |θ). The above theorem suggests that, if fθ (x) varies
Q

smoothly with θ, the MLE θ̂ typically should be close to the true value θ0 , and hence be
a reasonable estimator.

In fact, we illustrate this for the simple case in which Θ is finite, say, Θ = {θ0 , θ1 , · · · , θk }.
By the last theorem, for any θi , i = 1, ..., k, we have
Pθ0 [l(θ0 ) > l(θi ), for large n] = 1.
Hence,
 
Pθ0 l(θ0 ) > max l(θi ), for large n = Pθ0 [l(θ0 ) > l(θi ), for large n] = 1.
1≤i≤n

5.5 MLE is asymptotically efficient

Asymptotically efficiency

Suppose that {δn } is a sequence of estimators g(θ). Then typically,

1. (Consistency): δn →p g(θ).

2. (Asymptotic normality): n[δn − g(θ)] →d N (0, v(θ)).

If the lower bound in (3.5) is attained, δn is called Asymptotic Efficient Estimators.

Definition. A sequence {δn } is said to be asymptotically efficient (AE) if


!
√ [g 0 (θ)]2
n[δn − g(θ)] →d N 0, .
I(θ)

Two questions immediately arise:

51
• Do AE estimators exist? Under what conditions?

AE estimators are not unique. If δn is AE, so is δn + Rn if nRn →p 0, since
!
√ √ √ [g 0 (θ)]2
n (δn + Rn − g(θ)) = n (δn − g(θ)) + nRn →d N 0, .
I(θ)

• How do we find such estimators if they exist?


AE estimators can be obtained by different methods, like MLE and Bayes.

MLE is asymptotically efficient

Denote x = (x1 , ..., xn ), l(θ) = log fθ (x), l0 (θ) = ∂ log fθ (x)/∂θ.

Theorem 5.4 Let F = {fθ : θ ∈ Θ}, where Θ is an open set in R. Assume that

1. For each θ ∈ Θ, l000 (θ) exists for all x, and also there exists a H(x) ≥ 0 (possibly
depending on θ) such that for θ ∈ N (θ0 , ) = {θ : |θ − θ0 | < },
000
l (θ) ≤ H(x), Eθ H(X1 ) < ∞.

∂ ∂gθ (x)
Z Z
2. For gθ (x) = fθ (x) or l000 (θ), we have gθ (x)dx = dx.
∂θ ∂θ
2
3. For each θ ∈ Θ, we have 0 < IX1 (θ) = Eθ l0 (θ) < ∞.

Let X1 , ..., Xn ∼ Fθ . Then with probability 1, the likelihood equations, l0 (θ) = 0, admit a
sequence of solutions {θ̂n } satisfying

1. (Strong consistency) θ̂n →a.s. θ0


√   
−1

2. (Asymptotic efficiency) n θ̂n − θ0 →d N 0, IX1
(θ) .

Proof.

1. Let s(θ) =: s(X, θ) = l0 (θ)/n. Then


n 3 n

1X ∂ log fθ (Xi )
1X
|s00 (θ)| ≤ ≤ |H(Xi )| =: H̄(X)
∂θ3

n i=1 n i=1
Pn
where H̄(X) = n−1 i=1 H(Xi ). By Taylor’s expansion,
1
s(θ) = s(θ0 ) + s0 (θ0 )(θ − θ0 ) + s00 (ξ)(θ − θ0 )2
2
1
= s(θ0 ) + s0 (θ0 )(θ − θ0 ) + H̄(X)η ∗ (θ − θ0 )2
2
where |η ∗ | = |s00 (ξ)|/H̄(X) ≤ 1 by assumption. By the SLLN, we have, a.s.

s(θ0 ) → Es(θ0 ) = 0,
s0 (θ0 ) → Es0 (θ0 ) = −IX1 (θ0 ),
H̄(X) → Eθ H(Xi ) < ∞.

52
Remark 5.4 Heuristically, we have, for large n and θ = θ0 ± ,
1
s(θ0 ± ) ≈ ∓IX1 (θ0 ) + EH(X1 )η ∗ 2 ≈ ∓IX1 (θ0 )
2
which is negative at θ0 +  and positive at θ0 − . Thus there is a change of
sign for s(θ) near θ0 , and hence must have at least one root. By choosing
 smaller and smaller, we get a sequence of solutions converging to θ0 . Let
us make this precise below.

Fix  > 0, as n is large enough, say n ≥ n (depending on ), we have, a.s.

|s(θ0 )| ≤ 2 ,
|s0 (θ0 ) + IX1 (θ0 )| ≤ 2 ,
|H̄(X) − Eθ H(Xi )| ≤ 2 .

Consequently, if  is chosen to be small enough, we have


1
s(θ0 + ) = s(θ0 ) + s0 (θ0 ) + H̄(X)η ∗ 2
2
  1 
≤ 2 + −IX1 (θ0 ) + 2  + Eθ0 H(Xi ) + 2 2
2
2 1 
= −IX1 (θ0 ) +  + Eθ0 H(Xi ) + 2 2 + 3
2
< 0,

Similarly, as n is large enough, we have, a.s.


1
s(θ0 − ) = s(θ0 ) − s0 (θ0 ) + H̄(X)η ∗ 2 > 0.
2
Then, by continuity of s(θ), the interval [θ0 −, θ0 +] contains a solution of s(θ) = 0.
In particular, it contains the solution

θ̂n , = inf {θ : θ0 −  ≤ θ ≤ θ0 + , and s(θ) = 0} .

We now show that θ̂n , is a proper random variable, that is, θ̂n , is measureable.
Note that, for all t ≥ θ0 − , we have
 
{θ̂n , > t} = inf s(θ) > 0
θ0 −≤θ≤t
( )
= inf s(θ) > 0 ,
θ0 −≤θ≤t,θ is rational

which is clearly a measurable set, since s(θ) is continuous in θ.


Take  = 1/k for k ≥ 1. For each k ≥ 1, we define a sequence of r.v.’s by

θ̂k = θ̂n1/k ,1/k , k = 1, 2, ......

Clearly, θ̂k → θ0 with probability one.

2. For large n, we have seen that


1
0 = s(θ̂n ) = s(θ0 ) + s0 (θ0 )(θ̂n − θ0 ) + H̄(X)η ∗ (θ̂n − θ0 )2
2

53
Thus,
√ √ 1
 
ns(θ0 ) = n(θ̂n − θ0 ) −s0 (θ0 ) − H̄(X)η ∗ (θ̂n − θ0 )
2

Since ns(θ0 ) → N (0, I(θ0 )) in distribution by CLT, and −s0 (θ0 ) − 12 H̄(X)η ∗ (θ̂0 −
θ0 ) → I(θ0 ) a.s., then it follows from Slutsky’s theorem that

√ ns(θ0 )
n(θ̂n − θ0 ) = 1
−s (θ0 ) − 2 H̄(X)η ∗ (θ̂0 − θ0 )
0

N (0, I(θ0 ))  
→d = N 0, I −1 (θ0 ) .
I(θ0 )

Remark 5.5 To estimate g(θ) (g is smooth), we can use g(θ̂n ). Consequently, we can
apply the continuous mapping theorem and δ-method to obtain

• Strong consistency: g(θ̂n ) → g(θ0 ) with probability 1.


!
√   [g 0 (θ)]2
• Asymptotic efficiency: n g(θ̂n ) − g(θ0 ) →d N 0, .
IX1 (θ)

An example: asymptotic efficiency holds when the assumptions are voilated

Example 5.4 Let X1 , . . . , Xn ∼iid fθ (x) = 12 exp{−|x − θ|} (the double exponential).
Despite the violation of regularity conditions, the asymptotic version of the Cramer-Rao
theorem still holds for the MLE.

Proof. First of all, for EVERY x, fθ (x) is not differentable w.r.t θ at θ = x. So the
regularity conditions are not satisfied. Secondly,
n n n
( )
Y 1 X X
l(θ) = log fθ (xi ) = log n exp − |xi − θ| =C− |xi − θ|.
i=1
2 i=1 i=1
Pn
So an MLE of θ is θ̂ = arg maxθ∈R l(θ) = arg minθ∈R i=1 |xi − θ| = sample median, as
seen in an earlier chapter.

Next l0 (θ) = − ni=1 |xi − θ|0 , where |xi − θ|0 = ∂θ



|xi − θ| = ±1 except at θ = xi .
P
0 2
Thus I(θ) = E[l (θ)] = 1. Thus θ̂ is an asymptotic efficient estimator if

n(θ̂ − θ) →d N (0, 1) .

Let us assume that n is odd at this moment, then θ̂ = m̂. It is well known that
√  
n(θ̂ − θ) ∼asymptotic N 0, [4f 2 (θ)]−1 .

In our current example, we have [4f 2 (θ)]−1 = [4 × (1/4)]−1 = 1. Hence, the asymptotic
Cramer-Rao bound does hold.

54
5.6 Some comparisons about the UMVUE and MLE’s

1. It can be shown that UMVUE’s are always consistent and MLE’s usually are. (Bickel
and Doksum, 1977., p133.)

2. In the exponential family, both estimates are functions of the natural sufficient statis-
tics Ti (X).

3. When n is large, both give essentially the same answer. In that case, use is governed
by ease of computation.

4. MLEs always exist while UMVUEs may not exist, and MLEs are usually easier to
calculate than UMVUE’s, due partially to the invariance property of MLE’s.

5. Neither MLE’s nor UMVUE’s are satisfactory in general if one takes a Bayesian or
minimax point of view. Nor are they necessarily robust.

55
Chapter 6

Transformation

In statistics, data transformation is the application of a deterministic mathematical func-


tion to each point in a data set - that is, each data point xi is replaced with the transformed
value zi = f (xi ). A transformation changes the shape of a distribution or relationship. It
has important applications.

Transforms are usually applied so that the data appear to more closely meet the
assumptions of a statistical inference procedure that is to be applied, or to improve the
interpretability or appearance of graphs.

In fact, many familiar measured scales are really transformed scales: decibels, pH
and the Richter scale of earthquake magnitude are all logarithmic. Furthermore, trans-
formations are needed because there is no guarantee that the world works on the scales it
happens to be measured on, such as music.

6.1 Why transformations?

• Convenience
x−a
An example is standardisation: z = . It puts all variables on the same scale.
b
Standardisation makes no difference to the shape of a distribution.

• Reducing skewness
A (nearly) symmetric distribution is often easier to handle and interpret than a
skewed one. More specifically, a normal distribution is often regarded as ideal as it
is assumed by many statistical methods.

– To reduce right skewness, try x, log x or 1/x.
– To reduce left skewness, take x2 , x3 , ....

• Equal spreads/variability (homoscedasticity)


A transformation may be used to produce approximately equal spreads, despite
marked variations in level, which again makes data easier to handle and interpret.
This is called homoscedasticity, as opposed to heteroscedasticity.

• Linear relationships

56
When looking at relationships between variables, it is often far easier to think about
patterns that are approximately linear than about patterns that are highly curved.
This is vitally important when using linear regression, which amounts to fitting such
patterns to data.
For example, a plot of logarithms of a series of values against time has the property
that periods with constant rates of change (growth or decline) plot as straight lines.

• Additive relationships
Relationships are often easier to analyse for an additive model

y = a + bx

rather than an multiplicative one


y = axb
Additivity is a vital issue in analysis of variance.

• Easier to visualize.
For example, suppose we have a scatterplot in which the points are the countries of
the world, and the data values being plotted are the land area and population of
each country. If the plot is made using untransformed data (e.g. square kilometers
for area and the number of people for population), most of the countries would be
plotted in tight cluster of points in the lower left corner of the graph. The few coun-
tries with very large areas and/or populations would be spread thinly around most
of the graph’s area. Simply rescaling units (e.g., to thousand square kilometers, or
to millions of people) will not change this. However, following logarithmic transfor-
mations of both area and population, the points will be spread more uniformly in
the graph.

In practice, a transformation often works, serendipitously, to do several of these at


once, particularly to reduce skewness, to produce nearly equal spreads and to produce a
nearly linear or additive relationship. But this is not guaranteed.

6.2 Box-Cox transform

6.3 Review of most common transformations

The most useful transformations are the reciprocal, logarithm, cube root, square root, and
square.

Reciprocal: 1/x, x 6= 0

• This has a drastic effect on distribution shape.

• The reciprocal of a ratio may often be interpreted as easily as the ratio itself: e.g.

– population density (people per unit area) becomes area per person;

57
– persons per doctor becomes doctors per person;
– rates of erosion become time to erode a unit depth.

Logarithm log x, x > 0

• This has a major effect on distribution shape. It is commonly used for reducing right
skewness.
• Exponential growth or decline y = a exp(bx) is made linear by
ln y = ln a + bx.
Exponential function does not go through origin, as at x = 0, we have
y = a 6= 0.
• Power function y = axb is made linear by
log y = log a + b log x.
The power function for b > 0 goes through the origin, as at x = 0 (b > 0),
we have y = 0. This often makes physical or biological or economic sense.
• Consider ratios y = p/q ∈ (0, ∞) where p, q > 0. Examples are
– males/females;
– dependants/workers;
– downstream length/downvalley length
which are usually skewed, because there is a clear lower limit and no clear upper
limit. The logarithm
log y = log p/q = log p − log q ∈ (−∞, ∞),
which is likely to be more symmetrically distributed.

Square root x1/2 (x ≥ 0)

• The square root x1/2 has a moderate effect on distribution shape.


• It is weaker than the logarithm and the cube root.
• It is also used for reducing right skewness, and also has the advantage that it can be
applied to zero values.
• Note that the square root of an area has the units of a length. It is commonly applied
to counted data, especially if the values are mostly rather small.

Cube root x1/3

• The cube root x1/3 is a fairly strong transformation with a substantial effect on
distribution shape: it is weaker than the logarithm.
• It is also used for reducing right skewness, and has the advantage that it can be
applied to zero and negative values.
• Note that the cube root of a volume has the units of a length. It is commonly applied
to rainfall data.

58
Square x2

• The square x2 has a moderate effect on distribution shape and it could be used to
reduce left skewness.

• In practice, the main reason for using it is to fit a response by a quadratic function

y = a + bx + cx2 .

******************

Test

QQ plots

Box plot

59
6.4 Preliminaries

Many properties in statistics are preserved under smooth transformations.

Theorem 6.1 (Continuous Mapping Theorem) g : Rk → R is continuous. Then

• Xn →a.s. X =⇒ g(Xn ) →a.s. g(X).

• Xn →p X =⇒ g(Xn ) →p g(X).

• Xn →d X =⇒ g(Xn ) →d g(X).

Now let us focus on the convergence in distribution (useful in statistical applications). We


will see that more could be said: if the limiting distribution of Xn is Normal (or Gamma,
Cauchy, etc), so is the limiting distribution of g(Xn ). That is, the form of the limiting
distribution (such as asymptotic normality) is also preserved under smooth transformation.

Theorem 6.2 (Slutsky’s Theorem) Let Xn →d X, Yn →p C. Then

• Xn + Yn →d X + C.

• Xn Yn →d CX.

• Xn /Yn →d X/C if C 6= 0.

Stochastic Order relations

Order relations could greatly simplify writings in a proof. Let {Xn }’s and {Yn }’s be two
sequences of r.v.’s.

(a) Xn = Op (1)
⇐⇒ ∀ > 0, ∃M and N such that P (|Xn | > M ) < , ∀n ≥ N .
(P (|Xn | ≤ M ) ≥ 1 − )
⇐⇒ ∀ > 0, ∃M such that supn P (|Xn | > M ) < .
⇐⇒ lim lim sup P (|Xn | > M ) = 0. (“tightness”.)
M →∞ n
We say that {Xn } is “stochastically bounded”, or “tight.” It means that no mass
will escape to ∞ with positive probability.

(b) Xn = Op (Yn ) if Xn /Yn = Op (1).

(c) Xn = op (1) if Xn →p 0.

(d) Xn = op (Yn ), if Xn /Yn = op (1).

Here are some useful relations.

Theorem 6.3

60
1. If Xn →d X, then Xn = Op (1).

2. If Xn = op (1) then Xn = Op (1).

3. Op (1)op (1) = op (1).

4. Op (1)Op (1) = Op (1).

5. Rn = o(Yn ), Yn = Op (1), =⇒ Rn = op (1).


rn
6. Rn = √ , rn = Op (1) or op (1), =⇒ Rn = Op (n−1/2 ), or op (n−1/2 ).
n

Proof.

1. Xn →d X =⇒ |Xn | →d |X|. Note that the set of discontinuity points of F|X| ,


D(F|X| ), is countable. For every  > 0, we can always find a large M ∈ C(F|X| )
(the set of continuity points of F|X| ) such that

P (|Xn | > M ) = 1 − F|Xn | (M ) −→ 1 − F|X| (M ) = P (|X| > M ) < .

2. Xn = op (1) ⇐⇒ Xn →p 0 =⇒ P (|Xn | > 1) → 0 =⇒ P (|Xn | > 1) < , ∀n ≥ N . =⇒


Xn = Op (1).
Alternatively, Xn = op (1) =⇒ ∀ > 0 : limn P (|Xn | > ) = 0 =⇒

lim lim sup P (|Xn | > M ) = lim 0 = 0.


M →∞ n M →∞

3. Denote Wn = Op (1) and Vn = op (1).


∀ > 0, ∃M , such that for all large n, we have

P (|Wn Vn | > δ) = P (|Wn Vn | > δ, |Wn | ≤ M ) + P (|Wn Vn | > δ, |Wn | > M )


≤ P (|Vn | > δ/M ) + P (|Wn | > M )
≤ /2 + /2 = .

4. Denote Wn = Op (1) and Un = Op (1).


∀ > 0, ∃M1 and M2 such that for all large n, we have P (|Wn | > M1 ) ≤ /2 and
P (|Un | ≥ M2 ) ≤ /2. Thus,

P (|Wn Un | > M1 M2 ) ≤ P (|Wn | > M1 , or |Un | ≥ M2 )


≤ P (|Wn | > M1 ) + P (|Un | ≥ M2 )
≤ /2 + /2 = .

5. Proof. ∀ > 0, note that


Yn = Op (1) =⇒ ∀η > 0 : lim supn P (|Yn | > M ) < .
Rn = o(Yn ) =⇒ ∀δ > 0 : limn P (|Rn /Yn | > δ) = 0.
Hence,

P (|Rn | > η) ≤ P (|Rn | > η, |Yn | < M ) + P (|Yn | > M )


≤ P (|Rn /Yn | > η/M ) + P (|Yn | > M )
≤ /2 + /2 = .

6. Rn /n−1/2 = nRn = rn = Op (1) or op (1) by assumption.

61
6.5 The δ-method

The δ-method is a very useful tool in proving limit theorems and Slutsky’s theorems are
extensively employed. We illustrate the idea with smooth function of means.

The univariate case

Theorem 6.4 Let X1 , ..., Xn be i.i.d. with µ = EX1 and σ 2 = V ar(X1 ) ∈ (0, ∞), and
g 0 (µ) 6= 0. Then,


−1 0 2 2
 n(g(X̄) − g(µ))
g(X̄) →d N g(µ), n [g (µ)] σ , ⇐⇒ →d N (0, 1) . (5.1)
g 0 (µ)σ

Proof. From Young’s Taylor’s expansion,


g(X̄) = g(µ) + g 0 (µ)(X̄ − µ) + o(|X̄ − µ|)
Then,
√ √
n(g(X̄) − g(µ)) n(X̄ − µ) √
= + o(| n(X̄ − µ)|)
g 0 (µ)σ σ

n(X̄ − µ)
= + op (1)
√σ
(as n(X̄ − µ)/σ = Op (1))
−→d N (0, 1), (Slutsky’s Theorem).

The following generalization is straightforward.

Theorem 6.5 If Tn →d N (µ, σn2 ) with σn → 0, and g 0 (µ) 6= 0, then,


  g(Tn ) − g(µ)
g(Tn ) −→d N g(µ), [g 0 (µ)]2 σn2 , ⇐⇒ −→d N (0, 1) .
g 0 (µ)σn

Now consider the case that g is differentiable but g 0 (µ) = 0.

Theorem 6.6 Suppose Tn is AN(µ, σn2 ) with σn → 0. If g(·) is m times at µ differentiable


with g (j) (µ) = 0 for 1 ≤ j ≤ m − 1 and g (m) (µ) 6= 0, then,
g(Tn ) − g(µ)
−→d [N (0, 1)]m .
1 (m) m
g (µ)σn
m!

Proof. We shall prove this for m = 2 only.


g 00 (µ)
g(Tn ) = g(µ) + g 0 (µ)(Tn − µ) + (Tn − µ)2 + o[(Tn − µ)2 ]
2
g 00 (µ)σn2 Tn − µ 2
 
= g(µ) + + op (σn2 ) (as (Tn − µ)/σn = Op (1)).
2 σn
Applying Slutsky’s Theorem, we get
2
[g(Tn ) − g(µ)] Tn − µ

= + op (1) −→d [N (0, 1)]2 .
g 00 (µ)σn2 σn
2

62
The multivariate case

Theorem 6.7 If X, X1 , ..., Xn are i.i.d. random k-vectors with µ = EX and Σ =


Cov(X, X). Assume that g(·) : Rk → R1 is differentiable with ∇g(µ) 6= 0, where
∇g(x)k×1 = (∂g(x)/∂x1 , ......, ∂g(x)/∂xk )τ . Then,

g(X̄) −→d N (g(µ), vn ) .

where vn = n−1 ∇g(µ)τ1×k Σk×k ∇g(µ)k×1 .

Proof. From Taylor’s expansion, g(X̄) = g(µ) + 5τ g(µ)(X̄ − µ) + R, hence


√ √ √
n[g(X̄) − g(µ)] = 5τ g(µ) n(X̄ − µ) + nR.
√ √ √
By the multivariate CLT, we get n(X̄ −µ) →d N (0, Σ). Furthermore, nR = o( n||X̄ −
µ||) = o(Op (1)) = op (1). Then apply Slutsky’s Theorem.

6.5.1 Examples

Example 6.1 Find the limiting d.f.’s of (X̄)2 , 1/X̄, S 2 , S, S/X̄.

Solution.

1. Take g(x) = x2 , so g 0 (x) = 2x.


!
2 4µ2 σ 2 2 n(X̄)2
If µ 6= 0, (X̄) ∼ AN µ , . If µ = 0, →d χ21 .
n σ2

2. Take g(x) = 1/x, so g 0 (x) = −1/x2 .


σ 1
If µ 6= 0, 1/X̄ ∼ AN (1/µ, n−1 (σ 2 /µ4 )). If µ = 0, √ →d .
nX̄ N (0, 1)

3. Let Yi = Xi − µ. Take g(x, y) = y − x2 , then


n n n
1X 1X 1X  
S2 = (Xi − X̄)2 = (Yi − Ȳ )2 = Yi2 − (Ȳ )2 = g Ȳ , Y 2 .
n i=1 n i=1 n i=1
 
Now µ̃ = E Ȳ , Y 2 = (0, σ 2 ), 5g(µ̃) = (−2x, 1)τ |µ̃ = (0, 1)τ , and
! !
V ar(Y ) Cov(Y, Y 2 ) σ2 µ3
Σ= =
Cov(Y, Y ) V ar(Y 2 )
2 µ3 µ4 − σ 4

Then, g(µ̃) = σ 2 , 5τ g(µ̃)Σ 5 g(µ̃) = µ4 − σ 4 . Therefore,

S 2 −→d N (σ 2 , n−1 (µ4 − σ 4 )).


√ √ √
4. For S = S 2 = h(S 2 ), where h(x) = x. So h0 (x) = 1/(2 x). From the above,
!
2 −1 0 2 2 4 µ4 − σ 4
S −→d N (h(σ ), n [h (σ )] (µ4 − σ )) = N σ, .
4σ 2 n

63
5. If µ 6= 0, take g(x, y) = (y − x2 )1/2 /x. Then

n
!1/2
1X
S/X̄ = X 2 − (X̄)2 /X̄ := g(x, y).
n i=1 i

After some routine but tedious calculation, we get


!!
S σ 1 σ 2 µ2 µ3 µ4 − µ22
−→d N , − 3+ .
X̄ µ n µ4 µ 4µ2 σ 2

If µ = 0, we take Tn = nX̄/S −→d N (0, 1) := Z and g(t) = 1/t, then g(Tn ) −→d
g(Z) (continuous mapping theorem), i.e.,
S 1
√ −→d .
nX̄ N (0, 1)

6.6 Variance-stabilizing transformations

We often have
Tn →d N (µ, σ 2 (µ))
i.e., the variance changes with the mean.

We now seek some transformation g so that g(Tn ) has stable variance. Since
 
g(Tn ) −→d N g(µ), [g 0 (µ)]2 σ(µ)2 ,

we would like to have


C
[g 0 (µ)]2 σ 2 (µ) = C 2 , i.e., g 0 (µ) = .
σ(µ)
Thus
C
Z
g(µ) = dµ.
σ(µ)

Example. X1 , ..., Xn ∼ exp(λ), i.e., f (x) = λ−1 exp{−x/λ}, x ≥ 0. Then,

µ = λ, σ 2 = λ2 = µ2 .

Here, σ 2 (µ) = µ2 , hence,


1 1
Z Z
g(µ) = dµ = dµ = ln µ.
σ(µ) µ

In fact, if X1 , ..., Xn ∼ Gamma(k, λ) with k known, then EX = kλ, var(X) = kλ2 .


Similarly to the exponential distribution, ln transformation will stabilize variance.

Example. X1 , ..., Xn ∼ P oisson(λ). The CLT implies X̄ ∼ AN (λ, λ/n), hence,


√ !
n(X̄ − λ)
1−α ≈ P Φ−1 (α/2) < √ < Φ−1 (1 − α/2)
λ

64
from which one can get an approximate confidence interval for λ.

Alternatively, we can use the transformation approach. Here, σ 2 (λ) = λ, hence,

C
Z
C √ Z
g(θ) = dθ = √ dθ = 2C θ.
σ(θ) θ

Choose C = 1/2. Then we get g(θ) = θ, hence
√ √
1 √ p
p 
X̄ ∼ AN λ, , or 2 n( X̄ − λ) ∼ AN (0, 1)
4n

Hence, an approximate two-sided confidence interval for λ at (1 − α)-level is
p 1
(L, U ) = X̄ ± √ Φ−1 (1 − α/2).
2 n

Hence, an approximate two-sided confidence interval for λ at (1 − α)-level is given by


(L2 , U 2 ) if L > 0, or (0, U 2 ) if L ≤ 0.

6.6.1 An example: the sample correlation coefficient

Example 1. Let (X, Y ), (X1 , Y1 ), ..., (Xn , Yn ) be i.i.d. random vectors. For simplicity, we
first assume that µx = µy = 0. Then the correlation coefficient is

σxy Cov(X, Y ) E(X − µx )(Y − µy ) EXY


ρ= =p = p =√ .
σx σy V (X)V (Y ) V (X)V (Y ) EX 2 EY 2
Correspondingly, we consider a simpler sample correlation coefficient given by
1 Pn
n i=1 Xi Yi
ρ̂ =  1/2  P 1/2 .
1 Pn 2 1 n 2
n i=1 Xi n i=1 Yi

(Note that this is different from the usual definition of ρ̂; see the next example.)

(i) Find the limiting d.f. of ρ̂.

(ii) Further assume (X, Y ) is bivariate normal with pdf


( )
1 (x − µx )2 ρ(x − µx )(y − µ2 ) (y − µ2 )2
f (x, y) = exp − + −
2σ12 (1 − ρ2 ) σ1 σ2 (1 − ρ2 ) 2σ22 (1 − ρ2 )
p
2πσ1 σ2 1 − ρ2


(a) Show that ρ̂ →a.s. ρ, and n(ρ̂ − ρ) →d N (0, (1 − ρ2 )2 ). Further,

n(ρ̂ − ρ)
→d N (0, 1).
(1 − ρ̂2 )

(b) Find a variance stabilizing transformation (Fisher transformation).


(Hint: X and Y − βX are independent for β = σxy /σx2 , since (X, Y − βX) is
bivariate Normal with Cov(X, Y − βX) = EXY − βEX 2 = 0.)

65
n n n
!
1X 1X 1X
Solution. (i) Let θ = (EX 2 , EY 2 , EXY ) and θ̂ = Xi2 , Yi2 , Xi Yi , and
n i=1 n i=1 n i=1
z3
g(z1 , z2 , z3 ) = .
(z1 z2 )1/2

Then, ρ = g(θ) and ρ̂ = g(θ̂). Note that E θ̂ = θ. Assuming that EX 4 < ∞ and EY 4 < ∞,
from the multivariate CLT, we have
θ̂ ∼ AN (θ, Σ/n),
where Σ3×3 is the covariance matrix of (X 2 , Y 2 , XY ):
 
V (X 2 ) Cov(X 2 , Y 2 ) Cov(X 2 , XY )
Σ =  Cov(X 2 , Y 2 ) V (Y 2 ) Cov(Y 2 , XY )  .
 
Cov(X 2 , XY ) Cov(Y 2 , XY ) V (XY )
Therefore, from the theorem in this section, we have
ρ̂ = g(θ̂) ∼ AN (ρ, n−1 [5g(θ)]τ Σ 5 g(θ)),
From
!
∂g(z) ∂g(z) ∂g(z) z3 z3 1
 
[5g(z)]τ = , , = − 3/2 1/2
,− 1/2 3/2
, 1/2 1/2
,
∂z1 ∂z2 ∂z3 2z1 z2 2z1 z2 z1 z 2
we have !
τ σxy σxy 1
[5g(θ)] = − 3 ,− 3
, .
2σx σy 2σx σy σx σy

(ii) (a) From θ̂ →a.s. θ, we get ρ̂ = g(θ̂) →a.s. g(θ) = θ.

For simplicity, we shall also assume that σx2 = σy2 = 1. Then X ∼ N (0, 1),
X2 ∼ χ21 , hence EX 2 = 1, V (X 2 ) = 2, EX 3 = 0, EX 4 = 3. Similar results hold for Y as
well. Note that X and Y − βX are independent for β = σxy /σx2 = ρ. Therefore
E(X 2 Y 2 ) = E(X 2 (Y − βX + βX)2 )
= EX 2 (Y − βX)2 + 2βEX 3 (Y − βX) + β 2 EX 4
= EX 2 E(Y − βX)2 + 2βEX 3 E(Y − βX) + β 2 EX 4
= EX 2 E(Y − βX)2 + β 2 EX 4
 
= EX 2 EY 2 − 2βEXY + β 2 EX 2 + β 2 EX 4
 
= 1 − 2ρ2 + ρ2 + 3ρ2
= 1 + 2ρ2
V (XY ) = E(X 2 Y 2 ) − (EXY )2 = 1 + 2ρ2 − ρ2 = 1 + ρ2
Cov(X 2 , Y 2 ) = E(X 2 Y 2 ) − EX 2 EY 2 = 1 + 2ρ2 − 1 = 2ρ2
Cov(X 2 , XY ) = E(X 3 Y ) − EX 2 EXY = EX 3 (Y − βX) + βEX 4 − ρ
= EX 3 E(Y − βX) + 3ρ − ρ = 2ρ
Cov(Y 2 , XY ) = 2ρ (similar to above)

Hence,    
2 2ρ2 2ρ 1 ρ2 ρ
Σ =  2ρ2 2
 2
2ρ  = 2  ρ 1 ρ .
  
2ρ 2ρ 1 + ρ2 2
ρ ρ (1 + ρ )/2

66
and [5g(θ)]τ can be simplified to
!
τ σxy σxy 1 1
[5g(θ)] = − 3 ,− 3
, = (−ρ/2, −ρ/2, 1) = (−ρ, −ρ, 2).
2σx σy 2σx σy σx σy 2

Finally,
  
1 ρ2 ρ −ρ
1
[5g(θ)]τ Σ 5 g(θ) =
 2
(−ρ, −ρ, 2)  ρ 1 ρ   −ρ 
 
2
ρ ρ (1 + ρ2 )/2 2
 
−ρ
1 2 2 2

= ρ(1 − ρ ), ρ(1 − ρ ), (1 − ρ )  −ρ 
 
2
2
1  
= −2ρ2 (1 − ρ2 ) + 2(1 − ρ2 )
2
= (1 − ρ2 )2 .

Therefore, n(ρ̂ − ρ) →d N (0, (1 − ρ2 )2 ). This, ρ̂ →a.s. ρ and Slutsky’s theorem imply

n(ρ̂ − ρ)
→d N (0, 1).
(1 − ρ̂2 )

(b) Let us find a variance stabilizing transform:


1 1 1 1 1+t
Z Z Z
h(t) = dt = dt = dρ = ln ,
σ(t) (1 − t2 ) (1 − t)(1 + t) 2 1−t

Hence, n (h(ρ̂) − h(ρ)) →d N (0, 1).

Example 2. (Continuation of the last example.) Let (X, Y ), (X1 , Y1 ), ..., (Xn , Yn ) be i.i.d.
random vectors. The general definition of correlation coefficient is
σxy Cov(X, Y ) E(X − µx )(Y − µy )
ρ= =p = p .
σx σy V (X)V (Y ) V (X)V (Y )
The sample correlation coefficient is
1 Pn
n i=1 (Xi − X̄)(Yi − Ȳ )
ρ̂ =  1/2  P 1/2 .
1 Pn 2 1 n
n i=1 (Xi − X̄) n i=1 (Yi − Ȳ )2

Find the limiting d.f. of ρ̂.


n n n
!
1X 1X 1X
Solution. Let θ = (EX, EY, EX 2 , EY 2 , EXY ) and θ̂ = X̄, Ȳ , Xi2 , Yi2 , Xi Yi
n i=1 n i=1 n i=1
and
z5 − z1 z2
g(z1 , ..., z5 ) = .
(z3 − z1 )1/2 (z4 − z22 )1/2
2

Then, ρ = g(θ) and ρ̂ = g(θ̂). Note that E θ̂ = θ. Assuming that EX 4 < ∞ and EX 4 < ∞,
from the multivariate CLT, we have

θ̂ ∼ AN (θ, Σ/n),

where Σ5×5 is the covariance matrix of (X, Y, X 2 , Y 2 , XY ). (Why?) Therefore, from the
theorem in this section, we have

ρ̂ = g(θ̂) ∼ AN (g(θ) = ρ, n−1 [5g(θ)]τ Σ 5 g(θ)),

67
Some routine calculations yield
!
τ ρµx µy ρµy µx ρ ρ 1
[5g(θ)] = 2
− , 2 − ,− 2,− 2, .
σx σx σy σy σx σy 2σx 2σy σx σy

Finish the rest as an exercise.

6.7 Exercises

Below X, X1 , ..., Xn are i.i.d. r.v.’s with mean µ, variance σ 2 , and µk = E(X − µ)k .

1. Prove the equivalence of several definitions for Xn = Op (1) given in the notes.

2. (a) If Xn = Op (Zn ), and P (Yn = 0) = 0 for all n, then Xn Yn = Op (Yn Zn ).


(b) If Xn = Op (Zn ) and Yn = Op (Zn ), then Xn + Yn = Op (Zn ).
(c) If E|Xn | = O(an ) for a sequence an > 0, then Xn = Op (an ).
(d) If Xn →a.s. X, then supn |Xn | = Op (1).

3. Find the limiting d.f.’s of eX̄ , ln |X̄|, cos(X̄).

4. Let (X, Y ), (X1 , Y1 ), ..., (Xn , Yn ) be i.i.d. with mean µ = (µx , µy )τ and covariance
matrix Σ. Let r = µx /µy with µy 6= 0, and an estimate of r is r̂ = X̄/Ȳ . Investigate
the limiting behavior of r̂ (i.e. both consistency and asymptotic d.f.).
n

Ui )−1 . Show that
Y
5. Let Un ∼iid Uniform[0, 1], and Xn = ( n(Xn − e) →d N (0, e2 ).
i=1

6. If Xn ∼ Bin(n, p), then Xn →d N (np, np(1 − p)). Find a variance stabilizing


transformation.

68
Chapter 7

Quantiles

We are often interested in a special portion of a population distribution, other than the
centre. For instance, we might be interested in

• underweight new-born babies in medical studies;

• losses in financial companies (e.g., VAR) [risk management];

• the low and high water levels of a reservoir.

These are all examples of quantiles. The layout of the chapter is as follows:

• definitions of quantiles

• consistency

• asymptotic normality

• Bahadur theorem

• quantile regression

7.1 Definition

The pth quantile of a d.f. F is

ξp := F −1 (p) = inf{t : F (t) ≥ p}.

A plot of F −1 (p) reveals that

• F −1 jumps when F is flat. F −1 is flat when F jumps.

• In fact, F −1 (p) is a mirror image of F (t) along the line p = t.

Theorem 7.1 ∀p ∈ (0, 1), we have

1. F −1 (p) is non-decreasing and left continuous.

69
2. F −1 (F (x)) ≤ x, ∀x ∈ R.

3. F (ξp −) ≤ p ≤ F (ξp ).

4. ξp = F −1 (p) ≤ t ⇐⇒ p ≤ F (t).

5. If F is continuous, then F (F −1 (p)) = F (ξp ) = p.

Proof.

1. Monotonicity. If p1 < p2 , then {t : F (t) ≥ p2 } ⊂ {t : F (t) ≥ p1 }. Hence,

F −1 (p1 ) = inf{t : F (t) ≥ p1 } ≤ inf{t : F (t) ≥ p2 } = F −1 (p2 ).

Left-continuity. Let pn % p.
=⇒ F −1 (pn ) ≤ F −1 (p) and F −1 (pn ) % b, say.
(Monotonicity shown above)
=⇒ F −1 (p n) ≤ b ≤ F −1 (p) for all n.
=⇒ pn ≤ F (b) from (4) (proof given below).
=⇒ F (b) ≥ limn pn = p.
=⇒ b ∈ {t : F (t) ≥ p}, hence, b ≥ F −1 (p).
=⇒ limn F −1 (pn ) = b = F −1 (p).

2. F −1 (F (x)) = inf{t : F (t) ≥ F (x)} ≤ x as x ∈ {t : F (t) ≥ F (x)}.

3. By definition of ξp , we have F (ξp − ) < p ≤ F (ξp + ). Then let  → 0.

4. • If ξp := F −1 (p) ≤ t, then F (t) ≥ F [ξp ] ≥ p from (3).


• If p ≤ F (t), then F −1 (p) ≤ F −1 [F (t)] ≤ t from (2).

5. It follows from (3).

Theorem 7.2 (Quantile transformation.) F is a d.f., U ∼ U nif orm(0, 1). Then

X := F −1 (U ) ∼ F.

Proof. From Theorem 7.1 (4): P (F −1 (U ) ≤ x) = P (U ≤ F (x)) = F (x).

Theorem 7.2 forms the basis for sampling from d.f. F , at least in principle. Feasi-
bility of this technique depends on either having F −1 available in closed form, or being
able to approximate it numerically, or using some other techniques.

7.2 Consistency of sample quantiles

Given X1 , ..., Xn ∼iid F , a natural estimate of ξp is the sample quantile of Fn :

ξˆp := Fn−1 (p) = inf{t : Fn (t) ≥ p}.

70
Theorem 7.3 If F (ξp − δ) < p < F (ξp + δ) for all δ > 0 (i.e., F is not flat at ξp ), then
for all  > 0 and n,
2
P (|ξˆp − ξp | > ) ≤ 2Ce−2nδ ,
where δ = min{F (ξp + ) − p, p − F (ξp − )}, and C > 0 is an absolute constant.

Proof. Fix  > 0. Recall F −1 (t) ≤ x ⇐⇒ t ≤ F (x). Hence,

P (ξˆp > ξp + ) = P (Fn−1 (p) > ξp + ) = P (p > Fn [ξp + ])


= P (Fn [ξp + ] − F [ξp + ] < p − F [ξp + ])
= P (Fn [ξp + ] − F [ξp + ] < −{F [ξp + ] − p})
≤ P (|Fn [ξp + ] − F [ξp + ]| > {F [ξp + ] − p})
≤ P (sup |Fn (x) − F (x)| > δ )
x∈R
2
≤ Ce−2nδ from DWK inequality.
2
Similarly, P (ξˆp < ξp − ) ≤ Ce−2nδ .

Corollary 7.1 Let F (ξp − ) < p < F (ξp + ) for any  > 0. Then ξˆp −→a.s. ξp .

Proof.
P∞ ˆ − ξp | > ) ≤ P∞ 2Ce−2nδ2 < ∞.
n=1 P (|ξp n=1

7.3 Asymptotic normality of the sample qunatiles

Asymptotic normality can be proved in several different ways, e.g. Bahadur representation.
For simplicity, let m = F −1 (1/2) and m̂ = Fn−1 (1/2) be the population and sample
medians, respectively. General quantiles can be treated similarly.

Theorem 7.4 Assume f (x) = F 0 (x) is continuous near m and f (m) > 0. Then
1
 
m̂ ∼asymp. N m, 2 .
4f (m)n


Proof. WLOG, assume n is odd. Let Yni = I{Xi ≤ m + x/ n}. Then Yni ∼
P

Bin(n, pn ), where pn = P (Xi ≤ m + x/ n). Now
√  √ 
P n(m̂ − m) ≤ x = P m̂ ≤ m + x/ n
n
!
X √ n+1
= P I{Xi ≤ m + x/ n} ≥
i=1
2
n
!
X n+1
= P Yni≥
i=1
2
n+1
Pn !
i=1 Yni− npn − npn
= P p ≥p 2
npn (1 − pn ) npn (1 − pn )

Since F is differentiable (hence continuous) at m, thus F (m) = 1/2. So


√ 1 x δx 1
 
pn = F (m + x/ n) = + √ f m + √ → ,
2 n n 2

71

and n(pn − 1/2) → xf (m). Thus,
1 √
1
(n+ 1) − npn √
2 n
+ n( 12 − pn ) −xf (m)
2p
= p → = −2xf (m)
npn (1 − pn ) pn (1 − pn ) 1/2

Applying the CLT to the triangular array Yni and Slutsky’s theorem, we have

n(m̂ − m) ≤ x → 1 − Φ(−2xf (m)) = Φ(2xf (m)) = Φ(x/(2f (m))−1 ).

P

7.4 Bahadur representation

Sample quantile x̂ip is a nonlinear statistic (i.e., not an average of iid r.v.s). However,
Bahadur representation says that it can be approximately by a linear statistic, which is
much easier to handle. Heuristically, by Taylor’s expansion,

F (ξˆp ) − F (ξp ) = f (ξp )(ξˆp − ξp ) + ...

Replacing F by Fn , we get F (ξˆp ) − F (ξp ) = Fn (ξˆp ) − Fn (ξp ) + ... = p − Fn (ξp ) + ... as


|Fn (ξˆp ) − p| ≤ 1/n. Thus,
p − Fn (ξp )
ξˆp − ξp = + Rn3 .
f (ξp )

More precisely, we have

Theorem 7.5 (Bahadur representation) If f (ξp ) = F 0 (ξp ) > 0, then

F (ξp ) − Fn (ξp )
ξˆp = ξp + + op (n−1/2 ).
f (ξp )
Or equivalently, √
√   n[F (ξp ) − Fn (ξp )]
ˆ
n ξp − ξp = + op (1).
f (ξp )

Proof. It is equivalent to show Xn − Yn = op (1), where



√ n[F (ξp ) − Fn (ξp )]
ˆ
Xn := n(ξp − ξp ), Yn = .
f (ξp )
The following lemma will be useful, whose proof will be given later.

Lemma 7.1 (Ghosh, 1971) If Xn = Op (1) and P (Xn ≤ t, Yn ≥ t + δ) + P (Xn ≥ t +


δ, Yn ≤ t) = o(1), for any fixed t ∈ R and δ > 0, then

Xn − Yn = op (1).

t
To apply the lemma, let t ∈ R and ξnt = ξp + √ . We have
n
 
P (Xn ≤ t, Yn ≥ t + ) = P ξˆp ≤ ξnt , Yn ≥ t + 
 
= P Fn (ξˆp ) ≤ Fn (ξnt ) , Yn ≥ t + 

72
 
= P F (ξnt ) − Fn (ξnt ) ≤ F (ξnt ) − Fn (ξˆp ), Yn ≥ t + 
√ √
n[F (ξnt ) − Fn (ξnt )] n[F (ξnt ) − Fn (ξˆp )]
= P ≤ ,
f (ξp ) f (ξp )
√ !
n[F (ξp ) − Fn (ξp )]
≥t+
f (ξp )
:= P (Zn (t) ≤ Un (t), Zn (0) ≥ t + ) ,

where √ √
n[F (ξnt ) − Fn (ξnt )] n[F (ξnt ) − Fn (ξˆp )]
Zn (t) = , Un (t) = .
f (ξp ) f (ξp )
We can show (later), as n → ∞,

(a) Un (t) = t + o(1), (b) Zn (t) − Zn (0) = op (1).

For any t ∈ R and  > 0, from (a) and (b), we have

P (Xn ≤ t, Yn ≥ t + ) = P (Zn (t) ≤ Un (t), Zn (0) ≥ t + )


≤ P (Zn (t) ≤ Un (t), Zn (0) ≥ t + , |Un (t) − t| ≤ /2)
+P (|Un (t) − t| ≥ /2)
≤ P (|Zn (t) − Zn (0)| ≥ /2) + P (|Un (t) − t| ≥ /2)
−→ 0.

Proof of (a): First we prove the following.



(a1) Fn (ξˆp ) − p ≤ n−1 .

Proof. Let ξˆp = Fn−1 (p). Note that Fn (ξˆp −) ≤ p ≤ Fn (ξˆp ). Thus
|Fn (ξˆp ) − p| ≤ |Fn (ξˆp ) − Fn (ξˆp −)| = n−1 .


 
F ξp + √t − F (ξp )
n[F (ξnt ) − p] n
(a2) Note → t, as → f (ξp ).
f (ξp ) √t
n

Then, from (a1) and (a2), we have, as n → ∞,


√ √  
n[F (ξnt ) − p] n[p − Fn ξˆp ]
Un (t) = + → t.
f (ξp ) f (ξp )

Proof of (b): Rewrite


√ √
n[F (ξnt ) − Fn (ξnt )] n[F (ξp ) − Fn (ξp )]
Zn (t) − Zn (0) = −
f (ξp ) f (ξp )
√ √
n[F (ξnt ) − F (ξp )] n[Fn (ξnt ) − Fn (ξp )]
= −
f (ξp ) f (ξp )
Pn
(I{ξp < Xi ≤ ξnt } − EI{ξp < Xi ≤ ξnt })
= − i=1 √
nf (ξp )
Pn
(Yni − EYni )
= − i=1√ (4.1)
nf (ξp )

73
where Yni = I{ξp < Xi ≤ ξnt }. Note that EYni = P (ξp < X1 ≤ ξnt ) = F (ξnt ) − F (ξp ) =

f (ξp )t/ n + o(n−1/2 ) [see the proof of (a2) above]. Thus,
n n
!
1 X 1X
V ar √ (Yni − EYni ) = V ar(Yni ) ≤ V ar(Yn1 )
n i=1 n i=1

≤ E(Yn12
) = EYn1 = f (ξp )t/ n + o(n−1/2 )
−→ 0,

which implies that the first term in (4.1) is op (1). Thus Zn (t) − Zn (0) = op (1).

Similarly, we can show that P (Xn ≥ t + , Yn ≤ t) → 0.

Applying Lemma 7.1, we get Xn − Yn = op (1).

As a simple corollary of Bahadur representation, we have the following theorem,


whose proof is left as an exercise.

Theorem 7.6 Let F have positive derivatives at ξpj , where 0 < p1 < ... < pm < 1. Then
√ h  i
n ξˆp1 , ..., ξˆpm − (ξp1 , ..., ξpm ) −→d Nm (0, Σ),

where Σ = (σij )m×m with


pi (1 − pj )
σij = .
f (ξpi )f (ξpj )

The theorem is useful in deriving the (joint) d.f. of the sample range, inter-quartile range,
etc.

Finally, we complete proof of Lemma 7.1

Proof of Lemma 7.1. Fix  > 0 any δ > 0. Since Xn = Op (1), there exists M and N
such that
P (|Xn | > M ) < , ∀n ≥ N .
Divide the intervals [−M , M ] into m sub-intervals with the length of each sub-interval
≤ δ/2, e.g., we can take m = 2[2M /(δ/2)]. Write [−M , M ] ⊂ m
P
k=1 [tk , tk+1 ] with
tk+1 − tk ≤ δ/2. Then for all n ≥ N , we have

P (|Xn − Yn | ≥ δ) ≤ P (|Xn − Yn | ≥ δ, |Xn | ≤ M ) + P (|Xn | > M )


≤ P (|Xn − Yn | ≥ δ, Xn ∈ [−M , M ]) + 
m
X
≤ P (|Xn − Yn | ≥ δ, Xn ∈ [tk , tk+1 ]) + 
k=1
(since Xn ∈ [tk , tt+1 ], a very small interval,
essentially we have fixed the value of Xn )
m
X
= P (Xn − Yn ≥ δ, Xn ∈ [tk , tk+1 ])
k=1
Xm
+ P (Xn − Yn ≤ −δ, Xn ∈ [tk , tk+1 ]) + 
k=1
m
X m
X
:= Ak + Bk + .
k=1 k=1

74
Now for every k and as n → ∞, we have (draw a diagram to illustrate)

Ak = P (Xn − Yn ≥ δ, Xn ∈ [tk , tk+1 ])


= P (Yn ≤ Xn − δ ≤ tk+1 − δ, Xn ∈ [tk , tk+1 ])
≤ P (Yn ≤ tk − δ/2, Xn ≥ tk )
= o(1),
Bk = P (Xn − Yn ≤ −δ, Xn ∈ [tk , tk+1 ])
= P (Yn ≥ Xn + δ ≥ tk + δ, Xn ∈ [tk , tk+1 ])
≤ P (Yn ≥ tk+1 + δ/2, Xn ≤ tk+1 )
= o(1).

Since  is arbitrary, we have limn P (|Xn − Yn | ≥ δ) = 0.

7.5 A more general definition of quantiles

Population quantiles

The earlier definition of a quantile ξp := F −1 (p) = inf{t ∈ R : F (t) ≥ p} is uniquely


defined. It satisfies F (ξp −) ≤ p ≤ F (ξp ), which could be used as an alternative definition.

Definition 7.1 A p-th quantlle is any value satisfying

F (ξp −) ≤ p ≤ F (ξp ).

For simplicity, we first consider medians (i.e. p = 1/2) and denote m = ξ1/2 , i.e.

F (m−) ≤ 1/2 ≤ F (m).

Theorem 7.7

1. Medians form a closed interval: m0 ≤ m ≤ m1 . (Hence, medians may not be unique)


(If m0 = m1 , the interval reduces to a single point.)
2. If F is continuous at m, then F (m) = 1/2.
3. We have
m = arg min E|X − c|,
c
i.e. for any c, we have
EF |X − m| ≤ EF |X − c|.

Proof.

1. • Medians form an interval.


Proof. Let a and b be two medians, and c ∈ (a, b). Then

F (c−) ≤ F (b−) ≤ 1/2 ≤ F (a) ≤ F (c).

Hence c is also a median.

75
• Medians are closed, i.e. m = inf{m : m is a median} and m = sup{m :
m is a median} are also medians.
Proof. WLOG, we only show that m is a median. Clearly, there exist a
decreasing subsequence of medians, denoted by {mk , k ≥ 1} such that
– mk ↓ m.
– From F (mk −) ≤ 1/2 ≤ F (mk )
Hence,
– F (m−) ≤ F (mk −) ≤ 1/2,
– F (m) = limk F (mk ) ≥ 1/2.
Similarly, we can show that m is a median too.

2. If FX is continuous at m, then 1/2 ≤ FX (m) = FX (m−) ≤ 1/2.

3. WLOG, assume that c > m, then


Z Z
E|X − c| = + |x − c|dF (x)
(−∞,c] (c,∞)
Z Z
= (c − x)dF (x) + (x − c)dF (x)
(−∞,c] (c,∞)
Z Z
= (c − x)dF (x) + (c − x)dF (x)
(−∞,m] (m,c]
Z Z
+ (x − c)dF (x) − (x − c)dF (x)
(m,∞) (m,c]
Z Z
= (c − m)dF (x) + (m − x)dF (x)
(−∞,m] (−∞,m]
Z Z
+ (x − m)dF (x) + (m − c)dF (x)
(m,∞) (m,∞)
Z
+2 (c − x)dF (x)
(m,c]
Z
= E|X − m| + (c − m)[2F (m) − 1] + 2 (c − x)dF (x)
(m,c]
≥ E|X − m|,

where the last inequality holds since 2F (m) − 1 ≥ 0 (as m is a median), and also
the assumption that c > m.

Sample quantiles

Definition. Given a sample X1 , ..., Xn from a continuous d.f. F , the sample median is
any value m̂ satisfying
Fn (m̂−) ≤ 1/2 ≤ Fn (m̂)
where Fn is the empirical distribution function (e.d.f.):

Fn (x) = n−1
X
I(Xi ≤ x).

Theorem 7.8 We have

76
(1)

m̂ = X( n+1 ) if n is odd
2
h i
any value in X( n ) , X( n +1) if n is even
2 2

(2) We have
n
X
m̂ = arg min EFn |X − c| = arg min |Xi − c|,
c c
i=1
i.e. for any c, we have
n
X n
X
|Xi − m̂| ≤ |Xi − c|
i=1 i=1

(3) For n given points (Xi , Yi ), i = 1, ..., n, the sum ni=1 |Yi − bXi | is minimized by
P

b = Z(k) , where k satisfies ki=1 p(i) ≥ 1/2 and ni=k p(i) ≥ 1/2 with pi = P|X|X i|
P P
|
,
j
and Z(1) , ..., Z(n) are order statistics of Y1 /X1 , ..., Yn /Xn , and p(i) is the value cor-
responding to Z(i) .

Proof.

(1) It suffices to show that m̂ satisfies nFn (m̂−) ≤ n/2 ≤ nFn (m̂), or equivalently
n
X n
X
LHS =: I{Xi < m̂} ≤ n/2 ≤ I{Xi ≤ m̂} =: RHS.
i=1 i=1

– If n is odd, then for m̂ = X( n+1 ) , we have


2

n n+1 n o
RHS = # Xi ≤ X̂( n+1 ) = ≥ ,
2 2 2
n o n+1 n−1 n
LHS = # Xi < X̂( n+1 ) = −1= ≤ .
2 2 2 2
– If n is even, then for m̂ ∈ [X( n2 ) , X( n2 +1) ], we have

n n
 
RHS = # {Xi ≤ m̂} = or + 1 ≥ n/2,
2 2
n n
 
LHS = # {Xi < m̂} = − 1 or ≤ n/2.
2 2

(2) Let X ∼ Fn (x), then EFn |X − a| = n−1 |Xi − a|.


P

(3) Note that


|Xi |
 X Y
X X Yi X
i
|Yi − Xi b| = X − b |Xi | =
|Xi | X − b P |X |

i i i
X X
= |Xi | |Zi − b|pi
X 
= |Xi | EG |Z − b|,

where Z ∼ G(x) = pi I(Yi /Xi ≤ x), where pi = P|X|X


i|
|Yi − Xi b| is
P P
|
. From (2),
j
Pk−1
minimized when b = Z(k) , the median of Fnp (x), where k satisfies i=1 p(i) ≤ 1/2 ≤
Pk
i=1 p(i) . Note here p(i) corresponds to Z(i) .

77
7.5.1 Alternative definition of quantiles

Theorem 7.9 A p-th quantile of X is any value ξp such that (Lehmann, p58.)
FX (ξp −) ≤ p ≤ FX (ξp ).
Then

1. The set of quantiles forms a closed interval: ξp0 ≤ ξp ≤ ξp1 .


If ξp0 = ξp1 , then the interval reduces to a single point.
2. If F is continuous at ξp , then F (ξp ) = p.
3. pE(X − c)+ + (1 − p)E(X − c)− is minimized by any pth quantile ξp , i.e., for any c,
pE(X − c)+ + (1 − p)E(X − c)− ≥ pE(X − ξp )+ + (1 − p)E(X − ξp )− ,
i.e., ξp = arg minc {pE(X − c)+ + (1 − p)E(X − c)− }.
Remark 7.1 Take p = 0.05 for example, ξ0.05 minimizes the loss function 0.05E(X−
c)+ + 0.95E(X − c)− . Here we notice that it puts more weight (0.95) on the negative
loss E(X − c)− . This has some bearings on the value-at-risk (VaR).

Proof. We only prove the last below since the proofs for the first two are similar to those
of medians. WLOG, assume that c > ξp , then
pE(X − c)+ + (1 − p)E(X − c)−
Z Z
= p (x − c)I{x − c > 0}dF (x) + (1 − p) [−(x − c)]I{x − c ≤ 0}dF (x)
Z Z
= (1 − p) (c − x)dF (x) + p (x − c)dF (x)
(−∞,c] (c,∞)
Z Z
= (1 − p) (c − x)dF (x) + (1 − p) (c − x)dF (x)
(−∞,ξp ] (ξp ,c]
Z Z
+p (x − c)dF (x) − p (x − c)dF (x)
(ξp ,∞) (ξp ,c]
Z Z
= (1 − p) (c − ξp )dF (x) + (1 − p) (ξp − x)dF (x)
(−∞,ξp ] (−∞,ξp ]
Z Z
+p (x − ξp )dF (x) + p (ξp − c)dF (x)
(ξp ,∞) (ξp ,∞)
Z Z
+(1 − p) (c − x)dF (x) − p (x − c)dF (x)
(ξp ,c] (ξp ,c]
Z Z
= (1 − p) (ξp − x)dF (x) + p (x − ξp )dF (x)
(−∞,ξp ] (ξp ,∞)
Z Z
+(1 − p) (c − ξp )dF (x) + p (ξp − c)dF (x)
(−∞,ξp ] (ξp ,∞)
Z Z
+(1 − p) (c − x)dF (x) + p (c − x)dF (x)
(ξp ,c] (ξp ,c]

= pE(X − ξp ) + + (1 − p)E(X − ξp )
Z
+(c − ξp )[(1 − p)F (ξp ) − p(1 − F (ξp ))] + (c − x)dF (x)
(ξp ,c]
= pE(X − ξp )+ + (1 − p)E(X − ξp )−
Z
+(c − ξp )[F (ξp ) − p] + (c − x)dF (x)
(ξp ,c]
≥ pE(X − ξp )+ + (1 − p)E(X − ξp )− .

78
where the last inequality holds since F (ξp ) ≥ p (as ξp is a p-th quantile), and also the
assumption that c > ξp . The statement is proved.

Remark 7.2 As we have seen earlier, our quantiles may not be unique. In fact, we have
shown that it can be any value from the closed interval [a, b], where

a := inf{t : F (t) ≥ p}, b := sup{t : F (t) ≥ p}

Definition. Let X1 , ..., Xn be a set of distinct real numbers, its order statistics are
X(1) , ..., X(n) . Then we define the sample p-th quantile to be
h i
ξˆp = any value in X([np]) , X([np]+1) if np = [np] (i.e., np is an integer)
= X([np]+1) if np 6= [np] (i.e., np is NOT an integer).

(Compare this with the median where p = 1/2.)

Theorem 7.10 Let X1 , ..., Xn be a set of distinct real numbers.

(1) ξˆp is also a p-th quantile of Fn (x) = n−1 ni=1 I(Xi ≤ x).
P

(2) Any p-th quantile, ξˆp , minimizes

pEFn (X − c)+ + (1 − p)EFn (X − c)−


 
−1 
X X
= n p|Xi − c| + (1 − p)|Xi − c|
i∈{i:Xi ≥c} i∈{i:Xi <c}
n n
!
−1 −
X X
+
= n p(Xi − c) + (1 − p)(Xi − c)
i=1 i=1
n
= n−1 p(Xi − c)+ + (1 − p)(Xi − c)− .
X 

i=1

(3) For n given points (Xi , Yi ), i = 1, ..., n,


 
X X
b := arg min  p|Yi − cXi | + (1 − p)|Yi − cXi |
i∈{i:Yi −cXi ≥0} i∈{i:Yi −cXi <0}

is called the p-th quantile regression estimator of Yi = cXi + i .


If further Xi > 0, there is a simple algorithm to calculate b, which is given by the
p-th sample quantile of the weighted empirical d.f. G(x) = pi I(Yi /Xi ≤ x), where
P

pi = P|X|X
i|
|
. The details are similar to the medians.
j

Proof.

(1) It suffices to show that nFn (ξˆp −) ≤ np ≤ nFn (ξˆp ), i.e.,


n n
I{Xi < ξˆp } ≤ np ≤ I{Xi ≤ ξˆp } =: RHS.
X X
LHS =:
i=1 i=1

79
– If np is NOT an integer, then ξˆp = X([np]+1) , therefore,
n o
RHS = # Xi ≤ X̂([np]+1) = [np] + 1 ≥ np,
n o
LHS = # Xi < X([np]+1) = [np] ≤ np.
h i
– If np is an integer, then ξˆp could be any value in the close interval X([np]) , X([np]+1) =
h i
X(np) , X(np+1) , therefore,
n o
RHS = # Xi ≤ ξˆp = {np or np + 1} ≥ [np],
n o
LHS = # Xi < ξˆp = {np − 1 or np} ≤ np = [np].

(2) Let X ∼ Fn (x). Since

pEFn (X − c)+ + (1 − p)EFn (X − c)−


 

= n−1 
X X
p|Xi − c| + (1 − p)|Xi − c| ,
i∈{i:Xi ≥c} i∈{i:Xi <c}

by (1), is minimized when a = a p-th quantile of Fn (x), or a p-th quantile of


X1 , ..., Xn .

(3) If we assume that Xi > 0 for all i. Then, letting Zi = Yi /Xi and pi = |Xi |/ |Xi |,
P

we have
X X
p|Yi − cXi | + (1 − p)|Yi − cXi |
i∈{i:Yi −cXi ≥0} i∈{i:Yi −cXi <0}
 
X  X X
= |Xi |  p|Yi /Xi − c|pi + (1 − p)|Yi /Xi − c|pi 
i∈{i:Yi /Xi −c≥0} i∈{i:Yi /Xi −c<0}
 
X  X X
= |Xi |  p|Zi − c|pi + (1 − p)|Zi − c|pi 
i∈{i:Zi −c≥0} i∈{i:Zi −c<0}

The rest is left as an exercise.

7.6 Median or Quantile (Robust) Regression

Suppose that (xi , Yi ) follow the model

Yi = α + βxτi + i , i = 1, ..., n,

where α ∈ R, β ∈ Rk , and i are errors satisfying Ei = 0.

• The least square estimates (LSE) of (α, β) are


( n )
X
min [yi − (a + bxτi )]2 .
a,b
i=1

80
• The least absolute deviation (LAD) estimates [or median regression] of (α, β) are
( n )
X
min |yi − (a + bxτi )| .
a,b
i=1

• The p-th quantile regression estimates of (α, β) are


( n )
bxτi ))+ bxτi ))−
X
min p (yi − (a + + (1 − p) (yi − (a + .
a,b
i=1

Remark 7.3

• LAD is more robust than the LSE. If the d.f. of  is heavy tailed, one can use LAD.
There are many other robust estimates, balancing robustness and efficiency.
• LSE is easier to compute than LAD. Theorem 7.10 shows how to compute one un-
known for LAD. For more than one unknowns, we could fix one value given all
others, and repeat for others.

Remark 7.4 It can be shown (see the chapter on M-functionals) that, the mean, the
median and the p-th quantile satisfy
µ = arg min E(X − c)2
c
m = arg min E|X − c|
c
ξp = arg min E(p(X − c)+ + (1 − p)(X − c)− ),
c

with respective influence functions given by


ψ(x; µ) = x
ψ(x; m) = I{x > 0} − I{x < 0}
ψ(x; ξp ) = pI{x > 0} − (1 − p)I{x < 0}
The above three loss functions, (X − c)2 , |X − c| and (p(X − c)+ + (1 − p)(X − c)− ), are all
convex functions. Furthermore, the first two are symmetric functions while the last one
is asymmetric. If we plot px+ + (1 − p)x− against x with p = 5% for instance, clearly
we penalize more if X move further away toward the left tail than toward the right. This
makes sense since we are interested in the quantiles in the left tail (p = 5% e.g.).

7.7 Exercises
1. Suppose X ∼ F , and m = F −1 (1/2), µ = EX, σ 2 = V ar(x). Show that
|m − µ| ≤ σ.
(Hint: use the L1 -optimal property of the median.)
2. Let F (x) = |x|, which is convex and has derivatives everywhere except 0. Denote its
left derivative by D(x) =: f (x−) = I{x > 0} − I{x ≤ 0}. Show that the one-term
Taylor expansion satisfies
|F (x + t) − F (x) − f (x−)t| =: ||x + t| − |x| − D(x)t|
≤ 2|t|I{|X| ≤ |t|}.

81
Chapter 8

M-Estimation

Loosely speaking, an M-Estimation refers to any procedure which involves maximiza-


tion or minimization of certain quantities. For example, in practice, we might like to
minimize a loss function EF ρ(X, t), say, over a range of values for t. Many estimation
procedures in statistics are in the form of M-estimation. Other related procedures in the
literature are general estimating equations (GEE) or zero-estimation (Z-estimation).

8.1 Definition

M −functional is usually defined by minimizing an expected loss function:


Z
T (F ) = arg min EF ρ(X, t0 ) = arg min ρ(x, t0 )dF (x),
t0 t0

which can be solved (under certain conditions) from



Z Z
ρ(x, t)dF (x) = ψ(x, t)dF (x) = 0, where ψ(x, t) = ∂ρ(x, t)/∂t
∂t

Definition: Let X1 , . . . , Xn ∼ F . Fn is the empirical d.f. Define


Z
λF (t) = EF ψ(X, t) = ψ(x, t)dF (x).

(1). An M −functional T (F ) w.r.t. ψ is a solution t0 to

λF (t0 ) = 0.

(2). An M -estimator of T (F ) is t̂ = T (Fn ), or the solution of


n
λFn (t̂) = EFn ψ(X, t̂) = n−1
X
ψ(Xi , t̂) = 0.
i=1

(3). If ψ(x, t) = ψ̃(x − t), T (F ) is called a location parameter.

Remark 8.1 Although the M -functional and M -estimators are usually derived by Min-
imization or Maximization of certain quantities EF ρ(X, t) (hence the name M ), we
actually define those items without any mention of EF ρ(X, t).

82
8.2 Examples of M-estimators

Example 1 (MLE). Let F = {Fθ (·) : θ ∈ Θ}. Assume that Fθ0 (x) = fθ (x) exists. Maxi-
mizing log-likelihood log fθ (x) is equivalent to minimizing ρ(x, θ) = − log fθ (x) Choose


ψ(x, θ) = − log fθ (x).
∂θ
The resulting M -estimator T (Fn ) is an MLE.

Example 2 (Location parameter estimation). Let M -functional T (F ) be a solution


of Z
EF ψ(X − t0 ) = ψ(x − t0 )dF (x) = 0,

and the corresponding M -estimators T (Fn ) to the location parameter. Different choices
of ψ lead to different estimators. Here are some examples.

1. The least square estimate (LSE). If ρ(x) = x2 /2, then ψ(x) = x. Then the
solution to EF ψ(X − t0 ) = EF (X − t0 ) = µF − t0 is t0 = µF . Therefore,
Z
T (F ) = xdF (x) = EX, mean functional,

and
Z
T (Fn ) = xdFn (x) = X̄, sample mean.

2. The least absolute deviation (LAD) estimate. If ρ(x) = |x|, then



 −1 x < 0

ψ(x) = I{x > 0} − I{x < 0} = 0 x=0

 1 x>0

(In fact, ρ(x) is not differentiable at 0, but we still define ψ(0) = 0).
Then we claim that any solution to

EF ψ(X − t0 ) = EF I{X − t0 > 0} − EF I{X − t0 < 0}


= P (X > t0 ) − P (X < t0 )
= P (X ≥ t0 ) − P (X ≤ t0 )
= 0

must be a median m.

Proof. We have P (X ≥ t0 ) = P (X ≤ t0 ), and hence, P (X < t0 ) = P (X > t0 ).


Therefore,

1 = P (X > t0 ) + P (X < t0 ) + P (X = t0 ) = 2P (X < t0 ) + P (X = t0 )


= P (X ≥ t0 ) + P (X ≤ t0 ) − P (X = t0 ) = 2P (X ≤ t0 ) − P (X = t0 ),

implying that P (X < t0 ) = 12 (1−P (X = t0 )) ≤ 21 , and P (X ≤ t0 ) = 12 (1+P (X =


t0 )) ≥ 12 . Therefore, t0 must be a median.

Therefore, T (F ) = m (median functional) and T (Fn ) = m̂ (sample median).

83
3. pth LAD estimator (LAD). More generally, we can take ρ(x) = |x|p /p, where
p ∈ [1, 2], then

p−1
 −|x|
 x<0
ψ(x) = 0 x=0
 |x|p−1

x > 0.

The M-estimator T (Fn ) is called the pth least absolute deviation estimator
(LAD) or the minimum Lp distance estimator. (Again, ρ(x) may not be differentiable
at 0, but we still define ψ(0) = 0).

4. A type of trimmed mean. Huber (1964) considers


(
x2 /2 |x| ≤ C
ρ(x) =
C 2 /2 |x| > C.

with
(
x |x| ≤ C
ψ(x) =
0 |x| > C
= xI{|x| ≤ C}.

Then the solution to

EF ψ(X − t0 ) = EF (X − t0 )I{|X − t0 | ≤ C}
= EF XI{|X − t0 | ≤ C} − t0 P (|X − t0 | ≤ C) = 0

satisfies
EF XI{|X − t0 | ≤ C}
t0 = = EF (X|{|X − t0 | ≤ C}) := T (F ).
P (|X − t0 | ≤ C)

Intuitively, t0 = T (F ) is the mean for the restricted range [t0 − C, t0 + C].


Now the solution t̂ to

EFn ψ(X − t̂) = EFn (X − t̂)I{|X − t̂| ≤ C}


n
1X
= (Xi − t̂)I{|Xi − t̂| ≤ C}
n i=1
n n
!
1 X X
= Xi I{|Xi − t̂| ≤ C} − t̂ I{|Xi − t̂| ≤ C}
n i=1 i=1
= 0

satisfies
Pn
i=1 Xi I{|Xi − t̂| ≤ C}  
t̂ = P n = EFn X|{|X − t̂| ≤ C} := T (Fn ).
i=1 I{|Xi − t̂| ≤ C}

Intuitively, t̂ = T (Fn ) is the mean for the restricted range [t̂ − C, t̂ + C]. Notice that
both sides in the above equation involves t̂, but we can solve t̂ iteratively with the
initial choice of t̂0 being the sample mean or any other reasonable guess.
Clearly, T (Fn ) is a type of trimmed mean. It completely eliminates the influence
of the extreme values (outliers).

84
5. Huber’s estimate (or winsorized mean). Huber (1964) considers
1

 x2
 |x| ≤ C
ρ(x) = 2
 C|x| − 1 C 2 |x| > C.

2
with

 C
 x>C
ψ(x) = x |x| ≤ C
−C x < −C.

= xI{|x| ≤ C} + CI{|x| > C}.

Then the solution to


EF ψ(X − t0 ) = EF (X − t0 )I{|X − t0 | ≤ C} + CP (|X − t0 | > C)
= EF XI{|X − t0 | ≤ C} − t0 P (|X − t0 | ≤ C) + CP (|X − t0 | > C) = 0
satisfies
EF XI{|X − t0 | ≤ C} + CP (|X − t0 | > C)
t0 = := T (F ).
P (|X − t0 | ≤ C)
Intuitively, t0 = T (F ) is the mean for the restricted range [t0 − C, t0 + C].
Now the solution t̂ to
 
EFn ψ(X − t̂) = EFn (X − t̂)I{|X − t̂| ≤ C} + CI{|x| > C}
n n
!
1 X X
= (Xi − t̂)I{|Xi − t̂| ≤ C} + C I{|Xi − t̂| > C}
n i=1 i=1
n n n
!
1 X X X
= Xi I{|Xi − t̂| ≤ C} − t̂ I{|Xi − t̂| ≤ C} + C I{|Xi − t̂| > C}
n i=1 i=1 i=1
= 0
satisfies
Pn
− t̂| ≤ C} + C ni=1 I{|Xi − t̂| > C}
P
i=1 Xi I{|Xi
t̂ = Pn := T (Fn ).
i=1 I{|Xi − t̂| ≤ C}
Here, T (Fn ) is a type of winsorized mean. It limits, but does not entirely eliminate,
the influence of outliers.
6. Hampel’s estimate. Hampel (1964) considers ψ(−x) = −ψ(−x) (i.e. an odd
function), and
x 0 ≤ x ≤ a,



 a  a < x ≤ b,


ψ(x) = c−x
 a b < x ≤ c,



 c−b
0 x > c.
Then T (Fn ) is called Hampel’s estimate. It is a combination of the trimmed
mean and the Winsorized mean. It down-weights the influence of extreme values
and completely removes the influence of the most extreme values.
7. Smoothed Hampel’s estimate. Hampel (1964) considers ψ(−x) = −ψ(−x) (i.e.
an odd function), and
(
sin(ax) 0 ≤ x ≤ π/a
ψ(x) =
0 x > π/a.
Then T (Fn ) is called smoothed Hampel’s estimate.

85
8.3 Consistency and asymptotic normality of M -estimates

We consider the following cases separately.

(a) ψ(x, t) is monotone in t. (e.g. sample mean, median, Huber’s estimates.)

(b) ψ(x, t) is continuous in t and bounded. (e.g. Hampel’s estimates.)

R 1 P
Recall λF (t) = EF ψ(X, t) = ψ(x, t)dF (x) and λFn (t) = EFn ψ(X, t) = n ψ(Xi , t).

Case I: ψ(x, t) is monotone in t

Theorem 8.1 (Consistency) Assume that

(a) ψ(x, t) is monotone (say, non-increasing) in t.

(b) Either t0 = T (F ) is an isolated root of λF (t) = 0, (or λF (t) changes sign uniquely
in a neighborhood of t0 ).

(c) {Tn } is a root of λFn (t) = 0, (or λFn (t) changes sign in a neighborhood of Tn ).

Then t0 is unique and Tn → t0 a.s.

Proof. Since ψ(x, t) is non-increasing in t, so are λF (t) and λFn (t).

If t0 is an isolated root to λF (t0 ) = 0, t0 is the unique root, then for any  > 0,

λF (t0 + ) < λF (t0 ) = 0 < λF (t0 − ).

If λF (t) changes sign uniquely in a neighborhood of t0 , then for any  > 0,

λF (t0 + ) < 0 < λF (t0 − ).

By SLLN, λFn (t) → λF (t) a.s. for each t. Then as n → ∞,

λFn (t0 + ) < 0 < λFn (t0 − ), a.s. (3.1)

Recall that λFn (t) is non-increasing in t. Then from assumption (c), we must have

t0 −  ≤ Tn ≤ t0 + , a.s.

(Note that Tn does not depend on  and may not be unique, but they all satisfy the above
inequality.) Letting  → 0, we see that Tn → t0 a.s.

Theorem 8.2 (Asymptotic Normality) Let t0 be an isolated root of λF (t) = EF ψ(X, t) =


0. Suppose that

(a) ψ(x, t) is monotone (say, non-increasing) in t.

(b) λF (t) is differentiable at t = t0 with λ0F (t0 ) 6= 0.

86
(c) Eψ 2 (X, t) < ∞ for t in a neighborhood of t0 and is continuous at t = t0 .

Suppose that {Tn } is a root of λFn (t) = 0, or λFn (t) changes sign in a neighborhood of Tn .
Then,

(i) Tn −→a.s. t0 ;
√ EF ψ 2 (X, t0 )
(ii) n (Tn − t0 ) →d N (0, σ 2 ), where σ 2 = .
[λ0F (t0 )]2

Proof. (i) This is guaranteed by Theorem 8.1.

(ii) Since ψ(x, t) non-increasing in t, so is λFn (t). Then we have

{λFn (t) < 0} ⊂ {Tn ≤ t} ⊂ {λFn (t̂) ≤ 0}. (3.2)

Thus, P (λFn (t) < 0) ≤ P (Tn ≤ t) ≤ P (λFn (t) ≤ 0). Note that
√ !
n(Tn − t0 ) σx σx
 
P ≤ x = P Tn ≤ t0 + √ = P (Tn ≤ tnx ) , where tnx = t0 + √ .
σ n n

Hence it suffices to show that

lim P (λFn (tnx ) < 0) = lim P (λFn (tnx ) ≤ 0) = Φ(x).


n n

Now
X 
P (λFn (tnx ) ≤ 0) = P ψ(Xi , tnx ) ≤ 0
√ !
[ψ(Xi , tnx ) − Eψ(Xi , tnx )] − nEψ(Xi , tnx )
P
= P √ ≤ ,
nσnx σnx
√ !
− nλ F (t nx )
n−1/2
X
=: P Yni ≤
σnx

2 = V ar(ψ(X , t )) and Y ψ(Xi , tnx ) − Eψ(Xi , tnx )


where σnx i nx ni = . However, as n → ∞,
σnx
we have
√ √
nλF (tnx ) = n[λF (tnx ) − λF (t0 )]
√ λF (tnx ) − λF (t0 )
 
= n[tnx − t0 ]
tnx − t0
= σx[λ0F (t0 ) + o(1)] as n → ∞
→ λ0F (t0 )σx

and, since EF ψ 2 (X, t) are EF ψ(X, t) = λF (t) are both continuous at t0 , we have
2
σnx = EF ψ 2 (X, tnx ) − [EF ψ(X, tnx )]2
= EF ψ 2 (X, t0 ) + o(1) − [EF ψ(X, t0 ) + o(1)]2
= EF ψ 2 (X, t0 ) + o(1)
= σ 2 [λ0F (t0 )]2 + o(1)
→ σ 2 [λ0F (t0 )]2 .

87
Therefore, σnx → −λ0F (t0 )σ as λ0F (t0 ) ≤ 0. Hence, as n → ∞,

− nλF (tnx )
−→ x.
σnx
Since Yni , 1 ≤ i ≤ n, are i.i.d. with mean 0 and variance 1, we may apply the double
array CLT (Theorem 8.6) with Bn2 = nV ar(Yn1 ) = n. Applying Slutsky’s theorem and
the CLT for double arrays (see Theorem 8.6 in Section 8.6 below), we get

lim P (λFn (tnx ) ≤ 0) = Φ(x).


n

By the following remark, we have limn P (λFn (tnx ) < 0) = Φ(x). The proof is complete.

Remark 8.2 There are some equivalent definitions of weak convergence, referred to as
Portmanteau Theorem:

(a) Xn =⇒ X or Xn →d X.

(b) lim inf P (Xn ∈ G) ≥ P (X ∈ G) for all open sets G.


n→∞

(c) lim sup P (Xn ∈ K) ≤ P (X ∈ K) for all closed sets K.


n→∞

(d) lim P (Xn ∈ A) = P (X ∈ A) for all sets A with P (X ∈ ∂A) = 0, where ∂A is the
n→∞
boundary of A.

(e) Eg(Xn ) → Eg(X) for all bounded continuous function g.

(f ) Eg(Xn ) → Eg(X) for all functions g of the form g(x) = h(x)I[a,b] (x), where h is
continuous on [a, b] and a, b ∈ C(F ).

(g) limn ψn (t) = ψ(t), where ψn (t) and ψ(t) are the c.f.s of Xn and X, respectively.

Remark 8.3 In the quantile example below, the estimating equation λFn (t) = 0 reduces
to Fn (t) = p, which may have no solution or an infinite number of solutions, since Fn (t)
has jump sizes n−1 at each observation. If λFn (t) = 0 has no solution, then Tn is defined
to be a value which changes sign of λFn (t).

Example 1. (Sample pth Quantile). F is continuous, and 0 < p < 1. Let ξp be the
unique root of F (t) = p, and let ξˆp be a root of Fn (t) = p or Fn (t) = p changes signs at
ξˆp . Further assume that F 0 (ξp ) = f (ξp ). Then
!
√   p(1 − p)
n ξˆp − ξp −→d N 0, 2
f (ξp )

Proof. Take ψ(x, t) = ψ(x − t) (somewhat abuse of notation), where



 −1
 x < 0,
ψ(x) = 0 x = 0,
p
x>0


1−p
p
= I{x > 0} − I{x < 0}
1−p
1
= (pI{x > 0} − (1 − p)I{x < 0}) .
1−p

88
Then
p
λF (t) = EF ψ(X − t) = (−1)P (X − t < 0) + P (X − t > 0)
1−p
p p
= [1 − F (t)] − P (X < t) = [1 − F (t)] − F (t)
1−p 1−p
p − F (t)
= .
1−p

Clearly, λF (ξp ) = 0. Also,

p2
Eψ 2 (X, ξp ) = (−1)2 P (X − ξp < 0) + P (X − ξp > 0)
(1 − p)2
p2 p
= p+ 2
(1 − p) = .
(1 − p) 1−p

and
−F 0 (ξp ) −f (ξp )
λ0F (ξp ) = = .
1−p 1−p
It iseasy to check that all conditions in Theorem 8.2 are satisfied with t0 = ξp . Therefore,
√ ˆ
n ξp − ξp →d N 0, σ 2 , where


Eψ 2 (X, ξp ) p/(1 − p) p(1 − p)


σ2 = = 2 = 2 .
[λ0F (ξp )]2 f (ξp )/(1 − p)2 f (ξp )

Remark 8.4 In this example, we integrate ψ(x) to get


Z x
ρ(x) = ψ(x)dx
0
(
−x x ≤ 0,
= p
1−p x x>0
1
= (pxI{x > 0} − (1 − p)xI{x < 0})
1−p
1
= (px+ + (1 − p)x− ) .
1−p

1
Ignoring the constant term 1−p , we notice that the p-th quantile estimate minimizes the
loss function weighted L1 loss, EL(c), where

L(c) = p(X − c)+ + (1 − p)(X − c)− .

Take p = 5% for instance, if we plot L(c) against c, we see that we penalize more if X
move further away toward the left tail than it toward the right tail. This makes sense
since we are interested in the quantiles in the left tail (p = 5% e.g.).

Case II: ψ(x, t) is differentiable in t (e.g. MLE)

The next theorem trades monotonicity of ψ(x, ·) for smoothness restrictions. The method
has been most used for MLE.

89
Lemma 8.1 Let g(x, t) be continuous at t0 uniformly in x. Let F be a d.f. such that
EF |g(X, t0 )| < ∞. Let {Xi } be i.i.d. with d.f. F and suppose that Tn −→p t0 . Then,
n
1X
g(Xi , Tn ) −→p EF g(X, t0 ).
n i=1

The convergence in probability can be replaced by a.s. convergence throughout.

Proof. Write
n n

1 X 1 X
g(Xi , Tn ) − EF g(X, t0 ) ≤ [g(Xi , Tn ) − g(Xi , t0 )]


n n
i=1
i=1
n

1 X
+ g(Xi , t0 ) − EF g(X, t0 )

n
i=1

The second term on the RHS −→p 0, while the first term is bounded by
n
1X
|g(Xi , Tn ) − g(Xi , t0 )| −→p 0,
n i=1

since |g(x, Tn ) − g(x, t0 )| →p 0 for all x uniformly.

Case III: ψ(x, t) is continuous in t and bounded

Theorem 8.3 (Consistency) Assume that

(a) ψ(x, t) is continuous in t and bounded.


(b) t0 = T (F ) is an isolated root of λF (t) = 0, and λF (t) changes sign
uniquely in a neighborhood of t0 . (so it does not behaves like y = x2 .)

Then λFn (t) = 0 admits a strongly consistent estimation sequence Tn for t0 .

Proof. Since λF (t) is continuous in t, assumption (b) implies that ∃ > 0 such that

λF (t0 + ) < λF (t0 ) = 0 < λF (t0 − ),

or
λF (t0 + ) > λF (t0 ) = 0 > λF (t0 − ).
For simplicity, we assume that first inequality holds. By the Strong Law of Large Numbers
(SLLN), λFn (t) → λF (t) a.s. for each t. Then as n → ∞,

λFn (t0 + ) < 0 < λFn (t0 − ), a.s. (3.3)

Since λFn (t) is also continuous in t, then there EXISTS a solution sequence, {Tn }, of
λFn (t) = 0 in the interval [t0 − , t0 + ]. That is,

t0 −  ≤ Tn ≤ t0 + , a.s.

In particular, choosing  = 1/n and Tn = Tn,1/n , we have

t0 − 1/n ≤ Tn = Tn,1/n ≤ t0 + 1/n, a.s.

Letting n → ∞, we get Tn → t0 a.s.

90
Remark 8.5 Note that λFn (t) = 0 in Theorem 8.3 may have multiple solutions. In that
case, one needs to identify a consistent solution sequence in practice. Therefore, for a
given solution sequence, e.g., one obtained by a specified algorithm, one needs to CHECK
if it is consistent or not.

Theorem 8.4 (Asymptotic Normality) Let t0 = T (F ) be an M -functional which is a


root of λF (t) = EF ψ(X, t) = 0. Suppose that

(a) ψ(x, t) be bounded and continuous in t.


(b) λF (t) is continuously differentiable at t0 with λ0F (t0 ) 6= 0.

Let {Tn } be a strongly consistent solution sequence of λFn (t) = 0 (The existence is estab-
lished in Theorem 8.3). Then,

√ Eψ 2 (X, t0 )
n (Tn − t0 ) →d N (0, σ 2 ), where σ 2 = V ar (φF (X1 )) = .
[λ0F (t0 )]2

Proof. We provide two methods.

• Method 1. This is a special case of the next theorem.

• Method 2. It suffices to show that T is d∞ -Hadamard differentiable at F with


ψ(x, t0 ) ψ(x, T (F ))
φF (x) = 0 = 0 . (3.4)
λF (t0 ) λF (T (F ))

The derivation of (3.4) will be given in the next theorem. Let us prove d∞ -Hadamard
differentiability next.
We now show Hadamard differentiability. For Gj = F + tj ∆j , we have

λGj (T (F )) λG −F (T (F )) tj λ∆ (T (F ))
TF0 (Gj − F ) = − 0 = − j0 =− 0 j
λF (T (F )) λF (T (F )) λF (T (F ))

Then,

0 = λGj [T (Gj )] − λF [T (F )]
= λGj [T (Gj )] − λGj [T (F )] + λGj [T (F )] − λF [T (F )]
λG [T (Gj )] − λGj [T (F )]
= [T (Gj ) − T (F )] j + tj λ∆j [T (F )]
T (Gj ) − T (F )
!
λGj [T (Gj )] − λGj [T (F )]
= [T (Gj ) − T (F )] − TF0 (Gj − F )λ0F (T (F ))
T (Gj ) − T (F )

So
λ0F [T (F )]
T (Gj ) − T (F ) = TF0 (Gj − F ) λ
Gj [T (Gj )]−λGj [T (F )]
T (Gj )−T (F )
 
λ0 [T (F )]
= TF0 (Gj − F ) + TF0 (Gj − F )  λ [T (GF )]−λ [T (F )] − 1
 
Gj j Gj
T (Gj )−T (F )

91
 
λ0 [T (F )]
= TF0 (Gj − F ) + tj λ∆j [T (F )]  λ [T (GF )]−λ [T (F )] − 1
 
Gj j Gj
T (Gj )−T (F )
:= TF0 (Gj − F ) + tj λ∆j [T (F )]Ξ

It suffices to show that |λ∆j [T (F )]| < C and Ξ → 0. The first relation is easy to
prove. The second can also be proved as well.

Case IV: ψ(x, t) is continuous in t (but not necessarily bounded)

Denote k · kV be the variation norm, khkV = lima→−∞,b→∞ Va,b (h), where


k
X
Va,b (h) = sup |h(xi ) − h(xi−1 )|,
i=1

the supremum being taken over all partitions a = x0 < ... < xk = b of [a, b].

Lemma 8.2 Let the function h be continuous with khkV < ∞ and the function K be
right-continuous with kKk∞ < ∞ and K(∞) = K(−∞) = 0. Then,
Z

hdK ≤ khkV kKk∞ .

Proof. Applying integration by parts, we get


Z Z Z
hdK = lim h(x)K(x) − lim h(x)K(x) − Kdh = − Kdh
x→∞ x→−∞

using the fact that h(±∞) < ∞ (Why? Please check) and the assumption K(∞) =
K(−∞) = 0. Hence
Z Z Z Z

hdK = Kdh ≤ |K(x)|d|h(x)| ≤ sup |K(x)| d|h(x)| ≤ kK(x)k∞ kh(x)kV .
x

For another proof, see Natanson (1961), page 232.

Theorem 8.5 (Asymptotic Normality) Let t0 = T (F ) be an M -functional which is


an isolated root of λF (t) = EF ψ(X, t) = 0. Suppose that

(a) ψ(x, t) is continuous in x and satisfy limt→t0 kψ(·, t) − ψ(·, t)kV = 0.


(b) λF (t) is differentiable at t0 with λ0F (t0 ) 6= 0.
(c) Eψ 2 (X, t0 ) < ∞.

Let {Tn } be a solution sequence of λFn (t) = 0 satisfying Tn → t0 . Then,

√ Eψ 2 (X, t0 )
n (Tn − t0 ) →d N (0, σ 2 ), where σ 2 = V ar (φF (X1 )) = .
[λ0F (t0 )]2

Proof. We shall use the von Mises differential approach here.

92
• Let
R
us first find out φF (x). Let Ft = F + t(δx − F ) and λF (t) = EF ψ(X, t) =
ψ(x, t)dF (x). Then,
λFt (T (Ft )) = EFt ψ(X, T (Ft ))
= EF ψ(X, T (Ft )) + t[EG ψ(X, T (Ft )) − EF ψ(X, T (Ft ))]
= λF (T (Ft )) + t[λG (T (Ft )) − λF (T (Ft ))]
= 0.
Differentiating w.r.t. t, we get
dT (Ft ) d
λ0F (T (Ft )) + [λG (T (Ft )) − λF (T (Ft ))] + t [λG (T (Ft )) − λF (T (Ft ))] = 0.
dt dt
Setting t = 0, we get

dT (Ft )
λ0F (t0 ) + [λG (t0 ) − λF (t0 )] = 0.
dt t=0
Note that λF (t0 ) = 0. Therefore,

dT (Ft ) λG (t0 ) EG ψ(X, t0 )
TF0 (G − F) = =− 0 =− .
dt t=0 λF (t0 ) λ0F (t0 )
From this, we get
Eδx ψ(X, t0 ) ψ(x, t0 )
φF (x) = TF0 (δx − F ) = − 0 =− 0 .
λF (t0 ) λF (t0 )

• In order to deal with Rn = T (G) − T (F ) − TF0 (G − F ), we need to have some explicit


expression for T (G) − T (F ). Note that
λF (T (G)) − λF (t0 )

 if T (G) 6= t0
λF (T (G)) − λF (t0 )


T (G) − T (F ) =

 T (G) − t0
0 if T (G) = t0

λF (T (G)) − λF (t0 )
= ,
h[T (G)]
where 
 λF (T (G)) − λF (t0 ) if T (G) 6= t0
h[T (G)] = T (G) − t0

λ0F (t0 ) if T (G) = t0
It is easy to see that h(t) → h(t0 ) = λ0F (t0 ). Therefore,
λF (T (G)) − λF (t0 ) λG (t0 )
Rn (G, F ) = + 0 .
h[T (G)] λF (t0 )
As it turns out, this expression is not especially manageable. (Try this yourself.)
However, if the denomenatior in the second term of Rn is replaced by h[T (G)], things
becomes a lot easier. In other words, we can study

e n (G, F ) = T (G) − T (F ) − λ0F (t0 ) 0


R T (G − F ) (3.5)
h[T (G)] F
λF (T (G)) − λF (t0 ) λG (t0 )
= +
h[T (G)] h[T (G)]
λF (T (G)) + λG (t0 )
=
h[T (G)]
R
− [ψ(x, T (G)) − ψ(x, t0 )]d[G(x) − F (x)]
=
h[T (G)]

93
We now specialize to G = Fn and T (G) = T (Fn ) = Tn . The assumption Tn −→p t0
yields that h[T (Fn )] −→p h[t0 ] = λ0F (t0 ). Check that Lemma 8.2 is applicable. We
thus have
R
− [ψ(x, Tn ) − ψ(x, t0 )]d[Fn (x) − F (x)]

|Rn (Fn , F )| =
e
h[T ] n

kψ(·, Tn ) − ψ(·, t0 )kV kFn − F k∞

|h[Tn ]|
 
= op (1) kFn − F k∞ = op n−1/2 [n1/2 kFn − F k∞ ]
 
= op n−1/2 Op (1)
 s
(recall that E n1/2 kFn − F k∞ < ∞ for any s > 0 from before)
 
= op n−1/2 .

From this, (3.5) and h[Tn ] −→p λF (t0 ), we get


!
√ 
−1 2
 Eψ 2 (X, t0 )
n[Tn − t0 ] −→d N EφF (X), n σ = N 0, .
n[λ0F (t0 )]2

Remark 8.6 Consider M -estimation of a location parameter with ψ(x, t) = ψ(x−t)


(a slight abuse of notation). Then, the condition (a) of Theorem 8.5 is satisfied by

(a) least pth power estimates, i.e., ψ(x) = |x|p−1 sgn(x) (1 < p ≤ 2).
(b) Huber’s estimates
(c) Hampel’s (smoothed or non-smoothed) estimates.

Appendix: CLT for double arrays of r.v.’s

Consider a double array of r.v.’s:

X11 , X12 , ............, X1k1 ;

X21 , X22 , ............, X2k2 ;


........................
Xn1 , Xn2 , ............, Xnkn ;
For each n ≥ 1, there are kn r.v.’s in that row. Assume that kn → ∞. The case kn = n is
called a “triangular array”.

Write Fnj (x) = P (Xnj ≤ x), and


 
kn
X kn
X
µnj = EXnj , An = µnj , Bn2 = V ar  Xnj  .
j=1 j=1

The Lindeberg-Feller-Levy Theorem for a double array of r.v.’s become

Theorem 8.6 Let {Xnj : 1 ≤ j ≤ kn ; n = 1, 2, ...} be a double array with independent


r.v.’s within rows. Then

94
(i) the uniform asymptotic negligibility condition

max P (|Xnj − µnj | > τ Bn ) → 0, n → ∞, each τ > 0,


1≤j≤kn

Pkn
(ii) the asymptotic normality condition j=1 Xnj ∼asy N (An , Bn2 )

together holds if and only if the following Lindeberg condition holds:


kn
Bn−2
X
2
EXnj I{|Xnj − µnj | > τ Bn } → 0, n → ∞, each τ > 0.
j=1

8.4 Asymptotic Relative Efficiency (ARE)

Definition. Let T1 and T2 be two (asymptotically unbiased) estimators of some param-


eter θ with asymptotic variances σ12 and σ22 , respectively. Then the Asymptotic Relative
Efficiency (ARE) T1 with respect to T2 is defined by

ARE(T1 , T2 ) = σ22 /σ12 .

Comparison between m̂ and X̄.

Let X1 , ..., Xn ∼ F (x) with symmetric pdf f (x). Then, µ = m. Here we could use either
the sample mean X̄ or sample median m̂ to estimator of the center.

1). X̄ ∼asymp. N (µ, σ 2 /n).


2). m̂ ∼asymp. N µ, [4f 2 (µ)]−1 /n .


Therefore,
σ 2 /n
ARE(m̂, X̄) = = 4σ 2 f 2 (µ).
[4f 2 (µ)]−1 /n

(x−µ)2
1. Take f (x) = (2πσ 2 )−1/2 e− 2σ 2 (i.e., N (µ, σ 2 )). Then f 2 (µ) = 1/(2πσ 2 ). So

ARE(m̂, X̄) = 2/π = 0.64.

e −x
2. If f (x) = (1+e 2 2
−x )2 (logistic distribution), then µ = EX = 0 and σ = var(X) = π /3

(see the appendix below). f (µ) = f (0) = 1/4. So

ARE(m̂, X̄) = 4σ 2 f 2 (µ) = π 2 /12 = 0.82.


R ∞ 2 −x
3. If f (x) = 21 e−|x| (double exponential), then µ = 0, σ 2 = x2 f (x)dx =
R
0 x e dx =
Γ(3) = 2! = 2. So

ARE(m̂, X̄) = 4σ 2 f 2 (µ) = 8 × (1/4) = 2.

So it is better to use the median in this case.


Note that the MLE is the sample median in this case, which is known to be asymp-
totically efficient. So one can not find a more efficient one than the sample median
asymptotically.

95
4. If f (x) is the pdf of Cauchy distribution, then σ 2 = ∞. So
ARE(m̂, X̄) = ∞.

Remark: This example illustrates that as the tail of pdf gets thicker, the median becomes
more efficient.

Appendix: proof of the logistic distribution

e−x
Logistic distribution. X ∼ f (x) = . Find µ = EX and σ 2 = var(X).
(1 + e−x )2

ex e−x
Solution. Clearly, f (−x) = = = f (x), i.e., f (x) is even and
x
(1 + e )2 (1 + e−x )2
symmetric around the origin. Thus µ = EX = 0. And
Z ∞ Z ∞
σ 2 = E(X − µ)2 = EX 2 = x2 f (x)dx = 2 x2 f (x)dx
−∞ 0
x2 ex
Z ∞
= 2 dx
0 (1 + ex )2
Z ∞
1
 
2
= −2 x d
0 1 + ex

x2
Z ∞
x
= −2 x
+4 dx
1+e 0 0 1 + ex
xe−x
Z ∞
= 4 dx
1 + e−x
Z0∞  
= 4 xe−x 1 − e−x + e−2x − e−3x + e−4x + ...... dx
0
Z ∞ ∞
xe−x (−1)m e−mx dx
X
= 4
0 m=0

Z ∞ X
= 4 (−1)m xe−(m+1)x dx
0 m=0
∞ Z ∞
xe−(m+1)x dx
X
= 4 (−1)m (interchanging integration and summation)
m=0 0

(−1)m
Z ∞
ye−y dy
X
= 4 [take y = (m + 1)x]
m=0
(m + 1)2 0

(−1)m 1 1 1 1 1
X  
= 4 2
= 4 1 − 2 + 2 − 2 + 2 − 2 + ......
m=0
(m + 1) 2 3 4 5 6

1 1 1 1 1 π2
But it is known that 1 + + + + + + ...... = So
22 32 42 52 62 6
1 1 1 1 1 1 1 1 π2
   
2 2 + 2 + 2 + 2 + ...... = 2 × 1 + 2 + 2 + 2 + ...... =
2 4 6 8 4 2 3 4 12
The difference of the above two relations results in
1 1 1 1 1 π2
1− + − + − + ...... =
22 32 42 52 62 12
π2 π2
Therefore, σ 2 = 4 × = .
12 3

96
8.5 Robustness v.s. Efficiency

Assuming that ψ(x, θ) = ψ(x − θ), then

2 Eψ 2 (X − θ0 ) A
σM = 0 2
= 2.
[Eψ (X − θ0 )] B

Assume E|ψ(X − θ0 )| < ∞, then fθ0 (x)ψ(x − θ0 ) → 0 as |x| → ∞. Thus


Z
B = ψ 0 (x − θ0 )fθ0 (x)dx
Z
= fθ0 (x)dψ(x − θ0 )
Z
= fθ0 (x)ψ(x − θ0 )|∞
−∞ − ψ(x − θ0 )fθ0 0 (x)dx
Z
= − ψ(x − θ0 )fθ0 0 (x)dx,
f 0 (X)
!
= −E ψ(X − θ0 ) θ0 ,
fθ0 (X)
∂ ln fθ0 (X)
 
= −E ψ(X − θ0 ) .
∂θ
Therefore, by C-S inequality,
2
∂ ln fθ0 (X)
  
2 2
B ≤ E ψ (X − θ0 ) E = AI(θ),
∂θ

where I(θ) is the Fisher information. Combining all these, we get

2 A 1
σM = ≥ ,
B2 I(θ)

where the equality holds iff

∂ ln fθ0 (X)
ψ(X − θ0 ) = a , for some a.
∂θ
That is, M-estimators are not efficient unless they are MLE’s.

8.6 Generalized Estimating Equations (GEE)

The method of GEE is a very powerful tool and general method of deriving point estima-
tors, which includes many other methods as special cases. We omit the details here.

8.7 Exercises

1. Huber’s M-estimate. Take ψ(x, θ) = ψ0 (x − θ), where

ψ0 (x) = −k, x < −k,


= x, |x| ≤ k,
= k, x > k,

97
Using the theorem given in the lecture, show that any solution sequence Tn of
Pn
i=1 ψ(Xi , Tn ) = 0 satisfies asymptotically

n (Tn − θ) −→ N (0, σ 2 ), in distribution,

where R θ+k R θ−k R∞


2 θ−k (x − θ)2 dF (x) + k 2 −∞ dF (x) + k 2 θ+k dF (x)
σ = R 2
θ+k
θ−k dF (x)

2. Let X1 , . . . , Xn be i.i.d. from the Cauchy distribution with pdf given by


C
f (x) = .
1 + x2

(a) Show that E|X(k) |r < ∞ iff r < k < n − r + 1.


n!
(Hint: the pdf of X(k) is fX(k) (x) = [F (x)]k−1 [1−F (x)]n−k f (x).)
(k − 1)!(n − k)!
(b) Cauchy d.f. does not have expectations. However, from the above,
i. E|X(k) | < ∞ iff 2 ≤ k ≤ n − 1,
i.e., only the smallest/largest order statistics do not have finite first mo-
ments.
ii. EX(k)2 < ∞ iff 3 ≤ k ≤ n − 2,

i.e., only the smallest/largest two order statistics do not have finite second
moments.
(c) Check if the trimmed mean by removing the smallest and largest two observa-
tions is a consistent estimate of the center of the d.f. consistently.
(Hint: find the mean/variance of the trimmed mean.)

98
Chapter 9

U -Statistics

Hoeffding (1948) introduced U -statistics and proved their central limit theorem. There
are several factors that are unique to the class of U -statistics.

• Many commonly used statistics are or can be approximated by U -statistics, so they


are much more applicable than it first appears.

• U -statistics are generalisations of sums of independent r.v.s, thus these developments


on the theory of U -statistics are strongly influenced by the classical theory on sums
of independent random variables.

• Many of the techniques developed for U -statistics are also essential in studying even
more general classes of statistics, such as the symmetrical statistics.

• The special dependence structure in the U -statistics allows one to employ the mar-
tingale properties extensively.

9.1 U -statistics

Definition. Let X1 , ..., Xn ∼i.i.d. F and h(x1 , ..., xm ) be a symmetric function in its
arguments. Consider
Z Z
θ = T (F ) = EF h(X1 , ..., Xm ) = ... h(x1 , ..., xm )dF (x1 )...dF (xm ).

The U - and V -statistics of order m with kernel h are defined, respectively, by


n n
1 X X
Vn = T (Fn ) = m ... h(Xi1 , ..., Xim ),
n i =1 i =1
1 m
n
1 X
Un = ! h(Xi1 , ..., Xim )
n 1≤i1 <...<im ≤n
m
n
1 X
= h(Xi1 , ..., Xim ).
n(n − 1)...(n − m + 1) 1≤i 6=...6=i
1 m ≤n

99
Remark 9.1 U -statistics are always unbiased (“U ” = unbiased) while V -statistics (“V ”
comes from “von Mises”, who studied this type of statistics first) are typically biased for
m ≥ 2, since usually Eh(X1 , ..., X1 ) 6= Eh(X1 , ..., Xn ) for diagonal elements.

9.2 Examples
• If h(x) = x, the U-statistic
Un = x̄n = (x1 + · · · + xn )/n
is the sample mean.
• If h(x1 , x2 ) = |x1 − x2 |, the U-statistic
1 X
Un = |xi − xj |, n≥2
n(n − 1) i6=j

is the mean pairwise deviation.


• If h(x1 , x2 ) = (x1 − x2 )2 /2, the U-statistic is
1 X
Un = (xi − x̄n )2 , n≥2
n−1
the sample variance.
• The third k-statistic
n X
k3,n (x) = (xi − x̄n )3
(n − 1)(n − 2)
i.e., the sample skewness defined for n ≥ 3, is a U-statistic.

9.3 Hoeffding-decomposition

Given T := f (X1 , ..., Xn ), we have the Hoeffding-decomposition (or projection, or ANOVA


decomposition):
n
X X X X
T = Ti + Tij + Tijk + · · · + Ti1 ...in ,
i=1 i<j i<j<k 1≤i1 <...<in

where
Ti = E(T |Xi )
Tij = E(T |Xi , Xj ) − Ti − Tj
Tijk = E(T |Xi , Xj , Xk ) − E(T |Xi , Xj ) − E(T |Xi , Xk ) − E(T |Xj , Xk )
+E(T |Xi ) + E(T |Xj ) + E(T |Xk )
= E(T |Xi , Xj , Xk )
−{E(T |Xi , Xj ) − E(T |Xi ) − E(T |Xj )}
−{E(T |Xi , Xk ) − E(T |Xi ) − E(T |Xk )}
−{E(T |Xi , Xk ) − E(T |Xj ) − E(T |Xk )}
−E(T |Xi ) − E(T |Xj ) − E(T |Xk )
= E(T |Xi , Xj , Xk ) − (Tij + Tik + Tjk ) − (Ti + Tj + Tk ).
Clearly,

100
• Ti is the main effect of Xi on T ,

• Tij is the 2-way interaction effect of Xi and Xj on T , after removing the main effects.

• Tijk is the 3-way interaction effect of Xi , Xj , Xk on T , after removing the main and
two-way interaction effects.

Heuristic proof

For T = f (X1 , ..., Xn ), note E(T |X1 , ..., Xn ) = T .

• For n = 1,

T := f (X1 ) = E(T |X1 )

• For n = 2,

T := f (X1 , X2 )
= E(T |X1 ) + E(T |X2 ) + {E(T |X1 , X2 ) − E(T |X1 ) − E(T |X2 )}
2
X X
= E(T |Xi ) + [E(T |Xi , Xj ) − E(T |Xi ) − E(T |Xj )]
i=1 1≤i<j≤2

• For n = 3,

T := f (X1 , X2 , X3 )
= E(T |X1 ) + E(T |X2 ) + E(T |X3 )
+{E(T |X1 , X2 ) − E(T |X1 ) − E(T |X2 )}
+{E(T |X1 , X3 ) − E(T |X1 ) − E(T |X3 )}
+{E(T |X2 , X3 ) − E(T |X2 ) − E(T |X3 )}
+{E(T |X1 , X2 , X3 ) − E(T |X1 , X2 ) − E(T |X1 , X3 ) − E(T |X2 , X3 )
+E(T |X1 ) + E(T |X2 ) + E(T |X3 )}
3
X X
= E(T |Xi ) + [E(T |Xi , Xj ) − E(T |Xi ) − E(T |Xj )]
i=1 1≤i<j≤3
X
+ {E(T |Xi , Xj , Xk ) − E(T |Xi , Xj ) − E(T |Xi , Xk ) − E(T |Xj , Xk )
1≤i<j<k≤3
+E(T |Xi ) + E(T |Xj ) + E(T |Xk )}.

• The general n can be shown similarly.

Remark 9.2 Hoeffding-decomposition is an identity. Given T , its overall effect is


ET , giving
T = ET + R1 .
For R1 , the main effect (or contribution) from Xi is given by Ti , resulting in
n
X
T − ET = Ti + R2 .
i=1

101
For R2 , the main effect of 2-way interactions from (Xi , Xj ) is Ti,j , giving
n
X n
X
T = ET + Ti + Ti1 ,i2 + R3 .
i=1 1≤i1 <i2 ≤2

Continuing on until all interaction effects are taken care of, giving the above ANOVE
decomposition.

Theorem 9.1 The Hoeffding-decomposition to 2-order U -statistics is


n
2X 2 X
Un − θ = g(Xi ) + ψ(Xi , Xj ),
n i=1 n(n − 1) i<j

where hij = h(Xi , Xj ), gi = g(Xi ) = E (hij | Xi ) − θ and ψ(Xi , Xj ) = hij − gi − gj + θ and


all the terms on the RHS of Un are uncorrelated.

Proof. WLOG, assume that θ = 0.


2 X
T1 = E(Un |X1 ) = E[h(Xi , Xj )|X1 ]
n(n − 1) i<j
2
= (n − 1)E[h(X1 , Xj )|X1 ]
n(n − 1)
2
= g1 ,
n
2 X
E(Un |X1 , X2 ) = E[h(Xi , Xj )|X1 , X2 ]
n(n − 1) i<j
2
= [h12 + (n − 2)g1 + (n − 2)g2 ]
n(n − 1)

Hence,

T12 = E(Un |X1 , X2 ) − E(Un |X1 ) − E(Un |X2 )


2
= {h12 + (n − 2)[g1 + g2 ] − (n − 1)[g1 + g2 ]}
n(n − 1)
2
= (h12 − g1 − g2 )
n(n − 1)
2
= ψ(X1 , X2 ).
n(n − 1)

To prove uncorrelatedness, we will show a few examples. First we note that

E[ψ(X1 , X2 )|X1 ]} = E[h(X1 , X2 ) − g(X1 ) − g(X2 )|X1 ]} = g(X1 ) − g(X1 ) − Eg(X2 ) = 0.

Hence,

Eg(X1 )ψ(X1 , X2 ) = EE[g(X1 )ψ(X1 , X2 )|X1 ] = E{g(X1 )E[ψ(X1 , X2 )|X1 ]}

and
Eψ(X1 , X2 )ψ(X1 , X3 ) = EE[ψ(X1 , X2 )ψ(X1 , X3 )|X1 ]
= E{E[ψ(X1 , X2 )|X1 ]}E{E[ψ(X1 , X3 )|X1 ]} = 0.

102
9.3.1 Asymptotic Normality

Theorem 9.2 If Eh2 (X1 , X2 ) < ∞ and σg2 > 0, then



n(Un − θ)
−→d N (0, 1).
2σg

√ Pn √
n(Un − θ) = √2 2 P
Proof. Note n i=1 g(Xi ) + nRn , where Rn = n(n−1) i<j ψ(Xi , Xj ).
Now
√ 4 X 2n
V ar( nRn ) = n V ar(ψ(Xi , Xj )) = V ar(ψ(Xi , Xj )) → 0,
n2 (n − 1)2 i<j n(n − 1)

implying nRn →p 0. Then apply the CLT and Slutsky theorem.

9.4 Jackknife variance estimates for U -statistics

Note that V ar(Un ) = 4V ar[g(X1 )]/n + O(n−2 ) which is approximated by


n n
41X 4 X
ar(Un ) '
Vg [g(Xi )]2 = 2 g 2 (Xi ).
n n i=1 n i=1

1 Pn
Here g(Xi ) = E{h(Xi , Xj )|Xi } can be approximated by g(Xi ) ' n−1 j=1,j6=i hij − Un .
Then an approximation to the variance of Un is
 2
n n
4 X 1 X
Vd
ar(Un ) =  hij − Un  .
n2 i=1
n − 1 j=1,j6=i

This is almost the same as the jackknife variance estimator, Vd arJack (Un ), given in the
next theorem. In effect, Vd
arJack (Un ) estimates the variance of the dominating term in
the H-decomposition of Un .

Theorem 9.3 If Un is a U -statistics of order 2 with kernel h(x, y), then


 2
n n
4(n − 1) X 1 X
Vd
arJack (Un ) =  hij − Un  . (4.1)
n(n − 2)2 i=1 n − 1 j=1,i6=j

Furthermore, Vd
arJack (Un ) is a consistent estimator of V ar(Un ) in the sense that

Vd
arJack (Un )
−→p 1, as n → ∞. (4.2)
V ar(Un )

Proof. We only derive (3.1) below (the proof for (3.2) is more involved and hence omitted).
From Theorem 10.1, we have
1 X
Un = hij
n(n − 1) i6=j

103
 
n
(−k) 1 X X
Un−1 =  hij − 2 hkj 
(n − 1)(n − 2) i6=j j=1,j6=k
n
(·) 1X (−k)
Un−1 = Un−1
n k=1
 
n n X n n
1 X 2X 2X
=  hij − hkj + hkk 
(n − 1)(n − 2) i6=j n k=1 j=1 n k=1
 
1 X 2X 
=  hij − hij
(n − 1)(n − 2) i6=j n i6=j
 
1 X
=  hij − 2(n − 1)Un 
(n − 1)(n − 2) i6=j
n 
n−1 X (−k) (·)
2
Vd
arJack (Un ) = Un−1 − Un−1
n k=1
 2
n n
n−1 1 X X
= 2 hkj − 2(n − 1)Un 
n (n − 1)2 (n − 2)2 k=1 j=1,j6=k
 2
n n
n−1 4(n − 1)2 X 1 X
=  hkj − Un 
n (n − 1)2 (n − 2)2 k=1 n − 1 j=1,j6=k
 2
n n
4(n − 1) X 1 X
=  hkj − Un  .
n(n − 2)2 k=1 n − 1 j=1,j6=k

From Slutsky theorem, we get

Theorem 9.4 If Eh2 (X1 , X2 ) < ∞ and σg2 > 0, then



n(Un − θ)
−→d N (0, 1).
Sn

104
Chapter 10

Jackknife

The jackknife and bootstrap are two popular nonparametric methods in statistical infer-
ence. The first idea related to the bootstrap was von-Mises, who used the plug-in principle
in the 1930’s, (although it can be argued that Gauss invented the bootstrap in 1800’s).
Then in the late 40’s Quenouille found a way of correcting the bias for estimators whose
bias was known to be of the form. The method is called the jackknife.

Both methods provide several advantages over the traditional parametric approach:
the methods are easy to describe and they apply to arbitrarily complicated situations;
distribution assumptions, such as normality, are never made.

10.1 The traditional approach

Let X = X1 , . . . , Xn ∼i.i.d. F (typically unknown). The parameter of interest is θ ≡ T (F ).


Let Tn ≡ T (X1 , . . . , Xn ) be an estimator of θ. For Tn , we might be interested in

1. its bias: Bias(Tn ) = E(Tn ) − θ (for correcting for bias of Tn ).

2. its variance: σT2n = V ar(Tn ) (for measuring the accuracy of Tn ).

3. its sampling distribution: P ((Tn − θ)/σTn ≤ x), or P ((Tn − θ)/σ̂Tn ≤ x)? (for con-
structing confidence intervals for θ).

Example 1. (Sample mean.) Let θ = µ ≡ EX1 , and Tn = X̄, then

1. Bias(X̄) = E(X̄) − µ = 0, unbiased.

2. V ar(X̄) = σ 2 /n, where σ 2 = V ar(X1 ). So an estimate of V ar(X̄) is


n
1 X
σ̂T2n =: Vd
ar(X̄) = Se2 /n, where Se2 = (Xi − X̄)2 .
n − 1 i=1

3. By the CLT and Slutsky Theorem, P ((Tn − θ)/σ̂Tn ≤ x) −→ Φ(x). So an approxi-


Se
mate 1 − α C.I. for µ is X̄ ∓ z α/2 √ .
n

105
Example 2. Let θ = µ2 , and Tn = (X̄)2 , then

σ2
1. Bias[(X̄)2 ] = E[(X̄)2 ] − µ2 = V ar(X̄) + [E(X̄)]2 − µ2 = . Hence, a biased-
n
corrected estimator of µ2 is given by

Se2 Se2
Ten = Tn − = (X̄)2 − ,
n n
which is in fact unbiased for µ2 . (But its variance may increase as a result, so one
need to use the MSE criterion, e.g., to choose a better one.)

2. Let us find out V ar(Tn ) next. Let Y = X − µ. Note that Tn = (X̄ − µ + µ)2 =
(Ȳ + µ)2 = (Ȳ )2 + 2µȲ + µ2 . Hence

V ar(Tn ) = V ar[(Ȳ )2 ] + 4Cov(Ȳ , (Ȳ )2 ) + 4µ2 V ar(Ȳ )


= E[(Ȳ )4 ] − {E[(Ȳ )2 ]}2 + 4E[(Ȳ )3 ] + 4µ2 σ 2 /n
=

(Please finish the rest of the calculations.)

Alternatively, as we have seen, by using the transformation approach, the asymptotic


variance of Tn is 4µ2 σ 2 /n, which can be estimated by 4(X̄)2 Se2 /n.
Tn − µ2 Tn − µ2
3. By the transformation approach, →d N (0, 1), or →d N (0, 1), which
2|µ|σ 2|X̄|Se
can then be used to construct a C.I. for µ2 .

Exercise. Let (Xi , Yi ) be i.i.d. r.v.s. Let θ = ρ and Tn = ρ̂ be the population and sample
correlation coefficients, respectively. Repeat the last example.

The weakness of the traditional approaches

There are several inherent weaknesses associated with traditional approaches.

(1) The theoretical formula for bias, variance, etc can be very difficult or even impossible
to obtain. It often requires the data analysts to have strong mathematical and
statistical background.

(2) The theorectical formula may to be too complicated to be useful.

(3) The formula are problem-specific. For a new problem, we need to repeat the whole
process all over again.

(4) The formula are often valid only for large sample size n (e.g. CLT). The performance
for small n can often be poor.

In the following two chapters, we shall introduce alternative methods such as boot-
strap and jackknife and other resampling methods to overcome the above-mentioned dif-
ficulties. Some of the advantages are listed below.

106
(1) The methods are highly automatic, easy to use.
(2) The methods require little mathematical background on the user’s part.
(3) ) The methods are general, not problem-specific.
(4) The methods have good performances for small sample size, as compared to some
asymptotic methods.

10.2 Jackknife

The jackknife preceded the bootstrap. It is simple to use. The original work on the
“delete-one” jackknife is due to Quenouille (1949) and Tukey (1958).

10.2.1 Definition

Definition. Suppose that θ̂ = T (Fn ) := T (X1 , ..., Xn ) estimates θ = T (F ). Let


(−i)
Tn−1 ≡ T (Fn−1,i ) = T (X1 , ..., Xi−1 , Xi+1 , ..., Xn ),

where
1 X
Fn−1,i (x) ≡ I .
n − 1 j6=i {Xj ≤x}
(−i)
That is, Tn−1 is the estimator of F (F ) based on the data with Xi “deleted” or left out.

Remark 10.1 Tukey (1958) defined the i-th pseudo-value as


(−i)
Tipseudo = nTn − (n − 1)Tn−1 .

and conjectured that

• Tipseudo (1 ≤ i ≤ n) may be treated as i.i.d. r.v.’s, which have been verified in many
cases such as U -statistics and functions of means.

• Tipseudo has approximately the same variance as nTn .

We shall now use pseudo-values Tipseudo to estimate bias, variance, etc. Let
n
(·) 1X (−i)
Tn−1 = T .
n i=1 n−1

1. The jackknife estimate of θ = T (F ) (biased-corrected) is


n
1X (·)
TnJack = T pseudo = nTn − (n − 1)Tn−1 .
n i=1 i

2. The jackknife bias estimate is


d n ) ≡ Tn − T Jack = (n − 1)[T (·)
bias(T n n−1 − Tn ].

(By definition, TnJack = Tn − bias(T


d n ) should hopefully have smaller bias than Tn .)

107
√ pseudo
3. From Remark 10.1, we estimate var( nTn ) by the sample variance based on Ten(i) .
Hence, the jackknife estimate of V ar(Tn ) is
 !2 
n n
1 1 1
Tipseudo − Tipseudo 
X X
V̂Jack (Tn ) =
n (n − 1) i=1
n i=1
n 
n−1 X (−i) (·)
2
= Tn−1 − Tn−1 .
n i=1

4. The jackknife can also be used to estimate d.f.’s. Details are omitted.

Remark 10.2 The delete-1 jackknife estimate of P (Tn ≤ t) can be taken as P̂Jack (Tn ≤
(−i)
t) = n1 ni=1 I{Tn−1 ≤ t} or P̂Jack (Tn ≤ t) = n1 ni=1 I{Tipseudo ≤ t}. This has poor
P P

properties for i.i.d. sample. But for dependent sample, one could use the delete-1 jackknife
estimate, which will be mentioned later.

10.2.2 Heuristic justification

Suppose that we have the following expansion,


a b
E(Tn ) = T (F ) + + 2 + O(n−3 ).
n n
Then,
(−i)
E[Tn−1 ] = E[Tn(.) ]
a b
= T (F ) + + + O(n−3 ).
n − 1 (n − 1)2
Hence
(·)
E[TnJack ] = nETn − (n − 1)ETn−1
a b a b
   
= n T (F ) + + 2 + O(n−3 ) − (n − 1) T (F ) + + + O(n −3
)
n n n − 1 (n − 1)2
b b
   
−2 −2
= nT (F ) + a + + O(n ) − (n − 1)T (F ) + a + + O(n )
n (n − 1)
b
= T (F ) − + O(n−2 ).
n(n − 1)

So TnJack has bias of order O(n−2 ), one order smaller than the bias of Tn (of O(n−1 )).

One can repeat the jackknife again and again to keep on reducing the bias. But be
aware: the jackknife might introduce more variability. One could use the Mean Square
Error (MSE) to see if one- or more-step jackknife is worthwhile or not. Usually, the
jackknife is rarely used beyond one-step.

10.2.3 The examples revisited

Example 1. (Sample mean.) Let θ = T (F ) = µ ≡ EX1 , and Tn = X̄. Then,

(−i) X1 , +... + Xi−1 + Xi+1 ... + Xn nX̄ − Xi


Tn−1 = T (X1 , ..., Xi−1 , Xi+1 , ..., Xn ) = =
n−1 n−1

108
n
(·) 1X (−i) nX̄ − X̄
Tn−1 = Tn−1 = = X̄.
n i=1 n−1
So we have

1. The i-th pseudo-value is


n
(−i)
Tipseudo = nTn − (n − 1)Tn−1 =
X
Xi − (nX̄ − Xi ) = Xi .
i=1

2. The jackknife estimate of θ = T (F ) (biased-corrected) is


n
1X
TnJack = T pseudo = X̄.
n i=1 i

3. The jackknife bias estimate is


d n ) ≡ Tn − T Jack = X̄ − X̄ = 0.
bias(T n

4. The jackknife estimate of V ar(θ̂) is


n  n
!2
n−1X (−i) (·) 2
 n−1X nX̄ − Xi
Vd
arJack (θ̂) = Tn−1 − Tn−1 = − X̄
n i=1 n i=1 n−1
Pn 2
1 i=1 Xi − X̄ Se2
= = ,
n (n − 1) n
which is unbiased for σ 2 /n.

Example 2. Let θ = T (F ) = µ2 , and Tn = T (Fn ) = (X̄)2 . Then,


!2 n
(−i) nX̄ − Xi (·) 1X (−i)
Tn−1 = , Tn−1 = T .
n−1 n i=1 n−1
(−i)
The i-th pseudo-value is Tipseudo = nTn − (n − 1)Tn−1 .

1. The jackknife bias estimate is


d n ) ≡ (n − 1)[T (·)
bias(T n−1 − Tn ]
 !2 
n
(n − 1) X nX̄ − Xi
=  − (X̄)2 
n i=1
n−1
n
! !
(n − 1) X nX̄ − Xi nX̄ − Xi
= + X̄ − X̄
n i=1
n−1 n−1
n
! !
(n − 1) X (2n − 1)X̄ − Xi X̄ − Xi
=
n i=1
n−1 n−1
n
1 X 
= (−Xi ) X̄ − Xi
(n − 1)n i=1
n
1 X  
= X̄ − Xi X̄ − Xi
(n − 1)n i=1
n
1 X 2 Se2
= Xi − X̄ = .
(n − 1)n i=1 n

109
2. The jackknife estimate of θ = T (F ) (biased-corrected) is

Se2
TnJack ≡ Tn − bias(T
d n ) = (X̄)2 − ,
n
E Se2 σ2 σ2
which is unbiased as ETnJack − µ2 = E(X̄)2 − − µ2 = − µ2 − − µ2 = 0.
n n n
3. The jackknife estimate of V ar(θ̂) is left as an exercise.

4. As an exercise, compare the MSEs of Tn and TnJack .

Example 3. Let θ = T (F ) = σ 2 = V ar(X), and Tn = T (Fn ) = n−1 ni=1 (Xi − X̄)2 , which
P

is biased estimator of σ 2 . Show that the jackknife estimate of θ = T (F ) (biased-corrected)


is n
1 X
TnJack = (Xi − X̄)2 (unbiased).
n − 1 i=1
Proof. Exercise.

Remark: As we shall see, T (Fn ) is a V -statistic of order 2, whose jackknife estimator is


a U -statistic of order 2, hence unbiased.

Example 4. Let Tn = X(n) be the largest order statistic of a random sample X1 , ..., Xn .
(Tn is often used to estimate the boundary point of a d.f., such as Unif[0, θ]). Find its
jackknife estimator TnJack . Compare their MSEs.
2n − 1 n−1
Proof. It is easy to find that TnJack = X(n) − X(n−1) . Finding the MSEs is
n n
more tedious.

10.3 Application of the Jackknife to U - and V -statistics

Definition. Let X1 , ..., Xn ∼i.i.d. F and h(x1 , ..., xm ) be a symmetric function in its
arguments. Consider
Z Z
θ = T (F ) = Eh(X1 , ..., Xm ) = ... h(x1 , ..., xm )dF (x1 )...dF (xm ).

The U - and V -statistics of order m with kernel h are defined, respectively, by


n n
1 X X
Vn = T (Fn ) = ... h(Xi1 , ..., Xim ),
nm i =1 i =1
1 m
n
1 X
Un = ! h(Xi1 , ..., Xim )
n 1≤i1 <...<im ≤n
m
n
1 X
= h(Xi1 , ..., Xim ).
n(n − 1)...(n − m + 1) 1≤i 6=...6=i
1 m ≤n

Remark 10.3 U -statistics are always unbiased (“U ” = unbiased) while V -statistics (“V ”
comes from “von Mises”, who studied this type of statistics first) are typically biased for
m ≥ 2, since usually Eh(X1 , ..., X1 ) 6= Eh(X1 , ..., Xn ) for diagonal elements.

110
10.3.1 Reduce bias for V -statistics by jackknife

Examples.

1. θ = µ = T (F ), here h(x) = x and m = 1. Then, Vn = T (Fn ) = X̄ and Un = X̄.

2. θ = µ2 = T (F ), here h(x, y) = xy and m = 2. Then, Vn = T (Fn ) = (X̄)2 , and


 
1 X 1 X X
Un = Xi Xj =  Xi Xj − Xi Xj 
n(n − 1) i6=j n(n − 1) i,j i=j
 !2 
n n
1 X X
=  Xi − Xi2 
n(n − 1) i=1 i=1
n
!!
1 2 X 2
= n2 X̄ − 2
(Xi − X̄) + n X̄
n(n − 1) i=1
n
2 1 X 2 Se2
= X̄ − (Xi − X̄)2 := X̄ − ,
n(n − 1) i=1 n

which is unbiased for µ2 .


1 1
3. θ = σ 2 = T (F ) = V arF (X) = V arF (X − Y ) = EF (X − Y )2 , where X, Y ∼ F
2 2
(x − y)2
and independent, hence h(x, y) = . Hence,
2
n
1X
Z Z
Vn = V arFn (X) = (x − xdFn (x))2 dFn (x) = (Xi − X̄)2 ,
n i=1
1 X (Xi − Xj )2 1 X (Xi − Xj )2
Un = =
n(n − 1) i6=j 2 n(n − 1) i,j 2
n
1 1 1X
= n2 Vn = n2 (Xi − X̄)2
n(n − 1) n(n − 1) n i=1
n
1 X
= (Xi − X̄)2 := Se2 ,
(n − 1) i=1

which is unbiased for σ 2 .

The above three examples are all special cases of the next theorem.

Theorem 10.1 If Vn is a V -statistics of order 2, then

Vn,Jack = Un .

That is, jackknifing Vn (of order 2) leads to Un .

(·) (·)
Proof. Recall that Vn,Jack = nVn − (n − 1)Vn−1 . We first calculate Vn−1 . Denote
hij := h(Xi , Xj ), then
n X n
1 X
Vn = 2 hij .
n i=1 j=1

111
Hence, (easy to see from a diagram)
 
n X n n n
(−k) 1 X X X
Vn−1 =  hij − hik − hkj + hkk 
(n − 1)2 i=1 j=1 i=1 j=1
 
n X n n
1 X X
=  hij − 2 hkj + hkk 
(n − 1)2 i=1 j=1 j=1
n
(·) 1X (−k)
Vn−1 = V
n k=1 n−1
 
n X n n X n n
1 X 2X 1X
=  hij − h kj + hkk 
(n − 1)2 i=1 j=1 n k=1 j=1 n k=1
 
n X n n
1 n−2X 1X
=  hij + hkk 
(n − 1)2 n i=1 j=1 n k=1

Therefore,
(·)
Vn,Jack = nVn − (n − 1)Vn−1
 
n X n n X n n
1X 1 n−2X 1X
= hij −  hij + hkk 
n i=1 j=1 (n − 1) n i=1 j=1 n k=1
 n X n n
1 n−2 X 1
 X
= 1− hij − hkk
n n − 1 i=1 j=1 n(n − 1) k=1
 
n X n n
1 X X
=  hij − hij 
n(n − 1) i=1 j=1 i=j
n
1 X
= hij
n(n − 1) i6=j
= Un .

Remark 10.4 The above theorem is NOT true if m > 2 in general.

10.3.2 Use the jackknife to estimate variance for U -statistics

Jackknife’s variance estimation for U -statistics has an explicit expression.

Theorem 10.2 If Un is a U -statistics of order 2 with kernel h(x, y), then


 2
n n
4(n − 1) X 1 X
Vd
arJack (Un ) =  hij − Un  . (3.1)
n(n − 2)2 i=1 n − 1 j=1,i6=j

Furthermore, Vd
arJack (Un ) is a consistent estimator of V ar(Un ) in the sense that

Vd
arJack (Un )
−→p 1, as n → ∞. (3.2)
V ar(Un )

112
Proof. We only derive (3.1) below (the proof for (3.2) is more involved and hence omitted).
From Theorem 10.1, we have
1 X
Un = hij
n(n − 1) i6=j
 
n
(−k) 1 X X
Un−1 =  hij − 2 hkj 
(n − 1)(n − 2) i6=j j=1,j6=k
n
(·) 1X (−k)
Un−1 = U
n k=1 n−1
 
n n X n n
1 X 2X 2X
=  hij − hkj + hkk 
(n − 1)(n − 2) i6=j n k=1 j=1 n k=1
 
1 X 2X 
=  hij − hij
(n − 1)(n − 2) i6=j n i6=j
 
1 X
=  hij − 2(n − 1)Un 
(n − 1)(n − 2) i6=j
n 
n−1 X (−k) (·)
2
Vd
arJack (Un ) = Un−1 − Un−1
n k=1
 2
n n
n−1 1 X X
= 2 hkj − 2(n − 1)Un 
n (n − 1)2 (n − 2)2 k=1 j=1,j6=k
 2
n n
n−1 4(n − 1)2 X 1 X
=  hkj − Un 
n (n − 1)2 (n − 2)2 k=1
n − 1 j=1,j6=k
 2
n n
4(n − 1) X 1 X
=  hkj − Un  .
n(n − 2)2 k=1 n − 1 j=1,j6=k

Remark 10.5 Hoeffding-decomposition

• The Hoeffding-decomposition (or projection method, or ANOVA decomposition) is


n
X n
X n
X
T − ET = Ti + Ti1 ,i2 + · · · + Ti1 ,...,in .
i=1 1≤i1 <i2 ≤2 1≤i1 <...<in ≤n

where Ti = E(T − ET |Xi ), Ti1 ,i2 = E(T − ET |Xi1 , Xi2 ) − Ti1 − Ti2 , etc.

• We decompose T into the main effects Ti , 2-way interactions, 3-way interactions,


etc. This is where the name “ANOVA decomposition” comes from.

• Hoeffding-decomposition is in fact an identity, which can be thought as follows.


Given T , its overall effect is ET , and we write

T = ET + R1 .

From R1 , we take the main effects or contributions from each Xi , given by Ti , re-
sulting in
n
X
T = ET + Ti + R2 .
i=1

113
From R2 , we take the 2-way interactions from each pair (Xi , Xj ) (but without the ),
given by Ti,j , resulting in
n
X n
X
T = ET + Ti + Ti1 ,i2 + R3 .
i=1 1≤i1 <i2 ≤2

We can continue until all the interaction effects have been taken care of, giving the
above ANOVE decomposition.

• Applying the Hoeffding-decomposition to U -statistics, it is easy to check that


n
2X 2 X
Un − θ = g(Xi ) + ψ(Xi , Xj ),
n i=1 n(n − 1) i<j

where g(Xi ) = E (hij | Xi ) − θ and ψ(Xi , Xj ) = hij − g(Xi ) − g(Xj ) + θ and all the
terms on the RHS of Un are uncorrelated.
Proof. WLOG, assume that θ = 0.
1 X
T1 = E(Un |X1 ) = E[h(Xi , Xj )|X1 ]
n(n − 1) i6=j
1 1
= 2(n − 1)E[h(X1 , Xj )|X1 ] = 2(n − 1)g1
n(n − 1) n(n − 1)
2
= g1 ,
n
1 X
E(Un |X1 , X2 ) = E[h(Xi , Xj )|X1 , X2 ]
n(n − 1) i6=j
1
= (h12 + 2(n − 2)g1 + 2(n − 2)g2 )
n(n − 1)

Hence,

T12 = E(Un |X1 , X2 ) − E(Un |X1 ) − E(Un |X2 )


1
= (h12 + 2(n − 2)[g1 + g2 ] − 2(n − 1)[g1 + g2 ])
n(n − 1)
1
= (h12 − g1 − g2 )
n(n − 1)
= ψ(X1 , X2 ).

Remark 10.6 From the above, we see that

4V ar[g(X1 )]
V ar(Un ) = + O(n−2 )
n
which can be approximated by
n n
41X 4 X
ar(Un ) '
Vg [g(Xi )]2 = 2 g 2 (Xi ).
n n i=1 n i=1

But g(Xi ) are not statistics, and can be further approximated by


n
1 X
g(Xi ) ' hij − Un .
n − 1 j=1,j6=i

114
Combining the above, we get an approximation to the variance of Un :
 2
n n
4 X 1 X
Vd
ar(Un ) = 2  hij − Un  .
n i=1 n − 1 j=1,j6=i

This is almost the same as the jackknife variance estimator Vd


arJack (Un ). Therefore,
V arJack (Un ) estimates the variance of the dominating term in the H-decomposition
d
of Un .

Remark 10.7 Asymptotic properties such as the CLT for U -statistics, Berry-Esseen bounds,
and so on will be described in a later chapter.

10.4 Application of the jackknife to the smooth function of


means models

Let X, X1 , ..., Xn be i.i.d. k-random vectors with µ = EX and Σ = Cov(X, X). It is


known that
1
 
g(X̄) −→d N g(µ), σg2 ,
n
where
σg2 = g 0 (µ)τ Σg 0 (µ).
Here g(·) : Rk → R1 is assumed to be differentiable with g 0 (µ) 6= 0, and g 0 (x) is the
gradient of g(x), given by

∂g(x) ∂g(x)

g 0 (x)k×1 = , ......, .
∂x1 ∂xk


Let X̄ and Σ̂ = n−1 ni=1 X̄i − X̄ X̄i − X̄ denote sample mean and sample
P 

covariance matrix. One obvious estimate of σg2 is the plug-in (i.e., bootstrap) estimator:

σ̂g2 = g 0 (X̄)τ Σ̂g 0 (X̄) −→a.s g 0 (µ)τ Σg 0 (µ) = σg2 .

However, the formula requires the user to carry out much analytical work, such as calcu-
lating g 0 (X̄) etc. By comparion, the jackknife provides a very simple alternative, which
requires no differentiation at all.

Theorem 10.3 If g 0 (x) is continuous in a neighborhood of µ and g 0 (µ) 6= 0. Then, the


jackknife variance estimator of g(X̄) is strongly consistent, i.e.,

Vd
arJack [g(X̄)] nVd
arJack [g(X̄)
2
= −→a.s. 1.
σg /n σg2

Proof. Let θ = g(µ) = T (F ), and θ̂ = T (Fn ) = g(X̄). The jackknife estimate of V ar(θ̂)
is

nVd
arJack [g(X̄)] = nVd
arJack (Tn )
n  2
X (−i) (·)
= (n − 1) Tn−1 − Tn−1
i=1

115
n n
!2
X
(−i) 1X
= (n − 1) g(X̄ )− g(X̄ (−i) )
i=1
n i=1
n n
!2
X
(−i) 1X
= (n − 1) [g(X̄ ) − g(X̄)] − [g(X̄ (−i) ) − g(X̄)] .
i=1
n i=1
Now
1 nX̄ − Xi
X̄ (−i) = (X1 + ... + Xi−1 + Xi+1 + ... + Xn ) = ,
n−1 n−1
n 
nX̄ − Xi − (n − 1)X̄ X̄ − Xi 1X 
X̄ (−i) − X̄ = = , X̄ (−i) − X̄ = 0.
n−1 n−1 n i=1
Furthermore, by the mean-value theorem, we have
   
g(X̄ (−i) ) − g(X̄) = g 0 (ξi )τ X̄ (−i) − X̄ = g 0 (X̄)τ X̄ (−i) − X̄ + Ri ,

where ξi is a point on the segment between X (−i) and X̄ and


  1
Ri = [g 0 (ξi ) − g 0 (X̄)]τ X̄ (−i) − X̄ = [g 0 (ξi ) − g 0 (X̄)]τ X̄ − Xi

n−1
Therefore,
n  
[g 0 (X̄)τ X̄ (−i) − X̄ + (Ri − R̄)]2
X
nVd
arn,Jack (Tn ) = (n − 1)
i=1
:= An + Bn + 2Cn ,
where
n   τ
An = (n − 1)g 0 (X̄)τ g 0 (X̄)τ
X
X̄ (−i) − X̄ X̄ (−i) − X̄
i=1
n
X n
X
2
Bn = (n − 1) (Ri − R̄) ≤ (n − 1) Ri2
i=1 i=1
Cn2 ≤ |An | |Bn |

For An , since g 0 (x) is continuous at µ, we have


n
1 τ
g 0 (X̄)τ X̄i − X̄ X̄i − X̄ g 0 (X̄)τ
X 
An =
(n − 1) i=1
n
= g (X̄) Σ̂g 0 (X̄)τ
0 τ
(n − 1)
−→a.s. σg2 .

For Bn , let un = max1≤i≤n kg 0 (ξi )τ − g 0 (X̄)τ k, we have


n
(n − 1) X 2
Bn ≤ 2
[g 0 (ξi ) − g 0 (X̄)]τ X̄ − Xi
(n − 1) i=1
n 
1 
kg 0 (ξi ) − g 0 (X̄)k2 kX̄ − Xi k2
X

(n − 1) i=1
(by Cauchy-Schwarze inequality)
n
u2n X
≤ kX̄ − Xi k2
(n − 1) i=1
n
≤ u2 × (trace of Σ̂).
(n − 1) n

116
Note that (trace of Σ̂) −→a.s. (trace of Σ). Since ξi is between X̄ (−i) and X̄, we have
2
1

max kξi − X̄k = max kξi − X̄k2 ≤ max kX̄ (−i) − X̄k2 ≤ max kX̄i − X̄k2
1≤i≤n 1≤i≤n 1≤i≤n (n − 1)2 1≤i≤n
n
1 X n
≤ kX̄i − X̄k2 ≤ (trace of Σ̂) −→a.s. 0,
(n − 1)2 i=1
(n − 1)2

That is, max1≤i≤n kξi − X̄k −→a.s. 0. This, together with X̄ −→a.s. µ, implies that
un −→a.s. 0. Hence, Bn −→a.s. 0.

For Cn2 ≤ |An | |Bn |, since An −→a.s. σg2 and Bn −→a.s. 0, we have Cn −→a.s. 0.

q
Remark 10.8 Theorem 10.3 implies that [g(X̄) − g(µ)]/ Vd
arn,Jack [g(X̄)] −→d N (0, 1),
which could be used to construct a C.I. for g(µ).

Remark 10.9 From Theorem 10.3, Vd arJack [g(X̄)] is consistent if g 0 (x) is continuous in
a neighborhood of µ. On the other hand, σ̂g2 is consistent if g 0 (x) is continuous only at µ.
So the jackknife needs stronger condition.

10.5 Jackknife to linear regression models

Assume that following linear regression model

Yi = xτi β + i , i = 1, ..., n,

where

Yi = the ith response,

xi = p-vector of explanatory variables,

β = p-vector of unknown parameter of interest,

i = independent random errors with E(i ) = 0 and V ar(i ) = σi2 .

The models can also be rewritten as

Y = Xβ + ,

where
xτ1
       
Y1 β0 1

 . 


 . 


 . 


 . 

Y = . , X= . , β= . , = . .
       
       
 .   .   .   . 
Yn xτn βp−1 p

The least square estimate (LSE) of β is

β̂ = arg min kY − Xβk2 = arg min(Y − Xβ)τ (Y − Xβ) = (X τ X)−1 X τ Y,


β β

117
and
V ar(β̂) = (X τ X)−1 (X τ W X) (X τ X)−1 ,
with W = diag(σ12 , ..., σn2 ).

Basic properties

• E β̂ = β (unbiased)

• If σ12 = σ 2 for all i, then V ar(β̂) = σ 2 (X τ X)−1 . Furthermore, in this case, β̂ is the
best linear unbiased estimator of β (BLUE).

• If i ∼i.i.d. N (0, σ 2 ), then β̂ is also MLE and UMVUE.

(1) The Jackknife to β

Miller (1974) first extended the jackknife to regression problem. Let β̂(i) be the LSE of β
based on the data (xτ1 , Y1 ), ..., (xτi−1 , Yi−1 ), (xτi+1 , Yi+1 ), ..., (xτn , Yn ). That is,
 −1  
−1 n n
 τ τ X X
β̂(i) = X (−i) X (−i) X (−i) Y(i) =  xj xτj − xi xτi   xj Yj − xi Yi 
j=1 j=1

= (X X − τ
xi xτi )−1 (X τ Y − xi Yi ) .

Recall the the Bartlett identity

A−1 uv τ A−1
(A + uv τ )−1 = A−1 − ,
1 + v τ A−1 u
which is true since
!
τ −1 A−1 uv τ A−1
(A + uv ) A −
1 + v τ A−1 u
uv τ A−1 uv τ A−1 uv τ A−1
= I + uv τ A−1 − −
1 + v τ A−1 u 1 + v τ A−1 u
uv τ A−1 u(v τ A−1 u)v τ A−1
= I + uv τ A−1 − −
1 + v τ A−1 u 1 + v τ A−1 u
τ
uv A −1 uv τ A−1
= I + uv τ A−1 − − (v τ −1
A u)
1 + v τ A−1 u 1 + v τ A−1 u
τ −1
(since (v A u) is a scalar)
uv τ A−1  
= I + uv τ A−1 − 1 + v τ −1
A u
1 + v τ A−1 u
= I + uv A − uv τ A−1 = I.
τ −1

Letting
zi ri
hi = xτi (X τ X)−1 xi , ri = Yi − xτi β̂, zi = (X τ X)−1 xi , bi =
1 − hi
we have

β̂(i) = (X τ X − xi xτi )−1 (X τ Y − xi Yi )

118
!
τ −1 (X τ X)−1 (xi xτi )(X τ X)−1
= (X X) + (X τ Y − xi Yi )
1 − xτi (X τ X)−1 xi
(X τ X)−1 (xi xτi )(X τ X)−1
= β̂ − (X τ X)−1 xi Yi + (X τ Y − xi Yi )
1 − hi
(X τ X)−1 (xi xτi )  
= β̂ − (X τ X)−1 xi Yi + β̂ − (X τ X)−1 xi Yi
1 − hi
τ
(X X) xi −1  
= β̂ − Yi (1 − hi ) − xτi β̂ + hi Yi
1 − hi
(X τ X)−1 xi  
= β̂ − Yi − xτi β̂
1 − hi
(X τ X)−1 xi ri
= β̂ −
1 − hi
zi ri
= β̂ −
1 − hi
= β̂ − bi .
Hence,
n n
!
1X zi ri n−1X
β̂Jack = nβ̂ − (n − 1)β̂(·) = nβ̂ − (n − 1) β̂ − = β̂ + bi = β̂ + (n − 1)b̄
n i=1 1 − hi n i=1
n  n
n−1X  τ n−1X  τ
V̂Jack (β̂) = β̂ − β̂(·) β̂(i) − β̂(·) = bi − b̄ bi − b̄ .
n i=1 (i) n i=1

Remark 10.10

• Note that the “normal equations” from the LSE imply that
n
X n
X
ri = (Yi − xτi β̂) = 0
i=1 i=1
n n  
zi ri = (X τ X)−1 xi [Yi − xτi β̂] = (X τ X)−1 X τ Y − X τ X β̂ = β̂ − β̂ = 0.
X X

i=1 i=1
n
1X zi ri
However, in general, b̄ = 6= 0. Hence, θbJack 6= θ̂ although E θbJack = E θ̂
n i=1 1 − hi
for fixed design points.
• Both β̂ and β̂Jack are unbiased for β, but β̂Jack has an extra unattractive term (n−1)b̄
(although it is pretty harmless). The same thing will happen to the linear combination
of β. One way to get rid of the extra term is to use “weighted jackknife” method.
• The extra term appears because of the unbalanced design points xi . Note that the
bigger the absolute value xi is, the bigger its influence. To counterbalance this, one
can down-weigh the influence of the large values in absolute value terms.

(2) The Jackknife to θ = g(β) = C τ β

Similarly, if θ = C τ β, then θ̂ = C τ β̂ and it can be shown that


 
θ̂Jack = C τ β̂ + (n − 1)b̄ = θ̂ + (n − 1)C τ b̄
n
n−1X τ
C τ bi − b̄ bi − b̄ C.

V̂Jack (θ̂) =
n i=1

119
Remark 10.11 In fact, for linear combinations of β = C τ β, since the design points xi
are fixed, we do have
E θbJack = E θ̂.
That is, θbJack is indeed unbiased estimator for θ = C τ β. Despite this, it is still not natural
to have b̄ around. Some proposals have been made to “correct” (maybe not necessary here)
this by using weighted jackknife.

(3) The Weighted Jackknife to θ = g(β)

Miller (1974) showed that V̂Jack (θ̂) is a consistent variance estimator for V (θ̂) under very
weaker conditions. However, θbJack may not be consistent. Recall C τ β̂ is unbiased for C τ β,
but jackknife estimator is biased. This is caused by the unbalance of the linear model,
i.e., EYi = xτi β are note the same. (Please check the statements here.)

It is hoped that better jackknife estimator can be obtained by incorporating the


unbalance into account.

10.6 Failure of the jackknife

The jackknife is good, but it can fail miserably if the statistic θ̂ is not “smooth” (i.e., small
changes in the data can cause big changes in the statistics). One example is the sample
median (or quantile), whose influence curve IC(x, F, T ) has a jump (i.e., not smooth) near
the sample median. Note that the sample median only depends one or two values in the
middle of the data.

10.6.1 An example

Consider the following n = 10 ordered values

10, 27, 31, 38, 40, 50, 52, 104, 146, 150.

Then, we get
(
(−i) 50 1 ≤ i ≤ 5
Tn−1 = =
40 1 ≤ 6 ≤ 10.
(·)
Tn−1 = 45
(
(−i) (·) 5 1≤i≤5
Tn−1 − Tn−1 =
−5 6 ≤ i ≤ 10.

Thus, Vd arJack (Tn ) tried to estimate the variance from just these two values, 40 and 50,
resulting in
10 − 1 2
Vd
arJack (Tn ) = 5 .
4
Note that the estimate depends solely on two points 40 and 50, which makes it highly
unstable. (It is impossible to estimate from one data value, and near impossible to estimate
from two data values. The variance estimate should be based on “many” data points.)

120
The jackknife described so far deletes one observation at a time. If we try to delete
2 observations at atime to the above quantile example, the total number of ways one can
do this is n2 = 10

2 = 10 × 9/2 = 45. You can try to list the pseudo-values here. Clearly,
we have more different values now. More generally, one can do delete-d jackknife, to be
discussed below.

10.6.2 Main results

Theorem 10.4 Let θ = F −1 (1/2) and Tn = Fn−1 (1/2) and n = 2m. Assume that f = F 0
exists and continuous at θ, and f (θ) > 0. Then

1. The jackknife variance estimate is


n−1 2
Vd
arJack (Tn ) = X(m+1) − X(m) .
4
(Hence, the jackknife estimates the variance from just two pseudo-values, not a very
smart idea.)

2. The jackknife variance estimate fails to be consistent, i.e.,


!2
Vd
arJack (Tn ) χ22 1
 
−→d [exponential(1)]2 =d =d W eibull 1, ,
vn 2 2

1
where vn = .
4nf 2 (θ)
(Hence, no matter how large n is, the ratio on the RHS does not go to 1.)

Proof.

1. The order statistics are X(1) , ..., X(m) , X(m+1) , ..., X(n) . Now
(
(−i) X(m+1) 1 ≤ i ≤ m
Tn−1 =
X(m) 1 ≤ m + 1 ≤ n.
(·) mX(m+1) + mX(m) X(m+1) + X(m)
Tn−1 = =
 n 2
X(m+1) − X (m)
1≤i≤m



(−i) (·) 2
Tn−1 − Tn−1 = X(m+1) − X(m)
 − m + 1 ≤ i ≤ n.


2
n 
n−1X (−i) (·) 2
 n−1 2
VdarJack (Tn ) = Tn−1 − Tn−1 = X(m+1) − X(m) .
n i=1 4

2. Proof. The assumptions in the theorem implies that F has local inverse at θ.
WLOG, we might assume that F has inverse everywhere. By Lemma 10.1, we have

Xk =: F −1 (1 − e−Yk ) := H(Yk ) ∼ F, 1 ≤ k ≤ n,

where H(t) = F −1 (1 − e−t ), and

Xk ∼i.i.d. F, F (Xk ) ∼ U nif orm[0, 1], Yk =: − ln[1 − Uk ] ∼ Exp(1).

121
Since H(t) = F −1 (1 − e−t ) is increasing in t, from Lemma 10.1, we have
X(k) = F −1 (1 − e−Y(k) ) := H(Y(k) ), 1 ≤ k ≤ n,
where Y(1) , ..., Y(n) are order statistics of Y1 , ..., Yn ∼iii Exp(1).
e−t
Note that H 0 (t) = . By (6.3), we have Y(m+1) − Y(m) = Zm /m = Op (n−1 ),
f (H(t))
and
n n m−1
X 1 X 1 X 1
EY(m) = = − = (C + ln n + n ) − (C + ln(m − 1) + m−1 )
k=m
k k=1
k k=1
k
n
 
= ln + (n − j−1 ),
m−1
= ln 2 + o(1)
n
V ar(Zk ) 1 1
+ ... + 2 = O(n−1 )
X
V ar(Y(m) ) = =
k=m
k2 m2 n
Y(m) = EY(m) + Op (n−1/2 )
H(EY(m) ) = F −1 (e−EY(m) ) = F −1 (e− ln 2+o(1) ) = F −1 (1/2 + o(1)) = θ + o(1)
e−EY(m) e− ln 2+o(1) 1
H 0 (EY(m) ) = = = + o(1)
f (H(EY(m) )) f (θ + o(1)) 2f (θ)
H 0 (Y(m) ) = H 0 (EY(m) ) + o(Y(m) − EY(m) ).

By the mean value theorem,


n[X(m+1) − X(m) ] = n[H(Y(m+1) ) − H(Y(m) )]
 
= H 0 Y(m) + θ[Y(m+1) − Y(m) ] n[Y(m+1) − Y(m) ], |θ| ≤ 1
1 √ 1
 
= H 0 ln 2 + √ n[Y(m) − ln 2] + θn[Y(m+1) − Y(m) ] n[Y(m+1) − Y(m) ]
n n
= H 0 (ln 2 + op (1)) n[Y(m+1) − Y(m) ]
→d H 0 (ln 2) Exp(1)

by Slutsky’s theorem, continuity of g 0 and n[U(m) − 1/2] −→d N (0, σ 2 ). Hence
n(n − 1)
nVd
arJack (Tn ) = (Xn,m+1 − Xn,m )2
4
1 0
−→d [H (ln 2)]2 [Exp(1)]2 .
4

10.6.3 Delete-d jackknife

Let θ̂ estimate θ. Let θ̂(s) denote θ̂ applied to the data with subset s removed. Then the
delete-d jackknife variance estimator is
n − d X 2
Vd
arJack (Tn )[d] = ! θ̂(s) − θ̂(s·) ,
n d
d
d
s of size n − d chosen without replacement from
P
where d is the sum over all subsets
n
{X1 , ..., Xn } and θ̂(s·) = d θ̂(s) / d .
P

122

Theorem 10.5 Under the same conditions of Theorem 10.4, and n/d → 0 and n − d →
∞, then
VdarJack (Tn )[d]
−→p 1.
vn

Remark 10.12

1. The conditions require d to go to ∞, but not too fast. One such choice is d = n2/3 .

2. If n is large, and n < d < n, then nd is large. In this case, one can use random


sampling (or Monte-Carlo simulations) to approximate Vd arJack (Tn )[d]. If this is too
much trouble, one might use the bootstrap in the first place.

3. The delete-d jackknife can be shown to be consistent for the median if n/d → 0 and

n − d → 0. Roughly speaking, it is preferable to choose a d such that n < d < n
for the delete-d jackknife estimation of standard error
4. d plays the role of smoothing parameter, similar to that of bandwidth in curve esti-
mation.
5. The delete-d jackknife can also be used to estimate d.f.’s for t-statistics, for exam-
ple. But its convergence rate is slower than the bootstrap for the independent data.
For dependent data though, the delete-d jackknife works beautifully while ”naive”
bootstrap or delete-1 jackknife all fails here.

10.6.4 A useful lemma

Lemma 10.1 Let Y1 , ..., Yn be iid from Exp(1), and Y(1) , ..., Y(n) its order statistics.

1. (i) e−Yk ∼ U nif orm[0, 1], (ii) Yk2 ∼ W eibull(1, 1/2).


2. Let Zi := (n − i + 1)[Y(i) − Y(i−1) ], 1 ≤ i ≤ n., where Y(0) = 0. Then,
Z1 , ..., Zn are i.i.d. from Exp(1).
Consequently, Y(i) can be represented as
i
Z1 Zi X Zk
Y(i) = + ... + = , 1 ≤ i ≤ n. (6.3)
n n − i + 1 k=1 n − k + 1

3. Let U(1) , ..., U(n) be order statistics from U nif orm(0, 1). Let U(0) = 0, U(n+1) = 1.
Then,
Vi := n[U(i) − U(i−1) ] −→d Exp(1), 1 ≤ i ≤ n + 1.

Proof.


1. We prove (ii) only. fY (y) = e−y I{y > 0}. Let w = y 2 , so y = w, dy/dw =

1/(2 w). Hence,
√ √ 1 1/2
fW (w) = fY (x)|dy/dw| = e− w
/(2 w) = w−1/2 e−w , w > 0.
2
β
(Recall that W eibull(α, β) has density f (w) = αβwβ−1 e−αw , where α > 0, β > 0,
w > 0.)

123
2. The joint pdf of (Y(1) , ..., Y(n) ) is

fY(1) ,...,Y(n) (y1 , ..., yn ) = n!fY (y1 )...fY (yn ) = n!e−y1 ...e−yn , where y1 < ... < yn .

Let

Z1 = nY(1)
Z2 = (n − 1)[Y(2) − Y(1) ]
Z3 = (n − 2)[Y(3) − Y(2) ]
... ...... (6.4)
Zn = 1 × [Y(n) − Y(n−1) ],

dz
whose jacobian is = n!. Therefore,
dy

dy 1
fZ1 ,...,Zn (z1 , ..., zn ) = fY(1) ,...,Y(n) (y1 , ..., yn ) = n!e−y1 −...−yn = e−z1 −...−zn = e−z1 ...e−zn ,
dz n!

which implies that Z1 , ..., Zn are i.i.d. Exp(1) r.v.s.


The second part follows from the transformation in (6.4):

Z1
Y(1) =
n
Z1 Z2
Y(2) = +
n n−1
... ......
Z1 Z2 Zn
Y(n) = + + ...... + .
n n−1 1

3. The joint d.f. of (U(k) , U(k+1) ) is

n!
fU(k) ,U(k+1) (u, v) = uk−1 [1 − v]n−k−1 , 0 ≤ u < v ≤ 1.
(k − 1)!(n − k − 1)!

Let

X = U(k)
Y = n[U(k+1) − U(k) ].

So

U(k) = X
Y
U(k+1) = X + .
n
Therefore,

dU
f(X,Y ) (x, y) = f(U(k) ,U(k+1) ) (u, v)
dY
(n − 1)!
= uk−1 [1 − v]n−k−1
(k − 1)!(n − k − 1)!
(n − 1)!
= xk−1 [1 − y/n − x]n−k−1 , where 0 ≤ x ≤ 1 − y/n.
(k − 1)!(n − k − 1)!

124
(Note that 0 ≤ u < v ≤ 1 ⇐⇒ 0 ≤ x < x + y/n ≤ 1 ⇐⇒ 0 ≤ x ≤ 1 − y/n.) Thus,

(n − 1)! 1−y/n Z
fY (y) = xk−1 [1 − y/n − x]n−k−1 dx
(k − 1)!(n − k − 1)! 0
y n−1
Z 1
(n − 1)! y
   
= 1− tk−1 (1 − t)n−k−1 dt, where x = t 1 −
n (k − 1)!(n − k − 1)! 0 n
n−1
y

= 1−
n
−→ e−y , y > 0.

10.7 Application of jackknife in a real example

In the study of income shares, sometimes we need to estimate the poverty line (or low
income cutoff point). Consider the model
m
X
log Zi = γ1 + γ2 log Yi + βt xit + i , i = 1, ..., n,
t=1

where

Zi = expenditure on “necessities”,

Yi = total income,

xit = urbanization category, family size, etc.

i = independent errors (also assumed to be independent of all other covariates)

Let γ0 be the overall proportion of income spent on “necessities”. Then the poverty line
θ can be defined to be the solution of
m
X
log[(γ0 + 0.2)θ] = γ1 + γ2 log θ + βt x0t
t=1

for a particular set x01 , ..., x0m . Our purpose is to estimate θ and construct a (1 − α)
confidence interval for θ.

Solution. First of all, it is reasonable to estimate γ0 by its sample counterpart:


Pn
Zi Z̄
γ̂0 = Pi=1
n = .
Y
i=1 i Ȳ

On the other hand, γ1 , γ2 , βt ’s can be estimated by the LSE, denoted by γ̂1 , γ̂2 , β̂t ’s. As
a consequence, θ can be estimated (implicitly) from
m
X
log[(γ̂0 + 0.2)θ̂] = γ̂1 + γ̂2 log θ̂ + β̂t x0t ,
t=1

which can be solved numerically by standard software.

Note that  
θ̂ = g γ̂0 , γ̂1 , γ̂2 , β̂1 , ..., β̂m := g(X̄),

125
where
Xi = (Zi , Yi , log Zi , log Yi , xi1 , ..., xim , all other cross product terms) .
Therefore, we can express θ̂ into a very smooth function of multivariate sample means
of independent (not necessarily identically distributed) r.v.’s. Hence, θ̂ is asymptotically
normal. We need to get a variance estimate.

If one uses the traditional variance estimator, we have

vn = 5g(X̄)τ Vd
ar(X̄) 5 g(X̄).

Since g is not explicit, it is very tedious to calculate 5g(X̄) in this case. On the other
hand, the jackknife can provide an easy way to get a consistent variance estimator:

• Step I: Run OLS.

• Step II: Apply Newton algorithm to solve θ̂(i) .

• Step III: Use Jackknife estimation of variance.

10.8 A quick summary

1. The jackknife replaces the theoretical derivations by numerical computations. Asymp-


totically, it is equivalent to the traditional approach.

2. The jackknife variance estimator v̂Jack is consistent for many types of statistics
under mild conditions, such as sample means, function of sample means, U -, V -,
L-statistics, M -estimates, etc.

3. If delete-1 jackknife fails, one can try delete-d jackknife, or bootstrap.

4. The jackknife can be easily extended to multivariate case.

10.9 Exercises

1. Let X1 , . . . , Xn be i.i.d with variance σ 2 . Take θb = n−1 ni=1 (Xi − X̄)2 as an


P

estimator of σ 2 .
(a) Show that θb has bias equal to −σ 2 /n.
(b) Find θbJ , the jackknife estimator of σ 2 .
(b) Find vbJ (θ),
b the jackknife estimator of V ar(θ).
b

2. Suppose X1 , . . . , Xn are i.i.d with mean µ. We take θb = (X̄)3 as an estimator of µ3 .


(a) Find θbJ , the jackknife estimator of µ3 .
(b) Show that θb = (X̄)3 is a V -statistic of order 3 with the kernel h(x, y, z) = xyz.
(c) Derive the U -statistic of order 3 with the kernel h(x, y, z) = xyz.
(d) From (a)-(c), check whether the jackknifed V -statistic in (b) produces the U -
statistic in (c).

3. (Use S-Plus or otherwise to do this question.) The following data (Xi , Yi ) are English
scores of 6 people before and after attending an intensive study program.

126
(80, 90), (85, 78), (65, 75), (88, 92), (77, 86), (95,90).

(a) Calculate θ̂1 = X̄/Ȳ and θ̂2 = ρ̂ (sample correlation coefficient).


(b) Find bias(
d θ̂), θbJ and vbJ (θ̂) for θ̂ = θ̂1 and θ̂2 . Hence find a 95% confidence
intervals for θ1 = µ1 /µ2 and θ2 = ρ.
(c) Assuming the simple linear regression model Yi = β0 + β1 Xi + i . The parameter
of interest here is θ0 = −β0 /β1 . Take θ̂0 = −β̂0 /β̂1 where (β̂0 , β̂1 ) are LSE of (β0 , β1 ).
Use the jackknife to find the variance estimator of θ̂0 and hence find a 95% confidence
interval for θ0 .
(d) [Do this after the chapter on the bootstrap.] Repeat these calculations by
using the bootstrap method and compare the results by the two methods.
4. Show that for linear statistics,
R
the jackknife and bootstrap estimates of bias are 0.
(Note that θ = T (F ) = ψdF is linear with Tn = T (Fn ) is the corresponding linear
statistics.)
5. (a) Given n distinct data items, show that the probability that a given data item
does not appear in a bootstrap resample is en = (1 − 1/n)n → e−1 ' 0.368 as
n → ∞.
(b) Show that the probability that each of B bootstrap samples contains an item i
is (1 − en )B .
6. Let X1 , ..., Xn ∼i.i.d. from F . Let θ = T (F ) and Tn = T (Fn ). Suppose that

n (Tn − θ) −→d N (0, σF2 ),

where σT2n = EF ψF2 (X) = EIC(X, F, Tn )2 . An estimate of σT2n can be given by


n
1 X
bT2 n =
σ IC(X, Fn , Tn )2 .
n − 1 i=1

The factor n − 1 (instead of n) was substituted to preserve equivalence with the


classical formula for the sample mean. Check if this variance estimate is consistent
in the case of the sample median, Tn = Fn−1 (1/2).
n
1 X n2
7. Let θ = V ar(X), and Tn = (Xi −X̄)2 . Show that Vd arJack = (µ̂4 −
n − 1 i=1 (n − 1)(n − 2)2
1
µ̂22 ). Recall that the delta method gives V ard delta = (µ̂4 − µ̂22 ).
n
8. Let X1 , ..., Xn ∼i.i.d. from F . Let θ = µ2 and Tn = (X̄)2 . Let µ̂j be the jth sample
central moments. Show that
4X̄ 2 µ̂2 4X̄ µ̂3
(i) the jackknife variance estimator is Vd
arJack (Tn ) = − +
n−1 (n − 1)2
µ̂4 − µ̂22
,
(n − 1)3
4X̄ 2 µ̂22 4X̄ µ̂3
(ii) the bootstrap variance estimator is Vd
arBoot (Tn ) = + +
n n2
2
µ̂4 − (2n − 3)µ̂2
.
n3
4µ2 µ2 4X̄ 2 µ̂2
Remark: The asymptotic variance of Tn is whose plug-in estimate is .
n n
Therefore, all three approaches are asymptotically equivalent in the sense that the
ratio goes to 1 a.s.

127
9. Let X1 , ..., Xn ∼i.i.d. from F . Let θ = T (F ) and Tn = T (Fn ). Suppose that

n (Tn − θ) −→d N (0, σF2 ),

where σT2n = EF ψF2 (X) = EIC(X, F, Tn )2 . An estimate of σT2n can be given by


n
1X
bT2 n =
σ IC(X, Fn , Tn )2 .
n i=1

Check if this variance estimate is consistent in the case of the sample median, Tn =
Fn−1 (1/2).

10. Let g(x, t) be continuous at t0 uniformly in x. Let F be a distribution function for


which EF |g(X, t0 )| < ∞. Let {Xi } be i.i.d. with d.f. F and suppose that Tn −→p t0 .
Then,
n
1X
g(Xi , Tn ) −→p EF g(X, t0 ).
n i=1
Further, the convergence in probability can be replaced by a.s. convergence through-
out.

128
Chapter 10

Bootstrap

The bootstrap was first introduced by Efron (1979). It is a powerful nonparametric tech-
nique, like the jackknife. In a sense, jackknife can be considered as a first-order approxi-
mation to the bootstrap.

10.1 The bootstrap (or plug-in, or substitution method)

Suppose

• X1 , . . . , Xn ∼iid F (typically unknown),

• θ ≡ θ(F ) = T (F ) is the parameter of interest,

• An estimator of θ is
Tn ≡ Tn (X1 , . . . , Xn ) ≡ Tn (Fn ).

Again, we are interested in

• the bias of Tn : Bias(Tn ) = E(Tn ) − θ = EF (T (X1 , . . . , Xn )) − T (F ).

• the variance of Tn : V ar(Tn ) = E (Tn − E(Tn ))2 = EF (EF (T (X1 , . . . , Xn )) − T (F ))2 .

• the sampling distribution of


!  
Tn − θ Tn − θ
H(x) = P p ≤x , or K(x) = P  q ≤ x?
V ar(Tn ) Vd
ar(Tn )

Let X1∗ , ..., Xn∗ ∼i.i.d. Fn . Then, the bootstrap approximations of the above quantities are

d Boot (Tn ) = EF [T (X ∗ , ..., X ∗ ) − T (Fn )].


(1) Bias n 1 n
arBoot (Tn ) = EFn (T (X1∗ , ..., Xn∗ ) − EFn T (X1∗ , ..., Xn∗ ))2
(2) Vd
θ(Fn∗ ) − θ(Fn ) θ(Fn∗ ) − θ(Fn )
   
(3) H
b Boot (x) = PF ≤ x , and K
cBoot (x) = PF ≤x ?
n
σn (Fn ) n
σn (Fn∗ )

129
Example Let θ = µ2 and T = (X̄)2 . Find the bootstrap estimate of the bias.

Solution. Let X1∗ , · · · , Xn∗ ∼i.i.d. Fn ,


2 2
EFn X̄ ∗ = VFn X̄ ∗ + EFn X̄ ∗
 

VFn (X1∗ )
= + (EX1∗ )2
n
n
1 1X 2 2
= · Xi − X̄ + X̄
n n i=1
1 2 2
= Sn + X̄ .
n
Hence, the bootstrap bias estimate is
h 2 i 2 2 1 2
Bias
d Boot X̄ = EFn X̄ ∗ − X̄ = S .
n n
Recall that the true bias is
h 2 i 2 σ2 σ2
Bias X̄ = EF X̄ − µ2 = + µ2 − µ2 = .
n n

Definition. Let η = EF T (X1 , . . . , Xn , F ), where X1 , . . . , Xn ∼iid F . Its bootstrap


approximation is

η̂Boot = EFn T (X1∗ , ..., Xn∗ , Fn ), where X1∗ , ..., Xn∗ ∼iid Fn .

That is, the generalized function η is approximated by the generalized statistical functional
η̂Boot .

10.2 Monte Carlo simulations

Generally speaking, there is no closed form expression for the bootstrap estimate η̂Boot .
On the other hand, for each k, we note
1
P (Xk∗ = Xj ) = , j = 1, · · · , n.
n
Hence, one can find the exact value η̂Boot by exhausting all possible bootstrap samples (all
together nn ) and calculate
n n
1 X
η̂Boot = n T (X1∗ (k), ..., Xn∗ (k), Fn )
n k=1
!
2n − 1
If T is symmetric in X1 , . . . , Xn , then the total number of different samples is
n
! !
2n − 1 19
(Why?). If n = 10, then nn
= 10, 000, 000, 000, and = , a huge number
n 10
indeed. Therefore, the exact calculation is not practical.

One alternative is to resort to Monte Carlo simulations:

1. Draw B independent random samples Xk∗ = {X1∗ (k), ..., Xn∗ (k)} from Fn , k = 1, ..., B.
(This can be done by drawing from Fn with replacement).

130
2. The Monte Carlo approximation to η̂Boot is
B
1 X
η̂M C = T (X1∗ (k), ..., Xn∗ (k), Fn ).
B k=1

By the SLLN, we have η̂M C → η̂Boot a.s. as B → ∞.

10.3 Application of the bootstrap to quantile

Closed expressions

Example. Let X1 , . . . , Xn be i.i.d. r.v.’s from d.f. F and Fn be its e.d.f. Let ξp =
F −1 (p) = inf{t : F (t) ≥ p} and
ξp = F −1 (p) = inf{t : F (t) ≥ p}
ξˆp = Fn−1 (p) = inf{t : nFn (t) ≥ np} := X(r) ,
where r = dnpe, giving the smallest integer ≥ np (namely = [np] if integer, [np] + 1
otherwise). Find the bootstrap estimates of

(1) the mean µξ = EF (ξˆp ), and the bias biasξ = EF (ξˆp − ξp )

(2) the variance σξ2 = V arF (ξˆp ), and the Mean Square Error M SEξ = EF (ξˆp − ξp )2
(3) the d.f.’s of standardized and studentized sample quantiles
√ ˆ ! √ ˆ !
n(ξp − ξp ) n(ξp − ξp )
H(t) = PF ≤t , K(t) = PF ≤t ,
σξ σ̂ξ

where σξ2 = σξ2 (F ) = V arF (ξˆp ) and σ̂ξ2 is an consistent estimate of σξ2 .

Solution. Let F(r) (x) = P (X(r) ≤ x), then


n!
dF(r) (x) = F r−1 (x)[1 − F (x)]n−r dF (x)
(r − 1)!1!(n − r)!
!
n
= r F r−1 (x)[1 − F (x)]n−r dF (x).
r

Note that X(r) =d F −1 (U(r) ), where U(r) is the rth order statistic from U (0, 1) with pdf
given by
!
n r−1
dG(r) (u) = r u [1 − u]n−r , 0 < x < 1.
r
Therefore,
Z
EF (ξˆp − c)m = E[X(r) − c]m = (x − c)m dF(r) (x)
Z ∞ !
n
= (x − c)m r F r−1 (x)[1 − F (x)]n−r dF (x)
−∞ r
!Z
n 1
= r [F −1 (u) − c]m ur−1 [1 − u]n−r du
r 0

= E[F −1 (U(r) ) − c]m .

131
For the bootstrap,

ξˆp∗ = Fn∗ −1 (p) = inf{t : nFn∗ (t) ≥ np} := X(r)



,

Therefore, the bootstrap estimate of EF (ξˆp )m is given by


!Z
n 1
EFn (ξˆp∗ − c) m
= ∗
E[X(r) − c] m
=r [Fn−1 (u) − c]m ur−1 [1 − u]n−r du
r 0
! n Z
n X
= r [Fn−1 (u) − c]m ur−1 [1 − u]n−r du
r k−1 k
k=1 ( n , n ]
! n Z
n X
= r [X(k) − c]m ur−1 [1 − u]n−r du
r k−1 k
k=1 ( n , n ]
n
!Z !
X n r−1 n−r
= r u [1 − u] du [X(k) − c]m
k=1
r ( k−1
n
k
,n ]

Xn
= wnk [X(k) − c]m ,
k=1

where !Z
n
wnk =r ur−1 [1 − u]n−r du, k = 1, ..., n.
r ( k−1
n
k
,n ]

(1) the bootstrap estimates of the mean and bias are


n
µ̂ξ = EFn (ξˆp∗ ) =
X
wnk X(k) ,
k=1
n  
d Boot (ξˆp ) = EF (ξˆ∗ − X ) =
X
b̂iasξ = bias n p (r) wnk X(k) − X(r) .
k=1

(2) the bootstrap estimates of the variance and MSE are


n n
!2
d Boot (ξˆp ) =
X X
σ̂ξ2 = var 2
wnk X(k) − wnk X(k) ,
k=1 k=1
n
SE Boot (ξˆp ) = EFn (ξˆp∗ − ξˆp )2 =
X
Md wnk [X(k) − X(r) ]2 .
k=1

(3) First we note that


    √
H(t) = P ξˆp ≤ s = P X(r) ≤ s where s = ξp + tσξ / n
= P ({ # of Xi ≤ s } ≥ r)
n
!
X n
= F k (s)[1 − F (s)]n−k .
k=r
k

Thus, its bootstrap approximation is given by

ξˆp∗ − ξˆp n
! !
X n
ĤBoot (t) = PFn ≤t = F k (s)[1 − Fn (s)]n−k .
σξ2 (Fn ) k=r
k n

132
Comparisons with other methods
n
p(1 − p) 1 X x − Xi
 
(a) the kernel estimate ˆ
, where f (x) = K .
fˆ2 (ξˆp ) nh i=1 h
n n
!2
X X
2
(b) the bootstrap variance estimate σ̂ξ,boot = σξ2 (Fn ) = 2
wnk X(k) − wnk X(k) .
k=1 k=1
(c) Recall that delete-1 jackknife variance estimate is inconsistent.

For illustration, we take the bootstrap variance estimate given by

ξˆp∗ − ξˆp
!
K̂Boot (t) = PFn ≤t
σξ2 (Fn∗ )

where !2
n n
∗ 2
σξ2 (Fn∗ ) ∗
X X
= wnk X(k) − wnk X(k) .
k=1 k=1

Note that there is no close form expression for K̂Boot (t). Monte Carlo simulations
have be carried out here.

Remark 10.1 For more materials see Hall (1992), page 319.

Remark 10.2 Bootstrap approximations of the mean, bias, and variance are all weighted
averages of the ordered statistics. Note that ur−1 [1−u]n−r is maximized at u = (r−1)/(n−
1) ≈ p. Hence, all the bootstrap estimates place more weights around θ̂ = Fn−1 (p) = X(r) ,
and less weights if the data are far away from θ̂, thus producing a consistent estimator.
By comparison, the jackknife variance estimator, Vd arJack (θ̂) only uses two values near θ̂
to approximate the variance, which fails to be consistent.

10.4 Bootstrap approximations to d.f.’s

Definition (k-th order consistency (or accuracy) of the bootstrap). Let Rn =


R(X1 , . . . , Xn , F ), Rn∗ = R(X1∗ , ..., Xn∗ , Fn ). Let H(x) = PF (Rn ≤ x) and ĤB (x) =
PFn (Rn∗ ≤ x). Then, ĤB (x) is said to be k-th order strongly or weakly consistenct or
accurate if

sup ĤB (x) − H(x) = o(n−(k−1)/2 ) or O(n−k/2 ), a.s. or in prob.

x

In particular,


(a) ĤB (x) is (first-order) consistent if supx ĤB (x) − H(x) = o(1) or O(n−1/2 ).


(b) ĤB (x) is second-order accurate if supx ĤB (x) − H(x) = o(n−1/2 ) or O(n−1 ).


(c) ĤB (x) is third-order accurate if supx ĤB (x) − H(x) = o(n−1 ) or O(n−3/2 ).

133
10.4.1 First-order accurate or consistent

Under certain regularity conditions, the bootstrap provides first-order accurate approxi-
mation in a number of familiar situations:

1. smooth functions of means

2. U -statistics

3. L-statistics (including quantiles)

4. symmetric statistics

Smooth functions of means

Theorem 10.1 Assume that σ 2 ∈ (0, ∞). Denote σ̂ 2 = n−1 ni=1 (Xi − X̄)2 . Let Rn be
P

√ √
n(X̄ − µ) √ n(X̄ − µ)
(a). , (b). n(X̄ − µ), (c). .
σ σ̂

Then ĤB (x) is strongly consistent for H(x) in all three cases.

We need the following results in the proof.

Lemma 10.1 (Marcinkeiwicz SLLN) If Y1 , ..., Yn are i.i.d. r.v.’s with E|Y1 |δ < ∞ for
δ ∈ (0, 1), then
n
1 X
|Yi | →a.s. 0.
n1/δ i=1

Lemma 10.2 (Berry-Esseen bounds) If X, X1 , ..., Xn are i.i.d. r.v.’s with E|X|3 < ∞,
and denote σ̂ 2 = n−1 ni=1 (Xi − X̄)2 . Then
P

√ !
n(X̄ − µ) A E|X|3
P ≤ t − Φ(t) ≤ √ .

σ or σ̂ n σ3

Proof of Theorem 10.1.

(a) Let us look at the first case. By the CLT,

sup |H(x) − Φ(x)| → 0.


x

By the Berry-Esseen bounds, we have

EFn |X1 |∗ 3
Pn 3 Pn
A A i=1 |X1 | A i=1 Yi

sup ĤB (x) − Φ(x) ≤ √ ≤ 3/2 = 3 .

2 σ̂ 3
x ∗
n (EFn |X1 | ) 3/2 n σ̂ n1/δ

where δ = 2/3 and Yi = |X1 |3 , and clearly E|Yi |δ = EXi2 < ∞. Note
that σ̂ 3 → σ 3
a.s. and also by Marcinkeiwicz SLLN, we get supx ĤB (x) − Φ(x) →a.s. 0.

134
(b) By the CLT,
sup |H(x) − Φ (x/σ)| → 0.
x
By the Berry-Esseen bounds, we have

EFn |X1 |∗ 3
Pn 3
A A i=1 |X1 |

sup ĤB (x) − Φ (x/σ̂) ≤ √ ≤ 3/2 →a.s. 0,


n (EFn |X1 | )2 3/2 σ̂ 3
x n

similarly to the arguments in (a). Finally, the proof follows from

sup |Φ (x/σ̂) − Φ (x/σ)| →a.s. 0.


x

(c) The proof is very similar to that of (a), and hence omitted.

10.5 Bootstrap variance approximations for functions of means

Theorem 10.2 Let X, X1 , . . . , Xn be i.i.d. r.v.’s with µ = EX and EkX1 k2 < ∞. Let
θ = g(µ), θ̂ = g(X̄). Assume that

(i) g is continuously differentiable in a neighborhood of µ and 5g(µ) 6= 0.


(ii) there exists a sequence of positive numbers, {τn }, such that lim inf n τn > 0,
τn = O(enq ) for q ∈ (0, 1/2), and

θ̂(X , ..., X ) − θ̂
i1 in
max −→a.s. 0,

i1 ,...,in τn

where “max” is over all integers from {1, ..., n}.

Then,
Vd
arBoot (θ̂) arFn (θ∗ )
Vd
:= −1 −→a.s. 1.
vn n 5 g(µ)τ V ar(X) 5 g(µ)

We need to do some preparations for the main proof.

Definition: A sequence {X1 , . . . , Xn } is uniformly integrable (u.i.) if

lim sup E[|Xn |I(|Xn | > C)] = 0.


C→∞

Lemma 10.3

(a) If supn E|Xn |1+δ < ∞ for some δ > 0, then {X1 , . . . , Xn } is u.i.

(b) Suppose Xn →d X. Then limn E|Xn |r = E|X|r ⇐⇒ |Xn |r is u.i.

Lemma 10.4 (Bernstein inequality) Let Y, Y1 , ..., Yn be i.i.d. r.v.’s. such that P (|Y −
EY | < C) = 1 for some C < ∞. Then for t > 0,
( )
−nt2 /2
P (|Ȳ − EY | > t) ≤ 2 exp
V ar(Y ) + Ct/3

135
Lemma 10.5 For any r.v. Z ≥ 0, we have

X ∞
X
P (Z ≥ i) ≤ EZ ≤ P (Z ≥ i) + 1.
i=1 i=1

Lemma 10.6 If E|X|α < ∞, and X(1) ≤ ... ≤ X(n) are ordered statistics. Then,
|X(1) | + |X(n) |
−→a.s. 0.
n1/α

Proof. Exercise.

Remark 10.3 For more details, see

• Singh, K. (1981). On the Asymptotic Accuracy of Efron’s Bootstrap. Ann. Statist.


Vol. 9(6), 1187-1195.
• Bickel, P.J. and Freedman, D.A. (1981). Some Asymptotic Theory for the Bootstrap.
Ann. Statist. Vol 9(6), 1196-1217.

10.6 Bootstrap confidence intervals

Some basic concepts

Definition: A confidence interval (C.I.), or more generally, a confidence region (C.R.),


ˆ
for θ, I(α), with nominal (or desired) level 1 − α, is consistent if
ˆ
P (θ ∈ I(α)) = (1 − α) + o(1), as n → ∞.
ˆ
Furthermore, I(α) is said to be kth-order accurate (asymptotically) if
ˆ
P (θ ∈ I(α)) = (1 − α) + O(n−k/2 ), for k ≥ 1 as n → ∞.

(1) There are some close relationships between confidence (upper or lower) limits
and coverage probability. Suppose that θ̂L (α) is an approximate (1 − α) level C.L. and
θ̂L (α) − θ̂Exact (α) = Op (n−(k+1)/2 ).
Then,
   
P θ ∈ [IˆL (α), ∞) = P θ ≥ IˆL (α)
 
= P θ ≥ θ̂Exact (α) + Op (n−(k+1)/2 )
 
= P θ̂Exact (α) ≤ θ + Op (n−(k+1)/2 )
!
v̂ 1/2
= P θ̂ − √ K −1 (1 − α) ≤ θ + Op (n−(k+1)/2 )
n
√ !
n(θ̂ − θ) −1 −k/2
= P ≤ K (1 − α) + Op (n )
v̂ 1/2
√ !
n(θ̂ − θ)
= P ≤ K −1 (1 − α) + O(n−k/2 )
v̂ 1/2
[assuming that Op (n−k/2 ) makes contribution O(n−k/2 )]
= (1 − α) + O(n−k/2 ).

136
For instance, if we need to show that θ̂L (α) to be first or second order accurate, then if
suffices to show that

θ̂L (α) − θ̂Exact (α) = Op (n−1 ), or Op (n−3/2 ).

(2) Some alternative expressions of C.L.: Let


√ ! √ !
n(θ̂ − θ) n(θ̂ − θ)
H(x) = P ≤x , K(x) = P ≤x ,
v 1/2 v̂ 1/2

and their bootstrap approximations are given by


√ ! √ !
∗ n(θ̂ − θ) n(θ̂ − θ)
Ĥ(x) = P ≤x , K̂(x) = P ∗ ≤x ,
v 1/2 v̂ 1/2

where P ∗ denotes probability conditionally on the data X1 , . . . , Xn .

Normal and Bootstrap approximation to distributions

Theorem 10.3 g : Rd → R1 . 5 g (X) is continuous at µ = EX, 5g (µ) 6= 0, E ||X||2 <


√ √
0 (X–vector) , Rn = n g X̄ − g (µ) , Rn∗ = n g X̄ ∗ − g X̄ .
 

sup |P (Rn ≤ X) − P ∗ (Rn∗ ≤ X)| →a.s. 0. (6.1)


X

Outline of proof:

X
 
→ 0, σ 2 = 5g 0 (µ) Σ 5 g (µ)

sup P (Rn ≤ X) − Φ
X σ
3
AE X1∗ − X̄

X
 
sup P ∗ (Rn∗ ≤ X) − Φ , σ̂ 2 = 5g 0 X̄ Σ̂ 5 g X̄ .
 
≤ √
X σ̂ nσ
We have two approximations to :

 
X
1. Normal approximation: Φ σ ;

2. Bootstrap approximation: P ∗ (Rn∗ ≤ X).

It turns out that bootstrap approximation is better than normal approximation.

Bootstrap confidence regions

X1 , · · · , Xn ∼ F, Xi ∈ Rd
Notations:
θ = θ (F ) = T (F ) : parameter
θ̂ = θ (Fn ) = T (Fn ) : estimate

v is the asymptotic variance of n (θ)
v̂ : estimate of v
1–sided upper (1 − α) confidence interval
   
Iˆ1,U = −∞, θ̂U (α) if P θ ∈ Iˆ1,U = 1 − α. (6.2)

137
1–sided lower (1 − α) confidence interval
   
Iˆ1,L = θ̂L (α) , ∞ if P θ ∈ Iˆ1,L = 1 − α. (6.3)

2–sided (1 − α) confidence interval


   
Iˆ2 = θ̂L (α) , θ̂U (α) if P θ ∈ Iˆ2 = 1 − α. (6.4)

Example 10.1 X1 , · · · , Xn ∼ N µ, σ 2 .


1–sided upper
σ̂
 
−∞, X̄ + tn−1 (α) √ (6.5)
n

1–sided lower
σ̂
 
X̄ − tn−1 (α) √ , ∞ (6.6)
n
2–sided
σ̂
X̄ ± tn−1 (α/2) √ (6.7)
n

Define
√   
n θ̂ − θ
K (t) = P  ≤ t
v̂ 1/2
√   
n θ̂∗ − θ
K̂B (t) = P ∗  ≤ t .
v̂ ∗1/2

Exact” 1–sided upper confidence interval is


!
−1 v̂ 1/2
−∞, θ̂ + K (1 − α) √ . [if K (distribution) is unknown] (6.8)
n

Normal approximation !
−1 v̂ 1/2
−∞, θ̂ + Φ (1 − α) √ . (6.9)
n

1. Bootstrap–t (percentile–t) 1–sided upper C.I. for µ is (approximation)


!
−1 v̂ 1/2
−∞, θ̂ + K̂B (1 − α) √ . (6.10)
n

2. Percentile method  
ĜP er (X) = PF∗n θ̂∗ ≤ X (6.11)

θ̂U,P ercentile (α) = Ĝ−1 (1 − α)


θ̂L,P ercentile (α) = Ĝ−1 (α)

θ = θ (F ) , θ̂ = θ (Fn ) (original sample) , θ̂∗ = θ (Fn∗ ) (bootstrap sample) . Problem:


has no center.

138
3. Hybrid method  
ĜHybrid (X) = PF∗n θ̂∗ − θ̂ ≤ X (6.12)
   
GH (X) = P θ̂ − θ ≤ X = P θ ≥ θ̂ − X = 1 − α (6.13)

θ̂U,Hybrid (α) = θ̂ − Ĝ−1


Hybrid (α)

θ̂L,Hybrid (α) = θ̂ − Ĝ−1


Hybrid (1 − α) .

4. Bias Correction (BC) method


  
θ̂BC,L (α) = Ĝ−1
P er Φ Zα + 2Ẑ0 , Zα = Φ−1 (α)
  
θ̂BC,U (α) = Ĝ−1
P er Φ Z1−α + 2Ẑ0 , Zα = Φ−1 (1 − α)

  
where Ẑ0 = Φ−1 ĜP er θ̂

P ∗ X ∗ ≤ X̄ ≈ P X̄ ≤ µ ≈ 1/2. (symmetric → percentile–t)


 
(6.14)

5. Accelarated Bias Correction (ABC) method


  
Φ−1 (α) + Ẑ0
θ̂ABC,L (α) = Ĝ−1
P er Φ
    
1−a Φ−1 (α) + Ẑ0
P n 3
1 i=1 Di
a = 3/2
(3 − moment: skewness)
6 P n
D2

i=1 i
θ [(1 − ε) Fn + εδXi ] − θ (Fn )
Di = lim (empirical inference function)
ε→0 ε

Lemma 10.7 Riemann–Lebesgue lemma


If g ∈ L1 (g is Lebesgue–integrable) ,
Z
eitx g (x) dx → 0 as t → ∞. (6.15)

Z b ∞
X M
X Z bi
cos (tx) = ci IAi g (x)dx = ci cos (tx) dx
a i=1 i=1 ai

sin (tx) bi

= → 0 as t → ∞,
t ai
for g continuous.

Accuracy of bootstrap confidence interval

Let
√   
n θ̂ − θ
H (X) = P  ≤ t Standardized
V 1/2
√   
n θ̂ − θ
K (X) = P  ≤ t . Studentized
V̂ 1/2

139
Bootstrap approximation
√   
n θ̂∗ − θ̂
Ĥ (X) = P ∗  ≤ t
V̂ 1/2
√   
n θ̂∗ − θ̂
K (X) = P ∗  ≤ t .
V̂ ∗1/2

Definition The exact confidence interval

V̂ 1/2
θ̂L,Exact (α) = θ̂ − √ κ−1 (1 − α)
n
V̂ 1/2
θ̂U,Exact (α) = θ̂ − √ κ−1 (α) .
n
√   
n θ̂ − θ
!
  V̂1/2
P θ ≥ θ̂L = P θ ≥ θ̂ − √ κ−1 (1 − α) =P ≤ κ−1 (1 − α) (6.16)
n V̂ 1/2

if we know κ.

5.7.1 Relationship between confidence limit (C.L.) and coverage probability

If
 
θ̂L (α) = θ̂L,Exact + Op n−(k+1)/2 (stochastic expansion)
= θ̂L,Exact + δ̂n .

Typically (not a proof),


 h     
P θ ∈ θ̂L , ∞ = P θ̂L ≤ θ = P θ̂L,Exact + δ̂n ≤ θ
!
V̂ 1/2
= P θ̂ − √ κ−1 (1 − α) + δ̂n ≤ θ
n
√ 
 √nδ̂
 
n θ̂ − θ 
n
= P + → Op n−k/2 ≤ κ−1 (1 − α)
V̂ 1/2 V̂ 1/2
√   
n θ̂ − θ
−1
= P + ∆n ≤ κ (1 − α)
V̂ 1/2
 
= (1 − α) + O n−k/2 . (not always true)

In particular, k = 1, first order accurate. k = 2, second order accurate.

Lemma 10.8

sup |P (Xn + Yn ≤ t) − Φ (t)| ≤ sup |P (Xn ≤ t) − Φ (t)| + O (an ) + P (|Yn | > an ) , an > 0.
t t
(6.17)

140
Proof.

= AP (Xn + Yn ≤ t) − Φ (t) = P (Xn + Yn ≤ t, |Yn | < an ) + P (Xn + Yn ≤ t, |Yn | > an ) − Φ (t)


≤ P (Xn ≤ t + an ) + P (|Yn | > an ) − Φ (t) − Φ (t + an ) + Φ (t + an )
≤ sup |P (Xn ≤ t − Φ (t)) + P (|Yn | > an )| + φ0 t0 an , t < t0 < t + an

 0 
φ (t) an − bounded (e.g. edgeworth expansion)

A = (1 − Φ (t)) − P (Xn + Yn > t, |Yn | > an ) − P (Xn + Yn > t, |Yn | < an )


≥ 1 − Φ (t) − P (|Yn | > an ) − [P (Xn > t − an ) − [1 − Φ (t − an )]]

The rest is the same.

an = cn−k/2 ,
   
P |∆n | > cn−k/2 = O n−k/2 . (Use Chebyshev’s bound, etc) (6.18)

To show C.L.

Cornish–Fisher expansions
1 1
α = κH (X) = Φ (X) + √ (q1 )P1 (X) ϕ (X) + (q2 )P2 (X) ϕ (X) + · · · (6.19)
n n
1  
X ≡ H −1 (α) = Zα + √ P11 (Zα ) + · · · Zα = Φ−1 (α) (6.20)
n

1
Xα = Zα + Wn , α = H (Zα + Wn ) = Φ (Zα + Wn ) + √ P1 (Zα + Wn ) ϕ (Zα + Wn ) + · · ·
n

= Zα + √1n P1 (Zα ) + Wn1 


= α + Wn ϕ (Zα ) + · · · + √1n P1 (Zα ) ϕ (Zα ) + · · · Iteration
⇒ Wn = − √1n P1 (Zα ) + · · ·


where P11 (Zα ) = −P1 (Zα ) .

1. Percentile−t method

V̂ 1/2  
θ̂L,P er−t (α) = θ̂ − √ κ̂ (1 − α) − κ−1 (1 − α) + κ−1 (1 − α)
n
V̂ 1/2 h i
= θ̂L,Exact − √ κ̂ (1 − α) − κ−1 (1 − α) .
n

Now
1
κ−1 (1 − α) = Z1−α + √ q11 (Z1−α ) + · · ·
n
1
κ̂−1 (1 − α) = Z1−α + √ q̂11 (Z1−α ) + · · ·
n
1
 
⇒ κ̂−1 (1 − α) − κ−1 (1 − α) = O √ .
n

Percentile−t : 2nd order accurate.

141
2. Normal approximation

V̂ 1/2  −1 
θ̂L,N or (α) = θ̂ − √ Φ (1 − α) − κ−1 (1 − α) + κ−1 (1 − α)
n
V̂ 1/2 h −1 i
= θ̂L,Exact − √ Φ (1 − α) − κ−1 (1 − α)
n

Now
1
κ−1 (1 − α) = Φ−1 (1 − α) + √ q11 (Z1−α ) + · · ·
n
1
κ̂−1 (1 − α) = Φ−1 (1 − α) + √ q̂11 (Z1−α ) + · · ·
n
1
 
⇒ Φ−1 (1 − α) − κ−1 (1 − α) = O √ .
n

Normal approximation: 1st order accurate.

3. Percentile method
θ̂L,P er (α) = Ĝ−1 (α) . (6.21)

 
G (X) = P θ̂ ≤ X
√ 

 
n θ̂ − θ n (X − θ)
= P ≤ 
V 1/2 V 1/2
√ !
n (X − θ)
= H
V 1/2
= α.

n G−1 (α) − θ

= H −1 (α) . (6.22)
V 1/2

V 1/2
G−1 (α) = θ + √ H −1 (α)
n
V̂ 1/2  −1 
Ĝ−1 (α) = θ̂ + √ Ĥ (α) + κ−1 (1 − α) − κ−1 (1 − α)
n
V̂ 1/2
 
= θ̂L,Exact + √ Typically not cancelledH −1 (α) + κ−1 (1 − α) .
n | {z }
c1 c2
H −1 (α) = Zα + √ , κ−1 (1 − α) = Zα + √ . (6.23)
n n
Percentile method: 1st order accurate.

4. Hybrid method
θ̂L,Hyb (α) = θ̂ − Ĝ−1 (1 − α) . : α → 1 − α (6.24)
√  −1 
  n G (α)
G (X) = P θ̂ − θ ≤ X ⇒ = H −1 (α) . (6.25)
V 1/2
V̂ 1/2 c1
θ̂L,Hyb (α) = θ̂L − √ Ĥ −1 (1 − α) , Ĥ −1 (1 − α) = Z1−α + √ . (6.26)
n n
Hybrid method: 1st order accurate.

142
5. Bias correction (only special case: e.g. transformation location model)
     
Ĝ−1
P er Φ Zα + 2Ẑ0 = Φ−1 Ĝ θ̂ . (6.27)

  1
Ĝ θ̂ = H (0) = Φ (0) + √ P1 (0) ϕ (0) + · · ·
n
1 1
= + √ Wn P1 (0) ϕ (0) + · · ·
2 n | {z }

1 1 Wn
      
Φ−1 Ĝ θ̂ = Φ−1 + Wn = Φ−1 +
2 2 φ (0)
1
√ P1 (0) + · · ·
=
n
2P1 (0) 2P1 (0)
 
Φ Zα + √ + ··· = α + ϕ (Zα ) √ + ···
n n

V̂ 1/2
Ĝ−1 (α) = θ̂ + √ H −1 (α)
n
V̂ 1/2 2P1 (0)
   
θ̂L,BC (α) = θ̂ + √ H −1 α + φ (Zα ) √ −1
+κ (1 − α) − κ −1
(1 − α)
n n
V̂ 1/2 2P1 (0)
   
= θ̂L,Exact + √ H −1 α + φ (Zα ) √ +κ −1
(1 − α)
n n

2P1 (0)
 
H −1 α + φ (Zα ) √ = Zα + · · · (CF expansion, depends on γ − skewness)
n
c1
κ−1 (1 − α) = Z1−α + √ (Taylor expansion)
n
Bias correction method: 1st order accurate. [Calibration method (choose α̂ → α)]
6. ABC " #!
−1 Φ−1 (α) + Ẑ0
Ĝ Φ (6.28)
1 − a ()
Problem: Ẑ0 unknown. a−depends on 3rd moment.

Edgeworth expansion
1
1 − α = K (t) = Φ (t) + √ P̂1 (X) φ (X) (6.29)
n
Empirical Edgeworth expansion (2nd order accurate). May not be better than bootstrap
since tail may not be very accurate.

10.7 Exercises
1. Consider an artificial data set consisting of the 8 numbers
1, 2, 3.5, 4, 7, 7.3, 8.6, 12.4, 13.8, 18.1.
n−[nα]
1 X
Define the α trimmed mean by θ̂ = Xn,i .
n − 2[nα] i=[nα]+1

143
(a) Take α = 25%, calculate the bootstrap variance estimate v̂B (θ̂) by
Monte-Carlo simulations for B = 20, 100, 200, 500, 1000, 2000. Do they be-
come more stable as B gets larger?
(b) Find the jackknife variance estimate v̂J (θ̂).

2. (Generation of bivariate normal random variables.) Suppose that we would like


to generate a bivariate normal random vector (X, Y )T with mean (µx , µy )T and
covariance matrix !
σx2 ρσx σy
Σ= .
ρσx σy σy2
The following procedure could be used.

Step 1. Generate R1 and R2 , where Ri ∼ N (0, 1) for i = 1, 2, and R1 and


R2 are independent.
Step 2. Calculate
σy
X = µx + σx R1 Y = µy + √ (R1 + cR2 )
1 + c2
p
where c = ρ−2 − 1.

Show that (X, Y )T have the required bivariate normal distribution.

3. Show that for function of means, one-sided (lower) confidence interval by the hybrid
method is only first-order accurate.

4. The following are the numbers of minutes that a doctor kept 10 patients waiting
beyond their appointment times,

12.1, 9.8, 10.5, 5.6, 8.2, 0.5, 6.8, 10.1, 17.2, 4.2.

find a two-sided 90% confidence interval for the mean waiting time by the normal
approximation, percentile-t, percentile, ABC method.

5. The following data are the number of hours 10 students studied for a certain achieve-
ment test and their scores on the test.
Number of hours (x): 8, 6, 10, 9, 14, 6, 18, 10, 15, 24,
Score on the test (Y): 80, 62, 91, 77, 89, 65, 96, 85, 94, 91.

(i) Plot Y against x.


(ii) Fit a simple linear regression.
(iii) Find a two-sided 90% confidence interval for the slope parameter by
the normal approximation method and hybrid method.

6. “Are statistical tables obsolete?” In this example, we shall use Monte Carlo simula-
tions to obtain quantiles from any given distribution. We shall illustrate this with
the standard normal distribution Φ(x). Suppose we wish to find the 90% and 95%
quantile, i.e. q1 = Φ−1 (0.90) and q2 = Φ−1 (0.95).

(1) Generate B random numbers, X1 , · · · , XB , from the standard normal


distribution, where B can be arbitrarily big, say 10,000.
(2) Plot the histogram for the generated data. Is it belled shaped?

144
(3) Rank the random numbers in increasing order, say, XB,1 , · · · , XB,B .
So the Monte Carlo approximations for q1 and q2 are

q1M C = XB,[B∗0.90] , q1M C = XB,[B∗0.95] .

(4) Compare and comment on those Monte Carlo approximations and the
exact values.
(5) Let X ∼ N (0, 1). Suppose that we are interested in calculate P (X ≥
2), µ = EX and σ 2 = var(X). How do we use the Monte Carlo simulations
to approximate these? Compare your approximations with the exact ones.

145
Chapter 11

Bayes Rules and Empirical Bayes

11.1 Introduction

It is impossible to find estimators which minimize the risk R(θ, δ) uniformly at every value
of θ (see the example below) unless we restrict ourselves to a smaller class. We shall now
drop such restrictions, admitting all estimators into competition, but shall then have to
be satisfied with a weaker optimality property than uniformly minimum risk. We shall
look for estimators that make the risk function R(θ, δ) small in some overall sense. Two
such optimality properties will be considered:

• Average Risk Optimality (minimizing the weighted average risk);

• Minimax Optimality (minimizing the maximum risk)

We consider the first approach now while the second will be studied in the next chapter.

146
An illustrative example

Suppose we are interested in the average income µ of an area A. If we have no data for
the area, a natural guess would be µ0 , the average income from the last national census;
whereas if we have a random sample X1 , X2 , ..., Xn from the area A, we may want to
combine µ0 and X̄ = n−1 ni=1 Xi into an estimator, for instance,
P

µ̂ = (0.2)µ0 + (0.8)X̄.

For comparison, we calculate their M SE(µ̂) = E(µ̂ − µ)2 = V ar(µ̂) + bias2 (µ). Hence,

R(µ, µ0 ) = M SE(µ0 ) = (µ0 − µ)2 ,


R(µ, µ̂) = M SE(µ̂) = .04(µ0 − µ)2 + (.64)σ 2 /n,
R(µ, X̄) = M SE(X̄) = σ 2 /n.

We can plot these risks as functions of µ.

Here are some observations:

1. Since µ is unknown, none of the estimators can be proclaimed as the best.

2. We can place a prior µ ∼ N (µ0 + δ, A).

(a) If δ ≈ 0 and A ≈ 0, M SE(µ0 ) will be the smallest.


(b) If µ is close to µ0 , then R(µ, µ̂) < R(µ, X̄) with the minimum relative risk

inf{R(µ, µ̂)/R(µ, X̄); µ ∈ R} = 0.64, when µ = µ0

3. If we use as our criteria the maximum (over µ) of the MSE (called the minimax
criteria), then X̄ is optimal (shown later).

147
11.2 Bayes estimator

Assume that

X|θ ∼ f (x|θ),
θ ∼ π.

Let δ = δ(X) be an estimator of g(θ). (Sometimes we use d ≈ δ (decision), or a (action)).


Define

• Loss function: l(θ, δ). Examples include

(a) L2 -Loss: l(θ, δ) = (δ(x) − g(θ))2


More generally, weighted L2 -Loss: l(θ, δ) = w(θ)(δ(x) − g(θ))2
(b) L1 -Loss: l(θ, δ) = |δ(x) − g(θ)|
(c) 0 − 1 Loss: l(θ, δ) = I{δ 6= g(θ)}

• Risk (or Frequentist risk, or a priori risk)


Z
R(θ, δ) ≡ E[l(θ, δ)|θ] ≡ Eθ l(θ, δ) = l(θ, δ(x))f (x|θ)dx

• Posterior risk:
Z
r(δ|x) ≡ E[l(θ, δ(x))|X = x] = l(θ, δ(x))f (θ|x)dθ

• Bayes risk (or average risk w.r.t. θ):


Z Z
r(π, δ) ≡ El(θ, δ) = Er(δ|X) = ER(θ, δ) = l(θ, δ(x))f (θ, x)dθdx.

148
Definition 11.1 An estimator δ ∗ is called a Bayes estimator w.r.t. π if

r(π, δ ∗ ) ≡ El(θ, δ ∗ ) ≤ El(θ, δ) ≡ r(π, δ).

Bayes estimators play a central role in Wald’s decision theory. One of its main results
is that in any given statistical problem, attention can be restricted to Bayes solutions and
suitable limits of Bayes solutions; given any other procedure δ, there exists a procedure δ 0
in this class such that R(θ, δ 0 ) ≤ R(θ, δ) for all θ. (In view of this result, it is not surprising
that Bayes estimators provide a tool for solving minimax problems, as will be seen in the
next chapter.)

149
Finding a Bayes estimator is, in principle, quite simple.

Theorem 11.1 If δ ∗ (x) minimizes the posterior risk

r(δ|x) ≡ E{l[θ, δ(x)]|X = x}, (a.s.),

then δ ∗ (X) is a Bayes estimator.

Proof. E{l[θ, δ(x)]|X = x} ≥ E{l[θ, δ ∗ (x)]|X = x} a.e. Then take expectation.

Corollary 11.1 For L2 -Loss: l(θ, δ) = (δ − g(θ))2 , then


R
∗ g(θ)f (x|θ)π(θ)dθ
δ (x) = E[g(θ)|x] = R .
f (x|θ)π(θ)dθ

More generally, for weighted L2 -Loss: l(θ, δ) = w(θ)(δ − g(θ))2 , then


R
∗ E[w(θ)g(θ)|x] w(θ)g(θ)f (x|θ)π(θ)dθ
δ (x) = = R
E[w(θ)|x] w(θ)f (x|θ)π(θ)dθ

Proof. For weighted L2 -loss,

r(δ|x) = E[w(θ)(δ − g(θ))2 |x] = δ 2 E[w(θ)|x] − 2δE[w(θ)g(θ)|x] + E[w(θ)g 2 (θ)|x].

Setting r0 (δ|x) = 0, we get the desired result.

Determinination of Bayes estimates may be daunting and the use of conjugate priors
might facilitate its calculations, and reduce computational complexity in δ ∗ (x). Conjugate
priors are natural parametric families of priors such that the posterior distributions also
belong to this family. Such families are called conjugate.

150
Examples

Example 11.1 (Normal mean with known variance)

Suppose

Xi |µ ∼ N (µ, σ̃02 ), i = 1, ..., n


µ ∼ N (M, A), (conjugate prior)

Denote x = (x1 , ..., xn )0 . The posterior pdf is

f (µ|x) = f (x, µ)/f (x) ∝ f (x, µ)


1 1 X
 
2
= exp − 2 (xi − µ)
(2πσ̃02 )n/2 2σ̃0
1 1
 
2
× exp − (µ − M )
(2πA)1/2 2A
n 1 2
   
2
∝ exp − 2 (−2x̄µ + µ ) exp − (µ − 2M µ)
2σ̃0 2A
( !)
1 n 1 nx̄/σ̃02 + M/A
 
∝ exp − 2 + µ2 − 2µ
2 σ̃0 A n/σ̃02 + 1/A
 !2 
 1 1 1 x̄/σ02 + M/A 

= exp − + µ−
 2 σ02 A 1/σ02 + 1/A 

∼ N (E[µ|x], var[µ|x]) ,

where σ02 = σ̃02 /n, and

x̄/σ02 + M/A Ax̄ + σ02 M


E[µ|x] = = ,
1/σ02 + 1/A A + σ02
1 Aσ02
var[µ|x] = 2 = .
1/σ0 + 1/A A + σ02

Under L2 -loss, the Bayes estimator is

A σ02
δ ∗ = E[µ|x] = x̄ + M.
A + σ02 A + σ02

151
Remark 11.1

1. δ ∗ = wx̄ + (1 − w)M , a weighted average of the standard estimator X̄ and the mean
M of the prior distribution.

2. As n → ∞ with M and A fixed, δ ∗ ≈ X̄ →p θ. Hence, choice of priors is of little


import when n is large.

3. When n is fixed, choice of prior is important. For example,

(a) As A → 0, then δ ∗ →p M . (Very informative prior.)


(b) As A → ∞, δ ∗ ≈ X̄. (Non-informative (flat) prior.)

4. δ ∗ is a shrinkage estimator; e.g., if M = 0, δ ∗ = A


A+σ02
x̄ shrinks x̄ toward 0.

Remark 11.2 Inference can be based on sufficient statistics z = x̄ only. That is,

z|µ ∼ N (µ, σ02 ), i = 1, ..., n


µ ∼ N (M, A), (conjugate prior).

It can be easily seen that, under L2 -loss, the Bayes estimator is

A σ02
δ ∗ = E[µ|z] = z + M.
A + σ02 A + σ02

152
Example 11.2 (Normal variance with known mean)

Suppose

Xi |σ 2 ∼ N (0, σ 2 ),
1
τ ≡ 2 ∼ Gamma(α, β) (conjugate prior)
σ
where π(τ ) = (β α /Γ(α))τ α−1 e−βτ with

α α(α + 1)
Eτ = , Eτ 2 = ,
β β2

β β2
Eτ −1 = Eσ 2 = , Eτ −2 = Eσ 4 = .
α−1 (α − 1)(α − 2)
So the posterior distribution is

f (τ |x) ∝ f (x|µ, τ )π(τ )


n/2 Pn !
1 2 β α α−1 −βτ
i=1 xi

= exp − × τ e
2πσ 2 2σ 2 Γ(α)
n
!
τX
∝ τ n/2
exp − x2 τ α−1 e−βτ
2 i=1 i
n
( !)
1X
∝ τ n/2+α−1 exp −τ x2 + β
2 i=1 i
n
!
n 1X
∼ Gamma + α, x2 + β .
2 2 i=1 i

Under L2 -loss, the Bayes estimator of σ 2 = 1/τ is


Pn 2 n
x2 + 2β
P
∗ −1 i=1 xi /2

δ = E[τ |x] = = i=1 i
n/2 + α − 1 n + 2α − 2
 Pn 2
n i=1 xi 2α − 2 β
  
= +
n + 2α − 2 n n + 2α − 2 α − 1
=: wSn2 + (1 − w) Eσ 2 .
Pn
which is a weighted average of sample variance Sn2 = 2
i=1 xi /n and prior variance Eσ 2 =
β/(α − 1).

153
Example 11.3 (Normal variance with unknown mean)

Suppose

• Xi |(µ, σ 2 ) ∼iid N (µ, σ 2 )

• π(µ) = 1, or µ ∼ N (M, ∞), (improper uniform prior)

• τ≡ 1
σ2
∼ Gamma(α, β), i.e., π(τ ) ∝ τ a−1 e−bτ , (conjugate prior)

• µ⊥τ

Now
n/2 Pn !
1 i=1 (xi − µ)2

f (x|µ, τ ) = exp −
2πσ 2 2σ 2
n
!
τX
∝ τ n/2 exp − (xi − µ)2 .
2 i=1

So the posterior distribution is

f (µ, τ |x) ∝ f (x|µ, τ )π(µ)π(τ )


n
!!
n/2+α−1 τ X
2 2
∝ τ exp − (xi − x̄) + n(x̄ − µ) + 2β .
2 i=1

By integrating out µ, we get


Z
f (τ |x) = f (µ, τ |x)dµ
n
!!
τ X
∝ τ n/2+α−3/2 exp − (xi − x̄)2 + 2β
2 i=1
Z  
1/2
× τ exp −τ n(µ − x̄)2 dµ
n
!!
n/2+α−3/2 τ X
2
∝ τ exp − (xi − x̄) + 2β
2 i=1
n
!
n 1 1X
∼ Gamma +α− , (xi − x̄)2 + β .
2 2 2 i=1

154
Under L2 -loss, the Bayes estimator of σ 2 = 1/τ is
1 Pn n
− x̄)2 + β (xi − x̄)2 + 2β
P
∗ −1 i=1 (xi
δ = E(τ |x) = 2
= i=1
n/2 + α − 3/2 n + 2α − 3
 Pn 2
n−1 i=1 (xi − x̄) 2α − 2 β
  
= +
n + 2α − 3 n−1 n + 2α − 3 (α − 1)
2 2
=: wS + (1 − w) Eσ ,
Pn
which is a weighted average of sample variance S 2 = i=1 (xi − x̄)2 /(n − 1) and prior
variance Eσ 2 = β/(α − 1).

155
Example 11.4 (Binomial mean)

Suppose

X|p ∼ Binomial(n, p),


p ∼ Beta(a, b), (conjugate prior)

The joint pdf is !


n Γ(a + b) x+a−1
f (x, p) = p (1 − p)n−x+b−1
x Γ(a)Γ(b)
and the posterior pdf is

f (p|x) = f (x, θ)/f (x) ∝ f (x, θ)


∝ px+a−1 (1 − p)n−x+b−1
∼ Beta(x + a, n − x + b),

Recall, for p ∼ Beta(a, b), we have

a ab
E(p) = and var(p) = (a + b + 1)
a+b (a + b)2

So the Bayes estimator under L2 -loss is


a+x a+b a n X
   
δk∗ (x) = E[p|x] = = + ,
b+n+x a+b+n a+b a+b+n n

which is a weighted average of standard estimate X/n and prior mean a/(a + b).

156
Example 11.5 (Poisson)

Suppose

Xi |θ ∼ P oisson(θ), ,
θ ∼ Gamma(α, β), (conjugate prior)

The posterior distribution is


β
 
f (θ|x) = Gamma α + x̄,
1+β

Note that EX = var(X) = θ. Although L0 (θ, δ) = (θ − δ)2 is often preferred for the
estimation of a mean, some type of scaled squared error loss, e.g., Lk (θ, δ) = (θ − δ)2 /θk ,
may be more appropriate for the estimation of a variance. The Bayes estimator under
Lk -loss is (by taking w(θ) = θ−k in Corollary 11.1)

E(θ1−k |x) β
δk∗ (x) = −k
= (x̄ + α − k) ,
E(θ |x) β+1
β 1
   
= x̄ + (α − k)β, for α − k > 0
β+1 β+1

which is a weighted average of sample mean x̄ and (α − k)β (the prior mean is αβ).

Note that the choice of loss function can have a large effect on the resulting Bayes
estimator.

157
Example 11.6 (Regression)

Consider generalized least squares regression

Y = Xβ + ,  ∼ N (0, V ).

1. The MLE of β is β̂ = (X T V −1 X)−1 X T V −1 Y := AY.


(In particular, if V = σ 2 I, then β̂ = (X T X)−1 X T Y .)

2. Assume we have prior β ∼ N (βP , P ), then

β̂Bayes = (X T V −1 X + P −1 )−1 (X T V −1 Y + P −1 βP ).

Remarks:

• This is simply a ridge estimator.


• If V = σ 2 I and βP = 0, we have β̂Bayes = (X T X + σ −2 I)−1 X T Y .

3. β̂Bayes satisfies β̂Bayes = βP +K(Y −XβP ), where K = (X T V −1 X +P −1 )−1 X T V −1 .

Proof. The likelihood of β is


1
l(β) = f () = (2π)n/2 |V |−1/2 exp{− (Y − Xβ)T V −1 (Y − Xβ)}
2
∝ exp{−g(β)},

where g(β) = 12 (Y − Xβ)T V −1 (Y − Xβ).

1. The MLE of β is
β̂ = arg min f () = arg max g().
β β

That is, β̂ satisfies g 0 (β) = −X T V −1 (Y − Xβ) = −X T V −1 Y − X T V −1 Xβ = 0,


resulting in
β̂ = (X T V −1 X)−1 X T V −1 Y := AY.
It is easy to see

E β̂ = (X T V −1 X)−1 X T V −1 Xβ = β,
V ar(β̂) = AV ar(Y )AT = (X T V −1 X)−1 X T V −1 V V −1 X(X T V −1 X)−1 = (X T V −1 X)−1 .

2. The posterior of β is

f (, β) = f (|β)π(β)
1
= (2π)n/2 |V |−1/2 exp{− (Y − Xβ)T V −1 (Y − Xβ)}
2
−1/2 1
p/2
×(2π) |P | exp{− (β − βP )T P −1 (β − βP )}
2
1 1
∝ exp{− (Y − Xβ)T V −1 (Y − Xβ) − (β − βP )T P −1 (β − βP )}
2 2
:= exp{−h(β)},

where h(β) = 12 (Y − Xβ)T V −1 (Y − Xβ) − 12 (β − βP )T P −1 (β − βP ).

158
The maximum a posteriori (MAP) estimate is

β̂ Bayes = arg min f (, β) = arg max h().


β β

That is, β̂ Bayes satisfies

0 = h0 (β)
= −X T V −1 (Y − Xβ) + P −1 (β − βP )
= −X T V −1 Y + X T V −1 Xβ + P −1 β − P −1 βP
= (X T V −1 X + P −1 )β − (X T V −1 Y + P −1 βP )

resulting in

β̂ Bayes = (X T V −1 X + P −1 )−1 (X T V −1 Y + P −1 βP ).

3. β̂Bayes can be rewritten as

β̂ Bayes = (X T V −1 X + P −1 )−1 [X T V −1 (Y − XβP ) + (X T V −1 X + P −1 )βP ]


= βP + (X T V −1 X + P −1 )−1 X T V −1 (Y − XβP )
:= βP + K(Y − XβP ).

It is easy to see

E β̂ Bayes = βP + K(Xβ − XβP ) = (1 − KX)βP + KXβ


V ar(β̂ Bayes ) = KV ar(Y )K T = (X T V −1 X + P −1 )−1 X T V −1 X(X T V −1 X + P −1 )−1 .

Clearly, as with other Bayes estimator, β̂ Bayes is a linear combination of prior esti-
mator and frequentist estimator. It is biased with reduced variance.

159
11.3 Properties of Bayes Estimators

Bayes estimators: balance between evidence and belief

In many cases, a Bayes estimator is a weighted average of frequentist estimate θ̂f req and
prior estimate θprior :
δ ∗ = wθ̂f req + (1 − w)θprior ,
where the weight w strikes balances between evidence from data and prior belief.

1. As n → ∞ with all hyper-parameters fixed, δ ∗ becomes essentially θ̂f req . Hence,


choice of priors is of little import when n is large.

2. When n is fixed, choice of prior is important. For example,

(a) As the prior π(·) becomes more concentrated (degenerate), we have w → 0.


Then δ ∗ ≈ θ̂prior .
(b) As the prior π(·) becomes non-informative, we have w → 1. Then δ ∗ ≈ θ̂f req .

Bayes estimators can also be regarded as shrinkage estimator. Without priors, the
usual estimator is θ̂f req . Once we place priors, our new estimator shrinks toward θprior .
More discussion will be given on James-Stein estimator.

Are Bayes estimators unique?

• Bayes estimators may not necessarily be unique in general.

• However, under weak conditions, Bayes estimators are unique, e.g., if the loss func-
tion l(θ, δ) is squared error loss, or more generally, if it is strictly convex in δ.
The proof follows easily from that of Corollary 11.1, and hence omitted here.

Can unbiased estimators be Bayes estimators w.r.t. a proper prior?

The answer is NO.

Theorem 11.2 No unbiased estimator δ(X) can be a Bayes solution under L2 -loss unless

r(π, δ) ≡ E[δ(X) − g(θ)]2 = 0. (E w.r.t. X and θ.)

Proof. Suppose δ(X) is a Bayes estimator and is unbiased for estimating g(θ). Then, we
have: (i) δ(X) = E[g(θ)|X], a.s. and (ii) E[δ(X)|θ] = g(θ) for all θ. Hence,

E[g(θ)δ(X)] = E{δ(X)E[g(θ)|X]} = E[δ 2 (X)]


E[g(θ)δ(X)] = E{g(θ)E[δ(X)|θ]} = E[g 2 (θ)].

It follows that E[δ(X) − g(θ)]2 = E[δ 2 (X)] + E[g 2 (θ)] − 2E[δ(X)g(θ)] = 0.

160
Example 11.7 Sample means are NOT Bayes estimators

Let Xi ’s be i.i.d. (i = 1, ..., n) with E(Xi ) = θ and var(Xi ) = σ 2 (independent of θ), then

R(θ, X̄) = E(X̄ − θ)2 = σ 2 /n.

For any proper prior,


r(π, δ) ≡ ER(θ, X̄) = σ 2 /n 6= 0.
From the above theorem, X̄ is not a Bayes estimator.

Improper prior

In Example 11.1, X̄ is the limit of the Bayes estimators as b → ∞. As b → ∞, the prior


π(·) tends to Lebesgue measure, which is improper prior and non-informative.

Since the Fisher information I(θ) = constant, this is actually the Jeffreys prior. It
is easy to check that the posterior distribution calculated from this improper prior is a
proper distribution as soon as an observation has been taken. This is not surprising; since
X is normally distributed about θ with variance 1, even a single observation provides a
good idea of the position of θ.

Connection between Bayes estimation, sufficiency, and likelihood

Let X = (X1 , ..., Xn ) ∼ f (x|θ) with sufficient statistic T . Its likelihood function is
L(θ|x) = f (x|θ). Given T = t, we have

f (x|θ) = L(θ|x) = g(t|θ)h(x)

For any prior distribution π(θ), the posterior distribution is then

f (x|θ)π(θ) g(t|θ)π(θ)
π(θ|x) = R =R = π(θ|t).
f (x|θ)π(θ)dθ g(t|θ)π(θ)dθ

That is, π(θ|x) depends on x only through t, and the posterior distribution of θ is the
same whether we compute it on the basis of x or of t. Since Bayesian measures are
computed from posterior distributions, we need only focus on functions of a minimal
sufficient statistic.

Limits of Bayes estimators

161
11.4 Empirical Bayes

Assume

Xi |θ ∼ f (x|θ), i = 1, .., n
θ|γ ∼ π(θ|γ),

where γ is called the hyper-parameter γ, which needs to be provided.

• One can assign values to γ arbitrarily or based on one’s own judgements.

• Alternatively, it can be estimated by empirical Bayes (EB) estimators.

Procedure

1. First, we compute the marginal pdf of X:


n
Z Y
m(x|γ) = f (xi |θ)π(θ|γ)dθ.
i=1

2. Based on m(x|γ), we obtain an estimate, γ̂(x), of γ, e.g., by MLE or by Method of


Moment (MM). We now substitute γ̂(x) for γ in π(θ|γ) and determine the estimator
that minimizes the empirical posterior loss
Z
EB
δ̂ = arg min L(θ, δ(x))π(θ|x, γ̂(x)), (4.1)
δ

which is the empirical Bayes estimator.

3. An alternative definition is obtained by substituting γ̂(x) for γ in the Bayes es-


timator. Although, mathematically, this is equivalent to the definition given here
(see Problem 6.1), it is statistically more satisfying to define the empirical Bayes
estimator as minimizing the empirical posterior loss (4.1).

Empirical Bayes (EB) estimation is an effective technique of constructing estimators


that perform well under both Bayesian and frequentist criteria. One reason is that EB
estimators tend to be more robust against mis-specification of the prior distribution.

162
Examples

Example 11.8 (Continuation of Example 11.1, Normal mean with known variance)

We shall base our inference on the sufficient statistics Z = X̄:

Z|µ ∼ N (µ, σ02 ),


µ ∼ N (0, A), (conjugate prior)

The Bayes estimator of µ under L2 -loss is


!
Bayes σ02
µ̂ = 1− z. (4.2)
A + σ02

We now proceed to estimate the hyper-parameter A via EB.

First one can show that

Z|A ∼ N (0, σ02 + A). (4.3)

Proof.
( )
1 1 1 µ2
Z  
f (z|A) = 2 1/2
exp − 2 (z − µ)2 exp − dµ
(2πσ0 ) 2σ0 (2πA)1/2 2A
( ) ( )
1 (z − µ)2 µ2
Z
= exp − exp − dµ
(4π 2 σ02 A)1/2 2σ02 2A
( !)
1 1 1 1 z z2
Z  
= exp − µ2 + 2 − 2µ 2 + 2 dµ
(4π σ02 A)1/2
2 2 A σ0 σ0 σ0
( )Z
1 z2 1 1 1 Az
   
= exp − exp − + µ2 − 2µ dµ
2
(4π 2 σ0 A)1/2 2σ02 2 A σ02 σ02 + A
( ) ( ! 2 )
1 z2 1 σ02 + A A 2
= exp − exp z
(4π 2 σ02 A)1/2 2σ02 2 Aσ02 2
σ0 + A
( ! 2 )
1 σ02 + A Az
Z
× exp − µ− 2 dµ
2 Aσ02 σ0 + A
( ) ( ! ) !1/2
1 z2 z2 A 2πAσ02
= exp − exp ×
(4π 2 σ02 A)1/2 2σ02 2σ02 σ02 + A σ02 + A
( )
1 z2
= exp − .
(2π(σ02 + A))1/2 2(σ02 + A)

163
Method A: Empirical Bayes-MLE (EB-MLE)

The MLE of η = σ02 + A is

η̂ = max{σ02 , z 2 }. (4.4)

Thus, an empirical Bayes estimator of µ by MLE is (see the proof below)


! ! !+
σ2 σ02 σ2
µ̂EB
M LE = 1− 0 z= 1− z= 1 − 20 z,
η̂ max{σ02 , z 2 } z

where z + = z ∨ 0, the positive part of z.

Proof of (4.4). Let η = σ02 + A, so η ∈ [σ02 , ∞). We have L(η) ∝ η −1/2 exp −z 2 /(2η) ,


and l(η) ∝ − log η − z 2 /η. So

1 z2 1
l0 (η) ∝ − + 2 = 2 (z 2 − η)
η η η

The solution to l0 (η) = 0 is η0 = z 2 .

1. If z 2 ∈ [σ02 , ∞), i.e., z 2 ≥ σ02 , the MLE is η̂ = z 2 .

2. If z 2 6∈ [σ02 , ∞), i.e., x̄2 < σ02 , the MLE is η̂ = σ02 . [l(µ) decreases as l0 (η) ≤ 0].

Method B: Empirical Bayes-Method of Moment (EB-MM)

From (4.3), EZ 2 = V ar(Z) = σ02 + A. So the moment estimator of σ02 + A is z 2 , giving


! !
σ2 σ2
µ̂EB
MM = 1− 0 z= 1 − 20 z.
η̂ z

164
Example 11.9 (Empirical Bayes binomial)

Empirical Bayes estimation is best suited to large-scale inference, where there are many
problems that can be modeled simultaneously in a common way.

For example, suppose that there are K different groups of patients, where each group
has n patients. Each group is given a different treatment for the same illness, and in the
k-th group, we count Xk , k = 1, · · · , K, the number of successful treatments out of n.

Group 1 : X11 , · · · , X1n ∼ Ber(p1 ),


...... ................
Group K : XK1 , · · · , XKn ∼ Ber(pK ).

Let Xk = Xki ∼ Bin(n, pk ), k = 1, ..., K. Since the groups receive different treatments,
P

we expect different success rates; however, since we are treating the same illness, these rates
should be somewhat related to each other. These considerations suggest the hierarchy

Xk |pk ∼ Bin(n, pk ),
pk ∼ Beta(a, b), k = 1, ..., K,

where the K groups are tied together by the common prior distribution. As in Example
11.4, the Bayes estimator of pk under L2 -loss is
a + xk a+b a n xk
   
p̂Bayes
k = E[pk |xk ] = = + .
b + n − xk a+b+n a+b a+b+n n
If a and b are assumed unknown and one can use EB method. First we need to calculate
the marginal distribution
K
Z 1 Z 1Y ( ! )
n xk Γ(a + b) a−1
f (x|a, b) = ··· pk (1 − pk )n−xk × p (1 − pk )b−1 dpk
0 0 i=1 xk Γ(a)Γ(b) k
K
( ! Z 1 )
n Γ(a + b)
pkxk +a−1 (1
Y
n−xk +b−1
= − pk ) dpk
k=1
xk Γ(a)Γ(b) 0
K
!
Y n Γ(a + b) Γ(a + xk )Γ(n − xk + b)
= .
k=1
xk Γ(a)Γ(b) Γ(a + b + n)

We then proceed with MLEs of a and b, denoted by â and b̂, although they do not have
closed solutions and have to be calculated numerically. Therefore,
!
â + xk â + b̂ â n xk
 
p̂EB
k = E[pk |xk ] = = + .
b̂ + n − xk â + b̂ + n â + b̂ â + b̂ + n n

165
11.5 Empirical Bayes and the James-Stein Estimator

Charles Stein shocked the statistical world in 1955 with his proof that MLE methods for
Gaussian models, in common use for more than a century, were inadmissible beyond simple
one- or two-dimensional situations. These methods are still in use, for good reasons, but
Stein-type estimators have pointed the way toward a radically different empirical Bayes
approach to high-dimensional statistical inference. Empirical Bayes ideas are powerful for
estimation, testing, and prediction. We begin here with their path-breaking appearance
in the James-Stein formulation.

Roughly speaking, many contemporary techniques in large-dimensional inference,


such as LASSO, can trace its roots back to James-Stein estimations.

166
Model

Suppose

zi |µi ∼ind N (µi , σ02 ), i = 1, ..., p


µi ∼ind N (0, A), (conjugate prior).

Write µ = (µ1 , ..., µp )0 , z = (z1 , ..., zp )0 , and I as the p × p identify matrix. Then,

z|µ ∼ N (µ, σ02 I),


µ ∼ N (0, AI).

The posterior distribution and the marginal distribution of z are


A
µ|z ∼ N (Bz, Bσ02 I), where B = .
A + σ02

and
z ∼ N (0, (A + σ02 )I),
which have been verified component-wise before.

167
Bayes estimator

So the Bayes estimator of µ is


!
Bayes σ02
µ̂ = Bz = 1− z. (5.5)
A + σ02

Theorem 11.3

R(µ, µ̂Bayes ) = pσ02 B 2 + (1 − B)2 kµk2 ,


r(µ, µ̂Bayes ) = pσ02 B.

Proof.

R(µ, µ̂Bayes ) = Eµ kµ̂Bayes − µk2 = V ar(µ̂Bayes ) + Bias2 = pB 2 σ02 + (1 − B)2 kµk2 ,


r(µ, µ̂Bayes ) = ER(µ, µ̂Bayes ) = B 2 pσ02 + (1 − B)2 Ekµk2 = B 2 pσ02 + (1 − B)2 pA
!
A2 σ02 Aσ02 A
 
= p + = pσ02 = pσ02 B.
(A + σ02 )2 (A + σ02 )2 A + σ02

MLE

The obvious estimator of µ is the MLE µ̂M LE = z. Its risk and Bayes risks are
p
X p
X
R(M LE) (µ) = Eµ (zi − µi )2 = Vµ (zi ) = pσ02 ,
i=1 i=1
(M LE) (M LE) 2
r (µ) = ER (µ) = pσ0 .

168
Empirical Bayes Estimation, James-Stein Estimation

Since A is assumed unknown, we can use empirical Bayes rules below. Note that
p
X
kzk2 = zi2 ∼ (A + σ02 )χ2p .
i=1

It can be shown that


!
A + σ02 1
E = . (5.6)
kzk2 p−2

Thus, if we replace σ02 /(A + σ02 ) in (5.5) by an unbiased estimator (p − 2)σ02 /kzk2 , we have
!
JS (p − 2)σ02
µ̂ = 1− z,
kzk2

the James-Stein estimator. It was discovered by Stein (1956) and later shown by James
and Stein (1961) to have a smaller mean squared error than the MLE z for all µ. Its
empirical Bayes derivation can be found in Efron and Morris (1972).

Proof of (5.6). If Y ∼ Gamma(a, b), then

ba a−1 −by ba−1


Z ∞ Z ∞
−1 −1 bΓ(a − 1) b
EY = y y e dy = y (a−1)−1 e−by dy = .
0 Γ(a) Γ(a) 0 Γ(a − 1) a−1

Since kzk2 /(A + σ02 ) ∼ χ2p = Gamma(a, b), with (a, b) = (p/2, 1/2), we have
! !
A + σ02 1/2 1 (p − 2)σ02 σ02
E = = , =⇒ E = .
kzk2 p/2 − 1 p−2 kzk2 A + σ02

Since the James-Stein estimator (or any EB estimator) cannot attain as small a
Bayes risk as the Bayes estimator, it is of interest to compare their risks, which, in effect,
tell us the penalty we are paying for estimating A.

169
Stein Unbiased Risk Estimation (SURE)

Stein provided an unbiased risk estimation to the Bayes risk r(θ, δ). SURE can not only
be used to calculate risks, but also (perhaps more importantly) can be useful in many
other situaitons, e.g., or more importantly to find EB-SURE estimators, or to find the
tuning parameters (Soft-thresholding level).

Theorem 11.4 (SURE) Let X ∼ Np (µ, σ 2 I), and let the estimator δ be of the form

δ(x) = x − g(x),
Pp
where g(x) = x−δ(x) is (weakly) differentiable. If Ekg 0 (X)k1 ≡ i=1 E|∂g(X)/∂Xi | < ∞,
then

Eµ kδ(x) − µk2 = Eµ k(x − µ) − g(x)k2


= Eµ kx − µk2 + Eµ kg(x)k2 − 2Eµ g(x)T (x − µ)
= pσ 2 + Eµ kg(x)k2 − 2Eµ g(x)T (x − µ)
= pσ 2 + Eµ kg(x)k2 − 2Covµ (g(x), x) (5.7)
p
X dgi
= pσ 2 + Eµ kg(x)k2 − 2σ 2 Eµ . (5.8)
i=1
dxi
(M SE = V ar + Bias2 − 2Cov.)

Hence, Stein’s unbiased risk estimate is


p
2 2 2
X dgi
SU RE(δ(x)) = pσ + kg(x)k − 2σ .
i=1
dxi

170
Proof. It suffices to show the step from (5.7) to (5.8). For simplicity, we only prove it
for p = 2. For g(x) = (g1 (x), g2 (x))T , we have

Eµ g(x)T (x − µ) = Eµ g1 (x1 , x2 )(x1 − µ1 ) + Eµ g2 (x1 , x2 )(x2 − µ2 ).

Using integration by parts and φ0 (t) = −tφ0 (t), we have

Eµ g(x1 , x2 )(x1 − µ1 )
x 1 − µ1 x 2 − µ2
Z Z    
= φ φ g1 (x1 , x2 )(x1 − µ1 )dx1 dx2
σ σ
x 1 − µ1 x 2 − µ2
Z Z    
= (−σ)φ0 φ g1 (x1 , x2 )dx1 dx2
σ σ
x 2 − µ2 x1 − µ 1
Z  Z  
= −σ φ φ0 g1 (x1 , x2 )dx1 dx2
x2 ∈R σ x1 ∈R σ
x2 − µ2 x 1 − µ1
Z  Z  
2
= −σ φ g1 (x1 , x2 )dφ dx2
x2 ∈R σ x1 ∈R σ
x 2 − µ2 ∂g1 (x1 , x2 ) x 1 − µ1
Z   Z  
= σ2 φ φ dx1 dx2
x2 ∈R σ x1 ∈R ∂x1 σ
∂g1 (x1 , x2 ) x 1 − µ1 x 2 − µ2
Z Z    
= σ2 φ φ dx1 dx2
x1 ∈R x2 ∈R ∂x1 σ σ
dg1
= σ 2 Eµ .
dx1
dg2
Similarly, Eµ g(x1 , x2 )(x2 − µ2 ) = σ 2 Eµ .
dx2

171
!
JS (p − 2)σ02
Theorem 11.5 For JS estimator µ̂ = 1− z, we have
kzk2

1
 
JS 2 2 4
R(µ, µ̂ ) = pσ − (p − 2) σ Eµ
kzk2
(p − 2)2 σ 4
SU RE(µ̂JS ) = pσ 2 −
kzk2
1
 
JS 2 2 4
r(µ, µ̂ ) = pσ − (p − 2) σ E
kzk2
(p − 2)σ 4
= pσ 2 − (by (5.6))
σ2 + A
σ2A σ4
= p 2 +2 2
σ +A σ +A

Note: Eµ kzk−2 6= E kzk−2 .


 

172
Proof. Here µ̂JS = z − g(z) with g(z) = (p − 2)σ02 z/kzk2 . Hence,
p !
JS 2 (p − 2)2 σ04 kzk2 X ∂ (p − 2)σ02 zi
R(µ, µ̂ ) = pσ + Eµ 4
− 2σ02 Eµ
kzk i=1
∂zi kzk2
p !
2 1 X kzk2 − 2zi2
= pσ + (p − 2)2 σ04 Eµ 2
− 2(p − 2)σ04 Eµ
kzk i=1
kzk4
!
2 1 pkzk2 − 2kzk2
= pσ + (p − 2)2 σ04 Eµ − 2(p − 2)σ04 Eµ
kzk2 kzk4
1
= pσ 2 − (p − 2)2 σ04 Eµ .
kzk2

And

r(µ, µ̂JS ) = ER(µ, µ̂JS )


1
 
= pσ 2 − (p − 2)2 σ 4 E
kzk2
(p − 2)σ 4
= pσ 2 − (by (5.6))
σ2 + A
2
σ A σ4
= p 2 +2 2
σ +A σ +A

2
Recall that r(µ, µ̂Bayes ) = p σσ2 +A
A
. So the relative risk is
4
r(µ, µ̂JS ) 2 σ2σ+A 2σ 2
= 1 + 2A = 1 +
r(µ, µ̂Bayes ) p σσ2 +A pA

For p = 10 and A = σ 2 , r(µ, µ̂JS ) is only 20% greater than the true Bayes risk.

173
The shock the James-Stein estimator provided the statistical world didn’t come from
the above average risk comparisons. These are based on the zero-centric Bayesian model,
where the MLE, which doesn’t favor values of µ near 0, might be expected to be bested.
The rude surprise came from the theorem proved by James and Stein in 1961:

Theorem 11.6 For p ≥ 3, we have

R(µ, µ̂Bayes ) < R(µ, µ̂JS ) < R(µ, µ̂M LE ),

for every choice of µ.

This result is frequentist rather that Bayesian – it implies the superiority of µ̂JS
no matter what one’s prior beliefs about µ may be. Since versions of µ̂M LE dominate
popular statistical techniques such as linear regression, its apparent uniform inferiority
was a cause for alarm. The fact that linear regression applications continue unabated
reflects some virtues of µ̂M LE , which will be discussed later.

174
11.6 Using SURE to find Estimators

A standard application of SURE is to choose a parametric form for an estimator, and


then optimize the values of the parameters to minimize the risk estimate. This technique
has been applied in several settings. For example, a variant of the James-Stein estimator
can be derived by finding the optimal shrinkage estimator. The technique has also been
used by Donoho and Johnstone to determine the optimal shrinkage factor in a wavelet
denoising setting.

Recall
1
 
Bayes
µ̂ = 1− z = z − g(z)
A+1
where g(z) = z/(A + 1) and ∂gi (z)/∂zi = 1/(A + 1). From Theorem 11.4, we have
p
X dgi kzk2 2pσ 2
SURE(A) = pσ 2 + kg(x)k2 − 2σ 2 = pσ 2 + 2

i=1
dxi (A + 1) (A + 1)

A SURE estimate of A is
 = arg min SURE(A),
A≥0

which can be found by setting

∂SURE(A) 2kzk2 2pσ 2


=− + = 0.
∂A (A + 1)3 (A + 1)2

1 1 pσ 2
Thus, a SURE estimate of is = . Hence, we obtain a SURE type
(1 + A) (1 + Â) kzk2
estimator of µ as !
SU RE pσ 2
µ̂ = 1− z.
kzk2
Note that this is very similar to the James-Stein estimator µ̂JS .

175
11.7 James-Stein Estimator, shrinking to non-zero point

The above James-Stein estimator shrinks each observed value zi toward 0. We don’t have
to take 0 as the preferred shrinking point. Now suppose

zi |µi ∼ N (µi , σ02 ), i = 1, ..., p


µi ∼ N (M, A), (conjugate prior),

the (µi , zi ) pairs being independent of each other. We can write this compactly as

z|µ ∼ N (µ, σ02 I),


µ ∼ N (M, AI).

The posterior distribution is

µi |zi ∼ N ((1 − B)M + Bzi , Bσ02 I), [B = A/(A + σ02 )],

and the marginal distribution of z is

zi ∼ N (M, A + σ02 ).

Proof. Let η = θ − M . Then, similar to earlier derivation, we have


( )
1 1 1 (θ − M )2
Z  
f (z|b) = exp − (z − θ)2 exp − dθ
(2πA)1/2 2A (2πb2 )1/2 2b2
( )
1 1 1 η2
Z  
= exp − (z − M − η)2 exp − dη
(2πA) 1/2 2A (2πb2 )1/2 2b2
(z − M )2
( )
1
= exp − .
(2π(A + b2 ))1/2 2(A + b2 )

176
Bayes estimator

The Bayes estimator of µ is


!
Bayes σ02
µ̂ = E(µ|z) = M + B(z − M ) = M + 1 − (z − M ).
A + σ02
(7.9)

Theorem 11.7

R(µ, µ̂Bayes ) = Eµ kµ̂Bayes − µk2 = V ar(µ̂Bayes ) + Bias2


= B 2 pσ02 + (1 − B)2 kM − µk2
r(µ, µ̂JS ) = ER(µ, µ̂Bayes ) = B 2 pσ02 + (1 − B)2 EkM − µk2
!
2 A2 σ 2 Aσ 2
= B pσ02 2
+ (1 − B) pA = p +
(A + σ 2 )2 (A + σ 2 )2
A
 
= pσ 2 .
A + σ2

177
James-Stein Estimation

Since A is assumed unknown, we can use empirical Bayes rules below. Note that
p
X
kz − z̄k2 = (zi − z̄)2 ∼ (A + σ02 )χ2p−1 .
i=1

Similarly to the previous derivations, we have


!
(p − 3)σ02 σ02
E = . (7.10)
kz − z̄k2 A + σ02

Thus, if we replace 1/(A + 1) in (7.9) by an unbiased estimator (p − 3)/kz − z̄k2 , we have


!
JS (p − 3)σ02
µ̂ = 1− z,
kz − z̄k2

the James-Stein estimator.

Theorem 11.6 continues to hold, except now p ≥ 4.

Theorem 11.8 For p ≥ 4, we have

R(µ, µ̂Bayes ) < R(µ, µ̂JS ) < R(µ, µ̂M LE ),

for every choice of µ.

178

You might also like