You are on page 1of 16

4 Comparison of Estimators

4.1 Optimality theory

(1) Mean Squared Error (MSE)

R(θ, T ) = E(T (X) − θ)2

(2) Root Mean Squared Error (RMSE)

E|T (X) − θ|

if T ∼ N(θ, ),
r
2p
E|T (X) − θ| = R(θ, T ).
π

Relationship between MSE and bias and variance of an estimator:

R(θ, T ) = var(T (X)) + b2 (θ, T ),

where b(θ, T ) = E(T (X)) − θ.

Ex. (X1 , . . . , Xn ) ∼ N(µ, σ 2 )

1
Pn
µ̂ = X̄, σ̂ = n i=1 (Xi − X̄)2

σ2
R(θ, X̄) = var(X̄) =
n

σ2
 2
2 nσ̂
b(θ, σ̂ ) = E − σ2
n σ2
σ2
= (n − 1) − σ 2
n
σ2
= −
n

1
σ4
 2
2 nσ̂ σ4
R(θ, σ̂ ) = var +
n2 σ2 n2
σ4
= [2(n − 1) + 1]
n2
σ 4 (2n − 1)
= . 
n2

Ex. Consider aX̄, 0 < a < 1.

σ2
R(θ, aX̄) = a2 + (a − 1)2 µ2
n

Then aX̄ is better than X̄ for µ ≈ 0. 

Inadmissible : we say the estimator S is inadmissible if there exists the other estimator T

such that

R(θ, T ) ≤ R(θ, S), ∀θ,

with strict inequality holding for some θ.

* In general, there is no estimator improves all others in all respects.

* UMVUE (uniformly minimum variance unbiased estimate):

R(θ, T ) = var(T) ≤ var(S) = R(θ, S), ∀θ and S.

• Difficulty :

– (1) unbiased estimates may not exist.

– (2) even UMVUE exist, they may be Inadmissible.

– (3) unbiasedness is not invariant under functional transformation.

Other criteria for comparing estimators:

* Bayes risk :
Z
R(θ, T )π(θ)dθ

2
* Worst-case risk :

max R(θ, T )
θ

the estimator with minimum worst-case risk is called minimax.

4.2 UMVUEs

Ex. Consider two estimators for estimating the standard deviation σ when data ∼ N(µ, σ 2 ):
s
1X
σ̂ = (Xi − X̄)2
n i

1X
σ̃ = |Xi − X̄|
n i

• aσ̂ is UMVUE with


" √ #−1
2 Γ( 12 n)
a= √
n Γ( 21 (n − 1))

• cσ̃ is unbiased (but NOT UNVUE) with


"r #−1
2(n − 1)
c= . 
πn

Theorem. (Rao-Blackwell) T (X) is sufficient for θ and Eθ (|S(X)|) < ∞. Then the estimator

defined by

T ∗ (X) = E(S(X)|T (X))

has the nice property

Eθ (T ∗ (X) − q(θ))2 ≤ Eθ (S(X) − q(θ))2 ∀θ.

Strict inequality holds unless T ∗ (X) = S(X) with probability one if varθ (S(X)) < ∞.

p.f.

b(θ, T ∗ (X)) = b(θ, S(X))

3
so it is sufficient to show that

var[E(S(X)|T (X))] ≤ var[S(X)]

with “=” holds iff E(S(X)|T (X)) = S(X); but that is already a known result (Chapter 1)

.

Complete: A statistic T is complete iff the only function g , defined on the range of T ,

satisfying

Eθ [g(T )] = 0, ∀θ

is just g() = 0.
P
Ex. (X1 , . . . , Xn ) ∼ P oisson(θ) ⇒ T = i Xi is sufficient for θ and T ∼ P oisson(nθ):

E(g(T ))

−nθ
X g(i) (n θ)i
= e = 0, ∀θ > 0
i=0
i!
⇔ g(i) = 0, ∀i = 0, 1, 2, ...

So T is complete. 

Theorem. (Lehmann-Scheffé) If T (X) is a complete sufficient statistic and S(X) is an

unbiased estimate of q(θ), then T ∗ = E[S(X)|T (X)] is UMVUE of q(θ). If varθ (T∗ (X)) < ∞,

T ∗ is the unique UMVUE.

p.f. Since b(θ, T ∗ ) = b(θ, S), so T ∗ is also unbiased. By Rao-Blackwell,

var(T∗ ) ≤ var(S)

(strictly smaller unless S is equal to T ∗ ). We need to show that T ∗ does not depend on

which unbiased estimate S we start with. Suppose

T1∗ = g1 (T ) and T2∗ = g2 (T )

4
both unbiased and obtained by Rao-Blackwell.

Then

E(g1 (T ) − g2 (T )) = 0.

By completeness g1 = g2 , so T ∗ = E(S|T ) cannot depend on S, and uniqueness follows from

the Rao-Blackwell theorem. 

How to find UMVUE: first find a complete sufficient statistic T , then either

(1) find h(T (X)) such that h(T ) is unbiased, or

(2) find an unbiased S, then E(S|T ) is UMVUE.

Ex. hypergeometric ⇒ X is sufficient and complete:

n Nθ N −N θ
 
X k n−k
Eθ (g(X)) = N
 g(k) = 0, ∀θ = 0, 1/N, . . . , 1.
k=0 n

When θ = 0,

E0 (g(X)) = g(0) = 0 ⇒ g(0) = 0

1
When θ = N
,

1 N −1 1 N −1
   
0 1 n−1
Eθ (g(X)) = N
n g(0) + N
 g(1)
n n

⇒ g(1) = 0.

So, by induction,

g(0) = g(1) = · · · = g(n) = 0

hence g() = 0. Since Eθ ( Xn ) = θ,


X
is UMVUE.
n

5
Theorem. Suppose {Pθ } belongs to exponential family, and suppose C = (C1 (θ), · · · , Ck (θ))

has a nonempty interior. Then

T (X) = (T1 (X), . . . , Tk (X))

is complete and sufficient.

Ex. (X1 , . . . , Xn ) ∼ N(µ, σ),


P
2 i (Xi − X̄)
s =
n−1

is unbiased for σ 2 and is a function of complete sufficient statistics, so it is UMVUE for σ 2 .

Ex. (X1 , . . . , Xn ) ∼ U(0, θ)

1
p(X, θ) = if X(n) < θ, X(1) > 0.
θn

Recall that the sufficient statistic for θ is X(n) , and the density of X(n) is

n tn−1
fX(n) (t) = , 0 < t < θ.
θn

Note that, if

θ
n
Z
g(t) tn−1dt = 0, ∀θ > 0,
 
E g(X(n) ) = n
θ 0

⇒ g(t) = 0.

So X(n) is also complete. Since

θ
n nθ
Z
E(X(n) ) = n tn dt =
θ 0 n+1

n+1
We have that n
X(n) is UMVUE for θ. 

Ex. X = (X1 , . . . , Xn ) ∼ Exp(λ),

f (x) = λe−λx , x > 0.

6
Pn
⇒T = i=1 Xi is sufficient, (complete) for λ. We want to estimate the quantity

Pλ (X1 ≤ x) = 1 − e−λx .

Note that I(X1 ≤ x) is an unbiased estimate of 1 − e−λx , and

E[I(X1 ≤ x)|T = t]
n
X
= P (X1 ≤ x| Xi = t)
i=1
n
!
X1 x X
= P Pn ≤ Pn | Xi = t
i=1 Xi i=1 Xi i=1

∵ X1 ∼ Γ(1, λ)
n
X
Xi ∼ Γ(n, λ)
i=1

X1
∴ Pn ∼ β(1,n−1)
i=1 Xi

n
X
E[I(X1 ≤ x)| Xi = t]
i=1
Z x/t
= β(1,n−1) (u) du
0
Z x/t
= (n − 1)(1 − u)n−2 du
0
Pn
Pn x )n−1 , if

 1 − (1 −

i=1 Xi > x
i=1 Xi
=


 1, o.w.

which is UMVUE for Pλ (X1 ≤ x).

4.3 Information Theory

* Regularity Assumptions :

(1) The set A = {x : p(x, θ) > 0} (support) does not depend on θ.

7
(2) ∀x ∈ A and θ ∈ Θ,

log p(x, θ)
∂θ

exists and is finite.

(3) Eθ (|T |) < ∞, and hence

∂ ∂
Z Z
T (x)p(x, θ)dx = T (x) p(x, θ)dx
∂θ ∂θ

* Exponential family satisfies the regularity conditions above.

* Score Function (for single observation):


S(X, θ) = log p(X, θ)
∂θ

* Properties for score function (under regularity assumptions) : (1) Eθ (S(X, θ)) = 0.

p.f.

 

Eθ log p(x, θ)
∂θ

Z
= log p(x, θ) p(x, θ)dx
∂θ

Z
= p(x, θ)dx
∂θ

Z
= p(x, θ)dx
∂θ

= 1
∂θ
= 0. 

h i
∂2
(2) varθ (S(X, θ)) = −E ∂θ 2
log p(x, θ) .
∂ 2
p.f. Since Eθ (S(X, θ)) = 0, varθ (S(X, θ)) = E[S(X, θ)]2 = E ∂θ
log p(X, θ) . The result

8
follows by noting that

∂2
 
−E log p(x, θ)
∂θ2
∂2
Z
= − log p(x, θ) p(x, θ)dx
∂θ2
Z   
∂ ∂ ∂ ∂
Z
= − log p(x, θ) p(x, θ) dx − log p(x, θ) p(x, θ) dx
∂θ ∂θ ∂θ ∂θ

∂ ∂ ∂ p(x, θ)
Z Z
= − log p(x, θ) p(x, θ)dx + log p(x, θ) ∂θ p(x, θ) dx
∂θ ∂θ ∂θ p(x, θ)
Z  2

= 0+ log p(x, θ) p(x, θ)dx
∂θ
 2

= E log p(x, θ) . 
∂θ

(3) covθ (S(X, θ), T (X)) = ψ ′ (θ) with ψ(θ) = Eθ (T (X)).

p.f. Since Eθ (S(X, θ)) = 0, covθ (S(X, θ), T (X)) = E[(S(X, θ)T (X)], and


Z
E[(S(X, θ)T (X)] = log p(x, θ)T (x)p(x, θ) dx
∂θ

Z
= p(x, θ)T (x) dx
∂θ

Z
= p(x, θ)T (x) dx
∂θ

= Eθ (T (x)) = ψ ′ (θ). 
∂θ

* Fisher Information (for single observation):


 2

I(θ) = E log p(x, θ)
∂θ
 2 

= −E log p(x, θ) (from Property (2))
∂θ2

Note that the Fisher information number satisfies 0 ≤ I(θ) ≤ ∞.

Ex. (X1 , . . . , Xn ) ∼ P oisson(θ)


Pn
∂ i=1 xi
log p(x, θ) = −n
∂θ θ
9
Pn 2
i=1 xi
I(θ) = E −n
θ
 Pn 
i=1 xi
= var
θ
1
= 2 nθ
θ
n
= .
θ

Theorem. (Information Inequality) Let T (x) be any statistic such that var(T (x)) < ∞, ∀θ.

Let Eθ (T (x)) = ψ(θ). Under reqularity conditions and 0 < I(θ) < ∞:

[ψ ′ (θ)]2
varθ (T (x)) ≥ , ∀ψ differentiable.
I(θ)


p.f. Let S = ∂θ
log p(x, θ) (score). Then var(S) = I(θ), and ψ ′ (θ) = cov(T, S). The result

follows by the well known inequality for covariance (Cauchy-Schwartz). 

Corollary.

For unbiased T (X) :


1
var(T (X)) ≥ .
I(θ)

1/I(θ) : Cramér-Rao lower bound.

Corollary.

For unbiased T ∗ , if
1
var(T ∗ ) =
I(θ)

then T ∗ is UMVUE.

In general, we have n observations and the Fisher information I(θ) is defined for n

observations. Suppose that I1 (θ) is the Fisher information for single observation. Then

I(θ) = n I1 (θ).

10
Ex. (X1 , . . . , Xn ) ∼ N(µ, σ 2 )

σ2 1 1
var(X̄) = = = ,
n I(θ) n I1 (θ)

1
where I1 (θ) = σ2
.

Theorem. If {Pθ : θ ∈ Θ} satisfies regularity conditions and there exists an unbiased

estimate T ∗ of ψ(θ) achieving information bound for every θ. Then Pθ is a one-parameter

exponential family with density p(x, θ) = exp [c(θ)T ∗ (x) + d(θ) + s(x)] IA (x). Conversely, if

Pθ is a one-parameter exponential family and c(θ) has a continuous non-vanishing deviate

on Θ. Then T (X) achieves the information bound and is a UMVUE of E(T (X)).

4.4 Large Sample Theory

* Consistency : Tn (X1 , . . . , Xn ) is with high probability close to θ.


p
Mathematically, Tn is consistent for θ (Tn → θ) iff ∀ε > 0, as n → ∞,

P [ |Tn (X1 , . . . , Xn ) − θ| ≥ ε ] → 0.

Ex.
n
1X j
m̂j = X
n i=1 i

is consistent for E(X j ).

Theorem. The method of moment estimator is consistent.

p
Tn = g(m̂1 , . . . , m̂r ) → g(m1 (θ), . . . , mr (θ)) = g(θ)

* UMVUEs are consistent; MLEs usually are.

11
* Asymptotic Normality:

Tn is approximately normally distributed with µn (θ) and variance σn2 (θ) iff
 
t − µn (θ)
P (Tn (X) ≤ t) ≈ Φ
σn (θ)

i.e., ∀z,
 
Tn (X1 , . . . , Xn ) − µn (θ)
lim P ≤ z = Φ(z).
n→∞ σn (θ)

* n-consistent :


n(µn (θ) − q(θ)) → 0,

n σn2 (θ) → σ 2 (θ) > 0.

Ex. (X1 , . . . , Xn ) ∼ Bin(θ)

√  d
n q(X̄n ) − q(θ) → Z ∼ N(0, [q ′ (θ)]2 θ(1 − θ)).

So we can take
[q ′ (θ)]2 θ(1 − θ)
µn = q(θ), σn2 = .
n

Theorem. (a) Suppose that P = (P1 , . . . , Pk ) are the population frequencies for k categories.

Let Tn = h(P̂1 , . . . , P̂k ) with P̂i (i = 1, . . . , k) the sample frequencies. Then


n(Tn − h(P1 , . . . , Pk )) → N(0, σh2 )

k  2 " k
#2
X ∂ X ∂
σh2 = Pi h(P) − Pi h(P)
i=1
∂Pi i=1
∂Pi

(b) Suppose that m = (m1 , . . . , mr ) are the population moments. Let Tn = h(m̂1 , . . . , m̂k )

with m̂i (i = 1, . . . , r) the sample moments. Then


n(Tn − g(m1 , . . . , mr )) → N(0, σg2 )

12
2r
" r #2
X X ∂
σg2 = bi mi − mi g(m)
i=2 i=1
∂mi
r
!
X ∂
= var g(m) X i ,
i=1
∂mi
X ∂ ∂
bi = g(m) g(m).
∂mj ∂mk
j+k=i
1≤j, k≤r

Ex. (Hardy-Weinberg Equilibrium) N1 = #(AA), N2 = #(Aa), N3 = #(aa), θ = P (A), P1 =

P (AA) = θ2 , P2 = P (Aa) = 2θ(1 − θ), P3 = P (aa) = (1 − θ)2 .


r
N1
T1 (X) =
nr
N3
T2 (X) = 1 −
n
N1 N2
T3 (X) = + (MLE)
n 2n

p
h1 (P1 , P2 , P3 ) = P1
p
h2 (P1 , P2 , P3 ) = 1 − P3
1
h3 (P1 , P2 , P3 ) = P1 + P2
2
2
1 − θ2

1 1 1 − P1
σ12 = √ · var(X1 ) = · P1 (1 − P1 ) = =
2 P1 4P1 4 4
2
1 − (1 − θ)2

1 1 1 − P3
σ22 = − √ · var(X3 ) = · P3 (1 − P3 ) = =
2 P3 4P3 4 4

 2
1 1
σ32 = 2
1 · var(X1 ) + · var(X2 ) + 1 · · 2 cov(X1 , X2 )
2 2
1
= P1 (1 − P1 ) + · P2 (1 − P2 ) + 1 · (−P1 P2 )
4
1
= θ12 (1 − θ12 ) + · 2θ1 (1 − θ1 )[1 − 2θ1 (1 − θ1 )] − θ12 · 2θ1 (1 − θ1 )
4
2
θ θ1
= − 1 +
2 2
θ1 (1 − θ1 )
= .
2
13
Ex. σ̂ 2 = m̂2 − m̂21 .

g(m1 , m2 ) = m2 − m21


g(m1 , m2 ) = −2m1
∂m1

g(m1 , m2 ) = 1
∂m2

∴ var(σ̂ 2 ) = var(−2m1 · X + 1 · X 2 )

= var(X − m1 )2

= E(X − m1 )4 − σ 4

= 3m41 − 4m1 m3 + 8m21 m3 − m22 . 

* Asymptotic Relative Efficiency (ARE) :

Tn(1) : nσn2 1 = σ12

Tn(2) : nσn2 2 = σ22

the asymptotic relative efficiency of Tn1 to Tn2 :

σ22
e(θ, T (1) , T (2) ) =
σ12

Ex. (HWE)

1 − θ2
σ12 =
4
2 1 − (1 − θ)2
σ2 =
4
1 − (1 − θ)2
e(T1 , T2 ) =
1 − θ2

1
T1 is better when θ > 2
and T1 ≈ T2 when θ = 21 .

14
Does the asymptotic variance σ 2 (θ) satisfy

{ψ ′ (θ)}2
σ 2 (θ) ≧ ?
I(θ)

generally holds.

Ex. Sample variance when X ∼ Poisson(θ):

σ 2 (θ) = E(X − θ)4 − θ2 = θ + 3θ2 − θ2 = θ(1 + 2θ)



q ′ (θ) = θ=1
∂θ
2 2
θX e−θ
 
∂ X var(X) 1
I1 (θ) = E log =E −1 = 2
=
∂θ X! θ σ θ
∴ Information bound = θ ≤ θ(1 + 2θ).

*asymptotically efficient : σ 2 (θ) satisfies the information bound.

* MLE are asymptotically efficient.

Ex. X ∼ Poisson(θ)

MLE for θ is X̄ and


θ
var(X̄) = ,
n

achieving the information bound. 

Ex. X ∼ Bin(θ)
2 2
(X − θ)2
   
∂ X 1−X 1
I(θ) = E log θX (1 − θ)1−X =E − =E =
∂θ θ 1−θ θ(1 − θ) θ(1 − θ)

∴ q(X̄) is asymptotically efficient. 

*Proof of MLE’s asymptotic efficiency


Let ℓi = ∂θ
log p(Xi , θ), and θ̂ the MLE. Then

X X hX i
0= ℓi (θ̂) ≈ ℓi (θ) + ℓ′i (θ∗ ) (θ̂ − θ)

15
− √1n ℓi (θ)
P

 
N(0, I(θ)) 1
n(θ̂ − θ) ≈  1 P ′  −−−−→ = N 0,
n
ℓi (θ) Stustky
CLT I(θ) I(θ)

MLE is asymptotically normal and efficient.

(MLE is not the only method guaranteeing asymptotic optimality.)

(UMVUE, first-order Newton’s method approximation, Bayes estimate.)

Ex. q(θ) = θ(1 − θ)


n
UMVUE = X̄(1 − X̄)
n−1
√ n n √
n{Tn − θ(1 − θ)} = n{X̄n (1 − X̄n ) − θ(1 − θ)}
n−1 n−1

has the same limit as



n {X̄n (1 − X̄n ) − θ(1 − θ)}

∴ the UMVUE has the same asymptotical distribution as MLE. 

16

You might also like