You are on page 1of 95

Advanced Mathematical Statistics I

Dr. Bing-Yi JING

Math, HKUST
Email: majing@ust.hk

October 13, 2006


Contents

1 Probability Models 2
1.1 Exponential family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Canonical form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Natural parameter space is a convex set . . . . . . . . . . . . . . . 4
1.1.4 Exponential family of full rank . . . . . . . . . . . . . . . . . . . . 4
1.1.5 Exponential family is closed under independent sum, marginal and
conditional operations . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.6 T (X) is also in another exponential family . . . . . . . . . . . . . . 5
1.1.7 A useful property . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.8 Examples of exponential family . . . . . . . . . . . . . . . . . . . . 5
1.2 Location-scale family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Principles of Data Reduction 7


2.1 Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Minimal Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Some simple rule to find minimal sufficient statistics . . . . . . . . . 11
2.2.2 Minimal Sufficiency for Exponential Family of Full Rank . . . . . . 14
2.2.3 Minimal Sufficiency for Location Family . . . . . . . . . . . . . . . 16
2.3 Complete Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 Full rank exponential family is complete . . . . . . . . . . . . . . . 19
2.3.2 Examples of Complete Families . . . . . . . . . . . . . . . . . . . . 20
2.3.3 Examples of Non-Complete Families . . . . . . . . . . . . . . . . . 21
2.4 Ancillary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Relationship between Minimal Sufficient, Complete and Ancillary Statistics 22
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Unbiased Estimation 25
3.1 Criteria of estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Do unbiased estimators always exist? . . . . . . . . . . . . . . . . . . . . . 26
3.3 Characterization of a UMVUE. . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Existence of UMVUE among unbiased estimators . . . . . . . . . . . . . . 29
3.5 Searching UMVUE’s by sufficiency and completeness . . . . . . . . . . . . 35
3.6 Some examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.7 UMVUE in nonparametric families. . . . . . . . . . . . . . . . . . . . . . . 45
3.8 Failure of UMVUE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.9 Procedures to find a UMVUE . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.10 Information inequality (or Cramer-Rao lower bound) . . . . . . . . . . . . 46
3.10.1 Information inequality . . . . . . . . . . . . . . . . . . . . . . . . . 46

1
3.10.2 Some properties of the Fisher information matrix . . . . . . . . . . 49
3.10.3 Some examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.10.4 A necessary and sufficient condition for reaching Cramer-Rao lower
bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.10.5 Information inequality in exponential family . . . . . . . . . . . . . 56
3.10.6 Information inequality in location-scale family . . . . . . . . . . . . 57
3.10.7 Regularity conditions must be checked . . . . . . . . . . . . . . . . 58
3.10.8 Relative Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.11 Some comparisons about the UMVUE and MLE’s . . . . . . . . . . . . . . 62
3.11.1 Difficulties with UMVUE’s . . . . . . . . . . . . . . . . . . . . . . . 62
3.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4 The Method of Maximum Likelihood 64


4.1 Maximum Likelihood Estimate (MLE) . . . . . . . . . . . . . . . . . . . . 64
4.2 Invariance property of MLE . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3 Some examples of MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.4 MLE’s in exponential families . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5 Numerical solution to likelihood equations . . . . . . . . . . . . . . . . . . 72
4.6 Problems with MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.6.1 MLE may not be unique . . . . . . . . . . . . . . . . . . . . . . . . 73
4.6.2 MLE may not be asymptotically normal . . . . . . . . . . . . . . . 74
4.6.3 MLE may be inadmissible . . . . . . . . . . . . . . . . . . . . . . . 74
4.6.4 MLE can be inconsistent . . . . . . . . . . . . . . . . . . . . . . . . 75
4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5 Asymptotically Efficient Estimation 78


5.1 Efficiency vs. Super-Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2 Definition of Asymptotic Efficiency . . . . . . . . . . . . . . . . . . . . . . 81
5.3 MLE is Asymptotic Efficient . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3.1 Approach I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3.2 Approach II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.4 Some further examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.4.1 Some comparisons about the UMVUE and MLE’s . . . . . . . . . . 87

6 Other Methods of Estimations 88


6.1 Method of moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.1.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.2 Bayesian Method of Estimation . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

1
Chapter 1

Probability Models

Broadly speaking, there are two types of models.

1. Parametric models: A family of distributions F = {Fθ : θ ∈ Θ} is said


to be a parametric family iff Θ ⊂ Rd for some fixed positive integer d. The set
Θ is called the parameter space and d is called its dimension.
(i.e. The form of distributions is known except for some fixed number of
unknown parameters.)
A parametric model refers to the assumption that the population distribition
is in a parametric family. A parametric model with very high dimension d is
not very useful.
A parametric family is identifiable iff θ1 6= θ2 implies that Fθ1 6= Fθ2 . (If we
know a sample is from Fθ , then we can uniquely determine (or identify) θ.
Therefore, we can talk about estimating θ.)
e.g. Normal family: F = {N (µ, σ 2 ) : µ ∈ R, σ 2 > 0}. Here θ = (µ, σ 2 ).

2. Non-parametric models: not parametric models. Here, minimal as-


sumptions are made regarding the form of distributions. e.g.
(a) All continuous d.f.’s.
(b) All symmetric d.f.’s.
(c) All d.f. with finite moments of order r > 0.
(d) All d.f. with p.d.f.’s.
(e) All d.f. continuous and symmetric.

There are also so called Semi-parametric models, which are mixtures of parametric
and nonparametric models (such as Cox’s model). They could be thought to include
parametric and nonparametric models as special cases. On the other hand, they can also
be regarded as nonparametric models.
In this course, we shall mostly concentrate on parametric models. Two very useful
parametric models will be introduced below:

(a). exponential family; (b). location-scale family.

2
1.1 Exponential family
1.1.1 Definition
A family F = {Fθ : θ ∈ Θ} of distributions is said to be a k-parameter exponential
family if the distributions Fθ have densities of the form (either with Lesbegue or counting
measures)  
X
k 
fθ (x) = exp ηj (θ)Tj (x) − B(θ) h(x),
 
j=1

with respect to some common measure µ. Here all the functions are real-valued, h(x) ≥ 0.

Remarks.
R
(1) Since fθ (x)dx = 1, we find that
   
Z Xk  
B(θ) = ln exp ηj (θ)Tj (x) h(x)dx .
   
j=1

Therefore, ηj (θ)’s are “free” parameters, while B(θ) depends on ηj ’s and hence
not a free “parameter”.

(2) Many common families are exponential families. These include the nor-
mal, gamma, beta distributions, and binomial, negative binomial etc. The
exponential families have many nice properties, as can be seen below.

(3) The above representation is not unique, since ηj Tj = (cηj )(Tj /c).

1.1.2 Canonical form


If we reparametrize ηj = ηj (θ), then we can rewrite the family in the canonical form
 
Xk 
fη (x) = exp ηj Tj (x) − A(η) h(x).
 
j=1

Here η = (η1 , ..., ηk ) are called the natural parameters.

Remarks.
R
(1) Since fθ (x)dx = 1, we find that
   
Z Xk  
A(η) = ln  exp  ηj Tj (x) h(x)dx .
j=1

(2) The above representation is not unique.

3
1.1.3 Natural parameter space is a convex set
R nP o
k
The set Ξ = {η = (η1 , ..., ηk ) : exp j=1 ηj Tj (x) h(x)dx < ∞} is called the natural
parameter space of the canonical family. The set Θ = {θ : (η1 (θ), ..., ηk (θ)) ∈ Ξ} is
called the parameter space of the original exponential family. We now show that Ξ is
a convex set. However, Θ may not necessarily be a convex set.

Theorem 1.1.1 The natural parameter space of an exponential family is convex.

Proof. Let η and ξ be two parameter points in the natural parameter space. Then
since
eλx+(1−λ)y ≤ λex + (1 − λ)ey .
Then for any 0 < λ < 1
Z Z Z
ληT (x)0 +(1−λ)ξT (x)0 ηT (x)0 0
e h(x)dx ≤ λ e h(x)dx + (1 − λ) eξT (x) h(x)dx < ∞,

which completes the proof.


Alternatively, we could use Holder’s inequality to get
Z µZ ¶λ µZ ¶(1−λ)
ληT (x)0 +(1−λ)ξT (x)0 ηT (x)0 ξT (x)0
e h(x)dx ≤ e h(x)dx e h(x)dx < ∞.

1.1.4 Exponential family of full rank


If the canonical family is minimal in the sense that neither the T ’s nor η’s satisfy a linear
constraint, and Ξ contains a k-dimensional rectangle (or a k-dimensional open set), then
the family is said to be of full rank.
(As can be seen later, exponential family of full rank implies minimal sufficiency,
completeness, attaining the Cramer-Rao lower bound in the information inequality, etc.)

1.1.5 Exponential family is closed under independent sum, marginal


and conditional operations
Theorem 1.1.2 If X and Y are independent and they are from two k-parameter expo-
nential families, then (X, Y ) is again an exponential family.

Theorem 1.1.3 If X1 , ..., Xn belong to a k-parameter exponential family, then the joint
density of (X1 , ..., Xn ) is again an exponential family of the form
 " n # 
Xk X 
fθ (x) = exp  ηj (θ) Tj (xi ) − nB(θ) h(x1 )h(x2 )...h(xn ).
j=1 i=1

The proofs of these theorems are trivial and omitted here. (One can use m.g.f.
method.)

4
1.1.6 T (X) is also in another exponential family
Theorem 1.1.4 If X belongs to a k-parameter exponential family, then the distribution
of T = (T1 (X), ..., Tk (X)) is again an exponential family of the form
 
Xk 
fT (t) = fT (t1 , ..., tk ) = exp ηj (θ)tj − B(θ) h0 (t).
 
j=1

Proof. The proof can be shown using the idea of m.g.f. Here we shall illustrate a
proof for the discrete r.v.. In such a case,
X
P (T1 = t1 , ......, Tk = tk ) = P (X = x)
T1 (x)=t1 ,...,Tk (x)=tk
( k )
X X
= exp ηi (θ)Ti (x) − B(θ) h(x)
T1 (x)=t1 ,...,Tk (x)=tk i=1
( k )
X X
= exp ηi (θ)ti − B(θ) h(x)
T1 (x)=t1 ,...,Tk (x)=tk i=1
( k ) 
X X
= exp ηi (θ)ti − B(θ)  h(x) .
i=1 T1 (x)=t1 ,...,Tk (x)=tk
( k )
X
= exp ηi (θ)ti − B(θ) h0 (t).
i=1

where t = (t1 , ..., tk ).

1.1.7 A useful property


There are simple formulas to calculate the m.g.f., the mean, and variance of an exponential
family.
Theorem 1.1.5 For any integrable function f and any η in the interior of Ξ, the integral
Z ³X ´
f (x) exp ηi Ti (x) h(x)dx

is continuous and has derivatives of all orders with respect to the η’s, and these can be
obtained by differentiating under the integral sign.
Proof. See “Testing Statistical Hypotheses” by Lehmann (Chapter 2, Theorem 9)
and in Barndorff-Nielsen (1978, Section 7.1).

1.1.8 Examples of exponential family

e.g. Binomial Bin(n, p)

e.g. Normal family N (µ, σ 2 ).

e.g. Multivariate family.

5
1.2 Location-scale family
Theorem. Let f (x) be any p.d.f.. For any constants µ and σ > 0, denote θ = (µ, σ).
Then the function µ ¶
1 x−µ
gθ (x) = f
σ σ
is also a p.d.f..

Definition. Let f (x) be any p.d.f.. Then we define

Location family: {f (x − µ) : µ ∈ R} ;
½ µ ¶ ¾
1 x
Scale family: f :σ>0 ;
σ σ
½ µ ¶ ¾
1 x−µ
Location-scale family: f : µ ∈ R, σ > 0 .
σ σ
where µ is called the location parameter, σ the scale parameter.

6
Chapter 2

Principles of Data Reduction

2.1 Sufficiency
Suppose X1 , · · · , Xn ∼ Fθ and we are interested in estimating the parameter θ. We’d like
to reduce the original data set to a smaller one which captures all the information about θ
contained in the sample. In other words, the smaller data set is “sufficient” for estimating
θ.
Throughout the course, we shall use

X = (X1 , · · · , Xn ), x = (x1 , · · · , xn ),

to denote a random sample and its realization, respectively.

Definition: An estimator T (X) is sufficient for θ if the conditional distribution of X on


T (X) does not depend on θ.

Example. It is obvious that T = X = {X1 , · · · , Xn } is always sufficient.

Example. Suppose that {X1 , . . . , Xn } is a random sample (i.i.d.) from a continuous


distribution. Show that the order statistics T = (X(1) , ..., X(n) ) is sufficient.
Proof. Continuity of d.f. ensures no ties in the observations a.s.. For any t = (t1 , ..., tn ),

P (X = x|T = t) = 0, if {x1 , ..., xn } 6= {t1 , ..., tn }


= 1/n!, if {x1 , ..., xn } = {t1 , ..., tn },

which does not depend on any parameters. So T is sufficient.

Remark. If {X1 , . . . , Xn } is a random sample from a discrete distribution, then the


order statistics T = (X(1) , ..., X(n) ) is sufficient. This can be shown by the Factorization
Theorem to be introduced later.
P
Example. Let X1 , · · · , Xn ∼ Bernoulli(p). Show that T = Xi is sufficient for p.
Proof. Note that T ∼ Bin(n, p). So
P
P (X = x|T = t) = 0 if xi 6= t
³ X ´ ³X ´ P
= P X = x, Xi = t /P Xi = t if xi = t

7
P
If xi = t,
Qn P P
P (X = x) i=1 P (Xi = xi ) p xi (1 − p)n− xi 1
LHS = P = P = ³n´ = ³n´ ,
P ( Xi = t) P ( Xi = t) pt (1 − p)n−t
t t

which does not depend on p. So T is sufficient.

Example. If X1 ∼ Poison(µ), X2 ∼ Poison(λ), and they are independent. Assume that


λ = kµ for some known k. Show that T =³
X1 + X2 ´is sufficient for µ.
µ
Proof. Use the fact: X1 |{T = t} ∼ Bin t, p = µ+λ . The rest is left as an exercise.

The following result provides an easy way to find a sufficient statistic.

Factorization Theorem. Let X = {X1 , · · · , Xn } be iid with pdf fθ (x). Then a statistic
T (X) is a sufficient statistic for θ iff we can decompose the joint pdf into

fθ (x) = g (T (x), θ) h(x) .

Proof A complete proof of the theorem is complicated, (involving measure-theoretic


considerations, see Lehmann (1959)). However, the proof for the discrete case is relatively
easy and given below.

“⇐=”. Suppose fθ (x) = P (X = x) = g (T (x), θ) h (x).

P (X = x|T = t) = 0, if T (x) 6= t,
P (X = x, T (X) = t) /P (T (X) = t) , if T (x) = t.

If T (x) = t, then
X X X
P (T (X) = t) = P (X = x) = g (t, θ) h (x) = g (t, θ) h (x) .
T (x)=t T (x)=t T (x)=t

So if T (x) = t,

P (X = x, T (X) = t) P (X = x) g(t, θ)h(x) h (x)


= = P =P ,
P (T (X) = t) P (T (X) = t) g(t, θ) T (x)=t h(x) T (x)=t h(x)

which does not depend on θ. So T (X) is sufficient.

“=⇒”. If T (X) is sufficient, Then

fθ (x) = P (X = x) = P (X = x, T (x) = T (x))


= P (X = x, T (X) = T (x))
(or {ω : X(ω) = x} ⊂ {ω : T (X(ω)) = T (x)})
= P (T (X) = T (x)) P (X = x|T (X) = T (x))
= g(T (x), θ) h(x), say.

The proof is complete.

8
Example. Assume X1 , · · · , Xn is a random sample from
(1) Bernoulli(p). (2) U [0, θ]. (3) N (µ, σ 2 ). (4) N (σ, σ 2 ).
Find sufficient statistics for the parameters involved.
Solution.
P P P P P
(1) T = Xi , (2) T = X(n) , (3) T = ( Xi , Xi2 ). (4) T = ( Xi , Xi2 ).

The details are left as an exercise.

Remark 2.1.1 Clearly, it is much easier to use Factorization Theorem than to use the
direct approach.

9
2.2 Minimal Sufficiency
While sufficient statistics provide a reduction to the data, minimal sufficient statistics
provide the maximum reduction to the data.

Definition: A statistic T (X) is called a minimal sufficient statistic if, for any other
sufficient statistic S(X), T (X) is a function of S(X).

Minimal sufficient statistics exist under very mild conditions. e.g. see Bahadur (1959),
or Shao Jun (1999) for reference.
For two r.v.’s S, T , by S = T we mean that S = T a.s. w.r.t. P . (i.e. P (S = T ) = 1
for all P ∈ {Pθ , θ ∈ Θ}).

Theorem 2.2.1 .

(1). If T = h(S) and T is sufficient, then S is also sufficient.


(2). Let T = h(S) and h is 1-1. Then T is (minimal) sufficient iff S is
(minimal) sufficient.

Proof. (1). Since T is sufficient, so by Factorization Theorem, fθ (x) = gθ (T (x))k(x) =


gθ (h(S(x)))k(x), which implies S(X) is sufficient by Factorization Theorem again.

(2). The proof follows from (i) and the definition.

Let us first give two examples of minimal sufficiency using the definition.

Example. X1 , . . . , Xn ∼ U(θ, θ + 1). Find a minimal sufficient stat for θ.


Solution. Note that fθ (x) = I{θ < x < θ + 1}. Then
n
Y
fθ (x) = I{θ < xi < θ + 1}
i=1
= I{θ < x(1) < x(n) < θ + 1}
= I{x(n) − 1 < θ < x(1) }. (2.1)
³ ´
By the factorization theorem, T = X(1) , X(n) is a sufficient statistic. Let us now show
that it is also minimal.
Suppose that S(X) is another sufficient statistic. Then by the factorization theorem,
we have
fθ (x) = g (S(x), θ) h (x) .
We shall only consider those x s.t. h(x) > 0, so the following conclusions hold a.s. It
follows easily from (2.1) that

x(1) = sup{θ : fθ (x) > 0} = sup{θ : g (S(x), θ) > 0},


x(n) = inf{θ : fθ (x) > 0} + 1 = inf{θ : g (S(x), θ) > 0} + 1.
³ ´ ³ ´
That is, T (x) = x(1) , x(n) is a function of S(x). By the definition, T (X) = X(1) , X(n)
is a minimal sufficient statistic.

10
Example. X1 , . . . , Xn ∼Bin(1, p). Find a minimal sufficient stat for p.
P
Solution. By the Factorization Theorem, T = Xi is sufficient. Let us now show that
it is also minimal.
If T is not minimal, by definition, there exists a sufficient statistic S(X), such that
T (X) is NOT a function of S(X). (Recall that T is said to be function of S ⇐⇒ for
any S1 = S2 implies that T1 = T2 . Therefore, if T is NOT a function of S, then there
exists some S1 = S2 such that T1 6= T2 .) In other words, there exists a function h (not
1-1) such that S = h(T ) is sufficient. (Intuitively, this is clear as, if T is NOT minimal
sufficient, we can always truly reduce further to some minimal S.) There must exist at
least two integers t1 , t2 with 0 ≤ t1 6= t2 ≤ n such that h(t1 ) = h(t2 ) ≡ s. (For simplicity,
we assume that there are exactly two t’s s.t. h(t) = s.) Therefore, A ≡ P (T = t1 |S = s)
does not depend on p since S is sufficient. However,

A = P (T = t1 , S = s)/P (S = s) = P (T = t1 )/P (h(T ) = s)


= P (T = t1 )/ [P (T = t1 ) + P (T = t2 )]
³ ´
n
t1
pt1 (1 − p)n−t1
= ³ ´
n
³ ´
pt1 (1 − p)n−t1 + tn2 pt2 (1 − p)n−t2
t1
1
= ,
(t2 ) t2 −t1
n
t −t
1+ n p (1 − p) 1 2
(t1 )

which depends on p. This contradicts with the previous conclusion. Therefore, T must
be minimal sufficient.

Remark 2.2.1 Since the above example is in an expontial family, there is a much simpler
way of finding a minimal sufficient statistic.

Remark 2.2.2 We use two different methods to find minimal sufficient statistics. In
both cases, we need to guess what it is, and then try to prove the case. It would be nice to
have some routine methods to find minimal sufficient statistics. We shall do this in the
next section. The uniform example will be looked at again in the next subsection (you will
see how easy it is in comparison with the approach here).

2.2.1 Some simple rule to find minimal sufficient statistics


Using the definition to find a minimal sufficient statistic is not an easy thing to do, as
seen in the last two examples. The next theorem provides an easier way to find a minimal
sufficient statistic.

Theorem 2.2.2 Let fθ (x) be the pdf of a sample X = (X1 , . . . , Xn ). Suppose that there
exists a function T (x) such that, for two sample points x and y, the ratio fθ (x)/fθ (y) is
free of θ (or does not depend on θ) iff T (x) = T (y). Then T (X) is a minimal sufficient
statistic for θ.

Proof. To simplify the proof, we only need to consider those x and θ where fθ (x) > 0.
So all conclusions hold a.s.

11
First we show that T (X) is a sufficient statistic. Let X be the sample space and
T = {t : t = T (x) for some x ∈ X } be the image of X under T (x). (A diagram here will
be helpful!) Define the partition sets induced by T (x) as At = {x : T (x) = t}. For each
At , choose and fix one element xt ∈ At . For any x ∈ X , xT (x) is the fixed element that is
in the same set At as x, where t = T (x). Since x and xT (x) are in the same set At (with
t = T (x)), we have T (x) = T (xT (x) ) = t, and hence it follows from the assumptions of
the theorem that fθ (x)/fθ (xT (x) ) is constant as a function of θ. Thus, we can define a
function on X by h(x) = fθ (x)/fθ (xT (x) ), which does not depend on θ. Define a function
on T by gθ (t) = fθ (xt ). Then it can be seen that

fθ (x)
fθ (x) = fθ (xT (x) ) = gθ (T (x))h(x) = gθ (t)h(x).
fθ (xT (x) )

By the factorization theorem, T (X) is a sufficient statistic for θ.

Next we show that T (X) is minimal. Let T 0 (X) be any other sufficient statistic. We
need to show that T (x) is function of T 0 (x). Clearly, q : t0 = T 0 (x) → t = T (x) defines a
mapping as x runs through the sample space. To show that q is a function, it suffices to
prove that

T 0 (x) = T 0 (y) =⇒ T (x) = T (y). (2.2)

By the Factorization Theorem, there exist functions g 0 and h0 such that

fθ (x) = gθ0 (T 0 (x))h0 (x).

Let x and y be any two sample points with T 0 (x) = T 0 (y). Then

fθ (x) g 0 (T 0 (x))h0 (x) h0 (x)


= θ0 0 = .
fθ (y) gθ (T (y))h0 (y) h0 (y)

Since this ratio does not depend on θ, the assumptions of the theorem imply that T (x) =
T (y). This proves (2.2). Thus, T (x) is a function of T 0 (x) and hence T (x) is minimal.

Remark 2.2.3 From the proof of the theorem, we see that the “if” part ensures that
T (X) is sufficient and “only if” part ensures T (X) is minimal.

Example. X1 , . . . , Xn ∼ U(θ, θ + 1). Find a minimal sufficient stat for θ.


Solution. It has been shown in the last subsection that

fθ (x) = I{x(n) − 1 < θ < x(1) }.

Then
fθ (x) I{x(n) − 1 < θ < x(1) }
r(θ) ≡ = .
fθ (y) I{y(n) − 1 < θ < y(1) }
³ ´
Let T (x) = x(1) , x(n) . Clearly, if T (x) = T (y), then r(θ) is free of θ.
On the other hand, if T (x) 6= T (y), then r(θ) will depend on θ. To give an example,
suppose that x(1) > y(1) and x(n) < y(n) , then r(θ) could take values 0 and 1, depending

12
where θ is. That is, r(θ) is not free of θ. In other words, r(θ) is free of θ only if
T (x) = T (y). ³ ´
Therefore, T = X(1) , X(n) is a minimal sufficient statistic.

Example. Let X1 , . . . , Xn ∼ U (0, θ). Show that T = X(n) is minimal sufficient.


Proof. Left as an exercise.

Example. Let X1 , . . . , Xn ∼ U (0, θ), where θ > 1. Show that T = X(n) is sufficient but
not minimal sufficient.
Proof. First let us write down the pdf of X:
1
fθ (x) = I{0 < x < θ}I{θ > 1}.
θ

The proof of sufficiency is easy by the Factorization Theorem. Let us now show it is
not minimal. Note
fθ (x) θ−n I(0 < x(1) ≤ x(n) < θ)I{θ > 1}
r(θ) ≡ = −n . (2.3)
fθ (y) θ I(0 < y(1) ≤ y(n) < θ)I{θ > 1}

Clearly, if x(n) = y(n) , then the ratio r(θ) is free of θ. On the other hand, if r(θ) is free
of θ, x(n) may not be equal to y(n) . For instance, we could easily choose x(n) = a and
y(n) = b, where 0 < a < 1 and 0 < b < 1 but a 6= b, then

I(x(1) > 0) I(x(n) < θ)I{θ > 1} I(x(1) > 0)I{θ > 1} I{θ > 1}
r(θ) = = = =1
I(y(1) > 0) I(y(n) < θ)I{θ > 1} I(y(1) > 0)I{θ > 1} I{θ > 1}

which is free of θ. This implies that T = X(n) MAY NOT be minimal.

The reason that T = X(n) MAY NOT be minimal sufficient is that it did not take
the extra information about θ into account of the estimator. A better estimator would
be T 0 (X) = max{X(n) , 1}, which turns out to be minimal sufficient. To see why, noting
θ > 1, we can rewrite (2.3) as

I(x(1) > 0) I(x(n) < θ) I(1 < θ) I(x(1) > 0) I(max{x(n) , 1} < θ)
r(θ) = =
I(y(1) > 0) I(y(n) < θ) I(1 < θ) I(y(1) > 0) I(max{y(n) , 1} < θ)
0
I(x(1) > 0) I(T (x) < θ)
= .
I(y(1) > 0) I(T 0 (y) < θ))

Clearly, r(θ) is free of θ iff T 0 (x) = T0 (y). Thus, T 0 (X) = max{X(n) , 1} is minimal
sufficient.
Intuitively, it is quite clear why T 0 (X) = max{X(n) , 1} is a minimal sufficient statistic:
if the largest observation is smaller than 1 (1/2, say), clearly, it would not be wise to
choose X(n) = 1/2 as an estimator of θ since we already know that θ > 1. Instead, we
would be better off choosing 1 as an estimator of θ. Of course, if X(n) ≥ 1, then X(n)
would be a very good estimator of θ.

13
2.2.2 Minimal Sufficiency for Exponential Family of Full Rank
For an exponential family of full rank, finding minimal sufficient statistics is a relatively
easy matter.

Theorem 2.2.3 If P is in an exponential family of full rank with p.d.f.’s given by


( k )
X
fθ (x) = exp ηj (θ)Tj (x) − A(η) h(x).
i=1

Then T (X) = (Ti (X), ..., Tk (X)) is minimal sufficient for θ.

Proof. Sufficiency follows immediately from the Factorization Theorem. Let us prove
minimality next. Since the exponential family is of full rank, then there exists a θ0
such that η1 (θ) − η1 (θ0 ), ..., ηk (θ) − ηk (θ0 ) are linearly independent, for values of θ in a
neighborhood of θ0 . (Why?) Note that

fθ (x) fθ (x)/fθ0 (x) fθ0 (x)


=
fθ (y) fθ (y)/fθ0 (y) fθ0 (y)
nP o
k
exp i=1 [ηj (θ) − ηj (θ0 )] Tj (x) fθ0 (x)
= nP o
exp k
i=1 [ηj (θ) − ηj (θ0 )] Tj (y) fθ0 (y)
( k )
X fθ0 (x)
= exp [ηj (θ) − ηj (θ0 )] [Tj (x) − Tj (y)] .
i=1 fθ0 (y)

By the necessary and sufficient condition for minimal sufficiency, if the above ratio is free
of θ, then we must have
k
X
[ηj (θ) − ηj (θ0 )] [Tj (x) − Tj (y)] = C (x, y) ,
i=1

where C (x, y) is a function of x and y only. In particular, if we take θ = θ0 , then


C (x, y) = 0. However, since ηj (θ) − ηj (θ0 ), j = 1, ..., k are linearly independent for θ in
a neighborhood of θ0 , we get

Tj (x) = Tj (y), j = 1, ..., k.

This completes our proof.

Alternative proof. Sufficiency follows immediately from the Factorization Theorem.


Let us prove minimality next. Denote aj ≡ aj (x, y) = Tj (x) − Tj (y), 1 ≤ j ≤ n. Then
note that
nP o  
k
fθ (x) exp j=1 ηj (θ)Tj (x)
h(x) h(x) Xk 
= nP o = exp  aj ηj (θ) .
fθ (y) exp k
j=1 ηj (θ)Tj (y)
h(y) j=1 h(y)
Pk
is free of θ iff i=1 aj ηj (θ) is free of θ. In other words,
k
X
aj ηj (θ) = C (x, y) ,
j=1

14
Pk
Fix θ0 ∈ Θ, then j=1 aj ηj (θ0 ) = C (x, y) . Then
k
X
aj [ηj (θ) − ηj (θ0 )] = 0. (2.4)
j=1

Let a = (a1 , ..., ak ) and η(θ) = (η1 (θ), ..., ηk (θ)).


Since the exponential family is of full rank, there exist θ1 , ..., θk such that (η(θ1 ) −
η(θ0 ), ..., η(θ1 ) − η(θ0 )) form a base in Rk , i.e. they are independent. But (2.4) implies
that a1 = a2 = ... = ak = 0, i.e. Ti (x) = Ti (y) for all 1 ≤ k ≤ n.

. . , Xn ∼ N (µ, σ 2 ), find a minimal sufficient statistics for θ = (µ, σ 2 ).


Example. Let X1 , . √
2 2
Solution. fθ (x) = ( 2πσ)−1 e−(x−µ) /(2σ ) . Therefore,
( )
√ Pn
−µ)2 /(2σ 2 ) µ X n
1 X n
fθ (x) = ( 2πσ)−n e− i=1
(xi
= C(θ) exp x i − x2
σ 2 i=1 2σ 2 i=1 i
So k = 2, µ ¶
µ 1
η(θ) = (η1 , η2 ) = , −
σ 2 2σ 2
and à n n
!
X X
T (x) = (T1 (x), T2 (x)) = xi , x2i .
i=1 i=1

It is easy to see that the natural parameter space Ξ is a half space in R2 , which certainly
contains a 2-dim rectangle. By the above theorem, T (X) is a minimal sufficient statistic.

Example. Let X1 , .√. . , Xn ∼ N (θ, θ2 ), find a minimal sufficient statistics for θ.


2 2
Solution. fθ (x) = ( 2πσ)−1 e−(x−θ) /(2θ ) . Therefore,
( )
√ −n −
Pn
(x −θ)2 /(2θ 2 ) 1X n
1 Xn
fθ (x) = ( 2π|θ|) e i=1 i = C(θ) exp xi − 2 x2 .
θ i=1 2θ i=1 i
So k = 2, µ ¶
1 1
η(θ) = (η1 , η2 ) = ,− 2
θ 2θ
and à n !
X n
X
T (x) = (T1 (x), T2 (x)) = xi , x2i .
i=1 i=1

Clearly, the natural parameter space Ξ is a curve in R2 , which certainly DOES NOT
contain a 2-dim rectangle. Therefore, one can not use the very last theorem to find the
minimal sufficient statistic. However, we can use the simple rule given earler. Note that
½ ¾
fθ (x) 1 1
r(θ) ≡ = exp (T1 (x) − T1 (y)) − 2 (T2 (x) − T2 (y))
fθ (y) θ 2θ
is free of θ iff
1 1
(T1 (x) − T1 (y)) − 2 (T2 (x) − T2 (y)) = C (x, y) = 0,
θ 2θ

15
where the last equality is obtained by letting θ → ∞ on both sides. Again, since θ is
arbitrary, we get (by taking θ = 1 and θ = 2)

T1 (x) − T1 (y) = 0, T2 (x) − T2 (y) = 0.


P P
Therefore, T (X) = ( Xi , Xi2 ) is a minimal sufficient statistic.

2.2.3 Minimal Sufficiency for Location Family


If X1 , . . . , Xn is a random sample from a location family with pdf f (x − θ), where f is
assumed to be known, then

fθ (x) = f (x1 − θ) · · · f (xn − θ) = f (x(1) − θ) · · · f (x(n) − θ).


³ ´
Clearly, the order statistic T (X) = X(1) , ..., X(n) is a sufficient statistic. (Recall that
we have considered the case where F (x) is assumed to be continuous. Here, it could be
discrete, where ties are allowed). However, this reduction only uses the i.i.d. assumption.
Neither the structure of fθ (x) nor the knowledge of f is utilized. In some cases, further
reduction is possible while, in other cases, the order statistic is already the maximum
reduction (i.e. minimal sufficiency).

Example. If X1 , . . . , Xn ∼ N (µ, 1) with pdf φ(x−µ), then clearly X̄ is minimal sufficient.


(Here, X̄ also belongs to an exponential family.)

Example. If X1 , . . . , Xn ∼ exp(θ, 1) with pdf f (x − θ) = ex−θ I(x > θ), then it can be
shown that X(1) is minimal sufficient. But if we have restriction that θ > 1, then X(1) is
no longer minimal sufficient.

³ ´
Example. If X1 , . . . , Xn ∼ U (θ, θ + 1), then it has be shown that X(1) , X(n) is minimal
sufficient.

In all the above examples, sufficiency allows us to reduce the original n-dim data to
one or two-dim data. However, this is not possible for the following examples.

Example. If X1 , . . . , Xn ∼ Cauchy with pdf given by


³ ´−1
fθ (x) = π −1 1 + (x − θ)2 .
³ ´
Show that T (X) = X(1) , ..., X(n) is minimal sufficient.
Proof. Now
³ ´
fθ (x) n
Y (1 + (yk − θ)2 ) n
Y 1 + (y(k) − θ)2
r(θ) ≡ = = ³ ´.
fθ (y) k=1 (1 + (xk − θ)2 ) k=1 1 + (x(k) − θ)2

(1). If x(k) = y(k) for all k = 1, 2, ..., n, then clearly the ratio r(θ) is free of θ.

16
(2). If r(θ) is free of θ, then R = limθ→∞ R = 1. So we have
n ³
Y ´ n ³
Y ´
1 + (yk − θ)2 = 1 + (xk − θ)2 .
k=1 k=1

Since both sides are polynomials of θ of degree 2n, they must have the same roots as well.
That is, √ √
{xk ± −1, k = 1, ..., n} = {yk ± −1, k = 1, ..., n}
which is equivalent to {xk , k = 1, ..., n} = {yk , k = 1, ..., n}, which is again equivalent to
x(k) = y(k) for all k = 1, 2, ..., n.
Hence T (X) = (X(1) , ..., X(n) ) is minimal sufficient statistic.

Example. If X1 , . . . , Xn ∼ logistic distribution with pdf given by


exp{−(x − θ)} e−x
fθ (x) = = f (x − θ), wheref (x) = .
(1 + exp{−(x − θ)})2 (1 + e−x )2
³ ´
Show that T (X) = X(1) , ..., X(n) is minimal sufficient.
Proof. Now
fθ (x)
r(θ) ≡
fθ (y)
à n !
X Y 1 + exp{−(yk − θ)} 2
= exp{− (xi − yi )}
k=1 1 + exp{−(xk − θ)}
à n !
X Y 1 + exp{−yk }η 2
= exp{− (xi − yi )}
k=1 1 + exp{−xk }η

where η = exp{θ} > 0.


(1). If x(k) = y(k) for all k = 1, 2, ..., n, then clearly the ratio r(θ) is free of θ.
P
(2). If r(θ) is free of θ, then R(θ) = limθ→−∞ R(θ) = limη→0 R(θ) = exp{− (xi −yi )},
which results in
n
Y n
Y
(1 + exp{−xk }η) = (1 + exp{−yk }η) , for all η > 0.
k=1 k=1

Both sides are polynomials in η of degree n, which holds for all η > 0, so that coefficients
of the two polynomials are also the same. Therefore, the two polynomials are equal for
all values of η (real or imaginary). As a result, they must have the same roots as well.
That is,
{− exp{xk }, k = 1, ..., n} = {− exp{yk }, k = 1, ..., n}
which is equivalent to {xk , k = 1, ..., n} = {yk , k = 1, ..., n}, which is again equivalent to
x(k) = y(k) for all k = 1, 2, ..., n.
Hence T (X) = (X(1) , ..., X(n) ) is minimal sufficient statistic.

Example. If X1 , . . . , Xn ∼ double exponential with pdf given by


1
fθ (x) = e−|x−θ|
2

17
³ ´
Show that T (X) = X(1) , ..., X(n) is minimal sufficient.
Proof. P
fθ (y) exp{ ni=1 |xi − θ|}
r(θ) = = P .
fθ (x) exp{ ni=1 |yi − θ|}
If r(θ) is free of θ, then
P n
exp{ ni=1 (θ − xi )} X
lim r(θ) = P = exp{ (yi − xi )} = A
θ→∞ exp{ ni=1 (θ − yi )} i=1
P n
exp{ ni=1 (xi − θ)} X
lim r(θ) = Pn = exp{ (xi − yi )} = B
θ→−∞ exp{ i=1 (yi − θ)} i=1

Note that A = B, but also A = B −1 . So A = 1/A, hence A = ±1. But A has to be


positive, thus, A = 1. That is,
P
exp{ ni=1 |xi − θ|}
r(θ) = P = r(∞) = r(−∞) = 1.
exp{ ni=1 |yi − θ|}

Hence,
n
X n
X
g(θ) =: |xi − θ| = |yi − θ|
i=1 i=1

Note that the value which minimizes the above is the sample median. Clearly, the two
data sets X’s and Y ’s have the same median from the above equation. We consider two
cases:

(i) if n = 2m + 1 is odd, then minimizing w.r.t. θ, we see that x’s and y’s have
the same sample median. That is,

x(m+1) = y(m+1) .

Therefore, we can remove this common value by considering g(θ) − |x(m+1) −


θ| = g(θ) − |y(m+1) − θ|. Once this is done, we have an even number of
observations left. Thus, it suffices to consider the case where n is even, as we
shall do below.

(ii) if n = 2m is even, any value in the interval [x(m) , x(m+1) ] is a median of x’s,
which minimizes g(θ) w.r.t. θ. Similarly, any value in the interval [y(m) , y(m+1) ]
is a median of y’s, which minimizes g(θ) w.r.t. θ. These two sets must be the
same, thus x(m) = y(m) and x(m+1) = y(m+1) . Removing these two points,
we still have even number of points left. Continuing doing this, we see that
x(k) = y(k) for all k = 1, ..., n.

Remark 2.2.4 For point estimation problems, it is best to concentrate on minimal suf-
ficient statistics only since it provides the maximum data reduction.

18
2.3 Complete Statistics
Definition:

Let fθ (t) be a family of pdf’s for a statistic T . The family is called complete
if Eθ g(T ) = 0 for all θ implies Pθ (g(T ) = 0) = 1 for all θ. Then T is called a
complete statistic for θ.
T is said to be bounded complete if the previous statement holds for all
bounded Borel function g.

Clearly, completeness implies bounded completeness, but not vice versa.

2.3.1 Full rank exponential family is complete


Theorem 2.3.1 If F is in an exponential family of full rank with p.d.f.’s given by
( k )
X
fη (x) = exp ηj Tj (x) − A(η) h(x).
i=1

Then T (X) is complete and (minimal) sufficient.

Proof. We have shown that T (X) is minimal sufficient. We will show that it is also
complete. From Theorem 1.1.4, the distribution of T = (T1 (X), ..., Tk (X)) is again an
exponential family of the form
 
Xk 
fT (t) = fT (t1 , ..., tk ) = exp ηj tj − A(θ) h0 (t),
 
j=1

where t = (t1 , ..., tk ). Now suppose that there exists a function f such that E[g(T )] = 0
for all η ∈ Ξ. Then by the definition of fη (x), we have
Z
g(t) exp{tη 0 − A(η)} dλ(t) = 0, for all η ∈ Ξ,
R
where λ(A) = A h0 (t)dt is a measure on (Rk , B k ). Let η0 be an interior point in Ξ. Then
Z Z
g + (t) exp{tη 0 − A(η)} dλ = g − (t) exp{tη 0 − A(η)} dλ for all η ∈ N (η0 ), (3.5)

where N (η0 ) = {η : ||η − η0 || < ²} for some ² > 0. In particular,


Z Z
g + (t) exp{tη00 − A(η0 )} dλ = g − (t) exp{tη00 − A(η0 )} dλ = C ≥ 0.

If C = 0, then g + (t) = g − (t) = 0 a.e. λ, i.e., g(t) = 0 a.e. λ. If C > 0, then


C −1 g + (t) exp{tη00 − A(η0 )} and C −1 g − (t) exp{tη00 − A(η0 )} are p.d.f. with respect to λ.
However, (3.5) implies that
Z Z
g + (t) exp{tη00 }dλ(t) g − (t) exp{tη00 }dλ(t)
exp{t(η − η0 )0 } R + = exp{t(η − η 0 )0
} R
g (t) exp{tη00 }dλ(t) g − (t) exp{tη00 }dλ(t)

19
+
g (t) exp{tη }dλ(t) 0
for all η ∈ N (η0 ). That is, the m.g.f.’s of these two p.d.f.’s, R g+ (t) exp{tη0 0 }dλ(t) and
0
− 0
Rg (t) exp{tη0 }dλ(t)
0 , are the same in a neighborhood of 0. This implies that g + (t) = g − (t)

g (t) exp{tη0 }dλ(t)
a.e. λ. That is, g(t) = 0 a.e. λ. That is, T is complete.

Example. Let X1 , . . . , Xn be iid with geometric distribution

Pθ (X = x) = θ(1 − θ)x−1 , x = 1, 2, ..., 0 < θ < 1.

(1). Find a minimal sufficient and complete statistic for θ. (Note that this belongs to
an exponential family.)
P
(2). Find the family of distributions of T = Xi . Check directly whether the family
is complete?

Example. Let X1 , . . . , Xn be iid n(θ, θ1+² ), where θ > 0 and ² 6= −1. Show that
P P
(1). the statistic T = ( Xi , Xi2 ) is a complete minimal sufficient statistic
for θ iff ² 6= 0 and ² 6= −1.
P
(2). when ² = −1, the statistic T1 = Xi is a complete minimal sufficient
statistic for θ.
P
(3). when ² = 0, the statistic T2 = Xi2 is a complete minimal sufficient
statistic for θ.

Proof. Left as an exercise.

2.3.2 Examples of Complete Families


P
Example. Assume X1 , . . . , Xn ∼ Bin(1, p). Then T = Xi ∼ Bin(n, p) for 0 < p < 1.
Show that T is complete.
Proof. If Eθ g(T ) = 0 for 0 < p < 1, then
n
à ! n
à !
X n t n−t X n t
n
p q g(t) = q r g(t) = 0.
t=0 t t=0 t

where r = p/q with r > 0. So g(t) = 0 for t = 0, 1, 2, · · · , n. That is, P(g(T)=0)=1.


Hence T is complete.

Example. X1 , . . . , Xn is from U (0, θ). Show that T = X(n) is complete.


Proof. F (x) = (x/θ)I(0 < x < θ) + I(x ≥ θ), P (T ≤ t) = F (t)n . If Eθ g(T ) = 0 for
θ > 0, then Z Z
θ θ
g(t)ntn−1 θ−n dt = 0, =⇒ g(t)tn−1 dt = 0.
0 0
n−1
Differentiating w.r.t. θ, one gets g(θ)θ = 0, i.e., g(θ) = 0 for all θ > 0. Therefore, T is
complete.

20
2.3.3 Examples of Non-Complete Families
Completeness is a rather strong requirement, and most statistics are not complete. A
simple example is given below.

Example. If X1 , ..., Xn ∼ F (x) which is absolutely continuous with p.d.f. f (x). Then
P
X1 − X2 and (X1 , n−1
i=2 Xi ) are not complete.
Solution. We shall only look at the case X1 − X2 . Clearly, E(X1 − X2 ) = 0. However,
Z ∞ Z ∞
P (X1 − X2 = 0) = P (X1 = X2 |X2 = x)f (x)dx = P (X1 = x)f (x)dx = 0.
−∞ −∞

P
Note that in the above example, X1 − X2 is not sufficient while (X1 , n−1
i=2 Xi ) may
be sufficient in many situations. The natural question is: whether a minimal sufficient
statistic is complete. The answer is negative, as illustrated below.

Example. Show that the minimal sufficient statistic for U (θ, θ + 1) is not complete.
³ ´
Solution. It has been shown that the minimal sufficient statistic is T = X(1) , X(n) .
Since F (x) = (x − θ)I(θ < x < θ + 1) + I(x ≥ θ + 1), we get P (X(n) ≤ x) = F (x)n .
Therefore,
Z Z θ+1
E(X(n) ) = xdF n (x) = nx(x − θ)n−1 dx
θ
Z 1
n
= n (y + θ)y n−1 dy = +θ
0 n+1
1
= (θ + 1) − .
n+1
1
Similarly, we can get E(X(1) ) = θ + n+1
. Therefore,
· ¸
n−1
E X(n) − X(1) − = 0.
n+1
But X(n) − X(1) − (n − 1)(n + 1) 6= 0. So T is not complete.

Remark: This example shows that minimal sufficiency does not imply completeness.

Remark: A complete and sufficient statistic is minimal sufficient, which was shown by
Lehmann and Scheffe (1950) and Bahadur (1957). Also see Shao (1999, p80).

Pn−1
Example. Show that T = (Xn , i=1 Xi ) for Bin(1, p) is not complete.
h Pn−1 i Pn−1
Solution. E Xn − i=1 Xi /(n − 1) = 0. But Xn − i=1 Xi /(n − 1) 6= 0. So T is not
complete.

21
2.4 Ancillary Statistics
Definition: S(X) is called an ancillary statistic if its distribution does not depend on θ.

Example. X1 , . . . , Xn follows F (x − θ), the location parameter family. Show that the
range, R = X(n) − X(1) , is an ancillary statistic.
Proof. Let Zi = Xi − θ. Then Z1 , · · · , Zn follows F (x), free of θ. Then
³ ´ ³ ´
P (R ≤ x) = P (X(n) − θ) − (X(1) − θ) ≤ x = P Z(n) − Z(1) ≤ x ,

which does not depend on θ.

Example. X1 , . . . , Xn is from F (x/σ), the scalar parameter family. Show that (X1 /Xn , · · · , Xn−1 /Xn )
is an ancillary statistic.
Proof. Let Zi = Xi /σ. Then Z1 , · · · , Zn follows F (x), free of θ. Then

P (X1 /Xn ≤ t1 , · · · , Xn−1 /Xn ≤ tn−1 ) = P (Z1 /Zn ≤ t1 , · · · , Zn−1 /Zn ≤ tn−1 ) ,

which does not depend on θ.

2.5 Relationship between Minimal Sufficient, Com-


plete and Ancillary Statistics

Remark: An ancillary statistic may not be independent of a minimal sufficient statistic,


as shown in the next example.

Example. Suppose X1 , . . . , Xn ∼ U (θ, θ + 1).


Minimal sufficient statistic:
³
(X(1)
´
, X(n) ), or (X(n) − X(1) , X(n) + X(1) ).
Ancillary statistic: X(n) − X(1) . (Range of a location family, see an earlier example.)
Clearly, the minimal sufficient statistic and the ancillary statistic are not independent.

However, if T (X) is complete, then a minimal sufficient statistic will be independent


of every ancillary statistic, as seen from the next theorem.

Basu’s theorem: If T (X) is a bounded complete and sufficient statistic, then T (X) is
independent of every ancillary statistic.
Proof. Let T be a bounded complete and sufficient statistic, and V an ancillary statistic.
Let A, B be any two Borel sets. Then

P (V ∈ B) = EI{V ∈ B} ≡ C, some constant

P (V ∈ B|T ) = E(I{V ∈ B}|T ) is free of θ.


Let g(T ) = P (V ∈ B|T ) − P (V ∈ B) only depends on T , and it is bounded. Clearly,

Eθ g(T ) = EE(I{V ∈ B}|T ) − P (V ∈ B) = P (V ∈ B) − P (V ∈ B) = 0.

Since T is bounded complete, we have g(T ) = P (V ∈ B|T ) − P (V ∈ B) = 0 a.s., i.e.,

P (V ∈ B|T ) = P (V ∈ B) a.s. (5.6)

22
Therefore,

P (T ∈ A, V ∈ B) = E (I{T ∈ A}I{V ∈ B})


= E [E (I{T ∈ A}I{V ∈ B}|T )]
= E [I{T ∈ A}E (I{V ∈ B}|T )]
= E [I{T ∈ A}P (V ∈ B|T )]
= E [I{T ∈ A}P (V ∈ B)] from (5.6)
= E [I{T ∈ A}] P (V ∈ B)
= P (T ∈ A)P (V ∈ B).

Thus T and V are independent.

Remark. Basu’s Theorem is useful in that it allows us to deduce the independence of


two statistics without ever finding the joint distribution of the two statistics. It uses all
the concepts we studied so far: complete and minimal sufficient and ancillary statistics.

Example. Example 6.1.14 and Example 6.1.15 of Casella and Berger.

Example. Let X1 , . . . , Xn be a random sample from the pdf fθ (x) = e−(x−θ) I{x > θ}.

(a) Show that X(1) is complete and sufficient.


(b) Use Basu’s Theorem to show that X(1) and any ancillary statistics g(X)
are independent. Examples of g(X) include

g(X) = S 2 , or (X(n) − X(1) ), or X2 − X1 , etc.

2.6 Exercises
1. Let X be one observation from N (0, σ 2 ). Is |X| a sufficient statistic?

2. Show that Gamma distribution with parameters α and β defined by


1
f (x) = α
xα−1 e−x/β , x > 0,
Γ(α)β

belongs to the exponential family. Suppose that X1 , · · · , Xn is a random sample


from this distribution. Find a sufficient statistic for (α, β).

3. Let X1 , . . . , Xn be i.i.d. r.v.’s with p.d.f.


1 −(x−µ)/σ
f (x) = e for x > µ.
σ
Find a two-dimensional sufficient statistic for (µ, σ).

23
4. Suppose that X1 , . . . , Xn ∼ Uniform(θ − 12 , θ + 21 ). Find a sufficient statistic for θ.
5. Let X1 , . . . , Xn be independent r.v.’s with densities
fXi (x) = eiθ−x , for x ≥ iθ.
Prove that T = mini {Xi /i} is a sufficient statistic for θ.
6. For each of the following distributions let X1 , . . . , Xn be a random sample. For a
minimal sufficient statistic.
2
(a). f (x) = (2π)−1/2 e−(x−θ) /2 (normal)
(b). f (x) = P (X = x) = e−θ θx /x!, for x = 0, 1, ... (Poisson)
(c). f (x) = e−(x−θ) , for x > θ (location exponential)
(d). f (x) = c(µ, σ) exp {−(x − µ)4 /σ 4 }, where σ > 0.
−1
(e). f (x) = (π)−1 [1 + (x − θ)2 ] (Cauchy)

7. Suppose that X1 , . . . , Xn are discrete random variables with distribution function


Fθ . Show that T = (X1 , · · · , Xn−1 ) is not sufficient for θ. (Comment: in fact, any
statistic using only part of the data can not be sufficient for θ.)
iid
8. Suppose that X1 , . . . , Xn ∼ Poisson(λ).
P
(i). Show that T = ni=1 Xn is minimal sufficient for λ.
P
(ii). Show that U = (X1 , ni=2 Xn ) is sufficient for λ. Is U a minimal
sufficient for λ? Explain why.

9. Let T = (X1 , . . . , Xn ) be i.i.d. r.v. from a continuous distribution with p.d.f.


f (x) and n ≥ 2. Show that although T is sufficient, it is not complete. (Hint:
E(X1 − X2 ) = 0.)
10. Let X1 , · · · , Xn be i.i.d. random variables having Beta(α, α) with density
fθ (x) = C(α)xα (1 − x)1−α , 0 < x < 1,
where α > 0 and C(α) is a constant depending only on α. Find a minimal sufficient
statistic for α.
11. Let X1 , · · · , Xn be i.i.d. random variables from U (α, β), where a < b. Show that
Zi =: (X(i) − X(1) )/(X(n) − X(1) ), i = 2, ..., n − 1, are independent of ((X(1) , X(n) )
for any a and b.
12. Let X1 , · · · , Xn be i.i.d. random variables from E(a, θ), i.e., f (x) = θ−1 e−(x−a)/θ I(a,∞)(x) .
P
(a). Show that ni=1 (Xi − X(1) ) are X(1) are independent for any (a, θ).
(b). Show that Zi =: (X(n) − X(i) )/(X(n) − X(n−1) ) i = 1, 2, ..., n − 2, are
P
independent of ((X(1) , ni=1 (Xi − X(1) )) for any a and θ.
13. Let X be a discrete r.v.’s with p.d.f.
fθ (x) = θ, x=0
2 x−1
= (1 − θ) θ , x = 1, 2, ...
= 0 otherwise
where θ ∈ (0, 1). Show that X is boundedly complete, but not complete.

24
Chapter 3

Unbiased Estimation

3.1 Criteria of estimation


Suppose that we have a random sample X = {X1 , · · · , Xn } from a family {Fθ : θ ∈ Θ} of
probability distributions. (θ could be a vector.) Let T (X) = T (X1 , · · · , Xn ) be such an
estimator for θ or g(θ). The purpose is to find T = T (X1 , · · · , Xn ), an estimator for θ or
more generally g(θ), such that it is “optimal” in certain sence.
One of criteria for choosing an “optimal” estimator is Mean Squared Errors (MSE),
defined by
M SE[T ] = E[T − g(θ)]2 = V ar(T )) + bias2 [T ].
where bias[T ] = ET − g(θ). An estimator T is said to be optimal in the MSE sense if it
has the smallest MSE for all θ ∈ Θ. However, there is a difficulty with this criteria, as
illustrated in the next example.
Example. Let X1 , · · · , Xn be iid with EXi = µ and V ar(Xi ) = σ 2 . To estimate µ, we
choose three estimators
T1 = X1 , T2 = 3, T3 = X̄.
It is easy to find
M SE(T1 ) = σ 2 , M SE(T2 ) = (3 − µ)2 , M SE(T3 ) = σ 2 /n.
Plot MSE against µ, we see that there is no UNIFORM best estimator if we use MSE
criterion.
The problem arises because the class of estimators is too large. Here we’ll restrict
attention to the class of all unbiased estimators. An estimator T (X) is called unbiased
for g(θ) if ET (X) = g(θ). Clearly in this situation, M SE(T ) = V ar(T ). An unbiased
estimator T (X) is called uniformly minimum variance unbiased estimator (UMVUE) for
g(θ) if it has the smallest variance for all θ.
Some questions immediately arise.
(1). Does there always exist an unbiased estimator for g(θ)?
(2). Amongst all unbiased estimators of g(θ) (assuming existence), is there a
UMVUE?
(3). If a UMVUE does exist, then how can we find it? Is it unique?

We shall try to answer these questions in this section.

25
3.2 Do unbiased estimators always exist?

Definition: g(θ) is said to be U -estimable if there exists a statistic η(X) such that
Eη(X) = g(θ).

Example. X ∼ Bin(n,p). g(p) is U -estimable iff it is a polynomial in p of degree m (an


integer), where 0 ≤ m ≤ n.
Proof. η(X) is an unbiased estimator iff
n
à !
X n
g(p) = Eη(X) = η(k)pk (1 − p)n−k ,
k=0 k

i.e. g(p) is a polynomial in p of degree m, where 0 ≤ m ≤ n.



Therefore, if X ∼ Bin(n, p), then g(p) = p−1 , p/(1 − p) and p are not U -estimable.
Of course, this can also be proved directly. We give one example.

Example. If X ∼ Bin(n,p), show that g(p) = p−1 is not U -estimable.


Proof. If g(p) = p−1 is U -estimable, then there exists η(X) such that
n
à !
X n k
Eη(X) = η(k) p (1 − p)n−k = 1/p,
k=0 k

So à !
n
X n k+1
η(k) p (1 − p)n−k = 1.
k=0 k
But limp→0 LHS = 0. So we can choose p small enough such that the above equality does
not hold. Therefore, g(p) = p−1 is not U -estimable.

3.3 Characterization of a UMVUE.


We’ve just seen that unbiased estimators may not exist for some parameters. However,
if we assume that there exist unbiased estimators, then the next question is whether a
UMVUE exists? If it does, is it unique?

The next theorem provides a necessary and sufficient condition for UMVUE.

UMVUE Characterization Theorem:

(1). Let δ(X) be an unbiased estimator of g(θ) with Eδ 2 (X) < ∞. δ(X) is a
UMVUE iff E[δ(X)U (X)] = 0 for any U ∈ U and all θ ∈ Θ, where U is the
set of all unbiased estimators of 0 with finite variances.
(2). Let T be a (minimal) sufficient statistic and η(T ) be an unbiased estimator
of g(θ) with Eη 2 (T ) < ∞, where η(·) is a Borel function. Then h(T ) is a
UMVUE iff E[h(T )U (T )] = 0 for all θ ∈ Θ and any U ∈ U, e where Ue be the
subset of U containing all Borel functions of T .

26
Proof of the theorem. We shall only prove the first part. The second part can be
shown similarly.
(i) Necessity. Suppose that δ is a UMVUE for g(θ). Fix U ∈ U and θ ∈ Θ. Let
δ1 = δ + λU . Clearly Eδ1 = g(θ), and so
V ar(δ1 ) = V ar(δ) + 2λCov(U, δ) + λ2 V ar(U ) ≥ V ar(δ),
that is,
2λCov(U, δ) + λ2 V ar(U ) ≥ 0,
for all λ. The quadratic in λ has two roots λ1 = 0 and λ2 = −2Cov(U, δ)/V ar(U ), and
hence takes on negative values unless λ1 = λ2 , that is, Cov(U, δ) = 0.
(ii) Sufficiency. Suppose that Eθ (δU ) = 0 for all U ∈ U. For any δ1 such that
E(δ1 ) = g(θ), we have δ1 −δ ∈ U . Hence, Eθ [δ(δ1 −δ)] = 0. From this and Cauchy-Schwarz
inequality, we have V ar(δ) = Cov(δ, δ1 ) ≤ V ar1/2 (δ)V ar1/2 (δ1 ). That is, V ar(δ) ≤
V ar(δ1 ).
The proof is thus complete.

Remark. In other words, if W is a UMVUE iff corr(W, U ) = 0, where U is any unbiased


estimator of zero.

Remark. Part (2) of the theorem is more useful in practice. Here we only to look at
the functions of the minimal sufficient statistics, which is enough for the point estimation
purposes.

Remark. There are three important aspects in the application of the characterization
theorem to find a UMVUE.
(1). E[U (X)] = 0.
(2). E[δ(X)U (X)] = 0.
(3). E[δ(X)] = g(θ), where g(θ) is the parameter of interest.

Or
(1)’. E[U (T )] = 0.
(2)’. E[η(T )U (T )] = 0.
(3)’. E[η(T )] = g(θ), where g(θ) is the parameter of interest.

Remark. The theorem follows from the projection theorem of linear space theory if we
interpret E(δ − U )2 as the norm in L2 space and U forms a Hilbert space.

Remark. From the proof of the theorem, given an unbiased estimator δ, if there exists an
unbiased estimator of zero, which is CORRELATED with δ, then we can always choose
one of them such that it has negative correlation with δ and hence will deduce the variance
for their linear combination.

Remark. Note that U or Ue is not empty since zero itself is in it. On the other hand,
if zero is the ONLY element, i.e., Ue = {0} or U = {0}, then the requirement (??) is
certainly satisfied. However, this is guaranteed if the family of distributions is complete.
This can be used to prove the Lehmann-Scheffe theorem introduced later.

A consequence of the UMVUE characterization theorem is

27
Theorem 3.3.1 Let Ti be a UMVUE of θi , j = 1, ..., k, where k is is a fixed positive
integer. Then
Pk Pk
(a) T = i=1 ci Ti is a UMVUE of θ = i=1 ci θi , for any constants c1 , ..., ck .
(Therefore, UMVUE is closed under the linear operation. That is, all UMVUE’s
forms a linear space.)

Qk ³Q ´
k
(b) T = i=1 Ti is a UMVUE of θ = ET = E i=1 Ti . In particular, if Ti ’s
³Q ´ Qk
k
are independent, then E i=1 Ti = i=1 θi .

(Therefore, UMVUE is also closed under the (independent) product operation.)

Proof. The proof will use the Characterization Theorem.


(a) Note that
à k ! k
X X
Cov (T, U ) = Cov ci Ti , U = ci Cov (Ti , U ) = 0.
i=1 i=1

(b) For simplicity, assume that k = 2. Since T2 is UMVUE, we have E(T2 U ) = 0 for
any U ∈ U. Thus, T2 U ∈ U by the Characterization Theorem, which in turn implies that
E(T1 T2 U ) = E[T1 (T2 U )] = 0 by the Characterization Theorem again.

Uniqueness Theorem: If δ is a UMVUE, then it is unique.

Proof. Let δ1 and δ2 are two UMVUE for g(θ). Then Eδ1 = Eδ2 and V ar(δ1 ) =
V ar(δ2 ) ≡ V (θ). Take δ = (δ1 + δ2 )/2. Then

V ar(δ) = [2V (θ) + 2ρV (θ)]/4 = V (θ)(1 + ρ)/2,

where ρ = Corr(δ1 , δ2 ). It follows that ρ = 1, otherwise V ar(δ) < V ar(δ1 ) which contra-
dicts with our assumption. Therefore,

V ar(δ1 − δ2 ) = [2V (θ) − 2ρV (θ)]/4 = 0.

So
P (δ1 − δ2 = E(δ1 − δ2 )) = P (δ1 − δ2 = 0) = 1.

Alternative proof using the Characterization Theorem. Suppose δ1 and δ2


are two UMVUEs, then δ1 − δ2 ∈ U . Then from the Characterization Theorem, we have
E[δ1 (δ1 − δ2 )] = 0 and E[δ2 (δ1 − δ2 )] = 0. So

E[(δ1 − δ2 )2 ] = E[δ1 (δ1 − δ2 )] − E[δ2 (δ1 − δ2 )] = 0,

which implies δ1 = δ2 a.s.

28
3.4 Existence of UMVUE among unbiased estima-
tors

In the class of all unbiased estimators, there are three possibilities concerning the set of
UMVUE estimators.

Case 1. No nonconstant parameters have UMVUE.

Example. Let X be a sample of size 1 from U (θ − 1/2, θ + 1/2), where θ ∈ R. Show


that no nonconstant parameters have UMVUE.
Proof. We shall use the UMVUE characterization theorem (part (1) or (2)).

(a) From EU (X) = 0, we get


Z θ+1/2
U (x)dx = 0.
θ−1/2

Differentiating w.r.t. θ, we get

U (θ − 1/2) = U (θ + 1/2).

(From here, we see that U (θ) is a periodic function with period 1. Clearly, X is not
complete.)

(b). From E[δ(X)U (X)] = 0, we get


Z θ+1/2
δ(x)U (x)dx = 0.
θ−1/2

Differentiating w.r.t. θ, we get

δ(θ − 1/2)U (θ − 1/2) = δ(θ + 1/2)U (θ + 1/2).

Combining (a) and (b), we get

δ(θ − 1/2) = δ(θ + 1/2).

(c) Eδ(X) = g(θ), that is,


Z θ+1/2
g(θ) = δ(x)dx.
θ−1/2

Differentiating w.r.t. θ, we get

g 0 (θ) = δ(θ + 1/2) − δ(θ − 1/2) = 0.

That is, g(θ) = C.

29
Example. Let X1 , . . . , Xn be a sample of size n from U (θ − 1/2, θ + 1/2), where θ ∈ R.
Can we show that no nonconstant parameters have UMVUE?
Proof. We shall use the UMVUE characterization theorem, part (2). It can be shown
that T = (X(1) , X(n) ) is a minimal sufficient statistic. Recall that if X1 , . . . , Xn ∼ F with
pdf f , then
n!
f(X(r) ,X(s) ) (x, y) =
(r − 1)!(s − r − 1)!(n − s)!
×(F (x))r−1 (F (y) − F (x))s−r−1 (1 − F (y))n−s f (x)f (y),
where x < y.

In particular, if F = U (θ1 , θ2 ) with θ1 = θ − 1/2 and θ2 = θ + 1/2, then F (x) = θx−θ 1


2 −θ1
,
I{θ1 ≤x≤θ2 }
for θ1 ≤ x ≤ θ2 , and f (x) = θ2 −θ1 . Then taking r = 1, s = n, we get the joint pdf of
T = (X(1) , X(n) ):

f(X(1) ,X(n) ) (x, y) = n(n − 1)(y − x)n−2 , where θ1 ≤ x < y ≤ θ2 .

R θ+1/2 Ry n−2
(a) From EU (T ) = 0 = EU (X(1) , X(n) ) = y=θ−1/2 x=θ−1/2 n(n−1) (y−x)
(θ2 −θ1 )n
dxdy,
we get
Z θ+1/2 Z y
(y − x)n−2 dxdy = 0.
y=θ−1/2 x=θ−1/2

(b). From E[η(T )U (T )] = 0, we get

(c) Eη(T ) = g(θ), that is,


g(θ) = .

30
A discrete version of the above example is the following.

Example. X1 , . . . , Xn ∼ Fθ (·), where P (X = θ − 1) = P (X = θ) = P (X = θ + 1) = 1/3


and θ is the set of all integers. Then no nonconstant function of θ has a UMVUE.

Proof. Let us just do it for the case n = 1. We shall use the UMVUE characterization
theorem (part (1) or (2)).

(a) From 0 = Eθ U (X) = 31 [U (θ − 1) + U (θ) + U (θ + 1)] = 0 for any θ, we get

U (θ − 1) + U (θ) + U (θ + 1) = 0

U (θ) + U (θ + 1) + U (θ + 2) = 0
Differencing both, we get U (θ − 1) − U (θ + 2) = 0, i.e.,

U (θ) = U (θ + 3), for any θ.

So U (θ) is a periodic function with period 3. (Clearly, X is not complete.)

(b). From Eθ [δ(X)U (X)] = 0, similar to (a), we can get

δ(θ)U (θ) = δ(θ + 3)U (θ + 3).

Combining (a) and (b), we get

δ(θ) = δ(θ + 3).

(c) If Eδ(X) = g(θ), recalling that δ(·) is periodic with period 3, then

g(θ) = δ(θ − 1) + δ(θ) + δ(θ + 1) = δ(−1) + δ(0) + δ(1) = C.

Remark. It is interesting to note that both examples belong to the location family.

31
Case 2. Some but not all parameters have UMVUE.

Lemma. If δ0 is any unbiased estimator of g(θ), then the totality of unbiased estimators
is given by δ = δ0 − U , where U is any unbiased estimator of zero, that is, it satisfies

Eθ (U ) = 0 for all θ ∈ Θ.

The proof of this is obvious.

Remark: From the lemma, we can get

V ar(δ) = V ar(δ0 − U ) = E(δ0 − U )2 − g 2 (θ)

Therefore, minimizing V ar(δ) is equivalent to minimizing E(δ0 − U )2 .

Example. (Lehmann, p76). Let X take on values -1, 0, 1, .... with probabilities

P (X = −1) = p, P (X = k) = pk q 2 , k = 0, 1, 2, ...

where 0 < p < 1, q = 1 − p. Consider the problem of estimating

(a) p and (b) q2.

Simple unbiased estimators are, respectively,

δ1 = I{X = −1}, δ2 = I{X = 0}.

(Aside: What is a (minimal) sufficient statistic in this case? Is it complete?)


Note that V ar(δ) = V ar(δ0 − U ) = E(δ0 − U )2 − g 2 (θ). So V ar(δ) is minimized (w.r.t.
δ) iff E(δ0 − U )2 is minimized (w.r.t. U if δ0 is prefixed.) That is, given δ0 , we have

arg min V ar(δ) = δ0 − arg min E(δ0 − U )2 .


δ U

Now if EU = 0, then

X
U (−1)p + U (k)(1 − p)2 pk = 0
k=0

That is,

U (0) = 0
U (−1) − 2U (0) + U (1) = 0
U (k) − 2U (k + 1) + U (k + 2) = 0, k ≥ 0.

Then

U (0) = 0
U (1) = −U (−1)
U (2) = 2U (1) − U (0) = 2U (1)
...
U (k) = kU (1)
...

32
Denote a = U (1), then we’ve shown that

U (k) = ka, k = −1, 0, 1, 2, 3, ......

That is, we’ve shown that U is an unbiased estimator of zero iff

U (X) = aX.

From the earlier remark, we need to find a which minimizes


X
E(δi (X) − U )2 = pk [δi (k) − ak]2 = C2 (i)a2 − 2C1 (i)a + C0 (i)

where for i = 1, 2

X
C2 (i) = k 2 P (X = k)
k=−1

X
C1 (i) = kδi (k)P (X = k)
k=−1

X
C0 (i) = δi2 (k)P (X = k).
k=−1
P
Thus pk [δi (k) − ak]2 is minimized iff

a(i) = C1 (i)/C2 (i).

Simple algebra yields


à ∞
!
X
2 2 k
a(1) = −p/ p + q k p , a(2) = 0.
k=1

Since a(1) depends on p, so δ = δ1 − U = δ1 − a(1)X is only locally minimum variance


unbiased estimator (LMVUE) of p at p. In other words, a UMVUE does not exist.
On the other hand, a(2) does not depend on p, therefore, the estimator δ = δ2 − U =
δ2 − a(2)X = δ2 is a UMVUE of q 2 .

Example. Find the totality of g(p) which have UMVUE’s in the last example.
Solution. From the UMVUE Characterization Theorem, δ is UMVUE iff

Ep (δU ) = 0, Ep U = 0, for all p.

From the last example, we get

δ(k)U (k) = ak, U (k) = bk, k = −1, 0, 1, 2, ......

Clearly,

δ(k) = c if k 6= 0
= d if k = 0.

Therefore,

Eδ(X) = cP (X = 0) + dP (X 6= 0) = cq 2 + d(1 − q 2 )
= C1 + C2 q 2 .

33
Therefore, g(p) is a UMVUE iff it is of the form C1 + C2 q 2 .

Case 3. All parameters have UMVUE. (complete and sufficient).

In the first two cases, we have not used any other knowledge. If we have a complete
and sufficient statistic, then a UMVUE always exists among the unbiased estimators.
This will be studied later. We point out that, when a complete and sufficient statistic
is not available, it is usually very difficult to derive a UMVUE. In some cases, we can
use the UMVUE characterization theorem, if we know enough knowledge about unbiased
estimators of 0.

34
3.5 Searching UMVUE’s by sufficiency and complete-
ness
If we have a complete and sufficient statistic, then finding a UMVUE is a relatively simple
matter.

If we have a sufficient statistic, and an unbiased estimator of g(θ), then we can always
improve it by the following Rao-Blackwell Theorem.

Rao-Blackwell Theorem: Let X1 , . . . , Xn be iid r.v.’s with pdf fθ (x). Suppose T is


sufficient for θ. Let δ = δ(X1 , . . . , Xn ) be an estimator for g(θ). Assume MSE(δ) =
E[δ − g(θ)]2 < ∞. Define η(T ) = E(δ(X1 , . . . , Xn )|T ). Then

(i) η(T ) is also an estimator for g(θ).


(ii) MSE(η(T )) = E[η(T ) − g(θ)]2 ≤ E[δ − g(θ)]2 = MSE(δ), where equality
holds iff P (η(T ) = δ) = 1.
(iii) η(T ) and δ have the same bias. In particular, if one is unbiased, so is the
other.

Proof.

(1). By the definition of sufficiency, η(T ) is an estimator of g(θ).

(2).

M SE(δ) = E(δ − g(θ))2 = E ([δ − η] + [η − g(θ)])2


= E[δ − η]2 + E[η − g(θ)]2 + 2C
= M SE(η) + E[δ − η]2 + 2C,

where

C = E ([δ − η][η − g(θ)])


= EE ([δ − η][η − g(θ)]|T )
= E {[η − g(θ)] (E[δ|T ] − η)}
= 0.

Therefore, M SE(δ) ≥ M SE(η) and the equality holds iff δ = η(T ) a.s.

(3). Eη(T ) = EE(δ|T ) = Eδ. Therefore, bias(δ) = bias(η(T )).

Remark. Rao-Blackwell Theorem states that conditioning any unbiased estimator on a


sufficient statistic will result in an uniform improvement. So we only need to consider
statistics that are functions of a sufficient statistics in our search for UMVUE.

Remark. Rao-Blackwell Theorem can only improve an unbiased estimator, but it does
not guarantee that it is a UMVUE.

35
Now if we have a sufficient and complete statistic, then finding a UMVUE is a relatively
easy exercise.

Lehmann-Scheffe Theorem. Suppose that T is sufficient and complete (S-C). If Eθ [η(T )] =


g(θ), then η(T ) is the unique UMVUE for g(θ).
Proof. Method 1. In our search for UMVUE, we only need to look in the class of
estimators which are functions of T . Otherwise if δ(X) is an unbiased estimator δ(X) for
g(θ), but is not a function of T , then η(T ) = E [δ(X)|T ] is always as good as or better
than δ(X) in terms of variances.
Secondly, for any other unbiased estimator η1 (T ), then we have Eθ [η(T ) − η1 (T )] = 0,
which in turn implies that η(T ) = η1 (T ) a.s. for all θ ∈ Θ, since T is complete. Therefore,

varθ [η(T )] ≤ varθ [η1 (T )]

for all θ ∈ Θ. (In fact, equality holds all the time.) Therefore, η(T ) is a UMVUE.
The uniqueness follows from the Uniqueness Theorem given earlier, or can be implied
by the completeness as well.

Method 2.
As in the first method, we shall only restrict our attention to the class of estimators
which are functions of T . Since T is complete, then Eθ (U (T )) = 0 implies U (T ) = 0.
To apply the UMVUE characterization theorem (2), we see that Ue = {0}. Then the
necessary and sufficient condition is satisfied. Hence, η(T ) is a UMVUE. The proof of
uniqueness is as in the first proof.

The next theorem is also useful in finding a UMVUE.

Theorem. Let T be complete and sufficient. If δ is an unbiased estimator and T is


complete and sufficient, then η(T ) = E(δ|T ) is the unique UMVUE.

36
3.6 Some examples.
The above two theorems provide two different methods for finding a UMVUE. We shall
now give some examples to illustrate each method.

Method 1: Solving equations.

Example. X1 , . . . , Xn ∼ U (0, θ). Find a UMVUE for g(θ) where g(x) is differentiable.
Solution. It is known that T = X(n) is a complete and sufficient, and

fT (t) = nθ−n tn−1 I(0 < t < θ).



If Eη(T ) = 0 η(t)nθ−n tn−1 dt = g(θ), then
Z θ
n η(t)tn−1 dt = θn g(θ)
0

Differentiating w.r.t. θ, we get

nη(θ)θn−1 = nθn−1 g(θ) + θn g 0 (θ)

Then η(θ) = g(θ) + nθ g 0 (θ).

X(n) 0
η(T ) = g(X(n) ) + g (X(n) )
n
n+1
(a). If g(θ) = θ, then η(T ) = n
X(n) .
(b). If g(θ) = θ2 , then η(T ) 2
= X(n) + n2 X(n)
2
= n+2 2
n
X(n) .

Remark. Note that g(θ) can be estimated by g(X(n) ), and the bias is of order O(n−1 ).

Example. Let X1 , . . . , Xn be iid Poisson(λ), find a UMVUE for g(λ), where g(λ) is
infinitely differentiable.
P
Solution. T = Xi is complete and sufficient. And T ∼ P oisson(nλ), that is, fT (t) =
e−nλ (nλ)t /t! for t = 0, 1, 2, ... Then

X
Eη(T ) = η(k)e−nλ (nλ)k /k! = g(λ).
k=0

Then

Ã∞ ! ∞ 

X η(k)nk λk X g (i) (0)λi X nj λj X
= enλ g(λ) =  = ak λk ,
k=0 k! i=0 i! j=0 j! k=0

P g (i) (0)nj η(k)nk ak k!


where ak = i+j=k i!j!
. Therefore, k!
= ak . So η(k) = nk
. That is,
  Ã !
X g (i) (0)nj T
aT T ! T! X T g (i) (0)
η(T ) = T = T  = .
n n i+j=T i!j! i=0 i ni

37
(a). If g(λ) = λr , where r > 0 is an integer, then

g (k) (λ) = r(r − 1)...(r − k + 1)λr−k if 0 ≤ k ≤ r,


= 0 if k > r,

Then g (r) (0) = r! and 0 otherwise. Therefore,

T ! nT −r r! T (T − 1)...(T − r + 1)
η(T ) = T
= if T ≥ r,
n (T − r)!r! nr
= 0 if T < r.

Remark: Note that η(T ) ≈ (X̄)r , the MLE.

e−λ λk
(b). If g(λ) = P (X1 = k) = k!
, for any k = 0, 1, ..., then we can use the
above result.

For instance, if k = 0, g(λ) = e−λ . So g (i) (λ) = (−1)i e−λ and g (i) (0) = (−1)i .
Therefore, a UMVUE of e−λ is
T
à ! µ ¶T
X T (−1)i 1
η(T ) = .= 1− .
i=0 i ni n

Similarly, if g(λ) = e−λ λk /k!, then its UMVUE is


à ! µ ¶k µ ¶T −k
T 1 1
η(T ) = 1− = Bin(T, p = 1/n).
k n n

Remark: However, in this case, it is perhaps easier to tackle the problem


directly, as shown in Method 2.

Example. X1 , . . . , Xn ∼ N (µ, σ 2 ). Find a UMVUE for µ/σ 2 .


Solution. The following fact will be useful in this problem.
P
(a) X̄ and S 2 are independent, where S 2 = (n − 1)−1 ni=1 (Xi − X̄)2
(b) X̄ ∼ N (µ, σ 2 /n), and (n − 1)S 2 /σ 2 ∼ χ2n−1 .
r/2−1 e−x/2
(c) The pdf of χ2r is x2r/2 Γ(r/2) .
(d) Γ(α + 1) = αΓ(α)

It is known that T = (X̄, S 2 ) is complete and sufficient. Let


η(T ) = C .
S2

38
³ ´
1
Then Eη(T ) = CµE S2
. Now taking r = n − 1, we have
à ! à !
σ2 1
E = E
(n − 1)S 2 χ2n−1
Z ∞
1 xr/2−1 e−x/2
= dx
x 2r/2 Γ(r/2)
0
Z ∞
x(r−2)/2−1 e−x/2 2(r−2)/2 Γ((r − 2)/2)
= dx
0 2(r−2)/2 Γ((r − 2)/2) 2r/2 Γ((r − 2)/2 + 1)
1 Γ((n − 3)/2)
=
2 Γ((n − 1)/2)
1
= .
n−3
Then µ ¶
1 n−1 µ µ
Eη(T ) = CµE =C = 2,
S2 n−3σ 2 σ
n−3 n−3 X̄ µ
if C = n−1
. That is, n−1 S 2
is a UMVUE of σ2
.

Remark: Note the MLE = X̄/S 2 .

Example. Let X1 , . . . , Xn be a sample from a power series distribution, i.e.,


P (Xi = x) = a(x)θx /c(θ), x = 0, 1, 2, ......
with a known function a(x) ≥ 0 and θ > 0. Find a UMVUE for the parameter θr /[c(θ)]p ,
where r and p ≤ n are both nonnegative integers.

Remark. We first remark that many discrete distributions belong to this family. Ex-
amples include binomial, negative binomial, Poisson, etc. For instance, the Binomial
distribution can be written as
à ! à !à !x
n x n p
P (X = x) = p (1 − p)n−x = (1 − p)n , so θ = p/(1 − p).
x x 1−p
Note that C(θ) can be written as

X
c(θ) = a(x)θx .
x=0

Also note that the family of power series distribution belongs to an exponential family as
P (Xi = x) = ex log(θ) γ(x)/c(θ), x = 0, 1, 2, ......
P
Thus T = Xi is a minimal sufficient and complete statistic. Also, the distribution of T
is also in a power series family, i.e., for t ≥ 0,
X X
P (T = t) = P ( Xi = t) = P (X1 = x1 ) · · · P (Xn = xn )
P
xi =t
X
= a(x1 ) · · · a(xn )θt /[c(θ)]n
P
xi =t

= A(t, n)θt /[c(θ)]n ,

39
PP
where A(t, n) = xi =t a(x1 )......a(xn ), and satisfies

X ∞
X
1= P (T = t) = A(t, n)θt /[c(θ)]n .
t=0 t=0
P∞
That is, t=0 A(t, n)θt = [c(θ)]n . Therefore, A(t, n) is the coefficient of θt in the expansion
of [c(θ)]n .
Now for η(T ) to be a UMVUE of g(θ), we have

X
g(θ) = Eη(T ) = η(t)A(t, n)θt /[c(θ)]n .
t=0

That is,

X
η(t)A(t, n)θt = g(θ)[c(θ)]n .
t=0

In particular, if g(θ) = θ /[c(θ)]p , then


r


X ∞
X ∞
X
η(t)A(t, n)θt = θr [c(θ)]n−p = A(t, n − p)θt+r = A(t − r, n − p)θt .
t=0 t=0 t=r

Comparing the coefficients of order t on both sides, we get

η(t)A(t, n) = A(t − r, n − p) if t ≥ r,
0 if t < r.

That is,

A(t − r, n − p)
η(t) = if t ≥ r,
A(t, n)
0 if t < r.

Note that

(i) if p = 0, we get a UMVUE of θr .


(ii) if p = 1, we get a UMVUE of P (X1 = r), which is a(r)η(T ).

Method 2: Conditioning by using Rao-Blackwell theorem.

Example. Let X1 , . . . , Xn be iid Poisson(λ), find a UMVUE for

e−λ λk
g(λ) = P (X1 = k) = , for any k = 0, 1, 2, ........
k!

40
P
Solution. Let δ = I{X1 = k}. Clearly, Eδ = g(λ). It is obvious that T = Xi is
Complete and Sufficient, and T ∼ P oisson(nλ). Then

η(t) = E[δ|T = t] = P (X1 = k|T = t)


= P (X1 = k, T = t)/P (T = t)
à n
!, Ã n !
X X
= P X1 = k, Xi =t−k P Xi = t
i=2 i=1
e−λ λk /k! e −(n−1)λ
[(n − 1)λ]t−k /(t − k)!
=
e−nλ [nλ]t /t!
à ! µ ¶k µ ¶
t 1 1 t−k
= 1− .
k n n

So a UMVUE for g(λ) is


à ! µ ¶k µ ¶T −k
T 1 1
η(T ) = 1−
k n n

Remarks.

(1). Recall if Y ∼ Bin(m, p), then


(a) Y ∼approx N (mp, mpq) if m → ∞ and p is fixed.
(b) Y ∼approx P oisson(λ) if m → ∞ and limm→∞ mp = λ, for some fixed λ.

In the above example, η(T ) = P (Y = k), where Y ∼ Bin(T, n1 ). Now


limn→∞ T /n = E(T /n) = λ a.s. (since ET = nλ.) Therefore, as n → ∞,
−λ k
η(T ) ≈ e k!λ = P (X1 = k) = g(λ) for k = 0, 1, .... (e.g. If k = 0,
η(T ) = (1 − 1/n)T → [(1 − 1/n)n ]λ → e−λ = P (X1 = 0) a.s.

(2). The above derivation can be seen straightaway from the fact: If X ∼
P oisson(λ1 ) and Y ∼ P oisson(λ2 ) and independent, then X|X + Y = t ∼
Bin{t, λ1 /(λ1 + λ2 )}.

e−X̄ (X̄)k
(3). By comparison, the MLE for the above problem is k!
.

Example. Let X1 , . . . , Xn be iid Poisson(λ), find a UMVUE for g(λ) = λr , where r is a


positive integer.
P
Solution. We take r = 2 for example. T = Xi is complete and sufficient. Choose an
unbiased estimator δ(X) = X12 − X1 since Eδ(X) = EX12 − EX1 = λ2 . Then
³ ´
η(T ) = E [X12 − X1 ]|T

41
is a UMVUE of λ2 . Since X1 |T ∼ Bin(T, 1/n), we get

η(T ) = V ar(X1 |T ) + [E(X1 |T )]2 − E(X1 |T )


µ ¶
1 1 T2 T
= T 1− + 2 −
n n n n
µ ¶
T 1 T2
= − + 2
n n n
T (T − 1) X̄
= 2
= (X̄)2 − .
n n
(Note that the MLE is (X̄)2 .)

Example. Let X1 , . . . , Xn be iid Bin(k, p). (k = 1 corresponds to Bernoulli(p).) The


problem is to estimate the probability of exactly one success from a Bin(k, p), that is,
estimate
g(p) = Pp (X1 = 1) = kp(1 − p)k−1 .
(In particular, we can take either n = 1, or k = 1.)
P
Solution. T = Xi is complete and sufficient. Also T ∼ Bin(kn, p) Take δ = I{X1 =
1}, which is unbiased for g(p). So the unique UMVUE of g(p) is η(T ) = E(δ|T ), where,
for t ≥ 1, we have
X
η(t) = E(δ|T = t) = P (X1 = 1| Xi = t)
Pn
P (X1 = 1, i=1 Xi = t)
= P
P ( ni=1 Xi = t)
P
P (X1 = 1) P ( ni=2 Xi = t − 1)
= P
P ( ni=1 Xi = t)
³ ´ ³ ´
k k(n−1)
1
p(1 − p)k−1 t−1
pt−1 (1 − p)k(n−1)−(t−1)
= ³ ´
kn t
t
p (1 − p)kn−t
³ ´³ ´
k k(n−1)
1 t−1
= ³ ´
kn
. (Hyper-geometric d.f.)
t

And for t = 0, we have η(t) = 0.

Example. Does N (µ, µ), where µ > 0, belong to the exponential family? No. Find a
UMVUE for µ.

Method 3. Using Chacterization Theorem (when C-S stat is not avaiable)

Example. Let X1 , . . . , Xn ∼ U (0, θ), where θ > 1. Find a UMVUE of θ.


Solution. We can show easily that T = X(n) is sufficient. However, it is not complete.
To show this, note that FT (t) = F n (t) = tn /θn , and so
ntn−1
fT (t) = I(0 < t < θ).
θn

42
If Eθ (g(T )) = 0, then
Z θ Z 1 Z θ
0= tn−1 g(t)dt = tn−1 g(t)dt + tn−1 g(t)dt.
0 0 1

Differentiating w.r.t. θ, we get g(θ) = 0 for any θ > 1, and


Z 1
tn−1 g(t)dt = 0.
0

Clearly, it is easy to give an example that a nonzero g(t) satisfies this condition. This
implies that T is not complete. Therefore, we can not use Lehmann-Scheffe Theorem to
find a UMVUE.
(Another method to show that T = X(n) is not complete is as follows. Suppose it is
complete. Then T is complete and sufficient and therefore must be minimal sufficient (by
a theorem in Lehmann (19??)). But this is known to be false. Therefore, T = X(n) can
not be complete.)
We now try the characterization theorem (part 2) to find a UMVUE.
(1). E[U (T )] = 0 implies that (simply take g(t) = U (t) above)
Z 1
U (θ) = 0 for any θ > 1 and tn−1 U (t)dt = 0.
0

(2). If E[η(T )U (T )] = 0, then from (1), we get


Z θ Z 1
n−1
t U (t)η(t)dt = tn−1 U (t)η(t)dt = 0.
0 0

(3). Our purpose is to find η(T ) such that Eη(T ) = θ. We shall use two approaches.

Approach 1. It is easy to check that the constraints in (1) and (2) are satisfied by
the following choice

η(t) = c if 0 < t < 1


= bt if t > 1

Then from Eη(T ) = θ, we get

θ = Eη(T ) = E [η(T )I(0 < T < 1)] + E [η(T )I(T > 1)]
Z 1 Z θ
= cfT (t)dt + btfT (t)dt
0 1
Z
ntn−1 θ
= cP (T < 1) + dt bt
1 θn
c bn θn+1 − 1
= n+
θ n +Ã
1 θn !
bn bn 1
= θ+ c− ,
n+1 n + 1 θn
where we used the fact P (T < 1) = F n (1) = θ−n . Therefore, we have
bn bn
= 1, c− = 0.
n+1 n+1

43
So we get b = (n + 1)/n and c = 1. That is,

η(T ) = 1 if 0 < X(n) < 1


= (1 + n−1 )X(n) if X(n) > 1

Approach 2. Clearly, we can choose

η(t) = c if 0 < t < 1


= η(t) if t > 1 .

Here, we choose η(t) = c if 0 < t < 1 since it satisfies (1). Now from Eη(T ) = θ, we get

θ = Eη(T ) = E [cI(0 < T < 1)] + E [η(T )I(T > 1)]


Z 1 Z θ
= cfT (t)dt + η(t)fT (t)dt
0 1
Z θ
= cP (T < 1) + η(t)fT (t)dt
1
Z θ n−1
c nt
= n+ η(t) n dt.
θ 1 θ
as P (T < 1) = F n (1) = θ−n . That is,
Z θ
n+1
θ = c+ η(t)ntn−1 dt. (6.1)
1

Differentiating w.r.t. θ, we get

nη(θ)θn−1 = (n + 1)θn .
n+1
Therefore, η(θ) = n
θ for θ > 1. Putting this back into (6.1), we get c = 1. Hence, we
get

η(t) = 1 if 0 < t < 1


= (n + 1)t/n if t > 1

which is the same as that of Approach 1.

Remark: If all the observations are less than 1, then x(n) < 1, from the last example, we
simply take θ = 1 since we know θ > 1. So here we have used the information that θ > 1.
However, if we have x(n) ≥ 1, then we just carry on as usual since θ > 1 does not provide
any more useful information any more.

Remark: Note that the minimal and sufficient statistic here is T0 = max{X(n) , 1}. Why
isn’t the UMVUE η(T ) a function of T0 ?

44
3.7 UMVUE in nonparametric families.
See Lehmann, p101.

3.8 Failure of UMVUE


e.g. Let X ∼ P oisson(θ) and g(θ) = e−aθ , where a is a known constant. Clearly, X is
complete and sufficient. And from Eδ(X) = e−aθ , we get
X δ(x)θ x X (1 − a)x θ x
= e(1−a)θ =
x! x!
so δ(X) = (1 − a)X is a UMVUE of e−aθ . Suppose a = 3, then δ(X) = (−2)X . However,
this is a crazy estimator since it oscillates between negative and positive values. That is,
it goes from 0 to -2 to 4 to -8 etc as X goes from 0, 1, 2, .... This is because UMVUE
only possesses finite sample properties.
The above problem will disappear if the sample size increases. Suppose that X1 , . . . , Xn ∼
P
P oisson(λ). Then T = Xi is complete and sufficient and follows Poisson(nλ). To have
Eη(T ) = e−aθ , we need
X η(t)(nθ)t X (n − a)t θ t
(n−a)θ
=e =
t! t!
so η(T ) = (1 − a/n)T is a UMVUE of e−aθ . This is quite reasonable as soon as n > a.

Note that the MLE estimator in this case is simply e−aX̄ .

3.9 Procedures to find a UMVUE


1. Is it U -estimable? If not, it is the end of story. If so, move on to the next
step.
2. Find a (minimal) sufficient statistic T . (Its existence is guaranteed under
very mild conditions.)
3. Verify whether T is complete or not.
4. If T is complete, then we can use Lehmann-Scheffe Theorem.
5. If T is NOT complete, then we can use the UMVUE Characterization
Theorem.
6. One could also use the information inequality sometimes, which will be
studied a bit later.

45
3.10 Information inequality (or Cramer-Rao lower
bound)
3.10.1 Information inequality
Theorem. (Cramer-Rao lower bound) Let X = {X1 , . . . , Xn } be a sample (dependent
or independent) with a joint p.d.f. fθ (x), where θ ∈ Θ with Θ being an open set in
Rk . Suppose that Eδ(X) = g(θ), where g(θ) is a differentiable function of θ. Suppose
fθ (x) = fθ (x1 , ..., xn ) is differentiable as a function of θ and satisfies

∂ Z Z

h(x)fθ (x)dx = h(x) fθ (x)dx, for θ ∈ Θ,
∂θ ∂θ
for h(x) = 1 and h(x) = δ(x).

(1). We have
τ
V ar(δ(X)) ≥ (g 0 (θ))1×k [IX (θ)]−1 0
k×k (g (θ))k×1 , (10.2)

where
à !
∂ ∂
g 0 (θ) = g(θ), ......, g(θ)
∂θ1 ∂θk
"Ã !τ Ã ! #
∂ ∂
IX (θ) = E log fθ (X) log fθ (X)
∂θ k×1
∂θ 1×k
à " #" #!
∂ ∂
= E log fθ (X) log fθ (X) .
∂θi ∂θj k×k

Here, I(θ) is called the “Fisher information matrix” and is assumed to be


positive definite for all θ ∈ Θ.

(2). The equality in (1.1) holds iff there exists some function a(θ) such that


a(θ) [δ(x) − g(θ)] = log fθ (x).
∂θ

Proof. We shall only prove the univariate case k = 1. The general case will be given in
the Appendix at the end of this section.
R
(1). Differentiating 1 = fθ (x)dx w.r.t. θ, we get
à !
Z ∂ Z µZ ¶
∂ f (x)
∂θ θ ∂ ∂ ∂
E log fθ (X) = fθ (x)dx = fθ (x)dx = fθ (x)dx = 1=0
∂θ fθ (x) ∂θ ∂θ ∂θ

Therefore, Ã !2 Ã !
∂ ∂
E log fθ (X) = var log fθ (X) .
∂θ ∂θ

46
R
Differentiating g(θ) = E(δ(X)) = δ(x)fθ (x)dx, we get
Z Ã !
0 ∂
g (θ) = δ(x) fθ (x) dx
∂θ
Z Ã !

= δ(x) log fθ (x) fθ (x)dx
∂θ
" Ã !#

= E δ(X) log fθ (X)
∂θ
à !

= Cov δ(X), log fθ (X) .
∂θ

Therefore,
à ! à !
0 2 2 ∂ ∂
[g (θ)] = Cov δ(X), log fθ (X) ≤ V ar(δ(X)) V ar log fθ (X) .
∂θ ∂θ

This completes the proof of (1).



(2). From the above proof, we see that equality holds iff δ(X) − g(θ) and ∂θ
log fθ (x)
are proportional to each other. The proof is complete.

Remark 1. Note that in the Cramer-Rao theorem, the lower bound on the RHS does not
involve any estimators. In other words, we have a lower bound for the variance of all
unbiased estimators of g(θ). So the bound is uniform for all unbiased estimators.

Remark 2. When the Cramer-Rao lower bound is sharp, i.e., there exists an unbiased
estimator T of g(θ) whose variance reaches the lower bound, then that estimator is a
UMVUE. This provides a way to find UMVUE’s. However, this is not very effective as
very often Cramer-Rao lower bound may not be sharp. This means that UMVUE (if it
exists) may not reach the lower bound.

Remark 3. the Cramer-Rao lower bound is useful in assessing the performance of UMVUE’s.

Appendix: proof of the theorem for the general case.

(a). For any two r.v.’s δ and Y , by Cauchy-Swartze inequality,

Cov 2 (δ, Y ) ≤ V ar(δ)V ar(Y ),

or equivalently,

Cov 2 (δ, Y )
V ar(δ) ≥ = Cov(δ, Y )[V ar(Y )]−1 Cov(δ, Y ), (10.3)
V ar(Y )

where the equality holds iff δ and Y are linearly dependent.

Proof. For any λ ∈ R, we have

0 ≤ V ar(δ − λY ) = V ar(δ) − 2λCov(δ, Y ) + λ2 V ar(Y ) =: g(λ)

47
Setting g 0 (λ) = 2λV ar(Y ) − 2Cov(δ, Y ) = 0, we get the stationary point
Cov(δ, Y )
λ0 =
V ar(Y )
Thus,
Cov 2 (δ, Y )
g(λ0 ) = V ar(δ) − ≥0
V ar(Y )
where equality holds iff g(λ0 ) = 0 = V ar(δ − λ0 Y ) iff δ − λ0 Y = C. This proves (10.3).

(b). For any r.v. δ and random vector Y = (Y1 , ..., Yk ), we have a generalization of the
Cauchy-Swartze inequality as follows
V ar(δ) ≥ Cov(δ, Y )[V ar(Y )]−1 [Cov(δ, Y )]τ , (10.4)
where the equality holds iff δ and Y are linearly dependent. Here, we use the following
notation:
Cov(δ, Y ) = (Cov(δ, Y1 ), ......, Cov(δ, Yk ))1×k ,
V ar(Y ) = (Cov(Yi , Yj ))k×k .

Proof. For any λ = (λ1 , ..., λk ) ∈ Rk , we have


0 ≤ V ar(δ − λY τ ) = Cov(δ − λY τ , δ − λY τ )
= V ar(δ) − 2Cov(δ, Y λτ ) + Cov(λY τ , λY τ )
= V ar(δ) − 2Cov(δ, Y )λτ + λV ar(Y )λτ
=: g(λ)
Setting g 0 (λ) = 2λV ar(Y ) − 2Cov(δ, Y ) = 0, we get the stationary point
λ0 = Cov(δ, Y )[V ar(Y )]−1 .
Thus,
g(λ0 ) = V ar(δ) − Cov(δ, Y )[V ar(Y )]−1 [Cov(δ, Y )]τ ≥ 0
where equality holds iff g(λ0 ) = 0 = V ar(δ − λ0 Y τ ) iff δ − λ0 Y τ = C. This proves (10.4).

(c). Let

Y = log fθ (X)
∂θ
R R
As before, differentiating 1 = fθ (x)dx and g(θ) = E(δ(X)) = δ(x)fθ (x)dx w.r.t. θ, we
get
à !

0 = E log fθ (X) = EY,
∂θ
à !
0 ∂
g (θ) = cov δ(X), log fθ (X) = cov (δ(X), Y ) ,
∂θ
"Ã !τ Ã ! #
∂ ∂
IX (θ) = E log fθ (X) log fθ (X) = E(Y τ Y ) = V ar(Y ).
∂θ k×1
∂θ 1×k

48
Applying (10.4), we get

V ar(δ) ≥ Cov(δ, Y )[V ar(Y )]−1 [Cov(δ, Y )]τ = g 0 (θ)[IX (θ)]−1 [g 0 (θ)]τ .

Remark. The information inequality (10.4) has been extended in a number of directions.

(i) When the lower bound is not sharp, it can usually be improved by consid-
ering not only first derivatives of g(θ), but also higher order derivatives

1 ∂ i1 +...+is fθ (x)
ψi1 ,...,is = .
fθ (x) ∂θ1i1 ...∂θsis

Then we can show the following Bhattacharyya inequality:

V ar(δ) ≥ αK −1 (θ)ατ ,

where K is the covariance matrix of the given set of ψ, and α is the row matrix
with elements
∂ i1 +...+is g(θ)
∂θ1i1 ...∂θsis

(ii) When the regularity conditions are not satisfied, one could use difference
instead of derivatives. See the references given in Lehmann, page 130.

3.10.2 Some properties of the Fisher information matrix


If X ∼ fθ (x), we shall denote its Fisher information matrix is
"Ã !Ã !#
∂ ∂
IX (θ) = E log fθ (X) log fθ (X)
∂θτ ∂θ

Now let us list some properties of the Fisher information matrix.

(1). If X and Y are independent and have pdf fθ (x) and gθ (y), respectively,
then
I(X,Y) (θ) = IX (θ) + IY (θ).

(2). If X = {X1 , . . . , Xn } are i.i.d. with pdf fθ (x), then


n
X
IX (θ) = IXi (θ) = nIXi (θ).
i=1

(3). Suppose that X has joint pdf fθ (x), and fθ (x) is twice differentiable in θ,
and
∂ Z ∂ Z
∂2
fθ (x)dx = fθ (x)dx, for θ ∈ Θ.
∂θτ ∂θ ∂θ∂θτ

49
Then there is an alternative expression for IX (θ) (which is often easier to use)
à !
∂2
IX (θ) = −E log fθ (X)
∂θ∂θτ
à !
∂2
= − E log fθ (X) .
∂θi ∂θj k×k

Proof. We shall only prove the univariate case k = 1.


We have seen fromR the proof of Cramer-Rao lower bound theorem that, by
differentiating 1 = fθ (x)dx w.r.t. θ, we get
à Z ! à ! Z ∂
∂ ∂ f (x)
∂θ θ
E log fθ (X) = log fθ (x) fθ (x)dx = fθ (x)dx
∂θ ∂θ fθ (x)
Z µZ ¶
∂ ∂ ∂
= fθ (x)dx = fθ (x)dx = 1=0
∂θ ∂θ ∂θ
That is,
Z Ã !

log fθ (x) fθ (x)dx = 0
∂θ
Differentiating this again w.r.t. θ gives
Z Ã ! Z Ã !2
∂2 ∂
log f θ (x) f θ (x)dx + log fθ (x) fθ (x)dx = 0
∂θ2 ∂θ

That is,
à !2 Z à !2
∂ ∂
IX (θ) = E log fθ (X) = log fθ (x) fθ (x)dx
∂θ ∂θ
Z Ã !
∂2 ∂2
=− log f θ (x) f θ (x)dx = −E log fθ (X).
∂θ2 ∂θ2

(4). Let θ = θ(η) is from Rm to Rk , where m, k ≥ 1. Assume that θ(.) is


differentiable. Then the Fisher information that X contains about η is
à ! à !
∂θ ∂θτ
IX (η) = IX (θ) .
∂η τ m×k
∂η k×m

Proof. From the definition,


"Ã !τ Ã !#
∂ ∂
IX (η) = E log fθ (X) log fθ (X)
∂η ∂η
"ÃÃ !Ã !!τ ÃÃ !Ã !!#
∂ ∂θτ ∂ ∂θτ
= E log fθ (X) log fθ (X)
∂θ ∂η ∂θ ∂η
à ! "à !τ à !# à !
τ
∂θ ∂ ∂ ∂θ
= τ
E log fθ (X) log fθ (X)
∂η ∂θ ∂θ ∂η
à ! à !
∂θ ∂θτ
= I X (θ) .
∂η τ ∂η

50
(5). If θ = θ(η) is 1-1 (from Rk to Rk ), then it can be shown that the
Cramer-Rao lower bound remains the same. In other words, the C-R bound
is reparametrization invariant.
Proof. X1 , ..., Xn ∼ fθ (x), and θ = θ(η) is 1-1. Now Eδ(X) = g(θ) =
g(θ(η)) = l(η). By C-R lower bound theorem,
τ
V ar(δ(X)) ≥ (g 0 (θ))1×k [IX (θ)]−1 0
k×k (g (θ))k×1 =: A,

and also
τ
V ar(δ(X)) ≥ (l0 (η))1×k [IX (η)]−1 0
k×k (l (η))k×1 =: B.

We shall show that A = B. Note that


∂l(η) ∂l(η) ∂θτ ∂g(θ) ∂θτ ∂θτ
l0 (η) = = = = g 0 (θ) ,
∂η ∂θ ∂η ∂θ ∂η ∂η
∂θτ
and that ∂η
is k × k matrix of full rank. So from this and (4) above, we get
µ ¶µ ¶−1 µ ¶−1 µ ¶τ
∂θτ ∂θτ ∂θ ∂θτ ¡ ¢ ¡ ¢τ
B = g 0 (θ) [IX (θ)]−1 g 0 (θ) = g 0 (θ) [IX (θ)]−1 g 0 (θ) = A.
∂η ∂η ∂η τ ∂η

(6). If θ is a scalar (i.e., k = 1), then the Cramer-Rao lower bound becomes
à !2 à !
[g 0 (θ)]2 d d2
V ar(δ(X)) ≥ , where IX (θ) = E log fθ (X) = −E log fθ (X) .
IX (θ) dθ dθ2

3.10.3 Some examples


e.g. X1 , . . . , Xn ∼ Bin(1, p). Find the Cramer-Rao lower bound for g(p) = p, p2 . Check
if the lower bound can be reached or not.
Solution. First fp (x) = px (1 − p)1−x , x = 0, 1. So log fp (x) = x log p + (1 − x) log(1 − p),
x = 0, 1. Then
∂ x 1−x x−p
log fp (x) = − = , x = 0, 1.
∂p p 1−p p(1 − p)
Therefore,
E(X1 − p)2 1
IX1 (p) = = , x = 0, 1.
p2 (1 − p)2 p(1 − p)

(1). If Eδ(X) = g(p), then by the Cramer-Rao inequality,

[g 0 (p)]2 [g 0 (p)]2 p(1 − p)


V ar(δ(X)) ≥ = .
nIX1 (p) n

In particular, if g(p) = p, then V ar(δ(X)) ≥ p(1 − p)/n. However, V ar(X̄) =


p(1 − p)/n, so this lower bound is reached by choosing δ(X) = X̄. Therefore,
δ(X) = X̄ is a UMVUE of p.

51
3
(2). If g(p) = p2 , then V ar(δ(X)) ≥ 4p (1−p)
n
. We shall show that this lower
bound can never be reached. We know that η(T ) = T (T − 1)/(n(n − 1)) is
the UMVUE of p2 . After some tedious algebra, it can be shown that

E[T 2 (T − 1)2 ] 4 4p3 (1 − p) 2p2 (1 − p)2


V ar(η(T )) = − p = + .
[n2 (n − 1)2 ] n n(n − 1)

This completes the proof.

Aside: To find out V ar(η(T )), note that the m.g.f. of T is Mn (t) = (q + pe4)n .
So

Mn0 (t) = (q + pe4)n = n(q + pet )n−1 et = npet Mn−1 (t),


Mn00 (t) = net Mn−1 0
(t) + net Mn−1 (t),
= n(n − 1)e2t Mn−2 (t) + net Mn−1 (t),
Mn(3) (t) = ........
Mn(4) (t) = ........

And then setting t = 0, we get the first four moments of T .

e.g. X1 , . . . , Xn ∼ N (µ, 1). Find the Cramer-Rao lower bound for g(µ) = µ, µ2 . Check if
the lower bound can be reached√ or not.
2 ∂
Solution. Since fµ (x) = ( 2π)−1 e−(x−µ) /2 , so ∂µ log fµ (x) = x − µ. Therefore,
.............

52
3.10.4 A necessary and sufficient condition for reaching Cramer-
Rao lower bound
The following theorem follows easily from the proof of Cramer-Rao lower bound theorem.

Theorem. Let δ(X) be an estimator of g(θ) = E(δ(X)). Assume that the conditions
in the information inequality theorem hold for δ(X) and that Θ ∈ R. Then V ar(U (X))
attains the Cramer-Rao lower bound iff
d
a(θ)[δ(X) − g(θ)] = log fθ (x), a.s. fθ

for some function a(θ), θ ∈ Θ.

Example. Let X1 , ..., Xn be iid from the Bin(1, p). Show that the Cramer-Rao lower
bound can be reached in estimating p but not for p2 .
Proof. Note
³ P P ´ X X
xi
log fp (x) = log p (1 − p)n− xi
= xi log p + (n − xi ) log(1 − p).

So
d n
log fp (x) = [x̄ − p] = a(p)[δ(X) − p],
dp p(1 − p)
n
where a(p) = p(1−p) , δ(X) = X̄. However, it can NOT be expressed as a function
2
of a(p)[δ(X) − p ]. By the above necessary and sufficient condition, Cramer-Rao lower
bound is reached for X̄, and hence it is a UMVUE of θ. On the other hand, the lower
bound can not be reached for p2 .

53
Before we give the next example, we shall give the following preliminaries.
(1) Γ(p, λ) has pdf of the form
λp p−1 −λx
f (x) = x e I{x ≥ 0}. p > 0.
Γ(p)

(2) Γ(p1 , λ) + Γ(p2 , λ) = Γ(p1 + p2 , λ)


(3) For any r > −p, we have

Γ(p + r)
E[Γ(p, λ)]r = .
λr Γ(p)
p p
In particular, we have EΓ(p, λ) = λ
and V ar(Γ(p, λ)) = λ2
.

e.g. Suppose that X1 , ..., Xn are iid with pdf fθ (x) = λ2 xe−λx , λ > 0, x ≥ 0.
(1). Find a UMVUE of λ.
(2). Check if the Cramer-Rao lower bound can be achieved.
Solution. (1) We shall do this by two methods.
Method 1. (This method is crazy. Method 2 is much better.)
Note that X1 , ..., Xn ∼ Γ(2, λ). Therefore,
µ ¶
1 Γ(2 − 1)
E = = λ.
X1 λ−1 Γ(2)
P
Note that T = Xi ∼ Γ(2n, λ) is complete and sufficient (since it belongs to an expo-
P
nential family of full rank). To find its pdf, let Y = Xi − X1 and note that

fX1 ,Y (x1 , y) = fX1 (x1 )fY (y)


λ2 2−1 −λx1 λ2(n−1)
= x e . y 2(n−1)−1 e−λy
Γ(2) 1 Γ[2(n − 1)]
λ2n
= x1 y 2n−3 e−λ(x1 +y) .
Γ[2(n − 1)]

Let x1 = x1 , y = t − x1 , that is x1 = x1 , t = y + x1 . The jocobian |J| is 1.

fX1 ,T (x1 , t) = fX1 ,Y (x1 , y) |J| = fX1 ,Y (x1 , t − x1 )


λ2n
= x1 (t − x1 )2n−3 e−λt , where 0 ≤ x1 ≤ t.
Γ[2(n − 1)]
Therefore,
fX1 ,T (x1 , t)
fX1 |T (x1 |t) =
fT (t)
λ2n
x (t − x1 )2n−3 e−λt
Γ[2(n−1)] 1
= λ2n 2n−1 −λt
Γ[2n]
t e
µ ¶2n−3
x1 x1
= (2n − 1)(2n − 2) 2 1 − where 0 ≤ x1 ≤ t.
t t

54
Then
µ ¯ ¶
1 ¯¯
η(t) = E T =t
X1 ¯
Z
1
= fX |T (x1 |t)dx1
x1 1
Z t µ ¶
1 x1 x1 2n−3
= (2n − 1)(2n − 2) 2 1 − dx1
0 x1 t t
µ ¶ ¯t
1 x1 2n−2 ¯¯
= − (2n − 1) 1 − ¯
t t ¯
0
2n − 1
=
t
2n−1
Therefore, a UMVUE of λ is T
.
P
Method 2. Similarly to Method 1, we see that T = Xi ∼ Γ(2n, λ), and
µ ¶
1 Γ(2n − 1)
E = .
T λ−1 Γ(2n)
2n−1
Therefore, a UMVUE of λ is T
.

(2). Now let us check if the Cramer-Rao lower bound can be reached or not by the
UMVUE. Since f (x) = λ2 xe−λx , then

log f (x) = 2 log λ + log x − λx,

then
à !2

IX1 (λ) = E log fλ (X)
∂λ
µ ¶2
2
= E − X = E (X − EX)2
λ
2
= V ar(X) = 2 see the proliminaries earlier.
λ
Taking g(λ) = λ in the Cramer-Rao theorem, for any unbiased estimator δ(X), we get

1 λ2
V ar(δ(X)) ≥ = .
nIX1 (λ) 2n
2n−1
However, for η(T ) = T
(the UMVUE of λ), we have
µ ¶ µ ¶ ³ ´
2n − 1 1 λ2 λ2
var = (2n − 1)2 var = (2n − 1)2 ET −2 − (ET −1 )2 = > .
T T 2n − 2 2n
Therefore, the Cramer-Rao lower bound can not be achieved. On the other hand,
λ2
2n n−1
ef f (T ) = λ2
= →1 as n → ∞.
2n−2
n

55
3.10.5 Information inequality in exponential family
Lemma 3.10.1 For the exponential model

fX (x) = exp{T (x)η τ − B(η)}c(x).

we have the cumulant generating function of T is

κT (λ) = ln MT (λ) = B(λ + η) − B(η).

In particular,
ET = B 0 (η), V ar(T ) = B 00 (η).

Proof. The pdf of T is (from Chapter 1)

fT (t) = exp{tη τ − B(η)}c0 (t).

Hence,
Z
τ
κT (λ) = ln MT (λ) = ln EeλT = ln eλt exp{tη τ − B(η)}c0 (t)dt
Z
= ln exp{t(λ + η)τ − B(η + λ)}c0 (t)dt + [B(η + λ) − B(η)]
= ln 1 + [B(η + λ) − B(η)]
= [B(η + λ) − B(η)] .

Thus,

ET = κ0T (0) = B 0 (η + λ)|λ=0 = B 0 (η),


V ar(T ) = κ00T (0) = B 00 (η + λ)|λ=0 = B 00 (η).

The main theorem of this section is the following:

Theorem 3.10.1 Suppose that the distribution of X is from an exponential family {fη :
η ∈ Θ} of full rank, with pdf given by

fη (x) = exp{T (x)η τ − B(η)}c(x),

where Θ is an open subset of Rk .

(1) The regularity condition in the Cramer-Rao theorem is satisfied for any h
with E|h(X)| < ∞.
(2) If IX (η) is the Fisher information matrix for the natural parameter η, then
the variance-covariance matrix V ar(T ) = IX (η).
(3) Let ν = ET (X) and denote IX (ν) to be its Fisher information matrix,
then the variance-covariance matrix V ar(T ) = [IX (ν)]−1 .

Proof of the theorem.

(1). Omitted.

56
(2). The pdf under the natural parameter η is

fX (x) = exp{T (x)η τ − B(η)}c(x).

So
log fX (x) = T (x)η τ − B(η) + log [c(x)] .

log fX (x) = T (x) − B 0 (η).
∂η

From the last lemma, we know ∂η
B(η) = ET . So


log fX (x) = T (x) − ET (X).
∂η
Thus,
" #τ " #
∂ ∂
IX (η) = E log fX (X) log fX (X)
∂η ∂η
= E[T (X) − ET (X)]τ [T (X) − ET (X)] = V ar(T (X)).

(3). Let ν = ET = B 0 (η). Then ν = ψ(η) is 1-1 from Rk to Rk since by using the last
lemma, we get
∂ν
= B 00 (η) = V ar(T ) = IX (η) > 0.
∂η τ
Then from an earlier remark, we get the Fisher information matrix about η contained in
X is à ! à !
∂ν ∂ν τ
IX (η) = IX (ν) .
∂η τ ∂η
Then we have
à ! à !
∂ν ∂ν τ
IX (η) = IX (ν) = IX (η)IX (ν) (IX (η))τ = IX (η)IX (ν)IX (η).
∂η τ ∂η

That is, IX (ν) = IX (η)−1 = (V ar(T ))−1 .

3.10.6 Information inequality in location-scale family


³ ´
x−µ
Theorem. Let X = X1 , . . . , Xn be iid with pdf σ1 f σ
, where f (x) > 0 and f 0 (x) exists
for all x ∈ R, µ ∈ R and σ > 0.
(1). (Location family.) If σ = σ0 is known, then the Fisher information
about θ = µ contained in X1 , . . . , Xn is
n Z [f 0 (x)]2
IX (θ) = dx
σ02 f (x)

(2). (Scale family.) If µ = µ0 is known, then the Fisher information about


θ = σ contained in X1 , . . . , Xn is
n Z [xf 0 (x) + f (x)]2
IX (θ) = 2 dx
σ f (x)

57
(3). (Location-scale family.) Let θ = (µ, σ). Then the Fisher information
about θ contained in X1 , . . . , Xn is
 R 0 2 R 0 2

[f (x)]
n  f (x)
dx x [ff(x)]
(x)
dx
IX (θ) = 2 R [f 0 (x)]2 R [xf 0 (x)+f (x)]2 
σ x f (x) dx, f (x)
dx

Proof.
³ Let
´ us prove the first case only.
³ The´ rest is left as an exercice. Since fθ (x) =
1 x−µ x−µ
σ0
f σ0 , log fθ (x) = − log σ0 + log f σ0 ,
³ ´
x−µ
∂ −f 0 σ
log fθ (x) = ³ 0 ´
∂θ σ0 f x−µ
σ0

Therefore,
" #2

IX1 (θ) = E − log fθ (X1 )
∂θ
 ³ ´ 2
X1 −µ
f0 σ
= E − ³ 0 ´
X1 −µ
σ0 f σ0
³ ´
0 x−µ 2 µ ¶
1 Z [f σ0 ] x−µ
= ³ ´ f dx
σ03 [f x−µ ]2 σ0
σ0
1 Z [f 0 (y)]2
= dy
σ02 f (y)

e.g. Let X1 , . . . , Xn ∼ N (µ, σ 2 ). Then taking θ = (µ, σ). Here f (x) = φ(x) =
2
(2π)−1/2 e−x /2 , f 0 (x) = −xf (x) , applying the above theorem, part 3, the Fisher informa-
tion about θ contained in X1 , . . . , Xn is
 R 0 2 R 0 2

[f (x)]
n  f (x)
dx x [ff(x)]
(x)
dx
IX (θ) = R [f 0 (x)]2 R [xf 0 (x)+f (x)]2 
σ 2
x f (x) dx, f (x)
dx
à R R !
n x2 f (x)dx R x3 f (x)dx
= R 3
σ2 x f (x)dx (1 − x2 )2 f (x)dx
à !
n 1 0
= .
σ2 0 2

3.10.7 Regularity conditions must be checked


It is important that a key assumption in the Cramer-Rao theorem is the ability to inter-
change differentiation and integration. It has been shown that the assumption is satisfied
for the exponential family. In other cases, this assumption must be checked, or contradic-
tions such as the following will arise. Generally speaking, if the range of the pdf depends
on the parameter, the theorem will not be applicable.

e.g. Let X1 , . . . , Xn ∼ U (0, θ).


(1) Show that the assumption in the Cramer-Rao theorem does not hold.

58
(2) Show that the information inequality does not apply to the UMVUE of θ.

Solution. (1). Use Leibnitz’s rule, we have

∂ Z ∂ Z θ h(x)
h(x)fθ (x)dx = dx
∂θ ∂θ 0 θ
h(θ) Z θ ∂ 1
= + h(x) dx
θ 0 ∂θ θ
h(θ) Z ∂
= + h(x) fθ (x)dx
θ ∂θ
Z

6= h(x) fθ (x)dx
∂θ
unless h(θ)/θ = 0 for all θ. Therefore, the assumption fails here.

(2) Here fθ (x) = 1/θ, 0 < x < θ. So ∂θ
log fθ (x) = −1/θ, and
à !2 Z θ
∂ 11 n
IX (θ) = nE − fθ (X) =n dx =
∂θ 0 θ2 θ θ2

By Cramer-Rao theorem, we get, if W is any unbiased estimator, then

θ2
V ar(W ) ≥ .
n
nxn−1
Next let us take W = (n + 1)/nX(n) , the UMVUE of θ. Recall fX(n) (x) = θn
I(0 <
x < θ), then
Z θ
2 nxn+1 n 2
EX(n) = dx = θ
0 θn n+2
Therefore,
µ ¶ µ ¶2 µ ¶2
n+1 n+1 2 n+1
2 n 2 θ2
var X(n) = EX(n) −θ = θ − θ2 =
n n n n+2 n(n + 2)
θ2 1
< = .
n nI(θ)

Note that the V ar(X(n) ) is of size O(n−2 ), as opposed to the usuall size O(n−1 ).

Remark. In this example, the UMVUE has variance below the C-R lower bound. This
is because the regularity conditions in the C-R lower bound theorem are violated.

3.10.8 Relative Efficiency


If δ = δ(X) is an unbiased estimator of g(θ), we know from the Cramer-Rao theorem that

[g 0 (θ)]2
V ar(δ) ≥ .
IX (θ)

This leads to the following definition.

59
Definition: The efficiency of an unbiased estimator δ = δ(X) of g(θ) is defined by

[g 0 (θ)]2 /IX (θ)


ef f (δ) = .
V ar(δ)

For two unbiased estimators δ1 and δ2 , the relative efficiency of δ1 with respect to δ2 is
given by à !
δ1 ef f (δ1 ) V ar(δ2 )
ef f = = .
δ2 V ar(δ2 ) V ar(δ1 )

Lemma. If X1 , . . . , Xn ∼ F with V ar(X1 ) < ∞ and h(·) : Rk → R1 is four times


differentiable, then
³ ´
V ar(h(X̄)) = [h0 (µ)]V ar(X̄)[h0 (µ)]τ 1 + O(n−1 )

and
h(X̄) − h(µ)
q ∼asympt N (0, 1).
n−1 [h0 (µ)]V ar(X1 )[h0 (µ)]τ
Proof. From Taylor’s expansion,

h(X̄) = h(µ) + h0 (µ)(X̄ − µ)τ + Op (n−1 ).

Hence,
V ar(h(X̄)) = [h0 (µ)]V ar(X̄)[h0 (µ)]τ + O(n−2 ).
The proof of the asymptotical normality is omitted.

Remark. If T is complete and sufficient (and if there exists an unbiased estimator of


a parameter g(θ)), then UMVUE’s are of the form η(T ). Furthermore, if the data are
from an exponential family, then T is an average as well. So if η(·) is a smooth enough
function, we see from the above lemma that

lim ef f (η(T )) = 1, and η(T ) ∼asymp N (η(ET ), [η 0 (ET )]V ar(T )[η 0 (T )]τ ) .
n→∞

e.g. Let X1 , ..., Xn be iid from the Bin(1, p). Suppose that g(µ) = pq, where q = 1 − p.
P
Nota that T = Xi is complete and sufficient. An UMVUE of pq is

T (n − T ) n
η(T ) = = X̄(1 − X̄).
n(n − 1) n−1

(1) Show that η(T ) is not efficient. (i.e., it does not reach the Cramer-Rao
lower bound.)
(2) Show that limn→∞ ef f (η(T )) = 1.

60
Proof. (1) It is easy to see that
d 1 hX i
log fp (x) = Xi − np
dθ p(1 − p)
n
= (X̄ − p).
p(1 − p)
From the Cramer-Rao theorem, we get for any unbiased estimator δ(X) of g(p), we have
[g 0 (p)]2 [1 − 2p]2 pq
V ar(δ(X)) ≥ = .
IX (p) n
d
On the other hand, since dθ log fp (x) can not be expressed in the form of a(p)[δ(X) − pq],
then from the necessary and sufficient condition, the Cramer-Rao lower bound can not
be achieved for any unbiased estimator, in particular, for δ(X) = η(T ). That is,
[1 − 2p]2 pq/n
ef f (η(T )) = < 1.
V ar(η(T ))
(2). We now use the lemma to find V ar(η(T )). Taking g(x) = x − x2 , then g 0 (x) =
1 − 2x. Thus,
V ar(η(T )) = n−1 [g 0 (µ)]V ar(X1 )[g 0 (µ)]τ + O(n−2 )
= n−1 [1 − 2p]2 V ar(X1 ) + O(n−2 )
= n−1 [1 − 2p]2 pq + O(n−2 ).
Thus
[1 − 2p]2 pq/n ³ ´
lim ef f (η(T )) = n→∞
lim lim 1 + O(n−1 ) = 1.
= n→∞
n→∞ V ar(η(T ))
(Note that ef f (T ) is only 50% if n = 2.)

Remark. If T is asymptotically normal with the asymptotic variance reaching the


Cramer-Rao lower bound, then we call T to be asymptotically efficient. Therefore, in
exponential family, UMVUE’s are asymptotically efficient. However, if the data comes
from non-exponential family, then this may not be true, as seen from the next example.

Example. Let X1 , ..., Xn ∼ U (0, θ). Show that the UMVUE does not follow normal
distribution asymptotically.
Solution. It is known that a UMVUE of θ is θ̂ = n+1
n
X(n) . Next we’ll show that θ̂ is not
asymptotically normal. For any y ≥ 0, we have
P (n(θ − X(n) ) ≤ y) = P (X(n) ≥ θ − y/n)
= 1 − P (X(n) ≤ θ − y/n)
= 1 − [P (X1 ≤ θ − y/n)]n
µ ¶
y n
= 1− 1−

Taking y = tθ, we get


à ! µ ¶n
n(θ − X(n) ) t
P ≤t = 1− 1−
θ n
→ 1 − e−t as n → ∞.

61
By Slutsky’s theorem, we get

n(θ − θ̂)
∼approx Exp(λ = 1)
θ
Therefore, θ̂ is asymptotically Exp(λ = 1), but not asymptotically normal.

3.11 Some comparisons about the UMVUE and MLE’s


3.11.1 Difficulties with UMVUE’s
(a) Unbiased estimates may not exist.
(b) Even when UMVUE’s exist, they may be inadmissible. (T is said to be
inadmissible if there exists another estimator S such that R(θ, T ) > R(θ, S)
for all θ with inequality holding for some θ.)
(c) The properties of unbiasedness is not invariant under functional transfor-
mations; that is, θ̂ can be unbiased for θ, but g(θ̂) biased for g(θ).

3.12 Exercises
1. Let X1 , . . . , Xn be a random sample from N (µ, σ 2 ). Find a UMVUE for µ2 /σ.
iid Pn
2. Suppose that X1 , . . . , Xn ∼ Poisson(λ). We already know that T = i=1 Xi is
complete and sufficient. Find a UMVUE for e−2λ .
3. Let X1 , . . . , Xn be iid U (θ1 , θ2 ), where θ1 < θ2 . Find a UMVUE for (θ1 + θ2 )/2.
4. Let X1 , . . . , Xn be iid Bin(1, p), where p ∈ (0, 1).
(a) Find the UMVUE of pm , where m is a positive integer ≤ n.
(b) Find the UMVUE of P (X1 + ... + Xm = k), where m and k are positive integers
≤ n.
(c) Find the UMVUE of P (X1 + ... + Xn−1 > Xn ).
5. Let X1 , . . . , Xn be iid with pdf given by fθ (x) = e−(x−θ) , x > θ. Find a UMVUE
of θr , r > 0.
6. Let X1 , . . . , Xn be iid N (0, σ 2 ). We wish to find a UMVUE of σ 2 .
(1) Find a complete and sufficient statistic for σ 2 .
(2) Find C1 , C2 (which may depend on n) such that W1 = C1 (X̄)2 and W2 =
P
C2 ni=1 (Xi − X̄)2 are unbiased for σ 2 .
(3) Consider W = aW1 + (1 − a)W2 , where a is a constant. Find a which minimizes
var(W ). Show that var(W ) ≤ var(Wi ), i = 1, 2
(4) Find a UMVUE of σ 2 .

62
7. Suppose that X has pdf fθ (x), where θ ∈ Θ ∈ R. Further assume that fθ (x) is twice
differentiable in θ, and

∂2 Z Z
∂2
fθ (x)dx = fθ (x)dx, for θ ∈ Θ.
∂θ2 ∂θ2
Then show that à !
∂2
IX (θ) = −E log fθ (X)
∂θ2
³ ´
8. Let X1 , . . . , Xn be iid with pdf σ1 f x−µ
σ
, where f (x) > 0 and f 0 (x) exists for all
x ∈ R, and µ ∈ R, σ > 0. (This is called a location-scale family.) Let θ = (µ, σ).
Show that the Fisher information about θ in X1 , . . . , Xn is
 R 0 2 R 0 2

[f (x)]
n dx x [ff(x)] dx
I(θ) = 2  R f[f(x)
0 (x)]2
(x)
R [xf 0 (x)+f (x)]2 
σ x f (x) dx, f (x)
dx

9. Let X1 , . . . , Xn ∼ N (µ, σ 2 ). In the lecture, we have seen that the information matrix
about θ = (µ, σ) contained in X1 , . . . , Xn is
à !
n 1 0
IX (θ) = .
σ2 0 2

Using this or otherwise, find the information matrix about θ = (µ, σ 2 ) contained in
X1 , . . . , Xn .

10. Suppose that X1 , . . . , Xn is a random sample from a population with density fθ (x) =
P
2xθ exp(−θx2 ), for x > 0 and θ > 0. Let T = ni=1 Xi2 /n.
(1) Find ET and V ar(T ).
(2) Find the Fisher information IX (θ). Use Cramer-Rao information inequality to
see if T is a UMVUE for g(θ) = ET .

63
Chapter 4

The Method of Maximum Likelihood

4.1 Maximum Likelihood Estimate (MLE)


Example. Let X ∼ Pθ , where θ = θ1 or θ2 , where θ1 and θ2 are both known. The
observations follow the following distributions
P (X = x) x=0 x=1 x=2 x=3
θ = θ1 0.5 0.1 0.1 0.3
θ = θ2 0.2 0.3 0.2 0.3

If X = 0, then it is more plausible that it came from population 1, since Pθ1 (0) >
Pθ2 (0). Then we can estimate θ by θ1 . Note that if X = 3, then Pθ1 (3) = Pθ2 (3), then we
can take either θ = θ1 or θ2 . In general, we take the estimate of θ by

θ̂ = θ1 if X = 0, 3
= θ2 if X = 1, 2, 3

Note that θ̂ may not be unique, e.g., when X = 3, θ̂ = θ1 or θ2 . This is in fact a maximum
likelihood estimator (MLE).

Definition. Suppose that X has joint pdf fθ (x) where θ ∈ Θ ∈ Rk .

(1). Given X = x is observed, the functions of θ defined by

L(θ) ≡ L(θ) = fθ (x), and l(θ) ≡ l(θ) = log L(θ)

are called the likelihood function and log-likelihood function, respectively.


(2). For each sample point x, let θ̂ = θ̂(x) satisfy
³ ´ ³ ´
L θ̂(x) = sup L(θ), or equivalently l θ̂(x) = sup l(θ).
θ∈Θ θ∈Θ

i.e.,
θ̂ = arg sup L (θ(x)) = arg sup l (θ(x)) .
θ∈Θ θ∈Θ

If such a θ̂ = θ̂(X) exists, then it is called a maximum likelihood estimator


(MLE) of θ. Note that sometimes θ̂ may not be in Θ.

64
We make several remarks.

(a) If Θ is a finite set, then Θ̂ = Θ, then MLE can be obtained from finitely
many values L(θ), θ ∈ Θ.
(b) If L(θ) is differentiable in θ in the interior of Θ, then possible candidates
for MLE’s are the values of θ satisfying
∂ ∂
L(θ) = 0, or equivalently l(θ) = 0. (1.1)
∂θ ∂θ
The two equations in (1.1) are called the likelihood equation and log-
likelihood equation, respectively.
(d) If Θ is an infinite countable set, then one can look at the ratio L(θ)/L(θ−1)
or the difference L(θ) − L(θ − 1) to find an MLE. (This is the equivalent of
differentiation in (c).)
(e) MLE may not be unique (unlike UMVUE).

4.2 Invariance property of MLE


One very attractive property of the MLE is its invariance. Basically, it means that if θ̂ is
an MLE of θ, then g(θ̂) is an MLE of g(θ).
First, let X1 , ..., Xn be iid with pdf fθ (x), where θ ∈ Θ ∈ Rd . Recall that its likelihood
function is given by
n
Y
L(θ) = fθ (x) = fθ (xi ).
i=1

If g(·) is one to one, then we have the following simple theorem.

Theorem. If θ̂ is an MLE for θ, and η = g(θ) is a 1-1 function, then η̂ = g(θ̂) is an MLE
for η = g(θ).
Proof. Since g(·) is 1-1, then g −1 (·) exists. Also

L(θ, x) = L(g −1 (η), x) = L1 (η, x)

where L1 is the likelihood function of η. Setting the value of θ at θ̂, we get

L(θ̂, x) = L1 (g(θ̂), x) = L1 (η̂, x).

That is, if θ̂ is an MLE, both sides are maximized by θ̂. Hence, η̂ is an MLE of η. (WHY?)

If g(·) is not 1-1, then we introduce the so-called induced likelihood.

Definition. If g(·) is a function: Θ → Λ ∈ Rp , where 1 ≤ p ≤ d. Define the induced


likelihood function for the transformed parameter η = g(θ) by
e
L(η) = sup L(θ).
θ:g(θ)=η

The value η̂ maximizing over the closure of g(Θ) is called an MLE of η.

65
MLE Invariance Theorem. If θ̂ is an MLE of θ, then η̂ ≡ g(θ̂) is an MLE of η ≡ g(θ).
e
[In other words, we need to show L(η̂) e
= L(g(θ̂)).]

(draw a graph here to illustrate)

Proof. First,
e
L(η̂) e
= sup L(η) = sup sup L(θ) = sup L(θ) (draw a mapping) = L(θ̂).
η η θ:g(θ)=η θ

On the other hand,


e
L(g(θ̂)) = sup L(θ) = L(θ̂).
θ:g(θ)=g(θ̂)

e
Therefore, L(η̂) e
= L(g(θ̂)), as must be proved.

4.3 Some examples of MLE


Example. Let X1 , ..., Xn ∼ Bin(1, p) with p ∈ (0, 1). Find an MLE for p2 .
Solution. Method 1. Note that
Y P P
L(p) = npxi (1 − p)1−xi = p xi
(1 − p)n− xi

i=1
³X ´ ³ X ´
l(p) = xi log p + n − x (1 − p)
If Setting P P
∂ xi n − xi n(x̄ − p)
l(p) = − = = 0,
∂p p 1−p p(1 − p)
we get p̂ = x̄. Note that L(p) → 0 as p → 0 or 1. So an MLE for p is

p̂ = X̄.

Then (X̄)2 is an MLE for p2 .

Remark. If X̄ = 0 or 1, then p̂ = 0 or 1 6∈ Θ = (0, 1).

2
Method 2. P
First, Let η =
Pp P P

L(p) = p xi (1 − p)n− xi = η xi /2 (1 − η)n− xi = L1 (η, x), So
P
xi X √
l1 (η, x) = log η + (n − xi ) log(1 − eta)
2
Set P
∂ xi X 1
l1 (η, x) = − (n − xi ) √ √ =0
∂η 2η 2 η(1 − η)
Therefore, η̂ = (X̄)2 is an MLE for p2 .

66
Remark. Note that in this example, the MLE X̄ is also the UMVUE.

Example. Let X1 , ..., Xn ∼ P oisson(λ) with λ > 0. Find an MLE for λ.


Solution. Note that
e−nλ λnx̄
L(λ) = .
x1 !...xn !
l(λ) = −nλ + (nx̄) log λ − log(x1 !...xn !)
If x̄ > 0, we set
∂ nx̄
l(λ) = −n + = 0,
∂λ λ
we get λ̂ = x̄. And
¯ ¯
∂2 ¯ nx̄ ¯¯ n
¯
2
l(λ)¯
¯
= − 2 ¯ = − < 0, if x̄ > 0,
∂λ λ=X̄
λ λ=X̄ x̄

that is, λ̂ = x̄ will be a local maxima.


On the other hand, if x̄ = 0, (i.e., at the boundary of Θ), then l(λ) = −nλ, which
attains the maximum at λ = 0 = x̄.
Combining these results, an MLE of λ is thus

λ̂ = X̄.

Remark. If X̄ = 0, then λ̂ = 0 6∈ Θ = (0, ∞).

Remark. Note that in this example, the MLE X̄ is also the UMVUE.

Example. Let X1 , ..., Xn ∼ N (µ, σ 2 ) with µ ∈ R and σ 2 > 0, where n ≥ 2. Find an


MLE for θ = (µ, σ 2 ).
Solution. Ã !n ( )
n
1 1 X 2
L(θ) = √ exp − 2 (xi − µ) .
2πσ 2 2σ i=1
n
1 X n n
l(θ) = − 2
(xi − µ)2 − log σ 2 − log(2π)
2σ i=1 2 2
The log-likelihood equations become
n n
∂l(θ) 1 X ∂l(θ) 1 X n
= 2 (xi − µ) = 0, and 2
= 4
(xi − µ)2 − 2 = 0
∂µ σ i=1 ∂σ 2σ i=1 2σ

Therefore, we get θ̂ = (µ̂, σ̂ 2 ), where


n
X
µ̂ = x̄, and σ̂ 2 = n−1 (xi − x̄)2 .
i=1

Also note that


¯ Ã !¯ Ã !
1 Pn
∂ 2 l(θ) ¯¯ n
σ2 σ 4 i=1 (xi − µ)
¯
¯
n
σ̂ 2
0
¯ =− 1 Pn 1 Pn 2 n ¯ =− n
∂θ∂θτ ¯θ=θ̂ σ4 i=1 (xi − µ) σ6 i=1 (xi − µ) − 2σ 4
¯
θ=θ̂
0 2σ̂ 4

67
is negative definite when θ = θ̂. At the boundary, we notice that L(θ) → 0 as θ goes to
the boundary of Θ. Therefore, an MLE of θ is
n ³
X ´2
µ̂ = X̄, and σ̂ 2 = n−1 Xi − X̄ .
i=1

Example. Let X1 , ..., Xn ∼ N (µ, σ 2 ) with µ ≥ 0 and σ 2 > 0, where n ≥ 2. Find an MLE
for θ = (µ, σ 2 ).
(e.g. the body weight follows Normal d.f. but the mean must be positive.)

Solution. From the last question, the log-likelihood equations become


n n
1 X 1 X 1
(xi − µ) = 0, and (xi − µ)4 − 2 = 0.
σ 2 i=1 4
2σ i=1 2σ

Case I. If x̄ ≥ 0, then we get θ̂ = (µ̂, σ̂ 2 ), where


n
X
µ̂ = x̄, and 2
σ̂ = n −1
(xi − x̄)2 .
i=1

Since θ̂ is an MLE in a bigger space, it is also an MLE here.

Case II. If x̄ < 0, then the solution to the log-likelihood equations will not be in the
parameter space. But as in the last example,
n
1 X n n
l(θ) = l(µ, σ 2 ) = = − 2
(xi − µ)2 − log σ 2 − log(2π)
2σ i=1 2 2
n
1 X 2 n 2 n 2 n
= − (x i − x̄) − (x̄ − µ) − log σ − log(2π).
2σ 2 i=1 2σ 2 2 2

For fixed σ 2 , l(θ) attains its maximum when µ = 0 since x̄ ≤ 0. Then


n
1 X n n n
l(0, σ 2 ) = − 2
(xi − x̄)2 − 2 (x̄)2 − log σ 2 − log(2π),
2σ i=1 2σ 2 2
P 2
which attains its maximum when σ 2 = n−1 xi . Therefore, an MLE of θ is
X
µ̂ = 0, σ̂ 2 = n−1 x2i .

In summary, an MLE of θ is
X
θ̂ = (0, n−1 Xi2 ), if X̄ < 0
X
= (X̄, n−1 (Xi − X̄)2 ), if X̄ ≥ 0.

Example. (A genetics example). Assume that P (A) = p and P (a) = 1 − p. We have the
following table.

68
Type AA Aa aa
Prob p2 2p(1 − p) (1 − p)2
Freq x1 x2 x3

where x1 + x2 + x3 = n for a fixed n.


We know that (X1 , X2 , X3 ) ∼ M ultinomial (n, (p2 , 2p(1 − p), (1 − p)2 ))

L(p, x1 , x2 , x3 ) = C(n, x1 , x2 , x3 )p2x1 [2p(1 − p)]x2 [(1 − p)]2x3

l(p, x1 , x2 , x3 ) = log C(n, x1 , x2 , x3 )+x2 log 2+2x1 log p+x2 (log p+log(1−p))+2x3 log(1−p)
Setting
∂ 2x1 + x2 x2 + 2x3
l(p, x1 , x2 , x3 ) = − =0
∂p p 1−p
we get
2X1 + X2
p̂ = .
2n

We now show that θ̂ is also a UMVUE. Note

fp (x1 , x2 , x3 ) = L(p, x1 , x2 , x3 )
= C(n, x1 , x2 , x3 ) p2x1 [2p(1 − p)]x2 (1 − p)2x3
= h(n, x1 , x2 , x3 ) p2x1 +x2 (1 − p)x2 +2x3
= h(n, x1 , x2 , x3 ) p2x1 +x2 (1 − p)2n−(2x1 +x2 )
à !2x1 +x2
p
= (1 − p)2n h(n, x1 , x2 , x3 ).
1−p

By the factorization theorem, T = 2X1 + X2 is a sufficient statistic for p. It is also


p
complete as it belongs to an exponential family and 1−p ∈ (0, ∞).
2
Now that X1 ∼ Bin(n, p ) and X2 ∼ Bin(n, 2p(1 − p)), we have

ET = 2EX1 + EX2 = 2np2 + 2np(1 − p) = 2np.

That is, ET /(2n) = p. So (2X1 + X2 )/(2n) is also a UMVUE of p.

Example. (Fish-Tagging problem, or Capture-Recapture Experiment). Let X ∼ Hyper-


geometric distribution with pdf

Cnx CNr−x
−n
P (X = x) = , x = 0, 1, ..., min(r, n)
CNr

with known r, n and unknown N where N ≥ n + 1.


1). Find an estimator of θ = N by the method of moments.
2). Find a MLE of θ = N
Solution. 1). Equate the sample proportion of fish caught in the second catch to the
population proportion of the fish, we get
x r
= .
n N

69
Thus, an estimate by the method of moment is
rn
θ̂ = N̂ =
x
2). Note that

Cnx CNr−x r
−n /CN CNr−x r
−n /CN CNr−x r
−n CN −1
R = L(θ)/L(θ − 1) = = =
Cnx CNr−x r
−1−n /CN −1 CNr−x r
−1−n /CN −1 CNr CNr−x
−1−n
(N − n)!/[(r − x)!(N − n − r + x)!] [(N − 1)!/[r!(N − 1 − r)!]
=
(N − n − 1)!/[(r − x)!(N − n − r + x − 1)!] [N !/[r!(N − r)!]
(N − n)/(N − n − r + x)
=
N/(N − r)
(N − n)(N − r)
= .
(N − n − r + x)N

Now R ≥ 1 iff (N −n)(N −r) ≥ (N −n−r+x)N iff N 2 +nr−N r−nN ≥ N 2 −N n−rN +xN
iff nr ≥ xN iff N ≤ rn/x. Similarly, R < 1 iff N > rn/x.
1). If θ̂ = rn/x is an integer, then we see that L(θ̂) = L(θ̂ − 1), then an MLE of θ is

N̂ = rn/x, and N̂ = rn/x − 1.

2). If θ̂ = rn/x is not an integer, then

N̂ = rn/x, or N̂ = rn/x − 1.

depending on which one gives the maximum likelihood.

70
4.4 MLE’s in exponential families
Some of the examples given earlier belong to the exponential family. Here, we shall give
some general results regarding MLE in this family.

Suppose that X has a distribution from a natural exponential family of full rank

fη (x) = exp{T (x)η τ − A(η)} h(x),

where η ∈ Ψ.
∂A(η)
(1). Show that an MLE of η is η̂ = g −1 (T (X)), where g(η) = ∂η
, if T (x) is
∂A(η)
in the range of ∂η
.
(2). Show that T (X) is an MLE of E[T (X)]. (T (X) is in fact also a UMVUE
of E[T (X)].)

Solution. (1). The log-likelihood equation is

∂lx (η) ∂ log fη (x) ∂A(η)


= = T (x) − ≡ T (x) − g(η) = 0,
∂η ∂η ∂η

which has a unique solution T (x) = ∂A(η)


∂η
= g(η) since T (x) is in the range of ∂A(η)
∂η
. Note
that
∂ 2 lx (η) ∂ 2 A(η) ∂g(η)
τ
= − τ
=− = −var(T (X)),
∂η∂η ∂η∂η ∂η
which is negative definite. Therefore, the likelihood function is convex in η and the
function g(η) is one-to-one. Therefore, an MLE of the parameter η is

η̂ = g −1 (T (X)).
∂A(η)
(2). Since E[T (X)] = ∂η
= g(η), so from (1), an MLE of E[T (X)] is g[η̂] =
−1
g[g (T (X))] = T (X).

Suppose that X has a distribution from a general exponential family of full rank

fθ (x) = exp{T (x)η(θ)τ − B(θ)} h(x),

where θ ∈ Θ. If η −1 exists, then from the above result, we get an MLE of θ to be

θ̂ = η(η̂).

Of course, θ̂ is also the solution of the log-likelihood equation

∂lx (θ) ∂η(θ) ∂B(θ)


= [T (x)]τ − = 0.
∂θ ∂θ ∂θ

71
4.5 Numerical solution to likelihood equations
Typically, there are no explicit solutions to likelihood equations, and some numerical
methods have to be used to compute MLE’s. We start with a couple of examples.

Example. (Estimating the median of a Cauchy distribution). Let X1 , ..., Xn ∼ Cauchy(θ).


Find an MLE of θ.
Solution. The pdf of Cauchy(θ) is
1 1
fθ (x) = .
π 1 + (xi − θ)2

Therefore,
n
1 Y 1
L(θ) = .
π i=1 1 + (xi − θ)2
n

n
X
l(θ) = −n log π − log(1 + (xi − θ)2 ).
i=1

Setting
n
∂l(θ) X 2(xi − θ)
= 2
= 0.
∂θ i=1 1 + (xi − θ)

Here, an MLE θ̂ can not be solved explicitly, but can be solved by numerical method,
such as Newton-Raphson’s method.

Example. Let X1 , ..., Xn ∼ Γ(α, γ) with unknown α > 0 and γ > 0. Find a MLE of
θ = (α, γ).
Solution. The likelihood function is
Qn Pn
n
Y xα−1
i e−xi /γ ( i=1 xi )
α−1 −( i=1 xi )/γ
e
L(θ) = =
i=1 Γ(α)γ α Γ (α)γ nα
n

à n ! Ãn
!
X 1 X
l(θ) = (α − 1) log xi − xi − n log Γ(α) − (nα) log γ.
i=1 γ i=1

The likelihood equations become


n
∂l(θ) X Γ0 (α)
= log xi − n − n log γ = 0 (5.2)
∂α i=1 Γ(α)
n
∂l(θ) 1 X nα
= 2 xi − = 0. (5.3)
∂γ γ i=1 γ

Then from (5.3), we get γ̂ α̂ = x̄, or γ̂ = x̄/α̂. Substituting this into (5.2), we get
n
1X Γ0 (α)
log xi − − log(x̄) + log(α̂) = 0
n i=1 Γ(α)

In this case, the likelihood equation has no explicit solution, although it can be shown
that a solution exists and it is the unique MLE. We must resort to numerical method to
compute the MLE.

72
One commonly used numerical method to compute MLE’s is the Newton’s algorithm.
Note that typically, θ̂ → θ in probability, and also note that n−1 l(θ), n−1 l0 (θ) and n−1 l00 (θ)
are all sample means of i.i.d. r.v.’s. Thus

1 ∂l(θ̂) 1 ∂l(θ) 1 ∂ 2 l(θ)


0= ≈ + (θ̂ − θ) τ
n ∂θ n ∂θ n ∂θ∂θ
à !
1 ∂l(θ) 1 ∂ 2 l(θ)
≈ + (θ̂ − θ) E
n ∂θ n ∂θ∂θτ

Then
" #−1
∂l(θ) ∂ 2 l(θ)
θ̂ ≈ θ −
∂θ ∂θ∂θτ
" #−1
∂l(θ) ∂ 2 l(θ)
≈ θ− E
∂θ ∂θ∂θτ

If we take θ0 to be an initial value and suppose that ∂ 2 l(θj )/∂θ∂θτ is of full rank for every
θ ∈ Θ, then we have the following iteration
" #−1
∂l(θj ) ∂ 2 l(θj )
θj+1 = θj − , j = 0, 1, 2, ...,
∂θ ∂θ∂θτ

which is the Newton-Raphson’s algorithm. Or alternatively we have


" #−1
∂l(θj ) ∂ 2 l(θj )
θj+1 = θj − E , j = 0, 1, 2, ...,
∂θ ∂θ∂θτ

which is so called scoring method.

4.6 Problems with MLE


4.6.1 MLE may not be unique
Example. Let X1 , ..., Xn ∼ U (θ − 1/2, θ + 1/2) with θ ∈ R. Find a MLE of θ.
Solution. Note
n
Y
L(θ) = I(θ − 1/2 < xi < θ + 1/2)
i=1
= I(θ − 1/2 < x(1) < x(n) < θ + 1/2)
= I(x(n) − 1/2 < θ < x(1) + 1/2).

Here the likelihood equation approach is not applicable. However, by plotting L(θ) against
θ, it is easy to see that any statistic T satisfying x(n) − 1/2 < θ < x(1) + 1/2 is an MLE
of θ.

Remark. This example indicates that MLE’s may not be unique (may have infinite
number of MLE’s).

73
4.6.2 MLE may not be asymptotically normal
Example. Let X1 , ..., Xn ∼ U (0, θ). For an MLE for θ.
Solution.
1
L(θ, x) = n I(0 ≤ x(1) ≤ x(n) ≤ θ)
θ
Draw a graph of L(θ, x) as a function of θ, we see that an MLE for θ is
θ̂ = X(n) .
In this case, the MLE is not the same as the UMVUE, which is (n + 1)X(n) /n.
Next we’ll show that θ̂ is not asymptotically normal.
P (n(θ − θ̂)/θ ≤ t) = P (θ̂ ≥ θ − tθ/n)
= 1 − P (X(n) ≤ θ − tθ/n)
= 1 − [P (X1 ≤ θ − tθ/n)]n
µ ¶
t n
= 1− 1−
n
−t
→ 1−e as n → ∞.
That is, n(θ−
θ
θ̂)
is an asymptotically exponential distribution with mean 1. Therefore,
θ̂ can not be asymptotically normal. In addition, θ̂ is biased downward. However, the
−θ
bias(θ̂) = E θ̂ − θ = n+1 → 0 as n → ∞.

4.6.3 MLE may be inadmissible


Note that MLE maximizes the likelihood. Therefore, it may not have other good proper-
ties, such as the mean squared errors (MSE) used extensively in the search for UMVUE.

Example Let X ∼ Bin(1, p), where 1/3 < p < 2/3. (Restricted Bernoulli parameter.)
Find an MLE of p, and show that it is inadmissible by the MSE criterion.

Solution.
L(p, x) = px (1 − p)1−x I(1/3 < p < 2/3)
= p if x = 1
1−p if x = 0.

(Plot L(p, x) against p for x = 0, 1)

From this graph, we see that an MLE for p is


p̂ = 1/3 if x = 0

74
2/3 if x = 1
x+1
= , if x = 0, 1
3

Note that
µ ¶
x+1 p+1
E p̂ = E = 6= p, (which is biased)
3 3
var(x) p(1 − p)
var(p̂) = = .
9 9
Therefore,
µ ¶2
2 p(1 − p) p+1 1 − 3p(1 − p)
M SE(p̂) = var(p̂) + bias (p̂) = + −p = .
9 3 9

Now let us take another estimator p̂1 = 1/2. Then

M SE(p̂1 ) = (1/2 − p)2 .

Note that M SE(p̂1 ) equals 0, 1/36 and 1/36 when p̂1 = 1/2, 1/3, 2/3. On the other hand,
M SE(p̂) equals 1/9, 1/27, 1/27, which are all bigger than the corresponding values of
M SE(p̂1 ) at these points. Thus,

M SE(p̂1 ) < M SE(p̂), p ∈ [1/3, 2/3].

Here, MLE is worse than the trivial estimator δ(p̂1 ) = 1/2 under MSE criterion. In other
words, MLE is inadmissible under MSE loss function.

Remark: M SE(p̂1 ) and M SE(p̂) do cross each in (0, 1) although they don’t in [1/3, 2/3].

4.6.4 MLE can be inconsistent


Definition. A sequence of estimators {Tn } is a consistent estimator of g(θ) if

lim P (|Tn − g(θ)| > ²) = 0,


n→∞
for any ².
P
e.g. Let X1 , . . . , Xn ∼ F , µ = EX1 , X̄ = n−1
³ Xi . If E|X
´ 1 | < ∞, then by WLLN, we
get X̄ → µ in probability, that is, limn→∞ P |X̄ − µ| > ² = 0 for any ².

When there are many nuisance parameters, then MLE’s can behave very badly even for
large samples, as shown below (in which case, the MLE is unique).

Example. Suppose that we have p independent random samples

X11 , · · · , X1r ∼ N (µ1 , σ 2 ),

X21 , · · · , X2r ∼ N (µ3 , σ 2 ),


......
Xp1 , · · · , Xpr ∼ N (µp , σ 2 ).

75
1. Find an MLE of θ = (µ1 , · · · , µp , σ 2 ).
2. Show that if p is fixed, r → ∞, then σ̂ 2 is consistent for σ 2 .
3. Show that if r is fixed, p → ∞, then σ̂ 2 is NOT consistent for σ 2 .

Proof. 1. The log-likelihood function is


p Y
Y r
l(θ) = log fθ (xij )
i=1 j=1
 
1 p X
X r 
(xij − µi )2
= log(2πσ 2 )−pr/2 exp − 
2 σ2 i=1 j=1
p X r
pr pr 2 1 X
= − log(2π) − log(σ ) − 2 (xij − µi )2
2 2 2σ i=1 j=1

The loglikelihood equations are


r
X
(xij − µi )/σ 2 = 0
j=1
p X r
pr 1 X
− + (xij − µi )2 = 0
2σ 2 2σ 4 i=1 j=1

Therefore, an MLE of θ is
r
X p X
X r
−1 2 −1
µ̂i = r xij = xi· , σ̂ = (pr) (xij − xi· )2
j=1 i=1 j=1

Note  
p r µ ¶ p
2 σ2 X X xij − xi· 2  σ 2 X σ2
σ̂ =  = χ2r−1 = χ2p(r−1)
pr i=1 j=1 σ pr i=1 pr
Thus,
r−1 2 2(r − 1) 4
E σ̂ 2 = σ , V ar(σ̂ 2 ) = σ
r pr2

2. If p is fixed, r → ∞, then by Chebyshev’s inequality, we have

P (|σ̂ 2 − σ 2 | > ²) ≤ ²−2 E(σ̂ 2 − σ 2 )2


h i
≤ ²−2 Bias2 (σ̂ 2 ) + V ar(σ̂ 2 )
h i
≤ ²−2 σ 4 /r2 + V ar(σ̂ 2 )
→ 0.

That is, σ̂ 2 is consistent for σ 2 .

3. If r is fixed, p → ∞, clearly µ̂i is not consistent for µi . Furthermore, σ̂ 2 is NOT


consistent for σ 2 either since the Bias(σ̂ 2 ) 6= 0, but var(σ̂ 2 ) → 0.

76
4.7 Exercises
1. Let X1 , ..., Xn be iid with a pdf fθ (x). Find an MLE of θ in each of the following
cases.
(1) fθ (x) = θ−1 , where x = 1, ..., θ and θ is an integer between 1 and θ0 .
(2) fθ (x) = e−(x−θ) , where x ≥ θ and θ > 0.
(3) fθ (x) = θ(1 − x)θ−1 I(0 < x < 1), where θ > 1.
(4) fθ (x) is the pdf of N (θ, θ2 ), where θ ∈ R.

2. Let X be a sample with a joint pdf fθ (x) and T = T (X) be a sufficient statistic for
θ. Show that if an MLE exists, it is a function of T but it may not be sufficient for
θ.

3. Let X be a sample from Fθ , θ ∈ R. Suppose that Fθ ’s have pdf’s fθ and that


{x : fθ (x) > 0} does not depend on θ. Assume further that an unbiased estimator
θ̂ of θ attains the Cramer-Rao lower bound and the conditions in the Cramer-Rao
theorem hold for θ̂. Show that

(a). θ̂ is the UMVUE of θ.


(b). θ̂ is also the unique MLE of θ.

4. Let X1 , · · · , Xn be i.i.d. random variables withthe density function

fθ (x) = (θ + 1)xθ , 0 ≤ x ≤ 1.

(1) Find the MLE of θ.


(2) Find the asymptotic variance of the MLE.

77
Chapter 5

Asymptotically Efficient Estimation

5.1 Efficiency vs. Super-Efficiency


Let X1 , . . . , Xn ∼ Fθ . Suppose that {δn } is a sequence of estimators g(θ). Let b(θ) =
Eδn − g(θ). From the Cramer-Rao theorem (assuming that all the regularity conditions
are satisfied), we have
[g 0 (θ) + b0 (θ)]2
V arθ (δn ) ≥ , for ALL θ ∈ Θ.
nI(θ)
In particular, if δn is a unbiased estimator of g(θ), i.e., b(θ) = 0, then
[g 0 (θ)]2
V arθ (δn ) ≥ . for ALL θ ∈ Θ. (1.1)
nI(θ)
If the lower bound in (1.1) is attained, then δn is the (unique) UMVUE. Clearly, δn is the
most efficient (minimum variance) estimators amongst all unbiased estimators. It must
be pointed out that very few UMVUE’s can attain the lower bound (see the necessary
and sufficient conditions in earlier notes).

So far, we only discussed the C-R lower bound for fixed sample size n. What happens
when n → ∞?
Now let us assume that δn = δn (X1 , . . . , Xn ) is asymptotically normal, say that

n[δn − g(θ)] →d N (0, v(θ)), v(θ) > 0. (1.2)

Then δn is asymptotically unbiased with asymptotic variance v(θ)/n. Then, it would be


of interest to see if the Cramer-Rao theorem still hold asymptotically here, i.e.
v(θ) [g 0 (θ)]2 [g 0 (θ)]2
≥ , or equivalently v(θ) ≥ for ALL θ ∈ Θ. (1.3)
n nI(θ) I(θ)
If the lower bound in (1.3) is attained uniformly for all θ ∈ Θ, then δn is an asymptotically
efficient estimator.

However, the next example shows that this may not be the case. That is, the asymp-
totical version of the Cramer-Rao inequality may not hold uniformly for ALL θ ∈ Θ, even
though all the conditions in the usual Cramer-Rao theorem are satisfied. (The example
when regularity conditions are not satisfied is given by the U nif (0, θ)) example in the
last chapter.)

78
Example (Hodges). Let X1 , ..., Xn ∼ N (θ, 1), θ ∈ R. Then IX1 (θ) = 1. The MLE of θ is
θ̂ = X̄, which has distribution N (θ, 1/n). Now let us define

θ̃ = X̄ if |X̄| ≥ n−1/4
tX̄ if |X̄| < n−1/4 ,

where we choose 0 ≤ t < 1. Find the asymptotic distribution of θ̃.

Remark: A couple of words about θ̃. We are interested in estimating the population
mean. If X̄ is not close to 0, we simply take the sample mean as the estimator. If we
know that it is pretty close to 0, we can shrink it further to make it closer to 0. Thus, the
resulting estimator should be more efficient than the sample mean X̄ at 0. Of course, we
can take other values than 0, the same thing will happen too.

Solution. (1). If θ = 0, then we can write


√ √ h i
nθ̃ = n X̄I(|X̄| ≥ n−1/4 ) + tX̄I(|X̄| < n−1/4 )
√ h i
= n tX̄ + (1 − t)X̄I(|X̄| ≥ n−1/4 )
= tYn + (1 − t)Yn I(|Yn | ≥ n1/4 ).

where Yn = nX̄ ∼ N (0, 1), and hence tYn ∼ N (0, t2 ). Now let us look at the second
term Wn = Yn I(|Yn | ≥ n1/4 ). Since
h i
(E|Wn |)2 ≤ (EYn2 )E I 2 (|Yn | ≥ n1/4 ) = P (|Yn | ≥ n1/4 ) ≤ E|Yn |2 /n1/2 = n−1/2 → 0,

which implies Wn →p 0. By Slutsky’s theorem, we get



nθ̃ →d N (0, t2 ), if θ = 0.

(2). If θ 6= 0, then we can write



nθ̃ = Yn + (t − 1)Yn I(|Yn | < n1/4 ).
√ √
Again, Yn − nθ = n(X̄ − θ) ∼ N (0, 1).
Now let us show that Yn I(|Yn | < n1/4 ) →prob 0. For any 0 < ² < n1/4 ,
³ ´ ³ ´
P |Yn I{|Yn | < n1/4 }| > ² ≤ P n1/4 I{|Yn | < n1/4 } > ²
³ ´
= P I{|Yn | < n1/4 } > ²/n1/4
³ ´
= P |Yn | < n1/4
³ ´
= P −n1/4 < Yn < n1/4
³ √ √ √ ´
= P − nθ − n1/4 < Yn − nθ < n1/4 − nθ
³ √ √ ´
= P − nθ − n1/4 < N (0, 1) < − nθ + n1/4
³ √ ´ ³ √ ´
= Φ − nθ + n1/4 − Φ − nθ − n1/4
→ 0, as n → ∞.

Thus, Yn I{|Yn | < n1/4 } →p 0. By Slutsky’s theorem, we get



n(θ̃ − θ) →d N (θ, 1).

79
Combining the above two cases, we get
√ ³ ´
n θ̃ − θ →d N (0, t2 ) if θ = 0,
N (0, 1) if θ 6= 0
In the case of θ = 0, the usual asymptotic Cramer-Rao theorem does not hold, since
t2 < 1 = [I(θ)]−1 .

Remark 1. The set of all points violating (1.3) is called points of supereffi-
ciency. Superefficiency in fact is a very good property.
Remark 2. The normal family is the best family possible. Hodges’s example
shows in order for the asymptotic version of Cramer-Rao inequality to hold,
one must restrict attention to a more restricted class of estimators (so-called
“regular” estimators).
Remark 3. The example can be easily extended to have countable number
of super-efficiency points. For instance, we can define, for k = 0, ±1, ±2, ......,
θ̂ = X̄ − k if |X̄ − k| ≥ n−1/4 ,
k + tk (X̄ − k) if |X̄ − k| < n−1/4
where we choose 0 ≤ tk < 1 for all k. Then it can be shown that this estimator
will be super-efficient at all k = 0, ±1, ±2, .......
Remark 4. Even though (1.3) is not always true, however, it is almost true.
In other words, the points of superefficiency has Lebesgue measure 0, as was
shown in the next theorem (Le Cam (1953), Bahadur (1964)). The proof is
taken from Shao (1999).

Theorem. Let X1 , . . . , Xn be iid with common pdf fθ (x), where θ ∈ Θ ∈ R and Θ is an


open set. Suppose the following regularity conditions hold:
1). The distributions Fθ of the Xi have common support, so that the set
A = {x : fθ (x) > 0} is independent of θ. [i.e., the boundary of Θ does not
depend on θ.]
2). For every x ∈ A, the density ∂ 2 fθ (x)/∂θ2 exists and is continuous in θ.
R
3). The integral fθ (x)dx can be twice differentiated under the integral sign.
³ ´
∂ log fθ (X1 ) 2
4). The Fisher information IX1 (θ) = E ∂θ
satisfies 0 < IX1 (θ) < ∞.
5). For any given θ0 ∈ Θ there exists a positive number c and a function M (x)
(both of which may depend on θ0 ) such that for all x ∈ A and |θ − θ0 | < c, we
have ¯ ¯
¯ ∂ 2 log f (X ) ¯
¯ θ 1 ¯
¯ ¯ ≤ M (x), and Eθ0 [M (X1 )] < ∞.
¯ ∂θ2 ¯

Under these conditions, if δn = δn (X1 , . . . , Xn ) is any estimator satisfying (1.2), then v(θ)
satisfies (1.3) except on a set of Lebesgue measure 0.

Proof. Omitted. See Shao (1998, p249).

80
5.2 Definition of Asymptotic Efficiency
Let X1 , . . . , Xn ∼ Fθ , θ ∈ Θ. Suppose that {δn } is a sequence of estimators g(θ). Then
typically,
1). {δn } is consistent, i.e.,

P (|δn − g(θ)| > ²) → 0, for any ² > 0, as n → ∞

2). {δn } is asymptotically normal, i.e., δn →d N (g(θ), v(θ)/n), or equivalently,



n[δn − g(θ)] →d N (0, v(θ)), v(θ) > 0.

where v(θ) is the asymptotic variance.

Ideally, we would like to find estimators such that the asymptotic variance v(θ) is
minimized uniformly over θ ∈ Θ. If such estimators exist, they are called Asymptotic
Efficient Estimators.

Definition. A sequence {δn } is said to be asymptotically efficient if


à !
√ [g 0 (θ)]2
n[δn − g(θ)] →d N 0, .
I(θ)

Two questions immediately arise:


1). Does there exist any asymptotically efficient estimators? Under what
conditions?
2). How do we find such estimators if they exist?

Remark. Asymptotically efficient estimators, if they exist, would not be unique, so that
different methods may all lean to asymptotically
√ efficient estimators. If δn is asymptoti-
cally efficient, so is δn + Rn , where nRn → 0 in probability, since by Slutsky’s theorem,
à !
√ √ √ [g 0 (θ)]2
n (δn + Rn − g(θ)) = n (δn − g(θ)) + nRn →d N 0, .
I(θ)

5.3 MLE is Asymptotic Efficient


5.3.1 Approach I
We shall show in this section that, asymptotically efficient estimators exist subject to
quite general regularity conditions.

Let X1 , . . . , Xn be iid with common pdf fθ (x), where θ ∈ Θ ∈ R, and suppose the
following regularity conditions hold.
A1). The distributions Fθ of the Xi have common support, so that the set
A = {x : fθ (x) > 0} is independent of θ.
A2). The true parameter value θ0 ∈ Θ0 ∈ Θ, where Θ0 is open.

81
Theorem. Under assumptions (A0)-(A1), for any fixed θ 6= θ0 ,

Pθ0 (L(θ0 ) > L(θ)) → 1, as n → ∞.

L(θ0 )
Proof. L(θ0 ) > L(θ) iff L(θ)
> 1 iff
n
1 L(θ0 ) 1X fθ (Xi )
log = log 0 >0
n L(θ) n i=1 fθ (Xi )

However, since − log x is strictly convex, we can use Jensen’s inequality to get
n n n
à !
1X fθ (Xi ) 1X fθ (Xi ) 1X fθ (Xi )
log 0 = − log > − log
n i=1 fθ (Xi ) n i=1 fθ0 (Xi ) n i=1 fθ0 (Xi )
à !
fθ (Xi )
→ − log Eθ0 = − log 1 = 0, in probability.
fθ0 (Xi )

Theorem. Assume that (A0)-(A2) hold. Further suppose that

for almost all x, fθ (x) is differentiable w.r.t. θ in Θ0 .

Then with probability tending to 1 as n → ∞, the likelihood equation


∂ ∂
LX (θ) = [fθ (x1 )...fθ (xn )] = 0
∂θ ∂θ
or equivalently
Xn
∂ ∂
lX (θ) = log fθ (xi ) = 0
∂θ i=1 ∂θ

has a root θ̂n = θ̂(x1 , ..., xn ) such that θ̂(X1 , ..., Xn ) tends to the true parameter value θ0
in probability.

Pθ0 (L(θ0 ) > L(θ)) → 1, as n → ∞.


Proof.

82
5.3.2 Approach II
Theorem. Let F = {fθ : θ ∈ Θ}, where Θ is an open set in R. Assume that
3
1). For each θ ∈ Θ, ∂ log∂θfθ (x) exists for all x, and also there exists a function
H(x) ≥ 0 (possibly depending on θ) such that for θ ∈ N (θ0 , ²) = {θ : |θ −θ0 | <
²}, ¯ ¯
¯ ∂ 3 log f (x) ¯
¯ θ ¯
¯ ¯ ≤ H(x), Eθ H(X1 ) < ∞.
¯ ∂θ3 ¯

2). For gθ (x) = fθ (x) or gθ (x) = ∂ log fθ (x)/∂θ, we have


∂ Z Z
∂gθ (x)
gθ (x)dx = dx
∂θ ∂θ

3). For each θ ∈ Θ, we have


à !2
∂ log fθ (x)
0 < IX1 (θ) = Eθ < ∞.
∂θ

Let X1 , ..., Xn ∼ Fθ . Then with probability 1, the likelihood equations admit a sequence
of solutions {θ̂n } satisfying

(a). Strong consistency: θ̂n → θ0 with probability 1.


(b). Asymptotic efficiency: θ̂n →asy. N (θ0 , [nIX1 (θ)]−1 ).

Proof. (a). Denote the score function by


n
1 ∂ log L(θ) 1X ∂ log fθ (Xi )
s(X, θ) = = .
n ∂θ n i=1 ∂θ

Then
n n
1X ∂ 2 log fθ (Xi ) 1X ∂ 3 log fθ (Xi )
s0θ (X, θ) = , s00θ (X, θ) = .
n i=1 ∂θ2 n i=1 ∂θ3

Note that
n ¯ 3
¯ ¯
1X ¯ 1X n
¯ ∂ log fθ (Xi ) ¯
|s00θ (X, θ)| ≤ ¯ ¯≤ |H(Xi )| =: H̄(X)
n i=1 ¯ ∂θ3 ¯ n i=1
Pn
where H̄(X) = n−1 i=1 H(Xi ). By Taylor’s expansion,
1
s(X, θ) = s(X, θ0 ) + s0θ0 (X, θ0 )(θ − θ0 ) + s00θ0 (X, ξ)(θ − θ0 )2
2
1
= s(X, θ0 ) + sθ (X, θ0 )(θ − θ0 ) + H̄(X)η ∗ (θ − θ0 )2
0
2
where |η ∗ | = |s00θ (X, θ)|/H̄(X) ≤ 1. By the SLLN, we have, with probability one,

s(X, θ0 ) → Es(X, θ0 ) = 0,
s0θ (X, θ0 ) → Es0θ (X, θ0 ) = −IX1 (θ0 ),
H̄(X) → Eθ H(Xi ) < ∞.

83
Clearly, for ² > 0, we have with probability one,
1
s(X, θ0 ± ²) = s(X, θ0 ) + s0θ (X, θ0 )(±²) + H̄(X)η ∗ (±²)2
2
1 2
≈ ∓IX1 (θ0 )² + EH(X1 )c² , |c| < 1.
2
In particular, we choose 0 < ² < IX1 (θ0 )/Eθ0 H(Xi ). Then for large enough n, we have,
with probability 1,
1 1
s(X, θ0 + ²) = s(X, θ0 ) + s0θ0 (X, θ0 )² + H̄(X)η ∗ ²2 ≤ −IX1 (θ0 )² + Eθ0 H(Xi )²2 < 0,
2 2
1 1
s(X, θ0 − ²) = s(X, θ0 ) − s0θ0 (X, θ0 )² + H̄(X)η ∗ ²2 ≥ IX1 (θ0 )² − Eθ0 H(Xi )²2 > 0.
2 2
Therefore, by the continuity of s(X, θ), the interval [θ0 − ², θ0 + ²] contains a solution of
the likelihood equation s(X, θ) = 0. In particular, it contains the solution
θ̂n² = inf {θ : θ0 − ² ≤ θ ≤ θ0 + ², and s(X, θ) = 0} .

Before going any further, we shall first show that θ̂n² is a proper random variable, that
is, θ̂n² is measureable. Note that, for all t ≥ θ0 − ², we have
½ ¾
{θ̂n² > t} = inf s(X, θ) > 0
θ0 −²≤θ≤t
( )
= inf s(X, θ) > 0 ,
θ0 −²≤θ≤t,θ is rational
which is clearly a measurable set, since s(X, θ) is continuous in θ.

Take ² = 1/n for n ≥ 1. For each n ≥ 1, we define a sequence of r.v.’s by


θ̂n = θ̂n,1/n , n = 1, 2, ......

Clearly, θ̂n → θ0 with probability one. This proves (a).

(b). For large n, we have seen that


1
0 = s(X, θ̂n ) = s(X, θ0 ) + s0θ (X, θ0 )(θ̂n − θ0 ) + H̄(X)η ∗ (θ̂n − θ0 )2
2
Thus,
µ ¶
√ √ 1
ns(X, θ0 ) = n(θ̂n − θ0 ) −s0θ (X, θ0 ) − H̄(X)η ∗ (θ̂n − θ0 )
2

Since ns(X, θ0 ) → N (0, I(θ0 )) in distribution by CLT, and −s0θ (X, θ0 ) − 21 H̄(X)η ∗ (θ̂0 −
θ0 ) → I(θ0 ) a.s., then it follows from Slutsky’s theorem that

√ ns(X, θ0 )
n(θ̂n − θ0 ) =
−sθ (X, θ0 ) − 12 H̄(X)η ∗ (θ̂0 − θ0 )
0

N (0, I(θ0 )) ³ ´
→asy = N 0, I −1 (θ0 ) .
I(θ0 )
This completes our proof.

84
Before we give the proof, we make a couple of remarks.
∂ log fθ (x)
Remark 1. Condition 1 ensures that ∂θ
, for any x, has a Taylor’s
expansion as a function of θ.
Remark 2. Condition 2 means that
Z Z
∂ log fθ (x)
fθ (x)dx, dx
∂θ
can be differentiated with respect to θ under the integral sign. That is, the
integration and differentiation can be interchanged.
Remark 3. A sufficient condition for Condition 2 is the following:
For each θ ∈ Θ, there exists functions g(x), h(x), and H(x) (possibly depend-
ing on θ) such that for θ0 ∈ N (θ, ²) = {θ0 : |θ0 − θ| < ²},
¯ ¯ ¯ ¯ ¯ ¯
¯ ∂f 0 (x) ¯ ¯ ∂ 2 f 0 (x) ¯ ¯ ∂ 3 log f 0 (x) ¯
¯ θ ¯ ¯ θ ¯ ¯ θ ¯
¯ ¯ ≤ g(x), ¯ ¯ ≤ h(x), ¯ ¯ ≤ H(x),
¯ ∂θ 0 ¯ ¯ ∂θ 0 ¯ ¯ ∂θ0 ¯

hold for all x and


Z Z
g(x)dx < ∞, h(x)dx < ∞, Eθ0 H(X1 ) < ∞.

∂ log fθ (x)
Remark 4. Condition 3 ensures that the variance of ∂θ
is finite.

5.4 Some further examples


Example. (Karr, p215.) Let X1 , . . . , Xn ∼ U (θ − 1/2, θ + 1/2).

1). Show that any point in [X(n) − 1/2, X(1) + 1/2] is an MLE.
2). Check that regularity conditions in the asymptotic efficiency theorem are
not satisfied.
3). Show that both X(n) − 1/2 and X(1) + 1/2 are consistent estimators of θ.

Proof. 1) has been done before. 2) is true since the range of fθ (x) is [θ − 1/2, θ + 1/2],
which depends on the parameter θ. Let us check 3) now. We’ll only check X(n) − 1/2.
Note
³ ´ ³ ´
P |X(n) − 1/2 − θ| > ² = P θ − X(n) + 1/2 > ²
³ ´
= P X(n) < θ + 1/2 − ²
= FXn1 (θ + 1/2 − ²)
= [(θ + 1/2 − ²) − (θ − 1/2)]n
= (1 − ²)n → 0.

That is, X(n) − 1/2 is consistent estimator of θ.

85
Remark. Since X(n) − 1/2 and X(1) + 1/2 are both consistent for θ, then X(n) − 1/2 −
X(1) − 1/2 →p 0, that is, X(n) − X(1) →p 1.

Example. Let X1 , . . . , Xn be iid r.v.’s with the double exponential pdf,


1
fθ (x) = exp{−|x − θ|}, θ ∈ R.
2
1). Find an MLE of θ.
2). Check that regularity conditions in the asymptotic efficiency theorem are
not satisfied.
3). Despite 2), the asymptotic version of the Cramer-Rao theorem still holds
for the MLE.

Proof. 1). First ( )


n
Y Xn
1
L(θ) = fθ (xi ) = n exp − |xi − θ| .
i=1 2 i=1

So an MLE of θ is
n
X
θ̂ = arg max L(θ) = arg min |xi − θ|
θ∈R θ∈R
i=1
= X( n+1 ) if n is odd.
2

any point in [X( n ) , X( n +1) ], if n is even.


2 2

which is simply the median. To see this, we assume that n is even, ...........

(a graph to illustrate this point)

2). Notice that for EVERY x, fθ (x) is not differentable w.r.t θ at θ = x. So the
regularity conditions are not satisfied.

3). Let us assume that n is odd at this moment, then θ̂ = m̂. It is well known that
√ ³ ´
n(θ̂ − θ) ∼asymptotic N 0, [4f 2 (θ)]−1 .

In our current example, we have

[4f 2 (θ)]−1 = [4 × (1/4)]−1 = 1.

Therefore, the asymptotic Cramer-Rao bound does hold.

86
5.4.1 Some comparisons about the UMVUE and MLE’s
1. It can be shown that UMVUE’s are always consistent and MLE’s usually are. (Bickel
and Doksum, 1977., p133.)

2. In the exponential family, both estimates are functions of the natural sufficient
statistics Ti (X).

3. When the sample size is large, both estimates give essentially the same answer. In
that case, use is governed by ease of computation.

4. MLE’s exist more frequently than UMVUE’s (in fact, in our definition, MLE always
exist although it may be outside the parameter space) and they are usually easier
to calculate than UMVUE’s. This is partially because of the invariance property of
MLE’s.

5. Neither MLE’s nor UMVUE’s are satisfactory in general if one takes a Bayesian or
minimax point of view. Nor are they necessarily robust.

87
Chapter 6

Other Methods of Estimations

We studied UMVUE and MLE. There are other ways of estimations.

6.1 Method of moments


6.1.1 Methodology
The method of moments (MM) dates back to Karl Pearson in the late 1800. It is
easy to use and is often a good place to start when other methods prove intractable.
A generalized version of MM, appropriately called the general method of moments
(GMM) has proved to be a powerful tool in Econometrics.

Procedure: The method is rather easy to describe. If we have m unknown parameters,


we simply equate the first sample moments and population moments (typically depending
on the m unknown parameters), and solve for the unknowns.

6.1.2 Examples
Example 1. Suppose that X1 , . . . , Xn are i.i.d. r.v.’s from N (µ, σ 2 ). Use the method of
moments to estimate µ and σ 2 .

Solution. The first two population moments are

EX = µ,
EX 2 = V ar(X) + (EX)2 = σ 2 + µ2 .

The first two sample moments, on the other hand, are

m01 = X̄,
n
1X
m02 = X 2.
n i=1 i

Equate the population and sample moments:

µ = X̄,
n
1X
σ 2 + µ2 = X 2.
n i=1 i

88
Solving for µ and σ 2 yields the method of moments estimators

µ̂ = X̄,
n n
1X 1X
σ̂ 2 = Xi2 − (X̄)2 = (Xi − X̄)2 .
n i=1 n i=1

Remark: The simple method of moment produces the same estimator as MLE and
almost the same as UMVUE.

Example 2. Suppose that X1 , . . . , Xn are i.i.d. r.v.’s from Binomial(k, p), where both
k and p are unknown. Use the method of moments to estimate k and p.

Solution. The first two population moments are

EX = kp,
EX 2 = V ar(X) + (EX)2 = kp(1 − p) + (kp)2 .

The first two sample moments, on the other hand, are

m01 = X̄,
n
1X
m02 = X 2.
n i=1 i

Equate the population and sample moments:

kp = X̄, (1.1)
n
1X
kp(1 − p) + (kp)2 = X 2. (1.2)
n i=1 i

Substituting (1.1) into (1.2), we get


n
1X
Xi2 = kp(1 − p) + (kp)2 = X̄(1 − p) + (X̄)2 .
n i=1

So Pn Pn
n−1 Xi2 − (X̄)2
i=1 n−1 i=1 (Xi − X̄)2
1−p= = .
X̄ X̄
Hence, the moment estimator for p is
Pn Pn
n−1 i=1 (Xi − X̄)2 X̄ − n−1 i=1 (Xi − X̄)2
p̂ = 1 − = .
X̄ X̄
Substituting this into (1.1), we get an moment estimator for k:

X̄ X̄ (X̄)2
k̂ = = Pn = −1
Pn 2
.
p̂ X̄−n−1
i=1
(Xi −X̄)2 X̄ − n i=1 (Xi − X̄)

Remark. Note that in this example, the MM is simple to use. But they may be negative,
in which case unacceptable.

89
6.2 Bayesian Method of Estimation
6.2.1 Methodology
Let X1 , . . . , Xn be i.i.d. r.v.’s with a sampling distribution f (x|θ), indexed by θ. In the
Bayesian approach, the population parameter θ is treated as a r.v. with a prior distri-
bution, denoted by π(θ). This is a subjective distribution, based on the experimenter’s
belief, and is formulated before the data are seen (hence the name prior distribution).
Once a sample is taken from the population indexed by θ, and then the prior distribution
is updated with this sample information to produce the so-called posterior distribu-
tion. The updating is done with the help of the Baye’s rule, hence the name Bayesian
statistics. An Bayesian point estimator of θ can then be taken as the mean of the posterior
distribution.

Here is a summary of how to find the posterior distribution and point estimator of θ.
Note that f (x|θ) and π(θ) are supposed to be known here.

(a). The joint distribution of (x, θ) is:

f (x, θ) = f (x|θ)π(θ).

(b). The marginal distribution of x is:


Z
m(x) = f (x|θ)π(θ)dθ.

(c). The posterior distribution of θ is

f (x, θ) f (x|θ)π(θ)
π(θ|x) = = .
m(x) m(x)

(Note that this is still a random variable.)

(d). A Bayesian point estimator of θ is:


Z
θ̂B = π(θ|x)dθ.

6.2.2 Examples
Example 1. Let X1 , . . . , Xn be i.i.d. Bernoulli(p) r.v.’s. We assume that the prior
distribution on p is Beta(α, β). Find the posterior distribution of p and find a Bayesian
point estimator for p.
Pn
Solution. Let T = i=1 Xi ∼ Bin(n, p). So
à !
n t n−t
f (t|p) = pq .
t

and the prior p.d.f. of θ is


Γ(α + β) α−1 β−1
π(p) = p q .
Γ(α)Γ(β)

90
(a). The joint p.d.f. of T and θ is
"Ã ! #" #
n t n−t Γ(α + β) α−1 β−1
f (t, p) = f (t|p)π(p) = pq p q
t Γ(α)Γ(β)
à !
n Γ(α + β) t+α−1 n−t+β−1
= p q .
t Γ(α)Γ(β)

(b). The marginal p.d.f. of X is


Z ∞ Z ∞ Ã !
n Γ(α + β)
m(t) = f (t, p)dp = pt+α−1 q n−t+β−1 dp
−∞ t Γ(α)Γ(β)
−∞
à !
n Γ(α + β) Γ(t + α)Γ(n − t + β)
= .
t Γ(α)Γ(β) Γ(n + α + β)

(c). The posterior p.d.f. of θ (i.e. conditional p.d.f. of θ given X) is


f (t, p)
π(p|t) =
m(t)
³ ´
n Γ(α+β) t+α−1 n−t+β−1
t Γ(α)Γ(β)
p q
= ³ ´
n Γ(α+β) Γ(t+α)Γ(n−t+β)
t Γ(α)Γ(β) Γ(n+α+β)
Γ(n + α + β)
= pt+α−1 q n−t+β−1 ,
Γ(t + α)Γ(n − t + β)
which is Beta(t + α, n − t + β) distribution.

(d). An Bayesian estimate of p is


t+α
p̂B = E(p|t) = .
n+α+β

Remarks:
(i). We can rewrite p̂B as
à !µ ¶ à !à !
t+α n t α+β α
p̂B = = + .
n+α+β n+α+β n n+α+β α+β
Thus, the Bayesian point estimator p̂B is a linear combination of the prior
α
mean α+β of θ and the sample mean nt . In particular, we see that p̂B =
α/(α + β); and limn p̂B = limn t/n = p.

(ii). Since 0 < p < 1, the choice of Beta distribution seems reasonable. It
turns out that Beta family is the conjugate family of Binomial family.
Definition: Let F = {f (x|θ), θ ∈ Θ}. A class of priors ¦ is a conjugate family
if the posterior distributions remain in the same class ¦, for all priors in ¦ and
all x ∈ X .
So in the last example, we start with a Beta prior, we end up with a Beta
posterior. The updating of priors takes the form of updating its parameters.
We emphasize that although it is convenient to deal with mathematically,
whether it is a reasonable choice depends on the experimenter.

91
Example 2. Let X ∼ N (θ, σ02 ), and suppose that the prior distribution on θ is N (µ0 , τ02 ).
(Here we assume that σ02 , µ0 and τ02 are all known). Find the posterior p.d.f. of θ, and a
Bayesian point estimate of θ.

Solution. Here we only give a sketch. It can be shown that the posterior distribution of
θ is N (E(θ|x), V ar(θ|x)), where

τ02 σ02
E(θ|x) = x + µ0
τ02 + σ02 τ02 + σ02

τ02 σ02
V ar(θ|x) = .
τ02 + σ02
Notice that the normal family is its own conjugate family. Clearly, the Bayesian estimator
E(θ|x) given above is again a linear combination of the prior and sample means. Also
note that if τ02 → ∞ (i.e., the prior information is very vague), more weight is given to
the sample information. On the other hand, if τ02 << σ02 (i.e. the prior is very good),
more weight will be given to the prior mean.

92
Example 3. Let X be a r.v. from U nif (0, θ). Assume that the prior distribution of θ is
U (a, b) with 0 < a < b. A random sample of size 1 is taken from the population and its
value is x. Find the posterior p.d.f. of θ, and a Bayesian point estimate of θ.

Solution. Note that


1
f (x|θ) = I{0 < x < θ},
θ
and the prior p.d.f. of θ is
1
π(θ) = I{a < θ < b},
b−1

(a). The joint p.d.f. of X and θ is


1
f (x, θ) = f (x|θ)π(θ) = I{0 < x < θ}I{a < θ < b}
θ(b − a)
1
= I{max (x, a) < θ < b}.
θ(b − a)

(b). The marginal p.d.f. of X is


Z ∞ Z ∞
1
m(x) = f (x, θ)dθ = I{max (x, a) < θ < b}dθ
−∞ −∞ θ(b − a)
Z b ¯b
1 1 1 ¯
¯
= dθ = (ln θ)¯¯
(b − a) max(x,a) θ (b − a) θ=max(x,a)
ln[max (x, a)] − ln b
= .
b−a

(c). The posterior p.d.f. of θ (i.e. conditional p.d.f. of θ given X) is


1
f (x, θ) θ(b−a)
I{max (x, a) < θ < b}
π(θ|x) = = ln[max(x,a)]−ln b
m(x) b−a
1 I{max (x, a) < θ < b}
=
[ln[max (x, a)] − ln b] θ

(d). An Bayesian estimate of θ is


Z ∞
E(θ|x) = θπ(θ|x)dθ
−∞
Z ∞
1
= I{max (x, a) < θ < b} dθ
−∞ [ln[max (x, a)] − ln b]
Z ∞
1
= I{max (x, a) < θ < b} dθ
[ln[max (x, a)] − ln b] −∞
Z b
1
= dθ
[ln[max (x, a)] − ln b] max(x,a)
max (x, a) − b
= .
[ln[max (x, a)] − ln b]

93

You might also like