Professional Documents
Culture Documents
Math, HKUST
Email: majing@ust.hk
1 Probability Models 2
1.1 Exponential family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Canonical form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Natural parameter space is a convex set . . . . . . . . . . . . . . . 4
1.1.4 Exponential family of full rank . . . . . . . . . . . . . . . . . . . . 4
1.1.5 Exponential family is closed under independent sum, marginal and
conditional operations . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.6 T (X) is also in another exponential family . . . . . . . . . . . . . . 5
1.1.7 A useful property . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.8 Examples of exponential family . . . . . . . . . . . . . . . . . . . . 5
1.2 Location-scale family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Unbiased Estimation 25
3.1 Criteria of estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Do unbiased estimators always exist? . . . . . . . . . . . . . . . . . . . . . 26
3.3 Characterization of a UMVUE. . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Existence of UMVUE among unbiased estimators . . . . . . . . . . . . . . 29
3.5 Searching UMVUE’s by sufficiency and completeness . . . . . . . . . . . . 35
3.6 Some examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.7 UMVUE in nonparametric families. . . . . . . . . . . . . . . . . . . . . . . 45
3.8 Failure of UMVUE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.9 Procedures to find a UMVUE . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.10 Information inequality (or Cramer-Rao lower bound) . . . . . . . . . . . . 46
3.10.1 Information inequality . . . . . . . . . . . . . . . . . . . . . . . . . 46
1
3.10.2 Some properties of the Fisher information matrix . . . . . . . . . . 49
3.10.3 Some examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.10.4 A necessary and sufficient condition for reaching Cramer-Rao lower
bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.10.5 Information inequality in exponential family . . . . . . . . . . . . . 56
3.10.6 Information inequality in location-scale family . . . . . . . . . . . . 57
3.10.7 Regularity conditions must be checked . . . . . . . . . . . . . . . . 58
3.10.8 Relative Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.11 Some comparisons about the UMVUE and MLE’s . . . . . . . . . . . . . . 62
3.11.1 Difficulties with UMVUE’s . . . . . . . . . . . . . . . . . . . . . . . 62
3.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
1
Chapter 1
Probability Models
There are also so called Semi-parametric models, which are mixtures of parametric
and nonparametric models (such as Cox’s model). They could be thought to include
parametric and nonparametric models as special cases. On the other hand, they can also
be regarded as nonparametric models.
In this course, we shall mostly concentrate on parametric models. Two very useful
parametric models will be introduced below:
2
1.1 Exponential family
1.1.1 Definition
A family F = {Fθ : θ ∈ Θ} of distributions is said to be a k-parameter exponential
family if the distributions Fθ have densities of the form (either with Lesbegue or counting
measures)
X
k
fθ (x) = exp ηj (θ)Tj (x) − B(θ) h(x),
j=1
with respect to some common measure µ. Here all the functions are real-valued, h(x) ≥ 0.
Remarks.
R
(1) Since fθ (x)dx = 1, we find that
Z Xk
B(θ) = ln exp ηj (θ)Tj (x) h(x)dx .
j=1
Therefore, ηj (θ)’s are “free” parameters, while B(θ) depends on ηj ’s and hence
not a free “parameter”.
(2) Many common families are exponential families. These include the nor-
mal, gamma, beta distributions, and binomial, negative binomial etc. The
exponential families have many nice properties, as can be seen below.
(3) The above representation is not unique, since ηj Tj = (cηj )(Tj /c).
Remarks.
R
(1) Since fθ (x)dx = 1, we find that
Z Xk
A(η) = ln exp ηj Tj (x) h(x)dx .
j=1
3
1.1.3 Natural parameter space is a convex set
R nP o
k
The set Ξ = {η = (η1 , ..., ηk ) : exp j=1 ηj Tj (x) h(x)dx < ∞} is called the natural
parameter space of the canonical family. The set Θ = {θ : (η1 (θ), ..., ηk (θ)) ∈ Ξ} is
called the parameter space of the original exponential family. We now show that Ξ is
a convex set. However, Θ may not necessarily be a convex set.
Proof. Let η and ξ be two parameter points in the natural parameter space. Then
since
eλx+(1−λ)y ≤ λex + (1 − λ)ey .
Then for any 0 < λ < 1
Z Z Z
ληT (x)0 +(1−λ)ξT (x)0 ηT (x)0 0
e h(x)dx ≤ λ e h(x)dx + (1 − λ) eξT (x) h(x)dx < ∞,
Theorem 1.1.3 If X1 , ..., Xn belong to a k-parameter exponential family, then the joint
density of (X1 , ..., Xn ) is again an exponential family of the form
" n #
Xk X
fθ (x) = exp ηj (θ) Tj (xi ) − nB(θ) h(x1 )h(x2 )...h(xn ).
j=1 i=1
The proofs of these theorems are trivial and omitted here. (One can use m.g.f.
method.)
4
1.1.6 T (X) is also in another exponential family
Theorem 1.1.4 If X belongs to a k-parameter exponential family, then the distribution
of T = (T1 (X), ..., Tk (X)) is again an exponential family of the form
Xk
fT (t) = fT (t1 , ..., tk ) = exp ηj (θ)tj − B(θ) h0 (t).
j=1
Proof. The proof can be shown using the idea of m.g.f. Here we shall illustrate a
proof for the discrete r.v.. In such a case,
X
P (T1 = t1 , ......, Tk = tk ) = P (X = x)
T1 (x)=t1 ,...,Tk (x)=tk
( k )
X X
= exp ηi (θ)Ti (x) − B(θ) h(x)
T1 (x)=t1 ,...,Tk (x)=tk i=1
( k )
X X
= exp ηi (θ)ti − B(θ) h(x)
T1 (x)=t1 ,...,Tk (x)=tk i=1
( k )
X X
= exp ηi (θ)ti − B(θ) h(x) .
i=1 T1 (x)=t1 ,...,Tk (x)=tk
( k )
X
= exp ηi (θ)ti − B(θ) h0 (t).
i=1
is continuous and has derivatives of all orders with respect to the η’s, and these can be
obtained by differentiating under the integral sign.
Proof. See “Testing Statistical Hypotheses” by Lehmann (Chapter 2, Theorem 9)
and in Barndorff-Nielsen (1978, Section 7.1).
5
1.2 Location-scale family
Theorem. Let f (x) be any p.d.f.. For any constants µ and σ > 0, denote θ = (µ, σ).
Then the function µ ¶
1 x−µ
gθ (x) = f
σ σ
is also a p.d.f..
Location family: {f (x − µ) : µ ∈ R} ;
½ µ ¶ ¾
1 x
Scale family: f :σ>0 ;
σ σ
½ µ ¶ ¾
1 x−µ
Location-scale family: f : µ ∈ R, σ > 0 .
σ σ
where µ is called the location parameter, σ the scale parameter.
6
Chapter 2
2.1 Sufficiency
Suppose X1 , · · · , Xn ∼ Fθ and we are interested in estimating the parameter θ. We’d like
to reduce the original data set to a smaller one which captures all the information about θ
contained in the sample. In other words, the smaller data set is “sufficient” for estimating
θ.
Throughout the course, we shall use
X = (X1 , · · · , Xn ), x = (x1 , · · · , xn ),
7
P
If xi = t,
Qn P P
P (X = x) i=1 P (Xi = xi ) p xi (1 − p)n− xi 1
LHS = P = P = ³n´ = ³n´ ,
P ( Xi = t) P ( Xi = t) pt (1 − p)n−t
t t
Factorization Theorem. Let X = {X1 , · · · , Xn } be iid with pdf fθ (x). Then a statistic
T (X) is a sufficient statistic for θ iff we can decompose the joint pdf into
P (X = x|T = t) = 0, if T (x) 6= t,
P (X = x, T (X) = t) /P (T (X) = t) , if T (x) = t.
If T (x) = t, then
X X X
P (T (X) = t) = P (X = x) = g (t, θ) h (x) = g (t, θ) h (x) .
T (x)=t T (x)=t T (x)=t
So if T (x) = t,
8
Example. Assume X1 , · · · , Xn is a random sample from
(1) Bernoulli(p). (2) U [0, θ]. (3) N (µ, σ 2 ). (4) N (σ, σ 2 ).
Find sufficient statistics for the parameters involved.
Solution.
P P P P P
(1) T = Xi , (2) T = X(n) , (3) T = ( Xi , Xi2 ). (4) T = ( Xi , Xi2 ).
Remark 2.1.1 Clearly, it is much easier to use Factorization Theorem than to use the
direct approach.
9
2.2 Minimal Sufficiency
While sufficient statistics provide a reduction to the data, minimal sufficient statistics
provide the maximum reduction to the data.
Definition: A statistic T (X) is called a minimal sufficient statistic if, for any other
sufficient statistic S(X), T (X) is a function of S(X).
Minimal sufficient statistics exist under very mild conditions. e.g. see Bahadur (1959),
or Shao Jun (1999) for reference.
For two r.v.’s S, T , by S = T we mean that S = T a.s. w.r.t. P . (i.e. P (S = T ) = 1
for all P ∈ {Pθ , θ ∈ Θ}).
Theorem 2.2.1 .
Let us first give two examples of minimal sufficiency using the definition.
10
Example. X1 , . . . , Xn ∼Bin(1, p). Find a minimal sufficient stat for p.
P
Solution. By the Factorization Theorem, T = Xi is sufficient. Let us now show that
it is also minimal.
If T is not minimal, by definition, there exists a sufficient statistic S(X), such that
T (X) is NOT a function of S(X). (Recall that T is said to be function of S ⇐⇒ for
any S1 = S2 implies that T1 = T2 . Therefore, if T is NOT a function of S, then there
exists some S1 = S2 such that T1 6= T2 .) In other words, there exists a function h (not
1-1) such that S = h(T ) is sufficient. (Intuitively, this is clear as, if T is NOT minimal
sufficient, we can always truly reduce further to some minimal S.) There must exist at
least two integers t1 , t2 with 0 ≤ t1 6= t2 ≤ n such that h(t1 ) = h(t2 ) ≡ s. (For simplicity,
we assume that there are exactly two t’s s.t. h(t) = s.) Therefore, A ≡ P (T = t1 |S = s)
does not depend on p since S is sufficient. However,
which depends on p. This contradicts with the previous conclusion. Therefore, T must
be minimal sufficient.
Remark 2.2.1 Since the above example is in an expontial family, there is a much simpler
way of finding a minimal sufficient statistic.
Remark 2.2.2 We use two different methods to find minimal sufficient statistics. In
both cases, we need to guess what it is, and then try to prove the case. It would be nice to
have some routine methods to find minimal sufficient statistics. We shall do this in the
next section. The uniform example will be looked at again in the next subsection (you will
see how easy it is in comparison with the approach here).
Theorem 2.2.2 Let fθ (x) be the pdf of a sample X = (X1 , . . . , Xn ). Suppose that there
exists a function T (x) such that, for two sample points x and y, the ratio fθ (x)/fθ (y) is
free of θ (or does not depend on θ) iff T (x) = T (y). Then T (X) is a minimal sufficient
statistic for θ.
Proof. To simplify the proof, we only need to consider those x and θ where fθ (x) > 0.
So all conclusions hold a.s.
11
First we show that T (X) is a sufficient statistic. Let X be the sample space and
T = {t : t = T (x) for some x ∈ X } be the image of X under T (x). (A diagram here will
be helpful!) Define the partition sets induced by T (x) as At = {x : T (x) = t}. For each
At , choose and fix one element xt ∈ At . For any x ∈ X , xT (x) is the fixed element that is
in the same set At as x, where t = T (x). Since x and xT (x) are in the same set At (with
t = T (x)), we have T (x) = T (xT (x) ) = t, and hence it follows from the assumptions of
the theorem that fθ (x)/fθ (xT (x) ) is constant as a function of θ. Thus, we can define a
function on X by h(x) = fθ (x)/fθ (xT (x) ), which does not depend on θ. Define a function
on T by gθ (t) = fθ (xt ). Then it can be seen that
fθ (x)
fθ (x) = fθ (xT (x) ) = gθ (T (x))h(x) = gθ (t)h(x).
fθ (xT (x) )
Next we show that T (X) is minimal. Let T 0 (X) be any other sufficient statistic. We
need to show that T (x) is function of T 0 (x). Clearly, q : t0 = T 0 (x) → t = T (x) defines a
mapping as x runs through the sample space. To show that q is a function, it suffices to
prove that
Let x and y be any two sample points with T 0 (x) = T 0 (y). Then
Since this ratio does not depend on θ, the assumptions of the theorem imply that T (x) =
T (y). This proves (2.2). Thus, T (x) is a function of T 0 (x) and hence T (x) is minimal.
Remark 2.2.3 From the proof of the theorem, we see that the “if” part ensures that
T (X) is sufficient and “only if” part ensures T (X) is minimal.
Then
fθ (x) I{x(n) − 1 < θ < x(1) }
r(θ) ≡ = .
fθ (y) I{y(n) − 1 < θ < y(1) }
³ ´
Let T (x) = x(1) , x(n) . Clearly, if T (x) = T (y), then r(θ) is free of θ.
On the other hand, if T (x) 6= T (y), then r(θ) will depend on θ. To give an example,
suppose that x(1) > y(1) and x(n) < y(n) , then r(θ) could take values 0 and 1, depending
12
where θ is. That is, r(θ) is not free of θ. In other words, r(θ) is free of θ only if
T (x) = T (y). ³ ´
Therefore, T = X(1) , X(n) is a minimal sufficient statistic.
Example. Let X1 , . . . , Xn ∼ U (0, θ), where θ > 1. Show that T = X(n) is sufficient but
not minimal sufficient.
Proof. First let us write down the pdf of X:
1
fθ (x) = I{0 < x < θ}I{θ > 1}.
θ
The proof of sufficiency is easy by the Factorization Theorem. Let us now show it is
not minimal. Note
fθ (x) θ−n I(0 < x(1) ≤ x(n) < θ)I{θ > 1}
r(θ) ≡ = −n . (2.3)
fθ (y) θ I(0 < y(1) ≤ y(n) < θ)I{θ > 1}
Clearly, if x(n) = y(n) , then the ratio r(θ) is free of θ. On the other hand, if r(θ) is free
of θ, x(n) may not be equal to y(n) . For instance, we could easily choose x(n) = a and
y(n) = b, where 0 < a < 1 and 0 < b < 1 but a 6= b, then
I(x(1) > 0) I(x(n) < θ)I{θ > 1} I(x(1) > 0)I{θ > 1} I{θ > 1}
r(θ) = = = =1
I(y(1) > 0) I(y(n) < θ)I{θ > 1} I(y(1) > 0)I{θ > 1} I{θ > 1}
The reason that T = X(n) MAY NOT be minimal sufficient is that it did not take
the extra information about θ into account of the estimator. A better estimator would
be T 0 (X) = max{X(n) , 1}, which turns out to be minimal sufficient. To see why, noting
θ > 1, we can rewrite (2.3) as
I(x(1) > 0) I(x(n) < θ) I(1 < θ) I(x(1) > 0) I(max{x(n) , 1} < θ)
r(θ) = =
I(y(1) > 0) I(y(n) < θ) I(1 < θ) I(y(1) > 0) I(max{y(n) , 1} < θ)
0
I(x(1) > 0) I(T (x) < θ)
= .
I(y(1) > 0) I(T 0 (y) < θ))
Clearly, r(θ) is free of θ iff T 0 (x) = T0 (y). Thus, T 0 (X) = max{X(n) , 1} is minimal
sufficient.
Intuitively, it is quite clear why T 0 (X) = max{X(n) , 1} is a minimal sufficient statistic:
if the largest observation is smaller than 1 (1/2, say), clearly, it would not be wise to
choose X(n) = 1/2 as an estimator of θ since we already know that θ > 1. Instead, we
would be better off choosing 1 as an estimator of θ. Of course, if X(n) ≥ 1, then X(n)
would be a very good estimator of θ.
13
2.2.2 Minimal Sufficiency for Exponential Family of Full Rank
For an exponential family of full rank, finding minimal sufficient statistics is a relatively
easy matter.
Proof. Sufficiency follows immediately from the Factorization Theorem. Let us prove
minimality next. Since the exponential family is of full rank, then there exists a θ0
such that η1 (θ) − η1 (θ0 ), ..., ηk (θ) − ηk (θ0 ) are linearly independent, for values of θ in a
neighborhood of θ0 . (Why?) Note that
By the necessary and sufficient condition for minimal sufficiency, if the above ratio is free
of θ, then we must have
k
X
[ηj (θ) − ηj (θ0 )] [Tj (x) − Tj (y)] = C (x, y) ,
i=1
14
Pk
Fix θ0 ∈ Θ, then j=1 aj ηj (θ0 ) = C (x, y) . Then
k
X
aj [ηj (θ) − ηj (θ0 )] = 0. (2.4)
j=1
It is easy to see that the natural parameter space Ξ is a half space in R2 , which certainly
contains a 2-dim rectangle. By the above theorem, T (X) is a minimal sufficient statistic.
Clearly, the natural parameter space Ξ is a curve in R2 , which certainly DOES NOT
contain a 2-dim rectangle. Therefore, one can not use the very last theorem to find the
minimal sufficient statistic. However, we can use the simple rule given earler. Note that
½ ¾
fθ (x) 1 1
r(θ) ≡ = exp (T1 (x) − T1 (y)) − 2 (T2 (x) − T2 (y))
fθ (y) θ 2θ
is free of θ iff
1 1
(T1 (x) − T1 (y)) − 2 (T2 (x) − T2 (y)) = C (x, y) = 0,
θ 2θ
15
where the last equality is obtained by letting θ → ∞ on both sides. Again, since θ is
arbitrary, we get (by taking θ = 1 and θ = 2)
Example. If X1 , . . . , Xn ∼ exp(θ, 1) with pdf f (x − θ) = ex−θ I(x > θ), then it can be
shown that X(1) is minimal sufficient. But if we have restriction that θ > 1, then X(1) is
no longer minimal sufficient.
³ ´
Example. If X1 , . . . , Xn ∼ U (θ, θ + 1), then it has be shown that X(1) , X(n) is minimal
sufficient.
In all the above examples, sufficiency allows us to reduce the original n-dim data to
one or two-dim data. However, this is not possible for the following examples.
(1). If x(k) = y(k) for all k = 1, 2, ..., n, then clearly the ratio r(θ) is free of θ.
16
(2). If r(θ) is free of θ, then R = limθ→∞ R = 1. So we have
n ³
Y ´ n ³
Y ´
1 + (yk − θ)2 = 1 + (xk − θ)2 .
k=1 k=1
Since both sides are polynomials of θ of degree 2n, they must have the same roots as well.
That is, √ √
{xk ± −1, k = 1, ..., n} = {yk ± −1, k = 1, ..., n}
which is equivalent to {xk , k = 1, ..., n} = {yk , k = 1, ..., n}, which is again equivalent to
x(k) = y(k) for all k = 1, 2, ..., n.
Hence T (X) = (X(1) , ..., X(n) ) is minimal sufficient statistic.
Both sides are polynomials in η of degree n, which holds for all η > 0, so that coefficients
of the two polynomials are also the same. Therefore, the two polynomials are equal for
all values of η (real or imaginary). As a result, they must have the same roots as well.
That is,
{− exp{xk }, k = 1, ..., n} = {− exp{yk }, k = 1, ..., n}
which is equivalent to {xk , k = 1, ..., n} = {yk , k = 1, ..., n}, which is again equivalent to
x(k) = y(k) for all k = 1, 2, ..., n.
Hence T (X) = (X(1) , ..., X(n) ) is minimal sufficient statistic.
17
³ ´
Show that T (X) = X(1) , ..., X(n) is minimal sufficient.
Proof. P
fθ (y) exp{ ni=1 |xi − θ|}
r(θ) = = P .
fθ (x) exp{ ni=1 |yi − θ|}
If r(θ) is free of θ, then
P n
exp{ ni=1 (θ − xi )} X
lim r(θ) = P = exp{ (yi − xi )} = A
θ→∞ exp{ ni=1 (θ − yi )} i=1
P n
exp{ ni=1 (xi − θ)} X
lim r(θ) = Pn = exp{ (xi − yi )} = B
θ→−∞ exp{ i=1 (yi − θ)} i=1
Hence,
n
X n
X
g(θ) =: |xi − θ| = |yi − θ|
i=1 i=1
Note that the value which minimizes the above is the sample median. Clearly, the two
data sets X’s and Y ’s have the same median from the above equation. We consider two
cases:
(i) if n = 2m + 1 is odd, then minimizing w.r.t. θ, we see that x’s and y’s have
the same sample median. That is,
x(m+1) = y(m+1) .
(ii) if n = 2m is even, any value in the interval [x(m) , x(m+1) ] is a median of x’s,
which minimizes g(θ) w.r.t. θ. Similarly, any value in the interval [y(m) , y(m+1) ]
is a median of y’s, which minimizes g(θ) w.r.t. θ. These two sets must be the
same, thus x(m) = y(m) and x(m+1) = y(m+1) . Removing these two points,
we still have even number of points left. Continuing doing this, we see that
x(k) = y(k) for all k = 1, ..., n.
Remark 2.2.4 For point estimation problems, it is best to concentrate on minimal suf-
ficient statistics only since it provides the maximum data reduction.
18
2.3 Complete Statistics
Definition:
Let fθ (t) be a family of pdf’s for a statistic T . The family is called complete
if Eθ g(T ) = 0 for all θ implies Pθ (g(T ) = 0) = 1 for all θ. Then T is called a
complete statistic for θ.
T is said to be bounded complete if the previous statement holds for all
bounded Borel function g.
Proof. We have shown that T (X) is minimal sufficient. We will show that it is also
complete. From Theorem 1.1.4, the distribution of T = (T1 (X), ..., Tk (X)) is again an
exponential family of the form
Xk
fT (t) = fT (t1 , ..., tk ) = exp ηj tj − A(θ) h0 (t),
j=1
where t = (t1 , ..., tk ). Now suppose that there exists a function f such that E[g(T )] = 0
for all η ∈ Ξ. Then by the definition of fη (x), we have
Z
g(t) exp{tη 0 − A(η)} dλ(t) = 0, for all η ∈ Ξ,
R
where λ(A) = A h0 (t)dt is a measure on (Rk , B k ). Let η0 be an interior point in Ξ. Then
Z Z
g + (t) exp{tη 0 − A(η)} dλ = g − (t) exp{tη 0 − A(η)} dλ for all η ∈ N (η0 ), (3.5)
19
+
g (t) exp{tη }dλ(t) 0
for all η ∈ N (η0 ). That is, the m.g.f.’s of these two p.d.f.’s, R g+ (t) exp{tη0 0 }dλ(t) and
0
− 0
Rg (t) exp{tη0 }dλ(t)
0 , are the same in a neighborhood of 0. This implies that g + (t) = g − (t)
−
g (t) exp{tη0 }dλ(t)
a.e. λ. That is, g(t) = 0 a.e. λ. That is, T is complete.
(1). Find a minimal sufficient and complete statistic for θ. (Note that this belongs to
an exponential family.)
P
(2). Find the family of distributions of T = Xi . Check directly whether the family
is complete?
Example. Let X1 , . . . , Xn be iid n(θ, θ1+² ), where θ > 0 and ² 6= −1. Show that
P P
(1). the statistic T = ( Xi , Xi2 ) is a complete minimal sufficient statistic
for θ iff ² 6= 0 and ² 6= −1.
P
(2). when ² = −1, the statistic T1 = Xi is a complete minimal sufficient
statistic for θ.
P
(3). when ² = 0, the statistic T2 = Xi2 is a complete minimal sufficient
statistic for θ.
20
2.3.3 Examples of Non-Complete Families
Completeness is a rather strong requirement, and most statistics are not complete. A
simple example is given below.
Example. If X1 , ..., Xn ∼ F (x) which is absolutely continuous with p.d.f. f (x). Then
P
X1 − X2 and (X1 , n−1
i=2 Xi ) are not complete.
Solution. We shall only look at the case X1 − X2 . Clearly, E(X1 − X2 ) = 0. However,
Z ∞ Z ∞
P (X1 − X2 = 0) = P (X1 = X2 |X2 = x)f (x)dx = P (X1 = x)f (x)dx = 0.
−∞ −∞
P
Note that in the above example, X1 − X2 is not sufficient while (X1 , n−1
i=2 Xi ) may
be sufficient in many situations. The natural question is: whether a minimal sufficient
statistic is complete. The answer is negative, as illustrated below.
Example. Show that the minimal sufficient statistic for U (θ, θ + 1) is not complete.
³ ´
Solution. It has been shown that the minimal sufficient statistic is T = X(1) , X(n) .
Since F (x) = (x − θ)I(θ < x < θ + 1) + I(x ≥ θ + 1), we get P (X(n) ≤ x) = F (x)n .
Therefore,
Z Z θ+1
E(X(n) ) = xdF n (x) = nx(x − θ)n−1 dx
θ
Z 1
n
= n (y + θ)y n−1 dy = +θ
0 n+1
1
= (θ + 1) − .
n+1
1
Similarly, we can get E(X(1) ) = θ + n+1
. Therefore,
· ¸
n−1
E X(n) − X(1) − = 0.
n+1
But X(n) − X(1) − (n − 1)(n + 1) 6= 0. So T is not complete.
Remark: This example shows that minimal sufficiency does not imply completeness.
Remark: A complete and sufficient statistic is minimal sufficient, which was shown by
Lehmann and Scheffe (1950) and Bahadur (1957). Also see Shao (1999, p80).
Pn−1
Example. Show that T = (Xn , i=1 Xi ) for Bin(1, p) is not complete.
h Pn−1 i Pn−1
Solution. E Xn − i=1 Xi /(n − 1) = 0. But Xn − i=1 Xi /(n − 1) 6= 0. So T is not
complete.
21
2.4 Ancillary Statistics
Definition: S(X) is called an ancillary statistic if its distribution does not depend on θ.
Example. X1 , . . . , Xn follows F (x − θ), the location parameter family. Show that the
range, R = X(n) − X(1) , is an ancillary statistic.
Proof. Let Zi = Xi − θ. Then Z1 , · · · , Zn follows F (x), free of θ. Then
³ ´ ³ ´
P (R ≤ x) = P (X(n) − θ) − (X(1) − θ) ≤ x = P Z(n) − Z(1) ≤ x ,
Example. X1 , . . . , Xn is from F (x/σ), the scalar parameter family. Show that (X1 /Xn , · · · , Xn−1 /Xn )
is an ancillary statistic.
Proof. Let Zi = Xi /σ. Then Z1 , · · · , Zn follows F (x), free of θ. Then
P (X1 /Xn ≤ t1 , · · · , Xn−1 /Xn ≤ tn−1 ) = P (Z1 /Zn ≤ t1 , · · · , Zn−1 /Zn ≤ tn−1 ) ,
Basu’s theorem: If T (X) is a bounded complete and sufficient statistic, then T (X) is
independent of every ancillary statistic.
Proof. Let T be a bounded complete and sufficient statistic, and V an ancillary statistic.
Let A, B be any two Borel sets. Then
22
Therefore,
Example. Let X1 , . . . , Xn be a random sample from the pdf fθ (x) = e−(x−θ) I{x > θ}.
2.6 Exercises
1. Let X be one observation from N (0, σ 2 ). Is |X| a sufficient statistic?
23
4. Suppose that X1 , . . . , Xn ∼ Uniform(θ − 12 , θ + 21 ). Find a sufficient statistic for θ.
5. Let X1 , . . . , Xn be independent r.v.’s with densities
fXi (x) = eiθ−x , for x ≥ iθ.
Prove that T = mini {Xi /i} is a sufficient statistic for θ.
6. For each of the following distributions let X1 , . . . , Xn be a random sample. For a
minimal sufficient statistic.
2
(a). f (x) = (2π)−1/2 e−(x−θ) /2 (normal)
(b). f (x) = P (X = x) = e−θ θx /x!, for x = 0, 1, ... (Poisson)
(c). f (x) = e−(x−θ) , for x > θ (location exponential)
(d). f (x) = c(µ, σ) exp {−(x − µ)4 /σ 4 }, where σ > 0.
−1
(e). f (x) = (π)−1 [1 + (x − θ)2 ] (Cauchy)
24
Chapter 3
Unbiased Estimation
25
3.2 Do unbiased estimators always exist?
Definition: g(θ) is said to be U -estimable if there exists a statistic η(X) such that
Eη(X) = g(θ).
So à !
n
X n k+1
η(k) p (1 − p)n−k = 1.
k=0 k
But limp→0 LHS = 0. So we can choose p small enough such that the above equality does
not hold. Therefore, g(p) = p−1 is not U -estimable.
The next theorem provides a necessary and sufficient condition for UMVUE.
(1). Let δ(X) be an unbiased estimator of g(θ) with Eδ 2 (X) < ∞. δ(X) is a
UMVUE iff E[δ(X)U (X)] = 0 for any U ∈ U and all θ ∈ Θ, where U is the
set of all unbiased estimators of 0 with finite variances.
(2). Let T be a (minimal) sufficient statistic and η(T ) be an unbiased estimator
of g(θ) with Eη 2 (T ) < ∞, where η(·) is a Borel function. Then h(T ) is a
UMVUE iff E[h(T )U (T )] = 0 for all θ ∈ Θ and any U ∈ U, e where Ue be the
subset of U containing all Borel functions of T .
26
Proof of the theorem. We shall only prove the first part. The second part can be
shown similarly.
(i) Necessity. Suppose that δ is a UMVUE for g(θ). Fix U ∈ U and θ ∈ Θ. Let
δ1 = δ + λU . Clearly Eδ1 = g(θ), and so
V ar(δ1 ) = V ar(δ) + 2λCov(U, δ) + λ2 V ar(U ) ≥ V ar(δ),
that is,
2λCov(U, δ) + λ2 V ar(U ) ≥ 0,
for all λ. The quadratic in λ has two roots λ1 = 0 and λ2 = −2Cov(U, δ)/V ar(U ), and
hence takes on negative values unless λ1 = λ2 , that is, Cov(U, δ) = 0.
(ii) Sufficiency. Suppose that Eθ (δU ) = 0 for all U ∈ U. For any δ1 such that
E(δ1 ) = g(θ), we have δ1 −δ ∈ U . Hence, Eθ [δ(δ1 −δ)] = 0. From this and Cauchy-Schwarz
inequality, we have V ar(δ) = Cov(δ, δ1 ) ≤ V ar1/2 (δ)V ar1/2 (δ1 ). That is, V ar(δ) ≤
V ar(δ1 ).
The proof is thus complete.
Remark. Part (2) of the theorem is more useful in practice. Here we only to look at
the functions of the minimal sufficient statistics, which is enough for the point estimation
purposes.
Remark. There are three important aspects in the application of the characterization
theorem to find a UMVUE.
(1). E[U (X)] = 0.
(2). E[δ(X)U (X)] = 0.
(3). E[δ(X)] = g(θ), where g(θ) is the parameter of interest.
Or
(1)’. E[U (T )] = 0.
(2)’. E[η(T )U (T )] = 0.
(3)’. E[η(T )] = g(θ), where g(θ) is the parameter of interest.
Remark. The theorem follows from the projection theorem of linear space theory if we
interpret E(δ − U )2 as the norm in L2 space and U forms a Hilbert space.
Remark. From the proof of the theorem, given an unbiased estimator δ, if there exists an
unbiased estimator of zero, which is CORRELATED with δ, then we can always choose
one of them such that it has negative correlation with δ and hence will deduce the variance
for their linear combination.
Remark. Note that U or Ue is not empty since zero itself is in it. On the other hand,
if zero is the ONLY element, i.e., Ue = {0} or U = {0}, then the requirement (??) is
certainly satisfied. However, this is guaranteed if the family of distributions is complete.
This can be used to prove the Lehmann-Scheffe theorem introduced later.
27
Theorem 3.3.1 Let Ti be a UMVUE of θi , j = 1, ..., k, where k is is a fixed positive
integer. Then
Pk Pk
(a) T = i=1 ci Ti is a UMVUE of θ = i=1 ci θi , for any constants c1 , ..., ck .
(Therefore, UMVUE is closed under the linear operation. That is, all UMVUE’s
forms a linear space.)
Qk ³Q ´
k
(b) T = i=1 Ti is a UMVUE of θ = ET = E i=1 Ti . In particular, if Ti ’s
³Q ´ Qk
k
are independent, then E i=1 Ti = i=1 θi .
(b) For simplicity, assume that k = 2. Since T2 is UMVUE, we have E(T2 U ) = 0 for
any U ∈ U. Thus, T2 U ∈ U by the Characterization Theorem, which in turn implies that
E(T1 T2 U ) = E[T1 (T2 U )] = 0 by the Characterization Theorem again.
Proof. Let δ1 and δ2 are two UMVUE for g(θ). Then Eδ1 = Eδ2 and V ar(δ1 ) =
V ar(δ2 ) ≡ V (θ). Take δ = (δ1 + δ2 )/2. Then
where ρ = Corr(δ1 , δ2 ). It follows that ρ = 1, otherwise V ar(δ) < V ar(δ1 ) which contra-
dicts with our assumption. Therefore,
So
P (δ1 − δ2 = E(δ1 − δ2 )) = P (δ1 − δ2 = 0) = 1.
28
3.4 Existence of UMVUE among unbiased estima-
tors
In the class of all unbiased estimators, there are three possibilities concerning the set of
UMVUE estimators.
U (θ − 1/2) = U (θ + 1/2).
(From here, we see that U (θ) is a periodic function with period 1. Clearly, X is not
complete.)
29
Example. Let X1 , . . . , Xn be a sample of size n from U (θ − 1/2, θ + 1/2), where θ ∈ R.
Can we show that no nonconstant parameters have UMVUE?
Proof. We shall use the UMVUE characterization theorem, part (2). It can be shown
that T = (X(1) , X(n) ) is a minimal sufficient statistic. Recall that if X1 , . . . , Xn ∼ F with
pdf f , then
n!
f(X(r) ,X(s) ) (x, y) =
(r − 1)!(s − r − 1)!(n − s)!
×(F (x))r−1 (F (y) − F (x))s−r−1 (1 − F (y))n−s f (x)f (y),
where x < y.
R θ+1/2 Ry n−2
(a) From EU (T ) = 0 = EU (X(1) , X(n) ) = y=θ−1/2 x=θ−1/2 n(n−1) (y−x)
(θ2 −θ1 )n
dxdy,
we get
Z θ+1/2 Z y
(y − x)n−2 dxdy = 0.
y=θ−1/2 x=θ−1/2
30
A discrete version of the above example is the following.
Proof. Let us just do it for the case n = 1. We shall use the UMVUE characterization
theorem (part (1) or (2)).
U (θ − 1) + U (θ) + U (θ + 1) = 0
U (θ) + U (θ + 1) + U (θ + 2) = 0
Differencing both, we get U (θ − 1) − U (θ + 2) = 0, i.e.,
(c) If Eδ(X) = g(θ), recalling that δ(·) is periodic with period 3, then
Remark. It is interesting to note that both examples belong to the location family.
31
Case 2. Some but not all parameters have UMVUE.
Lemma. If δ0 is any unbiased estimator of g(θ), then the totality of unbiased estimators
is given by δ = δ0 − U , where U is any unbiased estimator of zero, that is, it satisfies
Eθ (U ) = 0 for all θ ∈ Θ.
Example. (Lehmann, p76). Let X take on values -1, 0, 1, .... with probabilities
P (X = −1) = p, P (X = k) = pk q 2 , k = 0, 1, 2, ...
Now if EU = 0, then
∞
X
U (−1)p + U (k)(1 − p)2 pk = 0
k=0
That is,
U (0) = 0
U (−1) − 2U (0) + U (1) = 0
U (k) − 2U (k + 1) + U (k + 2) = 0, k ≥ 0.
Then
U (0) = 0
U (1) = −U (−1)
U (2) = 2U (1) − U (0) = 2U (1)
...
U (k) = kU (1)
...
32
Denote a = U (1), then we’ve shown that
U (X) = aX.
where for i = 1, 2
∞
X
C2 (i) = k 2 P (X = k)
k=−1
∞
X
C1 (i) = kδi (k)P (X = k)
k=−1
∞
X
C0 (i) = δi2 (k)P (X = k).
k=−1
P
Thus pk [δi (k) − ak]2 is minimized iff
Example. Find the totality of g(p) which have UMVUE’s in the last example.
Solution. From the UMVUE Characterization Theorem, δ is UMVUE iff
Clearly,
δ(k) = c if k 6= 0
= d if k = 0.
Therefore,
Eδ(X) = cP (X = 0) + dP (X 6= 0) = cq 2 + d(1 − q 2 )
= C1 + C2 q 2 .
33
Therefore, g(p) is a UMVUE iff it is of the form C1 + C2 q 2 .
In the first two cases, we have not used any other knowledge. If we have a complete
and sufficient statistic, then a UMVUE always exists among the unbiased estimators.
This will be studied later. We point out that, when a complete and sufficient statistic
is not available, it is usually very difficult to derive a UMVUE. In some cases, we can
use the UMVUE characterization theorem, if we know enough knowledge about unbiased
estimators of 0.
34
3.5 Searching UMVUE’s by sufficiency and complete-
ness
If we have a complete and sufficient statistic, then finding a UMVUE is a relatively simple
matter.
If we have a sufficient statistic, and an unbiased estimator of g(θ), then we can always
improve it by the following Rao-Blackwell Theorem.
Proof.
(2).
where
Therefore, M SE(δ) ≥ M SE(η) and the equality holds iff δ = η(T ) a.s.
Remark. Rao-Blackwell Theorem can only improve an unbiased estimator, but it does
not guarantee that it is a UMVUE.
35
Now if we have a sufficient and complete statistic, then finding a UMVUE is a relatively
easy exercise.
for all θ ∈ Θ. (In fact, equality holds all the time.) Therefore, η(T ) is a UMVUE.
The uniqueness follows from the Uniqueness Theorem given earlier, or can be implied
by the completeness as well.
Method 2.
As in the first method, we shall only restrict our attention to the class of estimators
which are functions of T . Since T is complete, then Eθ (U (T )) = 0 implies U (T ) = 0.
To apply the UMVUE characterization theorem (2), we see that Ue = {0}. Then the
necessary and sufficient condition is satisfied. Hence, η(T ) is a UMVUE. The proof of
uniqueness is as in the first proof.
36
3.6 Some examples.
The above two theorems provide two different methods for finding a UMVUE. We shall
now give some examples to illustrate each method.
Example. X1 , . . . , Xn ∼ U (0, θ). Find a UMVUE for g(θ) where g(x) is differentiable.
Solution. It is known that T = X(n) is a complete and sufficient, and
X(n) 0
η(T ) = g(X(n) ) + g (X(n) )
n
n+1
(a). If g(θ) = θ, then η(T ) = n
X(n) .
(b). If g(θ) = θ2 , then η(T ) 2
= X(n) + n2 X(n)
2
= n+2 2
n
X(n) .
Remark. Note that g(θ) can be estimated by g(X(n) ), and the bias is of order O(n−1 ).
Example. Let X1 , . . . , Xn be iid Poisson(λ), find a UMVUE for g(λ), where g(λ) is
infinitely differentiable.
P
Solution. T = Xi is complete and sufficient. And T ∼ P oisson(nλ), that is, fT (t) =
e−nλ (nλ)t /t! for t = 0, 1, 2, ... Then
∞
X
Eη(T ) = η(k)e−nλ (nλ)k /k! = g(λ).
k=0
Then
∞
Ã∞ ! ∞
∞
X η(k)nk λk X g (i) (0)λi X nj λj X
= enλ g(λ) = = ak λk ,
k=0 k! i=0 i! j=0 j! k=0
37
(a). If g(λ) = λr , where r > 0 is an integer, then
T ! nT −r r! T (T − 1)...(T − r + 1)
η(T ) = T
= if T ≥ r,
n (T − r)!r! nr
= 0 if T < r.
e−λ λk
(b). If g(λ) = P (X1 = k) = k!
, for any k = 0, 1, ..., then we can use the
above result.
For instance, if k = 0, g(λ) = e−λ . So g (i) (λ) = (−1)i e−λ and g (i) (0) = (−1)i .
Therefore, a UMVUE of e−λ is
T
à ! µ ¶T
X T (−1)i 1
η(T ) = .= 1− .
i=0 i ni n
X̄
η(T ) = C .
S2
38
³ ´
1
Then Eη(T ) = CµE S2
. Now taking r = n − 1, we have
à ! à !
σ2 1
E = E
(n − 1)S 2 χ2n−1
Z ∞
1 xr/2−1 e−x/2
= dx
x 2r/2 Γ(r/2)
0
Z ∞
x(r−2)/2−1 e−x/2 2(r−2)/2 Γ((r − 2)/2)
= dx
0 2(r−2)/2 Γ((r − 2)/2) 2r/2 Γ((r − 2)/2 + 1)
1 Γ((n − 3)/2)
=
2 Γ((n − 1)/2)
1
= .
n−3
Then µ ¶
1 n−1 µ µ
Eη(T ) = CµE =C = 2,
S2 n−3σ 2 σ
n−3 n−3 X̄ µ
if C = n−1
. That is, n−1 S 2
is a UMVUE of σ2
.
Remark. We first remark that many discrete distributions belong to this family. Ex-
amples include binomial, negative binomial, Poisson, etc. For instance, the Binomial
distribution can be written as
à ! à !à !x
n x n p
P (X = x) = p (1 − p)n−x = (1 − p)n , so θ = p/(1 − p).
x x 1−p
Note that C(θ) can be written as
∞
X
c(θ) = a(x)θx .
x=0
Also note that the family of power series distribution belongs to an exponential family as
P (Xi = x) = ex log(θ) γ(x)/c(θ), x = 0, 1, 2, ......
P
Thus T = Xi is a minimal sufficient and complete statistic. Also, the distribution of T
is also in a power series family, i.e., for t ≥ 0,
X X
P (T = t) = P ( Xi = t) = P (X1 = x1 ) · · · P (Xn = xn )
P
xi =t
X
= a(x1 ) · · · a(xn )θt /[c(θ)]n
P
xi =t
39
PP
where A(t, n) = xi =t a(x1 )......a(xn ), and satisfies
∞
X ∞
X
1= P (T = t) = A(t, n)θt /[c(θ)]n .
t=0 t=0
P∞
That is, t=0 A(t, n)θt = [c(θ)]n . Therefore, A(t, n) is the coefficient of θt in the expansion
of [c(θ)]n .
Now for η(T ) to be a UMVUE of g(θ), we have
∞
X
g(θ) = Eη(T ) = η(t)A(t, n)θt /[c(θ)]n .
t=0
That is,
∞
X
η(t)A(t, n)θt = g(θ)[c(θ)]n .
t=0
∞
X ∞
X ∞
X
η(t)A(t, n)θt = θr [c(θ)]n−p = A(t, n − p)θt+r = A(t − r, n − p)θt .
t=0 t=0 t=r
η(t)A(t, n) = A(t − r, n − p) if t ≥ r,
0 if t < r.
That is,
A(t − r, n − p)
η(t) = if t ≥ r,
A(t, n)
0 if t < r.
Note that
e−λ λk
g(λ) = P (X1 = k) = , for any k = 0, 1, 2, ........
k!
40
P
Solution. Let δ = I{X1 = k}. Clearly, Eδ = g(λ). It is obvious that T = Xi is
Complete and Sufficient, and T ∼ P oisson(nλ). Then
Remarks.
(2). The above derivation can be seen straightaway from the fact: If X ∼
P oisson(λ1 ) and Y ∼ P oisson(λ2 ) and independent, then X|X + Y = t ∼
Bin{t, λ1 /(λ1 + λ2 )}.
e−X̄ (X̄)k
(3). By comparison, the MLE for the above problem is k!
.
41
is a UMVUE of λ2 . Since X1 |T ∼ Bin(T, 1/n), we get
Example. Does N (µ, µ), where µ > 0, belong to the exponential family? No. Find a
UMVUE for µ.
42
If Eθ (g(T )) = 0, then
Z θ Z 1 Z θ
0= tn−1 g(t)dt = tn−1 g(t)dt + tn−1 g(t)dt.
0 0 1
Clearly, it is easy to give an example that a nonzero g(t) satisfies this condition. This
implies that T is not complete. Therefore, we can not use Lehmann-Scheffe Theorem to
find a UMVUE.
(Another method to show that T = X(n) is not complete is as follows. Suppose it is
complete. Then T is complete and sufficient and therefore must be minimal sufficient (by
a theorem in Lehmann (19??)). But this is known to be false. Therefore, T = X(n) can
not be complete.)
We now try the characterization theorem (part 2) to find a UMVUE.
(1). E[U (T )] = 0 implies that (simply take g(t) = U (t) above)
Z 1
U (θ) = 0 for any θ > 1 and tn−1 U (t)dt = 0.
0
(3). Our purpose is to find η(T ) such that Eη(T ) = θ. We shall use two approaches.
Approach 1. It is easy to check that the constraints in (1) and (2) are satisfied by
the following choice
θ = Eη(T ) = E [η(T )I(0 < T < 1)] + E [η(T )I(T > 1)]
Z 1 Z θ
= cfT (t)dt + btfT (t)dt
0 1
Z
ntn−1 θ
= cP (T < 1) + dt bt
1 θn
c bn θn+1 − 1
= n+
θ n +Ã
1 θn !
bn bn 1
= θ+ c− ,
n+1 n + 1 θn
where we used the fact P (T < 1) = F n (1) = θ−n . Therefore, we have
bn bn
= 1, c− = 0.
n+1 n+1
43
So we get b = (n + 1)/n and c = 1. That is,
Here, we choose η(t) = c if 0 < t < 1 since it satisfies (1). Now from Eη(T ) = θ, we get
nη(θ)θn−1 = (n + 1)θn .
n+1
Therefore, η(θ) = n
θ for θ > 1. Putting this back into (6.1), we get c = 1. Hence, we
get
Remark: If all the observations are less than 1, then x(n) < 1, from the last example, we
simply take θ = 1 since we know θ > 1. So here we have used the information that θ > 1.
However, if we have x(n) ≥ 1, then we just carry on as usual since θ > 1 does not provide
any more useful information any more.
Remark: Note that the minimal and sufficient statistic here is T0 = max{X(n) , 1}. Why
isn’t the UMVUE η(T ) a function of T0 ?
44
3.7 UMVUE in nonparametric families.
See Lehmann, p101.
45
3.10 Information inequality (or Cramer-Rao lower
bound)
3.10.1 Information inequality
Theorem. (Cramer-Rao lower bound) Let X = {X1 , . . . , Xn } be a sample (dependent
or independent) with a joint p.d.f. fθ (x), where θ ∈ Θ with Θ being an open set in
Rk . Suppose that Eδ(X) = g(θ), where g(θ) is a differentiable function of θ. Suppose
fθ (x) = fθ (x1 , ..., xn ) is differentiable as a function of θ and satisfies
∂ Z Z
∂
h(x)fθ (x)dx = h(x) fθ (x)dx, for θ ∈ Θ,
∂θ ∂θ
for h(x) = 1 and h(x) = δ(x).
(1). We have
τ
V ar(δ(X)) ≥ (g 0 (θ))1×k [IX (θ)]−1 0
k×k (g (θ))k×1 , (10.2)
where
à !
∂ ∂
g 0 (θ) = g(θ), ......, g(θ)
∂θ1 ∂θk
"Ã !τ Ã ! #
∂ ∂
IX (θ) = E log fθ (X) log fθ (X)
∂θ k×1
∂θ 1×k
à " #" #!
∂ ∂
= E log fθ (X) log fθ (X) .
∂θi ∂θj k×k
(2). The equality in (1.1) holds iff there exists some function a(θ) such that
∂
a(θ) [δ(x) − g(θ)] = log fθ (x).
∂θ
Proof. We shall only prove the univariate case k = 1. The general case will be given in
the Appendix at the end of this section.
R
(1). Differentiating 1 = fθ (x)dx w.r.t. θ, we get
à !
Z ∂ Z µZ ¶
∂ f (x)
∂θ θ ∂ ∂ ∂
E log fθ (X) = fθ (x)dx = fθ (x)dx = fθ (x)dx = 1=0
∂θ fθ (x) ∂θ ∂θ ∂θ
Therefore, Ã !2 Ã !
∂ ∂
E log fθ (X) = var log fθ (X) .
∂θ ∂θ
46
R
Differentiating g(θ) = E(δ(X)) = δ(x)fθ (x)dx, we get
Z Ã !
0 ∂
g (θ) = δ(x) fθ (x) dx
∂θ
Z Ã !
∂
= δ(x) log fθ (x) fθ (x)dx
∂θ
" Ã !#
∂
= E δ(X) log fθ (X)
∂θ
à !
∂
= Cov δ(X), log fθ (X) .
∂θ
Therefore,
à ! à !
0 2 2 ∂ ∂
[g (θ)] = Cov δ(X), log fθ (X) ≤ V ar(δ(X)) V ar log fθ (X) .
∂θ ∂θ
Remark 1. Note that in the Cramer-Rao theorem, the lower bound on the RHS does not
involve any estimators. In other words, we have a lower bound for the variance of all
unbiased estimators of g(θ). So the bound is uniform for all unbiased estimators.
Remark 2. When the Cramer-Rao lower bound is sharp, i.e., there exists an unbiased
estimator T of g(θ) whose variance reaches the lower bound, then that estimator is a
UMVUE. This provides a way to find UMVUE’s. However, this is not very effective as
very often Cramer-Rao lower bound may not be sharp. This means that UMVUE (if it
exists) may not reach the lower bound.
Remark 3. the Cramer-Rao lower bound is useful in assessing the performance of UMVUE’s.
or equivalently,
Cov 2 (δ, Y )
V ar(δ) ≥ = Cov(δ, Y )[V ar(Y )]−1 Cov(δ, Y ), (10.3)
V ar(Y )
47
Setting g 0 (λ) = 2λV ar(Y ) − 2Cov(δ, Y ) = 0, we get the stationary point
Cov(δ, Y )
λ0 =
V ar(Y )
Thus,
Cov 2 (δ, Y )
g(λ0 ) = V ar(δ) − ≥0
V ar(Y )
where equality holds iff g(λ0 ) = 0 = V ar(δ − λ0 Y ) iff δ − λ0 Y = C. This proves (10.3).
(b). For any r.v. δ and random vector Y = (Y1 , ..., Yk ), we have a generalization of the
Cauchy-Swartze inequality as follows
V ar(δ) ≥ Cov(δ, Y )[V ar(Y )]−1 [Cov(δ, Y )]τ , (10.4)
where the equality holds iff δ and Y are linearly dependent. Here, we use the following
notation:
Cov(δ, Y ) = (Cov(δ, Y1 ), ......, Cov(δ, Yk ))1×k ,
V ar(Y ) = (Cov(Yi , Yj ))k×k .
(c). Let
∂
Y = log fθ (X)
∂θ
R R
As before, differentiating 1 = fθ (x)dx and g(θ) = E(δ(X)) = δ(x)fθ (x)dx w.r.t. θ, we
get
à !
∂
0 = E log fθ (X) = EY,
∂θ
à !
0 ∂
g (θ) = cov δ(X), log fθ (X) = cov (δ(X), Y ) ,
∂θ
"Ã !τ Ã ! #
∂ ∂
IX (θ) = E log fθ (X) log fθ (X) = E(Y τ Y ) = V ar(Y ).
∂θ k×1
∂θ 1×k
48
Applying (10.4), we get
V ar(δ) ≥ Cov(δ, Y )[V ar(Y )]−1 [Cov(δ, Y )]τ = g 0 (θ)[IX (θ)]−1 [g 0 (θ)]τ .
Remark. The information inequality (10.4) has been extended in a number of directions.
(i) When the lower bound is not sharp, it can usually be improved by consid-
ering not only first derivatives of g(θ), but also higher order derivatives
1 ∂ i1 +...+is fθ (x)
ψi1 ,...,is = .
fθ (x) ∂θ1i1 ...∂θsis
V ar(δ) ≥ αK −1 (θ)ατ ,
where K is the covariance matrix of the given set of ψ, and α is the row matrix
with elements
∂ i1 +...+is g(θ)
∂θ1i1 ...∂θsis
(ii) When the regularity conditions are not satisfied, one could use difference
instead of derivatives. See the references given in Lehmann, page 130.
(1). If X and Y are independent and have pdf fθ (x) and gθ (y), respectively,
then
I(X,Y) (θ) = IX (θ) + IY (θ).
(3). Suppose that X has joint pdf fθ (x), and fθ (x) is twice differentiable in θ,
and
∂ Z ∂ Z
∂2
fθ (x)dx = fθ (x)dx, for θ ∈ Θ.
∂θτ ∂θ ∂θ∂θτ
49
Then there is an alternative expression for IX (θ) (which is often easier to use)
à !
∂2
IX (θ) = −E log fθ (X)
∂θ∂θτ
à !
∂2
= − E log fθ (X) .
∂θi ∂θj k×k
That is,
à !2 Z à !2
∂ ∂
IX (θ) = E log fθ (X) = log fθ (x) fθ (x)dx
∂θ ∂θ
Z Ã !
∂2 ∂2
=− log f θ (x) f θ (x)dx = −E log fθ (X).
∂θ2 ∂θ2
50
(5). If θ = θ(η) is 1-1 (from Rk to Rk ), then it can be shown that the
Cramer-Rao lower bound remains the same. In other words, the C-R bound
is reparametrization invariant.
Proof. X1 , ..., Xn ∼ fθ (x), and θ = θ(η) is 1-1. Now Eδ(X) = g(θ) =
g(θ(η)) = l(η). By C-R lower bound theorem,
τ
V ar(δ(X)) ≥ (g 0 (θ))1×k [IX (θ)]−1 0
k×k (g (θ))k×1 =: A,
and also
τ
V ar(δ(X)) ≥ (l0 (η))1×k [IX (η)]−1 0
k×k (l (η))k×1 =: B.
(6). If θ is a scalar (i.e., k = 1), then the Cramer-Rao lower bound becomes
à !2 à !
[g 0 (θ)]2 d d2
V ar(δ(X)) ≥ , where IX (θ) = E log fθ (X) = −E log fθ (X) .
IX (θ) dθ dθ2
51
3
(2). If g(p) = p2 , then V ar(δ(X)) ≥ 4p (1−p)
n
. We shall show that this lower
bound can never be reached. We know that η(T ) = T (T − 1)/(n(n − 1)) is
the UMVUE of p2 . After some tedious algebra, it can be shown that
Aside: To find out V ar(η(T )), note that the m.g.f. of T is Mn (t) = (q + pe4)n .
So
e.g. X1 , . . . , Xn ∼ N (µ, 1). Find the Cramer-Rao lower bound for g(µ) = µ, µ2 . Check if
the lower bound can be reached√ or not.
2 ∂
Solution. Since fµ (x) = ( 2π)−1 e−(x−µ) /2 , so ∂µ log fµ (x) = x − µ. Therefore,
.............
52
3.10.4 A necessary and sufficient condition for reaching Cramer-
Rao lower bound
The following theorem follows easily from the proof of Cramer-Rao lower bound theorem.
Theorem. Let δ(X) be an estimator of g(θ) = E(δ(X)). Assume that the conditions
in the information inequality theorem hold for δ(X) and that Θ ∈ R. Then V ar(U (X))
attains the Cramer-Rao lower bound iff
d
a(θ)[δ(X) − g(θ)] = log fθ (x), a.s. fθ
dθ
for some function a(θ), θ ∈ Θ.
Example. Let X1 , ..., Xn be iid from the Bin(1, p). Show that the Cramer-Rao lower
bound can be reached in estimating p but not for p2 .
Proof. Note
³ P P ´ X X
xi
log fp (x) = log p (1 − p)n− xi
= xi log p + (n − xi ) log(1 − p).
So
d n
log fp (x) = [x̄ − p] = a(p)[δ(X) − p],
dp p(1 − p)
n
where a(p) = p(1−p) , δ(X) = X̄. However, it can NOT be expressed as a function
2
of a(p)[δ(X) − p ]. By the above necessary and sufficient condition, Cramer-Rao lower
bound is reached for X̄, and hence it is a UMVUE of θ. On the other hand, the lower
bound can not be reached for p2 .
53
Before we give the next example, we shall give the following preliminaries.
(1) Γ(p, λ) has pdf of the form
λp p−1 −λx
f (x) = x e I{x ≥ 0}. p > 0.
Γ(p)
Γ(p + r)
E[Γ(p, λ)]r = .
λr Γ(p)
p p
In particular, we have EΓ(p, λ) = λ
and V ar(Γ(p, λ)) = λ2
.
e.g. Suppose that X1 , ..., Xn are iid with pdf fθ (x) = λ2 xe−λx , λ > 0, x ≥ 0.
(1). Find a UMVUE of λ.
(2). Check if the Cramer-Rao lower bound can be achieved.
Solution. (1) We shall do this by two methods.
Method 1. (This method is crazy. Method 2 is much better.)
Note that X1 , ..., Xn ∼ Γ(2, λ). Therefore,
µ ¶
1 Γ(2 − 1)
E = = λ.
X1 λ−1 Γ(2)
P
Note that T = Xi ∼ Γ(2n, λ) is complete and sufficient (since it belongs to an expo-
P
nential family of full rank). To find its pdf, let Y = Xi − X1 and note that
54
Then
µ ¯ ¶
1 ¯¯
η(t) = E T =t
X1 ¯
Z
1
= fX |T (x1 |t)dx1
x1 1
Z t µ ¶
1 x1 x1 2n−3
= (2n − 1)(2n − 2) 2 1 − dx1
0 x1 t t
µ ¶ ¯t
1 x1 2n−2 ¯¯
= − (2n − 1) 1 − ¯
t t ¯
0
2n − 1
=
t
2n−1
Therefore, a UMVUE of λ is T
.
P
Method 2. Similarly to Method 1, we see that T = Xi ∼ Γ(2n, λ), and
µ ¶
1 Γ(2n − 1)
E = .
T λ−1 Γ(2n)
2n−1
Therefore, a UMVUE of λ is T
.
(2). Now let us check if the Cramer-Rao lower bound can be reached or not by the
UMVUE. Since f (x) = λ2 xe−λx , then
then
à !2
∂
IX1 (λ) = E log fλ (X)
∂λ
µ ¶2
2
= E − X = E (X − EX)2
λ
2
= V ar(X) = 2 see the proliminaries earlier.
λ
Taking g(λ) = λ in the Cramer-Rao theorem, for any unbiased estimator δ(X), we get
1 λ2
V ar(δ(X)) ≥ = .
nIX1 (λ) 2n
2n−1
However, for η(T ) = T
(the UMVUE of λ), we have
µ ¶ µ ¶ ³ ´
2n − 1 1 λ2 λ2
var = (2n − 1)2 var = (2n − 1)2 ET −2 − (ET −1 )2 = > .
T T 2n − 2 2n
Therefore, the Cramer-Rao lower bound can not be achieved. On the other hand,
λ2
2n n−1
ef f (T ) = λ2
= →1 as n → ∞.
2n−2
n
55
3.10.5 Information inequality in exponential family
Lemma 3.10.1 For the exponential model
In particular,
ET = B 0 (η), V ar(T ) = B 00 (η).
Hence,
Z
τ
κT (λ) = ln MT (λ) = ln EeλT = ln eλt exp{tη τ − B(η)}c0 (t)dt
Z
= ln exp{t(λ + η)τ − B(η + λ)}c0 (t)dt + [B(η + λ) − B(η)]
= ln 1 + [B(η + λ) − B(η)]
= [B(η + λ) − B(η)] .
Thus,
Theorem 3.10.1 Suppose that the distribution of X is from an exponential family {fη :
η ∈ Θ} of full rank, with pdf given by
(1) The regularity condition in the Cramer-Rao theorem is satisfied for any h
with E|h(X)| < ∞.
(2) If IX (η) is the Fisher information matrix for the natural parameter η, then
the variance-covariance matrix V ar(T ) = IX (η).
(3) Let ν = ET (X) and denote IX (ν) to be its Fisher information matrix,
then the variance-covariance matrix V ar(T ) = [IX (ν)]−1 .
(1). Omitted.
56
(2). The pdf under the natural parameter η is
So
log fX (x) = T (x)η τ − B(η) + log [c(x)] .
∂
log fX (x) = T (x) − B 0 (η).
∂η
∂
From the last lemma, we know ∂η
B(η) = ET . So
∂
log fX (x) = T (x) − ET (X).
∂η
Thus,
" #τ " #
∂ ∂
IX (η) = E log fX (X) log fX (X)
∂η ∂η
= E[T (X) − ET (X)]τ [T (X) − ET (X)] = V ar(T (X)).
(3). Let ν = ET = B 0 (η). Then ν = ψ(η) is 1-1 from Rk to Rk since by using the last
lemma, we get
∂ν
= B 00 (η) = V ar(T ) = IX (η) > 0.
∂η τ
Then from an earlier remark, we get the Fisher information matrix about η contained in
X is à ! à !
∂ν ∂ν τ
IX (η) = IX (ν) .
∂η τ ∂η
Then we have
à ! à !
∂ν ∂ν τ
IX (η) = IX (ν) = IX (η)IX (ν) (IX (η))τ = IX (η)IX (ν)IX (η).
∂η τ ∂η
57
(3). (Location-scale family.) Let θ = (µ, σ). Then the Fisher information
about θ contained in X1 , . . . , Xn is
R 0 2 R 0 2
[f (x)]
n f (x)
dx x [ff(x)]
(x)
dx
IX (θ) = 2 R [f 0 (x)]2 R [xf 0 (x)+f (x)]2
σ x f (x) dx, f (x)
dx
Proof.
³ Let
´ us prove the first case only.
³ The´ rest is left as an exercice. Since fθ (x) =
1 x−µ x−µ
σ0
f σ0 , log fθ (x) = − log σ0 + log f σ0 ,
³ ´
x−µ
∂ −f 0 σ
log fθ (x) = ³ 0 ´
∂θ σ0 f x−µ
σ0
Therefore,
" #2
∂
IX1 (θ) = E − log fθ (X1 )
∂θ
³ ´ 2
X1 −µ
f0 σ
= E − ³ 0 ´
X1 −µ
σ0 f σ0
³ ´
0 x−µ 2 µ ¶
1 Z [f σ0 ] x−µ
= ³ ´ f dx
σ03 [f x−µ ]2 σ0
σ0
1 Z [f 0 (y)]2
= dy
σ02 f (y)
e.g. Let X1 , . . . , Xn ∼ N (µ, σ 2 ). Then taking θ = (µ, σ). Here f (x) = φ(x) =
2
(2π)−1/2 e−x /2 , f 0 (x) = −xf (x) , applying the above theorem, part 3, the Fisher informa-
tion about θ contained in X1 , . . . , Xn is
R 0 2 R 0 2
[f (x)]
n f (x)
dx x [ff(x)]
(x)
dx
IX (θ) = R [f 0 (x)]2 R [xf 0 (x)+f (x)]2
σ 2
x f (x) dx, f (x)
dx
à R R !
n x2 f (x)dx R x3 f (x)dx
= R 3
σ2 x f (x)dx (1 − x2 )2 f (x)dx
à !
n 1 0
= .
σ2 0 2
58
(2) Show that the information inequality does not apply to the UMVUE of θ.
∂ Z ∂ Z θ h(x)
h(x)fθ (x)dx = dx
∂θ ∂θ 0 θ
h(θ) Z θ ∂ 1
= + h(x) dx
θ 0 ∂θ θ
h(θ) Z ∂
= + h(x) fθ (x)dx
θ ∂θ
Z
∂
6= h(x) fθ (x)dx
∂θ
unless h(θ)/θ = 0 for all θ. Therefore, the assumption fails here.
∂
(2) Here fθ (x) = 1/θ, 0 < x < θ. So ∂θ
log fθ (x) = −1/θ, and
à !2 Z θ
∂ 11 n
IX (θ) = nE − fθ (X) =n dx =
∂θ 0 θ2 θ θ2
θ2
V ar(W ) ≥ .
n
nxn−1
Next let us take W = (n + 1)/nX(n) , the UMVUE of θ. Recall fX(n) (x) = θn
I(0 <
x < θ), then
Z θ
2 nxn+1 n 2
EX(n) = dx = θ
0 θn n+2
Therefore,
µ ¶ µ ¶2 µ ¶2
n+1 n+1 2 n+1
2 n 2 θ2
var X(n) = EX(n) −θ = θ − θ2 =
n n n n+2 n(n + 2)
θ2 1
< = .
n nI(θ)
Note that the V ar(X(n) ) is of size O(n−2 ), as opposed to the usuall size O(n−1 ).
Remark. In this example, the UMVUE has variance below the C-R lower bound. This
is because the regularity conditions in the C-R lower bound theorem are violated.
[g 0 (θ)]2
V ar(δ) ≥ .
IX (θ)
59
Definition: The efficiency of an unbiased estimator δ = δ(X) of g(θ) is defined by
For two unbiased estimators δ1 and δ2 , the relative efficiency of δ1 with respect to δ2 is
given by à !
δ1 ef f (δ1 ) V ar(δ2 )
ef f = = .
δ2 V ar(δ2 ) V ar(δ1 )
and
h(X̄) − h(µ)
q ∼asympt N (0, 1).
n−1 [h0 (µ)]V ar(X1 )[h0 (µ)]τ
Proof. From Taylor’s expansion,
Hence,
V ar(h(X̄)) = [h0 (µ)]V ar(X̄)[h0 (µ)]τ + O(n−2 ).
The proof of the asymptotical normality is omitted.
lim ef f (η(T )) = 1, and η(T ) ∼asymp N (η(ET ), [η 0 (ET )]V ar(T )[η 0 (T )]τ ) .
n→∞
e.g. Let X1 , ..., Xn be iid from the Bin(1, p). Suppose that g(µ) = pq, where q = 1 − p.
P
Nota that T = Xi is complete and sufficient. An UMVUE of pq is
T (n − T ) n
η(T ) = = X̄(1 − X̄).
n(n − 1) n−1
(1) Show that η(T ) is not efficient. (i.e., it does not reach the Cramer-Rao
lower bound.)
(2) Show that limn→∞ ef f (η(T )) = 1.
60
Proof. (1) It is easy to see that
d 1 hX i
log fp (x) = Xi − np
dθ p(1 − p)
n
= (X̄ − p).
p(1 − p)
From the Cramer-Rao theorem, we get for any unbiased estimator δ(X) of g(p), we have
[g 0 (p)]2 [1 − 2p]2 pq
V ar(δ(X)) ≥ = .
IX (p) n
d
On the other hand, since dθ log fp (x) can not be expressed in the form of a(p)[δ(X) − pq],
then from the necessary and sufficient condition, the Cramer-Rao lower bound can not
be achieved for any unbiased estimator, in particular, for δ(X) = η(T ). That is,
[1 − 2p]2 pq/n
ef f (η(T )) = < 1.
V ar(η(T ))
(2). We now use the lemma to find V ar(η(T )). Taking g(x) = x − x2 , then g 0 (x) =
1 − 2x. Thus,
V ar(η(T )) = n−1 [g 0 (µ)]V ar(X1 )[g 0 (µ)]τ + O(n−2 )
= n−1 [1 − 2p]2 V ar(X1 ) + O(n−2 )
= n−1 [1 − 2p]2 pq + O(n−2 ).
Thus
[1 − 2p]2 pq/n ³ ´
lim ef f (η(T )) = n→∞
lim lim 1 + O(n−1 ) = 1.
= n→∞
n→∞ V ar(η(T ))
(Note that ef f (T ) is only 50% if n = 2.)
Example. Let X1 , ..., Xn ∼ U (0, θ). Show that the UMVUE does not follow normal
distribution asymptotically.
Solution. It is known that a UMVUE of θ is θ̂ = n+1
n
X(n) . Next we’ll show that θ̂ is not
asymptotically normal. For any y ≥ 0, we have
P (n(θ − X(n) ) ≤ y) = P (X(n) ≥ θ − y/n)
= 1 − P (X(n) ≤ θ − y/n)
= 1 − [P (X1 ≤ θ − y/n)]n
µ ¶
y n
= 1− 1−
nθ
61
By Slutsky’s theorem, we get
n(θ − θ̂)
∼approx Exp(λ = 1)
θ
Therefore, θ̂ is asymptotically Exp(λ = 1), but not asymptotically normal.
3.12 Exercises
1. Let X1 , . . . , Xn be a random sample from N (µ, σ 2 ). Find a UMVUE for µ2 /σ.
iid Pn
2. Suppose that X1 , . . . , Xn ∼ Poisson(λ). We already know that T = i=1 Xi is
complete and sufficient. Find a UMVUE for e−2λ .
3. Let X1 , . . . , Xn be iid U (θ1 , θ2 ), where θ1 < θ2 . Find a UMVUE for (θ1 + θ2 )/2.
4. Let X1 , . . . , Xn be iid Bin(1, p), where p ∈ (0, 1).
(a) Find the UMVUE of pm , where m is a positive integer ≤ n.
(b) Find the UMVUE of P (X1 + ... + Xm = k), where m and k are positive integers
≤ n.
(c) Find the UMVUE of P (X1 + ... + Xn−1 > Xn ).
5. Let X1 , . . . , Xn be iid with pdf given by fθ (x) = e−(x−θ) , x > θ. Find a UMVUE
of θr , r > 0.
6. Let X1 , . . . , Xn be iid N (0, σ 2 ). We wish to find a UMVUE of σ 2 .
(1) Find a complete and sufficient statistic for σ 2 .
(2) Find C1 , C2 (which may depend on n) such that W1 = C1 (X̄)2 and W2 =
P
C2 ni=1 (Xi − X̄)2 are unbiased for σ 2 .
(3) Consider W = aW1 + (1 − a)W2 , where a is a constant. Find a which minimizes
var(W ). Show that var(W ) ≤ var(Wi ), i = 1, 2
(4) Find a UMVUE of σ 2 .
62
7. Suppose that X has pdf fθ (x), where θ ∈ Θ ∈ R. Further assume that fθ (x) is twice
differentiable in θ, and
∂2 Z Z
∂2
fθ (x)dx = fθ (x)dx, for θ ∈ Θ.
∂θ2 ∂θ2
Then show that à !
∂2
IX (θ) = −E log fθ (X)
∂θ2
³ ´
8. Let X1 , . . . , Xn be iid with pdf σ1 f x−µ
σ
, where f (x) > 0 and f 0 (x) exists for all
x ∈ R, and µ ∈ R, σ > 0. (This is called a location-scale family.) Let θ = (µ, σ).
Show that the Fisher information about θ in X1 , . . . , Xn is
R 0 2 R 0 2
[f (x)]
n dx x [ff(x)] dx
I(θ) = 2 R f[f(x)
0 (x)]2
(x)
R [xf 0 (x)+f (x)]2
σ x f (x) dx, f (x)
dx
9. Let X1 , . . . , Xn ∼ N (µ, σ 2 ). In the lecture, we have seen that the information matrix
about θ = (µ, σ) contained in X1 , . . . , Xn is
à !
n 1 0
IX (θ) = .
σ2 0 2
Using this or otherwise, find the information matrix about θ = (µ, σ 2 ) contained in
X1 , . . . , Xn .
10. Suppose that X1 , . . . , Xn is a random sample from a population with density fθ (x) =
P
2xθ exp(−θx2 ), for x > 0 and θ > 0. Let T = ni=1 Xi2 /n.
(1) Find ET and V ar(T ).
(2) Find the Fisher information IX (θ). Use Cramer-Rao information inequality to
see if T is a UMVUE for g(θ) = ET .
63
Chapter 4
If X = 0, then it is more plausible that it came from population 1, since Pθ1 (0) >
Pθ2 (0). Then we can estimate θ by θ1 . Note that if X = 3, then Pθ1 (3) = Pθ2 (3), then we
can take either θ = θ1 or θ2 . In general, we take the estimate of θ by
θ̂ = θ1 if X = 0, 3
= θ2 if X = 1, 2, 3
Note that θ̂ may not be unique, e.g., when X = 3, θ̂ = θ1 or θ2 . This is in fact a maximum
likelihood estimator (MLE).
i.e.,
θ̂ = arg sup L (θ(x)) = arg sup l (θ(x)) .
θ∈Θ θ∈Θ
64
We make several remarks.
(a) If Θ is a finite set, then Θ̂ = Θ, then MLE can be obtained from finitely
many values L(θ), θ ∈ Θ.
(b) If L(θ) is differentiable in θ in the interior of Θ, then possible candidates
for MLE’s are the values of θ satisfying
∂ ∂
L(θ) = 0, or equivalently l(θ) = 0. (1.1)
∂θ ∂θ
The two equations in (1.1) are called the likelihood equation and log-
likelihood equation, respectively.
(d) If Θ is an infinite countable set, then one can look at the ratio L(θ)/L(θ−1)
or the difference L(θ) − L(θ − 1) to find an MLE. (This is the equivalent of
differentiation in (c).)
(e) MLE may not be unique (unlike UMVUE).
Theorem. If θ̂ is an MLE for θ, and η = g(θ) is a 1-1 function, then η̂ = g(θ̂) is an MLE
for η = g(θ).
Proof. Since g(·) is 1-1, then g −1 (·) exists. Also
That is, if θ̂ is an MLE, both sides are maximized by θ̂. Hence, η̂ is an MLE of η. (WHY?)
65
MLE Invariance Theorem. If θ̂ is an MLE of θ, then η̂ ≡ g(θ̂) is an MLE of η ≡ g(θ).
e
[In other words, we need to show L(η̂) e
= L(g(θ̂)).]
Proof. First,
e
L(η̂) e
= sup L(η) = sup sup L(θ) = sup L(θ) (draw a mapping) = L(θ̂).
η η θ:g(θ)=η θ
e
Therefore, L(η̂) e
= L(g(θ̂)), as must be proved.
i=1
³X ´ ³ X ´
l(p) = xi log p + n − x (1 − p)
If Setting P P
∂ xi n − xi n(x̄ − p)
l(p) = − = = 0,
∂p p 1−p p(1 − p)
we get p̂ = x̄. Note that L(p) → 0 as p → 0 or 1. So an MLE for p is
p̂ = X̄.
2
Method 2. P
First, Let η =
Pp P P
√
L(p) = p xi (1 − p)n− xi = η xi /2 (1 − η)n− xi = L1 (η, x), So
P
xi X √
l1 (η, x) = log η + (n − xi ) log(1 − eta)
2
Set P
∂ xi X 1
l1 (η, x) = − (n − xi ) √ √ =0
∂η 2η 2 η(1 − η)
Therefore, η̂ = (X̄)2 is an MLE for p2 .
66
Remark. Note that in this example, the MLE X̄ is also the UMVUE.
λ̂ = X̄.
Remark. Note that in this example, the MLE X̄ is also the UMVUE.
67
is negative definite when θ = θ̂. At the boundary, we notice that L(θ) → 0 as θ goes to
the boundary of Θ. Therefore, an MLE of θ is
n ³
X ´2
µ̂ = X̄, and σ̂ 2 = n−1 Xi − X̄ .
i=1
Example. Let X1 , ..., Xn ∼ N (µ, σ 2 ) with µ ≥ 0 and σ 2 > 0, where n ≥ 2. Find an MLE
for θ = (µ, σ 2 ).
(e.g. the body weight follows Normal d.f. but the mean must be positive.)
Case II. If x̄ < 0, then the solution to the log-likelihood equations will not be in the
parameter space. But as in the last example,
n
1 X n n
l(θ) = l(µ, σ 2 ) = = − 2
(xi − µ)2 − log σ 2 − log(2π)
2σ i=1 2 2
n
1 X 2 n 2 n 2 n
= − (x i − x̄) − (x̄ − µ) − log σ − log(2π).
2σ 2 i=1 2σ 2 2 2
In summary, an MLE of θ is
X
θ̂ = (0, n−1 Xi2 ), if X̄ < 0
X
= (X̄, n−1 (Xi − X̄)2 ), if X̄ ≥ 0.
Example. (A genetics example). Assume that P (A) = p and P (a) = 1 − p. We have the
following table.
68
Type AA Aa aa
Prob p2 2p(1 − p) (1 − p)2
Freq x1 x2 x3
l(p, x1 , x2 , x3 ) = log C(n, x1 , x2 , x3 )+x2 log 2+2x1 log p+x2 (log p+log(1−p))+2x3 log(1−p)
Setting
∂ 2x1 + x2 x2 + 2x3
l(p, x1 , x2 , x3 ) = − =0
∂p p 1−p
we get
2X1 + X2
p̂ = .
2n
fp (x1 , x2 , x3 ) = L(p, x1 , x2 , x3 )
= C(n, x1 , x2 , x3 ) p2x1 [2p(1 − p)]x2 (1 − p)2x3
= h(n, x1 , x2 , x3 ) p2x1 +x2 (1 − p)x2 +2x3
= h(n, x1 , x2 , x3 ) p2x1 +x2 (1 − p)2n−(2x1 +x2 )
à !2x1 +x2
p
= (1 − p)2n h(n, x1 , x2 , x3 ).
1−p
Cnx CNr−x
−n
P (X = x) = , x = 0, 1, ..., min(r, n)
CNr
69
Thus, an estimate by the method of moment is
rn
θ̂ = N̂ =
x
2). Note that
Cnx CNr−x r
−n /CN CNr−x r
−n /CN CNr−x r
−n CN −1
R = L(θ)/L(θ − 1) = = =
Cnx CNr−x r
−1−n /CN −1 CNr−x r
−1−n /CN −1 CNr CNr−x
−1−n
(N − n)!/[(r − x)!(N − n − r + x)!] [(N − 1)!/[r!(N − 1 − r)!]
=
(N − n − 1)!/[(r − x)!(N − n − r + x − 1)!] [N !/[r!(N − r)!]
(N − n)/(N − n − r + x)
=
N/(N − r)
(N − n)(N − r)
= .
(N − n − r + x)N
Now R ≥ 1 iff (N −n)(N −r) ≥ (N −n−r+x)N iff N 2 +nr−N r−nN ≥ N 2 −N n−rN +xN
iff nr ≥ xN iff N ≤ rn/x. Similarly, R < 1 iff N > rn/x.
1). If θ̂ = rn/x is an integer, then we see that L(θ̂) = L(θ̂ − 1), then an MLE of θ is
N̂ = rn/x, or N̂ = rn/x − 1.
70
4.4 MLE’s in exponential families
Some of the examples given earlier belong to the exponential family. Here, we shall give
some general results regarding MLE in this family.
Suppose that X has a distribution from a natural exponential family of full rank
where η ∈ Ψ.
∂A(η)
(1). Show that an MLE of η is η̂ = g −1 (T (X)), where g(η) = ∂η
, if T (x) is
∂A(η)
in the range of ∂η
.
(2). Show that T (X) is an MLE of E[T (X)]. (T (X) is in fact also a UMVUE
of E[T (X)].)
η̂ = g −1 (T (X)).
∂A(η)
(2). Since E[T (X)] = ∂η
= g(η), so from (1), an MLE of E[T (X)] is g[η̂] =
−1
g[g (T (X))] = T (X).
Suppose that X has a distribution from a general exponential family of full rank
θ̂ = η(η̂).
71
4.5 Numerical solution to likelihood equations
Typically, there are no explicit solutions to likelihood equations, and some numerical
methods have to be used to compute MLE’s. We start with a couple of examples.
Therefore,
n
1 Y 1
L(θ) = .
π i=1 1 + (xi − θ)2
n
n
X
l(θ) = −n log π − log(1 + (xi − θ)2 ).
i=1
Setting
n
∂l(θ) X 2(xi − θ)
= 2
= 0.
∂θ i=1 1 + (xi − θ)
Here, an MLE θ̂ can not be solved explicitly, but can be solved by numerical method,
such as Newton-Raphson’s method.
Example. Let X1 , ..., Xn ∼ Γ(α, γ) with unknown α > 0 and γ > 0. Find a MLE of
θ = (α, γ).
Solution. The likelihood function is
Qn Pn
n
Y xα−1
i e−xi /γ ( i=1 xi )
α−1 −( i=1 xi )/γ
e
L(θ) = =
i=1 Γ(α)γ α Γ (α)γ nα
n
à n ! Ãn
!
X 1 X
l(θ) = (α − 1) log xi − xi − n log Γ(α) − (nα) log γ.
i=1 γ i=1
Then from (5.3), we get γ̂ α̂ = x̄, or γ̂ = x̄/α̂. Substituting this into (5.2), we get
n
1X Γ0 (α)
log xi − − log(x̄) + log(α̂) = 0
n i=1 Γ(α)
In this case, the likelihood equation has no explicit solution, although it can be shown
that a solution exists and it is the unique MLE. We must resort to numerical method to
compute the MLE.
72
One commonly used numerical method to compute MLE’s is the Newton’s algorithm.
Note that typically, θ̂ → θ in probability, and also note that n−1 l(θ), n−1 l0 (θ) and n−1 l00 (θ)
are all sample means of i.i.d. r.v.’s. Thus
Then
" #−1
∂l(θ) ∂ 2 l(θ)
θ̂ ≈ θ −
∂θ ∂θ∂θτ
" #−1
∂l(θ) ∂ 2 l(θ)
≈ θ− E
∂θ ∂θ∂θτ
If we take θ0 to be an initial value and suppose that ∂ 2 l(θj )/∂θ∂θτ is of full rank for every
θ ∈ Θ, then we have the following iteration
" #−1
∂l(θj ) ∂ 2 l(θj )
θj+1 = θj − , j = 0, 1, 2, ...,
∂θ ∂θ∂θτ
Here the likelihood equation approach is not applicable. However, by plotting L(θ) against
θ, it is easy to see that any statistic T satisfying x(n) − 1/2 < θ < x(1) + 1/2 is an MLE
of θ.
Remark. This example indicates that MLE’s may not be unique (may have infinite
number of MLE’s).
73
4.6.2 MLE may not be asymptotically normal
Example. Let X1 , ..., Xn ∼ U (0, θ). For an MLE for θ.
Solution.
1
L(θ, x) = n I(0 ≤ x(1) ≤ x(n) ≤ θ)
θ
Draw a graph of L(θ, x) as a function of θ, we see that an MLE for θ is
θ̂ = X(n) .
In this case, the MLE is not the same as the UMVUE, which is (n + 1)X(n) /n.
Next we’ll show that θ̂ is not asymptotically normal.
P (n(θ − θ̂)/θ ≤ t) = P (θ̂ ≥ θ − tθ/n)
= 1 − P (X(n) ≤ θ − tθ/n)
= 1 − [P (X1 ≤ θ − tθ/n)]n
µ ¶
t n
= 1− 1−
n
−t
→ 1−e as n → ∞.
That is, n(θ−
θ
θ̂)
is an asymptotically exponential distribution with mean 1. Therefore,
θ̂ can not be asymptotically normal. In addition, θ̂ is biased downward. However, the
−θ
bias(θ̂) = E θ̂ − θ = n+1 → 0 as n → ∞.
Example Let X ∼ Bin(1, p), where 1/3 < p < 2/3. (Restricted Bernoulli parameter.)
Find an MLE of p, and show that it is inadmissible by the MSE criterion.
Solution.
L(p, x) = px (1 − p)1−x I(1/3 < p < 2/3)
= p if x = 1
1−p if x = 0.
74
2/3 if x = 1
x+1
= , if x = 0, 1
3
Note that
µ ¶
x+1 p+1
E p̂ = E = 6= p, (which is biased)
3 3
var(x) p(1 − p)
var(p̂) = = .
9 9
Therefore,
µ ¶2
2 p(1 − p) p+1 1 − 3p(1 − p)
M SE(p̂) = var(p̂) + bias (p̂) = + −p = .
9 3 9
Note that M SE(p̂1 ) equals 0, 1/36 and 1/36 when p̂1 = 1/2, 1/3, 2/3. On the other hand,
M SE(p̂) equals 1/9, 1/27, 1/27, which are all bigger than the corresponding values of
M SE(p̂1 ) at these points. Thus,
Here, MLE is worse than the trivial estimator δ(p̂1 ) = 1/2 under MSE criterion. In other
words, MLE is inadmissible under MSE loss function.
Remark: M SE(p̂1 ) and M SE(p̂) do cross each in (0, 1) although they don’t in [1/3, 2/3].
When there are many nuisance parameters, then MLE’s can behave very badly even for
large samples, as shown below (in which case, the MLE is unique).
75
1. Find an MLE of θ = (µ1 , · · · , µp , σ 2 ).
2. Show that if p is fixed, r → ∞, then σ̂ 2 is consistent for σ 2 .
3. Show that if r is fixed, p → ∞, then σ̂ 2 is NOT consistent for σ 2 .
Therefore, an MLE of θ is
r
X p X
X r
−1 2 −1
µ̂i = r xij = xi· , σ̂ = (pr) (xij − xi· )2
j=1 i=1 j=1
Note
p r µ ¶ p
2 σ2 X X xij − xi· 2 σ 2 X σ2
σ̂ = = χ2r−1 = χ2p(r−1)
pr i=1 j=1 σ pr i=1 pr
Thus,
r−1 2 2(r − 1) 4
E σ̂ 2 = σ , V ar(σ̂ 2 ) = σ
r pr2
76
4.7 Exercises
1. Let X1 , ..., Xn be iid with a pdf fθ (x). Find an MLE of θ in each of the following
cases.
(1) fθ (x) = θ−1 , where x = 1, ..., θ and θ is an integer between 1 and θ0 .
(2) fθ (x) = e−(x−θ) , where x ≥ θ and θ > 0.
(3) fθ (x) = θ(1 − x)θ−1 I(0 < x < 1), where θ > 1.
(4) fθ (x) is the pdf of N (θ, θ2 ), where θ ∈ R.
2. Let X be a sample with a joint pdf fθ (x) and T = T (X) be a sufficient statistic for
θ. Show that if an MLE exists, it is a function of T but it may not be sufficient for
θ.
fθ (x) = (θ + 1)xθ , 0 ≤ x ≤ 1.
77
Chapter 5
So far, we only discussed the C-R lower bound for fixed sample size n. What happens
when n → ∞?
Now let us assume that δn = δn (X1 , . . . , Xn ) is asymptotically normal, say that
√
n[δn − g(θ)] →d N (0, v(θ)), v(θ) > 0. (1.2)
However, the next example shows that this may not be the case. That is, the asymp-
totical version of the Cramer-Rao inequality may not hold uniformly for ALL θ ∈ Θ, even
though all the conditions in the usual Cramer-Rao theorem are satisfied. (The example
when regularity conditions are not satisfied is given by the U nif (0, θ)) example in the
last chapter.)
78
Example (Hodges). Let X1 , ..., Xn ∼ N (θ, 1), θ ∈ R. Then IX1 (θ) = 1. The MLE of θ is
θ̂ = X̄, which has distribution N (θ, 1/n). Now let us define
θ̃ = X̄ if |X̄| ≥ n−1/4
tX̄ if |X̄| < n−1/4 ,
Remark: A couple of words about θ̃. We are interested in estimating the population
mean. If X̄ is not close to 0, we simply take the sample mean as the estimator. If we
know that it is pretty close to 0, we can shrink it further to make it closer to 0. Thus, the
resulting estimator should be more efficient than the sample mean X̄ at 0. Of course, we
can take other values than 0, the same thing will happen too.
79
Combining the above two cases, we get
√ ³ ´
n θ̃ − θ →d N (0, t2 ) if θ = 0,
N (0, 1) if θ 6= 0
In the case of θ = 0, the usual asymptotic Cramer-Rao theorem does not hold, since
t2 < 1 = [I(θ)]−1 .
Remark 1. The set of all points violating (1.3) is called points of supereffi-
ciency. Superefficiency in fact is a very good property.
Remark 2. The normal family is the best family possible. Hodges’s example
shows in order for the asymptotic version of Cramer-Rao inequality to hold,
one must restrict attention to a more restricted class of estimators (so-called
“regular” estimators).
Remark 3. The example can be easily extended to have countable number
of super-efficiency points. For instance, we can define, for k = 0, ±1, ±2, ......,
θ̂ = X̄ − k if |X̄ − k| ≥ n−1/4 ,
k + tk (X̄ − k) if |X̄ − k| < n−1/4
where we choose 0 ≤ tk < 1 for all k. Then it can be shown that this estimator
will be super-efficient at all k = 0, ±1, ±2, .......
Remark 4. Even though (1.3) is not always true, however, it is almost true.
In other words, the points of superefficiency has Lebesgue measure 0, as was
shown in the next theorem (Le Cam (1953), Bahadur (1964)). The proof is
taken from Shao (1999).
Under these conditions, if δn = δn (X1 , . . . , Xn ) is any estimator satisfying (1.2), then v(θ)
satisfies (1.3) except on a set of Lebesgue measure 0.
80
5.2 Definition of Asymptotic Efficiency
Let X1 , . . . , Xn ∼ Fθ , θ ∈ Θ. Suppose that {δn } is a sequence of estimators g(θ). Then
typically,
1). {δn } is consistent, i.e.,
Ideally, we would like to find estimators such that the asymptotic variance v(θ) is
minimized uniformly over θ ∈ Θ. If such estimators exist, they are called Asymptotic
Efficient Estimators.
Remark. Asymptotically efficient estimators, if they exist, would not be unique, so that
different methods may all lean to asymptotically
√ efficient estimators. If δn is asymptoti-
cally efficient, so is δn + Rn , where nRn → 0 in probability, since by Slutsky’s theorem,
à !
√ √ √ [g 0 (θ)]2
n (δn + Rn − g(θ)) = n (δn − g(θ)) + nRn →d N 0, .
I(θ)
Let X1 , . . . , Xn be iid with common pdf fθ (x), where θ ∈ Θ ∈ R, and suppose the
following regularity conditions hold.
A1). The distributions Fθ of the Xi have common support, so that the set
A = {x : fθ (x) > 0} is independent of θ.
A2). The true parameter value θ0 ∈ Θ0 ∈ Θ, where Θ0 is open.
81
Theorem. Under assumptions (A0)-(A1), for any fixed θ 6= θ0 ,
L(θ0 )
Proof. L(θ0 ) > L(θ) iff L(θ)
> 1 iff
n
1 L(θ0 ) 1X fθ (Xi )
log = log 0 >0
n L(θ) n i=1 fθ (Xi )
However, since − log x is strictly convex, we can use Jensen’s inequality to get
n n n
à !
1X fθ (Xi ) 1X fθ (Xi ) 1X fθ (Xi )
log 0 = − log > − log
n i=1 fθ (Xi ) n i=1 fθ0 (Xi ) n i=1 fθ0 (Xi )
à !
fθ (Xi )
→ − log Eθ0 = − log 1 = 0, in probability.
fθ0 (Xi )
has a root θ̂n = θ̂(x1 , ..., xn ) such that θ̂(X1 , ..., Xn ) tends to the true parameter value θ0
in probability.
82
5.3.2 Approach II
Theorem. Let F = {fθ : θ ∈ Θ}, where Θ is an open set in R. Assume that
3
1). For each θ ∈ Θ, ∂ log∂θfθ (x) exists for all x, and also there exists a function
H(x) ≥ 0 (possibly depending on θ) such that for θ ∈ N (θ0 , ²) = {θ : |θ −θ0 | <
²}, ¯ ¯
¯ ∂ 3 log f (x) ¯
¯ θ ¯
¯ ¯ ≤ H(x), Eθ H(X1 ) < ∞.
¯ ∂θ3 ¯
Let X1 , ..., Xn ∼ Fθ . Then with probability 1, the likelihood equations admit a sequence
of solutions {θ̂n } satisfying
Then
n n
1X ∂ 2 log fθ (Xi ) 1X ∂ 3 log fθ (Xi )
s0θ (X, θ) = , s00θ (X, θ) = .
n i=1 ∂θ2 n i=1 ∂θ3
Note that
n ¯ 3
¯ ¯
1X ¯ 1X n
¯ ∂ log fθ (Xi ) ¯
|s00θ (X, θ)| ≤ ¯ ¯≤ |H(Xi )| =: H̄(X)
n i=1 ¯ ∂θ3 ¯ n i=1
Pn
where H̄(X) = n−1 i=1 H(Xi ). By Taylor’s expansion,
1
s(X, θ) = s(X, θ0 ) + s0θ0 (X, θ0 )(θ − θ0 ) + s00θ0 (X, ξ)(θ − θ0 )2
2
1
= s(X, θ0 ) + sθ (X, θ0 )(θ − θ0 ) + H̄(X)η ∗ (θ − θ0 )2
0
2
where |η ∗ | = |s00θ (X, θ)|/H̄(X) ≤ 1. By the SLLN, we have, with probability one,
s(X, θ0 ) → Es(X, θ0 ) = 0,
s0θ (X, θ0 ) → Es0θ (X, θ0 ) = −IX1 (θ0 ),
H̄(X) → Eθ H(Xi ) < ∞.
83
Clearly, for ² > 0, we have with probability one,
1
s(X, θ0 ± ²) = s(X, θ0 ) + s0θ (X, θ0 )(±²) + H̄(X)η ∗ (±²)2
2
1 2
≈ ∓IX1 (θ0 )² + EH(X1 )c² , |c| < 1.
2
In particular, we choose 0 < ² < IX1 (θ0 )/Eθ0 H(Xi ). Then for large enough n, we have,
with probability 1,
1 1
s(X, θ0 + ²) = s(X, θ0 ) + s0θ0 (X, θ0 )² + H̄(X)η ∗ ²2 ≤ −IX1 (θ0 )² + Eθ0 H(Xi )²2 < 0,
2 2
1 1
s(X, θ0 − ²) = s(X, θ0 ) − s0θ0 (X, θ0 )² + H̄(X)η ∗ ²2 ≥ IX1 (θ0 )² − Eθ0 H(Xi )²2 > 0.
2 2
Therefore, by the continuity of s(X, θ), the interval [θ0 − ², θ0 + ²] contains a solution of
the likelihood equation s(X, θ) = 0. In particular, it contains the solution
θ̂n² = inf {θ : θ0 − ² ≤ θ ≤ θ0 + ², and s(X, θ) = 0} .
Before going any further, we shall first show that θ̂n² is a proper random variable, that
is, θ̂n² is measureable. Note that, for all t ≥ θ0 − ², we have
½ ¾
{θ̂n² > t} = inf s(X, θ) > 0
θ0 −²≤θ≤t
( )
= inf s(X, θ) > 0 ,
θ0 −²≤θ≤t,θ is rational
which is clearly a measurable set, since s(X, θ) is continuous in θ.
N (0, I(θ0 )) ³ ´
→asy = N 0, I −1 (θ0 ) .
I(θ0 )
This completes our proof.
84
Before we give the proof, we make a couple of remarks.
∂ log fθ (x)
Remark 1. Condition 1 ensures that ∂θ
, for any x, has a Taylor’s
expansion as a function of θ.
Remark 2. Condition 2 means that
Z Z
∂ log fθ (x)
fθ (x)dx, dx
∂θ
can be differentiated with respect to θ under the integral sign. That is, the
integration and differentiation can be interchanged.
Remark 3. A sufficient condition for Condition 2 is the following:
For each θ ∈ Θ, there exists functions g(x), h(x), and H(x) (possibly depend-
ing on θ) such that for θ0 ∈ N (θ, ²) = {θ0 : |θ0 − θ| < ²},
¯ ¯ ¯ ¯ ¯ ¯
¯ ∂f 0 (x) ¯ ¯ ∂ 2 f 0 (x) ¯ ¯ ∂ 3 log f 0 (x) ¯
¯ θ ¯ ¯ θ ¯ ¯ θ ¯
¯ ¯ ≤ g(x), ¯ ¯ ≤ h(x), ¯ ¯ ≤ H(x),
¯ ∂θ 0 ¯ ¯ ∂θ 0 ¯ ¯ ∂θ0 ¯
∂ log fθ (x)
Remark 4. Condition 3 ensures that the variance of ∂θ
is finite.
1). Show that any point in [X(n) − 1/2, X(1) + 1/2] is an MLE.
2). Check that regularity conditions in the asymptotic efficiency theorem are
not satisfied.
3). Show that both X(n) − 1/2 and X(1) + 1/2 are consistent estimators of θ.
Proof. 1) has been done before. 2) is true since the range of fθ (x) is [θ − 1/2, θ + 1/2],
which depends on the parameter θ. Let us check 3) now. We’ll only check X(n) − 1/2.
Note
³ ´ ³ ´
P |X(n) − 1/2 − θ| > ² = P θ − X(n) + 1/2 > ²
³ ´
= P X(n) < θ + 1/2 − ²
= FXn1 (θ + 1/2 − ²)
= [(θ + 1/2 − ²) − (θ − 1/2)]n
= (1 − ²)n → 0.
85
Remark. Since X(n) − 1/2 and X(1) + 1/2 are both consistent for θ, then X(n) − 1/2 −
X(1) − 1/2 →p 0, that is, X(n) − X(1) →p 1.
So an MLE of θ is
n
X
θ̂ = arg max L(θ) = arg min |xi − θ|
θ∈R θ∈R
i=1
= X( n+1 ) if n is odd.
2
which is simply the median. To see this, we assume that n is even, ...........
2). Notice that for EVERY x, fθ (x) is not differentable w.r.t θ at θ = x. So the
regularity conditions are not satisfied.
3). Let us assume that n is odd at this moment, then θ̂ = m̂. It is well known that
√ ³ ´
n(θ̂ − θ) ∼asymptotic N 0, [4f 2 (θ)]−1 .
86
5.4.1 Some comparisons about the UMVUE and MLE’s
1. It can be shown that UMVUE’s are always consistent and MLE’s usually are. (Bickel
and Doksum, 1977., p133.)
2. In the exponential family, both estimates are functions of the natural sufficient
statistics Ti (X).
3. When the sample size is large, both estimates give essentially the same answer. In
that case, use is governed by ease of computation.
4. MLE’s exist more frequently than UMVUE’s (in fact, in our definition, MLE always
exist although it may be outside the parameter space) and they are usually easier
to calculate than UMVUE’s. This is partially because of the invariance property of
MLE’s.
5. Neither MLE’s nor UMVUE’s are satisfactory in general if one takes a Bayesian or
minimax point of view. Nor are they necessarily robust.
87
Chapter 6
6.1.2 Examples
Example 1. Suppose that X1 , . . . , Xn are i.i.d. r.v.’s from N (µ, σ 2 ). Use the method of
moments to estimate µ and σ 2 .
EX = µ,
EX 2 = V ar(X) + (EX)2 = σ 2 + µ2 .
m01 = X̄,
n
1X
m02 = X 2.
n i=1 i
µ = X̄,
n
1X
σ 2 + µ2 = X 2.
n i=1 i
88
Solving for µ and σ 2 yields the method of moments estimators
µ̂ = X̄,
n n
1X 1X
σ̂ 2 = Xi2 − (X̄)2 = (Xi − X̄)2 .
n i=1 n i=1
Remark: The simple method of moment produces the same estimator as MLE and
almost the same as UMVUE.
Example 2. Suppose that X1 , . . . , Xn are i.i.d. r.v.’s from Binomial(k, p), where both
k and p are unknown. Use the method of moments to estimate k and p.
EX = kp,
EX 2 = V ar(X) + (EX)2 = kp(1 − p) + (kp)2 .
m01 = X̄,
n
1X
m02 = X 2.
n i=1 i
kp = X̄, (1.1)
n
1X
kp(1 − p) + (kp)2 = X 2. (1.2)
n i=1 i
So Pn Pn
n−1 Xi2 − (X̄)2
i=1 n−1 i=1 (Xi − X̄)2
1−p= = .
X̄ X̄
Hence, the moment estimator for p is
Pn Pn
n−1 i=1 (Xi − X̄)2 X̄ − n−1 i=1 (Xi − X̄)2
p̂ = 1 − = .
X̄ X̄
Substituting this into (1.1), we get an moment estimator for k:
X̄ X̄ (X̄)2
k̂ = = Pn = −1
Pn 2
.
p̂ X̄−n−1
i=1
(Xi −X̄)2 X̄ − n i=1 (Xi − X̄)
X̄
Remark. Note that in this example, the MM is simple to use. But they may be negative,
in which case unacceptable.
89
6.2 Bayesian Method of Estimation
6.2.1 Methodology
Let X1 , . . . , Xn be i.i.d. r.v.’s with a sampling distribution f (x|θ), indexed by θ. In the
Bayesian approach, the population parameter θ is treated as a r.v. with a prior distri-
bution, denoted by π(θ). This is a subjective distribution, based on the experimenter’s
belief, and is formulated before the data are seen (hence the name prior distribution).
Once a sample is taken from the population indexed by θ, and then the prior distribution
is updated with this sample information to produce the so-called posterior distribu-
tion. The updating is done with the help of the Baye’s rule, hence the name Bayesian
statistics. An Bayesian point estimator of θ can then be taken as the mean of the posterior
distribution.
Here is a summary of how to find the posterior distribution and point estimator of θ.
Note that f (x|θ) and π(θ) are supposed to be known here.
f (x, θ) = f (x|θ)π(θ).
f (x, θ) f (x|θ)π(θ)
π(θ|x) = = .
m(x) m(x)
6.2.2 Examples
Example 1. Let X1 , . . . , Xn be i.i.d. Bernoulli(p) r.v.’s. We assume that the prior
distribution on p is Beta(α, β). Find the posterior distribution of p and find a Bayesian
point estimator for p.
Pn
Solution. Let T = i=1 Xi ∼ Bin(n, p). So
à !
n t n−t
f (t|p) = pq .
t
90
(a). The joint p.d.f. of T and θ is
"Ã ! #" #
n t n−t Γ(α + β) α−1 β−1
f (t, p) = f (t|p)π(p) = pq p q
t Γ(α)Γ(β)
à !
n Γ(α + β) t+α−1 n−t+β−1
= p q .
t Γ(α)Γ(β)
Remarks:
(i). We can rewrite p̂B as
à !µ ¶ à !à !
t+α n t α+β α
p̂B = = + .
n+α+β n+α+β n n+α+β α+β
Thus, the Bayesian point estimator p̂B is a linear combination of the prior
α
mean α+β of θ and the sample mean nt . In particular, we see that p̂B =
α/(α + β); and limn p̂B = limn t/n = p.
(ii). Since 0 < p < 1, the choice of Beta distribution seems reasonable. It
turns out that Beta family is the conjugate family of Binomial family.
Definition: Let F = {f (x|θ), θ ∈ Θ}. A class of priors ¦ is a conjugate family
if the posterior distributions remain in the same class ¦, for all priors in ¦ and
all x ∈ X .
So in the last example, we start with a Beta prior, we end up with a Beta
posterior. The updating of priors takes the form of updating its parameters.
We emphasize that although it is convenient to deal with mathematically,
whether it is a reasonable choice depends on the experimenter.
91
Example 2. Let X ∼ N (θ, σ02 ), and suppose that the prior distribution on θ is N (µ0 , τ02 ).
(Here we assume that σ02 , µ0 and τ02 are all known). Find the posterior p.d.f. of θ, and a
Bayesian point estimate of θ.
Solution. Here we only give a sketch. It can be shown that the posterior distribution of
θ is N (E(θ|x), V ar(θ|x)), where
τ02 σ02
E(θ|x) = x + µ0
τ02 + σ02 τ02 + σ02
τ02 σ02
V ar(θ|x) = .
τ02 + σ02
Notice that the normal family is its own conjugate family. Clearly, the Bayesian estimator
E(θ|x) given above is again a linear combination of the prior and sample means. Also
note that if τ02 → ∞ (i.e., the prior information is very vague), more weight is given to
the sample information. On the other hand, if τ02 << σ02 (i.e. the prior is very good),
more weight will be given to the prior mean.
92
Example 3. Let X be a r.v. from U nif (0, θ). Assume that the prior distribution of θ is
U (a, b) with 0 < a < b. A random sample of size 1 is taken from the population and its
value is x. Find the posterior p.d.f. of θ, and a Bayesian point estimate of θ.
93