You are on page 1of 13

SOME THEORY AND PRACTICE OF STATISTICS

by Howard G. Tucker

CHAPTER 9. POINT ESTIMATION

9.1 Convergence in Probability. The bases of point estimation have


already been laid out in previous chapters. In chapter 5 in particular, we
proved the law of large numbers and Bernoulli’s theorem, which are results
about convergence in probability. From time to time we have also mentioned
in an informal way about a statistic, i.e., a known function of several observ-
able random variables, being an unbiased estimate of some parameter. In
this chapter, we shall deal with questions about point estimation in a more
structured way. We shall introduce in this section the notion of convergence
in probability.
De…nition. If X, X1 , X2; are random variables, we say that the
sequence fXn g converges in probability to X if, for every > 0, the sequence
of numbers fP ([j Xn X j ])g converges to 0 as n ! 1 i.e.,

P ([j Xn Xj ]) ! 0 as n ! 1.

We frequently denote this by Xn !P X.


In section 5:2 we proved the law of large numbers. In our present language
this can be expressed as follows: if fXn g is a sequence of independent and
identically distributed random variables with …nite common second moment,
then
1X
n
Xi !P E(X1 ) as n ! 1.
n i=1
As a special corollary of the law of large numbers we obtained Bernoulli’s the-
orem which in our present language states: if for each n, fSn g is Bin(n; p),
then Snn !P p as n ! 1.
Example: For n = 2; 3; , let Xn be a random variable with absolutely
continuous distribution function with density given by
8
< n 1 if 0 x n1
1
fXn (x) = n 1
if n1 < x 1
:
0 if x < 0 or x > 1.

216
Let > 0 be arbitrary, with < 1. Then for n > 1 , we have

0 P ([j Xn 0j ]) = P ([Xn ])
1
P ([Xn n ])
= n1 ! 0 as n ! 0.

Thus, for this example, Xn !P 0.


The following remark is an immediate consequence of the de…nition.
Remark. If X, X1 , X2; are random variables, the sequence fXn g
converges in probability to X if and only if the sequence of di¤erences, fXn
Xg, converges in probability to 0.
The following theorem is beyond the scope of this course and is given as
a faith theorem. We have proved special cases of it in section 2:1.
Theorem 1. If f : Rn ! R1 is a function which is continuous over
Rn , and if X1 ; ; Xn are random variables de…ned on the same probability
space, then f (X1 ; ; Xn ) is a random variable.
The following theorem can be stated more generally, namely over Rn , but
the statement and proof for n = 2 can easily be extended.
Theorem 2. If f : R2 ! R1 is a function which is continuous at the
point (a; b) 2 R2 , and if fXn g and fYn g are sequences of random variables
de…ned on the same probability space such that Xn !P a and Yn !P b ,
then the sequence of random variables ff (Xn ; Yn )g converges in probability
to f (a; b) as n ! 1, i.e., f (Xn ; Yn ) !P f (a; b) as n ! 1.
Proof: Since by hypothesis f : R2 ! R1 is continuous at (a; b), then by
the de…nition of continuity, given > 0, there exists a > 0 such that

[j Xn a j< ] \ [j Yn b j< ] [j f (Xn ; Yn ) f (a; b) j< ].

By the DeMorgan formula, we take complements of both sides to get

[j f (Xn ; Yn ) f (a; b) j ] [j Xn aj ] [ [j Yn bj ].

Taking the probabilities of both sides, and using Boole’s inequality, we obtain

0 P ([j f (Xn ; Yn ) f (a; b) j ])


P ([j Xn a j ]) + P ([j Yn b j ]).

Since, by hypothesis, P ([j Xn a j ]) ! 0 and P ([j Yn b j ]) ! 0, it


follows that
P ([j f (Xn ; Yn ) f (a; b) j ]) ! 0 as n !;

217
i.e., f (Xn ; Yn ) !P f (a; b) as n ! 1.
De…nition. If X is an observable random variable, we say it is an
unbiased estimate of the constant if E(X) = .
The next de…nition appears to be a little odd, but this is the way a lot of
mathematical statisticians speak and write.
De…nition. If fXn g is an in…nite sequence of observable random vari-
ables, then we say “Xn is a consistent estimate of ”if Xn !P as n ! 1.
There are many examples of unbiased estimates. If Xn is Bin(n; p), then
E(Xn ) = np, or E(Xn =n) = p. Thus Xn =n is an unbiased estimate of p.
Also, by Bernoulli’s theorem and the de…nition of convergence in probability
given above, Xn =n is a consistant estimate of p.
Here is an example where we can obtain three di¤erent consistent esti-
mates of a particular parameter. Let X be a random variable with absolutely
continuous distribution with density given by

1 if < x < + 1
fX (x) =
0 otherwise.

Let X1 , X2 , denote a sequence of independent observations on X, i.e.,


X1 , X2 , is a sequence of independent, identically distributed random
variables, each with the same distribution function as X, and let Un =
minfXi ; 1 i ng. Then for every > 0 satisfying 0 < < 1, we have

0 P ([j Un j ])
= P ([U
Tn + ])
=QP ( ni=1 [Xj + ])
n
= i=1 P ([Xj + ])
= (1 )n ! 0 as n ! 1.

Thus Un !P as n ! 1, so Un is a consistant estimate of . But if we also


de…ne
X1 + 1 + Xn
Vn = ,
n 2
we can show that Vn is also a consistent estimate of . Indeed, by the law of
large numbers, since E(X) = + 12 ,

X1 + + Xn
P ([j Vn j ]) = P ([j E(X) j ]) ! 0
n
as n ! 1, for all > 0. Note also that if the observable random variable

218
Wn is de…ned by
Wn = maxfXi ; 1 i ng 1,
then one can easily prove that Wn !P as n ! 1.

EXERCISES

1. The de…nition of a function f : R2 ! R1 being continuous at a


point (a; b) 2 R2 is usually given as: for every > 0, there exists a number
> 0 such that j f (x; y) f (a; b) j< for all points (x; y) 2 R2 that satisfy
j x a j< and j y b j< . Prove that this justi…es the following statement
in the proof of theorem 1: Since by hypothesis f : R2 ! R1 is continuous at
(a; b), then by the de…nition of continuity, given > 0, there exists a > 0
such that

[j Xn a j< ] \ [j Yn b j< ] [j f (Xn ; Yn ) f (a; b) j< ].

2. Prove: if fXn g and fYn g are sequences of random variables, if Xn !P a


and if Yn !P b, where a and b are constants, then Xn + Yn !P a + b and
Xn Yn !P ab as n ! 1.
3. Let fUn g be a sequence of random variables, where Un is N (0; n1 ),
n = 1; 2; . Prove that Un !P 0 as n ! 1. (Hint: Use Chebishev’s
inequality.)
4. Prove: if fVn g is a sequence of random variables, and if Vn has the
Fn;2n distribution, n = 1; 2; , then Vn !P 1.
5. Let fXn g be a sequence of independent, identically distributed random
variables with a common absolutely continuous distribution function whose
density is given by
e x if x 0
fX1 (x) =
0 if x < 0,
where > 0 is a constant. Prove that
n
!P as n ! 1.
X1 + Xn

6. Prove that Wn !P as n ! 1, where Wn is as de…ned at the end of


this section.
9.2 The Method of Moments. An advantage of the notion of conver-
gence in probability to the value of a parameter of interest, i.e., an advantage

219
of a consistent estimate of the parameter, is that one is assured that the far-
ther out one goes in the sequence the closer one will get to this parameter in
a certain sense. Especially if one does not have an unbiased estimate of the
parameter but does have a consistant estimate, then one can take comfort
in the fact that the larger n is, the closer the estimate is to the value of
the parameter. A very useful method of …nding consistant estimates is by
the method of moments which is developed in this section. We …rst need an
important lemma.
Lemma 1. If fXn g are independent and identically distributed random
variables with …nite 2rth moment, where r is a positive integer, then
1X r P
n
X ! E(X1r )as n ! 1.
n i=1 i

Proof: We …rst observe that the rth powers, fXnr g, are independent and
identically distributed with common …nite second moment. Hence by the law
of large numbers we obtain the conclusion.
Lemma 2. If X; X1 ; X2 ; are random variables that are constant, i.e.,
there exists constants a; a1 ; a2 ; such that P ([X = a]) = 1 and P ([Xn =
an ]) = 1 for all n, and if an ! a as n ! 1, then Xn !P X = a as n ! 1.
Proof: Let > 0 be arbitrary. Since by hypothesis, an ! a as n ! 1,
there exists a positive integer N such that j an a j< for all values of n
that satisfy n > N . Hence P ([j Xn X j< ]) = 1 for all n > N , from which
it follows that P ([j Xn X j ]) = 0 for all n > N , i.e., Xn !P X = a as
n ! 1.
We shall consider the case of a sequence of independent observations
X1 ; X2 ; on a random variable X. This means that each random variable
in the sequence has the same distribution function as does X and is an inde-
pendent observation on the same population that X is. To keep things simple,
suppose the distribution function of X depends on two unknown parameters,
1 and 2 ; and suppose we wish to estimate both of these parameters with
consistent estimates. Let us suppose that this sequence of random variables
has a …nite common fourth moment. Then by the above lemma,
1X r P
n
Xi ! E(X1r )as n ! 1 for r = 1; 2.
n i=1

We also suppose that each of the …rst two moments of X, m1 and m2 , is a


continuous function ofthe unknown parameters, ( 1 ; 2 ). Call these functions

220
m1 = u( 1 ; 2 ) and m2 = v( 1 ; 2 ). Further we assume that a contiunuous
inverse of this mapping exists, namely, that one can solve these two equations
for 1 and 2 as continuous functions, 1 = g(m1 ; m2 ) and 2 = h(m1 ; m2 ),
P
n Pn
of m1 and m2 . Let us denote X n = n1 Xi and Vn = n1 Xi2 . Then by
i=1 i=1
theorem 1 in section 9:1 and theorem 1 above,
g(X n ; Vn ) !P g(m1 ; m2 ) = 1

and
h(X n ; Vn ) !P h(m1 ; m2 ) = 2

as n ! 1. This is essentially the method of moments. It should be realized


that this holds for any …nite number of parameters provided the correct
common moment is …nite.
Example 1. Let X be a random variable with an absolutely continuous
distribution with density given by
p
Lx if 0 < x < 2=L
p
fX (x) =
0 if x 0 or x 2=L,
where L > 0 is unknown, and let X1 ; X2 ; be independent observations on
X. In order to …nd a consistant estimate of L, we compute
Z 1 Z p2=L
23=2
E(X) = xfX (x)dx = Lx2 dx = .
1 0 3L1=2
Solving for L we obtain
8
L= .
9(E(X))2
Since h(x) = 9x82 is a continuous function of x, it follows that a consistent
estimate of L is Lbn = 8 2 .
9X n
Example 2: Consider a random variable X with …nite fourth moment
and unknown expectation . Let X1 ; X2 ; denote a sequence of inde-
pendent observations on X, i.e., X1 ; X2 ; are independent observations on
the same population on which X is an observation. Accordingly, X1 ; X2 ;
are independent and identically distributed random variables whose common
distribution is tthe same as the distribution of X. Now let us de…ne X n by
X1 + Xn
Xn =
n
221
for every n. Then by the law of large numbers proved in section 5:2 and by the
de…nitions given in section 9:1, X n !P , i.e., X n is a consistent estimate of
. From now on we shall use the notation X n to denote the arithmetic mean
of the …rst n independent observations on a random variable X. Likewise we
shall use in such a case the notation
1X
n
2
bn = (Xk X n )2
n k=1

and
1 X
n
s2n = (Xk X n )2 .
n 1 k=1
Theorem 1. If X1 ; X2 ; are independent observations on a random
variable X which has a …nite fourth moment, then both b2n and s2n are con-
sistent estimates of V ar(X), but b2n is not an unbiased estimate of V ar(X).
Pn 2
Proof: We …rst show that b2n = n1 Xk nX n . Indeed,
k=1

P
n 2
b2n = 1
n
(Xi2 2X n Xi + X n )
i=1
1
Pn 2 2
= n
Xi2 2X n + X n
i=1
1
Pn 2
= n
Xi2 X n.
i=1

Now, we apply the law of large numbers and theorem 2 of section 9:1 and
lemma 1 above. Observing that the function f (x; y) = y x2 is continuous
in (x; y), and since V ar(X) = E(X 2 ) (E(X))2 , we may conclude that
b2n !P V ar(X) as n ! 1. Since s2n = nn 1 b2n and since by lemma 2 the
constant random variable nn 1 !P 1, we again obtain by theorem 2 in section
9:1, by means of the continuous function g(x; y) = xy, that s2n !P V ar(X).
Now we compute
!
2 P n P
E(X n ) = n12 E 2
i=1 Xi + Xj Xk
j6=k
!
P
= n12 nE(X 2 ) + E(Xj )E(Xk )
j6=k
= n12 (nE(X 2 ) + n(n 1)(E(X))2 )
= n1 E(X 2 ) + nn 1 (E(X))2 .

222
Thus,
n 1 n 1
E(b2n ) = (E(X 2 ) (E(X))2 ) = V ar(X),
n n
which implies that b2n is not an unbiased estimate of V ar(X). However, since
E(s2n ) = nn 1 E(b2n ), it follows that s2n is an unbiased estimate of V ar(X).
Example 3. . Let X be a random variable with an absolutely continuous
distribution with density given by
1
b a
if a < x < b
fX (x) =
0 if x a or x b,

where a < b are unknown constants, and let X1 ; X2 ; be independent


observations on X. We wish to …nd consistant estimates of a and of b by the
method of moments. We compute the …rst two moments of X and obtain

E(X) = a+b 2
E(X 2 ) = 13 (a2 + ab + b2 ).

Solving for a and b in terms of E(X) and E(X 2 ) we obtain

a = E(X) (3V ar(X))1=2 , and


b = E(X) + (3V ar(X))1=2 .
P
In example 2 we proved that s2n = n 1 1 ni=1 (Xi X n )2 !P V ar(X). Since
functions g(x; y) = x (3y)1=2 and h(x; y) = x + (3y)1=2 are continuous
an and bbn of a and b
functions of (x; y), we obtain consistent estimates b
respectively by
an = X n (3s2n )1=2
b
and
bbn = X n + (3s2 )1=2 .
n

EXERCISES

1. All of the terms of the sequence f n+1


n
: n = 1; 2; g may be considered
as random variables. Prove that this sequence converges in probability to 1.
2. Let X be a random variable with discrete density given by
n n
22 33
P ([X = n]) = pe + (1 p)e for n = 0; 1; 2; ,
n! n!
223
where 0 < p < 1 is an unknown constant. If X1 ; X2 ; are independent
observations on X, use the method of moments to …nd a consistant estimate
of p.
3. Let X be a random variable with geometric distribution given by

P ([X = n]) = (1 p)n 1 p for n = 1; 2; ,

where 0 < p < 1 is an unknown constant. If X1 ; X2 ; are independent


observations on X, use the method of moments to …nd a consistant estimate
of p.
4. Suppose that X has an absolutely continuous distribution with density
given by
1
( +1) +1
x e x= if x > 0
fX (x) =
0 if x 0,
where > 0 and > 1 are unknown constants. If X1 ; X2 ; are inde-
pendent observations on X, use the method of moments to …nd consistant
estimates of and .
5. Let X have an absolutely continuous distribution with density given
by
= (1 p q)e x + 2pe 2x + 3qe 3x if x > 0
fX (x) =
= 0 otherwise.
where p > 0, q > 0, and p + q < 1 are unknown constants. If X1 ; X2 ; are
independent observations on X, use the method of moments to …nd consistant
estimates of p and q

9.3 Maximum Likelihood Estimation. Given a set of observable


random variables whose joint distribution depends on one or more unknown
parameters, there is the problem of …nding a function of these random vari-
ables that “estimates” one or more of these parameters. This function is
called an estimate. What properties would we like such an estimate to have?
In general, we want the distribution of this estimate to be concentrated near
the true value of the parameter, whatever value this parameter might have.
We might also desire that this estimate be unbiased or that it be a con-
sistent estimate. A maximum likelihood estimate might have none of these
properties. However, it is sometimes easy or at least possible to obtain, so
it becomes a candidate, and from then on one tries to …nd out what nice
properties it has, or whether it can be altered to have certain nice properties.
This section is devoted only to discussing how to obtain these estimates.

224
In brief, a maximum likelihood estimate is a function g of the observable
random variables X that maximizes the probability of the value of X that is
observed. If the random variables X have a joint discrete density, a density
that is sometimes written as fX (xj ) in order to show its dependence on the
parameter , a maximum likelihood estimate of is a function = (X) that
maximizes fX (Xj ). _ Note that x is replaced by X; this is because X_ is the
value of X that you get, so fX (Xj ) is the probability of getting the value
that you get. If this is di¢ cult to understand, have patience, and wait for the
examples that follow. In the case where the joint distribution is absolutely
continuous and has density fX (xj ), then it is loosely said that fX (xj )dx
is the probability of X taking a value in a small neighborhood of x, and so
the maximum likelihood estimate of in this case is that value of that
maximizes the probability of getting close to, or in a neighborhood of, the
value of X that one observes, i.e., = (X) in this case is the value of
that maximizes the joint density. At this point it is perhaps best to consider
examples.
Example 1. The normal distribution. Suppose X1 ; ; Xn is a sample of
independent observations on a random variable X which is N ( ; 2 ), where
and 2 > 0 are unknown.We know that their joint density is

2 n=2 1 Xn
fX1 ; ;Xn (x1 ; ; xn ) = (2 ) expf 2
(Xi )2 g.
2 i=1

Suppose that we are only interested in a maximum likelihood estimate,


(X1 ; ; Xn ), of . We must therefore …nd the value of that maximizes

this joint density. This is equivalent to …nding the value (X1 ; ; Xn ) of


P
that minimizes ni=1 (Xi )2 . In section 8:3 we found that this value of
as a function of X1 ; ; Xn is
1 Xn
(X1 ; ; Xn ) = X n = Xi .
n i=1

2
This value of maximizes the density no matter what the value of is. So

if instead looking only for the maximum likelihood estimate fo we are also
interested in the value of the ( ; 2 ) that maximizes the joint density. To
do this, let = X n and look for the value of 2 that maximizes the joint

225
2
density when = X n . Taking the derivative with respect to we …nd that
the joint density is maximized when
1 Xn
2
= b2 = (Xi X n )2 .
n i=1

In section 9:2 we proved that b2 is a consistent but not an unbiased estimate


of 2 , while s2 is both unbiased and consistent.
Example 2. The uniform distribution. Let X1 ; ; Xn be a sample of
independent observations on a random variable X that has the uniform dis-
tribution over [0; ], for some unknown constant > 0. Thus the common
density of these random variables is
1
if 0 x
fX (xj ) =
0 otherwise.

Since the random variables are independent, their joint density is


1
if 0 xi
n ; 1 i n
fX1 ; ;Xn (x1 ; ; xn j ) =
0 otherwise.

From this expression of the joint density, it is clear that if we take b such
that b < maxfx1 ; ; xn g, then

fX1 ; ;Xn (x1 ; ; xn jb) = 0:

When this joint density is positive, the smaller one takes b the larger the den-
sity becomes. The smallest value that b can be then is b = maxfx1 ; ; xn g.
b
Thus (X1 ; ; Xn ) = maxfX1 ; ; Xn g is a maximum likelihood estimate
of . It is not di¢ cult to prove that b(X1 ; ; Xn ) is not an unbiased esti-
mate of but is a consistent estimate of .
Example 3. The hypergeometric distribution. In a lake there are N …sh,
where N is unknown. The problem is to estimate the value of N . Someone
catches r …sh (all at the same time), marks each with a spot and returns
all of them to the lake.After a reasonable length of time, during which these
“tagged ” …sh are assumed to have distributed themselves “at random” in
the lake, someone catches s …sh (again, all at once). (Note: r and s are
considered to be …xed, predetermined constants.) Among these s …sh caught

226
there will be X tagged …sh, where X is a random variable. The discrete
density of X is given by
r N r
x s x
fX (xjN ) = N
s

where x is an integer that satis…es

maxf0; s N + rg x minfs; rg,

and fX (xjN ) = 0 for all other values of x. The problem of …nding a maximum
likelihood estimate of N is to …nd that value N b = Nb (x) of N for which
fX (xjN ) is maximized. In order to accomplish this we consider the ratio

fX (xjN )
R(N ) = .
fX (xjN 1)

For those values of N for which R(N ) > 1 we know that fX (xjN ) > fX (xjN
1), and for those values of N for which R(N ) < 1 we know that fX (xjN ) is
a decreasing function of N . Using the formula for the density of X we note
(after a certain amount of algebra) that R(N ) > 1 if and only if N < rs=x,
and R(N ) < 1 if and only if N > rs=x. We see that fX (xjN ) reaches its
maximum value(as a function of N ) when N = [rs=x]. Thus a maximum
likelihood estimate of N is Nb = [rs=X]. (This is a method used in wildlife
estimation and is usually referred to as the capture-recapture metnod.)
Example 4. The general linear model. In the general linear model pre-
sented in chapter 8, Y = X + Z, we saw that the joint distribution of Y
was N (X ; 2 ). Thus its joint density is
1 1
fY (y) = 2 )n=2
expf 2
(y X )t (y X )g.
(2 2

The problem is to …nd a maximum likelihood estimate of and of 2 .


A value of that maximizes fY (y), whatever the value of 2 , is clearly
the value of that minimizes (y X )t (y X ). We solved this problem
in section 8:1, …nding that b = (X t X) 1 X t Y is the value of that min-
t
imizes (y X ) (y X ). Thus the maximum likelihood estimate of is
b = (X t X) 1 X t Y . We have shown that this estimate is an unbiased esti-
mate. In order to …nd the joint maximum likelihood estimate of ; 2 , we

227
substitute b = (X t X) 1 X t Y for in the formula for the joint density and
di¤erentiate with respect to 2 . Upon so doing, and solving for 2 , we ob-
tain that the joint density is maximized when b = (X t X) 1 X t Y and 2 = b2 ,
where
1
b2 = jj Y X b jj2 .
n
2
However, this value of is not an unbiased estimate of 2 ; this was shown
in chapter 8.

EXERCISES

1. Let X be a random variable with a discrete distribution whose density


is 8
< p if x = 1
fX (x) = 1 p if x = 0
:
0 otherwise,
where 0 < p < 1 is an unknown constant. Let X1 ; ; Xn be indepen-
dent observations on X. Find the joint density of X1 ; ; Xn , and …nd the
maximum likelihood estimate of p.
2. Let X1 ; ; Xn denote n independent observations on a random vari-
able whose distribution is P( ), where > 0 is unknown. Find the joint
density of X1 ; ; Xn , and …nd the maximum likelihood estimate, b, of .
3. Let X1 ; ; Xn be n independent observations on a random variable
X that has an absolutely continuous distribution with density given by

(1 x) 1 if x 2 (0; 1)
fX (xj ) =
0 otherwise,

where > 1 is an unknown constant. Find the maximum likelihood estimate


of .
4. Let X1 ; ; Xn be n independent observations on a random variable
X with discrete distribution, whose density is
1
N
if 1 x N
fX (xjN ) =
0 otherwise.

Find the maximum likelihood estimate of N .


5. In example 2, prove that b(X1 ; ; Xn ) is a consistent estimate of .

228

You might also like