You are on page 1of 30

MAXIMUM LIKELIHOOD ESTIMATION

and
CRAMÉR-RAO INEQUALITY

Fourth IIA - Penn State Astrostatistics School


July 22-29, 2013. VBO, Kavalur

Notes by Donald Richards,Penn State Univ.

notes revised and lecture delivered by


B. V. Rao;Chennai Math. Inst.
The maximum likelihood method

We seek a method which produces good estimators.

R. A. Fisher (1912), “On an absolute criterion for fitting frequency


curves,” Messenger of Math. 41, 155–160. Fisher’s first
mathematical paper, written while a final-year undergraduate in
mathematics and mathematical physics at Gonville and Caius
College, Cambridge University. It’s not clear what motivated Fisher
to study this subject; perhaps it was the influence of his tutor, the
astronomer F. J. M. Stratton. Fisher’s paper started with a
criticism of two methods of curve fitting, the least-squares and the
method of moments.
X is a random variable and f (x; θ) is a statistical model for X .
Here θ is a parameter. X1 , . . . , Xn is a random sample from X . We
want to construct good estimators for θ using the random sample.
Here is an example from Protheroe, et al. “Interpretation of
cosmic ray composition - The path length distribution,” ApJ., 247
1981. X is length of paths. X is modeled as an exponential
variable with density,

f (x; θ) = θ−1 exp(−x/θ), x > 0.

Under this model,


Z ∞
E (X ) = x f (x; θ) dx = θ.
0
Intuition suggests using X̄ to estimate θ. Indeed, X̄ is unbiased
and consistent.
Here is another example. LF for globular clusters in the Milky
Way; van den Bergh’s normal model,

(x − µ)2
 
1
f (x) = √ exp −
2πσ 2σ 2

µ is the mean visual absolute magnitude and σ is the standard


deviation of visual absolute magnitude for Galactic globulars.
Again, X̄ is a good estimator for µ and S 2 is a good estimator for
σ2.
Choose a globular cluster at random; what is the chance that the
LF will be exactly -7.1 mag? Exactly -7.2 mag? For any
continuous random variable X , it is easy to see that,
P(X = c) = 0, whatever be the number c. Specifically suppose
that X ∼ N(µ = −6.9, σ 2 = 1.21), then P(X = −7.1) = 0, but

(−7.1 + 6.9)2
 
1
f (−7.1) = √ exp − = 0.37
1.1 2π 2(1.1)2
Fisher’s interpretation: in one simulation of the random variable X ,
the “likelihood” of observing the number −7.1 is 0.37. Similarly,
f (−7.2) = 0.28. In one simulation of X , the value x = −7.1 is
32% more likely to be observed than the value x = −7.2.

In this model, x = −6.9 is the value which has the greatest, or


maximum likelihood, for it is where the probability density function
is at its maximum.

Do not confuse ‘likelihood of observing’ with ‘probability of


observing’. As mentioned earlier, the probability of observing any
given value is zero.
Return to a general model f (x; θ); and a random sample
X1 , . . . , Xn from this population. The joint probability density
function of the sample is

f (x1 ; θ)f (x2 ; θ) · · · f (xn ; θ)

Here the variables are the x’s, while θ is fixed. This is probability
density function in the variables x1 , · · · , xn .

Fisher’s brilliant idea: choose that value of θ which makes the


observations most likely, that is, choose that value of θ which
makes the likelihood maximum. Reverse the roles of the x’s and θ.
Regard the x’s as fixed and θ as the variable.
When we do this and think of it as a function of θ it is called the
likelihood function. It tells us the likelihood of coming up with
the sample x1 , · · · , xn when the model is f (x, θ). This is denoted
simply as L(θ).

We define θ̂, the maximum likelihood estimator (MLE) of θ, as


that value of θ where L is maximized. Thus θ̂ is a function of the
observations x’s and hence of the sample, X ’s. Several questions
arise now. Does the maximum exist, is it unique, is it a good
estimator etc. In the situations we consider, all these have
affirmative answers, though in general the situation is not so nice.
Return to the example “ cosmic ray composition - The path length
distribution ...”. The length of paths, X , is exponential

f (x; θ) = θ−1 exp(−x/θ), x >0

Likelihood function is

L(θ) = f (x1 ; θ)f (x2 ; θ) · · · f (xn ; θ)

= θ−n exp(−(x1 + · · · + xn )/θ) = θ−n exp(−nx̄/θ)

Maximizing L is equivalent to maximizing ln L:

ln L(θ) = −n ln(θ) − nx̄θ−1


d
ln L(θ) = −nθ−1 + nx̄θ−2

d2
ln L(θ) = nθ−2 − 2nx̄θ−3
dθ2
Solving the equation d ln L(θ)/dθ = 0, we get θ = x̄. check
d 2 ln L(θ)/dθ2 < 0 at θ = x̄. Thus ln L(θ) is maximized at θ = x̄.
CONCLUSION: The MLE of θ is θ̂ = X̄ .

Return to the LF for globular clusters; X ∼ N(µ, σ 2 ). Assume σ is


known (1.1 mag, say). Likelihood function is

L(µ) = f (x1 ; µ)f (x2 ; µ) · · · f (xn ; µ)


n
" #
−n/2 −n 1 X 2
= (2π) σ exp − 2 (xi − µ) .

i=1
Maximizing ln L using calculus, we get µ̂ = X̄ .

Continuing with LF for globular clusters, suppose now that both µ


and σ 2 are unknown. We have now likelihood function of two
variables,

L(µ, σ 2 ) = f (x1 ; µ, σ 2 ) · · · f (xn ; µ, σ 2 )


n
" #
1 1 X 2
= exp − 2 (xi − µ)
(2πσ 2 )n/2 2σ
i=1
n
n n 1 X
ln L = − ln(2π) − ln(σ 2 ) − 2 (xi − µ)2
2 2 2σ
i=1
n
∂ 1 X
ln L = 2 (xi − µ)
∂µ σ
i=1
n
∂ n 1 X
ln L = − + (xi − µ)2
∂(σ 2 ) 2σ 2 2(σ 2 )2
i=1

Solve for µ and σ 2 , the simultaneous equations:


∂ ∂
ln L = 0, ln L = 0
∂µ ∂(σ 2 )

We also verify that L is concave at the solutions of these equations


(Hessian matrix).
Conclusion: The MLEs are
n
1X
µ̂ = X̄ , σ̂ 2 = (Xi − X̄ )2
n
i=1

µ̂ is unbiased:
E (µ̂) = µ.
σ̂ 2 is not unbiased:
E (σ̂ 2 ) = n−1 2 2
n σ 6= σ .
n 2
For this reason, we use n−1 σ̂ ≡ S 2.
Calculus cannot always be used to find MLEs.

Return to the “cosmic ray composition” example.



exp(−(x − θ)) if x ≥θ
f (x; θ) =
0, if x <θ

The likelihood function is

L(θ) = f (x1 ; θ) · · · f (xn ; θ)

exp[− ni=1 (xi − θ)], if all xi ≥ θ,


 P
=
0, if any xi < θ
Let x(1) be the smallest observation in the sample. “All xi ≥ θ” is
equivalent to “x(1) ≥ θ”. Thus

exp(−n(x̄ − θ)), if θ ≤ x(1)
L(θ) =
0, if x(1) < θ

Conclusion: θ̂ = X(1) .
General Properties of the MLE θ̂:
(a) θ̂ may not be unbiased. We often can remove this bias by
multiplying θ̂ by a constant.

(b) For many models, θ̂ is consistent.

(c) The Invariance Property: For many nice functions g , if θ̂ is the


MLE of θ, then g (θ̂) is the MLE of g (θ).

(d) The Asymptotic Property: For large n, θ̂ has an approximate


normal distribution with mean θ and variance 1/B where
 2

B = nE ln f (X ; θ)
∂θ

This asymptotic property can be used to develop confidence


intervals for θ.
The method of maximum likelihood works well when intuition fails
and no obvious estimator can be found. When an obvious
estimator exists the method of ML often will find it.

The method can be applied to many statistical problems:


regression analysis, analysis of variance, discriminant analysis,
hypothesis testing, principal components, etc.
The ML Method for Testing Hypotheses
Suppose that X ∼ N(µ, σ 2 ), so that

(x − µ)2
 
2 1
f (x; µ, σ ) = √ exp − .
2πσ 2 2σ 2
Using a random sample X1 , . . . , Xn , we wish to test H0 : µ = 3 vs.
Ha : µ 6= 3.

Here the parameter space is the space of all permissible values of


the parameters

Ω = {(µ, σ) : −∞ < µ < ∞, σ > 0}.

The hypotheses H0 and Ha represent restrictions on the


parameters, so we are led to parameter subspaces

ω0 = {(µ, σ) : µ = 3, σ > 0}; ωa = {(µ, σ) : µ 6= 3, σ > 0}.


L(µ, σ 2 ) = f (x1 ; µ, σ 2 ) · · · f (xn ; µ, σ 2 )
n
" #
1 1 X 2
= exp − 2 (xi − µ)
(2πσ 2 )n/2 2σ
i=1

Maximize L(µ, σ 2 ) over ω0 and ωa .

The likelihood ratio test statistic (LRT) is

max L(µ, σ 2 ) max L(3, σ 2 )


ω0 σ>0
λ = = .
max L(µ, σ 2 ) max L(µ, σ 2 )
ωa µ6=3,σ>0
Fact: 0 ≤ λ ≤ 1.
L(3, σ 2 ) is maximized over ω0 at
n
1X
σ2 = (xi − 3)2
n
i=1

 1X n 
max L(3, σ 2 ) = L 3, (xi − 3)2
ω0 n
i=1
 n/2
n
= Pn 2
.
2πe i=1 (xi − 3)

L(µ, σ 2 ) is maximized over ωa at


n
1X
µ = x̄, σ2 = (xi − x̄)2
n
i=1
 1X n 
2
max L(µ, σ ) = L x̄, (xi − x̄)2
ωa n
i=1
 n/2
n
= Pn .
2πe i=1 (xi − x̄)2

The likelihood ratio test statistic is


" #n/2 " #n/2
n n
λ= n ÷ n
2 2πe (Xi − X̄ )2
P P
2πe (Xi − 3)
i=1 i=1

n
" n
#n/2
X X
= (Xi − X̄ )2 ÷ (Xi − 3)2
i=1 i=1
λ is close to 1 iff X̄ is close to 3.

λ is close to 0 iff X̄ is far from 3.

This particular LRT statistic λ is equivalent to the t-statistic seen


earlier.

In this case, the ML method discovers the obvious test statistic.


Cramér-Rao inequality

Given two unbiased estimators, we prefer the one with smaller


variance. In our quest for unbiased estimators with minimum
possible variance, we need to know how small their variances can
be.

Suppose that X is a random variable modeled by the density


f (x; θ) where θ is a parameter.

The “support” of f is the region where f > 0. We assume that the


“support” of f does not depend on θ.

We have a random sample, X1 , . . . , Xn . Suppose that Y is an


unbiased estimator of θ.
: If Y is an unbiased estimator of θ based on a sample of size n,
then the smallest possible value that Var(Y ) can attain is 1/B
where
 2  2 
∂ ∂
B = nE ln f (X ; θ) = −nE ln f (X ; θ) .
∂θ ∂θ2

This is the same B which appeared in our study of MLEs. Here the
expected value is taken under the model f (x, θ). Certain regularity
conditions need to hold for this to be true, but we shall not go into
the mathematical details.
To illustrate, let us consider the example: “... cosmic ray
composition - The path length distribution ...”. Here X is the
length of paths modeled by

f (x; θ) = θ−1 exp(−x/θ), x > 0.


ln f (X ; θ) = − ln θ − θ−1 X ; ln f (X ; θ) = −θ−1 + θ−2 X .
∂θ

∂2 ∂2
 
ln f (X ; θ) = θ−2 − 2θ−3 X ; E ln f (X ; θ) = −θ−2 .
∂θ2 ∂θ2

The smallest possible value of Var(Y ) is θ2 /n. This is attained by


X̄ .

For this problem, X̄ is the best unbiased estimator of θ.


Suppose that Y is an unbiased estimator of the parameter θ. We
compare Var(Y ) with 1/B, the lower bound in the Cramér-Rao
inequality:
1
÷ Var(Y )
B

This number is called the efficiency of Y . Obviously,


0 ≤ efficiency ≤ 1.

If Y has 50% efficiency then about 1/0.5 = 2 times as many


sample observations are needed for Y to perform as well as the
MVUE. The use of Y results in confidence intervals which
generally are longer than those arising from the MVUE.

If the MLE is unbiased then as n becomes large, its efficiency


increases to 1.
The Cramér-Rao inequality can be stated as follows. If Y is any
unbiased estimator of θ, based on sample of size n, then
 2
1 ∂
Var(Y ) ≥ ; where I = E ln f (X ; θ) .
nI ∂θ
The quantity I is called Fisher information number.
Under the same regularity conditions needed for the Cramér-Rao
inequality to hold, one has a nice desirable property of the MLE.

Let θ̂n be the MLE for θ based on a sample of size n. Then, for
large n, √
n (θ̂n − θ) ≈ N(0, 1/I ).

In other words, θ̂ is consistent as well as asymptotically efficient.


Not only that, the above approximate distribution can be used to
get confidence intervals also.
Dembo, Cover, and Thomas (1991) “Information-theoretic
inequalities,” IEEE Trans. Information Theory 37, 1501–1518,
provide a unified treatment of the Cramér-Rao inequality, the
Heisenberg uncertainty principle, entropy inequalities, Fisher
information, and many other inequalities in statistics, mathematics,
information theory, and physics.

In particular, they show that the Heisenberg uncertainty principle is


a consequence of the Cramér-Rao inequality and in a sense, they
are equivalent. This remarkable paper demonstrates that there is a
basic oneness among these various fields.
Several interesting questions arise.

For example, if you have found an estimator that did not attain the
minimum variance as in the Cramér-Rao bound, what should we
do. Should we look for an estimator whose variance attains the
bound?

Is there one such? If not what is to be done? We are not going


into the details now.

You might also like