BhamidiRao MLE

MAXIMUM LIKELIHOOD ESTIMATION
and
CRAMÉR-RAO INEQUALITY
Fourth IIA - Penn State Astrostatistics School

July 22-29, 2013. VBO, Kavalur
Notes by Donald Richards,Penn State Univ.
notes revised and lecture delivered by

B. V. Rao;Chennai Math. Inst.
The maximum likelihood method
We seek a method which produces good estimators.
R. A. Fisher (1912), “On an absolute criterion for fitting frequency

curves,” Messenger of Math. 41, 155–160. Fisher’s first
mathematical paper, written while a final-year undergraduate in
mathematics and mathematical physics at Gonville and Caius
College, Cambridge University. It’s not clear what motivated Fisher
to study this subject; perhaps it was the influence of his tutor, the
astronomer F. J. M. Stratton. Fisher’s paper started with a
criticism of two methods of curve fitting, the least-squares and the
method of moments.
X is a random variable and f (x; θ) is a statistical model for X .
Here θ is a parameter. X1 , . . . , Xn is a random sample from X . We
want to construct good estimators for θ using the random sample.
Here is an example from Protheroe, et al. “Interpretation of
cosmic ray composition - The path length distribution,” ApJ., 247
1981. X is length of paths. X is modeled as an exponential
variable with density,
f (x; θ) = θ−1 exp(−x/θ), x > 0.
Under this model,

Z ∞
E (X ) = x f (x; θ) dx = θ.
0
Intuition suggests using X̄ to estimate θ. Indeed, X̄ is unbiased
and consistent.
Here is another example. LF for globular clusters in the Milky
Way; van den Bergh’s normal model,
(x − µ)2

1
f (x) = √ exp −
2πσ 2σ 2
µ is the mean visual absolute magnitude and σ is the standard

deviation of visual absolute magnitude for Galactic globulars.
Again, X̄ is a good estimator for µ and S 2 is a good estimator for
σ2.
Choose a globular cluster at random; what is the chance that the
LF will be exactly -7.1 mag? Exactly -7.2 mag? For any
continuous random variable X , it is easy to see that,
P(X = c) = 0, whatever be the number c. Specifically suppose
that X ∼ N(µ = −6.9, σ 2 = 1.21), then P(X = −7.1) = 0, but
(−7.1 + 6.9)2

1
f (−7.1) = √ exp − = 0.37
1.1 2π 2(1.1)2
Fisher’s interpretation: in one simulation of the random variable X ,
the “likelihood” of observing the number −7.1 is 0.37. Similarly,
f (−7.2) = 0.28. In one simulation of X , the value x = −7.1 is
32% more likely to be observed than the value x = −7.2.
In this model, x = −6.9 is the value which has the greatest, or

maximum likelihood, for it is where the probability density function
is at its maximum.
Do not confuse ‘likelihood of observing’ with ‘probability of

observing’. As mentioned earlier, the probability of observing any
given value is zero.
Return to a general model f (x; θ); and a random sample
X1 , . . . , Xn from this population. The joint probability density
function of the sample is
f (x1 ; θ)f (x2 ; θ) · · · f (xn ; θ)
Here the variables are the x’s, while θ is fixed. This is probability
density function in the variables x1 , · · · , xn .
Fisher’s brilliant idea: choose that value of θ which makes the

observations most likely, that is, choose that value of θ which
makes the likelihood maximum. Reverse the roles of the x’s and θ.
Regard the x’s as fixed and θ as the variable.
When we do this and think of it as a function of θ it is called the
likelihood function. It tells us the likelihood of coming up with
the sample x1 , · · · , xn when the model is f (x, θ). This is denoted
simply as L(θ).
We define θ̂, the maximum likelihood estimator (MLE) of θ, as

that value of θ where L is maximized. Thus θ̂ is a function of the
observations x’s and hence of the sample, X ’s. Several questions
arise now. Does the maximum exist, is it unique, is it a good
estimator etc. In the situations we consider, all these have
affirmative answers, though in general the situation is not so nice.
Return to the example “ cosmic ray composition - The path length
distribution ...”. The length of paths, X , is exponential
f (x; θ) = θ−1 exp(−x/θ), x >0
Likelihood function is
L(θ) = f (x1 ; θ)f (x2 ; θ) · · · f (xn ; θ)
= θ−n exp(−(x1 + · · · + xn )/θ) = θ−n exp(−nx̄/θ)
Maximizing L is equivalent to maximizing ln L:
ln L(θ) = −n ln(θ) − nx̄θ−1

d
ln L(θ) = −nθ−1 + nx̄θ−2
dθ
d2
ln L(θ) = nθ−2 − 2nx̄θ−3
dθ2
Solving the equation d ln L(θ)/dθ = 0, we get θ = x̄. check
d 2 ln L(θ)/dθ2 < 0 at θ = x̄. Thus ln L(θ) is maximized at θ = x̄.
CONCLUSION: The MLE of θ is θ̂ = X̄ .
Return to the LF for globular clusters; X ∼ N(µ, σ 2 ). Assume σ is

known (1.1 mag, say). Likelihood function is
L(µ) = f (x1 ; µ)f (x2 ; µ) · · · f (xn ; µ)

n
" #
−n/2 −n 1 X 2
= (2π) σ exp − 2 (xi − µ) .
2σ
i=1
Maximizing ln L using calculus, we get µ̂ = X̄ .
Continuing with LF for globular clusters, suppose now that both µ

and σ 2 are unknown. We have now likelihood function of two
variables,
L(µ, σ 2 ) = f (x1 ; µ, σ 2 ) · · · f (xn ; µ, σ 2 )

n
" #
1 1 X 2
= exp − 2 (xi − µ)
(2πσ 2 )n/2 2σ
i=1
n
n n 1 X
ln L = − ln(2π) − ln(σ 2 ) − 2 (xi − µ)2
2 2 2σ
i=1
n
∂ 1 X
ln L = 2 (xi − µ)
∂µ σ
i=1
n
∂ n 1 X
ln L = − + (xi − µ)2
∂(σ 2 ) 2σ 2 2(σ 2 )2
i=1
Solve for µ and σ 2 , the simultaneous equations:

∂ ∂
ln L = 0, ln L = 0
∂µ ∂(σ 2 )
We also verify that L is concave at the solutions of these equations

(Hessian matrix).
Conclusion: The MLEs are
n
1X
µ̂ = X̄ , σ̂ 2 = (Xi − X̄ )2
n
i=1
µ̂ is unbiased:
E (µ̂) = µ.
σ̂ 2 is not unbiased:
E (σ̂ 2 ) = n−1 2 2
n σ 6= σ .
n 2
For this reason, we use n−1 σ̂ ≡ S 2.
Calculus cannot always be used to find MLEs.
Return to the “cosmic ray composition” example.

exp(−(x − θ)) if x ≥θ
f (x; θ) =
0, if x <θ
The likelihood function is
L(θ) = f (x1 ; θ) · · · f (xn ; θ)
exp[− ni=1 (xi − θ)], if all xi ≥ θ,

P
=
0, if any xi < θ
Let x(1) be the smallest observation in the sample. “All xi ≥ θ” is
equivalent to “x(1) ≥ θ”. Thus

exp(−n(x̄ − θ)), if θ ≤ x(1)
L(θ) =
0, if x(1) < θ
Conclusion: θ̂ = X(1) .
General Properties of the MLE θ̂:
(a) θ̂ may not be unbiased. We often can remove this bias by
multiplying θ̂ by a constant.
(b) For many models, θ̂ is consistent.
(c) The Invariance Property: For many nice functions g , if θ̂ is the

MLE of θ, then g (θ̂) is the MLE of g (θ).
(d) The Asymptotic Property: For large n, θ̂ has an approximate

normal distribution with mean θ and variance 1/B where
2
∂
B = nE ln f (X ; θ)
∂θ
This asymptotic property can be used to develop confidence

intervals for θ.
The method of maximum likelihood works well when intuition fails
and no obvious estimator can be found. When an obvious
estimator exists the method of ML often will find it.
The method can be applied to many statistical problems:

regression analysis, analysis of variance, discriminant analysis,
hypothesis testing, principal components, etc.
The ML Method for Testing Hypotheses
Suppose that X ∼ N(µ, σ 2 ), so that
(x − µ)2

2 1
f (x; µ, σ ) = √ exp − .
2πσ 2 2σ 2
Using a random sample X1 , . . . , Xn , we wish to test H0 : µ = 3 vs.
Ha : µ 6= 3.
Here the parameter space is the space of all permissible values of

the parameters
Ω = {(µ, σ) : −∞ < µ < ∞, σ > 0}.
The hypotheses H0 and Ha represent restrictions on the

parameters, so we are led to parameter subspaces
ω0 = {(µ, σ) : µ = 3, σ > 0}; ωa = {(µ, σ) : µ 6= 3, σ > 0}.

L(µ, σ 2 ) = f (x1 ; µ, σ 2 ) · · · f (xn ; µ, σ 2 )
n
" #
1 1 X 2
= exp − 2 (xi − µ)
(2πσ 2 )n/2 2σ
i=1
Maximize L(µ, σ 2 ) over ω0 and ωa .
The likelihood ratio test statistic (LRT) is
max L(µ, σ 2 ) max L(3, σ 2 )

ω0 σ>0
λ = = .
max L(µ, σ 2 ) max L(µ, σ 2 )
ωa µ6=3,σ>0
Fact: 0 ≤ λ ≤ 1.
L(3, σ 2 ) is maximized over ω0 at
n
1X
σ2 = (xi − 3)2
n
i=1
1X n
max L(3, σ 2 ) = L 3, (xi − 3)2
ω0 n
i=1
n/2
n
= Pn 2
.
2πe i=1 (xi − 3)
L(µ, σ 2 ) is maximized over ωa at

n
1X
µ = x̄, σ2 = (xi − x̄)2
n
i=1
1X n
2
max L(µ, σ ) = L x̄, (xi − x̄)2
ωa n
i=1
n/2
n
= Pn .
2πe i=1 (xi − x̄)2
The likelihood ratio test statistic is

" #n/2 " #n/2
n n
λ= n ÷ n
2 2πe (Xi − X̄ )2
P P
2πe (Xi − 3)
i=1 i=1
n
" n
#n/2
X X
= (Xi − X̄ )2 ÷ (Xi − 3)2
i=1 i=1
λ is close to 1 iff X̄ is close to 3.
λ is close to 0 iff X̄ is far from 3.
This particular LRT statistic λ is equivalent to the t-statistic seen

earlier.
In this case, the ML method discovers the obvious test statistic.

Cramér-Rao inequality
Given two unbiased estimators, we prefer the one with smaller

variance. In our quest for unbiased estimators with minimum
possible variance, we need to know how small their variances can
be.
Suppose that X is a random variable modeled by the density

f (x; θ) where θ is a parameter.
The “support” of f is the region where f > 0. We assume that the

“support” of f does not depend on θ.
We have a random sample, X1 , . . . , Xn . Suppose that Y is an

unbiased estimator of θ.
: If Y is an unbiased estimator of θ based on a sample of size n,
then the smallest possible value that Var(Y ) can attain is 1/B
where
2 2
∂ ∂
B = nE ln f (X ; θ) = −nE ln f (X ; θ) .
∂θ ∂θ2
This is the same B which appeared in our study of MLEs. Here the
expected value is taken under the model f (x, θ). Certain regularity
conditions need to hold for this to be true, but we shall not go into
the mathematical details.
To illustrate, let us consider the example: “... cosmic ray
composition - The path length distribution ...”. Here X is the
length of paths modeled by
f (x; θ) = θ−1 exp(−x/θ), x > 0.
∂
ln f (X ; θ) = − ln θ − θ−1 X ; ln f (X ; θ) = −θ−1 + θ−2 X .
∂θ
∂2 ∂2

ln f (X ; θ) = θ−2 − 2θ−3 X ; E ln f (X ; θ) = −θ−2 .
∂θ2 ∂θ2
The smallest possible value of Var(Y ) is θ2 /n. This is attained by

X̄ .
For this problem, X̄ is the best unbiased estimator of θ.

Suppose that Y is an unbiased estimator of the parameter θ. We
compare Var(Y ) with 1/B, the lower bound in the Cramér-Rao
inequality:
1
÷ Var(Y )
B
This number is called the efficiency of Y . Obviously,

0 ≤ efficiency ≤ 1.
If Y has 50% efficiency then about 1/0.5 = 2 times as many

sample observations are needed for Y to perform as well as the
MVUE. The use of Y results in confidence intervals which
generally are longer than those arising from the MVUE.
If the MLE is unbiased then as n becomes large, its efficiency

increases to 1.
The Cramér-Rao inequality can be stated as follows. If Y is any
unbiased estimator of θ, based on sample of size n, then
2
1 ∂
Var(Y ) ≥ ; where I = E ln f (X ; θ) .
nI ∂θ
The quantity I is called Fisher information number.
Under the same regularity conditions needed for the Cramér-Rao
inequality to hold, one has a nice desirable property of the MLE.
Let θ̂n be the MLE for θ based on a sample of size n. Then, for
large n, √
n (θ̂n − θ) ≈ N(0, 1/I ).
In other words, θ̂ is consistent as well as asymptotically efficient.

Not only that, the above approximate distribution can be used to
get confidence intervals also.
Dembo, Cover, and Thomas (1991) “Information-theoretic
inequalities,” IEEE Trans. Information Theory 37, 1501–1518,
provide a unified treatment of the Cramér-Rao inequality, the
Heisenberg uncertainty principle, entropy inequalities, Fisher
information, and many other inequalities in statistics, mathematics,
information theory, and physics.
In particular, they show that the Heisenberg uncertainty principle is

a consequence of the Cramér-Rao inequality and in a sense, they
are equivalent. This remarkable paper demonstrates that there is a
basic oneness among these various fields.
Several interesting questions arise.
For example, if you have found an estimator that did not attain the
minimum variance as in the Cramér-Rao bound, what should we
do. Should we look for an estimator whose variance attains the
bound?
Is there one such? If not what is to be done? We are not going

into the details now.

BhamidiRao MLE

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BhamidiRao MLE

Uploaded by

Copyright:

Available Formats

MAXIMUM LIKELIHOOD ESTIMATION

Fourth IIA - Penn State Astrostatistics School

Notes by Donald Richards,Penn State Univ.

notes revised and lecture delivered by

We seek a method which produces good estimators.

R. A. Fisher (1912), “On an absolute criterion for fitting frequency

f (x; θ) = θ−1 exp(−x/θ), x > 0.

Under this model,

µ is the mean visual absolute magnitude and σ is the standard

In this model, x = −6.9 is the value which has the greatest, or

Do not confuse ‘likelihood of observing’ with ‘probability of

f (x1 ; θ)f (x2 ; θ) · · · f (xn ; θ)

Fisher’s brilliant idea: choose that value of θ which makes the

We define θ̂, the maximum likelihood estimator (MLE) of θ, as

f (x; θ) = θ−1 exp(−x/θ), x >0

L(θ) = f (x1 ; θ)f (x2 ; θ) · · · f (xn ; θ)

= θ−n exp(−(x1 + · · · + xn )/θ) = θ−n exp(−nx̄/θ)

Maximizing L is equivalent to maximizing ln L:

ln L(θ) = −n ln(θ) − nx̄θ−1

Return to the LF for globular clusters; X ∼ N(µ, σ 2 ). Assume σ is

L(µ) = f (x1 ; µ)f (x2 ; µ) · · · f (xn ; µ)

Continuing with LF for globular clusters, suppose now that both µ

L(µ, σ 2 ) = f (x1 ; µ, σ 2 ) · · · f (xn ; µ, σ 2 )

Solve for µ and σ 2 , the simultaneous equations:

We also verify that L is concave at the solutions of these equations

Return to the “cosmic ray composition” example.

The likelihood function is

L(θ) = f (x1 ; θ) · · · f (xn ; θ)

exp[− ni=1 (xi − θ)], if all xi ≥ θ,

(b) For many models, θ̂ is consistent.

(c) The Invariance Property: For many nice functions g , if θ̂ is the

(d) The Asymptotic Property: For large n, θ̂ has an approximate

This asymptotic property can be used to develop confidence

The method can be applied to many statistical problems:

Here the parameter space is the space of all permissible values of

Ω = {(µ, σ) : −∞ < µ < ∞, σ > 0}.

The hypotheses H0 and Ha represent restrictions on the

ω0 = {(µ, σ) : µ = 3, σ > 0}; ωa = {(µ, σ) : µ 6= 3, σ > 0}.

Maximize L(µ, σ 2 ) over ω0 and ωa .

The likelihood ratio test statistic (LRT) is

max L(µ, σ 2 ) max L(3, σ 2 )

L(µ, σ 2 ) is maximized over ωa at

The likelihood ratio test statistic is

λ is close to 0 iff X̄ is far from 3.

This particular LRT statistic λ is equivalent to the t-statistic seen

In this case, the ML method discovers the obvious test statistic.

Given two unbiased estimators, we prefer the one with smaller

Suppose that X is a random variable modeled by the density

The “support” of f is the region where f > 0. We assume that the

We have a random sample, X1 , . . . , Xn . Suppose that Y is an

f (x; θ) = θ−1 exp(−x/θ), x > 0.

The smallest possible value of Var(Y ) is θ2 /n. This is attained by

For this problem, X̄ is the best unbiased estimator of θ.

This number is called the efficiency of Y . Obviously,

If Y has 50% efficiency then about 1/0.5 = 2 times as many

If the MLE is unbiased then as n becomes large, its efficiency

In other words, θ̂ is consistent as well as asymptotically efficient.

In particular, they show that the Heisenberg uncertainty principle is

Is there one such? If not what is to be done? We are not going

You might also like