Mle 1

Maximum Likelihood Estimation
Econometrics
1st Semester
Academic Year: 2018/2019
(2018) 2018/2019 1 / 42
Introduction
In the GMM approach the model imposes assumptions about a

number of expectations (moments) that involve observable data and
unknown coe¢ cients, which are exploited in estimation.
We now consider an estimation approach that makes stronger
assumptions, because it assumes knowledge of the entire distribution,
not just of a number of its moments.
If the distribution of a variable yi conditional upon a number of
variables xi is known, up to a small number of unknown coe¢ cients,
we can use this to estimate these unknown parameters by choosing
them in such a way that the resulting distribution corresponds as well
as possible to the observed data.
This is rather loosely formulated the method of maximum likelihood.
(2018) 2018/2019 2 / 42
In certain applications and models, distributional assumptions like
normality are commonly imposed because estimation strategies that
do not require such assumptions are complex or unavailable.
If the distributional assumptions are correct, the maximum likelihood
estimator is, under weak regularity conditions, consistent and
asymptotically normal.
Moreover, it fully exploits the assumptions about the distribution so
that the estimator is asymptotically e¢ cient.
In other words, alternative consistent estimators will have an
asymptotic covariance matrix that is at least as large as the maximum
likelihood estimator.
(2018) 2018/2019 3 / 42
The starting point of MLE is the assumption that the distribution of
an observed phenomenon (the endogenous variable) is known, except
for a …nite number of unknown parameters.
These parameters will be estimated by taking those values for them
that give the observed values the highest probability, i.e. the highest
likelihood.
The maximum likelihood method thus provides a means of estimating
a set of parameters characterizing a distribution, if we know, or
assume we know the form of the distribution.
For example, we could characterize the distribution of a variable yi
(for given xi ) as normal with mean β1 + β2 xi and variance s 2 . This
would represent the simple linear regression model with normal error
terms.
(2018) 2018/2019 4 / 42
Example
Suppose the success or failure of a …eld goal in football can be
modeled with a Bernoulli(π) distribution. Let X = 0 if the …eld goal
is a failure and X = 1 if the …eld goal is a success. Then the
probability distribution for X is:
f (x ) = π x (1 π )1 x
where π denotes the probability of success.

Suppose we would like to estimate π for a 40 yard …eld goal.
Let x1 , ..., xn denote a random sample of …eld goal results at 40 yards.
Given the resulting data (x1 , ..., xn ), the “likelihood function”
measures the likelihood of di¤erent values of π:
L (π jx1 , ..., xn ) = f (x1 ) f (x2 ) ... f (xn 1) f (xn )
n n
= ∏ f (xi ) = ∏ πx (1 i
π )1 xi
i =1 i =1
Σxi n Σxi
= π (1 π)
(2018) 2018/2019 5 / 42
Example
n
Suppose ∑ xi = 4 and n = 10. Then the following table can be formed:
i =1
π L (π jx1 , ..., xn )
0.20 0.000419
0.30 0.000953
0.35 0.001132
0.39 0.001192
0.40 0.001194
0.41 0.001192
0.50 0.000977
(2018) 2018/2019 6 / 42
Plot of the likelihood function
0.0014
0.0012
L(π |x1,...,x n) 0.0010
0.0008
0.0006
0.0004
0.0002
0.0000
0 0.2 0.4 0.6 0.8 1
(2018) 2018/2019 7 / 42
Example
Note that π = 0.4 is the “most likely” or “most plausible” value of π

since this maximizes the likelihood function.
Therefore, πb = 0.4 is the maximum likelihood estimate.
(2018) 2018/2019 8 / 42
MLE Procedure
In general, the maximum likelihood estimate can be found as follows:
1 Find the natural log of the likelihood function, i.e., ` (π jx1 , ..., xn ) ;
2 Take the derivative of ` (π jx1 , ..., xn ) with respect to π;
3 Set the derivative equal to 0 and solve for π to …nd the maximum
likelihood estimate.
4 Note that the solution is the maximum of L (π jx1 , ..., xn ) ; provided
certain “regularity” conditions hold.
(2018) 2018/2019 9 / 42
Example
For the …eld goal example:
!
n
` (π jx1 , ..., xn ) = log ∏π xi
(1 π) 1 xi
i =1
! !
n n
= ∑ xi log (π ) + n ∑ xi log (1 π)
i =1 i =1
(2018) 2018/2019 10 / 42
Example
Then the derivative of ` (π jx1 , ..., xn ) with respect to π is,
d ` (π jx1 , ..., xn )
= 0 ()
dπ
n n n n
∑ xi n ∑ xi ∑ xi n ∑ xi
i =1 i =1 i =1 i =1
= 0 () = ()
π 1 π π 1 π
n n n
1 π
n ∑ xi 1
n ∑ xi + ∑ xi
i =1 i =1 i =1
= n () = n ()
π π
∑ xi ∑ xi
i =1 i =1
n
∑ xi
i =1
b =
π
n
Therefore, the maximum likelihood estimate of π is the proportion of …eld
goals made.
(2018) 2018/2019 11 / 42
Example
Let x1 , ..., xn be a random sample from a N (µ, σ2 ) distribution. Find

the maximum likelihood estimate of µ and σ2 .
Remember that the normal distribution is:
2
1 (x i µ)
f (xi ) = p e 2σ2 .
2πσ
Then the likelihood function is:
n 2
1 (x i µ)
L µ, σ2 jx1 , ..., xn = ∏p 2πσ
e 2σ2
i =1
1 1
Σ (x i µ )2
= n/2
e 2σ2
(2π ) σn
(2018) 2018/2019 12 / 42
Example
Taking the log of the likelihood function produces:

n
n 1
` µ, σ2 jx1 , ..., xn =
2
log (2π ) n log (σ)
2σ2 ∑ (xi µ )2
i =1
(2018) 2018/2019 13 / 42
Example
To …nd the maximum likelihood estimate of µ, take the derivative with

respect to µ and set it equal to 0.
∂` µ, σ2 jx1 , ..., xn
= 0
∂µ
n
1
2σ2 i∑
2 (xi µ) = 0
=1
Solving for µ produces:

1 n
n i∑
b=
µ xi
=1
(2018) 2018/2019 14 / 42
Example
To …nd the maximum likelihood estimate of σ, take the derivative with
respect to σ and set it equal to 0.
∂` µ, σ2 jx1 , ..., xn
= 0
∂σ
n 2 n
σ 2σ3 i∑
(xi µ)2 = 0
=1
Solving for σ2 produces:

n
n 1 n ∑ (xi µ )2
σ
= 3
σ ∑ (xi µ)2 () σ2 = i =1
n
i =1
and v
u n
u
u ∑ (xi x )2
t i =1
b=
σ
n
(2018) 2018/2019 15 / 42
General Properties
To de…ne the MLE in a more general situation, suppose that interest

lies in the conditional distribution of yi given xi .
Let the density or probability mass function be given by f (yi jxi ; θ ) ,
where θ is a K dimensional vector of unknown parameters and
assume that observations are mutually independent.
In this situation the joint density or probability mass function of the
sample y1 , ..., yn (conditional upon X = (x1 , ..., xn )0 ) is given by
n
f (y1 , ..., yn jX ; θ ) = ∏ f (yi jxi ; θ ) .
i =1
The likelihood function for the available sample is then given by

n n
L ( θ jy , X ) = ∏ Li (θ jyi , xi ) = ∏ f (yi jxi ; θ )
i =1 i =1
which is a function of θ.
(2018) 2018/2019 16 / 42
General Properties
For several purposes it is convenient to employ the likelihood

contributions, denoted by Li (θ jyi , xi ) , which re‡ect how much
observation i contributes to the likelihood function.
The MLE b θ for θ is the solution to,
n
max log L(θ ) = max ∑ log Li (θ )
θ θ i =1
where log L(θ ) is the loglikelihood function and for simplicity the
other arguments were dropped.
(2018) 2018/2019 17 / 42
General Properties
The …rst order conditions for this problem imply that,

n
∂ log L(θ ) ∂ log Li (θ )
∂θ b
= ∑ ∂θ b
=0
θ i =1 θ
where jbθ indicates that the expression is evaluated at θ = b

θ.
If the loglikelihood function is globally concave there is a unique
global maximum and the maximum likelihood estimator is uniquely
determined by these …rst order conditions.
In general, numerical optimization is required.
(2018) 2018/2019 18 / 42
General Properties
For notational convenience denote the …rst derivatives of the

individual loglikelihood contributions, also known as scores, as
∂ log Li (θ )
si ( θ ) = ;
∂θ
so that the …rst order condition can be stated as,
n
∑ si (θ ) = 0.
i =1
This says that the sample averages of the K scores, evaluated at the
ML estimate bθ should be zero.
(2018) 2018/2019 19 / 42
General Properties
Provided that the likelihood function is correctly speci…ed, it can be shown

under weak regularity conditions that the MLE:
1 is consistent for θ (plimb
θ = θ );
2 is asymptotically e¢ cient (that is, asymptotically the ML estimator
has the "smallest" variance among all consistent asymptotically
normal estimators);
3 is asymptotically normally distributed,
p
n b
θ θ ! N (0, V ),
where V is the asymptotic covariance matrix.
(2018) 2018/2019 20 / 42
General Properties
The covariance matrix V is determined by the shape of the
loglikelihood function and can be shown to equal,
1
∂2 log Li (θ )
V = E .
∂θ∂θ 0
The term in brackets is the expected value of the matrix of the second
derivatives and re‡ects the curvature of the loglikelihood function.
Clearly, if the loglikelihood function is highly curved around its
maximum, the second derivative is large, the variance is small and the
MLE estimator is relatively accurate. If the function is less curved the
variance will be larger.
The symmetric matrix
∂2 log Li (θ )
I (θ ) = E
∂θ∂θ 0
is known as the (Fisher) information matrix.
(2018) 2018/2019 21 / 42
General Properties
Given the asymptotic e¢ ciency of the MLE estimator, the inverse of

the information matrix I (θ ) 1 provides a lower bound on the
asymptotic covariance matrix for any consistent asymptotically
normal estimator for θ.
The MLE is asymptotically e¢ cient because it attains this bound,
often referred to as the Cramèr-Rao lower bound.
In practice, the covariance matrix V can be estimated consistently by
replacing the expectations operator by a sample average and replacing
the unknown coe¢ cients by the ML estimates. That is,
! 1
N
1 ∂2 log Li (θ )
VH =
N ∑ ∂θ∂θ 0
i =1 b
θ
where we take the derivatives …rst and substitute the unknown θ by b

θ.
(2018) 2018/2019 22 / 42
General Properties
If the likelihood function is correctly speci…ed it can be shown that

the matrix
J ( θ ) E s i ( θ ) si ( θ ) 0 ,
∂ log L i (θ )
where si (θ ) = ∂θ and this is identical to the information matrix
I ( θ ).
Thus, V can also be estimated from the …rst order derivatives of the
loglikelihood function
! 1
N
1
VG =
N ∑ si (bθ )si (bθ )0
i =1
where G re‡ects that the variance estimator uses the outer product of
the gradients (…rst derivatives).
(2018) 2018/2019 23 / 42
The Normal Linear Regression Model
Consider,
yi = xi0 β + εi , εi NID (0, σ2 )
under the usual OLS assumptions. This imposes that (conditional upon
the exogenous variables), yi is normal with mean xi0 β and a constant
variance σ2 .
(2018) 2018/2019 24 / 42
The loglikelihood is,

N
log L( β, σ2 ) = ∑ log Li ( β, σ2 )
i =1
2
N 1 N (yi xi0 β)
2 i∑
= log(πσ2 ) .
2 =1 σ2
(2018) 2018/2019 25 / 42
The score vector is,

∂ log L i ( β,σ2 )
!
si ( β, σ2 ) = ∂β
∂ log L i ( β,σ2 )
∂σ2
(yi xi0 β)
!
σ2
xi
= 0 2 ,
1 1 i xi β)
( y
2σ 2 + 2 σ 4
(2018) 2018/2019 26 / 42
The MLE b b2 satisfy the …rst order conditions
β and σ
N
(yi xi0 β)
∑ σ2
xi = 0
i =1
and
2
N 1 N (yi xi0 β)
2 i∑
+ = 0.
2σ2 =1 σ4
The solutions to these equations are,
! 1
N N
b
β= ∑ xi xi0 ∑ xi yi
i =1 i =1
and
N
1
∑
2
b2 =
σ yi xi0 β .
N i =1
The estimator for the vector of slopes is identical to the OLS estimator,
b2 di¤ers from OLS by dividing by N rather than N K .
while σ
(2018) 2018/2019 27 / 42
The information matrix is de…ned as,
I ( β, σ2 ) = E si ( β, σ2 )si ( β, σ2 )0 .
Since for a normal distribution,
E fεi g = 0,
E ε2i = σ2 ,
E ε3i = 0,
E ε4i = 3σ4 ,
It can be shown that,
σ 2E
fxi xi0 g 0
I ( β, σ2 ) = 1 .
0 2σ4
(2018) 2018/2019 28 / 42
Because this information matrix is block diagonal, its inverse is given by

1
σ2 E fxi xi0 g 0
I ( β, σ2 ) 1
= .
0 2σ4
From this, it follows that b b2 are asymptotically normal and mutually

β and σ
independent according to
p 2 1
N b
β β2 ! N (0, σ2 E xi xi0 )
p
b2
N σ σ2 ! N (0, 2σ4 ).
(2018) 2018/2019 29 / 42
Speci…cation Tests
Based on MLE a large number of alternative tests can be constructed.

Such tests are typically based on three di¤erent principles: the Wald,
the likelihood ratio or the Lagrange Multiplier principles.
Consider again the general problem where we estimate a
K -dimensional parameter vector θ by maximizing the loglikelihood
function i.e.,
N
max log L(θ ) = max ∑ log Li (θ ).
θ θ i =1
Suppose that we are interested in testing one or more linear

restrictions on the parameter vector θ = (θ 1 , ..., θ K )0 .
These restrictions can be summarised as H0 : Rθ = r for some …xed
J-dimensional vector 9, where R is a J K matrix. It is assumed that
the J rows of R are linearly independent, so that the restrictions are
not in con‡ict with each other nor redundant.
(2018) 2018/2019 30 / 42
The three test principles can be summarised as follows:
Wald test. Estimate θ by ML and check whether the di¤erence
Rbθ r is close to zero, using its asymptotic covariance matrix. This
is the idea that underlies the well-known t- and F-tests.
Likelihood ratio test. Estimate the model twice: once without the
restriction imposed (giving b
θ ) and once with the null hypothesis
imposed (giving the constrained ML estimator e θ, where R e
θ = r ) and
check whether the di¤erence in likelihood values log L(bθ ) log L(eθ ) is
signi…cantly di¤erent from zero. This implies the comparison of an
unrestricted and a restricted maximum of log L(θ ).
Lagrange Multiplier test. Estimate the model with the restriction
from the null hypothesis imposed (giving e θ) and check whether the
…rst order conditions from the general model are signi…cantly violated.
That is check whether ∂ log L(θ )/ ∂θ jeθ is signi…cantly di¤erent from
zero.
(2018) 2018/2019 31 / 42
These three tests are asymptotically equivalent (i.e. the statistics

have the same asymptotic distribution).
Computation of the test statistic is however substantially di¤erent.
(2018) 2018/2019 32 / 42
The Wald Test
The Wald test starts from the result that
p
N (b
θ θ ) ! N (0, V )
from which it follows that the J-dimensional vector R b
θ also has an
asymptotic normal distribution, given by,
p
N (R b
θ Rθ ) ! N (0, RVR 0 ).
Under the null hypothesis Rθ equals the known vector r , so that we
can construct a test statistics by forming the quadratic form
h i 1
ξ W = N (R b θ r )0 R V b R0 N (R b
θ r)
where V b is a consistent estimator of V .

Under the null hypothesis, this test statistic has a Chi-squared
distribution with J degrees of freedom, so that large values of ξ W
lead us to reject the null hypothesis.
(2018) 2018/2019 33 / 42
The Likelihood Ratio Test
The likelihood ratio test is even simpler to compute, provided the
model is estimated with and without the restrictions imposed. This
means that we have two di¤erent estimators: the unrestricted ML
estimator b θ and the constrained ML estimator e θ, obtained by
maximising the loglikelihood function log L(θ ) subject to the
restriction.
If log L(bθ ) log L(e θ ) 0 is small, the consequences of imposing the
restrictions are limited, suggesting that the restrictions are correct. If
the di¤erence is large, the restrictions are likely to be incorrect.
The LR test statistic is simply computed as
h i
ξ LR = 2 log L(b
θ ) log L(e θ)
which under the null hypothesis has a Chi-squared distribution with J

degrees of freedom.
(2018) 2018/2019 34 / 42
The Lagrange Multiplier Test
Several tests in the literature are Lagrange Multiplier tests (LM tests).
Suppose the null hypothesis restricts some elements in the parameter
vector θ to equal a set of given values.
Consider θ 0 = (θ 10 , θ 20 ) where the null hypothesis now says that θ 2 = r
and θ 2 has the dimension of J.
The term "Lagrange Multiplier" comes from the fact that it is
implicitly based upon the value of the LM in the constraint
maximization problem.
The …rst order condition of the Lagrangian,
" #
N
H (θ, λ) = ∑ log Li (θ ) λ0 (θ 2 r) , (1)
i =1
0 0
yields the constraint ML estimator e
θ = (e
θ1 , e e
θ 2 )0 and λ.
(2018) 2018/2019 35 / 42

From …rst order conditions of (1) it follows that,
N N
∂ log Li (θ )
∑ ∂θ 1 e
= ∑ si 1 e
θ =0
i =1 θ i =1
and
N N
∂ log Li (θ )
e=
λ ∑ ∂θ 2 e
= ∑ si 2 e
θ (2)
i =1 θ i =1
where the score vector si (θ ) is decomposed into subvectors si 1 (θ )

and si 2 (θ ) , corresponding to θ 1 and θ 2 , respectively.
The result in (2) shows that the vector of Lagrange Multipliers λ e
equals the vector of …rst derivatives with respect to the restricted
parameters θ 2 , evaluated at the constraint estimator e θ.
(2018) 2018/2019 36 / 42

As the …rst derivatives are also known as scores, the LM test is also
known as the score test.
To determine the appropriate test statistic, we exploit that it can be
shown that the sample average N 1 λ e is asymptotically normal with
covariance matrix
1
Vλ = I22 (θ ) I21 (θ )I11 (θ ) I12 (θ ), (3)
where Ijk (θ ) are blocks in the information matrix I (θ ),
I11 (θ ) I12 (θ )
I (θ ) = ,
I21 (θ ) I22 (θ )
where I22 (θ ) is of dimension J J.
(2018) 2018/2019 37 / 42

Computationally, (3) is the inverse of the lower right J J block of
the inverse of I (θ ).
The Lagrange multiplier test statistic can be derived as
1 e0 e e
ξ LM = N λ Vλ λ,
which under the null hypothesis has an asymptotic Chi-squared

eλ denotes an estimate of
distribution with J degrees of freedom and V
the information matrix based upon the constrained estimator of e
θ.
(2018) 2018/2019 38 / 42
Computation of the LM test statistic is particularly attractive if the
information matrix is estimated on the basis of the …rst derivatives of
the loglikelihood function, as
N 0
1
bIG (e
θ) =
N ∑ si e
θ si e
θ ,
i =1
i.e., the average outer product of the vector of …rst derivatives

evaluated under the constraint ML estimates e θ.Hence,
! 1
N 0 N 0 N
ξ LM= ∑ si eθ ∑ s i
e
θ s i
e
θ ∑ si eθ .
i =1 i =1 i =1
Note that the …rst K J elements in the scores si e θ sum to zero,

because of the …rst order conditions. Nevertheless, these are
important for computing the correct covariance matrix.
(2018) 2018/2019 39 / 42
Denote the N K matrix of …rst derivatives as S, such that
0 0 1
s1 e θ
B C
B 0 C
B s2 e θ C
S=B B ..
C.
C
B . C
@ 0
A
sN θ e
In the matrix S each row corresponds to an observation and each column

corresponds to the derivative with respect to one of the parameters. Thus,
N
∑ si e
θ = S 0ι
i =1
where ι = (1, 1, 1, ..., 1)0 of dimension N. Moreover,
N 0
∑ si e
θ si e
θ = S 0 S.
(2018)
i =1 2018/2019 40 / 42
Considering an auxiliary regression of a column of ones upon the
columns of the matrix S.
1
From the standard expression for the OLS estimator, (S 0 S ) S 0 ι, we
1
obtain predicted values of this regression as S (S 0 S ) S 0 ι.
The explained sum of squares is therefore,
1 1 1
ι0 S S 0 S S 0S S 0S S 0 ι = ι0 S S 0 S S 0ι
while the total (uncentred) sum of squares of this regression is ι0 ι.

1
ι 0 S (S 0 S ) S 0ι
Since the uncentred R 2 = ι0 ι , it follows that,
ξ LM = NR 2 .
Note that this R 2 is the uncentred R 2 of an auxiliary regression of a

vector of ones upon the score vectors.
(2018) 2018/2019 41 / 42

Under the null hypothesis, the test statistic is asymptotically χ2
distributed with J degrees of freedom, where J is the number of
restrictions considered.
This is a computationally convenient test.
(2018) 2018/2019 42 / 42

Mle 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mle 1

Uploaded by

Copyright:

Available Formats

Maximum Likelihood Estimation

In the GMM approach the model imposes assumptions about a

where π denotes the probability of success.

L(π |x1,...,x n) 0.0010

Note that π = 0.4 is the “most likely” or “most plausible” value of π

In general, the maximum likelihood estimate can be found as follows:

For the …eld goal example:

Let x1 , ..., xn be a random sample from a N (µ, σ2 ) distribution. Find

Taking the log of the likelihood function produces:

To …nd the maximum likelihood estimate of µ, take the derivative with

Solving for µ produces:

Solving for σ2 produces:

To de…ne the MLE in a more general situation, suppose that interest

The likelihood function for the available sample is then given by

For several purposes it is convenient to employ the likelihood

The …rst order conditions for this problem imply that,

where jbθ indicates that the expression is evaluated at θ = b

For notational convenience denote the …rst derivatives of the

Provided that the likelihood function is correctly speci…ed, it can be shown

where V is the asymptotic covariance matrix.

Given the asymptotic e¢ ciency of the MLE estimator, the inverse of

where we take the derivatives …rst and substitute the unknown θ by b

If the likelihood function is correctly speci…ed it can be shown that

The loglikelihood is,

The score vector is,

The information matrix is de…ned as,

Since for a normal distribution,

It can be shown that,

Because this information matrix is block diagonal, its inverse is given by

From this, it follows that b b2 are asymptotically normal and mutually

Based on MLE a large number of alternative tests can be constructed.

Suppose that we are interested in testing one or more linear

These three tests are asymptotically equivalent (i.e. the statistics

where V b is a consistent estimator of V .

which under the null hypothesis has a Chi-squared distribution with J

The Lagrange Multiplier Test

where the score vector si (θ ) is decomposed into subvectors si 1 (θ )

The Lagrange Multiplier Test

where Ijk (θ ) are blocks in the information matrix I (θ ),

where I22 (θ ) is of dimension J J.

The Lagrange Multiplier Test

which under the null hypothesis has an asymptotic Chi-squared

i.e., the average outer product of the vector of …rst derivatives

Note that the …rst K J elements in the scores si e θ sum to zero,

In the matrix S each row corresponds to an observation and each column

while the total (uncentred) sum of squares of this regression is ι0 ι.

Note that this R 2 is the uncentred R 2 of an auxiliary regression of a

The Lagrange Multiplier Test

You might also like