Professional Documents
Culture Documents
An example
Suppose we sample a set of goods for quality and find 5 defective items
in a sample of 10. What is our estimate of the proportion of bad items in
the whole population.
Intuitively of course it is 50%. Formally in a sample of size n the
probability of finding B bad items is
n!
n- B
B
P=
(1
B! (n - B)!
is the proportion of bad items in the population
P
n !
n- B
B -1
)
(1 -
= B
B! (n - B)!
n!
n- B -1
B
- (n - B)
(1 - ) = 0
B! (n - B)!
) (n - B)
)
B
(1 -
(1 -
B -1
n- B
n - B -1
=0
)
= (n - B)(1 -
B
-1
-1
) = (n - B)
B(1 -
= B/n = 5/10 = 0.5
A general Statement
Consider a sample (X1...Xn) which is drawn from a probability distribution
P(X|A) where A are parameters. If the Xs are independent with probability
density function P(Xi|A) the joint probability of the whole set is
n
P( X 1 ... X n | A) = P( X i | A)
i=1
log(L(A)) =
log(P( X
| A))
i=1
Y = f(X, ) + e
e ~ N(0, )
-1
L( , ) =
exp[-0.5(Y - f(X, ) (Y - f(X, )]
0.5
0.5
(2 ) | |
log(L( , )) =
log(L( ))
= S(A)
2 log(L( ))
E -
= I( )
-1
if is another estimate of
*
-1
Var( ) I( )
*
L( ) = L( 1 , 2 )
now suppose we knew 1 then sometimes we can derive a
formulae for the ML estimate of 2, eg
2 = g( 1 )
then we could write the LF as
L( 1 , 2 ) = L( 1 , g( 1 )) = L ( 1 )
*
L( ) = -Tlog( ) - e /
2
L
2
2
2 2
= -T/ + e /( ) = 0
2
= e /T
2
L ( )=
*
Tlog( e2 /T) - T
log(L( Y 1 ,Y 2 ,...Y T -1 ,Y T ))
= log(L( Y T | Y 1 ,Y 2 ,...,Y T -1 )) + log(L( Y 1 ,Y 2 ,...,Y T -1 ))
The first term is the conditional probability of Y given all past values. We
can then condition the second term and so on to give
T -2
that is, a series of one step ahead prediction errors conditional on actual
lagged Y.
Testing hypothesis.
If a restriction on a model is acceptable this means that the reduction in
the likelihood value caused by imposing the restriction is not `significant'.
This gives us a very general basis for constructing hypothesis tests but
to implement the tests we need some definite metric to judge the tests
against, i.e. what is significant.
L
Lu
LR
log(L( ))
log(L( )) = log(L( )) + ( - )
log(L( ))
+ 0.5( - )
( - ) + O(1)
and of course
log(L( ))
= S( ) = 0
log(L( )) = I( )
So
r
2
( - r)I( )( - ) ~ (m)
And so
r
2
L
Lu
LR
LM = S( )[I( ) ] S( ) ~ (m)
-1
Now suppose
Y t = f( X t , 1 , 2 ) + et
where we assume that the subset of parameters 1 is fixed according to a
set of restrictions g=0 (G is the derivative of this restriction).
Now
S( 1 ) = Ge
-2
I( 1 ) = ( E(GG) )
-1
-2
eG[ E(GG) ] Ge
-2
-2
-1
-2
And
if E(GG) = GG
-2
LM = ee
which may be interpreted as TR2 from a regression of e on G
This is used in many tests for serial correlation heteroskedasticity
functional form etc.
e is the actual errors from a restricted model and G is the restrictions in
the model.
Y = X + u
u = u-1 + e
The restriction that = 0 may be tested as an LM test as follows
estimate the model without serial correlation. save the residuals u. then
estimate the model
m
u = X + ut -i
i=1
then TR2 from this regression is an LM(m) test for serial correlation
Numerical optimisation
In simple cases (e.g. OLS) we can calculate the maximum likelihood
estimates analytically. But in many cases we cannot, then we resort to
numerical optimisation of the likelihood function.
This amounts to hill climbing in parameter space.
there are many algorithms and many computer programmes implement
these for you.
It is useful to understand the broad steps of the procedure.
L
Lu
i +1 = i +
2
2
-1
Derivative free techniques. These do not use derivatives and so they are
less efficient but more robust to extreme non-linearitys. e.g. Powell or
non-linear Simplex.
Y t = X t + ut
but we only observe certain limited information, eg z=1 or 0 related to y
z = 1 if Y > 0
z = 0 if Y < 0
then we can group the data into two groups and form a likelihood
function with the following form
L = F(- X t ) F(1 - X t )
z=0
z=1
Y t = X t + et
et ~ N(0, ht )
2
h t = 0 + 1 h t - 1 + 2 et - 1
then the likelihood function for this model is
T
log(L( , )) = - | ht | -( et2 / ht )
t=1
An alternative approach
Method of moments
A widely used technique in estimation is the Generalised Method of
Moments (GMM), This is an extension of the standard method of
moments.
The idea here is that if we have random drawings from an unknown
probability distribution then the sample statistics we calculate will
converge in probability to some constant. This constant will be a function
of the unknown parameters of the distribution. If we want to estimate k of
these parameters,
1 ,..., k
we compute k statistics (or moments) whose probability limits are known
functions of the parameters
m1 ,..., mk
These k moments are set equal to the function which generates the
moments and the function is inverted.
f ( m)
1
A simple example
Suppose the first moment (the mean) is generated by the following
distribution, f ( x | 1 ) . The observed moment from a sample of n
observations is
n
m1 (1 / n) xi
i 1
Hence
m1 f ( x | 1 )
And
1 f (m1 ) m1
1
Y f ( , X )
where are k parameters. And we have k conditions (or moments)
which should be met by the model.
E ( g (Y , X | )) 0
then we approximate E(g) with a sample measure and invert g.
g (Y , X ,0)
1
Examples
OLS
In OLS estimation we make the assumption that the regressors (Xs) are
orthogonal to the errors. Thus
E ( Xe) 0
The sample analogue for each xi is
(1 / n)t 1 xit et 0
n
and so
Ln( L) Ln( f ( y, x | ))
and this will be maximised when the following k first order conditions are
met.
E ( ln( f ( y, x | )) / ) 0
This give rise to the following k sample conditions
(1 / n)i1 ( ln( f ( y, x | )) / ) 0
n
E (m j ( )) 0 (1 / n)t 1 m j ( ) 0
n
j 1,...L
Min(q ) m ( )' Am ( )
That is, the weighted squared sum of the moments.
This gives a consistent estimator for any positive definite matrix A (not a
function of
)
The optimal A
If any weighting matrix is consistent they clearly can not all be equally
efficient so what is the optimal estimate of A.
Hansen(1982) established the basic properties of the optimal A and how
to construct the covariance of the parameter estimates.
The optimal A is simply the covariance matrix of the moment conditions.
(just as in GLS)
Thus
The parameters which solve this criterion function then have the
following properties.
gmm ~ N (0,Vgmm )
Where
var(n (m ))
1/ 2
Conclusion
Both ML and GMM are very flexible
estimation strategies
They are equivalent ways of
approaching the same problem in many
instances.