Professional Documents
Culture Documents
15.1 KLIC
Y
Suppose a random sample is y = y1 ; ; :::; yn has unknown density f (y) = f (yi ):
Y
A model density g(y) = g(yi ):
How can we assess the “…t” of g as an approximation to f ?
One useful measure is the Kullback-Leibler information criterion (KLIC)
Z
f (y)
KLIC(f ; g) = f (y) log dy
g(y)
15.2 Estimation
Let the model density g(y; ) depend on a parameter vector : The negative log-likelihood
function is
n
X
L( ) = log g(yi ; ) = log g(y; )
i=1
and the MLE is ^ = argmin L( ): Sometimes this is called a “quasi-MLE”when g(y; ) is acknowl-
edged to be an approximation, rather than the truth.
Let the minimizer of E log g(y; ) be written 0 and called the pseudo-true value. This value
also minimizes KLIC(f; g( )): As the likelihood divided by n is an estimator of E log g(y; ); the
MLE ^ converges in probability to 0 : That is,
^ !p 0 = argmin KLIC(f; g( ))
Thus QMLE estimates the best-…tting density, where best is measured in terms of the KLIC.
From conventional asymptotic theory, we know
p
n ^QM LE 0 !d N (0; V )
114
1
V = Q Q 1
@2
Q = E log g(y; )
@ @ 0
@ @
= E log g(y; ) log g(y; )0
@ @
If the model is correctly speci…ed (g (y; 0) = f (y)), then Q = (the information matrix equality).
Otherwise Q 6= .
= Cf y; ^(y))
Ey~ log g(~
where y
~ has density f ; independent of y:
The expected KLIC is the expectation over the observed values y
E (KLIC(f ; g
^)) = Cf y; ^(y))
Ey Ey~ log g(~
= Cf Ey~ Ey log g(y; ^(~
y))
115
Make a second-order Taylor expansion of log g y; ~ about ^ :
@
log g(y; ~) ' log g(y; ^) log g(y; ^)0 ~ ^
@
1 ~ ^
0 @2
log g(y; ^) ~ ^
2 @ @ 0
The …rst term on the RHS is L(^); the second is linear in the FOC, so only the third term remains.
Writing
@2
^=
Q n 1
0 log g(y; ^);
@ @
~ ^= ~ 0
^ 0
1 0 1 0
log g(y; ~) ' L(^) + n ~ 0
^ ~
Q 0 + n ^ 0
^ ^
Q 0 +n ~ 0
^ ^
Q 0 :
2 2
Now
p
n ^ 0 !d Z1 N (0; V )
p
n ~ 0 !d Z2 N (0; V )
1 1
log g(y; ~) ' L(^) + Z10 QZ1 + Z20 QZ2 + Z10 QZ2 :
2 2
Taking expectations
T = E log g(y; ~)
1 0 1
' EL(^) + E Z1 QZ1 + Z20 QZ2 + Z10 QZ2
2 2
= EL(^) + tr (QV )
= EL(^) + tr Q 1
\
T^ = L(^) + tr (Q 1 )
\
where tr (Q 1 ) is an estimate of tr Q 1 :
116
15.5 AIC
When g(x; 0) = f (x) (the model is correctly speci…ed) then Q = (the information matrix
equality). Hence
1
tr Q = k = dim( )
so
T^ = L(^) + k
This is the the Akaike Information Criterion (AIC). It is typically written as 2T^, e.g.
AIC = 2L(^) + 2k
AIC is an estimate of the expected KLIC, based on the approximation that g includes the
correct model.
Picking a model with the smalled AIC is picking the model with the smallest estimated KLIC.
In this sense it is picking is the best-…tting model.
15.6 TIC
Takeuchi (1976) proposed a robust AIC, and is known as the Takeuchi Information Criterion
(TIC)
T IC = 2L(^) + 2 tr Q
^ 1^
where
n
1 X @2
^ =
Q 0 log g(yi ; ^)
n @ @
i=1
n
X
1 @ @
^ = log g(yi ; ^) log g(yi ; ^)0
n @ @
i=1
117
The AIC is known to be asymptotically optimal in linear regression (we discuss this below),
but in the general context I do not know of an optimality result. The desired optimality would
be that if a model is selected by minimizing AIC (or TIC) then the …tted KLIC of this model is
asymptotically equivalent to the KLIC of the infeasible best-…tting model.
yi = Xi0 + ei
E (Xi ei ) = 0
AIC or TIC cannot be directly applied, as the density of ei is unspeci…ed. However, the LS estimator
is the same as the Gaussian MLE, so it is natural to calculate the AIC or TIC for the Gaussian
quasi-MLE.
The Gaussian quasi-likelihood is
1 2 1 2
log gi ( ) = log 2 2
yi Xi0
2 2
where = ( ; 2) and 2 = Ee2i : The MLE ^ = ( ^ ; ^ 2 ) is LS. The pseudo-true value 0 is the
projection coe¢ cient = E (Xi Xi0 ) 1 E (Xi yi ) : If is k 1 then the number of parameters is
k + 1:
The sample log-likelihood is
AIC = n log ^ 2 + 2 (k + 1) :
k
AIC = log ^ 2 + 2
n
118
Also
@ 1
log g(yi ; ) = 2
X i yi Xi0
@
@ 1 1 2
log g(yi ; ) = + yi Xi0 ;
@ 2 2 2 2 4
and
@2 1
0 log g(yi ; ) = 2
Xi Xi0
@ @
@2 1
2
log g(yi ; ) = 4
X i yi Xi0
@ @
@2 1 1 2
log g(yi ; ) = + yi Xi0
@( 2 )2 2 4 6
@ 1
log g(yi ; 0) = 2
Xi ei
@
@ 1
log g(yi ; 0) = e2i 2
;
@ 2 2 4
and
@2 1
0 log g(yi ; 0) = 2
Xi Xi0
@ @
@2 1
2
log g(yi ; 0) = 4
Xi e
@ @
@2 1 2
log g(yi ; 0) = 2 yi Xi0 2
@( 2 )2 2 6
Thus
2 3
@2 @2
6 @ @ 0 log g(yi ; 0 ) @ @ 2
log g(yi ; 0) 7
Q = E6
4 @2 @2
7
5
log g(yi ; 0 ) log g(yi ; 0)
@ 2@ 0 @( 2 )2
2 3
E (Xi Xi0 ) 0
24
= 1 5
0
2 2
119
and
2 3
@ @ 0 @ @
6 @ log g(yi ; 0 ) @ log g(yi ; 0 ) @ log g(yi ; 0 ) @ 2 log g(yi ; 0) 7
= E6
4 @ 2 7
5
@ 0 @
log g(y ;
i 0 ) log g(y ;
i 0 ) log g(y ;
i 0 )
@ 2 @ @ 2
2 3
0 e2i 1 0 3
E Xi Xi 2 E Xi ei 7
26 2 4
= 4 5
1 3 4
E X e
i i
2 4 4 2
where
E e2i 2 2 E e4i 4
e2i
4 = var 2
= 4
= 4
We see that = Q if
e2i
E 2
j Xi = 1
E Xi e3i = 0
4 = 2
Let
n
1X 2 2
^4 = e^i ^2
n
i=1
T IC = n log ^ 2 + tr Q^ 1^
2 0 ! !1 3
n 1 n
X X e^2
= n log ^ 2 + 2 4tr @ Xi Xi0 Xi Xi0 i2 A + ^45
i=1 i=1
^
n
X
2
= n log ^ 2 + hi e^2i + ^ 4
^2 i=1
120
1
where hi = Xi0 (X0 X) Xi .
When the errors are close to homoskedastic and Gaussian, then hi and e2i will be uncorrelated
^ 4 will be close to 2, so the penalty will be close to
n
X
2 hi + 2 = 2 (k + 1)
i=1
as for AIC. In this case TIC will be close to AIC. In applications, the di¤erences will arise under
heteroskedasticity and non-Gaussianity.
The primary use of AIC and TIC is to compare models. As we change models, typically the
residuals e^i do not change too much, so my guess is that the estimate ^ will not change much. In
this event, the TIC correction for estimation of 2 will not matter much.
^2
~ 2 AIC n log ~ 2 + n = n 2
log + 1 + 2~ 2 k
~2
^2
' n~ 2 + 2~ 2 k
~2
= e^0 e^ + 2~ 2 k
= Ck
The approximation is log(1 + a) ' a for a small. This is the Mallows criterion. Thus AIC is
approximately equal to Mallows, and the approximation is close when k=n is small.
Furtheremore, this expression approximately equalts
2
e^0 e^ 1 + = Sk
nk
121
the cross-validation criterion. Thus T IC ' CV:
They are both asymptotically equivalent to a “Heteroskedastic-Robust Mallows Criterion”
n
X
Ck = e^0 e^ + 2 hi e^2i
i=1
yi = g (Xi ) + ei
= gi + ei
E (ei j Xi ) = 0
E e2i j Xi = 0
Written as an n 1 vector
y = g + e:
Li assumed that the Xi are non-random, but his analysis can be re-interpreted by treating every-
thing as conditional on Xi :
Li considered estimators of the n 1 vector g which are linear in y and thus take the form
g^(h) = M (h)y
1 2 2
C(h) = (y g^(h))0 (y g^(h)) + tr M (h)
n n
122
The …rst term is the residual variance from model h; the second is the penalty. For series estimators,
tr M (h) = kh : The Mallows selected index h^ minimizes C(h):
Since y = e + g; then y g^(h) = e + g g^(h); so
1 2 2
C(h) = (y g^(h))0 (y g^(h)) + tr M (h)
n n
1 0 1 2 2
= e e + L(h) + 2 e0 (g g^(h)) + tr M (h)
n n n
And
g^(h) = M (h)y = M (h)g + M (h)e
then
1 0 1 2
e e + L(h) + 2 e0 b(h) + 2
tr M (h) e0 M (h)e
n n n
^ minimizes
As the …rst term doesn’t involve h; it follows that h
1 2
C (h) = L(h) + 2 e0 b(h) + 2
tr M (h) e0 M (h)e
n n
over h 2 H:
The idea is that empirical criterion C (h) equals the desired criterion L(h) plus a stochastically
small error.
We calculate that
1
L(h) = (g g^(h))0 (g g^(h))
n
1 2 1
= b(h)0 b(h) b(h)0 M (h)e + e0 M (h)0 M (h)e
n n n
and
R(h) = E (L(h) j X)
1 1 0
= b(h)0 b(h) + E e M (h)0 M (h)e j X
n n
1 2
= b(h)0 b(h) + tr M (h)0 M (h)
n n
123
Theorem 1. Let max (A) denote the maximum eigenvalue of A: If for some positive integer m;
E e4m
i j Xi <1
X m
(nR(h)) ! 0 (1)
h2H
then
^
L(h)
!p 1:
inf h2H L(h)
then
s s=2
E b0 e jX K1s b0 b (2)
and
s s=2
E e0 Ae E e0 Ae j X jX K2s tr A0 A (3)
where
2s3=2 s+1
K1s = p s
2
1=2
K2s = 2s K1s K1;2s 2s
C (h) L(h)
sup !p 0 (4)
h L(h)
124
Let h0 denote the minimizer of L(h): Then
^
L(h) L(h0 )
0
^
L(h)
^
C (h) L(h0 ) ^
C (h) ^
L(h)
=
^
L(h) ^
L(h)
^
C (h) L(h0 )
= + op (1)
^
L(h)
C (h0 ) L(h0 )
+ op (1)
^
L(h)
C (h0 ) L(h0 )
+ op (1)
L(h0 )
= op (1)
This uses (4) twice, and the facts L(h0 ) ^ and C (h)
L(h) ^ C (h0 ): This shows that
L(h0 )
!p 1
^
L(h)
L(h)
sup 1 !p 0 (5)
h R(h)
which says that L(h) and R(h) are asymptotically equivalent, and thus (4) is equivalent to
C (h) L(h)
sup !p 0: (6)
h R(h)
2 tr M (h)e0 M (h)e
C (h) L(h) je0 b(h)j
sup 2 sup + 2 sup : (7)
h R(h) h nR(h) h nR(h)
2m m
E e0 b(h) jX K b(h)0 b(h)
Now recall
nR(h) = b(h)0 b(h) + 2
tr M (h)0 M (h) (8)
Thus
nR(h) b(h)0 b(h)
125
Hence
2m m
E e0 b(h) jX K b(h)0 b(h) K (nR(h))m
X E je0 b(h)j2m j X
2m
h2H
(nR(h))2m
X (nR(h))m
2m K
h2H
(nR(h))2m
K X m
= 2m (nR(h))
h2H
! 0
E e0 M (h)e j X = 2
tr M (h);
then
2m m
E e0 M (h)e 2
tr M (h) jX K tr M (h)0 M (h)
2m
K (nR(h))m
tr M (h)0 M (h) 2
nR(h)
126
Applying Markov’s inequality
! !
e0 M (h)e 2 tr M (h) X e0 M (h)e 2 tr M (h)
P sup > jX P > jX
h nR(h) nR(h)
h2H
E e0 M (h)e 2 tr M (h) 2m jX
X
2m
h2H
(nR(h))2m
X 2m K(nR(h))m
2m
h2H
(nR(h))2m
m X
= K 2 2
(nR(h)) m
h2H
! 0
For completeness, let us show (5). The demonstration is essentially the same as the above.
We calculate
2 1 0 2
L(h) R(h) = b(h)0 M (h)e + e M (h)0 M (h)e tr M (h)0 M (h)
n n n
2 1 0
= b(h)0 M (h)e + e M (h)0 M (h)e E e0 M (h)0 M (h)e j X
n n
Thus
L(h) R(h) je0 M (h)0 b(h)j je0 M (h)0 M (h)e E (e0 M (h)0 M (h)e j X)j
sup 2 sup + 2 sup :
h R(h) h nR(h) h nR(h)
2m m
E e0 M (h)0 b(h) jX K b(h)0 M (h)M (h)0 b(h)
and letting
M = lim sup max (M (h)) <1
n!1 h2H
then
127
Thus
2m m
E e0 M (h)0 b(h) jX K b(h)0 M (h)M (h)0 b(h)
K M 2 (nR(h))m
Thus
h2H
(nR(h))2m
X 2 (nR(h))m
2m K M
h2H
(nR(h))2m
KM 2 X m
= 2m (nR(h))
h2H
! 0
Similarly,
2m m
E e0 M (h)0 M (h)e E e0 M (h)0 M (h)e j X jX K tr M (h)0 M (h)M (h)0 M (h)
m
K M 2m tr M (h)0 M (h)
2m
K M 2m (nR(h))m
and thus
je0 M (h)0 M (h)e E (e0 M (h)0 M (h)e j X)j X je0 M (h)0 M (h)e E (e0 M (h)0 M (h)e j X)
P sup > jX P
h nR(h) nR(h)
h2H
h2H
(nR(h))2m
X 2m K M 2m (nR(h))m
2m
h2H
(nR(h))2m
m X
M2
= K 2 2 (nR(h)) m
h2H
! 0
We have shown
L(h) R(h)
sup !p 0
h R(h)
which is (5).
128
15.13 Mallows Model Selection
Li’s Theorem 1 applies to a variety of linear estimators. Of particular interest is model selection
(e.g. series estimation).
Let’s verify Li’s conditions, which were
E e4m
i j Xi <1
X
(nR(h)) m ! 0 (9)
h2H
In linear estimation, M (h) is a projection matrix, so max (M (h)) = 1 and the …rst equation is
automatically satis…ed.
The key is equation (9).
Suppose that for sample size n; there are Nn models. Let
n = inf nR(h)
h2H
and assume
n !1
A crude bound is
X m m
(nR(h)) Nn n
h2H
m
If Nn n ! 0 then (9) holds. Notice that by increasing m; we can allow for larger Nn (more
models) but a tighter moment bound.
The condition n ! 0 says that for all …nite models h; the there is non-zero approximation error,
so that R(h) is non-zero. In contrast, if there is a …nite dimensional model h0 for which b(h0 ) = 0;
then nR(h0 ) = h0 2 does not diverge. In this case, Mallows (and AIC) are asymptotically sub-
optimal.
We can improve this condition if we consider the case of selection among models of increasing
size. Suppose that model h has kh regressors, and k1 < k2 < and for some m 2;
1
X
kh m < 1
h=1
129
m
Now pick Bn ! 1 so that Bn n ! 0 (which is possible since n ! 1:) Then
1
X Bn
X 1
X
m m m
(nR(h)) = (nR(h)) + (nR(h))
h=1 h=1 h=Bn +1
1
X
Bn n
m
+ 2
kh m ! 0
h=Bn +1
as required.
where jcj p is the number of overidentifying restrictions and rn is a sequence. For an AIC-like
criterion, he sets rn = 2; for a BIC-like criterion, he sets rn = log n:
The model selection rule picks the moment conditions c which minimize Jn (c):
Assuming that a subset of the moments are incorrect, Andrews shows that the BIC-like rule
asymptotically selects the correct subset.
Andrews and Lu (2001, JoE) extend the above analysis to the case of jointly picking the moments
and the parameter vector (that is, imposing zero restrictions on the parameters). They show that
the same criterion has similar properties –that it can asymptotically select the “correct”moments
and “correct” zero restrictions.
Hong, Preston and Shum (ET, 2003) extend the analysis of the above papers to empirical
likelihood. They show that that this criterion has the same interpretation when Jn (c) is replaced
by the empirical likelihood.
These papers are an interesting …rst step, but they do not address the issue of GMM selection
when the true model is potentially in…nite dimensional and/or misspeci…ed. That is, the analysis
is not analogous to that of Li (1987) for the regression model.
In order to properly understand GMM selection, I believe we need to understand the behavior
of GMM under misspeci…cation.
Hall and Inoue (2003, Joe) is one of the few contributions on GMM under misspeci…caiton.
They did not investigate model selection.
Suppose that the model is
m( ) = Emi ( ) = 0
130
where mi is ` 1 and is k 1: Assume ` > r (overidenti…cation). The model is misspeci…ed if
there is no such that this moment condition holds. That is, for all ;
m( ) 6= 0
Jn ( ) = nmn ( )0 Wn mn ( )
If Wn !p W; then
1
n Jn ( ) !p m( )0 W m( ):
0 (W ) = argmin m( )0 W m( ):
Em(y) = 0
Em(y) 6= 0
131
Temporarily ignoring parameter estimation, we can ask: Which density g(y) satisfying this
moment condition is closest to f (y) in the sense of minimizing KLIC? We can call this g0 (y) the
pseudo-true density.
The solution is nicely explained in Appendix A of Chen, Hong, and Shum (JoE, 2007). Recall
Z
f (y)
KLIC(f; g) = f (y) log dy
g(y)
The problem is
min KLIC(f; g)
g
subject to
Z
g(y)dy = 1
Z
m(y)g(y)dy = 0
The Lagrangian is
Z Z Z
f (y) 0
f (y) log dy + g(y)dy 1 + m(y)g(y)dy
g(y)
f (y) 0
0= + + m(y)
g(y)
132
The pseudo-true density is
f (y)
g0 (y) = ;
1 + 00 m(y)
with associated minimized KLIC
Z
0
KLIC (f; g0 ) = f (y) log 1 + 0 m(y) dy
0
= E log 1 + 0 m(y)
0
min max E log 1 + m(y; )
Theorem (Chen, Hong and Shum, JoE, 2007). If jm(y; )j is bounded, then the EL estimates
^ ^
( ; ) are n 1=2 consistent for the pseudo-true values ( 0 ; 0 ).
This gives a simple interpretation to the de…nition of KLIC under misspeci…cation.
Her paper leaves open the question: For what is ^ consistent? Is there a pseudo-true value?
One possibility is that the pseudo-true value n needs to be indexed by sample size. (This idea is
used in Hal White’s work.)
Never-the-less, Schennach’s theorem suggests that empirical likelihood is non-robust to mis-
speci…cation.
133
15.17 Exponential Tilting
Instead of Z
f (y)
KLIC(f; g) = f (y) log dy
g(y)
consider the reverse distance
Z
g(y)
KLIC(g; f ) = g(y) log dy:
f (y)
subject to
Z
g(y)dy = 1
Z
m(y)g(y)dy = 0
The Lagrangian is
Z Z Z
g(y) 0
g(y) log dy g(y)dy 1 m(y)g(y)dy
f (y)
with FOC
g(y) 0
0 = log +1 m(y):
f (y)
Solving
0
g(y) = f (y) exp ( 1 + ) exp m(y) :
R
Imposing g(y)dy = 1 we …nd
0
f (y) exp m(y)
g(y) = R 0 : (10)
f (y) exp m(y) dy
134
Inserting this into KLIC(g; f ) we …nd
Z !
exp 0 m(y)
KLIC(g; f ) = g(y) log R dy
f (y) exp 0 m(y) dy
Z Z Z
0 0
= m(y)g(y)dy g(y)dy log f (y) exp m(y) dy
Z
= log f (y) exp 0 m(y) dy (11)
0
= log E exp m(y) (12)
0
0 = argmin E exp m(y) (13)
The pseudo-true density g0 (y) is (10) with this 0; with associated minimized KLIC (11). This is
the smallest possible KLIC(g,f) for moment condition models.
Notice: the g0 which minimize KLIC(g,f) and KLIC(f,g) are di¤erent.
In contrast to the EL case, the ET problem (13) does not restrict ; and there are no “trouble
spots”. Thus ET is more robust than EL. The pseudo-true 0 and g0 are well de…ned under
misspeci…cation, unlike EL.
When the moment m(y; ) depends on a parameter ;then the pseudo-true values ( 0 ; 0) are
the joint solution to the problem
0
max min E exp m(y; ) :
subject to
n
X
pi = 1
i=1
n
X
pi m (yi ; ) = 0
i=1
135
First, we concentrate out the probabilities. For any ; the Lagrangian is
n n
! n
X X X
0
pi log pi pi 1 pi m (yi ; )
i=1 i=1 i=1
with FOC
0
0 = log p^i 1 m(yi ; ):
exp 0 m(yi ; )
p^i ( ) = Pn 0
i=1 exp m(yi ; )
When = 0 then p^i = n 1; same as EL. The concentrated “entropy” criterion is then
n n
" n
!#
X X X
0 0
p^i ( ) log p^i ( ) = p^i ( ) m(yi ; ) log exp m(yi ; )
i=1 i=1 i=1
n
!
X
0
= log exp m(yi ; )
i=1
136
n
X
ET EL( ) = log (^
pi ( ))
i=1
n n
!
X X
= ^( ) 0
m(yi ; ) log exp ^ ( ) m(yi ; )
0
i=1 i=1
exp ^ ( )0 m(yi ; )
p^i ( ) = P
n ^ ( )0 m(yi ; )
i=1 exp
n
X
^ ( ) = argmin exp 0
m(yi ; )
i=1
Under misspeci…cation, the pseudo-true values 0; 0 are generically well de…ned, and mini-
mize a KLIC analog
p @
n ^ 0 !d N 0; 1 10 where =E m (y; ) and = E m (y; ) m (y; )0 :
@ 0
137