14 Model Selection

15 Model Selection
15.1 KLIC
Y
Suppose a random sample is y = y1 ; ; :::; yn has unknown density f (y) = f (yi ):
Y
A model density g(y) = g(yi ):
How can we assess the “…t” of g as an approximation to f ?
One useful measure is the Kullback-Leibler information criterion (KLIC)
Z
f (y)
KLIC(f ; g) = f (y) log dy
g(y)
You can decompose the KLIC as

Z Z
KLIC(f ; g) = f (y) log f (y)dy f (y) log g(y)dy
= Cf E log g(y)
R
The constant Cf = f (y) log f (y)dy is independent of the model g:
Notice that KLIC(f; g) 0; and KLIC(f; g) = 0 i¤ g = f: Thus a “good” approximating
model g is one with a low KLIC.
15.2 Estimation
Let the model density g(y; ) depend on a parameter vector : The negative log-likelihood
function is
n
X
L( ) = log g(yi ; ) = log g(y; )
i=1
and the MLE is ^ = argmin L( ): Sometimes this is called a “quasi-MLE”when g(y; ) is acknowl-
edged to be an approximation, rather than the truth.
Let the minimizer of E log g(y; ) be written 0 and called the pseudo-true value. This value
also minimizes KLIC(f; g( )): As the likelihood divided by n is an estimator of E log g(y; ); the
MLE ^ converges in probability to 0 : That is,
^ !p 0 = argmin KLIC(f; g( ))
Thus QMLE estimates the best-…tting density, where best is measured in terms of the KLIC.
From conventional asymptotic theory, we know
p
n ^QM LE 0 !d N (0; V )
114
1
V = Q Q 1
@2
Q = E log g(y; )
@ @ 0
@ @
= E log g(y; ) log g(y; )0
@ @
If the model is correctly speci…ed (g (y; 0) = f (y)), then Q = (the information matrix equality).
Otherwise Q 6= .
15.3 Expected KLIC

The MLE ^ = ^(y) is a function of the data vector y:
The …tted model at any y
~ is g
^(~ y; ^(y)) .
y) = g(~
The …tted likelihood is L(^) = log g(y; ^(y)) (the model evaluated at the observed data).
The KLIC of the …tted model is is
Z
KLIC(f ; g
^) = Cf f (~ y; ^(y))d~
y) log g(~ y
= Cf y; ^(y))
Ey~ log g(~
where y
~ has density f ; independent of y:
The expected KLIC is the expectation over the observed values y
E (KLIC(f ; g
^)) = Cf y; ^(y))
Ey Ey~ log g(~
= Cf Ey~ Ey log g(y; ^(~
y))
the second equality by symmetry. In this expression, y

~ and y are independent vectors each with
~ ^
density f : Letting = (~y); the estimator of when the data is y
~, we can write this more compactly
as
E:y (KLIC(f ; g
^)) = Cf E log g(y; ~)
where y and ~ are independent.

An alternative interpretation is in terms of predicted likelihood. The expected KLIC is the
expected likelihood when the sample y ~ is used to construct the estimate ~; and an independent
sample y used for evaluation. In linear regression, the quasi-likelihood is Gaussian, and the expected
KLIC is the expected squared prediction error.
15.4 Estimating KLIC

We want an estimate of the expected KLIC.
As Cf is constant across models, it is ignored.
We want to estimate
T = E log g(y; ~)
115
Make a second-order Taylor expansion of log g y; ~ about ^ :
@
log g(y; ~) ' log g(y; ^) log g(y; ^)0 ~ ^
@
1 ~ ^
0 @2
log g(y; ^) ~ ^
2 @ @ 0
The …rst term on the RHS is L(^); the second is linear in the FOC, so only the third term remains.
Writing
@2
^=
Q n 1
0 log g(y; ^);
@ @
~ ^= ~ 0
^ 0
and expanding the quadratic, we …nd
1 0 1 0
log g(y; ~) ' L(^) + n ~ 0
^ ~
Q 0 + n ^ 0
^ ^
Q 0 +n ~ 0
^ ^
Q 0 :
2 2
Now
p
n ^ 0 !d Z1 N (0; V )
p
n ~ 0 !d Z2 N (0; V )
^ !p Q. Thus for large n;

which are independent, and Q
1 1
log g(y; ~) ' L(^) + Z10 QZ1 + Z20 QZ2 + Z10 QZ2 :
2 2
Taking expectations
T = E log g(y; ~)
1 0 1
' EL(^) + E Z1 QZ1 + Z20 QZ2 + Z10 QZ2
2 2
= EL(^) + tr (QV )
= EL(^) + tr Q 1
An (asymptotically) unbiased estimate of T is then
\
T^ = L(^) + tr (Q 1 )
\
where tr (Q 1 ) is an estimate of tr Q 1 :
116
15.5 AIC
When g(x; 0) = f (x) (the model is correctly speci…ed) then Q = (the information matrix
equality). Hence
1
tr Q = k = dim( )
so
T^ = L(^) + k
This is the the Akaike Information Criterion (AIC). It is typically written as 2T^, e.g.
AIC = 2L(^) + 2k
AIC is an estimate of the expected KLIC, based on the approximation that g includes the
correct model.
Picking a model with the smalled AIC is picking the model with the smallest estimated KLIC.
In this sense it is picking is the best-…tting model.
15.6 TIC
Takeuchi (1976) proposed a robust AIC, and is known as the Takeuchi Information Criterion
(TIC)
T IC = 2L(^) + 2 tr Q
^ 1^
where
n
1 X @2
^ =
Q 0 log g(yi ; ^)
n @ @
i=1
n
X
1 @ @
^ = log g(yi ; ^) log g(yi ; ^)0
n @ @
i=1
The does not require that g is correctly speci…ed.
15.7 Comments on AIC and TIC

The AIC and TIC are designed for the likelihood (or quasi-likelihood) context. For proper
application, the “model” needs to be a conditional density, not just a conditional mean or set of
moment conditions. This is a strength and limitation.
The bene…t of AIC/TIC is that it selects …tted models whose densities are close to the true
density. This is a broad and useful feature.
The relation of the TIC to the AIC is very similar to the relationship between the conventional
and “White” covariance matrix estimators for the MLE/QMLE or LS. The TIC does not appear
to be widely appreciated nor used.
117
The AIC is known to be asymptotically optimal in linear regression (we discuss this below),
but in the general context I do not know of an optimality result. The desired optimality would
be that if a model is selected by minimizing AIC (or TIC) then the …tted KLIC of this model is
asymptotically equivalent to the KLIC of the infeasible best-…tting model.
15.8 AIC and TIC in Linear Regression

In linear regression or projection
yi = Xi0 + ei
E (Xi ei ) = 0
AIC or TIC cannot be directly applied, as the density of ei is unspeci…ed. However, the LS estimator
is the same as the Gaussian MLE, so it is natural to calculate the AIC or TIC for the Gaussian
quasi-MLE.
The Gaussian quasi-likelihood is
1 2 1 2
log gi ( ) = log 2 2
yi Xi0
2 2
where = ( ; 2) and 2 = Ee2i : The MLE ^ = ( ^ ; ^ 2 ) is LS. The pseudo-true value 0 is the
projection coe¢ cient = E (Xi Xi0 ) 1 E (Xi yi ) : If is k 1 then the number of parameters is
k + 1:
The sample log-likelihood is
2L(^) = n log ^ 2 + n log (2 ) + n
The second/third parts can be ignored. The AIC is
AIC = n log ^ 2 + 2 (k + 1) :
Often this is written

AIC = n log ^ 2 + 2k
as adding/subtracting constants do not matter for model selection, or sometimes
k
AIC = log ^ 2 + 2
n
as scaling doesn’t matter.
118
Also
@ 1
log g(yi ; ) = 2
X i yi Xi0
@
@ 1 1 2
log g(yi ; ) = + yi Xi0 ;
@ 2 2 2 2 4
and
@2 1
0 log g(yi ; ) = 2
Xi Xi0
@ @
@2 1
2
log g(yi ; ) = 4
X i yi Xi0
@ @
@2 1 1 2
log g(yi ; ) = + yi Xi0
@( 2 )2 2 4 6
Evaluated at the pseudo-true values,
@ 1
log g(yi ; 0) = 2
Xi ei
@
@ 1
log g(yi ; 0) = e2i 2
;
@ 2 2 4
and
@2 1
0 log g(yi ; 0) = 2
Xi Xi0
@ @
@2 1
2
log g(yi ; 0) = 4
Xi e
@ @
@2 1 2
log g(yi ; 0) = 2 yi Xi0 2
@( 2 )2 2 6
Thus
2 3
@2 @2
6 @ @ 0 log g(yi ; 0 ) @ @ 2
log g(yi ; 0) 7
Q = E6
4 @2 @2
7
5
log g(yi ; 0 ) log g(yi ; 0)
@ 2@ 0 @( 2 )2
2 3
E (Xi Xi0 ) 0
24
= 1 5
0
2 2
119
and
2 3
@ @ 0 @ @
6 @ log g(yi ; 0 ) @ log g(yi ; 0 ) @ log g(yi ; 0 ) @ 2 log g(yi ; 0) 7
= E6
4 @ 2 7
5
@ 0 @
log g(y ;
i 0 ) log g(y ;
i 0 ) log g(y ;
i 0 )
@ 2 @ @ 2
2 3
0 e2i 1 0 3
E Xi Xi 2 E Xi ei 7
26 2 4
= 4 5
1 3 4
E X e
i i
2 4 4 2
where
E e2i 2 2 E e4i 4
e2i
4 = var 2
= 4
= 4
We see that = Q if
e2i
E 2
j Xi = 1
E Xi e3i = 0
4 = 2
Essentially, this requies that ei N (0; 2 ): Otherwise 6= Q:

Thus the AIC is appropriate in Gaussian regression. It is an “approximation”in non-Gaussian
regression, heteroskedastic regression, or projection.
To calculate the TIC, note that since Q is block diagonal you do not need to estimate the
o¤-diagonal component of : Note that
1
1 1 e2i 1 4
tr Q = tr E Xi Xi0 E Xi Xi0 2
+ 2 2
2 4
1 e2 4
= tr E Xi Xi0 E Xi Xi0 i2 +
2
Let
n
1X 2 2
^4 = eî ^2
n
i=1
The TIC is then
T IC = n log ^ 2 + tr Q^ 1^
2 0 ! !1 3
n 1 n
X X e^2
= n log ^ 2 + 2 4tr @ Xi Xi0 Xi Xi0 i2 A + ^45
i=1 i=1
^
n
X
2
= n log ^ 2 + hi e^2i + ^ 4
^2 i=1
120
1
where hi = Xi0 (X0 X) Xi .
When the errors are close to homoskedastic and Gaussian, then hi and e2i will be uncorrelated
^ 4 will be close to 2, so the penalty will be close to
n
X
2 hi + 2 = 2 (k + 1)
i=1
as for AIC. In this case TIC will be close to AIC. In applications, the di¤erences will arise under
heteroskedasticity and non-Gaussianity.
The primary use of AIC and TIC is to compare models. As we change models, typically the
residuals eî do not change too much, so my guess is that the estimate ^ will not change much. In
this event, the TIC correction for estimation of 2 will not matter much.
15.9 Asymptotic Equivalence

Let ~ 2 be a preliminary (model-free) estimate of 2: The AIC is equivalent to
^2
~ 2 AIC n log ~ 2 + n = n 2
log + 1 + 2~ 2 k
~2
^2
' n~ 2 + 2~ 2 k
~2
= e^0 e^ + 2~ 2 k
= Ck
The approximation is log(1 + a) ' a for a small. This is the Mallows criterion. Thus AIC is
approximately equal to Mallows, and the approximation is close when k=n is small.
Furtheremore, this expression approximately equalts
2
e^0 e^ 1 + = Sk
nk
which is known as Shibata’s condition (Annals of Statistics, 1980; Biometrick, 1981).

The TIC (ignoring the correction for estimation of 2) is equivalent to
n
^2 2~ 2 X
~ 2 T IC n log ~ 2 + n = n~ 2 log +1 + hi e^2i
~2 2
^ i=1
n
X
0
' e^ e^ + 2 hi e^2i
i=1
n
X e^2i
'
i=1
(1 hi )2
= CV;
121
the cross-validation criterion. Thus T IC ' CV:
They are both asymptotically equivalent to a “Heteroskedastic-Robust Mallows Criterion”
n
X
Ck = e^0 e^ + 2 hi e^2i
i=1
which, strangely enough, I have not seen in the literature.
15.10 Mallows Criterion

Ker-Chau Li (1987, Annals of Statistics) provided a important treatment of the optimality of
model selection methods for homoskedastic linear regression. Andrews (1991, JoE) extended his
results to allow conditional heteroskedasticity.
Take the regression model
yi = g (Xi ) + ei
= gi + ei
E (ei j Xi ) = 0
E e2i j Xi = 0
Written as an n 1 vector
y = g + e:
Li assumed that the Xi are non-random, but his analysis can be re-interpreted by treating every-
thing as conditional on Xi :
Li considered estimators of the n 1 vector g which are linear in y and thus take the form
g^(h) = M (h)y
where M (h) is n n; a function of the X matrix, indexed by h 2 H; and H is a discrete set.

1
For example, a series estimator sets M (h) = Xh (Xh0 Xh ) Xh where Xh is an n kh set of basis
functions of the regressors, and H = f1; :::; hg: The goal is to pick h to minimize the average squared
error
1
L(h) = (g g^(h))0 (g g^(h)) :
n
The index h is selected by minimizing the Mallows, Generalized CV, or CV criterion. We discuss
Mallows in detail, as it is the easiest to analyze. Andrews showed that only CV is optimal under
heteroskedasticity.
The Mallows criterion is
1 2 2
C(h) = (y g^(h))0 (y g^(h)) + tr M (h)
n n
122
The …rst term is the residual variance from model h; the second is the penalty. For series estimators,
tr M (h) = kh : The Mallows selected index h^ minimizes C(h):
Since y = e + g; then y g^(h) = e + g g^(h); so
1 2 2
C(h) = (y g^(h))0 (y g^(h)) + tr M (h)
n n
1 0 1 2 2
= e e + L(h) + 2 e0 (g g^(h)) + tr M (h)
n n n
And
g^(h) = M (h)y = M (h)g + M (h)e
then
g g^(h) = (I M (h)) g M (h)e

= b(h) M (h)e
where b(h) = (I M (h)) g; and C(h) equals
1 0 1 2
e e + L(h) + 2 e0 b(h) + 2
tr M (h) e0 M (h)e
n n n
^ minimizes
As the …rst term doesn’t involve h; it follows that h
1 2
C (h) = L(h) + 2 e0 b(h) + 2
tr M (h) e0 M (h)e
n n
over h 2 H:
The idea is that empirical criterion C (h) equals the desired criterion L(h) plus a stochastically
small error.
We calculate that
1
L(h) = (g g^(h))0 (g g^(h))
n
1 2 1
= b(h)0 b(h) b(h)0 M (h)e + e0 M (h)0 M (h)e
n n n
and
R(h) = E (L(h) j X)
1 1 0
= b(h)0 b(h) + E e M (h)0 M (h)e j X
n n
1 2
= b(h)0 b(h) + tr M (h)0 M (h)
n n
The optimality result is:
123
Theorem 1. Let max (A) denote the maximum eigenvalue of A: If for some positive integer m;
lim sup max (M (h)) < 1

n!1 h2H
E e4m
i j Xi <1
X m
(nR(h)) ! 0 (1)
h2H
then
^
L(h)
!p 1:
inf h2H L(h)
15.11 Whittle’s Inequalities

To prove Theorem 1, Li (1987) used two key inequalities from Whittle (1960, Theory of Prob-
ability and Its Applications).
Theorem. Suppose the observations are independent: Let b be any n 1 vector and A any
n n matrix, functions of X. If for some s 2
max E (jei js j Xi ) s <1

i
then
s s=2
E b0 e jX K1s b0 b (2)
and
s s=2
E e0 Ae E e0 Ae j X jX K2s tr A0 A (3)
where
2s3=2 s+1
K1s = p s
2
1=2
K2s = 2s K1s K1;2s 2s
15.12 Proof of Theorem 1

The main idea is similar to that of consistent estimation. Recall that if Sn ( ) !p S( ) uniformly
in ; then the minimizer of Sn ( ) converges to the minimizer of S( ): We can write the uniform
convergence as
Sn ( )
sup 1 !p 0
S( )
In the present case, we will show (below) that
C (h) L(h)
sup !p 0 (4)
h L(h)
124
Let h0 denote the minimizer of L(h): Then
^
L(h) L(h0 )
0
^
L(h)
^
C (h) L(h0 ) ^
C (h) ^
L(h)
=
^
L(h) ^
L(h)
^
C (h) L(h0 )
= + op (1)
^
L(h)
C (h0 ) L(h0 )
+ op (1)
^
L(h)
C (h0 ) L(h0 )
+ op (1)
L(h0 )
= op (1)
This uses (4) twice, and the facts L(h0 ) ^ and C (h)
L(h) ^ C (h0 ): This shows that
L(h0 )
!p 1
^
L(h)
which is equivalent to the Theorem.

The key is thus (4). We show below that
L(h)
sup 1 !p 0 (5)
h R(h)
which says that L(h) and R(h) are asymptotically equivalent, and thus (4) is equivalent to
C (h) L(h)
sup !p 0: (6)
h R(h)
From our earlier equation for C (h); we have
2 tr M (h)e0 M (h)e
C (h) L(h) je0 b(h)j
sup 2 sup + 2 sup : (7)
h R(h) h nR(h) h nR(h)
Take the …rst term on the right-hand-side. By Whittle’s …rst inequality,
2m m
E e0 b(h) jX K b(h)0 b(h)
Now recall
nR(h) = b(h)0 b(h) + 2
tr M (h)0 M (h) (8)
Thus
nR(h) b(h)0 b(h)
125
Hence
2m m
E e0 b(h) jX K b(h)0 b(h) K (nR(h))m
Then, since H is discrete, by applying Markov’s inequality and this bound,
je0 b (h)j X je0 b(h)j

P sup > jX P > jX
h nR(h) nR(h)
h2H
X E je0 b(h)j2m j X
2m
h2H
(nR(h))2m
X (nR(h))m
2m K
h2H
(nR(h))2m
K X m
= 2m (nR(h))
h2H
! 0
by assumption (1). This shows

je0 b(h)j
sup !p 0
h nR(h)
Now take the second term in (7). By Whittle’s second inequality, since
E e0 M (h)e j X = 2
tr M (h);
then
2m m
E e0 M (h)e 2
tr M (h) jX K tr M (h)0 M (h)
2m
K (nR(h))m
the second inequality since (8) implies
tr M (h)0 M (h) 2
nR(h)
126
Applying Markov’s inequality
! !
e0 M (h)e 2 tr M (h) X e0 M (h)e 2 tr M (h)
P sup > jX P > jX
h nR(h) nR(h)
h2H
E e0 M (h)e 2 tr M (h) 2m jX
X
2m
h2H
(nR(h))2m
X 2m K(nR(h))m
2m
h2H
(nR(h))2m
m X
= K 2 2
(nR(h)) m
h2H
! 0
For completeness, let us show (5). The demonstration is essentially the same as the above.
We calculate
2 1 0 2
L(h) R(h) = b(h)0 M (h)e + e M (h)0 M (h)e tr M (h)0 M (h)
n n n
2 1 0
= b(h)0 M (h)e + e M (h)0 M (h)e E e0 M (h)0 M (h)e j X
n n
Thus
L(h) R(h) je0 M (h)0 b(h)j je0 M (h)0 M (h)e E (e0 M (h)0 M (h)e j X)j
sup 2 sup + 2 sup :
h R(h) h nR(h) h nR(h)
By Whittle’s …rst inequality,
2m m
E e0 M (h)0 b(h) jX K b(h)0 M (h)M (h)0 b(h)
Use the matrix inequality

tr (AB) max (A) tr (B)
and letting
M = lim sup max (M (h)) <1
n!1 h2H
then
b(h)0 M (h)M (h)0 b(h) = tr M (h)M (h)0 b(h)b(h)0

M 2 tr b(h)b(h)0
M 2 b(h)0 b(h)
M 2 nR(h)
127
Thus
2m m
E e0 M (h)0 b(h) jX K b(h)0 M (h)M (h)0 b(h)
K M 2 (nR(h))m
Thus
je0 M (h)0 b (h)j X je0 M (h)0 b(h)j

P sup > jX P > jX
h nR(h) nR(h)
h2H
X E je0 M (h)0 b(h)j2m j X

2m
h2H
(nR(h))2m
X 2 (nR(h))m
2m K M
h2H
(nR(h))2m
KM 2 X m
= 2m (nR(h))
h2H
! 0
Similarly,
2m m
E e0 M (h)0 M (h)e E e0 M (h)0 M (h)e j X jX K tr M (h)0 M (h)M (h)0 M (h)
m
K M 2m tr M (h)0 M (h)
2m
K M 2m (nR(h))m
and thus
je0 M (h)0 M (h)e E (e0 M (h)0 M (h)e j X)j X je0 M (h)0 M (h)e E (e0 M (h)0 M (h)e j X)
P sup > jX P
h nR(h) nR(h)
h2H
X E je0 M (h)0 M (h)e E (e0 M (h)0 M (h)e

2m
h2H
(nR(h))2m
X 2m K M 2m (nR(h))m
2m
h2H
(nR(h))2m
m X
M2
= K 2 2 (nR(h)) m
h2H
! 0
We have shown
L(h) R(h)
sup !p 0
h R(h)
which is (5).
128
15.13 Mallows Model Selection
Li’s Theorem 1 applies to a variety of linear estimators. Of particular interest is model selection
(e.g. series estimation).
Let’s verify Li’s conditions, which were
lim sup max (M (h)) < 1

n!1 h2H
E e4m
i j Xi <1
X
(nR(h)) m ! 0 (9)
h2H
In linear estimation, M (h) is a projection matrix, so max (M (h)) = 1 and the …rst equation is
automatically satis…ed.
The key is equation (9).
Suppose that for sample size n; there are Nn models. Let
n = inf nR(h)
h2H
and assume
n !1
A crude bound is
X m m
(nR(h)) Nn n
h2H
m
If Nn n ! 0 then (9) holds. Notice that by increasing m; we can allow for larger Nn (more
models) but a tighter moment bound.
The condition n ! 0 says that for all …nite models h; the there is non-zero approximation error,
so that R(h) is non-zero. In contrast, if there is a …nite dimensional model h0 for which b(h0 ) = 0;
then nR(h0 ) = h0 2 does not diverge. In this case, Mallows (and AIC) are asymptotically sub-
optimal.
We can improve this condition if we consider the case of selection among models of increasing
size. Suppose that model h has kh regressors, and k1 < k2 < and for some m 2;
1
X
kh m < 1
h=1
This includes nested model selection, where kh = h and m = 2: Note that
nR(h) = b(h)0 b(h) + kh 2

kh 2
129
m
Now pick Bn ! 1 so that Bn n ! 0 (which is possible since n ! 1:) Then
1
X Bn
X 1
X
m m m
(nR(h)) = (nR(h)) + (nR(h))
h=1 h=1 h=Bn +1
1
X
Bn n
m
+ 2
kh m ! 0
h=Bn +1
as required.
15.14 GMM Model Selection

This is an underdeveloped area. I list a few papers.
Andrews (1999, Econometrica). He considers selecting moment conditions to be used for GMM
estimation. Let p be the number of parameters, c represent a list of “selected”moment conditions,
jcj denote the cardinality (number) of these moments, and Jn (c) the GMM criterion computed
using these c moments. Andrews’proposes criteria of the form
IC(c) = Jn (c) rn (jcj p)
where jcj p is the number of overidentifying restrictions and rn is a sequence. For an AIC-like
criterion, he sets rn = 2; for a BIC-like criterion, he sets rn = log n:
The model selection rule picks the moment conditions c which minimize Jn (c):
Assuming that a subset of the moments are incorrect, Andrews shows that the BIC-like rule
asymptotically selects the correct subset.
Andrews and Lu (2001, JoE) extend the above analysis to the case of jointly picking the moments
and the parameter vector (that is, imposing zero restrictions on the parameters). They show that
the same criterion has similar properties –that it can asymptotically select the “correct”moments
and “correct” zero restrictions.
Hong, Preston and Shum (ET, 2003) extend the analysis of the above papers to empirical
likelihood. They show that that this criterion has the same interpretation when Jn (c) is replaced
by the empirical likelihood.
These papers are an interesting …rst step, but they do not address the issue of GMM selection
when the true model is potentially in…nite dimensional and/or misspeci…ed. That is, the analysis
is not analogous to that of Li (1987) for the regression model.
In order to properly understand GMM selection, I believe we need to understand the behavior
of GMM under misspeci…cation.
Hall and Inoue (2003, Joe) is one of the few contributions on GMM under misspeci…caiton.
They did not investigate model selection.
Suppose that the model is
m( ) = Emi ( ) = 0
130
where mi is ` 1 and is k 1: Assume ` > r (overidenti…cation). The model is misspeci…ed if
there is no such that this moment condition holds. That is, for all ;
m( ) 6= 0
Suppose we apply GMM. What happens?

The …rst question is, what is the pseudo-true value? The GMM criterion is
Jn ( ) = nmn ( )0 Wn mn ( )
If Wn !p W; then
1
n Jn ( ) !p m( )0 W m( ):
Thus the GMM estimator ^ is consistent for the pseudo-true value
0 (W ) = argmin m( )0 W m( ):
Interstingly, the pseudo-true value 0 (W ) is a function of W: This is a fundamental di¤erence

from the correctly speci…ed case, where the weight matrix only a¤ects e¢ ciency. In the misspeci…ed
case, it a¤ects what is being estimated.
This means that when we apply “iterated GMM”, the pseudo-true value changes with each step
of the iteration!
Hall and Inoue also derive the distribution of the GMM estimator. They …nd that the distri-
bution depends not only on the randomness in the moment conditions, but on the randomness in
the weight matrix. Speci…cally, they assume that n1=2 (Wn W ) !d N ormal; and …nd that this
a¤ects the asymptotic distributions.
Furthermore, the distribution of test statistics is non-standard (a mixture of chi-squares). So
inference on the pseudo-true values is troubling.
This subject deserves more study.
15.15 KLIC for Moment Condition Models Under Misspeci…cation

Suppose that the true density is f (y); and we have an over-identi…ed moment condition model,
e.g. for some function m(y); the model is
Em(y) = 0
However, we want to allow for misspeci…cation, namely that
Em(y) 6= 0
To explore misspe…cation, we have to ask: What is a desirable pseudo-true model?
131
Temporarily ignoring parameter estimation, we can ask: Which density g(y) satisfying this
moment condition is closest to f (y) in the sense of minimizing KLIC? We can call this g0 (y) the
pseudo-true density.
The solution is nicely explained in Appendix A of Chen, Hong, and Shum (JoE, 2007). Recall
Z
f (y)
KLIC(f; g) = f (y) log dy
g(y)
The problem is
min KLIC(f; g)
g
subject to
Z
g(y)dy = 1
Z
m(y)g(y)dy = 0
The Lagrangian is
Z Z Z
f (y) 0
f (y) log dy + g(y)dy 1 + m(y)g(y)dy
g(y)
The FOC with respect to g(y) at some y is
f (y) 0
0= + + m(y)
g(y)
Multiplying by g(y) and integrating,

Z Z Z
0
0 = f (y)dy + g(y)dy + m(y)g(y)dy
= 1+
so = 1: Solving for g(y) we …nd

f (y)
g(y) = ;
1 + 0 m(y)
a tilted version of the true density f (y): Inserting this solution we …nd
Z
0
KLIC(f; g) = f (y) log 1 + m(y) dy
By duality, the optimal Lagrange multiplier 0 maximizes this expression

Z
0
0 = argmax f (y) log 1 + m(y) dy:
132
The pseudo-true density is
f (y)
g0 (y) = ;
1 + 00 m(y)
with associated minimized KLIC
Z
0
KLIC (f; g0 ) = f (y) log 1 + 0 m(y) dy
0
= E log 1 + 0 m(y)
This is the smallest possible KLIC(f,g) for moment condition models.

This solution looks like empirical likelihood. Indeed, EL minimizes the empirical KLIC, and
this connection is widely used to motivate EL.
When the moment m(y; ) depends on a parameter ;then the pseudo-true values ( 0 ; 0) are
the joint solution to the problem
0
min max E log 1 + m(y; )
Theorem (Chen, Hong and Shum, JoE, 2007). If jm(y; )j is bounded, then the EL estimates
^ ^
( ; ) are n 1=2 consistent for the pseudo-true values ( 0 ; 0 ).
This gives a simple interpretation to the de…nition of KLIC under misspeci…cation.
15.16 Schennach’s Impossibility Result

Schennach (Annals of Statistics, 2007) claims a fundamental ‡aw in the application of KLIC
to moment condition models. She shows that the assumption of bounded jm(y; )j is not merely a
technical condition, it is binding.
[Notice: In the linear model, m(y; ) = z(y x0 ) is unbounded if the data has unbounded
support. Thus the assumption is highly relevant.]
The key problem is that for any 6= 0; if m(y; ) is unbounded, so is 1 + 0 m(y; ): In particular,
0
it can take on negative values. Thus log 1 + m(y; ) is ill-de…ned. Thus there is no pseudo-true
value of : (It must be non-zero, but it cannot be non-zero!) Without a non-zero ; there is no way
to de…ne a pseudo-true 0 which satis…es the moment condition.
Technically, Schennach shows that when there is no such that Em(y; ) = 0 and m(y; ) is
p
unbounded, then there is no 0 such that n ^ 0 = Op (1):
Her paper leaves open the question: For what is ^ consistent? Is there a pseudo-true value?
One possibility is that the pseudo-true value n needs to be indexed by sample size. (This idea is
used in Hal White’s work.)
Never-the-less, Schennach’s theorem suggests that empirical likelihood is non-robust to mis-
speci…cation.
133
15.17 Exponential Tilting
Instead of Z
f (y)
KLIC(f; g) = f (y) log dy
g(y)
consider the reverse distance
Z
g(y)
KLIC(g; f ) = g(y) log dy:
f (y)
The pseudo-true g which minimizes this criterion is

Z
g(y)
min g(y) log dy
g f (y)
subject to
Z
g(y)dy = 1
Z
m(y)g(y)dy = 0
The Lagrangian is
Z Z Z
g(y) 0
g(y) log dy g(y)dy 1 m(y)g(y)dy
f (y)
with FOC
g(y) 0
0 = log +1 m(y):
f (y)
Solving
0
g(y) = f (y) exp ( 1 + ) exp m(y) :
R
Imposing g(y)dy = 1 we …nd
0
f (y) exp m(y)
g(y) = R 0 : (10)
f (y) exp m(y) dy
Hence the name “exponential tilting” or ET
134
Inserting this into KLIC(g; f ) we …nd
Z !
exp 0 m(y)
KLIC(g; f ) = g(y) log R dy
f (y) exp 0 m(y) dy
Z Z Z
0 0
= m(y)g(y)dy g(y)dy log f (y) exp m(y) dy
Z
= log f (y) exp 0 m(y) dy (11)
0
= log E exp m(y) (12)
By duality, the optimal Lagrange multiplier 0 maximizes this expression, equivalently
0
0 = argmin E exp m(y) (13)
The pseudo-true density g0 (y) is (10) with this 0; with associated minimized KLIC (11). This is
the smallest possible KLIC(g,f) for moment condition models.
Notice: the g0 which minimize KLIC(g,f) and KLIC(f,g) are di¤erent.
In contrast to the EL case, the ET problem (13) does not restrict ; and there are no “trouble
spots”. Thus ET is more robust than EL. The pseudo-true 0 and g0 are well de…ned under
misspeci…cation, unlike EL.
When the moment m(y; ) depends on a parameter ;then the pseudo-true values ( 0 ; 0) are
the joint solution to the problem
0
max min E exp m(y; ) :
15.18 Exponential Tilting –Estimation

The ET or exponential tilting estimator solves the problem
n
X
min pi log pi
;p1 ;:::;pn
i=1
subject to
n
X
pi = 1
i=1
n
X
pi m (yi ; ) = 0
i=1
135
First, we concentrate out the probabilities. For any ; the Lagrangian is
n n
! n
X X X
0
pi log pi pi 1 pi m (yi ; )
i=1 i=1 i=1
with FOC
0
0 = log pî 1 m(yi ; ):
Solving for pî and imposing the summability,
exp 0 m(yi ; )
pî ( ) = Pn 0
i=1 exp m(yi ; )
When = 0 then pî = n 1; same as EL. The concentrated “entropy” criterion is then
n n
" n
!#
X X X
0 0
pî ( ) log pî ( ) = pî ( ) m(yi ; ) log exp m(yi ; )
i=1 i=1 i=1
n
!
X
0
= log exp m(yi ; )
i=1
By duality, the Lagrange multiplier maximizes this criterion, or equivalently

n
X
^ ( ) = argmin exp 0
m(yi ; )
i=1
The ET estimator ^ maximizes this concentrated function, e.g.

n
X
^ = argmax exp ^ ( )0 m(yi ; )
i=1
The ET probabilities are pî = pî ( ^ )
15.19 Schennach’s Estimator

Schennach (2007) observed that while the ET probabilities have desirable properties, the EL
estimator for has better bias properties. She suggested a hybrid estimator which achieves the
best of both worlds, called exponentially tilted empirical likelihood (ETEL).
This is
^ = argmax ELET ( )
136
n
X
ET EL( ) = log (^
pi ( ))
i=1
n n
!
X X
= ^( ) 0
m(yi ; ) log exp ^ ( ) m(yi ; )
0
i=1 i=1
exp ^ ( )0 m(yi ; )
pî ( ) = P
n ^ ( )0 m(yi ; )
i=1 exp
n
X
^ ( ) = argmin exp 0
m(yi ; )
i=1
She claims the following advantages for the ETEL estimator ^
Under correct speci…cation, ^ is asymptotically second-order equivalent to EL
Under misspeci…cation, the pseudo-true values 0; 0 are generically well de…ned, and mini-
mize a KLIC analog
p @
n ^ 0 !d N 0; 1 10 where =E m (y; ) and = E m (y; ) m (y; )0 :
@ 0
137

14 Model Selection

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

14 Model Selection

Uploaded by

Copyright:

Available Formats

15 Model Selection

You can decompose the KLIC as

15.3 Expected KLIC

the second equality by symmetry. In this expression, y

where y and ~ are independent.

15.4 Estimating KLIC

and expanding the quadratic, we …nd

^ !p Q. Thus for large n;

An (asymptotically) unbiased estimate of T is then

The does not require that g is correctly speci…ed.

15.7 Comments on AIC and TIC

15.8 AIC and TIC in Linear Regression

2L(^) = n log ^ 2 + n log (2 ) + n

The second/third parts can be ignored. The AIC is

Often this is written

as adding/subtracting constants do not matter for model selection, or sometimes

as scaling doesn’t matter.

Evaluated at the pseudo-true values,

Essentially, this requies that ei N (0; 2 ): Otherwise 6= Q:

The TIC is then

15.9 Asymptotic Equivalence

which is known as Shibata’s condition (Annals of Statistics, 1980; Biometrick, 1981).

which, strangely enough, I have not seen in the literature.

15.10 Mallows Criterion

where M (h) is n n; a function of the X matrix, indexed by h 2 H; and H is a discrete set.

g g^(h) = (I M (h)) g M (h)e

where b(h) = (I M (h)) g; and C(h) equals

The optimality result is:

lim sup max (M (h)) < 1

15.11 Whittle’s Inequalities

max E (jei js j Xi ) s <1

15.12 Proof of Theorem 1

which is equivalent to the Theorem.

From our earlier equation for C (h); we have

Take the …rst term on the right-hand-side. By Whittle’s …rst inequality,

Then, since H is discrete, by applying Markov’s inequality and this bound,

je0 b (h)j X je0 b(h)j

by assumption (1). This shows

the second inequality since (8) implies

By Whittle’s …rst inequality,

Use the matrix inequality

b(h)0 M (h)M (h)0 b(h) = tr M (h)M (h)0 b(h)b(h)0

je0 M (h)0 b (h)j X je0 M (h)0 b(h)j

X E je0 M (h)0 b(h)j2m j X

X E je0 M (h)0 M (h)e E (e0 M (h)0 M (h)e

lim sup max (M (h)) < 1

This includes nested model selection, where kh = h and m = 2: Note that

nR(h) = b(h)0 b(h) + kh 2

15.14 GMM Model Selection

IC(c) = Jn (c) rn (jcj p)

Suppose we apply GMM. What happens?

Thus the GMM estimator ^ is consistent for the pseudo-true value

Interstingly, the pseudo-true value 0 (W ) is a function of W: This is a fundamental di¤erence

15.15 KLIC for Moment Condition Models Under Misspeci…cation

However, we want to allow for misspeci…cation, namely that

To explore misspe…cation, we have to ask: What is a desirable pseudo-true model?

The FOC with respect to g(y) at some y is

Multiplying by g(y) and integrating,

so = 1: Solving for g(y) we …nd

By duality, the optimal Lagrange multiplier 0 maximizes this expression

This is the smallest possible KLIC(f,g) for moment condition models.

15.16 Schennach’s Impossibility Result

The pseudo-true g which minimizes this criterion is