Professional Documents
Culture Documents
C(w) = w0^
e0 ^
ew + 2^ 2 w0 K
! !
y 0 y y 0 e^ 1 w
= 1 w w + 2^ 2 wk
e^0 y e^0 e^ w
2 0 2
= (1 w) y y + w + 2w(1 w) e^0 e^ + 2^ 2 wk
= (1 w)2 y 0 y e^0 e^ + e^0 e^ + 2^ 2 wk
d
C(w) = 2(1 w) y 0 y e^0 e^ + 2^ 2 k = 0
dw
with solution
k
w
^=1
F
where
y0y e^0 e^
F =
^2
is the Wald statistic for = 0: Imposing the constraint w
^ 2 [0; 1] we obtain
8
> k
>
< 1 F k
F
w
^=
>
>
:
0 F <k
^ k
=^ 1
F +
145
17.2 Loss and Risk
A great reference is Theory of Point Estimation, 2nd Edition, by Lehmann and Casella.
Let ^ be an estimator for ; k 1: Suppose ^ is an (asymptotic) su¢ cient statistic for so that
any other estimator can be written as a function of ^: We call ^ the “usual”estimator.
p
Suppose that n ^ !d N (0; V): Thus, approximately,
^ a N ( ; Vn )
where Vn = n 1 V:Most of Stein-type theory is developed for the exact distribution case. It
carries over to the asymptotic setting as approximations. For now on we will assume that ^ has an
exact normal distribution, and that Vn = V is known. (Equivalently, we can rewrite the statistical
problem as local to using the “Limits of Experiments” theory.
Is ^ the best estimator for ; in the sense of minimizing the risk (expected loss)?
The risk of ~ under weighted squared error loss is
0
R( ; ~; W) = E ~ W ~
0
= tr WE ~ ~
0
R( ; ^; V 1
) = tr V 1
E ~ ~
1
= tr V V
= k:
If W 6= V 1 then
0
R( ; ^; W) = tr WE ~ ~
= tr (WV)
0
R( ; 0; W) = W :
Thus ~ = 0 has smaller risk than ^ when 0 W < tr (WV) ; and larger risk when 0 W > tr (WV) :
Neither ^ nor ~ = 0 is “better” in the sense of having (uniformly) smaller risk! It is not enough to
ask that one estimator has smaller risk than another, as in general the risk is a function depending
146
on unknowns.
As another example, take the simple averaging (or shrinkage) estimator
~ = w^
~ =w ^ (1 w)
R( ; ~; W) = w2 R( ; ~; W) + (1 w)2 0 W
= w2 tr (WV) + (1 w)2 0 W
We can think: Suppose we use ~ to estimate : Then what is the worst case, how bad can than this
estimator do?
147
For example, for the usual estimator, R ; ^; W = tr (WV) for all ; so
R(^) = tr (WV)
The latter is an example of an estimator with unbounded risk. To guard against extreme worst
cases, it seems sensible to avoid estimators with unbounded risk.
The minimium value of the maximum risk R(~) across all estimators = (^) is
where
0
^ = 1 h(^ W^) ^
148
where a = 2k for Mallows selection, and a is the critical value from a chi-square distribution for a
pretest estimator.
We now calculate the risk of ^ . Note
0
^ = ^ h(^ W^)^
Thus
0 0 0 0 0 0
^ W ^ = ^ W ^ + h(^ W^)2 ^ W^ 2h(^ W^)^ W ^
Taking expectations:
h 0 0
i h 0 0
i
R( ; ^ ; W) = tr (WV) + E h(^ W^)2 ^ W^ 2E h(^ W^)^ W ^
@ ^0
E (^)0 ^ = E tr ( )V :
@
Proof: Let
1 1 0 1
(x) = k=2
exp xV x
(2 ) 2
@ 1
(x) = V x (x)
@x
and
@ 1
(x )= V (x ) (x ):
@x
By multivariate integration by parts
Z
E (^)0 ^ = (x)0 VV 1
(x ) (x ) (dx)
Z
@
= tr (x)0 V (x ) (dx)
@
@ ^0
= E tr ( )V
@
as stated.
0 0
Let ( )0 = h W W; for which
@
( )0 = h( 0 W )W + 2W 0
Wh0 ( 0 W )
@
149
and
@
tr ( )0 V = tr (WV) h( 0 W ) + 2 0 WVW h0 ( 0 W )
@
Then by Stein’s Lemma
h 0 0
i 0
h 0 0
i
E h(^ W^)^ W ^ = tr (WV) Eh(^ W^) + 2E (^ WVW^)h0 (^ W^)
where the …nal equality uses the alternative expression h(q) = c(q)=q:
We are trying to …nd cases where R( ; ^ ; W) < R( ; ^; W): This requires the term in the
expectation to be negative.
We now explore some special cases.
and write
R( ; ^ ) = R( ; ^ ; V 1
)
Then 2 3
0
0
c(^ V 1 ^) 2k + 4 0
R( ; ^ ) = k + E 4c(^ V 1^
) 0 4c0 (^ V )5 :
1^
^V 1^
Theorem 1 For any absolutely continuous and non-decreasing function c(q) such that
then
R( ; ^ ) < R( ; ^);
the risk of ^ is strictly less than the risk of ^: This inequality holds for all values of the parameter
:
150
Note: Condition (1) can only hold if k > 2: (Since k , the dimension of ; is an integer, this
means k 3:)
Proof. Let
c(q) (c(q) 2 (k 2))
g(q) = 4c0 (q)
q
For all q 0; g(q) < 0 by the assumptions. Thus Eg (q) for any non-negative random variable q:
0
Setting qk = ^ V 1 ^;
R( ; ^ ) = k + Eg (qk ) < k = R( ; ^)
a non-central chi-square random variable with k degrees of freedom and non-centrality parameter
0 1
= V
^ = c ^
1 (2)
^0 V 1 ^
Theorem 2 If ^ N ( ; V); k > 2; and 0 < c < 2(k 2); then for (2),
R( ; ^ ) < R( ; ^)
the risk of the James-Stein estimator is strictly less than the usual estimator. This inequality holds
for all values of the parameter :
Since the risk is quadratic in c; we can also see that the risk is minimized by setting c = k 2:
This yields the classic form of the James-Stein estimator
^ = k 2 ^
1
^0 V 1 ^
151
and the James-Stein estimator over-shrinks, and ‡ips the sign of ^ relative to ^: This is corrected
by using the positive-part version
^+ = c ^
1
^0 V 1 ^ +
8
>
> c ^ ^0 V 1^
>
< 1 0 c
^V 1^
=
>
>
>
: 0 else
or 8 c
>
> q c
< q
h(q) =
>
>
:
1 q<c
In general the positive-part version of
0
^ = 1 h(^ W^) ^
is
^+ = 1 0
h(^ W^) ^
+
+
Theorem. For any shrinakge estimator, R( ; ^ ) < R( ; ^)
The proof is a bit technical, so we will skip it.
2 ! 3
0 ^0 WVW^
6 c(^ W^) 2 tr (WV) + 4 7
6 0 ^0 W^ ^0 WVW^ 0 7
R( ; ^ ; W) = tr (WV)+E 6 ^ ^
6c( W ) 4 c ( W )7
0 ^ ^
7
4 ^0 W^ ^0 W^ 5
152
Using a result about eigenvalues and setting h = W 1=2
^0 WVW^ 0
WVW
0 max 0
^ W^ W
h0 W1=2 VW1=2 h
= max
h h0 h
1=2
= max (W VW1=2 )
= max (WV)
Thus if c0 (q) 0;
2 0 3
0
c(^ W^) 2 tr (WV) + 4 max (WV)
R( ; ^ ; W) tr (WV) + E 4c(^ W^) 5
^0 W^
< tr (WV)
When W = V; the upper bound is 2(k 2) so this is the same as for the default weight matrix.
Theorem 3 For any absolutely continuous and non-decreasing function c(q) such that (3) holds,
then
R( ; ^ ; W) < R( ; ^; W);
where ^ is the unrestricted estimator (e.g. the long regression) and ~ is the restricted estimator
(e.g. the short regression).
The classic form is 0 1
^ =^ B r 2 C ^ ~
@ 0 A
^ ~ ^
V 1 ^ ~
1
where (a)1 = max(a; 1); V ^ is the covariance matrix for ^; and r is the number of restrictions (by
the restriction from ^ to ~):
153
This estimator shrinks ^ towards ~; with the degree of shrinkage depending on the magnitude
of ^ ~ :
0
This approach works for nested models, so that ^ ~ ^
V 1 ^ ~ is approximately (non-
central) chi-square.
0
It is unclear how to extend the idea to non-nested models, where ^ ~ ^
V 1 ^ ~ is not
chi-square.
17.10 Inference
We discussed shrinkage estimation.
Model averaging, Selection, and Shrinkage estimators have non-standard non-normal distribu-
tions.
Standard errors, testing, and con…dence intervals need development.
154