You are on page 1of 10

17 Shrinkage

17.1 Mallows Averaging and Shrinkage


Suppose there are two models or estimators of g = E(y j X)
(1) g0 = 0
(2) g^1 = X ^
Given weights (1 w) and w an averaging estimator is g^ = wX ^ :
The Mallows criterion is

C(w) = w0^
e0 ^
ew + 2^ 2 w0 K
! !
y 0 y y 0 e^ 1 w
= 1 w w + 2^ 2 wk
e^0 y e^0 e^ w
2 0 2
= (1 w) y y + w + 2w(1 w) e^0 e^ + 2^ 2 wk
= (1 w)2 y 0 y e^0 e^ + e^0 e^ + 2^ 2 wk

The FOC for minimization is

d
C(w) = 2(1 w) y 0 y e^0 e^ + 2^ 2 k = 0
dw

with solution
k
w
^=1
F
where
y0y e^0 e^
F =
^2
is the Wald statistic for = 0: Imposing the constraint w
^ 2 [0; 1] we obtain
8
> k
>
< 1 F k
F
w
^=
>
>
:
0 F <k

The Mallow averaging estimator thus equals

^ k
=^ 1
F +

where (a)+ = a if a 0; 0 else.


This is a Stein-type shrinkage estimator.

145
17.2 Loss and Risk
A great reference is Theory of Point Estimation, 2nd Edition, by Lehmann and Casella.
Let ^ be an estimator for ; k 1: Suppose ^ is an (asymptotic) su¢ cient statistic for so that
any other estimator can be written as a function of ^: We call ^ the “usual”estimator.
p
Suppose that n ^ !d N (0; V): Thus, approximately,

^ a N ( ; Vn )

where Vn = n 1 V:Most of Stein-type theory is developed for the exact distribution case. It
carries over to the asymptotic setting as approximations. For now on we will assume that ^ has an
exact normal distribution, and that Vn = V is known. (Equivalently, we can rewrite the statistical
problem as local to using the “Limits of Experiments” theory.
Is ^ the best estimator for ; in the sense of minimizing the risk (expected loss)?
The risk of ~ under weighted squared error loss is

0
R( ; ~; W) = E ~ W ~
0
= tr WE ~ ~

A convenient choice for the weight matrix is W = V 1: Then

0
R( ; ^; V 1
) = tr V 1
E ~ ~

1
= tr V V
= k:

If W 6= V 1 then

0
R( ; ^; W) = tr WE ~ ~

= tr (WV)

which depends on WV:


Again, we want to know if the risk of another feasible estimator is smaller than tr (WV) :
Take the simple (or silly) estimator ~ = 0: This has risk

0
R( ; 0; W) = W :

Thus ~ = 0 has smaller risk than ^ when 0 W < tr (WV) ; and larger risk when 0 W > tr (WV) :
Neither ^ nor ~ = 0 is “better” in the sense of having (uniformly) smaller risk! It is not enough to
ask that one estimator has smaller risk than another, as in general the risk is a function depending

146
on unknowns.
As another example, take the simple averaging (or shrinkage) estimator

~ = w^

where w is a …xed constant. Since

~ =w ^ (1 w)

we can calculate that

R( ; ~; W) = w2 R( ; ~; W) + (1 w)2 0 W
= w2 tr (WV) + (1 w)2 0 W

This is minimized by setting


0
W
w=
tr (WV) + 0 W
which is strictly in (0,1). [This is illustrative, and does not suggest an empirical rule for selecting
w:]

17.3 Admissibile and Minimax Estimators


For reference.
To compare the risk functions of two estimators, we have the following concepts.

De…nition 1 ^ weakly dominates ~ if R( ; ^) R( ; ~) for all

De…nition 2 ^ dominates ~ if R( ; ^) R( ; ~) for all ; and R( ; ^) < R( ; ~) for at least one :

Clearly, we should prefer an estimator if it dominates the other.

De…nition 3 An estimator is admissible if it is not dominated by another estimator. An estimator


is inadmissible if it is dominated by another estimator.

Admissibility is a desirable property for an estimator.


If the risk functions of two estimators cross, then neither dominates the other. How do we
compare these two estimators?
One approach is to calculate the worst-case scenerio. Speci…cally, we de…ne the maximum risk
of an estimator ~ as
R(~) = sup R ; ~

We can think: Suppose we use ~ to estimate : Then what is the worst case, how bad can than this
estimator do?

147
For example, for the usual estimator, R ; ^; W = tr (WV) for all ; so

R(^) = tr (WV)

while for the silly estimator ~ = 0


R(0) = 1

The latter is an example of an estimator with unbounded risk. To guard against extreme worst
cases, it seems sensible to avoid estimators with unbounded risk.
The minimium value of the maximum risk R(~) across all estimators = (^) is

inf R( ) = inf sup R( ; )

where

De…nition 4 An estimator ~ of which minimizes the maximum risk

inf sup R( ; ) = sup R( ; ~)

is called a minimax estimator.

It is desireable for an estimator to be minimax, again as a protection against the worst-case


scenerio.
There is no general rule for determining the minimax bound. However, in the case ^ N ( ; Ik );
it is known that ^ is mimimax for :

17.4 Shrinkage Estimators


Suppose ^ N ( ; V)
A general form for a shrinkage estimator for is

0
^ = 1 h(^ W^) ^

where h : [0; 1) ! [0; 1): Sometimes this is written as


0 !
^ = c(^ W^) ^
1
^0 W^

where c(q) = qh(q):


This notation includes the James-Stein estimator, pretest estimators, selection estimators, and
the Model averaging estimator of section 17.1. Pretest and selection estimators take the form

h(q) = 1(q < a)

148
where a = 2k for Mallows selection, and a is the critical value from a chi-square distribution for a
pretest estimator.
We now calculate the risk of ^ . Note

0
^ = ^ h(^ W^)^

Thus
0 0 0 0 0 0
^ W ^ = ^ W ^ + h(^ W^)2 ^ W^ 2h(^ W^)^ W ^

Taking expectations:
h 0 0
i h 0 0
i
R( ; ^ ; W) = tr (WV) + E h(^ W^)2 ^ W^ 2E h(^ W^)^ W ^

To simplify the second expectation when h is continuous we use:

Lemma 1 (Stein’s Lemma) If ( ) : Rk ! Rk is absolutely continuous and ^ N ( ; V) then

@ ^0
E (^)0 ^ = E tr ( )V :
@

Proof: Let
1 1 0 1
(x) = k=2
exp xV x
(2 ) 2

denote the N (0; V) density. Then

@ 1
(x) = V x (x)
@x

and
@ 1
(x )= V (x ) (x ):
@x
By multivariate integration by parts
Z
E (^)0 ^ = (x)0 VV 1
(x ) (x ) (dx)
Z
@
= tr (x)0 V (x ) (dx)
@
@ ^0
= E tr ( )V
@

as stated.
0 0
Let ( )0 = h W W; for which

@
( )0 = h( 0 W )W + 2W 0
Wh0 ( 0 W )
@

149
and
@
tr ( )0 V = tr (WV) h( 0 W ) + 2 0 WVW h0 ( 0 W )
@
Then by Stein’s Lemma
h 0 0
i 0
h 0 0
i
E h(^ W^)^ W ^ = tr (WV) Eh(^ W^) + 2E (^ WVW^)h0 (^ W^)

Applying this to the risk calculation, we obtain


Theorem.
h 0 0
i 0
h 0 0
i
R( ; ^ ; W) = tr (WV) + E h(^ W^)2 ^ W^ 2 tr (WV) Eh(^ W^) 4E (^ WVW^)h0 (^ W^)
2 ! 3
0 ^0 WVW^
6 c(^ W^) 2 tr (WV) + 4 7
6 0 ^0 W^ ^0 WVW^ 7
6 0 ^0
^ ^
= tr (WV) + E 6c( W ) 0 4 0 c ( W )7^
7
4 ^ W^ ^ W^ 5

where the …nal equality uses the alternative expression h(q) = c(q)=q:
We are trying to …nd cases where R( ; ^ ; W) < R( ; ^; W): This requires the term in the
expectation to be negative.
We now explore some special cases.

17.5 Default Weight Matrix


Set
1
W=V

and write
R( ; ^ ) = R( ; ^ ; V 1
)

Then 2 3
0
0
c(^ V 1 ^) 2k + 4 0
R( ; ^ ) = k + E 4c(^ V 1^
) 0 4c0 (^ V )5 :
1^
^V 1^

Theorem 1 For any absolutely continuous and non-decreasing function c(q) such that

0 < c(q) < 2 (k 2) (1)

then
R( ; ^ ) < R( ; ^);

the risk of ^ is strictly less than the risk of ^: This inequality holds for all values of the parameter
:

150
Note: Condition (1) can only hold if k > 2: (Since k , the dimension of ; is an integer, this
means k 3:)
Proof. Let
c(q) (c(q) 2 (k 2))
g(q) = 4c0 (q)
q
For all q 0; g(q) < 0 by the assumptions. Thus Eg (q) for any non-negative random variable q:
0
Setting qk = ^ V 1 ^;
R( ; ^ ) = k + Eg (qk ) < k = R( ; ^)

which proves the result.

It also useful to note that


0
qk = ^ V 1^ 2
k ( )

a non-central chi-square random variable with k degrees of freedom and non-centrality parameter

0 1
= V

17.6 James-Stein Estimator


Set c(q) = c; a constant. This is the James-Stein estimator

^ = c ^
1 (2)
^0 V 1 ^

Theorem 2 If ^ N ( ; V); k > 2; and 0 < c < 2(k 2); then for (2),

R( ; ^ ) < R( ; ^)

the risk of the James-Stein estimator is strictly less than the usual estimator. This inequality holds
for all values of the parameter :

Since the risk is quadratic in c; we can also see that the risk is minimized by setting c = k 2:
This yields the classic form of the James-Stein estimator

^ = k 2 ^
1
^0 V 1 ^

17.7 Positive-Part James-Stein


0
If ^ V 1^ < c then
c
1 0 <0
^V 1^

151
and the James-Stein estimator over-shrinks, and ‡ips the sign of ^ relative to ^: This is corrected
by using the positive-part version

^+ = c ^
1
^0 V 1 ^ +
8
>
> c ^ ^0 V 1^
>
< 1 0 c
^V 1^
=
>
>
>
: 0 else

This bears some resemblance to selection estimators.


The positive-part estimator takes the shrinkage form with
8
>
< c q c
c(q) =
>
:
q q<c

or 8 c
>
> q c
< q
h(q) =
>
>
:
1 q<c
In general the positive-part version of

0
^ = 1 h(^ W^) ^

is
^+ = 1 0
h(^ W^) ^
+
+
Theorem. For any shrinakge estimator, R( ; ^ ) < R( ; ^)
The proof is a bit technical, so we will skip it.

17.8 General Weight Matrix


Recall that for general c(q) and weight W we had

2 ! 3
0 ^0 WVW^
6 c(^ W^) 2 tr (WV) + 4 7
6 0 ^0 W^ ^0 WVW^ 0 7
R( ; ^ ; W) = tr (WV)+E 6 ^ ^
6c( W ) 4 c ( W )7
0 ^ ^
7
4 ^0 W^ ^0 W^ 5

152
Using a result about eigenvalues and setting h = W 1=2

^0 WVW^ 0
WVW
0 max 0
^ W^ W
h0 W1=2 VW1=2 h
= max
h h0 h
1=2
= max (W VW1=2 )
= max (WV)

Thus if c0 (q) 0;
2 0 3
0
c(^ W^) 2 tr (WV) + 4 max (WV)
R( ; ^ ; W) tr (WV) + E 4c(^ W^) 5
^0 W^
< tr (WV)

the …nal inequality if


0 < c(q) < 2 (tr (WV) 2 max (WV)) (3)

When W = V; the upper bound is 2(k 2) so this is the same as for the default weight matrix.

Theorem 3 For any absolutely continuous and non-decreasing function c(q) such that (3) holds,
then
R( ; ^ ; W) < R( ; ^; W);

the risk of ^ is strictly less than the risk of ^:

17.9 Shrinkage Towards Restrictions


The classic James-Stein estimator shrinks towards the zero vector. More generally, shrinkage
can be towards restricted estimators, or towards linear or non-linear subspaces.
These estimators take the form
0
^ =^ h( ^ ~ W ^ ~ ) ^ ~

where ^ is the unrestricted estimator (e.g. the long regression) and ~ is the restricted estimator
(e.g. the short regression).
The classic form is 0 1

^ =^ B r 2 C ^ ~
@ 0 A
^ ~ ^
V 1 ^ ~
1

where (a)1 = max(a; 1); V ^ is the covariance matrix for ^; and r is the number of restrictions (by
the restriction from ^ to ~):

153
This estimator shrinks ^ towards ~; with the degree of shrinkage depending on the magnitude
of ^ ~ :
0
This approach works for nested models, so that ^ ~ ^
V 1 ^ ~ is approximately (non-
central) chi-square.
0
It is unclear how to extend the idea to non-nested models, where ^ ~ ^
V 1 ^ ~ is not
chi-square.

17.10 Inference
We discussed shrinkage estimation.
Model averaging, Selection, and Shrinkage estimators have non-standard non-normal distribu-
tions.
Standard errors, testing, and con…dence intervals need development.

154

You might also like