17 Shrinkage: 17.1 Mallows Averaging and Shrinkage

17 Shrinkage
17.1 Mallows Averaging and Shrinkage

Suppose there are two models or estimators of g = E(y j X)
(1) g0 = 0
(2) g^1 = X ^
Given weights (1 w) and w an averaging estimator is g^ = wX ^ :
The Mallows criterion is
C(w) = w0^
e0 ^
ew + 2^ 2 w0 K
! !
y 0 y y 0 e^ 1 w
= 1 w w + 2^ 2 wk
e^0 y e^0 e^ w
2 0 2
= (1 w) y y + w + 2w(1 w) e^0 e^ + 2^ 2 wk
= (1 w)2 y 0 y e^0 e^ + e^0 e^ + 2^ 2 wk
The FOC for minimization is
d
C(w) = 2(1 w) y 0 y e^0 e^ + 2^ 2 k = 0
dw
with solution
k
w
^=1
F
where
y0y e^0 e^
F =
^2
is the Wald statistic for = 0: Imposing the constraint w
^ 2 [0; 1] we obtain
8
> k
>
< 1 F k
F
w
^=
>
>
:
0 F <k
The Mallow averaging estimator thus equals
^ k
=^ 1
F +
where (a)+ = a if a 0; 0 else.

This is a Stein-type shrinkage estimator.
145
17.2 Loss and Risk
A great reference is Theory of Point Estimation, 2nd Edition, by Lehmann and Casella.
Let ^ be an estimator for ; k 1: Suppose ^ is an (asymptotic) su¢ cient statistic for so that
any other estimator can be written as a function of ^: We call ^ the “usual”estimator.
p
Suppose that n ^ !d N (0; V): Thus, approximately,
^ a N ( ; Vn )
where Vn = n 1 V:Most of Stein-type theory is developed for the exact distribution case. It
carries over to the asymptotic setting as approximations. For now on we will assume that ^ has an
exact normal distribution, and that Vn = V is known. (Equivalently, we can rewrite the statistical
problem as local to using the “Limits of Experiments” theory.
Is ^ the best estimator for ; in the sense of minimizing the risk (expected loss)?
The risk of ~ under weighted squared error loss is
0
R( ; ~; W) = E ~ W ~
0
= tr WE ~ ~
A convenient choice for the weight matrix is W = V 1: Then
0
R( ; ^; V 1
) = tr V 1
E ~ ~
1
= tr V V
= k:
If W 6= V 1 then
0
R( ; ^; W) = tr WE ~ ~
= tr (WV)
which depends on WV:

Again, we want to know if the risk of another feasible estimator is smaller than tr (WV) :
Take the simple (or silly) estimator ~ = 0: This has risk
0
R( ; 0; W) = W :
Thus ~ = 0 has smaller risk than ^ when 0 W < tr (WV) ; and larger risk when 0 W > tr (WV) :
Neither ^ nor ~ = 0 is “better” in the sense of having (uniformly) smaller risk! It is not enough to
ask that one estimator has smaller risk than another, as in general the risk is a function depending
146
on unknowns.
As another example, take the simple averaging (or shrinkage) estimator
~ = w^
where w is a …xed constant. Since
~ =w ^ (1 w)
we can calculate that
R( ; ~; W) = w2 R( ; ~; W) + (1 w)2 0 W
= w2 tr (WV) + (1 w)2 0 W
This is minimized by setting

0
W
w=
tr (WV) + 0 W
which is strictly in (0,1). [This is illustrative, and does not suggest an empirical rule for selecting
w:]
17.3 Admissibile and Minimax Estimators

For reference.
To compare the risk functions of two estimators, we have the following concepts.
De…nition 1 ^ weakly dominates ~ if R( ; ^) R( ; ~) for all
De…nition 2 ^ dominates ~ if R( ; ^) R( ; ~) for all ; and R( ; ^) < R( ; ~) for at least one :
Clearly, we should prefer an estimator if it dominates the other.
De…nition 3 An estimator is admissible if it is not dominated by another estimator. An estimator

is inadmissible if it is dominated by another estimator.
Admissibility is a desirable property for an estimator.

If the risk functions of two estimators cross, then neither dominates the other. How do we
compare these two estimators?
One approach is to calculate the worst-case scenerio. Speci…cally, we de…ne the maximum risk
of an estimator ~ as
R(~) = sup R ; ~
We can think: Suppose we use ~ to estimate : Then what is the worst case, how bad can than this
estimator do?
147
For example, for the usual estimator, R ; ^; W = tr (WV) for all ; so
R(^) = tr (WV)
while for the silly estimator ~ = 0

R(0) = 1
The latter is an example of an estimator with unbounded risk. To guard against extreme worst
cases, it seems sensible to avoid estimators with unbounded risk.
The minimium value of the maximum risk R(~) across all estimators = (^) is
inf R( ) = inf sup R( ; )
where
De…nition 4 An estimator ~ of which minimizes the maximum risk
inf sup R( ; ) = sup R( ; ~)
is called a minimax estimator.
It is desireable for an estimator to be minimax, again as a protection against the worst-case

scenerio.
There is no general rule for determining the minimax bound. However, in the case ^ N ( ; Ik );
it is known that ^ is mimimax for :
17.4 Shrinkage Estimators

Suppose ^ N ( ; V)
A general form for a shrinkage estimator for is
0
^ = 1 h(^ W^) ^
where h : [0; 1) ! [0; 1): Sometimes this is written as

0 !
^ = c(^ W^) ^
1
^0 W^
where c(q) = qh(q):

This notation includes the James-Stein estimator, pretest estimators, selection estimators, and
the Model averaging estimator of section 17.1. Pretest and selection estimators take the form
h(q) = 1(q < a)
148
where a = 2k for Mallows selection, and a is the critical value from a chi-square distribution for a
pretest estimator.
We now calculate the risk of ^ . Note
0
^ = ^ h(^ W^)^
Thus
0 0 0 0 0 0
^ W ^ = ^ W ^ + h(^ W^)2 ^ W^ 2h(^ W^)^ W ^
Taking expectations:
h 0 0
i h 0 0
i
R( ; ^ ; W) = tr (WV) + E h(^ W^)2 ^ W^ 2E h(^ W^)^ W ^
To simplify the second expectation when h is continuous we use:
Lemma 1 (Stein’s Lemma) If ( ) : Rk ! Rk is absolutely continuous and ^ N ( ; V) then
@ ^0
E (^)0 ^ = E tr ( )V :
@
Proof: Let
1 1 0 1
(x) = k=2
exp xV x
(2 ) 2
denote the N (0; V) density. Then
@ 1
(x) = V x (x)
@x
and
@ 1
(x )= V (x ) (x ):
@x
By multivariate integration by parts
Z
E (^)0 ^ = (x)0 VV 1
(x ) (x ) (dx)
Z
@
= tr (x)0 V (x ) (dx)
@
@ ^0
= E tr ( )V
@
as stated.
0 0
Let ( )0 = h W W; for which
@
( )0 = h( 0 W )W + 2W 0
Wh0 ( 0 W )
@
149
and
@
tr ( )0 V = tr (WV) h( 0 W ) + 2 0 WVW h0 ( 0 W )
@
Then by Stein’s Lemma
h 0 0
i 0
h 0 0
i
E h(^ W^)^ W ^ = tr (WV) Eh(^ W^) + 2E (^ WVW^)h0 (^ W^)
Applying this to the risk calculation, we obtain

Theorem.
h 0 0
i 0
h 0 0
i
R( ; ^ ; W) = tr (WV) + E h(^ W^)2 ^ W^ 2 tr (WV) Eh(^ W^) 4E (^ WVW^)h0 (^ W^)
2 ! 3
0 ^0 WVW^
6 c(^ W^) 2 tr (WV) + 4 7
6 0 ^0 W^ ^0 WVW^ 7
6 0 ^0
^ ^
= tr (WV) + E 6c( W ) 0 4 0 c ( W )7^
7
4 ^ W^ ^ W^ 5
where the …nal equality uses the alternative expression h(q) = c(q)=q:
We are trying to …nd cases where R( ; ^ ; W) < R( ; ^; W): This requires the term in the
expectation to be negative.
We now explore some special cases.
17.5 Default Weight Matrix

Set
1
W=V
and write
R( ; ^ ) = R( ; ^ ; V 1
)
Then 2 3
0
0
c(^ V 1 ^) 2k + 4 0
R( ; ^ ) = k + E 4c(^ V 1^
) 0 4c0 (^ V )5 :
1^
^V 1^
Theorem 1 For any absolutely continuous and non-decreasing function c(q) such that
0 < c(q) < 2 (k 2) (1)
then
R( ; ^ ) < R( ; ^);
the risk of ^ is strictly less than the risk of ^: This inequality holds for all values of the parameter
:
150
Note: Condition (1) can only hold if k > 2: (Since k , the dimension of ; is an integer, this
means k 3:)
Proof. Let
c(q) (c(q) 2 (k 2))
g(q) = 4c0 (q)
q
For all q 0; g(q) < 0 by the assumptions. Thus Eg (q) for any non-negative random variable q:
0
Setting qk = ^ V 1 ^;
R( ; ^ ) = k + Eg (qk ) < k = R( ; ^)
which proves the result.
It also useful to note that

0
qk = ^ V 1^ 2
k ( )
a non-central chi-square random variable with k degrees of freedom and non-centrality parameter
0 1
= V
17.6 James-Stein Estimator

Set c(q) = c; a constant. This is the James-Stein estimator
^ = c ^
1 (2)
^0 V 1 ^
Theorem 2 If ^ N ( ; V); k > 2; and 0 < c < 2(k 2); then for (2),
R( ; ^ ) < R( ; ^)
the risk of the James-Stein estimator is strictly less than the usual estimator. This inequality holds
for all values of the parameter :
Since the risk is quadratic in c; we can also see that the risk is minimized by setting c = k 2:
This yields the classic form of the James-Stein estimator
^ = k 2 ^
1
^0 V 1 ^
17.7 Positive-Part James-Stein

0
If ^ V 1^ < c then
c
1 0 <0
^V 1^
151
and the James-Stein estimator over-shrinks, and ‡ips the sign of ^ relative to ^: This is corrected
by using the positive-part version
^+ = c ^
1
^0 V 1 ^ +
8
>
> c ^ ^0 V 1^
>
< 1 0 c
^V 1^
=
>
>
>
: 0 else
This bears some resemblance to selection estimators.

The positive-part estimator takes the shrinkage form with
8
>
< c q c
c(q) =
>
:
q q<c
or 8 c
>
> q c
< q
h(q) =
>
>
:
1 q<c
In general the positive-part version of
0
^ = 1 h(^ W^) ^
is
^+ = 1 0
h(^ W^) ^
+
+
Theorem. For any shrinakge estimator, R( ; ^ ) < R( ; ^)
The proof is a bit technical, so we will skip it.
17.8 General Weight Matrix

Recall that for general c(q) and weight W we had
2 ! 3
0 ^0 WVW^
6 c(^ W^) 2 tr (WV) + 4 7
6 0 ^0 W^ ^0 WVW^ 0 7
R( ; ^ ; W) = tr (WV)+E 6 ^ ^
6c( W ) 4 c ( W )7
0 ^ ^
7
4 ^0 W^ ^0 W^ 5
152
Using a result about eigenvalues and setting h = W 1=2
^0 WVW^ 0
WVW
0 max 0
^ W^ W
h0 W1=2 VW1=2 h
= max
h h0 h
1=2
= max (W VW1=2 )
= max (WV)
Thus if c0 (q) 0;
2 0 3
0
c(^ W^) 2 tr (WV) + 4 max (WV)
R( ; ^ ; W) tr (WV) + E 4c(^ W^) 5
^0 W^
< tr (WV)
the …nal inequality if

0 < c(q) < 2 (tr (WV) 2 max (WV)) (3)
When W = V; the upper bound is 2(k 2) so this is the same as for the default weight matrix.
Theorem 3 For any absolutely continuous and non-decreasing function c(q) such that (3) holds,
then
R( ; ^ ; W) < R( ; ^; W);
the risk of ^ is strictly less than the risk of ^:
17.9 Shrinkage Towards Restrictions

The classic James-Stein estimator shrinks towards the zero vector. More generally, shrinkage
can be towards restricted estimators, or towards linear or non-linear subspaces.
These estimators take the form
0
^ =^ h( ^ ~ W ^ ~ ) ^ ~
where ^ is the unrestricted estimator (e.g. the long regression) and ~ is the restricted estimator
(e.g. the short regression).
The classic form is 0 1
^ =^ B r 2 C ^ ~
@ 0 A
^ ~ ^
V 1 ^ ~
1
where (a)1 = max(a; 1); V ^ is the covariance matrix for ^; and r is the number of restrictions (by
the restriction from ^ to ~):
153
This estimator shrinks ^ towards ~; with the degree of shrinkage depending on the magnitude
of ^ ~ :
0
This approach works for nested models, so that ^ ~ ^
V 1 ^ ~ is approximately (non-
central) chi-square.
0
It is unclear how to extend the idea to non-nested models, where ^ ~ ^
V 1 ^ ~ is not
chi-square.
17.10 Inference
We discussed shrinkage estimation.
Model averaging, Selection, and Shrinkage estimators have non-standard non-normal distribu-
tions.
Standard errors, testing, and con…dence intervals need development.
154

17 Shrinkage: 17.1 Mallows Averaging and Shrinkage

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

17 Shrinkage: 17.1 Mallows Averaging and Shrinkage

Uploaded by

Copyright:

Available Formats

17 Shrinkage

17.1 Mallows Averaging and Shrinkage

The FOC for minimization is

The Mallow averaging estimator thus equals

where (a)+ = a if a 0; 0 else.

A convenient choice for the weight matrix is W = V 1: Then

which depends on WV:

where w is a …xed constant. Since

we can calculate that

This is minimized by setting

17.3 Admissibile and Minimax Estimators

De…nition 1 ^ weakly dominates ~ if R( ; ^) R( ; ~) for all

De…nition 2 ^ dominates ~ if R( ; ^) R( ; ~) for all ; and R( ; ^) < R( ; ~) for at least one :

Clearly, we should prefer an estimator if it dominates the other.

De…nition 3 An estimator is admissible if it is not dominated by another estimator. An estimator

Admissibility is a desirable property for an estimator.

while for the silly estimator ~ = 0

inf R( ) = inf sup R( ; )

De…nition 4 An estimator ~ of which minimizes the maximum risk

inf sup R( ; ) = sup R( ; ~)

is called a minimax estimator.

It is desireable for an estimator to be minimax, again as a protection against the worst-case

17.4 Shrinkage Estimators

where h : [0; 1) ! [0; 1): Sometimes this is written as

where c(q) = qh(q):

h(q) = 1(q < a)

To simplify the second expectation when h is continuous we use:

Lemma 1 (Stein’s Lemma) If ( ) : Rk ! Rk is absolutely continuous and ^ N ( ; V) then

denote the N (0; V) density. Then

Applying this to the risk calculation, we obtain

17.5 Default Weight Matrix

0 < c(q) < 2 (k 2) (1)

which proves the result.

It also useful to note that

17.6 James-Stein Estimator

17.7 Positive-Part James-Stein

This bears some resemblance to selection estimators.

17.8 General Weight Matrix

the …nal inequality if

the risk of ^ is strictly less than the risk of ^:

17.9 Shrinkage Towards Restrictions

You might also like