You are on page 1of 8

Hájek Projection and U-Statistics

Bryan S. Graham, UC - Berkeley & NBER

October 1, 2018

Hájek projection
One important method of deriving the asymptotic distribution of a sequence of statistics,
say UN , is show the sequence is asymptotically equivalent to a second sequence, say UN∗ , the
large sample properties of which are well-understood (cf., van der Vaart, 1998, Chapter 11).
The asymptotic properties of sums of independent random variables, appropriately scaled,
are especially well-understood (since Central Limit Theorems (CLTs) are generally applicable
to them). With this observation in mind, let X1 , X2 , . . . , XN be independent K × 1 random
vectors. Let L be the linear subspace containing of all functions of the form

N
X
gi (Xi ) (1)
i=1

 
for gi : RK → R arbitrary with E gi (Xi )2 < ∞ for i = 1, . . . , N.
Next let Y be an arbitrary random variable with finite variance, but unknown distribution.
We can use the Projection Theorem to approximate the statistic Y with one composed of
a sum of independent random functions. Such a sum, by appeal to a CLT, may be well-
described by a normal distribution. If the projection is also a very good approximation of
Y , then the hope is that Y may be well-described by a normal distribution as well.
The projection of Y onto L, called the Hájek Projection, equals

N
X
Π ( Y | L) = E [ Y | Xi ] − (N − 1) E [Y ] . (2)
i=1

To verify (2) it suffices to check the necessary and sufficient orthogonality condition of the

1
University of California - Berkeley Bryan S. Graham

Projection Theorem. Before doing so, it is helpful to observe that, for j 6= i,

E [ E [ Y | Xi ]| Xj ] = E [E [ Y | Xi ]]
= E [Y ] , (3)

due to independence of Xi and Xj and the law of iterated expectations. In contrast, if j = i,


then
E [ E [ Y | Xi ]| Xi ] = E [ Y | Xi ] . (4)

The orthogonality condition to verify, for U = Y − Π ( Y | L), is


" N
!#
X
0 =E U gj (Xj )
j=1
N
X
= E [Ugj (Xj )]
j=1
N
X
= E [E [ U| Xj ] gj (Xj )]
j=1

Next observe that, using (3) and (4),

N
X
E [ U| Xj ] = E [ Y | Xj ] − E [ E [ Y | Xi ]| Xj ] + (N − 1) E [Y ]
i=1

= E [ Y | Xj ] − E [ Y | Xj ] − (N − 1) E [Y ] + (N − 1) E [Y ]
= 0,

for j = 1, . . . , N. Hence the require orthogonality condition does hold when Π (Y | L) is of


the form asserted in (2).
If, in addition to independence, we have that (i) {Xi }N i=1 are identically distributed and (ii)
Y = h (X1 , . . . , XN ) is a permutation symmetric function of {Xi }N
i=1 , then

E [ Y | Xi = x] = E [ Y | X1 = x]
= E [h (x, X2 . . . , XN )]
def
≡ h̄1 (x)

for all i = 1, . . . , N. Since h1 (x) does not depend on i it follows that (2) simplifies, in this

2 © Bryan S. Graham, 2018


University of California - Berkeley Bryan S. Graham

case, to
N
X
Π (Y | L) = h̄1 (Xi ) − (N − 1) E [Y ] . (5)
i=1

Projection (5) is important, as we shall see later, for the analysis of U-Statistics.
A fact that will be helpful for what follows is that average of Y and its projection coincide:

N
X
E [Π ( Y | L)] = E [E [ Y | Xi ]] − (N − 1) E [Y ]
i=1

=NE [Y ] − (N − 1) E [Y ]
=E [Y ]

Note also that Π (Y | 1) = E [Y ].

Hájek projection and large sample theory


Next let {YN } be a sequence of statistics indexed by the sample size and LN a correspond-
ing sequence of linear subspaces of form (1). The goal is to use the limiting distribution
√ √
of N (Π ( YN | LN ) − Π (YN | 1)) to approximate that of N (YN − Π ( YN | 1)) . Such an ap-
proach will be (asymptotically) valid if these two statistics converge in mean square (to one
another). It is attractive because in the main cases of interest the asymptotic sampling

distribution of N (Π ( YN | LN ) − Π (YN | 1)) is straightforward to derive, whereas that of

N (YN − Π (YN | 1)) may be ex ante non-obvious.
The “Analysis of Variance” decomposition for projections gives, after some re-arrangement

N kYN − Π ( YN | LN )k2 = N kΠ ( YN | LN ) − Π (YN | 1)k2 − N kYN − Π (YN | 1)k2 .

Or, invoking the covariance inner product, that Π ( Y | 1) = E [Y ], as well as the definition of
variance
 
NE (YN − Π (YN | LN ))2 =NV (YN ) − NV (Π ( YN | LN )) . (6)

Hence if the limits of NV (YN ) and NV (Π ( YN | LN )) coincides as N → 0 we have that


√ √
N (YN − Π (YN | LN )) converges in mean square to zero. This means that N YN and

N Π (YN | LN ) will have identical limit distributions.
In the next section we apply the above ideas to study the asymptotic properties of U-
Statistics.

3 © Bryan S. Graham, 2018


University of California - Berkeley Bryan S. Graham

U-Statistics
Let {Xi }N
i=1 be a simple random sample from some population of interest. Let h (Xi1 , . . . , Xim )
be a symmetric kernel function. The assumption of symmetry is without loss of general-
ity since we can always replace h (Xi1 , . . . , Xim ) with its average across permutations. A
U-statistic is an average of the kernel h (Xi1 , . . . , Xim ) over all possible m-tuples of observa-
tions in the sample.  −1 X
N
UN = h (Xi1 , . . . , Xim )
m i∈C m,N

where Cm,N denotes the set of all unique combinations of indices of size m drawn from the
set {1, 2, . . . , N}.
The parameter of interest is

θ = E [UN ] = E [h (X1 , . . . , Xm )] ,

where the expectation is over m independent random draws from the target population.

Variance of UN

For s = 1, . . . , m let

h̄s (x1 , . . . , xs ) = E [h (x1 , . . . , xs , Xs+1 , . . . , Xm )]

be the average over the last m − s elements of h (·) holding the first s elements fixed. Note
that since Xik is independent of Xil for all k 6= l we have

E [ h (X1 , . . . , Xs , Xs+1 , . . . , Xm )| (X1 , . . . , Xs ) = (x1 , . . . , xs )] = E [h (x1 , . . . , xs , Xs+1 , . . . , Xm )] .

It is also useful to observe that


 
E h̄s (X1 , . . . , Xs ) = E [h (X1 , . . . , Xm )] = θ.

The variance of UN has a special structure. Define, for s = 1, . . . , m,



δs2 = V h̄s (X1 , . . . , Xs ) .

4 © Bryan S. Graham, 2018


University of California - Berkeley Bryan S. Graham

Applying the variance operator to UN yields


 
 −1 X
N
V (UN ) = V  h (Xi1 , . . . , Xim )
m i∈Cm,N
 −2 X X
N
= C (h (Xi1 , . . . , Xim ) , h (Xj1 , . . . , Xjm )) . (7)
m i∈C j∈C m,N m,N

The form of the covariances in (7) depends on the number of indices in common. Let s be
the number of indices in common in Xi1 , . . . , Xim and Xj1 , . . . , Xjm :

C (h (Xi1 , . . . , Xim ) , h (Xj1 , . . . , Xjm )) = E [(h (X1 , . . . , Xs , Xs+1 , . . . , Xm ) − θ)


′ ′
 
× h X1 , . . . , Xs , Xs+1 , . . . , Xm −θ (8)

Conditional on X1 , . . . , Xs the two terms in (8) are independent so that, using the Law of
Iterated Expectations,
  
C (h (Xi1 , . . . , Xim ) , h (Xj1 , . . . , Xjm )) =E h̄s (X1 , . . . , Xs ) − θ h̄s (X1 , . . . , Xs ) − θ
=δs2 .

Using the same argument yields



C h̄s (X1 , . . . , Xs ) , h (X1 , . . . , Xm ) = δs2 .

By the Cauchy-Schwartz Inequality we have



C h̄s (X1 , . . . , Xs ) , h (X1 , . . . , Xm )
≤1
δs δm

and hence
δs2 ≤ δm
2
.

Continuing with this type of reasoning we get the weak ordering

δ12 ≤ δ22 ≤ . . . ≤ δm
2
.

2
In what follows we will assume that δm < ∞.
To use these results to get an expression for V (UN ) begin by observing that the number of

5 © Bryan S. Graham, 2018


University of California - Berkeley Bryan S. Graham

pairs of m-tuples (i1 , . . . , im ) and (j1 , . . . , jm ) having exactly s elements in common is


   
N m N −m
.
m s m−s

N

This follows since equals the number of ways
m  of choosing (i1 , . . . , im ) from the set
m
{1, . . . , N}. For each unique m-tuple there are ways of choosing a subset of size s
s 
from it. Having fixed the s indices in common there are then Nm−s −m
ways of choosing the
m − s non-common elements of (j1 , . . . , jm ) from the N − m integers not already present in
(i1 , . . . , im ).
We therefore have
 −2 Xm    
N N m N −m 2
V (UN ) = δ
m s=0
m s m−s s
 −1 Xm   
N m N −m 2
= δ .
m s=1
s m−s s
m
X m!2 (N − m) (N − m − 1) · · · (N − 2m + s + 1) 2
= δs . (9)
s=1
s! (m − s)! 2 N (N − 1) · · · (N − m + 1)

To understand this expression note that each of the covariances in (7) above have s = 0, . . . , m
elements in common. The coefficients on the δs2 in (9) give the number of covariances with
s elements in common. Also note that δ02 = 0.
The coefficient on δ12 is

m-1 terms
z }| {
m!2 (N − m) (N − m − 1) · · · (N − 2m + 1 + 1) (N − m) (N − m − 1) · · · (N − 2m + 2)
= m2
1! (m − 1)!2 N (N − 1) · · · (N − m + 1) N (N − 1) · · · (N − m + 1)
| {z }
mterms
m2
≃ .
N

The coefficient on δ22 is O (N −2 ) etc. We therefore have

m2 2 
V (UN ) = δ1 + O N −2
N
√ 
and also that V N (UN − θ) → m2 δ12 as N → ∞.
If δ1 = 0 we say that UN is a degenerate U-Statistic with degeneracy of order 1. I will not
consider the properties of degenerate U-Statistics here.

6 © Bryan S. Graham, 2018


University of California - Berkeley Bryan S. Graham

First projection of UN

The arguments outlined so far provide expressions for the mean and variance of UN . To
conduct inference we need an asymptotic normality result. While UN is constructed from
independently and identically distributed random variables, not all elements of its summand
are independent of one another. We cannot apply a standard central limit theorem (CLT).

To show asymptotic normality of NUN we will therefore proceed in the way used to moti-
vate our introduction of the Hájek projection earlier. Let L the linear subspace containing
of all functions of the form
XN
gi (Xi ) (10)
i=1
 
for gi : RK → R arbitrary with E gi (Xi )2 < ∞ for i = 1, . . . , N. The Hájek Projection of
UN onto L equals, from (2) above,

N
X
Π (UN | LN ) = E [ UN | Xi ] − (N − 1) E [UN ] . (11)
i=1

To simplify the argument assume that m = 2. The L2 projection of UN onto just the first
observation X1 is

N X
X

N −1
E [ UN | X1 ] = 2
E [ h (Xi , Xj )| X1 ]
i=1 i<j

N −1

N −1 N
 
= 2
(N − 1) h̄1 (X1 ) + 2 2
− (N − 1) θ
2 
= h̄1 (X1 ) − θ + θ. (12)
N

The second equality follows because E [ h (Xi , Xj )| X1 ] = h̄1 (X1 ) if either i or j equals 1
(which occurs N − 1 times). In all other cases, by random sampling, E [ h (Xi , Xj )| X1 ] =

E [h (Xi , Xj )] = θ (which occurs N2 − (N − 1) times). Substituting (12) into (11) yields

N
2 X
Π (UN − θ| LN ) = h̄1 (X1 ) − θ .
N i=1

For the general m ≥ 2 case a similar calculation gives

N
m X
Π (UN − θ| LN ) = h̄1 (X1 ) − θ .
N i=1

7 © Bryan S. Graham, 2018


University of California - Berkeley Bryan S. Graham


Since Π ( UN − θ| LN ) is a sum of i.i.d. random variables with V h̄1 (X1 ) − θ = δ12 , a CLT
gives
√ D 
N Π (UN − θ| LN ) → N 0, m2 δ12 .

Our (combinatoric) variance calculations gave


√ 
V N (UN − θ) → m2 δ12

as N → ∞. Therefore NV (UN ) − NV (Π ( UN | LN )) → 0 as N → ∞, in turn implying that


√ √
N (UN − θ) convergences in mean square to N Π (UN − θ| LN ) and hence that
√ D 
N (UN − θ) → N 0, m2 δ12

as needed.

Bibliographic notes
Hoeffding (1948) developed the basic theory of U-Statistics. van der Vaart (1998, Chapters
11 & 12) is a standard, and highly recommended, textbook reference. The exposition in
these notes draws, in part, of unpublished lectures notes by Ferguson (2005).

References
Ferguson, T. S. (2005). U-statistics. University of California - Los Angeles.

Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution. Annals


of Mathematical Statistics, 19(3), 293 – 325.

van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge: Cambridge University


Press.

8 © Bryan S. Graham, 2018

You might also like