Professional Documents
Culture Documents
Annal Horowitz Mammen 2004
Annal Horowitz Mammen 2004
w(x2,...xd) dx2...dxd = 1.
as the one in this paper but a different and, possibly, complicated bias. We
also conjecture that a classical backfittingestimator would not have the oracle
property.Nonetheless, we do not argue here that our procedure outperforms
classical backfitting,in the sense of minimizing an optimality criterion such as
the asymptoticmean-squareerror.However,our procedurehas the advantagesof
a complete asymptoticdistributiontheory and the oracle property.
The remainderof this paper is organized as follows. Section 2 provides an
informaldescriptionof the two-stage estimator.The main results are presentedin
Section 3. Section 4 discusses the selection of bandwidths.Section 5 presentsthe
results of a small simulationstudy,and Section 6 presentsconcluding comments.
The proofs of theorems are in Section 7. Throughoutthe paper,subscriptsindex
observationsand superscriptsdenote componentsof vectors. Thus, Xi is the ith
observationof X, XJ is the jth componentof X, and Xj is the ith observationof
the jth component.
m(v)dv=
mj , j=l,...,d.
for each j = 1, ..., d, each xJ E [0, 1] and suitable coefficients {0jk}. For any
positive integerK, define
PK(x)= [1, P (x 1) ... .PK(X ), P1( x2), .. PK(X2), (, I(d), . . . PK(Xd)]
. P1
To obtain the first-stage estimators of the mj's, let {Yi, Xi: i = 1, ..., n be
a random sample of (Y, X). Let OnK be a solution to
n
minimize: Sn,() - n-1 {Yi - F[PK(Xi)0]}2,
~OE~~OK i=1
where KcC RKd+l is a compact parameter set. The series estimator of /z + m(x)
is
, + mh(x)= PK(X)'nK,
where /2 is the first component of OnK. The estimator of mj(xj) for any
j = 1,..., d and any xJ E [0, 1] is the product of [pl(xJ)... , PK(XJ)] with the
appropriate components of OK.
To obtain the second-stage estimator of (say) ml(xl), let Xi denote the ith
observation of X (X2, .. ., Xd). Define m_l (Xi) = m2(X2) + + mhd(X'/),
where XJ is the ith observation of the jth component of X and thj is the series
estimator of mj. Let K be a probability density function on [-1, 1], and define
Kh(v) = K(v/h) for any real, positive constant h. Conditions that K and h must
satisfy are given in Section 3. These include h -+ 0 at an appropriate rate as
n -- oo. Define
n
lm) = --2
Sj (x1, {Yi - F[f + ml (x1) + m_l (Xi)]}
i=l
es aS th)oSno (x)
. nd (x) n 1 Sh)
^-sSng2e(Xim ,
)-So si(X)
(2.4) mi (x)l th) () (xl,h) r) - 1 - SIn, (x, '
2
The estimator (2.4) can be understood intuitively as follows. If A and mt_ were
the true values of ,t and m_ 1, the local linear estimator of m 1(x1) would minimize
n
3. Main results. This section has three parts. Section 3.1 states the assump-
tions that are used to prove the main results. Section 3.2 states the results. The
main results are the n-2/5-consistency and asymptotic normality of the mj's. Sec-
tion 3.3 describes the weighted estimator.
The following additional notation is used. For any matrix A, define the norm
IA I = [trace(A'A)]1/2. Define U = Y - F[/A + m(X)], V(x) = Var(UIX = x),
QK = E{F'[/1 + m(X)]2 PK(X)PK (X)'}, and IK = QK E{F'[j + m(X)]2V(X) x
PK(X)PK(X)'}Q -1 whenever the latter quantity exists. QK and IK are d(K) x d(K)
positive semidefinite matrices, where d(K) = Kd + 1. Let XK,min denote the
smallest eigenvalue of QK. Let QK,ij denote the (i, j) element of QK. Define
K= Supxx IIPK(x)ll.Let {0jk} be the coefficients of the series expansion (2.3).
For each K define
3.1. Assumptions. The main results are obtained under the following assump-
tions.
(iv) There are constants cv > 0 and Cv < oo such that cv < Var(UIX =
x) < Cv for all x E X.
(v) Thereis a constantCu < oo such thatEIUI/ < C 2j!E(U2) < oo for all
j>2.
ASSUMPTION
A3. (i) There is a constant Cm < oc such that Jmj(v)I < Cm
for each j = 1,...,d and all v E [-1, 1].
(ii) Each functionmj is twice continuouslydifferentiableon [-1, 1].
(iii) There are constants CF1 < oo, CF2 > 0, and CF2 < oo such that
F(v) < CF1 and CF2 < F'(v) < CF2 for all v E [ut - Cmd, ,L + Cmd].
(iv) F is twice continuously differentiable on [u/ - Cmd, /t + Cmd].
(v) There is a constant CF3 < oo such that IF"(v2) - F"(vl)I < CF3 I2 - l I
for all v2, vl E [/t - Cmd, iu + Cmd].
ASSUMPTION A4. (i) There are constants CQ < oo and cx > 0 such that
IQK,ij[ < CQ and K,min> C) for all K and all i, j = 1,..., d().
(ii) The largest eigenvalue of "K is bounded for all K.
to insure that the bias of our estimator converges to zero sufficiently rapidly.
AssumptionA2(v) restrictsthe thicknessof the tails of the distributionof U and is
used to prove consistency of the first-stageestimator.AssumptionA3 defines the
sense in which F and the mj's must be smooth. Assumption A3(iii) is needed
for identification.Assumption A4 insures the existence and nonsingularityof
the covariancematrix of the asymptoticform of the first-stageestimator.This is
analogousto assumingthatthe informationmatrixis positive definitein parametric
maximum likelihood estimation. Assumption A4(i) implies Assumption A4(ii)
if U is homoskedastic. Assumptions A5(iii) and A5(iv) bound the magnitudes
of the basis functions and insure that the errorsin the series approximationsto
the m j's converge to zero sufficiently rapidly as K -- oo. These assumptions are
satisfied by spline and (for periodic functions) Fourier bases. Assumption A6
states the rates at which K -- oo and h -- 0 as n -> oo. The assumed rate of
convergenceof h is well known to be asymptoticallyoptimalfor one-dimensional
kernelmean-regressionwhen the conditionalmean functionis twice continuously
differentiable.The requiredratefor K insuresthatthe asymptoticbias andvariance
of the first-stage estimator are sufficiently small to achieve an n-2/5 rate of
convergencein the second stage. The L2 rateof convergenceof a series estimator
of mj is maximizedby settingK ocn1/5, which is slowerthanthe ratespermittedby
AssumptionA6(i) [Newey (1997)]. Thus,AssumptionA6(i) requiresthe first-stage
estimatorto be undersmoothed.Undersmoothingis needed to insure sufficiently
rapid convergenceof the bias of the first-stageestimator.We show that the first-
orderperformanceof our second-stage estimatordoes not depend on the choice
of K if AssumptionA6(i) is satisfied.See Theorems2 and 3. Optimizingthe choice
of K would require a rathercomplicated higher-ordertheory and is beyond the
scope of this paper, which is restricted to first-order asymptotics.
3.2. Theorems. This section states two theorems that give the main results
of the paper. Theorem 1 gives the asymptotic behavior of the first-stage series
estimatorunder Assumptions A1-A6(i). Theorem 2 gives the propertiesof the
second-stage estimator. For i = 1, ..., n, define Ui = Yi - F[Az + m(Xi)] and
bKo(x) = z + m(x) - P (x)'OKo.Let II v 11denote the Euclidean norm of any finite-
dimensionalvector v.
=
Do( F'[+ml(x)=2
F[ mxl) +m_l(.c)]2 fx(xl,)d/x,
+ + ml(X)]2[Ofx(x1 ~x)/Oxl]d.,
DI(X1) = 2 F'[, ml(xl)
AK = v2K(v)dv,
J-
BK= K(v)2dv,
1
x +ml(xl) +m_l(x)]fx(xl,x)dx
fg(xl,x)F'[l
and
Vl(x1) = BKCh-D(x1)-2
and
Pl,Lc(xl) = 2Ch2AKDo(xl)-1
x fgLc(x +
x)F'[tm l(x l)+m- (X)]fx(x x)dx.
S1'j1(x1,m, w)
n
=2 w(x,X i)F'[, + mlh(xl) + m_l (Xi)]2(Xl-xl)j Kh(x -X)
i=l
n
-2 w(x1l, Xi){Yi -F[ + ml(x l)+ m-l(Xii)]}
i=l
for each x1 E [-1, 1]. Arguments identical to those used to prove Theorem 2 show
that the variance of the asymptotic distribution of the resulting local-linear or local-
constant estimator of m i (x1) is
where
Then as n -- oo andfor t = 1, 2,
n
(4.2) + 2K(O)n-1 I{F'[L + (Xi)2V(Xi)}
i=1
d
x E[n45Ch Dj(X)]-1,
j=1
where the Chj's are restricted to a compact, positive interval that excludes 0,
is asymptotically negligible and, therefore, that the solution to (4.2) estimates the
bandwidths that minimize ASE. A proof of this result only requires additional
smoothness conditions on F and more restrictive assumptions on K. The proof can
be carried out by making arguments similar to those used in the proof of Theorem 2
but with a higher-order stochastic expansion for h - m. Here, we provide only
a heuristic outline. For this purpose, note that
n
n- l Ui2 + ASE(h) - PLS(i)
i=1
n
= 2n-1 E{F[r/ + m (Xi)] - F[/z + m(Xi)]}Ui
i=l
n d
- 2K(O)n-1 F'[/u + m(Xi)]2V(Xi) [n4/ Chj D(Xi)]-
i=l j=1
n d
2
, -
F'[t m + (Xi )]2 Var(Ui Xi) n4/SC hDoj(Xi ),
i=l j=1
where
and fxj is the probability density function of XJ. Now by standard kernel smooth-
ing arguments, Doj (x) Dj(xj). In addition, it is clear that V(Xi) V(Ui IXi),
which establishes the desired result.
5. Monte Carlo experiments. This section presents the results of a small set
of Monte Carlo experiments that compare the finite-sample performances of the
two-stage estimator, the estimator of LH and the infeasible oracle estimator in
P(Y=1IX=x) =L fi(x')+f2(x2)+Exj
j=3
In all of the experiments,
fl(x) = sin(rx) and f2(x) = c(3x),
where c is the standardnormal distributionfunction. The components of X
are independentlydistributedas U[-1, 1]. Estimation is carried out under the
assumption that the additive components have two (but not necessarily more)
continuousderivatives.Underthis assumption,the two-stageestimatorhas the rate
of convergencen-2/5. The LH estimatorhas this rate of convergenceif d = 2 but
not if d = 5.
B-splines were used for the first stage of the two-stage estimator.The kernel
used for the second stage and for the LH estimatoris
K(v) = 15( - v2)2I(Ivl < 1).
Experimentswere carriedout using both local-constantand local-linearestimators
in the second stage of the two-stage method. There were 1000 Monte Carlo
replicationsper experimentwith the two-stage estimatorbut only 500 replications
with the LH estimatorbecause of the very long computing times it entails. The
experimentswere carriedout in GAUSS using GAUSS randomnumbergenerators.
The results of the experiments are summarized in Table 1, which shows
the empirical integratedmean-squareerrors (EIMSEs) of the estimators at the
values of the tuning parametersthat minimize the EIMSEs. Lengthy computing
times precluded using data-based methods for selecting tuning parametersin
the experiments. The EIMSEs of the local-constant and local-linear two-stage
estimates of fi are considerablysmaller than the EIMSEs of the LH estimator.
The EIMSEs of the local-constant and LH estimators of f2 are approximately
equal whereas the local-linearestimatorof f2 has a largerEIMSE. There is little
differencebetween the EIMSEs of the two-stage local-linearand infeasible oracle
TABLE 1
Results of Monte Carlo experiments*
Empirical IMSE
Estimator K1l K2 h h2 f f
d=2
LH 0.9 0.9 0.116 0.015
Two-stage with 2 2 0.4 0.9 0.052 0.015
local-constant
smoothing
Two-stagewith 4 2 0.5 1.4 0.052 0.023
local-linear
smoothing
Infeasibleoracle 0.6 1.7 0.056 0.021
estimator
d=5
LH 1.0 1.0 0.145 0.019
Two-stagewith 2 2 0.4 0.9 0.060 0.018
local-constant
smoothing
Two-stagewith 2 2 0.6 1.3 0.057 0.029
local-linear
smoothing
Infeasibleoracle 0.6 2.0 0.057 0.023
estimator
*In the two-stage estimator,Kj and hj (j = 1, 2) are the series length and bandwidth
used to estimate fj. In the LH estimator, hj (j = 1, 2) is the bandwidthused to
estimate fj. The values of K1, K2, h and h2 minimize the IMSEs of the estimates.
estimators. This result is consistent with the oracle property of the two-stage
estimator.
7.1. Theorem 1. This section begins with lemmas that are used to prove
Theorem 1.
PROOF. Write
n
SnK(O) = n-1 Yi - 2SnKl(0) + SnK2(0),
i=1
where
n
SnKl (0) =n-1 Yi F[PK(Xi)'0]
i=1
and
n
SnK2(0) = n-1 F[PK(Xi)O]2.
i=1
It suffices to prove that
for any s > 0, some C < oo and all sufficiently large n. The proof is given only
for j = 1. Similar arguments apply when j = 2.
Define SnK (0) = SnlK(0) - ESnKl (0). Divide OK into hypercubes of edge-
length f. Let 1),..., ?(M) denote the M = (2Co/f)d(K) cubes thus created.
Let OKj be the point at the center of eK(). The maximum distance between OKj
and any other point in O) is r = d()1/2/2 , and M = exp{d(K)[log(Co/r) +
(1/2) log d(K)]}. Now
Therefore,
M -
for all sufficiently large K and, therefore, n. Therefore, for all sufficiently large n,
P sup ISnKl(0)l>8
C/2 >
> /2] + P CF2ZKr n- .
< P[ISnil(0Kj)l Yl +CF1
Choose r = -2. Then s/2 - CF2'Kr[CF1 + E(IYI)] > s/4 for all sufficiently
large K. Moreover,
n
P CF2iKr n-1E IYII+ CF1 > s/2
n
< P CF2'Krn-I
)-(IY/I- EIYI) > s/4
+ 2exp(-ainE2),
where y = 4/15 + v. It follows that Pn < 4exp(-an?2) for a suitable a > 0 and
all sufficiently large n. D
Define
SK(O)=E[SnK ()]
and
OK= arg min SK().
OE?K
LEMMA 2. For any tr > 0, SK(OnK) - SK(OK) < i almost surely for all
sufficiently large n.
PROOF. For each K, let KAcC IRd(K)be an open set containing OK.Let eK
denote the complement of WK in OK. Define TK = NK,n OK. Then TK C Rd(K) is
compact.Define
= min SK(0) - SK(K).
Let An be the event ISnK(0) - SK(0)I < r/2 for all 0 E OK. Then
An = SK(OnK)< SnK(OK)
+ r/2
and
Therefore,
An S( ((K) ) < ( SK
SK =+
+ K(OnK) - SK (OK) < .
So An => OnKE J K. Since VK is arbitrary, the result follows from Lemma 1 and
Theorem 1.3.4 of Serfling [(1980), page 10]. D
Define
Then
0K = arg min SK0o(0).
0 CK
EQK. Let Zki [k = 1,..., d(K)] denote the kth component of ZKi. Let ZK denote
the n x d(K) matrix whose (i, k) element is Zk.
LEMMA 4. QK - QK 12 = Op(K2/n).
d(K)
YnII
Q_ Z/I Unl
Yn11 j/ 112= nY tq'
/IC~~~=
d(K)
Define
To prove the remaining parts of the theorem, observe that 0nK satisfies the first-
order condition aSfK(OnK) /a0 = 0 almost surely for all sufficiently large n. Define
Mi = JL + m(Xi) and A Mi = PK(Xi)'OnK - Mi = PK(Xi)'(OnK - OKo) - bKo(Xi).
Then a Taylor series expansion yields
n n
n
1LZKI(Ui -(Q^K?R)(OnK-OKo)?+ n LF'(Mi)ZKibo(Xi)+Rn2 = 0,
i 1 i=i
- [3F"((Mi)F'(Mi) ?
+ F"(Mi)F"(Mli)AMi
+ ?
F"(1Mi)F'F(Mii ) (Mi) 2]AMi
- [F"(Mi)F'(Mi)- F"(AMi)F'(AMi)
+ F"(Mi)F"(Mi)bKo(Xi)]bKo(Xi)}
x PK(Xi)PK(Xi)',
where Mi and Mi are points between PK(Xi)'OnK and Mi. Rn2 is defined by
n
-F" (3Mi)F"(M,ri)bKo(X
P(Xi)2}P(X)bK(Xi)
~~~~~~~~n 2
'
n-1 UiF"(Mi)PK(Xi)PK(Xi) = Op(K2/n).
i=l
Then
= YnIl(QK + Rnl)-lRnlill2
= Trace{y,n['Rnl(QK + Rnl) 2Rnl4]}
= Op(ll''Rnl 112)
1=lF'(Mi)2 P(Xi)bKo(Xi) + Rn2], then applying Lemma 6 and using the result
IIQ,-Rn2 I =Op(K-2) yields
- -- 2
n-
[(QK + R1n)-1 - Q ] n-1 F'(Mi)ZKibKO(Xi) + Rn2
- i =1
= + 1/K5).
Op(lIOnK- OKO12/K
It follows from these resultsthat
n
OnK- OKO= n-l Q 1 F'[,U + m(Xi)]PK(Xi)Ui
i=1
n
+- n-QK E Fl[L+ + m (Xi)]P i)bK(Xi) + Rn,
i=l
where IIRnII= Op(K3/2/n + n-1/2). Part (d) of the theorem now follows from
Lemma4. Part(b) follows by applyingLemmas5 and6 to part(d). Part(c) follows
from part(b) and AssumptionA5(iii). D
7.2. Theorem 2. This section begins with lemmas that are used to prove
Theorem 2. For any = (x2,...,xd) E [-1, 1]d-1, set m_l(x) = m2(x2) +
? + md(xd), and bKo(x) = u + m-1() - PK(x)OKo,where
x Kh(x --X])(nl(Xi),
x Kh(x - Xi)8n2(Xi)
and
n
x Kh(X -
Xi)bKO(Xi).
PROOF. The proof is given only for j = 0. Similar arguments apply for j = 1.
First consider Hnoi (x1). We can write
n
Hnol(xl) =) aj (xl)Uj,
j=1
where
n
aj(xl) n-3/2h-1/2 F'[tL + ml(x1) + m_(i)]2
i=l
Define
Then arguments similar to those used to prove Lemma 1 show that aj(x1) =
(h/n)l/2[c (x ) + rnl] QK F'[tu + m(Xj)]PK(Xj), where rn is uncorrelated with
the Uj's and lrnll = O[(logn)/(nh)1/2] uniformly over x E [-1, 1] almost
surely. Moreover, for each x1 E [-1, 1], the components of a (x) are the Fourier
coefficients of a function that is bounded uniformly over x1. Therefore,
for some finite constant M and all K = 1,..., oo. It follows from (7.2) and the
Cauchy-Schwarz inequality that
n 2
uniformly over xl E [-1, 1]. This and lrnll= O[(logn)/(nh)1/2] establish the
conclusion of the lemma for j = 0, k = 1.
We now prove the lemma for j = 0, k = 2. We can write
n
Hno2(x) = (nh)-1/2 F'[I + ml (x) + m_ (Xi)]2Kh(x -X)P(Xi)'Bn,
i=l
where
n
Bn = n-Q1 1 F'[/1 + m(Xj)]2PK(Xj)bKo(Xj).
j=1
Arguments like those used to prove Lemma 6 show that ElIBn112= O(K-4).
Therefore,
n
sup IHno2(xl)I = sup (nh)- 1/2L Kh(x1 - X)') . Op(K-2)
Ixll<1 i=1
Ixll<1
= Op(n1/21/2K -2)
=
Op(l).
For the proof with j = 0, k = 3, note that
n
sup |Hno3(x)l = sup (nh) 1/2 Kh(x' - X) Op(K 2)
lx'11<1 Ix <1 i-=
= 0p(l). V
and
n
8n4(I1, Xi) = n- PK(x1, Xi)'QK- F'[cI + m(Xj)]2P (Xj)bo(Xj).
j=1
x Kh (1 - XJ)n3(x1, Xi),
n
Lnj4(x1) = (nh)-1/2
" {F[t + ml (XJ) + m-_1(Xi)]
i=l
-F[,U +ml (xl) + m (X)]}
x F"[,a + ml (x) + m- (Xi)](X1 -xl)j
x Kh(xl - Xl)8n3(x 1, Xi),
n
and
n
PROOF. The proof is given only for j = 0. The arguments are similar for
j = 1. By Theorem 1, Sn4(x, Xi) is the asymptotic bias component of the
stochastic expansion of PK(xl, Xi)(OnK- OKO)and is Op(K-3/2) uniformly over
(x1, Xi) E [-1, l]d. This fact and standard bounds on
n
sup IUi IKh(xl - XI)
Ixll <i=l
and
n
sup LKh(xl-Xl)
Ixlll i=l
establish the conclusion of the lemma for j = 0, k = 2, 5. For j = 0, k = 3, 6,
proceed similarly using
sup Ib,K(x)l O(K-2)
Ixll<1
For j = 0, k = 4, one can use arguments similar to those made for Hnol (x1) in the
proof of Lemma 7. It remains to consider Lno (x 1) = Dn (x )Bn, where
n
and
n
and
n
for 0 < r < K, 2 < e < d. These expressions can be bounded uniformly over
1
Ix|l 1 by terms that are Op[(logn)l/2lpr(xl)l] and Op[(logn)l/2], respectively.
This gives
and
PROOF. Only (a) is proved. The proof of (b) is similar. For each i=
1,..., n, let m*(x ,Xi) and m**(x1, Xi) denote quantities that are between
i+ iml(xl)+ m_l(Xi) and / + ml(xl) + m_l(Xi). The values of m*(xl, Xi)
4
-
(nh)-r/2Snol(x1, m) (nh)-r2S'o01 (x, m) + Jnj (x)
j=1
n -
uniformly over Ix1I < 1 - h. Now consider J43(X1).- It follows from Assump-
tion A3(v) that
n
i=1
6-
1) 1= ( (nh
in =1/2 F[t+mI( )+MI(k)2K(I-X)
11< 1 - h. El
uniformlyover I1x
(73)I(x1 Im(x1) mI )
By replacing the integrand with the expansion of Theorem 2(a), one obtains
a U-statistic in Ui conditional on X1, ...,Xn. This U-statistic has vanishing
conditional variance. 1]
REFERENCES
BOsQ, D. (1998). NonparametricStatisticsfor Stochastic Procesess. Estimation and Prediction,
2nd ed. LectureNotes in Statist. 110. Springer,Berlin.
BREIMAN, L. and FRIEDMAN, J. H. (1985). Estimating optimal transformationsfor multiple
regression and correlation (with discussion). J. Amer. Statist. Assoc. 80 580-619.
BUJA, A., HASTIE, T. and TIBSHIRANI, R. J. (1989). Linear smoothers and additive models (with
discussion).Ann. Statist. 17 453-555.
CHEN, R., HARDLE, W., LINTON, O. B. and SEVERANCE-LOSSIN,E. (1996). Nonparametric esti-
mation of additive separable regression models. In Statistical Theory and Computational
Aspects of Smoothing (W. Hardle and M. Schimek, eds.) 247-253. Physica, Heidelberg.
FAN, J. and CHEN, J. (1999). One-step local quasi-likelihood estimation. J. R. Stat. Soc. Ser. B Stat.
Methodol.61 927-943.
FAN, J. and GIJBELS,I. (1996). Local PolynomialModelling and Its Applications. Chapmanand
Hall, London.
FAN,J., HARDLE,W. and MAMMEN,E. (1998). Direct estimationof low-dimensionalcomponents
in additive models. Ann. Statist. 26 943-971.
HASTIE, T. J. and TIBSHIRANI, R. J. (1990). Generalized Additive Models. Chapman and Hall,
London.
HOROWITZ, J. L., KLEMELA, J. and MAMMEN, E. (2002). Optimal estimation in additive
regression models. Working paper, Institut fir Angewandte Mathematik, Ruprecht-Karls-
Univertsitit,Heidelberg,Germany.
LINTON, O. B. and HARDLE, W. (1996). Estimating additive regression models with known links.
Biometrika83 529-540.
LINTON,O. B. and NIELSEN,J. P. (1995). A kernelmethodof estimatingstructurednonparametric
regression based on marginal integration. Biometrika 82 93-100.