Annal Horowitz Mammen 2004

TheAnnals of Statistics
2004, Vol. 32, No. 6, 2412-2443

DOI 10.1214/009053604000000814
? Instituteof MathematicalStatistics,2004
NONPARAMETRIC ESTIMATION OF AN ADDITIVE MODEL

WITH A LINK FUNCTION
BY JOELL. HOROWITZ1AND ENNO MAMMEN2

Northwestern University and Universitdt Mannheim
This paper describes an estimator of the additive components of
a nonparametricadditive model with a known link function. When the
additive components are twice continuously differentiable,the estimatoris
asymptoticallynormallydistributedwith a rateof convergencein probability
of n-2/5. This is true regardlessof the (finite) dimensionof the explanatory
variable.Thus, in contrastto the existing asymptoticallynormal estimator,
the new estimatorhas no curse of dimensionality.Moreover,the estimatorhas
an oracleproperty.The asymptoticdistributionof each additivecomponentis
the same as it would be if the othercomponentswere known with certainty.
1. Introduction. This paper is concerned with nonparametric estimation of

the functions m , ..., md in the model
(1.1) Y = F[it + ml(X1) +... + md(Xd)] + U,
where Xi (j = 1, ..., d) is the jth component of the random vector X E IRdfor

some finite d > 2, F is a known function, , is an unknown constant, m , ..., md
are unknown functions and U is an unobserved random variable satisfying
E(UIX = x) = 0 for almost every x. Estimation is based on an i.i.d. random
sample {Yi, Xi: i = 1,..., n} of (Y, X). We describe an estimator of the additive
components m, ..., md that converges in probability pointwise at the rate n-2/5
when F and the mj's are twice continuously differentiable and the second
derivative of F is sufficiently smooth. In contrast to previous estimators, only two
derivatives are needed regardless of the dimension of X, so asymptotically there is
no curse of dimensionality. Moreover, the estimators derived here have an oracle
property. Specifically, the centered, scaled estimator of each additive component is
asymptotically normally distributed with the same mean and variance that it would
have if the other components were known.
Received July 2002; revised March2004.

1Supported in part by NSF Grants SES-99-10925 and SES-03-52675, the Alexander
von Humboldt Foundationand Deutsche ForschungsgemeinschaftSonderforschungsbereich373,
"Quantifikationund SimulationOkonomischerProzesse."
2Supportedin partby Deutsche ForschungsgemeinschaftMA 1026/6-2.
AMS2000 subjectclassifications.Primary62G08; secondary62G20.
Keywords and phrases. Additive models, multivariatecurve estimation, nonparametricregres-
sion, kernelestimates,orthogonalseries estimator.
2412
This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM

All use subject to JSTOR Terms and Conditions
ADDITIVEREGRESSIONMODELWITHA LINK FUNCTION 2413
Linton and Hardle (1996) (hereinafterLH) developed an estimator of the

additivecomponentsof (1.1) that is based on marginalintegration.The marginal
integration method is discussed in more detail below. The estimator of LH
converges at the rate n-2/5 and is asymptotically normally distributed,but it
requiresthe mj's to have an increasing numberof derivativesas the dimension
of X increases. Thus, it suffers from the curse of dimensionality.Our estimator
avoids this problem.
There is a large body of researchon estimationof (1.1) when F is the identity
function so that Y = / + ml(X1) + *.. + md(Xd) + U. Stone (1985, 1986)
showed that n-2/5 is the optimal L2 rate of convergence of an estimator of
the mj's when they aretwice continuouslydifferentiable.Stone (1994) andNewey
(1997) describe spline estimatorswhose L2 rate of convergenceis n-2/5, but the
pointwise rates of convergence and asymptotic distributionsof spline and other
series estimatorsremain unknown. Breiman and Friedman(1985), Buja, Hastie
andTibshirani(1989), Hastie andTibshirani(1990), OpsomerandRuppert(1997),
Mammen, Linton and Nielsen (1999) and Opsomer (2000) have investigatedthe
propertiesof backfittingprocedures.Mammen, Linton and Nielsen (1999) give
conditionsunderwhich a smoothbackfittingestimatorof the mj 's convergesat the
pointwise rate n-2/5 when these functions are twice continuouslydifferentiable.
The estimator is asymptotically normally distributedand avoids the curse of
dimensionality,but extendingit to models in which F is not the identityfunction
appearsto be quite difficult.Horowitz,Klemeli and Mammen(2002) (hereinafter
HKM) discuss optimalitypropertiesof a variety of estimatorsfor nonparametric
additivemodels withoutlink functions.
Tj0stheim and Auestad (1994), Linton and Nielsen (1995), Chen, Hardle,
Linton and Severance-Lossin(1996) and Fan, Hardle and Mammen (1998) have
investigatedthe propertiesof marginalintegrationestimatorsfor the case in which
F is the identityfunction.These estimatorsare based on the observationthatwhen
F is the identityfunction,then m (x1), say, is given up to an additiveconstantby
(1.2) f E(YIX = x)w(x2 ..., xd)dx2...dxd,

where w is a nonnegativefunction satisfying
w(x2,...xd) dx2...dxd = 1.
Therefore, m (x1) can be estimated up to an additive constant by replacing

E(YIX = x) in (1.2) with a nonparametricestimator.Linton and Nielsen (1995),
Chen, Hirdle, Lintonand Severance-Lossin(1996) and Fan, Hardleand Mammen
(1998) have given conditions under which a variety of estimatorsbased on the
marginalintegrationidea converge at rate n-2/5 and are asymptoticallynormal.
The latter two estimators have the oracle property. That is, the asymptotic
distributionof the estimatorof each additivecomponentis the same as it would be

2414 J. L. HOROWITZAND E. MAMMEN
if the other componentswere known. LH extend marginalintegrationto the case

in which F is not the identity function. However,marginalintegrationestimators
have a curse of dimensionality:the smoothness of the mj's must increase as the
dimension of X increases to achieve n-2/5 convergence. The reason for this is
that estimatingE(YIX = x) requirescarryingout a d-dimensionalnonparametric
regression.If d is large and the mj's are only twice differentiable,then the bias of
the resulting estimator of E(YIX = x) converges to zero too slowly as n -> oo
to estimate the mj's with an n-2/5 rate. For example, the estimator of Fan,
Hardle and Mammen (1998), which imposes the weakest smoothness conditions
of any existing marginalintegrationestimator,requiresmore than two derivatives
if d >5.
This paper describes a two-stage estimation procedurethat does not require
a d-dimensional nonparametricregression and, thereby, avoids the curse of
dimensionality.In the first stage, nonlinearleast squaresis used to obtain a series
approximationto each mj. The first-stageprocedureimposes the additivestructure
of (1.1) and yields estimates of the mj's that have smaller asymptotic biases
than do estimatorsbased on marginalintegrationor otherproceduresthat require
d-dimensional nonparametricestimation. The first-stageestimates are inputs to
the second stage. The second-stageestimate of, say, ml is obtainedby taking one
Newton step from the first-stageestimate towarda local linear estimate. In large
samples, the second-stage estimator has a structuresimilar to that of a local
linear estimator, so deriving its pointwise rate of convergence and asymptotic
distributionis relativelyeasy. The main results of this papercan also be obtained
by using a local constant estimate in the second stage, and the results of Monte
Carlo experimentsdescribedin Section 5 show that a local constantestimatorhas
better finite-sampleperformanceunder some conditions. However, a local linear
estimatorhas betterboundarybehaviorand better ability to adaptto nonuniform
designs, among otherdesirableproperties[Fanand Gijbels (1996)].
Our approachdiffers from typical two-stage estimation,which aims at estimat-
ing one unknownparameteror function [e.g., Fan andChen (1999)]. In this setting,
a consistent estimatoris obtainedin the first stage and is updatedin the second,
possibly by taking a Newton step towardthe optimumof an appropriateobjective
function. In contrast,in our setting, there are several unknownfunctions but we
updatethe estimatorof only one. It is essential thatthe first-stageestimatorsof the
other functions have negligible bias. The variancesof these estimatorsmust also
convergeto zero but can have relatively slow rates. We show that asymptotically,
the estimationerrorof the otherfunctionsdoes not appearin the updatedestimator
of the functionof interest.
HKM use a two-stage estimationapproachthat is similarto the one used here,
but HKM do not consider models with link functions, and they use backfitting
for the first-stageestimator.Derivationof the propertiesof a backfittingestimator
for a model with a link function appearsto be very complicated.We conjecture
that a classical backfitting estimator would have the same asymptotic variance

ADDITIVEREGRESSIONMODELWITH A LINK FUNCTION 2415
as the one in this paper but a different and, possibly, complicated bias. We
also conjecture that a classical backfittingestimator would not have the oracle
property.Nonetheless, we do not argue here that our procedure outperforms
classical backfitting,in the sense of minimizing an optimality criterion such as
the asymptoticmean-squareerror.However,our procedurehas the advantagesof
a complete asymptoticdistributiontheory and the oracle property.
The remainderof this paper is organized as follows. Section 2 provides an
informaldescriptionof the two-stage estimator.The main results are presentedin
Section 3. Section 4 discusses the selection of bandwidths.Section 5 presentsthe
results of a small simulationstudy,and Section 6 presentsconcluding comments.
The proofs of theorems are in Section 7. Throughoutthe paper,subscriptsindex
observationsand superscriptsdenote componentsof vectors. Thus, Xi is the ith
observationof X, XJ is the jth componentof X, and Xj is the ith observationof
the jth component.
2. Informal description of the estimator. Assume that the supportof X is

X - [-1, ]d, and normalize ml, ..., m so that
m(v)dv=
mj , j=l,...,d.
For any x E Rd define m(x) ml(x ( l) + . + md(xd), where xj is the jth

component of x. Let {Pk k = 1, 2, ...} denote a basis for smooth functions
on [- 1, 1]. A precise definitionof "smooth"andconditionsthatthe basis functions
must satisfy are given in Section 3. These conditionsinclude
(2.1) pk(v)dv =0,
(2.2) {' f=k,

f -(2)pj(v)pk(v)dv=
O
0, otherwise,
and
00
(2.3) mj(xj)= E Ojkpk(xj),

k=l
for each j = 1, ..., d, each xJ E [0, 1] and suitable coefficients {0jk}. For any
positive integerK, define
PK(x)= [1, P (x 1) ... .PK(X ), P1( x2), .. PK(X2), (, I(d), . . . PK(Xd)]
. P1
Then for 0K E IRKd+l, PK(X)'OKis a series approximation to gt + m(x). Section 3

gives conditions that K must satisfy. These requirethat K -- oo at an appropriate
rate as n --> oo.

To obtain the first-stage estimators of the mj's, let {Yi, Xi: i = 1, ..., n be
a random sample of (Y, X). Let OnK be a solution to
n
minimize: Sn,() - n-1 {Yi - F[PK(Xi)0]}2,
~OE~~OK i=1
where KcC RKd+l is a compact parameter set. The series estimator of /z + m(x)
is
, + mh(x)= PK(X)'nK,
where /2 is the first component of OnK. The estimator of mj(xj) for any
j = 1,..., d and any xJ E [0, 1] is the product of [pl(xJ)... , PK(XJ)] with the
appropriate components of OK.
To obtain the second-stage estimator of (say) ml(xl), let Xi denote the ith
observation of X (X2, .. ., Xd). Define m_l (Xi) = m2(X2) + + mhd(X'/),
where XJ is the ith observation of the jth component of X and thj is the series
estimator of mj. Let K be a probability density function on [-1, 1], and define
Kh(v) = K(v/h) for any real, positive constant h. Conditions that K and h must
satisfy are given in Section 3. These include h -+ 0 at an appropriate rate as
n -- oo. Define
n
lm) = --2
Sj (x1, {Yi - F[f + ml (x1) + m_l (Xi)]}
i=l
x F'[ - xl)j Kh(x - XI)

+-i l (xl) + m_- (Xi)](X
for j = 0, 1 and
n
Sn/jl (xl, m) = 2 E F'[/ + ml (xl) +m- l (Xi)]2(Xl - xl)j Kh(xl - XI)

i=l
n
-2 {Yi -F[/f + l C(x1)+- I_ (X)]}
i=1
x F"[ +mrtl (xi) +m-_l(X)](X1 -xl)jKh(x1 - X)
for j = 0, 1, 2. The second-stage estimator of m 1(xl) is
es aS th)oSno (x)
. nd (x) n 1 Sh)
^-sSng2e(Xim ,
)-So si(X)
(2.4) mi (x)l th) () (xl,h) r) - 1 - SIn, (x, '
2
The second-stage estimators of m2(x2),..., md(xd) are obtained similarly. Sec-

tion 3.3 describes a weighted version of this estimator that minimizes the asymp-
totic variance of n2/5 [I 1(x) - m(x 1)]. However, due to interactions between the
weight function and the bias, the weighted estimator does not necessarily minimize
the asymptotic mean-square error.

ADDITIVEREGRESSION
MODELWITHA LINKFUNCTION 2417
The estimator (2.4) can be understood intuitively as follows. If A and mt_ were
the true values of ,t and m_ 1, the local linear estimator of m 1(x1) would minimize
n
Snl (X, bo, bl) = {Yi -F[g + bo + bl(Xl -x1)

i=1
(2.5)
+ m_l(Xi)]}2Kh(xl -XI).
Moreover, S jl(x1, m) = aSnl(x1, bo, bl)/abj (j = 0, 1) evaluated at b= hm(x1)

and bl = 0. S"1 (x , m) gives the second derivatives of S, 1(x1, bo, bl) evaluated
at the same point. The estimator (2.4) is the result of taking one Newton step from
the starting values bo = mh1(x1), bl = 0 toward the minimum of the right-hand side
of (2.5).
Section 3 gives conditions under which mh1(x1) - m 1(x ) = Op(n2/5) and
n2/5 [[m(xl) - ml(xl)] is asymptotically normally distributed for any finite d
when F and the mj's are twice continuously differentiable.
3. Main results. This section has three parts. Section 3.1 states the assump-
tions that are used to prove the main results. Section 3.2 states the results. The
main results are the n-2/5-consistency and asymptotic normality of the mj's. Sec-
tion 3.3 describes the weighted estimator.
The following additional notation is used. For any matrix A, define the norm
IA I = [trace(A'A)]1/2. Define U = Y - F[/A + m(X)], V(x) = Var(UIX = x),
QK = E{F'[/1 + m(X)]2 PK(X)PK (X)'}, and IK = QK E{F'[j + m(X)]2V(X) x
PK(X)PK(X)'}Q -1 whenever the latter quantity exists. QK and IK are d(K) x d(K)
positive semidefinite matrices, where d(K) = Kd + 1. Let XK,min denote the
smallest eigenvalue of QK. Let QK,ij denote the (i, j) element of QK. Define
K= Supxx IIPK(x)ll.Let {0jk} be the coefficients of the series expansion (2.3).
For each K define
OK = (A/, 01 , I,01DK, 021 . . . 2K, ... , Odl, OdK)'.
3.1. Assumptions. The main results are obtained under the following assump-
tions.
ASSUMPTION Al. The data, {(Yi,Xi):i = l,...,n}, are an i.i.d. random

sample from the distribution of (Y, X), and E(Y X = x) = F[,/ + m(x)] for almost
every x e X _ [-1, 1]d
ASSUMPTIONA2. (i) The support of X is X.

(ii) The distribution of X is absolutely continuous with respect to Lebesgue
measure.
(iii) The probability density function of X is bounded, bounded away from zero
and twice continuously differentiable on X.

(iv) There are constants cv > 0 and Cv < oo such that cv < Var(UIX =
x) < Cv for all x E X.
(v) Thereis a constantCu < oo such thatEIUI/ < C 2j!E(U2) < oo for all
j>2.
ASSUMPTION
A3. (i) There is a constant Cm < oc such that Jmj(v)I < Cm
for each j = 1,...,d and all v E [-1, 1].
(ii) Each functionmj is twice continuouslydifferentiableon [-1, 1].
(iii) There are constants CF1 < oo, CF2 > 0, and CF2 < oo such that
F(v) < CF1 and CF2 < F'(v) < CF2 for all v E [ut - Cmd, ,L + Cmd].
(iv) F is twice continuously differentiable on [u/ - Cmd, /t + Cmd].
(v) There is a constant CF3 < oo such that IF"(v2) - F"(vl)I < CF3 I2 - l I
for all v2, vl E [/t - Cmd, iu + Cmd].
ASSUMPTION A4. (i) There are constants CQ < oo and cx > 0 such that
IQK,ij[ < CQ and K,min> C) for all K and all i, j = 1,..., d().
(ii) The largest eigenvalue of "K is bounded for all K.
ASSUMPTIONA5. (i) The functions {Pk} satisfy (2.1) and (2.2).

(ii) There is a constant cK > 0 such that (K > CK for all sufficiently large K.
(iii) ,K= O(K1/2) as K -> oo.
(iv) There are a constant Co < oo and vectors OKO E 0, - [-Co, Co]d(K) such
that supx Exl[ + m(x) - PK(X)'OKl = O(K -2) as K -> X.
(v) For each K, OK is an interior point of )K.
ASSUMPTION A6. (i) K = CKn4/15+v for some constant CK satisfying

0 < CK < oo and some v satisfying 0 < v < 1/30.
(ii) h = Chn-1/5 for some constant Ch satisfying 0 < Ch < oo.
ASSUMPTIONA7. The function K is a bounded, continuous probability

density functionon [-1, 1] and is symmetricabout0.
The assumptionthatthe supportof X is [-1, 1]d entails no loss of generalityas

it can always be satisfiedby carryingout monotone increasingtransformationsof
the componentsof X, even if theirsupportbeforetransformationis unbounded.For
practicalcomputations,it suffices to transformthe empiricalsupportto [-1, 1]d.
Assumption A2 precludes the possibility of treating discrete covariates with
our method, though they can be handled inelegantly by conditioning on them.
Another possibility is to develop a version of our estimator for a partially
linear generalized additive model in which discrete covariates are included in
the parametric(linear) term. However, this extension is beyond the scope of the
present paper.Differentiabilityof the density of X [AssumptionA2(iii)] is used

to insure that the bias of our estimator converges to zero sufficiently rapidly.
AssumptionA2(v) restrictsthe thicknessof the tails of the distributionof U and is
used to prove consistency of the first-stageestimator.AssumptionA3 defines the
sense in which F and the mj's must be smooth. Assumption A3(iii) is needed
for identification.Assumption A4 insures the existence and nonsingularityof
the covariancematrix of the asymptoticform of the first-stageestimator.This is
analogousto assumingthatthe informationmatrixis positive definitein parametric
maximum likelihood estimation. Assumption A4(i) implies Assumption A4(ii)
if U is homoskedastic. Assumptions A5(iii) and A5(iv) bound the magnitudes
of the basis functions and insure that the errorsin the series approximationsto
the m j's converge to zero sufficiently rapidly as K -- oo. These assumptions are
satisfied by spline and (for periodic functions) Fourier bases. Assumption A6
states the rates at which K -- oo and h -- 0 as n -> oo. The assumed rate of
convergenceof h is well known to be asymptoticallyoptimalfor one-dimensional
kernelmean-regressionwhen the conditionalmean functionis twice continuously
differentiable.The requiredratefor K insuresthatthe asymptoticbias andvariance
of the first-stage estimator are sufficiently small to achieve an n-2/5 rate of
convergencein the second stage. The L2 rateof convergenceof a series estimator
of mj is maximizedby settingK ocn1/5, which is slowerthanthe ratespermittedby
AssumptionA6(i) [Newey (1997)]. Thus,AssumptionA6(i) requiresthe first-stage
estimatorto be undersmoothed.Undersmoothingis needed to insure sufficiently
rapid convergenceof the bias of the first-stageestimator.We show that the first-
orderperformanceof our second-stage estimatordoes not depend on the choice
of K if AssumptionA6(i) is satisfied.See Theorems2 and 3. Optimizingthe choice
of K would require a rathercomplicated higher-ordertheory and is beyond the
scope of this paper, which is restricted to first-order asymptotics.
3.2. Theorems. This section states two theorems that give the main results
of the paper. Theorem 1 gives the asymptotic behavior of the first-stage series
estimatorunder Assumptions A1-A6(i). Theorem 2 gives the propertiesof the
second-stage estimator. For i = 1, ..., n, define Ui = Yi - F[Az + m(Xi)] and
bKo(x) = z + m(x) - P (x)'OKo.Let II v 11denote the Euclidean norm of any finite-
dimensionalvector v.
THEOREM1. Let Assumptions A1-A6(i) hold. Then:
(a) limn -oo 0lnK - K1 = 0 almost surely,

KO
(b) 0nK - K = Op(K/2/n1/2 + K-2), and
(c) suPxcx rIm(x)-m(x) = Op(K/n1/2 + K-3/2).
In addition:
(d) 0nK -
OKO = nQK i=l F'[L +m(Xi)]PK(Xi)Ui + n 1QK1 x
n=l F'[/ + m(Xi)]2PK(Xi)bK(Xi) R = Op(3/2/n
+ Rn, where IIR + n-1/2).

Now let fx denote the probabilitydensity functionof X. For j = 0, 1, define

n
S l (xl, m) =-2 - F[u + ml (xl) + m_l (Xi)]}
j{Yi
i=1
x F'[z + ml (x1) + m- (.Xi)](X -x)j Kh(x1 - X1).

Also define
=
Do( F'[+ml(x)=2
F[ mxl) +m_l(.c)]2 fx(xl,)d/x,
+ + ml(X)]2[Ofx(x1 ~x)/Oxl]d.,
DI(X1) = 2 F'[, ml(xl)
AK = v2K(v)dv,
J-
BK= K(v)2dv,
1
g(xl,x) = F"[,L +ml(xl) +m_ l(x)]m (xl)

+ F'[,L +ml(xl)+ m_l(x)]ml'(xl),
1(xl) = 2C2AKDo(x )-1
x +ml(xl) +m_l(x)]fx(xl,x)dx
fg(xl,x)F'[l
and
Vl(x1) = BKCh-D(x1)-2
x x)F'[+m(xl)+mjl( ) ]2fx(xIx dx.

fvar(Ulx
The next theoremgives the asymptoticpropertiesof the second-stageestimator.
THEOREM2. Let Assumptions A1-A6 hold. Then:
(a) m(x 1) - m (x1) = [nhDo (x 1] -S01(, m) + [D1 (x)/Do(x1)] x

-
S'n (xl1m)} + op(n-2/5) uniformly over Ix11 < I h and ml(x1) - ml(x1) =
Op[(logn)1/2n-2/5] uniformly over Ix'l < 1.
(b) n2/5[ml
(x )- ml(xl)] N[BPl(x l), Vl(xl)].
(c) If j 1, then n2/5[l((x1) - ml(xl)] and n2/5[mj(x) - are
mj(xj)]
asymptoticallyindependentlynormallydistributed.
Theorem2(a) implies thatasymptotically,n2/5[m1(x) - m (x 1)] is not affected

by random sampling errorsin the first-stageestimator.In fact, the second-stage

estimatorof m l(x1) has the same asymptotic distributionthat it would have if

m2,..., md were known and local-linear estimation were used to estimate m 1(x1)
directly. In this sense, our estimatorhas an oracle property.Parts (b) and (c) of
Theorem 2 imply that the estimators of m l(xl),..., md(xd) are asymptotically
independentlydistributed.
It is also possible to use a local-constantestimator in the second stage. The
resultingsecond-stageestimatoris
- m).
m,LLC(Xl) Sl(xl)-S'0o(x,l)/SnO1(x
The following modificationof Theorem2, which we state withoutproof, gives the

asymptoticpropertiesof the local-constantsecond-stageestimator.Define
gLC(x, X) = (a2/a2){ F[ml ( + x1) + m-i(x)]

- F[ml(x) + m-l(x)]}fx(( + x, x)lr=o
and
Pl,Lc(xl) = 2Ch2AKDo(xl)-1
x fgLc(x +
x)F'[tm l(x l)+m- (X)]fx(x x)dx.
THEOREM3. Let Assumptions A1-A6 hold. Then
(a) mrl,Lc(xl)-ml(xl)=-[nhDo(xl)]-lS'ol(x, m)+op(n-2/5)uniformly

over Ixl < 1 - h and mh (xl) - ml(x) = Op [(logn)1/2n-2/5] uniformly over
I 1.
xlJ
-)-ml 1)
(b) n2/5[l,LC( (xl)] N[P1,Lc(xl), V1(x1)].
-
(c) If j - 1, then n2/5h1,LC( X1) ml(xl)] and n2/ [mhj,Lc(xj) - mj(x)]
are asymptotically independently normally distributed.
Vl(x1) and Bl(x1) and fi,LC(X1) can be estimated consistently by replacing

unknown population parameterswith consistent estimators. Section 4 gives a
method for estimating the derivatives of m that are in the expressions for
Bi(x1) and /B1,LC(xl). As is usual in nonparametricestimation, reasonably
precise bias estimation is possible only by making assumptions that amount
to undersmoothing. One way of doing this is to assume that the second
derivative of ml satisfies a Lipschitz condition. Alternatively, one can set
h = Chn-Y for 1/5 < y < 1. Then n(- )/2[mi1(x1) - ml(xl)] N[0, V (xl)],
and n( -y)I/2[C1LC(X1)- m(x )] N[O, V(x1)].

3.3. A weighted estimator. A weighted estimator can be obtained by replacing

S'j (x1, m) and S 'j(x1, m) in (2.5) with
n
S'ij(x1, m, w) =-2 E w(xl, Xi){Yi - F[it +- ml (x) + m_ (Xi)]}

i=1
x F'[/ + m (xl) + _l (Xi)](X1 - xl)j Kh(xl - X1)

and
S1'j1(x1,m, w)
n
=2 w(x,X i)F'[, + mlh(xl) + m_l (Xi)]2(Xl-xl)j Kh(x -X)
i=l
n
-2 w(x1l, Xi){Yi -F[ + ml(x l)+ m-l(Xii)]}
i=l
x F"[fL + ml (x1) + m_l(Xi)(X- xl)j Kh(xl- Xi)

for j = 0, 1, 2, where w is a nonnegative weight function that is assumed for the
moment to be nonstochastic. It is convenient to normalize w so that
fw(x x)F'[ +ml(xl)+m l(X)]2fx(xl x)dx =
for each x1 E [-1, 1]. Arguments identical to those used to prove Theorem 2 show
that the variance of the asymptotic distribution of the resulting local-linear or local-
constant estimator of m i (x1) is
V1(x1, w) =0.25BKCh1 fw (x1,C)2Var(U Ixl )
x F'[1 + ml (xl)+ m_ (x)]2fx(xl x) dx.

It follows from Lemma 1 of Fan, Hardle and Mammen (1998) that V(xl, w) is
minimized by setting w(x1, x)2 o 1/ Var(U Ix1, x), thereby yielding
Vl(x ,w)= O.25BKhl() -1 F'[ + ml (xl) + m( ) fx(x x)dx,
where
D2(xl)= fVar(Ulxl,i)-lF'[ +ml(xl) +m_l(x)]2fx(x ) d.

In an application, it suffices to replace the variance-minimizing weight function
with a consistent estimator. For example, F'[JL + ml(xl) + m_l(x)] can be
estimated from the first estimation stage, Var(Ulxl, ) can be estimated by
applying a nonparametric regression to the squared residuals of the first-stage
estimate and kernel methods can be used to estimate fx (x , x).

The minimum-varianceestimator is not a minimum asymptotic mean-square

error estimator unless undersmoothingis used to remove the asymptotic bias
of 1. This is because weighting affects the bias when the latteris nonnegligible.
The weight function that minimizes the asymptotic mean-squareerror is the
solution to an integral equation and does not have a closed-form analytic
representation.
4. Bandwidth selection. This section presentsa plug-in and a penalizedleast

squares(PLS) methodfor choosing h in applications.We begin with a description
of the plug-in method. This method estimates the value of h that minimizes the
asymptoticintegratedmean-squareerror(AIMSE) of n2/5[m (x) - ml(xl)] for
j = 1, ... , d. We discuss only local-linear estimation, but similar results hold for
local-constant estimation. The AIMSE of n2/5 (MIm - m) is defined as
AIMSEl =n4S w(x )[I(x ) + V1(x1)] dx

-1
where w(.) is a nonnegativeweight function that integratesto 1. We also define
the integratedsquarederror(ISE) as
ISE -n45 / w(x')[ml(xl) -ml(x)]2 dx.

-1
We define the asymptoticallyoptimal bandwidthfor estimatingm as Chln-1/5,

where Ch minimizes AIMSE1.Let
1(X1) = Bl(X )/ C2 and V1(x l) = ChVI(xl).
Then
-1- Cf1 ww(xl)V1(x1)dx -1/5
(4.1) Chl -
4 fl1 w(xl),(xl)2 dx1_
The results for the plug-in method rely on the following two theorems.Theo-
rem4 shows thatthe differencebetweenthe ISE andAIMSEis asymptoticallyneg-
ligible. Theorem5 gives a method for estimatingthe first and second derivatives
of mj. Let G() denote the eth derivativeof any e-times differentiablefunction G.
THEOREM 4. Let Assumptions A1-A6 hold. Then for a continuous weight

function w(.) and as n - oo, AIMSE = ISE1 +op(1).
THEOREM5. Let Assumptions A1-A6 hold. Let L be a twice differentiable

probability density function on [-1, 1], and let gn : n = 1, 2,.... be a sequence
of strictly positive real numbers satisfying gn - 0 and g2n4/5 (logn)- - oo as
n -- oo. For = 1,2 define
m^x1) = g- L()[(x1 -v)/gn]ml()dv.

1 -n

Then as n -- oo andfor t = 1, 2,
sup Imth)(xl)--m(e)(x') =Op(1)

Ixll<l
A plug-in estimator of Chl can now be obtained by replacing unknown
population quantities on the right-hand side of (4.1) with consistent estimators.
Theorem 5 provides consistent estimators of the required derivatives of mi.
Estimators of the conditional variance of U and of fx can be obtained by using
standard kernel methods.
We now describe the PLS method. This method simultaneously estimates the
bandwidths for second-stage estimation of all of the functions mi (j = 1, ..., d).
Let hj = Chjn-1/5 be the bandwidth for rhj. Then the PLS method selects
the Chj's that minimize an estimate of the average squared error (ASE),
n
ASE(h) = n-1 {F[l + mr(Xi)] - F[/z + m(Xi)]}2
i=l
where h = (Chln-1/5,... Chdn-l/5). Specifically, the PLS method selects
the Chj's to
n
minimize: PLS (h) = n-1 {Yi - F [ji + (Xi)]}2
Chl,...,Chd
n
(4.2) + 2K(O)n-1 I{F'[L + (Xi)2V(Xi)}
i=1
d
x E[n45Ch Dj(X)]-1,
j=1
where the Chj's are restricted to a compact, positive interval that excludes 0,
jx^ 1 rhKj(X/ -x)F'[i+-

Dj(xJ) = Khj (X
xi)F'[,1 + r(X/)]2
(Xi
J 1=1
and
--1
V(x)= LKhl(XI -x) ... Khd(XI-xd)
-i=l
n
XE Khl (X - xl).. Khd(Xi - xd){yi - F
F[ + m(Xi)]}2.
i=1
The bandwidths used for V may be different from those used for m because V is
a full-dimensional nonparametric estimator. We now argue that the difference
n
n-1 U2 + ASE(h) -PLS(h)
i=l

is asymptotically negligible and, therefore, that the solution to (4.2) estimates the
bandwidths that minimize ASE. A proof of this result only requires additional
smoothness conditions on F and more restrictive assumptions on K. The proof can
be carried out by making arguments similar to those used in the proof of Theorem 2
but with a higher-order stochastic expansion for h - m. Here, we provide only
a heuristic outline. For this purpose, note that
n
n- l Ui2 + ASE(h) - PLS(i)
i=1
n
= 2n-1 E{F[r/ + m (Xi)] - F[/z + m(Xi)]}Ui
i=l
n d
- 2K(O)n-1 F'[/u + m(Xi)]2V(Xi) [n4/ Chj D(Xi)]-
i=l j=1
We now approximate F[? + m (Xi)] - F[i/ + m(Xi)] by a linear expansion in

m - m and replace m - m with the stochastic approximation of Theorem 2(a).
(A rigorous argument would require a higher-order expansion of mh- m.) Thus,
F[ui + m (Xi)] - F[/-L + m(Xi)] is approximated by a linear form in Ui. Dropping
higher-order terms leads to an approximation of
n
22
{F[A + m (Xi)] - F[Iu + m(Xi)]}Ui
i=1
that is a U statistic in Ui. The off-diagonal terms of the U statistic can be shown
to be of higher order and, therefore, asymptotically negligible. Thus, we get
- J{F[r + m(Xi)] - F[Iz + m(Xi)]}Ui

i=l
n d
2
, -
F'[t m + (Xi )]2 Var(Ui Xi) n4/SC hDoj(Xi ),
i=l j=1
where
Doj(xj) = 2E{F'[Lu + m(Xi)]2 Xi = xJ}fxj(xi)
and fxj is the probability density function of XJ. Now by standard kernel smooth-
ing arguments, Doj (x) Dj(xj). In addition, it is clear that V(Xi) V(Ui IXi),
which establishes the desired result.
5. Monte Carlo experiments. This section presents the results of a small set
of Monte Carlo experiments that compare the finite-sample performances of the
two-stage estimator, the estimator of LH and the infeasible oracle estimator in

which all additivecomponentsbut one are known. The oracle estimatorcannotbe

used in applicationsbutprovidesa benchmarkagainstwhich ourfeasible estimator
can be compared.The infeasible oracle estimatorwas calculatedby solving (2.5).
Experimentswere carriedout with d = 2 andd = 5. The samplesize is n = 500.
The experimentswith d = 2 consist of estimating fl and f2 in the binary logit
model
P(Y = IX =x) = L[fi(x) + f2(x2)],
where L is the cumulativelogistic distributionfunction
L(v) = ev/[1 + ev], -oo < v < oo.
The experimentswith d = 5 consist of estimating fl and f2 in the binary logit
model
5
P(Y=1IX=x) =L fi(x')+f2(x2)+Exj
j=3
In all of the experiments,
fl(x) = sin(rx) and f2(x) = c(3x),
where c is the standardnormal distributionfunction. The components of X
are independentlydistributedas U[-1, 1]. Estimation is carried out under the
assumption that the additive components have two (but not necessarily more)
continuousderivatives.Underthis assumption,the two-stageestimatorhas the rate
of convergencen-2/5. The LH estimatorhas this rate of convergenceif d = 2 but
not if d = 5.
B-splines were used for the first stage of the two-stage estimator.The kernel
used for the second stage and for the LH estimatoris
K(v) = 15( - v2)2I(Ivl < 1).
Experimentswere carriedout using both local-constantand local-linearestimators
in the second stage of the two-stage method. There were 1000 Monte Carlo
replicationsper experimentwith the two-stage estimatorbut only 500 replications
with the LH estimatorbecause of the very long computing times it entails. The
experimentswere carriedout in GAUSS using GAUSS randomnumbergenerators.
The results of the experiments are summarized in Table 1, which shows
the empirical integratedmean-squareerrors (EIMSEs) of the estimators at the
values of the tuning parametersthat minimize the EIMSEs. Lengthy computing
times precluded using data-based methods for selecting tuning parametersin
the experiments. The EIMSEs of the local-constant and local-linear two-stage
estimates of fi are considerablysmaller than the EIMSEs of the LH estimator.
The EIMSEs of the local-constant and LH estimators of f2 are approximately
equal whereas the local-linearestimatorof f2 has a largerEIMSE. There is little
differencebetween the EIMSEs of the two-stage local-linearand infeasible oracle

TABLE 1
Results of Monte Carlo experiments*
Empirical IMSE
Estimator K1l K2 h h2 f f
d=2
LH 0.9 0.9 0.116 0.015
Two-stage with 2 2 0.4 0.9 0.052 0.015
local-constant
smoothing
Two-stagewith 4 2 0.5 1.4 0.052 0.023
local-linear
smoothing
Infeasibleoracle 0.6 1.7 0.056 0.021
estimator
d=5
LH 1.0 1.0 0.145 0.019
Two-stagewith 2 2 0.4 0.9 0.060 0.018
local-constant
smoothing
Two-stagewith 2 2 0.6 1.3 0.057 0.029
local-linear
smoothing
Infeasibleoracle 0.6 2.0 0.057 0.023
estimator
*In the two-stage estimator,Kj and hj (j = 1, 2) are the series length and bandwidth
used to estimate fj. In the LH estimator, hj (j = 1, 2) is the bandwidthused to
estimate fj. The values of K1, K2, h and h2 minimize the IMSEs of the estimates.
estimators. This result is consistent with the oracle property of the two-stage
estimator.
6. Conclusions. This paper has described an estimator of the additive

components of a nonparametric additive model with a known link function.
The approach is very general and may be applicable to a wide variety of other
models. The estimator is asymptotically normally distributed and has a pointwise
rate of convergence in probability of n-2/5 when the unknown functions are
twice continuously differentiable, regardless of the dimension of the explanatory
variable X. In contrast, achieving the rate of convergence n-2/5 with the only
other currently available estimator for this model requires the additive components
to have an increasing number of derivatives as the dimension of X increases.
In addition, the new estimator has an oracle property: the asymptotic distribution
of the estimator of each additive component is the same as it would be if the other
components were known.

7. Proofs of theorems. Assumptions A 1-A7 hold throughout this section.
7.1. Theorem 1. This section begins with lemmas that are used to prove
Theorem 1.
LEMMA 1. There are constants a > 0 and C < oo such that
P sup ISnK(0)-ESnk(0)I > < Cexp(-na82)

-_0EK
for any sufficiently small s > 0 and all sufficiently large n.
PROOF. Write
n
SnK(O) = n-1 Yi - 2SnKl(0) + SnK2(0),
i=1
where
n
SnKl (0) =n-1 Yi F[PK(Xi)'0]
i=1
and
n
SnK2(0) = n-1 F[PK(Xi)O]2.
i=1
It suffices to prove that
P sup ISnKj(0)- ESnkjW()I > < Cexp(-na2) (j = 1,2)

-OEOK
for any s > 0, some C < oo and all sufficiently large n. The proof is given only
for j = 1. Similar arguments apply when j = 2.
Define SnK (0) = SnlK(0) - ESnKl (0). Divide OK into hypercubes of edge-
length f. Let 1),..., ?(M) denote the M = (2Co/f)d(K) cubes thus created.
Let OKj be the point at the center of eK(). The maximum distance between OKj
and any other point in O) is r = d()1/2/2 , and M = exp{d(K)[log(Co/r) +
(1/2) log d(K)]}. Now
SUp ISnK1(0)I >

-OE--1 -
C U sup ISnKl(0)I
(j)
>?
-8~0G j=1 e_oEK
Therefore,
M -
Pn P sup ISn,K(0)l > E < P sup ISnKl(0)l >?

-OK - j= - O(j)
-

MODELWITHA LINKFUNCTION
ADDITIVEREGRESSION 2429
Now for 0 E (j)

),
ISnKl(0) < ISnKl(OKj)l + ISnKl(0) - SnKl (0j)
< \SnK (OKj)I +CF2rKr (n- E Yi +CF1)
for all sufficiently large K and, therefore, n. Therefore, for all sufficiently large n,
P sup ISnKl(0)l>8
C/2 >
> /2] + P CF2ZKr n- .
< P[ISnil(0Kj)l Yl +CF1
Choose r = -2. Then s/2 - CF2'Kr[CF1 + E(IYI)] > s/4 for all sufficiently
large K. Moreover,
n
P CF2iKr n-1E IYII+ CF1 > s/2
n
< P CF2'Krn-I
)-(IY/I- EIYI) > s/4
< 2exp(-al n2r2)

for some constant al > 0 and all sufficiently large K by Bernstein's inequality
[Bosq (1998), page 22]. Also by Berstein's inequality, there is a constant a2 > 0
such that
P[ISnKl(OKj)I > 8/2] < 2exp(-a2ns2)

for all n, K and j. Therefore,
Pn < 2[M exp(-a2ns2) + exp(-aln82)]
< 2exp{-a2nE2(2 + 2dCKnY[log(Co/r) + - log(2CKd) + y logn]}
+ 2exp(-ainE2),
where y = 4/15 + v. It follows that Pn < 4exp(-an?2) for a suitable a > 0 and
all sufficiently large n. D
Define
SK(O)=E[SnK ()]
and
OK= arg min SK().
OE?K

LEMMA 2. For any tr > 0, SK(OnK) - SK(OK) < i almost surely for all
sufficiently large n.
PROOF. For each K, let KAcC IRd(K)be an open set containing OK.Let eK
denote the complement of WK in OK. Define TK = NK,n OK. Then TK C Rd(K) is
compact.Define
= min SK(0) - SK(K).
Let An be the event ISnK(0) - SK(0)I < r/2 for all 0 E OK. Then
An = SK(OnK)< SnK(OK)
+ r/2
and
An = SnK,() < SK,(K) + /2.

But SnK(OnK)< Sn, (K) by definition, so
An =: S (On) < SnK(OK)+ r1/2.
Therefore,
An S( ((K) ) < ( SK
SK =+
+ K(OnK) - SK (OK) < .
So An => OnKE J K. Since VK is arbitrary, the result follows from Lemma 1 and
Theorem 1.3.4 of Serfling [(1980), page 10]. D
Define
bk(x) = f+ + m(x) - PK(x)'OK

and
SKo(0) = E{Y - F[PK(X)' + bK (X)]}2.
Then
0K = arg min SK0o(0).
0 CK
LEMMA 3. For any r>> 0, SKo(OK)- SKo(OKO)

< rfor all sufficiently large n.
PROOF. Observe that ISK(0) - SKo(0) - 0 as n -- oo uniformly over 0 E OK

because bK () --- 0 for almost every x E X. For each K, let CX C IRd(K) be an open
set containing OKO.Define TK= xK(nO,. Then TK C Id(K) is compact. Define
t7 min SKo(0) - SKo(Ko).

O6T7

By choosing a sufficiently small cAK, j can be made arbitrarily small. Choose n

and, therefore, K large enough that ISK(0) - SKo(O)l < r7/2 for all 0 E @. Now
proceed as in the proof of Lemma 2. D
Define ZKi= F'[-t + m(Xi)]PK(Xi) QKi=1 ZK

n 1E Zi . Then Q =
ZKi.ZKIT and QK =
EQK. Let Zki [k = 1,..., d(K)] denote the kth component of ZKi. Let ZK denote
the n x d(K) matrix whose (i, k) element is Zk.
LEMMA 4. QK - QK 12 = Op(K2/n).
PROOF. Let Qij denote the (i, j) element of QK. Then

d(K) d(K) n 2
- -Z
El0lK QK 12 = E E .KiZK Qkj
k=l j=l i=l
d(K) d(K) n n
- E En-2 -
E E En , , kiZiKi zkz
ZK KE K Q2j)
Qkj
k=l j=l i= e=1
d(K) d(K) n d(K) d(K)
-=E En E2L(Zki)2 (zJi)2 E E

k=l j=l i=1 k=l j=l
-d(K) d(K)
<n- E E(Z i)2 (Z/i)2 = O(K/n).
-k=l j=l _1
The lemma now follows from Markov's inequality. D
Define Yn = I(XK,min> c,/2), where I is the indicator function. Let U =

(Ui,..., Un)'.
LEMMA5. YnlQK 1ZU/nl = Op(K1/2/n1/2) as n -oo.
PROOF. For any x e X,
n-2E(yni Q / Z,U 1 21X =x) = n -2nE(U'ZK k ZKUIX =x)
= n-2E[Trace(ZK QK ZKUU)iX =x]
n-2 ynCv Trace(Q- ZKZK)
n-lCvynd(K) < CK/n
for some constant C < oo. Therefore, YnllQK/2Z'U/nll = Op(K/ /n1/2) by

Markov's inequality. Now
Ynl -= ZU/n 2(Z U/n)]12.

l 2 (U
Yn ZK /n) QK Q

~
Define -1/2
Define Q,
Z'UQK Z / n. Let Ij,1 ?7d) andq,1,
..., .q. 1, denote the eigenval-
ues and eigenvectorso f Ql . Let 1lmax= max(ij1,..., id(,)). The spectraldecom-
positiono f QK
YVOCIVI
Q- V
Q-'
gives QK = dI(K)
f=1 rce~
r/qtq', so
d(K)
YnII
Q_ Z/I Unl
Yn11 j/ 112= nY tq'
/IC~~~=
d(K)
( Yn 7max : L 'qe q' < Yn

7maxl = Op(K/n).
?= 1
Define
Bn= QKin- 1 F'[g + m(Xi)]ZKibKO(Xi).

i-i
LEMMA 6. IIBnII= O(K 2) with probability approaching 1 as n -- oX.
PROOF. Let ~ be the n x 1 vector whose ith component is F'[ft +

m(Xi)lbKo(Xi). Then B, = QK1Z 4/n, and ynJIB,112 , -2 -2Zt
Therefore, by the same arguments used to prove Lemma 5, y,nIB,112 <
Cn- y,n'' = ynO(K 4). The lemma follows from the fact that P(yn = 1) -- 1
as n - o. D
PROOFOF THEOREM1. To prove part (a), write
SKo(OnK)- SKo(OK) = LSKO(OnK)- SK(OnK)]? [SK(OnK)- SK (OK)]

(7.1)
+ [SK(OK)- SKo(K)] ? [SKo(K)- SKo(OK)].
Given any q > 0, it follows from Lemmas2 and 3 and uniformconvergenceof SK

to SKO that each term on the right-hand side of (7.1) is less than ij/4 almost surely
for all sufficiently large n. Therefore SKO(OnK) - SKO(OK) < i, almost surely for
all sufficiently large n. It follows that - OK I -> 0 almost surely as n -- o0
---K
because OK uniquely minimizes SK. Part (a) follows because uniqueness of the
series representation of each function mj implies that 11OK- O0K 0 as n -- cxD. -
To prove the remaining parts of the theorem, observe that 0nK satisfies the first-
order condition aSfK(OnK) /a0 = 0 almost surely for all sufficiently large n. Define
Mi = JL + m(Xi) and A Mi = PK(Xi)'OnK - Mi = PK(Xi)'(OnK - OKo) - bKo(Xi).
Then a Taylor series expansion yields
n n
n
1LZKI(Ui -(Q^K?R)(OnK-OKo)?+ n LF'(Mi)ZKibo(Xi)+Rn2 = 0,
i 1 i=i

almost surely for all sufficientlylarge n. Rn is definedby

n
Rnl = n-1 {-Ui F"(M i)- U[F ( i)- F"(Mi)]
i=1
- [3F"((Mi)F'(Mi) ?
+ F"(Mi)F"(Mli)AMi
+ ?
F"(1Mi)F'F(Mii ) (Mi) 2]AMi
- [F"(Mi)F'(Mi)- F"(AMi)F'(AMi)
+ F"(Mi)F"(Mi)bKo(Xi)]bKo(Xi)}
x PK(Xi)PK(Xi)',
where Mi and Mi are points between PK(Xi)'OnK and Mi. Rn2 is defined by
n
Rn2 = -n-1 {UiF"(Mi) + Ui[F"(Mi) - F"(Mi)]

i=l
+ [F" (Mi)F (Mi) - 1F"(Mi)F'(Mi)]bKo(Xi)
-F" (3Mi)F"(M,ri)bKo(X
P(Xi)2}P(X)bK(Xi)
Now let ~ denote either QK-Z'U/n or Q [n-1 En F'(Mi)2PK(Xi) x

bKo(Xi) + Rn2]. Note that
~~~~~~~~n 2
'
n-1 UiF"(Mi)PK(Xi)PK(Xi) = Op(K2/n).
i=l
Then
Ynl[(QK + Rnl)-1- QK-]QK112
= YnIl(QK + Rnl)-lRnlill2
= Trace{y,n['Rnl(QK + Rnl) 2Rnl4]}
= Op(ll''Rnl 112)
= Op + [P(x)'(nK - ,o0)]2 dx + sup lbo(x)|l2}

(')Op{K2/n xeX
=Op(~'l)Op(K2/n + KIlK -OKI + K-3).
Setting =- -1Z'U /n and applying Lemma 5 yields [(QK + Rn)-1 - QK ] x

Z'Ul/n 12 = Op[K3/n + (K2/n)llI0K - 0o112 + l/(nK2)]. If =- QK[n-1 x

2434 J. L. HOROWITZ AND E. MAMMEN
1=lF'(Mi)2 P(Xi)bKo(Xi) + Rn2], then applying Lemma 6 and using the result
IIQ,-Rn2 I =Op(K-2) yields
- -- 2
n-
[(QK + R1n)-1 - Q ] n-1 F'(Mi)ZKibKO(Xi) + Rn2
- i =1
= + 1/K5).
Op(lIOnK- OKO12/K
It follows from these resultsthat
n
OnK- OKO= n-l Q 1 F'[,U + m(Xi)]PK(Xi)Ui
i=1
n
+- n-QK E Fl[L+ + m (Xi)]P i)bK(Xi) + Rn,
i=l
where IIRnII= Op(K3/2/n + n-1/2). Part (d) of the theorem now follows from
Lemma4. Part(b) follows by applyingLemmas5 and6 to part(d). Part(c) follows
from part(b) and AssumptionA5(iii). D
7.2. Theorem 2. This section begins with lemmas that are used to prove
Theorem 2. For any = (x2,...,xd) E [-1, 1]d-1, set m_l(x) = m2(x2) +
? + md(xd), and bKo(x) = u + m-1() - PK(x)OKo,where
PK(x) = [1, 0,..., , p1(x2) ..., P (2)

.... , (Xd), ... PK(X)]
and
= (Aq, 21...
,KO 0 029 0,\.. 02 , O,
... ... OdK)
In other words, P and 0KOare obtained by replacing pl (x), ..., PK(xd) with zeros
in PK and 011, ...,1K with zeros in 0KO.Also define
n
ni() = n-I PK()Y Q , F'[/t + m(Xj)]PK(Xj)Uj
j=1
and
n
n2(X) = n-PK(x) QK F'[u + m(Xj)]2PK(Xj)b,o(Xj).
j=1
Forxl E [-1, 1] and for j =0, 1 define

n
= (nh)-1/2 F'[ + m (x ) + m_1 (.i)]2(X] - x l)j
Hnjl(xl) ~
i=l
x Kh(x --X])(nl(Xi),

n
Hnj2(xl) =(nh)-l/2 F'[yI + ml(xl)+m-l(xi)]2(Xi -x )j

i=l
x Kh(x - Xi)8n2(Xi)
and
n
Hnj3(xl) = -(nh)-1/2 yF'[iL + mml(xl) + m_(Xi)]2(Xl -x)j

i=l
x Kh(X -
Xi)bKO(Xi).
Let V(x) = Var(UIX = x).
LEMMA 7. For j = 0, 1 and k = 1,2,3, Hnjk(xl) = Op(1) as n - oo

uniformly over x1 [-1, 1].
PROOF. The proof is given only for j = 0. Similar arguments apply for j = 1.
First consider Hnoi (x1). We can write
n
Hnol(xl) =) aj (xl)Uj,
j=1
where
n
aj(xl) n-3/2h-1/2 F'[tL + ml(x1) + m_(i)]2
i=l
x Kh(x - X) PK(Xi) Q F[ + m(Xj)]P(Xj)

n
=n-3/2h- 1/2 Kh(x - X1)Ai(xl
i=l
Define
a(xl)= F[l + ml(xl) + m-(x)]2(x)fx( x)dx
Then arguments similar to those used to prove Lemma 1 show that aj(x1) =
(h/n)l/2[c (x ) + rnl] QK F'[tu + m(Xj)]PK(Xj), where rn is uncorrelated with
the Uj's and lrnll = O[(logn)/(nh)1/2] uniformly over x E [-1, 1] almost
surely. Moreover, for each x1 E [-1, 1], the components of a (x) are the Fourier
coefficients of a function that is bounded uniformly over x1. Therefore,
(7.2) sup a(x)'a(x ) < M

Ixll<1

for some finite constant M and all K = 1,..., oo. It follows from (7.2) and the
Cauchy-Schwarz inequality that
n 2
a(x1)'(h/n)1/2 X UjQ 1F'[t + m(Xj)]PK(Xj)

j=l
n 2
<M (h/n)1/2 UUj QKF'[F +m(Xj)]PK(Xj)
j=l
But
n 2
E (h/n)'/2 Uj Q-F'[L+ m(Xj)]PK(Xj) = (h),
j=l
so it follows from Markov's inequality that
n
a(x)'(h/n) 1/2 Uj Q- F'[t +m(Xj)]PK(Xj) = Op(h 12)
j=l
uniformly over xl E [-1, 1]. This and lrnll= O[(logn)/(nh)1/2] establish the
conclusion of the lemma for j = 0, k = 1.
We now prove the lemma for j = 0, k = 2. We can write
n
Hno2(x) = (nh)-1/2 F'[I + ml (x) + m_ (Xi)]2Kh(x -X)P(Xi)'Bn,
i=l
where
n
Bn = n-Q1 1 F'[/1 + m(Xj)]2PK(Xj)bKo(Xj).
j=1
Arguments like those used to prove Lemma 6 show that ElIBn112= O(K-4).
Therefore,
n
sup IHno2(xl)I = sup (nh)- 1/2L Kh(x1 - X)') . Op(K-2)
Ixll<1 i=1
Ixll<1
= Op(n1/21/2K -2)
=
Op(l).
For the proof with j = 0, k = 3, note that
n
sup |Hno3(x)l = sup (nh) 1/2 Kh(x' - X) Op(K 2)
lx'11<1 Ix <1 i-=
= 0p(l). V

Let PK(x ,X) = [1, p(x ), ., pK(X), p(X2),.., PK(X2),...,pl(Xd),

.. PK(X)] and bK(x', Xi) = zt + ml(x1) + m_l (Xi) - PK(x1, Xi)OKo. De-
fine
n
8n3(x1, Xi) = n- PK(x1, Xi)'QK E F'[iU + m(Xj)]PK(Xj)Uj

j=1
and
n
8n4(I1, Xi) = n- PK(x1, Xi)'QK- F'[cI + m(Xj)]2P (Xj)bo(Xj).
j=1
Also, for j = 0, 1 define

n
Lnjl (xl) = (nh)-1/2 Ui F"[u + ml (xl) + m-_ (Xi)](X - x)j

i=1
x Kh (1 - XJ)n3(x1, Xi),
n
Lnj2(xl) = (nh)-1/2 ~ Ui F"[tU + ml (xl) + m-l (Xi)](X -x xl)j

i=l
x Kh(x-1 XJ) n4(xl., i)

n
Lnj3(xl) = -(nh)-1/2 Ui F"[ + ml(xl) + m-_ (Xi)](X - xl)j

i=1
x Kh(X1 - XJ)bK(x1, Xi),

n
Lnj4(x1) = (nh)-1/2
" {F[t + ml (XJ) + m-_1(Xi)]
i=l
-F[,U +ml (xl) + m (X)]}
x F"[,a + ml (x) + m- (Xi)](X1 -xl)j
x Kh(xl - Xl)8n3(x 1, Xi),
n
Lnjs(xl) = (nh)-1/2 j{F[/z + ml (X1) + m-l (Xi)]

i=l
-F[,U + ml (xl) + m_l(Xi)]}

x F"[U +ml (xl) + m-l (Xi)](XJ -xl)j
x Kh (X1 - Xl1)n4(x1, Xi)

and
n
Lnj6(xl) =-(nh)-1/2 Z{F[ +mml (Xl) + m-l(Xi)]

i=l
-F[, +ml(xl) +m-l(Xi)]}

x F"[,U +ml (x) + (Xi)](X-
m-l -xl)
x Kh(xl- XJ)bKo(xl, Xi).
LEMMA 8. As n - oo, Lnjk(x1) = Op(l) uniformly over x1 E [-1, 1] for

each j =O, , k = 1,...,6.
PROOF. The proof is given only for j = 0. The arguments are similar for
j = 1. By Theorem 1, Sn4(x, Xi) is the asymptotic bias component of the
stochastic expansion of PK(xl, Xi)(OnK- OKO)and is Op(K-3/2) uniformly over
(x1, Xi) E [-1, l]d. This fact and standard bounds on
n
sup IUi IKh(xl - XI)
Ixll <i=l
and
n
sup LKh(xl-Xl)
Ixlll i=l
establish the conclusion of the lemma for j = 0, k = 2, 5. For j = 0, k = 3, 6,
proceed similarly using
sup Ib,K(x)l O(K-2)
Ixll<1
For j = 0, k = 4, one can use arguments similar to those made for Hnol (x1) in the
proof of Lemma 7. It remains to consider Lno (x 1) = Dn (x )Bn, where
n
Dn(xl) = (nh)- /2 Ui F [Rl + ml (xl) + m-l (Xi)]Kh (x l-x- ) PXK(x

1,i)
i=l
and
n
Bn = n-Q l F'[/L + m(Xj)]2PK (Xj)Uj

j=l
Now, El Bn 112= O(Kn-1), and Dn(x ) contains elements of the form

n
(nh)-l/2pr(xl) Ui F"[t + ml (x1) + m_l(Xi)]Kh(xl - X)

i=l

MODELWITHA LINKFUNCTION
ADDITIVEREGRESSION 2439
and
n
(nh)-l/2 E UiPr (X)F"[t, + ml(xl) + m-l(Xi)lKh(x - X)

i=l
for 0 < r < K, 2 < e < d. These expressions can be bounded uniformly over
1
Ix|l 1 by terms that are Op[(logn)l/2lpr(xl)l] and Op[(logn)l/2], respectively.
This gives
sup IIDn(xl)112= Op(Klogn).

Ix11<1
Therefore,
sup ILnoi(x) 12< sup IIDn(x)11211Bn112

=Op(1).
Ixlix<lXl<l C
LEMMA9. Thefollowing hold uniformly over Ixl < 1 -h:
(nh)-1Sno1(x1, mt) = Do(x1) + op(1),
(nh)-1Sn21(x1, t) = AKh2Do(xl)[ + Op(l)]
and
(nh)-'Sll (xl, m) = h2AKDl(xl)[ + op(1)].
PROOF. This follows from Theorem l(c) and standard bounds on

n
sup L U['(Xl - xl)sKh(x -XJ)

\x<1 i=I
forr=0, 1, s=0, 1,2. D
Define Aml (xl) = ml (xl) - ml (xl), Am_l(x) = i - /z + m_r(X) - m-1 (x)

and Am(x1, ) = Aml(xl) + Am_l(x).
LEMMA 10. Thefollowing hold uniformly over Ixll < 1 - h:

(a) (nh)-l/2S0ol(xl, -m)=(nh)-l/2Snol(xl,m) +(nh)l/2Do(x')Aml(xl)+
Op(1),
(b) (nh)-l/2Snll(xl, m) = (nh)-l/2Snl(xl, m) + Op(1).
PROOF. Only (a) is proved. The proof of (b) is similar. For each i=
1,..., n, let m*(x ,Xi) and m**(x1, Xi) denote quantities that are between
i+ iml(xl)+ m_l(Xi) and / + ml(xl) + m_l(Xi). The values of m*(xl, Xi)

and m**(x1, Xi) may be differentin differentuses. A Taylorseries expansionand

Theorem 1(c) give
4
-
(nh)-r/2Snol(x1, m) (nh)-r2S'o01 (x, m) + Jnj (x)
j=1
+n(nh) -/20 sup IIAm(x1' )ll3

-(X 1,)eX
4
= (nh)-/2S 1(x, m) + Jnjm(x) + Op(l)
j=1
uniformly over Ixll < 1 - h, where

n
Jnl(xl) = 2(nh)-1/2 Y FI'[L + ml(xl) + m_l(Xi)]2Kh(xl- Xl)Aml(xl),
i=l1
n
Jn2(l1) = 2(nh)- 1/2 F4'[ + ml(x) +m_l(Xi)]2
i=l
x Kh(x -XJl)Am_l (Xi),

n
Jn3(xl) = -2(nh)-1/2 {Yi - F[t + ml (xl) + m- (Xi)]}
i=1
x F"[m*(x, Xi)]Kh(x -XX)Am(xl,Xi),

n
Jn4(Xl) = 2(nh)-l/2 ) F'[f + ml(xl) + m-l(Xi)]
i=l
x {F"[m*(xl, Xi)] + 2F"'[m**(xl, Xi)]}

x Kh(x1 - [m(x1, i)]2.
It follows from Theorem 1(d) and Lemma 7 that Jn2(X1) = 2 3= HnOk(x1) +

Op(l) = Op(l) uniformly over Ixll < 1 - h. In addition, it follows from
Theorem 1(c) that for some constant C < oo,
n -
Jn4(xl) < C(nh)- /2 Kh(xl -Xl) sup l/Am(xl,x)112

i=1 -(x 1,)E X
=Op (nh)1/2 sup IIAm(xl,x)ll2 = op(l)

(x 1,x)EX

ADDITIVE REGRESSIONMODEL WITHA LINK FUNCTION 2441
uniformly over Ix1I < 1 - h. Now consider J43(X1).- It follows from Assump-
tion A3(v) that
n
i=1
x Kh(X' - XJ)Am(x1, Xj)

? Op (nh)1/2 SUPIAM(Xl,i)12
6-
L:Lnok(X I)+0Op (nh) I/2SUPIAM(XI'_i)j2 ?oP(l)

k=I xXE
uniformlyover I1x1I< 1 - h. Therefore,Jn3(X1) = op(l) uniformlyby Lemma 8
and Theorem 1(c), and
uniformlyover Ijx1j 1<I - h.

Now consider J, I(x 1). Set
n
1) 1= ( (nh
in =1/2 F[t+mI( )+MI(k)2K(I-X)
It follows from Theorem 2.37 of Pollard (1984) that Jnl(xl) - E[Jn1(xl1] =

o(logn) almost surely as n --* oo. In addition, E[(nh) 11 l~(Xl)] =
D (x 1) + 0 (h2). Therefore,
JnI(x1) = (nh)1/2D0(x1)Aml(xl) ? 0[lognAmi(xl)]
(nh)1/2D0(x1)Am1(xl) + op(l)
=
11< 1 - h. El
uniformlyover I1x
PROOFOF THEOREm2. By the definitionof 6'I (x 1),
(73)I(x1 Im(x1) mI )
Part (a) follows by applying Lemmas 9 and 10 to the right-handside of (7.3).

Define

2442 J. L. HOROWITZ AND E. MAMMEN
Methods identical to those used to establish asymptotic normality of local-linear

estimators show that E(n2/5w) = B1+ o(l), Var(n2/5w) = Vi(xl) + o(l) and
n2/5 [m (x1) - ml (xl)] is asymptotically normal, which proves part (b). D
PROOFOF THEOREM4. It follows from Theorem 2(a) that
n4/5 f w(Xl)[-l(xl) ml (xl)]2 dx Op(1).

1-h<lxll<1
Now consider
n4/5 [ w(x^)[tl(xl) -ml(x1)]2dx1.

\X11<1-h
By replacing the integrand with the expansion of Theorem 2(a), one obtains
a U-statistic in Ui conditional on X1, ...,Xn. This U-statistic has vanishing
conditional variance. 1]
PROOF OF THEOREM5. Use Theorem 2(a) to replace mhl with ml in the

expression for m ). The result now follows from standard methods for bounding
kernel estimators. D
REFERENCES
BOsQ, D. (1998). NonparametricStatisticsfor Stochastic Procesess. Estimation and Prediction,
2nd ed. LectureNotes in Statist. 110. Springer,Berlin.
BREIMAN, L. and FRIEDMAN, J. H. (1985). Estimating optimal transformationsfor multiple
regression and correlation (with discussion). J. Amer. Statist. Assoc. 80 580-619.
BUJA, A., HASTIE, T. and TIBSHIRANI, R. J. (1989). Linear smoothers and additive models (with
discussion).Ann. Statist. 17 453-555.
CHEN, R., HARDLE, W., LINTON, O. B. and SEVERANCE-LOSSIN,E. (1996). Nonparametric esti-
mation of additive separable regression models. In Statistical Theory and Computational
Aspects of Smoothing (W. Hardle and M. Schimek, eds.) 247-253. Physica, Heidelberg.
FAN, J. and CHEN, J. (1999). One-step local quasi-likelihood estimation. J. R. Stat. Soc. Ser. B Stat.
Methodol.61 927-943.
FAN, J. and GIJBELS,I. (1996). Local PolynomialModelling and Its Applications. Chapmanand
Hall, London.
FAN,J., HARDLE,W. and MAMMEN,E. (1998). Direct estimationof low-dimensionalcomponents
in additive models. Ann. Statist. 26 943-971.
HASTIE, T. J. and TIBSHIRANI, R. J. (1990). Generalized Additive Models. Chapman and Hall,
London.
HOROWITZ, J. L., KLEMELA, J. and MAMMEN, E. (2002). Optimal estimation in additive
regression models. Working paper, Institut fir Angewandte Mathematik, Ruprecht-Karls-
Univertsitit,Heidelberg,Germany.
LINTON, O. B. and HARDLE, W. (1996). Estimating additive regression models with known links.
Biometrika83 529-540.
LINTON,O. B. and NIELSEN,J. P. (1995). A kernelmethodof estimatingstructurednonparametric
regression based on marginal integration. Biometrika 82 93-100.

MAMMEN,E., LINTON,O. B. and NIELSEN,J. P. (1999). The existence and asymptoticproperties

of backfittingprojectionalgorithmunderweak conditions.Ann. Statist.27 1443-1490.
NEWEY, W. K. (1997). Convergence rates and asymptotic normality for series estimators.
J. Econometrics79 147-168.
OPSOMER, J. D. (2000). Asymptotic propertiesof backfittingestimators.J. MultivariateAnal. 73
166-179.
OPSOMER,J. D. and RUPPERT, D. (1997). Fitting a bivariateadditive model by local polynomial
regression.Ann. Statist.25 186-211.
POLLARD,D. (1984). Convergenceof StochasticProcesses. Springer,New York.
SERFLING, R. J. (1980). ApproximationTheoremsof MathematicalStatistics.Wiley, New York.
STONE,C. J. (1985). Additiveregressionand othernonparametricmodels. Ann. Statist. 13 689-705.
STONE,C. J. (1986). The dimensionality reduction principle for generalized additive models.
Ann. Statist. 14 590-606.
STONE,C. J. (1994). The use of polynomialsplines andtheirtensorproductsin multivariatefunction
estimation(with discussion).Ann. Statist.22 118-184.
TJ0STHEIM,D. and AUESTAD,B. H. (1994). Nonparametricidentificationof nonlineartime series:
Projections.J. Amer.Statist.Assoc. 89 1398-1409.
DEPARTMENTOF ECONOMICS FAKULTATFUR VOLKSWIRTSCHAFTSLEHRE
NORTHWESTERNUNIVERSITY UNIVERSITATMANNHEIM
EVANSTON, ILLINOIS 60208-2600 VERFUGUNGSGEBAUDEL7, 3-5
USA 68131 MANNHEIM
E-MAIL:joel-horowitz@northwestem.edu GERMANY
E-MAIL:emammen@rumms.uni-mannheim.de


Annal Horowitz Mammen 2004

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Annal Horowitz Mammen 2004

Uploaded by

Copyright:

Available Formats

TheAnnals of Statistics

2004, Vol. 32, No. 6, 2412-2443

NONPARAMETRIC ESTIMATION OF AN ADDITIVE MODEL

BY JOELL. HOROWITZ1AND ENNO MAMMEN2

1. Introduction. This paper is concerned with nonparametric estimation of

(1.1) Y = F[it + ml(X1) +... + md(Xd)] + U,

where Xi (j = 1, ..., d) is the jth component of the random vector X E IRdfor

Received July 2002; revised March2004.

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM

Linton and Hardle (1996) (hereinafterLH) developed an estimator of the

(1.2) f E(YIX = x)w(x2 ..., xd)dx2...dxd,

Therefore, m (x1) can be estimated up to an additive constant by replacing

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM

if the other componentswere known. LH extend marginalintegrationto the case

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM

2. Informal description of the estimator. Assume that the supportof X is

For any x E Rd define m(x) ml(x ( l) + . + md(xd), where xj is the jth

(2.1) pk(v)dv =0,

(2.2) {' f=k,

(2.3) mj(xj)= E Ojkpk(xj),

Then for 0K E IRKd+l, PK(X)'OKis a series approximation to gt + m(x). Section 3

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM

x F'[ - xl)j Kh(x - XI)

Sn/jl (xl, m) = 2 E F'[/ + ml (xl) +m- l (Xi)]2(Xl - xl)j Kh(xl - XI)

x F"[ +mrtl (xi) +m-_l(X)](X1 -xl)jKh(x1 - X)

for j = 0, 1, 2. The second-stage estimator of m 1(xl) is

The second-stage estimators of m2(x2),..., md(xd) are obtained similarly. Sec-

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM

Snl (X, bo, bl) = {Yi -F[g + bo + bl(Xl -x1)

Moreover, S jl(x1, m) = aSnl(x1, bo, bl)/abj (j = 0, 1) evaluated at b= hm(x1)

OK = (A/, 01 , I,01DK, 021 . . . 2K, ... , Odl, OdK)'.

ASSUMPTION Al. The data, {(Yi,Xi):i = l,...,n}, are an i.i.d. random

ASSUMPTIONA2. (i) The support of X is X.

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM

ASSUMPTIONA5. (i) The functions {Pk} satisfy (2.1) and (2.2).

ASSUMPTION A6. (i) K = CKn4/15+v for some constant CK satisfying

ASSUMPTIONA7. The function K is a bounded, continuous probability

The assumptionthatthe supportof X is [-1, 1]d entails no loss of generalityas

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM

THEOREM1. Let Assumptions A1-A6(i) hold. Then:

(a) limn -oo 0lnK - K1 = 0 almost surely,

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM

Now let fx denote the probabilitydensity functionof X. For j = 0, 1, define

x F'[z + ml (x1) + m- (.Xi)](X -x)j Kh(x1 - X1).

g(xl,x) = F"[,L +ml(xl) +m_ l(x)]m (xl)

x x)F'[+m(xl)+mjl( ) ]2fx(xIx dx.

THEOREM2. Let Assumptions A1-A6 hold. Then:

(a) m(x 1) - m (x1) = [nhDo (x 1] -S01(, m) + [D1 (x)/Do(x1)] x

Theorem2(a) implies thatasymptotically,n2/5[m1(x) - m (x 1)] is not affected

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM

estimatorof m l(x1) has the same asymptotic distributionthat it would have if

The following modificationof Theorem2, which we state withoutproof, gives the

gLC(x, X) = (a2/a2){ F[ml ( + x1) + m-i(x)]

THEOREM3. Let Assumptions A1-A6 hold. Then

(a) mrl,Lc(xl)-ml(xl)=-[nhDo(xl)]-lS'ol(x, m)+op(n-2/5)uniformly

Vl(x1) and Bl(x1) and fi,LC(X1) can be estimated consistently by replacing

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM

3.3. A weighted estimator. A weighted estimator can be obtained by replacing

S'ij(x1, m, w) =-2 E w(xl, Xi){Yi - F[it +- ml (x) + m_ (Xi)]}

x F'[/ + m (xl) + _l (Xi)](X1 - xl)j Kh(xl - X1)

x F"[fL + ml (x1) + m_l(Xi)(X- xl)j Kh(xl- Xi)

fw(x x)F'[ +ml(xl)+m l(X)]2fx(xl x)dx =

V1(x1, w) =0.25BKCh1 fw (x1,C)2Var(U Ixl )