You are on page 1of 32

TheAnnals of Statistics

2004, Vol. 32, No. 6, 2412-2443


DOI 10.1214/009053604000000814
? Instituteof MathematicalStatistics,2004

NONPARAMETRIC ESTIMATION OF AN ADDITIVE MODEL


WITH A LINK FUNCTION

BY JOELL. HOROWITZ1AND ENNO MAMMEN2


Northwestern University and Universitdt Mannheim
This paper describes an estimator of the additive components of
a nonparametricadditive model with a known link function. When the
additive components are twice continuously differentiable,the estimatoris
asymptoticallynormallydistributedwith a rateof convergencein probability
of n-2/5. This is true regardlessof the (finite) dimensionof the explanatory
variable.Thus, in contrastto the existing asymptoticallynormal estimator,
the new estimatorhas no curse of dimensionality.Moreover,the estimatorhas
an oracleproperty.The asymptoticdistributionof each additivecomponentis
the same as it would be if the othercomponentswere known with certainty.

1. Introduction. This paper is concerned with nonparametric estimation of


the functions m , ..., md in the model

(1.1) Y = F[it + ml(X1) +... + md(Xd)] + U,

where Xi (j = 1, ..., d) is the jth component of the random vector X E IRdfor


some finite d > 2, F is a known function, , is an unknown constant, m , ..., md
are unknown functions and U is an unobserved random variable satisfying
E(UIX = x) = 0 for almost every x. Estimation is based on an i.i.d. random
sample {Yi, Xi: i = 1,..., n} of (Y, X). We describe an estimator of the additive
components m, ..., md that converges in probability pointwise at the rate n-2/5
when F and the mj's are twice continuously differentiable and the second
derivative of F is sufficiently smooth. In contrast to previous estimators, only two
derivatives are needed regardless of the dimension of X, so asymptotically there is
no curse of dimensionality. Moreover, the estimators derived here have an oracle
property. Specifically, the centered, scaled estimator of each additive component is
asymptotically normally distributed with the same mean and variance that it would
have if the other components were known.

Received July 2002; revised March2004.


1Supported in part by NSF Grants SES-99-10925 and SES-03-52675, the Alexander
von Humboldt Foundationand Deutsche ForschungsgemeinschaftSonderforschungsbereich373,
"Quantifikationund SimulationOkonomischerProzesse."
2Supportedin partby Deutsche ForschungsgemeinschaftMA 1026/6-2.
AMS2000 subjectclassifications.Primary62G08; secondary62G20.
Keywords and phrases. Additive models, multivariatecurve estimation, nonparametricregres-
sion, kernelestimates,orthogonalseries estimator.
2412

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM


All use subject to JSTOR Terms and Conditions
ADDITIVEREGRESSIONMODELWITHA LINK FUNCTION 2413

Linton and Hardle (1996) (hereinafterLH) developed an estimator of the


additivecomponentsof (1.1) that is based on marginalintegration.The marginal
integration method is discussed in more detail below. The estimator of LH
converges at the rate n-2/5 and is asymptotically normally distributed,but it
requiresthe mj's to have an increasing numberof derivativesas the dimension
of X increases. Thus, it suffers from the curse of dimensionality.Our estimator
avoids this problem.
There is a large body of researchon estimationof (1.1) when F is the identity
function so that Y = / + ml(X1) + *.. + md(Xd) + U. Stone (1985, 1986)
showed that n-2/5 is the optimal L2 rate of convergence of an estimator of
the mj's when they aretwice continuouslydifferentiable.Stone (1994) andNewey
(1997) describe spline estimatorswhose L2 rate of convergenceis n-2/5, but the
pointwise rates of convergence and asymptotic distributionsof spline and other
series estimatorsremain unknown. Breiman and Friedman(1985), Buja, Hastie
andTibshirani(1989), Hastie andTibshirani(1990), OpsomerandRuppert(1997),
Mammen, Linton and Nielsen (1999) and Opsomer (2000) have investigatedthe
propertiesof backfittingprocedures.Mammen, Linton and Nielsen (1999) give
conditionsunderwhich a smoothbackfittingestimatorof the mj 's convergesat the
pointwise rate n-2/5 when these functions are twice continuouslydifferentiable.
The estimator is asymptotically normally distributedand avoids the curse of
dimensionality,but extendingit to models in which F is not the identityfunction
appearsto be quite difficult.Horowitz,Klemeli and Mammen(2002) (hereinafter
HKM) discuss optimalitypropertiesof a variety of estimatorsfor nonparametric
additivemodels withoutlink functions.
Tj0stheim and Auestad (1994), Linton and Nielsen (1995), Chen, Hardle,
Linton and Severance-Lossin(1996) and Fan, Hardle and Mammen (1998) have
investigatedthe propertiesof marginalintegrationestimatorsfor the case in which
F is the identityfunction.These estimatorsare based on the observationthatwhen
F is the identityfunction,then m (x1), say, is given up to an additiveconstantby

(1.2) f E(YIX = x)w(x2 ..., xd)dx2...dxd,


where w is a nonnegativefunction satisfying

w(x2,...xd) dx2...dxd = 1.

Therefore, m (x1) can be estimated up to an additive constant by replacing


E(YIX = x) in (1.2) with a nonparametricestimator.Linton and Nielsen (1995),
Chen, Hirdle, Lintonand Severance-Lossin(1996) and Fan, Hardleand Mammen
(1998) have given conditions under which a variety of estimatorsbased on the
marginalintegrationidea converge at rate n-2/5 and are asymptoticallynormal.
The latter two estimators have the oracle property. That is, the asymptotic
distributionof the estimatorof each additivecomponentis the same as it would be

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM


All use subject to JSTOR Terms and Conditions
2414 J. L. HOROWITZAND E. MAMMEN

if the other componentswere known. LH extend marginalintegrationto the case


in which F is not the identity function. However,marginalintegrationestimators
have a curse of dimensionality:the smoothness of the mj's must increase as the
dimension of X increases to achieve n-2/5 convergence. The reason for this is
that estimatingE(YIX = x) requirescarryingout a d-dimensionalnonparametric
regression.If d is large and the mj's are only twice differentiable,then the bias of
the resulting estimator of E(YIX = x) converges to zero too slowly as n -> oo
to estimate the mj's with an n-2/5 rate. For example, the estimator of Fan,
Hardle and Mammen (1998), which imposes the weakest smoothness conditions
of any existing marginalintegrationestimator,requiresmore than two derivatives
if d >5.
This paper describes a two-stage estimation procedurethat does not require
a d-dimensional nonparametricregression and, thereby, avoids the curse of
dimensionality.In the first stage, nonlinearleast squaresis used to obtain a series
approximationto each mj. The first-stageprocedureimposes the additivestructure
of (1.1) and yields estimates of the mj's that have smaller asymptotic biases
than do estimatorsbased on marginalintegrationor otherproceduresthat require
d-dimensional nonparametricestimation. The first-stageestimates are inputs to
the second stage. The second-stageestimate of, say, ml is obtainedby taking one
Newton step from the first-stageestimate towarda local linear estimate. In large
samples, the second-stage estimator has a structuresimilar to that of a local
linear estimator, so deriving its pointwise rate of convergence and asymptotic
distributionis relativelyeasy. The main results of this papercan also be obtained
by using a local constant estimate in the second stage, and the results of Monte
Carlo experimentsdescribedin Section 5 show that a local constantestimatorhas
better finite-sampleperformanceunder some conditions. However, a local linear
estimatorhas betterboundarybehaviorand better ability to adaptto nonuniform
designs, among otherdesirableproperties[Fanand Gijbels (1996)].
Our approachdiffers from typical two-stage estimation,which aims at estimat-
ing one unknownparameteror function [e.g., Fan andChen (1999)]. In this setting,
a consistent estimatoris obtainedin the first stage and is updatedin the second,
possibly by taking a Newton step towardthe optimumof an appropriateobjective
function. In contrast,in our setting, there are several unknownfunctions but we
updatethe estimatorof only one. It is essential thatthe first-stageestimatorsof the
other functions have negligible bias. The variancesof these estimatorsmust also
convergeto zero but can have relatively slow rates. We show that asymptotically,
the estimationerrorof the otherfunctionsdoes not appearin the updatedestimator
of the functionof interest.
HKM use a two-stage estimationapproachthat is similarto the one used here,
but HKM do not consider models with link functions, and they use backfitting
for the first-stageestimator.Derivationof the propertiesof a backfittingestimator
for a model with a link function appearsto be very complicated.We conjecture
that a classical backfitting estimator would have the same asymptotic variance

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM


All use subject to JSTOR Terms and Conditions
ADDITIVEREGRESSIONMODELWITH A LINK FUNCTION 2415

as the one in this paper but a different and, possibly, complicated bias. We
also conjecture that a classical backfittingestimator would not have the oracle
property.Nonetheless, we do not argue here that our procedure outperforms
classical backfitting,in the sense of minimizing an optimality criterion such as
the asymptoticmean-squareerror.However,our procedurehas the advantagesof
a complete asymptoticdistributiontheory and the oracle property.
The remainderof this paper is organized as follows. Section 2 provides an
informaldescriptionof the two-stage estimator.The main results are presentedin
Section 3. Section 4 discusses the selection of bandwidths.Section 5 presentsthe
results of a small simulationstudy,and Section 6 presentsconcluding comments.
The proofs of theorems are in Section 7. Throughoutthe paper,subscriptsindex
observationsand superscriptsdenote componentsof vectors. Thus, Xi is the ith
observationof X, XJ is the jth componentof X, and Xj is the ith observationof
the jth component.

2. Informal description of the estimator. Assume that the supportof X is


X - [-1, ]d, and normalize ml, ..., m so that

m(v)dv=
mj , j=l,...,d.

For any x E Rd define m(x) ml(x ( l) + . + md(xd), where xj is the jth


component of x. Let {Pk k = 1, 2, ...} denote a basis for smooth functions
on [- 1, 1]. A precise definitionof "smooth"andconditionsthatthe basis functions
must satisfy are given in Section 3. These conditionsinclude

(2.1) pk(v)dv =0,

(2.2) {' f=k,


f -(2)pj(v)pk(v)dv=
O
0, otherwise,
and
00

(2.3) mj(xj)= E Ojkpk(xj),


k=l

for each j = 1, ..., d, each xJ E [0, 1] and suitable coefficients {0jk}. For any
positive integerK, define
PK(x)= [1, P (x 1) ... .PK(X ), P1( x2), .. PK(X2), (, I(d), . . . PK(Xd)]
. P1

Then for 0K E IRKd+l, PK(X)'OKis a series approximation to gt + m(x). Section 3


gives conditions that K must satisfy. These requirethat K -- oo at an appropriate
rate as n --> oo.

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM


All use subject to JSTOR Terms and Conditions
2416 J. L. HOROWITZAND E. MAMMEN

To obtain the first-stage estimators of the mj's, let {Yi, Xi: i = 1, ..., n be
a random sample of (Y, X). Let OnK be a solution to
n
minimize: Sn,() - n-1 {Yi - F[PK(Xi)0]}2,
~OE~~OK i=1

where KcC RKd+l is a compact parameter set. The series estimator of /z + m(x)
is
, + mh(x)= PK(X)'nK,
where /2 is the first component of OnK. The estimator of mj(xj) for any
j = 1,..., d and any xJ E [0, 1] is the product of [pl(xJ)... , PK(XJ)] with the
appropriate components of OK.
To obtain the second-stage estimator of (say) ml(xl), let Xi denote the ith
observation of X (X2, .. ., Xd). Define m_l (Xi) = m2(X2) + + mhd(X'/),
where XJ is the ith observation of the jth component of X and thj is the series
estimator of mj. Let K be a probability density function on [-1, 1], and define
Kh(v) = K(v/h) for any real, positive constant h. Conditions that K and h must
satisfy are given in Section 3. These include h -+ 0 at an appropriate rate as
n -- oo. Define
n
lm) = --2
Sj (x1, {Yi - F[f + ml (x1) + m_l (Xi)]}
i=l

x F'[ - xl)j Kh(x - XI)


+-i l (xl) + m_- (Xi)](X
for j = 0, 1 and
n

Sn/jl (xl, m) = 2 E F'[/ + ml (xl) +m- l (Xi)]2(Xl - xl)j Kh(xl - XI)


i=l
n
-2 {Yi -F[/f + l C(x1)+- I_ (X)]}
i=1

x F"[ +mrtl (xi) +m-_l(X)](X1 -xl)jKh(x1 - X)

for j = 0, 1, 2. The second-stage estimator of m 1(xl) is

es aS th)oSno (x)
. nd (x) n 1 Sh)
^-sSng2e(Xim ,
)-So si(X)
(2.4) mi (x)l th) () (xl,h) r) - 1 - SIn, (x, '
2

The second-stage estimators of m2(x2),..., md(xd) are obtained similarly. Sec-


tion 3.3 describes a weighted version of this estimator that minimizes the asymp-
totic variance of n2/5 [I 1(x) - m(x 1)]. However, due to interactions between the
weight function and the bias, the weighted estimator does not necessarily minimize
the asymptotic mean-square error.

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM


All use subject to JSTOR Terms and Conditions
ADDITIVEREGRESSION
MODELWITHA LINKFUNCTION 2417

The estimator (2.4) can be understood intuitively as follows. If A and mt_ were
the true values of ,t and m_ 1, the local linear estimator of m 1(x1) would minimize
n

Snl (X, bo, bl) = {Yi -F[g + bo + bl(Xl -x1)


i=1
(2.5)
+ m_l(Xi)]}2Kh(xl -XI).

Moreover, S jl(x1, m) = aSnl(x1, bo, bl)/abj (j = 0, 1) evaluated at b= hm(x1)


and bl = 0. S"1 (x , m) gives the second derivatives of S, 1(x1, bo, bl) evaluated
at the same point. The estimator (2.4) is the result of taking one Newton step from
the starting values bo = mh1(x1), bl = 0 toward the minimum of the right-hand side
of (2.5).
Section 3 gives conditions under which mh1(x1) - m 1(x ) = Op(n2/5) and
n2/5 [[m(xl) - ml(xl)] is asymptotically normally distributed for any finite d
when F and the mj's are twice continuously differentiable.

3. Main results. This section has three parts. Section 3.1 states the assump-
tions that are used to prove the main results. Section 3.2 states the results. The
main results are the n-2/5-consistency and asymptotic normality of the mj's. Sec-
tion 3.3 describes the weighted estimator.
The following additional notation is used. For any matrix A, define the norm
IA I = [trace(A'A)]1/2. Define U = Y - F[/A + m(X)], V(x) = Var(UIX = x),
QK = E{F'[/1 + m(X)]2 PK(X)PK (X)'}, and IK = QK E{F'[j + m(X)]2V(X) x
PK(X)PK(X)'}Q -1 whenever the latter quantity exists. QK and IK are d(K) x d(K)
positive semidefinite matrices, where d(K) = Kd + 1. Let XK,min denote the
smallest eigenvalue of QK. Let QK,ij denote the (i, j) element of QK. Define
K= Supxx IIPK(x)ll.Let {0jk} be the coefficients of the series expansion (2.3).
For each K define

OK = (A/, 01 , I,01DK, 021 . . . 2K, ... , Odl, OdK)'.

3.1. Assumptions. The main results are obtained under the following assump-
tions.

ASSUMPTION Al. The data, {(Yi,Xi):i = l,...,n}, are an i.i.d. random


sample from the distribution of (Y, X), and E(Y X = x) = F[,/ + m(x)] for almost
every x e X _ [-1, 1]d

ASSUMPTIONA2. (i) The support of X is X.


(ii) The distribution of X is absolutely continuous with respect to Lebesgue
measure.
(iii) The probability density function of X is bounded, bounded away from zero
and twice continuously differentiable on X.

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM


All use subject to JSTOR Terms and Conditions
2418 J. L. HOROWITZAND E. MAMMEN

(iv) There are constants cv > 0 and Cv < oo such that cv < Var(UIX =
x) < Cv for all x E X.
(v) Thereis a constantCu < oo such thatEIUI/ < C 2j!E(U2) < oo for all
j>2.

ASSUMPTION
A3. (i) There is a constant Cm < oc such that Jmj(v)I < Cm
for each j = 1,...,d and all v E [-1, 1].
(ii) Each functionmj is twice continuouslydifferentiableon [-1, 1].
(iii) There are constants CF1 < oo, CF2 > 0, and CF2 < oo such that
F(v) < CF1 and CF2 < F'(v) < CF2 for all v E [ut - Cmd, ,L + Cmd].
(iv) F is twice continuously differentiable on [u/ - Cmd, /t + Cmd].
(v) There is a constant CF3 < oo such that IF"(v2) - F"(vl)I < CF3 I2 - l I
for all v2, vl E [/t - Cmd, iu + Cmd].

ASSUMPTION A4. (i) There are constants CQ < oo and cx > 0 such that
IQK,ij[ < CQ and K,min> C) for all K and all i, j = 1,..., d().
(ii) The largest eigenvalue of "K is bounded for all K.

ASSUMPTIONA5. (i) The functions {Pk} satisfy (2.1) and (2.2).


(ii) There is a constant cK > 0 such that (K > CK for all sufficiently large K.
(iii) ,K= O(K1/2) as K -> oo.
(iv) There are a constant Co < oo and vectors OKO E 0, - [-Co, Co]d(K) such
that supx Exl[ + m(x) - PK(X)'OKl = O(K -2) as K -> X.
(v) For each K, OK is an interior point of )K.

ASSUMPTION A6. (i) K = CKn4/15+v for some constant CK satisfying


0 < CK < oo and some v satisfying 0 < v < 1/30.
(ii) h = Chn-1/5 for some constant Ch satisfying 0 < Ch < oo.

ASSUMPTIONA7. The function K is a bounded, continuous probability


density functionon [-1, 1] and is symmetricabout0.

The assumptionthatthe supportof X is [-1, 1]d entails no loss of generalityas


it can always be satisfiedby carryingout monotone increasingtransformationsof
the componentsof X, even if theirsupportbeforetransformationis unbounded.For
practicalcomputations,it suffices to transformthe empiricalsupportto [-1, 1]d.
Assumption A2 precludes the possibility of treating discrete covariates with
our method, though they can be handled inelegantly by conditioning on them.
Another possibility is to develop a version of our estimator for a partially
linear generalized additive model in which discrete covariates are included in
the parametric(linear) term. However, this extension is beyond the scope of the
present paper.Differentiabilityof the density of X [AssumptionA2(iii)] is used

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM


All use subject to JSTOR Terms and Conditions
ADDITIVEREGRESSIONMODELWITH A LINK FUNCTION 2419

to insure that the bias of our estimator converges to zero sufficiently rapidly.
AssumptionA2(v) restrictsthe thicknessof the tails of the distributionof U and is
used to prove consistency of the first-stageestimator.AssumptionA3 defines the
sense in which F and the mj's must be smooth. Assumption A3(iii) is needed
for identification.Assumption A4 insures the existence and nonsingularityof
the covariancematrix of the asymptoticform of the first-stageestimator.This is
analogousto assumingthatthe informationmatrixis positive definitein parametric
maximum likelihood estimation. Assumption A4(i) implies Assumption A4(ii)
if U is homoskedastic. Assumptions A5(iii) and A5(iv) bound the magnitudes
of the basis functions and insure that the errorsin the series approximationsto
the m j's converge to zero sufficiently rapidly as K -- oo. These assumptions are
satisfied by spline and (for periodic functions) Fourier bases. Assumption A6
states the rates at which K -- oo and h -- 0 as n -> oo. The assumed rate of
convergenceof h is well known to be asymptoticallyoptimalfor one-dimensional
kernelmean-regressionwhen the conditionalmean functionis twice continuously
differentiable.The requiredratefor K insuresthatthe asymptoticbias andvariance
of the first-stage estimator are sufficiently small to achieve an n-2/5 rate of
convergencein the second stage. The L2 rateof convergenceof a series estimator
of mj is maximizedby settingK ocn1/5, which is slowerthanthe ratespermittedby
AssumptionA6(i) [Newey (1997)]. Thus,AssumptionA6(i) requiresthe first-stage
estimatorto be undersmoothed.Undersmoothingis needed to insure sufficiently
rapid convergenceof the bias of the first-stageestimator.We show that the first-
orderperformanceof our second-stage estimatordoes not depend on the choice
of K if AssumptionA6(i) is satisfied.See Theorems2 and 3. Optimizingthe choice
of K would require a rathercomplicated higher-ordertheory and is beyond the
scope of this paper, which is restricted to first-order asymptotics.

3.2. Theorems. This section states two theorems that give the main results
of the paper. Theorem 1 gives the asymptotic behavior of the first-stage series
estimatorunder Assumptions A1-A6(i). Theorem 2 gives the propertiesof the
second-stage estimator. For i = 1, ..., n, define Ui = Yi - F[Az + m(Xi)] and
bKo(x) = z + m(x) - P (x)'OKo.Let II v 11denote the Euclidean norm of any finite-
dimensionalvector v.

THEOREM1. Let Assumptions A1-A6(i) hold. Then:

(a) limn -oo 0lnK - K1 = 0 almost surely,


KO
(b) 0nK - K = Op(K/2/n1/2 + K-2), and
(c) suPxcx rIm(x)-m(x) = Op(K/n1/2 + K-3/2).
In addition:
(d) 0nK -
OKO = nQK i=l F'[L +m(Xi)]PK(Xi)Ui + n 1QK1 x
n=l F'[/ + m(Xi)]2PK(Xi)bK(Xi) R = Op(3/2/n
+ Rn, where IIR + n-1/2).

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM


All use subject to JSTOR Terms and Conditions
2420 J. L. HOROWITZAND E. MAMMEN

Now let fx denote the probabilitydensity functionof X. For j = 0, 1, define


n
S l (xl, m) =-2 - F[u + ml (xl) + m_l (Xi)]}
j{Yi
i=1

x F'[z + ml (x1) + m- (.Xi)](X -x)j Kh(x1 - X1).


Also define

=
Do( F'[+ml(x)=2
F[ mxl) +m_l(.c)]2 fx(xl,)d/x,
+ + ml(X)]2[Ofx(x1 ~x)/Oxl]d.,
DI(X1) = 2 F'[, ml(xl)

AK = v2K(v)dv,
J-

BK= K(v)2dv,
1

g(xl,x) = F"[,L +ml(xl) +m_ l(x)]m (xl)


+ F'[,L +ml(xl)+ m_l(x)]ml'(xl),
1(xl) = 2C2AKDo(x )-1

x +ml(xl) +m_l(x)]fx(xl,x)dx
fg(xl,x)F'[l
and
Vl(x1) = BKCh-D(x1)-2

x x)F'[+m(xl)+mjl( ) ]2fx(xIx dx.


fvar(Ulx
The next theoremgives the asymptoticpropertiesof the second-stageestimator.

THEOREM2. Let Assumptions A1-A6 hold. Then:

(a) m(x 1) - m (x1) = [nhDo (x 1] -S01(, m) + [D1 (x)/Do(x1)] x


-
S'n (xl1m)} + op(n-2/5) uniformly over Ix11 < I h and ml(x1) - ml(x1) =
Op[(logn)1/2n-2/5] uniformly over Ix'l < 1.
(b) n2/5[ml
(x )- ml(xl)] N[BPl(x l), Vl(xl)].
(c) If j 1, then n2/5[l((x1) - ml(xl)] and n2/5[mj(x) - are
mj(xj)]
asymptoticallyindependentlynormallydistributed.

Theorem2(a) implies thatasymptotically,n2/5[m1(x) - m (x 1)] is not affected


by random sampling errorsin the first-stageestimator.In fact, the second-stage

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM


All use subject to JSTOR Terms and Conditions
ADDITIVEREGRESSIONMODELWITHA LINK FUNCTION 2421

estimatorof m l(x1) has the same asymptotic distributionthat it would have if


m2,..., md were known and local-linear estimation were used to estimate m 1(x1)
directly. In this sense, our estimatorhas an oracle property.Parts (b) and (c) of
Theorem 2 imply that the estimators of m l(xl),..., md(xd) are asymptotically
independentlydistributed.
It is also possible to use a local-constantestimator in the second stage. The
resultingsecond-stageestimatoris
- m).
m,LLC(Xl) Sl(xl)-S'0o(x,l)/SnO1(x

The following modificationof Theorem2, which we state withoutproof, gives the


asymptoticpropertiesof the local-constantsecond-stageestimator.Define

gLC(x, X) = (a2/a2){ F[ml ( + x1) + m-i(x)]


- F[ml(x) + m-l(x)]}fx(( + x, x)lr=o

and

Pl,Lc(xl) = 2Ch2AKDo(xl)-1

x fgLc(x +
x)F'[tm l(x l)+m- (X)]fx(x x)dx.

THEOREM3. Let Assumptions A1-A6 hold. Then

(a) mrl,Lc(xl)-ml(xl)=-[nhDo(xl)]-lS'ol(x, m)+op(n-2/5)uniformly


over Ixl < 1 - h and mh (xl) - ml(x) = Op [(logn)1/2n-2/5] uniformly over
I 1.
xlJ
-)-ml 1)
(b) n2/5[l,LC( (xl)] N[P1,Lc(xl), V1(x1)].
-
(c) If j - 1, then n2/5h1,LC( X1) ml(xl)] and n2/ [mhj,Lc(xj) - mj(x)]
are asymptotically independently normally distributed.

Vl(x1) and Bl(x1) and fi,LC(X1) can be estimated consistently by replacing


unknown population parameterswith consistent estimators. Section 4 gives a
method for estimating the derivatives of m that are in the expressions for
Bi(x1) and /B1,LC(xl). As is usual in nonparametricestimation, reasonably
precise bias estimation is possible only by making assumptions that amount
to undersmoothing. One way of doing this is to assume that the second
derivative of ml satisfies a Lipschitz condition. Alternatively, one can set
h = Chn-Y for 1/5 < y < 1. Then n(- )/2[mi1(x1) - ml(xl)] N[0, V (xl)],
and n( -y)I/2[C1LC(X1)- m(x )] N[O, V(x1)].

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM


All use subject to JSTOR Terms and Conditions
2422 J. L. HOROWITZAND E. MAMMEN

3.3. A weighted estimator. A weighted estimator can be obtained by replacing


S'j (x1, m) and S 'j(x1, m) in (2.5) with
n

S'ij(x1, m, w) =-2 E w(xl, Xi){Yi - F[it +- ml (x) + m_ (Xi)]}


i=1

x F'[/ + m (xl) + _l (Xi)](X1 - xl)j Kh(xl - X1)


and

S1'j1(x1,m, w)
n
=2 w(x,X i)F'[, + mlh(xl) + m_l (Xi)]2(Xl-xl)j Kh(x -X)
i=l
n
-2 w(x1l, Xi){Yi -F[ + ml(x l)+ m-l(Xii)]}
i=l

x F"[fL + ml (x1) + m_l(Xi)(X- xl)j Kh(xl- Xi)


for j = 0, 1, 2, where w is a nonnegative weight function that is assumed for the
moment to be nonstochastic. It is convenient to normalize w so that

fw(x x)F'[ +ml(xl)+m l(X)]2fx(xl x)dx =

for each x1 E [-1, 1]. Arguments identical to those used to prove Theorem 2 show
that the variance of the asymptotic distribution of the resulting local-linear or local-
constant estimator of m i (x1) is

V1(x1, w) =0.25BKCh1 fw (x1,C)2Var(U Ixl )

x F'[1 + ml (xl)+ m_ (x)]2fx(xl x) dx.


It follows from Lemma 1 of Fan, Hardle and Mammen (1998) that V(xl, w) is
minimized by setting w(x1, x)2 o 1/ Var(U Ix1, x), thereby yielding

Vl(x ,w)= O.25BKhl() -1 F'[ + ml (xl) + m( ) fx(x x)dx,

where

D2(xl)= fVar(Ulxl,i)-lF'[ +ml(xl) +m_l(x)]2fx(x ) d.


In an application, it suffices to replace the variance-minimizing weight function
with a consistent estimator. For example, F'[JL + ml(xl) + m_l(x)] can be
estimated from the first estimation stage, Var(Ulxl, ) can be estimated by
applying a nonparametric regression to the squared residuals of the first-stage
estimate and kernel methods can be used to estimate fx (x , x).

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM


All use subject to JSTOR Terms and Conditions
ADDITIVEREGRESSIONMODELWITHA LINK FUNCTION 2423

The minimum-varianceestimator is not a minimum asymptotic mean-square


error estimator unless undersmoothingis used to remove the asymptotic bias
of 1. This is because weighting affects the bias when the latteris nonnegligible.
The weight function that minimizes the asymptotic mean-squareerror is the
solution to an integral equation and does not have a closed-form analytic
representation.

4. Bandwidth selection. This section presentsa plug-in and a penalizedleast


squares(PLS) methodfor choosing h in applications.We begin with a description
of the plug-in method. This method estimates the value of h that minimizes the
asymptoticintegratedmean-squareerror(AIMSE) of n2/5[m (x) - ml(xl)] for
j = 1, ... , d. We discuss only local-linear estimation, but similar results hold for
local-constant estimation. The AIMSE of n2/5 (MIm - m) is defined as

AIMSEl =n4S w(x )[I(x ) + V1(x1)] dx


-1
where w(.) is a nonnegativeweight function that integratesto 1. We also define
the integratedsquarederror(ISE) as

ISE -n45 / w(x')[ml(xl) -ml(x)]2 dx.


-1

We define the asymptoticallyoptimal bandwidthfor estimatingm as Chln-1/5,


where Ch minimizes AIMSE1.Let
1(X1) = Bl(X )/ C2 and V1(x l) = ChVI(xl).
Then
-1- Cf1 ww(xl)V1(x1)dx -1/5
(4.1) Chl -
4 fl1 w(xl),(xl)2 dx1_
The results for the plug-in method rely on the following two theorems.Theo-
rem4 shows thatthe differencebetweenthe ISE andAIMSEis asymptoticallyneg-
ligible. Theorem5 gives a method for estimatingthe first and second derivatives
of mj. Let G() denote the eth derivativeof any e-times differentiablefunction G.

THEOREM 4. Let Assumptions A1-A6 hold. Then for a continuous weight


function w(.) and as n - oo, AIMSE = ISE1 +op(1).

THEOREM5. Let Assumptions A1-A6 hold. Let L be a twice differentiable


probability density function on [-1, 1], and let gn : n = 1, 2,.... be a sequence
of strictly positive real numbers satisfying gn - 0 and g2n4/5 (logn)- - oo as
n -- oo. For = 1,2 define

m^x1) = g- L()[(x1 -v)/gn]ml()dv.


1 -n

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM


All use subject to JSTOR Terms and Conditions
2424 J. L. HOROWITZAND E. MAMMEN

Then as n -- oo andfor t = 1, 2,

sup Imth)(xl)--m(e)(x') =Op(1)


Ixll<l
A plug-in estimator of Chl can now be obtained by replacing unknown
population quantities on the right-hand side of (4.1) with consistent estimators.
Theorem 5 provides consistent estimators of the required derivatives of mi.
Estimators of the conditional variance of U and of fx can be obtained by using
standard kernel methods.
We now describe the PLS method. This method simultaneously estimates the
bandwidths for second-stage estimation of all of the functions mi (j = 1, ..., d).
Let hj = Chjn-1/5 be the bandwidth for rhj. Then the PLS method selects
the Chj's that minimize an estimate of the average squared error (ASE),
n
ASE(h) = n-1 {F[l + mr(Xi)] - F[/z + m(Xi)]}2
i=l
where h = (Chln-1/5,... Chdn-l/5). Specifically, the PLS method selects
the Chj's to
n
minimize: PLS (h) = n-1 {Yi - F [ji + (Xi)]}2
Chl,...,Chd

n
(4.2) + 2K(O)n-1 I{F'[L + (Xi)2V(Xi)}
i=1
d
x E[n45Ch Dj(X)]-1,
j=1
where the Chj's are restricted to a compact, positive interval that excludes 0,

jx^ 1 rhKj(X/ -x)F'[i+-


Dj(xJ) = Khj (X
xi)F'[,1 + r(X/)]2
(Xi
J 1=1
and
--1
V(x)= LKhl(XI -x) ... Khd(XI-xd)
-i=l
n
XE Khl (X - xl).. Khd(Xi - xd){yi - F
F[ + m(Xi)]}2.
i=1
The bandwidths used for V may be different from those used for m because V is
a full-dimensional nonparametric estimator. We now argue that the difference
n
n-1 U2 + ASE(h) -PLS(h)
i=l

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM


All use subject to JSTOR Terms and Conditions
ADDITIVEREGRESSIONMODELWITHA LINK FUNCTION 2425

is asymptotically negligible and, therefore, that the solution to (4.2) estimates the
bandwidths that minimize ASE. A proof of this result only requires additional
smoothness conditions on F and more restrictive assumptions on K. The proof can
be carried out by making arguments similar to those used in the proof of Theorem 2
but with a higher-order stochastic expansion for h - m. Here, we provide only
a heuristic outline. For this purpose, note that
n
n- l Ui2 + ASE(h) - PLS(i)
i=1
n
= 2n-1 E{F[r/ + m (Xi)] - F[/z + m(Xi)]}Ui
i=l
n d
- 2K(O)n-1 F'[/u + m(Xi)]2V(Xi) [n4/ Chj D(Xi)]-
i=l j=1

We now approximate F[? + m (Xi)] - F[i/ + m(Xi)] by a linear expansion in


m - m and replace m - m with the stochastic approximation of Theorem 2(a).
(A rigorous argument would require a higher-order expansion of mh- m.) Thus,
F[ui + m (Xi)] - F[/-L + m(Xi)] is approximated by a linear form in Ui. Dropping
higher-order terms leads to an approximation of
n
22
{F[A + m (Xi)] - F[Iu + m(Xi)]}Ui
i=1
that is a U statistic in Ui. The off-diagonal terms of the U statistic can be shown
to be of higher order and, therefore, asymptotically negligible. Thus, we get

- J{F[r + m(Xi)] - F[Iz + m(Xi)]}Ui


i=l

n d
2
, -
F'[t m + (Xi )]2 Var(Ui Xi) n4/SC hDoj(Xi ),
i=l j=1
where

Doj(xj) = 2E{F'[Lu + m(Xi)]2 Xi = xJ}fxj(xi)

and fxj is the probability density function of XJ. Now by standard kernel smooth-
ing arguments, Doj (x) Dj(xj). In addition, it is clear that V(Xi) V(Ui IXi),
which establishes the desired result.

5. Monte Carlo experiments. This section presents the results of a small set
of Monte Carlo experiments that compare the finite-sample performances of the
two-stage estimator, the estimator of LH and the infeasible oracle estimator in

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM


All use subject to JSTOR Terms and Conditions
2426 J. L. HOROWITZAND E. MAMMEN

which all additivecomponentsbut one are known. The oracle estimatorcannotbe


used in applicationsbutprovidesa benchmarkagainstwhich ourfeasible estimator
can be compared.The infeasible oracle estimatorwas calculatedby solving (2.5).
Experimentswere carriedout with d = 2 andd = 5. The samplesize is n = 500.
The experimentswith d = 2 consist of estimating fl and f2 in the binary logit
model
P(Y = IX =x) = L[fi(x) + f2(x2)],
where L is the cumulativelogistic distributionfunction
L(v) = ev/[1 + ev], -oo < v < oo.
The experimentswith d = 5 consist of estimating fl and f2 in the binary logit
model
5

P(Y=1IX=x) =L fi(x')+f2(x2)+Exj
j=3
In all of the experiments,
fl(x) = sin(rx) and f2(x) = c(3x),
where c is the standardnormal distributionfunction. The components of X
are independentlydistributedas U[-1, 1]. Estimation is carried out under the
assumption that the additive components have two (but not necessarily more)
continuousderivatives.Underthis assumption,the two-stageestimatorhas the rate
of convergencen-2/5. The LH estimatorhas this rate of convergenceif d = 2 but
not if d = 5.
B-splines were used for the first stage of the two-stage estimator.The kernel
used for the second stage and for the LH estimatoris
K(v) = 15( - v2)2I(Ivl < 1).
Experimentswere carriedout using both local-constantand local-linearestimators
in the second stage of the two-stage method. There were 1000 Monte Carlo
replicationsper experimentwith the two-stage estimatorbut only 500 replications
with the LH estimatorbecause of the very long computing times it entails. The
experimentswere carriedout in GAUSS using GAUSS randomnumbergenerators.
The results of the experiments are summarized in Table 1, which shows
the empirical integratedmean-squareerrors (EIMSEs) of the estimators at the
values of the tuning parametersthat minimize the EIMSEs. Lengthy computing
times precluded using data-based methods for selecting tuning parametersin
the experiments. The EIMSEs of the local-constant and local-linear two-stage
estimates of fi are considerablysmaller than the EIMSEs of the LH estimator.
The EIMSEs of the local-constant and LH estimators of f2 are approximately
equal whereas the local-linearestimatorof f2 has a largerEIMSE. There is little
differencebetween the EIMSEs of the two-stage local-linearand infeasible oracle

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM


All use subject to JSTOR Terms and Conditions
ADDITIVEREGRESSIONMODELWITHA LINK FUNCTION 2427

TABLE 1
Results of Monte Carlo experiments*

Empirical IMSE
Estimator K1l K2 h h2 f f

d=2
LH 0.9 0.9 0.116 0.015
Two-stage with 2 2 0.4 0.9 0.052 0.015
local-constant
smoothing
Two-stagewith 4 2 0.5 1.4 0.052 0.023
local-linear
smoothing
Infeasibleoracle 0.6 1.7 0.056 0.021
estimator

d=5
LH 1.0 1.0 0.145 0.019
Two-stagewith 2 2 0.4 0.9 0.060 0.018
local-constant
smoothing
Two-stagewith 2 2 0.6 1.3 0.057 0.029
local-linear
smoothing
Infeasibleoracle 0.6 2.0 0.057 0.023
estimator

*In the two-stage estimator,Kj and hj (j = 1, 2) are the series length and bandwidth
used to estimate fj. In the LH estimator, hj (j = 1, 2) is the bandwidthused to
estimate fj. The values of K1, K2, h and h2 minimize the IMSEs of the estimates.

estimators. This result is consistent with the oracle property of the two-stage
estimator.

6. Conclusions. This paper has described an estimator of the additive


components of a nonparametric additive model with a known link function.
The approach is very general and may be applicable to a wide variety of other
models. The estimator is asymptotically normally distributed and has a pointwise
rate of convergence in probability of n-2/5 when the unknown functions are
twice continuously differentiable, regardless of the dimension of the explanatory
variable X. In contrast, achieving the rate of convergence n-2/5 with the only
other currently available estimator for this model requires the additive components
to have an increasing number of derivatives as the dimension of X increases.
In addition, the new estimator has an oracle property: the asymptotic distribution
of the estimator of each additive component is the same as it would be if the other
components were known.

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM


All use subject to JSTOR Terms and Conditions
2428 J. L. HOROWITZAND E. MAMMEN

7. Proofs of theorems. Assumptions A 1-A7 hold throughout this section.

7.1. Theorem 1. This section begins with lemmas that are used to prove
Theorem 1.

LEMMA 1. There are constants a > 0 and C < oo such that

P sup ISnK(0)-ESnk(0)I > < Cexp(-na82)


-_0EK

for any sufficiently small s > 0 and all sufficiently large n.

PROOF. Write
n
SnK(O) = n-1 Yi - 2SnKl(0) + SnK2(0),
i=1
where
n
SnKl (0) =n-1 Yi F[PK(Xi)'0]
i=1
and
n
SnK2(0) = n-1 F[PK(Xi)O]2.
i=1
It suffices to prove that

P sup ISnKj(0)- ESnkjW()I > < Cexp(-na2) (j = 1,2)


-OEOK

for any s > 0, some C < oo and all sufficiently large n. The proof is given only
for j = 1. Similar arguments apply when j = 2.
Define SnK (0) = SnlK(0) - ESnKl (0). Divide OK into hypercubes of edge-
length f. Let 1),..., ?(M) denote the M = (2Co/f)d(K) cubes thus created.
Let OKj be the point at the center of eK(). The maximum distance between OKj
and any other point in O) is r = d()1/2/2 , and M = exp{d(K)[log(Co/r) +
(1/2) log d(K)]}. Now

SUp ISnK1(0)I >


-OE--1 -
C U sup ISnKl(0)I
(j)
>?
-8~0G j=1 e_oEK

Therefore,
M -

Pn P sup ISn,K(0)l > E < P sup ISnKl(0)l >?


-OK - j= - O(j)
-

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM


All use subject to JSTOR Terms and Conditions
MODELWITHA LINKFUNCTION
ADDITIVEREGRESSION 2429

Now for 0 E (j)


),
ISnKl(0) < ISnKl(OKj)l + ISnKl(0) - SnKl (0j)

< \SnK (OKj)I +CF2rKr (n- E Yi +CF1)

for all sufficiently large K and, therefore, n. Therefore, for all sufficiently large n,

P sup ISnKl(0)l>8

C/2 >
> /2] + P CF2ZKr n- .
< P[ISnil(0Kj)l Yl +CF1
Choose r = -2. Then s/2 - CF2'Kr[CF1 + E(IYI)] > s/4 for all sufficiently
large K. Moreover,
n
P CF2iKr n-1E IYII+ CF1 > s/2

n
< P CF2'Krn-I
)-(IY/I- EIYI) > s/4

< 2exp(-al n2r2)


for some constant al > 0 and all sufficiently large K by Bernstein's inequality
[Bosq (1998), page 22]. Also by Berstein's inequality, there is a constant a2 > 0
such that

P[ISnKl(OKj)I > 8/2] < 2exp(-a2ns2)


for all n, K and j. Therefore,
Pn < 2[M exp(-a2ns2) + exp(-aln82)]
< 2exp{-a2nE2(2 + 2dCKnY[log(Co/r) + - log(2CKd) + y logn]}

+ 2exp(-ainE2),
where y = 4/15 + v. It follows that Pn < 4exp(-an?2) for a suitable a > 0 and
all sufficiently large n. D

Define
SK(O)=E[SnK ()]
and
OK= arg min SK().
OE?K

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM


All use subject to JSTOR Terms and Conditions
2430 J. L. HOROWITZAND E. MAMMEN

LEMMA 2. For any tr > 0, SK(OnK) - SK(OK) < i almost surely for all
sufficiently large n.

PROOF. For each K, let KAcC IRd(K)be an open set containing OK.Let eK
denote the complement of WK in OK. Define TK = NK,n OK. Then TK C Rd(K) is
compact.Define
= min SK(0) - SK(K).

Let An be the event ISnK(0) - SK(0)I < r/2 for all 0 E OK. Then

An = SK(OnK)< SnK(OK)
+ r/2
and

An = SnK,() < SK,(K) + /2.


But SnK(OnK)< Sn, (K) by definition, so

An =: S (On) < SnK(OK)+ r1/2.

Therefore,
An S( ((K) ) < ( SK
SK =+
+ K(OnK) - SK (OK) < .

So An => OnKE J K. Since VK is arbitrary, the result follows from Lemma 1 and
Theorem 1.3.4 of Serfling [(1980), page 10]. D

Define

bk(x) = f+ + m(x) - PK(x)'OK


and

SKo(0) = E{Y - F[PK(X)' + bK (X)]}2.

Then
0K = arg min SK0o(0).
0 CK

LEMMA 3. For any r>> 0, SKo(OK)- SKo(OKO)


< rfor all sufficiently large n.

PROOF. Observe that ISK(0) - SKo(0) - 0 as n -- oo uniformly over 0 E OK


because bK () --- 0 for almost every x E X. For each K, let CX C IRd(K) be an open
set containing OKO.Define TK= xK(nO,. Then TK C Id(K) is compact. Define

t7 min SKo(0) - SKo(Ko).


O6T7

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM


All use subject to JSTOR Terms and Conditions
ADDITIVEREGRESSIONMODELWITHA LINK FUNCTION 2431

By choosing a sufficiently small cAK, j can be made arbitrarily small. Choose n


and, therefore, K large enough that ISK(0) - SKo(O)l < r7/2 for all 0 E @. Now
proceed as in the proof of Lemma 2. D

Define ZKi= F'[-t + m(Xi)]PK(Xi) QKi=1 ZK


n 1E Zi . Then Q =
ZKi.ZKIT and QK =

EQK. Let Zki [k = 1,..., d(K)] denote the kth component of ZKi. Let ZK denote
the n x d(K) matrix whose (i, k) element is Zk.

LEMMA 4. QK - QK 12 = Op(K2/n).

PROOF. Let Qij denote the (i, j) element of QK. Then


d(K) d(K) n 2
- -Z
El0lK QK 12 = E E .KiZK Qkj
k=l j=l i=l
d(K) d(K) n n
- E En-2 -
E E En , , kiZiKi zkz
ZK KE K Q2j)
Qkj
k=l j=l i= e=1
d(K) d(K) n d(K) d(K)

-=E En E2L(Zki)2 (zJi)2 E E


k=l j=l i=1 k=l j=l
-d(K) d(K)
<n- E E(Z i)2 (Z/i)2 = O(K/n).
-k=l j=l _1
The lemma now follows from Markov's inequality. D

Define Yn = I(XK,min> c,/2), where I is the indicator function. Let U =


(Ui,..., Un)'.

LEMMA5. YnlQK 1ZU/nl = Op(K1/2/n1/2) as n -oo.

PROOF. For any x e X,

n-2E(yni Q / Z,U 1 21X =x) = n -2nE(U'ZK k ZKUIX =x)

= n-2E[Trace(ZK QK ZKUU)iX =x]

n-2 ynCv Trace(Q- ZKZK)

n-lCvynd(K) < CK/n

for some constant C < oo. Therefore, YnllQK/2Z'U/nll = Op(K/ /n1/2) by


Markov's inequality. Now

Ynl -= ZU/n 2(Z U/n)]12.


l 2 (U
Yn ZK /n) QK Q

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM


All use subject to JSTOR Terms and Conditions
2432 J. L. HOROWITZAND E. MAMMEN
~
Define -1/2
Define Q,
Z'UQK Z / n. Let Ij,1 ?7d) andq,1,
..., .q. 1, denote the eigenval-
ues and eigenvectorso f Ql . Let 1lmax= max(ij1,..., id(,)). The spectraldecom-
positiono f QK
YVOCIVI
Q- V
Q-'
gives QK = dI(K)
f=1 rce~
r/qtq', so

d(K)
YnII
Q_ Z/I Unl
Yn11 j/ 112= nY tq'
/IC~~~=
d(K)

( Yn 7max : L 'qe q' < Yn


7maxl = Op(K/n).
?= 1

Define

Bn= QKin- 1 F'[g + m(Xi)]ZKibKO(Xi).


i-i

LEMMA 6. IIBnII= O(K 2) with probability approaching 1 as n -- oX.

PROOF. Let ~ be the n x 1 vector whose ith component is F'[ft +


m(Xi)lbKo(Xi). Then B, = QK1Z 4/n, and ynJIB,112 , -2 -2Zt
Therefore, by the same arguments used to prove Lemma 5, y,nIB,112 <
Cn- y,n'' = ynO(K 4). The lemma follows from the fact that P(yn = 1) -- 1
as n - o. D

PROOFOF THEOREM1. To prove part (a), write

SKo(OnK)- SKo(OK) = LSKO(OnK)- SK(OnK)]? [SK(OnK)- SK (OK)]


(7.1)
+ [SK(OK)- SKo(K)] ? [SKo(K)- SKo(OK)].

Given any q > 0, it follows from Lemmas2 and 3 and uniformconvergenceof SK


to SKO that each term on the right-hand side of (7.1) is less than ij/4 almost surely
for all sufficiently large n. Therefore SKO(OnK) - SKO(OK) < i, almost surely for
all sufficiently large n. It follows that - OK I -> 0 almost surely as n -- o0
---K
because OK uniquely minimizes SK. Part (a) follows because uniqueness of the
series representation of each function mj implies that 11OK- O0K 0 as n -- cxD. -

To prove the remaining parts of the theorem, observe that 0nK satisfies the first-
order condition aSfK(OnK) /a0 = 0 almost surely for all sufficiently large n. Define
Mi = JL + m(Xi) and A Mi = PK(Xi)'OnK - Mi = PK(Xi)'(OnK - OKo) - bKo(Xi).
Then a Taylor series expansion yields
n n
n
1LZKI(Ui -(Q^K?R)(OnK-OKo)?+ n LF'(Mi)ZKibo(Xi)+Rn2 = 0,
i 1 i=i

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM


All use subject to JSTOR Terms and Conditions
ADDITIVEREGRESSIONMODELWITH A LINK FUNCTION 2433

almost surely for all sufficientlylarge n. Rn is definedby


n
Rnl = n-1 {-Ui F"(M i)- U[F ( i)- F"(Mi)]
i=1

- [3F"((Mi)F'(Mi) ?
+ F"(Mi)F"(Mli)AMi

+ ?
F"(1Mi)F'F(Mii ) (Mi) 2]AMi

- [F"(Mi)F'(Mi)- F"(AMi)F'(AMi)

+ F"(Mi)F"(Mi)bKo(Xi)]bKo(Xi)}
x PK(Xi)PK(Xi)',

where Mi and Mi are points between PK(Xi)'OnK and Mi. Rn2 is defined by
n

Rn2 = -n-1 {UiF"(Mi) + Ui[F"(Mi) - F"(Mi)]


i=l

+ [F" (Mi)F (Mi) - 1F"(Mi)F'(Mi)]bKo(Xi)

-F" (3Mi)F"(M,ri)bKo(X
P(Xi)2}P(X)bK(Xi)

Now let ~ denote either QK-Z'U/n or Q [n-1 En F'(Mi)2PK(Xi) x


bKo(Xi) + Rn2]. Note that

~~~~~~~~n 2
'
n-1 UiF"(Mi)PK(Xi)PK(Xi) = Op(K2/n).
i=l
Then

Ynl[(QK + Rnl)-1- QK-]QK112

= YnIl(QK + Rnl)-lRnlill2
= Trace{y,n['Rnl(QK + Rnl) 2Rnl4]}
= Op(ll''Rnl 112)

= Op + [P(x)'(nK - ,o0)]2 dx + sup lbo(x)|l2}


(')Op{K2/n xeX

=Op(~'l)Op(K2/n + KIlK -OKI + K-3).

Setting =- -1Z'U /n and applying Lemma 5 yields [(QK + Rn)-1 - QK ] x


Z'Ul/n 12 = Op[K3/n + (K2/n)llI0K - 0o112 + l/(nK2)]. If =- QK[n-1 x

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM


All use subject to JSTOR Terms and Conditions
2434 J. L. HOROWITZ AND E. MAMMEN

1=lF'(Mi)2 P(Xi)bKo(Xi) + Rn2], then applying Lemma 6 and using the result
IIQ,-Rn2 I =Op(K-2) yields
- -- 2
n-
[(QK + R1n)-1 - Q ] n-1 F'(Mi)ZKibKO(Xi) + Rn2
- i =1

= + 1/K5).
Op(lIOnK- OKO12/K
It follows from these resultsthat
n
OnK- OKO= n-l Q 1 F'[,U + m(Xi)]PK(Xi)Ui
i=1
n
+- n-QK E Fl[L+ + m (Xi)]P i)bK(Xi) + Rn,
i=l

where IIRnII= Op(K3/2/n + n-1/2). Part (d) of the theorem now follows from
Lemma4. Part(b) follows by applyingLemmas5 and6 to part(d). Part(c) follows
from part(b) and AssumptionA5(iii). D

7.2. Theorem 2. This section begins with lemmas that are used to prove
Theorem 2. For any = (x2,...,xd) E [-1, 1]d-1, set m_l(x) = m2(x2) +
? + md(xd), and bKo(x) = u + m-1() - PK(x)OKo,where

PK(x) = [1, 0,..., , p1(x2) ..., P (2)


.... , (Xd), ... PK(X)]
and
= (Aq, 21...
,KO 0 029 0,\.. 02 , O,
... ... OdK)
In other words, P and 0KOare obtained by replacing pl (x), ..., PK(xd) with zeros
in PK and 011, ...,1K with zeros in 0KO.Also define
n
ni() = n-I PK()Y Q , F'[/t + m(Xj)]PK(Xj)Uj
j=1
and
n
n2(X) = n-PK(x) QK F'[u + m(Xj)]2PK(Xj)b,o(Xj).
j=1

Forxl E [-1, 1] and for j =0, 1 define


n
= (nh)-1/2 F'[ + m (x ) + m_1 (.i)]2(X] - x l)j
Hnjl(xl) ~
i=l

x Kh(x --X])(nl(Xi),

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM


All use subject to JSTOR Terms and Conditions
ADDITIVEREGRESSIONMODELWITHA LINK FUNCTION 2435
n

Hnj2(xl) =(nh)-l/2 F'[yI + ml(xl)+m-l(xi)]2(Xi -x )j


i=l

x Kh(x - Xi)8n2(Xi)

and
n

Hnj3(xl) = -(nh)-1/2 yF'[iL + mml(xl) + m_(Xi)]2(Xl -x)j


i=l

x Kh(X -
Xi)bKO(Xi).

Let V(x) = Var(UIX = x).

LEMMA 7. For j = 0, 1 and k = 1,2,3, Hnjk(xl) = Op(1) as n - oo


uniformly over x1 [-1, 1].

PROOF. The proof is given only for j = 0. Similar arguments apply for j = 1.
First consider Hnoi (x1). We can write
n
Hnol(xl) =) aj (xl)Uj,
j=1

where
n
aj(xl) n-3/2h-1/2 F'[tL + ml(x1) + m_(i)]2
i=l

x Kh(x - X) PK(Xi) Q F[ + m(Xj)]P(Xj)


n
=n-3/2h- 1/2 Kh(x - X1)Ai(xl
i=l

Define

a(xl)= F[l + ml(xl) + m-(x)]2(x)fx( x)dx

Then arguments similar to those used to prove Lemma 1 show that aj(x1) =
(h/n)l/2[c (x ) + rnl] QK F'[tu + m(Xj)]PK(Xj), where rn is uncorrelated with
the Uj's and lrnll = O[(logn)/(nh)1/2] uniformly over x E [-1, 1] almost
surely. Moreover, for each x1 E [-1, 1], the components of a (x) are the Fourier
coefficients of a function that is bounded uniformly over x1. Therefore,

(7.2) sup a(x)'a(x ) < M


Ixll<1

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM


All use subject to JSTOR Terms and Conditions
2436 J. L. HOROWITZAND E. MAMMEN

for some finite constant M and all K = 1,..., oo. It follows from (7.2) and the
Cauchy-Schwarz inequality that
n 2

a(x1)'(h/n)1/2 X UjQ 1F'[t + m(Xj)]PK(Xj)


j=l
n 2
<M (h/n)1/2 UUj QKF'[F +m(Xj)]PK(Xj)
j=l
But
n 2
E (h/n)'/2 Uj Q-F'[L+ m(Xj)]PK(Xj) = (h),
j=l
so it follows from Markov's inequality that
n
a(x)'(h/n) 1/2 Uj Q- F'[t +m(Xj)]PK(Xj) = Op(h 12)
j=l

uniformly over xl E [-1, 1]. This and lrnll= O[(logn)/(nh)1/2] establish the
conclusion of the lemma for j = 0, k = 1.
We now prove the lemma for j = 0, k = 2. We can write
n
Hno2(x) = (nh)-1/2 F'[I + ml (x) + m_ (Xi)]2Kh(x -X)P(Xi)'Bn,
i=l
where
n
Bn = n-Q1 1 F'[/1 + m(Xj)]2PK(Xj)bKo(Xj).
j=1

Arguments like those used to prove Lemma 6 show that ElIBn112= O(K-4).
Therefore,
n
sup IHno2(xl)I = sup (nh)- 1/2L Kh(x1 - X)') . Op(K-2)
Ixll<1 i=1
Ixll<1
= Op(n1/21/2K -2)
=
Op(l).
For the proof with j = 0, k = 3, note that
n
sup |Hno3(x)l = sup (nh) 1/2 Kh(x' - X) Op(K 2)
lx'11<1 Ix <1 i-=
= 0p(l). V

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM


All use subject to JSTOR Terms and Conditions
ADDITIVEREGRESSIONMODELWITHA LINK FUNCTION 2437

Let PK(x ,X) = [1, p(x ), ., pK(X), p(X2),.., PK(X2),...,pl(Xd),


.. PK(X)] and bK(x', Xi) = zt + ml(x1) + m_l (Xi) - PK(x1, Xi)OKo. De-
fine
n

8n3(x1, Xi) = n- PK(x1, Xi)'QK E F'[iU + m(Xj)]PK(Xj)Uj


j=1

and
n
8n4(I1, Xi) = n- PK(x1, Xi)'QK- F'[cI + m(Xj)]2P (Xj)bo(Xj).
j=1

Also, for j = 0, 1 define


n

Lnjl (xl) = (nh)-1/2 Ui F"[u + ml (xl) + m-_ (Xi)](X - x)j


i=1

x Kh (1 - XJ)n3(x1, Xi),
n

Lnj2(xl) = (nh)-1/2 ~ Ui F"[tU + ml (xl) + m-l (Xi)](X -x xl)j


i=l

x Kh(x-1 XJ) n4(xl., i)


n

Lnj3(xl) = -(nh)-1/2 Ui F"[ + ml(xl) + m-_ (Xi)](X - xl)j


i=1

x Kh(X1 - XJ)bK(x1, Xi),


n

Lnj4(x1) = (nh)-1/2
" {F[t + ml (XJ) + m-_1(Xi)]
i=l
-F[,U +ml (xl) + m (X)]}
x F"[,a + ml (x) + m- (Xi)](X1 -xl)j
x Kh(xl - Xl)8n3(x 1, Xi),
n

Lnjs(xl) = (nh)-1/2 j{F[/z + ml (X1) + m-l (Xi)]


i=l

-F[,U + ml (xl) + m_l(Xi)]}


x F"[U +ml (xl) + m-l (Xi)](XJ -xl)j

x Kh (X1 - Xl1)n4(x1, Xi)

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM


All use subject to JSTOR Terms and Conditions
2438 J. L. HOROWITZAND E. MAMMEN

and
n

Lnj6(xl) =-(nh)-1/2 Z{F[ +mml (Xl) + m-l(Xi)]


i=l

-F[, +ml(xl) +m-l(Xi)]}


x F"[,U +ml (x) + (Xi)](X-
m-l -xl)
x Kh(xl- XJ)bKo(xl, Xi).

LEMMA 8. As n - oo, Lnjk(x1) = Op(l) uniformly over x1 E [-1, 1] for


each j =O, , k = 1,...,6.

PROOF. The proof is given only for j = 0. The arguments are similar for
j = 1. By Theorem 1, Sn4(x, Xi) is the asymptotic bias component of the
stochastic expansion of PK(xl, Xi)(OnK- OKO)and is Op(K-3/2) uniformly over
(x1, Xi) E [-1, l]d. This fact and standard bounds on
n
sup IUi IKh(xl - XI)
Ixll <i=l
and
n

sup LKh(xl-Xl)
Ixlll i=l
establish the conclusion of the lemma for j = 0, k = 2, 5. For j = 0, k = 3, 6,
proceed similarly using
sup Ib,K(x)l O(K-2)
Ixll<1

For j = 0, k = 4, one can use arguments similar to those made for Hnol (x1) in the
proof of Lemma 7. It remains to consider Lno (x 1) = Dn (x )Bn, where
n

Dn(xl) = (nh)- /2 Ui F [Rl + ml (xl) + m-l (Xi)]Kh (x l-x- ) PXK(x


1,i)
i=l

and
n

Bn = n-Q l F'[/L + m(Xj)]2PK (Xj)Uj


j=l

Now, El Bn 112= O(Kn-1), and Dn(x ) contains elements of the form


n

(nh)-l/2pr(xl) Ui F"[t + ml (x1) + m_l(Xi)]Kh(xl - X)


i=l

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM


All use subject to JSTOR Terms and Conditions
MODELWITHA LINKFUNCTION
ADDITIVEREGRESSION 2439

and
n

(nh)-l/2 E UiPr (X)F"[t, + ml(xl) + m-l(Xi)lKh(x - X)


i=l

for 0 < r < K, 2 < e < d. These expressions can be bounded uniformly over
1
Ix|l 1 by terms that are Op[(logn)l/2lpr(xl)l] and Op[(logn)l/2], respectively.
This gives

sup IIDn(xl)112= Op(Klogn).


Ix11<1
Therefore,

sup ILnoi(x) 12< sup IIDn(x)11211Bn112


=Op(1).
Ixlix<lXl<l C

LEMMA9. Thefollowing hold uniformly over Ixl < 1 -h:

(nh)-1Sno1(x1, mt) = Do(x1) + op(1),

(nh)-1Sn21(x1, t) = AKh2Do(xl)[ + Op(l)]

and

(nh)-'Sll (xl, m) = h2AKDl(xl)[ + op(1)].

PROOF. This follows from Theorem l(c) and standard bounds on


n

sup L U['(Xl - xl)sKh(x -XJ)


\x<1 i=I

forr=0, 1, s=0, 1,2. D

Define Aml (xl) = ml (xl) - ml (xl), Am_l(x) = i - /z + m_r(X) - m-1 (x)


and Am(x1, ) = Aml(xl) + Am_l(x).

LEMMA 10. Thefollowing hold uniformly over Ixll < 1 - h:


(a) (nh)-l/2S0ol(xl, -m)=(nh)-l/2Snol(xl,m) +(nh)l/2Do(x')Aml(xl)+
Op(1),
(b) (nh)-l/2Snll(xl, m) = (nh)-l/2Snl(xl, m) + Op(1).

PROOF. Only (a) is proved. The proof of (b) is similar. For each i=
1,..., n, let m*(x ,Xi) and m**(x1, Xi) denote quantities that are between
i+ iml(xl)+ m_l(Xi) and / + ml(xl) + m_l(Xi). The values of m*(xl, Xi)

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM


All use subject to JSTOR Terms and Conditions
2440 J. L. HOROWITZAND E. MAMMEN

and m**(x1, Xi) may be differentin differentuses. A Taylorseries expansionand


Theorem 1(c) give

4
-
(nh)-r/2Snol(x1, m) (nh)-r2S'o01 (x, m) + Jnj (x)
j=1

+n(nh) -/20 sup IIAm(x1' )ll3


-(X 1,)eX
4
= (nh)-/2S 1(x, m) + Jnjm(x) + Op(l)
j=1

uniformly over Ixll < 1 - h, where


n
Jnl(xl) = 2(nh)-1/2 Y FI'[L + ml(xl) + m_l(Xi)]2Kh(xl- Xl)Aml(xl),
i=l1
n
Jn2(l1) = 2(nh)- 1/2 F4'[ + ml(x) +m_l(Xi)]2
i=l

x Kh(x -XJl)Am_l (Xi),


n
Jn3(xl) = -2(nh)-1/2 {Yi - F[t + ml (xl) + m- (Xi)]}
i=1

x F"[m*(x, Xi)]Kh(x -XX)Am(xl,Xi),


n
Jn4(Xl) = 2(nh)-l/2 ) F'[f + ml(xl) + m-l(Xi)]
i=l

x {F"[m*(xl, Xi)] + 2F"'[m**(xl, Xi)]}


x Kh(x1 - [m(x1, i)]2.

It follows from Theorem 1(d) and Lemma 7 that Jn2(X1) = 2 3= HnOk(x1) +


Op(l) = Op(l) uniformly over Ixll < 1 - h. In addition, it follows from
Theorem 1(c) that for some constant C < oo,

n -

Jn4(xl) < C(nh)- /2 Kh(xl -Xl) sup l/Am(xl,x)112


i=1 -(x 1,)E X

=Op (nh)1/2 sup IIAm(xl,x)ll2 = op(l)


(x 1,x)EX

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM


All use subject to JSTOR Terms and Conditions
ADDITIVE REGRESSIONMODEL WITHA LINK FUNCTION 2441

uniformly over Ix1I < 1 - h. Now consider J43(X1).- It follows from Assump-
tion A3(v) that
n

i=1

x Kh(X' - XJ)Am(x1, Xj)


? Op (nh)1/2 SUPIAM(Xl,i)12

6-

L:Lnok(X I)+0Op (nh) I/2SUPIAM(XI'_i)j2 ?oP(l)


k=I xXE
uniformlyover I1x1I< 1 - h. Therefore,Jn3(X1) = op(l) uniformlyby Lemma 8
and Theorem 1(c), and

uniformlyover Ijx1j 1<I - h.


Now consider J, I(x 1). Set
n

1) 1= ( (nh
in =1/2 F[t+mI( )+MI(k)2K(I-X)

It follows from Theorem 2.37 of Pollard (1984) that Jnl(xl) - E[Jn1(xl1] =


o(logn) almost surely as n --* oo. In addition, E[(nh) 11 l~(Xl)] =
D (x 1) + 0 (h2). Therefore,
JnI(x1) = (nh)1/2D0(x1)Aml(xl) ? 0[lognAmi(xl)]
(nh)1/2D0(x1)Am1(xl) + op(l)
=

11< 1 - h. El
uniformlyover I1x

PROOFOF THEOREm2. By the definitionof 6'I (x 1),

(73)I(x1 Im(x1) mI )

Part (a) follows by applying Lemmas 9 and 10 to the right-handside of (7.3).


Define

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM


All use subject to JSTOR Terms and Conditions
2442 J. L. HOROWITZ AND E. MAMMEN

Methods identical to those used to establish asymptotic normality of local-linear


estimators show that E(n2/5w) = B1+ o(l), Var(n2/5w) = Vi(xl) + o(l) and
n2/5 [m (x1) - ml (xl)] is asymptotically normal, which proves part (b). D

PROOFOF THEOREM4. It follows from Theorem 2(a) that

n4/5 f w(Xl)[-l(xl) ml (xl)]2 dx Op(1).


1-h<lxll<1
Now consider

n4/5 [ w(x^)[tl(xl) -ml(x1)]2dx1.


\X11<1-h

By replacing the integrand with the expansion of Theorem 2(a), one obtains
a U-statistic in Ui conditional on X1, ...,Xn. This U-statistic has vanishing
conditional variance. 1]

PROOF OF THEOREM5. Use Theorem 2(a) to replace mhl with ml in the


expression for m ). The result now follows from standard methods for bounding
kernel estimators. D

REFERENCES
BOsQ, D. (1998). NonparametricStatisticsfor Stochastic Procesess. Estimation and Prediction,
2nd ed. LectureNotes in Statist. 110. Springer,Berlin.
BREIMAN, L. and FRIEDMAN, J. H. (1985). Estimating optimal transformationsfor multiple
regression and correlation (with discussion). J. Amer. Statist. Assoc. 80 580-619.
BUJA, A., HASTIE, T. and TIBSHIRANI, R. J. (1989). Linear smoothers and additive models (with
discussion).Ann. Statist. 17 453-555.
CHEN, R., HARDLE, W., LINTON, O. B. and SEVERANCE-LOSSIN,E. (1996). Nonparametric esti-
mation of additive separable regression models. In Statistical Theory and Computational
Aspects of Smoothing (W. Hardle and M. Schimek, eds.) 247-253. Physica, Heidelberg.
FAN, J. and CHEN, J. (1999). One-step local quasi-likelihood estimation. J. R. Stat. Soc. Ser. B Stat.
Methodol.61 927-943.
FAN, J. and GIJBELS,I. (1996). Local PolynomialModelling and Its Applications. Chapmanand
Hall, London.
FAN,J., HARDLE,W. and MAMMEN,E. (1998). Direct estimationof low-dimensionalcomponents
in additive models. Ann. Statist. 26 943-971.
HASTIE, T. J. and TIBSHIRANI, R. J. (1990). Generalized Additive Models. Chapman and Hall,
London.
HOROWITZ, J. L., KLEMELA, J. and MAMMEN, E. (2002). Optimal estimation in additive
regression models. Working paper, Institut fir Angewandte Mathematik, Ruprecht-Karls-
Univertsitit,Heidelberg,Germany.
LINTON, O. B. and HARDLE, W. (1996). Estimating additive regression models with known links.
Biometrika83 529-540.
LINTON,O. B. and NIELSEN,J. P. (1995). A kernelmethodof estimatingstructurednonparametric
regression based on marginal integration. Biometrika 82 93-100.

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM


All use subject to JSTOR Terms and Conditions
ADDITIVEREGRESSIONMODELWITHA LINK FUNCTION 2443

MAMMEN,E., LINTON,O. B. and NIELSEN,J. P. (1999). The existence and asymptoticproperties


of backfittingprojectionalgorithmunderweak conditions.Ann. Statist.27 1443-1490.
NEWEY, W. K. (1997). Convergence rates and asymptotic normality for series estimators.
J. Econometrics79 147-168.
OPSOMER, J. D. (2000). Asymptotic propertiesof backfittingestimators.J. MultivariateAnal. 73
166-179.
OPSOMER,J. D. and RUPPERT, D. (1997). Fitting a bivariateadditive model by local polynomial
regression.Ann. Statist.25 186-211.
POLLARD,D. (1984). Convergenceof StochasticProcesses. Springer,New York.
SERFLING, R. J. (1980). ApproximationTheoremsof MathematicalStatistics.Wiley, New York.
STONE,C. J. (1985). Additiveregressionand othernonparametricmodels. Ann. Statist. 13 689-705.
STONE,C. J. (1986). The dimensionality reduction principle for generalized additive models.
Ann. Statist. 14 590-606.
STONE,C. J. (1994). The use of polynomialsplines andtheirtensorproductsin multivariatefunction
estimation(with discussion).Ann. Statist.22 118-184.
TJ0STHEIM,D. and AUESTAD,B. H. (1994). Nonparametricidentificationof nonlineartime series:
Projections.J. Amer.Statist.Assoc. 89 1398-1409.
DEPARTMENTOF ECONOMICS FAKULTATFUR VOLKSWIRTSCHAFTSLEHRE
NORTHWESTERNUNIVERSITY UNIVERSITATMANNHEIM
EVANSTON, ILLINOIS 60208-2600 VERFUGUNGSGEBAUDEL7, 3-5
USA 68131 MANNHEIM
E-MAIL:joel-horowitz@northwestem.edu GERMANY
E-MAIL:emammen@rumms.uni-mannheim.de

This content downloaded from 128.164.240.105 on Tue, 7 Jan 2014 11:33:28 AM


All use subject to JSTOR Terms and Conditions

You might also like