Professional Documents
Culture Documents
0 for all z. Condition (2.81) states that A being positive definite is suf- ficient for 29,...,29 to be finite. We give an intuitive proof of (2.31) Note that (BS) Ka sient condition. A (wosk) necessary and suficent ‘condition for 2,...,28 t0 all be positive and fnite can be found in Li and Zhou (2005),2 2. REGRESSION here and relegate a more rigorous proof to Exerese 2.2. First, note that 1A being positive definite implies that 2? < oo forall s, for otherwise ‘we will have yz = 00. Next, we have 2] > 0 for all s, otherwise we have Co/(e}.., 28)! = oo. Thus, we must have 0 < 22 < oo for all sehen ‘Theorera 2.3 only covers the case whereby all components of are relevant, As we discussed in Section 22.1, when q= 2 and g(a) = (a) for all (x1,22) € R? (ez is an irrelevant regressor), then (2.28) does 104 hold, However, we can show that the cross-validation method Sail leads to optimal smoothing parameter selection, as the optimal Smoothing in this case should have the property that fy = Op (n~™/) ‘and hy + 00, as we demonstrate in Section 2.2.4. 2.2.3 AIC. ‘An alternative bandwidth selection method having impressive finite- sample properties was proposed by Hurvich, Simonoff and Tsai (1998). ‘Their approach is based on an improved Akaike information criterion (AIC; see Alike (1974)). Hurvich et al's exiterion provides an approx: imately unbiased estimate of expected Kullback-Leibler information for nonparametric models. Akaike’s original information criterion was de- signed for parametric models, while Hurvich et al.'s approach is valid for estimators that can be written as linear combinations of the out- ‘come, hence is directly applicable to a wide range of nonparametric estimators. Their criterion is given by 1+ tr) /n i") + eC) + AIC. = where 1 SS aixo}? = YC - HY ~ DY/n vith 9(%:) being « nonparametric estimator and H being an n x 7 Weigh function (he. the matrix of kernel weights) with ie (jj) tlement ven by Hy = Kru / Dh Kast Be gth((Kia— Xie) “Tinigh simulation experiments, Hurvich et a. (1998) show that the AIG. bandwidth sclector performs quite well ompased with the plugin method (hen iti evailable) and with a numberof generalized2.2, LOCAL CONSTANT BANDWIDTH JECTION 8 ‘cross-validation methods (Craven and Wahba (1979)). While there is ‘no rigorous theoretical result available for the optimality of the AIC. so- lector, Hurvich et al. conjecture that the AIC, selector shares the same asymptotic optimality properties as the least. squares cross-validation ‘method outlined in Section 2.2.2, Indeed, simulation results reported. in Li and Racine (2004a) show that, for small samples, AIC, tends to perform better than the least squares cross-validation method, whi for large samples there is no appreciable difference between the two approaches. 2.2.4 The Presence of Irrelevant Regressors ‘We now consider the case wherein (i.c., allow for the possibility that) some of the regressors are irrelevant. Without loss of generality, we assume that only the first ¢1 components of 2 are relevant in the sense defined below. For integers 100 for 8 = qi +1,---44- Thus, asymptotically, cross-validation is capable of automatically removing irrelovant variables. In other words, cross-validation in conjunction with the local constant kernel estimar tor is capable of automatic dimensionality reduction when some of the regressors are in fact irrelevant. "The cross-validation objective function is the same as that defined in Section 2.2.2 except that now Yj = gi) + u with E(uiXe) = 0, tat is the conditional mean function now depends only on the relevant, regressors X. "The analogue of the function x defined in Section 2.2.2 is now mod ified to (see Exercise 2.5) senesta) =f [= nerd] Heyman ae (aya yar (2.35) where 12) = [ Heqess2dM CB tqanee eager Be and the “bar” uotation refers to fimetions involving only the frst a, Components of #. 7 (J) is the marginal density of X (X) and B, is defined in the same way as B except that it is only a function of X. ‘Nove thatthe ietlevant components do not appear in the definition of ‘As before, let af,...,08, denote values that, minimize X subject to cach of them being nonnegative, We require Fors= esas cach of is uniquely defined and is finite, (2.36) Since we should allow the smoothing parameters for irrelevant vari ables to diverge from zero as n — 00, conditions used in Section 2.7 ‘cannot be imposed here. We impose the following conventional restric tions on the bandwidth and kernel function. Define 1 ho) FT 41 ii (he, 1)-22. LOCAL CONSTANT BANDWIDTH SELECTION % Let 0 <€<1/(g-+4). Assume that WES Hay Sn min(hayesshq) > max(hiy..5fy) < n© for some C > 0; the kernel k is a symmetric, compactly supported, Hélder-continuous PDF; ‘k(0) > K(6) for all 5 > 0. (231) If ¢ is arbitrarily small, the above conditions om his...shg are in essence that mhy....ig, > 00, ht---f + O and hy.-.hy + 0 a8 00, and that the fastest rate at which hy goes to zero cannot exceed n“* (for some a > 0) and that the fastest rate at which he (lor irrelevant variables) goes to oo is contained by n? (For some b > 0). With these conditions we obtain the following result ‘Dhoorem 2.4. Under conditions (2.92), (2.96), and (2.9), letting fasyes-1 hq denote the smoothing parameters that minimize CVjc, then lO, 08 in a [0+ of in prolly, for <7 < 5 39 Ply > ©) 1 for + 1S 8<4 and alt C >0. ‘Theorem 2.4 states that the smoothing parameters for the irelevant components diverge to infinity in probability, therefore, all of the irrel- evant variables ean be (asymptotically) automatically smoothed out ‘while the smoothing parameters for the relovant variables continue to inherit the same optimality properties that would prevail ifthe irele- vant variables did not exis. Hore we mention that (2.92) is a sufficient, not a necessary, condi- tion for Theorem 2.4 ta hold. Simlation results reported in Hall et al (2006) suzgest that the conclusion of Theorem 2.4 still holds true if (2.82) is replaced by the condition “conditional on X, Y' is indepen- dent of X,” although a rigorous proof for this seems quite challenging and remains a topic for future research, ‘The complicated proof of Theorem 2.4 is given in Hall et al. (2006) ere we present some intuitive reasons to expect the results of Theorem 2.4 to hold. We call the terms of order mm = SO h? the leading bias term, and the terms of order 1 = (ny...) the leading variance term. Hall et al. (2006) show that the presence of izelevant vaviables does not contribute to the leading bias term in CVic because, by (2.32), the irrelevant components cancel in the ratio fig(#) = Bjrn(z)]/E\ (2) ‘Therefore, for the leading bias squared terms of order O(rg), the hasw 2. REGRESSION (a +1 <81 for all choices of Ag-a2s-+» sf. Obvi- ously, R—+ 1 a5 hg 00 (91 +1 < # <4). Infact we have the following result: Result 2.1. The only values for which R(3, hy 41y-+-yhq) = 1 for some Be suppw are hy = 00 for +1580 so that R = E[Z2]/(E(Zn|? > 1, unless in the definition of Zn, all hh, = 00, in which case Zy = K(O)!-® and var(Z,) = 00 that R= 1 in this and only this ense. “Therefore, to minimize (240) (the leading term of CVi(hay +++ ».))+ noting that both the squared bias and the variance terms are positive, ‘we mst have R — 1 as m+ 6, which implies that we must have ‘the smoothing parameters for irelevant components diverging to infin- ity. Thus, the irrelevant components are asymptotically smoothed out. Using = 1 in (2.40) leads to (2.35), which in turn implies that the smoothing parauueters for the relevant variablos retain their optimality property as given in Theorem 24 Tf § is computed using the asymptotically optimal deterministic22, LOCAL CONSTANT BANDWIDTH SELECT smoothing parameters, then itis easy to show that (ohh ng (i (2) -S mers") > N0,95(2)) (241) in distribution, where Ze2)at2) 7} dual) +2) (2.42) 23 rr Fe) with 03(2) = B ‘Tho next theorem shows that when the cross-validated smoothing parameters are used rather than the asymptotically optimal determin- istic ones, the asymptotic normality result given in (2.41) still holds. ‘Theorem 2.5. Under the same conditions found in Theorem 2.4, (241) remains true if ae) is computed using the smoothing pararn- eters chosen by cross-validation, i.., nti fg, (10 =H) - a) SN One /F(3)) ‘Tho proof of Theorem 2.5 is given in Hall etal. (2006). It can be understood intuitively as follows: First, note that as hy — 00 for s = aa 1.19, we have 9(e; fn, ---s fg) = GCF hiy---sfgs)+ (3.0,), where G(2,hi,...,hq,) is @ kernel estimator of 9(%) using only the relevant regressors, Next, given the fact that hy ~h = o(h2) (8 = 1,-...41), ‘we expect that 9(2, hay... hg) = G(2,Rt,... AB.) + op (SH, (49)"). ‘Theorem 2.5 follows from this result and (2.41). ‘Note that the condition (2.82) is quite strong, It not only assumes ‘that, X is independent of ¥, but also requires that X is indepen- dent of X. Ideally, one would like to relax this condition to either ( EIVLX, X) = BLY] almost surely, or to (i) conditional on X, Y is independent of . Hall et al. (2006) conjecture that ‘Theorem 2.4 remains true when (2.82) is relaxed and replaced by condition (3) or (ji) above. The simulation results reported in Hall et al. provide some evidence in support ofthis conjecture.8 2. REGRESSION 2.2.5 Some Further Results on Cross-Validation Racine and Li (2004) also established the rate of convergence of hy/h2— 1, which is given by Bea abe 10, (woe) (243) ‘where a = min{o/2,2}- When q = 1, we get (fg A9)/H9 = Op (n#), ‘which coincides with the result obtained hy Handle ct al. (1988), among thers. Equation (2.43) implies that h converges to the nonstochastic optimal sraoothing parameter h? in probability Equation (2.48) shows that (R,~AY)/A9 = Op (n-#/ 0) for q < 4, and (ha — A9)/12 = Op (n- 2/490) for q> 4. ‘The reason for having, two different expressions for (fi,—A9)/h? (depending on whether or not > 4) is that, in the higher order expansions of the cross-validation function CVie(hi,-.. Rg), we have terms of order O, (Sf, hf) and a second order degenerate U-statistic of the form (nin = 1) SOY wes Knl Xe X5) Tia where B(1|X;) = 0, which has an order Op (n(lh1-..fg)!/2)~"); see [Appendix A for a general treatment of degenerate U-statistis. When @ £ 4, the Oy (n(hi-.-hg)!/?)-!) term dominates the Op (Rf) term since hy ~ Op (49) = Op (n-¥/4"9)), while when q 2 5, the Op (Hi term becomes the dominating term. Therefore, the rate of convergence differs depending on whether q <4 or q > 5 (see Racine and Li (2004) for a detailed proof) 2.3 Uniform Rates of Convergence Proceeding along lines similar to those used when deriving the uniform almost sure rate for the density estimator discussed in Section 1.12 of Chapter 1, we can establish the uniform almost sure convergence rate ‘of g(2) to glz) for 2 € S, where S is a compact subset of RY that excludes the boundary range of the support of X. Condition 2.1. Assume that24, LOCAL LINEAR KERNEL ESTIMATION n () flz) is differentiable and g(x) is twice differentiable, and that the “erivative functions all satisfy the Lipschitz condition |m(z) ~ m(v)| S Cle —v] for some C >0 (m(-) = gu) or fa). GH) 0°(2) = E(ufl2) és a continuous function, and infzes (2) 2 5 > 0. (Gii) The kernel k(.) is symmetric, bounded and has compact support (say k(w) = 0 for jo) > 1). Defining Hi(w) = |v|!K(v), we assume that [Zh(v) — Hi(u)| < Calu =| for all 0 <1< 3. ‘Dheorem 2.6. Under Condition 2.1, we have sup |g(x) ~ g(x)| =O ( Zs =x) almost surely. a (2.44) The proof of (2.44) is quite similar to the proof of Theorem 1 of Chapter 1. By writing (2) ~ g(e) = ri(z)/f(2), where ra(z) = (G(x) — g(x) f(x), one can show that mtn = (S68 Coon) ‘Theorem 1.4 imply that infzeg f(x) > 6” > 0 as., where é is a positive aoe supla(e) ~ 92] = sup a(=)/e)| < sup (2)/ inf A) ob By soph eh =0 (conven notte) as. Having examined the local constant kernel estimator in detail, we now turn to another popular method, the local polynomial approach. 2.4 Local Linear Kernel Estimation ‘Though the local constant estimator is the workhorse of kernel regres- sion, it is not without its idiosyncrasies. In particular, it has potentially large bias when estimating a regression fnction near the boundary80 2. REGRESSION of support. The local linear estimator proposed by Stone (1977) and Cleveland (1979), on the other hand, shares many of the properties of, the local constant approach (e.g, their variances are identical), while it is one of the best known approaches for boundary correction as its bias is not a function of the design density f(r); see Fan (1992), Fan (1993), and Fan and Gijbels (1992). For an extensive treatment of the local linear estimator see the excellent monograph by Fan and Gijbels (1996) We again consider a regression model of the form Y=aX)+y, 5 (2.45) retaliate aero chee “sme (2) «() ‘which ean also be obtained as the solution of « in the following mini- ization problem: G(z) + (2.46) (247) Letting @ = (2) be the solution that minimizes (2.47), one can easily see that @ = 9(2) as dofined in (2.46). Note, however, that (2.47) uses a constant a to approximate g(x) (or Y) in the neighborhood of z, since (2.47) only uses a local (X's close to 2) average of ¥j’s to estimate g(). For this reason, g(r) de- fined in (2.46) is called a “local constant” kernel estimator. However, ‘one could instead use @ “local linear” (or higher order polynomial) es- ‘imator to ostimate g(). One feature of a local linear method is that it automatically provides an estimator for g(?)(e) “ ag(x)/Az, the first ‘order derivative of g(z), though the local constant partial derivative ‘estimator is well-studied; see, for example, Vinod and Ullah (1988), Rilstone and Ullah (1989), Hardle and Stoker (1989) and Pagan and Ullah (1999, Section 4.2) for further details ‘The local linear method is based on the following minimization problem: rap 05 teat 2 (4-2)? K & =) (24s)24. LOCAL EAR KEI a Let 4 = a(x) and b = b(z) be the solutions to (2.48). We will show in this section that a(2) is a consistent estimator of g(2) and b(e) is a consistent estimator of g(!)(zx) = d9(:r)/Ax (both b and g\)(2) are qx1 ‘vectors). The local linear approach given in (2.48) is easily understood because itis analogous to a “local least squares” estimator. Hence the slope estimator b estimates the local slope g')(z). Let 5 = (2) = (a(z), (6(x))'), let Y be the n x 1 vector having, ith component Yj, lot X be an m X (Iq) matrix with the ith row being (1,(X;~2)), and let K(x) be an n x m diagonal matrix having ith diagonal element K (4§=*), Then in vector-natrix notation, (2.48) can be written as BBO ~ XK ~ 28), (249) Equation (2.49) isa standard generalized least squares (GLS) prob- (6,8 be the solution of (2.49). Then using the standard (AKA AK )Y 22) (x1. Oo0- | 2) (gb a) (250) ‘The following condition is needed to establish the consistency and asymptotie normality of 6(z) given in (2.60). Condition 2.2. ( (Xjp¥Yor ave kid, g(a) and f(e) and o%2) = Elufla) are tuice differentiable, (ti) K is a bounded second order kernel (ill) Asm 00, nhy hy DB, HE 00 and nhs. Ihy 2g AE 0. ‘The next theorem establishes the asymptotic normality of &().2 __2 REGRESSION ‘Theorem 2.7. Recall that «= [kv)®dv, wo = [A(v}e dv, define kag = f vP(0)Pdo, and let pe= (ME yam mp,)* = (SPOT), ° ( 0, e-nnosyyldsta)) where Dy is aX q diagonal matriz with the sth diagonal element given by hg, and where I, is an identity matrix of dimension g. Then under Condition 2.2 we have (0) (ie) ate) - (F# EYL) 400.5). sn ‘he proof of Theorem 2. given in Setin 27.2 Note that Theor 27 pies the following dierent rates of com vergence (aes Ihg)¥? jae 92) 2 4 wot(z) de.) (0.52), (252) nd (ohh sedate] (0, eet) for 8 =3y-..5% (258) where gs(2) = Ag(2)/0z, is the s** component of g(c) Let (2) = A(z) and 9,(2) — b,(2) denote the loesl-tinar ker nel estimates of g(x) and ge(x), respectively. Then 9(z) — g(z) = 0p (a+!) ane Gl) — 94(@) = Op (m+ mi!"hs") (ma = Daa Ms mm = (hy....hg)"2). This is a standard result, i., that the conver- dence Fate for derivative estimation is slower than that for regression function estimation, For estimating the lth order derivative of g(x), the rate of convergence will be Op (m+ nt!? (Cte n3t)-¥), We discuss higher order derivative estimators in Section 25. Consistent estimation of g(x) requires that hy +0 (9 = 1,---49) ‘and nhy-shy —» 20 in order for the bias and the variance to both converge to zero, while consistent estimation of g(x) requires an even24, LOCAL LINBAR KERNEL ESTIMATION 88 stronger condition, namely, ny... fly 321 A3 + 00, in order for the variance terinof the derivative estimator to converge to zero Note that when the underlying regression function isin fact linene ina (Le, g(2) = a9-+2'a3, where the scalar ag and q x 1 vector ay are constants), then the local linear estimator exhibits the property that the bias tern is zero for any value of (h1,.--yfg) (ee Exercise 2.10) ‘Thus, the local linear estimator becomes unbiased when g(2) is linear in.z, When the bias i zero we could allow hg = co (8 = 1,..-y4), and it is easy to show in this instance that the local linear estimator collapses to a(x) = +2/a1, where dg and 6 are the ordinary Teast squares ‘estimators of ay and an, respectively (see Exercise 2.6). Thus, the local linear estimator nests the least squares estimator as a special case if ‘one allows for hy to assume sufficiently large values for all s = 1)... 9 We come back to this poiut when we discuss cross-validated smoothing parameter selection for the local linear estimator in the next section, since this has important implications for both bandwidth selection and rates of convergence. 2.4.1 Local Linear Bandwidth Selection: Least Squares Cross-Validation Let 9-4(Xi) denote the leave-one-out local linear estimator. That is, let (€4,5,) be the solution of (a,8) in the following minimization prob- lem: min 2 [%- wh ae were (5:82) = Ty (AM). the ay Teeveone-ut fol lnnt kernal timate of ¢(X)- ‘The local linear cross-validation approach to bandwidth selection chooses those h,'s whieh mininize Vis ---sha) = min * YT DRE MOG), 58) where M(.) is a weight function. Note that the objective function (2.54) only involves estimation of g(:) and not its derivatives, therefore we only need the conditions on hy,...shg stated in (2.74), that is, we do not ‘need the stronger conditions on the smoothing parameters that ensure ‘consistent estimation of the derivatives of g().st 2, REGRESSION For fixed 2 € RY, we have derived the asymptotic bias and variance terms of (x) = 4(2) of the local linear estimator, while (2.52) implies that. Bla(2) - a(2)h Souter taney M2) ong +m) By Fa) (255) [As we discussed in Section 2.2.2, the cross-validation objective fune- tion is asymptotically equivalent to f Elg(z) —a(a)?/(2)M(a) dz. Us ing (2.55) one might conjecture that the leading term of CV would be Oviaa~ J Bige)— aCe @MC de s fF xt fo%(a)M (2) dz = [3 Senco] NeyM yas + SLE ce + olrf-+ 228) and this coneture tus out o be corsct asthe nding ter of CV is indeed given by (2.58) ( Lt and Racine (2004 for a dtaled poo!) ‘oy following the aporouch presented Ia Sertion 22, we express the leading tern of (2.56) a3 n-4(0"4)yy(a1,... aq), here the a's are defined by hy = an“ VO ( jy-++4g)) and rulersovay)= f [sXe] Fla)M(2) de (sn) +H, [tom@e Let a8)... 09, denote those values of a1... that minimize xu, and assume that Each a; is uniquely defined and is positive and finite, (2.88) ‘Assumption (2.58) rules out the case for which a2, = oo, which implies that guo(z) cannot be a zero function. That is, we explicitly rule out cases for which g(2) is linear in any of its components zy Letting the h,'s denote those values of the h,’s that minimize (2.54), Li and Racine (2004a) showed that nll@tDh,, + a), in probability for s “4 (2.59)25. LOCAL POLYNOMIAL REGRESSION (GENERAL PTH ORDER) Equation (2.50) states that the local linear cross-validation smooth ing parameters converge to the optimal smoothing parasneters, and that the rate of convergence of the resulting local linear estimator is, ‘the same as the local constant eross-validation case when the underlying regression function is not linear in any of its components ‘When 9(2) is linear in some component of, says, then (2.58) docs not hold. Ia this ease we also need to modify the conditions imposed on the smoothing parameters as was done in Section 2.24 so that we allow those smoothing parameters corresponding to those regressors entering (2) linearly to diverge to infinity. Then we would expect that local linear least squares cross-validation should have the ability to select a large value of hy when g(z) is linear in 2, while selecting relatively small values of he fr regressors that enter nonlinearly, The simulation results reported in Li and Racine (2004a) show that this is indeed the case, though we are not aware any work that provides a rigorous theoretical proof of this result. Specifically, et us consider the case for which q = 2.and where the underlying regression function g(#1,22) = @(21) + 20 is a par tially linear fanction (e.g. linear in 22). In this case, local linear cross validation will tend to select a very large value of fig (liz -» 00), while hy = O,(n-¥®) (rather than Op (n-¥)). Thus, local linear cross validation would suggest that g() is Haeae in 2 2.5 Local Polynomial Regression (General pth Order) 2.5.1 The Univariate Case "As was the case when we derived the local Linear kernel estimator, we ‘can use a higher order local polynomial estimator to estimate 9(2) ‘When z is univariate, a pth order local polynomial kernel estimator is based on minimizing the following objective function: des Bo = By( Ry — 8) 20 = OpXy = FY? xx (% *) (2.60) Let by denote the values of by (I tad) 2 1,...p) that minimize (2.60)._2._hcression ‘Then fy estimates g(z), and ty estimates g(a), where of(z) = d@'g(z)/de! is the Ith order derivative of g(x) (1 =1,..-,P). Once again, (2.60) is a standard weighted least squares regression problom. Letting & = (1,(Xj —)-+-y (Xj — 2) then be is "] i [pe ce)| Eame a) (2.61) Ruppert and Wand (1994) study the leading conditional bias and conditional variance of ty, which we summarize in the following theo- ‘Theorem 2.8. Let g(r) = bo(x), and let Zn = (Xi}Par Ip is odd, B9(2)|2n) — fe) = 1 eay [a | o(ntt), (2.62) and if p is even, Bg@)/2a] —9(2) = wong mee] Fee +! egg [22] Lora, @ ae [= 2) | eo (2.63) puree — feswo*@)] (2 vweraeo2n)= [SFB] +0(F), eon tuhere Gig (7 = 1:2,8,4) are some constants defined in Ruppert and Wond (1994) ‘Theorem 2.8 shows that the leading conditional bias term depends ‘upon whether pis odd or even, By a Taylor series expansion argumnent, ‘we know that when considering |X; ~2| < h, the remainder term from a pth order polynomial expansion should be of order O(4?"), so the result for odd p is quite easy to understand. When p is even, (p +1) is odd; hence the term h?*! is associated with [ K(v}u' de for ! odd, and23.0 \L. POLYNOMIAL REGRESSION (GENERAL PTHLORDER) 87 this term is zeto becanse K(p) is an even function, Therefore, the h?*? term disappears, while the remainder term becomes O(?*?). Since p Js either odd or even, wo see that the bias term is an even power of h. ‘This is similar to the ease where one uses higher order kernel functions ‘based upon a symmetric kernel function (an even function) for the local constant estimator, where the bias is always an even power of h. Summarizing, conditional on 2%, we have When p is odd 9(c) ~ o(z) = 0, (H?** + (nh)*) 0, (n#" + (nay?) (265) When pis even (2) ~ (2) Let g((e) = tie) denote the estimator af g((z) based on a rth order local polynomial fit (! < p). The following theorem states the leading bias and variance of 3 (2). ‘Theorem 2.9. When pI is odd, eat B [91%] - (2) = erty [ When p—1 is even, E [91%] 9) = wil a | BI x J Ky (v)dv I} In either case var (4(2)|4) = (aa) +o((nnet)-), where cj1y (5 =1,2,3,4) are constants defined in Ruppert and Wand (1994). Note that f(")(2)/ f(z) does not appear in the conditional bias term if p—1 = is odd. Ruppert and Wand (1994) also show that when p—is, ‘odd, the bias at the boundary is of the same order as that for points on88 2. REGRESSION the interior. Thus, the appealing boundary behavior of local polynomial ‘mean estimation extends to derivative estimation when p —1 is odd However, when p— 1 is oven, the bias at the boundary is larger than in the interior, and the bias can also be large at points where f(r) is discontinuous. For these reasons, we recommend strietly setting p—U to be odd when estimating 9((z) ‘Theorem 2.9 implies that, conditional on %p, When p—1is odd 9 (2) — (2) = Op (1M-+ (nney-¥?) When pis even (a) — (a) =O, (10-42 + (2) 47) (3.56) Using Liapunov’'s CLT, we can also establish the asymptotic nor- ality of g(z) and g(” (2). We postpone our discussion of asymptotic normality results until the next section, in which we discuss general ‘multivariate local polynomial regression. 2.5.2. The Multivariate Case ‘A general pth order local polynomial estimator for the multivariate regressor case is more cumbersome notationally speaking. The general multivariate ease is considered by Masry (1996a, 19962), who develops some carefully considered notation and establishes the uniform almost sure convergence rate and the pointwise asymptotic normality result for the local polynomial estimator of g(z) and its derivatives up to order p. Borrowing from Masry (19968), we introduce the following notation: (vom) enlace #= Sry 267) (2.63) 00) ~ 5 es (260) Using this notation, and assuming that g(x) has derivatives of total order p+1 at a point 2, we can approximate g(=) locally using aLOCAL POLYNOMIAL REGRESSION (GENERAL PTH ORDER) _ 89 ‘multivariate polynomial of total order p given by 2)= DAW lvlivaale— 2) (2.70) oSep Define a multivariate weighted least squares function by {- 3 seas afr ‘Minimizing (2.71) with respect to each by gives an estimate by(z), and by (2.70), we know that rlb(x) estimates (D"g)(r) so that (D°G)(x) = riby(z), For a gencral treatment of multivariate bandwidth sclection for the local linear kernel estimator, see Yang and Tschernig (1999). 2.5.8 Asymptotic Normality of Local Polytiomial Estimators Local pth order polynomial regression can be used to estimate the etivatives of g(z) up to order p. We use .N; to denote the mumber of distinct [th order derivatives of g(x). For example, No = 1, Ni =a (q citferent frst order derivatives), and Na = qlq+1)/2 (q second own derivatives and q(g~ 1)/2 eross second order derivatives). The general formula is 7 ‘There are Mj distinct Ith order derivatives of g(x), and we use ‘Vig(2) to denote the Mj x 1 vector of these Ith order derivatives, using @ lexicographic order. For example, when 1 = 2, we have the No = (q+ 1)/2 distinct second order derivatives, and V@ g(r) is de- fined by vy! Ta eae ‘Masry (19960) uses the following condition to establish the asymp- totic normality result for the local polynomial estimator. (9%) =% Condition 2.3. (@ 9(2) has continuous derivatives of total onder (p+ 1)- (ii) k(v) i @ bounded second order kernel function having compact ‘support. (iis) For expositional simplicity, letting hy = +++ = hg = h, assume that h = O (nr-Mie#2), ‘Theorem 2.10. Under Condition 2.5, we have for 0 <1 p (nar) [9 — We) —Aegntore“™] 4, ma), (°55") ‘at continuous points of 0°(2) and f(x) with f(z) > 0, where the defi- nitions of mprite), Ar and V; are given in Section 2.7 Proof. See Theorem 5 of Masry (1996a). o IE we compare Theorem 2.10 with Theorems 2.8 and 2.9, we observe that Theorem 2.10 does not yield differing bias expressions depending ‘on whether p~1 is odd or even. It can be shown that when q= 1 (the single regressor case), the bias expression of Theorem 2.10 matches that cof Theorems 2.8 and 2.9 for the ease when p — 1 is odd. When p—1 is, even, the leading bias term given in Theorem 2.10 isin fact zero. Thus, the leading nonzero bias term will be of the next order with O(hP-!#?) ‘when p~ Lis even. We verify this for the case of p= 1 (local Tinear) in Section 2.7, where we show that Aipya(z) = Aia(z) = 0, and so the nonvanishing leading bins term will be of order O(h?*2-!) = 0 (h?) rather than O(h?+"~) = O(h) (p= I= 1). This matches the result of ‘Theorem 2.7, hence the results given in Theorem 2.10 are consistent with those given in Theorems 2.8 and 2.9. “Assuming that p~ [is odd, we have se (2) — 0 (120-4 + oase2h), x (2) = 0p (H+ (se %)-12) ‘Masry (19968) established the uniform almost sure convergence rate of the local polynomial estimator, and we present his results in the next theorem,25, LOCAL POLYNOMIAL REGRESSION (GENERAL PTH ORDER) 91 Theorem 2.11. For each SU 2, the bias term is similar to that arising from the use of a higher order kernel function for the local constant estimator. For [= 1, Theorem 2.11 provides the uniform almost sure rate for first order derivative estimators, v2) Va] = fo) — a4] = O( (tain)? (nne*2))"”? +10) Similarly, one can obtain the uniform almost sure rate for higher order derivative estimators from ‘Theorem 2.11 Throughout this chapter we assume that the data {X,Y} is ob- served without error. In practice, however, data may be contaminated® 2. REGRESSION ‘or measured with error, There is a rich literature on how to handle ‘measurement error in nonparametric settings. Though it is beyond the soope of this book to cover this vast literature, we direct the interested render to Fan and Truong (1993), Carroll and Hall (2004), and Carroll, ‘Maca and Ruppert (1999) and the references therein. 2.6 Applications 2.6.1 Prestige Data We consider the following example using data from Fox's (2002) car library in R. (R Development Core ‘Team (2006). ‘The dataset consists ‘of 102 observations, ench corresponding to a particular occupation. The dependent variable is the prestige of Canadian occupations, mensured by the Pineo-Porter prestige score for occupation taken fro a social survey conducted in the mid-1960s. The explanatory variable is aver- tage income for each occupation measured in 1971 dollars. Figure 2.1 plots the data and five local linear regression estimates each differing in their window widths, the window widths being undersmoothed, over- smoothed, Ruppert, Sheather and Wand’s (1995) direct plug-in, Hur- ‘ich et al's (1998) corrected AIC (AIC,), and cross-validation (CV) (Li and Racine (2004a)). A second order Gaussian kernel was used throughout, "The oversmoothed local linear estimate in Figure 2.1 is globally linear, and in fact is exactly simple linear least squares regression of Y on X as expected, while the AIC, and CY criterion appears to provide the most reasonable fit to this data, 2.6.2 Adolescent Growth In Section 1.13.3 we noted that abnormal adolescent growth cam pro vide an early warning that a child has a medical problem, We used ata from the population of healthy U.S. children obtained from the Centor for Disease Control and Prevention’s (CDC) National Health ‘and Nutrition Examination Survey to model the joint distribution of height and weight by sex. ‘We now consider the regression function underlying the relation- ship between height and age for males and females using the mixed data local linear estimator to be introduced shortly in Chapter 4 to model “stature for age means.” The local linear estimator was used26. APPLICATIONS a 100 & 8 2 E40 = 0 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Income (K) Income (K) AIC, & CV (3.540207, 3.450,n-"5) Oversmoothed (10%s2n-1) 100 FF =| 2 9 +4 2 60 | E40 4 4 = 2 oe 0 05 10 15 20.25 30 0 5 10 15 20.25 30 Income (K) Income (K) Figure 2.1: Local linear kernel estimates with varying window widths (note that the AIG, and CV bandwidths are almost identical, hence the two curves appear as one in the lowerleft figure) with bandwidths selected by Hurvich et al.'s (1998) AIC, criterion us- ing a second order Gaussian kernel. The bandwidth for age is 7.63, while that for sex is 0. Figure 2.2 presents mean height by age and sex, It is interesting to note that mean height is virtually indistinguishable by sex until zoughly. 10-12 years of age, and then diverges significantly. This behavior would be particularly difficult to model in a parametzie setting without re- sorting to sample splitting, while the nonparametric approach readily reveals this structure, ¢ 2.6.3 Inflation Forecasting and Money Growth ‘The conventional wisdom prevalent in monetary circles is that money growth should have predictive power for forecasting inflation. How-o 2. REGRESSION Height (em) Age (years) Figure 2.2: Stature-for-age means. ver, empirical results overwhelmingly suggest just the opposite, ie. that money growth has no predictive power for forecasting inflation ‘whatsoever. Moreover, this finding has been robust to changes in the sample period and econometric methodology (see Leeper and Roush (2008) and Stock and Watson (1999)). This inconsistency between the- ‘ory and empires isa wellknown puzzle in macroeconomics. Much of the empirical work in this area has focused on forecasts obtained from linear vector autoregressive (VAR) models ofthe form Xe at BLK tery where X; = (11, %)/, me is the inflation rate at time f, and Z; is the growth rate of a monetary aggregate. Bachmeier, Leolahanon and Li (forthcoming) revisit this issue from ‘a nonparametric perspective using monthly data for the period January 1959 through May 2002. The monetary aggregates used are M1, M2, land M3, and the corresponding M1, M2, and M3 Divisia monetary services index, while inflation is measured with the consumer price index. Using data from Bachmeier et al. (forthcoming), we first consider parametric forecasts of inflation obtained from a bivariate VAR model of inflation and money growth, which are made using a recursive es- ‘timation procedure with an increasing window of data; each forecast2.6. APPLICATIONS 5 ‘Table 2:1: Relative parametric MSPE (including/excluding each mon- tary aggregate with 1 lag) Torin MTN WMD ND WD 1 Moni 1.05102 1.05 1.00 —1.00 ~ 1.00 6 Months 0.90 100 111 100 104 1.06 Months 0.98 109 1.00 1.02 104 Table 2.2: Relative parametric MSPE (Including/excluding each mon- ‘etary aggregate with 2 lags) “Horizon MMM MD M2D_MaD_ “TMonth—LOf 093 Lot 1001.00 1.00 6 Months 0.91 1.00 110 0.99 1.04 107 12 Months 0.92 0.94 1.06 1.00 1.01 1.05 was made based on a model estimated using all data available through ‘the date that forecast would have been made. Forecasts were made for each model and forecast horizon (January 1994 through April 2002) ‘The mean squared prediction error (MSPE) was calculated for each ‘model for forecast horizons of ¢ = 1,6, and 12 months. SIC-optimal lg length selection was employed resulting in two lags of inflation and ‘money growth, so that the forecasting equation of interest is MES AQ OTe HORT g-1 bagAmeg + OAM 1 HEE, where m, can be any one of the six monetary aggregnte measures (Arm = me—me-1). For the sake of robustness, we report results for 1 and 2 lags ofthe monetary aggregste, and Tables 2.1 and 2.2 report the ratio of MSPE of the VAR model with each monetary aggregate {nchuded to that with each monetary aggregate excluded forthe given horizon for 1 and 2 lags of the monetary aggregate respectively; values Jess than 1.00 imply that including the monetary aggrogate improves the forecast for n given forecast horizon ‘Tables 2.1 and 2.2 reveal no apparent systematic gain from inelud- Ing money as a predictor. With 1 lag of the monetary aggregate in- cluded, 9/18 forecasts worsen, 3/18 improve, and 6/18 are unchanged; with 2 lags, 8/18 worsen, 5/18 improve, and 5/18 are unchanged. Over- all, the case for using money growth as an inflation predictor can be96 2. REGRESSION ‘Table 2.3: Relative MSPE of the nonparametric AR(2)/parametric ARQ) Horton 7 Month 098 6 Months 1.01 12 Months 0.88 seen to be rather weak in @ parametric setting as inclusion of the mon etary aggregate leads to less eflicent forecasts on balance. ‘Next we compare a parametric AR(2) mode's forecasts with that for a local constant nonparametric AR(2) model of the form me = lt Ret) +E “where g(-) is unknown, We conduct leave-one-out cross-validation on the estimation sample, and then use cross-validated bandwidths to generate our out-of-sample forecasts. Table 2.3 presents the relative MSPE of the nonparametric AR(2) model with that for the parametric AR(2) model ‘Table 2.3 reveals that the cross-validated local constant AR(2) ‘model is generating predictions that are comparable to that from the parametsic model for the 1 and 6 month horizon and are substantially improved for the 12 month horizon. This suggests misspecification of ‘the parametric model since the nonparametric model is less efficient than a correctly specified parametric model. One then wonders whether the conclusion about the forecast inelficacy of monetary aggregates ‘may in fact be an artifact of parametric misspecification. We therefore consider a local constant nonparametric model of the form, m= alten es py Armee, Amina) +e (273) where 9(-) is again unknown and we again use leave-one-out cross- ‘validation for our bandwidth celector. "Tables 2.4 and 2.5 present the relative MSPE of the nonparamet- ric model which utilizes the monetary aggregate to the nonparametric Note that Bachmoser et al orthcomng) use a more complicated two-step erss- slidation procedure. Therefore, the results reported here differ slighty from those ‘ported in Bachmcier etal. (forthcoming) due tothe diffrent bandwidth selection procedures being used,27. PROOFS so ‘Table 2.4: Relative nonparametric MSPE (ineluding/excluding mone- tary aggregate with 1 lag) “Horizon MiNi MID_W2D— NBD Month 107 1.00085 0.92 0.80 0.89 Months 0.96 0.95 1.15 1.00 1.00 1.00, 12 Months 0.96 0.84 0.98 1.00 1.00 1.00 ‘Table 2.5; Relative nonparametric MSPE (including/excluding mone- tary aggregate with 2 lags) “Horzon_MT_M2_N3_ MiDW2D_WsD_ TMosth 107 0.98 1.06 1.00 1.00 0.88 6 Months 1.00 0.96 1.07 1.00 1.00 _12 Months 0.99 0.83 0.98 1.00 _0.97 model which excludes the monetary aggregate, These tables reveal that with 1 lag of the monetary aggrogste included, 2/18 foreeasts worsen (in the parametric setting we had 9/18), 9/18 improve (in the paramet- ric setting we had 3/18), and 7/18 are unchanged (in the parametzic setting we had 6/18); with 2 lags, 3/18 worsen, 8/18 izaprove, and 7/18 are unchanged, In a nonparametric setting, the case for using money srowth as an indicator variable for inflaton isin fact quite favorable as inclusion of the monetary aggregate leads to more accurate forecasts on balance ‘Thus, the robust nonparametric method suggest that money growth affects inflation nonlinearly. The early puzzle appears therefore to be an artifact arising from the use of « misspecified linear relationship. 2.7 Proofs In this section we provide some omitted proofs for easlier material in this chapter.eee 2.7.1 Derivation of (2.24) Since we only consider the caso in which all regressors are relevant, We assume therefore that (Fay tg) © Hen = {(ety--- fg) neooo sta) € [rh ta} where my is @ positive sequence that converges to zero at a rate slower than the inverse of any polynomial in n, and ty is a sequence of con- stants that diverges to infinity. Equation (2.74) basically requires that rahy hg —» 90 and that Ay > 0 ($= 1)-..)9). Tet $ denote the support of w. We also assume that, (274) and nha ..-hg 9(3), £0) and o? have two continuous derivatives; w is con- tinaons, nonnegative and has compact support; f(-) is bounded aviay from zero for x € S. (2.75) Using (2.28) with 9 = 9(%2), g = 9-«(Xi), and Mi = M(%), we have ESD i a Mi +20 uae = Ms +o dM om CVillin- +a) (2.76) ‘The third term on the right-hand side of (2.76) does not depend on (fis.--yfg) Tecan be shown that the second term has an order suualler than the first term (see Exercise 2.3). So, asymptotically, minimizing CVie(h) is equivalent to minimizing 7 [ai — di?Mi, the first term on the right-hand side of (2.76). ‘We next analyze the leading term of Vie. Note that (i — ofl fe Haile iy fit (60), where thy = (Gi — 99) i- Define fgg = (nL Ka — 9h ia2. PROOFS 2 and =I Suki, tA where Kina = []far hy H((2i 2) fhe). Then =r +g Ps, bins — fig ic = (miss + a) fe" is the leading component of d — . Therefore, the leading term of CVie(A) is CVigo(h) = 2 [ras + ‘af; 2M. Te can be farthor shown that CVizo(h) = E{CVico(H] + (6.0.). Hence, B[CVico(h)] is the leading term of CVi.(h). Nov, E[CVicoas-+- he] = Bf [lens + tea)? $4] Ma (a gPM) +B RPM], (2.77) where we have used Elinyotaif; Mi] = 0 (since E(usl{X3}7-1) = 0). Then E fair 71%] = E [GG — 9) Knsslor — 9) Kea fo *1Xi) FE (G5 — 9) KR fe) = {Bll ~ Ka gled SY? +0((oh. Aor) 1 NK) wav) = (pag [loose te) —aeonaeys(xe+ hn) eo} +O(mm) fz {Sacre} +0(98 + mm), 78) where By(2) the sane as that defined in (28), m = (ng, )“% and = Using (2.78) we have B [mise 7M] = J {S5ecove} rene de +0 (03 +m) (2.79)100 2. REGRESSI Next, E fp fy 7M] = B fp? MaE [|X] } =n TE (2M [KR IX) = (nhy hg) {seta [ 106 + hoot + HHO) as} = (hy. fhg)* x BA PM [s4/(Xi)o"(Xd) + OCm)]} = tg, [Pemmete+ (mm (280) Substituting (2.79) and (2.80) into (2.77) we obtain ElCVio(his shall =f {= aucnsh SoM (a) ax + hohe [remeyar so(i+m) = nl) (ay, 5) + 0 (ne) (2.81) where the a,s are defined via hy = agn—¥/(@+4) (3 = 1,...,q) and xem) =f {Erin} S@)M(@) de wf arg + [eeme)ae (282) 2.7.2. Proof of Theorem 2.7 ‘We sketch an outline of the proof of Theorem 2.7, while leaving certain details as exercises. ‘One problem atsing from working with (2.50) is that ED mice) (y,4 ,) (K-21) is an (asymptotically) singular matrix and thus is not invertible. We ‘could rewrite (2.50) in an equivalent form by inserting an identity27, PROOFS 101 matrix Jger = G5'Gq in the middle of (2.50), and where Gy = a, i? is nal matrix wit sth di (1 82). 0! vee dg asi wth he th gs nal element given by h;?, giving us (applying B-'C = BGG, = (GnB)"'G,0) average (yt 2) ai »| x DKiX2)Gn Ge 2 we - [= #002) (po 2(— «)) = »)| TRH (peach, ay) (2.88) ‘The advantage of using (2.88) in the proof is that FXi2) (pact, 2) (Xi -2Y) converges in probability to ® nonsingular matrix. ‘Therefore, we can analyze the denominator and numerator of (2.88) separately and thus greatly simplify the analysis. Using @ Taylor expansion given by (Xe) = g(a) + (Xi ~ 2)'g (7) + (Xe — YG (a)(Xs — 2/2 + Rel, Xi) = (1s (Ki = 2)/)8lz) + (Ki — 2) (Xs ~ 2)/2+ Real, Xe)ooz 2, REGRESSION whore Ry (2, Xi the remainder term, we have a L 1 : [: Dive (op - 2) Oa(Xi= 2) | x {25K (0,22) 0+ «if 1 1, (%-2) eee [: kv (opin, nyo ae - | 1 1 ft {E35 (2,2) x [oxi —2)0e)Xs 9/2404 Roles X0]} = h(a) + [AM] [AP + AB} + (5.0), (284) 4 where Ae ET Kis (Deady #35 Di (54-0) 2D Kae (os a)® oe ‘and the smaller order (s.o.) term comes from (Xj 2)! ( ppt ay)* 289) 1 )g(a)(Xi—a), (2.86) (AMPLY. KiniaRen(t, 8) which has an order smaller than [4!]-1A, Exercise 2.7 shows that A™ = Q-+ op(1), where =( 4) o 2 (765, eater) Using the partitioned inverse, we got at A/ fle), 0 = (hS Fie, titer)”27. PROOFS 105 Next, rewrite (2.84) as D(n) (6) -8(@)) = D(n) [Al] [4* + AB] + (8.0), (2.88) ‘and define diagonal matrices adi = UW F(2), 0 reaenta)= (Es yoaren) va (“eee and er nnottesters) ‘The proof of Thoorem 2.1 follows if we can show the following: ()_D(n\[Aba}-'[42* + A¥4) = D(n\Q- A + AP] + op), (i) D(njQ>*|A** + A™] = RD(n)[A?* + A] + o9(2), Gai) Dinars = (co a) Ma/2)402) Sa soley) + op(1), (iv) D{n)A%*N(0, V) in distribution, ‘The proofs of (i)-(iv) are given in four lemmas below. Statements (2)-(iv) imply that D(a) [ie ~a{z) (ome Fert aa) = RD(n)[A%* 4 424] — (om ug)4??(a/2) Dhan a) 0 + o,(1) = RD(n) A + og(1)+RN(0,V) + op(1) —+ (0,2) in distribution, where © = RV Ris the same ¥ as that given in Theorem 27. Lemma 2.1, D(n)[A'|-"[4%* +A%4] = D(n)Q-'[A**+ 484] +09(1). Proof. Wiiting 2 Dea ay WaP= + AB] = DnyQryars + AB] + D(n)|(A» "(A 4+ 42], it is sufficient to show that Din) io =@] [2% +457] = 00). 2.89)108 2_ REGRESSION From Al = Q + op(1) (see Exercise 2.7) we know that (aly = Q7 + offi), (2.90) Let # (als) @7 ) Equation (2.90) implies that Jjy = op(t) for &,j = 1,2. To prove (2.89), we need a sharp (i.e., fast) rate for Jig. Below we show that. Jy2 = Op(m). Define G = (ALF ~ Agf(At?)-'A}S) —. Using the partitioned in- verse and the results of Bxercise 2.7, we obtain eye (es c a (e) '), a aa Et ( (U/ f@)g + Op(or), Opt) ) S N= FO) Fo? + (0s Tal afl] + op(0))" (291) because Alf’ = Op(na) by Exercise 2.7 (i). Bquation (2.91) leads to stir -a*=(39h 94). am which proves that Ji2 = Op()- Using (2.92) we immediately have yoy (Ca) 99] [#4 A = bd” ( py) O(n), Optra) 149°) , (435) «(Ce 2) CE) + )] = o(t) because (hs ---fg)"?Op(m) AB = Op (ta g)*n8) = 096) and (aba. 2Op tm) AY* = (a-ha)! 0p (ao ih)™™2) = 0, (nf) by Exercises 2.8 and 2.9. "This proves (2.88). a27. PROOFS 105 Lemma 2.2. Din)Q~ [4% + 4°] = RD(n) [42 + 44] + o4 (1). Proof. Noting that R i have diagonal matrix with R = ding(Q>), we D(njQu [A + A] = RD(n) [A™ + A™] + op(1) because the terms associated with the off-diagonal clement of Q~* are all op(1) (ly. )!/*g AP = afr TgO pa) = 09(1) and (nuhy ..hg)"/*heAt* = Op( hs) = op(1). a Lemma oe Din)Ae= = (er Ione oe eacuintasi)) + o9(1). Proof. By Exercise 2.9, we have (00h eg) AY* = a Tae C0) Soule) ott, and (7 hg)" Dy ABE = af FsOp (18!) = op) Hence nae (_ (th. hy MAAR Donate= (cir ghana) = (om he) Plra/2) ste) Thasetei) + opt) a Lemma 2.4, D(n)A**+N(0,V) in distribution, where ve (ae Pa snafteoa))106 2 Prof, By Exercise 29, ar (nhs. A34) = wtf (ao%a) +00), var (nh. chy!®D,A}*) = wesley + of) and cov ( (rt hg) 2AP*, (rb tg)" DyAB*) = 0(0). Hence, var (D(n)A%) = V +o(1). Also note that A has mean zero. ‘Thus D(n)A%* 4, N(0,V) by the Liapunoy CLT. a 2.7.8 Definitions of Aipj1 and Vj Used in Theorem 2.10 ‘We use the notation introduced in (2.67) to (2.69) to define w= feKede, y= fem wa. (2.93) ‘Moo Mos --- May Teo Toa s+ Tag Mio Mia ip Tie Tao Tay M : Myo Mpa os Mya Foo Ton os To “2.4 where Mys and Pig are N, x Ny dimensional matrices. For example, Mia = say X Heys Where ja) And ja) axe of dimension My x 1 and {Nie 1 and are defined from yi withthe lexicographic order ce dv’ fi ee wak(o)dv ntak(e) de Hay = + Hay = ; (2.95) pK (0) dy LK (0) de Similarly, Tj can be obtained from M,; with K(-) replaced by For two m x 1 vectors Ci and Cz we use C; @ C2 to denote an sm x1 vector with its jth element being Cy C25 (= 1,---ym), he ® dlnotes clement by element product. ‘The Vj matrix introduced in Theorem 2.10 is defined via the follow- ing NN matrix V (N = 5? Nj): v=MorM27, PROO! sot ‘Then Vj is an N; x N; dimensional matrix, consisting of rows and columns from S!} Nj +1 to Tho Nj. For example, Vp is the first diagonal element of V. V; consists of rows and columns from 2 to Ni + 1 =q41of V. Ve consists of rows and columns from No+Ni+1 = 942 to No +N +No=1+9+G(q+1)/2 of V, ete ‘We can easily check that when p = 1 (the local linear case), we get u=(g on) P= (F than) 0m) Vem"rM 7 @ onal) : where 2 = [vk(v) dv, «= [2(v) do, and ze = fu%#2(v) dv. Thus, Vo = wt and Vi = (n2gg/n3)fy. The variance [o°(2)/f(x)]Vi exactly ‘matches the result of Theorem 27 for 1=0 and I = 1 Define an Nj x1 veetor fi.) whose elements vary with different forms cof Afri) = 1/(ry1% ral x =" X re), F = ly im the lexicographic order. For example 2; is an Np x 1 veetor given by wy tect ta aoe deal eon 1 where tm 8 vetor of ones of dimension m ‘Then define Moos a Mages AY a | 2p) 2 Bion (298) An (7) @ Myre, "There are Npy1 distinct derivatives like (1/;!)(D¥g)(2) of total order +1: we put them in an Npi1 x1 vector using the lexicographicos 2. REGRESSION ‘order, and we denote this vector by mjy,)(2). For example, when p = 1 (the local linear case), we have mea (2) = (2.99) ‘Then Aap introduced in Theorem 2.10 is defined via (€=0,....p) the following equation: Aopen Aig BY ML Ampan(2 (2.100) Ap where M and A are defined in (2.94) and (2.98), respectively, and ‘mp (2) is defined immediately following (2.98) ‘When p= 1 (the local linear ease), using (2.96), (2.99), and (2.98) wwe have B= MAme(x) = Cassie) i (2). (2.101) Using the above Ap and Ap in Theorem 2.10 leads to results identical to those in Theorem 2.7, as of course they should. 2.8 Exercises Exercise 2.1, Prove (2.10) Hint: Note that Eljiu(z) — fle) Df, Bola)h2]? = (Bhina(z) — S(2) Dy Bala) KET}? + var(vin (2), then apply (2.8) and (2.9) Exercise 2.2. (8) Consider (2.82) fur y = 2: x(anan) = If eBi(2) + oBBa(a)]? + 2 axon f(z) } Sa) M(a) de Derive the explicit expressions of of and of that minimize x(a1sa2)28, EXERCISES 109 (Gi) For the general multivariate case, yw can be written as (sce (2.80) (21) 20)-005%)) = Ae + alan 205000549) me where A is a 4 xq postive semidefinite matrix and Cp > 0 is a constant, Show that when A is positive definite, then inf yp > 0. This will Imply that of,...,o9 ar all postive and finive Hint for (i: This follows the same analysis as those used in Ex cercise 1.10, Exercise 2.3. Show that 11 Slo) 2K) =o, (1m) Hint: Write m2 wig —4. 1 wi(96— 9-3) fail fee (s.0.). Then show that the second moment of this leading term is O(n 08 +m): Exercise 2.4. Suppose one does not use the leave-one-estimator in the cross-validation procedure. (i) Show that, inthis ease, CV(h) approaches its lower bound (zer0) ash 0, Le, limos CV(R) = 0. That is, A> 0 minimizes vin). (ii) What is the resulting estimator if his arbitrarily close to zero, roughly say, o(Xih = 0)? We know that A+ 0 too fast violates the condition my... fg» o9 as n— 00, ‘The variance of such an estimator yoes to infinity Exercise 2.5. This problem explains R(@,h+1y.--yfg) defined in (a(x) ~ AIF eV Fla) = rr(x)/F(2), (2.89). Write 9(a) - g(a) where sia) = a(x) + thal), v(2) = n*S8 (g, ~ 8)) Knot and ta(e) = m7 DE wKnas. Write Kyat = Racine Raat (Mig X)/hs)s and Kise = [Tg a B((Xis — Xa)/he), where the bar and tide refer to relevant and ielevant variables, 0. spectively Define ma = EE AG and may = (Ma. igs)?mo, REGRESSION (@) Show that Elf(o)] = FEVEKp ao] + Olram)s and that vax(fle)) = Om) = o(0)- Therefore, Fle) = FayBLR ral + of) 2.102) Note that (2.102) izapies the following: ce) — (2) = ME) — AO) __ 5 (6.0,) = An(a) + (80. He) — 900) = FE = Fapptiega] 1 0 Ante) + 2) where Aq(#) = ri(2}/f(@)E(Kn2)] isthe lending term of (2) ~ a2) (8) Show that Effin (2)] = DO AIB. Ce) FER ns) + (mq E(Ks,ei)), and that var(tin(2)) = Om mn [BRrai)!?) = oma (Era) ‘Therefore, Elite] {Sorensoyronet.th +( (ba +m) [e.n)]"). e108) (i) Show that Eba(e)*| = mn 2*(2) 2B [KF] + oC) 2108) (iv) Show that (2.102) to (2.108) imply at BlAnC@)?] = [wm Secon aE sss) + (00 (2.108) where R() is defined by (289).28, EXERCISES m (@) Using (2.105) and ove / Elp(2) — 9(@))*f(@)M(a) dr ~ E[AR(2)), we have ov~f {[E=«n] : [an Ia] Re haere ovhd)} Heed (2.106) Show that hy ~+ 90 (8 = 91+ 15g) minimizes (2.106). ‘Tis will result in R() = 1. Using R'= 1 show that (2.106) leads to (2.88). Note that f(a) ~ F(2)F(@) by (2.32). ° ~ Give an intuitive proof of Theorem 24 based on the above rests Exercise 2.6, (i) Show that if one uses ite = oo for all » = 1,...,9, then the local linear estimator (2) is identical to the least squares estimator 0 +2G1, where dg and dy are the least squares estimators of ao and a based on ¥j = ap + Xai + error. (Note that we do not necessarily assume that the true regression function g() i linear in for this problem.) (Gi) Show when g(2) = a9 +.2/ax is linear in z, then the local linear estimator is unbiased (note that you should not assume hh, = 00 in this problem). Exercise 2.7, Let AL, (= 1,2) be defined as in Section 2.7. Prove the following results () AY = £(2) + op(1). (i) AUP = Opin) = of). (Gli) ADF = 62 (2) + o9(2). iv) ARF = raf (a)Tg + op(1).2 2. REGRESSION Note that (i)-(iv) imply that = 2), 0 2
0, and among the n independent random draws X{...., Xi, there are on average np(2*) = O(n) observations that assume the value +. Therefore, our estimator f(x) is expected to converge to p(2) at the parametsic rate of Op ((np(e!))-¥/2) = O, (n-¥2). ‘The nite support assumption implies that we have only finitely many parameters (0(24), 24 © Sto be estimated. This is essentially a parametric model (since the number of parameters does not increase as sample size in- creases); therefore, we naturally obtain the J-ate, ie. the standard parametric rate of convergence. There is also a large literature in statis tics that deals with the case in which the number of discrete cells grows 1 the sample size increases, with the number of observation in each cel ‘being finite even as n —» oo. ‘This is the so-called sparse asymptotics framework, and we refer the interested reader to Simonoff (1996). We 4o not pursue this case further in this book.118 3._ FREQUENCY ESTIMATION WITH MIXED DATA 3.2 Regression with Discrete Regressors Consider a nonparametric regression model having only disereteregres- sors given by Yea aX!) tu (35) where Xf € Su isi.id, with zero mean and E(uZ|Xf = 2) = 0%(e4) We allow the error process to be conditionally heteroskedastic and of unknown form. For any 24 € S4, we estimate g(e4) by Yaunt = 24)/ple"), G6) where (2) is defined in (3.2), ‘One can easily show that ile!) - oe") rn), en and also that vii (gla) ~ al2*)) 4 3 (0,0%2*/0@) 8) by observing that 3-7, [9(X#) - 9(z*)] 1(X# = 24) = 0. The proofs of (3.7) and (3.8) are fairly straightforward and are left: to the reader as an exercise (see Exercise 3.1) 3.3 Estimation with Mixed Data: The Frequency Approach We now turn our attention to the mixed discrete and continuons vari ables case. We use XF to denote a continuous variable, and therefore awrite X; = (Xf, X#) € Rx S#, Wo define f(x) = f(2*,24) to be the joint PDF of (X,Xi). 3.8.1 Density Estimation with Mixed Data Suppose that we have independent observations Xi, Xa... Xn drawn from f()). A nonparametric kernel estimator of f(z) is given by f= Lat 2008 = (3.9)33._PREQUENCY ESTIMATION WITH MIXED DATA, 119 where Wi(X@,2°) = [fy hy2w((XS,~29)/by), with w(:) being a stan- dard univariate second order kernel function satisfying (1.10), We state the rate of convergence and the asymptotic distribution of f(x) in the next theorem ‘Theorem 3.1. For all 2! € S#, assume that (2°24) satisfies the same conditions as f(2*) given in Theorem 1.3. Also assume that, as + 00, hy + 0 (8 = Lynsey), (Bt «+ Ihg)(Shaa he) + 0, and fh shy» 00. Then (@) Fle) ~ $a) = Op (Thy B+ (hy. Ag)-?). (i) (mh. (F(@) — Fl) ~ (2/2) Chg Fala) h2) converges in distribution to N(0,x%f(x)), where fya(x) = 8f(x*, 24) /(a6)* 48 the second order derivative of f+) with respect to, and where = [u{v) dv and «= f w2(o) dv ‘The proof of Theorem 3.1 is given in Section 8.5. From Theorem 3:1 we se that the rate of convergence of f(2) for the mised variable case isthe same as the case involving only the subset of q purely con timuous variables. This follows simply because estimation with purely discrete variables has an Op (n/2) rate of convergence, which is faster than the rate of convergence for the q purely continnons variables cas. ence, the rate of convergence of f(x) for the mixed variables case is determined by the rate of convergence arising from the preserce of the subset of ¢ continuous variables, the slower ofthe two rates 3.3.2. Regression with Mixed Data ‘The above frequency method can also be used to estimate regres- sion functions with mixed discrete and continuous rogressors. For = (25,24) RY x S4, we estimate g(z) = E(Y|X = x) by Dumesenead=x/7@), 10) ‘where j(z) is defined in (3.9). The following theorem gives the asymp- totic distribution of (2). ‘Theorem 8.2. For any fized x" ¢ 4, assuming that g(a", 2) satisfies the same conditions as g(2®) given in Theorem 2.2, and also assum- ing that, asm —+ 00, hy + 0 (8 = Tyssesg), Mhi...Ry —+ 00 and120 3,_PREQUENCY ESTIMATION WITH MIXED DATA anf os hg)(Sa Hf) =O, then we have Vik Tg (ae ~ af) — Eacont) 4.N (0,n107(2)/F(@)) » where By(2) = (12/2){2fo(a)ga() + (2)9es(s)}/F(2) isthe same as that defined in Theorem 2.2 except that now z = (2,24), and ga(2) (ula) and gualz) ov the frst and second order derivatives of 9 (5) {hith respect to-26, where again k2 = [ v*o(v) du and x = J w%(v) do. ‘The proof is similar to the proof of (2.14) (the case with purely continuous variables only), and is let as an exercise (see Exercise 3.2) ‘From ‘Theorem 3.2. we once again see that the rate of convergence is solely determined by the rate associated with the continuous variables, Following the analysis in Chapter 2, one ean agnin choose the smoothing parameters (I's) by the cross-validation (CV) method, gen fralized CV method, or plug-in method (when a closed form expres- sion is available). In order for the nonparametric frequency estimation nothod to yield reliable results in the mixed variables case, one needs ‘2 reasonable amount of data in each discrete cell. Should the sum- ‘ber of discrete celis be large relative to the sample size, nonparametric frequency methods may, as expected, be quite unreliable. 3.4 Some Cautionary Remarks on Frequency Methods In this chapter we have shown that, theortioaly, there is no problem Mhatabever with adding discrete variables (with fnite support) to a Touparanetric estimation framework because the nonparametric fr- ony method (with only discrete variables) has an Oy(/2) rate arcpryergence, whic is faster than the rate of convergence associated Sich the subset of q purely coutiniows variables, However, nonparamel- Te fieguency mts wi clealy be weal only when ne hns 2 Ite inpl sae and when te discrete variables aswume a limited momber Sf values, Le in skations where the number of discrete ols is much shoul than dhe sample size For many economic datasets, however, obo Sen counters ease for which the mamber of dcerete cells is chose to or even larger thas the sample size. Obviously, the sample splitting ffequeney estimator cannot be used in such situations34,_ SOME CAUTIONARY REMARKS ON FREQUENCY METHODS _121 ‘To further ilustrate this point, let M denote the mumber of cells associated with the discrete variables. In even simple applications, M cean be in the hundreds and even tens of thousands. For example, if one hhas data which includes a person's sex (0/1), industry of occupation (ay, 1-20), and religious affiliation (say, 1-5), then the simple presence of three discrete variables gives vise to 2 x 20 x 5 = 200 cells. The av- ‘erage mumber of observations (the effective sample size) in each cell is ess = n/M. Since M isa fixed finite mumber (no matter how large), ‘asymptotically we have n/M = O(n). Therefore, the effective sample size is of order n, Even if M = 1,000,000, n/M is still of order O(n). For a univariate continuous variable, however, the effective sample size ‘used in estimation is of order nh = O(n*!8) (if h = O(n—/")). Asymap- totially, en! is of smaller order than n/IM for any positive constants ‘and M. Therefore, asymptotically the effective sample size for diserete variables is larger than that for continuous variables, Consequently, nonparametric estimation with discrete variables has 2 faster rate of convergence than that with continuous variables In finite-sample applications, however, the story can be quite dffer- cent. The V7 rate for the discrete variable case can provide a misleading ‘impression regarding its usefulness in small sample applications. Let us consider some artificial data having a sample size n = 500 with five discrete variables, and further suppose that the five discrete variables assume 2, 2, 5, 5, and 5 different. values, respectively. In this exam- ple, the mumber of discrete cells arising from the presence of these five discrete variables is 2x 2x8 x 5 x5 = 500. Therefore, the av- «rage number of observations (the effective sample size) in cach cell is 12/500 = 500/500 = 1, which is too small to yield any meaningful non- parametric estimates (were one to use the fRequeney method, that is) However, n = 500 is certainly large enough to ensure accurate nonpara- ‘metric estimation with one continuous variable. Ifq = 1 (r= 0, ie.,n0 discrete variables), then the effective sample size used in nonparametric estimation with a continuous variable is n‘/S = 5008/® ~ 144, which is much larger than the effective sample size of 1 in the above discrete variable example (with the number of cells equal to 500) In Chapter 4, we discuss an alternative approach where we also smooth the discrete variables in a particular manner frst suggested by Aitchison and Aitken (1976). We extend Aitchison and Aitken's ap- proach to cover nonparametric regression and conditional density esti- ‘mation with mixed discrete and continuous variables. We also present several empirical applications where we show that such smoothing ap-we 3, FREQUENCY ESTIMATION WITH MIXED DATA proaches can also provide superior out-of-sample predictions when com pared with some commonly used parametric methods. 3.5 Proofs 3.5.1 Proof of Theorem 3.1 roo For what follows, we use the following shorthand notation: F(a + hoya) = Fleh + hava. G+ hve) By the same argument used in the proof of Theorem 1.3, we have BUF) = EDWa XS, 2")1XF = 24) = trnhyy? Sf seetotin (AG*) ate = ehaer = (tg | Hetetw (SGX) ae = [re=hnetio de = Hae 24) + (na/2) So fesNN +O & ) : & a Next, by the same argument as in the proof of Theorem 1.4, we86. EXERCISES 1 have safle) = var Win 2 af = 24) = (nh. A fe [w A #1 = ») “(e()or-4) = (ong) {z, [sei aw (SG) sete an5 ; : 3 J reat = ) acta} ohare fweyde<0(h.-¥9} (nha shag) {n f(z) + Olfey Ira) (cine f" wey ao [f voy a0]' =). ‘The above results show that bias(f(2)) = O(DE var(f(x)) = O ((nhy ...ig)~). Hence, MSE( f(x) (hy ooh)“, which implies Pheorern 3.1 i) Part (i follows by the same argument given above, and the applica- tion of Liapunov's CLT (see the proof of Theorem 1.3) proves Theorem 3.1 (ii), a 1 W2) and that (Leta BB)? + 3.6 Exercises Exercise 8.1, Prove (8.7) and (3.8). Hint: Write (a) — gx4) = (ale — ae") He") (2%) = mala") faa").ey 3,_FREQUENCY ESTIMATION WITH MIXED DATA Now (¥ = (Xf) +14). sinfat) = 0d Slo XE) + us — fe AVLCXE = 24) Duaxt ==, Gu) pecause Dilg(X#) — g(e"\JUXe = 24) = 0. rom (3.11), i i easy to see that Bhe(ad)] = 0 and Biia(at)"] = O(n), which implies a(x") = Op(n-¥?). ay 22) = plat) + Op(n-¥/2). Togetner we have gle) ~ (24) = Opln-"?). Using (311), itis easy to see that B{lvi(a4))2) = nto? 24)p(at). Hence, n¥in(24) + N(0,02(24)p(e4)) by the Lindebers-Levy CLT. Hence, VatGle) — o(24)) = vine) (Bla) 4 (1/pla*)N (0.0% nle) (0,0%(a4)/r(e")) Exercise 8.2. Prove Theorem 32. Tint: Similar to the proof of (2.14), write az) — 9a) = (a) F2)/f(w) = m(z)/f(e). Write n(x) = siy(z) + ra(2), where inn() and tna(z) are defined the same ways as rhy(x) and tia(e) of Chapter 1. Show that sing (x) = (62/2) SUF feel) + 2Fo()90C@)N +02 + (nh 1"), (ry «+ hg)*/2rn() > N(O, 8107(a) f(z) in distribution. ‘Those results ane j(#) = jz) + op(1) lead to Theorem 32. and thatChapter 4 Kernel Estimation with Mixed Data In Chapter 3 we discussed the traditional frequency-based approach to ‘nonperametric estimation in the presence of discrete variables, and also ‘outlined the pros and cons associated with such approaches. We now ‘turn our attention to an alternative nonparametric approach which can be deployed when confronted with discrote variables where, instead of ‘using the conventional frequency method described in Chapter 3, we clect to smooth the discrete variables in a particular fashion, By smoothing the discrete variables when estimating probability ‘and regression functions rather than adopting the traditional cell-based frequency approsch, we can dramatically extend the reach of nonpara- metric methods. From a statistical point of view, smoothing discret variables may introduce some bias; however, it also reduces the finite- somple variance resulting in a reduction in the finite-sample MSE of the nonparametric estimator relative to a frequency-based estimator. A particularly noteworthy feature of this approach is developed in Section 4.5, where the smoothing method is used in conjunction with, ‘data-driven method of bandwidth selection, such as cross-validation, to automatically remove “irrelevant variables” (a specific definition of irrelevant variables is given in Section 4.5). This is an important re- sult as we have observed that irrelevant variables appear surprisingly often in practice. When irrelevant variables exist, the conventional fre- queney method outlined in Chapter 3 still splits the sample into many diserete cells including those cells arising from the presence of the relevant variables. In contrast, the smoothing-eross-validation method.26 4,_ KERNEL ESTIMATION WITH MIXED DATA ccan (asymptotically) automatically remove these irrelevant variables, thereby reducing the dimension of the nonparametric model to that associated with the relevant variables only. In such cases, impressive efficiency gains often arise by smoothing the discrete variables. Sections 41, 4.2 and 4.3 discuss the discrete-variable-only case, and Sections 4.4 and 4.5 cover the general mixed discrete and continuous variable case, 4.1 Smooth Estimation of Joint Distributions with Discrete Data ‘We now consider the kernel estimation of a probability function defined, over discrete data X# € S*, the support of X%, As before we use 2 and Xé to denote the sth component of 24 and X# (i = bt respectively. Following Aitchison and Aitken (1976), for xf, X¢ € {O.1,--s¢s — 1}, we define a diserete univariate kernel function 1-Ay if Xf {7c ad. Wet Ae) (4a) Note that when Ay = 0, (X4,28,0) = (Xf = xf) becomes an indicator function, and if Ay = (cx ~ 1)/es then 1 (X22, 4) = 1/e, is a constant for all values of Xf, and 24. The range of Xy is (0, (es = 1)fes] For multivariate data we use 8 standard product kernel function sven by 1(X$,2,2) Tees. a4.a0 (4.2) Tle.te- yya(1 — a4) ME where Nig(2) = (ig #2) $8 an Indicator function, which equals Lif Xf #28, and O otherwise. For a given value of the vector A= (Agy..,2r) we estimate p (24) (et) = zee (3)4.4, SMOOTH ESTIMATION OF DISCRETE JOINT DISTRIBUTIONS 197, Note that the weight function [(.,.,.) defined in (4.1) adds up to one, which ensures that p (24) defined in (4.3) is a proper probability mea- sure, ie Dptesef (24) = 1. Note also that if As = 0 for all 1 <8 ) for s mn (ii) For 8 = 11+ Ay...47; littneP [Se = 4] > 6 for some 0 < bet “Theorem 4.3 is proved in Ouyang otal. (2006). Theorem 4.3 states that the smoothing parameters associated with uniformly distributed variables will not converge to zero; rather, they have a high probability ‘of assuming their upper extreme values s0 thet the estimated proba- bility function satistes the wniform distribution condition of (4) for fit 1,...yr and is more efficient than an estimator that does not pose this restriction, It is dificult to determine the exact value of 6. The simulation results reported in Ouyang, etal. suggest that 6 is round 0.6 for a wide range of data generating processes. 4.2 Smooth Regression with Discrete Data Sid cot oo are i vizo(xf) +0 where 9(:) is an unknown function, Xf € 54, E(w|Xf) = 0 and E(uflX#) = 02(X¢) is of unkown form. Although the kernel fanc- tion defined in (4.2) can be used to estimate g (24), we suggest using the following simple alternative kernel function in regression settings: 1 when Xf = af todehn= (5, eae (412) |. 1X4, 2,0) becomes an indicator function, and when ds = 1, 1064,28,1) 211s a uniform weight function. ‘Therefore, the range of Ay i (0,1). Note that the kernel weight function defined in (4.12) does not add up to 1 and therefore would he inappropriate for estimating a probability function, However, this does not affect the nonparametric estimator 9(z) defined in (4.14) below, because the ker- nel funetion appears in both the numerator and the denominator of192 4. KERNEL ESTIMATION WITH MIXED DATA, (4.14), and obviously the kemel function can therefore be multiplied by any nonzero constant leaving the definition of @(2) intact. By using (4-12), the product kernel function is given by 1xdsat.x) = TP, (423) t where Nu(a) = 1(X¢ Aart) (the indicator function), We estimate 9 (24) by _ ESR bg 8,4) 4 (ot) = (ans) where p (24) =n 1(X#,a4, 9). When Ag =O for all 8 = 4.057 Sur eatimator collapses tothe Frequency estimator given in (8.6), Under the assumption that A, —+ 0s n — 00 (= 1h.-.,r), one can easily show that § (2°) — g(a) = Op (Sitar As + 07/2). For ex: Simple, 28 was the ease for the proof ofthe purely continuous variable festimator discussed in Chapter 2, we could waite 9 (24) — 9 (24) = {5 ) — 9 1B (x4) /9 (et) = mm (04) /P (e4). One can show that Bhn(a4)] = Oy As) and varlin(x4)] = O(n-1), These imply ‘that E (1 @) (Spa AZ+n-4), which in tun implies that sn 28) = Op (Stay Aen), while it can easily be shown that pla!) = p (2!) +0, (Tey As +74?) by the same arguments that ead to (4.5). Hence, we have = Op (Srey +n) = platy off) San) da We now discuss the use of cross-validation for selecting the Ay's. Wedd ory 0 mini tesa am ona YS fe -ae]’, (4.16) bead 1 Xd jax = DEPT GY) (a7) Pa)133 Js the leave-one-out kernel estimator of g(X{), and p(X‘) is the leave- one-out estimator of p(X#). We use Ai,..., Ay to denote the cross- ‘validation choices of \y,..-,A, that minimize (4.16), In this section we consider only the case for which all of the com- ponents of X are relevant, and we postpone the discussion of poten- ‘ally irrelevant regressors until the next section. Also, in Section 4.5 we discuss the more general case with mixed categorical and continuous rogressors. We start with the following assumption, Assumption 4.1. (i) BOPIX#) is finite, (4) The only values of (iy... Ax) that make x wf X vlole*) sees tan} wee hese are Ay =0 for all 8=1,...,7, Assumption 4.1 (i) implies that g(2) is not @ constant function for any component 2, € Dy (sco Exercise 4.11) and it also implies that, Je = op (1) for all ‘The next theorem establishes the convergence rate of 41, ‘Theorem 4.4. Under Assumption 4.1 we have = 0,(n7) for all = 1,...57 Theorem 44 is proved in Li, Ouyang andl Racine (2006). Bxercise 4.7 asks the reader to prove that ‘Theorem 4.4 holds forthe simple case {or which 2 is univariate (hints are also provided) ‘Theorem A shows that \y converges to zero at the rate Op(n7}) With this fast rate of convergence itis easy to establish the asymptotic distribution of (x), as the next theorem shows. ‘Theorem 4.5. Under Assumption 4.1, we have Vii (9 (24) — 9 (24) / V2 4 [9 + N(O,1) in distribution, 18 — 9 (XA))PLCCF,2,X)/0(2) i 0 consistent 4) Sblxf = 24)rer 4. KERNEL ESTIMATION WITH MIXED DATA Proof. Recall that the frequency estimator g (2) can be obtained from (24) with Ay = 0 for all s = 1,....r. Using Ay = p(n), it is ‘easy to see that §(cx) = g(a) + Op(n-2). Therefore, y/7i(9(r) — g(=)) = YalG(2) —9(z)) +Op(n-Y?) & IV (0,08 (24) /p (24) by (8.8). Finally, Exercise 4.8 shows that 6? (e#) = 0? (24) +0p(1). Theorem 4.5 follows direct o In the next section we discuss the case in which a subset of regressors is irrelovant. 4.3 Kernel Regression with Discrete Regressors: The Irrelevant Regressor Case In this section we allow for the possibility that some of the regressors are in fact ivelevant in the sense that ehey are independent of the de- pendent variable. Without loss of generality we assume that the frst f(t <5) components of X¢ are relevant, while the remain- jng mp = 7 —ri components of X# are irrelevant. Let Xf denote the rredimensional relevant components of Xi and let X¥ denote the r=- ‘mensional irrelevant components, Similar to the approach taken in Lit al, (2006), we assume that (¥,£4) and X¢ are independent of each other. (418) ‘The kernel estimator of g(2*) as well as the definition of the CV objective fanetion are the sae as that sven in Section 4.2, Westil se (1...) Ar) to denote the CV-selected smoothing parameters. In The- Grom 4. below we show that (i the smoothing parameters associated with the relevant regressors converge to zero at the rate of n’ 1/2, which Alles from the nate of convergence when n irrelevant components exist an (i) the smoothing parameters associated withthe ieevant Components will not converge to zo, but have a high probability of Converging to their upper extreme values, 0 that these irrelevant 1e- Gresors are mmihe out with high probabibity. Furthermore, there Ekta n postive probability that these smoothing parameters do not Converge to their upper extreme values even a8 —> oo. Mirroring the notation used for 74 and 24, we shall use \ to denote the smoothing poremeters associated with the relevant regressors, and Sor the ielevont ones. Thus, we have Xy = 2y for 9 = 2yyrt, and Jem Ane fr 8 Tyeyte (te = =r) Alo, wing 9) and BO)43,_KERNEL REGRESSION: IRRELEVANT DISCRETE REGRESSORS 135, to denote the marginal probability functions of X and X, respectively, then by (4.18) we know that p(z4) = p(2*)p(#4), and using $* to denote the support of 2%, Assumption 4.1 is modified as follows Assumption 4.2. The only values of Mts..+.An that make i seb DY pe) { D vee) - ot24) votes) ° are Ay = 0 for all s=1,...5%. Assumption 4.2 ensures that the CV-selected smoothing parame- ters associated with the relevant regressors will converge to 2620, while vwe do not impose any assumption on the smoothing parameters asso- ciated with the irrelevant regressors except that they tale values in the unit interval (0,1). ae ‘The next theorem provides the asymptotic behavior of the CV- selected smoothing parameters. ‘Theorem 4.6. Assume that ry > 1 and ry >1 (with r =r +r2 > 2). ‘Then under Assumption 4.2, we have = Op(n-¥) for s=1,....75 slim, P (nga = tyes de 1) 2 @ for some a € (0,1)° ‘The proof of Theorem 4.6 can be found in Li ot al. (2006). ‘Theorem 46 states thatthe smoothing parameters associated with the relevant regressors all converge to zero at the rate of n-/, while the smoothing parameters for the irelevat regressors have a positive probability of assuming their upper bound value of 1 That is, there sa positive probability thatthe irelevantregressors wll be smoothed out. tis difficult to determine the exact value of forthe general case. The simulation results reported in Li etal (2006) show that there is usually about « 60% chance that Ay takes the upper extreme value {and a 40% ‘hance that Ay takes ves betwoon O and 1, for 811 -FIyses9r Note that ‘when 2f is an irelevant regresor, the asymptotic be- havior of Ay is dificult to characterize in further dotail since Ay does not converge to 2ero in this ease. The arymptotic distribution of the resulting g(x) will also be difficult to obtain. When one obtains a rel- atively lage value for A, and suspocts that this may reflect the fact,136 4. KERNEL ESTIMATION WITH MIXED DATA that zis an irelovant regrestor, one may perform a formal test of the uull hypothesis that 2f is indeod an irrelevant regreseor, say, US- Ing the bootstrap based test suggested by Racine otal. (Forthcoming) (cee Chapter 12). If one fails to reject the mull hypothesis, one may "remove this regresor from the model. In this way only relevant regres sors are expected to remain in the model, and one can recompute the cross validated bandwidths using only those regressors deemed to be relevant. Then the results of Theorem 4.4 end Theorem 4.5 apply as to irclovant regresors are expected to remain in the snodel ‘We now proceed to the general case of smooth nonparametsc esti ration with mixed discrete and continuous data 4.4 Regression with Mixed Data: Relevant Regressors 4.4.1 Smooth Estimation with Mixed Data In this section we consider a nonparametric regression model where a subset of regresirs are categorical while the remaining are continuous. ‘he in Chapter 3, we use Xf to denote an r x 1 vector of regressors that assume discrete values and let XP € R? denote the remaining continuous regrestors. We again let Xf denote the sth component of Xe We assume that X% assumes ¢, > 2 diferent values, ie, that Ye (0.1--.sey— I} for #= 1,94, and define X= (X48). ‘The nonparametric regression model is given by Yes) tu (419) with B(uX:) = 0. We use f(x) = f(e*,2%) to denote the joint PDF of (XS, X28). For the discrete variables X7, we first consider the case for which the variables possess no natural ordering. ‘The extension to the general cease whereby some of the diserete regressors have a natural ordering is discussed at the end of this section. = (fy), define Wala? X8) Taehg where w is symmetric, nonnegative univariate kernel function. O < hy < 00 is the smoothing parameter for 2{, and for 24 = (¢f,...,2#)44. REGRESSION WITH MIXED DATA: RELEVANT REGRESSORS 187 define 7 at, X80) = TT Ne, where Ni(e) = 1 (X4 # 22) isan indicator function which equals one when X¢ + cf, and zero otherwise. The smoothing parameter for xf is0 B(x) f(e)A3 +S Bas(0)f(e)rv+ 00m), (4.21) vane) = MELE + O(m)], at (42 F@) = fe) + Opn +i”), (4.23) [aeol) + 24a(e) fale) F(2)-"] XY (#424) [ote 24) — ae] 129 F00), * dee na fvietdo, m=D+ Dan mn = (mh Pig), and138 4. KERNEL ESTIMATION WITH MIXED DATA 14 (24,28) takes the value 1 if 24 and 2 difer only in their sth come ponent, 0 otherwise ‘Equations (4.21) and (4.22) are equivalent to Efi(z?"] = 0 (n3 +m), which implies that ri(2) = Op (m +n) and that in(a) _ Oplmm+ Flay > Le) oo) (m+ ni”) Tf in addition to my = (1) and mp = ofl), one also assumes thnt (hy sh)! y f= a(t, then using (4.21) t0 (428) one can establish sre coyanotie normal distsibution of f(2) using Liapunov’s CLT: 13 H(z) - oz) = ah -Fq | (@) ~ a(x) - Srowtene-S anion SN (,n80%2)/F(e))- (4.28) 4.4.2. The Cross-Validation Method “Least squares cross-validation selects fy... gy Aty---»Ar t0 mininize ‘the following cross-validation function: OVehA) =O — GAH) MX), (4.26) wine §-a(Xs) = Dy ¥io(Xiy Xf Cis Re (Ne i) isthe leave-one- ac Reracl estimator of (Ki) and 0 = Mf() < 1 isa weight function Athich serves to avoid dificultis caused by dividing by zero ot by the Tow convergence rate induced by boundary effect "We assume that (hay cos gy AtseoeyAn) © [Osm}®*7) and nly +g my (4.27) where 1) = tis positive sequence that converges to zero at a rate Mlower than the inverse of any polynomial in m, and f, is a sequence of constants that diverges to infinity. Let S denote the support of M(:). We also assume that ‘9(:) and f(-) have two continuous derivatives; M(:) is eon- Hinuous, nonnegative and has compact support; f(-) is Dounded away from zero for 2 = (2°24) € 5S: (428)It can be shown that the leading term of CV, is CVjo, given by (ignoring the terms unrelated to (h, \)) ovat) = f {[pasom aver fla) SIL Jaca (429) The relationship between CV and MSE[g(x)] can also be used to derive (4.29). In Choptar 2 we argued that the leading term of CV is related to the pointwise MSE by oveyD / MSEIg(z)] (2) M (2) da (4.30) Equation (4.25) implies that “ 7 2 MSEUa(@)] ~ [= Bulan + Yr] + 807(2) fa) (aha fg) Substituting this into (4.30), and ignoring the terms unrelated to (h,) ‘and the smaller order terms, we obtain CV,(a,b) = n-¥$0)y,, where a= ¥ f { [Saw + = acm] gitarta) ert ance aes (asi) where ay = nip, and 6, = ni/(+9),, Equation (4.81) indeed ssves the leading term of CV, as stated in (4.29). Let a®,...,a8,8f,... 09 denote values that: minimize x» subject to ‘each of them being nonnegative. We exclude the case for which some af or BY are infinite, and require that ‘Tho o's and A's are niqnely defined, and each of them is nie. (432) ‘This implies that 0 < a? sy im probably. ‘The proof of Theorem 4.7 can be found in Hall et al. (2006). The- orem 4.7 follows from (4.33) and the fact that Ovatha) — f o%(a)M(e) de = CV(0,2) + (60) uniformly in (24,28) € w S4,(h,A) € [Ora], where tig > 0 and ‘My 1.48 n= 00. Tatuitively, one should expect that the f,’s and Jy’s that minimize CV-(,-) have the same asymptotic behavior as the fit’s and 0's that rminimize CVs), the leading term of CVs(h,). Therefore, one ex- pects that hy = Ad + (.0,) and y= 28 + (5.0) 4.5 Regression with Mixed Data: Irrelevant Regressors In this section we consider the case in which (i.e. allow for the pos- sibility that) some of the regressors may be irrelevant. Without loss of generality, we assume that only the first g: components of X* and the first r1 components of X“ are “relevant” regressors in the sense defined below. For integers 0 < qi, 00 for s= q+ 1,...,q and Ay > 1 for =r 1,...,7. ‘Lins, asymptotically, cross-validation can automatically remove irrel evant variables. ‘The eross-validation objective function is the same as that defined in Section 44 except that now ¥i = 9(Xi) + wi with E(ulXe) = 0, that is, the conditional mean function depends only on the relevant regressors £. The definitions of mi, rt; and f; all remain unchanged G(x) = (4.35) (4.36)ua 4. KERNEL ESTIMATION WITH MIXED DATA ‘except that one needs to replace g: = 9(X) by di ‘The modified analogue of the function x defined in Section 44 is (Xe) wherever it lay gqs tse s br) -r/ {(Sadaeren foe = aa)} Jee @ + eaSa (ate) rhtentoge) 10 15 / S22 foe me DJ Fein repahane (08 a sees) dean Oe ‘and the bar (tilde) notation refers to functions of vectors involving ool, ‘the relevant (irrelevant) regressors #, (2). F (f) is the joint density of 4) (2 = (38,24) and yy (Jus) are second order derivatives of m (f) with respect to 25, = J, WAs before, let af...) Bf,-.-,09, denote values that minimize subject to each of them being nonnegative. We require that the a's ‘nd 89's are uniquely defined, and that each is Bite "Note that the irelevant components do not appenr in the definition of g, This is bocause the irelevant components cancel each other in the tatio Blin(z)]/ELF(e)), which we denote by fit) = Blyn(a)}/EL/ (2) For the relevant variables & we require that & [.laele) serra anaes, (428) 23 cow interpreted as a function of hy,.--,/gy Aty-++Any vanishes if and oply if all of these smoothing parameters vanish. Since we should allow the smoothing parameters for irrclevant vari- ables to diverge from zero, we can no longer impose the requirementthat hy = o(1) and A, = o(1) for all s. Rather, we impose the following conventional conditions on the bandwidth and kernels. Define w= (fle) Hf sawn Let 0n-© and max(fi,--.,hg) < n© for some C > 0; the kernel K is a symmetric, compactly supported, Hélder-continuous PDF; (0) > (6) for all 6 > 0. (439) If cis arbitrarily small, the above conditions on hi, hg are basi cally that hy -igg 205i x ofgg + 0 and fi. /tg + 0.85 11 00, In addition, we require that the fastest rate at which fy goes to zero cannot exceed nC and the fastest rate at which h, (fr itrelevant vari ables) goes to 60 cannot excend n® With these conditions we obtain the following result. ‘Theorem 4.8, Under conditions (4.28), (4.34), (4.38), (4-39), and Letting fy. ig, tyes Ae denote the smoothing parameters that min- imize Vp, then nM(O+D}, a8 in probability for 1 < 8 < as Phe > ©) 1 for q:+1<0 0; n2(O9}, — 2 in probability, for 1 <800 for 6 = 01 + Tyeenyd and as Ay + 1 for # = ny +1y.--,F, Wo have Bay vey hgy Mtv eces Se) = OCB ye oes fg Atyoo osc) +(60,) Next, from the fact that hy — A = op(h2) (¢ = 1,.--,gx) and Ay = 22 09(A9) (8 = Te.ey7t), we have that 92, fa,-.-shgis Ais ja) GZ, AY,... AO, AY,.--,A8,) + (8.0.). Theorem 4.9 then follows. 4.5.1 Ordered Discrete Variables Up to now we limited our attention to the case for which the disereto variables are unordered. If, however, some of the discrete variables are Ordered (ie, ate “ordinal categorical variables"), then we should use a ‘eee! function which is capable of reflecting the fact that these vari- ables are ordered, Assuuaing that 2# can tako on cy different ordered vals, {0,1,.--y¢1— 1}, Aitchison and Aitken (1976, p. 29) suggested sing the Kernlfantion given by Hod) = (5) al. Ay?45. APPLICATIONS | = 5 (0 < ¢ 3, this weight func- tion is deficient since there is no value of J, that can snake I(r, v.24) equal to a constant function. Thos, even if 2f turned out to be anit relevant regressor, it eould never be smoothed out if one were to use this kernel. Therefore, we suggest the use of the following alternative lernel faction: Ua of, A) = AsFovAl (4.43) ‘The range of A, is [0,1]. Again, when A, assumes its extreme upper bound value (Ay = 1), we see that {(e, vf, 1) = 1 for all values off, vf € (0,4,-.-4ce~1}, and in this case 24 is completely smoothed out of tho resulting estimate. Tt can be easily shown that when some of the diserete variables are ordered, then all ofthe results established above continue to hold provided one uses the kernel function given by (448). ~ OF course, if an ordered discrete variable assumes a large umber of diferent values, one might simply treat it as ft were a continuous variable. In practice this may well lead to estimation results similar to those arising were one to treat the varinble as an ordered discrete variable and use the kernel function givea by (443) Bierens (1983, 1987) and Ahmad and Cerrito (1904) also consider smoothing both the discrete and continuous variables for the nonpara- ‘metric estimation of a regression function. But they do not study the fundamental issue of data-driven solection of smoothing parameters. ‘As we have shown in this section, it is important to use automatic data-iriven methods to select smoothing parameters. Moreover, least, squares cross-validation has the attractive property that it cam auto- ‘matically (asviaptotically) remove irrelevant explanatory vasiabls. In nonparametric settings with mixed discrete and continuous variables, least squares cross-validation stands alone and appears to be without any obvious peers. 4.6 Applications 4.6.1 Food-Away-from-Home Expenditure Prior to the 1980s, the value added in China's food sector that was accounted for by food-away-from-home (FAFH) consumption was neg- ligible. Household production of most meals used grain, raw vegetables,16 4,_ KERNEL ESTIMATION WITH MIXED DATA and meat produced at home, purchased from state-run food stores, for purchased directly from farmers. Mirroring China's rapid income growth after 1980, Chinose consumers increasingly ate more meals in restaurants, cafeterias and dining halls. The FAFH share of total food expenditure has steadily increased from 5.03% in 1992 to 14.70% at 2000, In 2000, per capita annual FAFH expenditure reached 288 yuan with total FAPH expenditure of 182 billion yuan in urban China ($15.9 billion US dollars). In the application below, it will be seen that non- parametric mothods suggest that the FAFH expenditure growth in China is expected to continue, due to the rising middle class, rapid ‘urbanization, and to its relatively low FAFH share compared with the United States (40.3% in 2001, Economic Research Service, USDA), Canada (35.6% in 2001, Statistic Canada), and other developed coun- ‘tries (see Jensen and Yen (1996)). This stands in contrast to conclusions ‘obtained from a popular linear model that has been used to model such expenditure. ‘The data is obtained from urban houschold surveys conducted by ‘the State Statistical Bureau of the People’s Republic of China in 1992 land 1998. The dependent variable is household per capita FAFH ex- penditure (in 1992 yuan). The explanatory variables include household per capita income (in 1992 yuan), household size, education level of the household head (seven categories)!, age of the household head, a 0-1 dummy variable indicating that the household is in a large city, and a ‘gender dummy for the household head, The total sample size in this study is 3,459 for 1992 and 3,359 for 1998. Min, Fang and Li (2004) also estimated a popular linear model to be used for comparison pur- poses. They presented some graphs that show the relationship between FAPH expenditure and income (holding all other regressors fixed at ‘their sample median). Both 1992 and 1998 nonparametric PAFHL ex- ppenditure curves show that FAFH consumption stops increasing after ‘8 high income level has been attained. This is quite different from ‘the parametric results, where the linear model prodiets higher FAFH ‘consumption for higher income households. To check which results ap- pear to provide @ better description of the data, Min et al. computed goodness-of-fit (R2) for both the parametric and nonparametric mod- els. The R? from the linear models for the 1992 and 1998 data are 0.128 77, below clemeatary school, 2, elementary school, $- middle school, 4. high school, 5. middle lvel specialind training, 6. twosear college, 7. bachelor degree or above46,_APPLICATIONS ar and 0.170, while the A? for the nonparametric models are 0.382 and (0.348 for 1902 and 1998, respectively. ‘Table 4.1 gives the average FAFH expenditure for different income jervals, slong with the predicted average values using the paremet- ric (ordinary least squares, OLS) and nonparametric models. High in- come households indoed spent less on FAFH than those households with slightly lower income. In 1992 households with income exceeding 5,000 yuan spent abont 11 yuan less on FAH than those with income between 3,000 and 5,000 yuan, In 1998 households with income exceed- ing 9,000 yuan spent 52 yuan less on PAFH then those with income ‘tween 4,000 and 9,000 yuan, We observe that for low and middle i come level, both the parametric and the nonpacnmetric models predict the average FAPH expenditure well. However, for high income levels, the parametric model appears to give misleading predictions, while the nonparametric method performs much better in this case. ‘Table 42 reports estimated mean and median income elastics. For the nonparametric model, the elasticity 7 for house + is computed via nj = 28%) 216, where 21; is the income of the ith household, while for the linea? mode, = (2. Table 42 reveals some interesting phenomena. Comparing the elatcities of 1992 and 1998, the nonpara- Inetric results show that both the mean and median income elasticities haw ineeased for both large city andl midle-small city households Tn contrast, the parametric model shows only that the mean elasticity for middle-small city has increased, while the mean elasticity for large cities and the median elasticity for both large and midlesmall cities fell fom 1992 to 1998. The conetng estimates are likely de to the misspecification of the linear model, As we discussed earlier, the lin ear model overestimates FAFH expentiture fr high income households due to its imposing an inoorect linear income (‘eend) component on AFH consumption. The misspecified linear model also overestimates the elasticities for high income families, leading tothe erroneous pre- diction that the median elasticity has decreased from 1982 to 1998, 4.6.2. Modeling Strike Volume We consider a panel of annual observations used to model strike vol- ume for 18 Organization for Beonomic Cooperation and Development (OBCD) countries. The data consist of observations on the level of strike volume (days lost due to industrial disputes per 1,000 wage salary earners) and their regressors in 18 OBCD countries from 1951 tous ‘Table 4.1: Average FAFH expenditure by income category 1992 Tncome Data Ave, OLS Ave. NP Ave. “Below 3K 8278 ——«2.12—C«3.26 3K-5K 1491520 48.0 AboveSK 143.9 LT. 1998 “Tncome Data Ave. OLS Ave._ NP Ave, “Bdow 4K 11940 «20D AK-9K 257.0 250.0 «254.0 Above 9K 2052 MLL 282.9 ‘Table 4.2: Mean and median income elasticities (n; denotes income lasticity of household i) Parametic Large Gity— Results Toe 199819921998 Mean O87 O81 1.274 1.351 Median 0.848 0.832 1.074 0.994 Yoofm>1 18.5% 13.9% 69.6% 48.8% Nonparametric Large City Results “i992” 199819921908 Mean 0.626 0.826 0.900 0.947 Median 0.751 0.851 0.938 0.960 Wofn>L 35.0% 39.9% 45.9% 45.8% 1085, The average level and variance of strike volume vary substantially ‘across countries. The data distribution also features a long right tail ‘and several large values of volume, We make use of the following re- ‘gressors: 1) country code; 2) year; 2) strike volume; 4) unemployment; 5) inflation; 6) parliamentary representation of social democratic and labor parties; 7) a time-invariant measure of union centralization. The data are publicly available (see StatLib, http:/ /lib.stat.cmu.edu). As ‘one country had incomplete data we analyze only the 17 countries hav- ing complete records. "These data were analyzed by Western (1996), who considered aAPPLICATIONS. uo ‘Table 4.3: Regressor standard deviations and bandwidths for the pro- posed method for the training data, ny = 561 1 2 3 5 6 a 9.53 2.84 475 1331 031 NA f/X_102565846.11_4800821.61 5.84 30.56 40852803 0.12 Key: 1) year; 2) unemployment; 3) ination; 4) parliamentary repre- sentation; 5) union centralization; 6) country code. linear panel data model with country-specific fixed effects and a time trend. We consider a nonparametric model which treats country code ‘as categorical and the remaining regressors as continuous. To assess each model’s performance we estimate each on data for the period 1951-1983, and then evaluate each based on its out-of-sample predictive performance for the period 1984-1985, again using predictive squared error as our criterion, We begin by considering the cross-validated band- widths, which are summarized in Table 4.3 Ascan be seen from Table 4.3, the continuous regressors year, uneu- ployment, and union centralization are effectively smoothed out of the resulting nonparametric estimate, having bandwidths that age orders ‘of magnitude larger than the respective regeessor's standard deviation, ‘This suggests that the continuous regressors inflation and parliamen- tary representation and the discrete regressor country code are relevant in the senso outlined es We next compare the out-of-sample forecasting performance of each ‘model. The relative predictive MSE of the parametric panel model (all ‘variables enter the model linearly) versus the NP CV approach is 1.38. ‘We note that varying the forecast horizon has little impact on the 1elative predictive performance. We have also tried a parametric model ‘with interaction terms (quadratic in the continuous regressors) and the resulting out-of-sample predictive MSE is larger than even the Iinear model's MSE prediction (the ratio of its predictive MSE over NP CV's is 1.44), while a parsimonious parametric model having only a constant, ‘unemployment, and inflation has a predicted MSE 15% larger than ‘that of the NP CV approach (the relative predictive MSE is 1.15), suggesting that a linear parametzic specification is inappropriate,190 4._ KERNEL ESTIMATION WITH MIXED DATA 4.7 Exercises Exercise 4.1. Derive the MSB of the nonsmoothing estimator of p(z) given by : aoe Be) = DUK =>), ‘where 1() isan indicator function defined by { #X=2 UX=2)= 19 otherwise, and where X; € S = {0,1,...,¢— 1} is a discrete random variable having finite support. Bxercise 4.2. Consider the kernel estimator of (2) given by Ha) = EY 1%, i where 2 is a scalar and L(:) is a kernel function defined by 1d HX A(e-1) otherwise, xan) { where d € [0, (e = 1)/¢} (i) Derive the bias of this estimator, then demonstrate that if X has a (diserete) uniform distribution, so that p(2) = 1/c for all X, then the kernel estimator f(x) is unbiased for any admissible value of 2. (8) Show that 5a) — MAP) (y_y 2)" (Ga) = MA=PEED ( agin) Demonstrate that if X has a (discrete) uniform distribution, 0 ‘that p(r) = 1/e for all X, hen ie vaslance of (x) is zero when = (e— 1)/e, its upper bound. (iii) Given your results above, is it possible for the kernel estimator {c) to have an MSB of zero when the underlying distribution is ‘uniform? What are the conditions that A must satisfy for this to ‘occur? Demonstrate your result47. EXERCISES 151 Bxercise 4.3. (0 Show chat Elp (et) and var (x4) ato indo given by (4.5) swith Boy = Sou = yds (24,24) p (24) — pe), where 1, (24,24) is desined in (424) (il) Show that B(mh (2)*) = O (S73 +n-+), where rh (x4) is defined above (4.15). Note that O (S22. Ao)*) =O (2.4 8). Exercise 4.4. 0, is an r xr matrix with its (6,¢)th element given by E Lr) =n] 4) - (4) Erules) wn Note that pr.s(2) is also a probability measure-since p1,. (2%) is ‘the average probability over the c, — 1 zs € St that differ from 24 only in the sth component, ie, pts (24) = hy Degyeg P(28, 244). Consider the simple ease of r= 2. Then Matern =P fo) -mss (#)] + m2 @)]F- ‘Show that » is positive definite if and only if there exists.x#, 24 € 54 such that p(24) pia (24) and thet p (24) # pia (24) Exercise 4.5. Using (4.3), we have in Ef ('- wey L EE te where where 8) = Ses Baba "Ths we have x iy rE [22-2055] - aE LLM (449) wey OM)152 4 KERNEL ESTIMATION WITH MIXED DATA Let OVp(A) be defined in (4.45) and assume that 24 is univariate taking values in {0,2,...,¢~1). Show that (Ais a scalar) E(CV9(0)) = DiA* = Dad $0 (A? 4 An) + terms independent of , (446) ‘where the D's are (positive) constants. Desive explicit expressions for the D's Note that minimizing (4.46) over 2 leads to 3 O(n"). Exercise 4. *Da/(2D1) = (@ Prove Theorem 4.2 (i). Hint: Use Theorem 4.1 and the result in (3.3) Prove Theorem 4.2 (i) Hint; Use Theorem 4.1 and the result in (3.4) Exercise 4.7. Prove Theorem 4.4 for the ease when a is univariate. Hint: When 2 is univariate, Bij = Luj-o + Mao, where dy = al, Write BuAlX$) =n Dy yilLay-o+ Mayo] = Bio + Avia» Then 1 -1[,_ ee 2) Pa) ~ Bo [afi] +00 One can expand CVs() in a power series of A, Le. OV = AyX? + Ann! + ofan? + X2) + (terms not related to ), where Ay > 0 is a positive constant, and Amy is an O,(1) random variable Exercise 4.8. In the proof of Theorem 4.5 we have used 4? (a4) = 0 (2°) + o9(1). Prove this result. Hint: Write 6 (et) — 0? (c4) 2 (24) )p (x4) /9 (at). Use Ay = Op(n), Show that 4 (as) p (x4) =n DP, wAXE 2) +Op(n7) 2s 02 (24) p(x4) by a law of large numbers argument. Exercise 4.9, Derive (4.21) through (4.23).47. EXERCISES 153 Bxercise 4.10. The al's and sare defined asthe unique finite values ‘hat minimize x defined in (1.31). Consider the case of q = 1 and r = 1. Solve for a? and U? by minimizing (431) Hint: Note that x(ay,bs) = Cyaf + Cabf ~ Cyabs + Co/ay, where G, > 0 for j =0,1,2. This ean be written as (ai, b1) = Cp [by — Arai]? + Anat + Cp/ar, where Ay = C3/(2Ca), and where Az = Ci — C3/(4C2) > 0. Thus, we have 8 = max{0, Ay(a{)?} (since 82 i nonnegative) and a? = [Co/(4A2))¥* if Ay > 0, and af = [Co/(4C)) if Ar <0, Exercise 4.11. Considering the simple case for which r = 1, show that Assumption 4.1 holds true if and only if g(z4) is not a constant function (when x4 is univariate). Hint: Assumption 4.1 becomes, for all 24 € S# (x* is univariate since r= 1), Dysese {{ala) ~ o(e4)] [aa = 24) + Jae 4 oy] Y? = 0, which is equivalent t0 1? weselate®) — o(e8)4(e4 A 24) = 0 because [g(24) — o(=*)1(24 = 2) = 0. However, since (24) is not a constant function, we know that 3.4294[9(24)—g(24)|?1(x* ¢ 24) > 0. ‘Thus, we must have A= 0. Thus, Assumption 4.1 holds tru if and only if g(#*) is not a constant function when «4 is univariate.Chapter 5 Conditional Density Estimation Conditional PDFs form the backbone of most populat statistical meth- ods in use today, though they are not often modeled directly in a para- ‘metric framework. They have perhaps received even less attention in ‘a kernel framework. Nevertheless, as will he seen, they are extremely ‘useful for @ range of tasks including modeling count data or predict- jing consumer choice (see Cameron and ‘Trivedi (1998) for a thorough treatment of count data models). In this chapter we discuss the non- parametsic estimation of conditional PDFs. Throughout this, chapter, ‘ve emphasize the practically relevant mixed discrete and continuous data case. Given the theoretical and practical advantages associated with smoothing discrete variables in a particular manner versus using. ‘a frequency-based approach, which we covered in Chapters $ and 4, we shall proceed directly with smoothing both the discrete and continuous ‘variables for what follows. Note that we may refer to conditioning vati- ables as regressors even though we are estimating a conditional density rather than a regression function 5.1 Conditional Density Estimation: Relevant Variables Let f()) and y(-) denote the joint and marginal densities of (X,Y) and X, respectively. For what follows we shall refer to Y as a dependent variable (i.c., ¥ is explained), and to X as covariates (ie., X is the explanatory variable). We use f and ji to denote kernel estimators156 5, CONDITIONAL DENSITY ESTIMATION thereof, and we estimate the conditional density gfyl2) = f(a,u)/m(a) by d(yle) = Fle, w)/Ale). on) We begin by considering the case for which ¥ is a univariate con- ‘tinuous random variable, and then we discuss the ease for which Y is ‘univariate discrete variable, We treat the multivariate ¥ case in Section 5.4. Our estimators of f(-) and y() are given by Fle) = 0 Oye, X)hi Fs (6.2) t ia) = OK (Xp 63) whore hy is the smoothing parameter associated with Y, 7 = (A), and K,y(e, Xi) = Wala’ XLA XE), eye TY by (xe Wale’ X9) Thi( 728), pet, X49) Tastee Dea —Ay*, Ng(a) = 1X} # xf) is an indicator function that equals one when Xd ¢ af, zero otherwise, and kyg(y.¥,) = hy" k((y ~¥)/ho) ‘As we emphasized in previous chapters, data-driven methods of pondwidth selection such as crose-vaidation are required in applied rltings. Below we discuss to differnt cross-validation methods. The rot i based on minimizing a weighted integrated squared difference betwoen (yz) and gly), and the second isa likelihood approach. As trill be seen, the frst method has some desirable optimality properties Jput can be computationally burdensome for large samples, The seo- tnd method is computationally less costly but can load to inconsistent Catimates if the distribution for the continuous variables has fat tails.52, CONDITIONAL DENSITY BANDWIDTH SELECTION ast 5.2 Conditional Density Bandwidth Selection 5.2.1 Least Squares Cross-Validation: Relevant. Variables ‘As was the cate for unconditional density estimation, when estimat- ing conditional densities with kernl methods we could pursue last squares cros-valiation approach to selecting smoothing parameter. ‘We consider the following criterion based upon @ weighted integrated square error (de = Sues fd"), 15B = J (aula) ~ ou\a))* nM (2 dey on Ian ~2lan + Tos where M() i a weight faneton, tin = f aula)ateya (a) dedy, tag = f auladfles wae" dey, and where Inn = J 9°(ylz)u(z)M(z*) dady does not depend on the smoothing parameters used to compute f and ji. Observe that [ee t= f toy Oar ae =e [2 (aye 5M a). where the expectation is with respect to X, not, to the random obser vations {Xi.Y;}fLy that are used to compute GC) and yi), Gle) = J fle.y)%dy, and where G(.) is defined as Gla) = FSS SS Kelas Xue Xa) f welt Ys) an Ya ee Similarly, Faq can be written as Ezla(¥1X)M(X°)] = Bal FY, X)M(X°)/i(X)], where the expectation is with respect to Z = (¥,X). Therefore, the following cross-validation approximations, fin and faq, for Tin and Tony respectively, are adopted:188 5, CONDITIONAL DENSITY ESTIMATION whore the subscript —i denotes leave-one-out estimators, for example, cd = Gp, B,D, Hae RIK) x J eC. %a el 8a "Thus, our cross-validation objective function is given by CVs, d) = Han (ho,s 9) ~ 2fanl hoy hy A)» where (FX) = (ay .++5hgy Ais -+ +1 Ar) ‘long the lines of the analysis presented in Section 4.4.2, again sing fife) * = ple)! + (u(e) ~ A(x) (2) A@)] to bande the pres- tence of the random denominator, it is easy to see that the lading term of (yz) — a(ulz) is [o(ulz) ~ avie\lale)/ute) = (flew) — ji(z)g{ yee) /1(2). Hall otal. (2004) have further shown that the leading term of CV, is ovp= [ {8 [few - Flertve)] Mayne? } doa (5) Letting pp = Aa/{(1 ~ da)(Cs — 1)}, and letting foo(#*,4,u) and Jule! y) denote the second derivative of f(2!,24,y) with respect ioly and 25 respectively, we may write effew}=D {te - aagtceah { J Teco} MQ) x flat — ha,2y — hoo). deylo =seeEngs Yuttatsitvta)- sea} + wah foo(n,u) + Ena 92 foa(2,u) + 0(02)+ 2 at 68)52. CONDITIONAL DENSITY BANDWIDTH SELECTION 150 where 1,(v4,24) is defined in (4.24) and m = 75_y As + Cfo A, and Efala)} = DP =v) {ie = aanees} x {/ Il rea} M(o)f (28 —hasa4y— how) det dea ~rt0)+ af 1 Due ed ules 4) - veo} a + de So Mnale) + ofr) 2 (2) ‘Therefore, (5.6) and (5.7) lead to B{Flasu) — Hlaatvla)} =[Io-ana {rein ~ Het ir ) een} ++ Jpatb fotos) + a Som { aa) — (58) Note that. tar [Kn(es Xs) Chal ¥) ~ alla] = 0B [(¥a(2, Xba ln YD] +06) = F(,y)m +0(0) var (f(a;y) - Alaatvl2)) (5.0) where m = (raha. .-hg)~ Combining (5.8) and (5.9) we can see that that CV soho, h, A) = 2-44) x9 (ap, a, 6) (5.10)160 ssITY ESTIMATION where vio ( x {orate} J eaebfoalw) 1) where the a,'s and 6,’s are defined via hy = no-Wieta, and Ay = yn 2/049), respectively. Letting a, a2...» 0%, Wf... BR denote those values that minimize ‘yar we assume that the o's and b's are uniquely determined and that wee finite. Let Af, sAB,X2,.--,28 denote the values of fa»... that minimize CVgo. ‘Then ho ~ n¥(@*9a and X$ ~ on 2etsheN, ino, given that CV, = OV,o + (8.0), we expect that hy = AE + (6-0) Joa be 9 + (a0), whichis indood trae as the next theorem shows. ‘Theorem 5.1. MMe), + a for 8=0,1,-- 2G), 8 for 8 = Ayooesti (6.12) n8ll+9 inf CV, ~+ in xy in probability ‘The proof of (5.12) is given in Hall et al. (2004). ‘Using Theorem 5.1, we can derive the asymptotic distribution of da{viz). We postpone this result nil Section 5.3 where wo discuss @ rare general ease that allows for the existence of irelevant, covariates. 5.2.2 Maxiinum Likelihood Cross-Validation: Relevant ‘Variables “The least squares cross-validation approach discussed in Section 5.2.1 jhas some desirable optimality properties, which were given in Theorem Su, However, least squares cros-validation in the context of condi- Yional PDFs is computationally costly, particularly when the sample52. CONDITIONAL DENSITY BANDWIDTH SELECTION 01 siz i large. This is because the objective function CV, (i. fig) ine volves three summations. When the sample size i small, one ean realize substantial spotd improvements by storing temporary weight matrices dn computer memory rather than recompiiting thelr components when computing CV,, but for even moderate sample sizes, the memory re shied for storage of these weight matsices wll quickly overwhelm the tmemory capacity of most computers. This isa classical computational trade-off memory vermis speed In contrast, however, Hkeliiood eros validation has » substantial advantage from s computational perspec tive, a8 we show below "The lkelihood cross-validation method involves choosing fh ‘gy Aty +++1 Ar by maximizing the log-likelihood function £= Ying KlXo, (6.3) where 9-i(ViIXi) = fi(%i,¥i)/rh (Xi), and #4(%i, Yo) and rh4(%) are the leave-one-out kernel estimators of f(%s,¥:) and j(X:), respec- tively. The objective function (5.13) involves one less surmmation than ‘that for least: squares cross-validation, and is therefore less compu- tationally burdensome. One problem with likelihood cross-validation arises when the continuous variables are drawn from fat tailed distri- Imutions. In such eases, likelihood cross-validation tends to oversmooth ‘tho data, and may Iead to inconsistent estimation results (Gee Hall (19874, 19878). Two practical methods can be used to prevent this from happen- ing, First, one can compare the likelihood eross-validatory sinoothing parameters with some ad hoc formula (say, fy = tagn—™/¥8), and if the two sets of smoothing parameters are comparable, then it is un- likely that the cross-validation method has oversmoothed. When like- Tihood cross-validation oversmooths the data, there are two possbill- tos: either it leads to inconsistent estimation, or the related variable is ireleyant. We note that, in the latter caso, oversmoothing is in fact highly desirable, To distinguish between these two eases, one can there- fore use a second method that compares the out-of-sample predictions ‘based on likelihood cross-validation with those based upon a paramet- rie method. If the nonparametric method behaves better than or is comparable to its parametrie counterpart on the hold-out data, this indicates that likelihood cross-validation has probably not fallen vie- ‘tim to the potential inconsistency problem. Of course, should the least §162 5. CONDITIONAL DENSITY ESTIMATION squares cross-validation method be computationally feasible, the least ‘squares cross-validated smoothing parameters should also be computed ‘and they then can be used to judge whether likelihood cross-validation ‘bas produced unreasonably large smoothing parameters. 5.3 Conditional Density Estimation: Irrelevant Variables ‘We now consider the ease for which some ofthe covariates may be irrel- evant, Using notation defined in Section 4.5, for integers 0 < aida < @ nd 0 < rity ©) +1 for) +1.< 9 0; 2495, 228 for <3 rai 1 KBE} fon stcg srs (49) int CV, 2) B int. ~~ a6) The proof of Theorem 5.2 is given by Hall et al, (2004), Like Theo- rem 4.8, Theorem 5.2 states that cross-validated smoothing parameter selection is asymptotically eapable of removing irrelevant condition- ing variables. The following theorem provides the asymptotic normal distribution of ay). ‘Theorem 5.8. Letting j(y|z) be computed using the cross-validation smoothing parameters hoy. .,fbgs Mtye-s Sey then (wha hy)” (aio ~ ata) — So Bulan ~ S> Ba, vs) +N (0,03(8,y)) in distribution, an where era eat eel Ho Bua) = poate) + gro he {GE — Fer aca} ny eo Bales) = SF total) {anes 04 - 22 oop} Zu) = we Ma(yl2)/A(@).164 5, CONDITIONAL DENSITY ESTIMATION So far we have assumed that ¥ is a continuous random vari fable. Now we tum our attention to the ease for which Y is dis crete. If ¥ assumes cp different values, then we replace the contin- ‘uous kernel kng(y,¥;) = higtk((y — ¥i)/ho) by the categorical kernel Uys Vig Ao) = AY — Ag), where Ni(y) = 10% # y)- Theorem 5.2 can be modified by replacing g1 + 5 with q1 + 4, and by replac- ing n'V(0r+5)ig — of with n®/'™*4 Jo — 2B, where Of is defined in a ‘way similar to 8 for s = 1,...,7. Theorem 5.3 also requires minor modification (eee Exercise 5.1) ‘Theorems 5.2 and 5.3 are quite powerful results, particulaely for data containing a mix of discrete and continuous data, having the po- tential to extend the reach of nonparametric methods. In Section 5.5 ‘we show, via soveral empirical applications, that “irrelevant” variables ‘occur quite often for various datasets and that erose-validated condi- tional density estimates can outperforin some commonly used para- metric methods in terms of their out-of-sample predictions, even for situations in which the number of discrete cells is comparable to the sample size 5.4 The Multivariate Dependent Variables Case In this scotion we consider the estimation of the conditional PDF of Y given X when ¥ is also # general multivariate veetor. Let Z = (X,Y), ‘and we shall also write Z = (Z°,Z#), where Z# consists of r dis- crote variables, and where Z° € RY represents the continuous compo- nents. For expositional simplicity, we first consider the case where Z# is a vector of nominal variables (i.., having no natural ordering) with 2 © TT4{0,1,-..,¢ — 1}. We write ¥ = (¥¢,¥4), X = (X5.X4), ‘nd assume that Y° contains the first g, continuous components of Z while Y# contains the first r, discrete components of 24. Thus, Yo € Re, YE © TT (0)1,.-56e — 1}, X° © RE. Similarly, Xe Tyas, = 1 [As before, let f(2) = f(y.) denote the joint PDF of (¥, X),! s(x) the marginal PDF of X, and g(yjz) = f(y,2)/u(z) the conditional PDF of ¥ given X "Depending on the situation, we sometimes may write the joint density as $e,24) rater than J{y,2), where 2° = (9",2*) and 254. THE MULTIVARIATE DEPENDENT VARIABLES CASE 165, We use Zi, to denote the sth component of Zf. As in Section Ba, for Zi, Z4, € (0,1,--.4¢5 — 1}, we define a univariate kernel function U(2§,24,,ds) = 1— Ay if 24 = Zi, and 28, 24.,.,) = Dulles 1) if 24 # ZZ, The product kernel is given by Ly o Tier HZ, 25s A)- Letting 2 denote the sth component of 2, w()) denote a univari- ate kernel function, and W() denote a product kernel function for Z, wwe write Wa, Zhe). ‘To avoid introducing too much notation, we shall use the same notation L(-) and W(:) to denote the product kernel for ¥# and Y°, ion Day eae = TI HOVE, Yih, Ae) and Way wee = TT hg (CVE Y)/h) Similarly, me define La, xp xg = TE CCS, Kf Anse) and Wreaxsoxg TIP hah ((XG — X50) / gyre ‘We estimate f(z) and p(x) by Flare) = fle) = 290K x and ale) LY Kain (618) where Kye = Lyssa ‘Therefore, we estimate g(yj2) 0 and Ky. xp2 = Dax atWha xg ae (x,u)/m(z) by f(o.w) Ale) * (6.19) alule) We choose the smoothing parameters by erose- validation methods ‘which minimize a sample analogue of a weighted Integrated square exo Iny where (J de = Su [ de) In = flav) ~ alyle) ye) dz = Tin ~ 2lan + Too, Jan = flitule) (2) de, ton = Foto) o(ol2) ula), and Ign = flg(yle) u(x) dz. As before Ign, is independent of (h, d). ‘Therefore, minimizing Jy over (2,2) is equivalent to minimiaing Lig ~ Zan Define the lesve-one-out estimators for f(X,¥) and f(X) to be Lae =2 YO Kaa, aud id AX =E YO Kx, (6.20) the168 5, CONDITIONAL DENSITY ESTIMATION Also define G_)(Xi) to be G(X) = DY keith, 62 aia whore KP, = Dys J Koy nai “Then by arguments similar to those used in Section 5.1, Racine ct al, (2004) show that a cousistent estimator of fig ~ Zay is given by F10%,%1) Hk) evtnay# 2S Cad, 25 629 where f(X4YD, A(X, and G(r) axe defined in (6.20) and G21) “We then choose (h, A) = (It1,.-+slig)Aty-++5Ar) to minimize the ob- jective function CV (RA) defined in (8.22) Let (h,X) denote the exo ‘Midation chooes of (8). Since we know that irrelevant independent MEnables will be agymptotically smoothed out, in the analysis below ‘We wll only consider the case where the independent variables ar all Tetovant, Racine et al (2008) show thot ha/A@ % 1 and 4y/X2 % 1, where RY = ean" 449 for 8 = 1, and A2 = eggn“2/(4+9) fora = Iy...y% here eye and ea, are some constants (HO, \®) are the nonstochastic optimal smoothing parame- ters chat minimize the leading term of the cross-validation fanetion (623) ‘Theorem 5.4, proved in Racine et al. (2004), establishes the rate of convergence of (f,) to (h9,2°).. ‘Thoorem 5.4. Under assumptions given in Racine et al. (2004), we re Tits nae Ogineel0) for ¢ = Tyee, and we have Ke af = Opt?) for s = Lyeenst where @ = minf2,4/2} and B= min (1/24/44 +a} Given Theorem 5.4, one can further show the following: ‘Theorem 5.5. Define Bas(2) = (1/2)mafosl yb) /a(2)54, THE MULTIVARIATE DEPENDENT VARIAL cas for 8 =1,...44y, and let Bis(z) = (1/2), for 8 = y+ ly-..y9. Also, define Y no4 9/09 se Ho(x)g(ylz))/ ul) Ba = eal for s=1esesty, 1 Bas(2) D bet 212 v4.04) — ove) nla wD] ma) ot and Q(z) = n%g(yle)/u(x). Then fors=ry+ (hg [aun = alole) - OH Bue) -O Arnis] converges to W (0,0(e) in stdin. ‘The proof of Theorem 5.5 is given in Section 5.4.2. ‘We now discuss how to extend the above analysis to also cover the ‘case for which Z¢ contains ordered categorical variables. 5.4.1 The General Categorical Data Case . For ordered categorical variables, Aitchison and Aitken (1978, p. 29) suggest using the kernel weight funetion given by Laat Yh 0) = (xa =A) when [Yi —¥f (6.24) for 0 < t rnto| } EM et slv.2)) = ¥ O,ntotale)/alz) (6.8) 5 5.5 Applications 5.5.1 A Nonparametric Analysis of Corruption Political corcuption (or the perception thereof) plays a central role in political theory and in the theory of economic development, Whether ‘or not there is convergence in corruption across countries over timo, or whether similarities in the distribution of corruption persist overtime remains a subject of ongoing debate. ‘The Corruption Perception Index (CPI)? ranks countries in terms of perceived corruption. ‘The CPT lies in the range 0-10 with 10 indicating the absenco of corruption. In the spirit of analysis done by McAdam and Rummel (2004), who considered a pane! of 40 countries having full records forthe years 1995-2002, we erate a balanoed panel of 45 counties having fall records for the nine year period 1996-2004. For this panel of n = 405 observation, we estimate the conditional PDF of CPI conditional upon year using the approach discussed in this chapter. ‘The estimated conditional PDF is plotied in Figur 6.1 Wo trent year as an ordered categorical variable, and the LSCV bandwidths are hort = 0.224 and Aye = 1.00 (the upper bound value for Ayer). The crose-validated value of Spar suggests that pooling of the data forall years is appropriate, ie, that there has been no.ap- preciable change in the distribution over time, This finding supports http. tranaparencyore/cpi1m 5. CONDITIONAL DENSITY ESTIMA} A(CPIYear) 0.25 0.20 05, 0.0 0.05 0.00 Figure 5.1: The conditional PDF of the Corruption Perception Index for 1996-2004, n = 405. the persistence hypothesis (i.e., distributional stability across time), ‘while the shape of the estimated PDF is consistent with multiple equi- libria (Je., multiple modes). These results are consistent with those of McAdam and Rummel (2004, p. 509), who report that “our find- {ngs lend support to concerns expressed in the theoretical literabure - namely, that corruption can be highly persistent, and characterized by multiple equilibria.” 5.5.2 Extramarital Affairs Data In a widely cited paper, Fair (1978) proposes a “theory of extranmari- tal affairs” and considers the allocation of an individual's time among ‘work and two types of leisure activities, time spent with one’s spouse ‘and time spent with one’s “paramour.” The unique dataset is taken {from two magazino surveys, and Fair considers a parametric Tobit e=- timator for modeling the number of extramarital affairs per year. This55. APPLICATIONS 1be more robust and thus provide a viable alternative to the conditional ‘mean regression function. In addition to being robust to the presence of ‘censoring, a more comprehensive picture of the conditional distribution ‘of a dependent variable can often be obtained by estimating a set of conditional quantiles rather than by simply presenting the conditional ‘mean itself. For a complete treatment of quantile regression methods, wwe direct the reader to Koenker (2005). In this chapter we begin by studying how to estimate a conditional CDP in nonparametric settings. We first discuss the estimation of a conditional CDF heving only continuous covariates. The conditional ‘quantile function that is often of direct interest can be obtained rectly from the conditional CDF, and is obtained by simply inverting the cosiditional CDF. Having considered the case with continuous data, only, we then show that the results can be easily extonded to the mixed 00, nhy...hy > 00 and hy +0 for all 8 = 1,..-44. For § = Tyecagy let Aslu,2) = Aly, 2)/Oz4 and Aus(ynz) = PAly,2)/0x2, Aoly.2) = OACy,2)/0y, Aca(wn2) = 04%Xy,2)/0y?, a = f w(u)utdy, and x= f w(o)Pdv, where we? = [Ty f w(vs)2dey = Two. i a ‘We write F(yle)—P(y|2) = N(y,2)/ A(x), where 1(y, 2 =Flyl2)}i(2). The next theorem. Flue) = Fle) the asymptotic distribution of, ‘Theorem 6.1. Under Assumptions 6.1 and 6.2, and also assuming that p(2) > 0 and F(yl2) > 0, we haveoa. ESTINAN A. CONDITIONAL CDF: NONSMOOTH Y_ 185 (@) ELM (y2)] = ule) [hy HEBa(vs)] + (Sta hd), where Buy.) = (1/2)¥2[Frolule) + 2e(2)Fe(yle)/(2)], (i) vas{ (9, 2)] = (hy)! (2)PByet0 (nhs Bg), where “P(yle)(l — Fyle)]/a(a). hg)" D2 HE = oft), then (maby... ha)?? [Pete Fore.) 4 N(O,2 ye): i Bye (ai) (oy, ‘The proof of Theorem 6.1 is given in Section 6.9. By using Poole) — Fyle) = Mt(y,2)/ii(x) = M(y2)/ule) + Op (Se + (rhe. wr) , ‘Theorem 6.1 implies that MSE[(viz)] = is MBAlya | + ms +(s0), (62) where (6.0) denotes (omitted) smaller order ters. If'we choose hi, .. fg to minimize a weighted integrated MSE given, by [MSE[F(yja}}s(y,-2)dedy, where s(y,2) is a nonnegative weight function, then from (6.2) its easy to show thatthe optimal smoothing parameters that minimize the inagrated MSE should be hy ~ 11/440 for 8 = 1,.--,4- However, there doesnot exist closed form expression for the optimal hy’. Though one ean compute plug-in hs, iti com- putationally challenging to sey the least. Li and Racine (forthcoming) Fecommend using a least squares crose-validation method of bandwideh felection designod forthe estimation of conditional PDFS. We showed in Chapter 6 that, for conditional PDF estimation, optimal smoothing yields bandwidths of the form fo ~ hy ~ 1/649 for alls = 1,90 Note that the exponential factor is ~1/{6 + ) becanse the dimension of (yz) is q+ 1 since we assumed that ¥ is a continuous variable. We can also multiply the conditional LSCV smoothing parameters by the factor n(®+#)/(4+9) to obtain the correct optimal rate for the condi- tional CDF optimal smoothing parameters, ‘The simulations reported in Li and Racine indleate that this approech performs wel181 6. CONDITIONAL CDF AND QUANTILE ESTIMATION 6.2 Estimating a Conditional CDF with Continuous Covariates Smoothing the Dependent Variable In this section we discuss an alternative estimator that also smooths the dependent variable Y as we did for the unconditional CDF estimator developed in Section 1.4. That is, we can also estimate F(yle) by 6 (at) F(ule) = Ft (6.3) ite) is a kernel CDF chosen by the researcher, say, the standard normal CDF, and hy is the smocthing parameter associsted with ¥ Tn addition to Assumption 6.1, we also need to assume that as n = 00, ho = 0. Defining M(y,2) = Fols)] Ala), then F(ulz) — Pole) = Mt(y,2)/Ale). We have the following result. Theorem 6.2. Define Bo(y,:2) = (1/2)%aFoo(yla) and let M(y,2) = 4 Fo(ylx)/u(e). Note that Syjz and By(x) are the same as those defined in Theorem 6.1. Letting \hl? = h3 + Sot, hE = Sot ght, then under conditions similar to those given in Theorem 6.1, we have () BLM (y,2)] = w(2) Dono MEBs(y, 2) + ofA). (4) vasl ty, 2] = (MFT) H(z)? Bye — RoCKMY,2)] + 0 ((nHQ)™), where Hy = hy...hy and Cy =2 [ G(v)w(vode. (48) If (0a Ihg)? Dg (1), then (ne t)* [Foi ~ Flu) ~ > Baly.2)] 5 NOB). ‘The proof of Theorem 6.2 is given in Section 6.9. ‘Theorem 6.2 implies that hgh M(y, 2) Ty hg Hee (6)62,_ ESTIMATING A CONDITIONAL CDP: SMOOTH ¥_ 185 ‘One may choose the smoothing parameters to minimize the leading term of a WIMSE of F(y\r) given by WIMSE = cf {is TBA, | = aC Fae Gh aad (65) whore J dedy = Seeep | ded. Were one to adopt a plug-in method of bandwidth selection, one would first estimate By(y,), Dyje and 2(y,2), which requires one to choose initial “plot” smoothing parameters, while thp ageurate numer- ical computation of a q+ 1 dimensional integration can be extremely dificult. However, well developed automatic data-driven methods do exist for selecting the smoothing parameters for estimating a condl- tional PDP. Given the lack of dato-driven methods for conditional CDF estimation having desirable properties, we again suggest using the cross-validation methods that have been developed for conditional PDF estimation. As discussed in Chapter 5 for conditional PDFS, op- timal smoothing gives hy ~ hy ~ n+ for all $= 1,...,g. Below wwe show that optimal smoothing for conditional CDPs recites that hy = off) for 5 = 1,--.99 and that hy ~ n/), For the sake of clarity, we Brst consider the case of g = 1 and then turn to the general Consider the case for which ¢ = 1. From (6.4) we know that the ‘weighted MSB is given by wise = [ MsElF(yie)is(v.2)dude = Ahi + Arhgh} + Aaht + Aa(nhn)! = Adho(nky) + O(n), (68) where = 1-+-+ (B+ 8)(nh)~!, and the Ay's ate some constants180 )P AND QUANTILE ESTIMATION given by, Aa= f Boti2yPolnz)dnde, Boly.2) Bry, 2)s(y,2)dyde, dam f Biluyotu2)dvde, As Ae Syea(yz)dyde, and Ry 2)s(y, r)dyde. All the Aj’s, except Ay, are positive, while Ay can be positive, negative ‘The fist order conditions are SUE = AA + 2Art — Adinhy) + (00) on PROSE — aAyighs + 4Aabt — Asn)! + (20) (08) where (s.0,) denotes smaller order terms. From (6.7) and (6,8), we can easily see that h and fy eannot have the same order. Furthermore, it can be shown that Ay must have an order smaller than that of fy. Assuming that ho = ogn-® and hy eqn, then we must have a > 8. Then, from (68) we get 8 ~ 1/5, and substituting this into (6.7) we obtain a = 2/5. Thus, optimal smoothing requires that fi ~/n-2/° and hy ~ m2. For the general ease of @ > 1, by symmetry all h,’s should have the same order, say; fhy ~ m~? for § = Ty... and fa ~ n~®. Then itis easy to show that = 1/(4 +9) and a =2/(4+q). Thus, the optimal fy M49) for 5 = 1,...,g, and ho ~ n2/(4+4), Up to now we have focused on a local constant estimator of Piy|2) We can also use local linear methods to estimate F(y|z). As we dis cussed earlier, F(y|r) = E[2(%i < y)|Xi = a] is the conditional mean function of 1(; = y) conditional on X; =<. Thus, we ean also use the local linear method to estimate this conditional mean function by re- agressing 1(¥4 < y) on (1,(X;—2)) using kernel weights. The resulting TOne can also try to we a =P, which wil lead co 8 = 1/5 from (68) and 9 = 1/A from (6.7, a contradiction, Similary, assuming that a < Balko leads to.a ‘outradition. Therefor, we must have a>62,_ ESTIMATING A CONDITIONAL CDF: SMOOTH Y 187 intercept estimator will be the local linear estimator of F(yle). Tt ean bbe shown that this estimator has a leading bias term given by (02/2) S>h2Fes(ule) (6.9) and a leading variance term of Xyj., which is the same as that of the local constant estimator given in Theorem 6.1. However, as pointed out bby Hall, Wolff and Yao (1999), two undesirable properties are associated with the local linear estimator: (i) it may not be monotone in Y’, and i) it is not be constrained to take values in (0,1) Motivated by the empirical likelihood approach, Hall et al. (1999) ‘and Cai (2002) introduce a weighted local constant estimator to aver- come problems arising from the use of a local linear approach. A ‘weighted local constant estimator is given by Dhppile) Kn XG, 2)1 < v) - Cee) Kn Xa) where pi(r), for ¢ = 1,...,n, denotes the weight functions of the data, Xiy.++)Xq and the design point 2 satisfying the properties pi(x) > 0, and Flyia) = (6.10) Des apeyKlXu2) (oan) Condition (6.11) is motivated by the local linear estimator, which en- sures thatthe estimation bis has the same order of magnitude in the Boundary and interior rogons. Tt also reduces the number of leading bias terms in the Taylor expansion, However, pi(a)s satisfying these conditions are not uniquely defined (since we have n parameters of pla), only q-+ 2 restrictions). Hall et al. and Cai suggest choosing the pi(2)’s by maximizing TT? p(c) subject to these constraints, This leads tothe following optnization problem: ‘gaan Doln p(c), subject to (6.11), pila) 2 0 and S22 pila) = a (6.12) Letting 7 be the Lagrangian multiplier associated with condition (6.11), then (see the hints given in Exercise 6.5) Ae) 7 ADR an188 6 where angmax St inl +-4(Xi = 2) KA(%2))- (61a) Equation (6.14) does not have a closed form solution. Cai (2002) recommends using the Newton-Raphson scheme to find the root (soli- tion) of 7 based on (6.14). Under regularity conditions similar to those given in Theorem 6.2, Cai establishes the asymptotic distribution of F(ylz), which is given by (nh) | Pla) — Fluke) — So A2Balys2)Fea(vle)| & N(O,Z ye), (6.18) where By (ys) = (1/2)e2F (ul) ‘Eaton (6.18) shows that F(y|x) has the same (fnst order) ssymp- totic distribution as a local linear kernel estimator of F(y|z). It also has the additional advantage of being monotone in Y and taking values only in (0,1). A disndvantage is that numerical optimization routines are required for its computation. When X has bounded support, Cai (2002) further shows that ‘F(ylz) has the same order of bias (SYt_, h2) and variance at the bound- ary of the support of X, Le, this approach is immune to boundary effects in bounded support settings. ‘We observe that the above weighted local constant estimator has properties similar (o that ofthe local linear estimator. One disadvan tage mentioned above is that computation requires nonlinear optimiza- tion procedures, while the local constant and local linear estimators possess closed form solutions and can be readily computed. The prob- Jem arising with the local liar estimator 8 that the local weight function may assume negative values, hence the resulting estimator of F(u|z) may assume values lying outside the unit interval [0,1]. Hansen (2004) proposes a modified local linear estimator of F(|2) in which he replaces the negative weight (when it occurs) by 0, This modifention is asymptotically negligible, but it constrains the resulting local linear estimator of F(y|z) to assume values in [0, 1] leading to a valid condi- tional CDF estimator. Hansen shows that the asymptotic distribution of his modified local linear estimator is rst order equivalent to Fiz) given in (6.15). ‘The nonparametric estimation of conditional CDF has widespread potential application in econotnics, For example, such estimators have63._NONPARAMETRIC QUANTILE ESTIMATION 180 ‘been used to identify and estimate nonadditive and nonseparable fue- tions; see Matzkin (2003) and. Altonji and Matzkin (2005). Up to now we have focused on nonparametzic estimation of a con- ditional CDF. When the dimension of the covariates is high, the curse of dimensionality may prevent accurate nonparametric estimation. Hall ‘and Yao (2005) propose using dimensionality reducing methods to ap- proximate the conditional PDF. Specifically they use F(Y|X"A) to ap- proximate the true conditional CDF F(Y|X), where X is of dimension @ and is a vector of qx 1 unknown parameters. Their estimation ‘method is similar in spirit to the single index model method (see Chap- ter 8). Since X'A isa scalar (ie, univariate), it involves one-dimensional nonparametric estimation only and hence is immune to the curse of di rmensionality critique. 6.3 Nonparametric Estimation of Conditional Quantile Functions Quantile regression methods are being widely used by practitioners. For a comprehensive snalysis of a rango of parametric methods we direct the reader to Koenker (2005). One reason for the popularity of these methods is that they more fully characterize the conditional CDF of a variable of interest, in contrast to regression models that provide the conditional mean only. ‘The unconditional ath quantile of a CDF F() is defined as a = inf{y : F(y) > a} = F(a), (6.16) where a € (0, 1). For example, let ¥~ 1¥(0,1) and let F(.) be the associated stan- dard normal CDF. Then 9.5 = 0 since F(0) = 0.5 and F(0~e) < 0.5 for any €> 0. To determine a quantile, say, qo, we work with the inverse CDF. That is, since q.95 = F~1(.95) and F(q95) = F(F~1(.95)) = 0.95, we obtain gos = 1.645 because P[N(0,1) < 1.645] = F(1.645) = 0.95. In general, if we let X denote » random variable with CDF F(-), then the method for finding qq involves applying F on qa = F~!(a) to et F(ga) = PIX < 4 ie, applying F on (6.16).190 6. CONDITIONAL CDF AND QUANTILE ESTIMATION Again by way of example, suppose that X represents family income. Jn a given year. Then qzs is that level of income such that 25% of all family incomes lie below i Often, it is a conditional rather than unconditional quantile that is of interest. The conditional ath quantile is defined by (a € (0,1)) da(2) = inf{y: Fyfe) > a) = FYala). (17) For example, suppose that X represents age and ¥ represents in- dividual incomes. Then 425(2 = 30) is that level of income for 90 year olds such that 25% of 30 year olds have incomes below it. In practice, we can estimate the conditional quantile function ga(2) by inverting an estimated conditional CDF function, Using the estima tore with a smooth function and an indicator function respectively for Y,, we would obtain Goa) =inf{y : Pula) 2 a} = F'(alz), (18) da(a) = inf{y: Flyjz) > a) = F(ale), (6.19) Because F(ylx) (P(y[2) lies between zero and one and is monotone iY, do(2) (Ga(2)) always exists. Therefore, once one obtains F(y\2) (Piz), it is teivial to compute da(2) (da(2)) using (6.18) (6.19). ‘We interpret (6.18) to mean that, for a given value of @ and 2, we solve for ga(x) by choosing qq to minimize the objective function Gal) = argmin|os— PCale). (6.20) ‘That is, the resulting value of q that minimizes (6.20) ts da(«) We assume that F(yjz) has a conditional PDF (viz), f(vl2) continuous ia, and f(aa(2)|2) > 0. Note that f(ylz) = Foyle) since Falls) = OF (via) /Oy. The next two theorems give the asymptotic distribution of a(x) and (2) ‘Theorem 6.3. For s=1y...54, define __Balts2) Boalt) = Fg cays" where By(y,2) is defined in Theorem 6.1. Then under conditions sim- ‘lar to those given in Theorem 6.1, we have (ra --thg)"?? |a(@) — galt) — S>HEBas(2)| 4 NCO, Val2)), =64. THE CH FUNCTION APPROACH a1 where Vaz) = a(1 — a)n*/ [f?(ga(2)|2)H(da())] ‘Theorem 6.4. Define Bao(2) = Fo(ylz)/n(2) and Ba(2) i the same 1s in Theorem 6.3 for s = 1,....g. Then under conditions similar to those given in Theorem 6.2, we have (uly hg)*? | dat) ~ gat) — > HBa(2)| 4 N(O,Vale)), 14) and Va(a) are the same as defined im Theorem. ‘The proof of Theorem 6.3 is left as an exercise and the proof of ‘Theorem 6.4 is given in Section 6.9.1. One can also estimate qa() by inverting the weighted local constant, catimator of F(y/z) given in (6.10), Le., ea a(x) = inf{y : P(y|x) > a} = F*(ale), (6.21) Cai (2002) shows that (rah... hg)"?? |e) — dale) — S7 AE Badal), 2)/ 4 Gal) |2)| 4.N(0,Vat=)), (6.22) where the B,(.,)'s are defined in (6.15) and where Va(z) isthe same 4s that defined in Theorem 6.4 6.4 The Check Function Approach ‘A popular alternative estimator for ga(2) can be derived using the s0- called check function; see Chandhuri (1991), Cheudhuri, Doksum and Samarov (1997), Jones and Hall (1990), Yu and Jones (1997, 1998), Honda (2000), Chong and Peng (2002), Whang (2006), and the refer ‘ences therein, The check function gets its name from the shape of the underlying objective function. Note that L-norm objective functions suchas those based on least squares methoxls yield U-shaped objective functions, while Lr-norm objective functions such as those based on2 6, CONDITIONAL CDF AND QUANTILE ESTIMATION least absolute deviations yield a V-shaped or “check-shaped” objective funetion. ‘The popular parametric linear quantile regression model (Koenker ‘and Bassett (1978)) is of the form Y= XiG+ uy, (623) and from (6.23) we obtain qa(Vi|%:) = X18 + ga(tu|Xi)- One eannot separately identify and gq(t|Xi, Just as one cannot separately iden- tify the intercept and E(u) in a least squares regression model (i.e. where we impose Eu) = 0). We can rewrite this model as ¥; = dal VIXi) + Ya = XYBa + Yass (6.24) where vag = th ~ da(tulXi). By definition ga(vas|Xi) = 0. Basically, ‘we impose the condition that the conditional ath quantile of the error process is equal to zero It is well known that one can estimate Ja in (6.24) by minimizing the following objective function: Ba = argmain > pal ~ XB), (6.25) here pa(2) = 1(z < 0)], the check function; see Koenker and Bassett (1978) for details. Also, see He and Zin (2008) for a lack-of fit ‘est for parameteic quantile regression models. Equation (6.25) can be shown to be equivalent to wom D M-xl+0-a) Dm -xah ; vats veo (6.26) where we make use of the fact that —2 = |z| for <0. Koenker and Bassett (1978) showed that Via = Ba) % N (0,08(6(%6X))1) (1 — a) foa(0), and fos () isthe PDF of vo ‘This check funetion approach can also be adopted for th estimation of a nonparametric quantile regression model. We use « local constant,‘method (or, alternatively, we could use a local linear method), and we choose « to minimize the following objective function: min 95 pal¥j — a) W(X), (6.27) vla.— 1(v <0). By letting goc(c) denote the value of 4 that minimizes (6.27), it can be shown that da(2) is @ consistent tstimator for qa(2) ‘The leading MSE of dao(2) can be found in Jones and Hall (1990) and Yu and Jones (1997). The leading bias and variance terms of fns(2) ate exactly the same as those for Ga(e) given in Theorem 6.3 Therefore, ga,c(7) hes the same asymptotic distribution as that of Ga (ct) presented in Theorem 63 One can also use a local linear method by replacing (6.27) by the following objective function: min pal) 0 — (Xj —2)'8)Wi(X}, =), (6.28) where the minimizer a is the local linear estimator of ga(x), and b ‘estimates the derivative of ga(). Let da.u(2) denote the resulting estimator of ga(). The leading MSE term of qai(2) can be found in Fan, Hu and Truong (1994) and ‘Yu and Jones (1097). The leading variance term i the same as that for the local constant estimator (hence identical to that given in Theorem 6.3), while the leading bias term is given by isda) = pra (6.29) m where ga,se(t) = Fda (#)/Ox%. 6.5 Conditional CDF and Quantile Estimation with Mixed Discrete and Continuous Covariates When X isa vector of mixed discrete and continous variables, we need to replace the kernel function by a generalized product kernel appropri- ate for mixed data types. We first discuss estimating a conditional CDF194 6, CONDITIONAL CDF AND QUANTILE nIMATION without smoothing the dependent variable ¥. We estimate F(yla) by . nS 1% < y)Ky(Xe,) Fs) EEE M s Die), ica where fi(z) = n7'K,(X;,22) is the kernel estimator of (2), Ky(Xi.2) = WAKE 2°) XE. 24), Wa(Xf,2") = [Ty hts = #2)/ha)y AXE et) = [Ts XE, 2 Ay), and (CE, X,) = ACCS = a) LXE A xf) Should we elect to smooth the dependent variable Y, we obtain another estimator of F(ylz) given by "1G Folz) 7 (6.31) where G() is a CDF, say, the standard normal CDF, and hg is the smoothing parameter associated with Y, Define M(y,2) = [F(u/x) — F(yla)|A(z) and also let M(y,2) = [Plul2) — Flui2) |e). The following two theorems give the asymptotic distributions of P(y|x) and F(y2), respectively. ‘Theorem 6.5. Under regularity conditions similar to those given in ‘Theorem 6.1 but with the differentiabtity condition changed with respect to x* (see Li and Racine (forthcoming) for details), we have (1) Letting ty =| (nh hg) RP then MSE[A7(y,2)] = {Seim.00 +9 >» Balle | + Volyla)(nri won +f), where Busy2) = (/2)ra12Flvledua(2) + H(2)Fe(ula)] ond Basta) = wa)? Yo ale 24) [Fle ale 24) aoe = Flule)ute)], here 1o(u4s24) is defined in (4.24).55. CONDITIONAL CDF AND QUANTILE ESTIMATION: MIXED DATA 195 (i) Letting Yj = Fy) ~ Flylel/nte), then (nbs. na | Pot = Flula) - NButie) Patio * N(0, Eis) ‘Theorem 6.6. Define |A] = Ti, As, Bro(vle) = (1/2)s2Feo(ulz), Ay, 2) = w*Folyla)/u(a), and let Syje, Brs(ysa) and Broly,22) be the same as those defined in Theorem 6.5. Then under assumptions given fn Li and Racine (fortheoming), we have MSE{XZ(y,2)] = n(2) {Sra +See} = a He)? Bye ~ hoy, ©)] c Te hg + Olt where tha = [Rl + [Al + (ry bg) “2(URIP + [AI), and F (ule) — Plule) — S943 Baaly.z) — y vant * N(0,34j0)- ‘The proofs of Theorems 6.5 and 6.6 are given in Section 6.9.2. As discussed above, to obtain a quantile estimator, invert F(.): Gale) = Fal). In practice we compute Ga(z) via (hash)! daft) = arg ja ~ F(a ‘The asymptotic distribution of ga(z) is given in the next theorem. ‘Theorem 6.7. Define By.alt) = Bn(ga(2)|2)/f(da(x) |x), where (y = Go(2)) Bayle) = [Xo heBie(ylz) + Ls AsB2a(yl2)] #9 the leading bias term of F(ylz). ‘Then under the same conditions as those used in Theorem 6.6, we have196 6. CONDITIONAL CDF AND QUANTILE ESTIMATION (@ Gal) — dal) in probability, and (3) (hy «-ha) Eda) ~ a2) ~ Bral)] —» N(O,Valxe)) in dis- tribution, where Vale) = a(l ~ ade! {f*(da(2)|x)(dae))] = V(qa(e ie) S"(dae)z) (since «= F(da(e)|e)) ‘The proof of ‘Theorem 6.7 is similar to the proof of Theorem 6.4 and is therefore omitted here. In Exercise 6.4 the reader is asked to show that the optimal smooth- ing parameters for estimating F(y|zr) should satisfy hy ~ n~¥/+9) for 8 Lyssa dy 2/60) for s = 1,.-.,7, and ho ~ 272/49). Unfor- ‘omately, to the best of our knowledige, there doos not exist an automatic data-driven method for optimally selecting bandwidths when estimat- ing a conditional CDP in the sense that a weighted integrated MSE is ‘minimized. Given the close relationship between the conditional PDF ‘and the conditional CDF, Li and Racine (forthcoming) recommend us- ing the least squares cross-validation method based on conditional PDF cstimation when selecting bandwidths and using them for estimating the conditional CDF function (y|x) and the conditional quantile func- tion Ga() Let hy and Xe denote the values of hy and As selected by mini- 1izing the PDF cross-validation function (soe Chapter 5). Recall that in Chapter 5 we showed that the optimal bandwidths are of order fig ~ n-UO649 for 5 = 0,1,-..59, and Ay ~ 2-240) for § = 1,..-57 (assuming that Y is a continuous variable). To obtain bandwidths that have the correct optimal rates for F(y|x), Li and Racine (fortheoming) nF (8 = 1...) recommend using fig = Roni #s, fy and 1, = Jona-7 (8 = 1,...,7) ‘Theorems 6.6 and 6.7 hold true with the above (stochastic) band- widths fo, hey Avo 6.6 A Small Monte Carlo Simulation Study ‘We briefly investigate the finite-sample performance of the conditional quantile estimator defined in (6.18) and (6.19), following Li and Racine (forthcoming). The DGP we consider is given by Y= By + XA + BX + Hsin(XP) +6, where Xf ~ uniform|—2r, 27}, Xf and Xf are both binomial (the total number of discrete cells is c = 4), and ¢; ~ N(0,07) with o = 0.50 and66._A SMALL MONTE CARLO SIMULATION STUDY. at ‘Table 6.1: MSE performance of the nonsmooth and smooth quantile estimators Da) hy = ho MSE 1.06¢ nH Tadieator 00 L060.n-™5 9 L060yn-/ 0.41 106en# SCV 1.06a4n-/* 0.38, LscV SCV. —LSCV. 005 100 We consider the estimator given in (6.18) that treats Y using an Indicator function, ie,, 1(y—Y; > 0), and the estimator given in (6.19) ‘that uses a smooth function, G((y—¥i)/he), where G(-) is the Gaussian CDF kernel. Regardless of the kernel function chosen for G((y—¥i)/ho), 18 hy 0, Gl(y ~ ¥;)/ho) + Aly ~ ¥; > 0), hence to generate results for My — ¥; > 0) we simply set hy = 0 (choosing a sufficiently small a). ‘We conduct 1,000 Monte Carlo replications, each time computing the MSE of the quantile estimator with a = 0.50. We report the mediaa relative MSE over the 1,000 replications, and investigate the relative performance of the nousmooth and smooth quantile approaches based on a variety of bandwidth selectors. Results are summarized in Table 6.1, where numbers less than one in smooth quantile estimator, (6.19) From Table 6.1 we observe that by smoothing Y with an ad hoe formula hy = 1.060jn""/®, MSB is reduced by 20% compared with the estimator that uses @ nonsmooth indicator Function for Y. Next, smoothing over the discrete covariates by LSCV reduces the MSE by another 20%. Finally, ehoosing fz and hy also by LSCV (rather than using the ad hoe formula) reduces the MSB by a further 25%. Using SCV reduces the MSE by half compared with ad hoc choices of fy oF indicator funetions for the dependent vasiable and the diserete covaei- ates, Note that in the above experiment, all covariates are relevant, In practice it is often the case that some covariates are in fact irrelevant ((c, are independent ofthe dependent variable), and in such cases, the ciency gains achieved by using LSCV will be even larger since the SCV approach can automatically smooth out irrelevant covariates. Nonparametric quantile regression with dependent data was con- cate superior performance of the198 6. CONDITIONAL CDP AND QUANTILE ESTIMATION sidered by Cai (2002). See also Koenker ancl Xiao (2002), who discuss inference regarding the quantile regression process. 6.7 Nonparametric Estimation of Hazard Functions By combining the results on conditional CDF estimation discussed above and the conditional PDF estimator discussed in Chapter 5, non- parametric estimators of hazard functions can easily be obtained. We first provide a formal definition of a hazard function. Definition 6.1. Let T denote a random variable that stands for the time of exit from a state (e.g., no longer unemployed). The probability that a person who, having occupied a state up to time t, leaves it in the short time interval of length dt after t is P(L t), The hazard funetion is defined as mn POST 0) A) = fm a which is the instantaneous rate of exit. Roughly speaking, the hazard function can be interpreted so that |(E)dt is the probability of exit from a state in a short time interval of length dé after t, conditional on the state still being occupied at t Hazard functions are often used to describe the probability of exit- ing from unemployed status (.e., getting a job) in an immediate future ‘time interval (say, nxt week), conditional on the individual being um- employed at present (up to this week). They are also often used to describe the probability of death of a patient in the next time period (say, next week) conditional on the patient being alive up to this week. It can be shown that the hazard function ean be written as (see the hint in Exercise 6.6) n= AO = Aint, e068) Hence, fr(t) = h(t) exp(~ fi h(v)d). 1 — Fr(t) is called the “Survival function.” This is because F(t) = PUT © #) is the probability of exit (bay, death) before t, hence 1 — Fr() = PP > t) is the probablity of survival beyond period t67. NONPARAMETRIC ESTIMATION OF H. (RD FUNCTIONS __199 One commonly used parametric PDF in hazard and duration anal- ysis is the Weibull distribution, whose PDF is given by Sr (tia,b) = alt’ exp(—at"), ¢ > 0,0,0 >0. ‘The hazard function is given by A(z) = bat Hence the hazard function is strictly increasing if b > 1 and strictly decreasing if 6 < 1. When b = 1, the Weibull distribution reduces to ‘the exponential distr ‘with constant hazard. The Weibull distri- bution is often used to model unemployment data. Often interest lies in a hazard rate conditional on the outcome of & set of variables zx. The conditional hazard function is defined as fo(tx) = Fr(tls)’ Ate) (>0. = (633) For example, T' may be the time of death of cancer patients and 2 contains treatment type and individual characteristics. We assume that x contains q continuous and r discrete variables. Nonparametric estimation of h(t/2) can be readily obtained by replacing fr(tlz) and Fo(t{2) by their nonparametric estimators, fr(tle) Alte) 1 Ft (6.34) where fr(t\2) and Pr(t}z) are defined in Chapter 5 and Section 6.5, respectively. From Fr(tlx) = Fr(tiz) = Op( 4h? + DL Ae + (nhy ...tg)"}, it is easy to see that the asymptotic distribution of ‘A(t{z) is the same as that of fr(tlz)/ay2, where az = 1 — Fr(tlz). Honce, by Theorem 5.3 we immediately know that (um, oar eC) ag [Sau 2m +O Boalt »| } an (0 “2 : (638)200 6. CONDITIONAL CDF AND QUANTILE ESTIMATION where By(t2) and Bot) ate defined in ‘Theorem 53 Th the decussion above we assume that the data is not censored In applications sucha fllon-ap surveys fr patients having recived a pasticular treatment, a patient may fal to respond to a survey before fying and the data wl thereby be censored. We will discuss how to handle censored data in nonparametric and semiparametric settings in Chapter 1 ‘Alo, when one estimates a hazard function in high dimensional settings, the curse of dimensionality may preclude accurate nonpare- mesic estination. In such ease, one may choose to insted estimate a semiparametric hazard model as proposed by Horowitz (1999), who considered semiparametric proportional hazard models, Sea also Lin- ton, Nielsen and van de Geer (2008) for kernel-based approsches to the estiznation of multiplicative and adative hazard model, 6.8 Applications 6.8.1 Boston Housing Data We consider the 1970s era Boston housing data that has been exten- sively analyzed by a number of authors. For what follows, we report on the application outlined in Li and Racine (fortheoming), This dataset contaius n = 506 observations, and the response variable ¥ isthe me- dian price of a house in a given area, Following Chaudhuri et al. (1997, p. 724), we focus on three important covariates: RM = average num- ber of rooms per house in the area (rounded to the nearest integer), LSTAT = percentage of the population having lower economic status in the area, and DIS = weighted distance to five Boston employment cen- ters. An interesting feature is thatthe data js right eensored at $50,000 (1970s housing prices), which makes this particularly well sulted to quantile methods, ‘We ist shuffle the data and create two independent samples of size ‘m1 = 400 and ng = 106, We then fit linear parametzic quantile model ‘and nonparametric quantile model using the estimation sample of size ‘mand generate the predicted median of ¥ based upon the covariates in ‘the independent hold-out data of size n. Finally, we compute the mean square prediction error defined as MSPE = ng! 3°%2(¥; — dos)? where qo(X;) is the predicted median generated from either the para- metric or nonparametric model, and Y; is the actual value of the re- sponse in the hold-out dataset. To deflect potential criticism that our5, APPLICATIONS an. 30 ] 25 20h 1s JoMsPE) T 10 05 00 04 06 08 10 12 14 16 18 Relative MSPE Figure 6.1: Density of the relative MSPE over the 100 random spits of the data. Values > 1 indicate superior out-of sample pérfdrmance ofthe kemel method fora given random spit ofthe data (lower quartile=1.03, median=1.13, upper quartle=1.20) result reflects an unrepresentative spit ofthe data, we repeat this pro- cess 100 times, each time computing the relative MSPE (ie., the para- metric MSPE divided by the nonparametric MSPE). For each split, wwe use the method of Hall et al. (2004) to compute data-dependent bandwidths. The median relative MSPE over the 100 spits ofthe data is 1.19 (lower quartile=1.03, upper quartile=1.20), indicating that the nonparametric approach is producing superior out-of-sample quantile estimates, Figure 6.1 presents a density estimate summarizing these results for the 100 random splits ofthe data. Figure 6.1 reveals that 76% of all sample splits (i.., the area to the right of the vertical bar) had a relative efficiency greater than 1, or, equivalently, in 76 out of 100 splits, the nonparametric quantile model yields better predictions of the median housing priee than the parametzic quantile model. Given the small sample size and the fact that there exist three covariates, we feel this sa telling application for the nonparametric method. Of course, wo are not suggesting that we will outperform an approximately correct parametric model. Rather, we only wish to suggest that we ean often outperform common parametsic specifications that ean be found in the literature.202 6. CONDITIONAL CDF AND QUANTILE ESTIMATION 6.8.2 Adolescent Growth Charts Growth charts are used by pediatricians and parents alike for compas ing children’s growth to a standard range. Height, weight, and body ‘mass index (BMI) measurements are used to document a child’s height and weight based on age in months. Measurements are compared to the standard or normal range for children of the same sex and age. Growth charts provide an early warning that the child has a medi cal problem. For instance, too rapid growth may indicate the presence ‘of s hydrocephalus (an accumulation of liquid within the cavity of the ‘ranium), a brain tumor or other conditions that cause macrocephaly (the condition of having an unusually Iarge head; it diflers from hydro- cephalus because there is no increase in intracranial pressure), while too slow growth may indicate malformations of the brain, early fusion of sutures or other problems. Insufficient gain in weight, height or a combination may indicate failure-to-thrive, chronic illness, neglect ot other problems. For the application that follows, data is taken from the Center for Disease Control (CDC) and Prevention's National Health and Nutri tion Examination Survey. A variety of methods have been sed for constructing growth curves, and results are typically presented as quan tiles. For instance, the U.S. Center for Disease Control and Prevention use a two stage approach. In the first stage, empizieal percentiles are smoothed by a variety of parametric and nonparametiic procedures. To ‘obtain corresponding percentiles and = scores, the smoothed percentiles are ten approximated using a modified least median of squares estima- tion procedure in the second stage. We consider the direct application of mixed data quantile methods using the estimator defined in (6.19), and by way of example construct weight-for-age quantiles. We report the 25th, 50th, and 75th quantiles and plot those for males in Figure 6.2, Bandwidth were obtained via least squaes cross-validation, with fngight = 1.22; hage = 8.11, and Suge = 0.12 Figure 6.2 readily reveals many features present in this dataset, including heteroskedastcity that increases with age and asymmetry in ‘the conditional CDF of weight. 6.8.3 Conditional Value at Risk Financial instruments are subject to many types of risk, including in- terest risk, default risk, liquidity risk, and market risk, Value at risk68_APPLICATIONS 203 90 so} ™ 60 50 40 30 20 10 ‘Weight (kz) 2 6 8 0 2 M 1 18 2 Age (Years) Figure 6.2: Weight-for-age quantiles for males. (VaR) sooks to measure the latter; itis single estimate of the amount by which one’s position in an instrument could decline due to general, ‘market movements over a specific holding period. One can think of VaR ‘as the maximal loss of a financial position during a given time period for a given probability; hence VaR is simply a measure of loss arising from an unlikely event under normal market conditions (Tsay (2002, p. 257)). For what follows, we report on the application outlined in Li and Racine (forthcoming). Letting AV(l) denote the change in value of a financial instrument from time ¢ to t+1, we define VaR over a time horizon { with probability pas p=PIAV() < VaR] = F(VaR), where Fi(VaR) is the unknown CDF of AV(1). VaR assumes negative values when pis small since the holder ofthe instrament suffers a loss.” "Therefore, the probability that the holder will suffer a loss greater than or equal to VaR is p, and the probability that the maximal loss is VaR. “For a long positon we ave concommod with the lowar tll of returns, i. a loss duo to afl alu. Fora short postion we are concornod with the upper teil, ‘lace de to ise in value. Inthe lator cas we simply model 1 ~ F(VaR) rather ‘than F(VaR). OF eaurse, in either cace wa are modeling the fait ofa distribution.208 6. CONDITIONAL CDF AND QUANTILE ESTIMATION is 1—p. The quantity ‘VaR, = inf{VaR|Fi(VaR) > p} is the pth quantile of Fi(VaR); hence VaR, is simply the pth quantile of the CDF Fi(VaR). The convention is to drop the subscript p and refer to, e.g, “the 5% VaR” (VaRoos)- Obviously the CDF Fi(VaR) is unknown in practice and must be estimated, Conditional value at risk (CVaR) secks to measure VaR. when co- variates are involved, which simply conditions one’s estimate on a veo- tor of covariates, X. Of course, the conditional CDF Fi(VaR|X) is again unknown and must be estimated. A variety of parametric methods can be found in the literature, In what follows, we compare results from five popular parametric approsches with the nonparametric quantile estimator defined in (6.19) ‘We consider data used in Tsay (2002) for IBM stock (7, daily log returns (%) from July 3, 1962, through December 31, 1998) plotted in Figure 6.3. Log returns correspond approximately to percentage ‘changes in the value of a financial position, and are used throughout. ‘The CVaR computed from the quantile of the distribution of res1 eon- 2.5%, i= 1,2,...,5):68 APPLICATIONS 205 Daily log returns ‘0 1000 2000 3000 4000 5000 6000 7000 8000 9000, Observation Figure 6.8: Time plot of daily log returns of IBM stock from July 3, 1962, through December 31, 1998. (Gv) Xe: an annual trend defined as (year of time £-1961)/38, used to detect any trend behavior of extreme returns of IBM stock. (v) Xse: a volatility series based on a Gaussian GARCH(1,1) model for the mean corrected series, equaling o¢ where @? is the condi- tional variance of the GARCH(1,1) model. ‘Table 6.2 presents results reported in Tsay (2002, pp. 282, 295) along with the estimator defined in (6.19) for a varity of existing ap- proaches to measuring CVaR for December 31, 1998. The explanatory vriales assumed values of Xionso = 1, Xzpim = 0, Xsoiv0 = 0, Xai ~ 0.9737, and Xsoiso ~ 1.7966. Values are based on an as- sumed long postion of $10 milion, hence i value at risk is 2%, then tre obtain VaR = $10,000,000 > 0.02 = 8200, 000 Observe that, depending on one's choice of parametric model, one can obtain estimates that difor by as auch 82% for 5% CVaR and 63% for 1% for this example, OF course, this variation arises due ¢o model tncertainty. One might take a Bayesian approach to deal with this tncertainty, averaging over models, or alternatively consider nonpars- mnetric methods that are robust to functional specification. The kernel nantile approach provides guidance yielding sensible estimates that thight be of value to practitioners. Another use forthe nonparametric206 6. CONDITIONAL CDF AND QUANTILE ESTIMATION ‘Table 6.2: Conditional value at risk for @ long position in IBM stock Model 5% CVaR_1% CVaR “Tnhomogencous Poisson, GARCH(I,i) $303,756 — $407,425 Conditional normal, IGARCH(1,1)/ $302,500 $426,500 AR(2)-GARCH(1,1) $287,700 $409,738 Student-ts AR(2)-GARCH(1,1) $283,520 $475,943 Extreme value $166,641 $304,969 LSCV CDF $258,727 $417,102 quantile estimator might be to evaluate the CDF nonperametrically at the values produced by the parametric estimators above to assess which quantile the parametric values in fact correspond to, presuming of course that the parametric models are misspecifie. Interestingly, Tsay (2002) finds that Xy_ and Xoq are irrelevant ex- planatory variables. The LSCV bandwidths are consistent with this. However, Xo, is also found to be irrelevant; hence only one volatility measure is relevant. 6.8.4 Real Income in Italy, 1951-1998 ‘We again consider the Italian GDP panel discussed previously in See- tion 1.13.5, a panel containing data for income in Ftaly for 21 regions {or the period 1951-1998 (millions of lire, 1990—base). We consider the estimator defined in (6.19) and model the 25th, 50th, and 75th income ‘quantiles treating time as an ordered discrete regressor. Figure 6.4 plots the resulting quantile estimates. Recall that Figure 1.9 presented in Chapter 1 revealed that the distribution of income evolved from a unimodal one in the early 1950s toa markedly bimodal one in the 1990s. This feature is clearly eaptured by the nonparametric quantile estimator, as ean be seen in Figure 6.4, 6.8.5 Multivariate ¥ Conditional CDF Example: GDP. Growth and Population Growth Conditional on OECD Status In Section 5.5.5 of Chapter § we modeled a multivariate Y conditional PDF with data-driven methods of bandwidth selection, the sample size68._ APPLICATIONS Ea Real GDP per eapita 0 11950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 ‘Year Figure 6.4: Income quantiles for Italian real GDP. ‘was n= 616, and ¥ consisted ofthe following variables: yy, the growth Tate of income per capita during each period; y», the annual rote of pepulation growth during each period 2, OECD status (0/1), We now fstimate the conditional CDF ‘F(yy.yale) and preset the resulting Conditional PDP in Figute 5.3, that i, we plot (n,spiX ~ 0) and F(yi,ya\X = 1) again using Hall et al.’s (2004) least squares cross- validation method! Figure 65 presents a lear pictare ofthis multivariate ¥ conditional CDF. A stochastic dominance relationship is evident in that OBCD counties tend to have lower population growth rates and higher GDP ‘romth rates than their non-OECD counterparts208 6. CONDITIONAL CDF AND QUANTILB ESTIMATION P(yrsyalei) 1.00 0.80 060 oo 0.20 0.00 cpp, Population Growth (92) Figure 6.5: Multivariate ¥ conditional CDF with ¥ consisting of GDP ‘and population growth rates and X being OECD status.9. PROOFS. ei 200 6.9 Proofs 6.9.1 Proofs of Theorems 6.1, 6.2, and 6.4 Proof of Theorem 6.1 (i) ELF (ulx)A(@)| = BL < v) Wal Xo) EIWa(Xi,2)E(U(% < y)/X)) = BWM, 2) F(W|X)) = inet f aw (252) Folens = fies myFtale + hey oydo = / [ue + Dalen 3 Emer] x [ro + Some, FED Foluleyhatiowve] Wo) dv + ofl) = wa) Fue) + Fea 73 [aa Fave) +2(2) Folula) + Flulz)iso(2)] + o(|W). (6.38) Shmilaely, Ela(e)] = BWa(Xe,2e)] = (ee) + (1/2)e2 D> AF teal) +0(|h|*). (6.37) By noting that E(M(y,2)] = BlF(ulz)j(@)] ~ F(yle)Bla(2)], The- orem 6.1 (i) follows from (6.36) and (6.37). o200 6, CONDITIONAL CDF AND QUANTILE ESTIMATION Proof of Theorem 6.1 (i) var( f(y, 2)) = n~varl(4(¥; < 2) ~ Pula) WA(%2)] HELLY y) — Fple))*1Wa(Xn 2)? + OUD} E(B (UY < 9) ~ Flyla))*1X]Wa(Xe2)*} + OO) HEALY) ~ 2F (ole) Ful) + Fula)? Wa(Xiv2)?) + O(n) sat J H2)LP (ul) — 2F (ula) (ule) + Peule)"|Wile,2)Pde +0(n) = (ohne)? fue — he) a — he) ~ 2Fye) Fa — ho) + F(yle)*|W (v)%dv + O(n) = n(n hg) '2)(P (ula) ~ Flvla)*| $0 ((nhy..-hg) NA +n7), where «= fu(v)%de. o Proof of Theorem 6.4 (ii). Define By(y,2) = Sty KEBa(y,2). Then (F(ul2) — Flute) — Ba(y.2)] = (Fula) ~ F(vlx) — Balan )hi(a) a) F(ul2) — Fluke) ~ Ba(u2)]A(2)/ulz) + o9(2) (v,2)/n(x) + op(1), where A(y,2) = (F(ylz) — F(yle) - Bn(y,2)]A(z). By Theorem 6.1 () and (i) we know that E[A(y,2)) = O(ji/) and that var{A(y,2)] = (Oh fg) Hz) *var(y 2) + OC) (hn = (ra oof) “M). By virtue of a CLT argument one ean show that (rhy --tg)!?A(y, 2) + NV (0,p(2)?82) in distribution. Hence, (0 --)"[Pluls) ~ P(ule) ~ Ba(ara)] = (bey)? Aly. 2) f(a) + op(1) 4 playtn (0, u(z)Zy2) = NO, Eye). a9. PROOFS. au Proof of Theorem 6.2 (i). Using Lemma 6.1 we have woo (2 "ne J ~2{ni6( ue) ix Wav) =e ae +228 ila] Walxe)} +08 = [te [ey + 98 = [ules my Leie-+ hw) Wales )de + off) +H ralate +h] Woodo + 088) = le) Flue) + 2h (2) oll) +e Sinton (ule) + "EPC ) Sls) +o) (6.38) ‘Theorem 6:2 (i follows fom (6.37) and (6.38), a Proof of Theorem 6.2 (ii). Lot Hy = hy ...hg. By using Lemma 6.1 we212 6. CONDITIONAL CDF AND QUANTILE ESTIMATION have var(Mf(y,2) ~ Flul2)Alz)] = rhar (6 (4%) —F - ofe [e (4) ] —2F(ule)E le ¢ & *) ix] + Fajerwacsia)} +001) = BAIF(UIX:) — hoCeFatyl%) — 2FUIA)F(V) + Fula? + O((APYIWR(X2)} + O(n) = fHle)F C2) ~ hocaotuls) + 2F(0l) F(a) + Pluie) Wa (=, 2)d2 + (6.0.) = (ott) f+ he), + ho) — hoCa Ful) + 2F (yee + hv) F(ykz) + Flylx)?]W(v)de + O07) = (nff.)*w(a)eF(y\2) ~ Fv) ~ PoCan(a) Fatal) FOC RP (nH) +n) 7 & (ng) *u(2)?Byje — hoy, 2)] + O(|h2(nHy)* +0). ) axe] a Pf of Tiare 62 (8). Past (i) follows fom) and (1) and the Liapunov CLT. Lemma 6.1. (0) 8 [6 (taht) |e] = PULX:) + (1/2)wahB Foal uLXD) + 0(08)- (8) [? (2) [aa] = FV: —hoCeFo(ulX:) +olho), where Ch = 2 G(v)w(v)udo, Proof. This is lft as an exercise. o Lemma 6.2. fu + ale) — Fula) = Flvle)en + Op(h8) + open + (ah...69._PROOFS 213 Proof. Let An(én) = (F(y + énl) ~ Plule)|ACe)/u(e). Then Py + nl) — F(yl2) = An(eq){1 + op(1)]- By Lemma 6.1 we have ws -*% KYAXa 2) vraear=e{efo( 5%) -6 (4) x] A?) = BLP Wala) ~ Fvia) + O08) (Xz) B{[s(ulXiden + OG) + O(NG)My(Xe-2)} (ulXiden/wln) + O16, +18). Similars, var[An(6n)] € mney ref [e(te)-« (GP AK, »} = Olen(nhy fy). Therefore, Fy + eal) — Flu|) = Aalen)[ + op(1)] = Flule)en + ORB) + open + (maa feg)™™) a Proof of Theorem 6.4. We know that F(ylz) —> P(ulx) in probability by Theorem 6.2 1 follows by ‘Theorem 1 of Tucker (1967) that sup P(e) — F(a] ~ 0 im probity (639) because F(yl2) is a conditional CDF. Since qa(2) is unique, this implies that 6 = 6(€) = minfa~ F(aa() — ex), F(aa(z) + el) ~ a} > 0. Tt is easy to see that P\da(2) — ga(x)| > el 4}
a] ~P [Plaa(2)l2) > ~S(aal)kz)en +o] (6.1) by Lemma 6.2 and the assumption that fg = of (ny. Fy)"/3). There fore, Qn(o) ~ Pllinhs «.-hg)*0*(aal2)|2) {Plao(2)|2) ~~ Bu(aa(e)l2)} > -v] ~ 0), where ®(-) is the standard normal distribution, a as Qalv) = 6.9.2 Proofs of Theorems 6.5 and 6.6 (Mixed Covariates Case) ‘Theorem 6.5 is proved in Lemmas 6.8 and 6.4 below, while Theorem 6.6 is proved in Lemmas 6.5 and 6.6, Lemma 6.3. () ELM, 2)] = w(x) Bialyt) + mlz) Dear AsBas(vz)+ o(|A| + IAP). (i) var(M(y,2)) = xtu(x)F(yla)[1 — F(ylz)], where Brs(y,2) and Bag(y, x) are defined in Theorem 6.5. Proof. See Li and Racine (forthcoming). o emma 6.4 (hy ha)¥ [POyla) ~ Fale) — So Bia(os2) — Yo r-Bosty.2) > N,V), ee aH where V = et F(et — Pla/a)6.10. EXERCISES 215 Proof. By noting that F(y\z)—F(ylz) = M(y,2)/A(2) = M(y,2)/u(2) “op(1), Lerama 6.4 follows from Lemma 6.3 and the Liapunov CLT. ‘Lemma 6.5. 4 ) ELA (y,2)) = we) Lh NEBr) + wl) Dn AvBaslys2) + olla +14?) (i) vac, 2)) = n(n) (2) x[F(uie) ~ Fivla)® ~ hoCeFo(v2))- Proof. This is left as an exercise. a Lemma 6.6. (nha. fg) 7LF (ula) — Flue) ~ SiR 2) — SY AvBalar z)] 4xon, fe ‘where V = x81 (ulx) ~ Fiula)?V/n(2). Proof. By noting that Fluke) ~ Plvia) = M(y.2)/ A(z) = My, 2)/u(2) + op(1), Lemma 6.6 follows from Lemma 6.5 and the Liapunov CLT. a 6.10 Exercises Exercise 6.1, Prove Lemma 6.1. Hint: Use the change of variable and integration by parts arguments Section 1.4. Also note that 2 f G(w)w(x)dv = fdG%(x) = Bxereise 6.2. Prove Lemma 6.5 Exercise 6.8. Prove Theorem 6.3 ‘Hint: Mimic the proof of Theorem 6.4 but without introdueing Ao216 6. CONDITIONAL CDF AND QUANTILE ESTIMATION Exercise 6.4, Show that for the general g and r case the optimal smoothing parameters that minimize IMSE[F(y|2) require that hy ~ Te) for 8 Ayseaydy Ay we NAMED for 8 = Lyeeeyty andy ~ wo2keta Hint; Use arguments similar to those used in the two paragraphs immediately following (6.6) and (6.7). Exereise 6.5. Derive (6.13). Hint: Write the Lagrange function as £=Yimpile)—n [= a) ~ | =D — DKK). ‘The first order conditions are FE -Sple)-1%0, (6.2) FE = Lomleyxi KA (Ne2) So, (643) of -—t (Xi—2)KulXu2) 20, (64) Spa) ~ me) Equation (6.43) leads to 1 = np,(x)[n-+7(Xi-2)Ko(Xiya))- Summing this over i and using (6.42) and (6.43) gives na nS open +a Sola) Xe — a) KA(Xez) = Substituting 1 = nto (644), we get (2 =9) (ey ee Pi) = Sa — Ee This proves (6.5). Hence, one can choose 7 Wy umsimize the log-likelihood function given by £09) = Solase) =" nt 440 — 2) KA)6.40, BXERCISES 2 Exercise 6.6. ( Derive (6.32), Hint: fl 1 w= gm, (Se
) = sn(Q/PT 29 = fot — Fr) (i) Show that fix(x) = A(a)exo(— [i Moldy) int: fro) bp T= Fy” = Inf — Fr(o)lho=5, ==Inlt = Fried, ” n(uydo = ‘ich implies that 1= F(t) = BHO, Substituting this into (i), we obtain (i).Part II Semiparametric MethodsChapter 7 Semiparametric Partially Linear Models In this chapter we discuss a relatively simple and popular semipara- ‘metric model, the semiparametric partially linear regression model; see Hidle, Liang and Gao (2000) for a thorough treatment of partially lin- ear models. Roughly speaking, a semiparametric model is one for which some components are parametric, while the remaining components are Teft unspecified. Those models therefore have a finite dimensional pa- rameter of interest, yet also contain some unknown functions which can be viewed as functions characterized by infinite dimensional parame- ters. ‘The partially linear model is one of the simplest semiparametric ‘models used in practice. We shall use this model to introduce semi- parametric models because its estimation is straightforward, involv- ‘ng nothing more than basic kernel estimation of regression functions along with least squares regression, This model also serves to illustrate ‘a number of somewhat subtle issues which arise in the estimation of semiparametric models in general. For example, the finite dimensional parameter (the parametric part of the model) can usually be estimated with a parametric y/-rate, though we usually need stronger regularity conditions along with stricter conditions on the smoothing parameters im order to achieve the y‘e-rate for the parametric part of the modelmm 7. SEMIPARAMETRIC PARTIALLY LINEAR MODELS 7.1 Partially Linear Models [A semiparameteic partially linear model is given by Y= X19 +9(Zi) ty Fs Booms (ra) where X; is a.p x 1 vector, 6 is a p 1 voctor of unknown parameters, hod Z; € RE. The functional fori of g(?) is not specified. The finite dimensional parameter # constitutes the parametric part ofthe model ‘and the unknown function g() the nonparametric part. The data is ‘sumed to be iid. with Elui.Xz, Zi) = 0, and we allow for a condi- tionally heteroskedastic errr process E(u, 2) = 0%(r, 2) of unknown form. Wo focus our discussion on how to obtain a y/n-consistent - ‘timator of 8, as once this is done, an estimator of g(-) can be easily obtained. TAA Identification of 3 Some identification conditions will be required in order to identify the parameter vector f. Observe that X cannot contain a constant {icy 8 cannot contain an intercept) because, if there were an in- tercept, say a, it could not be identified separately from the wn- mown function 9(-). That is, for any constant e % 0, observe that a+ g(2) = la+e)+ lolz) 4 = Gove + dnew(2) thus the sum of the new intercept and the new g(-) function are observationally equivalent to the sum of the old ones in (7.1). Since the functional form of g(:) is, not specified, this immediately tells us that an intercept term cannot be identified in a partially linear model. After we derive the asymp- totie distribution of our semiparametric estimator of 1 in Section 7.2, it will be apparent that the identification condition for § becomes the requirement that @ % B([X — B(X|Z)IX — E(x]Z)}) be a postive definite matrix, implying that X cannot contain a constant and that none of the components of X can be a deterministic function of Z for, otherwise, X-E(X|Z) = 0 for that component and & will be singular 7.2 Robinson’s Estimator ‘We first present an infeasible estimator of (7.1) to illustrate the me- chanics involved in the estimation of @. Taking the expectation of (7.1) conditional on Zi, we get BY) = BOG2)'6 + 91%). (7.2)1.2. ROBINSON'S ESTIMATOR 223 Subtracting (7.2) from (7.1) yields Yi ~ B(Yi|Z:) = (Xe — E(%G|Zi))" 8 + (7.3) Defining the shorthand notation ¥; = ¥j - B(¥|2,) and Xj = X;— E(X|Z;), and applying the least squares method to (7.3), we obtain an estimator of # given by Bar = [Sx] ban (ra) By the Lindeberg-Levy CLT we immediately have (see Exercise 7.1) Vaile — 8) % N(O,2WH™), (75) provided that @ is positive definite, where ¥ = Blo%(X,, Z))XiX), © = [BUX ‘The basic idea underlying this procedure is to frst eliminate the unknown fanetion g(-) by subtracting (7.2) from (7.1). Although the ‘unknown function g(-) washes out in (7.3), two new unknown fanctions: are introduced, namely E(¥|Z:) and E(X;|Z)). Therefore, the above es- timator fir isnot feasible because F(%|Z) and E(X,|Z,) are unknown. However, we kuow that these conditional expectations can be consis- tently estimated using kernel methods, so we can replace the unknown conditional expectations that appear in Buy with their kernel estima- tors, thereby obteining a feasible estimator of 8. That is, we replace Yi = ¥i— BI) and X, = X,— (WZ) by Yi — ¥; and Xi ~ Xs, respectively, where aR) Le SY KAZI, — (8) X=bZ) LO x Kale 2)/HEade (7.7) and = FZ) =" YO Ka(Zi 25), (78) where Kx(Zi,Z)) = [Ty aste( ltt)at 11 SEMIPARAMETRIC PARTIALLY LINEAR MODELS ‘The presence of the random denominator f(Z;) can cause some technical dilficultios when deriving the asymptotic distribution of the feasible estimator of 9. Wo will consider two different approaches to handling the presence of the random denominator, one that uses a function that “trims out” observations for which the denominator is small, and another that uses a density-weighted method for getting rid of the random denominator altogether. We begin with a discussion of the trimming method. Define a foasible estimator of # given by fem where 1; = 1(/(Z:) > 6), which equals 1 if f(Z:) > b, 0 otherwise, ‘and where the trimming parameter b= By > 0 and satisfies by +0 a5 ‘To derive the asymptotic distribution of 8, we first provide a defi- nition and state some assumptions. We shall use GZ, where a > 0 and 1 > 2 is an integer, to denote the class of smooth functions such that if'g € Gg, thon g is v times differentiable; g and its partial derivative functions (up to order v) all satisfy some Lipschitz-type conditions such as |9(2) ~ 9(2)| S Ha(2)|2/~2]], where (2) is @ continuous function having finite ath moment, and where | || denotes Euclidean norm, ie. tll = ~OIa x —a'} Dw-kyH- Kn, (79) (i) (Xi), i = 1)2,.-1m are tid. observations, Z; admits a PDP f G2 (Le, fis bounded), g(:) € Gf, and B(X|z) € GS, ‘where v > 2 is an integer. (ii) B(ulX, Z) = 0, B(u*|x, 2) = o7(x,z) is continuous in 2, both X and u have finite fourth moments. (G83) (0) ia a product korncl, the univariate kernel k(.) is @ bounded th order kernel, and k(v) = O (1/(0 + lvll"*"). (i) As n+ 00, (hi .-hg)?B + 00, mb4 Dy AY 0. Condition 7.1 (i) states a sot of smoothness and moment conditions, ‘The unknown functions g(z) and B(X|z) are assumed to be vth order7.2. ROBINSON'S ESTIMATOR 225 differentiable. These, together with a th order kernel of 7.1 (i), ensure ‘hat the bias of the kernel estimator is of order O(S-t_, hi). Condition 7.1 (iv) is used in Robinson (1988). One can ignore the trimming parameter b in empirical applications as one can let b—+ 0 at ‘an extremely slow rate. Thus, Condition 7.1 (iv) is basically equivalent to Vm | STAR + (ahr ig)?| + 0.05 2-00. This condition is casy to understand; O (S72. A? + (nhs ..-fg)~*) is the order of the nonparametric MSE. The difference between the feasible estimator A and the infeasible estimator Agr is proportional to ‘an average of squared nonparametric estimation errors. Therefore, for to bea yii-consistent estimator of 8, one needs the squared estimation error terms to be smaller than n-!¥2, ie, for Shy AB + (mht...) to be of order o(n-¥/), resulting in Condition 71 (iv). - ‘Theorem 7.1. Under Condition 7.1, we have vita 8) 4. 0,0-wa"), (710) provided that ® is postive definite, where © = BUTE, w= Blo%(Xi, ZR!) and X= Xi — E(Xi|Z. ‘The proof of Theorem 7.1 can be found in Robinson (1988). A ‘consistent estimator of the asymptotic variance of fis given in Exercise 1. Comparing Theorem 7.1 with the distributional result given in (75), we see thatthe feasible etimator J has exactly the same asymap- totie distribution as that of the infeasible estimator Age. The intuition underlying this result is quite simple. If we ignore the trimming pa- rameter b, then in essence J — fine = Op (Say 2” + (nh ...hg)~"), the MSE rate of the nonparametric estimator, which is op(nt-¥/2) by Condition 7.1 (iv). Hence, /7i(3— At) — op(1), which implice that the two estimators have the same asymptotic distribution. Suppose one uses a second order kernel, ie., assume v = 2. With hg ~ h,! Condition 7.1 (iv) becomes n¥? [>t A + (nha ..-Ag)!] ~ nif2 [hl + (nh8)"}] = o(1), which requires that q <4 (or that ¢ <3 Ths ~ h means that (hy — Bh = oft) or, equivalently, hy = h-+-ofh)226 11, SEMIPARAMETRIC PARTIALLY LINEAR MODELS since ¢ is a positive integer, where q is the dimension of Z). This is the condition invoked by Robinson (1988). ‘Thus, Condition 7.1 (iv) requires one to use a higher order kernel if q > 4. However, Li (1996) showed that Condition 7.1 (iv) can be replaced by a weaker condition given below. ny) Condition 7.2. As n — 00, awe hae nB~*(y..bg)?/( 20, b(t oI) +00, and nb SE hi — 0 Li (1996) shows that A Ane =O, (Se Taha) Gah, no") ; If this term is of order op(n-2/2), we obtain Condition 7.2. A detailed txplantion as to my be entiation ero ba this do ater than the more familiar order Op (So, M2” + (mh... .4g)~2) can be found in 11006) (ee alo Exeter further explain). Considering the ear for whieh al he have fhe same onder (he ~ A for alls), thea Condition 7.2 bocomes max {h2"~4,h#} —+ a0 and nh‘ —» 0. If one uses a second order kernel (i 2), this condition requires that 4v = § > max(2q — 4,q), leading to the requirement Ting 6 or Unt q 2 5 since g a's postive integer, Toefl, nomzagtive soond tds Kernel enn led to y/conttntetination of @ as long as q <5. “This lowing corllary s proved in i (2886). Corollary 7.1. Under the same conditions given in Theorem 7.1, ex- cept that Condition 7.1 (iv) is replaced by Condition 7.2, then Theorem 71 still holds true. One undesirable feature of the semiparametric estimator described above is the use of a trimming function which requires the researcher to choose a nuisance parameter, the trimming parameter b. However, ‘one can instead use the density-weighted approach to avoid @ random ‘denominator in the kernel estimator as follows. ‘Multiplying (7.3) by fi = (Zi), we get (% ~ BOIZ))Fi = (Xi - (GIZA + hi au) One can estimate using the least squares method by restessing (%/-E(¥IZ,)) fron (X,-B(%i12,) fi Letting Bar denote the resulting,7.2._ROBINSON ESTIMATOR {infeasible estimator of 8, then by the Lindeberg-Levy CLT (see Exercise 7.1) we know that Bury is Viv-consistent and asymptotically normal, VF (Bay 8) 49 (0,850, 05! where &) = E[X 142] and Vy = Blo%(X,, 2) ~E(X|Z)). ‘A fenibleeatiator offs obtained by repa and (X; — E(X,|Z)).fi with (¥; — Yi) fi and (Xi (Hh -EMIZ)) fF. Xi), where Yi, X and f; are the kernel estimators of B(¥i|Z:), B(%|Z)) and (Xi) de- fined in (7.6) to (7.8). Because Vifi =n! Dy ¥iKy(ZiyZ;) (and Xf.) does not have a random denominator (f.), after removing the tximming parameter b, Condition 7.1 (iv) ean be replaced by the following: Condition 7.3. Asn 00, n(hs---hg)* + 00, and n 2, hie — 0, while Condition 7.2 can be replaced by Condition 7.4. As n> co, ini {em rad S HD 1} 00, and nhs hi 0. ‘The next theorem gives the asymptotic distribution of the feasible density-weightod estimator, ‘Theorem 7.2. Lelting Ay denote the feasible density weighted estima- tor of, then, under the sume conditions as in Theorem 7, except that Condition 7.1 (0) i replaced by either Condition 7.9 or Condition 7-4, wwe have Vals — 8) 4 NO, 879/85"), there ®y ond Wy are defined in (7.12). ‘Theorem 7.2, under the weaker Condition 7.4, is proved in Li (1996). In Section 7.5 we provide a proof using the stronger Condition 7.3. As wil be sen, the stronger Condition 7.3 yields a relatively simple proof ‘Theorem 7.2 states that the feasible estimator Ay has the same symptotic distribution as the infeasible estimator Ae.2 7. SEMIPARAME! (IC PARTIALLY LINEAR MODELS Note that the use of the specific weight function (Zi) for avoiding the random denominator when estimating (is not based on auy eff cioncy arguments. In fact, when the error is conditionally homoskedas- tic, the unweighted estimator 3 can be shown to be semiparametrically offcient. ‘When the error is conditionally heteroskedastic, ie., F(u?1Xi, 2) = oAX;,Z), one might be Tured by analogy into thinking that an cffcient estimator of 8 eould be obtained by choosing w, = L/o(XisZ:) as a weight function. In general, however, this intuition is incorrect. ‘This approach will not lead to efficient estimation of except ia the special case for which the conditional variance is only a function of 2%, Thats, only when B(a#|X;, Z) = 02(Za) will the choice of weight 1/o(Z) lead to efficient estimation of #. Efficient estimation of 3 in the gencral case is more complex and will be discussed in Section 74. 7.2.1 Estimation of the Nonparametric Component From (7.2) we know that g(Z,) = E(¥i ~ X{A1%). Therefore, ater obtaining a y/-consistent estimator of 8 (Le, 8), a consistent estimator of g(z) is given by Jua¥s — XY) ol=23) Smee) fed a2) = where J can be replaced by 8. We know that the nonparametric kernel ‘estimator has a convergence rte tha is slover than the parametric y/n- rate, Therefore, i is easy to se that, asymptotically, (2) is equivalent to the following infeasible estimator that; makes use of the true value of 8: Than = XY KvleZ) Dp Kn) ‘The convergence rate and the asymptotic distribution of (2), from ‘which one can immediately obtain the asymptotic distribution of (=) (sco Exercise 7.2), is discussed in Chapter 2 Note that the choice of fh,'s for estimating g(=) can be quite dif- ferent from those for estimating 9. In order to obtain a y-consistent estimator of 8, a higher order kernel is needed if q > 6. However, when estimating 9(2), there is no need to use a higher order keruel regardless of the value of g. Therefore, one can always use a nonnegative second a2) (714)1.2. ROBINSON'S ESTIMATOR 29 order kernel to estimate 9(2), and one could choose the smoothing pa rameters by least squares cross-validation (in estimating g(z)); i.e., one cean always choose hy... /tg to minimize 5%)» (7a) where §-«(Zivh) = 9-4(Z)) as defined in (7.13) with 2 replaced by Zi, and 3 replaced by D9 jy Note that in (7.15) the dependent variable is ¥i~X{9 rather than ¥. Because f'— 8 = Op (n-¥/2) has a faster rate of convergence than any nonparametzic estimator, one can replace @ in (7.15) with f to study the asymptotic behavior of the eros-validation selected fis. The rate of convergenes ofthe hys isthe same as discussed in Chapter 2 One can also select and h simultaneously by mi Yk ~ x19 - 44, my- Under general conditions, including the use of a second order kernel, the cross-validation. choico of h will be of order Op (n-¥/"¥)), This order satisfies Conditions 7.1 (iv), 7.2, 7.8 and 7.4 if q < 3, There: fore, when q <3, ove can choose the fy’s and 3 simultaneously by minimizing 7, Yi —X/8 — 9(Zi,h)]?. The resulting hy's will be of order Op (n 7614), and the resulting A will be /Fe-consistent having asymptotic variance given in Theorem 7.1. One can also use second or- der expansion results (in the h’s) in a partially linear model to select the smoothing parameters so a8 to minimize the estimation MSE up to second order; se Linton (1095). ‘The partially linear estimator has been applied in @ range of set- tings. For example, Anglin and Geneay (1996) used it for the semipara- retrie modeling of hedonic price factions; Blundell, Duncan and Pen dalsur (1998) considered partially linear models for the semiparametric estimation of Engel eurves; Bngle, Granger, Rice and Weiss (1986) used 4 partially linear specification for estimating the relationship between ‘weather and electricity sales; Stock (1989) considered the problem of predicting the mean elect of a change in the distribution of certain policy-related variables on a dependent variable in « partially linear framework; Adams, Berger and Sickles (1999) apply a partially linear20 7, SEMIPARAMETRIC PARTIALLY LINRAR MODELS specification to study the production frontier of the U.S. banking indus- try; and Yatchew and No (2001) estimated price and income elasticities allowing for endogeneity of prices in a partially linear model with price ‘and age entering nonparametricaly. 7.3. Andrews’s MINPIN Method ‘Andrews (1904) provides a general framework for proving the i- consistency and asymptotic normality of a wide class of semiparametric estimators. Andrews names the estimators MINPIN because they are estimators that MINimize a criterion function that may depend on Preliminary Infinite dimensional Nuisance parameter estimators. An- drews's method can be used to derive the asymptotic distribution of ‘many semiparametric estimators, including an estimator of in a par- tially linear model. We briefly discuss this method below. Tet 0 € © C RP denote a finite dimensional parameter, and let ‘r denote some infinite dimensional function, Further, let 7 be some preliminary nonparametric estimator of r © 71, where 7 is a class cof smooth functions the specifics of which depend on the particular semiparametric model being considered. We use @ and mp to denote the true parameter and the true unknown function. Let @ denote an festimator of @ that is a solution to a minimization problem where the objective finetion depends on @ and #, with being a preliminary nonparametric estimator of 7y. Suppose that d is a consistent estimator ‘of dp that solves a minimization problem resulting in the following first ‘order condition: viii (8,7) where sig (0, 2) = 2? Sym 0, 4) ‘We consider the ease where mn(W;, 0,7) is differentiabe with respect. to 0. It were finite dimensional, one could establish the asymptotic normality of 6 by expanding y/rifin(6,*) about (0,70) using element by element mean value expansions. Since ris infinite dimensional, how- ver, @ mean value expansion in (0,7) is unavailable. Andrews (1994) Suggests expanding yi, #) about t only and using the concept of stochastic equicontinuity (see Appendix A) to handle +. A mean value ‘expansion in 6 about @y yields (7.16) (1) = Vit (5,7) = Vn (ay #) + etn (8,4) vez (9~ 00)» an)1.3. ANDREWS'S MINPIN METHOD 2a ‘where @ lies on a line sogment between 8 and 8p. Under some regularity assumption, one can show that Hn8.8) =D Am wd.) Be [rege 4673] [Ferro] =M. ‘Thus, provided that Mf is nonsingular, from (7-17) and (7.18) we get Vii (0) = (M~* + 0p(2)) 7 > em(Wfo,F). (7.18) Therefore, if nV mn(Wis do, #) — nV? > ra(Wi, 80,70) = op(1), (7.19) then we wil have Yri(d 85) = Mn? > m(W,00, 70) +09(1) NCO, MSI (7.20) by the Lindeberg-Levy CLT, where $ = var(m(W89,70)) In practice, (7.19) can be difficult to verify for general semipara- tetric models that are more complicated than the simple partially linear model. Androws (1994) suggests using the concept of “stochas- tic equicontinuity” to establish (7.19). Stochastic equicontinuty can be used to establish (7.19) because if (7,70) — 0 (p(-,") i a psoudometc; see Definition A.82 in the appendix) as n+ 90 and vq(7) is stochastic equicontinuous for r € A, where A isa bounded set that contains 7) as fn interior point, then Weal) — (ra)| 5 0, (7-21) See Androws (1004, eq. (210)) for a proof of (7:21). Recall that u(r) © yritin(Oo,7). Thus, (7.19) holds if va(-) is stochastically equicontinuous.200 17, SENIPARAMETRIC PARTIALLY LINEAR MODELS ‘The following assumptions can be used to establish the fr normality result for 6 Suppose that the MINPIN estimator solves the minimization problem of 9 = infgce d(@, 7) for some objective function d(@,#), and as ‘a result the first order condition is may(8,4) = n-4 S>sm(Wi, 8,4) = 0. ‘The population moment condition is Ehm(Wi,89,70)] = 0. Defining iplt) = 1 ig(p,7), we make the following assumption Assumption 7.1. (Normality) Assume that (i) 6% 0 €@, © is a compact subset of BR”. (i) #3 meg. (iti) Jriva(70) + N(0,S). (iv) (un J} a stochastically equicontinuous af ro (v) (0,7) is twice continuously differentiable in 8 € ©, while 2D (W267) > Elm 0,7, nD (0/00)n( 18.0.2) 2 Blm(WB, ry and m2 (0/40) 0,7) 2 E(0/08)m% 0,7 uniformly over @ xG (ice., uniform weak law of larye numbers) ‘Theorem 7.8. Under Assumption 7.1, vii (@-) 4 0,ar-ts Proof. The above theorem is a special case of Theorem 1 of Andrews (1994) and is therefore omitted. In fact, Andrews does not assume i.d, data; rather, he allows for wealdy dependent time series data. and allows for nouidentically distributed data. We discuss weakly dependent data in Chapter 18, a In practice, the stochastic equicontinuity condition (Assumption 7.1 (iv)) is the most diffieult part to verify, especially for highly nonlinear semiparametric models. Andrews (1994) uses Theorem 7.3 to establish7.4. SEMIPARAMETRIC EFFICIENCY BOUNDS ‘the yi-normality result for a partially specified nonlinear model (see also Ai and McFadden (1997). We discuss verification of Assumption 7.1 (iv) for partially linear models in Section 75. One maintained assumption above is that the objective fanetion is smooth in its arguments; see Chen, Linton and van Keilegom (2003) ‘on estimation of semiparametric models when the criterion function is, not smooth, 7.4 Semiparametric Efficiency Bounds 7.4.1 The Conditionally Homoskedastic Error Case In this section we discuss the (local) semiparametric efficiency bound for (7.1) and consider twa approaches. We begin by deriving the (local) semiparametric lower bound under the assumption that the errors are conditionally homoskedastic, ie., that E(u|X;, Z;) = E(u), where uj is normally distributed. This approach uses parametric maximum likeli- hood estimation and is easy to understand, and we follow the arguments found in Newey (19906) and Rilstone (1993). The discussion here will be informal, while e rigorous approach can be found in Newey; also see Begun, Hall, Huang and Wellner (1983) and ‘Tripathi and Severina (2001) for a general treatment of efficiency bounds of semiparametric ‘models, and the related work of Pakes and Oley (1095). Let m(z, 8) be a parametric submodel such that m(z, 9) = 9(2) for some fy, where 5 is a kx 1 vector of nuisance parameters (k > 4). ‘The parameters of this parametric submodel are y = (1,6), and ‘the parameter vector of interest is y. For each parametric submodel there is a vector-valued score function fy, which one can partition with respect to the parameter of interest ) and the nuisance param ter 6 80 that , = (Uyyl5). The semiparametric bound can be inter- preted as the supremum of the covariance matrices over the para- metric submodels. Define the tangent set J for the nuisance func- tion as the mean square closure of all £ x1 linear combinations of 1s, Let Plla|) denote the projection of fg onto the space J and define = Pllo|7}. Then the semiparametric lower bound for the model is given by Yo = {UGH} ‘The log-likelihood function corresponding to the parametric sub- Ga2st 11. SEMIPARAMETRIC PARITALLY LINBAR MODELS model is wv £(8,8) = Constant — ‘The score function when evaluated at the true model is| (8) a () in (eseatiay) ; where we used g(z, 49)/Az = g!")(z). Since g(7) is not specified, 92) cannot be identified either. We obtain the tangent set A(-), the mean square closure of all k x 1 linear combinations of ls, given by A= (ule): with B{(AZ)P} < eo}. ‘The projection ofp on AC) i (1/a)uE(X|2), therefore, we obtain the efficient score as 5 8 ty —E(pIA()) = (1/o)u(X — B(X|2))- ‘Therefore, the semiparametric eficiency bound for (7-1) is Vp = (BUBIB II! = o* (EI — OX - I, (722) E(X|Z). “Thus, 6 defined in (7.9) is semiparametricaly efficent in the sense ‘that the asymptotic covariance matrix © equals to the semiparametric lower bound Vj for the partially linear model ‘We can also compare Vg with the asymptotic variance of the least squares estimator of fy when (7-1) is a linear regression model, i.e., g(2) = a+ 27, ais a scalar and 7 € RE, ue = XI tat Zi hus. (7.23) Letting Asx denote the ordinary least squares estimator of 3p based on (7.23), then it is easy to show that Val Bats ~ Bo) N(0, Vass (72a) where Vous = 0 E{[X; — a — bz%I[Xs ~ a - BZ} * (see Exercise 7.6), where a and b are constant matrices of dimensions p x 1 and p x 4, - spectively, and a-+b2; is the best linear projection of X; on aspace that is linear in Zi. Now compare Va with Vay. We see that if B(X|Z) =74. SEMIPARAMETRIC EFFICIENCY BOUND: 236 +2; isa linear function in Z;, then Vie = Ve. However, if (Xi Zi) is not a linear function in Z), then EB {[X; — B(X,|Z)][X¢— B(%|Za)]"} — E ([Xi— 9(%)][Xi — 9(Z)]"} is negative semidenite for any function (Za) (by Theorem A.3).,Thus, Vi ~ Vote i8 positive definite when E(%z) is not linear in z. In this latter ease, the semiparametrically ef ficient estimator @p is asymptotically less efficient than the parametric estimator pis, which uses the extra information that g(z) =a + 2'7. 74.2. The Conditionally Heteroskedastic Error Case ‘The detivation of the semiparametric efficiency bound for a partially linear model can be found in Chamberlain (1992) under the general eon- itionally heteroskedastic error ease, while Ai and Chen (2008) consider efficient estimation for general semiparametric models which includes ‘the partially linear model as a special ease. The discussion below is based on Ai and Chen. The results presented in this section include a conditionally homoskedastic error model as a special case, Consider the partially linear model = Xft (Btu i where E(u?| Xi, Z)) = 0°(Xs, Z) is of unknown form, In order to derive the semiparametric efficiency bound of al- lowing for conditionally heteroskedastic errors, we first assume that 0°(X;, Z) is known, and then proceed to discuss how to obtain a feasi- ble semiparamettically efficient estimator when 42(2,2) is of unknown form. Ai and Chen (2003) show that Jy can be efficiently estimated bby minimizing the following objective function simultaneously (jointly) ‘over fp and the unknown funetion g: aciag PAM — Xiu + o(Z)/07(, 2}, (7.26) yee (7.25) where B is a compact subset in RY and G isa class of smooth functions In practice, one works with the sample mean rather than with the population mean (7.26), thereby minimizing the following: sciheg 22 — Mla + a1 2i)/0°(%s 20 (an ‘The minimization ean be done by first concentrating out the un- mown function f(-). That is, we fist treat yas a fixed constant, and26 1. SEMIPARAMETRIC PARTIALLY LINEAR MODELS by application of calculus of variations to (7.26) over g:), we obtain 2B {B [(¥i — X{ + (Z:))/o*( Xs, 4)1 Zi] (Zi)} where 5(Z) is a arbitrary function of Z. Solving for g(Z) in (7.28) Teads to v0 iaylee)-*Ge)e om where 0? = 0%(X:, %)- Substituting (7.29) into (7.27), we get y y EG) «PG ah jo2. (230) 7 E (414) (14) By wing the shorthand notation = ¥— E(/of)/B(0/of\2) an A = Xi ~ BUX/o7)/E(A/o? 2), (7.30) bacon DL Po)? /03- ‘Minimizing this objective function over fy gives an roi) sen estimator of fy: (728) = DS xatll] Trove? =6+ fe ae] Yixia/o?. [A standard law of lange numbers slong with the Lindeberg-Levy CUP Teas to Vii (Bar — 0) +N (0,¥e) im distribution, (7.1) Ge am where = [XAi/o2) “aff114. SEMIPARAMETRIC EFFICIENCY BOUNDS Ed Equation (7.82) is the semiparametric efficiency bound deri Chamberlain (1992, p. 569, ea. 1.9) When 0°(X;,Z)) = 0°(Z;), (7.82) cam be simplified to Vo = B IX: — BOM ZA)] IM — BOG) /o(Z)} (7.33) When o? = o?, constant, (7:32) further reduces tothe asymptotic variance of y(3~ fo) covered by Theorem 71 ‘The above estimator Byq is not feasible. A feasible efficient es- timator of J can be obtained from 3g by replacing the unknown conditional expectations with their respective kernel estimators, eg. by replacing B[¥/o?|2i] with Eli /o#|2i] = Dy¥5/9} Keyl Dy Kus where Ki; = K((Z — Z;)/h), and by further replacing o? with 6? Lj HFRnag/ Dj Kass where Knyy = K(X — X5)/he)K (Ze ~ 25)/he) with ii; = B(Y/|Z,) — B(%|Z,)'f being a consistent estimator of uw de- fined in Exercise 7.1. Some additional zegularity conditions, such as the donsity f(v, 2) being bounded away from zero in its support or, alterna- tively the use of a trimming funetion, are needed to establish that the feasible estimator indeed has the same asymptotic distribution as the ‘efficient infeasible estimator Jer. We point out that an efficient semi- parametric estimator of fy is quite complex. It requires estimation of a nonparametric model with dimension p-+q (the dimension of (Xi, Zi) while the estimation of 6 and g() involves only nonparametric esti tation with dimension q, Therefore, the “curse of dimensionality” may prevent researchers from applying efcient estimation procedures to a partially inear model when the error is conditionally heteroskedastic. In this chapter we only cover the case in which (¥,X,2) are all observed without error. For estimation of a semiparametric partially linear model with errors-in-variables, see Liang, Hiirdle and Carroll (1999) and Liang and Wang, (2004)238 17, SEMIPARAMETRIC PARTIALLY LINEAR MODELS 7.5 Proofs 7.5.1 Proof of Theorem 7.2 Proof. We first provide a consistent estimator of the asymptotic vari- ance of fy. It ean be shown that ae DOG - XK ~ Ky? ana #25 |e (7.34) (Xe — R(X — Ky FF] fare consistent estimators of @ and W respectively, where i i) — (Xi — X)'By is a consistent estimator of uy. Throughout, we use F, to denote DE, and Syyy; to denote ii _, From ¥i = Xioo + 91 + ui, we get Yj = XiBp + gy + a, where As = Dip Akash (A = Y, X, gy w). Define Suge = WTS AABifis and Sapa = Sap Using % — Yi = (Xi — XB + gi ~ Geb us ~ te Wo Bet (ui - Bp So ayy e-syhcr—1nf = PO+ Se 4 See-ayftgraee-ayp> (735) Hence, we have VilEr- B= SE yy VRS ayterreny (790) In Propositions 7.1 to 7.4 below, we show that © VRS fief = 0s (8) Syp 2%, ) YRS aja = Olt), and (8) VRSie_aypug = VAS opt + Op(1) * NCO,1S. PROOFS. 200 Using ()-(), from (7.6) we obtain Vay ~ 8) = Sea e-nig jeu-a)h cau Se oP Senne Ft Se-avfah Seaport} = [y+ op) Hoplt) +[vRSprup + op(0] + 06(0)} 407N0,9) 73) ‘This completes the proof of Theorem 7.2. a Tn the lemmas presented below, we assume that the support of (X%;, 2) isa bounded sotto simplify the proof. All results for gs = g(Zi) also hold true for & = £(Z:) = E(X,|Z). Letting ¢ = 14 oF tw, then (2%), (2.2) and 02(X,, Z) = Elef|X Z) ae all bounded functions ‘with bounded derivatives up to order v, though the support of us need not be bounded. ‘We will use the shorthand notation E,(A) = B(A|Xy, Z,) and Kiy = K((Z— Z)/h) below. ‘Lemmas 7.1 to 7.5 below hold true if one replaces g with €. The only difference between g and ¢ is that g isa sealar function, while € is of dimension r x 1. The proofs below hold true for each component of € and, therefore, hold true for the vector function € since 7 is finite. ‘Also, we assume that fy ===: = hy = A for notational simplicity Alternatively, we can interpret the results of O(h#) as O (S74, hd) and O(h#) as O{h ..-h) to obtain the results that correspond to the case where we do not imnpose the condition that all ofthe fs are equal ‘Also, in order to simplify the proof, we use the leavo-one-out estima- tors for X; and Yi in the proof below, Note that Theorem 7.2 remains valid without using leave-one-out kernel estimators. Lemma 7-1. Let my = 9(Z) or ms = €(Z). Then Ex[(ms — m1) Kaa] = O(h"). Proof. Note that 9(2) has bounded derivatives, It follows by using the ‘Taylor expansion and a change of variables argument. o Lemma 7.2. (9 Seong = Op (HY + HP(nh®)) (rm; = 95 o mi = Ee20 1 SEMIPARAMETRIC PARTIALLY LINEAR MODELS (8) Let eg = wy o7 145 then 8, = Op ((nh®)~)) Proof of (i). (we ignore the difference between (n— 1)~! and n=!) E [[Sin-ml] = re [lois — my? #3] =E [i —m)* = wee (ons = mm) Knaus ~ m1) Ki] ® ¥ [Plem-miPktal a + YX BEE llm—m) Kral shies x Ey [Oj — mm) Kings] =n {0 (h2h-*) 4-20 (h*)} = 0 ((nne-2y* +2) by Lemma 7.1 7 Proof of (ii). Eluj-md -E [ef] = LY Klas Krak] an =n? DE [ekza a =n IB [01(Xi, ZK] = Cn "B (Kia) = 0 (nh?) Lemma 7.8. Spamyjci = % (2) (me = 9 oF ms = 6)zsproors 2a. mm faet + Seix-mypelF-D = Seraomfes + (60) DB [lin mr Rese] = ne [mn {[umna’} = my)? fPo2(%, 20) f8] << Cn'8 [lin ~ m)*f3] =n = nol) =oln-) by Lemma 7.2. 7 Hence, Sion—myfi anjap + (60.) =o (n-¥?) o Lemma 7-4. Sia-myfap = Oo (inhi)! + WY(ak®) 2) = op (n-¥) Proof. By the Cauchy inequality, [Sinner] = {[Soe-ml Sil} {0p (h2 (nk?) -* + A) O, ((nb8)2)}7 > (mene) none) o-). s,| n Lemma 7.5. 1 Spag= Op (rte), (8) Syi.ag = On ((nbw?y*),(8) 8,307 = Onl), (0) Sapg = Op (m0), (0) Sup = op). ‘The proof of (i) and (it) are the same as () Proof of () Supor = Susag* Sup = Sypaz + (0-09) 2{ [Sail = -oufoon aaron Bae NOV) by Theorem 7.1, §(z) is defined in (7.14). From 8 ~ fi Op(n-¥/9), one can show that (2) — (2) = Op(n-¥) Op ( (tha ig)? + Sh hE)16, EXERCISES 2 Exercise 7.5. (i) Show that the result, of Lemma 7.4 can be sharpened to Socavteh = On (Mnb#!2)“1 nV PH") = op (1), (i) Show that the result of Lemma 7.5 (i Spey = On ON?) can be sharpened to Hint: Do not use Cauchy's inequality, rather, show that els, fap hag) = OCC tH) tH) for (i), and E [52, 4] = O(n-*h-*) for (i) Note that with the above sharpened rate, Condition 7.8 can be replaced by the weaker Condition 7.4 Exercise 7.6. Prove (7.21). Hint: Use’ the Frisch-Wangh-Lovell Theorem (soe Davidson and MacKinnon (1993, pp. 19-24)). Define Mz = In—P.(PLP,)~"Pt, where Pz is an m x (q+ 1) matrix with ith row given by (1, 2). Project the matrix form of (7.24) on Mg to wipe ont a + Z(7, then use a standard law of large numbers and CLT arguments as we did in the proof of Exercise 7.1. Exercise 7. (@) Assume that B(u2)X, Z2) = 0°(Z4). Using arguments similar to those we used when deriving (7.38), show thatthe semiparametric effcieney bound aff is given by (7.88). (ii) When E(u?|Xi, 2)) = 07(X:2:), is Vo — Vo,re positive definite? Hint: No calculation is needed for answering (ii). A simple logical argument is sufficient.Chapter 8 Semiparametric Single Index Models In this section we consider another popular semiparametric model, the so-called semiparametric single index model. This model has been widely used by applied econometricians and has been deployed in a variety of settings. 'A semiparametzic single index model is of the form ¥ =9(X'fo) +n, (s1) where Y is the dependent variable, X ¢ IR is the vector of explanatory variables, 6 is the q x 1 veetor of unknown parameters, and u is the error satisfying E(u|X) = 0. The term 2 is called a “single index” because it is a scalar (a single index) even though z is a vector. The functional form of g(-) is unknown to the researcher. This model is ‘semiparametric in nature since the functional form of the linear index is specified, while g(.) is left unspecified. ‘Semiparametric single index models arise naturally in binary choice settings, and so for illustrative purposes we first discuss this popular ex- ample. In a binary choice model, if one is willing to accept the paramet- ric linear index governing choices but unwilling to specify the unknown distribution of the error term, then one arrives at a semiparametric, single index model. Specifically, when considering the relationship be- tween a binary dependent variable (Y) and some other covariates (X), ‘one might model this relationship as follows (8.2)250 8. SEMIPARAMETRI (DEX MODELS ‘where Y"* isa latent variable. Note that here ¢= ¥*—B(Y*|X), which differs from w= ¥ ~ E(Y'|X) defined in (8.1) bocanse ¥ # ¥* For example, ¥ could representa labor force participation decision where if ¥ equals 1, an individual participates in the labor market, while f ¥ equals 0, the individual does not. ‘The explanatory variables X contain a set of economic factors that could influence the partic pation decison, such as age, marital status, education, work history, and nuzaber of children. Model (8.2) assumes a linear parametric link function between the decision regarding whether to participate in the labor markot (¥) and the explanatory variables X. We are mainly in- terested in estimating 8, which relleets the impact of changes in X on the probability of participating in the labor market. Parametric methods for estimating ff require one to specify the (unknown) distribution of the error term «. A popular assumption is, that chas a normal distribution, ie., e~ N(0,¢2). It ean be shown that and o? cannot be jointly identified without addtional identification conditions (see Maddala (1986) for farther details). For instance, if we assume that o = 1, then 3 is identified, and we can use maximum likelihood estimation to estimate 2. However, if the error term does not possess a normal distribution, then the parametric approach will in general produce inconsistent estimates, ic., of PY = Iz) = B(Y 2); seo Exercise 8.1. To soe this, let F,() denote the true CDF of «, From (62) we have E(Y|2) = DO vPtvle) x P(Y =1l2) 40x P(Y = 02) (Y= 12) (a+a's+e>0) le > “(a +2'9)) =Ple<-(a+2/8)) ~F(-(a+2'A)) (a+2/8), where F(.) is the CDF of ¢. Note that if ¢ has a symmotrie distribution, then F(a 2!) = 1~ F(~(a+2/8)), and we have m(-) = F() in this cease, For example, if €~ N(0,1) (o = 1), then (8.2) becomes a Probit model: E(Y|z) = PCY = 1/2) Ho +2'8), (83)81. IDENTIFICATION CONDITIONS 251 where ®(.) is the CDF of a standard normal variable. If, on the other hhand, ¢ has a symmetric logistic distribution, then (8.8) results in a logistic model of the form Ee) (84) From (8.3) and (8.4), wo sce that different distributional assump- tions for ¢ lead to different functional forms for the conditional prob- ability of ¥ = 1, Hence, consistent parametric estimation of P(Y = |x) = E(Y|2) requires the correct distributional specification of «. A semiparametric single index model therefore avoids the problem of er- ror distribution misspecification. Moreover, the semiparametric single index model (8.1) is more general than the binary choice model since wwe do not require that the dependent variable necessarily be binary in nature. As we will see in Section 8.1, the location parameter a: can- not be identified when the functional form of g() is'usiknown, which is why we write our semiparametric model (8.1) only as a function of X{. We discuss this and other identification conditions in Section 8. ‘Note that model (8.1) implies that E(Y 2) = g(a!6), hence y depen ‘on 2 only through the linear combination 2p, and the relationship is characterized by the link function 9(). We would like to emphasize here that, like the partially linear ‘model, the semiparametric single index model is an alternative ap- proach designed to mitigate effects arising from the curse of dimen- sionality. Also, we emphasize that Y can be continuous or discret, ie. ‘there is no no reason to restrict Y to be a binary variable, 8.1 Identification Conditions For the semiparametse single index model, we have E(Y lz) = oa!) Ichimura (1993), Manski (1988) and Horowitz (1998, pp. 14-20), provide excellent intuitive explanations of the identifiability conditions ‘underlying semiparametric single index models (i.e., the set of condi- tions under which the unknown parameter vector 6 and the unknown function g(-) can be sensibly estimated). We briefly discuss these con- ditions and then summarize them in a proposition,252 8 First, g(:) cannot be constant function, for otherwise it is obvi cous that 4 is not identified. Next, as isthe case for linear regression, the different components of # cannot have a perfect linear relationship (perfect multicollineatity), Another resteition i that x contain atleast ‘one continuous random variable. Intuitively this ean be understood via the following ine of reasoning. If contains only discrete variables, soy some 0-1 dummy variables, then the support of zis finite, and soi the support of the scalar variable v = 2/8 for any vector of f. Obviously then there will exst an infinite number of fanctions g(:) which differ in terms of their 9 vectors such that (2/8) = B(V2). This isso simply becanse E(Y'|2) = g(2’8) imposes only a finite number of restrictions on the unknovn function g(-), therefore, there exist an infinite umber of different choices of g() and 8 that satisfy the finite set of restrictions imposed by B(¥|z) = 9(2’8). See Horowitz (1998) for a specific exam ple and illustration of this point. Also, 2 cannot contain a constant. ‘That is, cannot contain & location parameter, and iy is identifiable only up to @ scale. This follows because, for any nonzero constants a1 and ag and for any g(-) funetion and fxed vector 3, we can always find ‘another function, say’ g2(), defined by go(ai +ax2"8) = 9(x’A). There- fore, 6 is not identified without some location and seale restrictions (normalizations). One popular normalization is that + does not contain ‘a constant, the so-called location normalization. For the so-called scale normalization, one approach is to assume that the vector 8 has unit length, Le, Il] = 1, where [|] = (Sf. 67)" isthe Euclidean norm (length) of 8. Another popular scaie normalization is to assume that the first component of x has a unit cosflcient, and this fist component is assumed to be a continuous variable. ‘We suinmarize the above conditions inthe following proposition. Proposition 8.1. (Identification of a Single Index Model). For the semiparametric single index model (8.1), identification of Bp and g(:) requires that (@ & should not contain a constant (intercept), and x must contain at least one continuous variable. Moreover, |\6oll = 1. (48) g() is differentiable and is not a constant function on the support of 2! (iii) For the discrete components of x, varying the values of the dis- crete variables will not divide the support of 2/6 into disjoint subsetssTIMATION 258 We have already discussed how, when 2 contains only discrete vati- ables, ¥p and g(-) are not identified. However, when 9( to be an increasing function, then one can obtain identified bounds on ‘the components of 8. For a detailed discussion of how to characterize the bounds when all components of = are discrete, see Horowitz (1998, pp. 17-20). 8.2 Estimation 8.2.1 Ichimura's Method In this section we review the semiparametric estimation method pro- posed by Ichimura (1993) If the fmctional form of g(:) were known, (8.1) would become a standard nonlinear regression model, and we could use the nonlinear Teast squares method to estimate jp by minimizing. DM - Xa? (8.5) a with respect to 8. Tn the case of an unknown function 9(-), we first need to estimate 4). However, the kernel method does not estimate 9(X/6) directly because not only is 4(:) unknown, but so too is dy. Nevertheless, for a given value of we can estimate G(X1B) F BOHIXIB) = BlolX1) X15] 66) by the kernel method, where the last equality follows from the fat that B(u|X{9) = 0 for all 3 since Eu Note that when 8 = fp, G(X/) = 9(X/%), while in general, G(X18) # g(X1%0) if 8 fo. A leave-one-out nonparametric kernel estimator of G(X{B) is given by (ay Sega Fe (A) GA(Xi8) = BAKIXIB) = + 87) where p-(X18) = (MAY Thay K (( Tehimura (1099) suggests estimating o(X fb) in (8.5) by @-«(X1B) of (8.7) and choosing by (semiparametric) nonlinear Teast squares.254 6. SEMIPARAMETRIC SINGLE INDEX MODELS ‘There is one technical problem though, being that (8.7) has a ran ‘dom denominator, Le., A(X!8) = (nh)? Daa sau K (X58 ~ X{8)/ Tehirnara uses a trimming fanction to trim out small values of P(X{8). Let p(a') denote the PDF of X73. Define As and Ay to be the following sets: As = (a: pla’) 2 6, for all BB} where 5 > 0 is a constant, B is a compact subset in RY, and Aq = (2: |e ~2"\| £2 for some 2° € As } ‘Phe set As ensures that the denominator in (8.7) does not get too lose to zero for 2 € As. The Ay set is slightly larger than As, but as nn 00, h—+0 and Ay shrinks to Ag. Ichimura (1993) siggests choosing @ by minimizing the following objective function: si) => (XA wiXICG EAD), 8) where G(X) is defined in (87) tl Xi nonnegative weg ane ‘tion, and 1(-) is the usual indicator function. That is, 1(X; € Ap) is @ trimming function which equals one if X; € Ap, ero otherwise “The trimming fanetion ensures that te random denominator in the ern eainator i postive with high probability so as to simplify the asymptotic analysis Tat 3 denote the semiparametric estimator ofp obtained from min- mining (88). To derive the asymptotic distribution of 8, the following conditions axe needed: ‘Assumption 8.1. The set As is compact, and the weight function w(:) is bounded and positive on As. Define the set Dz = (2: 2~= 23,9 € Byz € As}. Letting p(-) denote the PDP of 2 € De, p(-) és bounded below by a positive constant for all = € De, “Assumption 8.2. 9(°) and pl) are three times differentiable with re ‘pect to z = 2/8, The third derivatives are Lipschitz continuous wni- Jformly over B for all z € Ds. ‘Assumption 8.3. The kernel function is a bounded second order emel having bounded support, is twice differentiable, and its second derivative is Lipschit: continuous.52 ESTIMATION 285 Assumption 8.4. E)Y"| < oo for some m > 3. cov(¥ jz) is bounded ‘and bounded away from zero for all x € As.q In(h)/[nb3*/™—3)] — g and nh® +0 as n— co, Tehimura (1998) has pfoven the following result: ‘Theorem 8.1. Under Assumptions 8.1 through 8.4, Vin(B - 8) + N(0,01) in distribution, where Q7 = V—1EV—1, and where 2 = B{w(Xi)o°(x) (f)” (Xi BalKlXIH) x (Xi E(%iLX460))'}, with gf? = [Og(v)/2v\lyaxys, BalXilv) = E(Xe6o= v) with xa having the distribution of Xi conditional on Xj € As, and v= [wexo (a?) Batctatan) (Bacay ‘A consistent estimator of Q is given by o . nore 7 =n Soo X,)G (X18) Xi BAX) X — ERX. SK} BGA) (NBN XB) (Xe B(NGLNGB)) = Yi ~ G(X{A), (XB) = [00-«(X40)/0B)|pjp 9-s(2{) 1s defined in (8.1), BX XYBY = Dy Xs (Ki ~ XA) Dy K(X ~ HB). ‘The proof of Theorem 8.1 is quite technical, aud we refer readers to Ichimura (1993). Horowitz (1998) provides an excellent heuristic outline for proving Theorem 8.1, using only the familiar Taylor series methods, a standard law of large numbers, and the Lindoberg-Levy CLT arguments. ‘When B(uf]X;) = o? is a constant (ie., when w= ¥—9(X/4) has conditionally homoskedastie errors), it can be shown that the optimal choice of w(X;) is w(X,) = 1, and in this case is semiparemerically efficient in the sense that is the semiparametric variance lower bound (conditional on X € As) In general, however, B(u?|X;) = 03(X;), and the semiparametri- cally efficient estimator off has a complicated structure. Ione assumes256 ‘8. SEMIPARAMETRIC SINGLE INDEX MODELS ‘that B(u2|X;) = o%(X%6o), that is, that the conditional variance de- pends only on the single index, then the choice of w(X;) = 1/0(X{#) fan lead to a semiparametrically efficient estimator for Jp. However, this choice is infeasible since, in practice, 0%(X{5) is unknown, One could therefore adopt a two-step procedure as follows. Suppose that ‘the conditional variance isa fanetion of Xp only. In this ease, for the first step, use u(Xi) = 1 to obtain a yi-consistent estimator of fo, say fi, then using iy = ¥i = 9(Xffa) one ean obtain a consistent non- parametric estimator of o%(X/6y), say 6%(X{0) = var(tis|Xio). In the ‘second step choose w(X,) = 1/6%(X{Ho) to estimates again. Provided ‘that 62(v) — 02(v) converges to zero at a particular rate uniformly over 1) € Dy (Deis the support of X/6o), the reslting second step estimator ‘i will be semiparametsically efficient. For what follows below, we ignore the trimming set Ag, the weight fonetion w(), and also assume that the minimization over 8 is, in fact, done over a shrinking st B, = (8: ||9 ~ fol| 0. This type of arsument is used by Hide, Hall and Ichimura (1983). ‘The requirement that f Tie in a set with BoB = 0 (n~/2) may appear to be an overly strong assumptions how- ever, given Ichimmara’s (1993) result, we know that 3 is a yfi-consistent estimator of 8, and we can look at a minimum of $(A) for 8 with © (n-¥/) distance from f- With this assumption, together with the assumption that A € Hn = {hi Cin-¥S < h< Con} for some (Cy > C1 > 0, then an alternative proof can establish the result of The- forem 8.1 as follows. First, it can be shown that the nonparametric sum fof squared residuals can be written as (we omit w; = I and omit the trimming indicator function for notational simplicity) 3,09) = 2.0K - xia)’ =S(—)+T+or(0, 69) where $(8) = nD 4 ~ G(X{B)), where To = D> [G-s6{60) — 9 X400)] is a term that is independent of @, and where op(1) denotes terms of ‘order o,(1) uniformly in 8 © By. The proof of (8.9) can be found in Hirdle et al. (1993), who in fact consider a more general setting where they choose and h simultaneously by cross-validation method.52._ ESTIMATION 2st In Section 8.12 we show that (3) = O(1). Therefore, minimizing (8.9) with respect to (is asymptotically equivalent to minimizing $({). Letting 8 denote the value of # that minimizes $(9) then, using Taylor expansion arguments, we,can easily show that (sec Section 8.12) 3 satisfies the following first order condition: Wo (3 — fy) = Vo + (s.0.), (8.10) where Wo = Som [/extan)]” Es — Beat] [A BORIS] and Vo = You f(X4) [Xe — EOGIX{Bo)) ‘Hence, by a standard law of large numbers and CLT argument, we have YAR ~ By) = (Wo/ny-*n-¥/7V% + 0f(1) 4 (0,%), (8-11) where Mis the same as 2y except that w(X;) is replaced by 1 ‘Up to now we have restricted ourselves to the normalization rule, I[b|| = 1. We could also use a different normalization rule if desired. Instead of assuming that has unit length, we can assume that the first component of 7 is one. Assume that the first component of 2 is a continuous vatiable, and write X, = (Xi.X)', where X; is X, '1 9/2. Therefore, a higher order kernel (v > 2) is needed since q > 2. In the case of a single index model we need q > 2 since, otherwise, 3 is not identified. Powell et al. (1989) prove the following result vad — 4) + N(0,Prws), (8.18) where Qpxs = AB|o%(X) F(X) f (XY -+Avar( F(X)gP(X)). A nor malized vector can be obtained via 4/6). ‘The proof of (8.18) requires the use of U-statistic decompositions for variable kernel functions (some of these are collected in Appendix A). Im Chapter 2.we discussed using local polynomial methods to esti- imate an unknown conditional mean function and its derivatives. Sine B given in (8.15) is based on the derivative of a local constant. Kernel tstimator of E(¥|2), we could also use the local polynomial method to estimate JE(Y |2)/Ox, This approach is considered by 13, Lu and Ullah (2008). Letting @)(Xi) denote the kernel estimator of g(!)(Xi) obtained from an mth order local polynomial regression, Li, Lat and lish suggest using = AH) (819) to estimate = Ela(X)). Aw estimates 9 directly without assuming that y(2) = 0 at the boundary of tho support of In this sense, i is sinlar to B defined in (8.19) However, the price of not imposing the condition (2) = 0 at the boundary ofits support is that, an Li, [ey and Ullah, ove usually bas to assume that tho support of X isa compact set and that the density f(x) is bounded below by a positive tonstant atthe support of X, ie, thelr conditions ule out the ease of84, DIRECT SEMIPARAMETRIC ESTIMATORS FOR 20 ‘unbounded support. In the case of unbounded support, one needs to introduce a trimming function to trim out small values ofthe random denominator. With the assumptions of bounded support and the den sity being bounded away from zero in its support, one can avoid the use of a trimming funetion. Li, Lu and Ua use the uniform conver- gence rate result of Masry (1996a) to handle the random denominator. Under some smoothness and moment conditions similar to those used in Powell et al (1989), but with their boundary assumption replaced by compact support and the density being bounded avvay from zero in its support, the use ofa second order kernel, and with n 7%. 2" —»0 and nay. 04) 5-2 €2/ n(n) —» 00 as nt -» 00, where mis the order of polynomial inthe local polynomial estimation and i the dimension of z, Li, Lu and Ullah establish the following result: Vii(Bee = 8) +N (0,8-+ var (o(X))), (8.20) where = Efo*(X) F(X) F(X) /f2(X)] and 8 = Elg"(X)] ‘The reason why 4 and few differ in their variances is because estimates 5 = E[/(X)gO(X)], while Ayre estimates 9 = F [g!(X)] While neither § nor Bye given above is normalized to have unit length, should normalization be applied, the variances of the normalized vec: tors obtained from 3 and Ayye would be identical. This is what one should expect given the result of Newey (19948), who shows that the asymptotic variance of a /7-consistent estimator in a semijaramet- rie model should not depend on the specific nonparametric estimation method used. Indeed, Newey also shows that the asymptotic variance of an average derivative estimator remains the same should one use nonparametric sores methods rather than kernel methods. We discuss series methods in Chapter 15 ‘Tho proof of (6.20) 8 similar to the proof of (8.18) in that one uses the U-statistie decomposition with variable keraels developed in Powell et al. (1989) and makes extensive use of results contained in Masry (1996a). Rather than repeating detailed proofs contained in Li, Lu and Ullah (2003) here, we provide a brief outline of the proof of (8.20) intended to provide readers with an intuitive understanding Defining 8 =n“! SO" gl!)(X)), we write vin (8.~8) = vi (8A) + vn (8-9) Note that vit (3—3) + W (0, var (9((X))) by the Lindeberg-Levy252 8, SEMIPARAMETRIC SINGLE INDEX MODELS CUD. It ean be shown that vii (8.—3) 4 1 (0,2) in distribution. Finally the two terms are asymptotically independent. Hence, Fi (i-8) = (40 (9"C0)) Hiristache, Juditsky and Spokoiny (2001) propose an iterative proce- dure that improves upon the original (noniterative) average derivative estimator. Their idea is to use prior information about the vector @ for improving the quality of the gradient estimate by extending a weight- ing kernel in the direction of small directional derivatives, and they demonstrate that the whole procedure requires at’ most 2log(n) iter- ations. The resulting estimator is /n-consistent under relatively mild assumptions. 8.3.2 Estimation of g(’) “We use f, to denote a generic y/i-consistent estimator of f or 5 (Le., it could be 8, 8, Bay Ba, oF 5 defined earlier). With Py, we can estimate E(Y |e) = g(afo) by sa oan DK (& ay Be) Since A.—fo = Op(n¥/2), which converges to zero faster than stan dard nonparametric estimators, the asymptotic distribution of 9(2/) is the same as the case with fn being replaced by J. Therefore, itis covered by Theorem 2.2 of Chapter 2 with q = 1 (since v = 2/fy is a scalar). Thus, we have G(2'Bn) Corollary 8.1. Assume that Ba — = Op(n-!/?) and under conditions similar to those given in Theorem 2.2, we have Ya (522) ~ aa!) — B(x!) £1V (0, 0(a)/ Fle e)) (622) ‘where B(x!) is defined immediately following (2.8). ‘The direct average derivative estimation method discussed above is applicable only when 2 is a q-vector of continuous variables since the derivative with respect to discrote variables is not defined. Horowitz and84, BANDWIDTH SELECTION 263 Hirde (1996) discuss how direct (noniterative) estimation can be gen- eralized to cases for which some components of x are discrete. Horowita (1998) provides an excellent overview ofthis method, and we refer read- ers to the detailed discussions contained in Horowite and Hardle (1996) ‘andl Horowitz (1998, pp. 41-48). ‘The advantage of the direct average derivative estimation method is that one can estimate fy and g(x’6y) dizectly without using nonlin- car iteration procedures. This computational simplicity is particularly attractive in large-sample settings; however, it gives rise to a potential finite-sample problem. The direct estimators all involve multidimen- sional nonparametric estimation in the inital estimation stage, and we know that nonparametric estimators suffer from the curse of dimen- sonality. Inthe second stage, the multidimensional nonparametric es- timator is averaged overall sample points, resulting in a consistent estimator of i. Therefore, asymptotically, the eurse of dimensionality problem disappears because the second stage estimate ins a paramet- erate of convergence and the dimension of z does uot affect the rate of convergonce ofthe average derivative estimator obtained at the second stage. However, in finite-sample applications, inaccurate esti- ‘mation in the first stage may affoct the accuracy of the second stage cstimator unless the sample size is sufficiently large. Therefore, with relatively large sample sizes, the direct estimator is more attractive due to its computational ease. However, in small-sammple settings, the iterative method of Ichimura (1998) may be more appealing as it avoids having to conduct high-dimensional nonparametric estimation Carrol, Fan, Gijbels and Wand (1987) proposed a method related both to the material in Chapter 7 (ie. partially inear models) and in this chapter (ie. single index models). In particular, Carroll etal considered the problem of estimating a general partially Iincar single index model which contains both a partially linear model and a single Index model as special cases. 8.4 Bandwidth Selection 8.4.1 Bandwidth Selection for Ichimura’s Method Ichimura’s (1993) approach estimates by the (iterative) nonlinear Teast squares method, where the unknown function 9(X14) is zeplaced by the nonparametric kemel estimator §(X/) defined in (8.7). The smoothing parameter is chosen to satisfy the conditions nh® — 0, and264 5. SEMIPARAMETRIC SINGLE INDEX MODELS In(h)/ [nh™#/-D) 0 as nm + co, whore » > 3 is a positive in- teger whose specific value depends on the existence of a number of finite moments of Y along with smoothness of the unknown function {o(). The range of permissible smoothing parameters allows for optimal Smoothing, ie, h= O (n-¥!). Therefore, Hardle etal. (1993) suggest Selecting h and f simultaneously by the nonlinear least squares cross Yalidation method of Ichimura. Specifically, they suggest choosing, 9 ‘and h simultaneously by minimizing (9,8) = [Wi ~ G-349,0)] UKE As), 23) where Gi(X!8,h) = @-4(X{9) is defined in (8.7), and Ag is the trim- ming set introduced enti. Under some regularity conditions including ¥ having finite mo- iments of any order, the use of @ second order kernel function, the ‘unknown functions g(2’8) and p(2’8) being twice differentiable, f() ‘being bounded away from zero on As, and the assumption that 6 By = (8: [8 — Po\ Con ¥?}, where fr € Hy = [Cr¥!®,Can-¥), and where Co, Cz > C) ate three positive constants, then Hirdle et a (1993) show that. (8, h) has the following decomposition: M(9,h) = M(B) +T(0)+ {terms of smaller order than ‘T(h) and not dependent on 3} + {terms of smaller order than either (8) or T(H) ), (824) where M(8) = So [%i— 9X1)? 10% € As), T(h) = [Gaxthe) ~ a°XLA)] and where @s(X/i) is defined in (8.7) with 8 being replaced by fo “Therefore, minimizing M(f,) simultaneously over both (3,h) € BH, is equivalent to first minimiring M(3) over @ € By and then rninimizing (A) over b € Hy Tet (3, h) denote the estimators that minimize (8.23). The asymp- tot distribution of 8 is given by Theorem 8.1. For h, with the se of ‘second order kernel faction, itis easy to show thet the leading tern of the nonstochastie objective function E[Z()] equals Aih*+ Aa(nh)-* by84, BANDWIDTH SELECTION the standard MSE ealeulation of a nonparametric estimator discussed in Chapter 2, where A and Ap are two positive constants, Hence, hy = [Ay/(442)}¥ nM = O(n-Y) minimizes Arh + Aninh), Hiidle etal (1993) show that f/f —» 1 in probability We now briefly compare the regulasity conditions used in Ichimura (1903) with those in Hardle tal. (1993). The regularity conditions used to prove Theorem 8.1 and to establish (8.24) are somewhat different. For example, the conditions used in Theorem 8.1 equite a higher order kernel, while for deriving (811), Hale etal use» second order kerne} with an optimal smoothing parameter hk = O (n-¥/8), In (8.24), the minimization is done within a restricted shrinking set (3,1) € By x Hn s0 that 8 ~ 6 = O (n-¥?). This condition makes the kernel estima. tion bias smaller than it would be under the conditions of Theorem 8.1, thus one can establish (8.11) without resorting to higher order kernels, Another difference in regularity conditions is hat a stronger moment condition, namely, Y having moments of any order, fs used to obtain (8.11). This is because Hirde et al. nod to establish that the small order terms in (8.11) hold uniformly in By x Hy. Demonstrating tni- form rates of convergence usually requires stronger moment conditions, while, in Theorem 8.1, fis treated as being nonstochastic, and mini. ization is done with respeet to (only, therefore, a weaker moment condition suffices, 8.4.2 Bandwidth Selection with Direct Estimat Methods For the dinect average derivative estimation method, the estination of fy involves the qcimensonal multivariate nonparametic estima. tion ofthe fist onder derivatives. Hidle and Teybakor (1993) suggest choosing the smoothing parameters ay... at minimize the MSE of 5, that i, choosing h to ininize Eiji ~ 4). Hardlo and ‘Teybakor show that the asymptotically optimal bandwidth is of the form (for all s= 1-9) ag = en 220), where c, is a constant, v is the order of kernel, and q is the dimension of z, Powell and Stoker (1996) provide a method for estimating ¢,, while Horowitz (1998, pp. 50-52) considers selecting a, based on bootstrap resampling,8. SEMIPARAMETRIC SINGLE INDEX MODELS laying selected a, optimally, oe can then obtain an average derivae tive estimator of 8. Using By to denote a generic estimator, one then estimates E(¥ |) = 9(2'6o) by 9(2 Anh) = 9(x’x) defined in (8.7). ‘The smoothing parameter h associated with the sealar index fy can be selected by least squares cross-validation, i.e, by choosing to anne imize S22 1%; 9-s(X1Bq,/))?. Under some regularity conditions, the ‘crose-validated selection of h is of order Op (n°) ‘One can also combine a pastially linear model and a single index model to obtain a “partially linear single index model” of the form E(Y|X,Z) = X'a + 9(Z'B). For the estimation of « partially linear single index model see Carroll et al, (1997), Xia, Tong and Li (1999), and Liang and Wang (2005). 8.5 Klein and Spady’s Estimator ‘When the single index model is derived from the binary choice model (8.2), and under the assumption that ; and X; are independent, Klein and Spady (1993) suggested estimating 8 by maximum likelihood meth ods, The estimated log-likelihood function is £68) = S21 = Yn — (XFS) + LK NwaXIA], (825) where 6(X{) is defined in (8.7). Maximizing (8:25) with expect to 8 Teads to the semiparametric maximum likelihood estimator of, sy ies, proposed by Klein and Spady. Like Ichimura's (1998) estimator, ‘malmization must be performed numerically by solving the fst order condition obtained from (8.25). ‘Under some regularity conditions, inching introducing a trimming fanotion to trim out observations near the boundary ofthe support of IXy-and the use of higher order kernels, Klein and Spey (1998) showed that Axs is V/m-consistent and has an asymptotic normal distribution given by Valine — 8) + NO, Mes), (35 (i) ral) and P = P(e < 2/8) = Fug(2/8), where Fy) is the CDF of « cond tional on Xi = 2. where OKs =86. LEWBEL'S ESTIMATOR 207 Kicin and Spady (1993) also showed that their proposed estimator {s semiparametrically efficient in the sense that the asymptotic variance of their estimator attains the semiparametric efficiency bound. ‘We compare the asymptotic variance of the semiparametric es timator, Qs, with the parametsie counterpart Qe. ‘The paramet- ric model has two additional parameters, n = (,7n)". Partitioning ‘he parameter + into 7 = (ofA, then the asymptotic variance for n?(Bute - 8) is Vous (2:-2,@7*%5) ‘Comparing this with Vizl, one can show that (e.., Pagan and Ullah (1999, p. 278)) if BUGLA{B) = o0 + €1(X49), where @ and cy are two q x 1 vector of constants, then Vig = Veil or, equivalently, Vics = Vat ‘That is, the semiparametric estimator is asymptotically as efficient as the parametric nonlinear least squares estimator based on the known true functional form of g(-) when E(Xi)X/) is linéar in X{3 (rst order efficiency”). This is similar to the partially inear model case However, when E(X,|X/8) is not a linear function of X/9, it can be shown that Vics ~ Vie i positive definite, Therefore, the semiparamet- rie estimator is asymptotically les efficient compared with the para- ‘metric nonlinear Teast squares estimator based on the true function ‘(The asymptotic variance Vics ~ Vas is postive definite unless B(X|X{8) = ep + e1(XJ8). Moreover, in this ease the result cannot be improved since Vk already attains the semiparametric efficiency bound. The efficiency loss of the semiparametric model compared with 2 parametric nonlinear lest squares estimator arises because the Func- tional form of g() (or, equivalently, Fy,()) is unknown. Of course, in practice the true functionsl form 9(-) it typically unknown, and in this caso, the semiparametric estimator is robust to functional form misspecification of 9(-) 8.6 Lewbel’s Estimator Lewbel (2000) considers the following binary choice model Vix Uo + XiB-+e: > 0), ‘where ty Is @ (special) continuous regressor whose coeff ized to be one and X; is of dimension q. Let f(wjx) denote the con ‘ional density of v; given X;, and let F.(elv,2) denote the conditional268 _& SEMIPARAMETRIC SINGLE INDEX MODELS CDF of e given (vj, Xz). Under the condition that F(elv.2) = F(z), that is, conditional on <, ¢ is independent of the special regressor v, and that BiXse] = 0, then Lewbel shows that = (BOGxD) EK, (8.27) where Yi = [i= 104 > Sais uation (8.26) suggests that ono ean estimate f by regressing ¥% con X.Y involves the unknown quantity f(o1X;), whieh can be con- Sistently estimated using, say, the nonparametric kernel method dis- cused in Chapter 5. Letting 6 denote the resulting fesible estimator ‘of 8, Lewbel (2000) establishes the y/#-normality result of his proposed estimator of 8. TLewbel (2000) further extends his result to the ease where & and X; are uncorrelated so that B(&X;) 4 0. Assuming that there exists _p-vector of instrumental variables Z; such that B(eZ;) = 0, B(Z:X{) is nonsingular, and Fas(€,zl0,2) = Fer(y2l2), where Fes(e2)-) denotes the distribution fanetion of (¢,2) conditional on the data -, Lewbel stows that waxpe 2) Hence one can estimate 8 ty replacing the above expectations with sample means and by replacing f(wi|Zi) with a consistent estimator there “he above approach can be extended wo cover the ordered response saodel defined se 8 Y= Vig < wt XiG+6 Sas), (8.28) where dp = —00 and ay = +00. The response variable Yj takes values in {0,1,...,J—1}, and ¥i = j if y-+X,B +4; lies between ay and aj. Lot Xi; — 1 (the intercept) and without loss of generality let 6 (since otherwise one can redefine ay as a; ~ fr). Let Yip = 1(%; 2 J) for = 1y..,J—1, and define A = [E(X.X2]-!. Also, let Aj equal the jul row Of A. The Lewbel (2000) shows that et ane (XZ ME EO), ja, ,J-1,and (829) sty, : A ~-ae(x EES 70), 4. (8:30)8:7, MANSKTS MAXIMUM SCORE ESTIMATOR 200 From (8.29) and (8.30), one can easily obtain feasible estimators for the a's and 4's, .e., by replacing the expectations with sample means and by replacing (vi|X;) with a consistent estimator. Lewbel (2000) establishes the asymptotic normality results for the resulting. estimators, Lewbel further shows that his results can be extended to deal with multinomial choices, partially linear latent variable models, ‘and threshold and censored regression models, 8.7 Manski’s Maximum Score Estimator “Manski’s (1975) maximum score! estimator involves choosing to max- imize the following objective function: Su(a) = D2¥AlxIs > 0) + (1 VOUS <0) (831) ‘This estimator secks to maximize the number of eorect predictions. For ¥j= 1, Su = 1if X/9 > 0, and Sy(A) = 05 X13 <0. The correct prediction gots weight one, and an incorrect prediction gets weight zero. Similarly, for ¥ = 0, Sy¢ = 1 if X49 <0, and Sy = -1 if X/9 > 0. In this case, a correct prediction gets weight one, an incorrect prediction gets weight minus one. Manski (1975) demonstrates strong consistency of 8 assuming that median(¥,|X,) = X/ (or median(«|X;) = 0), that the frst component of @ is one, and that the fist component of 2 is 4 continuous variable. Kim and Pollard (1990) show that the maxi- mum score estimator has a convergence rate of n-!/*, not the usual 1-2, Because the objective function is discontinuous, the standard ‘Tuylor series expansion method of asymptotic theory cannot be ap- plied to the maximum score estimator. Kim and Pollard show that the limiting distribution of n'/*(Bz.-cove — 8) is that of a: maximum of a ‘multidimensional Brownian motion with quadratic drift. This asymp- totic distribution is quite complex and therefore quite inconvenient for applied inference. Mansi and ‘Thompson (1986) propose using a boot- strap procedure to approximate the asymptotic distribution of Baer ‘This bootstrap method is easy to implement and the simulations re. ported in Manski and Thompson show that their propased bootstrap method works well in Snite-sample applications "Note that Mans uses th verm “score” asin "keeping score,” eg the score of| ‘baseball game, and aot the statistical score function fe, Use sum of gradients of ‘log iklihood function)270 8. SEMIPARAMETRIC SINGLE INDEX MODE! 8.8 Horowitz’s Smoothed Maximum Score Estimator ‘Phough Manski (1975) demonstrates that his maximum soore estima- tor is consistent under fairly weak distributional assumptions, it has a slow rate of convergence and has a complicated asymptotic distribu- tion. Horowitz (1992) proposes a modified maximum score estimator obtained by maximizing a smoothed version of Manski’s score func- tion, which has a rate approaching ii depending on the strength of certain smoothness assumptions, In essence, the problems associated ‘with Manski’s approach are due mainly to the lack of continuity of the indicator function used in (8.31). Horowitz proposes replacing dicator function 1(A) with a twice continuously differentiable function that retains the essential features of the indicator function. Horowitz proposes estimating by maximizing the following smooth objective function: ='y 3 (88 2G Sell) = DOV -C Gs ) + 682) where G() is a primes continuously differentiable CDF, hi = hy > 0 ‘and h—+ 0 as n —» co. It is easy to see that G(X{3/h) — 1(X{8 > 0) as h— 0, For example, one can choose G(z) = J, k(v)dv, where k(-) isa (p— 2)times differentiable kernel funetion. Under some reguaxity ‘sumptions inchuding some smoothness conditions on the conditional PDF of y given 2/8, Horowitz shows his proposed smoothed maxi- ‘mur score estimator, Sty Amsco, has a rate of convergence equal 0 Bam-see — 8 = (nh)? and an asymptotic normal distribution. If f= (c/n)!/@°™D for some 0 < ¢ < 00, then Aam-sore ~ 8 has a rate of eonvergence equal ton~P/2°+), which ean be made arbitrarily close ton-¥? for suficienty large p. 8.9 Han’s Maximum Rank Estimator Rather than maximizing the score function given in (8.31), Han (1987) considers maximizing the rank cortelation between the binary outcome Y%; and the index function X/8. This is achieved by maximizing 2S ae ma EZ KO), 80) Gu(3)8.0, MULTINOMIAL DISCRETE CHOICE MODELS. m ‘with the eummation being taken over all () = n(n —1)/2 combina tions of the distinct elements {i,j}. A simple principle motivates this ‘estimator, The monotonicity of F and the independence of the «sand ‘the X;’s ensure that POY > ¥j[Xi.Xj) 2 PY < Y5 1X, Xs) whenever Xf > X46. ‘That is, when XG > Xo, more likely than not ¥i > Yj. Han (1987) shows that BjGy7(3)] is maximized at 3 = fp, the true value of B. Han furthers establishes the strong consistency of his proposed maxiinim rank correlation (MRC) estimator, but Han did not pro- vide the asymptotic distribution of his MRC estimator. The main if- ficulty in determining the limiting disteibution of the MRC estimator is the nonsmooth nature of the objective function Gy(). Note that Gi(8) is a second order U-statistic (or U-process, indexed by ). Us- ing a Ucstatistic decomposition and a uniform bound for degenerate U-processes, Sherman (1993) shows that the MRC estimator is yi consistent and has an asymptotic normal distribution. Up tonow we have considered a semiparametric binary choice model in which the distribution of the error term is modeled nonparametr- cally and the linear index 2/9 is the parametric part of the model Matzkin (1992) considers a more general binary choice model without imposing any parametric structure on either the systematic function of the observed exogenous variable or on the distribution ofthe random er- ror term, Matzkin provides identification conditions and demonstrates consistency of her nonparametric maximum likelihood estimator. 8.10 Multinomial Discrete Choice Models Pagan and Ullah (1999, pp. 296-299) provide an excellent overview of the semiparametric estimation of multinomial discrete choice models For what follows, consider the case where an individual faces J (J > 2) choices, Define Vi = 1 if individual i selects alternative j (j= 4,-..4J) and Yi; = 0 otherwise. Let Fy = Py = 1X) = BOY]; then the multiple choice equation is Vis = Fig + ey (8.34)2 8, SEMIPARAMETRIC SINGLE INDEX MODELS and the likelihood function is no SEM iory. (35) ‘Parametric approaches specily a functional form for Fy. For exam- plo, to obtain a multinomial Logit model one sets Z Fy = exp X48)/ Yo exp(Xy8). ‘The semiparametric approach sets Fj = (Vili) = E(Yislon---%44) a oltin..,0is)> where the functional form of g(-) is unknown, and ‘where tj — Xij8j. The estimation procedure is similar to the sei ‘parametric single index model ease discussed earlier. Ichimura and Lee {1901) extend Tchimura's (1993) method to the muti-ndex ease and derive the asymptotic distribution of their semiparametric east squares tstimator. Lee (1995) proposes using semiparametric maximum likeli fhood methods to estimate the multi-index model (8.35). Ai (1997) eon- ‘Sites a general semiparametric maximurs likelihood approach which overs many semiparametric models such as multiindex models and Sastialy Iinear models as special cases. We now turn to A's general approach, 8.11 Ai’s Semiparametric Maximum Likelihood Approach ‘Ai (1907) considers a general semiparametric maximum likelihood es. timation approach, Let. q(¥|X,@0, 90) denote the conditional PDF of Y¥ given X, where 0p is a finite dimensional parameter (the parametric part), and go() isan infinite dimensional unknown function. Ai further resumes that the conditional density satisfies an index restriction, Le., ‘hat there exist some known fursctions vy(=,@), v2(2,0) such that (1X. 40, 90) = 1(ZsOo)$ (01(Z Oo) aX, 80)s00), (8-86) where f(js8) is the conditional density of v: given va for arbitrary @, tind J(,0) is a known Jacobian of the transformation from v(#6) to wysui. Ars [PARAMETRIC MAXIMUM LIKELIHOOD APPROACH 273 Model (836) contains many familiar semiparametric models as spe- cial cases. For example, consider a partially specified regression model ‘m(W") = X40 + (482+ X5) +14, where m) and 72(-) are fanetions cof unknown form, and w ip independent of x = (Xi, Xz, X3) and has aan unknown density m()- 6 = (61,@2) is the parametric part of the model, and 1 = (m,22y7) 38 the nonperamotric part of the model, ‘The conditional density of y given « in this example is (1,1) = ns fl) ~ X461 — ol X302 + Xs)] iP Mw)l, (837) which is also the conditional density of oy = Y given vz = (X41, X302+ Xo), whore n{(y) denotes the derivative of m(y) with respect toy, and In? (9) is the Jacobian, For tho pastially linear multi-index model considered in. Ichiemara and Lee (1991), m(¥) = (hence Jacobian J = 1), v1 = ¥ — X40, ‘and vy ~ XJ6, and for this ease (8.37) becomes (with Jacobian = 1) Floilva) = ns (¥ — X18 — m(X462)) + (8.38) where ma() is the density of w= ¥ ~ X46) ~ na(X40), and nal X40 BIY — X{8\|X{8]. If one uses #, = 0 in (8.38), it collapses to Iehimura’s (1903) single index model (see Chapter 8). For the partially linear model considered in Chapter 7, m(¥) = Y, ur = ¥ — X41, 82 =0, v2 = Xz, and (8.37) becomes (J = 1) Seaton) = ns (¥ — X48 — m2(%) 7 where m(:) is the density of uw = ¥ — X{@) —ma(Xs) and ma(Xa) = BLY ~ X{oi1Xs} We now turn to the issue of estimating this class of models. Define m(z,8, f) = Analy, 6, f)]/20. IF the functional form of f is known, wwe could estimate @ by solving for @ from the score equation Sn(8) (8.39) ‘The yA-normality result of @ follows from standard laws of large numn- bber and CLT arguments. When f is unknown, Ai (1997) suggests ex ‘mating 1 using nonparametric kernel methods. Letting fi(u(2,0),0) and fa(va(2,4),9) denote the joint density of v and the marginal density of up, respectively, we have Sa (v(20)s6) Fox, Oll2,0)0) = Fa4 8. SEMIPARAMETRIC SINGLE INDEX MODELS ‘Substituting this into m(z,9, f) = Olnla(ylx,8, f)]/28 gives m(z,9, f) = ma(2,9, fr, fa)ma(z,9, fis fa), where m(2,8) fis fa) = filv(=,6), 6] faleal=,4),9) 4 thle din|J(2,6)| @ 2) pyva(2.0).61 alex) 8} — files), and 1 male fob) = Fete DATE OA Letting q and gz denote the dimension of v and v, respectively, then a kernel estimator of f.(0) = f(vi(Z:@)|va(Xs,0),0) is given by A(v1(Zi.9) v2 Xe, 9),8) = (8.40) where (8.41) and - j 1 e flor) =O Kin ( (642) with product kernels defined by 1 (22M) Toe w(t (m=) ‘Then an estimator of f,(8) = fi(8) = F(v1(Z0)lua(Xi,0).6) is given. SF (w(Zi,0),0) FOF eX 00)$.12,_A SKETCH OF THE PROOF OF THEOREM 8.1 28 Finally, one estimates 0, say 0, by solving the following first order condition: $,(0) = Ym (2,8, AO) =0 (8.43) For the single index model example, the Jacobian = 1, Oxo = 0, and 1 =u, Xa is empty. We write Xz = 2 and # = @ so that ‘Then we have faa = filtis) = Wes X10) = and f= fleas) = fle) = Ks ( Substituting these into + mis, fis, fos) = ma (Zin, fss fad wwe got exactly the same first order condition as for Klein and Spady’s (1993) estimator}? we leave the verification of this as an exercise (see Exercise 8.4). ‘Ai (1997) shows that the feasible estimator has the same asymp- totic distribution as the infeasible estimator 8, and that 8 is semi- parametrically efficent in the sense that the inverse ofits asfmptotic variance equals the semiparametric efficiency hound for this type of semiparametric mode. 8.12 A Sketch of the Proof of Theorem 8.1 We first derive (8.45). Note that G(X?) 9X4) at = X48) ~ G(XJ9) = of X48) — B [a(X1)|X19] (X{Bo) — 9X58) — 9 (X{BVB [X{(H — 9)1X{8] + O(n) (X48) [Xs — ECGX48)] (8 — 8) +0 (x) - Elg(X{60)|X{8]. By Taylor expansion of 1 and using fy — B= O(n-¥), we have =a (3.44) liety, we omitted the trimming functions; soo Ai (1997, 938) for the details on introducing tsimmang Functions. Por expositional216. 8, SEMIPARAMETRIC SINGLE INDEX MODELS Substituting (8.44) into $(3) = SR, lol Xf50) + wi — GOXH0))?, wwe obtain 'S(8) = (Pa ~ BY Wo(Bo ~ 8) ~ 2Vo( ~ 8) + Sut + 0(1), (8:45) oi where Wo and Vp ate defined by (8.10), The first order condition of (8.45) yields (8.10) __ It can be shown that the lending terms can be obtained by replacing B(V]XiB)] with Blo(XY)]X/6] in (8.9). We have, uniformly in 3 € By Sin(B) = 3 (00x) E [ol X480)1X18]}? +a ow {0Ci6) ~B [a LAO 1XG5]} ++ {terms independent of 9}+ op (02). (848) Since @ € By, we have [3-6] = O(n-¥/), Using a Taylor expansion of g(X!Ba) at Ay = 8 (use it twice below), we have X40) — Elo X185)LX18] = a XiPo) ~ (X18) = of (X{BELXIIX{8) Pa ~ 8) + Opln™4) =o (XL) {X} — E(XIX{B)} (Bo — 8) + Op (r° (ean Substituting (8.47) into (8.46) we get Sin(B) =o — 8" [: Eula] (-A) +22 D wal f(G - 8) 4 {terms independent of 6} ay(n-%),(B8) where gf? = g(X{B) and vy = Xi ~ E(%1 X49), ‘Minimizing Siq(8) with respect to 6, and ignoring the term which is independent of 3 and the op(n~1) term, we obtain the first order condition 214 ~ 5 (oP) wat 25 Soa (8.49)88. APPLICATIONS mm which leads to Vali — A) = [: y (ow Fabel = E D (oy vor FE wal no +o, 7 (8.50) where gf) = o(X{), 019 = Xi—E(G1X 4), and the second equality in (8.50) uses the fact that § — A = Op(n-¥/?), A standard law of large numbers and the Lindebers-Levy OLT ar- sgumnent leads to Vii(8 ~ 8) —+ N(0, Vo) in distribution, (8.51) cee iE {(09)*aovo}. r= {61x (6) wag} ane i = E(Xi| Xj). Equation (8.51) matches the result of Theorem 8.1 if one replaces w(X;) = 1 there. to = 8.13 Applications 8.13.1 Modeling Response to Direct Marketing Catalog Mailings Direct marketing is often used to target individuals who, on the basis of observable characteristics such as demographics and their past purchase cisions, are most likely to be repeat customers, For example, one ‘might think of mailing catalogs only to those who are highly likely to be repeat customers or who most “closely resemble” repeat customers.* Bult end Wansbesk (1905), in a profit maximisation framework, point out that 1 fact one might wank to da the opposite, thereby saving costs by avoiding repeated8 8. SEMIPARAMETRIC SINGLE INDEX MODELS ‘The success or failure of direct marketing, however, hinges directly upon the ability to identify those consumers who are most likely to make a purchase. Racine (2002) considers an industry-standard database obtained from the Direct Marketing Association (DMA)* that contains data on f reproduction gift catalog company, “an upscale gift business that mails general and specialized catalogs to its customer base several times each year.” The base time period covers December 1971 through June 1992, Data collected by the DMA includes orders, purchases in each of fourteen product groups, time of purchase, and purchasing methods. ‘Then a three month gap occurs in the data after which customers in the existing database are sent at least one catalog in early Fall 1992. ‘Then from September 1992 through December 1992 the database was updated. This provides a setting in which models can be constructed {or the base time period and then evaluated on the later time period. A random subset of 4,500 individuals from the first time period was cre- ated, and various methods for predicting the likelihood of a consumer purchase was undertaken followed by evaluation of the forecast accu- recy for the independent hold-out sample consisting of 1,500 randomly selected individuals drawn from the later time period. Parametric index models (Logit, Probit) and semiparametric index ‘models (Ichimura (1993); Ichimura and Lee (1991)) were fitted to the estimation sample of size my = 4,500 and evaluated on the hold-out sample of size nz = 1,500, Data Deseription We have two independent estimation and evaluation datasets of sizes my = 4,500 and nz = 1,500 respectively having one record per cus- tomer. We restrict our attention to one product group and thereby select the middle of the 14 product groups, group 8. The variables involved in the study are listed below, while their properties are sum- marized in Tables 8.1 and 8.2. (@) Response: whether or not a purchase was made ‘mollings to thowe who in fact ave hight Hikely to make purchase, Regardles of the {bjectve it the ability to identify those mostly to make w purchase that bas proven prablemati in the past “UThis database contains cusiomer buying history for about 100,000 customers of rationally known catalog and nonprofit databaso markoting businesses,5.18. APPLICATIONS ‘Table 8.1: Summary of the estimation dataset (r1 = 4,500) Mean | Std Dev | Min | Max ; ooof o2sf 0, 4 LUDFallOrders | 136) 138) 0} 15 LastPurchSeason | 162] 0.53) -1] 2 Orders#¥rshgo | 026) 0.55) 0] 5 LrDPurehGrps | 009| 031) 0} 4 DateLastPurch | 9731| 2734] 0] 17 ‘Table 8.2: Summary of the evaluation dataset (nz = 1,500) Variable ‘Response Dos; a7] a LDPallOrderss | 132] 1.38] 0 LastPurchSeason| 163) 051) -1] 2 0 0 oO Orders4¥rsAgo | 0.25) 0.52 LTDPurchGrp8 | 0.08] 0.29 DateLastPurch | 36.44 | 26.95 116 (ji) LTDFallOrders:life-to-date Fall orders (Gi) LastP urehSeason: the last season in which « purchase was made (iv) Orderst¥esAgo: orders made in the latest five yoars (0) LTDPurchGrps: lif-to-date purchases (vi) DateLastPureh: the date of the last purchase? ‘Each model was evalnated in terms of its out-of-sample performance based upon the measure of McFadden, Puig. and Kirschner (1977)? This ie recoded ta the database ae 1 if tho purchoso was made in January through June, 2 if the purchase was made in July through December, and 1 if 0 purchase was mad “19/71 was veorded as 0, 1/72 as 1 and so on. pus + pia —ph ~ pla where py i the ijth entry in the 2x 2 confusion mateix expressed a a fraction ofthe sum ofall entries. A "confusion matrix” is simply &20 8. SEMIPARAMETRIC SINGLE INDEX MODELS ‘Table 8.3: Confusion matrix and classification rates for the Logit model Predicted Nonpurchase — Predicted Purchase KewalNouparchass———SCSTB i ‘Actual Purchase 108 9 ‘Predictive Performance BL5% Overall correct classification rate 92.47% Correct nonpurchase classification rate 99.64% 7.60% ‘Table 84: Confusion matrix and classification rates for the semipars- metric index model Predicted Nonpurchase Predicted Purchase “Ketiat Noupurchase SSO 2 Actual Purchase % a2 ~Prodictive Performance 93.20% Overall correct classification rate 98.53% Correct nonpurchase classification rate 98.41% “Correct purchase classification rate 35.00% long with the correct purchase classification rate.® Results for the Logit model? are presented in the form of a confusion matrix given in ‘Table 8.3 while those for the semiparametric index model are given in Table 8.4. “Phe semiparametric single index model delivers better predictive performance for the hold-ont data than the parametric Logit model. Note also that, although the parametric models appear to perform faitly ‘well in terms of McFadden et al.’s (1977) measure, they yield poor predictions for the class of persons who actually make a purchase, Jabulaton ofthe actual outcome vermis those prodicted ty a sodel, The diagonal ‘ements contain correctly peedited onteomes while the off-diagonal ones contain Incorrectly predicted (confused) outcomes. Sine faction of coretly predicted purchases among those who actually made purchase TRecults for the Probit model are not as good those for the Logit model and ace omitted far space considertions.5.4, EXERCISES 2s 8.14 Exercises Exercise 8.1. () Suppose that Y is & {0,1} binary variable. Show that P(Y = Ale) = F(Y 2). (ji) ITY is o binary variable taking values in {1,2}, is POY = 1[2) still equal to E(Y’|x) or not? Exercise 8.2. In deriving (8.51), for the second term on the right-hand, side of (8.51), we used the fact that (i) 23, u82(u|X18) = terms independent of 8+09(n-), and that Gi) $y ue {Blo 140)] ~ Ble X1A)1XG6]} = op (n~), where Bus X48) = (mh) Sus (Xe ~ Xy)'8/AY AX), ia Bul AB) = (mn)? pK (Xe — X9)'8/H)/ AAS), and a (X48) = (mh)! 7 K(X ~ X5)'B/h)- A Prove () and (i) Hint: Por (i), apply & Taylor series expansion at @ = Ah, and use 8 ~ & = Op(n™¥/), ‘The first term in the Taylor expansion will be independent of 8, The second term is a second order U-staistc, and is of order op (n~¥) by using the H-decomposition Bxercise 8.8, Prov that 2S foci) - Box]? = 7 1 YX loa) — Blot x1)1x/9)]” + terms independent of 6 + op(n™"), Hint: Wei BOING (a{ X40) X18) + Bll XB) = E(9(X{A0)1X{9) + [Blo( io] X18) ~ ECo(X1H)1XI8)] + B(ulX48), were B(g(X/40)|XJ8) and BlesLX{A) are defined in Section 8.12se 8. SEMIPARAMETRIC SINGLE INDEX MODELS Exercise 8.4. Verify that when applying Ai's (1997) general approach to the single index model with binary variable y, Ai's first order condition coincides with that of Klein and Spady (1993) as given im (8.52). Tint: Klein and Spady's (1993) estimator solves the first order con- dition (see Pagan and Ullah (1999, p. 283)) S72; mhy(@) = 0, where sin(o) = ate fh acxioyy* AM py, — ata] (652) en act = Bera = BE ( with 9(X¢ (Yi X{8) = ma(S aeChapter 9 Additive and Smooth (Varying) Coefficient Semiparametric Models In this chapter we consider some popular semiparametric regression ‘models that appear in the literature. The rationale for using these, rather than fully nonparametric models, is to reduce the dimension of the nonparametric component in order to mitigate the curse of di- mensionality. Of course, these models are lisble to the same critique of functional misspecification as are fully parametrie models, but they have proven to be extremely popular in applied settings and tend to be simpler to interpret than fully nonparametric models, 9.1 An Additive Model We first consider the semiparametric additive model = 9+ (Zs) + 90(Zas) +--+ (Ze) +a (1) where cy is a scalar parameter, the Zy’s are all univariate continuous variables, and gi(:) (U= 1,-..,4) are tinknown smooth functions. The observations {¥,, Zsi,-++5 Zhu: are presumed to be ii.d For kernel-based methods, two approaches are commonly used for estimating an additive model: the “backfitting” method (see Buja, Hastic and Tibshirani (1989) and Hastie and Tibshirani (1990) and ‘the “marginal integration” method independently proposed by Linton288 9._ ADDITIVE AND SMOOTH COEFFICIENT MODELS ‘and Nielsen (1995), Newey (19948), and Tjostheim and Auestad (1994); see also Chen, Hirdle, Linton and Severance-Lossin (1996), and Lin- ton (1997, 2000). Due to its iterative nature, the backfitting method is ‘more difficult to analyze than the marginal integration method, hence ‘we begin with a discussion of the relatively simple marginal integration ‘method. 9.1.1 The Marginal Integration Method First, note that for any constant ¢, [gu(e) + 6 + lola) — d= ‘01(21)+92(2). Therefore, we need some identification conditions for the ‘n() functions to be identified. With kemel methods it is convenient to impose the condition Elm(Z)] = 0 so that the individual components .i(-) ate identified (= 1,..-,9)- This also leads to E(Yi) =v, Let Za = (Zana --» Zot Zesih--»» Zi)» That is, Zqi is obtained from (Zs, Zais. +5 Zq) by removing Zqi. Define Ga(2a) = 91(21) +++ ga1(2a-1) + Savi(aaa) + +--+ Gg(2q)- With this notation, (9.1) can be written as XY + Ga(Zas) + Ga(Za¢) + t4- (9.2) Define Ela Zas) = E(|Zag = 20, Za3)> and also let (a, Zas) E(X)|Zoj = 2a, Zaj)- Applying Bl-|Zaj = 2a. Zaj] to both sides of (9.2), we get bala, Zas) = 60+ Gala) + Ga(Zas)- (9.3) Define ma(2a) = El€(a,Zaj)), and note that the marginal expectation is with respect to Zaj- Applying the expectation (with respect to Zi) to (9.3), we obtain Maza) where we have used B[Gy(Za)] + dole), (04) From (9.4), we get Ga(2a) = Ma(2a) ~ ElmalZos)]- (95) ‘The above approach is infeasible. However, a feasible estimator can bbe obtained by replacing expectations by sample means, marginal ex- pectation (integration) by marginal averaging, and conditional mean functions by local linear kernel estimators. Specifically, a feasible coun- terpart of (9.5) is Gala) = tala) — (9.6)9.1. AN ADDITIVE MODEL 285 where fa(Za,Zaj)s (9.7) ard (2, Za) i the sldton to ain the folowing minimization prob- tem, ap ple Note that da(za, Zaj) 6 a kernel estimator of ELYs|Zaj = Za, Zajl, ‘which applies the local linear method only to the component =, and the local constant method to 2. Fan, Hardle and Mammen (1998) assume that k() is a univariate second order kernel, and K(:) is a vth order product kernel, where 1 isa positive integer satisfying v > (q—1)/2. Therefore, a higher order kernel is needed (v > 2) if q > 4; sce Section 1.11 regarding the con- struction of higher order kernel functions. The following theorem gives the asymptotic distribution of ja(éa), and for expositional simplicity, wwe assume that Ay = ~~~ = Aga = hast =o = hy = ha Fan otal prove the following result. (Zot — 2a)b)* kg (Zat ~ 2a) Kng(Zat — Za): Theorem 9.1. Under the conditions given in Fan et al. (1998), and also assuming that nha? In — 00, BEJI2 +O, hy» O and that ag 0, then wl al) ~ dala) ~e0~ FAR se) + 88) = N(0,0alea)) schere gC) is the second onier derivative of gal)» ma = fuPk(u)d, a(za) = fala) " Heady] of and 0 (sn) = #0): ‘Theorem 9.1 shows that ga(a) reaches the one-dimensional optimal rate of convergence. A disadvantage of the marginal integration method is that itis com- putationally costly. One has to estimate B[Yi|Zat = Zain Zat = Zay] for286 9, ADDITIVE AND SMOOTH COEFFICIENT MODFLS all i,j = 1,..-.n, which has a computational order involving n? cal- culations, whereas the computational cost of estimating a nonadditive ‘model is only of arder n. In the next section we discuss a computation- ally efficient method for estimating additive models. 9.1.2 A Computationally Efficient Oracle Estimator ‘The marginal integration method discussed above is computationally costly as it requires one to estimate 9( Zoi, Znj) for 4,5 yn. Kim, Linton and Hengartner (1999) propose an alternative method which reduces the estimation time to order n. Consider the nonparametric regression of Y on Za, B(Y Za = 2a) = 60+ Galea) +S Floa(Za)\Za= Fal, (98) me which shows that E(Y|Zq = 2a) is a biased estimator of cy +ga(%a) due to the presence of 57,4, Elge(Zs)|Za = Za}. Kim et al. suggest choosing an instrumental variable wa(z) such that Elwa(Z)|Za = za] = 1 and Elwa(Z)9e(Xs}|Za = Za] = 0 for s # a. (9.9) ‘Then one would obtain Blwa(Zi)¥ilZai = 2a] = C0 + Ga(2a)- Tt is easy to check that wa(2) = fa(%)fa(%a)/ F(a, 2a) i8 such a function, where fala) is the joint PDP of 2a = (21ys0ey20-ty nets --0%4)- fact, for any random variable €, we have lgmeleel = f €ra(e) Gea deg = [Pee Slee ‘F(as%) fala) = f Salads (010) ‘which is exactly the marginal integration of € over the sq compo- nents. Replacing € in (0.10) by € = 1 we obtain Blwy(Z)[Zai) = J falza)dzq = 1. Also, if we replace € by gs(Zu) for s # a, then Bhwg(Zi)on(Za) Zoe) = J deli) falZa)d2a = Elas(Zi)) = 0. Thus, (@.9) holds true.91. AN ADDITIVE MODEL, 2st It is easy to see that, oo ae Zui) ¥% (9.1) is a consstont estimator of Eluts(Z)¥IZa = Za) = oy + gala), whore wta(2) = fa) fala) fava) Note that “g(2q) can be interpreted as a one-dimensional standard local constant estimator obtained by regressing the adjusted Yaj on 2a where Vag = Y; fal) fa Zas)/ Ila Zn) Since (2a) estimates 0 + 9q(Zq), naturally one can estimate fa(2q) by Jala) — Fala) + 1377 FalSay)- he leading bias, variance and asymptotic normal distribution of fa(Zq) ate given in Theorem Lof Kim etal. (1909). Kim et al show that the asymptotic variance of f(2a) has an extra term compared with the asymptotic variance of the marginal integration estimator of gq(za). Therefore, Ga(%q) is less efficient than the marginal integration estimator. Kim et al. farther propose an ficient oracle sstimator which we discuss below. An oracle (efficient) estimstor for ga(2q) {8 defined as an esti- rator that has the same leading bis and variance terms ae in the ral when all other additive functions are known. Define Yom" = = Enda 9e(Zi) ~ eo, and consider a univariate kernel estimator of tae) che GeO, a Eak( Baa) = + (9.12) with hg = 6in-¥/9 (6; > 0 is a constant). Then using results derived in Chapter 2, we know that (sing, second order kernel faction) Vika {89 S0) ~ (za) ~ NBBa(n)} + N(O,Valea)) (0.18) in distribution, where (2) = no [al?"(=n) + af (2a) 14 (ea) / fala) and where Va(a) = 72 (5a)/ fala) (03 (én) = var(Y Ya = 2a).2s 9. ADDITIVE AND SMOOTH CORFFICIENT MODELS Instead of using a local constant estimator, @ local linear estimator ‘can be used to estimate ga(2a). That is, letting d and 5 be the values of a and b that minimize the objective Function Se(e ) [ra 0 Wear then G is the local linear oracle estimate of ga{za), and its asymptotic disteibution is the same as that given in Chapter 2, which is basically the same as (9.13) except thatthe leading bia term should be changed to (1/2)n208 (2}/ fala). ‘The estimator 92"(z.) given above is not feasible because cp and the ge(zx)s are unknown for s # a. We can replace g4(z) by 4a(2s) and ey by Y. Thus, we can use Yar? = Y= S F(Z ha) + (q— IP me co, where to replace Yai = ¥i— Disa Ga(Zni) Fulenhe)= pdr (752) fay, as) cm is a consistent estimetor of Blus(Z)¥ 12s Leg Letting 92° be defined as was ge!" in (9.12) but with Yor being replaced by Y2-**?, Kim et al. (1990) showed that go" has the same leading bias and variance terms as gf"; therefore, it ns ‘the same asymptotic distribution. Kim etal. assume that hq = an—1/® for some a > 0 and that hp = o(n~¥/5). Note that here one chooses ‘hg to be of smaller order than hq. They show that if a local linear estimator is used, then Vita {42%*(z0) ~ gaa) ~ Zh2a1? (za) Fala)} (0, Valea) (0.15) in distribution, where the definition of Va(za) appears directly below (018). ‘Equation (9.15) shows that the feasible estimator g2°(zq) has the same first order efficiency as for the case where all other g.(2.)'s are known (for 8 # 0) oy + Gal2s), 59.1. AN ADDITIVE MODEL. 239 Kim et al. (1999) further show that a wild bootstrap procedure can be used to better approximate the fnite-sample distribution of ‘Ge"(-). Define an estimated additive function Badal #i Ras ho) = ¥ + 3 Galza5 has he), ost where Ga(zaj/osfo) = Gala) is the initial additive function esti- tmator of ga(a}- Then 1 is estimated by tx = ¥; ~ Gadel Zi haha) (i =1,...,1). Define the centered residual by ts = diy —n—! 9G ‘The bootstrap error uy is generated by the two-point wild bootstrap method, ie., uj = [(1+-¥5)/2}i with probability r = (1+-¥5)/(2v3) and uf = [(1— VB)/2}is with probability 1 —r, Next generate adil Zi Thay he) + ul, ‘where in the above another bandwidth hg is used. Kin etal. show that hig should have an order larger than hg for the bootstrap method to work. Kim et al. suggest choosing fhe ~n¥/, hy = o(ha)s and letting Fig assume values in [n“¥/>+¥,n~4] for some 0 < § < 1/5. One can ¢ the bootstrap sample {7,¥:}?.. to compute Gz" (z,), where G°"*"(zq) is the same as 92""!*(zq) except that one replaces ¥; by Y¥7* wherever it occurs. Kim et al. show that one eau repeatedly gener- ate, say, B bootstrap estimates of 92'°"""(z_) and use their empirical distribution to approximate the finite-sample distribution of g2*“"*(z). yy 9.1.3 The Ordinary Backfitting Method By taking the conditional expectation of (9.1), conditional on Zat = fay we get (0 = 1y.-.50) co YE lga(Zsi)|Zat = Za). (9.16) ms alZa) = B(VilZos = a) Equation (9.16) suggests that an iterative procedure is appropriate. Let, a5\a) be some initial estimators for ga(a), say, dl (zq) = 0, or let gfl(2q) be the marginal integration estimator of ga(zq). Also, 7We provide more detailed discussion of the wild bootstrap method in Chapter 2,9. ADDITIVE AND SMOOTH COEFFICIENT MODELS 1 Soh, ¥i. Then the iterative procedure is given as follows. For sq, and for | = 1,2,3,..., compute the step 4l(z0) by et 19(29) = BUY 2as = 20) — G0 — JE [BZ = x0] NZ) Za (a7) where, for a random variable Ay, B[4\|Zai = a] is the (univariate) ernol estimator of E(Ai|Zou ~ 2a). BldilZec = Za) could be the local constant estimator given by So" Ajhayay/ Sihay Kauss Where Iojnay = b((Zas ~ %)/ha)s OF it could be a local linear estimator of E(Ai|Zai = #0): TMeretion ends when a prespecified convergence criterion is reached, say, when S22 (G4! Zes) — 4°" Zo)?/ Tien (Ger (Zon)? is less than some small number, say, 10-4 ‘By using a local linear smoother in the above procedure, Opsomer and Ruppert (1998) and Opsomer (2000) showed that estimation bias is of order O(S7#_, hd), and the leading variance term of Ga(Za) is given by (n= J M(o)de) we alee) = apts +0(s4-)) E(u#|Z;) = E(u?) (assuming conditionally homoskedastic 9.1.4 The Smoothed Backfitting Method Mammen, Linton and Nielsen (1999) propose « smoothed backfitting procedure for estimating the additive mode! (9.1). The idea is to project ¥ (or E(Y'2)) on the space of additive funetions. Let 9(2) = SOE YK Ziy2)/ Pa Kainz) and fle) =m? Kyl Zi?) de ‘ote the multidimensional Kernel estimators of E(YTZ ~ =) and f(2), respectively, where z = (21,.--,44)- The local constant smoothed back: fitting estimators of gy(2),---+9q(%) ate defined as those gy...» Jq that minimize the objective function [i902)-a2) aula Feld, (048)8.._AN ADDITIVE MODEL. 2 where the minimization runs overall functions g(z) = ¢9+322. ga(Za) with | 9a(a)fa(#a)déa = 0, where fa(za) = f f(2}dag is the kernel es- timator of the marginal PDF fa(za), 2q = (21y++-12a-ty Zasty +++ 29)- The solution to (9.18);is characterized by the following system of ‘equations (a =1,...,4; ¥ =n! 2, i): ated feta. fe balea)= fer f toy dfa (a9) 0 f tulsa) falco (9.20) Note that 7 Yiklen Zed) 2 [oof Ft Bala), (021) because {Tada hyh( (zs ~ Z)/fe}d2q = 1, where ja(zq) is the local constant estimator of E(¥i|Zai = 2a). Also, using the additive structure of Ga(Zq) and the fact that J hz"k((ee —Zai)/he)dz, = 1, the q —1 dimensional integration f dq can be reduced to double integration, hence where faws(2as2) = (haha)! Sts (Za — #n)/ ah (Zui ~ 20) (Ie) is the Kernel estimator of the two-dimensional marginal PDF of (zs, 7.) ‘auations (9.19), (9.21), and (9.22) lead to the following iterative procedure: AE 20) = Glen) - Ff ao slanted, -% fies an where = 1,2,..., denote the number of iterations. Mammen et al. (1999) derived the asymptotic distribution of Ga(a)- To present their result we frst define @ constant &y and uni-202 9. ADDITIVE AND SMOOTH COBFFICIENT MODELS variate functions Ba(2a) (with [ Sa(za)dza = 0, a) ae gin [f(10) = Bo len == Boo] (9.23) +a) by for a given function (2). ‘Under regularity conditions given in Mammen et al. (1999) (see also ‘Nielsen and Sperlich (2005)), it can be shown that (ha}"/*lia(Za) ~ daa) — A2Ba(20)] N(O,ta(%a)), (9:24) where fis defined in (9.23) with (2) defined by Alz)= + Foe)» and va(za) = 67a(Fa)/ fala) (i Finally, one ean estimate g(- J Rwyv2dv, w= f B(v)dv), 0+ Dai dalZa) by G2) =F + 7 Gal20): (9.25) Moreover, Mammen et al. (1999) show that different components of Ga(2a) ate asymptotically independent; therefore, the asymptotic dis- ttibution of §(2) follows readily from (9.24) (see Exercise 9.1). Mam- men et al. also discuss using boundary corrected kernels when Z has compact support. ‘Mammen et al. (1999) also consider using a local linear estimator of ga(a), 80 that the objective function (9.18) becomes J hx ft — Ydalen) — YGalea) ea Zai)) Kale, Zde, QM aa ‘where the minimization is over the additive components Go, all j2a)’s (with f da(2a) fo(2a)d2a = 0), and also over all Ga(za)'s, where Oa(2a) Ste Bet ors dervtive of 9).9.4. AN ADDITIVE MODEL Ea ‘Mammen et al, (1999) (see also Mammen and Park (2005) show that the g's and d's satisfy the following equation: (3) =- (0) + iS) Nala) ¥ J Sen) (553) dem with f= FY =D f doled falcata D> [bleed Bean) (o27) a 1 2-2 ; alia) = Yh lasZa (15, ite) (28) Scales 2a) = 0") Bg (#0, Zoi) ha (Zar Za) a - (ax me ete) - os) : on I8ea) = 21 hn en Zi) Lat ~ Za), and Falta) Vk, Zai)- Also, Ga(Za) aid Ja(za) are the local linear fits obtained by regressing ¥; on Zoi; that is, they minimize the objective function b ‘The definition of Gy dx(<1),---Go(), and 6i(21),.--,4q(%) can be rade unique by imposing the normalization condition I Gala) ~ Bal a)(Zai~ta)) baal 2Zs)- (930) aleaVfalsaddia + J BalcefRlca)dea = 0. (931)8. ADDITIVE AND SMOOTH CORFFICIENT MODELS ‘The smooth backfitting estimates are obtained via the iterative ap- plication of (9.27) whete, when the left-hand side isthe iteration step Ut), the rigt-hand quantities are either [t+ 1} or (0), depending on whether s a. Note that (9.31) yields é = 0? 21 Yi The asymptotic distribution of ja(+a) i derived by Mammen etal (1999). Below we state a simple version given by Nielsen and Spelich (2005). Via (daa) = al) ~ Yo — 2 tua(ea)] N(0,Val2a))» (9:32) were va { alcadhna(2u 0) alo) tea, tala) = (ea) ~ faa) fale Vola) = 08 (2a) falta) / (do, a= f Hoptae, bution using matrices, 2. One can also present the joint ee = oi(a1) = 1 — RE Fea (1) ) SN (O,ding(Via(=a))) > il a) ~ 920) ~ Ya — BG Halal (9.33) where 0 = (0,...,0) is aq % 1 vector of zeros, and diag(Va(eq)) is a qx q diagonal matrix with the ath diagonal element equal to Va(a) (@=1,...,4) and all off diagonal elements being zero. Equation (9.33) shows that the different components of f(a) are asymptotically independent of one another. Nielsen and Sperlich (2005) suggest using leave-one-out least squares cross-validation to select the smoothing parameters, while ‘Masnmen and Park (2005) recommend minimizing the penalized sum of squared residuals, Le., they recommend choosing fy... fy to minimize the objective function PLS(h) = RSS(h) [ By aco : (9.34)9.1_AN ADDITIVE MODEL. 295, where RSS(h) y [i-a-Zacen] (9.35) a S In this chapter we do hot discuss kernel estimation of an additive model with interaction terms, or derivative estimation in generalized auditive models. However, eaders can find a discussion ofthese issues in Sperich, Tjostheim and Yang (2002) and Yang, Sperlich and Hirde (2008). Als, for efficient and fast spline-backfitted kernel smoothing of additive regression models, see Wang and Yang (2005). 9.1.5 Additive Models with Link Functions ‘A more general additive funetion is an additive model with a known link funetion given by vino [0+ Soot] =m “> 36) where G(,) isa known link funetion, When G(:) is the identity function, (9.36) collapses to (9.1). In practice, G() could be the exponential or logistic funetion, te Linton and Hirdle (1996) propose a marginal integration approach to estimate (9.36). Let m(z) = B(Y|z) and let M() = G-1(); then (9.36) can be waitten as . Mim(2)] = 60 +) ga(2a): (9.37) Deline alta) = f Mima Malad (9.38) here 2q = (21,---1 20-1) fastee-- 4) Sala) i the PDF of 2. rom (9.38) we know that ga difers from ga only by an additive constant Linton and Hardle (1996) suggest estimating ¢q based on the multi- dimensional local constant keruel estimator nD Viking (2a Zai) Wha (Za WTS ig (an Zot Vig an Za) * where k() is a univariate second order kernel function, and where Wha) = [4a hs w((2e — Zi)/He) with w(-) being a kernel fune- tion of order dq 1 (a higher order kemel when q > 2). 1(2a 70) (9.39)206 9, ADDITIVE AND SMOOTH CORFFICIENT MODELS ‘Then Ga(%a) is estimated by the sample analogue of (9.38) given by Gaza) = * > MU (2a Zeta) (0.40) ‘When M is the identity function, then $q(2a) is linear in the ¥7's. In general, however, Ga(a) is @ nonlinear function of the ¥'s. The constant term cy is estimated by BU Leela, (9.41) and ga(éa) i estimated by Gaza) = Gala) — Go (9.42) ‘The final estimate for m(2) is given by (G = M~) tn(z) = aoa +a (9.43) ‘Under some regularity conditions, Linton and Hiirdle (1996) prove ‘the following results, Vita [Bu — Galea) ~ Mtta(%n)] SNO.Valza)), (9-44) where ala) = (0/2) {2 (c0) f MOM fa eta 29x) [ Mme) LO (ctea}. Valea) =m f (MOfaten)? Reds A consistent estimator of Va(za) is given by B= Dwie,92._AN ADDITIVE PARTIALLY LINEAR MODEL 207 where f= nS MOM i Zed 2a Za) +t alco Zan) = eba2a = Za) Wa (a Zas) and where i, = ¥;— m(Z4), ‘As observed by Horowitz and Mammen (2004), one undesirable property of the above marginal integration based estimator is that a higher order kernel must be used for w(-), with the order of the kernel depending on q (i.e, the larger is q, the higher the order of w()). This arises because, atthe initial stage, one has to estimate a q-dimensional function m(2) nonparametrically (ic., the additive structure is not im- posed at tho initial stage). Horowitz and Mammen suggést using series ‘methods at the inital stage when estimating m(z), then using a ker- nel local linear method to estimate the individual functions ga(:)- ‘iscuss Horowitz and Mammen’s approach in Chapter 15, where we it~ ‘troduce nonparametric series methods. Horowitz and Lee (2005) further extend Horowitz and Mammen’s result to additive quantile regression ‘models Horowitz (2001) generalizes the above result to the ease in which the functional form of the link function G() is unknown, 9.2 An Additive Partially Linear Model While the additive model ontlined above has the advantage of not suf- fering from the curse of dimensionality, itis, however, a very restrictive model since it does not allow for the presence of interaction terms. An additive partially linear model can avoid this problem and maintain ‘he “one-dimensional” nonparametric rate of convergence, Consider @ ‘model of the form Y= D+KB+ aZu) +e baylZe) tu, (045) where X; is a p x 1 vector of random variables, 6 1) 8 0 1% 1 veetor of unknown parameters, X; ean contain interaction terms involving (Z3.++, Zs), 6 is @ sealae parameter, the Z's are all nie variate continuous variables, and the ga('s (a = I,...,@) are unknown208 9. ADDITIVE AND SMOOTH COFFFICIENT MODELS smooth functions. The observations {¥i,X!, Z1is---+ Za} We impose the condition Bga(Zas)] = 0 for a = 1,-..,4 in order to identify the individual components gx(),-»94(?) Defining », = Xi — E(XilZi), we say tha X ie not a deterministic fanetion of (Z1y-+-124) if Ejei{] is positive definite, (0.46) Under (9.46), one can use the method discussed in Chapter 6 to obtain a y/i-consistent estimator of @, say. One then can estimate the additive functions by rewriting the model as Yi-xi3 Vi ga(Zai) + eis where e¢ = uj + X{(9—B). Since J — 8 = Op(n-¥/9) has a faster rate of convergence than the nonparametric additive function estimators, the resulting nonparametric estimator of ga(q) has the same asymptotic properties as when @ is known. Thus, they have the same asymptotic distribution as the nonparametric additive model discussed in Section 9.1 (without the linear components) We formally summarize this estimator below. Define 0; = Xj — (XZ). When E(v;v,) is positive definite, one can use the following simple two-step method to estimate this model: (i) First, ignore the additive structure of the model and estimate using Robinson's (1988) approach, ie. regress ¥; ~ BZ) on X,—F(X;lZ;) to obtain a semiparametric estimator of 8, say 3, ‘where B(¥i|Z,) and B(X;|Z:) are the kernel estimators of E(¥i|Z:) and BU%|Z;), respectively. (ii) Next, rewrite the model as ¥i - X13 = gy (Zi) +--+ g9(Zqd) +6 ‘and estimate the additive functions g(-) as discussed in Section ol. ‘The asymptotic distribution of @ is discussed in Chapter 7, and the asymptotic distribution of the resulting estimator of ga(za)s 54¥ fala) is the same as that discussed in Section 9.1, This is because BB = O,(n~"!), which is faster than the nonparametric rate of convergence; therefore, replacing 8 by 9 does not affect the asymptotic distribution of Qa(%a)92._AN ADDITIVE PARTIALLY LINEAR MODEL ‘The advantage of the above procedure is that it is computationally simple. However, there is one problem with this approach. When Evi!) is not positive definite, the proceduse fails to estimate f. Consider the simple ease where q= 2, and consider a model of the form Ye= Bo + (ZZ) + 91(Zu) + 92(Za1) + uy (9.47) where we have X; = Z3,Zy. In this ease, X; is a deterministic function of Z, and we have «4 = Xi — B(X |Z) = Xi Xj = 0. That is, the above procedure cannot apply in such eases. However, (9.16) isa strong assumption asi rules out eases for which X; is a deterministic (but nonadditive) function of (Z34,...5Zqi)- For example, when q = 2 we may wish to let X; contain interaction terms such as X; = ZsZ2i, but then (9.46) fails to hold Fan ot al. (1998) and Fan and Li (2008) proposed an alternative _Yi-consistent estimator of 8 based on marginal integration methods. ‘Their method has the advantage that it does not rely pon condition (9.46). As discussed carlier, the marginal integration method is compu- tationally costly: Schick (1096) and Mangan and Zerom (2005) suggest using # computationally efficient method for estimating 3. We describe Manzan and Zerom's procedure below. 9.2.1 A Simple Two-Step Method For any random variable (vector) &i, define Eq = Elg(Zi)|Zad] SBUEZ = 2]fa(ea)dea, where wal2) = falZa)fa(2a}/ f(Zas%a)s @ = 1,...,4, and define the projection of om the additive functional space by Ea(G) = D0 bos. (9.48) Ss Applying the projection (9-48) to (9.45) we obtain Ba(¥) = BalX'8-+ So 9lZa)- (0.49) Subtracting (9.49) from (9.45) we get ¥i— Bai) = (Xi — Ba(X))'8 + (9.50) ‘We can estimate # by applying least squares methods. However, ‘we must first obtain consistent estimators for E4(¥i) and Ea(Xi)- By00, 9 ADDITIVE AND SMOOTH COB: \CIENT MODELS the results given in Section 9.1 we know that they can be consistently estimated by Bah) > Yos and Ba(Xi) = > Xan (9.51) where 1s Zoj— Zoi _falZas) we 54 ( A ey sg (9.52) and, with & = ¥j or & = Xs in (9.52), we obtain B4(¥) and Ba(X,), respectively "Therefore, a feasible estimator off is given by B= S25 og Sx-Baany-BA0) (953) where So,p =n“! IE, G:D!, Se = Se,c. The asymptotie disteibution of @ is given in Manzen and Zerom (2005) and is as follows. ‘Theorem 9.2. Under the same regularity conditions given in Mansan ‘and Zerom (2005), ValB — 8) + (0,3) in distribution, where 5 = 106, @ = Elnni], m = Xi — Ea(X;), and Q = EluPninl _ A consistent estimator of © is given by B= 6106-1 where $ = ADE IN ~ BAAN ~ BAKO), © = ALE, aK — Bal XIX — Ea(A and ty = ¥~ BAC) — (%— BACKS. When the error is conditionally homoskedaste, 9 is semiparaanelat cally fficient in the sense that its asymptotic variance reaches the seu parametric efficiency bound for this model (see Chamberlain (1992)). Positive definiteness of © is an identification condition for g. Te allows for X, to be a deterministic function of (Zss.--. Zs) provided 1t iv not an additive function of the Zac's. More specifically, cousider the simple eas of q = 2in (247) with X; = ZisZax given by Yim Bo + (ZisZ0i)B + 91(Zse) + gal Zae) + us (9.54) ‘Model (9.54) is immune to the curso of dimensionality since it only involves a one-dimensional nonparametric function ga(-) (a = 1,2).93._A SEMIPARAMETRIC (SMOOTH) COEFFICIENT MODEL an Also, itis more general than an additive model that does not have in- teraction terms (i.c., (9.45) allows interaction terms to enter the model parametrically) Continuing with this-simple case (q = 2), we next estimate the individual nonparametric components gq(2q). Given the Y-consistent estimator 3, we can rewrite (9.45) as = X48 = Bot DY galZai) + (9.55) where 6 = us +X1(3—8) Equation (9.55) is essentially an additive regression model with — X/8 as the new dependent variable and [ai + X{(3 — 8)] as the new (composite) error. Hence, one can estimate ga(zq) by marginally integrating a local linear model as discussed in (9.6) and (9.7), with ¥; replaced by ¥; ~ X{9. Let daza) denote the resulting estimator of {94(%a). From Theorem 9.1 and the fact that ~ 8 = O,(n-¥/2), i is ‘easy to see that the asymptotic distribution of Ga(za) is the same as that of Ja(e), which was given in Theorem 9.1. . The intercept term y can be Yi-consistently estimated by By = ¥ — XG, where ¥ Ypand X = nt", Xz. The con ditional mean funct Dy = Aye Zqi = 4] can de estimated by 6 +2’3-+ 54.1 da(Za), and the error can be estimated by tis = Yi - Bo — X43 — V5 Gal Zan). 9.3. A Semiparametric Varying (Smooth) Coefficient Model In the previous section we discussed a partially linear model of the form ¥i= al Zi) + X1p +0, (9.50) where a(:) is an unknown funetion and fh is an r>1 veetor of unknown parameters, [In this section we consider a more general semiparametric regression ‘model, the so-called semiparametric smooth coefficient model, which nests a partially linear model as a special case. The smooth coefficient ‘model is given by Y= (Zi) + X{B(%) + ue, (9.57)