You are on page 1of 768
Nonparametric Econometrics Theory and Practice Qi Li and Jeffrey Scott Racine Princeton University Press Princeton and Oxford Copyright © 2007 by Princeton Universit Published by Princeton University Press, 41 William Street, Princeton, New Jersey 08540, In the United Kingdom: Princeton University Press, 3 Market Place, Woodstock, Oxfordshire OX20 1SY_ All Rights Reserved Library of Congress Cataloging-in-Publication Data Li, Qi, 1956~ ‘Nonparametric econometrics: theory and practice / Qi Li & Jeffrey 8. Racine. p. em. Includes index. ISBN-13: 978-0-691-12161-1 (cloth : alk. paper) ISBN-10; 0-691-12161-3 (cloth : alk, paper) 1. Boonometris, 2. Nonparametric statistics 1. Racine, Jeffry S 1962- TL. Tithe HBIS9.L55 20g % ® % 390.07°51954-ge0 EW . : 2006044989, +4%,* pate British Library Cataloging4f¢P\iblication Data is available ‘The publisher would like to acknowledge the authors of this volume for providing the camera-ready copy from. ‘which this book was printed. Printed on acid-free paper. co pup princeton Priniol its Ue United States of Amerien 13579108642 Contents Preface I Nonparametric Kernel Methods 1 Density Estimation ll 42 13 14 15 16 4? 18 19 110 La 112 113, Univariate Density Estimation Univariate Bandwidth Sclection: Rule-of-Thumb and, Plug-In Methods Univariate Bandwidth Selection: Cross-Validation ‘Methods 1.3.1 Least Squares Cross-Validation 1.3.2 Likelihood Cross-Validation 1.3.3 An Ilustration of Data-Driven Bandwidth Selection Univariate CDF Estimation Univariate CDF Bandwidth Selection: Cross ‘Validation Methods Multivariate Density Estimation Multivariate Bandwidth Selection: Rule-of-Thumb fand Plug-In Methods Multivariate Bandwidth Selection: Cross-Validation Methods 1.8.1 Least Squares Cross-Validation 1.8.2 Likelihood Cruss-Validation Asymptotic Normality of Density Estimators Uniform Rates of Convergence Higher Order Kernel! Functions Proof of Theorem 14 (Uniform Almost Sure Convergence) Applications ul 15 6 18 19 9 26 a7 7 28 28 30 33 40 CONTENTS 1.13.1 Female Wage Inequality 1.13.2 Unemployment Rates and City Size 113.3. Adolescent Growth 1.134 Old Faithful Geyser Data 1.13.5. Evolution of Real Income Distribution in Ttaly, 1951-1998 114 Bxercises 2 Regression 2a 26 ar 28 Local Constant Kernel Estimation 2.11 Intuition Underlying the Local Constant Kernel Estimator Local Constant Bandwidth Selection 22.1 Rule-of-Thumb and Plug-In Methods 2.2.2 Least Squares Cross-Validation 223° AIC. 2.2.4 The Presence of Irrelevant Regressors 2.2.5 Some Further Results on Cross-Validation Uniform Rates of Convergence Local Linear Kernel Bstimation 24.1 Local Linear Bandwidth Selection: Least Squares Cross-Validation Local Polynomial Regression (General pth Order) 25.1 The Univariate Case 25.2 The Multivariate Case 25.3 Asymptotic Normality of Local Polynomial Estimators 26.1 Prestige Data 26.2 Adolescent Growth 26.3 _ Inflation Forecasting and Money Growth Proofs 2.7.1 Derivation of (2.24) 272 Proof of Theorem 2.7 2.7.8 Definitions of Ajy,1 and Vj Used in Theorem 240 Exercises 3 Frequency Estimation with Mixed Data 3 Probability Function Estimation with Discrete Data, a 43 a 45 a or 64 66 66 69 R B 8 78 79 83 85 35 89 92 92 92 93 97 98 100 106 108 5 116 CONTENTS vl 3.2 Regression with Diserete Regressors us 3.3 Estimation with Mixed Data: The Frequency Approach 118 38.1 Density Estimation with Mixed Data us 3.3.2 Regression with Mixed Data u9 34 Some Cautionary Remarks on Frequency Methods 120, 35 Proofs 12 3.5.1 Proof of Theorem 3.1 122 3.6 Exercises 123 4 Kernel Estimation with Mixed Data 125 4.1 Smooth Estimation of Joint Distributions with Diserete Data 126 4.2. Smooth Regression with Diserete Data 131 43° Kernel Rogression with Diserete Regressors: The Irrelevant Regressor Case aan 134 4A. Regression with Mixed Data: Relevant Regressors 136 44.1 Smooth Estimation with Mixed Data, 136 44.2 The Cross-Validation Method 138 4.5 Regression with Mixed Data: Irrelevant Regressors. 140) 45.1 Ordered Disereto Variables M4 4.6 Applications M5 4.6.1 Food-Away-lrom-Home Expenditure 5 4.6.2 Modeling Strike Volume ur AT Bxorcises 150 5 Conditional Density Estimation 155 5.1 Conditional Density Estimation: Relevant Variables 155 5.2 Conditional Density Bandwidth Selection 1st 5.2.1 Least Squares Cross-Validation: Relevant, Variables 17 5.2.2 Maximum Likelihood Cross-Validation: Relevant Variables 360 5.3 Conditional Density Estimation: Irrelevant Variables 162 5.4 The Multivariate Dependent Variables Cuse 164 5.1 The General Categorical Data Case 167 5.4.2 Proof of Theorem 5.5 168 5.5 Applications am 5.5.1 A Nonparametric Analysis of Corruption im 5.5.2 Extramarital Affairs Data rr) 55.3 Married Female Labor Force Participation 175 vi 1 CONTENTS 5.54 Labor Productivity a 5.5.5 Multivariate ¥ Conditional Density Example: GDP Growth and Population Growth Conditional on OECD Status 178 5.6 Exercises 180 Conditional CDF and Quantile Estimation 181 6.1 Estimating a Conditional CDF with Continuous Covariates without Smoothing the Dependent Variable 182 62 Estimating a Conditional CDF with Continuous Covariates Smoothing the Dependent Variable 184 6.3 Nonparametric Estimation of Conditional Quantile Functions: 189 64 ‘The Check Function Approach. 191 65 Conditional CDF and Quantile Estimation with Mixed Discrete and Continuous Covariates, 198 6.6 A Small Monte Carlo Simulation Study 196 6.7 Nonparametric Estimation of Hazard Functions 198 68 Applications 200 68.1 Boston Housing Data 200 682 Adolescent Growth Charts 202 683 Conditional Value at Risk 202 6.8.4 Real Income in Italy, 1951-1998 206 68.5 Multivariate Y Conditional CDF Example: GDP Growth and Population Growth Conditional on OBCD Status 208 69° Proofs 209 6.9.1 Proofs of Theorems 6.1, 6.2, and 6.4 209 692 Proofs of Thoorems 6.5 and 66 (Mixed Covariates Case) au 6.10 Bxercises 215 Semiparametric Methods 219 Semiparametric Partially Linear Models 221 71 Partially Linear Models 222 71.1 Tdentification of 222 72 Robinson's Estimator 222 72.1 Estimation of the Nonparametric Component 228. CONTENTS 7.3 Andrews's MINPIN Method 74 Semiparametric Hfficiency Bounds 74a The Conditionally Homoskedastic Error Case 74.2 The Conditionally Heteroskedastic Error Case 75. Proofs 15.1 Proof of Theorem 7.2 752. Verifying Theorem 7.8 for a Partially Linear Model 7.6 Exercises 8 Semiparametrie Single Index Models 8.1 Identification Conditions 8.2. Estimation 8.2.1 Ichimura’s Method 83. Direct Somiparametric Estimators for 9 83.1 Average Derivative Estimators 8.3.2 Estimation of g(:) 84 Bandwidth Selection 84.1 Bandwidth Selection for Ichimura’s Method 84.2 Bandwidth Selection with Direct Estimation Methods 85. Klein and Spady’s Bstimator 8.6 Lewbol’s Bstimator 8.7 Manski's Maximum Score Estimator 8.8 Horowitz's Smoothed Maximum Score Estimator 8.9 Hans Maximum Rank Estimator 8.10 Multinomial Diserete Choice Models 8.1 Ai’s Semiparametric Maximum Likelihood Approach 8.12 A Sketch of the Proof of Theorem 8.1 8.13 Applications 8.13.1 Modeling Response to Direct Marketing. Catalog Mailings BLM Exercises 9 Additive and Smooth (Varying) Coefficient Semiparametric Models 9.1 An Additive Model 9.1.1 The Marginal Integration Method 9.1.2 A Computationally Efficient Oracle Estimator 9.1.8 The Ordinary Backfitting Method 230 233 233 235 238 238 2a 246 249 21 258 253 258 258 262 263 263 265 266 267 269 270 270 2m 272 25 27 2m 281 283 283 284 286 289 x CONTENTS 9.14 The Smoothed Backfitting Method 9.1.5 Additive Models with Link Functions 9.2 An Additive Partially Linear Model 9.2.1 A Simple Two-Step Method 9.3. A. Semiparametrie Varying (Smooth) Coefficient Model 9.3.1 A Local Constant Estimator of the Smooth Coefficient: Funetion 93.2 A Local Linear Estimator of the Smooth Coollicient Funetion 933 Testing for a Parametric Smooth Coeflicient Model 93.4 Partially Linear Smooth Coefficient Models 9.3.5 Proof of Thoorem 9. 94 Exercises 10 Selectivity Models 10,1 Semiparametric Type? Tobit Models 10.2 Estimation of a Semiparametric Type-2 Tobit Model 10.2.1 Gallant and Nychla’s Estimator 10.22 Estimation of the Intercept in Selection Models 10.3 Semiparametric Type-3 Tobit Models 10.3.1 Eeonometric Preliminaries 10.3.2 Alternative Estimation Methods 104 Das, Newey and Vella’s Nonparametric Selection ‘Model 10.5. Exercises 11 Censored Models 11 Parametric Censored Models 11.2. Semiparametric Censored Regression Models 11.3 Semiparametric Censored Regression Models with Nonparametric Hoteroskedastcity 114 The Univariate Kaplan-Meier CDF Bstimator 11.5. The Multivariate Kaplan-Meier CDF Estimator 115.1 Nonparametric Regression Models with Random Censoring 11.6 Nonparametzie Censored Regression 11.6.1 Lewbel and Linton’s Approach 1.6.2 Chen, Dahl and Khan's Approach 290 295 207 299 301 302 303 306 308 310 312 315 316 317 318. 319 320 320 323 CONTENTS xs 17 Exercises 38 III Consistent Model Specification Tests 349 12 Model Specification Tests 351 12.1 A Simplo Consistent Test for Parametric Regression Functional Form 354 121.1 A Consistent Test for Correct Parametric Functional Form 390 12.1.2 Mixed Data 360 12.2. Testing for Equality of PDFs 362 12.3 More Tests Related to Regression Functions 365 12.3.1 Hairdle and Mammen’s Test for a Parametric Regression Model 365 123.2 An Adaptive and Rate Optimal Test” ~ 367 123.8 A Test for a Parametric Single Index Model 369 123.4 A Nonparametric Omitted Variables Test 370 12.3.5 ‘Testing the Significance of Categorical Variables 375 124 Tests Related to PDFs 378 12.4.1 Testing Independence between Two Random ‘Variables 378 124.2 A Test for a Parametric PDF 1243 A Kemel Test for Conditional Parametric Distributions 382 12.5 Applications 385 12.5.1 Growth Convergence Clubs 385 12.6 Proofs 388 126.1 Proof of Theorem 12.1 388 12.62 Proof of Theorem 12.2 389 126.3. Proof of Theorem 12.5 389 12.64 Proof of Theorem 12.9 391 127 Exercises 394 13 Nonsmoothing Tests 397 18.1 Testing for Parametric Regression Functional Form 398 13.2. Testing for Equality of PDFs 401 13.3. A Nonparametric Significance Test 401 184 Andrews's Test for Conditional CDPs 402 18.5 Hong's Tests for Serial Dependence 404 ai CONTENTS 13.6 More on Nonsmoothing Tests 40s. 13.7. Proofs 409 13.7.1. Proof of Theorem 13.1 409 13.8 Exercises 410 IV Nonparametric Nearest Neighbor and Series ‘Methods 413 14 K-Nearest Neighbor Methods 415 14.1 Density Estimation: The Univatiate Case a5 14.2. Regression Function Estimation 419 143 A Local Linear k-nn Estimator 42 14.4 Cross-Validation with Local Constant enn Estimation 422 14.5. Cross-Validation with Local Linear k-nn Estimation 425 146 Estimation of Semiparametric Models with k-na Methods ar 14.7 Model Specification Tests with k-nn Methods 428 14.7.1 A Bootstrap Test 431 148 Using Different k for Different Components of x 432 149 Proofs 432 149.1 Proof of Theorem 14.1 435 149.2 Proof of Theorem 14.5 435 14.93 Proof of Theorem 14.10 40 14.10 Exercises a4 15 Nonparametric Series Methods 445, 15.1 Estimating Regression Functions 446 15.1.1 Convergence Rates 449 15.2 Selection of the Series Term K 451 15.2.1 Asymptotie Normality 453, 15.3 A Partially Linear Model 454 15.3.1 An Additive Partially Linear Model 455 15.3.2 Selection of Nonlinear Additive Components 461 15.3.3 Estimating an Additive Model with a Known Link Fanetion 463 154 Estimation of Partially Linear Varying Coefficient Models 406 15.4.1 Testing for Correct Parametric Regression Funetional Form an CONTENTS: 155 156 157 154.2 A Consistent Test for an Additive Partially Linear Model Other Series-Based Tests Proofs 15.6.1 Proof of Theorem 15.1 15.6.2 Proof of Theorem 15.3 15.6.3. Proof of Theorem 15.6 15.6.4 Proof of Theorem 15.9 15.6.5 Proof of Theorem 15.10 Exercises V Time Series, Simultaneous Equation, and Panel Data Models 16 Instrumental Variables and Efficient Bstirhation of Semiparametric Models 16.1 164 165 ‘A Pastially Linear Model with Endogenous Regrestors in the Parametric Part ‘A Varying Coefficient Model with Endogenous Regrescors in the Parametric Part Ai and Chen's Biicient Estimator with Conditional Moment Restrictions 16.3.1 Estimation Procedures 16.3.2 Asymptotic Normality for 4 16.33 A Partially Linear Model with the Endogenous Regressors in the Nonparametric Part Proof of Equation (16.16) Exercises 17 Endogeneity in Nonparametric Regression Models a4 172 173, 174 175 176 ‘A Nonparametric Model A Triangular Simultaneous Equation Model Nowoy Powell Series Based Brtimator Hall and Horowitz's Kernel-Based Estimator Darolles, Florens and Renault’s Bstimator Exercises 18 Weakly Dependent Data 181 Density Estimation with Dependent Data 414 479 480 480 44 488, 492 497 502 503 505 51 51 513, 515 517 520 521 521 522 597 529 582 533 585, 587 a conrENTS 18.1.1 Uniform Almost Sure Rate of Convergence 541 18.2. Regression Models with Dependent Data 541 18.2.1 The Martingale Difference Error Case Bal 182.2 The Autocorrelated Error Case 54d 18.2.3. One-Step-Ahead Forecasting 546 18.2.4 d-Step-Ahead Forecasting a7 182.5 Estimation of Nonparametric Impulse Response Finctions 548 183 Semiparametric Models with Dependent Data 551 183.1 A Partially Linear Model with Dependent Data, 551 183.2 Additive Regression Models 552 183.3 Varying Coefficient Models with Dependent Data 553 18.4 Testing for Serial Correlation in Semiparametric Models 554 1841 The Test Statistic and Its Asymptotic Distribution 554 18.4.2 Testing Zero First Order Serial Correlation 555 18.5 Model Specification Tests with Dependent Data 556 185.1 A Kernel ‘Test for Correct Parametric Regression Functional Form 556 185.2 Nonparametric Significance Tests 567 18.6 Nonsmoothing Tests for Regression Functional Form 558 18.7, Testing Parametric Predictive Models 559 187.1 In-Sample Testing of Conditional CDFs 559 18.7.2 Out-ofSample Testing of Conditional CDPs 562 188 Applications 564 188.1 Forecasting Short-Term Interest Rates 564 18.9 Nonparametric Estimation with Nonstationary Data 566 18.10 Proofs 367 18.10.1 Proof of Equation (18.9) 307. 18.10.2Proof of Theorem 18.2 569 18.11 Exercises 572 19 Panel Data Models 875 19.1 Nonparametric Estimation of Panel Data Models: Ignoring the Variance Structure 576 19.2 Wang's Bificient Nonparametric Panel Data Estimator 578 193A Partially Linear Model with Random Effects 584 19.4 Nonparametric Panel Data Models with Fixed Bifects 586 CONTENTS ~ 10.4.1 Error Variance Structure Is Known 19.4.2 The Error Variance Structure Is Unknown 19.5 A Partially Linear Model with Fixed Effects 19.6 Semiparametric Instrumental Variable Estimators 10.6.1 An Infeasible Estimator 19.6.2 The Choice of Instruments 10.63 A Feasible Estimator 19.7 Testing for Serial Correlation and for Individual Biffects in Semiparametric Models 198. Series Estimation of Panel Data Models 19.8.1 Additive Effects 19.8.2 Alternative Formulation of Fixed Eifects 19.9 Nonlinear Panel Data Models 19.9.1 Censored Panel Data Models 10.9.2 Discrete Choice Panel Data Models 19.10 Proofs ‘| 19.10.1 Proof of Theorem 19.1 1910.2 Leading MSE Caleulation of Wang's Estimator 19,11 Exercises SESE885 gegeesesees 20 Topics in Applied Nonparametric Estimation 627 20.1 Nonparametric Methods in Continuous-Time Models 627, 20.11 Nonparametric Estimation of Continuous: ‘Time Models 027 20.1.2 Nonparametsic Tests for Continuous-Time Models 632 20.1.3 Ait-Sabalia's Test 632 20.14 Hong and Li's Test 633 20.1.5 Proofs 636 20.2 Nonparametric Estimation of Average ‘Treatment Bifects 639 20.2.1 The Model 640 20.2.2 An Application: Assessing the Efficacy of Right Heart Catheterization at 20.3 Nonparametric Estimation of Auction Models 645 20.3.1 Estimation of First Price Auetion Models 615 20.3.2 Conditionally Independent Private Information Auctions 648 20.4 Copula-Based Semiparametrie Estimation of, Multivariate Distributions 651 wi CONTENTS Some Background on Copula Functions Semiparametric Copula-Based Multivariate Distributions 204.3 A Two-Step Estimation Procedure 20.4.4 A One-Stop Efficient Estimation Procedure 20.4.5 Testing Parametric Functional Forms of a Copula, 20.5 A Semiparametric Transformation Model 20.6 Bxercises A Background Statistical Concepts 1.1 Probability, Measure, and Measurable Space 1.2 Metric, Norm, and Funetional Spaces 1.3 Limits and Modes of Convergence 1.3.1 Limit Supremum and Limit Infimum 1.3.2 Modes of Convergence 1.4 Inequalities, Laws of Large Numbers, and Central Limit Theorems 15 Exercises Bibliography Author Index Subject Index 651 652 655 657 662 663, 672 esi 604 697 737 744 Preface ‘Throughout this book, the term “nonparametric” is used to refer to statistical techniques that do not require a researcher to specify a func ‘ional form for an object being estimated. Rather than assuming that the functional form of an object is known up to a few (finite) unknown parameters, wo substitute loss restrictive assumptions such as smooth ness (differentiability) and moment restrictions for the objects being studied. For example, when we are interested in estimating the income distribution of a region, instesd of assuming that the density function lies in a parametric family such as the normal or log-normal family, we assume only that the density function is twice (or three times) differ- entiable. Of course, if one possesses prior knowledge (some have called this “divine insight”) about the functional form of the object of in- terest, then ono will always do better by using parametric techniques. However, in practice such functional forms are rarely (if ever) known, ‘and the unforgiving consequences of parametric misspecification are well known and are not repeated here. ‘Since nonparametric techniques make fewer assumptions about the object being estimated than do parametric'techniques, nonparametric estimators tend to be slower to converge to the objects being studied than correctly specified parametric estimators, In addition, unlike their parametric counterparts, the convergence rate is typically inversely re- lated to the number of variables (covariates) involved, which is some- ‘times referred to as the “curse of dimensionality.” However, it is often surprising how, even for moderate datasets, nonparametric approaches ‘can reveal structure in the data which might be missed were one to employ common parametric functional specifications. Nonparametric ‘methods are therefore best suited to situations in which (i) one knows Little about the functional form of the object being estimated, ‘number of variables (covariates) is small, and (ii) the researcher has 1 reasonably large data set. Points (ii) and (ii) are closely related be- ii PREFACE cause, in nonparametric settings, whether oF not one asa suficiently large sample depends on how many covariates are present, Silverman (1986, see Table 42, p. 94) provides an excellent illustration on the relationship between the sample size and the covariate dimension r= quired to obtain accurate nonperametric estimates. We use the term “semiparametri" to refer to statistical techniques that do not require a researcher to specify a parametric functional form for some part of an object being estimated but do require parametric assumptions for the romaining part(). AAs noted sbowe, the nonparametric methods covered in this text offer the advantage of imposing less restrictive asmimptions on fune- ‘ional forms (e., regression oF conditional probability functions) as compared to, sa, commonly used parametric models. However, aterna- tive approaches may be obtained by relaxing restrictive assumptions in conventional parametric setting. One such approach taken by Manski (2003) and hs collaborators considers probability or regression models Jn which some parameters aze not identified, Instead of imposing overly strong assumptions to identify the parameters, itis often possible to find bounds for the permissible range for these parameters. When the bound is relatively tight, Le, when the permissible range is quite nar- row, one can alast identify these parameters, This exciting line of inquiry, however, is beyond the scope of this text, so we refer the in- terested reader tothe excellent monograph by Mansi (2003); see also recent work by Manski and Tamer (2002), Imbens and Manski (2004), Honoré and Tamer (2006) and the references therein, Nonparametric and semiparametric methods have attracted a great deal of attention from statisticians in the past few decades, as evi dlenced by the vast array of texts written hy statisticians including Pralasa Rao (1983), Devroye and Gyérfi (1985), Siiverman (1986), Scott (1992), Bickel, Klaassen, Ritov and Wellner (1993), Wand and Jones (1985), Fan and Gijbels (1996), Simonoff (1996), Azzalini and Bowman (1907), Hart (1907), ffomovich (1909), Eubank (1909), Rup- pert, Caroll and Wand (2003), and Fan and Yao (2005). However, the number of texts tailored to the needs of applied econometricans is rela- tively searc, Hardie (1990), Horowite (1998), Pagan aad Ula (1980), Yatchew (2003), and Hardle, Miller, Sperlich and Werwatz (2004) bo- ing those of which we are currently aware. In addition, the majority of existing texts operate from the pre- sumption that the underlying data is strictly continuous ia naturo, ‘while more often than not economists deal with entegorieal (nominal PREFACE xx and ordinal) data in applied settings. ‘The conventional frequency’ based nonparametric approach to dealing with the presence of diserete vari- ables is acknowledged to be unsatisfactory. Building upon Aitchison and Aitken's (1976) seminal work on smoothing diserete covariates, we recently proposed a number of novel nonparametric approaches; see, ‘og, Li and Racine (2003), Hall, Racine and Li (2004), Racine and Li (2004), Li and Racine (2004a), Racine, Li and Zhu (2004), Ouyang, Li and Racine (2006), Hall, Li and Racine (2006), Racine, Hart and Li (forthcoming), Li and Racine (forthcoming), and Hsiao, Li and Racine (fortheoming) for recent work in this area. In this text we emphasize nonparametric techniques suited to the rich array of data types (com tinuous, nominal, and ordinal) encountered by an applied econorist within one coherent framework. ‘Another defining feature of this text is its emphasis on the prop- erties of nonparametric estimators in the presence of potentially irrel evant variables. Existing treatments of keruel methods, in particular, Dandwidth selection methods, presume that all variables are relevant. Por example, existing treatments of plug-in ot cross-validation methods presume that all covariates in a regression model are infact relevant, i.e. that all covariates help explain variation in the outeome (i.e., the dependent variable). When this is not the case, however, existing re- sults such as rates of convergence and the behavior of bandwidths no longer hold; soo, e.g, Hall et al. (2004), Hall otal. (2006), Racine and Li (2004), and Li and Racine (20040). Wo fel that this is an extremely important aspect of sound nonparametric estimation which must be appreciated by practitioners if they are to wield these tools wisely This book is aimed at students enrolled in a graduate course in nonparametric and semiparametric methods, who are interested in application areas such as economies and other social sciences. Ideal prerequisites would include a course in mathematical statisties and a course in parametric econometrics atthe level of, say, Greene (2008) or ‘Wooldridge (2002). We also intend for this text to serve as a valuable reference for a much wider audience, including applied researchers and those who wish to familiarize themselves with the subject area. ‘The five parts of this text are organized as follows. The frst part covers nonparametric estimation of density and regression functions with independent data, with emphasis being placed on mixed diserete and continuous data types. The second part deals with various somi- parametric models again with independent data, including partially linear models, single index models, additive models, varying coefficient = PREFACE rmodel, censored models and sample selection models. The thied pat deals with an array of eonstent move speciation tess, The fourth part cxaminesnesrrt neighbor and series methods, The fith pat co Siders ferel estimation of instrumental variable model, simultaneous equation models, end pane data models and extends results from pre ‘ious Chapters to the weakly dependent data setting Rigorous prool are provided for most reels in Pat I, while ou lines of proofs are provided for many results in Parts I, 1V, and V. Background statistical concept ae presented in an appendix ‘An K package (It Development Gore Tou (200)) is ailable and can be obtained dred fom http://w. projet org that ple ‘ments a number of the methods discussed in Part I, I, and some of thove discued in Parts I, IV, and V. Ie also contains some detasts ted in the book, and eonians a fanction that allows the reader to easily implement new kerd-basd tests and kerne-bsed estimators ers appear a the end of cach chapter, and detailed hints are provided for many ofthe problems. Students who wish to master the Ioterll are encouraged to work out many problems te posible Becaito some ofthe hints may render the questions almost teva, wo strongly encourage students who wish to master the techniques work on the problems without fist conolting the hints, ‘Weare deeply indebted tos0 many poop who have provided guid- ance, inpration, or hove laid the foundations that have made thi book posible. 1 would be impossible to list them all. However, we tak each of you who have in one way or another contebuted to this project to indulge us and enjoy a personal sense of accomplishment at Kes completion This being said, we would ike o thank the staf at Princeton Uni- versity Pres, namely, Peter Dougherty, Soth Diteik, Toi O'Prey, and Carole Schwager, or their eye to detail and profesional guidance ‘through this process. "We would alo ike to expross our deep gratitude to mumerous grant- Jog agencies for ther generous financial support that funded research which forms the heat of this book. In particular, Li would ke to a Knowledge suppot from the Social Siences end Humanities Research Counc of Cenada (SSHRC), the Natural Scionces and Engineering Re search Counel of Canada (NSERC), the Texas ALM University rite Research Center, andthe Bush Sehool Program in Beonomies. Racine woul ke to acnowledge support from SSHRC, NSERC, the Center for Policy Rewearch at Syracane University, and tho National Sciences PREFACE wi Foundation (NSP) in the United States of America ‘We would aitionally like to thank graduate students at McMaster University, Syracuse University, Texas A&M Univesity, the University ‘of California San Diego, the University of Guiph, the University of South Florida, and York University, who, as involuntary subjects, pro- vided valuable feedback on early drafts of this manuscript, ‘We would furthermore like to thank mumerons coauthors for their many and varied contributions. We would especially like to thank Peter Hall, whose collaborations on kernel methods with mixed data types ‘and, in particular, whose singular contributions to the theoretical foun- ations for kernel methods with irelevant variables have brought much ofthis work to fruition Many’ people have provided feedback that has deepened our une derstanding and enhanced this book. In particular, we would like to acknowledge Chunrong Ai, Zongynt Cai, Xiaohong Chen, David Giles, ‘Yanquin Fa, Jianboa Huang, Yiguo Sun, and Lifian Yang On a slightly more personal note, Racine would like express his deep-feltaffction and personal indebtedness to Aman Ullah, who not only baptized him in nonparametric statistics, but ao gue his thesis fand remains an ongoing source of inspiration, Finally, Li would like to dedicate this book to his wife, Zhenjuan Liu, his daughter, Kathy, and son, Kevin, without whom this project right not have materialized. Li would also like to dedicate this book to his parents with love and gratitude. Racine would like to dedicate this book to the memory of his father who passed away on November 22, 2005, and who has been a guiding light and will remain an eternal sonee of inspiration. Racine would also ike to dedicate this book to his wife, Jenifer, and son, Adam, who continue to enrich his life beyond ther ken. Part I Nonparametric Kernel Methods Chapter 1 Density Estimation The estimation of probability density functions (PDFs) and cumulative distribution functions (CDF) are cornerstones of applied data analysis the social sciences, Testing for the equality of two distributions (or moments thereof) is perhaps the most basic test in all of applied data, ‘analysis. Economists, for instance, devote a great deal of attention to the study of income distributions and how they vary across regions and over time. Though the PDF and CDF are often the objects of direct interest, their estimation also serves as an important building block for other objects being modeled such as a conditional mean (i.e., a “regression function”), which may be directly modeled using, nonpara- ietric or semiparametric methods (a conditional mean is a function of 2 conditional PDF, which is itself a ratio of unconditional PDFs). AE ter mastering the principles underlying the nonparametric estimation of a PDF, the nonparametric estimation of the workhorse of applied data analysis, the conditional mean function considered in Chapter 2, progresses in a fairly straightforward manner. Careful study of the ap- proaches developed in Chapter 1 will be most helpful for understanding ‘material presented in later chapters. ‘We begin with the estimation of a univariate PDF in Sections 1.1 through 1.3, tum to the estimation of a univariate CDF in Sections 1 and 1.5, and then move on to the more general multivariate setting in Sections 1.6 through 1.8. Asymptotic normality, uniform rates of con- vergence, and bias reduction methods appear in Sections 1.9 through 1.12. Numerous illustrative applieations appeer in Section 1.13, while theoretical and applied exercises can be found in Section 1.14 We now proceed with a discussion of how to estimate the PDF a DENSITY ESTIMATION ‘fx (2) of @ random variable X. For notational simplicity we drop the ‘subscript X and simply use f(2) to denote the PDF of X. Some of the treatments of the kernel estimation of a PDF discussed in this chapter are drawn from the two excellent monographs by Silverman (1986) and Seott (1992). 1.1 Univariate Density Estimation ‘To best appreciate why one might consider using nonparametric meth- ods to estimate a PDF, we begin with an illustrative example, the parametric estimation of a PDF. Example 1.1. Suppose Xi, Xay..., Xn represent independent and identically distributed (iia.) draws from a normat distribution with ‘mean je and variance 0. We wish to estimate the normal PDF f(z). ‘By assumption, (2) has a known parametric functional form (i.e. univariate normal) given by f(2) = (2na2)-!Pexp [-(2~ 1)?/0°], tuhere the mean = E(X) and variance 9? = B[(X -E(X))?] = var(X) fare the only unknown parameters to be estimated. One could estimate us and o? by the method of mazimum likelihood as follows. Under the id. assumption, the joint PDF of (Xiy.-..Xq) i8 simply the product ofthe univariate PDFs, which may be uritten as Sip A-Data? $Xryo00% - Gar Conditional upon the observed sample and taking the logarithm, this gives us the log-likelihood function L{y2,0%) 5 In f(Xiy0- Nei oo”) Inn) ~ Bin? = 525 D2 = “The method of musimum likelihood proceeds by choosing those parom- eters that make it most likely that we observed the sample at hand given our distributional assurmption. Thus, the likelihood function (or ‘a monotonic transformation thereof, .., In) expresses the plausibility of different values of and o? given the observed sample. We then smavimize the likelihood function with respect to these two unknown pa rameters. 1. UNIVARIATE DENSITY ESTIMATION The necessary first order conditions for a masimization of the log- likelihood function are AL(u,0%)/8u = 0 and BC(y,0°)/e* = 0. Solv- ing these first onder conditions for the two unknown pararaeters pond 0? yields 23x and 2 eK Ay {A ond 6% above are the mazimum likelihood estimators of 4 and a, respectively, and the resulting estimator of f(x) is for geo (54)] ‘The “Achilles heel” of any parametric spprosch is of course the requirement that, prior to estimation, the analyst must specify the ex- act parametric functional form for the object being estimated. Upon reflection, the parametric approach is somewhat circular since we i tially set out to estimate an unknown density but must first assume that the density is in fact known (up to a handful of unknown param ‘ters, of course). Having based our estimate on the assumption that the density is a member of a known parametric family, we must then naturally confront the possibility that the parametric model is “mis- specified,” i.c., not consistent with the population from which the data was drawn. For instance, by assuming that X is drawn from a nor- mally distributed population in the above example, we in fact impose 1 number of potentially quite restrictive assumptions: symmetry, uni- ;, monotonically decreasing away from the mode and so on. If ‘the true donsity were in fact asymmetric or possessed multiple modes, ‘or was nonmonotonie away from the mode, then the presumption of 0 is a consistent estimator of f(2). Note that the symmetry condition (i) implies that f vk(v) dv = 0. By consistency, we mean that f(z) — f(2) in probability (convergence in probability is defined in Appendix A), [Note that &(-) defined in (1.10) isa (symmetric) PDP. For recent work fon kernel methods with asymmetric kemels, see Abadir and Lawford (2004) To define various modes of convergence, we frst introduce the con- cept of the “Buclidean norm” (“Buclidean length") of a vector. Given ag x1 vector 2 = (21,22,...,%9)! € RY, we use |_|] to denote the Euclidean length of z, which is defined by = (esa Vetat tat When ¢= 1 (a sealar), 2 is simply the absolute value of = In the appendix we discuss the notation O(-) ("big Ok") and of-) (“small Oh"). Let an be a nonstochastic sequence. We say that an = O() if an] < Cn® forall n suficiently large, where a and C*(> 0) are constants. Similarly, we say that ay = o(n) if aq/n® — 0 as n — 00. We are now ready to prove the MSE consistency of f(z). Theorem 11. Let Xiyo..X denote tid. observations having a three-tines differentiable PDF f(x), and let f)(2) denote the sth or- der derivative of f(z) (s = 1,2,8). Let x be an interior point in the support of X, and ict f(2) be that defined in (1.8). Assume that the kernel function () ts bounded and satisfies (1.10). Also, a8 n — 00, nO and nh — 00, then ase (f(2)) = [rag]? + 2 +o (at (nay) =0 (i+ (nh) (ay) 10 1, DENSITY ESTIMATION here wa = [v®k(u)dv and x= f (0) de. Proof of Theorem 1.1 se (fe) (fe) + [E(/@) -se@) ef [7eo-s00)"} var (j(2)) + [bins (f(@))]" We will evaluate the bias(/(z)) and var(/()) terms separately. For the bias calculation we will need to use the Taylor expansion formula. For a univariate function g(r) that is m times differentiable, wwe have 9(2) =a(20) + 9 zoe — 20) + ho (eae — 20) 1 + 3D (ao)oa— aay? + Aigl(e\lx 20)", where g()(tp) = 720 |.a2y, and £ lies between x and zp, 1Li,_UNIVARIATE DENSITY ESTIMATION 4 ‘The bias term is given by bias (f(2)) ={0(S=2)} =f) "fC e))-m (by identical distribution) art f seone (24) der) =m [He F hu)b(w)h dv — f(z) (Change of viable, 21 —2 = hv) =f {121+ enw Seay? + 00°)} Hoye ~ 12) = fr +0+ 21%) [ vH¢o do+0(0)} - He) by (110) zm 70%) f ve(opau +0(H), (zy where the © (K8) term comes from aysya® [[@erme as ont [Pao] =0(t"), where C is a positive constant, and where # lies between « and x-+ hv. Note that in the above derivation we assume that f(x) is throo- times differentiable, We can weaken this condition to f(z) being twice differentiable, resulting in (O(H) becomes o(h?), see Exercise 1.5) bias (f(@)) = B (f(@) - £@) =e) J wR(u)dv-+o(ht). (2.13) 2 1, DENSITY ESTIMATION Next we consider th varane term. Observe that var (f(a) = var [ae"(2)| ~ sta Some [a(e)] oo} (by independence) wae) -ae PAS) POET] = ans {[ seve (2) 7 [fren (252) = aa {ht [ferme eyan = [a fires mcrae] } =a {hf 1+ Om] Pear} =a UO [Medes 0 (hf wn ear) - 09} = Filet) +o}, a) 2(u) de. ‘Equations (1.12) and (1.14) complete the proof of Theorem 11. ‘Theorem 1.1 implies that (by Theorem A.7 of Appendix A) Fle) ~ (2) = 0, (1? + (aby-¥) = 00), By choosing h = en~¥/* for some ¢ > 0 and a > 1, the condi- tions required for consistent estimation of f(z), h > 0 and nh — oo, 1, UNIVARIATE DENSITY ESTIMATION _ 18 are clearly satsfiod. The overriding question is what values of ¢ and a should be used in practice. As can be seen, for a given sample size , if h is small, the resulting estimator will have a small bias but a large variance. On the other hand, iff is large, then the resulting estimator ‘will have a small variance but a large bias. To minimize MSE((2)), ‘one should balance the squared bias and the variance terms. The op- ‘imal choico of h (in the sense that MSE((z)) is minimized) should satisfy dMSB(j(2))/dh = 0. By using (1-11), itis easy to show that the optimal A that minimizes the leading term of MSE(f(2)) is given by rope = elz)n-¥9, (118) where o(2) = {nf(2)/lraf PY". MSE(/(2) is clearly a “pointwise” property, and by using this as the boss for bandwidth selection we are obtaining bandwidth that is optimal when estimating a density at a point z. Examining o(e) in (1.15), we can see that a bandwidth which is optimal for estimation at a point + located in the tail ofa distribution wil differ from that which is optimal for estimation at a point located st, say, the mode. Suppose that we ar interested notin tailoring the bandwidth to the pointe estimation of f(2) but instead in tailoring the bandwidth globally for all points x, that is, for all x in the support of f(-) (the support of x is defined as the set of points of x for which f(z) > 0, ie., {x f(x) > 0}). In this case we can choose h optimally by minimizing the “integrated MSE" (IMSE) of f(z). Using (1.11) we have [alter seal’ a= jd [ Pe] ae +5 + 0(ht + (nny) (1.16) IMsE(f) * Again letting hope denote the optimal smoothing parameter that minimizes the leading terms of (1.16), we use simple calculus to get hope = con 48, (1.17) a yay i whore eo = 57H ( f [f21(2))? az} > 0 sa positive constant. Note that if (2) 0 for (almost) all z, then gis not well defined, For example, if X is, say, uniformly distributed over its support, then f(e) = 0 for all and for all s > 1, and (1.17) is not defined in ‘his caso. It can be shown that in this ease (Le. when X is uniformly u 1, DENSITY ESTIMATION distributed), hoy: will have a difforent rate of convergence equal to 1n-¥/; see the related discussion in Section 1.8.1 and Exercise 1.16, An interesting extension of the above results can be found in Zinde- ‘Walsh (2005), who examines the asymptotie process for the kernel den- sity estimator by means of generalized functions and generalized ran- dom processes and presents novel results for characterizing the behavior of kernel density estimators when the density doesnot exit, ie., when the density does not exist as a locally summable function. 1.2 Univariate Bandwidth Selection: Rule-of-Thumb and Plug-In Methods Equation (1.17) reveals that the optimal smoothing parameter depends on the integrated second derivative of the unknown density through a In practi, one might choose an initial “pilot value” of fo es timate f [f)(x)]” dr nonparametrically, and then use this value to obtain hope using (1.17). Such approaches are known as “plug-in meth- ods” for obvious reasons. One popular way of choosing the initial f, suggested by Silverman (1986), is to assume that (2) belongs to @ parametric family of distributions, and then to compute h using (1.17). For example, if f(x) is 8 normal PDF with variance o?, then J [f(a)]? de = 3/(8e'20°), If a standard normal keel is used, us- ing (1.17), we get the pilot estimate Fae = (42) 20° [/8) 2] on = 1.060n-¥8, (1.18) which is then plugged into f{/(e)/2dz, which then may be used to obtain oe using (1.17). A clearly undesirable property ofthe plug-in method is that it is not filly antomatie beecuse one needs to choose ‘an initial value of h to estimate {(f()(z)]? de (see Marron, Jones and Sheather (1906) and also Loader (1900) for further discussion). Often, practitioners wil use (1.18) itself for the bandwidth, This is known as the “normal reference cule-of thumb” approach since it is the optimal bandwidth for « particular family of dstibutons, in this ease the normal family. Should the underlying distribution be “close to a uortal distribution, then this wil provide good routs, and for exploratory purposes itis certainly computationally attractive. In practice, is replaced by the sample standard deviation of (X,}2y, while Silverman (1986, p. 47) advocates using a more robust measure 3.3, UNIVARIATE BANDWIDTH: CROSS-VALIDATION ‘of spread which replaces ¢ with 4, an “adaptive” measure of spread given by A= min(standard deviation, interquartile range/'.34). ‘We now turn our attention to a discussion of a number of fully antomatic or “data-driven” methods for selecting h that are tailored to ‘the sample at hand. 1.3 Univariate Bandwidth Selection: Cross-Validation Methods Jn both theoretical and practical settings, nonparametric kernel esti- ‘mation has been established as relatively insensitive to choice of ker- nel function. However, the same cannot be said for bandwidth selec- tion. Different bandwidths can generate radically differing impressions of the underlying distribution, If kernel methods are used simply for “exploratory” purposes, then one might undersmooth the density by ‘choosing a small value of h and let the eye do any remaining smooth- ing. Alternatively, one might choose a range of values for h and plot the resulting estimates. However, for sound analysis and inference, a pri ciple having some known optimality properties inust be adopted. One can think of choosing the bandwidth as being analogous to choosing the ‘number of terms in a series approximation; the more terms one includes in the approximation, the more flexible the resulting model becomes, while the smaller the bandwidth of a kernel estimator, the more flexi- ble it becomes, However increasing flexibility (reducing potential bins) loadls to increased variability (increasing potential variance) ‘Seen in this light, one naturally appreciates how a number of methods discussed below are motivated by the need to balance the squared bies ‘and variance of the resulting estimate. 1. 1 Least Squares Cross-Validation Least squares cross-validation is a fully automatic data-driven method of selecting the smoothing parameter h, originally proposed by Rudemo (1982), Stone (1984) and Bowman (1984) (soe also Silverman (1986, pp. 48-51)). This method is based on the principle of selecting a band width that minimizes the integrated squared error of the resulting es- timate, that is, it provides an optimal bandwidth tailored to all 2 in the support of f(2). 6 1._DENSITY ESTIMATION ‘The integrated squared difference between f and f is [ies] a2= [ferrar—2 f foster ae+ f reo? ae (as) ‘As the third term om the right-hand side of (1.19) is unrelated to h, choosing hs to minimnie (1.19) therefore equivalent to minimising [iceras—2 f ferseyae (1.20) with respect to h. In the second term, f f(x) f(x) dx can be written as Ex{f(X)], where Ex()) denotes expectation with respect to X and not with respect to the random observations {X}7.y used for computing F(0. Therefore, we may estimate BxLf(X)] by n*SIE FX) (oe replacing Ex by its sample mean), where heo= AG D (4) (121) sales is the leave-one-out kornel estimator of f(Xi) Finally, we estimate the first term J f(z)?de by (1.22) where E(v) = f k(u)h(v—u) duis the twofold convolution kernel derived fiom &(). If k(v) = exp(—v?/2)/V2r, a standard normal kernel, then E(w) = exp(—v2/4)/V/4r, a normal kernel (i.e. normal PDP) with mean zero and variance two, which follows since two independent (0,1) random variables sum to a N(0,2) random variable. Sifere we emphasize that is important to use the leav-one-ont kernel estimator for computing Ex() above. This is bocause the expectations operator presumes that ‘the X and the X's are independent of one another. Without using the leaveone-ont estimator, the eros-valdation method will beak down: soe Exercise 16 (i) 1.3, UNIVARIATE BANDWIDTH: CROSS-VALIDATION uv Least squares cross-validation therefore chooses h to minimize which is typically undertaken using numerical search algorithms. Tt can be shown that the leading term of CV;(A) is CVjo given by (ignoring a term unrelated to h; see Exercise 1.6) Bik +, (24) C¥joth) 7 swore By = (n3/4) [JI (e)} da) (4a = J v*8(u) do, 6 = J CH) co). ‘Thus, as long as f)(x) does not vanish for (almost)-all x, we have By>. Let h® denote the value of h that minimizes CV;o. Simple calculus shows that A° = cn-¥/5 where (AB) = wg { [2%] «} ‘A comparison of A? with Mop in (1.17) reveals that the two are identical, ive, AP he. This arses because hop, minimizes f Elf(x) ~ f(e)P dx, ‘while AP minimizes BICV;(h)], the leading term of CVy(h), Te can be easily seen that B/CVy(h)] + f f(2)*dz is an alternative version of SBif(e) — f(2)P dx; hence, [CV s(h)] + f J(e)* dr also estimates SEL(@)—F(a)]? de. Given that f /(2)® dais unrelated to h, one would expect that A? and hop should be the same, Tet fi denote the value of h that minimizes CV(A). Given that CVj(B) = CVjo-+ (8.0), where (¢.0.) denotes smelier order terms (than CV) and terms uarelated to h, it ean be shown that h = 1° +op(h®), or, equivalently, chat us Aw _h So" is Intuitively, (1.25) is easy to understand because CV;(h) = CVjo\ + (s.0,), thus asymptotically an h that minimizes OVs(h) should be ~ 1 +0 in probability. 18 1._DRNSITY ESTIMATION close to an h that minimizes CVjo(t); therefore, we expect that and 1 will be elose to each other in the sense of (1.25). Hardle, Hall and Marron (1988) showed that (i — A)/A® = O,(n-¥"), which indeed converges to zero (in probability) but at an extremely slow rate We again underscore the need to use the leave-one-out Kernel esti- ‘ator when constructing CV; as given in (1.23). If instead one were to use the standard kernel estimator, east squares cross-validation will break down, yielding h = 0. Exercise 1.6 shows that: if one does not use the leave-one-out kernel estimator when estimating f(X;), then (0 minimizes the objective function, which of course violates the ency condition that nh ~ 00 as n + 90. Horo we implicitly impose the restriction that. f(z) is not a zero funetion, which rules out the ease for which f() is a uniform PDE. In fact this condition can be relaxed, Stone (1984) showed that, as long, as f(x) is bounded, then the least squares cross-validation method will, select h optimally in the sense that Slf(eh) ~ fla))? de inks [[7(eh) ~ fle)? de where j(2,h) denotes the kemel estimator of f(z) with cross-validation selected fi, and f(2,h) isthe kernel estimator with a generic h. Obvi- ously, the ratio defined in (1.26) should be greater than or equal to one for any n. Therefore, Stone's (1984) result states that, asymptotically, cross-validated smoothing parameter selection is optimal in the sense of minimizing the estimation integrated square error. In Bxercise 1.16, sve further discuss the intuition underlying why h +O even when f(2) is a uniform PDP. + 1 almost surely, (1.26) 1.3.2 Likelihood Cross-Validation Likelihood cross-validation is another antomatic data-driven method for sclecting the smoothing parameter f. This appronch yields a den- sity estimate which has an entropy theoretic interpretation, since the estimate will be close to the actual density in a Kullback-Leibler sense This approach was proposed by Duin (1976). Likelihood cross-validation chooses h to maximize the (leave-one- cout) log likelihood function given by =InL TEx, 14, UNIVARIATE CDF FSTIMATION 19 where /-i(Xi) is the leave-oue-out kernel estimator of f(X:) defined in (1.21) The main problem with likelinood cross-validation is that it is severely affected by the tail behavior of f(z) and can lead to in- consistent results for fat tailed distributions when using popular kerne! functions (sce Hall (19874, 1987). For this reason the likelihood cross- validation method has elicited litle interest inthe statistical literature However the ikehood cross-validation method may work well for a range of standard distributions (ie., thin tailed). We consider the performance of likelihood cross-validation in Section 1.3.3, when we Compare the impact of diferent bandwidth selection methods on the resulting density estimate, and in Section 1.13, where we consider em pirical applications. 1.3.3 An Mlustration of Data-Driven Bandwidth Selection, Figure 1.1 presents kernel estimates constructed from n = 500 observa- tions drawn from a simulated bimodal distribution. The second order Gaussian (normal) kernel was used throughout, and least squares eross- validation was used to select the bandwidth for the estimate appearing {in the upper let plot ofthe figure, with hiyey = 0.19. We also plot the es- timate based on the normal reference rule-of-thumb (Fest = 0.34) along with an undersmoothed estimate (1/5 x fx) and an oversmoothed estimate (5 % htkey)-* Figure 1.1 reveals that least squares cross-validation appears to yield a reasonable density estimate for this data, while the reference ruloof‘thumb is inappropriate as it oversmooths somewhat. Extreme oversmoothing can lead to a unimodal estimate which completely ob- seures the true bimodal nature of the underlying distribution. Also, ‘undersmoothing leads to too many false modes, See Exercise 1.17 for ‘an empirical application that investigates the effects of under- and over- smoothing on the resulting density estimate. 1.4 Univariate CDF Estimation In Section 1.1 we introduced the empirical ODF estimator F,(2) given in (1.2), while Exercise 1.4 shows that it is a yf-consistent estimator “ikl exoae-validation yielded m bandwidth of Piey = 015, which results ln a density estimate virtually Mentical to that based upon leat squares cra valdaton for thie dataset, 20 1, DENSITY ESTIMATION Least-Squares CV 05 1 04 03 4 02 4 oa 4 00 L it Fla) fe) f(z) ft) Figure 1.1: Univariate kernel estimates of & mixture of normals using least squares cross-validation, the normal reference rule-of-thumb, un- detsmoothing, and oversmoothing (n = 500). The correct parametric data generating process appears as the solid line, the kernel estimate as the dashed line. of F(e). However, this empirical ODF Fy(2) is not smooth as it jumps by 1/n at each sample realization point. One can, however, obtain a smoothed estimate of F(2) by integrating f(x). Define [fom ve(5* ) : (27) where G(2) = J%,, (v) dois CDF (whieh follows directly becanse k() is a PDF; sce (1.10)). The next theorem provides the MSE of F(x). 1.4, UNIVARIATE CDF ESTIMATION 2 Theorem 1.2. Under conditions given in Bowman, Hall and Proan (1908), im particular, assuming that F(z) is twice continuously dif- ferentable, k(v) ~ dG(v)/dv i bounded, symmetric, and compactly ‘supported, and that F(z) /da? is Hélder-continuons, 0 < h< Cn for some 0 0. We first used integration by parts to get k(v), and then used the Taylor expansion since fu" k(v) dv is usually finite. For example, if k(v) has bounded support or K(v) is ‘a standard normal kernel function, then fu™k(v) dv is finite for any m>0. 2 4,_ DENSITY ESTIMATION Similarly, ele (5%) - fo (EE) toe nn [erme—mmar oe J @(a)aF(e— bo) Jo xre— ny | ecnnteteCe) Fo) do + 00) = F(z) —aphf(x) + O(h"), (129) where ag =2 fvGlo)k(o) do and where we have wed the fact chat of” Glo)k(v) do = £ 4G%(v) = C00) — G00) = 1, bocause (isa (user-specified) CDF kere! fonction From (1.28) we have bias|F(z)] = (1/2)mah*F (2) + o(h), and feom (1,28) and (1.29) we have [Fe] 1) having @ coramon PDF f(2) = f(a.ta,..-9a). et Xs denote the sth component of X; (s = I,...,q). Using a “product kernel fanetion* constricted from the produet of univariate kernel functions, wwe estimate the PDF f(z) by fo= aie ( (1.35) 1.6, MULTIVARIATE DENSITY ESTIMATION 25 were (S422) = (28) x (SHEE), and whee) vst Kero anton sting (1.10) The proof of MSE consistency of f(x) is similar to the univariate casein patcula, on cn show th bias (f(x) By wmte+0 (SH). (1.36) where fas(2) is the second order derivative of f(2) ssa = f v#R(0) do, and one can also show that we (He) = mom) (isn) where = f K2(v) dv. The proofs of (1.36) and (1.37), which are similar ‘to the univariate X case, are left as an exercise (see Exercise 1.11). Smmariaing, we obtain tho result =k [ese +0 ( MsE (f(2)) = [iss (f(2))]* + var (ft) 3 ((S2e) ) ‘ Hence, if as n + 00, maxicecehs + 0 and rahy ...hy + 00, then wwe have f(t) > f(z) in MSE, which implies that f(z) + f(z) in probability ‘As we saw in the univariate ease, the optimal smoothing parame- ters h, should balance the squared bias and variance terms, ie. ht O ((nhr-.-hg)) for all s. Thus, we have hy = en~¥@¥4 for some positive constant cy (¢ = 1,...,4)- The eross-validation methods dis- ‘cussed in Section 1.3 can be easily generalized to the multivariate data setting, and we ean show that least squares cross-validation can opti- rally select the hg's in the sense outlined in Section 1.3 (see Section 18 below). ‘We briefly remark on the independence assumption invoked for the proofs presented above. Our assumption was that the data is indepen- dent across the ¢ index. Note that no restrictions were placed on the 8 index for each component Xis (= 1,...,4). The product kernel is used simply for convenience, and it certainly does not require that the X's 26 1. DENSITY ESTIMATION ‘are independent across the s index. In other words, the multivariate kernel density estimator (1.35) is capable of capturing general depen- dence among the different components of X,. Furthermore, we shall relax the “independence across observations” assumption in Chapter 18, and will see that all of the results developed above carry over to the weakly dependent data setting 1.7 Multivariate Bandwidth Selectio! Rule-of-Thumb and Plug-In Methods In Section 1.2 we discussed the use of the so-called normal reference rile-of thumb and plug-in methods in a tnivariate setting, ‘The gener- alization of the univariate normal reference rule-of-thumb to a multi- variate setting is straightforward. Letting q be the dimension of Xj, one can choose hy = ¢,X,,qn~'/(4+4) for 5 = 1,...,q, where Xeeq is the sample standard deviation of {Xi,}fLy and cy is @ positive constant ‘In practice one still faces the problem of how to choose cy. The chotce of e, = 1.06 for all s = 1,...,q is computationally attractive; how- ver, this selection treats the diferent X,,'s symmetrically. In practice, should the joint PDF change rapidly in one dimension (say in 2) but change slowly in another (say in x2), then one should select a relatively ‘small value ofc (hence a small hy) and a relatively large value for cz (ha). Unlike the cross-validation methods that we will discuss shortly, rule-of-thumb methods do not offer this flexibility. For plug-in methods, on the other hand, the leading (squared) bias and variance terms of (1) must be estimated, and then hy,-..,hy must be chosen to minimize the leading MSE term of f(z). However, the leading MSE term of f(2) involves the unknown j{2) and its partial derivative functions, sind pilot bandwidths must be selected for cach variable in order to estimate these unknown functions. How to best select the initial pilot smoothing parameters can be tricky in high- dimensional settings, and the plugin methods are not widely used in applied settings to the best of our knowledge, nor would we counsel their use other than for exploratory data analysis. 1S. MULTIVARIATE BANDWIDTH SELECTION: CV METHODS a 1.8 Multivariate Bandwidth Selectior Cross-Validation Methods 1.8.1 Least Squares Cross-Validation ‘The univariate least squares cross-validation method discussed in Sec- tion 1.3.1 can be readily generalized to the multivariate density esti- ‘mation setting. Replacing the univariate kernel function in (1.23) by ‘8 multivariate product kemel, the cross-validation objective fnction, now becomes OV els sha) = SO KAX XS) 2 ~ ale yu YS kn%i,%)), (1.38) BiaAg where K(X Xs Ths(2 =), Ka(XaX,) = T] note (Sea Xe Ko) = [] "8 (=). and E(v) is the twofold convolution kernel based upon (), where &() is a univariate kernel fonction satisfying (1.10). Exercise 1.12 shows that the leading term of CV(Iiy.--yhq) i8 given by (ignoring a term unrelated to the h,’s) OV jo (1-4 hq) = / [sacone] Ce ee where B,(r) = (2/2) fas(2)- Dalinng ay via hy — ayn 049 (9 = 1.9), we have CV Go(tiy yh) =m NOX aI, 05g) (Lao) where neploty0) = { [Eee] ter (uy 1, DENSITY (ATION Let the a0's be the values of the a,'s that minimize yp(a1,-..,44) ‘Under the same conditions used in the univariate case and, in addition, assuming that fus(2) is not a zero function for all s, Li and Zhou (2005) show that each a? is uniquely defined, postive, and finite (see Exercise 1.10). Let AY, ..., Al denote the values of hy,..., hy that minimize CV, ‘Then from (1.40) we know that AY = ala) — 0 (q-Mler), Exereite 1.12 shows that CV; is also the leading term of E(CV;) ‘Therefore, the nonstochastic smoothing parameters h? can be inter- preted as optimal smoothing parameters that minimize the leading term of the IMSE. Let fs shy denote the values of hi... that minimize CV. Using the fact that CV; = C¥jy + (eo), we can show that hy 18-4 op(2). Thus, wo have fy no _ hy FE Te TAO im probebility for q (1.42) ‘Therefore, smoothing parameters selected via cross-validation have the saine asymptotic optimality properties as the nonstochastic optimal smoothing parameters. Note that if f(z) = 0 almost everywhere (a.e.) for some s, then B, = 0 and the above result does not hold. Stone (1984) shows that the cross-validation method still selects fy,.-.,hy optimally in the sense that the integrated estimation square etzor is minimized; see also Ouyang et al. (2006) for a more detailed discussion of this case, 1.8.2 Likelihood Cross-Validation Likelihood cross-validation for multivariate models follows dizeetly via (multivariate) maximization of the likelihood funetion outlined in Soc- tion 1.3.2, hence we do not go into further details here. However, we do point out that, though straightforward to implement, it sufers from the same defects outlined for the univariate case in the presence of fat tail distributions (j.., it has a tendency to oversmooth in such situations). 1.9 Asymptotic Normality of Density Estimators In this section we show that f(x) has an asymptotic normal distri- bution. The most popular CLT is the Lindeberg-Levy CLT given in 1.9, ASYMPTOTIC NORMALITY OF DENSITY ESTIMATORS 20 ‘Theorem A.3 of Appendix A, which states that n/[n~ 1N(0,02) in distribution, provided that Z is iid. (0,07) Lindeberg-Levy CLT can be used to derive the asymptotic distribution of various semiparametric estimators discussed in Chapters 7, 8, and 9, it cannot be used to derive the asymptotic distribution of f(z). This is because f(x) =n! SD, Zin, where the summand Zin = Kn(X;,2) depends on n (since A’ h(n). We shall make use of the Lispinov CLP given in Theorem A.5 of Appendix A Theorem 1.3. Let Xi,---Xq be iid. q-vectors with ite PDF f()) pon ofthe spor of. arene he 0 frase, mhy ..hg + 00, and (hy ...hg) Sy AS + 0, then Vi Ty [i 10) -F Sms] NOx f(a). Proof. Using (1.86) and (1.37), one can easly show tha a [io -f@)-% Ensue] has asymptotic mean zero and asymptotic variance x#f(cr), Le, Vinha Fg | f(z) - fa) - 2 sEHmt] =Vaiohy [fl2) - § (f(@))] + Viigo [e (7) -1@)- 3 Sun] =Vilis---Ty [f(@) - B (f@)] 7 +0 (vane) (by (1.36) = Down eg yt 2)-a((82)))-«0 =) Zn +00) 4. (0,647), oot. istry estan by Lispunov’s CLT, provided we can verify that Liapunov’s CLT con- dition (A.21) holds, where Kine Xin 7 12 (Xin) : nel) (KER) Les by (1.37). Pagan and Ullah (1999, p. 40) show that (A.21) holds under the condition given in Theorem 1.3, The condition that f k(v)**4 do < 0 for some 6 > 0 used in Pagan and Ullah is implied by our assumption that k(v) is nonnegative and bounded, and that J k(v) dy = 1, because J kw} dv < Cf k(w) dv = C is finite, where C = supyexs k(v)!*4 a DovarZni) = F(@) +01) 1.10 Uniform Rates of Convergence {Upto now we have demonstrated only the case of pointwise and IMSE consistency (which implies consistency in probability). Tn this section we generalize pointwise consistency in order to obtain e stronger “uni- form consistency” result. We will prove that nonparametric kernel e= timators are uniformly almost surely consistent and derive their wni- form almast sure rate of convergence. Almost sure convergence implies convergence in probability; however, the converse is uot true, Le, con- ergence in probability may not imply convergence almost surely; nee Serfling (1980) for specie examples. ‘We have already established pointwise conssteney for an interior point inthe support of X. However, it turn out that popular kernel Functions such as (19) say not lead to consistent estimation of #(2) ‘when 2 sat the boundary of its suppor, hence we need to exclude the boundary ranges when considering the uniform convergence rate, This Fighlights an important aspect of kerel estimation in genera, and a ‘number of kernel estimators introduced in later sections are motivated by the desire to mitigate such “boundary efloct.” We frst show that ‘when 2 is at (ot near) the boundary of its support, f(e) may not be a consistent estimator of f(a), ‘Consider the case where X is univariate having bounded support. For simplicity we assume that X € [0,1]. The pointwise consistency romult f(a) ~ f(2) = og(t) obtained earlier requires that le in the 2 a a interior ofits support. Exercise 1.13 shows that, for # at the boundary of its support, MSE(/(2)) may not be o(1). Therefore, some modifications ray be needed to consistently estimate f(x) for at the boundary ofits support. Typical modifications include the use of boundary kernels or data reflection (soe Gasser and Mille (1979), Hall and Webrly (1991), and Scott (1992, pp. 148-149)). By way of example, consider the case where 2 lies on its lowermost boundary, ie., z = 0, hence (0) = (nh) 1S, K((Xi — 0)/h). Exercise 1.13 shows that for this case, E[f(0)) = f(0)/2 + O(h). Therefore, bias|(0)] = EL{(0)] ~ £(0) = =F(0)/2 + O(h), which will not converge to zero if f(0) # 0 (when F(0) > 0). In the literature, various boundary kernels are proposed to overcome the boundary (bias) problem. For example, asimple boundary corrected kernel is given by (assuming that X « [0,1)) We (GE) / [jy Rw) do if-e-e fo.) Ryley) = 4 (BE) ifz€[h1—h] WR (GE) / [SP Ro) de kz (L—hy I], (1s) where K() isa second order kerne! satisfying (1.10). Now, we estimate F(a) by fla)= 2 lexi, (a where ky, Xi) is define in (1.49) xerebe 1.14 shows thatthe above Boundary comet heme! saccessflly overcomes the boundary prob- vem ‘We now stablsh th uniform almost sure convergence rate of f(2)~ J(e) or 2 € 8, where & is a bounded set excluding the boundary range ofthe support of X. In the above example, when th support of is [Ol wo ean choose S = [eld for arbitealy small postive ¢ (0- 5 > 0, we have ; uit sup|fee —10| = o( ue me +n) almost surely. 2 eee 1,_DENSITY ESTIMATION [A detailed proof of Theorem 1.4 is given in Section 1.12. Since almost sure convergence implies convergence in probability, the uniform rate also holds in probability, ie., under the same condi- ‘ions as in Theorem 1.4, we have fl | = Gini)? sup \@ - 1@)| =O (ei ‘Using the results of (1.36) and (1.37), we can establish the following ‘uniform MSE rate. ‘Theorem 1.5. Assuming that f(x) is twice differentiable with bounded second derivatives, then we have supe { [fe — so} -0 (Ss (ahs a) Proof. This follows from (1.86) and (1.37), by noting that supses f(2) ‘and suppes [fus(2)| aF6 both finite (8=1,..-+9). a Note that althongh convergence in MSE implies convergence in probability, one cannot derive the uniform convergence rate in probar bility from ‘Theorem 1.5. This is because feu [fee -sa)"} 4 xp (7) — Fe)» and » [ap|fe -s0)|> 4 Agr [ere] >4 ‘The sup and the E() or the P() operators do not: commute with one sot. ‘Cheng (1997) proposes alternative (Jocal linear) density estimators that achieve automatic boundary corrections and enjoy some typical ‘optimality properties. Cheng also suggests a data-based bandwidth se- Jector (in the spirit of plug-in rules), and demonstrates that the band- width selector is very efficent regardless of whether there are non smooth boundaries in the support of the density. LIL, HIGHER ORDER KERNEL FUNCTIONS 3 1.11 Higher Order Kernel Functions Recall that decreasing the bandwidth h lowers the bias of « kernel estimator but increases its variance. Higher order kernel functions are devices used for bias reduction which are also capable of reducing the MSE of the resulting estimator. Many popular kernel functions such as the one defined in (1.10) are called “second order” kernels. The order of a kernel, v (v > 0), is defined as the order of the first nonzero moment. For example, if [ wk(u) du = 0, but f'u2k(u) du # 0, then K() is said to be a second order kernel (v= 2}. A general vth order kernel (v > 2 is an integer) must therefore satisfy the following conditions: (0 f Hu)au=1, Co fukin, eto, 48 (ii) [erro =n #0. Obviously, when v = 2, (1.45) collapses to (1.10). If one replaces the second order kernel in f(z) of (1.35) by a vth order kernel function, then as was the case when using a second order kernel, under the assumption that f(z) is vth order differentiable, and assuming that the he's all have the same order of magnitude, one ean show that : bins (ft) 0 (5) 4s) i = var (f(2)) = 0 ((uhy hg) (1.47) (see Exercise 1.15). Hence, we have MSE (f(2)) =O (& HY + (ny. so) (1.48) Ae) - F(z) = Op (e« + (nh ~w) ‘Thus, by using a vth higher order kernel function (v > 2), one can. reduce the order of the bias of f(x) from O (S74_, h2) to O(DL Ay), a 1,_DENSITY ESTIMATION and the optimal value of h, may once again be obtained by balancing the squared bias and the variance, giving fy = O (n~¥/@**0), while the rate of convergence is now f(x) — f(z) = Op(n-"/2"*9), Assuming that f(e) is differentiable up to any finite order, then one can choose to be sufciently large, and the resulting rate can be made arbitrarily close to Op(n"/2). Note, however, that for v > 2, no nonnegative ker- nel exists that satisfies (1.45) This moans that, necessarily, we have to assign nogative weights to some range of the data which implies that fone may get negative density estimates, clearly an undesirable side- cfoct. Furthermore, in finit-sample applications nonnegative second order kernels have often been found to yield more stable estimation results than their higher order counterparts. Therefore, higher order kernel fonctions are mainly used for theoretical purposes; for example, to achieve a yi-rate of convergence for some finite dimensional param- eter in a semiparametric model, one often has to use high order kernel functions (see Chapter 7 for such an example). Higher order kernel fonctions are quite easy to construct. Assum- jing that k(u) is symmetric around zero,® i.e. k(u) = k(—u), then J12"*h(a) du = 0 for all postive integers m. By way of example, ‘in order to construct a simple fourth order kernel (i.e., v }), one could begin with, say, a second order kernel such as the standard nor- zal kernel, set up a polynomial in te argument, and solve for the roots of the polynomial subject to the desired moment constraints, For ex- ample, letting (u) = (2n)-'/? exp(—u?/2) be a second order Gaussian kernel, we could begin with the polynomial k(u) = (a+ bu?) ®(u), (1.49) where a and b aro two constants which must satisfy the require rents of a fourth order kernel. Letting hu) satisfy (145) with » = 4 (J wk(u)du = 0 for 1 = 1,3 because k(u) is an even function), we therefore only requite J &(u)du = 1 and fu?k(u) du 0. From these two restrictions, one can easily obtain the result a = 8/2 and b = -1/2. or rondors requiring come higher order kernel functions, we provide a few examples based on the second order Gaussian and Bpanechnikov kernels, perhaps the two most popular kernels in applied nonparamet~ ric estimation. As noted, the fourth order univariate Gaussian kernel Typically, only symmetric Iernel functions are wied in practice, though sc Abadi and Lasetord (2004 for rent work involving optimal asymmetic keel. 4.12, PROOF OF THEOREM 14 Ea is given by the formula. 3 _ 12) exp(-u?/2) k(u) ( a7 3% ) nae cH while the sixth order univariate Gaussian kernel is given by ay = (55,2 4 Ayr) @PC0?/2) H0)= (5 ” “a') vir The second order univariate Epanechnikov kernel is the optimal kernel based on a calculus of variations solution to minimizing the IMSE of the kere estimator (see Serfing (1980, pp. 40-43)). The univariate second order Epanechnikov kernel is given by the formula 2.(-fuw) itwcs wa f HeO- be) te? 2, the kernels indeed assign negative weights which can rel in ngative density estimates, not thatable entre or tlated work involving exact mean integrated squared ero for higher order kernel inthe coutet of univasiate kernel density estima tion, see Hansen (2005). Also, for related work using iterative meth- oats to cxtiate unnaormaton keel dense, on Yang and Marron (1999) and Yang (2000). 1.12 Proof of Theorem 1.4 (Uniform Almost Sure Convergence) ‘The proof below is based on the arguments presented in Masry (19960), who establishes uniform almost sure rates for local polynomial regres. sion with weakly dependent (a-mixing) data; see Chapter 18 for fur- ther details on weakly dependent processes, Since the bias of the kernel wo. srry esrmsarion, 1.00 0.80 |-Fourth Onder Sixth Order 0.60 oa ku) 0.20 0.00 0.20 - 25-20-15 -10-05 0.0 05 10 15 20 25 Figure 1.2: Epanechnikov kernels of varying order. density estimator is of order O(Xf.., 42) and the variance is of order O((nhy....hg)~), it is easy to show that: the optimal rate of conver- gence requires that all hy should be of the same order of magnitude. ‘Therefore, for expositional simplicity, we will make the simplifying as- sumption that ys hga he ‘This will not affect the optimal rate of convergence, but it simplifies the derivation tremendously. We emphasize that, in practice, one should al- ways allow hy (s = 1,-..,8) to differ from each other, which is of course always permitted when using fully data-driven methods of bandwidth selection such as cross-validation. Only for the theoretical analysis that immediately follows do we assume that all smoothing parameters are the same, Proof. Let Wy = Wa(2) = |fl2) — f(2)|. To prove that the ran- dom variable ‘W, is of order (7) almost surely (as.), we can show that D& P (\Wa/n| > 1) is finite (for some > 0). Then by the Borel-Cantelli lemma (see Lemma A.7 in Appendix A), we know that W, = O(n) as. Hero, the supremum operator complicates the proof because S is an uncountable set. Letting Ly, denote a countable set, 112. PROOF OF THEOREM Lt a then we have P (myx Wala) >n) < (of La) mpxPCWQ(e) >) (1.50) But in our case, x € S is uncountable and we cannot simply use an inequality lke (1.50) to bound P (supes Wa(2) > n) However, since Sis « bounded sot we can partition S into countable subsets with the volume of each subset being as small as necessory. ‘Then P(supzcs|Wa(2)| > 7) ean be transformed into a problem like P(maxzety [Wa(2)] >) and the inequality of (1.50) can be used to hhandle this term, Using this idea we prove Theorem 1-4 below. We write |e - s)| = 7-2 (7) +B (fe) - 10) <|f@-E(f@)|+|B (fe) -s@)| We prove Theorem 14 by showing that swe (f@) -s@|=00), (1.51) and that Gain)? sup |fle) -B(Fe)| -o( ae ) almost surely. (1.52) We first prove (1.51). Because the compact set is in the interior of its support, by a change-of-variables argument, we have, for x ¢ S, B (fle) - se) = f Hle-+ ney (e) do — fle) =n J vfO(@)oK(v) de 20 [von(yinscr? -0(0) uniformly in 2 € S. Thus, we have proved (1.51). ‘We now turn to the proof of (1.52). Since & is compact (closed and bounded), it ean be covered by a finite number Ln = L(n) of (g- dimensional) cubes Jy = Tens with centers zxq and length fy (It = 8 1, DENSITY ESTIMATION 1,.-+,L(n)). We know that Ly = constant/(In)* because S is compact, which gives fy = constant/Z4/, We write sep |) —P (7) |= spas, sup Fe) —B (A) i) xe 5 dio 8, Vien + 22m, feu) ~B (Fern)| + gio ge, | (Henn) - 2 F)] = +Q2+Qs. Note that @, does not depend on x, 50 supzesrs, does not appear in the definition of Qo. ‘We first, consider Qo. where Zaz = (mht) (K(X, 11> 0, we have Sle) 8 (A(@) = DiZnes p[K((X; — 2)/f)]}. Bor any Pian > 9) =P, << P(Walzin) > 17 or Wa(wan) > 1, ---,0r Waltr(npn) > 7] £P(Walcan) >) +P Wat) > 0) + +P (Walteenyn) >) s (n) sup P [1Wa(2)] >a. (2.53) 2% Wola I >a] Since K() is bounded, and letting Ay = sup, |K(2)), we have Vans] < 2Ax/(n*) for all i = 1,...,n. Define dy = (nh l(a)?” hen AalZna] © 2Aalin(n)/(ahs)} 1/2 forall i = 1....r for m sufficiently large.® Using the inequality exp(z) < 1+2-+2? for |x| < 1/2, wwe have exp(-tAnZn4) < 1+ AeZna + A2ZE,- Hence, Bloxp(+nZqsI] $1+ ME [225] Sexp (B(MZa)], (54) hile for the second inequality we used ERR). E(w) /(4A) wil load to [Aa Znal S 1/2: Later on ‘we will show that, in order £9 obtain the optimal rae for Qo, one needs to choose Dem (nnn), 112, PROOF OF THEOREM 14 30 By the Markov inequality (see Lemama A.23 with 4(2) = exp(az)) wwe know that mx ng cB), sy, as >| <[Speu>d] +0 [aul << Blexo(n Yk Zn] + Blosr(—An Sy Za exPAntin) (by (1.55), a = Anjo =n) < Peden) [xe (axees,)| (by (1.54) 2exp(—Ann) [exp (Aarg/(nhY))], - (1.58) Using (1.55) we have P((Wa(2)] > a] -| Sen, where we used B [ZA4] 5 (nh?) *E [K(X — 2)/A)] < Ann?) + 0 (2) Because the last bound in (1.56) is independent of z, it is also the uniform bound, ie, 2 sph Wa(2)|>n)< ex (Ain 8). (sr) 8 ry We want ta have 9 +O as fast as posible, and at the samo trna we need yf) + 00 at a rate which ensures that (1.57) is summable.” We can choose Aan = Ca In(n), oF Ay = Ce In(n)/n. Finding the fastest rate for which n — 0 is equivalent to finding the fastest rate for which Xn —+ co, We aso need the order of Ayn & A2/ (nh), or Inn) = A8/ (mht) TA sequence Fs Ho be ummble f| TE, <0 7 1._DENSITY ESTIMATION ‘Thus, we simply nood to maximize the order of Ay —+ co subject to XZ S (nh) In(n). Doing so, we get An = [fr inn)" andy = Can()/An = Cr n(n (eh)? ) (1.58 Using (1.58) we get —dnn/2-+ Aad2/ (nh) = Calan) + Ay la(n) =-aln(n), where a = Cy ~ Ap. Substituting this into (1.57) and then into (1.53) gives us P[Q2 > m] < 2L(n)/n°. (1.59) By choosing Cy suificiently large, we can obtain the result that E(n)/n* is summable by properly choosing the order of L(n), ie. P(|Qa/'m| > 1) < 4D%_ L(n)/n* < co. Therefore, by the Borel-Cantell lemma we kuow that =O) =O (n(n) snes. (1.60) We now consider Q: and Qs. Recall that ||- || denotes the usual Euclidean norm of a vector. By the Lipschitz condition on K(-), we know that sup |K((Xi—2)/h)— K(Xi~ eq) /A)| 0, there exists a positive integer mg such that Jay ~ | < ¢ for all n > no. Hint: For (ii) use the result from (j) along with Theorem A.3 of Appendix A. Hint: For (ii) argue by contradiction. Exercise 1.4. Let F(z) be defined as in (1.2). {) Show that MSE[F,(z)] = O(n) (note that this implies that Fale) ~ F(e) = O(0-¥/2) by Theorem A.7 of Appendix A. 0 1,_DENSITY ESTIMATION i) Prove that, VailPs(2) ~ F(a)) 4 NO. F(@)(0 ~ Fla). Hint: Fest show that B[F,(2)] = (2) and var(Fy(z)) = F(2)(0— F(a)). Then use the Lindeberg-Levy CLT. Bxercise 1.5. Prove (118) under the assumption that f(=) has @ continuous second order derivative at. Hints Use the dominated convergence theorem given by Lemma |A26 in Appendix A Bxereise 1.6. Write hy ~ & (9%) and by = & (G22). Using 1? = (n(n 1))-*~ (n2(n— 1), we obtain from (1.23) Ott REE aE meek BO eta Do Dalia Bh) + O09) sa + In + Op(h(nh)), (1.63) =i where Jn = [n(n — JAI Df; Dfyalhy — 2k] and «= [IP (v) do = K). (i) Show that B(J,) = Bo+ Bik! +O(h®), where Bo =~ J f(2)* de, and By = (43/4){ LF (2)? dz}, (ii) Accept the fact that J ‘asymptotically, minimizi (Jp) + smaller order terms. So, CV4{h) is equivalent to minimizing I(h) © (nh)~'e + BJ). Obtain that f which minimizes I(h). (ui) Asoume that (0) > &(u) for al hich i usually tue fr kernel cetimation). Ihwe do not use the leaveone-ut estimator, then ve would ingtead have the objetive function V(A) * (n)“te— JE(0):+ Un). Show that h =O mininaes V(i), which obviously violates the Tegument that nh = co as n'~> co, This hows hat we must use the lave-one out etinator when constructing ov. 1.14, EXERCISES a (Gv) In deriving (1.63) we used Ag & (n(n — 1h)? > FRX — Xj) /h) = Opin), 7 iA Prove this result (¥) Using the U-statistic H-decomposition given in Lemma A.15 of * Appendix A, show that Jy = E(Jn) + Op(h!/%(nh)-b4n-¥8) + terms unrelated to h. Therefore, we indeed have Jy = E(Jn) + (s.0.) (for a given value off). Hints: Note that K() is also a nonnegative, symmetric PDF, ie. J Ro) do = 1, J v*K(v) do = 0 whens isan odd positive integer. Sena) (52) “fae jae) fla) =n f (Me) +04 (oi /%ayh? +0 (sa/4F9 (ah + OM)” de Effsa| = we (“1% of (2522) Men sten)an den ah f Hef 10) +0 (a/a.r%aye +0 + (wa/A FO (x)n} de + (0). (ii) Note that £(0) = [#%(v) dv > 0, (il) Show that h~+ 0 produces a value of the objective function V(h) = —oo. Ths, a = 0 minimizes V(). 2 1, DENSITY ESTIMATION (iv) Show that [Jnl] = E(4n) = O(n), then apply Theorem A.7. (©) Using the Usstatistic H-decomposition (again, see Appendix A), show that the last two terms in the H-decomposition are of or- Ger Op(n-¥/2H4) (plus terms unrelated to h) and Op( (nh) ), respectively. Exercise 1.7. Derive (1.27), Le, show that [fortenn Se) Hint: Use fv) = (nh)? Dk (42) ‘and do a change of variable (aj—v)/h =t and da; = hdv. Exercise 1.8. (i) Discuss the relationship between the kernel and empirical CDF estimators, Le., P(x) and Fa(2) =n? DE, 1% <2). Discuss whether or not one can use h = 0 in F(x) defined in (1.27), ie. can one let h 0 arbitrarily fast in P(x)? (ii) P(e) and Fy(2) have the same asymptotic distribution. What is the advantage of using F(z) over Fy(2)? Which estimator do you txpert to have smaller fnite-scrple MSE? Explain. Exereise 1.9. Derive (1.33). Hint: Write 1s(e) = 1(X; <2) and G,,2, = G((e— Xj)/h), then we PL ED [Ee Gen] mia x [1e(@) — Graal} de = tp [Btu ~Go0)"} de BEE [EXE (MMe) ~ Gaal}? de = CV +0¥ B[CVe(A)] = + Ld, EXERCISES 83 then show that OM = (ny? {efra = F)de~ exno1n")} and cys i! a wl {fra ~F(a))de+n* [cscrae} Exercise 1.10. Define a qq matrix A with its (f,s)th element given by Ace = (2/2) J Be(2) B(x) de. (i) Show that A is positive semidefinite, (i) Show that if A is positive definite, then the a's defined in (1.41) aro all uniquely determined, positive, and fnie. [A necessary condition for A to be positive definite is that ful) is not a zero function for all 8 = 1,...44. Hint: () Note that for any @ x 1 vector 2 = (siy...5%4)f that Az = S(Xdar Bale) z5}* dx > 0, then xy = 2Az + n/a and let 28 denote values of 2, that minimize xy. It is easy to argue that co > Inf .ny X > 0. This implies that 29 > 0 for alls, The fact that ‘Ais positive definite implies that 2 < oo forall s, Finally, 2 is uniquely determined by a result given in Li and Zhou (2005). Therefore, a2 = 2) is uniquely determined, postive and finite for all = 1)... Note that A being positive definite is a suficient condition, Li and ‘Zhou (2005) provide a weaker necessary and sufficient condition for this result Exercise 1-11, Prove (1.36) and (1.87). Hint: For a multivariate Taylor expansion, we have f(z0 + 2) S(20)+ Dhar fol0)(s—40)+(1/2) Dy Dd fae (@) (Gea) (29 — 22yy); 2's on the line segment between 7 and 2 Exercise 1.12. For the multivariate case, we have thy hg (m= 1)? Dy Dogs [Kn (Xi, X4) - Bn Xs, XG) OV (ays eg) = San Op (nha. .thg)*) where Jn st 1, DENSITY ESTIMATION (@) Show that E(Ja) = f (Sts Bola)h2]? adr + 0 (2 |. where the definition of ,(z) is given in Section 1.8. (ii) Use the U-statistic H-decomposition to show that (ignoring the ‘term unrelated to the H,’s) Jn = B(Jn) + Op (" (& 2) ) 4p ((ha a)? (ha ---ha)*) Note that (i) and (i) together imply that OV, = So Buh + Mlb hg)! + op (18 +m) where ma = Yofay hE and m = (rhy hg)! ‘int: Using H-decomposition, show that the second moments ofthe ‘second and third terms are of order O(n-'/2ng) and O ((hi .-./i)n)» respectively Exercise 1.19. Assuming that X € [0,1] and f(0) > 0, show that E(f(0)] = f(0)/2+O(h) so that f(0) is a biased estimator of f(0) even asymptotically. Hint: f(0) = (nh) PM — O)/h), and BLF(O)] = R*BIACXs/2)] = f * se)tles/ ds 1 = [" reermerae = 70) [oto = st0y72 Exercise 1.14. With the boundary-corrected kernel defined in (1.43), tnd with f(x) defined in (1.44) and with the support of X being [0,11 show that for z € [0,A] at the boundary region, we have Elf(2)] (x) + O(h). Explicitly state the conditions that you need to derive ‘this result. Therefore, bias[(z)] = Oh) + 0 as n -» 00, and the boundary- corrected kernel restores the asymptotic unbiasedness for f(z) for = at the boundary region. 1.14. EXERCISES 5 Hint: One can write 2 = ah with 0 < a <1. One can assume that [f(2) ~ f(2)| < Cla — al for all x, © [0,1], where C is a positive constant, Then ant f= bE) BHA [aaa aA flak + wh) (used 2/h = a and 2 = al Pe eyae lon +h (sed 2h de =ah) Mw) Gay [F2.x0) ay] = f(0) + O18). Exercise 1.15. With a vth order kernel, prove (1.46) and (1.47) for the univariate x case (i.e., q=1). Exercise 1.16, Intuitively, one might think that when f(z) is a unic form density, say on (0,1), then one can choose a nonsbrinking value of h to estimate f(x) for some x ¢ [0,1] (i.e. h does not go to zero as ‘n+ 00). This intuition is correct when s is an interior point of [0,1] However, at (or near) the boundary of [0,1], estimation bias will not 0 to zero oven for uniform f(a). (i) Show that ifh does not converge to 0 as n —+ 00, then fife, h)— ‘F(2)/? da will not go to zero, where f(e) isthe uniform PDF. (G) Show that if h + 0 a8 m+ 00, then flfle,A) ~ f(a)]8dr —+ 0 as n— co, where f(z) is the uniform PDF. (j) and (ii) above explain why the cross-validated selection of A, fi, must converge to zero as 00, and why one does not need the condition that f@)(x) is not a zero function. Of course when f(z) is a ‘niform PDF, A will no longer have the usual order (n-") astend it ‘has an order equal to n="/8 since the bias now is of order A rather than i Exercise 1.17. Consider the Italian income data from Section 1.13.5. For the two samples of size n = 21 for the years 1951 and 1998, compute the density estimates using the reference rule-of-thumb in (1.17) pre- suming an underlying normal distribution. How many times larger than this would the bandwidth have to be to remove the bimodal feature present in the 1998 sample? Next, compute the density estimates using least squares cross-validation, Presuming that those bandwidths repre- sent the “optimal” bandwidths, how much larger would the bandwidth for 1998 have to be to produce an apparently unimodal distribution? Finally, compare your least: squares cross-validated density estimates ‘with a naive histogram. Do your estimates appear to be sensible, i., do they reflect features you believe are in fact present in the data? Chapter 2 Regression [Regression analysis is perhaps the most widely used fool in all of ap- plied data analysis. Regression methods model the expected (average) behavior of a dependent variable typically denoted by-Y given a vee- tor of covariates typically denoted by X (which are often commonly referred to as “rogressors’ or “explanatory variables”). In other words, regression analysis is designed to address questions such as, What is the expected wage of a black female college graduate working inthe trans- portation sector? Furthermore, practitioners are often interested in how ‘the dependent variable responds to a change in one or more of the covariates (Sresponse”) and whether tis response differs signiicantly from zoro (“test of significance”). We first briefly outline parametric regression, and then quickly move on to the study of nonparametric regression. By far the most popular parametric regression model isthe linear regression model given by Y= Pot XiP tu, om lym, 1) where X; € RY, and f is aq x 1 vector of unknown parameters, while a more general nonlinear regression model is given by i= alXi,8) + us, om (2.2) where g(.,.) has a known funetional form with again being a vector of unknown parameters. For example, we might posit a model of the form g(z,) = exp(2'8). Having specified the functional form of the regression model up to a finite number of unknown parameters, meth- ods such as ordinary least squares or nonlinear least squares could then. 58 2. REGRESSION be used to estimate the unknown parameter vector in (2.1) or (2.2) respectively. ‘As was the case when modeling a PDF with parametric methods (coe Chapter 1), in practice the true regression functional form is rarely if ever known. Since parametric methodls require the user to specify the ‘exact parametric form of the model prior to estimation, one must con front. the likelihood that the presumed model may not be consistent with the data generating process (DGP), and in practice one must address issues that arise should the parametric regression model be severely misspecified. For valid inference, one must correctly specify not only the conditional mean function, but also the heteroskedastic- ity and serial correlation functions. The unforgiving consequences of estimation and inference based on incorrectly specified models are well cstablished, and include inconsistent parameter estimates and invalid inference, As noted in Chapter 1, one could of course test whether the presumed parametzic model is correct, but rejection of one’s parametric, ‘model typically offers nothing in the way of alternative models. That is, rejection of the presumed model does not thereby produce a correctly specified model. Nonparametric regression models do not require practitioners to ‘make fanctional form presumptions about the underlying DGP. Rather than presume one knows the exact functional form of the object be ing estimated, one instead presumos the object exists and satisfies some regularity conditions such as smoothness (differentiability) and moment conditions, We again point out that this does not, however, come with- out a cost. By imposing less structure on the problem, nonparametric ‘methods require more data to achieve the same degree of precision as ‘correctly specified parametric model, However, if one suspects that 1 parametric model is misspecified to some degree, and the sample at hhand is not too small to meaningfully apply nonparametric techniques, practitioners might instead wish to cousider nonparametric regression, methods, ‘We begin by considering the nonparametric regression model Yes g(Xi) tu, 85 tym. 3) ‘We assume for the moment that the sample realizations (¥;, Xi) are iid, though we relax this assumption in Chapter 18. The functional form g() is of course unknown. If g(-) is a smooth function, then we can estimate g(-) nouparametrically using kernel methods. We will interpret 2. REGRESSION Ea .9{z) as the conditional mean of ¥ given X =, ie., g(z) = EYL 21] due to the following result, ‘Theorem 2.1. Let G denote the class of Borel measurable (or contin ous) functions having finite second moment (see Appendix A for the def- inition of Borel measurable functions). Assume that g(z) = E(Y|X = 22) belongs to G, and that E(Y?) is finite. Then B(Y|X) is the optimal predictor of ¥ given X, in the following MSE sense: {IY —r(X)P} > BEY — B(Y|X)?} for all r(-) € 9, or, equivalently, BB) BU — nO} Proof. First we observe that g(2) = B(Y|2) is a function of 2, Next let fye(ty) f(e) and fyje(yiz) denote the joint PDF of (Y,X), the marginal PDF of X, and the conditional PDF of ¥|X, respectively. From fyje(2,u) = fya(0su)/ fla), we have {IY — EQ"|X))?}. BOK =2)= f viyetala) dy = Paster af 2), (24y which is obviously a function of x. Now, for any function r(x), we have E{Iy — r(X))9 {IY — BQ |X) + B(Y1X) F(X?) BY ~ BOYLAY?} + BIBOYLX) ~ r(20)?} + 2B{((Y ~ BOY X)EWX) — 7X) = E{Y ~ ELX)P} + BALE |X) — 2)? 2E(V -EV/X)P}, where wo used the fact that EAI — E(Y|XIEO|X) ~ r(X))} = ELLY — BY LX)IfolX) — r(X)]) =Efl(X) (X)] x BY ~E(Y|X)|x]} =0 by the law of iterated expectations (see Appendix A, Lemma A.11). © ‘Theorem 2.1 says that E(Y[X) is the optimal function of X for predicting ¥ in the sense that the prediction MSE will be minimized in « 2, REGRESSION the class of all Borel measurable (or continuous) functions r(x). Thus swe will interpret g(x) in (2.3) 88 E(Y x) In light of (2.4) itis clear that ‘once one knows how to estimate fy,x(z,y) and f() using the methods presented in Chapter 1, one can easily obtain an estimator for g(z) = B(V'|2). This leads directly to the “local constant” kernel estimator, originally proposed by Nadaraya (1965) and Watson (1964), which is often referred to simply as the “Nadaraya-Watson kernel estimator.” 2.1 Local Constant Kernel Estimation ‘We considered the estimation of fy») and f(2) in detail im Chapter 1, and so the only additional step required for estimating a conditional ean is integeation with respect to to obtain fufya(2,v) dy. Note that we can estimate f yfy,(29) dy by replacing the unknown PDF Jya(t,y) With its kernel estimate yielding J yfyx(2,u) dy, where Syaltsy) = mee (% *) ‘ 52) . vnhee (52) = (gst)... b (AAS) and where ho isthe smoothing parameter associafed with Y. Thus we have fsivalond0 Spee oe A)/«C 5) =a (change of variables: (y - ¥)/ho =») ~ ak A) ‘where we used f k(x) do = 1 and f vk(v) dv = 0. ‘Based on (24) and (2.5) we thereby estimate E(Y 2) = o() by 2) [i+ nKeotods (25) _ Lvbyatevaddy _ Da WK (52) 6) fe) kee) 24. LOCAL CONSTANT KERNEL ESTIMATION a which is simply a weighted average of ¥; because we can rewrite (2.6) ie) = 3%, where wi = K (Se) / Dj, K (22) is the weight attached to ¥5 Note that the weight# are nonnegative énd sum to one. Along the lines of the analysis given in Chapter 1, one can show thet Gee) - g(x) = Op (© E+ (ra so) . the proof of which is similar to the proof of fiz) - fla) = 0, (5 + (nh ny) : which was presented in Section 1.6 of Chapter 1. An easy way to estab- lish the consistency of 9() is to handle its numerator and denominator separately. First we write (o(2) - oe) Fe) — raz) a)’ (2) — 9(2) (2.7) ‘where v(x) = (G(«) — o()) fla). Using ¥; = (Xi) + us, we have a(x) = sin (e) + sha(e), where in (@) = (hy. hy) Yon = 9(x)) K and sng(z) = (ra Bg)? SK Using the notation r(x + hv) = r(ar + hivr,---,2q + gry) and using ‘the same arguments as in the derivation of (1.36), we ean easily show a 2. REGRESSION at h(a] = Cao scha) fe fen) ~ a2 (2 = [ Hees mofole+ m0) ~ oe, Kav ) ae = ZYON {L2fo(was(a) + F(a)gale)]} +0 (e 2) te =r) Smtn+0(S), “ x i where xq = fv°h(v) dv, Ba(2) = P{2fa(x)ga(x) + f(#)gu0(z)}/ fle), Te Glee) a ahd ted et dato ie) ‘with respect to 24, respectively (r= g or r= f). Also, var (via (x)) = (nb. ..1)ar [x nen (> =*)] 3 mn lloras} -{e [is -aerx(=2)]}’) = (nh...) (m sha) / ‘Sle + he) x [gla + hv) — 9(2))? Kw)? dv = (a0. ) =0(0. not), i 2.9) where (3.0.) denotes terms of smaller order. Equations (2.8) and (2.9) lead to fine) = $0) YBa )IE+ Oy (ni! + n7nh!”), 2.10) 24. LOCAL CONSTANT KERNEL ESTIMATION a where m = (mi -..tg)—! and my = So4_; HF (see Exercise 2.1). Next, wwe observe that Ejria(2)] = 0, and that Bf ale)? } = (ang... AE [aw @)] = (nti. B [Peron & *)| = (ry hg) {row [reso (Sox8)} = (nh nari +0 (‘on wide), ea where M(x) = nEf(z)0%(2), «= [ R(o)®dv and o°(2) = B(ud|X; = 2). Equations (2.10) and (2.11) lead to la) = Als) +e) =O,(me nk) aa) ‘Therefore given that g(e) ~ g(x) = vn(x)/f(2) and that fle) = ‘F(2)-+0)(1), we just proved the following result (assuming that f(2) > 0): : ~0,(— mo) 60) 92) 0 (eS eos) ~0, (Ses hy) ‘») 21) From (2.13) it is easy to soo thet if each bandwidth (2) has the same order of magnitude, then the optimal choice of hy that minimizes MSE(g(=)] is hy ~ n-/"+9, and the resulting MSE is therefore of onder O(n-#/0*4), ‘The asymptotic normality of g(z) can also be established at this stage by using Liapunov's CLT. Theorem 2.2. Under the assumption that = ison interior point, o(2) and f(2) are three-times continuously diferentioble, and f(2) > 0, then asm — 00, hy —+ 0 (for alls), mhy ...hg —> 00 and (mfx... hig) 4g AS 0, we have Vay hg (a —9(2)— PE) AN Ona) ile), es (2.14) 6 where By(z) is defined immediately following (2.8). Proof. (wht «--hg)!/ing(2) has mean zero and (2.11) shows that it has ‘asymptotic variance 1(z). Using Linpunov's CLT, one can easily show that Vi Fal) N (0,0(2)) (215) Combining (2.10) and (2.16), we have Vata hg (00 7 Swatovte) = al hgiha() + op(1) 2N 0,2) 16) ‘ocanse (rhs -+Fg)* 2! ~ (ah +b)? Shay 89 = oft) From (2.16), and also noting that /(2) = f(2) + op(1), we immedi- ately obtain Vali he (ae = (2) - dea) fal Teg (van) — Shas HBL) A) Ft) 4 oN, =N (0,n90%(2)/S(@ 4 FN Mey = N (O.nto%(e)ise ) (2.17) ‘This completes the proof of Theorem 2.2 o 2.1.1 Intuition Underlying the Local Constant Kernel Estimator In order to master nonparametzie kernel regression methods, one should not only master basic skills such as deriving rates of convergence, but in addition seek to cultivate an intuitive understanding of why kernel ‘methods work as they do. Fortunately, the intuition underlying Q(x) % g(x) is really quite simple. The basic idea is that g(z) is simply a “local average,” whicl (CONSTANT KERNEL ESTIMATION 6 can pethaps best be explained in a simple one regressor setting (X; € R) employing a uniform kernel. In this case (fy = h) we see that Dixie enl/2) Lixcaten Xo +) Lineage! average of g(Xi)'s + average of u's with [X;~ a] < h} 7 o(a) +0= (2), because [g(X:)—g(2)] = O(h) = oft) for X;—2| < h by the assumption ‘that g(z) has a bounded derivative atx. In fact, given-the symmetry of the kernel function, we have n—? jx, jeq(g(Xi) ~ o(2)) = Op (A?) ‘The condition nh —+ 26 ensures that aeymptoticaly, infinitely many observations appear in each interval with length A (or 2h, Le. within distance h either side of the point X = at which we evaluate our estimate of g(z)). This is because the number of intervals is of the onder O(1/A)- On average, we would expect to have n/O(1/h) = Olnh) observations in an interval with length h. Azzalini and Bowman (1997) refer to mh as the “local sample size.” which is a nice way of thinking about things, a this i the sample relevant for estimating the regression function ata fixed point 2. Consistency therefore requires that the size ofthe “local sample” must increase (nh ~ oc) with the overall sample size (rm co) while at the same time the interval width must shrink to zero in the limit (h —» 0). ‘The average of the t's in each interval converges to their population mean value B(us) = 0 (in probability) by a law of large numbers argument because nh > co as n —* 00. We therefore see that, for the uniform kernel, (2) uses a simple local average of the 1's (based upon those X; Iying close to 2) to estimate at). : ‘Summarizing, we estimate the conditional mean function by locally averaging those values of the dependent variable which are “close” in terms of the values taken on by the regressors. The amount of local, information used to construct the average is controlled by a window width, By controlling the amount of local iaformation used to construct the estimate (the “Iocal sample size") and allowing the amount of local 6 2, REGRESSION averaging to become more informative os the sample size increases, ‘vile also shrinking the neighborhood in which the averaging oecurs, we ‘an ensure that our estimates are consistent under standard regularity conditions. ‘The estimator g(e) is often called the “local constant” estimator of ‘4(2), though one could just as easily use local linear/potynomial inet ts to estimate (2), and for a more in-depth treatment of local poly- tomial methods we refer the reader to Fan and Gijbel's (1996) well ‘writen monograph. Local linear estimators also automatically provide Jin estimator of response, Le. the derivative of g(z), while » pth ordex focal polynornial method can estimate derivatives up to order p. We cuss local linear and local polynomial methods in Section 2.4 be: Jow, but Brst we address a fundamental aspect of nonparametric local constant regression, namely, bandwidth selection. 2.2 Local Constant Bandwidth Selection In this section we discuss various approaches to smoothing, parameter selection for the estimation of the unknown regression function g(°) by the local constant estimator given by (2.6). We discuss three different methods of smoothing parameter selection, (i) rule-of-thumb and plug jn methods, (ii) the least squares cross-validation method, and (i) & modifiod AIC procedure. 2.2.1 Rule-ofThumb and Plug-In Methods When using a second order Kernel, it ean be shown that the opti- ral smoothing parameter should be of order O(n~'#9)), A popalar Tuloof thumb procedure is therefore to choose he by cyan) (on 1.---a), where cy isa constant and X,uq i the sample standard deviation of (Xis}f-y- In practice, cy is often chosen to be 1 or some ther constant close t 1, The argument underlying this rule-of-thumb fe volated to the oo called novmal reference rule for density estimation discumsed in Chapter 1. Rulesof-thumb are attractive to practitioners tpecause they are simple, computationally speaking. However, one dis- ‘edvantage of such approaches is that that they treat all components (covariates) of »syzmmetricaly. In practice, «regression function g(z) thay change slowly with respect to one component, say 21, but quite rapidly with respect to another component, sy 2. In this case, one 22. LOCAL CONSTANT BANDWIDTH SELECTION or ought to use a relatively large smoothing parameter for 2, and a rel atively small smoothing parameter for 22. Clearly the rule-of-thumb rethod lacks this fexibilty ‘An alternative method isthe so-called plug-in method which is usu- ally based on minimizing a “weighted integrated MSE" (WIMSE) of the form J Elg(x) — 9(2)}*v(2) da, where the expectation is taken with respect to the random sample {X,Y}, and where v(2) is @ nonneg- ative weight function which ensures thet asymptotically, the WIMSE is finite ‘The leading bias and variance terms of (2) are derived in (2.14). ‘Thus, the leading term of the WIMSE J Elg(2) ~9()Pu(2) dz is given by @ = a (2) wse~ {{ eer va gi pe woe) eon) Note that the use of a weighting function here is important since fo7(x)f(z)"! dx does not exist for many well-known distributions (eg. the normal distribution) By defining ay via hy = ayn™¥/(@+), (2.18) becomes (2.18) WISE = 9M yu (04,. 504) (219) where rela, wove [{ [Een] sate G vom (2.20) We let af,...,09 denote the values of a1,...,d) that minimize Xo(ai,---149). If one of the a! is zero, we must have a? = co for some t # s. This implies that hy = 00. If Jy = co, we readily see that k((Xie ~ 21)/Iu) = K(0) becomes a constant, and this constant cancels in the mumerator and the denominator of j(e) so that (x) is then unrelated to. (if hy = 00). In this section we assume that all the components in + are relevant regressors and hence we rule out the case 68 for which oP (Fy) is infinity and assume that? Each a! is uniquely defined, positive, and finite (221) Letting the Al's denote the stoothing parameters that minimize (2.18), we have Na nMarle? for s=tyend (222) Equation (2.22) implies that A2 = O(n-¥/(@+4), We observe that 9 depends on the unknown funetions o(-), f(), and their derivatives (because B.(z) depends on these functions}. Obtaining explicit expres- sions for the a when 1 0 for all z. Condition (2.81) states that A being positive definite is suf- ficient for 29,...,29 to be finite. We give an intuitive proof of (2.31) Note that (BS) Ka sient condition. A (wosk) necessary and suficent ‘condition for 2,...,28 t0 all be positive and fnite can be found in Li and Zhou (2005), 2 2. REGRESSION here and relegate a more rigorous proof to Exerese 2.2. First, note that 1A being positive definite implies that 2? < oo forall s, for otherwise ‘we will have yz = 00. Next, we have 2] > 0 for all s, otherwise we have Co/(e}.., 28)! = oo. Thus, we must have 0 < 22 < oo for all sehen ‘Theorera 2.3 only covers the case whereby all components of are relevant, As we discussed in Section 22.1, when q= 2 and g(a) = (a) for all (x1,22) € R? (ez is an irrelevant regressor), then (2.28) does 104 hold, However, we can show that the cross-validation method Sail leads to optimal smoothing parameter selection, as the optimal Smoothing in this case should have the property that fy = Op (n~™/) ‘and hy + 00, as we demonstrate in Section 2.2.4. 2.2.3 AIC. ‘An alternative bandwidth selection method having impressive finite- sample properties was proposed by Hurvich, Simonoff and Tsai (1998). ‘Their approach is based on an improved Akaike information criterion (AIC; see Alike (1974)). Hurvich et al's exiterion provides an approx: imately unbiased estimate of expected Kullback-Leibler information for nonparametric models. Akaike’s original information criterion was de- signed for parametric models, while Hurvich et al.'s approach is valid for estimators that can be written as linear combinations of the out- ‘come, hence is directly applicable to a wide range of nonparametric estimators. Their criterion is given by 1+ tr) /n i") + eC) + AIC. = where 1 SS aixo}? = YC - HY ~ DY/n vith 9(%:) being « nonparametric estimator and H being an n x 7 Weigh function (he. the matrix of kernel weights) with ie (jj) tlement ven by Hy = Kru / Dh Kast Be gth((Kia— Xie) “Tinigh simulation experiments, Hurvich et a. (1998) show that the AIG. bandwidth sclector performs quite well ompased with the plugin method (hen iti evailable) and with a numberof generalized 2.2, LOCAL CONSTANT BANDWIDTH JECTION 8 ‘cross-validation methods (Craven and Wahba (1979)). While there is ‘no rigorous theoretical result available for the optimality of the AIC. so- lector, Hurvich et al. conjecture that the AIC, selector shares the same asymptotic optimality properties as the least. squares cross-validation ‘method outlined in Section 2.2.2, Indeed, simulation results reported. in Li and Racine (2004a) show that, for small samples, AIC, tends to perform better than the least squares cross-validation method, whi for large samples there is no appreciable difference between the two approaches. 2.2.4 The Presence of Irrelevant Regressors ‘We now consider the case wherein (i.c., allow for the possibility that) some of the regressors are irrelevant. Without loss of generality, we assume that only the first ¢1 components of 2 are relevant in the sense defined below. For integers 1 00 for 8 = qi +1,---44- Thus, asymptotically, cross-validation is capable of automatically removing irrelovant variables. In other words, cross-validation in conjunction with the local constant kernel estimar tor is capable of automatic dimensionality reduction when some of the regressors are in fact irrelevant. "The cross-validation objective function is the same as that defined in Section 2.2.2 except that now Yj = gi) + u with E(uiXe) = 0, tat is the conditional mean function now depends only on the relevant, regressors X. "The analogue of the function x defined in Section 2.2.2 is now mod ified to (see Exercise 2.5) senesta) =f [= nerd] Heyman ae (aya yar (2.35) where 12) = [ Heqess2dM CB tqanee eager Be and the “bar” uotation refers to fimetions involving only the frst a, Components of #. 7 (J) is the marginal density of X (X) and B, is defined in the same way as B except that it is only a function of X. ‘Nove thatthe ietlevant components do not appear in the definition of ‘As before, let af,...,08, denote values that, minimize X subject to cach of them being nonnegative, We require Fors= esas cach of is uniquely defined and is finite, (2.36) Since we should allow the smoothing parameters for irrelevant vari ables to diverge from zero as n — 00, conditions used in Section 2.7 ‘cannot be imposed here. We impose the following conventional restric tions on the bandwidth and kernel function. Define 1 ho) FT 41 ii (he, 1)- 22. LOCAL CONSTANT BANDWIDTH SELECTION % Let 0 <€<1/(g-+4). Assume that WES Hay Sn min(hayesshq) > max(hiy..5fy) < n© for some C > 0; the kernel k is a symmetric, compactly supported, Hélder-continuous PDF; ‘k(0) > K(6) for all 5 > 0. (231) If ¢ is arbitrarily small, the above conditions om his...shg are in essence that mhy....ig, > 00, ht---f + O and hy.-.hy + 0 a8 00, and that the fastest rate at which hy goes to zero cannot exceed n“* (for some a > 0) and that the fastest rate at which he (lor irrelevant variables) goes to oo is contained by n? (For some b > 0). With these conditions we obtain the following result ‘Dhoorem 2.4. Under conditions (2.92), (2.96), and (2.9), letting fasyes-1 hq denote the smoothing parameters that minimize CVjc, then lO, 08 in a [0+ of in prolly, for <7 < 5 39 Ply > ©) 1 for + 1S 8<4 and alt C >0. ‘Theorem 2.4 states that the smoothing parameters for the irelevant components diverge to infinity in probability, therefore, all of the irrel- evant variables ean be (asymptotically) automatically smoothed out ‘while the smoothing parameters for the relovant variables continue to inherit the same optimality properties that would prevail ifthe irele- vant variables did not exis. Hore we mention that (2.92) is a sufficient, not a necessary, condi- tion for Theorem 2.4 ta hold. Simlation results reported in Hall et al (2006) suzgest that the conclusion of Theorem 2.4 still holds true if (2.82) is replaced by the condition “conditional on X, Y' is indepen- dent of X,” although a rigorous proof for this seems quite challenging and remains a topic for future research, ‘The complicated proof of Theorem 2.4 is given in Hall et al. (2006) ere we present some intuitive reasons to expect the results of Theorem 2.4 to hold. We call the terms of order mm = SO h? the leading bias term, and the terms of order 1 = (ny...) the leading variance term. Hall et al. (2006) show that the presence of izelevant vaviables does not contribute to the leading bias term in CVic because, by (2.32), the irrelevant components cancel in the ratio fig(#) = Bjrn(z)]/E\ (2) ‘Therefore, for the leading bias squared terms of order O(rg), the has w 2. REGRESSION (a +1 <8 1 for all choices of Ag-a2s-+» sf. Obvi- ously, R—+ 1 a5 hg 00 (91 +1 < # <4). Infact we have the following result: Result 2.1. The only values for which R(3, hy 41y-+-yhq) = 1 for some Be suppw are hy = 00 for +158 0 so that R = E[Z2]/(E(Zn|? > 1, unless in the definition of Zn, all hh, = 00, in which case Zy = K(O)!-® and var(Z,) = 00 that R= 1 in this and only this ense. “Therefore, to minimize (240) (the leading term of CVi(hay +++ ».))+ noting that both the squared bias and the variance terms are positive, ‘we mst have R — 1 as m+ 6, which implies that we must have ‘the smoothing parameters for irelevant components diverging to infin- ity. Thus, the irrelevant components are asymptotically smoothed out. Using = 1 in (2.40) leads to (2.35), which in turn implies that the smoothing parauueters for the relevant variablos retain their optimality property as given in Theorem 24 Tf § is computed using the asymptotically optimal deterministic 22, LOCAL CONSTANT BANDWIDTH SELECT smoothing parameters, then itis easy to show that (ohh ng (i (2) -S mers") > N0,95(2)) (241) in distribution, where Ze2)at2) 7} dual) +2) (2.42) 23 rr Fe) with 03(2) = B ‘Tho next theorem shows that when the cross-validated smoothing parameters are used rather than the asymptotically optimal determin- istic ones, the asymptotic normality result given in (2.41) still holds. ‘Theorem 2.5. Under the same conditions found in Theorem 2.4, (241) remains true if ae) is computed using the smoothing pararn- eters chosen by cross-validation, i.., nti fg, (10 =H) - a) SN One /F(3)) ‘Tho proof of Theorem 2.5 is given in Hall etal. (2006). It can be understood intuitively as follows: First, note that as hy — 00 for s = aa 1.19, we have 9(e; fn, ---s fg) = GCF hiy---sfgs)+ (3.0,), where G(2,hi,...,hq,) is @ kernel estimator of 9(%) using only the relevant regressors, Next, given the fact that hy ~h = o(h2) (8 = 1,-...41), ‘we expect that 9(2, hay... hg) = G(2,Rt,... AB.) + op (SH, (49)"). ‘Theorem 2.5 follows from this result and (2.41). ‘Note that the condition (2.82) is quite strong, It not only assumes ‘that, X is independent of ¥, but also requires that X is indepen- dent of X. Ideally, one would like to relax this condition to either ( EIVLX, X) = BLY] almost surely, or to (i) conditional on X, Y is independent of . Hall et al. (2006) conjecture that ‘Theorem 2.4 remains true when (2.82) is relaxed and replaced by condition (3) or (ji) above. The simulation results reported in Hall et al. provide some evidence in support ofthis conjecture. 8 2. REGRESSION 2.2.5 Some Further Results on Cross-Validation Racine and Li (2004) also established the rate of convergence of hy/h2— 1, which is given by Bea abe 10, (woe) (243) ‘where a = min{o/2,2}- When q = 1, we get (fg A9)/H9 = Op (n#), ‘which coincides with the result obtained hy Handle ct al. (1988), among thers. Equation (2.43) implies that h converges to the nonstochastic optimal sraoothing parameter h? in probability Equation (2.48) shows that (R,~AY)/A9 = Op (n-#/ 0) for q < 4, and (ha — A9)/12 = Op (n- 2/490) for q> 4. ‘The reason for having, two different expressions for (fi,—A9)/h? (depending on whether or not > 4) is that, in the higher order expansions of the cross-validation function CVie(hi,-.. Rg), we have terms of order O, (Sf, hf) and a second order degenerate U-statistic of the form (nin = 1) SOY wes Knl Xe X5) Tia where B(1|X;) = 0, which has an order Op (n(lh1-..fg)!/2)~"); see [Appendix A for a general treatment of degenerate U-statistis. When @ £ 4, the Oy (n(hi-.-hg)!/?)-!) term dominates the Op (Rf) term since hy ~ Op (49) = Op (n-¥/4"9)), while when q 2 5, the Op (Hi term becomes the dominating term. Therefore, the rate of convergence differs depending on whether q <4 or q > 5 (see Racine and Li (2004) for a detailed proof) 2.3 Uniform Rates of Convergence Proceeding along lines similar to those used when deriving the uniform almost sure rate for the density estimator discussed in Section 1.12 of Chapter 1, we can establish the uniform almost sure convergence rate ‘of g(2) to glz) for 2 € S, where S is a compact subset of RY that excludes the boundary range of the support of X. Condition 2.1. Assume that 24, LOCAL LINEAR KERNEL ESTIMATION n () flz) is differentiable and g(x) is twice differentiable, and that the “erivative functions all satisfy the Lipschitz condition |m(z) ~ m(v)| S Cle —v] for some C >0 (m(-) = gu) or fa). GH) 0°(2) = E(ufl2) és a continuous function, and infzes (2) 2 5 > 0. (Gii) The kernel k(.) is symmetric, bounded and has compact support (say k(w) = 0 for jo) > 1). Defining Hi(w) = |v|!K(v), we assume that [Zh(v) — Hi(u)| < Calu =| for all 0 <1< 3. ‘Dheorem 2.6. Under Condition 2.1, we have sup |g(x) ~ g(x)| =O ( Zs =x) almost surely. a (2.44) The proof of (2.44) is quite similar to the proof of Theorem 1 of Chapter 1. By writing (2) ~ g(e) = ri(z)/f(2), where ra(z) = (G(x) — g(x) f(x), one can show that mtn = (S68 Coon) ‘Theorem 1.4 imply that infzeg f(x) > 6” > 0 as., where é is a positive aoe supla(e) ~ 92] = sup a(=)/e)| < sup (2)/ inf A) ob By soph eh =0 (conven notte) as. Having examined the local constant kernel estimator in detail, we now turn to another popular method, the local polynomial approach. 2.4 Local Linear Kernel Estimation ‘Though the local constant estimator is the workhorse of kernel regres- sion, it is not without its idiosyncrasies. In particular, it has potentially large bias when estimating a regression fnction near the boundary 80 2. REGRESSION of support. The local linear estimator proposed by Stone (1977) and Cleveland (1979), on the other hand, shares many of the properties of, the local constant approach (e.g, their variances are identical), while it is one of the best known approaches for boundary correction as its bias is not a function of the design density f(r); see Fan (1992), Fan (1993), and Fan and Gijbels (1992). For an extensive treatment of the local linear estimator see the excellent monograph by Fan and Gijbels (1996) We again consider a regression model of the form Y=aX)+y, 5 (2.45) retaliate aero chee “sme (2) «() ‘which ean also be obtained as the solution of « in the following mini- ization problem: G(z) + (2.46) (247) Letting @ = (2) be the solution that minimizes (2.47), one can easily see that @ = 9(2) as dofined in (2.46). Note, however, that (2.47) uses a constant a to approximate g(x) (or Y) in the neighborhood of z, since (2.47) only uses a local (X's close to 2) average of ¥j’s to estimate g(). For this reason, g(r) de- fined in (2.46) is called a “local constant” kernel estimator. However, ‘one could instead use @ “local linear” (or higher order polynomial) es- ‘imator to ostimate g(). One feature of a local linear method is that it automatically provides an estimator for g(?)(e) “ ag(x)/Az, the first ‘order derivative of g(z), though the local constant partial derivative ‘estimator is well-studied; see, for example, Vinod and Ullah (1988), Rilstone and Ullah (1989), Hardle and Stoker (1989) and Pagan and Ullah (1999, Section 4.2) for further details ‘The local linear method is based on the following minimization problem: rap 05 teat 2 (4-2)? K & =) (24s) 24. LOCAL EAR KEI a Let 4 = a(x) and b = b(z) be the solutions to (2.48). We will show in this section that a(2) is a consistent estimator of g(2) and b(e) is a consistent estimator of g(!)(zx) = d9(:r)/Ax (both b and g\)(2) are qx1 ‘vectors). The local linear approach given in (2.48) is easily understood because itis analogous to a “local least squares” estimator. Hence the slope estimator b estimates the local slope g')(z). Let 5 = (2) = (a(z), (6(x))'), let Y be the n x 1 vector having, ith component Yj, lot X be an m X (Iq) matrix with the ith row being (1,(X;~2)), and let K(x) be an n x m diagonal matrix having ith diagonal element K (4§=*), Then in vector-natrix notation, (2.48) can be written as BBO ~ XK ~ 28), (249) Equation (2.49) isa standard generalized least squares (GLS) prob- (6,8 be the solution of (2.49). Then using the standard (AKA AK )Y 22) (x1. Oo0- | 2) (gb a) (250) ‘The following condition is needed to establish the consistency and asymptotie normality of 6(z) given in (2.60). Condition 2.2. ( (Xjp¥Yor ave kid, g(a) and f(e) and o%2) = Elufla) are tuice differentiable, (ti) K is a bounded second order kernel (ill) Asm 00, nhy hy DB, HE 00 and nhs. Ihy 2g AE 0. ‘The next theorem establishes the asymptotic normality of &(). 2 __2 REGRESSION ‘Theorem 2.7. Recall that «= [kv)®dv, wo = [A(v}e dv, define kag = f vP(0)Pdo, and let pe= (ME yam mp,)* = (SPOT), ° ( 0, e-nnosyyldsta)) where Dy is aX q diagonal matriz with the sth diagonal element given by hg, and where I, is an identity matrix of dimension g. Then under Condition 2.2 we have (0) (ie) ate) - (F# EYL) 400.5). sn ‘he proof of Theorem 2. given in Setin 27.2 Note that Theor 27 pies the following dierent rates of com vergence (aes Ihg)¥? jae 92) 2 4 wot(z) de.) (0.52), (252) nd (ohh sedate] (0, eet) for 8 =3y-..5% (258) where gs(2) = Ag(2)/0z, is the s** component of g(c) Let (2) = A(z) and 9,(2) — b,(2) denote the loesl-tinar ker nel estimates of g(x) and ge(x), respectively. Then 9(z) — g(z) = 0p (a+!) ane Gl) — 94(@) = Op (m+ mi!"hs") (ma = Daa Ms mm = (hy....hg)"2). This is a standard result, i., that the conver- dence Fate for derivative estimation is slower than that for regression function estimation, For estimating the lth order derivative of g(x), the rate of convergence will be Op (m+ nt!? (Cte n3t)-¥), We discuss higher order derivative estimators in Section 25. Consistent estimation of g(x) requires that hy +0 (9 = 1,---49) ‘and nhy-shy —» 20 in order for the bias and the variance to both converge to zero, while consistent estimation of g(x) requires an even 24, LOCAL LINBAR KERNEL ESTIMATION 88 stronger condition, namely, ny... fly 321 A3 + 00, in order for the variance terinof the derivative estimator to converge to zero Note that when the underlying regression function isin fact linene ina (Le, g(2) = a9-+2'a3, where the scalar ag and q x 1 vector ay are constants), then the local linear estimator exhibits the property that the bias tern is zero for any value of (h1,.--yfg) (ee Exercise 2.10) ‘Thus, the local linear estimator becomes unbiased when g(2) is linear in.z, When the bias i zero we could allow hg = co (8 = 1,..-y4), and it is easy to show in this instance that the local linear estimator collapses to a(x) = +2/a1, where dg and 6 are the ordinary Teast squares ‘estimators of ay and an, respectively (see Exercise 2.6). Thus, the local linear estimator nests the least squares estimator as a special case if ‘one allows for hy to assume sufficiently large values for all s = 1)... 9 We come back to this poiut when we discuss cross-validated smoothing parameter selection for the local linear estimator in the next section, since this has important implications for both bandwidth selection and rates of convergence. 2.4.1 Local Linear Bandwidth Selection: Least Squares Cross-Validation Let 9-4(Xi) denote the leave-one-out local linear estimator. That is, let (€4,5,) be the solution of (a,8) in the following minimization prob- lem: min 2 [%- wh ae were (5:82) = Ty (AM). the ay Teeveone-ut fol lnnt kernal timate of ¢(X)- ‘The local linear cross-validation approach to bandwidth selection chooses those h,'s whieh mininize Vis ---sha) = min * YT DRE MOG), 58) where M(.) is a weight function. Note that the objective function (2.54) only involves estimation of g(:) and not its derivatives, therefore we only need the conditions on hy,...shg stated in (2.74), that is, we do not ‘need the stronger conditions on the smoothing parameters that ensure ‘consistent estimation of the derivatives of g(). st 2, REGRESSION For fixed 2 € RY, we have derived the asymptotic bias and variance terms of (x) = 4(2) of the local linear estimator, while (2.52) implies that. Bla(2) - a(2)h Souter taney M2) ong +m) By Fa) (255) [As we discussed in Section 2.2.2, the cross-validation objective fune- tion is asymptotically equivalent to f Elg(z) —a(a)?/(2)M(a) dz. Us ing (2.55) one might conjecture that the leading term of CV would be Oviaa~ J Bige)— aCe @MC de s fF xt fo%(a)M (2) dz = [3 Senco] NeyM yas + SLE ce + olrf-+ 228) and this coneture tus out o be corsct asthe nding ter of CV is indeed given by (2.58) ( Lt and Racine (2004 for a dtaled poo!) ‘oy following the aporouch presented Ia Sertion 22, we express the leading tern of (2.56) a3 n-4(0"4)yy(a1,... aq), here the a's are defined by hy = an“ VO ( jy-++4g)) and rulersovay)= f [sXe] Fla)M(2) de (sn) +H, [tom@e Let a8)... 09, denote those values of a1... that minimize xu, and assume that Each a; is uniquely defined and is positive and finite, (2.88) ‘Assumption (2.58) rules out the case for which a2, = oo, which implies that guo(z) cannot be a zero function. That is, we explicitly rule out cases for which g(2) is linear in any of its components zy Letting the h,'s denote those values of the h,’s that minimize (2.54), Li and Racine (2004a) showed that nll@tDh,, + a), in probability for s “4 (2.59) 25. LOCAL POLYNOMIAL REGRESSION (GENERAL PTH ORDER) Equation (2.50) states that the local linear cross-validation smooth ing parameters converge to the optimal smoothing parasneters, and that the rate of convergence of the resulting local linear estimator is, ‘the same as the local constant eross-validation case when the underlying regression function is not linear in any of its components ‘When 9(2) is linear in some component of, says, then (2.58) docs not hold. Ia this ease we also need to modify the conditions imposed on the smoothing parameters as was done in Section 2.24 so that we allow those smoothing parameters corresponding to those regressors entering (2) linearly to diverge to infinity. Then we would expect that local linear least squares cross-validation should have the ability to select a large value of hy when g(z) is linear in 2, while selecting relatively small values of he fr regressors that enter nonlinearly, The simulation results reported in Li and Racine (2004a) show that this is indeed the case, though we are not aware any work that provides a rigorous theoretical proof of this result. Specifically, et us consider the case for which q = 2.and where the underlying regression function g(#1,22) = @(21) + 20 is a par tially linear fanction (e.g. linear in 22). In this case, local linear cross validation will tend to select a very large value of fig (liz -» 00), while hy = O,(n-¥®) (rather than Op (n-¥)). Thus, local linear cross validation would suggest that g() is Haeae in 2 2.5 Local Polynomial Regression (General pth Order) 2.5.1 The Univariate Case "As was the case when we derived the local Linear kernel estimator, we ‘can use a higher order local polynomial estimator to estimate 9(2) ‘When z is univariate, a pth order local polynomial kernel estimator is based on minimizing the following objective function: des Bo = By( Ry — 8) 20 = OpXy = FY? xx (% *) (2.60) Let by denote the values of by (I tad) 2 1,...p) that minimize (2.60). _2._hcression ‘Then fy estimates g(z), and ty estimates g(a), where of(z) = d@'g(z)/de! is the Ith order derivative of g(x) (1 =1,..-,P). Once again, (2.60) is a standard weighted least squares regression problom. Letting & = (1,(Xj —)-+-y (Xj — 2) then be is "] i [pe ce)| Eame a) (2.61) Ruppert and Wand (1994) study the leading conditional bias and conditional variance of ty, which we summarize in the following theo- ‘Theorem 2.8. Let g(r) = bo(x), and let Zn = (Xi}Par Ip is odd, B9(2)|2n) — fe) = 1 eay [a | o(ntt), (2.62) and if p is even, Bg@)/2a] —9(2) = wong mee] Fee +! egg [22] Lora, @ ae [= 2) | eo (2.63) puree — feswo*@)] (2 vweraeo2n)= [SFB] +0(F), eon tuhere Gig (7 = 1:2,8,4) are some constants defined in Ruppert and Wond (1994) ‘Theorem 2.8 shows that the leading conditional bias term depends ‘upon whether pis odd or even, By a Taylor series expansion argumnent, ‘we know that when considering |X; ~2| < h, the remainder term from a pth order polynomial expansion should be of order O(4?"), so the result for odd p is quite easy to understand. When p is even, (p +1) is odd; hence the term h?*! is associated with [ K(v}u' de for ! odd, and 23.0 \L. POLYNOMIAL REGRESSION (GENERAL PTHLORDER) 87 this term is zeto becanse K(p) is an even function, Therefore, the h?*? term disappears, while the remainder term becomes O(?*?). Since p Js either odd or even, wo see that the bias term is an even power of h. ‘This is similar to the ease where one uses higher order kernel functions ‘based upon a symmetric kernel function (an even function) for the local constant estimator, where the bias is always an even power of h. Summarizing, conditional on 2%, we have When p is odd 9(c) ~ o(z) = 0, (H?** + (nh)*) 0, (n#" + (nay?) (265) When pis even (2) ~ (2) Let g((e) = tie) denote the estimator af g((z) based on a rth order local polynomial fit (! < p). The following theorem states the leading bias and variance of 3 (2). ‘Theorem 2.9. When pI is odd, eat B [91%] - (2) = erty [ When p—1 is even, E [91%] 9) = wil a | BI x J Ky (v)dv I} In either case var (4(2)|4) = (aa) +o((nnet)-), where cj1y (5 =1,2,3,4) are constants defined in Ruppert and Wand (1994). Note that f(")(2)/ f(z) does not appear in the conditional bias term if p—1 = is odd. Ruppert and Wand (1994) also show that when p—is, ‘odd, the bias at the boundary is of the same order as that for points on 88 2. REGRESSION the interior. Thus, the appealing boundary behavior of local polynomial ‘mean estimation extends to derivative estimation when p —1 is odd However, when p— 1 is oven, the bias at the boundary is larger than in the interior, and the bias can also be large at points where f(r) is discontinuous. For these reasons, we recommend strietly setting p—U to be odd when estimating 9((z) ‘Theorem 2.9 implies that, conditional on %p, When p—1is odd 9 (2) — (2) = Op (1M-+ (nney-¥?) When pis even (a) — (a) =O, (10-42 + (2) 47) (3.56) Using Liapunov’'s CLT, we can also establish the asymptotic nor- ality of g(z) and g(” (2). We postpone our discussion of asymptotic normality results until the next section, in which we discuss general ‘multivariate local polynomial regression. 2.5.2. The Multivariate Case ‘A general pth order local polynomial estimator for the multivariate regressor case is more cumbersome notationally speaking. The general multivariate ease is considered by Masry (1996a, 19962), who develops some carefully considered notation and establishes the uniform almost sure convergence rate and the pointwise asymptotic normality result for the local polynomial estimator of g(z) and its derivatives up to order p. Borrowing from Masry (19968), we introduce the following notation: (vom) enlace #= Sry 267) (2.63) 00) ~ 5 es (260) Using this notation, and assuming that g(x) has derivatives of total order p+1 at a point 2, we can approximate g(=) locally using a LOCAL POLYNOMIAL REGRESSION (GENERAL PTH ORDER) _ 89 ‘multivariate polynomial of total order p given by 2)= DAW lvlivaale— 2) (2.70) oSep Define a multivariate weighted least squares function by {- 3 seas afr ‘Minimizing (2.71) with respect to each by gives an estimate by(z), and by (2.70), we know that rlb(x) estimates (D"g)(r) so that (D°G)(x) = riby(z), For a gencral treatment of multivariate bandwidth sclection for the local linear kernel estimator, see Yang and Tschernig (1999). 2.5.8 Asymptotic Normality of Local Polytiomial Estimators Local pth order polynomial regression can be used to estimate the etivatives of g(z) up to order p. We use .N; to denote the mumber of distinct [th order derivatives of g(x). For example, No = 1, Ni =a (q citferent frst order derivatives), and Na = qlq+1)/2 (q second own derivatives and q(g~ 1)/2 eross second order derivatives). The general formula is 7 ‘There are Mj distinct Ith order derivatives of g(x), and we use ‘Vig(2) to denote the Mj x 1 vector of these Ith order derivatives, using @ lexicographic order. For example, when 1 = 2, we have the No = (q+ 1)/2 distinct second order derivatives, and V@ g(r) is de- fined by vy! Ta eae ‘Masry (19960) uses the following condition to establish the asymp- totic normality result for the local polynomial estimator. (9%) = % Condition 2.3. (@ 9(2) has continuous derivatives of total onder (p+ 1)- (ii) k(v) i @ bounded second order kernel function having compact ‘support. (iis) For expositional simplicity, letting hy = +++ = hg = h, assume that h = O (nr-Mie#2), ‘Theorem 2.10. Under Condition 2.5, we have for 0 <1 p (nar) [9 — We) —Aegntore“™] 4, ma), (°55") ‘at continuous points of 0°(2) and f(x) with f(z) > 0, where the defi- nitions of mprite), Ar and V; are given in Section 2.7 Proof. See Theorem 5 of Masry (1996a). o IE we compare Theorem 2.10 with Theorems 2.8 and 2.9, we observe that Theorem 2.10 does not yield differing bias expressions depending ‘on whether p~1 is odd or even. It can be shown that when q= 1 (the single regressor case), the bias expression of Theorem 2.10 matches that cof Theorems 2.8 and 2.9 for the ease when p — 1 is odd. When p—1 is, even, the leading bias term given in Theorem 2.10 isin fact zero. Thus, the leading nonzero bias term will be of the next order with O(hP-!#?) ‘when p~ Lis even. We verify this for the case of p= 1 (local Tinear) in Section 2.7, where we show that Aipya(z) = Aia(z) = 0, and so the nonvanishing leading bins term will be of order O(h?*2-!) = 0 (h?) rather than O(h?+"~) = O(h) (p= I= 1). This matches the result of ‘Theorem 2.7, hence the results given in Theorem 2.10 are consistent with those given in Theorems 2.8 and 2.9. “Assuming that p~ [is odd, we have se (2) — 0 (120-4 + oase2h), x (2) = 0p (H+ (se %)-12) ‘Masry (19968) established the uniform almost sure convergence rate of the local polynomial estimator, and we present his results in the next theorem, 25, LOCAL POLYNOMIAL REGRESSION (GENERAL PTH ORDER) 91 Theorem 2.11. For each SU

2, the bias term is similar to that arising from the use of a higher order kernel function for the local constant estimator. For [= 1, Theorem 2.11 provides the uniform almost sure rate for first order derivative estimators, v2) Va] = fo) — a4] = O( (tain)? (nne*2))"”? +10) Similarly, one can obtain the uniform almost sure rate for higher order derivative estimators from ‘Theorem 2.11 Throughout this chapter we assume that the data {X,Y} is ob- served without error. In practice, however, data may be contaminated ® 2. REGRESSION ‘or measured with error, There is a rich literature on how to handle ‘measurement error in nonparametric settings. Though it is beyond the soope of this book to cover this vast literature, we direct the interested render to Fan and Truong (1993), Carroll and Hall (2004), and Carroll, ‘Maca and Ruppert (1999) and the references therein. 2.6 Applications 2.6.1 Prestige Data We consider the following example using data from Fox's (2002) car library in R. (R Development Core ‘Team (2006). ‘The dataset consists ‘of 102 observations, ench corresponding to a particular occupation. The dependent variable is the prestige of Canadian occupations, mensured by the Pineo-Porter prestige score for occupation taken fro a social survey conducted in the mid-1960s. The explanatory variable is aver- tage income for each occupation measured in 1971 dollars. Figure 2.1 plots the data and five local linear regression estimates each differing in their window widths, the window widths being undersmoothed, over- smoothed, Ruppert, Sheather and Wand’s (1995) direct plug-in, Hur- ‘ich et al's (1998) corrected AIC (AIC,), and cross-validation (CV) (Li and Racine (2004a)). A second order Gaussian kernel was used throughout, "The oversmoothed local linear estimate in Figure 2.1 is globally linear, and in fact is exactly simple linear least squares regression of Y on X as expected, while the AIC, and CY criterion appears to provide the most reasonable fit to this data, 2.6.2 Adolescent Growth In Section 1.13.3 we noted that abnormal adolescent growth cam pro vide an early warning that a child has a medical problem, We used ata from the population of healthy U.S. children obtained from the Centor for Disease Control and Prevention’s (CDC) National Health ‘and Nutrition Examination Survey to model the joint distribution of height and weight by sex. ‘We now consider the regression function underlying the relation- ship between height and age for males and females using the mixed data local linear estimator to be introduced shortly in Chapter 4 to model “stature for age means.” The local linear estimator was used 26. APPLICATIONS a 100 & 8 2 E40 = 0 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Income (K) Income (K) AIC, & CV (3.540207, 3.450,n-"5) Oversmoothed (10%s2n-1) 100 FF =| 2 9 +4 2 60 | E40 4 4 = 2 oe 0 05 10 15 20.25 30 0 5 10 15 20.25 30 Income (K) Income (K) Figure 2.1: Local linear kernel estimates with varying window widths (note that the AIG, and CV bandwidths are almost identical, hence the two curves appear as one in the lowerleft figure) with bandwidths selected by Hurvich et al.'s (1998) AIC, criterion us- ing a second order Gaussian kernel. The bandwidth for age is 7.63, while that for sex is 0. Figure 2.2 presents mean height by age and sex, It is interesting to note that mean height is virtually indistinguishable by sex until zoughly. 10-12 years of age, and then diverges significantly. This behavior would be particularly difficult to model in a parametzie setting without re- sorting to sample splitting, while the nonparametric approach readily reveals this structure, ¢ 2.6.3 Inflation Forecasting and Money Growth ‘The conventional wisdom prevalent in monetary circles is that money growth should have predictive power for forecasting inflation. How- o 2. REGRESSION Height (em) Age (years) Figure 2.2: Stature-for-age means. ver, empirical results overwhelmingly suggest just the opposite, ie. that money growth has no predictive power for forecasting inflation ‘whatsoever. Moreover, this finding has been robust to changes in the sample period and econometric methodology (see Leeper and Roush (2008) and Stock and Watson (1999)). This inconsistency between the- ‘ory and empires isa wellknown puzzle in macroeconomics. Much of the empirical work in this area has focused on forecasts obtained from linear vector autoregressive (VAR) models ofthe form Xe at BLK tery where X; = (11, %)/, me is the inflation rate at time f, and Z; is the growth rate of a monetary aggregate. Bachmeier, Leolahanon and Li (forthcoming) revisit this issue from ‘a nonparametric perspective using monthly data for the period January 1959 through May 2002. The monetary aggregates used are M1, M2, land M3, and the corresponding M1, M2, and M3 Divisia monetary services index, while inflation is measured with the consumer price index. Using data from Bachmeier et al. (forthcoming), we first consider parametric forecasts of inflation obtained from a bivariate VAR model of inflation and money growth, which are made using a recursive es- ‘timation procedure with an increasing window of data; each forecast 2.6. APPLICATIONS 5 ‘Table 2:1: Relative parametric MSPE (including/excluding each mon- tary aggregate with 1 lag) Torin MTN WMD ND WD 1 Moni 1.05102 1.05 1.00 —1.00 ~ 1.00 6 Months 0.90 100 111 100 104 1.06 Months 0.98 109 1.00 1.02 104 Table 2.2: Relative parametric MSPE (Including/excluding each mon- ‘etary aggregate with 2 lags) “Horizon MMM MD M2D_MaD_ “TMonth—LOf 093 Lot 1001.00 1.00 6 Months 0.91 1.00 110 0.99 1.04 107 12 Months 0.92 0.94 1.06 1.00 1.01 1.05 was made based on a model estimated using all data available through ‘the date that forecast would have been made. Forecasts were made for each model and forecast horizon (January 1994 through April 2002) ‘The mean squared prediction error (MSPE) was calculated for each ‘model for forecast horizons of ¢ = 1,6, and 12 months. SIC-optimal lg length selection was employed resulting in two lags of inflation and ‘money growth, so that the forecasting equation of interest is MES AQ OTe HORT g-1 bagAmeg + OAM 1 HEE, where m, can be any one of the six monetary aggregnte measures (Arm = me—me-1). For the sake of robustness, we report results for 1 and 2 lags ofthe monetary aggregste, and Tables 2.1 and 2.2 report the ratio of MSPE of the VAR model with each monetary aggregate {nchuded to that with each monetary aggregate excluded forthe given horizon for 1 and 2 lags of the monetary aggregate respectively; values Jess than 1.00 imply that including the monetary aggrogate improves the forecast for n given forecast horizon ‘Tables 2.1 and 2.2 reveal no apparent systematic gain from inelud- Ing money as a predictor. With 1 lag of the monetary aggregate in- cluded, 9/18 forecasts worsen, 3/18 improve, and 6/18 are unchanged; with 2 lags, 8/18 worsen, 5/18 improve, and 5/18 are unchanged. Over- all, the case for using money growth as an inflation predictor can be 96 2. REGRESSION ‘Table 2.3: Relative MSPE of the nonparametric AR(2)/parametric ARQ) Horton 7 Month 098 6 Months 1.01 12 Months 0.88 seen to be rather weak in @ parametric setting as inclusion of the mon etary aggregate leads to less eflicent forecasts on balance. ‘Next we compare a parametric AR(2) mode's forecasts with that for a local constant nonparametric AR(2) model of the form me = lt Ret) +E “where g(-) is unknown, We conduct leave-one-out cross-validation on the estimation sample, and then use cross-validated bandwidths to generate our out-of-sample forecasts. Table 2.3 presents the relative MSPE of the nonparametric AR(2) model with that for the parametric AR(2) model ‘Table 2.3 reveals that the cross-validated local constant AR(2) ‘model is generating predictions that are comparable to that from the parametsic model for the 1 and 6 month horizon and are substantially improved for the 12 month horizon. This suggests misspecification of ‘the parametric model since the nonparametric model is less efficient than a correctly specified parametric model. One then wonders whether the conclusion about the forecast inelficacy of monetary aggregates ‘may in fact be an artifact of parametric misspecification. We therefore consider a local constant nonparametric model of the form, m= alten es py Armee, Amina) +e (273) where 9(-) is again unknown and we again use leave-one-out cross- ‘validation for our bandwidth celector. "Tables 2.4 and 2.5 present the relative MSPE of the nonparamet- ric model which utilizes the monetary aggregate to the nonparametric Note that Bachmoser et al orthcomng) use a more complicated two-step erss- slidation procedure. Therefore, the results reported here differ slighty from those ‘ported in Bachmcier etal. (forthcoming) due tothe diffrent bandwidth selection procedures being used, 27. PROOFS so ‘Table 2.4: Relative nonparametric MSPE (ineluding/excluding mone- tary aggregate with 1 lag) “Horizon MiNi MID_W2D— NBD Month 107 1.00085 0.92 0.80 0.89 Months 0.96 0.95 1.15 1.00 1.00 1.00, 12 Months 0.96 0.84 0.98 1.00 1.00 1.00 ‘Table 2.5; Relative nonparametric MSPE (including/excluding mone- tary aggregate with 2 lags) “Horzon_MT_M2_N3_ MiDW2D_WsD_ TMosth 107 0.98 1.06 1.00 1.00 0.88 6 Months 1.00 0.96 1.07 1.00 1.00 _12 Months 0.99 0.83 0.98 1.00 _0.97 model which excludes the monetary aggregate, These tables reveal that with 1 lag of the monetary aggrogste included, 2/18 foreeasts worsen (in the parametric setting we had 9/18), 9/18 improve (in the paramet- ric setting we had 3/18), and 7/18 are unchanged (in the parametzic setting we had 6/18); with 2 lags, 3/18 worsen, 8/18 izaprove, and 7/18 are unchanged, In a nonparametric setting, the case for using money srowth as an indicator variable for inflaton isin fact quite favorable as inclusion of the monetary aggregate leads to more accurate forecasts on balance ‘Thus, the robust nonparametric method suggest that money growth affects inflation nonlinearly. The early puzzle appears therefore to be an artifact arising from the use of « misspecified linear relationship. 2.7 Proofs In this section we provide some omitted proofs for easlier material in this chapter. eee 2.7.1 Derivation of (2.24) Since we only consider the caso in which all regressors are relevant, We assume therefore that (Fay tg) © Hen = {(ety--- fg) neooo sta) € [rh ta} where my is @ positive sequence that converges to zero at a rate slower than the inverse of any polynomial in n, and ty is a sequence of con- stants that diverges to infinity. Equation (2.74) basically requires that rahy hg —» 90 and that Ay > 0 ($= 1)-..)9). Tet $ denote the support of w. We also assume that, (274) and nha ..-hg 9(3), £0) and o? have two continuous derivatives; w is con- tinaons, nonnegative and has compact support; f(-) is bounded aviay from zero for x € S. (2.75) Using (2.28) with 9 = 9(%2), g = 9-«(Xi), and Mi = M(%), we have ESD i a Mi +20 uae = Ms +o dM om CVillin- +a) (2.76) ‘The third term on the right-hand side of (2.76) does not depend on (fis.--yfg) Tecan be shown that the second term has an order suualler than the first term (see Exercise 2.3). So, asymptotically, minimizing CVie(h) is equivalent to minimizing 7 [ai — di?Mi, the first term on the right-hand side of (2.76). ‘We next analyze the leading term of Vie. Note that (i — ofl fe Haile iy fit (60), where thy = (Gi — 99) i- Define fgg = (nL Ka — 9h ia 2. PROOFS 2 and =I Suki, tA where Kina = []far hy H((2i 2) fhe). Then =r +g Ps, bins — fig ic = (miss + a) fe" is the leading component of d — . Therefore, the leading term of CVie(A) is CVigo(h) = 2 [ras + ‘af; 2M. Te can be farthor shown that CVizo(h) = E{CVico(H] + (6.0.). Hence, B[CVico(h)] is the leading term of CVi.(h). Nov, E[CVicoas-+- he] = Bf [lens + tea)? $4] Ma (a gPM) +B RPM], (2.77) where we have used Elinyotaif; Mi] = 0 (since E(usl{X3}7-1) = 0). Then E fair 71%] = E [GG — 9) Knsslor — 9) Kea fo *1Xi) FE (G5 — 9) KR fe) = {Bll ~ Ka gled SY? +0((oh. Aor) 1 NK) wav) = (pag [loose te) —aeonaeys(xe+ hn) eo} +O(mm) fz {Sacre} +0(98 + mm), 78) where By(2) the sane as that defined in (28), m = (ng, )“% and = Using (2.78) we have B [mise 7M] = J {S5ecove} rene de +0 (03 +m) (2.79) 100 2. REGRESSI Next, E fp fy 7M] = B fp? MaE [|X] } =n TE (2M [KR IX) = (nhy hg) {seta [ 106 + hoot + HHO) as} = (hy. fhg)* x BA PM [s4/(Xi)o"(Xd) + OCm)]} = tg, [Pemmete+ (mm (280) Substituting (2.79) and (2.80) into (2.77) we obtain ElCVio(his shall =f {= aucnsh SoM (a) ax + hohe [remeyar so(i+m) = nl) (ay, 5) + 0 (ne) (2.81) where the a,s are defined via hy = agn—¥/(@+4) (3 = 1,...,q) and xem) =f {Erin} S@)M(@) de wf arg + [eeme)ae (282) 2.7.2. Proof of Theorem 2.7 ‘We sketch an outline of the proof of Theorem 2.7, while leaving certain details as exercises. ‘One problem atsing from working with (2.50) is that ED mice) (y,4 ,) (K-21) is an (asymptotically) singular matrix and thus is not invertible. We ‘could rewrite (2.50) in an equivalent form by inserting an identity 27, PROOFS 101 matrix Jger = G5'Gq in the middle of (2.50), and where Gy = a, i? is nal matrix wit sth di (1 82). 0! vee dg asi wth he th gs nal element given by h;?, giving us (applying B-'C = BGG, = (GnB)"'G,0) average (yt 2) ai »| x DKiX2)Gn Ge 2 we - [= #002) (po 2(— «)) = »)| TRH (peach, ay) (2.88) ‘The advantage of using (2.88) in the proof is that FXi2) (pact, 2) (Xi -2Y) converges in probability to ® nonsingular matrix. ‘Therefore, we can analyze the denominator and numerator of (2.88) separately and thus greatly simplify the analysis. Using @ Taylor expansion given by (Xe) = g(a) + (Xi ~ 2)'g (7) + (Xe — YG (a)(Xs — 2/2 + Rel, Xi) = (1s (Ki = 2)/)8lz) + (Ki — 2) (Xs ~ 2)/2+ Real, Xe)o oz 2, REGRESSION whore Ry (2, Xi the remainder term, we have a L 1 : [: Dive (op - 2) Oa(Xi= 2) | x {25K (0,22) 0+ «if 1 1, (%-2) eee [: kv (opin, nyo ae - | 1 1 ft {E35 (2,2) x [oxi —2)0e)Xs 9/2404 Roles X0]} = h(a) + [AM] [AP + AB} + (5.0), (284) 4 where Ae ET Kis (Deady #35 Di (54-0) 2D Kae (os a)® oe ‘and the smaller order (s.o.) term comes from (Xj 2)! ( ppt ay)* 289) 1 )g(a)(Xi—a), (2.86) (AMPLY. KiniaRen(t, 8) which has an order smaller than [4!]-1A, Exercise 2.7 shows that A™ = Q-+ op(1), where =( 4) o 2 (765, eater) Using the partitioned inverse, we got at A/ fle), 0 = (hS Fie, titer)” 27. PROOFS 105 Next, rewrite (2.84) as D(n) (6) -8(@)) = D(n) [Al] [4* + AB] + (8.0), (2.88) ‘and define diagonal matrices adi = UW F(2), 0 reaenta)= (Es yoaren) va (“eee and er nnottesters) ‘The proof of Thoorem 2.1 follows if we can show the following: ()_D(n\[Aba}-'[42* + A¥4) = D(n\Q- A + AP] + op), (i) D(njQ>*|A** + A™] = RD(n)[A?* + A] + o9(2), Gai) Dinars = (co a) Ma/2)402) Sa soley) + op(1), (iv) D{n)A%*N(0, V) in distribution, ‘The proofs of (i)-(iv) are given in four lemmas below. Statements (2)-(iv) imply that D(a) [ie ~a{z) (ome Fert aa) = RD(n)[A%* 4 424] — (om ug)4??(a/2) Dhan a) 0 + o,(1) = RD(n) A + og(1)+RN(0,V) + op(1) —+ (0,2) in distribution, where © = RV Ris the same ¥ as that given in Theorem 27. Lemma 2.1, D(n)[A'|-"[4%* +A%4] = D(n)Q-'[A**+ 484] +09(1). Proof. Wiiting 2 Dea ay WaP= + AB] = DnyQryars + AB] + D(n)|(A» "(A 4+ 42], it is sufficient to show that Din) io =@] [2% +457] = 00). 2.89) 108 2_ REGRESSION From Al = Q + op(1) (see Exercise 2.7) we know that (aly = Q7 + offi), (2.90) Let # (als) @7 ) Equation (2.90) implies that Jjy = op(t) for &,j = 1,2. To prove (2.89), we need a sharp (i.e., fast) rate for Jig. Below we show that. Jy2 = Op(m). Define G = (ALF ~ Agf(At?)-'A}S) —. Using the partitioned in- verse and the results of Bxercise 2.7, we obtain eye (es c a (e) '), a aa Et ( (U/ f@)g + Op(or), Opt) ) S N= FO) Fo? + (0s Tal afl] + op(0))" (291) because Alf’ = Op(na) by Exercise 2.7 (i). Bquation (2.91) leads to stir -a*=(39h 94). am which proves that Ji2 = Op()- Using (2.92) we immediately have yoy (Ca) 99] [#4 A = bd” ( py) O(n), Optra) 149°) , (435) «(Ce 2) CE) + )] = o(t) because (hs ---fg)"?Op(m) AB = Op (ta g)*n8) = 096) and (aba. 2Op tm) AY* = (a-ha)! 0p (ao ih)™™2) = 0, (nf) by Exercises 2.8 and 2.9. "This proves (2.88). a 27. PROOFS 105 Lemma 2.2. Din)Q~ [4% + 4°] = RD(n) [42 + 44] + o4 (1). Proof. Noting that R i have diagonal matrix with R = ding(Q>), we D(njQu [A + A] = RD(n) [A™ + A™] + op(1) because the terms associated with the off-diagonal clement of Q~* are all op(1) (ly. )!/*g AP = afr TgO pa) = 09(1) and (nuhy ..hg)"/*heAt* = Op( hs) = op(1). a Lemma oe Din)Ae= = (er Ione oe eacuintasi)) + o9(1). Proof. By Exercise 2.9, we have (00h eg) AY* = a Tae C0) Soule) ott, and (7 hg)" Dy ABE = af FsOp (18!) = op) Hence nae (_ (th. hy MAAR Donate= (cir ghana) = (om he) Plra/2) ste) Thasetei) + opt) a Lemma 2.4, D(n)A**+N(0,V) in distribution, where ve (ae Pa snafteoa)) 106 2 Prof, By Exercise 29, ar (nhs. A34) = wtf (ao%a) +00), var (nh. chy!®D,A}*) = wesley + of) and cov ( (rt hg) 2AP*, (rb tg)" DyAB*) = 0(0). Hence, var (D(n)A%) = V +o(1). Also note that A has mean zero. ‘Thus D(n)A%* 4, N(0,V) by the Liapunoy CLT. a 2.7.8 Definitions of Aipj1 and Vj Used in Theorem 2.10 ‘We use the notation introduced in (2.67) to (2.69) to define w= feKede, y= fem wa. (2.93) ‘Moo Mos --- May Teo Toa s+ Tag Mio Mia ip Tie Tao Tay M : Myo Mpa os Mya Foo Ton os To “2.4 where Mys and Pig are N, x Ny dimensional matrices. For example, Mia = say X Heys Where ja) And ja) axe of dimension My x 1 and {Nie 1 and are defined from yi withthe lexicographic order ce dv’ fi ee wak(o)dv ntak(e) de Hay = + Hay = ; (2.95) pK (0) dy LK (0) de Similarly, Tj can be obtained from M,; with K(-) replaced by For two m x 1 vectors Ci and Cz we use C; @ C2 to denote an sm x1 vector with its jth element being Cy C25 (= 1,---ym), he ® dlnotes clement by element product. ‘The Vj matrix introduced in Theorem 2.10 is defined via the follow- ing NN matrix V (N = 5? Nj): v=MorM 27, PROO! sot ‘Then Vj is an N; x N; dimensional matrix, consisting of rows and columns from S!} Nj +1 to Tho Nj. For example, Vp is the first diagonal element of V. V; consists of rows and columns from 2 to Ni + 1 =q41of V. Ve consists of rows and columns from No+Ni+1 = 942 to No +N +No=1+9+G(q+1)/2 of V, ete ‘We can easily check that when p = 1 (the local linear case), we get u=(g on) P= (F than) 0m) Vem"rM 7 @ onal) : where 2 = [vk(v) dv, «= [2(v) do, and ze = fu%#2(v) dv. Thus, Vo = wt and Vi = (n2gg/n3)fy. The variance [o°(2)/f(x)]Vi exactly ‘matches the result of Theorem 27 for 1=0 and I = 1 Define an Nj x1 veetor fi.) whose elements vary with different forms cof Afri) = 1/(ry1% ral x =" X re), F = ly im the lexicographic order. For example 2; is an Np x 1 veetor given by wy tect ta aoe deal eon 1 where tm 8 vetor of ones of dimension m ‘Then define Moos a Mages AY a | 2p) 2 Bion (298) An (7) @ Myre, "There are Npy1 distinct derivatives like (1/;!)(D¥g)(2) of total order +1: we put them in an Npi1 x1 vector using the lexicographic os 2. REGRESSION ‘order, and we denote this vector by mjy,)(2). For example, when p = 1 (the local linear case), we have mea (2) = (2.99) ‘Then Aap introduced in Theorem 2.10 is defined via (€=0,....p) the following equation: Aopen Aig BY ML Ampan(2 (2.100) Ap where M and A are defined in (2.94) and (2.98), respectively, and ‘mp (2) is defined immediately following (2.98) ‘When p= 1 (the local linear ease), using (2.96), (2.99), and (2.98) wwe have B= MAme(x) = Cassie) i (2). (2.101) Using the above Ap and Ap in Theorem 2.10 leads to results identical to those in Theorem 2.7, as of course they should. 2.8 Exercises Exercise 2.1, Prove (2.10) Hint: Note that Eljiu(z) — fle) Df, Bola)h2]? = (Bhina(z) — S(2) Dy Bala) KET}? + var(vin (2), then apply (2.8) and (2.9) Exercise 2.2. (8) Consider (2.82) fur y = 2: x(anan) = If eBi(2) + oBBa(a)]? + 2 axon f(z) } Sa) M(a) de Derive the explicit expressions of of and of that minimize x(a1sa2) 28, EXERCISES 109 (Gi) For the general multivariate case, yw can be written as (sce (2.80) (21) 20)-005%)) = Ae + alan 205000549) me where A is a 4 xq postive semidefinite matrix and Cp > 0 is a constant, Show that when A is positive definite, then inf yp > 0. This will Imply that of,...,o9 ar all postive and finive Hint for (i: This follows the same analysis as those used in Ex cercise 1.10, Exercise 2.3. Show that 11 Slo) 2K) =o, (1m) Hint: Write m2 wig —4. 1 wi(96— 9-3) fail fee (s.0.). Then show that the second moment of this leading term is O(n 08 +m): Exercise 2.4. Suppose one does not use the leave-one-estimator in the cross-validation procedure. (i) Show that, inthis ease, CV(h) approaches its lower bound (zer0) ash 0, Le, limos CV(R) = 0. That is, A> 0 minimizes vin). (ii) What is the resulting estimator if his arbitrarily close to zero, roughly say, o(Xih = 0)? We know that A+ 0 too fast violates the condition my... fg» o9 as n— 00, ‘The variance of such an estimator yoes to infinity Exercise 2.5. This problem explains R(@,h+1y.--yfg) defined in (a(x) ~ AIF eV Fla) = rr(x)/F(2), (2.89). Write 9(a) - g(a) where sia) = a(x) + thal), v(2) = n*S8 (g, ~ 8)) Knot and ta(e) = m7 DE wKnas. Write Kyat = Racine Raat (Mig X)/hs)s and Kise = [Tg a B((Xis — Xa)/he), where the bar and tide refer to relevant and ielevant variables, 0. spectively Define ma = EE AG and may = (Ma. igs)? mo, REGRESSION (@) Show that Elf(o)] = FEVEKp ao] + Olram)s and that vax(fle)) = Om) = o(0)- Therefore, Fle) = FayBLR ral + of) 2.102) Note that (2.102) izapies the following: ce) — (2) = ME) — AO) __ 5 (6.0,) = An(a) + (80. He) — 900) = FE = Fapptiega] 1 0 Ante) + 2) where Aq(#) = ri(2}/f(@)E(Kn2)] isthe lending term of (2) ~ a2) (8) Show that Effin (2)] = DO AIB. Ce) FER ns) + (mq E(Ks,ei)), and that var(tin(2)) = Om mn [BRrai)!?) = oma (Era) ‘Therefore, Elite] {Sorensoyronet.th +( (ba +m) [e.n)]"). e108) (i) Show that Eba(e)*| = mn 2*(2) 2B [KF] + oC) 2108) (iv) Show that (2.102) to (2.108) imply at BlAnC@)?] = [wm Secon aE sss) + (00 (2.108) where R() is defined by (289). 28, EXERCISES m (@) Using (2.105) and ove / Elp(2) — 9(@))*f(@)M(a) dr ~ E[AR(2)), we have ov~f {[E=«n] : [an Ia] Re haere ovhd)} Heed (2.106) Show that hy ~+ 90 (8 = 91+ 15g) minimizes (2.106). ‘Tis will result in R() = 1. Using R'= 1 show that (2.106) leads to (2.88). Note that f(a) ~ F(2)F(@) by (2.32). ° ~ Give an intuitive proof of Theorem 24 based on the above rests Exercise 2.6, (i) Show that if one uses ite = oo for all » = 1,...,9, then the local linear estimator (2) is identical to the least squares estimator 0 +2G1, where dg and dy are the least squares estimators of ao and a based on ¥j = ap + Xai + error. (Note that we do not necessarily assume that the true regression function g() i linear in for this problem.) (Gi) Show when g(2) = a9 +.2/ax is linear in z, then the local linear estimator is unbiased (note that you should not assume hh, = 00 in this problem). Exercise 2.7, Let AL, (= 1,2) be defined as in Section 2.7. Prove the following results () AY = £(2) + op(1). (i) AUP = Opin) = of). (Gli) ADF = 62 (2) + o9(2). iv) ARF = raf (a)Tg + op(1). 2 2. REGRESSION Note that (i)-(iv) imply that = 2), 0 2 0, and among the n independent random draws X{...., Xi, there are on average np(2*) = O(n) observations that assume the value +. Therefore, our estimator f(x) is expected to converge to p(2) at the parametsic rate of Op ((np(e!))-¥/2) = O, (n-¥2). ‘The nite support assumption implies that we have only finitely many parameters (0(24), 24 © Sto be estimated. This is essentially a parametric model (since the number of parameters does not increase as sample size in- creases); therefore, we naturally obtain the J-ate, ie. the standard parametric rate of convergence. There is also a large literature in statis tics that deals with the case in which the number of discrete cells grows 1 the sample size increases, with the number of observation in each cel ‘being finite even as n —» oo. ‘This is the so-called sparse asymptotics framework, and we refer the interested reader to Simonoff (1996). We 4o not pursue this case further in this book. 118 3._ FREQUENCY ESTIMATION WITH MIXED DATA 3.2 Regression with Discrete Regressors Consider a nonparametric regression model having only disereteregres- sors given by Yea aX!) tu (35) where Xf € Su isi.id, with zero mean and E(uZ|Xf = 2) = 0%(e4) We allow the error process to be conditionally heteroskedastic and of unknown form. For any 24 € S4, we estimate g(e4) by Yaunt = 24)/ple"), G6) where (2) is defined in (3.2), ‘One can easily show that ile!) - oe") rn), en and also that vii (gla) ~ al2*)) 4 3 (0,0%2*/0@) 8) by observing that 3-7, [9(X#) - 9(z*)] 1(X# = 24) = 0. The proofs of (3.7) and (3.8) are fairly straightforward and are left: to the reader as an exercise (see Exercise 3.1) 3.3 Estimation with Mixed Data: The Frequency Approach We now turn our attention to the mixed discrete and continuons vari ables case. We use XF to denote a continuous variable, and therefore awrite X; = (Xf, X#) € Rx S#, Wo define f(x) = f(2*,24) to be the joint PDF of (X,Xi). 3.8.1 Density Estimation with Mixed Data Suppose that we have independent observations Xi, Xa... Xn drawn from f()). A nonparametric kernel estimator of f(z) is given by f= Lat 2008 = (3.9) 33._PREQUENCY ESTIMATION WITH MIXED DATA, 119 where Wi(X@,2°) = [fy hy2w((XS,~29)/by), with w(:) being a stan- dard univariate second order kernel function satisfying (1.10), We state the rate of convergence and the asymptotic distribution of f(x) in the next theorem ‘Theorem 3.1. For all 2! € S#, assume that (2°24) satisfies the same conditions as f(2*) given in Theorem 1.3. Also assume that, as + 00, hy + 0 (8 = Lynsey), (Bt «+ Ihg)(Shaa he) + 0, and fh shy» 00. Then (@) Fle) ~ $a) = Op (Thy B+ (hy. Ag)-?). (i) (mh. (F(@) — Fl) ~ (2/2) Chg Fala) h2) converges in distribution to N(0,x%f(x)), where fya(x) = 8f(x*, 24) /(a6)* 48 the second order derivative of f+) with respect to, and where = [u{v) dv and «= f w2(o) dv ‘The proof of Theorem 3.1 is given in Section 8.5. From Theorem 3:1 we se that the rate of convergence of f(2) for the mised variable case isthe same as the case involving only the subset of q purely con timuous variables. This follows simply because estimation with purely discrete variables has an Op (n/2) rate of convergence, which is faster than the rate of convergence for the q purely continnons variables cas. ence, the rate of convergence of f(x) for the mixed variables case is determined by the rate of convergence arising from the preserce of the subset of ¢ continuous variables, the slower ofthe two rates 3.3.2. Regression with Mixed Data ‘The above frequency method can also be used to estimate regres- sion functions with mixed discrete and continuous rogressors. For = (25,24) RY x S4, we estimate g(z) = E(Y|X = x) by Dumesenead=x/7@), 10) ‘where j(z) is defined in (3.9). The following theorem gives the asymp- totic distribution of (2). ‘Theorem 8.2. For any fized x" ¢ 4, assuming that g(a", 2) satisfies the same conditions as g(2®) given in Theorem 2.2, and also assum- ing that, asm —+ 00, hy + 0 (8 = Tyssesg), Mhi...Ry —+ 00 and 120 3,_PREQUENCY ESTIMATION WITH MIXED DATA anf os hg)(Sa Hf) =O, then we have Vik Tg (ae ~ af) — Eacont) 4.N (0,n107(2)/F(@)) » where By(2) = (12/2){2fo(a)ga() + (2)9es(s)}/F(2) isthe same as that defined in Theorem 2.2 except that now z = (2,24), and ga(2) (ula) and gualz) ov the frst and second order derivatives of 9 (5) {hith respect to-26, where again k2 = [ v*o(v) du and x = J w%(v) do. ‘The proof is similar to the proof of (2.14) (the case with purely continuous variables only), and is let as an exercise (see Exercise 3.2) ‘From ‘Theorem 3.2. we once again see that the rate of convergence is solely determined by the rate associated with the continuous variables, Following the analysis in Chapter 2, one ean agnin choose the smoothing parameters (I's) by the cross-validation (CV) method, gen fralized CV method, or plug-in method (when a closed form expres- sion is available). In order for the nonparametric frequency estimation nothod to yield reliable results in the mixed variables case, one needs ‘2 reasonable amount of data in each discrete cell. Should the sum- ‘ber of discrete celis be large relative to the sample size, nonparametric frequency methods may, as expected, be quite unreliable. 3.4 Some Cautionary Remarks on Frequency Methods In this chapter we have shown that, theortioaly, there is no problem Mhatabever with adding discrete variables (with fnite support) to a Touparanetric estimation framework because the nonparametric fr- ony method (with only discrete variables) has an Oy(/2) rate arcpryergence, whic is faster than the rate of convergence associated Sich the subset of q purely coutiniows variables, However, nonparamel- Te fieguency mts wi clealy be weal only when ne hns 2 Ite inpl sae and when te discrete variables aswume a limited momber Sf values, Le in skations where the number of discrete ols is much shoul than dhe sample size For many economic datasets, however, obo Sen counters ease for which the mamber of dcerete cells is chose to or even larger thas the sample size. Obviously, the sample splitting ffequeney estimator cannot be used in such situations 34,_ SOME CAUTIONARY REMARKS ON FREQUENCY METHODS _121 ‘To further ilustrate this point, let M denote the mumber of cells associated with the discrete variables. In even simple applications, M cean be in the hundreds and even tens of thousands. For example, if one hhas data which includes a person's sex (0/1), industry of occupation (ay, 1-20), and religious affiliation (say, 1-5), then the simple presence of three discrete variables gives vise to 2 x 20 x 5 = 200 cells. The av- ‘erage mumber of observations (the effective sample size) in each cell is ess = n/M. Since M isa fixed finite mumber (no matter how large), ‘asymptotically we have n/M = O(n). Therefore, the effective sample size is of order n, Even if M = 1,000,000, n/M is still of order O(n). For a univariate continuous variable, however, the effective sample size ‘used in estimation is of order nh = O(n*!8) (if h = O(n—/")). Asymap- totially, en! is of smaller order than n/IM for any positive constants ‘and M. Therefore, asymptotically the effective sample size for diserete variables is larger than that for continuous variables, Consequently, nonparametric estimation with discrete variables has 2 faster rate of convergence than that with continuous variables In finite-sample applications, however, the story can be quite dffer- cent. The V7 rate for the discrete variable case can provide a misleading ‘impression regarding its usefulness in small sample applications. Let us consider some artificial data having a sample size n = 500 with five discrete variables, and further suppose that the five discrete variables assume 2, 2, 5, 5, and 5 different. values, respectively. In this exam- ple, the mumber of discrete cells arising from the presence of these five discrete variables is 2x 2x8 x 5 x5 = 500. Therefore, the av- «rage number of observations (the effective sample size) in cach cell is 12/500 = 500/500 = 1, which is too small to yield any meaningful non- parametric estimates (were one to use the fRequeney method, that is) However, n = 500 is certainly large enough to ensure accurate nonpara- ‘metric estimation with one continuous variable. Ifq = 1 (r= 0, ie.,n0 discrete variables), then the effective sample size used in nonparametric estimation with a continuous variable is n‘/S = 5008/® ~ 144, which is much larger than the effective sample size of 1 in the above discrete variable example (with the number of cells equal to 500) In Chapter 4, we discuss an alternative approach where we also smooth the discrete variables in a particular manner frst suggested by Aitchison and Aitken (1976). We extend Aitchison and Aitken's ap- proach to cover nonparametric regression and conditional density esti- ‘mation with mixed discrete and continuous variables. We also present several empirical applications where we show that such smoothing ap- we 3, FREQUENCY ESTIMATION WITH MIXED DATA proaches can also provide superior out-of-sample predictions when com pared with some commonly used parametric methods. 3.5 Proofs 3.5.1 Proof of Theorem 3.1 roo For what follows, we use the following shorthand notation: F(a + hoya) = Fleh + hava. G+ hve) By the same argument used in the proof of Theorem 1.3, we have BUF) = EDWa XS, 2")1XF = 24) = trnhyy? Sf seetotin (AG*) ate = ehaer = (tg | Hetetw (SGX) ae = [re=hnetio de = Hae 24) + (na/2) So fesNN +O & ) : & a Next, by the same argument as in the proof of Theorem 1.4, we 86. EXERCISES 1 have safle) = var Win 2 af = 24) = (nh. A fe [w A #1 = ») “(e()or-4) = (ong) {z, [sei aw (SG) sete an5 ; : 3 J reat = ) acta} ohare fweyde<0(h.-¥9} (nha shag) {n f(z) + Olfey Ira) (cine f" wey ao [f voy a0]' =). ‘The above results show that bias(f(2)) = O(DE var(f(x)) = O ((nhy ...ig)~). Hence, MSE( f(x) (hy ooh)“, which implies Pheorern 3.1 i) Part (i follows by the same argument given above, and the applica- tion of Liapunov's CLT (see the proof of Theorem 1.3) proves Theorem 3.1 (ii), a 1 W2) and that (Leta BB)? + 3.6 Exercises Exercise 8.1, Prove (8.7) and (3.8). Hint: Write (a) — gx4) = (ale — ae") He") (2%) = mala") faa"). ey 3,_FREQUENCY ESTIMATION WITH MIXED DATA Now (¥ = (Xf) +14). sinfat) = 0d Slo XE) + us — fe AVLCXE = 24) Duaxt ==, Gu) pecause Dilg(X#) — g(e"\JUXe = 24) = 0. rom (3.11), i i easy to see that Bhe(ad)] = 0 and Biia(at)"] = O(n), which implies a(x") = Op(n-¥?). ay 22) = plat) + Op(n-¥/2). Togetner we have gle) ~ (24) = Opln-"?). Using (311), itis easy to see that B{lvi(a4))2) = nto? 24)p(at). Hence, n¥in(24) + N(0,02(24)p(e4)) by the Lindebers-Levy CLT. Hence, VatGle) — o(24)) = vine) (Bla) 4 (1/pla*)N (0.0% nle) (0,0%(a4)/r(e")) Exercise 8.2. Prove Theorem 32. Tint: Similar to the proof of (2.14), write az) — 9a) = (a) F2)/f(w) = m(z)/f(e). Write n(x) = siy(z) + ra(2), where inn() and tna(z) are defined the same ways as rhy(x) and tia(e) of Chapter 1. Show that sing (x) = (62/2) SUF feel) + 2Fo()90C@)N +02 + (nh 1"), (ry «+ hg)*/2rn() > N(O, 8107(a) f(z) in distribution. ‘Those results ane j(#) = jz) + op(1) lead to Theorem 32. and that Chapter 4 Kernel Estimation with Mixed Data In Chapter 3 we discussed the traditional frequency-based approach to ‘nonperametric estimation in the presence of discrete variables, and also ‘outlined the pros and cons associated with such approaches. We now ‘turn our attention to an alternative nonparametric approach which can be deployed when confronted with discrote variables where, instead of ‘using the conventional frequency method described in Chapter 3, we clect to smooth the discrete variables in a particular fashion, By smoothing the discrete variables when estimating probability ‘and regression functions rather than adopting the traditional cell-based frequency approsch, we can dramatically extend the reach of nonpara- metric methods. From a statistical point of view, smoothing discret variables may introduce some bias; however, it also reduces the finite- somple variance resulting in a reduction in the finite-sample MSE of the nonparametric estimator relative to a frequency-based estimator. A particularly noteworthy feature of this approach is developed in Section 4.5, where the smoothing method is used in conjunction with, ‘data-driven method of bandwidth selection, such as cross-validation, to automatically remove “irrelevant variables” (a specific definition of irrelevant variables is given in Section 4.5). This is an important re- sult as we have observed that irrelevant variables appear surprisingly often in practice. When irrelevant variables exist, the conventional fre- queney method outlined in Chapter 3 still splits the sample into many diserete cells including those cells arising from the presence of the relevant variables. In contrast, the smoothing-eross-validation method. 26 4,_ KERNEL ESTIMATION WITH MIXED DATA ccan (asymptotically) automatically remove these irrelevant variables, thereby reducing the dimension of the nonparametric model to that associated with the relevant variables only. In such cases, impressive efficiency gains often arise by smoothing the discrete variables. Sections 41, 4.2 and 4.3 discuss the discrete-variable-only case, and Sections 4.4 and 4.5 cover the general mixed discrete and continuous variable case, 4.1 Smooth Estimation of Joint Distributions with Discrete Data ‘We now consider the kernel estimation of a probability function defined, over discrete data X# € S*, the support of X%, As before we use 2 and Xé to denote the sth component of 24 and X# (i = bt respectively. Following Aitchison and Aitken (1976), for xf, X¢ € {O.1,--s¢s — 1}, we define a diserete univariate kernel function 1-Ay if Xf {7c ad. Wet Ae) (4a) Note that when Ay = 0, (X4,28,0) = (Xf = xf) becomes an indicator function, and if Ay = (cx ~ 1)/es then 1 (X22, 4) = 1/e, is a constant for all values of Xf, and 24. The range of Xy is (0, (es = 1)fes] For multivariate data we use 8 standard product kernel function sven by 1(X$,2,2) Tees. a4.a0 (4.2) Tle.te- yya(1 — a4) ME where Nig(2) = (ig #2) $8 an Indicator function, which equals Lif Xf #28, and O otherwise. For a given value of the vector A= (Agy..,2r) we estimate p (24) (et) = zee (3) 4.4, SMOOTH ESTIMATION OF DISCRETE JOINT DISTRIBUTIONS 197, Note that the weight function [(.,.,.) defined in (4.1) adds up to one, which ensures that p (24) defined in (4.3) is a proper probability mea- sure, ie Dptesef (24) = 1. Note also that if As = 0 for all 1 <8 ) for s mn (ii) For 8 = 11+ Ay...47; littneP [Se = 4] > 6 for some 0 < bet “Theorem 4.3 is proved in Ouyang otal. (2006). Theorem 4.3 states that the smoothing parameters associated with uniformly distributed variables will not converge to zero; rather, they have a high probability ‘of assuming their upper extreme values s0 thet the estimated proba- bility function satistes the wniform distribution condition of (4) for fit 1,...yr and is more efficient than an estimator that does not pose this restriction, It is dificult to determine the exact value of 6. The simulation results reported in Ouyang, etal. suggest that 6 is round 0.6 for a wide range of data generating processes. 4.2 Smooth Regression with Discrete Data Sid cot oo are i vizo(xf) +0 where 9(:) is an unknown function, Xf € 54, E(w|Xf) = 0 and E(uflX#) = 02(X¢) is of unkown form. Although the kernel fanc- tion defined in (4.2) can be used to estimate g (24), we suggest using the following simple alternative kernel function in regression settings: 1 when Xf = af todehn= (5, eae (412) |. 1X4, 2,0) becomes an indicator function, and when ds = 1, 1064,28,1) 211s a uniform weight function. ‘Therefore, the range of Ay i (0,1). Note that the kernel weight function defined in (4.12) does not add up to 1 and therefore would he inappropriate for estimating a probability function, However, this does not affect the nonparametric estimator 9(z) defined in (4.14) below, because the ker- nel funetion appears in both the numerator and the denominator of 192 4. KERNEL ESTIMATION WITH MIXED DATA, (4.14), and obviously the kemel function can therefore be multiplied by any nonzero constant leaving the definition of @(2) intact. By using (4-12), the product kernel function is given by 1xdsat.x) = TP, (423) t where Nu(a) = 1(X¢ Aart) (the indicator function), We estimate 9 (24) by _ ESR bg 8,4) 4 (ot) = (ans) where p (24) =n 1(X#,a4, 9). When Ag =O for all 8 = 4.057 Sur eatimator collapses tothe Frequency estimator given in (8.6), Under the assumption that A, —+ 0s n — 00 (= 1h.-.,r), one can easily show that § (2°) — g(a) = Op (Sitar As + 07/2). For ex: Simple, 28 was the ease for the proof ofthe purely continuous variable festimator discussed in Chapter 2, we could waite 9 (24) — 9 (24) = {5 ) — 9 1B (x4) /9 (et) = mm (04) /P (e4). One can show that Bhn(a4)] = Oy As) and varlin(x4)] = O(n-1), These imply ‘that E (1 @) (Spa AZ+n-4), which in tun implies that sn 28) = Op (Stay Aen), while it can easily be shown that pla!) = p (2!) +0, (Tey As +74?) by the same arguments that ead to (4.5). Hence, we have = Op (Srey +n) = platy off) San) da We now discuss the use of cross-validation for selecting the Ay's. Wedd ory 0 mini tesa am ona YS fe -ae]’, (4.16) bead 1 Xd jax = DEPT GY) (a7) Pa) 133 Js the leave-one-out kernel estimator of g(X{), and p(X‘) is the leave- one-out estimator of p(X#). We use Ai,..., Ay to denote the cross- ‘validation choices of \y,..-,A, that minimize (4.16), In this section we consider only the case for which all of the com- ponents of X are relevant, and we postpone the discussion of poten- ‘ally irrelevant regressors until the next section. Also, in Section 4.5 we discuss the more general case with mixed categorical and continuous rogressors. We start with the following assumption, Assumption 4.1. (i) BOPIX#) is finite, (4) The only values of (iy... Ax) that make x wf X vlole*) sees tan} wee hese are Ay =0 for all 8=1,...,7, Assumption 4.1 (i) implies that g(2) is not @ constant function for any component 2, € Dy (sco Exercise 4.11) and it also implies that, Je = op (1) for all ‘The next theorem establishes the convergence rate of 41, ‘Theorem 4.4. Under Assumption 4.1 we have = 0,(n7) for all = 1,...57 Theorem 44 is proved in Li, Ouyang andl Racine (2006). Bxercise 4.7 asks the reader to prove that ‘Theorem 4.4 holds forthe simple case {or which 2 is univariate (hints are also provided) ‘Theorem A shows that \y converges to zero at the rate Op(n7}) With this fast rate of convergence itis easy to establish the asymptotic distribution of (x), as the next theorem shows. ‘Theorem 4.5. Under Assumption 4.1, we have Vii (9 (24) — 9 (24) / V2 4 [9 + N(O,1) in distribution, 18 — 9 (XA))PLCCF,2,X)/0(2) i 0 consistent 4) Sblxf = 24) rer 4. KERNEL ESTIMATION WITH MIXED DATA Proof. Recall that the frequency estimator g (2) can be obtained from (24) with Ay = 0 for all s = 1,....r. Using Ay = p(n), it is ‘easy to see that §(cx) = g(a) + Op(n-2). Therefore, y/7i(9(r) — g(=)) = YalG(2) —9(z)) +Op(n-Y?) & IV (0,08 (24) /p (24) by (8.8). Finally, Exercise 4.8 shows that 6? (e#) = 0? (24) +0p(1). Theorem 4.5 follows direct o In the next section we discuss the case in which a subset of regressors is irrelovant. 4.3 Kernel Regression with Discrete Regressors: The Irrelevant Regressor Case In this section we allow for the possibility that some of the regressors are in fact ivelevant in the sense that ehey are independent of the de- pendent variable. Without loss of generality we assume that the frst f(t <5) components of X¢ are relevant, while the remain- jng mp = 7 —ri components of X# are irrelevant. Let Xf denote the rredimensional relevant components of Xi and let X¥ denote the r=- ‘mensional irrelevant components, Similar to the approach taken in Lit al, (2006), we assume that (¥,£4) and X¢ are independent of each other. (418) ‘The kernel estimator of g(2*) as well as the definition of the CV objective fanetion are the sae as that sven in Section 4.2, Westil se (1...) Ar) to denote the CV-selected smoothing parameters. In The- Grom 4. below we show that (i the smoothing parameters associated with the relevant regressors converge to zero at the rate of n’ 1/2, which Alles from the nate of convergence when n irrelevant components exist an (i) the smoothing parameters associated withthe ieevant Components will not converge to zo, but have a high probability of Converging to their upper extreme values, 0 that these irrelevant 1e- Gresors are mmihe out with high probabibity. Furthermore, there Ekta n postive probability that these smoothing parameters do not Converge to their upper extreme values even a8 —> oo. Mirroring the notation used for 74 and 24, we shall use \ to denote the smoothing poremeters associated with the relevant regressors, and Sor the ielevont ones. Thus, we have Xy = 2y for 9 = 2yyrt, and Jem Ane fr 8 Tyeyte (te = =r) Alo, wing 9) and BO) 43,_KERNEL REGRESSION: IRRELEVANT DISCRETE REGRESSORS 135, to denote the marginal probability functions of X and X, respectively, then by (4.18) we know that p(z4) = p(2*)p(#4), and using $* to denote the support of 2%, Assumption 4.1 is modified as follows Assumption 4.2. The only values of Mts..+.An that make i seb DY pe) { D vee) - ot24) votes) ° are Ay = 0 for all s=1,...5%. Assumption 4.2 ensures that the CV-selected smoothing parame- ters associated with the relevant regressors will converge to 2620, while vwe do not impose any assumption on the smoothing parameters asso- ciated with the irrelevant regressors except that they tale values in the unit interval (0,1). ae ‘The next theorem provides the asymptotic behavior of the CV- selected smoothing parameters. ‘Theorem 4.6. Assume that ry > 1 and ry >1 (with r =r +r2 > 2). ‘Then under Assumption 4.2, we have = Op(n-¥) for s=1,....75 slim, P (nga = tyes de 1) 2 @ for some a € (0,1)° ‘The proof of Theorem 4.6 can be found in Li ot al. (2006). ‘Theorem 46 states thatthe smoothing parameters associated with the relevant regressors all converge to zero at the rate of n-/, while the smoothing parameters for the irelevat regressors have a positive probability of assuming their upper bound value of 1 That is, there sa positive probability thatthe irelevantregressors wll be smoothed out. tis difficult to determine the exact value of forthe general case. The simulation results reported in Li etal (2006) show that there is usually about « 60% chance that Ay takes the upper extreme value {and a 40% ‘hance that Ay takes ves betwoon O and 1, for 811 -FIyses9r Note that ‘when 2f is an irelevant regresor, the asymptotic be- havior of Ay is dificult to characterize in further dotail since Ay does not converge to 2ero in this ease. The arymptotic distribution of the resulting g(x) will also be difficult to obtain. When one obtains a rel- atively lage value for A, and suspocts that this may reflect the fact, 136 4. KERNEL ESTIMATION WITH MIXED DATA that zis an irelovant regrestor, one may perform a formal test of the uull hypothesis that 2f is indeod an irrelevant regreseor, say, US- Ing the bootstrap based test suggested by Racine otal. (Forthcoming) (cee Chapter 12). If one fails to reject the mull hypothesis, one may "remove this regresor from the model. In this way only relevant regres sors are expected to remain in the model, and one can recompute the cross validated bandwidths using only those regressors deemed to be relevant. Then the results of Theorem 4.4 end Theorem 4.5 apply as to irclovant regresors are expected to remain in the snodel ‘We now proceed to the general case of smooth nonparametsc esti ration with mixed discrete and continuous data 4.4 Regression with Mixed Data: Relevant Regressors 4.4.1 Smooth Estimation with Mixed Data In this section we consider a nonparametric regression model where a subset of regresirs are categorical while the remaining are continuous. ‘he in Chapter 3, we use Xf to denote an r x 1 vector of regressors that assume discrete values and let XP € R? denote the remaining continuous regrestors. We again let Xf denote the sth component of Xe We assume that X% assumes ¢, > 2 diferent values, ie, that Ye (0.1--.sey— I} for #= 1,94, and define X= (X48). ‘The nonparametric regression model is given by Yes) tu (419) with B(uX:) = 0. We use f(x) = f(e*,2%) to denote the joint PDF of (XS, X28). For the discrete variables X7, we first consider the case for which the variables possess no natural ordering. ‘The extension to the general cease whereby some of the diserete regressors have a natural ordering is discussed at the end of this section. = (fy), define Wala? X8) Taehg where w is symmetric, nonnegative univariate kernel function. O < hy < 00 is the smoothing parameter for 2{, and for 24 = (¢f,...,2#) 44. REGRESSION WITH MIXED DATA: RELEVANT REGRESSORS 187 define 7 at, X80) = TT Ne, where Ni(e) = 1 (X4 # 22) isan indicator function which equals one when X¢ + cf, and zero otherwise. The smoothing parameter for xf is0 B(x) f(e)A3 +S Bas(0)f(e)rv+ 00m), (4.21) vane) = MELE + O(m)], at (42 F@) = fe) + Opn +i”), (4.23) [aeol) + 24a(e) fale) F(2)-"] XY (#424) [ote 24) — ae] 129 F00), * dee na fvietdo, m=D+ Dan mn = (mh Pig), and 138 4. KERNEL ESTIMATION WITH MIXED DATA 14 (24,28) takes the value 1 if 24 and 2 difer only in their sth come ponent, 0 otherwise ‘Equations (4.21) and (4.22) are equivalent to Efi(z?"] = 0 (n3 +m), which implies that ri(2) = Op (m +n) and that in(a) _ Oplmm+ Flay > Le) oo) (m+ ni”) Tf in addition to my = (1) and mp = ofl), one also assumes thnt (hy sh)! y f= a(t, then using (4.21) t0 (428) one can establish sre coyanotie normal distsibution of f(2) using Liapunov’s CLT: 13 H(z) - oz) = ah -Fq | (@) ~ a(x) - Srowtene-S anion SN (,n80%2)/F(e))- (4.28) 4.4.2. The Cross-Validation Method “Least squares cross-validation selects fy... gy Aty---»Ar t0 mininize ‘the following cross-validation function: OVehA) =O — GAH) MX), (4.26) wine §-a(Xs) = Dy ¥io(Xiy Xf Cis Re (Ne i) isthe leave-one- ac Reracl estimator of (Ki) and 0 = Mf() < 1 isa weight function Athich serves to avoid dificultis caused by dividing by zero ot by the Tow convergence rate induced by boundary effect "We assume that (hay cos gy AtseoeyAn) © [Osm}®*7) and nly +g my (4.27) where 1) = tis positive sequence that converges to zero at a rate Mlower than the inverse of any polynomial in m, and f, is a sequence of constants that diverges to infinity. Let S denote the support of M(:). We also assume that ‘9(:) and f(-) have two continuous derivatives; M(:) is eon- Hinuous, nonnegative and has compact support; f(-) is Dounded away from zero for 2 = (2°24) € 5S: (428) It can be shown that the leading term of CV, is CVjo, given by (ignoring the terms unrelated to (h, \)) ovat) = f {[pasom aver fla) SIL Jaca (429) The relationship between CV and MSE[g(x)] can also be used to derive (4.29). In Choptar 2 we argued that the leading term of CV is related to the pointwise MSE by oveyD / MSEIg(z)] (2) M (2) da (4.30) Equation (4.25) implies that “ 7 2 MSEUa(@)] ~ [= Bulan + Yr] + 807(2) fa) (aha fg) Substituting this into (4.30), and ignoring the terms unrelated to (h,) ‘and the smaller order terms, we obtain CV,(a,b) = n-¥$0)y,, where a= ¥ f { [Saw + = acm] gitarta) ert ance aes (asi) where ay = nip, and 6, = ni/(+9),, Equation (4.81) indeed ssves the leading term of CV, as stated in (4.29). Let a®,...,a8,8f,... 09 denote values that: minimize x» subject to ‘each of them being nonnegative. We exclude the case for which some af or BY are infinite, and require that ‘Tho o's and A's are niqnely defined, and each of them is nie. (432) ‘This implies that 0 < a? sy im probably. ‘The proof of Theorem 4.7 can be found in Hall et al. (2006). The- orem 4.7 follows from (4.33) and the fact that Ovatha) — f o%(a)M(e) de = CV(0,2) + (60) uniformly in (24,28) € w S4,(h,A) € [Ora], where tig > 0 and ‘My 1.48 n= 00. Tatuitively, one should expect that the f,’s and Jy’s that minimize CV-(,-) have the same asymptotic behavior as the fit’s and 0's that rminimize CVs), the leading term of CVs(h,). Therefore, one ex- pects that hy = Ad + (.0,) and y= 28 + (5.0) 4.5 Regression with Mixed Data: Irrelevant Regressors In this section we consider the case in which (i.e. allow for the pos- sibility that) some of the regressors may be irrelevant. Without loss of generality, we assume that only the first g: components of X* and the first r1 components of X“ are “relevant” regressors in the sense defined below. For integers 0 < qi, 00 for s= q+ 1,...,q and Ay > 1 for =r 1,...,7. ‘Lins, asymptotically, cross-validation can automatically remove irrel evant variables. ‘The eross-validation objective function is the same as that defined in Section 44 except that now ¥i = 9(Xi) + wi with E(ulXe) = 0, that is, the conditional mean function depends only on the relevant regressors £. The definitions of mi, rt; and f; all remain unchanged G(x) = (4.35) (4.36) ua 4. KERNEL ESTIMATION WITH MIXED DATA ‘except that one needs to replace g: = 9(X) by di ‘The modified analogue of the function x defined in Section 44 is (Xe) wherever it lay gqs tse s br) -r/ {(Sadaeren foe = aa)} Jee @ + eaSa (ate) rhtentoge) 10 15 / S22 foe me DJ Fein repahane (08 a sees) dean Oe ‘and the bar (tilde) notation refers to functions of vectors involving ool, ‘the relevant (irrelevant) regressors #, (2). F (f) is the joint density of 4) (2 = (38,24) and yy (Jus) are second order derivatives of m (f) with respect to 25, = J, WAs before, let af...) Bf,-.-,09, denote values that minimize subject to each of them being nonnegative. We require that the a's ‘nd 89's are uniquely defined, and that each is Bite "Note that the irelevant components do not appenr in the definition of g, This is bocause the irelevant components cancel each other in the tatio Blin(z)]/ELF(e)), which we denote by fit) = Blyn(a)}/EL/ (2) For the relevant variables & we require that & [.laele) serra anaes, (428) 23 cow interpreted as a function of hy,.--,/gy Aty-++Any vanishes if and oply if all of these smoothing parameters vanish. Since we should allow the smoothing parameters for irrclevant vari- ables to diverge from zero, we can no longer impose the requirement that hy = o(1) and A, = o(1) for all s. Rather, we impose the following conventional conditions on the bandwidth and kernels. Define w= (fle) Hf sawn Let 0 n-© and max(fi,--.,hg) < n© for some C > 0; the kernel K is a symmetric, compactly supported, Hélder-continuous PDF; (0) > (6) for all 6 > 0. (439) If cis arbitrarily small, the above conditions on hi, hg are basi cally that hy -igg 205i x ofgg + 0 and fi. /tg + 0.85 11 00, In addition, we require that the fastest rate at which fy goes to zero cannot exceed nC and the fastest rate at which h, (fr itrelevant vari ables) goes to 60 cannot excend n® With these conditions we obtain the following result. ‘Theorem 4.8, Under conditions (4.28), (4.34), (4.38), (4-39), and Letting fy. ig, tyes Ae denote the smoothing parameters that min- imize Vp, then nM(O+D}, a8 in probability for 1 < 8 < as Phe > ©) 1 for q:+1<0 0; n2(O9}, — 2 in probability, for 1 <8 00 for 6 = 01 + Tyeenyd and as Ay + 1 for # = ny +1y.--,F, Wo have Bay vey hgy Mtv eces Se) = OCB ye oes fg Atyoo osc) +(60,) Next, from the fact that hy — A = op(h2) (¢ = 1,.--,gx) and Ay = 22 09(A9) (8 = Te.ey7t), we have that 92, fa,-.-shgis Ais ja) GZ, AY,... AO, AY,.--,A8,) + (8.0.). Theorem 4.9 then follows. 4.5.1 Ordered Discrete Variables Up to now we limited our attention to the case for which the disereto variables are unordered. If, however, some of the discrete variables are Ordered (ie, ate “ordinal categorical variables"), then we should use a ‘eee! function which is capable of reflecting the fact that these vari- ables are ordered, Assuuaing that 2# can tako on cy different ordered vals, {0,1,.--y¢1— 1}, Aitchison and Aitken (1976, p. 29) suggested sing the Kernlfantion given by Hod) = (5) al. Ay? 45. APPLICATIONS | = 5 (0 < ¢ 3, this weight func- tion is deficient since there is no value of J, that can snake I(r, v.24) equal to a constant function. Thos, even if 2f turned out to be anit relevant regressor, it eould never be smoothed out if one were to use this kernel. Therefore, we suggest the use of the following alternative lernel faction: Ua of, A) = AsFovAl (4.43) ‘The range of A, is [0,1]. Again, when A, assumes its extreme upper bound value (Ay = 1), we see that {(e, vf, 1) = 1 for all values off, vf € (0,4,-.-4ce~1}, and in this case 24 is completely smoothed out of tho resulting estimate. Tt can be easily shown that when some of the diserete variables are ordered, then all ofthe results established above continue to hold provided one uses the kernel function given by (448). ~ OF course, if an ordered discrete variable assumes a large umber of diferent values, one might simply treat it as ft were a continuous variable. In practice this may well lead to estimation results similar to those arising were one to treat the varinble as an ordered discrete variable and use the kernel function givea by (443) Bierens (1983, 1987) and Ahmad and Cerrito (1904) also consider smoothing both the discrete and continuous variables for the nonpara- ‘metric estimation of a regression function. But they do not study the fundamental issue of data-driven solection of smoothing parameters. ‘As we have shown in this section, it is important to use automatic data-iriven methods to select smoothing parameters. Moreover, least, squares cross-validation has the attractive property that it cam auto- ‘matically (asviaptotically) remove irrelevant explanatory vasiabls. In nonparametric settings with mixed discrete and continuous variables, least squares cross-validation stands alone and appears to be without any obvious peers. 4.6 Applications 4.6.1 Food-Away-from-Home Expenditure Prior to the 1980s, the value added in China's food sector that was accounted for by food-away-from-home (FAFH) consumption was neg- ligible. Household production of most meals used grain, raw vegetables, 16 4,_ KERNEL ESTIMATION WITH MIXED DATA and meat produced at home, purchased from state-run food stores, for purchased directly from farmers. Mirroring China's rapid income growth after 1980, Chinose consumers increasingly ate more meals in restaurants, cafeterias and dining halls. The FAFH share of total food expenditure has steadily increased from 5.03% in 1992 to 14.70% at 2000, In 2000, per capita annual FAFH expenditure reached 288 yuan with total FAPH expenditure of 182 billion yuan in urban China ($15.9 billion US dollars). In the application below, it will be seen that non- parametric mothods suggest that the FAFH expenditure growth in China is expected to continue, due to the rising middle class, rapid ‘urbanization, and to its relatively low FAFH share compared with the United States (40.3% in 2001, Economic Research Service, USDA), Canada (35.6% in 2001, Statistic Canada), and other developed coun- ‘tries (see Jensen and Yen (1996)). This stands in contrast to conclusions ‘obtained from a popular linear model that has been used to model such expenditure. ‘The data is obtained from urban houschold surveys conducted by ‘the State Statistical Bureau of the People’s Republic of China in 1992 land 1998. The dependent variable is household per capita FAFH ex- penditure (in 1992 yuan). The explanatory variables include household per capita income (in 1992 yuan), household size, education level of the household head (seven categories)!, age of the household head, a 0-1 dummy variable indicating that the household is in a large city, and a ‘gender dummy for the household head, The total sample size in this study is 3,459 for 1992 and 3,359 for 1998. Min, Fang and Li (2004) also estimated a popular linear model to be used for comparison pur- poses. They presented some graphs that show the relationship between FAPH expenditure and income (holding all other regressors fixed at ‘their sample median). Both 1992 and 1998 nonparametric PAFHL ex- ppenditure curves show that FAFH consumption stops increasing after ‘8 high income level has been attained. This is quite different from ‘the parametric results, where the linear model prodiets higher FAFH ‘consumption for higher income households. To check which results ap- pear to provide @ better description of the data, Min et al. computed goodness-of-fit (R2) for both the parametric and nonparametric mod- els. The R? from the linear models for the 1992 and 1998 data are 0.128 77, below clemeatary school, 2, elementary school, $- middle school, 4. high school, 5. middle lvel specialind training, 6. twosear college, 7. bachelor degree or above 46,_APPLICATIONS ar and 0.170, while the A? for the nonparametric models are 0.382 and (0.348 for 1902 and 1998, respectively. ‘Table 4.1 gives the average FAFH expenditure for different income jervals, slong with the predicted average values using the paremet- ric (ordinary least squares, OLS) and nonparametric models. High in- come households indoed spent less on FAFH than those households with slightly lower income. In 1992 households with income exceeding 5,000 yuan spent abont 11 yuan less on FAH than those with income between 3,000 and 5,000 yuan, In 1998 households with income exceed- ing 9,000 yuan spent 52 yuan less on PAFH then those with income ‘tween 4,000 and 9,000 yuan, We observe that for low and middle i come level, both the parametric and the nonpacnmetric models predict the average FAPH expenditure well. However, for high income levels, the parametric model appears to give misleading predictions, while the nonparametric method performs much better in this case. ‘Table 42 reports estimated mean and median income elastics. For the nonparametric model, the elasticity 7 for house + is computed via nj = 28%) 216, where 21; is the income of the ith household, while for the linea? mode, = (2. Table 42 reveals some interesting phenomena. Comparing the elatcities of 1992 and 1998, the nonpara- Inetric results show that both the mean and median income elasticities haw ineeased for both large city andl midle-small city households Tn contrast, the parametric model shows only that the mean elasticity for middle-small city has increased, while the mean elasticity for large cities and the median elasticity for both large and midlesmall cities fell fom 1992 to 1998. The conetng estimates are likely de to the misspecification of the linear model, As we discussed earlier, the lin ear model overestimates FAFH expentiture fr high income households due to its imposing an inoorect linear income (‘eend) component on AFH consumption. The misspecified linear model also overestimates the elasticities for high income families, leading tothe erroneous pre- diction that the median elasticity has decreased from 1982 to 1998, 4.6.2. Modeling Strike Volume We consider a panel of annual observations used to model strike vol- ume for 18 Organization for Beonomic Cooperation and Development (OBCD) countries. The data consist of observations on the level of strike volume (days lost due to industrial disputes per 1,000 wage salary earners) and their regressors in 18 OBCD countries from 1951 to us ‘Table 4.1: Average FAFH expenditure by income category 1992 Tncome Data Ave, OLS Ave. NP Ave. “Below 3K 8278 ——«2.12—C«3.26 3K-5K 1491520 48.0 AboveSK 143.9 LT. 1998 “Tncome Data Ave. OLS Ave._ NP Ave, “Bdow 4K 11940 «20D AK-9K 257.0 250.0 «254.0 Above 9K 2052 MLL 282.9 ‘Table 4.2: Mean and median income elasticities (n; denotes income lasticity of household i) Parametic Large Gity— Results Toe 199819921998 Mean O87 O81 1.274 1.351 Median 0.848 0.832 1.074 0.994 Yoofm>1 18.5% 13.9% 69.6% 48.8% Nonparametric Large City Results “i992” 199819921908 Mean 0.626 0.826 0.900 0.947 Median 0.751 0.851 0.938 0.960 Wofn>L 35.0% 39.9% 45.9% 45.8% 1085, The average level and variance of strike volume vary substantially ‘across countries. The data distribution also features a long right tail ‘and several large values of volume, We make use of the following re- ‘gressors: 1) country code; 2) year; 2) strike volume; 4) unemployment; 5) inflation; 6) parliamentary representation of social democratic and labor parties; 7) a time-invariant measure of union centralization. The data are publicly available (see StatLib, http:/ /lib.stat.cmu.edu). As ‘one country had incomplete data we analyze only the 17 countries hav- ing complete records. "These data were analyzed by Western (1996), who considered a APPLICATIONS. uo ‘Table 4.3: Regressor standard deviations and bandwidths for the pro- posed method for the training data, ny = 561 1 2 3 5 6 a 9.53 2.84 475 1331 031 NA f/X_102565846.11_4800821.61 5.84 30.56 40852803 0.12 Key: 1) year; 2) unemployment; 3) ination; 4) parliamentary repre- sentation; 5) union centralization; 6) country code. linear panel data model with country-specific fixed effects and a time trend. We consider a nonparametric model which treats country code ‘as categorical and the remaining regressors as continuous. To assess each model’s performance we estimate each on data for the period 1951-1983, and then evaluate each based on its out-of-sample predictive performance for the period 1984-1985, again using predictive squared error as our criterion, We begin by considering the cross-validated band- widths, which are summarized in Table 4.3 Ascan be seen from Table 4.3, the continuous regressors year, uneu- ployment, and union centralization are effectively smoothed out of the resulting nonparametric estimate, having bandwidths that age orders ‘of magnitude larger than the respective regeessor's standard deviation, ‘This suggests that the continuous regressors inflation and parliamen- tary representation and the discrete regressor country code are relevant in the senso outlined es We next compare the out-of-sample forecasting performance of each ‘model. The relative predictive MSE of the parametric panel model (all ‘variables enter the model linearly) versus the NP CV approach is 1.38. ‘We note that varying the forecast horizon has little impact on the 1elative predictive performance. We have also tried a parametric model ‘with interaction terms (quadratic in the continuous regressors) and the resulting out-of-sample predictive MSE is larger than even the Iinear model's MSE prediction (the ratio of its predictive MSE over NP CV's is 1.44), while a parsimonious parametric model having only a constant, ‘unemployment, and inflation has a predicted MSE 15% larger than ‘that of the NP CV approach (the relative predictive MSE is 1.15), suggesting that a linear parametzic specification is inappropriate, 190 4._ KERNEL ESTIMATION WITH MIXED DATA 4.7 Exercises Exercise 4.1. Derive the MSB of the nonsmoothing estimator of p(z) given by : aoe Be) = DUK =>), ‘where 1() isan indicator function defined by { #X=2 UX=2)= 19 otherwise, and where X; € S = {0,1,...,¢— 1} is a discrete random variable having finite support. Bxercise 4.2. Consider the kernel estimator of (2) given by Ha) = EY 1%, i where 2 is a scalar and L(:) is a kernel function defined by 1d HX A(e-1) otherwise, xan) { where d € [0, (e = 1)/¢} (i) Derive the bias of this estimator, then demonstrate that if X has a (diserete) uniform distribution, so that p(2) = 1/c for all X, then the kernel estimator f(x) is unbiased for any admissible value of 2. (8) Show that 5a) — MAP) (y_y 2)" (Ga) = MA=PEED ( agin) Demonstrate that if X has a (discrete) uniform distribution, 0 ‘that p(r) = 1/e for all X, hen ie vaslance of (x) is zero when = (e— 1)/e, its upper bound. (iii) Given your results above, is it possible for the kernel estimator {c) to have an MSB of zero when the underlying distribution is ‘uniform? What are the conditions that A must satisfy for this to ‘occur? Demonstrate your result 47. EXERCISES 151 Bxercise 4.3. (0 Show chat Elp (et) and var (x4) ato indo given by (4.5) swith Boy = Sou = yds (24,24) p (24) — pe), where 1, (24,24) is desined in (424) (il) Show that B(mh (2)*) = O (S73 +n-+), where rh (x4) is defined above (4.15). Note that O (S22. Ao)*) =O (2.4 8). Exercise 4.4. 0, is an r xr matrix with its (6,¢)th element given by E Lr) =n] 4) - (4) Erules) wn Note that pr.s(2) is also a probability measure-since p1,. (2%) is ‘the average probability over the c, — 1 zs € St that differ from 24 only in the sth component, ie, pts (24) = hy Degyeg P(28, 244). Consider the simple ease of r= 2. Then Matern =P fo) -mss (#)] + m2 @)]F- ‘Show that » is positive definite if and only if there exists.x#, 24 € 54 such that p(24) pia (24) and thet p (24) # pia (24) Exercise 4.5. Using (4.3), we have in Ef ('- wey L EE te where where 8) = Ses Baba "Ths we have x iy rE [22-2055] - aE LLM (449) wey OM) 152 4 KERNEL ESTIMATION WITH MIXED DATA Let OVp(A) be defined in (4.45) and assume that 24 is univariate taking values in {0,2,...,¢~1). Show that (Ais a scalar) E(CV9(0)) = DiA* = Dad $0 (A? 4 An) + terms independent of , (446) ‘where the D's are (positive) constants. Desive explicit expressions for the D's Note that minimizing (4.46) over 2 leads to 3 O(n"). Exercise 4. *Da/(2D1) = (@ Prove Theorem 4.2 (i). Hint: Use Theorem 4.1 and the result in (3.3) Prove Theorem 4.2 (i) Hint; Use Theorem 4.1 and the result in (3.4) Exercise 4.7. Prove Theorem 4.4 for the ease when a is univariate. Hint: When 2 is univariate, Bij = Luj-o + Mao, where dy = al, Write BuAlX$) =n Dy yilLay-o+ Mayo] = Bio + Avia» Then 1 -1[,_ ee 2) Pa) ~ Bo [afi] +00 One can expand CVs() in a power series of A, Le. OV = AyX? + Ann! + ofan? + X2) + (terms not related to ), where Ay > 0 is a positive constant, and Amy is an O,(1) random variable Exercise 4.8. In the proof of Theorem 4.5 we have used 4? (a4) = 0 (2°) + o9(1). Prove this result. Hint: Write 6 (et) — 0? (c4) 2 (24) )p (x4) /9 (at). Use Ay = Op(n), Show that 4 (as) p (x4) =n DP, wAXE 2) +Op(n7) 2s 02 (24) p(x4) by a law of large numbers argument. Exercise 4.9, Derive (4.21) through (4.23). 47. EXERCISES 153 Bxercise 4.10. The al's and sare defined asthe unique finite values ‘hat minimize x defined in (1.31). Consider the case of q = 1 and r = 1. Solve for a? and U? by minimizing (431) Hint: Note that x(ay,bs) = Cyaf + Cabf ~ Cyabs + Co/ay, where G, > 0 for j =0,1,2. This ean be written as (ai, b1) = Cp [by — Arai]? + Anat + Cp/ar, where Ay = C3/(2Ca), and where Az = Ci — C3/(4C2) > 0. Thus, we have 8 = max{0, Ay(a{)?} (since 82 i nonnegative) and a? = [Co/(4A2))¥* if Ay > 0, and af = [Co/(4C)) if Ar <0, Exercise 4.11. Considering the simple case for which r = 1, show that Assumption 4.1 holds true if and only if g(z4) is not a constant function (when x4 is univariate). Hint: Assumption 4.1 becomes, for all 24 € S# (x* is univariate since r= 1), Dysese {{ala) ~ o(e4)] [aa = 24) + Jae 4 oy] Y? = 0, which is equivalent t0 1? weselate®) — o(e8)4(e4 A 24) = 0 because [g(24) — o(=*)1(24 = 2) = 0. However, since (24) is not a constant function, we know that 3.4294[9(24)—g(24)|?1(x* ¢ 24) > 0. ‘Thus, we must have A= 0. Thus, Assumption 4.1 holds tru if and only if g(#*) is not a constant function when «4 is univariate. Chapter 5 Conditional Density Estimation Conditional PDFs form the backbone of most populat statistical meth- ods in use today, though they are not often modeled directly in a para- ‘metric framework. They have perhaps received even less attention in ‘a kernel framework. Nevertheless, as will he seen, they are extremely ‘useful for @ range of tasks including modeling count data or predict- jing consumer choice (see Cameron and ‘Trivedi (1998) for a thorough treatment of count data models). In this chapter we discuss the non- parametsic estimation of conditional PDFs. Throughout this, chapter, ‘ve emphasize the practically relevant mixed discrete and continuous data case. Given the theoretical and practical advantages associated with smoothing discrete variables in a particular manner versus using. ‘a frequency-based approach, which we covered in Chapters $ and 4, we shall proceed directly with smoothing both the discrete and continuous ‘variables for what follows. Note that we may refer to conditioning vati- ables as regressors even though we are estimating a conditional density rather than a regression function 5.1 Conditional Density Estimation: Relevant Variables Let f()) and y(-) denote the joint and marginal densities of (X,Y) and X, respectively. For what follows we shall refer to Y as a dependent variable (i.c., ¥ is explained), and to X as covariates (ie., X is the explanatory variable). We use f and ji to denote kernel estimators 156 5, CONDITIONAL DENSITY ESTIMATION thereof, and we estimate the conditional density gfyl2) = f(a,u)/m(a) by d(yle) = Fle, w)/Ale). on) We begin by considering the case for which ¥ is a univariate con- ‘tinuous random variable, and then we discuss the ease for which Y is ‘univariate discrete variable, We treat the multivariate ¥ case in Section 5.4. Our estimators of f(-) and y() are given by Fle) = 0 Oye, X)hi Fs (6.2) t ia) = OK (Xp 63) whore hy is the smoothing parameter associated with Y, 7 = (A), and K,y(e, Xi) = Wala’ XLA XE), eye TY by (xe Wale’ X9) Thi( 728), pet, X49) Tastee Dea —Ay*, Ng(a) = 1X} # xf) is an indicator function that equals one when Xd ¢ af, zero otherwise, and kyg(y.¥,) = hy" k((y ~¥)/ho) ‘As we emphasized in previous chapters, data-driven methods of pondwidth selection such as crose-vaidation are required in applied rltings. Below we discuss to differnt cross-validation methods. The rot i based on minimizing a weighted integrated squared difference betwoen (yz) and gly), and the second isa likelihood approach. As trill be seen, the frst method has some desirable optimality properties Jput can be computationally burdensome for large samples, The seo- tnd method is computationally less costly but can load to inconsistent Catimates if the distribution for the continuous variables has fat tails. 52, CONDITIONAL DENSITY BANDWIDTH SELECTION ast 5.2 Conditional Density Bandwidth Selection 5.2.1 Least Squares Cross-Validation: Relevant. Variables ‘As was the cate for unconditional density estimation, when estimat- ing conditional densities with kernl methods we could pursue last squares cros-valiation approach to selecting smoothing parameter. ‘We consider the following criterion based upon @ weighted integrated square error (de = Sues fd"), 15B = J (aula) ~ ou\a))* nM (2 dey on Ian ~2lan + Tos where M() i a weight faneton, tin = f aula)ateya (a) dedy, tag = f auladfles wae" dey, and where Inn = J 9°(ylz)u(z)M(z*) dady does not depend on the smoothing parameters used to compute f and ji. Observe that [ee t= f toy Oar ae =e [2 (aye 5M a). where the expectation is with respect to X, not, to the random obser vations {Xi.Y;}fLy that are used to compute GC) and yi), Gle) = J fle.y)%dy, and where G(.) is defined as Gla) = FSS SS Kelas Xue Xa) f welt Ys) an Ya ee Similarly, Faq can be written as Ezla(¥1X)M(X°)] = Bal FY, X)M(X°)/i(X)], where the expectation is with respect to Z = (¥,X). Therefore, the following cross-validation approximations, fin and faq, for Tin and Tony respectively, are adopted: 188 5, CONDITIONAL DENSITY ESTIMATION whore the subscript —i denotes leave-one-out estimators, for example, cd = Gp, B,D, Hae RIK) x J eC. %a el 8a "Thus, our cross-validation objective function is given by CVs, d) = Han (ho,s 9) ~ 2fanl hoy hy A)» where (FX) = (ay .++5hgy Ais -+ +1 Ar) ‘long the lines of the analysis presented in Section 4.4.2, again sing fife) * = ple)! + (u(e) ~ A(x) (2) A@)] to bande the pres- tence of the random denominator, it is easy to see that the lading term of (yz) — a(ulz) is [o(ulz) ~ avie\lale)/ute) = (flew) — ji(z)g{ yee) /1(2). Hall otal. (2004) have further shown that the leading term of CV, is ovp= [ {8 [few - Flertve)] Mayne? } doa (5) Letting pp = Aa/{(1 ~ da)(Cs — 1)}, and letting foo(#*,4,u) and Jule! y) denote the second derivative of f(2!,24,y) with respect ioly and 25 respectively, we may write effew}=D {te - aagtceah { J Teco} MQ) x flat — ha,2y — hoo). deylo =seeEngs Yuttatsitvta)- sea} + wah foo(n,u) + Ena 92 foa(2,u) + 0(02)+ 2 at 68) 52. CONDITIONAL DENSITY BANDWIDTH SELECTION 150 where 1,(v4,24) is defined in (4.24) and m = 75_y As + Cfo A, and Efala)} = DP =v) {ie = aanees} x {/ Il rea} M(o)f (28 —hasa4y— how) det dea ~rt0)+ af 1 Due ed ules 4) - veo} a + de So Mnale) + ofr) 2 (2) ‘Therefore, (5.6) and (5.7) lead to B{Flasu) — Hlaatvla)} =[Io-ana {rein ~ Het ir ) een} ++ Jpatb fotos) + a Som { aa) — (58) Note that. tar [Kn(es Xs) Chal ¥) ~ alla] = 0B [(¥a(2, Xba ln YD] +06) = F(,y)m +0(0) var (f(a;y) - Alaatvl2)) (5.0) where m = (raha. .-hg)~ Combining (5.8) and (5.9) we can see that that CV soho, h, A) = 2-44) x9 (ap, a, 6) (5.10) 160 ssITY ESTIMATION where vio ( x {orate} J eaebfoalw) 1) where the a,'s and 6,’s are defined via hy = no-Wieta, and Ay = yn 2/049), respectively. Letting a, a2...» 0%, Wf... BR denote those values that minimize ‘yar we assume that the o's and b's are uniquely determined and that wee finite. Let Af, sAB,X2,.--,28 denote the values of fa»... that minimize CVgo. ‘Then ho ~ n¥(@*9a and X$ ~ on 2etsheN, ino, given that CV, = OV,o + (8.0), we expect that hy = AE + (6-0) Joa be 9 + (a0), whichis indood trae as the next theorem shows. ‘Theorem 5.1. MMe), + a for 8=0,1,-- 2G), 8 for 8 = Ayooesti (6.12) n8ll+9 inf CV, ~+ in xy in probability ‘The proof of (5.12) is given in Hall et al. (2004). ‘Using Theorem 5.1, we can derive the asymptotic distribution of da{viz). We postpone this result nil Section 5.3 where wo discuss @ rare general ease that allows for the existence of irelevant, covariates. 5.2.2 Maxiinum Likelihood Cross-Validation: Relevant ‘Variables “The least squares cross-validation approach discussed in Section 5.2.1 jhas some desirable optimality properties, which were given in Theorem Su, However, least squares cros-validation in the context of condi- Yional PDFs is computationally costly, particularly when the sample 52. CONDITIONAL DENSITY BANDWIDTH SELECTION 01 siz i large. This is because the objective function CV, (i. fig) ine volves three summations. When the sample size i small, one ean realize substantial spotd improvements by storing temporary weight matrices dn computer memory rather than recompiiting thelr components when computing CV,, but for even moderate sample sizes, the memory re shied for storage of these weight matsices wll quickly overwhelm the tmemory capacity of most computers. This isa classical computational trade-off memory vermis speed In contrast, however, Hkeliiood eros validation has » substantial advantage from s computational perspec tive, a8 we show below "The lkelihood cross-validation method involves choosing fh ‘gy Aty +++1 Ar by maximizing the log-likelihood function £= Ying KlXo, (6.3) where 9-i(ViIXi) = fi(%i,¥i)/rh (Xi), and #4(%i, Yo) and rh4(%) are the leave-one-out kernel estimators of f(%s,¥:) and j(X:), respec- tively. The objective function (5.13) involves one less surmmation than ‘that for least: squares cross-validation, and is therefore less compu- tationally burdensome. One problem with likelihood cross-validation arises when the continuous variables are drawn from fat tailed distri- Imutions. In such eases, likelihood cross-validation tends to oversmooth ‘tho data, and may Iead to inconsistent estimation results (Gee Hall (19874, 19878). Two practical methods can be used to prevent this from happen- ing, First, one can compare the likelihood eross-validatory sinoothing parameters with some ad hoc formula (say, fy = tagn—™/¥8), and if the two sets of smoothing parameters are comparable, then it is un- likely that the cross-validation method has oversmoothed. When like- Tihood cross-validation oversmooths the data, there are two possbill- tos: either it leads to inconsistent estimation, or the related variable is ireleyant. We note that, in the latter caso, oversmoothing is in fact highly desirable, To distinguish between these two eases, one can there- fore use a second method that compares the out-of-sample predictions ‘based on likelihood cross-validation with those based upon a paramet- rie method. If the nonparametric method behaves better than or is comparable to its parametrie counterpart on the hold-out data, this indicates that likelihood cross-validation has probably not fallen vie- ‘tim to the potential inconsistency problem. Of course, should the least § 162 5. CONDITIONAL DENSITY ESTIMATION squares cross-validation method be computationally feasible, the least ‘squares cross-validated smoothing parameters should also be computed ‘and they then can be used to judge whether likelihood cross-validation ‘bas produced unreasonably large smoothing parameters. 5.3 Conditional Density Estimation: Irrelevant Variables ‘We now consider the ease for which some ofthe covariates may be irrel- evant, Using notation defined in Section 4.5, for integers 0 < aida < @ nd 0 < rity ©) +1 for) +1.< 9 0; 2495, 228 for <3 rai 1 KBE} fon stcg srs (49) int CV, 2) B int. ~~ a6) The proof of Theorem 5.2 is given by Hall et al, (2004), Like Theo- rem 4.8, Theorem 5.2 states that cross-validated smoothing parameter selection is asymptotically eapable of removing irrelevant condition- ing variables. The following theorem provides the asymptotic normal distribution of ay). ‘Theorem 5.8. Letting j(y|z) be computed using the cross-validation smoothing parameters hoy. .,fbgs Mtye-s Sey then (wha hy)” (aio ~ ata) — So Bulan ~ S> Ba, vs) +N (0,03(8,y)) in distribution, an where era eat eel Ho Bua) = poate) + gro he {GE — Fer aca} ny eo Bales) = SF total) {anes 04 - 22 oop} Zu) = we Ma(yl2)/A(@). 164 5, CONDITIONAL DENSITY ESTIMATION So far we have assumed that ¥ is a continuous random vari fable. Now we tum our attention to the ease for which Y is dis crete. If ¥ assumes cp different values, then we replace the contin- ‘uous kernel kng(y,¥;) = higtk((y — ¥i)/ho) by the categorical kernel Uys Vig Ao) = AY — Ag), where Ni(y) = 10% # y)- Theorem 5.2 can be modified by replacing g1 + 5 with q1 + 4, and by replac- ing n'V(0r+5)ig — of with n®/'™*4 Jo — 2B, where Of is defined in a ‘way similar to 8 for s = 1,...,7. Theorem 5.3 also requires minor modification (eee Exercise 5.1) ‘Theorems 5.2 and 5.3 are quite powerful results, particulaely for data containing a mix of discrete and continuous data, having the po- tential to extend the reach of nonparametric methods. In Section 5.5 ‘we show, via soveral empirical applications, that “irrelevant” variables ‘occur quite often for various datasets and that erose-validated condi- tional density estimates can outperforin some commonly used para- metric methods in terms of their out-of-sample predictions, even for situations in which the number of discrete cells is comparable to the sample size 5.4 The Multivariate Dependent Variables Case In this scotion we consider the estimation of the conditional PDF of Y given X when ¥ is also # general multivariate veetor. Let Z = (X,Y), ‘and we shall also write Z = (Z°,Z#), where Z# consists of r dis- crote variables, and where Z° € RY represents the continuous compo- nents. For expositional simplicity, we first consider the case where Z# is a vector of nominal variables (i.., having no natural ordering) with 2 © TT4{0,1,-..,¢ — 1}. We write ¥ = (¥¢,¥4), X = (X5.X4), ‘nd assume that Y° contains the first g, continuous components of Z while Y# contains the first r, discrete components of 24. Thus, Yo € Re, YE © TT (0)1,.-56e — 1}, X° © RE. Similarly, Xe Tyas, = 1 [As before, let f(2) = f(y.) denote the joint PDF of (¥, X),! s(x) the marginal PDF of X, and g(yjz) = f(y,2)/u(z) the conditional PDF of ¥ given X "Depending on the situation, we sometimes may write the joint density as $e,24) rater than J{y,2), where 2° = (9",2*) and 2 54. THE MULTIVARIATE DEPENDENT VARIABLES CASE 165, We use Zi, to denote the sth component of Zf. As in Section Ba, for Zi, Z4, € (0,1,--.4¢5 — 1}, we define a univariate kernel function U(2§,24,,ds) = 1— Ay if 24 = Zi, and 28, 24.,.,) = Dulles 1) if 24 # ZZ, The product kernel is given by Ly o Tier HZ, 25s A)- Letting 2 denote the sth component of 2, w()) denote a univari- ate kernel function, and W() denote a product kernel function for Z, wwe write Wa, Zhe). ‘To avoid introducing too much notation, we shall use the same notation L(-) and W(:) to denote the product kernel for ¥# and Y°, ion Day eae = TI HOVE, Yih, Ae) and Way wee = TT hg (CVE Y)/h) Similarly, me define La, xp xg = TE CCS, Kf Anse) and Wreaxsoxg TIP hah ((XG — X50) / gyre ‘We estimate f(z) and p(x) by Flare) = fle) = 290K x and ale) LY Kain (618) where Kye = Lyssa ‘Therefore, we estimate g(yj2) 0 and Ky. xp2 = Dax atWha xg ae (x,u)/m(z) by f(o.w) Ale) * (6.19) alule) We choose the smoothing parameters by erose- validation methods ‘which minimize a sample analogue of a weighted Integrated square exo Iny where (J de = Su [ de) In = flav) ~ alyle) ye) dz = Tin ~ 2lan + Too, Jan = flitule) (2) de, ton = Foto) o(ol2) ula), and Ign = flg(yle) u(x) dz. As before Ign, is independent of (h, d). ‘Therefore, minimizing Jy over (2,2) is equivalent to minimiaing Lig ~ Zan Define the lesve-one-out estimators for f(X,¥) and f(X) to be Lae =2 YO Kaa, aud id AX =E YO Kx, (6.20) the 168 5, CONDITIONAL DENSITY ESTIMATION Also define G_)(Xi) to be G(X) = DY keith, 62 aia whore KP, = Dys J Koy nai “Then by arguments similar to those used in Section 5.1, Racine ct al, (2004) show that a cousistent estimator of fig ~ Zay is given by F10%,%1) Hk) evtnay# 2S Cad, 25 629 where f(X4YD, A(X, and G(r) axe defined in (6.20) and G21) “We then choose (h, A) = (It1,.-+slig)Aty-++5Ar) to minimize the ob- jective function CV (RA) defined in (8.22) Let (h,X) denote the exo ‘Midation chooes of (8). Since we know that irrelevant independent MEnables will be agymptotically smoothed out, in the analysis below ‘We wll only consider the case where the independent variables ar all Tetovant, Racine et al (2008) show thot ha/A@ % 1 and 4y/X2 % 1, where RY = ean" 449 for 8 = 1, and A2 = eggn“2/(4+9) fora = Iy...y% here eye and ea, are some constants (HO, \®) are the nonstochastic optimal smoothing parame- ters chat minimize the leading term of the cross-validation fanetion (623) ‘Theorem 5.4, proved in Racine et al. (2004), establishes the rate of convergence of (f,) to (h9,2°).. ‘Thoorem 5.4. Under assumptions given in Racine et al. (2004), we re Tits nae Ogineel0) for ¢ = Tyee, and we have Ke af = Opt?) for s = Lyeenst where @ = minf2,4/2} and B= min (1/24/44 +a} Given Theorem 5.4, one can further show the following: ‘Theorem 5.5. Define Bas(2) = (1/2)mafosl yb) /a(2) 54, THE MULTIVARIATE DEPENDENT VARIAL cas for 8 =1,...44y, and let Bis(z) = (1/2), for 8 = y+ ly-..y9. Also, define Y no4 9/09 se Ho(x)g(ylz))/ ul) Ba = eal for s=1esesty, 1 Bas(2) D bet 212 v4.04) — ove) nla wD] ma) ot and Q(z) = n%g(yle)/u(x). Then fors=ry+ (hg [aun = alole) - OH Bue) -O Arnis] converges to W (0,0(e) in stdin. ‘The proof of Theorem 5.5 is given in Section 5.4.2. ‘We now discuss how to extend the above analysis to also cover the ‘case for which Z¢ contains ordered categorical variables. 5.4.1 The General Categorical Data Case . For ordered categorical variables, Aitchison and Aitken (1978, p. 29) suggest using the kernel weight funetion given by Laat Yh 0) = (xa =A) when [Yi —¥f (6.24) for 0 < t rnto| } EM et slv.2)) = ¥ O,ntotale)/alz) (6.8) 5 5.5 Applications 5.5.1 A Nonparametric Analysis of Corruption Political corcuption (or the perception thereof) plays a central role in political theory and in the theory of economic development, Whether ‘or not there is convergence in corruption across countries over timo, or whether similarities in the distribution of corruption persist overtime remains a subject of ongoing debate. ‘The Corruption Perception Index (CPI)? ranks countries in terms of perceived corruption. ‘The CPT lies in the range 0-10 with 10 indicating the absenco of corruption. In the spirit of analysis done by McAdam and Rummel (2004), who considered a pane! of 40 countries having full records forthe years 1995-2002, we erate a balanoed panel of 45 counties having fall records for the nine year period 1996-2004. For this panel of n = 405 observation, we estimate the conditional PDF of CPI conditional upon year using the approach discussed in this chapter. ‘The estimated conditional PDF is plotied in Figur 6.1 Wo trent year as an ordered categorical variable, and the LSCV bandwidths are hort = 0.224 and Aye = 1.00 (the upper bound value for Ayer). The crose-validated value of Spar suggests that pooling of the data forall years is appropriate, ie, that there has been no.ap- preciable change in the distribution over time, This finding supports http. tranaparencyore/cpi 1m 5. CONDITIONAL DENSITY ESTIMA} A(CPIYear) 0.25 0.20 05, 0.0 0.05 0.00 Figure 5.1: The conditional PDF of the Corruption Perception Index for 1996-2004, n = 405. the persistence hypothesis (i.e., distributional stability across time), ‘while the shape of the estimated PDF is consistent with multiple equi- libria (Je., multiple modes). These results are consistent with those of McAdam and Rummel (2004, p. 509), who report that “our find- {ngs lend support to concerns expressed in the theoretical literabure - namely, that corruption can be highly persistent, and characterized by multiple equilibria.” 5.5.2 Extramarital Affairs Data In a widely cited paper, Fair (1978) proposes a “theory of extranmari- tal affairs” and considers the allocation of an individual's time among ‘work and two types of leisure activities, time spent with one’s spouse ‘and time spent with one’s “paramour.” The unique dataset is taken {from two magazino surveys, and Fair considers a parametric Tobit e=- timator for modeling the number of extramarital affairs per year. This 55. APPLICATIONS 1 be more robust and thus provide a viable alternative to the conditional ‘mean regression function. In addition to being robust to the presence of ‘censoring, a more comprehensive picture of the conditional distribution ‘of a dependent variable can often be obtained by estimating a set of conditional quantiles rather than by simply presenting the conditional ‘mean itself. For a complete treatment of quantile regression methods, wwe direct the reader to Koenker (2005). In this chapter we begin by studying how to estimate a conditional CDP in nonparametric settings. We first discuss the estimation of a conditional CDF heving only continuous covariates. The conditional ‘quantile function that is often of direct interest can be obtained rectly from the conditional CDF, and is obtained by simply inverting the cosiditional CDF. Having considered the case with continuous data, only, we then show that the results can be easily extonded to the mixed 00, nhy...hy > 00 and hy +0 for all 8 = 1,..-44. For § = Tyecagy let Aslu,2) = Aly, 2)/Oz4 and Aus(ynz) = PAly,2)/0x2, Aoly.2) = OACy,2)/0y, Aca(wn2) = 04%Xy,2)/0y?, a = f w(u)utdy, and x= f w(o)Pdv, where we? = [Ty f w(vs)2dey = Two. i a ‘We write F(yle)—P(y|2) = N(y,2)/ A(x), where 1(y, 2 =Flyl2)}i(2). The next theorem. Flue) = Fle) the asymptotic distribution of, ‘Theorem 6.1. Under Assumptions 6.1 and 6.2, and also assuming that p(2) > 0 and F(yl2) > 0, we have oa. ESTINAN A. CONDITIONAL CDF: NONSMOOTH Y_ 185 (@) ELM (y2)] = ule) [hy HEBa(vs)] + (Sta hd), where Buy.) = (1/2)¥2[Frolule) + 2e(2)Fe(yle)/(2)], (i) vas{ (9, 2)] = (hy)! (2)PByet0 (nhs Bg), where “P(yle)(l — Fyle)]/a(a). hg)" D2 HE = oft), then (maby... ha)?? [Pete Fore.) 4 N(O,2 ye): i Bye (ai) (oy, ‘The proof of Theorem 6.1 is given in Section 6.9. By using Poole) — Fyle) = Mt(y,2)/ii(x) = M(y2)/ule) + Op (Se + (rhe. wr) , ‘Theorem 6.1 implies that MSE[(viz)] = is MBAlya | + ms +(s0), (62) where (6.0) denotes (omitted) smaller order ters. If'we choose hi, .. fg to minimize a weighted integrated MSE given, by [MSE[F(yja}}s(y,-2)dedy, where s(y,2) is a nonnegative weight function, then from (6.2) its easy to show thatthe optimal smoothing parameters that minimize the inagrated MSE should be hy ~ 11/440 for 8 = 1,.--,4- However, there doesnot exist closed form expression for the optimal hy’. Though one ean compute plug-in hs, iti com- putationally challenging to sey the least. Li and Racine (forthcoming) Fecommend using a least squares crose-validation method of bandwideh felection designod forthe estimation of conditional PDFS. We showed in Chapter 6 that, for conditional PDF estimation, optimal smoothing yields bandwidths of the form fo ~ hy ~ 1/649 for alls = 1,90 Note that the exponential factor is ~1/{6 + ) becanse the dimension of (yz) is q+ 1 since we assumed that ¥ is a continuous variable. We can also multiply the conditional LSCV smoothing parameters by the factor n(®+#)/(4+9) to obtain the correct optimal rate for the condi- tional CDF optimal smoothing parameters, ‘The simulations reported in Li and Racine indleate that this approech performs wel 181 6. CONDITIONAL CDF AND QUANTILE ESTIMATION 6.2 Estimating a Conditional CDF with Continuous Covariates Smoothing the Dependent Variable In this section we discuss an alternative estimator that also smooths the dependent variable Y as we did for the unconditional CDF estimator developed in Section 1.4. That is, we can also estimate F(yle) by 6 (at) F(ule) = Ft (6.3) ite) is a kernel CDF chosen by the researcher, say, the standard normal CDF, and hy is the smocthing parameter associsted with ¥ Tn addition to Assumption 6.1, we also need to assume that as n = 00, ho = 0. Defining M(y,2) = Fols)] Ala), then F(ulz) — Pole) = Mt(y,2)/Ale). We have the following result. Theorem 6.2. Define Bo(y,:2) = (1/2)%aFoo(yla) and let M(y,2) = 4 Fo(ylx)/u(e). Note that Syjz and By(x) are the same as those defined in Theorem 6.1. Letting \hl? = h3 + Sot, hE = Sot ght, then under conditions similar to those given in Theorem 6.1, we have () BLM (y,2)] = w(2) Dono MEBs(y, 2) + ofA). (4) vasl ty, 2] = (MFT) H(z)? Bye — RoCKMY,2)] + 0 ((nHQ)™), where Hy = hy...hy and Cy =2 [ G(v)w(vode. (48) If (0a Ihg)? Dg (1), then (ne t)* [Foi ~ Flu) ~ > Baly.2)] 5 NOB). ‘The proof of Theorem 6.2 is given in Section 6.9. ‘Theorem 6.2 implies that hgh M(y, 2) Ty hg Hee (6) 62,_ ESTIMATING A CONDITIONAL CDP: SMOOTH ¥_ 185 ‘One may choose the smoothing parameters to minimize the leading term of a WIMSE of F(y\r) given by WIMSE = cf {is TBA, | = aC Fae Gh aad (65) whore J dedy = Seeep | ded. Were one to adopt a plug-in method of bandwidth selection, one would first estimate By(y,), Dyje and 2(y,2), which requires one to choose initial “plot” smoothing parameters, while thp ageurate numer- ical computation of a q+ 1 dimensional integration can be extremely dificult. However, well developed automatic data-driven methods do exist for selecting the smoothing parameters for estimating a condl- tional PDP. Given the lack of dato-driven methods for conditional CDF estimation having desirable properties, we again suggest using the cross-validation methods that have been developed for conditional PDF estimation. As discussed in Chapter 5 for conditional PDFS, op- timal smoothing gives hy ~ hy ~ n+ for all $= 1,...,g. Below wwe show that optimal smoothing for conditional CDPs recites that hy = off) for 5 = 1,--.99 and that hy ~ n/), For the sake of clarity, we Brst consider the case of g = 1 and then turn to the general Consider the case for which ¢ = 1. From (6.4) we know that the ‘weighted MSB is given by wise = [ MsElF(yie)is(v.2)dude = Ahi + Arhgh} + Aaht + Aa(nhn)! = Adho(nky) + O(n), (68) where = 1-+-+ (B+ 8)(nh)~!, and the Ay's ate some constants 180 )P AND QUANTILE ESTIMATION given by, Aa= f Boti2yPolnz)dnde, Boly.2) Bry, 2)s(y,2)dyde, dam f Biluyotu2)dvde, As Ae Syea(yz)dyde, and Ry 2)s(y, r)dyde. All the Aj’s, except Ay, are positive, while Ay can be positive, negative ‘The fist order conditions are SUE = AA + 2Art — Adinhy) + (00) on PROSE — aAyighs + 4Aabt — Asn)! + (20) (08) where (s.0,) denotes smaller order terms. From (6.7) and (6,8), we can easily see that h and fy eannot have the same order. Furthermore, it can be shown that Ay must have an order smaller than that of fy. Assuming that ho = ogn-® and hy eqn, then we must have a > 8. Then, from (68) we get 8 ~ 1/5, and substituting this into (6.7) we obtain a = 2/5. Thus, optimal smoothing requires that fi ~/n-2/° and hy ~ m2. For the general ease of @ > 1, by symmetry all h,’s should have the same order, say; fhy ~ m~? for § = Ty... and fa ~ n~®. Then itis easy to show that = 1/(4 +9) and a =2/(4+q). Thus, the optimal fy M49) for 5 = 1,...,g, and ho ~ n2/(4+4), Up to now we have focused on a local constant estimator of Piy|2) We can also use local linear methods to estimate F(y|z). As we dis cussed earlier, F(y|r) = E[2(%i < y)|Xi = a] is the conditional mean function of 1(; = y) conditional on X; =<. Thus, we ean also use the local linear method to estimate this conditional mean function by re- agressing 1(¥4 < y) on (1,(X;—2)) using kernel weights. The resulting TOne can also try to we a =P, which wil lead co 8 = 1/5 from (68) and 9 = 1/A from (6.7, a contradiction, Similary, assuming that a < Balko leads to.a ‘outradition. Therefor, we must have a> 62,_ ESTIMATING A CONDITIONAL CDF: SMOOTH Y 187 intercept estimator will be the local linear estimator of F(yle). Tt ean bbe shown that this estimator has a leading bias term given by (02/2) S>h2Fes(ule) (6.9) and a leading variance term of Xyj., which is the same as that of the local constant estimator given in Theorem 6.1. However, as pointed out bby Hall, Wolff and Yao (1999), two undesirable properties are associated with the local linear estimator: (i) it may not be monotone in Y’, and i) it is not be constrained to take values in (0,1) Motivated by the empirical likelihood approach, Hall et al. (1999) ‘and Cai (2002) introduce a weighted local constant estimator to aver- come problems arising from the use of a local linear approach. A ‘weighted local constant estimator is given by Dhppile) Kn XG, 2)1 < v) - Cee) Kn Xa) where pi(r), for ¢ = 1,...,n, denotes the weight functions of the data, Xiy.++)Xq and the design point 2 satisfying the properties pi(x) > 0, and Flyia) = (6.10) Des apeyKlXu2) (oan) Condition (6.11) is motivated by the local linear estimator, which en- sures thatthe estimation bis has the same order of magnitude in the Boundary and interior rogons. Tt also reduces the number of leading bias terms in the Taylor expansion, However, pi(a)s satisfying these conditions are not uniquely defined (since we have n parameters of pla), only q-+ 2 restrictions). Hall et al. and Cai suggest choosing the pi(2)’s by maximizing TT? p(c) subject to these constraints, This leads tothe following optnization problem: ‘gaan Doln p(c), subject to (6.11), pila) 2 0 and S22 pila) = a (6.12) Letting 7 be the Lagrangian multiplier associated with condition (6.11), then (see the hints given in Exercise 6.5) Ae) 7 ADR an 188 6 where angmax St inl +-4(Xi = 2) KA(%2))- (61a) Equation (6.14) does not have a closed form solution. Cai (2002) recommends using the Newton-Raphson scheme to find the root (soli- tion) of 7 based on (6.14). Under regularity conditions similar to those given in Theorem 6.2, Cai establishes the asymptotic distribution of F(ylz), which is given by (nh) | Pla) — Fluke) — So A2Balys2)Fea(vle)| & N(O,Z ye), (6.18) where By (ys) = (1/2)e2F (ul) ‘Eaton (6.18) shows that F(y|x) has the same (fnst order) ssymp- totic distribution as a local linear kernel estimator of F(y|z). It also has the additional advantage of being monotone in Y and taking values only in (0,1). A disndvantage is that numerical optimization routines are required for its computation. When X has bounded support, Cai (2002) further shows that ‘F(ylz) has the same order of bias (SYt_, h2) and variance at the bound- ary of the support of X, Le, this approach is immune to boundary effects in bounded support settings. ‘We observe that the above weighted local constant estimator has properties similar (o that ofthe local linear estimator. One disadvan tage mentioned above is that computation requires nonlinear optimiza- tion procedures, while the local constant and local linear estimators possess closed form solutions and can be readily computed. The prob- Jem arising with the local liar estimator 8 that the local weight function may assume negative values, hence the resulting estimator of F(u|z) may assume values lying outside the unit interval [0,1]. Hansen (2004) proposes a modified local linear estimator of F(|2) in which he replaces the negative weight (when it occurs) by 0, This modifention is asymptotically negligible, but it constrains the resulting local linear estimator of F(y|z) to assume values in [0, 1] leading to a valid condi- tional CDF estimator. Hansen shows that the asymptotic distribution of his modified local linear estimator is rst order equivalent to Fiz) given in (6.15). ‘The nonparametric estimation of conditional CDF has widespread potential application in econotnics, For example, such estimators have 63._NONPARAMETRIC QUANTILE ESTIMATION 180 ‘been used to identify and estimate nonadditive and nonseparable fue- tions; see Matzkin (2003) and. Altonji and Matzkin (2005). Up to now we have focused on nonparametzic estimation of a con- ditional CDF. When the dimension of the covariates is high, the curse of dimensionality may prevent accurate nonparametric estimation. Hall ‘and Yao (2005) propose using dimensionality reducing methods to ap- proximate the conditional PDF. Specifically they use F(Y|X"A) to ap- proximate the true conditional CDF F(Y|X), where X is of dimension @ and is a vector of qx 1 unknown parameters. Their estimation ‘method is similar in spirit to the single index model method (see Chap- ter 8). Since X'A isa scalar (ie, univariate), it involves one-dimensional nonparametric estimation only and hence is immune to the curse of di rmensionality critique. 6.3 Nonparametric Estimation of Conditional Quantile Functions Quantile regression methods are being widely used by practitioners. For a comprehensive snalysis of a rango of parametric methods we direct the reader to Koenker (2005). One reason for the popularity of these methods is that they more fully characterize the conditional CDF of a variable of interest, in contrast to regression models that provide the conditional mean only. ‘The unconditional ath quantile of a CDF F() is defined as a = inf{y : F(y) > a} = F(a), (6.16) where a € (0, 1). For example, let ¥~ 1¥(0,1) and let F(.) be the associated stan- dard normal CDF. Then 9.5 = 0 since F(0) = 0.5 and F(0~e) < 0.5 for any €> 0. To determine a quantile, say, qo, we work with the inverse CDF. That is, since q.95 = F~1(.95) and F(q95) = F(F~1(.95)) = 0.95, we obtain gos = 1.645 because P[N(0,1) < 1.645] = F(1.645) = 0.95. In general, if we let X denote » random variable with CDF F(-), then the method for finding qq involves applying F on qa = F~!(a) to et F(ga) = PIX < 4 ie, applying F on (6.16). 190 6. CONDITIONAL CDF AND QUANTILE ESTIMATION Again by way of example, suppose that X represents family income. Jn a given year. Then qzs is that level of income such that 25% of all family incomes lie below i Often, it is a conditional rather than unconditional quantile that is of interest. The conditional ath quantile is defined by (a € (0,1)) da(2) = inf{y: Fyfe) > a) = FYala). (17) For example, suppose that X represents age and ¥ represents in- dividual incomes. Then 425(2 = 30) is that level of income for 90 year olds such that 25% of 30 year olds have incomes below it. In practice, we can estimate the conditional quantile function ga(2) by inverting an estimated conditional CDF function, Using the estima tore with a smooth function and an indicator function respectively for Y,, we would obtain Goa) =inf{y : Pula) 2 a} = F'(alz), (18) da(a) = inf{y: Flyjz) > a) = F(ale), (6.19) Because F(ylx) (P(y[2) lies between zero and one and is monotone iY, do(2) (Ga(2)) always exists. Therefore, once one obtains F(y\2) (Piz), it is teivial to compute da(2) (da(2)) using (6.18) (6.19). ‘We interpret (6.18) to mean that, for a given value of @ and 2, we solve for ga(x) by choosing qq to minimize the objective function Gal) = argmin|os— PCale). (6.20) ‘That is, the resulting value of q that minimizes (6.20) ts da(«) We assume that F(yjz) has a conditional PDF (viz), f(vl2) continuous ia, and f(aa(2)|2) > 0. Note that f(ylz) = Foyle) since Falls) = OF (via) /Oy. The next two theorems give the asymptotic distribution of a(x) and (2) ‘Theorem 6.3. For s=1y...54, define __Balts2) Boalt) = Fg cays" where By(y,2) is defined in Theorem 6.1. Then under conditions sim- ‘lar to those given in Theorem 6.1, we have (ra --thg)"?? |a(@) — galt) — S>HEBas(2)| 4 NCO, Val2)), = 64. THE CH FUNCTION APPROACH a1 where Vaz) = a(1 — a)n*/ [f?(ga(2)|2)H(da())] ‘Theorem 6.4. Define Bao(2) = Fo(ylz)/n(2) and Ba(2) i the same 1s in Theorem 6.3 for s = 1,....g. Then under conditions similar to those given in Theorem 6.2, we have (uly hg)*? | dat) ~ gat) — > HBa(2)| 4 N(O,Vale)), 14) and Va(a) are the same as defined im Theorem. ‘The proof of Theorem 6.3 is left as an exercise and the proof of ‘Theorem 6.4 is given in Section 6.9.1. One can also estimate qa() by inverting the weighted local constant, catimator of F(y/z) given in (6.10), Le., ea a(x) = inf{y : P(y|x) > a} = F*(ale), (6.21) Cai (2002) shows that (rah... hg)"?? |e) — dale) — S7 AE Badal), 2)/ 4 Gal) |2)| 4.N(0,Vat=)), (6.22) where the B,(.,)'s are defined in (6.15) and where Va(z) isthe same 4s that defined in Theorem 6.4 6.4 The Check Function Approach ‘A popular alternative estimator for ga(2) can be derived using the s0- called check function; see Chandhuri (1991), Cheudhuri, Doksum and Samarov (1997), Jones and Hall (1990), Yu and Jones (1997, 1998), Honda (2000), Chong and Peng (2002), Whang (2006), and the refer ‘ences therein, The check function gets its name from the shape of the underlying objective function. Note that L-norm objective functions suchas those based on least squares methoxls yield U-shaped objective functions, while Lr-norm objective functions such as those based on 2 6, CONDITIONAL CDF AND QUANTILE ESTIMATION least absolute deviations yield a V-shaped or “check-shaped” objective funetion. ‘The popular parametric linear quantile regression model (Koenker ‘and Bassett (1978)) is of the form Y= XiG+ uy, (623) and from (6.23) we obtain qa(Vi|%:) = X18 + ga(tu|Xi)- One eannot separately identify and gq(t|Xi, Just as one cannot separately iden- tify the intercept and E(u) in a least squares regression model (i.e. where we impose Eu) = 0). We can rewrite this model as ¥; = dal VIXi) + Ya = XYBa + Yass (6.24) where vag = th ~ da(tulXi). By definition ga(vas|Xi) = 0. Basically, ‘we impose the condition that the conditional ath quantile of the error process is equal to zero It is well known that one can estimate Ja in (6.24) by minimizing the following objective function: Ba = argmain > pal ~ XB), (6.25) here pa(2) = 1(z < 0)], the check function; see Koenker and Bassett (1978) for details. Also, see He and Zin (2008) for a lack-of fit ‘est for parameteic quantile regression models. Equation (6.25) can be shown to be equivalent to wom D M-xl+0-a) Dm -xah ; vats veo (6.26) where we make use of the fact that —2 = |z| for <0. Koenker and Bassett (1978) showed that Via = Ba) % N (0,08(6(%6X))1) (1 — a) foa(0), and fos () isthe PDF of vo ‘This check funetion approach can also be adopted for th estimation of a nonparametric quantile regression model. We use « local constant, ‘method (or, alternatively, we could use a local linear method), and we choose « to minimize the following objective function: min 95 pal¥j — a) W(X), (6.27) vla.— 1(v <0). By letting goc(c) denote the value of 4 that minimizes (6.27), it can be shown that da(2) is @ consistent tstimator for qa(2) ‘The leading MSE of dao(2) can be found in Jones and Hall (1990) and Yu and Jones (1997). The leading bias and variance terms of fns(2) ate exactly the same as those for Ga(e) given in Theorem 6.3 Therefore, ga,c(7) hes the same asymptotic distribution as that of Ga (ct) presented in Theorem 63 One can also use a local linear method by replacing (6.27) by the following objective function: min pal) 0 — (Xj —2)'8)Wi(X}, =), (6.28) where the minimizer a is the local linear estimator of ga(x), and b ‘estimates the derivative of ga(). Let da.u(2) denote the resulting estimator of ga(). The leading MSE term of qai(2) can be found in Fan, Hu and Truong (1994) and ‘Yu and Jones (1097). The leading variance term i the same as that for the local constant estimator (hence identical to that given in Theorem 6.3), while the leading bias term is given by isda) = pra (6.29) m where ga,se(t) = Fda (#)/Ox%. 6.5 Conditional CDF and Quantile Estimation with Mixed Discrete and Continuous Covariates When X isa vector of mixed discrete and continous variables, we need to replace the kernel function by a generalized product kernel appropri- ate for mixed data types. We first discuss estimating a conditional CDF 194 6, CONDITIONAL CDF AND QUANTILE nIMATION without smoothing the dependent variable ¥. We estimate F(yla) by . nS 1% < y)Ky(Xe,) Fs) EEE M s Die), ica where fi(z) = n7'K,(X;,22) is the kernel estimator of (2), Ky(Xi.2) = WAKE 2°) XE. 24), Wa(Xf,2") = [Ty hts = #2)/ha)y AXE et) = [Ts XE, 2 Ay), and (CE, X,) = ACCS = a) LXE A xf) Should we elect to smooth the dependent variable Y, we obtain another estimator of F(ylz) given by "1G Folz) 7 (6.31) where G() is a CDF, say, the standard normal CDF, and hg is the smoothing parameter associated with Y, Define M(y,2) = [F(u/x) — F(yla)|A(z) and also let M(y,2) = [Plul2) — Flui2) |e). The following two theorems give the asymptotic distributions of P(y|x) and F(y2), respectively. ‘Theorem 6.5. Under regularity conditions similar to those given in ‘Theorem 6.1 but with the differentiabtity condition changed with respect to x* (see Li and Racine (forthcoming) for details), we have (1) Letting ty =| (nh hg) RP then MSE[A7(y,2)] = {Seim.00 +9 >» Balle | + Volyla)(nri won +f), where Busy2) = (/2)ra12Flvledua(2) + H(2)Fe(ula)] ond Basta) = wa)? Yo ale 24) [Fle ale 24) aoe = Flule)ute)], here 1o(u4s24) is defined in (4.24). 55. CONDITIONAL CDF AND QUANTILE ESTIMATION: MIXED DATA 195 (i) Letting Yj = Fy) ~ Flylel/nte), then (nbs. na | Pot = Flula) - NButie) Patio * N(0, Eis) ‘Theorem 6.6. Define |A] = Ti, As, Bro(vle) = (1/2)s2Feo(ulz), Ay, 2) = w*Folyla)/u(a), and let Syje, Brs(ysa) and Broly,22) be the same as those defined in Theorem 6.5. Then under assumptions given fn Li and Racine (fortheoming), we have MSE{XZ(y,2)] = n(2) {Sra +See} = a He)? Bye ~ hoy, ©)] c Te hg + Olt where tha = [Rl + [Al + (ry bg) “2(URIP + [AI), and F (ule) — Plule) — S943 Baaly.z) — y vant * N(0,34j0)- ‘The proofs of Theorems 6.5 and 6.6 are given in Section 6.9.2. As discussed above, to obtain a quantile estimator, invert F(.): Gale) = Fal). In practice we compute Ga(z) via (hash)! daft) = arg ja ~ F(a ‘The asymptotic distribution of ga(z) is given in the next theorem. ‘Theorem 6.7. Define By.alt) = Bn(ga(2)|2)/f(da(x) |x), where (y = Go(2)) Bayle) = [Xo heBie(ylz) + Ls AsB2a(yl2)] #9 the leading bias term of F(ylz). ‘Then under the same conditions as those used in Theorem 6.6, we have 196 6. CONDITIONAL CDF AND QUANTILE ESTIMATION (@ Gal) — dal) in probability, and (3) (hy «-ha) Eda) ~ a2) ~ Bral)] —» N(O,Valxe)) in dis- tribution, where Vale) = a(l ~ ade! {f*(da(2)|x)(dae))] = V(qa(e ie) S"(dae)z) (since «= F(da(e)|e)) ‘The proof of ‘Theorem 6.7 is similar to the proof of Theorem 6.4 and is therefore omitted here. In Exercise 6.4 the reader is asked to show that the optimal smooth- ing parameters for estimating F(y|zr) should satisfy hy ~ n~¥/+9) for 8 Lyssa dy 2/60) for s = 1,.-.,7, and ho ~ 272/49). Unfor- ‘omately, to the best of our knowledige, there doos not exist an automatic data-driven method for optimally selecting bandwidths when estimat- ing a conditional CDP in the sense that a weighted integrated MSE is ‘minimized. Given the close relationship between the conditional PDF ‘and the conditional CDF, Li and Racine (forthcoming) recommend us- ing the least squares cross-validation method based on conditional PDF cstimation when selecting bandwidths and using them for estimating the conditional CDF function (y|x) and the conditional quantile func- tion Ga() Let hy and Xe denote the values of hy and As selected by mini- 1izing the PDF cross-validation function (soe Chapter 5). Recall that in Chapter 5 we showed that the optimal bandwidths are of order fig ~ n-UO649 for 5 = 0,1,-..59, and Ay ~ 2-240) for § = 1,..-57 (assuming that Y is a continuous variable). To obtain bandwidths that have the correct optimal rates for F(y|x), Li and Racine (fortheoming) nF (8 = 1...) recommend using fig = Roni #s, fy and 1, = Jona-7 (8 = 1,...,7) ‘Theorems 6.6 and 6.7 hold true with the above (stochastic) band- widths fo, hey Avo 6.6 A Small Monte Carlo Simulation Study ‘We briefly investigate the finite-sample performance of the conditional quantile estimator defined in (6.18) and (6.19), following Li and Racine (forthcoming). The DGP we consider is given by Y= By + XA + BX + Hsin(XP) +6, where Xf ~ uniform|—2r, 27}, Xf and Xf are both binomial (the total number of discrete cells is c = 4), and ¢; ~ N(0,07) with o = 0.50 and 66._A SMALL MONTE CARLO SIMULATION STUDY. at ‘Table 6.1: MSE performance of the nonsmooth and smooth quantile estimators Da) hy = ho MSE 1.06¢ nH Tadieator 00 L060.n-™5 9 L060yn-/ 0.41 106en# SCV 1.06a4n-/* 0.38, LscV SCV. —LSCV. 005 100 We consider the estimator given in (6.18) that treats Y using an Indicator function, ie,, 1(y—Y; > 0), and the estimator given in (6.19) ‘that uses a smooth function, G((y—¥i)/he), where G(-) is the Gaussian CDF kernel. Regardless of the kernel function chosen for G((y—¥i)/ho), 18 hy 0, Gl(y ~ ¥;)/ho) + Aly ~ ¥; > 0), hence to generate results for My — ¥; > 0) we simply set hy = 0 (choosing a sufficiently small a). ‘We conduct 1,000 Monte Carlo replications, each time computing the MSE of the quantile estimator with a = 0.50. We report the mediaa relative MSE over the 1,000 replications, and investigate the relative performance of the nousmooth and smooth quantile approaches based on a variety of bandwidth selectors. Results are summarized in Table 6.1, where numbers less than one in smooth quantile estimator, (6.19) From Table 6.1 we observe that by smoothing Y with an ad hoe formula hy = 1.060jn""/®, MSB is reduced by 20% compared with the estimator that uses @ nonsmooth indicator Function for Y. Next, smoothing over the discrete covariates by LSCV reduces the MSE by another 20%. Finally, ehoosing fz and hy also by LSCV (rather than using the ad hoe formula) reduces the MSB by a further 25%. Using SCV reduces the MSE by half compared with ad hoc choices of fy oF indicator funetions for the dependent vasiable and the diserete covaei- ates, Note that in the above experiment, all covariates are relevant, In practice it is often the case that some covariates are in fact irrelevant ((c, are independent ofthe dependent variable), and in such cases, the ciency gains achieved by using LSCV will be even larger since the SCV approach can automatically smooth out irrelevant covariates. Nonparametric quantile regression with dependent data was con- cate superior performance of the 198 6. CONDITIONAL CDP AND QUANTILE ESTIMATION sidered by Cai (2002). See also Koenker ancl Xiao (2002), who discuss inference regarding the quantile regression process. 6.7 Nonparametric Estimation of Hazard Functions By combining the results on conditional CDF estimation discussed above and the conditional PDF estimator discussed in Chapter 5, non- parametric estimators of hazard functions can easily be obtained. We first provide a formal definition of a hazard function. Definition 6.1. Let T denote a random variable that stands for the time of exit from a state (e.g., no longer unemployed). The probability that a person who, having occupied a state up to time t, leaves it in the short time interval of length dt after t is P(Lt), The hazard funetion is defined as mn POST 0) A) = fm a which is the instantaneous rate of exit. Roughly speaking, the hazard function can be interpreted so that |(E)dt is the probability of exit from a state in a short time interval of length dé after t, conditional on the state still being occupied at t Hazard functions are often used to describe the probability of exit- ing from unemployed status (.e., getting a job) in an immediate future ‘time interval (say, nxt week), conditional on the individual being um- employed at present (up to this week). They are also often used to describe the probability of death of a patient in the next time period (say, next week) conditional on the patient being alive up to this week. It can be shown that the hazard function ean be written as (see the hint in Exercise 6.6) n= AO = Aint, e068) Hence, fr(t) = h(t) exp(~ fi h(v)d). 1 — Fr(t) is called the “Survival function.” This is because F(t) = PUT © #) is the probability of exit (bay, death) before t, hence 1 — Fr() = PP > t) is the probablity of survival beyond period t 67. NONPARAMETRIC ESTIMATION OF H. (RD FUNCTIONS __199 One commonly used parametric PDF in hazard and duration anal- ysis is the Weibull distribution, whose PDF is given by Sr (tia,b) = alt’ exp(—at"), ¢ > 0,0,0 >0. ‘The hazard function is given by A(z) = bat Hence the hazard function is strictly increasing if b > 1 and strictly decreasing if 6 < 1. When b = 1, the Weibull distribution reduces to ‘the exponential distr ‘with constant hazard. The Weibull distri- bution is often used to model unemployment data. Often interest lies in a hazard rate conditional on the outcome of & set of variables zx. The conditional hazard function is defined as fo(tx) = Fr(tls)’ Ate) (>0. = (633) For example, T' may be the time of death of cancer patients and 2 contains treatment type and individual characteristics. We assume that x contains q continuous and r discrete variables. Nonparametric estimation of h(t/2) can be readily obtained by replacing fr(tlz) and Fo(t{2) by their nonparametric estimators, fr(tle) Alte) 1 Ft (6.34) where fr(t\2) and Pr(t}z) are defined in Chapter 5 and Section 6.5, respectively. From Fr(tlx) = Fr(tiz) = Op( 4h? + DL Ae + (nhy ...tg)"}, it is easy to see that the asymptotic distribution of ‘A(t{z) is the same as that of fr(tlz)/ay2, where az = 1 — Fr(tlz). Honce, by Theorem 5.3 we immediately know that (um, oar eC) ag [Sau 2m +O Boalt »| } an (0 “2 : (638) 200 6. CONDITIONAL CDF AND QUANTILE ESTIMATION where By(t2) and Bot) ate defined in ‘Theorem 53 Th the decussion above we assume that the data is not censored In applications sucha fllon-ap surveys fr patients having recived a pasticular treatment, a patient may fal to respond to a survey before fying and the data wl thereby be censored. We will discuss how to handle censored data in nonparametric and semiparametric settings in Chapter 1 ‘Alo, when one estimates a hazard function in high dimensional settings, the curse of dimensionality may preclude accurate nonpare- mesic estination. In such ease, one may choose to insted estimate a semiparametric hazard model as proposed by Horowitz (1999), who considered semiparametric proportional hazard models, Sea also Lin- ton, Nielsen and van de Geer (2008) for kernel-based approsches to the estiznation of multiplicative and adative hazard model, 6.8 Applications 6.8.1 Boston Housing Data We consider the 1970s era Boston housing data that has been exten- sively analyzed by a number of authors. For what follows, we report on the application outlined in Li and Racine (fortheoming), This dataset contaius n = 506 observations, and the response variable ¥ isthe me- dian price of a house in a given area, Following Chaudhuri et al. (1997, p. 724), we focus on three important covariates: RM = average num- ber of rooms per house in the area (rounded to the nearest integer), LSTAT = percentage of the population having lower economic status in the area, and DIS = weighted distance to five Boston employment cen- ters. An interesting feature is thatthe data js right eensored at $50,000 (1970s housing prices), which makes this particularly well sulted to quantile methods, ‘We ist shuffle the data and create two independent samples of size ‘m1 = 400 and ng = 106, We then fit linear parametzic quantile model ‘and nonparametric quantile model using the estimation sample of size ‘mand generate the predicted median of ¥ based upon the covariates in ‘the independent hold-out data of size n. Finally, we compute the mean square prediction error defined as MSPE = ng! 3°%2(¥; — dos)? where qo(X;) is the predicted median generated from either the para- metric or nonparametric model, and Y; is the actual value of the re- sponse in the hold-out dataset. To deflect potential criticism that our 5, APPLICATIONS an. 30 ] 25 20h 1s JoMsPE) T 10 05 00 04 06 08 10 12 14 16 18 Relative MSPE Figure 6.1: Density of the relative MSPE over the 100 random spits of the data. Values > 1 indicate superior out-of sample pérfdrmance ofthe kemel method fora given random spit ofthe data (lower quartile=1.03, median=1.13, upper quartle=1.20) result reflects an unrepresentative spit ofthe data, we repeat this pro- cess 100 times, each time computing the relative MSPE (ie., the para- metric MSPE divided by the nonparametric MSPE). For each split, wwe use the method of Hall et al. (2004) to compute data-dependent bandwidths. The median relative MSPE over the 100 spits ofthe data is 1.19 (lower quartile=1.03, upper quartile=1.20), indicating that the nonparametric approach is producing superior out-of-sample quantile estimates, Figure 6.1 presents a density estimate summarizing these results for the 100 random splits ofthe data. Figure 6.1 reveals that 76% of all sample splits (i.., the area to the right of the vertical bar) had a relative efficiency greater than 1, or, equivalently, in 76 out of 100 splits, the nonparametric quantile model yields better predictions of the median housing priee than the parametzic quantile model. Given the small sample size and the fact that there exist three covariates, we feel this sa telling application for the nonparametric method. Of course, wo are not suggesting that we will outperform an approximately correct parametric model. Rather, we only wish to suggest that we ean often outperform common parametsic specifications that ean be found in the literature. 202 6. CONDITIONAL CDF AND QUANTILE ESTIMATION 6.8.2 Adolescent Growth Charts Growth charts are used by pediatricians and parents alike for compas ing children’s growth to a standard range. Height, weight, and body ‘mass index (BMI) measurements are used to document a child’s height and weight based on age in months. Measurements are compared to the standard or normal range for children of the same sex and age. Growth charts provide an early warning that the child has a medi cal problem. For instance, too rapid growth may indicate the presence ‘of s hydrocephalus (an accumulation of liquid within the cavity of the ‘ranium), a brain tumor or other conditions that cause macrocephaly (the condition of having an unusually Iarge head; it diflers from hydro- cephalus because there is no increase in intracranial pressure), while too slow growth may indicate malformations of the brain, early fusion of sutures or other problems. Insufficient gain in weight, height or a combination may indicate failure-to-thrive, chronic illness, neglect ot other problems. For the application that follows, data is taken from the Center for Disease Control (CDC) and Prevention's National Health and Nutri tion Examination Survey. A variety of methods have been sed for constructing growth curves, and results are typically presented as quan tiles. For instance, the U.S. Center for Disease Control and Prevention use a two stage approach. In the first stage, empizieal percentiles are smoothed by a variety of parametric and nonparametiic procedures. To ‘obtain corresponding percentiles and = scores, the smoothed percentiles are ten approximated using a modified least median of squares estima- tion procedure in the second stage. We consider the direct application of mixed data quantile methods using the estimator defined in (6.19), and by way of example construct weight-for-age quantiles. We report the 25th, 50th, and 75th quantiles and plot those for males in Figure 6.2, Bandwidth were obtained via least squaes cross-validation, with fngight = 1.22; hage = 8.11, and Suge = 0.12 Figure 6.2 readily reveals many features present in this dataset, including heteroskedastcity that increases with age and asymmetry in ‘the conditional CDF of weight. 6.8.3 Conditional Value at Risk Financial instruments are subject to many types of risk, including in- terest risk, default risk, liquidity risk, and market risk, Value at risk 68_APPLICATIONS 203 90 so} ™ 60 50 40 30 20 10 ‘Weight (kz) 2 6 8 0 2 M 1 18 2 Age (Years) Figure 6.2: Weight-for-age quantiles for males. (VaR) sooks to measure the latter; itis single estimate of the amount by which one’s position in an instrument could decline due to general, ‘market movements over a specific holding period. One can think of VaR ‘as the maximal loss of a financial position during a given time period for a given probability; hence VaR is simply a measure of loss arising from an unlikely event under normal market conditions (Tsay (2002, p. 257)). For what follows, we report on the application outlined in Li and Racine (forthcoming). Letting AV(l) denote the change in value of a financial instrument from time ¢ to t+1, we define VaR over a time horizon { with probability pas p=PIAV() < VaR] = F(VaR), where Fi(VaR) is the unknown CDF of AV(1). VaR assumes negative values when pis small since the holder ofthe instrament suffers a loss.” "Therefore, the probability that the holder will suffer a loss greater than or equal to VaR is p, and the probability that the maximal loss is VaR. “For a long positon we ave concommod with the lowar tll of returns, i. a loss duo to afl alu. Fora short postion we are concornod with the upper teil, ‘lace de to ise in value. Inthe lator cas we simply model 1 ~ F(VaR) rather ‘than F(VaR). OF eaurse, in either cace wa are modeling the fait ofa distribution. 208 6. CONDITIONAL CDF AND QUANTILE ESTIMATION is 1—p. The quantity ‘VaR, = inf{VaR|Fi(VaR) > p} is the pth quantile of Fi(VaR); hence VaR, is simply the pth quantile of the CDF Fi(VaR). The convention is to drop the subscript p and refer to, e.g, “the 5% VaR” (VaRoos)- Obviously the CDF Fi(VaR) is unknown in practice and must be estimated, Conditional value at risk (CVaR) secks to measure VaR. when co- variates are involved, which simply conditions one’s estimate on a veo- tor of covariates, X. Of course, the conditional CDF Fi(VaR|X) is again unknown and must be estimated. A variety of parametric methods can be found in the literature, In what follows, we compare results from five popular parametric approsches with the nonparametric quantile estimator defined in (6.19) ‘We consider data used in Tsay (2002) for IBM stock (7, daily log returns (%) from July 3, 1962, through December 31, 1998) plotted in Figure 6.3. Log returns correspond approximately to percentage ‘changes in the value of a financial position, and are used throughout. ‘The CVaR computed from the quantile of the distribution of res1 eon- 2.5%, i= 1,2,...,5): 68 APPLICATIONS 205 Daily log returns ‘0 1000 2000 3000 4000 5000 6000 7000 8000 9000, Observation Figure 6.8: Time plot of daily log returns of IBM stock from July 3, 1962, through December 31, 1998. (Gv) Xe: an annual trend defined as (year of time £-1961)/38, used to detect any trend behavior of extreme returns of IBM stock. (v) Xse: a volatility series based on a Gaussian GARCH(1,1) model for the mean corrected series, equaling o¢ where @? is the condi- tional variance of the GARCH(1,1) model. ‘Table 6.2 presents results reported in Tsay (2002, pp. 282, 295) along with the estimator defined in (6.19) for a varity of existing ap- proaches to measuring CVaR for December 31, 1998. The explanatory vriales assumed values of Xionso = 1, Xzpim = 0, Xsoiv0 = 0, Xai ~ 0.9737, and Xsoiso ~ 1.7966. Values are based on an as- sumed long postion of $10 milion, hence i value at risk is 2%, then tre obtain VaR = $10,000,000 > 0.02 = 8200, 000 Observe that, depending on one's choice of parametric model, one can obtain estimates that difor by as auch 82% for 5% CVaR and 63% for 1% for this example, OF course, this variation arises due ¢o model tncertainty. One might take a Bayesian approach to deal with this tncertainty, averaging over models, or alternatively consider nonpars- mnetric methods that are robust to functional specification. The kernel nantile approach provides guidance yielding sensible estimates that thight be of value to practitioners. Another use forthe nonparametric 206 6. CONDITIONAL CDF AND QUANTILE ESTIMATION ‘Table 6.2: Conditional value at risk for @ long position in IBM stock Model 5% CVaR_1% CVaR “Tnhomogencous Poisson, GARCH(I,i) $303,756 — $407,425 Conditional normal, IGARCH(1,1)/ $302,500 $426,500 AR(2)-GARCH(1,1) $287,700 $409,738 Student-ts AR(2)-GARCH(1,1) $283,520 $475,943 Extreme value $166,641 $304,969 LSCV CDF $258,727 $417,102 quantile estimator might be to evaluate the CDF nonperametrically at the values produced by the parametric estimators above to assess which quantile the parametric values in fact correspond to, presuming of course that the parametric models are misspecifie. Interestingly, Tsay (2002) finds that Xy_ and Xoq are irrelevant ex- planatory variables. The LSCV bandwidths are consistent with this. However, Xo, is also found to be irrelevant; hence only one volatility measure is relevant. 6.8.4 Real Income in Italy, 1951-1998 ‘We again consider the Italian GDP panel discussed previously in See- tion 1.13.5, a panel containing data for income in Ftaly for 21 regions {or the period 1951-1998 (millions of lire, 1990—base). We consider the estimator defined in (6.19) and model the 25th, 50th, and 75th income ‘quantiles treating time as an ordered discrete regressor. Figure 6.4 plots the resulting quantile estimates. Recall that Figure 1.9 presented in Chapter 1 revealed that the distribution of income evolved from a unimodal one in the early 1950s toa markedly bimodal one in the 1990s. This feature is clearly eaptured by the nonparametric quantile estimator, as ean be seen in Figure 6.4, 6.8.5 Multivariate ¥ Conditional CDF Example: GDP. Growth and Population Growth Conditional on OECD Status In Section 5.5.5 of Chapter § we modeled a multivariate Y conditional PDF with data-driven methods of bandwidth selection, the sample size 68._ APPLICATIONS Ea Real GDP per eapita 0 11950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 ‘Year Figure 6.4: Income quantiles for Italian real GDP. ‘was n= 616, and ¥ consisted ofthe following variables: yy, the growth Tate of income per capita during each period; y», the annual rote of pepulation growth during each period 2, OECD status (0/1), We now fstimate the conditional CDF ‘F(yy.yale) and preset the resulting Conditional PDP in Figute 5.3, that i, we plot (n,spiX ~ 0) and F(yi,ya\X = 1) again using Hall et al.’s (2004) least squares cross- validation method! Figure 65 presents a lear pictare ofthis multivariate ¥ conditional CDF. A stochastic dominance relationship is evident in that OBCD counties tend to have lower population growth rates and higher GDP ‘romth rates than their non-OECD counterparts 208 6. CONDITIONAL CDF AND QUANTILB ESTIMATION P(yrsyalei) 1.00 0.80 060 oo 0.20 0.00 cpp, Population Growth (92) Figure 6.5: Multivariate ¥ conditional CDF with ¥ consisting of GDP ‘and population growth rates and X being OECD status. 9. PROOFS. ei 200 6.9 Proofs 6.9.1 Proofs of Theorems 6.1, 6.2, and 6.4 Proof of Theorem 6.1 (i) ELF (ulx)A(@)| = BL < v) Wal Xo) EIWa(Xi,2)E(U(% < y)/X)) = BWM, 2) F(W|X)) = inet f aw (252) Folens = fies myFtale + hey oydo = / [ue + Dalen 3 Emer] x [ro + Some, FED Foluleyhatiowve] Wo) dv + ofl) = wa) Fue) + Fea 73 [aa Fave) +2(2) Folula) + Flulz)iso(2)] + o(|W). (6.38) Shmilaely, Ela(e)] = BWa(Xe,2e)] = (ee) + (1/2)e2 D> AF teal) +0(|h|*). (6.37) By noting that E(M(y,2)] = BlF(ulz)j(@)] ~ F(yle)Bla(2)], The- orem 6.1 (i) follows from (6.36) and (6.37). o 200 6, CONDITIONAL CDF AND QUANTILE ESTIMATION Proof of Theorem 6.1 (i) var( f(y, 2)) = n~varl(4(¥; < 2) ~ Pula) WA(%2)] HELLY y) — Fple))*1Wa(Xn 2)? + OUD} E(B (UY < 9) ~ Flyla))*1X]Wa(Xe2)*} + OO) HEALY) ~ 2F (ole) Ful) + Fula)? Wa(Xiv2)?) + O(n) sat J H2)LP (ul) — 2F (ula) (ule) + Peule)"|Wile,2)Pde +0(n) = (ohne)? fue — he) a — he) ~ 2Fye) Fa — ho) + F(yle)*|W (v)%dv + O(n) = n(n hg) '2)(P (ula) ~ Flvla)*| $0 ((nhy..-hg) NA +n7), where «= fu(v)%de. o Proof of Theorem 6.4 (ii). Define By(y,2) = Sty KEBa(y,2). Then (F(ul2) — Flute) — Ba(y.2)] = (Fula) ~ F(vlx) — Balan )hi(a) a) F(ul2) — Fluke) ~ Ba(u2)]A(2)/ulz) + o9(2) (v,2)/n(x) + op(1), where A(y,2) = (F(ylz) — F(yle) - Bn(y,2)]A(z). By Theorem 6.1 () and (i) we know that E[A(y,2)) = O(ji/) and that var{A(y,2)] = (Oh fg) Hz) *var(y 2) + OC) (hn = (ra oof) “M). By virtue of a CLT argument one ean show that (rhy --tg)!?A(y, 2) + NV (0,p(2)?82) in distribution. Hence, (0 --)"[Pluls) ~ P(ule) ~ Ba(ara)] = (bey)? Aly. 2) f(a) + op(1) 4 playtn (0, u(z)Zy2) = NO, Eye). a 9. PROOFS. au Proof of Theorem 6.2 (i). Using Lemma 6.1 we have woo (2 "ne J ~2{ni6( ue) ix Wav) =e ae +228 ila] Walxe)} +08 = [te [ey + 98 = [ules my Leie-+ hw) Wales )de + off) +H ralate +h] Woodo + 088) = le) Flue) + 2h (2) oll) +e Sinton (ule) + "EPC ) Sls) +o) (6.38) ‘Theorem 6:2 (i follows fom (6.37) and (6.38), a Proof of Theorem 6.2 (ii). Lot Hy = hy ...hg. By using Lemma 6.1 we 212 6. CONDITIONAL CDF AND QUANTILE ESTIMATION have var(Mf(y,2) ~ Flul2)Alz)] = rhar (6 (4%) —F - ofe [e (4) ] —2F(ule)E le ¢ & *) ix] + Fajerwacsia)} +001) = BAIF(UIX:) — hoCeFatyl%) — 2FUIA)F(V) + Fula? + O((APYIWR(X2)} + O(n) = fHle)F C2) ~ hocaotuls) + 2F(0l) F(a) + Pluie) Wa (=, 2)d2 + (6.0.) = (ott) f+ he), + ho) — hoCa Ful) + 2F (yee + hv) F(ykz) + Flylx)?]W(v)de + O07) = (nff.)*w(a)eF(y\2) ~ Fv) ~ PoCan(a) Fatal) FOC RP (nH) +n) 7 & (ng) *u(2)?Byje — hoy, 2)] + O(|h2(nHy)* +0). ) axe] a Pf of Tiare 62 (8). Past (i) follows fom) and (1) and the Liapunov CLT. Lemma 6.1. (0) 8 [6 (taht) |e] = PULX:) + (1/2)wahB Foal uLXD) + 0(08)- (8) [? (2) [aa] = FV: —hoCeFo(ulX:) +olho), where Ch = 2 G(v)w(v)udo, Proof. This is lft as an exercise. o Lemma 6.2. fu + ale) — Fula) = Flvle)en + Op(h8) + open + (ah... 69._PROOFS 213 Proof. Let An(én) = (F(y + énl) ~ Plule)|ACe)/u(e). Then Py + nl) — F(yl2) = An(eq){1 + op(1)]- By Lemma 6.1 we have ws -*% KYAXa 2) vraear=e{efo( 5%) -6 (4) x] A?) = BLP Wala) ~ Fvia) + O08) (Xz) B{[s(ulXiden + OG) + O(NG)My(Xe-2)} (ulXiden/wln) + O16, +18). Similars, var[An(6n)] € mney ref [e(te)-« (GP AK, »} = Olen(nhy fy). Therefore, Fy + eal) — Flu|) = Aalen)[ + op(1)] = Flule)en + ORB) + open + (maa feg)™™) a Proof of Theorem 6.4. We know that F(ylz) —> P(ulx) in probability by Theorem 6.2 1 follows by ‘Theorem 1 of Tucker (1967) that sup P(e) — F(a] ~ 0 im probity (639) because F(yl2) is a conditional CDF. Since qa(2) is unique, this implies that 6 = 6(€) = minfa~ F(aa() — ex), F(aa(z) + el) ~ a} > 0. Tt is easy to see that P\da(2) — ga(x)| > el

4}

a] ~P [Plaa(2)l2) > ~S(aal)kz)en +o] (6.1) by Lemma 6.2 and the assumption that fg = of (ny. Fy)"/3). There fore, Qn(o) ~ Pllinhs «.-hg)*0*(aal2)|2) {Plao(2)|2) ~~ Bu(aa(e)l2)} > -v] ~ 0), where ®(-) is the standard normal distribution, a as Qalv) = 6.9.2 Proofs of Theorems 6.5 and 6.6 (Mixed Covariates Case) ‘Theorem 6.5 is proved in Lemmas 6.8 and 6.4 below, while Theorem 6.6 is proved in Lemmas 6.5 and 6.6, Lemma 6.3. () ELM, 2)] = w(x) Bialyt) + mlz) Dear AsBas(vz)+ o(|A| + IAP). (i) var(M(y,2)) = xtu(x)F(yla)[1 — F(ylz)], where Brs(y,2) and Bag(y, x) are defined in Theorem 6.5. Proof. See Li and Racine (forthcoming). o emma 6.4 (hy ha)¥ [POyla) ~ Fale) — So Bia(os2) — Yo r-Bosty.2) > N,V), ee aH where V = et F(et — Pla/a) 6.10. EXERCISES 215 Proof. By noting that F(y\z)—F(ylz) = M(y,2)/A(2) = M(y,2)/u(2) “op(1), Lerama 6.4 follows from Lemma 6.3 and the Liapunov CLT. ‘Lemma 6.5. 4 ) ELA (y,2)) = we) Lh NEBr) + wl) Dn AvBaslys2) + olla +14?) (i) vac, 2)) = n(n) (2) x[F(uie) ~ Fivla)® ~ hoCeFo(v2))- Proof. This is left as an exercise. a Lemma 6.6. (nha. fg) 7LF (ula) — Flue) ~ SiR 2) — SY AvBalar z)] 4xon, fe ‘where V = x81 (ulx) ~ Fiula)?V/n(2). Proof. By noting that Fluke) ~ Plvia) = M(y.2)/ A(z) = My, 2)/u(2) + op(1), Lemma 6.6 follows from Lemma 6.5 and the Liapunov CLT. a 6.10 Exercises Exercise 6.1, Prove Lemma 6.1. Hint: Use the change of variable and integration by parts arguments Section 1.4. Also note that 2 f G(w)w(x)dv = fdG%(x) = Bxereise 6.2. Prove Lemma 6.5 Exercise 6.8. Prove Theorem 6.3 ‘Hint: Mimic the proof of Theorem 6.4 but without introdueing Ao 216 6. CONDITIONAL CDF AND QUANTILE ESTIMATION Exercise 6.4, Show that for the general g and r case the optimal smoothing parameters that minimize IMSE[F(y|2) require that hy ~ Te) for 8 Ayseaydy Ay we NAMED for 8 = Lyeeeyty andy ~ wo2keta Hint; Use arguments similar to those used in the two paragraphs immediately following (6.6) and (6.7). Exereise 6.5. Derive (6.13). Hint: Write the Lagrange function as £=Yimpile)—n [= a) ~ | =D — DKK). ‘The first order conditions are FE -Sple)-1%0, (6.2) FE = Lomleyxi KA (Ne2) So, (643) of -—t (Xi—2)KulXu2) 20, (64) Spa) ~ me) Equation (6.43) leads to 1 = np,(x)[n-+7(Xi-2)Ko(Xiya))- Summing this over i and using (6.42) and (6.43) gives na nS open +a Sola) Xe — a) KA(Xez) = Substituting 1 = nto (644), we get (2 =9) (ey ee Pi) = Sa — Ee This proves (6.5). Hence, one can choose 7 Wy umsimize the log-likelihood function given by £09) = Solase) =" nt 440 — 2) KA) 6.40, BXERCISES 2 Exercise 6.6. ( Derive (6.32), Hint: fl 1 w= gm, (Se) = sn(Q/PT 29 = fot — Fr) (i) Show that fix(x) = A(a)exo(— [i Moldy) int: fro) bp T= Fy” = Inf — Fr(o)lho=5, ==Inlt = Fried, ” n(uydo = ‘ich implies that 1= F(t) = BHO, Substituting this into (i), we obtain (i). Part II Semiparametric Methods Chapter 7 Semiparametric Partially Linear Models In this chapter we discuss a relatively simple and popular semipara- ‘metric model, the semiparametric partially linear regression model; see Hidle, Liang and Gao (2000) for a thorough treatment of partially lin- ear models. Roughly speaking, a semiparametric model is one for which some components are parametric, while the remaining components are Teft unspecified. Those models therefore have a finite dimensional pa- rameter of interest, yet also contain some unknown functions which can be viewed as functions characterized by infinite dimensional parame- ters. ‘The partially linear model is one of the simplest semiparametric ‘models used in practice. We shall use this model to introduce semi- parametric models because its estimation is straightforward, involv- ‘ng nothing more than basic kernel estimation of regression functions along with least squares regression, This model also serves to illustrate ‘a number of somewhat subtle issues which arise in the estimation of semiparametric models in general. For example, the finite dimensional parameter (the parametric part of the model) can usually be estimated with a parametric y/-rate, though we usually need stronger regularity conditions along with stricter conditions on the smoothing parameters im order to achieve the y‘e-rate for the parametric part of the model mm 7. SEMIPARAMETRIC PARTIALLY LINEAR MODELS 7.1 Partially Linear Models [A semiparameteic partially linear model is given by Y= X19 +9(Zi) ty Fs Booms (ra) where X; is a.p x 1 vector, 6 is a p 1 voctor of unknown parameters, hod Z; € RE. The functional fori of g(?) is not specified. The finite dimensional parameter # constitutes the parametric part ofthe model ‘and the unknown function g() the nonparametric part. The data is ‘sumed to be iid. with Elui.Xz, Zi) = 0, and we allow for a condi- tionally heteroskedastic errr process E(u, 2) = 0%(r, 2) of unknown form. Wo focus our discussion on how to obtain a y/n-consistent - ‘timator of 8, as once this is done, an estimator of g(-) can be easily obtained. TAA Identification of 3 Some identification conditions will be required in order to identify the parameter vector f. Observe that X cannot contain a constant {icy 8 cannot contain an intercept) because, if there were an in- tercept, say a, it could not be identified separately from the wn- mown function 9(-). That is, for any constant e % 0, observe that a+ g(2) = la+e)+ lolz) 4 = Gove + dnew(2) thus the sum of the new intercept and the new g(-) function are observationally equivalent to the sum of the old ones in (7.1). Since the functional form of g(:) is, not specified, this immediately tells us that an intercept term cannot be identified in a partially linear model. After we derive the asymp- totie distribution of our semiparametric estimator of 1 in Section 7.2, it will be apparent that the identification condition for § becomes the requirement that @ % B([X — B(X|Z)IX — E(x]Z)}) be a postive definite matrix, implying that X cannot contain a constant and that none of the components of X can be a deterministic function of Z for, otherwise, X-E(X|Z) = 0 for that component and & will be singular 7.2 Robinson’s Estimator ‘We first present an infeasible estimator of (7.1) to illustrate the me- chanics involved in the estimation of @. Taking the expectation of (7.1) conditional on Zi, we get BY) = BOG2)'6 + 91%). (7.2) 1.2. ROBINSON'S ESTIMATOR 223 Subtracting (7.2) from (7.1) yields Yi ~ B(Yi|Z:) = (Xe — E(%G|Zi))" 8 + (7.3) Defining the shorthand notation ¥; = ¥j - B(¥|2,) and Xj = X;— E(X|Z;), and applying the least squares method to (7.3), we obtain an estimator of # given by Bar = [Sx] ban (ra) By the Lindeberg-Levy CLT we immediately have (see Exercise 7.1) Vaile — 8) % N(O,2WH™), (75) provided that @ is positive definite, where ¥ = Blo%(X,, Z))XiX), © = [BUX ‘The basic idea underlying this procedure is to frst eliminate the unknown fanetion g(-) by subtracting (7.2) from (7.1). Although the ‘unknown function g(-) washes out in (7.3), two new unknown fanctions: are introduced, namely E(¥|Z:) and E(X;|Z)). Therefore, the above es- timator fir isnot feasible because F(%|Z) and E(X,|Z,) are unknown. However, we kuow that these conditional expectations can be consis- tently estimated using kernel methods, so we can replace the unknown conditional expectations that appear in Buy with their kernel estima- tors, thereby obteining a feasible estimator of 8. That is, we replace Yi = ¥i— BI) and X, = X,— (WZ) by Yi — ¥; and Xi ~ Xs, respectively, where aR) Le SY KAZI, — (8) X=bZ) LO x Kale 2)/HEade (7.7) and = FZ) =" YO Ka(Zi 25), (78) where Kx(Zi,Z)) = [Ty aste( ltt) at 11 SEMIPARAMETRIC PARTIALLY LINEAR MODELS ‘The presence of the random denominator f(Z;) can cause some technical dilficultios when deriving the asymptotic distribution of the feasible estimator of 9. Wo will consider two different approaches to handling the presence of the random denominator, one that uses a function that “trims out” observations for which the denominator is small, and another that uses a density-weighted method for getting rid of the random denominator altogether. We begin with a discussion of the trimming method. Define a foasible estimator of # given by fem where 1; = 1(/(Z:) > 6), which equals 1 if f(Z:) > b, 0 otherwise, ‘and where the trimming parameter b= By > 0 and satisfies by +0 a5 ‘To derive the asymptotic distribution of 8, we first provide a defi- nition and state some assumptions. We shall use GZ, where a > 0 and 1 > 2 is an integer, to denote the class of smooth functions such that if'g € Gg, thon g is v times differentiable; g and its partial derivative functions (up to order v) all satisfy some Lipschitz-type conditions such as |9(2) ~ 9(2)| S Ha(2)|2/~2]], where (2) is @ continuous function having finite ath moment, and where | || denotes Euclidean norm, ie. tll = ~OIa x —a'} Dw-kyH- Kn, (79) (i) (Xi), i = 1)2,.-1m are tid. observations, Z; admits a PDP f G2 (Le, fis bounded), g(:) € Gf, and B(X|z) € GS, ‘where v > 2 is an integer. (ii) B(ulX, Z) = 0, B(u*|x, 2) = o7(x,z) is continuous in 2, both X and u have finite fourth moments. (G83) (0) ia a product korncl, the univariate kernel k(.) is @ bounded th order kernel, and k(v) = O (1/(0 + lvll"*"). (i) As n+ 00, (hi .-hg)?B + 00, mb4 Dy AY 0. Condition 7.1 (i) states a sot of smoothness and moment conditions, ‘The unknown functions g(z) and B(X|z) are assumed to be vth order 7.2. ROBINSON'S ESTIMATOR 225 differentiable. These, together with a th order kernel of 7.1 (i), ensure ‘hat the bias of the kernel estimator is of order O(S-t_, hi). Condition 7.1 (iv) is used in Robinson (1988). One can ignore the trimming parameter b in empirical applications as one can let b—+ 0 at ‘an extremely slow rate. Thus, Condition 7.1 (iv) is basically equivalent to Vm | STAR + (ahr ig)?| + 0.05 2-00. This condition is casy to understand; O (S72. A? + (nhs ..-fg)~*) is the order of the nonparametric MSE. The difference between the feasible estimator A and the infeasible estimator Agr is proportional to ‘an average of squared nonparametric estimation errors. Therefore, for to bea yii-consistent estimator of 8, one needs the squared estimation error terms to be smaller than n-!¥2, ie, for Shy AB + (mht...) to be of order o(n-¥/), resulting in Condition 71 (iv). - ‘Theorem 7.1. Under Condition 7.1, we have vita 8) 4. 0,0-wa"), (710) provided that ® is postive definite, where © = BUTE, w= Blo%(Xi, ZR!) and X= Xi — E(Xi|Z. ‘The proof of Theorem 7.1 can be found in Robinson (1988). A ‘consistent estimator of the asymptotic variance of fis given in Exercise 1. Comparing Theorem 7.1 with the distributional result given in (75), we see thatthe feasible etimator J has exactly the same asymap- totie distribution as that of the infeasible estimator Age. The intuition underlying this result is quite simple. If we ignore the trimming pa- rameter b, then in essence J — fine = Op (Say 2” + (nh ...hg)~"), the MSE rate of the nonparametric estimator, which is op(nt-¥/2) by Condition 7.1 (iv). Hence, /7i(3— At) — op(1), which implice that the two estimators have the same asymptotic distribution. Suppose one uses a second order kernel, ie., assume v = 2. With hg ~ h,! Condition 7.1 (iv) becomes n¥? [>t A + (nha ..-Ag)!] ~ nif2 [hl + (nh8)"}] = o(1), which requires that q <4 (or that ¢ <3 Ths ~ h means that (hy — Bh = oft) or, equivalently, hy = h-+-ofh) 226 11, SEMIPARAMETRIC PARTIALLY LINEAR MODELS since ¢ is a positive integer, where q is the dimension of Z). This is the condition invoked by Robinson (1988). ‘Thus, Condition 7.1 (iv) requires one to use a higher order kernel if q > 4. However, Li (1996) showed that Condition 7.1 (iv) can be replaced by a weaker condition given below. ny) Condition 7.2. As n — 00, awe hae nB~*(y..bg)?/( 20, b(t oI) +00, and nb SE hi — 0 Li (1996) shows that A Ane =O, (Se Taha) Gah, no") ; If this term is of order op(n-2/2), we obtain Condition 7.2. A detailed txplantion as to my be entiation ero ba this do ater than the more familiar order Op (So, M2” + (mh... .4g)~2) can be found in 11006) (ee alo Exeter further explain). Considering the ear for whieh al he have fhe same onder (he ~ A for alls), thea Condition 7.2 bocomes max {h2"~4,h#} —+ a0 and nh‘ —» 0. If one uses a second order kernel (i 2), this condition requires that 4v = § > max(2q — 4,q), leading to the requirement Ting 6 or Unt q 2 5 since g a's postive integer, Toefl, nomzagtive soond tds Kernel enn led to y/conttntetination of @ as long as q <5. “This lowing corllary s proved in i (2886). Corollary 7.1. Under the same conditions given in Theorem 7.1, ex- cept that Condition 7.1 (iv) is replaced by Condition 7.2, then Theorem 71 still holds true. One undesirable feature of the semiparametric estimator described above is the use of a trimming function which requires the researcher to choose a nuisance parameter, the trimming parameter b. However, ‘one can instead use the density-weighted approach to avoid @ random ‘denominator in the kernel estimator as follows. ‘Multiplying (7.3) by fi = (Zi), we get (% ~ BOIZ))Fi = (Xi - (GIZA + hi au) One can estimate using the least squares method by restessing (%/-E(¥IZ,)) fron (X,-B(%i12,) fi Letting Bar denote the resulting, 7.2._ROBINSON ESTIMATOR {infeasible estimator of 8, then by the Lindeberg-Levy CLT (see Exercise 7.1) we know that Bury is Viv-consistent and asymptotically normal, VF (Bay 8) 49 (0,850, 05! where &) = E[X 142] and Vy = Blo%(X,, 2) ~E(X|Z)). ‘A fenibleeatiator offs obtained by repa and (X; — E(X,|Z)).fi with (¥; — Yi) fi and (Xi (Hh -EMIZ)) fF. Xi), where Yi, X and f; are the kernel estimators of B(¥i|Z:), B(%|Z)) and (Xi) de- fined in (7.6) to (7.8). Because Vifi =n! Dy ¥iKy(ZiyZ;) (and Xf.) does not have a random denominator (f.), after removing the tximming parameter b, Condition 7.1 (iv) ean be replaced by the following: Condition 7.3. Asn 00, n(hs---hg)* + 00, and n 2, hie — 0, while Condition 7.2 can be replaced by Condition 7.4. As n> co, ini {em rad S HD 1} 00, and nhs hi 0. ‘The next theorem gives the asymptotic distribution of the feasible density-weightod estimator, ‘Theorem 7.2. Lelting Ay denote the feasible density weighted estima- tor of, then, under the sume conditions as in Theorem 7, except that Condition 7.1 (0) i replaced by either Condition 7.9 or Condition 7-4, wwe have Vals — 8) 4 NO, 879/85"), there ®y ond Wy are defined in (7.12). ‘Theorem 7.2, under the weaker Condition 7.4, is proved in Li (1996). In Section 7.5 we provide a proof using the stronger Condition 7.3. As wil be sen, the stronger Condition 7.3 yields a relatively simple proof ‘Theorem 7.2 states that the feasible estimator Ay has the same symptotic distribution as the infeasible estimator Ae. 2 7. SEMIPARAME! (IC PARTIALLY LINEAR MODELS Note that the use of the specific weight function (Zi) for avoiding the random denominator when estimating (is not based on auy eff cioncy arguments. In fact, when the error is conditionally homoskedas- tic, the unweighted estimator 3 can be shown to be semiparametrically offcient. ‘When the error is conditionally heteroskedastic, ie., F(u?1Xi, 2) = oAX;,Z), one might be Tured by analogy into thinking that an cffcient estimator of 8 eould be obtained by choosing w, = L/o(XisZ:) as a weight function. In general, however, this intuition is incorrect. ‘This approach will not lead to efficient estimation of except ia the special case for which the conditional variance is only a function of 2%, Thats, only when B(a#|X;, Z) = 02(Za) will the choice of weight 1/o(Z) lead to efficient estimation of #. Efficient estimation of 3 in the gencral case is more complex and will be discussed in Section 74. 7.2.1 Estimation of the Nonparametric Component From (7.2) we know that g(Z,) = E(¥i ~ X{A1%). Therefore, ater obtaining a y/-consistent estimator of 8 (Le, 8), a consistent estimator of g(z) is given by Jua¥s — XY) ol=23) Smee) fed a2) = where J can be replaced by 8. We know that the nonparametric kernel ‘estimator has a convergence rte tha is slover than the parametric y/n- rate, Therefore, i is easy to se that, asymptotically, (2) is equivalent to the following infeasible estimator that; makes use of the true value of 8: Than = XY KvleZ) Dp Kn) ‘The convergence rate and the asymptotic distribution of (2), from ‘which one can immediately obtain the asymptotic distribution of (=) (sco Exercise 7.2), is discussed in Chapter 2 Note that the choice of fh,'s for estimating g(=) can be quite dif- ferent from those for estimating 9. In order to obtain a y-consistent estimator of 8, a higher order kernel is needed if q > 6. However, when estimating 9(2), there is no need to use a higher order keruel regardless of the value of g. Therefore, one can always use a nonnegative second a2) (714) 1.2. ROBINSON'S ESTIMATOR 29 order kernel to estimate 9(2), and one could choose the smoothing pa rameters by least squares cross-validation (in estimating g(z)); i.e., one cean always choose hy... /tg to minimize 5%)» (7a) where §-«(Zivh) = 9-4(Z)) as defined in (7.13) with 2 replaced by Zi, and 3 replaced by D9 jy Note that in (7.15) the dependent variable is ¥i~X{9 rather than ¥. Because f'— 8 = Op (n-¥/2) has a faster rate of convergence than any nonparametzic estimator, one can replace @ in (7.15) with f to study the asymptotic behavior of the eros-validation selected fis. The rate of convergenes ofthe hys isthe same as discussed in Chapter 2 One can also select and h simultaneously by mi Yk ~ x19 - 44, my- Under general conditions, including the use of a second order kernel, the cross-validation. choico of h will be of order Op (n-¥/"¥)), This order satisfies Conditions 7.1 (iv), 7.2, 7.8 and 7.4 if q < 3, There: fore, when q <3, ove can choose the fy’s and 3 simultaneously by minimizing 7, Yi —X/8 — 9(Zi,h)]?. The resulting hy's will be of order Op (n 7614), and the resulting A will be /Fe-consistent having asymptotic variance given in Theorem 7.1. One can also use second or- der expansion results (in the h’s) in a partially linear model to select the smoothing parameters so a8 to minimize the estimation MSE up to second order; se Linton (1095). ‘The partially linear estimator has been applied in @ range of set- tings. For example, Anglin and Geneay (1996) used it for the semipara- retrie modeling of hedonic price factions; Blundell, Duncan and Pen dalsur (1998) considered partially linear models for the semiparametric estimation of Engel eurves; Bngle, Granger, Rice and Weiss (1986) used 4 partially linear specification for estimating the relationship between ‘weather and electricity sales; Stock (1989) considered the problem of predicting the mean elect of a change in the distribution of certain policy-related variables on a dependent variable in « partially linear framework; Adams, Berger and Sickles (1999) apply a partially linear 20 7, SEMIPARAMETRIC PARTIALLY LINRAR MODELS specification to study the production frontier of the U.S. banking indus- try; and Yatchew and No (2001) estimated price and income elasticities allowing for endogeneity of prices in a partially linear model with price ‘and age entering nonparametricaly. 7.3. Andrews’s MINPIN Method ‘Andrews (1904) provides a general framework for proving the i- consistency and asymptotic normality of a wide class of semiparametric estimators. Andrews names the estimators MINPIN because they are estimators that MINimize a criterion function that may depend on Preliminary Infinite dimensional Nuisance parameter estimators. An- drews's method can be used to derive the asymptotic distribution of ‘many semiparametric estimators, including an estimator of in a par- tially linear model. We briefly discuss this method below. Tet 0 € © C RP denote a finite dimensional parameter, and let ‘r denote some infinite dimensional function, Further, let 7 be some preliminary nonparametric estimator of r © 71, where 7 is a class cof smooth functions the specifics of which depend on the particular semiparametric model being considered. We use @ and mp to denote the true parameter and the true unknown function. Let @ denote an festimator of @ that is a solution to a minimization problem where the objective finetion depends on @ and #, with being a preliminary nonparametric estimator of 7y. Suppose that d is a consistent estimator ‘of dp that solves a minimization problem resulting in the following first ‘order condition: viii (8,7) where sig (0, 2) = 2? Sym 0, 4) ‘We consider the ease where mn(W;, 0,7) is differentiabe with respect. to 0. It were finite dimensional, one could establish the asymptotic normality of 6 by expanding y/rifin(6,*) about (0,70) using element by element mean value expansions. Since ris infinite dimensional, how- ver, @ mean value expansion in (0,7) is unavailable. Andrews (1994) Suggests expanding yi, #) about t only and using the concept of stochastic equicontinuity (see Appendix A) to handle +. A mean value ‘expansion in 6 about @y yields (7.16) (1) = Vit (5,7) = Vn (ay #) + etn (8,4) vez (9~ 00)» an) 1.3. ANDREWS'S MINPIN METHOD 2a ‘where @ lies on a line sogment between 8 and 8p. Under some regularity assumption, one can show that Hn8.8) =D Am wd.) Be [rege 4673] [Ferro] =M. ‘Thus, provided that Mf is nonsingular, from (7-17) and (7.18) we get Vii (0) = (M~* + 0p(2)) 7 > em(Wfo,F). (7.18) Therefore, if nV mn(Wis do, #) — nV? > ra(Wi, 80,70) = op(1), (7.19) then we wil have Yri(d 85) = Mn? > m(W,00, 70) +09(1) NCO, MSI (7.20) by the Lindeberg-Levy CLT, where $ = var(m(W89,70)) In practice, (7.19) can be difficult to verify for general semipara- tetric models that are more complicated than the simple partially linear model. Androws (1994) suggests using the concept of “stochas- tic equicontinuity” to establish (7.19). Stochastic equicontinuty can be used to establish (7.19) because if (7,70) — 0 (p(-,") i a psoudometc; see Definition A.82 in the appendix) as n+ 90 and vq(7) is stochastic equicontinuous for r € A, where A isa bounded set that contains 7) as fn interior point, then Weal) — (ra)| 5 0, (7-21) See Androws (1004, eq. (210)) for a proof of (7:21). Recall that u(r) © yritin(Oo,7). Thus, (7.19) holds if va(-) is stochastically equicontinuous. 200 17, SENIPARAMETRIC PARTIALLY LINEAR MODELS ‘The following assumptions can be used to establish the fr normality result for 6 Suppose that the MINPIN estimator solves the minimization problem of 9 = infgce d(@, 7) for some objective function d(@,#), and as ‘a result the first order condition is may(8,4) = n-4 S>sm(Wi, 8,4) = 0. ‘The population moment condition is Ehm(Wi,89,70)] = 0. Defining iplt) = 1 ig(p,7), we make the following assumption Assumption 7.1. (Normality) Assume that (i) 6% 0 €@, © is a compact subset of BR”. (i) #3 meg. (iti) Jriva(70) + N(0,S). (iv) (un J} a stochastically equicontinuous af ro (v) (0,7) is twice continuously differentiable in 8 € ©, while 2D (W267) > Elm 0,7, nD (0/00)n( 18.0.2) 2 Blm(WB, ry and m2 (0/40) 0,7) 2 E(0/08)m% 0,7 uniformly over @ xG (ice., uniform weak law of larye numbers) ‘Theorem 7.8. Under Assumption 7.1, vii (@-) 4 0,ar-ts Proof. The above theorem is a special case of Theorem 1 of Andrews (1994) and is therefore omitted. In fact, Andrews does not assume i.d, data; rather, he allows for wealdy dependent time series data. and allows for nouidentically distributed data. We discuss weakly dependent data in Chapter 18, a In practice, the stochastic equicontinuity condition (Assumption 7.1 (iv)) is the most diffieult part to verify, especially for highly nonlinear semiparametric models. Andrews (1994) uses Theorem 7.3 to establish 7.4. SEMIPARAMETRIC EFFICIENCY BOUNDS ‘the yi-normality result for a partially specified nonlinear model (see also Ai and McFadden (1997). We discuss verification of Assumption 7.1 (iv) for partially linear models in Section 75. One maintained assumption above is that the objective fanetion is smooth in its arguments; see Chen, Linton and van Keilegom (2003) ‘on estimation of semiparametric models when the criterion function is, not smooth, 7.4 Semiparametric Efficiency Bounds 7.4.1 The Conditionally Homoskedastic Error Case In this section we discuss the (local) semiparametric efficiency bound for (7.1) and consider twa approaches. We begin by deriving the (local) semiparametric lower bound under the assumption that the errors are conditionally homoskedastic, ie., that E(u|X;, Z;) = E(u), where uj is normally distributed. This approach uses parametric maximum likeli- hood estimation and is easy to understand, and we follow the arguments found in Newey (19906) and Rilstone (1993). The discussion here will be informal, while e rigorous approach can be found in Newey; also see Begun, Hall, Huang and Wellner (1983) and ‘Tripathi and Severina (2001) for a general treatment of efficiency bounds of semiparametric ‘models, and the related work of Pakes and Oley (1095). Let m(z, 8) be a parametric submodel such that m(z, 9) = 9(2) for some fy, where 5 is a kx 1 vector of nuisance parameters (k > 4). ‘The parameters of this parametric submodel are y = (1,6), and ‘the parameter vector of interest is y. For each parametric submodel there is a vector-valued score function fy, which one can partition with respect to the parameter of interest ) and the nuisance param ter 6 80 that , = (Uyyl5). The semiparametric bound can be inter- preted as the supremum of the covariance matrices over the para- metric submodels. Define the tangent set J for the nuisance func- tion as the mean square closure of all £ x1 linear combinations of 1s, Let Plla|) denote the projection of fg onto the space J and define = Pllo|7}. Then the semiparametric lower bound for the model is given by Yo = {UGH} ‘The log-likelihood function corresponding to the parametric sub- Ga 2st 11. SEMIPARAMETRIC PARITALLY LINBAR MODELS model is wv £(8,8) = Constant — ‘The score function when evaluated at the true model is| (8) a () in (eseatiay) ; where we used g(z, 49)/Az = g!")(z). Since g(7) is not specified, 92) cannot be identified either. We obtain the tangent set A(-), the mean square closure of all k x 1 linear combinations of ls, given by A= (ule): with B{(AZ)P} < eo}. ‘The projection ofp on AC) i (1/a)uE(X|2), therefore, we obtain the efficient score as 5 8 ty —E(pIA()) = (1/o)u(X — B(X|2))- ‘Therefore, the semiparametric eficiency bound for (7-1) is Vp = (BUBIB II! = o* (EI — OX - I, (722) E(X|Z). “Thus, 6 defined in (7.9) is semiparametricaly efficent in the sense ‘that the asymptotic covariance matrix © equals to the semiparametric lower bound Vj for the partially linear model ‘We can also compare Vg with the asymptotic variance of the least squares estimator of fy when (7-1) is a linear regression model, i.e., g(2) = a+ 27, ais a scalar and 7 € RE, ue = XI tat Zi hus. (7.23) Letting Asx denote the ordinary least squares estimator of 3p based on (7.23), then it is easy to show that Val Bats ~ Bo) N(0, Vass (72a) where Vous = 0 E{[X; — a — bz%I[Xs ~ a - BZ} * (see Exercise 7.6), where a and b are constant matrices of dimensions p x 1 and p x 4, - spectively, and a-+b2; is the best linear projection of X; on aspace that is linear in Zi. Now compare Va with Vay. We see that if B(X|Z) = 74. SEMIPARAMETRIC EFFICIENCY BOUND: 236 +2; isa linear function in Z;, then Vie = Ve. However, if (Xi Zi) is not a linear function in Z), then EB {[X; — B(X,|Z)][X¢— B(%|Za)]"} — E ([Xi— 9(%)][Xi — 9(Z)]"} is negative semidenite for any function (Za) (by Theorem A.3).,Thus, Vi ~ Vote i8 positive definite when E(%z) is not linear in z. In this latter ease, the semiparametrically ef ficient estimator @p is asymptotically less efficient than the parametric estimator pis, which uses the extra information that g(z) =a + 2'7. 74.2. The Conditionally Heteroskedastic Error Case ‘The detivation of the semiparametric efficiency bound for a partially linear model can be found in Chamberlain (1992) under the general eon- itionally heteroskedastic error ease, while Ai and Chen (2008) consider efficient estimation for general semiparametric models which includes ‘the partially linear model as a special ease. The discussion below is based on Ai and Chen. The results presented in this section include a conditionally homoskedastic error model as a special case, Consider the partially linear model = Xft (Btu i where E(u?| Xi, Z)) = 0°(Xs, Z) is of unknown form, In order to derive the semiparametric efficiency bound of al- lowing for conditionally heteroskedastic errors, we first assume that 0°(X;, Z) is known, and then proceed to discuss how to obtain a feasi- ble semiparamettically efficient estimator when 42(2,2) is of unknown form. Ai and Chen (2003) show that Jy can be efficiently estimated bby minimizing the following objective function simultaneously (jointly) ‘over fp and the unknown funetion g: aciag PAM — Xiu + o(Z)/07(, 2}, (7.26) yee (7.25) where B is a compact subset in RY and G isa class of smooth functions In practice, one works with the sample mean rather than with the population mean (7.26), thereby minimizing the following: sciheg 22 — Mla + a1 2i)/0°(%s 20 (an ‘The minimization ean be done by first concentrating out the un- mown function f(-). That is, we fist treat yas a fixed constant, and 26 1. SEMIPARAMETRIC PARTIALLY LINEAR MODELS by application of calculus of variations to (7.26) over g:), we obtain 2B {B [(¥i — X{ + (Z:))/o*( Xs, 4)1 Zi] (Zi)} where 5(Z) is a arbitrary function of Z. Solving for g(Z) in (7.28) Teads to v0 iaylee)-*Ge)e om where 0? = 0%(X:, %)- Substituting (7.29) into (7.27), we get y y EG) «PG ah jo2. (230) 7 E (414) (14) By wing the shorthand notation = ¥— E(/of)/B(0/of\2) an A = Xi ~ BUX/o7)/E(A/o? 2), (7.30) bacon DL Po)? /03- ‘Minimizing this objective function over fy gives an roi) sen estimator of fy: (728) = DS xatll] Trove? =6+ fe ae] Yixia/o?. [A standard law of lange numbers slong with the Lindeberg-Levy CUP Teas to Vii (Bar — 0) +N (0,¥e) im distribution, (7.1) Ge am where = [XAi/o2) “aff 114. SEMIPARAMETRIC EFFICIENCY BOUNDS Ed Equation (7.82) is the semiparametric efficiency bound deri Chamberlain (1992, p. 569, ea. 1.9) When 0°(X;,Z)) = 0°(Z;), (7.82) cam be simplified to Vo = B IX: — BOM ZA)] IM — BOG) /o(Z)} (7.33) When o? = o?, constant, (7:32) further reduces tothe asymptotic variance of y(3~ fo) covered by Theorem 71 ‘The above estimator Byq is not feasible. A feasible efficient es- timator of J can be obtained from 3g by replacing the unknown conditional expectations with their respective kernel estimators, eg. by replacing B[¥/o?|2i] with Eli /o#|2i] = Dy¥5/9} Keyl Dy Kus where Ki; = K((Z — Z;)/h), and by further replacing o? with 6? Lj HFRnag/ Dj Kass where Knyy = K(X — X5)/he)K (Ze ~ 25)/he) with ii; = B(Y/|Z,) — B(%|Z,)'f being a consistent estimator of uw de- fined in Exercise 7.1. Some additional zegularity conditions, such as the donsity f(v, 2) being bounded away from zero in its support or, alterna- tively the use of a trimming funetion, are needed to establish that the feasible estimator indeed has the same asymptotic distribution as the ‘efficient infeasible estimator Jer. We point out that an efficient semi- parametric estimator of fy is quite complex. It requires estimation of a nonparametric model with dimension p-+q (the dimension of (Xi, Zi) while the estimation of 6 and g() involves only nonparametric esti tation with dimension q, Therefore, the “curse of dimensionality” may prevent researchers from applying efcient estimation procedures to a partially inear model when the error is conditionally heteroskedastic. In this chapter we only cover the case in which (¥,X,2) are all observed without error. For estimation of a semiparametric partially linear model with errors-in-variables, see Liang, Hiirdle and Carroll (1999) and Liang and Wang, (2004) 238 17, SEMIPARAMETRIC PARTIALLY LINEAR MODELS 7.5 Proofs 7.5.1 Proof of Theorem 7.2 Proof. We first provide a consistent estimator of the asymptotic vari- ance of fy. It ean be shown that ae DOG - XK ~ Ky? ana #25 |e (7.34) (Xe — R(X — Ky FF] fare consistent estimators of @ and W respectively, where i i) — (Xi — X)'By is a consistent estimator of uy. Throughout, we use F, to denote DE, and Syyy; to denote ii _, From ¥i = Xioo + 91 + ui, we get Yj = XiBp + gy + a, where As = Dip Akash (A = Y, X, gy w). Define Suge = WTS AABifis and Sapa = Sap Using % — Yi = (Xi — XB + gi ~ Geb us ~ te Wo Bet (ui - Bp So ayy e-syhcr—1nf = PO+ Se 4 See-ayftgraee-ayp> (735) Hence, we have VilEr- B= SE yy VRS ayterreny (790) In Propositions 7.1 to 7.4 below, we show that © VRS fief = 0s (8) Syp 2%, ) YRS aja = Olt), and (8) VRSie_aypug = VAS opt + Op(1) * NCO, 1S. PROOFS. 200 Using ()-(), from (7.6) we obtain Vay ~ 8) = Sea e-nig jeu-a)h cau Se oP Senne Ft Se-avfah Seaport} = [y+ op) Hoplt) +[vRSprup + op(0] + 06(0)} 407N0,9) 73) ‘This completes the proof of Theorem 7.2. a Tn the lemmas presented below, we assume that the support of (X%;, 2) isa bounded sotto simplify the proof. All results for gs = g(Zi) also hold true for & = £(Z:) = E(X,|Z). Letting ¢ = 14 oF tw, then (2%), (2.2) and 02(X,, Z) = Elef|X Z) ae all bounded functions ‘with bounded derivatives up to order v, though the support of us need not be bounded. ‘We will use the shorthand notation E,(A) = B(A|Xy, Z,) and Kiy = K((Z— Z)/h) below. ‘Lemmas 7.1 to 7.5 below hold true if one replaces g with €. The only difference between g and ¢ is that g isa sealar function, while € is of dimension r x 1. The proofs below hold true for each component of € and, therefore, hold true for the vector function € since 7 is finite. ‘Also, we assume that fy ===: = hy = A for notational simplicity Alternatively, we can interpret the results of O(h#) as O (S74, hd) and O(h#) as O{h ..-h) to obtain the results that correspond to the case where we do not imnpose the condition that all ofthe fs are equal ‘Also, in order to simplify the proof, we use the leavo-one-out estima- tors for X; and Yi in the proof below, Note that Theorem 7.2 remains valid without using leave-one-out kernel estimators. Lemma 7-1. Let my = 9(Z) or ms = €(Z). Then Ex[(ms — m1) Kaa] = O(h"). Proof. Note that 9(2) has bounded derivatives, It follows by using the ‘Taylor expansion and a change of variables argument. o Lemma 7.2. (9 Seong = Op (HY + HP(nh®)) (rm; = 95 o mi = Ee 20 1 SEMIPARAMETRIC PARTIALLY LINEAR MODELS (8) Let eg = wy o7 145 then 8, = Op ((nh®)~)) Proof of (i). (we ignore the difference between (n— 1)~! and n=!) E [[Sin-ml] = re [lois — my? #3] =E [i —m)* = wee (ons = mm) Knaus ~ m1) Ki] ® ¥ [Plem-miPktal a + YX BEE llm—m) Kral shies x Ey [Oj — mm) Kings] =n {0 (h2h-*) 4-20 (h*)} = 0 ((nne-2y* +2) by Lemma 7.1 7 Proof of (ii). Eluj-md -E [ef] = LY Klas Krak] an =n? DE [ekza a =n IB [01(Xi, ZK] = Cn "B (Kia) = 0 (nh?) Lemma 7.8. Spamyjci = % (2) (me = 9 oF ms = 6) zsproors 2a. mm faet + Seix-mypelF-D = Seraomfes + (60) DB [lin mr Rese] = ne [mn {[umna’} = my)? fPo2(%, 20) f8] << Cn'8 [lin ~ m)*f3] =n = nol) =oln-) by Lemma 7.2. 7 Hence, Sion—myfi anjap + (60.) =o (n-¥?) o Lemma 7-4. Sia-myfap = Oo (inhi)! + WY(ak®) 2) = op (n-¥) Proof. By the Cauchy inequality, [Sinner] = {[Soe-ml Sil} {0p (h2 (nk?) -* + A) O, ((nb8)2)}7 > (mene) none) o-). s,| n Lemma 7.5. 1 Spag= Op (rte), (8) Syi.ag = On ((nbw?y*), (8) 8,307 = Onl), (0) Sapg = Op (m0), (0) Sup = op). ‘The proof of (i) and (it) are the same as () Proof of () Supor = Susag* Sup = Sypaz + (0-09) 2{ [Sail = -oufoon aaron Bae NOV) by Theorem 7.1, §(z) is defined in (7.14). From 8 ~ fi Op(n-¥/9), one can show that (2) — (2) = Op(n-¥) Op ( (tha ig)? + Sh hE) 16, EXERCISES 2 Exercise 7.5. (i) Show that the result, of Lemma 7.4 can be sharpened to Socavteh = On (Mnb#!2)“1 nV PH") = op (1), (i) Show that the result of Lemma 7.5 (i Spey = On ON?) can be sharpened to Hint: Do not use Cauchy's inequality, rather, show that els, fap hag) = OCC tH) tH) for (i), and E [52, 4] = O(n-*h-*) for (i) Note that with the above sharpened rate, Condition 7.8 can be replaced by the weaker Condition 7.4 Exercise 7.6. Prove (7.21). Hint: Use’ the Frisch-Wangh-Lovell Theorem (soe Davidson and MacKinnon (1993, pp. 19-24)). Define Mz = In—P.(PLP,)~"Pt, where Pz is an m x (q+ 1) matrix with ith row given by (1, 2). Project the matrix form of (7.24) on Mg to wipe ont a + Z(7, then use a standard law of large numbers and CLT arguments as we did in the proof of Exercise 7.1. Exercise 7. (@) Assume that B(u2)X, Z2) = 0°(Z4). Using arguments similar to those we used when deriving (7.38), show thatthe semiparametric effcieney bound aff is given by (7.88). (ii) When E(u?|Xi, 2)) = 07(X:2:), is Vo — Vo,re positive definite? Hint: No calculation is needed for answering (ii). A simple logical argument is sufficient. Chapter 8 Semiparametric Single Index Models In this section we consider another popular semiparametric model, the so-called semiparametric single index model. This model has been widely used by applied econometricians and has been deployed in a variety of settings. 'A semiparametzic single index model is of the form ¥ =9(X'fo) +n, (s1) where Y is the dependent variable, X ¢ IR is the vector of explanatory variables, 6 is the q x 1 veetor of unknown parameters, and u is the error satisfying E(u|X) = 0. The term 2 is called a “single index” because it is a scalar (a single index) even though z is a vector. The functional form of g(-) is unknown to the researcher. This model is ‘semiparametric in nature since the functional form of the linear index is specified, while g(.) is left unspecified. ‘Semiparametric single index models arise naturally in binary choice settings, and so for illustrative purposes we first discuss this popular ex- ample. In a binary choice model, if one is willing to accept the paramet- ric linear index governing choices but unwilling to specify the unknown distribution of the error term, then one arrives at a semiparametric, single index model. Specifically, when considering the relationship be- tween a binary dependent variable (Y) and some other covariates (X), ‘one might model this relationship as follows (8.2) 250 8. SEMIPARAMETRI (DEX MODELS ‘where Y"* isa latent variable. Note that here ¢= ¥*—B(Y*|X), which differs from w= ¥ ~ E(Y'|X) defined in (8.1) bocanse ¥ # ¥* For example, ¥ could representa labor force participation decision where if ¥ equals 1, an individual participates in the labor market, while f ¥ equals 0, the individual does not. ‘The explanatory variables X contain a set of economic factors that could influence the partic pation decison, such as age, marital status, education, work history, and nuzaber of children. Model (8.2) assumes a linear parametric link function between the decision regarding whether to participate in the labor markot (¥) and the explanatory variables X. We are mainly in- terested in estimating 8, which relleets the impact of changes in X on the probability of participating in the labor market. Parametric methods for estimating ff require one to specify the (unknown) distribution of the error term «. A popular assumption is, that chas a normal distribution, ie., e~ N(0,¢2). It ean be shown that and o? cannot be jointly identified without addtional identification conditions (see Maddala (1986) for farther details). For instance, if we assume that o = 1, then 3 is identified, and we can use maximum likelihood estimation to estimate 2. However, if the error term does not possess a normal distribution, then the parametric approach will in general produce inconsistent estimates, ic., of PY = Iz) = B(Y 2); seo Exercise 8.1. To soe this, let F,() denote the true CDF of «, From (62) we have E(Y|2) = DO vPtvle) x P(Y =1l2) 40x P(Y = 02) (Y= 12) (a+a's+e>0) le > “(a +2'9)) =Ple<-(a+2/8)) ~F(-(a+2'A)) (a+2/8), where F(.) is the CDF of ¢. Note that if ¢ has a symmotrie distribution, then F(a 2!) = 1~ F(~(a+2/8)), and we have m(-) = F() in this cease, For example, if €~ N(0,1) (o = 1), then (8.2) becomes a Probit model: E(Y|z) = PCY = 1/2) Ho +2'8), (83) 81. IDENTIFICATION CONDITIONS 251 where ®(.) is the CDF of a standard normal variable. If, on the other hhand, ¢ has a symmetric logistic distribution, then (8.8) results in a logistic model of the form Ee) (84) From (8.3) and (8.4), wo sce that different distributional assump- tions for ¢ lead to different functional forms for the conditional prob- ability of ¥ = 1, Hence, consistent parametric estimation of P(Y = |x) = E(Y|2) requires the correct distributional specification of «. A semiparametric single index model therefore avoids the problem of er- ror distribution misspecification. Moreover, the semiparametric single index model (8.1) is more general than the binary choice model since wwe do not require that the dependent variable necessarily be binary in nature. As we will see in Section 8.1, the location parameter a: can- not be identified when the functional form of g() is'usiknown, which is why we write our semiparametric model (8.1) only as a function of X{. We discuss this and other identification conditions in Section 8. ‘Note that model (8.1) implies that E(Y 2) = g(a!6), hence y depen ‘on 2 only through the linear combination 2p, and the relationship is characterized by the link function 9(). We would like to emphasize here that, like the partially linear ‘model, the semiparametric single index model is an alternative ap- proach designed to mitigate effects arising from the curse of dimen- sionality. Also, we emphasize that Y can be continuous or discret, ie. ‘there is no no reason to restrict Y to be a binary variable, 8.1 Identification Conditions For the semiparametse single index model, we have E(Y lz) = oa!) Ichimura (1993), Manski (1988) and Horowitz (1998, pp. 14-20), provide excellent intuitive explanations of the identifiability conditions ‘underlying semiparametric single index models (i.e., the set of condi- tions under which the unknown parameter vector 6 and the unknown function g(-) can be sensibly estimated). We briefly discuss these con- ditions and then summarize them in a proposition, 252 8 First, g(:) cannot be constant function, for otherwise it is obvi cous that 4 is not identified. Next, as isthe case for linear regression, the different components of # cannot have a perfect linear relationship (perfect multicollineatity), Another resteition i that x contain atleast ‘one continuous random variable. Intuitively this ean be understood via the following ine of reasoning. If contains only discrete variables, soy some 0-1 dummy variables, then the support of zis finite, and soi the support of the scalar variable v = 2/8 for any vector of f. Obviously then there will exst an infinite number of fanctions g(:) which differ in terms of their 9 vectors such that (2/8) = B(V2). This isso simply becanse E(Y'|2) = g(2’8) imposes only a finite number of restrictions on the unknovn function g(-), therefore, there exist an infinite umber of different choices of g() and 8 that satisfy the finite set of restrictions imposed by B(¥|z) = 9(2’8). See Horowitz (1998) for a specific exam ple and illustration of this point. Also, 2 cannot contain a constant. ‘That is, cannot contain & location parameter, and iy is identifiable only up to @ scale. This follows because, for any nonzero constants a1 and ag and for any g(-) funetion and fxed vector 3, we can always find ‘another function, say’ g2(), defined by go(ai +ax2"8) = 9(x’A). There- fore, 6 is not identified without some location and seale restrictions (normalizations). One popular normalization is that + does not contain ‘a constant, the so-called location normalization. For the so-called scale normalization, one approach is to assume that the vector 8 has unit length, Le, Il] = 1, where [|] = (Sf. 67)" isthe Euclidean norm (length) of 8. Another popular scaie normalization is to assume that the first component of x has a unit cosflcient, and this fist component is assumed to be a continuous variable. ‘We suinmarize the above conditions inthe following proposition. Proposition 8.1. (Identification of a Single Index Model). For the semiparametric single index model (8.1), identification of Bp and g(:) requires that (@ & should not contain a constant (intercept), and x must contain at least one continuous variable. Moreover, |\6oll = 1. (48) g() is differentiable and is not a constant function on the support of 2! (iii) For the discrete components of x, varying the values of the dis- crete variables will not divide the support of 2/6 into disjoint subsets sTIMATION 258 We have already discussed how, when 2 contains only discrete vati- ables, ¥p and g(-) are not identified. However, when 9( to be an increasing function, then one can obtain identified bounds on ‘the components of 8. For a detailed discussion of how to characterize the bounds when all components of = are discrete, see Horowitz (1998, pp. 17-20). 8.2 Estimation 8.2.1 Ichimura's Method In this section we review the semiparametric estimation method pro- posed by Ichimura (1993) If the fmctional form of g(:) were known, (8.1) would become a standard nonlinear regression model, and we could use the nonlinear Teast squares method to estimate jp by minimizing. DM - Xa? (8.5) a with respect to 8. Tn the case of an unknown function 9(-), we first need to estimate 4). However, the kernel method does not estimate 9(X/6) directly because not only is 4(:) unknown, but so too is dy. Nevertheless, for a given value of we can estimate G(X1B) F BOHIXIB) = BlolX1) X15] 66) by the kernel method, where the last equality follows from the fat that B(u|X{9) = 0 for all 3 since Eu Note that when 8 = fp, G(X/) = 9(X/%), while in general, G(X18) # g(X1%0) if 8 fo. A leave-one-out nonparametric kernel estimator of G(X{B) is given by (ay Sega Fe (A) GA(Xi8) = BAKIXIB) = + 87) where p-(X18) = (MAY Thay K (( Tehimura (1099) suggests estimating o(X fb) in (8.5) by @-«(X1B) of (8.7) and choosing by (semiparametric) nonlinear Teast squares. 254 6. SEMIPARAMETRIC SINGLE INDEX MODELS ‘There is one technical problem though, being that (8.7) has a ran ‘dom denominator, Le., A(X!8) = (nh)? Daa sau K (X58 ~ X{8)/ Tehirnara uses a trimming fanction to trim out small values of P(X{8). Let p(a') denote the PDF of X73. Define As and Ay to be the following sets: As = (a: pla’) 2 6, for all BB} where 5 > 0 is a constant, B is a compact subset in RY, and Aq = (2: |e ~2"\| £2 for some 2° € As } ‘Phe set As ensures that the denominator in (8.7) does not get too lose to zero for 2 € As. The Ay set is slightly larger than As, but as nn 00, h—+0 and Ay shrinks to Ag. Ichimura (1993) siggests choosing @ by minimizing the following objective function: si) => (XA wiXICG EAD), 8) where G(X) is defined in (87) tl Xi nonnegative weg ane ‘tion, and 1(-) is the usual indicator function. That is, 1(X; € Ap) is @ trimming function which equals one if X; € Ap, ero otherwise “The trimming fanetion ensures that te random denominator in the ern eainator i postive with high probability so as to simplify the asymptotic analysis Tat 3 denote the semiparametric estimator ofp obtained from min- mining (88). To derive the asymptotic distribution of 8, the following conditions axe needed: ‘Assumption 8.1. The set As is compact, and the weight function w(:) is bounded and positive on As. Define the set Dz = (2: 2~= 23,9 € Byz € As}. Letting p(-) denote the PDP of 2 € De, p(-) és bounded below by a positive constant for all = € De, “Assumption 8.2. 9(°) and pl) are three times differentiable with re ‘pect to z = 2/8, The third derivatives are Lipschitz continuous wni- Jformly over B for all z € Ds. ‘Assumption 8.3. The kernel function is a bounded second order emel having bounded support, is twice differentiable, and its second derivative is Lipschit: continuous. 52 ESTIMATION 285 Assumption 8.4. E)Y"| < oo for some m > 3. cov(¥ jz) is bounded ‘and bounded away from zero for all x € As.q In(h)/[nb3*/™—3)] — g and nh® +0 as n— co, Tehimura (1998) has pfoven the following result: ‘Theorem 8.1. Under Assumptions 8.1 through 8.4, Vin(B - 8) + N(0,01) in distribution, where Q7 = V—1EV—1, and where 2 = B{w(Xi)o°(x) (f)” (Xi BalKlXIH) x (Xi E(%iLX460))'}, with gf? = [Og(v)/2v\lyaxys, BalXilv) = E(Xe6o= v) with xa having the distribution of Xi conditional on Xj € As, and v= [wexo (a?) Batctatan) (Bacay ‘A consistent estimator of Q is given by o . nore 7 =n Soo X,)G (X18) Xi BAX) X — ERX. SK} BGA) (NBN XB) (Xe B(NGLNGB)) = Yi ~ G(X{A), (XB) = [00-«(X40)/0B)|pjp 9-s(2{) 1s defined in (8.1), BX XYBY = Dy Xs (Ki ~ XA) Dy K(X ~ HB). ‘The proof of Theorem 8.1 is quite technical, aud we refer readers to Ichimura (1993). Horowitz (1998) provides an excellent heuristic outline for proving Theorem 8.1, using only the familiar Taylor series methods, a standard law of large numbers, and the Lindoberg-Levy CLT arguments. ‘When B(uf]X;) = o? is a constant (ie., when w= ¥—9(X/4) has conditionally homoskedastie errors), it can be shown that the optimal choice of w(X;) is w(X,) = 1, and in this case is semiparemerically efficient in the sense that is the semiparametric variance lower bound (conditional on X € As) In general, however, B(u?|X;) = 03(X;), and the semiparametri- cally efficient estimator off has a complicated structure. Ione assumes 256 ‘8. SEMIPARAMETRIC SINGLE INDEX MODELS ‘that B(u2|X;) = o%(X%6o), that is, that the conditional variance de- pends only on the single index, then the choice of w(X;) = 1/0(X{#) fan lead to a semiparametrically efficient estimator for Jp. However, this choice is infeasible since, in practice, 0%(X{5) is unknown, One could therefore adopt a two-step procedure as follows. Suppose that ‘the conditional variance isa fanetion of Xp only. In this ease, for the first step, use u(Xi) = 1 to obtain a yi-consistent estimator of fo, say fi, then using iy = ¥i = 9(Xffa) one ean obtain a consistent non- parametric estimator of o%(X/6y), say 6%(X{0) = var(tis|Xio). In the ‘second step choose w(X,) = 1/6%(X{Ho) to estimates again. Provided ‘that 62(v) — 02(v) converges to zero at a particular rate uniformly over 1) € Dy (Deis the support of X/6o), the reslting second step estimator ‘i will be semiparametsically efficient. For what follows below, we ignore the trimming set Ag, the weight fonetion w(), and also assume that the minimization over 8 is, in fact, done over a shrinking st B, = (8: ||9 ~ fol| 0. This type of arsument is used by Hide, Hall and Ichimura (1983). ‘The requirement that f Tie in a set with BoB = 0 (n~/2) may appear to be an overly strong assumptions how- ever, given Ichimmara’s (1993) result, we know that 3 is a yfi-consistent estimator of 8, and we can look at a minimum of $(A) for 8 with © (n-¥/) distance from f- With this assumption, together with the assumption that A € Hn = {hi Cin-¥S < h< Con} for some (Cy > C1 > 0, then an alternative proof can establish the result of The- forem 8.1 as follows. First, it can be shown that the nonparametric sum fof squared residuals can be written as (we omit w; = I and omit the trimming indicator function for notational simplicity) 3,09) = 2.0K - xia)’ =S(—)+T+or(0, 69) where $(8) = nD 4 ~ G(X{B)), where To = D> [G-s6{60) — 9 X400)] is a term that is independent of @, and where op(1) denotes terms of ‘order o,(1) uniformly in 8 © By. The proof of (8.9) can be found in Hirdle et al. (1993), who in fact consider a more general setting where they choose and h simultaneously by cross-validation method. 52._ ESTIMATION 2st In Section 8.12 we show that (3) = O(1). Therefore, minimizing (8.9) with respect to (is asymptotically equivalent to minimizing $({). Letting 8 denote the value of # that minimizes $(9) then, using Taylor expansion arguments, we,can easily show that (sec Section 8.12) 3 satisfies the following first order condition: Wo (3 — fy) = Vo + (s.0.), (8.10) where Wo = Som [/extan)]” Es — Beat] [A BORIS] and Vo = You f(X4) [Xe — EOGIX{Bo)) ‘Hence, by a standard law of large numbers and CLT argument, we have YAR ~ By) = (Wo/ny-*n-¥/7V% + 0f(1) 4 (0,%), (8-11) where Mis the same as 2y except that w(X;) is replaced by 1 ‘Up to now we have restricted ourselves to the normalization rule, I[b|| = 1. We could also use a different normalization rule if desired. Instead of assuming that has unit length, we can assume that the first component of 7 is one. Assume that the first component of 2 is a continuous vatiable, and write X, = (Xi.X)', where X; is X, '1 9/2. Therefore, a higher order kernel (v > 2) is needed since q > 2. In the case of a single index model we need q > 2 since, otherwise, 3 is not identified. Powell et al. (1989) prove the following result vad — 4) + N(0,Prws), (8.18) where Qpxs = AB|o%(X) F(X) f (XY -+Avar( F(X)gP(X)). A nor malized vector can be obtained via 4/6). ‘The proof of (8.18) requires the use of U-statistic decompositions for variable kernel functions (some of these are collected in Appendix A). Im Chapter 2.we discussed using local polynomial methods to esti- imate an unknown conditional mean function and its derivatives. Sine B given in (8.15) is based on the derivative of a local constant. Kernel tstimator of E(¥|2), we could also use the local polynomial method to estimate JE(Y |2)/Ox, This approach is considered by 13, Lu and Ullah (2008). Letting @)(Xi) denote the kernel estimator of g(!)(Xi) obtained from an mth order local polynomial regression, Li, Lat and lish suggest using = AH) (819) to estimate = Ela(X)). Aw estimates 9 directly without assuming that y(2) = 0 at the boundary of tho support of In this sense, i is sinlar to B defined in (8.19) However, the price of not imposing the condition (2) = 0 at the boundary ofits support is that, an Li, [ey and Ullah, ove usually bas to assume that tho support of X isa compact set and that the density f(x) is bounded below by a positive tonstant atthe support of X, ie, thelr conditions ule out the ease of 84, DIRECT SEMIPARAMETRIC ESTIMATORS FOR 20 ‘unbounded support. In the case of unbounded support, one needs to introduce a trimming function to trim out small values ofthe random denominator. With the assumptions of bounded support and the den sity being bounded away from zero in its support, one can avoid the use of a trimming funetion. Li, Lu and Ua use the uniform conver- gence rate result of Masry (1996a) to handle the random denominator. Under some smoothness and moment conditions similar to those used in Powell et al (1989), but with their boundary assumption replaced by compact support and the density being bounded avvay from zero in its support, the use ofa second order kernel, and with n 7%. 2" —»0 and nay. 04) 5-2 €2/ n(n) —» 00 as nt -» 00, where mis the order of polynomial inthe local polynomial estimation and i the dimension of z, Li, Lu and Ullah establish the following result: Vii(Bee = 8) +N (0,8-+ var (o(X))), (8.20) where = Efo*(X) F(X) F(X) /f2(X)] and 8 = Elg"(X)] ‘The reason why 4 and few differ in their variances is because estimates 5 = E[/(X)gO(X)], while Ayre estimates 9 = F [g!(X)] While neither § nor Bye given above is normalized to have unit length, should normalization be applied, the variances of the normalized vec: tors obtained from 3 and Ayye would be identical. This is what one should expect given the result of Newey (19948), who shows that the asymptotic variance of a /7-consistent estimator in a semijaramet- rie model should not depend on the specific nonparametric estimation method used. Indeed, Newey also shows that the asymptotic variance of an average derivative estimator remains the same should one use nonparametric sores methods rather than kernel methods. We discuss series methods in Chapter 15 ‘Tho proof of (6.20) 8 similar to the proof of (8.18) in that one uses the U-statistie decomposition with variable keraels developed in Powell et al. (1989) and makes extensive use of results contained in Masry (1996a). Rather than repeating detailed proofs contained in Li, Lu and Ullah (2003) here, we provide a brief outline of the proof of (8.20) intended to provide readers with an intuitive understanding Defining 8 =n“! SO" gl!)(X)), we write vin (8.~8) = vi (8A) + vn (8-9) Note that vit (3—3) + W (0, var (9((X))) by the Lindeberg-Levy 252 8, SEMIPARAMETRIC SINGLE INDEX MODELS CUD. It ean be shown that vii (8.—3) 4 1 (0,2) in distribution. Finally the two terms are asymptotically independent. Hence, Fi (i-8) = (40 (9"C0)) Hiristache, Juditsky and Spokoiny (2001) propose an iterative proce- dure that improves upon the original (noniterative) average derivative estimator. Their idea is to use prior information about the vector @ for improving the quality of the gradient estimate by extending a weight- ing kernel in the direction of small directional derivatives, and they demonstrate that the whole procedure requires at’ most 2log(n) iter- ations. The resulting estimator is /n-consistent under relatively mild assumptions. 8.3.2 Estimation of g(’) “We use f, to denote a generic y/i-consistent estimator of f or 5 (Le., it could be 8, 8, Bay Ba, oF 5 defined earlier). With Py, we can estimate E(Y |e) = g(afo) by sa oan DK (& ay Be) Since A.—fo = Op(n¥/2), which converges to zero faster than stan dard nonparametric estimators, the asymptotic distribution of 9(2/) is the same as the case with fn being replaced by J. Therefore, itis covered by Theorem 2.2 of Chapter 2 with q = 1 (since v = 2/fy is a scalar). Thus, we have G(2'Bn) Corollary 8.1. Assume that Ba — = Op(n-!/?) and under conditions similar to those given in Theorem 2.2, we have Ya (522) ~ aa!) — B(x!) £1V (0, 0(a)/ Fle e)) (622) ‘where B(x!) is defined immediately following (2.8). ‘The direct average derivative estimation method discussed above is applicable only when 2 is a q-vector of continuous variables since the derivative with respect to discrote variables is not defined. Horowitz and 84, BANDWIDTH SELECTION 263 Hirde (1996) discuss how direct (noniterative) estimation can be gen- eralized to cases for which some components of x are discrete. Horowita (1998) provides an excellent overview ofthis method, and we refer read- ers to the detailed discussions contained in Horowite and Hardle (1996) ‘andl Horowitz (1998, pp. 41-48). ‘The advantage of the direct average derivative estimation method is that one can estimate fy and g(x’6y) dizectly without using nonlin- car iteration procedures. This computational simplicity is particularly attractive in large-sample settings; however, it gives rise to a potential finite-sample problem. The direct estimators all involve multidimen- sional nonparametric estimation in the inital estimation stage, and we know that nonparametric estimators suffer from the curse of dimen- sonality. Inthe second stage, the multidimensional nonparametric es- timator is averaged overall sample points, resulting in a consistent estimator of i. Therefore, asymptotically, the eurse of dimensionality problem disappears because the second stage estimate ins a paramet- erate of convergence and the dimension of z does uot affect the rate of convergonce ofthe average derivative estimator obtained at the second stage. However, in finite-sample applications, inaccurate esti- ‘mation in the first stage may affoct the accuracy of the second stage cstimator unless the sample size is sufficiently large. Therefore, with relatively large sample sizes, the direct estimator is more attractive due to its computational ease. However, in small-sammple settings, the iterative method of Ichimura (1998) may be more appealing as it avoids having to conduct high-dimensional nonparametric estimation Carrol, Fan, Gijbels and Wand (1987) proposed a method related both to the material in Chapter 7 (ie. partially inear models) and in this chapter (ie. single index models). In particular, Carroll etal considered the problem of estimating a general partially Iincar single index model which contains both a partially linear model and a single Index model as special cases. 8.4 Bandwidth Selection 8.4.1 Bandwidth Selection for Ichimura’s Method Ichimura’s (1993) approach estimates by the (iterative) nonlinear Teast squares method, where the unknown function 9(X14) is zeplaced by the nonparametric kemel estimator §(X/) defined in (8.7). The smoothing parameter is chosen to satisfy the conditions nh® — 0, and 264 5. SEMIPARAMETRIC SINGLE INDEX MODELS In(h)/ [nh™#/-D) 0 as nm + co, whore » > 3 is a positive in- teger whose specific value depends on the existence of a number of finite moments of Y along with smoothness of the unknown function {o(). The range of permissible smoothing parameters allows for optimal Smoothing, ie, h= O (n-¥!). Therefore, Hardle etal. (1993) suggest Selecting h and f simultaneously by the nonlinear least squares cross Yalidation method of Ichimura. Specifically, they suggest choosing, 9 ‘and h simultaneously by minimizing (9,8) = [Wi ~ G-349,0)] UKE As), 23) where Gi(X!8,h) = @-4(X{9) is defined in (8.7), and Ag is the trim- ming set introduced enti. Under some regularity conditions including ¥ having finite mo- iments of any order, the use of @ second order kernel function, the ‘unknown functions g(2’8) and p(2’8) being twice differentiable, f() ‘being bounded away from zero on As, and the assumption that 6 By = (8: [8 — Po\ Con ¥?}, where fr € Hy = [Cr¥!®,Can-¥), and where Co, Cz > C) ate three positive constants, then Hirdle et a (1993) show that. (8, h) has the following decomposition: M(9,h) = M(B) +T(0)+ {terms of smaller order than ‘T(h) and not dependent on 3} + {terms of smaller order than either (8) or T(H) ), (824) where M(8) = So [%i— 9X1)? 10% € As), T(h) = [Gaxthe) ~ a°XLA)] and where @s(X/i) is defined in (8.7) with 8 being replaced by fo “Therefore, minimizing M(f,) simultaneously over both (3,h) € BH, is equivalent to first minimiring M(3) over @ € By and then rninimizing (A) over b € Hy Tet (3, h) denote the estimators that minimize (8.23). The asymp- tot distribution of 8 is given by Theorem 8.1. For h, with the se of ‘second order kernel faction, itis easy to show thet the leading tern of the nonstochastie objective function E[Z()] equals Aih*+ Aa(nh)-* by 84, BANDWIDTH SELECTION the standard MSE ealeulation of a nonparametric estimator discussed in Chapter 2, where A and Ap are two positive constants, Hence, hy = [Ay/(442)}¥ nM = O(n-Y) minimizes Arh + Aninh), Hiidle etal (1993) show that f/f —» 1 in probability We now briefly compare the regulasity conditions used in Ichimura (1903) with those in Hardle tal. (1993). The regularity conditions used to prove Theorem 8.1 and to establish (8.24) are somewhat different. For example, the conditions used in Theorem 8.1 equite a higher order kernel, while for deriving (811), Hale etal use» second order kerne} with an optimal smoothing parameter hk = O (n-¥/8), In (8.24), the minimization is done within a restricted shrinking set (3,1) € By x Hn s0 that 8 ~ 6 = O (n-¥?). This condition makes the kernel estima. tion bias smaller than it would be under the conditions of Theorem 8.1, thus one can establish (8.11) without resorting to higher order kernels, Another difference in regularity conditions is hat a stronger moment condition, namely, Y having moments of any order, fs used to obtain (8.11). This is because Hirde et al. nod to establish that the small order terms in (8.11) hold uniformly in By x Hy. Demonstrating tni- form rates of convergence usually requires stronger moment conditions, while, in Theorem 8.1, fis treated as being nonstochastic, and mini. ization is done with respeet to (only, therefore, a weaker moment condition suffices, 8.4.2 Bandwidth Selection with Direct Estimat Methods For the dinect average derivative estimation method, the estination of fy involves the qcimensonal multivariate nonparametic estima. tion ofthe fist onder derivatives. Hidle and Teybakor (1993) suggest choosing the smoothing parameters ay... at minimize the MSE of 5, that i, choosing h to ininize Eiji ~ 4). Hardlo and ‘Teybakor show that the asymptotically optimal bandwidth is of the form (for all s= 1-9) ag = en 220), where c, is a constant, v is the order of kernel, and q is the dimension of z, Powell and Stoker (1996) provide a method for estimating ¢,, while Horowitz (1998, pp. 50-52) considers selecting a, based on bootstrap resampling, 8. SEMIPARAMETRIC SINGLE INDEX MODELS laying selected a, optimally, oe can then obtain an average derivae tive estimator of 8. Using By to denote a generic estimator, one then estimates E(¥ |) = 9(2'6o) by 9(2 Anh) = 9(x’x) defined in (8.7). ‘The smoothing parameter h associated with the sealar index fy can be selected by least squares cross-validation, i.e, by choosing to anne imize S22 1%; 9-s(X1Bq,/))?. Under some regularity conditions, the ‘crose-validated selection of h is of order Op (n°) ‘One can also combine a pastially linear model and a single index model to obtain a “partially linear single index model” of the form E(Y|X,Z) = X'a + 9(Z'B). For the estimation of « partially linear single index model see Carroll et al, (1997), Xia, Tong and Li (1999), and Liang and Wang (2005). 8.5 Klein and Spady’s Estimator ‘When the single index model is derived from the binary choice model (8.2), and under the assumption that ; and X; are independent, Klein and Spady (1993) suggested estimating 8 by maximum likelihood meth ods, The estimated log-likelihood function is £68) = S21 = Yn — (XFS) + LK NwaXIA], (825) where 6(X{) is defined in (8.7). Maximizing (8:25) with expect to 8 Teads to the semiparametric maximum likelihood estimator of, sy ies, proposed by Klein and Spady. Like Ichimura's (1998) estimator, ‘malmization must be performed numerically by solving the fst order condition obtained from (8.25). ‘Under some regularity conditions, inching introducing a trimming fanotion to trim out observations near the boundary ofthe support of IXy-and the use of higher order kernels, Klein and Spey (1998) showed that Axs is V/m-consistent and has an asymptotic normal distribution given by Valine — 8) + NO, Mes), (35 (i) ral) and P = P(e < 2/8) = Fug(2/8), where Fy) is the CDF of « cond tional on Xi = 2. where OKs = 86. LEWBEL'S ESTIMATOR 207 Kicin and Spady (1993) also showed that their proposed estimator {s semiparametrically efficient in the sense that the asymptotic variance of their estimator attains the semiparametric efficiency bound. ‘We compare the asymptotic variance of the semiparametric es timator, Qs, with the parametsie counterpart Qe. ‘The paramet- ric model has two additional parameters, n = (,7n)". Partitioning ‘he parameter + into 7 = (ofA, then the asymptotic variance for n?(Bute - 8) is Vous (2:-2,@7*%5) ‘Comparing this with Vizl, one can show that (e.., Pagan and Ullah (1999, p. 278)) if BUGLA{B) = o0 + €1(X49), where @ and cy are two q x 1 vector of constants, then Vig = Veil or, equivalently, Vics = Vat ‘That is, the semiparametric estimator is asymptotically as efficient as the parametric nonlinear least squares estimator based on the known true functional form of g(-) when E(Xi)X/) is linéar in X{3 (rst order efficiency”). This is similar to the partially inear model case However, when E(X,|X/8) is not a linear function of X/9, it can be shown that Vics ~ Vie i positive definite, Therefore, the semiparamet- rie estimator is asymptotically les efficient compared with the para- ‘metric nonlinear Teast squares estimator based on the true function ‘(The asymptotic variance Vics ~ Vas is postive definite unless B(X|X{8) = ep + e1(XJ8). Moreover, in this ease the result cannot be improved since Vk already attains the semiparametric efficiency bound. The efficiency loss of the semiparametric model compared with 2 parametric nonlinear lest squares estimator arises because the Func- tional form of g() (or, equivalently, Fy,()) is unknown. Of course, in practice the true functionsl form 9(-) it typically unknown, and in this caso, the semiparametric estimator is robust to functional form misspecification of 9(-) 8.6 Lewbel’s Estimator Lewbel (2000) considers the following binary choice model Vix Uo + XiB-+e: > 0), ‘where ty Is @ (special) continuous regressor whose coeff ized to be one and X; is of dimension q. Let f(wjx) denote the con ‘ional density of v; given X;, and let F.(elv,2) denote the conditional 268 _& SEMIPARAMETRIC SINGLE INDEX MODELS CDF of e given (vj, Xz). Under the condition that F(elv.2) = F(z), that is, conditional on <, ¢ is independent of the special regressor v, and that BiXse] = 0, then Lewbel shows that = (BOGxD) EK, (8.27) where Yi = [i= 104 > Sais uation (8.26) suggests that ono ean estimate f by regressing ¥% con X.Y involves the unknown quantity f(o1X;), whieh can be con- Sistently estimated using, say, the nonparametric kernel method dis- cused in Chapter 5. Letting 6 denote the resulting fesible estimator ‘of 8, Lewbel (2000) establishes the y/#-normality result of his proposed estimator of 8. TLewbel (2000) further extends his result to the ease where & and X; are uncorrelated so that B(&X;) 4 0. Assuming that there exists _p-vector of instrumental variables Z; such that B(eZ;) = 0, B(Z:X{) is nonsingular, and Fas(€,zl0,2) = Fer(y2l2), where Fes(e2)-) denotes the distribution fanetion of (¢,2) conditional on the data -, Lewbel stows that waxpe 2) Hence one can estimate 8 ty replacing the above expectations with sample means and by replacing f(wi|Zi) with a consistent estimator there “he above approach can be extended wo cover the ordered response saodel defined se 8 Y= Vig < wt XiG+6 Sas), (8.28) where dp = —00 and ay = +00. The response variable Yj takes values in {0,1,...,J—1}, and ¥i = j if y-+X,B +4; lies between ay and aj. Lot Xi; — 1 (the intercept) and without loss of generality let 6 (since otherwise one can redefine ay as a; ~ fr). Let Yip = 1(%; 2 J) for = 1y..,J—1, and define A = [E(X.X2]-!. Also, let Aj equal the jul row Of A. The Lewbel (2000) shows that et ane (XZ ME EO), ja, ,J-1,and (829) sty, : A ~-ae(x EES 70), 4. (8:30) 8:7, MANSKTS MAXIMUM SCORE ESTIMATOR 200 From (8.29) and (8.30), one can easily obtain feasible estimators for the a's and 4's, .e., by replacing the expectations with sample means and by replacing (vi|X;) with a consistent estimator. Lewbel (2000) establishes the asymptotic normality results for the resulting. estimators, Lewbel further shows that his results can be extended to deal with multinomial choices, partially linear latent variable models, ‘and threshold and censored regression models, 8.7 Manski’s Maximum Score Estimator “Manski’s (1975) maximum score! estimator involves choosing to max- imize the following objective function: Su(a) = D2¥AlxIs > 0) + (1 VOUS <0) (831) ‘This estimator secks to maximize the number of eorect predictions. For ¥j= 1, Su = 1if X/9 > 0, and Sy(A) = 05 X13 <0. The correct prediction gots weight one, and an incorrect prediction gets weight zero. Similarly, for ¥ = 0, Sy¢ = 1 if X49 <0, and Sy = -1 if X/9 > 0. In this case, a correct prediction gets weight one, an incorrect prediction gets weight minus one. Manski (1975) demonstrates strong consistency of 8 assuming that median(¥,|X,) = X/ (or median(«|X;) = 0), that the frst component of @ is one, and that the fist component of 2 is 4 continuous variable. Kim and Pollard (1990) show that the maxi- mum score estimator has a convergence rate of n-!/*, not the usual 1-2, Because the objective function is discontinuous, the standard ‘Tuylor series expansion method of asymptotic theory cannot be ap- plied to the maximum score estimator. Kim and Pollard show that the limiting distribution of n'/*(Bz.-cove — 8) is that of a: maximum of a ‘multidimensional Brownian motion with quadratic drift. This asymp- totic distribution is quite complex and therefore quite inconvenient for applied inference. Mansi and ‘Thompson (1986) propose using a boot- strap procedure to approximate the asymptotic distribution of Baer ‘This bootstrap method is easy to implement and the simulations re. ported in Manski and Thompson show that their propased bootstrap method works well in Snite-sample applications "Note that Mans uses th verm “score” asin "keeping score,” eg the score of| ‘baseball game, and aot the statistical score function fe, Use sum of gradients of ‘log iklihood function) 270 8. SEMIPARAMETRIC SINGLE INDEX MODE! 8.8 Horowitz’s Smoothed Maximum Score Estimator ‘Phough Manski (1975) demonstrates that his maximum soore estima- tor is consistent under fairly weak distributional assumptions, it has a slow rate of convergence and has a complicated asymptotic distribu- tion. Horowitz (1992) proposes a modified maximum score estimator obtained by maximizing a smoothed version of Manski’s score func- tion, which has a rate approaching ii depending on the strength of certain smoothness assumptions, In essence, the problems associated ‘with Manski’s approach are due mainly to the lack of continuity of the indicator function used in (8.31). Horowitz proposes replacing dicator function 1(A) with a twice continuously differentiable function that retains the essential features of the indicator function. Horowitz proposes estimating by maximizing the following smooth objective function: ='y 3 (88 2G Sell) = DOV -C Gs ) + 682) where G() is a primes continuously differentiable CDF, hi = hy > 0 ‘and h—+ 0 as n —» co. It is easy to see that G(X{3/h) — 1(X{8 > 0) as h— 0, For example, one can choose G(z) = J, k(v)dv, where k(-) isa (p— 2)times differentiable kernel funetion. Under some reguaxity ‘sumptions inchuding some smoothness conditions on the conditional PDF of y given 2/8, Horowitz shows his proposed smoothed maxi- ‘mur score estimator, Sty Amsco, has a rate of convergence equal 0 Bam-see — 8 = (nh)? and an asymptotic normal distribution. If f= (c/n)!/@°™D for some 0 < ¢ < 00, then Aam-sore ~ 8 has a rate of eonvergence equal ton~P/2°+), which ean be made arbitrarily close ton-¥? for suficienty large p. 8.9 Han’s Maximum Rank Estimator Rather than maximizing the score function given in (8.31), Han (1987) considers maximizing the rank cortelation between the binary outcome Y%; and the index function X/8. This is achieved by maximizing 2S ae ma EZ KO), 80) Gu(3) 8.0, MULTINOMIAL DISCRETE CHOICE MODELS. m ‘with the eummation being taken over all () = n(n —1)/2 combina tions of the distinct elements {i,j}. A simple principle motivates this ‘estimator, The monotonicity of F and the independence of the «sand ‘the X;’s ensure that POY > ¥j[Xi.Xj) 2 PY < Y5 1X, Xs) whenever Xf > X46. ‘That is, when XG > Xo, more likely than not ¥i > Yj. Han (1987) shows that BjGy7(3)] is maximized at 3 = fp, the true value of B. Han furthers establishes the strong consistency of his proposed maxiinim rank correlation (MRC) estimator, but Han did not pro- vide the asymptotic distribution of his MRC estimator. The main if- ficulty in determining the limiting disteibution of the MRC estimator is the nonsmooth nature of the objective function Gy(). Note that Gi(8) is a second order U-statistic (or U-process, indexed by ). Us- ing a Ucstatistic decomposition and a uniform bound for degenerate U-processes, Sherman (1993) shows that the MRC estimator is yi consistent and has an asymptotic normal distribution. Up tonow we have considered a semiparametric binary choice model in which the distribution of the error term is modeled nonparametr- cally and the linear index 2/9 is the parametric part of the model Matzkin (1992) considers a more general binary choice model without imposing any parametric structure on either the systematic function of the observed exogenous variable or on the distribution ofthe random er- ror term, Matzkin provides identification conditions and demonstrates consistency of her nonparametric maximum likelihood estimator. 8.10 Multinomial Discrete Choice Models Pagan and Ullah (1999, pp. 296-299) provide an excellent overview of the semiparametric estimation of multinomial discrete choice models For what follows, consider the case where an individual faces J (J > 2) choices, Define Vi = 1 if individual i selects alternative j (j= 4,-..4J) and Yi; = 0 otherwise. Let Fy = Py = 1X) = BOY]; then the multiple choice equation is Vis = Fig + ey (8.34) 2 8, SEMIPARAMETRIC SINGLE INDEX MODELS and the likelihood function is no SEM iory. (35) ‘Parametric approaches specily a functional form for Fy. For exam- plo, to obtain a multinomial Logit model one sets Z Fy = exp X48)/ Yo exp(Xy8). ‘The semiparametric approach sets Fj = (Vili) = E(Yislon---%44) a oltin..,0is)> where the functional form of g(-) is unknown, and ‘where tj — Xij8j. The estimation procedure is similar to the sei ‘parametric single index model ease discussed earlier. Ichimura and Lee {1901) extend Tchimura's (1993) method to the muti-ndex ease and derive the asymptotic distribution of their semiparametric east squares tstimator. Lee (1995) proposes using semiparametric maximum likeli fhood methods to estimate the multi-index model (8.35). Ai (1997) eon- ‘Sites a general semiparametric maximurs likelihood approach which overs many semiparametric models such as multiindex models and Sastialy Iinear models as special cases. We now turn to A's general approach, 8.11 Ai’s Semiparametric Maximum Likelihood Approach ‘Ai (1907) considers a general semiparametric maximum likelihood es. timation approach, Let. q(¥|X,@0, 90) denote the conditional PDF of Y¥ given X, where 0p is a finite dimensional parameter (the parametric part), and go() isan infinite dimensional unknown function. Ai further resumes that the conditional density satisfies an index restriction, Le., ‘hat there exist some known fursctions vy(=,@), v2(2,0) such that (1X. 40, 90) = 1(ZsOo)$ (01(Z Oo) aX, 80)s00), (8-86) where f(js8) is the conditional density of v: given va for arbitrary @, tind J(,0) is a known Jacobian of the transformation from v(#6) to wy sui. Ars [PARAMETRIC MAXIMUM LIKELIHOOD APPROACH 273 Model (836) contains many familiar semiparametric models as spe- cial cases. For example, consider a partially specified regression model ‘m(W") = X40 + (482+ X5) +14, where m) and 72(-) are fanetions cof unknown form, and w ip independent of x = (Xi, Xz, X3) and has aan unknown density m()- 6 = (61,@2) is the parametric part of the model, and 1 = (m,22y7) 38 the nonperamotric part of the model, ‘The conditional density of y given « in this example is (1,1) = ns fl) ~ X461 — ol X302 + Xs)] iP Mw)l, (837) which is also the conditional density of oy = Y given vz = (X41, X302+ Xo), whore n{(y) denotes the derivative of m(y) with respect toy, and In? (9) is the Jacobian, For tho pastially linear multi-index model considered in. Ichiemara and Lee (1991), m(¥) = (hence Jacobian J = 1), v1 = ¥ — X40, ‘and vy ~ XJ6, and for this ease (8.37) becomes (with Jacobian = 1) Floilva) = ns (¥ — X18 — m(X462)) + (8.38) where ma() is the density of w= ¥ ~ X46) ~ na(X40), and nal X40 BIY — X{8\|X{8]. If one uses #, = 0 in (8.38), it collapses to Iehimura’s (1903) single index model (see Chapter 8). For the partially linear model considered in Chapter 7, m(¥) = Y, ur = ¥ — X41, 82 =0, v2 = Xz, and (8.37) becomes (J = 1) Seaton) = ns (¥ — X48 — m2(%) 7 where m(:) is the density of uw = ¥ — X{@) —ma(Xs) and ma(Xa) = BLY ~ X{oi1Xs} We now turn to the issue of estimating this class of models. Define m(z,8, f) = Analy, 6, f)]/20. IF the functional form of f is known, wwe could estimate @ by solving for @ from the score equation Sn(8) (8.39) ‘The yA-normality result of @ follows from standard laws of large numn- bber and CLT arguments. When f is unknown, Ai (1997) suggests ex ‘mating 1 using nonparametric kernel methods. Letting fi(u(2,0),0) and fa(va(2,4),9) denote the joint density of v and the marginal density of up, respectively, we have Sa (v(20)s6) Fox, Oll2,0)0) = Fa 4 8. SEMIPARAMETRIC SINGLE INDEX MODELS ‘Substituting this into m(z,9, f) = Olnla(ylx,8, f)]/28 gives m(z,9, f) = ma(2,9, fr, fa)ma(z,9, fis fa), where m(2,8) fis fa) = filv(=,6), 6] faleal=,4),9) 4 thle din|J(2,6)| @ 2) pyva(2.0).61 alex) 8} — files), and 1 male fob) = Fete DATE OA Letting q and gz denote the dimension of v and v, respectively, then a kernel estimator of f.(0) = f(vi(Z:@)|va(Xs,0),0) is given by A(v1(Zi.9) v2 Xe, 9),8) = (8.40) where (8.41) and - j 1 e flor) =O Kin ( (642) with product kernels defined by 1 (22M) Toe w(t (m=) ‘Then an estimator of f,(8) = fi(8) = F(v1(Z0)lua(Xi,0).6) is given. SF (w(Zi,0),0) FOF eX 00) $.12,_A SKETCH OF THE PROOF OF THEOREM 8.1 28 Finally, one estimates 0, say 0, by solving the following first order condition: $,(0) = Ym (2,8, AO) =0 (8.43) For the single index model example, the Jacobian = 1, Oxo = 0, and 1 =u, Xa is empty. We write Xz = 2 and # = @ so that ‘Then we have faa = filtis) = Wes X10) = and f= fleas) = fle) = Ks ( Substituting these into + mis, fis, fos) = ma (Zin, fss fad wwe got exactly the same first order condition as for Klein and Spady’s (1993) estimator}? we leave the verification of this as an exercise (see Exercise 8.4). ‘Ai (1997) shows that the feasible estimator has the same asymp- totic distribution as the infeasible estimator 8, and that 8 is semi- parametrically efficent in the sense that the inverse ofits asfmptotic variance equals the semiparametric efficiency hound for this type of semiparametric mode. 8.12 A Sketch of the Proof of Theorem 8.1 We first derive (8.45). Note that G(X?) 9X4) at = X48) ~ G(XJ9) = of X48) — B [a(X1)|X19] (X{Bo) — 9X58) — 9 (X{BVB [X{(H — 9)1X{8] + O(n) (X48) [Xs — ECGX48)] (8 — 8) +0 (x) - Elg(X{60)|X{8]. By Taylor expansion of 1 and using fy — B= O(n-¥), we have =a (3.44) liety, we omitted the trimming functions; soo Ai (1997, 938) for the details on introducing tsimmang Functions. Por expositional 216. 8, SEMIPARAMETRIC SINGLE INDEX MODELS Substituting (8.44) into $(3) = SR, lol Xf50) + wi — GOXH0))?, wwe obtain 'S(8) = (Pa ~ BY Wo(Bo ~ 8) ~ 2Vo( ~ 8) + Sut + 0(1), (8:45) oi where Wo and Vp ate defined by (8.10), The first order condition of (8.45) yields (8.10) __ It can be shown that the lending terms can be obtained by replacing B(V]XiB)] with Blo(XY)]X/6] in (8.9). We have, uniformly in 3 € By Sin(B) = 3 (00x) E [ol X480)1X18]}? +a ow {0Ci6) ~B [a LAO 1XG5]} ++ {terms independent of 9}+ op (02). (848) Since @ € By, we have [3-6] = O(n-¥/), Using a Taylor expansion of g(X!Ba) at Ay = 8 (use it twice below), we have X40) — Elo X185)LX18] = a XiPo) ~ (X18) = of (X{BELXIIX{8) Pa ~ 8) + Opln™4) =o (XL) {X} — E(XIX{B)} (Bo — 8) + Op (r° (ean Substituting (8.47) into (8.46) we get Sin(B) =o — 8" [: Eula] (-A) +22 D wal f(G - 8) 4 {terms independent of 6} ay(n-%),(B8) where gf? = g(X{B) and vy = Xi ~ E(%1 X49), ‘Minimizing Siq(8) with respect to 6, and ignoring the term which is independent of 3 and the op(n~1) term, we obtain the first order condition 214 ~ 5 (oP) wat 25 Soa (8.49) 88. APPLICATIONS mm which leads to Vali — A) = [: y (ow Fabel = E D (oy vor FE wal no +o, 7 (8.50) where gf) = o(X{), 019 = Xi—E(G1X 4), and the second equality in (8.50) uses the fact that § — A = Op(n-¥/?), A standard law of large numbers and the Lindebers-Levy OLT ar- sgumnent leads to Vii(8 ~ 8) —+ N(0, Vo) in distribution, (8.51) cee iE {(09)*aovo}. r= {61x (6) wag} ane i = E(Xi| Xj). Equation (8.51) matches the result of Theorem 8.1 if one replaces w(X;) = 1 there. to = 8.13 Applications 8.13.1 Modeling Response to Direct Marketing Catalog Mailings Direct marketing is often used to target individuals who, on the basis of observable characteristics such as demographics and their past purchase cisions, are most likely to be repeat customers, For example, one ‘might think of mailing catalogs only to those who are highly likely to be repeat customers or who most “closely resemble” repeat customers.* Bult end Wansbesk (1905), in a profit maximisation framework, point out that 1 fact one might wank to da the opposite, thereby saving costs by avoiding repeated 8 8. SEMIPARAMETRIC SINGLE INDEX MODELS ‘The success or failure of direct marketing, however, hinges directly upon the ability to identify those consumers who are most likely to make a purchase. Racine (2002) considers an industry-standard database obtained from the Direct Marketing Association (DMA)* that contains data on f reproduction gift catalog company, “an upscale gift business that mails general and specialized catalogs to its customer base several times each year.” The base time period covers December 1971 through June 1992, Data collected by the DMA includes orders, purchases in each of fourteen product groups, time of purchase, and purchasing methods. ‘Then a three month gap occurs in the data after which customers in the existing database are sent at least one catalog in early Fall 1992. ‘Then from September 1992 through December 1992 the database was updated. This provides a setting in which models can be constructed {or the base time period and then evaluated on the later time period. A random subset of 4,500 individuals from the first time period was cre- ated, and various methods for predicting the likelihood of a consumer purchase was undertaken followed by evaluation of the forecast accu- recy for the independent hold-out sample consisting of 1,500 randomly selected individuals drawn from the later time period. Parametric index models (Logit, Probit) and semiparametric index ‘models (Ichimura (1993); Ichimura and Lee (1991)) were fitted to the estimation sample of size my = 4,500 and evaluated on the hold-out sample of size nz = 1,500, Data Deseription We have two independent estimation and evaluation datasets of sizes my = 4,500 and nz = 1,500 respectively having one record per cus- tomer. We restrict our attention to one product group and thereby select the middle of the 14 product groups, group 8. The variables involved in the study are listed below, while their properties are sum- marized in Tables 8.1 and 8.2. (@) Response: whether or not a purchase was made ‘mollings to thowe who in fact ave hight Hikely to make purchase, Regardles of the {bjectve it the ability to identify those mostly to make w purchase that bas proven prablemati in the past “UThis database contains cusiomer buying history for about 100,000 customers of rationally known catalog and nonprofit databaso markoting businesses, 5.18. APPLICATIONS ‘Table 8.1: Summary of the estimation dataset (r1 = 4,500) Mean | Std Dev | Min | Max ; ooof o2sf 0, 4 LUDFallOrders | 136) 138) 0} 15 LastPurchSeason | 162] 0.53) -1] 2 Orders#¥rshgo | 026) 0.55) 0] 5 LrDPurehGrps | 009| 031) 0} 4 DateLastPurch | 9731| 2734] 0] 17 ‘Table 8.2: Summary of the evaluation dataset (nz = 1,500) Variable ‘Response Dos; a7] a LDPallOrderss | 132] 1.38] 0 LastPurchSeason| 163) 051) -1] 2 0 0 oO Orders4¥rsAgo | 0.25) 0.52 LTDPurchGrp8 | 0.08] 0.29 DateLastPurch | 36.44 | 26.95 116 (ji) LTDFallOrders:life-to-date Fall orders (Gi) LastP urehSeason: the last season in which « purchase was made (iv) Orderst¥esAgo: orders made in the latest five yoars (0) LTDPurchGrps: lif-to-date purchases (vi) DateLastPureh: the date of the last purchase? ‘Each model was evalnated in terms of its out-of-sample performance based upon the measure of McFadden, Puig. and Kirschner (1977)? This ie recoded ta the database ae 1 if tho purchoso was made in January through June, 2 if the purchase was made in July through December, and 1 if 0 purchase was mad “19/71 was veorded as 0, 1/72 as 1 and so on. pus + pia —ph ~ pla where py i the ijth entry in the 2x 2 confusion mateix expressed a a fraction ofthe sum ofall entries. A "confusion matrix” is simply & 20 8. SEMIPARAMETRIC SINGLE INDEX MODELS ‘Table 8.3: Confusion matrix and classification rates for the Logit model Predicted Nonpurchase — Predicted Purchase KewalNouparchass———SCSTB i ‘Actual Purchase 108 9 ‘Predictive Performance BL5% Overall correct classification rate 92.47% Correct nonpurchase classification rate 99.64% 7.60% ‘Table 84: Confusion matrix and classification rates for the semipars- metric index model Predicted Nonpurchase Predicted Purchase “Ketiat Noupurchase SSO 2 Actual Purchase % a2 ~Prodictive Performance 93.20% Overall correct classification rate 98.53% Correct nonpurchase classification rate 98.41% “Correct purchase classification rate 35.00% long with the correct purchase classification rate.® Results for the Logit model? are presented in the form of a confusion matrix given in ‘Table 8.3 while those for the semiparametric index model are given in Table 8.4. “Phe semiparametric single index model delivers better predictive performance for the hold-ont data than the parametric Logit model. Note also that, although the parametric models appear to perform faitly ‘well in terms of McFadden et al.’s (1977) measure, they yield poor predictions for the class of persons who actually make a purchase, Jabulaton ofthe actual outcome vermis those prodicted ty a sodel, The diagonal ‘ements contain correctly peedited onteomes while the off-diagonal ones contain Incorrectly predicted (confused) outcomes. Sine faction of coretly predicted purchases among those who actually made purchase TRecults for the Probit model are not as good those for the Logit model and ace omitted far space considertions. 5.4, EXERCISES 2s 8.14 Exercises Exercise 8.1. () Suppose that Y is & {0,1} binary variable. Show that P(Y = Ale) = F(Y 2). (ji) ITY is o binary variable taking values in {1,2}, is POY = 1[2) still equal to E(Y’|x) or not? Exercise 8.2. In deriving (8.51), for the second term on the right-hand, side of (8.51), we used the fact that (i) 23, u82(u|X18) = terms independent of 8+09(n-), and that Gi) $y ue {Blo 140)] ~ Ble X1A)1XG6]} = op (n~), where Bus X48) = (mh) Sus (Xe ~ Xy)'8/AY AX), ia Bul AB) = (mn)? pK (Xe — X9)'8/H)/ AAS), and a (X48) = (mh)! 7 K(X ~ X5)'B/h)- A Prove () and (i) Hint: Por (i), apply & Taylor series expansion at @ = Ah, and use 8 ~ & = Op(n™¥/), ‘The first term in the Taylor expansion will be independent of 8, The second term is a second order U-staistc, and is of order op (n~¥) by using the H-decomposition Bxercise 8.8, Prov that 2S foci) - Box]? = 7 1 YX loa) — Blot x1)1x/9)]” + terms independent of 6 + op(n™"), Hint: Wei BOING (a{ X40) X18) + Bll XB) = E(9(X{A0)1X{9) + [Blo( io] X18) ~ ECo(X1H)1XI8)] + B(ulX48), were B(g(X/40)|XJ8) and BlesLX{A) are defined in Section 8.12 se 8. SEMIPARAMETRIC SINGLE INDEX MODELS Exercise 8.4. Verify that when applying Ai's (1997) general approach to the single index model with binary variable y, Ai's first order condition coincides with that of Klein and Spady (1993) as given im (8.52). Tint: Klein and Spady's (1993) estimator solves the first order con- dition (see Pagan and Ullah (1999, p. 283)) S72; mhy(@) = 0, where sin(o) = ate fh acxioyy* AM py, — ata] (652) en act = Bera = BE ( with 9(X¢ (Yi X{8) = ma(S ae Chapter 9 Additive and Smooth (Varying) Coefficient Semiparametric Models In this chapter we consider some popular semiparametric regression ‘models that appear in the literature. The rationale for using these, rather than fully nonparametric models, is to reduce the dimension of the nonparametric component in order to mitigate the curse of di- mensionality. Of course, these models are lisble to the same critique of functional misspecification as are fully parametrie models, but they have proven to be extremely popular in applied settings and tend to be simpler to interpret than fully nonparametric models, 9.1 An Additive Model We first consider the semiparametric additive model = 9+ (Zs) + 90(Zas) +--+ (Ze) +a (1) where cy is a scalar parameter, the Zy’s are all univariate continuous variables, and gi(:) (U= 1,-..,4) are tinknown smooth functions. The observations {¥,, Zsi,-++5 Zhu: are presumed to be ii.d For kernel-based methods, two approaches are commonly used for estimating an additive model: the “backfitting” method (see Buja, Hastic and Tibshirani (1989) and Hastie and Tibshirani (1990) and ‘the “marginal integration” method independently proposed by Linton 288 9._ ADDITIVE AND SMOOTH COEFFICIENT MODELS ‘and Nielsen (1995), Newey (19948), and Tjostheim and Auestad (1994); see also Chen, Hirdle, Linton and Severance-Lossin (1996), and Lin- ton (1997, 2000). Due to its iterative nature, the backfitting method is ‘more difficult to analyze than the marginal integration method, hence ‘we begin with a discussion of the relatively simple marginal integration ‘method. 9.1.1 The Marginal Integration Method First, note that for any constant ¢, [gu(e) + 6 + lola) — d= ‘01(21)+92(2). Therefore, we need some identification conditions for the ‘n() functions to be identified. With kemel methods it is convenient to impose the condition Elm(Z)] = 0 so that the individual components .i(-) ate identified (= 1,..-,9)- This also leads to E(Yi) =v, Let Za = (Zana --» Zot Zesih--»» Zi)» That is, Zqi is obtained from (Zs, Zais. +5 Zq) by removing Zqi. Define Ga(2a) = 91(21) +++ ga1(2a-1) + Savi(aaa) + +--+ Gg(2q)- With this notation, (9.1) can be written as XY + Ga(Zas) + Ga(Za¢) + t4- (9.2) Define Ela Zas) = E(|Zag = 20, Za3)> and also let (a, Zas) E(X)|Zoj = 2a, Zaj)- Applying Bl-|Zaj = 2a. Zaj] to both sides of (9.2), we get bala, Zas) = 60+ Gala) + Ga(Zas)- (9.3) Define ma(2a) = El€(a,Zaj)), and note that the marginal expectation is with respect to Zaj- Applying the expectation (with respect to Zi) to (9.3), we obtain Maza) where we have used B[Gy(Za)] + dole), (04) From (9.4), we get Ga(2a) = Ma(2a) ~ ElmalZos)]- (95) ‘The above approach is infeasible. However, a feasible estimator can bbe obtained by replacing expectations by sample means, marginal ex- pectation (integration) by marginal averaging, and conditional mean functions by local linear kernel estimators. Specifically, a feasible coun- terpart of (9.5) is Gala) = tala) — (9.6) 9.1. AN ADDITIVE MODEL 285 where fa(Za,Zaj)s (9.7) ard (2, Za) i the sldton to ain the folowing minimization prob- tem, ap ple Note that da(za, Zaj) 6 a kernel estimator of ELYs|Zaj = Za, Zajl, ‘which applies the local linear method only to the component =, and the local constant method to 2. Fan, Hardle and Mammen (1998) assume that k() is a univariate second order kernel, and K(:) is a vth order product kernel, where 1 isa positive integer satisfying v > (q—1)/2. Therefore, a higher order kernel is needed (v > 2) if q > 4; sce Section 1.11 regarding the con- struction of higher order kernel functions. The following theorem gives the asymptotic distribution of ja(éa), and for expositional simplicity, wwe assume that Ay = ~~~ = Aga = hast =o = hy = ha Fan otal prove the following result. (Zot — 2a)b)* kg (Zat ~ 2a) Kng(Zat — Za): Theorem 9.1. Under the conditions given in Fan et al. (1998), and also assuming that nha? In — 00, BEJI2 +O, hy» O and that ag 0, then wl al) ~ dala) ~e0~ FAR se) + 88) = N(0,0alea)) schere gC) is the second onier derivative of gal)» ma = fuPk(u)d, a(za) = fala) " Heady] of and 0 (sn) = #0): ‘Theorem 9.1 shows that ga(a) reaches the one-dimensional optimal rate of convergence. A disadvantage of the marginal integration method is that itis com- putationally costly. One has to estimate B[Yi|Zat = Zain Zat = Zay] for 286 9, ADDITIVE AND SMOOTH COEFFICIENT MODFLS all i,j = 1,..-.n, which has a computational order involving n? cal- culations, whereas the computational cost of estimating a nonadditive ‘model is only of arder n. In the next section we discuss a computation- ally efficient method for estimating additive models. 9.1.2 A Computationally Efficient Oracle Estimator ‘The marginal integration method discussed above is computationally costly as it requires one to estimate 9( Zoi, Znj) for 4,5 yn. Kim, Linton and Hengartner (1999) propose an alternative method which reduces the estimation time to order n. Consider the nonparametric regression of Y on Za, B(Y Za = 2a) = 60+ Galea) +S Floa(Za)\Za= Fal, (98) me which shows that E(Y|Zq = 2a) is a biased estimator of cy +ga(%a) due to the presence of 57,4, Elge(Zs)|Za = Za}. Kim et al. suggest choosing an instrumental variable wa(z) such that Elwa(Z)|Za = za] = 1 and Elwa(Z)9e(Xs}|Za = Za] = 0 for s # a. (9.9) ‘Then one would obtain Blwa(Zi)¥ilZai = 2a] = C0 + Ga(2a)- Tt is easy to check that wa(2) = fa(%)fa(%a)/ F(a, 2a) i8 such a function, where fala) is the joint PDP of 2a = (21ys0ey20-ty nets --0%4)- fact, for any random variable €, we have lgmeleel = f €ra(e) Gea deg = [Pee Slee ‘F(as%) fala) = f Salads (010) ‘which is exactly the marginal integration of € over the sq compo- nents. Replacing € in (0.10) by € = 1 we obtain Blwy(Z)[Zai) = J falza)dzq = 1. Also, if we replace € by gs(Zu) for s # a, then Bhwg(Zi)on(Za) Zoe) = J deli) falZa)d2a = Elas(Zi)) = 0. Thus, (@.9) holds true. 91. AN ADDITIVE MODEL, 2st It is easy to see that, oo ae Zui) ¥% (9.1) is a consstont estimator of Eluts(Z)¥IZa = Za) = oy + gala), whore wta(2) = fa) fala) fava) Note that “g(2q) can be interpreted as a one-dimensional standard local constant estimator obtained by regressing the adjusted Yaj on 2a where Vag = Y; fal) fa Zas)/ Ila Zn) Since (2a) estimates 0 + 9q(Zq), naturally one can estimate fa(2q) by Jala) — Fala) + 1377 FalSay)- he leading bias, variance and asymptotic normal distribution of fa(Zq) ate given in Theorem Lof Kim etal. (1909). Kim et al show that the asymptotic variance of f(2a) has an extra term compared with the asymptotic variance of the marginal integration estimator of gq(za). Therefore, Ga(%q) is less efficient than the marginal integration estimator. Kim et al. farther propose an ficient oracle sstimator which we discuss below. An oracle (efficient) estimstor for ga(2q) {8 defined as an esti- rator that has the same leading bis and variance terms ae in the ral when all other additive functions are known. Define Yom" = = Enda 9e(Zi) ~ eo, and consider a univariate kernel estimator of tae) che GeO, a Eak( Baa) = + (9.12) with hg = 6in-¥/9 (6; > 0 is a constant). Then using results derived in Chapter 2, we know that (sing, second order kernel faction) Vika {89 S0) ~ (za) ~ NBBa(n)} + N(O,Valea)) (0.18) in distribution, where (2) = no [al?"(=n) + af (2a) 14 (ea) / fala) and where Va(a) = 72 (5a)/ fala) (03 (én) = var(Y Ya = 2a). 2s 9. ADDITIVE AND SMOOTH CORFFICIENT MODELS Instead of using a local constant estimator, @ local linear estimator ‘can be used to estimate ga(2a). That is, letting d and 5 be the values of a and b that minimize the objective Function Se(e ) [ra 0 Wear then G is the local linear oracle estimate of ga{za), and its asymptotic disteibution is the same as that given in Chapter 2, which is basically the same as (9.13) except thatthe leading bia term should be changed to (1/2)n208 (2}/ fala). ‘The estimator 92"(z.) given above is not feasible because cp and the ge(zx)s are unknown for s # a. We can replace g4(z) by 4a(2s) and ey by Y. Thus, we can use Yar? = Y= S F(Z ha) + (q— IP me co, where to replace Yai = ¥i— Disa Ga(Zni) Fulenhe)= pdr (752) fay, as) cm is a consistent estimetor of Blus(Z)¥ 12s Leg Letting 92° be defined as was ge!" in (9.12) but with Yor being replaced by Y2-**?, Kim et al. (1990) showed that go" has the same leading bias and variance terms as gf"; therefore, it ns ‘the same asymptotic distribution. Kim etal. assume that hq = an—1/® for some a > 0 and that hp = o(n~¥/5). Note that here one chooses ‘hg to be of smaller order than hq. They show that if a local linear estimator is used, then Vita {42%*(z0) ~ gaa) ~ Zh2a1? (za) Fala)} (0, Valea) (0.15) in distribution, where the definition of Va(za) appears directly below (018). ‘Equation (9.15) shows that the feasible estimator g2°(zq) has the same first order efficiency as for the case where all other g.(2.)'s are known (for 8 # 0) oy + Gal2s), 5 9.1. AN ADDITIVE MODEL. 239 Kim et al. (1999) further show that a wild bootstrap procedure can be used to better approximate the fnite-sample distribution of ‘Ge"(-). Define an estimated additive function Badal #i Ras ho) = ¥ + 3 Galza5 has he), ost where Ga(zaj/osfo) = Gala) is the initial additive function esti- tmator of ga(a}- Then 1 is estimated by tx = ¥; ~ Gadel Zi haha) (i =1,...,1). Define the centered residual by ts = diy —n—! 9G ‘The bootstrap error uy is generated by the two-point wild bootstrap method, ie., uj = [(1+-¥5)/2}i with probability r = (1+-¥5)/(2v3) and uf = [(1— VB)/2}is with probability 1 —r, Next generate adil Zi Thay he) + ul, ‘where in the above another bandwidth hg is used. Kin etal. show that hig should have an order larger than hg for the bootstrap method to work. Kim et al. suggest choosing fhe ~n¥/, hy = o(ha)s and letting Fig assume values in [n“¥/>+¥,n~4] for some 0 < § < 1/5. One can ¢ the bootstrap sample {7,¥:}?.. to compute Gz" (z,), where G°"*"(zq) is the same as 92""!*(zq) except that one replaces ¥; by Y¥7* wherever it occurs. Kim et al. show that one eau repeatedly gener- ate, say, B bootstrap estimates of 92'°"""(z_) and use their empirical distribution to approximate the finite-sample distribution of g2*“"*(z). yy 9.1.3 The Ordinary Backfitting Method By taking the conditional expectation of (9.1), conditional on Zat = fay we get (0 = 1y.-.50) co YE lga(Zsi)|Zat = Za). (9.16) ms alZa) = B(VilZos = a) Equation (9.16) suggests that an iterative procedure is appropriate. Let, a5\a) be some initial estimators for ga(a), say, dl (zq) = 0, or let gfl(2q) be the marginal integration estimator of ga(zq). Also, 7We provide more detailed discussion of the wild bootstrap method in Chapter 2, 9. ADDITIVE AND SMOOTH COEFFICIENT MODELS 1 Soh, ¥i. Then the iterative procedure is given as follows. For sq, and for | = 1,2,3,..., compute the step 4l(z0) by et 19(29) = BUY 2as = 20) — G0 — JE [BZ = x0] NZ) Za (a7) where, for a random variable Ay, B[4\|Zai = a] is the (univariate) ernol estimator of E(Ai|Zou ~ 2a). BldilZec = Za) could be the local constant estimator given by So" Ajhayay/ Sihay Kauss Where Iojnay = b((Zas ~ %)/ha)s OF it could be a local linear estimator of E(Ai|Zai = #0): TMeretion ends when a prespecified convergence criterion is reached, say, when S22 (G4! Zes) — 4°" Zo)?/ Tien (Ger (Zon)? is less than some small number, say, 10-4 ‘By using a local linear smoother in the above procedure, Opsomer and Ruppert (1998) and Opsomer (2000) showed that estimation bias is of order O(S7#_, hd), and the leading variance term of Ga(Za) is given by (n= J M(o)de) we alee) = apts +0(s4-)) E(u#|Z;) = E(u?) (assuming conditionally homoskedastic 9.1.4 The Smoothed Backfitting Method Mammen, Linton and Nielsen (1999) propose « smoothed backfitting procedure for estimating the additive mode! (9.1). The idea is to project ¥ (or E(Y'2)) on the space of additive funetions. Let 9(2) = SOE YK Ziy2)/ Pa Kainz) and fle) =m? Kyl Zi?) de ‘ote the multidimensional Kernel estimators of E(YTZ ~ =) and f(2), respectively, where z = (21,.--,44)- The local constant smoothed back: fitting estimators of gy(2),---+9q(%) ate defined as those gy...» Jq that minimize the objective function [i902)-a2) aula Feld, (048) 8.._AN ADDITIVE MODEL. 2 where the minimization runs overall functions g(z) = ¢9+322. ga(Za) with | 9a(a)fa(#a)déa = 0, where fa(za) = f f(2}dag is the kernel es- timator of the marginal PDF fa(za), 2q = (21y++-12a-ty Zasty +++ 29)- The solution to (9.18);is characterized by the following system of ‘equations (a =1,...,4; ¥ =n! 2, i): ated feta. fe balea)= fer f toy dfa (a9) 0 f tulsa) falco (9.20) Note that 7 Yiklen Zed) 2 [oof Ft Bala), (021) because {Tada hyh( (zs ~ Z)/fe}d2q = 1, where ja(zq) is the local constant estimator of E(¥i|Zai = 2a). Also, using the additive structure of Ga(Zq) and the fact that J hz"k((ee —Zai)/he)dz, = 1, the q —1 dimensional integration f dq can be reduced to double integration, hence where faws(2as2) = (haha)! Sts (Za — #n)/ ah (Zui ~ 20) (Ie) is the Kernel estimator of the two-dimensional marginal PDF of (zs, 7.) ‘auations (9.19), (9.21), and (9.22) lead to the following iterative procedure: AE 20) = Glen) - Ff ao slanted, -% fies an where = 1,2,..., denote the number of iterations. Mammen et al. (1999) derived the asymptotic distribution of Ga(a)- To present their result we frst define @ constant &y and uni- 202 9. ADDITIVE AND SMOOTH COBFFICIENT MODELS variate functions Ba(2a) (with [ Sa(za)dza = 0, a) ae gin [f(10) = Bo len == Boo] (9.23) +a) by for a given function (2). ‘Under regularity conditions given in Mammen et al. (1999) (see also ‘Nielsen and Sperlich (2005)), it can be shown that (ha}"/*lia(Za) ~ daa) — A2Ba(20)] N(O,ta(%a)), (9:24) where fis defined in (9.23) with (2) defined by Alz)= + Foe)» and va(za) = 67a(Fa)/ fala) (i Finally, one ean estimate g(- J Rwyv2dv, w= f B(v)dv), 0+ Dai dalZa) by G2) =F + 7 Gal20): (9.25) Moreover, Mammen et al. (1999) show that different components of Ga(2a) ate asymptotically independent; therefore, the asymptotic dis- ttibution of §(2) follows readily from (9.24) (see Exercise 9.1). Mam- men et al. also discuss using boundary corrected kernels when Z has compact support. ‘Mammen et al. (1999) also consider using a local linear estimator of ga(a), 80 that the objective function (9.18) becomes J hx ft — Ydalen) — YGalea) ea Zai)) Kale, Zde, QM aa ‘where the minimization is over the additive components Go, all j2a)’s (with f da(2a) fo(2a)d2a = 0), and also over all Ga(za)'s, where Oa(2a) Ste Bet ors dervtive of 9). 9.4. AN ADDITIVE MODEL Ea ‘Mammen et al, (1999) (see also Mammen and Park (2005) show that the g's and d's satisfy the following equation: (3) =- (0) + iS) Nala) ¥ J Sen) (553) dem with f= FY =D f doled falcata D> [bleed Bean) (o27) a 1 2-2 ; alia) = Yh lasZa (15, ite) (28) Scales 2a) = 0") Bg (#0, Zoi) ha (Zar Za) a - (ax me ete) - os) : on I8ea) = 21 hn en Zi) Lat ~ Za), and Falta) Vk, Zai)- Also, Ga(Za) aid Ja(za) are the local linear fits obtained by regressing ¥; on Zoi; that is, they minimize the objective function b ‘The definition of Gy dx(<1),---Go(), and 6i(21),.--,4q(%) can be rade unique by imposing the normalization condition I Gala) ~ Bal a)(Zai~ta)) baal 2Zs)- (930) aleaVfalsaddia + J BalcefRlca)dea = 0. (931) 8. ADDITIVE AND SMOOTH CORFFICIENT MODELS ‘The smooth backfitting estimates are obtained via the iterative ap- plication of (9.27) whete, when the left-hand side isthe iteration step Ut), the rigt-hand quantities are either [t+ 1} or (0), depending on whether s a. Note that (9.31) yields é = 0? 21 Yi The asymptotic distribution of ja(+a) i derived by Mammen etal (1999). Below we state a simple version given by Nielsen and Spelich (2005). Via (daa) = al) ~ Yo — 2 tua(ea)] N(0,Val2a))» (9:32) were va { alcadhna(2u 0) alo) tea, tala) = (ea) ~ faa) fale Vola) = 08 (2a) falta) / (do, a= f Hoptae, bution using matrices, 2. One can also present the joint ee = oi(a1) = 1 — RE Fea (1) ) SN (O,ding(Via(=a))) > il a) ~ 920) ~ Ya — BG Halal (9.33) where 0 = (0,...,0) is aq % 1 vector of zeros, and diag(Va(eq)) is a qx q diagonal matrix with the ath diagonal element equal to Va(a) (@=1,...,4) and all off diagonal elements being zero. Equation (9.33) shows that the different components of f(a) are asymptotically independent of one another. Nielsen and Sperlich (2005) suggest using leave-one-out least squares cross-validation to select the smoothing parameters, while ‘Masnmen and Park (2005) recommend minimizing the penalized sum of squared residuals, Le., they recommend choosing fy... fy to minimize the objective function PLS(h) = RSS(h) [ By aco : (9.34) 9.1_AN ADDITIVE MODEL. 295, where RSS(h) y [i-a-Zacen] (9.35) a S In this chapter we do hot discuss kernel estimation of an additive model with interaction terms, or derivative estimation in generalized auditive models. However, eaders can find a discussion ofthese issues in Sperich, Tjostheim and Yang (2002) and Yang, Sperlich and Hirde (2008). Als, for efficient and fast spline-backfitted kernel smoothing of additive regression models, see Wang and Yang (2005). 9.1.5 Additive Models with Link Functions ‘A more general additive funetion is an additive model with a known link funetion given by vino [0+ Soot] =m “> 36) where G(,) isa known link funetion, When G(:) is the identity function, (9.36) collapses to (9.1). In practice, G() could be the exponential or logistic funetion, te Linton and Hirdle (1996) propose a marginal integration approach to estimate (9.36). Let m(z) = B(Y|z) and let M() = G-1(); then (9.36) can be waitten as . Mim(2)] = 60 +) ga(2a): (9.37) Deline alta) = f Mima Malad (9.38) here 2q = (21,---1 20-1) fastee-- 4) Sala) i the PDF of 2. rom (9.38) we know that ga difers from ga only by an additive constant Linton and Hardle (1996) suggest estimating ¢q based on the multi- dimensional local constant keruel estimator nD Viking (2a Zai) Wha (Za WTS ig (an Zot Vig an Za) * where k() is a univariate second order kernel function, and where Wha) = [4a hs w((2e — Zi)/He) with w(-) being a kernel fune- tion of order dq 1 (a higher order kemel when q > 2). 1(2a 70) (9.39) 206 9, ADDITIVE AND SMOOTH CORFFICIENT MODELS ‘Then Ga(%a) is estimated by the sample analogue of (9.38) given by Gaza) = * > MU (2a Zeta) (0.40) ‘When M is the identity function, then $q(2a) is linear in the ¥7's. In general, however, Ga(a) is @ nonlinear function of the ¥'s. The constant term cy is estimated by BU Leela, (9.41) and ga(éa) i estimated by Gaza) = Gala) — Go (9.42) ‘The final estimate for m(2) is given by (G = M~) tn(z) = aoa +a (9.43) ‘Under some regularity conditions, Linton and Hiirdle (1996) prove ‘the following results, Vita [Bu — Galea) ~ Mtta(%n)] SNO.Valza)), (9-44) where ala) = (0/2) {2 (c0) f MOM fa eta 29x) [ Mme) LO (ctea}. Valea) =m f (MOfaten)? Reds A consistent estimator of Va(za) is given by B= Dwie, 92._AN ADDITIVE PARTIALLY LINEAR MODEL 207 where f= nS MOM i Zed 2a Za) +t alco Zan) = eba2a = Za) Wa (a Zas) and where i, = ¥;— m(Z4), ‘As observed by Horowitz and Mammen (2004), one undesirable property of the above marginal integration based estimator is that a higher order kernel must be used for w(-), with the order of the kernel depending on q (i.e, the larger is q, the higher the order of w()). This arises because, atthe initial stage, one has to estimate a q-dimensional function m(2) nonparametrically (ic., the additive structure is not im- posed at tho initial stage). Horowitz and Mammen suggést using series ‘methods at the inital stage when estimating m(z), then using a ker- nel local linear method to estimate the individual functions ga(:)- ‘iscuss Horowitz and Mammen’s approach in Chapter 15, where we it~ ‘troduce nonparametric series methods. Horowitz and Lee (2005) further extend Horowitz and Mammen’s result to additive quantile regression ‘models Horowitz (2001) generalizes the above result to the ease in which the functional form of the link function G() is unknown, 9.2 An Additive Partially Linear Model While the additive model ontlined above has the advantage of not suf- fering from the curse of dimensionality, itis, however, a very restrictive model since it does not allow for the presence of interaction terms. An additive partially linear model can avoid this problem and maintain ‘he “one-dimensional” nonparametric rate of convergence, Consider @ ‘model of the form Y= D+KB+ aZu) +e baylZe) tu, (045) where X; is a p x 1 vector of random variables, 6 1) 8 0 1% 1 veetor of unknown parameters, X; ean contain interaction terms involving (Z3.++, Zs), 6 is @ sealae parameter, the Z's are all nie variate continuous variables, and the ga('s (a = I,...,@) are unknown 208 9. ADDITIVE AND SMOOTH COFFFICIENT MODELS smooth functions. The observations {¥i,X!, Z1is---+ Za} We impose the condition Bga(Zas)] = 0 for a = 1,-..,4 in order to identify the individual components gx(),-»94(?) Defining », = Xi — E(XilZi), we say tha X ie not a deterministic fanetion of (Z1y-+-124) if Ejei{] is positive definite, (0.46) Under (9.46), one can use the method discussed in Chapter 6 to obtain a y/i-consistent estimator of @, say. One then can estimate the additive functions by rewriting the model as Yi-xi3 Vi ga(Zai) + eis where e¢ = uj + X{(9—B). Since J — 8 = Op(n-¥/9) has a faster rate of convergence than the nonparametric additive function estimators, the resulting nonparametric estimator of ga(q) has the same asymptotic properties as when @ is known. Thus, they have the same asymptotic distribution as the nonparametric additive model discussed in Section 9.1 (without the linear components) We formally summarize this estimator below. Define 0; = Xj — (XZ). When E(v;v,) is positive definite, one can use the following simple two-step method to estimate this model: (i) First, ignore the additive structure of the model and estimate using Robinson's (1988) approach, ie. regress ¥; ~ BZ) on X,—F(X;lZ;) to obtain a semiparametric estimator of 8, say 3, ‘where B(¥i|Z,) and B(X;|Z:) are the kernel estimators of E(¥i|Z:) and BU%|Z;), respectively. (ii) Next, rewrite the model as ¥i - X13 = gy (Zi) +--+ g9(Zqd) +6 ‘and estimate the additive functions g(-) as discussed in Section ol. ‘The asymptotic distribution of @ is discussed in Chapter 7, and the asymptotic distribution of the resulting estimator of ga(za)s 54¥ fala) is the same as that discussed in Section 9.1, This is because BB = O,(n~"!), which is faster than the nonparametric rate of convergence; therefore, replacing 8 by 9 does not affect the asymptotic distribution of Qa(%a) 92._AN ADDITIVE PARTIALLY LINEAR MODEL ‘The advantage of the above procedure is that it is computationally simple. However, there is one problem with this approach. When Evi!) is not positive definite, the proceduse fails to estimate f. Consider the simple ease where q= 2, and consider a model of the form Ye= Bo + (ZZ) + 91(Zu) + 92(Za1) + uy (9.47) where we have X; = Z3,Zy. In this ease, X; is a deterministic function of Z, and we have «4 = Xi — B(X |Z) = Xi Xj = 0. That is, the above procedure cannot apply in such eases. However, (9.16) isa strong assumption asi rules out eases for which X; is a deterministic (but nonadditive) function of (Z34,...5Zqi)- For example, when q = 2 we may wish to let X; contain interaction terms such as X; = ZsZ2i, but then (9.46) fails to hold Fan ot al. (1998) and Fan and Li (2008) proposed an alternative _Yi-consistent estimator of 8 based on marginal integration methods. ‘Their method has the advantage that it does not rely pon condition (9.46). As discussed carlier, the marginal integration method is compu- tationally costly: Schick (1096) and Mangan and Zerom (2005) suggest using # computationally efficient method for estimating 3. We describe Manzan and Zerom's procedure below. 9.2.1 A Simple Two-Step Method For any random variable (vector) &i, define Eq = Elg(Zi)|Zad] SBUEZ = 2]fa(ea)dea, where wal2) = falZa)fa(2a}/ f(Zas%a)s @ = 1,...,4, and define the projection of om the additive functional space by Ea(G) = D0 bos. (9.48) Ss Applying the projection (9-48) to (9.45) we obtain Ba(¥) = BalX'8-+ So 9lZa)- (0.49) Subtracting (9.49) from (9.45) we get ¥i— Bai) = (Xi — Ba(X))'8 + (9.50) ‘We can estimate # by applying least squares methods. However, ‘we must first obtain consistent estimators for E4(¥i) and Ea(Xi)- By 00, 9 ADDITIVE AND SMOOTH COB: \CIENT MODELS the results given in Section 9.1 we know that they can be consistently estimated by Bah) > Yos and Ba(Xi) = > Xan (9.51) where 1s Zoj— Zoi _falZas) we 54 ( A ey sg (9.52) and, with & = ¥j or & = Xs in (9.52), we obtain B4(¥) and Ba(X,), respectively "Therefore, a feasible estimator off is given by B= S25 og Sx-Baany-BA0) (953) where So,p =n“! IE, G:D!, Se = Se,c. The asymptotie disteibution of @ is given in Manzen and Zerom (2005) and is as follows. ‘Theorem 9.2. Under the same regularity conditions given in Mansan ‘and Zerom (2005), ValB — 8) + (0,3) in distribution, where 5 = 106, @ = Elnni], m = Xi — Ea(X;), and Q = EluPninl _ A consistent estimator of © is given by B= 6106-1 where $ = ADE IN ~ BAAN ~ BAKO), © = ALE, aK — Bal XIX — Ea(A and ty = ¥~ BAC) — (%— BACKS. When the error is conditionally homoskedaste, 9 is semiparaanelat cally fficient in the sense that its asymptotic variance reaches the seu parametric efficiency bound for this model (see Chamberlain (1992)). Positive definiteness of © is an identification condition for g. Te allows for X, to be a deterministic function of (Zss.--. Zs) provided 1t iv not an additive function of the Zac's. More specifically, cousider the simple eas of q = 2in (247) with X; = ZisZax given by Yim Bo + (ZisZ0i)B + 91(Zse) + gal Zae) + us (9.54) ‘Model (9.54) is immune to the curso of dimensionality since it only involves a one-dimensional nonparametric function ga(-) (a = 1,2). 93._A SEMIPARAMETRIC (SMOOTH) COEFFICIENT MODEL an Also, itis more general than an additive model that does not have in- teraction terms (i.c., (9.45) allows interaction terms to enter the model parametrically) Continuing with this-simple case (q = 2), we next estimate the individual nonparametric components gq(2q). Given the Y-consistent estimator 3, we can rewrite (9.45) as = X48 = Bot DY galZai) + (9.55) where 6 = us +X1(3—8) Equation (9.55) is essentially an additive regression model with — X/8 as the new dependent variable and [ai + X{(3 — 8)] as the new (composite) error. Hence, one can estimate ga(zq) by marginally integrating a local linear model as discussed in (9.6) and (9.7), with ¥; replaced by ¥; ~ X{9. Let daza) denote the resulting estimator of {94(%a). From Theorem 9.1 and the fact that ~ 8 = O,(n-¥/2), i is ‘easy to see that the asymptotic distribution of Ga(za) is the same as that of Ja(e), which was given in Theorem 9.1. . The intercept term y can be Yi-consistently estimated by By = ¥ — XG, where ¥ Ypand X = nt", Xz. The con ditional mean funct Dy = Aye Zqi = 4] can de estimated by 6 +2’3-+ 54.1 da(Za), and the error can be estimated by tis = Yi - Bo — X43 — V5 Gal Zan). 9.3. A Semiparametric Varying (Smooth) Coefficient Model In the previous section we discussed a partially linear model of the form ¥i= al Zi) + X1p +0, (9.50) where a(:) is an unknown funetion and fh is an r>1 veetor of unknown parameters, [In this section we consider a more general semiparametric regression ‘model, the so-called semiparametric smooth coefficient model, which nests a partially linear model as a special case. The smooth coefficient ‘model is given by Y= (Zi) + X{B(%) + ue, (9.57)

You might also like