Professional Documents
Culture Documents
La Econometria de Los Diseños Complejos
La Econometria de Los Diseños Complejos
Previous Volumes
Volume 21: Modelling and Evaluating Treatment Effects in Econometrics – Edited by Daniel L.
Millimet, Jeffrey A. Smith and Edward Vytlacil
Volume 22: Econometrics and Risk Management – Edited by Jean-Pierre Fouque, Thomas B.
Fomby and Knut Solna
Volume 23: Bayesian Econometrics – Edited by Siddhartha Chib, Gary Koop, Bill Griffiths and
Dek Terrell
Volume 24: Measurement Error: Consequences, Applications and Solutions – Edited by Jane
Binner, David Edgerton and Thomas Elger
Volume 25: Nonparametric Econometric Methods – Edited by Qi Li and Jeffrey S. Racine
Volume 26: Maximum Simulated Likelihood Methods and Applications – Edited by R. Carter Hill
and William Greene
Volume 27A: Missing Data Methods: Cross-Sectional Methods and Applications – Edited by David
M. Drukker
Volume 27B: Missing Data Methods: Time-Series Methods and Applications – Edited by David M.
Drukker
Volume 28: DSGE Models in Macroeconomics: Estimation, Evaluation and New Developments –
Edited by Nathan Balke, Fabio Canova, Fabio Milani and Mark Wynne
Volume 29: Essays in Honor of Jerry Hausman – Edited by Badi H. Baltagi, Whitney Newey, Hal
White and R. Carter Hill
Volume 30: 30th Anniversary Edition – Edited by Dek Terrell and Daniel Millmet
Volume 31: Structural Econometric Models – Edited by Eugene Choo and Matthew Shum
Volume 32: VAR Models in Macroeconomics – New Developments and Applications: Essays in
Honor of Christopher A. Sims – Edited by Thomas B. Fomby, Lutz Kilian and Anthony
Murphy
Volume 33: Essays in Honor of Peter C. B. Phillips – Edited by Thomas B. Fomby, Yoosoon Chang
and Joon Y. Park
Volume 34: Bayesian Model Comparison – Edited by Ivan Jeliazkov and Dale J. Poirier
Volume 35: Dynamic Factor Models – Edited by Eric Hillebrand and Siem Jan Koopman
Volume 36: Essays in Honor of Aman Ullah – Edited by Gloria Gonzalez-Rivera, R. Carter Hill
and Tae-Hwy Lee
Volume 37: Spatial Econometrics – Edited by Badi H. Baltagi, James P. LeSage, and R. Kelley Pace
Volume 38: Regression Discontinuity Designs: Theory and Applications – Edited by Matias D.
Cattaneo and Juan Carlos Escanciano
ADVANCES IN ECONOMETRICS VOLUME 39
THE ECONOMETRICS OF
COMPLEX SURVEY DATA:
THEORY AND APPLICATIONS
EDITED BY
KIM P. HUYNH
Bank of Canada, Canada
DAVID T. JACHO-CHÁVEZ
Emory University, USA
GAUTAM TRIPATHI
University of Luxembourg, Luxembourg
ISOQAR certified
Management System,
awarded to Emerald
for adherence to
Environmental
standard
ISO 14001:2004.
INTRODUCTION ix
PART I
SAMPLING DESIGN
PART II
VARIANCE ESTIMATION
v
vi CONTENTS
PART III
ESTIMATION AND INFERENCE
PART IV
APPLICATIONS IN BUSINESS,
HOUSEHOLD, AND CRIME SURVEYS
INDEX 315
LIST OF CONTRIBUTORS
vii
This page intentionally left blank
INTRODUCTION
SAMPLING DESIGN
“Can Internet Match High Quality Traditional Surveys? Comparing the Health and
Retirement Study and Its Online Version” by Marco Angrisani, Brian Finley and
Arie Kapteyn revisit the question of comparability of online and more traditional
interview modes by studying differences across Internet-based, face-to-face and
phone-based surveys. They find little evidence of mode effects when comparing
various outcomes providing support for internet-based surveys.
“Effectiveness of Stratified Random Sampling for Payment Card Acceptance
and Usage” by Christopher S. Henry and Tamás Ilyés uses the universe of merchant
cash registers in Hungary to assess the effect of stratified random sampling on
estimates of payment card acceptance and usage. It compares county, industry,
and store size stratifications to mimic the usual stratification criteria for standard
merchant surveys. By doing this, they can quantify the effect on estimates of card
acceptance for different sample sizes.
ix
x INTRODUCTION
VARIANCE ESTIMATION
“Wild Bootstrap Randomization Inference for Few Treated Clusters” by James
G. MacKinnon and Matthew D. Webb proposes a bootstrap-based alternative to
randomization inference, which mitigates problems of over- or under-rejection in
t tests in pure treatment or difference-in-differences settings when the number of
clusters is very small.
“Variance Estimation for Survey-weighted Data Using Bootstrap Resampling
Methods: 2013 Methods-of-Payment Survey Questionnaire” by Heng Chen and Q.
Rallye Shen proposes a bootstrap-resampling method to estimate variability when
sampling units are selected through an approximate stratified two-stage sampling
design. Their proposed method allows for randomness from both the sampling
design and the raking procedure.
“Model Selection Tests for Complex Survey Samples” by Iraj Rahmani and Jeffrey
M. Wooldridge extends Vuong’s model selection test (“Likelihood Ratio Tests for
Model Selection and Non-Nested Hypothesis,” Econometrica, 1989) to allow for
complex survey samples. By using an M-estimation setting, their test applies to
general estimation problems including linear and nonlinear least squares, Poisson
regression and fractional response models. With cluster samples and panel data,
they show how to combine the weighted objective function with a cluster-robust
variance estimator, thereby expanding the scope of their test.
“Inference in Conditional Moment Restriction Models When There is Selec-
tion Due to Stratification” by Antonio Cosma, Andreï V. Kostyrka and Gautam
Tripathi shows how to use a smoothed empirical likelihood approach to conduct
efficient semiparametric inference in models characterized as conditional moment
equalities when data are collected by variable probability sampling.
“Nonparametric Kernel Regression Using Complex Survey Data” by Luc Clair
derives the asymptotic properties of a design-based nonparametric kernel-based
regression estimator under a combined inference framework involving multivariate
mixed data. It also proposes a least squares cross-validation procedure for selecting
the bandwidth for both continuous and discrete variables. Simulation results show
that the estimator is consistent and that efficiency gains can be achieved by weight-
ing observations by the inverse of their inclusion probabilities if the sampling is
endogenous.
“Nearest Neighbor Imputation for General Parameter Estimation in Survey
Sampling” by Shu Yang and Jae Kwang Kim studies the asymptotic properties
of the nearest neighbor population imputation estimator of population parameters
Introduction xi
ABSTRACT
We examine sample characteristics and elicited survey measures of two stud-
ies, the Health and Retirement Study (HRS), where interviews are done
either in person or by phone, and the Understanding America Study (UAS),
where surveys are completed online and a replica of the HRS core ques-
tionnaire is administered. By considering variables in various domains, our
investigation provides a comprehensive assessment of how Internet data col-
lection compares to more traditional interview modes. We document clear
demographic differences between the UAS and HRS samples in terms of
age and education. Yet, sample weights correct for these discrepancies and
allow one to satisfactorily match population benchmarks as far as key socio-
demographic variables are concerned. Comparison of a variety of survey
3
4 MARCO ANGRISANI ET AL.
outcomes with population targets shows a strikingly good fit for both the HRS
and the UAS. Outcome distributions in the HRS are only marginally closer
to population targets than outcome distributions in the UAS. These patterns
arise regardless of which variables are used to construct post-stratification
weights in the UAS, confirming the robustness of these results. We find little
evidence of mode effects when comparing the subjective measures of self-
reported health and life satisfaction across interview modes. Specifically, we
do not observe very clear primacy or recency effects for either health or life
satisfaction. We do observe a significant social desirability effect, driven by
the presence of an interviewer, as far as life satisfaction is concerned. By
and large, our results suggest that Internet surveys can match high-quality
traditional surveys.
Keywords: Online survey; survey methods; weighting; survey mode
effects; face-to-face interviews; online interviews
1. INTRODUCTION
The collection of high-quality data on households and individuals tends to be
labor intensive, costly and slow. When adopting traditional survey modes like
face-to-face or telephone interviewing, typically several years elapse from the
moment a survey is designed to final data availability. The Internet, with its
promise of real-time results and looming ubiquity, provides a tempting alterna-
tive for faster and more cost-effective data collection. Online surveys, however,
differ from more traditional surveys in several respects which may affect both
sample representativeness and data quality.
First, Internet coverage is still not entirely pervasive, especially among more
economically disadvantaged groups and the elderly. Data from the Pew Research
Center showed that only 51% of Americans aged 65 or older had a home broadband
connection in 2016, while the fraction of home broadband owners was about 77%
among 18–29 year olds. Likewise, home broadband coverage was only 53% for
Americans with incomes of less than $30,000 and 93% for those with incomes
of $75,000 or greater.1 As a result, the representativeness of online surveys may
be jeopardized (Schonlau et al., 2009). Telephone surveys, however, face similar
difficulties with the widespread adoption of voice mail and cell phones (Blumberg,
Luke, & Cynamon, 2004). Second, even with complete coverage of the population,
individual characteristics are bound to influence the likelihood of completing an
online survey versus a face-to-face or phone survey, thereby introducing relevant
selectivity issues and nonresponse biases that may vary by interview mode (Couper,
2011). Third, mode effects need to be considered, as the same question may be
Can Internet Match High-quality Traditional Surveys? 5
answered differently in person, by phone or over the Internet (Schwarz & Sudman,
1992). Face-to-face and phone interviews leave more room for clarification and
offer more control of who is actually answering the questionnaire. On the other
hand, Web surveys offer more privacy and could thereby encourage more accurate
and honest reporting on personal and sensitive matters, while the presence of
an interviewer in face-to-face and telephone interviews may induce interviewer
effects.
Chang and Krosnick (2009) compare sample representativeness and data quality
of Internet-based surveys and phone-based surveys. They conclude that as long as
Internet data are collected from a probability-based sample, these exhibit higher
accuracy than data collected by phone. Their study, however, is limited to a rather
specific topic, namely politics.
In view of the existing literature, the contribution of this paper is twofold. First,
we revisit the question of comparability of online and more traditional interview
modes by studying differences across Internet-based, face-to-face and phone-based
surveys. Second, we focus on a diverse set of outcomes, ranging from home
ownership and labor force status to self-reported health and life satisfaction. The
aforementioned sources of differences between Web surveys and face-to-face or
telephone surveys may affect each of these outcomes in different ways. Thus,
by considering variables in various domains, our investigation provides a more
robust and comprehensive assessment of how Internet data collection compares to
more traditional interview modes. Moreover, while our analysis is performed at
a time when Internet coverage has increased substantially in the population, we
focus (because of data availability and comparability issues) on the subgroup of
individuals aged of 55 and older. Within this segment of the population, barriers
to adoption of new technology may still imply significant selectivity issues and
limit study generalizability, as recently pointed out by Remillard et al. (2014). The
extent to which Internet data are comparable to data collected with more traditional
interview modes is then of particular scientific interest in research concerned with
this subpopulation.
with a high school degree, zip codes with a relatively high proportion of females
with a high school degree receive a higher probability of being sampled. The SIS
is implemented iteratively. That is, after selecting a zip code, the distributions of
demographics in the UAS are updated according to the expected contribution of this
zip code toward the panel’s representativeness; updated measures of desirability
are computed and new sampling probabilities for all other zip codes are defined.
This procedure provides a list of zip codes to be sampled. For each zip code in this
list, addresses are then drawn in a simple random sample from the USPS database.
In the UAS, sample weights are survey-specific. They are provided with each
UAS survey and are meant to make each survey data set representative of the
reference US population with respect to a predefined set of sociodemographic
variables. Sample weights are constructed in two steps. In a first step, a base weight
is created to account for unequal probabilities of sampling zip codes produced
by the SIS algorithm and to reflect the probability of a household being sampled,
conditional on its zip code being sampled. In a second step, final post-stratification
weights are generated to correct for differential nonresponse rates and to bring the
final survey sample in line with the reference population as far as the distribution
of key variables of interest is concerned.
More precisely, to compute the base weight, the unit of analysis is a zip code. A
logit model is estimated for the probability that a zip code is sampled as a function
of its characteristics, namely census region, urbanicity and population size, as
well as its sex, race, age, marital status and education composition. Estimation is
carried out on an ACS file that contains five-year average characteristics at the zip
code level, with urbanicity derived from 2010 Urban Area to ZIP Code Tabulation
Area Relationship File of the US5 Census Bureau and merged to this. The outcome
of this logit model is an estimate of the marginal probability of a zip code being
sampled, which, because of the implementation of the SIS algorithm, is not known
ex ante. Indicate by w1b the inverse of the logit-estimated probability of sampling
each zip code.
Next, for each sampled zip code, the ratio of the number of households in the
zip code to the number of sampled households within the zip code, denoted by w2b
is computed. For the first recruitment batch, which is a simple random sample of
addresses from the US population and does not use the SIS algorithm, it is assumed
(without loss of generality) that w1b = w2b = 1 instead.
The base weight is a zip code-level weight defined as
where a is a correction factor such that the sum of the base weights is equal to the
number of all selected households (if all of them respond). This number is equal
to the size of the first recruitment batch (10,000) and to the number of sampled zip
10 MARCO ANGRISANI ET AL.
codes times 40 (the number of sampled households within each drawn zip code) for
all subsequent recruitment batches. Hence, the correction factors take two values,
one for the first recruitment batch and one for all subsequent recruitment batches.
UAS members are assigned a base weight, computed as described above,
depending on the zip code where they reside at the time of recruitment.
The post-stratification weights in the second step are generated by a rak-
ing algorithm that, starting from the base weight, compares, iteratively adjusts,
and eventually matches relative frequencies in the target population with relative
weighted frequencies in the survey sample for the following one and two-way
marginal distributions: race, gender × age, gender × education, household size ×
total household income, census regions and urbanicity. The benchmark distribu-
tions against which UAS surveys are weighted are derived from the CPS Annual
Social and Economic Supplement administered in March of each year.
Post-stratification weights are trimmed to limit variability and improve the effi-
ciency of estimators using the weights. This is performed using the general weight
trimming and redistribution procedure described by Valliant, Dever, and Kreuter
(2013). More precisely, indicating by wi,raking , the raking weight for respondent i
and with w raking the sample average of raking weights within the survey sample;
the procedure involves
(1) Setting the lower and upper bounds on weights equal to L = 0.25w raking and
U = 4w raking , respectively6 ;
(2) Resetting any weights smaller than the lower bound to L and any weights
greater than the upper bound to U :
⎧
⎨L wi,raking ≤ L
wi,trim = wi,raking L < wi,raking < U
⎩
U wi,raking ≥ U
c
(3) Computing the amount of weight lost by trimming as wlost = N i=1 (wi,raking −
wi,trim ) and distributing it evenly among the respondents whose weights are
not trimmed.
While raking weights can match population distributions of selected variables,
trimmed weights typically do not. Therefore, the raking algorithm and the trimming
procedure are iterated until post-stratification weights are obtained that respect the
weight bounds and align sample and population distributions of selected variables.7
The final post-stratification weight for each survey respondent, wi,post , is the
weight generated by applying the raking/trimming procedure just described to the
base weight.8
It should be noted that in the UAS weighting procedure, there is no explicit
nonresponse adjustment to the base weights. Rather, it is the post-stratification
Can Internet Match High-quality Traditional Surveys? 11
factor that is meant to correct for differential nonresponse across survey invitees.
A similar approach is adopted by the HRS. In the HRS, post-stratification of base
weights is performed in the first wave to adjust for nonparticipation in the study
and create a “baseline weight.” In subsequent waves, a post-stratification factor
applied to baseline weights corrects for wave-specific nonresponse.
Moreover, while the UAS uses the CPS to establish population benchmarks for
post-stratification, the HRS considered in this study relies on the ACS for post-
stratification. Population controls for CPS weights are derived from the census and
theACS. For the purposes of our exercise, we will consider weighted CPS measures
as population targets to which UAS and HRS survey outcomes are compared.
Owing to the CPS’s close correspondence with the ACS, we do not expect that this
should particularly favor one survey over the other.
Table 1 shows the distributions of basic demographic variables for the unweighted
and weighted HRS and UAS samples. The first column reports the target distribu-
tions in the US population of individuals aged 55 and older. These are computed
using the 2015 CPS and its provided sample weights. As mentioned above, while
the UAS relies on the CPS to obtain population benchmarks for post-stratification,
the HRS uses the ACS. Yet, since the CPS itself weights to match the census and
the ACS, it is plausible to assume that both the UAS and HRS are weighted to align
their samples to essentially the same population. The choice of referring to the year
2015 for population benchmarks is due to the fact that our comparison exercise
uses data from the 2014 HRS wave and from the first HRS wave in the UAS, which,
12 MARCO ANGRISANI ET AL.
Gender
Male 0.462 0.425 0.461 0.491 0.462
Female 0.538 0.575 0.539 0.509 0.538
Mean abs. diff – 0.036 0.000 0.030 0.000
Race/Ethnicity
White 0.751 0.641 0.777 0.835 0.751
Black 0.098 0.191 0.100 0.067 0.098
Other 0.060 0.032 0.035 0.060 0.060
Hispanic 0.090 0.136 0.088 0.038 0.090
Mean abs. diff. – 0.069 0.013 0.042 0.000
Age
55–64 0.466 0.405 0.467 0.571 0.466
65–74 0.313 0.281 0.309 0.327 0.313
75–84 0.156 0.236 0.161 0.087 0.156
85+ 0.065 0.078 0.064 0.015 0.065
Mean abs. diff. – 0.046 0.003 0.059 0.000
Education
HS or less 0.461 0.521 0.453 0.267 0.252
Some college 0.163 0.191 0.194 0.255 0.240
Assoc. coll. degree 0.088 0.060 0.067 0.131 0.107
Bachelor 0.171 0.140 0.171 0.191 0.232
Postgrad 0.117 0.089 0.115 0.156 0.169
Mean abs. diff. – 0.035 0.013 0.078 0.084
Household income
<$30k 0.306 0.388 0.313 0.287 0.268
[$30k, $60k) 0.301 0.262 0.248 0.299 0.284
[$60k, $100k) 0.205 0.169 0.189 0.243 0.255
$100k+ 0.189 0.181 0.250 0.171 0.194
Mean abs. diff. – 0.041 0.034 0.019 0.027
although based on the 2014 HRS questionnaire, was completed between the years
2015 and 2017.9
It is worth restating that the UAS final weights allow sample distributions
to match the population distributions of gender, race/ethnicity, age, education,
household income, household composition and location (i.e., census region and
urbanicity). The HRS adopts a more parsimonious model, where final weights align
the sample to the population along the dimensions of gender, age of respondent
Can Internet Match High-quality Traditional Surveys? 13
Full Phone Person unwgh wgh0 wgh1 wgh2 wgh3 wgh4 wgh5
Own 0.803 0.790 0.803 0.779 0.805 0.760 0.760 0.770 0.758 0.784 0.775
Does not own 0.197 0.210 0.197 0.221 0.195 0.240 0.240 0.230 0.242 0.216 0.225
Mean abs. diff. – 0.013 0.000 0.025 0.002 0.043 0.043 0.033 0.045 0.019 0.028
Notes: wgh0, default UAS weights using gender, race, age, education, income, household size, census
region, and urbanicity; wgh1, as wgh0 with finer age brackets; wgh2, as wgh0 without education, wgh3,
as wgh0 without income; wgh4, as wgh0 without education and income; wgh5, as wgh0 without census
region and urbanicity.
The first comparison exercise concerns home ownership. This is a rather objec-
tive measure, and as a result, we expect it to be relatively more affected by
coverage/representativeness biases than by interview mode. In the population of
adults aged 55 and older, 80% are home owners and 20% are renters.11 Within the
HRS, phone interviewees are more likely to own a home than in-person intervie-
wees by a statistically significant margin (p-value: 0.006). This difference is not
statistically significant any longer (p-value: 0.106) when we limit the sample to
respondents younger than 80, among whom mode assignment is random. Thus,
there seem to be no evidence of a mode effect for home ownership as observed
differences between in-person and phone interviewees are likely stemming from
the different age composition of these two groups.
In the UAS, the unweighted home ownership rate is 80% and ranges from 76% to
78% when weights are applied. When post-stratification is not based on education
(wgh2 and wgh4), the weighted home ownership rate is closer to its population
benchmark as well as to the population-level figures inferred from the HRS.
Across the various weighting schemes, the mean absolute difference between
the UAS and the CPS ranges from 1.9 to 4.5 percentage points, while the
unweighted mean is right on the mark (mean absolute difference equal to 0.2
percentage points). When comparing the HRS (pooling the phone and in-person
samples) and the CPS, the mean absolute difference is 1.3 percentage points.
In Table 3, we focus on another arguably objective measure, namely health
insurance coverage. As can be seen, there exist some differences within the HRS.
The fraction of insured individuals is 94.2% among those interviewed by phone
and 95.7% among those interviewed in person. While this difference is statisti-
cally significant for the entire sample (p-value: 0.002), it is not among respondents
younger than 80 (p-value: 0.172). Again, this pattern suggests differences in age
Can Internet Match High-quality Traditional Surveys? 17
Full Phone Person unwgh wgh0 wgh1 wgh2 wgh3 wgh4 wgh5
No 0.052 0.050 0.058 0.043 0.049 0.050 0.050 0.045 0.051 0.042 0.046
Yes 0.948 0.950 0.942 0.957 0.951 0.950 0.950 0.955 0.949 0.958 0.954
Mean abs. diff. – 0.002 0.006 0.009 0.003 0.002 0.002 0.007 0.001 0.010 0.006
See Table 2.
Full Phone Person unwgh wgh0 wgh1 wgh2 wgh3 wgh4 wgh5
No 0.529 0.535 0.584 0.492 0.597 0.561 0.551 0.557 0.557 0.554 0.557
Yes 0.471 0.465 0.416 0.508 0.403 0.439 0.449 0.443 0.443 0.446 0.443
Mean abs. diff. – 0.007 0.055 0.037 0.068 0.032 0.022 0.028 0.028 0.025 0.028
See Table 2.
composition between the two samples, as all respondents 80 and over are eligi-
ble for Medicare. For the UAS, the unweighted fraction of insured individuals
is 95%, as in the reference population, and does not change appreciably when
weights are applied. It increases slightly to 95.8% when the weighting scheme does
not use education and income (wgh4), but the difference with the CPS remains
small and under no weighting scheme is the difference between the UAS and CPS
distributions statistically significant.
Questions about retirement status are bound to be subject to personal interpre-
tation and answers to them may also be affected by social desirability. In Table
4, we observe apparently sizeable, but statistically insignificant differences in the
proportion of retirees across surveys. The unweighted UAS proportion of retirees
is significantly different from the CPS and HRS proportions (p-value: 0.000 in
both cases), but tests for differences in the proportion retired between all pairs of
the CPS, full HRS, and UAS weightings have p-values between 0.1 and 0.4. It
should be noted that, for this specific outcome, differences may also stem from
the type of questions administered to respondents to elicit labor force status and
the type of recoding used by each study. Specifically, we rely on the “major labor
force status” recode of the CPS, which is based on answers to a series of labor
18 MARCO ANGRISANI ET AL.
force items in the main questionnaire. For the HRS and the UAS, we adopt the
RAND-HRS indicator of retirement, which is based on a question where respon-
dents can select more than one employment status at once (e.g., working part-time
and retired).
Not surprisingly, the fraction of retired individuals is higher among HRS respon-
dents interviewed in person than among those interviewed by phone. This plausibly
reflects the fact that the former group is, on average, four years older (average age
is 67 for the phone interview group and 71 for the face-to-face interview group).
Overall, even after weighting, the HRS seems to somewhat underrepresent retirees,
with a proportion of 46.5% compared to 47.1% in the reference population. In con-
trast, the unweighted proportion of retired individuals in the UAS is substantially
(and statistically significantly) lower than in the CPS, at 40.3%. Such a differ-
ence may be likely driven by representativeness/selection bias. Individuals who
answer online surveys tend to be younger, better educated, and more attached to
the labor force. When default weights are applied, the fraction of retirees in the
UAS increases by 3 percentage points. Interestingly, the UAS weighting with the
closest-to-target proportion of retired individuals (44.9%) and the lowest mean
absolute difference with the CPS (0.022) is achieved when finer age brackets are
used (wgh1), which better correct for the underrepresentation of seniors. Yet,
differences across weighting schemes are rather modest.
In Table 5, we compare the distribution of individual earnings across surveys.
The proportion of individuals with earnings below $25,000 per year is larger in
the HRS than in the CPS by a statistically significant margin (p-value: 0.001)
and apparently more sizeable among those who are administered a face-to-face
interview. Conversely, the fraction of high earners (above $75,000) in the HRS is
0.4 percentage points higher than in the CPS, but this difference is not significant
(p-value: 0.261).
Full Phone Person unwgh wgh0 wgh1 wgh2 wgh3 wgh4 wgh5
[0–$25k) 0.726 0.745 0.715 0.771 0.695 0.747 0.756 0.736 0.754 0.718 0.746
[$25k–$50k) 0.118 0.098 0.105 0.092 0.125 0.109 0.108 0.108 0.110 0.113 0.111
[$50k–$75k) 0.070 0.067 0.077 0.058 0.070 0.051 0.049 0.053 0.051 0.058 0.052
[$75k–$100k) 0.036 0.035 0.037 0.033 0.045 0.043 0.040 0.046 0.041 0.049 0.040
$100k+ 0.050 0.055 0.065 0.046 0.064 0.049 0.047 0.057 0.045 0.062 0.052
Mean abs. diff. – 0.010 0.010 0.018 0.012 0.011 0.014 0.011 0.013 0.010 0.011
See Table 2.
Can Internet Match High-quality Traditional Surveys? 19
The UAS slightly underrepresents low earners, and the observed difference with
the CPS is statistically significant (p-value: 0.006). When weights are applied, the
fraction of individuals with earnings below $25,000 is closer to its population
benchmark, with only borderline-significant differences with the CPS for only
some weights (p-values range from 0.051 to 0.605). The fraction of workers with
earnings above $75,000 is 2.4 percentage points larger in the UAS relative to the
CPS, and the difference is statistically significant (p-value: 0.002). With weights,
this gap varies between virtually 0 and 2.5 percentage points and only the differ-
ence using wgh4 is statistically significant at the 5% level (p-value 0.013). When
comparing various sets of weights in the UAS, we observe only minor differences
among them in terms of mean absolute difference from the reference population.
Only slightly larger deviations are shown when the set of raking factors features
finer age brackets (wgh1) and does not include household income (wgh3). In gen-
eral, the earnings distribution in both the UAS and the HRS matches the one in the
CPS very closely. The mean absolute difference is about 1 percentage point in the
HRS the UAS both before and after weighting.
Next, we examine two subjective outcomes, that is, self-reported health and life
satisfaction. For both of them, we expect mode effects to be more apparent (we
analyze mode effects for these two measures in more detail in the next section).
Table 6 reports the distribution of self-reported health. All three surveys ask their
respondents to rate their health on a five-point scale, “excellent” (1), “very good”
(2), “good” (3), “fair” (4) and “poor” (5). As mentioned above, though, while in
the UAS and in the HRS all respondents answer about their own health, in the CPS
the household respondent reports about his/her own health as well as that of other
household members. In Table 6, we rely on all household members’ health status
reports in the CPS. The distribution of health status in the CPS remains virtually
Full Phone Person unwgh wgh0 wgh1 wgh2 wgh3 wgh4 wgh5
Excellent 0.151 0.094 0.097 0.091 0.108 0.102 0.098 0.111 0.099 0.116 0.102
Very good 0.274 0.321 0.328 0.315 0.354 0.327 0.334 0.353 0.328 0.363 0.333
Good 0.329 0.328 0.328 0.328 0.327 0.339 0.335 0.337 0.342 0.335 0.345
Fair 0.171 0.191 0.183 0.198 0.171 0.187 0.188 0.157 0.186 0.148 0.177
Poor 0.076 0.067 0.066 0.068 0.041 0.045 0.045 0.043 0.045 0.039 0.043
Mean abs. diff. – 0.027 0.026 0.027 0.032 0.032 0.034 0.035 0.033 0.038 0.033
See Table 2.
20 MARCO ANGRISANI ET AL.
unchanged when we only use health status referring to the household respondent
(see Table A1 in Appendix).
The first thing to notice is the absence of any difference between the measures
elicited by the HRS via telephone and in-person interview. The only sizeable and
marginally significant deviation is observed for the fraction of individuals reporting
fair health (p-value: 0.041). No deviation is remotely significant when limiting the
sample to respondents younger than 80, though. Mode effects, then, do not seem
to be present for this outcome.
It is common practice in the health economics literature to classify individuals
into two groups, one in poor and fair health and another in good, very good and
excellent health. Compared to the CPS, the HRS somewhat overrepresents indi-
viduals in fair and poor health. This fraction is 1.1 percentage points higher in the
HRS and the difference is statistically significant (p-value: 0.029). The unweighted
UAS distribution shows an underrepresentation of individuals in excellent health,
an overrepresentation of those in very good health, and an underrepresentation of
those in poor health relative to the CPS. These deviations from population bench-
marks are not corrected by sample weights, regardless of the post-stratification
scheme adopted. When using the aforementioned binary health indicator, the
unweighted proportion of individuals in fair and poor health in the UAS falls
short of the CPS benchmark by 3.4 percentage points. This difference is statis-
tically significant (p-value: 0.001). The gap is reduced to 1.4 percentage points
(not statistically significant, with p-value: 0.356) when default UAS weights are
applied. Overall, the HRS and UAS perform similarly in terms of their ability to
match the population distribution of self-reported health after weighting. Specifi-
cally, mean absolute difference relative to the CPS is 0.027 for the HRS and 0.032
for the UAS with default weighs (wgh0).
The HRS and UAS respondents are asked about their life satisfaction. Answers
are on a five-point scale, from “completely satisfied” (1) to “not at all satisfied”
(5). There is no analogous question in the CPS instrument, so we do not have a
population benchmark for this outcome. The fraction of HRS respondents reporting
complete satisfaction is 2.8 percentage points higher for the in-person than the
phone interview and the difference is significant (p-value: 0.001). In contrast,
those interviewed by phone tend to express more moderate judgments. The fraction
of those stating that they are somewhat satisfied with their life is 2.1 percentage
points higher when the questionnaire is administered over the phone and, again,
the difference is statistically significant (p-value: 0.027). Observed differences in
the fractions of individuals who are not very and not at all satisfied with their
life are very small in magnitude and not statistically significant. When restricting
attention to respondents younger than 80, the only significant difference is for the
somewhat satisfied group (p-value: 0.04). We will delve more into potential mode
effects for self-reported life satisfaction in the next section and shed some light
Can Internet Match High-quality Traditional Surveys? 21
Full Phone Person unwgh wgh0 wgh1 wgh2 wgh3 wgh4 wgh5
Completely 0.220 0.205 0.232 0.136 0.151 0.151 0.146 0.148 0.145 0.155
Very 0.464 0.466 0.462 0.491 0.473 0.475 0.489 0.469 0.493 0.483
Somewhat 0.265 0.277 0.256 0.304 0.301 0.300 0.300 0.307 0.296 0.292
Not very 0.040 0.041 0.039 0.059 0.061 0.060 0.058 0.063 0.059 0.057
Not at all 0.012 0.012 0.011 0.010 0.014 0.014 0.008 0.013 0.007 0.012
See Table 2.
on the extent to which the different age composition of these two groups of HRS
respondents may contribute to these results.
In the UAS, unweighted and weighted life-satisfaction distributions are remark-
ably similar, regardless of which set of post-stratification weights is considered.
UAS respondents are significantly less likely to report complete life satisfaction
(p-values are 0.000 for all weighting schemes) and more inclined to state that they
are somewhat (p-values range from 0.001 to 0.084 across different weights) or
not very (p-values from 0.001 to 0.022) satisfied with their life. Based on this,
it is not surprising to see that the UAS distribution is relatively closer to the one
obtained from the HRS phone interview. Even so, differences between these two
distributions are very pronounced. We construct a binary variable taking the value
1 for completely or very satisfied and 0 otherwise. With default weights (wgh0),
we estimate that the fraction of individuals with a positive outlook on their life
(i.e., this indicator equal to 1) is 6 percentage points lower in the UAS than in the
HRS and find that this difference is statistically significant (p-value: 0.002).
Compared to face-to-face and online surveys, phone interviews lack visual aids.
This is an important characteristic to account for when studying response qual-
ity across different interview modes. When respondents are asked to pick one
category from a list – for example, in rating their health and life satisfaction on
five-point scales like those adopted by the HRS instrument – there are two well-
known response effects: a primacy effect (a tendency to pick the first response
category) and a recency effect (a tendency to pick the last response category).
Importantly, primacy and recency effects show age gradients. As discussed by
Knauper (1999), older respondents are more likely to choose the last category
22 MARCO ANGRISANI ET AL.
UAS HRS-Phone
3.5
3.5
3.25
3.25
3
3
55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79
Age Age
HRS-in-Person CPS
3.5
3.5
3.25
3.25
3
55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79
Age Age
(recency), while younger respondents are more likely to choose the first category
(primacy). A possible explanation for this phenomenon comes from the decline of
memory when people age.
In view of this and the order of response categories in the health status and
life-satisfaction questions (whose distributions are reported in Tables 6 and 7,
respectively), we may expect a sharper decline (or a less steep increase) in health
and life satisfaction with age in auditory mode (i.e., by phone), then over the
Internet and in person (since in the latter case HRS uses show cards). The CPS
carries out interviews both by phone and in person. The preferred mode of interview
is face-to-face for the first and last months of a household’s time in the rotating
panel, while the interview mode defaults to telephone during the intervening three
months. The CPS data do not include a variable indicating whether an interview
was conducted in person or over the phone. Also, there is no indication of adopting
showcards during face-to-face interviews, thereby making these more akin to an
HRS phone than in-person interview.
A difference between the UAS and the other surveys is the absence of an inter-
viewer. Interactions between interviewers and respondents may generate social
desirability effects. As a result of that, the mere presence of an interviewer would
most likely lead to higher levels of self-reported health and life satisfaction in face-
to-face and phone interviews, compared to Internet surveys (Chang and Krosnick,
2009).
Can Internet Match High-quality Traditional Surveys? 23
(a)
UAS HRS-Phone
.2
.2
.1
.1
0
0
55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79
Age Age
HRS-in-Person .2 CPS
.2
.1
.1
0
0
55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79
Age Age
(b)
UAS HRS-Phone
.2
.2
.1
.1
0
0
55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79
Age Age
HRS-in-Person CPS
.2
.2
.1
.1
0
55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79
Age Age
Fig. 2. (a) Predicted Probability of Choosing the First Option (Excellent Health) by Age
and Survey Mode and (b) Predicted Probability of Choosing the Last Option (Poor Health)
by Age and Survey Mode.
24 MARCO ANGRISANI ET AL.
UAS HRS-Phone
4.5
4.5
4
4
3.5
3.5
55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79
Age Age
HRS-in-Person
4.5
4
3.5
(a)
UAS HRS-Phone
.4
.4
.3
.3
.2
.2
.1
.1
55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79
Age Age
HRS-in-Person
.4
.3
.2
.1
(b)
UAS HRS-Phone
.05
.05
.025
.025
0
55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79
Age Age
HRS-in-Person
.05
.025
0
Fig. 4. (a) Predicted Probability of Choosing the First Option (Completely Satisfied) by
Age and Survey Mode and (b) Predicted Probability of Choosing the Last Option (Not at
All Satisfied) by Age and Survey Mode.
Can Internet Match High-quality Traditional Surveys? 27
6. CONCLUSIONS
We have documented some clear demographic differences between the UAS and
HRS. Compared to the US population aged 55 and older, the UAS has relatively
fewer respondents at older ages, while the HRS overrepresents older age groups.
The UAS underrepresents individuals with high school or less, while the HRS
underrepresents the higher education strata. In general, sample weights correct for
these discrepancies and allow one to satisfactorily match population benchmarks as
far as key sociodemographic variables are concerned. For instance, acknowledging
the significant underrepresentation of individuals with high school or less, the
default UAS weights are post-stratified on the interaction of gender and education,
thereby aligning sample distributions of education by gender with their population
counterparts.
Comparison of a variety of survey outcomes with population targets taken from
the CPS shows a strikingly good fit for both the HRS and the UAS. Outcome
distributions in the HRS are marginally closer to those in the CPS than outcome
distributions in the UAS. These patterns arise for the most part regardless of which
variables are used to construct post-stratification weights in the UAS, confirming
the robustness of these results.
We find little evidence of mode effects when comparing the subjective measures
of self-reported health and life satisfaction across interview modes. Specifically,
we do not observe very clear primacy or recency effects for either health or life sat-
isfaction. We do find a significant social desirability effect, driven by the presence
of an interviewer, as far as life satisfaction is concerned.
While relatively simple and merely descriptive, the analyses in this study offer a
comprehensive comparison of surveys administered by different interview modes
both in terms of sample representativeness and data quality. They also provide
rather consistent empirical evidence which leads us to answer the question asked
in the title of this paper with a tentative “Yes.”
NOTES
1. The fact sheet can be found at http://www.pewinternet.org/fact-sheet/internet-
broadband/.
2. Alattar, Messel and Rogofsky (2018) offer a comprehensive overview of the UAS.
3. Details on UAS sample recruitment can be found at https://uasdata.usc.edu/index.php.
4. The SIS algorithm is implemented to recruit all UAS respondents, except those
belonging to two special purpose samples, namely Native Americans and Los Angeles
County residents with young children, for whom different sampling procedures are adopted.
Because of their specific sampling procedures, these two groups receive zero weight.
28 MARCO ANGRISANI ET AL.
5. The variable urbanicity takes three mutually exclusive values indicating whether the
area of residence of a respondent is rural, mixed, or urban.
6. While these values are arbitrary, they are in line with those described in the literature
and followed by other surveys (Battaglia et al., 2009).
7. A maximum of 50 iterations are allowed. If an exact alignment respecting the weight
bounds cannot be achieved, the trimmed weights will ensure the exact match between survey
and population relative frequencies, but may take values outside the interval defined by the
pre-specified lower and upper bounds.
8. A complete description of the UAS weighting procedure can be found at
https://uasdata.usc.edu/addons/documentation/UAS%20Weighting%20Procedures.pdf.
9. Referring to the years 2014 and 2016 for population benchmarks does not change
the results of the analysis.
10. We are not aware of significant differences in survey non-response rates between
phone and in-person interview mode in the HRS.
11. In the HRS and UAS respondents can report whether they are owners, renter or
“other.” The latter option is not available in the CPS. The proportion of respondents falling
in the “other” category is minimal in both HRS and UAS and should not appreciably affect
the comparison.
12. The graphs based on regressions without controls look very similar and are not
reported here.
REFERENCES
Alattar, L., Messel, M., & Rogofsky, D. (2018). An introduction to the Understanding America Study
Internet panel. Social Security Bulletin, 78(2), 13–28.
Angrisani, A., Kapteyn, A., Meijer, E., & Saw, H. W. (2014). Recruiting an additional sample for an
existing panel. In Presented at the panel survey methods workshop, Ann Arbor, MI.
Battaglia, M. P., Izrael, D., Hoaglin, D. C., & Frankel M. R. (2009). Practical considerations in raking
survey data. Survey Practice, 2009 (June). Retrieved from http://surveypractice.org/2009/06/
29/raking-survey-data/.
Blumberg, S. J., Luke, J. V., & Cynamon, M. L. (2004). Has cord-cutting cut into random-digit-
dialed health surveys? The prevalence and impact of wireless substitution. In S. B. Cohen &
J. M. Lepkowski (Eds.), Proceedings of the 8th conference on health survey research methods.
National Center for Health Statistics, Hyattsville, MD.
Bugliari, D. et al. (2016). RAND HRS data documentation, version P. RAND Corporation, Center
for the Study of Aging, Santa Monica, CA. Retrieved from http://hrsonline.isr.umich.edu/
modules/meta/rand/index.html.
Chang, L., & Krosnick J. A. (2009). National surveys via RDD telephone interviewing versus the
Internet comparing sample representativeness and response quality. Public Opinion Quarterly,
73(4), 641–678.
Couper, M. (2011). The future of modes of data collection. Public Opinion Quarterly, 73(5), 889–908.
Groves, R. M., & Heeringa, S. G. (2006). Responsive design for household surveys: tools for actively
controlling survey errors and costs. Journal of the Royal Statistical Society, Series A Statistics
in Society, 169(3), 439–457.
Can Internet Match High-quality Traditional Surveys? 29
Hays, R. D., Liu, H., & Kapteyn, A. (2015). Use of Internet panels to conduct surveys. Behavior
Research Methods, 47(3), 685–690.
Knauper, B. (1999). The impact of age and education on response order effects in attitude measurement.
Public Opinion Quarterly, 63(3), 347–370.
Meijer, E. (2014). Effective sample size metric for sequential importance sampling. Mimeo, USC-
CESR, Los Angeles.
Remillard, M. L., Mazor, K. M., Cutrona, S. L., Gurwitz, J. H., & Tjia, J. (2014). Systematic review
of the use of online questionnaires of older adults. Journal of the American Geriatric Society,
62, 696–705.
Rivers, D. (2013). Comment. Journal of Survey Statistics and Methodology, 1, 111–117.
Schonlau, M., van Soest A., Kapteyn, A. and Couper, M. (2009). Selection bias in Web surveys and
the use of propensity scores. Sociological Methods & Research, 37(3), 291–318.
Schwarz, N., & Sudman, S., (Eds.). (1992). Context effects in social and psychological research.
New York, NY: Springer-Verlag.
Tourangeau, R., Brick, J. M., Lohr, S., & Li, J. (2016). Adaptive and responsive survey design and
assessment. Journal of the Royal Statistical Society, Series A Statistics in Society, 180(1),
203–223.
Valliant, R., Dever, J. A., & Kreuter F. (2013). Practical tools for designing and weighting survey
samples. New York, NY: Springer.
Wagner, J., West, B. T., Kirgis, N., Lepkowski, J. M., Axinn, W. G., & Kruger Ndiaye, S. (2012).
Use of paradata in a responsive design framework to manage a field data collection. Journal of
Official Statistics, 28(4). 477–499.
30 MARCO ANGRISANI ET AL.
APPENDIX
Table A1. Self-reported Health With CPS Limited to HH Respondents.
CPS HRS UAS
Full Phone Person unwgh wgh0 wgh1 wgh2 wgh3 wgh4 wgh5
Excellent 0.149 0.094 0.097 0.091 0.108 0.102 0.098 0.111 0.099 0.116 0.102
Very good 0.277 0.321 0.328 0.315 0.354 0.327 0.334 0.353 0.328 0.363 0.333
Good 0.330 0.328 0.328 0.328 0.327 0.339 0.335 0.337 0.342 0.335 0.345
Fair 0.172 0.191 0.183 0.198 0.171 0.187 0.188 0.157 0.186 0.148 0.177
Poor 0.071 0.067 0.066 0.068 0.041 0.045 0.045 0.043 0.045 0.039 0.043
Mean abs. diff. – 0.025 0.024 0.026 0.030 0.030 0.031 0.033 0.031 0.036 0.030
See Table 2
Full Phone Person unwgh wgh0 wgh1 wgh2 wgh3 wgh4 wgh5
N 37,795 16,751 7,654 9,097 1,852 1,852 1,852 1,852 1,852 1,852 1,852
# Missing 0 0 0 0 0 0 0 0 0 0 0
Proportion 0 0 0 0 0 0 0 0 0 0 0
missing
Weighted 0 0 0 0 0 0 0 0 0 0 0
proportion
missing
Notes: wgh0, default UAS weights using gender, race, age, education, income, household size, census
region, and urbanicity; wgh1, as wgh0 with finer age brackets; wgh2, as wgh0 without education, wgh3,
as wgh0 without income; wgh4, as wgh0 without education and income; wgh5, as wgh0 without census
region and urbanicity.
Full Phone Person unwgh wgh0 wgh1 wgh2 wgh3 wgh4 wgh5
N 37,795 16,751 7,654 9,097 1,852 1,852 1,852 1,852 1,852 1,852 1,852
# Missing 0 0 0 0 27 27 27 27 27 27 27
Proportion 0 0 0 0 0.015 0.015 0.015 0.015 0.015 0.015 0.015
missing
Weighted 0 0 0 0 0.015 0.011 0.010 0.009 0.011 0.009 0.010
proportion
missing
See Table 2.
Can Internet Match High-quality Traditional Surveys? 31
Full Phone Person unwgh wgh0 wgh1 wgh2 wgh3 wgh4 wgh5
N 37,795 16,751 7,654 9,097 1,852 1,852 1,852 1,852 1,852 1,852 1,852
# Missing 0 41 20 21 18 18 18 18 18 18 18
Proportion 0 0.002 0.003 0.002 0.010 0.010 0.010 0.010 0.010 0.010 0.010
missing
Weighted 0 0.002 0.002 0.002 0.010 0.009 0.009 0.007 0.009 0.007 0.007
proportion
missing
Full Phone Person unwgh wgh0 wgh1 wgh2 wgh3 wgh4 wgh5
N 37,795 16,751 7,654 9,097 1,852 1,852 1,852 1,852 1,852 1,852 1,852
# Missing 0 0 0 0 0 0 0 0 0 0 0
Proportion 0 0 0 0 0 0 0 0 0 0 0
missing
Weighted 0 0 0 0 0 0 0 0 0 0 0
proportion
missing
Full Phone Person unwgh wgh0 wgh1 wgh2 wgh3 wgh4 wgh5
N 37,795 16,751 7,654 9,097 1,852 1,852 1,852 1,852 1,852 1,852 1,852
# Missing 0 14 6 8 3 3 3 3 3 3 3
Proportion 0 0.001 0.001 0.001 0.002 0.002 0.002 0.002 0.002 0.002 0.002
missing
Weighted 0 0.001 0.001 0.001 0.002 0.001 0.001 0.001 0.001 0.001 0.001
proportion
Missing
Full Phone Person unwgh wgh0 wgh1 wgh2 wgh3 wgh4 wgh5
N 37,795 16,751 7,654 9,097 1,852 1,852 1,852 1,852 1,852 1,852 1,852
# Missing 0 931 706 225 3 3 3 3 3 3 3
Proportion 0 0.056 0.092 0.025 0.002 0.002 0.002 0.002 0.002 0.002 0.002
missing
Weighted 0 0.050 0.084 0.020 0.002 0.001 0.001 0.001 0.001 0.001 0.001
proportion
missing
Additional Figures for Self-reported Health (with the CPS Split by Household
Respondent Status)
3.5
3.5
3.25
3.25
3.25
3
55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79
Age Age Age
CPS-HH-Resp CPS-Non-HH-Resp
3.5
3.5
3.25
3.25
3
55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79
Age Age
Fig. A1. Predicted Mean Health Status by Age and Survey Mode, and CPS HH Respondent
Status.
Can Internet Match High-quality Traditional Surveys? 33
.2
.2
.1
.1
.1
0
0
55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79
Age Age Age
CPS-HH-Resp CPS-Non-HH-Resp
.2
.2
.1
.1
0
55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79
Age Age
Fig. A2. Predicted Probability of Choosing the First Option (Excellent Health) by Age,
Survey Mode, and CPS HH Respondent Status.
.2
.2
.1
.1
.1
0
55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79
Age Age Age
CPS-HH-Resp CPS-Non-HH-Resp
.2
.2
.1
.1
0
55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79
Age Age
Fig. A3. Predicted Probability of Choosing the Last Option (Poor Health) by Age, Survey
Mode, and CPS HH Respondent Status.
This page intentionally left blank
EFFECTIVENESS OF STRATIFIED
RANDOM SAMPLING FOR PAYMENT
CARD ACCEPTANCE AND USAGE
ABSTRACT
For central banks who study the use of cash, acceptance of card payments
is an important factor. Surveys to measure levels of card acceptance and
the costs of payments can be complicated and expensive. In this paper, we
exploit a novel data set from Hungary to see the effect of stratified random
sampling on estimates of payment card acceptance and usage. Using the
Online Cashier Registry, a database linking the universe of merchant cash
registers in Hungary, we create merchant and transaction level data sets.
We compare county (geographic), industry and store size stratifications to
simulate the usual stratification criteria for merchant surveys and see the
effect on estimates of card acceptance for different sample sizes. Further,
we estimate logistic regression models of card acceptance/usage to see how
stratification biases estimates of key determinants of card acceptance/usage.
35
36 CHRISTOPHER S. HENRY AND TAMÁS ILYÉS
1. INTRODUCTION
Central banks around the world, as issuers of bank notes, have a keen interest in
understanding the use of cash. While there has undoubtedly been a shift toward
the use of electronic methods of payment at the point-of-sale, cash is still widely
used across many countries, see e.g., Bagnall et al. (2016). In addition, many
countries observe that the demand for cash as measured by the value of bank notes
in circulation has been growing in line with – and in some cases faster than – the rate
of growth of the economy. Of course, different countries have different experiences.
On one spectrum, countries such as Sweden have recently been exploring whether
to issue a central bank digital currency, due to rapidly declining cash demand. By
contrast, Hungary is a particularly cash intensive country with over 80% volume
of transactions being conducted in cash.
A key factor influencing use of cash at the point-of-sale is whether or not card
payments – debit and credit – are accepted by the merchant. Payments are a two-
sided market: consumers are more likely to adopt and use cards when acceptance is
widespread; similarly, merchants are more likely to want to accept card payments
when there are many consumers that desire to pay with cards. See, for example,
the discussion in Fung et al. (2017).
Card acceptance also has implications not just for the use of cash but also for
the amount of cash held by consumers. For example, Huynh et al. (2014) show that
cash holdings increase when consumers move from an area of high card acceptance
to an area with low card acceptance. Consumers must hold more cash because of
the higher probability of encountering a situation where cards are not accepted,
and therefore they can make a potential transaction that would otherwise not take
place.
Finally, card acceptance is intimately related to the cost of payments. A recent
study by the Bank of Canada ((Kosse et al. 2017)) measured that the cost to
accept various methods of payment amounted to 0.8% of GDP; a similar result
of 1% of GDP was found in a study across 13 Euro area countries including
Hungary ((Schmiedel et al. 2013)). Accepting card payments comes with fees
such as interchange, terminal fees, etc., which the merchant must trade-off with the
labor costs of processing cash, the opportunity cost of missing a card payment, etc.
Effectiveness of Stratified Random Sampling 37
To measure the level of card acceptance as well as the cost for accepting various
forms of payments, central banks around the world have conducted merchant or
retailer surveys, see Kosse et al. (2017); European Commission (2015); Norges
Bank (2014); Stewart et al. (2014); Jonker (2013); Schmiedel et al. (2013);
Danmarks Nationalbank (2012); Segendorf and Jansson (2012); Arango and
Taylor (2008).
Merchant studies, however, can be difficult and expensive endeavors. Recruit-
ing businesses/merchants to participate is no easy task, and in practice sample sizes
are often low; half of the studies shown in Figure 1 of Kosse et al. (2017) were of
size N = 200 or below. Additional challenges of merchant surveys are coverage
of small and medium-sized businesses, which may not belong to an official reg-
istry; accounting for businesses with franchises/multiple locations, high costs of
conducting survey interviews to obtain quality data, etc.
In this paper, we exploit a novel administrative data set from Hungary, which
allows us to validate the approach and results of merchant surveys with respect
to measuring card acceptance. This rich data set is known as the Online Cashier
Registry (OCR) and captures the universe of retail transactions in Hungary via a
linking of cash registers/payment terminals to the centralized tax authority.
Specifically, our analysis first focuses on estimating card acceptance using dif-
ferent stratification variables that are commonly used in practice for merchant
surveys. This allows us to see how stratification impacts point and variance esti-
mates of acceptance and provides guidance on which stratification variables may
be most effective in practice, given the constraint of small sample sizes. Our results
suggest that having full coverage with respect to geography and the size of the store
would produce the most reliable estimates. Further, we quantify the uncertainty
in card acceptance estimates that may be present for particularly small sample
sizes.
Further, we conduct logistic regression analysis on the entire OCR database and
stratified subsamples in order to see the bias in point estimates for key determinants
of card acceptance and card usage. Our models are motivated by the payment
survey literature, and we confirm results from the literature that store size is highly
correlated with card acceptance, and transaction size with card usage.
Our work is situated within an exciting research area that is a nexus between
survey statistics and data science. For example, Rojas et al. (2017) investigate
how various sampling techniques can be used to explore and visualize the so-
called “big data” sets. Chu et al. (forthcoming) use the additional structure of
survey weights to aide in estimating functional principal components of a large
and complex price data set used for constructing the consumer price index in
the United Kingdom. In a slightly different vein, Lohr and Raghunathan (2017)
review methods for combining survey data with large administrative data sets,
38 CHRISTOPHER S. HENRY AND TAMÁS ILYÉS
including record linkage and imputation. They further highlight the opportunity
to use administrative data sets in the survey design stage, as well as for assessing
nonresponse bias and mitigating the need for follow-up. The interplay between
survey statistics and data science will become increasingly relevant against the
background of declining survey response rates as well as competition for sponsor
resources from large administrative data sets, see, for example, the discussion by
Miller (2017).
The paper is organized as follows. In Section 2, we describe the data set and con-
text of the Hungarian payments system. In Section 3, we outline our methodology.
Section 4 presents and discusses results and Section 5 concludes.
2. DATA DESCRIPTION
2.1. Hungarian Payments Landscape
The Hungarian payment system can be considered cash-oriented. The level of cash
in circulation is higher than the European average, and the share of electronic trans-
actions in retail payment situations is fairly low. This notwithstanding, Hungarian
households have good access to electronic infrastructure; 83% of households have
a payment account and 80% have a payment card. Despite a 15 to 20% increase in
electronic payments over the last few years, the vast majority of transactions are
still conducted in cash.
Under Decree No. 2013/48 (XI. 15.) NGM, the Ministry for National Economy
mandated the installation of online cash registers directly linked to the tax authority.
The replacement of cash registers was implemented as part of a gradual process
at the end of 2014; subject to certain conditions, taxpayers were permitted to use
traditional cash registers until January 1, 2015.
The scope of the online cash register system has been expanding signifi-
cantly since the adoption of the decree. Initially, the regulation covered retail
trade turnover primarily; however, from January 1, 2017, its provisions became
applicable to a substantial part of the service sectors as well (e.g., taxi services,
hospitality/catering and automotive repair services).
The OCR database contains data from over 200,000 cash registers, totaling 7
billion transactions. The median transaction was about 1,000 HUF (≈ $4 USD).
Effectiveness of Stratified Random Sampling 39
The OCR reveals that conventional payment surveys tend to underestimate the
amount of cash payments. For example, a 2014 Hungarian survey estimated that
84% of the volume of payments were conducted using cash, whereas this figure is
almost 90% in the OCR.
For our analysis, we utilize two data sets derived from the OCR:
These data sets were derived from the OCR covering 2015 and 2016.
Here we describe the key variables used for in our analysis. See Table 1 for a
summary.
County County
Mobile shops 5,294 16.0 Jász-Nagykun 4,442 29.0
Bács-Kiskun 30,851 37.1 Komárom-Esztergom 2,518 18.0
Baranya 5,519 26.4 Mozgobolt 15,982 26.3
Békés 8,189 18.7 Nógrád 5,746 20.7
Borsod-Abaúj 5,543 17.5 Pest 7,936 14.2
Budapest 8,552 21.7 Somogy 5,634 20.8
Csongrad 6,517 26.2 Szabolcs-Szatmár 3,469 22.8
Fejer 5,548 28.6 Tolna 4,004 21.1
Győr-Moson-Sopron 7,656 24.4 Vas 6,106 28.3
Hajdú-Bihar 7,796 24.0 Veszprém 5,037 25.9
Heves 4,732 25.1
Industry Size
1 7,562 16.8 Small 65,373 6.4
4 96,152 29.5 Medium 76,312 32.5
5 36,483 18.4 Large 15,386 74.2
6 3,545 27.3
7 3,064 26.8
8 3,067 29.2
9 7,198 19.2
Note: This table shows the distribution of key variables used in our study based on the merchant level
data set derived from the OCR database. For store size, small businesses are defined as those with 2–15
million HUF in annual turnover, medium are businesses between 15 and 150 million HUF, large are
businesses with over 150 million HUF.
Note: This table shows the industry code descriptions used in the TEAOR08 classification.
These codes correspond to European NACE Rev. 2.
Effectiveness of Stratified Random Sampling 41
3. METHODOLOGY
Our analysis consists of two components which we describe below.
Due to the importance of card acceptance for understanding payment choice and
cash usage, we first use the merchant-level aggregated version of the OCR to
estimate card acceptance. We proceed in the following manner.
First we draw a random stratified sample from the merchant-level data set. From
the sample, we estimate the proportion of businesses accepting card payments. Also
we calculate the standard error and a 95% confidence interval. Finally, we repeat
these calculations for 1,000 replications. Estimates are calculated using Stata’s
svy command which accounts for the strata, strata level inclusion probabilities as
weights and finite population corrections.
To draw the stratified samples, we consider three target sample sizes: 0.1%,
0.2% and 1%. These sizes reflect the fact that in practice, merchant surveys often
face a constraint of having small samples sizes. Stratification is performed on the
three variables defined in Section 2.3, and we select the given proportion (0.1%,
0.2% and 1%) of units from each strata.
Choice of the three stratification variables is motivated by those used in practice
for merchant surveys, see e.g., Kosse et al. (2017). The purpose of stratification
in general is to reduce variance estimates by finding variables which are highly
correlated with card acceptance (see again Table 1). Of course, we are also limited
by what is available in the OCR.
To estimate the logistic regression models of card acceptance and usage, we take
a similar approach for drawing stratified random samples. However, since we are
interested in understanding the bias of the estimates for key explanatory variables,
we fix the sample sizes at 1% of the data from each strata of the merchant-level data
set. For the card usage model, the sampling ratio is 0.01% from the transaction-
level data set. For both models, we perform 10 replications and compute average
point estimates. Explanatory variables are included based on what is available in
the OCR combined with a review of the payments literature; see Appendix for a
more detailed explanation. Coefficient estimates are not produced using any survey
weights.
42 CHRISTOPHER S. HENRY AND TAMÁS ILYÉS
4. RESULTS
4.1. Estimates of Card Acceptance
Table 3 shows the results from the first component of our analysis. From the
1,000 replications, we report the average estimates of card acceptance, per cent
bias when compared with the true value of acceptance (0.2573804), the average
standard error of the estimate and the average length of a 95% confidence interval.
Each stratification approach provides essentially unbiased estimates of card
acceptance, even for the smallest sample size considered (0.1% sample; ≈ N=160);
the bias is below 1% when compared with the true value. That being said, size
stratification leads to the most biased estimates. County and size stratification
underestimates card acceptance in small sample sizes, whereas industry stratifi-
cation provides an overestimate. The main issue with small sample sizes – which
has implications for practical merchant surveys – is the precision of the estimates.
For small sample sizes, the average length of a confidence interval for both county
0.1% sample
Mean 0.25625 0.25604 0.25831
Percent bias −0.439 −0.519 0.362
SE 0.034 0.031 0.035
CI length 0.134 0.122 0.136
0.2% sample
Mean 0.25732 0.25693 0.25682
Percent bias −0.025 −0.175 −0.220
SE 0.015 0.014 0.015
CI length 0.060 0.054 0.061
1% sample
Mean 0.25749 0.25704 0.25721
Percent bias 0.044 −0.134 −0.067
SE 0.011 0.010 0.011
CI length 0.043 0.038 0.043
Note: This table shows estimates of card acceptance for different stratification variables and sample
sizes. Stratification is performed on county, industry and store size variables. From 1,000 replications,
we report the average point estimate and its percent bias from the true value estimated on the full
merchant-level data set derived from the OCR. We also report the average standard error of the estimate,
along with the average length of a 95% confidence interval.
Effectiveness of Stratified Random Sampling 43
and industry strata is about 13.5% points; size stratification provided a relatively
more precise estimate.
Increasing the sample size from 0.1% to 0.2% leads to a noticeable increase
in the precision of the estimates, and there was not much additional improvement
by increasing the sample size from 0.2% to 1%. For larger sample sizes, county
stratification provides the least biased estimates, whereas store size estimates are
most precise. This is driven by the high correlation between card acceptance and
store size; see again Table 1.
There are some key lessons to draw from these results for actual merchant
surveys. A combination of geographic and merchant size coverage would likely
provide the most reliable estimates of card acceptance, reducing both bias and
variance. Further, very small sample sizes could produce unreliable estimates, but
even a small increase in the sample size can increase precision without having to
increase cost significantly.
Now we discuss results from logistic regression models of card acceptance and
usage; see Appendix for details on model selection and full results.
4.2.1. Acceptance
Our analysis focuses on three types of explanatory variables: county effects, indus-
try effects, and store size effects on the entire sector. Due to the high number of
control variables, we only discuss selected results, see Table 4.
In general for the 88 parameter estimates, on average the stratas based on the
size of the store provide the best estimates the average of the estimates for the 10
samples is in 52 cases inside of the 95% confidence interval.
The model based on the entire database clearly shows that the most important
factor in line with the literature results is the size of the store, which we charac-
terize by the annual revenue. The distribution of store sizes follows a lognormal
distribution which means that in the county and industry-based stratas, there is
a low probability that they will be included in the sample. Without the biggest
retailers, where county and industry effects are small compared to the size effects,
the estimates for these variables will on general be biased. The county and industry
stratas overestimate these effects.
Since the functional relation between size and acceptance is nonlinear and non-
monotone, the stratas of the counties and industries do not provide good fits for
these higher order polynomials and most of the size-related variables share of dif-
ferent size transactions, volume of transactions. In the case of industry effects,
44 CHRISTOPHER S. HENRY AND TAMÁS ILYÉS
Note: This table shows selected point estimates from a logistic regression model with card acceptance
as the dependent variable; results from the full model are shown in Appendix. In the first column,
we show estimates from the model estimated on the full merchant-level data set derived from the
OCR. The last three columns show the average estimates from ten 1% stratified subsamples, where the
stratification variable is indicated in the column heading.
the size-based stratas underperform the industry-based stratas. The industry dis-
tribution of the database is heavily concentrated on retail services, and the other
categories have only a very small share. The three categories of size effects, direct
annual revenue, share of different size transactions and volume of transactions, are
in general better estimated by size stratas.
In conclusion, we can state that the most efficient stratification is a random
stratified sampling based on different store size categories and not on geographical
or industry classification. The main causes are the importance of annual revenue
over county and industry effects in card acceptance decisions and the complex
Effectiveness of Stratified Random Sampling 45
Note: This table shows selected point estimates from a logistic regression model with card usage as
the dependent variable; results from the full model are shown in Appendix A. In the first column, we
show estimates from the model estimated on the full transaction-level data set derived from the OCR.
The last three columns show the average estimates from ten 0.01% stratified subsamples, where the
stratification variable is indicated in the column heading.
functional relation between the two. By not basing stratification on store sizes as
well, the model overestimates the other effects.
4.2.2. Usage
In the case of modeling payment card use, the above approach does not lead to
good estimates. The model estimated on the full data set clearly shows that the
single most important factor is the transaction value and its higher order orthogonal
polynomials. We find there is no single best method from these three types of
stratification, see Table 5. Because of the extremely small standard errors calculated
from nearly 5 billion transactions, all subsample estimations are on average outside
of the 95% confidence intervals. Based on the average estimations, the county
stratas provide the closest estimations for most variables. The reason for this is
the much greater county effect in card use decisions compared to card acceptance.
However, as opposed to what we observed in the card-acceptance model, there is
no difference in biases between the three types of stratification. All three models
on average overestimate the county and transaction size effects and underestimate
the industry effects.
The main reason that the above stratifications provide poor results is the log-
normality of transaction size distribution. By not basing the stratification on
46 CHRISTOPHER S. HENRY AND TAMÁS ILYÉS
5. CONCLUSIONS
In our paper, we analyzed payment card acceptance and payment card use decisions
in retail situations using the comprehensive OCR database and random strati-
fied subsamples. We compare county, industry and store size stratas with the
true population estimates to simulate the usual stratification criteria for merchant
surveys.
Our results from estimating levels of card acceptance suggest that having full
coverage with respect to geography and the size of the store produces the most reli-
able estimates. Further, we quantify the uncertainty in card acceptance estimates
that may be present for particularly small sample sizes, finding that increasing
the sample size from around N = 160 to N = 300 can reduce the length of confi-
dence intervals by half. These findings have important practical implications for
merchant surveys.
Further, we estimated logistic regression models of both card acceptance and
usage. In our models, we controlled for several factors relevant in payment instru-
ment adoption and use, but we focused on the performance of the three types of
stratifications variables to estimate the exact county, industry and size effects for
the entire population. In our comparison, we created 10 random stratified subsam-
ples of 1% of the merchant database and 10 random stratified subsamples of 0.01%
of the transaction database.
In the card-acceptance model, the stratification based on the store sizes provided
the best estimates. In card acceptance decisions, the most important factor is the
size of the store and other size-related variables. However, as store sizes follow
a skewed distribution, with small samples there is a high probability that the
subsample does not have enough big stores and the average effects of size will be
underestimated, while the county and industry effects overestimated.
Effectiveness of Stratified Random Sampling 47
In the card usage model, we evaluate the same subsample types and show that
county stratification provides the best results. However, due to not creating stratas
on transaction sizes, all of the three analyzed subsampling provide systematic
biases. In conclusion, it can be stated that in card-acceptance models, the store
size stratification provides good estimates; however, the same stratification cannot
be used to effectively estimate variable effects in a card usage model.
ACKNOWLEDGEMENTS
We thank the Magyar Nemzeti Bank, in particular Lóránt Varga and Gábor Sin, for
facilitating access to the data. Tamás Ilyés is no longer an employee of the Magyar
Nemzeti Bank. We also thank Jean-Louis Combes, Pierre Lesuisse and participants
of the Doctoral Seminar at the University of Auvergne School of Economics. We
are grateful to the editors and referees for their helpful comments. The views
expressed in this paper are those of the authors. No responsibility for them should
be attributed to the Bank of Canada or the Magyar Nemzeti Bank. All remaining
errors are the responsibility of the authors.
REFERENCES
Arango, C., & Taylor, V. (2008). Merchant acceptance, costs, and perceptions of retail payments: A
Canadian survey. Bank of Canada discussion paper, 2008-12.
Arango, C., Huynh, K. P., & Sabetti, L. (2015). Consumer payment choice: Merchant card acceptance
versus pricing incentives. Journal of Banking and Finance, 55(C), 130–141.
Bagnall, J., Bounie, D., Huynh, K. P., Kosse, A., Schmidt, T., Schuh, S., et al. (2016). Consumer cash
usage: A cross country comparison with payment diary survey data. International Journal of
Central Banking, 12(4), 1–61.
Baxter, W. P. (1983). Bank interchange of transactional paper: Legal and economic perspectives.
Journal of Law and Economics, 26(3), 541–588.
Bolt, W. (2008). Consumer choice and merchant acceptance of payment media. DNB Working Papers,
No. 197.
Bolt, W., Jonker, N., & Renselaar, C. (2010). Incentives at the counter: An empirical analysis of
surcharging card payments and payment behaviour in the Netherlands. Journal of Banking and
Finance, 34, 1738–1744.
Borzekowski, R., Elizabeth, K. K., & Shaista, A. (2008). Consumers’ use of debit cards: Patterns,
preferences, and price response. Journal of Money, Credit and Banking, 40, 149–172.
Briglevics, T., & Schuh, S. (2014). This is what’s in your wallet …and how you use it. ECB Working
Paper Series, No. 1684.
Chu, B. M., Huynh, K. P., Jacho-Chávez, D. T., & Kryvtsov, O. (2018). On the evolution of the UK
price distributions. Annals of Applied Statistics.
Cohen, M., & Rysman, M. (2013). Payment choice with consumer panel data. Federal Reserve Bank
of Boston Working Paper, No. 13-6.
48 CHRISTOPHER S. HENRY AND TAMÁS ILYÉS
APPENDIX
This appendix reviews the literature on payment card acceptance and usage, which
justifies the explanatory variables included in our models. In addition, we provide
description of the variables not discussed in the main text and the full results for
the logistic regression models.
the results presented, neither Klee’s nor Wolman and Wang’s database shows a
non-monotonic relationship between cash use and transaction value on the values
examined by the authors. Empirical results show that several theoretical models
have been constructed to explain the relationship between transaction value and
the card usage rate. Briglevics and Schuh (2014) used US payment diary data,
while Huynh et al. (2014) relied on Canadian and Austrian data to construct their
respective decision models. According to transaction value, both models estimate
monotonic and concave card usage patterns. While Briglevics and Schuh (2014)
described payment instrument choice as a dynamic optimization problem, Huynh
et al. (2014) supplement the Baumol–Tobin model.
Despite the use of receipt-level data, our database differs significantly from
the two studies analyzing transaction data and from the surveys built on payment
diaries in several regards. The database of online cash registers provides national
coverage, and the vast majority of merchants are subject to the relevant regulation.
Accordingly, compared to the studies mentioned above, we were able to distin-
guish between far more merchants both in terms of size and type. On the other
hand, due to the anonymization, we had little data on the customers of the stores.
County identifiers were of limited use as there is scant variance across the counties
with regard to the main demographical aspects; consequently, as opposed to Klee
(2008), there is no sufficient variance to add a consumer characteristics proxy.
However, as opposed to the payment diaries, there is significantly more informa-
tion available on payment location; moreover, due to the statutory obligations, the
reliability of the data is presumably better.
relationship, given the limited number of explanatory variables, the final models
include the turnover’s log and its square.
• Temporal attributes of the store: Not only the annualized turnover of the
stores, but also the turnover’s monthly and weekly distribution can be estab-
lished based on the dates indicated on the receipts. Accordingly, in our analysis,
we also studied the effect of the weekly turnover structure on card acceptance.
For the most part during the two years under review, the decree on Sunday store
closure was in effect in the retail sector. Family-owned stores represented the
main exceptions. Consequently, Sunday opening hours can be used as a proxy
for ownership. Since the correspondence is imperfect, this variable is included
in conjunction with the TEÁOR variable in the models. In this way, we can
separate the effects of individual sectorial exceptions from the attributes of the
owner.
Since the store’s closure on Mondays and Tuesdays proved to have a sig-
nificant explanatory power in our analysis, this serves as the control variable
in the rest of the models. These attributes are linked to special stores, e.g.,
museum gift shops, sample stores where the business is not considered to be
an independent financial unit.
• Network attributes: A large part of the retail sector operates in the form of a
network; in other words, numerous outlets are operated by a single legal entity.
According to our hypothesis, the fact that the store is part of a chain affects
card acceptance decisions in two ways. In networks where each member of
the network belongs to the same category, it accepts or does not accept card
payments. Card acceptance is presumably based on a network-level decision;
therefore, the decision situation itself may differ from that of independent stores.
By contrast, in networks where, according to the observations, card acceptance
is based on the independent decision of the store, the decision situation is
determined by the store’s unique characteristics. Therefore, the models included
dummy variables for the three types of stores: independent store, independent
decision, network decision. Moreover, in the case of network stores, we also
included the network’s total turnover and the number of stores included in the
network. According to the cross-sectional analyses, the correlation is nonlinear;
therefore, we also include the squared terms in the regressions.
• Item number: The database includes the number of products purchased under
each receipt. This allowed us, on the one hand, to use the total item number of
the store as another approach to the size variable and to introduce average and
maximum item numbers. The average and the maximum item number presum-
ably correlates strongly with the payment time and as such, it is used as the proxy
variable of the latter. We used average payment value as the control variable
in several cases; however, this variable correlates extremely strongly with the
decomposition of the turnover by value and with the proportions of the ranges.
Effectiveness of Stratified Random Sampling 53
Annual revenue 1st order orthogonal polynomial 59.296 59.525 59.503 63.279
Annual revenue 2nd order orthogonal polynomial −81.847 −80.740 −84.464 −78.947
Annual revenue 3rd order orthogonal polynomial −24.489 −24.085 −24.322 −21.949
variables capture the ease of cash payment, which presumably correlates with
payment time and as such, it can be considered a cost variable.
• Store attributes: Although the model constructed for card acceptance con-
tained numerous variables, due to space limitations, we can only include the
most important ones in this part of the study. As regards store attributes, most
models include the log and square of annualized turnover and the aggregate
form of the activity.
• County data: In the card-acceptance model, county effects did not correlate
significantly with the county’s level of development, but a correlation can be
observed during card usage on raw data. We estimated county codes in two
steps: the main regression includes only the county dummy variables, while in
the second step, we focus on the correlation between the coefficients and the
main sociodemographic data of individual counties.
• Temporal data: The database contains data for a 2-year period, which reflect
significant monthly and weekly seasonality. Since a sufficient amount of data
was available, we included yearly and monthly dummy variables and dummies
pertaining to the days of the month and the days of the week.
• Inverse Mills ratio: As card usage and card acceptance mutually affect each
other, the model calculated by us reflects a significant degree of selection bias.
In order to remove the bias, we also included the inverse Mills ratio computed
from the probit version of the model constructed for card acceptance. The
Heckman selection thus performed reduces estimation uncertainty, especially
in the case of the affiliated store data.
Transaction value 1st order orthogonal polynomial −184.446 −157.517 −167.243 −176.342
Transaction value 2nd order orthogonal polynomial −206.922 −189.216 −194.939 −200.644
Transaction value 3rd order orthogonal polynomial −76.017 −68.811 −71.832 −72.792
This page intentionally left blank
PART II
VARIANCE ESTIMATION
This page intentionally left blank
WILD BOOTSTRAP RANDOMIZATION
INFERENCE FOR FEW TREATED
CLUSTERS
ABSTRACT
When there are few treated clusters in a pure treatment or difference-in-
differences setting, t tests based on a cluster-robust variance estimator can
severely over-reject. Although procedures based on the wild cluster bootstrap
often work well when the number of treated clusters is not too small, they
can either over-reject or under-reject seriously when it is. In a previous
paper, we showed that procedures based on randomization inference (RI)
can work well in such cases. However, RI can be impractical when the
number of possible randomizations is small. We propose a bootstrap-based
alternative to RI, which mitigates the discrete nature of RI p values in the
few-clusters case. We also compare it to two other procedures. None of them
works perfectly when the number of clusters is very small, but they can work
surprisingly well.
Keywords: Clustered data; panel data; CRVE; wild cluster bootstrap;
difference-in-differences; kernel-smoothed p value
61
62 JAMES G. MACKINNON AND MATTHEW D. WEBB
1. INTRODUCTION
During the past decade or two, it has become common for empirical work in many
areas of economics to involve models where the error terms are allowed to be corre-
lated within clusters. Much of this work employs difference-in-differences (DiD)
estimators, where the data set has both a time and a cross-section dimension, and
clustering is typically at the cross-section level (say, by state or province). Cameron
and Miller (2015) provides a comprehensive survey of econometric methods for
cluster-robust inference.
Despite considerable progress in the development of suitable econometric meth-
ods over the past decade, it can still be a challenge to make reliable inferences.
Doing so is particularly challenging in the DiD context when there are very few
treated clusters. Past research, including Conley and Taber (2011), has shown that
inference based on cluster-robust test statistics can greatly over-reject in this case.
MacKinnon and Webb (2017b) explains why this happens and why the wild cluster
bootstrap (WCB) of Cameron, Gelbach, and Miller (2008) does not solve the prob-
lem; for a less technical discussion, see also MacKinnon and Webb (2017a). When
there are very few treated clusters, the restricted WCB often severely under-rejects,
and the unrestricted WCB often severely over-rejects.
One potentially attractive way to obtain tests with accurate size when there
are few treated clusters is to use randomization inference (RI). This approach
involves comparing estimates based on the clusters that were actually treated with
estimates based on control clusters that were not treated. Several authors have
recently investigated this approach; see Conley and Taber (2011), Canay, Romano,
and Shaikh (2017), Ferman and Pinto (2019), and MacKinnon and Webb (2018a).
RI procedures necessarily rely on strong assumptions about how similar the
control clusters are to the treated clusters. MacKinnon and Webb (2018a) shows
that for RI procedures which use coefficient estimates, like the one of Conley
and Taber (2011), these assumptions almost always fail to hold when the treated
clusters have either more or fewer observations than the control clusters. As a con-
sequence, the procedure may over-reject or under-reject quite noticeably when the
treated clusters are substantially smaller or larger than the controls. MacKinnon
and Webb (2018a) suggests that more reliable inferences can often be obtained by
basing RI on t statistics rather than coefficient estimates. However, such proce-
dures can involve noticeable power loss relative to the ones based on coefficient
estimates.
In Section 2, we briefly discuss conventional asymptotic procedures for infer-
ence with clustered errors. In Section 2.1, we then explain how the WCB works. In
Section 3, we introduce RI and discuss two variants of it, one based on coefficient
estimates which is quite similar to what was proposed in Conley and Taber (2011)
and the other based on t statistics proposed in MacKinnon and Webb (2018a).
Wild Bootstrap Randomization Inference for Few Treated Clusters 63
2. CLUSTER-ROBUST INFERENCE
A linear regression model with clustered errors may be written as
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
y1 X1 1
⎢ y2 ⎥ ⎢ X2 ⎥ ⎢ 2 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
y ≡ ⎢ . ⎥ = Xβ + ≡ ⎢ . ⎥β + ⎢ . ⎥, (1)
⎣ .. ⎦ ⎣ .. ⎦ ⎣ .. ⎦
yG XG G
where each of the G clusters, indexed by g, has Ng observations. The matrix X and
the vectors y and have N = G g=1 Ng rows, X has k columns and the parameter
vector β has k rows. Each subvector g is assumed to have covariance matrix g
and to be uncorrelated with every other subvector. The covariance matrix of the
entire error vector is block diagonal with diagonal blocks the g . Ordinary least
squares (OLS) estimation of Eq. (1) yields estimates β̂ and residuals ˆ .
Because the elements of the g are in general neither independent nor identically
distributed, both classical OLS and heteroskedasticity-robust standard errors for
64 JAMES G. MACKINNON AND MATTHEW D. WEBB
where ˆg is the subvector of ˆ that corresponds to cluster g. This is the estimator
that is used when the cluster command is invoked in Stata.1 Consistent with
the results of Bester, Conley, and Hansen (2011), it is common to assume that the
t statistics follow a t(G − 1) distribution; this is what Stata does by default.
It is not obvious that using t statistics based on the CRVE (3) is valid asymp-
totically. The proof requires technical assumptions about the distributions of the
errors and the regressors and how the number of clusters and their sizes change as
the sample size tends to infinity; see Djogbenou, MacKinnon, and Nielsen (2018).
Nevertheless, test statistics based on (3) seem to yield reliable inferences when
the number of clusters is large and there is not too much heterogeneity across
clusters. In particular, the number of observations per cluster must not vary too
much; see Carter, Schnepel, and Steigerwald (2017) and MacKinnon and Webb
(2017b). However, t statistics based on (3) tend to over-reject severely when the
parameter of interest is the coefficient on a treatment dummy and there are very few
treated clusters; see Conley and Taber (2011) and MacKinnon and Webb (2017b).
Rejection frequencies can be over 75% when all the treated observations belong
to the same cluster.
In this paper, we are primarily concerned with the DiD model, which is often
appropriate for studies that use individual data in which there is variation in
treatment across both clusters (or groups) and time periods. We can write such
a model as
yigt = β1 + β2 GTg + β3 PTt + β4 TREATgt + igt , (4)
i = 1, . . . , Ng , g = 1, . . . , G, t = 1, . . . , T ,
where i indexes individuals, g indexes groups, and t indexes time periods. Here
GTg is a “group treated” dummy that equals 1 if group g is treated in any time
period, PTt is a “period treated” dummy that equals 1 if any group is treated in time
period t, and TREATgt is a dummy that equals 1 if an observation is actually treated.
Wild Bootstrap Randomization Inference for Few Treated Clusters 65
The WCB was proposed in Cameron et al. (2008) as a method for reliable inference
in cases with a small number of clusters, and its asymptotic validity is proved
in Djogbenou et al. (2018). A different, but less effective, bootstrap procedure
for cluster-robust inference, often referred to as the “pairs cluster bootstrap,” was
previously suggested in Bertrand, Duflo, and Mullainathan (2004); see MacKinnon
and Webb (2017a). The WCB was studied extensively in MacKinnon and Webb
(2017b) for the cases of unbalanced clusters and/or few treated clusters. Because
we will be proposing a new procedure that is closely related to the WCB in Section
4, we review how the latter works.
Without loss of generality, we consider how to test the hypothesis that β4 , the
DiD coefficient in Eq. (4), is zero. Then the (restricted) WCB works as follows:
(1) Estimate Eq. (4) by OLS.
(2) Calculate tˆ4 , the t statistic for β4 = 0, using the square root of the 4th diagonal
element of (3) as a cluster-robust standard error.
66 JAMES G. MACKINNON AND MATTHEW D. WEBB
(3) Reestimate the model (4) subject to the restriction that β4 = 0, so as to obtain
restricted residuals ˜ and restricted estimates β̃.
(4) For each of B bootstrap replications, indexed by b, generate a new set of
∗b
bootstrap dependent variables yig using the bootstrap DGP
∗b
Here yigt is an element of the vector y ∗b of observations on the bootstrap depen-
dent variable, GTg , and PTt are taken from the corresponding row of X, and
vg∗b is an auxiliary random variable that follows the Rademacher distribution;
see Davidson and Flachaire (2008). It takes the values 1 and −1 with equal
probability.3
(5) For each bootstrap replication, estimate regression (4) using y ∗b as the regres-
sand. Calculate t4∗b , the bootstrap t statistic for β4 = 0, using the square root
of the 4th diagonal element of (3), with bootstrap residuals replacing the OLS
residuals, as the standard error.
(6) Calculate the symmetric bootstrap p value as
B
1
p̂s∗ = I(|t4∗b | > |t4 |), (6)
B b=1
where I( · ) denotes the indicator function. Eq. (6) assumes that, under the
null hypothesis, the distribution of t4 is symmetric around zero. Alternatively,
one can use a slightly more complicated formula to calculate an equal-tail
bootstrap p value.
The procedure just described is known as the restricted wild cluster, or WCR,
bootstrap, because the bootstrap DGP (5) uses restricted parameter estimates and
restricted residuals.4 We could instead use unrestricted estimates and residuals in
step 4 and calculate bootstrap t statistics for the hypothesis that β4 = β̂4 in step 5.
This yields the unrestricted wild cluster, or WCU, bootstrap.
MacKinnon and Webb (2017b) explains why the WCB fails when the number of
treated clusters is small. The WCR bootstrap, which imposes the null hypothesis,
leads to severe under-rejection. In contrast, the WCU bootstrap, which does not
impose the null hypothesis, leads to severe over-rejection. When just one cluster
is treated, it over-rejects at almost the same rate as using CRVE t statistics with
the t(G − 1) distribution.
The poor performance of WCR and WCU when there are few treated clus-
ters is a consequence of the fact that the bootstrap DGP attempts to replicate the
Wild Bootstrap Randomization Inference for Few Treated Clusters 67
within-cluster correlations of the errors using residuals that have very odd prop-
erties. MacKinnon and Webb (2018b) therefore suggest using the ordinary wild
bootstrap instead, and Djogbenou et al. (2018) prove that combining the ordinary
wild bootstrap for the model (1) with the CRVE (3) leads to asymptotically valid
inference. When clusters are sufficiently homogeneous, this procedure can work
well even when the number of treated clusters is small.
3. RANDOMIZATION INFERENCE
RI, first proposed in Fisher (1935), is a procedure for performing exact tests in
the context of experiments. The idea is to compare an observed test statistic τ̂
with an empirical distribution of test statistics τj∗ for j = 1, . . . , S generated by re-
randomizing the assignment of treatment across experimental units. To compute
each of the τj∗ , we use the actual outcomes while pretending that certain non-treated
experimental units were treated. If τ̂ is in the tails of the empirical distribution of
the τj∗ , then this is evidence against the null hypothesis of no treatment effect.
Randomization tests are valid only when the distribution of the test statistic
is invariant to the realization of the re-randomizations across permutations of
assigned treatments; see Lehmann and Romano (2008) and Imbens and Rubin
(2015). Whether this key assumption is true in the context of policy changes such
as those typically studied in the DiD literature is debatable. Any endogeneity in
the way policies are implemented over jurisdictions and time would presumably
cast doubt on the assumption.
When treatment is randomly assigned at the individual level, the invariance of
the distribution of the test statistic to re-randomization follows naturally. However,
if treatment assignment is instead at the group level, as is always the case for DiD
models like (4), then the extent of unbalancedness can determine how close the
distribution is to being invariant.
It is obvious that the proportion of treated observations
1 matters for β̂4 in
(4) and its cluster-robust standard error. Let d̄ = ( G g=1 Ng )/N denote this pro-
portion. When clusters are balanced, the value of d̄ will be constant across
re-randomizations. However, when clusters are unbalanced, d̄ may vary consider-
ably across re-randomizations. This implies that the distributions of β̂4 may also
vary substantially. RI may not work well in such cases.
MacKinnon and Webb (2018a) studies two RI procedures. One uses β̂4 in (4)
as τ̂ , and the other uses the cluster-robust t statistic that corresponds to β̂4 . The
former procedure, which we refer to as RI-β, is quite similar to a procedure
proposed in Conley and Taber (2011). It is only valid, even in large samples,
∗
if re-randomizing does not change the distribution of the β̂4j . The latter proce-
dure, which we refer to as RI-t, is evidently valid in large samples whenever the
68 JAMES G. MACKINNON AND MATTHEW D. WEBB
the single “treated” group in all re-randomizations would start treatment in 1978.
If G1 = 2 and treatment began in 1978 and 1982, then, for each re-randomization,
one group would begin treatment in 1978 and the other in 1982. In our simulations,
we ordered both the actually treated groups and the controls by size. Thus if, for
example, treatment began in 1978 for group 3 and in 1982 for group 11, and
N3 > N11 , then treatment would begin in 1978 for the larger control group and in
1982 for the smaller one. We also experimented with assigning treatment years at
random and found that doing so made very little difference.
The most natural way to calculate an RI p value is probably to use the equivalent
of Eq. (6). As before, S denotes the number of repetitions, which would be G0
when G1 = 1 and the minimum of G CG1 − 1 and B when G1 > 1, where B is a
user-specified target number of replications. Then the analog of Eq. (6) is
S
1
p̂1∗ = I(|τj∗ | > |τ̂ |). (7)
S j =1
This makes sense if we are testing the null hypothesis that β4 = 0 and expect τj∗ to
be symmetrically distributed around zero. If we were instead testing the one-sided
null hypothesis that β4 ≤ 0, we would want to remove the absolute value signs.
Eq. (7) is not the only way to compute an RI p value for a point null hypothesis,
however. A widely used alternative is
⎛ ⎞
S
1 ⎝
p̂2∗ = 1+ I(|τj∗ | > |τ̂ |)⎠. (8)
S+1 j =1
It is easy to see that the difference between p̂1∗ and p̂2∗ is O(1/S), so that they tend
to the same value as S → ∞. There is evidently no problem if S is large, but the
two p values can yield quite different inferences when S is small. The analogous
issue should rarely arise for bootstrap tests, because the investigator can almost
always choose B (the number of bootstrap samples, which plays the same role as
S here) in such a way that Eqs. (7) and (8) yield the same inferences. This will
happen whenever α(B + 1) is an integer, where α is the level of the test. That is
why it is common to see B = 99, B = 999, and so on.
With RI, however, we generally cannot choose S such that α(S + 1) is an integer.
In every other case, we could in principle use any p value between p̂1∗ and p̂2∗ . Thus,
70 JAMES G. MACKINNON AND MATTHEW D. WEBB
two by, in effect, flipping a coin. This means that two different researchers using
the same data set will randomly obtain different p values.
Formally, the WBRI procedure for generating the tb∗ and tbj
∗
statistics is as
follows:
(1) Estimate Eq. (4) by OLS and calculate t for the coefficient of interest using
CRVE standard errors.
(2) Obtain S test statistics tj∗ by re-randomization, as in Section 3.
(3) Estimate a restricted version of Eq. (4) with β4 = 0 and retain the restricted
estimates β̃ and residuals ˜ .
(4) For the original test statistic and each of the S possible re-randomizations,
indexed by j = 0, . . . , S, construct B bootstrap samples indexed by b, say
yj∗b , using the restricted WCB procedure discussed in Section 2.1. For each
bootstrap sample, estimate Eq. (4) using yj∗b and calculate a bootstrap t statistic
tj∗b based on CRVE standard errors.5
(5) Use one of Eq. (7) or Eq. (8) to calculate a p value for t based on the
(B + 1)(S + 1) − 1 bootstrap and randomized test statistics.
Since every possible set of G1 clusters is “treated” in the bootstrap samples,
the number of bootstrap test statistics is B × G CG1 = B(S + 1). In addition, there
are G CG1 − 1 = S statistics based on the original sample. Thus, the total number
of test statistics is B(S + 1) + S = (B + 1)(S + 1) − 1. We suggest choosing B
so that this number is at least 1,000.
The number of possible bootstrap DGPs is only 2G if one uses the Rademacher
distribution. Therefore, when G is small, it is better to use an alternative bootstrap
distribution such as the 6-point distribution suggested in (Webb, 2014). In the case
of the latter, the number of possible bootstrap DGPs is 6G.
In general, it makes sense to use the WBRI procedure only when the RI-t
procedure does not provide enough tj∗ for the interval p value problem to be
negligible. As a rule of thumb, we suggest using WBRI when G1 = 1 and G < 300,
or G1 = 2 and G < 30, or G1 = 3 and G < 15. Code for this procedure is available
from the authors.6
5. ALTERNATIVE PROCEDURES
In this section, we briefly discuss two very different procedures that can be used
instead of WBRI. Their performance will be compared with that of the latter in the
simulation experiments of Section 6.
Racine and MacKinnon (2007a) suggested a way to solve the interval p value
problem in the context of bootstrap tests. For those tests, the problem only
arises if computation is so expensive that making α(B + 1) an integer for all test
levels α of interest is infeasible. But since the problem arises quite frequently for
randomization tests, their procedure may be useful in this context.
Wild Bootstrap Randomization Inference for Few Treated Clusters 73
Recall the example of Canadian provinces given in Section 3.1. Suppose the
treated province has a more extreme outcome than any of the others, so that R = 0.
In the strict context of RI, all we can say is that the p value is between 0, according
to Eq. (7), and 0.10, according to Eq. (8). In saying this, however, we have made
no use of the actual values of t and the tj∗ . Only the location of |t| in the sorted list
affects either p value. If the outcome for the treated province differed a lot from
the outcomes for the other nine provinces, that is, if |t| were much larger than any
of the |tj∗ |, then the evidence against the null hypothesis would seem to be quite
strong. On the other hand, if |t| were just slightly larger than the largest of the |tj∗ |,
the evidence against the null would seem to be rather weak. But neither of the RI
p values takes this into account.
The procedure of Racine and MacKinnon (2007a) does take the values of the
actual and re-randomized test statistics into account. It is based on the smoothed
p value
S
1
p̂h = 1 − F̂h (t) = 1 − K(tj∗ , t, h), (9)
S j =1
where F̂h (t) is a kernel-smoothed CDF of the tj∗ evaluated at the actual test statistic
t. When t is much more extreme than any of the tj∗ , it will surely lie in the far tail
of the CDF, and p̂h will be very small. On the other hand, when t is near one or
more of the tj∗ , p̂h is unlikely to be very small.
This procedure requires the choice of a kernel function K( · ) and a bandwidth
h. Because p̂h is an estimated probability rather than an estimated density, K( · )
must be a cumulative kernel. A natural choice is the cumulative standard normal
CDF. The choice of h is more difficult, and it matters a lot when S is small.
Based largely on simulation evidence, Racine and MacKinnon (2007a) suggested
choosing h = scS −4/9 , where s is the standard deviation of the tj∗ , and the values of
c are 2.418, 1.575 and 1.3167 for α = 0.01, α = 0.05 and α = 0.10, respectively.7
Thus the bandwidth h should be larger the more variable are the tj∗ and the smaller
is the level of the test. The latter makes sense, because values of tj∗ will be scarcer
near more extreme values of t.
The kernel smoothing procedure of Racine and MacKinnon (2007a) can evi-
dently be used with coefficients as well as t statistics, and we consider both methods
in the next section. Note that we estimate s from the tj∗ but then apply the procedure
to |t| and the |tj∗ |, whereas Racine and MacKinnon (2007a) considered upper tail
tests. Despite this difference, their smoothing procedure generally performed best
for tests at the 0.05 level when using the value of c recommended for that level,
namely, 1.575. All of the results reported in Section 6 use that value.
A radically different approach, which was studied in Donald and Lang (2007),
is to collapse the original, individual data to the cluster level. Instead of N
74 JAMES G. MACKINNON AND MATTHEW D. WEBB
observations, the regression uses just G of them. Precisely, how this works depends
on the model. Consider the simple case in which
where each of the subscripted vectors corresponds to a single cluster and has Ng
observations, and the vector ιNg contains Ng ones. If we take the averages of each
of the vectors here, we obtain ȳg = ιNg yg /Ng , x̄g = ιNg xg /Ng , and ūg = ιNg ug /Ng .
This allows us to write
ȳ = γ ιG + β x̄ + ū, (11)
where the G-vectors ȳ, x̄ and ū have typical elements ȳg , x̄g and ūg , respectively.
Since all the variables in regression (11) are cluster means, we refer to it as a
“cluster-means regression,” or CMR.
Donald and Lang (2007) argue that the ordinary t statistic for β = 0 in the
cluster-means regression (11) will be approximately distributed as t(G − 2) if
two restrictive but not unreasonable assumptions are satisfied. The first is that all
clusters are the same size, so that Ng = m for all g, with all of the ug having the
same covariance matrices. The second is either that the original error terms are
normally distributed or that m is sufficiently large, so that a central limit theorem
applies to the elements of ū.
The advantage of collapsing individual data to the cluster level, as in (11), is
that we no longer have to estimate a CRVE. Because of the first assumption, we
do not even have to use heteroskedasticity-robust standard errors. This allows us
to make inferences about β when just one cluster is treated. In that case, only
one element of x̄ is nonzero, but we can still make valid inferences because all G
observations are used to estimate the variance of the error terms.
6. SIMULATION EXPERIMENTS
In this section, we report the results of some simulation experiments designed to
assess the performance of WBRI and the procedures discussed in Section 5. The
model is very simple. It is essentially Eq. (4), but without any group dummies.
This model can also be thought of as Eq. (10) with time dummies instead of the
constant term. To make inference a bit more difficult, the error terms follow a
lognormal distribution. The group dummies are omitted because the error terms
have constant intra-cluster correlations of 0.05 (prior to being exponentiated), and
group dummies would soak up all of this correlation.
In the experiments that we report, there are G clusters, each with 100 observa-
tions divided evenly among 10 time periods. When a cluster is treated, treatment
is always for 5 of the 10 periods. Because all clusters are the same size, and the
Wild Bootstrap Randomization Inference for Few Treated Clusters 75
number of treated observations per treated cluster is always the same, RI would
work perfectly if it were not for the interval p value problem. If we relaxed either
of these assumptions, of course, it would not work perfectly, even when G is large;
see MacKinnon and Webb (2018a).
Figure 2 shows rejection frequencies for tests at the 0.05 level for three pro-
cedures (RI-t using p̂1 , RI-t using p̂2 and WBRI-t using p̂2 ) for 56 different
experiments, each with 400,000 replications.8 The number of clusters varies from
5 to 60, and only one cluster is treated. For any value of G, this is the case for
which the interval p value problem is most severe, because S = G − 1 is small
unless there are many clusters. The number of bootstraps per randomization is
always chosen so that (B + 1)(S + 1) ≥ 1, 000.
One striking feature of Figure 2 is that rejection frequencies for the two RI
procedures are almost exactly what theory predicts; see Figure 1. When G = 20,
40, and 60, the two RI p values yield precisely the same outcomes, as they must.
In every other case, however, p̂1∗ = R/S rejects more often than p̂2∗ = (R + 1)/
(S + 1).
In Figure 2, the WBRI rejection frequencies are almost always between the two
RI rejection frequencies (although this is not true for G = 19 and G = 20), and
they are always quite close to 5% except when G is very small. This is what we
would like to see. However, it must be remembered that the figure deals with a
very special case in which all clusters are the same size and the error terms are
homoskedastic. The WBRI procedure cannot be expected to work any better than
76 JAMES G. MACKINNON AND MATTHEW D. WEBB
the RI-t procedure when the treated clusters are smaller or larger than the untreated
clusters, or when their error terms have different variances.
In the experiments of Figure 2, we used the Rademacher distribution for G ≥ 19
and the 6-point distribution for G ≤ 18. This accounts for the sharp jump between
18 and 19. Rejection frequencies for small values of G would have been much
larger if we had used Rademacher, while those for large values of G would have
been noticeably smaller if we had used 6-point. It is not clear why WBRI tends
to under-reject for tests with G1 = 1 (but not for tests with G1 = 2; see below)
when the 6-point distribution is used. As MacKinnon and Webb (2017b, 2018b)
showed, OLS residuals have strange properties when just one cluster is treated. We
speculate that these cause the choice of the wild bootstrap auxiliary distribution to
be unusually important in this case.
Figure 3 shows rejection frequencies for tests at the 0.05 level for five pro-
cedures. For readability, the vertical axis has been subjected to a square root
transformation, and the conventional RI procedures have been dropped. The results
for WBRI-t are the same ones shown in Figure 2. Results for WBRI-β are also
shown, and it is evident that WBRI-β always rejects more often than WBRI-t.
The difference is quite substantial for small values of G, and WBRI-t is clearly
preferred.
In Figure 3, the two tests based on kernel-smoothed p values work remarkably
well for larger values of G (say G > 25), but they over-reject quite severely for
really small values. The over-rejection is more severe for RI-t than for RI-β. All
reported results are for c = 1.575, the value suggested in Racine and MacKinnon
(2007a) for tests at the 0.05 level. When the larger value of 2.418 (suggested for
Wild Bootstrap Randomization Inference for Few Treated Clusters 77
tests at the 0.01 level) was used instead, all rejection frequencies were noticeably
lower.
The test based on the CMR (11) and the t(G − 2) distribution works remarkably
well even for very small values of G. It over-rejects slightly when G is small and
under-rejects slightly when G is large. It would have performed even better if the
errors had been normally rather than lognormally distributed. Since this test is very
easy to perform (it requires neither randomization nor the bootstrap), one might
well feel, on the basis of these results, that there is no point worrying about the
more complicated procedures based on individual data. However, this test does
have one serious limitation. As we will see below, it can be seriously lacking in
power.
All the experiments reported so far have just one treated group. This is generally
the most difficult case. In Figure 4, we show results for several tests with G1 = 2.
For these tests, the values of G vary from 5 to 20, and the values of S consequently
vary from 9 to 189. The CMR works extremely well for all values of G. WBRI-t
(using the 6-point distribution for G ≤ 13 and the Rademacher distribution for
G ≥ 14) under-rejects slightly for very small values of G but works very well
whenever G ≥ 11. Smoothed RI-t over-rejects for very small values of G but works
very well for G ≥ 9. However, the two procedures that are based on coefficients
instead of t statistics do not work particularly well.
The results presented so far may seem to suggest that the cluster-means regres-
sion is the most reliable, as well as the easiest, way to make inferences. However,
78 JAMES G. MACKINNON AND MATTHEW D. WEBB
this approach has one serious shortcoming. When the value of the treatment vari-
able is not constant within groups, aggregation to the group level can seriously
reduce power.
Figure 5 presents results for a case with G = 10, G1 = 2 and S = 44, where the
value of β varies from 0 (the null hypothesis) to 1. The most striking result is that
tests based on the CMR (11) are much less powerful than the other tests. As noted
earlier, in all of our experiments there are 10 “years,” only 5 of which are treated.
Every cluster has 100 observations, 10 for each “year.” Therefore, the regressor
x̄g either takes the value 0 (when cluster g is not treated) or the value 0.5 (when
half the observations in cluster g are treated). Not surprisingly, this results in very
substantial power loss.9 Of course, if all the observations in every treated cluster
were treated, this power loss would not occur. Additional experiments suggest that,
when all “years” are treated, tests based on regression (11) have excellent power.
Some of the other results in Figure 5 are also interesting. The two procedures
based on t statistics, WBRI-t and smoothed RI-t, have power functions that are
essentially identical. In contrast, the two procedures based on coefficients are
noticeably more powerful than the ones based on t statistics. This is consistent
with results for RI tests in MacKinnon & Webb (2018a), and it makes sense,
because the tests based on coefficients do not have to estimate standard errors. The
somewhat higher power of WBRI-β relative to smoothed RI-β can probably be
attributed to its somewhat larger size (it rejects 5.92% of the time at the 5% level,
versus 4.49%).
Wild Bootstrap Randomization Inference for Few Treated Clusters 79
It is important to remember that all the procedures we have discussed are very
sensitive to the assumption that the clusters are homogeneous. When that assump-
tion is violated, no RI procedure can be expected to perform well, even when G
and G1 are large. Since MacKinnon and Webb (2018a) documents the mediocre
performance of RI tests for a number of cases where cluster sizes vary, there is
no need to perform similar experiments here. In general, RI tests tend to over-
reject when the treated clusters are relatively small and under-reject when they are
relatively large.
In Figure 6, we investigate the effects of a particular type of heteroskedastic-
ity which was not studied in MacKinnon and Webb (2018a). Instead of the error
terms being homoskedastic, their standard deviation is twice as large for treated
observations as for untreated ones. Whether this is a realistic specification is debat-
able, although it does not seem unreasonable that some treatments could affect the
second moment of the outcome as well as the first.
In both panels of Figure 6, G varies between 5 and 20, as in Figures 3 and
4. In the left panel, G1 = 1, and in the right panel, G1 = 2. It is evident that
no method yields reliable inferences. The results for G1 = 2 are generally better
than for G1 = 1, but they are far from satisfactory. Moreover, the performance
of the cluster-means regression and of the two methods based on RI-β actually
deteriorates as G increases when G1 = 2.
7. EMPIRICAL EXAMPLE
In this section, we consider an empirical example from Decarolis (2014). Part of
the analysis deals with how the introduction of first price auctions (FPA) in Italy
80 JAMES G. MACKINNON AND MATTHEW D. WEBB
β̂ 12.18 6.14
t Statistic 14.86 7.82
PA-year clustering (CI) (9.54, 14.81) (3.55, 8.72)
PA clustering (CI) (10.42, 13.94) (4.45, 7.82)
CMR p value 0.0203 0.6698
Conley–Taber (CI) (10, 16) (5, 8)
RI-β p values (0.000, 0.063) (0.133, 0.188)
Smoothed RI-β p value 0.0000 0.0885
RI-t p values (0.000, 0.063) (0.067, 0.125)
Smoothed RI-t p value 0.0000 0.0716
WBRI-t p value 0.0000 0.0799
WBRI-β p value 0.0000 0.0595
N 1,262 1,262
G 15 15
β̂ 8.71 5.69
t Statistic 19.22 8.34
PA-year clustering (CI) (6.55, 10.85) (3.19, 8.18)
PA clustering (CI) (7.75, 9.66) (4.25, 7.12)
CMR p value 0.0041 0.4684
Conley–Taber (CI) (7, 14) (4, 8)
RI-β p values (0.000, 0.056) (0.118, 0.187)
Smoothed RI-β p value 0.0004 0.1046
RI-t p values (0.000, 0.056) (0.058, 0.111)
Smoothed RI-t p value 0.0000 0.0570
WBRI-t p value 0.0014 0.0181
WBRI-β p value 0.0000 0.0446
N 1,355 1,355
G 18 18
Regressors
Fiscal efficiency Yes No
PA specific time trends No Yes
Notes: Entries of the form (0.000, 0.067) represent the p value pairs (p̂1∗ , p̂2∗ ). WBRI p values are
obtained with B = 700 for Panel A and B = 600 for Panel B, ensuring that B×G C1 > 10, 000 for both
panels.
WBRI p value using the same two samples and two models. We do this clustering
only by PA. As expected, the RI-β p values are identical to the RI-t p values
for Model 1, because there is only one treated cluster; see MacKinnon and Webb
(2018a) for details.11 The four RI p value intervals for Model 1 contain 0.05, while
the four RI p value intervals for Model 2 contain 0.10. In the former case, this
makes it impossible to tell whether we should reject or not reject at the 0.05 level.
In the latter case, we evidently cannot reject at the 0.05 level, but it is impossible
to tell whether we should reject or not reject at the 0.10 level.
The WBRI-t p values shown in the table are obtained with B = 700 for Panel
A and B = 600 for Panel B. This means that there are 701×15 C1 − 1 = 10,514 and
601×18 C1 − 1 = 10,817 bootstrap/randomization t statistics, respectively. Under
Model 1, we find WBRI-t p values that are very close to p̂1∗ and highly signif-
icant. Under Model 2, we again find that the WBRI-t p value is very close to
p̂1∗ for the municipality sample, but below p̂1∗ for the county sample. Except for
Model 2 using the county sample, the smoothed RI-t p values are very similar to
the WBRI-t ones. The WBRI-β p values are in general similar to both the WBRI-t
values and the smoothed RI-β p values. Interestingly, for Model 2 using both
samples, the WBRI-β p values are below p̂1∗ .
We also consider an aggregation procedure, which we call cluster-means regres-
sion, or CMR, that is similar to one suggested in Donald and Lang (2007). This
procedure yields sensible results for Model 1 for both samples. However, for
Model 2, it yields much larger p values than any of the other procedures. This is
probably a consequence of the fact that Model 2 contains both a DiD term for just
one cluster in addition to a time trend for only that cluster, which does not fit easily
into the aggregation framework of Eq. (11).
The evidence against the null hypothesis is probably even stronger than these
results suggest. In MacKinnon and Webb (2018a), we showed that RI procedures
tend to under-reject when the treated clusters are unusually large. Since the only
treated cluster is either the Municipality or the County of Turin, and each of those
is the largest cluster in its sample, we would expect all forms of RI p value to be
biased upwards. Thus the fact that the WBRI-t test rejects at the 0.001 level for
Model 1 for both data sets and at either the 0.05 or 0.10 level for Model 2 suggests
that there is quite strong evidence against the null hypothesis.
8. CONCLUSION
We introduce a bootstrap-based modification of RI which can solve the problem
of interval p values when there are few possible randomizations, a problem that
often arises when there are very few treated groups. This procedure, which we
call WBRI, is easiest to understand as a modified version of the WCB. Like the
Wild Bootstrap Randomization Inference for Few Treated Clusters 83
WCB, it generates a large number of bootstrap samples and uses them to compute
bootstrap test statistics. However, unlike the WCB, only some of the bootstrap test
statistics are testing the actual null hypothesis. Most of them are testing fictional
null hypotheses obtained by re-randomizing the treatment.
The WBRI procedure can be used to generate as many bootstrap test statistics
as desired by making B large enough. Thus, it can solve the problem of interval
p values. However, it shares some of the properties of RI procedures, which
perform conventional RI based on either coefficients or cluster-robust t statistics;
see MacKinnon and Webb (2018a). In particular, like RI-β and RI-t, WBRI-β
and WBRI-t can be expected to over-reject (or under-reject) when the treated
clusters are smaller (or larger) than the control clusters. This tendency is greater for
WBRI-β than for WBRI-t. Thus, we cannot expect WBRI procedures to yield
reliable inferences in every case.
We also consider two other procedures. One of them applies the kernel-
smoothed p value approach of Racine and MacKinnon (2007a) to RI. This method
seems to perform very similarly to WBRI in many cases. The other, based on
Donald and Lang (2007), aggregates individual data to the cluster level and uses
the t distribution with degrees of freedom equal to the number of clusters minus 2.
This cluster-means regression approach can work remarkably well in some cases,
but it can be seriously lacking in power when not all observations within treated
clusters are treated.
ACKNOWLEDGEMENTS
The WBRI procedure discussed in this paper was originally proposed in a working
paper circulated as “Randomization Inference for Difference-in-Differences with
Few Treated Clusters.” However, a revised version of that paper no longer discusses
the WBRI procedure. We are grateful to Jeffrey Wooldridge, seminar participants
at the Complex Survey Data conference on October 19–20, 2017 and at New York
Camp Econometrics XIII on April 6–8, 2018, and two anonymous referees for
helpful comments. This research was supported, in part, by grants from the Social
Sciences and Humanities Research Council of Canada. Joshua Roxborough and
Oladapo Odumosu provided excellent research assistance.
NOTES
1. One of the earliest CRVEs was suggested in Liang and Zeger (1986). Alternatives
to (3) have been proposed in Bell and McCaffrey (2002) and Imbens and Kolesár (2016),
among others.
84 JAMES G. MACKINNON AND MATTHEW D. WEBB
2. Of course, even when G1 is not small, the matrices Ng−1 Xg ˆg ˆg Xg in (3) do not
estimate the corresponding matrices Ng−1 X g X in (2) consistently, because the former
matrices necessarily have rank 1. But the summation in the middle of expression (3), appro-
priately normalized, does consistently estimate the matrix X X, appropriately normalized.
See Djogbenou et al. (2018) for details.
3. Because vg∗b takes the same value for all observations within each group, we would
not want to use the Rademacher distribution if G were smaller than about 12; see Webb
(2014), which proposes an alternative for such cases.
4. For more details on how to implement the wild cluster bootstrap in Stata at minimal
computational cost, see Roodman, MacKinnon, Nielsen, and Webb (2019).
5. Note that, in this procedure, B denotes the number of bootstrap samples per re-
randomization. The total number of bootstrap samples is B(S + 1). It might seem tempting
to use the same B bootstrap samples for every re-randomization. However, this would create
dependence among the S different test statistics that depend on each bootstrap sample. This
sort of dependence should be avoided.
6. Code for the WBRI procedure can be found at https://sites.google.com/site/matthewd
webb/code.
7. There is a typo on page 5,955 of Racine and MacKinnon (2007a) which causes the
optimal values of α = 0.01 and α = 0.10 to be reversed. That this is incorrect can be seen
from Figure 6 of the paper.
8. WBRI would have rejected slightly more often if we had used p̂1 instead of p̂2 ; the
difference in rejection frequencies was almost always less than 0.001.
9. We note that Donald and Lang (2007) did not suggest using Eq. (11) for DiD models
in the way that we have used it here.
10. Following the original paper, confidence intervals for the CT procedure are rounded
to the nearest integer values.
11. We should not expect them to be the same for Model 2, however, because there are
two variables that need to be randomized, the DiD variable and the trend-treatment variable.
REFERENCES
Abadie, A., Diamond, A., & Hainmueller, J. (2010). Synthetic control methods for comparative case
studies: Estimating the effect of California’s tobacco control program. Journal of the American
Statistical Association, 105(490), 493–505.
Bell, R. M., & McCaffrey, D. F. (2002). Bias reduction in standard errors for linear regression with
multi-stage samples. Survey Methodology, 28(2), 169–181.
Bertrand, M., Duflo, E., & Mullainathan, S. (2004). How much should we trust differences-in-
differences estimates? The Quarterly Journal of Economics, 119(1), pp. 249–275.
Bester, C. A., Conley, T. G., & Hansen, C. B. (2011). Inference with dependent data using cluster
covariance estimators. Journal of Econometrics, 165(2), 137–151.
Cameron, A. C., Gelbach, J. B., & Miller, D. L. (2008). Bootstrap-based improvements for inference
with clustered errors. The Review of Economics and Statistics, 90(3), 414–427.
Cameron, A. C., & Miller, D. L. (2015). A practitioner’s guide to cluster robust inference. Journal of
Human Resources, 50(2), 317–372.
Wild Bootstrap Randomization Inference for Few Treated Clusters 85
Canay, I. A., Romano, J. P., & Shaikh, A. M. (2017). Randomization tests under an approximate
symmetry assumption. Econometrica, 85(3), 1013–1030.
Carter, A. V., Schnepel, K. T., & Steigerwald, D. G. (2017). Asymptotic behavior of a t test robust to
cluster heterogeneity. Review of Economics and Statistics, 99(4), 698–709.
Conley, T. G., & Taber, C. R. (2011). Inference with “Difference in Differences” with a small number
of policy changes. The Review of Economics and Statistics, 93(1), 113–125.
Davidson, R., & Flachaire, E. (2008). The wild bootstrap, tamed at last. Journal of Econometrics,
146(1), 162–169.
Decarolis, F. (2014). Awarding price, contract performance, and bids screening: Evidence from
procurement auctions. American Economic Journal: Applied Economics, 6(1), 108–132.
Djogbenou, A., MacKinnon, J. G., & Nielsen, M. Ø. (2018). Asymptotic and wild bootstrap inference
with clustered errors. Working Paper 1399. Department of Economics, Queen’s University.
Donald, S. G., & Lang, K. (2007). Inference with difference-in-differences and other panel data. The
Review of Economics and Statistics, 89(2), 221–233.
Ferman, B., & Pinto, C. (2017). Inference in differences-in-differences with few treated groups and
heteroskedasticity. The Review of Economics and Statistics, 101, to appear.
Fisher, R. (1935). The design of experiments. Edinburgh: Oliver and Boyd.
Imbens, G. W., & Kolesár, M. (2016). Robust standard errors in small samples: Some practical advice.
Review of Economics and Statistics, 98(4), 701–712.
Imbens, G. W., & Rubin, D. B. (2015). Causal inference in statistics, social, and biomedical sciences.
Cambridge: Cambridge University Press.
Lehmann, E. L., & Romano, J. P. (2008). Testing statistical hypotheses. New York: Springer.
Liang, K.-Y., & Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models.
Biometrika, 73(1), 13–22.
MacKinnon, J. G., & Webb, M. D. (2017a). Pitfalls when estimating treatment effects using clustered
data. The Political Methodologist, 24(2), 20–31.
MacKinnon, J. G., & Webb, M. D. (2017b). Wild bootstrap inference for wildly different cluster sizes.
Journal of Applied Econometrics, 32(2), 233–254.
MacKinnon, J. G., & Webb, M. D. (2018a). Randomization inference for difference-in-differences with
few treated clusters. Working Paper 1355. Department of Economics, Queen’s University.
MacKinnon, J. G., & Webb, M. D. (2018b). The wild bootstrap for few (treated) clusters. Econometrics
Journal, 21(2), 114–135.
Racine, J. S., & MacKinnon, J. G. (2007a). Inference via kernel smoothing of bootstrap P values.
Computational Statistics & Data Analysis, 51(12), 5949–5957.
Racine, J. S., & MacKinnon, J. G. (2007b). Simulation-based tests that can use any number of
simulations. Communications in Statistics: Simulation and Computation, 36(2), 357–365.
Roodman, D., MacKinnon, J. G., Nielsen, M. Ø., & Webb, M. D. (2019). Fast and wild: Bootstrap
inference in stata using boottest. The Stata Journal, 19(1), to appear.
Webb, M. D. (2014). Reworking wild bootstrap based inference for clustered errors. Working Paper
1315. Department of Economics, Queen’s University.
Young, A. (2019). Channelling Fisher: Randomization tests and the statistical insignificance of
seemingly significant experimental results. The Quarterly Journal of Economics, 134, to appear.
This page intentionally left blank
VARIANCE ESTIMATION FOR
SURVEY-WEIGHTED DATA USING
BOOTSTRAP RESAMPLING METHODS:
2013 METHODS-OF-PAYMENT SURVEY
QUESTIONNAIRE
ABSTRACT
87
88 HENG CHEN AND Q. RALLYE SHEN
1. INTRODUCTION
Variance estimates are crucial for building confidence intervals to assess disper-
sion, and for implementing statistical inferences to test various hypotheses. In
general, survey variance estimates depend on the specific weighting procedure,
not just on the numerical values of the weights; variance estimates that disregard
the weighting procedure are often biased. Hence, an unbiased estimation method
must incorporate two sources of randomness from the weighting procedure: (1)
from the sampling design, which, in our case, is measured by the selection proba-
bility design weights induced by complicated sampling and (2) from the calibration
procedure, which involves adjusting the sample counts to match the population
counts through calibrated weights. If we ignore either source of randomness, the
variance estimates will be incorrect.
To consider the randomness from the sampling design, it is important to under-
stand the design-based inference. While the units in the population as well as
their characteristics are assumed fixed, the randomness in the design-based statis-
tics comes only from randomization performed at the sample selection stage. The
design-based distributions are obtained by enumerating all samples possible under
a given design scheme and associating the numeric values of the statistics of interest
with the probabilities of the samples they are based on.
As for the randomness from the calibration procedure, adjusting design weights
would make final weights depend on the particular calibration method, in which
incorporation of population level information can lead to statistically more accurate
estimates and better inference. Such modifications will affect variances of weighted
estimates, because calibrated weights are functions of the sampling design, which
introduces randomness from the sample selection stage. In contrast to the non-
random design weights, calibrated weights are usually random (Lu & Gelman,
2003).
This paper discusses variance estimations of the weighted means and pro-
portions used in Henry, Huynh, and Shen (2015), whose sampling design is an
approximate stratified two-stage sampling,1 and the classical raking (or iterative
proportional fitting (IPF)) procedure is chosen for calibration. In the stratified two-
stage sampling, the population is divided into nonoverlapping strata. From each
stratum, a sample of primary sampling units (PSUs) is taken with replacement,
and from each PSU, samples of ultimate units are taken. Stratification can improve
the efficiency of estimates, while allowing for straightforward statistical analysis
within the strata. PSUs form clusters and allow the user to reduce costs in situa-
tions where it is impossible or impractical to obtain the complete list of ultimate
observation units.
Besides the sampling design, this paper considers variance estimation of raking
ratio estimators used in Vincent (2015). Two types of raking ratio estimators are
Variance Estimation for Survey-weighted Data 89
This paper considers two forms of raking ratio estimator, the classical esti-
mator obtained by the application of IPF as well as an estimator, which may be
interpreted as a maximum likelihood estimator within a certain framework. The
GREG estimator is also considered as a benchmark for comparison.4 Consider the
estimation of a population total TY of a survey variable Y taking values yi for units
i in a population U :
TY = yi .
U
where ωi is a given weight, referred to here as the initial weight. This weight
may be the Horvitz–Thompson (H–T) weight ωi = 1/πi . Note that the ωi is fixed
(nonrandom) and known before the survey is conducted. It is usually computed
as the ratio between the census stratum count and the service-agreement targeted
count.
The classical raking adjustment makes use of information on the population counts
over the categories of two or more categorical auxiliary variables. This type of
adjustment is used in the weighting and calibration procedure. For example, sup-
pose three sets of post-strata are used for calibration, and let xi denote the vector
of indicator variables for these categories:
where δa..i = 1 if unit i is in category a of the first auxiliary variable and 0 otherwise,
δ.b.i = 1 if unit i is in category b of the second auxiliary variable and 0 otherwise,
and so on. The population total TX of this vector thus contains the population counts
in each of the (marginal) categories for each of the three auxiliary variables. It is
assumed that TX is given and that xi is known for i ∈ s.
The classical raking adjustment involves iterative modifications of the initial
weights, ωi , in a multiplicative way to adjust the weights wi with the aim of
satisfying the calibration equations:
wi xi = TX .
s
Variance Estimation for Survey-weighted Data 91
The multiplicative adjustment depends only upon the cell in the contingency
table formed by the auxiliary variables, that is we may write wi = ωi h (xi ), where
the multiplicative adjustment factor h(xi ) is fixed for all units with common values
of the auxiliary variables. Let N̂ω [h(x)] and N̂w [h(x)] denote the weighted esti-
mates of the population counts in the cell of the table defined by x using the weights
ωi and wi , respectively. Then we may write N̂w [h(x)] = h(x)N̂ω [h(x)], where IPF
makes use of standard post-stratification.5 The usual iterative modification of the
weights involves IPF. Ireland and Kullback (1968) demonstrate that this method
converges to a solution which minimizes
N̂w
N̂w log ,
N̂ω
subject to the calibration equations, when the sum is over all cells defined by x.
This objective function may alternatively be expressed as
wi
wi log ,
s
ωi
that is, under convergence of the iterative algorithm, the wi minimizes the above
function, subject to solving the calibration equations. Yet another way to express
this objective function is
wi
ωi G M ,
s
ωi
where GM (u) = ulog(u) − u + 1 is the multiplicative distance measure considered
by Deville,
Sarndal, and Sautory (1993), and it is assumed that for the calibration
equation wi is considered to be a given constant. Using the standard Lagrange
multiplier method for the constrained minimization, the multiplicative adjustment
factors may be expressed as
wi = ωi FM xi λ̂ ,
−1
where FM (u) = gM (u) denotes the inverse function of gM (u) = dGM (u)/du and
λ̂ is the Lagrange multiplier, which solves the calibration equations. It follows
92 HENG CHEN AND Q. RALLYE SHEN
from definition GM (u) above that gM (u) = log(u) and FM (u) = exp(u). Hence λ̂
solves
ωi exp xi λ̂ xi = xi .
s U
The estimator T̂Y Rak = s wi yi may be used if yi is a scalar or a vector, since
the scalar weight, wi , does not depend upon yi .
which is proportional to minus the log likelihood in the case of simple random
sampling with replacement. Equivalently, the objective function may be expressed,
summed over sample units, as
wi
−ωi log ,
s
ωi
or as
wi
ωi GML ,
s
ωi
where GML (u) = u − 1 − log(u).
One of the major difficulties for the empirical likelihood inferences under gen-
eral unequal probability sampling designs is to obtain an informative empirical
likelihood function for the given sample. The likelihood depends necessarily on
the sampling design and a complete specification of the joint probability func-
tion of the sample is usually not feasible. Because of this difficulty, Chen and
Variance Estimation for Survey-weighted Data 93
L ωi
n Wh log(whi ),
h=1 i∈Sh i∈Sh ωi
L
where Wh = Nh /N, Nh is the stratum size and N = h=1 Nh .
wi
ωi GLM ,
s
ωi
where GLM (u) = (1/2)(u − 1)2 . This leads to the generalized regression estimator
T̂Y GREG = wi yi = T̂Y π + TX − T̂X π B̂s ,
s
where T̂Y π = s (1/πi )yi and T̂X π = s (1/πi )xi denote the H–T estimator for
TY and the TX vector, and
−1
B̂s = ωi xi xi ωi xi yi ,
s s
3. VARIANCE ESTIMATION
We consider estimating the asymptotic variance of the converged estimator, i.e., the
estimator T̂Y Rak , where the weights wi solve the constrained optimization problem.
This asymptotic variance is assumed to be a sufficiently close approximation to the
asymptotic variance of the estimator obtained after the finite number of iterations
used in practice.
We assume that in large samples, λ̂ converges to a value λ. Deville and Sarndal
(1992) assume that λ = 0, but this property is based upon the assumption that the
estimator of TX obtained by applying the initial weights ωi is consistent. This
assumption will often be false in the case of nonresponse and we prefer not to
make this assumption.
We allow the function F (·) to be general and not necessarily equal to FM (·).
We first expand the adjusted weight wi = ωi F xi λ̂ about λ to obtain
wi ≈ ωi Fi + fi xi λ̂ − λ ,
and hence
−1
λ̂ − λ ≈ ωi fi xi xi TX − ωi F i xi .
s s
We are assuming here the first matrix is non-singular. It may be necessary to
drop redundant variables from xi to achieve this. For example, in the three-way
case above, each of the sums of the indicator variables δa..i , δ.b.i and δ..ci across a,
b and c, respectively, equals to 1 and it is natural to drop two of these indicators
to avoid singularity.
Variance Estimation for Survey-weighted Data 95
−1
where wi = ωi Fi and B = s ωi f i xi x i s ωi fi yi xi . We assume that B
converges to a finite limit β in the asymptotic framework. It follows from this and
the other approximation assumptions above, in particular that the initial weights
ωi are fixed, that the (normalized)
asymptotic distribution of B[T X − s ωi F i xi ]
is the same as that of β[TX − s ωi Fi xi ]. Hence, the (normalized) asymptotic
variance of T̂Y Rak is the same as that of s zi , where
zi = wi (yi − βxi ).
where ehj k = yhj k − xhj k B̂ are the estimated residuals and B̂ is the estimator of
the multiple regression coefficient, which may be constructed using either design
(Deville & Sarndal, 1992) or raking weights. If nonresponse exists, we should use
zhj = whj k ehj k , (3)
k
−1
where whj k = ωhj k F̂hj k ,
ehj k = yhj k − xhj k B̂ and B̂ = s ωi fˆi xi xi
ˆ
s ωi fi yi x i .
T 2
V̂ T̂Y Rak = ct T̂Y(t)Rak − T̂Y Rak ,
t=1
where ct is a constant which depends on the replication method. There are three
main resampling methods: balanced repeated replication (BRR), the jackknife and
the bootstrap (Rust & Rao, 1996). Each involves creating multiple replicates of
the data set by repeatedly sampling from the original sample.
We focus on the bootstrap resampling method.6 We do not use BRR because
it is more suitable for a stratified clustered sampling design, where nh = 2 for
all strata. The main reason that we choose the bootstrap over the jackknife is
that the traditional delete-1 jackknife variance estimator will be inconsistent for
non-smooth functions (e.g., sample quantiles). The consistent delete-d jackknife
method requires a nontrivial specification for d, where there is a complicated inter-
play between the smoothness of the estimate and the parameter d 7 . The bootstrap,
Variance Estimation for Survey-weighted Data 97
on the other hand, will generally work for these non-smooth estimates, as discussed
in Ghosh, Parr, Singh, and Babu (1984). Besides the major advantage of the boot-
strap over the jackknife for non-smooth estimates, we prefer the bootstrap for two
other reasons: (1) less computational burden: as pointed out in Kolenikov (2010),
the replications of the delete-d jackknife increase notably with d, especially when
applied to list-based establishment surveys; (2) better approximation of distribu-
tions: the bootstrap can be used for estimating distributions and constructing more
accurate one-sided confidence intervals, while the jackknife is typically only used
for estimating variances.
When survey data are released for public use, confidentiality of the respondents
must be protected: geographic information is provided in a coarse form, incomes
are top-coded, small racial groups are conglomerated, etc. Variance estimation via
linearization requires that stratum and PSU identifiers h and j are known for each
observation. If the data provider decides that releasing strata and PSU information
poses the risk that individual subjects could be identified, alternative variance
estimation methods must be used.
To overcome the above limitation, instead of recreating the sample in each
replicate, we implement the more practical method of generating replicate weights.
The construction of the replicate weights wi(t) involves first taking the initial weights
ωi . From these, a set of initial replication weights ωi(t) , t = 1, . . . , T , is constructed
according to the replication method and the sampling scheme. Next the raking
adjustment method is applied to each of these T sets of weights separately. This
generates the required weights wi(t) . This approach can be applied to a wide class
of adjustment methods including classical raking, “maximum likelihood” raking
and GREG. And these replicate weights protect the privacy of sampling units and
have the advantage of incorporating strata information as well as adjustments for
nonresponse and noncoverage.
To construct the tth replicate bootstrap sample under stratified two-stage
sampling, we follow the steps below:
• Step 1: Take a simple random sample with replacement of nh units from the
original data in stratum h, repeating independently across strata;
• Step 2: Modify the design weights as in the rescaling bootstrap from Rao, Wu,
andYue (1992) by applying the following formula (before the raking procedure):
(t) 1/2 −1/2 1/2 −1/2 nh (t)
ωhj k = 1 − m h (n h − 1) + m h (n h − 1) m ωhj k , (4)
mh hj
where m(t)
hj is the bootstrap frequency of unit hj, that is, the number of times
PSU hj was used in forming the tth bootstrap replicate;
98 HENG CHEN AND Q. RALLYE SHEN
(t)
• Step 3: Implement the raking produce to obtain the replicate weight whj k . Notice
that the weight adjustment is taking place after the internal scaling of Step 2;
• Step 4: Estimate the parameter of interest, T̂Y(t)Rak . Repeat T times and estimate
variance using
2
1 (t)
T
V̂B T̂Y Rak = T̂ − T̂Y Rak . (5)
T t=1 Y Rak
to draw units from the 2013MOP such that the distribution of the
strata size vector
n follows the multinomial n; π1 N1 / Lh=1 πh Nh , . . . , πL NL / Lh=1 πh Nh ; then
we simulate the response y within each stratum using the Poisson distribution,
given the “true” value. Multiplicative nonresponse models were taken into account
in this study. The assignment of nonresponse probabilities φhj to each selected
cluster of the population takes into account characteristics of the clusters, such as
region, age and gender. Once the sample is drawn, in order to obtain the subset
of respondents, Bernoulli distributions with parameter φhj , for h = 1, . . . , H and
j = 1, . . . , Jh , are used to generate an indicator variable Ihj that takes value 1 if
cluster (hj ) responds and value 0 if cluster (hj ) does not respond. The multiplicative
nonresponse model is
Pr (non-response)
= ϕhj
= 0.24Iontario · 1.5Iunder 35 · 1.4Ifemale .
For each simulation, we compute the weighted estimates, which we call T̂Y Rak:r .
In addition, we compute five different variance estimates:
• SRS: This variance estimation treats the sample as obtained by simple random
sampling, rather than the stratified two-stage sampling. Although not based
on a realistic model, this is a commonly used approximation because it is so
simple to compute. With the weights wi treated as constants, the SRS estimated
variance is
n
V̂SRS (T̂Y Rak ) = wi2 var(yi )
i=1
n
= 2
wi var (y1 )
i=1
n n
2 2
= wi wi yi − T̂Y Rak .
i=1 i=1
Notice that (6) ignores both the sampling design and weighting procedure.
• No raking: This estimate is based on the linearization variance estimate of
(1). Notice that it is equivalent to treating raked weights as inverse-probability
sampling weights as in Lu and Gelman (2003). Thus, it does not take into
account the effect of the raking procedure on the variance.
• Full response: This estimate is based on (2); however, such computation
assumes λ = 0. As discussed in Section 3, this assumption will often be false
in the case of nonresponse.
100 HENG CHEN AND Q. RALLYE SHEN
1
R
2
VTRUE = T̂Y Rak:r − Ê T̂Y Rak ,
R r=1
where T̂Y Rak:r is the value of T̂Y Rak for sample r, and the expectation of the point
estimator T̂Y Rak is estimated by
1
R
Ê T̂Y Rak = T̂Y Rak:r .
R r=1
We compare the true variance to the average variance estimates from the above
five different methods. The expectation of a specific variance estimator V̂ T̂Y Rak
for T̂Y Rak is estimated by
1
R
Ê V̂ T̂Y Rak = V̂r T̂Y Rak:r ,
R r=1
where V̂r T̂Y Rak:r is the value of V̂ T̂Y Rak for sample r.
We calculate the variances of estimators for population totals, population means
and subpopulation means of various demographic strata.8 The tables show results
for the variables cash on hand and tng_credit_year. The first is the amount of
cash the respondent has in his or her wallet, purse or pockets when completing
the survey. The second, tng_credit_year, is a binary variable indicating whether
the respondent has used the contactless feature of a credit card in the past year.
Taking the weighted total of this variable, we find an estimate of the total number
of people in the population that used the feature; taking the weighted mean, we
obtain an estimate of the proportion of the population that has used it.
Tables 1–3 show variance computations using different variance estimation
methods. The approximation based on the simple random sampling (SRS)
Variance Estimation for Survey-weighted Data 101
Note: All columns have been divided by the value of VTRUE . Total number of contactless credit adopters
is estimated by the weighted total of the binary variable tng credit year. Notice that SRS is defined
in (6), which ignores both the sampling design and weighting procedure and assumes the sample as
simple random sampling. No raking refers to (1), which does not take into account the effect of the
raking procedure on the variance. Full response and nonresponse are linearization variance estimates
based on (2) and (3), respectively. Bootstrap is the resampling method based on the algorithm outlined
in Section 3.2.
Note: All columns have been divided by the value of VTRUE . The proportion of contactless credit
adopters is estimated by the weighted mean of the binary variable tng credit year. Notice that SRS is
defined in (6), which ignores both the sampling design and weighting procedure and assumes the sample
as simple random sampling. No raking refers to (1), which does not take into account the effect of the
raking procedure on the variance. Full response and nonresponse are linearization variance estimates
based on (2) and (3), respectively. Bootstrap is the resampling method based on the algorithm outlined
in Section 3.2.
Table 3. Simulation: Mean of Cash on Hand and tng Credit Year by Regions.
Linearization Resampling
Cash on hand
Atlantic mean
Variance 0.55 1.73 1.13 1.09 1.09
Quebec mean
Variance 0.67 1.43 1.21 1.03 1.03
Ontario mean
Variance 0.82 1.62 1.1 1.05 1.06
Prairies mean
Variance 0.84 1.78 1.06 1.06 1.06
BC mean
Variance 0.91 1.65 1.31 1.01 1.01
tng credit year
Atlantic proportion
Variance 0.63 1.47 1.21 1.10 1.09
Quebec proportion
Variance 0.7 1.36 1.15 1.12 1.14
Ontario proportion
Variance 0.91 1.27 1.16 1.09 1.07
Prairies proportion
Variance 0.82 1.26 1.23 1.02 1.04
BC proportion
Variance 0.75 1.30 1.17 1.06 1.05
Note: All columns have been divided by the value of VTRUE . The proportion of contactless credit
adopters is estimated by the weighted mean of the binary variable tng credit year. Notice that SRS is
defined in (6), which ignores both the sampling design and weighting procedure and assumes the sample
as simple random sampling. No raking refers to (1), which does not take into account the effect of the
raking procedure on the variance. Full response and nonresponse are linearization variance estimates
based on (2) and (3), respectively. Bootstrap is the resampling method based on the algorithm outlined
in Section 3.2.
as well as, the resampling bootstrap estimates are always smaller.9 It is because the
raking ratio estimator makes use of the correlation between the variables used for
weighting and the outcome variable of interest and thus produces efficiency gain
over the estimator which does not exploit this correlation (Graham , 2011; Chaud-
huri & Renault, 2017). Following Chaudhuri, Handcock, and Rendall (2008), the
raking used in this paper is a two-step procedure, in which Step 1 generates wi
using the auxiliary information not having the parameters of interest, and then
Step 2 re-weights model parameters from the first step.
Variance Estimation for Survey-weighted Data 103
5. SUMMARY
If variances for weighted estimates are computed without considering the raking
procedure, the resulting confidence intervals will tend to be conservative. We
therefore produce bootstrap replicate raking weights in Stata and use these to
estimate the variances of weighted estimates from the 2013 MOP survey SQ.
ACKNOWLEDGMENTS
We are grateful to the AiE Editors, Gautam Tripathi, Kim P. Huynh, David T.
Jacho-Chavez and two anonymous referees for their insightful comments which
have led to the current much improved paper. We thank Geoffrey Dunbar, Shelley
Edwards, Ben Fung, Kim P. Huynh, May Liu, Sasha Rozhnov and Kyle Vincent
for their useful comments and encouragement. Maren Hansen provided excellent
writing assistance. We also thank Statistics Canada for providing access to the
2011 National Household Survey and the 2012 Canadian Internet Usage Survey.
The views of this paper are those of the authors and do not represent the views of
the Bank of Canada.
NOTES
1. For a very complex survey design, exact accounting for all its features is extremely
cumbersome. Hence, approximations are often made to yield a usable estimation formula.
2. Besides population means, Hellerstein and Imbens (1999) also consider population
variances and covariances.
3. In practice, it is unlikely that matching on a few population moments will lead to
an artificial population with exactly the same distribution as the target population without
nonresponse. However, as more and more moments are matched, the artificial distribution
will get close to the target distribution. In particular, it may be possible to obtain enough of
a resemblance between the artificial distribution and the target distribution with only a few
matched moments. However, using too many restrictions may compromise the large sample
results that are used for inference, while too few may leave the estimated distribution too
far from the target distribution.
4. When ωi = 1/n, the criterion functions from the classical and maximum likelihood
raking and GREG estimators belong to the Cressie–Read family of statistical discrepancies.
We thank referees for pointing this out.
5. Since the post-stratification method adjusts every cell of a multi-way table, it can
result in cells with zero or very small counts. In contrast, the raking method adjusts only
the marginals, or the low-level interactions.
6. Statistics Canada uses bootstrap procedures extensively. For example, the bootstrap
replicate weights method is used in the CIUS to estimate the coefficients of variation.
Variance Estimation for Survey-weighted Data 105
REFERENCES
Angrisani, M., Foster, K., & Hitczenko, M. (2015). The 2013 survey of consumer payment choice:
Technical appendix. Research Data Report 15-5. Federal Reserve Bank of Boston.
Arango, C., & Welte, A. (2012). The Bank of Canada’s 2009 Methods-of-Payment survey: Methodology
and key results (No. 2012-6). Bank of Canada discussion paper.
Chaudhuri, S., & Renault, E. (2017). Score tests in GMM: Why use implied probabilities? Working
paper.
Chaudhuri, S., Handcock, M. S., & Rendall, M. S. (2008). Generalized linear models incorporating
population level information: An empirical-likelihood-based approach. Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 70(2), 311–328.
Chen, J., & Qin, J. (1993). Empirical likelihood estimation for finite populations and the effective
usage of auxiliary information. Biometrika, 80(1), 107–116.
Chen, J., & Sitter, R. R. (1999). A pseudo empirical likelihood approach to the effective use of auxiliary
information in complex surveys. Statistica Sinica, 9, 385–406.
Deville, J. C., & Sarndal, C. E. (1992). Calibration estimators in survey sampling. Journal of the
American statistical Association, 87(418), 376–382.
Deville, J. C., Sarndal, C. E., & Sautory, O. (1993). Generalized raking procedures in survey sampling.
Journal of the American statistical Association, 88(423), 1013–1020.
Ghosh, M., Parr, W. C., Singh, K., & Babu, G. J. (1984). A note on bootstrapping the sample median.
The Annals of Statistics, 1130-1135.
Graham, B. S. (2011). Efficiency bounds for missing data models with semiparametric restrictions.
Econometrica, 79(2), 437–452.
Henry, C. S., Huynh, K. P., & Shen, Q. R. (2015). 2013 Methods-of-Payment survey results (No.
2015-4). Bank of Canada discussion paper.
Ireland, C. T., & Kullback, S. (1968). Contingency tables with given marginals. Biometrika, 55(1),
179–188.
Hellerstein, J. K., & Imbens, G. W. (1999). Imposing moment restrictions from auxiliary data by
weighting. Review of Economics and Statistics, 81(1), 1–14.
Kalton, G., & Flores-Cervantes, I. (2003). Weighting methods. Journal of Official Statistics, 19(2), 81.
Kolenikov, S. (2010). Resampling variance estimation for complex survey data. The Stata Journal,
10(2), 165–199.
Kolenikov, S. (2014). Calibrating survey data using iterative proportional fitting (raking). The Stata
Journal, 14(1), 22–59.
Lu, H., & Gelman, A. (2003). A method for estimating design-based sampling variances for surveys
with weighting, poststratification, and raking. Journal of Official Statistics, 19(2), 133.
McCarthy, P. J., & Snowden, C. B. (1985). The bootstrap and finite population sampling. Vital and
Health Statistics. Series 2, Data Evaluation and Methods Research, 95, 1–23.
Owen, A. B. (1988). Empirical likelihood ratio confidence intervals for a single functional. Biometrika,
75(2), 237–249.
Owen, A. (1990). Empirical likelihood ratio confidence regions. The Annals of Statistics, 18, 90–120.
106 HENG CHEN AND Q. RALLYE SHEN
Qin, J., & Lawless, J. (1994). Empirical likelihood and general estimating equations. The Annals of
Statistics, 22, 300–325.
Rao, J. N. K., Wu, C. F. J., & Yue, K. (1992). Some recent work on resampling methods for complex
surveys. Survey Methodology, 18(2), 209–217.
Rust, K. F., & Rao, J. N. K. (1996). Variance estimation for complex surveys using replication
techniques. Statistical Methods in Medical Research, 5(3), 283–310.
Vincent, K. (2015). 2013 Methods-of-Payment survey: sample calibration analysis. Bank of Canada.
PART III
ESTIMATION AND INFERENCE
This page intentionally left blank
MODEL-SELECTION TESTS FOR
COMPLEX SURVEY SAMPLES
ABSTRACT
We extend Vuong’s (1989) model-selection statistic to allow for complex sur-
vey samples. As a further extension, we use an M-estimation setting so that
the tests apply to general estimation problems – such as linear and non-
linear least squares, Poisson regression and fractional response models, to
name just a few – and not only to maximum likelihood settings. With strati-
fied sampling, we show how the difference in objective functions should be
weighted in order to obtain a suitable test statistic. Interestingly, the weights
are needed in computing the model-selection statistic even in cases where
stratification is appropriately exogenous, in which case the usual unweighted
estimators for the parameters are consistent. With cluster samples and panel
data, we show how to combine the weighted objective function with a cluster-
robust variance estimator in order to expand the scope of the model-selection
tests. A small simulation study shows that the weighted test is promising.
Keywords: Survey sampling; weighted estimation; cluster sampling;
nonnested models; model selection test; m-estimation
109
110 IRAJ RAHMANI AND JEFFREY M. WOOLDRIDGE
1. INTRODUCTION
models, but the nature of the adjustment to the test statistics is different from what
we address here.
An important feature of our approach, which also extends Vuong’s (1989) orig-
inal framework, is that we study the model-selection problem in the context of
general M-estimation. This generality allows us to explicitly cover situations where
only a specific feature of a distribution is correctly specified, such as the condi-
tional mean. For example, because of its robustness for estimating the parameters
of a conditional mean – see Gourieroux, Monfort, and Trognon (1984) – Poisson
regression is commonly used in situations where little if any interest lies in the rest
of the distribution. However, we may have two competing models of the condi-
tional mean. While one could use nonlinear least squares estimation, for efficiency
reasons, the Poisson quasi-MLE is often preferred. We can apply Vuong’s (1989)
approach in this situation to obtain a statistic that does not take a stand on the
distribution; it is purely a test of the conditional mean function. By contrast, if we
use the same mean function – say, an exponential function with the same covari-
ates – and we use two different quasi-MLEs, such as the Poisson and Geometric,
then the Vuong approach is a test of which distribution fits betters. Understanding
the distinction between a conditional mean test and a full test of the conditional
distribution is something easily described in our setting.
The remainder of the paper is organized as follows. In Section 5.2, we define the
estimation problems that effectively define the two nonnested competing models.
Section 5.3 shows how to modify the Vuong (1989) statistic to accommodate
standard stratified (SS) sampling in the context of general M-estimation. We start
with SS and variable probability (VP) sampling because they are widely used in
practice, and it is then clear what role weighting plays in more complex sampling
designs. Section 5.4 shows how to modify the estimated variance to account for
clustering in a multistage design. In Section 5.5, we extend the model-selection
test to panel data models with standard stratification design. In Section 5.6, we
discuss how weighting is desirable even under what is typically called “exogenous”
stratification. Section 5.7 provides several examples, and Section 5.8 contains a
small simulation study. Section 5.9 contains a brief conclusion.
has a unique solution – which is required for identification. In standard settings, the
solution is often denoted θ o , and θ o is assumed to index the quantity of interest, such
as the parameters in a conditional mean function E (Y |X). In a conditional MLE
setting, where q(W, θ ) = −log [f (Y |X; θ ] is the negative of the log likelihood, θ o
is the vector of parameters indexing the conditional density of Y given X. There are
many other applications where W is partitioned as W = (X, Y), where X and Y are,
respectively, K and L dimensional vectors with L + K = M. We are particularly
interested in the case where q (·) is the negative of a quasi-log-likelihood function
in the linear exponential family.
Rather than estimate, the parameters using a single objective function – under-
lying which is a parametric model of some feature of a distribution D (W) or, more
likely, a conditional distribution – we suppose we have two estimation methods,
represented by the objective functions q1 (W, θ 1 ) and q2 (W, θ 2 ), where the param-
eter vectors may have different dimensions. We need to make precise the sense in
which these competing models and estimation methods are nonnested. Let θ ∗1 and
θ ∗2 be the unique solutions to the population problems
These solutions are often called the “pseudo-true values” or “quasi-true values.”
The null hypothesis is that the models evaluated at the pseudo-true values fit equally
well on average, where fit is measured by the mean of the objective functions in
the population. Precisely,
H0 : E q1 (W, θ ∗1 ) = E q2 (W, θ ∗2 ) . (3)
In the maximum likelihood setting, (3) states that the KLIC distances of the two
models to the true model are the same. In a regression context using nonlinear
least squares, the null is that the population sum of squared residuals are the same;
equivalently, the two models provide a function that has the same mean squared
error relative to the true conditional mean.
Condition (3) can hold for nested as well as nonnested models, but the nature of
our approach requires us to focus on the latter. The reason is that we supplement
Model-Selection Tests for Complex Survey Samples 113
(3) with the assumption that the objective functions, evaluated at the pseudo-true
values, differ with positive probability:
P q1 (W, θ ∗1 ) = q2 (W, θ ∗2 ) > 0. (4)
The requirement in (4) means that the two functions q1 (W, θ ∗1 ) and q2 (W, θ ∗2 ) must
differ for a nontrivial set of outcomes on the support of W. If (4) does not hold
then the variance of q1 (W, θ ∗1 ) − q2 (W, θ ∗2 ) is 0, and that will invalidate the Vuong
(1989) approach taken in the paper.
The combination of (3) and (4) effectively rules out nested models, where one
model is obtained as a special case of the other and the same objective func-
tion – such as the negative of the log-likelihood function – is used. Then, the
only way that (3) can be true is when the more general model collapses to the
restricted version; otherwise E q1 (W, θ ∗1 ) > E q2 (W, θ ∗2 ) . In other words, for
nested models, we cannot have (3) and (4) both be true. The two conditions (3)
and (4) rule out other forms of degeneracies. For example, assume we have a ran-
dom variable Y and would like to model E(Y |X) as a function of the explanatory
variables X, a 1 × K vector. We specify two competing models and we estimate
both by nonlinear least squares. Specifically, q1 (W, θ 1 ) = (Y − α1 − Xβ 1 )2 and
2
q2 (W, θ 2 ) = Y − exp(α2 + Xβ 2 ) . If E(Y |X) actually depends on X, then (4)
generally holds. However, if Y is mean independent of X, so that E(Y |X) = E(Y ),
then the two models are simply different parameterizations of a constant condi-
tional mean, and (4) fails. As we will see, this failure causes the standard normal
limiting distribution for the Vuong-type statistic to break down. Incidentally, in this
example, provided the models satisfy (4), the Vuong test with random sampling
would reduce to comparing the R-squareds from the two least squares regressions.
We require no additional assumptions – for example, neither homoskedasticity nor
normality – for the test to be valid.
The nature of the alternative is inherently one sided, as we wish to determine
whether one model can be rejected in favor of the other. Because we have defined
the optimization problem to be a minimization problem, the alternative that model
one – technically, model one combined with whatever objective function we choose
– fits better in the population is
HAq1 : E q1 (W, θ ∗1 ) < E q2 (W, θ ∗2 ) .
HAq2 : E q1 (W, θ ∗1 ) > E q2 (W, θ ∗2 ) .
114 IRAJ RAHMANI AND JEFFREY M. WOOLDRIDGE
In Vuong’s setup, these functions are the negative of a log-likelihood function, but
we can apply these alternatives much more generally. Naturally, if either HAq1 or
HAq2 holds then the nondegeneracy condition (4) must hold.
N
min N −1 qg (W, θ g ).
θ g ∈g
i=1
In this section, we are interested in cases where the population has been divided
into J strata and then the resulting sample is not necessarily representative of the
population.
1 Qj
J Nj
qg (Wij , θ g ), (6)
N j =1 Hj i=1
N
Q ji
qg (Wi , θ g ), (7)
i=1
H ji
where Qji /Hji is the sampling weight for observation i. Thus, the estimator rep-
resented in (7) is obtained by weighting the objective function by the sampling
weight for each observation i. As discussed in Wooldridge (2001), the represen-
tation in (5) is the form used to obtain asymptotic properties of the weighted
M-estimator.
We first show that when the models are nonnested in the sense of (4), a prop-
erly standardized version of the objective function
√ has a limiting distribution√that
does not depend on the limiting distribution of N (θ̂ g − θ ∗g ) provided θ̂ g is N -
consistent, which is standard. Wooldridge (2001, Theorem 3.2) contains sufficient
conditions. The proof relies on fairly standard asymptotics and so we show only
its main features.
Theorem 3.1. Assume that for g ∈ {1, 2},
1. {Wij : i = 1, . . . , Nj ; j = 1, . . . , J } satisfies the SS sampling scheme with
Nj /N → aj > 0, j = 1, . . . , J .
2. g is compact.
3. The objective function E qg (W, θ g ) has a unique minimum on g at θ ∗g .
4. θ ∗g ∈ int g .
5. For each w ∈ W, qg (w, ·) is continuous on g and twice continuously
differentiable on int g .
6. For all θ g ∈ g , qg (w, θ g ) ≤ b (w), ∂qg (w, θ g )/∂θgk ≤ b (w), ∂ 2 qg
(w, θ g )/∂θgk ∂θgm ≤ b (w) for a function b(w) with E [b(W)] < ∞.
7. E ∇θ qg (W, θ ∗g )∇θ qg (W, θ ∗g ) < ∞ and E ∇θ qg (W, θ ∗g ) = 0.
Then
1 Qj 1 Qj
J Nj J Nj
√ qg (Wij , θ̂ g ) = √ qg (Wij , θ ∗g ) + op (1).
N j =1 Hj i=1 N j =1 Hj i=1
1 1 1
Nj Nj Nj
∗ j
qg (Wij , θ̂ g ) = qg (Wij , θ g ) + ∇θ qg (Wij , θ̈ g )(θ̂ g − θ ∗g ),
Nj i=1 Nj i=1 Nj i=1
116 IRAJ RAHMANI AND JEFFREY M. WOOLDRIDGE
j p p
where θ̈ g is a mean value between θ̂ g and θ ∗g . Because θ̂ g → θ ∗g , θ̈ g → θ ∗g . Now,
by a corollary of the uniform law of large numbers, for example, Wooldridge
(2010, Lemma 12.1),
1 1
Nj Nj
j
∇θ qg (Wij , θ̈ g ) = ∇θ qg (Wij , θ ∗g ) + op (1).
Nj i=1 Nj i=1
As
√ shown in∗ Wooldridge (2001, Theorem 3.2), the assumptions ensure that
N (θ̂ g − θ g ) = Op (1), and so
⎡ ⎤
1 Nj
√
⎣ ∇θ qg (Wij , θ̈ g )⎦ N(θ̂ g − θ ∗g )
j
Nj i=1
⎡ ⎤
1
Nj
√
=⎣ ∇θ qg (Wij , θ ∗g )⎦ N(θ̂ g − θ ∗g ) + op (1).
Nj i=1
√
Therefore, if we multiply by N Qj and sum across j we get
√ Qj √ Qj
J Nj J Nj
N qg (Wij , θ̂ g ) = N qg (Wij , θ ∗g ) (8)
j =1
N j i=1 j =1
N j i=1
⎡ ⎤
J
Qj
Nj
√
+⎣ ∇θ qg (Wij , θ ∗g )⎦ N (θ̂ g − θ ∗g ).
j =1
Nj i=1
⎛ ⎞ ⎛ ⎞
J
1 Nj
J
1 Nj
plim Qj ⎝ ∇θ qg (Wij , θ ∗g )⎠ = Qj ⎝plim ∇θ qg (Wij , θ ∗g )⎠
N →∞ j =1 Nj i=1 j =1 N →∞ N j i=1
J
= Qj E ∇θ qg (Wi , θ ∗g )|Wi ∈ Wj
j =1
= E ∇θ qg (Wi , θ ∗g ) = 0, (9)
Model-Selection Tests for Complex Survey Samples 117
where the last equality holds from the population first order condition for θ ∗g .
Therefore ⎛ ⎞
J
1 Nj
Qj ⎝ ∇θ qg (Wij , θ ∗g )⎠ = op (1),
j =1
N j i=1
√
and since N (θ̂ g − θ ∗g ) = Op (1), the second term product in (8) is op (1). We
have shown
√ Qj √ Qj
J Nj J Nj
N qg (Wij , θ̂ g ) = N qg (Wij , θ ∗g ) + op (1),
j =1
N j i=1 j =1
N j i=1
1 Qj 1 Qj
J Nj J Nj
√ qg (Wij , θ̂ g ) = √ qg (Wij , θ ∗g ) + op (1).
N j =1 Hj i=1 N j =1 Hj i=1
We can use Theorem 3.1 to construct a simple test statistic that allow us
to discriminate between two competing models. Let r(w, θ 1 , θ 2 ) ≡ q1 (w, θ 1 ) −
q2 (w, θ 2 ) be the difference in the two objective functions evaluated at w ∈ W and
generic values of the parameters. The null hypothesis is
H0 : E r(W, θ ∗1 , θ ∗2 ) = 0.
By the assumption that the models are nonnested, V r(W, θ ∗1 , θ ∗2 ) > 0. Applied
to both estimation problems, and under the null hypothesis, Theorem 3.1 implies
1 Qj 1 Qj
J Nj J Nj
√ r(Wij , θ̂ 1 , θ̂ 2 ) = √ r(Wij , θ ∗1 , θ ∗2 ) + op (1). (10)
N j =1 Hj i=1 N j =1 Hj i=1
Eq. (10) is the key result in the paper. It extends Vuong (1989) by allowing for strat-
ified sampling, and also any objective functions – not just MLE. Because √ the right
hand side of (10) does not depend on the limiting distributions of N (θ̂ g − θ ∗g ),
its distribution is easy to study by now applying, again, the results in Wooldridge
(2001) on SS sampling.
118 IRAJ RAHMANI AND JEFFREY M. WOOLDRIDGE
Then
1 Qj
J Nj
d
√ r(Wij , θ̂ 1 , θ̂ 2 ) −→ Normal(0, η2 ),
N j =1 Hj i=1
where 2
J
Qj
η =
2
V r W, θ ∗1 , θ ∗2 |W ∈ Wj .
j =1
Hj
Proof. From Theorem 3.1 and the asymptotic equivalence lemma, we must
argue that
1 Qj
J Nj
d
√ r(Wij , θ ∗1 , θ ∗2 ) −→ Normal(0, η2 ).
N j =1 Hj i=1
But this holds from Wooldridge (2001, Theorem 3.2). Namely, we apply
the asymptotic variance formula for a weighted objective function under SS
sampling to the sequence {Rij : i = 1, . . . , Nj ; j = 1, . . . , J }, where
Rij ≡ r(Wij , θ ∗1 , θ ∗2 ).
Nj
R̄j = Nj−1 r(Wij , θ̂ 1 , θ̂ 2 ), (11)
i=1
1 Qj 2
J 2 Nj
= r(W ij , θ̂ 1 , θ̂ 2 ) − R̄ j , (12)
N j =1 Hj2 i=1
Model-Selection Tests for Complex Survey Samples 119
Then we treat the R̂i as data obtained from an SS sampling scheme, so we simply
need to specify the stratum for each observation, ji , and the weight, Qji /Hji . For
example, in Stata, one applies the “svyset” option – to specify the stratum identifier
and weights – and runs the regression R̂i on a constant. The usual t statistic on
the constant is the model-selection test statistic. The sign of the constant indicates
which model fits better.
In using tMS as a model-selection test, it is important to understand its prop-
erties compared to an approach where weighting is not used but SS sampling
has been employed. Because of the nature of the null and alternative, it is not
true that the weighted version of the test will always reject the null more often
than the unweighted version of the test. To see this, consider what happens when,
say, model one is correctly specified with parameters θ o1 . Then, generally, we
need to use the weights to consistently estimate θ o1 ; the unweighted estimator
converges to some other quantity, say θ + 1 . For model two, we can write the proba-
bility limits as θ ∗2 and θ +
2 for the weighted and unweighted problems, respectively.
Now, there is no guarantee that q1 (W i , θ o1 ) is further
from E q2 (Wi , θ ∗2 ) , on
average, than q1 (Wi , θ + +
1 ) is from E q2 (Wi , θ 2 ) , and so the test based on the
unweighted objective function may reject more often in favor of model one than
the weighted version of the test. This turns out not to be a good thing, for two
reasons. First, even though the unweighted estimator might point to the correct
model/estimation method, the estimator of θ o1 is generally inconsistent. In other
words, the unweighted approach may choose the correct model but with parameter
estimators that are essentially useless! In fact, for computing quantities of interest,
such as average partial effects, there is no telling that it would be better to use
model one with inconsistent parameter estimators or model two, with estimators
converging to θ + 2.
A second important shortcoming of the unweighted test is that it may systemat-
ically opt for model two when model one is correctly specified. And the problem
would generally be worse as the sample size grows. This cannot happen with the
120 IRAJ RAHMANI AND JEFFREY M. WOOLDRIDGE
weighted version of the test, provided we have chosen our model and objective
function in a way that generates consistent estimators under correct specification
of the feature of interest. The reason is that, if model one is correctly specifiedand
the objective function is chosen appropriately, E [q1 (W, θ o1 )] < E q2 (W, θ ∗2 ) . In
other words, the weighted test is a consistent test for choosing the true model
when one of themodels is correctly
specified. With the unweighted test, it could
easily be that E q2 (W, θ + 2 ) < E q 1 (W, θ +
1 , in which case the unweighted test
)
will systematically select the wrong model. And this will happen with probability
approaching one as the sample size grows. We will see this phenomenon in the
simulations in Section 8.
An analogy that does not require thinking about weighting versus not weighting
might be helpful. In fact, assume random sampling, as in the original Vuong
(1989) work. Now suppose we specify two nonnested conditional mean models,
m1 (x1 , θ 1 ) and m2 (x1 , θ 2 ), and model one is correctly specified. If we use an
objective function that identifies conditional means – say, the squared residual
function – then the Vuong test will detect that model one is correct with probability
approaching one. Suppose we use another objective function, such as the least
absolute deviations (LAD). In general, neither model one nor model two is correctly
specified for the conditional median. Consequently, using LAD in the Vuong
statistic has essentially unknown properties. It could incorrectly choose in favor
of model two because model two is closest to the conditional median. But it could
also frequently reject model two in favor of model one; in fact, nothing says
the rejection frequency could not be higher than when using the squared residual
objective function. In other words, using the wrong objective function may actually
lead to a more powerful test. The problem is that when this occurs, it is essentially
a fluke. And the LAD estimators are not generally consistent for conditional mean
parameters, and so it is difficult to know how it helps to choose model one: we
have the correct model but inconsistent estimators of the parameters.
When the competing models contain different numbers of parameters, the finite
sample performance of tMS may suffer. As in Vuong (1989), we can penalize the
objective functions for the number of parameters. Since we are minimizing the
objective function, we add a penalty that is a function of the number of parameters.
The resulting statistic is
N
N −1/2 (Qji /Hji )r(Wij , θ̂ 1 , θ̂ 2 ) + N −1/2 [K(P1 ) − K(P2 )]
t˜MS = i=1
. (14)
η̂
where P1 and P2 are the number of parameters in the different models and K (·)
is the penalty function. For example, K(P ) = P gives the Akaike (1973) crite-
rion and K(P ) = (P /2) log (N) gives the Schwarz (1978) criterion. In both cases,
N −1/2 K (P ) → 0 for fixed P , and so the penalty does not affect the asymptotic
Model-Selection Tests for Complex Survey Samples 121
distribution of the test statistic: t˜MS and tMS have the same asymptotic distribu-
tions under H0 . The statistic t˜MS has the feature of penalizing models that are
not parsimonious in the number of parameters. One could instead simply add
N −1/2 [K(P1 ) − K(P2 )] to tMS , which means the penalty would not be divided by
η̂. Again, the resulting statistic is asymptotically equivalent. In what follows, we
drop the penalty function for notational convenience.
When observations in the strata are difficult to identify prior to sampling or when
collecting information on the variable determining stratification is cheap relative
to the cost of collecting the remaining information, VP sampling is convenient.
In VP sampling, a unit is first drawn at random from the population. If the unit
falls into stratum j , it is kept with probability pj . For example, if we define
stratification in terms of individual income, we draw a person randomly from
the population, determine the person’s income and then keep that person with
probability that depends on income class that is set by the researcher. As discussed
in Wooldridge (1999), consistent estimation of the population parameters generally
requires weighting the objective function by the inverse of the probability of being
kept in the sample. With J strata, these probabilities are {pj : j = 1, . . . , J }. It is
straightforward to show the analog of Lemma 3.1 carries over, and so, under the
null hypothesis that the models are nonnested and fit the population equally well,
estimation of the parameters does not affect the limiting distribution. This leads to
the test statistic
−1 N −1
N −1/2 N i=1 pji r(Wi , θ̂ 1 , θ̂ 2 ) i=1 pji r(Wi , θ̂ 1 , θ̂ 2 )
= 2 1/2 , (15)
−1
N −2 2 1/2
N −2
N i=1 pji r(Wi , θ̂ 1 , θ̂ 2 ) i=1 pji r(Wi , θ̂ 1 , θ̂ 2 )
where again ji is the stratum for observation i. (Remember that under VP sampling,
we do not always observe a draw from the population; this statistic necessarily
depends only on the draws we keep.)
One way that the denominator of (15) differs from that of (13) is that the
within-stratum means are not removed in (15). Wooldridge (1999) shows that if
the known sampling probabilities, pj , are replaced with the observed frequencies,
then it is proper to remove the means, R̄j , in (15). Using the sample frequencies
means that we know how many times each stratum was drawn – call this Mj . Then
p̂j = Nj /Mj , where Nj is the kept number of draws in stratum j (and which we
always observe). We replace pj in (15) with p̂j and then replace r(Wi , θ̂ 1 , θ̂ 2 ) with
r(Wi , θ̂ 1 , θ̂ 2 ) − R̄j in the denominator for all i in stratum j . In many cases, the
122 IRAJ RAHMANI AND JEFFREY M. WOOLDRIDGE
number of times each stratum was drawn is not available, and so one must use
the pj rather than the p̂j . If one uses the p̂j directly in (15), then the statistic is
conservative in the sense that, asymptotically, its size will be less than the nominal
size (because the estimated standard deviation is systematically too large).
Cs Msc
vsc = ,
Ns Ksc
where we require information on the number of clusters in the population and the
number of units per cluster. The weighted objective function
S
Ns
Ksc
vsc qg Wscm , θ g
s=1 c=1 m=1
Model-Selection Tests for Complex Survey Samples 123
is the difference between the two objective functions for each unit m, in cluster c,
in stratum s. In other words,
1
S Ns
Ksc 1
S Ns
Ksc
√ vsc · r Wscm , θ̂ 1 , θ̂ 2 = √ vsc · r Wscm , θ ∗1 , θ ∗2 + op (1). (16)
N s=1 c=1 m=1 N s=1 c=1 m=1
1
S Ns Ksc d
√ vsc · r Wscm , θ̂ 1 , θ̂ 2 −→ N (0, ξ 2 )
N s=1 c=1 m=1
S
Ns
Ksc
Ksc
+ 2
vsc rscm θ̂ 1 , θ̂ 2 rscm θ̂ 1 , θ̂ 2
s=1 c=1 m=1 m =m
N K 2 ⎫
S
1 s sc ⎬
− vsc rscm θ̂ 1 , θ̂ 2 . (17)
Ns ⎭
s=1 c=1 m=1
The first term in (17) would be a consistent estimator of the variance under simple
random sampling. The second term accounts for within-cluster correlation, and
124 IRAJ RAHMANI AND JEFFREY M. WOOLDRIDGE
the third term properly subtracts off the within-strata means. Typically, the second
term is positive, reflecting the positive correlation within cluster. The third term,
without the minus sign, is always nonnegative. Therefore, the second and third
terms tend so work in opposite directions. In any case, the resulting test statistic,
S Ns Ksc
N −1/2 s=1 c=1 m=1 vsc · rscm Wscm , θ̂ 1 , θ̂ 2
, (18)
ξ̂
Model-selection tests in panel data models with complex sampling designs are
similar to the tests in the cross-sectional cases, but in using standard software
we must make sure to account for serial correlation in the difference in objective
functions when using a pooled estimation method. Here we cover the case where
stratified sampling is done in an initial time period, as is very common. Conse-
quently, the sampling weights, Qj /Hj for the strata j = 1, . . . , J , do not change
over time.
When a probability density function for the joint distribution
D(Yi1 , . . . , YiT |Xi1 , . . . , XiT ) is fully specified, the methods in Sections
5.3 and 5.4 apply directly: the objective function is the joint log likelihood
conditional on (Xi1 , . . . , XiT ).
Model-Selection Tests for Complex Survey Samples 125
For many reasons, one often wants to compare models estimated using pooled
methods. Pooled estimation methods are computationally simpler, often much
more so. More importantly, we are often interested in a feature of D(Yit |Xit )
or even D(Yit |Xi1 , . . . , XiT ), and we do not wish to take a stand on how the
{Yit : t = 1, ..., T } are related to each other. For example, we might be interesting
in estimating E(Yit |Xit ) or E(Yit |Xi1 , . . . , XiT ) using pooled quasi-MLE in the
linear exponential family. Such an approach is robust to other distributional mis-
specification and to arbitrary serial correlation. Therefore, any model-selection
statistic should be robust to arbitrary serial dependence, too.
As an example, suppose we use pooled nonlinear least squares to estimate two
models of the conditional mean. The difference in objective functions at time t,
evaluated at the pseudo-true values, is
2 2
Yit − m1 (Xit , θ ∗1 ) − Yit − m2 (Xit , θ ∗2 ) .
There are essentially no interesting cases where this difference would be serially
uncorrelated over time. We would have to assume that {(Xit , Yit ) : t = 1, . . . , T } is
an independent sequence, and this is very unlikely in a panel data setting.
In models with unobserved heterogeneity, say Ci , we can take a correlated
random effects approach (as in Wooldridge, 2010, Section 13.9) and propose a
model for
D(Ci |Xi1 , . . . , XiT ) = D(Ci |X̄i )
Then, if we assume strict exogeneity of {Xit } conditional on Ci ,
T
qg (Wi , θ g ) = qgt (Wit , θ g ), g = 1, 2.
t=1
where θ ∗1 and θ ∗1 are the pseudo-true values. A sufficient but not necessary condition
is that the models fit equally well for each t:
E q1t (Wit , θ ∗1 ) = E q2t (Wit , θ ∗2 ) , t = 1, . . . , T . (20)
T
r(Wij , θ ∗1 , θ ∗2 ) = q1t (Witj , θ ∗1 ) − q2t (Witj , θ ∗2 ) .
t=1
1 2
Nj
r(Wij , θ̂ 1 , θ̂ 2 ) − R̄j
Nj i=1
and noting that it includes cross products between time periods t and s, t = s.
Standard software can be tricked into computing the model-selection statistic
by specifying the strata, j , the sampling weights, Qji /Hji , and specifying each
cross-sectional unit i as a cluster. As is well known – see, for example, Arellano
(1987) – the form of the robust variance estimator for small-T panel data estimators
is the same as for cluster correlation.
6. EXOGENOUS STRATIFICATION
In most applications, we partition W as W = (X, Y) and we are interested in some
feature of the distribution of Y given X. If the feature is correctly specified, and
Model-Selection Tests for Complex Survey Samples 127
we choose a suitable objective function, then the population (true) value of the
parameters, θ o , solves
min E [q(W, θ )|X = x]
θ∈
for all x ∈ X , the support of X. For example, in the case of estimating the condi-
tional mean E(Y |X), one suitable choice of the objective function is the squared
residual function:
q(W, θ ) = [Y − m(X, θ )]2.
When the conditional mean is correctly specified, that is,
E (Y |X = x) = m(x, θ o ), x ∈ X ,
it is easily shown – see, for example, Wooldridge (2010, Chapter 12) – that θ o
solves
min E{[Y − m(X, θ)]2 |X = x}
θ∈
for all x.
Now consider a situation where stratification is based entirely on X, and so
{Xj : j = 1, . . . , J } represents the mutually exclusive and exhaustive strata. Then,
as discussed in Wooldridge (1999, 2001), θ o solves
for each j . Wooldridge (1999, 2001) shows that this feature of θ o implies that the
unweighted M-estimator is generally consistent for θ o .
Given consistency of the unweighted estimator under correct specification, it
may be tempting to ignore stratification when it is based on X and to simply apply
Vuong’s (1989) statistic in the M-estimation context. But the unweighted statistic
does not achieve our objectives because the null hypothesis is that each model is
misspecified. Even when stratification is based on X, we need to use the weights
to uncover θ ∗1 and θ ∗2 . In particular, under the null hypothesis of interest, θ ∗g does
not generally solve
min E qg (W, θ g )|X = x
θ g ∈g
for all x ∈ X , and therefore the unweighted estimator is inconsistent for θ ∗g . Our
goal is to compare the models in the population, and the weighted estimator always
consistently estimates θ ∗g under the null and alternatives – including if model g
is correctly specified – whether or not stratification is based on X, Y, or both. To
summarize, this observation argues in favor of weighting for both estimation and
model selection.
128 IRAJ RAHMANI AND JEFFREY M. WOOLDRIDGE
7. EXAMPLES
The previous framework has many applications. Here we describe a few that are
not completely standard.
Many software packages, such as Stata, allow for estimation of binary and frac-
tional response models with survey sampling. After the estimates θ̂ 1 and θ̂ 2 have
been obtained, compute
Example 7.3 (Nonlinear Least Squares) Suppose we want to model E (Yi |Xi )
using functions m1 (Xi , θ 1 ) and m2 (Xi , θ 2 ). Let θ̂ 1 and θ̂ 2 be the weighted nonlinear
least squares estimators, obtained using the sampling weights (if appropriate). For
each unit i, the difference in objective functions is
2 2
R̂i = Yi − m1 (Xi , θ̂ 1 ) − Yi − m2 (Xi , θ̂ 2 ) = Ûi12 − Ûi22 ,
Ûi12 − Ûi22 on 1, i = 1, . . . , N
using weights, if necessary, and adjusting the standard error of the constant for
the sampling scheme. For example, if Yi ≥ 0, possibly taking on the value zero, it
is somewhat common to start with a linear model estimated by OLS. That can be
compared with an exponential model estimated by OLS.
to the truth. Just as importantly, we will see that using the unweighted statistic can
be very misleading.
In the first set of simulations, we create a population of 100,000 units, where
the outcome variable, Y , follows a Poisson distribution conditional on a set of
covariates. We consider two different conditional mean functions. In the first case,
we generate five covariates, all normally distributed, such that
so that the true model includes X3 but excludes X4 and X5 . We call this model one.
We chose the parameter values so that the test does not choose the correct model
with probability one but still has substantial power in its direction. As competing
models, we replace X3 first with X4 (model two) and then with X5 (model three):
X 4 = X3 + R 4
X 5 = X3 + R 5 ,
where R4 and R5 are independent Normal(0, 1/9) random variables. Models two
and three fit equally well using any objective function that suitably identifies a con-
ditional mean function, such as a quasi-log likelihood from the linear exponential
family.
In order to study the performance of the test in a quasi-MLE framework, we
also generated the conditional distribution of Y to be exponential in the population,
with the same mean function (21). We must emphasize that we are still using the
Poisson log-likelihood function, so the estimator, in this case, is properly called
QMLE. As in the Poisson case, there is no guarantee that the weighted estimator
will choose the correct model one more frequently than the unweighted test. But
we know it will not systematically choose an incorrect conditional mean over a
correct one.
Rather than drawing a random sample, we stratify the sample on the basis of
Y . In particular, for the Poisson distribution, we take samples of 1, 000 from the
stratum with Y = 0 and 1, 000 from the stratum with Y > 0. In the population,
P(Y = 0) = 0.19061. Therefore, we oversample the stratum with Y = 0. There are
only two strata, and the sampling weights are Q1 = 0.38122 and Q2 = 1.61878. For
Model-Selection Tests for Complex Survey Samples 131
the exponential distribution, we choose the strata so that the population frequencies
are about 0.727 and 0.273, using a cutoff Y ≤ 2. Therefore, we oversample units
with larger outcomes, rather than small ones.
For each draw, we use both the weighted Vuong statistic and the unweighted
statistic, both based on the Poisson log-likelihood function with exponential con-
ditional mean. We test each model against the other two, so we have six outcomes.
Table 1 reports rejection frequencies obtained from the simulations using 1, 000
replications. Here the alternative is that model mi is better than mj , i = j , or in
short mi > mj .
As can be seen in Table 1, when the population distribution is Poisson, the
weighted test does a better job than the unweighted test in detecting that model
one provides the best fit – because it is the true model. For model one versus model
two, the rejection in favor of model one is almost 10% points higher (78.2% versus
68.9%). Similarly, the weighted test does better in choosing between models one
and three (73.0% versus 65.8%). Neither test ever incorrectly chooses model two
or model three over model one. Remember, though, that the estimates of the
parameters using the unweighted estimator are inconsistent because stratification
is based on Y .
Both the weighted and unweighted tests have rejection frequencies close to 0.05
when comparing the two incorrect models, model two and model three, which cor-
responds to the notion that both models are wrong but fit equally well. The weighted
statistic does find a few more “false positives” than the unweighted test, but there
are only 1,000 replications. Overall, the weighted test seems clearly preferred, and
we must use the weights for consistent parameter estimation, anyway.
When the true conditional distribution is exponential, both tests have a tougher
time choosing model one. This possibly is a result that, for the exponential distri-
bution, the variance is the square of the mean, and so there is much more variation
132 IRAJ RAHMANI AND JEFFREY M. WOOLDRIDGE
in the outcome Y than when the conditional distribution is Poisson. Plus, in the
exponential case, we oversample large outcomes rather than small ones (although
the weighted version of the test accounts for that). Overall, the weighted test does
somewhat better. For example, it correctly chooses model one over model two
26.4% of the time compared with 22.9% for the unweighted test.
As a second conditional mean specification, we use
where X1 and X2 have the same normal distributions and X3 ∼ Uniform(1, 3).
Now we are primarily interested in the ability of the test to detect functional form
misspecification. As before, the correct conditional mean function is labeled model
one. The alternative models are
Model two ignores x3 entirely; given the simulation findings in Table 1, we would
expect the test to do well in choosing model one. Model three misspecifies the
functional form in x3 . As a quadratic can mimic the reciprocal function, we expect
the test to have a more difficult time telling apart models one and three. In the Pois-
son population, we still oversample units with Y = 0. The strata in the exponential
case are defined by Y ≤ 4 and Y > 4.
The findings in Table 2 for the Poisson distribution are very interesting and
highlight the danger of using the unweighted version of the test. The weighted
test is fairly successfully distinguishing between models one and two: it correctly
Model-Selection Tests for Complex Survey Samples 133
rejects the null in favor of model one 52.5% of the time, and never chooses model
two. By contrast, the unweighted test never correctly picks model one. It even picks
model two 33.3% of the time. That means that a researcher is much more likely to
think that the correct model is quadratic in X1 and X2 and entirely excludes X3 .
Both the weighted and unweighted tests have very little ability to tell the dif-
ference between models one and three. This is unlikely to be a bad thing because
quantities of interest – such as elasticities, semi-elasticities and average partial
effects – are probably pretty similar across the two models. The weighted test
shows a clear preference for model three over model two, and this is a good thing:
model three is certainly closer to the true model. By contrast, the unweighted test
incorrectly shows a clear preference for model two over model three.
Both tests are completely ineffective for selecting among the three models
when the data are generated from the exponential distribution. As before, this
probably arises from the large variance in an exponential distribution and possibly
the oversampling of large outcomes from the population.
9. CONCLUSION
We have extended Vuong’s (1989) in several useful directions. First, we allow
for general M-estimation rather than maximum likelihood estimation. Second,
we allow for complex survey samples rather than assuming random sampling
from a population. Third, we allow panel data applications combined with survey
sampling.
The key to obtaining computationally simple tests is contained in Theorem
3.1, which shows that when the models are appropriately nonnested and they fit
equally well, the limiting distribution of the standardized difference in objective
functions is nondegenerate and does not depend on the limiting distributions of
the estimators themselves. This means we can apply standard asymptotic variance
estimators for stratified samples, cluster samples and combinations of these directly
to the differences in the unit-specific objective functions.
Section 5.7 contains just a couple of examples that show how the results can
be applied to problems that are explicitly quasi-MLE in nature, including popular
fractional response models and models for nonnegative responses.
For the most part, the simulation results in Section 5.8 are promising. In addi-
tion to providing consistent estimators of the pseudo-true values, weighting the
objective function generally allows us to better choose the best fitting model in
cases where the best fitting model is the true model or the best fitting model is
“close” to the true model. In one case, the unweighted test systematically selects
the worst of the three models while almost never choosing the correct model.
134 IRAJ RAHMANI AND JEFFREY M. WOOLDRIDGE
More simulations could be informative. For example, seeing what happens when
stratification is based on X is something we did not do.
There are several interesting directions for future research. First, it would be
helpful to study the finite-sample properties of the version of the test statistic that
penalizes the number of parameters – see (14). Second, our setup can be extended
to the case where the goodness-of-fit functions are not the same as the objective
functions used to obtain the θ̂ g . For example, in a Tobit model, one might maximize
the log likelihood but then want to make comparisons based on the conditional
mean, in which case we might want to compare a sum of squared residuals from a
Tobit to that from, say, an exponential mean estimated using Poisson QMLE. The
analog of Theorem 3.1 will not be as simple, but such extensions seem worthwhile.
Finally, as suggested by a reviewer, rather than relying on standard first-order
asymptotics, one could possibly bootstrap the test statistic. Given the nature of
the null hypothesis and the complex survey sampling, this poses an interesting
challenge for the future.
REFERENCES
Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In
Proceedings of the 2nd international symposium on information theory, Budapest (pp. 267–
281).
Arellano, M. (1987). Computing robust standard errors for within-groups estimators. Oxford Bulletin
of Economics and Statistics, 49, 431–434.
Bhattacharya, D. (2005). Asymptotic inference from multi-stage samples. Journal of Econometrics,
126, 145–171.
Cox, D. R. (1961). Tests of separate families of hypotheses. In L. M. LeCam, J. Neyman, E. L. Scott
(Eds.), Proceedings of the 4th berkeley symposium on mathematical statistics and probability,
Vol. 1, , University of California, Berkeley Press (pp. 105–123).
Cox, D. R. (1962). Further results on tests of separate families of hypotheses. Journal of the Royal
Statistical Society, Series B, 24, 406–424.
Davidson, R., & MacKinnon, J. G. (1981). Several tests for model specification in the presence of
alternative hypotheses. Econometrica, 49, 781–793.
Findley, D. F. (1990). Making difficult model comparisons. Mimeo, U.S. Bureau of the Census.
Findley, D. F. (1991). Convergence of finite multistep predictors from incorrect models and its role in
model selection. Note di Matematica, XI, 145–155.
Findley, D. F., & Wei, C. Z. (1993). Moment bound for deriving time series CLT’s and model selection
procedures. Statistica Sinica, 3, 453–480.
Gourieroux, C., Monfort, A., & Trognon, C. (1984). Pseudo-maximum likelihood methods:
Applications to Poisson models. Econometrica, 52, 701–720.
Newey, W. K., & McFadden, D. (1994). Large sample estimation and hypothesis testing. In
R. F. Engle & D. L. McFadden (Eds.), Handbook of econometrics (Vol. IV, pp. 2111–2245).
Amsterdam: North-Holland Publishing.
Rahmani, I. (2018). Asymptotic Inference of M-Estimator from Multistage Samples with Variable
Probability in the Final Stage, Working Paper.
Model-Selection Tests for Complex Survey Samples 135
Rivers, D., & Vuong, Q. (2002). Model selection tests for nonlinear dynamic models. The Econometrics
Journal 5, 1–39.
Schwarz, G. E. (1978). Estimating the dimension of a model. Annals of Statistics, 6(2), 461–464.
Vuong, Q. (1989). Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica,
57, 307–333.
White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica 50, 1-25.
Wooldridge, J. M. (1990). An encompassing approach to conditional mean tests with applications to
testing nonnested hypotheses. Journal of Econometrics, 45, 331–350.
Wooldridge, J. M. (1999). Asymptotic properties of weighted M-estimators for variable probability
samples. Econometrica, 67, 1385–1406.
Wooldridge, J. M. (2001). Asymptotic properties of weighted M-estimators for standard stratified
samples. Econometric Theory, 17, 451–470.
Wooldridge, J. M. (2010). Econometric analysis of cross-section and panel data (2nd ed.). Cambridge,
MA: MIT Press.
This page intentionally left blank
INFERENCE IN CONDITIONAL
MOMENT RESTRICTION MODELS
WHEN THERE IS SELECTION DUE TO
STRATIFICATION
ABSTRACT
We show how to use a smoothed empirical likelihood approach to conduct
efficient semiparametric inference in models characterized as conditional
moment equalities when data are collected by variable probability sampling.
Results from a simulation experiment suggest that the smoothed empirical
likelihood based estimator can estimate the model parameters very well in
small to moderately sized stratified samples.
Keywords: Conditional moment models; smoothed empirical likelihood;
stratification; variable probability sampling; endogenous and exogenous
stratification; generalized method of moments
137
138 ANTONIO COSMA ET AL.
1. INTRODUCTION
The gold standard for collecting data, at least for the ease of doing subsequent
statistical analysis, is simple random sampling, whereby each observation in the
“target” population, namely, the population of interest, has an equal chance of
being chosen. Consequently, the probability distribution of the chosen observation,
regarded as belonging to a “realized” population, is the same as the probability
distribution of an observation in the target population, which facilitates statistical
analysis.
However, when estimating or testing economic relationships, economists often
discover that the data they plan to use are not drawn from the target population they
wish to study. Instead, the observations are found to be sampled from a related
but different population. Sometimes this is done deliberately to make the sample
more informative. For example, when studying the impact of welfare legislation,
it is desirable to oversample minorities and low-income families. Similarly, if we
want to examine the effect of disability laws on demand for public transportation,
it makes sense to oversample households with disabled members. At other times,
a distinction between the target and realized populations can be created uninten-
tionally. For example, in sampling the duration of unemployment at a randomly
chosen time, economists are more likely to observe longer unemployment spells
than shorter ones. Using a data set to answer questions for which it was not orig-
inally designed, a typical situation in economics where data are often costly to
collect may also lead to such a situation (Newey, 1993, p. 419). For instance, if the
reason for collecting data is to estimate mean income for an underlying population,
oversampling low income and undersampling high income families can improve
the precision of estimators. However, at some later stage, this income data can be
used by another researcher as the dependent variable in a regression model with-
out realizing that the original sample was drawn from a distribution other than the
target population.
Whatever its cause is, if the distinction between the target and realized popu-
lations is not taken into account when analyzing the data, statistical inference can
be seriously off the mark. This phenomenon is commonly called selection bias.
Cf. Heckman (1976, 1979) and Manski (1989, 1995) for a classic exposition of
the selection problem.
In this paper, we describe an efficient semiparametric approach for conducting
inference in conditional moment restriction models when data are collected by a
variable probability (VP) sampling scheme such that the observations from the
target population have unequal chances of being chosen. In other words, we show
how to efficiently deal with the selection bias caused by the sampling scheme used
to collect the data, because the sampling scheme induces a probability distribution
Selection Due to Stratification 139
on the realized population which differs from the target distribution for which
inference is to be made.
The remainder of the paper is organized as follows. In Section 2, we describe
the conditional moment restriction model and the VP sampling scheme. Section 3
discusses how to do inference using the smoothed empirical likelihood approach,
and finite sample properties of the proposed estimator are examined in Section 4.
Section 5 concludes the paper. Related technical details are in the appendices.
2. THE MODEL
2.1. Conditional Moment Equalities
Let Z ∗ := (Y ∗ , X∗ )dim (Y ∗ )+dim (X∗ )×1 be a random (column) vector that denotes an
observation from the target population, where Y ∗ is the vector of endogenous
variables and X∗ the vector of exogenous variables. Assume that
∗)
H0 : ∃θ ∗ ∈ Rdim (θ s.t. EPY∗∗ |X∗ [g(Z ∗ , θ ∗ )|X ∗ ] = 0 PX∗ ∗ -a.s., (2.1)
L
pl ∗
P (Z ∈ B) := ∗
1Cl (z) dP ∗ (z), B ∈ B(Rdim (Z ) ), (2.2)
l=1
b
B
∗ ∗
where B(Rdim (Z ) ) is the Borel sigma-field of Rdim (Z ) , b∗ := Ll=1 pl Q∗l and
Q∗l := P ∗ (Z ∗ ∈ Cl ) > 0 denotes the probability that a randomly chosen observation
from the target population lies in the lth stratum.
Since Q∗l represents the probability mass of the lth stratum in the target pop-
ulation, the Q∗l ’s are popularly called “aggregate shares.” The aggregate shares,
which add up to one, i.e., Ll=1 Q∗l = 1, are unknown parameters of interest to
be estimated along with the structural parameter θ ∗ . The parameter b∗ also has
a practical interpretation, namely, it is the probability that an observation drawn
from the target population during the sampling process is ultimately retained in
the sample.
Selection Due to Stratification 141
It is immediate from (2.2) that the density of P , with respect to any measure
∗
on B(Rdim (Z ) ) that dominates P ∗ , is given by
L
pl ∗
dP (z) := 1 (z)dP ∗ (z)
∗ Cl
(z ∈ Rdim (Z ) )
l=1
b
b(z) ∗
= dP (z), (2.3)
b∗
where b(z) := Ll=1 pl 1Cl (z). Following Imbens and Lancaster (1996, p. 296),
∗
b( · )/b is referred to as a bias function because it determines the selection bias
due to stratified sampling, i.e., the extent to which P differs from P ∗ . For instance,
it is easy to see that if the sampling probabilities p1 , . . . , pL are all equal, then
there is no selection bias, i.e., P = P ∗ , because b( · )/b∗ = 1 irrespective of the
values taken by the aggregate shares.
The marginal density of X is given by
∗
dPX (x) := dP (y, x) (x ∈ Rdim (X ) )
∗)
y∈Rdim (Y
b(y, x)
= dPY∗∗ |X∗ =x (y) dPX∗ ∗ (x) ((2.3))
b∗
∗
y∈Rdim (Y )
γ ∗ (x)
= dPX∗ ∗ (x), (2.4)
b∗
where γ ∗ (x) := EPY∗∗ |X∗ [b(Y ∗ , x)|X∗ = x]. Throughout the paper, we maintain the
assumption that γ ∗ > 0 on supp (X∗ ).4 Under this condition, the probability dis-
tributions PX and PX∗ ∗ are mutually absolutely continuous, which we denote by
writing PX∗ ∗ PX PX∗ ∗ .
Since supp (Y , X) = supp (Y ∗ , X∗ ) and γ ∗ > 0 on supp (X∗ ), the conditional
density of Y |X is given by
dP (y, x)
dPY |X=x (y) := ((y, x) ∈ supp (Y ∗ ) × supp (X ∗ ))
dPX (x)
b(y, x)
= ∗ dPY∗∗ |X∗ =x (y), (2.5)
γ (x)
By (2.5), dPY |X=x (y) = dPY∗∗ |X∗ =x (y) if and only if b(y, x) = γ ∗ (x) for all
(y, x) ∈ supp (Y ∗ ) × supp (X ∗ ). However, as discussed subsequently, the con-
dition b(y, x) = γ ∗ (x) holds only in a special case. Therefore, in general,
dPY |X = dPY∗∗ |X∗ . Consequently, estimating (2.1) using the realized sample with-
out accounting for the fact that it was obtained by stratified sampling, i.e., ignoring
stratification, will generally not lead to a consistent estimator of θ ∗ .
2.3. Identification
In contrast to some other stratified sampling schemes Tripathi (2011b, Sections 3.1
and 4.1), identification, i.e., uniqueness, of θ ∗ cannot be lost because of VP sam-
pling. To see this, begin by recalling that the assumption that γ ∗ > 0 on supp (X ∗ )
implies that the distributions PX and PX∗ ∗ are mutually absolutely continuous.
Hence,
(2.1) ⇐⇒ EPY∗∗ |X∗ [g(Y ∗ , x, θ ∗ )|X ∗ = x] = 0 for PX∗ ∗ -a.a. x ∈ supp (X∗ )
g(Y , x, θ ∗ )
⇐⇒ γ ∗ (x)EPY |X |X = x = 0 for PX∗ ∗ -a.a. x ∈ supp (X∗ )
b(Y , x)
((2.5))
∗
g(Y , x, θ )
⇐⇒ PX∗ ∗ {x ∈ supp (X∗ ) : EPY |X |X = x = 0} = 0 (γ ∗ > 0)
b(Y , x)
g(Y , x, θ ∗ )
⇐⇒ PX {x ∈ supp (X∗ ) : EPY |X |X = x = 0} = 0
b(Y , x)
(PX∗ ∗ PX PX∗ ∗ )
g(Y , x, θ ∗ )
⇐⇒ EPY |X |X = x = 0 for PX -a.a. x ∈ supp (X∗ ).
b(Y , x)
Since b(Z) does not depend on θ ∗ , the equivalence in (2.6) reveals that θ ∗ in
(2.1) is uniquely defined if and only if θ ∗ in EPY |X [g(Z, θ ∗ )/b(Z)|X] = 0 (PX -a.s.)
is uniquely defined. That is, any condition that leads to the identification of θ ∗
in (2.1) will also ensure identification of θ ∗ in the right hand side of (2.6) and
vice-versa. To illustrate this, assume that the columns of the partial derivative
∂θ EPY∗∗ |X∗ [g(Z ∗ , θ ∗ )|X ∗ ] are linearly independent PX∗ ∗ -a.s.. As shown in Cosma,
Selection Due to Stratification 143
Kostyrka, and Tripathi (2018), this condition is sufficient to ensure that θ ∗ is locally
identified.5 However, since b does not depend on θ (which implies that γ ∗ does
not depend on θ ), we have that
∗ ∗ ∗ (2.5) ∗ g(Z, θ ∗ )
∂θ EPY∗∗ |X∗ [g(Z , θ )|X = x] = γ (x)∂θ EPY |X |X = x , x ∈ supp (X∗ ).
b(Z)
Therefore, since γ ∗ > 0 on supp (X∗ ), the columns of ∂θ EPY∗∗ |X∗ [g(Z ∗ , θ ∗ )|X ∗ ]
are linearly independent PX∗ ∗ -a.s. if and only if the columns of ∂θ EPY |X [g(Z, θ ∗ )/
b(Z)|X] are linearly independent PX -a.s. (because PX and PX∗ ∗ are mutually
absolutely continuous).
Since identification of θ ∗ cannot be lost because of VP sampling, for the
remainder of the paper we maintain that θ ∗ is identified.
⎧
⎪
⎨∪j =1 ∪m=1 Aj × Bm
J M
if both Y ∗ and X ∗ are stratified
supp (Y ∗ , X∗ ) = ∪Jj=1 (Aj × supp (X ∗ )) if only Y ∗ is stratified
⎪
⎩ M
∪m=1 ( supp (Y ∗ ) × Bm ) if only X ∗ is stratified.
J
J
b(y, x) = pl 1Al ×supp (X∗ ) (y, x) = pl 1Al (y) =: bendog (y).
l=1 l=1
144 ANTONIO COSMA ET AL.
bendog (y)
endogenous stratification =⇒ dPY |X=x (y) = dP ∗∗ ∗ (y), (2.7)
∗
γendog (x) Y |X =x
∗
where γendog (x) := EPY∗∗ |X∗ [bendog (Y ∗ )|X ∗ = x].
If only X∗ is stratified, then supp (Z ∗ ) = ∪Ll=1 Cl with L = M and Cl =
supp (Y ∗ ) × Bl . Consequently, for (y, x) ∈ supp (Y ∗ ) × supp (X ∗ ),
M
M
b(y, x) = pl 1supp (Y ∗ )×Bl (y, x) = pl 1Bl (x) =: bexog (x),
l=1 l=1
∗
which implies that γexog (x) := EPY∗∗ |X∗ [bexog (X ∗ )|X ∗ = x] = bexog (x). Hence,
by (2.5),
Example 2.1: (Linear regression with exogenous regressors) Consider the linear
regression model Y ∗ = X∗ θ ∗ + ε ∗ , where X ∗ := (1, X∗ ). Assume that the regres-
sors are exogenous with respect to the model error in the target population, i.e.,
EPY∗∗ |X∗ [ε ∗ |X ∗ ] = 0 PX∗ ∗ -a.s..
Suppose that only Y ∗ is stratified. If we ignore the fact that the data were
collected by VP sampling and simply regress the observed Y on the observed X
and the constant regressor, then θ ∗ cannot be consistently estimated by the least
squares (LS) estimator. Indeed, letting θ̂LS denote the LS estimator obtained by
regressing Y on X̃ := (1, X), we have that
⎛ ⎞−1 ⎛ ⎞
n
n
plimn→∞ θ̂LS = plimn→∞ ⎝n−1 X̃j X̃j ⎠ ⎝n−1 X̃j Yj ⎠
j =1 j =1
∗ (2.4)
where μ(X) := EPY |X [Y |X]. But, EPX X̃ X̃ = EPX∗ ∗ γendog (X ∗ )X∗ X ∗ /b∗ and
∗
γendog (X ∗ ) ∗ ∗ −1 ∗
γendog (X ∗ ) ∗ ∗
plimn→∞ θ̂LS = EPX∗ ∗ X X E P ∗ X μ(X )
b∗ X∗ b∗
((2.9) & (2.4))
∗ ∗ −1
(2.10) γendog (X ) ∗ ∗
= EPX∗ ∗ X X
b∗
∗
γendog (X ∗ ) ∗
× EPX∗ ∗ X
b∗
∗ ∗ 1 ∗ ∗ ∗
× X θ + ∗ EP ∗ [ε bendog (Y )|X ]
γendog (X ∗ ) Y ∗ |X∗
= θ ∗ + (EPX∗ ∗ γendog
∗
(X ∗ )X∗ X∗ )−1 (EP ∗ X∗ ε ∗ bendog (Y ∗ ))
= θ ∗ ,
because EPY∗∗ |X∗ [ε ∗ |X ∗ ] = 0 (PX∗ ∗ -a.s.) does not imply that EP ∗ X∗ ε ∗ bendog (Y ∗ ) = 0.
If, however, stratification is exogenous, then
(2.8)
μ(x) = EPY |X [Y |X = x] = EPY∗∗ |X∗ [Y ∗ |X ∗ = x] (x ∈ supp (X∗ ))
= EPY∗∗ |X∗ [X∗ θ ∗ + ε ∗ |X ∗ = x]
= x̃ θ ∗ . (2.11)
146 ANTONIO COSMA ET AL.
Hence, ignoring exogenous stratification does not affect the consistency of θ̂LS
because
3. INFERENCE
3.1. Related Literature and Our Contribution
There is a large literature on estimation and testing models using data collected by
various types of stratified sampling schemes; cf. the papers cited at the beginning
of Section 2.2, and the references therein. In this section, we briefly describe only
some of the works that consider VP sampling.
Earlier papers in the literature on estimating models with conditioning variables
assume that PY∗∗ |X∗ is known up to a finite dimensional parameter; only PX∗ ∗ is left
completely unspecified. For example, a well-known application of VP sampling
can be found in Hausman and Wise (1981). Imbens and Lancaster (1996) extend the
maximum likelihood approach of Hausman and Wise to a moment-based method-
ology that allows for VP sampling, mixed-response variables and stratification on
exogenous covariates. Regression under VP sampling and a parametric PY∗∗ |X∗ has
also been investigated. For example, Jewell (1985) and Quesenberry and Jewell
(1986) propose iterative estimators of regression coefficients under VP sampling
without imposing normality or independence, though they do not provide any
asymptotic theory for their estimators.
The papers described above impose strong conditions on the distribution of
Y ∗ |X ∗ . Exceptions include Wooldridge (1999) and Tripathi (2011b), who leave
both PY∗∗ |X∗ and PX∗ ∗ completely unspecified. Wooldridge provides asymptotic the-
ory for M-estimation under VP sampling for a model defined in terms of a set of
Selection Due to Stratification 147
z ∗
dP (z) := dP ∗ (z), z ∈ Rdim (Z ) ,
EP ∗ Z ∗
where · is the Euclidean norm. That is, length-biased sampling can be expressed
as (2.3) with b(z) := z and b∗ := EP ∗ Z ∗ . Therefore, with only minor nota-
tional changes, the results obtained in this paper can be extended to length-biased
sampling as well.
Length-biased sampling has been extensively studied for the parametric case,
i.e., where dP ∗ is specified up to a finite dimensional parameter, cf., e.g., Patil
and Rao (1977, 1978), Bickel, Klassen, Ritov, and Wellner (1993, Section 4.4)
and Owen (2001, Chapter 6). As far as a nonparametric treatment of length-biased
sampling is concerned, Vardi (1982) deals with the case when P ∗ is unknown.
Vardi assumes that both P ∗ and P can be sampled with positive probability.
Using two independent samples (one each from P ∗ and P ), he shows how to
construct the nonparametric maximum likelihood estimators (NPMLE) of P ∗ and
148 ANTONIO COSMA ET AL.
P and also obtains their asymptotic distributions. Vardi (1985) and Gill, Vardi, and
Wellner (1988) provide conditions for the existence and uniqueness of the NPMLE
of P ∗ in a general setup when more than two independent samples from F ∗ and F
are available. These papers concentrate on the distributions P ∗ and P ; there are no
other parameters to estimate. Qin (1993) uses the empirical likelihood approach
to construct a nonparametric likelihood ratio confidence interval for θ ∗ := EP ∗ Z ∗ ,
i.e., a just-identified unconditional moment equality, using an independent sample
from P ∗ and P . El-Barmi and Rothmann (1998) generalize Qin’s treatment to
handle models with overidentified unconditional moment restrictions of the form
EP ∗ g(Z, θ ∗ ) = 0. They also obtain efficient estimators of P ∗ and P . However, they
do not consider the testing of overidentifying restrictions.
The efficiency bounds for estimating θ ∗ and related functionals have been derived
in Severini and Tripathi (2013, Section 14.3). In this section, we describe some
of these bounds and discuss their salient features. Construction of estimators that
achieve these bounds is considered in the next section.
For the remainder of the paper, let ρ1 (Z, θ ) := g(Z, θ )/b(Z). Since the right
hand side of (2.6) is a conditional moment equality with respect to the realized con-
ditional distribution PY |X , the efficiency bound for θ ∗ follows from Chamberlain
(1987). Namely, the efficiency bound for estimating θ ∗ is given by10
where D(X) := ∂θ EPY |X [ρ1 (Z, θ ∗ )|X] and V1 (X) := EPY |X [ρ1 (Z, θ ∗ )ρ1 (Z, θ ∗ )|X].
The efficiency bound in (3.1), given as a functional of the realized distribution
P , can be used to determine whether an estimator of θ ∗ is semiparametrically effi-
cient by comparing its asymptotic variance with l.b.(θ ∗ ). However, as the moment
condition model (2.1) is specified in terms of the target distribution P ∗ , in order to
answer questions such as how the efficiency bound for θ ∗ changes if stratification is
purely endogenous (or purely exogenous) or if the error term in a regression model
is conditionally homoskedastic in the target population, it is helpful to rewrite (3.1)
in terms of P ∗ . To do so, observe that, by (2.5), we have
1
D(x) = ∂θ EPY∗∗ |X∗ [g(Z ∗ , θ ∗ )|X ∗ = x], x ∈ supp (X∗ ),
γ ∗ (x)
(3.2)
1 g(Z ∗ , θ ∗ )g (Z ∗ , θ ∗ ) ∗
V1 (x) = ∗ EPY∗∗ |X∗ |X = x .
γ (x) b(Y ∗ , x)
Selection Due to Stratification 149
Hence, by (2.4) and (3.2), the efficiency bound in (3.1) can be written as
(E(∂θ E[g(Z ∗ , θ ∗ )|X ∗ ]) (E[g(Z ∗ , θ ∗ )g (Z ∗ , θ ∗ )|X ∗ ])−1 (∂θ E[g(Z ∗ , θ ∗ )|X ∗ ]))−1 ,
−1 −1
∗ (3.3) ∗ X∗ X∗ (2.4),(2.5) X̃ X̃
l.b.(θ in Example 2.1) = b EPX∗ ∗ ∗ ∗ = EPX ∗ 2 .
Vb (X ) γ (X)V1 (X)
(3.4)
If stratification is endogenous, then
−1
∗ (3.4) X̃ X̃
l.b.(θ in Example 2.1)|endog. strat. = EPX ∗ 2 ,
γ endog (X)V1,endog (X)
⎛ ⎞−1 ⎛ ⎞
n
X̃j X̃j n
X̃j Yj ⎠
θ̂GMM,exog := ⎝ ⎠ ⎝ ,
j =1
bexog (Xj ) j =1
bexog (Xj )
Since the aggregate shares add up to one, it suffices to determine the efficiency
bound for estimating Q∗−L := (Q∗1 , . . . , Q∗L−1 )(L−1)×1 ∈ (0, 1)L−1 . The aggregate
shares are identified in the realized population by the moment condition
s(Z) − Q∗−L
EP = 0, (3.6)
b(Z)
where s(Z) := (1C1 (Z), . . . , 1CL−1 (Z))(L−1)×1 . The moment conditions in (3.6)
modify accordingly if stratification is endogenous or exogenous, namely,
s ∗
endog (Y )−Q−L
EPY =0
endog. strat. =⇒ bendog (Y )
sendog (Y ) := (1A1 (Y ), . . . , 1AJ −1 (Y ))(J −1)×1
s (X)−Q∗ (3.7)
EPX exogbexog (X) −L = 0
exog. strat. =⇒
sexog (X) := (1B1 (X), . . . , 1BM−1 (X))(M−1)×1 .
152 ANTONIO COSMA ET AL.
Let ρ2 (Z, Q∗−L ) := (s(Z) − Q∗−L )/b(Z), and 12 (X) := EPY |X [ρ1 (Z, θ ∗ )ρ2 (Z,
∗
Q−L )|X] be the conditional (on X) covariance between ρ1 (Z, θ ∗ ) and ρ2 (Z,
Q∗−L ). Then, under (2.1), the efficiency bound for estimating Q∗−L is given by
−1
l.b.(Q∗−L ) := b∗ 2 [ var P (ρ2 (Z, Q∗−L )) − (EPX
12 (X)V1 (X) 12 (X))
−1 ∗ −1
+ (EPX 12 (X)V1 (X)D(X))(l.b.(θ ))(EPX D (X)V1 (X) 12 (X))], (3.8)
statistics for testing H0 and parametric restrictions on θ ∗ that do not require pre-
liminary estimation of any variance terms. Moreover, the resulting estimation and
testing procedures are invariant to normalizations of H0 . Simulation results pre-
sented in the aforementioned papers suggest that the SEL-based approach can
work very well in finite samples.
The advantages of the SEL approach described above extend to the case when
the observations are collected by VP sampling. Furthermore, it leads to a unified
approach of estimating and testing models using stratified samples, which should
appeal to applied economists and practitioners in the field. Therefore, we now
demonstrate how to use the SEL approach to construct asymptotically efficient
estimators, i.e., estimators with asymptotic variance equal to the efficiency bounds
in Section 3.2.
If the focus is on efficient estimation of θ ∗ alone, then the equivalence in
(2.6) reveals that replacing the moment function in Kitamura, Tripathi, and Ahn
(Eq. (2.1)) with ρ1 (Z, θ ∗ ) will deliver an asymptotically efficient estimator of θ ∗ .
(3.6)
But what about Q∗−L ? Although the aggregate shares Q∗−L = EP [s(Z)]/
EP [1/b(Z)] can be simply estimated by their sample analogs, this estimator will
not be efficient because it does not take (2.1) into account; cf. the discussion after
(3.8). To construct an estimator of Q∗−L that accounts for (2.1), we have to jointly
estimate θ ∗ and Q∗−L , which we do using the SEL approach.
For the remainder of the paper, assume that we have independent observations
Z1 , . . . , Zn collected by VP sampling. Hence, these are i.i.d. draws from the real-
ized density dP in (2.3). Our estimation approach relies on a smoothed version of
empirical likelihood. This smoothing, or localization, is carried out using positive
Kb (Xi − Xj )
kernel weights wij := n n , i, j = 1, . . . , n, where K is a second
k=1 Kbn (Xi − Xk )
order kernel, Kbn ( · ) := K( · /bn ), and bn the bandwidth.
For i, j = 1, . . . , n, let pij denote the probability mass placed at (Xi , Zj ) by
a discrete distribution with support (X1 , . . . , Xn ) × (Z1 , . . . , Zn ). The collection
of probabilities (pij )ni,j =1 can be thought of as a set of nuisance parameters that
include the empirical distribution of the data. Using the kernel (wij ) and the
weights
distribution (pij ), construct the smoothed loglikelihood ni=1 nj=1 wij log pij .
Then, given (θ , Q−L ), concentrate out (pij ) by solving the following optimization
problem:
n
n
max wij log pij
(pij )
i=1 j =1
(3.9)
n
n
s.t. pij ≥ 0 for i, j = 1, . . . , n, pij = 1,
i=1 j =1
154 ANTONIO COSMA ET AL.
n
n
n
n
ρ1 (Zj , θ)p1j = 0, . . . , ρ1 (Zj , θ )pnj = 0, ρ2 (Zj , Q−L )pij = 0.
j =1 j =1 i=1 j =1
If the convex hulls of {ρ1 (Z1 , θ ), . . . , ρ1 (Zn , θ )} and {ρ2 (Z1 , Q−L ), . . . ,
ρ2 (Zn , Q−L )} contain the origin, then (3.9) can be solved by using Lagrange
multipliers. In this case, it can be verified that the solution to (3.9) is given by
1 wij
p̂ij (θ , Q−L ) := , i, j = 1, . . . , n,
n 1 + λi ρ1 (Zj , θ ) + μ ρ2 (Zj , Q−L )
n
wij ρ1 (Zj , θ )
= 0, i = 1, . . . , n,
j =1
1 + λi ρ1 (Zj , θ ) + μ ρ2 (Zj , Q−L )
(3.10)
n
n
wij ρ2 (Zj , Q−L )
= 0.
i=1 j =1
1 + λi ρ1 (Zj , θ ) + μ ρ2 (Zj , Q−L )
n
n
SEL(θ , Q−L ) := wij log p̂ij (θ , Q−L )
i=1 j =1
n
n
wij /n
= wij log , (3.11)
i=1 j =1
1 + λi ρ1 (Zj , θ ) + μ ρ2 (Zj , Q−L )
n
n w
ij
SEL(θ , Q−L ) = wij log
i=1 j =1
n
n
n
− wij log (1 + λi ρ1 (Zj , θ ) + μ ρ2 (Zj , Q−L )).
i=1 j =1
Selection Due to Stratification 155
Furthermore,13
n
n
λ1 , . . . , λn , μ = argmax wij log (1 + λ̃i ρ1 (Zj , θ ) + μ̃ ρ2 (Zj , Q−L )).
λ̃1 ,...,λ̃n ,μ̃ i=1 j =1
(3.12)
Therefore, the estimators of θ ∗ and Q∗−L are defined to be
SELT (θ , Q−L )
n
n
:= − max Ti,n wij log (1 + λ̃i ρ1 (Zj , θ) + μ̃ ρ2 (Zj , Q−L ))
λ̃1 ,...,λ̃n ,μ̃ (3.14)
i=1 j =1
n
n
= − max Ti,n max wij log (1 + λ̃i ρ1 (Zj , θ ) + μ̃ ρ2 (Zj , Q−L )).
μ̃ λ̃i
i=1 j =1
The trimming indicator Ti,n := 1(ĥ(Xi ) ≥ bnτ ), where ĥ(Xi ) := nj=1 Kbn (Xi −
Xj )/(nbndim (X) ) and τ ∈ (0, 1) is a trimming parameter, is incorporated in (3.14) to
deal with the “denominator
problem,” namely, the instability of the local empiri-
cal loglikelihood nj=1 wij log (1 + λ̃i ρ1 (Zj , θ ) + μ̃ ρ2 (Zj , Q−L )) caused by the
p
density of the conditioning variables becoming too small in the tails. Since Ti,n −
→1
as n → ∞, this trimming scheme ensures that asymptotically no data are lost.
Following Kitamura, Tripathi, and Ahn, it can be shown that, under some
regularity conditions, θ̂ and Q̂−L are consistent, asymptotically normal and
asymptotically efficient, i.e., their asymptotic variances match the efficiency
bounds.
3.4. Testing
d
It can be shown that, under some regularity conditions, LR −−−→ χdim
2
(R) whenever
n→∞
H̃0 is true. This result can be used to obtain the critical values for LR. Although
a Wald statistic can also be constructed, it is less attractive than LR because the
latter is internally studentized. As in parametric situations, LR can be inverted to
obtain asymptotically valid confidence intervals. A nice property of confidence
intervals based on LR is that they are invariant to nonsingular transformations of
the moment conditions. They also automatically satisfy natural range restrictions.
Since inference based on θ̂ is sensible only if (2.1) is true, it is important to
devise a test for H0 against the alternative that it is false. As we are dealing with
conditional moment restrictions, any specification test which first converts (2.1)
into a finite set of unconditional moment restrictions will not be consistent for test-
ing H0 . However, using the equivalence in (2.6), a consistent test of H0 is easily
obtained by replacing moment function in Tripathi and Kitamura (2003, Equa-
tion 1.1) with ρ1 (Z, θ ∗ ). Note that since (3.6) just identifies the aggregate shares,
testing the specification of (2.1) and (3.6) jointly is equivalent to testing (2.1).
4. SIMULATION STUDY
We now examine the finite sample behavior of the LS, GMM and SEL estimators
to illustrate the effects of estimating a simple linear regression model specified for
the target population, when data are collected by VP sampling and stratification is
either endogenous or exogenous. Code for the simulations is written in R, and the
SEL estimator of the model parameters and aggregate shares defined in (3.13) is
implemented using the algorithm in Owen (2013); cf. Appendix B for details.
4.1. Design
We consider the design in Kitamura et al. (Section 5), which has been used earlier
by Cragg (1983) and Newey (1993). The model to be estimated is
d
where EPY∗∗ |X∗ [ε ∗ |X ∗ ] = 0 PX∗ ∗ -a.s., θ ∗ := (β0∗ , β1∗ ) = (1, 1), and (ε∗ , log X∗ ) =
NIID(0, 1). We consider two specifications for the skedastic function in the target
Selection Due to Stratification 157
4.2. Discussion
500.15 In contrast, in both designs, the LS and GMM estimators under exogenous
stratification are practically unbiased even when n = 50. Under exogenous strati-
fication, the LS estimator has smaller sampling variance than the GMM estimator
for each sample size. However, this finding can be mathematically justified only for
homoskedastic designs (recall from Example 3.2 that the LS estimator is asymptot-
ically efficient when stratification is exogenous and the error term in the regression
model is conditionally homoskedastic in the target population). Indeed, as shown
in Appendix A, cf. Example A.1, counterexamples can be constructed to show that
in heteroskedastic designs, the LS estimator can have higher sampling variance
than the GMM estimator when stratification is exogenous.16 Under endogenous
stratification, the GMM estimator of the slope coefficient exhibits some bias (≈ 2–
4% in both designs) when n = 50, but the bias is very close to zero when n = 500.
Selection Due to Stratification 159
the two is most pronounced when n = 500; e.g., irrespective of the stratification
scheme, the RMSE of the GMM estimator is at least 65% larger than the RMSE
of the SEL estimator.
In the homoskedastic design, even though it exhibits some bias under endoge-
nous and exogenous stratification when n = 50, the bias of the SEL estimator is
close to zero for n = 500. However, its RMSE is larger than that of the GMM esti-
mator even when n = 500. This finding, which corroborates the simulation results
in Kitamura et al. (p. 1682), is likely due to the fact that the SEL estimator inter-
nally estimates the skedastic function nonparametrically to achieve semiparametric
efficiency and is thus unable to take advantage of conditional homoskedasticity in
small samples.
Tables 4 and 5 reveal that the GMM estimator of Q∗1 is consistent whether
stratification is endogenous or exogenous. It exhibits some upward bias (≈ 1–2%)
in both designs and for both types of stratification when n = 50, but the bias is very
close to zero when n = 500.17 In both designs, the RMSE of the SEL estimator of
Q∗1 is always slightly larger than the RMSE of the of the GMM estimator under
endogenous stratification, implying that in small samples there appears to be no
efficiency gain in estimating Q∗1 jointly with the model parameters. As can be seen
from Tables 4 and 5, the increase in the RMSE of Q̂1 is due to its bias, because
Selection Due to Stratification 161
RMSE ≈ SE whenever the bias is small. This becomes clear on comparing the bias
of Q̂1 under endogenous and exogenous stratification: the latter is always larger.
The higher bias of Q̂1 under exogenous stratification is likely a design effect.
5. CONCLUSION
ACKNOWLEDGMENTS
We thank two anonymous referees and seminar participants at the 2017 “Econo-
metrics of Complex Survey Data: Theory and Applications” workshop organized
by the Bank of Canada, Ottawa, Canada, for helpful comments. The simulation
experiments reported in this paper were carried out using the HPC facilities of the
University of Luxembourg (Varrette et al., 2014, http://hpc.uni.lu).
NOTES
1. If X∗ is constant PX∗∗ -a.s., then there is no conditioning and (2.1) reduces to a system
of unconditional moment equalities. These models are studied in Tripathi (2011a,b).
2. Similar notation, but without the “∗” superscript, applies to the random variables and
probability measures in the realized population.
3. Cf. Severini and Tripathi (2013, Appendix H) for a short proof of (2.2).
4. A sufficient condition for this is that PY∗∗ |X∗ ((Y ∗ , x) ∈ Cl |X∗ = x) > 0 for each l and
x ∈ supp (X∗ ).
5. The same condition leads to global identification of θ ∗ whenever g(Z ∗ , θ ∗ ) is linear
in θ ∗ .
6. In the econometrics literature, stratification based on a finite set of response variables
is often referred to as choice-based sampling.
7. Unless mentioned otherwise, it is assumed throughout the paper that both Y ∗ and X∗
are stratified.
8. Tripathi (2011b) shows that in unconditional moment restriction models even
exogenous stratification cannot be ignored.
9. Hence, for the LS estimator in Example 2.1, one can say that it is inconsistent because
of selection bias due to endogenous stratification, whereas exogenous stratification does not
lead to any selection bias.
10. The abbreviation “l.b.” stands for “lower bound,” because the efficiency bound is the
greatest lower bound for the asymptotic variance of any n1/2 -consistent regular estimator.
11. Namely, M1 ≤L M2 for symmetric matrices M1 , M2 means that M1 − M2 is negative
semidefinite.
12. The estimator θ̂GMM,endog is an example of an inverse probability weighted (IPW)
estimator, which uses the weights 1/bendog (Y1 ), . . . , 1/bendog (Yn ) to correct the selection
bias due to stratification by downward weighting the strata that are oversampled and upward
weighting the strata that are undersampled.
13. To see this, compare the first order conditions for (3.12) with (3.10).
14. The SEL estimator is implemented with Ti,n := 1. To the best of our knowledge, how
to choose an optimal data driven bandwidth for the SEL estimator remains an open problem.
Consequently, we naively chose the bandwidth by repeating the simulation experiment on
a grid of bandwidths and picking the one that minimized the average (across the simulation
replications) RMSE of the SEL estimator of β1∗ . The naively chosen bandwidth, labeled cn ,
is reported in Tables 2–5. For the sake of comparison, we also report the SEL estimator when
Selection Due to Stratification 163
the bandwidth is chosen using Silverman’s rule of thumb, namely, bn = 1.06 sd(X) n−1/5 .
Since sd(X) depends on the data, the bn reported in the tables is the value averaged across
the simulations.
15. This is even more so for the LS estimator of the intercept because, under endogenous
stratification, the bias of the LS estimator of β0∗ is ≈ 18% (resp. ≈ 41%) in magnitude for
the heteroskedastic (resp. homoskedastic) design even when n = 500. For the remainder
of this section, however, we only discuss the simulation results for the slope coefficient
because it can be interpreted as an average partial effect. Results for the intercept, which is
a pure level effect, are qualitatively very similar.
16. It is shown in Appendix A, cf. (A.1), that asvar (n1/2 (θ̂GMM,exog − θ ∗ )) −
asvar (n1/2 (θ̂LS − θ ∗ )) = A + B holds under exogenous stratification, where the matrix A is
positive semidefinite and the matrix B is negative semidefinite. Therefore, in general, it is
not clear which estimator has smaller asymptotic variance. However, since B = 0 under con-
ditional homoskedasticity, cf. (A.4), asvar (n1/2 (θ̂LS − θ ∗ )) ≤L asvar (n1/2 (θ̂GMM,exog − θ ∗ ))
holds under exogenous stratification and conditional homoskedasticity. Alternatively, under
conditional homoskedasticity, the Gauss–Markov theorem implies the same result because
θ̂GMM,exog and θ̂LS are both linear and unbiased when stratification is exogenous.
17. In Tables 4 and 5, the results under exogenous stratification are almost identical
for the heteroskedastic and homoskedastic designs because P ∗ (X∗ ∈ B1 ) is not affected by
conditional heteroskedasticity in Y ∗ (cf. Table 1).
18. In our setup, all the components of Z are continuous random variables, so that ties
in the data occur with probability (P ) zero.
19. This is a simplified but working version of the code we actually used. The complete
code is available from GitHub at https://github.com/Fifis/SELshares.
REFERENCES
Bhattacharya, D. (2005). Asymptotic inference from multi-stage samples. Journal of Econometrics,
126, 145–171.
Bhattacharya, D. (2007). Inference on inequality from household survey data. Journal of Econometrics,
137, 674–707.
Bickel, P. J., Klassen, C. A. J., Ritov, Y., & Wellner, J. A. (1993). Efficient and adaptive estimation for
semiparametric models. Baltimore, MD: Johns Hopkins University Press.
Bickel, P. J., & Ritov, Y. (1991). Large sample theory of estimation in biased sampling regression
models. Annals of Statistics, 19, 797–816.
Butler, J. S. (2000). Efficiency results of MLE and GMM estimation with sampling weights. Journal
of Econometrics, 96, 25–37.
Chamberlain, G. (1987). Asymptotic efficiency in estimation with conditional moment restrictions.
Journal of Econometrics, 34, 305–334.
Cosma, A., Kostyrka, A. V., & Tripathi, G. (2018). Smoothed empirical likelihood based inference
with missing endogenous variables. In progress.
Cosslett, S. R. (1981a). Efficient estimation of discrete choice models. In C. F. Manski &
D. McFadden (Eds.), Structural analysis of discrete data with econometric applications (pp.
51–111). Cambridge, MA: MIT Press.
164 ANTONIO COSMA ET AL.
Cosslett, S. R. (1981b). Maximum likelihood estimation for choice-based samples. Econometrica, 49,
1289–1316.
Cosslett, S. R. (1991). Efficient estimation from endogenously stratified samples with prior
information on marginal probabilities. Manuscript. Retrieved from economics.sbs.ohio-
state.edu/scosslett/papers/cbsample1.pdf.
Cosslett, S. R. (1993). Estimation from endogenously stratified samples. In G. Maddala, C. Rao, &
H. Vinod (Eds.), Handbook of statistics (Vol. 11, pp. 1–43). Amsterdam: Elsevier.
Cragg, J. G. (1983). More efficient estimation in the presence of heteroscedasticity of unknown form.
Econometrica, 49, 751–764.
Deaton, A. (1997). The analysis of household surveys. Baltimore, MD: Johns Hopkins University Press.
DeMets, D., & Halperin, M. (1977). Estimation of a simple regression coefficient in samples arising
from a subsampling procedure. Biometrics, 33, 47–56.
El-Barmi, H., & Rothmann, M. (1998). Nonparametric estimation in selection biased models in the
presence of estimating equations. Nonparametric Statistics, 9, 381–399.
Gill, R. D., Vardi, Y., & Wellner, J. A. (1988). Large sample theory of empirical distributions in biased
sampling models. Annals of Statistics, 16, 1069–1112.
Hausman, J. A., & Wise, D. A. (1981). Stratification on endogenous variables and estimation: The Gary
income maintenance experiment. In C. F. Manski & D. McFadden (Eds.), Structural analysis
of discrete data with econometric applications (pp. 365–391). Cambridge, MA: MIT Press.
Heckman, J. J. (1976). The common structure of statistical models of truncation, sample selection and
limited dependent variables and a simple estimator for such models. Annals of Economic and
Social Measurement, 5, 475–492.
Heckman, J. J. (1979). Sample selection bias as a specification error. Econometrica, 47, 153–161.
Hirose, Y. (2007). M-Estimators in semi-parametric multi-sample models. Manuscript. Retrieved from
sms.victoria.ac.nz/foswiki/pub/Main/ResearchReportSeries/mscs08-05.pdf.
Hirose, Y., & Lee, A. J. (2008). Semi-parametric efficiency bounds for regression models under gen-
eralised case-control sampling: The profile likelihood approach. Annals of the Institute of
Statistical Mathematics, 62, 1023–1052.
Holt, D., Smith, T. M. F., & Winter, P. D. (1980). Regression analysis of data from complex surveys.
Journal of the Royal Statistical Society, Series A, 143, 474–487.
Imbens, G. W. (1992). An efficient method of moments estimator for discrete choice models with
choice-based sampling. Econometrica, 60, 1187–1214.
Imbens, G. W., & Lancaster, T. (1996). Efficient estimation and stratified sampling. Journal of
Econometrics, 74, 289–318.
Jewell, N. P. (1985). Least squares regression with data arising from stratified samples of the dependent
variable. Biometrika, 72, 11–21.
Kalbfleisch, J. D., & Lawless, J. F. (1988). Estimation of reliability in field-performance studies (with
discussion). Technometrics, 30, 365–388.
Kitamura, Y., Tripathi, G., & Ahn, H. (2004). Empirical likelihood based inference in conditional
moment restriction models. Econometrica, 72, 1667–1714.
Manski, C. F. (1989). Anatomy of the selection problem. Journal of Human Resources, 24, 343–360.
Manski, C. F. (1995). Identification problems in the social sciences. Cambridge, MA, USA: Harvard
University Press.
Manski, C. F., & Lerman, S. R. (1977). The estimation of choice probabilities from choice based
samples. Econometrica, 45, 1977–1988.
Manski, C. F., & McFadden, D. (1981). Alternative estimators and sample design for discrete choice
analysis. In C. F. Manski & D. McFadden (Eds.), Structural analysis of discrete data with
econometric applications (pp. 2–50). Cambridge, MA: MIT Press.
Selection Due to Stratification 165
where
V1,exog (X)
:= EPX X̃X̃ 2
bexog (X)
X̃ X̃
− EPX (EPX X̃ X̃ )−1 (EPX X̃ X̃ V1,exog (X))(EPX X̃ X̃ )−1
bexog (X)
X̃ X̃
× EPX .
bexog (X)
Next, letting a1 := X̃ V1,exog (X)/bexog (X) and a2 := (EPX X̃ X̃ )−1 X̃/ V1,exog (X),
we have
where
−1
X̃ X̃
A := EPX [EPX a1 a1 − (EPX a1 a2 )(EPX a2 a2 )−1 (EPX a2 a1 )]
bexog (X)
−1
X̃ X̃
× EPX
bexog (X)
and
−1
X̃X̃
B := EPX (EPX a1 a2 )
bexog (X)
−1
X̃ X̃
× [(EPX a2 a2 )−1
− (EPX X̃ X̃ V1,exog (X))] EPX (EPX a2 a1 ).
bexog (X)
(2.8)
var P ∗ (Y ∗ |X ∗ = x) = var P (Y |X = x), x ∈ supp (X∗ ).
168 ANTONIO COSMA ET AL.
Hence, asvar (n1/2 (θ̂LS − θ ∗ )) ≤L asvar (n1/2 (θ̂GMM,exog − θ ∗ )) holds under exoge-
nous stratification and conditional homoskedasticity.
However, as demonstrated in the next example, this result may not hold under
conditional heteroskedasticity.
p1 1B1 (d) + p2 1B2 (d) = p2 because d > 0. Then, it can be verified that
Consequently,
asvar (n1/2 (β̂1,LS − β1∗ )) = 1.5 > asvar (n1/2 (β̂1,GMM − β1∗ )) = 1.
This shows that the LS estimator may be asymptotically inefficient compared to the
GMM estimator under conditional heteroskedasticity and exogenous stratification.
APPENDIX B: COMPUTATION
In this appendix, we describe how the SEL estimator was implemented by adapt-
ing the code of Owen (2017). The R function cemplik in Owen (2017) was
originally written for count random variables and allows for ties in the data. Let
Zj := (Yj , Xj ) be i.i.d. draws from the realized density dP and assume that each
of the n distinct values
of Zj can be taken by cj distinct draws, so that the total
sample size is N := nj=1 cj . If we impose on the data the vector of unconditional
moment equalities EP m(Z, θ ) = 0, then Owen (2017, p. 2) shows that the empiri-
cal loglikelihood (EL), as a function of θ , and modulo constants not depending on
θ, is obtained by solving (in our notation)
n
− max cj log (1 + λ̃ m(Zj , θ )). (B.1)
λ̃
j =1
Note how in (B.1) the original sample size N has disappeared, and only the
number n of distinct values of Zj remains. The function cemplik asks for
m := (m(Z1 , θ), . . . , m(Zn , θ )) and a vector c := (c1 , . . . , cn ) as inputs, and delivers
three outputs:
(1) The EL for a given value of θ , computed at the vector λdim (m)×1 of Lagrange
multipliers that maximize (B.1), i.e.,
n
ELm (θ ; c, λ) := − cj log (1 + λ m(Zj , θ )).
j =1
cj 1
pj :=
, j = 1, . . . , n.
n 1 + λ m(Zj , θ )
We now describe how to compute SELT (θ ) when only the conditional moment
restriction EPY |X [ρ1 (Z, θ )|X] = 0 is imposed on the data. In the following, we do
not deal with ties in the data.18 Instead, we take advantage of the formal resem-
blance of the optimization problem in (B.1) with the one that leads to the smoothed
EL. Indeed, obtaining SELT (θ ) only under EPY |X [ρ1 (Z, θ )|X] = 0 is equivalent to
solving (3.14) with ρ2 := 0, i.e.,
n
n
SELT (θ )|ρ2 =0 := − max Ti,n wij log (1 + λ̃i ρ1 (Zj , θ )). (B.2)
λ̃1 ,...,λ̃n
i=1 j =1
From the first order conditions, it is clear that the maximizers in (B.2) can be
recovered as solutions to n independent maximization problems, namely,
n
λi := argmax wij log (1 + λ̃i ρ1 (Zj , θ )), i = 1, . . . , n. (B.3)
λ̃i j =1
The elements of c in (B.1) are not constrained to be integers but are only supposed
to be positive. Hence, comparing (B.1) with (B.3), each λi can be obtained by
invoking cemplik n times with ci := (wi1 , . . . , win ) as the weights and m replaced
with ρ1 := (ρ1 (Z1 , θ ), . . . , ρ1 (Zn , θ )). Consequently,
n
SELT (θ )|ρ2 =0 = Ti,n ELρ1 (θ ; ci , λi ) (B.4)
i=1
with ELρ1 (θ ; ci , λ) = nj=1 wij log (1 + λ ρ1 (Zj , θ )). The R commands used to
implement (B.4) are as follows. Let rho1 denote (ρ1 (Z1 , θ ), . . . , ρ1 (Zn , θ )),
sel.weights be the n × n matrix whose elements are the kernel weights wij ,
and trim the trimming vector Ti,n . Then, SELT (θ)|ρ2 =0 is obtained with the
following code:
emplik.list = apply(sel.weights, MARGIN=1, function(w)
cemplik(rho1, w))
SEL = trim %*% unlist(lapply(emplik.list, ’[[’, 1))
Finally, we show how to impose a conditional and an unconditional moment
restriction on the data, i.e., compute the objective function SELTi,n (θ, Q−L ) defined
Selection Due to Stratification 171
50 0.036 16.17
150 0.129 45.70
500 0.523 149.4
n
max wij log (1 + λ̃i ρ1 (Zj , θ ) + μ̄ ρ2 (Zj , Q−L )), i = 1, . . . , n. (B.5)
λ̃i
j =1
Luc Clair
Department of Economics, University of Winnipeg, Canada
ABSTRACT
Applied econometric analysis is often performed using data collected from
large-scale surveys. These surveys use complex sampling plans in order
to reduce costs and increase the estimation efficiency for subgroups of the
population. These sampling plans result in unequal inclusion probabilities
across units in the population. The purpose of this paper is to derive the
asymptotic properties of a design-based nonparametric regression estima-
tor under a combined inference framework. The nonparametric regression
estimator considered is the local constant estimator. This work contributes
to the literature in two ways. First, it derives the asymptotic properties
for the multivariate mixed-data case, including the asymptotic normality of
the estimator. Second, I use least squares cross-validation for selecting the
bandwidths for both continuous and discrete variables. I run Monte Carlo
simulations designed to assess the finite-sample performance of the design-
based local constant estimator versus the traditional local constant estimator
for three sampling methods, namely, simple random sampling, exogenous
stratification and endogenous stratification. Simulation results show that the
173
174 LUC CLAIR
1. INTRODUCTION
Nonparametric methods for estimating conditional mean functions have emerged
as viable alternatives to standard parametric methods. However, the discussion of
these methods in a complex survey setting has been kept to the survey statistics
literature despite their applications in economics. There are two reasons why these
methods should appeal to economists. Firstly, in applied economic analysis, the
regression functional form is rarely known to a parametric specification. Economic
theory provides arguments for the inclusion or exclusion of variables in a model
but generally does not specify the functional form of the conditional mean function
(Yatchew, 1998). Consistent estimation using parametric methods requires that the
researcher perfectly specifies the functional form of the conditional mean function
prior to estimation (Li & Racine, 2007). In practice, though, one cannot be certain
that the correct parametric model has been chosen. Alternatively, nonparametric
estimators do not rely on functional form assumptions and are therefore free of
misspecification. They simply assume that the conditional mean function exists
and that it follows certain regularity conditions, such as smoothness.
Secondly, many large-scale surveys use complex sampling plans in order to
reduce costs and increase the estimation efficiency for subgroups of the popula-
tion (Lohr, 2010). As in Binder and Roberts (2009), the term complex sampling
plan refers to any sampling method other than simple random sampling (SRS) that
results in the population units having nonuniform probabilities of being selected
into the sample. The data sets derived from these surveys typically offer a broader
range of variables (e.g., income, health, education, demographic variables, etc.)
and are easier to access than administrative data sets, making them popular among
economic researchers. The complex sampling methods used in these surveys dis-
proportionately sample subgroups of the population, leading to a sample that is
systematically unrepresentative of the finite population from which it is drawn. In
this case, finite population descriptive statistics cannot be consistently estimated by
their respective sample analogs. In order to consistently estimate the finite popula-
tion statistics, design-based estimators that weight each observation by the inverse
of the unit’s probability of being selected in the sample must be used. Solon, Haider,
and Wooldridge (2013) described a sample of units with unequal probabilities of
Nonparametric Kernel Regression Using Complex Survey Data 175
y = g(x) + u, (1)
where g(x) = E(y|x) is the unknown quantity of interest and u is the error term.
In a survey statistics framework, it is presumed that a finite population U of size
N is generated based on realizations of the random variables (y, x) with a joint
probability distribution f (y, x) (Buskirk & Lohr, 2005; Harms & Duchesne, 2010).
From the finite population, a sample S of size n is selected based on a complex
sampling plan. Each individual j in the finite population U has a probability πj of
being included in S and the probability of being selected depends on the sampling
methods that are implemented. The number of population units a given sample
unit j represents is then given by the weight variable wj = πj−1 .
If data for all N observations in the finite population were available, g(x) could
be estimated using the local constant estimator:
N
j =1 yj Kγ ,j x
ĝU (x) = N , (2)
j =1 Kγ ,j x
n
πi−1 yi Kγ ,ix
ĝ(x) = i=1 (4)
n
i=1 πi−1 Kγ ,ix
2. MODEL-ASSISTED NONPARAMETRIC
REGRESSION ESTIMATOR
Let U = {1, . . . , N } denote a finite population of N units. For each
j ∈ U , the outcome variable yj and auxiliary variables xj = (xjd , xjc ) =
(xjd1 , . . . , xjdr , xjc1 , . . . , xjcq ) are realizations of the random variables (y, x), where
(y, x) follow a joint distribution f (y, x). xj is a (q + r) × 1 vector where the super-
scripts d and c denote that the variable is discrete or continuous, respectively. I
use xjct to denote the tth component of xjc and xjdt for the tth component of xjd and
assume that xjd takes ct ≥ 2 different values in Dt = {0, 1, . . . , ct − 1}, t = 1, . . . , r.
Next, a sample S of size ns is drawn based on a complex sampling plan pN (·), where
pN (S) is the probability of drawing the sample S. The sampling rate is Q = ns /N ,
with first order inclusion probabilities πj = P r(j ∈ S) = j ∈S pN (S) and second
order inclusion probabilities πj i = P r(j , i ∈ S) = j ,i∈s pN (S). The variable ns
may be fixed (as in SRS) or random; however, I do not specify a sampling plan.
The first and second order probabilities are the probabilities of obtaining the unit
j and units j and i, respectively, while sampling from the population according
to the complex sampling design. The goal is to estimate the conditional mean
function g(x) from Eq. (1).
If data were available for every i ∈ U then g(x) in (1) could be estimated using the
local-constant estimator. The local constant estimator was proposed by Nadaraya
(1964) and Waston (1964) who wanted to estimate conditional mean functions as a
locally weighted average, using a kernel as a weighting function. The mathematical
definition of E(y|x) is
f (y, x)
E(y|x) = yf (y|x)dy = y dy (5)
f (x)
where f (y|x) is the conditional density of y given X, f (y, x) is the joint density of
y and x and f (x) = f (x c , x d ) is the joint probability density function of (x c , x d ).
Nonparametric Kernel Regression Using Complex Survey Data 179
Nadaraya (1964) and Waston (1964) proposed substituting f (y, x) and f (x) by
their kernel density estimates. For the discrete regressors xtd , t = 1, . . . , r, a vari-
ation on Aitchison and Aitken’s (1976) kernel function can be used (scalar x) or
embedded product kernel (multivariate x). This function is defined by
1 if xitd = xjdt
l(xitd , xjdt , λ) = (6)
λt otherwise
where λt ∈ [0, 1] is the smoothing parameter for xtd . When λt = 0, the above kernel
function becomes an indicator function, and when λt = 1, it is a constant function
and the (irrelevant) variable gets smoothed out. Here, a match between xitd and xtd
determines the value of the discrete kernel function (Sánchez-Borrego et al., 2014;
Opsomer and Miller, 2005). The product kernel function for a vector of discrete
variables is defined as
r
1−1(xitd =xtd )
L(xid , x d , λ) = λt ,
t=1
where 1( · ) is the indicator function, which takes a value of 1 if the logical argument
in the brackets is true and 0 otherwise.
Using k to denote a symmetric, univariate density function the product kernel
for continuous variables is defined by
q c
1 xit − xt
Wh (x c , xic ) = k ,
h
t=1 t
ht
where 0 < ht < ∞ is the smoothing parameter for continuous variable xtc . The
shape of W depends on the choice of kernel function and the bandwidth. The
distance between xitc and xt is the traditional Euclidean distance (Sánchez-Borrego
et al., 2014). Defining γ = (h, λ), the multivariate mixed-data product kernel is
given by Kγ ,ix = Wh (x c , xic )L(xid , x d , λ). The local constant estimator is derived
by substituting the kernel density estimators f˜(x, y) and f˜(x) for f (x, y) and
f (x) = f (x d , x c ), respectively, in Eq. (5). After analytic integration, with a bit of
algebra, the local constant estimator can then be written as
yi Kγ ,ix
ĝU (x) = i∈U . (7)
i∈U Kγ ,ix
The benefit of using this method over parametric regression techniques is that
it does not require the practitioner to specify the exact functional form of
180 LUC CLAIR
Under SRS, the proposed estimator by Sánchez-Borrego et al. (2014) becomes the
traditional nonparametric estimator from Eq. (3).
There are three methods of inference when deriving the asymptotic properties
of estimators in survey sampling, namely, model-based inference, pure design-
based inference and combined inference. Model-based inference assumes the data
(y, x) are generated based on a given model and that the inclusion probabilities are
uninformative. Using this mode of inference relies heavily on model specification
as it is assumed that the model represents all units of the population (Lohr, 2010).
Naturally, this increases the appeal of nonparametric methods, such as the local
constant described above, which makes no assumptions about the functional form
of the model (other than smoothness and existence). However, if the sampling is
endogenous, model-based estimators will not take this into account and results
will be inconsistent. If one takes a model-based approach, then the estimator in
Eq. (3) is appropriate.
In pure design-based settings, inference depends on the probability distribu-
tion induced by the sampling design and not the probability distribution from an
underlying model. Inferences drawn using a design-based approach typically refer
to a particular finite population of interest and usually ignore any model structure
in the corresponding superpopulation. Expectations are taken with respect to the
sampling scheme. Therefore, asymptotic results depend on the sample size, the
sampling design, and the bandwidths h and λ (Buskirk and Lohr, 2005). Sánchez-
Borrego et al. (2014) adopted a design-based setting for deriving the asymptotic
properties for their model-assisted local constant estimator in Eq. (8). Under the
assumption of i.i.d. data, the authors showed that the estimator is asymptotically
design-unbiased and design-consistent with probability one.
Nonparametric Kernel Regression Using Complex Survey Data 181
Assumption 2.1.
Denote X as the compact support of x. Then, g(x), f (x) and σ 2 (x) = E(u2i |xi )
are second order differentiable in X . Letting As (x) and Ass (x)
denote the first
and second order derivatives of any function A w.r.t. xs , then gss (x)2 f (x) > 0
for all s = 1, ..., q (Li and Racine, 2007).
Assumption 2.2.
The kernel function k(·):R → R is symmetric with k(v) ≥ 0 with v ∈ R, and
bounded4 by finite constant z so that k(v) ≤ z. k(·) is m times differentiable
2 with
k(v)v
dv < ∞. k(·) is a second order kernel and define κ 2 = v k(v)dv and
κ = k 2 (v)dv (Li and Racine, 2007).
Assumption 2.3.
(hk,1 , . . . , hk,q , λk,1 , . . . , λk,r ) ∈ [0, ηk ]q+r lies in a shrinking set and ηk = ηNk is
a positive sequence that converges to zero at a rate slower than the inverse of
any polynomial in Nk . Nk hk,1 . . . hk,q ≥ tNk with tNk → ∞ as k → ∞ (Li and
Racine, 2007).
182 LUC CLAIR
Assumption 2.4.
The sampling plan is designed such that
1. The sampling rate is such that limk ns,k /Nk = Q ∈ [0, 1).
∗
2. For all N, minj ∈U πj ≥ > 0, with probability one, mini,j ∈U πij ≥ > 0,
and
lim sup ns,k max |πij − πi πj | ≤ ∞
k→∞ j ,i∈U :j =i
−2
k (nk , Nk , Pk ) = nk Nk (πj−1 − 1) = O(1).
j ∈Uk
N
fˆ(x) = πi−1 Kγ ,ix = N −1 πi−1 1(i ∈ S)Kγ ,ix (9)
i∈S i=1
Next, take the expectation of (9) using the combined inference method.
EC (fˆ(x)|x) = Eξ {EP (fˆ(x))|π)|x} = Eξ EP πi−1 Kγ ,ix π x
i∈S
N
= Eξ EP N −1
πi−1 1(i ∈ S)Kγ ,ix π x
i=1
N
= Eξ N −1 πi−1 EP (1(i ∈ S)|π )Kγ ,ix x
i=1
N
= Eξ N −1 πi−1 πi Kγ ,ix
i=1
N
= Eξ N −1 Kγ ,ix
i=1
q
r
ts − xs
= h−1
s w λ1(t
s
s =xs )
f (t c , t d )dt c
hs
t d ∈DRq s=1 s=1
q
= w(vs )f (x c + hvs , x d )dvs
Rq s=1
q
+ 1(t d , x d )λs w(vs )f (x c + hvs , t d )dvs
t d ∈D Rq s=1
q
κ2
= f (x) + h2s fss (x)
2 s=1
r q r
+ 1(t d , x d )f (x c , t d )λs + o h2s + λs
s=1 t d ∈D s=1 s=1
= Eξ (f˜(x)|x) (10)
184 LUC CLAIR
m̂(x)
ĝ(x) − g(x) = , (11)
fˆ(x)
where m̂(x) = (ĝ(x) − g(x))fˆ(x). Using the equation for the regression model
with additive errors (1), m̂(x) can be written as
N
m̂(x) = N −1 π −1 1(i ∈ S)[g(xi ) − g(x)]Kγ ,ix
i=1
N
−1
+N π −1 1(i ∈ S)ui Kγ ,ix
i=1
where the definition of m̂1 (x) and m̂2 (x) should be evident. In Section A.1 of
the appendix, I show that the leading term of the expectation of m̂1 (x) under the
combined framework is
q
κ2
EC [m̂(x)|x] = h2s [gss (x)f (x) + 2gs (x)fs (x)]
2 s=1
r
+ 1(t d , x d )[g(x c , t d ) − g(x c , gx d )]f (x c , t d )λs
s=1 t d ∈D
q r
+o hs + λs (12)
s=1 s=1
that the variance of the model-assisted estimator under the combined framework is
equal to the variance under the model framework multiplied by a correction factor:
N
1 −2 n κ q σ (x)
var C {[(ĝ(x) − g(x))|π ]|x} = N n (wi − 1) +
nh1 . . . hq i=1
N f (x)
q r
−1
+ O (N h1 . . . hq ) h2s + λs , (13)
s=1 s=1
where 1/nh1 . . . hq (κ q σ (x)/f (x)) is the leading term of var ξ (g̃(x)). Note that
under SRS, the correction factor equals one and Eq. (13) reduces to the variance of
g̃(x) evaluated over the sample data. Combining these two results proves Theorem
2.1, which is an extension of Theorem 1 in Harms and Duchesne (2010):
Theorem 2.1. If Assumptions 2.1–2.4 are satisfied, then the conditional point-
wise MSE of the model-assisted local constant estimator ĝ(x) under the
combined inference mode is given by
MSE(ĝ(x))
q
−1 κ2
= f (x) h2s [gss (x) + 2gs (x)fs (x)]
2 s=1
⎞⎤2
r
+ 1(t d , x d )[g(x c , t d ) − g(x c , gx d )]f (x c , t d )λs ⎠⎦
s=1 t d ∈D
q r
κ q σ 2 (x)f −1 (x)
+( + Q) + op (N h1 . . . hq )−1 h2s + λs
N h1 . . . hq s=1 s=1
⎞
q r 2
+ h2s + λs ⎠,
s=1 s=1
N
where = N −2 n i=1 (wi − 1) and Q = n/N.
Theorem 2.2. Under the assumption that x is an interior point and Assumptions
2.1–2.4 are satisfied, the asymptotic normality of ĝ(x) is defined by
q r
N h1 . . . hq ĝ(x) − g(x) − B1s (x)h2s − B2s (x)λs
s=1 s=1
d
−
→ N (0, ( + Q)κ q σ 2 (x)/f (x)), (14)
where B1s (x) = κ22 [gss (x) + 2gs (x)fs (x)] and B2s (x) = t d ∈D 1(t d , x d )[g(x c ,
t d ) − g(x c , gx d )]f (x c , t d )
Theorem 2.2 can be used to compute confidence intervals for ĝ(x) under the
combined inference framework. Let N̂ = ni=1 πi−1 and
n
(yi − ĝ(xi ))2 πi−1 Kγ ,ix
σ̂ (x) = i=1
n −1
,
i=1 πi Kγ ,ix
3. BANDWIDTH SELECTION
A critical component to any nonparametric regression technique is the choice
of the smoothing parameters (h, λ). Selecting the smoothing parameters for the
q continuous variables creates a trade-off between the bias and variance of the
estimator. Large values of hs will oversmooth the underlying density and increase
the bias while reducing the variance. Conversely, small hs will undersmooth the
underlying density shrinking the bias but increasing the variability of the estimator.
For the univariate continuous variable case, Harms and Duchesne (2010) used
a modified plug-in method for selecting the bandwidth according to the MSE
criterion in the combined inference mode. The optimal bandwidth in that case was
equal to that of the i.i.d. case multiplied by a correction factor equal to ( + Q).
This method is not applicable to the multivariate case and, therefore, not applicable
for the present model. In their simulations, Sánchez-Borrego et al. (2014) used a
plug-in method for the bandwidth in which they selected three values for h and
Nonparametric Kernel Regression Using Complex Survey Data 187
five values for λ. In addition, they used survey cross-validation to choose among
the 15 possible combinations of the fixed values for h and λ.
It is widely accepted that data-driven methods for selecting the bandwidths
in a nonparametric kernel regression setting is required for proper inference
and analysis. I propose using LSCV, a data-driven method for selecting (h, λ)=
(h1 , . . . , hq , λ1 , . . . , λr ). LSCV chooses (h, λ) to minimize the following cross-
validation function:
where g−i (xi ) = j ∈S,j =i πj−1 yj Kγ , ij/( j ∈S,j =i πj−1 Kγ , ij ) is the leave-one-
q
out kernel estimator of g(xi ) and Kij = j =i k((xisc − xjcs )/ hs ) rs=1 λαs with α =
1(xis = xj s ) is equal to 1 if xis = xj s and zero otherwise (Li and Racine, 2007). 0 <
M(xi ) < 1 is a weight function which serves to avoid difficulties caused by dividing
by zero. Using the leave-one-out kernel estimator helps to avoid
a computational
issue encountered when optimizing according to i∈S (ûi )2 = i∈S (yi − ĝ(xi ))2 .
By letting hs → 0, ĝ(xi ) can be made close to yi for any i ∈ S. Hence, i∈S (ûi )2
can be made very small as h → 0. At the same time, MSEC remains greater than
zero for all values of h (Opsomer and Miller, 2005). By replacing ĝ(xi ) with the
delete-one estimator, the difference yi − ĝ−i (xi ) does not go to zero as h → 0. The
proceeding analysis requires the following assumption about the weight function
M(xi ):
Assumption 3.1.
M(·) is continuous, nonnegative and has a compact support S.
In Section A.3 of the appendix, I show that, if we ignore the terms unrelated to
(h,λ), the leading term of CV(h, λ) is given by E[CV0 (h, λ)]:
× M(x)dx. (16)
Therefore, the leading term for LSCV is the same for an i.i.d. sample except
for the correction term on the second term on the right-hand side of Eq. (16).
188 LUC CLAIR
I consider four DGPs for g(x), which are outlined in Table 1. The first DGP,
g1 (X), was considered in Sánchez-Borrego et al. (2014) and is simply linear
in the continuous variable, x1c . In DGP two, the relationship between y and x1c
is quadratic introducing a further degree of smoothness. Population three, also
known as bump, was considered by Harms and Duchesne (2010) and Sánchez-
Borrego et al. (2014). This function produces a noticeable bump at x1c = 0.5.
Population four is the most complex function considered. The Härdle function
is characterized by a peak at x1c = 0.63, a valley at x1c = 0.91, and a saddle point
at x1c = 0.79. The error term for each population is assumed to be normally
distributed with mean of zero and standard deviation σ .
(3) Next, I draw a sample of size n based on three different sampling methods:
SRS without replacement, stratification based on x1c and stratification based on
yi , i = 1, 2, 3, 4. For stratified samples, the population was divided into three
strata of varying size, with unequally sized samples being drawn from each
strata. Within-strata sampling was performed by SRS without replacement.
Table 2 displays the strata borders and the sample size drawn from each strata.
(4) Using the sample data, I estimate g(x) using nonparametric and paramet-
ric estimation methods. The nonparametric estimators I consider are the
design-based local constant estimator and the traditional local constant
estimator, denoted by WLC and LC, respectively:
n
πi−1 yi Kγ ,ix
WLC = ĝ(x) = i=1
n −1
i=1 πi Kγ ,ix
n
yi Kγ ,ix
LC = g̃(x) = i=1
n .
i=1 Kγ ,ix
Both WLC and LC are computed using the Gaussian kernel for the continuous
variable x1c and the variant of Aitchison and Aiken (1976) kernel in (6) for
discrete variables x1d and x2d . The bandwidths for WLC are computed using
the LSCV method discussed in Section 3. The bandwidths for LC are com-
puted using the LSCV method outlined in Li and Racine (2007) page 69.
The WLC and LSCV method were written using the np package available in
the R statistical software. Anyone wanting to use the estimator can contact
the author and he will provide the code upon request. Note, I am only con-
sidering the relevant data case: I expect the bandwidths λr < 1, r = 1, 2 for
x1d and x2d . I let h denote the bandwidth for x1c . For parametric estimation, I
assume g(x) = g(β, x) = β1 x1c + β2 x1d + β3 x2d . Therefore, OLS and WLS will
be perfectly specified in the estimation of g1 (x) but will be misspecified in the
estimation of g2 (x) to g4 (x).
(5) Finally, I compute the sample mean squared error:
n
1
(ĥ(xi ) − g(xi ))2
n i=1
Tables 3–5 report the median MSE values from the Monte Carlo simulations
for all DGPs. To save space, I only report the results for n = 250 and n =
1, 000. The values in brackets below the reported MSEs are the median abso-
lute deviations (MAD) of the MSEs; a robust measure of the variability of
the MSE (Andersen, 2008). The MAD is calculated by median{|MSEm [ĥ(x)] −
median{MSEm [ĥ(x)]}|} with m = 1, ..., 1, 000 and h(x) is one of WLC, LC, WLS
or OLS. Compared to the standard deviation, the MAD is more resilient to outliers.
Table 3 displays the results from SRS. Not surprisingly, the median MSEs are
equal for the WLC and LC estimators, as well as the MSEs for the WLS and OLS
estimators. Under SRS, the inclusion probabilities are equal for all individuals and
WLC reduces to LC. As Harms and Duchesne (2010) point out in their simulations,
the results from SRS act as a benchmark for other sampling plans. Keeping n
constant, as σ increases, so too does the median MSE of each estimator. In order
for WLC to be a consistent estimator, the MSE needs to decrease as the sample size
increases. Keeping σ constant and increasing n, the MSE decreases for all DGPs.
This provides evidence that the estimator is consistent. As the functions increase in
the degree of complexity, the MSE also increases. Looking at the MADs for both
the weighted and unweighted nonparametric estimator, they too are equal for all
combinations of n and σ . Note that when the DGP is linear, the perfectly specified
parametric models report lower median MSEs than the nonparametric estimators.
However, when the parametric model is misspecified, the nonparametric estimators
perform better based on the sample MSEs.
Nonparametric Kernel Regression Using Complex Survey Data 191
Table 3. Median MSE for WLC, LC, WLS and OLS Under Simple Random
Sampling.
DGP n σ WLC LC WLS OLS
The results from stratification on the outcome variable are presented in Table 4.
Here, weighting by inclusion probabilities clearly shows an improvement as the
median MSE is smaller for WLC than it is for LC for all combinations of DGP,
n, and σ . By not accounting for endogenous sampling and unequal inclusion
probabilities, the traditional local constant performs worse than the weighted local
constant. Again, increasing the level of noise in the model reduces the efficiency
of each estimator. WLC remains consistent as the median MSE decreases and the
192 LUC CLAIR
Table 4. Median MSE for WLC, LC, WLS and OLS Under Endogenous
Stratification.
DGP n σ WLC LC WLS OLS
MAD decreases as n increases. The MADs suggest that WLC is more variable than
LC for small sample sizes. When n = 200, MAD(WLC) is greater than MAD(LC);
this is true for all DGPs. As the sample size increases, however, MSE(WLC)
becomes less variable than MSE(LC). These results clearly show a stochastic
dominance over LC. WLS regression performs best under the linear DGP, how-
ever, as the DGP becomes more complex, WLC outperforms both (misspecified)
Nonparametric Kernel Regression Using Complex Survey Data 193
Table 5. Median MSE for WLC, LC, WLS and OLS Under Exogenous
Stratification.
DGP n σ WLC LC WLS OLS
parametric estimators. In fact, as the DGPs become more complex, the parametric
estimators become inconsistent.
Table 5 displays the results from stratification on the continuous predictor vari-
able x1c . Since x1c is included in the model, the sampling scheme is exogenous.
These results mimic those from SRS in Table 3. Both of the nonparametric esti-
mators report the same median MSEs, the small difference being explained by
194 LUC CLAIR
simulation bias. These results indicate that the sample design had no effect on the
estimation of g(x).
4.2. Bandwidths
Tables 6–8 report the median bandwidths for WLC and LC under SRS, exogenous
stratification, and endogenous stratification, respectively. The superscripts w and
u denote the bandwidths for WLC and LC, respectively. Under SRS, WLC and
LC select the same bandwidths (see Table 6). Similar results are found under
exogenous stratification in Table 7. Under endogenous stratification, WLC selects
smaller bandwidths than LC. This suggests that the design-based estimator chooses
a higher degree of smoothing. Under all three sampling plans, both WLC and LC
choose bandwidths for discrete variables x1d and x2d that are below 1. This is
encouraging as both variables are relevant in each of the models considered.
These results showed that there is no loss in efficiency by using a design-based
estimator when it is not required to consistently estimate g(x). Under SRS, the
weighted estimator reduces to the traditional nonparametric regression estimator
and results are equivalent, regardless of the underlying DGP. Comparing across
sampling schemes, WLC is most efficient under SRS and least efficient under
Nonparametric Kernel Regression Using Complex Survey Data 195
5. APPLICATION
The application considered here is an extension of the example in Harms and
Duchesne (2010). The authors used data from the 2000 cycle of the Survey of
Labour and Income Dynamics (SLID) to estimate the relationship between age and
LMD for approximately 58,000 individuals (Statistics Canada, 2013). The purpose
of the SLID is to understand the economic well-being of Canadians, collecting data
on the primary source of income, education and demographic backgrounds of its
participants. The sampling scheme is based on a stratified, multistage design that
uses probability sampling. The result is unequal sampling weights for individuals
in the sample. The weights not only represent the sampling plan but also account
for nonresponse and are calibrated to meet certain benchmark criteria.
The application presented in this paper differs in two ways from Harms and
Duchesne (2010). First, in order to help reduce heteroskedasticity in the model,
the outcome variable I consider is LMD as a percent of age over 18. Looking only
at LMD versus age could lead to heteroskedastic errors as the variance of LMD for
older individuals is likely to be higher compared to younger individuals. Second,
I extend the model to include a discrete variable “Gender,” which takes on two
values, male or female.
I estimate the model using both the model-assisted local constant estimator,
WLC, and the traditional local constant estimator, LC. In both cases, the Gaussian
kernel was used for the continuous variable “Age” and the variation of theAitchison
and Aiken (1974) kernel was used for the discrete variable “Gender.” The band-
widths for the WLC estimator were computed using the cross-validation method
described in Section 3. The bandwidths for the LC estimator were computed
using method outlined in Li and Racine (2007, Chapter 2). The gray and black
lines in Figure 1 represent the weighted regression results for males and females,
respectively. It is clear that these two curves are pulled closer to observations
which represent a greater number of individuals in the population compared to
the unweighted estimates (the dashed lines in Figure 1). Results also show that
females spend a smaller percentage of time in the labor force compared to men as
the black lines lie below the gray lines. The bandwidths for gender are 0.0313<1
and 0.0287<1 for the LC and WLC, respectively, indicating that gender is a relevant
predictor of LMD.
Nonparametric Kernel Regression Using Complex Survey Data 197
60
WLC - Male
CI - WLC
LMD as percent of age over 18
WLC - Female
CI - WLC
50
LC - Male
LC - Female
30 40 20
10 20 30 40 50 60 70
Age
6. CONCLUSION
ACKNOWLEDGMENTS
I am grateful for the input and guidance I received from Dr. Jeff Racine, Dr. Jerry
Hurley, Dr. Phil DeCicca, Dr. Arthur Sweetman, and Dr. Michael Veall. Further-
more, I would like to thank participants at various seminars and conferences for
their feedback.
REFERENCES
Aitchison, J., & Aitken, C. G. G. (1976). Multivariate binary discrimination by the Kernel method.
Biometrica, 63, 413–420.
Andersen, R. (2008). Modern methods for Robust regression. Thousand Oaks, CA: SAGE Publications,
Inc.
Bellhouse, D., & Stafford, J. (1999). Density estimation from complex surveys. Statistica Sinica, 9,
407–424.
Binder, D. A., & Georgia, R. (2009). Design- and model-based inference for model parameters.
Handbook of Statistics, 29B, 33–54.
Bravo, F., Huynh, K. P., & Jacho-Chávez, D. T. (2011). Average derivative estimation with miss-
ing responses. In D. M. Drukker (Ed.), Missing data methods: Cross-sectional methods and
applications (1st ed., Vol. 27A, pp. 129–154). Bingley: Emerald Group Publishing Limited.
Breidt, F. J., & Opsomer, J. D. (2000). Local polynomial regression estimators in survey sampling.
The Annals of Statistics, 28, 1026–1053.
Buskirk, T. D., & Lohr, S. L. (2005). Asymptotic properties of Kernel density estimation with complex
survey data. Journal of Statistical Planning and Inference, 128, 165.
Harms, T., & Duchesne, P. (2010). On kernel nonparametric regression designed for complex survey
data. Metrika, 72, 111–138.
Hausman, J., & Wise, D. (1981). Stratification on endogenous variables and estimation: The Gary
income maintenance experiment. In The analysis of discrete economic data. Cambridge, MA:
MIT Press.
Isaki, C., & Fuller, W. (1982). Survey design under the superpopulation model. Journal of the American
Statistical Association, 77(377), 89–96.
Li, Q., & Racine, J. S. (2004). Cross-validated local linear regression. Statistica Sinica, 14, 485–512.
Li, Q., & Racine, J. S. (2007). Nonparametric econometrics. Princeton, NJ: Princeton University Press.
Lohr, S. L. (2010). In M. Juliet (Ed.). Sampling: Design and analysis (2nd ed.). 20 Channel Center
Street, Boston, MA 02210: Brooks/Cole.
Nadaraya, E. A. (1964). “On Estimating Regression.” Theory of Probability and Its Applications 9:
141–42.
Opsomer, J. D., & Miller, C. P. (2005). Selecting the amount of smoothing in nonparametric regression
estimation for complex surveys. Journal of Nonparametric Statistics, 17(5), 593–611.
Pfeffermann, D. (1993). The role of sampling weights when modeling survey data. International
Statistics Review, 61, 317–337.
Sánchez-Borrego, I., Opsomer, J. D., Rueda, M., & Arcos, A. (2014). Nonparametric estimation with
mixed data types in survey sampling. Revista Matemática Complutense, 27, 685–700.
Särdinal, C. E., Swensson, B., & Wretman, J. (1992). Model-assisted survey sampling. New York, NY:
Springer.
Nonparametric Kernel Regression Using Complex Survey Data 199
Solon, G., Haider, S. J., & Wooldridge, J. (2013). What Are We Weighting for? Working Paper 18859.
National Bureau of Economic Research, 1050 Massachusetts Avenue, Cambridge, MA 02138.
Statistics Canada. (2013). Survey of labour and income dynamics (SLID). http://www23.statcan.gc.ca/
imdb/p2SV.pl?Function=getSurvey&SDDS=3889.
Waston, G. S. (1964). Smooth regression analysis. Sankhya, Series A, 26, 359–372.
Yatchew, A. (1998). Nonparametric regression techniques in economics. Journal of Economic
Literature, 36(2), 699–721.
APPENDIX A. PROOFS
A.1. Proof of Theorem 2.1
EC (m̂1 (x)|x) =Eξ EP (m̂1 (x)|π )|x
N
=Eξ EP N −1 πi−1 1(i ∈ S)[g(xi ) − g(x)]Kγ ,ix
i=1
N
=Eξ N −1 [g(xi ) − g(x)]Kγ ,ix
i=1
= [g(t) − g(x)]f (t)Kh,tx dt c
Rq
t d ∈D
q
κ2
= h2s [gs s(x)f (x) + 2gs (x)fs (x)]
2 s=1
r
+ 1(t d , x d )[g(x c , t d ) − g(x c , gx d )]f (x c , t d )λs
s=1 t d ∈D
q r
+ op h2s + λs
s=1 s=1
q r q r
= h2s Bs (x)f (x) + Ds λ s + o p h2s + λs (19)
s=1 s=1 s=1 s=1
where Bs = [gs s(x)f (x) + 2gs (x)fs (x)]κ2 /2, Ds = (t d , x d )[g(x c , t d ) − g(x c ,
q r 1(tsd =xsd )
gx d )]f (x c , t d ), and Kh,tx = s=1 h−1
s k((ts − xs )/ hs ) s=1 λs . By using the
200 LUC CLAIR
Taylor expansion method from Särdinal, Swensson, and Wretman (1992), Harms
and Duchesne (2010) derived the following result:
N
1
ĝ(x) − g(x) ≈ fˆ−1 πi−1 1(i ∈ S)Kγ ,ix ui (20)
N i=1
N
1 ˆ−1
Avar P {ĝ(x) − g(x)|π } = var P f (x) π −1 1(i ∈ S)Kγ ,ix
N i=1
1 ˆ−2 1
= f (x) var P (1(i ∈ S))Kγ ,ix Kh,j x ui uj
N2 i j
π i πj
1 πij − πi πj
= 2 fˆ−2 (x) Kγ ,ix Kh,j x ui uj (21)
N i j
πi π j
Replacing fˆ(x) with the expression fˆ(x) = f (x) + op (1), the expression for
Avar P becomes
The variance under the combined framework is then derived using the following
expression:
Using a similar derivation for the bias of fˆ(x) and m̂(x), the first term in (23)
becomes the traditional equation for the variance of the local constant estimator:
1
var ξ ([g(t) − g(x)]Kh,tx )
N
⎡ ⎤
q r
1
= ⎣ [g(t) − g(x)]2 f (t c , t d )Kh,tx
2
dt c − O h2s + λs ⎦
N d
t ∈DRq s=1 s=1
q r
= O (Nh1 . . . hq )−1 h2s + λs . (24)
s=1 s=1
⎡ 2
⎤
N
1
Eξ ⎣ ui Kγ ,ix ⎦
N i=1
1
= E[σ (t)2 Kh,tx
2
]
N
1
= σ (t)2 f (t c , t d )Kh,tx
2
dt c
N d
t ∈DRq
⎡ ⎤
q r
1 ⎣
= σ (x c + hv, x d )2 f (x c + hv, x d ) h−1
s w (vs )dvs + O
2
λs ⎦
N s=1 s=1
Rq
q r
κ q σ 2 (x)f (x)
= + O (N h1 . . . hq )−1 h2s + λs (25)
N h1 . . . hq s=1 s=1
Combining (24), (25), and var ξ {[g̃(x) − g(x)]|x] = [f (x)]−2 var ξ (m̃(x)), the first
term in Eq. (23) is
Plugging the result from (22) into the second term in Eq. (23) becomes
where the second term in (27) is a zero mean function. To get the expression for
var C {(ĝ(x) − g(x))|x}, simply sum (26) and (28):
In order to prove the asymptotic normality of ĝ(x), I make use of the Lyapunov
Double Array Central Limit Theorem (see the statistical appendix in Li and Racine,
Nonparametric Kernel Regression Using Complex Survey Data 203
q q
Nh1 . . . hq ĝ(x) − g(x) − Bs (x)h2s − Ds (x)λs
s=1 s=1
q q
ĝ(x) − g(x) − s=1 Bs (x)h2s − s=1 Ds (x)λs fˆ(x)
≡ Nh1 . . . hq
fˆ(x)
q q
m̂(x) − s=1 Bs (x)h2s fˆ(x) − s=1 Ds (x)λs fˆ(x)
= Nh1 . . . hq
fˆ(x)
m̂(x) − E(m̂(x)) q q
= Nh1 . . . hq +O N h1 . . . hq h2s − λs
fˆ(x) s=1 s=1
m̂(x) − E(m̂(x))
= Nh1 . . . hq + o(1)
fˆ(x)
N
1
= ZN ,i + o(1) (30)
f (x) i=1
where ZN ,i = ( N h1 . . . hq )−1 [π −1 1(i ∈ S)(yi − g(x)) Kγ ,ix − E(π −1 1 (i ∈ S)
(yi − g(x))Kγ ,ix )] and fˆ(x) = f (x) + op (1). Next, take the expectation of the
absolute value of ZN,i raised to the power of 2 + δ, where δ is some constant and
δ > 0:
E|ZN ,i |2+q = ( N h1 . . . hq )−(2+q) E[π −1 1(i ∈ S)(yi − g(x))Kγ ,ix
− E(π −1 1(i ∈ S)(yi − g(x))Kγ ,ix )]2+q
and
N
1 d
ZN,i −
→ N (0, ( + Q)κ q σ 2 (x)/f (x))
f (x) i=1
204 LUC CLAIR
Write g(xj ) = g(xj ) + g(xi ) − g(xi ) = g(xi ) + Rij . Plug into the regression model
yj = g(xj ) + uj :
yj = g(xi ) + Rij + uj
Using the definition for the modified kernel density estimator fˆ(x c , x d ), we can
rewrite Eq. (32) as
1 ˆ−1
ĝ−1 (xi ) = g(xi ) + f πj−1 1(j ∈ S)Kh,ij (Rij + uj ) (33)
N i i=j
where fˆi = N −1 i=j πj−1 1(j ∈ S)Kh,ij . The definition for CV(h) can now be
written as
−1
=N (g(xi ) + ui − ĝ−i (xi ))2 M(xi )
i,j ∈U
The third term in Eq. (34) does not depend on (h, λ) and the second term has
first term. So asymptotically, minimizing CVlc (h, λ) is
an order smaller than the
equivalent to minimizing i,j ∈U [g(xi ) − g−i (Xi )]2 M(xi ).
CV0 (h, λ)
1
= N −1 [g(xi ) − g(xi ) − πj−1 Kh,ij (Rij + uj )fˆi−1 ]2 M(Xi )
i,j ∈U
N i=j
⎡ ⎤2
1 ⎣1
= πj−1 1(j ∈ S)Kh,ij (Rij + uj )fˆi−1 ⎦ M(xi )
N i,j ∈U
N i=j
⎡ ⎤2
1 ⎣1
= πj−1 Kh,ij (g(xj ) − g(xi ) + uj )fˆi−1 ⎦ M(xi )
N i,j ∈U
N i=j
⎧⎡ ⎤ ⎫2
1 ⎨ 1 1 ⎬
= ⎣ πj−1 Kh,ij (g(xj ) − g(xi )) + πj−1 Kh,ij uj ⎦ fˆi−1
N i,j ∈U ⎩ N i=j
N i=j
⎭
× M(xi ) (35)
Again, using the definition fˆ(x) = f (x) + op (1), write CV0 (h, λ) as
CV0 (h, λ)
1
= (m1i + m2i )2 fi−2 M(xi ) + (s.o)
N i,j ∈U
⎛ ⎞
1 ⎝
= m2 f −2 M(xi ) + m22i fi−2 M(xi ) + m1i m2i fi−2 M(xi )⎠ ,
N i,j ∈U 1i i i,j ∈U i,j ∈U
(36)
where m1i = 1/N i=j πj−1 Kh,ij (g(xj ) − g(xi )), m2i = 1/N i=j πj−1 Kh,ij uj ,
and s.o. denotes smaller order terms. The leading term of CV(h, λ) is CV0 (h, λ) =
E[CV0 (h, λ)] + (s.o.).
because EC (m1i m2i fi−2 M(xi )) = 0. Looking at the first term in Eq. (37),
⎡
1
= 2 EC ⎣ πj−1 1(j ∈ S)Kh,ij (g(xj )
N j =i
⎤
v d is a placeholder term where it is assumed that the data are identically distributed.
Then,
q r
The second term in (38) is O((N h1 . . . hq )−1 ( s=1 h2s + s=1 λs )):
⎡ ⎤
= N −1 Eξ ⎣ πj−1 Kh,ij
2
(g(xj ) − g(xi ))2 fi2 ⎦
j =i
= N −1 fi−2 πj−1 2
Kh,ij (g(xj ) − g(xi ))2 dxjc
j =i x d ∈D
q r
−1
= O (N h1 . . . hq ) h2s + λs (41)
s=1 s=1
Next, solve for the second term on the right hand side of (36).
q −1
= N2 hs πj−1 κ q σ 2 (x)M(x)dx c
s=1 j =i x d ∈D
⎛ ⎞
q r 3 q r
+O⎝ h2s + λs + (N 2 h1 . . . hq )−1 h2s + λs ⎠
s=1 s=1 s=1 s=1
Minimizing CV(h) is equivalent to minimizing CV1 (h) because n−1 i,j ∈U ui is
not related to h1 , . . . , hq .
NEAREST NEIGHBOR IMPUTATION
FOR GENERAL PARAMETER
ESTIMATION IN SURVEY SAMPLING
ABSTRACT
Nearest neighbor imputation has a long tradition for handling item nonre-
sponse in survey sampling. In this article, we study the asymptotic properties
of the nearest neighbor imputation estimator for general population param-
eters, including population means, proportions and quantiles. For variance
estimation, we propose novel replication variance estimation, which is
asymptotically valid and straightforward to implement. The main idea is to
construct replicates of the estimator directly based on its asymptotically lin-
ear terms, instead of individual records of variables. The simulation results
show that nearest neighbor imputation and the proposed variance estimation
provide valid inferences for general population parameters.
Keywords: Bahadur representation; bootstrap; jackknife variance
estimation; matching; missing at random; quantile estimation
209
210 SHU YANG AND JAE KWANG KIM
1. INTRODUCTION
In survey sampling, nearest neighbor imputation is popular for dealing with item
nonresponse. In nearest neighbor imputation, for each unit with missing data,
the nearest neighbor is identified among respondents based on the vector of fully
observed covariates and then is used as a donor for hot deck imputation (Little &
Rubin, 2002). Although nearest neighbor imputation has a long history of applica-
tion, there are relatively few papers on investigating its statistical properties. Sande
(1979) used nearest neighbor imputation in business surveys. Lee and Särndal
(1994) studied different methods of nearest neighbor imputation by simulation.
Chen and Shao (2000, 2001) developed asymptotic properties for the nearest neigh-
bor imputation estimator of population means. Shao and Wang (2008) proposed
methods for constructing confidence intervals for population means and quan-
tiles with nearest neighbor imputation. Kim et al. (2011) applied nearest neighbor
imputation for the US Census long form data. However, most of these studies
focused on mean estimation or a one-dimensional covariate in the context of a
simple random sample, which is restrictive both theoretically and practically.
In the empirical economics literature, nearest neighbor imputation (also known
as matching) has been widely used in evaluation research for adjusting the dis-
tribution of covariates among different treatment groups; see Stuart (2010) for a
survey of matching estimators. Abadie and Imbens (2006, 2008, 2011, 2012, 2016)
systematically studied the asymptotic properties of the matching estimators for the
average treatment effects with a finite number of matches. In particular, Abadie
and Imbens (2006, 2012) derived the asymptotic distribution for the matching
estimators that match directly on the covariates using a martingale representation.
Abadie and Imbens (2016) and Yang et al. (2016) further showed that the match-
ing estimators that match on the estimated propensity score are consistent and
asymptotically normal. However, these studies are restricted to mean estimation
and non-survey data.
Empirical researchers are often interested in various finite population quantities,
such as the population means, proportions and quantiles, to name a few (Francisco
and Fuller, 1991; Wu and Sitter, 2001; Berger and Skinner, 2003). Some corre-
sponding sample estimators should be treated differently than others. For example,
estimators of population quantiles involve nondifferentiable functions of estimated
quantities. Moreover, there often are more than one covariate available to facilitate
nearest neighbor imputation for survey data. The current framework of nearest
neighbor imputation does not fully cover inferences in these settings.
In this article, we provide a framework of nearest neighbor imputation for gen-
eral parameter estimation in survey sampling. In general, the nearest neighbor
imputation estimator is not root-n consistent Abadie and Imbens (2006), where n
Nearest Neighbor Imputation for General Parameter Estimation 211
is the sample size. Based on a scalar matching variable summarizing all covariates
information, we show that nearest neighbor imputation can provide consistent esti-
mators for a fairly general class of parameters. If the matching variable is chosen
to be the mean function of the outcome given the covariates, our method resem-
bles predictive mean matching imputation (Rubin, 1986; Little, 1988; Heitjan
and Little, 1991). However, unlike predictive mean matching imputation, nearest
neighbor imputation does not require the mean function be correctly specified. Its
consistency only requires the matching variable satisfy certain Lipschitz continuity
conditions; see Section 3 for details.
The asymptotic results suggest that variance estimation can proceed based on a
large sample approximation to the normal distribution but requires additional esti-
mation for the variance function of the outcome given the covariates. To avoid such
complication, we consider replication variance estimation (Rust and Rao, 1996;
Wolter, 2007; Mashreghi et al., 2016), which has gained popularity in practice
because of its intuitive appeal. Intrinsically, the nearest neighbor imputation esti-
mator with fixed number of matches is not smooth. The lack of smoothness makes
the conventional replication methods invalid for variance estimation (Abadie and
Imbens, 2008). This is because the conventional replication method distorts the
distribution of the number of times each unit is used as a match, ki . We provide a
heuristic illustration using an unrealistic but insightful example. Suppose in a sam-
ple of size 2n, let Sequence 1 be the first n observations, and let Sequence 2 be the
last n observations. Further, suppose that each observation in Sequence 1 matches
to that of Sequence 2. Therefore, the distribution of ki is degenerated to 1 with
probability 1. On the other hand, for the conventional bootstrap, the distribution of
ki∗ , where ki∗ is the number of times each unit is used as a match in the bootstrapping
sample, would have a different distribution from ki . Therefore, the conventional
bootstrap fails to preserve the distribution of ki . If the number of matches increases
with the sample size, such as in the “kernel matching” estimators of Heckman et al.
(1998), both ki and ki∗ are infinite in the original and conventional bootstrapping
samples, and therefore the conventional bootstrap works in this setting. To address
the non-smoothness due to the fixed number of matches, subsampling (Politis et al.,
1999) and m out of n bootstrap (Bickel et al., 2012) can be used; however, their con-
sistency relies critically on the choice of the size for subsampling. Unfortunately,
there is no clear guidance on how to choose these values in practice. Alternatively,
Otsu and Rai (2016) proposed a wild bootstrap method for the matching estimator
based on the full vector of covariates in the context of non-survey data. Adusumilli
(2017) developed a novel bootstrap procedure for the matching estimator based on
the estimated propensity score, built on the notion of “potential errors.” His simu-
lation study also demonstrated the superior performance of the bootstrap method
relative to using the asymptotic distribution for inference.
212 SHU YANG AND JAE KWANG KIM
2. BASIC SETUP
Let FN = {(xi , yi , δi ) : i = 1, . . . , N } denote a finite population of size N , where
xi is a p-dimensional vector of covariates, which is always observed, yi is the
outcome that is subject to missingness, and δi is the response indicator of yi , i.e.,
δi = 1 if yi is observed and δi = 0 if it is missing. The δi ’s are defined throughout
the finite population, as in Shao and Steel (1999) and Kim et al. (2006). We
assume that FN is a random sample from a superpopulation model ζ , and N
is known. Our objective
is to estimate the finite population parameter defined
through μg = N −1 N g(y i ) for some known g( · ), or ξN = inf{ξ : SN (ξ ) ≥ 0},
i=1
where SN (ξ ) = N −1 N i=1 s(y i − ξ ), and s( · ) is a univariate real function. These
parameters are fairly general, which cover many parameters of interest in survey
sampling. For example, let g(y) = y, μg ≡ N −1 N i=1 yi isthe population mean
−1 N
of y. Let g(y) = I (y < c) for some constant c, μg ≡ N i=1 I (yi < c) is the
population proportion of y less than c. Let s(yi − ξ ) = I (yi − ξ ≤ 0) − α, ξN is
the population αth quantile.
Let A denote an index set of the sample selected by a probability sampling
design. Let Ii be the sampling indicator function, i.e., Ii = 1 if unit i is selected into
the sample, and Ii = 0 otherwise. The sample size is n = N i=1 Ii . Suppose that πi ,
Nearest Neighbor Imputation for General Parameter Estimation 213
the first-order inclusion probability of unit i, is positive and known throughout the
sample. If yi were fully observed
throughout the sample, the sample estimator of
μg and ξN are μ̂g = N −1 i∈A πi−1 g(yi ) and ξ̂ = inf{ξ : ŜN (ξ ) ≥ 0} with ŜN (ξ ) =
N̂ −1 i∈A πi−1 s(yi − ξ ) and N̂ = i∈A πi−1 is an estimator for N . Even with a
known N, it is necessary to use N̂; we articulate this point in Example 3.
We make the following assumption for the missing data process.
Assumption 1.
(Missing at random and positivity)The missing data process satisfies P (δ =
1 | x, y) = P (δ = 1 | x), denoted by p(x). With probability 1, p(x) > for a
constant > 0.
Step 1. For each unit i with δi = 0, find the nearest neighbor from the respon-
dents with the minimum distance between xi and xj , for j ∈ AR ≡ {j ∈
A : δi = 1}. Let i(1) be the index set of its nearest neighbor, which satisfies
d(xi(1) , xi ) ≤ d(xj , xi ), for all j ∈ AR .
Step 2. The nearest neighbor imputation estimators of μg and ξN are computed
by
1 1
μ̂g,NNI = δi g(yi ) + (1 − δi )g(yi(1) ) , (1)
N i∈A πi
1 1
ŜNNI (ξ ) = δi s(yi − ξ ) + (1 − δi )s(yi(1) − ξ ) . (2)
N̂ i∈A πi
214 SHU YANG AND JAE KWANG KIM
In (1) and (2), the imputed values are real observations obtained from the current
sample.
3. MAIN RESULTS
For asymptotic inference, we use the framework of Isaki and Fuller (1982), where
the asymptotic properties of estimators are established under a fixed sequence of
populations and a corresponding sequence of random samples. Specifically, let a
sequence of nested finite populations be given by FN1 ⊂ FN2 ⊂ FN3 ⊂ · · · . Also,
let a sequence of samples of sizes {nt : t = 1, 2, 3, . . .} be constructed from the
sequence of populations with an increasing sample size n1 < n2 < n3 < · · · . For
the ease of exposition, we omit the dependence of Nt and nt on t. Denote EP ( · )
and var P ( · ) to be the expectation and the variance under the sampling design,
respectively. We impose the following regularity conditions on the sampling
design.
Assumption 2.
(1) There exist positive constants C1 and C2 such that C1 ≤ N n−1 πi ≤ C2 , for
−1
i = 1, . . . , N; (2) the sampling fraction is negligible; i.e., nN = o(1); (3)
−1 −1
the sequence of the Horvitz–Thompson estimators μ̂g,HT = N i∈A πi g(yi )
−1 −1/2
satisfies var P (μ̂g,HT ) = O(n ) and {var P (μ̂g,HT )} (μ̂g,HT − μg ) | FN →
N (0, 1) in distribution, as n → ∞.
1 δi
= (1 + ki )g(yi ), (3)
N i∈A πi
with πi
ki = (1 − δj )dij . (4)
π
j ∈A j
Under simple random sampling, ki = j ∈A (1 − δj )dij is the number of times that
unit i is used as the nearest neighbor for nonrespondents.
Nearest Neighbor Imputation for General Parameter Estimation 215
We first study the asymptotic properties of μ̂g,NNI . Let μg (x) ≡ E{g(y) | x} and
σg2 (x) ≡ var{g(y) | x}, where the expectation and variance are taken with respect
to the superpopulation model. We use the following decomposition:
where
1 1
DN = n1/2 μg (xi ) + δi (1 + ki ){g(yi ) − μg (xi ) − μg , (6)
N i∈A πi
and
n1/2 1
BN = (1 − δi ){μg (xi(1) ) − μg (xi )}. (7)
N i∈A πi
The difference μg (xi(1) ) − μg (xi ) accounts for the matching discrepancy, and
BN contributes to the asymptotic bias of the matching estimator. In general,
if x is p-dimensional, Abadie and Imbens (2006) showed that d(xi(1) , xi ) =
OP (n−1/p ). Therefore, for nearest neighbor imputation with p ≥ 2, the asymp-
totic bias is BN = OP (n1/2−1/p ) = oP (1). Abadie and Imbens (2011) proposed a
bias-adjustment using a nonparametric estimator μ̂g (x) that renders matching esti-
mators n1/2 -consistent. This approach may not be convenient for general parameter
estimation.
To address for the matching discrepancy due to a non-scalar x, we propose
an alternative method. We first summarize the covariate information into a scalar
matching variable m = m(x) and then apply nearest neighbor imputation based on
this matching variable. For simplicity of notation, we may suppress the dependence
of m on x if there is no ambiguity. Let f1 (m) and f0 (m) be the conditional density
of m given δ = 1 and δ = 0, respectively. We assume the superpopulation model ζ
and the matching variable m satisfy the following assumption.
Assumption 3.
(1) The matching variable m has a compact and convex support, with den-
sity bounded and bounded away from zero. Suppose that there exist constants
C1L and C1U such that C1L ≤ f1 (m)/f0 (m) ≤ C1U ; (2) μg (x) and μs (ξ , x) ≡
E{s(y − ξ ) | x} satisfy a Lipschitz continuous condition: there exists a constant
C2 such that |μg (xi ) − μg (xj )| < C2 |mi − mj | and |μs (ξ , xi ) −
μs (ξ , xj )| <
C2 |mi − mj | for any i and j ; (3) there exists δ > 0 such that E |g(y)|2+δ | x
is uniformly bounded for any x, and E |s(y − ξ )|2+δ | x is uniformly bounded
for any x and ξ in the neighborhood of ξN .
216 SHU YANG AND JAE KWANG KIM
with
n 1
Vgμ = lim 2 E var P μg (xi ) ,
n→∞ N π
i∈A i
N 2
n Ii
Vge = lim E δ i (1 + k i ) − 1 σg2 (xi ) ,
n→∞ N 2 π i
i=1
n1/2 (ξ̂NNI − ξN ) = −n1/2 S (ξN )−1 {ŜNNI (ξN ) − SN (ξN )} + oP (1), (9)
Nearest Neighbor Imputation for General Parameter Estimation 217
n E{s(yi − ξN ) | xi }
var{ŜNNI (ξN )} = lim 2 E var P
n→∞ N πi
i∈A
2
n
N
Ii
+ lim 2 E δi (1 + ki ) − 1 var [s(yi − ξN ) | xi ] , (11)
n→∞ N πi
i=1
Example 1: (Quantile estimation) The estimating function for the αth quantile is
s(yi − ξ ) = I (yi − ξ ≤ 0) − α, and
the population estimating equation Sα,N (ξ ) =
FN (ξ ) − α, where FN (ξ ) = N −1 N i=1 I (yi ≤ ξ ). The nearest neighbor imputation
estimator ξ̂α,NNI is defined as
L
V̂rep (μ̂g ) = g − μ̂g ) ,
ck (μ̂(k) 2
(13)
k=1
(k)
where L is the number of replicates, ck is the kth replication factor and μ̂g is
the kth replicate of μ̂g . For μ̂g = i∈A ωi g(yi ), we can write the replicate of μ̂g
(k) (k)
g =
as μ̂(k) i∈A ωi g(yi ), where ωi is the replication weight that account for the
complex sampling design. The replicates are constructed such that EP {V̂rep (μ̂g )} =
var P (μ̂g ){1 + o(1)}.
which is essentially the sampling variance of ψ̂HT . This suggests that we can treat
{ψi : i ∈ A} as pseudo observations in applying the replication variance estimator.
Otsu and Rai (2016) used a similar idea to develop a wild bootstrap technique for the
matching estimators for the average treatment
effects. To be specific, we construct
(k)
replicates of ψ̂HT as follows: ψ̂HT = i∈A ωi(k) ψi . The replication variance esti-
(k)
mator of ψ̂HT is obtained by applying V̂rep ( · ) in (13) for the above replicates ψ̂HT .
It follows that E{V̂rep (ψ̂HT )} = var(ψ̂HT − μψ ){1 + o(1)} = var(μ̂g,NNI − μg ){1 +
o(1)}. Because the pseudo observations ψi s involve unknown μg (x), we use a non-
parametric estimator μ̂g (x). Concretely, we adopt sieves estimators (Geman and
Hwang, 1982; Chen, 2007) which includes power series estimators as examples;
see the Appendices for details.
In summary, the new replication variance estimation for μ̂g,NNI proceeds as
follows:
μ̂(k)
g,NNI = ωi(k) [μ̂g (xi ) + δi (1 + ki ){g(yi ) − μ̂g (xi )}], (14)
i∈A
(k)
ŜNNI (ξ̂NNI ) = ωi(k) [μ̂s (ξ̂NNI , xi ) + δi (1 + ki ){s(yi − ξ̂NNI ) − μ̂s (ξ̂NNI , xi )}].
i∈A
(16)
Step 3. Apply V̂rep ( · ) in (13) for the above replicates to obtain the variance
estimator of ŜNNI (ξ̂NNI ), denoted as V̂rep {ŜNNI (ξ̂NNI )}.
Step 4. Obtain the kernel-based derivative estimator Ŝ (ξ̂NNI ), where Ŝ (ξ ) is
defined in (15).
Step 5. Calculate the variance estimator of ξ̂NNI as Ŝ (ξ̂NNI )−2 V̂rep {ŜNNI (ξ̂NNI )}.
For illustration, we continue with Example 3.
(k)
F̂NNI (ξ̂α,NNI ) = ωi(k) [F̂ (ξ̂α,NNI ) + δi (1 + ki ){I (yi ≤ ξ̂α,NNI ) − F̂ (ξ̂α,NNI )}].
i∈A
Apply V̂rep ( · ) in (13) for the above replicates to obtain the replication variance
estimator of F̂NNI (ξ̂α,NNI ), denoted as V̂rep {F̂NNI (ξ̂α,NNI )}. Calculate the variance
estimator of ξ̂α,NNI as fˆ(ξ̂α,NNI )−2 V̂rep {F̂NNI (ξ̂α,NNI )}.
5. SIMULATION STUDY
In this section, we investigate the finite-sample performance of the proposed repli-
cation method for variance estimation and constructing confidence intervals and
comparing them to conventional competitors.
For generating finite populations of size N = 50, 000: first, let x1i , x2i and x3i be
generated independently from Uniform[0, 1], and x4i , x5i and x6i and ei be gener-
ated independently from N (0, 1); then, let yi be generated under six mechanisms:
(P1) yi = −1 + x1i + x2i + ei , (P2) yi = −1.5 + x1i + x2i + x3i + x4i + ei , (P3)
yi = −1.5 + x1i + · · · + x6i + ei , (P4) yi = −1 + x1i + x2i + x1i 2
+ x2i
2
− 2/3 +
ei , (P5) yi = −1.5 + x1i + x2i + x3i + x4i + x1i + x2i − 2/3 + ei and (P6) yi =
2 2
n
μ̂(k)
NNI = ωi(k) [μ̂(xi ) + δi (1 + ki ){yi − μ̂(xi )}],
i=1
222 SHU YANG AND JAE KWANG KIM
(k)
n
η̂NNI = ωi(k) [μ̂η (xi ) + δi (1 + ki ){I (yi < c) − μ̂η (xi )}],
i=1
(k)
ξ̂NNI (ξ̂NNI )
n
= fˆ(ξ̂NNI )−2 ωi(k) [μ̂s (ξ̂NNI , xi ) + δi (1 + ki ){I (yi ≤ ξ̂NNI ) − μ̂s (ξ̂NNI , xi )}],
i=1
where μ̂η (x), μ̂s (ξ , x) and fˆ(x) are nonparametric estimators of μη (x) = P (y <
c | x), μs (ξ , x) = P (y < ξ | x) and f (ξ ), respectively. These are obtained by kernel
regression using a Gaussian kernel with bandwidth h = 1.5n−1/5 . We note that ki
is the number of times that yi is selected to impute the missing values of y based
on the original data and therefore is kept the same across replicated data sets. The
variance estimators are compared in terms of empirical coverage rate and relative
bias, {E(V̂I ) − V }/V , where V is the true variance estimated from Monte Carlo
samples.
Tables 1 and 2 present the simulation results under simple random sampling
and probability proportional to size sampling, respectively, based on 2, 000 Monte
Carlo samples. Under both sampling designs, the nearest neighbor imputation
estimator has small biases for all parameters μ, η and ξ , under (P1)–(P3) with m(x)
correctly specified for the mean function and (P4)–(P6) with m(x) misspecified for
the mean function. For variance estimation, as expected, the conventional jackknife
variance estimator is severely biased, indicating that the lack of smoothness of
the matching estimator needs to be taken into account in variance estimation. In
contrast, the proposed jackknife variance estimators provide satisfactory results
under both sampling designs and for all parameters. The relative biases are small
and the empirical coverage rates are close to the nominal coverage of 95% of
confidence. Overall, the simulation results suggest that the proposed replication
variance estimation works reasonably well under the settings we considered.
6. CONCLUDING REMARKS
We focus on inference of general population parameters when the outcome is
missing at random in survey data using nearest neighbor imputation, a hot-deck
type of imputation. The superiority of the hot deck imputation methods over the
mean, ratio and regression imputation methods is that the hot deck imputation
methods provide not only asymptotically valid mean estimators but also valid dis-
tribution and quantile estimators. This article establishes asymptotic properties of
the nearest neighbor imputation estimators based on a scalar variable summarizing
all covariate information. Because of the non-smooth nature of nearest neighbor
Nearest Neighbor Imputation for General Parameter Estimation 223
Prop JK Conv JK
m(x) Bias SE RB CR RB CR
Prop JK: Proposed jackknife variance estimation; Conv JK: conventional jackknife variance estimation.
c: correctly specified and m: misspecified.
Prop JK Conv JK
m(x) Bias SE RB CR RB CR
Prop JK: Proposed jackknife variance estimation; Conv JK: conventional jackknife variance estimation;
c: correctly specified and m: misspecified.
however, their asymptotic properties are underdeveloped (Lenis et al., 2017). The
proposed methodology here can be easily generalized to investigate the asymptotic
properties of propensity score matching estimators with survey weights.
Our methodology and theoretical results for nearest neighbor imputation rep-
resent an important building block for future developments. Such developments
can follow three lines. First, extending the current theory to non-negligible sam-
pling fractions is possible; see, e.g., Mashreghi et al. (2014). For non-negligible
sampling fraction, note that
var μ̂g,NNI − μg = var ψ̂HT − μψ + var μψ − μg + o(n−1 )
Nearest Neighbor Imputation for General Parameter Estimation 225
and var μψ − μg = O(N −1 ). Thus, we can add a model-based estimator of
var μψ − μg in addition to the replication variance estimator for var(ψ̂HT − μψ ).
Second, instead of choosing the nearest neighbor as a donor for missing items, we
can consider fractional imputation (Kim and Fuller, 2004; Yang et al., 2013; Kim
and Yang, 2014; Yang and Kim, 2016) using K (K > 1) nearest neighbors. Third,
writing yi = xi Ri and using the fact that xi is always observed, we can apply near-
est neighbor imputation only to impute Ri , which can be called nearest neighbor
ratio imputation.
ACKNOWLEDGMENTS
Dr. Yang is partially supported by NSF grant DMS 1811245, NCI grant P01
CA142538, and Ralph E. Powe Junior Faculty Enhancement Award from Oak
Ridge Associated Universities. Dr. Kim is partially supported by NSF grant MMS
1733572.
REFERENCES
Abadie, A., & Imbens, G. W. (2006). Large sample properties of matching estimators for average
treatment effects. Econometrica, 74, 235–267.
Abadie, A., & Imbens, G. W. (2008). On the failure of the bootstrap for matching estimators.
Econometrica, 76, 1537–1557.
Abadie, A., & Imbens, G. W. (2011). Bias-corrected matching estimators for average treatment effects.
Journal of Business & Economic Statistics, 29, 1–11.
Abadie, A., & Imbens, G. W. (2012). A martingale representation for matching estimators. Journal of
the American Statistical Association, 107, 833–843.
Abadie, A., & Imbens, G. W. (2016). Matching on the estimated propensity score. Econometrica, 84,
781–807.
Adusumilli, K. (2017). Bootstrap inference for propensity score matching. Retrieved from
https://economics.sas.upenn.edu/events/bootstrap-inference-propensity-score-matching.
Berger, Y. G., & Skinner, C. J. (2003). Variance estimation for a low income proportion. Journal of the
Royal Statistical Society: Series C, 52, 457–468.
Bickel, P. J., Götze, F., & van Zwet, W. R. (2012). Resampling fewer than n observations: Gains, losses,
and remedies for losses. In Selected works of Willem van Zwet (pp. 267–297). New York, NY:
Springer.
Chen, J., & Shao, J. (2000). Nearest neighbor imputation for survey data. Journal of Official Statistics,
16, 113–131.
Chen, J., & Shao, J. (2001). Jackknife variance estimation for nearest-neighbor imputation. Journal of
the American Statistical Association, 96, 260–269.
Chen, X. (2007). Large sample sieve estimation of semi-nonparametric models. Handbook of
Econometrics, 6, 5549–5632.
Deville, J. C. (1999). Variance estimation for complex statistics and estimators: Linearization and
residual techniques. Survey Methodology, 25, 193–204.
226 SHU YANG AND JAE KWANG KIM
Ding, P., & Li, F. (2018). Causal inference: A missing data perspective. Statistical Science, 33, 214–237.
Francisco, C. A., & Fuller, W. A. (1991). Quantile estimation with a complex survey design. Annals of
Statistics, 19, 454–469.
Fuller, W. A. (2009). Sampling Statistics. Hoboken, NJ: John Wiley & Sons.
Geman, S., & Hwang, C.-R. (1982). Nonparametric maximum likelihood estimation by the method of
sieves. Annals of Statistics, 10, 401–414.
Heckman, J., Ichimura, H., Smith, J., & Todd, P. (1998). Characterizing selection bias using
experimental data. Econometrica, 66, 1017–1098.
Heitjan, D. F., & Little, R. J. (1991). Multiple imputation for the fatal accident reporting system.
Applied Statistics, 40, 13–29.
Hirano, K., Imbens, G. W., & Ridder, G. (2003). Efficient estimation of average treatment effects using
the estimated propensity score. Econometrica, 71, 1161–1189.
Ichimura, H., & Linton, O. B. (2005). Asymptotic expansions for some semiparametric program eval-
uation estimators. In D. Andrews & J. Stock (Eds.), Identification and inference in econometric
models: Essays in honor of Thomas J. Rothenberg. Cambridge: Cambridge University Press.
Isaki, C. T., & Fuller, W. A. (1982). Survey design under the regression superpopulation model. Journal
of the American Statistical Association, 77, 89–96.
Kim, J. K., & Fuller, W. (2004). Fractional hot deck imputation. Biometrika, 91, 559–578.
Kim, J. K., Fuller, W. A., Bell, W. R., et al. (2011). Variance estimation for nearest neighbor imputation
for US Census long form data. The Annals of Applied Statistics, 5, 824–842.
Kim, J. K., Navarro, A., & Fuller, W. A. (2006). Replication variance estimation for two-phase stratified
sampling. Journal of the American Statistical Association, 101, 312–320.
Kim, J. K., & Yang, S. (2014). Fractional hot deck imputation for robust inference under item
nonresponse in survey sampling. Survey Methodology, 40, 211–230.
Lee, H., & Särndal, C. E. (1994). Experiments with variance estimation from survey data with imputed
values. Journal of Official Statistics, 10, 231–243.
Lenis, D., Nguyen, T. Q., Dong, N., & Stuart, E. A. (2017). It’s all about balance: Propensity score
matching in the context of complex survey data. Biostatistics. doi:10.1093/biostatistics/kxx063.
Li, Q., & Racine, J. S. (2007). Nonparametric econometrics: Theory and practice. Princeton, NJ:
Princeton University Press.
Little, R. J. (1988). Missing-data adjustments in large surveys. Journal of Business & Economic
Statistics, 6, 287–296.
Little, R. J., & Rubin, D. B. (2002). Statistical analysis with missing data. Hoboken, NJ: John Wiley
& Sons.
Mashreghi, Z., Haziza, D., & Léger, C. (2016). A survey of bootstrap methods in finite population
sampling. Statistics Surveys, 10, 1–52.
Mashreghi, Z., Léger, C., & Haziza, D. (2014). Bootstrap methods for imputed data from regression,
ratio and hot-deck imputation. Canadian Journal of Statistics, 42, 142–167.
Newey, W. K. (1997). Convergence rates and asymptotic normality for series estimators. Journal of
Econometrics, 79, 147–168.
Otsu, T., & Rai, Y. (2017). Bootstrap inference of matching estimators for average treatment effects.
Journal of the American Statistical Association. 112, 1720–1732
Politis, D. N., Romano, J. P., & Wolf, M. (1999). Subsampling. New York, NY: Springer-Verlag.
Rubin, D. B. (1986). Statistical matching using file concatenation with adjusted weights and multiple
imputations. Journal of Business & Economic Statistics, 4, 87–94.
Rust, K. F., & Rao, J. N. K. (1996). Variance estimation for complex surveys using replication
techniques. Statistical Methods in Medical Research, 5, 283–310.
Sande, I. G. (1979). A personal view of hot deck imputation procedures. Survey Methodology, 5,
238–258.
Nearest Neighbor Imputation for General Parameter Estimation 227
Serfling, R. J. (1980). Approximation theorems of mathematical statistics. Hoboken, NJ: John Wiley
& Sons.
Shao, J., & Steel, P. (1999). Variance estimation for survey data with composite imputation and
nonnegligible sampling fractions. Journal of the American Statistical Association, 94, 254–265.
Shao, J., & Wang, H. (2008). Confidence intervals based on survey data with nearest neighbor
imputation. Statistica Sinica, 18, 281–297.
Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward. Statistical
Science, 25, 1–21.
Wolter, K. (2007). Introduction to variance estimation (2nd ed.). New York, NY: Springer.
Wu, C., & Sitter, R. R. (2001). A model-calibration approach to using complete auxiliary information
from survey data. Journal of the American Statistical Association, 96, 185–193.
Yang, S., Imbens, G. W., Cui, Z., Faries, D. E., & Kadziola, Z. (2016). Propensity score match-
ing and subclassification in observational studies with multi-level treatments. Biometrics, 72,
1055–1065.
Yang, S., & Kim, J. K. (2016). Fractional imputation in survey sampling: A comparative review.
Statistical Science, 31, 415–432.
Yang, S., Kim, J. K., & Shin, D. W. (2013). Imputation methods for quantile estimation under missing
at random. Statistics and its Interface, 6, 369–377.
228 SHU YANG AND JAE KWANG KIM
APPENDICES
The Appendices include proofs of Theorems 1–3 and additional technical details.
n1/2 1
BN = (1 − δi ){μg (xi(1) ) − μg (xi )}
N i∈A πi
n1/2 1
≤ (1 − δi ) | mi(1) − mi |= oP (1),
N i∈A πi
n1/2 1 N
DN = μg,i + δi (1 + ki )ei − g(yi )
N i∈A πi i=1
N N
n1/2 Ii n1/2 Ii
= − 1 μg,i + δi (1 + ki ) − 1 ei , (A.2)
N i=1 πi N i=1 πi
and we can verify that the covariance of the two terms in (A.2) is zero. Thus,
N N
n1/2 Ii n1/2 Ii
var(DN ) = var − 1 μg,i + var δi (1 + ki ) − 1 ei .
N i=1 πi N i=1 πi
The remaining is to show that Vge = O(1). To do this, the key is to show that the
moments of ki are bounded. Under Assumption 2, it is easy to verify that
and
N
sup N −1 |s(yi − ξN − N −α ξ ) − s(yi − ξN )| = OP (N −α ),
ξ ∈Is i=1
Assumption B.1 (5) holds with probability one under suitable assumptions on
the probability mechanism generating the yi ’s and on the function s( · ), and there-
fore it is justifiable. Under Assumption B.1, by the standard arguments from the
theory on M-estimators (Serfling, 1980), ξ̂NNI is consistent for ξN . We further make
the following assumption.
Assumption B.2.
The nearest neighbor imputation estimator ξ̂NNI is root-n consistent for ξN .
Now, we give proof for Theorem 2. Under Assumptions B.1 and B.2, we can
write
Following a similar derivation in the proof for Theorem 1, it is easy to show that
n E{s(yi − ξ ) | xi }
var{ŜN (ξ )} = lim 2 E var P
n→∞ N πi
i∈A
N 2
n Ii
+ lim 2 δi (1 + ki ) − 1 var [s(yi − ξ ) | xi ] .
n→∞ N πi
i=1
Assumption C.1 states conditions on the smoothness and tail behavior of the
kernel functions. Popular kernel functions, including Epanechnikov, Gaussian and
triangle kernels, satisfy the required conditions.
We consider continuous g(y) and power series estimation for μg (x) with K terms
in the series, where K increases with n. Let p be the dimension of X. Consider a
sequence of power functions
where W is a diagonal matrix with the ith diagonal element πi−1 , and (P T W P )−
denotes a generalized inverse of a matrix P T W P .
Suppose the following assumption holds for establishing the fast convergence
rate of μ̂g (x) in (D.2).
Assumption D.1.
1. The support of x is a Cartesian product of compact intervals;
2. μg (x) is s-times continuously differentiable at x with s/p > 1;
3. the number of series K = O(nν ) with 0 < ν < 1/3.
232 SHU YANG AND JAE KWANG KIM
Lemma D.1. Under Assumption D.1, the power series estimator μ̂g (x) in (D.2)
satisfies that supx |μ̂g (x) − μg (x)| = OP K 3 /n + K 1−s/p = oP (1).
Denote pξ (x) = E{I (y ≤ ξ ) | x} and logit(a) = {1 + exp ( − a)}−1 . The series logit
estimator for pξ (x) can be obtained as
π̂K = arg max πi−1 I (yi − ξ ≤ 0)logit{p K (xi )T π }+
π
i∈A
I (yi − ξ > 0)[1 − logit{p K (xi )T π }] .
Suppose that the following assumption holds for establishing the fast convergence
rate of the series logit estimator p̂ξ (x) in (D.3).
Assumption D.2.
1. The support of x is a Cartesian product of compact intervals;
2. pξ (x) is s times continuously differentiable with s/p ≥ 3;
3. pξ (x) is bounded away from zero and one on the support of x;
4. the density of x is bounded away from zero on the support of x;
5. the number of series K = O(nν ) with ν < 1.
μ̂∗g,NNI = ωi∗ [μ̂g (xi ) + δi (1 + ki ){g(yi ) − μ̂g (xi )}]ui
i∈A
= ωi∗ [μg (xi ) + δi (1 + ki ){g(yi ) − μg (xi )}]ui
i∈A
+ ωi∗ {(1 − δi ) + δi ki }{μ̂g (xi ) − μg (xi )}ui
i∈A
= ωi∗ ψi ui + RN∗ , (E.1)
i∈A
where RN∗ = i∈A ωi∗ {(1 − δi ) + δi ki }{μ̂g (xi ) − μg (xi )}ui .
234 SHU YANG AND JAE KWANG KIM
2
We now show E ∗ n1/2 RN∗ → 0 in probability. We write
2
E∗ n1/2 RN∗
1
= nN E (ω1∗ u1 )2 {(1 − δi ) + δi ki }2 {μ̂g (xi ) − μg (xi )}2
N i∈A
1
+2nN(N − 1)E ∗ ω1∗ ω2∗ u1 u2 {(1 − δi ) + δi ki }
N (N − 1) i =j ∈A
ABSTRACT
We survey banks to construct national estimates of total noncash payments
by type, payments fraud and related information. The survey is designed to
create aggregate total estimates of all payments in the United States using
data from responses returned by a representative, random sample. In 2016,
the number of questions in the survey doubled compared with the previous
survey, raising serious concerns of smaller bank nonparticipation. To obtain
sufficient response data for all questions from smaller banks, we adminis-
tered a modified survey design which, in addition to randomly sampling
banks, also randomly assigned one of several survey forms, subsets of the
full survey. This case study illustrates that while several other factors influ-
enced response outcomes, the approach helped ensure sufficient response
for smaller banks. Using such an approach may be especially important in
an optional-participation survey, when reducing costs to respondents may
affect success, or when imputation of unplanned missing items is already
needed for estimation. While a variety of factors affected the outcome, we
find that the planned missing data approach improved response outcomes
237
238 GEOFFREY R. GERDES AND XUEMEI LIU
for smaller banks. The planned missing item design should be considered
as a way of reducing survey burden or increasing unit-level and item-level
responses for individual respondents without reducing the full set of survey
items collected.
Keywords: Business survey; responder burden; planned missing data; split
questionnaire; multiform design; imputation
JEL classifications: C83
The planned missing data design might come at the expense of lower unit
response rates for smaller banks, but with an expected dividend of higher total
response counts by item. To obtain sufficient response data for all items, the shorter
forms, which included some full-coverage items and some partial-coverage items,
were designed as complementary subsets of the full survey form. The shorter forms
were administered so that all partial-coverage items were presented to an equal
number but randomly selected set of banks. This approach fielded the full survey
form to the largest banks, and three sets of partial form variants to smaller banks
that contained either 2/3, 1/2 or 1/3 of the partial-coverage items, allowing the
length of the surveys to decline with bank size. The resulting missing patterns
from the returned surveys are influenced (1) by randomized planned missing items
determined by the form sent to the respondent and (2) by unplanned missing items
from returned surveys.
By design, the approach introduces planned missing data items, but a non-
mandatory survey involving a lengthy survey form already introduces unplanned
missing data items, known also as item nonresponse. Aggregate estimates we pro-
duce from the survey data are subject to various adding up constraints, which also
apply at the level of the unit response. Complete case analysis involves wasted
information and would violate adding up and other logical constraints. More
generally, missing items of both types introduces problems for analysis methods
requiring a rectangular data set. But modern imputation methods make it possible
to obtain a rectangular data set in the presence of missing data, retain all of the
collected information and produce estimates with desirable statistical properties
(Little and Rubin, 2002).
Our planned missing data design is an implementation of other split question-
naire or missing-by-design approaches in the literature. Multiple-matrix sampling
involves fielding samples of questions to test subjects. The idea appears to have
originated from problems of establishing normative univariate distributions for
broad sets of examination questions in educational testing and does not require
imputation (Lord, 1962). Inter alia, Raghunathan and Grizzle (1995) extended
the multiple-matrix idea to a split questionnaire survey design, which imposes
restrictions on the assignment of items to sampled subjects to retain the ability
to estimate population quantities. The three-form design of Graham, Hofer, and
MacKinnon (1996) has similar goals but imposes a particular structure on the
overlap of full-coverage and partial-coverage items. Both latter papers employ
imputation techniques to handle missing data, as we do.
Business surveys have a number of features that distinguish them from sur-
veys of individuals (Snijkers et al., 2013). A business survey collects data about
the business and not the individual responding to the survey. Political and social
considerations generally substitute for psychological ones. Recruitment involves
240 GEOFFREY R. GERDES AND XUEMEI LIU
obtaining the support of senior managers, internal experts with access to the infor-
mation, individuals assigned to fill out parts of the survey as well as serve as
a response contact. Further, once a participating business is recruited, obtaining
responses to the survey items involves the participant incurring varying levels of
cost, with paid staff consulting records, performing database queries and some-
times engaging third-party providers. Data quality remains a concern with the
collection of information that is objectively verifiable but complex to obtain.
Finally, business heterogeneity, which may cause behavior to vary, for example,
by size and type, may lead to different treatments between classes.
This case study illustrates that our approach involved an increase in the sam-
ple size which could reduce unit response rates, which often serve as proxy for
assessments of survey quality, potentially raising concerns of nonresponse bias.
If that was our focus, we would direct such concerns to the smaller size strata
because larger bank strata return surveys at a higher rate. In any case, concerns
about minimum data set size, as we have, and other factors that could contribute
to nonsampling error in a complex business survey may justifiably dominate con-
cerns about nonresponse bias (see e.g. Lineback and Thompson (2010)). The
disproportional influence of larger bank strata on the total estimates also mean that
misreported data or the loss of a very large bank response potentially overshadows
concerns of low response rates in small bank strata.
The remainder of the paper is organized as follows. First, the standard approach
to our survey design is discussed. Second, we discuss the challenge we faced in
2016 by reviewing the 2013 survey outcome and considering the impact of the
growth in the number of items. Third, we discuss our implementation of a planned
missing data design for the 2016 survey. Fourth, we compare the outcome of the
2016 survey to the 2013 survey, and conclude the paper.
a covariate for estimation can improve precision over, for example, drawing a
simple random sample and constructing a probability estimate (Cochran, 1953).
Data on type and size, as measured by checkable deposits (CHKD) and money
market deposits (MMDA) for the population of banks, are available from reports
filed with the Federal Reserve. The size distribution of US banks that process
payments, which include commercial banks (CMB), savings institutions (SAV)
and credit unions (CUS), is large, with well over 10,000 banks in 2013 and 2016,
and a highly skewed size distribution.
Because most of the payments are made from CHKD and MMDA, they are
highly correlated with the payment volumes of interest, making them valuable
measures of bank size for stratification as well as potential covariates for estimation
purposes. To take advantage of the correlations and to account for the skewness,
the bank population is stratified into subpopulations by type and size, and separate
samples are drawn from each, with the sampling rate declining with size.
The bank type has a meaningful relationship with how it is regulated, the type of
business it conducts and how it reports its information. Another variable, STRAT-
VAR, is defined to be equal to the sum of CHKD and MMDA for CMB and SAV
and is equal to CHKD for CUS due to reporting differences. Stratification first by
type, and then by size within type, using STRATVAR improves the precision of
estimates for a given sample size. There are different procedures recommended in
the literature for choosing an optimal cutoff between a take-all and a collection
of take-some strata in a skewed population, for example, Hidiroglou (1986) or
Hansen, Hurwitz, and Madow √ (1953). For the take-some strata, the boundaries
were chosen using the cum f method of Dalanius and Hodges (1959) for a fixed
number of strata, and the sample was allocated following Neyman (1934).
The framework of the sample selection procedure is a representative, random
sample using an auxiliary measure of size among the population from administra-
tive data, where we stratify the population, draw separate samples and construct
estimates from the sample responses within the strata. To obtain aggregate esti-
mates of volumes for the population, we used separate ratio estimators. We took
advantage of the high correlation between the universally available size covariate,
CHKD, with the various volumes measured in the study. Standard ratio estimators
for a population are used to “blow up” the sample data to the population estimate
with a covariate available on the size of banks in the population. Liu, Gerdes, and
Parke (2009) discusses the stratification and estimation approach we used in more
detail.
The rectangular data set used for estimation is imputed using an iterative E–M
algorithm approach to estimate the covariances between reported items in the pres-
ence of missing items while simultaneously imputing the missing items. Imputed
data replaces missing items and treated as reported data. Standard errors that
242 GEOFFREY R. GERDES AND XUEMEI LIU
account for imputation model error are calculated following a multiple imputation
strategy that uses a large number of replicate data sets with imputed data augmented
with random draws from the imputation model error distribution. Gerdes and Liu
(2010) discusses our approach to imputation and estimation in more detail.
3. THE CHALLENGE
Technological developments, evolving market structure and expanding payments
research and policy interests led to several changes in the survey design. In par-
ticular, the 2016 form included 502 payment volume items, roughly twice the
number as the 253 items in the 2013 form. This massive increase led us to try
a planned missing data design to address anticipated nonresponse due to survey
fatigue issues. Past experience, policy goals and survey form structure consid-
erations led to a prioritization of survey items into a set that would receive full
coverage in the survey, meaning they would be asked of all subjects, and to a set of
the remaining survey items, which would receive only partial coverage, meaning
that they would be asked of some but not all subjects. All partial-coverage survey
items would be distributed such that they would have an equal chance of being
presented to and answered by a subject.
The rise in the length of the survey form between 2013 and 2016 raised seri-
ous concerns that smaller banks, already exhibiting low response rates in 2013,
would not be willing to participate in sufficient numbers. Generally, the risk of
nonparticipation because of survey length is high and declines with bank size. Past
surveys have not exceeded a 55% unit response rate, and the unit response rate of
the 2013 survey had declined to 44%. To control losses in the 2016survey, we tried
a survey design that reduces the potential of gains at the intensive participation
margin: answering more survey items – in favor of reducing the potential of losses
at the extensive margin – involving the decision not to participate at any level in
this nonmandatory survey.
While our surveys have grown in length at each repetition, the unit-level
response rate to our 2013 survey had dropped markedly, to 44% overall, com-
pared with unit-level response rates in 2010 or earlier that ranged from 54 to
56%. While many factors could have been involved, some of the response rate
decline likely was attributable to the increased length and complexity. Evidence
from respondent feedback, such as complaints about an increasing burden placed
on bank staff by other, mandatory surveys, suggested that the survey environment
had also become increasingly challenging.2 Recognizing the importance of accom-
modating respondent needs in a nonmandatory survey, we entered the planning
stage for the 2016 collection of survey data with a recognition that greater effort
and adaptability would be needed to sustain the same response rate as 2013.
Improving Response Quality with Planned Missing Data 243
In the face of these challenges, the scope of information and the total number
of items included in the survey also grew. The number of payment volume items
increased by 249, growing from 253 in the 2013 survey to 502 during 2016. The
number of intersecting items between the two surveys is 205, with 297 new items
for 2016 and 48 expired items from 2013. The survey is not mandatory, and the
102% increase in the number of items in the survey led us to look for a relatively
radical adjustment to the survey design for 2016 which would offset the increased
burden and the anticipated drop off in response that might result, especially for
smaller banks.
In addition to the sheer increase in the number of items requested, we made a
major change to the survey reference period. Surveys prior to 2013 were designed
to collect prospective data during one or two months after registering their partic-
ipation, and banks were notified of the survey content before the reference period
so that they could prepare systems and staff to compile the requested information
while the measured activity was taking place. On the conjecture that the balance
of participating banks could provide information retrospectively from the previous
calendar year due to advances in electronic record keeping and retrieval capabil-
ities, for the first time, the 2016 survey collected data for the previous calendar
year (2015).3 This new approach to the survey reference period was anticipated to
also have an effect on response rates, especially on those of the smaller banks.
The survey data have always exhibited complex patterns of item nonresponse.
Many items, however, are part of a logical structure in the form of subtotals
adding up to totals. An example for a volume of debit card payments is depicted
in Figure 1. Collecting data in this form is often necessary because of the subject
matter, because it helps to enumerate components of totals for clarity and is also
helpful in cases where, for example, some respondents cannot as easily report the
subtotal details, or conversely, where they have access to a subtotal, but not all
components of the total.
Fig. 1. An Example of a Set of Logical ‘Adding Up” Relationships Between Items in the
Surveys. Note: Debit card must be the sum of Card-present and Card-not-present. Likewise,
Card-present must be the sum of Signature-authenticated, PIN-authenticated and Other.
244 GEOFFREY R. GERDES AND XUEMEI LIU
Imputing and enforcing logic at the response level allows the exploitation
of within-stratum covariances and within-response logical constraints. A final
covariance matrix is obtained through an approach based on an iterative, E–M
algorithm-based imputation method which, under the assumption that values are
missing at random, produces a maximum likelihood estimate of the covariance
matrix in the presence of missing data and simultaneously produces imputed
data (Little and Rubin, 2002). Precision is measured through the use of multi-
ple imputation estimates of the ratio estimator standard errors which accounts for
the model-based parameter estimation errors.
To set the stage and explain the potential benefit of the planned missing data
design, it is helpful to discuss the response outcome of the 2013 survey first. To
simplify the discussion and to prepare for comparisons with the results of the
planned missing data design in 2016, we adjusted the 2013 strata boundaries to
represent the same population proportions as 2016. The one exception to this
proportional matching is that the stratum for the largest banks of each type are
constructed to include the top 50 CMB, and the top 25 SVG and the top 25 CUS
in both 2013 and 2016. For convenience, we label CMB size strata 11–18, with
bank size, measured by STRATVAR, increasing with the label and, similarly, label
SVG strata 21–26 and CUS 31–38.
Figure 2 shows the unit and item response rates achieved for different strata in
the 2013 survey. Within each bank type, strata are displayed with size category
increasing from left to right. For example, stratum 11 contains the smallest CMB
and stratum 18 contains the largest CMB. The unit response rate within stratum
18 was about 85%. The item response rate is the total number of items returned
in proportion to the total number of items presented to all sampled banks, which,
for stratum 18, was approximately 65%. Smaller strata display much lower unit
response rates, and within each bank type, a rise in the unit and item response rates
is evident as bank size increases. The increase with bank size, with a few notable
exceptions, is close to being monotonic and is most pronounced for CUS.
The overall item response rate in the 2013 survey was about 30%, 14% points
lower than the unit response rate of 44% (Table 1). In the smallest bank strata, unit-
level and item-level response rates dipped to very low levels. The lowest unit and
item-level response rates were in stratum 31, reaching 20% and 13%, respectively.
Relatively low unit response and low item response in small bank strata led to
sample sizes too small to estimate without combining strata.
90
80
70
60
%
50
40
30
20
10 Unit
Item
0
11 12 13 14 15 16 17 18 21 22 23 24 25 26 31 32 33 34 35 36 37 38
Stratum
Fig. 2. Response Rates by Stratum for 2013 are Calculated Two Different Ways.
Note: The unit response rate (Unit) is the number of responding banks (those that returned
a survey form) in proportion to the number of sampled banks. The item response rate (Item)
is the total number of items returned in proportion to the total number of items presented
to all sampled banks.
doing so, we would have to double our sample size to present all items to the same
number of survey subject. In a simple scenario of a survey of sample size n, the
example of shortening the survey just described would be equivalent to defining
the first half of the survey as survey form 1 and the second half as survey form 2.
If n banks receive survey form 1 and another n banks receive survey form 2, then
each survey item is posed to n banks.
This applied study requires an approach tailored to the specific survey form,
related to the three-form and split-questionnaire designs proposed in the literature.
The 2016 survey form is naturally divided into the following nine categories:
bank profile, check payments and check returns, automated clearing house (ACH)
payments andACH returns, wire payments, debit and prepaid card payments, credit
card payments, cash deposits and withdrawals, alternative payment methods, and
unauthorized third-party fraudulent payments. In past studies, we imputed the data
246 GEOFFREY R. GERDES AND XUEMEI LIU
CMB
18 50 50 1.00 42 0.84 0.64
17 8, 100, 000 298 298 1.00 152 0.51 0.35
16 593, 000 273 217 0.79 115 0.53 0.34
15 301, 000 373 208 0.56 100 0.48 0.30
14 186, 200 685 186 0.27 78 0.42 0.28
13 116, 250 966 186 0.19 73 0.39 0.25
12 70, 700 1, 318 231 0.18 78 0.34 0.22
11 35, 955 1, 529 153 0.10 53 0.35 0.24
Subtotal 5, 492 1, 529 0.28 691 0.45 0.30
SVG
26 25 25 1.00 16 0.64 0.56
25 1, 900, 000 61 61 1.00 32 0.52 0.36
24 468, 000 120 90 0.75 48 0.53 0.41
23 173, 000 156 57 0.37 25 0.44 0.31
22 86, 600 183 37 0.20 19 0.51 0.30
21 44, 400 344 59 0.17 27 0.46 0.34
Subtotal 889 329 0.37 167 0.51 0.37
CUS
38 25 24 0.96 15 0.63 0.49
37 580, 000 50 47 0.94 34 0.72 0.55
36 307, 000 155 138 0.89 73 0.53 0.35
35 142, 100 201 134 0.67 62 0.46 0.34
34 80, 000 279 123 0.44 53 0.43 0.27
33 44, 450 492 113 0.23 30 0.27 0.16
32 20, 700 915 119 0.13 30 0.25 0.17
31 8, 638 3, 192 130 0.04 26 0.20 0.13
Subtotal 5, 309 828 0.16 323 0.39 0.26
Total 11, 690 2, 686 0.23 1, 181 0.44 0.30
in blocks along these lines to save on computational complexity and, thus, time,
and because increasing the set of potentially correlated data outside of each block
appeared to have limited value relative to within-block information. With this in
mind, the categories formed natural blocks for dividing up the surveys as well.
Improving Response Quality with Planned Missing Data 247
Profile 14 0
Check 18 72
ACH 20 64
Wire 4 48
Debit card 18 52
Credit card 19 36
Cash 39 22
Alt pymts 8 16
Fraud 52 0
Total 192 310
presented across a subset. Within strata with blocked surveys, we would want to
achieve as much balance as possible, meaning to minimize the range of length of
the survey form between companion versions of a survey to be administered across
equal-sized subsamples within a survey stratum. In addition, it was important to
limit the total number of survey versions to keep the complexity and potential
confusion of administering multiple versions to a minimum.
Table 3 contains the number of full-coverage and partial-coverage items by
section in the final survey form.
All items in the fraud section were designated full-coverage, in consideration
of policy priorities. These sections had different numbers of survey items, making
the balance consideration important, and were likely to influence which blocks
might appear together. After a decision to group the cash and alternative payments
sections together, which would make the groups more even in total count, and also
to make the combination calculations easier, six distinct blocks of survey items
were defined. While correlations between blocks were not of primary importance
to us, it seemed reasonable to make an attempt to try to pair the sections off in
at least one of the survey form versions in each stratum but would need to be
considered in the context of practicality.
First, we consider the problem of dividing the survey form such that each
respondent gets 1/3 of the survey items to be allocated. In our case, the total
number of survey items to be allocated is 353 less the survey items from bank
profile (6) and fraud (18), or 329. Each sampled bank would expect to get just
fewer than 110 of these survey items. Now, consider the case of defining just three
of these questionnaires. This is easily done by just placing two sections in each
version of the survey form. This satisfies the simplicity objective but is only able
to produce correlations between three pairs, which means each of the six sections
Improving Response Quality with Planned Missing Data 249
is paired off only once. Now, there are 15 distinct combinations of paired items in
a set of 6. It is apparent, therefore, that 15 versions of the survey would be required
in order to achieve a complete set of pairs, and each pair would exist in only one
of the 15 versions.
Second, we consider the problem of dividing the survey form such that each
respondent gets 1/2 of the survey items to be allocated. Again, there are six
sections. The simplest solution, of course, is to define two versions, as noted
above, by putting three sections on each. This would allow each section to be
correlated with two other sections. An alternative would be to define four versions,
where each section would appear in two of the surveys. Since each survey would
have two other sections, each section could be correlated with only four others,
leaving three pairs unmatched. Going further and defining six versions with three
versions each would mean that each section would be paired with each of the other
sections at least once, but the pairs would not work out evenly; three pairs would
occur twice. For total pairwise balance, 15 versions of this set of survey form
would need to be fielded.
Third, we consider the problem of dividing the survey form such that each
respondent gets 2/3 of the survey items to be allocated. In this case, each bank
would expect to receive four out of six sections. This can be achieved by fielding
three surveys. Because each section would appear with three others, there would
be 18 different pairs across the survey versions, meaning that 3 pairs would occur
twice. As with the other fractions, pairing off the sections evenly would require 15
different surveys.
The complexity of managing multiple versions of the surveys led to a decision
that came close to choosing the minimal number of survey versions. We decided to
field only three versions each of the 1/3 and 2/3 fractional survey form schemes. In
the case of the 1/2 fractional survey form scheme, we chose to field four versions.
These choices meant that correlations among certain sections would not be possible
in some strata, which particularly affected the smallest bank strata with the 1/3
scheme, minimally for the banks with the 1/2 scheme, and not at all for the 2/3
and 1 schemes.
Decisions about which combinations of section pairings would be chosen were
based on an attempt to minimize the difference between the longest and shortest
surveys in each stratum, which was determined by making an exhaustive list of all
possible combinations and choosing our preference among them. The final choice
is shown in Table 4. The 2/3 scheme had a range of 17 survey items, the 1/2
scheme had a range of 10 survey items and the 1/3 scheme had a range of 15
survey items.
Table 5 shows how a total sample counts of 3,797 banks were allocated to strata
and survey forms versions. Notice that, in a few cases, the number of sampled
banks within a stratum differs by 1, because of indivisibility.
Table 4. Survey Form Versions Used for Each Partial-Coverage Scheme.
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11
Profile Profile Profile Profile Profile Profile Profile Profile Profile Profile Profile
Table 5. Original 2016 Sample Counts by Bank Type, Stratum and Survey Form
Version.
Type Stratum n v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11
SAV 25 75 75 0 0 0 0 0 0 0 0 0 0
24 102 1 34 33 34 0 0 0 0 0 0 0
23 98 0 0 0 0 25 25 24 24 0 0 0
22 123 0 0 0 0 0 0 0 0 41 41 41
21 102 0 0 0 0 0 0 0 0 34 34 34
Subtotal 500 76 34 33 34 25 25 24 24 75 75 75
CUS 37 63 63 0 0 0 0 0 0 0 0 0 0
36 130 0 43 43 44 0 0 0 0 0 0 0
35 140 0 47 46 47 0 0 0 0 0 0 0
34 163 0 0 0 0 40 41 41 41 0 0 0
33 150 0 0 0 0 38 37 38 37 0 0 0
32 237 0 0 0 0 0 0 0 0 79 79 79
31 216 0 0 0 0 0 0 0 0 72 72 72
Subtotal 1, 099 63 90 89 91 78 78 79 78 151 151 151
Total 3, 797 439 284 282 286 248 247 248 247 505 507 507
5. RESULTS
Once the survey design is determined, including survey content, sample construc-
tion and so on, a substantial amount of additional work goes into implementing the
data collection. This includes the design and development of the web-based inter-
face for survey participants, detailed recruiting strategies and response follow-up
to encourage a more complete response and to validate the quality of the provided
information. These activities, which are at the core of the success or failure of the
survey, are documented elsewhere. While the effort level, the survey environment,
the ability to properly communicate and define survey content that matches well
with participant understanding, and luck are inseparable and, perhaps, far more
important than the adjustments to the survey we describe here, it is worthwhile to
explore the outcome of the data collection we designed.
252 GEOFFREY R. GERDES AND XUEMEI LIU
Table 6 shows the count of unit-level responses collectively for each stratum.
The column labeled n provides the total returned survey count, which equals
the potential number of high-priority items returned. Overall, we obtained 1,383
responses, exceeding the total goal of 1,215 by nearly 14%. Comparing the out-
come by overall unit response count by types of bank, for CMB we obtained 869
responses, exceeding the goal of 700 by 24%, for SAV we obtained 198 responses,
falling short of the goal of 212 by 7%, and for CUS we obtained 316, exceeding
the goal of 303 by 4%.
In order to obtain these response counts, we expanded the sample size from the
approximately 2,700 sample size used in past studies to approximately 3,800, an
increase of over 40%. In Table 7, we show the resulting response rates by stratum
and survey form version. Overall, the response rate was 36%, down 8% points
Table 6. Response Counts by Bank Type, Stratum and Survey Form Version and
Original Strata.
Type Stratum n v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11
SAV 25 40 40 0 0 0 0 0 0 0 0 0 0
24 45 0 15 16 14 0 0 0 0 0 0 0
23 36 0 0 0 0 8 9 9 10 0 0 0
22 41 0 0 0 0 0 0 0 0 16 16 9
21 36 0 0 0 0 0 0 0 0 9 11 16
Subtotal 198 40 15 16 14 8 9 9 10 25 27 25
CUS 37 31 31 0 0 0 0 0 0 0 0 0 0
36 51 0 17 14 20 0 0 0 0 0 0 0
35 52 0 16 18 18 0 0 0 0 0 0 0
34 46 0 0 0 0 12 12 14 8 0 0 0
33 41 0 0 0 0 13 10 10 8 0 0 0
32 48 0 0 0 0 0 0 0 0 20 10 18
31 47 0 0 0 0 0 0 0 0 14 14 19
Subtotal 316 31 33 32 38 25 22 24 16 34 24 37
and slightly less than 19% of the 44% response rate achieved in the previous
survey. Note that the special attention group of banks, which expanded to over
300 compared with 100 in the previous survey achieved an overall response rate
of roughly 50%. This group did not receive shortened versions of the surveys.
Response rates for the top 100 were similar to response rates for that group in the
past.
The response outcome for 2016, including the item response rates, is shown in
Figure 3 and Table 8. The overall item response rate is 19%, down considerably
from the 2013 item response rate of 30%. The mean item count in 2016 was
717, compared with a mean item count of 805 in 2013. That drop, all else equal,
could be interpreted as a bad outcome on its own, but that would ignore other
important factors. In particular, considering the increase in the number of items,
Table 7. Unit Response Rate (%) by Bank Type, Stratum and Survey Form
Version.
Type Stratum n v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11
CMB 17 49 49
16 35 30 38 39
15 46 43 54 40
14 36 36 35 31 44
13 40 29 47 35 50
12 38 37 35 41
11 35 33 39 33
Subtotal 39 49 37 46 39 32 41 32 47 35 37 37
SAV 25 53 53
24 44 44 48 41
23 37 32 36 38 42
22 33 39 39 22
21 35 26 32 47
Subtotal 40 53 44 48 41 32 36 38 42 33 36 33
CUS 37 49 49
36 39 40 33 45
35 37 34 39 38
34 28 30 29 34 20
33 27 34 27 26 22
32 20 25 13 23
31 22 19 19 26
Subtotal 29 49 37 36 42 32 28 30 21 23 16 25
Total 37 49 38 43 40 32 36 32 38 31 30 33
254 GEOFFREY R. GERDES AND XUEMEI LIU
90
80
70
60
%
50
40
30
20
10 Unit
Item
0
11 12 13 14 15 16 17 18 21 22 23 24 25 26 31 32 33 34 35 36 37 38
Stratum
Fig. 3. Response Rates by Stratum for 2016 Calculated Two Different Ways.
Note: The unit response rate is the number of responding banks (those that returned a
survey form) in proportion to the number of sampled banks. The item response rate is the
total number of items returned in proportion to the total number of items presented to all
sampled banks.
the 2016 survey returned nearly 360 thousand total response items, compared with
166 thousand total response items in 2013. The number of separate responses in
2016 was 1,383 compared with 1,181 in 2013, an increase of 17%. Having more
responses means that more items can be imputed, taking advantage of correlations
with other response items.
The survey items in the 2016 survey were a superset of the survey items in the
2013 survey. With so many new, untested survey items in 2016, it is revealing
to examine the outcomes for the 205 payment volume items that were the same
across both surveys. Figure 4 demonstrates that the 2016 survey approach, which
improved the number of items supplied to the imputation and estimation routines
for many strata, particularly for smaller CMB strata. The shorter surveys, contain-
ing 1/3 or 1/2 of partial-coverage survey items, combined with increases in the
sampling rate tended to do well for CMB and SAV, while only doing relatively bet-
ter for the very smallest CUS strata. Conversely, the strata with the largest banks,
Improving Response Quality with Planned Missing Data 255
CMB
18 50 50 1.00 40 0.80 0.53
17 10, 900, 000 264 264 1.00 110 0.42 0.22
16 799, 500 247 237 0.96 88 0.37 0.19
15 388, 000 337 237 0.70 105 0.44 0.23
14 232, 000 618 308 0.50 114 0.37 0.19
13 139, 754 872 289 0.33 118 0.41 0.21
12 83, 909 1, 190 444 0.37 167 0.38 0.18
11 41, 980 1, 382 356 0.26 128 0.36 0.19
Subtotal 4, 960 2, 185 0.44 870 0.40 0.21
SVG
26 25 24 0.96 18 0.75 0.43
25 1, 650, 000 48 48 1.00 21 0.44 0.22
24 497, 000 102 102 1.00 46 0.45 0.23
23 195, 000 132 104 0.79 42 0.40 0.23
22 100, 500 155 116 0.75 36 0.31 0.17
21 46, 300 292 96 0.33 34 0.35 0.22
Subtotal 754 490 0.65 197 0.40 0.22
CUS
38 25 25 1.00 14 0.56 0.34
37 730, 000 47 46 0.98 22 0.48 0.24
36 365, 000 137 126 0.92 47 0.37 0.19
35 185, 000 174 143 0.82 52 0.36 0.18
34 105, 500 240 147 0.61 34 0.23 0.09
33 58, 000 399 167 0.42 50 0.30 0.13
32 26, 680 690 201 0.29 47 0.23 0.11
31 11, 190 3, 144 242 0.08 50 0.21 0.09
Subtotal 4856 1, 097 0.23 316 0.29 0.14
Total 10, 570 3772 0.36 1383 0.37 0.19
where the survey form length was reduced less, or not at all, and where increases
in already high sampling rates were unavailable, the outcome was very different.
Except for the very large institutions, response levels dropped, indicating that the
survey length was an important factor.
256 GEOFFREY R. GERDES AND XUEMEI LIU
120
80
60
40
20
2013
2016
0
11 12 13 14 15 16 17 18 21 22 23 24 25 26 31 32 33 34 35 36 37 38
Stratum
Fig. 4. Mean Number of Item Responses, 205 Comparable Payment Volume Items in the
Surveys, by Stratum and Survey Year.
ACKNOWLEDGMENT
Opinions are the authors’ alone and do not necessarily reflect those of the Board
of Governors, the Federal Reserve System, or its staff. We acknowledge David
Jacho-Chávez, Editor of the Advances in Econometrics, two anonymous ref-
erees, and participants at the Bank of Canada conference associated with this
volume for comments and suggestions. We thank Lauren Clark, Daniel Nikolic,
Justin Skillman, and Alexander Spitz for assistance during different stages of
the research. We also thank Michael Argento and Thomas Welander of the
Global Concepts Office of McKinsey and Company for support during survey
design and data collection. Further information about the surveys is available
at http://www.federalreserve.gov/paymentsystems/fr-payments-study.htm. Any
errors or omissions are the responsibility of the authors.
NOTES
1. For reports and data, visit the payment study website at www.federalreserve.gov/
paymentsystems/fr-payments-study.htm.
2. The 2013 survey contained roughly twice the number of conceptually different survey
items compared with the 2010 survey. The total number of payment volume items between
2010 and 2013 grew only slightly, however, as we reduced the number of distinct items
by half by changing the survey from covering 2 months (March and April) with separately
reported volumes from each month, to 1 month (March).
3. Sampled banks were asked to either retrieve annual 2015 data from their internal
records (preferred) or, in a limited number of cases, if unavailable, to estimate their annual
2015 volumes using data from a different reference periods.
REFERENCES
Cochran, William G. (1953). Sampling Techniques. 2nd ed. New York, NY: John Wiley & Sons.
Dalanius, Tore and Joseph L. Hodges Jr. (1959). “Minimum Variance Stratification”. In: American
Statistical Association Journal 54. 285, pp. 88–101.
Gerdes, Geoffrey R. and May X. Liu (2010). “Estimating Technology Adoption and Aggregate Volumes
from U.S. Payments Surveys in the Presence of Complex Item Nonresponse”. In Proceed-
ings of the Survey Research Methods Section. Joint Statistical Meetings. American Statistical
Association. URL: https://ww2.amstat.org/sections/srms/Proceedings/allyearsf.html.
Graham, John W., Scott M. Hofer, and David P. MacKinnon (1996). “Maximizing the Usefulness of
Data Obtained with Planned Missing Value Patterns: An Application of Maximum Likelihood
Procedures”. In: Multivariate Behavioral Research 31.2, pp. 197–218.
Hansen, Morris H., William N. Hurwitz, and William G. Madow (1953). Sample Survey Methods and
Theory. New York, NY: John Wiley & Sons.
Hidiroglou, Michele A. (1986). “The Construction of Self-Representing Stratum of Large Units in
Survey Design”. In: The American Statistician 40.1, pp. 27–31.
258 GEOFFREY R. GERDES AND XUEMEI LIU
Lineback, Joanna Fane and Katherine J. Thompson (2010). “Conducting Nonresponse Bias Analysis
for Business Surveys”. In: Proceedings of the Government Statistics Section. Joint Statistical
Meetings. American Statistical Association.
Little, Roderick J. A. and Donald B. Rubin (2002). Statistical Analysis with Missing Data. 2nd ed.
New York, NY: John Wiley & Sons.
Liu, May X., Geoffrey R. Gerdes, and Darrel W. Parke (2009). “Sample Design and Estimation
of Volumes and Trends in the Use of Paper Checks and Electronic Payment Methods in
the United States”. In: Proceedings of the Survey Research Methods Section. Joint Statisti-
cal Meetings. American Statistical Association. URL: https://ww2.amstat.org/sections/srms/
Proceedings/allyearsf.html.
Lord, Frederick M. (1962). “Estimating Norms by Item-Sampling”. In: Educational and Psychological
Measurement 22.2, pp. 259–.
Neyman, Jerzy (1934). “On the Two Different Aspects of the Representative Method: The Method of
Stratifie Sampling and the Method of Purposeful Selection”. In: Journal of the Royal Statistical
Society 97.4, pp. 558–625.
Raghunathan, Trivellore E. and James E. Grizzle (1995). “A Split Questionnaire Survey Design”. In:
Journal of the American Statistical Association 90. 429, pp. 54–43.
Snijkers, Ger et al. (2013). Designing and Conducting Business Surveys. Hoboken, NJ: John
Wiley & Sons.
DOES SELECTIVE CRIME REPORTING
INFLUENCE OUR ABILITY TO
DETECT RACIAL DISCRIMINATION
IN THE NYPD’S STOP-AND-FRISK
PROGRAM?
ABSTRACT
Prior analyses of racial bias in the New York City’s Stop-and-Frisk pro-
gram implicitly assumed that potential bias of police officers did not vary
by crime type and that their decision of which type of crime to report as
the basis for the stop did not exhibit any bias. In this paper, we first extend
the hit rates model to consider crime type heterogeneity in racial bias and
police officer decisions of reported crime type. Second, we reevaluate the
program while accounting for heterogeneity in bias along crime types and
for the sample selection which may arise from conditioning on crime type. We
present evidence that differences in biases across crime types are substantial
and specification tests support incorporating corrections for selective crime
259
260 STEVEN F. LEHRER AND LOUIS-PIERRE LEPAGE
reporting. However, the main findings on racial bias do not differ sharply
once accounting for this choice-based selection.
Keywords: Misclassification; selective reporting; racial discrimination; hit
rates test; selection correction; criminal activity
1. INTRODUCTION
In many administrative and survey data sets, researchers must confront the chal-
lenge that non-sampling errors due to deliberate bias in providing a response may
distort the analyses. Much research has investigated this issue in survey research
(see Bound, Brown, & Mathiowetz, 2001 for a survey of the literature), and the
presence of measurement error has been shown to potentially cause biased and
inconsistent parameter estimates, thereby leading to erroneous conclusions to var-
ious degrees in statistical and economic analyses. Different methods are needed
to treat measurement error in survey data since errors can arise from different
sources. For example, they might arise from coding errors by surveyors, or survey
participants may choose to not provide truthful responses.
A specific form of measurement error arises with qualitative data resulting
in misclassification. Misclassification occurs when observations are placed erro-
neously in a different group or category. Within administrative data sources, this
erroneous information is provided not from survey responses, but rather in how the
records are generated and maintained. Just as certain sampling issues may influence
those being surveyed, it may also affect those who create and maintain administra-
tive records. For example, individuals preparing entries in administrative records
may rely on rules of thumb in a bid to minimize the burden of completing the
underlying forms accurately.1 These errors in classification not only affect sum-
mary statistics on sample proportions but may influence analyses that investigate
heterogenous behavioral relationships across these groups or categories.2 In many
settings, economic theory would suggest that we should expect heterogeneity
across these groups or categories3 which may be policy-relevant and could be
completely masked when investigating data on the full sample.
We illustrate the importance of considering the consequences of misclassifica-
tion that can arise from using rules of thumb to determine categories by reevaluating
if there is racial discrimination in New York City’s infamous Stop-and-Frisk pro-
gram. Under this program, officers can stop and frisk anyone they believe has
committed, is committing or might commit a crime. These policing practices often
disproportionately target minorities, generating significant controversy. Details on
each of over five million stops occurring in New York City between 2003 and 2014
have been collected and are frequently analyzed by researchers. However, these
Selective Crime Reporting 261
records may also feature misclassification error in the type of crime reported as the
basis for the stop, a source of bias that has been implicitly ignored prior analyses
of this large administrative data set.
Advocacy groups have long criticized the NYPD’s Stop-and-Frisk program4
and have even suggested that it has effectively turned some neighborhoods – usu-
ally poor and nonwhite ones – into occupied territories’ rife with unnecessary,
tense interactions between neighborhood residents and the police. More generally,
across all neighborhoods, surveys increasingly document that the American law
enforcement community is coming under increasing scrutiny and criticism and the
levels of trust in the police have plummeted.5 Proponents of the NYPD Stop-and-
Frisk program such as former NYPD commissioner Ray Kelly claim it has saved
over 7,000 lives and played a key role in the city’s decrease in crime over the past
years. Opponents of the program claim it constitutes a violation of freedom and
provides a means for officers to engage in racial profiling.
These claims are based in part on unconditional summary statistics such as the
fact that the overwhelming majority of those targeted by the program (consistently
around 85% of all stops in each year) are minorities. The use of unconditional
summary statistics to suggest that there is evidence of racial discrimination has
been made by advocacy groups at almost every stage of the criminal justice system.6
Testing for discrimination in the NYPD Stop-and-Frisk program can be chal-
lenging since an analysis of disparate impact alone does not constitute evidence
of discrimination. Knowles, Persico, and Todd (2001) (henceforth KPT) propose
a hit rates test that relies on the assumption that police officers try to maximize
successful searches. This test compares the productivity of stops across different
racial groups. Stops that are the result of discrimination alone and not the cause
of reasonable suspicion should be less likely to lead to arrests or summons, low-
ering the likelihood of those outcomes for that racial minority.7 This test has been
applied in prior research evaluating the NYPD Stop-and-Frisk program. Coviello
and Persico (2015) find no evidence of discrimination against African-Americans
in the aggregate sample of all recorded crime types in the whole city over 10 years
but, along with Goel, Rao, and Shroff (2016), do find evidence of discrimination
against African-Americans when restricting the sample to only stops relating to
the possession of a concealed weapon.8
We add to the existing evidence on the importance of accounting for potential
heterogeneity in bias across different types of crimes by showing its potential
importance in theory and confirming it empirically. African-Americans constitute
the overwhelming majority of suspects arrested for crimes related to drugs or
possession of a weapon, two types of crime related to the long-lasting nationwide
War on Drugs. Similarly to the NYPD’s Stop-and-Frisk program, the War on
Drugs has also been suggested to be discriminatory given disparate impacts for
African-Americans,9 motivating our particular focus on these crimes.
262 STEVEN F. LEHRER AND LOUIS-PIERRE LEPAGE
Prior research using Stop-and-Frisk data has also treated the reported crime
classifications as being exogenous. However, these classifications are selected by
individual police officers at the time when they complete the mandated forms
indicating why they stopped a given suspect. If individual police officers have per-
ceptions of a specific race being the perpetrators of certain types of crime,10 these
classifications might be subject to unconscious bias. In other words, conditioning
on those reported crime categories leads to endogenous stratification, which is well
known to lead to biased estimates.
Prior analyses of the Stop-and-Frisk program did not consider the decision we
focus on which is taken by the officer at the time of choosing whom to stop and must
also define the basis for the stop. While we illustrate the issue of analyzing effects
on subgroups that may be misclassified with Stop-and-Frisk data, we should stress
that these issues are becoming increasingly prevalent in survey data.11 Further,
misclassification has been shown to have consequences for econometric estimates
(Bollinger & David, 1997, 2001), sometimes changing the overall conclusions.
Generally, the consequence of misclassification depends on how it occurred. In
this setting, it is not simply a measurement error problem but reflects a choice-
based sample. If misclassification occurs to the same degree across racial groups
and policing outcomes, then it is random and no bias should arise. Conversely,
if misclassification of the subgroup varies differently between racial groups, then
there is nonrandom misclassification. Since much research analyzes impacts on
subgroups, we argue that there is likely going to be an increase in misclassification
of race and ethnicity variables as time progresses given changes in the mean-
ing of these variables to survey respondents. As such, the methods we describe
could be applied to various other contexts with both survey and administrative
data sets.
In this paper, we first extend the model underlying the hit rates test to account
for the potential that police officer bias depends on both the type of crime and
the race of the suspect, thereby influencing the type of crime reported as the basis
for the stop. This motivates investigating racial bias across different crime types
which can conciliate evidence in Coviello and Persico (2015) of no bias in the
full sample of stops. Further, it also motivates the need to correct the estimates
to account for potential bias in the reporting of crime categories. Second, we
reevaluate whether there is evidence of racial discrimination in New York City’s
Stop-and-Frisk program by modeling the selection process of crime categories as
a polychotomous choice. In effect, we implement a sample-selection correction to
explicitly account for the relative impact that being African-American may have
on the difference in the likelihood of being stopped for certain types of crime when
conducting the hit rates tests. To the best of our knowledge, this study presents the
first use of a polychotomous selection model to estimate whether there is evidence
of racial discrimination in the economics of crime literature.12
Selective Crime Reporting 263
Our main empirical results indicate that with or without correcting for selective
crime categorization, there is robust and conclusive evidence of discrimina-
tion. After accounting for both crime-dependent bias and selective classification,
African-Americans on average are 2.874% less likely than whites to be arrested
when stopped for crimes classified under the US War on Drugs; a difference of
approximately 50%. Contrasting estimates from the hit rates test that both account
for and ignore selective reporting of crime categories provide evidence that the
correction can be important both economically and statistically. Hausman tests
provide further evidence that this correction is statistically important.
The remainder of the paper is organized as follows. In the next section,
we discuss how the Stop-and-Frisk program is implemented in New York and
briefly describe the data set. To motivate our empirical tests, Section 3 presents a
theoretical model that extends KPT to consider crime-dependent payoffs to stop-
ping suspects which can differ by racial group. Section 4 describes the two-step
empirical strategy and discusses identification of the selection correction terms.
The empirical results are presented and discussed in Section 5. A final section
summarizes the main findings and concludes.
(2) a person stopped is frisked or frisked and searched, (3) a person is arrested
or (4) a person stopped refuses to identify themselves. This potential sampling
issue is noted in Coviello and Persico (2015) who discuss whether one should
restrict the analysis to only stops that legally must be recorded. They conclude
that imposing this restriction would require the implausible assumption that at the
time of choosing whom to stop, the officer could distinguish whether the stop will
develop into one that has to be recorded or not. The standard UF-250 form requires
officers to document, among other things, the time, date, place and precinct where
the stop occurred; the name, address, age, gender, race and physical description of
the person stopped; factors that caused the officer to reasonably suspect the person
stopped; the suspected crime that gave rise to the stop; the duration of the stop;
whether the person stopped was frisked, searched or arrested; and the name, shield
number and command of the officer who performed the stop. Each police officer
must submit all completed UF-250 forms to the desk officer in the precinct where
the stop occurred so the stated factual basis of the “stop” for legal sufficiency can
be reviewed. The data are then analyzed for quality assurance and a restricted
version that suppresses the shield number and other identifying information is
made publicly available.
The primary data used in this paper comprises all 5,028,789 recorded stops
in the Stop-and-Frisk program between 2003 and 2014. For each stop, we are
provided with all of the characteristics and outcomes listed on the UF-250 forms
that are permissible in the restricted data set. In total, over 40% of observed stops
in our sample did not have to be reported by law.15 We first restrict our sample
to stops involving only Caucasians or African-Americans and for which crime
categories are recorded on the UF-250 form, yielding 2,649,300 observations.
Crime classification was not reported for 2003 and inconsistently reported in both
2004 and 2005, reducing the number of observations for these earlier years.16
Summary statistics are presented in Table 1. Approximately 84% of this sample
is African-American and the vast majority of suspects are male. Among the 13
categories of crimes reported in Table 1, which account for over 95% of recorded
stops, possession of a weapon (27.67%), robbery (17.30%), trespassing (11.84%)
and drugs (11.11%) are the most commonly listed as the basis for a stop. As shown
in the table, we pool these crime categories into four main classifications.17 Drugs
and weapon crimes are pooled together since they represent felonies linked to
the US War on Drugs and account for nearly one-third of the stops. Similarly,
we pool other major crimes by whether they are either economic in motivation
(i.e., trespassing, burglary, grand larceny and grand larceny auto) or violent (i.e.,
assault, robbery, murder and rape). The final category consists of less severe
offenses which include petit larceny, graffiti and criminal misconduct. Crimes of
an economic nature account for over half of the stops, while there are fewer stops
associated with either violent or minor crimes.18
Selective Crime Reporting 265
Outcomes
Arrest rate×100 5.49 (22.78)
Summons×100 6.14 (24.01)
Demographics
Suspect is black 84.55 (36.15)
Suspect is male 92.84 (25.77)
Age of suspect 28.34 (12.39)
Suspect is youth 54.13 (49.83)
Suspect is tall 23.61 (42.47)
Suspect has heavy build 8.52 (27.91)
Stops
Number made at night 60.00 (48.99)
Mandated stops 58.52 (49.27)
Crimes
War on drugs 38.88 (48.75)
Drugs 11.21 (31.55)
Weapon 27.66 (44.73)
Other economic crimes 35.33 (47.80)
Trespassing 12.11 (32.63)
Burglary 9.25 (28.98)
Grand larceny 4.49 (20.70)
Grand larceny auto 9.48 (29.30)
Violent crimes 20.76 (40.57)
Assault 3.42 (18.17)
Robbery 17.21 (37.75)
Murder 0.05 (2.20)
Rape 0.10 (3.19)
Minor offenses 5.02 (21.83)
Petit larceny 2.62 (15.98)
Graffiti 1.16 (10.70)
Criminal misconduct 1.24 (11.06)
Observations 2,399,717
Standard deviations in parentheses. Youth is the fraction of suspects aged below 25. Tall is the fraction
of suspects 6 ft or taller. Heavy build is the fraction of suspects classified as heavy build by the NYPD.
Night to the fraction of stops performed between 7 p.m. and 6 a.m. Other economic crimes refers to
nonviolent crimes including trespassing, burglary, grand larceny and grand larceny auto. Violent refers
to violent crimes including rape, murder and assault. Minor refers to minor crimes and includes petit
larceny, graffiti and criminal misconduct.
Standard deviations in parentheses. The last column presents the p-values from tests where the null
hypothesis is that of equality in the probability of arrest between African-Americans and whites.
burglary and criminal misconduct. In contrast, over 92% of all stops categorized as
weapons, trespassing or murder involve an African-American suspect. The remain-
ing columns provide summary information on the percentage of stops that resulted
in an arrest by race and a test of whether there is a significant difference in these
proportions between races. In nearly every crime category, African-Americans
have on average lower rates of arrest relative to white suspects. Results from tests
of equality in arrest rates between groups are presented in the last column and
indicate that these differences are statistically significant at the 5% level in every
category with the exception of stops for murder, rape and robbery. Most striking is
that the rate of arrest for weapon possession stops is 75% higher for white suspects
Selective Crime Reporting 267
relative to African-Americans, despite the fact that almost 19 of every 20 stops for
this category involve a black suspect.
3. THEORY
We begin by extending the model that underlies the hit rates test first developed
in KPT to directly consider the Supreme Court’s holding in Terry vs Ohio that
the scope of any resulting police search has to be narrowly tailored to match the
original reason for the stop.19 Since the crime type must be reported and police
officers may associate certain racial groups with particular crime types, they may
be more likely to be biased for stops related to those classes of infractions.20 We
incorporate this feature within an economic model that describes a pedestrian’s
and a police officer’s behavior.
3.1. Pedestrians
Vir,o (ω) ≡ Vir,o (φ1 , .., φn , c1 , . . . , cn , ω)dFr,o φ1 , . . . , φn |φi = max φk
1≤k≤n
268 STEVEN F. LEHRER AND LOUIS-PIERRE LEPAGE
where Vir,o (ω) is the crime-dependent crime rate for group (r, o). This is the
objective probability that an individual of group (r, o) is guilty of crime type i.
Given a mass M of police officers who, after having exogenously been allocated
to a given precinct, receive a type p ∼ U [0, 1].21 Each officer type has a search
capacity Ep , receives payoff πpr,i from stopping a suspect of race r for crime1 i.
We denote by D(p, i) the additional benefit that a racially biased officer of type p
gains from stopping a suspect of race A for type of crime i.22
Assumption 1.
πpW ,i = πpW = 1 (normalized) and πpA,i = πpW + D(p, i) for i = 1, . . . , n.
Let Ep (r, o, i) be the number of stops for group (r, o) from an officer of type
1
p and E(r, o, i) = M 0 Ep (r, o, i)dp be the total number of stops for group (r, o)
and crime i.
Defining
W (p, r, o, i) = P (Guilty of crime i|r, o)
= Vir,o (E(r, o, i))
where sp is the cost of performing a stop for the officer. An officer chooses to stop
an agent from group (r, o) if πpr,i W (p, r, o, i) − sp ≥ 0.
4. Empirical Strategy
The traditional hit rates test examines whether there is a racial difference in the
percentage of stops that result in an arrest. When running regressions on subgroups
defined by type of crime classification, this requires that the subgroups are exoge-
nous as assumed in both Coviello and Persico (2015) and Goel et al. (2016). This
involves estimating an equation for whether a stop s involving a suspect stopped
for type of crime c at time t in precinct p, resulted in an arrest. Formally,
S
C
exp(xs γc )
lnL(γ ) = 1{ys = c}ln C
(2)
s=0 c=0 k=0 exp(xs γk )
month prior to the current stop. With increased length, the exogeneity assumption
requires that, while the recent history of policing in a precinct informs the decision
to stop individuals for certain types of crime, it is not correlated to omitted factors
that determine arrests. This alternative assumption is presumably less restrictive
for longer lagged periods, but the assumption itself remains untestable.
Using estimates of Eq. (2), Bourguinon et al. (2007) provide formulas to con-
struct C − 1 selection correction terms that are captured in the vector λc ( ). Adding
this vector of selectivity correction terms to Eq. (3) generates our estimating
equation
Using weighted least squares for each crime category allows us to obtain unbi-
ased and consistent estimation of the coefficients.26 A nice feature of this estimator
is that it has been shown to perform well in correcting selection bias even in set-
tings where the restrictive independence of irrelevant alternatives assumption of
the multinomial logit model is violated. Last, to conduct inference, we use boot-
strapped standard errors to explicitly account for the two-step estimation procedure.
We use the strategy proposed in Bourguinon et al. (2007) to estimate the two-step
model since it not only relaxes the restriction in Dubin and McFadden (1984) that
all correlation coefficients add up to zero, but they present Monte Carlo evidence
indicating superior performance of their estimator relative to Dubin and McFadden
(1984), Lee (1983) and Dahl (2002). For completeness, Bourguinon et al. (2007)
differ from Lee (1983) since the latter estimates a single selectivity effect for all
choices as opposed to estimating C − 1 selection terms for the C choices we con-
sider. This approach is less restrictive since Lee (1983) requires equal covariances
between the unobservables in the arrest rate equation and the unobservables which
determine the crime categories and comes with the computational costs of esti-
mating additional parameters. Dahl (2002) differs from Bourguinon et al. (2007)
by the functional form used to construct the selectivity correction terms.
Examining estimates of β3 from Eq. (3) also provides insight since a posi-
tive (negative) coefficient estimate indicates higher (lower) arrest rates for those
stopped in this classification relative to a randomly chosen suspect that was stopped.
If at least one of the C − 1 estimates of β3 enters in a statistically significant manner,
then there is suggestive evidence of selection. To formally examine if selectiv-
ity correction leads to statistically significantly different estimates, we conduct
Hausman tests.27
The consequence of misclassification for the analysis depends on how it
occurred. The direction of bias depends on the correlation between unobservables
in the outcome and selection equations. If police officers are more likely to
misclassify subjects that are at high perceived risk of having committed a crime, we
272 STEVEN F. LEHRER AND LOUIS-PIERRE LEPAGE
would expect that ignoring the selection correction would underestimate the effect
of racial differences. After all, when a decision to make a stop occurs rapidly,
police officers are more likely to use any implicit bias based on the suspect’s
characteristics when choosing crime classifications. In other words, where police
officers are less certain of the exact crime at the time they make the stop and they
are relying on criminal offender profiling to select the crime category, we would
underestimate the effect of racial discrimination.28
5. Results
We first present estimates of the hit rates test across different types of crime in
Table 3. The columns of the table differ based on the level of fixed effects that are
included. An important point from Coviello and Persico (2015) is that, as shown
in the first row of Table 3, once accounting for time and precinct fixed effects in
columns 6 and 7, there is no evidence of discrimination when pooling all crimes
together. Further, the inclusion of precinct fixed effects leads to an approximate
50% reduction in the magnitude of race on arrest rates. Rows 2–4 present results
by subgroups of different crime classifications as previously defined and shows
that the previous result conflates vastly different effects into one which creates
the false appearance of no arrest differential. The results indicate that African-
Americans are significantly less likely to be arrested when stopped for crimes
related to the War on Drugs but significantly more likely to be arrested when
stopped for other economic crimes with little estimated differential for violent or
minor crimes. This is consistent with the conjecture that potential police officer
bias differs by crime type which leads to inefficient policing. Further, we note that
adding extra pedestrian and stop characteristics to Eq. (1) as shown in column 7
has little incidence on the results.
Table A2 and Figure A1 in the appendix also show that the estimated arrest dif-
ferential for War on Drugs crimes is present in every borough in the city (though
there is important heterogeneity) and has increased consistently over the period
considered in our sample. Table A1 shows that, in the case of summons, the out-
come differentials by type of crime are more reflective of the aggregate regression
as it is estimated that African-Americans are less likely to be issued a summon
when stopped for any crime group.
Next, we investigate whether these results may be partly influenced by the
endogenous decision of police officers of which type of crime to report. To examine
if there is selective classification of crime type in the NYPD Stop-and-Frisk pro-
gram, Table 4 presents estimates of β1 from Eqs. (1) and (3) for each crime
category defined in Table 1. These specifications include additional pedestrian and
stop characteristics which may also define the selective classification of stops. As
Selective Crime Reporting 273
Table 3. Estimates of the Hit Rates Test on Arrests, Overall and by Crime Type.
Model OLS OLS OLS FE FE FE FE
1 2 3 4 5 6 7
Black
All crimes −0.386∗∗∗ −0.386∗∗∗ −0.386 0.164∗∗∗ 0.142∗∗∗ 0.142 0.221
(0.0416) (0.0416) (0.485) (0.0540) (0.0540) (0.211) (0.204)
War on drugs −3.726∗∗∗ −3.713∗∗∗ −3.713∗∗∗ −2.584∗∗∗ −2.597∗∗∗ −2.597∗∗∗ −2.463∗∗∗
(0.102) (0.102) (0.635) (0.119) (0.119) (0.498) (0.497)
Other 1.696∗∗∗ 1.704∗∗∗ 1.704∗∗∗ 1.794∗∗∗ 1.771∗∗∗ 1.771∗∗ 1.821∗∗∗
economic (0.0521) (0.0522) (0.495) (0.0715) (0.0715) (0.219) (0.222)
Violent −3.726∗∗∗ −1.689∗∗∗ −1.689∗∗∗ −0.00628 −0.0483 −0.0483 −0.0267
(0.102) (0.106) (0.578) (0.130) (0.130) (0.273) (0.267)
Minor 0.265 0.264 0.264 0.342∗ 0.353∗ 0.353 0.0595
(0.163) (0.163) (0.966) (0.207) (0.207) (0.494) (0.511)
Clustered SE No No Yes No No Yes Yes
Time FE No Yes Yes No Yes Yes Yes
Precinct FE No No No Yes Yes Yes Yes
Extra controls No No No No No No Yes
Standard errors are presented in parentheses. ∗ 0.1, ∗∗ 0.05, ∗∗∗ 0.01. The dependent variable is the
probability of being arrested conditional on being stopped and is multiplied by 100. Extra controls
refer to the inclusion of indicators for gender, youth, suspect height and build as well as the time of day
as defined in Table 1. Note: OLS refers to ordinary least squares estimator and FE refers to a precinct
fixed effects estimator.
shown in Table A3, the number of stops differs across racial groups and other
pedestrian characteristics that are also likely correlated with race.
Estimates of the selection-correction model in the second column are notice-
ably different in economic significance from estimates using the standard hit rates
presented in column 1. While African-Americans are statistically less likely at the
1% level to be arrested when stopped for War on Drugs related crimes irrespective
of whether crime categories are exogenous or a behavioral choice, the estimated
coefficient is roughly 15% larger than that which ignores selectivity. For other
crime categories, the difference between the two methods is also important both
in magnitude and statistical significance. Making corrections for selective crime
classifications lead to a 20% reduction in the magnitude of the race coefficient for
economic crimes and large changes in magnitude for violent and minor crimes.
While our adjusted estimates do not alter the overall conclusion of racial dis-
crimination for War on Drugs crimes, the estimates obtained from the two-stage
procedure do suggest that there is non-negligible sample-selection. The last column
in Table 4 reports the p-values from Hausman specification tests of the equality
of the estimated coefficient on black between estimates of Eqs. (1) and (3). For
War on Drugs, we observe that the p-values from the Hausman tests are less than
274 STEVEN F. LEHRER AND LOUIS-PIERRE LEPAGE
The dependent variable is the probability of being arrested conditional on being stopped and is multiplied
by 100. ∗ 0.1, ∗∗ 0.05, ∗∗∗ 0.01. Robust standard errors in parentheses for column 1, bootstrapped (1000
repetitions) and reweighted standard errors in parentheses for column 2. Specifications additionally
include fixed effects for precincts and years as well as indicators for gender, youth, height, build and
time of day. Column 3 presents the p-values from Hausman specification tests where the null hypothesis
is that the estimated coefficient on black is the same across models from columns 1 and 2.
0.01, indicating that we can safely reject the assumption that crime categories are
exogenous. Similarly, we can clearly reject that crime categories do not reflect a
behavioral choice for major economic and violent crimes and for minor crimes
at the 6% level. The results provide evidence that the choice of crime that offi-
cers report as the basis for individual stops generates endogenous stratification.
Table A4 applies the two-stage correction in the case of summons and finds that we
can reject the hypothesis that sample selection is negligible for all crime categories.
Marginal effect estimates from the first-stage crime classification selection are
presented in TableA5. Each of the variables used to identify the selection correction
terms in Eq. (3) are individually and jointly statistically significant with a plausible
sign and magnitude. A somewhat striking finding is that African-Americans are
statistically significantly more likely to have their stop categorized as a War on
Drugs crime. Estimates of Eq. (2) also find that blacks are significantly less likely
to have their stop categorized as other crime types. Since the categories underlying
War on Drugs crimes can be viewed as representing police officer speculation that
a suspect is either hiding a weapon or drugs as opposed to having committed a
robbery or trespassing, it is likely that they are easier to use to justify a stop.
Thus, when an officer decides to instantly make a stop based on the suspect’s
Selective Crime Reporting 275
The dependent variable is an aggregate of crime types which takes the value 1 for crimes related to
the War on Drugs, 2 for major nonviolent crimes, 3 for violent crimes and 4 for minor crimes. *0.1,
**0.05, ***0.01. Standard errors in parentheses. Other economic refers to economic crimes including
trespassing, burglary, robbery, grand larceny and grand larceny auto. Violent refers to violent crimes
including rape, murder and assault. Minor refers to minor crimes and includes petit larceny, graffiti and
criminal misconduct. Lag crime variables are defined as the proportion of stops that involved crimes
of that type in the day before the stop in the same precinct. The exclusion restrictions p-value refers to
a joint test of significance for the four exclusion restrictions.
characteristics, the use of this crime classification may also partially reflect implicit
bias. Thus, it is not surprising that the estimated effect of racial discrimination on
arrest rates for War on Drug crimes increases once the selection correction is used.
Last, we conducted a series of robustness checks shown in Table A5 to investi-
gate how the results of the sample-correction procedure vary depending on various
assumptions. We find that using different lagged values of the share of stops related
to each crime category as the exclusion restriction, which improves the plausibility
of the exogeneity assumption, has little incidence on the conclusion. We also find
that excluding first stage fixed effects from the correction does alter the estimates
of the correction but does not change the conclusion that there is a large arrest
differential between racial groups for War on Drugs crimes.
6. CONCLUSION
The NYC Stop-and-Frisk program often plays a prominent role in debates sur-
rounding racial profiling. Analyses of this data which condition on reported crime
type are necessary to uncover heterogeneity in bias but may lead to biased estimates
276 STEVEN F. LEHRER AND LOUIS-PIERRE LEPAGE
ACKNOWLEDGMENT
We are grateful to Decio Coviello, Maxwell Pak and Rosina Rodriguez Olivera
for their helpful comments and suggestions on this project. We also thank Victor
Selective Crime Reporting 277
Aguiar and other participants at the Econometrics of Complex Survey Data: Theory
and Applications conference for additional comments. NYC Stop-and-Frisk data
is publicly available at https://nycopendata.socrata.com as well as at the ICPSR
website at the University of Michigan. Lehrer thanks SSHRC for research support.
We are responsible for all errors.
NOTES
1. A large literature has documented evidence that police officers engage in implicit
bias and employ rules of thumb (see Fridell & Lim, 2016 for a recent survey). In recent
work, James (2017) provides evidence that the association between African-Americans and
weapons is stronger when officers have less sleep. The New York City Police Department
(NYPD) is well aware of this literature. In December 2014, the NYPD announced that they
would retrain a significant portion of their police force regarding implicit bias. However, a
2017 Newsweek investigation found that no officer had received such training to that date.
2. Rauscher, Johnson, Cho, and Walk (2008) present a meta-analysis of studies that
measured the race specific validity of survey questions about self-reported mammography
use against documented sources, such as medical and billing records. They found that
the specificity of survey questions that measure mammography use is lower among black
women than white women.
3. In work assuming categories are measured accurately, Lehrer, Pohl, and Song (2016)
use a simple static labor supply model to motivate why treatment effects from a welfare
program that changes work incentives should vary across demographic groups and over the
earnings distribution to motivate their empirical tests.
4. This program has also been targeted by legislative action, including high profile cases
and class action lawsuits such as Floyd, et al. vs City of New York, et al.
5. This finding is not unique to the United States, see Bradford, Jackson, and Stanko
(2009) for evidence on trends in the United Kingdom.
6. Studies that explore racial discrimination in the economics literature at other stages
of the criminal justice system make clear that these may be nondiscriminatory and one
needs to account for racial differences in crime prevalence. Yet, even after taking this
feature into account, there is evidence of racial prejudice at other stages. See Rehavi and
Starr (2014); Abrams, Bertrand, and Mullainathan (2012); Anwar, Bayer, and Hjalmarsson
(2012); Bushway and Gelbach (2010); Alesina and La Ferrara (2014) and Anwar and
Fang (2015) for studies looking at prejudice in prosecution, bail-setting, sentencing, prison
releases as well as in judges and juries.
7. In other words, police officers who stop more members of a certain racial group would
not be racially biased if these stops are productive and lead to arrests or summons. In addi-
tion, this test can account for the empirical features related to the geographic concentration
of crime across neighborhoods.
8. For completeness, other research evaluating Stop-and-Frisk include Gelman, Fagan,
and Kiss (2007) and Ridgeway (2012) who each use a different subset of the data
employed in the Coviello and Persico (2015) study. Lehrer and Lepage (2018) also use the
278 STEVEN F. LEHRER AND LOUIS-PIERRE LEPAGE
Stop-and-Frisk data to test for discrimination against Arabs and find evidence consistent
with racial profiling in periods of high terrorism threat. Regarding the use of non-lethal
force, Fryer (2016) finds that blacks are over 50% more likely than whites to have force
used against them when stopped.
9. See Nunn (2002) for further details.
10. See, for example, Fridell (2008).
11. For example, research has found high rates of misclassification in categorical vari-
ables such as education (Black, Sanders, & Taylor, 2003), labor market status (Poterba &
Summers, 1995) and disability status (Benitez-Silva, Buchinsky, Man Chan, Cheidvasser, &
Rust, 2004; Kreider & Pepper, 2008), among other measures.
12. This is surprising since as Bushway, Johnson, and Slocum (2007) note, issues of
selection bias pervade criminological research but econometric corrections have often been
misapplied.
13. This decision creates a narrow exception to the Fourth Amendment’s probable cause
and warrant requirements, permitting a police officer to briefly stop a citizen, question them
and frisk them to ascertain whether they possess a weapon that could endanger the officer.
By reasonable suspicion, a police officer “must be able to point to specific and articulable
facts” and is not permitted to base their decision on “inchoate and unparticularized suspicion
or [a] ‘hunch’".
14. See the 2000 report by the US Commission on Civil Rights (2000) for detailed
information on policing in New York.
15. The fraction of mandated stops is higher for African-Americans (60% versus 47%)
which likely reflects higher rates of frisking, summons and force used. Our main results
are robust to restricting the analysis only to those stops which had to be legally reported.
Following Coviello and Persico (2015), we do not use this as a sampling restriction since
it would condition on ex post information. The external validity of the results rely on the
plausible (yet untestable) assumption that the sample is representative of all stops in the city.
If police officers underreport racially sensitive stops, this would underestimate the number
of unproductive stops and our results would constitute a lower bound. This is consistent with
the data since the estimated arrest differential is larger when restricting to this subsample.
16. We also exclude observations which were related to other crimes than those reported
in Table 1 (less than 5% of all crimes) since we group crimes within categories in our
analysis. Including these crimes as an additional classification did not change the main
results but led to sub substantial computational costs.
17. Our results do not depend on those categories; the estimates are quantitatively and
qualitatively similar whether we only group War on Drugs crimes and leave other crime
types either ungrouped or grouped as violent and minor crimes in two categories while
leaving others ungrouped. We selected these classifications ex ante and as such present
them as the main results.
18. The low rate of stops for minor crimes may reflect an effort to justify the stops by
being “too ambitious” in stating suspected crimes to signal a stronger rationale for the stop.
Alternatively, police officers are likely to simply put more weight on serious offenses.
19. The KPT test was adapted to the Stop-and-Frisk setting in Coviello and Persico
(2015), whose model we extend. Our extension is also inspired by Anwar and Fang (2006).
Selective Crime Reporting 279
Note that the KPT test has faced criticism in Dharmapala and Ross (2003) and Gelman et al.
(2007), among others. These critiques focus on allowing police officers to consider varying
degrees of severity across types of crime, allowing for the fact that officers frequently do not
observe potential offenders or accounting for racial and neighborhood heterogeneity in the
probability of guilt. We pursue a similar line of inquiry in further investigating heterogeneity
along different types of crime, which prior work did not consider within the KPT framework.
20. While we do not make the distinction, racial bias is likely to be unconscious in
nature particularly when police officers have limited time to decide whether or not to stop
a pedestrian on the street. Smith, Makarios, and Alpert (2006) provide evidence that police
officers can develop unconscious biases along observable characteristics such as gender
which affect their propensity to be suspicious of a member of that group under different
circumstances.
21. The assignment of police officers across precincts is considered further in Coviello
and Persico (2015) and is beyond the scope of this analysis.
22. This is the same way through which racial bias enters in the original hit rates model
from KPT. Our model differs by allowing this additional benefit to vary by crime type and
therefore impact the officer’s decision to stop pedestrians differently across race and crime
dimensions.
23. There is also the possibility that police officers can lie about the type of crime for the
basis of the stop. This could go in both directions to show effort or to not appear prejudiced.
Ultimately, this may lead to bias even when adjusting for selection in crime classification
if officers consciously misreport the causes of their stops across different crime categories,
but the situation would be worse without the selection correction.
24. The intuition and mechanics behind the approach we use is proposed in Bourguignon,
Fournier, and Gurgand (2007) and parallel the seminal Heckman (1979) two-step estimator.
25. Note that the inclusion of these fixed effects leads to the well-known incidental
parameter problem, but since the ratio of the number of observations to number of parameters
is very high in our application, this suggests that issues of bias should be limited. On
the other hand, by ignoring the fixed effects, the interpretation of the coefficients of the
outcome equation would be unclear since a different set of coefficients enters the first
stage and outcome equation. Thus, our preferred estimates given the large sample size for
each precinct and year include fixed effects. For completeness, we present estimates using
both approaches in the results section. The use of a conditional fixed effects estimator is
computationally infeasible in our setting.
26. See Bourguignon et al. (2007) for further details on constructing the selectivity
correction terms and weights, which account for potential heteroskedasticity present in the
model due to selectivity.
27. Bootstrapped Hausman tests would be preferable since they relax the assumption
that ordinary least squares (OLS) be fully efficient under the null but are computationally
unfeasible in our setting.
28. Whenever making statistical corrections for selection bias or endogeneity, there is
always the risk that the cure may be worse than the disease (see Bound, Jaeger, & Baker,
1995). That said, the literature on policing (e.g., James, 2017 and the references within)
suggest that implicit bias may be higher for weapon crimes, which is accounted for by our
280 STEVEN F. LEHRER AND LOUIS-PIERRE LEPAGE
correction and consistent with our results. As such, it appears unlikely that these differences
in categories of crime would be solely due to chance and our selection correction appears
to be operating in the desired direction.
29. As Williams (2006) points out, some leading advocates of this change in the United
States were white women married to African-American men who found that their children
were almost always classified as black by those who collected statistical data or tabulated
persons by race.
REFERENCES
Abrams, D. S., Bertrand, M., & Mullainathan, S. (2012). Do judges vary in their treatment of race?.
Journal of Legal Studies, 41(2), 347–383.
Alesina, A., & La Ferrara, E. (2014). A test for racial bias in capital punishment. The American
Economic Review, 104(11), 3397–3433.
Anwar, S., Bayer, P., & Hjalmarsson, R. (2012). The impact of jury race in criminal trials. Quarterly
Journal of Economics, 127(2), 1017–1055
Anwar, S., & Fang, H. (2015). Testing for racial prejudice in the parole board release process: Theory
and evidence. Journal of Legal Studies, 44(1), 1–37.
Anwar, S., & Fang, H. (2006). An alternative test of racial prejudice in motor vehicle searches: Theory
and evidence. The American Economic Review, 96(1), 127–151.
Benitez-Silva, H., Buchinsky, M., Man Chan, H., Cheidvasser, S., & Rust, J. (2004). How large is the
bias in self-reported disability?. Journal of Applied Econometrics, 19(6), 649–670.
Black, D., Sanders, S., & Taylor, L. (2003). Measurement of higher education in the census and current
population survey. Journal of the American Statistical Association, 98(463), 545–554.
Bollinger, C. R., & David, M. H. (1997). Modeling discrete choice with response error: Food stamp
participation. Journal of the American Statistical Association, 92(439), 827–835.
Bollinger, C. R., & David, M. H. (2001). Estimation with response error and nonresponse: Food-stamp
participation in the SIPP. Journal of Business & Economic Statistics, 19(2), 129–141.
Bound, J., Brown, C., & Mathiowetz, N. (2001). Measurement error in survey data. In Handbook of
econometrics (Vol. 5, pp. 3705–3843). Amsterdam: Elsevier.
Bound, J., Jaeger, D. A., & Baker, R. M. (1995). Problems with instrumental variables estimation
when the correlation between the instruments and the endogenous explanatory variable is weak.
Journal of the American Statistical Association, 90(430), 443–450.
Bourguignon, F., Fournier, M., & Gurgand, M. (2007). Selection bias corrections based on the
multinomial logit model: Monte Carlo comparisons. Journal of Economic Surveys, 21(1),
174–205.
Bradford, B., Jackson, J., & Stanko, E. (2009). Contact and confidence: Revisiting the impact of public
encounters with the police. Policing and Society, 19(1), 20–46.
Brigham, J. C. (1971). Racial stereotypes, attitudes, and evaluations of and behavioral intentions toward
Negroes and whites. Sociometry, 34(3), 360–380.
Bushway, S. D., & Gelbach, J. B. (2010). Testing for racial discrimination in bail setting using
nonparametric estimation of a parametric model. New Haven, CT: Mimeo; Yale Law School.
Bushway, S. D., Johnson, B. D., & Slocum, L. A. (2007). Is the magic still there? The relevance of
the Heckman two-step correction for selection bias in criminology. Journal of Quantitative
Criminology, 23(2), 151–178.
Coviello, D., & Persico, N. (2015). An economic analysis of black-white disparities in NYPD’s
Stop-and-Frisk program. Journal of Legal Studies, 44(2), 315–360.
Selective Crime Reporting 281
Correll, J., Park, B., Judd, C. M., and Wittenbrink, B. (2002). The police officer’s dilemma: Using
ethnicity to disambiguate potentially threatening individuals. Journal of Personality and Social
Psychology, 83(6), 1314.
Dahl, G. B. (2002). Mobility and the return to education: Testing a Roy model with multiple markets.
Econometrica, 70(6), 2367–2420.
Devine, P. G., & Elliot, A.J. (1995). Are racial stereotypes really fading? The princeton trilogy revisited.
Personality and Social Psychology Bulletin, 21(11), 1139–1150.
Dharmapala, D., & Ross, S. L. (2004). Racial bias in motor vehicle searches: Additional theory and
evidence. Contributions to Economic Analysis & Policy, 3(1). Article 12, 1–21.
Dubin, J. A., & McFadden, D. L. (1984). An econometric analysis of residential electric appliance
holdings and consumption. Econometrica, 52(2), 345–362.
Fridell, L. A. (2008). Racially biased policing: The law enforcement response to the implicit Black-
Crime Association. In M. J. Lynch, E. B. Patterson & K. K. Childs (Eds.), Racial divide: Racial
and ethnic bias in the criminal justice system (pp. 39–59). Monsey, NY: Criminal Justice
Press.
Fridell, L., & Lim, H. (2016). Assessing the racial aspects of police force using the implicit-and
counter-bias perspectives. Journal of Criminal Justice, 44, 36–48.
Fryer Jr, R. G. (2016). An empirical analysis of racial differences in police use of force. NBER Working
Paper No. 22399.
Gelman, A., Fagan, J., & Kiss, A. (2007). An analysis of the New York City police department’s
Stop-and-Frisk policy in the context of claims of racial bias. Journal of the American Statistical
Association, 102(479), 813–823.
Goel, S., Rao, J. M., & Shroff, R. (2016). Precinct or prejudice? Understanding racial disparities in
New York City’s Stop-and-Frisk policy. The Annals of Applied Statistics, 10(1), 365–394.
Hausman, J. A. (1978). Specification tests in econometrics. Econometrica, 46(6), 1251–1271.
Heckman, J. J. (1979). Sample selection bias as a specification error. Econometrica, 47(1), 153–161.
James, L. (2017). The stability of implicit racial bias in police officers. Police Quarterly, 21(1), 30–52.
Knowles, J., Persico, N., & Todd, P. (2001). Racial bias in motor vehicle searches: Theory and evidence.
Journal of Political Economy, 109(1), 203–229.
Kreider, B., & Pepper, J. V. (2011). Identification of expected outcomes in a data error mixing model
with multiplicative mean independence. Journal of Business & Economic Statistics, 29(1),
49–60.
Lee, L. F. (1983). Generalized econometric models with selectivity. Econometrica, 51(2), 507–512.
Lehrer, S. F., & Lepage, L. (2018). How do NYPD officers respond to general and specific terror
threats? Ann Arbor, MI: Mimeo; University of Michigan.
Lehrer, S. F., Pohl, R. V., & Song, K. (2016). Targeting policies: Multiple testing and distributional
treatment effects. NBER Working Paper No. 22950.
Nunn, K. B. (2002). Race, crime and the pool of surplus criminality: Or why the war on drugs was a
war on blacks. Journal of Gender, Race & Justice, 6, 381–445.
Poterba, J. M., & Summers, L. H. (1995). Unemployment benefits and labor market transitions: A
multinomial logit model with errors in classification. The Review of Economics and Statistics,
77(2), 207–216.
Rauscher, G. H., Johnson, T. P., Cho, Y. I., & Walk, J. A. (2008). Accuracy of self-reported cancer-
screening histories: A meta-analysis. Cancer Epidemiology and Prevention Biomarkers, 17(4),
748–757.
Rehavi, M. M., & Starr, S. (2014). Racial disparities in federal criminal sentences. Journal of Political
Economy, 122(6), 1320–1354.
Ridgeway, G. (2007). Analysis of racial disparities in the New York police department’s stop, question,
and frisk practices. RAND Technical Report #534.
282 STEVEN F. LEHRER AND LOUIS-PIERRE LEPAGE
Schmeidler, D. (1973). Equilibrium points of nonatomic games. Journal of Statistical Physics, 7(4),
295–300.
Sit, R. (2017). Since Eric Garner’s death, not one NYPD officer has received implicit bias training,
despite what the mayor says. Newsweek. Available at http://www.newsweek.com/eric-garner-
erica-nypd-implicit-bias-bill-de-blasio765165. [Date Published: 12/29/2017] [Date Accessed:
02/21/2018].
Smith, M. R., Makarios, M., & Alpert, G. P. (2006). Differential suspicion: Theory specification and
gender effects in the traffic stop context. Justice Quarterly, 23(2), 271–295.
U.S. Commission on Civil Rights (2000). Chapter 5: Stop, question, and frisk. Police practices and
civil rights in NewYork City. Available online at http://www.usccr.gov/pubs/nypolice/main.htm.
Williams, K. M. (2008). Mark one or more: Civil rights in multiracial America. Ann Arbor, MI:
University of Michigan Press.
Selective Crime Reporting 283
APPENDIX
The estimated coefficients correspond to the coefficient on black from a set of
regressions of the probability of being arrested (multiplied by 100) conditional
on being stopped on a dummy for black as well as precinct indicators estimated
separately for each year. The standard errors are clustered at the precinct level.
The outcome is multiplied by 100. The dashed lines represent the pointwise 95%
confidence interval.
Black
All crimes 0.056 0.0794∗ 0.0794 −1.760∗∗∗ −1.746∗∗∗ −1.746∗∗∗ −1.545∗∗∗
(0.0427) (0.0428) (0.379) (0.0527) (0.0527) (0.330) (0.322)
War on Drugs −0.922∗∗∗ −0.886∗∗∗ −0.886 −2.468∗∗∗ −2.449∗∗∗ −2.449∗∗∗ −2.104∗∗∗
Standard errors are presented in parentheses. ∗ 0.1, ∗∗ 0.05, ∗∗∗ 0.01. The dependent variable is the probability of being issues a summons conditional on
being stopped and is multiplied by 100. Extra controls refer to the inclusion of indicators for gender, youth, suspect height and build as well as time of
day as defined in Table 1. OLS refers to ordinary least squares estimator and FE refers to a precinct fixed effects estimator.
Selective Crime Reporting 285
Table A2. Estimates of the Hit Rates Test on Arrests, by Crime Type and
Borough.
Model OLS OLS OLS FE FE FE FE
1 2 3 4 5 6 7
Black
Manhattan −4.559∗∗∗ −4.521∗∗∗ −4.521∗∗ −3.544∗∗∗ −3.471∗∗∗ −3.471∗∗ −3.299∗
(0.233) (0.233) (2.088) (0.251) (0.251) (1.589) (1.604)
Bronx −1.985∗∗∗ −2.039∗∗∗ −2.039∗∗ −1.905∗∗∗ −1.944∗∗∗ −1.944∗∗ −1.574∗∗
(0.278) (0.278) (0.782) (0.295) (0.295) (0.744) (0.634)
Brooklyn −3.109∗∗∗ −3.115∗∗∗ −3.115∗∗∗ −1.582∗∗∗ −1.631∗∗∗ −1.631∗∗∗ −1.493∗∗∗
(0.168) (0.168) (0.578) (0.187) (0.187) (0.494) (0.489)
Queens −3.456∗∗∗ −3.549∗∗∗ −3.549∗∗∗ −3.313∗∗∗ −3.338∗∗∗ −3.338∗∗∗ −3.250∗∗∗
(0.254) (0.255) (0.997) (0.307) (0.308) (0.956) (0.934)
Staten Island −1.204∗∗∗ −0.918∗∗∗ −0.918 −2.721∗∗∗ −2.536∗∗∗ −2.536∗∗ −2.645∗∗
(0.283) (0.288) (1.154) (0.373) (0.373) (0.583) (0.383)
Clustered SE No No Yes No No Yes Yes
Time FE No Yes Yes No Yes Yes Yes
Precinct FE No No No Yes Yes Yes Yes
Extra controls No No No No No No Yes
Standard errors are presented in parentheses. ∗ 0.1, ∗∗ 0.05, ∗∗∗ 0.01. The dependent variable is the
probability of being arrested conditional on being stopped and is multiplied by 100. Extra controls
refer to the inclusion of indicators for gender, youth, suspect height and build as well as the time of
day as defined in Table 1. OLS refers to ordinary least squares estimator and FE refers to a precinct
fixed effects estimator.
Standard deviations in parentheses. The last column presents the p-values from tests where the null
hypothesis is that of equality in the probability of stop for the four crime categories.
286 STEVEN F. LEHRER AND LOUIS-PIERRE LEPAGE
The dependent variable is the probability of being issued a summon conditional on being stopped and
is multiplied by 100. ∗ 0.1, ∗∗ 0.05, ∗∗∗ 0.01. Robust standard errors in parentheses for column 1, boot-
strapped (1,000 repetitions) and reweighted standard errors in parentheses for column 2. Specifications
additionally include fixed effects for precincts and years as well as indicators for gender, youth, height,
build and time of day. Column 3 presents the p-values from Hausman specification tests where the null
hypothesis is that the estimated coefficient on black is the same across models from columns 1 and 2.
The dependent variable is the probability of being arrested conditional on being stopped and is multiplied
by 100. Robust standard errors in parentheses. Specifications additionally include fixed effects for
precincts and years as well as indicators for gender, youth, height, build and time of day.
SURVEY EVIDENCE ON BLACK
MARKET LIQUOR IN COLOMBIA
Gustavo J. Canavire-Bacarreza,a
Alexander L. Lundbergb , and
Alejandra Montoya-Agudeloc
a
Inter-American Development Bank, USA
b
Department of Economics, West Virginia University, USA
c
School of Economics and Finance, Universidad EAFIT, Colombia
ABSTRACT
287
288 GUSTAVO J. CANAVIRE-BACARREZA ET AL.
1. INTRODUCTION
Illegal goods are a pervasive and understudied feature of modern economies.
Many goods are smuggled across borders to avoid taxation. Others are coun-
terfeit, designed to mimic an existing brand or product. Accurate estimates of
gross illegal economic activity remain elusive since market participants have a
clear incentive to conceal their transactions from authorities. To provide context,
however, the Organisation for Economic Co-operation and Development estimates
fake goods account for almost 3% of global imports, or 500 billion US dollars per
year (OECD, 2016). That figure does not include smuggled goods, nor does it
include purely domestic markets.
Although countries often coordinate trade policy through international agree-
ments, law enforcement is largely a domestic policy. Colombia, like many Latin
American countries, has grappled with the influx of illegal goods ever since
colonization. In 2014, the Colombian government adopted a novel approach to
enforcement when it commissioned a unique national survey to gather informa-
tion on black market liquor. Interviewers offered citizens money in exchange for
their most recently purchased bottle of liquor. Samples were sent to a laboratory for
testing, which confirmed whether a bottle was authentic, adulterated or contraband
(smuggled).1
The results of the survey confirm the importance of the black market for alcohol
in Colombia. Over 20% of the observations in the sample were confirmed to
be either contraband or adulterated. Figure 1 displays the percentage of illegal
purchases broken down by Colombian department.2
Different illegal goods carry different welfare implications. In two seminal arti-
cles, Grossman and Shapiro (1988a,b) classify fake goods according to whether
they are deceptive. If a good is deceptive, consumers believe it to be authentic
but incur a disutility from its inferior quality. If a good is not deceptive, con-
sumers can benefit from the option to buy a low quality imitation at lesser cost
(cf. Higgins & Rubin, 1986). Although either type of counterfeiting has the poten-
tial to raise or lower social welfare, deceptive counterfeiting is typically more
worrisome. For example, an adulterated bottle of alcohol may contain danger-
ous chemicals. Adulterated liquor is likely deceptive because for most purchases,
consumers have no obvious reason to want a fake product (e.g., a “snob effect”).
Contraband goods have a more direct link to welfare. They benefit consumers
who pay a lower price but hurt governments who lose tax revenue. One of the most
interesting features of the Colombian survey is the joint examination of adulter-
ated and contraband liquor. The two types of illegal liquor, though not mutually
exclusive, appear to enter the market through different channels. According to
regression analysis, two factors predict whether a bottle is contraband. One is the
Survey Evidence on Black Market Liquor in Colombia 289
2. DATA
Accounting for survey design is necessary for proper inference when data are not
taken from a simple random sample (see, e.g., Jain, 2006; Lumley, 2011). Esti-
mates and standard errors must be weighted to account for differing probabilities
of sample selection among members of the population.
The Colombian data come from a complex survey with multiple stages of sam-
pling. In the first stage, Colombia is divided into six different strata called regions.
Each region h contains Nh municipalities, and each municipality i has an adult pop-
ulation Mh,i . In the next stage, nh municipalities are randomly selected within each
region. Lastly, mh,i individuals are sampled randomly from within each selected
municipality i. Therefore, municipalities are the primary sampling units (PSUs)
and individuals are secondary sampling units (SSUs). The final sample includes
986 observations.
Table 1 displays the sample information in comparison to population fig-
ures. The population data come from the Colombian National Administrative
Department of Statistics (DANE), which offers population projections based on
Survey Evidence on Black Market Liquor in Colombia 291
census data. The DANE data can be segregated by age and municipality to cal-
culate the total adult population by region and municipality. Individuals younger
than age 18 are excluded from the calculation since the law does not allow minors
to purchase alcoholic beverages (the age of majority is 18 in Colombia).
Sampling weights, or probability weights, and finite population correction terms
are created according to PSUs and SSUs (Lumley, 2011). The weights are cal-
culated taking the inverse probability of being sampled in each stage. Defining
fh,1 ≡ nh /Nh as the probability for each municipality to be selected within a region,
and fh,i,2 ≡ mh,i /Mh,i the probability for each individual to be sampled inside a
municipality i within region h, the sampling weights are given by
1 1 Nh Mh,i
wh,i = × = × , (1)
fh,1 fh,i,2 nh mh,i
which results in 94 different values for the sample. One way to interpret the weights
is that each observation represents wh,i individuals in the population.
Finite population correction (FPC) terms are useful when PSUs and SSUs are
sampled without replacement. This correction accounts for the reduction in uncer-
tainty when the sample includes a large fraction of the population (Lumley, 2011).
Ignoring FPCs will inflate standard errors unless the sample is sufficiently small
compare to the population. The FPC term for municipalities is calculated as
FPCh,1 = 1 − fh,1 .
292 GUSTAVO J. CANAVIRE-BACARREZA ET AL.
FPCh,i,2 = 1 − fh,i,2 .
Lastly, given the differences in the relative sample size per region, we also
conduct a post-stratification adjustment to control for under- and over-represented
regions. We use the total adult population per region to conduct the following
adjustment (Levy and Lemeshow, 2013):
∗ Mh
wh,i = wh,i , (2)
i∈h wh,i mh,i
Nh
where Mh = i=1 Mhi a which is precisely the fourth column in Table 1.
3. RESULTS
Table 2 presents the summary statistics for the sample of 986 total observations.
The n column reports the number of complete observations for each variable.
Contraband and Adulterated are the two dependent variables. Each is an indicator
set to one if the purchase was confirmed to be contraband or adulterated, respec-
tively. Self-reported Transaction Price (STP) is the purchase price of the bottle
in Colombian Pesos reported by the buyer, and Size is the size of the bottle in
milliliters. Receipt is a dummy indicating whether the buyer obtained a receipt for
the purchase.
Discounts Offered and Asked are indicators set to one if the buyer was offered
a discount or if he or she asked for one, respectively. The following 10 indicator
variables denote the type of alcohol. Estrato is a set of six dummies capturing the
country’s unique socioeconomic classification system. Colombia assigns numbers
(or zones) 1–6 to housing buildings of roughly ascending wealth status. The classi-
fications determine tax rates and public utility pricing. The upper zones pay higher
rates and effectively subsidize the lower zones. The classification is decided only
by the physical features of housing, but since wealthier persons can afford to live in
the higher strata, Estrato is a proxy for socioeconomic status. Lastly, weekly usage
is recorded as an ordinal variable, with categories of less than half bottle, one half
bottle (375 mL), one bottle (750 mL), more than one bottle or didn’t know/didn’t
respond.
Ideally, the purchase of illegal liquor would be couched in a random utility
model of discrete choice. However, buyers and sellers may not know they are
buying contraband or illegal liquor, and even if they do, assumptions on the distri-
bution of random utility error terms are difficult to justify. Consider the framework
adapted from Train (2009). For simplicity, take the type of liquor as given and
Survey Evidence on Black Market Liquor in Colombia 293
assume all parties know whether a bottle is legal or not. Buyers, sellers (retailers),
and importers (or manufacturers) all choose whether to trade a legal or an illegal
bottle. Respectively, they derive utility
where the i index refers to the individual, and j is an indicator set to one if the bottle
is illegal and zero otherwise. The Vxij terms capture observable elements of utility,
including price for all parties (and perhaps size of the bottle), and cost for buyers
and importers. The εxij terms capture the remaining, unobserved component of
utility.
Ultimately, an illegal sale occurs when each party prefers an illegal bottle to a
legal one. The probability of the sale is then
P (illegalsale|Vbij , Vsij , Vmij ) = P ( Vzi1 + εzi1 > Vzi0 + εzi0 ). (3)
z∈{b,s,m}
where yi is a dummy variable equal to one if the bottle was illegal – either con-
traband or adulterated depending on the model – and xi , the vector of covariates,
includes the log of STP , the log of bottle size, whether a receipt was provided,
whether a discount was offered by the seller, whether a discount was received, and
a set of dummies for the type of alcohol and the Estrato of the buyer. The distribu-
tion function ( · ) follows the right-hand side of (3). Again, the distribution is a
priori unknowable. Results for probit, logit and complementary log–log (cloglog)
estimation are included in all tables for comparison.4 They should be interpreted
as approximations to the right-hand side of (3).
The log of STP has a significant effect on the likelihood of both types of ille-
gal alcohol, but the effect is stronger for contraband purchases. Recall that xi
Survey Evidence on Black Market Liquor in Colombia 295
Note: Standard errors in parentheses account for complex survey design. n=882 for all specifications.
Dummies for additional types of alcohol are excluded for perfect collinearity. We compute average
marginal effects (AMEs) for all our continuous covariates and discrete difference from the base level
for dummy variables; both cases were evaluated using actual values of the sample units. ∗ p < 0.1, ∗∗
p < 0.05, ∗∗∗ p < 0.01.
Survey Evidence on Black Market Liquor in Colombia 297
4. MULTIPLE IMPUTATION
The data also include variables for salary and weekly alcohol consumption, as
reported by the buyer. Both variables are theoretically relevant because heavy and
wealthy consumers may have different incentives to find and buy (or avoid) illegal
goods compared to other consumers. While the survey has nearly complete data
for most questions, including salary and weekly use regressors in the estimation
drops the sample size by roughly 40%. Furthermore, the indicator Discount Asked
contains missing values, and the variable might also influence illegal sales. To
better account for missing data, this section presents estimation results derived
from multiple imputation via chained equations.
In general, missing data can arise in one of two ways. The first is “unit non-
response,” which means the targeted individual did not participate in the survey.
The classic approach to dealing with unit nonresponse is by weighting the sample
to better match known characteristics of the population (see Section II). The sec-
ond source of missing data is “item nonresponse,” which means an individual was
unable or unwilling to answer a given question or “item.” The typical approach to
handling item nonresponse is to impute values for the missing observations before
running any estimation.
If data are missing completely at random (MCAR), then imputation is a fairly
harmless, simple procedure. In some cases, the data may be missing at random
(MAR) after conditioning on other variables in the data set. The trickier case is
when data are missing not at random (MNAR). Salary is a classic example of
MNAR data because very rich or poor individuals may be less willing to reveal
their salary. While MCAR can be tested against MAR, in principle, there is no
way to test for MAR vs. MNAR because the data lack the needed information by
definition.
Applying Little’s test of MCAR to both weekly use and salary based on the
estimation covariates, the assumption of MCAR is strongly rejected, with p-values
equal to zero to three decimal places for both variables (Li, 2013, Little and
Rubin, 2002). In theory, salary and weekly use may not be MAR, however, but
MNAR because individuals with high salaries or heavy use may be less willing to
disclose those facts. Table 4 confirms such a story for salary. Socioeconomic class
is correlated with salary, and response rates are nearly monotonically decreasing
as Estrato increases. Individuals with a higher socioeconomic class are less likely
to disclose income.
No variable appears to have a close relationship to missingness in weekly use.
For that variable, the missing category was combined with an “unsure” response,
and it remains unclear whether missing values actually represent a refusal to answer
or a sincere uncertainty.
298 GUSTAVO J. CANAVIRE-BACARREZA ET AL.
Multiple imputation via chained equations (MICE) is a useful tool for handling
missing data (White, Royston, & Wood, 2010). The procedure involves specifying
a conditional mean regression equation for each variable with missing observa-
tions, using those models to predict the missing values, and iterating to create
multiple imputed data sets, Lastly, estimation is conducted on each data set and
estimates are combined to account for the uncertainty introduced by the procedure.
The equations are “chained” because the predicted values from one conditional
mean model appear as covariates in the next. Typically, the process starts with the
variable containing the fewest missing values (of those with any missing values),
then moves to the next least missing variable, and so on. See Appendix B for a
complete discussion (cf. Van Buuren, 2012).
In the data, weekly use is the least missing variable, with roughly 10% of
values missing. Since no variables appear to predict missingness in weekly use,
the conditional mean is modeled by an ordered logit in the multiple imputation
(though if the variable is in fact MNAR, then results will be slightly biased in the
imputation).
Salary, on the other hand, appears to be MNAR based on Table 4. One way
to handle the selection issue is through a traditional Heckman two-step estimator
(Galimard, Chevret, Protopopescu, & Resche-Rigon, 2016; Heckman, 1979).
Unfortunately, two-step estimates are weakly identified without an exclusion
restriction. That is, ideally, the data would contain a variable that controls the
decision of whether to report salary but not the amount of salary itself. While the
survey contains no convincing such variable, Escanciano et al. (2016) show that
higher order terms in the selection equation can mitigate the weak identification.8
In the sample, age and age squared offer continuous measures satisfying the notion
behind Escanciano et al. (2016), who find identification for two-step estimators
can be achieved under weaker conditions, relative to standard methods, without
exclusion restrictions or instruments. In particular, identification can generally be
obtained where nothing more than linearity of the second stage, in our case, is
parameterized.
Thus, including age and age squared in the selection equation offers a partial
remedy to weak identification. We constrain log(Size) to be equal to 1 in the
Survey Evidence on Black Market Liquor in Colombia 299
∂U ∂U
dU = dST P + dReceipt = 0
∂ST P ∂Receipt (5)
dU = βST P dST P + βReceipt dReceipt = 0
300 GUSTAVO J. CANAVIRE-BACARREZA ET AL.
Note: Standard errors in parentheses account for complex survey design. n=986 for all specifications.
Dummies for additional types of alcohol are excluded for perfect collinearity. We compute average
marginal effects (AMEs) for all our continuous covariates and discrete difference from the base level
for dummy variables, both cases were evaluated using actual values of the sample units. ∗ p < 0.1, ∗∗
p < 0.05, ∗∗∗ p < 0.01.
Survey Evidence on Black Market Liquor in Colombia 301
Note: Confidence intervals are calculated using Krinsky and Robb’s method using 5,000 replications
and do not account for complex survey design. n=986.* ASL is the achieved significance level. The
level of confidence is 95. H0 : W T P ≤ 0, H1 : W T P > 0.
βReceipt
dST P = − dReceipt (6)
βST P
5. DISCUSSION
The results provide several insights on the sources of black market liquor in Colom-
bia. First, the absence of a receipt and the presence of a discount offered by the
seller both make a sale more likely to be contraband (although the receipt indicator
loses significance in the MICE estimation). Neither variable has any relationship
with adulterated sales. The emerging story suggests that sellers are complicit in the
302 GUSTAVO J. CANAVIRE-BACARREZA ET AL.
For 2014, the second largest contributor to Colombian departmental tax revenue
was liquor taxes, with a share of 17.1%, beaten only by beer at 28.2% (DNP, 2014).
During the same year, liquor taxes only depended on the size of the bottle and the
alcohol by volume (abv). Based on the survey, illegal bottles would have collected
52.5% more tax revenue compared to the legal collection calculated in the sample.15
This means that departments could have raised tax revenue by almost 9 percentage
points. However, the Colombian government introduced tax reform in 2016 that
modified liquor taxes in order to raise tax revenue. These reforms include an ad
valorem tax and a value-added tax, which could change the way illegal liquor
affects tax loss in the future.
304 GUSTAVO J. CANAVIRE-BACARREZA ET AL.
6. CONCLUSION
Illegal alcohol is a large industry in Colombia. Over 20% of the sampled bottles in
the government survey are contraband or adulterated. According to distributional
maps of the country, the two types of illegal liquor appear to arrive through different
channels. Contraband liquor originates through the north of the country, along
traditional shipping routes, while adulterated liquor is relatively dispersed across
departments.
Regression analysis offers further insight into the underground economy. The
most interesting predictors of contraband liquor are the absence of a receipt and the
presence of a discount offered specifically by the seller. Those two results strongly
Survey Evidence on Black Market Liquor in Colombia 305
suggest that sellers are complicit in the contraband market. Furthermore, sellers
are more likely to offer adulterated products when consumers ask for a discount.
The results have important implications for law enforcement. The government
has an incentive to stop contraband sales because they involve a loss of tax rev-
enue. Although adulterated sales involve little loss of revenue, they may still be a
point of emphasis for law enforcement because they are harmful to consumers and
perhaps public health. Since sellers appear complicit in the contraband market,
sellers and importers from the north would be the most effective targets for author-
ities. Conversely, authorities may need to identify different sellers when targeting
adulterated sales, as they appear more dispersed throughout the country. Firms
with an incentive to protect their brand may also have an interest in both shutting
down the adulterated market and contribute insight to enforcement policy.
ACKNOWLEDGMENTS
We thank two anonymous referees and participants of the 2017 Econometrics
of Complex Survey Design workshop at the Bank of Canada for their helpful
comments. We also thank Jacques-Emmanuel Galimard for sharing R code on
a multiple imputation procedure. Finally, we thank Universidad EAFIT and the
FND – Federación Nacional de Departamentos – for allowing us to use the data.
The opinions expressed in this publication are those of the authors and do not
necessarily reflect the views of the Inter-American Development Bank, its Board
of Directors, or the countries they represent.
REFERENCES
Bravo, F., Huynh, K. P., & Jacho-Chávez, D. T. (2011). Average derivative estimation with missing
responses. In David M. Drukker (Ed.), Missing data methods: Cross-sectional methods and
applications (Advances in Econometrics, Volume 27 Part 1) (pp. 129–154). Bingley: Emerald
Publishing.
Buuren, S. V., & Groothuis-Oudshoorn, K. (2011). MICE: Multivariate imputation by chained
equations in R. Journal of Statistical Software, 45, 1–68.
DNP (2014). Desempeño fiscal de los departamentos y municipios 2014 [Fiscal performance of
departments and municipalities 2014]. Bogotá: Departamento Nacional de Planeación.
Escanciano, J. C., Jacho-Chávez, D. T., & Lewbel, A. (2016). Identification and estimation of
semiparametric two-step models. Quantitative Economics, 7, 561–589.
Galbraith, J. W., & Kaiserman, M. (1997). Taxation, smuggling and demand for cigarettes in Canada:
Evidence from time-series data. Journal of Health Economics, 16(3), 287–301.
Galimard, J. E., Chevret, S., Protopopescu, C., & Resche-Rigon, M. (2016). A multiple imputation
approach for MNAR mechanisms compatible with Heckman’s model. Statistics in Medicine,
35(17), 2907–2920.
306 GUSTAVO J. CANAVIRE-BACARREZA ET AL.
Grossman, G. M., & Shapiro, C. (1988a). Counterfeit-product trade. American Economic Review,
78(1), 59–75.
Grossman, G. M., & Shapiro, C. (1988b). Foreign counterfeiting of status goods. Quarterly Journal
of Economics, 103(1), 79–100.
Heckman, J. J. (1979). Sample selection bias as a specification error. Econometrica, 47(1), 153–161.
Higgins, R. S., & Rubin, P. H. (1986). Counterfeit goods. Journal of Law & Economics, 29(2), 211–230.
Jain, A. K. & Hausman, R. E. (2006). Stratified multistage sampling. In S. Kotz, C. B. Read, N.
Balakrishnan, B. Vidakovic and N. L. Johnson (Eds.), Encyclopedia of Statistical Sciences.
Hoboken, NJ: John Wiley & Sons.
Jeanty, P. W. (2007). WTPCIKR: Constructing Krinsky and Robb Confidence Intervals for Mean and
Median willingness to pay (WTP) using Stata. In Sixth North American Stata users’ group
meeting, Boston, August (pp. 13–14).
Jiongo, V. D., Haziza, D., & Duchesne, P. (2013) Controlling the bias of robust small-area estimators.
Biometrika, 100(4), 843–858.
Levy, P. S. & Lemeshow, S. (2013). Sampling of populations: Methods and applications. Hoboken,
NJ: John Wiley & Sons.
Li, C. (2013). Little’s test of missing completely at random. The Stata Journal, 13(4), 795–809.
Little, R. J., & Rubin, D. B. (2014). Statistical analysis with missing data (Vol. 333). Hoboken, NJ:
John Wiley & Sons.
Lumley, T. (2011). Complex surveys: a guide to analysis using R (Vol. 565). Hoboken, NJ: John
Wiley & Sons.
Organization for Economic Cooperation and Development (2016). Global trade in fake goods worth
nearly half a trillion dollars a year. Retrieved from: http://www.oecd.org/industry/global-trade-
in-fake-goods-worth-nearly-half-a-trillion-dollars-a-year.htm.
Qian, Y. (2008). Impacts of entry by counterfeiters. The Quarterly Journal of Economics, 123(4),
1577–1609.
Quercioli, E., & Smith, L. (2015). The economics of counterfeiting. Econometrica, 83(3), 1211–1236.
Rao, J. N., & Molina, I. (2015). Small area estimation. Hoboken, NJ: John Wiley & Sons.
Rubin, D. B. (1988). An overview of multiple imputation. In Proceedings of the survey research
methods section of the American statistical association (pp. 79–84).
Thursby, J. G., & Thursby, M. C. (2000). Interstate cigarette bootlegging: extent, revenue losses, and
effects of federal intervention. National Tax Journal, 53, 59–77.
Train, K. E. (2009). Discrete choice methods with simulation. New York, NY: Cambridge University
Press.
Van Buuren, S. (2012). Flexible imputation of missing data. Boca Raton, FL: Chapman and Hall/CRC.
White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: issues
and guidance for practice. Statistics in Medicine, 30(4), 377–399.
Survey Evidence on Black Market Liquor in Colombia 307
APPENDIX A: TABLES
Table A1. Average Marginal Effects for Contraband and Adulterated Liquor –
Levels
Contraband Adulterated
Note: Standard errors in parentheses account for complex survey design. n=882 for all specifications.
Dummies for additional types of alcohol are excluded for perfect collinearity. We compute average
marginal effects (AMEs) for all our continuous covariates and discrete difference from the base level
for dummy variables, both cases were evaluated using actual values of the sample units. ∗ p < 0.1, ∗∗
p < 0.05, ∗∗∗ p < 0.01.
Table A2. Coefficients for Different Specifications in the Salary Imputation Model
Following Escanciano et al. (2016) Exclusion restrictions
Salary equation:
Age 0.007 0.007 0.008
(0.006) (0.007) (0.006)
Selection equation:
Age −0.061∗∗∗ −0.015 0.325 −0.020∗∗∗ −0.062∗∗∗
(0.026) (0.127) (0.593) (0.004) (0.025)
Age2 0.001 −0.001 −0.015 0.001∗
(0.000) (0.003) (0.024) (0.000)
Age3 0.000 0.000
(0.000) (0.000)
Age4 −0.000
(0.000)
Note: Standard errors in parentheses account for complex survey design. ∗ p < 0.1, ∗∗ p < 0.05, ∗∗∗ p < 0.01. Both stages include the following regressors:
log(ST P ), log(Size), Discount Asked, Discount Offered, Receipt, types of alcohol dummies, Estrato dummies, Weekly dummies, region dummies, Sex
(dummy), Adulterated (dummy), Contraband (dummy), Weekly reported (dummy), and Discount Asked reported (dummy), and a constant term.
Survey Evidence on Black Market Liquor in Colombia 309
Table A3. Average Marginal Effects for Multiple Imputation With Chained
Equations Using Linear Exclusion Restriction.
Contraband Adulterated
Note: Standard errors in parentheses account for complex survey design. n=986 for all specifications.
Dummies for additional types of alcohol are excluded for perfect collinearity. We compute average
marginal effects (AMEs) for all our continuous covariates and discrete difference from the base level
for dummy variables, both cases were evaluated using actual values of the sample units. ∗ p < 0.1, ∗∗
p < 0.05, ∗∗∗ p < 0.01.
310 GUSTAVO J. CANAVIRE-BACARREZA ET AL.
In our specific case, we consider three different models: logit, ordered logit, and
the two-step Heckman estimator. White et al. (2011) provide the baseline for the
first two models, while Galimard et al. (2016) provides the baseline for the final
case. These procedures assume that β follows a multivariate normal distribution,
which is a common approximation for multiple imputation procedures, including
Survey Evidence on Black Market Liquor in Colombia 311
the case of categorical variables (Van Buuren 2012). Therefore, a random draw β ∗
can be calculated as follows:
1. After estimating the proposed imputation model with k regressors (taking into
account the constant term, W, and R), the vector of coefficients β̂ and the
associated covariance matrix V are obtained.
2. Approximating the posterior distribution of β by MVN(β̂, V):
• σ ∗ is drawn as
nobs − k
σ ∗ = σ̂ , (B.1)
g
where σ̂ is the estimated root mean-squared error and g is a random draw
from a χ 2 distribution on nobs − k degrees of freedom.
3. β ∗ is drawn as
∗ σ∗ 1 nobs − k
β = β̂ + u1 V 2 = β̂ + u1 V1/2 , (B.2)
σ̂ g
Binary variables
Logit is the most common alternative to impute binary variables. The procedure
consists of conducting the imputation model as
1
Pr(yi = 1|Xi ) = , (B.3)
1 + exp( − Xi β)
which yields β ∗ . Predicted probabilities for the missing data are computed as
Pi∗ = [1 + exp( − Xi β ∗ )]−1 . Next, a vector u2 is generated with random draws
from a uniform distribution on (0,1) for all the units with yi missing. Finally, the
imputed values are given by 1 if ui < Pi∗ and 0 otherwise.
Ordered variables
This is a straightforward extension of the previous case considering an ordered
logit. After obtaining β ∗ , the predicted probabilities for each category k are
calculated as
1 1
Pik = Pr(yi = k|Xi ) = − (B.4)
1 + exp( − κk + Xi β ) 1 + exp( − κk−1 + Xi β ∗ )
∗
312 GUSTAVO J. CANAVIRE-BACARREZA ET AL.
where κk for k = 0, . . . , L are the different cut-points, which are also parameters of
the model. κ0 is defined as −∞ and κL as +∞. Next, denote cik as the cumulative
class membership probabilities kj =1 Pij∗ . The imputed values are given by
L−1
yi∗ = 1 + 1(u2i > cik ), (B.5)
k=1
where u2i is the component of row i of vector u2 , as defined for the case of binary
variables, and 1( · ) stands for the indicator function.
MNAR data
For the case of salary, we closely follow the procedure proposed by Galimard et
al. (2016), who conduct the Heckman two-step estimator, including a variance
correction step.
The Heckman two-step estimator is derived from
∗
P (Ryi = 1|Xis ) = (Xis β s ), Ryi = Xis β s + is (B.6)
E(yi |Xi , Xis , Ryi = 1) = Xi β + ρσ λi , (B.7)
∗
where Ryi is a binary variable indicating the missingness of salaries, and Ryiis the
∗
associated latent variable. Ryi and yi are linked by a bivariate normal distribution
of their error terms and s . ρ is the correlation coefficient between and s , and
λi = ((φ(Xis β s ))/((Xis β s ))) is the well-known inverse Mills ratio, where φ( · )
and ( · ) are the standard normal density and cumulative distribution functions.
The first step to obtain imputation values for yi is the estimation of β̂ s and λ̂i from
the selection Eq. (B.6), and β̂, β̂λ and σˆη from Eq. (B.7), which can be estimated
through ordinary least squares as yi = Xi β + λi βλi + η, where η ∼ N (0, ση2 ).
ση2∗ is computed as (σˆη /g), and σ2∗ is obtained through the following variance
correction:
1
N
ση2∗
σ̂2∗ = (B.8)
N
i=1 1 − ρ̂ 2 (λ̂i (λ̂i + Xis βˆs ))
The estimation of σ2∗ allows for the computation of random draws (β ∗ , βλ∗ ) =
(β̂, β̂λ ) + σ̂∗ u1 V1/2 . Finally, the imputation values for yi are derived as
yi∗ = Xi β ∗ + λi βλi
∗
+ ση∗ zi , (B.9)
where zi are random draws from a standard normal distribution.
The second and third steps of MICE are also conducted in multiple imputation.
Although the following lines summarize the main results, please refer to Rubin
Survey Evidence on Black Market Liquor in Colombia 313
(1988) for further details. Let θl , Ul be the estimates and associated variances for
each imputed database l, where l = 1, . . . , M. The final estimate of θ is
M
θ̂l
θ̄ = . (B.10)
l=1
M
The variability associated with the estimate has two components: the average
within-imputation variance,
M
Ul
Ū = , (B.11)
l=1
M
and the between-imputation component,
M
(θ̂l − θ̄)2
B= . (B.12)
l=1
M −1
Fig. C1. Residuals: (a) Probit, Contraband; (b) Probit, Adulterated; (c) Logit, Contraband;
(d) Logit, Adulterated; (e) Cloglog, Contraband and (f) Cloglog, Adulterated.
INDEX