La Econometria de Los Diseños Complejos

THE ECONOMETRICS OF
COMPLEX SURVEY DATA:

THEORY AND APPLICATIONS
ADVANCES IN ECONOMETRICS
Series editors: Thomas B. Fomby,
R. Carter Hill, Ivan Jeliazkov,
Juan Carlos Escanciano,
Eric Hillebrand, Daniel L. Millimet,
Rodney Strachan, David T. Jacho-Chávez,
Alicia Rambaldi
Previous Volumes
Volume 21: Modelling and Evaluating Treatment Effects in Econometrics – Edited by Daniel L.
Millimet, Jeffrey A. Smith and Edward Vytlacil
Volume 22: Econometrics and Risk Management – Edited by Jean-Pierre Fouque, Thomas B.
Fomby and Knut Solna
Volume 23: Bayesian Econometrics – Edited by Siddhartha Chib, Gary Koop, Bill Griffiths and
Dek Terrell
Volume 24: Measurement Error: Consequences, Applications and Solutions – Edited by Jane
Binner, David Edgerton and Thomas Elger
Volume 25: Nonparametric Econometric Methods – Edited by Qi Li and Jeffrey S. Racine
Volume 26: Maximum Simulated Likelihood Methods and Applications – Edited by R. Carter Hill
and William Greene
Volume 27A: Missing Data Methods: Cross-Sectional Methods and Applications – Edited by David
M. Drukker
Volume 27B: Missing Data Methods: Time-Series Methods and Applications – Edited by David M.
Drukker
Volume 28: DSGE Models in Macroeconomics: Estimation, Evaluation and New Developments –
Edited by Nathan Balke, Fabio Canova, Fabio Milani and Mark Wynne
Volume 29: Essays in Honor of Jerry Hausman – Edited by Badi H. Baltagi, Whitney Newey, Hal
White and R. Carter Hill
Volume 30: 30th Anniversary Edition – Edited by Dek Terrell and Daniel Millmet
Volume 31: Structural Econometric Models – Edited by Eugene Choo and Matthew Shum
Volume 32: VAR Models in Macroeconomics – New Developments and Applications: Essays in
Honor of Christopher A. Sims – Edited by Thomas B. Fomby, Lutz Kilian and Anthony
Murphy
Volume 33: Essays in Honor of Peter C. B. Phillips – Edited by Thomas B. Fomby, Yoosoon Chang
and Joon Y. Park
Volume 34: Bayesian Model Comparison – Edited by Ivan Jeliazkov and Dale J. Poirier
Volume 35: Dynamic Factor Models – Edited by Eric Hillebrand and Siem Jan Koopman
Volume 36: Essays in Honor of Aman Ullah – Edited by Gloria Gonzalez-Rivera, R. Carter Hill
and Tae-Hwy Lee
Volume 37: Spatial Econometrics – Edited by Badi H. Baltagi, James P. LeSage, and R. Kelley Pace
Volume 38: Regression Discontinuity Designs: Theory and Applications – Edited by Matias D.
Cattaneo and Juan Carlos Escanciano
ADVANCES IN ECONOMETRICS VOLUME 39
THE ECONOMETRICS OF
COMPLEX SURVEY DATA:
THEORY AND APPLICATIONS
EDITED BY
KIM P. HUYNH
Bank of Canada, Canada
DAVID T. JACHO-CHÁVEZ
Emory University, USA
GAUTAM TRIPATHI
University of Luxembourg, Luxembourg
United Kingdom – North America – Japan

India – Malaysia – China
Emerald Publishing Limited
Howard House, Wagon Lane, Bingley BD16 1WA, UK
First edition 2019
Copyright © 2019 Emerald Publishing Limited
Reprints and permissions service

Contact: permissions@emeraldinsight.com
No part of this book may be reproduced, stored in a retrieval system, transmitted in

any form or by any means electronic, mechanical, photocopying, recording or
otherwise without either the prior written permission of the publisher or a licence
permitting restricted copying issued in the UK by The Copyright Licensing Agency
and in the USA by The Copyright Clearance Center. Any opinions expressed in the
chapters are those of the authors. Whilst Emerald makes every effort to ensure the
quality and accuracy of its content, Emerald makes no representation implied or
otherwise, as to the chapters’ suitability and application and disclaims any warranties,
express or implied, to their use.
British Library Cataloguing in Publication Data

A catalogue record for this book is available from the British Library
ISBN: 978-1-78756-726-9 (Print)

ISBN: 978-1-78756-725-2 (Online)
ISBN: 978-1-78756-727-6 (Epub)
ISSN: 0731-9053 (Series)
ISOQAR certified
Management System,
awarded to Emerald
for adherence to
Environmental
standard
ISO 14001:2004.
Certificate Number 1985

ISO 14001
CONTENTS
LIST OF CONTRIBUTORS vii
INTRODUCTION ix
PART I
SAMPLING DESIGN
CAN INTERNET MATCH HIGH QUALITY

TRADITIONAL SURVEYS? COMPARING THE
HEALTH AND RETIREMENT STUDY AND ITS
ONLINE VERSION
Marco Angrisani, Brian Finley, and Arie Kapteyn 3
EFFECTIVENESS OF STRATIFIED RANDOM

SAMPLING FOR PAYMENT CARD ACCEPTANCE
AND USAGE
Christopher S. Henry and Tamás Ilyés 35
PART II
VARIANCE ESTIMATION
WILD BOOTSTRAP RANDOMIZATION INFERENCE

FOR FEW TREATED CLUSTERS
James G. MacKinnon and Matthew D. Webb 61
VARIANCE ESTIMATION FOR SURVEY-WEIGHTED

DATA USING BOOTSTRAP RESAMPLING METHODS:
2013 METHODS-OF-PAYMENT SURVEY
QUESTIONNAIRE
Heng Chen and Q. Rallye Shen 87
v
vi CONTENTS
PART III
ESTIMATION AND INFERENCE
MODEL-SELECTION TESTS FOR COMPLEX SURVEY

SAMPLES
Iraj Rahmani and Jeffrey M. Wooldridge 109
INFERENCE IN CONDITIONAL MOMENT RESTRICTION

MODELS WHEN THERE IS SELECTION DUE TO
STRATIFICATION
Antonio Cosma, Andreï V. Kostyrka, and Gautam Tripathi 137
NONPARAMETRIC KERNEL REGRESSION USING

COMPLEX SURVEY DATA
Luc Clair 173
NEAREST NEIGHBOR IMPUTATION FOR GENERAL

PARAMETER ESTIMATION IN SURVEY SAMPLING
Shu Yang and Jae Kwang Kim 209
PART IV
APPLICATIONS IN BUSINESS,
HOUSEHOLD, AND CRIME SURVEYS
IMPROVING RESPONSE QUALITY WITH PLANNED

MISSING DATA: AN APPLICATION TO
A SURVEY OF BANKS
Geoffrey R. Gerdes and Xuemei Liu 237
DOES SELECTIVE CRIME REPORTING INFLUENCE

OUR ABILITY TO DETECT RACIAL DISCRIMINATION
IN THE NYPD’S STOP-AND-FRISK PROGRAM?
Steven F. Lehrer and Louis-Pierre Lepage 259
SURVEY EVIDENCE ON BLACK MARKET LIQUOR IN

COLOMBIA
Gustavo J. Canavire-Bacarreza, Alexander L. Lundberg, and
Alejandra Montoya-Agudelo 287
INDEX 315
LIST OF CONTRIBUTORS
Marco Angrisani University of Southern California, USA

Gustavo J. Canavire-Bacarreza Inter-American Development Bank, USA
Heng Chen Bank of Canada, Canada
Luc Clair University of Winnipeg, Canada and Canadian
Centre for Agri-Food Research in Health and
Medicine, Canada
Antonio Cosma University of Luxembourg, Luxembourg
Brian Finley University of Southern California, USA
Geoffrey R. Gerdes Federal Reserve Board of Governors, USA
Christopher S. Henry Bank of Canada, Canada
School of Economics/CERDI, University of
Auvergne, France
Tamás Ilyés Magyar Nemzeti Bank, Hungary
Arie Kapteyn University of Southern California, USA
Andreï V. Kostyrka University of Luxembourg, Luxembourg
Jae Kwang Kim Iowa State University, USA
Steven F. Lehrer Queen’s University, Canada
Louis-Pierre Lepage University of Michigan, USA
Xuemei Liu Federal Reserve Board of Governors, USA
Alexander L. Lundberg West Virginia University, USA
James G. MacKinnon Queen’s University, Canada
Alejandra Montoya-Agudelo Universidad EAFIT, Colombia
Matthew D. Webb Carleton University, Canada
Iraj Rahmani Nazarbayev University, Kazakhstan
Q. Rallye Shen Bank of Canada, Canada
Gautam Tripathi University of Luxembourg, Luxembourg
Jeffrey M. Wooldridge Michigan State University, USA
Shu Yang North Carolina State University, USA
vii
This page intentionally left blank
INTRODUCTION
The assumption of simple random sampling is widely used in applied research

in the social, behavioral and biomedical sciences, as well as in empirical public
policy analysis. However, this assumption is seldom true in practice. Stratified
and cluster sampling are routinely used by most statistical agencies in the world,
and because of budgetary reasons, the actual sampling process may be even more
complicated. Correct statistical analysis therefore requires a careful consideration
of these complex survey designs when performing estimation and inference.
The papers in this volume of Advances in Econometrics were presented at the
“Econometrics of Complex Survey Data: Theory and Applications” conference
organized by the Bank of Canada, Ottawa, Canada, from October 19 to 20, 2017.
The editors would like to acknowledge the generous financial support provided by
the Bank of Canada.
Below is a brief overview of the papers accepted in this volume, grouped into
the following four categories: (1) sampling design; (2) variance estimation; (3)
estimation and inference and (4) business, household and crime surveys.
SAMPLING DESIGN
“Can Internet Match High Quality Traditional Surveys? Comparing the Health and
Retirement Study and Its Online Version” by Marco Angrisani, Brian Finley and
Arie Kapteyn revisit the question of comparability of online and more traditional
interview modes by studying differences across Internet-based, face-to-face and
phone-based surveys. They find little evidence of mode effects when comparing
various outcomes providing support for internet-based surveys.
“Effectiveness of Stratified Random Sampling for Payment Card Acceptance
and Usage” by Christopher S. Henry and Tamás Ilyés uses the universe of merchant
cash registers in Hungary to assess the effect of stratified random sampling on
estimates of payment card acceptance and usage. It compares county, industry,
and store size stratifications to mimic the usual stratification criteria for standard
merchant surveys. By doing this, they can quantify the effect on estimates of card
acceptance for different sample sizes.
ix
x INTRODUCTION
VARIANCE ESTIMATION
“Wild Bootstrap Randomization Inference for Few Treated Clusters” by James
G. MacKinnon and Matthew D. Webb proposes a bootstrap-based alternative to
randomization inference, which mitigates problems of over- or under-rejection in
t tests in pure treatment or difference-in-differences settings when the number of
clusters is very small.
“Variance Estimation for Survey-weighted Data Using Bootstrap Resampling
Methods: 2013 Methods-of-Payment Survey Questionnaire” by Heng Chen and Q.
Rallye Shen proposes a bootstrap-resampling method to estimate variability when
sampling units are selected through an approximate stratified two-stage sampling
design. Their proposed method allows for randomness from both the sampling
design and the raking procedure.
“Model Selection Tests for Complex Survey Samples” by Iraj Rahmani and Jeffrey
M. Wooldridge extends Vuong’s model selection test (“Likelihood Ratio Tests for
Model Selection and Non-Nested Hypothesis,” Econometrica, 1989) to allow for
complex survey samples. By using an M-estimation setting, their test applies to
general estimation problems including linear and nonlinear least squares, Poisson
regression and fractional response models. With cluster samples and panel data,
they show how to combine the weighted objective function with a cluster-robust
variance estimator, thereby expanding the scope of their test.
“Inference in Conditional Moment Restriction Models When There is Selec-
tion Due to Stratification” by Antonio Cosma, Andreï V. Kostyrka and Gautam
Tripathi shows how to use a smoothed empirical likelihood approach to conduct
efficient semiparametric inference in models characterized as conditional moment
equalities when data are collected by variable probability sampling.
“Nonparametric Kernel Regression Using Complex Survey Data” by Luc Clair
derives the asymptotic properties of a design-based nonparametric kernel-based
regression estimator under a combined inference framework involving multivariate
mixed data. It also proposes a least squares cross-validation procedure for selecting
the bandwidth for both continuous and discrete variables. Simulation results show
that the estimator is consistent and that efficiency gains can be achieved by weight-
ing observations by the inverse of their inclusion probabilities if the sampling is
endogenous.
“Nearest Neighbor Imputation for General Parameter Estimation in Survey
Sampling” by Shu Yang and Jae Kwang Kim studies the asymptotic properties
of the nearest neighbor population imputation estimator of population parameters
Introduction xi
when handling item nonresponse in survey sampling. When estimating a variance,

the authors propose a replication variance estimator.
BUSINESS, HOUSEHOLD AND CRIME SURVEYS

Last but not least, “Improving Response Quality with Planned Missing Data: An
Application to a Survey of Banks” by Geoffrey R. Gerdes and Xuemei Liu reports
a “random blocking” approach to shortening the questionnaires for individual
respondents when collecting data on noncash payments by type, cash withdrawals
and deposits, and related information in a survey of a population of depository
institutions in the United States. Their approach is a special case of multiple
matrix sampling and an extension of a split questionnaire or planned missing value
design. They find that the proposed blocking approach helped increase unit-level
and item-level response for smaller institutions.
“Does Selective Crime Reporting Influence Our Ability to Detect Racial Dis-
crimination in the NYPD’s Stop-and-frisk Program?” by Steven F. Lehrer and
Louis-Pierre Lepage uses data from the New York City’s Stop-and-Frisk program
to assess the presence of crime type heterogeneity in racial bias and police officer
decisions of reported crime type. They find evidence that differences in biases
across crime types are substantial while accounting for sample-selection which
may arise from conditioning on crime type.
“Survey Evidence on Black Market Liquor in Colombia” by Gustavo J.
Canavire-Bacarreza, Alexander L. Lundberg and Alejandra Montoya-Agudelo
uses a unique national survey on illegal liquor commissioned by the Colombian
government to estimate the determinants of the demand for smuggled and adul-
terated liquor. To address unit and item nonresponse, they implement a multiple
imputation procedure with chained equations.
Kim P. Huynh
David T. Jacho-Chávez
Gautam Tripathi
PART I
SAMPLING DESIGN
CAN INTERNET MATCH
HIGH-QUALITY TRADITIONAL
SURVEYS? COMPARING THE HEALTH
AND RETIREMENT STUDY AND ITS
ONLINE VERSION
Marco Angrisani, Brian Finley, and Arie Kapteyn

Center for Economic and Social Research, University of Southern
California, USA
ABSTRACT
We examine sample characteristics and elicited survey measures of two stud-
ies, the Health and Retirement Study (HRS), where interviews are done
either in person or by phone, and the Understanding America Study (UAS),
where surveys are completed online and a replica of the HRS core ques-
tionnaire is administered. By considering variables in various domains, our
investigation provides a comprehensive assessment of how Internet data col-
lection compares to more traditional interview modes. We document clear
demographic differences between the UAS and HRS samples in terms of
age and education. Yet, sample weights correct for these discrepancies and
allow one to satisfactorily match population benchmarks as far as key socio-
demographic variables are concerned. Comparison of a variety of survey
The Econometrics of Complex Survey Data: Theory and Applications

Advances in Econometrics, Volume 39, 3–33
Copyright © 2019 by Emerald Publishing Limited
All rights of reproduction in any form reserved
ISSN: 0731-9053/doi:10.1108/S0731-905320190000039001
3
4 MARCO ANGRISANI ET AL.
outcomes with population targets shows a strikingly good fit for both the HRS
and the UAS. Outcome distributions in the HRS are only marginally closer
to population targets than outcome distributions in the UAS. These patterns
arise regardless of which variables are used to construct post-stratification
weights in the UAS, confirming the robustness of these results. We find little
evidence of mode effects when comparing the subjective measures of self-
reported health and life satisfaction across interview modes. Specifically, we
do not observe very clear primacy or recency effects for either health or life
satisfaction. We do observe a significant social desirability effect, driven by
the presence of an interviewer, as far as life satisfaction is concerned. By
and large, our results suggest that Internet surveys can match high-quality
traditional surveys.
Keywords: Online survey; survey methods; weighting; survey mode
effects; face-to-face interviews; online interviews
1. INTRODUCTION
The collection of high-quality data on households and individuals tends to be
labor intensive, costly and slow. When adopting traditional survey modes like
face-to-face or telephone interviewing, typically several years elapse from the
moment a survey is designed to final data availability. The Internet, with its
promise of real-time results and looming ubiquity, provides a tempting alterna-
tive for faster and more cost-effective data collection. Online surveys, however,
differ from more traditional surveys in several respects which may affect both
sample representativeness and data quality.
First, Internet coverage is still not entirely pervasive, especially among more
economically disadvantaged groups and the elderly. Data from the Pew Research
Center showed that only 51% of Americans aged 65 or older had a home broadband
connection in 2016, while the fraction of home broadband owners was about 77%
among 18–29 year olds. Likewise, home broadband coverage was only 53% for
Americans with incomes of less than $30,000 and 93% for those with incomes
of $75,000 or greater.1 As a result, the representativeness of online surveys may
be jeopardized (Schonlau et al., 2009). Telephone surveys, however, face similar
difficulties with the widespread adoption of voice mail and cell phones (Blumberg,
Luke, & Cynamon, 2004). Second, even with complete coverage of the population,
individual characteristics are bound to influence the likelihood of completing an
online survey versus a face-to-face or phone survey, thereby introducing relevant
selectivity issues and nonresponse biases that may vary by interview mode (Couper,
2011). Third, mode effects need to be considered, as the same question may be
Can Internet Match High-quality Traditional Surveys? 5
answered differently in person, by phone or over the Internet (Schwarz & Sudman,
1992). Face-to-face and phone interviews leave more room for clarification and
offer more control of who is actually answering the questionnaire. On the other
hand, Web surveys offer more privacy and could thereby encourage more accurate
and honest reporting on personal and sensitive matters, while the presence of
an interviewer in face-to-face and telephone interviews may induce interviewer
effects.
Chang and Krosnick (2009) compare sample representativeness and data quality
of Internet-based surveys and phone-based surveys. They conclude that as long as
Internet data are collected from a probability-based sample, these exhibit higher
accuracy than data collected by phone. Their study, however, is limited to a rather
specific topic, namely politics.
In view of the existing literature, the contribution of this paper is twofold. First,
we revisit the question of comparability of online and more traditional interview
modes by studying differences across Internet-based, face-to-face and phone-based
surveys. Second, we focus on a diverse set of outcomes, ranging from home
ownership and labor force status to self-reported health and life satisfaction. The
aforementioned sources of differences between Web surveys and face-to-face or
telephone surveys may affect each of these outcomes in different ways. Thus,
by considering variables in various domains, our investigation provides a more
robust and comprehensive assessment of how Internet data collection compares to
more traditional interview modes. Moreover, while our analysis is performed at
a time when Internet coverage has increased substantially in the population, we
focus (because of data availability and comparability issues) on the subgroup of
individuals aged of 55 and older. Within this segment of the population, barriers
to adoption of new technology may still imply significant selectivity issues and
limit study generalizability, as recently pointed out by Remillard et al. (2014). The
extent to which Internet data are comparable to data collected with more traditional
interview modes is then of particular scientific interest in research concerned with
this subpopulation.
2. METHODS AND OUTLINE

We consider and compare sample characteristics and elicited survey measures
of two studies, the Health and Retirement Study (HRS), where interviews are
done either in person or by phone, and the Understanding America Study (UAS),
where surveys are completed online. Both the UAS and HRS maintain a panel
of American households, but while the HRS focuses for the most part on its core
questionnaire issued every two years, the UAS issues a wide variety of surveys
to its panel. Included among the surveys administered in the UAS is a replica of
the HRS core questionnaire (with some adaptation to accommodate differences

in format between verbal and self-administered interviews). Thus, by examining
responses to this questionnaire in the two studies, we can investigate not only
differences in sample composition across these two studies but also differences in
survey outcomes potentially stemming from different interview modes. In what
follows, we will refer to the HRS core questionnaire simply as the HRS, and
similarly refer to its replica in the UAS.
Whenever possible, we contrast HRS and UAS survey outcomes with com-
parable measures in the Current Population Survey (CPS), which we view as
population benchmarks. The goal is to assess the extent to which HRS and UAS
survey outcomes match population benchmarks and identify possible channels
which observed discrepancies may stem from.
Section 3 briefly sketches the general features of the HRS and the UAS pan-
els. Like other probability-based Internet surveys, the UAS suffers from low
recruitment rates. Hence, weighting becomes critical to matching the underly-
ing population characteristics. Section 3.3, therefore, goes into detail about the
sampling and weighting procedures adopted by the UAS. Section 4 follows with
a comparison of surveys’ basic demographics to each other and to the reference
population, as represented by the CPS. Section 5 performs similar comparisons
focusing on survey outcomes such as home ownership, health insurance coverage
and labor force status, as well as self-reported health and subjective well-being. By
using a diverse array of measures, for which biases induced by representativeness
issues and survey modes may differ, we aim to provide a fairly complete picture
of the surveys’ ability to match each other and the reference population. This anal-
ysis also compares different weighting procedures for the UAS to examine the
importance of the choice of post-stratification variables on the quality of weight-
ing. Section 5.1 focuses on the comparison of mode effects for two subjective
variables – self-reported health and satisfaction with life. Section 6 concludes.
3. HRS AND UAS DESCRIPTIONS

3.1. The Health and Retirement Study
The HRS is a multipurpose, longitudinal household survey representing the US

population over the age of 50. Since 1992, the HRS has surveyed age-eligible
respondents and their spouses every two years to track transitions from work into
retirement, to measure economic wellbeing in later life and to monitor changes in
health status as individuals age. Starting in 2006, study participants 80 years or
younger have been randomly assigned to either a phone or an “enhanced” face-
to-face interview. In the former case, the HRS questionnaire is administered via
computer-assisted telephone interviewing. In the latter case, the questionnaire

is administered in person by an interviewer using computer-assisted personal
interviewing technology and is complemented with a set of physical perfor-
mance measures, collection of biomarkers and a survey on psychosocial topics.
Respondents over the age of 80 are only interviewed face to face.
Initially, the HRS consisted of individuals born between 1931 and 1941 and
their spouses, but additional cohorts have been added in 1998, 2004 and 2010,
the youngest cohort to date comprising individuals born between 1954 and 1959.
Once added, a cohort is indefinitely administered the HRS questionnaire on the
same two-year cycle as previously existing panel members. Because of refresher
samples over the years, the HRS is representative of households in which at least
one member is 51 years old in 1998, 2004 and 2010, when new cohorts were
added to the survey. In 2000, 2006 and 2012, the HRS represents households
with members 53 or older; in 2002, 2008 and 2014, it represents households with
members 55 and older.
We use the 2014 wave of the HRS and rely on the RAND version (Bugliari
et al., 2016) of the data, a large user-friendly subset of the HRS that combines data
from all waves, adds information that may have been provided by the spouse to
the respondent’s record and has consistent imputation of financial variables.
As mentioned above, the 2014 HRS wave is representative of individuals aged
55 or older. Accordingly, we will select only individuals who are 55 or older in
both the CPS and the UAS to proceed with our comparison exercise. For HRS
waves 1992–2004, the CPS was used to establish population benchmarks for the
post-stratification of sample weights. Starting from 2006, however, the American
Community Survey (ACS) has served as the basis for post-stratification. Hence,
population targets for the 2014 HRS wave, which is used in this study, have been
computed off the ACS. Post-stratification in the HRS is based on gender, age,
race/ethnicity (Hispanic, Black non-Hispanic, other non-Hispanic) and geography
(Metropolitan Statistical Area (MSA) and non-MSA counties).
3.2. The Understanding America Study
The UAS is a nationally representative Internet panel of approximately 6,000

respondents. It began in 2014 and is managed by the Center for Economic and
Social Research at the University of Southern California. The UAS is based on
a probability sample drawn from the US population aged 18 and older. Panel
members are selected through address-based sampling. After joining the panel,
individuals are invited to take, on average, two surveys each month. Invitations to
panel members to take surveys are sent by email and surveys are answered online.
Respondents are typically compensated $20 for a 30-minute survey. Individuals
who do not have Internet access are provided with both access and a tablet for
completing surveys. This is a very important feature of the recruitment procedure,

extending coverage to groups that would otherwise not be reached.
The UAS has an estimated recruitment rate of 15 to 20% which is comparable to
or slightly higher than those of other probability-based Internet panels like the GfK
KnowledgePanel or the RAND American Life Panel. Such low recruitment rates
have led some researchers to argue that there is little practical difference between
opting out of a probability sample and opting into a nonprobability (convenience)
Internet panel (Rivers, 2013). In general, probability-based panels tend to bet-
ter represent the underlying population in terms of demographic characteristics
(Chang and Krosnick, 2009). Yet, nonprobability Internet panels may still be used
as a basis of population norms as long as the data can be appropriately weighted to
compensate for coverage errors and selection bias. Hays, Liu, and Kapteyn (2015)
conclude that “No hard-and-fast rules determine when convenience panels are ade-
quate for use in population inference or when response rates to probability Internet
panels will be high enough to assume unbiased estimates.” Our study does not try
to address this issue: we will not compare the UAS – a probability-based Internet
panel – to a convenience Internet panel. However, it is important to remember that
the UAS shares low recruitment rates with other probability-based Internet panels.
Because of this, weighting may become a crucial issue when the objective is to
represent the underlying population. In view of this, the next section describes in
some detail the weighting procedures used by the UAS.2
3.3. UAS Sampling and Weighting Procedures
An important feature of the UAS sampling procedure is that member recruitment is

done in batches and that the first recruitment batch was sampled differently from
subsequent batches.3 The first batch was a simple random sample of addresses
from the United States Postal Service (USPS) database. Subsequent batches were
based on the sequential importance sampling (SIS) algorithm developed by Meijer
(2014) and Angrisani et al. (2014).4 This is a type of adaptive sampling (Groves
& Heeringa, 2006; Tourangeau et al., 2017; Wagner et al., 2012) that generates
unequal sampling probabilities with desirable statistical properties. Specifically,
before sampling an additional batch, the SIS algorithm computes the unweighted
distributions of specific demographic characteristics (e.g., sex, age, marital status
and education) in the UAS at that point in time. It then assigns to each zip code a
nonzero probability of being drawn, which is an increasing function of the degree of
“desirability” of the zip code. The degree of desirability is a measure of how much,
given its population characteristics, a zip code is expected to move the current
distributions of demographics in the UAS toward those of the US population. For
example, if at a particular point in time the UAS panel underrepresents females
with a high school degree, zip codes with a relatively high proportion of females
with a high school degree receive a higher probability of being sampled. The SIS
is implemented iteratively. That is, after selecting a zip code, the distributions of
demographics in the UAS are updated according to the expected contribution of this
zip code toward the panel’s representativeness; updated measures of desirability
are computed and new sampling probabilities for all other zip codes are defined.
This procedure provides a list of zip codes to be sampled. For each zip code in this
list, addresses are then drawn in a simple random sample from the USPS database.
In the UAS, sample weights are survey-specific. They are provided with each
UAS survey and are meant to make each survey data set representative of the
reference US population with respect to a predefined set of sociodemographic
variables. Sample weights are constructed in two steps. In a first step, a base weight
is created to account for unequal probabilities of sampling zip codes produced
by the SIS algorithm and to reflect the probability of a household being sampled,
conditional on its zip code being sampled. In a second step, final post-stratification
weights are generated to correct for differential nonresponse rates and to bring the
final survey sample in line with the reference population as far as the distribution
of key variables of interest is concerned.
More precisely, to compute the base weight, the unit of analysis is a zip code. A
logit model is estimated for the probability that a zip code is sampled as a function
of its characteristics, namely census region, urbanicity and population size, as
well as its sex, race, age, marital status and education composition. Estimation is
carried out on an ACS file that contains five-year average characteristics at the zip
code level, with urbanicity derived from 2010 Urban Area to ZIP Code Tabulation
Area Relationship File of the US5 Census Bureau and merged to this. The outcome
of this logit model is an estimate of the marginal probability of a zip code being
sampled, which, because of the implementation of the SIS algorithm, is not known
ex ante. Indicate by w1b the inverse of the logit-estimated probability of sampling
each zip code.
Next, for each sampled zip code, the ratio of the number of households in the
zip code to the number of sampled households within the zip code, denoted by w2b
is computed. For the first recruitment batch, which is a simple random sample of
addresses from the US population and does not use the SIS algorithm, it is assumed
(without loss of generality) that w1b = w2b = 1 instead.
The base weight is a zip code-level weight defined as
base weight = w1b × w2b × a
where a is a correction factor such that the sum of the base weights is equal to the
number of all selected households (if all of them respond). This number is equal
to the size of the first recruitment batch (10,000) and to the number of sampled zip
codes times 40 (the number of sampled households within each drawn zip code) for
all subsequent recruitment batches. Hence, the correction factors take two values,
one for the first recruitment batch and one for all subsequent recruitment batches.
UAS members are assigned a base weight, computed as described above,
depending on the zip code where they reside at the time of recruitment.
The post-stratification weights in the second step are generated by a rak-
ing algorithm that, starting from the base weight, compares, iteratively adjusts,
and eventually matches relative frequencies in the target population with relative
weighted frequencies in the survey sample for the following one and two-way
marginal distributions: race, gender × age, gender × education, household size ×
total household income, census regions and urbanicity. The benchmark distribu-
tions against which UAS surveys are weighted are derived from the CPS Annual
Social and Economic Supplement administered in March of each year.
Post-stratification weights are trimmed to limit variability and improve the effi-
ciency of estimators using the weights. This is performed using the general weight
trimming and redistribution procedure described by Valliant, Dever, and Kreuter
(2013). More precisely, indicating by wi,raking , the raking weight for respondent i
and with w raking the sample average of raking weights within the survey sample;
the procedure involves
(1) Setting the lower and upper bounds on weights equal to L = 0.25w raking and
U = 4w raking , respectively6 ;
(2) Resetting any weights smaller than the lower bound to L and any weights
greater than the upper bound to U :
⎧
⎨L wi,raking ≤ L
wi,trim = wi,raking L < wi,raking < U
⎩
U wi,raking ≥ U
c
(3) Computing the amount of weight lost by trimming as wlost = N i=1 (wi,raking −
wi,trim ) and distributing it evenly among the respondents whose weights are
not trimmed.
While raking weights can match population distributions of selected variables,
trimmed weights typically do not. Therefore, the raking algorithm and the trimming
procedure are iterated until post-stratification weights are obtained that respect the
weight bounds and align sample and population distributions of selected variables.7
The final post-stratification weight for each survey respondent, wi,post , is the
weight generated by applying the raking/trimming procedure just described to the
base weight.8
It should be noted that in the UAS weighting procedure, there is no explicit
nonresponse adjustment to the base weights. Rather, it is the post-stratification
factor that is meant to correct for differential nonresponse across survey invitees.
A similar approach is adopted by the HRS. In the HRS, post-stratification of base
weights is performed in the first wave to adjust for nonparticipation in the study
and create a “baseline weight.” In subsequent waves, a post-stratification factor
applied to baseline weights corrects for wave-specific nonresponse.
Moreover, while the UAS uses the CPS to establish population benchmarks for
post-stratification, the HRS considered in this study relies on the ACS for post-
stratification. Population controls for CPS weights are derived from the census and
theACS. For the purposes of our exercise, we will consider weighted CPS measures
as population targets to which UAS and HRS survey outcomes are compared.
Owing to the CPS’s close correspondence with the ACS, we do not expect that this
should particularly favor one survey over the other.
4. COMPARING SOCIOECONOMIC VARIABLES IN THE

HRS, UAS AND CPS
In this section, we compare the distributions of key demographic variables in
the HRS and in the UAS with those in the CPS. This will allow us to gauge the
representativeness of these two studies relative to the reference population.
To ensure full comparability of the underlying samples across these studies, we
take the following two sample selection steps. First, throughout our analysis, we
only consider respondents aged at least 55 in all surveys. Individuals aged 55 or
older constitute the age group which the 2014 HRS wave is representative of, and,
consequently, those who receive a nonzero sample weight in the 2014 HRS wave.
Second, we drop HRS respondents living in nursing homes, as the UAS and CPS
do not sample from this population.
4.1. Representativeness of the HRS and UAS Samples
Table 1 shows the distributions of basic demographic variables for the unweighted
and weighted HRS and UAS samples. The first column reports the target distribu-
tions in the US population of individuals aged 55 and older. These are computed
using the 2015 CPS and its provided sample weights. As mentioned above, while
the UAS relies on the CPS to obtain population benchmarks for post-stratification,
the HRS uses the ACS. Yet, since the CPS itself weights to match the census and
the ACS, it is plausible to assume that both the UAS and HRS are weighted to align
their samples to essentially the same population. The choice of referring to the year
2015 for population benchmarks is due to the fact that our comparison exercise
uses data from the 2014 HRS wave and from the first HRS wave in the UAS, which,
Table 1. Comparison of Demographics Across Surveys.

CPS HRS UAS
Unweighted Weighted Unweighted Weighted
Gender
Male 0.462 0.425 0.461 0.491 0.462
Female 0.538 0.575 0.539 0.509 0.538
Mean abs. diff – 0.036 0.000 0.030 0.000
Race/Ethnicity
White 0.751 0.641 0.777 0.835 0.751
Black 0.098 0.191 0.100 0.067 0.098
Other 0.060 0.032 0.035 0.060 0.060
Hispanic 0.090 0.136 0.088 0.038 0.090
Mean abs. diff. – 0.069 0.013 0.042 0.000
Age
55–64 0.466 0.405 0.467 0.571 0.466
65–74 0.313 0.281 0.309 0.327 0.313
75–84 0.156 0.236 0.161 0.087 0.156
85+ 0.065 0.078 0.064 0.015 0.065
Mean abs. diff. – 0.046 0.003 0.059 0.000
Education
HS or less 0.461 0.521 0.453 0.267 0.252
Some college 0.163 0.191 0.194 0.255 0.240
Assoc. coll. degree 0.088 0.060 0.067 0.131 0.107
Bachelor 0.171 0.140 0.171 0.191 0.232
Postgrad 0.117 0.089 0.115 0.156 0.169
Mean abs. diff. – 0.035 0.013 0.078 0.084
Household income
<$30k 0.306 0.388 0.313 0.287 0.268
[$30k, $60k) 0.301 0.262 0.248 0.299 0.284
[$60k, $100k) 0.205 0.169 0.189 0.243 0.255
$100k+ 0.189 0.181 0.250 0.171 0.194
Mean abs. diff. – 0.041 0.034 0.019 0.027
N 37,795 16,751 16,751 1,852 1,852
although based on the 2014 HRS questionnaire, was completed between the years
2015 and 2017.9
It is worth restating that the UAS final weights allow sample distributions
to match the population distributions of gender, race/ethnicity, age, education,
household income, household composition and location (i.e., census region and
urbanicity). The HRS adopts a more parsimonious model, where final weights align
the sample to the population along the dimensions of gender, age of respondent
and spouse/partner, race/ethnicity and geography. When comparing demographic

distributions, we use a similar specification for the UAS, excluding education and
income from the set of raking factors. In this case, weight variability is slightly
lower and the effective sample size is higher in the HRS than in the UAS. Specifi-
cally, the standard deviation of relative final weights is 0.8 in the HRS and 1.1 in the
UAS. Effective sample sizes are 61% and 46%, respectively. For each demographic
variable, we report the mean absolute difference between the cells for the CPS and
those for the HRS and UAS. We treat this as a summary statistic of the distance
between HRS and UAS sample distributions from their population counterparts.
Within the population aged 55 and older, the fraction of females is 54%. Before
sample weights are applied, the HRS overrepresents female respondents, while the
UAS underrepresents them. The former, however, exhibits relatively smaller dis-
crepancy with the benchmark gender proportions. The HRS oversamples minority
groups by design, which is reflected in the difference between the unweighted
HRS race/ethnicity distribution and its population counterpart. The fraction of
White respondents is substantially lower than the one observed in the CPS, while
the fractions of Black and Hispanic respondents are larger than in the CPS. After
weighting, White respondents are slightly overrepresented, whereas Hispanics are
slightly underrepresented. In the UAS, Whites are considerably overrepresented,
while the unweighted fractions of Black and Hispanic respondents fall short of
their population benchmark by about 3 to 4 percentage points. These differences
disappear when weights are applied. Before weighting, mean absolute difference
from the CPS is 6.9 percentage points for the HRS and 4.2 for the UAS. After
weighting, they become 1.3 and essentially 0 percentage points, respectively.
The unweighted age distribution in the UAS is slightly further from its bench-
mark than is the unweighted age distribution in the HRS. The latter notably
overrepresents individuals in the 75–84 age range, while the UAS appears to over-
select younger respondents, with a substantial overrepresentation of the 55–64 age
group and fairly substantial underrepresentation of the 75+ age groups. Before
weights are applied, mean absolute difference from the CPS is 0.046 for the HRS
and 0.059 for the UAS. The age brackets shown in the table are those used to
construct weights in the UAS, but they do not necessarily overlap with those used
in the HRS weighting procedure. Hence, the difference between weighted and
population distributions is, by construction, minimized for the UAS. Nonetheless,
after weighting, the mean absolute difference between the CPS and HRS is only
0.3 percentage points.
As far as education is concerned, the HRS overrepresents individuals with high
school or less and underrepresents individuals with college education or more. This
is consistent with the fact that the HRS sample is biased toward the elderly. In sharp
contrast, the UAS largely underrepresents low-educated individuals and overrep-
resents those with some college education. Individuals holding higher degrees
(bachelor and postgraduate degrees) are overrepresented by a more modest mar-

gin. On average, the distance between the HRS unweighted education distribution
and the one in the CPS is 3.5 percentage points and shrinks to about 1.3 percent-
age points when weights are applied. On average, the distance between the UAS
unweighted education distribution and its population counterpart is 7.8 percentage
points and increases to 8.4 when weights are applied.
Despite the observed divergence in the education distributions between the UAS
and the CPS, the household income distributions align reasonably well. The HRS
overrepresents low-income households and underrepresents high-income house-
holds. This is consistent with the oversample of minority groups who tend to be
less affluent. Since post-stratification is not based on income, differences between
the surveys and the population income distribution remain even after weighting,
with a mean absolute deviation of about 3.4. The UAS match to the population
distribution deteriorates somewhat with weighting to the same level as the HRS.
Before weighting, demographic deviations from the population are observed
for both samples. They are more pronounced in the HRS as far as gender, race
and household income are concerned, but larger in the UAS for age and education.
We should note that in HRS, the overrepresentation of minorities (and the implied
deviation from the population income distribution) is by design. After weighting,
both surveys’ distributions match the population on the dimensions used in the
construction of the weights, as expected. However, some discrepancies in the
distribution of variables not used for post-stratification are apparent. Specifically,
the distribution of household income does not match its population counterparts in
either the HRS or the UAS, and the distribution of education in the UAS remains
far from its population target.
5. COMPARING SURVEY OUTCOMES IN THE

HRS AND UAS
Having assessed the representativeness of the two samples, we move on to compare
survey outcomes in the HRS and UAS. We primarily focus on survey outcomes for
which population benchmarks can be obtained from the CPS. Relative to previous
research investigating differences across studies stemming from different coverage
as well as interview mode, we consider a broad and diverse range of outcomes,
among which potential selection biases and mode effects may differ. Moreover,
exploiting the randomization of the interview mode in the HRS, we can compare
response quality of face-to-face, telephone and online interviews.
Before presenting and commenting on the results of this analysis, it is important
to point out that the way survey questionnaires are administered varies across stud-
ies. In the HRS and in the UAS, questions are typically answered by individuals
on their own behalf, although respondents are explicitly instructed to answer

on behalf of their household in some instances (e.g., in the module eliciting
household wealth). The CPS asks a single household respondent to answer
questions – including individual-specific questions such as those about one’s labor
market outcomes and health – on behalf of all other household members. This dif-
ference may account for some divergence in outcome distributions across studies
as we will indicate in the following discussion.
We carry out the comparison of survey outcomes on the population aged 55 and
older and referring to the 2015 CPS for population benchmarks. For the UAS, we
apply several different weighting schemes to assess the effect of post-stratifying
on different sociodemographic characteristics on weighted sample outcomes. We
begin with the default UAS scheme (wgh0), where post-stratification factors are
gender, race, age, education, income, household size, census region and urban-
icity and consider the following five alternative schemes. First, we adopt finer
age brackets (wgh1) to better account for the underrepresentation of older respon-
dents. Second, starting from the baseline weights, we drop education from the set of
raking factors (wgh2); this is the demographic variable that exhibits the largest dis-
crepancy from its population benchmark. Third, starting from the baseline weights,
we drop household income (but retain education) from the set of raking factors
(wgh3). Fourth, starting from the baseline weights, we drop both education and
household income from the set of raking factors (wgh4); this specification is the
closest to the one adopted by the HRS (and the one used in Table 1). Finally,
starting from the baseline weights, we drop geographic indicators – census region
and urbanicity – from the set of raking factors (wgh5). In light of the demographic
deviations from population benchmarks highlighted in the previous section, these
various specifications allow us to gauge to what extent different weighting schemes
impact survey outcomes and correct for possible sources of bias.
In Tables 2–7, the first column reports the population distribution of the outcome
of interest as taken from the CPS (if available). The second, third and fourth
columns present the weighted HRS distributions for the entire HRS sample and
separately for the phone and in-person interview subsamples. Column 5 reports
the unweighted distribution in the UAS; columns 6–11 show the weighted UAS
distributions for each of the six sets of post-stratification weights described above.
For all the outcomes considered in the analysis, item nonresponse is rather low in
both HRS and UAS and largely comparable in the two studies. We note that 1.5%
of the selected sample has no health insurance coverage information and 1% do
not report retirement status in the UAS, while these fractions are essentially zero
in the HRS. On the other hand, there is virtually no item nonresponse in the UAS
as far as life satisfaction is concerned, while in the HRS, about 9% of respondents
interviewed by phone and 2% interviewed in person have missing life satisfaction.
Full details of item nonresponse across studies are provided in Appendix.10
Table 2. Home Ownership.

CPS HRS UAS
Full Phone Person unwgh wgh0 wgh1 wgh2 wgh3 wgh4 wgh5
Own 0.803 0.790 0.803 0.779 0.805 0.760 0.760 0.770 0.758 0.784 0.775
Does not own 0.197 0.210 0.197 0.221 0.195 0.240 0.240 0.230 0.242 0.216 0.225
Mean abs. diff. – 0.013 0.000 0.025 0.002 0.043 0.043 0.033 0.045 0.019 0.028
Notes: wgh0, default UAS weights using gender, race, age, education, income, household size, census
region, and urbanicity; wgh1, as wgh0 with finer age brackets; wgh2, as wgh0 without education, wgh3,
as wgh0 without income; wgh4, as wgh0 without education and income; wgh5, as wgh0 without census
region and urbanicity.
The first comparison exercise concerns home ownership. This is a rather objec-
tive measure, and as a result, we expect it to be relatively more affected by
coverage/representativeness biases than by interview mode. In the population of
adults aged 55 and older, 80% are home owners and 20% are renters.11 Within the
HRS, phone interviewees are more likely to own a home than in-person intervie-
wees by a statistically significant margin (p-value: 0.006). This difference is not
statistically significant any longer (p-value: 0.106) when we limit the sample to
respondents younger than 80, among whom mode assignment is random. Thus,
there seem to be no evidence of a mode effect for home ownership as observed
differences between in-person and phone interviewees are likely stemming from
the different age composition of these two groups.
In the UAS, the unweighted home ownership rate is 80% and ranges from 76% to
78% when weights are applied. When post-stratification is not based on education
(wgh2 and wgh4), the weighted home ownership rate is closer to its population
benchmark as well as to the population-level figures inferred from the HRS.
Across the various weighting schemes, the mean absolute difference between
the UAS and the CPS ranges from 1.9 to 4.5 percentage points, while the
unweighted mean is right on the mark (mean absolute difference equal to 0.2
percentage points). When comparing the HRS (pooling the phone and in-person
samples) and the CPS, the mean absolute difference is 1.3 percentage points.
In Table 3, we focus on another arguably objective measure, namely health
insurance coverage. As can be seen, there exist some differences within the HRS.
The fraction of insured individuals is 94.2% among those interviewed by phone
and 95.7% among those interviewed in person. While this difference is statisti-
cally significant for the entire sample (p-value: 0.002), it is not among respondents
younger than 80 (p-value: 0.172). Again, this pattern suggests differences in age
Table 3. Health Insurance Coverage.

CPS HRS UAS
No 0.052 0.050 0.058 0.043 0.049 0.050 0.050 0.045 0.051 0.042 0.046
Yes 0.948 0.950 0.942 0.957 0.951 0.950 0.950 0.955 0.949 0.958 0.954
Mean abs. diff. – 0.002 0.006 0.009 0.003 0.002 0.002 0.007 0.001 0.010 0.006
See Table 2.
Table 4. Whether Retired.

CPS HRS UAS
No 0.529 0.535 0.584 0.492 0.597 0.561 0.551 0.557 0.557 0.554 0.557
Yes 0.471 0.465 0.416 0.508 0.403 0.439 0.449 0.443 0.443 0.446 0.443
Mean abs. diff. – 0.007 0.055 0.037 0.068 0.032 0.022 0.028 0.028 0.025 0.028
See Table 2.
composition between the two samples, as all respondents 80 and over are eligi-
ble for Medicare. For the UAS, the unweighted fraction of insured individuals
is 95%, as in the reference population, and does not change appreciably when
weights are applied. It increases slightly to 95.8% when the weighting scheme does
not use education and income (wgh4), but the difference with the CPS remains
small and under no weighting scheme is the difference between the UAS and CPS
distributions statistically significant.
Questions about retirement status are bound to be subject to personal interpre-
tation and answers to them may also be affected by social desirability. In Table
4, we observe apparently sizeable, but statistically insignificant differences in the
proportion of retirees across surveys. The unweighted UAS proportion of retirees
is significantly different from the CPS and HRS proportions (p-value: 0.000 in
both cases), but tests for differences in the proportion retired between all pairs of
the CPS, full HRS, and UAS weightings have p-values between 0.1 and 0.4. It
should be noted that, for this specific outcome, differences may also stem from
the type of questions administered to respondents to elicit labor force status and
the type of recoding used by each study. Specifically, we rely on the “major labor
force status” recode of the CPS, which is based on answers to a series of labor
force items in the main questionnaire. For the HRS and the UAS, we adopt the
RAND-HRS indicator of retirement, which is based on a question where respon-
dents can select more than one employment status at once (e.g., working part-time
and retired).
Not surprisingly, the fraction of retired individuals is higher among HRS respon-
dents interviewed in person than among those interviewed by phone. This plausibly
reflects the fact that the former group is, on average, four years older (average age
is 67 for the phone interview group and 71 for the face-to-face interview group).
Overall, even after weighting, the HRS seems to somewhat underrepresent retirees,
with a proportion of 46.5% compared to 47.1% in the reference population. In con-
trast, the unweighted proportion of retired individuals in the UAS is substantially
(and statistically significantly) lower than in the CPS, at 40.3%. Such a differ-
ence may be likely driven by representativeness/selection bias. Individuals who
answer online surveys tend to be younger, better educated, and more attached to
the labor force. When default weights are applied, the fraction of retirees in the
UAS increases by 3 percentage points. Interestingly, the UAS weighting with the
closest-to-target proportion of retired individuals (44.9%) and the lowest mean
absolute difference with the CPS (0.022) is achieved when finer age brackets are
used (wgh1), which better correct for the underrepresentation of seniors. Yet,
differences across weighting schemes are rather modest.
In Table 5, we compare the distribution of individual earnings across surveys.
The proportion of individuals with earnings below $25,000 per year is larger in
the HRS than in the CPS by a statistically significant margin (p-value: 0.001)
and apparently more sizeable among those who are administered a face-to-face
interview. Conversely, the fraction of high earners (above $75,000) in the HRS is
0.4 percentage points higher than in the CPS, but this difference is not significant
(p-value: 0.261).
Table 5. Individual Earnings.

CPS HRS UAS
[0–$25k) 0.726 0.745 0.715 0.771 0.695 0.747 0.756 0.736 0.754 0.718 0.746
[$25k–$50k) 0.118 0.098 0.105 0.092 0.125 0.109 0.108 0.108 0.110 0.113 0.111
[$50k–$75k) 0.070 0.067 0.077 0.058 0.070 0.051 0.049 0.053 0.051 0.058 0.052
[$75k–$100k) 0.036 0.035 0.037 0.033 0.045 0.043 0.040 0.046 0.041 0.049 0.040
$100k+ 0.050 0.055 0.065 0.046 0.064 0.049 0.047 0.057 0.045 0.062 0.052
Mean abs. diff. – 0.010 0.010 0.018 0.012 0.011 0.014 0.011 0.013 0.010 0.011
See Table 2.
The UAS slightly underrepresents low earners, and the observed difference with
the CPS is statistically significant (p-value: 0.006). When weights are applied, the
fraction of individuals with earnings below $25,000 is closer to its population
benchmark, with only borderline-significant differences with the CPS for only
some weights (p-values range from 0.051 to 0.605). The fraction of workers with
earnings above $75,000 is 2.4 percentage points larger in the UAS relative to the
CPS, and the difference is statistically significant (p-value: 0.002). With weights,
this gap varies between virtually 0 and 2.5 percentage points and only the differ-
ence using wgh4 is statistically significant at the 5% level (p-value 0.013). When
comparing various sets of weights in the UAS, we observe only minor differences
among them in terms of mean absolute difference from the reference population.
Only slightly larger deviations are shown when the set of raking factors features
finer age brackets (wgh1) and does not include household income (wgh3). In gen-
eral, the earnings distribution in both the UAS and the HRS matches the one in the
CPS very closely. The mean absolute difference is about 1 percentage point in the
HRS the UAS both before and after weighting.
Next, we examine two subjective outcomes, that is, self-reported health and life
satisfaction. For both of them, we expect mode effects to be more apparent (we
analyze mode effects for these two measures in more detail in the next section).
Table 6 reports the distribution of self-reported health. All three surveys ask their
respondents to rate their health on a five-point scale, “excellent” (1), “very good”
(2), “good” (3), “fair” (4) and “poor” (5). As mentioned above, though, while in
the UAS and in the HRS all respondents answer about their own health, in the CPS
the household respondent reports about his/her own health as well as that of other
household members. In Table 6, we rely on all household members’ health status
reports in the CPS. The distribution of health status in the CPS remains virtually
Table 6. Self-Reported Health.

CPS HRS UAS
Excellent 0.151 0.094 0.097 0.091 0.108 0.102 0.098 0.111 0.099 0.116 0.102
Very good 0.274 0.321 0.328 0.315 0.354 0.327 0.334 0.353 0.328 0.363 0.333
Good 0.329 0.328 0.328 0.328 0.327 0.339 0.335 0.337 0.342 0.335 0.345
Fair 0.171 0.191 0.183 0.198 0.171 0.187 0.188 0.157 0.186 0.148 0.177
Poor 0.076 0.067 0.066 0.068 0.041 0.045 0.045 0.043 0.045 0.039 0.043
Mean abs. diff. – 0.027 0.026 0.027 0.032 0.032 0.034 0.035 0.033 0.038 0.033
See Table 2.
unchanged when we only use health status referring to the household respondent
(see Table A1 in Appendix).
The first thing to notice is the absence of any difference between the measures
elicited by the HRS via telephone and in-person interview. The only sizeable and
marginally significant deviation is observed for the fraction of individuals reporting
fair health (p-value: 0.041). No deviation is remotely significant when limiting the
sample to respondents younger than 80, though. Mode effects, then, do not seem
to be present for this outcome.
It is common practice in the health economics literature to classify individuals
into two groups, one in poor and fair health and another in good, very good and
excellent health. Compared to the CPS, the HRS somewhat overrepresents indi-
viduals in fair and poor health. This fraction is 1.1 percentage points higher in the
HRS and the difference is statistically significant (p-value: 0.029). The unweighted
UAS distribution shows an underrepresentation of individuals in excellent health,
an overrepresentation of those in very good health, and an underrepresentation of
those in poor health relative to the CPS. These deviations from population bench-
marks are not corrected by sample weights, regardless of the post-stratification
scheme adopted. When using the aforementioned binary health indicator, the
unweighted proportion of individuals in fair and poor health in the UAS falls
short of the CPS benchmark by 3.4 percentage points. This difference is statis-
tically significant (p-value: 0.001). The gap is reduced to 1.4 percentage points
(not statistically significant, with p-value: 0.356) when default UAS weights are
applied. Overall, the HRS and UAS perform similarly in terms of their ability to
match the population distribution of self-reported health after weighting. Specifi-
cally, mean absolute difference relative to the CPS is 0.027 for the HRS and 0.032
for the UAS with default weighs (wgh0).
The HRS and UAS respondents are asked about their life satisfaction. Answers
are on a five-point scale, from “completely satisfied” (1) to “not at all satisfied”
(5). There is no analogous question in the CPS instrument, so we do not have a
population benchmark for this outcome. The fraction of HRS respondents reporting
complete satisfaction is 2.8 percentage points higher for the in-person than the
phone interview and the difference is significant (p-value: 0.001). In contrast,
those interviewed by phone tend to express more moderate judgments. The fraction
of those stating that they are somewhat satisfied with their life is 2.1 percentage
points higher when the questionnaire is administered over the phone and, again,
the difference is statistically significant (p-value: 0.027). Observed differences in
the fractions of individuals who are not very and not at all satisfied with their
life are very small in magnitude and not statistically significant. When restricting
attention to respondents younger than 80, the only significant difference is for the
somewhat satisfied group (p-value: 0.04). We will delve more into potential mode
effects for self-reported life satisfaction in the next section and shed some light
Table 7. How Satisfied with Life.

HRS UAS
Completely 0.220 0.205 0.232 0.136 0.151 0.151 0.146 0.148 0.145 0.155
Very 0.464 0.466 0.462 0.491 0.473 0.475 0.489 0.469 0.493 0.483
Somewhat 0.265 0.277 0.256 0.304 0.301 0.300 0.300 0.307 0.296 0.292
Not very 0.040 0.041 0.039 0.059 0.061 0.060 0.058 0.063 0.059 0.057
Not at all 0.012 0.012 0.011 0.010 0.014 0.014 0.008 0.013 0.007 0.012
See Table 2.
on the extent to which the different age composition of these two groups of HRS
respondents may contribute to these results.
In the UAS, unweighted and weighted life-satisfaction distributions are remark-
ably similar, regardless of which set of post-stratification weights is considered.
UAS respondents are significantly less likely to report complete life satisfaction
(p-values are 0.000 for all weighting schemes) and more inclined to state that they
are somewhat (p-values range from 0.001 to 0.084 across different weights) or
not very (p-values from 0.001 to 0.022) satisfied with their life. Based on this,
it is not surprising to see that the UAS distribution is relatively closer to the one
obtained from the HRS phone interview. Even so, differences between these two
distributions are very pronounced. We construct a binary variable taking the value
1 for completely or very satisfied and 0 otherwise. With default weights (wgh0),
we estimate that the fraction of individuals with a positive outlook on their life
(i.e., this indicator equal to 1) is 6 percentage points lower in the UAS than in the
HRS and find that this difference is statistically significant (p-value: 0.002).
5.1. Mode Effects
Compared to face-to-face and online surveys, phone interviews lack visual aids.
This is an important characteristic to account for when studying response qual-
ity across different interview modes. When respondents are asked to pick one
category from a list – for example, in rating their health and life satisfaction on
five-point scales like those adopted by the HRS instrument – there are two well-
known response effects: a primacy effect (a tendency to pick the first response
category) and a recency effect (a tendency to pick the last response category).
Importantly, primacy and recency effects show age gradients. As discussed by
Knauper (1999), older respondents are more likely to choose the last category
UAS HRS-Phone
3.5
3.5
3.25
3.25
3
3
55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79
Age Age
HRS-in-Person CPS
3.5
3.5
3.25
3.25
3
55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79
Age Age
Fig. 1. Predicted Mean Health Status by Age and Survey Mode.
(recency), while younger respondents are more likely to choose the first category
(primacy). A possible explanation for this phenomenon comes from the decline of
memory when people age.
In view of this and the order of response categories in the health status and
life-satisfaction questions (whose distributions are reported in Tables 6 and 7,
respectively), we may expect a sharper decline (or a less steep increase) in health
and life satisfaction with age in auditory mode (i.e., by phone), then over the
Internet and in person (since in the latter case HRS uses show cards). The CPS
carries out interviews both by phone and in person. The preferred mode of interview
is face-to-face for the first and last months of a household’s time in the rotating
panel, while the interview mode defaults to telephone during the intervening three
months. The CPS data do not include a variable indicating whether an interview
was conducted in person or over the phone. Also, there is no indication of adopting
showcards during face-to-face interviews, thereby making these more akin to an
HRS phone than in-person interview.
A difference between the UAS and the other surveys is the absence of an inter-
viewer. Interactions between interviewers and respondents may generate social
desirability effects. As a result of that, the mere presence of an interviewer would
most likely lead to higher levels of self-reported health and life satisfaction in face-
to-face and phone interviews, compared to Internet surveys (Chang and Krosnick,
2009).
(a)
UAS HRS-Phone
.2
.2
.1
.1
0
0
55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79
Age Age
HRS-in-Person .2 CPS
.2
.1
.1
0
0
55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79
Age Age
(b)
UAS HRS-Phone
.2
.2
.1
.1
0
0
55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79
Age Age
HRS-in-Person CPS
.2
.2
.1
.1
0
55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79
Age Age
Fig. 2. (a) Predicted Probability of Choosing the First Option (Excellent Health) by Age
and Survey Mode and (b) Predicted Probability of Choosing the Last Option (Poor Health)
by Age and Survey Mode.
Figures 1–4 show the results of (unweighted) regressions of self-reported health

and life satisfaction on a number of age dummies representing age brackets (55–
59, 60–64, 65–69, 70–74, 75–79), and as controls gender, race, education, and
household income. Since most respondents 80 and older in the HRS are interviewed
face to face (while assignment to telephone or face to face is random for younger
ages), we restrict the samples to respondents 79 years or younger to avoid the
comparison of survey outcomes between the phone and in-person interview modes
in the HRS being confounded by the different age composition of the two groups.
The graphs report the average predicted levels of self-reported health and life
satisfaction by age, with other demographic variables set to their weighted sample
means.12 The grey, capped spikes represent 95% pointwise confidence intervals.
In Figure 1, self-reported health is reverse coded so that higher numbers indicate
better health. The bar charts presented in the figure appear to be in line with
a hypothesized age gradient of the recency effect. The UAS and the HRS in-
person mode show no statistically significant differences in predicted self-reported
health between younger (55–59) and older (75–79) respondents (p-values: 0.766
for the UAS and 0.883 for the HRS in-person mode). The HRS phone mode shows
a decrease at the oldest age bracket. The estimated gap with the youngest age
bracket is sizeable and statistically significant (p-value: 0.014). Remarkably, the
CPS shows the steepest health decline with age.
To shed further light on this, we investigate patterns in the probability of
choosing the first (excellent health) and last (poor health) option in Figure 2.
The probability of choosing the first option (excellent health) in the UAS and
the HRS phone group exhibits no apparent age gradient, although in the HRS
in-person group, the likelihood of reporting excellent health is statistically signifi-
cantly higher among younger than older respondents (p-value: 0.015). It decreases
monotonically and to a greater extent with age in the CPS. There is no clear age pat-
tern in the probability of choosing the last option (poor health). Most importantly,
the fraction of respondents aged 75–79 reporting poor health is rather compa-
rable across surveys. Overall, we do not have clear evidence of mode effects in
self-reported health. Yet, since the CPS interviews are either over the phone or face-
to-face without visual aid, the sharper decline in health observed in the CPS could
be consistent with an age-related recency effect. This interpretation remains rather
speculative, as no recency effect can be detected for the HRS phone interview,
where no visual aid is available.
We also compare self-reported health across interview modes to assess the
extent of the social desirability effect induced by the presence of an interviewer.
For this purpose, we regress (reverse coded) health on survey mode indicators,
age-group dummies, and basic demographics. Relative to the UAS, where no
interviewer is present, average self-reported health is significantly higher in the
CPS (p-value: 0.010) but significantly lower in the HRS-phone (p-value: 0.019) and
UAS HRS-Phone
4.5
4.5
4
4
3.5
3.5
55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79
Age Age
HRS-in-Person
4.5
4
3.5
55-59 60-64 65-69 70-74 75-79

Age
Fig. 3. Predicted Mean Life Satisfaction by Age and Survey Mode.
HRS-in-person (p-value: 0.003), although the size of these differences is small –

within 2.4% of the overall sample mean. Thus, we conclude that, if present, social
desirability effects in health self-reports are rather unimportant.
Figure 3 shows predicted life satisfaction by age for the UAS, HRS-phone and
HRS-in-person (no measure of life satisfaction is available in the CPS). For this
variable, as well, we adopt reverse coding so that higher numbers indicate greater
life satisfaction. There appears to be little difference between HRS phone and HRS
face-to-face interview modes. The age gradient in life satisfaction is steeper for
the UAS than for the HRS.
Figure 4 shows that the probability of reporting the lowest value of life satis-
faction (last option) ticks up in the HRS phone sample in the highest age bracket,
but not in the face-to-face sample. It also ticks up in the UAS, but the confidence
interval becomes very wide in the 75–79 age category.
When we regress life satisfaction on interview mode indicators, conditional on
age and demographics, we find evidence consistent with a social desirability effect.
Relative to the UAS, average life satisfaction is significantly higher in the HRS,
both over the phone and face to face (p-value: 0.000 for both modes), although the
magnitude of this effect is modest, being within 2.8% of the sample mean. There
is no evidence that average life satisfaction is different in the HRS-in-person and
in the HRS-phone samples.
(a)
UAS HRS-Phone
.4
.4
.3
.3
.2
.2
.1
.1
55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79
Age Age
HRS-in-Person
.4
.3
.2
.1
55-59 60-64 65-69 70-74 75-79

Age
(b)
UAS HRS-Phone
.05
.05
.025
.025
0
55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79
Age Age
HRS-in-Person
.05
.025
0
55-59 60-64 65-69 70-74 75-79

Age
Fig. 4. (a) Predicted Probability of Choosing the First Option (Completely Satisfied) by
Age and Survey Mode and (b) Predicted Probability of Choosing the Last Option (Not at
All Satisfied) by Age and Survey Mode.
6. CONCLUSIONS
We have documented some clear demographic differences between the UAS and
HRS. Compared to the US population aged 55 and older, the UAS has relatively
fewer respondents at older ages, while the HRS overrepresents older age groups.
The UAS underrepresents individuals with high school or less, while the HRS
underrepresents the higher education strata. In general, sample weights correct for
these discrepancies and allow one to satisfactorily match population benchmarks as
far as key sociodemographic variables are concerned. For instance, acknowledging
the significant underrepresentation of individuals with high school or less, the
default UAS weights are post-stratified on the interaction of gender and education,
thereby aligning sample distributions of education by gender with their population
counterparts.
Comparison of a variety of survey outcomes with population targets taken from
the CPS shows a strikingly good fit for both the HRS and the UAS. Outcome
distributions in the HRS are marginally closer to those in the CPS than outcome
distributions in the UAS. These patterns arise for the most part regardless of which
variables are used to construct post-stratification weights in the UAS, confirming
the robustness of these results.
We find little evidence of mode effects when comparing the subjective measures
of self-reported health and life satisfaction across interview modes. Specifically,
we do not observe very clear primacy or recency effects for either health or life sat-
isfaction. We do find a significant social desirability effect, driven by the presence
of an interviewer, as far as life satisfaction is concerned.
While relatively simple and merely descriptive, the analyses in this study offer a
comprehensive comparison of surveys administered by different interview modes
both in terms of sample representativeness and data quality. They also provide
rather consistent empirical evidence which leads us to answer the question asked
in the title of this paper with a tentative “Yes.”
NOTES
1. The fact sheet can be found at http://www.pewinternet.org/fact-sheet/internet-
broadband/.
2. Alattar, Messel and Rogofsky (2018) offer a comprehensive overview of the UAS.
3. Details on UAS sample recruitment can be found at https://uasdata.usc.edu/index.php.
4. The SIS algorithm is implemented to recruit all UAS respondents, except those
belonging to two special purpose samples, namely Native Americans and Los Angeles
County residents with young children, for whom different sampling procedures are adopted.
Because of their specific sampling procedures, these two groups receive zero weight.
5. The variable urbanicity takes three mutually exclusive values indicating whether the
area of residence of a respondent is rural, mixed, or urban.
6. While these values are arbitrary, they are in line with those described in the literature
and followed by other surveys (Battaglia et al., 2009).
7. A maximum of 50 iterations are allowed. If an exact alignment respecting the weight
bounds cannot be achieved, the trimmed weights will ensure the exact match between survey
and population relative frequencies, but may take values outside the interval defined by the
pre-specified lower and upper bounds.
8. A complete description of the UAS weighting procedure can be found at
https://uasdata.usc.edu/addons/documentation/UAS%20Weighting%20Procedures.pdf.
9. Referring to the years 2014 and 2016 for population benchmarks does not change
the results of the analysis.
10. We are not aware of significant differences in survey non-response rates between
phone and in-person interview mode in the HRS.
11. In the HRS and UAS respondents can report whether they are owners, renter or
“other.” The latter option is not available in the CPS. The proportion of respondents falling
in the “other” category is minimal in both HRS and UAS and should not appreciably affect
the comparison.
12. The graphs based on regressions without controls look very similar and are not
reported here.
REFERENCES
Alattar, L., Messel, M., & Rogofsky, D. (2018). An introduction to the Understanding America Study
Internet panel. Social Security Bulletin, 78(2), 13–28.
Angrisani, A., Kapteyn, A., Meijer, E., & Saw, H. W. (2014). Recruiting an additional sample for an
existing panel. In Presented at the panel survey methods workshop, Ann Arbor, MI.
Battaglia, M. P., Izrael, D., Hoaglin, D. C., & Frankel M. R. (2009). Practical considerations in raking
survey data. Survey Practice, 2009 (June). Retrieved from http://surveypractice.org/2009/06/
29/raking-survey-data/.
Blumberg, S. J., Luke, J. V., & Cynamon, M. L. (2004). Has cord-cutting cut into random-digit-
dialed health surveys? The prevalence and impact of wireless substitution. In S. B. Cohen &
J. M. Lepkowski (Eds.), Proceedings of the 8th conference on health survey research methods.
National Center for Health Statistics, Hyattsville, MD.
Bugliari, D. et al. (2016). RAND HRS data documentation, version P. RAND Corporation, Center
for the Study of Aging, Santa Monica, CA. Retrieved from http://hrsonline.isr.umich.edu/
modules/meta/rand/index.html.
Chang, L., & Krosnick J. A. (2009). National surveys via RDD telephone interviewing versus the
Internet comparing sample representativeness and response quality. Public Opinion Quarterly,
73(4), 641–678.
Couper, M. (2011). The future of modes of data collection. Public Opinion Quarterly, 73(5), 889–908.
Groves, R. M., & Heeringa, S. G. (2006). Responsive design for household surveys: tools for actively
controlling survey errors and costs. Journal of the Royal Statistical Society, Series A Statistics
in Society, 169(3), 439–457.
Hays, R. D., Liu, H., & Kapteyn, A. (2015). Use of Internet panels to conduct surveys. Behavior
Research Methods, 47(3), 685–690.
Knauper, B. (1999). The impact of age and education on response order effects in attitude measurement.
Public Opinion Quarterly, 63(3), 347–370.
Meijer, E. (2014). Effective sample size metric for sequential importance sampling. Mimeo, USC-
CESR, Los Angeles.
Remillard, M. L., Mazor, K. M., Cutrona, S. L., Gurwitz, J. H., & Tjia, J. (2014). Systematic review
of the use of online questionnaires of older adults. Journal of the American Geriatric Society,
62, 696–705.
Rivers, D. (2013). Comment. Journal of Survey Statistics and Methodology, 1, 111–117.
Schonlau, M., van Soest A., Kapteyn, A. and Couper, M. (2009). Selection bias in Web surveys and
the use of propensity scores. Sociological Methods & Research, 37(3), 291–318.
Schwarz, N., & Sudman, S., (Eds.). (1992). Context effects in social and psychological research.
New York, NY: Springer-Verlag.
Tourangeau, R., Brick, J. M., Lohr, S., & Li, J. (2016). Adaptive and responsive survey design and
assessment. Journal of the Royal Statistical Society, Series A Statistics in Society, 180(1),
203–223.
Valliant, R., Dever, J. A., & Kreuter F. (2013). Practical tools for designing and weighting survey
samples. New York, NY: Springer.
Wagner, J., West, B. T., Kirgis, N., Lepkowski, J. M., Axinn, W. G., & Kruger Ndiaye, S. (2012).
Use of paradata in a responsive design framework to manage a field data collection. Journal of
Official Statistics, 28(4). 477–499.
APPENDIX
Table A1. Self-reported Health With CPS Limited to HH Respondents.
CPS HRS UAS
Excellent 0.149 0.094 0.097 0.091 0.108 0.102 0.098 0.111 0.099 0.116 0.102
Very good 0.277 0.321 0.328 0.315 0.354 0.327 0.334 0.353 0.328 0.363 0.333
Good 0.330 0.328 0.328 0.328 0.327 0.339 0.335 0.337 0.342 0.335 0.345
Fair 0.172 0.191 0.183 0.198 0.171 0.187 0.188 0.157 0.186 0.148 0.177
Poor 0.071 0.067 0.066 0.068 0.041 0.045 0.045 0.043 0.045 0.039 0.043
Mean abs. diff. – 0.025 0.024 0.026 0.030 0.030 0.031 0.033 0.031 0.036 0.030
See Table 2
Table A2. Home Ownership.

CPS HRS UAS
N 37,795 16,751 7,654 9,097 1,852 1,852 1,852 1,852 1,852 1,852 1,852
# Missing 0 0 0 0 0 0 0 0 0 0 0
Proportion 0 0 0 0 0 0 0 0 0 0 0
missing
Weighted 0 0 0 0 0 0 0 0 0 0 0
proportion
missing
Notes: wgh0, default UAS weights using gender, race, age, education, income, household size, census
region, and urbanicity; wgh1, as wgh0 with finer age brackets; wgh2, as wgh0 without education, wgh3,
as wgh0 without income; wgh4, as wgh0 without education and income; wgh5, as wgh0 without census
region and urbanicity.
Table A3. Health Insurance Coverage.

CPS HRS UAS
N 37,795 16,751 7,654 9,097 1,852 1,852 1,852 1,852 1,852 1,852 1,852
# Missing 0 0 0 0 27 27 27 27 27 27 27
Proportion 0 0 0 0 0.015 0.015 0.015 0.015 0.015 0.015 0.015
missing
Weighted 0 0 0 0 0.015 0.011 0.010 0.009 0.011 0.009 0.010
proportion
missing
See Table 2.
Table A4. Whether Retired.

CPS HRS UAS
N 37,795 16,751 7,654 9,097 1,852 1,852 1,852 1,852 1,852 1,852 1,852
# Missing 0 41 20 21 18 18 18 18 18 18 18
Proportion 0 0.002 0.003 0.002 0.010 0.010 0.010 0.010 0.010 0.010 0.010
missing
Weighted 0 0.002 0.002 0.002 0.010 0.009 0.009 0.007 0.009 0.007 0.007
proportion
missing
See Table 1.2.
Table A5. Individual Earnings.

CPS HRS UAS
N 37,795 16,751 7,654 9,097 1,852 1,852 1,852 1,852 1,852 1,852 1,852
# Missing 0 0 0 0 0 0 0 0 0 0 0
Proportion 0 0 0 0 0 0 0 0 0 0 0
missing
Weighted 0 0 0 0 0 0 0 0 0 0 0
proportion
missing
See Table 1.2.
Table A6. Self-reported Health.

CPS HRS UAS
N 37,795 16,751 7,654 9,097 1,852 1,852 1,852 1,852 1,852 1,852 1,852
# Missing 0 14 6 8 3 3 3 3 3 3 3
Proportion 0 0.001 0.001 0.001 0.002 0.002 0.002 0.002 0.002 0.002 0.002
missing
Weighted 0 0.001 0.001 0.001 0.002 0.001 0.001 0.001 0.001 0.001 0.001
proportion
Missing
See Table 1.2.

Table A7. How Satisfied with Life.

CPS HRS UAS
N 37,795 16,751 7,654 9,097 1,852 1,852 1,852 1,852 1,852 1,852 1,852
# Missing 0 931 706 225 3 3 3 3 3 3 3
Proportion 0 0.056 0.092 0.025 0.002 0.002 0.002 0.002 0.002 0.002 0.002
missing
Weighted 0 0.050 0.084 0.020 0.002 0.001 0.001 0.001 0.001 0.001 0.001
proportion
missing
See Table 1.2.
Additional Figures for Self-reported Health (with the CPS Split by Household
Respondent Status)
UAS HRS-Phone HRS-in-Person

3.5
3.5
3.5
3.25
3.25
3.25
3
55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79
Age Age Age
CPS-HH-Resp CPS-Non-HH-Resp
3.5
3.5
3.25
3.25
3
55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79
Age Age
Fig. A1. Predicted Mean Health Status by Age and Survey Mode, and CPS HH Respondent
Status.

.2
.2
.2
.1
.1
.1
0
0
55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79
Age Age Age
.2
.2
.1
.1
0
55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79
Age Age
Fig. A2. Predicted Probability of Choosing the First Option (Excellent Health) by Age,
Survey Mode, and CPS HH Respondent Status.

.2
.2
.2
.1
.1
.1
0
55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79
Age Age Age
.2
.2
.1
.1
0
55-59 60-64 65-69 70-74 75-79 55-59 60-64 65-69 70-74 75-79
Age Age
Fig. A3. Predicted Probability of Choosing the Last Option (Poor Health) by Age, Survey
Mode, and CPS HH Respondent Status.
EFFECTIVENESS OF STRATIFIED
RANDOM SAMPLING FOR PAYMENT
CARD ACCEPTANCE AND USAGE
Christopher S. Henrya and Tamás Ilyésb

a
Currency Department, Bank of Canada, Canada and School of
Economics/CERDI, University of Auvergne, France
b
Magyar Nemzeti Bank, Hungary
ABSTRACT
For central banks who study the use of cash, acceptance of card payments
is an important factor. Surveys to measure levels of card acceptance and
the costs of payments can be complicated and expensive. In this paper, we
exploit a novel data set from Hungary to see the effect of stratified random
sampling on estimates of payment card acceptance and usage. Using the
Online Cashier Registry, a database linking the universe of merchant cash
registers in Hungary, we create merchant and transaction level data sets.
We compare county (geographic), industry and store size stratifications to
simulate the usual stratification criteria for merchant surveys and see the
effect on estimates of card acceptance for different sample sizes. Further,
we estimate logistic regression models of card acceptance/usage to see how
stratification biases estimates of key determinants of card acceptance/usage.

ISSN: 0731-9053/doi:10.1108/S0731-905320190000039002
35
36 CHRISTOPHER S. HENRY AND TAMÁS ILYÉS
Keywords: Payment methods; card acceptance; stratified random

sampling; big data; logistic regression; payment choice
JEL classifications: C44, D22; D12, G20, C55, C83
1. INTRODUCTION
Central banks around the world, as issuers of bank notes, have a keen interest in
understanding the use of cash. While there has undoubtedly been a shift toward
the use of electronic methods of payment at the point-of-sale, cash is still widely
used across many countries, see e.g., Bagnall et al. (2016). In addition, many
countries observe that the demand for cash as measured by the value of bank notes
in circulation has been growing in line with – and in some cases faster than – the rate
of growth of the economy. Of course, different countries have different experiences.
On one spectrum, countries such as Sweden have recently been exploring whether
to issue a central bank digital currency, due to rapidly declining cash demand. By
contrast, Hungary is a particularly cash intensive country with over 80% volume
of transactions being conducted in cash.
A key factor influencing use of cash at the point-of-sale is whether or not card
payments – debit and credit – are accepted by the merchant. Payments are a two-
sided market: consumers are more likely to adopt and use cards when acceptance is
widespread; similarly, merchants are more likely to want to accept card payments
when there are many consumers that desire to pay with cards. See, for example,
the discussion in Fung et al. (2017).
Card acceptance also has implications not just for the use of cash but also for
the amount of cash held by consumers. For example, Huynh et al. (2014) show that
cash holdings increase when consumers move from an area of high card acceptance
to an area with low card acceptance. Consumers must hold more cash because of
the higher probability of encountering a situation where cards are not accepted,
and therefore they can make a potential transaction that would otherwise not take
place.
Finally, card acceptance is intimately related to the cost of payments. A recent
study by the Bank of Canada ((Kosse et al. 2017)) measured that the cost to
accept various methods of payment amounted to 0.8% of GDP; a similar result
of 1% of GDP was found in a study across 13 Euro area countries including
Hungary ((Schmiedel et al. 2013)). Accepting card payments comes with fees
such as interchange, terminal fees, etc., which the merchant must trade-off with the
labor costs of processing cash, the opportunity cost of missing a card payment, etc.
Effectiveness of Stratified Random Sampling 37
To measure the level of card acceptance as well as the cost for accepting various
forms of payments, central banks around the world have conducted merchant or
retailer surveys, see Kosse et al. (2017); European Commission (2015); Norges
Bank (2014); Stewart et al. (2014); Jonker (2013); Schmiedel et al. (2013);
Danmarks Nationalbank (2012); Segendorf and Jansson (2012); Arango and
Taylor (2008).
Merchant studies, however, can be difficult and expensive endeavors. Recruit-
ing businesses/merchants to participate is no easy task, and in practice sample sizes
are often low; half of the studies shown in Figure 1 of Kosse et al. (2017) were of
size N = 200 or below. Additional challenges of merchant surveys are coverage
of small and medium-sized businesses, which may not belong to an official reg-
istry; accounting for businesses with franchises/multiple locations, high costs of
conducting survey interviews to obtain quality data, etc.
In this paper, we exploit a novel administrative data set from Hungary, which
allows us to validate the approach and results of merchant surveys with respect
to measuring card acceptance. This rich data set is known as the Online Cashier
Registry (OCR) and captures the universe of retail transactions in Hungary via a
linking of cash registers/payment terminals to the centralized tax authority.
Specifically, our analysis first focuses on estimating card acceptance using dif-
ferent stratification variables that are commonly used in practice for merchant
surveys. This allows us to see how stratification impacts point and variance esti-
mates of acceptance and provides guidance on which stratification variables may
be most effective in practice, given the constraint of small sample sizes. Our results
suggest that having full coverage with respect to geography and the size of the store
would produce the most reliable estimates. Further, we quantify the uncertainty
in card acceptance estimates that may be present for particularly small sample
sizes.
Further, we conduct logistic regression analysis on the entire OCR database and
stratified subsamples in order to see the bias in point estimates for key determinants
of card acceptance and card usage. Our models are motivated by the payment
survey literature, and we confirm results from the literature that store size is highly
correlated with card acceptance, and transaction size with card usage.
Our work is situated within an exciting research area that is a nexus between
survey statistics and data science. For example, Rojas et al. (2017) investigate
how various sampling techniques can be used to explore and visualize the so-
called “big data” sets. Chu et al. (forthcoming) use the additional structure of
survey weights to aide in estimating functional principal components of a large
and complex price data set used for constructing the consumer price index in
the United Kingdom. In a slightly different vein, Lohr and Raghunathan (2017)
review methods for combining survey data with large administrative data sets,
including record linkage and imputation. They further highlight the opportunity
to use administrative data sets in the survey design stage, as well as for assessing
nonresponse bias and mitigating the need for follow-up. The interplay between
survey statistics and data science will become increasingly relevant against the
background of declining survey response rates as well as competition for sponsor
resources from large administrative data sets, see, for example, the discussion by
Miller (2017).
The paper is organized as follows. In Section 2, we describe the data set and con-
text of the Hungarian payments system. In Section 3, we outline our methodology.
Section 4 presents and discusses results and Section 5 concludes.
2. DATA DESCRIPTION
2.1. Hungarian Payments Landscape
The Hungarian payment system can be considered cash-oriented. The level of cash
in circulation is higher than the European average, and the share of electronic trans-
actions in retail payment situations is fairly low. This notwithstanding, Hungarian
households have good access to electronic infrastructure; 83% of households have
a payment account and 80% have a payment card. Despite a 15 to 20% increase in
electronic payments over the last few years, the vast majority of transactions are
still conducted in cash.
2.2. Data Description
Under Decree No. 2013/48 (XI. 15.) NGM, the Ministry for National Economy
mandated the installation of online cash registers directly linked to the tax authority.
The replacement of cash registers was implemented as part of a gradual process
at the end of 2014; subject to certain conditions, taxpayers were permitted to use
traditional cash registers until January 1, 2015.
The scope of the online cash register system has been expanding signifi-
cantly since the adoption of the decree. Initially, the regulation covered retail
trade turnover primarily; however, from January 1, 2017, its provisions became
applicable to a substantial part of the service sectors as well (e.g., taxi services,
hospitality/catering and automotive repair services).
The OCR database contains data from over 200,000 cash registers, totaling 7
billion transactions. The median transaction was about 1,000 HUF (≈ $4 USD).
The OCR reveals that conventional payment surveys tend to underestimate the
amount of cash payments. For example, a 2014 Hungarian survey estimated that
84% of the volume of payments were conducted using cash, whereas this figure is
almost 90% in the OCR.
For our analysis, we utilize two data sets derived from the OCR:
(1) An anonymized merchant-level data set. Each transaction contains a store

identifier which is used to aggregate transactions to a merchant level. Although
the store identifiers are anonymized, it is possible to link stores that belong
to the same network, for example, franchises. The identifier also links to a
four-digit primary industry code, which in Hungary is known as TEÁOR’08,
and allows for classification of merchant types.
(2) An anonymized data set of transaction-level aggregated data. This allows us to
observe the method of payment (cash vs card) used for any given transaction.
We dismissed negative transactions and those exceeding HUF 50 million but
did not apply any filters regarding store size.
These data sets were derived from the OCR covering 2015 and 2016.
2.3. Key Variables
Here we describe the key variables used for in our analysis. See Table 1 for a
summary.
• Card acceptance: A merchant in the OCR is defined to accept cards if a debit

or credit card transaction is linked to it in the database
• County: Merchants are classified into 20 geographic counties in Hungary, as
well as an additional categories for mobile/online shops.
• Industry: The OCR contains the four-digit TEÁOR 08 identifier of the store-
sprimary activity. For our purposes, we use the first digit of the identifier which
provides a broader classification, see Table 2. For convenience of producing
small random stratified subsamples, we drop very small industries from our
final analysis.
• Size: Merchants are classified into small, medium and large-sized businesses
based on their annual turnover of 2–15 million HUF, 15–150 million HUF and
150+ million HUF, respectively.
Other variables included in the logistic regression models are explained in

additional detail in Appendix.
Table 1. Description of Key Variables Used in the Study.

N Acceptance (%) N Acceptance (%)
Overall 157,071 25.7
County County
Mobile shops 5,294 16.0 Jász-Nagykun 4,442 29.0
Bács-Kiskun 30,851 37.1 Komárom-Esztergom 2,518 18.0
Baranya 5,519 26.4 Mozgobolt 15,982 26.3
Békés 8,189 18.7 Nógrád 5,746 20.7
Borsod-Abaúj 5,543 17.5 Pest 7,936 14.2
Budapest 8,552 21.7 Somogy 5,634 20.8
Csongrad 6,517 26.2 Szabolcs-Szatmár 3,469 22.8
Fejer 5,548 28.6 Tolna 4,004 21.1
Győr-Moson-Sopron 7,656 24.4 Vas 6,106 28.3
Hajdú-Bihar 7,796 24.0 Veszprém 5,037 25.9
Heves 4,732 25.1
Industry Size
1 7,562 16.8 Small 65,373 6.4
4 96,152 29.5 Medium 76,312 32.5
5 36,483 18.4 Large 15,386 74.2
6 3,545 27.3
7 3,064 26.8
8 3,067 29.2
9 7,198 19.2
Note: This table shows the distribution of key variables used in our study based on the merchant level
data set derived from the OCR database. For store size, small businesses are defined as those with 2–15
million HUF in annual turnover, medium are businesses between 15 and 150 million HUF, large are
businesses with over 150 million HUF.
Table 2. Industry Classifications: TEAOR08.

Industry Code Description
0 Agriculture, forestry, fishing and mining

1 Manufacture of textiles, wood, paper and coke
2 Manufacture of chemicals, fabricated products and electronic equipment
3 Manufacture of machinery, electricity, gas and other industries
4 Construction, wholesale and retail
5 Transportation and housing
6 Information, communication, financial and legal services
7 Professional, scientific and other support activities
8 Administration, defense, education and human health
9 Arts, entertainment and recreation
Note: This table shows the industry code descriptions used in the TEAOR08 classification.
These codes correspond to European NACE Rev. 2.
3. METHODOLOGY
Our analysis consists of two components which we describe below.
3.1. Estimating Card Acceptance
Due to the importance of card acceptance for understanding payment choice and
cash usage, we first use the merchant-level aggregated version of the OCR to
estimate card acceptance. We proceed in the following manner.
First we draw a random stratified sample from the merchant-level data set. From
the sample, we estimate the proportion of businesses accepting card payments. Also
we calculate the standard error and a 95% confidence interval. Finally, we repeat
these calculations for 1,000 replications. Estimates are calculated using Stata’s
svy command which accounts for the strata, strata level inclusion probabilities as
weights and finite population corrections.
To draw the stratified samples, we consider three target sample sizes: 0.1%,
0.2% and 1%. These sizes reflect the fact that in practice, merchant surveys often
face a constraint of having small samples sizes. Stratification is performed on the
three variables defined in Section 2.3, and we select the given proportion (0.1%,
0.2% and 1%) of units from each strata.
Choice of the three stratification variables is motivated by those used in practice
for merchant surveys, see e.g., Kosse et al. (2017). The purpose of stratification
in general is to reduce variance estimates by finding variables which are highly
correlated with card acceptance (see again Table 1). Of course, we are also limited
by what is available in the OCR.
3.2. Models of Card Acceptance and Usage
To estimate the logistic regression models of card acceptance and usage, we take
a similar approach for drawing stratified random samples. However, since we are
interested in understanding the bias of the estimates for key explanatory variables,
we fix the sample sizes at 1% of the data from each strata of the merchant-level data
set. For the card usage model, the sampling ratio is 0.01% from the transaction-
level data set. For both models, we perform 10 replications and compute average
point estimates. Explanatory variables are included based on what is available in
the OCR combined with a review of the payments literature; see Appendix for a
more detailed explanation. Coefficient estimates are not produced using any survey
weights.
4. RESULTS
4.1. Estimates of Card Acceptance
Table 3 shows the results from the first component of our analysis. From the
1,000 replications, we report the average estimates of card acceptance, per cent
bias when compared with the true value of acceptance (0.2573804), the average
standard error of the estimate and the average length of a 95% confidence interval.
Each stratification approach provides essentially unbiased estimates of card
acceptance, even for the smallest sample size considered (0.1% sample; ≈ N=160);
the bias is below 1% when compared with the true value. That being said, size
stratification leads to the most biased estimates. County and size stratification
underestimates card acceptance in small sample sizes, whereas industry stratifi-
cation provides an overestimate. The main issue with small sample sizes – which
has implications for practical merchant surveys – is the precision of the estimates.
For small sample sizes, the average length of a confidence interval for both county
Table 3. Effect of Stratification on Estimates of Card Acceptance.

County Size Industry
0.1% sample
Mean 0.25625 0.25604 0.25831
Percent bias −0.439 −0.519 0.362
SE 0.034 0.031 0.035
CI length 0.134 0.122 0.136
0.2% sample
Mean 0.25732 0.25693 0.25682
Percent bias −0.025 −0.175 −0.220
SE 0.015 0.014 0.015
CI length 0.060 0.054 0.061
1% sample
Mean 0.25749 0.25704 0.25721
Percent bias 0.044 −0.134 −0.067
SE 0.011 0.010 0.011
CI length 0.043 0.038 0.043
Note: This table shows estimates of card acceptance for different stratification variables and sample
sizes. Stratification is performed on county, industry and store size variables. From 1,000 replications,
we report the average point estimate and its percent bias from the true value estimated on the full
merchant-level data set derived from the OCR. We also report the average standard error of the estimate,
along with the average length of a 95% confidence interval.
and industry strata is about 13.5% points; size stratification provided a relatively
more precise estimate.
Increasing the sample size from 0.1% to 0.2% leads to a noticeable increase
in the precision of the estimates, and there was not much additional improvement
by increasing the sample size from 0.2% to 1%. For larger sample sizes, county
stratification provides the least biased estimates, whereas store size estimates are
most precise. This is driven by the high correlation between card acceptance and
store size; see again Table 1.
There are some key lessons to draw from these results for actual merchant
surveys. A combination of geographic and merchant size coverage would likely
provide the most reliable estimates of card acceptance, reducing both bias and
variance. Further, very small sample sizes could produce unreliable estimates, but
even a small increase in the sample size can increase precision without having to
increase cost significantly.
4.2. Regression Models of Card Payments
Now we discuss results from logistic regression models of card acceptance and
usage; see Appendix for details on model selection and full results.
4.2.1. Acceptance
Our analysis focuses on three types of explanatory variables: county effects, indus-
try effects, and store size effects on the entire sector. Due to the high number of
control variables, we only discuss selected results, see Table 4.
In general for the 88 parameter estimates, on average the stratas based on the
size of the store provide the best estimates the average of the estimates for the 10
samples is in 52 cases inside of the 95% confidence interval.
The model based on the entire database clearly shows that the most important
factor in line with the literature results is the size of the store, which we charac-
terize by the annual revenue. The distribution of store sizes follows a lognormal
distribution which means that in the county and industry-based stratas, there is
a low probability that they will be included in the sample. Without the biggest
retailers, where county and industry effects are small compared to the size effects,
the estimates for these variables will on general be biased. The county and industry
stratas overestimate these effects.
Since the functional relation between size and acceptance is nonlinear and non-
monotone, the stratas of the counties and industries do not provide good fits for
these higher order polynomials and most of the size-related variables share of dif-
ferent size transactions, volume of transactions. In the case of industry effects,
Table 4. Effect of Stratification on Key Determinants of Card Acceptance.

Full Data Set County Industry Size
County Mobile shops −0.649 −0.678 −0.671 −0.663

Bács-Kiskun 0.136 0.157 0.142 0.135
Baranya −0.035 −0.035 −0.034 −0.052
Békés −0.295 −0.240 −0.292 −0.295
Borsod-Abaúj −0.360 −0.329 −0.341 −0.366
Budapest −0.143 −0.097 −0.126 −0.142
Csongrád −0.100 −0.089 −0.074 −0.108
Fejer −0.082 −0.065 −0.055 −0.096
Győr-Moson-Sopron −0.225 −0.217 −0.216 −0.243
Hajdú-Bihar −0.103 −0.100 −0.073 −0.101
Heves 0.040 0.099 0.046 0.022
Jász-Nagykun 0.053 0.072 0.048 0.013
Komárom-Esztergom −0.209 −0.171 −0.171 −0.176
Mozgobolt −0.070 −0.050 −0.054 −0.064
Nógrád −0.198 −0.195 −0.172 −0.224
Pest −0.441 −0.410 −0.423 −0.457
Somogy −0.185 −0.164 −0.176 −0.173
Szabolcs-Szát −0.256 −0.246 −0.237 −0.255
Tolna −0.554 −0.511 −0.509 −0.570
Vas −0.012 0.000 0.032 −0.054
Veszprém 0.000 0.000 0.000 0.000
Store 1st order orthogonal polynomial 59.296 59.525 59.503 63.279

turnover 2nd order orthogonal polynomial −81.847 −80.740 −84.464 −78.947
3rd order orthogonal polynomial −24.489 −24.085 −24.322 −21.949
Note: This table shows selected point estimates from a logistic regression model with card acceptance
as the dependent variable; results from the full model are shown in Appendix. In the first column,
we show estimates from the model estimated on the full merchant-level data set derived from the
OCR. The last three columns show the average estimates from ten 1% stratified subsamples, where the
stratification variable is indicated in the column heading.
the size-based stratas underperform the industry-based stratas. The industry dis-
tribution of the database is heavily concentrated on retail services, and the other
categories have only a very small share. The three categories of size effects, direct
annual revenue, share of different size transactions and volume of transactions, are
in general better estimated by size stratas.
In conclusion, we can state that the most efficient stratification is a random
stratified sampling based on different store size categories and not on geographical
or industry classification. The main causes are the importance of annual revenue
over county and industry effects in card acceptance decisions and the complex
Table 5. Effect of Stratification on Key Determinants of Card Usage.

Full Data Set County Industry Size
Industry 1 −0.238 −0.225 −0.253 −0.309

code 2 0.262 0.245 0.182 0.214
3 0.105 0.124 0.048 0.071
4 0.026 0.019 −0.021 −0.051
5 0.733 0.676 0.629 0.608
6 0.048 0.03 −0.055 −0.057
7 0.383 0.385 0.329 0.307
8 0.846 0.805 0.717 0.705
9 0.143 0.12 0.128 −0.039
Transaction 1st order orthogonal polynomial −184.446 −157.517 −167.243 −176.342

value 2nd order orthogonal polynomial −206.922 −189.216 −194.939 −200.644
3rd order orthogonal polynomial −76.017 −68.811 −71.832 −72.792
Note: This table shows selected point estimates from a logistic regression model with card usage as
the dependent variable; results from the full model are shown in Appendix A. In the first column, we
show estimates from the model estimated on the full transaction-level data set derived from the OCR.
The last three columns show the average estimates from ten 0.01% stratified subsamples, where the
stratification variable is indicated in the column heading.
functional relation between the two. By not basing stratification on store sizes as
well, the model overestimates the other effects.
4.2.2. Usage
In the case of modeling payment card use, the above approach does not lead to
good estimates. The model estimated on the full data set clearly shows that the
single most important factor is the transaction value and its higher order orthogonal
polynomials. We find there is no single best method from these three types of
stratification, see Table 5. Because of the extremely small standard errors calculated
from nearly 5 billion transactions, all subsample estimations are on average outside
of the 95% confidence intervals. Based on the average estimations, the county
stratas provide the closest estimations for most variables. The reason for this is
the much greater county effect in card use decisions compared to card acceptance.
However, as opposed to what we observed in the card-acceptance model, there is
no difference in biases between the three types of stratification. All three models
on average overestimate the county and transaction size effects and underestimate
the industry effects.
The main reason that the above stratifications provide poor results is the log-
normality of transaction size distribution. By not basing the stratification on
this characteristic, we do not ensure that enough high value transaction is in

the subsample. This bias is present even by limiting the sample to transaction
below 32 thousand HUF (100 EUR). Without this filtering, the absence of these
extreme value transactions together with the non-monotonic, nonlinear relation-
ship between transaction size and card use for the entire database would probably
create even less efficient estimations.
In conclusion, we can state that the usual stratification of merchant surveys
geographical location, industry, size is not applicable to card use models because
they omit the most important factor in card use decision, the value of the transaction.
From these three types of stratification, the county stratas provide the best estimates
but there is a systematic bias for all three types of stratification.
5. CONCLUSIONS
In our paper, we analyzed payment card acceptance and payment card use decisions
in retail situations using the comprehensive OCR database and random strati-
fied subsamples. We compare county, industry and store size stratas with the
true population estimates to simulate the usual stratification criteria for merchant
surveys.
Our results from estimating levels of card acceptance suggest that having full
coverage with respect to geography and the size of the store produces the most reli-
able estimates. Further, we quantify the uncertainty in card acceptance estimates
that may be present for particularly small sample sizes, finding that increasing
the sample size from around N = 160 to N = 300 can reduce the length of confi-
dence intervals by half. These findings have important practical implications for
merchant surveys.
Further, we estimated logistic regression models of both card acceptance and
usage. In our models, we controlled for several factors relevant in payment instru-
ment adoption and use, but we focused on the performance of the three types of
stratifications variables to estimate the exact county, industry and size effects for
the entire population. In our comparison, we created 10 random stratified subsam-
ples of 1% of the merchant database and 10 random stratified subsamples of 0.01%
of the transaction database.
In the card-acceptance model, the stratification based on the store sizes provided
the best estimates. In card acceptance decisions, the most important factor is the
size of the store and other size-related variables. However, as store sizes follow
a skewed distribution, with small samples there is a high probability that the
subsample does not have enough big stores and the average effects of size will be
underestimated, while the county and industry effects overestimated.
In the card usage model, we evaluate the same subsample types and show that
county stratification provides the best results. However, due to not creating stratas
on transaction sizes, all of the three analyzed subsampling provide systematic
biases. In conclusion, it can be stated that in card-acceptance models, the store
size stratification provides good estimates; however, the same stratification cannot
be used to effectively estimate variable effects in a card usage model.
ACKNOWLEDGEMENTS
We thank the Magyar Nemzeti Bank, in particular Lóránt Varga and Gábor Sin, for
facilitating access to the data. Tamás Ilyés is no longer an employee of the Magyar
Nemzeti Bank. We also thank Jean-Louis Combes, Pierre Lesuisse and participants
of the Doctoral Seminar at the University of Auvergne School of Economics. We
are grateful to the editors and referees for their helpful comments. The views
expressed in this paper are those of the authors. No responsibility for them should
be attributed to the Bank of Canada or the Magyar Nemzeti Bank. All remaining
errors are the responsibility of the authors.
REFERENCES
Arango, C., & Taylor, V. (2008). Merchant acceptance, costs, and perceptions of retail payments: A
Canadian survey. Bank of Canada discussion paper, 2008-12.
Arango, C., Huynh, K. P., & Sabetti, L. (2015). Consumer payment choice: Merchant card acceptance
versus pricing incentives. Journal of Banking and Finance, 55(C), 130–141.
Bagnall, J., Bounie, D., Huynh, K. P., Kosse, A., Schmidt, T., Schuh, S., et al. (2016). Consumer cash
usage: A cross country comparison with payment diary survey data. International Journal of
Central Banking, 12(4), 1–61.
Baxter, W. P. (1983). Bank interchange of transactional paper: Legal and economic perspectives.
Journal of Law and Economics, 26(3), 541–588.
Bolt, W. (2008). Consumer choice and merchant acceptance of payment media. DNB Working Papers,
No. 197.
Bolt, W., Jonker, N., & Renselaar, C. (2010). Incentives at the counter: An empirical analysis of
surcharging card payments and payment behaviour in the Netherlands. Journal of Banking and
Finance, 34, 1738–1744.
Borzekowski, R., Elizabeth, K. K., & Shaista, A. (2008). Consumers’ use of debit cards: Patterns,
preferences, and price response. Journal of Money, Credit and Banking, 40, 149–172.
Briglevics, T., & Schuh, S. (2014). This is what’s in your wallet …and how you use it. ECB Working
Paper Series, No. 1684.
Chu, B. M., Huynh, K. P., Jacho-Chávez, D. T., & Kryvtsov, O. (2018). On the evolution of the UK
price distributions. Annals of Applied Statistics.
Cohen, M., & Rysman, M. (2013). Payment choice with consumer panel data. Federal Reserve Bank
of Boston Working Paper, No. 13-6.
Danmarks Nationalbank. (2012). Costs of Payments in Denmark. Danmarks Nationalbank Papers.

Engert, W., & Fung, B. S. (2017). Central Bank digital currency: Motivations and implications. Bank
of Canada Staff Working Paper, No. 2017-16.
European Commission. (2015). Survey on Merchant’s costs of processing cash and card payments,
final results. European Commission, Directorate-General for Competition.
Fung, B., Huynh, K. P., & Kosse, A. (2017). Acceptance and use of payments at the point of sale in
Canada. Bank of Canada Review, 2017(Autumn), 14–26.
Humphrey, M., Willesson, T., Lindblom, D., & Bergendahl, G. (2003). What does it Cost to Make a
Payment. Review of Network Economics, 2, 159–174.
Huynh, K. P., Schmidt-Dengler, P., & Stix, H. (2014). The role of card acceptance in the transaction
demand for money. Bank of Canada Staff Working Papers. No. 14-44.
Ilyes, T., & Varga, L. (2015). Show me how you pay and I will tell you who you are – Socio-demographic
determinants of payment habits. Financial Economic Review, 14(2), 25–61.
Jonker, N. (2011). Card acceptance and surcharging: The role of costs and competition. Review of
Network Economics, 10(2), 1–35.
Jonker, N. (2013). Social costs of POS payments in the Netherlands 2002–2012: Efficiency gains from
increased debit card usage. DNB Occasional Studies, No. 11-2.
Keszy-Harmath, É., Koczan, G., Kovats, S., Martinovic, B., & Takacs, K. (2012). The role of
interchange fees in card payment systems. MNB Occasional Papers, No. 96.
Klee, E. (2008). How people pay: Evidence from grocery store data. Journal of Monetary Economics,
55(3), 526–541.
Lohr, S., & Raghunathan, T. E. (2017). Combining survey data with other data sources. Statistical
Science, 32(2), 293–312.
Miller, P. V. (2017). Is there a future for surveys?. Public Opinion Quarterly, 81(Special Issue),
205–212.
Kosse, A., Chen, H., Felt, M.-H., Jiongo, V. D., Nield, K., & Welte, A. (2017). The costs of point-of-sale
payment in Canada. Bank of Canada Staff Discussion Paper, No. 2017-4.
Norges Bank. (2014). Costs in the Norwegian payment system. Norges Bank Papers, No. 5.
Polasik, M., & Fiszeder, P. (2014). Factors determining the acceptance of payment methods by online
shops in Poland. Mimeo.
Rochet, J.-C., & Tirole, J. (2003). An economic analysis of the determination of interchange fees in
payment card systems. Review of Network Economics, 2(2), 69–79.
Rochet, J-C., & Tirole, J. (2011). Must-take cards: Merchant discounts and avoided costs. Journal of
the European Economic Association, 9(3), 462–495.
Rojas, J. A. R., Kery, M. B., Rosenthal, S., & Dey, A. (2017). Sampling techniques to improve Big
Data Exploration. IEEE Symposium on Large Data Analysis and Visualization, 2017.
Schmiedel, H., Kostova, G., & Ruttenberg, W. (2013). The social and private costs of retail pay-
ment instruments’: A European perspective. Journal of Financial Market Infrastructures, 2(1),
37–75.
Segendorf, B. L., & Jansson, T. (2012). The cost of consumer payments in Sweden. Svergies Riksbank
Research Paper Series, No. 262.
Stewart, C., Chan, I., Ossolinski, C., Halperin, D., & Ryan, P. (2014). The evolution of payment costs
in Australia. Reserve Bank of Australia Research Discussion Papers, No. 2014-14.
Takács, K. (2011). A magyar háztartások fizetési szokásai. MNB Occasional Papers, No. 98.
var der Cruijsen, C., & Plooij, M. (2012). Changing payment patterns at point-of-sale: Their drivers.
DNB Working Papers, No. 471.
Wang, Z., & Wolman, A. L. (2016). Payment choice and the future of currency: Insights from two
billion retail transactions. Journal of Monetary Economics, 84, 94–115.
Wright, J. (2003). Optimal card payment systems. European Economic Review, 147(4), 587–612.
APPENDIX
This appendix reviews the literature on payment card acceptance and usage, which
justifies the explanatory variables included in our models. In addition, we provide
description of the variables not discussed in the main text and the full results for
the logistic regression models.
A.1. Review of Literature
Card acceptance is primarily a theoretical field in payment studies. Most studies

focus on the effect of the interchange fee on card acceptance, and the calculation of
the equilibrium, competitive fee level on the oligopolistic market of card issuing.
In one of the first analysis in this field, Baxter (1983) argues in favor of the
interchange fees to achieve a higher level of card acceptance and use. However,
his model received criticism form Rochet and Tirole (2003) and Wright (2003),
who significantly updated the model but still concluded that without surcharge the
interchange fee has a neutral effect on the market.
In Rochet and Tirole (2011), they created an empirical test, called the tourist
test, to calculate an equilibrium fee level. Based on this test, Keszy-Harmath et al.
(2012) concluded that in the Hungarian market the fee is above desired levels,
which resulted in a legislative cap in 2013. These theoretical studies however
provide little guidance to analyze card acceptance in cross samples, because in the
abstract and simplified models, the merchants usually only differ in unit acceptance
costs. In line with the theoretical studies, a considerable part of the empirical
literature focuses also on the costs of card acceptance, e.g., Humphrey et al. (2003).
Our regression models primarily draw from the results of questionnaire-based
surveys. Jonker (2011) explored card acceptance and surcharging using survey data
collected among 1,008 Dutch merchants. The author’s regression analysis revealed
that while the merchant’s revenue and the number of employees are significant
explanatory variables, the cost of card payments also influences card acceptance.
Arango and Taylor (2008) investigated card acceptance decisions in the Canadian
market primarily focusing on merchant perceptions, whereas Polasik and Fiszeder
(2014) studied the payment method acceptance decisions of online shops. The
lion’s share of empirical studies, however, concentrates on the consumers’ card
usage rather than the supply side (Bolt, 2008; Borzekowski et al., 2008; Bolt et al.,
2010; and Arango et al., 2015)
In our research, the online cash register (OCR) database enables us to analyze
turnover across a large-scale sample covering a substantial segment of the retail
sector. Previous payment studies were typically rooted in questionnaire-based
surveys, and the literature offers few examples of payment analyses that cover such
a significant volume of data as ours. The focus of questionnaire-based surveys is

the relationship between respondents’ sociodemographic characteristics and their
payment choices.
At the European level, van der Cruijsen and Plooij (2015) compared the results
of two Dutch questionnaire-based surveys over a decade-long horizon. Although
the use of electronic payment methods is far more intense in the Netherlands,
education and age proved to be similarly important explanatory variables. The
authors emphasized the role of subjective perception’s speed and safety in payment
choices. Although the nonlinear and non-monotonic relationship observed in the
cross-sectional analysis of the OCR database between payment value and card
usage intensity was not observed in the Dutch survey, it is important to note
that the highest category selected by the authors above EUR 60 is still below the
Hungarian maximum.
Similarly, using US household panel data Cohen and Rysman (2013) identi-
fied transaction size as the most important determinant of payment choice. The
study by Bagnall et al. (2016) is an important cross-country comparison har-
monizing questionnaire-based surveys from seven countries: Canada, the United
States, Austria, Germany, the Netherlands, France andAustralia. The authors’main
conclusions are consistent with the results of the Hungarian surveys: card usage
increases with higher income and education; the most significant variable is trans-
action value, while subjective factors also play an important role in all countries
considered. Takács (2011) used data from a 1,000-person questionnaire-based sur-
vey to examine Hungarian payment habits. The author found that payment account
and card coverage was primarily driven by education and income level. Also based
on a 1,000-person questionnaire survey and on payment diary data, Ilyés and Varga
(2015) arrived at similar conclusions; the relationship between sociodemographic
variables and card usage habits showed no difference in the two surveys.
Beside questionnaire-based surveys, over the past decade, only two surveys
have provided an opportunity for the analysis of a large volume of receipt-level
data. The first one is a survey conducted by Klee (2008) analyzing the transac-
tion data of US households. In her survey, the author matched the receipts of 99
retail outlets with demographic information on the local environment of the stores
concerned. The main finding of the study is that transaction costs and opportunity
costs influence the choice of payment instruments significantly, with transaction
value being the most important explanatory variable.
Wang and Wolman (2016) used transaction-level data from a large US discount
chain covering the transactions of a three-year period. In their research paper,
the authors presented a detailed analysis of the marginal effects of the individ-
ual variables and, with the assistance of the three-year time horizon; they were
able to forecast the long-term trends of future card usage. Wang and Wolman
(2016) analyzed more than two billion transactions in their research. Based on
the results presented, neither Klee’s nor Wolman and Wang’s database shows a
non-monotonic relationship between cash use and transaction value on the values
examined by the authors. Empirical results show that several theoretical models
have been constructed to explain the relationship between transaction value and
the card usage rate. Briglevics and Schuh (2014) used US payment diary data,
while Huynh et al. (2014) relied on Canadian and Austrian data to construct their
respective decision models. According to transaction value, both models estimate
monotonic and concave card usage patterns. While Briglevics and Schuh (2014)
described payment instrument choice as a dynamic optimization problem, Huynh
et al. (2014) supplement the Baumol–Tobin model.
Despite the use of receipt-level data, our database differs significantly from
the two studies analyzing transaction data and from the surveys built on payment
diaries in several regards. The database of online cash registers provides national
coverage, and the vast majority of merchants are subject to the relevant regulation.
Accordingly, compared to the studies mentioned above, we were able to distin-
guish between far more merchants both in terms of size and type. On the other
hand, due to the anonymization, we had little data on the customers of the stores.
County identifiers were of limited use as there is scant variance across the counties
with regard to the main demographical aspects; consequently, as opposed to Klee
(2008), there is no sufficient variance to add a consumer characteristics proxy.
However, as opposed to the payment diaries, there is significantly more informa-
tion available on payment location; moreover, due to the statutory obligations, the
reliability of the data is presumably better.
A.2. Logistic Regression Models of Card Acceptance and Usage
A.2.1. Card Acceptance: Variables

• Dependent variable: In line with our research question, the primary dependent
variable is card acceptance. A merchant or a store is considered to be a point of
sale when payment card transactions are linked to it in the database.
• Value categories: Based on the payment literature, the willingness to accept
payment cards strongly depends on payment value. Presumably, therefore, in
the case of stores with the same annual turnover, actual card usage is likely to be
higher in businesses where the majority of transactions fall into the appropriate
value category as opposed to the stores whose turnover, for the most part,
comprises mainly very small-value or very large-value transactions.
As regards the turnover structure, we can examine absolute and relative
turnover separately in each individual category. In the case of ratios, the bench-
mark category is always the highest value category. Due to the nature of the
relationship, given the limited number of explanatory variables, the final models
include the turnover’s log and its square.
• Temporal attributes of the store: Not only the annualized turnover of the
stores, but also the turnover’s monthly and weekly distribution can be estab-
lished based on the dates indicated on the receipts. Accordingly, in our analysis,
we also studied the effect of the weekly turnover structure on card acceptance.
For the most part during the two years under review, the decree on Sunday store
closure was in effect in the retail sector. Family-owned stores represented the
main exceptions. Consequently, Sunday opening hours can be used as a proxy
for ownership. Since the correspondence is imperfect, this variable is included
in conjunction with the TEÁOR variable in the models. In this way, we can
separate the effects of individual sectorial exceptions from the attributes of the
owner.
Since the store’s closure on Mondays and Tuesdays proved to have a sig-
nificant explanatory power in our analysis, this serves as the control variable
in the rest of the models. These attributes are linked to special stores, e.g.,
museum gift shops, sample stores where the business is not considered to be
an independent financial unit.
• Network attributes: A large part of the retail sector operates in the form of a
network; in other words, numerous outlets are operated by a single legal entity.
According to our hypothesis, the fact that the store is part of a chain affects
card acceptance decisions in two ways. In networks where each member of
the network belongs to the same category, it accepts or does not accept card
payments. Card acceptance is presumably based on a network-level decision;
therefore, the decision situation itself may differ from that of independent stores.
By contrast, in networks where, according to the observations, card acceptance
is based on the independent decision of the store, the decision situation is
determined by the store’s unique characteristics. Therefore, the models included
dummy variables for the three types of stores: independent store, independent
decision, network decision. Moreover, in the case of network stores, we also
included the network’s total turnover and the number of stores included in the
network. According to the cross-sectional analyses, the correlation is nonlinear;
therefore, we also include the squared terms in the regressions.
• Item number: The database includes the number of products purchased under
each receipt. This allowed us, on the one hand, to use the total item number of
the store as another approach to the size variable and to introduce average and
maximum item numbers. The average and the maximum item number presum-
ably correlates strongly with the payment time and as such, it is used as the proxy
variable of the latter. We used average payment value as the control variable
in several cases; however, this variable correlates extremely strongly with the
decomposition of the turnover by value and with the proportions of the ranges.
A.2.2. Card Acceptance: Results

Table A1 shows the point estimates from the full regression model of card
acceptance.
Table A1. Full Logistic Regression Results: Card Acceptance Model.

Full County Industry Size
Data Set Stratas Stratas Stratas
(Intercept) −1.276 −2.216 −0.862 −1.647
Average number of items −0.135 −0.140 −0.135 −0.138
Average value of transaction 0.000 0.000 0.000 0.000
Closed on Monday −0.254 −0.261 −0.241 −0.238

Closed on Tuesday 0.018 0.021 0.003 0.001
Open on Sunday −0.392 −0.380 −0.410 −0.403
Number of items 0.000 0.000 0.000 0.000
County – Mobile shops −0.649 −0.678 −0.671 −0.663

Bács-Kiskun 0.136 0.157 0.142 0.135
Baranya −0.035 −0.035 −0.034 −0.052
Békés −0.295 −0.240 −0.292 −0.295
Borsod-Abaúj −0.360 −0.329 −0.341 −0.366
Budapest −0.143 −0.097 −0.126 −0.142
Csongrad −0.100 −0.089 −0.074 −0.108
Fejer −0.082 −0.065 −0.055 −0.096
Győr-Moson-Sopron −0.225 −0.217 −0.216 −0.243
Hajdú-Bihar −0.103 −0.100 −0.073 −0.101
Heves 0.040 0.099 0.046 0.022
Jász-Nagykun- 0.053 0.072 0.048 0.013
Komárom-Eszte −0.209 −0.171 −0.171 −0.176
Mozgobolt −0.070 −0.050 −0.054 −0.064
Nógrád −0.198 −0.195 −0.172 −0.224
Pest −0.441 −0.410 −0.423 −0.457
Somogy −0.185 −0.164 −0.176 −0.173
Szabolcs-Szatmár −0.256 −0.246 −0.237 −0.255
Tolna −0.554 −0.511 −0.509 −0.570
Vas −0.012 0.000 0.032 −0.054
Veszprém (base) 0.000 0.000 0.000 0.000
Network store count −0.008 −0.008 −0.008 −0.008

Network store count squared 0.000 0.000 0.000 0.000
Table A1. (Continued)

Network sum value −0.275 −0.226 −0.332 −0.230

Network sum value squared 0.015 0.014 0.016 0.014
SHARE_10K 0.079 0.660 0.445 0.111

SHARE_10K2 −0.186 −0.458 −0.440 −0.238
SHARE_1K 1.422 1.978 1.574 1.445
SHARE_1K2 −0.968 −1.003 −0.965 −1.056
SHARE_20K 2.608 3.255 2.569 2.612
SHARE_20K2 −2.465 −2.738 −2.222 −2.533
SHARE_5K 0.401 0.722 0.599 0.311
SHARE_5K2 0.125 0.249 0.087 0.167
Industry code = 0 −0.609 −0.595 −0.642 −0.634

Industry code = 1 −0.337 −0.299 −0.370 −0.386
Industry code = 2 −0.303 −0.343 −0.349 −0.280
Industry code = 3 −0.066 −0.068 −0.076 −0.071
Industry code = 4 0.004 0.026 −0.031 −0.038
Industry code = 5 0.128 0.151 0.094 0.101
Industry code = 6 0.231 0.215 0.226 0.189
Industry code = 7 0.260 0.265 0.239 0.220
Industry code = 8 0.427 0.441 0.430 0.410
Industry code = 9 (base) 0.000 0.000 0.000 0.000
TIME_DUMMY=S_2015_1 −0.027 −0.021 −0.058 −0.008

TIME_DUMMY=S_2015_10 −0.178 −0.137 −0.208 −0.190
TIME_DUMMY=S_2015_11 −0.173 −0.202 −0.208 −0.162
TIME_DUMMY=S_2015_12 −0.373 −0.393 −0.392 −0.367
TIME_DUMMY=S_2015_2 −0.158 −0.153 −0.194 −0.146
TIME_DUMMY=S_2015_3 −0.277 −0.285 −0.310 −0.294
TIME_DUMMY=S_2015_4 −0.248 −0.273 −0.291 −0.248
TIME_DUMMY=S_2015_5 −0.259 −0.229 −0.285 −0.258
TIME_DUMMY=S_2015_6 −0.235 −0.217 −0.294 −0.249
TIME_DUMMY=S_2015_7 −0.225 −0.242 −0.242 −0.215
TIME_DUMMY=S_2015_8 −0.208 −0.186 −0.223 −0.198
TIME_DUMMY=S_2015_9 −0.147 −0.144 −0.179 −0.136
TIME_DUMMY=S_2016_1 −0.043 −0.078 −0.097 −0.043
TIME_DUMMY=S_2016_10 0.141 0.105 0.139 0.147
TIME_DUMMY=S_2016_11 0.028 0.020 0.009 0.045
TIME_DUMMY=S_2016_12 0.000 0.000 0.000 0.000
TIME_DUMMY=S_2016_2 −0.097 −0.117 −0.110 −0.102
TIME_DUMMY=S_2016_3 −0.072 −0.052 −0.099 −0.037
TIME_DUMMY=S_2016_4 0.096 0.089 0.062 0.125
TIME_DUMMY=S_2016_5 −0.017 −0.044 −0.059 0.000
TIME_DUMMY=S_2016_6 0.021 −0.014 −0.031 0.018

TIME_DUMMY=S_2016_7 0.103 0.095 0.088 0.117

TIME_DUMMY=S_2016_8 0.029 0.000 −0.026 0.044
TIME_DUMMY=S_2016_9 0.103 0.069 0.088 0.096
Network decision store dummy (base) 0.000 0.000 0.000 0.000
Individual store dummy 0.201 0.207 0.179 0.213

Network independent store dummy 0.532 0.520 0.537 0.525
Annual revenue 1st order orthogonal polynomial 59.296 59.525 59.503 63.279
Annual revenue 2nd order orthogonal polynomial −81.847 −80.740 −84.464 −78.947
Annual revenue 3rd order orthogonal polynomial −24.489 −24.085 −24.322 −21.949
A.2.3. Card Usage: Variables

• Dependent variable: The main outcome variable of the analysis is the binary
variable of card payment. Unlike in theoretical models, practice payers may
use cash and payment cards simultaneously. In the database, the share of cards
was 100% in 98% of the card transactions. For the rest of the transactions, the
limit of card payment has been defined at a share of 10%.
• Transaction value: The database contains the receipt’s gross and net value and
its breakdown according to the five VAT rates. Gross value is considered to be
the main value of the transaction and in view of the high multicollinearity, we
do not use the net value. Since transaction values roughly follow a log-normal
distribution, the log of the gross value was also included. In addition, because
of the decreasing card usage rate observed for high payment values, we doubled
all size variables into values above and below HUF 32,000, which allows the
originally monotonic functional form to have an up-sloped and a down-sloped
section.
• Item number: The number of items purchased was also indicated on the
receipts, and the model includes this information as an explanatory variable.
Since we do not have direct information on the exact number of items, item
number became a proxy variable of purchase size. Based on the nonlinear rela-
tionship observed by the cross-sectional analysis, we also included the square
of the item number in the model.
• Ease of payment: The granulated nature of the database provides the means
for using such computed variables in the model that can be generated only with
a low degree of reliability based on questionnaire and diary-based surveys. We
approximate the ease of payment by using the number of banknotes and coins
handled in the ideal case as a dummy variable, up to a value of 10. These
variables capture the ease of cash payment, which presumably correlates with
payment time and as such, it can be considered a cost variable.
• Store attributes: Although the model constructed for card acceptance con-
tained numerous variables, due to space limitations, we can only include the
most important ones in this part of the study. As regards store attributes, most
models include the log and square of annualized turnover and the aggregate
form of the activity.
• County data: In the card-acceptance model, county effects did not correlate
significantly with the county’s level of development, but a correlation can be
observed during card usage on raw data. We estimated county codes in two
steps: the main regression includes only the county dummy variables, while in
the second step, we focus on the correlation between the coefficients and the
main sociodemographic data of individual counties.
• Temporal data: The database contains data for a 2-year period, which reflect
significant monthly and weekly seasonality. Since a sufficient amount of data
was available, we included yearly and monthly dummy variables and dummies
pertaining to the days of the month and the days of the week.
• Inverse Mills ratio: As card usage and card acceptance mutually affect each
other, the model calculated by us reflects a significant degree of selection bias.
In order to remove the bias, we also included the inverse Mills ratio computed
from the probit version of the model constructed for card acceptance. The
Heckman selection thus performed reduces estimation uncertainty, especially
in the case of the affiliated store data.
A.2.4. Card Usage: Results

Table A2 shows the point estimates from the full regression model of card usage.
Table A2. Full Logistic Regression Results: Card Usage Model

Inverse Mills ratio −0.624 −0.713 −0.702 −0.714
Logarithm of store annual revenue 0.174 0.152 0.150 0.147
County = Bács-Kiskun −0.475 −0.417 −0.336 −0.408

County = Baranya −0.140 −0.094 −0.017 −0.077
County = Békés −0.388 −0.327 −0.263 −0.337
County = Borsod-Abaúj −0.193 −0.136 −0.049 −0.145
County = Budapest 0.297 0.350 0.435 0.360
County = Csongrad −0.122 −0.088 −0.020 −0.109

County = Fejer −0.040 0.021 0.114 0.018

County = Győr-Moson-Sopron −0.185 −0.118 −0.038 −0.116
County = Hajdú-Bihar −0.253 −0.206 −0.125 −0.216
County = Heves −0.372 −0.315 −0.235 −0.301
County = Jász-Nagykun −0.334 −0.294 −0.202 −0.281
County = Komárom-Eszte −0.094 −0.027 0.049 −0.011
County = Mozgobolt 0.000 0.000 0.000 0.000
County = Nógrád −0.509 −0.469 −0.404 −0.460
County = Pest −0.129 −0.071 0.007 −0.053
County = Somogy −0.332 −0.256 −0.179 −0.247
County = Szabolcs-Szatmár −0.536 −0.484 −0.395 −0.452
County = Tolna −0.261 −0.194 −0.110 −0.187
County = Vas −0.334 −0.249 −0.201 −0.277
County = Veszprém −0.065 −0.032 0.041 −0.017
County = Zala −0.270 −0.209 −0.129 −0.214
Number of bills = 1 (base) 0.000 0.000 0.000 0.000

Number of bills = 2 0.263 0.266 0.285 0.267
Number of bills = 3 0.407 0.406 0.409 0.400
Number of bills = 4 0.465 0.462 0.471 0.462
Number of bills = 5 0.504 0.512 0.518 0.505
Number of bills = 6 0.541 0.551 0.566 0.542
Number of bills = 7 0.572 0.617 0.626 0.597
Number of bills = 8 0.607 0.677 0.630 0.772
Industry code = 0 (base) 0.000 0.000 0.000 0.000

Industry code = 1 −0.238 −0.225 −0.253 −0.309
Industry code = 2 0.262 0.245 0.182 0.214
Industry code = 3 0.105 0.124 0.048 0.071
Industry code = 4 0.026 0.019 −0.021 −0.051
Industry code = 5 0.733 0.676 0.629 0.608
Industry code = 6 0.048 0.030 −0.055 −0.057
Industry code = 7 0.383 0.385 0.329 0.307
Industry code = 8 0.846 0.805 0.717 0.705
Industry code = 9 0.143 0.120 0.128 −0.039
Number of items 0.005 0.005 0.005 0.005
Transaction value 1st order orthogonal polynomial −184.446 −157.517 −167.243 −176.342
Transaction value 2nd order orthogonal polynomial −206.922 −189.216 −194.939 −200.644
Transaction value 3rd order orthogonal polynomial −76.017 −68.811 −71.832 −72.792
PART II
VARIANCE ESTIMATION
WILD BOOTSTRAP RANDOMIZATION
INFERENCE FOR FEW TREATED
CLUSTERS
James G. MacKinnona and Matthew D. Webbb

a
Queen’s University, Canada
b
Carleton University, Canada
ABSTRACT
When there are few treated clusters in a pure treatment or difference-in-
differences setting, t tests based on a cluster-robust variance estimator can
severely over-reject. Although procedures based on the wild cluster bootstrap
often work well when the number of treated clusters is not too small, they
can either over-reject or under-reject seriously when it is. In a previous
paper, we showed that procedures based on randomization inference (RI)
can work well in such cases. However, RI can be impractical when the
number of possible randomizations is small. We propose a bootstrap-based
alternative to RI, which mitigates the discrete nature of RI p values in the
few-clusters case. We also compare it to two other procedures. None of them
works perfectly when the number of clusters is very small, but they can work
surprisingly well.
Keywords: Clustered data; panel data; CRVE; wild cluster bootstrap;
difference-in-differences; kernel-smoothed p value

ISSN: 0731-9053/doi:10.1108/S0731-905320190000039003
61
62 JAMES G. MACKINNON AND MATTHEW D. WEBB
1. INTRODUCTION
During the past decade or two, it has become common for empirical work in many
areas of economics to involve models where the error terms are allowed to be corre-
lated within clusters. Much of this work employs difference-in-differences (DiD)
estimators, where the data set has both a time and a cross-section dimension, and
clustering is typically at the cross-section level (say, by state or province). Cameron
and Miller (2015) provides a comprehensive survey of econometric methods for
cluster-robust inference.
Despite considerable progress in the development of suitable econometric meth-
ods over the past decade, it can still be a challenge to make reliable inferences.
Doing so is particularly challenging in the DiD context when there are very few
treated clusters. Past research, including Conley and Taber (2011), has shown that
inference based on cluster-robust test statistics can greatly over-reject in this case.
MacKinnon and Webb (2017b) explains why this happens and why the wild cluster
bootstrap (WCB) of Cameron, Gelbach, and Miller (2008) does not solve the prob-
lem; for a less technical discussion, see also MacKinnon and Webb (2017a). When
there are very few treated clusters, the restricted WCB often severely under-rejects,
and the unrestricted WCB often severely over-rejects.
One potentially attractive way to obtain tests with accurate size when there
are few treated clusters is to use randomization inference (RI). This approach
involves comparing estimates based on the clusters that were actually treated with
estimates based on control clusters that were not treated. Several authors have
recently investigated this approach; see Conley and Taber (2011), Canay, Romano,
and Shaikh (2017), Ferman and Pinto (2019), and MacKinnon and Webb (2018a).
RI procedures necessarily rely on strong assumptions about how similar the
control clusters are to the treated clusters. MacKinnon and Webb (2018a) shows
that for RI procedures which use coefficient estimates, like the one of Conley
and Taber (2011), these assumptions almost always fail to hold when the treated
clusters have either more or fewer observations than the control clusters. As a con-
sequence, the procedure may over-reject or under-reject quite noticeably when the
treated clusters are substantially smaller or larger than the controls. MacKinnon
and Webb (2018a) suggests that more reliable inferences can often be obtained by
basing RI on t statistics rather than coefficient estimates. However, such proce-
dures can involve noticeable power loss relative to the ones based on coefficient
estimates.
In Section 2, we briefly discuss conventional asymptotic procedures for infer-
ence with clustered errors. In Section 2.1, we then explain how the WCB works. In
Section 3, we introduce RI and discuss two variants of it, one based on coefficient
estimates which is quite similar to what was proposed in Conley and Taber (2011)
and the other based on t statistics proposed in MacKinnon and Webb (2018a).
Wild Bootstrap Randomization Inference for Few Treated Clusters 63
All RI procedures encounter a serious practical problem when the number of

controls is small. Since there are not many ways to compare the treated clusters
with the control clusters, the RI p value can take on only a small number of values
in such cases. We discuss this problem in Section 3.1. Section 4 then introduces a
modified RI procedure, which we call “wild bootstrap RI,” or WBRI, that combines
RI with the WCB. There are two variants, one based on t statistics and one based on
coefficient estimates. The WBRI procedure is the main contribution of the paper.
In Section 5, we briefly discuss two alternative procedures. One of them is to
use p values obtained by kernel smoothing; see Racine and MacKinnon (2007b).
The second, which makes much stronger assumptions about the error terms, is
to estimate the model at the cluster level, with just one observation per cluster;
see Donald and Lang (2007). We do not discuss the “synthetic controls” method
of Abadie, Diamond, and Hainmueller (2010), because it is, in our view, fun-
damentally different from WBRI and the other procedures that we consider. It
involves “matching” the treated clusters with untreated ones according to their
characteristics.
In Section 6, we show that both WBRI and the other procedures we discuss can
substantially improve inference in cases where the only problem is an insufficient
number of control clusters. All these methods can work surprisingly well even
when the number of treated clusters is very small.
Finally, in Section 7, we present an empirical example from Decarolis (2014).
This example involves just one treated cluster. Section 8 concludes.
2. CLUSTER-ROBUST INFERENCE
A linear regression model with clustered errors may be written as
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
y1 X1 1
⎢ y2 ⎥ ⎢ X2 ⎥ ⎢ 2 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
y ≡ ⎢ . ⎥ = Xβ + ≡ ⎢ . ⎥β + ⎢ . ⎥, (1)
⎣ .. ⎦ ⎣ .. ⎦ ⎣ .. ⎦
yG XG G
where each of the G clusters, indexed by g, has Ng observations. The matrix X and

the vectors y and have N = G g=1 Ng rows, X has k columns and the parameter
vector β has k rows. Each subvector g is assumed to have covariance matrix g
and to be uncorrelated with every other subvector. The covariance matrix of the
entire error vector is block diagonal with diagonal blocks the g . Ordinary least
squares (OLS) estimation of Eq. (1) yields estimates β̂ and residuals ˆ .
Because the elements of the g are in general neither independent nor identically
distributed, both classical OLS and heteroskedasticity-robust standard errors for
β̂ are invalid. As a result, conventional inference can be severely unreliable. The

true covariance matrix for the model (1) is
⎛ ⎞
G
(X X)−1 ⎝ Xg g Xg⎠(X X)−1. (2)
g=1
This can be estimated by using a cluster-robust variance estimator, or CRVE. The

most popular CRVE is
⎛ ⎞
G
G(N − 1)
(X X)−1 ⎝ Xg ˆg ˆg Xg⎠(X X)−1 , (3)
(G − 1)(N − k) g=1
where ˆg is the subvector of ˆ that corresponds to cluster g. This is the estimator
that is used when the cluster command is invoked in Stata.1 Consistent with
the results of Bester, Conley, and Hansen (2011), it is common to assume that the
t statistics follow a t(G − 1) distribution; this is what Stata does by default.
It is not obvious that using t statistics based on the CRVE (3) is valid asymp-
totically. The proof requires technical assumptions about the distributions of the
errors and the regressors and how the number of clusters and their sizes change as
the sample size tends to infinity; see Djogbenou, MacKinnon, and Nielsen (2018).
Nevertheless, test statistics based on (3) seem to yield reliable inferences when
the number of clusters is large and there is not too much heterogeneity across
clusters. In particular, the number of observations per cluster must not vary too
much; see Carter, Schnepel, and Steigerwald (2017) and MacKinnon and Webb
(2017b). However, t statistics based on (3) tend to over-reject severely when the
parameter of interest is the coefficient on a treatment dummy and there are very few
treated clusters; see Conley and Taber (2011) and MacKinnon and Webb (2017b).
Rejection frequencies can be over 75% when all the treated observations belong
to the same cluster.
In this paper, we are primarily concerned with the DiD model, which is often
appropriate for studies that use individual data in which there is variation in
treatment across both clusters (or groups) and time periods. We can write such
a model as
yigt = β1 + β2 GTg + β3 PTt + β4 TREATgt + igt , (4)
i = 1, . . . , Ng , g = 1, . . . , G, t = 1, . . . , T ,
where i indexes individuals, g indexes groups, and t indexes time periods. Here
GTg is a “group treated” dummy that equals 1 if group g is treated in any time
period, PTt is a “period treated” dummy that equals 1 if any group is treated in time
period t, and TREATgt is a dummy that equals 1 if an observation is actually treated.
The coefficient of most interest, on which we focus in this paper, is β4 , which

measures the effect on treated groups in periods when there is treatment. In many
cases, of course, regression (4) would also contain additional regressors, such
as group and/or time dummies, which might make it necessary to drop GTg ,
PTt , or both. Following the literature, we divide the G groups into G0 control
groups, for which GTg = 0, and G1 treated groups, for which GTg = 1, so that
G = G0 + G1 .
We are concerned with the case in which G1 is small. In this case, as previously
noted, CRVE-based inference fails. It also fails when G0 is small in a pure treatment
model where every cluster is either entirely treated or entirely not treated. However,
in a DiD model where treatment only takes place in some time periods, it is possible
for CRVE-based inference to perform well even when G0 = 0; see MacKinnon and
Webb (2017a,b). In the remainder of the paper, since we are focusing on the DiD
case, we assume that only G1 may be small.
The reason for the failure of CRVE-based inference when G1 is small is
explained in detail in MacKinnon and Webb (2017b, Section 6). Essentially, the
problem is that the least squares residuals must be orthogonal to the treatment
dummy variable. This implies that they sum to zero over all the treated observa-
tions. When those treated observations are spread over many clusters, there is no
problem. But when they are concentrated in just a few clusters, some of the terms
that are summed in the middle matrix of (3) severely underestimate the corre-
sponding quantities in the matrices X g X.2 This causes the standard error of β̂4
to be seriously underestimated.
2.1. The Wild Cluster Bootstrap
The WCB was proposed in Cameron et al. (2008) as a method for reliable inference
in cases with a small number of clusters, and its asymptotic validity is proved
in Djogbenou et al. (2018). A different, but less effective, bootstrap procedure
for cluster-robust inference, often referred to as the “pairs cluster bootstrap,” was
previously suggested in Bertrand, Duflo, and Mullainathan (2004); see MacKinnon
and Webb (2017a). The WCB was studied extensively in MacKinnon and Webb
(2017b) for the cases of unbalanced clusters and/or few treated clusters. Because
we will be proposing a new procedure that is closely related to the WCB in Section
4, we review how the latter works.
Without loss of generality, we consider how to test the hypothesis that β4 , the
DiD coefficient in Eq. (4), is zero. Then the (restricted) WCB works as follows:
(1) Estimate Eq. (4) by OLS.
(2) Calculate tˆ4 , the t statistic for β4 = 0, using the square root of the 4th diagonal
element of (3) as a cluster-robust standard error.
(3) Reestimate the model (4) subject to the restriction that β4 = 0, so as to obtain
restricted residuals ˜ and restricted estimates β̃.
(4) For each of B bootstrap replications, indexed by b, generate a new set of
∗b
bootstrap dependent variables yig using the bootstrap DGP
yigt = β̃1 + β̃2 GTg + β̃3 PTt + ˜igt vg∗b , (5)

i = 1, . . . , Ng , g = 1, . . . , G, t = 1, . . . , T .
∗b
Here yigt is an element of the vector y ∗b of observations on the bootstrap depen-
dent variable, GTg , and PTt are taken from the corresponding row of X, and
vg∗b is an auxiliary random variable that follows the Rademacher distribution;
see Davidson and Flachaire (2008). It takes the values 1 and −1 with equal
probability.3
(5) For each bootstrap replication, estimate regression (4) using y ∗b as the regres-
sand. Calculate t4∗b , the bootstrap t statistic for β4 = 0, using the square root
of the 4th diagonal element of (3), with bootstrap residuals replacing the OLS
residuals, as the standard error.
(6) Calculate the symmetric bootstrap p value as
B
1
p̂s∗ = I(|t4∗b | > |t4 |), (6)
B b=1
where I( · ) denotes the indicator function. Eq. (6) assumes that, under the
null hypothesis, the distribution of t4 is symmetric around zero. Alternatively,
one can use a slightly more complicated formula to calculate an equal-tail
bootstrap p value.
The procedure just described is known as the restricted wild cluster, or WCR,
bootstrap, because the bootstrap DGP (5) uses restricted parameter estimates and
restricted residuals.4 We could instead use unrestricted estimates and residuals in
step 4 and calculate bootstrap t statistics for the hypothesis that β4 = β̂4 in step 5.
This yields the unrestricted wild cluster, or WCU, bootstrap.
MacKinnon and Webb (2017b) explains why the WCB fails when the number of
treated clusters is small. The WCR bootstrap, which imposes the null hypothesis,
leads to severe under-rejection. In contrast, the WCU bootstrap, which does not
impose the null hypothesis, leads to severe over-rejection. When just one cluster
is treated, it over-rejects at almost the same rate as using CRVE t statistics with
the t(G − 1) distribution.
The poor performance of WCR and WCU when there are few treated clus-
ters is a consequence of the fact that the bootstrap DGP attempts to replicate the
within-cluster correlations of the errors using residuals that have very odd prop-
erties. MacKinnon and Webb (2018b) therefore suggest using the ordinary wild
bootstrap instead, and Djogbenou et al. (2018) prove that combining the ordinary
wild bootstrap for the model (1) with the CRVE (3) leads to asymptotically valid
inference. When clusters are sufficiently homogeneous, this procedure can work
well even when the number of treated clusters is small.
3. RANDOMIZATION INFERENCE
RI, first proposed in Fisher (1935), is a procedure for performing exact tests in
the context of experiments. The idea is to compare an observed test statistic τ̂
with an empirical distribution of test statistics τj∗ for j = 1, . . . , S generated by re-
randomizing the assignment of treatment across experimental units. To compute
each of the τj∗ , we use the actual outcomes while pretending that certain non-treated
experimental units were treated. If τ̂ is in the tails of the empirical distribution of
the τj∗ , then this is evidence against the null hypothesis of no treatment effect.
Randomization tests are valid only when the distribution of the test statistic
is invariant to the realization of the re-randomizations across permutations of
assigned treatments; see Lehmann and Romano (2008) and Imbens and Rubin
(2015). Whether this key assumption is true in the context of policy changes such
as those typically studied in the DiD literature is debatable. Any endogeneity in
the way policies are implemented over jurisdictions and time would presumably
cast doubt on the assumption.
When treatment is randomly assigned at the individual level, the invariance of
the distribution of the test statistic to re-randomization follows naturally. However,
if treatment assignment is instead at the group level, as is always the case for DiD
models like (4), then the extent of unbalancedness can determine how close the
distribution is to being invariant.
It is obvious that the proportion of treated observations
1 matters for β̂4 in
(4) and its cluster-robust standard error. Let d̄ = ( G g=1 Ng )/N denote this pro-
portion. When clusters are balanced, the value of d̄ will be constant across
re-randomizations. However, when clusters are unbalanced, d̄ may vary consider-
ably across re-randomizations. This implies that the distributions of β̂4 may also
vary substantially. RI may not work well in such cases.
MacKinnon and Webb (2018a) studies two RI procedures. One uses β̂4 in (4)
as τ̂ , and the other uses the cluster-robust t statistic that corresponds to β̂4 . The
former procedure, which we refer to as RI-β, is quite similar to a procedure
proposed in Conley and Taber (2011). It is only valid, even in large samples,
∗
if re-randomizing does not change the distribution of the β̂4j . The latter proce-
dure, which we refer to as RI-t, is evidently valid in large samples whenever the
cluster-robust t statistics follow an asymptotic distribution that is invariant to d̄

and to any other features of the individual clusters. However, as MacKinnon and
Webb (2018a) shows, it is generally not valid in finite samples when d̄ varies across
re-randomizations, especially when G1 is small. Nevertheless, the RI-t procedure
typically works better than the RI-β procedure, especially when G1 is not too
small.
When there is just one treated group, it is natural to compare τ̂ to the empirical
distribution of G0 different τj∗ statistics. However, when there are two or more
treated groups and G0 is not quite small, the number of potential τj∗ to compare
with can be very large. In such cases, we may pick S of them at random. To avoid
ties, we never include the actual τ̂ among the τj∗ . Some RI procedures do in fact
include τ̂ , however. Provided S is large, this is inconsequential.
The RI procedures discussed in MacKinnon and Webb (2018a) for the model
(4) work as follows. Here τ̂ denotes either β̂4 or its cluster-robust t statistic, and
τj∗ denotes the corresponding quantity for the jth re-randomization.
(1) Estimate the regression model and calculate τ̂ .

(2) Generate a number of τj∗ statistics, S, to compare τ̂ with.
• When G1 = 1, assign a group from the G0 control groups as the “treated”
group g ∗ for each repetition, reestimate the model using the observations
from all G groups, and calculate a new statistic, τj∗ , indicating randomized
treatment. Repeat this process for all G0 control groups. Thus, the empirical
distribution of the τj∗ will have G0 elements.
• When G1 > 1, sequentially treat every set of G1 groups except the set actu-
ally treated, reestimate Eq. (4), and calculate a new τj∗ . There are potentially
G CG1 − 1 sets of groups to compare with, where n Ck denotes “n choose
k.” When this number is not too large, obtain all of the τj∗ by enumeration.
When it exceeds B (picked on the basis of computational cost), choose the
comparators randomly, without replacement, from the set of potential com-
parators. Thus, the empirical distribution will have S = min (G CG1 − 1, B)
elements.
(3) Sort the vector of τj∗ statistics.
(4) Determine the location of τ̂ within the sorted vector of the τj∗ and compute a
p value. This may be done in more than one way, as we discuss in the next
subsection.
In the above procedures, we need to assign a starting period for “treatment” in

each re-randomization if we are dealing with a DiD model like (4). The method
used in the simulation experiments in MacKinnon and Webb (2018a) and in Section
6 below is to make the treatment period(s) the same for each re-randomization as
for the actual sample. Thus if, for example, G1 = 1 and treatment began in 1978,
the single “treated” group in all re-randomizations would start treatment in 1978.
If G1 = 2 and treatment began in 1978 and 1982, then, for each re-randomization,
one group would begin treatment in 1978 and the other in 1982. In our simulations,
we ordered both the actually treated groups and the controls by size. Thus if, for
example, treatment began in 1978 for group 3 and in 1982 for group 11, and
N3 > N11 , then treatment would begin in 1978 for the larger control group and in
1982 for the smaller one. We also experimented with assigning treatment years at
random and found that doing so made very little difference.
3.1. The Problem of Interval p Values
The most natural way to calculate an RI p value is probably to use the equivalent
of Eq. (6). As before, S denotes the number of repetitions, which would be G0
when G1 = 1 and the minimum of G CG1 − 1 and B when G1 > 1, where B is a
user-specified target number of replications. Then the analog of Eq. (6) is
S
1
p̂1∗ = I(|τj∗ | > |τ̂ |). (7)
S j =1
This makes sense if we are testing the null hypothesis that β4 = 0 and expect τj∗ to
be symmetrically distributed around zero. If we were instead testing the one-sided
null hypothesis that β4 ≤ 0, we would want to remove the absolute value signs.
Eq. (7) is not the only way to compute an RI p value for a point null hypothesis,
however. A widely used alternative is
⎛ ⎞
S
1 ⎝
p̂2∗ = 1+ I(|τj∗ | > |τ̂ |)⎠. (8)
S+1 j =1
It is easy to see that the difference between p̂1∗ and p̂2∗ is O(1/S), so that they tend
to the same value as S → ∞. There is evidently no problem if S is large, but the
two p values can yield quite different inferences when S is small. The analogous
issue should rarely arise for bootstrap tests, because the investigator can almost
always choose B (the number of bootstrap samples, which plays the same role as
S here) in such a way that Eqs. (7) and (8) yield the same inferences. This will
happen whenever α(B + 1) is an integer, where α is the level of the test. That is
why it is common to see B = 99, B = 999, and so on.
With RI, however, we generally cannot choose S such that α(S + 1) is an integer.
In every other case, we could in principle use any p value between p̂1∗ and p̂2∗ . Thus,
Fig. 1. Rejection Frequencies and Number of Simulations.
p values based on a finite number of simulations are generally interval-identified

rather than point-identified, where the interval is [p̂1∗ , p̂2∗ ]; see Webb (2014).
For small values of S, the conflict between inferences based on p̂1∗ and p̂2∗ can
be substantial. Figure 1 shows analytical rejection frequencies for tests at the 0.05
level based on Eqs. (7) and (8), respectively. The tests would reject exactly 5% of
the time if S were infinite, but the figure is drawn for values of S between 7 and
103. In the figure, R denotes the number of times that t is more extreme than tj∗ ,
so that p̂1∗ = R/S and p̂2∗ = (R + 1)/(S + 1). It is evident that p̂1∗ always rejects
more often than p̂2∗ , except when (for tests at the 0.05 level) S = 19, 39, 59, and
so on. Even for fairly large values of S, the difference between the two rejection
frequencies can be substantial.
Suppose the data come from Canada, which has just 10 provinces. If one
province is treated, then G1 = 1, G0 = 9, and the p value can lie in only one
of nine intervals: 0 to 1/10, 1/9 to 2/10, 2/9 to 3/10, and so on. Even if R = 0, it
would never be reasonable to reject at the 0.01 or 0.05 levels.
One way to eliminate the interval and obtain a single p value is to use a random
number generator. Such a procedure is described, in the context of the bootstrap, in
Racine and MacKinnon (2007b). The idea is simply to replace the 1 after the large
left parenthesis in (8) with a draw from the U[0, 1] distribution. Similar procedures
have been used for many years in the RI literature; see Young (2019). However,
these procedures have the unfortunate property that the outcome of the test depends
on the realization of a single random variable drawn by the investigator. The gap
between p̂1 and p̂2 still remains. We have simply chosen a number between the
two by, in effect, flipping a coin. This means that two different researchers using
the same data set will randomly obtain different p values.
4. WILD BOOTSTRAP RANDOMIZATION INFERENCE

In this section, we suggest a novel way to overcome the problem of interval p
values. We propose a procedure that we refer to as WBRI. The WBRI procedure
essentially combines the WCB of Section 2.1 with either the RI-t or RI-β proce-
dures of Section 3. We focus on RI-t, because, at least under the null hypothesis,
it seems to be better to use t statistics rather than coefficients for RI.
The key idea of the WBRI procedure is to augment the small number (S) of test
statistics obtained by randomization with a much larger number generated by a
restricted WCB DGP like (5). All the bootstrap samples are generated in the same
way. However, they are used to test S + 1 different null hypotheses, corresponding
to the actual treatment and the S re-randomized ones.
Why should this procedure work? Under the (fairly strict) conditions for RI to
be valid (Imbens & Rubin, 2015), the RI-t procedure would work perfectly if it
were not for the interval p value problem. Provided the clusters are reasonably
homogeneous and S is not too small, it generally seems to work very well; see
MacKinnon and Webb (2018a). The idea of WBRI is to keep the good properties of
RI-t for large S even when S is not large by generating a large number of bootstrap
statistics that resemble the tj∗ obtained by re-randomization.
Of course, we could obtain as many bootstrap statistics tb∗ as we desire simply
by using the WCB. But, when G1 is small, the |tb∗ | tend to be positively correlated
with |t|. This is the reason for the failure of the WCR bootstrap with few treated
clusters; see MacKinnon and Webb (2017b). When G1 = 1, the correlation tends
to be very high, and this often leads to extreme under-rejection.
∗
With the WBRI procedure, the bootstrap statistics |tbj | that correspond to the
j th re-randomization will undoubtedly be correlated with |tj | when G1 is small.
But only the ones that correspond to the actual null hypothesis should be strongly
correlated with |t|. Thus WBRI should not encounter anything like the sort of
extreme failure that WCR routinely does when G1 is small. Of course, we do
not expect that WBRI will ever work perfectly, especially when the number of
clusters is very small. But it seems plausible that it should yield p values which
are reasonably accurate and much more precise than the interval [p̂1 , p̂2 ]. We
provide evidence on this point in Section 6.
We make no effort to prove the asymptotic validity of WBRI, because any proof
would require that G tend to infinity; see Djogbenou et al. (2018). But when G
is large, S = G − 1 for G1 = 1 and SG for G1 > 1, so that there would be no
problem of interval p values and no reason to employ WBRI.
Formally, the WBRI procedure for generating the tb∗ and tbj
∗
statistics is as
follows:
(1) Estimate Eq. (4) by OLS and calculate t for the coefficient of interest using
CRVE standard errors.
(2) Obtain S test statistics tj∗ by re-randomization, as in Section 3.
(3) Estimate a restricted version of Eq. (4) with β4 = 0 and retain the restricted
estimates β̃ and residuals ˜ .
(4) For the original test statistic and each of the S possible re-randomizations,
indexed by j = 0, . . . , S, construct B bootstrap samples indexed by b, say
yj∗b , using the restricted WCB procedure discussed in Section 2.1. For each
bootstrap sample, estimate Eq. (4) using yj∗b and calculate a bootstrap t statistic
tj∗b based on CRVE standard errors.5
(5) Use one of Eq. (7) or Eq. (8) to calculate a p value for t based on the
(B + 1)(S + 1) − 1 bootstrap and randomized test statistics.
Since every possible set of G1 clusters is “treated” in the bootstrap samples,
the number of bootstrap test statistics is B × G CG1 = B(S + 1). In addition, there
are G CG1 − 1 = S statistics based on the original sample. Thus, the total number
of test statistics is B(S + 1) + S = (B + 1)(S + 1) − 1. We suggest choosing B
so that this number is at least 1,000.
The number of possible bootstrap DGPs is only 2G if one uses the Rademacher
distribution. Therefore, when G is small, it is better to use an alternative bootstrap
distribution such as the 6-point distribution suggested in (Webb, 2014). In the case
of the latter, the number of possible bootstrap DGPs is 6G.
In general, it makes sense to use the WBRI procedure only when the RI-t
procedure does not provide enough tj∗ for the interval p value problem to be
negligible. As a rule of thumb, we suggest using WBRI when G1 = 1 and G < 300,
or G1 = 2 and G < 30, or G1 = 3 and G < 15. Code for this procedure is available
from the authors.6
5. ALTERNATIVE PROCEDURES
In this section, we briefly discuss two very different procedures that can be used
instead of WBRI. Their performance will be compared with that of the latter in the
simulation experiments of Section 6.
Racine and MacKinnon (2007a) suggested a way to solve the interval p value
problem in the context of bootstrap tests. For those tests, the problem only
arises if computation is so expensive that making α(B + 1) an integer for all test
levels α of interest is infeasible. But since the problem arises quite frequently for
randomization tests, their procedure may be useful in this context.
Recall the example of Canadian provinces given in Section 3.1. Suppose the
treated province has a more extreme outcome than any of the others, so that R = 0.
In the strict context of RI, all we can say is that the p value is between 0, according
to Eq. (7), and 0.10, according to Eq. (8). In saying this, however, we have made
no use of the actual values of t and the tj∗ . Only the location of |t| in the sorted list
affects either p value. If the outcome for the treated province differed a lot from
the outcomes for the other nine provinces, that is, if |t| were much larger than any
of the |tj∗ |, then the evidence against the null hypothesis would seem to be quite
strong. On the other hand, if |t| were just slightly larger than the largest of the |tj∗ |,
the evidence against the null would seem to be rather weak. But neither of the RI
p values takes this into account.
The procedure of Racine and MacKinnon (2007a) does take the values of the
actual and re-randomized test statistics into account. It is based on the smoothed
p value
S
1
p̂h = 1 − F̂h (t) = 1 − K(tj∗ , t, h), (9)
S j =1
where F̂h (t) is a kernel-smoothed CDF of the tj∗ evaluated at the actual test statistic
t. When t is much more extreme than any of the tj∗ , it will surely lie in the far tail
of the CDF, and p̂h will be very small. On the other hand, when t is near one or
more of the tj∗ , p̂h is unlikely to be very small.
This procedure requires the choice of a kernel function K( · ) and a bandwidth
h. Because p̂h is an estimated probability rather than an estimated density, K( · )
must be a cumulative kernel. A natural choice is the cumulative standard normal
CDF. The choice of h is more difficult, and it matters a lot when S is small.
Based largely on simulation evidence, Racine and MacKinnon (2007a) suggested
choosing h = scS −4/9 , where s is the standard deviation of the tj∗ , and the values of
c are 2.418, 1.575 and 1.3167 for α = 0.01, α = 0.05 and α = 0.10, respectively.7
Thus the bandwidth h should be larger the more variable are the tj∗ and the smaller
is the level of the test. The latter makes sense, because values of tj∗ will be scarcer
near more extreme values of t.
The kernel smoothing procedure of Racine and MacKinnon (2007a) can evi-
dently be used with coefficients as well as t statistics, and we consider both methods
in the next section. Note that we estimate s from the tj∗ but then apply the procedure
to |t| and the |tj∗ |, whereas Racine and MacKinnon (2007a) considered upper tail
tests. Despite this difference, their smoothing procedure generally performed best
for tests at the 0.05 level when using the value of c recommended for that level,
namely, 1.575. All of the results reported in Section 6 use that value.
A radically different approach, which was studied in Donald and Lang (2007),
is to collapse the original, individual data to the cluster level. Instead of N
observations, the regression uses just G of them. Precisely, how this works depends
on the model. Consider the simple case in which
yg = γ ιNg + βxg + ug , g = 1, . . . , G, (10)
where each of the subscripted vectors corresponds to a single cluster and has Ng
observations, and the vector ιNg contains Ng ones. If we take the averages of each
of the vectors here, we obtain ȳg = ιNg yg /Ng , x̄g = ιNg xg /Ng , and ūg = ιNg ug /Ng .
This allows us to write
ȳ = γ ιG + β x̄ + ū, (11)
where the G-vectors ȳ, x̄ and ū have typical elements ȳg , x̄g and ūg , respectively.
Since all the variables in regression (11) are cluster means, we refer to it as a
“cluster-means regression,” or CMR.
Donald and Lang (2007) argue that the ordinary t statistic for β = 0 in the
cluster-means regression (11) will be approximately distributed as t(G − 2) if
two restrictive but not unreasonable assumptions are satisfied. The first is that all
clusters are the same size, so that Ng = m for all g, with all of the ug having the
same covariance matrices. The second is either that the original error terms are
normally distributed or that m is sufficiently large, so that a central limit theorem
applies to the elements of ū.
The advantage of collapsing individual data to the cluster level, as in (11), is
that we no longer have to estimate a CRVE. Because of the first assumption, we
do not even have to use heteroskedasticity-robust standard errors. This allows us
to make inferences about β when just one cluster is treated. In that case, only
one element of x̄ is nonzero, but we can still make valid inferences because all G
observations are used to estimate the variance of the error terms.
6. SIMULATION EXPERIMENTS
In this section, we report the results of some simulation experiments designed to
assess the performance of WBRI and the procedures discussed in Section 5. The
model is very simple. It is essentially Eq. (4), but without any group dummies.
This model can also be thought of as Eq. (10) with time dummies instead of the
constant term. To make inference a bit more difficult, the error terms follow a
lognormal distribution. The group dummies are omitted because the error terms
have constant intra-cluster correlations of 0.05 (prior to being exponentiated), and
group dummies would soak up all of this correlation.
In the experiments that we report, there are G clusters, each with 100 observa-
tions divided evenly among 10 time periods. When a cluster is treated, treatment
is always for 5 of the 10 periods. Because all clusters are the same size, and the
Fig. 2. WBRI and RI Rejection Frequencies.
number of treated observations per treated cluster is always the same, RI would
work perfectly if it were not for the interval p value problem. If we relaxed either
of these assumptions, of course, it would not work perfectly, even when G is large;
see MacKinnon and Webb (2018a).
Figure 2 shows rejection frequencies for tests at the 0.05 level for three pro-
cedures (RI-t using p̂1 , RI-t using p̂2 and WBRI-t using p̂2 ) for 56 different
experiments, each with 400,000 replications.8 The number of clusters varies from
5 to 60, and only one cluster is treated. For any value of G, this is the case for
which the interval p value problem is most severe, because S = G − 1 is small
unless there are many clusters. The number of bootstraps per randomization is
always chosen so that (B + 1)(S + 1) ≥ 1, 000.
One striking feature of Figure 2 is that rejection frequencies for the two RI
procedures are almost exactly what theory predicts; see Figure 1. When G = 20,
40, and 60, the two RI p values yield precisely the same outcomes, as they must.
In every other case, however, p̂1∗ = R/S rejects more often than p̂2∗ = (R + 1)/
(S + 1).
In Figure 2, the WBRI rejection frequencies are almost always between the two
RI rejection frequencies (although this is not true for G = 19 and G = 20), and
they are always quite close to 5% except when G is very small. This is what we
would like to see. However, it must be remembered that the figure deals with a
very special case in which all clusters are the same size and the error terms are
homoskedastic. The WBRI procedure cannot be expected to work any better than
Fig. 3. WBRI, Smoothed RI and CMR Rejection Frequencies, G1 = 1.
the RI-t procedure when the treated clusters are smaller or larger than the untreated
clusters, or when their error terms have different variances.
In the experiments of Figure 2, we used the Rademacher distribution for G ≥ 19
and the 6-point distribution for G ≤ 18. This accounts for the sharp jump between
18 and 19. Rejection frequencies for small values of G would have been much
larger if we had used Rademacher, while those for large values of G would have
been noticeably smaller if we had used 6-point. It is not clear why WBRI tends
to under-reject for tests with G1 = 1 (but not for tests with G1 = 2; see below)
when the 6-point distribution is used. As MacKinnon and Webb (2017b, 2018b)
showed, OLS residuals have strange properties when just one cluster is treated. We
speculate that these cause the choice of the wild bootstrap auxiliary distribution to
be unusually important in this case.
Figure 3 shows rejection frequencies for tests at the 0.05 level for five pro-
cedures. For readability, the vertical axis has been subjected to a square root
transformation, and the conventional RI procedures have been dropped. The results
for WBRI-t are the same ones shown in Figure 2. Results for WBRI-β are also
shown, and it is evident that WBRI-β always rejects more often than WBRI-t.
The difference is quite substantial for small values of G, and WBRI-t is clearly
preferred.
In Figure 3, the two tests based on kernel-smoothed p values work remarkably
well for larger values of G (say G > 25), but they over-reject quite severely for
really small values. The over-rejection is more severe for RI-t than for RI-β. All
reported results are for c = 1.575, the value suggested in Racine and MacKinnon
(2007a) for tests at the 0.05 level. When the larger value of 2.418 (suggested for
Fig. 4. WBRI, Smoothed RI and CMR Rejection Frequencies, G1 = 2.
tests at the 0.01 level) was used instead, all rejection frequencies were noticeably
lower.
The test based on the CMR (11) and the t(G − 2) distribution works remarkably
well even for very small values of G. It over-rejects slightly when G is small and
under-rejects slightly when G is large. It would have performed even better if the
errors had been normally rather than lognormally distributed. Since this test is very
easy to perform (it requires neither randomization nor the bootstrap), one might
well feel, on the basis of these results, that there is no point worrying about the
more complicated procedures based on individual data. However, this test does
have one serious limitation. As we will see below, it can be seriously lacking in
power.
All the experiments reported so far have just one treated group. This is generally
the most difficult case. In Figure 4, we show results for several tests with G1 = 2.
For these tests, the values of G vary from 5 to 20, and the values of S consequently
vary from 9 to 189. The CMR works extremely well for all values of G. WBRI-t
(using the 6-point distribution for G ≤ 13 and the Rademacher distribution for
G ≥ 14) under-rejects slightly for very small values of G but works very well
whenever G ≥ 11. Smoothed RI-t over-rejects for very small values of G but works
very well for G ≥ 9. However, the two procedures that are based on coefficients
instead of t statistics do not work particularly well.
The results presented so far may seem to suggest that the cluster-means regres-
sion is the most reliable, as well as the easiest, way to make inferences. However,
Fig. 5. Power for Several Procedures.
this approach has one serious shortcoming. When the value of the treatment vari-
able is not constant within groups, aggregation to the group level can seriously
reduce power.
Figure 5 presents results for a case with G = 10, G1 = 2 and S = 44, where the
value of β varies from 0 (the null hypothesis) to 1. The most striking result is that
tests based on the CMR (11) are much less powerful than the other tests. As noted
earlier, in all of our experiments there are 10 “years,” only 5 of which are treated.
Every cluster has 100 observations, 10 for each “year.” Therefore, the regressor
x̄g either takes the value 0 (when cluster g is not treated) or the value 0.5 (when
half the observations in cluster g are treated). Not surprisingly, this results in very
substantial power loss.9 Of course, if all the observations in every treated cluster
were treated, this power loss would not occur. Additional experiments suggest that,
when all “years” are treated, tests based on regression (11) have excellent power.
Some of the other results in Figure 5 are also interesting. The two procedures
based on t statistics, WBRI-t and smoothed RI-t, have power functions that are
essentially identical. In contrast, the two procedures based on coefficients are
noticeably more powerful than the ones based on t statistics. This is consistent
with results for RI tests in MacKinnon & Webb (2018a), and it makes sense,
because the tests based on coefficients do not have to estimate standard errors. The
somewhat higher power of WBRI-β relative to smoothed RI-β can probably be
attributed to its somewhat larger size (it rejects 5.92% of the time at the 5% level,
versus 4.49%).
Fig. 6. Rejection Frequencies with Heteroskedasticity.
It is important to remember that all the procedures we have discussed are very
sensitive to the assumption that the clusters are homogeneous. When that assump-
tion is violated, no RI procedure can be expected to perform well, even when G
and G1 are large. Since MacKinnon and Webb (2018a) documents the mediocre
performance of RI tests for a number of cases where cluster sizes vary, there is
no need to perform similar experiments here. In general, RI tests tend to over-
reject when the treated clusters are relatively small and under-reject when they are
relatively large.
In Figure 6, we investigate the effects of a particular type of heteroskedastic-
ity which was not studied in MacKinnon and Webb (2018a). Instead of the error
terms being homoskedastic, their standard deviation is twice as large for treated
observations as for untreated ones. Whether this is a realistic specification is debat-
able, although it does not seem unreasonable that some treatments could affect the
second moment of the outcome as well as the first.
In both panels of Figure 6, G varies between 5 and 20, as in Figures 3 and
4. In the left panel, G1 = 1, and in the right panel, G1 = 2. It is evident that
no method yields reliable inferences. The results for G1 = 2 are generally better
than for G1 = 1, but they are far from satisfactory. Moreover, the performance
of the cluster-means regression and of the two methods based on RI-β actually
deteriorates as G increases when G1 = 2.
7. EMPIRICAL EXAMPLE
In this section, we consider an empirical example from Decarolis (2014). Part of
the analysis deals with how the introduction of first price auctions (FPA) in Italy
affected winning discounts in public works procurement. From January 2000 to

June 2006, the use of average bid auctions (ABA) was required for all contracts
with reserve prices below a5 million. However, after a case of collusion in ABAs
was discovered, the Municipality of Turin and the County of Turin switched from
ABAs to FPAs in early 2003. The central government mounted a legal challenge
against these reforms that essentially prevented all other public administrations
(PA) from making a similar switch.
The timing and exclusivity of the switch in Turin is exploited to estimate a
regression analogous to DiD. Each of the two treated PAs (the county and the
municipality) is considered separately in the following model:
W.Discountist = as + bt + cXist + βFPAst + ist . (12)
The outcome of interest, W.Discountist , is the winning discount offered in auction

i of PA s in year t. FPA is a binary indicator equal to 1 for an FPA and 0 otherwise.
The coefficient of interest, β, is the effect of using an FPA on the winning discount
conditional on fixed effects for PA (as ), time (bt ), and other covariates (Xist ). Anal-
ysis is restricted to public works auctions with reserve prices between a300,000
and a5 million, consisting of simple work types such as roadwork construction
and repair jobs.
Table 1 presents our results. We first recreate the first two columns of Table 5 in
Decarolis (2014). That paper implements a matching strategy, based on similarities
in total number of auctions held in each PA during the sample period, to define
control groups from other jurisdictions for each of the two treated PAs. This results
in 14 control groups for the Municipality of Turin and 17 control groups for the
County of Turin. Thus, G = 15 for the Municipality of Turin, and G = 18 for the
County of Turin, with G1 = 1 in both cases. In the municipality regression, Turin is
the largest cluster with 200 observations out of 1,262; the smallest cluster has 28.
In the county regression, Turin is again the largest cluster with 147 observations
out of 1,355; the smallest cluster has 27. Results in MacKinnon and Webb (2018a)
suggest that the RI tests should be conservative when the largest clusters are treated,
as is the case in both samples here.
The model above is used to estimate 95% confidence intervals for β under two
specifications. Both specifications control for year, PA, a municipality dummy,
type of public work dummies, and reserve price. The first specification, which
we call Model 1 and is called “W. Discount (1)” in the paper, controls for fiscal
efficiency, the ratio of total yearly realized revenue to estimated revenue of the
PA. The second specification, which we call Model 2, and is called “W. Discount
(2)” in the paper, controls for time trends and PA-specific time trends, but not
fiscal efficiency. For each panel, the first and second rows provide estimates when
standard errors are clustered at the PA-Year and PA levels, while the third row uses
Table 1. 95% Confidence Intervals and p Values for FPA Coefficient.

Model 1 Model 2
Panel A: Municipality of Turin
β̂ 12.18 6.14
t Statistic 14.86 7.82
PA-year clustering (CI) (9.54, 14.81) (3.55, 8.72)
PA clustering (CI) (10.42, 13.94) (4.45, 7.82)
CMR p value 0.0203 0.6698
Conley–Taber (CI) (10, 16) (5, 8)
RI-β p values (0.000, 0.063) (0.133, 0.188)
Smoothed RI-β p value 0.0000 0.0885
RI-t p values (0.000, 0.063) (0.067, 0.125)
Smoothed RI-t p value 0.0000 0.0716
WBRI-t p value 0.0000 0.0799
WBRI-β p value 0.0000 0.0595
N 1,262 1,262
G 15 15
Panel B: County of Turin
β̂ 8.71 5.69
t Statistic 19.22 8.34
PA-year clustering (CI) (6.55, 10.85) (3.19, 8.18)
PA clustering (CI) (7.75, 9.66) (4.25, 7.12)
CMR p value 0.0041 0.4684
Conley–Taber (CI) (7, 14) (4, 8)
RI-β p values (0.000, 0.056) (0.118, 0.187)
Smoothed RI-β p value 0.0004 0.1046
RI-t p values (0.000, 0.056) (0.058, 0.111)
Smoothed RI-t p value 0.0000 0.0570
WBRI-t p value 0.0014 0.0181
WBRI-β p value 0.0000 0.0446
N 1,355 1,355
G 18 18
Regressors
Fiscal efficiency Yes No
PA specific time trends No Yes
Notes: Entries of the form (0.000, 0.067) represent the p value pairs (p̂1∗ , p̂2∗ ). WBRI p values are
obtained with B = 700 for Panel A and B = 600 for Panel B, ensuring that B×G C1 > 10, 000 for both
panels.
the method of constructing confidence intervals proposed in Conley and Taber

(2011).10
In addition to reproducing the original results, we compute RI-β and RI-t
p values using both formulae, as well as smoothed p values and both types of
WBRI p value using the same two samples and two models. We do this clustering
only by PA. As expected, the RI-β p values are identical to the RI-t p values
for Model 1, because there is only one treated cluster; see MacKinnon and Webb
(2018a) for details.11 The four RI p value intervals for Model 1 contain 0.05, while
the four RI p value intervals for Model 2 contain 0.10. In the former case, this
makes it impossible to tell whether we should reject or not reject at the 0.05 level.
In the latter case, we evidently cannot reject at the 0.05 level, but it is impossible
to tell whether we should reject or not reject at the 0.10 level.
The WBRI-t p values shown in the table are obtained with B = 700 for Panel
A and B = 600 for Panel B. This means that there are 701×15 C1 − 1 = 10,514 and
601×18 C1 − 1 = 10,817 bootstrap/randomization t statistics, respectively. Under
Model 1, we find WBRI-t p values that are very close to p̂1∗ and highly signif-
icant. Under Model 2, we again find that the WBRI-t p value is very close to
p̂1∗ for the municipality sample, but below p̂1∗ for the county sample. Except for
Model 2 using the county sample, the smoothed RI-t p values are very similar to
the WBRI-t ones. The WBRI-β p values are in general similar to both the WBRI-t
values and the smoothed RI-β p values. Interestingly, for Model 2 using both
samples, the WBRI-β p values are below p̂1∗ .
We also consider an aggregation procedure, which we call cluster-means regres-
sion, or CMR, that is similar to one suggested in Donald and Lang (2007). This
procedure yields sensible results for Model 1 for both samples. However, for
Model 2, it yields much larger p values than any of the other procedures. This is
probably a consequence of the fact that Model 2 contains both a DiD term for just
one cluster in addition to a time trend for only that cluster, which does not fit easily
into the aggregation framework of Eq. (11).
The evidence against the null hypothesis is probably even stronger than these
results suggest. In MacKinnon and Webb (2018a), we showed that RI procedures
tend to under-reject when the treated clusters are unusually large. Since the only
treated cluster is either the Municipality or the County of Turin, and each of those
is the largest cluster in its sample, we would expect all forms of RI p value to be
biased upwards. Thus the fact that the WBRI-t test rejects at the 0.001 level for
Model 1 for both data sets and at either the 0.05 or 0.10 level for Model 2 suggests
that there is quite strong evidence against the null hypothesis.
8. CONCLUSION
We introduce a bootstrap-based modification of RI which can solve the problem
of interval p values when there are few possible randomizations, a problem that
often arises when there are very few treated groups. This procedure, which we
call WBRI, is easiest to understand as a modified version of the WCB. Like the
WCB, it generates a large number of bootstrap samples and uses them to compute
bootstrap test statistics. However, unlike the WCB, only some of the bootstrap test
statistics are testing the actual null hypothesis. Most of them are testing fictional
null hypotheses obtained by re-randomizing the treatment.
The WBRI procedure can be used to generate as many bootstrap test statistics
as desired by making B large enough. Thus, it can solve the problem of interval
p values. However, it shares some of the properties of RI procedures, which
perform conventional RI based on either coefficients or cluster-robust t statistics;
see MacKinnon and Webb (2018a). In particular, like RI-β and RI-t, WBRI-β
and WBRI-t can be expected to over-reject (or under-reject) when the treated
clusters are smaller (or larger) than the control clusters. This tendency is greater for
WBRI-β than for WBRI-t. Thus, we cannot expect WBRI procedures to yield
reliable inferences in every case.
We also consider two other procedures. One of them applies the kernel-
smoothed p value approach of Racine and MacKinnon (2007a) to RI. This method
seems to perform very similarly to WBRI in many cases. The other, based on
Donald and Lang (2007), aggregates individual data to the cluster level and uses
the t distribution with degrees of freedom equal to the number of clusters minus 2.
This cluster-means regression approach can work remarkably well in some cases,
but it can be seriously lacking in power when not all observations within treated
clusters are treated.
ACKNOWLEDGEMENTS
The WBRI procedure discussed in this paper was originally proposed in a working
paper circulated as “Randomization Inference for Difference-in-Differences with
Few Treated Clusters.” However, a revised version of that paper no longer discusses
the WBRI procedure. We are grateful to Jeffrey Wooldridge, seminar participants
at the Complex Survey Data conference on October 19–20, 2017 and at New York
Camp Econometrics XIII on April 6–8, 2018, and two anonymous referees for
helpful comments. This research was supported, in part, by grants from the Social
Sciences and Humanities Research Council of Canada. Joshua Roxborough and
Oladapo Odumosu provided excellent research assistance.
NOTES
1. One of the earliest CRVEs was suggested in Liang and Zeger (1986). Alternatives
to (3) have been proposed in Bell and McCaffrey (2002) and Imbens and Kolesár (2016),
among others.
2. Of course, even when G1 is not small, the matrices Ng−1 Xg ˆg ˆg Xg in (3) do not
estimate the corresponding matrices Ng−1 X g X in (2) consistently, because the former
matrices necessarily have rank 1. But the summation in the middle of expression (3), appro-
priately normalized, does consistently estimate the matrix X X, appropriately normalized.
See Djogbenou et al. (2018) for details.
3. Because vg∗b takes the same value for all observations within each group, we would
not want to use the Rademacher distribution if G were smaller than about 12; see Webb
(2014), which proposes an alternative for such cases.
4. For more details on how to implement the wild cluster bootstrap in Stata at minimal
computational cost, see Roodman, MacKinnon, Nielsen, and Webb (2019).
5. Note that, in this procedure, B denotes the number of bootstrap samples per re-
randomization. The total number of bootstrap samples is B(S + 1). It might seem tempting
to use the same B bootstrap samples for every re-randomization. However, this would create
dependence among the S different test statistics that depend on each bootstrap sample. This
sort of dependence should be avoided.
6. Code for the WBRI procedure can be found at https://sites.google.com/site/matthewd
webb/code.
7. There is a typo on page 5,955 of Racine and MacKinnon (2007a) which causes the
optimal values of α = 0.01 and α = 0.10 to be reversed. That this is incorrect can be seen
from Figure 6 of the paper.
8. WBRI would have rejected slightly more often if we had used p̂1 instead of p̂2 ; the
difference in rejection frequencies was almost always less than 0.001.
9. We note that Donald and Lang (2007) did not suggest using Eq. (11) for DiD models
in the way that we have used it here.
10. Following the original paper, confidence intervals for the CT procedure are rounded
to the nearest integer values.
11. We should not expect them to be the same for Model 2, however, because there are
two variables that need to be randomized, the DiD variable and the trend-treatment variable.
REFERENCES
Abadie, A., Diamond, A., & Hainmueller, J. (2010). Synthetic control methods for comparative case
studies: Estimating the effect of California’s tobacco control program. Journal of the American
Statistical Association, 105(490), 493–505.
Bell, R. M., & McCaffrey, D. F. (2002). Bias reduction in standard errors for linear regression with
multi-stage samples. Survey Methodology, 28(2), 169–181.
Bertrand, M., Duflo, E., & Mullainathan, S. (2004). How much should we trust differences-in-
differences estimates? The Quarterly Journal of Economics, 119(1), pp. 249–275.
Bester, C. A., Conley, T. G., & Hansen, C. B. (2011). Inference with dependent data using cluster
covariance estimators. Journal of Econometrics, 165(2), 137–151.
Cameron, A. C., Gelbach, J. B., & Miller, D. L. (2008). Bootstrap-based improvements for inference
with clustered errors. The Review of Economics and Statistics, 90(3), 414–427.
Cameron, A. C., & Miller, D. L. (2015). A practitioner’s guide to cluster robust inference. Journal of
Human Resources, 50(2), 317–372.
Canay, I. A., Romano, J. P., & Shaikh, A. M. (2017). Randomization tests under an approximate
symmetry assumption. Econometrica, 85(3), 1013–1030.
Carter, A. V., Schnepel, K. T., & Steigerwald, D. G. (2017). Asymptotic behavior of a t test robust to
cluster heterogeneity. Review of Economics and Statistics, 99(4), 698–709.
Conley, T. G., & Taber, C. R. (2011). Inference with “Difference in Differences” with a small number
of policy changes. The Review of Economics and Statistics, 93(1), 113–125.
Davidson, R., & Flachaire, E. (2008). The wild bootstrap, tamed at last. Journal of Econometrics,
146(1), 162–169.
Decarolis, F. (2014). Awarding price, contract performance, and bids screening: Evidence from
procurement auctions. American Economic Journal: Applied Economics, 6(1), 108–132.
Djogbenou, A., MacKinnon, J. G., & Nielsen, M. Ø. (2018). Asymptotic and wild bootstrap inference
with clustered errors. Working Paper 1399. Department of Economics, Queen’s University.
Donald, S. G., & Lang, K. (2007). Inference with difference-in-differences and other panel data. The
Review of Economics and Statistics, 89(2), 221–233.
Ferman, B., & Pinto, C. (2017). Inference in differences-in-differences with few treated groups and
heteroskedasticity. The Review of Economics and Statistics, 101, to appear.
Fisher, R. (1935). The design of experiments. Edinburgh: Oliver and Boyd.
Imbens, G. W., & Kolesár, M. (2016). Robust standard errors in small samples: Some practical advice.
Review of Economics and Statistics, 98(4), 701–712.
Imbens, G. W., & Rubin, D. B. (2015). Causal inference in statistics, social, and biomedical sciences.
Cambridge: Cambridge University Press.
Lehmann, E. L., & Romano, J. P. (2008). Testing statistical hypotheses. New York: Springer.
Liang, K.-Y., & Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models.
Biometrika, 73(1), 13–22.
MacKinnon, J. G., & Webb, M. D. (2017a). Pitfalls when estimating treatment effects using clustered
data. The Political Methodologist, 24(2), 20–31.
MacKinnon, J. G., & Webb, M. D. (2017b). Wild bootstrap inference for wildly different cluster sizes.
Journal of Applied Econometrics, 32(2), 233–254.
MacKinnon, J. G., & Webb, M. D. (2018a). Randomization inference for difference-in-differences with
few treated clusters. Working Paper 1355. Department of Economics, Queen’s University.
MacKinnon, J. G., & Webb, M. D. (2018b). The wild bootstrap for few (treated) clusters. Econometrics
Journal, 21(2), 114–135.
Racine, J. S., & MacKinnon, J. G. (2007a). Inference via kernel smoothing of bootstrap P values.
Computational Statistics & Data Analysis, 51(12), 5949–5957.
Racine, J. S., & MacKinnon, J. G. (2007b). Simulation-based tests that can use any number of
simulations. Communications in Statistics: Simulation and Computation, 36(2), 357–365.
Roodman, D., MacKinnon, J. G., Nielsen, M. Ø., & Webb, M. D. (2019). Fast and wild: Bootstrap
inference in stata using boottest. The Stata Journal, 19(1), to appear.
Webb, M. D. (2014). Reworking wild bootstrap based inference for clustered errors. Working Paper
1315. Department of Economics, Queen’s University.
Young, A. (2019). Channelling Fisher: Randomization tests and the statistical insignificance of
seemingly significant experimental results. The Quarterly Journal of Economics, 134, to appear.
VARIANCE ESTIMATION FOR
SURVEY-WEIGHTED DATA USING
BOOTSTRAP RESAMPLING METHODS:
2013 METHODS-OF-PAYMENT SURVEY
QUESTIONNAIRE
Heng Chen and Q. Rallye Shen

Bank of Canada, Canada
ABSTRACT
Sampling units for the 2013 Methods-of-Payment survey were selected

through an approximate stratified two-stage sampling design. To compensate
for nonresponse and noncoverage and ensure consistency with external pop-
ulation counts, the observations are weighted through a raking procedure.
We apply bootstrap resampling methods to estimate the variance, allowing
for randomness from both the sampling design and raking procedure. We find
that the variance is smaller when estimated through the bootstrap resam-
pling method than through the naive linearization method, where the latter
does not take into account the correlation between the variables used for
weighting and the outcome variable of interest.
Keywords: Variance estimation; raking; calibration; resampling;
bootstrap; 2013 Methods-of-Payment

ISSN: 0731-9053/doi:10.1108/S0731-905320190000039004
87
88 HENG CHEN AND Q. RALLYE SHEN
1. INTRODUCTION
Variance estimates are crucial for building confidence intervals to assess disper-
sion, and for implementing statistical inferences to test various hypotheses. In
general, survey variance estimates depend on the specific weighting procedure,
not just on the numerical values of the weights; variance estimates that disregard
the weighting procedure are often biased. Hence, an unbiased estimation method
must incorporate two sources of randomness from the weighting procedure: (1)
from the sampling design, which, in our case, is measured by the selection proba-
bility design weights induced by complicated sampling and (2) from the calibration
procedure, which involves adjusting the sample counts to match the population
counts through calibrated weights. If we ignore either source of randomness, the
variance estimates will be incorrect.
To consider the randomness from the sampling design, it is important to under-
stand the design-based inference. While the units in the population as well as
their characteristics are assumed fixed, the randomness in the design-based statis-
tics comes only from randomization performed at the sample selection stage. The
design-based distributions are obtained by enumerating all samples possible under
a given design scheme and associating the numeric values of the statistics of interest
with the probabilities of the samples they are based on.
As for the randomness from the calibration procedure, adjusting design weights
would make final weights depend on the particular calibration method, in which
incorporation of population level information can lead to statistically more accurate
estimates and better inference. Such modifications will affect variances of weighted
estimates, because calibrated weights are functions of the sampling design, which
introduces randomness from the sample selection stage. In contrast to the non-
random design weights, calibrated weights are usually random (Lu & Gelman,
2003).
This paper discusses variance estimations of the weighted means and pro-
portions used in Henry, Huynh, and Shen (2015), whose sampling design is an
approximate stratified two-stage sampling,1 and the classical raking (or iterative
proportional fitting (IPF)) procedure is chosen for calibration. In the stratified two-
stage sampling, the population is divided into nonoverlapping strata. From each
stratum, a sample of primary sampling units (PSUs) is taken with replacement,
and from each PSU, samples of ultimate units are taken. Stratification can improve
the efficiency of estimates, while allowing for straightforward statistical analysis
within the strata. PSUs form clusters and allow the user to reduce costs in situa-
tions where it is impossible or impractical to obtain the complete list of ultimate
observation units.
Besides the sampling design, this paper considers variance estimation of raking
ratio estimators used in Vincent (2015). Two types of raking ratio estimators are
Variance Estimation for Survey-weighted Data 89
considered, together with the generalized regression estimators as a benchmark

for comparison. The main focus is on variance estimation, and the bootstrapping
method is chosen to take account for both sampling design and raking procedure,
although various linearization methods are also applied. Our reason for choosing
resampling over linearization is that the linearization method can be very diffi-
cult to implement due to the presence of nuisance parameters, such as the joint
selection probability of two sampling units in the same stratum. The resampling
method circumvents the requirement of explicitly evaluating the variance formula.
Furthermore, we recompute replicate raking weights for each resample. These
replicate weights protect the privacy of sampling units and have the advantage of
incorporating sampling information as well as the adjustments for nonresponse
and noncoverage.
In Section 4.2, we discuss two types of raking ratio estimators, together with
the generalized regression estimator as a benchmark for comparison. In Section
4.3, we choose our bootstrap resampling method among the various linearization
and resampling methods. Using the survey questionnaire (SQ) component of the
2013 Methods-of-Payment survey, Section 4.4 studies the properties of a range
of alternative variance estimates by focusing on two variables: the cash on hand
continuous variable and the contactless (tap-and-go) credit card usage binary
variable. We do this so that we can investigate the sources of discrepancies between
different estimates. Section 4.5 concludes.
2. RAKING RATIO ESTIMATOR

The raking ratio estimator is one of a number of calibration estimators. Its main
competitor is the generalized regression estimator (GREG). Raking ratio estima-
tion appears to have a more well-established history of application in many national
statistical institutes (NSIs), perhaps because of its ease of computation, involving
repeated use of standard post-stratification adjustments. In some NSIs, GREG has
tended to replace raking ratio estimation. One reason is that the GREG can be
expressed in closed form and computed in one step, whereas the computation of
the raking ratio estimator is iterative. A more important reason is that GREG can
handle a wider class of forms of auxiliary information, including population totals
of continuous variables, whereas raking is restricted to the use of population counts
over categories of discrete variables.2 However, one advantage of the raking ratio
estimator is that it always produces positive weights, whereas GREG requires mod-
ification to meet this condition. In addition, raking may reduce nonresponse bias
more than GREG under certain assumptions (Kalton & Flores-Cervantes, 2003).3
This paper considers two forms of raking ratio estimator, the classical esti-
mator obtained by the application of IPF as well as an estimator, which may be
interpreted as a maximum likelihood estimator within a certain framework. The
GREG estimator is also considered as a benchmark for comparison.4 Consider the
estimation of a population total TY of a survey variable Y taking values yi for units
i in a population U :

TY = yi .
U
Given observations on yi from units in sample s, a basic weighted estimator of

TY is given by

T̂Y = ωi yi ,
s
where ωi is a given weight, referred to here as the initial weight. This weight
may be the Horvitz–Thompson (H–T) weight ωi = 1/πi . Note that the ωi is fixed
(nonrandom) and known before the survey is conducted. It is usually computed
as the ratio between the census stratum count and the service-agreement targeted
count.
2.1. Classical Raking Estimator
The classical raking adjustment makes use of information on the population counts
over the categories of two or more categorical auxiliary variables. This type of
adjustment is used in the weighting and calibration procedure. For example, sup-
pose three sets of post-strata are used for calibration, and let xi denote the vector
of indicator variables for these categories:
xi = (δ1..i , . . . , δA..i , δ.1.i, . . . , δ.B.i, δ..1i , . . . δ..Ci ),
where δa..i = 1 if unit i is in category a of the first auxiliary variable and 0 otherwise,
δ.b.i = 1 if unit i is in category b of the second auxiliary variable and 0 otherwise,
and so on. The population total TX of this vector thus contains the population counts
in each of the (marginal) categories for each of the three auxiliary variables. It is
assumed that TX is given and that xi is known for i ∈ s.
The classical raking adjustment involves iterative modifications of the initial
weights, ωi , in a multiplicative way to adjust the weights wi with the aim of
satisfying the calibration equations:

wi xi = TX .
s
The resulting raking estimator of TY is

T̂Y Rak = wi yi .
s
The multiplicative adjustment depends only upon the cell in the contingency
table formed by the auxiliary variables, that is we may write wi = ωi h (xi ), where
the multiplicative adjustment factor h(xi ) is fixed for all units with common values
of the auxiliary variables. Let N̂ω [h(x)] and N̂w [h(x)] denote the weighted esti-
mates of the population counts in the cell of the table defined by x using the weights
ωi and wi , respectively. Then we may write N̂w [h(x)] = h(x)N̂ω [h(x)], where IPF
makes use of standard post-stratification.5 The usual iterative modification of the
weights involves IPF. Ireland and Kullback (1968) demonstrate that this method
converges to a solution which minimizes

N̂w
N̂w log ,
N̂ω
subject to the calibration equations, when the sum is over all cells defined by x.
This objective function may alternatively be expressed as

wi
wi log ,
s
ωi
that is, under convergence of the iterative algorithm, the wi minimizes the above
function, subject to solving the calibration equations. Yet another way to express
this objective function is
wi
ωi G M ,
s
ωi
where GM (u) = ulog(u) − u + 1 is the multiplicative distance measure considered
by Deville,
Sarndal, and Sautory (1993), and it is assumed that for the calibration
equation wi is considered to be a given constant. Using the standard Lagrange
multiplier method for the constrained minimization, the multiplicative adjustment
factors may be expressed as

wi = ωi FM xi λ̂ ,
−1
where FM (u) = gM (u) denotes the inverse function of gM (u) = dGM (u)/du and
λ̂ is the Lagrange multiplier, which solves the calibration equations. It follows
from definition GM (u) above that gM (u) = log(u) and FM (u) = exp(u). Hence λ̂
solves
ωi exp xi λ̂ xi = xi .
s U

The estimator T̂Y Rak = s wi yi may be used if yi is a scalar or a vector, since
the scalar weight, wi , does not depend upon yi .
2.2. Maximum Likelihood Raking Estimator
An alternative approach to raking adjustment involves a maximum likelihood argu-

ment for the estimation of the population proportions in the cells of the table
formed by cross-classifying the auxiliary variables. Chen and Qin (1993) adapt the
empirical likelihood method (Owen, 1988, 1990) to incorporate available auxiliary
information in finite population sampling. Qin and Lawless (1994) demonstrate
that the empirical likelihood method can be used to solve estimating equations
when the number of estimating equations exceeds the number of parameters, and
they further define a profile empirical likelihood of the parameters of interest and
obtain estimates by maximizing it directly.
The calibration equations remain the same, but the objective function summed
over the cells is replaced by

− N̂w log N̂w ,
which is proportional to minus the log likelihood in the case of simple random
sampling with replacement. Equivalently, the objective function may be expressed,
summed over sample units, as

wi
−ωi log ,
s
ωi
or as

wi
ωi GML ,
s
ωi
where GML (u) = u − 1 − log(u).
One of the major difficulties for the empirical likelihood inferences under gen-
eral unequal probability sampling designs is to obtain an informative empirical
likelihood function for the given sample. The likelihood depends necessarily on
the sampling design and a complete specification of the joint probability func-
tion of the sample is usually not feasible. Because of this difficulty, Chen and
Sitter (1999) propose a pseudo-empirical likelihood approach using a two-stage

argument. For stratified sampling with an arbitrary sampling design within each
stratum, its pseudo-empirical likelihood objective function is

L ωi
n Wh log(whi ),
h=1 i∈Sh i∈Sh ωi
L
where Wh = Nh /N, Nh is the stratum size and N = h=1 Nh .
2.3. Generalized Regression Estimator
As a benchmark for comparing the properties of the raking estimators, we also

consider the widely used generalized regression estimator (GREG). Deville and
Sarndal (1992) discuss other choices G for the functions GM and GML considered
above. For every fixed ωi , it is assumed that G is nonnegative, differentiable with
respect to wi , and strictly convex in order for a unique solution to exit for the
optimization problem, and that G(1) = 0 implying that when wi = ωi , the distance
between the weights is zero. Additionally, it is required that G is continuous, one-
to-one, and that G (1) = 0 and G (1) > 0 which make wi = ωi a local minimum.
An alternative objective function which leads to the GREG estimator is

wi
ωi GLM ,
s
ωi
where GLM (u) = (1/2)(u − 1)2 . This leads to the generalized regression estimator

T̂Y GREG = wi yi = T̂Y π + TX − T̂X π B̂s ,
s

where T̂Y π = s (1/πi )yi and T̂X π = s (1/πi )xi denote the H–T estimator for
TY and the TX vector, and
−1

B̂s = ωi xi xi ωi xi yi ,
s s
is a weighted estimator of the multiple regression coefficients.

A disadvantage of GREG estimation compared to raking estimation is that
the adjusted weights wi resulting from using the objective function GLM can be
positive or negative, whereas classical raking adjustment and maximum likeli-

hood raking adjustment guarantee positive weights. Interestingly, under the i.i.d.
setup without nonresponse, Chaudhuri and Renault (2017) propose a shrinkage
empirical-likelihood-based estimator to ensure the positive weights and also pro-
vide sufficient conditions for the weights from the above classical raking estimator,
maximum likelihood raking estimator and generalized regression estimator to be
asymptotically equivalent.
3. VARIANCE ESTIMATION
We consider estimating the asymptotic variance of the converged estimator, i.e., the
estimator T̂Y Rak , where the weights wi solve the constrained optimization problem.
This asymptotic variance is assumed to be a sufficiently close approximation to the
asymptotic variance of the estimator obtained after the finite number of iterations
used in practice.
We assume that in large samples, λ̂ converges to a value λ. Deville and Sarndal
(1992) assume that λ = 0, but this property is based upon the assumption that the
estimator of TX obtained by applying the initial weights ωi is consistent. This
assumption will often be false in the case of nonresponse and we prefer not to
make this assumption.
We allow the function F (·) to be general and not necessarily equal to FM (·).
We first expand the adjusted weight wi = ωi F xi λ̂ about λ to obtain

wi ≈ ωi Fi + fi xi λ̂ − λ ,
where Fi = F (xi λ), fi = f (xi λ) and f (u) = dF (u)/du. Substituting in the

calibration equations, we obtain

ωi Fi + fi xi λ̂ − λ xi ≈ TX ,
x
and hence
−1

λ̂ − λ ≈ ωi fi xi xi TX − ωi F i xi .
s s
We are assuming here the first matrix is non-singular. It may be necessary to
drop redundant variables from xi to achieve this. For example, in the three-way
case above, each of the sums of the indicator variables δa..i , δ.b.i and δ..ci across a,
b and c, respectively, equals to 1 and it is natural to drop two of these indicators
to avoid singularity.
Substituting in the estimator of interest, we obtain

T̂Y Rak ≈ ωi Fi + fi xi λ̂ − λ yi
s

= wi yi + B T X − ω i Fi x i ,
s s

−1

where wi = ωi Fi and B = s ωi f i xi x i s ωi fi yi xi . We assume that B
converges to a finite limit β in the asymptotic framework. It follows from this and
the other approximation assumptions above, in particular that the initial weights
ωi are fixed, that the (normalized)
asymptotic distribution of B[T X − s ωi F i xi ]
is the same as that of β[TX − s ωi Fi xi ]. Hence, the (normalized) asymptotic

variance of T̂Y Rak is the same as that of s zi , where
zi = wi (yi − βxi ).
3.1. Variance Estimation via Linearization
linearization, T̂Y Rak may be treated as

For the purpose of variance estimation via

the linear estimator s ẑi where ẑi = wi yi − B̂xi is treated as a fixed variable

and B̂ is defined in the same way as B, with fi replaced by fî = f xi λ̂ . The
variance estimator for T̂Y Rak is then obtained for a given sampling scheme by
using a standard variance estimator for that scheme for a linear estimator,
applied

to s ẑi . Deville and Sarndal (1992) note that the choices ẑi = ωi yi − B̂xi or

ẑi = wi yi − B̂xi are asymptotically equivalent, but they express a preference
for the second choice.
For example, for a stratified sampling design assuming “with replacement”
sampling of clusters within strata, a standard estimator of the variance of T̂Y Rak
is given by
L
nh
nh
V̂ T̂Y Rak = (zhj − z̄h )2 , (1)
h=1
n h − 1 j =1

where zhj = k zhj k , z̄h = j zhj /nh , zhj k = whj k yhj k , and zhj k is the value of
the variable z for the kth individual within the j th selected cluster in stratum h.
This standard variance estimator is based on the assumption that whj k is fixed. To
allow for raking adjustment in the weights whj k , the standard linearization variance
estimator of T̂Y Rak is given by (1) with zhj defined as

zhj = ωhj k ehj k , (2)
k

where ehj k = yhj k − xhj k B̂ are the estimated residuals and B̂ is the estimator of
the multiple regression coefficient, which may be constructed using either design
(Deville & Sarndal, 1992) or raking weights. If nonresponse exists, we should use

zhj = whj k ehj k , (3)
k
−1
where whj k = ωhj k F̂hj k ,
ehj k = yhj k − xhj k B̂ and B̂ = s ωi fî xi xi

ˆ
s ωi fi yi x i .
3.2. Variance Estimation via Bootstrap
Recall T̂Y Rak is an estimator of a scalar population parameter TY based upon a

set of raked weights wi . The raking method may be any of those considered in
the previous section. The estimator of the variance of T̂Y Rak via a broad class of
replication methods is obtained by first constructing a set of T replicate samples.
For each replicate, an estimator T̂Y(t)Rak of TY is then computed in the same way
that T̂Y Rak is computed. An estimator of the variance of T̂Y Rak is then given by

T 2
V̂ T̂Y Rak = ct T̂Y(t)Rak − T̂Y Rak ,
t=1
where ct is a constant which depends on the replication method. There are three
main resampling methods: balanced repeated replication (BRR), the jackknife and
the bootstrap (Rust & Rao, 1996). Each involves creating multiple replicates of
the data set by repeatedly sampling from the original sample.
We focus on the bootstrap resampling method.6 We do not use BRR because
it is more suitable for a stratified clustered sampling design, where nh = 2 for
all strata. The main reason that we choose the bootstrap over the jackknife is
that the traditional delete-1 jackknife variance estimator will be inconsistent for
non-smooth functions (e.g., sample quantiles). The consistent delete-d jackknife
method requires a nontrivial specification for d, where there is a complicated inter-
play between the smoothness of the estimate and the parameter d 7 . The bootstrap,
on the other hand, will generally work for these non-smooth estimates, as discussed
in Ghosh, Parr, Singh, and Babu (1984). Besides the major advantage of the boot-
strap over the jackknife for non-smooth estimates, we prefer the bootstrap for two
other reasons: (1) less computational burden: as pointed out in Kolenikov (2010),
the replications of the delete-d jackknife increase notably with d, especially when
applied to list-based establishment surveys; (2) better approximation of distribu-
tions: the bootstrap can be used for estimating distributions and constructing more
accurate one-sided confidence intervals, while the jackknife is typically only used
for estimating variances.
When survey data are released for public use, confidentiality of the respondents
must be protected: geographic information is provided in a coarse form, incomes
are top-coded, small racial groups are conglomerated, etc. Variance estimation via
linearization requires that stratum and PSU identifiers h and j are known for each
observation. If the data provider decides that releasing strata and PSU information
poses the risk that individual subjects could be identified, alternative variance
estimation methods must be used.
To overcome the above limitation, instead of recreating the sample in each
replicate, we implement the more practical method of generating replicate weights.
The construction of the replicate weights wi(t) involves first taking the initial weights
ωi . From these, a set of initial replication weights ωi(t) , t = 1, . . . , T , is constructed
according to the replication method and the sampling scheme. Next the raking
adjustment method is applied to each of these T sets of weights separately. This
generates the required weights wi(t) . This approach can be applied to a wide class
of adjustment methods including classical raking, “maximum likelihood” raking
and GREG. And these replicate weights protect the privacy of sampling units and
have the advantage of incorporating strata information as well as adjustments for
nonresponse and noncoverage.
To construct the tth replicate bootstrap sample under stratified two-stage
sampling, we follow the steps below:
• Step 1: Take a simple random sample with replacement of nh units from the
original data in stratum h, repeating independently across strata;
• Step 2: Modify the design weights as in the rescaling bootstrap from Rao, Wu,
andYue (1992) by applying the following formula (before the raking procedure):

(t) 1/2 −1/2 1/2 −1/2 nh (t)
ωhj k = 1 − m h (n h − 1) + m h (n h − 1) m ωhj k , (4)
mh hj
where m(t)
hj is the bootstrap frequency of unit hj, that is, the number of times
PSU hj was used in forming the tth bootstrap replicate;
(t)
• Step 3: Implement the raking produce to obtain the replicate weight whj k . Notice
that the weight adjustment is taking place after the internal scaling of Step 2;
• Step 4: Estimate the parameter of interest, T̂Y(t)Rak . Repeat T times and estimate
variance using
2
1 (t)
T
V̂B T̂Y Rak = T̂ − T̂Y Rak . (5)
T t=1 Y Rak
Remark Lu and Gelman (2003) propose a method based on perturbations of the

data, which is similar to the resampling approach. In particular, their approach
is general and can be applied with combinations of different forms of weighting
adjustment. But it can be difficult to implement when the parameters of interest
are in terms of moment conditions rather than weighted averages.
4. EMPIRICAL APPLICATION OF THE 2013 MOP

The Bank of Canada needs to understand and monitor Canadians’ demand for
cash. Because cash is an anonymous payment method and it is difficult to obtain
detailed characteristics of cash users from aggregate data, the Bank of Canada
undertook the 2013 Methods-of-Payment (MOP) survey, which is a follow-up to
the 2009 MOP described in Arango and Welte (2012). The 2013 MOP survey
is designed to measure Canadian adult consumers’ attitudes toward and usage of
different payment instruments, including cash, credit and newer methods, such as,
the contactless feature of credit cards. A third party collected the data using an
approximate stratified two-stage sampling design. There are 3,663 respondents to
the 2013 MOP survey, out of which 3,651 provided non-missing data for cash on
hand, defined as the amount of cash in wallet, purse or pockets. Among these 3,651
consumers, 1,743 are males and 1,908 are females. There are 3,578 respondents
who provided non-missing data for usage of contactless credit card. Among these
3,578 consumers, 1,699 are males and 1,879 are females. For the 2013 MOP Survey
Questionnaire, the external calibration sources are the 2011 Census and the 2012
Canadian Internet Use Survey (CIUS). The calibration variables that we select
include the “Econ Plus” set of variables defined in Vincent (2015): marital status
nested within region; age category nested within mobile phone ownership; age
category nested within online purchase; income category nested within education;
gender; home ownership and employment status nested within region.
We conduct a simulation based on the 2013 MOP data. For each cell, we
calculate the mean response to a single question (e.g., cash on hand) as the “truth.”
Then we simulate 1,000 fake data sets: we apply the stratified two-stage sampling
to draw units from the 2013MOP such that the distribution of the
strata size vector
n follows the multinomial n; π1 N1 / Lh=1 πh Nh , . . . , πL NL / Lh=1 πh Nh ; then
we simulate the response y within each stratum using the Poisson distribution,
given the “true” value. Multiplicative nonresponse models were taken into account
in this study. The assignment of nonresponse probabilities φhj to each selected
cluster of the population takes into account characteristics of the clusters, such as
region, age and gender. Once the sample is drawn, in order to obtain the subset
of respondents, Bernoulli distributions with parameter φhj , for h = 1, . . . , H and
j = 1, . . . , Jh , are used to generate an indicator variable Ihj that takes value 1 if
cluster (hj ) responds and value 0 if cluster (hj ) does not respond. The multiplicative
nonresponse model is
Pr (non-response)
= ϕhj
= 0.24Iontario · 1.5Iunder 35 · 1.4Ifemale .
For each simulation, we compute the weighted estimates, which we call T̂Y Rak:r .
In addition, we compute five different variance estimates:
• SRS: This variance estimation treats the sample as obtained by simple random
sampling, rather than the stratified two-stage sampling. Although not based
on a realistic model, this is a commonly used approximation because it is so
simple to compute. With the weights wi treated as constants, the SRS estimated
variance is

n
V̂SRS (T̂Y Rak ) = wi2 var(yi )
i=1
n

= 2
wi var (y1 )
i=1
n n
2 2
= wi wi yi − T̂Y Rak .
i=1 i=1
Notice that (6) ignores both the sampling design and weighting procedure.
• No raking: This estimate is based on the linearization variance estimate of
(1). Notice that it is equivalent to treating raked weights as inverse-probability
sampling weights as in Lu and Gelman (2003). Thus, it does not take into
account the effect of the raking procedure on the variance.
• Full response: This estimate is based on (2); however, such computation
assumes λ = 0. As discussed in Section 3, this assumption will often be false
in the case of nonresponse.
• Nonresponse: Under nonresponse, the linearization variance estimate of

(3) correctly takes into account both sampling design (stratified two-stage
sampling) and raking procedure (classical raking estimator from Section 2.1).
• Bootstrap: This refers to the bootstrapping algorithm outlined in Section 3.2.
Finally, we put all simulations together: we compute the variance of the 1,000
T̂Y Rak:r , which is assumed to be the “true” variance. The true variance of the point
estimator T̂Y Rak is estimated by
1
R
2
VTRUE = T̂Y Rak:r − Ê T̂Y Rak ,
R r=1
where T̂Y Rak:r is the value of T̂Y Rak for sample r, and the expectation of the point
estimator T̂Y Rak is estimated by
1
R
Ê T̂Y Rak = T̂Y Rak:r .
R r=1
We compare the true variance to the average variance estimates from the above
five different methods. The expectation of a specific variance estimator V̂ T̂Y Rak
for T̂Y Rak is estimated by
1
R
Ê V̂ T̂Y Rak = V̂r T̂Y Rak:r ,
R r=1

where V̂r T̂Y Rak:r is the value of V̂ T̂Y Rak for sample r.
We calculate the variances of estimators for population totals, population means
and subpopulation means of various demographic strata.8 The tables show results
for the variables cash on hand and tng_credit_year. The first is the amount of
cash the respondent has in his or her wallet, purse or pockets when completing
the survey. The second, tng_credit_year, is a binary variable indicating whether
the respondent has used the contactless feature of a credit card in the past year.
Taking the weighted total of this variable, we find an estimate of the total number
of people in the population that used the feature; taking the weighted mean, we
obtain an estimate of the proportion of the population that has used it.
Tables 1–3 show variance computations using different variance estimation
methods. The approximation based on the simple random sampling (SRS)
Table 1. Simulation: Total of Cash on Hand and tng Credit Year.

Linearization Resampling
SRS No Raking Full Response Nonresponse Bootstrap
Total of cash on hand

Variance 0.67 1.45 1.16 1.06 1.06
Total of tng credit year

Variance 0.81 1.56 1.10 1.02 1.02
Note: All columns have been divided by the value of VTRUE . Total number of contactless credit adopters
is estimated by the weighted total of the binary variable tng credit year. Notice that SRS is defined
in (6), which ignores both the sampling design and weighting procedure and assumes the sample as
simple random sampling. No raking refers to (1), which does not take into account the effect of the
raking procedure on the variance. Full response and nonresponse are linearization variance estimates
based on (2) and (3), respectively. Bootstrap is the resampling method based on the algorithm outlined
in Section 3.2.
Table 2. Simulation: Mean of Cash on Hand and tng Credit Year.

Mean of cash on hand

Variance 0.78 1.82 1.09 0.98 0.98
Mean of tng credit year

Variance 0.8 1.77 1.14 1.03 1.03
Note: All columns have been divided by the value of VTRUE . The proportion of contactless credit
adopters is estimated by the weighted mean of the binary variable tng credit year. Notice that SRS is
defined in (6), which ignores both the sampling design and weighting procedure and assumes the sample
as simple random sampling. No raking refers to (1), which does not take into account the effect of the
in Section 3.2.
drastically underestimates the variances. Conversely, the approximate variance

estimates (no raking) based on the assumption of wi being nonrandom are all much
higher in these examples. The correct linearization (nonresponse) and bootstrap
estimates are very close to the true variances.
Next, when compared to no raking estimate (not considering the raking proce-
dure), we observe that both the full response and nonresponse linearization method,
Table 3. Simulation: Mean of Cash on Hand and tng Credit Year by Regions.
Cash on hand
Atlantic mean
Variance 0.55 1.73 1.13 1.09 1.09
Quebec mean
Variance 0.67 1.43 1.21 1.03 1.03
Ontario mean
Variance 0.82 1.62 1.1 1.05 1.06
Prairies mean
Variance 0.84 1.78 1.06 1.06 1.06
BC mean
Variance 0.91 1.65 1.31 1.01 1.01
tng credit year
Atlantic proportion
Variance 0.63 1.47 1.21 1.10 1.09
Quebec proportion
Variance 0.7 1.36 1.15 1.12 1.14
Ontario proportion
Variance 0.91 1.27 1.16 1.09 1.07
Prairies proportion
Variance 0.82 1.26 1.23 1.02 1.04
BC proportion
Variance 0.75 1.30 1.17 1.06 1.05
Note: All columns have been divided by the value of VTRUE . The proportion of contactless credit
adopters is estimated by the weighted mean of the binary variable tng credit year. Notice that SRS is
defined in (6), which ignores both the sampling design and weighting procedure and assumes the sample
as simple random sampling. No raking refers to (1), which does not take into account the effect of the
in Section 3.2.
as well as, the resampling bootstrap estimates are always smaller.9 It is because the
raking ratio estimator makes use of the correlation between the variables used for
weighting and the outcome variable of interest and thus produces efficiency gain
over the estimator which does not exploit this correlation (Graham , 2011; Chaud-
huri & Renault, 2017). Following Chaudhuri, Handcock, and Rendall (2008), the
raking used in this paper is a two-step procedure, in which Step 1 generates wi
using the auxiliary information not having the parameters of interest, and then
Step 2 re-weights model parameters from the first step.
Finally, when comparing the nonresponse linearization method to the full

response method, we observe that the full response method produces larger esti-
mates. As pointed out by Kalton and Flores-Cervantes (2003), when there is sizable
nonresponse and noncoverage, the assumption of Deville and Sarndal (1992) does
not hold, and thus the raking estimate will reduce the variance.
4.1. Implementation Under Stata
We implement variance estimation in Stata, a popular statistical software. Referring

to Kolenikov (2010, 2014), we use the ipfraking and bsweights commands. Stata
do-files for replicating our process and results are available upon request.
Step 1: Compute the initial weights ωi
First, we must specify the strata and the initial weights with which to
begin the raking procedure. The raking procedure will minimize the dis-
crepancy between the raked weights and initial weights. Here we set
the initial weights to be the post-stratification weights basewgt, which
equal the population stratum size divided by the number of respondents
in the corresponding stratum. The basewgt variable will be declared in
the argument [pw=varname] in the ipfraking command, which generates
the raked weights.
Step 2: Generate the raked weights wi
Note that raked weights for the entire sample are necessary for generating
replicate raking weights. Besides the initial post-stratification weights, the
ipfraking command also requires us to specify the population marginal
totals of the control variables.
Step 3: Generate the replicate raking weights using the bootstrap
We generate the replicate raking weights by calling ipfraking after
bsweights (Kolenikov, 2010). IPF is therefore performed on each boot-
strapped sample after all the samples have been generated. To create
multiple replicates of the original data: in the tth replicate, some PSUs
are omitted (i.e., all sample elements in PSUhj are removed) and some
are retained (and may be included multiple times). Next, we apply (4) to
rescale the initial weights ωi . The bootstrap sample size is set to be n − 1,
in accordance with McCarthy and Snowden (1985), who propose this
sample size for bootstrapping with replacement. Furthermore, Kolenikov
(2010) recommends the number of bootstrap replicates to be at least as
large as the design degrees of freedom, so we can choose T to be 300.
Step 4: Declare the bootstrap survey environment in Stata
In the declaration of survey environment, we specify the bootstrap option
(vce[bootstrap]) and the names of the replicate weights, so that variances
will be calculated using the replicate raking weights.
5. SUMMARY
If variances for weighted estimates are computed without considering the raking
procedure, the resulting confidence intervals will tend to be conservative. We
therefore produce bootstrap replicate raking weights in Stata and use these to
estimate the variances of weighted estimates from the 2013 MOP survey SQ.
ACKNOWLEDGMENTS
We are grateful to the AiE Editors, Gautam Tripathi, Kim P. Huynh, David T.
Jacho-Chavez and two anonymous referees for their insightful comments which
have led to the current much improved paper. We thank Geoffrey Dunbar, Shelley
Edwards, Ben Fung, Kim P. Huynh, May Liu, Sasha Rozhnov and Kyle Vincent
for their useful comments and encouragement. Maren Hansen provided excellent
writing assistance. We also thank Statistics Canada for providing access to the
2011 National Household Survey and the 2012 Canadian Internet Usage Survey.
The views of this paper are those of the authors and do not represent the views of
the Bank of Canada.
NOTES
1. For a very complex survey design, exact accounting for all its features is extremely
cumbersome. Hence, approximations are often made to yield a usable estimation formula.
2. Besides population means, Hellerstein and Imbens (1999) also consider population
variances and covariances.
3. In practice, it is unlikely that matching on a few population moments will lead to
an artificial population with exactly the same distribution as the target population without
nonresponse. However, as more and more moments are matched, the artificial distribution
will get close to the target distribution. In particular, it may be possible to obtain enough of
a resemblance between the artificial distribution and the target distribution with only a few
matched moments. However, using too many restrictions may compromise the large sample
results that are used for inference, while too few may leave the estimated distribution too
far from the target distribution.
4. When ωi = 1/n, the criterion functions from the classical and maximum likelihood
raking and GREG estimators belong to the Cressie–Read family of statistical discrepancies.
We thank referees for pointing this out.
5. Since the post-stratification method adjusts every cell of a multi-way table, it can
result in cells with zero or very small counts. In contrast, the raking method adjusts only
the marginals, or the low-level interactions.
6. Statistics Canada uses bootstrap procedures extensively. For example, the bootstrap
replicate weights method is used in the CIUS to estimate the coefficients of variation.
7. We thank May Liu for suggesting this point.

8. Besides the population means in Tables 1 and 2, Table 3 looks at the ratios in terms
of being means of subgroups.
9. In most payment surveys, variances are calculated by treating weights as fixed values
(Angrisani, Foster, & Hitczenko, 2015).
REFERENCES
Angrisani, M., Foster, K., & Hitczenko, M. (2015). The 2013 survey of consumer payment choice:
Technical appendix. Research Data Report 15-5. Federal Reserve Bank of Boston.
Arango, C., & Welte, A. (2012). The Bank of Canada’s 2009 Methods-of-Payment survey: Methodology
and key results (No. 2012-6). Bank of Canada discussion paper.
Chaudhuri, S., & Renault, E. (2017). Score tests in GMM: Why use implied probabilities? Working
paper.
Chaudhuri, S., Handcock, M. S., & Rendall, M. S. (2008). Generalized linear models incorporating
population level information: An empirical-likelihood-based approach. Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 70(2), 311–328.
Chen, J., & Qin, J. (1993). Empirical likelihood estimation for finite populations and the effective
usage of auxiliary information. Biometrika, 80(1), 107–116.
Chen, J., & Sitter, R. R. (1999). A pseudo empirical likelihood approach to the effective use of auxiliary
information in complex surveys. Statistica Sinica, 9, 385–406.
Deville, J. C., & Sarndal, C. E. (1992). Calibration estimators in survey sampling. Journal of the
American statistical Association, 87(418), 376–382.
Deville, J. C., Sarndal, C. E., & Sautory, O. (1993). Generalized raking procedures in survey sampling.
Journal of the American statistical Association, 88(423), 1013–1020.
Ghosh, M., Parr, W. C., Singh, K., & Babu, G. J. (1984). A note on bootstrapping the sample median.
The Annals of Statistics, 1130-1135.
Graham, B. S. (2011). Efficiency bounds for missing data models with semiparametric restrictions.
Econometrica, 79(2), 437–452.
Henry, C. S., Huynh, K. P., & Shen, Q. R. (2015). 2013 Methods-of-Payment survey results (No.
2015-4). Bank of Canada discussion paper.
Ireland, C. T., & Kullback, S. (1968). Contingency tables with given marginals. Biometrika, 55(1),
179–188.
Hellerstein, J. K., & Imbens, G. W. (1999). Imposing moment restrictions from auxiliary data by
weighting. Review of Economics and Statistics, 81(1), 1–14.
Kalton, G., & Flores-Cervantes, I. (2003). Weighting methods. Journal of Official Statistics, 19(2), 81.
Kolenikov, S. (2010). Resampling variance estimation for complex survey data. The Stata Journal,
10(2), 165–199.
Kolenikov, S. (2014). Calibrating survey data using iterative proportional fitting (raking). The Stata
Journal, 14(1), 22–59.
Lu, H., & Gelman, A. (2003). A method for estimating design-based sampling variances for surveys
with weighting, poststratification, and raking. Journal of Official Statistics, 19(2), 133.
McCarthy, P. J., & Snowden, C. B. (1985). The bootstrap and finite population sampling. Vital and
Health Statistics. Series 2, Data Evaluation and Methods Research, 95, 1–23.
Owen, A. B. (1988). Empirical likelihood ratio confidence intervals for a single functional. Biometrika,
75(2), 237–249.
Owen, A. (1990). Empirical likelihood ratio confidence regions. The Annals of Statistics, 18, 90–120.
Qin, J., & Lawless, J. (1994). Empirical likelihood and general estimating equations. The Annals of
Statistics, 22, 300–325.
Rao, J. N. K., Wu, C. F. J., & Yue, K. (1992). Some recent work on resampling methods for complex
surveys. Survey Methodology, 18(2), 209–217.
Rust, K. F., & Rao, J. N. K. (1996). Variance estimation for complex surveys using replication
techniques. Statistical Methods in Medical Research, 5(3), 283–310.
Vincent, K. (2015). 2013 Methods-of-Payment survey: sample calibration analysis. Bank of Canada.
PART III
MODEL-SELECTION TESTS FOR
COMPLEX SURVEY SAMPLES
Iraj Rahmania and Jeffrey M. Wooldridgeb

a
Department of Economics, Nazarbayev University, Kazakhstan
b
Department of Economics, Michigan State University, USA
ABSTRACT
We extend Vuong’s (1989) model-selection statistic to allow for complex sur-
vey samples. As a further extension, we use an M-estimation setting so that
the tests apply to general estimation problems – such as linear and non-
linear least squares, Poisson regression and fractional response models, to
name just a few – and not only to maximum likelihood settings. With strati-
fied sampling, we show how the difference in objective functions should be
weighted in order to obtain a suitable test statistic. Interestingly, the weights
are needed in computing the model-selection statistic even in cases where
stratification is appropriately exogenous, in which case the usual unweighted
estimators for the parameters are consistent. With cluster samples and panel
data, we show how to combine the weighted objective function with a cluster-
robust variance estimator in order to expand the scope of the model-selection
tests. A small simulation study shows that the weighted test is promising.
Keywords: Survey sampling; weighted estimation; cluster sampling;
nonnested models; model selection test; m-estimation

ISSN: 0731-9053/doi:10.1108/S0731-905320190000039009
109
110 IRAJ RAHMANI AND JEFFREY M. WOOLDRIDGE
1. INTRODUCTION
Building on White (1982), who studied the asymptotic properties of maximum

likelihood estimators under general misspecification, Vuong (1989) develops a
classical approach to model selection based on the Kullback–Leibler Information
Criterion (KLIC), which measures the closeness of a specified model to the true
model. Vuong proposes a simple test based on comparing the log-likelihood func-
tions from two estimated models. An important aspect of Vuong’s approach is that
the null hypothesis is taken to be that both models, as measured by the KLIC,
are equally close to the true model. The alternative is that one model provides a
better approximation (in the underlying population). Vuong’s general framework
allows for one of the two models to be nested in the other, for the models to be
completely nonnested, and for the models to overlap (depending on the values of
the population parameters). When one model is nested within another, we obtain
the standard likelihood ratio testing principle, although the limiting distribution is
often nonstandard if the most general model is misspecified.
The most attractive application of Vuong’s approach is when the competing
models are nonnested in a sense we make precise in Section 5.2. Essentially,
the models are nonnested if we cannot obtain either model as a special case of
the other by imposing parameter restrictions. As shown by Vuong (1989), his
statistic has a limiting standard normal distribution under the null hypothesis that
the models are nonnested and equally close to the true model. Necessarily, both
models are misspecified under the null; if one model were correctly specified, it
would necessarily be closer to the true model. Under the alternative, one model may
be correctly specified, but the test has power against the alternative that one of the
models provides a better approximation in the population. Due to its computational
simplicity and standard normal limiting distribution, Vuong’s test has become
popular in applied work and provides an alternative to computationally intensive
approaches based on Cox (1961, 1962). Computationally simple alternatives to
the Cox approach, such as those in a regression context proposed by Davidson and
MacKinnon (1981) and Wooldridge (1990), are limited in scope.
Vuong’s original framework is based on the assumption that the sample consists
of independent, identically distributed draws from the population. In practice,
many large surveys, such as the Current Population Survey, the Panel Survey of
Income Dynamics and National Survey of Families and Households (NSFH), to
name a few, adopt stratification and clustering schemes. In such cases, Vuong’s
(1989) original framework is not valid.
In this paper we extend Vuong’s model-selection procedure to allow for various
survey sampling methods. We are unaware of any previous work that does so.
Findley (1990, 1991), Findley and Wei (1993), and Rivers and Vuong (2002)
consider time series problems, such as ARMA models and dynamic regression
Model-Selection Tests for Complex Survey Samples 111
models, but the nature of the adjustment to the test statistics is different from what
we address here.
An important feature of our approach, which also extends Vuong’s (1989) orig-
inal framework, is that we study the model-selection problem in the context of
general M-estimation. This generality allows us to explicitly cover situations where
only a specific feature of a distribution is correctly specified, such as the condi-
tional mean. For example, because of its robustness for estimating the parameters
of a conditional mean – see Gourieroux, Monfort, and Trognon (1984) – Poisson
regression is commonly used in situations where little if any interest lies in the rest
of the distribution. However, we may have two competing models of the condi-
tional mean. While one could use nonlinear least squares estimation, for efficiency
reasons, the Poisson quasi-MLE is often preferred. We can apply Vuong’s (1989)
approach in this situation to obtain a statistic that does not take a stand on the
distribution; it is purely a test of the conditional mean function. By contrast, if we
use the same mean function – say, an exponential function with the same covari-
ates – and we use two different quasi-MLEs, such as the Poisson and Geometric,
then the Vuong approach is a test of which distribution fits betters. Understanding
the distinction between a conditional mean test and a full test of the conditional
distribution is something easily described in our setting.
The remainder of the paper is organized as follows. In Section 5.2, we define the
estimation problems that effectively define the two nonnested competing models.
Section 5.3 shows how to modify the Vuong (1989) statistic to accommodate
standard stratified (SS) sampling in the context of general M-estimation. We start
with SS and variable probability (VP) sampling because they are widely used in
practice, and it is then clear what role weighting plays in more complex sampling
designs. Section 5.4 shows how to modify the estimated variance to account for
clustering in a multistage design. In Section 5.5, we extend the model-selection
test to panel data models with standard stratification design. In Section 5.6, we
discuss how weighting is desirable even under what is typically called “exogenous”
stratification. Section 5.7 provides several examples, and Section 5.8 contains a
small simulation study. Section 5.9 contains a brief conclusion.
2. NONNESTED COMPETING MODELS

AND THE NULL HYPOTHESIS
Let W be an M × 1 random vector taking values in W ⊂ RM . Our goal is to learn
something about the distribution of W, which represents an underlying population.
As discussed in, say, Wooldridge (2010, Chapter 12), many population parameters
of interest – those indexing conditional means, conditional variances, conditional
quantiles, and so on – can be identified using a minimization problem in the
population. For a P × 1 vector of parameters θ in a parameter space ⊂ RP , let

q : W × → R be a real-valued function. Typically we assume that the population
minimization problem
min E [q(W, θ)] (1)
θ∈
has a unique solution – which is required for identification. In standard settings, the
solution is often denoted θ o , and θ o is assumed to index the quantity of interest, such
as the parameters in a conditional mean function E (Y |X). In a conditional MLE
setting, where q(W, θ ) = −log [f (Y |X; θ ] is the negative of the log likelihood, θ o
is the vector of parameters indexing the conditional density of Y given X. There are
many other applications where W is partitioned as W = (X, Y), where X and Y are,
respectively, K and L dimensional vectors with L + K = M. We are particularly
interested in the case where q (·) is the negative of a quasi-log-likelihood function
in the linear exponential family.
Rather than estimate, the parameters using a single objective function – under-
lying which is a parametric model of some feature of a distribution D (W) or, more
likely, a conditional distribution – we suppose we have two estimation methods,
represented by the objective functions q1 (W, θ 1 ) and q2 (W, θ 2 ), where the param-
eter vectors may have different dimensions. We need to make precise the sense in
which these competing models and estimation methods are nonnested. Let θ ∗1 and
θ ∗2 be the unique solutions to the population problems
min E [q1 (W, θ 1 )] and min E [q2 (W, θ 2 )] . (2)

θ 1 ∈1 θ 2 ∈2
These solutions are often called the “pseudo-true values” or “quasi-true values.”
The null hypothesis is that the models evaluated at the pseudo-true values fit equally
well on average, where fit is measured by the mean of the objective functions in
the population. Precisely,

H0 : E q1 (W, θ ∗1 ) = E q2 (W, θ ∗2 ) . (3)
In the maximum likelihood setting, (3) states that the KLIC distances of the two
models to the true model are the same. In a regression context using nonlinear
least squares, the null is that the population sum of squared residuals are the same;
equivalently, the two models provide a function that has the same mean squared
error relative to the true conditional mean.
Condition (3) can hold for nested as well as nonnested models, but the nature of
our approach requires us to focus on the latter. The reason is that we supplement
(3) with the assumption that the objective functions, evaluated at the pseudo-true
values, differ with positive probability:

P q1 (W, θ ∗1 ) = q2 (W, θ ∗2 ) > 0. (4)
The requirement in (4) means that the two functions q1 (W, θ ∗1 ) and q2 (W, θ ∗2 ) must
differ for a nontrivial set of outcomes on the support of W. If (4) does not hold
then the variance of q1 (W, θ ∗1 ) − q2 (W, θ ∗2 ) is 0, and that will invalidate the Vuong
(1989) approach taken in the paper.
The combination of (3) and (4) effectively rules out nested models, where one
model is obtained as a special case of the other and the same objective func-
tion – such as the negative of the log-likelihood function – is used. Then, the
only way that (3) can be true is when the more general model collapses to the
restricted version; otherwise E q1 (W, θ ∗1 ) > E q2 (W, θ ∗2 ) . In other words, for
nested models, we cannot have (3) and (4) both be true. The two conditions (3)
and (4) rule out other forms of degeneracies. For example, assume we have a ran-
dom variable Y and would like to model E(Y |X) as a function of the explanatory
variables X, a 1 × K vector. We specify two competing models and we estimate
both by nonlinear least squares. Specifically, q1 (W, θ 1 ) = (Y − α1 − Xβ 1 )2 and
2
q2 (W, θ 2 ) = Y − exp(α2 + Xβ 2 ) . If E(Y |X) actually depends on X, then (4)
generally holds. However, if Y is mean independent of X, so that E(Y |X) = E(Y ),
then the two models are simply different parameterizations of a constant condi-
tional mean, and (4) fails. As we will see, this failure causes the standard normal
limiting distribution for the Vuong-type statistic to break down. Incidentally, in this
example, provided the models satisfy (4), the Vuong test with random sampling
would reduce to comparing the R-squareds from the two least squares regressions.
We require no additional assumptions – for example, neither homoskedasticity nor
normality – for the test to be valid.
The nature of the alternative is inherently one sided, as we wish to determine
whether one model can be rejected in favor of the other. Because we have defined
the optimization problem to be a minimization problem, the alternative that model
one – technically, model one combined with whatever objective function we choose
– fits better in the population is

HAq1 : E q1 (W, θ ∗1 ) < E q2 (W, θ ∗2 ) .
Likewise, the alternative that model two fits better is

HAq2 : E q1 (W, θ ∗1 ) > E q2 (W, θ ∗2 ) .
In Vuong’s setup, these functions are the negative of a log-likelihood function, but
we can apply these alternatives much more generally. Naturally, if either HAq1 or
HAq2 holds then the nondegeneracy condition (4) must hold.
3. TESTING UNDER STRATIFIED SAMPLING

Under random sampling and weak regularity conditions – see, for example, Newey
and McFadden (1994) or Wooldridge (2010, Chapter 12) – we can consistently
estimate θ ∗g , g = 1, 2 by solving the sample counterparts to (2):

N
min N −1 qg (W, θ g ).
θ g ∈g
i=1
In this section, we are interested in cases where the population has been divided
into J strata and then the resulting sample is not necessarily representative of the
population.
3.1. Standard Stratified Sampling
The first sampling scheme we consider is SS sampling. As in Wooldridge (2001),

for each j ∈ {1, 2, . . . , J } assume we have a random sample {Wij : i = 1, 2, . . . , Nj }
from the conditional distribution D(W|W ∈ Wj ), where Wj is the jth stratum. Let
Qj = P(W ∈ Wj ) be the population share associated with stratum j . Then we
choose θ̂ g , g = 1, 2 to solve the weighted M-estimation problem
⎛ ⎞

J
1 Nj
min Qj ⎝ qg (Wij , θ g )⎠ . (5)
θ g ∈g N j
j =1 i=1
The objective function in Eq. (5) can be rewritten as
1 Qj
J Nj
qg (Wij , θ g ), (6)
N j =1 Hj i=1
where Hj ≡ Nj /N is the fraction of observations drawn from stratum j . Further,

by letting ji be the stratum for draw i and dropping the division by N , we can
write the objective function in the more familiar form
N
Q ji
qg (Wi , θ g ), (7)
i=1
H ji
where Qji /Hji is the sampling weight for observation i. Thus, the estimator rep-
resented in (7) is obtained by weighting the objective function by the sampling
weight for each observation i. As discussed in Wooldridge (2001), the represen-
tation in (5) is the form used to obtain asymptotic properties of the weighted
M-estimator.
We first show that when the models are nonnested in the sense of (4), a prop-
erly standardized version of the objective function
√ has a limiting distribution√that
does not depend on the limiting distribution of N (θ̂ g − θ ∗g ) provided θ̂ g is N -
consistent, which is standard. Wooldridge (2001, Theorem 3.2) contains sufficient
conditions. The proof relies on fairly standard asymptotics and so we show only
its main features.
Theorem 3.1. Assume that for g ∈ {1, 2},
1. {Wij : i = 1, . . . , Nj ; j = 1, . . . , J } satisfies the SS sampling scheme with
Nj /N → aj > 0, j = 1, . . . , J .
2. g is compact.
3. The objective function E qg (W, θ g ) has a unique minimum on g at θ ∗g .
4. θ ∗g ∈ int g .
5. For each w ∈ W, qg (w, ·) is continuous on g and twice continuously
differentiable on int g .
6. For all θ g ∈ g , qg (w, θ g ) ≤ b (w), ∂qg (w, θ g )/∂θgk ≤ b (w), ∂ 2 qg
(w, θ g )/∂θgk ∂θgm ≤ b (w) for a function b(w) with E [b(W)] < ∞.
7. E ∇θ qg (W, θ ∗g )∇θ qg (W, θ ∗g ) < ∞ and E ∇θ qg (W, θ ∗g ) = 0.
Then
1 Qj 1 Qj
J Nj J Nj
√ qg (Wij , θ̂ g ) = √ qg (Wij , θ ∗g ) + op (1).
N j =1 Hj i=1 N j =1 Hj i=1
Proof. Under the assumptions, q(·) is continuously differentiable with respect

to θ g and θ ∗g is in the interior of g . Therefore, from a Taylor expansion of
J ∗
i=1 qg (Wij , θ g ) about θ g , and then dividing both side by Nj , we obtain
1 1 1
Nj Nj Nj
∗ j
qg (Wij , θ̂ g ) = qg (Wij , θ g ) + ∇θ qg (Wij , θ̈ g )(θ̂ g − θ ∗g ),
Nj i=1 Nj i=1 Nj i=1
j p p
where θ̈ g is a mean value between θ̂ g and θ ∗g . Because θ̂ g → θ ∗g , θ̈ g → θ ∗g . Now,
by a corollary of the uniform law of large numbers, for example, Wooldridge
(2010, Lemma 12.1),
1 1
Nj Nj
j
∇θ qg (Wij , θ̈ g ) = ∇θ qg (Wij , θ ∗g ) + op (1).
Nj i=1 Nj i=1
As
√ shown in∗ Wooldridge (2001, Theorem 3.2), the assumptions ensure that
N (θ̂ g − θ g ) = Op (1), and so
⎡ ⎤
1 Nj
√
⎣ ∇θ qg (Wij , θ̈ g )⎦ N(θ̂ g − θ ∗g )
j
Nj i=1
⎡ ⎤
1
Nj
√
=⎣ ∇θ qg (Wij , θ ∗g )⎦ N(θ̂ g − θ ∗g ) + op (1).
Nj i=1
√
Therefore, if we multiply by N Qj and sum across j we get
√ Qj √ Qj
J Nj J Nj
N qg (Wij , θ̂ g ) = N qg (Wij , θ ∗g ) (8)
j =1
N j i=1 j =1
N j i=1
⎡ ⎤
J
Qj
Nj
√
+⎣ ∇θ qg (Wij , θ ∗g )⎦ N (θ̂ g − θ ∗g ).
j =1
Nj i=1
As in Wooldridge (2001, Theorem 3.2),
⎛ ⎞ ⎛ ⎞

J
1 Nj
J
1 Nj
plim Qj ⎝ ∇θ qg (Wij , θ ∗g )⎠ = Qj ⎝plim ∇θ qg (Wij , θ ∗g )⎠
N →∞ j =1 Nj i=1 j =1 N →∞ N j i=1

J

= Qj E ∇θ qg (Wi , θ ∗g )|Wi ∈ Wj
j =1

= E ∇θ qg (Wi , θ ∗g ) = 0, (9)
where the last equality holds from the population first order condition for θ ∗g .
Therefore ⎛ ⎞
J
1 Nj
Qj ⎝ ∇θ qg (Wij , θ ∗g )⎠ = op (1),
j =1
N j i=1
√
and since N (θ̂ g − θ ∗g ) = Op (1), the second term product in (8) is op (1). We
have shown
√ Qj √ Qj
J Nj J Nj
N qg (Wij , θ̂ g ) = N qg (Wij , θ ∗g ) + op (1),
j =1
N j i=1 j =1
N j i=1
and using Hj = Nj /N, we can rewrite this as
1 Qj 1 Qj
J Nj J Nj
√ qg (Wij , θ̂ g ) = √ qg (Wij , θ ∗g ) + op (1).
N j =1 Hj i=1 N j =1 Hj i=1
This complete the proof.
We can use Theorem 3.1 to construct a simple test statistic that allow us
to discriminate between two competing models. Let r(w, θ 1 , θ 2 ) ≡ q1 (w, θ 1 ) −
q2 (w, θ 2 ) be the difference in the two objective functions evaluated at w ∈ W and
generic values of the parameters. The null hypothesis is

H0 : E r(W, θ ∗1 , θ ∗2 ) = 0.

By the assumption that the models are nonnested, V r(W, θ ∗1 , θ ∗2 ) > 0. Applied
to both estimation problems, and under the null hypothesis, Theorem 3.1 implies
1 Qj 1 Qj
J Nj J Nj
√ r(Wij , θ̂ 1 , θ̂ 2 ) = √ r(Wij , θ ∗1 , θ ∗2 ) + op (1). (10)
N j =1 Hj i=1 N j =1 Hj i=1
Eq. (10) is the key result in the paper. It extends Vuong (1989) by allowing for strat-
ified sampling, and also any objective functions – not just MLE. Because √ the right
hand side of (10) does not depend on the limiting distributions of N (θ̂ g − θ ∗g ),
its distribution is easy to study by now applying, again, the results in Wooldridge
(2001) on SS sampling.
Theorem 3.2. Under the conditions of Theorem 3.1, assume that

H0 : E q1 (Wi , θ ∗1 ) = E q2 (Wi , θ ∗2 ) .
Then
1 Qj
J Nj
d
√ r(Wij , θ̂ 1 , θ̂ 2 ) −→ Normal(0, η2 ),
N j =1 Hj i=1
where 2

J
Qj
η =
2
V r W, θ ∗1 , θ ∗2 |W ∈ Wj .
j =1
Hj
Proof. From Theorem 3.1 and the asymptotic equivalence lemma, we must
argue that
1 Qj
J Nj
d
√ r(Wij , θ ∗1 , θ ∗2 ) −→ Normal(0, η2 ).
N j =1 Hj i=1
But this holds from Wooldridge (2001, Theorem 3.2). Namely, we apply
the asymptotic variance formula for a weighted objective function under SS
sampling to the sequence {Rij : i = 1, . . . , Nj ; j = 1, . . . , J }, where
Rij ≡ r(Wij , θ ∗1 , θ ∗2 ).
Consistent estimation of η2 is straightforward. Let

Nj
R̄j = Nj−1 r(Wij , θ̂ 1 , θ̂ 2 ), (11)
i=1
be the within-stratum mean of the difference in objective functions. Then

2⎛ Nj
⎞
J
Q 1 2
η̂2 ≡
j ⎝ r(Wij , θ̂ 1 , θ̂ 2 ) − R̄j ⎠
j =1
H j N j i=1
1 Qj 2
J 2 Nj
= r(W ij , θ̂ 1 , θ̂ 2 ) − R̄ j , (12)
N j =1 Hj2 i=1
which is the asymptotic variance estimator in Wooldridge (2001) applied to the

scalar case. The model selection t statistic can be written as

N −1/2 N i=1 (Qji /Hji )r(Wij , θ̂ 1 , θ̂ 2 )
tMS = . (13)
η̂
d
Under the null hypothesis, tMS → Normal(0, 1).
The model selection t statistic is very easy to obtain in practice. For each i,
define the difference in objective functions as
R̂i = r(Wi , θ̂ 1 , θ̂ 2 ) = q1 (Wi , θ̂ 1 ) − q2 (Wi , θ̂ 2 ).
Then we treat the R̂i as data obtained from an SS sampling scheme, so we simply
need to specify the stratum for each observation, ji , and the weight, Qji /Hji . For
example, in Stata, one applies the “svyset” option – to specify the stratum identifier
and weights – and runs the regression R̂i on a constant. The usual t statistic on
the constant is the model-selection test statistic. The sign of the constant indicates
which model fits better.
In using tMS as a model-selection test, it is important to understand its prop-
erties compared to an approach where weighting is not used but SS sampling
has been employed. Because of the nature of the null and alternative, it is not
true that the weighted version of the test will always reject the null more often
than the unweighted version of the test. To see this, consider what happens when,
say, model one is correctly specified with parameters θ o1 . Then, generally, we
need to use the weights to consistently estimate θ o1 ; the unweighted estimator
converges to some other quantity, say θ + 1 . For model two, we can write the proba-
bility limits as θ ∗2 and θ +
2 for the weighted and unweighted problems, respectively.
Now, there is no guarantee that q1 (W i , θ o1 ) is further
from E q2 (Wi , θ ∗2 ) , on
average, than q1 (Wi , θ + +
1 ) is from E q2 (Wi , θ 2 ) , and so the test based on the
unweighted objective function may reject more often in favor of model one than
the weighted version of the test. This turns out not to be a good thing, for two
reasons. First, even though the unweighted estimator might point to the correct
model/estimation method, the estimator of θ o1 is generally inconsistent. In other
words, the unweighted approach may choose the correct model but with parameter
estimators that are essentially useless! In fact, for computing quantities of interest,
such as average partial effects, there is no telling that it would be better to use
model one with inconsistent parameter estimators or model two, with estimators
converging to θ + 2.
A second important shortcoming of the unweighted test is that it may systemat-
ically opt for model two when model one is correctly specified. And the problem
would generally be worse as the sample size grows. This cannot happen with the
weighted version of the test, provided we have chosen our model and objective
function in a way that generates consistent estimators under correct specification
of the feature of interest. The reason is that, if model one is correctly specifiedand
the objective function is chosen appropriately, E [q1 (W, θ o1 )] < E q2 (W, θ ∗2 ) . In
other words, the weighted test is a consistent test for choosing the true model
when one of themodels is correctly
specified. With the unweighted test, it could
easily be that E q2 (W, θ + 2 ) < E q 1 (W, θ +
1 , in which case the unweighted test
)
will systematically select the wrong model. And this will happen with probability
approaching one as the sample size grows. We will see this phenomenon in the
simulations in Section 8.
An analogy that does not require thinking about weighting versus not weighting
might be helpful. In fact, assume random sampling, as in the original Vuong
(1989) work. Now suppose we specify two nonnested conditional mean models,
m1 (x1 , θ 1 ) and m2 (x1 , θ 2 ), and model one is correctly specified. If we use an
objective function that identifies conditional means – say, the squared residual
function – then the Vuong test will detect that model one is correct with probability
approaching one. Suppose we use another objective function, such as the least
absolute deviations (LAD). In general, neither model one nor model two is correctly
specified for the conditional median. Consequently, using LAD in the Vuong
statistic has essentially unknown properties. It could incorrectly choose in favor
of model two because model two is closest to the conditional median. But it could
also frequently reject model two in favor of model one; in fact, nothing says
the rejection frequency could not be higher than when using the squared residual
objective function. In other words, using the wrong objective function may actually
lead to a more powerful test. The problem is that when this occurs, it is essentially
a fluke. And the LAD estimators are not generally consistent for conditional mean
parameters, and so it is difficult to know how it helps to choose model one: we
have the correct model but inconsistent estimators of the parameters.
When the competing models contain different numbers of parameters, the finite
sample performance of tMS may suffer. As in Vuong (1989), we can penalize the
objective functions for the number of parameters. Since we are minimizing the
objective function, we add a penalty that is a function of the number of parameters.
The resulting statistic is
N
N −1/2 (Qji /Hji )r(Wij , θ̂ 1 , θ̂ 2 ) + N −1/2 [K(P1 ) − K(P2 )]
t˜MS = i=1
. (14)
η̂
where P1 and P2 are the number of parameters in the different models and K (·)
is the penalty function. For example, K(P ) = P gives the Akaike (1973) crite-
rion and K(P ) = (P /2) log (N) gives the Schwarz (1978) criterion. In both cases,
N −1/2 K (P ) → 0 for fixed P , and so the penalty does not affect the asymptotic
distribution of the test statistic: t˜MS and tMS have the same asymptotic distribu-
tions under H0 . The statistic t˜MS has the feature of penalizing models that are
not parsimonious in the number of parameters. One could instead simply add
N −1/2 [K(P1 ) − K(P2 )] to tMS , which means the penalty would not be divided by
η̂. Again, the resulting statistic is asymptotically equivalent. In what follows, we
drop the penalty function for notational convenience.
3.2. Variable Probability Sampling
When observations in the strata are difficult to identify prior to sampling or when
collecting information on the variable determining stratification is cheap relative
to the cost of collecting the remaining information, VP sampling is convenient.
In VP sampling, a unit is first drawn at random from the population. If the unit
falls into stratum j , it is kept with probability pj . For example, if we define
stratification in terms of individual income, we draw a person randomly from
the population, determine the person’s income and then keep that person with
probability that depends on income class that is set by the researcher. As discussed
in Wooldridge (1999), consistent estimation of the population parameters generally
requires weighting the objective function by the inverse of the probability of being
kept in the sample. With J strata, these probabilities are {pj : j = 1, . . . , J }. It is
straightforward to show the analog of Lemma 3.1 carries over, and so, under the
null hypothesis that the models are nonnested and fit the population equally well,
estimation of the parameters does not affect the limiting distribution. This leads to
the test statistic
−1 N −1
N −1/2 N i=1 pji r(Wi , θ̂ 1 , θ̂ 2 ) i=1 pji r(Wi , θ̂ 1 , θ̂ 2 )
= 2 1/2 , (15)
−1
N −2 2 1/2
N −2
N i=1 pji r(Wi , θ̂ 1 , θ̂ 2 ) i=1 pji r(Wi , θ̂ 1 , θ̂ 2 )
where again ji is the stratum for observation i. (Remember that under VP sampling,
we do not always observe a draw from the population; this statistic necessarily
depends only on the draws we keep.)
One way that the denominator of (15) differs from that of (13) is that the
within-stratum means are not removed in (15). Wooldridge (1999) shows that if
the known sampling probabilities, pj , are replaced with the observed frequencies,
then it is proper to remove the means, R̄j , in (15). Using the sample frequencies
means that we know how many times each stratum was drawn – call this Mj . Then
p̂j = Nj /Mj , where Nj is the kept number of draws in stratum j (and which we
always observe). We replace pj in (15) with p̂j and then replace r(Wi , θ̂ 1 , θ̂ 2 ) with
r(Wi , θ̂ 1 , θ̂ 2 ) − R̄j in the denominator for all i in stratum j . In many cases, the
number of times each stratum was drawn is not available, and so one must use
the pj rather than the p̂j . If one uses the p̂j directly in (15), then the statistic is
conservative in the sense that, asymptotically, its size will be less than the nominal
size (because the estimated standard deviation is systematically too large).
4. TESTS STATISTICS UNDER

MULTISTAGE SAMPLING
The model-selection statistic can also be modified to account for complex surveys
that feature cluster sampling, stratified sampling and variable probability sampling.
Clusters are groups of families, households or individuals positioned or occurring
in relatively close association. For example, in a school, students in each class
form a cluster. In rural areas the villages and in urban areas the neighborhoods are
clusters. Stratification often occurs at a larger geographical level, such as county
or state in the United States.
Complex survey methods are commonly used. For example, the NSFH is a com-
plex survey sample. It has multistage design that involves clustering, stratification,
and VP sampling.
The formal design structure we use here is closely related to Bhattacharya
(2005). In the first stage, the population of interest is divided into S subpopula-
tions or strata. These could be states, counties, prefectures, and other geographic
entities. They are exhaustive and mutually exclusive within the designated popu-
lation. Within stratum s, there are Cs clusters in the population. In stratum s, Ns
clusters are drawn randomly. Since the asymptotic analysis is based on number of
clusters going to infinity, we assume that in each stratum, a “large” number of clus-
ters is sampled. Because of the cluster sampling, units (e.g., households) within
each cluster can be arbitrarily correlated. Each sampled cluster c in stratum s con-
tains a finite population of Msc units (e.g., households). At the final sampling stage,
for each sampled cluster c in stratum s, we randomly sample Ksc households. (The
following formula assumes sampling is with replacement.) Define a (nonrandom)
weight for each stratum-cluster pair as
Cs Msc
vsc = ,
Ns Ksc
where we require information on the number of clusters in the population and the
number of units per cluster. The weighted objective function
S
Ns
Ksc
vsc qg Wscm , θ g
s=1 c=1 m=1
identifies the population pseudo-true parameters θ ∗g , g = 1, 2. See Bhattacharya

(2005) for linear regression and Rahmani (2018) for the general M-estimation
case.
Using reasoning similar to Lemma 3.1, it can be shown that the asymptotic
distribution of
1
S Ns Ksc
√ vsc · r Wscm , θ̂ 1 , θ̂ 2
N s=1 c=1 m=1
is not affected by the estimation of θ̂ 1 and θ̂ 2 , where

r Wscm , θ̂ 1 , θ̂ 2 = q1 Wscm , θ̂ 1 − q2 Wscm , θ̂ 2
is the difference between the two objective functions for each unit m, in cluster c,
in stratum s. In other words,
1
S Ns
Ksc 1
S Ns
Ksc
√ vsc · r Wscm , θ̂ 1 , θ̂ 2 = √ vsc · r Wscm , θ ∗1 , θ ∗2 + op (1). (16)
N s=1 c=1 m=1 N s=1 c=1 m=1
Consequently, as in the case of simple stratification, the problem reduces

to obtaining a valid standard
error for a population mean, namely,
E q1 W, θ ∗1 − q2 W, θ ∗2 . Under the null hypothesis, this mean is zero, and
(16) means we can directly apply the results of Bhattacharya (2005) directly:
1
S Ns Ksc d
√ vsc · r Wscm , θ̂ 1 , θ̂ 2 −→ N (0, ξ 2 )
N s=1 c=1 m=1
and a consistent estimator of ξ 2 is

S
Ns
Ksc
−1
ξ̂ = N
2 2 2
vsc rscm θ̂ 1 , θ̂ 2
s=1 c=1 m=1

S
Ns
Ksc
Ksc
+ 2
vsc rscm θ̂ 1 , θ̂ 2 rscm θ̂ 1 , θ̂ 2
s=1 c=1 m=1 m =m
N K 2 ⎫
S
1 s sc ⎬
− vsc rscm θ̂ 1 , θ̂ 2 . (17)
Ns ⎭
s=1 c=1 m=1
The first term in (17) would be a consistent estimator of the variance under simple
random sampling. The second term accounts for within-cluster correlation, and
the third term properly subtracts off the within-strata means. Typically, the second
term is positive, reflecting the positive correlation within cluster. The third term,
without the minus sign, is always nonnegative. Therefore, the second and third
terms tend so work in opposite directions. In any case, the resulting test statistic,
S Ns Ksc
N −1/2 s=1 c=1 m=1 vsc · rscm Wscm , θ̂ 1 , θ̂ 2
, (18)
ξ̂
is easy to compute once the difference in objective functions is obtained for

each observation.
We simply compute the standard error for the sample mean
of rscm Wscm , θ̂ 1 , θ̂ 2 under the complex sampling scheme. In Stata, the “svyset”
command can be used to set the sample as including stratification and clustering,
where the sampling weights are also specified.
Extending Bhattacharya’s (2005) model to more complex sampling designs,
Rahmani (2018) considers a sampling design with VP sampling in the final stage.
The framework closely resembles several complex surveys, including the NSFH.
The asymptotic variance becomes more complicated due to the final sampling
stage. The formula for the asymptotic variance becomes even more complicated,
mostly due to notation, but the adjustments are straightforward to describe. The
weights are simply adjusted to reflect the final probability sampling stage, where
an observation is further weighted by 1/pj , where j represents a stratum for the
final VP sampling stage. The statistic still has the same general form as in (18) and
is easily computed using standard software once the weights are properly adjusted
for the VP sampling.
5. MODEL-SELECTION TESTS WITH PANEL DATA
Model-selection tests in panel data models with complex sampling designs are
similar to the tests in the cross-sectional cases, but in using standard software
we must make sure to account for serial correlation in the difference in objective
functions when using a pooled estimation method. Here we cover the case where
stratified sampling is done in an initial time period, as is very common. Conse-
quently, the sampling weights, Qj /Hj for the strata j = 1, . . . , J , do not change
over time.
When a probability density function for the joint distribution
D(Yi1 , . . . , YiT |Xi1 , . . . , XiT ) is fully specified, the methods in Sections
5.3 and 5.4 apply directly: the objective function is the joint log likelihood
conditional on (Xi1 , . . . , XiT ).
For many reasons, one often wants to compare models estimated using pooled
methods. Pooled estimation methods are computationally simpler, often much
more so. More importantly, we are often interested in a feature of D(Yit |Xit )
or even D(Yit |Xi1 , . . . , XiT ), and we do not wish to take a stand on how the
{Yit : t = 1, ..., T } are related to each other. For example, we might be interesting
in estimating E(Yit |Xit ) or E(Yit |Xi1 , . . . , XiT ) using pooled quasi-MLE in the
linear exponential family. Such an approach is robust to other distributional mis-
specification and to arbitrary serial correlation. Therefore, any model-selection
statistic should be robust to arbitrary serial dependence, too.
As an example, suppose we use pooled nonlinear least squares to estimate two
models of the conditional mean. The difference in objective functions at time t,
evaluated at the pseudo-true values, is
2 2
Yit − m1 (Xit , θ ∗1 ) − Yit − m2 (Xit , θ ∗2 ) .
There are essentially no interesting cases where this difference would be serially
uncorrelated over time. We would have to assume that {(Xit , Yit ) : t = 1, . . . , T } is
an independent sequence, and this is very unlikely in a panel data setting.
In models with unobserved heterogeneity, say Ci , we can take a correlated
random effects approach (as in Wooldridge, 2010, Section 13.9) and propose a
model for
D(Ci |Xi1 , . . . , XiT ) = D(Ci |X̄i )
Then, if we assume strict exogeneity of {Xit } conditional on Ci ,
D(Yit |Xi1 , . . . , XiT , Ci ) = D(Yit |Xit , Ci )

then we obtain a model for
D(Yit |Xi1 , . . . , XiT ) = D(Yit |Xit , X̄i ).
The outcome {Yit : t = 1, . . . , T } would essentially never be independent condi-

tionally on {Xi1 , . . . , XiT }, but we can still apply pooled MLE, pooled quasi-MLE
or some other pooled estimation procedure.
With a time dimension, there is a subtle issue about the nature of the null
hypothesis. One could imagine wanting to test the models against each other for
each time period. Even if this is desirable, it would be tricky because the nature of
the alternative – that one model fits at least as well in all time periods – would imply
multiple inequality restrictions. Plus, what would we do if one model fits better in
three time periods but the other model fits much better in two time periods? In the
end, we would very likely average across time. Therefore, here we take the null to
be that the two models fit the same when we average (or sum) across time.
To be precise, define pooled objective functions

T
qg (Wi , θ g ) = qgt (Wit , θ g ), g = 1, 2.
t=1
Then the null hypothesis is

E q1 (Wi , θ ∗1 ) = E q2 (Wi , θ ∗2 ) , (19)
where θ ∗1 and θ ∗1 are the pseudo-true values. A sufficient but not necessary condition
is that the models fit equally well for each t:

E q1t (Wit , θ ∗1 ) = E q2t (Wit , θ ∗2 ) , t = 1, . . . , T . (20)
We will take (19) as the null hypothesis in what follows.

Under stratified sampling, it is easy to obtain a valid model-selection statistic
that adds no additional assumptions to (19). In fact, we can simply apply the same
statistics in (12) and (13). The estimate η̂2 automatically accounts for the serial
correlation in

T

r(Wij , θ ∗1 , θ ∗2 ) = q1t (Witj , θ ∗1 ) − q2t (Witj , θ ∗2 ) .
t=1
This can be seen by expanding the term
1 2
Nj
r(Wij , θ̂ 1 , θ̂ 2 ) − R̄j
Nj i=1
and noting that it includes cross products between time periods t and s, t = s.
Standard software can be tricked into computing the model-selection statistic
by specifying the strata, j , the sampling weights, Qji /Hji , and specifying each
cross-sectional unit i as a cluster. As is well known – see, for example, Arellano
(1987) – the form of the robust variance estimator for small-T panel data estimators
is the same as for cluster correlation.
6. EXOGENOUS STRATIFICATION
In most applications, we partition W as W = (X, Y) and we are interested in some
feature of the distribution of Y given X. If the feature is correctly specified, and
we choose a suitable objective function, then the population (true) value of the
parameters, θ o , solves
min E [q(W, θ )|X = x]
θ∈
for all x ∈ X , the support of X. For example, in the case of estimating the condi-
tional mean E(Y |X), one suitable choice of the objective function is the squared
residual function:
q(W, θ ) = [Y − m(X, θ )]2.
When the conditional mean is correctly specified, that is,
E (Y |X = x) = m(x, θ o ), x ∈ X ,
it is easily shown – see, for example, Wooldridge (2010, Chapter 12) – that θ o
solves
min E{[Y − m(X, θ)]2 |X = x}
θ∈
for all x.
Now consider a situation where stratification is based entirely on X, and so
{Xj : j = 1, . . . , J } represents the mutually exclusive and exhaustive strata. Then,
as discussed in Wooldridge (1999, 2001), θ o solves
min E{[Y − m(X, θ )]2 |X ∈Xj }

θ∈
for each j . Wooldridge (1999, 2001) shows that this feature of θ o implies that the
unweighted M-estimator is generally consistent for θ o .
Given consistency of the unweighted estimator under correct specification, it
may be tempting to ignore stratification when it is based on X and to simply apply
Vuong’s (1989) statistic in the M-estimation context. But the unweighted statistic
does not achieve our objectives because the null hypothesis is that each model is
misspecified. Even when stratification is based on X, we need to use the weights
to uncover θ ∗1 and θ ∗2 . In particular, under the null hypothesis of interest, θ ∗g does
not generally solve
min E qg (W, θ g )|X = x
θ g ∈g
for all x ∈ X , and therefore the unweighted estimator is inconsistent for θ ∗g . Our
goal is to compare the models in the population, and the weighted estimator always
consistently estimates θ ∗g under the null and alternatives – including if model g
is correctly specified – whether or not stratification is based on X, Y, or both. To
summarize, this observation argues in favor of weighting for both estimation and
model selection.
7. EXAMPLES
The previous framework has many applications. Here we describe a few that are
not completely standard.
Example 7.1: (Binary and Fractional Responses) Let Yi be either a binary

response or a fractional response, so Yi ∈ [0, 1]. In either case, sensible models are
of the form
E (Yi |Xi ) = F (Xi θ o )
where F (·) ∈ (0, 1) is typically a smooth, strictly increasing function. Common
examples are F (z) = (z) (the standard normal CDF), F (z) = (z) (the logis-
tic CDF) and F (z) = 1 − exp [− exp (z)] (complementary-log-log). The index
form is not important but common. For example, we could include models that
have heteroskedasticity in an underlying latent variable formulation, such as a
heteroskedastic probit.
Whether Yi is binary or fractional, maximizing the Bernoulli log likelihood is
a common estimation method. In the case of a binary response, our hope is that
we are using true MLE. In the case of a fractional response, estimation is clearly
quasi-MLE, but we are interested in testing specification of the conditional mean.
In the previous setup, the objective functions are the negative of the quasi-log
likelihood functions. In particular

qg (Wi , θ g ) = − (1 − Yi ) log 1 − Fg (Xi θ g ) + Yi log Fg (Xi θ g .
Many software packages, such as Stata, allow for estimation of binary and frac-
tional response models with survey sampling. After the estimates θ̂ 1 and θ̂ 2 have
been obtained, compute
r(Wi , θ̂ 1 , θ̂ 2 ) = q1 (Wi , θ̂ 1 ) − q2 (Wi , θ̂ 2 ).
Then, using a survey option, run a weighted regression of r(Wi , θ̂ 1 , θ̂ 2 ) on a

constant and use the appropriate variance estimate to account for the stratified
sampling or more complex forms of survey sampling.
In a panel data context where we use the correlated random effects approach,
the models for comparison could be of the form
mg (Xit , X̄i , θ g ) = Fg αgt + Xit β g + Zi γ g + X̄i ξ g ,

where Zi are time constant variables and X̄i = T −1 Tt=1 Xit are the time averages
of the time-varying covariates. Whether Yit is binary or fractional, one can use a
pooled QMLE based on the Bernoulli quasi-LLF. If there is no cluster sampling,

we only need to specify the PSUs, the sampling weights, and taking each unit be
its own cluster to account for serial dependence in the objective function.
Example 7.2 (Gamma versus Lognormal) If Yi > 0 is continuous, and we wish

to fit its entire distribution, then it makes sense to compare to fully specified
PDFs. Two leading cases are the gamma and lognormal distributions (although
one could use others, such as the Weibull). Both are two-parameter families and
can be parameterized in terms of the mean and an additional dispersion parameter.
Typically, the mean would be parameterized as exp (Xi β g ) in both cases, with
dispersion parameters τg2 > 0. Once the MLEs have been computed, the model-
selection statistic, appropriately adjusted for weights, stratification, and clustering,
is easily obtained.
Example 7.3 (Nonlinear Least Squares) Suppose we want to model E (Yi |Xi )
using functions m1 (Xi , θ 1 ) and m2 (Xi , θ 2 ). Let θ̂ 1 and θ̂ 2 be the weighted nonlinear
least squares estimators, obtained using the sampling weights (if appropriate). For
each unit i, the difference in objective functions is
2 2
R̂i = Yi − m1 (Xi , θ̂ 1 ) − Yi − m2 (Xi , θ̂ 2 ) = Ûi12 − Ûi22 ,
the difference in squared residuals. The model-selection statistic is based on the

difference in SSRs, obtained from the regression
Ûi12 − Ûi22 on 1, i = 1, . . . , N
using weights, if necessary, and adjusting the standard error of the constant for
the sampling scheme. For example, if Yi ≥ 0, possibly taking on the value zero, it
is somewhat common to start with a linear model estimated by OLS. That can be
compared with an exponential model estimated by OLS.
8. A SMALL SIMULATION STUDY

In this section, we present findings of a small simulation study. Our goal is to show
that the weighted version of Vuong’s statistic can help choose a correctly specified
model when one of the two models is correctly specified. We also find evidence
that the weighted statistic often chooses the model that is the better approximation
to the truth. Just as importantly, we will see that using the unweighted statistic can
be very misleading.
In the first set of simulations, we create a population of 100,000 units, where
the outcome variable, Y , follows a Poisson distribution conditional on a set of
covariates. We consider two different conditional mean functions. In the first case,
we generate five covariates, all normally distributed, such that
E (Y |X1 , X2 , X3 , X4 , X5 ) = exp (0.5 + 0.5X1 + 0.5X2 + 0.2X3 ), (21)
so that the true model includes X3 but excludes X4 and X5 . We call this model one.
We chose the parameter values so that the test does not choose the correct model
with probability one but still has substantial power in its direction. As competing
models, we replace X3 first with X4 (model two) and then with X5 (model three):
m2 (x1 , x2 , x3 , x4 , x5 ) = exp (β0 + β1 x1 + β2 x2 + β3 x4 )

m3 (x1 , x2 , x3 , x4 , x5 ) = exp (β0 + β1 x1 + β2 x2 + β3 x5 )
We generate X4 and X5 to make it somewhat difficult to distinguish among the

models by setting
X 4 = X3 + R 4
X 5 = X3 + R 5 ,
where R4 and R5 are independent Normal(0, 1/9) random variables. Models two
and three fit equally well using any objective function that suitably identifies a con-
ditional mean function, such as a quasi-log likelihood from the linear exponential
family.
In order to study the performance of the test in a quasi-MLE framework, we
also generated the conditional distribution of Y to be exponential in the population,
with the same mean function (21). We must emphasize that we are still using the
Poisson log-likelihood function, so the estimator, in this case, is properly called
QMLE. As in the Poisson case, there is no guarantee that the weighted estimator
will choose the correct model one more frequently than the unweighted test. But
we know it will not systematically choose an incorrect conditional mean over a
correct one.
Rather than drawing a random sample, we stratify the sample on the basis of
Y . In particular, for the Poisson distribution, we take samples of 1, 000 from the
stratum with Y = 0 and 1, 000 from the stratum with Y > 0. In the population,
P(Y = 0) = 0.19061. Therefore, we oversample the stratum with Y = 0. There are
only two strata, and the sampling weights are Q1 = 0.38122 and Q2 = 1.61878. For
Table 1. Rejection Frequencies, First Specification.

DGP Poisson Exponential
Estimation Method Poisson MLE Poisson QMLE
Test Version Weighted Unweighted Weighted Unweighted
m1 > m2 0.782 0.689 0.264 0.229

m2 > m1 0.000 0.000 0.004 0.003
m1 > m3 0.730 0.658 0.189 0.180
m3 > m1 0.000 0.000 0.006 0.004
m2 > m3 0.040 0.039 0.033 0.042
m3 > m2 0.048 0.051 0.055 0.040
the exponential distribution, we choose the strata so that the population frequencies
are about 0.727 and 0.273, using a cutoff Y ≤ 2. Therefore, we oversample units
with larger outcomes, rather than small ones.
For each draw, we use both the weighted Vuong statistic and the unweighted
statistic, both based on the Poisson log-likelihood function with exponential con-
ditional mean. We test each model against the other two, so we have six outcomes.
Table 1 reports rejection frequencies obtained from the simulations using 1, 000
replications. Here the alternative is that model mi is better than mj , i = j , or in
short mi > mj .
As can be seen in Table 1, when the population distribution is Poisson, the
weighted test does a better job than the unweighted test in detecting that model
one provides the best fit – because it is the true model. For model one versus model
two, the rejection in favor of model one is almost 10% points higher (78.2% versus
68.9%). Similarly, the weighted test does better in choosing between models one
and three (73.0% versus 65.8%). Neither test ever incorrectly chooses model two
or model three over model one. Remember, though, that the estimates of the
parameters using the unweighted estimator are inconsistent because stratification
is based on Y .
Both the weighted and unweighted tests have rejection frequencies close to 0.05
when comparing the two incorrect models, model two and model three, which cor-
responds to the notion that both models are wrong but fit equally well. The weighted
statistic does find a few more “false positives” than the unweighted test, but there
are only 1,000 replications. Overall, the weighted test seems clearly preferred, and
we must use the weights for consistent parameter estimation, anyway.
When the true conditional distribution is exponential, both tests have a tougher
time choosing model one. This possibly is a result that, for the exponential distri-
bution, the variance is the square of the mean, and so there is much more variation
Table 2. Rejection Frequencies, Second Specification.

DGP Poisson Exponential
Estimation Method Poisson MLE Poisson QMLE
Test Version Weighted Unweighted Weighted Unweighted
m1 > m2 0.525 0.000 0.015 0.005

m2 > m1 0.000 0.333 0.002 0.034
m1 > m3 0.013 0.006 0.002 0.001
m3 > m1 0.015 0.012 0.002 0.000
m2 > m3 0.000 0.297 0.001 0.023
m3 > m2 0.581 0.001 0.024 0.013
in the outcome Y than when the conditional distribution is Poisson. Plus, in the
exponential case, we oversample large outcomes rather than small ones (although
the weighted version of the test accounts for that). Overall, the weighted test does
somewhat better. For example, it correctly chooses model one over model two
26.4% of the time compared with 22.9% for the unweighted test.
As a second conditional mean specification, we use
E (Y |X1 , X2 , X3 , X4 , X5 ) = exp (0.5 + 0.5X1 + 0.5X2 + 0.3X3−1 ), (22)
where X1 and X2 have the same normal distributions and X3 ∼ Uniform(1, 3).
Now we are primarily interested in the ability of the test to detect functional form
misspecification. As before, the correct conditional mean function is labeled model
one. The alternative models are
m2 (x1 , x2 , x3 ) = exp (β0 + β1 x1 + β2 x2 + β3 x12 + β4 x22 )

m3 (x1 , x2 , x3 ) = exp (β0 + β1 x1 + β2 x2 + β3 x3 + β4 x32 ).
Model two ignores x3 entirely; given the simulation findings in Table 1, we would
expect the test to do well in choosing model one. Model three misspecifies the
functional form in x3 . As a quadratic can mimic the reciprocal function, we expect
the test to have a more difficult time telling apart models one and three. In the Pois-
son population, we still oversample units with Y = 0. The strata in the exponential
case are defined by Y ≤ 4 and Y > 4.
The findings in Table 2 for the Poisson distribution are very interesting and
highlight the danger of using the unweighted version of the test. The weighted
test is fairly successfully distinguishing between models one and two: it correctly
rejects the null in favor of model one 52.5% of the time, and never chooses model
two. By contrast, the unweighted test never correctly picks model one. It even picks
model two 33.3% of the time. That means that a researcher is much more likely to
think that the correct model is quadratic in X1 and X2 and entirely excludes X3 .
Both the weighted and unweighted tests have very little ability to tell the dif-
ference between models one and three. This is unlikely to be a bad thing because
quantities of interest – such as elasticities, semi-elasticities and average partial
effects – are probably pretty similar across the two models. The weighted test
shows a clear preference for model three over model two, and this is a good thing:
model three is certainly closer to the true model. By contrast, the unweighted test
incorrectly shows a clear preference for model two over model three.
Both tests are completely ineffective for selecting among the three models
when the data are generated from the exponential distribution. As before, this
probably arises from the large variance in an exponential distribution and possibly
the oversampling of large outcomes from the population.
9. CONCLUSION
We have extended Vuong’s (1989) in several useful directions. First, we allow
for general M-estimation rather than maximum likelihood estimation. Second,
we allow for complex survey samples rather than assuming random sampling
from a population. Third, we allow panel data applications combined with survey
sampling.
The key to obtaining computationally simple tests is contained in Theorem
3.1, which shows that when the models are appropriately nonnested and they fit
equally well, the limiting distribution of the standardized difference in objective
functions is nondegenerate and does not depend on the limiting distributions of
the estimators themselves. This means we can apply standard asymptotic variance
estimators for stratified samples, cluster samples and combinations of these directly
to the differences in the unit-specific objective functions.
Section 5.7 contains just a couple of examples that show how the results can
be applied to problems that are explicitly quasi-MLE in nature, including popular
fractional response models and models for nonnegative responses.
For the most part, the simulation results in Section 5.8 are promising. In addi-
tion to providing consistent estimators of the pseudo-true values, weighting the
objective function generally allows us to better choose the best fitting model in
cases where the best fitting model is the true model or the best fitting model is
“close” to the true model. In one case, the unweighted test systematically selects
the worst of the three models while almost never choosing the correct model.
More simulations could be informative. For example, seeing what happens when
stratification is based on X is something we did not do.
There are several interesting directions for future research. First, it would be
helpful to study the finite-sample properties of the version of the test statistic that
penalizes the number of parameters – see (14). Second, our setup can be extended
to the case where the goodness-of-fit functions are not the same as the objective
functions used to obtain the θ̂ g . For example, in a Tobit model, one might maximize
the log likelihood but then want to make comparisons based on the conditional
mean, in which case we might want to compare a sum of squared residuals from a
Tobit to that from, say, an exponential mean estimated using Poisson QMLE. The
analog of Theorem 3.1 will not be as simple, but such extensions seem worthwhile.
Finally, as suggested by a reviewer, rather than relying on standard first-order
asymptotics, one could possibly bootstrap the test statistic. Given the nature of
the null hypothesis and the complex survey sampling, this poses an interesting
challenge for the future.
REFERENCES
Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In
Proceedings of the 2nd international symposium on information theory, Budapest (pp. 267–
281).
Arellano, M. (1987). Computing robust standard errors for within-groups estimators. Oxford Bulletin
of Economics and Statistics, 49, 431–434.
Bhattacharya, D. (2005). Asymptotic inference from multi-stage samples. Journal of Econometrics,
126, 145–171.
Cox, D. R. (1961). Tests of separate families of hypotheses. In L. M. LeCam, J. Neyman, E. L. Scott
(Eds.), Proceedings of the 4th berkeley symposium on mathematical statistics and probability,
Vol. 1, , University of California, Berkeley Press (pp. 105–123).
Cox, D. R. (1962). Further results on tests of separate families of hypotheses. Journal of the Royal
Statistical Society, Series B, 24, 406–424.
Davidson, R., & MacKinnon, J. G. (1981). Several tests for model specification in the presence of
alternative hypotheses. Econometrica, 49, 781–793.
Findley, D. F. (1990). Making difficult model comparisons. Mimeo, U.S. Bureau of the Census.
Findley, D. F. (1991). Convergence of finite multistep predictors from incorrect models and its role in
model selection. Note di Matematica, XI, 145–155.
Findley, D. F., & Wei, C. Z. (1993). Moment bound for deriving time series CLT’s and model selection
procedures. Statistica Sinica, 3, 453–480.
Gourieroux, C., Monfort, A., & Trognon, C. (1984). Pseudo-maximum likelihood methods:
Applications to Poisson models. Econometrica, 52, 701–720.
Newey, W. K., & McFadden, D. (1994). Large sample estimation and hypothesis testing. In
R. F. Engle & D. L. McFadden (Eds.), Handbook of econometrics (Vol. IV, pp. 2111–2245).
Amsterdam: North-Holland Publishing.
Rahmani, I. (2018). Asymptotic Inference of M-Estimator from Multistage Samples with Variable
Probability in the Final Stage, Working Paper.
Rivers, D., & Vuong, Q. (2002). Model selection tests for nonlinear dynamic models. The Econometrics
Journal 5, 1–39.
Schwarz, G. E. (1978). Estimating the dimension of a model. Annals of Statistics, 6(2), 461–464.
Vuong, Q. (1989). Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica,
57, 307–333.
White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica 50, 1-25.
Wooldridge, J. M. (1990). An encompassing approach to conditional mean tests with applications to
testing nonnested hypotheses. Journal of Econometrics, 45, 331–350.
Wooldridge, J. M. (1999). Asymptotic properties of weighted M-estimators for variable probability
samples. Econometrica, 67, 1385–1406.
Wooldridge, J. M. (2001). Asymptotic properties of weighted M-estimators for standard stratified
samples. Econometric Theory, 17, 451–470.
Wooldridge, J. M. (2010). Econometric analysis of cross-section and panel data (2nd ed.). Cambridge,
MA: MIT Press.
INFERENCE IN CONDITIONAL
MOMENT RESTRICTION MODELS
WHEN THERE IS SELECTION DUE TO
STRATIFICATION
Antonio Cosma, Andreï V. Kostyrka, and

Gautam Tripathi
Center for Research in Economics and Management (CREA), Faculty of Law,
Economics and Finance, University of Luxembourg, Luxembourg
ABSTRACT
We show how to use a smoothed empirical likelihood approach to conduct
efficient semiparametric inference in models characterized as conditional
moment equalities when data are collected by variable probability sampling.
Results from a simulation experiment suggest that the smoothed empirical
likelihood based estimator can estimate the model parameters very well in
small to moderately sized stratified samples.
Keywords: Conditional moment models; smoothed empirical likelihood;
stratification; variable probability sampling; endogenous and exogenous
stratification; generalized method of moments

ISSN: 0731-9053/doi:10.1108/S0731-905320190000039010
137
138 ANTONIO COSMA ET AL.
1. INTRODUCTION
The gold standard for collecting data, at least for the ease of doing subsequent
statistical analysis, is simple random sampling, whereby each observation in the
“target” population, namely, the population of interest, has an equal chance of
being chosen. Consequently, the probability distribution of the chosen observation,
regarded as belonging to a “realized” population, is the same as the probability
distribution of an observation in the target population, which facilitates statistical
analysis.
However, when estimating or testing economic relationships, economists often
discover that the data they plan to use are not drawn from the target population they
wish to study. Instead, the observations are found to be sampled from a related
but different population. Sometimes this is done deliberately to make the sample
more informative. For example, when studying the impact of welfare legislation,
it is desirable to oversample minorities and low-income families. Similarly, if we
want to examine the effect of disability laws on demand for public transportation,
it makes sense to oversample households with disabled members. At other times,
a distinction between the target and realized populations can be created uninten-
tionally. For example, in sampling the duration of unemployment at a randomly
chosen time, economists are more likely to observe longer unemployment spells
than shorter ones. Using a data set to answer questions for which it was not orig-
inally designed, a typical situation in economics where data are often costly to
collect may also lead to such a situation (Newey, 1993, p. 419). For instance, if the
reason for collecting data is to estimate mean income for an underlying population,
oversampling low income and undersampling high income families can improve
the precision of estimators. However, at some later stage, this income data can be
used by another researcher as the dependent variable in a regression model with-
out realizing that the original sample was drawn from a distribution other than the
target population.
Whatever its cause is, if the distinction between the target and realized popu-
lations is not taken into account when analyzing the data, statistical inference can
be seriously off the mark. This phenomenon is commonly called selection bias.
Cf. Heckman (1976, 1979) and Manski (1989, 1995) for a classic exposition of
the selection problem.
In this paper, we describe an efficient semiparametric approach for conducting
inference in conditional moment restriction models when data are collected by a
variable probability (VP) sampling scheme such that the observations from the
target population have unequal chances of being chosen. In other words, we show
how to efficiently deal with the selection bias caused by the sampling scheme used
to collect the data, because the sampling scheme induces a probability distribution
Selection Due to Stratification 139
on the realized population which differs from the target distribution for which
inference is to be made.
The remainder of the paper is organized as follows. In Section 2, we describe
the conditional moment restriction model and the VP sampling scheme. Section 3
discusses how to do inference using the smoothed empirical likelihood approach,
and finite sample properties of the proposed estimator are examined in Section 4.
Section 5 concludes the paper. Related technical details are in the appendices.
2. THE MODEL
2.1. Conditional Moment Equalities
Let Z ∗ := (Y ∗ , X∗ )dim (Y ∗ )+dim (X∗ )×1 be a random (column) vector that denotes an
observation from the target population, where Y ∗ is the vector of endogenous
variables and X∗ the vector of exogenous variables. Assume that
∗)
H0 : ∃θ ∗ ∈ Rdim (θ s.t. EPY∗∗ |X∗ [g(Z ∗ , θ ∗ )|X ∗ ] = 0 PX∗ ∗ -a.s., (2.1)
where g is a vector of functions known up to θ ∗ , the notation EPY∗∗ |X∗ indicates

that the conditional expectation is with respect to the conditional distribution
PY∗∗ |X∗ := Law (Y ∗ |X ∗ ), and PX∗ ∗ denotes the marginal distribution of X ∗ . The con-
ditional distribution of Y ∗ |X ∗ and the marginal distribution of X ∗ are unknown.1
Throughout the paper, random variables and probability measures associated with
the target population appear with the superscript “∗.” The parameter of interest θ ∗
has an asterisk attached to it because it is a functional of PY∗∗ |X∗ .2
A large class of models in applied economics can be characterized in terms of
conditional moment equalities of the form (2.1). For example, in linear regres-
sion models where some or all of the regressors are endogenous, we have
g(Z ∗ , θ ∗ ) := Y1∗ − α ∗ − X1∗ β ∗ − Y2∗ δ ∗ , where Y ∗ := (Y1∗ , Y2∗ ) with Y1∗ the out-
come variable and Y2∗ the vector of endogenous regressors; X ∗ := (X1∗ , X2∗ ) with X1∗
the exogenous regressors, i.e., the “included instruments,” and X2∗ the “excluded
instruments” for Y2∗ ; and θ ∗ := (α ∗ , β ∗ , δ ∗ ). If all the regressors are endogenous,
then X1∗ is the empty vector and the definition of θ ∗ has to be adjusted accordingly
by dropping β ∗ . Similarly, for nonlinear regression models, g(Z ∗ , θ ∗ ) := Y1∗ −
ψ(Y2∗ , X1∗ , θ ∗ ), where the nonlinear function ψ(Y2∗ , X1∗ , ·) is known up to θ ∗ .
Multivariate extensions include systems of equations or transformation models,
linear or nonlinear, of the form g(Z ∗ , θ ∗ ) = ε∗ , where g is a vector of known
functions and the identifying assumption is that EPY∗∗ |X∗ [ε ∗ |X ∗ ] = 0 PX∗ ∗ -a.s..
Several examples of econometric models defined via conditional moment restric-

tions may be found in Newey (1993, Section 3), Pagan and Ullah (1999, Chapter 3)
and Wooldridge (2010).
2.2. Variable Probability Sampling
Instead of observing Z ∗ directly from the target population, we possess a random

vector Z := (Y , X) that is collected by VP sampling, also known as Bernoulli
sampling. For more on VP and other stratified sampling schemes, cf., e.g., DeMets
and Halperin (1977), Manski and Lerman (1977), Holt, Smith, and Winter (1980),
Cosslett (1981a,b, 1991, 1993), Manski and McFadden (1981), Jewell (1985),
Quesenberry and Jewell (1986), Scott and Wild (1986), Kalbfleisch and Lawless
(1988), Bickel and Ritov (1991), Imbens (1992), Imbens and Lancaster (1996),
Deaton (1997), Wooldridge (1999, 2001), Butler (2000), Bhattacharya (2005,
2007), Hirose (2007), Hirose and Lee (2008), Tripathi (2011a,b), and Severini and
Tripathi (2013).
Let the support of Z ∗ , denoted by supp (Z ∗ ), be partitioned into L nonempty
disjoint strata C1 , . . . , CL . In VP sampling, typically used when data are collected
by telephone surveys, an observation is first drawn randomly from the target pop-
ulation. If it lies in stratum Cl , it is retained with known probability pl ∈ (0, 1].
If it is discarded, all information about the observation is lost. Hence, instead of
observing a random vector Z ∗ drawn from the target distribution P ∗ := Law (Z ∗ ),
we observe a random vector Z, with supp (Z) = supp (Z ∗ ), drawn from the realized
distribution P := Law (Z) given by3
L
pl ∗
P (Z ∈ B) := ∗
1Cl (z) dP ∗ (z), B ∈ B(Rdim (Z ) ), (2.2)
l=1
b
B
∗ ∗
where B(Rdim (Z ) ) is the Borel sigma-field of Rdim (Z ) , b∗ := Ll=1 pl Q∗l and
Q∗l := P ∗ (Z ∗ ∈ Cl ) > 0 denotes the probability that a randomly chosen observation
from the target population lies in the lth stratum.
Since Q∗l represents the probability mass of the lth stratum in the target pop-
ulation, the Q∗l ’s are popularly called “aggregate shares.” The aggregate shares,

which add up to one, i.e., Ll=1 Q∗l = 1, are unknown parameters of interest to
be estimated along with the structural parameter θ ∗ . The parameter b∗ also has
a practical interpretation, namely, it is the probability that an observation drawn
from the target population during the sampling process is ultimately retained in
the sample.
It is immediate from (2.2) that the density of P , with respect to any measure
∗
on B(Rdim (Z ) ) that dominates P ∗ , is given by
L
pl ∗
dP (z) := 1 (z)dP ∗ (z)
∗ Cl
(z ∈ Rdim (Z ) )
l=1
b
b(z) ∗
= dP (z), (2.3)
b∗

where b(z) := Ll=1 pl 1Cl (z). Following Imbens and Lancaster (1996, p. 296),
∗
b( · )/b is referred to as a bias function because it determines the selection bias
due to stratified sampling, i.e., the extent to which P differs from P ∗ . For instance,
it is easy to see that if the sampling probabilities p1 , . . . , pL are all equal, then
there is no selection bias, i.e., P = P ∗ , because b( · )/b∗ = 1 irrespective of the
values taken by the aggregate shares.
The marginal density of X is given by

∗
dPX (x) := dP (y, x) (x ∈ Rdim (X ) )
∗)
y∈Rdim (Y

b(y, x)
= dPY∗∗ |X∗ =x (y) dPX∗ ∗ (x) ((2.3))
b∗
∗
y∈Rdim (Y )
γ ∗ (x)
= dPX∗ ∗ (x), (2.4)
b∗
where γ ∗ (x) := EPY∗∗ |X∗ [b(Y ∗ , x)|X∗ = x]. Throughout the paper, we maintain the
assumption that γ ∗ > 0 on supp (X∗ ).4 Under this condition, the probability dis-
tributions PX and PX∗ ∗ are mutually absolutely continuous, which we denote by
writing PX∗ ∗ PX PX∗ ∗ .
Since supp (Y , X) = supp (Y ∗ , X∗ ) and γ ∗ > 0 on supp (X∗ ), the conditional
density of Y |X is given by
dP (y, x)
dPY |X=x (y) := ((y, x) ∈ supp (Y ∗ ) × supp (X ∗ ))
dPX (x)
b(y, x)
= ∗ dPY∗∗ |X∗ =x (y), (2.5)
γ (x)
where (2.5) follows from (2.3) and (2.4).

By (2.5), dPY |X=x (y) = dPY∗∗ |X∗ =x (y) if and only if b(y, x) = γ ∗ (x) for all
(y, x) ∈ supp (Y ∗ ) × supp (X ∗ ). However, as discussed subsequently, the con-
dition b(y, x) = γ ∗ (x) holds only in a special case. Therefore, in general,
dPY |X = dPY∗∗ |X∗ . Consequently, estimating (2.1) using the realized sample with-
out accounting for the fact that it was obtained by stratified sampling, i.e., ignoring
stratification, will generally not lead to a consistent estimator of θ ∗ .
2.3. Identification
In contrast to some other stratified sampling schemes Tripathi (2011b, Sections 3.1
and 4.1), identification, i.e., uniqueness, of θ ∗ cannot be lost because of VP sam-
pling. To see this, begin by recalling that the assumption that γ ∗ > 0 on supp (X ∗ )
implies that the distributions PX and PX∗ ∗ are mutually absolutely continuous.
Hence,
(2.1) ⇐⇒ EPY∗∗ |X∗ [g(Y ∗ , x, θ ∗ )|X ∗ = x] = 0 for PX∗ ∗ -a.a. x ∈ supp (X∗ )

g(Y , x, θ ∗ )
⇐⇒ γ ∗ (x)EPY |X |X = x = 0 for PX∗ ∗ -a.a. x ∈ supp (X∗ )
b(Y , x)
((2.5))
∗
g(Y , x, θ )
⇐⇒ PX∗ ∗ {x ∈ supp (X∗ ) : EPY |X |X = x = 0} = 0 (γ ∗ > 0)
b(Y , x)

g(Y , x, θ ∗ )
⇐⇒ PX {x ∈ supp (X∗ ) : EPY |X |X = x = 0} = 0
b(Y , x)
(PX∗ ∗ PX PX∗ ∗ )

g(Y , x, θ ∗ )
⇐⇒ EPY |X |X = x = 0 for PX -a.a. x ∈ supp (X∗ ).
b(Y , x)
Therefore, we have that

g(Z, θ ∗ )
(2.1) ⇐⇒ EPY |X |X = 0 PX -a.s.. (2.6)
b(Z)
Since b(Z) does not depend on θ ∗ , the equivalence in (2.6) reveals that θ ∗ in
(2.1) is uniquely defined if and only if θ ∗ in EPY |X [g(Z, θ ∗ )/b(Z)|X] = 0 (PX -a.s.)
is uniquely defined. That is, any condition that leads to the identification of θ ∗
in (2.1) will also ensure identification of θ ∗ in the right hand side of (2.6) and
vice-versa. To illustrate this, assume that the columns of the partial derivative
∂θ EPY∗∗ |X∗ [g(Z ∗ , θ ∗ )|X ∗ ] are linearly independent PX∗ ∗ -a.s.. As shown in Cosma,
Kostyrka, and Tripathi (2018), this condition is sufficient to ensure that θ ∗ is locally
identified.5 However, since b does not depend on θ (which implies that γ ∗ does
not depend on θ ), we have that

∗ ∗ ∗ (2.5) ∗ g(Z, θ ∗ )
∂θ EPY∗∗ |X∗ [g(Z , θ )|X = x] = γ (x)∂θ EPY |X |X = x , x ∈ supp (X∗ ).
b(Z)
Therefore, since γ ∗ > 0 on supp (X∗ ), the columns of ∂θ EPY∗∗ |X∗ [g(Z ∗ , θ ∗ )|X ∗ ]
are linearly independent PX∗ ∗ -a.s. if and only if the columns of ∂θ EPY |X [g(Z, θ ∗ )/
b(Z)|X] are linearly independent PX -a.s. (because PX and PX∗ ∗ are mutually
absolutely continuous).
Since identification of θ ∗ cannot be lost because of VP sampling, for the
remainder of the paper we maintain that θ ∗ is identified.
2.4. Endogenous and Exogenous Stratification
As noted by Wooldridge (1999, p. 1385), VP sampling is employed when it is

cheaper to obtain information on a subset of variables in the target population.
Hence, it may happen that in certain data sets only Y ∗ is stratified (endogenous
stratification),6 or only X ∗ is stratified (exogenous stratification), or both Y ∗ and
X∗ are stratified. To see how all these cases can be handled in a unified manner
in our framework, let the support of Y ∗ be partitioned into J nonempty disjoint
strata A1 , . . . , AJ , and the support of X ∗ be partitioned into M nonempty disjoint
strata B1 , . . . , BM . Then, since ∪Jj=1 Aj × ∪M m=1 Bm = ∪j =1 ∪m=1 Aj × Bm ,
J M
⎧
⎪
⎨∪j =1 ∪m=1 Aj × Bm
J M
if both Y ∗ and X ∗ are stratified
supp (Y ∗ , X∗ ) = ∪Jj=1 (Aj × supp (X ∗ )) if only Y ∗ is stratified
⎪
⎩ M
∪m=1 ( supp (Y ∗ ) × Bm ) if only X ∗ is stratified.
Therefore, if both Y ∗ and X∗ are stratified, then supp (Z ∗ ) = ∪Ll=1 Cl with L =

J M and each Cl = Aj × Bm for some (j , m) ∈ {1, . . . , J } × {1, . . . , M}. This is
the most general case for which PY |X is given by (2.5).7
In contrast, simplifications occur if the stratification is endogenous or exoge-
nous. For instance, if only Y ∗ is stratified, then supp (Z ∗ ) = ∪Ll=1 Cl with L = J
and Cl = Al × supp (X ∗ ), which implies that, for (y, x) ∈ supp (Y ∗ ) × supp (X ∗ ),

J
J
b(y, x) = pl 1Al ×supp (X∗ ) (y, x) = pl 1Al (y) =: bendog (y).
l=1 l=1
Hence, by (2.5), we have that, for (y, x) ∈ supp (Y ∗ ) × supp (X ∗ ),
bendog (y)
endogenous stratification =⇒ dPY |X=x (y) = dP ∗∗ ∗ (y), (2.7)
∗
γendog (x) Y |X =x
∗
where γendog (x) := EPY∗∗ |X∗ [bendog (Y ∗ )|X ∗ = x].
If only X∗ is stratified, then supp (Z ∗ ) = ∪Ll=1 Cl with L = M and Cl =
supp (Y ∗ ) × Bl . Consequently, for (y, x) ∈ supp (Y ∗ ) × supp (X ∗ ),

M
M
b(y, x) = pl 1supp (Y ∗ )×Bl (y, x) = pl 1Bl (x) =: bexog (x),
l=1 l=1
∗
which implies that γexog (x) := EPY∗∗ |X∗ [bexog (X ∗ )|X ∗ = x] = bexog (x). Hence,
by (2.5),
exogenous stratification =⇒ dPY |X=x (y) = dPY∗∗ |X∗ =x (y) (2.8)
for (y, x) ∈ supp (Y ∗ ) × supp (X ∗ ). Consequently, exogenous stratification can be

ignored, at least as far as consistent estimation is concerned. However, as the
following example demonstrates, ignoring endogenous stratification does not lead
to a consistent estimator.
Example 2.1: (Linear regression with exogenous regressors) Consider the linear

regression model Y ∗ = X∗ θ ∗ + ε ∗ , where X ∗ := (1, X∗ ). Assume that the regres-
sors are exogenous with respect to the model error in the target population, i.e.,
EPY∗∗ |X∗ [ε ∗ |X ∗ ] = 0 PX∗ ∗ -a.s..
Suppose that only Y ∗ is stratified. If we ignore the fact that the data were
collected by VP sampling and simply regress the observed Y on the observed X
and the constant regressor, then θ ∗ cannot be consistently estimated by the least
squares (LS) estimator. Indeed, letting θ̂LS denote the LS estimator obtained by
regressing Y on X̃ := (1, X), we have that
⎛ ⎞−1 ⎛ ⎞

n
n
plimn→∞ θ̂LS = plimn→∞ ⎝n−1 X̃j X̃j ⎠ ⎝n−1 X̃j Yj ⎠
j =1 j =1
= (EPX X̃ X̃ )−1 (EP X̃Y )

= (EPX X̃ X̃ )−1 (EPX X̃μ(X)), (2.9)
∗ (2.4)
where μ(X) := EPY |X [Y |X]. But, EPX X̃ X̃ = EPX∗ ∗ γendog (X ∗ )X∗ X ∗ /b∗ and
μ(x) := EPY |X [Y |X = x] (x ∈ supp (X∗ ))

1 ∗
= ∗ EP ∗ [Y bendog (Y ∗ )|X ∗ = x] ((2.7))
γendog (x) Y ∗ |X∗
1
= ∗ EP ∗ [(X∗ θ ∗ + ε ∗ )bendog (Y ∗ )|X ∗ = x]
1
= x̃ θ ∗ + ∗ EP ∗ [ε ∗ bendog (Y ∗ )|X ∗ = x]. (2.10)
Hence, writing (2.9) in terms of PX∗ ∗ ,

∗
γendog (X ∗ ) ∗ ∗ −1 ∗
γendog (X ∗ ) ∗ ∗
plimn→∞ θ̂LS = EPX∗ ∗ X X E P ∗ X μ(X )
b∗ X∗ b∗
((2.9) & (2.4))
∗ ∗ −1
(2.10) γendog (X ) ∗ ∗
= EPX∗ ∗ X X
b∗
∗
γendog (X ∗ ) ∗
× EPX∗ ∗ X
b∗

∗ ∗ 1 ∗ ∗ ∗
× X θ + ∗ EP ∗ [ε bendog (Y )|X ]
γendog (X ∗ ) Y ∗ |X∗

= θ ∗ + (EPX∗ ∗ γendog
∗
(X ∗ )X∗ X∗ )−1 (EP ∗ X∗ ε ∗ bendog (Y ∗ ))
= θ ∗ ,
because EPY∗∗ |X∗ [ε ∗ |X ∗ ] = 0 (PX∗ ∗ -a.s.) does not imply that EP ∗ X∗ ε ∗ bendog (Y ∗ ) = 0.
If, however, stratification is exogenous, then
(2.8)
μ(x) = EPY |X [Y |X = x] = EPY∗∗ |X∗ [Y ∗ |X ∗ = x] (x ∈ supp (X∗ ))

= EPY∗∗ |X∗ [X∗ θ ∗ + ε ∗ |X ∗ = x]
= x̃ θ ∗ . (2.11)
Hence, ignoring exogenous stratification does not affect the consistency of θ̂LS
because
plimn→∞ θ̂LS = (EPX X̃ X̃ )−1 (EPX X̃μ(X)) ((2.9))

∗
γexog (X ∗ ) ∗ ∗ −1 ∗
γexog (X ∗ ) ∗
= EPX∗ ∗ X X EP ∗ X μ(X ∗ ) ((2.4))
b∗ X∗ b∗

∗
γexog (X ∗ ) ∗ ∗ −1 ∗
γexog (X ∗ ) ∗ ∗ ∗
= EPX∗ ∗ X X EP ∗ X X θ ((2.11))
b∗ X∗ b∗
= θ ∗.
However, as shown subsequently (cf. Example 3.2), θ̂LS is not asymptotically

efficient. Hence, ignoring exogenous stratification does not affect the consistency
of the LS estimator,8 but it does affect its efficiency.
3. INFERENCE
3.1. Related Literature and Our Contribution
There is a large literature on estimation and testing models using data collected by
various types of stratified sampling schemes; cf. the papers cited at the beginning
of Section 2.2, and the references therein. In this section, we briefly describe only
some of the works that consider VP sampling.
Earlier papers in the literature on estimating models with conditioning variables
assume that PY∗∗ |X∗ is known up to a finite dimensional parameter; only PX∗ ∗ is left
completely unspecified. For example, a well-known application of VP sampling
can be found in Hausman and Wise (1981). Imbens and Lancaster (1996) extend the
maximum likelihood approach of Hausman and Wise to a moment-based method-
ology that allows for VP sampling, mixed-response variables and stratification on
exogenous covariates. Regression under VP sampling and a parametric PY∗∗ |X∗ has
also been investigated. For example, Jewell (1985) and Quesenberry and Jewell
(1986) propose iterative estimators of regression coefficients under VP sampling
without imposing normality or independence, though they do not provide any
asymptotic theory for their estimators.
The papers described above impose strong conditions on the distribution of
Y ∗ |X ∗ . Exceptions include Wooldridge (1999) and Tripathi (2011b), who leave
both PY∗∗ |X∗ and PX∗ ∗ completely unspecified. Wooldridge provides asymptotic the-
ory for M-estimation under VP sampling for a model defined in terms of a set of
just-identified unconditional moment equalities, whereas Tripathi considers opti-

mal generalized method of moments (GMM) estimation in unconditional moment
restriction models that allow for the parameter of interest to be over-identified.
The major difference between (2.1) and the models in the papers of Wooldridge
and Tripathi is that (2.1) is a conditional moment restriction, whereas the moment
conditions in the aforementioned papers are all unconditional. Therefore, (2.1)
nests the moment conditions of Wooldridge and Tripathi as a special case.
In this paper, we show how to efficiently estimate θ ∗ and the aggregate shares
using a smoothed empirical likelihood-based approach. The results presented here
answer the question posed in Wooldridge (1999, p. 1402) by providing efficiency
bounds for models with conditional moment restrictions under VP sampling and
showing that these bounds are attainable.
Furthermore, the results in this paper are also directly applicable to a class of
“biased sampling” problems. To see this, recall that the phenomenon where the
realized probability distribution P differs from the target probability distribution
P ∗ is generically referred to as selection bias.9 It is useful to note that the class of
problems that can be handled when selection is modeled using (2.3) includes more
than just those involving stratified sampling. For instance, consider the so-called
length biased sampling problem where the probability of observing a random
variable is proportional to its “size.” For example, economists are more likely to
observe longer unemployment spells if they are sampled at a randomly chosen
time. Similarly, as Owen (2011, p. 127) points out, if internet log files are sampled
randomly then longer sessions are likely to be overrepresented. It is useful to
examine length-biased sampling in the context of VP sampling, because in length-
biased sampling, we have
z ∗
dP (z) := dP ∗ (z), z ∈ Rdim (Z ) ,
EP ∗ Z ∗
where · is the Euclidean norm. That is, length-biased sampling can be expressed
as (2.3) with b(z) := z and b∗ := EP ∗ Z ∗ . Therefore, with only minor nota-
tional changes, the results obtained in this paper can be extended to length-biased
sampling as well.
Length-biased sampling has been extensively studied for the parametric case,
i.e., where dP ∗ is specified up to a finite dimensional parameter, cf., e.g., Patil
and Rao (1977, 1978), Bickel, Klassen, Ritov, and Wellner (1993, Section 4.4)
and Owen (2001, Chapter 6). As far as a nonparametric treatment of length-biased
sampling is concerned, Vardi (1982) deals with the case when P ∗ is unknown.
Vardi assumes that both P ∗ and P can be sampled with positive probability.
Using two independent samples (one each from P ∗ and P ), he shows how to
construct the nonparametric maximum likelihood estimators (NPMLE) of P ∗ and
P and also obtains their asymptotic distributions. Vardi (1985) and Gill, Vardi, and
Wellner (1988) provide conditions for the existence and uniqueness of the NPMLE
of P ∗ in a general setup when more than two independent samples from F ∗ and F
are available. These papers concentrate on the distributions P ∗ and P ; there are no
other parameters to estimate. Qin (1993) uses the empirical likelihood approach
to construct a nonparametric likelihood ratio confidence interval for θ ∗ := EP ∗ Z ∗ ,
i.e., a just-identified unconditional moment equality, using an independent sample
from P ∗ and P . El-Barmi and Rothmann (1998) generalize Qin’s treatment to
handle models with overidentified unconditional moment restrictions of the form
EP ∗ g(Z, θ ∗ ) = 0. They also obtain efficient estimators of P ∗ and P . However, they
do not consider the testing of overidentifying restrictions.
3.2. Efficiency Bounds
The efficiency bounds for estimating θ ∗ and related functionals have been derived
in Severini and Tripathi (2013, Section 14.3). In this section, we describe some
of these bounds and discuss their salient features. Construction of estimators that
achieve these bounds is considered in the next section.
For the remainder of the paper, let ρ1 (Z, θ ) := g(Z, θ )/b(Z). Since the right
hand side of (2.6) is a conditional moment equality with respect to the realized con-
ditional distribution PY |X , the efficiency bound for θ ∗ follows from Chamberlain
(1987). Namely, the efficiency bound for estimating θ ∗ is given by10
l.b.(θ ∗ ) := (EPX D (X)V1−1 (X)D(X))−1 , (3.1)
where D(X) := ∂θ EPY |X [ρ1 (Z, θ ∗ )|X] and V1 (X) := EPY |X [ρ1 (Z, θ ∗ )ρ1 (Z, θ ∗ )|X].
The efficiency bound in (3.1), given as a functional of the realized distribution
P , can be used to determine whether an estimator of θ ∗ is semiparametrically effi-
cient by comparing its asymptotic variance with l.b.(θ ∗ ). However, as the moment
condition model (2.1) is specified in terms of the target distribution P ∗ , in order to
answer questions such as how the efficiency bound for θ ∗ changes if stratification is
purely endogenous (or purely exogenous) or if the error term in a regression model
is conditionally homoskedastic in the target population, it is helpful to rewrite (3.1)
in terms of P ∗ . To do so, observe that, by (2.5), we have
1
D(x) = ∂θ EPY∗∗ |X∗ [g(Z ∗ , θ ∗ )|X ∗ = x], x ∈ supp (X∗ ),
γ ∗ (x)
(3.2)
1 g(Z ∗ , θ ∗ )g (Z ∗ , θ ∗ ) ∗
V1 (x) = ∗ EPY∗∗ |X∗ |X = x .
γ (x) b(Y ∗ , x)
Hence, by (2.4) and (3.2), the efficiency bound in (3.1) can be written as
l.b.(θ ∗ ) = b∗ (EPX∗ ∗ (∂θ EPY∗∗ |X∗ [g(Z ∗ , θ ∗ )|X ∗ ]) Vb∗ −1 (X ∗ )

(3.3)
× (∂θ EPY∗∗ |X∗ [g(Z ∗ , θ ∗ )|X ∗ ]))−1 ,
where Vb∗ (X ∗ ) := EPY∗∗ |X∗ [g(Z ∗ , θ ∗ )g (Z ∗ , θ ∗ )/b(Z ∗ )|X ∗ ].

We can use (3.3) to determine the efficiency bound for estimating θ ∗ if stratifica-
tion is purely endogenous or purely exogenous. For instance, the efficiency bound
when only Y ∗ is stratified follows from (3.3) on replacing b(Z ∗ ) in the definition
of Vb∗ (X ∗ ) with bendog (Y ∗ ). Similarly, the bound when only X ∗ is stratified follows
from (3.3) on replacing b(Z ∗ ) in the definition of Vb∗ (X ∗ ) with bexog (X ∗ ).
If there is no conditioning in (2.1), i.e., X ∗ is constant PX∗ ∗ -a.s., and dim (g) ≥
dim (θ ∗ ), then (3.3) reduces to the efficiency bound for estimating θ ∗ in uncondi-
tional moment restriction models when observations are collected by VP sampling
(Severini & Tripathi, 2013, Section 14.2.1). At the other extreme, if there is no
stratification, i.e., L = 1 = p1 and C1 = supp (Z ∗ ), so that Z ∗ = Z and P ∗ = P ,
then the efficiency bound in (3.3) becomes
(E(∂θ E[g(Z ∗ , θ ∗ )|X ∗ ]) (E[g(Z ∗ , θ ∗ )g (Z ∗ , θ ∗ )|X ∗ ])−1 (∂θ E[g(Z ∗ , θ ∗ )|X ∗ ]))−1 ,
which is Chamberlain’s (1987) bound for estimating θ ∗ in the absence of any

selection.
The next example uses (3.3) to determine the efficiency bound for θ ∗ under
various scenarios.

Example 3.1: (Example 2.1 contd.) Here, g(Z ∗ , θ ) = Y ∗ − X∗ θ and the efficiency
bound for estimating θ ∗ is given by

−1 −1
∗ (3.3) ∗ X∗ X∗ (2.4),(2.5) X̃ X̃
l.b.(θ in Example 2.1) = b EPX∗ ∗ ∗ ∗ = EPX ∗ 2 .
Vb (X ) γ (X)V1 (X)
(3.4)
If stratification is endogenous, then
−1
∗ (3.4) X̃ X̃
l.b.(θ in Example 2.1)|endog. strat. = EPX ∗ 2 ,
γ endog (X)V1,endog (X)
where V1,endog (X) := EPY |X [(Y − X̃ θ ∗ )2 /bendog

2
(Y )|X].
In contrast, if stratification is exogenous then

−1
∗ (3.4) X̃ X̃
l.b.(θ in Example 2.1)|exog. strat. = EPX , (3.5)
V1,exog (X)
where V1,exog (X) := EPY |X [(Y − X̃ θ ∗ )2 |X].

Recall from Example 2.1 that, under exogenous stratification, the LS estima-
tor consistently estimates θ ∗ . Since n1/2 (θ̂LS − θ ∗ ) is asymptotically (as n → ∞)
normal with mean zero and variance VLS,exog := (EPX X̃ X̃ )−1 (EPX X̃ X̃ V1,exog (X))
(EPX X̃X̃ )−1 , an application of a matrix version of the Cauchy–Schwarz inequality
reveals that
l.b.(θ ∗ in Example 2.1)|exog. strat. ≤L VLS,exog ,
where ≤L is the usual (Löwner) order on the set of symmetric matrices.11 Therefore,
under exogenous stratification, the LS estimator is consistent but not asymptot-
ically efficient. However, if stratification is exogenous and ε ∗ is conditionally
homoskedastic in the target population, then (3.5) and (A.3) imply that the LS
estimator is asymptotically efficient.
Even under endogenous stratification, it is not difficult to obtain an estimator
of θ ∗ that is consistent but asymptotically inefficient. To see this, assume that only
Y ∗ is stratified. Then,

EPY∗∗ |X∗ [Y ∗ −X ∗ θ ∗ |X ∗ ] = 0 PX∗ ∗ -a.s.

Y − X̃ θ ∗
⇐⇒ EPY |X |X = 0 PX -a.s. ((2.6) & (2.7))
bendog (Y )

Y − X̃ θ ∗
=⇒ EP X̃ = 0.
bendog (Y )
Hence, it is straightforward to show that the GMM estimator

⎛ ⎞−1 ⎛ ⎞

n
X̃j X̃j n
X̃j Yj ⎠
θ̂GMM,endog := ⎝ ⎠ ⎝
j =1
bendog (Yj ) j =1
bendog (Yj )
is consistent for θ ∗ .12 However, θ̂GMM,endog is not asymptotically efficient because

its asymptotic variance is VGMM,endog := (EP X̃ X̃ /bendog (Y ))−1 (EPX X̃ X̃ V1,endog
(X))(EP X̃X̃ /bendog (Y ))−1 but
l.b.(θ ∗ in Example 2.1)|endog. strat. ≤L VGMM,endog .

Analogous to θ̂GMM,endog , the GMM estimator under exogenous stratification is
⎛ ⎞−1 ⎛ ⎞

n
X̃j X̃j n
X̃j Yj ⎠
θ̂GMM,exog := ⎝ ⎠ ⎝ ,
j =1
bexog (Xj ) j =1
bexog (Xj )
which is also not asymptotically efficient because its asymptotic variance

is VGMM,exog := (EPX X̃ X̃ /bexog (X))−1 (EPX X̃ X̃ V1,exog (X)/bexog
2
(X))(EPX X̃ X̃ /
−1
bexog (X)) but
l.b.(θ ∗ in Example 2.1)|exog. strat. ≤L VGMM,exog .
Constructing efficient estimators requires more effort. For instance, suppose

that stratification is purely exogenous. Then, following Robinson
(1987), it can
be shown that the asymptotic variance of θ̂Robinson := ( nj=1 X̃j X̃j /σ̂ 2 (Xj ))−1

( nj=1 X̃j Yj /σ̂ 2 (Xj )) equals (3.5), where σ̂ 2 denotes a consistent estimator of
V1,exog . Hence, θ̂Robinson is an asymptotically efficient estimator of θ ∗ under exoge-
nous stratification. A general approach, which can be used to construct efficient
estimators irrespective of whether stratification is endogenous, exogenous, or both,
is discussed in Section 3.3.
Since the aggregate shares add up to one, it suffices to determine the efficiency
bound for estimating Q∗−L := (Q∗1 , . . . , Q∗L−1 )(L−1)×1 ∈ (0, 1)L−1 . The aggregate
shares are identified in the realized population by the moment condition

s(Z) − Q∗−L
EP = 0, (3.6)
b(Z)
where s(Z) := (1C1 (Z), . . . , 1CL−1 (Z))(L−1)×1 . The moment conditions in (3.6)
modify accordingly if stratification is endogenous or exogenous, namely,
s ∗
endog (Y )−Q−L
EPY =0
endog. strat. =⇒ bendog (Y )
sendog (Y ) := (1A1 (Y ), . . . , 1AJ −1 (Y ))(J −1)×1
s (X)−Q∗ (3.7)
EPX exogbexog (X) −L = 0
exog. strat. =⇒
sexog (X) := (1B1 (X), . . . , 1BM−1 (X))(M−1)×1 .
Let ρ2 (Z, Q∗−L ) := (s(Z) − Q∗−L )/b(Z), and 12 (X) := EPY |X [ρ1 (Z, θ ∗ )ρ2 (Z,
∗
Q−L )|X] be the conditional (on X) covariance between ρ1 (Z, θ ∗ ) and ρ2 (Z,
Q∗−L ). Then, under (2.1), the efficiency bound for estimating Q∗−L is given by
−1
l.b.(Q∗−L ) := b∗ 2 [ var P (ρ2 (Z, Q∗−L )) − (EPX
12 (X)V1 (X) 12 (X))
−1 ∗ −1
+ (EPX 12 (X)V1 (X)D(X))(l.b.(θ ))(EPX D (X)V1 (X) 12 (X))], (3.8)
where l.b.(θ ∗ ) is the efficiency bound for estimating θ ∗ given in (3.1).

In the absence of (2.1), the efficiency bound for Q∗−L is given by
b∗ 2 var P (ρ2 (Z, Q∗−L )), which follows from standard GMM theory applied to (3.6).
Hence, estimating the aggregate shares in the presence of (2.1) leads to efficiency
gains under endogenous stratification. There are no efficiency gains for estimating
Q∗−L under exogenous stratification because

g(Z, θ ∗ ) (sexog (X) − Q∗−L )
exog. strat. =⇒ 12 (X) = EPY |X |X
bexog (X) bexog (X)

g(Z , θ ) exog (X ∗ ) − Q∗−L ) ∗
∗ ∗ (s
= EPY∗∗ |X∗ |X ((2.8))
bexog (X ∗ ) bexog (X ∗ )
(sexog (X ∗ ) − Q∗−L )
= EPY∗∗ |X∗ [g(Z ∗ , θ ∗ )|X ∗ ] 2 (X ∗ )
bexog
=0 PX∗ ∗ -a.s. ((2.1))
=0 PX -a.s.. (PX∗ ∗ PX PX∗ ∗ )
3.3. Efficient Estimation
The estimation and testing techniques demonstrated here extend Kitamura,

Tripathi, and Ahn (2004) and Tripathi and Kitamura (2003). These papers, which
are based on a generalization of the empirical likelihood approach of Owen (1988),
develop an asymptotically efficient methodology for estimating and testing mod-
els with conditional moment restrictions when the data are collected by random
sampling.
In the papers of Kitamura, Tripathi, and Ahn, and Tripathi and Kitamura, kernel
smoothing is used to efficiently incorporate the information implied by conditional
moment restrictions into an empirical likelihood, which is henceforth referred to as
the “smoothed empirical likelihood (SEL).” As shown in these papers, maximiz-
ing the SEL leads to one-step estimators which avoid any preliminary estimation
of optimal instruments. It also yields internally studentized likelihood ratio-type
statistics for testing H0 and parametric restrictions on θ ∗ that do not require pre-
liminary estimation of any variance terms. Moreover, the resulting estimation and
testing procedures are invariant to normalizations of H0 . Simulation results pre-
sented in the aforementioned papers suggest that the SEL-based approach can
work very well in finite samples.
The advantages of the SEL approach described above extend to the case when
the observations are collected by VP sampling. Furthermore, it leads to a unified
approach of estimating and testing models using stratified samples, which should
appeal to applied economists and practitioners in the field. Therefore, we now
demonstrate how to use the SEL approach to construct asymptotically efficient
estimators, i.e., estimators with asymptotic variance equal to the efficiency bounds
in Section 3.2.
If the focus is on efficient estimation of θ ∗ alone, then the equivalence in
(2.6) reveals that replacing the moment function in Kitamura, Tripathi, and Ahn
(Eq. (2.1)) with ρ1 (Z, θ ∗ ) will deliver an asymptotically efficient estimator of θ ∗ .
(3.6)
But what about Q∗−L ? Although the aggregate shares Q∗−L = EP [s(Z)]/
EP [1/b(Z)] can be simply estimated by their sample analogs, this estimator will
not be efficient because it does not take (2.1) into account; cf. the discussion after
(3.8). To construct an estimator of Q∗−L that accounts for (2.1), we have to jointly
estimate θ ∗ and Q∗−L , which we do using the SEL approach.
For the remainder of the paper, assume that we have independent observations
Z1 , . . . , Zn collected by VP sampling. Hence, these are i.i.d. draws from the real-
ized density dP in (2.3). Our estimation approach relies on a smoothed version of
empirical likelihood. This smoothing, or localization, is carried out using positive
Kb (Xi − Xj )
kernel weights wij := n n , i, j = 1, . . . , n, where K is a second
k=1 Kbn (Xi − Xk )
order kernel, Kbn ( · ) := K( · /bn ), and bn the bandwidth.
For i, j = 1, . . . , n, let pij denote the probability mass placed at (Xi , Zj ) by
a discrete distribution with support (X1 , . . . , Xn ) × (Z1 , . . . , Zn ). The collection
of probabilities (pij )ni,j =1 can be thought of as a set of nuisance parameters that
include the empirical distribution of the data. Using the kernel (wij ) and the
weights
distribution (pij ), construct the smoothed loglikelihood ni=1 nj=1 wij log pij .
Then, given (θ , Q−L ), concentrate out (pij ) by solving the following optimization
problem:

n
n
max wij log pij
(pij )
i=1 j =1
(3.9)

n
n
s.t. pij ≥ 0 for i, j = 1, . . . , n, pij = 1,
i=1 j =1

n
n
n
n
ρ1 (Zj , θ)p1j = 0, . . . , ρ1 (Zj , θ )pnj = 0, ρ2 (Zj , Q−L )pij = 0.
j =1 j =1 i=1 j =1
If the convex hulls of {ρ1 (Z1 , θ ), . . . , ρ1 (Zn , θ )} and {ρ2 (Z1 , Q−L ), . . . ,
ρ2 (Zn , Q−L )} contain the origin, then (3.9) can be solved by using Lagrange
multipliers. In this case, it can be verified that the solution to (3.9) is given by

1 wij
p̂ij (θ , Q−L ) := , i, j = 1, . . . , n,
n 1 + λi ρ1 (Zj , θ ) + μ ρ2 (Zj , Q−L )
where the multipliers λi := λi (θ , Q−L ) and μ := μ(θ , Q−L ) solve

n
wij ρ1 (Zj , θ )
= 0, i = 1, . . . , n,
j =1
1 + λi ρ1 (Zj , θ ) + μ ρ2 (Zj , Q−L )
(3.10)

n
n
wij ρ2 (Zj , Q−L )
= 0.
i=1 j =1
1 + λi ρ1 (Zj , θ ) + μ ρ2 (Zj , Q−L )
The smoothed empirical loglikelihood of (θ , Q−L ) is given by

n
n
SEL(θ , Q−L ) := wij log p̂ij (θ , Q−L )
i=1 j =1

n
n
wij /n
= wij log , (3.11)
i=1 j =1
1 + λi ρ1 (Zj , θ ) + μ ρ2 (Zj , Q−L )
where the multipliers solve (3.10).

The estimators of θ ∗ and Q∗−L can, in principle, be defined to be the maximiz-
ers of SEL(θ, Q−L ). However, this leads to a constrained optimization problem
because the Lagrange multipliers in SEL(θ , Q−L ) have to satisfy (3.10). To ease
computation, it is better to convert the constrained optimization problem into an
unconstrained optimization problem as follows. Begin by observing that, by (3.11),

n
n w
ij
SEL(θ , Q−L ) = wij log
i=1 j =1
n

n
n
− wij log (1 + λi ρ1 (Zj , θ ) + μ ρ2 (Zj , Q−L )).
i=1 j =1
Furthermore,13

n
n
λ1 , . . . , λn , μ = argmax wij log (1 + λ̃i ρ1 (Zj , θ ) + μ̃ ρ2 (Zj , Q−L )).
λ̃1 ,...,λ̃n ,μ̃ i=1 j =1
(3.12)
Therefore, the estimators of θ ∗ and Q∗−L are defined to be
(θ̂ , Q̂−L ) := argmax SELT (θ , Q−L ), (3.13)

θ,Q−L
where the “trimmed” SEL objective function
SELT (θ , Q−L )

n
n
:= − max Ti,n wij log (1 + λ̃i ρ1 (Zj , θ) + μ̃ ρ2 (Zj , Q−L ))
λ̃1 ,...,λ̃n ,μ̃ (3.14)
i=1 j =1

n
n
= − max Ti,n max wij log (1 + λ̃i ρ1 (Zj , θ ) + μ̃ ρ2 (Zj , Q−L )).
μ̃ λ̃i
i=1 j =1

The trimming indicator Ti,n := 1(ĥ(Xi ) ≥ bnτ ), where ĥ(Xi ) := nj=1 Kbn (Xi −
Xj )/(nbndim (X) ) and τ ∈ (0, 1) is a trimming parameter, is incorporated in (3.14) to
deal with the “denominator
problem,” namely, the instability of the local empiri-
cal loglikelihood nj=1 wij log (1 + λ̃i ρ1 (Zj , θ ) + μ̃ ρ2 (Zj , Q−L )) caused by the
p
density of the conditioning variables becoming too small in the tails. Since Ti,n −
→1
as n → ∞, this trimming scheme ensures that asymptotically no data are lost.
Following Kitamura, Tripathi, and Ahn, it can be shown that, under some
regularity conditions, θ̂ and Q̂−L are consistent, asymptotically normal and
asymptotically efficient, i.e., their asymptotic variances match the efficiency
bounds.
3.4. Testing
The empirical likelihood approach provides a convenient unified environment

for testing. For instance, suppose we want to test the parametric restriction
H̃0 : R(θ ∗ ) = 0 against the alternative that H̃0 is false, where R is a vector of twice
continuously differentiable functions such that rank ∂θ R(θ ∗ ) = dim (R). Let
(θ̂R , Q̂−L,R ) := argmax SELT (θ , Q−L ) s.t. R(θ ) = 0.

θ,Q−L
A version of the likelihood ratio statistic for testing H̃0 is given by
LR := 2[SELT (θ̂ , Q̂−L ) − SELT (θ̂R , Q̂−L,R )].
d
It can be shown that, under some regularity conditions, LR −−−→ χdim
2
(R) whenever
n→∞
H̃0 is true. This result can be used to obtain the critical values for LR. Although
a Wald statistic can also be constructed, it is less attractive than LR because the
latter is internally studentized. As in parametric situations, LR can be inverted to
obtain asymptotically valid confidence intervals. A nice property of confidence
intervals based on LR is that they are invariant to nonsingular transformations of
the moment conditions. They also automatically satisfy natural range restrictions.
Since inference based on θ̂ is sensible only if (2.1) is true, it is important to
devise a test for H0 against the alternative that it is false. As we are dealing with
conditional moment restrictions, any specification test which first converts (2.1)
into a finite set of unconditional moment restrictions will not be consistent for test-
ing H0 . However, using the equivalence in (2.6), a consistent test of H0 is easily
obtained by replacing moment function in Tripathi and Kitamura (2003, Equa-
tion 1.1) with ρ1 (Z, θ ∗ ). Note that since (3.6) just identifies the aggregate shares,
testing the specification of (2.1) and (3.6) jointly is equivalent to testing (2.1).
4. SIMULATION STUDY
We now examine the finite sample behavior of the LS, GMM and SEL estimators
to illustrate the effects of estimating a simple linear regression model specified for
the target population, when data are collected by VP sampling and stratification is
either endogenous or exogenous. Code for the simulations is written in R, and the
SEL estimator of the model parameters and aggregate shares defined in (3.13) is
implemented using the algorithm in Owen (2013); cf. Appendix B for details.
4.1. Design
We consider the design in Kitamura et al. (Section 5), which has been used earlier
by Cragg (1983) and Newey (1993). The model to be estimated is
Y ∗ = β0∗ + β1∗ X ∗ + σ ∗ (X ∗ )ε ∗ , (4.1)
d
where EPY∗∗ |X∗ [ε ∗ |X ∗ ] = 0 PX∗ ∗ -a.s., θ ∗ := (β0∗ , β1∗ ) = (1, 1), and (ε∗ , log X∗ ) =
NIID(0, 1). We consider two specifications for the skedastic function in the target
Table 1. Aggregate Shares for the Simulation Study.

Stratification Design Q∗1
Endogenous Homoskedastic 0.27

Heteroskedastic 0.28
Exogenous Homoskedastic 0.63

Heteroskedastic 0.63
population:a (conditional) heteroskedastic design, relevant for applications, with

σ ∗ (X ∗ ) := 0.1 + 0.2X ∗ + 0.3X ∗ 2 and a (conditional) homoskedastic design,
essentially of theoretical interest, with σ ∗ (X ∗ ) := 1.
The target population is partitioned into two strata. Under endogenous stratifica-
tion, A1 = (−∞, 1.4) and A2 = [1.4, ∞). Under exogenous stratification, B1 = A1
and B2 = A2 . The aggregate shares for the four configurations are given in Table 1.
The VP sampling probabilities are (p1 , p2 ) = (0.9, 0.3); i.e., the first stratum is
heavily oversampled, irrespective of whether the stratification is endogenous or
exogenous. Since it is typically strata with small aggregate shares that are over-
sampled, this sampling design focuses on endogenous stratification, which is the
object of attention in most applications.
Tables 2 and 3 reports the summary statistics averaged across 1, 000 Monte
Carlo replications for the LS estimator θ̂LS , the GMM estimators θ̂GMM,endog and
θ̂GMM,exog , and the SEL estimator θ̂ .14 Three sample sizes are considered, namely,
n = 50, 150, 500. Tables 4 and 5, which summarize the simulation results for esti-
mating Q∗1 , compare the GMM estimators based on the moment conditions in (3.7)
with the SEL estimator Q̂1 .
4.2. Discussion
Recall that the LS estimator is inconsistent under endogenous stratification and

consistent but generally inefficient under exogenous stratification; the GMM
estimators are consistent but inefficient under endogenous and exogenous stratifi-
cation; and the SEL estimator is consistent and asymptotically efficient irrespective
of whether the stratification is endogenous or exogenous. Tables 2–5 largely
confirm these results, at least as far as estimating the model parameters is
concerned.
The inconsistency of the LS estimator of β1∗ under endogenous stratification
is apparent from Tables 2 and 3 because the bias of the LS estimator, as a frac-
tion of β1∗ , remains greater than 9% in magnitude under heteroskedasticity, and
greater than 6% under homoskedasticity, as the sample size increases from 50 to
Table 2. Simulation Summary: Estimated β0∗ , β1∗ Under Heteroskedasticity.

Intercept Slope
Stratification n Estimator Bias SE RMSE Bias SE RMSE
Endogenous LS −0.1595 0.4417 0.4694 −0.1076 0.4868 0.4983

50 GMM 0.0307 0.5165 0.5171 −0.0408 0.4757 0.4772
SEL (cn = 0.3) −0.0033 0.2909 0.2910 −0.0329 0.3845 0.3859
SEL (bn ≈ 0.46) 0.0015 0.3916 0.3916 −0.0213 0.4140 0.4146
LS −0.1714 0.3489 0.3885 −0.1025 0.3514 0.3658
150 GMM 0.0234 0.3980 0.3985 −0.0248 0.3325 0.3332
SEL (cn = 0.4) 0.0202 0.1880 0.1891 −0.0332 0.2394 0.2417
SEL (bn ≈ 0.39) 0.0224 0.2313 0.2324 −0.0296 0.2543 0.2560
LS −0.1805 0.2894 0.3410 −0.0906 0.2641 0.2790
500 GMM 0.0043 0.3304 0.3302 −0.0061 0.2456 0.2456
SEL (cn = 0.8) 0.0107 0.1316 0.1321 −0.0130 0.1486 0.1492
SEL (bn ≈ 0.31) 0.0096 0.1242 0.1246 −0.0131 0.1454 0.1460
Exogenous LS 0.0038 0.3435 0.3434 −0.0032 0.4275 0.4273

50 GMM 0.0080 0.4863 0.4861 −0.0041 0.4791 0.4789
SEL (cn = 0.3) −0.0161 0.2518 0.2523 0.0250 0.3754 0.3762
SEL (bn ≈ 0.29) −0.0098 0.3547 0.3549 0.0117 0.4225 0.4227
LS 0.0021 0.2609 0.2608 −0.0063 0.3062 0.3061
150 GMM 0.0042 0.3838 0.3836 −0.0070 0.3364 0.3363
SEL (cn = 0.4) 0.0010 0.1562 0.1562 −0.0026 0.2326 0.2326
SEL (bn ≈ 0.24) 0.0020 0.1910 0.1910 −0.0034 0.2472 0.2472
LS −0.0012 0.2189 0.2188 0.0014 0.2323 0.2322
500 GMM −0.0023 0.3354 0.3352 0.0012 0.2540 0.2539
SEL (cn = 0.8) 0.0006 0.1200 0.1200 0.0017 0.1530 0.1530
SEL (bn ≈ 0.19) 0.0003 0.0988 0.0988 0.0017 0.1425 0.1425
500.15 In contrast, in both designs, the LS and GMM estimators under exogenous
stratification are practically unbiased even when n = 50. Under exogenous strati-
fication, the LS estimator has smaller sampling variance than the GMM estimator
for each sample size. However, this finding can be mathematically justified only for
homoskedastic designs (recall from Example 3.2 that the LS estimator is asymptot-
ically efficient when stratification is exogenous and the error term in the regression
model is conditionally homoskedastic in the target population). Indeed, as shown
in Appendix A, cf. Example A.1, counterexamples can be constructed to show that
in heteroskedastic designs, the LS estimator can have higher sampling variance
than the GMM estimator when stratification is exogenous.16 Under endogenous
stratification, the GMM estimator of the slope coefficient exhibits some bias (≈ 2–
4% in both designs) when n = 50, but the bias is very close to zero when n = 500.
Table 3. Simulation Summary: Estimated β0∗ , β1∗ Under Homoskedasticity.

Intercept Slope
Stratification n Estimator Bias SE RMSE Bias SE RMSE
Endogenous LS −0.4576 0.2855 0.5392 0.0991 0.2156 0.2372

50 GMM −0.0431 0.3389 0.3415 0.0158 0.2231 0.2236
SEL (cn = 0.3) −0.0790 0.4180 0.4255 0.0379 0.3501 0.3521
SEL (bn ≈ 0.42) −0.0559 0.3766 0.3807 0.0255 0.2793 0.2805
LS −0.4273 0.1480 0.4522 0.0708 0.0906 0.1149
150 GMM −0.0053 0.1680 0.1680 −0.0011 0.0902 0.0902
SEL (cn = 0.4) −0.0160 0.2028 0.2034 0.0070 0.1364 0.1366
SEL (bn ≈ 0.35) −0.0135 0.1914 0.1919 0.0069 0.1235 0.1237
LS −0.4142 0.0845 0.4227 0.0626 0.0453 0.0772
500 GMM −0.0005 0.0938 0.0938 0.0000 0.0427 0.0427
SEL (cn = 0.8) −0.0047 0.1026 0.1027 0.0031 0.0577 0.0578
SEL (bn ≈ 0.28) −0.0061 0.1053 0.1055 0.0037 0.0590 0.0591
Exogenous LS −0.0039 0.2432 0.2431 0.0031 0.1984 0.1983

50 GMM −0.0022 0.2622 0.2621 0.0022 0.2084 0.2083
SEL (cn = 0.3) −0.0184 0.3153 0.3159 0.0230 0.3214 0.3222
SEL (bn ≈ 0.29) −0.0077 0.2958 0.2959 0.0078 0.2787 0.2788
LS −0.0007 0.1260 0.1260 −0.0022 0.0843 0.0843
150 GMM 0.0011 0.1314 0.1314 −0.0024 0.0863 0.0863
SEL (cn = 0.4) 0.0001 0.1502 0.1502 −0.0007 0.1261 0.1261
SEL (bn ≈ 0.24) 0.0001 0.1516 0.1516 −0.0012 0.1199 0.1199
LS 0.0027 0.0703 0.0703 −0.0006 0.0410 0.0410
500 GMM 0.0040 0.0755 0.0756 −0.0008 0.0416 0.0416
SEL (cn = 0.8) 0.0053 0.0785 0.0787 −0.0029 0.0554 0.0555
SEL (bn ≈ 0.19) 0.0033 0.0801 0.0802 −0.0012 0.0595 0.0595
This is true whether the design is homoskedastic or heteroskedastic, although the

magnitude of the bias is higher under heteroskedasticity.
Tables 2–5 reveal that the SEL estimator using the naively chosen bandwidth
(cn ), described in Footnote 14, behaves very similarly to the SEL estimator using
the Silverman’s rule of thumb bandwidth (bn ). Hence, subsequent discussion
regarding the SEL estimator is based on its implementation using the naively
chosen bandwidth.
The SEL estimator of β1∗ is consistent whether stratification is exogenous or
endogenous. In the heteroskedastic design, the SEL estimator exhibits some bias
(≈ 1%) under endogenous stratification when n = 500, although its bias under
exogenous stratification is close to zero. Moreover, in the heteroskedastic design,
the SEL estimator beats the GMM estimator in terms of the RMSE under each strat-
ification scheme and for each sample size. Not surprisingly, the contrast between
Table 4. Simulation Summary: Estimated Q∗1 Under Heteroskedasticity.

Stratification n Estimator Bias SE RMSE
Endogenous 50 GMM 0.0132 0.0890 0.0900

SEL (cn = 0.3) 0.0178 0.0939 0.0956
SEL (bn ≈ 0.46) 0.0209 0.0940 0.0963
150 GMM 0.0047 0.0504 0.0506
SEL (cn = 0.4) 0.0096 0.0532 0.0540
SEL (bn ≈ 0.39) 0.0126 0.0534 0.0549
500 GMM 0.0014 0.0278 0.0278
SEL (cn = 0.8) 0.0106 0.0294 0.0313
SEL (bn ≈ 0.31) 0.0092 0.0293 0.0307
Exogenous 50 GMM 0.0133 0.1070 0.1078

SEL (cn = 0.3) 0.0384 0.1102 0.1167
SEL (bn ≈ 0.29) 0.0550 0.1032 0.1169
150 GMM 0.0050 0.0633 0.0635
SEL (cn = 0.4) 0.0414 0.0655 0.0775
SEL (bn ≈ 0.24) 0.0523 0.0620 0.0811
500 GMM 0.0009 0.0347 0.0347
SEL (cn = 0.8) 0.0719 0.0364 0.0806
SEL (bn ≈ 0.19) 0.0471 0.0344 0.0583
the two is most pronounced when n = 500; e.g., irrespective of the stratification
scheme, the RMSE of the GMM estimator is at least 65% larger than the RMSE
of the SEL estimator.
In the homoskedastic design, even though it exhibits some bias under endoge-
nous and exogenous stratification when n = 50, the bias of the SEL estimator is
close to zero for n = 500. However, its RMSE is larger than that of the GMM esti-
mator even when n = 500. This finding, which corroborates the simulation results
in Kitamura et al. (p. 1682), is likely due to the fact that the SEL estimator inter-
nally estimates the skedastic function nonparametrically to achieve semiparametric
efficiency and is thus unable to take advantage of conditional homoskedasticity in
small samples.
Tables 4 and 5 reveal that the GMM estimator of Q∗1 is consistent whether
stratification is endogenous or exogenous. It exhibits some upward bias (≈ 1–2%)
in both designs and for both types of stratification when n = 50, but the bias is very
close to zero when n = 500.17 In both designs, the RMSE of the SEL estimator of
Q∗1 is always slightly larger than the RMSE of the of the GMM estimator under
endogenous stratification, implying that in small samples there appears to be no
efficiency gain in estimating Q∗1 jointly with the model parameters. As can be seen
from Tables 4 and 5, the increase in the RMSE of Q̂1 is due to its bias, because
Table 5. Simulation Summary: Estimated Q∗1 Under Homoskedasticity.

Stratification n Estimator Bias SE RMSE
Endogenous 50 GMM 0.0135 0.0873 0.0883

SEL (cn = 0.3) 0.0204 0.0909 0.0931
SEL (bn ≈ 0.42) 0.0266 0.0924 0.0962
150 GMM 0.0042 0.0492 0.0493
SEL (cn = 0.4) 0.0129 0.0515 0.0531
SEL (bn ≈ 0.35) 0.0176 0.0519 0.0548
500 GMM 0.0013 0.0262 0.0263
SEL (cn = 0.8) 0.0180 0.0284 0.0337
SEL (bn ≈ 0.28) 0.0150 0.0284 0.0321
Exogenous 50 GMM 0.0133 0.1070 0.1078

SEL (cn = 0.3) 0.0380 0.1088 0.1153
SEL (bn ≈ 0.29) 0.0561 0.1020 0.1164
150 GMM 0.0050 0.0633 0.0635
SEL (cn = 0.4) 0.0420 0.0654 0.0777
SEL (bn ≈ 0.24) 0.0530 0.0619 0.0815
500 GMM 0.0009 0.0347 0.0347
SEL (cn = 0.8) 0.0720 0.0364 0.0807
SEL (bn ≈ 0.19) 0.0473 0.0344 0.0585
RMSE ≈ SE whenever the bias is small. This becomes clear on comparing the bias
of Q̂1 under endogenous and exogenous stratification: the latter is always larger.
The higher bias of Q̂1 under exogenous stratification is likely a design effect.
5. CONCLUSION
When estimating or testing economic relationships, economists often discover

that the data they plan to use are not drawn randomly from the target population
for which they wish to draw an inference. Instead, the observations are found to
be sampled from a related but different distribution. If this feature is not taken
into account when doing statistical analysis, subsequent inference can be severely
biased. In this paper, we show how to use a smoothed empirical likelihood approach
to conduct efficient semiparametric inference in models characterized as condi-
tional moment equalities when data are collected by VP sampling. Results from a
simulation experiment suggest that the smoothed empirical likelihood-based esti-
mator can estimate the model parameters very well in small to moderately sized
stratified samples.
ACKNOWLEDGMENTS
We thank two anonymous referees and seminar participants at the 2017 “Econo-
metrics of Complex Survey Data: Theory and Applications” workshop organized
by the Bank of Canada, Ottawa, Canada, for helpful comments. The simulation
experiments reported in this paper were carried out using the HPC facilities of the
University of Luxembourg (Varrette et al., 2014, http://hpc.uni.lu).
NOTES
1. If X∗ is constant PX∗∗ -a.s., then there is no conditioning and (2.1) reduces to a system
of unconditional moment equalities. These models are studied in Tripathi (2011a,b).
2. Similar notation, but without the “∗” superscript, applies to the random variables and
probability measures in the realized population.
3. Cf. Severini and Tripathi (2013, Appendix H) for a short proof of (2.2).
4. A sufficient condition for this is that PY∗∗ |X∗ ((Y ∗ , x) ∈ Cl |X∗ = x) > 0 for each l and
x ∈ supp (X∗ ).
5. The same condition leads to global identification of θ ∗ whenever g(Z ∗ , θ ∗ ) is linear
in θ ∗ .
6. In the econometrics literature, stratification based on a finite set of response variables
is often referred to as choice-based sampling.
7. Unless mentioned otherwise, it is assumed throughout the paper that both Y ∗ and X∗
are stratified.
8. Tripathi (2011b) shows that in unconditional moment restriction models even
exogenous stratification cannot be ignored.
9. Hence, for the LS estimator in Example 2.1, one can say that it is inconsistent because
of selection bias due to endogenous stratification, whereas exogenous stratification does not
lead to any selection bias.
10. The abbreviation “l.b.” stands for “lower bound,” because the efficiency bound is the
greatest lower bound for the asymptotic variance of any n1/2 -consistent regular estimator.
11. Namely, M1 ≤L M2 for symmetric matrices M1 , M2 means that M1 − M2 is negative
semidefinite.
12. The estimator θ̂GMM,endog is an example of an inverse probability weighted (IPW)
estimator, which uses the weights 1/bendog (Y1 ), . . . , 1/bendog (Yn ) to correct the selection
bias due to stratification by downward weighting the strata that are oversampled and upward
weighting the strata that are undersampled.
13. To see this, compare the first order conditions for (3.12) with (3.10).
14. The SEL estimator is implemented with Ti,n := 1. To the best of our knowledge, how
to choose an optimal data driven bandwidth for the SEL estimator remains an open problem.
Consequently, we naively chose the bandwidth by repeating the simulation experiment on
a grid of bandwidths and picking the one that minimized the average (across the simulation
replications) RMSE of the SEL estimator of β1∗ . The naively chosen bandwidth, labeled cn ,
is reported in Tables 2–5. For the sake of comparison, we also report the SEL estimator when
the bandwidth is chosen using Silverman’s rule of thumb, namely, bn = 1.06 sd(X) n−1/5 .

Since sd(X) depends on the data, the bn reported in the tables is the value averaged across
the simulations.
15. This is even more so for the LS estimator of the intercept because, under endogenous
stratification, the bias of the LS estimator of β0∗ is ≈ 18% (resp. ≈ 41%) in magnitude for
the heteroskedastic (resp. homoskedastic) design even when n = 500. For the remainder
of this section, however, we only discuss the simulation results for the slope coefficient
because it can be interpreted as an average partial effect. Results for the intercept, which is
a pure level effect, are qualitatively very similar.
16. It is shown in Appendix A, cf. (A.1), that asvar (n1/2 (θ̂GMM,exog − θ ∗ )) −
asvar (n1/2 (θ̂LS − θ ∗ )) = A + B holds under exogenous stratification, where the matrix A is
positive semidefinite and the matrix B is negative semidefinite. Therefore, in general, it is
not clear which estimator has smaller asymptotic variance. However, since B = 0 under con-
ditional homoskedasticity, cf. (A.4), asvar (n1/2 (θ̂LS − θ ∗ )) ≤L asvar (n1/2 (θ̂GMM,exog − θ ∗ ))
holds under exogenous stratification and conditional homoskedasticity. Alternatively, under
conditional homoskedasticity, the Gauss–Markov theorem implies the same result because
θ̂GMM,exog and θ̂LS are both linear and unbiased when stratification is exogenous.
17. In Tables 4 and 5, the results under exogenous stratification are almost identical
for the heteroskedastic and homoskedastic designs because P ∗ (X∗ ∈ B1 ) is not affected by
conditional heteroskedasticity in Y ∗ (cf. Table 1).
18. In our setup, all the components of Z are continuous random variables, so that ties
in the data occur with probability (P ) zero.
19. This is a simplified but working version of the code we actually used. The complete
code is available from GitHub at https://github.com/Fifis/SELshares.
REFERENCES
Bhattacharya, D. (2005). Asymptotic inference from multi-stage samples. Journal of Econometrics,
126, 145–171.
Bhattacharya, D. (2007). Inference on inequality from household survey data. Journal of Econometrics,
137, 674–707.
Bickel, P. J., Klassen, C. A. J., Ritov, Y., & Wellner, J. A. (1993). Efficient and adaptive estimation for
semiparametric models. Baltimore, MD: Johns Hopkins University Press.
Bickel, P. J., & Ritov, Y. (1991). Large sample theory of estimation in biased sampling regression
models. Annals of Statistics, 19, 797–816.
Butler, J. S. (2000). Efficiency results of MLE and GMM estimation with sampling weights. Journal
of Econometrics, 96, 25–37.
Chamberlain, G. (1987). Asymptotic efficiency in estimation with conditional moment restrictions.
Journal of Econometrics, 34, 305–334.
Cosma, A., Kostyrka, A. V., & Tripathi, G. (2018). Smoothed empirical likelihood based inference
with missing endogenous variables. In progress.
Cosslett, S. R. (1981a). Efficient estimation of discrete choice models. In C. F. Manski &
D. McFadden (Eds.), Structural analysis of discrete data with econometric applications (pp.
51–111). Cambridge, MA: MIT Press.
Cosslett, S. R. (1981b). Maximum likelihood estimation for choice-based samples. Econometrica, 49,
1289–1316.
Cosslett, S. R. (1991). Efficient estimation from endogenously stratified samples with prior
information on marginal probabilities. Manuscript. Retrieved from economics.sbs.ohio-
state.edu/scosslett/papers/cbsample1.pdf.
Cosslett, S. R. (1993). Estimation from endogenously stratified samples. In G. Maddala, C. Rao, &
H. Vinod (Eds.), Handbook of statistics (Vol. 11, pp. 1–43). Amsterdam: Elsevier.
Cragg, J. G. (1983). More efficient estimation in the presence of heteroscedasticity of unknown form.
Econometrica, 49, 751–764.
Deaton, A. (1997). The analysis of household surveys. Baltimore, MD: Johns Hopkins University Press.
DeMets, D., & Halperin, M. (1977). Estimation of a simple regression coefficient in samples arising
from a subsampling procedure. Biometrics, 33, 47–56.
El-Barmi, H., & Rothmann, M. (1998). Nonparametric estimation in selection biased models in the
presence of estimating equations. Nonparametric Statistics, 9, 381–399.
Gill, R. D., Vardi, Y., & Wellner, J. A. (1988). Large sample theory of empirical distributions in biased
sampling models. Annals of Statistics, 16, 1069–1112.
Hausman, J. A., & Wise, D. A. (1981). Stratification on endogenous variables and estimation: The Gary
income maintenance experiment. In C. F. Manski & D. McFadden (Eds.), Structural analysis
of discrete data with econometric applications (pp. 365–391). Cambridge, MA: MIT Press.
Heckman, J. J. (1976). The common structure of statistical models of truncation, sample selection and
limited dependent variables and a simple estimator for such models. Annals of Economic and
Social Measurement, 5, 475–492.
Heckman, J. J. (1979). Sample selection bias as a specification error. Econometrica, 47, 153–161.
Hirose, Y. (2007). M-Estimators in semi-parametric multi-sample models. Manuscript. Retrieved from
sms.victoria.ac.nz/foswiki/pub/Main/ResearchReportSeries/mscs08-05.pdf.
Hirose, Y., & Lee, A. J. (2008). Semi-parametric efficiency bounds for regression models under gen-
eralised case-control sampling: The profile likelihood approach. Annals of the Institute of
Statistical Mathematics, 62, 1023–1052.
Holt, D., Smith, T. M. F., & Winter, P. D. (1980). Regression analysis of data from complex surveys.
Journal of the Royal Statistical Society, Series A, 143, 474–487.
Imbens, G. W. (1992). An efficient method of moments estimator for discrete choice models with
choice-based sampling. Econometrica, 60, 1187–1214.
Imbens, G. W., & Lancaster, T. (1996). Efficient estimation and stratified sampling. Journal of
Econometrics, 74, 289–318.
Jewell, N. P. (1985). Least squares regression with data arising from stratified samples of the dependent
variable. Biometrika, 72, 11–21.
Kalbfleisch, J. D., & Lawless, J. F. (1988). Estimation of reliability in field-performance studies (with
discussion). Technometrics, 30, 365–388.
Kitamura, Y., Tripathi, G., & Ahn, H. (2004). Empirical likelihood based inference in conditional
moment restriction models. Econometrica, 72, 1667–1714.
Manski, C. F. (1989). Anatomy of the selection problem. Journal of Human Resources, 24, 343–360.
Manski, C. F. (1995). Identification problems in the social sciences. Cambridge, MA, USA: Harvard
University Press.
Manski, C. F., & Lerman, S. R. (1977). The estimation of choice probabilities from choice based
Manski, C. F., & McFadden, D. (1981). Alternative estimators and sample design for discrete choice
analysis. In C. F. Manski & D. McFadden (Eds.), Structural analysis of discrete data with
econometric applications (pp. 2–50). Cambridge, MA: MIT Press.
Newey, W. K. (1993). Efficient estimation of models with conditional moment restrictions. In G.

S. Maddala, C. R. Rao, & H. D. Vinod (Eds.), Handbook of statistics (Vol. 11, pp. 2111–2245).
Amsterdam: Elsevier.
Owen, A. B. (1988). Empirical likelihood ratio confidence intervals for a single functional. Biometrika,
75, 237–249.
Owen, A. B. (2001). Empirical likelihood. New York, NY: Chapman & Hall/CRC.
Owen, A. B. (2013). Self-concordance for empirical likelihood. Canadian Journal of Statistics, 41,
387–397.
Owen, A. B. (2017). A weighted self-concordant optimization for empirical likelihood. Manuscript.
Retrieved from http://statweb.stanford.edu/∼owen/empirical/countnotes.pdf.
Pagan, A., & Ullah, A. (1999). Nonparametric econometrics. Cambridge: Cambridge University Press.
Patil, G. P., & Rao, C. R. (1977). The weighted distributions: A survey of their applications. In
P. R. Krishnaiah (Ed.), Applications of statistics (pp. 383–405). Amsterdam: North-Holland
Publishing.
Patil, G. P., & Rao, C. R. (1978). Weighted distributions and size-biased sampling with applications to
wildlife populations and human families. Biometrics, 34, 179–189.
Qin, J. (1993). Empirical likelihood in biased sample problems. Annals of Statistics, 21, 1182–1196.
Quesenberry, C. P., & Jewell, N. P. (1986). Regression analysis based on stratified samples. Biometrika,
73, 605–614.
Robinson, P. M. (1987). Asymptotically efficient estimation in the presence of heteroskedasticity of
unknown form. Econometrica, 55, 875–891.
Scott, A. J., & Wild, C. J. (1986). Fitting logistic models under case-control or choice based sampling.
Journal of The Royal Statistical Society, Series B, 48, 170–182.
Severini, T. A., & Tripathi, G. (2013). Semiparametric efficiency bounds for microeconometric models:
A survey. Foundations and Trends in Econometrics, 6, 163–397.
Tripathi, G. (1999). A matrix extension of the Cauchy–Schwarz inequality. Economics Letters, 63,
1–3.
Tripathi, G. (2011a). Generalized method of moments (GMM) based inference with stratified samples
when the aggregate shares are known. Journal of Econometrics, 165, 258–265.
Tripathi, G. (2011b). Moment based inference with stratified data. Econometric Theory, 27, 47–73.
Tripathi, G., & Kitamura, Y. (2003). Testing conditional moment restrictions. Annals of Statistics, 31,
2059–2095.
Vardi, Y. (1982). Nonparametric estimation in the presence of length bias. Annals of Statistics, 10,
616–620.
Vardi, Y. (1985). Empirical distributions in selection biased models. Annals of Statistics, 13, 178–203.
Varrette, S., Bouvry, P., Cartiaux, H., & Georgatos, F. (2014). Management of an academic HPC cluster:
The UL experience. In Proceedings of the 2014 international conference on high performance
computing and simulation (HPCS 2014); IEEE, Bologna, Italy.
Wooldridge, J. M. (1999). Asymptotic properties of weighted M-estimators for variable probability
Wooldridge, J. M. (2001). Asymptotic properties of weighted M-estimators for standard stratified
samples. Econometric Theory, 17, 451–470.
Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data (2nd ed). Cambridge,
MA: MIT Press.
APPENDIX A: COMPARING THE ASYMPTOTIC

VARIANCE OF LS AND GMM ESTIMATORS UNDER
EXOGENOUS STRATIFICATION
We begin by proving the assertion in Footnote 16, namely, that, under exoge-
nous stratification, asvar (n1/2 (θ̂GMM,exog − θ ∗ )) − asvar (n1/2 (θ̂LS − θ ∗ )) = A +
B, where the matrix A is positive semidefinite, the matrix B is negative semidefinite
and B = 0 under conditional homoskedasticity.
Recalling the expressions for VGMM,exog and VLS,exog in Example 3.2, we can
write
asvar (n1/2 (θ̂GMM,exog − θ ∗ )) − asvar (n1/2 (θ̂LS − θ ∗ ))

−1 −1
X̃ X̃ X̃ X̃
= EPX EPX ,
bexog (X) bexog (X)
where

V1,exog (X)
:= EPX X̃X̃ 2
bexog (X)

X̃ X̃
− EPX (EPX X̃ X̃ )−1 (EPX X̃ X̃ V1,exog (X))(EPX X̃ X̃ )−1
bexog (X)

X̃ X̃
× EPX .
bexog (X)

Next, letting a1 := X̃ V1,exog (X)/bexog (X) and a2 := (EPX X̃ X̃ )−1 X̃/ V1,exog (X),
we have
= EPX a1 a1 − (EPX a1 a2 )(EPX X̃ X̃ V1,exog (X))(EPX a2 a1 )

= EPX a1 a1 − (EPX a1 a2 )(EPX a2 a2 )−1 (EPX a2 a1 )
+ (EPX a1 a2 )[(EPX a2 a2 )−1 − (EPX X̃ X̃ V1,exog (X))](EPX a2 a1 ).
Consequently, under exogenous stratification, we can write
asvar (n1/2 (θ̂GMM,exog − θ ∗ )) − asvar (n1/2 (θ̂LS − θ ∗ )) = A + B, (A.1)

where
−1
X̃ X̃
A := EPX [EPX a1 a1 − (EPX a1 a2 )(EPX a2 a2 )−1 (EPX a2 a1 )]
bexog (X)
−1
X̃ X̃
× EPX
bexog (X)
and
−1
X̃X̃
B := EPX (EPX a1 a2 )
bexog (X)
−1
X̃ X̃
× [(EPX a2 a2 )−1
− (EPX X̃ X̃ V1,exog (X))] EPX (EPX a2 a1 ).
bexog (X)
It remains to show that A is positive semidefinite and B is negative semidefinite.

To do so, recall the matrix version of the Cauchy–Schwarz inequality (Tripathi,
1999), namely,
(EGH )(EH H )−1 (EH G ) ≤L EGG , (A.2)
where G and H are random column vectors. Then, letting G := a1 and H := a2 , it
is immediate from (A.2) that A is positive semidefinite. Next,
−1
X̃ X̃
(EPX a2 a2 )−1 = (EPX X̃ X̃ ) EPX (EPX X̃ X̃ )
V1,exog (X)
≤L EPX X̃ X̃ V1,exog (X)

follows from (A.2) on letting G := X̃ V1,exog (X) and H := X̃/ V1,exog (X).
Hence, B is negative semidefinite. Consequently, as A is positive semidefinite
and B is negative semidefinite, it is not clear from (A.1) which estimator has
smaller asymptotic variance.
However, if conditional homoskedasticity holds in the target population,
var P ∗ (Y ∗ |X ∗ ) = σ ∗ 2 PX∗ ∗ -a.s.
for some constant σ ∗ 2 > 0. Moreover, under exogenous stratification,
(2.8)
var P ∗ (Y ∗ |X ∗ = x) = var P (Y |X = x), x ∈ supp (X∗ ).
Hence, since PX∗ ∗ PX PX∗ ∗ , conditional homoskedasticity and exogenous

stratification together imply that
V1,exog (X) = var P (Y |X) = σ ∗ 2 PX -a.s.. (A.3)
Therefore, under conditional homoskedasticity and exogenous stratification,
(EPX a2 a2 )−1 − EPX X̃ X̃ V1,exog (X)

−1
X̃ X̃
= (EPX X̃X̃ ) EPX (EPX X̃ X̃ ) − EPX X̃ X̃ V1,exog (X)
V1,exog (X)
= σ ∗ 2 [EPX X̃ X̃ − EPX X̃ X̃ ] ((A.3))

= 0.
It follows from the definition of B that
conditional homoskedasticity and exogenous stratification =⇒ B = 0. (A.4)
Hence, asvar (n1/2 (θ̂LS − θ ∗ )) ≤L asvar (n1/2 (θ̂GMM,exog − θ ∗ )) holds under exoge-
nous stratification and conditional homoskedasticity.
However, as demonstrated in the next example, this result may not hold under
conditional heteroskedasticity.
Example A.1: Consider (4.1) with β0 := 0, i.e., a simple linear regression

through the origin. As before, EPY∗∗ |X∗ [ε ∗ |X ∗ ] = 0 PX∗ ∗ -a.s.. Assume that only

c w.p. 1 − r
X∗ := is stratified with L = 2, where B1 = (−∞, 0) and B2 =
d w.p. r
[0, +∞).
Under exogenous stratification,
EPX X 2 V1,exog (X) EPX X 2 σ ∗ 2 (X)

asvar (n1/2 (β̂1,LS − β1∗ )) = =
(EPX X 2 )2 (EPX X 2 )2
EPX [X 2 V1,exog (X)/bexog
2
(X)]
asvar (n1/2 (β̂1,GMM − β1∗ )) =
(EPX X 2 /bexog (X))2
EPX [X 2 σ ∗ 2 (X)/bexog
2
(X)]
= .
(EPX X 2 /bexog (X))2
Let r = 1/3, c = −1, d = 2, σ ∗ 2 (c) = 1, σ ∗ 2 (d) = 4, p1 = 0.2, and p2 = 0.8.

Note that bexog (c) = p1 1B1 (c) + p2 1B2 (c) = p1 because c < 0, and bexog (d) =
p1 1B1 (d) + p2 1B2 (d) = p2 because d > 0. Then, it can be verified that
EPX X 2 σ ∗ 2 (X) = 6, EPX X 2 = 2, EPX [X 2 σ ∗ 2 (X)/bexog

2
(X)] = 25,
EPX [X 2 /bexog (X)] = 5.
Consequently,
asvar (n1/2 (β̂1,LS − β1∗ )) = 1.5 > asvar (n1/2 (β̂1,GMM − β1∗ )) = 1.
This shows that the LS estimator may be asymptotically inefficient compared to the
GMM estimator under conditional heteroskedasticity and exogenous stratification.
APPENDIX B: COMPUTATION
In this appendix, we describe how the SEL estimator was implemented by adapt-
ing the code of Owen (2017). The R function cemplik in Owen (2017) was
originally written for count random variables and allows for ties in the data. Let
Zj := (Yj , Xj ) be i.i.d. draws from the realized density dP and assume that each
of the n distinct values
of Zj can be taken by cj distinct draws, so that the total
sample size is N := nj=1 cj . If we impose on the data the vector of unconditional
moment equalities EP m(Z, θ ) = 0, then Owen (2017, p. 2) shows that the empiri-
cal loglikelihood (EL), as a function of θ , and modulo constants not depending on
θ, is obtained by solving (in our notation)

n
− max cj log (1 + λ̃ m(Zj , θ )). (B.1)
λ̃
j =1
Note how in (B.1) the original sample size N has disappeared, and only the
number n of distinct values of Zj remains. The function cemplik asks for
m := (m(Z1 , θ), . . . , m(Zn , θ )) and a vector c := (c1 , . . . , cn ) as inputs, and delivers
three outputs:
(1) The EL for a given value of θ , computed at the vector λdim (m)×1 of Lagrange
multipliers that maximize (B.1), i.e.,

n
ELm (θ ; c, λ) := − cj log (1 + λ m(Zj , θ )).
j =1
(2) The vector λ used to compute ELm (θ ; c, λ).

(3) The unconditional empirical probabilities
cj 1
pj :=
, j = 1, . . . , n.
n 1 + λ m(Zj , θ )
We now describe how to compute SELT (θ ) when only the conditional moment
restriction EPY |X [ρ1 (Z, θ )|X] = 0 is imposed on the data. In the following, we do
not deal with ties in the data.18 Instead, we take advantage of the formal resem-
blance of the optimization problem in (B.1) with the one that leads to the smoothed
EL. Indeed, obtaining SELT (θ ) only under EPY |X [ρ1 (Z, θ )|X] = 0 is equivalent to
solving (3.14) with ρ2 := 0, i.e.,

n
n
SELT (θ )|ρ2 =0 := − max Ti,n wij log (1 + λ̃i ρ1 (Zj , θ )). (B.2)
λ̃1 ,...,λ̃n
i=1 j =1
From the first order conditions, it is clear that the maximizers in (B.2) can be
recovered as solutions to n independent maximization problems, namely,

n
λi := argmax wij log (1 + λ̃i ρ1 (Zj , θ )), i = 1, . . . , n. (B.3)
λ̃i j =1
The elements of c in (B.1) are not constrained to be integers but are only supposed
to be positive. Hence, comparing (B.1) with (B.3), each λi can be obtained by
invoking cemplik n times with ci := (wi1 , . . . , win ) as the weights and m replaced
with ρ1 := (ρ1 (Z1 , θ ), . . . , ρ1 (Zn , θ )). Consequently,

n
SELT (θ )|ρ2 =0 = Ti,n ELρ1 (θ ; ci , λi ) (B.4)
i=1

with ELρ1 (θ ; ci , λ) = nj=1 wij log (1 + λ ρ1 (Zj , θ )). The R commands used to
implement (B.4) are as follows. Let rho1 denote (ρ1 (Z1 , θ ), . . . , ρ1 (Zn , θ )),
sel.weights be the n × n matrix whose elements are the kernel weights wij ,
and trim the trimming vector Ti,n . Then, SELT (θ)|ρ2 =0 is obtained with the
following code:
emplik.list = apply(sel.weights, MARGIN=1, function(w)
cemplik(rho1, w))
SEL = trim %*% unlist(lapply(emplik.list, ’[[’, 1))
Finally, we show how to impose a conditional and an unconditional moment
restriction on the data, i.e., compute the objective function SELTi,n (θ, Q−L ) defined
Table 6. Running Time (in Minutes) to Estimate the

Parameters.
n (β0∗ , β1∗ ) (β0∗ , β1∗ , Q∗1 )
50 0.036 16.17
150 0.129 45.70
500 0.523 149.4
in (3.14). We treat the optimization problem in (3.14) as a two-step maximization.

In the first step, we fix μ̄ and solve the n independent maximization problems

n
max wij log (1 + λ̃i ρ1 (Zj , θ ) + μ̄ ρ2 (Zj , Q−L )), i = 1, . . . , n. (B.5)
λ̃i
j =1
To carry out the maximizations in (B.5), we need to slightly modify Owen’s

cemplik. We wrote a function cemplik2, which receives an extra argument
μ̄ ρ2 (Zj , Q−L ). The new function cemplik2 evaluates the logarithm in (B.3) at
1 + λ̃i ρ1 (Zj , θ) + μ̄ ρ2 (Zj , Q−L ) instead of at 1 + λ̃i ρ1 (Zj , θ ). The second step
needed to compute SELTi,n (θ , Q−L ) is a maximization over μ̄ as shown in (3.14),
which can be carried out by a standard optimization routine as follows:19
muopt = optim(0, SmoothEmplik, rho1, rho2, sel.weights,
method="Brent", lower=0, upper=1)$value
SmoothEmplik = function(mu, rho1, rho2, sel.weights){
smooth.emplik.list = apply(sel.weights, MARGIN=1,
function(w) cemplik2(rho1, mu*rho2, w))
SEL = trim %*% unlist(lapply(smooth.emplik.list, ’[[’, 1))
return(SEL)
}
The finite sample performance of the SEL estimator, implemented as described

above, is discussed in Section 4. The simulation experiments in Section 4 were
carried out on the high performance computing clusters at the University of Lux-
embourg. Table 6 gives some idea about the average time taken to complete one
Monte Carlo replication for the heteroskedastic design (the execution times under
endogenous and exogenous stratification are very similar).
NONPARAMETRIC KERNEL
REGRESSION USING COMPLEX
SURVEY DATA
Luc Clair
Department of Economics, University of Winnipeg, Canada
ABSTRACT
Applied econometric analysis is often performed using data collected from
large-scale surveys. These surveys use complex sampling plans in order
to reduce costs and increase the estimation efficiency for subgroups of the
population. These sampling plans result in unequal inclusion probabilities
across units in the population. The purpose of this paper is to derive the
asymptotic properties of a design-based nonparametric regression estima-
tor under a combined inference framework. The nonparametric regression
estimator considered is the local constant estimator. This work contributes
to the literature in two ways. First, it derives the asymptotic properties
for the multivariate mixed-data case, including the asymptotic normality of
the estimator. Second, I use least squares cross-validation for selecting the
bandwidths for both continuous and discrete variables. I run Monte Carlo
simulations designed to assess the finite-sample performance of the design-
based local constant estimator versus the traditional local constant estimator
for three sampling methods, namely, simple random sampling, exogenous
stratification and endogenous stratification. Simulation results show that the

ISSN: 0731-9053/doi:10.1108/S0731-905320190000039011
173
174 LUC CLAIR
estimator is consistent and that efficiency gains can be achieved by weighting

observations by the inverse of their inclusion probabilities if the sampling is
endogenous.
Keywords: Nonparametric; local constant; survey data; regression
estimator; Monte Carlo simulations; mean squared error
1. INTRODUCTION
Nonparametric methods for estimating conditional mean functions have emerged
as viable alternatives to standard parametric methods. However, the discussion of
these methods in a complex survey setting has been kept to the survey statistics
literature despite their applications in economics. There are two reasons why these
methods should appeal to economists. Firstly, in applied economic analysis, the
regression functional form is rarely known to a parametric specification. Economic
theory provides arguments for the inclusion or exclusion of variables in a model
but generally does not specify the functional form of the conditional mean function
(Yatchew, 1998). Consistent estimation using parametric methods requires that the
researcher perfectly specifies the functional form of the conditional mean function
prior to estimation (Li & Racine, 2007). In practice, though, one cannot be certain
that the correct parametric model has been chosen. Alternatively, nonparametric
estimators do not rely on functional form assumptions and are therefore free of
misspecification. They simply assume that the conditional mean function exists
and that it follows certain regularity conditions, such as smoothness.
Secondly, many large-scale surveys use complex sampling plans in order to
reduce costs and increase the estimation efficiency for subgroups of the popula-
tion (Lohr, 2010). As in Binder and Roberts (2009), the term complex sampling
plan refers to any sampling method other than simple random sampling (SRS) that
results in the population units having nonuniform probabilities of being selected
into the sample. The data sets derived from these surveys typically offer a broader
range of variables (e.g., income, health, education, demographic variables, etc.)
and are easier to access than administrative data sets, making them popular among
economic researchers. The complex sampling methods used in these surveys dis-
proportionately sample subgroups of the population, leading to a sample that is
systematically unrepresentative of the finite population from which it is drawn. In
this case, finite population descriptive statistics cannot be consistently estimated by
their respective sample analogs. In order to consistently estimate the finite popula-
tion statistics, design-based estimators that weight each observation by the inverse
of the unit’s probability of being selected in the sample must be used. Solon, Haider,
and Wooldridge (2013) described a sample of units with unequal probabilities of
Nonparametric Kernel Regression Using Complex Survey Data 175
inclusion as viewing the reflection of a representative sample through a “funhouse

mirror,” where oversampled subgroups will be exaggerated. Using weights “clar-
ifies” the image and returns consistent estimates. Furthermore, the independence
assumption between the predictor variable(s) and the stochastic error term will be
violated if the sampling scheme is based on an endogenous variable (Hausman &
Wise, 1981). This is known as endogenous sampling in the economics literature
and informative sampling in the survey statistics literature. Using a design-based
estimator to estimate the regression model corrects for endogenous sampling and
returns consistent results.
In general, applied economic research is focused on estimating the parameters of
a statistical model where the target population is a conceptually infinite superpop-
ulation (Binder & Roberts, 2009). The superpopulation nonparametric regression
model describing the conditional relationship between an outcome variable y and
a vector of predictor variables x is given by
y = g(x) + u, (1)
where g(x) = E(y|x) is the unknown quantity of interest and u is the error term.
In a survey statistics framework, it is presumed that a finite population U of size
N is generated based on realizations of the random variables (y, x) with a joint
probability distribution f (y, x) (Buskirk & Lohr, 2005; Harms & Duchesne, 2010).
From the finite population, a sample S of size n is selected based on a complex
sampling plan. Each individual j in the finite population U has a probability πj of
being included in S and the probability of being selected depends on the sampling
methods that are implemented. The number of population units a given sample
unit j represents is then given by the weight variable wj = πj−1 .
If data for all N observations in the finite population were available, g(x) could
be estimated using the local constant estimator:
N
j =1 yj Kγ ,j x
ĝU (x) = N , (2)
j =1 Kγ ,j x
where Kγ ,j x is a multivariate mixed-data power kernel. In most cases, only data

in the sample is made available to the practitioner. If sampling is assumed to be
exogenous such that the model holding for the sample data is the same as the
model holding in the finite population, then one can use the local constant over the
sample data to estimate g(x):
n
j =1 yj Kγ ,j x
g̃(x) = n . (3)
j =1 Kγ ,j x
176 LUC CLAIR
However, if the sampling scheme is based on an endogenous variable, it is

recommended that one uses a design-based estimator:
n
πi−1 yi Kγ ,ix
ĝ(x) = i=1 (4)
n
i=1 πi−1 Kγ ,ix
(Sánchez-Borrego et al., 2014; Solon, Haider, & Wooldridge, 2013). Because

the estimator in Eq. (4) has both a design component and model structure, it is
known as a model-assisted estimator. This estimator could be used in place of
the Nadaraya (1964) and Waston (1964) local constant estimator and it could be
applied in many settings including estimating conditional mean derivatives and
their average derivatives (Bravo, Huynh, & Jacho-Chávez, 2011).
The complex survey framework differs from other areas in econometrics for
two reasons: the upper bound on the sample size imposed by the finite population
and the reliance on the sample design for inference (Opsomer & Miller, 2005).
Derivation of the asymptotic properties of ĝ(x) requires that the finite population
size be allowed to increase as the sample size increases, such that their structure is
preserved (Pfeffermann, 1993). In addition, there are three modes of inference one
may consider when analyzing survey data, namely, design-based inference, model-
based inference and combined inference. Design-based inference (also known as
descriptive inference) refers to inference about finite population statistics, includ-
ing population totals and population means. Design-based inferences are valid
as long as the estimator is design-consistent. That is, as n, N → ∞, the sample
estimator converges in probability to the population-based estimator ĝU (x) in Eq.
(2). Design-based inference only considers the random sampling plan and is not
influenced by any relationship between y and x. Hence, expectations are taken
with respect to the sampling scheme.
Historically, model-based inference (also known as analytic inference), or
inference about the infinite population model, has been the primary goal for econo-
metricians. This mode of inference assumes that the data (y, x) are generated based
on a given model and that inclusion probabilities are uninformative. Model-based
estimation is performed by the traditional nonparametric regression estimators
using the sample data, such as the local constant estimator g̃(x) in (3). Consistency
under a model-based framework is defined in the usual way, i.e., plimn,N →∞ g̃(x) =
g(x). Because model-based estimators do not control for the sample design, they
will inconsistently estimate g(x) when sampling is endogenous. In this case, one
may adopt a two-step combined inference approach (Harms & Duchesne, 2010;
Buskirk & Lohr, 2005; Pfeffermann, 1993). In the first step, the finite population
estimator ĝU (x) is estimated using a design-consistent estimator. In the second
step, ĝU (x) is used to estimate the superpopulation parameter function g(x). While
consistency under the combined inference approach implies design consistency,

the converse may not be true (Pfeffermann, 1993).
Harms and Duchesne (2010) used the combined inference method to derive
the asymptotic properties of the model-assisted local linear estimator (local poly-
nomial estimator of degree one) proposed by Breidt and Opsomer (2000). The
estimator took the form of
ĝLL (x) = e1T (XiT Ai Xi )−1 XiT Ai y
where e1 is an n × 1 vector with the first entry as 1 and the rest 0, Ai =

diag{(πj h)−1 K((xj − xi )/ h)}, K(·) is a kernel function, and h is the bandwidth
for the continuous regressor x. In their paper, Harms and Duchesne (2010) showed
that the pointwise bias of the model-assisted local linear under the combined infer-
ence method was the same as the bias of the unweighted local linear estimator;
however, the pointwise variance was adjusted by a factor dependent on the sam-
pling scheme. To select the bandwidth, h, the authors proposed a rule-of-thumb
method based on the integrated mean squared error criterion. Unfortunately, this
estimator considers only the univariate continuous variable case, and the band-
width selection mechanism is not applicable to the multivariate mixed-data case.
Furthermore, it is generally preferable to adopt a data-driven approach for the
selection of smoothing parameters.
The purpose of this paper is to derive the asymptotic properties of the model-
assisted estimator proposed by Sánchez-Borrego et al. (2014) in Eq. (4) using the
combined-inference approach and to develop a data-driven method for selecting the
smoothing parameters. This research adds to the literature in two ways. First, this
paper extends the work by Harms and Duchesne (2010) to the general multivariate
mixed-data case for the local constant estimator (local polynomial of degree zero).
In addition to deriving the pointwise mean squared error of the model-assisted
local constant estimator, I show that the estimator is normally distributed under
the combined-inference approach. Second, I propose using least squares cross-
validation (LSCV) for selecting the bandwidths of both the continuous and discrete
regressors. This is an important advancement for the application of the model-
assisted local constant estimator.
The rest of this paper is divided into five sections. In Section 2, I update the
assumptions from Harms and Duchesne (2010) and derive the pointwise asymp-
totic properties of the model-assisted local constant under the combined inference
approach for the multivariate mixed-data case. In Section 3, I introduce the LSCV
method for selecting the smoothing parameters. To assess the finite-sample prop-
erties of the modified local constant estimator, I run Monte Carlo simulations
in Section 4 for four data-generating processes (DGPs) under three sampling
178 LUC CLAIR
schemes: SRS, exogenous stratification and endogenous stratification. In Section 5,

I extend an application from Harms and Duchesne (2010) that examines the effect
of age on labor market duration (LMD) to include a discrete predictor variable –
gender. I conclude with a summary in Section 6.
2. MODEL-ASSISTED NONPARAMETRIC
REGRESSION ESTIMATOR
Let U = {1, . . . , N } denote a finite population of N units. For each
j ∈ U , the outcome variable yj and auxiliary variables xj = (xjd , xjc ) =
(xjd1 , . . . , xjdr , xjc1 , . . . , xjcq ) are realizations of the random variables (y, x), where
(y, x) follow a joint distribution f (y, x). xj is a (q + r) × 1 vector where the super-
scripts d and c denote that the variable is discrete or continuous, respectively. I
use xjct to denote the tth component of xjc and xjdt for the tth component of xjd and
assume that xjd takes ct ≥ 2 different values in Dt = {0, 1, . . . , ct − 1}, t = 1, . . . , r.
Next, a sample S of size ns is drawn based on a complex sampling plan pN (·), where
pN (S) is the probability of drawing the sample S. The sampling rate is Q = ns /N ,
with first order inclusion probabilities πj = P r(j ∈ S) = j ∈S pN (S) and second
order inclusion probabilities πj i = P r(j , i ∈ S) = j ,i∈s pN (S). The variable ns
may be fixed (as in SRS) or random; however, I do not specify a sampling plan.
The first and second order probabilities are the probabilities of obtaining the unit
j and units j and i, respectively, while sampling from the population according
to the complex sampling design. The goal is to estimate the conditional mean
function g(x) from Eq. (1).
2.1. Local Constant Estimator
If data were available for every i ∈ U then g(x) in (1) could be estimated using the
local-constant estimator. The local constant estimator was proposed by Nadaraya
(1964) and Waston (1964) who wanted to estimate conditional mean functions as a
locally weighted average, using a kernel as a weighting function. The mathematical
definition of E(y|x) is

f (y, x)
E(y|x) = yf (y|x)dy = y dy (5)
f (x)
where f (y|x) is the conditional density of y given X, f (y, x) is the joint density of
y and x and f (x) = f (x c , x d ) is the joint probability density function of (x c , x d ).
Nadaraya (1964) and Waston (1964) proposed substituting f (y, x) and f (x) by
their kernel density estimates. For the discrete regressors xtd , t = 1, . . . , r, a vari-
ation on Aitchison and Aitken’s (1976) kernel function can be used (scalar x) or
embedded product kernel (multivariate x). This function is defined by

1 if xitd = xjdt
l(xitd , xjdt , λ) = (6)
λt otherwise
where λt ∈ [0, 1] is the smoothing parameter for xtd . When λt = 0, the above kernel
function becomes an indicator function, and when λt = 1, it is a constant function
and the (irrelevant) variable gets smoothed out. Here, a match between xitd and xtd
determines the value of the discrete kernel function (Sánchez-Borrego et al., 2014;
Opsomer and Miller, 2005). The product kernel function for a vector of discrete
variables is defined as

r
1−1(xitd =xtd )
L(xid , x d , λ) = λt ,
t=1
where 1( · ) is the indicator function, which takes a value of 1 if the logical argument
in the brackets is true and 0 otherwise.
Using k to denote a symmetric, univariate density function the product kernel
for continuous variables is defined by
q c
1 xit − xt
Wh (x c , xic ) = k ,
h
t=1 t
ht
where 0 < ht < ∞ is the smoothing parameter for continuous variable xtc . The
shape of W depends on the choice of kernel function and the bandwidth. The
distance between xitc and xt is the traditional Euclidean distance (Sánchez-Borrego
et al., 2014). Defining γ = (h, λ), the multivariate mixed-data product kernel is
given by Kγ ,ix = Wh (x c , xic )L(xid , x d , λ). The local constant estimator is derived
by substituting the kernel density estimators f˜(x, y) and f˜(x) for f (x, y) and
f (x) = f (x d , x c ), respectively, in Eq. (5). After analytic integration, with a bit of
algebra, the local constant estimator can then be written as

yi Kγ ,ix
ĝU (x) = i∈U . (7)
i∈U Kγ ,ix
The benefit of using this method over parametric regression techniques is that
it does not require the practitioner to specify the exact functional form of
180 LUC CLAIR
E(y|x). Instead, g(·) is assumed to satisfy certain regularity conditions, including

smoothness and moment conditions (Li and Racine, 2007).
2.2. Model-assisted Local Constant Estimator
In the case of survey sampling, only the yi in S ∈ U is known. In this context,

Sánchez-Borrego et al. (2014) proposed replacing the population totals from (5)
by their design-based estimators:

πi−1 yi Kγ ,ix
ĝ(x) = i∈S . (8)
i∈S πi−1 Kγ ,ix
Under SRS, the proposed estimator by Sánchez-Borrego et al. (2014) becomes the
traditional nonparametric estimator from Eq. (3).
2.3. Asymptotic Properties
There are three methods of inference when deriving the asymptotic properties
of estimators in survey sampling, namely, model-based inference, pure design-
based inference and combined inference. Model-based inference assumes the data
(y, x) are generated based on a given model and that the inclusion probabilities are
uninformative. Using this mode of inference relies heavily on model specification
as it is assumed that the model represents all units of the population (Lohr, 2010).
Naturally, this increases the appeal of nonparametric methods, such as the local
constant described above, which makes no assumptions about the functional form
of the model (other than smoothness and existence). However, if the sampling is
endogenous, model-based estimators will not take this into account and results
will be inconsistent. If one takes a model-based approach, then the estimator in
Eq. (3) is appropriate.
In pure design-based settings, inference depends on the probability distribu-
tion induced by the sampling design and not the probability distribution from an
underlying model. Inferences drawn using a design-based approach typically refer
to a particular finite population of interest and usually ignore any model structure
in the corresponding superpopulation. Expectations are taken with respect to the
sampling scheme. Therefore, asymptotic results depend on the sample size, the
sampling design, and the bandwidths h and λ (Buskirk and Lohr, 2005). Sánchez-
Borrego et al. (2014) adopted a design-based setting for deriving the asymptotic
properties for their model-assisted local constant estimator in Eq. (8). Under the
assumption of i.i.d. data, the authors showed that the estimator is asymptotically
design-unbiased and design-consistent with probability one.
The method of inference considered in this paper is the combined framework

outlined in Pfeffermann (1993) and adopted by Harms and Duchesne (2010)
and Buskirk and Lohr (2005). Using this approach, superpopulation parameters
are estimated using design-consistent estimators of finite population parameters.
Because the sample size is bounded above by the finite population, consistency
requires that the finite population size increases along with the sample size. Fol-
lowing Harms and Duchesne (2010), I implement the formulation proposed by
Isaki and Fuller (1982). This mode of inference follows two steps: first a sequence
of nested finite populations Uk of size Nk are generated according to a superpop-
ulation model (ξ ), where k denotes the population and Nk → ∞ as k → ∞. As
before, for each j in the finite population Uk , the realization (xj , yj ) is obtained,
such that (x, y) follows the joint density f (x, y). For the analysis that follows, it is
assumed that the xj ’s, j ∈ Uk , are i.i.d. This is a popular assumption when working
in the combined inference framework (Bellhouse and Stafford, 1999). This repre-
sents the model stage. Next, in the design stage, given a population Uk , a sample
S of size ns,k is drawn according to a complex sampling plan. Using the follow-
ing assumptions, the asymptotic properties can be derived for the model-assisted
local constant estimator under the combined-inference method for the multivariate
mixed-data case.
Assumption 2.1.
Denote X as the compact support of x. Then, g(x), f (x) and σ 2 (x) = E(u2i |xi )
are second order differentiable in X . Letting As (x) and Ass (x)
denote the first
and second order derivatives of any function A w.r.t. xs , then gss (x)2 f (x) > 0
for all s = 1, ..., q (Li and Racine, 2007).
Assumption 2.2.
The kernel function k(·):R → R is symmetric with k(v) ≥ 0 with v ∈ R, and
bounded4 by finite constant z so that k(v) ≤ z. k(·) is m times differentiable
2 with
k(v)v
dv < ∞. k(·) is a second order kernel and define κ 2 = v k(v)dv and
κ = k 2 (v)dv (Li and Racine, 2007).
Assumption 2.3.
(hk,1 , . . . , hk,q , λk,1 , . . . , λk,r ) ∈ [0, ηk ]q+r lies in a shrinking set and ηk = ηNk is
a positive sequence that converges to zero at a rate slower than the inverse of
any polynomial in Nk . Nk hk,1 . . . hk,q ≥ tNk with tNk → ∞ as k → ∞ (Li and
Racine, 2007).
182 LUC CLAIR
Assumption 2.4.
The sampling plan is designed such that
1. The sampling rate is such that limk ns,k /Nk = Q ∈ [0, 1).
∗
2. For all N, minj ∈U πj ≥ > 0, with probability one, mini,j ∈U πij ≥ > 0,
and
lim sup ns,k max |πij − πi πj | ≤ ∞
k→∞ j ,i∈U :j =i
with probability one.
(Harms and Duchesne, 2010).
Assumption 2.3 is a common assumption in the literature, it requires that hs → 0

for all s and Nh1 , . . . , hq → ∞ as N → ∞. Comparing a kernel function to a
smooth histogram, the bandwidth h is the width of the histogram bars. As the
bars become thinner and thinner, we require that the bins remain non-empty to
ensure a smooth function. Assumption 2.4 is common in the survey statistics
literature (e.g., Harms and Duchesne, 2010; Opsomer and Miller, 2005; Breidt
and Opsomer, 2000), it specifies that the probability of inclusion for all population
units is nonzero. Assumption 2.4 holds for SRS without replacement, Poisson
sampling and stratified SRS.
Harms and Duchesne (2010) show that the following quantity is bounded:
−2
k (nk , Nk , Pk ) = nk Nk (πj−1 − 1) = O(1).
j ∈Uk
Following their approach, I assume that this quantity converges in probability to

as i → ∞.
For what follows, model-based inference, design-based inference and com-
bined inference are denoted by ξ , P and C, respectively. The conditional
mathematical expectation under the combined mode of inference is calculated
as EC = Eξ {EP {·|π }|x} where x = {x c , x d }. The following example shows how
the combined framework is used to derive the asymptotic bias of the modified
multivariate kernel density estimator fˆ(x) = i∈S πi−1 Kγ ,ix .
Example 2.1: (Kernel Density Estimation Under Combined Inference). If

Assumptions 2.1–2.4 are satisfied, the asymptotic bias of the modified kernel
density estimator under the combined
inference is equivalent to the bias of model-
based estimator f˜(x) = (nh)−1 ni=1 Kγ ,ix . Recall the indicator function 1( · ),
which takes a value of 1 if the logical argument in the brackets is true and 0
otherwise. Then 1(i ∈ S) equals one if unit i is in sample S, zero otherwise, and
EP (1(i ∈ S)|π) = πi . The modified density estimator can be written as
N
fˆ(x) = πi−1 Kγ ,ix = N −1 πi−1 1(i ∈ S)Kγ ,ix (9)
i∈S i=1
Next, take the expectation of (9) using the combined inference method.

EC (fˆ(x)|x) = Eξ {EP (fˆ(x))|π)|x} = Eξ EP πi−1 Kγ ,ix π x
i∈S
N

= Eξ EP N −1
πi−1 1(i ∈ S)Kγ ,ix π x
i=1
N

= Eξ N −1 πi−1 EP (1(i ∈ S)|π )Kγ ,ix x
i=1
N
= Eξ N −1 πi−1 πi Kγ ,ix
i=1
N
= Eξ N −1 Kγ ,ix
i=1

q
r
ts − xs
= h−1
s w λ1(t
s
s =xs )
f (t c , t d )dt c
hs
t d ∈DRq s=1 s=1

q
= w(vs )f (x c + hvs , x d )dvs
Rq s=1

q
+ 1(t d , x d )λs w(vs )f (x c + hvs , t d )dvs
t d ∈D Rq s=1
q
κ2
= f (x) + h2s fss (x)
2 s=1
r q r
+ 1(t d , x d )f (x c , t d )λs + o h2s + λs
s=1 t d ∈D s=1 s=1
= Eξ (f˜(x)|x) (10)
184 LUC CLAIR
Now consider the asymptotic properties of the proposed estimator ĝ(x).

Following Li and Racine (2007), I examine the numerator and denominator of
ĝ(x) separately. First write:
m̂(x)
ĝ(x) − g(x) = , (11)
fˆ(x)
where m̂(x) = (ĝ(x) − g(x))fˆ(x). Using the equation for the regression model
with additive errors (1), m̂(x) can be written as
N
m̂(x) = N −1 π −1 1(i ∈ S)[g(xi ) − g(x)]Kγ ,ix
i=1
N
−1
+N π −1 1(i ∈ S)ui Kγ ,ix
i=1
= m̂1 (x) + m̂2 (x),
where the definition of m̂1 (x) and m̂2 (x) should be evident. In Section A.1 of
the appendix, I show that the leading term of the expectation of m̂1 (x) under the
combined framework is
q
κ2
EC [m̂(x)|x] = h2s [gss (x)f (x) + 2gs (x)fs (x)]
2 s=1
r
+ 1(t d , x d )[g(x c , t d ) − g(x c , gx d )]f (x c , t d )λs
s=1 t d ∈D
q r
+o hs + λs (12)
s=1 s=1
and 1(t d , x d ) = 1(t d = x d ) rs=s 1(t d = x d ). Since ĝ(x) − g(x) = m̂(x)/fˆ(x) =

m̂(x)/(f (x) + op (1)), the bias of ĝ(x) is equivalent to the bias under the model-
based estimator g̃(x). This result is not surprising. Harms and Duchesne (2010)
showed that the bias of the model-assisted local linear in the scalar continuous vari-
able case was equal to the bias of the traditional model-based local linear estimator
with one continuous predictor variable. Also, in SectionA.1 of the appendix, I show
that the variance of the model-assisted estimator under the combined framework is
equal to the variance under the model framework multiplied by a correction factor:
N

1 −2 n κ q σ (x)
var C {[(ĝ(x) − g(x))|π ]|x} = N n (wi − 1) +
nh1 . . . hq i=1
N f (x)
q r

−1
+ O (N h1 . . . hq ) h2s + λs , (13)
s=1 s=1
where 1/nh1 . . . hq (κ q σ (x)/f (x)) is the leading term of var ξ (g̃(x)). Note that
under SRS, the correction factor equals one and Eq. (13) reduces to the variance of
g̃(x) evaluated over the sample data. Combining these two results proves Theorem
2.1, which is an extension of Theorem 1 in Harms and Duchesne (2010):
Theorem 2.1. If Assumptions 2.1–2.4 are satisfied, then the conditional point-
wise MSE of the model-assisted local constant estimator ĝ(x) under the
combined inference mode is given by
MSE(ĝ(x))
q
−1 κ2
= f (x) h2s [gss (x) + 2gs (x)fs (x)]
2 s=1
⎞⎤2
r
+ 1(t d , x d )[g(x c , t d ) − g(x c , gx d )]f (x c , t d )λs ⎠⎦
s=1 t d ∈D
q r
κ q σ 2 (x)f −1 (x)
+( + Q) + op (N h1 . . . hq )−1 h2s + λs
N h1 . . . hq s=1 s=1
⎞
q r 2
+ h2s + λs ⎠,
s=1 s=1
N
where = N −2 n i=1 (wi − 1) and Q = n/N.
The following theorem is proved in Section A.2 of the appendix anddescribes

q
r asymptotic normality of ĝ(x).−1The proof makes use of the fact that s=1 hs +
2
the
λ
s=1 s = o(1) and (N h 1 . . . hq ) =o(1).
186 LUC CLAIR
Theorem 2.2. Under the assumption that x is an interior point and Assumptions
2.1–2.4 are satisfied, the asymptotic normality of ĝ(x) is defined by
q r
N h1 . . . hq ĝ(x) − g(x) − B1s (x)h2s − B2s (x)λs
s=1 s=1
d
−
→ N (0, ( + Q)κ q σ 2 (x)/f (x)), (14)

where B1s (x) = κ22 [gss (x) + 2gs (x)fs (x)] and B2s (x) = t d ∈D 1(t d , x d )[g(x c ,
t d ) − g(x c , gx d )]f (x c , t d )
Theorem 2.2 can be used to compute confidence intervals for ĝ(x) under the
combined inference framework. Let N̂ = ni=1 πi−1 and
n
(yi − ĝ(xi ))2 πi−1 Kγ ,ix
σ̂ (x) = i=1
n −1
,
i=1 πi Kγ ,ix
then an estimator for var C {ĝ(x)} is given by

1 −2 n κ q σ̂ (x)
ˆ C {ĝ(x)}
var N̂ n (wi − 1) + .
nh1 . . . hq i∈U N̂ fˆ(x)

ˆ C (ĝ(x)).
The 95% confidence interval for ĝ(x) is then given by ĝ(x) ± 1.96 var
3. BANDWIDTH SELECTION
A critical component to any nonparametric regression technique is the choice
of the smoothing parameters (h, λ). Selecting the smoothing parameters for the
q continuous variables creates a trade-off between the bias and variance of the
estimator. Large values of hs will oversmooth the underlying density and increase
the bias while reducing the variance. Conversely, small hs will undersmooth the
underlying density shrinking the bias but increasing the variability of the estimator.
For the univariate continuous variable case, Harms and Duchesne (2010) used
a modified plug-in method for selecting the bandwidth according to the MSE
criterion in the combined inference mode. The optimal bandwidth in that case was
equal to that of the i.i.d. case multiplied by a correction factor equal to ( + Q).
This method is not applicable to the multivariate case and, therefore, not applicable
for the present model. In their simulations, Sánchez-Borrego et al. (2014) used a
plug-in method for the bandwidth in which they selected three values for h and
five values for λ. In addition, they used survey cross-validation to choose among
the 15 possible combinations of the fixed values for h and λ.
It is widely accepted that data-driven methods for selecting the bandwidths
in a nonparametric kernel regression setting is required for proper inference
and analysis. I propose using LSCV, a data-driven method for selecting (h, λ)=
(h1 , . . . , hq , λ1 , . . . , λr ). LSCV chooses (h, λ) to minimize the following cross-
validation function:
CV(h, λ) = (yi − ĝ−i (xi ))2 M(xi ), (15)

i∈S

where g−i (xi ) = j ∈S,j =i πj−1 yj Kγ , ij/( j ∈S,j =i πj−1 Kγ , ij ) is the leave-one-
q
out kernel estimator of g(xi ) and Kij = j =i k((xisc − xjcs )/ hs ) rs=1 λαs with α =
1(xis = xj s ) is equal to 1 if xis = xj s and zero otherwise (Li and Racine, 2007). 0 <
M(xi ) < 1 is a weight function which serves to avoid difficulties caused by dividing
by zero. Using the leave-one-out kernel estimator helps to avoid
a computational
issue encountered when optimizing according to i∈S (ûi )2 = i∈S (yi − ĝ(xi ))2 .
By letting hs → 0, ĝ(xi ) can be made close to yi for any i ∈ S. Hence, i∈S (ûi )2
can be made very small as h → 0. At the same time, MSEC remains greater than
zero for all values of h (Opsomer and Miller, 2005). By replacing ĝ(xi ) with the
delete-one estimator, the difference yi − ĝ−i (xi ) does not go to zero as h → 0. The
proceeding analysis requires the following assumption about the weight function
M(xi ):
Assumption 3.1.
M(·) is continuous, nonnegative and has a compact support S.
In Section A.3 of the appendix, I show that, if we ignore the terms unrelated to
(h,λ), the leading term of CV(h, λ) is given by E[CV0 (h, λ)]:
E[CV0 (h, λ)]

⎛ 2 ⎞
q r q
⎝ κ σ (x) ⎠
= Bs (x)h2s + D s λs f (x) + πj−1 2
j =i
N h1 . . . hq
x d ∈D s=1 s=1
× M(x)dx. (16)
Therefore, the leading term for LSCV is the same for an i.i.d. sample except
for the correction term on the second term on the right-hand side of Eq. (16).
188 LUC CLAIR
Define a1 , . . . , aq , b1 , . . . , br as hs = N 1/(4+q) as (s = 1, . . . , q) and λs = N 2/(4+q) bs

(s = 1, . . . , r). Then I obtain E[CV0 (h, λ)] = χr (a, b) with
⎛ 2 ⎞
q r q
⎝ κ σ (x) ⎠
χr = Bs (x)as2 + Ds bs f (x) + πj−1 M(x)dx.
j =i
a1 . . . aq
x d ∈D s=1 s=1
(17)
4. MONTE CARLO SIMULATIONS

In this section, I compare the finite-sample performance of the design-based local
constant estimator to other nonparametric and parametric estimators. I assess
the performance of each estimator under different DGPs and different sampling
plans by computing the MSE for each Monte Carlo replication and compare the
distributions of their MSEs. Each Monte Carlo replication follows five steps:
(1) I generate a population of size N = 10, 000 with three predictor variables:
x = {x1c , x1d , x2d }. x1c is a uniform variable with support within the interval [0,1],
and x1d and x2d are independent binary factor variables.
(2) The populations are then generated using the following regression model:
y = g(x) + u, u ∼ N (0, σ 2 ). (18)
I consider four DGPs for g(x), which are outlined in Table 1. The first DGP,
g1 (X), was considered in Sánchez-Borrego et al. (2014) and is simply linear
in the continuous variable, x1c . In DGP two, the relationship between y and x1c
is quadratic introducing a further degree of smoothness. Population three, also
known as bump, was considered by Harms and Duchesne (2010) and Sánchez-
Borrego et al. (2014). This function produces a noticeable bump at x1c = 0.5.
Population four is the most complex function considered. The Härdle function
is characterized by a peak at x1c = 0.63, a valley at x1c = 0.91, and a saddle point
Table 1. DGPs Used in Monte Carlo Simulations.

Name Expression
Linear g1 (X) = x1c + βx1d + βx2d

Quadratic g2 (X) = 1 + 2(x1c − 0.5)2 + βx1 + βx2
c 2
Bump g3 (X) = 1 + x1 + 2(x1c − 0.5)2 + e−200(x1 −0.5) + βx1d + βx2d
Härdle g4 (X) = sin3 (2π0 x1c ) + βx1d + βx2d , π0 = 3.1415
at x1c = 0.79. The error term for each population is assumed to be normally
distributed with mean of zero and standard deviation σ .
(3) Next, I draw a sample of size n based on three different sampling methods:
SRS without replacement, stratification based on x1c and stratification based on
yi , i = 1, 2, 3, 4. For stratified samples, the population was divided into three
strata of varying size, with unequally sized samples being drawn from each
strata. Within-strata sampling was performed by SRS without replacement.
Table 2 displays the strata borders and the sample size drawn from each strata.
(4) Using the sample data, I estimate g(x) using nonparametric and paramet-
ric estimation methods. The nonparametric estimators I consider are the
design-based local constant estimator and the traditional local constant
estimator, denoted by WLC and LC, respectively:
n
πi−1 yi Kγ ,ix
WLC = ĝ(x) = i=1
n −1
i=1 πi Kγ ,ix
n
yi Kγ ,ix
LC = g̃(x) = i=1
n .
i=1 Kγ ,ix
Both WLC and LC are computed using the Gaussian kernel for the continuous
variable x1c and the variant of Aitchison and Aiken (1976) kernel in (6) for
discrete variables x1d and x2d . The bandwidths for WLC are computed using
the LSCV method discussed in Section 3. The bandwidths for LC are com-
puted using the LSCV method outlined in Li and Racine (2007) page 69.
The WLC and LSCV method were written using the np package available in
the R statistical software. Anyone wanting to use the estimator can contact
the author and he will provide the code upon request. Note, I am only con-
sidering the relevant data case: I expect the bandwidths λr < 1, r = 1, 2 for
Table 2. Strata Borders.

Strata Variable Strata Borders Sample Size
x1c x1c ≤ 0.40 n/2

0.40 < x1c ≤ 0.8 n/5
x1c > 0.8 3n/10
y y ≤15% quantile n/2
15% quantile< y ≤85% quantile n/5
y > 85% quantile 3n/10
190 LUC CLAIR
x1d and x2d . I let h denote the bandwidth for x1c . For parametric estimation, I
assume g(x) = g(β, x) = β1 x1c + β2 x1d + β3 x2d . Therefore, OLS and WLS will
be perfectly specified in the estimation of g1 (x) but will be misspecified in the
estimation of g2 (x) to g4 (x).
(5) Finally, I compute the sample mean squared error:
n
1
(ĥ(xi ) − g(xi ))2
n i=1
where ĥ(x) is one of WLC, LC, WLS or OLS.
I set the number of replications to 1,000. Sensitivity analysis is performed by

varying the sample size (n = 250, 500, 1, 000) and standard deviation of the error
term in the regression model (σ = 0.25, 0.50, 1.00, 2.00).
4.1. Sample Mean Squared Error
Tables 3–5 report the median MSE values from the Monte Carlo simulations
for all DGPs. To save space, I only report the results for n = 250 and n =
1, 000. The values in brackets below the reported MSEs are the median abso-
lute deviations (MAD) of the MSEs; a robust measure of the variability of
the MSE (Andersen, 2008). The MAD is calculated by median{|MSEm [ĥ(x)] −
median{MSEm [ĥ(x)]}|} with m = 1, ..., 1, 000 and h(x) is one of WLC, LC, WLS
or OLS. Compared to the standard deviation, the MAD is more resilient to outliers.
Table 3 displays the results from SRS. Not surprisingly, the median MSEs are
equal for the WLC and LC estimators, as well as the MSEs for the WLS and OLS
estimators. Under SRS, the inclusion probabilities are equal for all individuals and
WLC reduces to LC. As Harms and Duchesne (2010) point out in their simulations,
the results from SRS act as a benchmark for other sampling plans. Keeping n
constant, as σ increases, so too does the median MSE of each estimator. In order
for WLC to be a consistent estimator, the MSE needs to decrease as the sample size
increases. Keeping σ constant and increasing n, the MSE decreases for all DGPs.
This provides evidence that the estimator is consistent. As the functions increase in
the degree of complexity, the MSE also increases. Looking at the MADs for both
the weighted and unweighted nonparametric estimator, they too are equal for all
combinations of n and σ . Note that when the DGP is linear, the perfectly specified
parametric models report lower median MSEs than the nonparametric estimators.
However, when the parametric model is misspecified, the nonparametric estimators
perform better based on the sample MSEs.
Table 3. Median MSE for WLC, LC, WLS and OLS Under Simple Random
Sampling.
DGP n σ WLC LC WLS OLS
Linear 250 0.25 0.0072 0.0072 0.0009 0.0009

(0.0013) (0.0012) (0.0004) (0.0004)
Linear 250 2.00 0.1690 0.1753 0.0548 0.0548
(0.0524) (0.0545) (0.0282) (0.0282)
Linear 1,000 0.25 0.0024 0.0023 0.0002 0.0002
(0.0003) (0.0003) (0.0001) (0.0001)
Linear 1,000 2.00 0.0552 0.0529 0.0134 0.0134
(0.0137) (0.0142) (0.0067) (0.0067)
Quadratic 250 0.25 0.0110 0.0110 0.8301 0.8301
(0.0015) (0.0013) (0.0551) (0.0551)
Quadratic 250 2.00 0.2329 0.2319 0.8914 0.8914
(0.0582) (0.0584) (0.0625) (0.0625)
Quadratic 1,000 0.25 0.0038 0.0038 0.8426 0.8426
(0.0004) (0.0004) (0.0290) (0.0290)
Quadratic 1,000 2.00 0.0807 0.0798 0.8570 0.8570
(0.0164) (0.0167) (0.0283) (0.0283)
Bump 250 0.25 0.0109 0.0110 0.0853 0.0853
(0.0015) (0.0014) (0.0085) (0.0085)
Bump 250 2.00 0.1467 0.1462 0.1417 0.1417
(0.0417) (0.0418) (0.0302) (0.0302)
Bump 1,000 0.25 0.0040 0.0040 0.0851 0.0851
(0.0005) (0.0004) (0.0044) (0.0044)
Bump 1,000 2.00 0.0647 0.0652 0.0995 0.0995
(0.0120) (0.0129) (0.0081) (0.0081)
Härdle 250 0.25 0.0202 0.0201 0.8301 0.8301
(0.0020) (0.0020) (0.0551) (0.0551)
Härdle 250 2.00 0.2960 0.2939 0.8914 0.8914
(0.0481) (0.0481) (0.0625) (0.0625)
Härdle 1,000 0.25 0.0073 0.0073 0.8426 0.8426
(0.0007) (0.0006) (0.0290) (0.0290)
Härdle 1,000 2.00 0.1291 0.1293 0.8570 0.8570
(0.0180) (0.0185) (0.0283) (0.0283)
The results from stratification on the outcome variable are presented in Table 4.
Here, weighting by inclusion probabilities clearly shows an improvement as the
median MSE is smaller for WLC than it is for LC for all combinations of DGP,
n, and σ . By not accounting for endogenous sampling and unequal inclusion
probabilities, the traditional local constant performs worse than the weighted local
constant. Again, increasing the level of noise in the model reduces the efficiency
of each estimator. WLC remains consistent as the median MSE decreases and the
192 LUC CLAIR
Table 4. Median MSE for WLC, LC, WLS and OLS Under Endogenous
Stratification.
Linear 250 0.25 0.0130 0.0216 0.0020 0.0044

(0.0029) (0.0025) (0.0009) (0.0013)
Linear 250 2.00 0.3633 0.9678 0.0705 1.0643
(0.1332) (0.1888) (0.0347) (0.1554)
Linear 1,000 0.25 0.0048 0.0151 0.0005 0.0039
(0.0011) (0.0010) (0.0002) (0.0007)
Linear 1,000 2.00 0.1398 0.9383 0.0175 1.0104
(0.0429) (0.0879) (0.0082) (0.0854)
Quadratic 250 0.25 0.0181 0.0225 1.8112 1.5187
(0.0047) (0.0028) (0.0631) (0.0475)
Quadratic 250 2.00 0.4163 1.0294 1.0966 1.4730
(0.1361) (0.1694) (0.0752) (0.1100)
Quadratic 1,000 0.25 0.0066 0.0142 1.8199 1.5327
(0.0016) (0.0011) (0.0328) (0.0260)
Quadratic 1,000 2.00 0.1864 0.9549 1.0610 1.4385
(0.0750) (0.0875) (0.0356) (0.0557)
Bump 250 0.25 0.0188 0.0268 0.0641 0.0639
(0.0042) (0.0025) (0.0094) (0.0080)
Bump 250 2.00 0.3495 1.1509 0.1457 1.1060
(0.1659) (0.2118) (0.0356) (0.1697)
Bump 1,000 0.25 0.0071 0.0196 0.0634 0.0642
(0.0011) (0.0011) (0.0045) (0.0040)
Bump 1,000 2.00 0.1391 1.0711 0.0930 1.0652
(0.0416) (0.0963) (0.0093) (0.0837)
Härdle 250 0.25 0.0329 0.0366 1.8112 1.5187
(0.0039) (0.0035) (0.0631) (0.0475)
Härdle 250 2.00 0.5610 1.1504 1.0966 1.4730
(0.1678) (0.1969) (0.0752) (0.1100)
Härdle 1,000 0.25 0.0138 0.0236 1.8199 1.5327
(0.0018) (0.0013) (0.0328) (0.0260)
Härdle 1,000 2.00 0.2432 1.0146 1.0610 1.4385
(0.0426) (0.0846) (0.0356) (0.0557)
MAD decreases as n increases. The MADs suggest that WLC is more variable than
LC for small sample sizes. When n = 200, MAD(WLC) is greater than MAD(LC);
this is true for all DGPs. As the sample size increases, however, MSE(WLC)
becomes less variable than MSE(LC). These results clearly show a stochastic
dominance over LC. WLS regression performs best under the linear DGP, how-
ever, as the DGP becomes more complex, WLC outperforms both (misspecified)
Table 5. Median MSE for WLC, LC, WLS and OLS Under Exogenous
Stratification.
Linear 250 0.25 0.0077 0.0079 0.0010 0.0008

(0.0013) (0.0012) (0.0005) (0.0004)
Linear 250 2.00 0.1866 0.1718 0.0620 0.0498
(0.0569) (0.0525) (0.0307) (0.0245)
Linear 1,000 0.25 0.0026 0.0027 0.0003 0.0002
(0.0004) (0.0004) (0.0001) (0.0001)
Linear 1,000 2.00 0.0615 0.0573 0.0178 0.0143
(0.0155) (0.0134) (0.0086) (0.0070)
Quadratic 250 0.25 0.0120 0.0119 0.7593 0.6487
(0.0015) (0.0016) (0.0549) (0.0432)
Quadratic 250 2.00 0.2540 0.2535 0.8230 0.7079
(0.0654) (0.0650) (0.0894) (0.0505)
Quadratic 1,000 0.25 0.0041 0.0042 0.7681 0.6562
(0.0004) (0.0005) (0.0264) (0.0196)
Quadratic 1,000 2.00 0.0883 0.0879 0.7832 0.6726
(0.0167) (0.0175) (0.0461) (0.0225)
Bump 250 0.25 0.0102 0.0098 0.0521 0.0478
(0.0014) (0.0014) (0.0074) (0.0064)
Bump 250 2.00 0.1329 0.1445 0.1145 0.0984
(0.0404) (0.0439) (0.0343) (0.0256)
Bump 1,000 0.25 0.0038 0.0036 0.0517 0.0473
(0.0005) (0.0004) (0.0036) (0.0031)
Bump 1,000 2.00 0.0605 0.0627 0.0688 0.0619
(0.0112) (0.0131) (0.0107) (0.0077)
Härdle 250 0.25 0.0213 0.0207 0.7593 0.6487
(0.0024) (0.0020) (0.0549) (0.0432)
Härdle 250 2.00 0.3159 0.3159 0.8230 0.7079
(0.0539) (0.0528) (0.0894) (0.0505)
Härdle 1,000 0.25 0.0077 0.0076 0.7681 0.6562
(0.0007) (0.0007) (0.0264) (0.0196)
Härdle 1,000 2.00 0.1405 0.1398 0.7832 0.6726
(0.0186) (0.0213) (0.0461) (0.0225)
parametric estimators. In fact, as the DGPs become more complex, the parametric
estimators become inconsistent.
Table 5 displays the results from stratification on the continuous predictor vari-
able x1c . Since x1c is included in the model, the sampling scheme is exogenous.
These results mimic those from SRS in Table 3. Both of the nonparametric esti-
mators report the same median MSEs, the small difference being explained by
194 LUC CLAIR
Table 6. Median Bandwidths for WLC and LC Under Simple Random

Sampling.
DGP n σ hw hu λw
1 λu1 λw
2 λu2
Linear 250 0.25 0.0614 0.0608 0.0106 0.0102 0.0106 0.0106

Linear 250 2.00 0.1539 0.1515 0.1789 0.1700 0.1696 0.1806
Linear 1,000 0.25 0.0443 0.0441 0.0036 0.0035 0.0033 0.0034
Linear 1,000 2.00 0.1219 0.1222 0.0693 0.0705 0.0709 0.0697
Quadratic 250 0.25 0.0382 0.0379 0.0074 0.0078 0.0070 0.0074
Quadratic 250 2.00 0.1095 0.1110 0.1180 0.1156 0.1165 0.1236
Quadratic 1,000 0.25 0.0269 0.0265 0.0027 0.0024 0.0027 0.0027
Quadratic 1,000 2.00 0.0814 0.0817 0.0481 0.0481 0.0486 0.0493
Bump 250 0.25 0.0297 0.0298 0.1058 0.1045 0.1058 0.1067
Bump 250 2.00 0.0921 0.0947 0.8291 0.6929 0.7442 0.8494
Bump 1,000 0.25 0.0226 0.0225 0.0429 0.0420 0.0418 0.0419
Bump 1,000 2.00 0.0560 0.0556 0.4170 0.4279 0.4378 0.4388
Härdle 250 0.25 0.0172 0.0171 0.0540 0.0554 0.0524 0.0541
Härdle 250 2.00 0.0518 0.0486 0.4830 0.4500 0.4475 0.4897
Härdle 1,000 0.25 0.0130 0.0129 0.0210 0.0211 0.0206 0.0203
Härdle 1,000 2.00 0.0308 0.0303 0.2609 0.2679 0.2685 0.2665
simulation bias. These results indicate that the sample design had no effect on the
estimation of g(x).
4.2. Bandwidths
Tables 6–8 report the median bandwidths for WLC and LC under SRS, exogenous
stratification, and endogenous stratification, respectively. The superscripts w and
u denote the bandwidths for WLC and LC, respectively. Under SRS, WLC and
LC select the same bandwidths (see Table 6). Similar results are found under
exogenous stratification in Table 7. Under endogenous stratification, WLC selects
smaller bandwidths than LC. This suggests that the design-based estimator chooses
a higher degree of smoothing. Under all three sampling plans, both WLC and LC
choose bandwidths for discrete variables x1d and x2d that are below 1. This is
encouraging as both variables are relevant in each of the models considered.
These results showed that there is no loss in efficiency by using a design-based
estimator when it is not required to consistently estimate g(x). Under SRS, the
weighted estimator reduces to the traditional nonparametric regression estimator
and results are equivalent, regardless of the underlying DGP. Comparing across
sampling schemes, WLC is most efficient under SRS and least efficient under
Table 7. Median Bandwidths for WLC and LC Under Exogenous Stratification.

DGP n σ hw hu λw
1 λu1 λw
2 λu2
Linear 250 0.25 0.0563 0.0521 0.0109 0.0110 0.0109 0.0113

Linear 250 2.00 0.1399 0.1745 0.2008 0.1756 0.1960 0.1674
Linear 1,000 0.25 0.0410 0.0372 0.0040 0.0037 0.0039 0.0039
Linear 1,000 2.00 0.1117 0.1242 0.0795 0.0678 0.0792 0.0682
Quadratic 250 0.25 0.0347 0.0334 0.0063 0.0061 0.0061 0.0060
Quadratic 250 2.00 0.1004 0.1020 0.1289 0.1234 0.1260 0.1186
Quadratic 1,000 0.25 0.0250 0.0239 0.0028 0.0028 0.0029 0.0028
Quadratic 1,000 2.00 0.0740 0.0711 0.0533 0.0525 0.0538 0.0521
Bump 250 0.25 0.0328 0.0326 0.1030 0.1034 0.1026 0.1026
Bump 250 2.00 0.0998 0.0890 0.7611 0.6943 0.8161 0.7308
Bump 1,000 0.25 0.0251 0.0250 0.0421 0.0394 0.0404 0.0398
Bump 1,000 2.00 0.0621 0.0564 0.4460 0.4288 0.4252 0.4348
Härdle 250 0.25 0.0160 0.0159 0.0504 0.0470 0.0486 0.0474
Härdle 250 2.00 0.0426 0.0420 0.5166 0.4668 0.4640 0.4584
Härdle 1,000 0.25 0.0122 0.0122 0.0216 0.0213 0.0215 0.0217
Härdle 1,000 2.00 0.0279 0.0275 0.2817 0.2734 0.2882 0.2739
Table 8. Median Bandwidths for WLC and LC Under Endogenous

Stratification.
DGP n σ hw hu λw
1 λu1 λw
2 λu2
Linear 250 0.25 0.0454 0.0539 0.0008 0.0031 0.0011 0.0036

Linear 250 2.00 0.0752 0.1459 0.0528 0.1283 0.0220 0.1315
Linear 1,000 0.25 0.0333 0.0392 0.0000 0.0014 0.0000 0.0015
Linear 1,000 2.00 0.0537 0.1144 0.0000 0.0572 0.0000 0.0555
Quadratic 250 0.25 0.0332 0.0386 0.0021 0.0053 0.0004 0.0052
Quadratic 250 2.00 0.0720 0.0951 0.0006 0.0953 0.0028 0.0952
Quadratic 1,000 0.25 0.0222 0.0264 0.0000 0.0027 0.0003 0.0027
Quadratic 1,000 2.00 0.0424 0.0695 0.0000 0.0440 0.0000 0.0434
Bump 250 0.25 0.0215 0.0316 0.0517 0.1035 0.0591 0.1013
Bump 250 2.00 0.0402 0.0670 0.6930 0.6230 0.5396 0.6390
Bump 1,000 0.25 0.0185 0.0239 0.0346 0.0444 0.0272 0.0446
Bump 1,000 2.00 0.0266 0.0435 0.3643 0.3670 0.3096 0.3710
Härdle 250 0.25 0.0112 0.0119 0.0059 0.0010 0.0036 0.0016
Härdle 250 2.00 0.0284 0.0501 0.2636 0.3124 0.2292 0.3259
Härdle 1,000 0.25 0.0106 0.0092 0.0043 0.0031 0.0045 0.0031
Härdle 1,000 2.00 0.0201 0.0319 0.1191 0.1706 0.0912 0.1765
196 LUC CLAIR
endogenous stratification. However, when endogenous stratification is present in

the model, efficiency gains can be realized by including sample weights equal to
the inverse of the inclusion probability relative to LC. By varying the sample size,
I was able to provide evidence that the WLC estimator is a consistent estimator as
the MSE decreased as the sample increased. Furthermore, if the parametric model
is misspecified, estimation will be inconsistent regardless of weighting.
5. APPLICATION
The application considered here is an extension of the example in Harms and
Duchesne (2010). The authors used data from the 2000 cycle of the Survey of
Labour and Income Dynamics (SLID) to estimate the relationship between age and
LMD for approximately 58,000 individuals (Statistics Canada, 2013). The purpose
of the SLID is to understand the economic well-being of Canadians, collecting data
on the primary source of income, education and demographic backgrounds of its
participants. The sampling scheme is based on a stratified, multistage design that
uses probability sampling. The result is unequal sampling weights for individuals
in the sample. The weights not only represent the sampling plan but also account
for nonresponse and are calibrated to meet certain benchmark criteria.
The application presented in this paper differs in two ways from Harms and
Duchesne (2010). First, in order to help reduce heteroskedasticity in the model,
the outcome variable I consider is LMD as a percent of age over 18. Looking only
at LMD versus age could lead to heteroskedastic errors as the variance of LMD for
older individuals is likely to be higher compared to younger individuals. Second,
I extend the model to include a discrete variable “Gender,” which takes on two
values, male or female.
I estimate the model using both the model-assisted local constant estimator,
WLC, and the traditional local constant estimator, LC. In both cases, the Gaussian
kernel was used for the continuous variable “Age” and the variation of theAitchison
and Aiken (1974) kernel was used for the discrete variable “Gender.” The band-
widths for the WLC estimator were computed using the cross-validation method
described in Section 3. The bandwidths for the LC estimator were computed
using method outlined in Li and Racine (2007, Chapter 2). The gray and black
lines in Figure 1 represent the weighted regression results for males and females,
respectively. It is clear that these two curves are pulled closer to observations
which represent a greater number of individuals in the population compared to
the unweighted estimates (the dashed lines in Figure 1). Results also show that
females spend a smaller percentage of time in the labor force compared to men as
the black lines lie below the gray lines. The bandwidths for gender are 0.0313<1
and 0.0287<1 for the LC and WLC, respectively, indicating that gender is a relevant
predictor of LMD.
60
WLC - Male
CI - WLC
LMD as percent of age over 18
WLC - Female
CI - WLC
50
LC - Male
LC - Female
30 40 20
10 20 30 40 50 60 70
Age
Fig. 1. Results From Nonparametric Regression of LMD As a Percentage of Age Over

18 Versus Age and Gender. Note: The dashed lines represent the unweighted local constant
(LC) estimation results and the solid lines represent the weighted local constant (WLC)
estimation results. The gray lines are the results for males and the black lines are the
results for females. The dotted gray and black lines are the 95% confidence intervals for the
WLC estimates for males and females, respectively. Bandwidths for the LC estimator were
computed using the LSCV method from Li and Racine (2007). Bandwidths for the WLC
estimator were computed using the LSCV method proposed in Section 3. Bandwidths for
age and gender in LC estimation were 0.6721 and 0.0313, respectively. Bandwidths for
age and gender in WLC estimation were 0.9220 and 0.0287, respectively (Data Source:
Statistics Canada, 2013).
6. CONCLUSION
This paper took an extensive look at nonparametric regression estimation for

multivariate mixed-data types using complex survey data. The purpose of this
overview was to derive the asymptotic properties of the model-assisted nonpara-
metric regression estimator under the combined inference framework and develop
a data-driven method for selecting the smoothing parameters for both continu-
ous and discrete regressors. The estimator was shown to be consistent under the
combined-inference framework and that it is normally distributed. In addition, the
LSCV method improves the applicability of this estimator for future analyses.
A secondary purpose of this paper was to assess settings under which incorporat-
ing sampling weights is appropriate. Using a data-driven method for selecting the
bandwidth, I showed that under SRS, the mean squared errors of the traditional and
modified local constant estimators were equal. Similar results were found under
exogenous stratification. When sampling is endogenous, however, efficiency gains
can be made by including survey weights.
198 LUC CLAIR
ACKNOWLEDGMENTS
I am grateful for the input and guidance I received from Dr. Jeff Racine, Dr. Jerry
Hurley, Dr. Phil DeCicca, Dr. Arthur Sweetman, and Dr. Michael Veall. Further-
more, I would like to thank participants at various seminars and conferences for
their feedback.
REFERENCES
Aitchison, J., & Aitken, C. G. G. (1976). Multivariate binary discrimination by the Kernel method.
Biometrica, 63, 413–420.
Andersen, R. (2008). Modern methods for Robust regression. Thousand Oaks, CA: SAGE Publications,
Inc.
Bellhouse, D., & Stafford, J. (1999). Density estimation from complex surveys. Statistica Sinica, 9,
407–424.
Binder, D. A., & Georgia, R. (2009). Design- and model-based inference for model parameters.
Handbook of Statistics, 29B, 33–54.
Bravo, F., Huynh, K. P., & Jacho-Chávez, D. T. (2011). Average derivative estimation with miss-
ing responses. In D. M. Drukker (Ed.), Missing data methods: Cross-sectional methods and
applications (1st ed., Vol. 27A, pp. 129–154). Bingley: Emerald Group Publishing Limited.
Breidt, F. J., & Opsomer, J. D. (2000). Local polynomial regression estimators in survey sampling.
The Annals of Statistics, 28, 1026–1053.
Buskirk, T. D., & Lohr, S. L. (2005). Asymptotic properties of Kernel density estimation with complex
survey data. Journal of Statistical Planning and Inference, 128, 165.
Harms, T., & Duchesne, P. (2010). On kernel nonparametric regression designed for complex survey
data. Metrika, 72, 111–138.
Hausman, J., & Wise, D. (1981). Stratification on endogenous variables and estimation: The Gary
income maintenance experiment. In The analysis of discrete economic data. Cambridge, MA:
MIT Press.
Isaki, C., & Fuller, W. (1982). Survey design under the superpopulation model. Journal of the American
Statistical Association, 77(377), 89–96.
Li, Q., & Racine, J. S. (2004). Cross-validated local linear regression. Statistica Sinica, 14, 485–512.
Li, Q., & Racine, J. S. (2007). Nonparametric econometrics. Princeton, NJ: Princeton University Press.
Lohr, S. L. (2010). In M. Juliet (Ed.). Sampling: Design and analysis (2nd ed.). 20 Channel Center
Street, Boston, MA 02210: Brooks/Cole.
Nadaraya, E. A. (1964). “On Estimating Regression.” Theory of Probability and Its Applications 9:
141–42.
Opsomer, J. D., & Miller, C. P. (2005). Selecting the amount of smoothing in nonparametric regression
estimation for complex surveys. Journal of Nonparametric Statistics, 17(5), 593–611.
Pfeffermann, D. (1993). The role of sampling weights when modeling survey data. International
Statistics Review, 61, 317–337.
Sánchez-Borrego, I., Opsomer, J. D., Rueda, M., & Arcos, A. (2014). Nonparametric estimation with
mixed data types in survey sampling. Revista Matemática Complutense, 27, 685–700.
Särdinal, C. E., Swensson, B., & Wretman, J. (1992). Model-assisted survey sampling. New York, NY:
Springer.
Solon, G., Haider, S. J., & Wooldridge, J. (2013). What Are We Weighting for? Working Paper 18859.
National Bureau of Economic Research, 1050 Massachusetts Avenue, Cambridge, MA 02138.
Statistics Canada. (2013). Survey of labour and income dynamics (SLID). http://www23.statcan.gc.ca/
imdb/p2SV.pl?Function=getSurvey&SDDS=3889.
Waston, G. S. (1964). Smooth regression analysis. Sankhya, Series A, 26, 359–372.
Yatchew, A. (1998). Nonparametric regression techniques in economics. Journal of Economic
Literature, 36(2), 699–721.
APPENDIX A. PROOFS
A.1. Proof of Theorem 2.1
Using a combined framework, first find the expectation of m̂(x), EC (m̂(x)|x) =

EC (m̂1 (x)|x). If Assumptions 2.1–2.4 are satisfied.

EC (m̂1 (x)|x) =Eξ EP (m̂1 (x)|π )|x
N

=Eξ EP N −1 πi−1 1(i ∈ S)[g(xi ) − g(x)]Kγ ,ix
i=1
N
=Eξ N −1 [g(xi ) − g(x)]Kγ ,ix
i=1

= [g(t) − g(x)]f (t)Kh,tx dt c
Rq
t d ∈D
q
κ2
= h2s [gs s(x)f (x) + 2gs (x)fs (x)]
2 s=1
r
+ 1(t d , x d )[g(x c , t d ) − g(x c , gx d )]f (x c , t d )λs
s=1 t d ∈D
q r
+ op h2s + λs
s=1 s=1
q r q r
= h2s Bs (x)f (x) + Ds λ s + o p h2s + λs (19)
s=1 s=1 s=1 s=1
where Bs = [gs s(x)f (x) + 2gs (x)fs (x)]κ2 /2, Ds = (t d , x d )[g(x c , t d ) − g(x c ,
q r 1(tsd =xsd )
gx d )]f (x c , t d ), and Kh,tx = s=1 h−1
s k((ts − xs )/ hs ) s=1 λs . By using the
200 LUC CLAIR
Taylor expansion method from Särdinal, Swensson, and Wretman (1992), Harms
and Duchesne (2010) derived the following result:
N

1
ĝ(x) − g(x) ≈ fˆ−1 πi−1 1(i ∈ S)Kγ ,ix ui (20)
N i=1
where ui is the population residual. The asymptotic design-based variance of

ĝ(x) − g(x) is therefore
N
1 ˆ−1
Avar P {ĝ(x) − g(x)|π } = var P f (x) π −1 1(i ∈ S)Kγ ,ix
N i=1
1 ˆ−2 1
= f (x) var P (1(i ∈ S))Kγ ,ix Kh,j x ui uj
N2 i j
π i πj
1 πij − πi πj
= 2 fˆ−2 (x) Kγ ,ix Kh,j x ui uj (21)
N i j
πi π j
Replacing fˆ(x) with the expression fˆ(x) = f (x) + op (1), the expression for
Avar P becomes
Avar P {ĝ(x) − g(x)|π}

⎡ ⎤
1 −2 ⎣ 1 − πi 2 2 πij − πi πj
= 2 f (x) Kh,ix ui + Kγ ,ix Kh,j x ui uj ⎦ .
N i
πi i j =i
πi π j
(22)
The variance under the combined framework is then derived using the following
expression:
var C {(ĝ(x) − g(x))|x} = var ξ {EP [ĝ(x) − g(x)|π ]|x}

+ Eξ {var P [ĝ(x) − g(x)|π ]|x} (23)
Using a similar derivation for the bias of fˆ(x) and m̂(x), the first term in (23)
becomes the traditional equation for the variance of the local constant estimator:
var ξ {EP [(ĝ(x) − g(x))|π]|x} = var ξ {[g̃(x) − g(x)]|x]

Looking first at the variance of m̂1 :
1
var ξ ([g(t) − g(x)]Kh,tx )
N
⎡ ⎤
q r
1
= ⎣ [g(t) − g(x)]2 f (t c , t d )Kh,tx
2
dt c − O h2s + λs ⎦
N d
t ∈DRq s=1 s=1
q r
= O (Nh1 . . . hq )−1 h2s + λs . (24)
s=1 s=1
Next, derive the expression for Eξ ((m̂2 (x))2 |x):
⎡ 2
⎤
N
1
Eξ ⎣ ui Kγ ,ix ⎦
N i=1
1
= E[σ (t)2 Kh,tx
2
]
N

1
= σ (t)2 f (t c , t d )Kh,tx
2
dt c
N d
t ∈DRq
⎡ ⎤
q r
1 ⎣
= σ (x c + hv, x d )2 f (x c + hv, x d ) h−1
s w (vs )dvs + O
2
λs ⎦
N s=1 s=1
Rq
q r
κ q σ 2 (x)f (x)
= + O (N h1 . . . hq )−1 h2s + λs (25)
N h1 . . . hq s=1 s=1
Combining (24), (25), and var ξ {[g̃(x) − g(x)]|x] = [f (x)]−2 var ξ (m̃(x)), the first
term in Eq. (23) is
var ξ {EP [(ĝ(x) − g(x)]}

q r
κ q σ 2 (x)
= + op (N h1 . . . hq )−1 h2s + λs . (26)
f (x)Nh1 . . . hq s=1 s=1
202 LUC CLAIR
Plugging the result from (22) into the second term in Eq. (23) becomes
Eξ {var P {ĝ(x) − g(x)|π }}

1 − πi 2 2
= (Nf (x))−2 Eξ [ Kh,ix ui
i
πi
πij − πi πj
+ Kγ ,ix Kh,j x ui uj ] (27)
i j =i
πi πj
1 − πi
= (Nf (x))−2 2
Eξ (Kh,ix σ 2)
i
πi

1 − πi
= (Nf (x))−2 (xi ) σ (t)2 f (t c , t d )Kh,tx
2
dt c
πi d t ∈DRq

1 − πi κ σ (x)[f (x)]−1
q
= (xi )
πi N 2 h1 . . . hq
q r
+ op (N h1 . . . hq )−1 h2s + λs (28)
s=1 s=1
where the second term in (27) is a zero mean function. To get the expression for
var C {(ĝ(x) − g(x))|x}, simply sum (26) and (28):
var C {[(ĝ(x) − g(x))|π]|x}

N
1 −2 1 − πi κ q σ (x)
= N n
nh1 . . . hq i=1
πi f (x)
q r
κ q σ 2 (x)f (x)
+ + op (N h1 . . . hq )−1 h2s + λs
N h1 . . . hq s=1 s=1
N

1 n κ q σ (x)
= N −2 n (wi − 1) +
nh1 . . . hq i=1
N f (x)
q r
+ op (N h1 . . . hq )−1 h2s + λs . (29)
s=1 s=1
A.2. Proof of Theorem 2.2
In order to prove the asymptotic normality of ĝ(x), I make use of the Lyapunov
Double Array Central Limit Theorem (see the statistical appendix in Li and Racine,
2007). Turning now to the modified local constant,
q q
Nh1 . . . hq ĝ(x) − g(x) − Bs (x)h2s − Ds (x)λs
s=1 s=1
q q
ĝ(x) − g(x) − s=1 Bs (x)h2s − s=1 Ds (x)λs fˆ(x)
≡ Nh1 . . . hq
fˆ(x)
q q
m̂(x) − s=1 Bs (x)h2s fˆ(x) − s=1 Ds (x)λs fˆ(x)
= Nh1 . . . hq
fˆ(x)

m̂(x) − E(m̂(x)) q q
= Nh1 . . . hq +O N h1 . . . hq h2s − λs
fˆ(x) s=1 s=1

m̂(x) − E(m̂(x))
= Nh1 . . . hq + o(1)
fˆ(x)
N
1
= ZN ,i + o(1) (30)
f (x) i=1

where ZN ,i = ( N h1 . . . hq )−1 [π −1 1(i ∈ S)(yi − g(x)) Kγ ,ix − E(π −1 1 (i ∈ S)
(yi − g(x))Kγ ,ix )] and fˆ(x) = f (x) + op (1). Next, take the expectation of the
absolute value of ZN,i raised to the power of 2 + δ, where δ is some constant and
δ > 0:

E|ZN ,i |2+q = ( N h1 . . . hq )−(2+q) E[π −1 1(i ∈ S)(yi − g(x))Kγ ,ix
− E(π −1 1(i ∈ S)(yi − g(x))Kγ ,ix )]2+q
Using the Cr inequality and Lyapunov’s central limit theorem:
21+q E[π −1 1(i ∈ S)(yi − g(x))Kγ ,ix ]2+q

E|ZN ,i |2+q ≤
( N h1 . . . hq )2+q
= o(1) (31)
and
N
1 d
ZN,i −
→ N (0, ( + Q)κ q σ 2 (x)/f (x))
f (x) i=1
204 LUC CLAIR
A.3. Least Squares Cross-Validation
Write g(xj ) = g(xj ) + g(xi ) − g(xi ) = g(xi ) + Rij . Plug into the regression model
yj = g(xj ) + uj :
yj = g(xi ) + Rij + uj
Then, we can rewrite the leave-one-out kernel estimator for ĝ(x) as

⎡ ⎤−1
ĝ−i (xi ) = ⎣ πj−1 1(j ∈ S)Kh,ij ⎦ πj−1 1(j ∈ S)Kh,ij yj

i =j i=j
⎡ ⎤−1
=⎣ πj−1 1(j ∈ S)Kh,ij ⎦ πj−1 1(j ∈ S)Kh,ij (g(xi ) + Rij + uj )

i =j i=j
⎡ ⎤−1
= g(xi ) + ⎣ πj−1 1(j ∈ S)Kh,ij ⎦ πj−1 1(j ∈ S)Kh,ij (Rij + uj )

i=j i=j
(32)
Using the definition for the modified kernel density estimator fˆ(x c , x d ), we can
rewrite Eq. (32) as
1 ˆ−1
ĝ−1 (xi ) = g(xi ) + f πj−1 1(j ∈ S)Kh,ij (Rij + uj ) (33)
N i i=j

where fî = N −1 i=j πj−1 1(j ∈ S)Kh,ij . The definition for CV(h) can now be
written as
CV(h) =N −1 (yi − ĝ−i (xi ))2 M(xi )

i,j ∈U
−1
=N (g(xi ) + ui − ĝ−i (xi ))2 M(xi )
i,j ∈U
=N −1 (g(xi ) − ĝ−i (xi ))2 − 2N −1 [ui (g(xi ) − ĝ−i (xi ))]M(xi )

i,j ∈U i,j ∈U
+ N −1 u2i M(xi ) (34)

i,j ∈U
The third term in Eq. (34) does not depend on (h, λ) and the second term has
first term. So asymptotically, minimizing CVlc (h, λ) is
an order smaller than the
equivalent to minimizing i,j ∈U [g(xi ) − g−i (Xi )]2 M(xi ).
CV0 (h, λ)
1
= N −1 [g(xi ) − g(xi ) − πj−1 Kh,ij (Rij + uj )fî−1 ]2 M(Xi )
i,j ∈U
N i=j
⎡ ⎤2
1 ⎣1
= πj−1 1(j ∈ S)Kh,ij (Rij + uj )fî−1 ⎦ M(xi )
N i,j ∈U
N i=j
⎡ ⎤2
1 ⎣1
= πj−1 Kh,ij (g(xj ) − g(xi ) + uj )fî−1 ⎦ M(xi )
N i,j ∈U
N i=j
⎧⎡ ⎤ ⎫2
1 ⎨ 1 1 ⎬
= ⎣ πj−1 Kh,ij (g(xj ) − g(xi )) + πj−1 Kh,ij uj ⎦ fî−1
N i,j ∈U ⎩ N i=j
N i=j
⎭
× M(xi ) (35)
Again, using the definition fˆ(x) = f (x) + op (1), write CV0 (h, λ) as
CV0 (h, λ)
1
= (m1i + m2i )2 fi−2 M(xi ) + (s.o)
N i,j ∈U
⎛ ⎞
1 ⎝
= m2 f −2 M(xi ) + m22i fi−2 M(xi ) + m1i m2i fi−2 M(xi )⎠ ,
N i,j ∈U 1i i i,j ∈U i,j ∈U
(36)

where m1i = 1/N i=j πj−1 Kh,ij (g(xj ) − g(xi )), m2i = 1/N i=j πj−1 Kh,ij uj ,
and s.o. denotes smaller order terms. The leading term of CV(h, λ) is CV0 (h, λ) =
E[CV0 (h, λ)] + (s.o.).
EC [CV0 (h, λ)]

⎧ ⎛ ⎞⎫
⎨1 ⎬
= EC ⎝ m21i fi−2 M(xi ) + m22i fi−2 M(xi ) + m1i m2i fi−2 M(xi ⎠
)
⎩N ⎭
i,j ∈U i,j ∈U i,j ∈U
= EC [m21i fi−2 M(xi )] + EC [m22i fi−2 M(xi )] (37)

206 LUC CLAIR
because EC (m1i m2i fi−2 M(xi )) = 0. Looking at the first term in Eq. (37),
EC [m21i fi−2 |xi ]

⎡
1 1
= EC ⎣ πj−1 1(j ∈ S)Kh,ij (g(xj ) − g(xi ))
N i=j N i=l
⎤
× πl−1 1(l ∈ S)Kh,il (g(xj ) − g(xi ))fi−2 M(xi )⎦
⎡
1
= 2 EC ⎣ πj−1 1(j ∈ S)Kh,ij (g(xj )
N j =i
⎤
−g(xi )) πl−1 1(l ∈ S)Kh,il (g(xj ) − g(xi ))fi−2 M(xi )⎦

l =i
⎡ ⎤
1
+ 2 EC ⎣ −2
πj 1(j ∈ S)Kh,ij
2
(g(xj ) − g(xi ))2 ⎦ (38)
N j =i
Following Li and Racine (2004), compute E(Rij Kh,ij fi−1 |xi ):
EC (Rij Kh,ij fi−1 |xi )

⎡ ⎡ ⎤ ⎤
= Eξ ⎣ED ⎣ πj−1 1(j ∈ S)Kh,ij (g(xj ) − g(xi ))fi−1 |π ⎦ |x ⎦

j =i
= Eξ [Kh,ij (g(xj ) − g(xi ))fi−1 |x]

q
κ2
= [gss (Xi )f (Xi ) + 2gs (Xi )fs (Xi )]f −1 (Xi )
2 s=1
r
+ 1(x d , v d )[g(x c , v d ) − g(x)]f (x c , x d )λs f (Xi )−1
s=1 v d ∈D
q r
+O h2s + λs . (39)
s=1 s=1
v d is a placeholder term where it is assumed that the data are identically distributed.
Then,
Eξ [Kh,ij (g(xj ) − g(xi ))Kh,il (g(xj ) − g(xi ))fi−2 ]

= Eξ [Kh,ij (g(xj ) − g(xi ))fi−1 ]Eξ [Kh,il (g(xj ) − g(xi ))fi−1 )]
= {Eξ [Kh,ij (g(xj ) − g(xi ))]fi−1 }2
2 ⎛ ⎞
q r q r 3
= Bs (x)h2s + D s λs +O⎝ h2s + λs ⎠ (40)

s=1 s=1 s=1 s=1
q r
The second term in (38) is O((N h1 . . . hq )−1 ( s=1 h2s + s=1 λs )):
⎡ ⎤
N −1 EC ⎣ πj−2 1(j ∈ S)Kh,ij

2
(g(xj ) − g(xi ))2 ⎦
j =i
⎡ ⎤
= N −1 Eξ ⎣ πj−1 Kh,ij
2
(g(xj ) − g(xi ))2 fi2 ⎦
j =i

= N −1 fi−2 πj−1 2
Kh,ij (g(xj ) − g(xi ))2 dxjc
j =i x d ∈D
q r
−1
= O (N h1 . . . hq ) h2s + λs (41)
s=1 s=1
Using Eqs. (40) and (41)
E[m̂21i fi−2 M(xi )]

q r
2
= Bs (x)h2s + D s λs f (x)M(x)dx c
x d ∈D s=1 s=1
⎛ ⎞
q r 3 q r
+O⎝ h2s + λs + (N h1 . . . hq )−1 h2s + λs ⎠ . (42)
s=1 s=1 s=1 s=1
208 LUC CLAIR
Next, solve for the second term on the right hand side of (36).
EC [m̂22i fi−2 M(xi )]

= EC {fi−2 M(xi )EC [m̂21i |xi ]}
⎧ ⎡ ⎤⎫
1 ⎨ ⎬
= 2 EC fi−2 M(xi )EC ⎣ πj−2 1(i ∈ S)u2j Kh,ij
2
|xi ⎦
N ⎩ ⎭
j =i

q −1
= N2 hs πj−1 κ q σ 2 (x)M(x)dx c
s=1 j =i x d ∈D
⎛ ⎞
q r 3 q r
+O⎝ h2s + λs + (N 2 h1 . . . hq )−1 h2s + λs ⎠
s=1 s=1 s=1 s=1
E[CV0 (h, λ)]

⎛ 2 ⎞
q r q
⎝ κ σ (x) ⎠M(x)dx
= Bs (x)h2s + D s λs f (x) + πj−1 2
j =i
N h1 . . . hq
x d ∈D s=1 s=1

Minimizing CV(h) is equivalent to minimizing CV1 (h) because n−1 i,j ∈U ui is
not related to h1 , . . . , hq .
NEAREST NEIGHBOR IMPUTATION
FOR GENERAL PARAMETER
ESTIMATION IN SURVEY SAMPLING
Shu Yanga and Jae Kwang Kimb

a
Department of Statistics, North Carolina State University, USA
b
Department of Statistics, Iowa State University, USA
ABSTRACT
Nearest neighbor imputation has a long tradition for handling item nonre-
sponse in survey sampling. In this article, we study the asymptotic properties
of the nearest neighbor imputation estimator for general population param-
eters, including population means, proportions and quantiles. For variance
estimation, we propose novel replication variance estimation, which is
asymptotically valid and straightforward to implement. The main idea is to
construct replicates of the estimator directly based on its asymptotically lin-
ear terms, instead of individual records of variables. The simulation results
show that nearest neighbor imputation and the proposed variance estimation
provide valid inferences for general population parameters.
Keywords: Bahadur representation; bootstrap; jackknife variance
estimation; matching; missing at random; quantile estimation

ISSN: 0731-9053/doi:10.1108/S0731-905320190000039012
209
210 SHU YANG AND JAE KWANG KIM
1. INTRODUCTION
In survey sampling, nearest neighbor imputation is popular for dealing with item
nonresponse. In nearest neighbor imputation, for each unit with missing data,
the nearest neighbor is identified among respondents based on the vector of fully
observed covariates and then is used as a donor for hot deck imputation (Little &
Rubin, 2002). Although nearest neighbor imputation has a long history of applica-
tion, there are relatively few papers on investigating its statistical properties. Sande
(1979) used nearest neighbor imputation in business surveys. Lee and Särndal
(1994) studied different methods of nearest neighbor imputation by simulation.
Chen and Shao (2000, 2001) developed asymptotic properties for the nearest neigh-
bor imputation estimator of population means. Shao and Wang (2008) proposed
methods for constructing confidence intervals for population means and quan-
tiles with nearest neighbor imputation. Kim et al. (2011) applied nearest neighbor
imputation for the US Census long form data. However, most of these studies
focused on mean estimation or a one-dimensional covariate in the context of a
simple random sample, which is restrictive both theoretically and practically.
In the empirical economics literature, nearest neighbor imputation (also known
as matching) has been widely used in evaluation research for adjusting the dis-
tribution of covariates among different treatment groups; see Stuart (2010) for a
survey of matching estimators. Abadie and Imbens (2006, 2008, 2011, 2012, 2016)
systematically studied the asymptotic properties of the matching estimators for the
average treatment effects with a finite number of matches. In particular, Abadie
and Imbens (2006, 2012) derived the asymptotic distribution for the matching
estimators that match directly on the covariates using a martingale representation.
Abadie and Imbens (2016) and Yang et al. (2016) further showed that the match-
ing estimators that match on the estimated propensity score are consistent and
asymptotically normal. However, these studies are restricted to mean estimation
and non-survey data.
Empirical researchers are often interested in various finite population quantities,
such as the population means, proportions and quantiles, to name a few (Francisco
and Fuller, 1991; Wu and Sitter, 2001; Berger and Skinner, 2003). Some corre-
sponding sample estimators should be treated differently than others. For example,
estimators of population quantiles involve nondifferentiable functions of estimated
quantities. Moreover, there often are more than one covariate available to facilitate
nearest neighbor imputation for survey data. The current framework of nearest
neighbor imputation does not fully cover inferences in these settings.
In this article, we provide a framework of nearest neighbor imputation for gen-
eral parameter estimation in survey sampling. In general, the nearest neighbor
imputation estimator is not root-n consistent Abadie and Imbens (2006), where n
Nearest Neighbor Imputation for General Parameter Estimation 211
is the sample size. Based on a scalar matching variable summarizing all covariates
information, we show that nearest neighbor imputation can provide consistent esti-
mators for a fairly general class of parameters. If the matching variable is chosen
to be the mean function of the outcome given the covariates, our method resem-
bles predictive mean matching imputation (Rubin, 1986; Little, 1988; Heitjan
and Little, 1991). However, unlike predictive mean matching imputation, nearest
neighbor imputation does not require the mean function be correctly specified. Its
consistency only requires the matching variable satisfy certain Lipschitz continuity
conditions; see Section 3 for details.
The asymptotic results suggest that variance estimation can proceed based on a
large sample approximation to the normal distribution but requires additional esti-
mation for the variance function of the outcome given the covariates. To avoid such
complication, we consider replication variance estimation (Rust and Rao, 1996;
Wolter, 2007; Mashreghi et al., 2016), which has gained popularity in practice
because of its intuitive appeal. Intrinsically, the nearest neighbor imputation esti-
mator with fixed number of matches is not smooth. The lack of smoothness makes
the conventional replication methods invalid for variance estimation (Abadie and
Imbens, 2008). This is because the conventional replication method distorts the
distribution of the number of times each unit is used as a match, ki . We provide a
heuristic illustration using an unrealistic but insightful example. Suppose in a sam-
ple of size 2n, let Sequence 1 be the first n observations, and let Sequence 2 be the
last n observations. Further, suppose that each observation in Sequence 1 matches
to that of Sequence 2. Therefore, the distribution of ki is degenerated to 1 with
probability 1. On the other hand, for the conventional bootstrap, the distribution of
ki∗ , where ki∗ is the number of times each unit is used as a match in the bootstrapping
sample, would have a different distribution from ki . Therefore, the conventional
bootstrap fails to preserve the distribution of ki . If the number of matches increases
with the sample size, such as in the “kernel matching” estimators of Heckman et al.
(1998), both ki and ki∗ are infinite in the original and conventional bootstrapping
samples, and therefore the conventional bootstrap works in this setting. To address
the non-smoothness due to the fixed number of matches, subsampling (Politis et al.,
1999) and m out of n bootstrap (Bickel et al., 2012) can be used; however, their con-
sistency relies critically on the choice of the size for subsampling. Unfortunately,
there is no clear guidance on how to choose these values in practice. Alternatively,
Otsu and Rai (2016) proposed a wild bootstrap method for the matching estimator
based on the full vector of covariates in the context of non-survey data. Adusumilli
(2017) developed a novel bootstrap procedure for the matching estimator based on
the estimated propensity score, built on the notion of “potential errors.” His simu-
lation study also demonstrated the superior performance of the bootstrap method
relative to using the asymptotic distribution for inference.
We propose new replication variance estimation for nearest neighbor imputation

for general parameters in the context of survey data. To address the non-smoothness
of the matching estimator, we construct replicates of the estimator directly based on
the linear representation of the nearest neighbor imputation estimator. In this way,
the distribution of ki can be preserved, which leads to valid variance estimation.
Furthermore, our replication variance method is flexible, which can accommodate
bootstrap and jackknife, among others. To assess the performance of the proposed
replicate variance estimator, we run a Monte Carlo simulation study. The sim-
ulation results show that the proposed estimator outperforms the conventional
replication estimator under various data-generating mechanisms and sampling
schemes.
The rest of the article is organized as follows. In Section 2, we introduce the
setup and notation and describe the nearest neighbor imputation estimators for
general parameters from survey data. In Section 3, we present the main results
of the article, which establish asymptotic distributions for the nearest neighbor
imputation estimators. In Section 4, we propose a replication method for variance
estimation and establish its consistency. In Section 5, we evaluate the finite sample
properties of the proposed procedure via Monte Carlo simulation studies under
different sampling schemes. Section 6 concludes. Technical details are deferred to
the Appendices.
2. BASIC SETUP
Let FN = {(xi , yi , δi ) : i = 1, . . . , N } denote a finite population of size N , where
xi is a p-dimensional vector of covariates, which is always observed, yi is the
outcome that is subject to missingness, and δi is the response indicator of yi , i.e.,
δi = 1 if yi is observed and δi = 0 if it is missing. The δi ’s are defined throughout
the finite population, as in Shao and Steel (1999) and Kim et al. (2006). We
assume that FN is a random sample from a superpopulation model ζ , and N
is known. Our objective
is to estimate the finite population parameter defined
through μg = N −1 N g(y i ) for some known g( · ), or ξN = inf{ξ : SN (ξ ) ≥ 0},
i=1
where SN (ξ ) = N −1 N i=1 s(y i − ξ ), and s( · ) is a univariate real function. These
parameters are fairly general, which cover many parameters of interest in survey
sampling. For example, let g(y) = y, μg ≡ N −1 N i=1 yi isthe population mean
−1 N
of y. Let g(y) = I (y < c) for some constant c, μg ≡ N i=1 I (yi < c) is the
population proportion of y less than c. Let s(yi − ξ ) = I (yi − ξ ≤ 0) − α, ξN is
the population αth quantile.
Let A denote an index set of the sample selected by a probability sampling
design. Let Ii be the sampling indicator function, i.e., Ii = 1 if unit i is selected into
the sample, and Ii = 0 otherwise. The sample size is n = N i=1 Ii . Suppose that πi ,
the first-order inclusion probability of unit i, is positive and known throughout the
sample. If yi were fully observed
throughout the sample, the sample estimator of
μg and ξN are μ̂g = N −1 i∈A πi−1 g(yi ) and ξ̂ = inf{ξ : ŜN (ξ ) ≥ 0} with ŜN (ξ ) =

N̂ −1 i∈A πi−1 s(yi − ξ ) and N̂ = i∈A πi−1 is an estimator for N . Even with a
known N, it is necessary to use N̂; we articulate this point in Example 3.
We make the following assumption for the missing data process.
Assumption 1.
(Missing at random and positivity)The missing data process satisfies P (δ =
1 | x, y) = P (δ = 1 | x), denoted by p(x). With probability 1, p(x) > for a
constant > 0.
We focus on the imputation estimators of μg and ξN given by μ̂g,I =

−1

N −1
i∈A πi δi g(yi ) + (1 − δi )g(yi∗ ) and ξ̂I = inf{ξ : ŜI (ξ ) ≥ 0}, respec-

tively, where ŜI (ξ ) = N̂ −1 i∈A πi−1 δi s(yi − ξ ) + (1 − δi )s(yi∗ − ξ ) , and yi∗
is an imputed value of yi for unit i with δi = 0.
To find suitable imputed values, we use nearest neighbor imputation. Let
d(xi , xj ) be a distance function between xi and xj . For example, d(xi , xj ) =
||xi − xj ||, where ||x|| = (x T x)1/2 . Other norms of the form ||x||D = (x T Dx)1/2 ,
where D is a positive definite symmetric matrix D, are equivalent to the Euclidean
norm, because ||x||D = {(Qx)T (Qx)}1/2 = ||Qx|| with QT Q = D. In particular,
Mahalanobis distance is commonly used, where D = ˆ −1 with ˆ the empirical
covariance matrix of x.
The classical nearest neighbor imputation can be described in the following
steps:
Step 1. For each unit i with δi = 0, find the nearest neighbor from the respon-
dents with the minimum distance between xi and xj , for j ∈ AR ≡ {j ∈
A : δi = 1}. Let i(1) be the index set of its nearest neighbor, which satisfies
d(xi(1) , xi ) ≤ d(xj , xi ), for all j ∈ AR .
Step 2. The nearest neighbor imputation estimators of μg and ξN are computed
by
1 1
μ̂g,NNI = δi g(yi ) + (1 − δi )g(yi(1) ) , (1)
N i∈A πi
and ξ̂NNI = inf{ξ : ŜNNI (ξ ) ≥ 0}, respectively, with
1 1
ŜNNI (ξ ) = δi s(yi − ξ ) + (1 − δi )s(yi(1) − ξ ) . (2)
N̂ i∈A πi
In (1) and (2), the imputed values are real observations obtained from the current
sample.
3. MAIN RESULTS
For asymptotic inference, we use the framework of Isaki and Fuller (1982), where
the asymptotic properties of estimators are established under a fixed sequence of
populations and a corresponding sequence of random samples. Specifically, let a
sequence of nested finite populations be given by FN1 ⊂ FN2 ⊂ FN3 ⊂ · · · . Also,
let a sequence of samples of sizes {nt : t = 1, 2, 3, . . .} be constructed from the
sequence of populations with an increasing sample size n1 < n2 < n3 < · · · . For
the ease of exposition, we omit the dependence of Nt and nt on t. Denote EP ( · )
and var P ( · ) to be the expectation and the variance under the sampling design,
respectively. We impose the following regularity conditions on the sampling
design.
Assumption 2.
(1) There exist positive constants C1 and C2 such that C1 ≤ N n−1 πi ≤ C2 , for
−1
i = 1, . . . , N; (2) the sampling fraction is negligible; i.e., nN = o(1); (3)
−1 −1
the sequence of the Horvitz–Thompson estimators μ̂g,HT = N i∈A πi g(yi )
−1 −1/2
satisfies var P (μ̂g,HT ) = O(n ) and {var P (μ̂g,HT )} (μ̂g,HT − μg ) | FN →
N (0, 1) in distribution, as n → ∞.
Assumption 2 is widely accepted in survey sampling (Fuller, 2009).

We introduce additional notation. Let A = AR ∪ AM , where AR and AM are the
sets of respondents and nonrespondents, respectively. Define dij = 1 if yj (1) = yi ,
i.e., unit i is used as a donor for unit j ∈ AM and dij = 0 otherwise. We write μ̂g,NNI
in (1) as
⎧ ⎫
1 ⎨ 1 1 − δj ⎬
μ̂g,NNI = δi g(yi ) + δi dij g(yi )
N⎩i∈A
πi j ∈A
πj i∈A
⎭
1 δi
= (1 + ki )g(yi ), (3)
N i∈A πi
with πi
ki = (1 − δj )dij . (4)
π
j ∈A j

Under simple random sampling, ki = j ∈A (1 − δj )dij is the number of times that
unit i is used as the nearest neighbor for nonrespondents.
We first study the asymptotic properties of μ̂g,NNI . Let μg (x) ≡ E{g(y) | x} and
σg2 (x) ≡ var{g(y) | x}, where the expectation and variance are taken with respect
to the superpopulation model. We use the following decomposition:
n1/2 (μ̂g,NNI − μg ) = DN + BN , (5)
where
1 1
DN = n1/2 μg (xi ) + δi (1 + ki ){g(yi ) − μg (xi ) − μg , (6)
N i∈A πi
and
n1/2 1
BN = (1 − δi ){μg (xi(1) ) − μg (xi )}. (7)
N i∈A πi
The difference μg (xi(1) ) − μg (xi ) accounts for the matching discrepancy, and
BN contributes to the asymptotic bias of the matching estimator. In general,
if x is p-dimensional, Abadie and Imbens (2006) showed that d(xi(1) , xi ) =
OP (n−1/p ). Therefore, for nearest neighbor imputation with p ≥ 2, the asymp-
totic bias is BN = OP (n1/2−1/p ) = oP (1). Abadie and Imbens (2011) proposed a
bias-adjustment using a nonparametric estimator μ̂g (x) that renders matching esti-
mators n1/2 -consistent. This approach may not be convenient for general parameter
estimation.
To address for the matching discrepancy due to a non-scalar x, we propose
an alternative method. We first summarize the covariate information into a scalar
matching variable m = m(x) and then apply nearest neighbor imputation based on
this matching variable. For simplicity of notation, we may suppress the dependence
of m on x if there is no ambiguity. Let f1 (m) and f0 (m) be the conditional density
of m given δ = 1 and δ = 0, respectively. We assume the superpopulation model ζ
and the matching variable m satisfy the following assumption.
Assumption 3.
(1) The matching variable m has a compact and convex support, with den-
sity bounded and bounded away from zero. Suppose that there exist constants
C1L and C1U such that C1L ≤ f1 (m)/f0 (m) ≤ C1U ; (2) μg (x) and μs (ξ , x) ≡
E{s(y − ξ ) | x} satisfy a Lipschitz continuous condition: there exists a constant
C2 such that |μg (xi ) − μg (xj )| < C2 |mi − mj | and |μs (ξ , xi ) −
μs (ξ , xj )| <
C2 |mi − mj | for any i and j ; (3) there exists δ > 0 such that E |g(y)|2+δ | x
is uniformly bounded for any x, and E |s(y − ξ )|2+δ | x is uniformly bounded
for any x and ξ in the neighborhood of ξN .
Assumption 3 (1) a convenient regularity condition (Abadie and Imbens 2006).

Assumption 3 (2) imposes a smoothness condition for μg (x), μs (ξ , x) and m(x),
which is not restrictive (Chen and Shao 2000). One simple example is when the
outcome distribution follows a single index model as E{g(y) | x} = φg (β0T x), where
φg is a smooth function. There exists some nonparametric estimator β̂ that is root-n
consistent for β0 ; see Li and Racine (2007) for a textbook discussion. In this case,
m(x) can be taken as the linear predictor β̂ T x. By a judicious choice, the scalar
matching variable should ensure that Assumption 3 holds. If the mean function
of the outcome given the covariates is feasible, we can choose the matching vari-
able to be the conditional mean function. We note that in this case the proposed
nearest neighbor imputation reduces to the predictive mean matching imputation.
However, our method is more general than predictive mean matching imputation,
because the latter requires the mean function to be correctly specified. Assumption
3 (3) is a moment condition for establishing the central limit theorem.
We derive the asymptotic distribution of μ̂g,NNI in the following theorem, with
the proof deferred to the Appendices.

Theorem 1. Under Assumptions 1–3, n1/2 μ̂g,NNI − μg → N (0, Vg ) in dis-
tribution, as n → ∞, where
Vg = Vgμ + Vge (8)
with

n 1
Vgμ = lim 2 E var P μg (xi ) ,
n→∞ N π
i∈A i
N 2
n Ii
Vge = lim E δ i (1 + k i ) − 1 σg2 (xi ) ,
n→∞ N 2 π i
i=1
and ki is defined in (4).

We now establish a similar result for ξ̂NNI , with the proof deferred to the
Appendices.
Theorem 2. Under Assumptions 1–3, suppose the population parameter ξN
and the population estimating function SN ( · ) satisfy regularity conditions in
Assumptions B.1 and B.2. We obtain the following asymptotic linearization
representation of ξ̂NNI :
n1/2 (ξ̂NNI − ξN ) = −n1/2 S (ξN )−1 {ŜNNI (ξN ) − SN (ξN )} + oP (1), (9)
where S (ξN ) = dS(ξN )/dξ . It follows that n1/2 (ξ̂NNI − ξN ) → N (0, Vξ ) in

distribution, as n → ∞, where
Vξ = S (ξN )−2 var{ŜNNI (ξN )}, (10)

n E{s(yi − ξN ) | xi }
var{ŜNNI (ξN )} = lim 2 E var P
n→∞ N πi
i∈A
2
n
N
Ii
+ lim 2 E δi (1 + ki ) − 1 var [s(yi − ξN ) | xi ] , (11)
n→∞ N πi
i=1
and ki is defined in (4).

For illustration, we use quantile estimation as an example.
Example 1: (Quantile estimation) The estimating function for the αth quantile is
s(yi − ξ ) = I (yi − ξ ≤ 0) − α, and
the population estimating equation Sα,N (ξ ) =
FN (ξ ) − α, where FN (ξ ) = N −1 N i=1 I (yi ≤ ξ ). The nearest neighbor imputation
estimator ξ̂α,NNI is defined as
ξ̂α,NNI = inf{ξ : Ŝα,NNI (ξ ) ≥ 0},

where Ŝα,NNI (ξ ) = F̂NNI (ξ ) − α, F̂NNI (ξ ) = N̂ −1 i∈A πi−1 δi (1 + ki )I (yi ≤ ξ ),

N̂ = i∈A πi−1 , and ki is defined in (4). Let F (ξ ) = P(y ≤ ξ ) be the cumulative
distribution function of y. Then, F̂NNI (ξ ) is a Hajek estimator for F (ξ ), which is
asymptotically equivalent to the one using N instead of N̂ . Even with a known N ,
it is necessary to use N̂ because F̂NNI (ξ ) for ξ = ∞ should be 1. The limiting func-
tion of Sα,N (ξ ) is Sα (ξ ) = F (ξ ) − α. The asymptotic linearization representation
of ξ̂α,NNI is
F̂NNI (ξN ) − FN (ξN )

ξ̂α,NNI − ξN = − + oP (n−1/2 ), (12)
f (ξN )
where f (ξ ) = dF (ξ )/dξ . Expression (12) is called the Bahadur-type representa-

tion for ξ̂α,NNI (Francisco and Fuller, 1991). The asymptotic variance of ξ̂α,NNI is
then given by (10) with S (ξN ) and ŜNNI (ξN ) replaced by f (ξN ) and F̂N N I (ξN ),
respectively.
4. REPLICATION VARIANCE ESTIMATION

Theorems 1 and 2 suggest that variance estimation for the nearest neighbor impu-
tation estimators can be obtained using the sample analogues of the asymptotic
variance formulas. This approach involves estimation of the variance function of
the outcome given the covariates. Alternatively, we consider replication variance
estimation (Rust and Rao, 1996; Wolter, 2007).
Let μ̂g be the Horvitz–Thompson estimator of μg . The replication variance
estimator of μ̂g takes the form of

L
V̂rep (μ̂g ) = g − μ̂g ) ,
ck (μ̂(k) 2
(13)
k=1
(k)
where L is the number of replicates, ck is the kth replication factor and μ̂g is
the kth replicate of μ̂g . For μ̂g = i∈A ωi g(yi ), we can write the replicate of μ̂g
(k) (k)
g =
as μ̂(k) i∈A ωi g(yi ), where ωi is the replication weight that account for the
complex sampling design. The replicates are constructed such that EP {V̂rep (μ̂g )} =
var P (μ̂g ){1 + o(1)}.
Example 2: In the delete-1 jackknife method, we have L = n, ck = (n − 1)n−1 ,

(n − 1)−1 if i = k,
ωi(k) =
0 if i = k,
under simple random sampling.
We now propose a new replication variance estimation for μ̂g,NNI . Let

ψi = μg (xi ) + δi (1 + ki ){g(yi ) − μg (xi )} and μψ = N −1 N
i=1 ψi . Then, the

Horvitz–Thompson estimator for μψ is ψ̂HT = i∈A ωi ψi , where ωi = N −1 πi−1 .
By Theorem 1, we have μ̂g,NNI − ψ̂HT = oP (n−1/2 ). Moreover, we have μψ −
μg = OP (N −1/2 ). Therefore,
μ̂g,NNI − μg = (μ̂g,NNI − ψ̂HT ) + (ψ̂HT − μψ ) + (μψ − μg ),

= oP (n−1/2 ) + (ψ̂HT − μψ ) + OP (N −1/2 ).
With negligible sampling fractions, i.e., nN −1 = o(1), μ̂g,NNI − μg = ψ̂HT − μψ +

oP (n−1/2 ). Then, it is sufficient to estimate var(ψ̂HT − μψ ) = E{var P (ψ̂HT − μψ )},
which is essentially the sampling variance of ψ̂HT . This suggests that we can treat
{ψi : i ∈ A} as pseudo observations in applying the replication variance estimator.
Otsu and Rai (2016) used a similar idea to develop a wild bootstrap technique for the
matching estimators for the average treatment
effects. To be specific, we construct
(k)
replicates of ψ̂HT as follows: ψ̂HT = i∈A ωi(k) ψi . The replication variance esti-
(k)
mator of ψ̂HT is obtained by applying V̂rep ( · ) in (13) for the above replicates ψ̂HT .
It follows that E{V̂rep (ψ̂HT )} = var(ψ̂HT − μψ ){1 + o(1)} = var(μ̂g,NNI − μg ){1 +
o(1)}. Because the pseudo observations ψi s involve unknown μg (x), we use a non-
parametric estimator μ̂g (x). Concretely, we adopt sieves estimators (Geman and
Hwang, 1982; Chen, 2007) which includes power series estimators as examples;
see the Appendices for details.
In summary, the new replication variance estimation for μ̂g,NNI proceeds as
follows:
Step 1. Obtain a sieves estimator for μg (x), denoted by μ̂g (x).

Step 2. Construct replicates of μ̂g,NNI as

μ̂(k)
g,NNI = ωi(k) [μ̂g (xi ) + δi (1 + ki ){g(yi ) − μ̂g (xi )}], (14)
i∈A
where ωi(k) is the kth replication weight for unit i.

Step 3. Apply V̂rep ( · ) in (13) for the above replicates to obtain the replication
variance estimator of μ̂g,NNI .
We now consider a replication variance estimator for ξ̂NNI . Following

the
previous
section, we obtain the asymptotic variance of ξ̂NNI using var ŜNNI (ξ ) and S (ξ ).

First, to estimate var ŜNNI (ξ ) , we can use the similar replication variance esti-
mation earlier in this section by considering I (y < ξ ) and μs (ξ , x) instead of y and
μg (x). Second, to estimate S (ξ ), we follow the kernel-based derivative estimation
of Deville (1999):

1 1 ξ −x
Ŝ (ξ ) = s(yi − x)K dx (15)
N h i∈A πi h
where K( · ) is a kernel function, K (x) = dK(x)/dx, and h is the bandwidth.

Under Assumption C.1 for the kernel function and bandwidth and previously stated
regularity conditions on the superpopulations and sampling designs, the kernel-
based estimator (15) is consistent for S (ξ ).
In summary, the new replication variance estimation for ξ̂NNI proceeds as

follows:
Step 1. Obtain a sieves logit estimator for μs (ξ̂NNI , x), denoted by μ̂s (ξ̂NNI , x);
see the Appendices for details.
Step 2. Construct replicates of ŜNNI (ξ̂NNI ) as
(k)

ŜNNI (ξ̂NNI ) = ωi(k) [μ̂s (ξ̂NNI , xi ) + δi (1 + ki ){s(yi − ξ̂NNI ) − μ̂s (ξ̂NNI , xi )}].
i∈A
(16)
Step 3. Apply V̂rep ( · ) in (13) for the above replicates to obtain the variance
estimator of ŜNNI (ξ̂NNI ), denoted as V̂rep {ŜNNI (ξ̂NNI )}.
Step 4. Obtain the kernel-based derivative estimator Ŝ (ξ̂NNI ), where Ŝ (ξ ) is
defined in (15).
Step 5. Calculate the variance estimator of ξ̂NNI as Ŝ (ξ̂NNI )−2 V̂rep {ŜNNI (ξ̂NNI )}.
For illustration, we continue with Example 3.
Example 3: (Quantile estimation (Cont.)) Obtain a sieves logit estimator for

F (ξ ) = P (y ≤ ξ ) and a kernel-based estimator for f (ξ ), denoted as F̂ (ξ ) and fˆ(ξ ),
respectively. Construct replicates of F̂NNI (ξ̂α,NNI ) as
(k)

F̂NNI (ξ̂α,NNI ) = ωi(k) [F̂ (ξ̂α,NNI ) + δi (1 + ki ){I (yi ≤ ξ̂α,NNI ) − F̂ (ξ̂α,NNI )}].
i∈A
Apply V̂rep ( · ) in (13) for the above replicates to obtain the replication variance
estimator of F̂NNI (ξ̂α,NNI ), denoted as V̂rep {F̂NNI (ξ̂α,NNI )}. Calculate the variance
estimator of ξ̂α,NNI as fˆ(ξ̂α,NNI )−2 V̂rep {F̂NNI (ξ̂α,NNI )}.
We present the consistency results for the proposed replication variance

estimators, with the proof presented in the Appendices.
Theorem 3. Suppose assumptions in Theorem 2 and Assumptions D.1 and D.2
for the sieves estimators hold. Suppose further that V̂rep (μ̂g ) in (13) is consistent
for var p (μ̂g ). Then, the replication variance estimator for μ̂g,NNI is consistent,
i.e., nV̂rep {μ̂g,NNI }/Vg → 1 in probability, as n → ∞, where the replicates of
μ̂g,NNI are given in (14), and Vg is given in (8).
Given that the kernel-based estimator Ŝ (ξ ) in (15) is consistent for S (ξ ), the
replication variance estimator for ξ̂NNI is consistent, i.e., nV̂rep {ξ̂NNI }/Vξ → 1
in probability, as n → ∞, where the replicates of ŜNNI (ξ̂NNI ) are given in (16),
and Vξ is given in (11).
5. SIMULATION STUDY
In this section, we investigate the finite-sample performance of the proposed repli-
cation method for variance estimation and constructing confidence intervals and
comparing them to conventional competitors.
For generating finite populations of size N = 50, 000: first, let x1i , x2i and x3i be
generated independently from Uniform[0, 1], and x4i , x5i and x6i and ei be gener-
ated independently from N (0, 1); then, let yi be generated under six mechanisms:
(P1) yi = −1 + x1i + x2i + ei , (P2) yi = −1.5 + x1i + x2i + x3i + x4i + ei , (P3)
yi = −1.5 + x1i + · · · + x6i + ei , (P4) yi = −1 + x1i + x2i + x1i 2
+ x2i
2
− 2/3 +
ei , (P5) yi = −1.5 + x1i + x2i + x3i + x4i + x1i + x2i − 2/3 + ei and (P6) yi =
2 2
−1.5 + x1i + · · · + x6i + x1i 2

+ x2i
2
− 2/3 + ei . The covariates are fully observed,
but yi is not. The response indicator of yi , δi , is generated from Bernoulli(pi ) with
logit{p(xi )} = xiT 1, where xi includes all corresponding covariates under each data-
generating mechanism and 1 is a vector of 1 with a compatible length. This results
in
a 75% response rate, on average. The parameters of interest are μ = N −1 N i=1 i ,
y
−1
N
η=N i=1 I (yi < c), where c is the 80th percentile such that the true value of η
is 0.8, and the median ξ . To generate samples, we consider two sampling designs:
(S1) simple random sampling with n = 800 and (S2) probability proportional to
size sampling. In (S2), for each unit in the population, we generate a size variable si
as log (|yi + νi | + 4), where νi ∼ N (0, 1) and specify the selection probability as
πi = 400si / N i=1 si . Therefore, (S2) is endogenous (also known as informative),
where units with larger yi values have larger probabilities to be selected into the
sample.
For nearest neighbor imputation, the matching scalar variable m is set to be
the conditional mean function of y given x, m(x), approximated by power series
estimation. For investigating the effect of the matching variable, we consider
the power series including all first and second order terms under (P1)–(P3) and
only first order terms under (P4)–(P6), so that m(x) is correctly specified for the
mean function under (P1)–(P3) but misspecified under (P4)–(P6). We construct
1/2 1/2
95% confidence intervals using (μ̂I − z0.975 V̂I , μ̂I + z0.975 V̂I ), where μ̂I is
the point estimate and V̂I is the variance estimate obtained by conventional and
proposed jackknife variance estimation. In the conventional jackknife variance
estimation, the whole procedure of nearest neighbor imputation is repeated on
the replicated data sets for obtaining the replicates for the estimators. In the pro-
posed jackknife variance estimation, the kth replicates of μ̂NNI , η̂NNI and ξ̂NNI are
given by

n
μ̂(k)
NNI = ωi(k) [μ̂(xi ) + δi (1 + ki ){yi − μ̂(xi )}],
i=1
(k)

n
η̂NNI = ωi(k) [μ̂η (xi ) + δi (1 + ki ){I (yi < c) − μ̂η (xi )}],
i=1
(k)
ξ̂NNI (ξ̂NNI )

n
= fˆ(ξ̂NNI )−2 ωi(k) [μ̂s (ξ̂NNI , xi ) + δi (1 + ki ){I (yi ≤ ξ̂NNI ) − μ̂s (ξ̂NNI , xi )}],
i=1
where μ̂η (x), μ̂s (ξ , x) and fˆ(x) are nonparametric estimators of μη (x) = P (y <
c | x), μs (ξ , x) = P (y < ξ | x) and f (ξ ), respectively. These are obtained by kernel
regression using a Gaussian kernel with bandwidth h = 1.5n−1/5 . We note that ki
is the number of times that yi is selected to impute the missing values of y based
on the original data and therefore is kept the same across replicated data sets. The
variance estimators are compared in terms of empirical coverage rate and relative
bias, {E(V̂I ) − V }/V , where V is the true variance estimated from Monte Carlo
samples.
Tables 1 and 2 present the simulation results under simple random sampling
and probability proportional to size sampling, respectively, based on 2, 000 Monte
Carlo samples. Under both sampling designs, the nearest neighbor imputation
estimator has small biases for all parameters μ, η and ξ , under (P1)–(P3) with m(x)
correctly specified for the mean function and (P4)–(P6) with m(x) misspecified for
the mean function. For variance estimation, as expected, the conventional jackknife
variance estimator is severely biased, indicating that the lack of smoothness of
the matching estimator needs to be taken into account in variance estimation. In
contrast, the proposed jackknife variance estimators provide satisfactory results
under both sampling designs and for all parameters. The relative biases are small
and the empirical coverage rates are close to the nominal coverage of 95% of
confidence. Overall, the simulation results suggest that the proposed replication
variance estimation works reasonably well under the settings we considered.
6. CONCLUDING REMARKS
We focus on inference of general population parameters when the outcome is
missing at random in survey data using nearest neighbor imputation, a hot-deck
type of imputation. The superiority of the hot deck imputation methods over the
mean, ratio and regression imputation methods is that the hot deck imputation
methods provide not only asymptotically valid mean estimators but also valid dis-
tribution and quantile estimators. This article establishes asymptotic properties of
the nearest neighbor imputation estimators based on a scalar variable summarizing
all covariate information. Because of the non-smooth nature of nearest neighbor
Table 1. Simulation Results for The Population Mean μ, the Population

Proportion η = 0.8 and the Population Median ξ Under Simple Random
Sampling: Bias (×102 ) and Standard Error (SE × 102 ) of the Point Estimator,
Relative Bias of Jackknife Variance Estimates (RB × 102 ) and Coverage Rate
(CR %) of 95% Confidence Intervals.
Simple Random Sampling
Prop JK Conv JK
m(x) Bias SE RB CR RB CR
μ (P1) c 0.00 4.87 0.1 94.9 >1,000 100

(P2) c 0.12 6.08 0.5 95.3 >1,000 100
(P3) c 1.09 8.42 2.2 95.3 >1,000 100
(P4) m −0.10 5.41 3.6 96.0 >1,000 100
(P5) m 0.20 6.59 0.1 95.4 >1,000 100
(P6) m 1.17 8.81 0.3 94.8 >1,000 100
η (P1) c 0.00 1.77 0.4 95.0 >1,000 100
(P2) c 0.00 1.53 −0.1 94.9 >1,000 100
(P3) c −0.01 1.50 −5.1 94.7 >1,000 100
(P4) m 0.03 1.63 6.1 95.4 >1,000 100
(P5) m 0.05 1.48 4.3 95.5 >1,000 100
(P6) m −0.01 1.47 −0.7 94.9 >1,000 100
ξ (P1) c −0.25 6.15 2.7 94.8 >1,000 100
(P2) c −0.40 7.60 2.5 94.7 >1,000 100
(P3) c −0.37 10.19 4.0 94.6 >1,000 100
(P4) m −0.25 7.09 3.2 94.6 >1,000 100
(P5) m −0.35 8.17 7.2 96.0 >1,000 100
(P6) m −0.54 10.78 1.8 94.1 >1,000 100
Prop JK: Proposed jackknife variance estimation; Conv JK: conventional jackknife variance estimation.
c: correctly specified and m: misspecified.
imputation, we propose a novel replication method for variance estimation based

on linearization of the estimator, which is asymptotically valid, while the con-
ventional replication methods are not. Simulation results show that, under various
scenarios, the proposed method outperforms the conventional counterparts. Cou-
pled with the proposed replication procedure, the nearest neighbor imputation
inference is straightforward to implement requiring only software routines for
existing estimators.
In the empirical economic literature, as an important example in evaluation
research, causal inference of treatment effects can be viewed from a missing data
perspective (e.g., Ding and Li, 2018). Propensity score matching has been recently
proposed for inferring causal effects of treatments in the context of survey data;
Table 2. Simulation Results for the Population Mean μ, the Population

Proportion η = 0.8 and the Population Median ξ Under Probability Proportional
to Size Sampling: Bias (×102 ) and Standard Error (SE × 102 ) of the Point
Estimator, Relative Bias of Jackknife Variance estimates (RB × 102 ) and
Coverage Rate (CR %) of 95% Confidence Intervals.
Probability Proportional to Size
Prop JK Conv JK
m(x) Bias SE RB CR RB CR
μ (P1) c 0.07 4.71 1.8 95.4 >1,000 100

(P2) c 0.20 5.71 6.1 95.9 >1,000 100
(P3) c 0.73 7.71 6.0 96.1 >1,000 100
(P4) m −0.06 5.29 2.4 95.5 >1,000 100
(P5) m 0.22 6.08 7.0 95.9 >1,000 100
(P6) m 0.99 8.23 5.4 95.1 >1,000 100
η (P1) c −0.01 1.89 −6.0 94.5 >1,000 100
(P2) c 0.02 1.63 −1.9 95.3 >1,000 100
(P3) c 0.08 1.66 −5.5 94.4 >1,000 100
(P4) m 0.02 1.79 −4.0 95.2 >1,000 100
(P5) m 0.03 1.60 1.8 95.2 >1,000 100
(P6) m 0.08 1.67 −8.7 93.7 >1,000 100
ξ (P1) c −0.31 6.34 6.2 94.8 >1,000 100
(P2) c −0.06 8.30 0.8 94.5 >1,000 100
(P3) c −0.42 11.36 5.4 94.6 >1,000 100
(P4) m −0.32 7.57 4.1 94.0 >1,000 100
(P5) m −0.34 8.91 7.0 94.8 >1,000 100
(P6) m −0.49 12.22 2.2 94.4 >1,000 100
Prop JK: Proposed jackknife variance estimation; Conv JK: conventional jackknife variance estimation;
c: correctly specified and m: misspecified.
however, their asymptotic properties are underdeveloped (Lenis et al., 2017). The
proposed methodology here can be easily generalized to investigate the asymptotic
properties of propensity score matching estimators with survey weights.
Our methodology and theoretical results for nearest neighbor imputation rep-
resent an important building block for future developments. Such developments
can follow three lines. First, extending the current theory to non-negligible sam-
pling fractions is possible; see, e.g., Mashreghi et al. (2014). For non-negligible
sampling fraction, note that

var μ̂g,NNI − μg = var ψ̂HT − μψ + var μψ − μg + o(n−1 )

and var μψ − μg = O(N −1 ). Thus, we can add a model-based estimator of

var μψ − μg in addition to the replication variance estimator for var(ψ̂HT − μψ ).
Second, instead of choosing the nearest neighbor as a donor for missing items, we
can consider fractional imputation (Kim and Fuller, 2004; Yang et al., 2013; Kim
and Yang, 2014; Yang and Kim, 2016) using K (K > 1) nearest neighbors. Third,
writing yi = xi Ri and using the fact that xi is always observed, we can apply near-
est neighbor imputation only to impute Ri , which can be called nearest neighbor
ratio imputation.
ACKNOWLEDGMENTS
Dr. Yang is partially supported by NSF grant DMS 1811245, NCI grant P01
CA142538, and Ralph E. Powe Junior Faculty Enhancement Award from Oak
Ridge Associated Universities. Dr. Kim is partially supported by NSF grant MMS
1733572.
REFERENCES
Abadie, A., & Imbens, G. W. (2006). Large sample properties of matching estimators for average
treatment effects. Econometrica, 74, 235–267.
Abadie, A., & Imbens, G. W. (2008). On the failure of the bootstrap for matching estimators.
Econometrica, 76, 1537–1557.
Abadie, A., & Imbens, G. W. (2011). Bias-corrected matching estimators for average treatment effects.
Journal of Business & Economic Statistics, 29, 1–11.
Abadie, A., & Imbens, G. W. (2012). A martingale representation for matching estimators. Journal of
the American Statistical Association, 107, 833–843.
Abadie, A., & Imbens, G. W. (2016). Matching on the estimated propensity score. Econometrica, 84,
781–807.
Adusumilli, K. (2017). Bootstrap inference for propensity score matching. Retrieved from
https://economics.sas.upenn.edu/events/bootstrap-inference-propensity-score-matching.
Berger, Y. G., & Skinner, C. J. (2003). Variance estimation for a low income proportion. Journal of the
Royal Statistical Society: Series C, 52, 457–468.
Bickel, P. J., Götze, F., & van Zwet, W. R. (2012). Resampling fewer than n observations: Gains, losses,
and remedies for losses. In Selected works of Willem van Zwet (pp. 267–297). New York, NY:
Springer.
Chen, J., & Shao, J. (2000). Nearest neighbor imputation for survey data. Journal of Official Statistics,
16, 113–131.
Chen, J., & Shao, J. (2001). Jackknife variance estimation for nearest-neighbor imputation. Journal of
the American Statistical Association, 96, 260–269.
Chen, X. (2007). Large sample sieve estimation of semi-nonparametric models. Handbook of
Deville, J. C. (1999). Variance estimation for complex statistics and estimators: Linearization and
residual techniques. Survey Methodology, 25, 193–204.
Ding, P., & Li, F. (2018). Causal inference: A missing data perspective. Statistical Science, 33, 214–237.
Francisco, C. A., & Fuller, W. A. (1991). Quantile estimation with a complex survey design. Annals of
Fuller, W. A. (2009). Sampling Statistics. Hoboken, NJ: John Wiley & Sons.
Geman, S., & Hwang, C.-R. (1982). Nonparametric maximum likelihood estimation by the method of
sieves. Annals of Statistics, 10, 401–414.
Heckman, J., Ichimura, H., Smith, J., & Todd, P. (1998). Characterizing selection bias using
experimental data. Econometrica, 66, 1017–1098.
Heitjan, D. F., & Little, R. J. (1991). Multiple imputation for the fatal accident reporting system.
Applied Statistics, 40, 13–29.
Hirano, K., Imbens, G. W., & Ridder, G. (2003). Efficient estimation of average treatment effects using
the estimated propensity score. Econometrica, 71, 1161–1189.
Ichimura, H., & Linton, O. B. (2005). Asymptotic expansions for some semiparametric program eval-
uation estimators. In D. Andrews & J. Stock (Eds.), Identification and inference in econometric
models: Essays in honor of Thomas J. Rothenberg. Cambridge: Cambridge University Press.
Isaki, C. T., & Fuller, W. A. (1982). Survey design under the regression superpopulation model. Journal
of the American Statistical Association, 77, 89–96.
Kim, J. K., & Fuller, W. (2004). Fractional hot deck imputation. Biometrika, 91, 559–578.
Kim, J. K., Fuller, W. A., Bell, W. R., et al. (2011). Variance estimation for nearest neighbor imputation
for US Census long form data. The Annals of Applied Statistics, 5, 824–842.
Kim, J. K., Navarro, A., & Fuller, W. A. (2006). Replication variance estimation for two-phase stratified
sampling. Journal of the American Statistical Association, 101, 312–320.
Kim, J. K., & Yang, S. (2014). Fractional hot deck imputation for robust inference under item
nonresponse in survey sampling. Survey Methodology, 40, 211–230.
Lee, H., & Särndal, C. E. (1994). Experiments with variance estimation from survey data with imputed
values. Journal of Official Statistics, 10, 231–243.
Lenis, D., Nguyen, T. Q., Dong, N., & Stuart, E. A. (2017). It’s all about balance: Propensity score
matching in the context of complex survey data. Biostatistics. doi:10.1093/biostatistics/kxx063.
Li, Q., & Racine, J. S. (2007). Nonparametric econometrics: Theory and practice. Princeton, NJ:
Princeton University Press.
Little, R. J. (1988). Missing-data adjustments in large surveys. Journal of Business & Economic
Little, R. J., & Rubin, D. B. (2002). Statistical analysis with missing data. Hoboken, NJ: John Wiley
& Sons.
Mashreghi, Z., Haziza, D., & Léger, C. (2016). A survey of bootstrap methods in finite population
sampling. Statistics Surveys, 10, 1–52.
Mashreghi, Z., Léger, C., & Haziza, D. (2014). Bootstrap methods for imputed data from regression,
ratio and hot-deck imputation. Canadian Journal of Statistics, 42, 142–167.
Newey, W. K. (1997). Convergence rates and asymptotic normality for series estimators. Journal of
Otsu, T., & Rai, Y. (2017). Bootstrap inference of matching estimators for average treatment effects.
Journal of the American Statistical Association. 112, 1720–1732
Politis, D. N., Romano, J. P., & Wolf, M. (1999). Subsampling. New York, NY: Springer-Verlag.
Rubin, D. B. (1986). Statistical matching using file concatenation with adjusted weights and multiple
imputations. Journal of Business & Economic Statistics, 4, 87–94.
Rust, K. F., & Rao, J. N. K. (1996). Variance estimation for complex surveys using replication
techniques. Statistical Methods in Medical Research, 5, 283–310.
Sande, I. G. (1979). A personal view of hot deck imputation procedures. Survey Methodology, 5,
238–258.
Serfling, R. J. (1980). Approximation theorems of mathematical statistics. Hoboken, NJ: John Wiley
& Sons.
Shao, J., & Steel, P. (1999). Variance estimation for survey data with composite imputation and
nonnegligible sampling fractions. Journal of the American Statistical Association, 94, 254–265.
Shao, J., & Wang, H. (2008). Confidence intervals based on survey data with nearest neighbor
imputation. Statistica Sinica, 18, 281–297.
Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward. Statistical
Science, 25, 1–21.
Wolter, K. (2007). Introduction to variance estimation (2nd ed.). New York, NY: Springer.
Wu, C., & Sitter, R. R. (2001). A model-calibration approach to using complete auxiliary information
from survey data. Journal of the American Statistical Association, 96, 185–193.
Yang, S., Imbens, G. W., Cui, Z., Faries, D. E., & Kadziola, Z. (2016). Propensity score match-
ing and subclassification in observational studies with multi-level treatments. Biometrics, 72,
1055–1065.
Yang, S., & Kim, J. K. (2016). Fractional imputation in survey sampling: A comparative review.
Statistical Science, 31, 415–432.
Yang, S., Kim, J. K., & Shin, D. W. (2013). Imputation methods for quantile estimation under missing
at random. Statistics and its Interface, 6, 369–377.
APPENDICES
The Appendices include proofs of Theorems 1–3 and additional technical details.
APPENDIX A: PROOF FOR THEOREM 1
With a scalar matching variable m, we have
n1/2 1
BN = (1 − δi ){μg (xi(1) ) − μg (xi )}
N i∈A πi
n1/2 1
≤ (1 − δi ) | mi(1) − mi |= oP (1),
N i∈A πi
where ≤ in the second line followed by Assumption 3 (2). Based on the

decomposition in (5), we can write
n1/2 (μ̂g,NNI − μg ) = DN + oP (1), (A.1)
where DN is defined in (6). Then, to study the asymptotic properties of

n1/2 (μ̂g,NNI − μg ), we only need to study the asymptotic properties of DN . For sim-
plicity, we introduce the following notation: μg,i ≡ μg (xi ) and ei ≡ g(yi ) − μg,i .
We express
n1/2 1 N
DN = μg,i + δi (1 + ki )ei − g(yi )
N i∈A πi i=1
N N
n1/2 Ii n1/2 Ii
= − 1 μg,i + δi (1 + ki ) − 1 ei , (A.2)
N i=1 πi N i=1 πi
and we can verify that the covariance of the two terms in (A.2) is zero. Thus,
N N
n1/2 Ii n1/2 Ii
var(DN ) = var − 1 μg,i + var δi (1 + ki ) − 1 ei .
N i=1 πi N i=1 πi
As n → ∞, the first term becomes

n μg,i
Vgμ = lim 2 E var P ,
n→∞ N πi
i∈A
and the second term becomes

N 2
n Ii
Vge = plim 2 δi (1 + ki ) − 1 var(ei | xi ).
N i=1 πi
The remaining is to show that Vge = O(1). To do this, the key is to show that the
moments of ki are bounded. Under Assumption 2, it is easy to verify that
ωk̃i ≤ ki ≤ ω̄k̃i , (A.3)

for some constants ω and ω̄, where k̃i = j ∈A (1 − δj )dij is the number of unit
i used as a match for the nonrespondents. Under Assumption 3, k̃i = OP (1) and
E(k̃i ) and E(k̃i2 ) are uniformly bounded over n (Abadie and Imbens 2006, Lemma
3); therefore, together with (A.3), we have ki = OP (1) and E(ki ) and E(ki2 ) are
uniformly bounded over n. Therefore, a simple algebra yields Vge = O(1).
μ
Combining all results, the asymptotic variance of n1/2 (μ̂g,NNI − μg ) is Vg +
Vge . By the central limit theorem, the result in Theorem 1 follows.
APPENDIX B: PROOF FOR THEOREM 2

Assumption B.1.
The following conditions hold for the population parameter ξN and the
population estimating function SN ( · ):
1. The population parameter ξN lies in a closed interval Iξ ;
2. the function s( · ) is bounded;
3. the population estimating function SN (ξ ) converges to S(ξ ) uniformly on Iξ
as N → ∞, and the equation S(ξ ) = 0 has a unique root in the interior of
Iξ ;
4. the limiting function S(ξ ) is strictly increasing and absolutely continuous
with finite first derivative in Iξ , and the derivative S (ξ ) is bounded away
from 0 for ξ in Iξ ; and
5. the population quantities
sup N α |SN (ξN + N −α ξ ) − SN (ξN ) − S(ξN + N −α ξ ) − S(ξN )| → 0,

ξ ∈Is
and

N
sup N −1 |s(yi − ξN − N −α ξ ) − s(yi − ξN )| = OP (N −α ),
ξ ∈Is i=1
where Is is a large enough compact set and α ∈ (1/4, 1/2].

Assumption B.1 (5) holds with probability one under suitable assumptions on
the probability mechanism generating the yi ’s and on the function s( · ), and there-
fore it is justifiable. Under Assumption B.1, by the standard arguments from the
theory on M-estimators (Serfling, 1980), ξ̂NNI is consistent for ξN . We further make
the following assumption.
Assumption B.2.
The nearest neighbor imputation estimator ξ̂NNI is root-n consistent for ξN .
Now, we give proof for Theorem 2. Under Assumptions B.1 and B.2, we can
write
ŜNNI (ξ̂NNI ) − SN (ξN ) = {ŜNNI (ξN ) − SN (ξN )} + S (ξN )(ξ̂NNI − ξN ) + oP (n−1/2 ).

(B.1)
By Assumption B.1 (4), S(ξ ) is smooth, and therefore SN (ξN ) = OP (N −1 ),
ŜNNI (ξ̂NNI ) = OP (n−1 ), and the left hand side of (B.1) is oP (n−1/2 ). Therefore,
we can obtain a linearization for ξ̂NNI as in (9).
Based on the linearization (9), the asymptotic variance is
Vξ = S (ξN )−2 var{ŜNNI (ξN )}.
Following a similar derivation in the proof for Theorem 1, it is easy to show that

n E{s(yi − ξ ) | xi }
var{ŜN (ξ )} = lim 2 E var P
n→∞ N πi
i∈A
N 2
n Ii
+ lim 2 δi (1 + ki ) − 1 var [s(yi − ξ ) | xi ] .
n→∞ N πi
i=1
APPENDIX C: ASSUMPTIONS FOR KERNEL FUNCTIONS

Assumption C.1.
The following conditions hold for kernel function K( · ) and bandwidth h:
1. the kernel function K( · ) is absolutely continuous with nonzero finite
derivative K ( · ) and K(x)dx = 1;
2. the bandwidth h → 0 and nh → ∞ as n → ∞;
3. there exists a constant c, such that |h−1 K (x1 / h) − h−1 K (x2 / h)| ≤ c|x1 −
x2 | for any x1 , x2 and an arbitrarily small h.
Assumption C.1 states conditions on the smoothness and tail behavior of the
kernel functions. Popular kernel functions, including Epanechnikov, Gaussian and
triangle kernels, satisfy the required conditions.
APPENDIX D: SIEVES ESTIMATION

The method of sieves (Geman and Hwang, 1982) offers a powerful tool for estima-
tion for nonparametric or semiparametric models. See Chen (2007) for a textbook
discussion. In particular, the sieves can be constructed using linear spans of power
series. For illustration, we describe the power series estimator for μg (x) (Newey,
1997) and the series logit estimator for sξ (x) = I (y − ξ ≤ 0) − α (Hirano et al.,
2003; Ichimura and Linton, 2005).
Power Series Estimator for μg (x)
We consider continuous g(y) and power series estimation for μg (x) with K terms
in the series, where K increases with n. Let p be the dimension of X. Consider a
sequence of power functions
pK (x) = (p1 (x), . . . , pK (x))T , (D.1)

λ λ
where
p pj (x) = x λj ≡ x1 j 1 × · · · × xpjp with λj = (λj 1 , . . . , λjp ), and |λj | =
k=1 λj k is nondecreasing in j .
For simplicity of the presentation, let the first r units be the respondents,
i.e., δi = 1 for i = 1, . . . , r. From the observations {(xi , yi ) : i = 1, . . . , r}, the
power series estimator of μg (x) can be calculated as the predicted value
obtained
K from a weightedT regression of g(yi ) on p K (xi ). To be precise, let P =
p (x1 ), . . . , p (xr ) and Gr = (g(y1 ), . . . , g(yr ))T . A power series estimator of
K
μg (x) takes the form
μ̂g (x) = pK (x)T (P T W P )− P T W Gr , (D.2)
where W is a diagonal matrix with the ith diagonal element πi−1 , and (P T W P )−
denotes a generalized inverse of a matrix P T W P .
Suppose the following assumption holds for establishing the fast convergence
rate of μ̂g (x) in (D.2).
Assumption D.1.
1. The support of x is a Cartesian product of compact intervals;
2. μg (x) is s-times continuously differentiable at x with s/p > 1;
3. the number of series K = O(nν ) with 0 < ν < 1/3.
Assumption D.1 (2) requires μg (x) to be sufficiently smooth, depending on

the dimension of x and the number of derivatives of μg (x). Assumption D.1 (3)
requires the number of series increases at a certain rate.
Lemma D.1. Under Assumption D.1, the power series estimator μ̂g (x) in (D.2)

satisfies that supx |μ̂g (x) − μg (x)| = OP K 3 /n + K 1−s/p = oP (1).
The proof of Lemma D.1 can be found in Newey (1997).
Series Logit Estimator for μs (ξ , x) = E{I (y − ξ ≤ 0) | x} − α
Denote pξ (x) = E{I (y ≤ ξ ) | x} and logit(a) = {1 + exp ( − a)}−1 . The series logit
estimator for pξ (x) can be obtained as
p̂ξ (x) = logit{pK (x)T π̂K }, (D.3)
where p K (x) is defined in (D.1), and

π̂K = arg max πi−1 I (yi − ξ ≤ 0)logit{p K (xi )T π }+
π
i∈A

I (yi − ξ > 0)[1 − logit{p K (xi )T π }] .
Suppose that the following assumption holds for establishing the fast convergence
rate of the series logit estimator p̂ξ (x) in (D.3).
Assumption D.2.
1. The support of x is a Cartesian product of compact intervals;
2. pξ (x) is s times continuously differentiable with s/p ≥ 3;
3. pξ (x) is bounded away from zero and one on the support of x;
4. the density of x is bounded away from zero on the support of x;
5. the number of series K = O(nν ) with ν < 1.
Lemma D.2. Under Assumption D.2, the√ series logit estimator

p̂ξ (x) in (D.3)
satisfies that supx |p̂ξ (x) − pξ (x)| = OP K/n + K 1−(s/2p)
The proof of Lemma D.2 can be found in Hirano et al. (2003).

Remark 4. When the dimension of x, p, becomes larger, Assumption D.1 (2)

and Assumption D.2 (2) typically require more stringent smoothness on μg (x)
and pξ (x) in x. Alternatively, we can estimate μg (x) and pξ (x) by applying the
power series constructed based on mi = m(xi ), i.e., using the 1-dimensional
variable mi .
APPENDIX E: PROOF FOR THEOREM 3
The replication method implicitly induces replication random variables ui and

weights ωi∗ such that E ∗ (ωi∗ ui ) = N −1 πi−1 and E ∗ {(ωi∗ ui )2 } = N −2 (1 − πi )πi−2 ,
for i = 1, . . . , N, where E ∗ ( · ) denotes the expectation for resampling given the
observed data. For example, in delete-1 jackknife under simple random sampling
with nN −1 = o(1), we have πi = nN −1 , L = n, ck = (n − 1)n−1 and ωi(k) = (n −
1)−1 if i = k and ωk(k) = 0. The induced random variables ui follows a two-point
mass distribution as

1, with probability (n − 1)n−1 ,
ui =
0, with probability n−1 ,
and weights ωi∗ = (n − 1)−1 . It is straightforward to verify that E ∗ (ωi∗ ui ) = n−1 =

N −1 πi−1 and E ∗ {(ωi∗ ui )2 } = (n − 1)−1 n−1 ≈ N −2 (1 − πi )πi−2 .
In what follows, we use P ∗ ( · ) to denote the probability mass or density func-
tion induced from resampling given the observed data and use the supscript ∗ to
indicate the random variables resulting from one replication sampling. Then, the
kth replication of μ̂g,NNI , μ̂(k)
g,NNI , can be viewed as one realization of

μ̂∗g,NNI = ωi∗ [μ̂g (xi ) + δi (1 + ki ){g(yi ) − μ̂g (xi )}]ui
i∈A

= ωi∗ [μg (xi ) + δi (1 + ki ){g(yi ) − μg (xi )}]ui
i∈A

+ ωi∗ {(1 − δi ) + δi ki }{μ̂g (xi ) − μg (xi )}ui
i∈A

= ωi∗ ψi ui + RN∗ , (E.1)
i∈A

where RN∗ = i∈A ωi∗ {(1 − δi ) + δi ki }{μ̂g (xi ) − μg (xi )}ui .
2
We now show E ∗ n1/2 RN∗ → 0 in probability. We write
2
E∗ n1/2 RN∗
1
= nN E (ω1∗ u1 )2 {(1 − δi ) + δi ki }2 {μ̂g (xi ) − μg (xi )}2
N i∈A
1
+2nN(N − 1)E ∗ ω1∗ ω2∗ u1 u2 {(1 − δi ) + δi ki }
N (N − 1) i =j ∈A
×{(1 − δj ) + δj kj }{μ̂g (xi ) − μg (xi )}{μ̂g (xj ) − μg (xj )}.

Because
of Assumption
2 (1), and the facts that nN E (ω1∗ u1 )2 = O(1), nN (N −
1)E ∗ ω1∗ ω2∗ u1 u2 = O(1), the uniform convergence of μ̂g (x) to μg (x) in Lemma
D.1, and E(kil) is uniformly bounded over n and for any l > 0, we obtain
2
E ∗ n1/2 RN∗ → 0 in probability. Then, by the Markov inequality, we obtain

for any , P ∗ n1/2 |RN∗ | > → 0 in probability.
It then becomes straightforward to verify that V̂rep ( · ) applied to μ̂(k)g,NNI is
consistent for var(ψ̄n ) and therefore for var(μ̂g,NNI ).
The proof for the second part of Theorem 3 is similar and therefore omitted.
PART IV
APPLICATIONS IN BUSINESS,
HOUSEHOLD, AND CRIME SURVEYS
IMPROVING RESPONSE QUALITY
WITH PLANNED MISSING DATA: AN
APPLICATION TO A SURVEY OF BANKS
Geoffrey R. Gerdes and Xuemei Liu

Federal Reserve Board of Governors, USA
ABSTRACT
We survey banks to construct national estimates of total noncash payments
by type, payments fraud and related information. The survey is designed to
create aggregate total estimates of all payments in the United States using
data from responses returned by a representative, random sample. In 2016,
the number of questions in the survey doubled compared with the previous
survey, raising serious concerns of smaller bank nonparticipation. To obtain
sufficient response data for all questions from smaller banks, we adminis-
tered a modified survey design which, in addition to randomly sampling
banks, also randomly assigned one of several survey forms, subsets of the
full survey. This case study illustrates that while several other factors influ-
enced response outcomes, the approach helped ensure sufficient response
for smaller banks. Using such an approach may be especially important in
an optional-participation survey, when reducing costs to respondents may
affect success, or when imputation of unplanned missing items is already
needed for estimation. While a variety of factors affected the outcome, we
find that the planned missing data approach improved response outcomes

ISSN: 0731-9053/doi:10.1108/S0731-905320190000039014
237
238 GEOFFREY R. GERDES AND XUEMEI LIU
for smaller banks. The planned missing item design should be considered
as a way of reducing survey burden or increasing unit-level and item-level
responses for individual respondents without reducing the full set of survey
items collected.
Keywords: Business survey; responder burden; planned missing data; split
questionnaire; multiform design; imputation
JEL classifications: C83
1. INTRODUCTION AND LITERATURE

The Federal Reserve conducts nonmandatory surveys every 3 years to obtain pay-
ment volumes and related information from banks. Individual bank response data
are confidential but the survey forms and aggregate estimated data are public
information.1 The 2013 survey, an expansion of the length of the 2010 survey,
experienced a decline in unit response. In addition to changes in the survey envi-
ronment over time and other factors, survey length and complexity may have
affected unit and item response rates, especially for smaller banks. Nevertheless,
we expanded the survey length in the 2016 survey substantially compared to the
length in the 2013 survey.
To help retain some control over missing item patterns generally, and suf-
ficient unit and item response counts specifically, we developed sets of partial
survey forms, subsets of the full 2016 survey. Sample sizes for smaller banks were
increased to allow random selection of the complementary survey forms, provid-
ing full coverage of items in the aggregate. In this case study, we consider the
effect of this planned missing data design on the unit and item response counts.
We also highlight some special considerations when sampling, collecting and ana-
lyzing data from a skewed population of banks, which may apply to other types
of business populations.
Our aim for the 2016 surveys was to obtain a sufficient number of unit responses,
and within the unit responses, sufficient item responses to obtain a “large” num-
ber of responses for every stratum, which would contribute to the precision of
estimated parameters in imputation models, in total estimates and the power of
statistical tests. Concerns about quality imputations of missing data and of obtain-
ing an unbiased, efficient and consistent set of total estimates in our framework
are equivalent to obtaining similar properties for model-based parameter estimates,
and so the concepts we explore should also be of interest to researchers wishing
to estimate econometric models and draw inferences from survey data.
Improving Response Quality with Planned Missing Data 239
The planned missing data design might come at the expense of lower unit
response rates for smaller banks, but with an expected dividend of higher total
response counts by item. To obtain sufficient response data for all items, the shorter
forms, which included some full-coverage items and some partial-coverage items,
were designed as complementary subsets of the full survey form. The shorter forms
were administered so that all partial-coverage items were presented to an equal
number but randomly selected set of banks. This approach fielded the full survey
form to the largest banks, and three sets of partial form variants to smaller banks
that contained either 2/3, 1/2 or 1/3 of the partial-coverage items, allowing the
length of the surveys to decline with bank size. The resulting missing patterns
from the returned surveys are influenced (1) by randomized planned missing items
determined by the form sent to the respondent and (2) by unplanned missing items
from returned surveys.
By design, the approach introduces planned missing data items, but a non-
mandatory survey involving a lengthy survey form already introduces unplanned
missing data items, known also as item nonresponse. Aggregate estimates we pro-
duce from the survey data are subject to various adding up constraints, which also
apply at the level of the unit response. Complete case analysis involves wasted
information and would violate adding up and other logical constraints. More
generally, missing items of both types introduces problems for analysis methods
requiring a rectangular data set. But modern imputation methods make it possible
to obtain a rectangular data set in the presence of missing data, retain all of the
collected information and produce estimates with desirable statistical properties
(Little and Rubin, 2002).
Our planned missing data design is an implementation of other split question-
naire or missing-by-design approaches in the literature. Multiple-matrix sampling
involves fielding samples of questions to test subjects. The idea appears to have
originated from problems of establishing normative univariate distributions for
broad sets of examination questions in educational testing and does not require
imputation (Lord, 1962). Inter alia, Raghunathan and Grizzle (1995) extended
the multiple-matrix idea to a split questionnaire survey design, which imposes
restrictions on the assignment of items to sampled subjects to retain the ability
to estimate population quantities. The three-form design of Graham, Hofer, and
MacKinnon (1996) has similar goals but imposes a particular structure on the
overlap of full-coverage and partial-coverage items. Both latter papers employ
imputation techniques to handle missing data, as we do.
Business surveys have a number of features that distinguish them from sur-
veys of individuals (Snijkers et al., 2013). A business survey collects data about
the business and not the individual responding to the survey. Political and social
considerations generally substitute for psychological ones. Recruitment involves
obtaining the support of senior managers, internal experts with access to the infor-
mation, individuals assigned to fill out parts of the survey as well as serve as
a response contact. Further, once a participating business is recruited, obtaining
responses to the survey items involves the participant incurring varying levels of
cost, with paid staff consulting records, performing database queries and some-
times engaging third-party providers. Data quality remains a concern with the
collection of information that is objectively verifiable but complex to obtain.
Finally, business heterogeneity, which may cause behavior to vary, for example,
by size and type, may lead to different treatments between classes.
This case study illustrates that our approach involved an increase in the sam-
ple size which could reduce unit response rates, which often serve as proxy for
assessments of survey quality, potentially raising concerns of nonresponse bias.
If that was our focus, we would direct such concerns to the smaller size strata
because larger bank strata return surveys at a higher rate. In any case, concerns
about minimum data set size, as we have, and other factors that could contribute
to nonsampling error in a complex business survey may justifiably dominate con-
cerns about nonresponse bias (see e.g. Lineback and Thompson (2010)). The
disproportional influence of larger bank strata on the total estimates also mean that
misreported data or the loss of a very large bank response potentially overshadows
concerns of low response rates in small bank strata.
The remainder of the paper is organized as follows. First, the standard approach
to our survey design is discussed. Second, we discuss the challenge we faced in
2016 by reviewing the 2013 survey outcome and considering the impact of the
growth in the number of items. Third, we discuss our implementation of a planned
missing data design for the 2016 survey. Fourth, we compare the outcome of the
2016 survey to the 2013 survey, and conclude the paper.
2. THE STANDARD APPROACH

To understand the approach we take and the results, it is necessary to discuss
some of the details of the survey objectives, the population under study and the
estimation methods we use. The Federal Reserve System, the central bank of
the United States, conducts this survey of banks, also known as depository and
financial institutions, to collect information on the number and value of all types
of payments, to track the adoption of new methods of making payments and, more
recently, to measure payments fraud activity. National estimates of these activities
are constructed from the response data.
If a size measure is available for a skewed population, as is the case for banks,
stratification of the population by size and estimation using the size measure as
a covariate for estimation can improve precision over, for example, drawing a
simple random sample and constructing a probability estimate (Cochran, 1953).
Data on type and size, as measured by checkable deposits (CHKD) and money
market deposits (MMDA) for the population of banks, are available from reports
filed with the Federal Reserve. The size distribution of US banks that process
payments, which include commercial banks (CMB), savings institutions (SAV)
and credit unions (CUS), is large, with well over 10,000 banks in 2013 and 2016,
and a highly skewed size distribution.
Because most of the payments are made from CHKD and MMDA, they are
highly correlated with the payment volumes of interest, making them valuable
measures of bank size for stratification as well as potential covariates for estimation
purposes. To take advantage of the correlations and to account for the skewness,
the bank population is stratified into subpopulations by type and size, and separate
samples are drawn from each, with the sampling rate declining with size.
The bank type has a meaningful relationship with how it is regulated, the type of
business it conducts and how it reports its information. Another variable, STRAT-
VAR, is defined to be equal to the sum of CHKD and MMDA for CMB and SAV
and is equal to CHKD for CUS due to reporting differences. Stratification first by
type, and then by size within type, using STRATVAR improves the precision of
estimates for a given sample size. There are different procedures recommended in
the literature for choosing an optimal cutoff between a take-all and a collection
of take-some strata in a skewed population, for example, Hidiroglou (1986) or
Hansen, Hurwitz, and Madow √ (1953). For the take-some strata, the boundaries
were chosen using the cum f method of Dalanius and Hodges (1959) for a fixed
number of strata, and the sample was allocated following Neyman (1934).
The framework of the sample selection procedure is a representative, random
sample using an auxiliary measure of size among the population from administra-
tive data, where we stratify the population, draw separate samples and construct
estimates from the sample responses within the strata. To obtain aggregate esti-
mates of volumes for the population, we used separate ratio estimators. We took
advantage of the high correlation between the universally available size covariate,
CHKD, with the various volumes measured in the study. Standard ratio estimators
for a population are used to “blow up” the sample data to the population estimate
with a covariate available on the size of banks in the population. Liu, Gerdes, and
Parke (2009) discusses the stratification and estimation approach we used in more
detail.
The rectangular data set used for estimation is imputed using an iterative E–M
algorithm approach to estimate the covariances between reported items in the pres-
ence of missing items while simultaneously imputing the missing items. Imputed
data replaces missing items and treated as reported data. Standard errors that
account for imputation model error are calculated following a multiple imputation
strategy that uses a large number of replicate data sets with imputed data augmented
with random draws from the imputation model error distribution. Gerdes and Liu
(2010) discusses our approach to imputation and estimation in more detail.
3. THE CHALLENGE
Technological developments, evolving market structure and expanding payments
research and policy interests led to several changes in the survey design. In par-
ticular, the 2016 form included 502 payment volume items, roughly twice the
number as the 253 items in the 2013 form. This massive increase led us to try
a planned missing data design to address anticipated nonresponse due to survey
fatigue issues. Past experience, policy goals and survey form structure consid-
erations led to a prioritization of survey items into a set that would receive full
coverage in the survey, meaning they would be asked of all subjects, and to a set of
the remaining survey items, which would receive only partial coverage, meaning
that they would be asked of some but not all subjects. All partial-coverage survey
items would be distributed such that they would have an equal chance of being
presented to and answered by a subject.
The rise in the length of the survey form between 2013 and 2016 raised seri-
ous concerns that smaller banks, already exhibiting low response rates in 2013,
would not be willing to participate in sufficient numbers. Generally, the risk of
nonparticipation because of survey length is high and declines with bank size. Past
surveys have not exceeded a 55% unit response rate, and the unit response rate of
the 2013 survey had declined to 44%. To control losses in the 2016survey, we tried
a survey design that reduces the potential of gains at the intensive participation
margin: answering more survey items – in favor of reducing the potential of losses
at the extensive margin – involving the decision not to participate at any level in
this nonmandatory survey.
While our surveys have grown in length at each repetition, the unit-level
response rate to our 2013 survey had dropped markedly, to 44% overall, com-
pared with unit-level response rates in 2010 or earlier that ranged from 54 to
56%. While many factors could have been involved, some of the response rate
decline likely was attributable to the increased length and complexity. Evidence
from respondent feedback, such as complaints about an increasing burden placed
on bank staff by other, mandatory surveys, suggested that the survey environment
had also become increasingly challenging.2 Recognizing the importance of accom-
modating respondent needs in a nonmandatory survey, we entered the planning
stage for the 2016 collection of survey data with a recognition that greater effort
and adaptability would be needed to sustain the same response rate as 2013.
In the face of these challenges, the scope of information and the total number
of items included in the survey also grew. The number of payment volume items
increased by 249, growing from 253 in the 2013 survey to 502 during 2016. The
number of intersecting items between the two surveys is 205, with 297 new items
for 2016 and 48 expired items from 2013. The survey is not mandatory, and the
102% increase in the number of items in the survey led us to look for a relatively
radical adjustment to the survey design for 2016 which would offset the increased
burden and the anticipated drop off in response that might result, especially for
smaller banks.
In addition to the sheer increase in the number of items requested, we made a
major change to the survey reference period. Surveys prior to 2013 were designed
to collect prospective data during one or two months after registering their partic-
ipation, and banks were notified of the survey content before the reference period
so that they could prepare systems and staff to compile the requested information
while the measured activity was taking place. On the conjecture that the balance
of participating banks could provide information retrospectively from the previous
calendar year due to advances in electronic record keeping and retrieval capabil-
ities, for the first time, the 2016 survey collected data for the previous calendar
year (2015).3 This new approach to the survey reference period was anticipated to
also have an effect on response rates, especially on those of the smaller banks.
The survey data have always exhibited complex patterns of item nonresponse.
Many items, however, are part of a logical structure in the form of subtotals
adding up to totals. An example for a volume of debit card payments is depicted
in Figure 1. Collecting data in this form is often necessary because of the subject
matter, because it helps to enumerate components of totals for clarity and is also
helpful in cases where, for example, some respondents cannot as easily report the
subtotal details, or conversely, where they have access to a subtotal, but not all
components of the total.
Fig. 1. An Example of a Set of Logical ‘Adding Up” Relationships Between Items in the
Surveys. Note: Debit card must be the sum of Card-present and Card-not-present. Likewise,
Card-present must be the sum of Signature-authenticated, PIN-authenticated and Other.
Imputing and enforcing logic at the response level allows the exploitation
of within-stratum covariances and within-response logical constraints. A final
covariance matrix is obtained through an approach based on an iterative, E–M
algorithm-based imputation method which, under the assumption that values are
missing at random, produces a maximum likelihood estimate of the covariance
matrix in the presence of missing data and simultaneously produces imputed
data (Little and Rubin, 2002). Precision is measured through the use of multi-
ple imputation estimates of the ratio estimator standard errors which accounts for
the model-based parameter estimation errors.
To set the stage and explain the potential benefit of the planned missing data
design, it is helpful to discuss the response outcome of the 2013 survey first. To
simplify the discussion and to prepare for comparisons with the results of the
planned missing data design in 2016, we adjusted the 2013 strata boundaries to
represent the same population proportions as 2016. The one exception to this
proportional matching is that the stratum for the largest banks of each type are
constructed to include the top 50 CMB, and the top 25 SVG and the top 25 CUS
in both 2013 and 2016. For convenience, we label CMB size strata 11–18, with
bank size, measured by STRATVAR, increasing with the label and, similarly, label
SVG strata 21–26 and CUS 31–38.
Figure 2 shows the unit and item response rates achieved for different strata in
the 2013 survey. Within each bank type, strata are displayed with size category
increasing from left to right. For example, stratum 11 contains the smallest CMB
and stratum 18 contains the largest CMB. The unit response rate within stratum
18 was about 85%. The item response rate is the total number of items returned
in proportion to the total number of items presented to all sampled banks, which,
for stratum 18, was approximately 65%. Smaller strata display much lower unit
response rates, and within each bank type, a rise in the unit and item response rates
is evident as bank size increases. The increase with bank size, with a few notable
exceptions, is close to being monotonic and is most pronounced for CUS.
The overall item response rate in the 2013 survey was about 30%, 14% points
lower than the unit response rate of 44% (Table 1). In the smallest bank strata, unit-
level and item-level response rates dipped to very low levels. The lowest unit and
item-level response rates were in stratum 31, reaching 20% and 13%, respectively.
Relatively low unit response and low item response in small bank strata led to
sample sizes too small to estimate without combining strata.
4. PLANNED MISSING DATA DESIGN

One way to shorten the survey for a subsample would be to split the survey form in
half and give one part to half the sample and the other part to the other half. But in
100 CMB SAV CUS
90
80
70
60
%
50
40
30
20
10 Unit
Item
0
11 12 13 14 15 16 17 18 21 22 23 24 25 26 31 32 33 34 35 36 37 38
Stratum
Fig. 2. Response Rates by Stratum for 2013 are Calculated Two Different Ways.
Note: The unit response rate (Unit) is the number of responding banks (those that returned
a survey form) in proportion to the number of sampled banks. The item response rate (Item)
is the total number of items returned in proportion to the total number of items presented
to all sampled banks.
doing so, we would have to double our sample size to present all items to the same
number of survey subject. In a simple scenario of a survey of sample size n, the
example of shortening the survey just described would be equivalent to defining
the first half of the survey as survey form 1 and the second half as survey form 2.
If n banks receive survey form 1 and another n banks receive survey form 2, then
each survey item is posed to n banks.
This applied study requires an approach tailored to the specific survey form,
related to the three-form and split-questionnaire designs proposed in the literature.
The 2016 survey form is naturally divided into the following nine categories:
bank profile, check payments and check returns, automated clearing house (ACH)
payments andACH returns, wire payments, debit and prepaid card payments, credit
card payments, cash deposits and withdrawals, alternative payment methods, and
unauthorized third-party fraudulent payments. In past studies, we imputed the data
Table 1. 2013 Survey Stratification, Counts and Rates.

Note: Max Size is the upper bound of the stratification variable STRATVAR (in
thousands of nominal USD), N is the population count, n is the sample count, f
is the sampling rate, r is the response count, g is the unit response rate and q is
the item response rate across all 253 numerical items requested in the survey
form. The item response rate is the total number of items returned in proportion
to the total number of items presented to all sampled banks.
Type Stratum Max Size N n f r g q
CMB
18 50 50 1.00 42 0.84 0.64
17 8, 100, 000 298 298 1.00 152 0.51 0.35
16 593, 000 273 217 0.79 115 0.53 0.34
15 301, 000 373 208 0.56 100 0.48 0.30
14 186, 200 685 186 0.27 78 0.42 0.28
13 116, 250 966 186 0.19 73 0.39 0.25
12 70, 700 1, 318 231 0.18 78 0.34 0.22
11 35, 955 1, 529 153 0.10 53 0.35 0.24
Subtotal 5, 492 1, 529 0.28 691 0.45 0.30
SVG
26 25 25 1.00 16 0.64 0.56
25 1, 900, 000 61 61 1.00 32 0.52 0.36
24 468, 000 120 90 0.75 48 0.53 0.41
23 173, 000 156 57 0.37 25 0.44 0.31
22 86, 600 183 37 0.20 19 0.51 0.30
21 44, 400 344 59 0.17 27 0.46 0.34
Subtotal 889 329 0.37 167 0.51 0.37
CUS
38 25 24 0.96 15 0.63 0.49
37 580, 000 50 47 0.94 34 0.72 0.55
36 307, 000 155 138 0.89 73 0.53 0.35
35 142, 100 201 134 0.67 62 0.46 0.34
34 80, 000 279 123 0.44 53 0.43 0.27
33 44, 450 492 113 0.23 30 0.27 0.16
32 20, 700 915 119 0.13 30 0.25 0.17
31 8, 638 3, 192 130 0.04 26 0.20 0.13
Subtotal 5, 309 828 0.16 323 0.39 0.26
Total 11, 690 2, 686 0.23 1, 181 0.44 0.30
in blocks along these lines to save on computational complexity and, thus, time,
and because increasing the set of potentially correlated data outside of each block
appeared to have limited value relative to within-block information. With this in
mind, the categories formed natural blocks for dividing up the surveys as well.
Table 2. 2016 Sampling Scheme.

Note: For a given stratum (Stratum), n is the sample
count, and Scheme refers to the approximate fraction
of partial-coverage survey items.
Type Stratum n Scheme
CMB 17 300 1/1

16 231 2/3
15 249 2/3
14 289 1/2
13 289 1/2
12 452 1/3
11 388 1/3
SVG 25 75 1/1
24 102 2/3
23 98 1/2
22 123 1/3
21 102 1/3
CUS 37 63 1/1
36 130 2/3
35 140 2/3
34 163 1/2
33 150 1/2
32 237 1/3
31 216 1/3
As discussed above, roughly 30% of the items to be collected would be full-

coverage, included in all survey forms. Full-coverage items included items of
greatest policy priority, as well as totals in adding up relationships, more valuable
because of their ability to bind the component data. Banks would typically be
able to access their aggregate totals more easily than the underlying component
information. This would leave 70% of the survey items for inclusion in the blocked
portion of surveys sent to smaller banks.
We constructed versions of the survey form which would present all items to
the largest banks of each type, and for partial-coverage survey items, two-thirds
to the next largest, one-half for smaller and one-third for the smallest. The final
distribution of the 2016 sample and the intended complementary form schemes
are shown in Table 2.
Once the full survey form was final, we determined a practical planned missing
or “blocking” strategy that could not only present a collection of full-coverage
items across all surveys but include sensible collections of partial-coverage items
for shorter surveys. An equal number of partial-coverage items would need to be
Table 3. The Number of Full-Coverage and

Partial-Coverage Items by Section in the 2016 Survey Form.
Item Counts by Section Full Partial
Profile 14 0
Check 18 72
ACH 20 64
Wire 4 48
Debit card 18 52
Credit card 19 36
Cash 39 22
Alt pymts 8 16
Fraud 52 0
Total 192 310
presented across a subset. Within strata with blocked surveys, we would want to
achieve as much balance as possible, meaning to minimize the range of length of
the survey form between companion versions of a survey to be administered across
equal-sized subsamples within a survey stratum. In addition, it was important to
limit the total number of survey versions to keep the complexity and potential
confusion of administering multiple versions to a minimum.
Table 3 contains the number of full-coverage and partial-coverage items by
section in the final survey form.
All items in the fraud section were designated full-coverage, in consideration
of policy priorities. These sections had different numbers of survey items, making
the balance consideration important, and were likely to influence which blocks
might appear together. After a decision to group the cash and alternative payments
sections together, which would make the groups more even in total count, and also
to make the combination calculations easier, six distinct blocks of survey items
were defined. While correlations between blocks were not of primary importance
to us, it seemed reasonable to make an attempt to try to pair the sections off in
at least one of the survey form versions in each stratum but would need to be
considered in the context of practicality.
First, we consider the problem of dividing the survey form such that each
respondent gets 1/3 of the survey items to be allocated. In our case, the total
number of survey items to be allocated is 353 less the survey items from bank
profile (6) and fraud (18), or 329. Each sampled bank would expect to get just
fewer than 110 of these survey items. Now, consider the case of defining just three
of these questionnaires. This is easily done by just placing two sections in each
version of the survey form. This satisfies the simplicity objective but is only able
to produce correlations between three pairs, which means each of the six sections
is paired off only once. Now, there are 15 distinct combinations of paired items in
a set of 6. It is apparent, therefore, that 15 versions of the survey would be required
in order to achieve a complete set of pairs, and each pair would exist in only one
of the 15 versions.
Second, we consider the problem of dividing the survey form such that each
respondent gets 1/2 of the survey items to be allocated. Again, there are six
sections. The simplest solution, of course, is to define two versions, as noted
above, by putting three sections on each. This would allow each section to be
correlated with two other sections. An alternative would be to define four versions,
where each section would appear in two of the surveys. Since each survey would
have two other sections, each section could be correlated with only four others,
leaving three pairs unmatched. Going further and defining six versions with three
versions each would mean that each section would be paired with each of the other
sections at least once, but the pairs would not work out evenly; three pairs would
occur twice. For total pairwise balance, 15 versions of this set of survey form
would need to be fielded.
Third, we consider the problem of dividing the survey form such that each
respondent gets 2/3 of the survey items to be allocated. In this case, each bank
would expect to receive four out of six sections. This can be achieved by fielding
three surveys. Because each section would appear with three others, there would
be 18 different pairs across the survey versions, meaning that 3 pairs would occur
twice. As with the other fractions, pairing off the sections evenly would require 15
different surveys.
The complexity of managing multiple versions of the surveys led to a decision
that came close to choosing the minimal number of survey versions. We decided to
field only three versions each of the 1/3 and 2/3 fractional survey form schemes. In
the case of the 1/2 fractional survey form scheme, we chose to field four versions.
These choices meant that correlations among certain sections would not be possible
in some strata, which particularly affected the smallest bank strata with the 1/3
scheme, minimally for the banks with the 1/2 scheme, and not at all for the 2/3
and 1 schemes.
Decisions about which combinations of section pairings would be chosen were
based on an attempt to minimize the difference between the longest and shortest
surveys in each stratum, which was determined by making an exhaustive list of all
possible combinations and choosing our preference among them. The final choice
is shown in Table 4. The 2/3 scheme had a range of 17 survey items, the 1/2
scheme had a range of 10 survey items and the 1/3 scheme had a range of 15
survey items.
Table 5 shows how a total sample counts of 3,797 banks were allocated to strata
and survey forms versions. Notice that, in a few cases, the number of sampled
banks within a stratum differs by 1, because of indivisibility.
Table 4. Survey Form Versions Used for Each Partial-Coverage Scheme.
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11
Scheme 1/1 Scheme 2/3 Scheme 1/2 Scheme 1/3
Profile Profile Profile Profile Profile Profile Profile Profile Profile Profile Profile
Debit/Prepaid Debit/Prepaid Debit/Prepaid Debit/Prepaid Debit/Prepaid Debit/Prepaid

Check Check Check Check Check Check
ACH ACH ACH ACH ACH ACH
Credit Credit Credit Credit Credit Credit
Wire Wire Wire Wire Wire Wire
Cash Cash Cash Cash Cash Cash
Alt pymts Alt pymts Alt pymts Alt pymts Alt pymts Alt Pymts
Fraud Fraud Fraud Fraud Fraud Fraud Fraud Fraud Fraud Fraud Fraud
Table 5. Original 2016 Sample Counts by Bank Type, Stratum and Survey Form
Version.
Type Stratum n v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11
CMB 17 300 300 0 0 0 0 0 0 0 0 0 0

16 231 0 77 77 77 0 0 0 0 0 0 0
15 249 0 83 83 83 0 0 0 0 0 0 0
14 289 0 0 0 0 72 72 72 73 0 0 0
13 289 0 0 0 0 73 72 72 72 0 0 0
12 452 0 0 0 0 0 0 0 0 150 151 151
11 388 0 0 0 0 0 0 0 0 129 129 130
Subtotal 2, 198 300 160 160 161 145 144 145 145 279 281 281
SAV 25 75 75 0 0 0 0 0 0 0 0 0 0
24 102 1 34 33 34 0 0 0 0 0 0 0
23 98 0 0 0 0 25 25 24 24 0 0 0
22 123 0 0 0 0 0 0 0 0 41 41 41
21 102 0 0 0 0 0 0 0 0 34 34 34
Subtotal 500 76 34 33 34 25 25 24 24 75 75 75
CUS 37 63 63 0 0 0 0 0 0 0 0 0 0
36 130 0 43 43 44 0 0 0 0 0 0 0
35 140 0 47 46 47 0 0 0 0 0 0 0
34 163 0 0 0 0 40 41 41 41 0 0 0
33 150 0 0 0 0 38 37 38 37 0 0 0
32 237 0 0 0 0 0 0 0 0 79 79 79
31 216 0 0 0 0 0 0 0 0 72 72 72
Subtotal 1, 099 63 90 89 91 78 78 79 78 151 151 151
Total 3, 797 439 284 282 286 248 247 248 247 505 507 507
5. RESULTS
Once the survey design is determined, including survey content, sample construc-
tion and so on, a substantial amount of additional work goes into implementing the
data collection. This includes the design and development of the web-based inter-
face for survey participants, detailed recruiting strategies and response follow-up
to encourage a more complete response and to validate the quality of the provided
information. These activities, which are at the core of the success or failure of the
survey, are documented elsewhere. While the effort level, the survey environment,
the ability to properly communicate and define survey content that matches well
with participant understanding, and luck are inseparable and, perhaps, far more
important than the adjustments to the survey we describe here, it is worthwhile to
explore the outcome of the data collection we designed.
Table 6 shows the count of unit-level responses collectively for each stratum.
The column labeled n provides the total returned survey count, which equals
the potential number of high-priority items returned. Overall, we obtained 1,383
responses, exceeding the total goal of 1,215 by nearly 14%. Comparing the out-
come by overall unit response count by types of bank, for CMB we obtained 869
responses, exceeding the goal of 700 by 24%, for SAV we obtained 198 responses,
falling short of the goal of 212 by 7%, and for CUS we obtained 316, exceeding
the goal of 303 by 4%.
In order to obtain these response counts, we expanded the sample size from the
approximately 2,700 sample size used in past studies to approximately 3,800, an
increase of over 40%. In Table 7, we show the resulting response rates by stratum
and survey form version. Overall, the response rate was 36%, down 8% points
Table 6. Response Counts by Bank Type, Stratum and Survey Form Version and
Original Strata.
CMB 17 146 146 0 0 0 0 0 0 0 0 0 0

16 82 0 23 29 30 0 0 0 0 0 0 0
15 114 0 36 45 33 0 0 0 0 0 0 0
14 105 0 0 0 0 26 25 22 23 0 0 0
13 116 0 0 0 0 21 34 25 36 0 0 0
12 171 0 0 0 0 0 0 0 0 56 53 62
11 135 0 0 0 0 0 0 0 0 42 50 43
Subtotal 869 146 59 74 63 47 59 47 68 98 103 105
SAV 25 40 40 0 0 0 0 0 0 0 0 0 0
24 45 0 15 16 14 0 0 0 0 0 0 0
23 36 0 0 0 0 8 9 9 10 0 0 0
22 41 0 0 0 0 0 0 0 0 16 16 9
21 36 0 0 0 0 0 0 0 0 9 11 16
Subtotal 198 40 15 16 14 8 9 9 10 25 27 25
CUS 37 31 31 0 0 0 0 0 0 0 0 0 0
36 51 0 17 14 20 0 0 0 0 0 0 0
35 52 0 16 18 18 0 0 0 0 0 0 0
34 46 0 0 0 0 12 12 14 8 0 0 0
33 41 0 0 0 0 13 10 10 8 0 0 0
32 48 0 0 0 0 0 0 0 0 20 10 18
31 47 0 0 0 0 0 0 0 0 14 14 19
Subtotal 316 31 33 32 38 25 22 24 16 34 24 37
Total 1, 383 217 107 122 115 80 90 80 94 157 154 167

and slightly less than 19% of the 44% response rate achieved in the previous
survey. Note that the special attention group of banks, which expanded to over
300 compared with 100 in the previous survey achieved an overall response rate
of roughly 50%. This group did not receive shortened versions of the surveys.
Response rates for the top 100 were similar to response rates for that group in the
past.
The response outcome for 2016, including the item response rates, is shown in
Figure 3 and Table 8. The overall item response rate is 19%, down considerably
from the 2013 item response rate of 30%. The mean item count in 2016 was
717, compared with a mean item count of 805 in 2013. That drop, all else equal,
could be interpreted as a bad outcome on its own, but that would ignore other
important factors. In particular, considering the increase in the number of items,
Table 7. Unit Response Rate (%) by Bank Type, Stratum and Survey Form
Version.
CMB 17 49 49
16 35 30 38 39
15 46 43 54 40
14 36 36 35 31 44
13 40 29 47 35 50
12 38 37 35 41
11 35 33 39 33
Subtotal 39 49 37 46 39 32 41 32 47 35 37 37
SAV 25 53 53
24 44 44 48 41
23 37 32 36 38 42
22 33 39 39 22
21 35 26 32 47
Subtotal 40 53 44 48 41 32 36 38 42 33 36 33
CUS 37 49 49
36 39 40 33 45
35 37 34 39 38
34 28 30 29 34 20
33 27 34 27 26 22
32 20 25 13 23
31 22 19 19 26
Subtotal 29 49 37 36 42 32 28 30 21 23 16 25
Total 37 49 38 43 40 32 36 32 38 31 30 33
100 CMB SAV CUS
90
80
70
60
%
50
40
30
20
10 Unit
Item
0
11 12 13 14 15 16 17 18 21 22 23 24 25 26 31 32 33 34 35 36 37 38
Stratum
Fig. 3. Response Rates by Stratum for 2016 Calculated Two Different Ways.
Note: The unit response rate is the number of responding banks (those that returned a
survey form) in proportion to the number of sampled banks. The item response rate is the
total number of items returned in proportion to the total number of items presented to all
sampled banks.
the 2016 survey returned nearly 360 thousand total response items, compared with
166 thousand total response items in 2013. The number of separate responses in
2016 was 1,383 compared with 1,181 in 2013, an increase of 17%. Having more
responses means that more items can be imputed, taking advantage of correlations
with other response items.
The survey items in the 2016 survey were a superset of the survey items in the
2013 survey. With so many new, untested survey items in 2016, it is revealing
to examine the outcomes for the 205 payment volume items that were the same
across both surveys. Figure 4 demonstrates that the 2016 survey approach, which
improved the number of items supplied to the imputation and estimation routines
for many strata, particularly for smaller CMB strata. The shorter surveys, contain-
ing 1/3 or 1/2 of partial-coverage survey items, combined with increases in the
sampling rate tended to do well for CMB and SAV, while only doing relatively bet-
ter for the very smallest CUS strata. Conversely, the strata with the largest banks,
Table 8. 2016 Survey Stratification, Counts and Rates.

Note: Max size is the upper bound of the stratification variable STRATVAR (in
thousands of nominal USD), N is the population count, n is the sample count, f
is the sampling rate, r is the response count, g is the unit response rate, and q is
the item response rate across all 502 numerical items requested in the survey
form. The item response rate is the total number of items returned in proportion
to the total number of items presented to all sampled banks.
Type Stratum Max Size N n f r g q
CMB
18 50 50 1.00 40 0.80 0.53
17 10, 900, 000 264 264 1.00 110 0.42 0.22
16 799, 500 247 237 0.96 88 0.37 0.19
15 388, 000 337 237 0.70 105 0.44 0.23
14 232, 000 618 308 0.50 114 0.37 0.19
13 139, 754 872 289 0.33 118 0.41 0.21
12 83, 909 1, 190 444 0.37 167 0.38 0.18
11 41, 980 1, 382 356 0.26 128 0.36 0.19
Subtotal 4, 960 2, 185 0.44 870 0.40 0.21
SVG
26 25 24 0.96 18 0.75 0.43
25 1, 650, 000 48 48 1.00 21 0.44 0.22
24 497, 000 102 102 1.00 46 0.45 0.23
23 195, 000 132 104 0.79 42 0.40 0.23
22 100, 500 155 116 0.75 36 0.31 0.17
21 46, 300 292 96 0.33 34 0.35 0.22
Subtotal 754 490 0.65 197 0.40 0.22
CUS
38 25 25 1.00 14 0.56 0.34
37 730, 000 47 46 0.98 22 0.48 0.24
36 365, 000 137 126 0.92 47 0.37 0.19
35 185, 000 174 143 0.82 52 0.36 0.18
34 105, 500 240 147 0.61 34 0.23 0.09
33 58, 000 399 167 0.42 50 0.30 0.13
32 26, 680 690 201 0.29 47 0.23 0.11
31 11, 190 3, 144 242 0.08 50 0.21 0.09
Subtotal 4856 1, 097 0.23 316 0.29 0.14
Total 10, 570 3772 0.36 1383 0.37 0.19
where the survey form length was reduced less, or not at all, and where increases
in already high sampling rates were unavailable, the outcome was very different.
Except for the very large institutions, response levels dropped, indicating that the
survey length was an important factor.
120
CMB SAV CUS

100
80
60
40
20
2013
2016
0
11 12 13 14 15 16 17 18 21 22 23 24 25 26 31 32 33 34 35 36 37 38
Stratum
Fig. 4. Mean Number of Item Responses, 205 Comparable Payment Volume Items in the
Surveys, by Stratum and Survey Year.
6. CONCLUSION AND NEXT STEPS

This case study of conducting an ongoing survey with a planned missing data
strategy to bolster item-level responses, particularly for smaller banks, is complete.
In this paper, we described our process for designing the new data collection.
Comparative results presented on the unit-level and item-level response outcomes
show declines in unit-level and item-level response which, for smaller bank strata,
were offset using increases in sampling rates and shortened surveys using a planned
missing data design. There were many simultaneous changes to the survey besides
the planned missing design, making strong conclusions impossible. The effective
doubling of the number of distinct items in the 2016 survey especially worked
against a favorable outcome. The results suggest, however, that the planned missing
strategy works well, and the outcome was better than could otherwise have been
expected. Estimates based on these data published elsewhere are of good quality,
and planned future work will highlight the effect of the planned missing design on
imputations and the properties of the total estimates.
ACKNOWLEDGMENT
Opinions are the authors’ alone and do not necessarily reflect those of the Board
of Governors, the Federal Reserve System, or its staff. We acknowledge David
Jacho-Chávez, Editor of the Advances in Econometrics, two anonymous ref-
erees, and participants at the Bank of Canada conference associated with this
volume for comments and suggestions. We thank Lauren Clark, Daniel Nikolic,
Justin Skillman, and Alexander Spitz for assistance during different stages of
the research. We also thank Michael Argento and Thomas Welander of the
Global Concepts Office of McKinsey and Company for support during survey
design and data collection. Further information about the surveys is available
at http://www.federalreserve.gov/paymentsystems/fr-payments-study.htm. Any
errors or omissions are the responsibility of the authors.
NOTES
1. For reports and data, visit the payment study website at www.federalreserve.gov/
paymentsystems/fr-payments-study.htm.
2. The 2013 survey contained roughly twice the number of conceptually different survey
items compared with the 2010 survey. The total number of payment volume items between
2010 and 2013 grew only slightly, however, as we reduced the number of distinct items
by half by changing the survey from covering 2 months (March and April) with separately
reported volumes from each month, to 1 month (March).
3. Sampled banks were asked to either retrieve annual 2015 data from their internal
records (preferred) or, in a limited number of cases, if unavailable, to estimate their annual
2015 volumes using data from a different reference periods.
REFERENCES
Cochran, William G. (1953). Sampling Techniques. 2nd ed. New York, NY: John Wiley & Sons.
Dalanius, Tore and Joseph L. Hodges Jr. (1959). “Minimum Variance Stratification”. In: American
Statistical Association Journal 54. 285, pp. 88–101.
Gerdes, Geoffrey R. and May X. Liu (2010). “Estimating Technology Adoption and Aggregate Volumes
from U.S. Payments Surveys in the Presence of Complex Item Nonresponse”. In Proceed-
ings of the Survey Research Methods Section. Joint Statistical Meetings. American Statistical
Association. URL: https://ww2.amstat.org/sections/srms/Proceedings/allyearsf.html.
Graham, John W., Scott M. Hofer, and David P. MacKinnon (1996). “Maximizing the Usefulness of
Data Obtained with Planned Missing Value Patterns: An Application of Maximum Likelihood
Procedures”. In: Multivariate Behavioral Research 31.2, pp. 197–218.
Hansen, Morris H., William N. Hurwitz, and William G. Madow (1953). Sample Survey Methods and
Theory. New York, NY: John Wiley & Sons.
Hidiroglou, Michele A. (1986). “The Construction of Self-Representing Stratum of Large Units in
Survey Design”. In: The American Statistician 40.1, pp. 27–31.
Lineback, Joanna Fane and Katherine J. Thompson (2010). “Conducting Nonresponse Bias Analysis
for Business Surveys”. In: Proceedings of the Government Statistics Section. Joint Statistical
Meetings. American Statistical Association.
Little, Roderick J. A. and Donald B. Rubin (2002). Statistical Analysis with Missing Data. 2nd ed.
New York, NY: John Wiley & Sons.
Liu, May X., Geoffrey R. Gerdes, and Darrel W. Parke (2009). “Sample Design and Estimation
of Volumes and Trends in the Use of Paper Checks and Electronic Payment Methods in
the United States”. In: Proceedings of the Survey Research Methods Section. Joint Statisti-
cal Meetings. American Statistical Association. URL: https://ww2.amstat.org/sections/srms/
Proceedings/allyearsf.html.
Lord, Frederick M. (1962). “Estimating Norms by Item-Sampling”. In: Educational and Psychological
Measurement 22.2, pp. 259–.
Neyman, Jerzy (1934). “On the Two Different Aspects of the Representative Method: The Method of
Stratifie Sampling and the Method of Purposeful Selection”. In: Journal of the Royal Statistical
Society 97.4, pp. 558–625.
Raghunathan, Trivellore E. and James E. Grizzle (1995). “A Split Questionnaire Survey Design”. In:
Journal of the American Statistical Association 90. 429, pp. 54–43.
Snijkers, Ger et al. (2013). Designing and Conducting Business Surveys. Hoboken, NJ: John
Wiley & Sons.
DOES SELECTIVE CRIME REPORTING
INFLUENCE OUR ABILITY TO
DETECT RACIAL DISCRIMINATION
IN THE NYPD’S STOP-AND-FRISK
PROGRAM?
Steven F. Lehrera and Louis-Pierre Lepageb

a
NYU-Shanghai, China, Queen’s University, Canada, NBER, USA
b
University of Michigan, USA
ABSTRACT
Prior analyses of racial bias in the New York City’s Stop-and-Frisk pro-
gram implicitly assumed that potential bias of police officers did not vary
by crime type and that their decision of which type of crime to report as
the basis for the stop did not exhibit any bias. In this paper, we first extend
the hit rates model to consider crime type heterogeneity in racial bias and
police officer decisions of reported crime type. Second, we reevaluate the
program while accounting for heterogeneity in bias along crime types and
for the sample selection which may arise from conditioning on crime type. We
present evidence that differences in biases across crime types are substantial
and specification tests support incorporating corrections for selective crime

ISSN: 0731-9053/doi:10.1108/S0731-905320190000039015
259
260 STEVEN F. LEHRER AND LOUIS-PIERRE LEPAGE
reporting. However, the main findings on racial bias do not differ sharply
once accounting for this choice-based selection.
Keywords: Misclassification; selective reporting; racial discrimination; hit
rates test; selection correction; criminal activity
1. INTRODUCTION
In many administrative and survey data sets, researchers must confront the chal-
lenge that non-sampling errors due to deliberate bias in providing a response may
distort the analyses. Much research has investigated this issue in survey research
(see Bound, Brown, & Mathiowetz, 2001 for a survey of the literature), and the
presence of measurement error has been shown to potentially cause biased and
inconsistent parameter estimates, thereby leading to erroneous conclusions to var-
ious degrees in statistical and economic analyses. Different methods are needed
to treat measurement error in survey data since errors can arise from different
sources. For example, they might arise from coding errors by surveyors, or survey
participants may choose to not provide truthful responses.
A specific form of measurement error arises with qualitative data resulting
in misclassification. Misclassification occurs when observations are placed erro-
neously in a different group or category. Within administrative data sources, this
erroneous information is provided not from survey responses, but rather in how the
records are generated and maintained. Just as certain sampling issues may influence
those being surveyed, it may also affect those who create and maintain administra-
tive records. For example, individuals preparing entries in administrative records
may rely on rules of thumb in a bid to minimize the burden of completing the
underlying forms accurately.1 These errors in classification not only affect sum-
mary statistics on sample proportions but may influence analyses that investigate
heterogenous behavioral relationships across these groups or categories.2 In many
settings, economic theory would suggest that we should expect heterogeneity
across these groups or categories3 which may be policy-relevant and could be
completely masked when investigating data on the full sample.
We illustrate the importance of considering the consequences of misclassifica-
tion that can arise from using rules of thumb to determine categories by reevaluating
if there is racial discrimination in New York City’s infamous Stop-and-Frisk pro-
gram. Under this program, officers can stop and frisk anyone they believe has
committed, is committing or might commit a crime. These policing practices often
disproportionately target minorities, generating significant controversy. Details on
each of over five million stops occurring in New York City between 2003 and 2014
have been collected and are frequently analyzed by researchers. However, these
Selective Crime Reporting 261
records may also feature misclassification error in the type of crime reported as the
basis for the stop, a source of bias that has been implicitly ignored prior analyses
of this large administrative data set.
Advocacy groups have long criticized the NYPD’s Stop-and-Frisk program4
and have even suggested that it has effectively turned some neighborhoods – usu-
ally poor and nonwhite ones – into occupied territories’ rife with unnecessary,
tense interactions between neighborhood residents and the police. More generally,
across all neighborhoods, surveys increasingly document that the American law
enforcement community is coming under increasing scrutiny and criticism and the
levels of trust in the police have plummeted.5 Proponents of the NYPD Stop-and-
Frisk program such as former NYPD commissioner Ray Kelly claim it has saved
over 7,000 lives and played a key role in the city’s decrease in crime over the past
years. Opponents of the program claim it constitutes a violation of freedom and
provides a means for officers to engage in racial profiling.
These claims are based in part on unconditional summary statistics such as the
fact that the overwhelming majority of those targeted by the program (consistently
around 85% of all stops in each year) are minorities. The use of unconditional
summary statistics to suggest that there is evidence of racial discrimination has
been made by advocacy groups at almost every stage of the criminal justice system.6
Testing for discrimination in the NYPD Stop-and-Frisk program can be chal-
lenging since an analysis of disparate impact alone does not constitute evidence
of discrimination. Knowles, Persico, and Todd (2001) (henceforth KPT) propose
a hit rates test that relies on the assumption that police officers try to maximize
successful searches. This test compares the productivity of stops across different
racial groups. Stops that are the result of discrimination alone and not the cause
of reasonable suspicion should be less likely to lead to arrests or summons, low-
ering the likelihood of those outcomes for that racial minority.7 This test has been
applied in prior research evaluating the NYPD Stop-and-Frisk program. Coviello
and Persico (2015) find no evidence of discrimination against African-Americans
in the aggregate sample of all recorded crime types in the whole city over 10 years
but, along with Goel, Rao, and Shroff (2016), do find evidence of discrimination
against African-Americans when restricting the sample to only stops relating to
the possession of a concealed weapon.8
We add to the existing evidence on the importance of accounting for potential
heterogeneity in bias across different types of crimes by showing its potential
importance in theory and confirming it empirically. African-Americans constitute
the overwhelming majority of suspects arrested for crimes related to drugs or
possession of a weapon, two types of crime related to the long-lasting nationwide
War on Drugs. Similarly to the NYPD’s Stop-and-Frisk program, the War on
Drugs has also been suggested to be discriminatory given disparate impacts for
African-Americans,9 motivating our particular focus on these crimes.
Prior research using Stop-and-Frisk data has also treated the reported crime
classifications as being exogenous. However, these classifications are selected by
individual police officers at the time when they complete the mandated forms
indicating why they stopped a given suspect. If individual police officers have per-
ceptions of a specific race being the perpetrators of certain types of crime,10 these
classifications might be subject to unconscious bias. In other words, conditioning
on those reported crime categories leads to endogenous stratification, which is well
known to lead to biased estimates.
Prior analyses of the Stop-and-Frisk program did not consider the decision we
focus on which is taken by the officer at the time of choosing whom to stop and must
also define the basis for the stop. While we illustrate the issue of analyzing effects
on subgroups that may be misclassified with Stop-and-Frisk data, we should stress
that these issues are becoming increasingly prevalent in survey data.11 Further,
misclassification has been shown to have consequences for econometric estimates
(Bollinger & David, 1997, 2001), sometimes changing the overall conclusions.
Generally, the consequence of misclassification depends on how it occurred. In
this setting, it is not simply a measurement error problem but reflects a choice-
based sample. If misclassification occurs to the same degree across racial groups
and policing outcomes, then it is random and no bias should arise. Conversely,
if misclassification of the subgroup varies differently between racial groups, then
there is nonrandom misclassification. Since much research analyzes impacts on
subgroups, we argue that there is likely going to be an increase in misclassification
of race and ethnicity variables as time progresses given changes in the mean-
ing of these variables to survey respondents. As such, the methods we describe
could be applied to various other contexts with both survey and administrative
data sets.
In this paper, we first extend the model underlying the hit rates test to account
for the potential that police officer bias depends on both the type of crime and
the race of the suspect, thereby influencing the type of crime reported as the basis
for the stop. This motivates investigating racial bias across different crime types
which can conciliate evidence in Coviello and Persico (2015) of no bias in the
full sample of stops. Further, it also motivates the need to correct the estimates
to account for potential bias in the reporting of crime categories. Second, we
reevaluate whether there is evidence of racial discrimination in New York City’s
Stop-and-Frisk program by modeling the selection process of crime categories as
a polychotomous choice. In effect, we implement a sample-selection correction to
explicitly account for the relative impact that being African-American may have
on the difference in the likelihood of being stopped for certain types of crime when
conducting the hit rates tests. To the best of our knowledge, this study presents the
first use of a polychotomous selection model to estimate whether there is evidence
of racial discrimination in the economics of crime literature.12
Our main empirical results indicate that with or without correcting for selective
crime categorization, there is robust and conclusive evidence of discrimina-
tion. After accounting for both crime-dependent bias and selective classification,
African-Americans on average are 2.874% less likely than whites to be arrested
when stopped for crimes classified under the US War on Drugs; a difference of
approximately 50%. Contrasting estimates from the hit rates test that both account
for and ignore selective reporting of crime categories provide evidence that the
correction can be important both economically and statistically. Hausman tests
provide further evidence that this correction is statistically important.
The remainder of the paper is organized as follows. In the next section,
we discuss how the Stop-and-Frisk program is implemented in New York and
briefly describe the data set. To motivate our empirical tests, Section 3 presents a
theoretical model that extends KPT to consider crime-dependent payoffs to stop-
ping suspects which can differ by racial group. Section 4 describes the two-step
empirical strategy and discusses identification of the selection correction terms.
The empirical results are presented and discussed in Section 5. A final section
summarizes the main findings and concludes.
2. STOP-AND-FRISK: DATA AND INSTITUTIONAL

KNOWLEDGE
For several decades, police officers throughout the United States have had the
authority to stop someone, ask questions and possibly frisk the suspect if they have
reasonable suspicion of a possible crime on the basis of the subject’s answers. There
has been longstanding backlash against this form of policing which culminated
with the case of Terry vs Ohio, 392 US in 1968 that was decided by the US Supreme
Court. The Supreme Court ruled that a police officer must have “a reasonable
suspicion” of some wrongdoing to conduct a stop.13 Most importantly, the Supreme
Court’s holding required the scope of any resulting police search to be narrowly
tailored to match the original reason for the stop. In response to this ruling, New
York’s Criminal Procedure Law was written which authorized police officers to
stop private citizens in a public place only if the officer reasonably suspected that
the citizen is committing, has committed or is about to commit either a felony or a
misdemeanor.14 Once such a stop has been made, New York law further authorizes
a frisk of the suspect only if the officer “reasonably suspects that they are in danger
of physical injury”.
While there is no legal requirement that NYPD officers record Stop-and-Frisk
encounters with private citizens, the completion of a UF-250 form is common
practice. Specifically, NYPD policy requires the completion of a UF-250 form
in the following four circumstances: (1) a person is stopped by use of force,
(2) a person stopped is frisked or frisked and searched, (3) a person is arrested
or (4) a person stopped refuses to identify themselves. This potential sampling
issue is noted in Coviello and Persico (2015) who discuss whether one should
restrict the analysis to only stops that legally must be recorded. They conclude
that imposing this restriction would require the implausible assumption that at the
time of choosing whom to stop, the officer could distinguish whether the stop will
develop into one that has to be recorded or not. The standard UF-250 form requires
officers to document, among other things, the time, date, place and precinct where
the stop occurred; the name, address, age, gender, race and physical description of
the person stopped; factors that caused the officer to reasonably suspect the person
stopped; the suspected crime that gave rise to the stop; the duration of the stop;
whether the person stopped was frisked, searched or arrested; and the name, shield
number and command of the officer who performed the stop. Each police officer
must submit all completed UF-250 forms to the desk officer in the precinct where
the stop occurred so the stated factual basis of the “stop” for legal sufficiency can
be reviewed. The data are then analyzed for quality assurance and a restricted
version that suppresses the shield number and other identifying information is
made publicly available.
The primary data used in this paper comprises all 5,028,789 recorded stops
in the Stop-and-Frisk program between 2003 and 2014. For each stop, we are
provided with all of the characteristics and outcomes listed on the UF-250 forms
that are permissible in the restricted data set. In total, over 40% of observed stops
in our sample did not have to be reported by law.15 We first restrict our sample
to stops involving only Caucasians or African-Americans and for which crime
categories are recorded on the UF-250 form, yielding 2,649,300 observations.
Crime classification was not reported for 2003 and inconsistently reported in both
2004 and 2005, reducing the number of observations for these earlier years.16
Summary statistics are presented in Table 1. Approximately 84% of this sample
is African-American and the vast majority of suspects are male. Among the 13
categories of crimes reported in Table 1, which account for over 95% of recorded
stops, possession of a weapon (27.67%), robbery (17.30%), trespassing (11.84%)
and drugs (11.11%) are the most commonly listed as the basis for a stop. As shown
in the table, we pool these crime categories into four main classifications.17 Drugs
and weapon crimes are pooled together since they represent felonies linked to
the US War on Drugs and account for nearly one-third of the stops. Similarly,
we pool other major crimes by whether they are either economic in motivation
(i.e., trespassing, burglary, grand larceny and grand larceny auto) or violent (i.e.,
assault, robbery, murder and rape). The final category consists of less severe
offenses which include petit larceny, graffiti and criminal misconduct. Crimes of
an economic nature account for over half of the stops, while there are fewer stops
associated with either violent or minor crimes.18
Table 1. Summary Statistics.

Mean Standard Deviation
Outcomes
Arrest rate×100 5.49 (22.78)
Summons×100 6.14 (24.01)
Demographics
Suspect is black 84.55 (36.15)
Suspect is male 92.84 (25.77)
Age of suspect 28.34 (12.39)
Suspect is youth 54.13 (49.83)
Suspect is tall 23.61 (42.47)
Suspect has heavy build 8.52 (27.91)
Stops
Number made at night 60.00 (48.99)
Mandated stops 58.52 (49.27)
Crimes
War on drugs 38.88 (48.75)
Drugs 11.21 (31.55)
Weapon 27.66 (44.73)
Other economic crimes 35.33 (47.80)
Trespassing 12.11 (32.63)
Burglary 9.25 (28.98)
Grand larceny 4.49 (20.70)
Grand larceny auto 9.48 (29.30)
Violent crimes 20.76 (40.57)
Assault 3.42 (18.17)
Robbery 17.21 (37.75)
Murder 0.05 (2.20)
Rape 0.10 (3.19)
Minor offenses 5.02 (21.83)
Petit larceny 2.62 (15.98)
Graffiti 1.16 (10.70)
Criminal misconduct 1.24 (11.06)
Observations 2,399,717
Standard deviations in parentheses. Youth is the fraction of suspects aged below 25. Tall is the fraction
of suspects 6 ft or taller. Heavy build is the fraction of suspects classified as heavy build by the NYPD.
Night to the fraction of stops performed between 7 p.m. and 6 a.m. Other economic crimes refers to
nonviolent crimes including trespassing, burglary, grand larceny and grand larceny auto. Violent refers
to violent crimes including rape, murder and assault. Minor refers to minor crimes and includes petit
larceny, graffiti and criminal misconduct.
The first column of Table 2 documents the substantial heterogeneity in the

average percentage of stops involving African-American suspects across crime
types, which ranges from 46.4% to 93.4%. African-Americans have lower stop
rates than whites for graffiti and are involved in less than 65% of stops for
Table 2. Race Differentials by Crime Classification.

% Black % Arrested Black % Arrested White Test of Equality
Overall 0.840 5.864 6.306 0.000

(0.367) (23.495) (24.307)
Drugs 0.839 10.484 11.927 0.000
(0.367) (30.634) (32.410)
Weapon 0.935 4.073 7.442 0.000
(0.247) (19.767) (26.246)
Trespassing 0.936 7.810 10.459 0.000
(0.245) (26.833) (30.604)
Burglary 0.630 3.794 2.701 0.000
(0.483) (19.105) (16.212)
Robbery 0.902 3.511 3.543 0.727
(0.298) (18.405) (18.488)
Assault 0.818 10.864 13.888 0.000
(0.386) (31.119) (34.583)
Murder 0.921 5.053 7.217 0.358
(0.270) (21.914) (26.011)
Rape 0.883 6.258 6.980 0.851
(0.322) (24.226) (23.751)
Grand larceny 0.800 5.359 4.738 0.000
(0.400) (22.520) (21.246)
Grand larceny auto 0.699 4.216 3.528 0.000
(0.459) (20.095) (18.450)
Petit larceny 0.802 8.815 9.431 0.026
(0.398) (28.352) (29.226)
Graffiti 0.464 4.077 5.445 0.000
(0.499) (19.777) (22.691)
Criminal misconduct 0.646 7.542 8.288 0.019
(0.478) (26.408) (27.571)
Standard deviations in parentheses. The last column presents the p-values from tests where the null
hypothesis is that of equality in the probability of arrest between African-Americans and whites.
burglary and criminal misconduct. In contrast, over 92% of all stops categorized as
weapons, trespassing or murder involve an African-American suspect. The remain-
ing columns provide summary information on the percentage of stops that resulted
in an arrest by race and a test of whether there is a significant difference in these
proportions between races. In nearly every crime category, African-Americans
have on average lower rates of arrest relative to white suspects. Results from tests
of equality in arrest rates between groups are presented in the last column and
indicate that these differences are statistically significant at the 5% level in every
category with the exception of stops for murder, rape and robbery. Most striking is
that the rate of arrest for weapon possession stops is 75% higher for white suspects
relative to African-Americans, despite the fact that almost 19 of every 20 stops for
this category involve a black suspect.
3. THEORY
We begin by extending the model that underlies the hit rates test first developed
in KPT to directly consider the Supreme Court’s holding in Terry vs Ohio that
the scope of any resulting police search has to be narrowly tailored to match the
original reason for the stop.19 Since the crime type must be reported and police
officers may associate certain racial groups with particular crime types, they may
be more likely to be biased for stops related to those classes of infractions.20 We
incorporate this feature within an economic model that describes a pedestrian’s
and a police officer’s behavior.
3.1. Pedestrians
We categorize pedestrians by their race r and a set of other costlessly observable

characteristics o. There are Qr,o pedestrians in each group (r, o). A pedestrian
chooses from Apedestrian = {a0 , a1 , . . . , an } where a0 is the outside option of not
committing a crime (value of 0) and ai with k = 0 represents committing a specific
crime of type k. Suppose that φk represents the payoff from crime k = 1, . . . , n
while ck represents the cost of being caught for crime k which is constant across
pedestrians. Also, let Fr,o (φ1 , . . . , φn ) denote the joint conditional distribution of
φ1 , . . . , φn given group (r, o) and ω denote the expected number of members of a
given group that are searched.
The expected payoff of a pedestrian with type (r, o) and (φ1 , . . . , φn , c1 , . . . , cn )
who chooses to commit a crime with the expectation that ω other pedestrians of
their type will be stopped is

ω
ur,o (φ1 , . . . , φn , c1, , . . . , cn , ω) = max φk − ck r,o
1≤k≤n Q
Let crime i be the maximizer of the previous expression. An agent commits a

crime if ur,o (φ1 , . . . , φn , c1, , . . . , cn , ω) ≥ 0:
ω
P (φi − ci ≥ 0) = Vir,o (φ1 , . . . , φn , c1, , . . . , cn , ω)
N r,o

Vir,o (ω) ≡ Vir,o (φ1 , .., φn , c1 , . . . , cn , ω)dFr,o φ1 , . . . , φn |φi = max φk
1≤k≤n
where Vir,o (ω) is the crime-dependent crime rate for group (r, o). This is the
objective probability that an individual of group (r, o) is guilty of crime type i.
3.2. Police Officers
Given a mass M of police officers who, after having exogenously been allocated
to a given precinct, receive a type p ∼ U [0, 1].21 Each officer type has a search
capacity Ep , receives payoff πpr,i from stopping a suspect of race r for crime1 i.
We denote by D(p, i) the additional benefit that a racially biased officer of type p
gains from stopping a suspect of race A for type of crime i.22
Assumption 1.
πpW ,i = πpW = 1 (normalized) and πpA,i = πpW + D(p, i) for i = 1, . . . , n.
Let Ep (r, o, i) be the number of stops for group (r, o) from an officer of type
1
p and E(r, o, i) = M 0 Ep (r, o, i)dp be the total number of stops for group (r, o)
and crime i.
Defining
W (p, r, o, i) = P (Guilty of crime i|r, o)
= Vir,o (E(r, o, i))
the expected payoff of an officer of type p is then given by

Ep (r, o, i) πpr,i W (p, r, o, i) − sp
r,o,i
where sp is the cost of performing a stop for the officer. An officer chooses to stop
an agent from group (r, o) if πpr,i W (p, r, o, i) − sp ≥ 0.
3.3. Existence of an Equilibrium
The existence of an equilibrium for such nonatomic games is established in Schmei-

dler (1973). For this theorem to hold in our setting, Vir,o ( · ) must be continuous
and the set of officer types that strictly prefers one action to the other must be mea-
surable for every action. The first condition holds if Fr,o (φ1 , . . . , φn ) has no atoms
for each (r, o). The second condition holds if, for all w, w , r and r , the set of p s

such that πpr,i w − sp > πpr w − sp and the set of p s such that πpr,i w − sp > 0 are
measurable.
3.4. Characterization of the Equilibrium
Theorem 1. Suppose that in equilibrium groups (r, o) and (r , o ) are stopped

for crime i ∈ {1, . . . n}. If police officers are unbiased, then Vir ,o (E(r , o )) =

Vir,o (E(r, o)). If police officers are biased against race r, then Vir ,o (E(r , o )) >
Vir,o (E(r, o)).
The proof for Theorem 1 is omitted as it is directly analogous to that from the
classical KPT framework but now simply defined separately for each crime type.
Intuitively, as in the original model, it implies that the crime rate for a specific type
of crime has to be equal across races if police officers are unbiased. In our case,
this implies that the likelihood of arrest conditional on being stopped should be
equal across races for a given type of crime. On the other hand, if police officers
are biased against race r, then the probability of being arrested when stopped for
a specific type of crime must be lower for race r. Critically, this does not rule
out that police officers be more or less biased against a specific racial group for
different types of crimes.
The model first highlights that, if police officers associate racial groups with
certain crime types and in turn derive additional satisfaction from successful stops
by racial group and type of crime, then racial bias may not be uncovered using
the traditional approach of pooling all crimes together. Rather, racial bias would
not only differ across race but also across crime type, requiring the hit rates test to
be performed conditioning on the crime type. It also highlights that, since police
officers’ benefits of successful stops vary across both race and crime classification,
the decision to stop a suspect for a specific type of crime may also depend on race
and should be accounted for in the hit rates analysis when conditioning on the
crime type.
4. Empirical Strategy
The traditional hit rates test examines whether there is a racial difference in the
percentage of stops that result in an arrest. When running regressions on subgroups
defined by type of crime classification, this requires that the subgroups are exoge-
nous as assumed in both Coviello and Persico (2015) and Goel et al. (2016). This
involves estimating an equation for whether a stop s involving a suspect stopped
for type of crime c at time t in precinct p, resulted in an arrest. Formally,
ARRESTsctp = β0 + β1 RACEs + Xtp + εsctp (1)
where RACEs is an indicator variable for whether the suspect is African-American,

Xtp is a set of year and precinct fixed effects and εsctp is an error term with zero
mean. If β1 is statistically significant, there is an evidence of differential suc-

cess rates of stops by race. Eq. (1) corresponds to our baseline specification to
investigate bias heterogeneity across types of crime.
In the theoretical model, payoff modifiers to police officers from stopping
African-American suspects can differ by crime type; hence, we have a choice-
based sample. The decision to classify a stop under a given crime category may
therefore also depend on race if police officers have strong associations between
race and potential crime types. Evidence of similar implicit bias has been shown to
hold for violent crime (Brigham, 1971; Devine & Elliot, 1995) and more recently
in a set of lab experiments. Correll, Park, Judd, and Wittenbrink (2002) show evi-
dence that undergraduate students playing a computer simulation were more likely
to misinterpret neutral objects (e.g., wallet, cell phone) as weapons and mistakenly
shoot when the suspect was a black person compared to a white person. Since bias
may enter in at this classification stage, it is important to account for it and go
beyond an analysis which conditions on crime type. An additional rationale for
this may be that police officers are “too ambitious” in stating crime types to appear
that they have a good reason for the stop. This is consistent with the relatively few
number of stops for minor crimes presented in Table 1 even though these crime
categories are likely to account for a large share of daily criminal activity.23
A two-step method allows us to address potential selection bias.24 In the first
step, an unordered multinomial logit model is used to explain the officer’s choice
of the criminal basis of the stop. Formally, for crime categories c = 1, . . . , C, we
assume logit errors and express the log-likelihood function as

S
C
exp(xs γc )
lnL(γ ) = 1{ys = c}ln C
(2)
s=0 c=0 k=0 exp(xs γk )
where ys is a categorical variable representing which type of crime presented in

Table 1 is the basis of the stop and covariates include the race of the suspect
along with year and precinct fixed effects with a vector Zctp used to identify the
selection correction term.25 The matrix Zctp contains measures of the fraction of
stops related to each of the four crime categories in a given precinct the day prior
to the current stop. We argue these variables constitute a valid exclusion restriction
given the likely state dependence in policing behavior at the precinct level. These
are likely to be correlated to the decision of which crime type to currently stop a
suspect for but should not be related to unobserved factors that lead to individual
arrests. The relevance aspect of the exclusion restriction is supported by Table 5 in
which the lagged variables are all highly statistically significant. We also conduct
the analysis using longer periods between the instruments and the stop date which
has limited impacts on the estimates. Specifically, we consider the fraction of
stops related to each of the four crime categories in a given precinct the week and
month prior to the current stop. With increased length, the exogeneity assumption
requires that, while the recent history of policing in a precinct informs the decision
to stop individuals for certain types of crime, it is not correlated to omitted factors
that determine arrests. This alternative assumption is presumably less restrictive
for longer lagged periods, but the assumption itself remains untestable.
Using estimates of Eq. (2), Bourguinon et al. (2007) provide formulas to con-
struct C − 1 selection correction terms that are captured in the vector λc ( ). Adding
this vector of selectivity correction terms to Eq. (3) generates our estimating
equation
ARRESTsctp = β0 + β1 RACEs + β2 Xtp + β3 λc ( ) + ηsctp . (3)
Using weighted least squares for each crime category allows us to obtain unbi-
ased and consistent estimation of the coefficients.26 A nice feature of this estimator
is that it has been shown to perform well in correcting selection bias even in set-
tings where the restrictive independence of irrelevant alternatives assumption of
the multinomial logit model is violated. Last, to conduct inference, we use boot-
strapped standard errors to explicitly account for the two-step estimation procedure.
We use the strategy proposed in Bourguinon et al. (2007) to estimate the two-step
model since it not only relaxes the restriction in Dubin and McFadden (1984) that
all correlation coefficients add up to zero, but they present Monte Carlo evidence
indicating superior performance of their estimator relative to Dubin and McFadden
(1984), Lee (1983) and Dahl (2002). For completeness, Bourguinon et al. (2007)
differ from Lee (1983) since the latter estimates a single selectivity effect for all
choices as opposed to estimating C − 1 selection terms for the C choices we con-
sider. This approach is less restrictive since Lee (1983) requires equal covariances
between the unobservables in the arrest rate equation and the unobservables which
determine the crime categories and comes with the computational costs of esti-
mating additional parameters. Dahl (2002) differs from Bourguinon et al. (2007)
by the functional form used to construct the selectivity correction terms.
Examining estimates of β3 from Eq. (3) also provides insight since a posi-
tive (negative) coefficient estimate indicates higher (lower) arrest rates for those
stopped in this classification relative to a randomly chosen suspect that was stopped.
If at least one of the C − 1 estimates of β3 enters in a statistically significant manner,
then there is suggestive evidence of selection. To formally examine if selectiv-
ity correction leads to statistically significantly different estimates, we conduct
Hausman tests.27
The consequence of misclassification for the analysis depends on how it
occurred. The direction of bias depends on the correlation between unobservables
in the outcome and selection equations. If police officers are more likely to
misclassify subjects that are at high perceived risk of having committed a crime, we
would expect that ignoring the selection correction would underestimate the effect
of racial differences. After all, when a decision to make a stop occurs rapidly,
police officers are more likely to use any implicit bias based on the suspect’s
characteristics when choosing crime classifications. In other words, where police
officers are less certain of the exact crime at the time they make the stop and they
are relying on criminal offender profiling to select the crime category, we would
underestimate the effect of racial discrimination.28
5. Results
We first present estimates of the hit rates test across different types of crime in
Table 3. The columns of the table differ based on the level of fixed effects that are
included. An important point from Coviello and Persico (2015) is that, as shown
in the first row of Table 3, once accounting for time and precinct fixed effects in
columns 6 and 7, there is no evidence of discrimination when pooling all crimes
together. Further, the inclusion of precinct fixed effects leads to an approximate
50% reduction in the magnitude of race on arrest rates. Rows 2–4 present results
by subgroups of different crime classifications as previously defined and shows
that the previous result conflates vastly different effects into one which creates
the false appearance of no arrest differential. The results indicate that African-
Americans are significantly less likely to be arrested when stopped for crimes
related to the War on Drugs but significantly more likely to be arrested when
stopped for other economic crimes with little estimated differential for violent or
minor crimes. This is consistent with the conjecture that potential police officer
bias differs by crime type which leads to inefficient policing. Further, we note that
adding extra pedestrian and stop characteristics to Eq. (1) as shown in column 7
has little incidence on the results.
Table A2 and Figure A1 in the appendix also show that the estimated arrest dif-
ferential for War on Drugs crimes is present in every borough in the city (though
there is important heterogeneity) and has increased consistently over the period
considered in our sample. Table A1 shows that, in the case of summons, the out-
come differentials by type of crime are more reflective of the aggregate regression
as it is estimated that African-Americans are less likely to be issued a summon
when stopped for any crime group.
Next, we investigate whether these results may be partly influenced by the
endogenous decision of police officers of which type of crime to report. To examine
if there is selective classification of crime type in the NYPD Stop-and-Frisk pro-
gram, Table 4 presents estimates of β1 from Eqs. (1) and (3) for each crime
category defined in Table 1. These specifications include additional pedestrian and
stop characteristics which may also define the selective classification of stops. As
Table 3. Estimates of the Hit Rates Test on Arrests, Overall and by Crime Type.
Model OLS OLS OLS FE FE FE FE
1 2 3 4 5 6 7
Black
All crimes −0.386∗∗∗ −0.386∗∗∗ −0.386 0.164∗∗∗ 0.142∗∗∗ 0.142 0.221
(0.0416) (0.0416) (0.485) (0.0540) (0.0540) (0.211) (0.204)
War on drugs −3.726∗∗∗ −3.713∗∗∗ −3.713∗∗∗ −2.584∗∗∗ −2.597∗∗∗ −2.597∗∗∗ −2.463∗∗∗
(0.102) (0.102) (0.635) (0.119) (0.119) (0.498) (0.497)
Other 1.696∗∗∗ 1.704∗∗∗ 1.704∗∗∗ 1.794∗∗∗ 1.771∗∗∗ 1.771∗∗ 1.821∗∗∗
economic (0.0521) (0.0522) (0.495) (0.0715) (0.0715) (0.219) (0.222)
Violent −3.726∗∗∗ −1.689∗∗∗ −1.689∗∗∗ −0.00628 −0.0483 −0.0483 −0.0267
(0.102) (0.106) (0.578) (0.130) (0.130) (0.273) (0.267)
Minor 0.265 0.264 0.264 0.342∗ 0.353∗ 0.353 0.0595
(0.163) (0.163) (0.966) (0.207) (0.207) (0.494) (0.511)
Clustered SE No No Yes No No Yes Yes
Time FE No Yes Yes No Yes Yes Yes
Precinct FE No No No Yes Yes Yes Yes
Extra controls No No No No No No Yes
Standard errors are presented in parentheses. ∗ 0.1, ∗∗ 0.05, ∗∗∗ 0.01. The dependent variable is the
probability of being arrested conditional on being stopped and is multiplied by 100. Extra controls
refer to the inclusion of indicators for gender, youth, suspect height and build as well as the time of day
as defined in Table 1. Note: OLS refers to ordinary least squares estimator and FE refers to a precinct
fixed effects estimator.
shown in Table A3, the number of stops differs across racial groups and other
pedestrian characteristics that are also likely correlated with race.
Estimates of the selection-correction model in the second column are notice-
ably different in economic significance from estimates using the standard hit rates
presented in column 1. While African-Americans are statistically less likely at the
1% level to be arrested when stopped for War on Drugs related crimes irrespective
of whether crime categories are exogenous or a behavioral choice, the estimated
coefficient is roughly 15% larger than that which ignores selectivity. For other
crime categories, the difference between the two methods is also important both
in magnitude and statistical significance. Making corrections for selective crime
classifications lead to a 20% reduction in the magnitude of the race coefficient for
economic crimes and large changes in magnitude for violent and minor crimes.
While our adjusted estimates do not alter the overall conclusion of racial dis-
crimination for War on Drugs crimes, the estimates obtained from the two-stage
procedure do suggest that there is non-negligible sample-selection. The last column
in Table 4 reports the p-values from Hausman specification tests of the equality
of the estimated coefficient on black between estimates of Eqs. (1) and (3). For
War on Drugs, we observe that the p-values from the Hausman tests are less than
Table 4. Estimates of Racial Discrimination for Arrests by Crime Category That

Both Ignore or Account for Selective Crime Categorization.
Hit Rates Test Hit Rates Test That Corrects
Exogenous Crime for Police officer Selection
Model Categories of Crime Categories Hausman
War on Drugs Crimes

Black −2.463∗∗∗ −2.874∗∗∗ 0.0032
(0.121) (0.184)
Major Economic Crimes
Black 1.821∗∗∗ 1.446∗∗∗ 0.0000
(0.072) (0.113)
Violent Crimes
Black −0.027 1.389∗∗∗ 0.0000
(0.132) (0.221)
Minor Crimes
Black 0.060 −0.632 0.0521
(0.211) (0.414)
The dependent variable is the probability of being arrested conditional on being stopped and is multiplied
by 100. ∗ 0.1, ∗∗ 0.05, ∗∗∗ 0.01. Robust standard errors in parentheses for column 1, bootstrapped (1000
repetitions) and reweighted standard errors in parentheses for column 2. Specifications additionally
include fixed effects for precincts and years as well as indicators for gender, youth, height, build and
time of day. Column 3 presents the p-values from Hausman specification tests where the null hypothesis
is that the estimated coefficient on black is the same across models from columns 1 and 2.
0.01, indicating that we can safely reject the assumption that crime categories are
exogenous. Similarly, we can clearly reject that crime categories do not reflect a
behavioral choice for major economic and violent crimes and for minor crimes
at the 6% level. The results provide evidence that the choice of crime that offi-
cers report as the basis for individual stops generates endogenous stratification.
Table A4 applies the two-stage correction in the case of summons and finds that we
can reject the hypothesis that sample selection is negligible for all crime categories.
Marginal effect estimates from the first-stage crime classification selection are
presented in TableA5. Each of the variables used to identify the selection correction
terms in Eq. (3) are individually and jointly statistically significant with a plausible
sign and magnitude. A somewhat striking finding is that African-Americans are
statistically significantly more likely to have their stop categorized as a War on
Drugs crime. Estimates of Eq. (2) also find that blacks are significantly less likely
to have their stop categorized as other crime types. Since the categories underlying
War on Drugs crimes can be viewed as representing police officer speculation that
a suspect is either hiding a weapon or drugs as opposed to having committed a
robbery or trespassing, it is likely that they are easier to use to justify a stop.
Thus, when an officer decides to instantly make a stop based on the suspect’s
Table 5. Marginal Effects Estimates From Multinomial logit Estimation of First

Stage Crime Classification Selection Equation.
War on Drugs Major Violent Minor
Black 0.028*** −0.105*** 0.097*** −0.020***

(0.001) (0.001) (0.001) (0.001)
Lag War on Drugs 0.139*** −0.093*** −0.038*** −0.008***
(0.003) (0.002) (0.002) (0.001)
Lag other economic −0.083*** 0.113*** −0.024*** −0.006***
(0.003) (0.002) (0.002) (0.001)
Lag violent −0.048*** −0.055*** 0.112*** −0.008***
(0.003) (0.003) (0.002) (0.001)
Lag minor −0.014*** −0.002*** −0.009*** 0.025***
(0.004) (0.004) (0.003) (0.001)
No. of observations 2,399,717
Pseudo R-squared 0.0706
p-Value H0: Exc. Restrictions = 0 0.0000
The dependent variable is an aggregate of crime types which takes the value 1 for crimes related to
the War on Drugs, 2 for major nonviolent crimes, 3 for violent crimes and 4 for minor crimes. *0.1,
**0.05, ***0.01. Standard errors in parentheses. Other economic refers to economic crimes including
trespassing, burglary, robbery, grand larceny and grand larceny auto. Violent refers to violent crimes
including rape, murder and assault. Minor refers to minor crimes and includes petit larceny, graffiti and
criminal misconduct. Lag crime variables are defined as the proportion of stops that involved crimes
of that type in the day before the stop in the same precinct. The exclusion restrictions p-value refers to
a joint test of significance for the four exclusion restrictions.
characteristics, the use of this crime classification may also partially reflect implicit
bias. Thus, it is not surprising that the estimated effect of racial discrimination on
arrest rates for War on Drug crimes increases once the selection correction is used.
Last, we conducted a series of robustness checks shown in Table A5 to investi-
gate how the results of the sample-correction procedure vary depending on various
assumptions. We find that using different lagged values of the share of stops related
to each crime category as the exclusion restriction, which improves the plausibility
of the exogeneity assumption, has little incidence on the conclusion. We also find
that excluding first stage fixed effects from the correction does alter the estimates
of the correction but does not change the conclusion that there is a large arrest
differential between racial groups for War on Drugs crimes.
6. CONCLUSION
The NYC Stop-and-Frisk program often plays a prominent role in debates sur-
rounding racial profiling. Analyses of this data which condition on reported crime
type are necessary to uncover heterogeneity in bias but may lead to biased estimates
due to endogenous stratification. This stratification may arise since individual

police officers could possess unconscious biases, raising concerns that race not only
influences whether a police officer decides to stop a pedestrian but also extends to
which crime is reported as the basis of the stop. In this paper, we extend the original
model underlying KPT to include police officer bias which can differ across types
of crime. We show that the traditional hit rates test can be modified to incorporate a
selection-correction term from a polychotomous choice model to account for this
consideration. Second, using data from the Stop-and-Frisk program, we show that
this correction is empirically important in numerous situations since it reduces the
estimated race differentials by 20%–35%.
However, even after applying the selection correction, we concur with prior
research that there is strong and robust evidence of discrimination against
African-Americans for weapon and drug-related crimes. Since numerous vari-
ables measured in both survey and administrative data sets may contain similar
non-sampling errors due to misclassification, the econometric methods utilized in
this paper could be used in other contexts to ensure that conclusions are not dis-
torted by this source of bias when conducting analyses on endogenous subgroups.
In conclusion, this paper demonstrates the importance of researchers accounting
for institutional features that guide a police officer’s behavior in the field by illus-
trating the challenges that arise when applying tests for racial bias across crime
types and locations.
More generally, trends related to social movements calling for fundamental
changes in the way the federal statistical system classifies people by race and
gender may increase the risk of misclassification in both survey and administra-
tive data sets.29 Recently, the US federal Office of Management and the Budget
passed legislation ensuring that the Census 2000 and subsequent federal statistical
documents must allow individuals to identify with as many of the major races as
they wish. The US Census Bureau introduced a new race category in the Cen-
sus and a Census Bureau interview–reinterview survey in 2000 found that those
who identify with multiple races are not consistent in their reporting of race. Fur-
ther, there was significant heterogeneity in reporting multiple races across both
geographic and demographic backgrounds. Thus, the tools we illustrate in our
application may be important for future research that considers generating policy-
relevant evidence from analyses carried out on subgroups defined by potentially
misclassified variables.
ACKNOWLEDGMENT
We are grateful to Decio Coviello, Maxwell Pak and Rosina Rodriguez Olivera
for their helpful comments and suggestions on this project. We also thank Victor
Aguiar and other participants at the Econometrics of Complex Survey Data: Theory
and Applications conference for additional comments. NYC Stop-and-Frisk data
is publicly available at https://nycopendata.socrata.com as well as at the ICPSR
website at the University of Michigan. Lehrer thanks SSHRC for research support.
We are responsible for all errors.
NOTES
1. A large literature has documented evidence that police officers engage in implicit
bias and employ rules of thumb (see Fridell & Lim, 2016 for a recent survey). In recent
work, James (2017) provides evidence that the association between African-Americans and
weapons is stronger when officers have less sleep. The New York City Police Department
(NYPD) is well aware of this literature. In December 2014, the NYPD announced that they
would retrain a significant portion of their police force regarding implicit bias. However, a
2017 Newsweek investigation found that no officer had received such training to that date.
2. Rauscher, Johnson, Cho, and Walk (2008) present a meta-analysis of studies that
measured the race specific validity of survey questions about self-reported mammography
use against documented sources, such as medical and billing records. They found that
the specificity of survey questions that measure mammography use is lower among black
women than white women.
3. In work assuming categories are measured accurately, Lehrer, Pohl, and Song (2016)
use a simple static labor supply model to motivate why treatment effects from a welfare
program that changes work incentives should vary across demographic groups and over the
earnings distribution to motivate their empirical tests.
4. This program has also been targeted by legislative action, including high profile cases
and class action lawsuits such as Floyd, et al. vs City of New York, et al.
5. This finding is not unique to the United States, see Bradford, Jackson, and Stanko
(2009) for evidence on trends in the United Kingdom.
6. Studies that explore racial discrimination in the economics literature at other stages
of the criminal justice system make clear that these may be nondiscriminatory and one
needs to account for racial differences in crime prevalence. Yet, even after taking this
feature into account, there is evidence of racial prejudice at other stages. See Rehavi and
Starr (2014); Abrams, Bertrand, and Mullainathan (2012); Anwar, Bayer, and Hjalmarsson
(2012); Bushway and Gelbach (2010); Alesina and La Ferrara (2014) and Anwar and
Fang (2015) for studies looking at prejudice in prosecution, bail-setting, sentencing, prison
releases as well as in judges and juries.
7. In other words, police officers who stop more members of a certain racial group would
not be racially biased if these stops are productive and lead to arrests or summons. In addi-
tion, this test can account for the empirical features related to the geographic concentration
of crime across neighborhoods.
8. For completeness, other research evaluating Stop-and-Frisk include Gelman, Fagan,
and Kiss (2007) and Ridgeway (2012) who each use a different subset of the data
employed in the Coviello and Persico (2015) study. Lehrer and Lepage (2018) also use the
Stop-and-Frisk data to test for discrimination against Arabs and find evidence consistent
with racial profiling in periods of high terrorism threat. Regarding the use of non-lethal
force, Fryer (2016) finds that blacks are over 50% more likely than whites to have force
used against them when stopped.
9. See Nunn (2002) for further details.
10. See, for example, Fridell (2008).
11. For example, research has found high rates of misclassification in categorical vari-
ables such as education (Black, Sanders, & Taylor, 2003), labor market status (Poterba &
Summers, 1995) and disability status (Benitez-Silva, Buchinsky, Man Chan, Cheidvasser, &
Rust, 2004; Kreider & Pepper, 2008), among other measures.
12. This is surprising since as Bushway, Johnson, and Slocum (2007) note, issues of
selection bias pervade criminological research but econometric corrections have often been
misapplied.
13. This decision creates a narrow exception to the Fourth Amendment’s probable cause
and warrant requirements, permitting a police officer to briefly stop a citizen, question them
and frisk them to ascertain whether they possess a weapon that could endanger the officer.
By reasonable suspicion, a police officer “must be able to point to specific and articulable
facts” and is not permitted to base their decision on “inchoate and unparticularized suspicion
or [a] ‘hunch’".
14. See the 2000 report by the US Commission on Civil Rights (2000) for detailed
information on policing in New York.
15. The fraction of mandated stops is higher for African-Americans (60% versus 47%)
which likely reflects higher rates of frisking, summons and force used. Our main results
are robust to restricting the analysis only to those stops which had to be legally reported.
Following Coviello and Persico (2015), we do not use this as a sampling restriction since
it would condition on ex post information. The external validity of the results rely on the
plausible (yet untestable) assumption that the sample is representative of all stops in the city.
If police officers underreport racially sensitive stops, this would underestimate the number
of unproductive stops and our results would constitute a lower bound. This is consistent with
the data since the estimated arrest differential is larger when restricting to this subsample.
16. We also exclude observations which were related to other crimes than those reported
in Table 1 (less than 5% of all crimes) since we group crimes within categories in our
analysis. Including these crimes as an additional classification did not change the main
results but led to sub substantial computational costs.
17. Our results do not depend on those categories; the estimates are quantitatively and
qualitatively similar whether we only group War on Drugs crimes and leave other crime
types either ungrouped or grouped as violent and minor crimes in two categories while
leaving others ungrouped. We selected these classifications ex ante and as such present
them as the main results.
18. The low rate of stops for minor crimes may reflect an effort to justify the stops by
being “too ambitious” in stating suspected crimes to signal a stronger rationale for the stop.
Alternatively, police officers are likely to simply put more weight on serious offenses.
19. The KPT test was adapted to the Stop-and-Frisk setting in Coviello and Persico
(2015), whose model we extend. Our extension is also inspired by Anwar and Fang (2006).
Note that the KPT test has faced criticism in Dharmapala and Ross (2003) and Gelman et al.
(2007), among others. These critiques focus on allowing police officers to consider varying
degrees of severity across types of crime, allowing for the fact that officers frequently do not
observe potential offenders or accounting for racial and neighborhood heterogeneity in the
probability of guilt. We pursue a similar line of inquiry in further investigating heterogeneity
along different types of crime, which prior work did not consider within the KPT framework.
20. While we do not make the distinction, racial bias is likely to be unconscious in
nature particularly when police officers have limited time to decide whether or not to stop
a pedestrian on the street. Smith, Makarios, and Alpert (2006) provide evidence that police
officers can develop unconscious biases along observable characteristics such as gender
which affect their propensity to be suspicious of a member of that group under different
circumstances.
21. The assignment of police officers across precincts is considered further in Coviello
and Persico (2015) and is beyond the scope of this analysis.
22. This is the same way through which racial bias enters in the original hit rates model
from KPT. Our model differs by allowing this additional benefit to vary by crime type and
therefore impact the officer’s decision to stop pedestrians differently across race and crime
dimensions.
23. There is also the possibility that police officers can lie about the type of crime for the
basis of the stop. This could go in both directions to show effort or to not appear prejudiced.
Ultimately, this may lead to bias even when adjusting for selection in crime classification
if officers consciously misreport the causes of their stops across different crime categories,
but the situation would be worse without the selection correction.
24. The intuition and mechanics behind the approach we use is proposed in Bourguignon,
Fournier, and Gurgand (2007) and parallel the seminal Heckman (1979) two-step estimator.
25. Note that the inclusion of these fixed effects leads to the well-known incidental
parameter problem, but since the ratio of the number of observations to number of parameters
is very high in our application, this suggests that issues of bias should be limited. On
the other hand, by ignoring the fixed effects, the interpretation of the coefficients of the
outcome equation would be unclear since a different set of coefficients enters the first
stage and outcome equation. Thus, our preferred estimates given the large sample size for
each precinct and year include fixed effects. For completeness, we present estimates using
both approaches in the results section. The use of a conditional fixed effects estimator is
computationally infeasible in our setting.
26. See Bourguignon et al. (2007) for further details on constructing the selectivity
correction terms and weights, which account for potential heteroskedasticity present in the
model due to selectivity.
27. Bootstrapped Hausman tests would be preferable since they relax the assumption
that ordinary least squares (OLS) be fully efficient under the null but are computationally
unfeasible in our setting.
28. Whenever making statistical corrections for selection bias or endogeneity, there is
always the risk that the cure may be worse than the disease (see Bound, Jaeger, & Baker,
1995). That said, the literature on policing (e.g., James, 2017 and the references within)
suggest that implicit bias may be higher for weapon crimes, which is accounted for by our
correction and consistent with our results. As such, it appears unlikely that these differences
in categories of crime would be solely due to chance and our selection correction appears
to be operating in the desired direction.
29. As Williams (2006) points out, some leading advocates of this change in the United
States were white women married to African-American men who found that their children
were almost always classified as black by those who collected statistical data or tabulated
persons by race.
REFERENCES
Abrams, D. S., Bertrand, M., & Mullainathan, S. (2012). Do judges vary in their treatment of race?.
Journal of Legal Studies, 41(2), 347–383.
Alesina, A., & La Ferrara, E. (2014). A test for racial bias in capital punishment. The American
Economic Review, 104(11), 3397–3433.
Anwar, S., Bayer, P., & Hjalmarsson, R. (2012). The impact of jury race in criminal trials. Quarterly
Journal of Economics, 127(2), 1017–1055
Anwar, S., & Fang, H. (2015). Testing for racial prejudice in the parole board release process: Theory
and evidence. Journal of Legal Studies, 44(1), 1–37.
Anwar, S., & Fang, H. (2006). An alternative test of racial prejudice in motor vehicle searches: Theory
and evidence. The American Economic Review, 96(1), 127–151.
Benitez-Silva, H., Buchinsky, M., Man Chan, H., Cheidvasser, S., & Rust, J. (2004). How large is the
bias in self-reported disability?. Journal of Applied Econometrics, 19(6), 649–670.
Black, D., Sanders, S., & Taylor, L. (2003). Measurement of higher education in the census and current
population survey. Journal of the American Statistical Association, 98(463), 545–554.
Bollinger, C. R., & David, M. H. (1997). Modeling discrete choice with response error: Food stamp
participation. Journal of the American Statistical Association, 92(439), 827–835.
Bollinger, C. R., & David, M. H. (2001). Estimation with response error and nonresponse: Food-stamp
participation in the SIPP. Journal of Business & Economic Statistics, 19(2), 129–141.
Bound, J., Brown, C., & Mathiowetz, N. (2001). Measurement error in survey data. In Handbook of
econometrics (Vol. 5, pp. 3705–3843). Amsterdam: Elsevier.
Bound, J., Jaeger, D. A., & Baker, R. M. (1995). Problems with instrumental variables estimation
when the correlation between the instruments and the endogenous explanatory variable is weak.
Journal of the American Statistical Association, 90(430), 443–450.
Bourguignon, F., Fournier, M., & Gurgand, M. (2007). Selection bias corrections based on the
multinomial logit model: Monte Carlo comparisons. Journal of Economic Surveys, 21(1),
174–205.
Bradford, B., Jackson, J., & Stanko, E. (2009). Contact and confidence: Revisiting the impact of public
encounters with the police. Policing and Society, 19(1), 20–46.
Brigham, J. C. (1971). Racial stereotypes, attitudes, and evaluations of and behavioral intentions toward
Negroes and whites. Sociometry, 34(3), 360–380.
Bushway, S. D., & Gelbach, J. B. (2010). Testing for racial discrimination in bail setting using
nonparametric estimation of a parametric model. New Haven, CT: Mimeo; Yale Law School.
Bushway, S. D., Johnson, B. D., & Slocum, L. A. (2007). Is the magic still there? The relevance of
the Heckman two-step correction for selection bias in criminology. Journal of Quantitative
Criminology, 23(2), 151–178.
Coviello, D., & Persico, N. (2015). An economic analysis of black-white disparities in NYPD’s
Stop-and-Frisk program. Journal of Legal Studies, 44(2), 315–360.
Correll, J., Park, B., Judd, C. M., and Wittenbrink, B. (2002). The police officer’s dilemma: Using
ethnicity to disambiguate potentially threatening individuals. Journal of Personality and Social
Psychology, 83(6), 1314.
Dahl, G. B. (2002). Mobility and the return to education: Testing a Roy model with multiple markets.
Econometrica, 70(6), 2367–2420.
Devine, P. G., & Elliot, A.J. (1995). Are racial stereotypes really fading? The princeton trilogy revisited.
Personality and Social Psychology Bulletin, 21(11), 1139–1150.
Dharmapala, D., & Ross, S. L. (2004). Racial bias in motor vehicle searches: Additional theory and
evidence. Contributions to Economic Analysis & Policy, 3(1). Article 12, 1–21.
Dubin, J. A., & McFadden, D. L. (1984). An econometric analysis of residential electric appliance
holdings and consumption. Econometrica, 52(2), 345–362.
Fridell, L. A. (2008). Racially biased policing: The law enforcement response to the implicit Black-
Crime Association. In M. J. Lynch, E. B. Patterson & K. K. Childs (Eds.), Racial divide: Racial
and ethnic bias in the criminal justice system (pp. 39–59). Monsey, NY: Criminal Justice
Press.
Fridell, L., & Lim, H. (2016). Assessing the racial aspects of police force using the implicit-and
counter-bias perspectives. Journal of Criminal Justice, 44, 36–48.
Fryer Jr, R. G. (2016). An empirical analysis of racial differences in police use of force. NBER Working
Paper No. 22399.
Gelman, A., Fagan, J., & Kiss, A. (2007). An analysis of the New York City police department’s
Stop-and-Frisk policy in the context of claims of racial bias. Journal of the American Statistical
Association, 102(479), 813–823.
Goel, S., Rao, J. M., & Shroff, R. (2016). Precinct or prejudice? Understanding racial disparities in
New York City’s Stop-and-Frisk policy. The Annals of Applied Statistics, 10(1), 365–394.
Hausman, J. A. (1978). Specification tests in econometrics. Econometrica, 46(6), 1251–1271.
Heckman, J. J. (1979). Sample selection bias as a specification error. Econometrica, 47(1), 153–161.
James, L. (2017). The stability of implicit racial bias in police officers. Police Quarterly, 21(1), 30–52.
Knowles, J., Persico, N., & Todd, P. (2001). Racial bias in motor vehicle searches: Theory and evidence.
Journal of Political Economy, 109(1), 203–229.
Kreider, B., & Pepper, J. V. (2011). Identification of expected outcomes in a data error mixing model
with multiplicative mean independence. Journal of Business & Economic Statistics, 29(1),
49–60.
Lee, L. F. (1983). Generalized econometric models with selectivity. Econometrica, 51(2), 507–512.
Lehrer, S. F., & Lepage, L. (2018). How do NYPD officers respond to general and specific terror
threats? Ann Arbor, MI: Mimeo; University of Michigan.
Lehrer, S. F., Pohl, R. V., & Song, K. (2016). Targeting policies: Multiple testing and distributional
treatment effects. NBER Working Paper No. 22950.
Nunn, K. B. (2002). Race, crime and the pool of surplus criminality: Or why the war on drugs was a
war on blacks. Journal of Gender, Race & Justice, 6, 381–445.
Poterba, J. M., & Summers, L. H. (1995). Unemployment benefits and labor market transitions: A
multinomial logit model with errors in classification. The Review of Economics and Statistics,
77(2), 207–216.
Rauscher, G. H., Johnson, T. P., Cho, Y. I., & Walk, J. A. (2008). Accuracy of self-reported cancer-
screening histories: A meta-analysis. Cancer Epidemiology and Prevention Biomarkers, 17(4),
748–757.
Rehavi, M. M., & Starr, S. (2014). Racial disparities in federal criminal sentences. Journal of Political
Economy, 122(6), 1320–1354.
Ridgeway, G. (2007). Analysis of racial disparities in the New York police department’s stop, question,
and frisk practices. RAND Technical Report #534.
Schmeidler, D. (1973). Equilibrium points of nonatomic games. Journal of Statistical Physics, 7(4),
295–300.
Sit, R. (2017). Since Eric Garner’s death, not one NYPD officer has received implicit bias training,
despite what the mayor says. Newsweek. Available at http://www.newsweek.com/eric-garner-
erica-nypd-implicit-bias-bill-de-blasio765165. [Date Published: 12/29/2017] [Date Accessed:
02/21/2018].
Smith, M. R., Makarios, M., & Alpert, G. P. (2006). Differential suspicion: Theory specification and
gender effects in the traffic stop context. Justice Quarterly, 23(2), 271–295.
U.S. Commission on Civil Rights (2000). Chapter 5: Stop, question, and frisk. Police practices and
civil rights in NewYork City. Available online at http://www.usccr.gov/pubs/nypolice/main.htm.
Williams, K. M. (2008). Mark one or more: Civil rights in multiracial America. Ann Arbor, MI:
University of Michigan Press.
APPENDIX
The estimated coefficients correspond to the coefficient on black from a set of
regressions of the probability of being arrested (multiplied by 100) conditional
on being stopped on a dummy for black as well as precinct indicators estimated
separately for each year. The standard errors are clustered at the precinct level.
The outcome is multiplied by 100. The dashed lines represent the pointwise 95%
confidence interval.
Fig. A1. Estimated Arrest Differential by Year, War on Drugs Crimes.

284
Table A1 Estimates of the Hit Rates Test on Summons, Overall and by Crime Type.

1 2 3 4 5 6 7
Black
All crimes 0.056 0.0794∗ 0.0794 −1.760∗∗∗ −1.746∗∗∗ −1.746∗∗∗ −1.545∗∗∗
(0.0427) (0.0428) (0.379) (0.0527) (0.0527) (0.330) (0.322)
War on Drugs −0.922∗∗∗ −0.886∗∗∗ −0.886 −2.468∗∗∗ −2.449∗∗∗ −2.449∗∗∗ −2.104∗∗∗
STEVEN F. LEHRER AND LOUIS-PIERRE LEPAGE

(0.103) (0.103) (0.619) (0.122) (0.122) (0.498) (0.499)
Other economic −0.922∗∗∗ 0.163∗∗∗ 0.163 −1.179∗∗∗ −1.176∗∗∗ −1.176∗∗∗ −1.090∗∗∗
(0.103) (0.0578) (0.336) (0.0744) (0.0744) (0.239) (0.235)
Violent −0.922∗∗∗ −0.932∗∗∗ −0.932∗∗ −1.953∗∗∗ −1.962∗∗∗ −1.962∗∗∗ −1.845∗∗∗
(0.103) (0.0946) (0.386) (0.109) (0.109) (0.287) (0.262)
Minor −0.922∗∗∗ −2.639∗∗∗ −2.639∗∗∗ −2.281∗∗∗ −2.265∗∗∗ −2.265∗∗∗ −2.047∗∗∗
(0.103) (0.132) (0.900) (0.148) (0.148) (0.423) (0.371)
Standard errors are presented in parentheses. ∗ 0.1, ∗∗ 0.05, ∗∗∗ 0.01. The dependent variable is the probability of being issues a summons conditional on
being stopped and is multiplied by 100. Extra controls refer to the inclusion of indicators for gender, youth, suspect height and build as well as time of
day as defined in Table 1. OLS refers to ordinary least squares estimator and FE refers to a precinct fixed effects estimator.
Table A2. Estimates of the Hit Rates Test on Arrests, by Crime Type and
Borough.
1 2 3 4 5 6 7
Black
Manhattan −4.559∗∗∗ −4.521∗∗∗ −4.521∗∗ −3.544∗∗∗ −3.471∗∗∗ −3.471∗∗ −3.299∗
(0.233) (0.233) (2.088) (0.251) (0.251) (1.589) (1.604)
Bronx −1.985∗∗∗ −2.039∗∗∗ −2.039∗∗ −1.905∗∗∗ −1.944∗∗∗ −1.944∗∗ −1.574∗∗
(0.278) (0.278) (0.782) (0.295) (0.295) (0.744) (0.634)
Brooklyn −3.109∗∗∗ −3.115∗∗∗ −3.115∗∗∗ −1.582∗∗∗ −1.631∗∗∗ −1.631∗∗∗ −1.493∗∗∗
(0.168) (0.168) (0.578) (0.187) (0.187) (0.494) (0.489)
Queens −3.456∗∗∗ −3.549∗∗∗ −3.549∗∗∗ −3.313∗∗∗ −3.338∗∗∗ −3.338∗∗∗ −3.250∗∗∗
(0.254) (0.255) (0.997) (0.307) (0.308) (0.956) (0.934)
Staten Island −1.204∗∗∗ −0.918∗∗∗ −0.918 −2.721∗∗∗ −2.536∗∗∗ −2.536∗∗ −2.645∗∗
(0.283) (0.288) (1.154) (0.373) (0.373) (0.583) (0.383)
Standard errors are presented in parentheses. ∗ 0.1, ∗∗ 0.05, ∗∗∗ 0.01. The dependent variable is the
probability of being arrested conditional on being stopped and is multiplied by 100. Extra controls
refer to the inclusion of indicators for gender, youth, suspect height and build as well as the time of
day as defined in Table 1. OLS refers to ordinary least squares estimator and FE refers to a precinct
fixed effects estimator.
Table A3. Reason for Stop and Pedestrian Characteristics.

Overall War on Other Economic Violent Minor Test of
Drugs Crimes Crimes Offenses Equality
% Black 0.845 0.908 0.775 0.887 0.688 0.0000

(0.361) (0.289) (0.418) (0.316) (0.463)
% Male 0.929 0.945 0.905 0.945 0.896 0.0000
(0.258) (0.228) (0.293) (0.228) (0.305)
% Youth 0.541 0.575 0.455 0.634 0.503 0.0000
(0.498) (0.494) (0.498) (0.482) (0.500)
% Night 0.600 0.658 0.573 0.569 0.468 0.0000
(0.490) (0.474) (0.495) (0.495) (0.499)
% Tall 0.236 0.245 0.231 0.235 0.216 0.0000
(0.425) (0.430) (0.421) (0.424) (0.412)
% Heavy 0.085 0.086 0.089 0.078 0.084 0.0000
(0.279) (0.281) (0.284) (0.267) (0.277)
Standard deviations in parentheses. The last column presents the p-values from tests where the null
hypothesis is that of equality in the probability of stop for the four crime categories.
Table A4. Estimates of Racial Discrimination for Summons by Crime Category

That Both Ignore or Account for Selective Crime Categorization.
Model Hit Rates Test Hit Rates Test That Corrects for Hausman
Exogenous Crime Police Officer Selection of
Categories Crime Categories
War on Drugs Crimes

Black −2.104∗∗∗ −2.278∗∗∗ 0.0197
(0.124) (0.171)
Black −1.090∗∗∗ −1.660∗∗∗ 0.0000
(0.076) (0.369)
Violent Crimes
Black −1.845∗∗∗ −2.109∗∗∗ 0.0000
(0.111) (0.439)
Minor Crimes
Black −2.047∗∗∗ −2.252∗∗∗ 0.0053
(0.150) (0.728)
The dependent variable is the probability of being issued a summon conditional on being stopped and
is multiplied by 100. ∗ 0.1, ∗∗ 0.05, ∗∗∗ 0.01. Robust standard errors in parentheses for column 1, boot-
strapped (1,000 repetitions) and reweighted standard errors in parentheses for column 2. Specifications
additionally include fixed effects for precincts and years as well as indicators for gender, youth, height,
build and time of day. Column 3 presents the p-values from Hausman specification tests where the null
hypothesis is that the estimated coefficient on black is the same across models from columns 1 and 2.
Table A5. Robustness Checks for Sample Correction Estimates.

Model Exclusion Exclusion Correction Correction First Stage:
Restriction: Restriction: Method: Method: No Fixed
1 week 1 month Lee (1983) Dahl (2002) Effects
War on Drugs Crimes

Black −3.011 −3.388 −2.293 −2.201 −2.305
(0.169) (0.226) (0.104) (0.188) (0.138)
Black 0.975 0.208 1.459 0.748 1.385
(0.134) (0.190) (0.088) (0.150) (0.105)
Violent Crimes
Black 1.319 0.691 1.266 1.529 0.930
(0.198) (0.323) (0.154) (0.204) (0.155)
Minor Crimes
Black −0.331 0.183 −0.529 −1.531 −0.458
(0.475) (0.772) (0.321) (0.426) (0.292)
The dependent variable is the probability of being arrested conditional on being stopped and is multiplied
by 100. Robust standard errors in parentheses. Specifications additionally include fixed effects for
precincts and years as well as indicators for gender, youth, height, build and time of day.
SURVEY EVIDENCE ON BLACK
MARKET LIQUOR IN COLOMBIA
Gustavo J. Canavire-Bacarreza,a
Alexander L. Lundbergb , and
Alejandra Montoya-Agudeloc
a
Inter-American Development Bank, USA
b
Department of Economics, West Virginia University, USA
c
School of Economics and Finance, Universidad EAFIT, Colombia
ABSTRACT
In 2014, the Colombian Government commissioned a unique national survey

on illegal liquor. Interviewers purchased bottles of liquor from interviewees
and tested them for authenticity in a laboratory. Two factors predict whether
liquor is contraband (smuggled): (1) the absence of a receipt and (2) the
presence of a discount offered by the seller. Neither factor predicts whether
a bottle is adulterated. The results back a story in which sellers are com-
plicit with a contraband economy, but whether buyers are complicit remains
unclear. However, buyers are more likely to receive adulterated liquor when
specifically asking for a discount.
Keywords: Black market; Colombia; counterfeit goods; alcohol;
contraband
JEL classifications: C81, C83, L66, K00, K40.

ISSN: 0731-9053/doi:10.1108/S0731-905320190000039016
287
288 GUSTAVO J. CANAVIRE-BACARREZA ET AL.
1. INTRODUCTION
Illegal goods are a pervasive and understudied feature of modern economies.
Many goods are smuggled across borders to avoid taxation. Others are coun-
terfeit, designed to mimic an existing brand or product. Accurate estimates of
gross illegal economic activity remain elusive since market participants have a
clear incentive to conceal their transactions from authorities. To provide context,
however, the Organisation for Economic Co-operation and Development estimates
fake goods account for almost 3% of global imports, or 500 billion US dollars per
year (OECD, 2016). That figure does not include smuggled goods, nor does it
include purely domestic markets.
Although countries often coordinate trade policy through international agree-
ments, law enforcement is largely a domestic policy. Colombia, like many Latin
American countries, has grappled with the influx of illegal goods ever since
colonization. In 2014, the Colombian government adopted a novel approach to
enforcement when it commissioned a unique national survey to gather informa-
tion on black market liquor. Interviewers offered citizens money in exchange for
their most recently purchased bottle of liquor. Samples were sent to a laboratory for
testing, which confirmed whether a bottle was authentic, adulterated or contraband
(smuggled).1
The results of the survey confirm the importance of the black market for alcohol
in Colombia. Over 20% of the observations in the sample were confirmed to
be either contraband or adulterated. Figure 1 displays the percentage of illegal
purchases broken down by Colombian department.2
Different illegal goods carry different welfare implications. In two seminal arti-
cles, Grossman and Shapiro (1988a,b) classify fake goods according to whether
they are deceptive. If a good is deceptive, consumers believe it to be authentic
but incur a disutility from its inferior quality. If a good is not deceptive, con-
sumers can benefit from the option to buy a low quality imitation at lesser cost
(cf. Higgins & Rubin, 1986). Although either type of counterfeiting has the poten-
tial to raise or lower social welfare, deceptive counterfeiting is typically more
worrisome. For example, an adulterated bottle of alcohol may contain danger-
ous chemicals. Adulterated liquor is likely deceptive because for most purchases,
consumers have no obvious reason to want a fake product (e.g., a “snob effect”).
Contraband goods have a more direct link to welfare. They benefit consumers
who pay a lower price but hurt governments who lose tax revenue. One of the most
interesting features of the Colombian survey is the joint examination of adulter-
ated and contraband liquor. The two types of illegal liquor, though not mutually
exclusive, appear to enter the market through different channels. According to
regression analysis, two factors predict whether a bottle is contraband. One is the
Survey Evidence on Black Market Liquor in Colombia 289
Fig. 1. Percentage of Illegal Purchases.
absence of a receipt. The other is the presence of a discount specifically offered by

the seller. The results are consistent with seller knowledge of contraband goods.
Neither a discount offered by the seller nor the absence of a receipt significantly
predicts whether a bottle is adulterated. If consumers knew they are getting an
inferior product, they would presumably demand some kind of price concession.
Interestingly, the data appear to reveal such a tradeoff – buyers are more likely to
receive adulterated liquor when asking for a discount.
The two types of illegal alcohol display different geographic distributions as
well. Contraband liquor is concentrated most heavily along traditional shipping
and transportation routes in the north of Colombia. Adulterated liquor is relatively

dispersed throughout the country.
By examining the market for illegal liquor in Colombia, this article contributes
to a recent, broader literature on markets for illegal goods. Empirical studies are
rare because data on the underground economy are difficult to obtain. Two studies
on illegal cigarettes are among the few exceptions. Using Canadian time-series
data, Galbraith and Kaiserman (1997) argue conventional estimates of cigarette
price elasticity combine relatively elastic legal cigarettes with inelastic contraband
ones. Thursby and Thursby (2000) estimate that roughly 4% of all cigarettes sold in
the US during the 1970s were commercially smuggled. More recently, Qian (2008)
examines shoe companies in China under shifting government protection. Brands
resort to product differentiation and self-enforcement, among other strategies,
when government enforcement is lax. Quercioli and Smith (2015) develop a model
of strategic money counterfeiting, testing its implications with data from the US
Secret Service.
The following section describes the survey design together with the data. Section
III presents the primary regression estimates. Because nonresponse is an issue
for several variables in the survey, Section IV presents estimates derived from a
multiple imputation procedure via chained equations. Section V provides a brief
discussion, and Section VI concludes.
2. DATA
Accounting for survey design is necessary for proper inference when data are not
taken from a simple random sample (see, e.g., Jain, 2006; Lumley, 2011). Esti-
mates and standard errors must be weighted to account for differing probabilities
of sample selection among members of the population.
The Colombian data come from a complex survey with multiple stages of sam-
pling. In the first stage, Colombia is divided into six different strata called regions.
Each region h contains Nh municipalities, and each municipality i has an adult pop-
ulation Mh,i . In the next stage, nh municipalities are randomly selected within each
region. Lastly, mh,i individuals are sampled randomly from within each selected
municipality i. Therefore, municipalities are the primary sampling units (PSUs)
and individuals are secondary sampling units (SSUs). The final sample includes
986 observations.
Table 1 displays the sample information in comparison to population fig-
ures. The population data come from the Colombian National Administrative
Department of Statistics (DANE), which offers population projections based on
Table 1. Sample Composition.

Region Number of Municipalities Number of Individuals
Population Sample Population Sample

Nh nh
(Nh ) (nh ) i=1 Mh,i i=1 mh,i
Centro Oriente 367 22 10,523,971 336

Región Caribe 197 21 6,552,778 194
Eje Cafetero 178 25 6,230,820 234
Pacífico 178 18 5,535,856 171
Centro Sur 124 6 2,196,516 34
Llano 78 2 1,128,829 17
Total 1,122 94 32,168,770 986
census data. The DANE data can be segregated by age and municipality to cal-
culate the total adult population by region and municipality. Individuals younger
than age 18 are excluded from the calculation since the law does not allow minors
to purchase alcoholic beverages (the age of majority is 18 in Colombia).
Sampling weights, or probability weights, and finite population correction terms
are created according to PSUs and SSUs (Lumley, 2011). The weights are cal-
culated taking the inverse probability of being sampled in each stage. Defining
fh,1 ≡ nh /Nh as the probability for each municipality to be selected within a region,
and fh,i,2 ≡ mh,i /Mh,i the probability for each individual to be sampled inside a
municipality i within region h, the sampling weights are given by
1 1 Nh Mh,i
wh,i = × = × , (1)
fh,1 fh,i,2 nh mh,i
which results in 94 different values for the sample. One way to interpret the weights
is that each observation represents wh,i individuals in the population.
Finite population correction (FPC) terms are useful when PSUs and SSUs are
sampled without replacement. This correction accounts for the reduction in uncer-
tainty when the sample includes a large fraction of the population (Lumley, 2011).
Ignoring FPCs will inflate standard errors unless the sample is sufficiently small
compare to the population. The FPC term for municipalities is calculated as
FPCh,1 = 1 − fh,1 .
Similarly, the FPC term for individuals is defined as
FPCh,i,2 = 1 − fh,i,2 .
Lastly, given the differences in the relative sample size per region, we also
conduct a post-stratification adjustment to control for under- and over-represented
regions. We use the total adult population per region to conduct the following
adjustment (Levy and Lemeshow, 2013):
∗ Mh
wh,i = wh,i , (2)
i∈h wh,i mh,i
Nh
where Mh = i=1 Mhi a which is precisely the fourth column in Table 1.
3. RESULTS
Table 2 presents the summary statistics for the sample of 986 total observations.
The n column reports the number of complete observations for each variable.
Contraband and Adulterated are the two dependent variables. Each is an indicator
set to one if the purchase was confirmed to be contraband or adulterated, respec-
tively. Self-reported Transaction Price (STP) is the purchase price of the bottle
in Colombian Pesos reported by the buyer, and Size is the size of the bottle in
milliliters. Receipt is a dummy indicating whether the buyer obtained a receipt for
the purchase.
Discounts Offered and Asked are indicators set to one if the buyer was offered
a discount or if he or she asked for one, respectively. The following 10 indicator
variables denote the type of alcohol. Estrato is a set of six dummies capturing the
country’s unique socioeconomic classification system. Colombia assigns numbers
(or zones) 1–6 to housing buildings of roughly ascending wealth status. The classi-
fications determine tax rates and public utility pricing. The upper zones pay higher
rates and effectively subsidize the lower zones. The classification is decided only
by the physical features of housing, but since wealthier persons can afford to live in
the higher strata, Estrato is a proxy for socioeconomic status. Lastly, weekly usage
is recorded as an ordinal variable, with categories of less than half bottle, one half
bottle (375 mL), one bottle (750 mL), more than one bottle or didn’t know/didn’t
respond.
Ideally, the purchase of illegal liquor would be couched in a random utility
model of discrete choice. However, buyers and sellers may not know they are
buying contraband or illegal liquor, and even if they do, assumptions on the distri-
bution of random utility error terms are difficult to justify. Consider the framework
adapted from Train (2009). For simplicity, take the type of liquor as given and
Table 2. Summary Statistics.

n Mean Weighted Std. Dev. Min Max
Contraband 986 .160 0.181 .366 0 1

Adulterated 986 .114 0.125 .318 0 1
STP 985 43,093.199 43,630.61 34,741.730 2×103 28×104
Size 982 766.400 777.479 334.183 250 2×103
Salary 551 923,969.310 973,017 710,520.940 1×105 7×106
Receipt 886 .187 0.252 .408 0 1
Discount offered 886 .034 0.040 .200 0 1
Discount asked 761 .056 0.085 .259 0 1
Agua 986 .547 0.519 .499 0 1
Brandy 986 .004 0.004 .064 0 1
Champ 986 .001 0.001 .032 0 1
Cream 986 .018 0.020 .134 0 1
Gin 986 .004 0.005 .064 0 1
Rum 986 .188 0.183 .392 0 1
T equila 986 .019 0.023 .138 0 1
W ine 986 .026 0.030 .157 0 1
V odka 986 .015 0.018 .122 0 1
W hisky 986 .176 0.197 .381 0 1
Estrato :
1 986 .212 0.185 .409 0 1
2 986 .419 0.494 .368 0 1
3 986 .259 0.418 .293 0 1
4 986 .061 0.240 .084 0 1
5 986 .028 0.164 .037 0 1
6 986 .022 0.148 .034 0 1
W eekly :
1 867 .434 0.440 .496 0 1
2 867 .196 0.186 .397 0 1
3 867 .171 0.184 .377 0 1
4 867 .084 0.084 .278 0 1
5 867 .114 0.107 .318 0 1
Note: This table presents summary statistics for the complete sample.
assume all parties know whether a bottle is legal or not. Buyers, sellers (retailers),
and importers (or manufacturers) all choose whether to trade a legal or an illegal
bottle. Respectively, they derive utility
Bij = Vbij + εbij ,

Sij = Vsij + εsij , and
Mij = Vmij + εmij ,
where the i index refers to the individual, and j is an indicator set to one if the bottle
is illegal and zero otherwise. The Vxij terms capture observable elements of utility,
including price for all parties (and perhaps size of the bottle), and cost for buyers
and importers. The εxij terms capture the remaining, unobserved component of
utility.
Ultimately, an illegal sale occurs when each party prefers an illegal bottle to a
legal one. The probability of the sale is then

P (illegalsale|Vbij , Vsij , Vmij ) = P ( Vzi1 + εzi1 > Vzi0 + εzi0 ). (3)
z∈{b,s,m}
Because the unobserved components include idiosyncratic tastes, as well as per-

ceived likelihoods of detection and legal penalties, they are presumably correlated,
following a distribution of unknown form. Therefore, we prefer to interpret prob-
ability models of illegal sales as reduced form rather than structural estimates. In
addition, it remains unclear whether all parties are in fact aware of the legal status
of every bottle.
Table 3 presents average marginal effects for separate regressions on the
two dependent variables for illegal alcohol.3 The observables in the random
utility model are replaced by the set of covariates in the data set. That is,
xi = Vbij + Vsij + Vmij , where xi denotes the vector of covariates. Coefficients
should then be interpreted as the net effect of the variable across buyers, suppliers
and manufacturers. Thus in the primary specification, we estimate the probability
that the bottle is illegal,
P (yi = 1|xi ) = (xi β), (4)
where yi is a dummy variable equal to one if the bottle was illegal – either con-
traband or adulterated depending on the model – and xi , the vector of covariates,
includes the log of STP , the log of bottle size, whether a receipt was provided,
whether a discount was offered by the seller, whether a discount was received, and
a set of dummies for the type of alcohol and the Estrato of the buyer. The distribu-
tion function ( · ) follows the right-hand side of (3). Again, the distribution is a
priori unknowable. Results for probit, logit and complementary log–log (cloglog)
estimation are included in all tables for comparison.4 They should be interpreted
as approximations to the right-hand side of (3).
The log of STP has a significant effect on the likelihood of both types of ille-
gal alcohol, but the effect is stronger for contraband purchases. Recall that xi
includes the observable utility components of buyers, suppliers, and manufac-

turers/importers; therefore, this reported price is also the outcome of different
beliefs from sellers and buyers about the legal status of the bottle. In addition,
we are only estimating the probability that a purchased bottle, the outcome of
a transaction, is illegal. We cannot know with certainty that the buyer chose
to buy illegal liquor, nor even that the observed results are only caused by the
consumer’s choice. Goods transferred on the black market usually admit lower
prices because producers do not pay taxes and face lower costs of production
and lower quality standards. The coefficient on STP is therefore expected to
have a negative sign. However, as we are not comparing bottles with exactly
the same characteristics in their legal and illegal versions, there are many factors
that could be related to reported prices that are not observed, such as the taste
for more refined liquors, special editions, or degrees of alcohol. Those features
are imperfectly captured by dummies for the type of liquors that could gener-
ate an upward effect on our estimates.5 Since buyers presumably do not prefer
illegal purchases, a positive coefficient on the STP variable can also be inter-
preted through the desire of sellers and importers to maximize profit. Higher
priced liquors offer higher profit margins for suppliers substituting illegal bot-
tles: if a supplier is already incurring in the cost of acquiring illegal bottles,
stamps, replicating flavors in the case of adulterated liquor, or bypassing legal
boundaries and bribing public agents in the case of contraband liquor, it is ratio-
nal to seek the most expensive bottles because they can generate more gains. The
expected sign of the coefficient on STP is then positive. Finally, as the STP is a self-
reported price, some consumers with high beliefs of the bottle being illegal could
report a higher price during the survey, in an attempt to hide the illegal nature of
the bottle.
All these effects suggest the coefficient associated with STP may not be the
same for each purchase, and the marginal effect may be better identified through
a mixed logit. Nevertheless, we did not find evidence of heterogeneity associated
with STP after considering a random coefficient for βST P .6 This result suggests
heterogeneity is not statistically relevant to the model.
The logged size of the bottle predicts contraband but not adulterated sales. Per-
haps most interestingly, the presence of a receipt reduces the chance of a contraband
bottle by 5.8% points, and a discount offered by the seller raises the likelihood
by 19.6% points. Both variables are statistically significant in the contraband
regression, but neither is significant in the adulterated regression. Furthermore,
the coefficients become much smaller in magnitude for adulterated liquor.
The omitted category for the Estrato variable is zone one, containing the poorest
housing units. Loosely summarizing the results, contraband liquor appears to be
most frequent for the middle tiers of socioeconomic status. Adulterated liquor
appears to be less common for the highest tiers.7
Table 3. Average Marginal Effects for Contraband and Adulterated Liquor.

Contraband Adulterated
(1) (2) (3) (4) (5) (6)

Probit Logit Cloglog Probit Logit Cloglog
log(STP) 0.158∗∗∗ 0.158∗∗∗ 0.164∗∗∗ 0.056∗∗ 0.062∗∗∗ 0.065∗∗

(0.016) (0.016) (0.016) (0.021) (0.023) (0.025)
log(Size) 0.056∗ 0.063∗∗ 0.072∗∗ 0.026 0.030 0.038
(0.033) (0.031) (0.030) (0.028) (0.030) (0.032)
Receipt −0.058∗∗ −0.051∗ −0.042 −0.022 −0.019 −0.015
(0.027) (0.029) (0.029) (0.025) (0.025) (0.025)
Discount offered 0.196∗∗∗ 0.197∗∗∗ 0.196∗∗∗ −0.006 −0.001 0.001
(0.068) (0.066) (0.061) (0.057) (0.057) (0.059)
Agua −0.127∗∗ −0.116∗∗ −0.115∗∗
(0.051) (0.052) (0.051)
Cream 0.284∗∗∗ 0.274∗∗∗ 0.235∗∗∗ 0.283∗∗∗ 0.287∗∗∗ 0.270∗∗∗
(0.073) (0.077) (0.064) (0.083) (0.085) (0.083)
Gin 0.144∗ 0.138∗ 0.147∗∗
(0.078) (0.070) (0.058)
Rum −0.080∗∗∗ −0.090∗∗∗ −0.096∗∗∗ −0.165∗∗∗ −0.157∗∗∗ −0.156∗∗∗
(0.021) (0.019) (0.018) (0.046) (0.044) (0.042)
Tequila 0.341∗∗∗ 0.329∗∗∗ 0.242∗∗∗ −0.012 −0.021 −0.026
(0.087) (0.091) (0.067) (0.056) (0.051) (0.049)
Wine 0.351∗∗∗ 0.339∗∗∗ 0.336∗∗∗ 0.210 0.226 0.230
(0.112) (0.114) (0.103) (0.143) (0.152) (0.155)
Vodka 0.200∗ 0.182 0.150∗ −0.145∗∗ −0.130∗∗ −0.126∗∗
(0.109) (0.117) (0.080) (0.057) (0.058) (0.058)
Estrato:
2 0.007 0.011 0.017 −0.001 0.004 0.003
(0.033) (0.036) (0.034) (0.040) (0.039) (0.037)
3 0.078 0.089∗ 0.096∗∗ 0.018 0.024 0.023
(0.050) (0.051) (0.048) (0.051) (0.051) (0.048)
4 0.102 0.102 0.099∗ −0.080 −0.071 −0.069
(0.062) (0.065) (0.059) (0.052) (0.052) (0.051)
5 −0.066 −0.059 −0.050 −0.110∗∗ −0.100∗∗ −0.098∗∗
(0.052) (0.052) (0.050) (0.045) (0.046) (0.047)
6 0.124∗ 0.131∗ 0.105∗ 0.055 0.055 0.055
(0.067) (0.067) (0.061) (0.065) (0.061) (0.056)
Note: Standard errors in parentheses account for complex survey design. n=882 for all specifications.
Dummies for additional types of alcohol are excluded for perfect collinearity. We compute average
marginal effects (AMEs) for all our continuous covariates and discrete difference from the base level
for dummy variables; both cases were evaluated using actual values of the sample units. ∗ p < 0.1, ∗∗
p < 0.05, ∗∗∗ p < 0.01.
4. MULTIPLE IMPUTATION
The data also include variables for salary and weekly alcohol consumption, as
reported by the buyer. Both variables are theoretically relevant because heavy and
wealthy consumers may have different incentives to find and buy (or avoid) illegal
goods compared to other consumers. While the survey has nearly complete data
for most questions, including salary and weekly use regressors in the estimation
drops the sample size by roughly 40%. Furthermore, the indicator Discount Asked
contains missing values, and the variable might also influence illegal sales. To
better account for missing data, this section presents estimation results derived
from multiple imputation via chained equations.
In general, missing data can arise in one of two ways. The first is “unit non-
response,” which means the targeted individual did not participate in the survey.
The classic approach to dealing with unit nonresponse is by weighting the sample
to better match known characteristics of the population (see Section II). The sec-
ond source of missing data is “item nonresponse,” which means an individual was
unable or unwilling to answer a given question or “item.” The typical approach to
handling item nonresponse is to impute values for the missing observations before
running any estimation.
If data are missing completely at random (MCAR), then imputation is a fairly
harmless, simple procedure. In some cases, the data may be missing at random
(MAR) after conditioning on other variables in the data set. The trickier case is
when data are missing not at random (MNAR). Salary is a classic example of
MNAR data because very rich or poor individuals may be less willing to reveal
their salary. While MCAR can be tested against MAR, in principle, there is no
way to test for MAR vs. MNAR because the data lack the needed information by
definition.
Applying Little’s test of MCAR to both weekly use and salary based on the
estimation covariates, the assumption of MCAR is strongly rejected, with p-values
equal to zero to three decimal places for both variables (Li, 2013, Little and
Rubin, 2002). In theory, salary and weekly use may not be MAR, however, but
MNAR because individuals with high salaries or heavy use may be less willing to
disclose those facts. Table 4 confirms such a story for salary. Socioeconomic class
is correlated with salary, and response rates are nearly monotonically decreasing
as Estrato increases. Individuals with a higher socioeconomic class are less likely
to disclose income.
No variable appears to have a close relationship to missingness in weekly use.
For that variable, the missing category was combined with an “unsure” response,
and it remains unclear whether missing values actually represent a refusal to answer
or a sincere uncertainty.
Table 4. Ratio of Missing Values for Salary.

Estrato 1 2 3 4 5 6
Provided 164 306 181 38 19 11
Missing 45 106 73 23 9 11
Ratio 0.27 0.35 0.40 0.61 0.47 1
Multiple imputation via chained equations (MICE) is a useful tool for handling
missing data (White, Royston, & Wood, 2010). The procedure involves specifying
a conditional mean regression equation for each variable with missing observa-
tions, using those models to predict the missing values, and iterating to create
multiple imputed data sets, Lastly, estimation is conducted on each data set and
estimates are combined to account for the uncertainty introduced by the procedure.
The equations are “chained” because the predicted values from one conditional
mean model appear as covariates in the next. Typically, the process starts with the
variable containing the fewest missing values (of those with any missing values),
then moves to the next least missing variable, and so on. See Appendix B for a
complete discussion (cf. Van Buuren, 2012).
In the data, weekly use is the least missing variable, with roughly 10% of
values missing. Since no variables appear to predict missingness in weekly use,
the conditional mean is modeled by an ordered logit in the multiple imputation
(though if the variable is in fact MNAR, then results will be slightly biased in the
imputation).
Salary, on the other hand, appears to be MNAR based on Table 4. One way
to handle the selection issue is through a traditional Heckman two-step estimator
(Galimard, Chevret, Protopopescu, & Resche-Rigon, 2016; Heckman, 1979).
Unfortunately, two-step estimates are weakly identified without an exclusion
restriction. That is, ideally, the data would contain a variable that controls the
decision of whether to report salary but not the amount of salary itself. While the
survey contains no convincing such variable, Escanciano et al. (2016) show that
higher order terms in the selection equation can mitigate the weak identification.8
In the sample, age and age squared offer continuous measures satisfying the notion
behind Escanciano et al. (2016), who find identification for two-step estimators
can be achieved under weaker conditions, relative to standard methods, without
exclusion restrictions or instruments. In particular, identification can generally be
obtained where nothing more than linearity of the second stage, in our case, is
parameterized.
Thus, including age and age squared in the selection equation offers a partial
remedy to weak identification. We constrain log(Size) to be equal to 1 in the
salary equation to satisfy the identification requirement of Escanciano et al. (2006).

Although we use a quadratic function of age as our main specification, we also
explore alternative models. Particularly, we consider higher-order polynomials
(cubic and quartic) as well as models where age is assumed to satisfy the usual
exclusion restriction required by Heckman models. These models are displayed in
Table A29 in Appendix A. Our results for contraband and adulterated liquor remain
largely unaltered across the different models.10
Lastly, the indicators for whether the buyer asked for a discount, whether the
seller offered one, and whether a receipt was given are imputed using a logit
predictive model.
Table 5 presents the results of the imputed model using 99 imputations.11 Over-
all, the results mirror those of the model excluding salary, weekly use, and the
indicator for a buyer asking for a discount. The imputation introduces uncer-
tainty into the estimation, which is reflected by the generally higher p-values, as
compared to Table 3.
For contraband liquor, the main story changes somewhat. A discount offered
by the seller predicts a contraband purchase, but the lack of a receipt is no longer
statistically significant. Perhaps surprisingly, Salary has a null effect for either
type of illegal product. Estrato emerges as an important predictor, however.
Higher socioeconomic classes are more likely to buy contraband liquor.
For adulterated liquor, the story is largely the same, but the interesting deviation
is that consumers are more likely to receive adulterated liquor when they ask for a
discount. The results of the imputation serve as a robustness check for the primary
specification in Table 3, but they also highlight the importance of imputation.
Sellers appear to respond to consumer requests for discounts by sometimes offering
adulterated products in exchange for the price reduction.
For the logit models, we also present results for the willingness to pay (WTP)
associated with Receipt (as an illustrative example since we are not considering the
survey design for the estimates). As in the previous section, an illegal sale occurs
when each party prefers an illegal bottle to a legal one; therefore, the observed part
of the joint utility for illegal liquor is given by U = xi β + . Train (2009) explains
how the ratios of the coefficients have an economic interpretation. We focus on the
WTP for a receipt, whose absence is a good indicator of the seller’s acknowledge
that the bottle is illegal. Next, consider the total derivative of the previous equation
and assume all differentials except dReceipt and dST P are zero,
∂U ∂U
dU = dST P + dReceipt = 0
∂ST P ∂Receipt (5)
dU = βST P dST P + βReceipt dReceipt = 0
Table 5. Average Marginal Effects for Multiple Imputation With Chained

Equations.
(1) (2) (3) (4) (5) (6)

log(STP) 0.174∗∗∗ 0.172∗∗∗ 0.175∗∗∗ 0.086∗∗∗ 0.090∗∗∗ 0.092∗∗∗

(0.019) (0.019) (0.019) (0.018) (0.020) (0.022)
log(Size) 0.026 0.030 0.044 0.007 0.014 0.022
(0.034) (0.034) (0.034) (0.031) (0.032) (0.034)
log(Salary) −0.002 −0.003 −0.003 −0.001 −0.002 −0.002
(0.002) (0.002) (0.002) (0.002) (0.002) (0.002)
Discount asked 0.116∗ 0.111 0.114 0.138∗ 0.140∗ 0.146∗
(0.070) (0.070) (0.071) (0.076) (0.079) (0.086)
Discount offered 0.160∗∗ 0.164∗∗ 0.151∗ −0.004 −0.010 −0.019
(0.081) (0.083) (0.082) (0.065) (0.066) (0.064)
Receipt −0.027 −0.022 −0.013 0.004 0.007 0.013
(0.031) (0.032) (0.032) (0.028) (0.030) (0.031)
Cream 0.261∗∗∗ 0.252∗∗∗ 0.239∗∗∗ 0.322∗∗∗ 0.314∗∗∗ 0.304∗∗∗
(0.065) (0.068) (0.061) (0.072) (0.075) (0.073)
Rum −0.066∗∗∗ −0.079∗∗∗ −0.087∗∗∗ −0.061∗∗∗ −0.063∗∗∗ −0.064∗∗∗
(0.023) (0.023) (0.022) (0.014) (0.014) (0.014)
Tequila 0.385∗∗∗ 0.383∗∗∗ 0.286∗∗∗ 0.054 0.039 0.031
(0.086) (0.092) (0.076) (0.042) (0.039) (0.036)
Wine 0.420∗∗∗ 0.405∗∗∗ 0.403∗∗∗ 0.264∗∗∗ 0.267∗∗∗ 0.265∗∗∗
(0.085) (0.091) (0.081) (0.098) (0.108) (0.109)
Vodka 0.247∗∗∗ 0.237∗∗ 0.168∗∗ 0.023 0.017 0.016
(0.093) (0.114) (0.075) (0.037) (0.038) (0.039)
Estrato:
2 0.024 0.022 0.025 0.022 0.026 0.025
(0.029) (0.032) (0.032) (0.030) (0.030) (0.028)
3 0.103∗∗∗ 0.106∗∗∗ 0.110∗∗∗ 0.038 0.044 0.046
(0.042) (0.044) (0.044) (0.041) (0.041) (0.039)
4 0.116∗∗∗ 0.118∗∗∗ 0.119∗∗∗ −0.066∗ −0.059 −0.057
(0.046) (0.048) (0.046) (0.037) (0.037) (0.036)
5 −0.048 −0.046 −0.036 −0.089∗∗∗ −0.082∗∗ −0.080∗∗
(0.047) (0.045) (0.043) (0.035) (0.037) (0.037)
6 0.157∗∗∗ 0.164∗∗∗ 0.133∗∗ 0.077 0.077 0.079
(0.062) (0.063) (0.059) (0.057) (0.053) (0.053)
Weekly:
350 mL 0.012 0.018 0.022 0.055∗∗ 0.056∗∗ 0.059∗∗
(0.027) (0.029) (0.030) (0.026) (0.026) (0.029)
750 mL 0.048 0.053 0.062 0.080 0.074 0.071
(0.054) (0.053) (0.051) (0.054) (0.055) (0.053)
1L 0.064 0.074∗ 0.077∗ 0.037 0.032 0.031
(0.041) (0.042) (0.043) (0.037) (0.037) (0.038)
More than 1 L 0.093∗∗ 0.097∗∗ 0.093∗∗ −0.024 −0.028 −0.029
(0.045) (0.047) (0.046) (0.032) (0.033) (0.035)
for dummy variables, both cases were evaluated using actual values of the sample units. ∗ p < 0.1, ∗∗
p < 0.05, ∗∗∗ p < 0.01.
Table 6. WTP for Receipt

WTP Lower Bound Upper Bound ASL∗
Contraband 1.33 0.94 1.91 0.0000

Adulterated 2.84 0.74 107.78 0.0000
Note: Confidence intervals are calculated using Krinsky and Robb’s method using 5,000 replications
and do not account for complex survey design. n=986.* ASL is the achieved significance level. The
level of confidence is 95. H0 : W T P ≤ 0, H1 : W T P > 0.
We set dU to zero as we are interested in keeping utility unaltered. Solving for

the change in STP yields the following:
βReceipt
dST P = − dReceipt (6)
βST P
Table 6 shows the results for Eq. 6, assuming dReceipt = 1.12 As ST P is

logged, results report the exponential of the expression in Eq. (6); for this reason,
WTP is interpreted as a proportion. For both cases, the WTP is strictly positive,
statistically significant, and greater than one, which means either the receipt is
a desirable attribute for buyers or that sellers require higher prices if they are
going to provide a receipt. Although estimates are statistically significant, the
reported bounds are particularly large, especially for adulterated liquor, so the
results should be interpreted with caution. For contraband liquor, people are willing
to pay 33% more for a receipt. For adulterated cases, WTP is higher. This result
can be explained by the negative health implications of fake liquor; therefore,
buyers are willing to pay even more (184%) in order to receive assurance that the
purchased bottle is not of poor quality. The result also shows that sellers may be
able to offer greater discounts for fake bottles.
5. DISCUSSION
The results provide several insights on the sources of black market liquor in Colom-
bia. First, the absence of a receipt and the presence of a discount offered by the
seller both make a sale more likely to be contraband (although the receipt indicator
loses significance in the MICE estimation). Neither variable has any relationship
with adulterated sales. The emerging story suggests that sellers are complicit in the
contraband economy. Avoiding receipts and offering suspicious discounts are

potential methods for passing on smuggled liquor to consumers. Whether con-
sumers also know the product is contraband is unclear. Since the alcohol is identical
whether smuggled or not, consumers may not care about the contraband status of
their purchase.
Conversely, neither a receipt nor a discount predicts whether a bottle is adul-
terated. Consumers presumably want to avoid buying an inferior product. If they
knew the bottle was fake, they should demand some kind of concession in terms
of price. Therefore, at least at the point of sale, consumers do not seem aware of
adulterated products. If consumers are fooled, sellers might be as well. However,
one variable does suggest that sellers are aware of pushing adulterated products—
when consumers ask for a discount, sellers are more likely to offer an adulterated
product.
To summarize the informational stories, sellers appear to know they are selling
both contraband and adulterated products. Whether consumers share the knowl-
edge is unclear. The explanations have an important role in law enforcement. For
example, if sellers are complicit in the contraband economy, law enforcement can
intervene at the point of sale.
Perhaps surprisingly, Salary has no relationship to illegal sales.13 ST P and
bottle size increase the likelihood of contraband and adulterated sales, respectively,
but since the variables are logged, the magnitude of the effects are relatively small
in real terms.
Summarizing the Estrato results, contraband sales appear to be most frequent
at the middle socioeconomic tiers.14 Adulterated sales appear not to have a clear
relationship to socioeconomic class. Weekly usage is also not predictive of either
type of sale.
A final, interesting question, not captured by the estimation, concerns the
geographic distribution of illegal alcohol. Where does illegal alcohol originate?
Figures 2 and 3 display the respective percentages of sales which are contraband
and adulterated for each Colombian department, roughly analogous to a state,
which is a grouping of municipalities. Although some of the less populous, east-
ern regions of the country are not represented in the sample, the two pictures clearly
diverge.
Contraband sales are most concentrated in the north of the country. This finding
could be explained by traditional shipping routes through Panama. Smuggling
through the north may be driven by expedience, with products unloaded at the
point of entry, then transported progressively farther south. Notice how contraband
sales decline in frequency when moving toward the southernmost departments.
Adulterated sales are relatively dispersed throughout Colombia. Though heav-
iest in the center of the country, adulterated sales do not radiate outward from that
point. Instead, they appear in more or less isolated pockets across departments.
Fig. 2. Percentage of Contraband Purchases.
For 2014, the second largest contributor to Colombian departmental tax revenue
was liquor taxes, with a share of 17.1%, beaten only by beer at 28.2% (DNP, 2014).
During the same year, liquor taxes only depended on the size of the bottle and the
alcohol by volume (abv). Based on the survey, illegal bottles would have collected
52.5% more tax revenue compared to the legal collection calculated in the sample.15
This means that departments could have raised tax revenue by almost 9 percentage
points. However, the Colombian government introduced tax reform in 2016 that
modified liquor taxes in order to raise tax revenue. These reforms include an ad
valorem tax and a value-added tax, which could change the way illegal liquor
affects tax loss in the future.
Fig. 3. Percentage of Adulterated Purchases.
6. CONCLUSION
Illegal alcohol is a large industry in Colombia. Over 20% of the sampled bottles in
the government survey are contraband or adulterated. According to distributional
maps of the country, the two types of illegal liquor appear to arrive through different
channels. Contraband liquor originates through the north of the country, along
traditional shipping routes, while adulterated liquor is relatively dispersed across
departments.
Regression analysis offers further insight into the underground economy. The
most interesting predictors of contraband liquor are the absence of a receipt and the
presence of a discount offered specifically by the seller. Those two results strongly
suggest that sellers are complicit in the contraband market. Furthermore, sellers
are more likely to offer adulterated products when consumers ask for a discount.
The results have important implications for law enforcement. The government
has an incentive to stop contraband sales because they involve a loss of tax rev-
enue. Although adulterated sales involve little loss of revenue, they may still be a
point of emphasis for law enforcement because they are harmful to consumers and
perhaps public health. Since sellers appear complicit in the contraband market,
sellers and importers from the north would be the most effective targets for author-
ities. Conversely, authorities may need to identify different sellers when targeting
adulterated sales, as they appear more dispersed throughout the country. Firms
with an incentive to protect their brand may also have an interest in both shutting
down the adulterated market and contribute insight to enforcement policy.
ACKNOWLEDGMENTS
We thank two anonymous referees and participants of the 2017 Econometrics
of Complex Survey Design workshop at the Bank of Canada for their helpful
comments. We also thank Jacques-Emmanuel Galimard for sharing R code on
a multiple imputation procedure. Finally, we thank Universidad EAFIT and the
FND – Federación Nacional de Departamentos – for allowing us to use the data.
The opinions expressed in this publication are those of the authors and do not
necessarily reflect the views of the Inter-American Development Bank, its Board
of Directors, or the countries they represent.
REFERENCES
Bravo, F., Huynh, K. P., & Jacho-Chávez, D. T. (2011). Average derivative estimation with missing
responses. In David M. Drukker (Ed.), Missing data methods: Cross-sectional methods and
applications (Advances in Econometrics, Volume 27 Part 1) (pp. 129–154). Bingley: Emerald
Publishing.
Buuren, S. V., & Groothuis-Oudshoorn, K. (2011). MICE: Multivariate imputation by chained
equations in R. Journal of Statistical Software, 45, 1–68.
DNP (2014). Desempeño fiscal de los departamentos y municipios 2014 [Fiscal performance of
departments and municipalities 2014]. Bogotá: Departamento Nacional de Planeación.
Escanciano, J. C., Jacho-Chávez, D. T., & Lewbel, A. (2016). Identification and estimation of
semiparametric two-step models. Quantitative Economics, 7, 561–589.
Galbraith, J. W., & Kaiserman, M. (1997). Taxation, smuggling and demand for cigarettes in Canada:
Evidence from time-series data. Journal of Health Economics, 16(3), 287–301.
Galimard, J. E., Chevret, S., Protopopescu, C., & Resche-Rigon, M. (2016). A multiple imputation
approach for MNAR mechanisms compatible with Heckman’s model. Statistics in Medicine,
35(17), 2907–2920.
Grossman, G. M., & Shapiro, C. (1988a). Counterfeit-product trade. American Economic Review,
78(1), 59–75.
Grossman, G. M., & Shapiro, C. (1988b). Foreign counterfeiting of status goods. Quarterly Journal
of Economics, 103(1), 79–100.
Heckman, J. J. (1979). Sample selection bias as a specification error. Econometrica, 47(1), 153–161.
Higgins, R. S., & Rubin, P. H. (1986). Counterfeit goods. Journal of Law & Economics, 29(2), 211–230.
Jain, A. K. & Hausman, R. E. (2006). Stratified multistage sampling. In S. Kotz, C. B. Read, N.
Balakrishnan, B. Vidakovic and N. L. Johnson (Eds.), Encyclopedia of Statistical Sciences.
Hoboken, NJ: John Wiley & Sons.
Jeanty, P. W. (2007). WTPCIKR: Constructing Krinsky and Robb Confidence Intervals for Mean and
Median willingness to pay (WTP) using Stata. In Sixth North American Stata users’ group
meeting, Boston, August (pp. 13–14).
Jiongo, V. D., Haziza, D., & Duchesne, P. (2013) Controlling the bias of robust small-area estimators.
Biometrika, 100(4), 843–858.
Levy, P. S. & Lemeshow, S. (2013). Sampling of populations: Methods and applications. Hoboken,
NJ: John Wiley & Sons.
Li, C. (2013). Little’s test of missing completely at random. The Stata Journal, 13(4), 795–809.
Little, R. J., & Rubin, D. B. (2014). Statistical analysis with missing data (Vol. 333). Hoboken, NJ:
John Wiley & Sons.
Lumley, T. (2011). Complex surveys: a guide to analysis using R (Vol. 565). Hoboken, NJ: John
Wiley & Sons.
Organization for Economic Cooperation and Development (2016). Global trade in fake goods worth
nearly half a trillion dollars a year. Retrieved from: http://www.oecd.org/industry/global-trade-
in-fake-goods-worth-nearly-half-a-trillion-dollars-a-year.htm.
Qian, Y. (2008). Impacts of entry by counterfeiters. The Quarterly Journal of Economics, 123(4),
1577–1609.
Quercioli, E., & Smith, L. (2015). The economics of counterfeiting. Econometrica, 83(3), 1211–1236.
Rao, J. N., & Molina, I. (2015). Small area estimation. Hoboken, NJ: John Wiley & Sons.
Rubin, D. B. (1988). An overview of multiple imputation. In Proceedings of the survey research
methods section of the American statistical association (pp. 79–84).
Thursby, J. G., & Thursby, M. C. (2000). Interstate cigarette bootlegging: extent, revenue losses, and
effects of federal intervention. National Tax Journal, 53, 59–77.
Train, K. E. (2009). Discrete choice methods with simulation. New York, NY: Cambridge University
Press.
Van Buuren, S. (2012). Flexible imputation of missing data. Boca Raton, FL: Chapman and Hall/CRC.
White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: issues
and guidance for practice. Statistics in Medicine, 30(4), 377–399.
APPENDIX A: TABLES
Table A1. Average Marginal Effects for Contraband and Adulterated Liquor –
Levels
(1) (2) (3) (4) (5) (6)

STP 0.000∗∗∗ 0.000∗∗∗ 0.000∗∗∗ 0.000∗∗∗ 0.000∗∗∗ 0.000∗∗∗

(0.000) (0.000) (0.000) (0.000) (0.000) (0.000)
Size 0.000∗ 0.000∗ 0.000∗∗ 0.000 0.000 0.000∗
(0.000) (0.000) (0.000) (0.000) (0.000) (0.000)
Receipt −0.056∗∗ −0.051∗ −0.044 −0.020 −0.016 −0.012
(0.028) (0.029) (0.030) (0.025) (0.025) (0.025)
Discount offered 0.212∗∗∗ 0.214∗∗∗ 0.217∗∗∗ −0.001 0.004 0.007
(0.072) (0.072) (0.069) (0.057) (0.057) (0.058)
Agua −0.111∗∗ −0.110∗∗ −0.115∗∗
(0.052) (0.051) (0.049)
Cream 0.316∗∗∗ 0.306∗∗∗ 0.291∗∗∗ 0.307∗∗∗ 0.310∗∗∗ 0.302∗∗∗
(0.075) (0.079) (0.068) (0.082) (0.084) (0.082)
Gin 0.155∗ 0.149∗ 0.175∗∗
(0.081) (0.075) (0.067)
Rum −0.076∗∗∗ −0.088∗∗∗ −0.093∗∗∗ −0.149∗∗∗ −0.150∗∗∗ −0.155∗∗∗
(0.020) (0.018) (0.018) (0.046) (0.043) (0.041)
Tequila 0.369∗∗∗ 0.368∗∗∗ 0.287∗∗∗ −0.019 −0.027 −0.034
(0.088) (0.095) (0.068) (0.050) (0.049) (0.045)
Wine 0.321∗∗∗ 0.322∗∗∗ 0.300∗∗∗ 0.216 0.221 0.218
(0.102) (0.107) (0.105) (0.135) (0.139) (0.138)
Vodka 0.216∗∗ 0.200∗ 0.192∗∗ −0.133∗∗ −0.122∗∗ −0.120∗∗
(0.108) (0.117) (0.085) (0.057) (0.058) (0.058)
Estrato:
2 0.008 0.009 0.018 0.001 0.006 0.006
(0.031) (0.034) (0.034) (0.041) (0.039) (0.037)
3 0.072 0.081 0.089∗ 0.016 0.021 0.019
(0.049) (0.051) (0.048) (0.052) (0.050) (0.047)
4 0.100 0.100 0.097 −0.078 −0.069 −0.067
(0.063) (0.067) (0.059) (0.053) (0.053) (0.051)
5 −0.064 −0.060 −0.047 −0.111∗∗ −0.100∗∗ −0.096∗∗
(0.052) (0.051) (0.052) (0.046) (0.047) (0.047)
6 0.111 0.121∗ 0.096 0.046 0.046 0.052
(0.067) (0.068) (0.062) (0.062) (0.058) (0.054)
p < 0.05, ∗∗∗ p < 0.01.
Table A2. Coefficients for Different Specifications in the Salary Imputation Model
Following Escanciano et al. (2016) Exclusion restrictions
(1) (2) (3) (4) (5)
Salary equation:
Age 0.007 0.007 0.008
(0.006) (0.007) (0.006)
Selection equation:
Age −0.061∗∗∗ −0.015 0.325 −0.020∗∗∗ −0.062∗∗∗
(0.026) (0.127) (0.593) (0.004) (0.025)
Age2 0.001 −0.001 −0.015 0.001∗
(0.000) (0.003) (0.024) (0.000)
Age3 0.000 0.000
(0.000) (0.000)
Age4 −0.000
(0.000)
Note: Standard errors in parentheses account for complex survey design. ∗ p < 0.1, ∗∗ p < 0.05, ∗∗∗ p < 0.01. Both stages include the following regressors:
log(ST P ), log(Size), Discount Asked, Discount Offered, Receipt, types of alcohol dummies, Estrato dummies, Weekly dummies, region dummies, Sex
(dummy), Adulterated (dummy), Contraband (dummy), Weekly reported (dummy), and Discount Asked reported (dummy), and a constant term.
Table A3. Average Marginal Effects for Multiple Imputation With Chained
Equations Using Linear Exclusion Restriction.
(1) (2) (3) (4) (5) (6)

log(STP) 0.174∗∗∗ 0.173∗∗∗ 0.176∗∗∗ 0.086∗∗∗ 0.090∗∗∗ 0.093∗∗∗

(0.019) (0.019) (0.019) (0.018) (0.020) (0.022)
log(Size) 0.024 0.029 0.042 0.005 0.012 0.020
(0.035) (0.034) (0.035) (0.031) (0.033) (0.034)
log(Salary) −0.002 −0.003 −0.003 −0.001 −0.002 −0.002
(0.002) (0.002) (0.002) (0.002) (0.002) (0.002)
Discount asked 0.110 0.105 0.109 0.138∗ 0.140 0.147
(0.078) (0.077) (0.079) (0.083) (0.086) (0.093)
Discount offered 0.162∗ 0.165∗ 0.154∗ −0.001 −0.007 −0.015
(0.083) (0.085) (0.085) (0.067) (0.069) (0.067)
Receipt −0.025 −0.021 −0.011 0.003 0.006 0.011
(0.032) (0.033) (0.033) (0.029) (0.030) (0.031)
Cream 0.260∗∗∗ 0.251∗∗∗ 0.238∗∗∗ 0.323∗∗∗ 0.315∗∗∗ 0.305∗∗∗
(0.065) (0.068) (0.061) (0.073) (0.076) (0.073)
Rum −0.065∗∗∗ −0.079∗∗∗ −0.086∗∗∗ −0.061∗∗∗ −0.062∗∗∗ −0.063∗∗∗
(0.023) (0.023) (0.022) (0.014) (0.014) (0.014)
Tequila 0.384∗∗∗ 0.381∗∗∗ 0.285∗∗∗ 0.055 0.039 0.032
(0.086) (0.092) (0.075) (0.042) (0.038) (0.036)
Wine 0.420∗∗∗ 0.405∗∗∗ 0.404∗∗∗ 0.265∗∗∗ 0.269∗∗∗ 0.267∗∗∗
(0.085) (0.090) (0.080) (0.098) (0.107) (0.109)
Vodka 0.249∗∗∗ 0.238∗∗ 0.170∗∗ 0.024 0.018 0.017
(0.093) (0.114) (0.075) (0.037) (0.038) (0.039)
Estrato:
2 0.022 0.020 0.024 0.021 0.025 0.025
(0.030) (0.032) (0.033) (0.031) (0.030) (0.028)
3 0.102∗∗∗ 0.105∗∗∗ 0.110∗∗∗ 0.038 0.044 0.046
(0.042) (0.044) (0.044) (0.041) (0.041) (0.039)
4 0.115∗∗∗ 0.116∗∗∗ 0.117∗∗∗ −0.066∗ −0.060 −0.057
(0.046) (0.048) (0.046) (0.037) (0.037) (0.036)
5 −0.049 −0.048 −0.037 −0.089∗∗∗ −0.082∗∗ −0.080∗∗
(0.047) (0.045) (0.043) (0.035) (0.037) (0.036)
6 0.153∗∗∗ 0.160∗∗∗ 0.129∗∗ 0.075 0.075 0.077
(0.062) (0.064) (0.059) (0.056) (0.053) (0.052)
Weekly:
350 mL 0.010 0.016 0.021 0.052∗∗ 0.053∗∗ 0.055∗
(0.027) (0.029) (0.030) (0.025) (0.026) (0.029)
750 mL 0.048 0.052 0.061 0.080 0.074 0.072
(0.054) (0.053) (0.050) (0.053) (0.054) (0.053)
1L 0.064 0.074∗ 0.078∗ 0.037 0.032 0.031
(0.041) (0.042) (0.043) (0.036) (0.037) (0.038)
More than 1 L 0.091∗∗ 0.096∗∗ 0.092∗∗ −0.024 −0.028 −0.030
(0.045) (0.046) (0.046) (0.032) (0.034) (0.035)
p < 0.05, ∗∗∗ p < 0.01.
APPENDIX B: MULTIPLE IMPUTATION BY CHAINED

EQUATIONS
Multiple Imputation by Chained Equations is a Markov chain Monte Carlo method
where all imputed values comprise the state space. More specifically, “the MICE
algorithm is a Gibbs sampler, a Bayesian simulation technique that samples from
the conditional distribution in order to obtain samples from the joint distribution”
(Van Buuren, 2012, p. 109).
The final goal of MICE is to calculate a valid estimate of a parameter(s) of
interest, denoted by θ. In our particular case, θ is a vector containing the marginal
effects for a set of covariates on the effect of purchasing illegal liquor (contraband
or adulterated depending on the model). MICE consists of three steps (White et
al. 2011):
(1) Generate M data sets with imputed values, which requires a definition of an
imputation model for each variable yj with missing information probability
P (yjmiss |yjobs , Y−j , R, W). R corresponds to a matrix of binary variables indicat-
ing the missingness of Y, W is a matrix of variables that can be incorporated
in the model, and are fully observed. The imputation values are computed
using the chosen functional form with a random draw of the coefficients β ∗ .
The procedure is “chained” in the sense that the other variables with missing
observations Y−j and their imputed values are included in the subsequent impu-
tation models. These imputations models are computed recursively in order to
stabilize the results and obtain a single l data set with complete information.
(2) Analyze each imputed data set l separately, obtaining θl estimates and their
associated variances Ul for each database l, where l = 1, . . . , M.
(3) Combine the M results using the Rubin’s rules (Rubin 1988), which account
for three different sources of variance: (1) the variance caused by considering
only a proportion of the population, (2) the variance caused by having missing
values in the sample and (3) the variance generated by considering finite M
for the estimates of Q (Van Buuren 2012).
Each step is elaborated below:
Imputation models for step 1
In our specific case, we consider three different models: logit, ordered logit, and
the two-step Heckman estimator. White et al. (2011) provide the baseline for the
first two models, while Galimard et al. (2016) provides the baseline for the final
case. These procedures assume that β follows a multivariate normal distribution,
which is a common approximation for multiple imputation procedures, including
the case of categorical variables (Van Buuren 2012). Therefore, a random draw β ∗
can be calculated as follows:
1. After estimating the proposed imputation model with k regressors (taking into
account the constant term, W, and R), the vector of coefficients β̂ and the
associated covariance matrix V are obtained.
2. Approximating the posterior distribution of β by MVN(β̂, V):
• σ ∗ is drawn as
nobs − k
σ ∗ = σ̂ , (B.1)
g
where σ̂ is the estimated root mean-squared error and g is a random draw
from a χ 2 distribution on nobs − k degrees of freedom.
3. β ∗ is drawn as

∗ σ∗ 1 nobs − k
β = β̂ + u1 V 2 = β̂ + u1 V1/2 , (B.2)
σ̂ g
where u1 is a row vector of k random draws from a standard normal distribution,

and V1/2 is the Cholesky decomposition of V.
Binary variables
Logit is the most common alternative to impute binary variables. The procedure
consists of conducting the imputation model as
1
Pr(yi = 1|Xi ) = , (B.3)
1 + exp( − Xi β)
which yields β ∗ . Predicted probabilities for the missing data are computed as
Pi∗ = [1 + exp( − Xi β ∗ )]−1 . Next, a vector u2 is generated with random draws
from a uniform distribution on (0,1) for all the units with yi missing. Finally, the
imputed values are given by 1 if ui < Pi∗ and 0 otherwise.
Ordered variables
This is a straightforward extension of the previous case considering an ordered
logit. After obtaining β ∗ , the predicted probabilities for each category k are
calculated as
1 1
Pik = Pr(yi = k|Xi ) = − (B.4)
1 + exp( − κk + Xi β ) 1 + exp( − κk−1 + Xi β ∗ )
∗
where κk for k = 0, . . . , L are the different cut-points, which are also parameters of
the model. κ0 is defined as −∞ and κL as +∞. Next, denote cik as the cumulative
class membership probabilities kj =1 Pij∗ . The imputed values are given by

L−1
yi∗ = 1 + 1(u2i > cik ), (B.5)
k=1
where u2i is the component of row i of vector u2 , as defined for the case of binary
variables, and 1( · ) stands for the indicator function.
MNAR data
For the case of salary, we closely follow the procedure proposed by Galimard et
al. (2016), who conduct the Heckman two-step estimator, including a variance
correction step.
The Heckman two-step estimator is derived from
∗
P (Ryi = 1|Xis ) = (Xis β s ), Ryi = Xis β s + is (B.6)
E(yi |Xi , Xis , Ryi = 1) = Xi β + ρσ λi , (B.7)
∗
where Ryi is a binary variable indicating the missingness of salaries, and Ryiis the
∗
associated latent variable. Ryi and yi are linked by a bivariate normal distribution
of their error terms and s . ρ is the correlation coefficient between and s , and
λi = ((φ(Xis β s ))/((Xis β s ))) is the well-known inverse Mills ratio, where φ( · )
and ( · ) are the standard normal density and cumulative distribution functions.
The first step to obtain imputation values for yi is the estimation of β̂ s and λ̂i from
the selection Eq. (B.6), and β̂, β̂λ and σˆη from Eq. (B.7), which can be estimated
through ordinary least squares as yi = Xi β + λi βλi + η, where η ∼ N (0, ση2 ).
ση2∗ is computed as (σˆη /g), and σ2∗ is obtained through the following variance
correction:
1
N
ση2∗
σ̂2∗ = (B.8)
N
i=1 1 − ρ̂ 2 (λ̂i (λ̂i + Xis βˆs ))
The estimation of σ2∗ allows for the computation of random draws (β ∗ , βλ∗ ) =
(β̂, β̂λ ) + σ̂∗ u1 V1/2 . Finally, the imputation values for yi are derived as
yi∗ = Xi β ∗ + λi βλi
∗
+ ση∗ zi , (B.9)
where zi are random draws from a standard normal distribution.
Rubin’s rules: combining results
The second and third steps of MICE are also conducted in multiple imputation.
Although the following lines summarize the main results, please refer to Rubin
(1988) for further details. Let θl , Ul be the estimates and associated variances for
each imputed database l, where l = 1, . . . , M. The final estimate of θ is
M
θ̂l
θ̄ = . (B.10)
l=1
M
The variability associated with the estimate has two components: the average
within-imputation variance,
M
Ul
Ū = , (B.11)
l=1
M
and the between-imputation component,
M
(θ̂l − θ̄)2
B= . (B.12)
l=1
M −1
If θ is a vector, ( · )2 is replaced by ( · )T ( · ). The total variability associated

with θ is then
1
T = Ū + 1 + B. (B.13)
M
When θ is a scalar, a t distribution is used for obtaining interval estimates and
significance test:
(θ − θ̄ )T −(1/2) ∼ tv , (B.14)
where the degrees of freedom v are
−1
1 B 2
v = (M − 1) 1 + 1 + . (B.15)
M Ū
APPENDIX C: RESIDUAL ANALYSIS

We calculated the deviance residuals for all the models proposed in Table 3. In
Figure C1, we plot these residuals against the linear prediction (xi β) in order to
check the goodness of fit of the models. Some residuals are identified as outliers,
being less than −2 or greater than +2. Nevertheless, after obtaining lowess smooth
approximates between the deviance residuals and the linear predictions, a line
close to have zero slope and zero intercept suggests these values are not affecting
the adequacy of our models. However, the goodness of fit for adulterated models
is lower.
Fig. C1. Residuals: (a) Probit, Contraband; (b) Probit, Adulterated; (c) Logit, Contraband;
(d) Logit, Adulterated; (e) Cloglog, Contraband and (f) Cloglog, Adulterated.
INDEX
Ad valorem tax, 303 Bernoulli log likelihood, 128

Adulterated liquor, 290, 292, 295 Bernoulli sampling. See Variable
Aggregate shares, 140 probability sampling (VP
for simulation study, 157 sampling)
Analytic inference. See Model-based Bias function, 141
inference Bias-adjustment, 215
Applied economic research, 175 “Biased sampling” problems, 147
ARMA models, 110–111 “Big data” sets, 37–38
Asymptotic Binary responses, 128
design-based variance, 200 Binary variables, 311
linearization representation, 217 Black market liquor in Colombia,
properties of estimators, 214 301–302
theory for M-estimation, adulterated sales, 302
146–147 average marginal effects for
Asymptotic distribution, 216 contraband and adulterated
for matching estimators, 210 liquor–levels, 307
Asymptotic variance, 230 average marginal effects for MICE,
estimator, 119 309
formula, 118 coefficients for specifications in
Auditory mode, 22 salary imputation model, 308
Automated clearing house contraband goods, 288–289
(ACH), 245 data, 290–292
Auxiliary variables, 90 multiple imputation, 297–301,
Average bid auctions (ABA), 80 310–313
Average partial effects, 119 percentage of adulterated
purchases, 304
Bahadur-type representation, 217 percentage of contraband
Balanced repeated replication purchases, 303
(BRR), 96 percentage of illegal purchases,
Bandwidths, 194–196 289
selection, 186–188 residual analysis, 313–314
Bank of Canada, 36, 47, 98, 104 results, 292–296
Bernoulli distributions, 99 “Blocking” strategy, 247
315
316 INDEX
Bootstrap, 100 Commercial banks (CMB), 241

conventional, 211 Complementary log–log (cloglog), 294
sample, 211 Complex
variance estimation via, 96–98 sampling plan, 174
wild bootstrap, 61–84, 211, 219 survey methods, 122, 176
bsweights command, 103 Computation, 169–171
Business Conditional density, 141
heterogeneity, 240 Conditional mean, 127
surveys, 239–240 functions, 130
Conditional moment restriction
Calibration models, 138
equations, 92 conditional moment equalities,
procedure, 88 139–140
Canadian Internet Usage Survey endogenous and exogenous
(CIUS), 98 stratification, 143–146
Card acceptance, 36, 39, 49 identification, 142–143
estimates, 42–43
simulation study, 156–161
and usage, 41
VP sampling, 140–142
Card payments, regression models of,
Contactless (tap-and-go) credit card
43–46
usage binary variable, 89
Card-acceptance model, 45
Continuous variables, 89
Cash on hand variable, 89, 100–101
Conventional bootstrap, 211
cemplik2 function, 171
Conventional jackknife variance
Census region, 15
estimation, 221
Checkable deposits (CHKD), 241
Choice-based sampling, 162n6 Conventional replication method, 211
Classical nearest neighbor Converged estimator, 94
imputation, 213 Correction factor, 185
Classical raking estimator, 90–92 Correlated random effects
Cluster-means regression (CMR), 74 approach, 128
Cluster-robust inference, 63 County, 39
WCB, 65–67 data, 56
Cluster-robust standard error, 67 effect, 43
Cluster-robust test statistics, 62 Covariance matrix, 63
Cluster-robust variance estimator Covariate information, 215
(CRVE), 64 Cox approach, 110
Colombian National Administrative Credit cards, 98
Department of Statistics Credit unions (CUS), 241
(DANE), 290–291 Crimes, 264, 272
Combined inference approach, Criminal activity, 270
176–177 Criminal Procedure Law, 263
Index 317
Cross-validation function, 187 Empirical loglikelihood (EL), 169

Cumulative standard normal (CDF), 73 Empirical researchers, 210
Current Population Survey (CPS), Endogeneity, 67
6, 110 Endogenous sampling, 175
socioeconomic variables in, 11–14 Endogenous stratification, 143–146,
150–151, 157, 262
Data-driven approach, 177, 187 median bandwidths for WLC and
Data-generating processes (DGPs), LC under, 194
177, 190 median MSE for WLC, LC, WLS
in Monte Carlo simulations, 188 and OLS under, 190
Degree of “desirability”, 8–9 Estimation approach, 241
Demographic variables, 11–12
Estrato (strata) variable, 292, 299
Dependent variable, 51, 55
Exogenous stratification, 111,
Depository and financial institutions,
126–127, 143–146, 150–151,
240
157
Descriptive inference. See
comparing asymptotic variance of
Design-based inference
LS and GMM estimators
Design-based approach, 180
under, 166–169
Design-based estimators, 174, 176
median bandwidths for WLC and
Design-based inference, 88, 176
LC under, 194
Design-consistent estimator, 176
Difference in objective functions, 119 median MSE for WLC, LC, WLS
Difference-in-differences (DiD), 62 and OLS under, 191
Discounts Asked indicator, 292, 297 Explanatory variables, 41
Discounts Offered indicator, 292
Discrete variables, 89 Fake goods, 288
Drugs and weapon crimes, 264 Federal Reserve System, 238, 240
Dynamic regression models, 110–111 Finite population correction (FPC),
291–292
“Econ Plus” set, 98 Finite population estimator, 176
Econometric methods, 62 Finite population parameter, 212
Econometric models, 140 First price auctions (FPA), 79–81
Economic theory, 174, 260 Four-digit primary industry code, 39
Efficiency bounds, 148–152
Fractional responses, 128
E–M algorithm-based imputation
Full-coverage items, 247
method, 244
Empirical application of 2013 MOP,
98–103 Gamma distributions, 129
Empirical distribution, 67 Gaussian kernel, 196
Empirical likelihood approach, 148, Gauss–Markov theorem, 163n16
155 General M-estimation, 111
318 INDEX
Generalized method of moments Inference, 138, 146

estimator (GMM estimator), efficiency bounds, 148–152
147, 150, 151 efficient estimation, 152–155
Generalized regression estimation modes, 176
(GREG), 89, 93–94 related literature and contribution,
Geographic information, 97 146–148
Geometric quasi-MLE, 111 testing, 155–156
“Group treated” dummy (GTg), 64 see also Conditional moment
restriction models
Hajek estimator, 217
Informative sampling, 175
Hausman specification tests, 273
Internet match high-quality traditional
Health and Retirement Study (HRS),
surveys
5–7
socioeconomic variables in, 11–14 health insurance coverage, 30
survey outcomes in UAS and, home ownership, 30
14–26 HRS, 6–7
Heckman models, 299 individual earnings, 31
Heckman two-step estimator, 312 methods and outline, 5–6
Heteroskedasticity-robust standard predicted mean health status by age
errors, 63–64, 74 and survey mode, 32
Higher order orthogonal polynomials, predicted probability, 33
45 satisfied with life, 32
Hit rates test, 262, 267 self-reported health, 30–31
Horvitz–Thompson estimator, 218 socioeconomic variables in HRS,
Horvitz–Thompson weight (H–T UAS and CPS, 11–14
weight), 90 survey outcomes in HRS and UAS,
Hungarian payments system, 38 14–26
UAS, 7–11
Illegal whether retired, 31
alcohol, 290
Interview
cigarettes, 290
face-to-face, 4, 18, 22
goods, 288
mode, 4–5
liquor, 288
Imputation models, 238, 239, 297, online, 14
299, 310–311 telephone, 4
Indicator Inverse Mills ratio, 56
function, 179 Inverse probability weighted estimator
variables, 90, 94 (IPW estimator), 162n12
Industry, 39 ipfraking command, 103
effect, 43 Item response rate, 244–245
industry-based stratas, 44 Iterative E–M algorithm approach, 241
Index 319
Iterative estimators of regression Local empirical loglikelihood, 155

coefficients, 146 Log-likelihood functions, 110, 270
Iterative proportional fitting (IPF), 88 negative of, 113–114
Logistic regression analysis, 37
Jackknife method, 218 Logistic regression models, 39–40, 51
Jackknife variance card acceptance, 51–55
estimation, 221 card usage, 55–57
estimator, 96–97 Logit, 311
predictive model, 299
Kernel function, 179 Lognormal distributions, 129
assumptions for, 230–231 Lyapunov DoubleArray Central Limit
“Kernel matching” estimators, 211 Theorem, 202–203
Kernel-based derivative estimation,
219 Mahalanobis distance, 213
Kernel-smoothed pvalues, 76 Marginal density, 141
Knowles, Persico, and Todd test (KPT Matching
test), 261, 278–279n19 discrepancy, 215
Kullback–Leibler Information estimators, 210
Criterion (KLIC), 110, 112 scalar variable, 221
variable, 215
Labor market duration (LMD), 178, Maximum likelihood raking estimator,
196 92–93
Lagrange multipliers, 154 Median absolute deviations (MAD),
Large-scale surveys, 174 190
Least absolute deviations (LAD), 120 Medicare, 17
Least squares cross-validation Misclassification, 260–261
(LSCV), 177, 189, 204–208 Missing at random (MAR), 213, 297
Leave-one-out kernel estimator, 204 Missing completely at random
Length biased sampling problem, 147 (MCAR), 297
Linear regression model, 63, 139, Missing data process, 213
144–146 Missing not at random (MNAR), 297
Linearization data, 312
of estimator, 223 see also Planned missing data
methods, 89 design
variance estimation via, 95–96 Missing-by-design approaches, 239
Local constant estimation (LC Mode effects, 21–26
estimation), 188 Model-assisted estimator, 176
median bandwidths, 194–196 local constant estimator, 180
median MSE for, 190–194 Model-assisted nonparametric
Local constant estimator, 175–176, regression estimator, 178
178–180 asymptotic properties, 180–186
320 INDEX
local constant estimator, 178–180 ordered variables, 311–312

model-assisted local constant Rubin’s rules, 312–313
estimator, 180 Multiple-matrix sampling, 239
Model-based inference, 176, 180 Multiplicative nonresponse models, 99
Model-selection statistic, 122 Multistage sampling, tests statistics
Model-selection tests under, 122–124
for complex survey samples, 110 Multivariate extensions, 139
examples, 128–129 Multivariate normal distribution, 310
exogenous stratification, 126–127
nonnested competing models and National statistical institutes (NSIs),
null hypothesis, 111–114 89
with panel data, 124–126 National Survey of Families and
rejection frequencies, 131–132 Households (NSFH), 110, 122
small simulation study, 129–133 Nearest neighbor imputation, 210
basic setup, 212–214
statistics under multistage
Nearest neighbor imputation estimator,
sampling, 122–124
210–213, 217–218, 222, 230
under stratified sampling, 114–122
proofs of theorems, 228–234
Modern imputation methods, 239
replication variance estimation,
Modified kernel density estimator, 204
218–220
Modified plug-in method, 186
results, 214–217
Money market deposits (MMDA), 241
simulation study, 221–224
Monte Carlo simulations, 188, 212 Nearest neighbor ratio imputation, 225
bandwidths, 194–196 New York City Police Department
DGPs in, 188, 190 (NYPD), 277n1
median bandwidths for WLC and Non-negligible sampling fraction,
LC, 194–196 224–225
median MSE for WLC, LC, WLS Non-sampling errors, 260
and OLS, 190–193 Non-smoothness, 211, 212
sample mean squared error, Nonlinear least squares estimators, 129
189–194 Nonlinear regression models, 139
strata borders, 189 Nonmandatory survey, 239
Multiple imputation, 242, 297–301 Nonnested competing models,
See also Nearest neighbor 111–114
imputation Nonparametric kernel regression, 174
Multiple Imputation by Chained application, 196–197
Equations (MICE), 310 bandwidth selection, 186–188
average marginal effects for, 309 model-assisted nonparametric
binary variables, 311 regression estimator, 178–186
imputation models, 310–311 Monte Carlo simulations, 188–196
MNAR data, 312 proofs, 199–209
Index 321
Nonparametric maximum likelihood review of literature, 49–51

estimators (NPMLE), 147 “Period treated” dummy, 64
Nonparametric methods for estimating Planned missing data design, 239,
conditional mean functions, 242, 244
174 full-coverage and partial-coverage
Nonprobability Internet panels, 8 items, 248
Nonrandom weight for stratum-cluster original 2016 sample counts, 251
pair, 122 problem of dividing survey form,
Nonresponse, 100 248–249
bias, 240 sampling scheme, 247
linearization method, 101–102 survey for subsample, 244–245
Normalized asymptotic distribution, 95 survey form, 245–246, 250
Novel bootstrap procedure, 211 survey stratification, counts and
Null hypothesis, 111–114, 117, 126 rates, 246
Plug-in method for bandwidth,
OLS, 189 186–187
median MSE for, 191–193 Poisson distribution, 132
Online cash register (OCR), 49–50 Poisson quasi-MLE, 111
Organisation for Economic
Polychotomous choice, 262
Co-operation and Development
Pooled estimation methods, 125
(OECD), 288
Pooled nonlinear least squares, 125
Pooled objective functions, 126
Pairs cluster bootstrap, 65
Population minimization problem,
Panel data, model-selection tests with,
112
124–126
Panel Survey of Income Dynamics, Post-stratification
110 factors, 15
Payment card acceptance and usage weights, 10
data description, 38–39 Post-stratification weights, 9
estimates of card acceptance, Power series estimators, 219, 231
42–43 Predictor variables, 188
estimating card acceptance, 41 Primacy effect, 21–22
Hungarian payments landscape, 38 Primary sampling units (PSUs), 88,
key variables, 39–40 290
logistic regression models of card Probability
acceptance and usage, 51–57 density function, 124
methodology, 41 distributions, 141
models of card acceptance and models of illegal sales, 294
usage, 41 probability-based Internet
regression models of card surveys, 6
payments, 43–46 probability-based panels, 8
322 INDEX
Product kernel Replication method, 233

for continuous variables, 179 Replication variance
function for vector of discrete estimation, 211–212, 218–220
variables, 179 estimator, 218
Propensity score matching, 223 Resampling method, 89
Pseudo-empirical likelihood objective Residual analysis, 313–314
function, 93 Respondents, 22
Pseudo-true values, 112 feedback, 242
Pure design-based inference, 180 Response linearization method,
101–102
Qin’s treatment, 148 Response quality improvement
Quantile estimation, 217, 220 business surveys, 239–240
Quasi-MLEs (QMLE), 111, 130 challenge, 242–244
Quasi-true values, 112 mean number of item responses,
256
Racial discrimination, 260, 261, 262, planned missing data design, 239,
272, 273, 275 244–251
Rademacher distribution, 76 response counts, 252
Raking ratio estimator, 88–89 response rates by stratum, 254
classical raking estimator, 90–92 standard approach, 240–242
GREG, 93–94 survey design, 251
maximum likelihood raking 2016 survey stratification, counts
estimator, 92–93 and rates, 255
RAND American Life Panel, 8 unit response rate, 253
RAND-HRS indicator of retirement, unit responses, 238
18 Rubin’s rules, 312–313
Randomization inference (RI), 62, 67 Rule-of-thumb method, 162–163n14,
problem of interval p values, 177
69–71
Randomized planned missing items, Salary, 298
239 imputation model, 308
Randomness, 88 ratio of missing values, 298
Re-randomizations, 69 Sample mean squared error, 190–194
re-randomized test statistics, 73 Sample selection
“Realized” population, 138 correction, 262
Recency effects, 21–22, 24, 27 procedure, 241
Regression analysis, 288–289 Savings institutions (SAV), 241
Regression models of card payments, Scalar matching variable, 211, 228
43 Scalar population parameter, 96
acceptance, 43–45 Secondary sampling units (SSUs), 290
usage, 45–46 Selection bias, 138, 147
Index 323
Selection correction, 271 aggregate shares for, 157

model, 273 design, 156–157
Selective crime reporting inconsistency of LS estimator, 158,
characterization of equilibrium, 159
269 Size, 292
coefficient estimation, 283 key variables, 39
empirical strategy, 269–272 size-based stratas, 44
estimates of hit rates test on Small simulation study, 129–133
summons, 284 Smoothed empirical likelihood
existence of equilibrium, 268 estimator (SEL estimator), 152,
hit rates test, 262 153, 154, 159, 162n14, 169
misclassification, 260–261 Smoothing parameters, 186
Pedestrians, 267–268 “Snob effect”, 288
police officers, 268 Social desirability effects, 22, 24–25
reason for Stop and Pedestrian Socioeconomic class, 297
characteristics, 285 Socioeconomic variables in HRS,
results, 272–275 UAS and CPS, 11–14
robustness checks for sample Split questionnaire survey design, 239
correction estimates, 286 Standard likelihood ratio testing
Stop-and-Frisk program, 263–267 principle, 110
theory, 267 Standard ratio estimators, 241
Self-reported health with CPS limited Standard stratified sampling (SS
to HH respondents, 30 sampling), 111, 114–121
Self-reported mammography, 277 Stata, 128
Self-reported Transaction Price (STP), implementation under, 103
292, 294–295, 301–302 Statistical inference, 138
Sequential importance sampling Stop-and-Frisk program, 260–262,
(SIS), 8 263–267
Shrinkage empirical-likelihood-based Strata borders, 189
estimator, 94 Strategic money counterfeiting model,
Sieves estimation, 231–233 290
Sieves estimators, 219 Stratification, 44–45, 122, 241
Simple random sampling (SRS), approach, 42–43
100–101, 174, 210 in terms of individual income, 121
median bandwidths for WLC and Stratified random samples, 41
LC under, 194 Stratified sampling, 114, 153
median MSE for WLC, LC, WLS SS sampling, 114–121
and OLS under, 191 testing under, 114–122
Simulation study, 156, 160–161, VP sampling, 121–122
221–222 STRATVAR, 241, 244
324 INDEX
Superpopulation Tourist test, 49

model, 215 Trade policy, 288
nonparametric regression model, Traditional hit rates test, 269
175 Traditional survey modes, 4
parameters, 181 Transaction size distribution, 45–46
Survey Transaction value, 45, 55
of banks, 240 Transaction-level data, 50–51
business, 239–240 aggregated data, 39
cross-validation, 187 “Trimmed” SEL objective function,
face-to-face, 4–5 155
form, 245 Trimming indicator, 155
Internet-based, 5 “True” variance, 100
large-scale, 174 Two-step method, 270, 271
modes, 6 2013 methods-of-payment survey
nonmandatory, 239 (2013 MOP survey), 89, 98
online, 4–5 empirical application, 98–103
statistics, 37–38 raking ratio estimator, 89–94
telephone surveys, 4–5 variance estimation, 94–98
web, 5
See also 2013 methods-of-payment UF-250 form, 264
survey (2013 MOP survey); Unconditional empirical probabilities,
Internet match high-quality 170
traditional surveys Unconditional summary statistics, 261
Survey of Labour and Income Underrepresents high-income
Dynamics (SLID), 196 households, 14
Survey outcomes in HRS and UAS, 14 Understanding America Study (UAS),
health economics literature, 20 5–6, 7
health insurance coverage, 17 sampling and weighting
home ownership, 16 procedures, 8–11
individual earnings, 18 socioeconomic variables in, 11–14
mode effects, 21–26 survey outcomes in HRS and,
post-stratification factors, 15 14–26
self-reported health, 19 “Unit nonresponse”, 297
whether retired, 17 Unit response rate, 244–245, 254
“Synthetic controls” method, 63 Unordered multinomial logit model,
270
TEÁOR variable, 52 Unplanned missing items, 239
Temporal attributes of store, 52 Unweighted HRS race/ethnicity
Temporal data, 56 distribution, 13
Time series problems, 110–111 Unweighted life-satisfaction
Tng_credit_year variable, 100 distribution, 21
Index 325
Unweighted objective function, 119 estimator, 194, 196

Unweighted statistic, 131 median bandwidths, 194–196
Unweighted tests, 133 median MSE for, 191–193
Urban Area to ZIP Code Tabulation Weighted/weighting, 6
Area Relationship File of US life-satisfaction distribution, 21
Census Bureau, 9 objective function, 122
procedure, 88
Value categories, 51 tests, 133
Value-added tax, 303 Wild bootstrap method, 211, 219
Variable probability sampling (VP Wild bootstrap randomization
sampling), 111, 121–122, 138, inference (WBRI), 63, 71
140–142 alternative procedures, 72–74
Variance estimation, 88, 94, 211 cluster-robust inference, 63–67
via bootstrap, 96–98 empirical example, 79–82
via linearization, 95–96 RI, 62, 67–71
Violent crimes, 264 simulation experiments, 74–79
Vuong’s approach, 110, 111 Wild cluster bootstrap (WCB), 62,
Vuong’s statistic, weighted version of, 65–67
129 WLS, 190, 192
median MSE for, 191–193
Weighted local constant estimation
(WLC estimation), 189, 191 Zero mean function, 202

La Econometria de Los Diseños Complejos

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

La Econometria de Los Diseños Complejos

Uploaded by

Copyright:

Available Formats

THE ECONOMETRICS OF

COMPLEX SURVEY DATA:

United Kingdom – North America – Japan

First edition 2019

Copyright © 2019 Emerald Publishing Limited

Reprints and permissions service

No part of this book may be reproduced, stored in a retrieval system, transmitted in

British Library Cataloguing in Publication Data

ISBN: 978-1-78756-726-9 (Print)

ISSN: 0731-9053 (Series)

Certificate Number 1985

LIST OF CONTRIBUTORS vii

CAN INTERNET MATCH HIGH QUALITY

EFFECTIVENESS OF STRATIFIED RANDOM

WILD BOOTSTRAP RANDOMIZATION INFERENCE

VARIANCE ESTIMATION FOR SURVEY-WEIGHTED

MODEL-SELECTION TESTS FOR COMPLEX SURVEY

INFERENCE IN CONDITIONAL MOMENT RESTRICTION

NONPARAMETRIC KERNEL REGRESSION USING

NEAREST NEIGHBOR IMPUTATION FOR GENERAL

IMPROVING RESPONSE QUALITY WITH PLANNED

DOES SELECTIVE CRIME REPORTING INFLUENCE

SURVEY EVIDENCE ON BLACK MARKET LIQUOR IN

Marco Angrisani University of Southern California, USA

The assumption of simple random sampling is widely used in applied research

ESTIMATION AND INFERENCE

when handling item nonresponse in survey sampling. When estimating a variance,

BUSINESS, HOUSEHOLD AND CRIME SURVEYS

Marco Angrisani, Brian Finley, and Arie Kapteyn

The Econometrics of Complex Survey Data: Theory and Applications

2. METHODS AND OUTLINE

the HRS core questionnaire (with some adaptation to accommodate differences

3. HRS AND UAS DESCRIPTIONS

The HRS is a multipurpose, longitudinal household survey representing the US

computer-assisted telephone interviewing. In the latter case, the questionnaire

3.2. The Understanding America Study

The UAS is a nationally representative Internet panel of approximately 6,000

completing surveys. This is a very important feature of the recruitment procedure,

3.3. UAS Sampling and Weighting Procedures

An important feature of the UAS sampling procedure is that member recruitment is

base weight = w1b × w2b × a

4. COMPARING SOCIOECONOMIC VARIABLES IN THE

4.1. Representativeness of the HRS and UAS Samples

Table 1. Comparison of Demographics Across Surveys.

Unweighted Weighted Unweighted Weighted

N 37,795 16,751 16,751 1,852 1,852

and spouse/partner, race/ethnicity and geography. When comparing demographic

(bachelor and postgraduate degrees) are overrepresented by a more modest mar-

5. COMPARING SURVEY OUTCOMES IN THE

on their own behalf, although respondents are explicitly instructed to answer

Table 2. Home Ownership.

Table 3. Health Insurance Coverage.

Table 4. Whether Retired.

Table 5. Individual Earnings.

Table 6. Self-Reported Health.

Table 7. How Satisﬁed with Life.

5.1. Mode Effects

Fig. 1. Predicted Mean Health Status by Age and Survey Mode.

Figures 1–4 show the results of (unweighted) regressions of self-reported health

55-59 60-64 65-69 70-74 75-79

Fig. 3. Predicted Mean Life Satisfaction by Age and Survey Mode.

HRS-in-person (p-value: 0.003), although the size of these differences is small –

55-59 60-64 65-69 70-74 75-79

55-59 60-64 65-69 70-74 75-79