ytru

© All Rights Reserved

39 views

sj14-3

ytru

© All Rights Reserved

- The Colonial Origins of Comparative Development_Acemoglu_Johnson_Robinson(2)
- hypothesis testing
- 00089
- 1508302587
- Linear Regression and Correlation_ABH
- Who Runs America - Moskowitz Jacobs Inc. - Yasir Batalvi, Arthur Kover Ph.D.
- Regression Analysis in Adult Age Estimation
- Constraints in the Control of Animal Trypanosomiasis by Cattle Farmers in Coastal Savannah of Ghana Quality Aspects of Drug Use
- Determinants of Trade, inter- and intra-industry trade in South Africa
- Pay Spread and Skewness,Employee Effort and Firm Productivity
- Rural household head employment status and remittance inflows from Italy
- salesforecasting-120421012853-phpapp02
- Statistic-SimpleLinearRegression
- mtcars
- en_tema_919
- Efeito Do Status Da Saude_estresse_alcool e Drogas No Volante
- ols proof
- H2008-1-1531535.H2008.lucasdavis
- the_relationship_between_financial_ratio.pdf
- Output

You are on page 1of 255

StataCorp LP

College Station, Texas

The Stata Journal

Editors

H. Joseph Newton Nicholas J. Cox

Department of Statistics Department of Geography

Texas A&M University Durham University

College Station, Texas Durham, UK

editors@stata-journal.com editors@stata-journal.com

Associate Editors

Christopher F. Baum, Boston College Frauke Kreuter, Univ. of MarylandCollege Park

Nathaniel Beck, New York University Peter A. Lachenbruch, Oregon State University

Rino Bellocco, Karolinska Institutet, Sweden, and Jens Lauritsen, Odense University Hospital

University of Milano-Bicocca, Italy Stanley Lemeshow, Ohio State University

Maarten L. Buis, WZB, Germany J. Scott Long, Indiana University

A. Colin Cameron, University of CaliforniaDavis Roger Newson, Imperial College, London

Mario A. Cleves, University of Arkansas for Austin Nichols, Urban Institute, Washington DC

Medical Sciences Marcello Pagano, Harvard School of Public Health

William D. Dupont, Vanderbilt University Sophia Rabe-Hesketh, Univ. of CaliforniaBerkeley

Philip Ender, University of CaliforniaLos Angeles J. Patrick Royston, MRC Clinical Trials Unit,

David Epstein, Columbia University London

Allan Gregory, Queens University Philip Ryan, University of Adelaide

James Hardin, University of South Carolina Mark E. Schaffer, Heriot-Watt Univ., Edinburgh

Ben Jann, University of Bern, Switzerland Jeroen Weesie, Utrecht University

Stephen Jenkins, London School of Economics and Ian White, MRC Biostatistics Unit, Cambridge

Political Science Nicholas J. G. Winter, University of Virginia

Ulrich Kohler, University of Potsdam, Germany Jeffrey Wooldridge, Michigan State University

Lisa Gilmore David Culwell, Shelbi Seiner, and Deirdre Skaggs

The Stata Journal publishes reviewed papers together with shorter notes or comments, regular columns, book

reviews, and other material of interest to Stata users. Examples of the types of papers include 1) expository

papers that link the use of Stata commands or programs to associated principles, such as those that will serve

as tutorials for users rst encountering a new eld of statistics or a major new technique; 2) papers that go

beyond the Stata manual in explaining key features or uses of Stata that are of interest to intermediate

or advanced users of Stata; 3) papers that discuss new commands or Stata programs of interest either to

a wide spectrum of users (e.g., in data management or graphics) or to some large segment of Stata users

(e.g., in survey statistics, survival analysis, panel analysis, or limited dependent variable modeling); 4) papers

analyzing the statistical properties of new or existing estimators and tests in Stata; 5) papers that could

be of interest or usefulness to researchers, especially in elds that are of practical importance but are not

often included in texts or other journals, such as the use of Stata in managing datasets, especially large

datasets, with advice from hard-won experience; and 6) papers of interest to those who teach, including Stata

with topics such as extended examples of techniques and interpretation of results, simulations of statistical

concepts, and overviews of subject areas.

The Stata Journal is indexed and abstracted by CompuMath Citation Index, Current Contents/Social and Behav-

ioral Sciences, RePEc: Research Papers in Economics, Science Citation Index Expanded (also known as SciSearch),

Scopus, and Social Sciences Citation Index.

For more information on the Stata Journal, including information for authors, see the webpage

http://www.stata-journal.com

Subscriptions are available from StataCorp, 4905 Lakeway Drive, College Station, Texas 77845, telephone

979-696-4600 or 800-STATA-PC, fax 979-696-4601, or online at

http://www.stata.com/bookstore/sj.html

Subscription rates listed below include both a printed and an electronic copy unless otherwise mentioned.

1-year subscription $ 98 1-year subscription $138

2-year subscription $165 2-year subscription $245

3-year subscription $225 3-year subscription $345

2-year institutional subscription $445 2-year institutional subscription $525

3-year institutional subscription $645 3-year institutional subscription $765

1-year subscription $ 75 1-year subscription $ 75

2-year subscription $125 2-year subscription $125

3-year subscription $165 3-year subscription $165

http://www.stata.com/bookstore/sjj.html

Individual articles three or more years old may be accessed online without charge. More recent articles may

be ordered online.

http://www.stata-journal.com/archives.html

The Stata Journal is published quarterly by the Stata Press, College Station, Texas, USA.

Address changes should be sent to the Stata Journal, StataCorp, 4905 Lakeway Drive, College Station, TX

77845, USA, or emailed to sj@stata.com.

Copyright

c 2014 by StataCorp LP

Copyright Statement: The Stata Journal and the contents of the supporting les (programs, datasets, and

help les) are copyright

c by StataCorp LP. The contents of the supporting les (programs, datasets, and

help les) may be copied or reproduced by any means whatsoever, in whole or in part, as long as any copy

or reproduction includes attribution to both (1) the author and (2) the Stata Journal.

The articles appearing in the Stata Journal may be copied or reproduced as printed copies, in whole or in part,

as long as any copy or reproduction includes attribution to both (1) the author and (2) the Stata Journal.

Written permission must be obtained from StataCorp if you wish to make electronic copies of the insertions.

This precludes placing electronic copies of the Stata Journal, in whole or in part, on publicly accessible websites,

leservers, or other locations where the copy may be accessed by anyone other than the subscriber.

Users of any of the software, ideas, data, or other materials published in the Stata Journal or the supporting

les understand that such use is made without warranty of any kind, by either the Stata Journal, the author,

or StataCorp. In particular, there is no warranty of tness of purpose or merchantability, nor for special,

incidental, or consequential damages such as loss of prots. The purpose of the Stata Journal is to promote

free communication among Stata users.

The Stata Journal, electronic version (ISSN 1536-8734) is a publication of Stata Press. Stata, , Stata

Press, Mata, , and NetCourse are registered trademarks of StataCorp LP.

Volume 14 Number 3 2014

Articles and Columns 453

response to treatment and unobservable selection . . . . . . . . . . . . . . . . . . G. Cerulli 453

Obtaining critical values for test of Markov regime switching . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V. K. Bostwick and D. G. Steigerwald 481

A command for signicance and power to test for the existence of a unique most

probable category . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B. M. Fellman and J. Ensor 499

Merger simulation with nested logit demand. . . .J. Bjornerstedt and F. Verboven 511

treatrew: A user-written command for estimating average treatment eects by

reweighting on the propensity score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G. Cerulli 541

Modeling count data with generalized distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . T. Harris, J. W. Hilbe, and J. W. Hardin 562

A Stata package for the application of semiparametric estimators of doseresponse

functions. . . . . . . . . . . . . . . . .M. Bia, C. A. Flores, A. Flores-Lagunes, A. Mattei 580

Space-lling location selection . . . . . . . . . . . . . . . . . . . . . . . . . M. Bia and P. Van Kerm 605

Adaptive Markov chain Monte Carlo sampling and estimation in Mata. . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. J. Baker 623

csvconvert: A simple command to gather comma-separated value les into Stata

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. A. Gaggero 662

The bmte command: Methods for the estimation of treatment eects when exclu-

sion restrictions are unavailable . . . I. McCarthy, D. Millimet, and R. Tchernis 670

Panel cointegration analysis with xtpedroni . . . . . . . . . . . . . . . . . . . . . . . . . . . . . T. Neal 684

Stata and Dropbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R. Hicks 693

Juul and Frydenberg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. Linden 697

The Stata Journal (2014)

14, Number 3, pp. 453480

treatment models with heterogeneous response

to treatment and unobservable selection

Giovanni Cerulli

Ceris-CNR

National Research Council of Italy

Institute for Economic Research on Firms and Growth

Rome, Italy

g.cerulli@ceris.cnr.it

Abstract. In this article, I present ivtreatreg, a command for tting four dier-

ent binary treatment models with and without heterogeneous average treatment

eects under selection-on-unobservables (that is, treatment endogeneity). Depend-

ing on the model specied by the user, ivtreatreg provides consistent estimation

of average treatment eects by using instrumental-variables estimators and a gen-

eralized two-step Heckman selection model. The added value of this new command

is that it allows for generalization of the regression approach typically used in stan-

dard program evaluation by assuming heterogeneous response to treatment. It also

serves as a sort of toolbox for conducting joint comparisons of dierent treatment

methods, thus readily permitting checks on the robustness of results.

Keywords: st0346, ivtreatreg, microeconometrics, treatment models, instrumental

variables, unobservable selection, treatment endogeneity, heterogeneous treatment

response

1 Introduction

It is increasingly recognized as good practice to perform ex-post evaluation of economic

and social programs through counterfactual evidence-based statistical analysis. Such

analysis is particularly important at the policy-making level. The statistical approach

is usually applied to measuring the causal eects of an intervention on part of an external

authority, such as local or national government, on a set of subjects targeted by a given

program, such as individuals and companies. Similar analysis is also becoming popular

in reassessing causal relations among factors identied under modern microeconometric

theory from a counterfactual perspective but not necessarily regarding policy implica-

tions.

Several ocial Stata commands and new user-written commands have been applied

to enlarge the set of available statistical tools for conducting these counterfactual anal-

yses. Table 1 contains a list of commands for estimating binary treatment eects.

However, the most recent release of Stata, version 13, provides a new far-reaching suite

called teffects, which can be used to estimate treatment eects from observational

data.

c 2014 StataCorp LP st0346

454 Fitting binary treatment models

squares (OLS) estimation using a control function; heckit, Heckman-type selection

model; dierence-in-dierences (DID); instrumental variables (IV); regression discon-

tinuity design

regress OLS estimation based on a control StataCorp

function, linear reweighting, DID

(panel data)

ivregress Basic IV, local average treatment StataCorp

eect

etregress Selection model (heckit) StataCorp

psmatch2* Matching (nearest neighbor on Leuven and Sianesi (2003)

covariates and propensity score)

pscore* Matching (propensity score) Becker and Ichino (2002)

nnmatch* Matching (nearest neighbor on Abadie et al. (2004)

covariates)

rd* Regression discontinuity design Austin (2007)

(sharp and fuzzy)

treatrew* Reweighting on propensity score Cerulli (2014)

diff* DID (repeated cross-section) Villa (2009)

* User-written command downloadable from the Statistical Software Components

archive

The teffects command can be used to estimate potential outcome means and av-

erage treatment eects (ATEs). As shown in table 2, the teffects suite covers a large

set of methods, such as regression adjustment; inverse-probability weights; doubly ro-

bust methods, including inverse-probability-weighted regression adjustment; augmented

inverse-probability weights; and matching on the propensity score or covariates (with

nearest neighbors). Other subcommands can be used for postestimation purposes and

for testing reliability of results; for example, overlap allows for plotting the estimated

densities of the probability of getting each treatment level.

G. Cerulli 455

Table 2. Stata 13 teffects subcommands for estimating treatment eects from obser-

vational data

Subcommand Description

aipw Augmented inverse-probability weighting

ipw Inverse-probability weighting

ipwra Inverse-probability-weighted regression adjustment

nnmatch Nearest-neighbor matching

overlap Overlap plots

psmatch Propensity-score matching

ra Regression adjustment

When applying teffects, the outcome models can be continuous, binary, count, or

nonnegative. Binary outcomes can be modeled using logit, probit, or heteroskedastic

probit regression, and count and nonnegative outcomes can be modeled using Poisson

regression. The treatment model can be binary or multinomial. Binary treatments

can be modeled using logit, probit, or heteroskedastic probit regression. For multino-

mial treatments, one can use pairwise comparisons and then exploit binary treatment

approaches.1

While the teffects command deals mainly with estimation methods suitable under

selection-on-observables, Stata 13 presents two further commands to deal with endoge-

nous binary treatment (occurring in the case of selection-on-unobservables): etregress

and etpoisson. etregress estimates the ATE and the other parameters of a linear

regression model augmented with an endogenous binary treatment variable. Basically,

etregress is an improvement on Statas treatreg command, whose estimation is based

on the Heckman (1978) selection model. Because such a model is fully parametric, esti-

mation can be performed either by full maximum likelihood or, less parametrically, by

a two-step consistent estimator. Similarly, etpoisson estimates an endogenous binary

treatment model when the outcome is a count variable by using a Poisson regression.

Both the ATE and the ATE on the treated (ATET) can be estimated by etpoisson.

Although Stata 13 oers the above commands for dealing with endogenous treat-

ment, the commands suer from two important limitations. First, they assume joint

normality of errors, meaning that they are not robust to violation of this hypothesis.

Second, they do not allowat least by defaultfor calculation of causal eects under

observable heterogeneity, meaning that they assume causal eects to be the same in the

subpopulation of treated and untreated units. This second limitation might be partially

1. For multinomial treatment, readers can refer to the user-written command poparms, which estimates

multivalued treatment eects under conditional independence by using the ecient semiparametric

estimation of multivalued treatment eects. See Cattaneo (2010) and Cattaneo, Drukker, and

Holland (2013) for tutorials.

456 Fitting binary treatment models

overcome by introducing interactions between the binary treatment and the covariates

in the outcome equation, but this requires further user programming to recover all the

parameters of interest.

The gsem command, also new in Stata 13, can estimate the causal parameters of

models with selection-on-unobservables, implemented as unobserved components, and

heterogeneous eects, implemented as random coecients. However, gsem uses full-

information maximum likelihood (ML), thus assuming a fully specied parametric model,

which in some contexts could present questionable reliability.

The ivtreatreg command I present in this article implements a series of methods

for treatment-eects estimation under treatment endogeneity that use only conditional-

moment restrictions. These methods are more robust than those implemented by

etregress or gsem. ML estimators would be naturally more ecient under correct

specication, and this means that a trade-o may arise between robustness and e-

ciency. On the one hand, assuming some parametric distributive form for the error

terms allows one to use ML estimation reaching the CramerRao lower variance bound.

On the other hand, when these distributive assumptions are questionable, ML may be

less reliable than less ecient (but consistent) estimation procedures, and the latter

ones become more robust. Thus it seems useful to adopt distribution-free methods for

dealing with treatment endogeneity, which the ivtreatreg command makes possible.

ivtreatreg ts four binary treatment models with and without idiosyncratic or

heterogeneous ATEs.2 Depending on the model specied by the user, ivtreatreg pro-

vides consistent estimation of ATEs under the hypothesis of selection-on-unobservables

by using IV and a generalized Heckman-style selection model.

Conditional on a prespecied subset of exogenous variables, xthought of as driving

the heterogeneous response to treatmentivtreatreg calculates the ATE, the ATET,

and the ATE on the nontreated (ATENT) for each called model, as well as the estimates

of these parameters conditional on the observable factors x.

Specically, the four models t by ivtreatreg are direct-2sls (IV regression t

by direct two-stage least squares), probit-ols (IV two-step regression t by probit and

OLS), probit-2sls (IV regression t by probit and two-stage least squares), and heckit

(Heckman two-step selection model).

Extensive discussion of the conditions under which previous methods provide con-

sistent estimation of ATE, ATET, and ATENT can be found in Wooldridge (2010).

ivtreatreg provides value by allowing for generalization of the regression approach

typically employed in standard program evaluation by assuming heterogeneous response

to treatment and treatment endogeneity. It is also a sort of toolbox for conducting joint

comparisons of dierent treatment methods, thus readily permitting the researcher to

run checks on the robustness of results.

In sections 2 and 3 of this article, I briey present the statistical framework and

estimation methods implemented by ivtreatreg. In section 4, I present the syntax

2. To my knowledge, no previous Stata command has addressed this objective.

G. Cerulli 457

with a description of the help le, and in section 5, I conduct a Monte Carlo experiment

to test the reliability of ivtreatreg. In section 6, I demonstrate the command applied

to real data from a study of the relationship between education and fertility. I conclude

with section 7, where I provide a brief summary and arm the value of ivtreatreg.

In the appendix, I derive the formulas for the selection model.

2 Statistical framework3

Our hypothetical evaluation objective is to estimate the eect of binary treatment w

(taking value 1 for treated and 0 for untreated units) on scalar outcome y.4 We sup-

pose that the assignment to treatment is not random but instead due to some form of

the units self-selection or external selection. For each unit, (y1 , y0 ) denotes the two

potential outcomes,5 where the outcome is y1 when the individual is treated and y0

when the individual is not treated. We then collect an independent and identically dis-

tributed sample of observations (yi , wi , xi ) with i = 1, . . . , N , where x is a row vector of

covariates hypothesized as driving the observable nonrandom assignment to treatment

(confounders).

Here we are interested in estimating the ATE, dened as

ATE = E(y1 y0 )

If we rely on observational data alone, we cannot identify the ATE because, for the same

individual and at the same time, we can observe just one out of the two quantities

needed to calculate the ATE (Holland 1986). By restricting the analysis on the group

of treated units, we can also dene a second causal parameter, the ATET, as

ATET = E(y1 y0 | w = 1)

Similarly, the ATENT, meaning the ATE calculated within the subsample of untreated

units, is

ATENT = E(y1 y0 | w = 0)

3. This section draws on the substantial literature on econometrics of program evaluation, such as

Rubin (1974), Angrist (1991), Angrist, Imbens, and Rubin (1996), Heckman, LaLonde, and Smith

(1999), Wooldridge (2010), and Cattaneo (2010). For a recent survey, see also Imbens and

Wooldridge (2009).

4. Notation follows Wooldridge (2010).

5. For simplicity, I avoid writing the subscript form of the unit i when referring to population param-

eters.

458 Fitting binary treatment models

where p(w = 1) is the probability of being treated and p(w = 0) is the probability

of being untreated. Where x is known, we can also dene the previous parameters

conditional on x as follows:

ATE(x) = E(y1 y0 | x)

ATET(x) = E(y1 y0 | w = 1, x)

ATENT(x) = E(y1 y0 | w = 0, x)

These quantities are functions of x, which means that they can be seen as individual-

specic ATEs because each individual owns a specic value of x. Furthermore, by law

of iterated expectation, we have

ATE = Ex {ATE(x)}

ATET= Ex {ATET(x)}

ATENT = Ex {ATENT(x)}

The analyst needs to recover consistent (and, when possible, ecient) estimators of

the previous parameters from observational data. Before going on, note that through-

out this article we assume that the stable unit treatment value assumption (Rubin

1978) holds. This assumption states that the treatment received by one unit does

not aect other units outcome (Cox 1958). We thus restrict the analysis to a no-

interference setting. Indeed, when the stable unit treatment value assumption does not

hold, treatment externality eects between units may occur and pose severe problems

in identifying eects.6

3 Estimation methods

The new command ivtreatreg implements four models to consistently estimate previ-

ous parameters, and three of these are IV estimators. These methods are direct-2sls

(IV regression estimated by direct two-stage least squares), probit-ols (IV two-step re-

gression estimated by probit and OLS), probit-2sls (IV regression estimated by probit

and two-stage least squares), and heckit (Heckman two-step selection model). Each

of these can be estimated by assuming either homogeneous or heterogeneous response

to treatment (for a total of eight models). Before presenting how ivtreatreg works, I

briey set out the formulas, conditions, and procedures of each model (see Wooldridge

[2010, chap. 21]). We start by assuming that

y1 = 1 + x 1 + e1 , E(e1 ) = 0, E(e1 | x) = 0, 1 = parameter (2)

y = y0 + w(y1 y0 ) (3)

6. Treatment-eects estimation under interference between units is a challenging eld of study. Sobel

(2006), Rosenbaum (2007), and Hudgens and Halloran (2008) oer important contributions on

correct inferences within such a setting.

G. Cerulli 459

Equations (1) and (2) represent the potential outcome equations assumed to be linear

in parameters, while the vector x can also contain nonlinear functions of the various

covariates. Equation (3) is the so-called potential outcome model and expresses the

observational rule of the model, because y is the observed outcome. We do not need to

explicitly specify an equation for w (that is, a selection equation) in this model; however,

we could specify an equation. We could assume, for instance, that a linear probability

model for the propensity to be selected into treatment is

w = 0 + x 1 + a (4)

where a is an error component. As soon as we hold that a is uncorrelated with (e1 ;

e0 ), then (4) is redundant and not needed to identify causal parameters. However, we

must know w to identify the causal parameters, as we will discuss later. By substituting

(1)(2) into (3), we get

y = 0 + (1 0 )w + x 0 + w(x 1 x 0 ) + e0 + w(e1 e0 )

where 0 = 1 implies observable heterogeneity and e1 = e0 implies unobservable

heterogeneity.

Next, we dene = e0 + w(e1 e0 ). We can distinguish two cases: 1) e1 = e0 and

2) e1 = e0 , which can in turn be split into the following subcases:

Case 1.1. e1 = e0 = e, 0 = 1 = , E(e | x, w) = 0: unobservable homogeneity,

homogeneous reaction function of y0 and y1 to x, treatment exogeneity.

In this case, we can show that

E(y | w, x) = 0 + w ATE + x

ATE = ATE(x) = ATET = ATET(x) = ATENT = ATENT(x) = 1 0

Thus no heterogeneous ATE (over x) exists. Furthermore, OLS consistently estimates

ATE.

heterogeneous reaction function of y0 and y1 to x, treatment exogeneity.

In this case, we can show that

E(y | w, x) = 0 + w ATE + x 0 + w(x x ) (5)

ATE = ATET = ATENT

ATE (over x) exist, and the population causal parameters take the forms

ATE = (1 0 ) + x

ATE(x) = ATE + (x x )

ATET = ATE + Ex (x x | w = 1)

ATET(x) = ATE + {(x x ) | w = 1}

ATENT = ATE + Ex (x x | w = 0)

ATENT(x) = ATE + {(x x ) | w = 0}

460 Fitting binary treatment models

ATE OLS

=

ATE(x) =

OLS + (x x) OLS

N

1

ATET OLS +

= wi (xi x)

N OLS

wi i=1

i=1

= {

OLS + (x x)

ATET(x) OLS }(w=1)

N

1

ATENT OLS +

= (1 wi )(xi x)

N OLS

(1 wi ) i=1

i=1

ATENT(x) =

OLS + (x x) OLS

(w=0)

where it is clear that, under treatment exogeneity, these parameters can be consistently

estimated by plugging-in the parameters from an OLS of (5).

But what happens when treatment exogeneity fails and w becomes endogenous? We

then have three subcases.

Case 2.1. e1 = e0 = e, 0 = 1 = , E(e | x, w) = 0: unobservable homogeneity,

homogeneous reaction function of y0 and y1 to x, treatment endogeneity.

In this case, we can show that

E(y | w, x) = 0 + w ATE + x 0

ATE = ATET = ATENT

approach.

Case 2.2. e1 = e0 = e, 0 = 1 , E(e | x, w) = 0: unobservable homogeneity,

heterogeneous reaction function of y0 and y1 to x, treatment endogeneity.

In this case, we can show that

ATE = ATET = ATENT

G. Cerulli 461

an IV approach. Observe, however, that we have two endogenous variables: w and

w(x x ). However, once IV estimations of parameters in (6) are available, we can

consistently recover all the causal parameters of interest as follows:

ATE IV

=

ATE(x) =

IV + (x x) IV

N

1

ATET IV +

= wi (xi x)

N IV

wi i=1

i=1

ATET(x) = IV + (x x)

IV

(w=1)

N

1

ATENT IV +

= (1 wi )(xi x)

N IV

(1 wi ) i=1

i=1

ATENT(x) =

IV + (x x) IV

(w=0)

reaction function of y0 and y1 to x, treatment endogeneity.

In this case, we can show that

ATE = ATET = ATENT

To apply IV and get consistent estimation, this case requires a further orthogonal con-

dition,

E{w(e1 e0 ) | x, z} = E{w(e1 e0 )} (7)

Given this condition, estimation may proceed as in Case 2.2.

Next, I present the methods implemented by ivtreatreg by referring to the case of

heterogeneous reaction.

Control-function regression consistently estimates the previously dened causal eects

under selection-on-observables, that is, when conditional mean independence (CMI)

holds. CMI implies treatment exogeneity by restricting the independence between po-

tential outcomes and treatment to the mean once covariates x are xed at a certain

level. The control-function estimation protocol is as follows:

462 Fitting binary treatment models

consistent estimates of 0 , , 0 , and , with = ATE.

2. Plug these estimated parameters into the sample formulas and recover all the

causal eects.

However, ivtreatreg does not t such a model, because it can be more robustly ob-

tained by using the regression-adjustment estimator implemented in the teffects com-

mand of Stata 13 (with the suboption ra). This command handles many functional

forms other than the linear one, and an estimation of ATENT can also be obtained

using the margins command after running the regression in step 1. For this reason,

ivtreatreg concentrates on the endogenous treatment-eect case, for which it adds

new tools.

When the CMI hypothesis does not hold, control-function regression causes biased esti-

mates of causal eects. This happens when the selection-into-treatment is due to both

observable and unobservable factors. In this case, w becomes endogenous, that is, cor-

related with the regression error term. This is the case when the error term of (4) is

correlated with e0 in (1) or with e1 in (2). IV can also restore consistency under the

selection-on-unobservables. Nevertheless, applying IV requires the availability of at least

one variable zthe instrumental variableassumed to be directly correlated with the

treatment w and directly uncorrelated with the outcome y. This implies an exclusion

restriction under which IV identies causal parameters. ivtreatreg implements the fol-

lowing three consistent but dierently ecient IV methods: direct-2sls, probit-ols,

and probit-2sls.

direct-2sls

By using direct-2sls, the analyst does not consider the binary nature of w. This

method follows the typical IV steps:

indicated by wf v,i .

2. Run a second OLS of y on {x, wf v,i , wf v,i (x x )}. The coecient of wf v,i is a

consistent estimation of ATE.

3. Plug these estimated parameters into the sample formulas, recover all the other

causal eects, and obtain standard errors for ATET and ATENT via bootstrap.

G. Cerulli 463

probit-ols

In this case, the analyst exploits the binary nature of w by tting a probit regression in

the rst step. Operationally, probit-ols follows these three steps:

with direct-2sls) given that the process generating w is correctly specied. It has

higher eciency because the propensity score is the orthogonal projection of w in the

vector space generated by (x, z). However, with this method, standard errors must be

corrected for the presence of a generated regressor and heteroskedasticity.

probit-2sls

2. Run an OLS of w on (1, x, pw ), thus getting the tted values w2f v,i .

The coecient of w2f v,i is a more ecient estimator of ATE compared with direct-2sls.

Furthermore, to achieve consistency, this procedure does not require that the process

generating w is correctly specied; thus, it is more robust than probit-ols.

3.3 heckit

ivtreatreg considers a generalized heckit model to consistently estimate previous pa-

rameters without using an IV. The price is that of relying on a trivariate normality

assumption between the error terms of the potential outcomes and the error term of

the treatment. However, this model has the advantage of tting Case 2.3 without in-

voking (7). The reference model is again the system of (14), where we also assume

that (e0 , e1 , a) are trivariate normal. Such a model, as implemented by ivtreatreg,

generalizes the two-step option of the ocial Stata command treatreg.

By default, the treatreg command assumes neither observable heterogeneity (be-

cause it holds that 0 = 1 ) nor unobservable heterogeneity (because it holds that

e1 = e0 ). When these two assumptions are removed, the model leads to the following

464 Fitting binary treatment models

baseline regression function, which can be consistently estimated by OLS (see Wooldridge

[2010, 949]):

(q) (q)

E(y | x, z, w) = 0 + w + x 0 + w(x x ) + 1 w + 0 (1 w)

(q) 1 (q)

where is the ATE, 1 and 0 are the correlations between the two potential outcomes

errors and the treatments error, and (x) and (x) are the density and cumulative

normal distribution, respectively. To estimate the previous regression, ivtreatreg

performs the following two-step procedure:

i ).

i , (1 wi )i /1

2. Run an OLS of yi on {1, wi , xi , wi (xi x )i , wi i / i }.

testing the null:

H0 : 1 = 0 = 0

More importantly, it is easy to show that

ATE =

ATE(x) = + (x x)

although ATET(x), ATET, ATENT(x), and ATENT assume dierent forms compared with

previous models, specically7

ATET(x) = { + (x x) + (0 + 1 ) 1 (q)}(w=1)

N N

1 1

ATET =+ w (x

i i x) + (1 + 0 ) wi 1 (q)

N

N

wi i=1 wi i=1

i=1 i=1

and

ATENT(x) = { + (x x) + (1 + 0 ) 0 (q)}(w=0)

N

1

ATENT =+ (1 wi )(xi x) + (0 + 1 )

N

(1 wi ) i=1

i=1

N

1

(1 wi ) 0i (q)

N

(1 wi ) i=1

i=1

one can easily calculate all the causal eects. Here bootstrapping can again be used to

obtain standard errors for ATET and ATENT.

7. See the appendix for the derivation of these formulas.

G. Cerulli 465

The ivtreatreg command ts the four binary treatment models presented above, with

and without idiosyncratic or heterogeneous ATEs. The command calculates the ATE,

ATET, and ATENT, as well as the estimates of these parameters conditional on the

observable factors x [that is, ATE(x), ATET(x), and ATENT(x)].

4.1 Syntax

ivtreatreg outcome treatment varlist if in weight , model(modeltype)

hetero(varlist h) iv(varlist iv) conf(#) graphic vce(vcetype) beta

const(noconstant) head(noheader)

where outcome species the target variable that is the object of the evaluation, treat-

ment species the binary treatment variable (that is, 1 = treated or 0 = untreated),

and varlist denes the list of exogenous variables that are considered as observable

confounders.

fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight.

4.2 Options

model(modeltype) species the treatment model to be t, where modeltype must be one

of the following four models (described in sections 3.3 and 3.4 above): direct-2sls,

probit-2sls, probit-ols, or heckit. model() is required.

modeltype Description

direct-2sls IV regression t by direct two-stage least squares

probit-2sls IV regression t by probit and two-stage least squares

probit-ols IV two-step regression t by probit and OLS

heckit Heckman two-step selection model

hetero(varlist h) species the list of variables over which to calculate the idiosyncratic

ATE(x), ATET(x), and ATENT(x), where x = varlist h. When this option is not

specied, the command ts the specied model without heterogeneous ATE. varlist h

should be the same set or a subset of the variables specied in varlist.

iv(varlist iv) species the variables to be used as instruments. This option is required

with model(direct-2sls); it is optional with other modeltypes.

conf(#) sets the condence level to the specied number. The default is conf(95).

466 Fitting binary treatment models

ATET(x), and ATENT(x). graphic gives an outcome only if specied with hetero().

vce(robust) species to report standard errors that are robust to some kinds of

misspecication.

vce(bootstrap | jackknife | conventional) may be specied when hetero() is

not specied with model(heckit) to report standard errors that use the bootstrap

method, the jackknife method, or the conventionally derived variance estimator.

beta reports standardized beta coecients.

const(noconstant) suppresses the regression constant term.

head(noheader) suppresses the display of summary statistics at the top of the output;

only the coecient table is displayed.

4.3 Remarks

The ivtreatreg command also creates several variables that can be used to further

examine the data:

hetero(varlist h) is specied. ws varname h are created for all models.

z varname h are the IVs used in a models regression when hetero(varlist h) and

iv(varlist iv) are specied. z varname h are created only for IV models.

observable confounders used.

The standard errors for ATET and ATENT can be obtained via bootstrapping. Also,

when option hetero() is not specied, then ATE(x), ATET(x), and ATENT(x) are single

numbers equal to ATE = ATET = ATENT.

G. Cerulli 467

ivtreatreg stores the following in e():

Scalars

e(N tot) total number of used observations

e(N treated) number of used treated units

e(N untreated) number of used untreated units

e(ate) value of the ATE

e(atet) value of the ATET

e(atent) value of the ATENT

In this section, I provide a Monte Carlo experiment to check whether ivtreatreg com-

plies with predictions from the theory and to assess its correctness from a computational

point of view. The rst step is to dene a data-generating process (DGP) as follows:

w = 1(0.5 + 0.5x1 + 0.3x2 + 0.6z + a > 0)

y0 = 0.1 + 0.2x1 + 0.2x2 + e0

y1 = 0.3 + 0.3x1 + 0.3x2 + e1

where

x1 : ln(h1 )

x2 : ln(h2 )

z : ln(h3 )

h1 : 2 (1) + c

h2 : 2 (1) + c

h3 : 2 (1) + c

c : 2 (1)

and

(a, e0 , e1 ) : N (0, )

2 2

a a,e0 a,e1 a a,e0 a e0 a,e1 a e1

= e20 a,e1 = e20 e0 ,e1 e0 e1

e21 e21

a2 = 1, e20 = 3, e21 = 6.5

a,e0 = 0.5, a,e1 = 0.3, e0 ,e1 = 0

By assuming that the correlation between a and e0 (a,e0 ) and the correlation between a

and e1 (a,e1 ) are dierent from 0, the wthe selection binary indicatoris endogenous.

We indicate the instrument with z, which is directly correlated with w but directly

uncorrelated with y1 and y0 . Given these assumptions, the DGP is completed by the

potential outcome means, yi = y0i + wi (y1i y0i ), generating the observable outcome y.

468 Fitting binary treatment models

The DGP is simulated 2,000 times using a sample size of 2,000. For each simula-

tion, we get a dierent data matrix (x1 , x2 , y, w, z) on which we apply the four models

implemented by ivtreatreg. Table 3 and gure 1 set out the simulation results.

Estimator Bias% Mean Std. dev. Mean SE Rejection rate

direct-2sls 5.05 0.235 0.316 0.318 0.042

probit-ols 2.92 0.217 0.272 0.268 0.045

probit-2sls 1.16 0.227 0.267 0.267 0.045

heckit 0.87 0.226 0.248 0.240 0.045

True value of ATE 0.224

We see that the true value of ATE is 0.224. As expected, all the IV procedures con-

sistently estimate the true ATE, with a slight bias of around 5% only for direct-2sls.

Figure 1 conrms these ndings by jointly plotting the distributions of ATEs obtained

by each single method over the 2,000 DGP simulations. All methods give similar results,

though direct-2sls has a slightly dierent shape with fatter tails. This suggests that

we should examine the estimation precision. Under our DGP assumptions, we expect

model heckit to be the most ecient method, followed by model probit-ols and

model probit-2sls, with model direct-2sls performing the worst. In fact, our DGP

follows exactly the same assumptions on which the model heckit is based, as well as

the joint trivariate normality of a, e0 , and e1 .

G. Cerulli 469

1.5

Kernel density of ATE

.5 0 1

1 .5 0 .5 1 1.5

ATE

heckit

True ATE = .224

Sample size = 2000

Number of simulations = 2000

Table 3 conrms the following theoretical predictions: the lowest standard deviation

is achieved by model heckit (0.248) and the highest by model direct-2sls (0.316),

with the other methods lying in the middle with no appreciable dierences. Observe

that the standard error means (mean SE in column 4) show that the values of the

standard deviations of the estimators in column 3 are estimated precisely (values are

much the same). This means that the asymptotic distribution of the ATE estimators

approximates nite-sample distribution well.

Table 3 also shows simulation results for test size. The size of a test is the probability

of rejecting a hypothesis H0 when H0 is true. In our DGP, we set the size level at 0.05

for a two-sided test where H0 : ATE = 0.224 against the alternative H1 : ATE = 0.224.

The results, under the heading Rejection rate (column 5), represent the proportion

of simulations that lead to rejection of H0 . These values should be interpreted as the

simulation estimate of the true test size (which we assumed to be 0.05). As expected,

the rejection rates are all lower than the usual 5% signicance.

As a conclusion, these results seem to conrm both our expected theoretical results

and the computational reliability of the ivtreatreg command.

ship between education and fertility

To see how ivtreatreg works in practice, we consider an instructional dataset called

fertil2.dta, which accompanies the book Introductory Econometrics: A Modern Ap-

proach by Wooldridge (2013) and is a collection of cross-sectional data on 4,361 women

470 Fitting binary treatment models

and family characteristics. In this exercise, we are particularly interested in evaluating

the impact of the variable educ7 (taking value 1 if a woman has seven years of education

or more and 0 otherwise) on the number of family children (children). Several condi-

tioning (or confounding) observable factors are included in the dataset, such as the age

of the woman (age), whether or not the family owns a TV (tv), and whether or not the

woman lives in a city (urban). To inquire about the relationship between education and

fertility, following Wooldridge (2010), we estimate the following specication for each of

the four models implemented by ivtreatreg:

> hetero(age agesq evermarr urban) iv(frsthalf) model(modeltype) graphic

This specication adopts the covariate frsthalf as the IV and takes value 1 if the

woman was born in the rst six months of the year and 0 otherwise. This variable is

partially correlated with educ7, but it should not have any direct relationship with the

number of family children.

The simple dierence-in-mean estimator (the mean of the treated ones, which are

the children in the group of more educated women, minus the mean of the untreated

ones, which are the children in the group of less educated women) is 1.77 with a t-

value of 28.46. This means that women with more education show about two children

fewer than women with less education, without ceteris paribus conditions. By adding

confounding factors in the regression specication, we get the OLS estimate of ATE as

0.394 with a t-value of 7.94, still in absence of heterogeneous treatment. This is still

signicant, but the magnitude, as expected, dropped considerably compared with the

dierence-in-mean estimation, thus showing that confounders are relevant. When we

consider OLS estimation with heterogeneity, we get an ATE equal to 0.37, which is still

signicant at 1%.9

When we consider IV estimation, results change dramatically. As we did in our

working example of how to use ivtreatreg, we estimate the previous specication for

probit-2sls with heterogeneous treatment response. The main outcome is reported

below, where results from both the probit rst-step and the IV regression of the second

step are set out. Results on the probit show that frsthalf is partially correlated with

educ7, thus it can be reliably used as an instrument for this variable. Step 2 shows that

the ATE (again, the coecient of educ7) is no more signicant and that it changes sign,

becoming positive and equal to 0.30.

9. OLS results on ATE are obtained by estimating the baseline regression set out in section 3.1 with

OLS.

G. Cerulli 471

. use fertil2.dta

. ivtreatreg children educ7 age agesq evermarr urban electric tv,

> hetero(age agesq evermarr urban) iv(frsthalf) model(probit-2sls) graphic

(output omitted )

Probit regression Number of obs = 4358

LR chi2(7) = 1130.84

Prob > chi2 = 0.0000

Log likelihood = -2428.384 Pseudo R2 = 0.1889

age -.0150337 .0174845 -0.86 0.390 -.0493027 .0192354

agesq -.0007325 .0002897 -2.53 0.011 -.0013003 -.0001647

evermarr -.2972879 .0486734 -6.11 0.000 -.392686 -.2018898

urban .2998122 .0432321 6.93 0.000 .2150789 .3845456

electric .4246668 .0751255 5.65 0.000 .2774235 .57191

tv .9281707 .0977462 9.50 0.000 .7365915 1.11975

_cons 1.13537 .2440057 4.65 0.000 .6571273 1.613612

(output omitted )

Source SS df MS Number of obs = 4358

F( 11, 4346) = 448.51

Model 10198.4139 11 927.128534 Prob > F = 0.0000

Residual 11311.6182 4346 2.60276536 R-squared = 0.4741

Adj R-squared = 0.4728

Total 21510.0321 4357 4.93689055 Root MSE = 1.6133

_ws_age -.8428913 .1368854 -6.16 0.000 -1.111256 -.5745262

_ws_agesq .011469 .0019061 6.02 0.000 .007732 .0152059

_ws_evermarr -.8979833 .2856655 -3.14 0.002 -1.458033 -.3379333

_ws_urban .4167504 .2316103 1.80 0.072 -.037324 .8708247

age .859302 .0966912 8.89 0.000 .669738 1.048866

agesq -.01003 .0012496 -8.03 0.000 -.0124799 -.0075801

evermarr 1.253709 .1586299 7.90 0.000 .9427132 1.564704

urban -.5313325 .1379893 -3.85 0.000 -.801862 -.260803

electric -.2392104 .1010705 -2.37 0.018 -.43736 -.0410608

tv -.2348937 .1478488 -1.59 0.112 -.5247528 .0549653

_cons -13.7584 1.876365 -7.33 0.000 -17.43704 -10.07977

Instruments: age agesq evermarr urban electric tv G_fv _z_age _z_agesq

_z_evermarr _z_urban

(output omitted )

472 Fitting binary treatment models

This result is in line with the IV estimation obtained by Wooldridge (2010). Never-

theless, having assumed heterogeneous response to treatment, we can now also calcu-

late the ATET and ATENT, and inspect the cross-unit distribution of these eects. First,

ivtreatreg returns these parameters as scalars (along with treated and untreated sam-

ple size).

. ereturn list

scalars:

(output omitted )

e(ate) = .3004007409051661

e(atet) = .898290019586237

e(atent) = -.4468834318294228

e(N_tot) = 4358

e(N_treat) = 2421

e(N_untreat) = 1937

(output omitted )

To get the standard errors for testing ATET and ATENT signicance, we can easily

implement a bootstrap procedure as follows:

> ivtreatreg children educ7 age agesq evermarr urban electric tv,

> hetero(age agesq evermarr urban) iv(frsthalf) model(probit-2sls)

Replications = 100

command: ivtreatreg children educ7 age agesq evermarr urban electric

tv, >hetero(age agesq evermarr urban) iv(frsthalf) model(probit-2sls)

atet: e(atet)

atent: e(atent)

Coef. Std. Err. z P>|z| [95% Conf. Interval]

atent -.4468834 .4124428 -1.08 0.279 -1.255257 .3614897

The results show that both ATET and ATENT are not signicant and show quite

dierent values, but the values are not far from that of ATE. Furthermore, a simple

check shows that ATE = ATETp(w = 1) + ATENT p(w = 0), for example,

ATE= .30040086

which conrms the expected result. Finally, we analyze the distribution of ATE(x),

ATET(x), and ATENT(x). Figure 2 shows the result.

G. Cerulli 473

.4 .3

Kernel density

.2 .1

0

2 0 2 4

ATE(x) ATET(x)

ATENT(x)

ATE(x) and ATENT(x) show a distribution more concentrated on negative values. In

particular, ATENT(x) shows the highest modal value around 2.2 children, thus pre-

dicting that less-educated women would have been less fertile if they had been more

educated.

ATE results for all four models and for the simple dierence-in-mean test (t test) are

shown below. The ATE obtained by IV methods is consistently not signicant, but it has

a positive value for only probit-2sls. The rest of the ATEs consistently show negative

valuesmeaning that more-educated women would have been more fertile if they had

been less educated. heckit is a little more puzzling because the result is signicant and

very close to the dierence-in-mean estimation that is highly suspected as biased. This

could be because the identication conditions of heckit are not met in this dataset.

474 Fitting binary treatment models

(output omitted )

. estimates store ttest

. ivtreatreg children educ7 age agesq evermarr urban electric tv,

> hetero(age agesq evermarr urban) iv(frsthalf) model(heckit) graphic

(output omitted )

. estimates store heckit

. ivtreatreg children educ7 age agesq evermarr urban electric tv,

> hetero(age agesq evermarr urban) iv(frsthalf) model(probit-ols) graphic

(output omitted )

. estimates store probit_ols

. ivtreatreg children educ7 age agesq evermarr urban electric tv,

> hetero(age agesq evermarr urban) iv(frsthalf) model(direct-2sls) graphic

(output omitted )

. estimates store direct_2sls

. ivtreatreg children educ7 age agesq evermarr urban electric tv,

> hetero(age agesq evermarr urban) iv(frsthalf) model(probit-2sls) graphic

(output omitted )

. estimates store probit_2sls

. estimates table ttest probit_ols direct_2sls probit_2sls heckit,

> b(%9.2f) keep(educ7 G_fv) star

G_fv -0.11

Variable heckit

educ7 -1.92***

G_fv

Finally, gure 3 shows the plot of the ATE distribution for each method. These

distributions largely follow a similar pattern, although direct-2sls and heckit show

some appreciable dierences. heckit, in particular, shows a very dierent pattern with

a strong demarcation between the plot of treated and untreated units. Consequently, it

appears to not be a reliable estimation procedure, an observation that deserves further

inspection.

G. Cerulli 475

Model probitols: Comparison of ATE(x) ATET(x) ATENT(x) Model direct2sls: Comparison of ATE(x) ATET(x) ATENT(x)

1.5

.4 .3

1

Kernel density

Kernel density

.2

.5

.1

0

0

2 0 2 4 2 1.5 1 .5

ATENT(x) ATENT(x)

Model probit2sls: Comparison of ATE(x) ATET(x) ATENT(x) Model heckit: Comparison of ATE(x) ATET(x) ATENT(x)

1.5

.4 .3

1

Kernel density

Kernel density

.2

.5

.1

0

2 0 2 4 4 3 2 1 0

ATENT(x) ATENT(x)

Figure 3. Distribution of ATE(x), ATET(x), and ATENT(x) for the four models t by

ivtreatreg

7 Conclusion

In this article, I presented a new user-written Stata command, ivtreatreg, for tting

four dierent binary treatment models with and without idiosyncratic or heterogeneous

ATEs. Depending on the model specied, ivtreatreg consistently estimates ATEs under

the hypothesis of selection-on-unobservables exploiting IV estimators and a generalized

two-step Heckman selection model.

After presenting the statistical framework, I provided evidence on the reliability

of ivtreatreg by using a Monte Carlo experiment. To familiarize the reader with

the command, I also applied it to a real dataset. Results from both the Monte Carlo

experiment and the real dataset encourage one to use the command when the empirical

and theoretical setting suggests that treatment endogeneity and heterogeneous response

to treatment are present. In such cases, performing more than one method may be a

useful robustness check. The ivtreatreg command makes such checks possible and

easy to perform.

476 Fitting binary treatment models

8 References

Abadie, A., D. Drukker, J. L. Herr, and G. W. Imbens. 2004. Implementing matching

estimators for average treatment eects in Stata. Stata Journal 4: 290311.

Angrist, J. D. 1991. Instrumental variables estimation of average treatment eects in

econometrics and epidemiology. NBER Technical Working Paper No. 0115.

http://www.nber.org/papers/t0115.

Angrist, J. D., G. W. Imbens, and D. B. Rubin. 1996. Identication of causal eects

using instrumental variables. Journal of the American Statistical Association 91:

444455.

Austin, N. A. 2007. rd: Stata module for regression discontinuity estimation. Statistical

Software Components S456888, Department of Economics, Boston College.

http://ideas.repec.org/c/boc/bocode/s456888.html.

Becker, S. O., and A. Ichino. 2002. Estimation of average treatment eects based on

propensity scores. Stata Journal 2: 358377.

Cattaneo, M. D. 2010. Ecient semiparametric estimation of multi-valued treatment

eects under ignorability. Journal of Econometrics 155: 138154.

Cattaneo, M. D., D. M. Drukker, and A. D. Holland. 2013. Estimation of multivalued

treatment eects under conditional independence. Stata Journal 13: 407450.

Cerulli, G. 2014. treatrew: A user-written command for estimating average treatment

eects by reweighting on the propensity score. Stata Journal 14: 541561.

Cox, D. R. 1958. Planning of Experiments. New York: Wiley.

Heckman, J. J. 1978. Dummy endogenous variables in a simultaneous equation system.

Econometrica 46: 931959.

Heckman, J. J., R. J. LaLonde, and J. A. Smith. 1999. The economics and econometrics

of active labor market programs. In Handbook of Labor Economics, ed. O. Ashenfelter

and D. Card, vol. 3A, 18652097. Amsterdam: Elsevier.

Holland, P. W. 1986. Statistics and causal inference. Journal of the American Statistical

Association 81: 945960.

Hudgens, M. G., and M. E. Halloran. 2008. Toward causal inference with interference.

Journal of the American Statistical Association 103: 832842.

Imbens, G. W., and J. M. Wooldridge. 2009. Recent developments in the econometrics

of program evaluation. Journal of Economic Literature 47: 586.

Leuven, E., and B. Sianesi. 2003. psmatch2: Stata module to perform full Mahalanobis

and propensity score matching, common support graphing, and covariate imbalance

testing. Statistical Software Components S432001, Department of Economics, Boston

College. http://ideas.repec.org/c/boc/bocode/s432001.html.

G. Cerulli 477

of the American Statistical Association 102: 191200.

domized studies. Journal of Educational Psychology 66: 688701.

. 1978. Bayesian inference for causal eects: The role of randomization. Annals

of Statistics 6: 3458.

Causal inference in the face of interference. Journal of the American Statistical As-

sociation 101: 13981407.

Statistical Software Components S457083, Department of Economics, Boston College.

http://ideas.repec.org/c/boc/bocode/s457083.html.

Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and Panel Data. 2nd

ed. Cambridge, MA: MIT Press.

. 2013. Introductory Econometrics: A Modern Approach. 5th ed. Mason, OH:

South-Western.

Giovanni Cerulli is a researcher at Ceris-CNR, National Research Council of Italy, Institute for

Economic Research on Firms and Growth. He received a degree in statistics and a PhD in

economic sciences from Sapienza University of Rome and is editor-in-chief of the International

Journal of Computational Economics and Econometrics. His research interests are mainly on

applied microeconometrics, with a focus on counterfactual treatment-eects models for program

evaluation. Stata programming and simulation- and agent-based methods are also among his

related elds of study. He has published articles in high-quality, refereed economics journals.

478 Fitting binary treatment models

Appendix

Derivation of ATET(x), ATET, ATENT(x), and ATENT in the heckit

model

Proof.

The heckit model with observable and unobservable heterogeneity relies on these as-

sumptions:

1. y = 0 + w + x 0 + w(x x ) + u

2. E(e1 | x, z) = E(e0 | x, z) = 0

3. w = 1(0 + 1 x + 2 z + a 0) = 1(q 0)

4. E(a | x, z) = 0

5. (a, e0 , e1 ) 3 N

6. a N (0, 1) a = 1

7. u = e0 + w(e1 e0 )

ATET(x) = E(y1 y0 | x, w = 1) = (1 0 ) + {g1 (x) g0 (x)} + E(e1 e0 | x, w = 1)

At the same time, because e1 and e0 are independent of x, we also have

E(e1 e0 | x, w = 1) = E(e1 e0 | w = 1)

The value of the last expectation is easy to compute; indeed, by putting

e1 e 0 =

it follows that still has a normal distribution. This means that, from the property of

truncated normal distributions,

(q)

E( | w = 1) = E( | 0 + 1 x + 2 z + a 0) = E( | q 0) = a

(q)

From the linearity property of the covariance, we get

a = Cov(; a) = Cov(e1 e0 ; a) = Cov(e1 ; a) Cov(e0 ; a) = e1 a e0 a = 1 + 0

because 0 = e0 a and 1 = e1 a . This implies that

ATET(x) = { + (x x) + (1 + 0 ) 1 (q)}(w=1)

N N

1 1

ATET =+ w (x

i i x) + (1 + 0 ) wi 1 (q)

N

N

wi i=1 wi i=1

i=1 i=1

G. Cerulli 479

where

(q)

1 (q) =

(q)

As for ATET, applying a similar procedure, it is immediate to get

ATENT(x) = { + (x x) + (1 + 0 ) 0 (q)}(w=1)

N

1

ATENT =+ (1 wi )(xi x) + (1 + 0 )

N

(1 wi ) i=1

i=1

N

1

(1 wi ) 0i (q)

N

(1 wi ) (i=1)

i=1

where

(q)

0 (q) =

1 (q)

model

Proof.

Consider formulas for ATE(x) and ATENT(x) in the heckit model:

ATET(x) = { + (x x) + (1 + 0 ) 1 (q)}(w=1)

ATENT(x) = { + (x x) + (1 + 0 ) 0 (q)}(w=0)

It follows that

+ p(w = 0) { + (x x) + (1 + 0 ) 0 (q)}

+ p(w = 0) {(1 + 0 ) 0 (q)}

(q)

= { + (x x)} + p(w = 1) (1 + 0 )

(q)

(q)

+ p(w = 0) (1 + 0 )

1 (q)

(1 + 0 ) (q)/(q) and E( | q < 0) = (1 + 0 ) (q)/{1 (q)}.

480 Fitting binary treatment models

For the law of iterated expectations, we get E() = p(w = 1)E( | q 0) + p(w =

0)E( | q 0) = 0, because E() = E(e1 e0 ) = 0, proving that

ATE(x) = + (x x)

and nally

ATE = Ex {ATE(x)} =

The Stata Journal (2014)

14, Number 3, pp. 481498

regime switching

Valerie K. Bostwick Douglas G. Steigerwald

Department of Economics Department of Economics

University of California, Santa Barbara University of California, Santa Barbara

Santa Barbara, CA Santa Barbara, CA

vkbostwick@gmail.com doug@econ.ucsb.edu

must be used to test for the possible presence of multiple regimes. Carter and

Steigerwald (2013, Journal of Econometric Methods 2: 2534) derive the ana-

lytic steps needed to implement the Markov regime-switching test proposed by

Cho and White (2007, Econometrica 75: 16711720). We summarize the imple-

mentation steps and address the computational issues that arise. We then in-

troduce a new command to compute regime-switching critical values, rscv, and

present it in the context of empirical research.

Keywords: st0347, rscv, Markov regime switching

1 Introduction

Markov regime-switching models are frequently used in economic analysis and are preva-

lent in elds such as nance, industrial organization, and business cycle theory. Unfortu-

nately, conducting proper inference with these models can be exceptionally challenging.

In particular, testing for the possible presence of multiple regimes requires the use of a

nonstandard test statistic and critical values that may dier across model specications.

Cho and White (2007) demonstrate that because of the unusually complicated na-

ture of the null space, the appropriate measure for a test of multiple regimes in the

Markov regime-switching framework is a quasi-likelihood-ratio (QLR) statistic. They

provide an asymptotic null distribution for this test statistic from which critical values

should be drawn. Because this distribution is a function of a Gaussian process, the

critical values are dicult to obtain from a simple closed-form distribution. Moreover,

the elements of the Gaussian process underlying the asymptotic null distribution are

dependent upon one another. Thus the critical values depend on the covariance of the

Gaussian process and, because of the complex nature of this covariance structure, are

best calculated using numerical approximation. In this article, we summarize the steps

necessary for such an approximation and introduce the new command rscv, which can

be used to produce the desired regime-switching critical values for a QLR test of only

one regime.

We focus on a simple linear model with Gaussian errors, but the QLR test and the

rscv command are generalizable to a much broader class of models. This methodology

can be applied to models with multiple covariates and non-Gaussian errors. It is also

c 2014 StataCorp LP st0347

482 Obtaining critical values for test of Markov regime switching

although the dierence between distributions must be in only one mean parameter.

Although most regime-switching models are thought of in the context of time-series data,

we provide an example in section 5 of how to use the QLR test in cross-section models.

However, there is one notable restriction on the allowable class of regime-switching

models. Carter and Steigerwald (2012) establish that the quasi-maximum likelihood

estimator created using the quasi-log-likelihood is inconsistent if the covariates include

lagged values of the dependent variable. Thus the QLR test should be used with extreme

caution on autoregressive models.

The article is organized as follows. In section 2, we describe the unusual null space

that corresponds to a test of only one regime versus the alternative of regime switching.

In section 3, we present the QLR test statistic, as derived by Cho and White (2007),

and the corresponding asymptotic null distribution. We also summarize the analysis

in Carter and Steigerwald (2013) describing the covariance structure of the relevant

Gaussian process. In section 4, we describe the methodology used by the rscv command

to numerically approximate the relevant critical values. We also present the syntax and

options of the rscv command and provide sample output. We illustrate the use of the

rscv command with an application from the economics literature in section 5. Finally,

we conclude in section 6 with some remarks on the general applicability of this command

and the underlying methods.

2 Null hypothesis

Specifying a Markov regime-switching model requires a test to conrm the presence of

multiple regimes. The rst step is to test the null hypothesis of one regime against the

alternative hypothesis of Markov switching between two regimes. If this null hypothesis

can be rejected, then one can proceed to estimate the Markov regime-switching models

with two or more regimes. The key to conducting valid inference is then a test of the

null hypothesis of one regime, which yields an asymptotic size equal to or less than the

nominal test size.

To understand how to conduct valid inference for the null hypothesis of only one

regime, consider a basic regime-switching model,

yt = 0 + st + ut (1)

where ut i.i.d. N 0, 2 . The unobserved state variable st (0, 1) indicates that

regime in state 0, yt has mean 0 , while regime in state 1, yt has mean 1 = 0 + . The

sequence (st )nt=1 is generated by a rst-order Markov process with P (st = 1|st1 = 0) =

p0 and P (st = 0|st1 = 1) = p1 .

The key is to understand the parameter space that corresponds to the null hypoth-

esis. Under the null hypothesis, there is one regime with mean . Hence, the null

parameter space must capture all the possible regions that correspond to one regime.

The rst region corresponds to the assumption that 0 = 1 = , which is the as-

sumption that each of the two regimes is observed with positive probability: p0 > 0

V. K. Bostwick and D. G. Steigerwald 483

and p1 > 0. The nonstandard feature of the null space is that it includes two addi-

tional regions, each of which also corresponds to one regime with mean . The second

region corresponds to the assumption that only regime 0 occurs with positive probabil-

ity, p0 = 0, and that 0 = . In this second region, the mean of regime 1, 1 is not

identied, so this region in the null hypothesis does not impose any value on 1 0 .

The third region is a mirror image of the second region, where now the assumption is

that regime 1 occurs with probability 1: p1 = 0 and 1 = . The three regions are

depicted in gure 1. The vertical distance measures the value of p0 and of p1 , and the

horizontal distance measures the value of 1 0 . Thus the vertical line at 1 = 0

captures the region of the null parameter space that corresponds to the assumption

that 0 = 1 = together with p0 , p1 (0, 1). The lower horizontal line captures the

region of the null parameter space where p0 = 0 and 1 0 is unrestricted. Similarly,

the upper horizontal line captures the region of the null parameter space where p1 = 0

and 1 0 is unrestricted.

1 0 = 0

p1 = 0

p0 = 0

1 = ; or 0 = 1 = together with local neighborhoods of p1 = 0 and 0 = 1 =

The additional curves that correspond to the values p0 = 0 and p1 = 0 help prevent

one from misclassifying a small group of extremal values as a second regime. In gure 1,

we depict the null space together with local neighborhoods for two points in this space.

These two neighborhoods illustrate the dierent roles of the three curves in the null

space. Points in the circular neighborhood of the point on 1 0 = 0 correspond

to processes with two regimes that have only slightly separated means. Points in the

semicircular neighborhood around the point on p1 = 0 correspond to processes in which

there are two regimes with widely separated means, one of which occurs infrequently.

Because a researcher is often concerned that rejection of the null hypothesis of one

regime is due to a small group of outliers rather than multiple regimes, including these

boundary values reduces this type of false rejection. Consequently, a valid test of the

null hypothesis of one regime must account for the entire null region and include all

three curves.

484 Obtaining critical values for test of Markov regime switching

To implement a valid test of the null hypothesis of one regime, a likelihood-ratio statistic

is needed. When considering the likelihood-ratio statistic for a Markov regime-switching

process, Cho and White (2007) nd that including p0 = 0 and p1 = 0 in the parameter

space creates signicant diculties in the asymptotic analysis. These diculties lead

them to consider a QLR statistic for which the Markov structure of the state variable is

ignored and (st ) is instead a sequence of independent and identically distributed (i.i.d.)

random variables.

This i.i.d. restriction allows Cho and White (2007) to consider only the stationary

probability, P (st = 1) = , where = p0 /(p0 + p1 ). Because = 1 if and only if p1 = 0

(and = 0 if and only if p0 = 0), the null hypothesis for a test of one regime based on

the QLR statistic is expressed with three curves. The null hypothesis is H0 : 0 = 1 =

(curve 1), = 0 and 0 = (curve 2), and = 1 and 1 = (curve 3). The alternative

hypothesis is H1 : (0, 1) and 0 = 1 .

For our basic model in (1), the quasi-log-likelihood analyzed by Cho and White

(2007) is

n

1

Ln , 2 , 0 , 1 = lt , 2 , 0 , 1

n t=1

conditional density with j = 0, 1. ( 2 , 0 , 1 ) are the parameter values that maximize

,

the quasi-log-likelihood function. (1, 2 , , 1 ) are the parameter values that maximize

Ln under the null hypothesis that = 1. The QLR statistic is then

QLRn = 2n Ln , 2 , 0 , 1 Ln 1, 2 , , 1

The asymptotic null distribution of QLRn is (Cho and White 2007, theorem 6(b),

1692),

2 2

QLRn max {max (0, G)} , sup G (0 ) (2)

where G(0 ) is a Gaussian process, G(0 ) := min{0, G(0 )}, and G is a standard Gaus-

sian random variable correlated with G(0 ). (For a more complete description of (2),

see Bostwick and Steigerwald [2012]).

The critical value for a test based on the statistic QLRn thus corresponds to a quan-

tile for the largest value over max(0, G)2 and sup {G(0 ) }2 . To determine this quan-

tity, one must account for the covariance among the elements of G(0 ) as well as their

covariance with G. The structure of this covariance, which is described in detail in

Bostwick and Steigerwald (2012), is

2

( )

e 1 2

E {G (0 ) G (0 )} = 12 12 (3)

4 ( )4

1 2 2

2 2

e 2 e( ) 1 ( ) 2

V. K. Bostwick and D. G. Steigerwald 485

sup {G(0 ) }2 that appears in the asymptotic null distribution. Because the regime-

specic parameters enter (3) only through , a researcher does not need to specify the

parameter space to calculate sup {G(0 ) }2 . The only requirement is to specify the

set H that contains the number of standard deviations that separate the regime means.

Finally, to fully capture the behavior of the asymptotic null distribution of QLRn , we

must also account for the covariance between G and G(0 ). Cho and White (2007) show

2

that Cov{G, G(0 )} = (e 1 2 4 /2)1/2 4 .

4.1 Syntax

rscv , ll(#) ul(#) r(#) q(#)

4.2 Description

rscv simulates the asymptotic null distribution of QLRn and returns the corresponding

critical value. If no options are specied, rscv returns the critical value for a size 5%

QLR test with a regime separation of 1 standard deviation calculated over 100,000

replications.

4.3 Options

ll(#) species a lower bound on the interval H containing the number of standard

deviations separating regime means, where H. The default is ll(-1), meaning

that the mean of regime 1 is no more than 1 standard deviation below the mean of

regime 2.

ul(#) species an upper bound on the interval H containing the number of standard

deviations separating regime means. The default is ul(1), meaning that the mean

of regime 1 is no more than 1 standard deviation above the mean of regime 2.

r(#) species the number of simulation replications to be used in calculating the critical

values. The default is r(100000), meaning that the simulation will be run 100,000

times.

q(#) species the quantile for which a critical value should be calculated. The default

is q(0.95), which corresponds to a nominal test size of 5%.

For a QLR test with size 5%, the critical value corresponds to the 0.95 quantile of the

limit distribution given on the right side of (2). Because the dependence in the process

486 Obtaining critical values for test of Markov regime switching

independent replications of the process. In this section, we describe the simulation

process used to obtain these critical values and how each of the rscv command options

aects those simulations.

Because the covariance of G (0 ) depends on only an index , we do not need to

simulate G (0 ) directly. Instead, we simulate G A (), which we will construct to have

the same covariance structure as G (0 ). The process G A () will therefore provide us

with the correct quantile while relying solely on the index, .

To construct G A () for the covariance structure in (3), recall that by a Taylor-series

expansion, e = 1 + + 2 /2! + . Hence, for ( k )

k=0 i.i.d. N (0, 1),

k 2 4

k N 0, e 1

2

k! 2

k=3

12 K1

k

A 2 4

G () = e 1

2

k

2 k!

k=3

where K determines the accuracy of the Taylor-series approximation. Note that the

covariance of this simulated process, E{G A ()G A ( )}, is identical to the covariance

structure of G(0 ) in (3).

We must also account for the covariance between G and G(0 ). Cho and White

(2007) establish that this covariance corresponds to the term in the Taylor-series ex-

pansion for k = 4. Thus we set G =
4 so that Cov{G, G(0 )} = Cov{G, G A ()}.

Therefore, the critical value that corresponds to (2) for a test size of 5% is the 0.95

quantile of the simulated value

2

max {max (0,
4 )} , max min 0, G A ()

2

(4)

H

The rscv command executes the numerical simulation of (4) by rst generating the

series (
k )K

k=0 i.i.d. N (0, 1). For each value in a discrete set of H, it then constructs

A 2 1/2

K1 k

G () = (e 1 /2)2 4

k=3 / k!
k . The command then obtains the

value mi = max({max(0,
4 )}2 , max [min{0, G A ()}]2 ), corresponding to (2) for each

replication (indexed by i). Let (m[i] )ri=1 be the vector of ordered values of mi calculated

in each replication. The command rscv returns the critical value for a test with size q

from m[(1q)r] .

For each replication, rscv calculates G A () at a ne grid of values over the interval

H. To do so requires three quantities: the interval H (which must encompass the true

value of ), the grid of values over H (given by the grid mesh), and the number of

desired terms in the Taylor-series approximation, K. The user species the interval H

using the ll() and ul() options. If 0 is thought to lie within 3 standard deviations

V. K. Bostwick and D. G. Steigerwald 487

of 1 , the interval is H = [3.0, 3.0]. Because the process is calculated at only a nite

number of values, the accuracy of the calculated maximum increases as the grid mesh

shrinks. Thus the command rscv implements a grid mesh of 0.01, as recommended in

Cho and White (2007, 1693). For the interval H = [3.0, 3.0], and with a grid mesh of

0.01, the process is calculated at the points (3.00, 2.99, . . . , 3.00).

Given the grid mesh of 0.01 and the user-specied interval H, we must determine

the appropriate value of K. To do so, we consider the approximation error, K, =

2

(e 1 2 4 /2)1/2 k=K k / k!
k . We want to ensure that as K increases,

the variance of K, decreases toward zero. Carter and Steigerwald (2013) show that

for large K, var(K, ) e2J log K log K . Therefore, the command rscv implements a

value of K such that for the user-specied interval H, (maxH ||)2 /K 1/2.

The rscv command also allows the user to specify the number of simulation repli-

cations and the desired quantile. For large values of H and the default number of

replications (r = 100000), the rscv command could require more memory than a 32-bit

operating system can provide. In this case, the user may need to specify a smaller num-

ber of replications to calculate the critical values for the desired interval, H. Critical

values derived using fewer simulation replications may be stable to only one signicant

digit. Table 1 depicts the results of rscv for a size 5% test over varying values of ll(),

ul(), and r().

100,000 4.9 5.6 6.2 6.7 7.0

Replications

10,000 4.9 5.6 6.2 6.6 7.1

Nominal level 5%; grid mesh of 0.01.

5 Example

We demonstrate how to test for the presence of multiple regimes through an example

from the economics literature. Unlike the simple model that we have considered until

now, (1), the model in this example includes several added complexities that are com-

monly used in regime-switching applications. We describe how to construct the QLR

test statistic for this more general model, how to use existing Stata commands to obtain

the value of the test statistic, and, nally, how to use the new command, rscv, to obtain

an appropriate critical value.

Our example is derived from Bloom, Canning, and Sevilla (2003), who test whether

the large dierences in income levels across countries are better explained by dierences

in intrinsic geography or by a regime-switching model where the regimes correspond to

488 Obtaining critical values for test of Markov regime switching

distinct equilibria. To this end, the authors use cross-sectional data to analyze the dis-

tribution of per capita income levels for countries with similar exogenous characteristics

and test for the presence of multiple regimes.

Bloom, Canning, and Sevilla (2003) propose a model of switching between two pos-

sible equilibria. Regime 1 occurs with probability p(x) and corresponds to countries

that are in a poverty trap equilibrium.

y = 1 + 1 x + 1 , Var( 1 ) = 12 (5)

equilibrium.

y = 2 + 2 x + 2 , Var( 2 ) = 22 (6)

In both regimes, y is the log gross domestic product per capita, and x is the absolute lat-

itude, which functions as a catchall for a variety of exogenous geographic characteristics.

This model diers from a Markov regime-switching model in that the authors are look-

ing at dierent regimes in a cross-section rather than over time. Thus the probability of

being in either regime is stationary, and the unobserved regime indicator is an i.i.d. ran-

dom variable. This modication corresponds exactly to that made by Cho and White

(2007) to create the quasi-log-likelihood, so in this example, the log-likelihood ratio and

the QLR are one and the same.

Note that this model is more general than the basic regime-switching model pre-

sented in section 2. Bloom, Canning, and Sevilla (2003) have allowed for three general-

izations: covariates with coecients that vary across regimes; error variances that are

regime specic; and regime probabilities that depend on the included covariates. How-

ever, as Carter and Steigerwald (2013) discuss, the asymptotic null distribution (2) is

derived under the following assumptions: that the dierence between regimes be in only

the intercept j ; that the variance of the error terms be constant across regimes; and

that the regime probabilities do not depend on the exogenous characteristic, x. Thus,

to form the test statistic, we must t the following two-regime model: regime 1 occurs

with probability p and corresponds to

y = 1 + x + (5 )

y = 2 + x + (6 )

where Var (
) = 2 .

Simplifying the model like this does not diminish the validity of the QLR as a one-

regime test for the model in (5) and (6). Under the null hypothesis of one regime, there is

necessarily only one error variance, only one coecient for each covariate, and a regime

probability equal to one. Thus, under the null hypothesis, the QLR test will necessarily

have the correct size even if the data are accurately modeled by a more complex system.

V. K. Bostwick and D. G. Steigerwald 489

Once the null hypothesis is rejected using this restricted model, the researcher can then

t a model with regime-specic variances and coecients, if desired.1

For the restricted model in (5 ) and (6 ), the quasi-log-likelihood is

n

1

Ln p, 2 , , 1 , 2 = lt p, 2 , , 1 , 2

n t=1

where lt (p, 2 , , 1 , 2 ) := log{pf (yt |xt ; 2 , , 1 ) + (1 p)f (yt |xt ; 2 , , 2 )}, and

f (yt |xt ; 2 , , j ) is the conditional density for j = 1, 2. It is common to assume, as

Bloom, Canning, and Sevilla (2003) do, that
is a normal random variable2 so that

2 2 2

f (yt |xt ; , , j ) = 1/( 2 2 )e(yt j xt ) /(2 ) . Let (

2 , ,

p, 1 , 2 ) be the values

2

that maximize Ln and let (1, , ,

1 , ) be the values that make Ln as large as possible

under the null hypothesis of one regime. The QLR statistic is then

QLRn = 2n Ln p ,

2 , , 1 , 2 Ln 1,

2 , , 1 ,

To estimate QLRn , we use the same Penn World Table and CIA World Factbook data

as in Bloom, Canning, and Sevilla (2003).3 First, we must determine the parameter

values that maximize the quasi-log-likelihood under the null hypothesis, (1,

2 , , 1 , )

and evaluate the quasi-log-likelihood at those values. To obtain these parameter values,

we estimate a linear regression of y on x, which corresponds to maximizing

n

1 1 1 2

Ln 1, 2 , , 1 , = log e 22 (yt 1 xt )

n t=1 2 2

While this can be achieved with a simple ordinary least-squares command, we also need

the value of the log-likelihood, so we detail how to use Stata commands to obtain both

the parameter estimates and this value.

1. With a more complex data-generating process, these restrictions could lead to an increased prob-

ability of failing to reject a false null hypothesis and, hence, a decrease in the power of the QLR

test.

2. Bloom, Canning, and Sevilla (2003) assume normally distributed errors, but the QLR test allows

for any error distribution within the exponential family.

3. Latitude data for countries appearing in the 1985 Penn World Tables and missing from the CIA

World Factbook come from https://www.google.com/.

490 Obtaining critical values for test of Markov regime switching

To nd (1,

2 , , 1 , ), we use the following code, which relies on the Stata command

ml.

1. version 13

2. args lnf mu beta sigma

3. quietly replace `lnf= (1/_N)*ln(((2*_pi*`sigma^2)^(-1/2))*

> exp((-1/(2*`sigma^2))*(lgdp-`mu-`beta*latitude)^2))

4. end

. ml model lf llfsingle /mu /beta /sigma

. ml maximize

initial: log likelihood = -<inf> (could not be evaluated)

feasible: log likelihood = -127.9261

rescale: log likelihood = -31.297788

rescale eq: log likelihood = -2.3397622

Iteration 0: log likelihood = -2.3397622 (not concave)

Iteration 1: log likelihood = -1.5884033 (not concave)

Iteration 2: log likelihood = -1.2842957

Iteration 3: log likelihood = -1.2479471

Iteration 4: log likelihood = -1.1988284

Iteration 5: log likelihood = -1.1982503

Iteration 6: log likelihood = -1.1982487

Iteration 7: log likelihood = -1.1982487

Number of obs = 152

Wald chi2(0) = .

Log likelihood = -1.1982487 Prob > chi2 = .

mu

_cons 6.927805 1.420095 4.88 0.000 4.144469 9.711141

beta

_cons .0408554 .049703 0.82 0.411 -.0565607 .1382714

sigma

_cons .8019654 .5670752 1.41 0.157 -.3094815 1.913412

. matrix gammasingle=e(b)

2 , , 1 , ).

. generate llf1regime=ln(((2*_pi*gammasingle[1,3]^2)^(-1/2))*

> exp((-1/(2*gammasingle[1,3]^2))*

> (lgdp-gammasingle[1,1]-gammasingle[1,2]*latitude)^2))

. quietly summarize llf1regime

. quietly replace llf1regime=r(sum)

. display "Final estimated quasi-log-likelihood for one regime: " llf1regime

Final estimated quasi-log-likelihood for one regime: -182.1338

2 , , 1 , ) = 182.1388.

Second, we must determine the parameter values that maximize the quasi-log-

likelihood under the alternative hypothesis of two regimes, (

p,

2 , , 1 ,

2 ) and evaluate

the quasi-log-likelihood at those values. Direct maximization is more dicult under the

V. K. Bostwick and D. G. Steigerwald 491

alternative hypothesis, because the quasi-log-likelihood involves the log of the sum of

two terms.

n

1

Ln p, 2 , , 1 , 2 = log pf yt |xt ; 2 , , 1 + (1 p) f yt |xt ; 2 , , 2

n t=1

diculty. This algorithm requires iterative estimation of the latent regime probabilities,

p, and maximization of the resultant log-likelihood function until parameter estimates

converge. The EM algorithm proceeds as follows:

(0) (0)

1. Choose starting guesses for the parameter values p(0) , 2(0) , (0) , 1 , 2 .

2. For each observation, calculate t = P(st = 1|yt , xt ) such that

(0)

f yt |xt ; 2(0) , (0) , 1

t = p(0)

(0) (0)

p(0) f yt |xt ; 2(0) , (0) , 1 + 1 p(0) f yt |xt ; 2(0) , (0) , 2

(1) (1)

3. Use Statas ml command to nd the parameter values p(1) , 2(1) , (1) , 1 , 2

that maximize the complete log-likelihood.

n

1

LC

n p, 2

, , 1 , 2 = t log f yt |xt ; 2 , , 1

n t=1

+ (1 t ) log f yt |xt ; 2 , , 2

+ (1 t ) log(1 p) + t log p }

(1) (1) (0) (0)

a. max p(1) , 2(1) , (1) , 1 , 2 p(0) , 2(0) , (0) , 1 , 2 ;

b. |LC , 2(1) , (1) , 1 , 2 LC

(1) (1) (0) (0)

n p

(1)

n p

(0)

, 2(0) , (0) , 1 , 2 |; and

c. (using numeric derivatives) max(LC

n ).

5. If all 3 convergence criteria are less than some tolerance level (we use 1/n), then

(1) (1)

quit and use p(1) , 2(1) , (1) , 1 , 2 as the nal parameter estimates. Otherwise,

(1) 2(1) (1) (1)

repeat steps 25 with p , , (1) , 1 , 2 as the new starting guesses.

492 Obtaining critical values for test of Markov regime switching

p,

2 , ,

1 , and

2 ).

1. version 13

2. args lnf mu1 mu2 beta sigma p

3. quietly replace `lnf= (1/_N)*((1-etahat)*(ln((2*_pi*`sigma^2)^(-1/2))+

> ((-1/(2*`sigma^2))*(lgdp-`mu2-`beta*latitude)^2)+

> ln(1-`p))+etahat*(ln((2*_pi*`sigma^2)^(-1/2))+

> ((-1/(2*`sigma^2))*(lgdp-`mu1-`beta*latitude)^2)+ln(`p)))

4. end

. generate error=10

. generate tol=1/_N

. while error>tol {

2. quietly replace f1=((2*_pi*gammahat[1,4]^2)^(-1/2))*

> exp((-1/(2*gammahat[1,4]^2))*(lgdp-gammahat[1,1]-gammahat[1,3]*latitude)^2)

3. quietly replace f2=((2*_pi*gammahat[1,4]^2)^(-1/2))*

> exp((-1/(2*gammahat[1,4]^2))*(lgdp-gammahat[1,2]-gammahat[1,3]*latitude)^2)

4. quietly replace fboth=gammahat[1,5]*f1+(1-gammahat[1,5])*f2

5. quietly replace etahat=gammahat[1,5]*f1/fboth

6. ml model lf llfmulti /mu1 /mu2 /beta /sigma /p

7. ml init gammahat, copy

8. quietly ml maximize

9. matrix gammanew=e(b)

10. *Check for convergence using user-defined program nds

. nds

11. quietly replace error=max(nd1,nd2,nd3,nd4,nd5)

12. matrix gammahat=gammanew

13. }

. ml display

Number of obs = 152

Wald chi2(0) = .

Log likelihood = -1.4441013 Prob > chi2 = .

mu1

_cons 6.532847 1.148891 5.69 0.000 4.281062 8.784632

mu2

_cons 7.813265 1.45266 5.38 0.000 4.966102 10.66043

beta

_cons .0451607 .0374139 1.21 0.227 -.0281691 .1184905

sigma

_cons .5986278 .4232938 1.41 0.157 -.2310128 1.428268

p

_cons .7708245 .4203024 1.83 0.067 -.052953 1.594602

V. K. Bostwick and D. G. Steigerwald 493

p,

2 , , 1 ,

2 ).

. quietly replace f1=((2*_pi*gammanew[1,4]^2)^(-1/2))*

> exp((-1/(2*gammanew[1,4]^2))*(lgdp-gammanew[1,1]-gammanew[1,3]*latitude)^2)

. quietly replace f2=((2*_pi*gammanew[1,4]^2)^(-1/2))*

> exp((-1/(2*gammanew[1,4]^2))*(lgdp-gammanew[1,2]-gammanew[1,3]*latitude)^2)

. generate lf2reg=gammanew[1,5]*f1+(1-gammanew[1,5])*f2

. generate llf2regime=ln(lf2reg)

. quietly summarize llf2regime

. quietly replace llf2regime=r(sum)

. display "Final estimated quasi-log-likelihood for two regimes: " llf2regime

Final estimated quasi-log-likelihood for two regimes: -179.9662

Thus we have n Ln (

p,

2 , , 1 ,

2 ) = 179.9662. Then, to calculate the test statistic,

QLRn , we type

. generate QLR=2*(llf2reg-llf1reg)

. display "Quasi-likelihood-ratio test statistic of one regime: " QLR

Quasi-likelihood-ratio test statistic of one regime: 4.3352051

These estimates and the resulting QLR test statistic are summarized in table 2. For the

complete Stata code used to create table 2, see the appendix.

Regime I Regime II

Constant (1 , 2 ) 6.928 6.533 7.813

Latitude () 0.041 0.045

Standard deviation of error () 0.802 0.599

Probability of regime I (p) 0.771

Log likelihood (Ln ) 182.1 180.0

QLRn 4.3

Finally, we use the rscv command to calculate the critical value for the QLR test of

size 5%. We allow for the possibility that the two regimes are widely separated and set

H = (5.0, 5.0). The command and output are shown below.

. rscv, ll(-5) ul(5) r(100000) q(0.95)

7.051934397

Given that this critical value of 7.05 exceeds the QLR statistic of 4.3, we cannot reject

the null hypothesis of one regime.

This result is consistent with the ndings of Bloom, Canning, and Sevilla (2003),

although they use a dierent method to obtain the necessary critical values. They

494 Obtaining critical values for test of Markov regime switching

report a likelihood ratio and the corresponding critical values for a restricted version of

their model where the regime probabilities are xed (p does not depend on x). Using

this restricted model, the authors do not reject the null hypothesis of one regime. At

the time that Bloom, Canning, and Sevilla (2003) were published, researchers had yet to

successfully derive the asymptotic null distribution for a likelihood-ratio test of regime

switching. Therefore, the authors use Monte Carlo methods to generate their critical

values using random data generated from the estimated relationship given by the model

in (5) and (6). The primary disadvantage of this approach is that the derived critical

values are then dependent upon the authors assumptions concerning the underlying

data-generating process.

Bloom, Canning, and Sevilla (2003) go on to report a likelihood-ratio test of a single

regime model against the unrestricted model with latitude-dependent regime probabili-

ties. With the unrestricted model, the authors can use the likelihood ratio and simulated

critical values to reject the null hypothesis in favor of the alternative of two regimes.

Because the null distribution derived by Cho and White (2007) applies to only the QLR

constructed using the two-regime model given in (5 ) and (6 ), we cannot use the QLR

test and, hence, the rscv command to obtain the critical values necessary to evaluate

this unrestricted test statistic.

6 Discussion

We provide a methodology and a new command, rscv, to construct critical values for

a test of regime switching for a simple linear model with Gaussian errors. Despite

the complexity of the underlying methodology, rscv is relatively simple to execute and

merely requires the researcher to provide a range for the standardized distance between

regime means. In section 5, we demonstrate how these methods can be generalized

to a very broad class of models, and we discuss the restrictions necessary to properly

estimate the QLR statistic and use the rscv critical values.

7 References

Bloom, D. E., D. Canning, and J. Sevilla. 2003. Geography and poverty traps. Journal

of Economic Growth 8: 355378.

Bostwick, V. K., and D. G. Steigerwald. 2012. Obtaining critical values for test of

Markov regime switching. Economics Working Paper Series qt3685g3qr, University

of California, Santa Barbara. http://ideas.repec.org/p/cdl/ucsbec/qt3685g3qr.html.

Carter, A. V., and D. G. Steigerwald. 2012. Testing for regime switching: A comment.

Econometrica 80: 18091812.

Econometric Methods 2: 2534.

V. K. Bostwick and D. G. Steigerwald 495

Cho, J. S., and H. White. 2007. Testing for regime switching. Econometrica 75: 1671

1720.

Valerie Bostwick is currently completing a PhD in economics at the University of California,

Santa Barbara.

Douglas G. Steigerwald joined the faculty of the Department of Economics at the University

of California, Santa Barbara, after completing an MA in statistics and a PhD in economics at

the University of California, Berkeley.

Appendix

The following Stata code was used to create table 2. The code ts the model in section 5

under the alternative hypothesis of two regimes using the EM algorithm and then under

the null hypothesis of one regime using the Stata ml command. Finally, the QLR test

statistic is calculated.

* Estimating QLR test statistic for Bloom, Canning, and Sevilla (2003)

capture program drop llf

program define llf

version 13

args lnf theta1 theta0 delta sigma lambda

quietly replace `lnf=(1/_N)*((1-etahat)*(ln((2*_pi*`sigma^2)^(-1/2)) ///

+((-1/(2*`sigma^2))*(lgdp-`theta0-`delta*latitude)^2)+ln(1-`lambda)) ///

+etahat*(ln((2*_pi*`sigma^2)^(-1/2))+((-1/(2*`sigma^2))*(lgdp-`theta1 ///

-`delta*latitude)^2)+ln(`lambda)))

end

capture program drop llfsingle

program define llfsingle

version 13

args lnf theta delta sigma

quietly replace `lnf= (1/_N)*ln(((2*_pi*`sigma^2)^(-1/2))* ///

exp((-1/(2*`sigma^2))*(lgdp-`theta-`delta*latitude)^2))

end

/***************************************************/

* First, estimate parameters and log likelihood for the case of two regimes:

* lgdp = theta0 + delta*latitude + u~N(0,sigma2) with probability (1-lambda)

* lgpp = theta1 + delta*latitude + u~N(0,sigma2) with probability lambda

/***************************************************/

* Start with initial guess for theta0, theta1, delta, sigma2, and lambda:

regress lgdp latitude

matrix beta=e(b)

svmat double beta, names(matcol)

scalar dhat=betalatitude

generate intercept=lgdp-dhat*latitude

summarize intercept

scalar t0hat=r(mean)-r(Var)

496 Obtaining critical values for test of Markov regime switching

scalar t1hat=r(mean)+r(Var)

scalar shat=sqrt(r(Var))

scalar lhat=0.5

matrix gammahat=(t1hat, t0hat, dhat, shat, lhat)

display "Original guess for parameter values: "

matrix list gammahat

/***************************************************/

* Start loop that continues until parameter estimates have converged

generate error1=10

generate error2=10

generate error3=10

generate tol=1/_N

generate count=0

generate count1=1

generate count2=1

generate count3=1

generate f1=0

generate f0=0

generate fboth=0

generate etahat=0

generate llfhat=0

generate llfnew=0

generate fdelta=0

generate fnew=0

generate Inllfnew=0

generate Inllfdelta=0

generate nd1=0

generate nd2=0

generate nd3=0

generate nd4=0

generate nd5=0

* Calculate f(Yt|St=1, gammahat)

quietly replace f1=((2*_pi*gammahat[1,4]^2)^(-1/2))* ///

exp((-1/(2*gammahat[1,4]^2))*(lgdp-gammahat[1,1]-gammahat[1,3]* ///

latitude)^2)

* Calculate f(Yt|St=0, gammahat)

quietly replace f0=((2*_pi*gammahat[1,4]^2)^(-1/2))* ///

exp((-1/(2*gammahat[1,4]^2))*(lgdp-gammahat[1,2]-gammahat[1,3]* ///

latitude)^2)

* Calculate f(Yt|gammahat)

quietly replace fboth=gammahat[1,5]*f1+(1-gammahat[1,5])*f0

quietly replace etahat=gammahat[1,5]*f1/fboth

/***************************************************/

* Now use etahat to create and maximize log-likelihood function

ml init gammahat, copy

ml maximize

matrix gammanew=e(b)

/***************************************************/

* Check whether the parameter estimates have converged

mata: st_matrix("temp", max(abs(st_matrix("gammanew")-st_matrix("gammahat"))))

quietly replace error1=temp[1,1]

V. K. Bostwick and D. G. Steigerwald 497

quietly replace llfnew=e(ll)

quietly replace llfhat=(1/_N)*((1-etahat) ///

*(ln((2*_pi*gammahat[1,4]^2)^(-1/2)) ///

+((-1/(2*gammahat[1,4]^2)) ///

*(lgdp-gammahat[1,2]-gammahat[1,3]*latitude)^2) ///

+ln(1-gammahat[1,5]))+etahat*(ln((2*_pi*gammahat[1,4]^2)^(-1/2)) ///

+((-1/(2*gammahat[1,4]^2)) ///

*(lgdp-gammahat[1,1]-gammahat[1,3]*latitude)^2) ///

+ln(gammahat[1,5])))

quietly summarize llfhat

quietly replace llfhat=r(sum)

quietly replace error2=abs(llfhat-llfnew)

* Recalculate incomplete log likelihood with new gamma estimates

quietly replace f1=((2*_pi*gammanew[1,4]^2)^(-1/2))* ///

exp((-1/(2*gammanew[1,4]^2))*(lgdp-gammanew[1,1]-gammanew[1,3]*latitude)^2)

quietly replace f0=((2*_pi*gammanew[1,4]^2)^(-1/2))* ///

exp((-1/(2*gammanew[1,4]^2))*(lgdp-gammanew[1,2]-gammanew[1,3]*latitude)^2)

quietly replace fnew=gammanew[1,5]*f1+(1-gammanew[1,5])*f0

quietly replace Inllfnew=log(fnew)

quietly summarize Inllfnew

quietly replace Inllfnew=r(sum)/_N

* Calculate incomplete log likelihood for gamma + 0.0001

forvalues i=1/5 {

matrix gammadelta=gammanew

matrix gammadelta[1,`i]=gammadelta[1,`i]+.0001

quietly replace f1=((2*_pi*gammadelta[1,4]^2)^(-1/2)) ///

*exp((-1/(2*gammadelta[1,4]^2)) ///

*(lgdp-gammadelta[1,1]-gammadelta[1,3]* ///

latitude)^2)

quietly replace f0=((2*_pi*gammadelta[1,4]^2)^(-1/2)) ///

*exp((-1/(2*gammadelta[1,4]^2)) ///

*(lgdp-gammadelta[1,2]-gammadelta[1,3]* ///

latitude)^2)

quietly replace fdelta=gammadelta[1,5]*f1+(1-gammadelta[1,5])*f0

quietly replace Inllfdelta=log(fdelta)

quietly summarize Inllfdelta

quietly replace Inllfdelta=r(sum)/_N

quietly replace nd`i=abs(Inllfdelta-Inllfnew)/.0001

}

quietly replace error3=max(nd1,nd2,nd3,nd4,nd5)

/***************************************************/

* Keep track of when each convergence criterion is met

quietly replace count1=count1+1 if error1>tol

quietly replace count2=count2+1 if error2>tol

quietly replace count3=count3+1 if error3>tol

matrix gammahat=gammanew

quietly replace count=count+1

* End of loop

}

498 Obtaining critical values for test of Markov regime switching

/***************************************************/

* Calculate final log likelihood for two regimes

quietly replace f1=((2*_pi*gammanew[1,4]^2)^(-1/2))* ///

exp((-1/(2*gammanew[1,4]^2))*(lgdp-gammanew[1,1]-gammanew[1,3]*latitude)^2)

quietly replace f0=((2*_pi*gammanew[1,4]^2)^(-1/2))* ///

exp((-1/(2*gammanew[1,4]^2))*(lgdp-gammanew[1,2]-gammanew[1,3]*latitude)^2)

generate f2reg=gammanew[1,5]*f1+(1-gammanew[1,5])*f0

generate llf2reg=ln(f2reg)

quietly summarize llf2reg

quietly replace llf2reg=r(sum)

* Output final parameter estimates

display "Final estimated parameter values for two regimes: "

matrix list gammanew

display "Final estimated log likelihood for two regimes: " llf2reg

display "Total number of loop iterations: " count

display "Parameter values converged after " count1 " iterations"

display "Log likelihood value converged after " count2 " iterations"

display "Gradient of Log likelihood converged after " count3 " iterations"

/***************************************************/

* Second, estimate parameters and log likelihood for the case of only one regime:

* lgdp = theta + delta*lat + u~N(0,sigma2)

quietly summarize intercept

matrix gamma0=(r(mean), dhat, .1)

* Maximize to find new estimate of gamma

ml model lf llfsingle /theta /delta /sigma

ml init gamma0, copy

ml maximize

matrix gammasingle=e(b)

generate llf1reg=ln(((2*_pi*gammasingle[1,3]^2)^(-1/2))* ///

exp((-1/(2*gammasingle[1,3]^2))*(lgdp-gammasingle[1,1]-gammasingle[1,2]* ///

latitude)^2))

quietly summarize llf1reg

quietly replace llf1reg=r(sum)

* Output final parameter estimates

display "Final estimated parameter values for one regime: "

matrix list gammasingle

display "Final estimated log likelihood for one regime: " llf1reg

/***************************************************/

* Finally, calculate QLR test statistic:

generate QLR=2*(llf2reg-llf1reg)

display "Quasi-likelihood-ratio test statistic of one regime: " QLR

The Stata Journal (2014)

14, Number 3, pp. 499510

for the existence of a unique most probable

category

Bryan M. Fellman Joe Ensor

MD Anderson Cancer Center MD Anderson Cancer Center

Houston, TX Houston, TX

bmfellman@mdanderson.org joensor@mdanderson.org

Abstract. The analysis of multinomial data often includes the following question

of interest: Is a particular category the most populous (that is, does it have the

largest probability)? Berry (2001, Journal of Statistical Planning and Inference

99: 175182) developed a likelihood-ratio test for assessing the evidence for the ex-

istence of a unique most probable category. Nettleton (2009, Journal of the Amer-

ican Statistical Association 104: 10521059) developed a likelihood-ratio test for

testing whether a particular category was most probable, showed that the test was

an example of an intersection-union test, and proposed other intersection-union

tests for testing whether a particular category was most probable. He extended

his likelihood-ratio test to the existence of a unique most probable category and

showed that his test was equivalent to the test developed by Berry (2001, Journal

of Statistical Planning and Inference 99: 175182). Nettleton (2009, Journal of the

American Statistical Association 104: 10521059) showed that the likelihood ratio

for identifying a unique most probable cell could be viewed as a union-intersection

test. The purpose of this article is to survey dierent methods and present a

command, cellsupremacy, for the analysis of multinomial data as it pertains to

identifying the signicantly most probable category; the article also presents a

command for sample-size calculations and power analyses, power cellsupremacy,

that is useful for planning multinomial data studies.

Keywords: st0348, cellsupremacy, cellsupremacyi, power cellsupremacy, most

probable category, multinomial data, cell supremacy, cell inferiority

1 Introduction

If Y1 , Y2 , . . . , Yk are independent Poisson-distributed random variables with means 1 ,

2 , . . ., k , then (Y1 , Y2 , . . . , Yk ), conditional on their sum, is multinomial(N, p1 , p2 , . . . ,

pk ), where pi = i / k k represents the probability of the ith category. Multinomial

data are common in biological, marketing, and opinion research scenarios. In a recent

study, Price et al. (2011) used data from the 2008 National Health Interview Survey

to examine whether 18- to 26-year-old women who are most likely to benet from

catch-up vaccination are aware of the human papillomavirus (HPV) vaccine and have

received initial and subsequent doses in the 3-dose series. The study found that the

most common reasons for lack of interest in the HPV vaccine were belief that it was not

needed (35.9%), not knowing enough about it (17.1%), concerns about safety (12.7%),

c 2014 StataCorp LP st0348

500 Cell supremacy

and not being sexually active (10.3%). These 4 responses were among the 11 possible

response categories to the survey question. Is the belief among respondents that the HPV

vaccine was not needed the unique most probable reason for lack of interest in the HPV

vaccine? Response to questionnaire-based infertility studies varies, and Morris et al.

(2013) noted that dierent modes of contact can aect response. Results of their study

indicated that 59% of the women surveyed preferred a mailed questionnaire, 37% chose

an online questionnaire, and only 3% selected a telephone interview as their mode of

contact. Is a mailed questionnaire the most preferred mode of contact? Are these

results signicant? The purpose of this article is to survey dierent methods and to

present a command for the analysis of multinomial data as it pertains to identifying the

signicantly most probable category; the article also presents a command for sample-size

calculations and power analyses that is useful for planning multinomial data studies.

2 Methods

Nettleton (2009) posed the test for the supremacy of a multinomial cell probability as an

intersection-union test (IUT). Suppose X = (X1 , . . . , Xk ) has a multinomial distribution

with n trials and the cell probabilities p1 , . . . , pk . The parameter p = (p1 , . . . , pk ) lies

in the set P of vectors of order k, whose components are positive and sum to one.

The tested null hypothesis states that a particular cell of interest is not more probable

than all others. Suppose the kth cell is the cell of interest; then the hypothesis can be

formulated as

k1

k1

H0 : pk pi versus H1 : p k > pi

i=1 i=1

which Nettleton (2009) noted can be stated as

H0 : pk max(p1 , . . . , pk1 ) versus H1 : pk > max(p1 , . . . , pk1 )

Nettleton (2009) oered three possible asymptotic IUT statistics: the score test, the

Wald test, and the likelihood-ratio test. Suppose x = (x1 , . . . , xk ) is a realization of

= (

X = (X1 , . . . , Xk ); then pi = xi /n so that p p1 , . . . , pk ) is the maximum likelihood

estimate of p = (p1 , . . . , pk ). Each asymptotic IUT statistic is zero unless xk is greater

than max(x1 , . . . , xk1 ). Nettleton (2009) also suggested a test based on the conditional

distribution of Xk , given the sum of xk and m, where m = max(x1 , . . . , xk1 ).

The test statistic for the asymptotic score test is

n( pM )2

pk

p

k +

pM if pk > pM = max(

p1 , . . . , pk1 )

TS =

0 otherwise

(1 2)}th quantile of the 2 distribution with 1 degree of freedom. The approximate

B. M. Fellman and J. Ensor 501

p-value for the test is given by Pr (2(1) TS | TS )/2, where 2(1) denotes a 2 random

variable with 1 degree of freedom.

The test statistic for the asymptotic Wald test is

n( pM )2

pk

p

k +

pM ( pM )2

pk if pk > pM = max(

p1 , . . . , pk1 )

TW =

0 otherwise

H0 is rejected if and only if TW 2(1),12 . The approximate p-value for the test

is given by Pr (2(1) TW | TW )/2.

The test statistic for the asymptotic likelihood-ratio test is

2 M ln 2M

+ xk ln 2xk

if xk > M = max(x1 , . . . , xk1 )

M +xk M +xk

TLR =

0 otherwise

H0 is rejected if and only if TLR 2(1),12 . The approximate p-value for the test

is given by Pr (2(1) TLR | TLR )/2.

The conditional distribution of Xk , given m + xk , where m = max(x1 , . . . , xk1 ), is

binomial(m + xk , 1/2). Thus a p-value for testing the null hypothesis that is valid for

all n is Pr {Xk xk | xk + max(x1 , . . . , xk )}. The conditional IUT is equivalent to a

permutation test, where the p-value is expressed as

m+x

k

m + xk

p-value = 2(m+xk )

x

x=xk

The simulation studies by Nettleton (2009) showed that the conditional IUT based on

the binomial distribution yielded a true p-value typically less than the nominal value.

Farcomeni (2012) suggested that the exact test (that is, conditional binomial) may

be conservative and that the exact signicance level may be smaller than the desired

nominal level. Farcomeni (2012) suggested using the typical continuity correction for

the binomial; namely, he recommended the mid-p value as the p-value of the test.

502 Cell supremacy

Using the mid-p value approach, we see that the p-value is

m+x

k

m + xk (m+xk +1) m + xk

p-value = 2 + 2(m+xk )

xk x

x=xk +1

The test for cell supremacy can be formulated as

One could formulate the test for cell inferiority (that is, a particular cell is least

probable) as

Farcomeni (2012) suggests using the exact test for inferiority where the sum goes

from 0 to xk . That is, the p-value for the conditional IUT for inferiority would be

xk

m + xk

p-value = 2(m+xk )

x

x=0

k 1

x

m + xk (m+xk +1) m + xk

p-value = 2 + 2(m+xk )

xk x

x=0

Alam and Thompson (1972) discussed the challenges of testing whether a particular

cell is least probable from a design point of view. Nettleton (2009) showed that the

likelihood-ratio test statistic could be used to test for the existence of a unique most

probable cell. That is, rather than test whether a particular cell chosen a priori is

the most probable, one could test whether the largest observed cell was uniquely most

probable. The likelihood-ratio test statistic matches the test statistic developed by

Berry (2001) and rejects H0 if and only if TLR 2(1),12 . The approximate p-value

for the test is given by Pr (2(1) TLR | TLR ), where 2(1) denotes a 2 random variable

with 1 degree of freedom. That is, the p-value is twice the p-value for the test in which

a particular cell chosen a priori is most probable.

2.7 Power

We consider the case of a random variable Xmultinomial(n, p1 , . . . , pk ). Without

loss of generality, we will assume that pk is the maximum among the k cells. Let

B. M. Fellman and J. Ensor 503

i = M and consider the test

H0 : pk = pM versus H1 : pk > pM

TS 2(1),12

and for xk > xM ,

2 2

2

pk p

k +

pM

pM p

k +

pM

n (pk pM ) 2 2

TS = =n +

pk + pM

p

k +

pM p

k +

pM

2 2

pk > pM )

(pk p0 )2 (pM p0 )2 (pk p0 )2

=n + = 2n

p0 p0 p0

where p0 = (pk + pM )/2 (Guenther 1977). For example, consider the random variable

at the = 0.05 signicance level. The null hypothesis is rejected if TS 2.70554. Solely

based on p4 and p5 , the noncentrality parameter for testing the 5th cell selected a priori

as the most probable cell is

(0.4 0.35)2

= 100 0.71429

0.35

and the approximate power is

0.71479 and 1 degree of freedom. The simulation of size 100,000 yielded a power equal

to 0.214 for this scenario. The approximation is ignorant of the distribution of the rst

k 1 cells. Because p4 is three times greater than any other cell probability amount

in the rst k 1 cells, the approximation yields a reasonable result. Now consider the

random variable

504 Cell supremacy

We have a trinomial, and there is strong competition for the maximum among the rst

k 1 cells. Because the cells of a multinomial are not independent, one would expect

the distribution of the rst k 1 cells that aect the power to detect the kth cell to

be the most probable. The simulated power for this scenario was 0.087. Thus the

approximation of power must consider the impact of the distribution of the rst k 1

cells. The correlation among the two cells of a multinomial is

$

pa pb

a,b =

(1 pa )(1 pb )

The power to detect the 5th cell as the most probable is the power that p5 > p4 and

p5 > p3 . Consider approximating the power by

1+M,N

power Pr TS 2(1),12 | pk , pM Pr TS 2(1),12 | pk , pN

where pM and pN represent the maximum and the second largest of the cell probabilities

of the rst k 1 cells, respectively, and M,N represents the correlation between cells

M and N . For our example, the approximate power is

1+4,3

Pr TS 2(1),12 | p5 = 0.4, p4 = 0.3

(0.21833) (0.21833)10.42857

0.09151

Applying this form of the approximation to the original example with p1 through p3

equal to 0.1 and p4 equal to 0.3 yields an approximate power of

power Pr TS 2(1),12 | p5 = 0.4, p3 = 0.3

1+4,3

Pr TS 2(1),12 | p5 = 0.4, p3 = 0.1

(0.21833) (0.91232)10.21822

0.20322

Table 1 provides simulations of size 100,000 for several scenarios to investigate the

adequacy of our proposed approximation. For each scenario, p6 is the cell of interest,

5,4 represents the correlation between the 5th and 4th cell, Sim. is the simulated

power, and Approx. is our power approximation.

B. M. Fellman and J. Ensor 505

Scenario p1 p2 p3 p4 p5 p6 5,4 Subjects Sim. Approx.

2 50 0.214 0.203

3 200 0.520 0.519

4 1000 0.984 0.984

5 0 0 0 0.3 0.3 0.4 0.4286 25 0.057 0.056

6 50 0.087 0.092

7 200 0.353 0.356

8 1000 0.971 0.974

9 0.0626 0.0625 0.0625 0.0625 0.25 0.5 0.1491 25 0.413 0.384

10 50 0.664 0.651

11 200 0.994 0.993

12 1000 1.000 1.000

13 0 0 0 0.25 0.25 0.5 0.3333 25 0.260 0.237

14 50 0.504 0.493

15 200 0.989 0.988

16 1000 1.000 1.000

17 0.05 0.05 0.05 0.05 0.2 0.6 0.1147 25 0.747 0.698

18 50 0.953 0.935

19 200 1.000 1.000

20 1000 1.000 1.000

21 0 0 0 0.2 0.2 0.6 0.2500 25 0.631 0.567

22 50 0.915 0.890

23 200 1.000 1.000

24 1000 1.000 1.000

25 0.1 0.1 0.1 0.1 0.2 0.4 0.1667 25 0.257 0.265

26 50 0.550 0.530

27 200 0.981 0.978

28 1000 1.000 1.000

29 0 0 0.2 0.2 0.2 0.4 0.2500 25 0.143 0.170

30 50 0.326 0.376

31 200 0.953 0.961

32 1000 1.000 1.000

2.8 Conclusions

Nettleton (2009) suggested that the asymptotic procedures are preferred for moderate to

large sample sizes based on simulations, but the IUT based on conditional tests is a useful

option when a small sample size casts doubt on the validity of the asymptotic procedures.

Our power simulations tend to also suggest that the power approximation works best

for moderate to large sample sizes. Scenarios 2932 present a slightly more complex

problem with three cells vying for the top spot among the rst cells. For these scenarios,

our power approximation yields slightly liberal results because the approximate power is

consistently larger than the simulated power. Under this scenario, the power to detect

the 6th cell as the most probable is the power that p6 > p5 , p6 > p4 , and p6 > p3 .

Thus one could improve the approximation by considering the added competition for

supremacy among the rst k 1 cells. That is, for n = 200, the approximate power is

506 Cell supremacy

power Pr TS 2(1),12 | p5 = 0.4, p4 = 0.2

1+4,3

Pr TS 2(1),12 | p5 = 0.4, p3 = 0.2

1+24,3

Pr TS 2(1),12 | p5 = 0.4, p3 = 0.2

(0.97761) (0.97761)10.25 (0.97761)10.50

0.95032

which compares favorably with the simulated power. However, we believe that for most

real-world problems, considering the impact of the top two cell probabilities among the

rst k 1 cells is sucient.

power cellsupremacy commands

3.1 Syntax

cellsupremacy varname weight

cellsupremacyi, counts(numlist)

power cellsupremacy, freq(numlist) n(#) simulate dots reps(#)

alpha(#)

counts(numlist) species the cell counts for each category of the variable of interest.

counts() is required.

freq(numlist) species the frequency of cells for each category of the variable of interest.

freq() is required.

n(#) species the number of observations. n() is required.

simulate calculates the simulated power and the approximate power. When not spec-

ied, only the approximated power is calculated.

B. M. Fellman and J. Ensor 507

dots shows the replication dots when using the simulate option.

reps(#) species the number of simulations used to calculate the power. The default

is reps(10000).

alpha(#) species the alpha that is used for calculating the power. The default is

alpha(0.05).

3.4 Examples

Suppose we are studying breast cancer and we nd that the distribution of subtypes is

a trinomial distribution with HER2+, HR+, and TNBC. In our data, we nd that patients

with leptomeningeal disease were more likely to be HER2+ (45%). We are interested in

knowing whether this particular category is the most populous (that is, does it have

the largest probability of occurring?). The following example will generate a sample

dataset and illustrate the use of the new command to answer this question.

. set obs 100

obs was 0, now 100

. generate subtype = "HER2+" in 1/45

(55 missing values generated)

. replace subtype = "HR+" in 46/73

(28 real changes made)

. replace subtype = "TNBC" in 74/100

(27 real changes made)

. tab subtype

subtype Freq. Percent Cum.

HR+ 28 28.00 73.00

TNBC 27 27.00 100.00

. cellsupremacy subtype

TESTS FOR CELL SUPREMACY

Category HER2+ had the largest observed frequency.

TESTING WHETHER CATEGORY HER2+ SELECTED A PRIORI IS MOST PROBABLE.

Quantity Score Wald LR Binomial Mid-P

-----------------------------------------------------------------

Test Statistic 3.9589 4.1221 3.9955

p-value 0.0233 0.0212 0.0228 0.0302 0.0237

TEST FOR THE EXISTENCE OF A MOST PROBABLE CELL

Quantity LR

-------------------------

Test Statistic 3.9955

p-value 0.0456

TESTS FOR CELL INFERIORITY

Category TNBC had the smallest observed frequency.

TESTING WHETHER CATEGORY TNBC SELECTED A PRIORI IS LEAST PROBABLE.

Quantity Binomial Mid-P

---------------------------------------------

p-value 0.5000 0.4469

508 Cell supremacy

The p-values for all tests are less than 0.05, which indicates that HER2+ is the most

probable. The test for the existence of a most probable cell is also signicant. On the

other hand, if we were interested in cell inferiority (least probable), we would not reject

our hypothesis because our p-values are approximately 0.50. Below is another example

with a slightly dierent distribution than before.

. clear

. set obs 100

obs was 0, now 100

. generate subtype = "HER2+" in 1/45

(55 missing values generated)

. replace subtype = "HR+" in 46/85

(40 real changes made)

. replace subtype = "TNBC" in 86/100

(15 real changes made)

. tab subtype

subtype Freq. Percent Cum.

HR+ 40 40.00 85.00

TNBC 15 15.00 100.00

. cellsupremacy subtype

TESTS FOR CELL SUPREMACY

Category HER2+ had the largest observed frequency.

TESTING WHETHER CATEGORY HER2+ SELECTED A PRIORI IS MOST PROBABLE.

Quantity Score Wald LR Binomial Mid-P

-----------------------------------------------------------------

Test Statistic 0.2941 0.2950 0.2943

p-value 0.2938 0.2935 0.2937 0.3323 0.2950

TEST FOR THE EXISTENCE OF A MOST PROBABLE CELL

Quantity LR

-------------------------

Test Statistic 0.2943

p-value 0.5875

TESTS FOR CELL INFERIORITY

Category TNBC had the smallest observed frequency.

TESTING WHETHER CATEGORY TNBC SELECTED A PRIORI IS LEAST PROBABLE.

Quantity Binomial Mid-P

---------------------------------------------

p-value 0.0005 0.0003

Because HER2+ and HR+ have similar frequencies, we cannot conclude that HER2+ is

the most probable. In this case, we can conclude that TNBC is the least probable cell. The

above examples can both be implemented by entering the raw counts cellsupremacyi

45 28 27 or cellsupremacyi 45 40 15, respectively.

B. M. Fellman and J. Ensor 509

To illustrate how to use the power cellsupremacy command to calculate the power

of the test, we consider the examples in section 2.7 for testing cell superiority for the

random variables,

and

Ymultinomial(n = 50, p1 = 0.1, p2 = 0.1, p3 = 0.1, p4 = 0.3, p5 = 0.4)

. clear

. set seed 339487731

. power_cellsupremacy, simulate freq(0 0 0.3 0.3 0.4) n(50)

Simulations (10000)

N Simulated Power Approximate Power

50 0.0898 0.0915

. power_cellsupremacy, simulate freq(0.1 0.1 0.1 0.3 0.4) n(50)

Simulations (10000)

N Simulated Power Approximate Power

50 0.2121 0.2032

4 Acknowledgment

This research is supported in part by the National Institutes of Health through M. D.

Andersons Cancer Center Support Grant CA016672.

5 References

Alam, K., and J. R. Thompson. 1972. On selecting the least probable multinomial

event. Annals of Mathematical Statistics 43: 19811990.

Statistical Planning and Inference 99: 175182.

with application to biting preferences of loggerhead marine turtles. Communications

in StatisticsTheory and Methods 41: 3445.

Guenther, W. C. 1977. Power and sample size for approximate chi-square tests. Amer-

ican Statistician 31: 8385.

survey responded more by mail but preferred a choice: Randomized controlled trial.

Journal of Clinical Epidemiology 66: 226235.

Nettleton, D. 2009. Testing for the supremacy of a multinomial cell probability. Journal

of the American Statistical Association 104: 10521059.

510 Cell supremacy

Price, R. A., J. A. Tiro, M. Saraiya, H. Meissner, and N. Breen. 2011. Use of human

papillomavirus vaccines among young adult women in the United States: An analysis

of the 2008 National Health Interview Survey. Cancer 117: 55605568.

Bryan Fellman is a research statistical analyst in the Department of Biostatistics at the Uni-

versity of Texas MD Anderson Cancer Center.

Joe Ensor is a research statistician in the Department of Biostatistics at the University of Texas

MD Anderson Cancer Center.

The Stata Journal (2014)

14, Number 3, pp. 511540

Jonas Bjornerstedt Frank Verboven

Swedish Competition Authority University of Leuven

Stockholm, Sweden Leuven, Belgium

jonas@bjornerstedt.org frank.verboven@kuleuven.be

as a postestimation command, that is, after estimating an aggregate nested logit

demand system with a linear regression model. We also show how to implement

merger simulation when the demand parameters are not estimated but instead cal-

ibrated to be consistent with outside information on average price elasticities and

prot margins. We allow for a variety of extensions, including the role of (marginal)

cost savings, remedies (divestiture), and conduct dierent from BertrandNash

behavior.

Keywords: st0349, mergersim, merger simulation, aggregate nested logit model,

unit demand and constant expenditures demand

1 Introduction

Competition and antitrust authorities have long been concerned with the possible an-

ticompetitive eects of mergers. This is in particular the case for horizontal mergers,

which are mergers between rms selling substitute products. The traditional concern

has been that such mergers raise market power, which may hurt consumers and reduce

total welfare (the sum of producer and consumer surplus). At the same time, however,

it has been recognized that mergers may also result in cost savings or other eciencies.

While such cost savings may often be insucient to reduce prices and benet consumers,

it has been shown that even small cost savings can be sucient to raise total welfare (see

Williamson [1968] and Farrell and Shapiro [1990]).1 Despite the possible total welfare

gains, most competition authorities in practice take a consumer surplus standard when

evaluating proposed mergers.

Merger simulation is increasingly used as a tool to evaluate the eects of horizontal

mergers. Consistent with policy practice, the focus is often on the price and con-

sumer surplus eects, but various applications also evaluate the eects on total wel-

fare.2 Merger simulation aims to predict the merger eects in the following three steps.

1. According to Williamsons (1968) analysis, the deadweight loss from the output reduction after

the merger is a second-order eect that is easily compensated by the cost savings from the merger.

However, Posner (1975) argues that there is another source of ineciency from mergers because

rms must spend wasteful resources to make a merger and maintain market power. In this alterna-

tive view, it may be more natural to use consumer surplus as a standard to evaluate mergers and

to ignore the transfer from consumers to rms.

2. Early contributions to the merger simulation literature are Werden and Froeb (1994), Nevo

(2000), Epstein and Rubinfeld (2002), and Ivaldi and Verboven (2005). For a recent survey, see

Budzinski and Ruhmer (2010).

c 2014 StataCorp LP st0349

512 Merger simulation

The rst step species and estimates a demand system, usually one with dierentiated

products. The second step makes an assumption about the rms equilibrium behavior,

typically multiproduct BertrandNash, to compute the products current prot margins

and their implied marginal costs. The third step usually assumes that marginal costs

are constant and computes the postmerger price equilibrium, accounting for increased

market power, cost eciencies, and perhaps remedies (such as divestiture). This enables

one to compute the mergers eect on prices, consumer surplus, producer surplus, and

total welfare. Stata is often used to estimate the demand system (the rst step) but not

to implement a complete merger simulation (including the second and third steps). In

this article, we show how to implement merger simulation in Stata as a postestimation

command, that is, after estimating the parameters of a demand system for dierentiated

products. We also illustrate how to perform merger simulation when the demand pa-

rameters are not estimated but rather calibrated to be consistent with outside industry

information on price elasticities and prot margins. We allow for a variety of exten-

sions, including the role of (marginal) cost savings, remedies (divestiture), and conduct

dierent from BertrandNash behavior.

We consider an oligopoly model with multiproduct price-setting rms that may par-

tially collude and have constant marginal cost. Following Berry (1994), we specify the

demand system as an aggregate nested logit model, which can be estimated with market-

level data using linear regression methods (as opposed to the individual-level nested logit

model). We consider both a unit demand specication, as in Berry (1994) and Verboven

(1996), and a constant expenditures specication, as in Bjornerstedt and Verboven

(2013). The model requires a dataset on products sold in one market, or in a panel

of markets, with information on the products prices, their quantities sold, rm and

nest identiers, and possibly other product characteristics.

In section 2, we discuss the merger simulation model, including the nested logit

demand system. In section 3, we introduce the commands required to carry out the

merger simulation. Section 4 provides examples and section 5 concludes.

mand system

2.1 Merger simulation

Suppose there are J products, indexed by j = 1, . . . , J. The demand for product j is

qj (p), where p is a J 1 price vector, and its marginal cost is constant and equal to cj .

Each rm f owns a subset of products Ff and chooses the prices of its own products

j Ff to maximize

f (p) = (pj cj ) qj (p) + (pj cj ) qj (p)

jFf j F

/ f

where (0, 1) is a conduct parameter to allow for the possibility that rms partially

coordinate. If = 0, rms behave noncooperatively as multiproduct rms. If = 1,

J. Bj

ornerstedt and F. Verboven 513

is dened by the following system of rst-order conditions:

qk (p) qk (p)

qj (p) + (pk ck ) + (pk ck ) = 0, j = 1, . . . , J (1)

pj pj

kFf kF

/ f

produced by the same rm and (j, k) = otherwise. If = 0 (no collusion), becomes

the usual block diagonal matrix; if all rms own only one product, becomes the identity

matrix. Furthermore, let q(p) be the J 1 demand vector, (p) q(p)/p be the

J J Jacobian of rst derivatives, and c be the J 1 marginal cost vector. We can

then write (1) in vector notation as

q(p) + { (p)} (p c) = 0

This can be inverted to write price as the sum of marginal cost and a markup, where the

markup term (inversely) depends on the price elasticities and on the product-ownership

matrix:

1

p = c {
(p)} q(p) (2)

For single-product rms with no collusion ( = 0), the markup term is price divided by

the own-price elasticity of demand. With multiproduct-rms and partial collusion, the

cross-price elasticities also matter, and this increases the markup term (if products are

substitutes).

Equation (2) serves two purposes. First, it can be rewritten to uncover the premerger

marginal cost vector c based on the premerger prices and estimated price elasticities of

demand; that is,

1

cpre = ppre + { pre
(ppre )} q(ppre )

Second, (2) can be used to predict the postmerger equilibrium. The merger involves

two possible changes: a change in the product ownership matrix from pre to post and,

if there are eciencies, a change in the marginal cost vector from cpre to cpost . To

simulate the new price equilibrium, one may use xed point iteration on (2), possibly

with a dampening parameter in the markup term, or another algorithm such as the

Newton method (see, for example, Judd [1998, 633]).

The demand system q = q(p) for the J products, j = 1, . . . , J, is specied as a nested

logit model with two levels of nests, referred to as groups and subgroups. This model be-

longs to McFaddens (1978) generalized extreme value discrete choice model. Consumers

choose the alternative that maximizes random utility, which results in a specication

for choice probabilities for each alternative. The nested logit model relaxes the inde-

pendence of an irrelevant alternative property of the simple logit model and allows con-

sumers to have correlated preferences for products that belong to the same subgroup or

group. While discrete choice models were initially developed to analyze individual-level

514 Merger simulation

data (see Train [2009] for an overview), Berry (1994) and Berry, Levinsohn, and Pakes

(1995) show how to estimate the models with aggregate data. The dataset consists of

J 1 vectors of the products quantities q, prices p, and a J K matrix of product

characteristics x, including indicator variables for the products subgroup and group

and their rm aliation. The dataset is for either one market or a panel of markets, for

example, dierent years or dierent regions and countries. The panel is not necessarily

balanced, because new products may be introduced over time, or old products may be

eliminated, and not all products may be for sale in all regions.

In addition to each product js quantity sold qj , its price pj , and the vector of

product characteristics xj , it is necessary to observe (or estimate) the potential market

size for the dierentiated products. In the common unit demand specication of the

nested logit, consumers have inelastic conditional demands: they buy either a single

unit of their most preferred product j = 1, . . . , J or the outside good j = 0. The

potential market size is then the potential number of consumers I, for example, an

assumed fraction of the observed population in the market, I = L. An alternative is

the constant expenditures specication, where consumers have unit elastic conditional

demand: they buy a constant expenditure of their preferred product or the outside

good. Here the potential market size is the potential total budget B, for example, an

assumed fraction of total gross domestic product in the market, B = Y .

As shown by Berry (1994) and the extensions by Verboven (1996) and Bj ornerstedt

and Verboven (2013), the aggregate two-level nested logit model gives rise to the fol-

lowing linear estimating equation for a cross section of products j = 1, . . . , J:

ln(sj /s0 ) = xj +

pj + 1 ln(sj|hg ) + 2 ln(sh|g ) + j (3)

empirical applications. The price variable is pj = pj in the unit demand specication,

and pj = ln(pj ) in the constant expenditures specication. The variable sj is the market

share of product j in the potential market, sj|hg is the market share of product j in its

subgroup h of group g, and sh|g is the market share of subgroup h in group g. More

precisely, as discussed in more detail in Bjornerstedt and Verboven (2013), the market

shares are quantity shares in the unit demand specication

qj qj jH qj

sj = , sj|hg = , sh|g = Hhg hg

I q

jHhg j h=1 jHhg qj

pj q j pj qj jH pj qj

sj = , sj|hg = , sh|g = Hhg hg

B jHhg pj qj pj q j

h=1 jHhg

Furthermore, in (3), xj is a vector of observed product characteristics, and j is

the error term, which captures the products quality that is unobserved to the econo-

metrician. Equation (3) has the following parameters to be estimated: a vector of

J. Bj

ornerstedt and F. Verboven 515

mean valuations for the observed product characteristics, a price parameter < 0,

and two nesting parameters 1 and 2 , which measure the consumers preference cor-

relation for products in the same subgroup and group. The model reduces to a one-

level nested logit model with only subgroups as nests if 2 = 0, to a one-level nested

logit model with only groups as nests if 1 = 2 , and to a simple logit model with-

out nests if 1 = 2 = 0. The mean gross valuation for product j is dened as

j xj + j = ln(sj /s0 )

pj 1 ln(sj|hg ) 2 ln(sh|g ), so it can be computed

from the products market share, price, and the parameters , 1 , and 2 .

In sum, the aggregate nested logit model is essentially a linear regression of the

products market shares on price, product characteristics, and (sub)group shares. In

the unit demand specication, price enters linearly and market shares are in volumes; in

the constant expenditures specication, price enters logarithmically and market shares

are in values. In both cases, the unobserved product characteristics term, j , may

be correlated with price and market shares, so instrumental variables should be used.

Cost shifters would qualify as instruments, but these are typically not available at

the product level. Berry, Levinsohn, and Pakes (1995) suggest using sums of the other

products characteristics (over the rm and the entire market). For the nested logit

model, Verboven (1996) adds sums of the other product characteristics by subgroup

and group.

Various mergersim subcommands implement merger simulation as either commands

before and after a linear nested logit regression to estimate , 1 , and 2 or stand-alone

commands where , 1 , and 2 are specied by the user. With a panel dataset, one

must time set the dataset before invoking the mergersim commands by using xtset id

time or tsset id time, where id is the unique product identier within the market,

and time is the market identier (time and region). Time setting is not required with a

dataset for one market.

3.1 Syntax

mergersim init if in , marketsize(varname)

{quantity(varname) | price(varname) | revenue(varname)} nests(varlist)

unitdemand cesdemand alpha(#) sigmas(# # ) name(string)

mergersim market if in , firm(varname) conduct(#) name(string)

516 Merger simulation

mergersim simulate if in , firm(varname) {buyer(#)

seller(#) | newfirm(varname)} conduct(#) name(string) buyereff(#)

sellereff(#) efficiencies(varname) newcosts(varname) newconduct(#)

method(fixedpoint | newton) maxit(#) dampen(#) keepvars detail

mergersim mre if in , {buyer(#) seller(#) | newfirm(varname)}

name(string)

3.2 Options

Demand and market specication

The demand and market specication are set in mergersim init and mergersim market

(and in mergersim simulate if mergersim market is not explicitly invoked by the

user).

marketsize(varname) species the potential size of market (total number of potential

buyers in unit demand specication, total potential budget in constant expenditures

specication). marketsize() is required with mergersim init.

Any two of price(), quantity(), or revenue() are required.

quantity(varname) species the quantity variable.

price(varname) species the price variable.

revenue(varname) species the revenue variable.

nests(varlist) species one or two nesting variables. The outer nest is specied rst.

If only one variable is specied, a one-level nested logit model applies. If the option

is not specied, a simple logit model applies.

unitdemand species the unit demand specication (default).

cesdemand species the constant expenditure specication rather than the default unit

demand specication.

alpha(#) species a value for the alpha parameter rather than using an estimate. Note

that this option has no eect if mergersim market has been run.

sigmas(# # ) species a value for the sigma parameters rather than using an esti-

mate. In the two-level nested logit, the rst sigma corresponds to the log share of

the product in the subgroup, and the second corresponds to the log share of the

subgroup in the group.

name(string) species a name for the simulation. Variables created will have the spec-

ied name followed by an underscore character rather than the default M . This

option can be used with all the mergersim subcommands.

J. Bj

ornerstedt and F. Verboven 517

firm(varname) species the integer variable, indexing the rm owning the product.

firm() is required with mergersim market and mergersim simulate.

conduct(#) measures the fraction of the competitors prots that rms account for

when setting their own prices. It gives the degree of joint prot maximization be-

tween rms before the merger in percentage terms (number between 0 and 1).

Merger specication

Either the identity of buyer and seller rms or the new ownership structure are required.

The identity corresponds to the value in the variable specied with the firm() option.

buyer(#) species the buyer ID in the rm variable.

seller(#) species the seller ID in the rm variable.

newfirm(varname) species postmerger ownership in more detail than the buyer and

seller options. For example, it can be used to simulate divestitures or two cumulative

mergers by manually constructing a new rm ownership variable that diers from

the rm variable specied with the firm() option.

Eciency gains, in terms of percentage reduction in marginal costs, can be specied by

either all seller and buyer products using the buyereff() and sellereff() options or

product by product with the efficiencies() option.

buyereff(#) species the eciency gain of all products of the buyer rm after the

merger. A value of 0 indicates no eciency gain. The default is buyereff(0). For

example, to incorporate a 10% eciency gain, specify the buyereff(0.1) option.

sellereff(#) species the eciency gain of all products of the seller rm after the

merger.

efficiencies(varname) species a variable for eciency gains more generally (that

is, product by product), where, for example, 0.2 is a 20% decrease in marginal costs,

and 0 is no change.

newcosts(varname) species a variable for postmerger costs.

newconduct(#) species the degree of joint prot maximization between rms after

the merger, in percentage terms. With a conduct value of 1, the prots of other

rms are as important as own prots.

Computation

The computation options can be set in mergersim simulate, where the postmerger

Nash equilibrium is computed.

518 Merger simulation

librium. The option can be specied as fixedpoint or newton. The default is

method(newton). The Newton method starts with one iteration of the fixedpoint

method.

maxit(#) species the maximum number of iterations in the solver methods.

dampen(#) species an initial dampening factor lower than the default dampen(1) in

the xed-point method. If fixedpoint does not converge, the method automatically

tries a dampening factor of half the initial dampening.

keepvars species that all generated variables should be kept after simulation, calcu-

lation of elasticities, or minimal required eciencies.

detail shows market shares in mergersim simulate. These market shares are relative

to total sales (excluding the outside good). Market shares are in terms of volumes

for the unit demand specication and in terms of value for the constant expenditure

specication. Changes in consumer and producer surplus and in the Herndahl

Hirshman index are also displayed.

3.3 Description

mergersim performs a merger simulation with the subcommands init, market, and

simulate. mergersim init must be invoked rst to initialize the settings. mergersim

market calculates the price elasticities and marginal costs. mergersim simulate per-

forms a merger simulation, automatically invoking mergersim market if the command

has not been called by the user. In addition to displaying results, mergersim creates

various variables at each step. By default, the names of these variables begin with M .

First, mergersim init initializes the settings for the merger simulation. It is re-

quired before estimation and before a rst merger simulation. It denes the upper and

lower nests; the specication (unit demand or constant expenditures demand); the price,

quantity, and revenue variables (two out of three); the potential market size variable;

and the rm identier (numerical variable). It also generates the variables necessary

to estimate the demand parameters (alpha and sigmas) using a linear (nested) logit

regression, similar to Berry (1994) and the extensions of Bjornerstedt and Verboven

(2013). The names of the market share and price variables to use in the regression will

depend on the demand specication and are shown in the display output of mergersim

init. Alternatively, the demand parameters can be calibrated with the alpha() and

sigmas() options rather than being estimated.

Second, mergersim market computes the premerger conditionsthe gross valua-

tions j and marginal costs cj of each product j under assumptions regarding the

degree of coordination. The computations are based on the last estimates of , 1 ,

and 2 unless they are overruled by values specied by the user in the alpha() and

J. Bj

ornerstedt and F. Verboven 519

sigmas() options. mergersim market is required after mergersim init and before

the rst mergersim simulate. It is not necessary to specify mergersim market before

additional mergersim simulates (unless one wants to specify new premerger values of

j and cj ).

Third, mergersim simulate computes the postmerger prices and quantities under

assumptions regarding the identity of the merged rms, their cost eciencies, and the

degree of collusion (the same as before the merger). It is possible to repeat the command

multiple times after estimation.

In addition to these three main subcommands, several other subcommands can pro-

vide useful information. For example, mergersim mre computes the minimum required

eciencies per product for the price not to increase after the merger. It can be invoked

after mergersim init.

4 Examples

4.1 Preparing the data

To demonstrate mergersim, we use the dataset on the European car market, collected by

Goldberg and Verboven (2001) and maintained on their webpages.3 We take a reduced

version of that dataset with fewer variables and a slightly more aggregate rm denition;

the dataset is called cars1.dta. Each observation comprises a car model, year, and

country. The total number of observations is 11,483: there are 30 years (19701999) and

5 countries (Belgium, France, Germany, Italy, and the United Kingdom), which implies

an average of 77 car models per year and country. The car market is divided into ve

upper nests (groups) according to the segments: subcompact, compact, intermediate,

standard, and luxury. Each segment is further subdivided into lower nests (subgroups)

according to the origin: domestic or foreign (for example, Fiat is domestic in Italy and

foreign in the other countries). Sales are new car registrations (qu). Price is measured in

1,000 Euro (in 1999 purchasing power). The product characteristics are horsepower (in

kilowatts), fuel eciency (in liter/100 kilometers), width (in centimeters), and height

(in centimeters). The commands below are provided in a script called example.do.

3. See http://www.econ.kuleuven.be/public/ndbad83/frank/cars.htm.

520 Merger simulation

. use cars1

. summarize year country co segment domestic firm qu price horsepower fuel

> width height pop ngdp

Variable Obs Mean Std. Dev. Min Max

country 11483 2.918488 1.443221 1 5

co 11483 223.0364 206.6172 1 980

segment 11483 2.559087 1.289577 1 5

domestic 11483 .1886267 .3912288 0 1

qu 11483 19911.44 37803.6 51 433694

price 11483 18.49683 8.922665 5.260726 150.3351

horsepower 11483 57.26393 23.89019 13 169.5

fuel 11483 6.728904 1.709702 3.8 18.6

height 11483 140.4434 4.631175 117.5 173.5

pop 11483 4.81e+07 2.18e+07 9660000 8.21e+07

ngdp 11483 1.76e+14 4.73e+14 5.18e+10 2.13e+15

A rst key preparatory task is to dene the two dimensions of the panel and to

time set the data (unless there is only one cross-section). The rst dimension is the

product, that is, the car model (for example, Volkswagen [VW] Golf). The second

dimension is the market, which can be dened as the country and year (for example,

France in 1995).

. xtset co yearcountry

panel variable: co (unbalanced)

time variable: yearcountry, 1 to 150, but with gaps

delta: 1 unit

Note that the panel is unbalanced because most models are not available throughout

the entire period or in all countries.

A second key preparatory task is to dene the potential market size. For the car

market, it is sensible to adopt a unit demand specication. We specify the potential

market size as total population divided by 4, a crude proxy for the number of households.

In practice, the potential market size in a given year may be lower because cars are

durable and consumers who just purchased a car may not consider buying a new one

immediately.

. generate MSIZE=pop/4

J. Bj

ornerstedt and F. Verboven 521

Merger simulation can now proceed in three steps.

The rst step initializes the settings for the merger simulation using the command

mergersim init. The next example species a two-level nested logit model where the

groups are the segments and the subgroups are domestic or foreign with the segments.

This requires the option nests(segment domestic). The specication is the default

unit demand specication. The price, quantity, market size, and rm variables are also

specied.

> marketsize(MSIZE) firm(firm)

Version 1.0, Revision: 218

Depvar Price Group shares

merger init creates market share and price variables labeled with an M prex (the

default prex). The variable M ls is the dependent variable ln(sj /s0 ), M lsjh is the log

of the subgroup share ln(sj|hg ), and M lshg is the log of the group share ln(sh|g ).

We can estimate the nested logit model with a linear regression estimator using

instrumental variables to account for the endogeneity of the price and market share

variables. As a simplication to illustrate the approach, we consider a xed-eects

regression without instruments.

522 Merger simulation

. xtreg M_ls price M_lsjh M_lshg horsepower fuel width height domestic year

> country2-country5, fe

Fixed-effects (within) regression Number of obs = 11483

Group variable: co Number of groups = 351

R-sq: within = 0.8948 Obs per group: min = 1

between = 0.7576 avg = 32.7

overall = 0.8427 max = 146

F(13,11119) = 7271.50

corr(u_i, Xb) = -0.0147 Prob > F = 0.0000

M_lsjh .9047371 .0041489 218.07 0.000 .8966045 .9128696

M_lshg .5677968 .0085109 66.71 0.000 .551114 .5844796

horsepower .0038279 .0005921 6.46 0.000 .0026672 .0049886

fuel -.0270919 .004539 -5.97 0.000 -.0359892 -.0181946

width .0103757 .0016768 6.19 0.000 .0070889 .0136625

height .0004322 .0022161 0.20 0.845 -.0039117 .0047761

domestic .5230743 .0124205 42.11 0.000 .4987279 .5474206

year .0017336 .0012022 1.44 0.149 -.000623 .0040902

country2 -.6621749 .01399 -47.33 0.000 -.6895977 -.6347521

country3 -.5883123 .0147382 -39.92 0.000 -.6172017 -.5594229

country4 -.7129762 .0137524 -51.84 0.000 -.7399333 -.686019

country5 -.4155907 .016715 -24.86 0.000 -.448355 -.3828265

_cons -8.193457 2.246407 -3.65 0.000 -12.59681 -3.790101

sigma_u .52455749

sigma_e .36374004

rho .6752947 (fraction of variance due to u_i)

F test that all u_i=0: F(350, 11119) = 22.69 Prob > F = 0.0000

The parameters that will inuence the merger simulations are the price parameter

= 0.0468 and the nesting parameters 1 = 0.905 and 2 = 0.568 (the coecients

of, respectively, M lsjh and M lshg). These estimates satisfy the following restrictions

from economic theory: < 0 and 1 > 1 2 0. However, it is important to stress

that the xed-eects estimator is inconsistent because price and the subgroup and group

market share variables are endogenous. As discussed in Berry (1994), an instrumental-

variable estimator is required (for example, using ivreg or xtivreg with appropriate

instruments). We therefore use only the results from the xed-eects estimator for

illustration.

The second step in the merger simulation calculates the premerger market conditions

(the products gross valuations and their marginal costs and the price elasticities of

demand) using the command mergersim market. In the example below, these calcula-

tions are done for only the ve countries in 1998. Because no values for , 1 , and 2 are

specied, mergersim market uses the parameters in the last available Stata estimation,

that is, the ones from a xed-eects regression.

J. Bj

ornerstedt and F. Verboven 523

Demand: Unit demand two-level nested logit

Demand estimate

xtreg M_ls price M_lsjh M_lshg horsepower fuel width height domestic year

> country2-country5, fe

Dependent variable: M_ls

Parameters

alpha = -0.047

sigma1 = 0.905

sigma2 = 0.568

variable mean sd min max

M_ejk 0.766 1.276 0.003 10.908

M_ejl 0.068 0.120 0.000 0.768

M_ejm 0.001 0.002 0.000 0.011

Observations: 449

Unweighted averages by firm

Fiat 15.277 10.553 0.372

Ford 14.557 11.923 0.207

Honda 20.094 17.941 0.128

Hyundai 12.915 10.849 0.179

Kia 10.814 8.772 0.207

Mazda 14.651 12.557 0.156

Mercedes 25.598 21.569 0.162

Mitsubishi 15.955 13.825 0.145

Nissan 15.438 13.259 0.159

GM 21.054 18.633 0.135

PSA 16.243 13.533 0.194

Renault 15.518 12.837 0.203

Suzuki 9.289 7.226 0.234

Toyota 14.560 12.430 0.172

VW 18.990 16.388 0.181

Volvo 23.167 20.912 0.099

Daewoo 13.871 11.789 0.170

524 Merger simulation

These results imply fairly high own-price elasticities for the products in 1998, 7.488

on average. The cross-price elasticities are higher for products within the same subgroup

(0.766) than for products of a dierent subgroup (0.068) and especially for products of

a dierent group (0.001). The Lerner index or percentage markup over marginal cost

varies from 9.9% to 37.2%, with a tendency of higher percentage markups for rms with

lower-priced models (a feature of most unit demand-logit models).

The third step performs the actual merger simulation using the mergersim simulate

command. The example below considers a merger where General Motors (GM) (rm =

15) sells its operations to VW (rm = 26). Note that the merger simulations would be

the same if VW sold its operations to GM. We rst carry out the merger simulations

for Germany in 1998, where it can be considered a domestic merger (because GM sells

the Opel brands, which are produced in Germany). It is assumed that there are no

marginal cost savings to the seller or the buyer and that there is no partial coordination

(neither before nor after the merger).

. mergersim simulate if year == 1998 & country == 3, seller(15) buyer(26)

> detail

Merger Simulation

Simulation method: Newton

Buyer Seller Periods/markets: 1

Firm 26 15 Number of iterations: 6

Marginal cost savings Max price change in last it: 4.5e-06

Prices

Unweighted averages by firm

Fiat 15.338 15.341 0.000

Ford 13.093 13.362 0.023

Honda 15.778 15.780 0.000

Hyundai 12.912 12.912 0.000

Kia 11.276 11.276 0.000

Mazda 14.229 14.231 0.000

Mercedes 20.114 20.155 0.003

Mitsubishi 15.832 15.834 0.000

Nissan 15.101 15.103 0.000

GM 19.921 21.054 0.076

PSA 16.397 16.399 0.000

Renault 15.292 15.295 0.000

Suzuki 9.225 9.225 0.000

Toyota 13.019 13.020 0.000

VW 17.182 17.739 0.036

Volvo 22.149 22.154 0.000

Daewoo 13.483 13.484 0.000

> dropped)

(output omitted )

J. Bj

ornerstedt and F. Verboven 525

The results show prices before and after the merger (in 1,000 Euro) and the percent-

age price change averaged by rm. This information is provided standard, even without

the detail option at the end. The merger simulations predict that GM will on average

raise its prices by 7.6%, while VW will on average raise its prices by 3.6%. The rivals

respond with only very small price increases (with the exception of Ford).4

Because the new price vector is saved, one can use Statas graphics to plot these

results. Consider the following commands:

. generate perc_price_ch=M_price_ch*100

(11386 missing values generated)

. graph bar (mean) perc_price_ch if country==3&year==1998,

> over(firm, sort(perc_price_ch) descending label(angle(vertical)))

> ytitle(Percentage) title(Average percentage price increase per firm)

8

6

Percentage

4 2

0

GM

VW

Ford

BMW

Mercedes

Renault

Fiat

Volvo

PSA

Mitsubishi

Nissan

Toyota

Honda

Mazda

Suzuki

Hyundai

Daewoo

Kia

4. Note that one can also specify the detail option to display the market shares before and after the

merger and the percentage point dierence. If one is interested to see more detailed results, one

can use additional options under mergersim results. One can also use standard Stata commands,

such as table, based on the variables M price (premerger price) and M price2 (postmerger price).

526 Merger simulation

Without the detail option after the mergersim simulate command, the output

reports only the price information. The detail option produces additional results on

the following variables (premerger, postmerger, and changes): market shares by rm,

the Herndahl index, C4 and C8 ratios (market share of 4 and 8 largest rms), and

consumer and producer surplus.5

Market shares by quantity

Unweighted averages by firm

Fiat 0.043 0.045 0.003

Ford 0.095 0.132 0.037

Honda 0.012 0.012 0.001

Hyundai 0.006 0.006 0.000

Kia 0.003 0.003 0.000

Mazda 0.025 0.027 0.002

Mercedes 0.100 0.116 0.017

Mitsubishi 0.015 0.017 0.001

Nissan 0.025 0.027 0.002

GM 0.166 0.108 -0.058

PSA 0.034 0.037 0.003

Renault 0.051 0.054 0.003

Suzuki 0.006 0.006 0.000

Toyota 0.027 0.029 0.002

VW 0.300 0.280 -0.020

Volvo 0.012 0.013 0.001

Daewoo 0.006 0.007 0.001

Pre-merger Post-merger

C4: 66.07 71.50

C8: 86.21 88.01

Change

Producer surplus: 1,303,353

For example, the Herndahl index increases from 1,501 to 1,972. Consumer surplus

(in Germany) drops by 1.8 billion Euro or 586 Euro per car (because 3.1 million cars

were sold in Germany in 1998). This is partly compensated by an increase in producer

surplus of 1.3 billion Euro.

5. In logit and nested logit models, consumer surplus (up to a constant) is given by the well-known

log(sum) expression divided by the marginal utility of income. Caution is warranted in the constant

expenditure specication because marginal utility is not constant. See Train (2009).

J. Bj

ornerstedt and F. Verboven 527

It is possible to account for several specic features of the merger.

Eciencies

First, one may account for the possibility that the buying or the selling rm benets

from a marginal cost saving, which may be passed on to consumer prices. The cost

saving is expressed as a percentage of current marginal cost. In the command below,

the options sellereff(0.2) and buyereff(0.2) mean that the seller and the buyer

each have a marginal cost saving of 20% on all of their products.

. mergersim simulate if year == 1998 & country == 3, seller(15) buyer(26)

> sellereff(0.20) buyereff(0.20) method(fixedpoint) maxit(40) dampen(0.5)

Merger Simulation

Simulation method: Dampened Fixed point

Buyer Seller Periods/markets: 1

Firm 26 15 Number of iterations: 19

Marginal cost savings .2 .2 Max price change in last it: .

Prices

Unweighted averages by firm

Fiat 15.338 15.265 -0.004

Ford 13.093 13.125 0.003

Honda 15.778 15.737 -0.002

Hyundai 12.912 12.908 -0.000

Kia 11.276 11.274 -0.000

Mazda 14.229 14.212 -0.001

Mercedes 20.114 19.259 -0.031

Mitsubishi 15.832 15.810 -0.001

Nissan 15.101 14.981 -0.005

GM 19.921 18.980 -0.022

PSA 16.397 16.372 -0.002

Renault 15.292 15.261 -0.003

Suzuki 9.225 9.219 -0.001

Toyota 13.019 13.005 -0.001

VW 17.182 15.717 -0.075

Volvo 22.149 22.036 -0.005

Daewoo 13.483 13.477 -0.000

> dropped)

There is now a predicted price decrease in Germany of 2.2% for GM and 7.5%

for VW. This implies that the 20% cost savings are suciently passed to consumers.

To obtain convergence, we used a xed-point iteration with a dampening factor of 0.5

because the default Newton method did not converge. sellereff() and buyereff()

assume the same percentage cost saving for all products of the seller and buyer. A

528 Merger simulation

percentage cost saving based on the variable that enters in efficiencies().

Instead of simulating the prices in the postmerger equilibrium with eciencies,

one can compute the minimum required eciency (percentage cost saving by prod-

uct) for the prices to remain unchanged after the merger; see Froeb and Werden (1998)

or Roller, Stennek, and Verboven (2001). This can be done with the mergersim mre

command:

variable mean sd min max

M_costs2 13.769 9.938 5.439 43.620

M_mre 0.123 0.128 0.001 0.401

The generated variable M mre refers to the minimum required eciency per product

owned by the merging rms and is set to a missing value for the products of the non-

merging rms. According to the results, the minimum required eciencies for the 19

products of the merging rms are on average 12.3% (unweighted) and 22.1% (weighted

by sales).

Divestiture as a remedy

Second, one may account for divestiture as a remedy to mitigate the price eects of

a merger. Under such a remedy, the competition authority accepts the merger on the

condition that the rms sell some of their products or brands. To simulate the eects of a

merger with divestiture, one can replace the options buyer(#) and seller(#) with the

option newfirm(varname), which species a variable for the new ownership structure

after the merger. To illustrate, we consider a merger between Renault (rm = 18) and

PSA (rm = 16), where PSA sells the brands Peugeot and Citro en. This merger would

substantially raise average prices in France: 59.8% for the Renault products and 63.1%

for the PSA products (ignoring entry and substitution to other countries). To mitigate

the anticompetitive eects, the competition authority may request that PSA sell one of

its brands, Citroen (brand = 4), to Fiat (rm = 4). The commands below show how

to simulate the eects of such a merger with divestiture after creating the appropriate

variable firm rem for the new ownership structure.6

6. Note that this example starts with mergersim init and moves to mergersim simulate without per-

forming a regression to obtain the price and nesting parameters. In this case, mergersim continues

to use the most recent results.

J. Bj

ornerstedt and F. Verboven 529

. generate firm_rem=firm

. replace firm_rem=16 if firm==18 // original merger

(890 real changes made)

. replace firm_rem=4 if brand==4 // divestiture

(583 real changes made)

. quietly mergersim init, nests(segment domestic) unit price(price)

> quantity(qu) marketsize(MSIZE) firm(firm)

. quietly mergersim simulate if year == 1998 & country == 2, seller(16)

> buyer(18)

. mergersim simulate if year == 1998 & country == 2, newfirm(firm_rem)

Merger Simulation

Simulation method: Newton

Variable name Periods/markets: 1

Ownership from: firm_rem Number of iterations: 7

Marginal cost savings Max price change in last it: 9.7e-08

Prices

Unweighted averages by firm

Fiat 12.688 12.749 0.006

Ford 11.995 12.001 0.001

Honda 15.742 15.744 0.000

Hyundai 9.862 9.863 0.000

Kia 7.040 7.040 0.000

Mazda 12.536 12.536 0.000

Mercedes 25.239 25.240 0.000

Mitsubishi 14.880 14.880 0.000

Nissan 12.371 12.372 0.000

GM 18.963 18.966 0.000

PSA 15.303 16.317 0.089

Renault 14.996 17.114 0.162

Suzuki 7.824 7.824 0.000

Toyota 12.638 12.638 0.000

VW 17.735 17.744 0.001

Volvo 22.641 22.642 0.000

Daewoo 13.939 13.940 0.000

> dropped)

The results show that the merger with divestiture raises the average price only by

16.2% for Renault and by 8.9% for the Peugeot brand, whereas the price of Fiat (now

including the Citroen brand) increases by 0.6%. The option newfirm(varname) can

also be used for other applications, for example, to assess the impact of two consecutive

mergers.

530 Merger simulation

Conduct

Third, one may account for the possibility that rms partially coordinate, that is, take

into account a fraction of the competitors prots when setting prices. Assume, for

example, that rms maintain the same degree of coordination before and after the

merger: one can set the conduct parameter such that the markups are in line with

outside estimates. Performing mergersim market before mergersim simulate enables

one to verify whether the conduct parameter results in premerger markups in line with

outside estimates. This is shown in the following example (which returns to the earlier

merger between GM and VW in Germany).

Demand: Unit demand two-level nested logit

Demand estimate

xtreg M_ls price M_lsjh M_lshg horsepower fuel width height domestic year

> country2-country5, fe

Dependent variable: M_ls

Parameters

alpha = -0.047

sigma1 = 0.905

sigma2 = 0.568

variable mean sd min max

M_ejk 0.781 1.141 0.007 4.920

M_ejl 0.060 0.123 0.001 0.637

M_ejm 0.001 0.002 0.000 0.011

Observations: 97

J. Bj

ornerstedt and F. Verboven 531

Unweighted averages by firm

Fiat 15.338 10.845 0.334

Ford 13.093 8.114 0.419

Honda 15.778 11.433 0.286

Hyundai 12.912 8.818 0.349

Kia 11.276 7.196 0.391

Mazda 14.229 10.012 0.315

Mercedes 20.114 13.753 0.348

Mitsubishi 15.832 11.612 0.280

Nissan 15.101 10.651 0.316

GM 19.921 14.862 0.297

PSA 16.397 12.106 0.299

Renault 15.292 10.893 0.340

Suzuki 9.225 5.084 0.461

Toyota 13.019 8.794 0.379

VW 17.182 12.104 0.352

Volvo 22.149 17.596 0.208

Daewoo 13.483 9.339 0.346

The results show that if rms coordinate by taking into account 50% of the com-

petitors prots, then the Lerner index becomes almost twice as high as when there is

no coordination. The predicted price eects after the merger can now be computed.

> conduct(0.5)

Merger Simulation

Simulation method: Newton

Buyer Seller Periods/markets: 1

Firm 26 15 Number of iterations: 6

Marginal cost savings Max price change in last it: 2.1e-07

Pre Post

Conduct: .5 .5

Prices

Unweighted averages by firm

532 Merger simulation

Fiat 15.338 15.434 0.007

Ford 13.093 13.881 0.063

Honda 15.778 15.889 0.008

Hyundai 12.912 13.019 0.009

Kia 11.276 11.379 0.009

Mazda 14.229 14.334 0.008

Mercedes 20.114 20.427 0.025

Mitsubishi 15.832 15.956 0.008

Nissan 15.101 15.194 0.007

GM 19.921 21.171 0.084

PSA 16.397 16.503 0.007

Renault 15.292 15.395 0.008

Suzuki 9.225 9.314 0.010

Toyota 13.019 13.115 0.008

VW 17.182 17.947 0.049

Volvo 22.149 22.265 0.005

Daewoo 13.483 13.584 0.008

> dropped)

Under partial coordination, the merger simulation predicts larger price increases. On

one hand, there is a larger predicted price increase for the merging rms: this feature

does not hold generally, because the merging rms already partially coordinate before

the merger. On the other hand, there is also a larger predicted price increase for the

outsider rms: this feature may hold more generally because it reects that outsiders

have more cooperative responses to price changes by the merging rms.

ters

Calibration

The merger simulation results depend on the values of three parameters: , 1 , and 2

(and on the price and quantity data per product). A practitioner may not want to rely

too heavily on the econometric estimates of these parameters and may want to verify

whether the elasticities and markups are consistent with external industry information.

Here a practitioner would not estimate but calibrate the parameters such that they

result in price elasticities and markups that are equal to external estimates. Such

calibration is possible by specifying the options alpha() and sigmas() to mergersim

market. The selected values overrule the values in memory, for example, the ones from a

previous estimation. In the lines below, we specify = 0.035 (closer to 0 as compared

with the econometric estimate of = 0.047), and we keep 1 and 2 to the previous

values. Hence, we calibrate such that demand would be less elastic. The results from

this calibration indeed imply lower price elasticities (on average 5.5):

J. Bj

ornerstedt and F. Verboven 533

Demand: Unit demand two-level nested logit

Demand calibration

Parameters

alpha = -0.035

sigma1 = 0.910

sigma2 = 0.570

variable mean sd min max

M_ejk 0.624 0.911 0.006 3.946

M_ejl 0.045 0.093 0.000 0.480

M_ejm 0.001 0.001 0.000 0.008

Observations: 97

Unweighted averages by firm

Fiat 15.338 12.297 0.229

Ford 13.093 9.765 0.287

Honda 15.778 12.921 0.189

Hyundai 12.912 10.294 0.223

Kia 11.276 8.681 0.248

Mazda 14.229 11.455 0.206

Mercedes 20.114 15.030 0.255

Mitsubishi 15.832 13.019 0.186

Nissan 15.101 12.155 0.209

GM 19.921 16.573 0.199

PSA 16.397 13.576 0.197

Renault 15.292 12.302 0.236

Suzuki 9.225 6.586 0.294

Toyota 13.019 10.280 0.246

VW 17.182 13.540 0.254

Volvo 22.149 18.974 0.144

Daewoo 13.483 10.860 0.220

534 Merger simulation

The next lines show what this calibration implies for merger simulation.

Merger Simulation

Simulation method: Newton

Buyer Seller Periods/markets: 1

Firm 26 15 Number of iterations: 6

Marginal cost savings Max price change in last it: 5.9e-06

Prices

Unweighted averages by firm

Fiat 15.338 15.342 0.000

Ford 13.093 13.443 0.030

Honda 15.778 15.781 0.000

Hyundai 12.912 12.912 0.000

Kia 11.276 11.276 0.000

Mazda 14.229 14.231 0.000

Mercedes 20.114 20.167 0.003

Mitsubishi 15.832 15.835 0.000

Nissan 15.101 15.103 0.000

GM 19.921 21.372 0.098

PSA 16.397 16.399 0.000

Renault 15.292 15.296 0.000

Suzuki 9.225 9.226 0.000

Toyota 13.019 13.020 0.000

VW 17.182 17.892 0.045

Volvo 22.149 22.155 0.000

Daewoo 13.483 13.484 0.000

> dropped)

These results show that the predicted price increase is larger when demand is less

elastic.

One can also use the calibration options alpha() and sigmas() to implement a para-

metric bootstrap for constructing condence intervals of the computed merger eects.

The following lines perform three steps. First, we take 100 draws for , 1 , and 2

assuming the parameters are normally distributed. Second, we perform 100 merger

simulations for each draw. Third, we save the results for the average price increase of

the buying rm and the selling rm, and we compute summary statistics.

J. Bj

ornerstedt and F. Verboven 535

> marketsize(MSIZE) firm(firm)

. matrix b=e(b)

. matrix V=e(V)

. matrix bsub = ( b[1,1] , b[1,2] , b[1,3] )

. matrix Vsub = ( V[1,1], V[1,2], V[1,3] \ V[2,1] , V[2,2], V[2,3] \ V[3,1],

> V[3,2], V[3,3] )

. local ndraws 100

. set seed 1

. preserve

. drawnorm alpha sigma1 sigma2, n(`ndraws) cov(Vsub) means(bsub) clear

(obs 100)

. mkmat alpha sigma1 sigma2, matrix(params)

. restore

. matrix pr_ch = J(`ndraws,2,0)

. forvalues i = 1 2 to `ndraws {

2. local alpha = params[`i,1]

3. local sigma1 = params[`i,2]

4. local sigma2 = params[`i,3]

5. quietly mergersim init, nests(segment domestic) price(price) quantity(qu)

> marketsize(MSIZE) firm(firm) alpha(`alpha) sigmas(`sigma1 `sigma2)

6. quietly mergersim simulate if year == 1998 & country == 3, seller(15)

> buyer(26)

7. sum M_price_ch if year == 1998 & country == 3&firm==15, meanonly

8. matrix pr_ch[`i,1] = r(mean)

9. sum M_price_ch if year == 1998 & country == 3&firm==26, meanonly

10. matrix pr_ch[`i,2] = r(mean)

11. }

. clear

. quietly svmat pr_ch , names(pr_ch)

. sum pr_ch1 pr_ch2

Variable Obs Mean Std. Dev. Min Max

pr_ch2 100 .0355618 .0015778 .0307121 .0394875

Earlier, we obtained point estimates for the percentage price increase of 7.6% for

GM and 3.6% for VW (for the base scenario). The 95% condence intervals for these

price increases are [6.78.4]% for GM and [3.14.0]% for VW.

We can nally illustrate how to do merger simulation based on a constant expenditures

demand instead of a unit demand specication. For cars, this may not be a realistic

option, because consumers typically buy one unit or no unit rather than constant ex-

penditures. Nevertheless, we can use the constant expenditures specication to see how

functional form aects the predictions from merger simulation.

536 Merger simulation

. generate MSIZE1=ngdpe/5

This assumes the potential expenditures on cars in a country and year are 20% of

total gross domestic product.

Next we calibrate (rather than estimate) the parameters to = 0.5, 1 = 0.9, and

2 = 0.6.

> marketsize(MSIZE1) firm(firm) alpha(-0.5) sigmas(0.9 .6)

(output omitted )

We can verify the premerger elasticities and markups at these calibrated parameters:

Demand: Constant expenditure two-level nested logit

Demand calibration

Parameters

alpha = -0.500

sigma1 = 0.900

sigma2 = 0.600

variable mean sd min max

M_ejk 0.426 0.493 0.005 1.946

M_ejl 0.039 0.065 0.000 0.283

M_ejm 0.001 0.001 0.000 0.006

Observations: 97

Unweighted averages by firm

J. Bj

ornerstedt and F. Verboven 537

Fiat 15.338 12.451 0.189

Ford 13.093 10.502 0.202

Honda 15.778 12.938 0.180

Hyundai 12.912 10.732 0.169

Kia 11.276 9.384 0.168

Mazda 14.229 11.684 0.177

Mercedes 20.114 14.228 0.260

Mitsubishi 15.832 12.978 0.180

Nissan 15.101 12.281 0.183

GM 19.921 15.784 0.206

PSA 16.397 13.473 0.179

Renault 15.292 12.504 0.188

Suzuki 9.225 7.661 0.170

Toyota 13.019 10.739 0.175

VW 17.182 13.395 0.221

Volvo 22.149 17.606 0.201

Daewoo 13.483 11.201 0.169

The premerger elasticities and markups are roughly comparable with the ones of the

estimated unit demand model (with less variation between rms). However, as shown

below, the merger simulation results in a larger predicted price increase: +10.1% for

GM and +4.4% for VW. This follows from the dierent functional form: the constant

expenditures specication has the property of quasi-constant price elasticity, whereas the

unit demand specication has the property that consumers become more price sensitive

as rms raise prices. For this same reason, eciencies in the form of marginal cost

savings would also be passed more to consumers under this specication.

538 Merger simulation

> detail

Merger Simulation

Simulation method: Newton

Buyer Seller Periods/markets: 1

Firm 26 15 Number of iterations: 7

Marginal cost savings Max price change in last it: 4.7e-09

Prices

Unweighted averages by firm

Fiat 15.338 15.342 0.000

Ford 13.093 13.302 0.017

Honda 15.778 15.781 0.000

Hyundai 12.912 12.912 0.000

Kia 11.276 11.276 0.000

Mazda 14.229 14.231 0.000

Mercedes 20.114 20.155 0.003

Mitsubishi 15.832 15.835 0.000

Nissan 15.101 15.103 0.000

GM 19.921 21.581 0.101

PSA 16.397 16.399 0.000

Renault 15.292 15.295 0.000

Suzuki 9.225 9.225 0.000

Toyota 13.019 13.020 0.000

VW 17.182 17.933 0.044

Volvo 22.149 22.159 0.000

Daewoo 13.483 13.484 0.000

> dropped)

(output omitted )

Because the detail option was added, mergersim simulate reports additional re-

sults. Consumer surplus now drops by 2.2 billion Euro (versus 1.8 billion Euro in the

unit demand specication), and producer surplus increases by 1.1 billion Euro (versus

1.3 billion Euro before).

Pre-merger Post-merger

C4: 66.07 70.52

C8: 86.21 87.61

Change

Producer surplus: 1,140,647

J. Bj

ornerstedt and F. Verboven 539

5 Conclusions

This overview has shown how to apply two specications of the two-level nested logit

demand system to merger simulation. We show that merger simulation can be applied

as a postestimation command based on estimated parameter values, or it can be im-

plemented without estimation but with calibrated parameters. The merger simulation

results yield intuitive predictions given the assumed demand parameters.7 The set of

merger simulation commands can be used to simulate the eects of horizontal mergers

in a standard setting (dierentiated products, multiproduct Bertrand price setting).

One can also incorporate various extensions, including eciencies in the form of cost

savings, remedies through partial divestiture, and alternative behavioral assumptions

(partial collusive behavior).

Other applications and extensions could be considered. For example, for the car

market, it could be interesting to generalize the demand model to allow consumers to

substitute between countries by introducing an upper nest for the choice of country

instead of assuming such substitution is not possible. These additional substitution

possibilities would limit the market power eects of mergers. Other demand models

may also be considered, such as a random coecients logit model or the almost ideal

demand system.

6 References

Berry, S., J. Levinsohn, and A. Pakes. 1995. Automobile prices in market equilibrium.

Econometrica 63: 841890.

Journal of Economics 25: 242262.

Bj

ornerstedt, J., and F. Verboven. 2013. Does merger simulation work? Evidence from

the Swedish analgesics market.

http://www.econ.kuleuven.be/public/ndbad83/Frank/Papers/

Bjornerstedt%20&%20Verboven,%202013.pdf.

Budzinski, O., and I. Ruhmer. 2010. Merger simulation in competition policy: A survey.

Journal of Competition Law & Economics 6: 277319.

with new applications. Antitrust Law Journal 69: 883919.

Farrell, J., and C. Shapiro. 1990. Horizontal mergers: An equilibrium analysis. American

Economic Review 80: 107126.

Froeb, L. M., and G. J. Werden. 1998. A robust test for consumer welfare enhancing

mergers among sellers of a homogeneous product. Economics Letters 58: 367369.

7. We stress, however, that the estimated parameters were based on an inconsistent xed-eects esti-

mator. In practice, one should use instrumental variables to estimate the parameters consistently.

540 Merger simulation

Goldberg, P. K., and F. Verboven. 2001. The evolution of price dispersion in the

European car market. Review of Economic Studies 68: 811848.

Ivaldi, M., and F. Verboven. 2005. Quantifying the eects from horizontal mergers in

European competition policy. International Journal of Industrial Organization 23:

669691.

McFadden, D. 1978. Modelling the choice of residential location. In Spatial Interac-

tion Theory and Planning Models, ed. A. Karlqvist, L. Lundqvist, F. Snickars, and

J. Weibull, 7596. Amsterdam: North-Holland.

Nevo, A. 2000. Mergers with dierentiated products: The case of the ready-to-eat cereal

industry. RAND Journal of Economics 31: 395421.

Posner, R. A. 1975. The social costs of monopoly and regulation. Journal of Political

Economy 83: 807828.

Roller, L.-H., J. Stennek, and F. Verboven. 2001. Eciency gains from mergers. Euro-

pean Economy 5: 31128.

Train, K. E. 2009. Discrete Choice Methods with Simulation. 2nd ed. Cambridge:

Cambridge University Press.

RAND Journal of Economics 27: 240268.

Werden, G. J., and L. Froeb. 1994. The eects of mergers in dierentiated products

industries: Logit demand and merger policy. Journal of Law, Economics, and Orga-

nization 10: 407426.

American Economic Review 58: 1836.

Jonas Bjornerstedt is a researcher at the Swedish Competition Authority and at the University

of Leuven (Belgium). His current research focuses on the empirical analysis of competition

policy and industrial organization.

Frank Verboven is a professor of economics and industrial organization at the University of

Leuven (Belgium) and research fellow at the Centre for Economic Policy Research. His current

research focuses on the empirical analysis of industries with market power, with applications

to issues in competition policy and regulation.

The Stata Journal (2014)

14, Number 3, pp. 541561

estimating average treatment eects by

reweighting on the propensity score

Giovanni Cerulli

Ceris-CNR

National Research Council of Italy

Institute for Economic Research on Firms and Growth

Rome, Italy

g.cerulli@ceris.cnr.it

ence in the presence of a nonrandom sample, and various reweighting estimators

have been proposed in the literature. This article presents the user-written com-

mand treatrew, which implements reweighting on the propensity-score estimator

as proposed by Rosenbaum and Rubin (1983, Biometrika 70: 4155) in their sem-

inal article. The main contribution of this command lies in providing analytical

standard errors for the average treatment eects in the whole population, in the

subpopulation of the treated, and in that of the untreated. Standard errors are cal-

culated using the approximation suggested by Wooldridge (2010, 920930, Econo-

metric Analysis of Cross Section and Panel Data [MIT Press]), but bootstrapped

standard errors can also be easily computed. Because an implementation of this

estimator with analytic standard errors and nonnormalized weights is missing in

Stata, this article and the accompanying ado-le aim to provide the community

with an easy-to-use method for reweighting on the propensity-score. The estima-

tor proves to be a valuable tool for estimating average treatment eects under

selection on observables.

Keywords: st0350, treatrew, treatment models, reweighting, propensity score, av-

erage treatment eects, ATE, ATET, ATENT

1 Introduction

treatrew is a user-written command for estimating average treatment eects (ATEs) by

reweighting (REW) on the propensity score. Depending on the specied model (probit

or logit), treatrew provides consistent estimation of ATEs under the hypothesis of

selection on observables. Conditional on a prespecied set of observable exogenous

variables xthought of as those driving the nonrandom assignment to treatment

treatrew estimates the average treatment eect (ATE), the average treatment eect

on the treated (ATET), and the average treatment eect on the nontreated (ATENT); it

also estimates these parameters conditional on the observable factors x (that is, ATE(x),

ATET(x), and ATENT(x)).

mators have been proposed. This article presents the user-written command treatrew,

c 2014 StataCorp LP st0350

542 treatrew: A user-written command

and Rubin (1983) in their seminal article.

The main contribution of this command lies in providing analytical standard errors

for the estimation of the ATE, ATET, and ATENT using the approximation suggested by

Wooldridge (2010, 920930). However, bootstrapped standard errors can also be easily

computed. treatrew assumes that the propensity score specied by the user is correct.

Thus it is sensitive to propensity-score misspecication.

The article is organized as follows: Section 2 provides the statistical description of

REW on the propensity-score estimator as implemented by treatrew. Section 3 provides

the formulas for calculating the causal parameters of interest and their standard errors.

Section 4 presents the syntax of treatrew and an application to real data. Section 5

shows the relation between treatrew and the recent Stata 13 command teffects ipw

for implementing the inverse-probability weighting (IPW) estimator. Section 6 concludes

the article. Finally, two appendixes are reported at the end of the article.

overview

Reweighting is a valuable approach to estimate (binary) treatment eects in a nonexper-

imental statistical setting when subjects nonrandom assignment to treatment is due to

selection on observables. The idea behind the REW procedure is straightforward: when

the treatment is not randomly assigned, treated and untreated subjects may present

dierent distributions of their observable characteristics. This may happen either be-

cause of the subjects self-selection into the experiment (subjects may consider the net

benet of participation) or because of the selection process operated by an external en-

tity (such as a public agency managing a subsidization program whose explicit objective

is selecting beneciaries with peculiar characteristics to maximize policy eect). Many

examples can be drawn from both social and epidemiological statistical settings.

In nonrandomized experiments, the distribution of the variables feeding into x could

be strongly unbalanced. To establish a balance in their distributions, one could im-

plement REW on observations, using their probability of becoming treated, that is,

according to subjects propensity scores. A possible REW estimation protocol is as

follows:

thus obtaining the predicated probability pi .

2. Build weights as 1/pi for treated observations and 1/(1 pi ) for untreated obser-

vations.

3. Calculate ATEs by comparing the weighted means of the two groups (for instance,

with a weighted least-squares [WLS] regression).

G. Cerulli 543

and Brumback 2000; Brunell and DiNardo 2004)that is, the idea that penalizing (ad-

vantaging) treated subjects with higher (lower) probability to be treated and advantag-

ing (penalizing) untreated subjects with higher (lower) probability to be treated make

the two groups as similar as possible. In other words, weights eliminate a confounding

component induced by the extent of the nonrandom assignment to a program.

Alternative weighting schemes have been proposed in the literature1 , and some au-

thors have shown that various matching methods can also be seen as specic REW

estimators (Lunceford and Davidian 2004; Morgan and Harding 2006). As in matching,

these estimators have dierent properties, but the main limit resides in the specication

of the propensity score because measurement errors in this specication could produce

severe bias. In what follows, we focus on REW on propensity-score inverse probabil-

ity as proposed by Rosenbaum and Rubin (1983). Here we start with the following

assumptions about the data-generating process:

i. y1 = g1 (x) + 1 , E(1 ) = 0

ii. y0 = g0 (x) + 0 , E(0 ) = 0

iii. y = wy1 + y0 (1 w)

iv. Conditional mean independence (CMI) holds; therefore, E(y1 |w, x) = E(y1 |x) and

E(y0 |w, x) = E(y0 |x)

v. x exogenous

y1 and y0 are the subjects outcome when treated and untreated, respectively; g1 (x)

and g0 (x) are the subjects reaction function to the confounder x when the subject is

treated and untreated, respectively; w is the treatment binary indicator taking value 1

for treated and 0 for untreated subjects; 0 and 1 are two error terms with unconditional

zero mean; and x is a set of observable and exogenous confounding variables assumed

to drive the nonrandom assignment into treatment. In short, the CMI assumption states

that it is sucient to control only for x to restore random assignment conditions. When

assumptions iv hold,

{w p(x)}y

ATE = E (1)

p(x){1 p(x)}

{w p(x)}y

ATET = E (2)

p(w = 1){1 p(x)}

{w p(x)}y

ATENT = E (3)

p(w = 0)p(x)

1. Another possible weighting scheme could be assuming pi /(1 pi ) for untreated subjects and 1 for

treated ones (Nichols 2007). The literature distinguishes between normalized and nonnormalized

weighting schemes depending on whether the weights sum to one or to a dierent value, respectively

(Busso, DiNardo, and McCrary 2008).

544 treatrew: A user-written command

and ATENT

Assuming that the propensity score is correctly specied, we can estimate previous

parameters by using the sample equivalent of the population parameters; that is,

N

1 {wi p(xi )}yi

ATE =

N i=1 p(xi ){1 p(xi )}

N

1 {wi p(xi )}yi

ATET =

N i=1 p(w = 1){1 p(xi )}

N

1 {wi p(xi )}yi

ATENT =

N i=1 p(w = 0)

p(xi )

Estimation follows in two steps: i) estimate the propensity score p(xi ), thus obtaining

p(xi ); and ii) substitute p(xi ) into previous formulas to get parameters. Consistency is

guaranteed because these estimators are M-estimators.

But how do we get standard errors for previous estimators? We can exploit some

results when the rst step is a maximum likelihood (ML) estimation and the second step

is an M-estimation. In our case, the rst step is an ML based on logit (or probit), and the

second step is a standard M-estimator. For such cases, Wooldridge (2007; 2010, 922924)

proposed a straightforward procedure to get analytical standard errors provided that the

propensity score is correctly specied. In what follows, we demonstrate Wooldridges

(2007; 2010, 922924) procedure and formulas for obtaining these standard errors.

First, dene the estimated ML score of the rst step (probit or logit). It is, by denition,

equal to

i , xi ,

i = d(w { p(xi , )} {wi p(xi ,

)}

d ) =

){1 p(xi ,

p(xi , )}

Observe that d is a row vector of the R 1 parameters and represents the gradient

of the function p(x, ).

Second, dene the generic estimated summand of ATE as

ki =

p(xi ){1 p(xi )}

G. Cerulli 545

ki on 1, di with i = 1, . . . , N

and call them ei (i = 1, . . . , N ). The asymptotic standard error for ATE is equal to

% N

&1/2

1 2

e

N i=1 i

(4)

N

and we can use it to test the signicance of ATE. Of course, d will have a dierent

expression according to the probability model adopted. Here we consider the logit and

probit cases.

Case 1: Logit

Suppose that the correct probability follows a logistic distribution. This means that

exp(xi )

p(xi , ) = = (xi ) (5)

1 + exp(xi )

Thus, by simple algebra, we see that

= xi (wi pi )

d i

'()*

1R

Case 2: Probit

Suppose that the right probability follows a normal distribution. This means that

p(xi , ) = (xi )

Thus, by simple algebra, we see that

)xi {wi (xi )}

= (xi ,

d i

(xi ){1 (xi )}

where () and () are the normal cumulative distribution and density function, re-

spectively. One can also add functions of x to estimate previous formulas. This reduces

standard errors if these functions are partially correlated with

ki .

Finally, observe that the previous procedure produces standard errors that are lower

than those produced by ignoring the rst step (that is, the propensity-score estimation

via ML). Indeed, the nave standard error

,

1

N 2 1/2

+

ki ATE

N i=1

N

is higher than the one produced by the previous procedure.

546 treatrew: A user-written command

This follows a route similar to ATE. Dene the generic estimated summand of ATET as

{wi p(xi )}yi

qi =

p(w = 1){1 p(xi )}

and calculate

qi on 1, di

ri = residuals from the regression of

N

,1/2

1 2

{

p(w = 1)} 1

ri wi ATET

N i=1

N

In this case, dene the generic estimated summand of ATENT as

p(w = 0)p(xi )

and then calculate

- N

.1/2

1 2

{

p(w = 0)} 1

si (1 wi ) ATENT

N i=1

N

The standard errors presented in this section are correct when the actual data-

generating process follows the probit or the logit probability rules. If not, then a mea-

surement error is present, and the estimations might be inconsistent. Authors such as

Hirano, Imbens, and Ridder (2003) and Li, Racine, and Wooldridge (2009) have sug-

gested more exible nonparametric estimation of the standard errors. Under correct

specication, a straightforward alternative is to use bootstrapping, where the binary

response estimation and the averaging are included in each bootstrap iteration.

treatrew estimates ATE, ATET, and ATENT parameters with either analytical or boot-

strapped standard errors. The syntax is rather simple and follows the typical Stata

G. Cerulli 547

command syntax. The user has to declare: a) the outcome variable, that is, the vari-

able over which the treatment is expected to have an impact (outcome); b) the binary

treatment variable (treatment); c) a set of confounding variables (varlist); and, nally,

d) a series of options. Two options are important: the option model(modeltype) sets

the type of model, probit or logit, that has to be used in estimating the propensity

score; the option graphic and the related option range(a b) produce a chart where the

distribution of ATE(x), ATET(x), and ATENT(x) are jointly plotted within the interval

[a; b].

As an e-class command, treatrew provides an ereturn list of objects (such as

scalars and matrices) to be used in the next elaborations. In particular, the values of

ATE, ATET, and ATENT are returned in the scalars e(ate), e(atet), and e(atent), and

they can be used to get bootstrapped standard errors. By default, treatrew provides

analytical standard errors.

4.1 Syntax

treatrew outcome treatment varlist if in weight , model(modeltype)

graphic range(a b) conf(#) vce(robust)

outcome is the target variable for measuring the impact of the treatment.

treatment is the binary treatment variable taking 1 for treated and 0 for untreated

subjects.

varlist is the set of pretreatment (or observable confounding) variables.

fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight.

4.2 Description

treatrew estimates ATEs by REW on the propensity score. Depending on the specied

model, treatrew provides consistent estimation of ATEs under the hypothesis of selec-

tion on observables. Conditional on a prespecied set of observable exogenous variables

xthought of as those driving the nonrandom assignment to treatmenttreatrew

estimates the ATE, the ATET, the ATENT, and these parameters conditional on the ob-

servable factors x (that is, ATE(x), ATET(x), and ATENT(x)). Parameters standard

errors are provided either analytically (following Wooldridge [2010, 920930]) or via

bootstrapping. treatrew assumes that the propensity-score specication is correct.

treatrew creates several variables:

ATET x is an estimate of the idiosyncratic ATET.

ATENT x is an estimate of the idiosyncratic ATENT.

548 treatrew: A user-written command

4.3 Options

model(modeltype) species the model for estimating the propensity score, where mod-

eltype must be one of probit or logit. model() is required.

graphic allows for a graphical representation of the density distributions of ATE(x),

ATET(x), and ATENT(x) within their whole support.

ATET(x), and ATENT(x) within the support [a; b] specied by the user. range()

must be specied with the graphic option.

conf(#) sets the condence level of probit or logit estimates equal to the specied #.

The default is conf(95).

vce(robust) allows for robust regression standard errors in the probit or logit estimates.

treatrew stores the following in e():

Scalars

e(N) number of observations e(ate) value of the ATE

e(N1) number of (used) treated e(atet) value of the ATET

subjects e(atent) value of the ATENT

e(N0) number of (used) untreated

subjects

4.5 Examples

To show a practical application of treatrew, we use an instructional dataset called

fertil2.dta, which is included in Wooldridge (2013) and collects cross-sectional data

on 4,361 women of childbearing age in Botswana. This dataset is freely downloadable

at http://fmwww.bc.edu/ec-p/data/wooldridge/fertil2.dta. It contains 28 variables on

women and family characteristics.

Using fertil2.dta, we are interested in evaluating the impact of the variable educ7

(taking value 1 if a woman has more than or exactly seven years of education and

0 otherwise) on the number of family children (children). Several conditioning (or

confounding) observable factors are included in the dataset, such as the age of the

woman (age), whether the family owns a television (tv), whether the woman lives

in a city (urban), and so forth. To inquire into the relation between education and

fertility according to Wooldridges (2010, ex. 21.3, 940) specication, we estimate the

ATE, ATET, and ATENT (as well as ATE(x), ATET(x), and ATENT(x)) by REW us-

ing treatrew. We also compare REW results with other popular program evaluation

methods: i) the dierence in mean (DIM), taken as benchmark; ii) the OLS regression-

based random-coecient model with heterogeneous reaction to confounders, estimated

through the user-written command ivtreatreg, provided by Cerulli (2011); and iii) a

one-to-one nearest-neighbor matching, computed by the command psmatch2, provided

G. Cerulli 549

by Leuven and Sianesi (2003). Because matching estimators can be seen as specic

REW procedures (Busso, DiNardo, and McCrary 2008), comparing REW with matching

is worthwhile. By taking just the case of ATET, we can prove that

1

ATETMatching = yi h(i, j)yj

Ni

i(w=1) jC(i)

N

N

N

1 1

= wi yi (1 wj )yj wi h(i, j)

N1 i=1 j=1

N1 i=1

N

N

1 1

= wi yi (1 wj )yj (j) = ATETReweighting

N1 i=1

N0 j=1

N

where (j) = N0 /N1 wi h(i, j) are REW factors, C(i) is the untreated subjects neigh-

i=1

borhood for the treated subject i, and h(i, j) are matching weights thatonce oppor-

tunely speciedproduce dierent types of matching methods. Results from all of these

estimators are reported in table 1.

550

Table 1. Comparison of ATE, ATET, and ATENT estimation among DIM, CF-OLS, REW, and MATCH

1 2 3 4 5 6 7

DIM CF-OLS REW REW REW REW MATCH(a)

(probit) (logit) (probit) (logit)

analytical analytical bootstrapped bootstrapped

standard standard standard standard

errors errors errors errors

ATE 1.77 *** 0.374 *** 0.43 *** 0.415 *** 0.434 *** 0.415 *** 0.316 ***

0.062 0.051 0.068 0.068 0.070 0.071 0.080

28.46 7.35 6.34 6.09 6.15 5.87 3.93

ATET 0.255 *** 0.355 ** 0.345 *** 0.355 *** 0.345 *** 0.131

0.048 0.15 0.104 0.0657 0.054 0.249

5.37 2.37 3.33 5.50 6.45 0.52

ATENT 0.523 *** 0.532 *** 0.503 ** 0.532 *** 0.503 *** 0.549 ***

0.075 0.19 0.257 0.115 0.119 0.135

7.00 2.81 1.96 4.61 4.21 4.07

Note: b/se/t; DIM; CF-OLS: control-function OLS; REW; MATCH. (a) Standard errors for ATE and ATENT are computed by bootstrapping.

*** = 1%, ** = 5%, * = 10% of signicance.

treatrew: A user-written command

G. Cerulli 551

> hetero(age agesq evermarr urban electric tv) model(cf-ols)

For CF-OLS, standard errors for ATET and ATENT are obtained via bootstrap and

can be obtained in Stata by typing

> ivtreatreg children educ7 age agesq evermarr urban electric tv,

> hetero(age agesq evermarr urban electric tv) model(cf-ols)

Results set out in columns 36 refer to the REW estimator. In columns 3 and 4,

standard errors are computed analytically, whereas in columns 5 and 6, they are com-

puted via bootstrap for the logit and probit models, respectively. These results can be

retrieved by typing sequentially

. treatrew children educ7 age agesq evermarr urban electric tv, model(probit)

. treatrew children educ7 age agesq evermarr urban electric tv, model(logit)

. bootstrap e(ate) e(atet) e(atent), reps(200):

> treatrew children educ7 age agesq evermarr urban electric tv, model(probit)

. bootstrap e(ate) e(atet) e(atent), reps(200):

> treatrew children educ7 age agesq evermarr urban electric tv, model(logit)

to-one nearest-neighbor matching on the propensity score (MATCH). Here the standard

error for ATET is obtained analytically, whereas those for ATE and ATENT are computed

by bootstrapping. Matching results can be obtained by typing

. psmatch2 educ7 age agesq evermarr urban electric tv, ate out(children) common

. bootstrap r(ate) r(atu): psmatch2 educ7 $xvars, ate out(children) common

where the option common restricts the sample to subjects with common support. To

test the balancing property for such a matching estimation, we provide a DIM on the

propensity score before and after matching treated and untreated subjects using the

psmatch2 postestimation command pstest:

552 treatrew: A user-written command

Variable Matched Treated Control %bias |bias| t p>|t|

Matched .65692 .65688 0.0 100.0 0.01 0.994

(output omitted )

This test suggests that with regard to the propensity score, the matching procedure

implemented by psmatch2 is balanced, so we can trust matching results (the propensity

score was unbalanced before matching, and it becomes balanced after matching).

Unlike DIM, results from CF-OLS and REW are fairly comparable in terms of both

coecients size and signicance: the values of ATE, ATET, and ATENT obtained using

REW on the propensity score are a little higher than those obtained using CF-OLS. This

means that the linearity of the potential-outcome equations assumed by the CF-OLS is

an acceptable approximation. According to the value of ATET, as obtained by REW and

visible in column 3 of table 1, an educated woman in Botswana would have beenceteris

paribussignicantly more fertile if she had been less educated. We can conclude that

education has a negative impact on fertility, leading a woman to have around 0.5 fewer

children. If confounding variables were not considered, as it happens using DIM, this

negative eect would appear dramatically higher, around 1.77 children: the dierence

between 1.77 and 0.5 (around 1.3) is an estimation of the bias induced by the presence

of selection on observables.

Columns 3 and 4 show REW results using Wooldridges (2010) analytical standard

errors in the case of probit and logit, respectively. As partly expected, these results

are similar. But the REW results when standard errors are obtained via bootstrap

(columns 5 and 6) are more interesting. Here statistical signicance is conrmed when

compared with results derived from analytical formulas. However, bootstrapping seems

to increase signicance for both ATET and ATENT, while the standard error for ATE is

in line with the analytical one.

Some dierences in results emerge when applying the one-to-one nearest-neighbor

matching (column 7) on this dataset. In this case, ATET becomes insignicant with a

magnitude that is around one-third lower than that obtained by REW. As said above, the

standard errors of ATE and ATENT are here obtained via bootstrap because psmatch2

does not provide analytical solutions for these two parameters. Nevertheless, as proved

by Abadie and Imbens (2008), bootstrap performance is generally poor in the case of

matching, so these results have to be taken with some caution.

G. Cerulli 553

Finally, gure 1 sets out the estimated kernel density for the distribution of ATE(x),

ATET(x), and ATENT(x) when treatrew is used with options graphic and range(-30

30). It is evident that the distribution of ATET(x) is a bit more concentrated around

its mean (equal to ATET) than the distribution of ATENT(x) is; this indicates that more

educated women respond more homogeneously to a higher level of education. On the

contrary, less educated women react more heterogeneously to a potential higher level of

education.

.2 .15

Kernel density

.1 .05

0

40 20 0 20 40

x

ATE(x) ATET(x)

ATENT(x)

Model:logit

on the propensity score with range equal to (30; 30)

Stata 13 provides a new command, teffects, for estimating treatment eects for ob-

servational data. Among the many estimation methods provided by this command,

teffects ipw implements a REW estimator based on IPW.

teffects ipw estimates the parameters ATE and ATET and the mean potential

outcomes using a WLS regression where weights are a function of the propensity score

estimated in the rst step. To see the equivalence between IPW and WLS, we apply the

teffects ipw command to our previous dataset by computing ATE.

554 treatrew: A user-written command

. use fertil2

. teffects ipw (children) (educ7 $xvars, probit), ate

Iteration 0: EE criterion = 6.624e-21

Iteration 1: EE criterion = 4.722e-32

Treatment-effects estimation Number of obs = 4358

Estimator : inverse-probability weights

Outcome model : weighted mean

Treatment model: probit

Robust

children Coef. Std. Err. z P>|z| [95% Conf. Interval]

ATE

educ7

(1 vs 0) -.1531253 .0755592 -2.03 0.043 -.3012187 -.0050319

POmean

educ7

0 2.208163 .0689856 32.01 0.000 2.072954 2.343372

In this estimation, we see that the value of ATE is 0.153 with a standard error of

0.075, which results in a moderately signicant eect of educ7 on children.

This value of ATE can also be obtained using a simple WLS regression of y on w and

a constant, with weights hi designed in this way:

hi = hi0 = 1/{1 p(xi )} if wi = 0 (7)

. global xvars age agesq evermarr urban electric tv

. probit educ7 $xvars, robust // estimate the probit regression

(output omitted )

. predict _ps, p // call the estimated propensity score as _ps

(3 missing values generated)

. generate H=(1/_ps)*educ7+1/(1-_ps)*(1-educ7) // weighing function H for w=1

> and w=0

(3 missing values generated)

. regress children educ7 [pw=H], vce(robust) // estimate ATE by a WLS regression

(sum of wgt is 9.1714e+03)

Linear regression Number of obs = 4358

F( 1, 4356) = 2.00

Prob > F = 0.1576

R-squared = 0.0013

Root MSE = 2.1324

Robust

children Coef. Std. Err. t P>|t| [95% Conf. Interval]

_cons 2.208163 .0867265 25.46 0.000 2.038135 2.378191

G. Cerulli 555

This table shows that the results of the commands calculating IPW and WLS for ATE

are identical. A dierence, however, appears in the estimated standard errors, which

are quite divergent: 0.075 for IPW against 0.108 for WLS. Moreover, observe that ATE

calculated by WLS becomes nonsignicant.

Why are these standard errors dierent? The answer resides in a dierent approach

used for estimating the variance of ATE (and, possibly, ATET): WLS regression uses the

usual OLS variancecovariance matrix adjusted for the presence of a matrix of weights,

lets say ; however, WLS does not consider the presence of a generated regressor,

namely, the weights computed through the propensity scores estimated in the rst step.

On the contrary, IPW accounts for the variability introduced by the generated weights

by exploiting a generalized method of moments approach for estimating the correct

variancecovariance matrix (see StataCorp [2013, 6888]). In this sense, IPW is a more

robust approach than a standard WLS regression.

As implemented in Stata, both WLS and IPW by default use normalized weights,

that is, weights that add up to one. treatrew, on the contrary, uses nonnormalized

weights, which is why the ATE values obtained from treatrew (see the previous section)

are numerically dierent from those obtained from WLS and IPW. As proved by Busso,

DiNardo, and McCrary (2008, 7), a general formula for estimating ATE by REW is

N N

1 1

+

ATE = wi yi hi1 (1 wi )yi hi0 (8)

N i=1 N i=1

hi1 = 1/p(x)

hi0 = 1/{1 p(xi )}

Such weights do not sum up to one. In this case, analytical standard errors cannot be

retrieved by a weighted regression, and the method suggested by Wooldridge (2010)

and implemented through treatrewfor getting correct analytical standard errors for

ATE, ATET, and ATENT is thus needed because a generated regressor from the rst-step

estimation is used in the second step.

The normalized weights used in WLS and IPW are instead

1/p(xi )

hi1 = N

1

wi /p(xi )

N1 i=1

1/{1 p(xi )}

hi0 = N

1

(1 wi )/{1 p(xi )}

N0 i=1

556 treatrew: A user-written command

Appendix B shows that if the formula of ATE implemented in treatrew using nor-

malized (rather than nonnormalized) weights was adopted, then the treatrews ATE

estimation would become numerically equivalent to the value of ATE obtained by the

commands used to calculate WLS and IPW.

Thus we can assert that both teffects ipw and treatrew lead to correct analytical

standard errors because both take into account that the propensity score is a generated

regressor from a rst-step (probit or logit) regression. The dierent values of ATE and

ATET obtained in the two approaches reside only in the dierent weighting scheme

(normalized versus nonnormalized).

In short, treatrew is useful when considering nonnormalized weights, that is, when a

pure IPW scheme is used. Moreover, compared with teffects ipw, treatrew provides

an estimation of ATENT, though it does not by default provide an estimation of the

mean potential outcomes.

6 Conclusion

This article provides a command, treatrew, for estimating ATEs by REW on the propen-

sity score as proposed by Rosenbaum and Rubin (1983). Although REW is a popular and

long-standing statistical technique to deal with the bias induced by drawing inference

in the presence of a nonrandom sample, its implementation in Stata with parameters

analytic standard errors (as proposed by Wooldridge [2010, 920930]) and a nonnormal-

ized weighting scheme was still missing. This article and the accompanying ado-le ll

this gap by providing an easy-to-use implementation of the REW method, which can be

used as a valuable tool for estimating causal eects under selection on observables.

7 References

Abadie, A., and G. W. Imbens. 2008. On the failure of the bootstrap for matching

estimators. Econometrica 76: 15371557.

Brunell, T. L., and J. DiNardo. 2004. A propensity score reweighting approach to esti-

mating the partisan eects of full turnout in American presidential elections. Political

Analysis 12: 2845.

Busso, M., J. DiNardo, and J. McCrary. 2008. Finite sample properties of semipara-

metric estimators of average treatment eects.

http://elsa.berkeley.edu/users/cle/laborlunch/mccrary.pdf.

Cerulli, G. 2011. ivtreatreg: A new Stata routine for estimating binary treatment

models with heterogeneous response to treatment under observable and unobservable

selection. 8th Italian Stata Users Group meeting proceedings.

http://www.stata.com/meeting/italy11/abstracts/italy11 cerulli.pdf.

Hirano, K., G. W. Imbens, and G. Ridder. 2003. Ecient estimation of average treat-

ment eects using the estimated propensity score. Econometrica 71: 11611189.

G. Cerulli 557

replacement from a nite universe. Journal of the American Statistical Association

47: 663685.

Leuven, E., and B. Sianesi. 2003. psmatch2: Stata module to perform full Mahalanobis

and propensity score matching, common support graphing, and covariate imbalance

testing. Statistical Software Components S432001, Department of Economics, Boston

College. http://ideas.repec.org/c/boc/bocode/s432001.html.

Li, Q., J. S. Racine, and J. M. Wooldridge. 2009. Ecient estimation of average treat-

ment eects with mixed categorical and continuous data. Journal of Business and

Economic Statistics 27: 206223.

Lunceford, J. K., and M. Davidian. 2004. Stratication and weighting via the propensity

score in estimation of causal treatment eects: A comparative study. Statistics in

Medicine 23: 29372960.

Morgan, S. L., and D. J. Harding. 2006. Matching estimators of causal eects: Prospects

and pitfalls in theory and practice. Sociological Methods and Research 35: 360.

Nichols, A. 2007. Causal inference with observational data. Stata Journal 7: 507541.

Robins, J. M., M. A. Hernan, and B. Brumback. 2000. Marginal structural models and

causal inference in epidemiology. Epidemiology 11: 550560.

Rosenbaum, P. R., and D. B. Rubin. 1983. The central role of the propensity score in

observational studies for causal eects. Biometrika 70: 4155.

comes/Counterfactual Outcomes. College Station, TX: Stata Press.

data problems. Journal of Econometrics 141: 12811301.

. 2010. Econometric Analysis of Cross Section and Panel Data. 2nd ed. Cam-

bridge, MA: MIT Press.

South-Western.

Giovanni Cerulli is a researcher at Ceris-CNR, National Research Council of Italy, Institute for

Economic Research on Firms and Growth. He received a degree in statistics and a PhD in

economic sciences from Sapienza University of Rome and is editor-in-chief of the International

Journal of Computational Economics and Econometrics. His research interests are mainly on

applied microeconometrics, with a focus on counterfactual treatment-eects models for program

evaluation. Stata programming and simulation- and agent-based methods are also among his

related elds of study. He has published articles in high-quality, refereed economics journals.

558 treatrew: A user-written command

Appendix A

This appendix provides the mathematical steps to get the REW formulas for ATEs as

reported in (1)(3). Observe rst that wy = w{wy1 +y0 (1w)} = w2 y1 +wy0 w2 y0 =

wy1 because w2 = w. Therefore,

wy wy1 LIE2 wy1 wE(y1 |x, w)

E |x =E |x = E E |x, w |x = E |x

p(x) p(x) p(x) p(x)

CMI wE(y1 |x) wg1 (x) w

= E |x = E |x = g1 (x) E |x

p(x) p(x) p(x)

g1 (x) g1 (x)

= E(w|x) = p(x) = g1 (x) (9)

p(x) p(x)

(1 w)y

E |x = g0 (x) (10)

{1 p(x)}

wy (1 w)y {w p(x)}y

ATE(x) = g1 (x) g0 (x) = E |x E |x = E |x

p(x) {1 p(x)} p(x){1 p(x)}

provided that 0 < p(x) < 1. To get ATE, one needs to take the expectation of ATE(x)

on x,

{w p(x)}y {w p(x)}y

ATE = Ex {ATE(x)} = Ex E |x = E

p(x){1 p(x)} p(x){1 p(x)}

show that such an estimator is equivalent to the HorvitzThompson estimator (Horvitz

and Thompson 1952). In sampling theory, it is a method for estimating the total and

mean of a super population in a stratied sample. IPW is generally applied to account

for dierent proportions of observations within strata in a target population.

Similarly, we can also calculate ATET by considering that

= {w p(x)} y0 + w {w p(x)} (y1 y0 )

= {w p(x)} y0 + w {1 p(x)} (y1 y0 )

{w p(x)}y {w p(x)}y0

= + w(y1 y0 ) (11)

{1 p(x)} {1 p(x)}

G. Cerulli 559

Consider now the quantity {w p(x)}y0 in the right-hand side of (11). We see that

= E[{w p(x)} E(y0 |x, w)|x] = E[{w p(x)} E{y0 |x}|x]

= E[{w p(x)} g0 (x)|x] = g0 (x) E[{w p(x)}|x]

= g0 (x) [E(w|x) E{p(x)|x}] = g0 (x) {p(x) p(x)} = 0

{w p(x)}y {w p(x)}y0

E |x = E |x + E{w(y1 y0 )|x} = E{w(y1 y0 )|x}

{1 p(x)} {1 p(x)}

we get

{w p(x)}y {w p(x)}y

Ex E |x = E

{1 p(x)} {1 p(x)} (12)

Ex E {w(y1 y0 )|x} E{w(y1 y0 )}

that is,

{w p(x)}y

E = E{w(y1 y0 )}

{1 p(x)}

E(h) = E{w(y1 y0 )}

= p(w = 1) E{w(y1 y0 )|w = 1} + p(w = 0) E{w(y1 y0 )|w = 0}

= p(w = 1) E{(y1 y0 )|w = 1}

= p(w = 1) ATET

{w p(x)}y

E = E{w(y1 y0 )} = p(w = 1) ATET

{1 p(x)}

proving that

{w p(x)}y

ATET = E

p(w = 1){1 p(x)}

also prove that

{w p(x)}y

ATENT = E

p(w = 0)p(x)

560 treatrew: A user-written command

Appendix B

In this appendix, we show that if one considers the formula of ATE as implemented

in treatrew by using normalized rather than nonnormalized weights, then treatrews

ATE estimation becomes numerically equivalent to the ATE obtained by commands used

to calculate WLS and IPW. To this purpose, we rst calculate the ATE estimator by

means of the general formula in (8) by adopting normalized IPW weights:

N N

1 1

+

ATE = wi yi hi1 (1 wi )yi hi0

N i=1 N i=1

As an intermediary step, we show that normalized weights sum up to one for the weights

of both the treated and the untreated subjects.

. generate h1 = educ7/_ps // observe that educ7=w

. summarize h1

. scalar sum_h1 = _N*r(mean)

. summarize educ7 if educ7==1

. scalar mean_h1 = (1/r(N))*sum_h1

. generate H1 = (1/_ps)/mean_h1 // H1 is the normalized weight for treated units

. generate m1 = educ7*H1 // m1 is equal to w*h1 using h1=H1

. summarize m1

. scalar tot_m1 = _N*r(mean)

. summarize educ7 if educ7==1

. scalar N1 = r(N)

. scalar one1 = (1/N1)* tot_m1

. display one1

1 // ok

. generate h0 = (1-educ7)/(1-_ps)

. summarize h0

. scalar sum_h0 = _N*r(mean)

. summarize educ7 if educ7==0

. scalar mean_h0 = (1/r(N))*sum_h0

. generate H0 = (1/(1-_ps))/mean_h0 // H0 is the normalized weight for

> untreated units

. generate m0 = (1-educ7)*H0 // m0 is equal to (1-w)*h0 using h0=H0

. summarize m0

. scalar tot_m0 = _N*r(mean)

. summarize educ7 if educ7==0

. scalar N0 = r(N)

. scalar one0 = (1/N0)* tot_m0

. display one0

1 // ok

G. Cerulli 561

Second, we compute the estimation of ATE by multiplying the two summands for

the treated and untreated units in (8) by the outcome y (equal in this example to the

variable children):

. generate s1 = children*educ7*H1 // s1 is the summand y*w*h1 of (8) with h1=H

. summarize s1

. scalar tot_s1 = _N*r(mean)

. summarize educ7 if educ7==1

. scalar N1 = r(N)

. scalar _s1 = (1/N1)* tot_s1 // _s1 is the average outcome for treated units

. display _s1

2.0550377

. generate s0 = children*(1-educ7)*H0 // s0 is y*(1-w)*h0 of (8) with h0=H0

. summarize s0

. scalar tot_s0 = _N*r(mean)

. summarize educ7 if educ7==0

. scalar N0 = r(N)

. scalar _s0 = (1/N0)* tot_s0 // _s0 is the average outcome for untreated units

. display _s0

2.208163

-.15312

which is numerically equivalent to the value of the ATE obtained via WLS and IPW.

The Stata Journal (2014)

14, Number 3, pp. 562579

distributions

Tammy Harris Joseph M. Hilbe

Institute for Families in Society School of Social and Family Dynamics

University of South Carolina Arizona State University

Columbia, SC Tempe, AZ

harris68@mailbox.sc.edu hilbe@asu.edu

James W. Hardin

Institute for Families in Society

Department of Epidemiology and Biostatistics

University of South Carolina

Columbia, SC

jhardin@sc.edu

Abstract. We present motivation and new commands for modeling count data.

While our focus is to present new commands for estimating count data, we also

discuss generalized binomial regression and present the zero-inated versions of

each model.

Keywords: st0351, gbin, zigbin, nbregf, nbregw, zinbregf, zinbregw, binomial, War-

ing, count data, overdispersion, underdispersion

1 Introduction

We introduce programs for regression models of count data. Poisson regression analysis

is widely used to model such response variables because the Poisson model assumes

equidispersion (equality of the mean and variance). In practice, equidispersion is rarely

reected in data. In most situations, the variance exceeds the mean. This occurrence

of extra-Poisson variation is known as overdispersion (see, for example, Dean [1992]).

In situations where the variance is smaller than the mean, data are characterized as

being underdispersed. Modeling underdispersed count data with inappropriate models

can lead to overestimated standard errors and misleading inference. While there are

various approaches for modeling overdispersed count data, such as the negative binomial

distributions and other mixtures of Poisson (Yang et al. 2007; Hilbe 2014), there are

few models for underdispersed count data. Harris, Yang, and Hardin (2012) introduced

a generalized Poisson regression command to handle underdispersed count data.

As stated earlier, count data can be analyzed using regression models based on the

Poisson distribution. However, in this article, we will discuss other discrete regression

models that can be used, such as the generalized negative binomial distribution, which

was described by Jain and Consul (1971) and later by Consul and Gupta (1980). The

distribution was also investigated by Famoye (1995), who illustrated a use for analyzing

grouped binomial data.

c 2014 StataCorp LP st0351

T. Harris, J. W. Hilbe, and J. W. Hardin 563

alized negative binomial distribution for which we treat one of the parameters as the

known denominator of proportional (grouped binomial) outcomes. The properties and

utility of the distribution for regression models for count and grouped binomial data are

discussed in Jain and Consul (1971), Consul and Gupta (1980), and Famoye (1995).

Another extension of the negative binomial distribution is the univariate generalized

Waring distribution, or the beta negative binomial distribution. The present generalized

Waring distribution was proposed and used by Irwin (1968) to model accident count

data. An advantage of this model over the negative binomial model is that investigators

can separate the unobserved heterogeneity from the internal factors of each individuals

characteristics and external factors (covariates) that may aect the variability of data

(confounding). For more technical and historical information on the distribution and

associated regression models, see Rodrguez-Avi et al. (2009), Irwin (1968), and Hilbe

(2011).

To distinguish the origins of specic regression models, we use NBREGF for count

models based on the generalized negative binomial distribution, GBIN for grouped bino-

mial models based on a simplication of the generalized negative binomial distribution,

and NBREGW for count models based on the generalized Waring distribution.

Many applications of the NBREGF regression model have been illustrated in studies

involving medicine, ecology, physics, etc. Wang et al. (2012) used the NBREGF model to

analyze a rehabilitation program study that evaluated brain function in stroke patients

by using functional magnetic resonance imaging. Hardin and Hilbe (2012) presented an

example that used microplot data of carrot y damage. For this example, the authors

analyzed these data by using Statas suite of ml() functions and developed syntax for

the GBIN regression. Lastly, Rodrguez-Avi et al. (2009) used the NBREGW regression

model to model the number of goals scored by football players, and they compared the

results with the results of a regression model based on the negative binomial distribution.

Herein, we illustrate modeling count data using the NBREGF, GBIN, and NBREGW

regression models. This article is organized as follows. In section 2, we review the

three count-data regression models and their zero-inated versions. In section 3, we

present the syntax for the new commands. In section 4, we present a real-world data

example. Finally, in section 5, we give a summary. We also present software that we

enhanced from Hardin and Hilbe (2012) to t NBREGF and GBIN models.

2 The models

2.1 Generalized negative binomial: Famoye

As implemented in the accompanying software, the NBREGF model assumes that is

a scalar unknown parameter. Thus the probability mass function (PMF), mean, and

variance are given by

564 Modeling count data with generalized distributions

+ y y y+y

P (Y = y) = (1 ) (1)

+ y y

where 0 < < 1, 1 < 1 for > 0 and nonnegative outcomes yi (0, 1, 2, . . .).

1

E(Y ) = (1 )

3

V (Y ) = (1 )(1 )

The main dierences from the GBIN model are that the parameter is an unknown

parameter in (1) but a known parameter in (2) and that = > 1. In the limit

1, the variance approaches that of the negative binomial distribution. Thus the

parameter generalizes the negative binomial distribution in the NBREGF model to have

greater variance than is allowed in a negative binomial regression model. To construct a

regression model, we implemented the log link log() = x to make results comparable

to Poisson and negative binomial models.

The generalized binomial regression model is based on a simplication of the generalized

negative binomial distribution. We assume that the parameter in (1) is a vector of

observation-specic known constants n (they are the denominators of grouped binomial

data), = , and is replaced with /(1 + ). When is known, the parameter

is nonnegative, while in the generalized negative binomial distribution, > 1. Under

these changes, the PMF, mean, and variance are given by

n n + y y ny+y

P (Y = y) = 1 (2)

n + y y 1 + 1 +

1

E(Y ) = n 1

1 + 1 +

= n

3

V (Y ) = n 1 (1 + )

1 + 1 +

= n(1 + )(1 + )

Parameterizing g() = x, where g() is a suitable link function assuming that plays

the role of the probability of success, we obtain results that coincide with a grouped

data binomial model. The variance is equal to binomial variance if = 0, and it is

equal to negative binomial variance if = 1. Thus the > 0 parameter generalizes the

binomial distribution in the GBIN regression model.

As illustrated in Irwin (1968), the generalized Waring distribution can be constructed

under the following specications:

T. Harris, J. W. Hilbe, and J. W. Hardin 565

i. Y |x, x , v Poisson(x )

ii. x |v Gamma(ax , v)

iii. v Beta(, k)

and v as accident proneness. The PMF is ultimately given by

P (Y = y) =

()(ax + k + ) (ax + k + )y y!

where k, , ax > 0, ax = ( 1)/k, and (a)w is the Pochhammer notation for (a + w)/

(w) if a > 0. The expected value and variance of the distribution are

ax k

E(Y ) = =

1

k+1 k+1

V (Y ) = + + 2 (3)

2 k( 2)

where ax , k > 0 and > 2 (to ensure nonnegative variance). To construct a regression

model, we implemented the log link log() = x to make results comparable to Poisson

and negative binomial models. A unique characteristic of this model occurs when the

data are from a dierent underlying distribution. For instance, when the data are

from a Poisson distribution with V (Y ) = , it indicates that (k + 1)/( 2) 0 and

{k + 1}/{k( 2)} 0 then k, . Also, if the data have an underlying NB-2

(negative binomial-2) distribution with V (Y ) = + 2 (where is the dispersion

parameter), it indicates that (k + 1)/( 2) 0 and {k + 1}/{k( 2)} ,

where k 1/ and .

When there is an excess of zeros in count-response data, Poisson (and other) distribution

models may not be appropriate to use. Hardin and Hilbe (2012) describe the two origins

of zero outcomes: 1) individuals who do not enter into the counting process and 2)

individuals who enter into the counting process and have a zero outcome. Therefore,

the model must be separated into dierent parts, one consisting of a zero count y = 0

and the other consisting of a nonzero count y > 0. The zero-inated model is given by

p + (1 p)f (y) y=0

P (Y = y) =

(1 p)f (y) y = 1, 2, . . .

where p is the probability that the binary process results in a zero outcome, 0 p < 1,

and f (y) is the probability function. Zero-ination models are proposed for the NBREGF,

GBIN, and NBREGW distributions.

566 Modeling count data with generalized distributions

3 Syntax

The accompanying software includes the command les as well as supporting les for

prediction and help. In the following syntax diagrams, unspecied options include the

usual collection of maximization and display options available to all estimation com-

mands. In addition, all zero-inated commands include the ilink(linkname) option to

specify the link function for the ination model. The generalized binomial model for

grouped binomial data also includes the link(linkname) option for linking the proba-

bility of success to the linear predictor. Supported linknames include logit, probit,

loglog, and cloglog.

The syntax for specifying a generalized binomial regression model for grouped data

is given by

gbin depvar indepvars if in weight , options

zigbin depvar indepvars if in weight ,

/

inflate(varlist , offset(varname) / cons) vuong options

The syntax for tting a generalized negative binomial regression model where the

distribution is assumed to follow Famoyes description is given by

nbregf depvar indepvars if in weight , options

The syntax for tting a generalized negative binomial regression model where the

distribution is derived from the Waring distribution is given by

nbregw depvar indepvars if in weight , options

The syntax for specifying a zero-inated count model where the count distribution

follows that described by Famoye is given by

zinbregf depvar indepvars if in weight ,

/

inflate(varlist , offset(varname) / cons) vuong options

The syntax for specifying a zero-inated count model where the count distribution

follows the Waring distribution is given by

zinbregw depvar indepvars if in weight ,

/

inflate(varlist , offset(varname) / cons) vuong options

A Vuong test (see Vuong [1989]) evaluates whether the regression model with zero

ination or the regression model without zero ination is closer to the true model. A

T. Harris, J. W. Hilbe, and J. W. Hardin 567

random variable is dened as the vector log LZ log LS , where LZ is the likelihood of

the zero-inated model evaluated at its maximum likelihood estimation, and LS is the

likelihood of the standard (nonzero-inated) model evaluated at its maximum likelihood

estimation. The vector of dierences over the N observations is then used to dene the

statistic

N

V = 0

( )2 /(N 1)

which, asymptotically, is characterized by a standard normal distribution. A signi-

cant positive statistic indicates preference for the zero-inated model, and a signicant

negative statistic indicates preference for the model without zero ination. Nonsignif-

icant Vuong statistics indicate no preference for either model. Results of this test are

included in a footnote to the estimation of the model when the user includes the vuong

option in any of the zero-inated commands. Vuong statistics with corrections based

on the Akaike information criterion (AIC) and the Bayesian information criterion (BIC)

are also displayed in the output (see Desmarais and Harden [2013] for details). They

are displayed for each of the zero-inated models discussed in this article.

4 Example

We shall use the popular German health data for the year 1984 as example data. The

goal of our model is to understand the number of visits made to a physician during 1984.

Our predictor of interest is whether the patient is highly educated based on achieving

a graduate degree, for example, an MA or MS, an MBA, a PhD, or a professional degree.

Confounding predictors are age (from 2564) and income in German Marks, divided by

10. We rst model the data using Poisson regression. The glm command is used to

determine the Pearson dispersion, or dispersion statistic, which is not available using

the poisson command.

568 Modeling count data with generalized distributions

(German health data for 1984; Hardin & Hilbe, GLM and Extensions, 3rd ed)

. gen hh = hhninc/10

. glm docvis edlevel4 age hh, nolog eform fam(poisson)

Generalized linear models No. of obs = 3874

Optimization : ML Residual df = 3870

Scale parameter = 1

Deviance = 24369.36065 (1/df) Deviance = 6.296992

Pearson = 44032.57716 (1/df) Pearson = 11.37793

Variance function: V(u) = u [Poisson]

Link function : g(u) = ln(u) [Log]

AIC = 8.120749

Log likelihood = -15725.89176 BIC = -7604.745

OIM

docvis IRR Std. Err. z P>|z| [95% Conf. Interval]

age 1.026209 .0008362 31.75 0.000 1.024571 1.027849

hh .3468308 .0257417 -14.27 0.000 .299876 .4011378

_cons 1.326749 .0608884 6.16 0.000 1.212619 1.451619

. estat ic

Akaikes information criterion and Bayesian information criterion

. nbreg docvis edlevel4 age hh, nolog irr

Negative binomial regression Number of obs = 3874

LR chi2(3) = 161.23

Dispersion = mean Prob > chi2 = 0.0000

Log likelihood = -8344.5927 Pseudo R2 = 0.0096

age 1.026037 .0023731 11.11 0.000 1.021397 1.030699

hh .4487569 .0718929 -5.00 0.000 .327827 .6142958

_cons 1.246529 .1453412 1.89 0.059 .991871 1.56657

T. Harris, J. W. Hilbe, and J. W. Hardin 569

. estat ic

Akaikes information criterion and Bayesian information criterion

The AIC and BIC statistics are substantially lower here than they are for the Poisson

model, indicating a much better t than the Poisson model.

. display 1/exp(_b[edlevel4])

1.3763358

Patients without a graduate education are 38% more likely to see a physician than

are patients with a graduate education. We can likewise arm that patients without

a graduate education saw a physician 38% more often in 1984 than patients with a

graduate education.

The negative binomial model did not adjust for all the correlation, or dispersion, in

the data.

. display e(dispers_p)

1.4017258

This is perhaps due to the excessive number of times a patient in the data never

saw a physician in 1984. A tabulation of docvis shows that nearly 42% of the 3,874

patients in the data did not visit a physician. This value is far greater than the one

accounted for by the Poisson and negative binomial distributional assumptions.

. count if docvis==0

1611

. display "Zeros account for " %4.2f (r(N)*100/3874) "% of the outcomes"

Zeros account for 41.58% of the outcomes

Given the excess zero counts in docvis, it may be wise to employ a zero-inated

regression model on the data. At the least, we can determine which predictors tend to

prevent patients from going to the doctor.

570 Modeling count data with generalized distributions

. zinb docvis edlevel4 age hh, nolog inflate(edlevel4 age hh) irr

Zero-inflated negative binomial regression Number of obs = 3874

Nonzero obs = 2263

Zero obs = 1611

Inflation model = logit LR chi2(3) = 98.50

Log likelihood = -8330.799 Prob > chi2 = 0.0000

docvis

edlevel4 .9176719 .1289238 -0.61 0.541 .6967903 1.208573

age 1.020511 .0025432 8.15 0.000 1.015538 1.025508

hh .4506524 .0720932 -4.98 0.000 .3293598 .6166132

_cons 1.768336 .2419851 4.17 0.000 1.352333 2.31231

inflate

edlevel4 1.174194 .3519899 3.34 0.001 .4843067 1.864082

age -.0521002 .0115586 -4.51 0.000 -.0747547 -.0294458

hh .2071444 .570265 0.36 0.716 -.9105545 1.324843

_cons -.037041 .4438804 -0.08 0.933 -.9070305 .8329486

. estat ic

Akaikes information criterion and Bayesian information criterion

The AIC statistic is 20 points lower in the zero-inated model but 5 points higher

for the BIC statistic. However, variables edlevel4 and age appear to aect zero counts,

with younger graduate patients more likely to not see a physician at all during the year.

Given the zero-inated model, patients without a graduate education see the physician

9% more often than patients with a graduate education.

. display 1/exp(_b[edlevel4])

1.0897141

Because excess zero counts did not appear to bear on extra correlation in the data,

there may be other factors. We employ a generalized Waring negative binomial model

to further identify the source of extra dispersion.

T. Harris, J. W. Hilbe, and J. W. Hardin 571

. nbregw docvis edlevel4 age hh, nolog eform

Generalized negative binomial-W regression Number of obs = 3874

LR chi2(3) = 163.80

Log likelihood = -8315.421 Prob > chi2 = 0.0000

age 1.027732 .0024925 11.28 0.000 1.022859 1.032629

hh .4693135 .086958 -4.08 0.000 .3263967 .674808

_cons 1.142679 .1431097 1.06 0.287 .8939621 1.460593

/lnk -.6113509 .0521974 -.7136559 -.5090458

k .5426174 .0283232 .4898501 .6010688

. estat ic

Akaikes information criterion and Bayesian information criterion

The AIC and BIC statistics are substantially lower here than for either the negative

binomial or zero-inated version. For the calculated and k, the V (Y ) = + 0.624 +

2.9942 , where is the mean. Here we see that the term {k + 1}/{k( 2)} =

2.994, from (3), is close to the dispersion parameter = 2.319 when using an NB-2

regression model from above. More information on the background of this model can

be found in Hilbe (2011).

572 Modeling count data with generalized distributions

To address the excess zeros in the outcome, we also t a zero-inated Waring model.

. zinbregw docvis edlevel4 age hh, nolog inflate(edlevel4 age hh) eform vuong

Zero-inflated gen neg binomial-W regression Number of obs = 3874

Regression link: Nonzero obs = 2263

Inflation link : logit Zero obs = 1611

Wald chi2(3) = 66.10

Log likelihood = -8262.174 Prob > chi2 = 0.0000

docvis

edlevel4 .9414482 .1406355 -0.40 0.686 .7024933 1.261684

age 1.017108 .0024842 6.95 0.000 1.012251 1.021989

hh .4841428 .0964645 -3.64 0.000 .3276222 .7154409

_cons 2.457403 .3313549 6.67 0.000 1.886691 3.200751

inflate

edlevel4 .613575 .2222675 2.76 0.006 .1779387 1.049211

age -.026716 .0048778 -5.48 0.000 -.0362763 -.0171558

hh -.0137845 .3544822 -0.04 0.969 -.7085569 .6809879

_cons .1834942 .245023 0.75 0.454 -.2967421 .6637305

/lnk 1.071457 .2498257 .581808 1.561107

k 2.919632 .7293992 1.789271 4.764092

Vuong test of zinbregw vs. gen neg binomial(W): z = 0.55 Pr>z = 0.2897

Bias-corrected (AIC) Vuong test: z = 0.13 Pr>z = 0.4482

Bias-corrected (BIC) Vuong test: z = -1.20 Pr>z = 0.8845

. estat ic

Akaikes information criterion and Bayesian information criterion

Note that introducing the zero-ination component into the regression model results

in losing signicance of the education level in the model of the mean outcomes. However,

that variable does play a signicant role (along with age) in determining whether a

person has zero visits to the doctor.

We can also attempt to understand the relationship of doctor visits and the high edu-

cation of patients with the additional factors age and income by using another parame-

terization of negative binomial. This model was discussed in Famoye (1995), but it has

had little notice in the literature, which is probably because of the lack of associated

software support.

T. Harris, J. W. Hilbe, and J. W. Hardin 573

Generalized negative binomial-F regression Number of obs = 3874

LR chi2(3) = 166.51

Log likelihood = -8337.884 Prob > chi2 = 0.0000

age 1.025957 .0024634 10.67 0.000 1.02114 1.030796

hh .4596616 .0743405 -4.81 0.000 .3347915 .6311055

_cons 2.366462 .3349416 6.09 0.000 1.793177 3.123028

/lntheta -.6445887 .0760764 -.7936957 -.4954816

theta .5248784 .0399309 .4521706 .6092774

. estat ic

Akaikes information criterion and Bayesian information criterion

Note that the risk ratios are nearly identical to the NB-2 negative binomial model.

The AIC and BIC statistics are lower than NB-2, but only by about 12 and 5 points,

respectively. Because of the excessive zero counts, we model a zero-inated model.

574 Modeling count data with generalized distributions

. zinbregf docvis edlevel4 age hh, nolog inflate(edlevel4 age hh) eform vuong

Zero-inflated gen neg binomial-F regression Number of obs = 3874

Regression link: Nonzero obs = 2263

Inflation link : logit Zero obs = 1611

LR chi2(3) = 176.08

Log likelihood = -8292.015 Prob > chi2 = 0.0000

docvis

edlevel4 .9125286 .1191361 -0.70 0.483 .7065079 1.178626

age 1.017058 .0024233 7.10 0.000 1.012319 1.021818

hh .4915087 .0753322 -4.63 0.000 .3639736 .6637315

_cons .0010836 .2112138 -0.04 0.972 1.3e-169 8.9e+162

inflate

edlevel4 .7118035 .2073926 3.43 0.001 .3053213 1.118286

age -.0380198 .0054111 -7.03 0.000 -.0486254 -.0274142

hh .2529651 .3447803 0.73 0.463 -.422792 .9287221

_cons .368429 .2425669 1.52 0.129 -.1069933 .8438514

/lntheta 7.679818 194.9173 -374.3511 389.7107

theta 2164.225 421844.9 2.6e-163 1.8e+169

Vuong test of zinbregf vs. gen neg binomial(F): z = 6.23 Pr>z = 0.0000

Bias-corrected (AIC) Vuong test: z = 5.68 Pr>z = 0.0000

Bias-corrected (BIC) Vuong test: z = 3.99 Pr>z = 0.0000

. estat ic

Akaikes information criterion and Bayesian information criterion

The AIC and BIC statistics are substantially lower than the nonzero-inated param-

eterization, and they are also lower than the Waring regression model. Here we nd

that younger patients without a graduate education see physicians more frequently than

patients with a graduate education (as we discovered before) and that the important

statistics are and .

If the outcomes are bounded counts (for which the bounds are known), then the data

can be addressed by grouped binomial models. Rather than introducing a new dataset

for these models as we did before, we illustrate how to generate synthetic data.

Herein, we synthesize the generalized binomial outcome along with a zero-inated

version of the generalized binomial outcome. To highlight the options built in to the

T. Harris, J. W. Hilbe, and J. W. Hardin 575

commands, we generate data following a complementary log-log link function for the

generalized binomial outcome and a log-log link for the zero-ination component.

. drop _all

. set obs 1500

obs was 0, now 1500

. // Linear predictors for zero-inflation

. gen z1 = runiform() < 0.5

. gen z2 = runiform() < 0.5

. gen zg = -0.5+0.25*z1+0.25*z2

. // Note that the zero-inflation link function is in terms of Prob(Y=0)

. gen z = rbinomial(1,1-exp(-exp(-zg))) // ilink(loglog)

. // Linear predictors for the outcome

. gen x1 = runiform() < 0.5

. gen xb = -2+0.5*x1

. gen n = floor(10*runiform()) + 1

. // Note that the outcome link function is in terms of Prob(Y=1)

. gen mu = 1-exp(-exp(xb)) // link(cloglog)

Once we have dened the components of the outcome and the necessary covariates,

we generate the outcome. The zero-inated version of the outcome is the product of

the binomial outcome and the zero-ination (binary) component.

. gen double yu = runiform() // random quantile

. gen y = 0 // initial outcome

. gen double p = 0 // initial cumulative probability

. capture program drop doit

. program define doit

1. args sigma

2. local flag 1

3. local y = 0

4. while `flag { // increase cumulative probability if y < n

5. quietly replace p = p + exp(lngamma(n+`y*`sigma+1)-

> lngamma(n+`y*`sigma-`y+1)-lngamma(`y+1)+log(n)+`y*log(mu) +

> (n+`y*`sigma-`y)*log(1+mu*`sigma-mu)-log(n+`y*`sigma)-

> (n+`y*`sigma)*log(1+mu*`sigma)) if `y < n

6. quietly replace y = y+1 if p <= yu // increase y if cumulative

> probability <= yu

7. quietly replace p = 1 if y >= n

8. local y = `y+1

9. quietly count if p <= yu // see if finished

10. if `r(N)==0 {

11. local flag = 0 // all done

12. }

13. }

14. end

. doit 1.25 // sigma=1.25

. // Zero-inflated outcomes "yo"

. gen yo = y*z

576 Modeling count data with generalized distributions

a model to see how closely the sample data match the specications.

. gbin y x1, link(cloglog) n(n) nolog

Generalized binomial regression Number of obs = 1500

Link = cloglog LR chi2(1) = 50.73

Dispersion = generalized binomial Prob > chi2 = 0.0000

Log likelihood = -1775.7031 Pseudo R2 = 0.0141

_cons -2.000648 .0503157 -39.76 0.000 -2.099264 -1.902031

Before tting the zero-inated model for the zero-inated outcome, we rst illustrate

how well a zero-inated model might t the nonzero-inated outcome. In this case, we

should expect the binomial regression components to estimate the means well, and we

should expect the covariate of the zero-ination component to be nonsignicant.

. zigbin y x1, inflate(z1 z2) n(n) link(cloglog) ilink(loglog) vuong nolog

Zero-inflated generalized binomial regression Number of obs = 1500

Regression link: cloglog Nonzero obs = 751

Inflation link : loglog Zero obs = 749

LR chi2(1) = 42.67

Log likelihood = -1772.5 Prob > chi2 = 0.0000

y

x1 .447438 .0681432 6.57 0.000 .3138797 .5809963

_cons -1.958826 .0540354 -36.25 0.000 -2.064733 -1.852918

inflate

z1 .4499741 .4248806 1.06 0.290 -.3827765 1.282725

z2 2.068714 60.05847 0.03 0.973 -115.6437 119.7812

_cons -3.264426 60.05983 -0.05 0.957 -120.9795 114.4507

Bias-corrected (AIC) Vuong test: z = 0.08 Pr>z = 0.4683

Bias-corrected (BIC) Vuong test: z = -3.04 Pr>z = 0.9988

Note that the Vuong statistic was nonsignicant in this example. Though it fails to

provide compelling evidence for one model over the other, we would prefer the nonzero-

T. Harris, J. W. Hilbe, and J. W. Hardin 577

inated model because of the lack of signicant covariates in the ination. When we

t a zero-inated model for the outcome that was specically generated to include zero

ination, we see a much better t.

. zigbin yo x1, inflate(z1 z2) n(n) link(cloglog) ilink(loglog) vuong nolog

Zero-inflated generalized binomial regression Number of obs = 1500

Regression link: cloglog Nonzero obs = 541

Inflation link : loglog Zero obs = 959

LR chi2(1) = 28.34

Log likelihood = -1518.557 Prob > chi2 = 0.0000

yo

x1 .4628085 .086265 5.36 0.000 .2937322 .6318848

_cons -1.969505 .0873894 -22.54 0.000 -2.140785 -1.798225

inflate

z1 .2292778 .1270487 1.80 0.071 -.019733 .4782886

z2 .3955768 .1296781 3.05 0.002 .1414125 .6497411

_cons -.4796692 .1724896 -2.78 0.005 -.8177426 -.1415958

Bias-corrected (AIC) Vuong test: z = 2.59 Pr>z = 0.0048

Bias-corrected (BIC) Vuong test: z = 1.20 Pr>z = 0.1159

Here the Vuong test indicates a clear preference for the zero-ination model, and we

note that the estimated coecients are close to the values we specied in synthesizing

these data.

In this article, we introduced programs for modeling count data. These count data can

be overdispersed (variance is greater than the mean), underdispersed (variance is smaller

than the mean), or undispersed (variance equals the mean). We then illustrated the

use of the new commands nbregf, zinbregf, nbregw, and zinbregw using real-world

German health data from 1984. We synthesized data and used it to demonstrate the

gbin and zigbin models. This article is fairly technical, and some readers may desire

more background on count-data models such as the Poisson, generalized Poisson, and

negative binomial models. For those readers, we recommend Hardin and Hilbe (2012),

Cameron and Trivedi (2013), Winkelmann (2008), and Tang, He, and Tu (2012).

578 Modeling count data with generalized distributions

6 References

Cameron, A. C., and P. K. Trivedi. 2013. Regression Analysis of Count Data. 2nd ed.

Cambridge: Cambridge University Press.

Consul, P. C., and H. C. Gupta. 1980. The generalized negative binomial distribution

and its characterization by zero regression. SIAM Journal on Applied Mathematics

39: 231237.

Dean, C. B. 1992. Testing for overdispersion in Poisson and binomial regression models.

Journal of the American Statistical Association 87: 451457.

Desmarais, B. A., and J. J. Harden. 2013. Testing for zero ination in count models:

Bias correction for the Vuong test. Stata Journal 13: 810835.

Famoye, F. 1995. Generalized binomial regression model. Biometrical Journal 37: 581

594.

Hardin, J. W., and J. M. Hilbe. 2012. Generalized Linear Models and Extensions. 3rd

ed. College Station, TX: Stata Press.

Harris, T., Z. Yang, and J. W. Hardin. 2012. Modeling underdispersed count data with

generalized Poisson regression. Stata Journal 12: 736747.

University Press.

Journal of the Royal Statistical Society Series A 131: 205225.

Jain, G. C., and P. C. Consul. 1971. A generalized negative binomial distribution. SIAM

Journal on Applied Mathematics 21: 501513.

anchez, A. J. S

aez-Castillo, M. J. Olmo-Jimenez, and

A. M. Martnez-Rodrguez. 2009. A generalized Waring regression model for count

data. Computational Statistics and Data Analysis 53: 37173725.

Tang, W., H. He, and X. M. Tu. 2012. Applied Categorical and Count Data Analysis.

Boca Raton, FL: Chapman & Hall/CRC.

Vuong, Q. H. 1989. Likelihood ratio tests for model selection and non-nested hypotheses.

Econometrica 57: 307333.

Wang, X.-F., Z. Jiang, J. J. Daly, and G. H. Yue. 2012. A generalized regression model

for region of interest analysis of fMRI data. Neuroimage 59: 502510.

Winkelmann, R. 2008. Econometric Analysis of Count Data. 5th ed. Berlin: Springer.

T. Harris, J. W. Hilbe, and J. W. Hardin 579

Yang, Z., J. W. Hardin, C. L. Addy, and Q. H. Vuong. 2007. Testing approaches for

overdispersion in Poisson regression versus the generalized Poisson model. Biometrical

Journal 49: 565584.

Tammy Harris is a senior research associate in the Institute for Families in Society at the Uni-

versity of South Carolina, Columbia, SC. She graduated from the Department of Epidemiology

and Biostatistics at the University of South Carolina with a PhD in August 2013.

Joseph M. Hilbe is an emeritus professor (University of Hawaii), an adjunct professor of statis-

tics at Arizona State University, Tempe, AZ, and a Solar System Ambassador at Jet Propulsion

Laboratory, Pasadena, CA.

James W. Hardin is an associate professor in the Department of Epidemiology and Biostatistics

and an aliated faculty in the Institute for Families in Society at the University of South

Carolina, Columbia, SC.

The Stata Journal (2014)

14, Number 3, pp. 580604

semiparametric estimators of doseresponse

functions

Michela Bia Carlos A. Flores

CEPS/INSTEAD Department of Economics

Esch-Sur-Alzette, Luxembourg California Polytechnic State University

michela.bia@ceps.lu San Luis Obispo, CA

core32@calpoly.edu

Alfonso Flores-Lagunes

Department of Economics

State University of New York, Binghamton

Binghamton, NY

aores@binghamton.edu

Alessandra Mattei

Department of Statistics, Informatics, Applications Giuseppe Parenti

University of Florence

Florence, Italy

mattei@disia.uni.it

categorical but rather continuous, so the focus is on estimating a continuous dose

response function. In this article, we propose a set of programs that semiparamet-

rically estimate the doseresponse function of a continuous treatment under the

unconfoundedness assumption. We focus on kernel methods and penalized spline

models and use generalized propensity-score methods under continuous treatment

regimes for covariate adjustment. Our programs use generalized linear models to

estimate the generalized propensity score, allowing users to choose between alter-

native parametric assumptions. They also allow users to impose a common sup-

port condition and evaluate the balance of the covariates using various approaches.

We illustrate our routines by estimating the eect of the prize amount on subse-

quent labor earnings for Massachusetts lottery winners, using data collected by

Imbens, Rubin, and Sacerdote (2001, American Economic Review, 778794).

Keywords: st0352, drf, doseresponse function, generalized propensity score, ker-

nel estimator, penalized spline estimator, weak unconfoundedness

1 Introduction

The evaluation process in economics, sociology, law, and many other elds generally

relies on applying nonexperimental techniques to estimate average treatment eects.

c 2014 StataCorp LP st0352

M. Bia, C. A. Flores, A. Flores-Lagunes, A. Mattei 581

Propensity-score methods (Rosenbaum and Rubin 1983) are attractive empirical tools

to balance the distribution of covariates between treatment groups and compare the

groups in terms of observed covariates. Under the unconfoundedness assumption, which

requires that potential outcomes are independent of the treatment conditional on the

observed covariates, propensity-score methods allow one to eliminate (or at least re-

duce) the potential bias in treatment-eects estimates in observational studies. Most

applications aim to evaluate causal eects of a binary treatment. There is extensive

literature on identifying and estimating causal eects of binary treatments (for exam-

ple, Imbens and Wooldridge [2009]; Stuart [2010]; Angrist, Imbens, and Rubin [1996]),

and many statistical software packages have built-in or add-on functions for imple-

menting methods to estimate causal eects of programs or policies. For example,

Becker and Ichino (2002) developed a set of programs (pscore.ado) for estimating av-

erage treatment eects on the treated using propensity-score matching by focusing on

four matching estimators: nearest-neighbor, radius, kernel, and stratication match-

ing. More recently, building on the work of Becker and Ichino (2002), Dorn (2012)

proposed a routine that helps improve covariate balance, and so the specication of the

propensity-score model, using data-driven approaches.

In many empirical studies, treatments may take on many values, implying that

participants in the study may receive dierent treatment levels. In such cases, one

may want to assess the heterogeneity of treatment eects arising from variation in the

amount of treatment exposure, that is, estimate a doseresponse function (DRF). Over

the past years, propensity-score methods have been generalized and applied to multival-

ued treatments (for example, Imbens [2000]; Lechner [2001]) and, more recently, to con-

tinuous treatments and arbitrary treatment regimes (for example, Hirano and Imbens

[2004]; Imai and van Dyk [2004]; Flores et al. [2012]; Bia and Mattei [2012]; Kluve et al.

[2012]).

In this article, we build on work by Hirano and Imbens (2004), who introduced the

concept of the generalized propensity score (GPS) and used it to estimate the entire DRF

of a continuous treatment. Hirano and Imbens (2004) used a parametric partial-mean

approach to estimate the DRF. Here we focus on semiparametric techniques. Specically,

we present a set of programs that allows users to i) estimate the GPS under alternative

parametric assumptions using generalized linear models;1 ii) impose the common sup-

port condition as dened in Flores et al. (2012) and assess the balance of covariates after

adjusting for the estimated GPS; and iii) estimate the DRF using the estimated GPS by

applying either the nonparametric inverse-weighting (IW) kernel estimator developed in

Flores et al. (2012) or a new set of semiparametric estimators based on penalized spline

techniques.

1. Guardabascio and Ventura (2014) proposed the routine gpscore2.ado to estimate the GPS using

generalized linear models.

582 Semiparametric estimators of doseresponse functions

We use a dataset collected by Imbens, Rubin, and Sacerdote (2001) to illustrate these

programs and to evaluate the eect of the prize amount on subsequent labor earnings

of winners of the Megabucks lottery in Massachusetts in the mid-1980s. We implement

our programs to semiparametrically estimate the average potential postwinning labor

earnings for each lottery prize amount. The prize is obviously assigned at random,

but unit and item nonresponse lead to a self-selected sample where the prize amount

received is no longer independent of background characteristics.

This article is organized as follows: Section 2 describes the methodological approach

we refer to in the analysis. Section 3 introduces the GPS model and the semiparametric

estimators of the DRF. Sections 3 and 3.2 show, respectively, the syntax and the options

of the drf command. Section 5 illustrates the methods and the program using data

from Imbens, Rubin, and Sacerdote (2001). Section 6 concludes.

2 Estimation strategy

We estimate a continuous DRF that relates each value of the dose (for example, lottery

prize amount) to the outcome variable (for example, postwinning labor earnings) within

the potential-outcome approach to causal inference (Rubin 1974, 1978). Formally, con-

sider a set of N individuals, and denote each of them by subscript i: i = 1, . . . , N .

Under the stable unit treatment value assumption (Rubin 1980, 1990), for each unit

i, there is a set of potential outcomes {Yi (t)}tT , where T is a subset of the real line,

T R. We are interested in estimating the average DRF, (t) = E{Yi (t)}.

For each individual i, we observe a vector of pretreatment covariates, Xi , the received

treatment level, Ti , and the corresponding value of the outcome for this treatment level,

Yi = Yi (Ti ).

The central assumption of our approach is that the assignment to treatment levels is

weakly unconfounded given the set of observed variables, that is, Yi (t) Ti |Xi for all t

T (Hirano and Imbens 2004). This assumption is described as weak unconfoundedness

because it requires only conditional independence for each potential outcome Yi (t) rather

than joint independence of all potential outcomes.

Under weak unconfoundedness, we can apply the GPS techniques for continuous

treatments introduced by Hirano and Imbens (2004). Let r(t, x) = fT |X (t|x) be the

conditional density of the treatment given the covariates. The GPS is dened as Ri =

r(Ti , Xi ). The GPS is a balancing score (Rosenbaum and Rubin 1983; Hirano and Im-

bens 2004); that is, within strata with the same value of r(t, x), the probability that

T = t does not depend on the value of X. The weak unconfoundedness assumption,

combined with the balancing score property, implies that assignment to treatment is

weakly unconfounded given the GPS. Formally,

for every t T (theorem 1.2.2 in Hirano and Imbens [2004]). Thus any bias associated

with dierences in the distribution of covariates across groups with dierent treatment

levels can be removed using the GPS. Formally, Hirano and Imbens (2004) showed that

M. Bia, C. A. Flores, A. Flores-Lagunes, A. Mattei 583

Xi , then (t) = E[{t, r(t, Xi )}], where (t, r) = E{Yi (t)|r(t, Xi ) = r} = E(Yi |Ti =

t, Ri = r) (theorem 1.3.1 in Hirano and Imbens [2004]).

3 Inference

We use two-step semiparametric estimators of the DRF. The rst step is to parametri-

cally model and estimate the GPS, Ri = r(Ti , Xi ), and to assess the common support

condition and the balance of the covariates. The second step is to estimate the average

DRF, (t), using either the nonparametric IW kernel estimator proposed by Flores et al.

(2012) or a semiparametric spline-based estimator. Here we describe these two steps,

implemented in the routine drf.

The rst part of the drf program estimates the GPS, allows users to impose an overlap

condition, and tests the balancing property of the GPS.

The GPS is estimated parametrically and alternative distributional assumptions can

be specied. Specically, we assume that

of the covariates depending on an unknown parameter vector , and is a scale pa-

rameter. In the drf program, we consider the Gaussian, inverse Gaussian, and Gamma

distributions using the identity function, the logarithm, and the power function as link

functions. We also implement a two-parameter beta distribution to address evaluation

problems where the treatment variable takes on values in the interval (0, 1), representing,

for instance, a proportion. We use maximum likelihood methods to t these models by

using the ocial Stata command glm (see [R] glm) or the user-written package betafit

(Buis, Cox, and Jenkins 2003).2

An important issue in GPS applications is determining the common support or

overlap region. The drf program allows users to do this by using the approach

proposed by Flores et al. (2012). Specically, the sample is rst divided into K intervals

according to the distribution of the treatment, cutting at the 100 (k/K)th, k =

1, . . . , K 1 percentiles of the treatment empirical distribution. Let qk , k = 1, . . . , K,

denote these intervals, and let Qi be the interval unit i belongs to: Ti Qi . For each

interval qk , let Rk be the GPS evaluated at the median level of the treatment in that

i

interval for unit i, which is calculated for all units. The common support region with

respect to qk , denoted by CSk , is obtained by comparing the support of the distribution

2. betafit (version 1.0.0 at the time of this writing) is available from the Statistical Software Com-

ponents archive (or findit betafit) and must be installed separately from drf.

584 Semiparametric estimators of doseresponse functions

k for those units with Qi = qk with that of units with Qi = qk and is given by the

of R i

subsample

CSk = i : R ik max min R jk , min Rjk , min max R jk , max R jk

j:Qj =qk j:Qj =qk j:Qj =qk j:Qj =qk

Finally, the sample is restricted to units that are comparable across all the K inter-

vals simultaneously by keeping only individuals who are simultaneously in the common

support region

1K for all k intervals. Therefore, the common-support subsample is given

by CS = k=1 CSk .

As in applications of standard propensity-score methods, in GPS applications, it is

crucial to evaluate how well the estimated GPS balances the covariates. Several methods

can be applied to evaluate the balancing properties of the GPS. The drf command

implements two approaches: an approach based on blocking on the GPS and an approach

that uses a likelihood-ratio (LR) test. The blocking on the GPS approach was proposed

by Hirano and Imbens (2004), and it is implemented in the drf routine using two-

sided t tests or Bayes factors (see also Bia and Mattei [2008]). The second approach

was proposed by Flores et al. (2012), who suggested using an LR test to compare an

unrestricted model for Ti that includes all covariates and the GPS (up to a cubic term)

with a restricted model that sets the coecients of all covariates equal to zero. If the GPS

suciently balances the covariates, then the covariates should have little explanatory

power conditional on the GPS.3

We estimate the DRF by applying spline and kernel techniques. The rst technique is

implemented using a partial mean approach (Newey 1994). Specically, for the penalized

spline methods, we rst estimate the conditional expectation of the observed outcome

Yi given the treatment actually received, Ti , and the GPS previously estimated in the

rst stage, Ri , using bivariate penalized spline smoothing based on i) additive spline

bases; ii) tensor products of spline bases; or iii) radial basis functions (for example,

Ruppert, Wand, and Carroll [2003]). Mixed models provide a representation of the

penalized splines that allows smoothing to be done using mixed-model methodologies

and software. In our routine, we use the Stata routine xtmixed, renamed mixed in

Stata 13, to t penalized spline regressions. The average DRF at t is then estimated by

averaging the estimated regression function over the estimated score function evaluated

t r(t, Xi ).

at the specic treatment level t; that is, R i

3. An alternative approach, which is not implemented in our program, was proposed by Kluve et al.

(2012). It consists of regressing each covariate on the treatment variable and comparing the signif-

icance of the coecients for specications with and without conditioning on the GPS.

M. Bia, C. A. Flores, A. Flores-Lagunes, A. Mattei 585

The simplest bivariate penalized spline smoothing relies on additive spline bases,

which can be formally dened in our setting as

Kt

Kr

E Yi |Ti , R i +

i = a0 + at Ti + ar R utk (Ti kkt )+ + urk Ri k r (1)

k

+

k=1 k=1

where for any number z, z+ is equal to z if z is positive and is equal to 0 otherwise, and

k1t < < kK

t r r

t and k1 < < kK r are K

t

and K r distinct knots in the support of T

and the estimated GPS, R i , respectively.

The additive models have many attractive features, one being their simplicity. How-

ever, an additive model may not provide a satisfactory t, so more complex mod-

els including interaction terms are required. To this end, we consider tensor prod-

uct bases, which are obtained by forming all pairwise products of the basis functions

1, Ti , (Ti k1t ), . . . , (Ti kK

t r r

t ) and 1, Ri , (Ri k1 ), . . . , (Ri kK r ). Formally,

E Yi |Ti , R i = a0 + a t Ti + a r R i + atr Ti Ri

K

t

K

r

K

t

+ utk Ti kkt + + urk i

R kkr + i Ti kkt

vkt R

+ +

k=1 k=1 k=1

Kr

Kt

Kr

+ vkr Ti Ri k r + tr

vkk t

T i kk

i k r

R (2)

k + k

+ +

k=1 k=1 k =1

Estimation problems may arise when the tensor product approach is applied, espe-

cially if the sample size is relatively small. When these problems arise, the drf program

alerts users and suggests they adopt an additive model instead.

As an alternative to tensor product splines, we propose to use the so-called radial

basis functions, which are basis functions of the form C{(t, r) (k, k ) } for some

univariate function C. Here we consider the following function

2 t 2 , 2 t 22 2 t 2

2 t k 2 2 t k 2 2 t k 2

C 2 2 2 =22 2 log 2 2

r k r 2 r k r 2 2 r k r 2

K

2 t 2 ,

2 Ti kk 2

E Yi |Ti , R i + atr Ti R

i = a0 + a t Ti + a r R i + uk C 2 2 (3)

2 Ri kkr 2

k=1

- mean 0 and variancecovariance matrix

2 t t 2 ,.

2 kk kk 2

Cov(u) = u2 (k )(k ) , with k = C 2 2

1/2 1/2

2 kr .

k kkr 2

1k,k K

Given the estimated parameters of the regression functions (1), (2), or (3), the

average potential outcome at treatment level t is estimated by averaging the estimated

t .

regression function over R i

586 Semiparametric estimators of doseresponse functions

Flores et al. (2012) proposed to estimate the DRF using a nonparametric IW estima-

tor based on kernel methods. In this approach, the estimated scores are used to weight

observations to adjust for covariate dierences. Let K(u) be a kernel function with the

usual properties, and let h be a bandwidth satisfying h 0 and N h as N .

The IW approach is implemented using a local linear regression of Y on T with weighted

kernel function K t , where Kh (z) = h1 K(z/h). Formally,

h,X (Ti t) = Kh (Ti t)/R

i

the IW kernel estimator of the average DRF is dened as

D0 (t)S2 (t) D1 (t)S1 (t)

(t) =

S0 (t)S2 (t) S12 (t)

N j

N j

where Sj (t) = i=1 K h,X (Ti t)(Ti t) and Dj (t) = i=1 Kh,X (Ti t)(Ti t) Yi ,

j = 0, 1, 2.

We implement the IW estimator using a normal kernel. By default, the global band-

width is selected using the procedure proposed by Fan and Gijbels (1996), which esti-

mates the unknown terms in the optimal global bandwidth by using a global polynomial

of order p + 3, where p is the order of the local polynomial tted. However, users can

also choose an alternative global bandwidth.

4.1 Syntax

drf varlist if in

weight , outcome(varname) treatment(varname)

cutpoints(varname) index(string) nq gps(#) method(type) gps

family(familyname) link(linkname) vce(vcetype) nolog(#) search

common(#) numoverlap(#) test varlist(varlist) test(type) flag(#)

tpoints(vector) npoints(#) npercentiles(#) det delta(#)

bandwidth(#) nknots(#) knots(#) standardized degree1(#)

degree2(#) nknots1(#) nknots2(#) knots1(#) knots2(#) additive

estopts(string)

Note that the argument varlist represents the observed pretreatment variables, which

are used to estimate the GPS. Note that spacefill must be installed (Bia and Van Kerm

2014).4

4.2 Options

Required

4. spacefill requires the Mata package moremata (Jann 2005).

M. Bia, C. A. Flores, A. Flores-Lagunes, A. Mattei 587

cutpoints(varname) divides the range or set of the possible treatment values, T ,

into intervals within which the balancing properties of the GPS are checked using a

blocking on the GPS approach. varname is a variable indicating to which interval

each observation belongs. This option is required unless flag() is set to 0 (see

below).

index(string) species the representative point of the treatment variable at which the

GPS must be evaluated within each treatment interval specied in cutpoints().

string identies either the mean (string = mean) or a percentile (string = p1, . . . ,

p100). This is used when checking the balancing properties of the GPS using a

blocking on the GPS approach. This option is required unless flag() is set to 0

(see below).

nq gps(#) species that for each treatment interval dened in cutpoints(), the values

of the GPS evaluated at the representative point index() have to be divided into #

(# {1, . . . , 100}) intervals, dened by the quantiles of the GPS evaluated at the

representative point index(). This is used when checking the balancing properties

of the GPS using a blocking on the GPS approach. This option is required unless

flag() is set to 0 (see below).

method(type) species the type of approach to be used to estimate the DRF. The ap-

proaches are bivariate-penalized splines (type = mtspline), bivariate penalized ra-

dial splines (type = radialpspline), or IW kernel (type = iwkernel).5

Global options

gps stores the estimated generalized propensity score in the gpscore variable that is

added to the dataset.6

family(familyname) species the distribution used to estimate the GPS. The available

distributional families are Gaussian (normal) (family(gaussian)), inverse Gaussian

(family(igaussian)), Gamma (family(gamma)), and Beta (family(beta)). The

default is family(gaussian). The Gaussian, inverse Gaussian, and Gamma distri-

butional families are t using glm, and the beta distribution is t using betafit.

The following four options are for the glm command, so they can be specied only

when the Gaussian, inverse Gaussian, or Gamma distribution is assumed for the treat-

ment variable.

link(linkname) species the link function for the Gaussian, inverse Gaussian, and

Gamma distributional families. The available links are link(identity), link(log),

and link(pow), and the default is the canonical link for the family() specied (see

help for glm for further details).

5. The subroutines mtpspline and radialpspline are called, respectively, when estimators with pe-

nalized splines (type = mtspline) and radial penalized splines (type = radialpspline) are used.

6. This option must not be specied when running the bootstrap.

588 Semiparametric estimators of doseresponse functions

vce(vcetype) species the type of standard error reported for the GPS estimation when

the Gaussian, inverse Gaussian, or Gamma distribution is assumed for the treatment

variable. vcetype may be oim, robust, cluster clustvar, eim, opg, bootstrap,

jackknife, hac, kernel, jackknife1 (see help glm for further details).

nolog(#) is a ag (# = 0, 1) that suppresses the iterations of the algorithm toward

eventual convergence when running the glm command. The default is nolog(0).

search searches for good starting values for the parameters of the generalized linear

model used to estimate the generalized propensity score (see help glm for further

details).

Overlap options

the common support condition when it is implemented (# = 1). The default is

common(1).

numoverlap(#) species that the common support condition is imposed by dividing

the sample into # groups according to # quantiles of the treatment distribution.

By default, the sample is divided into 5 groups, cutting at the 20th, 40th, 60th, and

80th percentiles of the distribution if common(1).

test varlist(varlist) species that the balancing property must be assessed for each

variable in varlist. The default test varlist() consists of all the variables used to

estimate the GPS.

test(type) allows users to specify whether the balancing property is to be assessed

using a blocking on the GPS approach employing either standard two-sided t tests

(test(t test)) or Bayes factors (test(Bayes factor)) or using a model-compari-

son approach with an LR test (test(L like)).

The blocking on the GPS approach using standard two-sided t tests provides the

values of the test statistics before and after adjusting for the GPS for each pretreat-

ment variable included in test varlist() and for each prexed treatment interval

specied in cutpoints(). Specically, let p be the number of control variables

in test varlist(), and let H be the number of treatment intervals specied in

cutpoints(). Then the program calculates and shows p H values of the test

statistic before and after adjusting for the GPS, where the adjustment is done by

dividing the values of the GPS evaluated at the representative point index() into

the number of intervals specied in nq gps(). (See Hirano and Imbens [2004] for

further details.)

The model-comparison approach uses a LR test to compare an unrestricted model

for Ti , including all the covariates and the GPS (up to a cubic term), with a re-

stricted model that sets the coecients of all covariates to zero. By default, both

the blocking on the GPS approach and the model-comparison approach are applied.

M. Bia, C. A. Flores, A. Flores-Lagunes, A. Mattei 589

flag(#) allows the user to specify that drf estimates the GPS without performing the

balancing test. The default is flag(1), which means that the balancing property is

assessed.

DRF options

tpoints(vector) indicates that the DRF is evaluated at each level of the treatment in

vector. By default, the drf program creates a vector with jth element equal to

the jth observed treatment value. This option cannot be used with npoints() or

npercentiles() (see below).

npoints(#) indicates that the DRF is evaluated at each level of the treatment be-

longing to a set of evenly spaced values t0 , t1 , . . . , t# that cover the range of the

observed treatment. This option cannot be used with tpoints() (see above) or

npercentiles() (see below).

npercentiles(#) indicates that the DRF is evaluated at each level of the treatment

corresponding to the percentiles tq0 , tq1 , . . . , tq# of the treatments empirical distri-

bution. This option cannot be used with tpoints() or npoints() (see above).

det displays more detailed output on the DRF estimation. When det is not specied,

the program displays only the chosen DRF estimator: method(radialpspline),

method(mtpspline), or method(iwkernel).

delta(#) species that drf also estimate the treatment-eect function (t + #) (t).

The default is delta(0), which means that drf estimates only the DRF, (t).

is chosen using the automatic procedure described in Fan and Gijbels (1996). This

procedure estimates the unknown terms in the optimal global bandwidth by using a

global polynomial of order p + 3, where p is the order of the local polynomial tted.

of the treatment variable and the GPS. The default is nknots(max(20, min(n/4,

150))), where n is the number of unique (Ti , Ri ) (Ruppert, Wand, and Carroll

2003). When this option is specied, the subroutines radialpspline and spacefill

(Bia and Van Kerm 2014) are called. This option cannot be used with the knots()

option (see below).

590 Semiparametric estimators of doseresponse functions

knots(numlist) species the list of knots for the treatment and the GPS variable. This

option cannot be used with the nknots() option (see above).

standardized implies that the spacefill algorithm standardizes the treatment vari-

able and the GPS variables before selecting the knots. The knots are chosen using

the standardized variables.

degree1(#) species the power of the treatment variable included in the penalized

spline model. The default is degree1(1).

degree2(#) species the power of the GPS included in the penalized spline model. The

default is degree2(1).

nknots1(#) species the number (#) of knots for the treatment variable. The location

of the Kk th knot is dened as {(k + 1)/(# + 2)}th sample quantile of the unique

Ti for k = 1, . . . , #. The default is nknots1(max(5, min(n/4, 35))), where n is

the number of unique Ti (Ruppert, Wand, and Carroll 2003). This option cannot

be used with the knots1(numlist) option (see below).

nknots2(#) species the number (#) of knots for the GPS. The location of the Kk th

knot is dened as {(k + 1)/(# + 2)}th sample quantile of the unique Ri for k =

1, . . . , #. The default is nknots2(max(5, min(n/4, 35))), where n is the number

of unique Ri (Ruppert, Wand, and Carroll 2003). This option cannot be used with

the knots2() option (see below).

knots1(numlist) species the list of knots for the treatment variable. This option

cannot be used with the nknots1() option (see above).

knots2(numlist) species the list of knots for the GPS. This option cannot be used with

the nknots2() option (see above).

additive allows users to implement penalized splines using the additive model without

including the product terms.

Mutual options for the tensor-product and radial penalized spline estimators

Mutual options for the tensor-product and radial penalized spline estimators involve

either the mtpspline subroutine or the radialpspline subroutine, depending on which

estimator is used.

estopts(string) species all the possible options allowed when running the xtmixed

models to t penalized spline models (see help xtmixed for further details).

M. Bia, C. A. Flores, A. Flores-Lagunes, A. Mattei 591

We illustrate the methods and the programs discussed by reanalyzing data from a survey

of Massachusetts lottery winners (see Imbens, Rubin, and Sacerdote [2001] for details on

the survey). We focus on evaluating how the prize amount aects future labor earnings

(from social security records). This example is also considered in Hirano and Imbens

(2004).

The sample we use consists of 237 individuals who won a major prize in the lottery.

The outcome of interest is earnings six years after winning the lottery (year6), and the

treatment is the prize amount (prize). The lottery prize is randomly assigned, but there

is substantial unit and item nonresponse as well as heterogeneity in the sample with

respect to background characteristics. Thus it is more reasonable to conduct the analysis

conditioning on the observed pretreatment variables under the weak unconfoundedness

assumption.

Pretreatment variables are age, gender, years of high school, years of college, winning

year, number of tickets bought, working status at the time of playing the lottery, and

earnings s years before winning the lottery, s = 1, 2, . . . , 6. To avoid results driven

by outliers, we drop observations belonging to the upper 5% of the treatment variable

distribution.

The output from running drf, shown below, is organized as follows. First, the GPS

model and summary statistics of the estimated GPS are shown, and the common support

is determined. The results show that 31 observations were dropped after we imposed the

common support condition. Second, the balancing property is assessed. We specify the

test(L like) option for the balancing test, so results from only the model-comparison

approach using the LR test are reported. The LR test shows that the GPS balances

the covariates: they have little explanatory power conditional on the GPS. Indeed, the

restricted model for Ti that excludes the covariates cannot be rejected at the usual

signicance levels (p-value is 0.284), whereas the restricted model that excludes the GPS

is soundly rejected (p-value is 0).

. use lotterydataset.dta

. * we delete the extreme values (1 and 99 percentile)

. drop if year6==.

(35 observations deleted)

. summarize prize, de

Treatment variable = Prize amount

Percentiles Smallest

1% 5.3558 1.139

5% 10.05 5

10% 11.246 5.3558 Obs 202

25% 17.034 6.844 Sum of Wgt. 202

50% 32.1835 Mean 57.36918

Largest Std. Dev. 64.84194

75% 71.642 270.1

90% 137.27 305.09 Variance 4204.477

95% 171.73 323.32 Skewness 2.821964

99% 305.09 484.79 Kurtosis 14.18278

592 Semiparametric estimators of doseresponse functions

(11 observations deleted)

. replace year6 = year6/1000

year6 was long now double

(92 real changes made)

. matrix define tp = (10\20\30\40\50\60\70\80\90\100)

. set seed 2322

. drf agew ownhs owncoll male tixbot workthen yearm1 yearm2 yearm3 yearm4

> yearm5 yearm6, outcome(year6) treatment(prize) gps test(L_like)

> tpoints(tp) numoverlap(3) method(radialpspline) family(gaussian)

> link(log) nknots(10) nolog(1) search det delta(1)

******************************************************

Algorithm to estimate the generalized propensity score

******************************************************

Generalized linear models No. of obs = 191

Optimization : ML Residual df = 178

Scale parameter = 1365.58

Deviance = 243073.1517 (1/df) Deviance = 1365.58

Pearson = 243073.1517 (1/df) Pearson = 1365.58

Variance function: V(u) = 1 [Gaussian]

Link function : g(u) = ln(u) [Log]

AIC = 10.12285

Log likelihood = -953.731889 BIC = 242138.2

OIM

prize Coef. Std. Err. z P>|z| [95% Conf. Interval]

ownhs .0585063 .0742126 0.79 0.430 -.0869477 .2039603

owncoll -.0108263 .0389408 -0.28 0.781 -.0871488 .0654962

male .3615542 .1564085 2.31 0.021 .0549991 .6681093

tixbot -.0174202 .0188308 -0.93 0.355 -.0543279 .0194875

workthen .0680442 .1819285 0.37 0.708 -.2885291 .4246174

yearm1 -.0033454 .0102149 -0.33 0.743 -.0233662 .0166754

yearm2 .0018299 .0151926 0.12 0.904 -.0279471 .0316069

yearm3 -.0190244 .0134829 -1.41 0.158 -.0454505 .0074016

yearm4 .0451296 .0194034 2.33 0.020 .0070997 .0831596

yearm5 -.0094795 .0147496 -0.64 0.520 -.0383882 .0194293

yearm6 -.0055688 .0084792 -0.66 0.511 -.0221877 .0110501

_cons 2.534394 .489911 5.17 0.000 1.574186 3.494602

M. Bia, C. A. Flores, A. Flores-Lagunes, A. Mattei 593

*****************************************************************

31 observations are dropped after imposing common support

*****************************************************************

drf_gpscore

Percentiles Smallest

1% .0000774 .0000308

5% .00118 .0000774

10% .0033023 .0003464 Obs 160

25% .0077024 .0004499 Sum of Wgt. 160

50% .0092675 Mean .0082089

Largest Std. Dev. .002953

75% .0103387 .0107928

90% .0107204 .010793 Variance 8.72e-06

95% .0107831 .0107953 Skewness -1.419599

99% .0107953 .0107956 Kurtosis 3.908883

********************************************

End of the algorithm to estimate the gpscore

********************************************

**********************************************************

Log-Likelihood test for Unrestricted and Restricted Model

**********************************************************

****************************************************

Unrestricted Model

link(E[T]) = GPSCORE + GPSCORE^2 + GPSCORE^3 + X

****************************************************

Generalized linear models No. of obs = 160

Optimization : ML Residual df = 144

Scale parameter = 383.389

Deviance = 55208.02303 (1/df) Deviance = 383.389

Pearson = 55208.02303 (1/df) Pearson = 383.389

Variance function: V(u) = 1 [Gaussian]

Link function : g(u) = ln(u) [Log]

AIC = 8.881567

Log likelihood = -694.5253454 BIC = 54477.2

OIM

prize Coef. Std. Err. z P>|z| [95% Conf. Interval]

drf_gpscore2 -45688.7 24107.57 -1.90 0.058 -92938.68 1561.268

drf_gpscore3 4243995 1464344 2.90 0.004 1373934 7114055

agew .0067685 .0036542 1.85 0.064 -.0003935 .0139306

ownhs .0159357 .0348134 0.46 0.647 -.0522974 .0841687

owncoll .0146014 .028581 0.51 0.609 -.0414163 .0706192

male -.0071926 .0945985 -0.08 0.939 -.1926022 .178217

tixbot -.0120352 .0108077 -1.11 0.265 -.033218 .0091475

workthen -.0411355 .1226241 -0.34 0.737 -.2814743 .1992032

yearm1 .0042786 .0080239 0.53 0.594 -.011448 .0200052

yearm2 -.0129785 .0123375 -1.05 0.293 -.0371595 .0112024

yearm3 .0191091 .015091 1.27 0.205 -.0104687 .048687

yearm4 .001562 .0113064 0.14 0.890 -.0205982 .0237222

yearm5 -.008559 .0116933 -0.73 0.464 -.0314774 .0143595

yearm6 .0002114 .00695 0.03 0.976 -.0134105 .0138332

_cons 4.74533 .2766597 17.15 0.000 4.203088 5.287573

594 Semiparametric estimators of doseresponse functions

********************************************************

Restricted Model: Pretreatment variables are excluded

link(E[T]) = GPSCORE + GPSCORE^2 + GPSCORE^3

********************************************************

Generalized linear models No. of obs = 160

Optimization : ML Residual df = 156

Scale parameter = 386.9127

Deviance = 60358.37384 (1/df) Deviance = 386.9127

Pearson = 60358.37384 (1/df) Pearson = 386.9127

Variance function: V(u) = 1 [Gaussian]

Link function : g(u) = ln(u) [Log]

AIC = 8.820758

Log likelihood = -701.6606578 BIC = 59566.65

OIM

prize Coef. Std. Err. z P>|z| [95% Conf. Interval]

drf_gpscore2 -53755.36 20238.49 -2.66 0.008 -93422.08 -14088.64

drf_gpscore3 4533115 1287859 3.52 0.000 2008958 7057273

_cons 5.034825 .0706282 71.29 0.000 4.896396 5.173253

**********************************************************

Restricted Model: GPS terms are excluded (link(E[T]) = X)

**********************************************************

Generalized linear models No. of obs = 160

Optimization : ML Residual df = 147

Scale parameter = 1311.924

Deviance = 192852.8661 (1/df) Deviance = 1311.924

Pearson = 192852.8661 (1/df) Pearson = 1311.924

Variance function: V(u) = 1 [Gaussian]

Link function : g(u) = ln(u) [Log]

AIC = 10.09489

Log likelihood = -794.5908861 BIC = 192106.8

OIM

prize Coef. Std. Err. z P>|z| [95% Conf. Interval]

ownhs .0445558 .0879733 0.51 0.613 -.1278687 .2169802

owncoll .0102703 .0484571 0.21 0.832 -.0847039 .1052445

male .3800062 .1676205 2.27 0.023 .051476 .7085364

tixbot -.0179112 .0212375 -0.84 0.399 -.0595359 .0237135

workthen .1593496 .2189032 0.73 0.467 -.2696929 .5883921

yearm1 .0158358 .0119526 1.32 0.185 -.0075909 .0392624

yearm2 -.0347405 .0256188 -1.36 0.175 -.0849524 .0154713

yearm3 -.0074285 .0246622 -0.30 0.763 -.0557656 .0409086

yearm4 .0487374 .0278511 1.75 0.080 -.0058497 .1033245

yearm5 -.013943 .018552 -0.75 0.452 -.0503042 .0224183

yearm6 .000416 .0150639 0.03 0.978 -.0291088 .0299408

_cons 2.285246 .6383848 3.58 0.000 1.034035 3.536457

M. Bia, C. A. Flores, A. Flores-Lagunes, A. Mattei 595

********************************************************************

Likelihood-ratio tests:

Comparison between the unrestricted model and the restricted models

********************************************************************

LR_TEST[3,4]

Lrtest T-Statistics p-value Restrictions

Unrestricted -694.52535 . . .

Covariates X -701.66066 14.270625 .2837616 12

GPS terms -794.59089 200.13108 3.952e-43 3

Number of observations = 160

***********************************************************

End of the assesment of the balancing property of the GPS

***********************************************************

Then we estimate the DRF and the treatment-eect function, which represents the

marginal propensity to earn out of the yearly prize money, using both penalized spline

techniques and the IW kernel estimator. Following Hirano and Imbens (2004), we ob-

tain the estimates of these functions at 10 dierent prize-amount values, considering

increments of $1,000 between $10,000 and $100,000 for the estimation of the treatment-

eect function. Note that we scaled the prize amount by dividing it by $1,000. To avoid

redundancies, we show details on the output from running drf for only the radial penal-

ized spline estimator (method(radialpspline)). Note that the det option is specied,

so details on estimating the DRF are shown.

****************

DRF estimation

****************

Radial penalized spline estimator

Run 1 .. (Cpq = 383.37)

Run 2 .. (Cpq = 427.99)

Run 3 ... (Cpq = 388.19)

Run 4 .. (Cpq = 365.61)

Run 5 ... (Cpq = 389.08)

Performing EM optimization:

Performing gradient-based optimization:

Iteration 0: log restricted-likelihood = -509.60164

Iteration 1: log restricted-likelihood = -509.58312

Iteration 2: log restricted-likelihood = -509.58286

Iteration 3: log restricted-likelihood = -509.58286

596 Semiparametric estimators of doseresponse functions

Mixed-effects REML regression Number of obs = 129

Group variable: _all Number of groups = 1

Obs per group: min = 129

avg = 129.0

max = 129

Wald chi2(2) = 5.01

Log restricted-likelihood = -509.58286 Prob > chi2 = 0.0818

drf_gpscore -1355.627 897.2735 -1.51 0.131 -3114.25 402.997

_cons 34.56937 11.09994 3.11 0.002 12.8139 56.32485

_all: Identity

sd(__00002U..__000033)(1) .0285723 .0584111 .0005198 1.570645

LR test vs. linear regression: chibar2(01) = 0.06 Prob >= chibar2 = 0.4072

(1) __00002U __00002V __00002W __00002X __00002Y __00002Z __000030 __000031

__000032 __000033

. matrix list e(b)

e(b)[1,20]

c1 c2 c3 c4 c5 c6

y1 15.131775 12.106819 9.3763398 7.2519104 6.0217689 5.5866336

c7 c8 c9 c10 c11 c12

y1 5.7080575 5.9898157 6.0769106 5.7288158 -.3081758 -.2900365

c13 c14 c15 c16 c17 c18

y1 -.23826795 -.15935109 -.05448761 -.00673878 .02770708 .02217719

c19 c20

y1 -.01213146 -.06489899

. matrix C = e(b)

. drop gpscore

. set seed 2322

M. Bia, C. A. Flores, A. Flores-Lagunes, A. Mattei 597

. bootstrap _b, reps(50): drf agew ownhs owncoll male tixbot workthen yearm1

> yearm2 yearm3 yearm4 yearm5 yearm6, outcome(year6) treatment(prize)

> test(L_like) tpoints(tp) numoverlap(3) method(radialpspline) family(gaussian)

> link(log) nolog(1) search nknots(10) det delta(1)

(running drf on estimation sample)

Bootstrap replications (50)

1 2 3 4 5

.................................................. 50

Bootstrap results Number of obs = 191

Replications = 50

Coef. Std. Err. z P>|z| [95% Conf. Interval]

c2 12.10682 6.628999 1.83 0.068 -.8857812 25.09942

c3 9.37634 6.500001 1.44 0.149 -3.363427 22.11611

c4 7.25191 7.843234 0.92 0.355 -8.120547 22.62437

c5 6.021769 12.20073 0.49 0.622 -17.89122 29.93475

c6 5.586634 15.15628 0.37 0.712 -24.11914 35.2924

c7 5.708057 18.95607 0.30 0.763 -31.44515 42.86127

c8 5.989816 23.01648 0.26 0.795 -39.12166 51.10129

c9 6.076911 26.94703 0.23 0.822 -46.7383 58.89212

c10 5.728816 31.02343 0.18 0.853 -55.07598 66.53361

c11 -.3081758 2.3051 -0.13 0.894 -4.826088 4.209736

c12 -.2900365 2.43639 -0.12 0.905 -5.065274 4.485201

c13 -.2382679 .5888614 -0.40 0.686 -1.392415 .9158791

c14 -.1593511 .641826 -0.25 0.804 -1.417307 1.098605

c15 -.0544876 .4563326 -0.12 0.905 -.9488831 .8399079

c16 -.0067388 .4477181 -0.02 0.988 -.8842501 .8707725

c17 .0277071 .5016994 0.06 0.956 -.9556057 1.01102

c18 .0221772 .4548985 0.05 0.961 -.8694075 .9137618

c19 -.0121315 .4958827 -0.02 0.980 -.9840437 .9597808

c20 -.064899 .5120701 -0.13 0.899 -1.068538 .93874

Figures 1 and 2 show the estimates of the DRF and the treatment-eect function by

using the semiparametric techniques implemented in the drf routine and a paramet-

ric approach. The parametric estimates are derived using the doseresponse routine

(Bia and Mattei 2008), which follows the parametric approach originally proposed by

Hirano and Imbens (2004).7 As can be seen in gures 1 and 2, the two penalized spline

estimators and the IW kernel estimator lead to similar results: the DRFs have a U shape

(which is more tenuous in the case of the radial spline method) and the treatment-eect

functions have irregular shapes increasing over most of the treatment range and decreas-

ing for high treatment levels. The parametric approach shows quite a dierent picture.

The DRF goes down sharply for low prize amounts and follows an inverse J shape for

prize amounts greater than $20,000. The treatment-eect function reaches a maximum

around $30,000, and then it slowly decreases.

7. The code to derive the graphs is shown here for only the radial penalized spline estimator.

598 Semiparametric estimators of doseresponse functions

> yscale(r(6 18)) title("Radial spline method")

> xtitle("Treatment") ylabel(6 7 8 9 10 11 12 13 14 15 16 17 18)

> xlabel(0 10 20 30 40 50 60 70 80 90 100)

> ytitle("Dose-response function") scheme(medim)

. graph save DRF_RAD.gph, replace

(file DRF_RAD.gph saved)

. graph export DRF_RAD.eps, replace

(note: file DRF_RAD.eps not found)

(file DRF_RAD.eps written in EPS format)

. line radialder treatment, lcolor(black)

> yscale(r(-0.45 0.15)) title("Radial spline method")

> xtitle("Treatment") ylabel(-0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2)

> xlabel(0 10 20 30 40 50 60 70 80 90 100)

> ytitle("Derivative") scheme(medim)

. graph save dDRF_RAD.gph, replace

(file dDRF_RAD.gph saved)

. graph export dDRF_RAD.eps, replace

(note: file dDRF_RAD.eps not found)

(file dDRF_RAD.eps written in EPS format)

9 10 11 12 13 14 15 16 17 18

9 10 11 12 13 14 15 16 17 18

Doseresponse function

Doseresponse function

8

8

7

7

6

0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

Treatment Treatment

9 10 11 12 13 14 15 16 17 18

9 10 11 12 13 14 15 16 17 18

Doseresponse function

Doseresponse function

8

8

7

7

6

0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

Treatment Treatment

M. Bia, C. A. Flores, A. Flores-Lagunes, A. Mattei 599

.2

.2

.1

.1

0

0

Derivative

Derivative

.1

.1

.2

.2

.3

.3

.4

.4

.5

.5

0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

Treatment Treatment

.2

.2

.1

.1

0

0

Derivative

Derivative

.1

.1

.2

.2

.3

.3

.4

.4

.5

.5

0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

Treatment Treatment

Figures 3 and 4 show the DRFs and the treatment-eect functions estimated using

the semiparametric and parametric techniques, now accompanied by pointwise 95% con-

dence bands. The condence bands are based on a normal approximation using boot-

strap standard errors, which are computed calling the drf program (or doseresponse

program) in the bootstrap command.8

8. The radial spline-based models may produce slightly dierent estimates in dierent runs and when

using the bootstrap command. This happens because within those models, an optimal set of

design points is chosen via random selection of the knot values using the spacefill algorithm (see

Bia and Van Kerm [2014] for further details). Some selected sets of knots may raise convergence

issues depending on the data. Thus we recommend that users set a seed before running the drf

code to make the results replicable.

600 Semiparametric estimators of doseresponse functions

> (line radialest treatment, lcolor(black))

> (line lowerEstRAD treatment, lcolor(black)),

> yscale(r(-40 60)) xtitle("Treatment") ylabel(-40 -20 0 20 40 60)

> title("Radial spline method") ytitle("Dose-response function")

> xlabel(0 10 20 30 40 50 60 70 80 90 100) scheme(medim)

. graph save CI_DRF_RAD.gph, replace

(file CI_DRF_RAD.gph saved)

. graph export CI_DRF_RAD.eps, replace

(note: file CI_DRF_RAD.eps not found)

(file CI_DRF_RAD.eps written in EPS format)

. twoway (line upperDerRAD treatment, lcolor(black))

> (line radialder treatment, lcolor(black))

> (line lowerDerRAD treatment, lcolor(black)),

> yscale(r(-2 2)) xtitle("Treatment") ylabel(-2 -1 0.0 1 2)

> title("Radial spline method") ytitle("Derivative")

> xlabel(0 10 20 30 40 50 60 70 80 90 100) scheme(medim)

. graph save CI_dDRF_RAD.gph, replace

(file CI_dDRF_RAD.gph saved)

. graph export CI_dDRF_RAD.eps, replace

(note: file CI_dDRF_RAD.eps not found)

(file CI_dDRF_RAD.eps written in EPS format)

60

60

40

40

Doseresponse function

Doseresponse function

20

20

0

0

20

20

40

40

60

60

0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

Treatment Treatment

60

60

40

40

Doseresponse function

Doseresponse function

20

20

0

0

20

20

40

40

60

60

0 10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 100

Treatment Treatment

M. Bia, C. A. Flores, A. Flores-Lagunes, A. Mattei 601

5

5

3

3

1

1

Derivative

Derivative

1

1

3

3

5

5

0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

Treatment Treatment

5

5

3

3

1

1

Derivative

Derivative

1

1

3

3

5

0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

Treatment Treatment

The example allows us to highlight two important points. First, gures 3 and 4

show that dierences in the point estimates and their precision among the three semi-

parametric estimators are more pronounced for low and high treatment levels. This is

because our data are sparse for lower and higher values of the treatment.9 Because of

the nonparametric methods we use, estimation becomes noisier and the parameters are

estimated less precisely in regions of the data with few observations, which is reected

in the wider condence intervals. This is particularly evident for the radial spline ap-

proach, which seems to be more sensitive to the sample size than the IW and penalized

splines estimators are. Second, it is clear from gures 3 and 4 that the parametric

estimator produces much tighter condence bands relative to the semiparametric esti-

mators. This is due to the additional structure imposed by the parametric estimator,

which allows extrapolation from regions where data are abundant to regions where data

are scarce. However, if the assumptions behind the parametric structure are incorrect,

the results, including their precision, are likely misleading.

9. In particular, there are very few observations for prizes lower than $15,000 and greater than $40,000.

602 Semiparametric estimators of doseresponse functions

6 Conclusion

We develop a program where we implement semiparametric estimators of the DRF based

on the GPS, assuming that assignment to the treatment is weakly unconfounded given

pretreatment variables. We propose three semiparametric estimators: the IW kernel

estimator developed in Flores et al. (2012) and two estimators using penalized spline

methods for bivariate smoothing. We use data from a survey of Massachusetts lottery

winners to illustrate the proposed methods and program. We nd that the semipara-

metric estimators provide estimates of the DRF and the treatment-eect function that

are substantially dierent from those obtained when using the parametric approach orig-

inally proposed in Hirano and Imbens (2004). All the semiparametric estimators agree

on a U -shaped DRF, which contrasts with the estimated inverse J shape uncovered by

the parametric estimator. Although we cannot draw a rm conclusion about the relative

performance of the estimators based on one dataset, we argue that a misspecication

of the conditional expectation of the outcome given treatment and GPS could result

in inappropriate removal of self-selection bias and in misleading estimates of the DRF.

Therefore, it is advisable to also use semiparametric estimators that account for compli-

cated structures that are dicult to model parametrically. Conversely, semiparametric

estimators can be sensitive to the sample size and might not perform well in regions

with few observations.

7 Acknowledgments

This research is part of the Estimation of direct and indirect causal eects using semi-

parametric and nonparametric methods project supported by the Luxembourg Fonds

National de la Recherche, which is cofunded under the Marie Curie Actions of the

European Commission (FP7-COFUND).

8 References

Angrist, J. D., G. W. Imbens, and D. B. Rubin. 1996. Identication of causal eects

using instrumental variables. Journal of the American Statistical Association 91:

444455.

Becker, S. O., and A. Ichino. 2002. Estimation of average treatment eects based on

propensity scores. Stata Journal 2: 358377.

Bia, M., and A. Mattei. 2008. A Stata package for the estimation of the doseresponse

function through adjustment for the generalized propensity score. Stata Journal 8:

354373.

. 2012. Assessing the eect of the amount of nancial aids to Piedmont rms using

the generalized propensity score. Statistical Methods & Applications 21: 485516.

Bia, M., and P. Van Kerm. 2014. Space-lling location selection. Stata Journal 14:

605622.

M. Bia, C. A. Flores, A. Flores-Lagunes, A. Mattei 603

Buis, M. L., N. J. Cox, and S. P. Jenkins. 2003. betat: Stata module to t a two-

parameter beta distribution. Statistical Software Components S435303, Department

of Economics, Boston College. http://ideas.repec.org/c/boc/bocode/s435303.html.

Dorn, S. 2012. pscore2: Stata module to enforce balancing score property in each

covariate dimension. UK Stata Users Group meeting.

http://econpapers.repec.org/paper/bocusug12/11.htm.

Fan, J., and I. Gijbels. 1996. Local Polynomial Modelling and Its Applications. New

York: Chapman & Hall/CRC.

Flores, C. A., A. Flores-Lagunes, A. Gonzalez, and T. C. Neumann. 2012. Estimating

the eects of length of exposure to instruction in a training program: The case of job

corps. Review of Economics and Statistics 94: 153171.

Guardabascio, B., and M. Ventura. 2014. Estimating the doseresponse function

through a generalized linear model approach. Stata Journal 14: 141158.

Hirano, K., and G. W. Imbens. 2004. The propensity score with continuous treat-

ments. In Applied Bayesian Modeling and Causal Inference from Incomplete-Data

Perspectives, ed. A. Gelman and X.-L. Meng, 7384. Chichester, UK: Wiley.

Imai, K., and D. A. van Dyk. 2004. Causal inference with general treatment regimes:

Generalizing the propensity score. Journal of the American Statistical Association

99: 854866.

Imbens, G. W. 2000. The role of the propensity score in estimating doseresponse

functions. Biometrika 87: 706710.

Imbens, G. W., D. B. Rubin, and B. I. Sacerdote. 2001. Estimating the eect of unearned

income on labor earnings, savings, and consumption: Evidence from a survey of

lottery players. American Economic Review 91: 778794.

Imbens, G. W., and J. M. Wooldridge. 2009. Recent developments in the econometrics

of program evaluation. Journal of Economic Literature 47: 586.

Jann, B. 2005. moremata: Stata module (Mata) to provide various functions. Sta-

tistical Software Components S455001, Department of Economics, Boston College.

http://ideas.repec.org/c/boc/bocode/s455001.html.

Kluve, J., H. Schneider, A. Uhlendor, and Z. Zhao. 2012. Evaluating continuous

training programmes by using the generalized propensity score. Journal of the Royal

Statistical Society, Series A 175: 587617.

Lechner, M. 2001. Identication and estimation of causal eects of multiple treatments

under the conditional independence assumption. In Econometric Evaluation of Labour

Market Policies, ed. M. Lechner and F. Pfeier, 4358. Heidelberg: Physica-Verlag.

Newey, W. K. 1994. Kernel estimation of partial means and a general variance estimator.

Econometric Theory 10: 233253.

604 Semiparametric estimators of doseresponse functions

Rosenbaum, P. R., and D. B. Rubin. 1983. The central role of the propensity score in

observational studies for causal eects. Biometrika 70: 4155.

domized studies. Journal of Educational Psychology 66: 688701.

. 1978. Bayesian inference for causal eects: The role of randomization. Annals

of Statistics 6: 3458.

298.

observational studies. Statistical Science 5: 472480.

bridge: Cambridge University Press.

Stuart, E. A. 2010. Matching methods for causal inference: A review and a look forward.

Statistical Science 25: 121.

Michela Bia is a researcher at CEPS/INSTEAD, Population & Emploi, Esch-Sur-Alzette, Lux-

embourg.

Carlos A. Flores is an associate professor in the Department of Economics, Orfalea College of

Business at the California Polytechnic State University.

Alfonso Flores-Lagunes is an associate professor in the Department of Economics at the State

University of New York, Binghamton.

Alessandra Mattei is an assistant professor in the Department of Statistics, Informatics, Ap-

plications Giuseppe Parenti at the University of Florence.

The Stata Journal (2014)

14, Number 3, pp. 605622

Michela Bia Philippe Van Kerm

CEPS/INSTEAD CEPS/INSTEAD

Esch-sur-Alzette, Luxembourg Esch-sur-Alzette, Luxembourg

michela.bia@ceps.lu philippe.vankerm@ceps.lu

selection algorithm. The objective is to select a subset from a list of locations

so that the spatial coverage of the locations by the selected subset is optimized

according to a geometric criterion. Such an algorithm designed for geographical

site selection is useful for determining a grid of points that covers a data matrix

as needed in various nonparametric estimation procedures.

Keywords: st0353, spacell, spatial sampling, space-lling design, site selection,

nonparametric regression, multivariate knot selection, point swapping

1 Introduction

Spatial statistics often address geographical sampling from a set of locations for net-

works construction (Cox, Cox, and Ensor 1997), for example, for installing air quality

monitoring (Nychka and Saltzman 1998) or for evaluating exposure to environmental

chemicals (Kim et al. 2010). The issue involves evaluating a discrete list of potential

locations and determining a small, optimal subset of placesa designat which

to position, say, measurement instruments or sensors. One strategy to address such a

problemthe geometric approachaims to nd a design that minimizes the aggregate

distance between the locations and the sensors.

As discussed in Ruppert, Wand, and Carroll (2003) and Gelfand, Banerjee, and Fin-

ley (2012), location selection is also relevant in estimation of statistical models such as

multivariate nonparametric or semiparametric regression models. By analogy, instead

of locating measurement instruments, one seeks to identify a small number of loca-

tions from a large dataset at which to estimate a statistical model to reduce com-

putational cost. For example, kernel density estimates or locally weighted regression

models (Cleveland 1979; Fan and Gijbels 1996) are typically calculated on a grid of

points spanning the data range rather than over the whole input data points (and in-

terpolation is used where needed). The location of knots in spline regression models is

somewhat related; a small number of knots are selected instead of knots being placed

at many (or all) potential distinct data points. Determining such a grid is relatively

easy in one-dimensional modelsfor example, it is customary to locate knots at selected

percentiles of the data. Choosing an appropriate multidimensional grid while preserv-

ing computational tractability is more complicated because merely taking combinations

of unidimensional grids quickly inates the number of evaluation points. In this con-

text, Ruppert, Wand, and Carroll (2003) recommend applying a geometric space-lling

design to identify grid points or knot locations.

c 2014 StataCorp LP st0353

606 Space-filling location selection

design construction. The algorithm developed in Royle and Nychka (1998) selects a set

of design points from a discrete set of candidate points such that the coverage of the

candidate points by the design points is optimized according to a geometric coverage

criterion.1 The algorithm involves iterative point swapping between the candidate

points and the design points until no swapping can further improve the coverage of the

candidate points by the design points. The coverage criteria is geometric, but it is not

restricted to spatial, two-dimensional data. The procedure can be used in miscellaneous

settings when optimal subsampling of multivariate data is needed. Constraints are easily

imposed by excluding or including particular locations in the design. A nearest-neighbor

approximation makes the algorithm fast even for large samples.

We describe Royle and Nychkas (1998) algorithm in section 2 and its implementation

in Stata in section 3. We illustrate several uses of the spacefill command in section 4.

We show how it can be applied for generating a multidimensional grid of xed size that

optimally covers a dataset.

algorithm

2.1 Geometric coverage criterion

The space-lling design selection considered here is based on optimization with respect

to the geometric coverage of a set of data points. We refer to data points as loca-

tions, although they are not restricted to geographic locations identied by spatial

coordinatesin principle, any unidimensional or multidimensional coordinates can be

used to locate points (see examples in section 4).

Following Royle and Nychkas (1998) notation, we let C denote a set of N candidate

locations (the candidate set). We let Dn be a subset of n locations selected from C.

Dn is a design of size n, and the locations selected in Dn are design points. The

geometric metric for the distance between any given location x and the design Dn is

p1

dp (x, Dn ) = ||x y||p (1)

yDn

with p < 0. dp (x, Dn ) measures how well the design Dn covers the location x. When

p , dp (x, Dn ) tends to the shortest Euclidean distance between x and a point

in Dn (Johnson, Moore, and Ylvisaker 1990). dp (x, Dn ) is zero if x is at a location in

Dn .

1. An R implementation of Royle and Nychkas (1998) algorithm is available in Furrer, Nychka, and

Sain (2013).

M. Bia and P. Van Kerm 607

p and q if it minimizes

, q1

Cp,q (C, Dn ) = dp (x, Dn )q (2)

xC

over all possible designs Dn from C. The optimal design minimizes the q power mean of

the coverages of all locations outside of the design (the candidate points). Increasing

q gives greater importance to the distance of the design to poorly covered locations.

Figure 1 can help readers visualize the criterion. From a set of 38 European cities, we

selected a potential design of ve locations: Madrid, Brussels, Berlin, Riga, and Soa.

The coverage of, say, London by this design is given by plugging the Euclidean distances

from London to the ve selected cities into (1). With a large negative p, this coverage

will be determined by the distance to the closest city, namely, Brussels. Repeating such

calculations for all 33 cities from outside the design and aggregating the coverages using

(2) gives the overall geometric distances of European cities to the design composed

of Madrid, Brussels, Berlin, Riga, and Soa. The optimal design is the combination of

any ve cities that minimizes this criterion. The design composed of Madrid, Brussels,

Berlin, Riga, and Soa is in fact the optimal design for p = 5 and q = 1.

sels, Berlin, Riga, and Soa) with distances to London as example

608 Space-filling location selection

In most applications, identication of the optimal design by calculating the coverage

criterion for all possible subsets of size n from N is computationally prohibitive. Royle

and Nychka (1998) propose a simple point-swapping algorithm to determine Dn . Start-

ing from a random initial design Dn0 , the algorithm iteratively attempts to swap a point

from the design with the point from the candidate set that leads to the greatest im-

provement in coverage. If this tentative swap improves coverage of the candidate set

by the design, the latter is updated. Otherwise, the swap is ignored. The process is

repeated until no swap between a design point and a candidate point can improve cov-

erage. Users can signicantly improve speed by restricting potential swaps for a point

in the design to its k nearest neighbors in the candidate set [according to (1)]. See

Royle and Nychka (1998) for details.

The point-swapping algorithm makes it straightforward to impose constraints on the

inclusion or exclusion of specic locations; such points are considered in calculations of

the geometric criterion but excluded from any potential swap. Nonrandom initial design

points can also be used.

Although the algorithm always converges to a solution, it is not guaranteed to con-

verge to the globally optimal Dn for any initial design when potential swaps are limited

to nearest neighbors. Therefore, Royle and Nychka (1998) recommend repeating esti-

mation for multiple initial design sets and selecting the design with the best coverage

across repetitions (see section 4).

The spacefill command performs space-lling location selection using Royle and Ny-

chkas (1998) point-swapping algorithm. It operates on N observations from variables

identifying the coordinates of the data points and returns the subset of n < N observa-

tions that optimally covers the data.

spacefill options allow forced inclusion or exclusion of particular observations,

user-specied initial design, and automatic standardization of location coordinates.

When weights are specied, spacefill performs weighted calculation of the aggre-

gate coverage measure [see (2)]. In section 4, we show that combining weights and

restrictions on candidate locations makes it easy to create an optimal regular grid

over a dataset.

M. Bia and P. Van Kerm 609

3.1 Syntax

spacefill varlist if in weight , ndesign(#) design0(varlist)

fixed(varname) exclude(varname) p(#) q(#) nnfrac(#) nnpoints(#)

nruns(#) standardize standardize2 standardize3 sphericize ranks

generate(newvar) genmarker(newvar) noverbose

aweights, fweights, and iweights are allowed; see [U] 11.1.6 weight.

varlist and the if or in qualier identify the data from which the optimal subset is

selected.

3.2 Options

ndesign(#) species n, the size of the design. The default is ndesign(4).

design0(varlist) identies a set of initial designs identied by observations with nonzero

varlist. If multiple variables are passed, one optimization is performed for each initial

design, and the selected design is the one with best coverage.

fixed(varname) identies observations that are included in all designs when varname

is nonzero.

exclude(varname) identies observations excluded from all designs when varname is

nonzero.

p(#) species a scalar value for the distance parameter for calculating the distance of

each location to the design; for example, p = 1 gives harmonic mean distance, and

p = gives the minimum distance. The default is p(-5), as recommended in

Royle and Nychka (1998).

q(#) species a scalar value for the parameter q. The default is q(1) (the arithmetic

mean).

nnfrac(#) species the fraction of data to consider as nearest neighbors in the point-

swapping iterations. Limiting checks to nearest neighbors improves speed but does

not guarantee convergence to the best design; therefore, setting nruns(#) is recom-

mended. The default is nnfrac(0.50).

nnpoints(#) species the number of nearest neighbors considered in the point-swapping

iterations. Limiting checks to nearest neighbors improves speed. nnfrac(#) and

nnpoints(#) are mutually exclusive.

nruns(#) sets the number of independent runs performed on alternative random initial

designs. The selected design is the one with best coverage across the runs. The

default is nruns(5).

610 Space-filling location selection

standardize standardizes all variables in varlist to zero mean and unit standard devi-

ation (SD) before calculating distances between observations.

standardize2 standardizes all variables in varlist to zero mean and SD before calculating

distances between observations, with an estimator of the SD as 0.7413 times the

interquartile range.

standardize3 standardizes all variables in varlist to zero median and SD before calcu-

lating distances between observations, with an estimator of the SD as 0.7413 times

the interquartile range.

sphericize transforms all variables in varlist into zero mean, SD, and zero covariance

using a Cholesky decomposition of the variancecovariance matrix before calculating

distances between observations.

ranks transforms all variables in varlist into their (fractional) ranks and uses distances

between these observation ranks in each dimension to evaluate distances between

observations.

generate(newvar) species the names for new variables containing the locations of the

best design points. If one variable is specied, it is used as a stubname; otherwise,

the number of new variable names must match the number of variables in varlist.

genmarker(newvar) species the name of a new binary variable equal to one for obser-

vations selected in the best design and zero otherwise.

noverbose suppresses output display.

Options standardize2, standardize3, and ranks require installation of the user-

written package moremata, which is available on the Statistical Software Components

archive (Jann 2005).

4 Examples

We provide two illustrations for the application of spacefill. The rst example uses

ozone2.txt, which is available in the R elds package (Furrer, Nychka, and Sain 2013),

and provides examples of standard site selection. The second example uses survey data

from the Panel Socio-Economique Liewen zu Letzebuerg/European Union-Statistics on

Income and Living Conditions (PSELL3/EU-SILC) and illustrates the use of spacefill

for nonparametric regression analysis with multidimensional, nonspatial data.

ozone2.txt contains air quality information in 147 locations in the US Midwest in the

Summer 1987 (Furrer, Nychka, and Sain 2013). Locations are identied by their relative

latitude (lat) and longitude (lon).

M. Bia and P. Van Kerm 611

We start by selecting an optimal design of size 10 from the 147 locations, using

default values p = 5 and q = 1, candidate swaps limited to the nearest half of the

locations, and 5 runs with random starting designs.

(3 vars, 147 obs)

. spacefill lon lat, ndesign(10)

Run 1 .... (Cpq = 100.34)

Run 2 .... (Cpq = 96.92)

Run 3 ...... (Cpq = 94.19)

Run 4 .... (Cpq = 95.00)

Run 5 .. (Cpq = 95.19)

. return list

scalars:

r(q) = 1

r(p) = -5

r(nn) = 69

r(Cpq) = 94.19164847896585

r(nexcluded) = 0

r(nfixed) = 0

r(ndesign) = 10

r(N) = 147

macros:

r(varlist) : "lon lat"

matrices:

r(Best_Design) : 10 x 2

. matrix list r(Best_Design)

r(Best_Design)[10,2]

lon lat

r1 -87.752998 41.855

r2 -90.160004 38.612

r3 -85.841003 39.935001

r4 -87.57 38.021

r5 -91.662003 41.992001

r6 -84.476997 39.106998

r7 -85.578003 38.137001

r8 -85.671997 42.985001

r9 -83.403 42.388

r10 -88.283997 43.333

Notice that the rst run leads to a somewhat higher aggregate distance to the design

points (Cpq=100.34) than the other runs. This stresses the importance of multiple

starting designs. Figure 1 shows the selected locations in the best design (achieved at

run 3, where Cpq=94.19).

612 Space-filling location selection

Figure 2. Scatterplot and histogram of longitude and latitude for all 147 locations (gray

histograms and gray hollow circles) and 10 best design points (thick histograms and

solid dots) with p = 5 and q = 1 (default)

Users can improve speed by restricting potential swaps to a smaller number of nearest

neighbors. Limiting a search to 25 nearest neighbors (against 69the default half of the

locationsin the rst example), our second example below runs in 4 seconds against

11 seconds for our initial example, without much loss in the coverage of the resulting

design (Cpq=96.59). On the other hand, running spacefill with the full candidates

as potential swaps runs in over 30 seconds for an optimal design with Cpq=91.96.

. spacefill lon lat, ndesign(10) nnpoints(25) genmarker(set1)

Run 1 ..... (Cpq = 117.02)

Run 2 .... (Cpq = 109.93)

Run 3 .. (Cpq = 110.99)

Run 4 .. (Cpq = 101.05)

Run 5 ..... (Cpq = 96.59)

. spacefill lon lat, ndesign(10) nnfrac(1)

Run 1 ... (Cpq = 91.96)

Run 2 .... (Cpq = 91.96)

Run 3 .. (Cpq = 91.96)

Run 4 ... (Cpq = 92.32)

Run 5 ... (Cpq = 91.96)

We now illustrate the use of the genmarker(), fixed(), and exclude() options. In

the previous call, genmarker(set1) generated a dummy variable equal to 1 for the 10

M. Bia and P. Van Kerm 613

points selected into the best design and 0 otherwise. We now specify exclude(set1)

to derive a new design with 10 dierent locations and then use fixed(set2) to force

this new design into a design of size 15.

> noverbose

10 points excluded from designs (set1>0)

. spacefill lon lat, ndesign(15) nnpoints(25) fixed(set2) genmarker(set3)

> noverbose

10 fixed design points (set2>0)

. list set1 set2 set3 if set1+set2+set3>0

4. 1 0 0

10. 0 1 1

25. 1 0 0

40. 1 0 0

48. 0 1 1

55. 1 0 0

58. 0 1 1

60. 1 0 0

61. 0 1 1

63. 0 0 1

67. 0 0 1

74. 1 0 0

77. 0 0 1

80. 0 1 1

82. 0 1 1

89. 0 0 1

91. 0 1 1

97. 1 0 0

107. 0 1 1

109. 1 0 0

121. 0 1 1

125. 0 0 1

135. 0 1 1

140. 1 0 0

143. 1 0 0

The key parameters q and p of the coverage criterion can also be exibly specied.

Figure 2 illustrates 3 designs selected with default parameters p = 5 and q = 1 (dots),

with p = 1 and q = 1 (squares), and with p = 1 and q = 5 (crosses). With p = 5,

the distance of a location to the design is mainly determined by the distance to the

closest point of the design; p = 1 accounts for the distance to all points in the design,

leading to more central location selections. Setting q = 5 penalizes large distances

between design and nondesign points, leading to location selections more spread out

toward external points. Note our use of user-specied random starting designs with

option design0() to ensure comparison is made on common initial values.

614 Space-filling location selection

(137 missing values generated)

. generate byte init2 = 1 in 11/20

(137 missing values generated)

. generate byte init3 = 1 in 21/30

(137 missing values generated)

. generate byte init4 = 1 in 31/40

(137 missing values generated)

. generate byte init5 = 1 in 41/50

(137 missing values generated)

. local options nnfrac(0.3) nruns(10) design0(init1 init2 init3 init4 init5)

> noverbose

. spacefill lat lon, `options generate(Des)

. spacefill lat lon, `options generate(Des_BIS) p(-1) q(1)

. spacefill lat lon, `options generate(Des_TER) p(-1) q(5)

. spacefill lat lon, `options generate(Des_QUAT) p(-5) q(5)

Figure 3. Scatterplot of longitude and latitude for all 147 locations (gray hollow circles)

and best design points with default p = 5 and q = 1 (dots), with p = 1 and q = 1

(squares), and with p = 1 and q = 5 (crosses)

M. Bia and P. Van Kerm 615

By combining the exclude() option and weights, one can use spacefill to nd an op-

timal design from an external set of locations; that is, one can use it to select a subset

of points from a set A that optimally covers points from a set B. This is particularly

useful to identify a subset of points from a lattice (the set A) that best covers the data

(the set B). To set this up, we start by generating the latticea dataset with many

candidate grid pointsusing range (see [D] range) and fillin (see [D] llin). We ap-

pend this generated dataset to the locations data. We then identify actual observations

from the sample by sample==0 and the generated candidate locations on the lattice by

sample==1.

We can now run spacefill to select a smaller subset of grid points from the full

lattice that optimally covers the actual locations. To do so, we run spacefill on the

whole set of data points with i) exclude(sample) to select points from the grid only

and ii) with [iw=sample] so that the aggregate distance is computed only between the

design points on the grid and the actual locations. A set of 25 optimally chosen grid

points from a candidate grid of 176 (11 16) points is shown in gure 3. Below we

illustrate how this can be used to speed up calculations of computationally intensive

nonparametric regression models.

. clear

. set obs 16

obs was 0, now 16

. range lon -95 -80 16

. range lat 36 46 11

(5 missing values generated)

. fillin lon lat

. gen byte sample = 0

. save gridlatlon.dta , replace

file gridlatlon.dta saved

. clear

. insheet using ozone2.txt

(3 vars, 147 obs)

. keep lat lon

. gen byte sample = 1

. append using gridlatlon

. spacefill lon lat [iw=sample], exclude(sample) ndesign(25) nnpoints(100)

> genmarker(subgrid1)

147 points excluded from designs (sample>0)

Run 1 .. (Cpq = 63.93)

Run 2 .... (Cpq = 63.92)

Run 3 .... (Cpq = 63.71)

Run 4 ... (Cpq = 63.07)

Run 5 ... (Cpq = 63.02)

616 Space-filling location selection

Figure 4. Actual 147 locations (hollowed gray circles), 176 candidate grid points (lattice;

crosses), and 25 optimally selected grid points (solid dots)

We now illustrate the use of spacefill with multidimensional and nonspatial data

taken from the PSELL3/EU-SILC collected in 2007.2 We extracted information on the

height, weight, and wage of a random subsample of 500 working women.

We rst use spacefill to select a subset of 50 women with characteristics on these 3

variables that best cover the sample. Given the dierent metric of the three variables,

we specify the standardize option to compute the geometric distance criterion after

standardizing the three variables to have zero mean and unit SD in the sample.3

Figures 5 and 6 show bivariate scatterplots and histograms of the selected 50 design

points. Two features are worth noting. First, the quality of the coverage is not aected

by the skewness of the data (especially in the wage dimension). The space-lling algo-

population residing in Luxembourg. Data are collected annually in a sample of more than 3,500

private households.

3. Alternative standardization could have been adopted with options standardize2, standardize3,

sphericize, or ranks.

M. Bia and P. Van Kerm 617

rithm is indeed applicable to broad data congurations. Second, the dierence in the

histograms for the sample and for the design points is a reminder that selecting a space-

lling design is distinct from drawing a representative subset of the data. The points

that best cover the data in a geometric sense must not necessarily reect their frequency

distribution: few design points may contribute to cover many data points in areas of

high concentration, while design points spread out in areas of low data concentration

will contribute to cover a smaller number of data points.

. summarize height weight wage

Variable Obs Mean Std. Dev. Min Max

weight 500 65.368 12.80502 43 127

wage 500 2720.688 1920.047 300 10000

. spacefill height weight wage, ndesign(50) nnfrac(0.05) generate(BH BW BWa)

> standardize

Run 1 .... (Cpq = 196.98)

Run 2 ..... (Cpq = 195.15)

Run 3 .... (Cpq = 196.13)

Run 4 ........ (Cpq = 196.79)

Run 5 .... (Cpq = 194.55)

Figure 5. Scatterplot and histogram of height and weight for all data (gray histograms

and hollowed markers) and best design points (thick histograms and markers) for the

standardized values of the height, weight, and wage

618 Space-filling location selection

Figure 6. Scatterplot and histogram of height and wage for all data (gray histograms

and hollowed markers) and best design points (thick histograms and markers) for the

standardized values of the height, weight, and wage

We now use these data to run a locally weighted polynomial regression of wage

on height and weight. Our objective is to assess nonparametrically the relationship

between wage and body size. For the sake of illustration, we want to estimate expected

wage nonparametrically at multiple grid points from a lattice where each point is a

pair of heightweight values. One reason for this is that tting the model at all height

weight pairs in our data would be computationally expensive (and inecient if there are

nearly identical heightweight pairs in the data). We seek a cheaper alternative with

fewer evaluation points. (This is similar to using lpoly with the at() option instead

of lowess in the unidimensional setting.) Also we use evaluation points on a lattice

instead of at sample values because we are considering tting the model for dierent

subsamples, and we want to have model estimates on a common grid of evaluation points

for all subsamples. (If need be, bivariate interpolation will be used to recover estimates

at sample values; see [G-2] graph twoway contourline for the interpolation formula.)

This setting is relatively standard in nonparametric regression analysis, especially when

dealing with large samples or computationally heavy estimators (for example, cross-

validation-based bandwidth selection).

M. Bia and P. Van Kerm 619

We start with a 20 20 rectangular lattice covering heights from 150 to 192 centime-

ters and weights from 43 to 127 kilograms. While this lattice spans the values observed

in our sample, it also includes many empirically irrelevant heightweight pairs. Estima-

tion on the full grid is therefore unnecessary, and we use spacefill as described above

to select a subset of points on the lattice that covers our data.

Figure 7 shows resulting estimates based on a space-lling design of size 50, as well

as estimates based on a random subset of 100 lattice points, on 100 Halton draws from

the lattice, on the full lattice, and on all sample points. Brightness of the contours

corresponds to local regression estimates of expected wage from black (for monthly

wage below EUR 1000) to white (for monthly wage above EUR 5000). In each panel,

local regression was eectively calculated only at the marked grid points (and so it was

conducted faster on the space-lling design), while the overall coloring of the map was

based on the thin-plate-spline interpolation built in twoway contour.

620 Space-filling location selection

Figure 7. Contour plot of expected wage of 500 Luxembourg women by height and

weight from monthly wage less than EUR 1000 (black) to more than EUR 5000 (white).

Calculations based on local regression estimation. White lines identify body-mass in-

dices of 18.5, 25, and 30, which delineate underweight, overweight, and obesity, respec-

tively.

M. Bia and P. Van Kerm 621

The contour plots display variations in areas of low data density (top left and bot-

tom right), reecting both the imprecision and variability of the local linear regression

estimates in these zones and the variations introduced by the interpolation of values

away from the bulk of the data. In areas of higher data densityfor height below 180

centimeters and weight below 100 kilogramsestimates on the 50-points space-lling

subset dier little from those of the full sample or from the full lattice.4

Acknowledgments

This research is part of the project Estimation of direct and indirect causal eects using

semi-parametric and non-parametric methods, which is supported by the Luxembourg

Fonds National de la Recherche, cofunded under the Marie Curie Actions of the

European Commission (FP7-COFUND). Philippe Van Kerm acknowledges funding for

the project Information and Wage Inequality, which is supported by the Luxembourg

Fonds National de la Recherche (contract C10/LM/785657).

5 References

Cleveland, W. S. 1979. Robust locally weighted regression and smoothing scatterplots.

Journal of the American Statistical Association 74: 829836.

Cox, D. D., L. H. Cox, and K. B. Ensor. 1997. Spatial sampling and the environment:

Some issues and directions. Environmental and Ecological Statistics 4: 219233.

Fan, J., and I. Gijbels. 1996. Local Polynomial Modelling and Its Applications. New

York: Chapman & Hall/CRC.

Furrer, R., D. Nychka, and S. Sain. 2013. elds: Tools for spatial data. R package

version 6.7.6. http://CRAN.R-project.org/package=elds.

Gelfand, A. E., S. Banerjee, and A. O. Finley. 2012. Spatial design for knot selection

in knot-based dimension reduction models. In Spatio-Temporal Design: Advances in

Ecient Data Acquisition, ed. J. Mateu and W. G. M uller, 142169. Chichester, UK:

Wiley.

Jann, B. 2005. moremata: Stata module (Mata) to provide various functions. Sta-

tistical Software Components S455001, Department of Economics, Boston College.

http://ideas.repec.org/c/boc/bocode/s455001.html.

Johnson, M. E., L. M. Moore, and D. Ylvisaker. 1990. Minimax and maximin distance

designs. Journal of Statistical Planning and Inference 26: 131148.

Kim, J.-I., A. B. Lawson, S. McDermott, and C. M. Aelion. 2010. Bayesian spatial

modeling of disease risk in relation to multivariate environmental risk elds. Statistics

in Medicine 29: 142157.

4. Note, incidentally, how taller women tend to be paid higher wages in these data in all three body-

mass index categories.

622 Space-filling location selection

Nychka, D., and N. Saltzman. 1998. Design of air-quality monitoring networks. In Case

Studies in Environmental Statistics (Lecture Notes in Statistics 132), ed. D. Nychka,

W. Piegorsch, and L. Cox, 5176. New York: Springer.

Royle, J. A., and D. Nychka. 1998. An algorithm for the construction of spatial coverage

designs with implementation in SPLUS. Computers and Geosciences 24: 479488.

bridge: Cambridge University Press.

Michela Bia and Philippe Van Kerm are at CEPS/INSTEAD, Esch-sur-Alzette, Luxembourg.

The Stata Journal (2014)

14, Number 3, pp. 623661

and estimation in Mata

Matthew J. Baker

Hunter College and the Graduate Center, CUNY

New York, NY

matthew.baker@hunter.cuny.edu

tive Markov chain Monte Carlo (MCMC) methods; I introduce a Mata func-

tion for performing adaptive MCMC, amcmc(); and I present a suite of functions,

amcmc *(), that allows an alternative implementation of adaptive MCMC. amcmc()

and amcmc *() can be used with models set up to work with Matas moptimize( )

(see [M-5] moptimize( )) or optimize( ) (see [M-5] optimize( )) or with stand-

alone functions. To show how the routines can be used in estimation problems, I

give two examples of what Chernozhukov and Hong (2003, Journal of Econometrics

115: 293346) refer to as quasi-Bayesian or Laplace-type estimatorssimulation-

based estimators using MCMC sampling. In the rst example, I illustrate basic

ideas and show how a simple linear model can be t by simulation. In the next

example, I describe simulation-based estimation of a censored quantile regression

model following Powell (1986, Journal of Econometrics 32: 143155); the discus-

sion describes the workings of the command mcmccqreg. I also present an example

of how the routines can be used to draw from distributions without a normalizing

constant and used in Bayesian estimation of a mixed logit model. This discussion

introduces the command bayesmixedlogit.

Keywords: st0354, amcmc(), amcmc *(), bayesmixedlogit, mcmccqreg, Mata,

Markov chain Monte Carlo, drawing from distributions, Bayesian estimation,

mixed logit

1 Introduction

Markov chain Monte Carlo (MCMC) methods are a popular and widely used means

of drawing from probability distributions that are not easily inverted, that have dif-

cult normalizing constants, or for which a closed form cannot be found. While of-

ten considered a collection of methods with primary usefulness in Bayesian analysis

and estimation, MCMC methods can be applied to a variety of estimation problems.

Chernozhukov and Hong (2003), for example, show that MCMC methods can be applied

to many problems of traditional statistical inference and used to t a wide class of

modelsessentially, any statistical model with a pseudoquadratic objective function.

This class of models encompasses many common econometric models that have tra-

ditionally been t by maximum likelihood or generalized methods of moments. This

article describes some Mata functions for drawing from distributions by using dierent

types of adaptive MCMC algorithms. The Mata implementation of the algorithms is

intended to allow straightforward application to estimation problems.

c 2014 StataCorp LP st0354

624 Adaptive MCMC in Mata

While it is well known that MCMC methods are useful for drawing from dicult

densities, one might ask: why use MCMC methods in estimation? Sometimes, maximiz-

ing an objective function may be dicult or slow, perhaps because of discontinuities or

nonconcave regions of the objective function, a large parameter space, or diculty in

programming analytic gradients or Hessians. When bootstrapping of standard errors

is required, estimation problems are exacerbated because of the need to ret a model

many times. MCMC methods may provide a more feasible means of estimation in these

cases: estimation based on sampling directly from the joint parameter distribution does

not require optimization and still provides the desired result of estimationa descrip-

tion of the joint distribution of parameters. MCMC methods are a popular means of

implementing Bayesian estimators because they allow one to avoid hard-to-calculate

normalizing constants that often appear in posterior distributions. Unlike extrema-

based estimation, Bayesian estimators do not rely on asymptotic results and thus are

useful in small-sample estimation problems or when the asymptotic distribution of pa-

rameters is dicult to characterize.

In this article, I describe a Mata function, amcmc(), that implements adaptive or non-

adaptive MCMC algorithms. I also describe a suite of routines, amcmc *(), that allows

implementation via a series of structured functions, as one might use Mata functions

such as moptimize( ) (see [M-5] moptimize( )) or deriv( ) (see [M-5] deriv( )). The

algorithms implemented by the Mata routines more or less follow Andrieu and Thoms

(2008), who present an accessible overview of the theory and practice of adaptive MCMC.

In section 2, I provide an intuitive overview of adaptive MCMC algorithms, while

in section 3, I describe how the algorithms are implemented in Mata by amcmc() or

by creating a structured object via the suite of functions amcmc *(). In section 4, I

describe four applications. I show how the routines might be used in a straightforward

parameter estimation problem, and I describe how methods can be applied to a more

dicult problem: censored quantile regression. In this discussion, I also introduce

the mcmccqreg command. I then show how routines can be used to sample from a

distribution that is hard to invert and lacks a normalizing constant. In a nal example

in section 4, I apply the methods to Bayesian estimation of a mixed logit model following

Train (2009) and introduce the bayesmixedlogit command. In section 5, I sketch a

basic Mata implementation of an adaptive MCMC algorithm, which I hope will give users

a template for developing adaptive MCMC algorithms in more specialized applications.

In section 6, I conclude and oer some sources for additional reading.

At the heart of adaptive MCMC sampling is the MetropolisHastings (MH) algorithm.

An MH algorithm is built around a target distribution that one wishes to sample from,

(X), and a proposal distribution, q(Y, X).1 If one is mainly interested in applying

MCMC in estimation, one may think of (X) as a conditional likelihood function, and

X can be thought of as a 1 n row vector of parameters. A basic MH algorithm is

described in table 1.

1. For ease of comparison, I follow the notation of Andrieu and Thoms (2008) wherever possible.

M. J. Baker 625

Table 1. An MH algorithm. The proposal distribution is denoted by q(Y, X), while the

target distribution is (X). (X, Y ) denotes the draw acceptance probability.

Basic MH algorithm

1: Initialize start value X = X0 and draws T .

2: Set t = 0 and repeat steps 36 while t T :

3: Draw a candidate Yt from q(Yt , Xt ).

(Yt ) q(Yt ,Xt )

4: Compute (Yt , Xt ) = min (X t ) q(Xt ,Yt )

, 1 .

5: Set Xt+1 = Yt with prob. (Yt , Xt ),

Xt+1 = Xt otherwise.

6: Increment t.

Output: The sequence (Xt )Tt=1 .

The MH algorithm sketched in table 1 has the property that candidate draws Yt

that increase the value of the target distribution, (X), are always accepted, whereas

candidate draws that produce lower values of the target distribution are accepted with

only probability . Under general conditions, the draws X1 , X2 , . . . , XT converge to

draws from the target distribution, (X); see Chib and Greenberg (1995) for proofs.

One can see the convenience the algorithm provides in drawing from densities of the form

(X) = (X)/K, where K is some perhaps dicult-to-calculate normalizing constant.

Computation of K is unnecessary, because it cancels out of the ratio (X)/(Y ). The

proposal distribution, q(Y, X), is where the Markov chain part of Markov chain

Monte Carlo comes in. It is what distinguishes MCMC algorithms from more general

acceptance-rejection Monte Carlo sampling: candidate draws depend upon previous

draws in this function.

MCMC algorithms are simple and exible, and they are therefore applicable to a wide

variety of problems. However, they can be challenging to implement, mainly because it

can be hard to nd an appropriate proposal distribution, q(Y, X). If q(Y, X) is chosen

poorly, coverage of the target distribution, (X), may be poor. This is where adaptive

MCMC methods are used because they help tune the proposal distribution. As an

adaptive MCMC algorithm proceeds, information about acceptance rates of previous

draws is collected and embodied in some set of tuning parameters . Slow convergence

or nonconvergence of an algorithm like that in table 1 is often caused by acceptance of

too few or too many candidate draws: if the algorithm accepts too few candidate draws,

candidates are too far away from regions of the support of the distribution where (X)

is large; if too many candidates are accepted, candidates occupy an area of the support

of the distribution clustered closely around a large value of (X). Accordingly, if the

acceptance rate is too low, the tuning mechanism contracts the search range; if the

acceptance rate is too high, it expands the search range. As a practical matter, one

augments the proposal distribution with the tuning parameters so that the proposal

distribution is something like q(Y, X) = q(Y, X, ). A description of such an algorithm

appears in table 2.

626 Adaptive MCMC in Mata

The algorithm in table 2 also relies on a simplication of the basic MCMC algorithm

presented in table 1, which results when a symmetric proposal distribution is used so that

q(Y, X, ) = q(X, Y, ). With a symmetric proposal distributionthe (multivariate)

normal distribution being a prominent examplethe proposal distribution drops out

of the calculation of the acceptance probability in step 4 of the algorithm; this results

in the simplied acceptance probability (Y, Xt ) = min[{(Y )}/{(Xt )}, 1]. All the

Mata routines discussed in this article use a multivariate normal density for a proposal

distribution.

distribution

1: Initialize start value X = X0 , draws T , and tuning parameters 0 .

2: Set t = 0 and repeat steps 37 while t T :

3: Draw a candidate Yt from q(Yt , Xt ,

t ).

(Yt )

4: Compute (Yt , Xt ) = min (X t)

,1 .

5: Set Xt+1 = Yt with prob. (Yt , Xt ),

Xt+1 = Xt otherwise.

6: Update t+1 = f (t , X0 , X1 , X2 , . . . , Xt ).

7: Increment t.

Output: The sequence (Xt )Tt=1 .

that in table 2. Tuning the proposal distribution results in loss of as an invariant

distribution of the process (Xt ) (Andrieu and Thoms 2008, 345) if it is not done care-

fully. Tuning the proposal distribution alters the long-run behavior of the algorithm so

that it no longer produces the sought-after draws from the target distribution, (X).

A solution to this problem is to tune the proposal distribution for some burn-in period

and then stop tuning so that the proposal distribution is stationary. Another solution is

to set up the algorithm so that tuning eventually recedes from the algorithm. The lat-

ter approach is referred to as vanishing or diminishing adaptation (Andrieu and Thoms

2008; Rosenthal 2011). With vanishing adaptation, if the algorithm runs for a sucient

number of iterations, the proposal distribution stabilizes while also (hopefully) being

tuned to provide good coverage of the target distribution. The Mata functions presented

in this article are built to work with vanishing adaptation, but they can also be set up

so that no adaptation of the proposal distribution occurs.

Before discussing implementation of vanishing adaptation, I must discuss how frequently

candidate draws should be accepted by an MCMC algorithm. Ideally, the acceptance rate

should be such that good coverage of the target distribution is achieved with the smallest

M. J. Baker 627

mal acceptance rates in adaptive MCMC algorithms and a summary of the main ideas

and results. At the risk of oversimplifying, I provide some guidelines. For univariate

distributions, the optimal acceptance rate is about 0.44, and as the dimension of (X)

increases to innity, the optimal acceptance rate converges to 0.234. Rosenthal (2011)

points out that moderate departure from these rates is unlikely to greatly damage algo-

rithm performance and that often for distributions with even relatively small dimension

(that is, d 5), the optimal acceptance rate is close to the asymptotic bound of 0.234.

In table 3, I describe an algorithm that is tuned toward a targeted acceptance rate

(presumably in or close to the range [0.234, 0.44]).

distribution and a specic tuning mechanism.

1: Set starting values X0 , 0 , 0 , 0 , , ( > 0), and draws T .

2: Set t = 0 and repeat steps 310 while t T :

3: Draw a candidate Yt M V N (Xt , t t ).

(Yt )

4: Compute (Yt , Xt ) = min (X t)

,1 .

5: Set Xt+1 = Yt with prob. (Yt , Xt ),

Xt+1 = Xt otherwise.

1

6: Compute weighting parameter t = (1+t) .

7: Update t+1 = exp {t ((Yt , Xt ) )} t .

8: Update t+1 = t + t (X t+1 t ).

9: Update t+1 = t + t (Xt+1 t ) (Xt+1 t ) t .

10: Increment t.

Output: The sequence (Xt )Tt=1 .

be implemented and how the Mata functions presented in section 3 actually operate.

In step 1, the algorithm starts with the initial value X0 ; an initial variancecovariance

matrix for proposals, 0 ; an initial value of a scaling parameter, 0 ; and a targeted

acceptance rate, . The algorithm also requires a value for what can be considered an

averaging or damping parameter, , which controls how quickly the impact of the tuning

mechanism decays through the parameter t = 1/(1+t) , calculated in step 6. For large

values of , adaptation ceases quickly as rapidly approaches zero; for values of close to

zero, adaptation occurs more slowly, and the algorithm uses more information about past

draws in tuning proposals. The Mata routines presented below allow the user to specify

such a parameter when implementing the algorithm.2 In steps 8 and 9, the algorithm

updates the mean and covariance matrix of the proposal distribution according to the

weighting parameter t , and because t eventually decays to zero, updating ceases, and

2. One might prefer this value to be as close to its upper bound as possible to reduce the impact of

tuning quickly; the tradeo is that the proposal distribution may not be as well adapted.

628 Adaptive MCMC in Mata

t+1 = t , t+1 = t , and t+1 = t .

If a researcher wished to write his or her own adaptive MCMC routine, the speci-

cation of the weighting scheme embodied in and on table 3 could be extended.

Andrieu and Thoms (2008) describe some other possibilities for adaptation, including

stochastic schemes or weighting functions that adapt as the algorithm continues. As

described by Andrieu and Thoms (2008, 356), virtually anything goes with the tuning

process, provided that the sequence t satises the following properties:

t = , t1+ < ; > 0

t t

These conditions are satised by the weighting parameter used in the adaptive al-

gorithm

in table 3 so long as (0, 1): the reason is that under these circumstances,

t t diverges, but a suciently large value of that forces the series {1/(1 + t) }1+

to converge can always be found.

A last detail to address is how to initialize the value of the scaling parameter at the

start of the algorithm. According to Andrieu and Thoms (2008, 359), theory suggests

that a good place to start with the scaling parameter is 2.382 /d, where d is the

dimension of the target distribution. The Mata routines presented below all use this

value as a starting point, with one exception.

There are many variations on the basic theme of the algorithm presented in table 3.

One possibility is one-at-a-time, sequential sampling of values from the distribution,

which produces a Metropolis-within-Gibbs type sampler. Another possibility is to

work halfway between the global sampling algorithm of table 3 and the sequential

sampling, creating what might be labeled a block adaptive MCMC sampler.3 In my

experience, Metropolis-within-Gibbs samplers or block samplers are often useful in situ-

ations in which variables are scaled very dierently or in situations where the researcher

might not have good intuition about starting values.

Related to determining how to execute the algorithm is the issue of how to choose

T , the length of the run. One would like to choose T large enough so that the conver-

gence criteria mentioned above are satised and enough draws are produced for reliable

statistical inference. How does one know that the algorithm has achieved these goals?

This is a surprisingly complex question that really does not have a good answer. While

one can often detect problems with the algorithm, there is no way to guarantee that

the algorithm has converged. Gelman and Shirley (2011) describe dierent techniques

for assessing performance and convergence of the run, but they also emphasize the

complementary roles of visual inspection of results, understanding the application, and

understanding the subject matter. These issues are discussed at greater length in the

conclusion.

3. I follow the convention of referring to a sequential sampler as a Metropolis-within-Gibbs sampler,

even though many nd this terminology misleading; see Geyer (2011, 2829). What I call a block

sampler, some might call a block-Gibbs sampler.

M. J. Baker 629

3.1 A Mata function

Syntax

The rst Mata implementation of the algorithms described in section 2 is through the

Mata function amcmc(),4 which uses dierent types of adaptive MCMC samplers based

upon user-provided information. In addition to describing details of sampling (spec-

ication of draws, weighting parameters, and acceptance rates), the user can specify

whether sampling is to proceed all at once (globally), in blocks, or sequentially. The

user can also set up amcmc() to work with a stand-alone distribution or with an

objective function previously set up to work with moptimize() or optimize(). The

syntax is as follows:

pointer (real scalar function) scalar lnf, real rowvector xinit,

real matrix Vinit, real scalar draws, real scalar burn,

real scalar delta, real scalar aopt, transmorphic arate,

transmorphic vals, transmorphic lambda,

real matrix blocks | transmorphic M, string scalar noisy)

Description

If the dimension of the target probability distribution (or the parameter vector) is char-

acterized as a 1 c row vector, amcmc() returns a matrix of draws from the distribution

organized in c columns and r = draws burn rows, so each row of the returned matrix

can be considered a draw from the target distribution lnf. Additional information about

the draws is collected in three arguments overwritten by amcmc(): arate, vals, and lam,

which contain actual acceptance rates, the log value of the target distribution at each

draw, and , the proposal scaling parameters. If a Metropolis-within-Gibbs sampler or

a block sampler is used, lam, as well as arate, is returned as a row vector equal in length

to the dimension of the distribution or the number of blocks.

Information about how to draw from the target distribution and how the distribution

has been programmed is passed to the command as a sequence of strings in the (string)

row vector alginfo. This row vector can contain information about whether sampling is

to be sequential (mwg), in blocks (block), or global (global). If the user is interested in

applying amcmc() to a model statement constructed with moptimize() or optimize(),

information on this and the type of evaluator function used with the model should also

be contained in alginfo. Target distribution information can be standalone, moptimize,

or optimize. Information on evaluator type can also be of any sort (that is, d0, v0,

630 Adaptive MCMC in Mata

etc.).5 A nal option that can be passed along as part of alginfo is the key fast, which

will execute the adaptive MCMC algorithm more quickly but less exactly. I give some

examples of what alginfo might look like in the remarks about syntax.

The second argument of amcmc(), lnf, is a pointer to the target distribution, which

must be written in log form. xinit and Vinit are conformable initial values for the

routine and an initial variancecovariance matrix for the proposal distribution. The

scalar draws and burn tell the routine how many draws to make from the distribution

and how many of these draws are to be discarded as an initial burn-in period. delta

is a string scalar that describes how adaptation is to occur, while aopt is the desired

acceptance rate; see section 2.1.

The real matrix blocks contains information on how amcmc() should proceed if the

user wishes to draw from the function in blocks. If the user does not wish to draw in

blocks, the user simply passes a missing value for this argument. If the user provides an

argument here, but does not specify block as part of alginfo, sampling will not occur

in blocks.

If the user is drawing from a function constructed with a prespecied model com-

mand written to work with either moptimize() or optimize(), this model statement is

passed to amcmc() via the optional M argument. As described below, this argument can

also have other uses; for example, it can pass up to 10 additional explanatory variables

to amcmc().

The nal option is noisy, and if the user species noisy="noisy", amcmc() will

produce feedback on drawing as the algorithm executes. A dot is produced every time

the evaluation function lnf is called (not every time a draw is completed, because the

latter is taken by amcmc() to mean a complete run through the routine). Thus, if a

block sampler or a Metropolis-within-Gibbs style sampler is used, a draw is deemed to

have occurred when all the blocks or variables have been drawn once. The value of the

target distribution is reported every 50 evaluations.

Remarks

It is helpful to have a few examples of how information about the draws to be conducted

can be passed to the amcmc() function through the rst argument, alginfo. This is

described in table 4.

Model denition moptimize, optimize, standalone

Evaluator type d*, q*, e*, g*, v*

Other information fast

5. The routine will not work with evaluators of the lnf type.

M. J. Baker 631

The user can select any item from each of the rows on table 4 and pass it to amcmc()

as part of alginfo. For example, if the user is trying to draw from a function that was

written as a type d2 evaluator to work with moptimize and the user wished to use a

global sampler, he or she might specify

alginfo="moptimize","d2","global"

Order does not matter, so the user could also specify

alginfo="d0","moptimize","global"

If the user had a stand-alone function and wished to do Metropolis-within-Gibbs

style sampling from this function, he or she would specify

alginfo="standalone","mwg"

or even just alginfo="mwg" because if no model statement is submitted, amcmc() will

assume that the function is stand alone. The nal option that the user might specify

is the "fast" option, which tacks on the string fast to alginfo. This option is helpful

when the user wishes to sample globally or in blocks but has a problem with large

dimension. Because the global and block samplers use Cholesky decomposition of the

proposal covariance matrix, large problems may be time consuming. The "fast" option

circumvents the potential slowdown by working with just the diagonal elements of the

proposal covariance matrix, so one can avoid Cholesky decomposition. One should,

however, be cautious in using this option and should probably apply it only when the

user can be reasonably certain that distribution variables are independent.6

The row vector xinit contains an initial value for the draws, while Vinit is an initial

variancecovariance matrix that may be a conformable identity matrix. If, however,

Vinit is a row vector, amcmc() will interpret this as the diagonal of a variance matrix

with zero o-diagonal entries.

While the user-specied scalar delta controls how rapidly adaptation vanishes, the

user may also specify delta equal to missing (delta = .). amcmc() will then assume that

the user does not want any adaptation to occur but instead wishes to draw from the

invariant proposal distribution with mean xinit and covariance matrix Vinit. In this

case, the user must supply values of lambda to describe to the algorithm how to scale

draws from the proposal distribution. Constructing the code this way allows users to

run the adaptive algorithm for a while, and once it has converged, it allows users to

switch to an algorithm using an invariant proposal distribution. If a global sampler is

used, only one value of lambda is required; otherwise, lambda must be conformable with

the sampler. So, if the option mwg is used, the dimension of lambda must match the

dimension of the target distribution; if the option block is used, lambda must contain

as many entries as the number of blocks.

Whether one wishes to do Metropolis-within-Gibbs sampling, block sampling, or

global sampling, the routine requires the same set of input information (although the

6. I included this option hoping that users might try it and see for what problems, if any, it does and

does not work well.

632 Adaptive MCMC in Mata

overwritten values lam and arate dier slightly) with one exception. When one samples

in block form, amcmc() requires a matrix to be provided in block, in which the number

of rows is equal to the number of sampling groups, and the values to be drawn together

have 1s in the appropriate positions and 0s elsewhere. So, for example, if one wished to

draw from a ve-dimensional distribution and wished to draw values for the rst three

arguments together, and then arguments four and ve together, one would set up a

matrix B as follows:

1 1 1 0 0

B=

0 0 0 1 1

1 0 0 0 0

0 1 0 0 0

B= 0 0 1 0 0

0 0 0 1 0

0 0 0 0 1

One might suspect that this would result in the same sort of algorithm obtained by

specifying alginfo="mwg", but this is not the case. After each draw, the block algorithm

updates the entire mean proposal vector and covariance matrix, so information on each

draw is used to prepare for the next.7 While not the intended use of the block-sampling

algorithm, if one leaves a column of all 0s in the matrix B, the corresponding value of

the parameter will never be drawn. This is a quick, albeit not particularly ecient, way

of constraining parameters at particular values during the drawing process.

The argument M of amcmc() can contain a previously assembled model statement, or

it can be used to pass additional arguments of a function to the routine.8 For example,

if the user has written a function to be sampled from that has three arguments, such

as lnf(x,Y,Z), the user would specify the standalone option in the variable alginfo,

assemble the additional arguments into a pointer, and then pass this information to

amcmc(). In this instance, M might be constructed in Mata as follows:

M=J(2,1,NULL)

M[1,1]=&Y

M[2,1]=&Z

M can then be passed to amcmc(), which will use Y and Z (in order) to evaluate

lnf(x,Y,Z). As shown in the examples, this usage of pointers can be handy when

amcmc() is used as part of a larger algorithm: one can continually change Y and Z

without actually having to explicitly declare that Y and Z have changed as the algorithm

executes.

7. Using amcmc() in this way is akin to what Andrieu and Thoms (2008, 360) describe as an adaptive

MCMC algorithm with componentwise adaptive scaling.

8. But not both; we assume that any arguments have already been built into the model statement if

a previously constructed model is used.

M. J. Baker 633

Syntax

Another alternative that has advantages in certain situations, particularly when one

wishes to do adaptive MCMC as one step in a larger sampling problem, is to set up an

adaptive MCMC sampling problem by using the set of functions amcmc *(). The user

rst opens a problem using the amcmc init() function and then lls in the details of

the drawing procedure. The user can use the following functions to set up an adaptive

MCMC problem, with the arguments corresponding to those described in section 3.1:

A = amcmc init()

amcmc lnf(A, pointer (real scalar function) scalar f)

amcmc args(A, pointer matrix Z)

amcmc xinit(A, real rowvector xinit)

amcmc Vinit(A, real matrix Vinit)

amcmc aopt(A, real scalar aopt)

amcmc blocks(A, real matrix blocks)

amcmc model(A, transmorphic M)

amcmc noisy(A, string scalar noisy)

amcmc alginfo(A, string rowvector alginfo)

amcmc damper(A, real scalar delta)

amcmc lambda(A, real rowvector lambda)

amcmc draws(A, real scalar draws)

amcmc burn(A, real scalar burn)

Once a problem has been specied, a run can be initiated via the function

amcmc draw(A)

where * in the above function can be any of the following: vals, arate, passes,

totaldraws, acceptances, propmean, propvar, or report. Additionally, users can

recover their initial specications by using * = draws, aopt, alginfo, noisy, blocks,

damper, xinit, Vinit, or lambda. An additional function amcmc results lastdraw()

produces the value of only the last draw. Two other functions that are useful when one

is executing an adaptive MCMC draw as part of a larger algorithm are

634 Adaptive MCMC in Mata

The function amcmc append() allows the user to indicate that results should be overwrit-

ten by specifying append="overwrite". In this case, the results of only the most recent

draws are kept. This can be useful when doing an analysis where nuisance parameters of

a model are being drawn, and storing all the previous draws would tax the memory and

impact the speed of the algorithms operation. The function amcmc reeval() allows

the user to indicate whether the target distribution should be reevaluated at the last

draw before a proposed value is tried by specifying reeval="reeval". When the draw

is part of a larger algorithm, some of the arguments of the target distribution might

change as the larger algorithm proceeds. In these cases, the target distribution needs

to be reevaluated at the new argument values and the last previous draw to function

correctly. If the user sets reeval to anything else, it is assumed that nothing has changed

and that the value of the target distribution has not changed between draws.

Remarks

Some of the information accessible with amcmc results *() provides hints as to why

a user might prefer to use a problem statement to attack an adaptive MCMC problem

instead of the Mata function amcmc(). Using a problem statement is particularly useful

because one can easily stop, restart, and append a run within Matas structure envi-

ronment. In this way, a user can perform adaptive MCMC as part of a larger algorithm;

the structure makes it easy to retain information about past adaptation and runs as the

algorithm proceeds and also makes it easy to modify arguments of the algorithm. In

the model statement syntax, information about the number of times a given problem

has been initiated is retrievable via the function amcmc results passes(A), while the

acceptance history of an entire run is accessible via amcmc results acceptances(A).

Given the initialization of an adaptive MCMC problem A, one can run amcmc draw()

sequentially and results will be appended to previous results. Accordingly, the burn

period is active only the rst time the function is executed. Thereafter, it is assumed

that the user wishes to retain all drawn values. As mentioned above, the user can

choose whether to retain all the information about previous draws with the function

amcmc append(). When a user species append="overwrite" to save the draws of only

the last run, the routine still includes all information about adaptation contained in the

entire drawing history.

When a user initializes an adaptive MCMC problem via amcmc init(), some defaults

are set unless overwritten by the user. The number of draws is set to 1, the burn period

is set to 0, the target distribution is assumed to be stand alone, the acceptance rate is

set to 0.234, and results are appended to previous results if multiple passes are made.

It is also assumed that the function does not need to be reevaluated at the last value

before drawing a new proposal.

M. J. Baker 635

Further description can be found in the help les, accessible by typing help mata

amcmc() or help mf amcmc at Statas command prompt.

4 Examples

4.1 Parameter estimation

For my rst example, I apply adaptive MCMC to a simple estimation problem. Suppose

that I have already programmed a likelihood function to use with moptimize() in Mata,

but I wish to try another means of estimating parametersperhaps because I have

found that maximization of the likelihood function is taking too long or presents other

diculties or because I am worried about small-sample properties of the estimators.

I decide to try to t the model by drawing directly from the conditional distribution

of parameters. The ideas derive from Bayess rule and the usual principles of Bayesian

estimation, but they can be applied to virtually any maximum likelihood problem.9 Via

Bayess rule, the distribution of parameters conditional on the data can be written as

p(X|)p() p(X|)p()

p(|X) = =5 (1)

p(X) p(X|)p()d

If one has no prior information about parameter values, one can take p()the prior dis-

tribution of parametersto be (improper) uniform over the support of the parameters.

As this renders p() constant, one then obtains the posterior parameter distribution as

p(|X) p(X|) (2)

So, according to (2), one might interpret a likelihood function as the distribution of

parameters conditional on data up to a constant of proportionality. The conditional

mean of parameter values is then

6

E(|X) = p(|X)d (3)

One can estimate E(|X) by simulating the right-hand side of (3) via S draws from the

conditional distribution p(|X),

S

1

E(|X) = (s)

S s=1

These simulations can also be used to characterize higher-order moments of the param-

eter distribution. I shall follow the nomenclature adopted by Chernozhukov and Hong

(2003) and refer to obtained estimators as Laplace-type estimators (LTEs) or quasi-

Bayesian estimators (QBEs).

Returning to the example, I will posit a simple linear model with log-likelihood

function,

(y X) (y X) n

ln L ln 2

2 2 2

9. They can also be applied to a wider variety of problems; see Chernozhukov and Hong (2003).

636 Adaptive MCMC in Mata

For comparison, in the following code, I take this simple model and t it to some data by

using a type d0 evaluator and Matas moptimize() function. One subtlety of the code is

that the variance is coded in exponentiated form. This is done so that when amcmc() is

applied to the problem, the objective function is consistent with the multivariate normal

proposal distribution, which requires that parameters have support (, ).10 The

following code develops the model statement and ts the model via maximum likelihood:

. sysuse auto

(1978 Automobile Data)

. mata:

mata (type end to exit)

: function lregeval(M,todo,b,crit,s,H)

> {

> real colvector p1, p2

> real colvector y1

> p1=moptimize_util_xb(M,b,1)

> p2=moptimize_util_xb(M,b,2)

> y1=moptimize_util_depvar(M,1)

> crit=-(y1:-p1)*(y1:-p1)/(2*exp(p2))-

> rows(y1)/2*p2

> }

note: argument todo unused

note: argument s unused

note: argument H unused

: M=moptimize_init()

: moptimize_init_evaluator(M,&lregeval())

: moptimize_init_evaluatortype(M,"d0")

: moptimize_init_depvar(M,1,"mpg")

: moptimize_init_eq_indepvars(M,1,"price weight displacement")

: moptimize_init_eq_indepvars(M,2,"")

: moptimize(M)

initial: f(p) = -18004

alternative: f(p) = -10466.142

rescale: f(p) = -298.60453

rescale eq: f(p) = -189.39334

Iteration 0: f(p) = -189.39334 (not concave)

Iteration 1: f(p) = -172.06827 (not concave)

Iteration 2: f(p) = -162.08563 (not concave)

Iteration 3: f(p) = -156.61996 (not concave)

Iteration 4: f(p) = -143.55991

Iteration 5: f(p) = -129.10949

Iteration 6: f(p) = -127.05705

Iteration 7: f(p) = -127.05447

Iteration 8: f(p) = -127.05447

10. A less ecient way to deal with parameters with restricted supports is to program the distribution

so that it returns a missing value whenever a draw lands outside the appropriate range.

M. J. Baker 637

: moptimize_result_display(M)

Number of obs = 74

eq1

price -.0000966 .0001591 -0.61 0.544 -.0004085 .0002153

weight -.0063909 .0011759 -5.43 0.000 -.0086956 -.0040862

displacement .0054824 .0096492 0.57 0.570 -.0134296 .0243945

_cons 40.10848 1.974222 20.32 0.000 36.23907 43.97788

eq2

_cons 2.433905 .164399 14.80 0.000 2.111688 2.756121

: end

I now estimate model parameters via simulation by treating the likelihood function

like the parameters conditional distribution. I start with a Metropolis-within-Gibbs

sequential sampler to obtain 10,000 draws for each parameter value, discarding the rst

20 draws as a burn-in period. I start with this sampler because it is usually a relatively

safe choice when there is little information on starting points, which I am pretending are

unavailable. I set the initial values used by the sampler to 0 and use an identity matrix

as an initial covariance matrix for proposals. I choose a value of delta = 2/3, which

allows a fairly conservative amount of adaptation to occur and a desired acceptance rate

of 0.4.11

. set seed 8675309

. mata:

mata (type end to exit)

: alginfo="moptimize","d0","mwg"

: b_mwg=amcmc(alginfo,&lregeval(),J(1,5,0),I(5),10000,50,2/3,.4,

> arate=.,vals=.,lambda=.,.,M)

: st_matrix("b_mwg",mean(b_mwg))

: st_matrix("V_mwg",variance(b_mwg))

: end

. matrix colnames b_mwg=`names

. matrix colnames V_mwg=`names

. matrix rownames V_mwg=`names

. ereturn post b_mwg V_mwg

11. Regarding what might seem a relatively short burn-in period, I set this period to be short enough

to show the convergence behavior of the algorithm.

638 Adaptive MCMC in Mata

. ereturn display

eq1

price -.0001322 .0001714 -0.77 0.440 -.0004681 .0002036

weight -.0057418 .0018016 -3.19 0.001 -.009273 -.0022107

displacement .00218 .0125846 0.17 0.862 -.0224854 .0268454

_cons 39.00328 3.095009 12.60 0.000 32.93717 45.06939

eq2

_cons 2.518081 .2071915 12.15 0.000 2.111993 2.924169

Although the algorithm was not allowed a very long burn-in time, the simulation-based

parameter estimates are close to those obtained by maximum likelihood.12 How fre-

quently were draws of each parameter accepted, and how close is the algorithm working

around the maximum value of the function? This information is returned as the over-

written arguments arate and vals.

. mata:

mata (type end to exit)

: arate

1

1 .3806030151

2 .3807035176

3 .3870351759

4 .4020100503

5 .3951758794

: max(vals),mean(vals)

1 2

1 -127.1097198 -130.2193494

: end

The sampler nds and operates close to the maximum value of the log likelihood (which

was 127.05), and the acceptance rates of the draws are very close to the desired

acceptance rate of 0.4. To understand what the distribution of the parameters looks

like, I pass the information about parameter draws to Stata and form visual pictures

of results. The code below accomplishes this and creates two panels of graphs: one

that shows the distribution of parameters (gure 1) and one that shows how parameter

draws and the value of the function evolved as the algorithm moved (gure 2).

12. One possible issue here is whether it is appropriate to summarize the results in usual Stata format

like this. One can assume that this is acceptable here because the parameters are collectively

normally distributed. Whether this is true in more general problems requires careful thought.

M. J. Baker 639

. preserve

. clear

. local varnames price weight displacement constant std_dev

. getmata (`varnames)=b_mwg

. getmata vals=vals

. generate t=_n

. local graphs

. local tgraphs

. foreach var of local varnames {

2. quietly {

3. histogram `var, saving(`var, replace) nodraw

4. twoway line `var t, saving(t`var, replace) nodraw

5. }

6. local graphs "`graphs `var.gph"

7. local tgraphs "`tgraphs t`var.gph"

8. }

. histogram vals, saving(vals,replace) nodraw

(bin=39, start=-183.40158, width=1.4433811)

(file vals.gph saved)

. twoway line vals t, saving(vals_t,replace) nodraw

(file vals_t.gph saved)

. graph combine `graphs vals.gph

. graph export vals_mwg.eps, replace

(file vals_mwg.eps written in EPS format)

. graph combine `tgraphs vals_t.gph

. graph export valst_mwg.eps, replace

(file valst_mwg.eps written in EPS format)

. restore

Figure 1 is composed of histograms for each parameter, with the last panel being the

histogram of the log likelihood. Parameters seem to be approximately normally dis-

tributed (with a few blips), excepting the rst few draws, and they are also centered

around parameter values obtained via maximum likelihood.

640 Adaptive MCMC in Mata

400

50

40

300

Density

Density

Density

30

200

20

100

500

10

0

0

.001 .0005 0 .0005 .01 .005 0 .005 .06 .04 .02 0 .02

price weight displacement

.25

2.5

.25

.2

.2

.15

1.5

Density

Density

Density

.15

.1

.1

.05

.05

.5

0

0

10 20 30 40 50 2 2.5 3 3.5 4 180170160150140130

constant std_dev vals

Figure 2 shows how the drawn values for parameters and the value of the objective

function evolved as the algorithm proceeded.

.0005

.005

0 .02

displacement

0

0

weight

price

.04 .02

.0005

.005

.06

.001

.01

0 2000 4000 6000 800010000 0 2000 4000 6000 800010000 0 2000 4000 6000 800010000

t t t

50

4

40

3.5

constant

std_dev

vals

30

3 2.5

20

2

10

0 2000 4000 6000 800010000 0 2000 4000 6000 800010000 0 2000 4000 6000 800010000

t t t

From gure 2, one can see that after a few iterations, the algorithm settles down

to drawing from an appropriate range. The draws are also autocorrelated, and this

autocorrelation is a general property of any MCMC algorithm, adaptive or not. Thus,

M. J. Baker 641

when one applies MCMC algorithms in practice, it is sometimes benecial to thin out

the draws by keeping, say, only every 5th or 10th draw or to jumble draws.

To illustrate the use of a global sampler and some of the problems one might en-

counter in an MCMC-based analysis, I now apply a global sampler to the problem so that

all parameter values are drawn simultaneously. The following code shows the results of

a run of 12,000 draws with a burn-in period of 2,000:

. mata:

mata (type end to exit)

: alginfo="global","d0","moptimize"

: b_glo=amcmc(alginfo,&lregeval(),J(1,5,0),I(5),12000,2000,2/3,.4,

> arate=.,vals=.,lambda=.,.,M)

: st_matrix("b_glo",mean(b_glo))

: st_matrix("V_glo",variance(b_glo))

: end

. matrix colnames b_glo=`names

. matrix colnames V_glo=`names

. matrix rownames V_glo=`names

. ereturn post b_glo V_glo

. ereturn display

eq1

price -.0004614 .0019104 -0.24 0.809 -.0042057 .0032829

weight .013056 .0232029 0.56 0.574 -.0324209 .0585328

displacement -.1798405 .3163187 -0.57 0.570 -.7998138 .4401328

_cons 15.16227 20.84814 0.73 0.467 -25.69933 56.02387

eq2

_cons 4.017751 1.880026 2.14 0.033 .3329679 7.702533

One can see from these results that the algorithm has not quickly found an appropriate

range of values for parameter values. Figures 3 and 4 indicate whythe algorithm

spends considerable time stuck away from the maximal function value.

642 Adaptive MCMC in Mata

8

400 600 800 1000

100

80

6

Density

Density

Density

60

4

40

2

200

20

0

0

.015 .01 .005 0 .005 0 .05 .1 1.5 1 .5 0

price weight displacement

.3

.006

.8 .6

.2

.004

Density

Density

Density

.4

.002

.1

.2

0

0

0 10 20 30 40 50 0 2 4 6 8 10 3000 2000 1000 0

constant std_dev vals

Figure 3. Distribution of parameters after a global MCMC run that is slow to converge

.005

.1

0

0

displacement

.5

weight

.05

.005

price

1

.01

0

.015

1.5

0 2000 4000 6000 800010000 0 2000 4000 6000 800010000 0 2000 4000 6000 800010000

t t t

50

10

0

40

1000

30

constant

std_dev

6

vals

20

2000

4

10

3000

0

0 2000 4000 6000 800010000 0 2000 4000 6000 800010000 0 2000 4000 6000 800010000

t t t

The problem observed in gures 3 and 4 is that the algorithm was not allowed to burn

in for a long enough time for the global MCMC algorithm to work correctly. While

the parameter values eventually settled down closer to their true values, it took the

algorithm upward of 6,000 draws to nd the right range. In fact, it looks as though the

algorithm settled into a stable range for draws 2,0006,000 or so but then once again

experienced a jump to the correct stable range, a phenomenon known as pseudoconver-

M. J. Baker 643

gence (Geyer 2011). This behavior is also responsible for the multimodal appearance

of the histograms on gure 3.

While my intent is to illustrate how the Mata function amcmc() works, my example

also illustrates what can happen when one fails to specify appropriate adjustment pa-

rameters and does not allow an adaptive MCMC algorithm to run long enough in a given

estimation problem. One may unknowingly get bad results, as the case would be if

the global algorithm had been allowed to run for only 5,000 iterations. This sometimes

happens if poor starting values are mixed with parameters that have very dierent mag-

nitudes, for example, the constant in the initial model relative to the other parameters.

From inspecting gure 3, one can see that the constant did not nd its correct range

until just after 6,000 draws, and this is likely what caused the problem.

This discussion motivates using amcmc() in steps, where a slower but relatively

robust sampler (a Metropolis-within-Gibbs sampler, in this case) is used to orient pa-

rameters close to their correct range before a global sampler is used, as shown in the

following code:

. mata:

mata (type end to exit)

: alginfo="mwg","d0","moptimize"

: b_start=amcmc(alginfo,&lregeval(),J(1,5,0),I(5),5*1000,5*100,2/3,.4,

> arate=.,vals=.,lambda=.,.,M)

: alginfo="global","d0","moptimize"

: b_glo2=amcmc(alginfo,&lregeval(),mean(b_start),

> variance(b_start),11000,1000,2/3,.4,

> arate=.,vals=.,lambda=.,.,M)

: st_matrix("b_glo2",mean(b_glo2))

: st_matrix("V_glo2",variance(b_glo2))

: end

. matrix colnames b_glo2=`names

. matrix colnames V_glo2=`names

. matrix rownames V_glo2=`names

. ereturn post b_glo2 V_glo2

. ereturn display

eq1

price -.0001059 .0001584 -0.67 0.504 -.0004164 .0002046

weight -.0063727 .0012014 -5.30 0.000 -.0087275 -.0040179

displacement .0056462 .0099215 0.57 0.569 -.0137997 .025092

_cons 40.10216 1.912111 20.97 0.000 36.35449 43.84982

eq2

_cons 2.480892 .1665249 14.90 0.000 2.15451 2.807275

644 Adaptive MCMC in Mata

Thus one can then draw parameters that are scaled dierently either alone or in blocks

until the algorithm nds it footing, and then proceed with a global algorithm. I have

motivated the use of a global drawing method because of its clear speed advantages, but

another more subtle reason to use it that might not be obvious when visually inspecting

the graphs is that global draws often exhibit less serial correlation across draws.13 The

conclusion provides sources with additional tips for setting up, analyzing, and presenting

the results of an MCMC run.

Yet another alternative is to once again begin with a Metropolis-within-Gibbs sam-

pler to characterize the distribution of the parameters and, once this is done suciently

well, to run the algorithm without adaptation so that one is using an invariant proposal

distribution and a regular MCMC algorithm. After an initial run with the "mwg" option,

I submit the mean and variance of results to the global sampler with no adaptation

parameter, passing a value of missing (.) for delta. Because I am not passing any

information to amcmc() on how to do adaptation in this case, I am required to submit

a value for lambda, so I choose = 2.382 /n.14 Finally, I also submit a missing value

for aopt. Because no adaptation occurs, aopt is not used by the algorithm.

. mata:

mata (type end to exit)

: alginfo="mwg","d0","moptimize"

: b_start=amcmc(alginfo,&lregeval(),J(1,5,0),I(5),5*1000,5*100,2/3,.4,

> arate=.,vals=.,lambda=.,.,M)

: alginfo="global","d0","moptimize"

: b_glo3=amcmc(alginfo,&lregeval(),mean(b_start),

> variance(b_start),10000,0,.,.,

> arate=.,vals=.,(2.38^2/5),.,M)

: arate

.2253

: mean(b_glo3)

1

1 -.0000916295

2 -.0064095109

3 .0054916501

4 40.14276799

5 2.497166774

: end

Apparently, the proposal distribution was successfully tuned in the initial run with the

Metropolis-within-Gibbs sampler. The mean values of the parameters obtained from

the global draw are close to their maximum-likelihood values, and the acceptance rate

is in the healthy range.

14. Note that I did not retain and submit the values of lambda from the initial runthis is because the

global sampler requires a scalar value for lambda, while the Metropolis-within-Gibbs run returns a

vector of values overwritten in lambda.

M. J. Baker 645

. mata:

mata (type end to exit)

: A=amcmc_init()

: amcmc_alginfo(A,("global","d0","moptimize"))

: amcmc_lnf(A,&lregeval())

: amcmc_xinit(A,J(1,5,0))

: amcmc_Vinit(A,I(5))

: amcmc_model(A,M)

: amcmc_draws(A,4000)

: amcmc_damper(A,2/3)

: amcmc_draw(A)

: end

I can now access results using the previously described amcmc results *(A) set of

functions.

While the previous example demonstrated the basic principles and how one might apply

adaptive MCMC in problems of parameter estimation, the example did not show how the

methods might work when the usual maximization-based techniques fail. Chernozhukov

and Hong (2003) use as an example censored quantile regression originally developed in

Powell (1984) and extended in Powell (1986), which, as Chernozhukov and Hong (2003,

296) note, provides a way to do valid inference in TobinAmemiya models without dis-

tributional assumptions and with heteroskedasticity of unknown form. Unfortunately,

the model is hard to handle with the usual methods. The objective function is

n

Ln () = {Yi max (ci , Xi )} (4)

i

where ci in (4) denotes a (left) censoring point that might be specic to the ith ob-

servation, and (u) = { (1(u < 0)}u. (0, 1) is the quantile of interest. Esti-

mation using derivative-based maximization methods is problematic because the objec-

tive function (4) has at regions and discontinuities. While one might do well with a

nonderivative-based optimization method such as NelderMead, one is then confronted

with the problem of characterizing the parameters distribution and getting standard

errors. For these reasons, one might opt for an LTE or a QBE estimator.

646 Adaptive MCMC in Mata

To apply amcmc() to the problem, I rst program the objective function as follows:15

. mata:

mata (type end to exit)

: void cqregeval(M,todo,b,crit,g,H) {

> real colvector u,Xb,y,C

> real scalar tau

>

> Xb =moptimize_util_xb(M,b,1)

> y =moptimize_util_depvar(M,1)

> tau =moptimize_util_userinfo(M,1)

> C =moptimize_util_userinfo(M,2)

> u =(y:-rowmax((C,Xb)))

> crit =-colsum(u:*(tau:-(u:<0)))

> }

note: argument todo unused

note: argument g unused

note: argument H unused

: end

The following code sets up a model statement for use with the function moptimize( )

(see [M-5] moptimize( )). One can follow the Mata code with moptimize(M) to verify

that this model and variations on the basic theme, obtained by dropping or adding

additional variables, encounter diculties.

. gen censorpoint=0

. mata:

mata (type end to exit)

: M=moptimize_init()

: moptimize_init_evaluator(M,&cqregeval())

: moptimize_init_depvar(M,1,"whrs")

: moptimize_init_eq_indepvars(M,1,"kl6 k618 wa")

: tau=.6

: moptimize_init_userinfo(M,1,tau)

: st_view(C=.,.,"censorpoint")

: moptimize_init_userinfo(M,2,C)

: moptimize_init_evaluatortype(M,"d0")

: end

15. One might code the objective function without summing over observations. I sum over observations

so that the objective is compatible with NelderMead in Stata, which requires a type d0 evaluator.

M. J. Baker 647

Setting up the problem like this allows the use of amcmc(), where I implement the

strategy of using a Metropolis-within-Gibbs-type algorithm followed by a global sampler.

. mata:

mata (type end to exit)

: alginfo="mwg","d0","moptimize"

: b_start=amcmc(alginfo,&cqregeval(),J(1,4,0),I(4),5000,1000,2/3,.4,

> arate=.,vals=.,lambda=.,.,M)

: alginfo="global","d0","moptimize"

: b_end=amcmc(alginfo,&cqregeval(),mean(b_start),

> variance(b_start),20000,10000,1,.234,arate=.,vals=.,lambda=.,.,M)

: end

Because this application might be of more general interest, I developed the command

mcmccqreg, which is a wrapper for the LTE and QBE estimation of censored quantile

regression. The previous code can be executed by the command.

. quietly mcmccqreg whrs kl6 k618 wa, tau(.6) sampler("mwg") draws(5000)

> burn(1000) dampparm(.667) arate(.4) censorvar(censorpoint)

. matrix binit=e(b)

. matrix V=e(V)

. mcmccqreg whrs kl6 k618 wa, tau(.6) sampler("global") draws(20000)

> burn(10000) arate(.234) saving(lsub_draws) replace

> from(binit) fromv(V)

Powells mcmc-estimated censored quantile regression

Observations: 250

Mean acceptance rate: 0.359

Total draws: 20000

Burn-in draws: 10000

Draws retained: 10000

k618 -171.3108 23.75806 -7.21 0.000 -217.8814 -124.7402

wa -29.23027 10.74507 -2.72 0.007 -50.29276 -8.167779

_cons 2638.497 500.8126 5.27 0.000 1656.804 3620.191

Mean: -89298.96

Min: -89295.52

Max: -89308.58

Draws saved in: lsub_draws

*Results are presented to conform with Stata covention, but

are summary statistics of draws, not coefficient estimates.

One can see from the way the command is issued how information about the sampler,

the drawing process, and the censoring point (which has default of 0 for all observations)

can be controlled using the mcmccqreg command. The command produces estimates

that are summary statistics of the sampling run. mcmccqreg allows one to save results,

and the results of the run are saved in the le lsub draws with the objective function

648 Adaptive MCMC in Mata

value after each draw. The user can then easily analyze the draws using Statas graphing

and statistical analysis tools. While the workings of the command derive more or less

directly from the description of amcmc(), more information about the command and

some additional examples can be found in the mcmccqregs help le.

I now show how to use amcmc() to draw from a distribution. Suppose that I have devel-

oped a theory that says three variables are jointly distributed according to a distribution

characterized by

p(x1 , x2 , x3 ) exp x21 0.5x22 + x1 x2 0.05(x3 100)2

As written, p does not integrate to one and seems hard to invert. While Metropolis-

within-Gibbs or global sampling works ne with this example, to illustrate the block

sampler, I will draw from the distribution in blocks, where values for the rst two

arguments are drawn together, followed by a draw of the third. Thus the block matrix

to be passed to amcmc() is

1 1 0

B=

0 0 1

The code that programs the function and draws from the distribution is as follows:

. mata:

mata (type end to exit)

: real scalar ln_fun(x)

> {

> return(-x[1]^2-1/2*x[2]^2+x[1]*x[2]-.05*(x[3]-100)^2)

> }

: B=(1,1,0) \ (0,0,1)

: alginfo="standalone","block"

: x_block=amcmc(alginfo,&ln_fun(),J(1,3,0),I(3),4000,200,2/3,.4,

> arate=.,vals=.,lambda=.,B)

: end

M. J. Baker 649

The example is set up to draw 4,000 values with a burn-in period of 200. Graphs of the

simulation results are shown in gures 5 and 6.

.4

.5

.4

.3

Density

Density

.3

.2

.2

.1

.1

0

0

4 2 0 2 4 5 0 5

x_1 x_2

.15

.5

.4

.1

Density

Density

.2 .3

.05

.1

0

0

90 95 100 105 110 8 6 4 2 0

x_3 vals

4

5

2

x_1

x_2

0

0 2

5

4

t t

110

0

105

2

vals

100

x_3

4

95

6

90

t t

650 Adaptive MCMC in Mata

The graphs give a visual of the marginal distributions for the variables, while the time-

series diagram veries that our simulation run is getting good coverage and rapid con-

vergence to the target distribution.

A dierent way to draw from this distribution would be to set up an adaptive MCMC

problem via a structured set of Mata functions.

. mata:

mata (type end to exit)

: A=amcmc_init()

: amcmc_lnf(A,&ln_fun())

: amcmc_alginfo(A,("standalone","block"))

: amcmc_draws(A,4000)

: amcmc_burn(A,200)

: amcmc_damper(A,2/3)

: amcmc_xinit(A,J(1,3,0))

: amcmc_Vinit(A,I(3))

: amcmc_blocks(A,B)

: amcmc_draw(A)

: end

In this section, I describe the nuts and bolts of Bayesian estimation of a mixed logit

model; the implementation is available via the command bayesmixedlogit, which I

have written and made available for download. The wrapper function bayesmixedlogit

adds some features but essentially works as described in this section.

While there is no strong reason to prefer using the amcmc routines as a function or

a structure in the previous examples, the power and exibility of structured objects in

Mata is indispensable in this example. My exposition of the basic ideas follows Train

(2009) as closely as possible, which also contains a nice overview of the principles. The

example assumes that one has access to traindata.dta, which is used by Hole (2007)

to illustrate estimation of a mixed logit model by maximum simulated likelihood.16

The help le for amcmcaccessible by typing help mata amcmc() or help mf amcmc

at Statas command promptdescribes an example that relies on data downloadable

from the Stata website.

The data concern n = 1, 2, 3, . . . , N people, each of whom makes a selection from

among j = 1, 2, 3, . . . , J choices on occasions t = 1, 2, 3, . . . , T . For each choice made,

there are a set of covariates xnjt that explain ns choices at t. A persons utility from

the jth choice on occasion t is specied as

16. The data are downloadable from Trains website at http://eml.berkeley.edu/train/ and can also

be found at http://fmwww.bc.edu/repec/bocode/t/traindata.dta.

M. J. Baker 651

where in (5),
njt is an independent identically distributed extreme value, and n are

individual-specic parameters. Variation in these parameters across the population is

captured by assuming parameters normally distributed with mean b and covariance

matrix W. I denote a persons choice at t as ynt J. Then the probability of observing

person ns sequence of choices is

7

en xnynt t

L(yn |) = J x

(6)

n

j=1 e

njt

t

Given the distribution of , I can write the above conditional on the distribution of

parameters, (|b, W), and integrate over the distribution of parameter values to get

6

L(yn |b, W) = L(yn |)(|b, W)d

In a Bayesian approach, a prior h(b, W) is assumed, and the joint posterior likelihood

of the parameters is formed using

7

H(b, W|Y, X) L(yn |b, W)h(b, W) (7)

n

usually used in estimation, as in the package mixlogit, developed in Hole (2007).17 An

alternative is a Bayesian approach. As described by Train (2009), estimation becomes

fairly easy (at least conceptually) if one breaks the problem into a sequence of condi-

tional distributions, taking the view that each set of individual-level coecients n are

additional parameters to be estimated. The posterior distribution of parameters given

data becomes

7

H(b, W, n , n = 1, 2, 3, . . . , N |y, X) L(yn |n )(n |b, W)h(b, W) (8)

n

Following the outline given in Train (2009, 301302), we see that drawing from the

posterior proceeds in three steps. First, b is drawn conditional on n and W; then W

is drawn conditional on b and n ; and nally, the values of n are drawn conditional on

b and W. The rst two steps are straightforward, assuming that the prior distribution

of b is normal with extremely large variance and that the prior for W is an inverted

Wishart with K degrees of freedom and an identity scale matrix. In this case, the

conditional distribution of b is N (, WN 1 ), where is the mean of the n s. The

conditional distribution of W is an inverted Wishart with K + N degrees of freedom

and scale matrix (KI + N S)/(K + N ), where S = N 1 n (n b)(n b) is the

sample variance of the n s about b.

The distribution of n given choices, data, and (b, W) has no simple form, but from

(8), we see that the distribution of a particular persons parameters obeys

17. From the Stata prompt, type search mixlogit.

652 Adaptive MCMC in Mata

where the term L(yn |n ) in (9) is given by (6). This is a natural place to apply MCMC

methods, and it is here where I can use the amcmc *() suite of functions.

I now return to the example. traindata.dta contains information on the energy

contract choices of 100 people, where each person faces up to 12 dierent choice oc-

casions. Suppliers contracts are dierentiated by price, the type of contract oered,

location to the individual, how well-known the supplier is, and the season in which the

oer was made.

As a point of comparison, I t the model in Train (2009, 305) using mixlogit (after

download and installation).

. clear all

. set more off

. use http://fmwww.bc.edu/repec/bocode/t/traindata.dta

. set seed 90210

. mixlogit y, rand(price contract local wknown tod seasonal) group(gid) id(pid)

Iteration 0: log likelihood = -1253.1345 (not concave)

Iteration 1: log likelihood = -1163.1407 (not concave)

Iteration 2: log likelihood = -1142.7635

Iteration 3: log likelihood = -1123.6896

Iteration 4: log likelihood = -1122.6326

Iteration 5: log likelihood = -1122.6226

Iteration 6: log likelihood = -1122.6226

Mixed logit model Number of obs = 4780

LR chi2(6) = 467.53

Log likelihood = -1122.6226 Prob > chi2 = 0.0000

Mean

price -.8908633 .0616638 -14.45 0.000 -1.011722 -.7700045

contract -.22285 .0390333 -5.71 0.000 -.2993539 -.1463462

local 1.958347 .1827835 10.71 0.000 1.600098 2.316596

wknown 1.560163 .1507413 10.35 0.000 1.264715 1.85561

tod -8.291551 .4995409 -16.60 0.000 -9.270633 -7.312469

seasonal -9.108944 .5581876 -16.32 0.000 -10.20297 -8.014916

SD

price .1541266 .0200631 7.68 0.000 .1148036 .1934495

contract .3839507 .0432156 8.88 0.000 .2992497 .4686516

local 1.457113 .1572685 9.27 0.000 1.148873 1.765354

wknown -.8979788 .1429141 -6.28 0.000 -1.178085 -.6178722

tod 1.313033 .1648894 7.96 0.000 .9898559 1.63621

seasonal 1.324614 .1881265 7.04 0.000 .9558927 1.693335

being positive

To implement the Bayesian estimator, I proceed in the steps outlined by Train (2009,

301302). First, I develop a Mata function that produces a single draw from the condi-

tional distribution of b.

M. J. Baker 653

. mata:

mata (type end to exit)

: real matrix drawb_betaW(beta,W) {

> return(mean(beta)+rnormal(1,cols(beta),0,1)*cholesky(W))

> }

: end

Next I use the instructions described in Train (2009, 299) to draw from the conditional

distribution of W. The Mata function is

. mata

mata (type end to exit)

: real matrix drawW_bbeta(beta,b)

> {

> v=rnormal(cols(b)+rows(beta),cols(b),0,1)

> S1=variance(beta)

> S=invsym((cols(b)*I(cols(b))+rows(beta)*S1)/(cols(b)+rows(beta)))

> L=cholesky(S)

> R=(L*v)*(L*v)/(cols(b)+rows(beta))

> return(invsym(R))

> }

: end

I now have two of the three steps of the drawing scheme in place. The last task is more

nuanced and involves using structured amcmc problems in conjunction with the exible

ways in which one can manipulate structures in Mata. The key is to think of drawing

each set of individual-level parameters n as a separate adaptive MCMC problem. It is

helpful to rst get all the data into Mata, get familiar with its structure, and then work

from there.

. mata:

mata (type end to exit)

: st_view(y=.,.,"y")

: st_view(X=.,.,"price contract local wknown tod seasonal")

: st_view(pid=.,.,"pid")

: st_view(gid=.,.,"gid")

: end

The matrix (really, a column vector) y is a sequence of dummy variables marking the

choices of individual n in each choice occasion, while the matrix X collects explanatory

variables for each potential choice. pid and gid are identiers for individuals and choice

occasions, respectively. I now write a Mata function that computes the log probability

for a particular vector of parameters for a given person, conditional on that persons

information.

654 Adaptive MCMC in Mata

. mata:

mata (type end to exit)

: real scalar lnbetan_bW(betaj,b,W,yj,Xj)

> {

> Uj=rowsum(Xj:*betaj)

> Uj=colshape(Uj,4)

> lnpj=rowsum(Uj:*colshape(yj,4)):-

> ln(rowsum(exp(Uj)))

> var=-1/2*(betaj:-b)*invsym(W)*(betaj:-b)-

> 1/2*ln(det(W))-cols(betaj)/2*ln(2*pi())

> llj=var+sum(lnpj)

> return(llj)

> }

: end

The function takes in ve arguments, the rst of which is a parameter vector for the

person (that is, the values to be drawn). The second and third arguments characterize

the mean and covariance matrix of the parameters across the population.18 The fourth

and fth arguments contain information about an individuals choices and explanatory

variables.

The rst line of code multiplies parameters by explanatory variables to form utility

terms, which are then shaped into a matrix with four columns. Individuals have four

options available on each choice occasion. After reshaping, the utilities from potential

choices on each occasion occupy a row, with separate choice occasions in columns. lnpj

then contains the log probabilities of the choices actually madethe log of utility less

the logged sum of exponentiated utilities. Finally, var computes the log distribution

of parameters about the conditional mean, and llj sums the two components. The

result is the log likelihood of individual ns parameter values, given choices, data, and

the parameters governing the distribution of individual-level parameters.

I now set up a structured problem for each individual in the dataset. I begin by

setting up a single adaptive MCMC problem and then replicate this problem using J( )

(see [M-5] J( )) to match the number of individual-level parameter setsthe same as

the number of individual-level identiers in the data (gid)characterized via Matas

panelsetup( ) (see [M-5] panelsetup( )) function.

18. This function is not as fast as it could be, and it is also specic to the dataset. One way to speed the

algorithm is to compute the Cholesky decomposition of W once before individual-level parameters

are drawn. The wrapper bayesmixedlogit exploits this and a few other improvements.

M. J. Baker 655

. mata

mata (type end to exit)

: m=panelsetup(pid,1)

: Ap=amcmc_init()

: amcmc_damper(Ap,1)

: amcmc_alginfo(Ap,("standalone","global"))

: amcmc_append(Ap,"overwrite")

: amcmc_lnf(Ap,&lnbetan_bW())

: amcmc_draws(Ap,1)

: amcmc_append(Ap,"overwrite")

: amcmc_reeval(Ap,"reeval")

: A=J(rows(m),1,Ap)

: end

I also apply the amcmc option "overwrite", which means that the results from only

the last round of drawing will be saved. Specifying the "reeval" option means that

each individuals likelihood will be reevaluated at the new parameter values and the old

values of coecients before drawing.

I now duplicate the problem by forming a matrix of adaptive MCMC problems

one for each individualand then use a loop to ll in individual-level choices and

explanatory variables as arguments. In the end, the matrix A is a collection of 100

separate adaptive MCMC problems. Before this, some initial values for b and W are set,

and some initial values for individual-level parameters are drawn. I set up the pointer

matrix Args to hold this information along with the individual-level information.

. mata

mata (type end to exit)

: Args=J(rows(m),4,NULL)

: b=J(1,6,0)

: W=I(6)*6

: beta=b:+sqrt(diagonal(W)):*rnormal(rows(m),cols(b),0,1)

: for (i=1;i<=rows(m);i++) {

> Args[i,1]=&b

> Args[i,2]=&W

> Args[i,3]=&panelsubmatrix(y,i,m)

> Args[i,4]=&panelsubmatrix(X,i,m)

> amcmc_args(A[i],Args[i,])

> amcmc_xinit(A[i],b)

> amcmc_Vinit(A[i],W)

> }

: end

After creating some placeholders for the draws (bvals and Wvals), we can execute the

drawing algorithm as follows:

656 Adaptive MCMC in Mata

. mata

mata (type end to exit)

: its=20000

: burn=10000

: bvals=J(0,cols(beta),.)

: Wvals=J(0,cols(rowshape(W,1)),.)

: for (i=1;i<=its;i++) {

> b=drawb_betaW(beta,W/rows(m))

> W=drawW_bbeta(beta,b)

> bvals=bvals\b

> Wvals=Wvals\rowshape(W,1)

> beta_old=beta

> for (j=1;j<=rows(A);j++) {

> amcmc_draw(A[j])

> beta[j,]=amcmc_results_lastdraw(A[j])

> }

> }

: end

The algorithm consists of an outer loop and an inner loop, within which individual-level

parameters are drawn sequentially. The current value of the beta vector, which holds

individual-level parameters in rows, is overwritten with the last draw produced by using

the amcmc results lastdraw() function.

A subtlety of the code also indicates a reason why it is useful to pass additional

function arguments as pointers: each time a new value of b and W is drawn, a user

does not need to reiterate to each sampling problem that b and W have changed, be-

cause pointers point to positions that hold objects and not to the values of the objects

themselves. Thus, every time a new value of b or W is drawn, the arguments of all 100

problems are automatically changed. By specifying that the target distribution for each

level problem is to be reevaluated, the user tells the routine to recalculate lnbetan bW

at the last drawn value when comparing a new draw to the previous one.

Because the technique might be of greater interest, I have developed a command that

implements the algorithm bayesmixedlogit. For example, the algorithm described by

the previous code could be executed with the following command, which also summarizes

results in a way conformable with usual Stata output:

M. J. Baker 657

. bayesmixedlogit y, rand(price contract local wknown tod seasonal)

> group(gid) id(pid) draws(20000) burn(10000) samplerrand("global")

> saving(train_draws) replace

Bayesian Mixed Logit Model Observations = 4780

Groups = 100

Acceptance rates: Choices = 1195

Fixed coefs = Total draws = 20000

Random coefs(ave,min,max)= 0.270, 0.235, 0.289 Burn-in draws = 10000

Random

price -1.168711 .1245738 -9.38 0.000 -1.4129 -.9245209

contract -.3433208 .0682585 -5.03 0.000 -.4771212 -.2095204

local 2.637242 .3436764 7.67 0.000 1.963567 3.310917

wknown 2.138963 .2596608 8.24 0.000 1.629976 2.647951

tod -11.16374 1.049769 -10.63 0.000 -13.2215 -9.105982

seasonal -11.19243 1.030291 -10.86 0.000 -13.212 -9.172849

Cov_Random

var_price .8499292 .2332495 3.64 0.000 .3927132 1.307145

cov_priceco~t .1128769 .0803203 1.41 0.160 -.044567 .2703208

cov_pricelo~l 1.583028 .4519537 3.50 0.000 .6971079 2.468948

cov_pricewk~n .8898662 .3096053 2.87 0.004 .2829775 1.496755

cov_pricetod 6.106009 1.909356 3.20 0.001 2.363286 9.848731

cov_pricese~l 6.044055 1.892895 3.19 0.001 2.333601 9.75451

var_contract .3450904 .0670202 5.15 0.000 .2137174 .4764634

cov_contrac~l .4714882 .2131141 2.21 0.027 .0537416 .8892347

cov_contrac~n .3624791 .1560516 2.32 0.020 .0565865 .6683717

cov_contrac~d .7592097 .6576296 1.15 0.248 -.5298765 2.048296

cov_contrac~l .9147682 .65939 1.39 0.165 -.3777688 2.207305

var_local 7.000292 1.883972 3.72 0.000 3.307328 10.69326

cov_localwk~n 4.022065 1.248119 3.22 0.001 1.575501 6.468629

cov_localtod 12.84674 3.787742 3.39 0.001 5.422006 20.27148

cov_localse~l 13.40598 3.727253 3.60 0.000 6.099812 20.71214

var_wknown 3.364285 1.012474 3.32 0.001 1.379632 5.348938

cov_wknowntod 6.513209 2.60766 2.50 0.013 1.401671 11.62475

cov_wknowns~l 7.109282 2.563623 2.77 0.006 2.084064 12.1345

var_tod 57.62449 16.97876 3.39 0.001 24.3427 90.90628

cov_todseas~l 53.93841 16.35184 3.30 0.001 21.88551 85.99131

var_seasonal 55.05572 16.54599 3.33 0.001 22.62226 87.48918

*Results are presented to conform with Stata covention, but

are summary statistics of draws, not coefficient estimates.

The results are similar but not identical to those obtained using mixlogit. Additional

information and examples for bayesmixedlogit can be found in the help le, and some

examples of estimating a mixed logit model using Bayesian methods are provided in

the help le for amcmc(), accessible via the commands help mf amcmc or help mata

amcmc().

658 Adaptive MCMC in Mata

5 Description

In this section, I sketch a Mata implementation of what I have been referring to as

a global adaptive MCMC algorithm. The sketched routine omits a few details, mainly

about parsing options, but it is relatively true to form in describing how the algorithms

discussed in the article are actually implemented in Mata and might be used as a

template for developing more specialized algorithms. It assumes that the user wishes to

draw from a stand-alone function without additional arguments. The code is as follows:

. mata:

mata (type end to exit)

: real matrix amcmc_global(f,xinit,Vinit,draws,burn,damper,

> aopt,arate,val,lam)

> {

> real scalar nb,old,pro,i,alpha

> real rowvector xold,xpro,mu

> real matrix Accept,accept,xs,V,Vsq,Vold

>

> nb=cols(xinit) /* Initialization */

> xold=xinit

> lam=2.38^2/nb

> old=(*f)(xold)

> val=old

>

> Accept=0

> xs=xold

> mu=xold

> V=Vinit

> Vold=I(cols(xold))

>

> for (i=1;i<=draws;i++) {

> accept=0

> Vsq=cholesky(V) /* Prep V for drawing */

> if (hasmissing(Vsq)) {

> Vsq=cholesky(Vold)

> V=Vold

> }

>

> xpro=xold+lam*rnormal(1,nb,0,1)*Vsq /* Draw, value calc. */

>

>

> pro=(*f)(xpro)

>

> if (pro==. ) alpha=0 /* calc. of accept. prob */

>

> else if (pro>old) alpha=1

> else alpha=exp(pro-old)

>

> if (runiform(1,1)<alpha) {

> old=pro

> xold=xpro

> accept=1

> }

>

> lam=lam*exp(1/(i+1)^damper*(alpha-aopt)) /*update*/

> xs=xs\xold

> val=val\old

> Accept=Accept\accept

M. J. Baker 659

> mu=mu+1/(i+1)^damper*(xold-mu)

> Vold=V

> V=V+1/(i+1)^damper*((xold-mu)(xold-mu)-V)

> _makesymmetric(V)

> }

>

> val =val[burn+1::draws,]

> arate=mean(Accept[burn+1::draws,])

> return(xs[burn+1::draws,])

> }

: end

The function starts by setting up a variable (nb) to hold the dimension of the distribu-

tion, and xold, which functions as xt in the algorithms discussed in table 3, is set to

the user-supplied initial value. The initial value of (called lam) is set as discussed by

Andrieu and Thoms (2008, 359).

Next the log value of the distribution (f) at xold is calculated and called old. The

next few steps proceed as one would expect. However, I nd it useful to have a default

covariance matrix waitingVold in the codein case the Cholesky decomposition en-

counters problems. For example, this could happen if the initial variancecovariance

matrix is not positive denite or if there is insucient variation in the draws, which

sometimes happens in the early stages of a run. Once a usable covariance matrix has

been obtained, xpro (which functions as Yt in the algorithms in tables 1, 2, and 3) is

formed using a conformable vector of standard normal random variates, and the function

is evaluated at xpro.

The acceptance probability alpha is then calculated in a numerically stable way in an

if-else if-else block. If the target function returns a missing value when evaluated,

alpha is set to 0 so that the draw will not be retained. If the proposal produces a higher

value of the target function, alpha is set to one. Otherwise, it is set as described by

the algorithms.19 Finally, a uniform random variable is drawn that determines whether

the draw is to be accepted. Once this is known, all values are updated according to

the scheme described in table 3. Once the for loop concludes, the algorithm overwrites

the acceptance rate, arate, and the function value, val, and returns the results of the

draw.

6 Conclusions

I have given a brief overview of adaptive MCMC methods and how they can be imple-

mented using the Mata routine amcmc() and a suite of functions amcmc *(). While I

have given some ideas about how one might use and display obtained results, my primary

purpose is to present and describe an implementation of adaptive MCMC algorithms.

19. The Mata function exp() does not evaluate to missing for very small values as it does for very large

values.

660 Adaptive MCMC in Mata

I have not discussed how one should set up the parameters of the draw, such as

the number of draws to take, whether to use a global sampler, or how aggressively to

tune the proposal distribution. I have also not discussed what users should do once

they have obtained draws from an adaptive MCMC algorithm. The functions leave these

decisions in the hands of users. Creating, describing, and analyzing results obtained via

MCMC is fortunately the subject of extensive literature. Broadly speaking, literature

on MCMC is built around the related issues of assessing convergence of a run and of

assessing the mixing and intensity of a run. A further issue is how one should deal

with autocorrelation between draws. Whatever means are used to analyze results, it

is fortunate that Stata provides a ready-made battery of tools to summarize, modify,

and graph results. However, while it is often easy to spot problems in an MCMC run, it

is impossible to know whether the run has actually provided draws from the intended

distribution.

On the subject of convergence, there is not any universally accepted criterion, but

researchers propose many guidelines. Gelman and Rubin (1992) present several useful

ideas. A general discussion appears in Geyer (2011), and some practical advice appears

in Gelman and Shirley (2011), who advocate discarding the rst half of a run as a burn-

in period and performing multiple runs in parallel from dierent starting points and

comparing results. To be sure that one is actually sampling from the right region of the

density, one can use heated distributions in preliminary runs. Eectively, these heated

distributions raise the likelihood function to some fractional power,20 which attens the

distribution and allows for more rapid and broader exploration of the parameter space.

One can also compare the results of multiple runs and compare the variance within

runs and between runs. A useful technique is to investigate the autocorrelation function

of results and then thin the results, retaining only a fraction of the draws so that most

of the autocorrelation is rid from the data. One can use time-series tools to test for

autocorrelation among draws. A possibility discussed by Gelman and Shirley (2011) is

to jumble the results of the simulation. While it might seem obvious, it is worthwhile

to note that solutions to these problems are interdependent. A draw that exhibits a

lot of autocorrelation may require more thinning and a longer run to obtain a suitable

number of draws. A good place to start with these and other aspects of analyzing results

is Brooks et al. (2011).

As may have been clear from the examples presented in section 4, another option

is to run the algorithm for some suitable amount of time and then restart the run

without adaptation by using previous results as starting values so that one is drawing

from an invariant proposal distribution. A simple yet useful starting point in judging

convergence is seeing whether the algorithm produces results with graphs that look like

those in gure 2 but not those in gure 4. A graph that does not contain jumps or

at spots and looks more or less like white noise is a preliminary indication that the

algorithm is working well. However, pseudo-convergence can still be very dicult to

detect. In addition to containing much practical advice, Geyer (2011) also advises that

one should at least do an overnight run, adding only half in jest that one should start

M. J. Baker 661

a run when the article is submitted and keep running until the referees reports arrive.

This cannot delay the article, and may detect pseudo-convergence (Geyer 2011, 18).

7 References

Andrieu, C., and J. Thoms. 2008. A tutorial on adaptive MCMC. Statistics and Com-

puting 18: 343373.

Brooks, S., A. Gelman, G. L. Jones, and X.-L. Meng, eds. 2011. Handbook of Markov

Chain Monte Carlo. Boca Raton, FL: Chapman & Hall/CRC.

Journal of Econometrics 115: 293346.

American Statistician 49: 327335.

Gelman, A., and D. B. Rubin. 1992. Inference from iterative simulation using multiple

sequences. Statistical Science 7: 457472.

Gelman, A., and K. Shirley. 2011. Inference from simulations and monitoring conver-

gence. In Handbook of Markov Chain Monte Carlo, ed. S. Brooks, A. Gelman, G. L.

Jones, and X.-L. Meng, 163174. Boca Raton, FL: Chapman & Hall/CRC.

Geyer, C. J. 2011. Introduction to Markov Chain Monte Carlo. In Handbook of Markov

Chain Monte Carlo, ed. S. Brooks, A. Gelman, G. L. Jones, and X.-L. Meng, 348.

Boca Raton, FL: Chapman & Hall/CRC.

Hole, A. R. 2007. Fitting mixed logit models by using maximum simulated likelihood.

Stata Journal 7: 388401.

Powell, J. L. 1984. Least absolute deviations estimation for the censored regression

model. Journal of Econometrics 25: 303325.

of Markov Chain Monte Carlo, ed. S. Brooks, A. Gelman, G. L. Jones, and X.-L.

Meng, 93112. Boca Raton, FL: Chapman & Hall/CRC.

Train, K. E. 2009. Discrete Choice Methods with Simulation. 2nd ed. Cambridge:

Cambridge University Press.

Matthew Baker is an associate professor of economics at Hunter College and the Graduate

Center, City University of New York. One of his current interests is simulation-based econo-

metrics.

The Stata Journal (2014)

14, Number 3, pp. 662669

comma-separated value les into Stata

Alberto A. Gaggero

Department of Economics and Management

University of Pavia

Pavia, Italy

alberto.gaggero@unipv.it

Abstract. This command meets the need of a researcher who holds multiple data

les in comma-separated value format diering by a period variable (for example,

year or quarter) or by a cross-sectional variable (for example, country or rm) and

must combine them into one Stata-format le.

Keywords: dm0076, csvconvert, comma-separated value le, .csv

1 Introduction

In applied research, it is common to come across several data les containing the same

set of variables that need to be combined into one le. For instance, in a cross-country

survey, a researcher may collect information country by country and thus create several

data les, one for each country. Or within the same cross-section (or even within the

same country), the researcher may sample each year independently and generate various

data les that dier by year.

A practical issue in this type of situation is determining how to read all of those

les together in Stata, especially if they are manifold. The standard approach would

be to import each data le sequentially into Stata by using a combination of import

delimited and append. This approach, however, requires a user to type several com-

mand lines proportional to the number of les to be included; thus it is reasonably

doable if the number of data les is limited.

Suppose the directory C:\data\world bank contains three comma-separated value

(.csv) les: wb2007.csv, wb2008.csv, and wb2009.csv.1 After setting the appropri-

ate working directory, a user implements the aforementioned procedure by typing the

following command lines:

. save wb2008.dta

. import delimited using wb2009.csv, clear

. save wb2009.dta

. import delimited using wb2007.csv, clear

1. csvconvert is designed to handle many .csv les; however, for simplicity, all the examples below

consider a limited set of .csv les.

c 2014 StataCorp LP dm0076

A. A. Gaggero 663

. append using wb2009.dta

Alternatively, and more compactly, the same result can be obtained with a loop.

. foreach file in wb2007 wb2008 wb2009 {

2. import delimited using `file.csv, clear

3. save `file

4. }

. foreach file in wb2007.dta wb2008.dta {

2. append using `file

3. }

Another way is to work with the disk operating system (DOS) to gather all the .csv les

into one .csv le and then to read the assembled single .csv le into memory using

import delimited.

Under the DOS framework, the lines below assemble wb2007.csv, wb2008.csv, and

wb2009.csv into a newly created .csv le named input.csv.

cd "C:\data\world bank"

copy wb2007.csv wb2008.csv wb2009.csv input.csv

To assemble all .csv les stored in the directory C:\data\world bank into a new le

named input.csv, type

cd "C:\data\world bank"

copy *.csv input.csv

. import delimited using C:\data\world bank\input.csv

A similar approach that bypasses the DOS framework can be implemented. However,

if the number of .csv les is large, the process may not be as straightforward. For

simplicity, let us still consider just three .csv les. Once the appropriate working

directory is set, the command lines to type are as follows:

. copy wb2008.csv wb2007.csv, append

. copy wb2009.csv wb2007.csv, append

. import delimited using wb2007.csv

The rst two command lines append wb2008.csv and wb2009.csv to wb2007.csv.

The third command reads the .csv le into Stata.

Note, however, that if the rst line of both wb2008.csv and wb2009.csv contains

the variable names, these are also appended.2 Thus, because of the presence of extra

lines with names, all the variables are read as a string. To correct this inaccuracy, one

should rst remove the lines with the variable names and then use destring to set the

numerical format.

2. Unfortunately, the option varnames(nonames), applicable with import delimited, is unavailable

with copy.

664 A simple command to gather comma-separated value files into Stata

Alternatively, we could prevent this fault by manually preparing the .csv les (that

is, by removing the lines with the variable names in the .csv les to be appended). The

whole process can be time consuming, especially if the number of .csv les is large. The

csvconvert command simplies and automatizes the procedure of gathering multiple

.csv les into one .dta, as illustrated in the next section.

2.1 Syntax

The syntax is

csvconvert input directory, replace input file(filenames)

output dir(output directory) output file(filename)

where input directory is the path of the directory in which the .csv les are stored. Do

not use any quotes at the endpoints of the directory path, even if the directory name

contains spaces (see example 1 below).

2.2 Options

replace species that the existing output le (if it already exists) be overwritten.

replace is required.

input file(filenames) species a subset of the .csv les to be converted. The filenames

must be separated by a space and include the .csv extension (see example 2 below).

If this option is not specied, csvconvert considers all the .csv les stored in the

input directory.

output dir(output directory) species the directory in which the .dta output le is

saved. If this option is not specied, the le is saved in the same directory where

the .csv les are stored.

output file(filename) species the name of the .dta output le. The default is

output file(output.dta).

3 Examples

3.1 Example 1Basic

The simplest way to run csvconvert is to type the command and the directory path

where the .csv les are stored followed by the mandatory option replace. In the same

directory, Stata will create output.dta, which collects all the .csv les of that directory

in Stata format.

A. A. Gaggero 665

_________________________________________________

The csv file wb2007.csv

(6 vars, 3 obs)

has been successfully included in output.dta

_________________________________________________

The csv file wb2008.csv

(6 vars, 3 obs)

has been successfully included in output.dta

_________________________________________________

The csv file wb2009.csv

(6 vars, 3 obs)

has been successfully included in output.dta

_________________________________________________

****************************************************************

You have successfully converted 3 csv files in one Stata file

****************************************************************

If you want to convert only a subset of the .csv les in the directory (for example,

wb2008.csv and wb2009.csv), then you need to list the les to be converted inside the

parentheses of the option input file(). Filenames must be separated by a blank space

and must be specied using the .csv extension.

_________________________________________________

The csv file wb2008.csv

(6 vars, 3 obs)

has been successfully included in output.dta

_________________________________________________

The csv file wb2009.csv

(6 vars, 3 obs)

has been successfully included in output.dta

_________________________________________________

****************************************************************

You have successfully converted 2 csv files in one Stata file

****************************************************************

666 A simple command to gather comma-separated value files into Stata

mined directory

Suppose you wish to name your output le wb data and save it in the directory

C:\data\wb dataset. In this case, you would type

> output_dir(C:\data\wb dataset)

_________________________________________________

The csv file wb2007.csv

(6 vars, 3 obs)

has been successfully included in wb_data.dta.dta

_________________________________________________

The csv file wb2008.csv

(6 vars, 3 obs)

has been successfully included in wb_data.dta.dta

_________________________________________________

The csv file wb2009.csv

(6 vars, 3 obs)

has been successfully included in wb_data.dta.dta

_________________________________________________

****************************************************************

You have successfully converted 3 csv files in one Stata file

****************************************************************

Example 2 and example 3 can be combined.

> output_file(wb_data.dta) output_dir(C:\data\wb dataset)

(output omitted )

csvconvert is designed to speed up the process of joining a large number of .csv

les. As the number of input les increases, the likelihood that one of them contains

inaccuracies rises. It is important, therefore, to keep track of all the steps in the process

so that the origin of possible faults can be detected.

While creating the output le, csvconvert oers various ways to check that the

conversion of the .csv les into Stata has been completed correctly.

First of all, at the end of the process, csvconvert displays the number of .csv les

contained in the output le. This information allows a researcher to check whether the

expected number of .csv les to be included in the output le is equal to the actual

number of .csv les that have been converted. The complete list of .csv les included

in the .dta le can be obtained by typing note (see example 5). Additionally, by

default, csvconvert creates one variable named csvfile, which encloses the name of

the .csv le where the observation originates.

A. A. Gaggero 667

During conversion, csvconvert sequentially reports the name of the .csv le being

converted, the number of variables, and the number of observations. If something in the

process appears odd, extra messages are displayed to alert the researcher and demand

further inspection. For instance, suppose that one .csv le contains a symbol or a

letter in one cell of a numerical variable; if ignored, this inaccuracy may undermine the

whole process. For this reason, csvconvert adds a note to help the researcher detect

the fault. In example 6, wb2008 symbol.csv contains N/A in one cell of the variable

populationtotal.

Once csvconvert has been completed, the full list of .csv les included in the .dta

le, together with the date and time when each .csv le was converted, can be obtained

by typing note in the command window.

. note

_dta:

1. File included on 18 Jan 2014 10:11 : "wb2007.csv"

2. File included on 18 Jan 2014 10:11 : "wb2008.csv"

3. File included on 18 Jan 2014 10:11 : "wb2009.csv"

Suppose that you wish to convert three les: wb2007.csv, wb2008 symbol.csv, and

wb2009.csv. The le wb2008 symbol.csv contains a fault (that is, the aforementioned

N/A cell), but you are unaware of it.

> input_file(wb2007.csv wb2008_symbol.csv wb2009.csv)

_________________________________________________

The csv file wb2007.csv

(6 vars, 3 obs)

has been successfully included in output.dta

_________________________________________________

The csv file wb2008_symbol.csv

(6 vars, 3 obs)

(note: variable populationtotal was long in the using data, but will be str9

now)

has been successfully included in output.dta

_________________________________________________

The csv file wb2009.csv

(6 vars, 3 obs)

(note: variable populationtotal was str9 in the using data, but will be long

now)

(note: variable _csvfile was str10, now str17 to accommodate using datas

values)

has been successfully included in output.dta

_________________________________________________

****************************************************************

You have successfully converted 3 csv files in one Stata file

****************************************************************

668 A simple command to gather comma-separated value files into Stata

By reading the log, you can see that in the conversion of wb2008 symbol.csv, the

variable populationtotal changed its format from numerical to string. Therefore,

wb2008 symbol.csv is the le that needs to be inspected. Once the anomalous obser-

vation is detected and manually corrected (for example, by emptying the anomalous

cell via Excel and saving the corrected le as wb2008 symbol2.csv), you can relaunch

csvconvert and check that it now runs smoothly.

> input_file(wb2007.csv wb2008_symbol2.csv wb2009.csv)

_________________________________________________

The csv file wb2007.csv

(6 vars, 3 obs)

has been successfully included in output.dta

_________________________________________________

The csv file wb2008_symbol2.csv

(6 vars, 3 obs)

has been successfully included in output.dta

_________________________________________________

The csv file wb2009.csv

(6 vars, 3 obs)

(note: variable _csvfile was str10, now str18 to accommodate using datas

values)

has been successfully included in output.dta

_________________________________________________

****************************************************************

You have successfully converted 3 csv files in one Stata file

****************************************************************

If csvconvert happens to include duplicate observations (for instance, it inserted the

same input le twice), it displays a warning message. Moreover, to facilitate the detec-

tion of double observations, csvconvert generates a new dummy variable, duplicate,

that is equal to one in case of duplicate observations. This example describes the pro-

cedure to spot whether an input le has been entered twice and, if so, which one.

A. A. Gaggero 669

> input_file(wb2008.csv wb2009.csv wb2008.csv)

_________________________________________________

The csv file wb2008.csv

(6 vars, 3 obs)

has been successfully included in output.dta

_________________________________________________

The csv file wb2009.csv

(6 vars, 3 obs)

has been successfully included in output.dta

_________________________________________________

The csv file wb2008.csv

(6 vars, 3 obs)

has been successfully included in output.dta

_________________________________________________

****************************************************************

You have successfully converted 3 csv files in one Stata file

****************************************************************

Warning - output.dta has 3 duplicate observations: you might have entered a

> .csv file name twice in the input_file() option, or your orginal dataset may

> contain duplicates. Check if this is what you wanted: variable _duplicates

> = 1 in case of duplicate and = 0 otherwise may help.

The warning message shows that there are three duplicate observations. Of course,

you can look carefully at the Results window and nd that wb2008.csv was entered

twice. However, if you are handling a large set of .csv les, checking each line of the

screen would be very time consuming.

Tabulating the variable csvfile conditional on duplicates being equal to one

quickly detects that the duplicate observations come from wb2008.csv.

. tabulate _csvfile if _duplicates==1

csv file

from which

observation

originates Freq. Percent Cum.

Total 6 100.00

5 Acknowledgments

I am grateful to Editor Joseph Newton for his assistance during revision and to Violeta

Carrion, Emanuele Forlani, Edna Solomon, and one anonymous referee for very helpful

comments.

Alberto A. Gaggero is currently an assistant professor in the Department of Economics and

Management at the University of Pavia, where he teaches applied industrial organization.

He obtained his PhD from the University of Essex. He formerly held research positions at

the University of Genoa, at Hogeschool University Brussel, and at the Belgian Ministry of

Economic Aairs. His research topics center on applied industrial organization with particular

interest in airline pricing.

The Stata Journal (2014)

14, Number 3, pp. 670683

estimation of treatment eects when exclusion

restrictions are unavailable

Ian McCarthy Daniel Millimet

Emory University Southern Methodist University

Atlanta, GA Dallas, TX

ianmccarthy.econ@gmail.com Institute for the Study of Labor

Bonn, Germany

millimet@smu.edu

Rusty Tchernis

Georgia State University

Atlanta, GA

Institute for the Study of Labor

Bonn, Germany

National Bureau of Economic Research

Cambridge, MA

rtchernis@gsu.edu

eects), that implements two new estimators proposed in Millimet and Tchernis

(2013, Journal of Applied Econometrics 28: 9821017) and designed to estimate

the eect of treatment when selection on unobserved variables exists and appro-

priate exclusion restrictions are unavailable. In addition, the bmte command esti-

mates treatment eects from several alternative estimators that also do not rely

on exclusion restrictions for identication of the causal eects of the treatment,

including the following: 1) Heckmans two-step estimator (1976, Annals of Eco-

nomic and Social Measurement 5: 475492; 1979, Econometrica 47: 153161); 2) a

control function approach outlined in Heckman, LaLonde, and Smith (1999, Hand-

book of Labor Economics 3: 18652097) and Navarro (2008, The New Palgrave

Dictionary of Economics [Palgrave Macmillan]); and 3) a more recent estimator

proposed by Klein and Vella (2009, Journal of Applied Econometrics 24: 735762)

that exploits heteroskedasticity for identication. By implementing two new esti-

mators alongside preexisting estimators, the bmte command provides a picture of

the average causal eects of the treatment across a variety of assumptions. We

present an example application of the command following Millimet and Tchernis

(2013, Journal of Applied Econometrics 28: 9821017).

Keywords: st0355, bmte, treatment eects, propensity score, unconfoundedness,

selection on unobserved variables

c 2014 StataCorp LP st0355

I. McCarthy, D. Millimet, and R. Tchernis 671

1 Introduction

The causal eect of binary treatment on outcomes is a central component of empirical

research in economics and many other disciplines. When individuals self-select into

treatment and when prospective randomization of the treatment and control groups is

not feasible, researchers must adopt alternative empirical methods intended to control

for the inherent self-selection. If individuals self-select on the basis of observed variables

(selection on observed variables), a variety of appropriate methodologies are available

to estimate the causal eects of the treatment. If instead individuals self-select on the

basis of unobserved variables (selection on unobserved variables), estimating treatment

eects is more dicult.

When one is confronted with selection on unobserved variables, the most common

empirical approach is to rely on an instrumental variable (IV); however, if credible instru-

ments are unavailable, a few approaches now exist that attempt to estimate the eects of

the treatment without an exclusion restriction. This article introduces a new Stata com-

mand, bmte, that implements two recent estimators proposed in Millimet and Tchernis

(2013) and designed to estimate treatment eects when selection on unobserved vari-

ables exists and appropriate exclusion restrictions are unavailable:

i. The minimum-biased (MB) estimator: This estimator searches for the observations

with minimized bias in the treatment-eects estimate of interest. This is accom-

plished by trimming the estimation sample to include only observations with a

propensity score within a certain interval as specied by the user. When the

conditional independence assumption (CIA) holds (that is, independence between

treatment assignment and potential outcomes, conditional on observed variables),

the MB estimator is unbiased. Otherwise, the MB estimator tends to minimize

the bias among estimators that rely on the CIA. Furthermore, the MB estima-

tor changes the parameter being estimated because of the restricted estimation

sample.

ii. The bias-corrected (BC) estimator: This estimator relies on the two-step estimator

of Heckmans bivariate normal (BVN) selection model to estimate the bias among

estimators that inappropriately apply the CIA (Heckman 1976, 1979). However,

unlike the BVN estimator, the BC estimator does not require specication of the

functional form for the outcome of interest in the nal step. Moreover, unlike the

MB estimator, the BC estimator does not change the parameter being estimated.

across a range of assumptions, including standard ordinary least-squares (OLS) and

inverse-probability-weighted (IPW) treatment-eects estimates. The bmte command

also presents the results of additional estimates applicable when the CIA fails and valid

exclusion restrictions are unavailable, including the following: 1) Heckmans BVN esti-

mator; 2) a control function (CF) approach outlined in Heckman, LaLonde, and Smith

(1999) and Navarro (2008); and 3) a more recent estimator proposed by Klein and Vella

(2009) that exploits heteroskedasticity for identication. By implementing two new es-

672 The bmte command

timators alongside preexisting estimators, the bmte command provides a picture of the

average causal eects of the treatment across a variety of assumptions and when valid

exclusion restrictions are unavailable.

Here we provide a brief background on the potential-outcomes model and the estima-

tors implemented by the bmte command. For additional discussion, see Millimet and

Tchernis (2013). We consider the standard potential-outcomes framework, denoting by

Yi (T ) the potential outcome of individual i under binary treatment T T = (0, 1). The

causal eect of the treatment (T = 1) relative to the control (T = 0) is dened as the

dierence between the corresponding potential outcomes, i = Yi (1) Yi (0).

In the evaluation literature, several population parameters are of potential interest.

The most commonly used parameters include the average treatment eect (ATE), the

ATE on the treated (ATT), and the ATE on the untreated (ATU), dened as

ATT = E(i |T = 1) = E{Yi (1) Yi (0)|T = 1}

ATU = E(i |T = 0) = E{Yi (1) Yi (0)|T = 0}

These parameters may also vary with a vector of covariates, X, in which case the

parameters have an analogous representation conditional on a particular value of X.1

For nonrandom treatment assignment, selection into treatment may follow one of two

general paths: 1) selection on observed variables, also referred to as unconfoundedness

or the CIA (Rubin 1974; Heckman and Robb 1985); and 2) selection on unobserved

variables. Under the CIA, selection into treatment is random conditional on covariates,

X, and the average eect of the treatment can be obtained by comparing outcomes

of individuals in the two treatment states with identical values of the covariates. This

approach often uses propensity-score methods to reduce the dimensionality problem

arising when X is a high-dimensional vector (Rosenbaum and Rubin 1983), with the

propensity score denoted by P (Xi ) = Pr(Ti = 1|Xi ).

If the CIA fails to hold, then the estimated treatment eects relying on the CIA are

biased. Following Heckman and Navarro-Lozano (2004) and Black and Smith (2004),

we denote the potential outcomes as Y (0) = g0 (X) + 0 and Y (1) = g1 (X) + 1 , where

g0 (X) and g1 (X) are the deterministic portions of the outcome variable in the control

and treatment groups, respectively, and where (0 , 1 ) are the corresponding error terms.

We also denote the latent treatment variable by T = h(X) u, where h(X) represents

the deterministic portion of T , and u denotes the error term. The observed treatment,

T , is therefore equal to 1 if T > 0 and 0 otherwise. Finally, we denote by the

dierence in the residuals of the potential outcomes, = 0 1 .

1. More formally, the coecient measures the treatment eect, adjusting for a simultaneous linear

change in the covariates, X, rather than being conditional on a specic value of X. We thank an

anonymous referee for highlighting this point.

I. McCarthy, D. Millimet, and R. Tchernis 673

Assuming and u are jointly normally distributed, the bias can be derived as

{h(X)}

BATE {P (X)} = [0u 0 + {1 P (X)}u ] (1)

{h(X)}[1 {h(X)}]

0 is the standard deviation of 0 , is the standard deviation of , and and are the

normal probability density function and cumulative distribution function, respectively.

When the CIA fails, consistent estimation of the treatment eect of interest requires

an alternative technique robust to selection on unobservables. This is dicult because

obtaining a consistent point estimate of a measure of the treatment eect typically

requires an exclusion restriction, which is unavailable in many situations. The proposed

bmte command presents a series of treatment-eects estimators designed to estimate

the average eects of treatment when appropriate exclusion restrictions are unavailable,

exploiting the functional form of the bias in (1). Below we briey present ve of the

estimators implemented by the bmte command.

This technique relates generally to the normalized IPW estimator of Hirano and Imbens

(2001), given by

N Y i Ti N Yi (1 Ti )

i=1 i=1

P (Xi ) 1 P (Xi )

IPW,ATE = (2)

N T i N (1 Ti )

i=1 i=1

P (Xi ) 1 P (Xi )

where P(Xi ) is an estimate of the propensity score obtained using a probit model.

Under the CIA, the IPW estimator in (2) provides an unbiased estimate of ATE .

When this assumption fails, the bias for the ATE follows the closed functional form in

(1), with similar expressions for the ATT and ATU. The MB estimator aims to minimize

the bias by estimating (2) using only observations with a propensity score close to

the bias-minimizing propensity score, denoted by P . Using P eectively limits the

observations included in the estimation of the IPW treatment eects to minimize the

inherent bias when the CIA fails. We denote by the set of observations ultimately

included in the estimation. In general, however, P and are unknown. Therefore, the

MB estimator estimates P and to minimize the bias in (1) by using Heckmans BVN

selection model, the details of which are provided in Millimet and Tchernis (2013).

The MB estimator of the ATE is formally given by

Y i Ti Yi (1 Ti )

i i

P (Xi ) 1 P (Xi )

MB,ATE (P ) = (3)

Ti (1 Ti )

i i

P (Xi ) 1 P (Xi )

674 The bmte command

where = {i|P(Xi ) C(P )}, and C(P ) denotes a neighborhood around P . Fol-

lowing Millimet and Tchernis (2013), the MB estimator denes C(P ) as C(P ) =

{P (Xi )|P(Xi ) (P , P )}, where P = max(0.02, P ), P = min(0.98, P + ),

and > 0 is the smallest value such that at least percent of both the treatment and

control groups are contained in . Specic values of are specied within the bmte

command, with smaller values reducing the bias at the expense of higher variance. The

MB estimator trims observations with propensity scores above and below specic values,

regardless of the value of . These threshold values can be specied within the bmte

command options. Obtaining does not require the use of Heckmans BVN selection

model when the focus is on the ATT or ATU, because P is known to be one-half in these

cases (Black and Smith 2004).

If the user is sensitive to potential deviations from the normality assumptions under-

lying Heckmans BVN model, the MB estimator and other estimators can be extended

appropriately (Millimet and Tchernis 2013). Such adjustments are included as part

of the bmte command, denoted by the Edgeworth-expansion versions of the relevant

estimators.

Estimation of the error correlation structure using Heckmans BVN model immediately

introduces the possibility of a BC version of each estimator. Specically, estimates of

the bias of the MB estimator of the ATE, denoted by BATE (P ), can be derived from

the two-stage BVN model. The estimated bias can then be applied as an adjustment to

the standard IPW treatment-eects estimate.

The MB bias-corrected (MB-BC) estimator for the ATE is then given by

MBBC,ATE (P ) = MB,ATE (P ) BATE (P ) (4)

where the corresponding estimators for the ATT and ATU follow. With heterogeneous

treatment eects, the MB-BC estimator changes the parameter being estimated. To

identify the correct parameter of interest, the bmte command rst estimates the MB-

BC estimator in (4) conditional on the propensity score, P (X), and then estimates the

(unconditional) ATE by taking the expectation of this over the distribution of X in the

population (or subpopulation of the treated). The resulting BC estimator is given by

BC,ATE = IPW,ATE BATE {P (Xi )} (5)

i

where again the corresponding estimators for the ATT and ATU follow.

I. McCarthy, D. Millimet, and R. Tchernis 675

Briey, Heckmans BVN selection model adopts a two-stage approach: 1) estimate the

), using a standard probit model with binary treatment

probability of treatment, (Xi

as the dependent variable; and 2) estimate via OLS the following second-stage outcome

equation,

)

(Xi

Yi = Xi 0 + Xi Ti (1 0 ) + 0 (1 Ti ) (6)

1 (Xi )

(Xi )

+ 1 Ti + i

)

(Xi

where ()/() is the inverse Mills ratio, and is an independent and identically dis-

tributed error term with constant variance and zero conditional mean. With this ap-

proach, the estimated ATE is given by

BVN,ATE = X 1 0 (7)

2.4 CF approach

Heckmans BVN selection model is a special case of the CF approach. The idea is to devise

a function where the treatment assignment is no longer correlated with the error term

in the outcome equation once it is included, as outlined nicely in Heckman, LaLonde,

and Smith (1999) and Navarro (2008). Specically, consider the outcome equation

S

Yi (t) = (t + t0 ) + gt (Xi ) + ts P (Xi )s + it , t = 0, 1

s=1

where S is the order of the polynomial. The following equation is then estimable via

OLS:

S

S

+ 0s (1 Ti )P (Xi )s + 1s Ti P (X)s + i

s=1 s=1

2. Depending on ones dataset and specic application, it may not be meaningful to evaluate all

covariates at their means. Therefore, when interpreting the treatment-eects estimates, the user

should check that the data support the use of X. We are grateful to an anonymous referee for

clarifying this important point.

676 The bmte command

As is clear from (8), t and t0 are not separately identied; however, because the

selection problem disappears in the tails of the propensity score, it follows that the CF

becomes zero and that the intercepts from the potential-outcome equations are identied

using observations in the extreme end of the support of P (X). After one estimates the

intercept terms, the ATE and ATT are given by

CF,ATE = ( 0 ) + X 1 0 and

1 (9)

CF,ATT = ( 0 ) + X 1 1 0 + E(1

1 0 |Ti = 1) (10)

where

S

, ,

1 P (X)

E(0

|Ti = 1) = 0s P (X)s0

and

s=1 P (X)

S

S

E(1

|Ti = 1) = 1s +

1s P (X)s1

s=1 s=1

and where P (X) is the overall mean propensity score, and P (X)t , t = 0, 1, is the mean

propensity score in group t.

Unlike the CF approach, which relies on observations at the extremes of the support

of P (X), the Klein and Vella (2009) (KV) estimator attempts to identify the treatment

eect by using more information from the middle of the support. Our implementation

of the KV estimator relies on a similar functional form assumption to the BVN estimator

in the absence of heteroskedasticity but eectively induces a valid exclusion restriction

in the presence of heteroskedasticity. Specically, denote the latent treatment by T =

X u , where u = S(X)u, S(X) is an unknown positive function, and u N (0, 1).

Here S(X) is intended to allow for a general form of heteroskedasticity in the treatment

eects.

In this case, the probability of receiving the treatment conditional on X is given by

X

Pr(T = 1|X) = (11)

S(X)

Assuming S(X) = exp(X), the parameters of (11) are estimable by maximum likeli-

hood, with the log-likelihood function given by3

X

Ti

X

1Ti

ln L = ln ln 1 (12)

i exp(X) exp(X)

3. Our functional form assumption, S(X) = exp(X), is a simplication made to compare the KV

estimator and the other estimators available with the bmte command. For more details on the KV

estimator and alternative functional forms for S(X), see Klein and Vella (2009).

I. McCarthy, D. Millimet, and R. Tchernis 677

where the element of corresponding to the intercept is normalized to zero for iden-

tication. The maximum likelihood estimates are then used to obtain the predicted

probability of treatment, P (X), which may be used as an instrument for T in (6),

excluding the selection correction terms.

3.1 Syntax

The bmte command implements the above MB, BC, BVN, CF, and KV estimators as well

as the traditional OLS and IPW estimators. The syntax for the bmte command is

bmte depvar indepvars if in , group(varname) ee hetero theta(#)

psvars(indepvars) kv(indepvars) cf(#) pmin(#) pmax(#) psate(#)

psatt(#) psatu(#) psateee(#) psattee(#) psatuee(#) saving(filename)

replace bs reps(#) fixp

3.2 Specication

The bmte command requires the user to specify an outcome variable, depvar, at least

one independent variable, and a treatment assignment variable, group(). Additional

independent variables are optional. The command also uses Stata commands hetprob

and ivreg2 (Baum, Schaer, and Stillman 2003, 2004, 2005). The remaining options

of the bmte command are detailed below.

3.3 Options

group(varname) species the treatment assignment variable. group() is required.

ee indicates that the Edgeworth-expansion versions of the MB, BVN, and BC estimators

be included in addition to the original versions of each respective estimator. The

Edgeworth expansion is robust to deviations from normality in Heckmans BVN

selection model.

hetero allows for heterogeneous treatment eects, with ATE, ATT, and ATU estimates

presented at the mean level of each independent variable.

theta(#) denotes the minimum percentage such that both the treatment and control

groups have propensity scores in the interval (P , P ) from (3). Multiple values of

theta() are allowed (for example, theta(5 25), for 5% and 25%). Each value will

form a dierent estimated treatment eect using the MB and MB-BC estimators.

678 The bmte command

psvars(indepvars) denotes the list of regressors used in the estimation of the propensity

score. If unspecied, the list of regressors is assumed to be the same as the original

covariate list.

kv(indepvars) denotes the list of independent variables used to model the variance in

the hetprob command. Like the psvars() option, the list of kv() regressors is

assumed to be the same as the original covariate list if not explicitly specied.

cf(#) species the order of the polynomial used in the CF estimator. The default is

cf(3).

pmin(#) and pmax(#) specify the minimum and maximum propensity scores, respec-

tively, included in the MB estimator. Observations with propensity scores outside

this range will be automatically excluded from the MB estimates. The defaults are

pmin(0.02) and pmax(0.98).

psate(#)psatuee(#) specify the xed propensity-score values (specic to each treat-

ment eect of interest) to be used as the bias-minimizing propensity scores in lieu

of estimating the values within the program itself.

saving(filename) indicates where to save the output.

replace indicates that the output in saving() should replace any preexisting le in

the same location.

bs and reps(#) specify that 95% condence intervals be calculated by bootstrap using

the percentile method and the number of replications in reps(#). The default is

reps(100).

fixp is an option for the bootstrap command that, when specied, estimates the bias-

minimizing propensity score {P (X)} and applies this estimate across all bootstrap

replications rather than reestimating at each replication.

4 Example

Following Millimet and Tchernis (2013), we provide an application of the bmte com-

mand to the study of the U.S. school breakfast program (SBP). Specically, we seek

causal estimates of the ATEs of SBP on child health. The data are from the Early

Childhood Longitudinal StudyKindergarten Class of 19981999 and are available for

download from the Journal of Applied Econometrics Data Archive.4 We provide esti-

mates of the eect of SBP on growth rate in body mass index from rst grade to the

spring of third grade.

4. http://qed.econ.queensu.ca/jae/datasets/millimet001/.

I. McCarthy, D. Millimet, and R. Tchernis 679

We rst dene global variable lists XVARS and HVARS and limit our analysis to third

grade students only. XVARS are the covariates used in the OLS estimation as well as

in the calculation of the propensity score. HVARS are the covariates used in the KV

estimator (that is, the variables that enter into the heteroskedasticity portion of the

hetprob command).

(output omitted )

. global XVARS gender age white black hispanic city suburb

> neast mwest south wicearly wicearlymiss momafb momafbmiss

> momft mompt momnw momeda momedb momedc momedd momede ses

> sesmiss bweight bweightmiss hfoodb hfoodbmiss books

> booksmiss momafb2 ses2 bweight2 books2 age2 z1-z22

. global HVARS ses age south city

We then estimate the eect of SBP participation in the rst grade (break1) on body

mass index growth (clbmi) by using the bmte command. In our application, we specify a

of 5% and 25%, and we estimate bootstrap condence intervals using 250 replications.

We also specify the ee option, asking that the results include the Edgeworth-expansion

versions of the relevant estimators. The resulting Stata output is as follows:

> kv($HVARS)

0.003, 0.011] [ 0.003, 0.011] [ 0.003, 0.011]

0.005, 0.014] [ 0.002, 0.012] [ 0.005, 0.012]

MB

-0.008, 0.022] [ -0.011, 0.014] [ -0.011, 0.014]

0.25 0.005 0.005 0.004

-0.002, 0.011] [ -0.001, 0.011] [ -0.003, 0.009]

MB-EE

0.005, 0.033] [ 0.003, 0.023] [ 0.002, 0.036]

0.25 0.013 0.005 0.013

0.005, 0.020] [ -0.001, 0.012] [ 0.004, 0.022]

-0.043, 0.120] [ -0.021, 0.159] [ -0.050, 0.107]

F = 5.677

p = 0.000

680 The bmte command

-0.037, 0.022] [ -0.037, 0.022] [ -0.037, 0.022]

F = 133.462

p = 0.000

LR = 27.393

p = 0.000

-0.046, 0.012] [ -0.021, 0.015] [ -0.059, 0.015]

0.052, 0.330] [ 0.033, 0.187] [ 0.070, 0.187]

MB-BC

-0.050, 0.018] [ -0.055, 0.020] [ -0.053, 0.002]

0.25 -0.017 -0.014 -0.022

-0.048, 0.011] [ -0.049, 0.022] [ -0.047, 0.002]

MB-BC-EE

-0.039, 0.220] [ 0.024, 0.304] [ 0.009, 0.393]

0.25 0.069 0.208 0.261

-0.048, 0.215] [ 0.020, 0.299] [ 0.000, 0.388]

0.167, 0.963] [ 0.500, 0.500] [ 0.500, 0.500]

P*-EE 0.033 0.787 0.141

0.020, 0.943] [ 0.728, 0.956] [ 0.020, 0.399]

-0.048, 0.012] [ -0.048, 0.022] [ -0.059, 0.022]

0.050, 0.331] [ 0.070, 0.439] [ 0.269, 0.439]

Here we focus on the general structure and theme of the output. For a thorough

discussion and interpretation of the results, see Millimet and Tchernis (2013). As indi-

cated by the section headings, the output presents results for the ATE, ATT, and ATU

using basic OLS and IPW treatment-eects estimates as well as each of the MB (3), MB-

BC (4), BC (5), BVN (7), CF [(9) and (10)], and KV [(11), (12), and (6)] estimators.

Below each estimate is the respective 95% condence interval.

As discussed in Millimet and Tchernis (2013), separate MB and MB-BC estimates

are presented for each value of specied in the bmte command (in this case, 5% and

25%). The results for the CF estimator also include a joint test of signicance of all

covariates in the OLS step of the CF estimator (8). Similarly, the KV results include

a test for weak instruments (the CraggDonald Wald F statistic and p-value) as well

as a likelihood-ratio test for heteroskedasticity based on the results of hetprob. Also

included in the bmte output is the estimated bias-minimizing propensity score.

I. McCarthy, D. Millimet, and R. Tchernis 681

sults. First, the MB estimators will generally alter the interpretation of the parameter

being estimated. Thus they may estimate a parameter considered to be uninteresting.

Therefore, researchers should pay attention to the value of P as well as the attributes

of observations with propensity scores close to this value. Second, none of the estimators

considered here match the performance of a traditional IV estimator, although IV may

also change the interpretation of the parameter being estimated.

5 Remarks

Despite advances in the program evaluation literature, treatment-eects estimators re-

main severely limited when the CIA fails and when valid exclusion restrictions are un-

available. Following the methodology presented in Millimet and Tchernis (2013), we

propose and describe a new Stata command (bmte) that provides a range of treatment-

eects estimates intended to estimate the average eects of the treatment when the CIA

fails and appropriate exclusion restrictions are unavailable.

Importantly, the bmte command provides results that are useful across a range of

alternative assumptions. For example, if the CIA holds, the IPW estimator provided

by the bmte command yields an unbiased estimate of the causal eects of treatment.

The MB estimator then oers a robustness check, given its comparable performance

when the model is correctly specied or overspecied and its improved performance if

the model is underspecied. If, however, the CIA does not hold, the bmte command

provides results that are appropriate under strong functional form assumptions, either

with homoskedastic (BVN or CF) or heteroskedastic (KV) errors, or under less restrictive

functional form assumptions (BC). As illustrated in our example application to the U.S.

SBP, the breadth of estimators implemented with the bmte command provides a broad

picture of the average causal eects of the treatment across a variety of assumptions.

6 References

Baum, C. F., M. E. Schaer, and S. Stillman. 2003. Instrumental variables and GMM:

Estimation and testing. Stata Journal 3: 131.

testing. Stata Journal 4: 224.

testing. Stata Journal 5: 607.

Black, D. A., and J. Smith. 2004. How robust is the evidence on the eects of college

quality? Evidence from matching. Journal of Econometrics 121: 99124.

and control functions to estimate economic choice models. Review of Economics and

Statistics 86: 3057.

682 The bmte command

Heckman, J., and R. Robb, Jr. 1985. Alternative methods for evaluating the impact of

interventions: An overview. Journal of Econometrics 30: 239267.

selection and limited dependent variables and a simple estimator for such models.

Annals of Economic and Social Measurement 5: 475492.

Heckman, J. J., R. J. LaLonde, and J. A. Smith. 1999. The economics and econometrics

of active labor market programs. In Handbook of Labor Economics, ed. O. Ashenfelter

and D. Card, vol. 3A, 18652097. Amsterdam: Elsevier.

Hirano, K., and G. W. Imbens. 2001. Estimation of causal eects using propensity score

weighting: An application to data on right heart catheterization. Health Services and

Outcomes Research Methodology 2: 259278.

Klein, R., and F. Vella. 2009. A semiparametric model for binary response and contin-

uous outcomes under index heteroscedasticity. Journal of Applied Econometrics 24:

735762.

exclusion restriction: With an application to the analysis of the school breakfast

program. Journal of Applied Econometrics 28: 9821017.

Navarro, S. 2008. Control function. In The New Palgrave Dictionary of Economics, ed.

S. N. Durlauf and L. E. Blume, 2nd ed. London: Palgrave Macmillan.

Rosenbaum, P. R., and D. B. Rubin. 1983. The central role of the propensity score in

observational studies for causal eects. Biometrika 70: 4155.

domized studies. Journal of Educational Psychology 66: 688701.

Ian McCarthy is an assistant professor of economics at Emory University. His research relates

primarily to the elds of health economics, policy, and economic evaluation of health care

programs. Within these areas, he is interested in patient choice, hospital and insurance market

structure, and empirical methodologies in cost and comparative eectiveness research. Prior

to joining Emory University, he was a director in the economic consulting practice at FTI

Consulting and a director of health economics with Baylor Scott & White Health. He received

his PhD in economics from Indiana University.

Rusty Tchernis is an associate professor of economics in the Andrew Young School of Policy

Studies at Georgia State University. He is also a research associate at the National Bureau

of Economic Research and a Research Fellow at the Institute for the Study of Labor. His

primary areas of research are applied econometrics, health economics, and labor economics.

Before becoming a faculty member at the Andrew Young School, he was an assistant professor

in the Department of Economics at Indiana University and a postdoctoral research fellow in

I. McCarthy, D. Millimet, and R. Tchernis 683

the Department of Health Care Policy at Harvard Medical School. He received his PhD in

economics from Brown University.

Daniel Millimet is a professor of economics at Southern Methodist University and a research

fellow at the Institute for the Study of Labor. His primary areas of research are applied

microeconometrics, labor economics, and environmental economics. His research has been

funded by various organizations, including the United States Department of Agriculture. He

received his PhD in economics from Brown University.

The Stata Journal (2014)

14, Number 3, pp. 684692

Timothy Neal

University of New South Wales

Sydney, Australia

timothy.neal@unsw.edu.au

implements the Pedroni (1999, Oxford Bulletin of Economics and Statistics 61:

653670; 2004, Econometric Theory 20: 597625) panel cointegration test and

the Pedroni (2001, Review of Economics and Statistics 83: 727731) group-mean

panel-dynamic ordinary least-squares estimator. For nonstationary heterogeneous

panels that are long (large T ) and wide (large N ), xtpedroni tests for cointegra-

tion among one or more regressors by using seven test statistics under the null of

no cointegration, and it also estimates the cointegrating equation for each individ-

ual as well as the group mean of the panel. The test can include common time

dummies and unbalanced panels.

Keywords: st0356, xtpedroni, panel cointegration, panel-dynamic ordinary least

squares, PDOLS, cointegration test, panel time series, nonstationary panels

1 Introduction

In recent years, it has become increasingly popular to use panel time-series datasets for

econometric analysis. These panel datasets are reasonably large in both cross-sectional

(N ) and time (T ) dimensions, as compared with the more conventional panels with very

large N yet small T. Theoretical research into the asymptotics of panel time series has

revealed two crucial dierences from the typical panel: the need for slope coecients

to be heterogeneous (for example, see Phillips and Moon [2000] and Im, Pesaran, and

Shin [2003]) and the concern of nonstationarity. Both dierences suggest that the usual

xed-eects or random-eects estimators are not appropriate for this application.

The long time dimension in panel time series allows one to use regular time-series

analytical tools, such as unit root and cointegration testing, to determine the order of

integration and the long-run relationship between variables. Researchers have proposed

a variety of tests and estimators that (in varying ways) extend time-series tools for panels

while importantly allowing for heterogeneity in the cross-sectional units (as opposed to

simply pooling the data). Users have already implemented several of these tests and

estimators into Stata (for example, see Blackburne and Frank [2007] and Eberhardt

[2012]).

This article and the associated program, xtpedroni, introduce two tools that were

developed in Pedroni (1999, 2001, 2004) for use in Stata. The rst tool is seven test

statistics for the null of no cointegration in nonstationary heterogeneous panels with

one or more regressors. The second tool is a between-dimension (that is, group-mean)

panel-dynamic ordinary least-squares (PDOLS) estimator. Both tools can include time

c 2014 StataCorp LP st0356

T. Neal 685

dummies (by time demeaning the data) to capture common time eects among members

of the panel. Nevertheless, they cannot account for more sophisticated forms of cross-

sectional dependence.

In this article, I will discuss the theoretical foundations of both tools. I will also

introduce the usage and capabilities of xtpedroni, and apply the program to replicate

the results in Pedroni (2001).

Pedroni (1999, 2004) introduced seven test statistics that test the null hypothesis of no

cointegration in nonstationary panels. The seven test statistics allow heterogeneity in

the panel, both in the short-run dynamics as well as in the long-run slope and intercept

coecients. Unlike regular time-series analysis, this tool does not consider normalization

or the exact number of cointegrating relationships. Instead, the hypothesis test is simply

the degree of evidence, or lack thereof, for cointegration in the panel among two or more

variables.

The seven test statistics are grouped into two categories: group-mean statistics that

average the results of individual country test statistics and panel statistics that pool

the statistics along the within-dimension. Nonparametric ( and t) and parametric

(augmented DickeyFuller [ADF] and v) test statistics are within both groups.

The test can include common time dummies to address simple cross-sectional de-

pendency, which is applied by time demeaning the data for each individual and variable

as follows:

N

1

yt = yi,t

N i=1

All the test statistics are residual-based tests, with residuals collected from the

following regressions:

M

yi,t = mi xmi,t + i,t

m=1

ei,t =

i ei,t1 +

i,t

K

ei,t =

i ei,t1 + i,k

i,t

ei,tk +

k=1

number of time periods, m = 1, 2, . . . , M is the number of regressors, and k = 1, 2, . . . , K

is the number of lags in the ADF regression (selected automatically by xtpedroni with

686 Panel cointegration analysis with xtpedroni

several available options). A linear time trend i t can be inserted into the regression at

the users discretion.

Next, several series and parameters are calculated from the regressions above.

T N

1 2 1 2

s2

i = ,

s2

N,T = s

T t=1 i,t N n=1 i

T ki T

2 = 1 2 s

L

2

+ (1 ) i,t i,ts

11i

T t=1 i,t T s=1 ki + 1 t=s+1

ki T

i = 1

(1

s

) i,t

i,ts

T s=1 ki + 1 t=s+1

T N

1 2 i , 1 2 2

s2i = ,

i2 = s2i + 2

N,T

2

= L

T t=1 i,t N n=1 11i i

The seven statistics can then be constructed from the following equations. (See Pedroni

[1999] for a complete discussion on how these statistics are constructed.)

3 N T 2 2

panel v: T 2 N 2 ( i=1 t=1 L i,t1 )1

11i e

N T

2 e2 )1 N T L

panel : T N ( i=1 t=1 L 2 ei,t1 i )

ei,t

11i i,t1 i=1 t=1 11i (

N T 2 e2 ) 12 N T L 2 ei,t1 i )

panel t: ( 2

N,T i=1 t=1 L 11i i,t1 i=1 t=1 11i ( ei,t

N T 2 e2 ) 12 N T L 2

s2

panel ADF: (N,T i=1 t=1 L 11i i,t1 i=1 t=1 11i e

i,t1 ei,t

N T 2 e2 )1 T ( i )

group : T 1N i=1 ( t=1 L 11i i,t1 ei,t

t=1 ei,t1

N T 1 T i )

group t: 1

N

i2

i=1 ( 2i,t1 ) 2

t=1 e t=1 (

ei,t1

ei,t

N T 12

T

group ADF: 1

N i=1 ( 2

t=1 s 2

i e i,t1 ) i,t1

t=1 e ei,t

The test statistics are then adjusted so that they are distributed as N (0, 1) under the

null. The adjustments performed on the statistics vary depending on the number of

regressors, whether time trends were included, and the type of test statistic.

Because the null of no cointegration is rejected, the panel v statistic goes to positive

innity while the other test statistics go to negative innity. Baltagi (2013, 296) provides

a formal interpretation of a rejection of the null: Rejection of the null hypothesis means

that enough of the individual cross-sections have statistics far away from the means

predicted by theory were they to be generated under the null.

The relative power of each test statistic is not entirely clear, and there may be con-

tradictory results between the statistics. Pedroni (2004) reported that the group and

panel ADF statistics have the best power properties when T < 100, with the panel v

T. Neal 687

and group statistics performing comparatively worse. Furthermore, the ADF statis-

tics perform better if the errors follow an autoregressive process (see Harris and Sollis

[2003]).

3 Pedronis PDOLS

Consider the following model:

yi,t = i + i xi,t + it

least squares (DOLS), which is a simple yet ecient single-equation estimate of the

cointegrating vector. It can be applied to data that are nonstationary and exhibit a

cointegrating relationship between the variables. We can extend this to panel time-series

data and conduct a DOLS regression on each individual in the above panel as follows:

P

yi,t = i + i xi,t + i,j xi,tj + it

j=P

of time periods, p = 1, 2, . . . , P is the number of lags and leads in the DOLS regression,

i is the slope coecient, and xi,t is the explanatory variable. The coecients and

associated t statistics are then averaged over the entire panel by using Pedronis group-

mean method.

% T &1 T ,

N

1

GM

=

zi,t zi,t zi,t (yi,t y i )

N i=1 t=1 t=1

T

, 12

t = (i 0 ) i2

(xi,t xi ) 2

i

t=1

N

1

t = t

GM N i=1 i

Here zi,t is the 2(p + 1) 1 vector of regressors (this includes the lags and leads of the

dierenced explanatory variable), and i2 is the long-run variance of the residuals it .

i2 is computed in the program through the Newey and West (1987) heteroskedasticity-

and autocorrelation-consistent method with a Bartlett kernel. By default, the maxi-

mum lag for the Bartlett kernel is selected automatically for each cross-section in the

panel according to 4 (T /100)(2/9) (see Newey and West [1994]), but it can also be set

manually by the user.

In comparison, Kao and Chiang (1997) and Mark and Sul (2003) compute the panel

statistics along the within-dimension, with the t statistics designed to test H0 : i =

0 against HA : i = A = 0 . Pedronis PDOLS estimator is averaged along the

688 Panel cointegration analysis with xtpedroni

between-dimension (that is, the group mean). Accordingly, the panel test statistics test

H0 : i = 0 against HA : i = 0 . In the alternative hypothesis, the regressors are

not constrained to be a constant A . Pedroni (2001) argues that this is an important

advantage for between-dimension panel time-series estimators, particularly when one

expects slope heterogeneity.

4.1 Syntax

xtpedroni depvar indepvars if in , notdum nopdols notest extraobs

b(#) mlags(#) trend lagselect(string) adflags(#) lags(#) full

average(string)

4.2 Options

Options that aect the cointegration test and the PDOLS estimation

notdum suppresses time demeaning of the variables (that is, the common time dummies).

Time demeaning is turned on by default. This option may be appropriate to use

when averaging over the N dimension may destroy the cointegrating relationship or

when there are comparability concerns between panel units in the data.

nopdols suppresses PDOLS estimation (that is, reports only the cointegration test re-

sults).

notest suppresses the cointegration tests (that is, reports only PDOLS estimation).

extraobs includes the available observations from the missing years in the time means

used for time demeaning if there is an unbalanced panel with observations missing for

some of the variables (at the start or end of the sample) for certain individuals. This

was the behavior of Pedronis original PDOLS program but not of the cointegration

test program. It is o by default.

b(#) denes the null hypothesis beta as #. The default is b(0).

mlags(#) species the number of lags to be used in the Bartlett kernel for the Newey

West long-run variance. If mlags() is not specied, then the number of lags is

determined automatically for each individual following Newey and West (1994).

lagselect(string) species the criterion used to select lag length in the ADF regressions.

string can be aic (default), bic, or hqic.

T. Neal 689

adflags(#) species the maximum number of lags to be considered in the lag selection

process for the ADF regressions. If adflags() is not specied, then it is determined

automatically.

lags(#) species the number of lags and leads to be included in the DOLS regression.

The default is lags(2).

full reports the DOLS regression for each individual in the panel.

average(string) determines the methodology used to combine individual coecient es-

timates into the panel estimate. string can be simple (default), sqrt, or precision.

simple takes a simple average and is the behavior of the original Pedroni program.

sqrt weighs each estimate according to the square root of the precision matrix,

which is the same procedure used for averaging the t statistics. precision weighs

each individuals coecient estimates by its precision.

Pedroni (2001) applied the group-mean PDOLS estimator empirically to test the purchas-

ing power parity (PPP) hypothesis. Specically, it tested the weak long-run PPP, which

argues that while nominal exchange rates and aggregate price ratios move together,

they may not be directly proportional in the long term. Accordingly, the cointegrating

slope may be close to yet dierent from 1. Pedroni used monthly data on nominal

exchange rates and Consumer Price Index deators from the International Monetary

Funds International Financial Statistics database for this test.

We will now replicate the group-mean PDOLS results with the same dataset and

xtpedroni.

. use pedronidata

. xtset country time

panel variable: country (strongly balanced)

time variable: time, 1973m6 to 1993m11

delta: 1 month

. xtpedroni logexrate logratio, notest lags(5) mlags(5) b(1) notdum

Pedronis PDOLS (Group mean average):

No. of Panel units: 20 Lags and leads: 5

Number of obs: 4700 Avg obs. per unit: 235

Data has not been time-demeaned.

690 Panel cointegration analysis with xtpedroni

Pedronis PDOLS (Group mean average):

No. of Panel units: 20 Lags and leads: 5

Number of obs: 4700 Avg obs. per unit: 235

Data has been time-demeaned.

We computed the results without time dummies (by specifying the notdum option),

and then with time dummies. We specied the option notest to suppress the results

of the cointegration test, which are not yet relevant. The option b(1) instructed the

program to compute all t statistics against the null hypothesis that the slope coecient

is equal to 1, which is appropriate for economic interpretation when testing the weak

long-run PPP hypothesis. In accordance with Pedronis original use of the group-mean

PDOLS estimator to calculate these results, we set the number of lags and leads in the

DOLS regression to 5 by specifying lags(5), and we set the number of lags used in the

Bartlett kernel for the NeweyWest long-run variance of the residuals to 5 by specifying

mlags(5).

We can now replicate the individual DOLS results for each country in the panel as

follows:

(output omitted )

Country t statistic Country t statistic

UK 0.67 1.91 Japan 1.75 5.03

Belgium 0.23 1.96 Greece 0.99 0.37

Denmark 1.90 2.85 Portugal 1.09 2.46

France 2.21 8.09 Spain 1.02 0.18

Germany 0.91 0.60 Turkey 1.11 5.84

Italy 1.08 1.12 NZ 1.02 0.61

Holland 0.66 2.06 Chile 1.37 10.95

Sweden 1.16 0.82 Mexico 1.03 3.60

Switzerland 1.36 2.25 India 2.06 7.80

Canada 1.43 1.88 South Korea 0.88 1.46

The output was compressed into a formatted table for brevity. We specied several

options to obtain the exact results. The option full displays the results of estimation

for each individual panel unit. Emulating Pedronis original use of the program for this

empirical application, we set the number of lags and leads in the DOLS regression to 4 by

T. Neal 691

specifying lags(4) and the number of lags used in the Bartlett kernel for the Newey

West long-run variance of the residuals to 4 by specifying mlags(4). No common time

dummies were used for the individual country results (notdum option).

Pedroni (2004) applied the seven panel cointegration test statistics to the PPP hy-

pothesis. We repeat this procedure as follows:

Pedronis cointegration tests:

No. of Panel units: 20 Regressors: 1

No. of obs.: 4920 Avg obs. per unit: 246

Data has been time-demeaned.

v 4.735 .

rho -2.027 -2.814

t -1.434 -2.185

adf -.9087 -1.737

and diverge to negative infinity (save for panel v).

The results will be inconsistent with those found in Pedroni (2004), because those results

relied on a larger sample period than did the Pedroni (2001) dataset we are currently

using. The only option we specied here is nopdols, which suppresses the PDOLS

estimation results.

Overall, the results indicate a cointegrating relationship between the log of the ex-

change rate and the log of the aggregate Consumer Price Index ratio. Statistical in-

ference is straightforward because all the test statistics are distributed N (0,1). All the

tests, except the panel t and ADF statistics, are signicant at least at the 10% level.

Furthermore, the PDOLS results support the weak long-run PPP hypothesis. Most of

the coecients are close to 1, but many are notably higher or lower. For a complete

discussion of the results, see Pedroni (2001).

6 Acknowledgments

This program is indebted to the work of many individuals, including Peter Pedroni,

Tom Doan, Tony Bryant, Roselyne Joyeux, and an anonymous reviewer.

7 References

Baltagi, B. H. 2013. Econometric Analysis of Panel Data. 5th ed. New York: Wiley.

Blackburne, E. F., III, and M. W. Frank. 2007. Estimation of nonstationary heteroge-

neous panels. Stata Journal 7: 197208.

Eberhardt, M. 2012. Estimating panel time-series models with heterogeneous slopes.

Stata Journal 12: 6171.

692 Panel cointegration analysis with xtpedroni

Harris, R., and R. Sollis. 2003. Applied Time Series Modelling and Forecasting. New

York: Wiley.

Im, K. S., M. H. Pesaran, and Y. Shin. 2003. Testing for unit roots in heterogeneous

panels. Journal of Econometrics 115: 5374.

Kao, C., and M.-H. Chiang. 1997. On the estimation and inference of a cointegrated

regression in panel data. Syracuse University Manuscript.

Mark, N. C., and D. Sul. 2003. Cointegration vector estimation by panel DOLS and

long-run money demand. Oxford Bulletin of Economics and Statistics 65: 665680.

and autocorrelation consistent covariance matrix. Econometrica 55: 703708.

Economic Studies 61: 631653.

Pedroni, P. 1999. Critical values for cointegration tests in heterogeneous panels with

multiple regressors. Oxford Bulletin of Economics and Statistics 61: 653670.

and Statistics 83: 727731.

time series tests with an applicaton to the PPP hypothesis. Econometric Theory 20:

597625.

overview of some recent developments. Econometric Reviews 19: 263286.

Timothy Neal is currently a PhD candidate in the Australian School of Business at the Univer-

sity of New South Wales. His research interests include income inequality, panel econometrics,

and welfare economics.

The Stata Journal (2014)

14, Number 3, pp. 693696

Raymond Hicks

Woodrow Wilson School

Niehaus Center for Globalization and Governance

Princeton University

Princeton, NJ

rhicks@princeton.edu

scholars to share les across dierent computers. However, because the Dropbox

directories have dierent pathnames for dierent users, sharing do-les can be

complicated. In this article, I oer some tips on how to navigate pathnames in

do-les when using Dropbox, and I present a command that automatically nds

and changes to a users Dropbox directory.

Keywords: pr0058, dropbox, Dropbox, directories, tips

1 Introduction

Dropbox makes scholarly collaboration much easier because it allows scholars to share

les across dierent computers. At the same time, sharing do-les in Dropbox presents

its own complications. Because users may install Dropbox in dierent locations and

because users have dierent usernames, often on dierent computers, directory paths to

Dropbox folders may not work in do-les. This is especially likely when multiple Drop-

box users collaborate. Here I present some tips on how to overcome these diculties.

There are three issues in using Stata with Dropbox. Two issues involve potential dicul-

ties in syncing les. The third issue, which this article discusses more, is the pathnames

of les.

2.1 Syncing

One issue with using Dropbox to share les is that Dropbox automatically syncs les

as they are saved. Stata do-les can get ahead of the Dropbox synchronization if, for

instance, a user saves les and then appends these les soon after in a loop. It may

also happen if a user saves a le and then uses it. This problem can be solved with

a sleep command at the end of the loop. Telling Stata to wait for ve seconds or so

before continuing the loop will usually solve the problem.

c 2014 StataCorp LP pr0058

694 Stata and Dropbox

A second issue may arise if multiple users have the same le open simultaneously.

Changes made by one user may not be saved if another user also has the le open.

Therefore, some people may want to store the data and do-les outside of Dropbox and

share only log and result les in Dropbox.

Many users, however, will want to store data and do-les in a shared Dropbox folder,

especially users that do all of their Stata work within do-les, including opening and

saving les located in Dropbox. To open or save the les, Stata needs a pathname so that

it knows where the Dropbox folder is located. Because the Dropbox directory is usually

placed within a users home directory, this creates a potential problem. Dierent people

will have dierent usernames, and even the same user may have dierent usernames on

oce and personal computers. If a do-le explicitly refers to a specic username, the

do-le will stop running if the username does not exist on the computer. For example,

use /users/jdoe/Dropbox/data1.dta will not work if the users name is johndoe.

This type of failure may make collaborating or using multiple computers (such as home

and oce) frustrating.

Moreover, two other issues may complicate sharing les in Dropbox. First, dierent

computers have dierent conventions for pathnames. Although cd /users/username

/Dropbox will work on Windows and Mac computers, it will not work on Unix comput-

ers. For Macs and Unix, cd ~/Dropbox will work, but it will not work with Windows.1

Second, Dropbox can be installed in a default location (/users/username

/Dropbox), but many users install it in dierent places. Some users install it as My

Dropbox, while others store it within their Documents folder (/users/username

/Documents/Dropbox).

All three of these issues potentially make it dicult to share Stata les in Dropbox.

3 Solutions

There are several dierent ways to ensure that everyone can easily share and use Stata

do-les in Dropbox without errors. I discuss the advantages and drawbacks of the

dierent ways below.

3.1 Edit le

One solution, at least for Windows users, is to open do-les using the edit option. The

user does not have to specify a pathname, because Stata will automatically change the

1. I use /users/username to refer to a users home directory because most users use Windows or

Macs. Unix users should read it as ~.

R. Hicks 695

directory to the one where the do-le is located. From there, relative paths can be used

to negotiate around the shared directory. The biggest drawback to this method is that

it is limited to Windows users. It also does not t with how a lot of people use Stata,

because each time a user wants to open a do-le in a dierent directory, the user has to

open a new instance of Stata or change the directory within Stata.

3.2 Capture

Other users may prefer to use the capture command to change the directory. Here each

user puts a change directory (cd) command to his or her Dropbox folder preceded by

the capture command, which prevents Stata from returning an error and aborting the

do-le if the specied directory does not exist. As the number of users increases, or if

users have dierent usernames for their home and oce computers, keeping track of all

the dierent directories becomes dicult.

3.3 c(username)

Stata stores the users name in a c-class value called c(username). If all users have Drop-

box in the same place, the macro can be used to specify the Dropbox directory. As noted

above, one of the common places users store Dropbox is in /users/username/Dropbox/.

The username is stored by Stata as c(username), which can be inserted as a local in the

change directory command: cd /users/c(username)/Dropbox. This will work as

long as all users have Dropbox installed in the same directory. However, some users may

install Dropbox in /users/username/My Dropbox/ or in /users/username/Documents

/Dropbox. If this is the case, then c(username) will not work. Moreover, as noted

above, this will work with Windows and Mac computers but not with Unix comput-

ers. If all collaborators use Unix or Macs, they could use ~/Dropbox to go to the root

Dropbox directory.

3.4 dropbox.ado

A nal solution is to use an ado-le I created, dropbox.ado, which looks for the Drop-

box directory in the most common places that users install Dropbox. It starts in

the most commonly used location (/users/c(username)/Dropbox for Windows and

~/Dropbox for Mac and Unix computers) and then searches within the Documents di-

rectory and then the root directory to nd Dropbox. The command returns the local

Dropbox directory as r(db), and unless the nocd option is specied, it changes the

directory to a users root Dropbox directory. From there, the relative paths of all users

within Dropbox will be the same. The command also uses the username macro to look

for the Dropbox directory.

696 Stata and Dropbox

This command is limited because it may not provide the correct Dropbox directory

if a user has more than one instance of Dropbox installed. It will not work if a Windows

user has Dropbox installed on a drive other than the c: drive. Also the command will

work only if all shared users have the command on their computers.

4 Conclusion

Using multiple computers and sharing les in the Cloud is increasingly common. In this

article, I presented some tips on how to best handle do-les shared with the popular

Dropbox program. Here I conclude with a couple of general tips about navigating

directories when sharing do-les.

First, avoid using the backslash when setting paths; instead, use a forward slash.

The backslash is used only by Windows machines; it is also used as an escape character

by Stata, which often causes confusion when users include locals in their pathnames.

For example, c:\users\c(username)\Dropbox will not work in Stata because Stata

will ignore the backslash between users and c(username). Both Unix and Macs use

the forward slash in directories, and Windows recognizes the forward slash, so it is a

costless change. It will also ensure conformability across operating systems. Similarly,

Windows users should avoid references to the c:\ drive as often as possible. Sometimes,

this is unavoidable, especially with network drives or with partitioned drives. However,

if all work is done on the c:\ drive, Windows will recognize cd / as referring to the c:\

drive, which brings Windows syntax in line with Unix and Mac syntax.

Second, users should become familiar with the commands to move around directories

without specifying full path names. Users can move up one directory using cd ..

or up two directories using cd ../... From the current directory, users can move

down a directory by specifying only the new directory name. For example, to go from

/users/username/Dropbox/ to /users/username/Dropbox/Shared Folder/, one can

type cd "Shared Folder".

Raymond Hicks is a statistical programmer in the Niehaus Center for Globalization and Gov-

ernance at Princeton University, where he focuses on trade and monetary issues.

The Stata Journal (2014)

14, Number 3, pp. 697700

Researchers, Fourth Edition, by Juul and

Frydenberg

Ariel Linden

Linden Consulting Group, LLC

Ann Arbor, MI

alinden@lindenconsulting.org

Fourth Edition, by Svend Juul and Morten Frydenberg (2014 [Stata Press]).

Keywords: gn0061, introduction to Stata, data management, statistical analysis,

health research

1 Introduction

For instructors of measurement and evaluation and individuals seeking methodological

guidance, it is dicult to nd a book that both covers key analytic concepts and provides

clear direction on how to perform the associated analyses in a given statistical software

package. The fourth edition of An Introduction to Stata for Health Researchers, by

Svend Juul and Morten Frydenberg, lls this need. It does an excellent job of covering a

wide range of measurement and evaluation topics while providing a gentle introduction

to Stata for those unfamiliar with the software. In fact, though the title suggests

the book is for health researchers, it is readily generalizable to many disciplines that

implement the same methods.

Many improvements have been made to the book since John Carlins review of

the inaugural edition in 2006 (Carlin 2006), including a reorganization of chapters to

more closely mirror the typical ow of a research project, an increase in the number of

practice exercises, and a more focused treatment of statistical issues. Additionally, this

fourth edition has been updated for Stata 13. On the whole, Juul and Frydenberg have

prepared a very accessible book for readers with varied levels of prociency in statistics

or Stata, or both.

2 Overview

Section I includes four chapters (called the basics) that introduce the reader to Stata.

These chapters cover such issues as installing the program, getting help, understanding

le types, and using command syntax. While a novice could go directly to the Stata

users manual (in particular, Getting Started with Stata and the Stata Users Guide),

this book oers a more user-friendly introduction. Combined, these 35 pages are more

than sucient to get a Stata novice up and running.

c 2014 StataCorp LP gn0061

698 Review of An Introduction to Stata for Health Researchers

Section II includes six chapters dealing with issues pertaining to data management,

such as variable types (numeric, dates and strings) and their manipulation and storage

(chapter 5); importing and exporting data (chapter 6); applying labels (chapter 7);

generating and replacing values and performing basic calculations (chapter 8); and

changing data structure, such as appending, merging, reshaping, and collapsing data

(chapter 9). Chapter 10 provides excellent advice on creating documentation (via do-

les and logs, etc.) to ensure reproducibility of data management and analytic steps.

While creating documentation is seemingly intuitive, not all researchers consistently

follow these steps.

Section III includes ve chapters focusing on the types of data analyses most widely

used in health-related research.

Chapter 11 starts with basic descriptive analytics and then continues on to analy-

ses using epidemiologic tables for binary variables (including the addition of stratied

variables). This naturally progresses to analyses of continuous variables, and the chap-

ter demonstrates some visual displays of the data (histograms, QQ plots, and kernel

density plots) and methods of tabulation. The chapter then ventures into more formal

basic statistical analyses, such as t tests, one-way analysis of variance, and nonparamet-

ric techniques (ranksum).

Chapter 12 presents ordinary least-squares and logistic regression, with a fair amount

of exposition on the use of lincom for postestimation.

Chapter 13 describes time-to-event analyses, starting with simple curves and tables,

and then moves into progressively more complex Cox regression models (without and

with time-varying covariates). Next it introduces Poisson models to examine more

complex models for rates. Finally, it includes a brief discussion on indirect and direct

standardization.

Chapter 14 is titled Measurement and diagnosis, and it describes graphical plots

and statistical tests for assessing measurement variation at one time point, and then

again over multiple measurements, for dependent samples. This transitions into methods

used for assessing accuracy of diagnostic tests (that is, sensitivity, specicity, area under

the curve, etc.).

Chapter 15Miscellaneousincludes topics such as random sampling, sample-

size calculations (including a nice example using simulation to estimate power for a

noninferiority study), error trapping, and log les.

Section IV includes one comprehensive chapter on graphs (44 pages). The chapter

begins by plotting a basic graph and describing the various elements, and it progresses

with increasing sophistication. It ends with some important tips on saving the code in

do-les so that graphs can be reproduced or enhanced later.

The nal section, section V, is composed of a single chapter titled Advanced top-

ics and discusses storing and using results after estimation and dening macros and

scalars. It then discusses looping through data using foreach, forvalues, and if/then

statements. The chapter ends with a brief overview of creating user-written commands.

A. Linden 699

3 Comments

The book is well organized, following the logical step-by-step approach that investigators

apply to their research: data acquisition and management, analysis, and presentation

of results. The many brief examples are useful and generalizable, and the footnotes

are helpful additions. When a topic is briey touched upon, the authors refer the

reader to the relevant help resource in Stata for more details. They also provide helpful

recommendations for resolving issues that may have multiple solutions.

Another strength of the book is that it contains many important but often overlooked

details (even for advanced Stata users), such as why a value may appear dierently

when formatted as oat versus double (pages 4546) and how this precision may impact

comparisons. Other examples include the use of numlabel to display both the value

and the value label of a variable (page 67), the use of egen cut() to easily recode

continuous variables into categories (page 75), and setting showbaselevels to display

a line for the reference level in regression output (page 153). Of arguably greatest value

is the fact that the authors continually emphasize the importance of developing good

habits in documenting the work process (using do-les and logs) so that all output

can be replicated, errors can be tracked down, and time-consuming procedures can be

performed repeatedly and eciently.

There is very little that I would change about this book, and my suggestions all relate

to what the authors could consider for future editions. First, the authors use lincom

and testparm extensively in the chapters on regression and time-to-event analyses.

Readers would benet from seeing examples using margins (followed by marginsplot).

margins is an extremely exible command that allows the user to perform various

analyses after running regression models, mostly with little additional specication.

The authors currently provide only a footnote (page 150) pointing interested readers

to the excellent book written by Michael N. Mitchell (2012). Second, some mention

of parametric regression models for survival analysis would be valuable (using streg),

because readers in certain disciplines may prefer these models over Cox regression models

(using stcox).

Finally, while Stata 13 introduced a new set of commands to estimate treatment

eects using propensity score-based matching and weighting techniques, the only men-

tion of such approaches is in appendix A, where the authors briey describe the Stata

Treatment-Eects Reference Manual by saying this: Despite its title, it does not cor-

respond to the methods of analysis that are mainstream in health research. This

statement left me somewhat perplexed, given that graduate programs in public health

in the United States have a required course in program evaluation that likely cov-

ers these methods in at least some detail. Furthermore, there is a growing body of

health research literature where using these methods has become commonplace (see,

for example, Austin [2007; 2008]). Readers would benet from an introduction to these

techniques, perhaps as a nal chapter in which some of the datasets analyzed in pre-

vious chapters using regression are reanalyzed using one of these approaches and the

results compared. The Stata Treatment-Eects Reference Manual oers an excellent

700 Review of An Introduction to Stata for Health Researchers

introduction to the methods implemented in Stata, and Stuart (2010) provides a more

comprehensive discussion of treatment-eects estimation using an array of approaches.

In summary, I strongly recommend this book both for students in introductory

measurement and evaluation courses and for more seasoned health researchers who

would like to avoid a steep learning curve when trying to conduct analyses in Stata.

4 References

Austin, P. C. 2007. Propensity-score matching in the cardiovascular surgery literature

from 2004 to 2006: A systematic review and suggestions for improvement. Journal of

Thoracic and Cardiovascular Surgery 134: 11281135.

between 1996 and 2003. Statistics in Medicine 27: 20372049.

Stata Journal 6: 580583.

Juul, S., and M. Frydenberg. 2014. An Introduction to Stata for Health Researchers.

4th ed. College Station, TX: Stata Press.

Mitchell, M. N. 2012. Interpreting and Visualizing Regression Models Using Stata.

College Station, TX: Stata Press.

Stuart, E. A. 2010. Matching methods for causal inference: A review and a look forward.

Statistical Science 25: 121.

Ariel Linden is a health services researcher specializing in the evaluation of health care in-

terventions. He is both an independent consultant and an adjunct associate professor at the

University of Michigan in the department of Health Management and Policy, where he teaches

program evaluation.

The Stata Journal (2014)

14, Number 3, p. 701

Software Updates

trolled trials. K. Hemming and J. Marsh. Stata Journal 13: 114135.

The original command restricted both the coecient of variation of cluster sizes

(size cv) and outcome (cluster cv) to be less than 1. This was an incorrect

restriction and has been removed. The help le also incorrectly referred to the

cluster cv as being the coecient of variation of the cluster sizes, when it is the

coecient of variation of the outcome.

st0295 1: Generating Manhattan plots in Stata. D. E. Cook, K. R. Ryckman, and J.

C. Murray. Stata Journal 13: 323328.

The manhattan package has been updated because there was an error in the way

Bonferroni lines were drawn in the Manhattan plots. The update xes this issue.

st0301 2: Fitting the generalized multinomial logit model in Stata. Y. Gu, A. R. Hole,

and S. Knox. Stata Journal 13: 382397.

A new noscale option has been added to gmnlbeta. By default, gmnlbeta calculates

the individual-level scaled parameters (as in equation 2 of Gu, Hole, and Knox

[2013]). When noscale is specied, gmnlbeta calculates instead the individual-level

parameters without scaling by sigma.

st0331 1: Estimating marginal treatment eects using parametric and semiparametric

methods. S. Brave and T. Walstrum. Stata Journal 14: 191217.

This update to the margte command includes a bug x for the parametric normal

model t by maximum likelihood. When run with the mlikelihood option, margte

interfaces with the movestay command (Lokshin and Sajaia 2004, 2005a,b). The

previous version of margte produced incorrect results when reading the output of

movestay, except under a particular parameterization of the generalized Roy model.

An updated help le is included to help clarify the dierences in the treatment of

the parameters of the generalized Roy model in movestay and margte.

References

Gu, Y., A. R. Hole, and S. Knox. 2013. Fitting the generalized multinomial logit model

in Stata. Stata Journal 13: 382397.

Lokshin, M., and Z. Sajaia. 2004. Maximum likelihood estimation of endogenous switch-

ing regression models. Stata Journal 4: 282289.

. 2005a. Software update: st0071 1: Maximum likelihood estimation of endoge-

nous switching regression models. Stata Journal 5: 139.

. 2005b. Software update: st0071 2: Maximum likelihood estimation of endoge-

nous switching regression models. Stata Journal 5: 471.

c 2014 StataCorp LP up0044

- The Colonial Origins of Comparative Development_Acemoglu_Johnson_Robinson(2)Uploaded byOscar Cardenas
- hypothesis testingUploaded byJessica Soh
- 00089Uploaded byGiora Rozmarin
- 1508302587Uploaded byDaryAnto
- Linear Regression and Correlation_ABHUploaded byYonas Tadesse
- Who Runs America - Moskowitz Jacobs Inc. - Yasir Batalvi, Arthur Kover Ph.D.Uploaded byMarketing Media Today
- Regression Analysis in Adult Age EstimationUploaded byamapink94
- Constraints in the Control of Animal Trypanosomiasis by Cattle Farmers in Coastal Savannah of Ghana Quality Aspects of Drug UseUploaded byresearchinbiology
- Determinants of Trade, inter- and intra-industry trade in South AfricaUploaded byElysium764
- Pay Spread and Skewness,Employee Effort and Firm ProductivityUploaded bymarhelun
- Rural household head employment status and remittance inflows from ItalyUploaded byDr.Kazi Abdul Mannan
- salesforecasting-120421012853-phpapp02Uploaded byDharmang Patel
- Statistic-SimpleLinearRegressionUploaded bymfah00
- mtcarsUploaded bysoundmasterj
- en_tema_919Uploaded byscaricascarica
- Efeito Do Status Da Saude_estresse_alcool e Drogas No VolanteUploaded byjp_rosema
- ols proofUploaded byNopri Keliat
- H2008-1-1531535.H2008.lucasdavisUploaded byPat N
- the_relationship_between_financial_ratio.pdfUploaded byDavid Martin
- OutputUploaded byVivek H Das
- Discussion Paper 38 Chen Minjia Financial Constraints 1Uploaded byliemuel
- haily7Uploaded byHồng Trân
- Airline ResultsUploaded byandri00
- Q201_Lec1_2Uploaded byuniversedrill
- Kuliah 5, 6 - Simple Linear RegressionUploaded byFaizal Akbar
- BasciCaner2005Uploaded byulufare11
- Chapter FourUploaded byabdishakur
- SSRN-id3230368Uploaded byAngela Simoes
- Investigating the Factors of Growth Within the Commonwealth of Nations: An Empirical AnalysisUploaded byWalter Taylor
- Ho RandfixdUploaded byricky5ricky

- Thesis Zhirui Zhu.originalUploaded byJorge Dillon
- Algebra I-a - Course No. 1200370 content written by Linda Walker developed and edited by Sue Fresen graphics by Rachel McAllister page layout by Jennifer KeeleUploaded byMLSBU11
- 00000chen- Linear Regression Analysis3Uploaded byTommy Ngo
- BBS10 Ppt Mtb Ch04 ProbabiltyUploaded byAgenttZeeroOutsider
- Probabilistic Analysis of Rock Slope StabilityUploaded byBrian Conner
- Lesson3_cdutil_genutilUploaded byArulalan.T
- ASTM D566Uploaded byMuhammadYusufZaky
- Chapter 11 Study Guide - Chi-squareUploaded byChengine
- Introduction to Biostatistics (in Arabic)Uploaded byizeldien5870
- Chapter 12. Reproductive Cycles of Tropical SnakesUploaded byla_lu_iza
- Janitor FishUploaded byMora Joram
- Compare Data Mining ToolsUploaded bynpnbkck
- Http Www.mo-media.com Pdfview Action=Print Page&Page=1&File=TestscoreiqUploaded bypauljp
- 316178270-Time-Sereis-Analysis-Using-Stata.pdfUploaded bySayed Farrukh Ahmed
- 11.1-4.61.pdfUploaded byAwad Almalki
- 201607190935-NABL-141-docUploaded byshahazad
- Revisiting the Ss Central America SearchUploaded bymadhu_gopinathan
- Sesi D SamplingUploaded byDinar Ristya Putri
- Probebility Theory ConceptsUploaded byAnamol18
- FORMULATION OF A SOY–COFFEE BEVERAGE BYUploaded byAtmira Nurandarini Utomo
- Freezing Point Depression LabUploaded byErryn Bard
- HypothesisUploaded byAbegail Song Balilo
- Natural ventilation effects on temperatures within Stevenson screensUploaded byTiago de Araújo
- Paper 2 Regression AnalysisUploaded byErik
- Summary of Coleman and Steele uncertainty methodsUploaded byAli Al-hamaly
- Seismic inversion for acoustic impedance and porosity of Cenozoic cool-water carbonates on the upper continental slope of the Great Australian BightUploaded byMiftahulhusnah
- 04641865Uploaded byFatihUniversitesi SummerSchool
- app2.pdfUploaded byGILBERTO YOSHIDA
- Bacteria Lab ReportUploaded byTin Nok Poon
- Vapor Pressures of Phlegmatic Liquids. I. Simple and Mixed TriglyceridesUploaded byShayane Magalhães