Wooldridge

Attribution Non-Commercial (BY-NC)

1.2K views

Wooldridge

Attribution Non-Commercial (BY-NC)

- Introductory econometrics test bank
- Wooldridge Solution chapter 3
- Wooldridge.solutions.chap.7 11
- Wooldridge.solutions.chap.2 6
- Econometrics Cheat Sheet Stock and Watson
- Jeffrey M Wooldridge Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data 2003
- Solucionario Econometría Jeffrey M. Wooldridge
- wooldridge(1)
- Econometrics
- Introduction to Econometrics James Stock Watson 2e Part2
- Jeffrey M. Wooldridge-Econometric Analysis of Cross Section and Panel Data, 2nd Edition(2010)
- Introductory Eco No Metrics Wooldridge 4th Edition Solutions Manual
- econometrics Solutions
- Basic Econometrics Chapter 3 Solutions
- Econ5025 Practice Problems
- UNSW -Econ2206 Solutions Semester 1 2011 - Introductory Eco No Metrics
- Econometrics
- Econometric Analysis of Cross Section and Panel Data 2010.pdf
- 236185337-Solucionario-Econometria-Jeffrey-M-Wooldridge.pdf
- Chapter Specific Appendices

You are on page 1of 57

AN EXTENDED ANALYSIS

Jeffrey M. Wooldridge

Department of Economics

Michigan State University

East Lansing, MI 48824-1038

(517) 353-5972

wooldri1@msu.edu

This version: June 2006

1

ABSTRACT

This is an expanded version of Wooldridge (2003), which provided an overview of

cluster-sample methods in linear models. Here I include additional details for linear models

and provide, as much as possible, a parallel treatment for nonlinear models, including a

summary of strategies for dealing with data sets having a small number of clusters.

Keywords: Cluster Correlation; Generalized Estimating Equations; Minimum Distance

Estimation; Panel Data; Robust Variance Matrix; Unobserved Effect

JEL Classification Codes: C13, C21, C23

2

1. INTRODUCTION

In Wooldridge (2003), I provided a brief overview of econometric approaches to analyzing

cluster samples in the context of a linear regression model. I considered cases with both large

and small cluster sizes (relative to the number of clusters). That treatment was necessarily

terse, and some subtle issues were only briefly mentioned or neglected entirely.

The asymptotic theory for the case with a large number of clusters (and relatively small

cluster sizes), either in linear or nonlinear models, has been pretty well worked out; for a

summary, see, for example, Wooldridge (2002). Just as importantly, popular statistical

packages, such as Stata

arbitrary cluster correlation for a variety of linear and nonlinear estimation methods. Still,

while accounting for clustering in data is much more common than it was 10 years ago,

inference methods robust to cluster correlation are still not used routinely in all relevant

applications. I think that is partly because empirical researchers are not entirely sure when

certain estimators are robust to various kinds of misspecification. I hope this expanded paper

helps to fill that gap.

For nonlinear models, there are some open modeling questions for cases where the group

sizes vary a common situation with true cluster samples and one wants to allow correlation

between the unobserved group effect (or heterogeneity) and the covariates that vary within

group. I discuss the modeling issues, and offer some tentative solutions, in Section 3.1. This

is very preliminary and hopefully generates some interest in the problem.

The case of a small number of clusters has received much attention recently, particularly

3

for linear models. In Wooldridge (2003), I summarized two possible ways to estimate the

effects of cluster-level covariates on individual-specific outcomes when the number of clusters

is small (while the sizes of the clusters are moderately large). One approach, suggested by

Donald and Lang (2001), is to effectively treat the number of groups as the number of

observations, and use finite sample analysis (with individual-specific unobservables becoming

unimportant relative to the cluster effect as the cluster sizes get large). A second approach

is to view the cluster-level covariates as imposing restrictions on cluster-specific intercepts in a

set of individual-specific regression models, and then imposing and testing the restrictions

using minimum distance estimation. The minimum distance (MD) approach, when valid, has

the benefit of allowing standard asymptotic analysis under a weak set of assumptions, at least

when the group sample sizes are large. When it is valid, the MD approach can be expected to

lead to stronger statistical significance than the Donald and Lang approach when there are few

groups. Campolieti (2004) applies both the Donald and Lang (2001) (or DL, for short) and

minimum distance approaches.

Conveniently, both the DL and minimum distance approaches can be extended to a broad

class of nonlinear models. As in the linear case, the conditions under which minimum distance

can be justified are less restrictive than those for the DL approach (which, importantly,

requires unobserved cluster effects to be drawn from a homoskedastic normal distribution).

Plus, the DL approach does not allow one to obtain standard errors for the partial effects

themselves in nonlinear models. I cover this case in Section 3.2.

2. THE LINEAR MODEL WITH CLUSTER

4

EFFECTS

This section considers linear models estimated using cluster samples (of which a panel data

set is a special case). For each group or cluster g, let y

gm

, x

g

, z

gm

: m 1, . . . , M

g

be the

observable data, where M

g

is the number of units in cluster g, y

gm

is a scalar response, x

g

is a

1 K vector containing explanatory variables that vary only at the group level, and z

gm

is a

1 L vector of covariates that vary within (as well as across) groups.

2.1. Specification of the Model

The linear model with an additive error is

y

gm

o x

g

[ z

gm

, v

gm

, m 1, . . . , M

g

; g 1, . . . , G. (1)

Our approach to estimation and inference in equation (1) depends on several factors, including

whether we are interested in the effects of aggregate variables [ or individual-specific

variables ,. Plus, we need to make assumptions about the error terms. In the context of pure

cluster sampling, an important issue is whether the v

gm

contain a common group effect that can

be separated in an additive fashion, as in

v

gm

c

g

u

gm

, m 1, . . . , M

g

, (2)

where c

g

is an unobserved cluster effect and u

gm

is the idiosyncratic error. (In the statistics

literature, (1) and (2) are referred to as a hierarchical linear model.) One important issue is

whether the explanatory variables in (1) can be taken to be appropriately exogenous. Under

5

(2), exogeneity issues are usefully broken down by separately considering c

g

and u

gm

.

Throughout we assume that the sampling scheme generates observations that are

independent across g. This assumption can be restrictive, particularly when the clusters are

large geographical units. This paper does not formally consider problems of spatial

correlation across clusters, although below I will mention some benefits of fixed effects

estimators in such settings.

Appropriate sampling assumptions within cluster are more complicated. Theoretically, the

simplest case also allows the most flexibility for robust inference: from a large population of

relatively small clusters, we draw a large number of clusters (G), where cluster g has M

g

members. This setup is appropriate, for example, in randomly sampling a large number of

families, classrooms, or firms from a large population. The key feature is that the number of

groups is large enough relative to the group sizes so that we can allow essentially unrestricted

within-cluster correlation. Randomly sampling a large number of clusters also applies to many

panel data sets, where the cross-sectional population size is large (say, individuals, firms, even

cities or counties) and the number of time periods is relatively small. In the panel data setting,

G is the number of cross-sectional units and M

g

is the number of time periods for unit g.

A different sampling scheme results in data sets that also can be arranged by group. We

first stratify the population into G 2 nonoverlapping groups and then obtain a random

sample of size M

g

from each group. Ideally, the strata sizes are large in the population,

hopefully resulting in large M

g

. I adopt this perspective for the small G case in Section 2.3.

2.2. Large Group Asymptotics

6

In this section I review methods and estimators justified when the asymptotic

approximations theory is with G and the group sizes are fixed. The theory is well

developed; see, for example, White (1984) and Wooldridge (2002, Chapters 10, 11). Here, I

summarize the large G theory, emphasizing how one might wish to use methods robust to

cluster sampling even when it is not so obvious.

First suppose that the covariates satisfy

Ev

gm

|x

g

, z

gm

0, m 1, . . . , M

g

; g 1, . . . , G. (3)

For consistency, we can, of course, get by with zero correlation assumptions, but we use (3) for

convenience because it meshes well with assumptions concerning conditional second

moments. Importantly, the exogeneity in (3) only requires that z

gm

and v

gm

are uncorrelated.

In particular, it does not specify assumptions concerning v

gm

and z

gp

for m p. In the panel

data literature, where m is a time index, (3) is called the contemporaneous exogeneity

assumption; see Wooldridge (2002, Chapter 7). Allowing for correlation between v

gm

and

z

gp

, m p is useful for some panel data applications and possibly even cluster samples (if the

covariates of one unit can affect another units response). Under (3) and a standard rank

condition on the covariates, the pooled OLS estimator, where we regress y

gm

on

1, x

g

, z

gm

, m 1, . . . , M

g

; g 1, . . . , G, is consistent for z o, [

, ,

(as G with M

g

fixed) and G -asymptotically normal. See, for example, Wooldridge (2002, Sections 7.8).

Without more assumptions, a robust variance matrix is needed to account for correlation

within clusters or heteroskedasticity in Varv

gm

|x

g

, Z

g

, or both. When v

gm

has the form in (2),

the amount of within-cluster correlation can be substantial, which means the usual OLS

7

standard errors can be very misleading (and, in most cases, systematically too small). In the

case of common cluster sizes, Wooldridge (2002, Section 7.8) gives the formula for a

variance-matrix estimator that assumes no particular kind of within-cluster correlation nor a

particular form of heteroskedasticity. Allowing for different group sizes is straightforward.

Write the equation by cluster as

y

g

W

g

z v

g

, g 1, . . . , G, (4)

where W

g

is the M

g

1 K L matrix of all regressors for group g. Then the

1 K L 1 K L variance matrix estimator is

Avarz

g1

G

W

g

W

g

1

g1

G

W

g

v

g

v

g

W

g

g1

G

W

g

W

g

1

(5)

where v

g

is the M

g

1 vector of pooled OLS residuals for group g. This asymptotic variance

is now computed routinely by packages such as Stata

however, to remember that the standard errors and test statistics obtained are known to be valid

only as G with each M

g

fixed. Probably they are valid if M

g

increases at a rate somewhat

less than G. Practically, the important issue is how well the robust variance matrix estimator

works for common cluster sample sizes, something I comment on below.

In estimating the parameters in (1), the pooled OLS estimator ignores the within-cluster

correlation of the v

gm

. If we strengthen the exogeneity assumption to

Ev

gm

|x

g

, Z

g

0, m 1, . . . , M

g

; g 1, . . . , G, (6)

where Z

g

is the M

g

L matrix of unit-specific covariates, then we can exploit the presence of

c

g

in (2) in a generalized least squares (GLS) analysis. [In the panel data case, (6) is known as

the strict exogeneity assumption on z

gm

: m 1, . . . , M

g

.] The standard random effects

8

approach makes enough assumptions so that the M

g

M

g

variance-covariance matrix of

v

g

v

g1

, v

g2

, . . . , v

g,Mg

Varv

g

o

c

2

j

Mg

j

Mg

o

u

2

I

Mg

, (7)

where j

Mg

is the M

g

1 vector of ones and I

Mg

is the M

g

M

g

identity matrix. In the standard

setup, we also make the system homoskedasticity assumption,

Varv

g

|x

g

, Z

g

Varv

g

. (8)

It is important to understand the role of assumption (8): it implies that the conditional

variance-covariance matrix is the same as the unconditional variance-covariance matrix, but it

does not restrict Varv

g

. The particular random effects structure on Varv

g

is given by (7).

Under (7) and (8), the resulting GLS estimator is the well-known random effects (RE)

estimator; see, for example, Wooldridge (2002, Section 10.3).

The random effects estimator is asymptotically more efficient than pooled OLS under (6),

(7), and (8) (and a standard rank condition). The RE estimates and test statistics are computed

routinely by popular software packages. Nevertheless, an important point is often overlooked

in applications of RE: one can, and in many cases should, make inference completely robust to

an unknown form of Varv

g

|x

g

, Z

g

. Wooldridge [2002, equation (7.49)] gives the robust

formula in the balanced case (same group sizes), and the formula extends easily to the

unbalanced case. One way to think about the robust formula is to characterize RE as a pooled

OLS estimator on quasi-time demeaned data. For each g, define

0

g

1 1/1 M

g

o

c

2

/o

u

2

1/2

, where o

c

2

and o

u

2

are estimators of the two variance

parameters. Then the RE estimator is identical to the pooled OLS estimator of

y

gm

0

g

y g

on 1 0

g

, 1 0

g

x

g

, z

gm

0

g

zg

, m 1, . . . , M

g

; g 1, . . . , G; (9)

9

see, for example, Hsiao (1986). For fully robust inference, we can just apply the fully robust

variance matrix estimator in (5) but on the transformed data.

The idea in obtaining a fully robust variance matrix of RE is straightforward, and dates

back at least to the generalized estimating equation (GEE) literature [Liang and Zeger

(1986)]. Even if Varv

g

|x

g

, Z

g

does not have the RE form, the RE estimator is still consistent

and G -asymptotically normal under (6), and it is likely to be more efficient than pooled OLS.

[Except in unusual cases, incorrectly accounting for within-group correlation by assuming

constant correlation within-group correlation is likely to be more efficient than entirely

ignoring within-group correlation.] Yet we should recognize that the RE second moment

assumptions can be violated without causing inconsistency in the RE estimator. Making

inference robust to serial correlation in the idiosyncratic errors for panel data applications,

particularly with more than a few time periods, can be very important. But within-group

correlation in the idiosyncratic errors can arise for cluster samples, too, especially if underlying

(1) is a random coefficient model, where z

gm

,

g

replaces z

gm

,. By estimating a standard

random effects model that assumes common slopes ,, we effectively include z

gm

,

g

, in the

idiosyncratic error; this generally creates within-group correlation because z

gm

,

g

, and

z

gp

,

g

, will be correlated for m p, conditional on Z

g

. Also, the idiosyncratic error will

have heteroskedasticity that is a function of z

gm

. Nevertheless, if we assume

E,

g

|X

g

, Z

g

E,

g

, along with (6), the random effects estimator still consistently

estimates the average slopes, ,. Therefore, in applying random effects to panel data or cluster

samples, it is sensible (with large G) to make the variance estimator of random effects robust to

arbitrary heteroskedasticity and within-group correlation.

With panel data so that there is a natural ordering of the observations within group it

10

may make sense to estimate an unrestricted version of Varv

g

, especially if G is large. Even

in that case, we can use Wooldridge [2002, equation (7.49)] to obtain a variance matrix robust

to Varv

gm

|x

g

, Z

g

Varv

g

.

In economics, the prevailing view is still that robust inference is not necessary when using

GLS. Adopting the perspective of the GEE literature that one often chooses a working

variance-covariance matrix for convenience, but should recognize it may not represent the

conditional variance-covariance matrix is relatively straightforward these days. The

appendix shows how to use GEE software to obtain fully robust inference after random effects

estimation.

If we are only interested in estimating ,, the fixed effects (FE) or within estimator is

attractive. The within transformation subtracts off group averages from the dependent variable

and explanatory variables:

y

gm

y g

z

gm

zg

, u

gm

g

, m 1, . . . , M

g

; g 1, . . . , G, (10)

and this equation is estimated by pooled OLS. (The x

g

get swept away by the within-group

demeaning.) Under a full set of fixed effects assumptions which, unlike random effects,

allows arbitrary correlation between c

g

and the z

gm

inference is straightforward using

standard software. Nevertheless, analogous to the random effects case, it is often important to

allow Varu

g

|Z

g

to have an arbitrary form, including within-group correlation and

heteroskedasticity. Arellano (1987) proposed a fully robust variance matrix estimator for the

fixed effects estimator, and it is consistent (as G increases with the M

g

fixed) with cluster

samples or panel data; see also Wooldridge [2002, equation (10.59)]. For panel data, the

idiosyncratic errors can always have serial correlation or heteroskedasticity, and it is easy to

11

guard against these problems in inference. Reasons for wanting a fully robust variance matrix

estimator for FE applied to cluster samples are similar to the RE case. For example, if ,

g

replaces , in (1), then z

gm

zg

,

g

, appears in u

gm

in (4). The FE estimator is still

consistent if E,

g

|z

g1

zg

, . . . , z

g,Mg

zg

E,

g

, a reasonable assumption that allows

,

g

to be correlated with zg

provide the group slopes are mean-independent of the within-group

deviations from the mean. Nevertheless, u

gm

, u

gp

will be correlated for m p. A fully robust

variance matrix estimator is

Avar,

FE

g1

G

Z

g

1

g1

G

Z

g

g1

G

Z

g

1

, (11)

where Z

g

is the matrix of within-group deviations from means and

g

is the M

g

1 vector of

fixed effects residuals. This estimator is justified with large-G asymptotics.

One benefit of a fixed effects approach, especially in the standard model with constant

slopes but c

g

in the composite error term, is that no adjustments are necessary if the c

g

are

correlated across groups. When the groups represent different geographical units, we might

expect correlation across groups close to each other. If we think such correlation is largely

captured through the unobserved effect c

g

, then its elimination via the within transformation

effectively solves the problem. If we use pooled OLS or a random effects approach, we would

have to deal with spatial correlation across g, in addition to within-group correlation, and this

is a difficult problem.

Recently, in the context of fixed effects estimation and panel data, Kzde (2001) and

Bertrand, Duflo, and Mullainathan (2002) study the finite-sample properties of robust variance

matrix estimators that are theoretically justified only as G . One common finding is that

the fully robust estimator works reasonably well even when the cross-sectional sample size (G)

12

is not especially large relative to the time-series dimension. When Varu

g

|Z

g

does not depend

on Z

g

, a variance matrix that exploits system homoskedasticity can perform better than the

fully robust variance matrix estimator. Kzde (2001) also finds that, without serial correlation,

an estimator that adjusts the variance matrix only for heteroskedasticity works well. This

finding is probably more relevant for cluster sample applications, where independence of the

idiosyncratic errors is often reasonable unless slopes vary by group.

Importantly, the encouraging simulaton findings for fixed effects with panel data are not in

conflict with findings that the robust variance matrix for the pooled OLS estimator with a

small number of clusters can behave poorly. An important difference in the two problems is

that for FE estimation using panel data, the issue is serial correlation in the idiosyncratic errors

u

gm

: m 1, . . . , M

g

, and in simulation studies to date this correlation has been assumed to

die out as the time periods get far apart. [Typically, a stable AR(1) model is assumed.] The

pooled OLS estimator that keeps c

g

in the error term suffers because of the constant correlation

across all observations within cluster. Plus, fixed effects can only estimate ,, while for pooled

OLS on cluster samples the focus is on typically on [. If we are interested in [, an interesting

open question for a small cluster sample size is whether inference with the random effects

estimator, whether carried out under the usual RE assumptions or made robust to arbitrary

within-group correlation and heteroskedasticity, can improve on robust inference for pooled

OLS.

The previous discussion extends immediately to instrumental variables versions of all

estimators. With large G, one can typically afford to make pooled two stage least squares

(2SLS), random effects 2SLS, and fixed effects 2SLS robust to arbitrary within-cluster

correlation and heteroskedasticity. Also, more efficient estimation is possible by applying

13

generalized method of moments (GMM); again, GMM is justified with large G.

Unfortunately, for any of the IV methods there appears to be little simulation evidence for how

large G standard errors work for IV methods when G is not so large.

2.3. Large Group Size Asymptotics

The previous subsection discussed the mixed performance of standard large G variance

matrix estimators when G is not so large. I now discuss estimation of the parameters in (1)

under a standard stratified sampling scheme. In particular, suppose we have defined G strata in

the population, and we obtain our data by randomly sampling from each stratum. As before,

M

g

is the sample size for stratum (group) g. Without specifying how the sample was obtained,

the resulting data set is essentially indistinguishable from that described in Section 2.2, where

we randomly draw clusters from a population of many clusters. But the difference turns out to

be important, and in this subsection I act as if the usual large G asymptotics do not apply.

We are mostly interested in estimating [ in (1), but that generally requires estimating ,, too.

Donald and Lang (2001) (hereafter, DL) study this problem, and I draw on their work while

offering a complementary approach.

DLs examples are for very small G, and one of their examples is Card and Krueger

(1994), who had G 2 states, New Jersey and Pennsylvania. DLs approach can be, and has

been, applied with larger numbers of groups. Moulton (1990), the clusters are states in the

U.S. (G 50) and, in Loeb and Bound (1996), G 36 cohort-division groups. I explicitly

consider the case where G and the M

g

are both (moderately) large in Section 2.4.

14

To illustrate the pitfalls in applying standard methods such as pooled OLS when G is

small, consider a special case of (1), where x

g

is a scalar (as in DL) and z

gm

is not in the

equation. The equation is

y

gm

o [x

g

c

g

u

gm

, m 1, . . . , M

g

; g 1, . . . , G, (12)

where c

g

and u

gm

: m 1, . . . , M

g

are independent of x

g

and u

gm

: m 1, . . . , M

g

is a

mean-zero, independent, identically distributed sequence for each g. Now, if the cluster effect

c

g

is absent from the model, then we can estimate (12) using pooled ordinary least squares, and

inference is straightforward. If we assume Varu

gm

is constant across g then, provided

N M

1

. . . M

G

is large enough (often N as small as 30 suffices) even if G is not large we

can use the usual t statistics from the pooled OLS regression as having an approximate

standard normal distribution. Making inference robust to heteroskedasticity is straightforward,

and tends to work well even for moderate N.

Donald and Lang (2001) allow for the presence of c

g

Normal0, o

c

2

, which they assume

to be independent of u

gm

: m 1, . . . , M

g

. With a common cluster effect, there is no

averaging out within cluster that allows application of the central limit theorem. One way to

see the problem is to note that the pooled OLS estimator, [

estimator obtained from the regression

y g

on 1, x

g

, g 1, . . . , G. (13)

Conditional on the x

g

, [

: g 1, . . . , G, the within-group

averages of the composite errors v

gm

c

g

u

gm

. The presence of c

g

means new observations

within group do not provide additional information for estimating [ beyond how they affect

the group average, y g

.

15

If we add some strong assumptions, there is an exact solution to the inference problem. In

particular, assume u

gm

Normal0, o

u

2

and M

g

M for all g (same sample size across strata).

Then v g

Normal0, o

c

2

o

u

2

/M for all g. Because we assume independence across g, the

equation

y g

o [x

g

v g

, g 1, . . . , G (14)

satisfies the classical linear model assumptions. Therefore, we can use inference based on the

t

G2

distribution to test hypotheses about [, provided G 2. When G is very small, the

requirements for a significant t statistic using the t

G2

distribution are much more stringent then

if we use the t

M

1

M

2

...M

G

2

distribution which is what we would be doing if we use the usual

pooled OLS statistics. When x

g

is a 1 K vector, we need G K 1 to use the t

GK1

distribution for inference. [In Moulton (1990), x

g

contains 17 elements, which can be handled

in this setup because G 50. ]

As pointed out by DL, performing the correct inference in the presence of c

g

is not just a

matter of correcting the pooled OLS standard errors for cluster correlation, or using the RE

estimator. There is only one estimator: pooled OLS, random effects, and the between

regression in (13) all lead to the same [

appropriate standard error and reports the correct (small) degrees of freedom in the t

distribution.

We can apply the DL method without normality of the u

gm

if the common group size M is

large: by the central limit theorem,

g

will be approximately normally distributed very

generally. Then, because c

g

is normally distributed, we can treat v g

as approximately normal

with constant variance. Further, even if the group sizes differ across g, for very large group

16

sizes

g

will be a negligible part of v g

: Varv g

o

c

2

o

u

2

/M

g

. Provided c

g

is normally

distributed and it dominates v g

, a classical linear model analysis on (14) should be roughly

valid.

The broadest applicability of DLs setup is when the average of the idiosyncratic errors,

g

,

can be ignored either because o

u

2

is small relative to o

c

2

, M

g

is large, or both. It is important

to see that applying DL with different group sizes or nonnormality of the u

gm

is identical to

ignoring the estimation error in the sample averages, y g

. In other words, it is as if we are

analyzing the simple regression j

g

o [x

g

c

g

using the classical linear model

assumptions (where y g

is used in place of the unknown group mean, j

g

). With small G, we

need to assume c

g

is normally distributed.

The DL solution to the inference problem is actually pretty common as a strategy to check

robustness of results obtained from cluster samples, but typically it is implemented with large

G. Often with cluster samples one estimates the parameters using the disaggregated data and

also the averaged data. With covariates that vary within cluster, using averaged data is

generally inefficient. But it does mean that standard errors need not be made robust to

within-cluster correlation. DL essentially point out that using (14) with small G ensures that

inference is conservative (at least if normality of c

g

holds and the group sizes are large).

For small G and large M

g

, inference obtained from analyzing (14) as a classical linear

model will be very conservative in the absense of a cluster effect. Perhaps this is desirable, but

it rules out some widely-used staples of policy analysis. For example, suppose we have two

populations (maybe men and women, two different cities, or a treatment and a control group)

with means j

g

, g 1, 2, and we would like to obtain a confidence interval for their difference.

Under random sampling from each population, and assuming normality and equal population

17

variances, the usual comparison-of-means statistic is distributed exactly as t

M

1

M

2

2

under the

null hypothesis of equal population means. (Or, we can construct an exact 95% confidence

interval of the difference in population means.) With even moderate sizes for M

1

and M

2

, the

t

M

1

M

2

2

distribution is close to the standard normal distribution. Plus, we can relax normality

to obtain approximately valid inference, and it is easy to adjust the t statistic to allow for

different population variances. The comparison-of-means setup is most students first

exposure to policy analysis. But we cannot even study the difference-in-means estimator in

the DL setup because G 2.

DL criticize Card and Krueger (1994) for comparing mean wage changes of fast-food

workers across two states because Card and Krueger fail to account for the state effect (New

Jersery or Pennsylvania), c

g

, in the composite error, v

gm

. But the DL criticism in the G 2

case is indistinguishable from a common question raised for any difference-in-differences

analyses: How can we be sure that any observed difference in means is due entirely to the

policy change? To characterize the problem as failing to account for an unobserved group

effect is not necessarily helpful. If the experiment is appropriately randomized, then c

g

should

be part of the effect to be estimated.

Consider a related example. Suppose that, over the summer, a school district with two high

schools, say A and B, decides to provide personal computers for students at high school B who

just finished their first year. The announcement is made just before the school year starts, so

that students cannot switch high schools. The response variable is the change in a standardized

test score given in the district from the first to the second year. If we sample students from

each high school, a comparsion of means should be appropriate if the experiment is effectively

randomized. Of course, if there are other confounding factors say, the average increase in

18

test scores is already higher at school B then the difference-in-differences estimator will not

be appropriate.

If z

gm

appears in (1), we can modify (14) by adding the group averages, zg

, as explanatory

variables, but only if G K L 1. Under the normality and homoskedasticity assumptions

with common group sizes, inference can be carried out using the t

GKL1

distribution. DL

propose an estimation method that has an exact t

GK1

distribution, regardless of the size of L,

but they assume that zg

is constant across g, zg

z, in which case z, gets absorbed into the

intercept in (14) and we are back to the case without z

gm

.

With large group sizes M

g

, a different perspective on estimating [ sheds additional insight

on problems of inference, and allows more flexibility on the kinds of questions that can be

answered. Write for each goup g

y

gm

o

g

z

gm

,

g

u

gm

, m 1, . . . , M

g

, (15)

where we assume random sampling within group and independent sampling across groups.

We make the standard assumptions for OLS to be consistent (as M

g

) and

M

g

-asymptotically normal; see, for example, Wooldridge (2002, Chapter 4). The presence

of group-level variables x

g

in a structural model can be viewed as putting restrictions on the

intercepts, o

g

, in the separate group models in (15). In particular,

o

g

o x

g

[, g 1, . . . , G, (16)

where we now think of x

g

as fixed, observed attributes of heterogeneous groups. With K

attributes we must have G K 1. If M

g

is large enough to estimate the o

g

precisely, a

simple two-step estimation strategy suggests itself. First, obtain the o

g

, along with ,

g

, from an

OLS regression within each group. If G K 1 then, typically, we can solve for 0

o , [

19

uniquely in terms of the G 1 vector o

:. 0

X

1

o

with g

th

row 1, x

g

. If G K 1 then, in a second step, we can use a minimum distance

approach, as described in Wooldridge (2002, Section 14.6). If we use as the weighting matrix

I

G

, the G G identity matrix, then the minimum distance estimator can be computed from the

OLS regression

o

g

on 1, x

g

, g 1, . . . , G. (17)

Under asymptotics such that M

g

g

M where 0

g

1 and M , the minimum distance

estimator 0

distance estimator is asymptotically inefficient except under strong assumptions. Because the

samples are assumed to be independent, it is not difficult to obtain the efficient minimum

distance (MD) estimator also called the minimum chi-square estimator.

First consider the case where z

gm

does not appear in the first stage estimation, so that the o

g

is just y g

, the sample mean for group g. Let o

g

2

denote the usual sample variance for group g.

Because the y g

are independent across g, the efficient MD estimator uses a diagonal weighting

matrix. As a computational device, the minimum chi-square estimator can be computed by

using the weighted least squares (WLS) version of (17), where group g is weighted by M

g

/o

g

2

(groups that have more data and smaller variance receive greater weight). Conveniently, the

reported t statistics from the WLS regression are asymptotically standard normal as the group

sizes M

g

get large. (With fixed G, the WLS nature of the estimation is just a computational

device; the standard asymptotic analysis of the WLS estimator has G .). The minimum

distance approach works with small G provided G K 1 and each M

g

is large enough so that

normality is a good approximation to the distribution of the (properly scaled) sample average

20

within each group.

If z

gm

is present in the first-stage estimation, we use as the minimum chi-square weights the

inverses of the asymptotic variances for the g intercepts in the separate G regressions. With

large M

g

, we might make these fully robust to heteroskedasticity in Eu

gm

2

|z

gm

using the White

(1980) sandwich variance estimator, or at least allow for different o

g

2

. In any case, once we

have the Avaro

g

which are just the squared reported standard errors for the o

g

we use as

weights 1/Avaro

g

in the computationally simple WLS procedure. We are still using

independence across g in obtaining a diagonal weighting matrix in the MD estimation.

An important by-product of the WLS regression is a minimum chi-square statistic that can

be used to test the G K 1 overidentifying restrictions. The statistic is easily obtained as the

weighted sum of squared residuals, say SSR

w

. Under the null hypothesis in (16),

SSR

w

a

_

GK1

2

as the group sizes, M

g

, get large. If we reject H

0

at a reasonably small

significance level, the x

g

are not sufficient for characterizing the changing intercepts across

groups. If we fail to reject H

0

, we can have some confidence in our specification, and perform

inference using the standard normal distribution for t statistics for testing linear combinations

of the population averages.

We might also be interested in how one of the slopes in ,

g

depends on the group features,

x

g

. Then, we simple replace o

g

with, say ,

g1

, the slope on the first element of z

gm

. Naturally,

we would use 1/Avar,

g1

as the weights in the MD estimation.

The minimum distance approach can also be applied if we impose ,

g

, for all g, as in

the original model (1). Obtaining the o

g

themselves is easy: run the pooled regression

y

gm

on d1

g

, d2

g

, . . . , dG

g

, z

gm

, m 1, . . . , M

g

; g 1, . . . , G (18)

21

where d1

g

, d2

g

, . . . , dG

g

are group dummy variables. Using the o

g

from the pooled regression

(18) in MD estimation is complicated by the fact that the o

g

are no longer asymptotically

independent; in fact, o

g

y g

zg

,, where , is the vector of common slopes, and the presence

of , induces correlation among the intercept estimators. Let V

be the G G estimated

(asymptotic) variance matrix of the G 1 vector o

0

V

1

X

1

X

V

1

o

V

1

X

1

. If the OLS

regression (17) is used, or the WLS version, the resulting standard errors will be incorrect

because they ignore the across group correlation in the estimators. (With large group sizes the

errors might be small; see the next subsection.)

Intermediate approaches are available, too. Loeb and Bound (1996) (LB for short) allow

different group intercepts and group-specific slopes on education, but impose common slopes

on demographic and family background variable. The main group-level covariate is the

student-teacher ratio. Thus, LB are interested in seeing how the student-teach ratio affects the

relationship between test scores and education levels. LB use both the unweighted estimator

and the weighted estimator and find that the results differ in unimportant ways. Because they

impose common slopes on a set of regressors, the estimated slopes on education (say ,

g1

) are

not asymptotically independent, and perhaps using a nondiagonal estimated variance matrix V

(which would be 36 36 in this case) is more appropriate; but see Section 2.4.

As another example, suppose x

g

is a binary treatment indicator, equal to unity if group g is

in the treatment group and zero otherwise. Then [

effect (ATE). If G 2, there are no restrictions to test, as in the simple comparison-of-means

setup. If G 2 we can test the overidentifying restrictions. If we reject them, then there are

unaccounted for group-level differences, and we might want to re-specify our model by adding

22

more elements to x

g

, even if we think the new elements of x

g

are not systematically related to

the original elements. Of course, if we add elements until G K 1, there are no restrictions

to test.

If we reject the overidentifying restrictions, we are essentially concluding that

o

g

o x

g

[ c

g

, where c

g

is the error made in imposing the restrictions (16) on the o

g

. One

possibility is to apply the Donald and Lang approach, which is to analyze the OLS regression

in (17) in the context of the classical linear model (CLM), where inference is based on the

t

GK1

distribution. Why is a CLM analysis justified? Since o

g

o

g

O

p

Mg

1/2

, we can

ingore the estimation error in o

g

for large M

g

(Recall that the same large M

g

assumption

underlies the minimum distance approach.) Then, it is as if we are estimating the equation

o

g

o x

g

[ c

g

, g 1, . . . , G by OLS. If the c

g

are drawn from a normal distribution,

classical analysis is applicable because c

g

is assumed to be independent of x

g

. This approach

is desirable when one cannot, or does not want to, find group-level observables that completely

determine the o

g

. It is predicated on the assumption that the other factors in c

g

are not

systematically related to x

g

, a reasonable assumption if, say, x

g

is a randomly assigned

treatment at the group level, a case considered by Angrist and Lavy (2002).

Before settling for conservative inference, which is justified only under normality of c

g

when we can ignore the estimation error in o

g

, it makes sense to choose x

g

carefully and test

Varc

g

0. This is precisely the purpose of the minimum chi-square overidentification

statistic. The minimum chi-square approach is justified very generally when the group sizes,

M

g

, are moderately large; normality is not needed anywhere, nor is homoskedasticity within or

across groups. If the M

g

are not very large but G K 1 is large, the _

GK1

2

distribution may

not be a good approximation to the distribution of the overidentification statistic.

23

Even if the overidentification statistic rejects (16), it may be too rash to adopt the DL

approach with small degrees of freedom. To see why, consider a treatment effect setup.

Suppose there are three large groups (G 3), and groups one and two are the control groups

while group three is the treatment group. So x

g

is a binary variable with x

1

x

2

0 and

x

3

1. For simplicity, there are no control variables. Then the estimated average treatment

effect using the DL regression (13) or the (unweighted) minimum distance regression (18) are

identical, and simply equal to the difference in means for the treatment and control group. It is

conveniently written as

[

y 3

p

1

y 1

p

2

y 2

, (19)

where p

1

M

1

/M

1

M

2

and p

2

M

2

/M

1

M

2

are the proportions of observations for

groups one and two, respectively, out of the total number of control units. In other words, the

mean outcome for the control group is just a weighted average. The efficient minimum

distance estimator would use a different weighting scheme, but that is not important for the

main point here.

Because G 3, we can test the single overidentifying restriction. This is tantamount to

testing whether j

1

j

2

, that is, the population means of the two control groups are the same.

Here is the important point: should we really reject the standard analysis just because there is a

heterogeneous response in the two control groups? I think not. If we do and apply the DL

approach, we wind up using a t distribution with a single degree of freedom to conduct

inference, even though we may have hundreds, or even thousands, of observations from each

group or stratum.

Instead, we can be satisfied with (19) as the estimate of the treatment effect without

worrying whether j

1

j

2

. In fact, if the sample proportions p

1

and p

2

converge to the

24

relative population frequencies, say

1

and

2

, then the probability limit of (19) is simply

j

3

1

j

1

2

j

2

, where the term in parentheses is the average across the two control group

populations. If p

1

and p

2

are not consistent estimates of the population frequencies, then the

control group mean estimates a weighted average of j

1

and j

2

. We may or may not be

interested in the particular weighted average implied by (19). Nevertheless, the DL method

produces exactly the same estimate of the ATE, but DL would insist that the estimated effect

be deemed statistically significant only if the t statistic, from the regression of y g

on 1, x

g

,

g 1, 2, 3, is on the order of 12.8 (for a two-sided, 5% level test).

Rather than either rejecting the minimum distance approach if j

1

j

2

, or simply

accepting the estimated treatment effect as possibly estimating an unteresting weighted

combination of the mean effects for different groups, we can specify ahead of time a particular

linear combination of the means that is of interest. So, for example, suppose that the first G

1

groups are the controls and the remaining G

2

groups are the treated groups. Then, we might

define the treatment effect in terms of two simple averages, say 0

C

j

1

. . . j

G

1

/G

1

and

0

T

j

G

1

1

. . . j

G

/G

2

; the treatment effect would be defined as [ 0

T

0

C

. Even with

only a moderate sample size for each g, we can typically get fairly precise estimates of the j

g

using the sample average. Estimation of [ is then straightforward, and inference is standard

because asymptotic standard errors for 0

C

and 0

T

are easily obtained.

Of course, one might use weighted averages in each case, particularly if one has knowledge

of population frequencies. The general point is that once we have specified an interesting

population parameter, we can obtain a consistent, approximately normal estimator, and

inference can be based on the standard normal distribution when the group sizes are

moderately large. There is no reason to rely on the small degrees of freedom approach of DL.

25

Beyond the treatment effect case, the issue of how to define parameters of interest appears

complicated, and deserves further research.

2.4. What if G and the M

g

are Both Large?

In Section 2.2, I reviewed methods appropriate for a large number of groups and relatively

small group sizes. In Section 2.3, I considered two approaches appropriate for large group

sizes and a small number of groups. The DL and minimum distance approaches use the large

group sizes assumption differently: in its most applicable setting, DL use the large M

g

assumption to ignore the first-stage estimation error entirely, while the asymptotics underlying

the MD approach is based on applying the central limit theorem within each group. Not

surprisingly, more flexibility is afforded if G and M

g

are both large.

For example, suppose we adope the DL specification (with an unobserved cluster effect)

and G 50 (say, states in the U.S.). Further, assume first that the group sizes are large enough

(or the cluster effects are so strong) that the first-stage estimation error can be ignored. Then,

it matters not whether we impose some common slopes or run separate regressions for each

group (state) in the first stage: we can treat the o

g

, g 1, . . . , G, as independent observations

to be used in the second stage. This means we apply regression (17) and apply the usual

inference procedures. The difference now is that with G 50, the usual t statistics have some

robustness to nonnormality of the c

g

, assuming the CLT approximation works well With small

G, the exact inference was based on normality of the c

g

.

Loeb and Bound (1996) essentially use regression (17), but with estimated slopes as the

26

dependent variable in place of estimated intercepts. As mentioned in Section 2.3, LB impose

some common slopes across groups, which means all estimated parameters are dependent

across group. The minimum distance approach without cluster effects is one way to account

for the dependence. Alternatively, one can simply adopt the DL perspective and just assume

the estimation error is swamped by c

g

; then standard OLS analysis is approximately justfied.

LB also apply a weighted least squares procedure that is intended to account for the different

sampling variations in their slope estimates (largely due to different group sample sizes). With

a cluster effect, the LB weights are not the correct ones for eliminating heteroskedasticity.

[Plus, we do not have the actual variances of the o

g

of ,

g1

; we have only estimates of their

asymptotic variances. This is fine from the MD perspective, but not really correct from a WLS

perspective.] There is nothing really wrong with using the incorrect weights provided the

standard errors are appropriately adjusted, but with G 36 one might be concerned about the

finite-sample properties of robust standard errors. With a cluster effect, the same problem

arises with OLS, as it ignores the heteroskedasticity. From the MD perspective, the LB WLS

approach would be correct if each first-stage regression were estimated separately; but they

were not.

In summary, with G somewhat large, the DL approach gains robustness to nonnormality of

the c

g

if the first-stage estimation error can be safely ignored. A WLS approach is justified in

the MD approach when separate regressions can be run for each group. But that weighting is

not the correct one when either a cluster effect is present or one must account for dependence

in the first-stage estimators across groups.

27

3. NONLINEAR MODELS

Many of the issues for nonlinear models are the same as for linear models. The biggest

difference is that, in many cases, standard approaches require distributional assumptions about

the unobserved group effects. In addition, it is more difficult in nonlinear models to allow for

group effects correlated with covariates, especially when group sizes differ. For the small G

case, we offer extensions of the Donald and Lang (2001) approach (with large group sizes) and

the minimum distance approach.

Rather than using a general, abstract setting, the issues for nonlinear models are easily

illustrated with a few widely-used examples. Here I discuss binary response, count data, tobit

models, and fractional response models.

3.1. Large Groups Asymptotics

As in Section 2, we begin with the case where the asymptotic analysis is with fixed group

sizes M

g

with the number of groups G getting large.

3.1.1. Binary Response Models

We can illustrate many issues using an unobserved effects probit model. Let y

gm

be a

binary response, with x

g

and z

gm

, m 1, . . . , M

g

, g 1, . . . , G defined as in Section 2. Assume

28

that

y

gm

1o x

g

[ z

gm

, c

g

u

gm

0

u

gm

|x

g

, Z

g

, c

g

~Normal0, 1

(20)

(21)

(where 1 is the indicator function). Equations (20) and (21) imply

Py

gm

1|x

g

, z

gm

, c

g

Py

gm

1|x

g

, Z

g

, c

g

o x

g

[ z

gm

, c

g

, (22)

where is the standard normal cumulative distribution function (cdf). We assume

throughout that only z

gm

affects the response probability of y

gm

conditional on x

g

and c

g

; the

outcomes of z

gp

for p m are assumed not to matter. This is captured in (22). For pooled

methods we could relax this restriction (as in the linear case), but, with the presence of c

g

, this

affords little generality in practice.

The presence of c

g

in (22) raises several important issues. Before even considering

estimation, we should know how to define the partial effects of the explanatory variables,

x

g

, z

gm

. The elements of [ and , provide the directions of effects and, for continuous

elements, the relative sizes. For example, if the first element of x

g

is continuous,

Py

gm

1|x

g

, z

gm

, c

g

x

g1

[

1

o x

g

[ z

gm

, c

g

, (23)

where is the standard normal density function. From (23), the sign of [

1

tells us whether

x

g1

has a positive or negative partial effect on the response probability. Further, if x

g2

is

another continuous variable, the ratio of partial effects is simply [

1

/[

2

. For discrete

covariates, the signs of the parameters give the direction of effects, although the relative

magnitudes are not free of the unobservable, c

g

. In any case, the magnitude of the partial

effects depends directly on c

g

, as (23) shows in the case of a continuous covariate. Often, we

need to a measure of the change in the response probability given a change in a regressor. To

29

address the dependence of (23) [and its discrete analogs] on c

g

, we can average (23) across the

population distribution of c

g

, for fixed values of the covariates. This results in what I have

called the average partial effect (APE); see, for example, Wooldridge (2002, 2005). The

simplest way to obtain APEs requires is to specify a distribution (or a conditional distribution)

for c

g

.

We can easily derive the APEs, and then consistently estimate them, if we assume c

g

is

independent of x

g

, Z

g

and is normally distributed:

c

g

|x

g

, Z

g

~Normal0, o

c

2

, (24)

where the zero mean is without loss of generality because (20) contains an intercept, o. The

unconditional normality assumption for c

g

implies that the APEs are obtained from the

function

Py

gm

1|x

g

, Z

g

o x

g

[ z

gm

,/1 o

c

2

1/2

o

c

x

g

[

c

z

gm

,

c

, (25)

where o

c

o/1 o

c

2

1/2

, and so on; see, for example, Wooldridge (2002, Chapter 15).

Conveniently, the scaled coefficients are exactly the coefficients estimated by using a simple

pooled probit procedure. So, for estimating the average partial effects, pooled probit is

perfectly acceptable. With large G and small group sizes, we can easily make the standard

errors and test statistics robust to arbitarary within group correlation. This is especially

convenient for panel data applications, where serial correlation in the idiosyncratic errors, u

gm

(where m indexes time) causes extra dependence in the responses y

gm

that is not accounted for

by c

g

. In other words, the cluster-robust variance matrix estimator allows for any kind of

within-group correlation, just as in the linear case.

If we supplement (20),.(21), and (24) with

30

u

g1

, . . . , u

g,Mg

are independent conditional on x

g

, Z

g

, c

g

(26)

then we have the so-called random effects probit model. Under the RE probit assumptions,

o, [, , and o

c

2

are all identified, which means we can estimate the APEs as well as the partial

effects evaluated at the everage value of c

g

, which is zero. We can also compute partial effects

at other values of c

g

that we might select from the normal distribution with estimated standard

deviation o

c

. The details for random effects probit in the balanced panel data case are given in

Wooldridge (2002, Chapter 15). The unbalanced case is similar.

The random effects probit estimator has no known robustness properties that is, if any of

the assumptions (20),.(21), (24), and (26) fail, the parameter estimators are believed to be

inconsistent. Plus, the APEs would not be consistently estimated in general. Therefore, if we

insist that the parameter estimators and APEs are consistently estimated, there is no logical

reason to use robust inference; the usual MLE inference is appropriate. Still, one might view

the RE probit model as an approximation to the true model, and use a sandwich estimator for

robust inference; see White (1982) and Wooldridge (2002, Chapter 12).

Random effects probit relies on a full set of distributional assumptions, while pooled probit

relaxes assumptions on the joint distribution at the cost of inefficient estimators (of the scaled

parameters indexing the APEs). If we maintain the strict exogeneity assumption of the

explanatory variables, there are estimators that are generally more efficient than pooled probit

while remaining consistent under (20), (21), and (24). The generalized estimating equation

(GEE) literature applies to grouped data (panel data as a special case); see Liang and Zeger

(1986). Perhaps the best way to think about GEE is as a multivariate weighted nonlinear least

squares (WNLS) estimator, with explicit recognition that the variance-covariance matrix

underlying the WNLS estimation is likely misspecified. The GEE approach begins by

31

specifying a conditional mean for the M

g

1 vector y

g

. In the binary response case,

Ey

g

|x

g

, Z

g

is simply the vector of response probabilities in (25). [In the context of GEE, the

expression in (25) would be called the population averaged model because the unobserved

effect has been averaged out.] Because y

gm

is binary,

Vary

gm

|x

g

, Z

g

o

c

x

g

[

c

z

gm

,

c

1 o

c

x

g

[

c

z

gm

,

c

. (27)

The pooled weighted nonlinear least squares estimator, using as weights the inverse of the

estimated variance function, produces an an estimator asymptotically equivalent to the pooled

probit estimator. (Remember, this is all as G with the M

g

fixed.) What GEE does is

exploit the within-group correlation to obtain a more efficient estimator. It does this by

specifying a working correlation matrix (WCM) which is, for each g, an M

g

M

g

matrix of

correlations. This matrix is supposed to approximate Corry

g

|x

g

, Z

g

. In most

implementations of GEE, the WCM does not depend on the condititioning variables. For

cluster samples and panel data, a useful WCM is the exchangeable matrix that assumes a

constant correlation, , across any two pairs m and p. Combined with (27), the WCM gives an

approximation to Vary

g

|x

g

, Z

g

. The resulting (simple) structure is known to be wrong if the

random effects probit model, or any extensions (such as modeling the within-group correlation

in u

g1

, . . . , u

g,Mg

). The idea is that accounting for the within-cluster correlation in even a

crude way generally leads to more efficient estimation than ignoring the correlation in

estimation, as with pooled probit. Because the M

g

M

g

variance matrices are assumed to be

misspecified, a sandwich estimator should be used for inference, and this option is standard

(and should always be used) for GEE applications. The GEE estimates are directly comparable

to the pooled probit estimates the parameters index the APEs in both cases and can be

32

compared to random effects estimates after scaling the latter by 1/1 o

c

2

1/2

.

In many large G applications, interest centers on the causal effect of the unit specific

covariates, z

gm

, controlling for differences in unobserved heterogeneity, c

g

. Then, x

g

(and an

intercept) is absorbed into c

g

and correlation between c

g

and z

g1

, z

g2

, . . . , z

g,Mg

is allowed.

For linear models, we saw that the within or fixed effects estimator allows arbitrary

correlation, and does not restrict the within-cluster dependence of u

g1

, . . . , u

g,Mg

.

Unfortunately, allowing general correlation is much more difficult in nonlinear models. In the

balanced case, where the M

g

are the same, a device due to Chamberlain (1980) can be used.

Actually, in practice a more restrictive version is often used: the distribution of c

g

given

z

g1

, z

g2

, . . . , z

gM

is assumed to be homoskedasticity normal with a mean linear in the group

average:

c

g

|Z

g

~Normalp zg

, o

a

2

, (28)

where o

a

2

is the conditional variance Varc

g

|Z

g

. If we use all random effects probit

assumptions but with (28) in place of (24), then we obtain a simple extension of the RE probit

model: simply add the group averages, zg

, as a set of additional explanatory variables. The

marginal distributions are

Py

gm

1|Z

g

p z

gm

, zg

/1 o

a

2

1/2

p

a

z

gm

,

a

zg

a

(29)

where now the coefficients are scaled by a function of the conditional variance. As shown in

Wooldridge (2002, Chapter 15), the average partial effects are consistently estimated by taking

derivatives and changes of

G

1

g1

G

p

a

z,

a

zg

a

(30)

33

with respect to elements of z. In other words, we average out the effects of the group averages,

zg

. The RE probit assumptions identify the original coefficients and o

a

2

. We can also apply

pooled probit or GEE without (26) and directly estimate the scaled coefficients (and only the

scaled coefficients are identified). Also, one is free to add x

g

to the set of covariates, but,

typically, one would not interpret the partial effects in a causal way (because c

g

and x

g

might

be (partially) correlated.)

Unfortunately, while the Chamberlain-Mundlak device is sensible for balanced cluster

samples (including balanced panels), it needs to be modified for the unbalanced case. [Of

course, one can always balance a cluster sample under the assumption that the cluster sizes are

exogenous, and that might be desirable if there is not much variation in the cluster sizes. But

sometimes it entails throwing away a lot of data.] Here I will simply mention what might be

done. At a minimum, (28) should be modified to allow the variances to depend on the cluster

size, M

g

. Under restrictive assumptions, such as joint normality of c

g

, z

g1

, . . . , z

g,Mg

, with the

z

gm

independent and identically distributed within a cluster, one can derive Varc

g

|Z

g

. But

these are strong assumptions. With G large and little variation in the group sizes, M

g

, one

might just allow a different variance for each M

g

. Then the marginal distributions are

Py

gm

1|Z

g

p z

gm

, zg

/1 o

a,Mg

2

1/2

. (31)

A simple approach, and one that allows p, to be p

Mg

,

Mg

, is to estimate a different

equation by pooled probit for each group size [with covariates 1, z

gm

, zg

]. Then, we can

estimate APEs for each M

g

, and then average those. Obtaining standard errors via the delta

method would be tricky, and this approach ignores the constancy of , across groups. [In

effect, we would just be obtaining p

Mg

, ,

Mg

,

Mg

for each group size.] To impose a common

34

,, a minimum distance approach can be used.

Alternatively, for each different value of M

g

we can estimate a random effects probit model

with the zg

included as covariates. In fact, we could be more general and allow p, to be

p

Mg

,

Mg

. Then, after estimating the slope and variance parameters for a set of extended RE

probits, we can then use a minimum distance approach to estimate the common ,. This might

be a useful topic for future research.

In estimating APEs, one can dispense with parametric assumptions entirely, provided

nonparametric estimation is feasible. In particular, consider the two assumptions

Py

gm

1|Z

g

, c

g

Py

gm

1|z

gm

, c

g

Fz

gm

, c

g

(32)

and

Dc

g

|z

g1

, z

g2

, . . . , z

g,Mg

Dc

g

|zg

. (33)

Assumption (32) implies strict exogeneity conditional on c

g

while (33) means that the

distribution of the group effect, c

g

, given all within-group covariates, depends only on the

group average; a specific distributional assumption, such as homoskedastic normality with a

mean linear in zg

, is clearly a special case.

Define H

g

z

gm

, zg

Py

gm

1|z

gm

, zg

. Under (32) and (33), it can be show that the

APEs are obtained from

E

zg

H

g

z, zg

; (34)

see, for example, Wooldridge (2005). If the group sizes differ, H

g

, generally does depend

on g. If there are relatively few group sizes, it makes sense to estimate the H

g

, separately

for each group size M

g

. Then, the APEs can be estimated from

35

G

1

g1

G

g

z, zg

. (35)

As a practical matter, we might just use flexible parametric models, such as pooled probit or

random effects probit. This approach does not impose the constant function F, as specified

in (32), and so more efficient methods should be available, but I will not explore that here.

Other strategies are available for estimating APEs. A procedure commonly known as

fixed effects probit treats the c

g

as parameters to estimate in

Py

gm

1|Z

g

, c

g

Py

gm

1|z

gm

, c

g

z

gm

, c

g

. (36)

With small group sizes M

g

(say, siblings or short panel data sets), treating the c

g

as parameters

to estimate creates an incidental parameters problem. In particular, the estimator of , can be

badly biased. Interestingly, as shown in recent work by Fernndez-Val (2005) for the balanced

panel case and independence across m conditional on Z

g

, c

g

, the average partial effects,

obtained from

G

1

g1

G

z,

g

, (37)

are unbiased for the population APEs up to order M

2

when the group effects are not present.

When group effects are present, Fernndez-Val (2005) proposes a bias correction for the APEs.

See Fernndez-Val (2005) for details.

Some prefer to use a logit response function rather than probit. The use of logit for any of

the methods discussed previously makes estimation of parameters and average partial effects

more difficult, so logit has little to offer as an alternative. Nevertheless, for allowing

correlation between c

g

and z

gm

, logit has an advantage over probit: under the conditional

36

independence assumption

y

g1

, . . . , y

g,Mg

are independent conditional on Z

g

, c

g

, (38)

the conditional maximum likelihood estimator eliminates c

g

and leads to consistent estimation

of ,, without having to specify Dc

g

|Z

g

. While this is useful for obtaining directions and

relative magnitudes, it does not allow us to estimate the APEs. Plus, for panel data

applications it is limited by the conditional independence assumption.

3.1.2. Count Data

Almost all of the discussion for binary response models extends to other nonlinear models

for cluster samples. An important application is to count data (or, in fact, any nonnegative

response variable). A popular model for the conditional mean is

Ey

gm

|x

g

, Z

g

, c

g

Ey

gm

|x

g

, z

gm

, c

g

expo x

g

[ z

gm

, c

g

. (39)

As with the probit model we assume strict exogeneity of the unit-specific covariates, z

gm

,

conditional on x

g

, c

g

. As usual, strict exogeneity can be relaxed for pooled methods but not

for random effects, fixed effects, and GLS-type methods. A convenient normalization in (39)

is Eexpc

g

1.

Under (39), a simple way to estimate o, [, and , is the pooled Poisson quasi-maximum

likelihood estimator (QMLE). In other words, act as if y

gm

has a Poisson distribution given

x

g

, z

gm

and ignore the within-group correlation in estimation. In many statistical packages, it

is simple to make inference robust to arbitrary within-group correlation (including serial

correlation in a panel data setting). The Poisson QMLE is known to be fully robust to

37

distributional misspecification, as well as within-group correlation. All that is required is

Ey

gm

|x

g

, Z

m

expo x

g

[ z

gm

, (40)

(and we can replace Z

m

with z

gm

) along with regularity conditions. See Wooldridge (2002,

Chapter 19) for references and a summary.

As with the linear case, any pooled method is likely to be inefficient. A typical random

effects Poisson approach adds the assumptions

y

gm

|x

g

, Z

g

, c

g

~Poissonexpo x

g

[ z

gm

, c

g

, (41)

y

g1

, . . . , y

g,Mg

are independent conditional on x

g

, Z

g

, c

g

, (42)

and

expc

g

|x

g

, Z

g

~Gammat, t, (43)

where the gamma distribution is parameterized to have mean equal to one. [Of course, other

distributions are possible, but the gamma is very common.] This estimator is programmed in

many packages, and is, of course, efficient under the maintained assumptions. Unfortunately,

the estimator has no known robustness properties to violations of (41), (42), or (43).

As in the binary response case, a middle ground that maintains robustness while likely

being more efficient than the pooled estimator is available. A GEE approach assumes (40),

uses the nominal Poisson variance assumption

Vary

gm

|x

g

, Z

g

Ey

gm

|x

g

, Z

g

expo x

g

[ z

gm

, (44)

and specifies a working correlation matrix. Typically, with a cluster sample the

exchangeable option would be used. As described in Section 3.1.1, the GEE estimator is a

multivariate weighted nonlinear least squares estimator with a (probably) misspecified

38

variance matrix. Thus, inference should be made fully robust using a sandwich estimator, and

this option is available (see appendix).

Unlike in the probit case, for count data we can derive the correct conditional variance

matrix under the nominal random effects Poisson assumption, and the resulting correlation

matrix does not have the exchangeable form. As shown in Wooldridge (2002, Chapter 19),

Vary

gm

|x

g

, Z

g

expo x

g

[ z

gm

, p

2

expo x

g

[ z

gm

,

2

Covy

gm

, y

mp

|x

g

, Z

g

p

2

expo x

g

[ z

gm

, expo x

g

[ z

gp

,

(45)

(46)

where p

2

1/t Varexpc

g

. The conditional correlation clearly depends on the

covariates. Therefore, rather than using a constant working correlation, one uses the variance

matrix implied by (45) and (46). Such an estimator is still fully robust but would be more

efficient than the exchangeable GEE estimator under the RE Poisson assumptions.

Wooldridge (2002, Chapter 19) provides details on estimating p

2

as well as fully robust

inference [to guard against (45) or (46) being incorrect]. As far as I know, this estimator is not

programmed in standard econometrics software, although it would not be difficult to do so.

It is straightforward to extend the previous models to allow correlation between c

g

and z

gm

.

Is in the binary response case, we initially drop x

g

. Then, we might be willing to assume

Eexpc

g

|Z

g

Eexpc

g

|zg

exp0

g

zg

g

(47)

where 0

g

,

g

depends on g through the group size, M

g

. Then

Ey

gm

|Z

m

exp0

g

z

gm

, zg

g

(48)

and now we can apply the pooled Poisson estimator, the GEE estimator, or the WNLS

estimator implied by (45) and (46), but but where we allow the slope and intercepts on zg

to

depend on the group sizes. One might want to use a different estimate of p

2

for each group

39

size, but that is not necessary for consistency. One could also extend the RE Poisson estimator

to allow expc

g

|Z

g

to have a gamma distribution with mean in (47), although estimation

would be much more complicated than extending the other methods. As in the binary response

case, one can add x

g

to the model, subject to the issue of how to interpret the parameters.

Because the elements of , are semi-elasticities (or elasticities), often estimation of , is

sufficient. Still, estimation of APEs can be based on

G

1

g1

G

exp0

g

z, zg

g

. (49)

Rather than restricting the nature of the relationship between c

g

and Z

g

, we can use the

fixed effects Poisson estimator to consistently estimate . Originally, this estimator was

derived as a conditional MLE by Hausman, Hall, and Griliches (1984). Blundell, Griffith, and

Windmeijer (2002) showed that treating the c

g

as parameters to estimate is the same as the

conditional MLE, so the name fixed effects Poisson cannot cause confusion.

As shown by Wooldridge (1999), the FE Poisson estimator has very desirable robustness

properties: it is consistent and G -asymptotically normal under the assumption

Ey

gm

|Z

g

, c

g

expz

gm

, c

g

. (50)

The actual distribution of y

gm

is practically unrestricted (except for regularity conditions), as is

the conditition dependence withing group. If interest lies in ,, the FE Poisson estimator is very

attractive. Of course, one should generally use a sandwich variance matrix estimator [see

Wooldridge (1999) or Wooldridge (2002, Section 19.6.4)]; unfortunately, this is not currently a

standard option in statistical packages.

40

3.1.2. Tobit Models for Corner Solutions

The discussion for Tobit models is very similar for probit models. Here, I assume that

Tobit is applied to a corner solution response, so that y

gm

, the nonnegative variable with

probability mass at zero, is the variable we wish to explain. In particular, I will focus on

estimating APEs on the mean response.

The standard Tobit model with a group effect is

y

gm

max0, o x

g

[ z

gm

, c

g

u

gm

(51)

u

gm

|x

g

, Z

g

, c

g

~Normal0, o

u

2

(52)

Among other things, (51) and (52) imply

Ey

gm

|x

g

, Z

g

, c

g

o x

g

[ z

gm

, c

g

/o

u

o x

g

[ z

gm

, c

g

o

u

o x

g

[ z

gm

, c

g

/o

u

mo x

g

[ z

gm

, c

g

, o

u

2

. (53)

The APEs are obtained by integrating this expression with respect to the distribution of c

g

. A

typical andom effects approach assumes

c

g

|x

g

, Z

g

~Normal0, o

c

2

, (54)

in which case the APEs are remarkably simple to obtain. They are estimated from

mo x[

z,, o

v

2

(55)

where o

v

2

o

c

2

o

u

2

. Conveniently, the pooled Tobit directly provides consistent and

G -asymptotically normal estimators of o, [, , and o

v

2

. A sandwich estimator is easily

obtained to allow arbitrary within-group correlation.

The full random effects Tobit specification adds the same conditional independence

41

assumption as random effects probit; see (26). Then o

c

2

and o

u

2

are separately identified, but

consistency requires all RE Tobit assumptions. In principle, a GEE-type approach can be used

in the Tobit case as well. This could be based on the scores from the Tobit log-likelihood,

although, as far as I know, details remain to be worked out.

As with probit, we can modify pooled Tobit and RE Tobit to allow for correlation between

c

g

and Z

g

, as in (28). But, with different group sizes, we should replace o

a

2

Varc

g

|zg

with

o

a,Mg

2

. As in the probit case, we can estimate a different variance for each group size M

g

if the

group sizes are relatively few. An even easier approach is to allow all parameters to differ by

M

g

, estimate the APEs in each case, and average them together. Either pooled Tobit or RE

Tobit can be used. The APEs would be obtained from

G

1

g1

G

mp

Mg

z,

Mg

zg

Mg

, o

v,Mg

2

(56)

where the M

g

just indicates a different set of estimates for each different group size.

With larger M

g

, one might use fixed effects Tobit. While it is known that the estimator

of , suffers from an incidental parameters problem, one can use

G

1

g1

G

mz,

g

, o

u

2

(57)

to estimate APEs. I know of no simulation evidence that studies APEs based on (57); they

might be reasonable estimates for moderate M

g

. Fernndez-Val (2005) contains some relevant

theory.

3.1.4. Fractional Responses

42

Recently, more applied econometric work is recognizing that fractional responses can

require special treatment. Naturally, one can always opt for a linear model (as in the case of

binary responses, count, and corner solutions responses). But there is more interest in using

functional forms that ensure expected values are in the unit interval. One possibility for

responses that have probability mass at zero and one is to adopt a two-limit Tobit model. The

analysis would then be very similar to Section 3.1.3. In particular, the formulas for the

expected value in the two-limit case can be easily adapted to allow the presence of unobserved

heterogeneity, either assumed independent of the covariates or with Dc

g

|zg

assumed normal.

The drawback of the two-limit Tobit model is that it makes sense only when there is pile up

a zero and one. Even in such cases, if fully specifies the distribution Dy

gm

|x

g

, Z

g

. If we are

primarily interested in effects on the mean response, more robust methods are available. These

methods produce consistent estimators with pile up at neither, both, or only one endpoint.

Papke and Wooldridge (2005) recently proposed methods for fractional responses for

balanced panel data sets. Here I draw on that work and make suggestions for the unbalanced

case. In fact, the methods are virtually identical to the results for the probit case, but with a

change in how we interpret the probit response function. With y

gm

a fractional response, we

begin with

Ey

gm

|x

g

, Z

g

, c

g

o x

g

[ z

gm

,, m 1, . . . , M

g

. (58)

Equation (58) is attractive as a mean response In other words, we simply assume the mean

response has the probit form. [The logit form is also possible but does not lead to simple

estimating equations.] If we add the independence and normality assumption

43

c

g

|x

g

, Z

g

~Normal0, o

c

2

, (59)

just as in (24), then we have an extension of (25):

Ey

gm

|x

g

, Z

g

o

c

x

g

[

c

z

gm

,

c

, (60)

where the c index denotes scaling by 1/1 o

c

2

1/2

. Because the Bernoulli quasi-likelihood

function identifies the parameters of a correctly specified conditional mean see Gourieroux,

Monfort, and Trognon (1984) or Papke and Wooldridge (1996) we can use pooled probit

to consistently estimate the scaled parameters and APEs. Naturally, we should use a fully

robust sandwich variance matrix estimator to account for within-group correlation but also for

the fact that the Bernoulli variance assumption can no longer be expected to hold. See Papke

and Wooldridge (2005) for more discussion.

A random effects probit analysis does not make sense here because we do not have a fully

specified distribution. But the GEE methods described for probit apply immediately. In fact,

using standard GEE commands, you would not even know y

gm

is fractional or binary (except

possibly for a warning that you applying probit to a nonbinary response).

The discussion about allowing correlation between c

g

and Z

g

is identical to the binary

response case. Again, it is important to allow at least Varc

g

|zg

depend on the group size, and

possibly the slopes on zg

as well. A simple, flexible, but inefficient approach is to estimate a

different model for each M

g

, provided there is not much variation in M

g

. Better would be to

apply a minimum distance approach that recognizes a constant , vector on z

gm

in (58).

3.2 Large Group Size Asymptotics

44

Unlike in the linear case covered by Donald and Lang (2001), for nonlinear models no

exact inference is available. Nevertheless, if the group sizes M

g

are reasonably large, we can

extend the approximate inference using the DL approach to nonlinear models. Plus, the

minimum distance approach carries over essentially without change.

We can apply the methods to any of the nonlinear models in Section 3.1 (as well as others).

Here, I illustrate the probit response function (which means it can apply to binary or fractional

responses).

With small G but random sampling of y

gm

, z

gm

: m 1, . . . , M

g

within each g, write

Py

gm

1|z

gm

o

g

z

gm

,

m

, m 1, . . . , M

g

(61)

o

g

o x

g

[, g 1, . . . , G. (62)

As with the linear model, we assume the intercept, o

g

in (61), is a function of the group

features x

g

. With the M

g

moderately large, we can get good estimates of the o

g

. The

o

g

, g 1, . . . , G, are easily obtained by estimating a separate probit for each group.

Under (62), we can apply the minimum distance approach just as before. Let Avaro

g

(so these shrink to zero at the rate 1/M

g

).

For binary response, these are just the usual MLE estimated variances. For fractional response,

these would be from a sandwich estimate of the asymptotic variance. Then, we obtain the

minimum distance estimates as the WLS estimates from

o

g

on 1, x

g

, g 1, . . . , G (63)

using weights 1/Avaro

g

are used as the weights. This is the efficient minimum distance

45

estimator and, conveniently, the proper asymptotic standard errors are reported from the WLS

estimation (even though we are doing large M

g

, not large G, asymptotics.) The

overidentification test is obtained exactly as in the linear case: there are G K 1

degrees-of-freedom in the test.

The same cautions about using the overidentification test to reject the minimum distance

approach apply here as well. In particular, in the treatment effect setup, where x

g

is zero or

one, one might reject a weighted comparision of means simply because the means within the

control or within the treatment group differ, or both. It might make more sense to define, using

other considerations, weighted averages of the population means, and then to estimate those

means using within-group sample averages. Similar considerations may be relevant when x

g

is

a vector with a more complicated structure.

If we impose common slopes ,

g

,, g 1, . . . , G, then in applying MD estimation we

should account for the across group correlation in the o

g

, as described in Section 2.3. This is

also true if we impose that a subset of the slopes are common across g.

If we reject the overidentification restrictions and wish to have conservative inference, then

we can adapt Donald and Lang and treat

o

g

o x

g

[ error

g

, g 1, . . . , G (64)

as approximately satisfying the classical linear model assumptions, provided G K 1, just as

before. As in the linear case, this approach is justified if o

g

o x

g

[ c

g

with c

g

independent of x

g

and c

g

drawn from a homoskedastic normal distribution. It assumes that we

can ignore the estimation error in o

g

, based on o

g

o

g

O1/ M

g

.

Because the DL approach ignores the estimation error in o

g

, it is unchanged if one imposes

46

some constant slopes across the groups, as with the linear model.

Once one estimates o and [, the estimated effect on the response probability can be

obtained by averaging the response probability for a given x:

G

1

g1

G

M

g

1

m1

Mg

o x[

z

gm

,

g

, (65)

where derivatives or differences with respect to the elements of x can be computed. Here, the

minimum distance approach has an important advantage of the DL approach: the finite sample

properties of (65) are viritually impossible to obtain, whereas the large-M

g

asymptotics

underlying minimum distance would be straightforward using the delta method. Whether

bootstrapping can be applied is an interesting question.

Particularly with binary response problems, the two-step methods described here are

problematical when the response does not vary within group. For example, suppose that x

g

is a

binary treatment equal to one for receiving a voucher to attend college and y

gm

is an

indicator of attending college. Each group is a high school class, say. If some high schools

have all students attend college, one cannot use probit (or logit) of y

gm

on z

gm

, m 1, . . . , M

g

.

A linear regression returns zero slope coefficients and intercept equal to unity. Of course, if

randomization occurs at the group level that is, x

g

is independent of group attributes then it

is not necessary to control for the z

gm

. Instead, the within-group averages can be used in a

simple minimum distance approach. In this case, as y

gm

is binary, the DL approximation will

not be valid, as the CLM assumptions will not even approximately hold in the model

y g

o x

g

[ e

g

(because y g

is always a fraction regardless of the size of M

g

).

As in the linear case, more flexibility is afforded if G is somewhat large along with large

M

g

. Then, in implementing the DL approach with or without common slopes imposed in the

47

first stage one gains robustness to nonnormality of c

g

if G is large enough so that

G

1

g1

G

c

g

and G

1

g1

G

x

g

c

g

are approximately normally distributed. If we estimate

separate models in the first stage and do not wish to ignore the estimation error in o

g

, then

error

g

in (64) is heteroskedastic, and we can, perhaps, use heteroskedasticity-robust standard

errors in (64).

4. CONCLUDING REMARKS

My purpose is writing this expanded version of Wooldridge (2003) is to flesh out some

issues only touched on in the shorter paper, while also considering more general frameworks.

This version of the paper contains what are best described as conjectures, or at least

suggestions that may not turn out to work very well. Future research could resolve the issue of

how to allow group heterogeneity to be correlated with the unit-specific covariates when the

group sizes differ. Minimum distance estimation strikes me as promising, but that remains to

be seen. Less parametric approaches should also be considered.

Much more remains to be learned about the small G case, too, particularly for nonlinear

models. A similuation study comparing Donald and Langs (2001) conservative approach to

minimum distance estimation, when neither method is entirely appropriate, would be

worthwhile. For example, in a treatment effect case with G 4, two treatment groups and two

control groups, with different means within each group, one might select different, modest

sizes of M

g

(ranging from maybe 20 to 40). The underlying population distributions can be

normal. Which method, DL or minimum distance, produces the best confidence intervals for

48

the ATE (defined appropriately for a fictitious population)?

49

REFERENCES

Angrist, J.D. and V. Lavy (2002), The Effect of High School Matriculation Awards:

Evidence from Randomized Trials, NBER Working Paper 9389.

Arellano, M. (1987), Computing Robust Standard Errors for Within-Groups Estimators,

Oxford Bulletin of Economics and Statistics 49, 431-434.

Baker, M. and N.M. Fortin (2001), Occupational Gender Composition and Wages in

Canada, 1987-1988, Canadian Journal of Economics 34, 345-376.

Bertrand, M., E. Duflo, and S. Mullainathan (2002), How Much Should We Trust

Differences-in-Differences Estimates? mimeo, MIT Department of Economics.

Blundell, R., R. Griffith, and F. Windmeijer (2002), Individual Effects and Dynamics in

Count Data Models, Journal of Econometrics 108, 113-131.

Campolieti, M. (2004), Disability Insurance Benefits and Labor Supply: Some Additional

Evidence, Journal of Labor Economics 22, 863-889.

Card, D., and A.B. Krueger (1994), Minimum Wages and Employment: A Case Study of

the Fast-Food Industry in New Jersey and Pennsylvania, American Economic Review 84,

772-793.

Donald, S.G. and K. Lang (2001), Inference with Difference in Differences and Other

Panel Data, mimeo, University of Texas Department of Economics.

Fernndez-Val, I. (2005), Estimation of Structural Parameters and Marginal Effects in

Binary Choice Panel Data Models with Fixed Effects, mimeo, Boston University Department

of Economics.

50

Hausman, J.A., B.H. Hall, and Z. Griliches (1984), Econometric Models for Count Data

with an Application to the Patents-R&D Relationship, Econometrica 52, 909-938.

Hsiao, C. (1986), Analysis of Panel Data. Cambridge: Cambridge University Press.

Kzdi, G. (2001), Robust Standard Error Estimation in Fixed-Effects Panel Models,

mimeo, University of Michigan Department of Economics.

Liang, K.-Y., and S.L. Zeger (1986), Longitudinal Data Analysis Using Generalized

Linear Models, Biometrika 73, 13-22.

Loeb, S. and J. Bound (1996), The Effect of Measured School Inputs on Academic

Achievement: Evidence form the 1920s, 1930s and 1940s Birth Cohorts, Review of

Economics and Statistics 78, 653-664

Moulton, B.R. (1990), An Illustration of a Pitfall in Estimating the Effects of Aggregate

Variables on Micro Units, Review of Economics and Statistics 72, 334-338.

Papke, L.E. and J.M. Wooldridge (1996), Econometric Methods for Fractional Response

Variables with an Application to 401(k) Plan Participation Rates, Journal of Applied

Econometrics 11, 619-632.

Papke, L.E. and J.M. Wooldridge (2005), Panel Data Methods for Fractional Response

Variables with an Application to Test Pass Rates, mimeo, Michigan State University

Department of Economics.

Pepper, J.V. (2002), Robust Inferences from Random Clustered Samples: An Application

Using Data from the Panel Study of Income Dynamics, Economics Letters 75, 341-345.

White, H. (1980), A Heterosekdasticity-Consistent Covariance Matrix Estimator and A

Direct Test for Heteroskeasticity, Econometrica 48, 817-838.

White, H. (1982), Maximum Likelihood Estimation with Misspecified Models,

51

Econometrica 50, 1-26.

White, H. (1984), Asymptotic Theory for Econometricians. Academic Press: Orlando, FL.

Wooldridge, J.M. (1999) Distribution-Free Estimation of Some Nonlinear Panel Data

Models, Journal of Econometrics 90, 77-97.

Wooldridge, J.M. (2002), Econometric Analysis of Cross Section and Panel Data. MIT

Press: Cambridge, MA.

Wooldridge, J.M. (2003), Cluster-Sample Methods in Applied Econometrics, American

Economic Review 93, 133-138.

Wooldridge, J.M. (2005), Unobserved Heterogeneity and Estimation of Average Partial

Effects, in Identification and Inference for Econometric Models: Essays in Honor of Thomas

Rothenberg. D.W.K. Andrews and J.H. Stock (eds.). Cambridge: Cambridge University Press,

27-55.

52

APPENDIX

Below are commands in version 8.0 of Stata that can be used for robust inference with

cluster or panel data sets. All commands assume that there is an identifier, id, that identifies

the cluster or group for each observation.

Section 2.2

Pooled OLS

reg y x1 x2 ... xK z1 z2 ... zL, cluster(id)

Random Effects: Usual Inference

iis id

xtreg y x1 ... xK z1 ... zL, re

Random Effects: Fully Robust Inference

iis id

xtgee y x1 x2 ... xK z1 ... zL, corr(exch) robust

Note: The xtgee and xtreg estimates will generally differ for two reasons. First, xtgee uses

estimates of o

c

2

based on an initial pooled OLS estimation, while xtreg uses a fixed-effects

based estimate. Second, xtgee iterates, while xtreg does not.

Section 2.3

A separate linear model is estimated for each group identifier to obtain the asymptotic

53

variances of the intercepts. In the second step, OLS can be used (as in DL), relying on the

reported t statistics and critical values, or minimum distance (implemented as WLS) using the

standard normal approximation.

reg y z1 ... zL if id g

Section 3.1.1

Pooled Probit

probit y x1 x2 ... xK z1 z2 ... zL, robust cluster(id)

Random Effects Probit

iis id

xtprobit y x1 x2 ... xK z1 z2 ... zK, re

GEE with an Exchangeable Working Correlation Matrix

iis id

xtgee y x1 x2 ... xK z1 z2 ... zK, family(bin) link(probit) corr(exch) robust

or

xtprobit y x1 x2 ... xK z1 z2 ... zK, pa

(The qualifier pa stands for population averaged. Stata simply uses the first command

when pa is used with xtprobit.)

Using the Chamberlain-Mundlak Device

All of the previous commands can include the within-group averages of the unit-specific

covariates. For example, Chamberlains random effects probit model can be estimated by

iis id

54

xtprobit y x1 x2 ... xK z1 z2 ... zL z1bar z2bar ... zLbar, re

But, as discussed in Section 3.1.1, at a minimum these coefficients are all scaled by a

variance that depends on the group size, M

g

. One could do this estimation for each group size

and average the APEs. Even better would be to use minimum distance to impose the

restriction of a common , in the structural model.

Section 3.1.2

Pooled Poisson

poisson y x1 ... xK z1 ... zL, robust cluster(id)

Random Effects Poisson

iis id

xtpois y x1 ... xK z1 ... zL, re

GEE with an Exchangeable WCM

xtgee y x1 ... xK z1 ... zL, family(pois) link(log) corr(exc) robust

The Fixed Effects Poisson Estimator

iis id

xtpois y x1 ... xK z1 ... zL, fe

Note: Currently, Stata does not report robust standard errors for FE Poisson

Section 3.1.3

Pooled Tobit

tobit y x1... xK z1 ... zL z1bar ... zLbar, ll(0) cluster(id)

55

Random Effects Tobit

xttobit y x1... xK z1 ... zL z1bar ... zLbar, re

Section 3.1.4

Pooled Two-Limit Tobit

tobit y x1... xK z1 ... zL z1bar ... zLbar, ll(0) ul(1) cluster(id)

Note: Unfortunately, xttobit does not currently allow specifying an upper bound, so the RE

version of the two-limit Tobit is not immediately available.

Pooled Quasi-MLE

glm y x1 ... xK z1 ... zL, family(bin) link(probit) robust cluster(id)

GEE with Exchangeable WCM

xtgee y x1 ... xK z1 ... zL, family(bin) link(logit) corr(exch) robust

Using the Chamberlain-Mundlak Device

For GEE, this would be

xtgee y x1 ... xK z1 ... zL z1bar ... zLbar, family(bin) link(probit) corr(exch) robust

Section 3.2

A separate model is estimated for each group identifier to obtain the asymptotic variances

of the intercepts. So, the commands are carried out for each group size, g. In the second step,

OLS can be used (as in DL), relying on the reported t statistics and critical values, or minimum

distance (implemented as WLS) using the standard normal approximation.

Binary Response

56

probit y z1 ... zL if id g

Count Response

glm y z1 ... zL if id g, family(poisson) link(log) robust

Fractional Response

glm y z1 ... zL if id g, family(bin) link(probit) robust

or

glm y z1 ... zL if id g, family(bin) link(logit) robust

57

- Introductory econometrics test bankUploaded byJohn Deichen
- Wooldridge Solution chapter 3Uploaded byAdami Fajri
- Wooldridge.solutions.chap.7 11Uploaded byBurner
- Wooldridge.solutions.chap.2 6Uploaded byBurner
- Econometrics Cheat Sheet Stock and WatsonUploaded bypeathepeanut
- Jeffrey M Wooldridge Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data 2003Uploaded byvanyta0201
- Solucionario Econometría Jeffrey M. WooldridgeUploaded byHéctor F Bonilla
- wooldridge(1)Uploaded byAlexis Singh
- EconometricsUploaded byRjnbk
- Introduction to Econometrics James Stock Watson 2e Part2Uploaded bygrvkpr
- Jeffrey M. Wooldridge-Econometric Analysis of Cross Section and Panel Data, 2nd Edition(2010)Uploaded byLuke Sperduto
- Introductory Eco No Metrics Wooldridge 4th Edition Solutions ManualUploaded bynayanandtina
- econometrics SolutionsUploaded byarmailgm
- Basic Econometrics Chapter 3 SolutionsUploaded bytariqmtch
- Econ5025 Practice ProblemsUploaded byTrang Nguyen
- UNSW -Econ2206 Solutions Semester 1 2011 - Introductory Eco No MetricsUploaded byleobe89
- EconometricsUploaded byGiovanniValente
- Econometric Analysis of Cross Section and Panel Data 2010.pdfUploaded bytrovas22
- 236185337-Solucionario-Econometria-Jeffrey-M-Wooldridge.pdfUploaded byarley320
- Chapter Specific AppendicesUploaded byEdu Marin
- Solution Manual for MicroeconometricsUploaded bySafis Hajjouz
- Introductory EconometricsUploaded byJohn Smythe
- Jehle and Reny SolutionsUploaded byKevin Andrew
- Cameron & Trivedi 2005 Microeconometrics Methods and Applications SolutionsUploaded byVanessa Gonçalves
- Solutions Jehle RanyUploaded bypatipet2753
- Stock Watson 3U ExerciseSolutions Chapter9 InstructorsUploaded byvivo15
- 236185337-Solucionario-Econometria-Jeffrey-M-Wooldridge.pdfUploaded byMartín Cruzat Riquelme
- Baum - An Introduction to Modern Econometrics Using StataUploaded byVelichka Dimitrova
- Foundations of Mathematical EconomicsUploaded byHisham Haider Dewan
- Econ 113 Probset2 SolUploaded bynataliasbueno

- Botakoz Kasumbekova Understanding Stalinism 2017Uploaded byamudaryo
- Botakoz Kasumbekova European Resttlement in TJK 2011Uploaded byamudaryo
- Call for Papers Nordic Conf March 182 2018pdfUploaded byamudaryo
- John Keats PoemsUploaded byamudaryo
- Outlook Tutorial - Outlook BasicsUploaded byc_clipper
- Deloitte Global Value Chain 2015Uploaded byamudaryo
- Brexit and UK 2017Uploaded byamudaryo
- Suyarkulova Uzbek an Elementary Textbook 2012Uploaded byamudaryo
- PDF PowerPoint Tutorial - PowerPoint BasicsUploaded byKhadija Rv
- Tajikistan SnapshotUploaded byamudaryo
- Ban 2015 GovernanceUploaded byamudaryo
- Verhoeven Multiyear Budget and Fiscal Performance 2014Uploaded byamudaryo
- Scholarforum Migrationcentral Asia and Mongolia 20160603Uploaded byamudaryo
- Haizhen Mou Oct 7Uploaded byamudaryo
- Taxotere NewUploaded byamudaryo
- Fritz Strengthening PFM 2014Uploaded byamudaryo
- WS Minto PosterUploaded byamudaryo
- Baltagi Financial Development and Openness 2009Uploaded byamudaryo
- Kiingdom of CaubulUploaded byamudaryo
- Broome 2015 GovernanceUploaded byamudaryo
- Japan-IMF ScholarshipUploaded byamudaryo
- R-Cheat SheetUploaded byPrasad Marathe
- ODI Budget Credibility 2014Uploaded byamudaryo
- Chapter1 EnUploaded byamudaryo
- Scopes CountrylistUploaded byELgün F. Qurbanov
- Tax Burden in ChinaUploaded byamudaryo
- infedilidadeUploaded byamudaryo

- ideapad_10015ibdUploaded byrubenmena1
- 2005 Mathematical Methods (CAS) Exam Assessment Report Exam 1.pdfUploaded bypinkangel2868_142411
- LC-60LE630EUploaded byGustavoLopezGuardado
- Lecture 10 Markov Analysis 1(1)Uploaded byKim Reyes
- Introdiction to Power ElectronicsUploaded byNo-No-No
- Transportation Statistics: entireUploaded byBTS
- Session 2 The Current State of ICTUploaded bySpare Man
- Accpac - Guide - Checklist for SM.pdfUploaded bycaplusinc
- 94449554-MAN-48-60-Engine-ExternUploaded bylubangjarum
- Building a Web Services ODBC DriverUploaded bykptran
- Global Access.docxUploaded bySasi Gangadhar B
- PassUploaded byAnonymous IG4gsLBuK
- manual para configurar un servidor ldapUploaded byEdwin Gutierrez
- Protecting From Network ThreatsUploaded byThet
- Audiocodes Mediant 800 MSBGUploaded byJulio Guarniz
- Diameter Edge - SonusUploaded byDecimusBlue
- 2080-rm001_-en-eUploaded byCuong Pham
- Zermelo Fraenkel Set Theory 001Uploaded byArthur Dux
- SB_CSE_MAUploaded byAnubhav000007
- 1-8 practice_cUploaded byStanley
- 02 Maintenance Strategies.pdfUploaded byIrfannurllah Shah
- Ford Transit Connect 2010Uploaded byStoneham Ford
- Developing XML SolutionsUploaded byapi-3738603
- PR4GM4_31010_v03englishUploaded byMihai Mihailescu
- Quiz ConvolutionUploaded byGeorge Karagiannidis
- Redirect Website Using Nat - MikroTik RouterOSUploaded byAndres Felipe G
- SCM270-FlexiblePlanning.docUploaded byharry el sucio
- EPM Theory ImplementationUploaded bykvpy26
- Arburg 720s Web 524650 en UsUploaded byumberti
- Step by Step Procedure for Creation of IDOCUploaded byjolliest