Professional Documents
Culture Documents
Journal of Econometrics
journal homepage: www.elsevier.com/locate/jeconom
Keywords:
Spatial statistics
Maximum likelihood
Probit model
1. Introduction wide range of fields, including regional, real estate, agricultural, en-
vironmental, and industrial organization economics (Lee, 2004).
Most econometric techniques using cross-sectional data are Econometricians have begun to pay more attention to spatial
based on the assumption of independence of the observations. dependence problems in the last two decades, and there have been
When the data are outcomes measured at different geographical important advances both theoretical and empirical.1 The analysis
locations the assumption of independence is tenuous, especially of spatial data starts with an underlying spatial structure generat-
as economic activities have become more and more correlated over ing observed spatial correlations (Anselin and Florax, 1995). There
space with the advent of modern communication and transporta- are two popular ways of capturing spatial dependence. The first is
tion improvements. Technological advances in the geographic in- in the domain of geostatistics, where the spatial index is continu-
formation system (GIS) make collecting spatial data easier than ous (Conley, 1999). The second is to assume that spatial sites form a
ever before. Consequently, the possibility of spatial correlation countable lattice (Lee, 2004). Among lattice models, there are also
among observations has received more and more attention in a two types of spatial dependence models that have received the
bulk of the attention: the spatial autoregressive dependent vari-
able model (SAR) and the spatial autoregressive error model (SAE).
In most applications of spatial models, the dependent variables are
✩ We are grateful to two referees, the Associate Editor, the Co-Editor and continuous, work that has been added by important theoretical re-
participants at the 2009 Econometric Society European Meeting, the 2009 Simposio sults in Conley (1999), Lee (2004), and Kelejian and Prucha (1999,
de Análisis Económico, the 2010 ‘‘Brunel Macroeconomic Research Centre’’-QASS
2001). Nevertheless, there are a handful of applications that ad-
Conference on Macro and Financial Economics and at seminars at London City
University, London School of Economics, Tinbergen Institute, UCL, University Carlos dress spatial dependence with discrete choice dependent variables
III, University of Essex and University of Exeter for very useful comments. Any
remaining errors are our own.
∗ Corresponding author.
E-mail addresses: hwang@hkma.gov.hk (H. Wang), emma.iglesias@udc.es, 1 Anselin et al. (2004) wrote a comprehensive review about econometrics for
iglesia5@msu.edu (E.M. Iglesias), wooldri1@msu.edu (J.M. Wooldridge). spatial models.
0304-4076/$ – see front matter © 2012 Elsevier B.V. All rights reserved.
doi:10.1016/j.jeconom.2012.08.005
78 H. Wang et al. / Journal of Econometrics 172 (2013) 77–89
(Case, 1991; McMillen, 1995; Pinkse and Slade, 1998; Lesage, 2000; Section 3 presents the bivariate spatial probit model. In Section 4,
Beron and Vijverberg, 2003; Pinkse et al., 2006). The purpose of this we prove consistency and asymptotic normality of the PML
paper is to advance the available estimation methods for spatially estimator (PMLE) under regularity assumptions, and discuss how
correlated binary outcomes. to get consistent covariance matrix estimators. Section 5 presents
While the analysis can be made more general, we focus on a simulation study showing the advantages of our new estimation
the probit model with spatially correlated data. As is now well procedure in this setting. Finally, Section 6 concludes. The proofs
known, if we ignore the spatial correlation and construct a are collected in Appendix A, while the results for the simulation
pseudo-likelihood function as if we had independent draws, the study are provided in Appendix B.
resulting pooled maximum likelihood estimator (MLE) is, under
fairly general conditions, consistent and asymptotically normal, 2. Discrete choice models with spatial dependence
provided the marginal model is correctly specified. Poirier and
Ruud (1988) established this result for time series data, and it is It is useful to begin with a brief discussion of general binary
pretty clear that it holds, under certain assumptions that restrict response models with spatial dependence. For a draw i, let Yi be
the amount of dependence for spatial data. The main drawback to a binary outcome and Xi a 1 × K vector of covariates. Assume that
applying the pooled MLE when the observations are dependent is Yi is generated as
a loss of efficiency. Some authors, for example Robinson (1982),
Yi = 1[Xi β + εi > 0], (1)
explicitly consider joint maximum likelihood estimation of a
nonlinear model with time series data. Unfortunately, in the where εi is an unobserved error and β is a K × 1 parameter vector
context of spatially correlated data obtaining maximum likelihood to be estimated. Regardless of any dependence in the data across i,
estimators that account for the joint dependence in the data is if εi is independent of Xi , then the response probability P (Yi = 1|Xi )
computationally very demanding. can be obtained if the distribution of εi is known. In the case where
Rather than taking either extreme — ignoring the dependence εi ∼ Normal(0, 1), it is well known that P (Yi = 1|Xi ) = Φ (Xi β),
in the data or trying to model full joint dependence — middle- where Φ denotes the standard normal cumulative distribution
ground approaches are possible. For example, Poirier and Ruud function (cdf). The ‘‘marginal probability’’ can be used, under
(1988) show how to estimate the probit model with dependence in general assumptions, to consistently estimate β using a pooled
time-series data using generalized conditional moment (GCM) es- MLE procedure — even though the data may not be independent.
timators. These estimators are computationally attractive and rel- This is effectively the insight of the Poirier and Ruud (1988) results
atively more efficient than ignoring serial dependence. Generally, for time series data.
nonlinear models with a time series dimension can be estimated Allowing explicitly for spatial correlation of the kind that is
by generalized method of moments (GMM). The GMM approach is popular for linear models raises a couple of important issues, as
(asymptotically) more efficient than just using a pooled MLE proce- recognized in Pinkse and Slade (1998). First, the variance of the
dure. However, because time series dependence is ignored in form- error in such models typically depends on the distances among
ing moment conditions, GMM estimators still can be considerably pairs of observations in the lattice — via the matrix that is used in
less efficient than the joint MLE. a weighted least squares analysis. Let W denote and N × N matrix
Similar considerations hold for spatially correlated data. Meth- of weights that are exogenous in the sense that
ods that only use information on the marginal distributions — such εi |X , W ∼ Normal(0, hi (W , λ)), (2)
as Pinkse and Slade’s (1998) GMM estimator of the SAE probit
where (hi (W , λ)) > 0 is a variance function that depends on λ. The
model based on the pooled MLE first order conditions — poten-
form of hi (·) differs across spatial models and is not yet impor-
tially give up much in terms of efficiency compared with a full
tant. The exogeneity assumption is embodied in the requirement
MLE approach. The motivation for the current paper is that joint
E (εi |X , W ) = 0, which also imposes a strict exogeneity assump-
MLE is often prohibitively difficult while recognizing that methods
tion on the covariates X .
based only on marginal distributions will often be too imprecise.
If we maintain (2) along with (1) then D(Yi |X , W ) follows a so-
Therefore, we propose a middle ground between a pooled probit
called heteroskedastic probit model with
approach and full maximum likelihood. In particular, we choose to
capture spatial dependence by assuming that sites form a count-
P (Yi = 1|X , W ) = Φ Xi β/ hi (W , λ) . (3)
able lattice. Then, we divide the lattice into many small groups
(clusters), where the clusters are formed from adjacent observa- Under sufficient regularity conditions — mainly restricting the
√ of spatial dependence — β and λ can be consistently
tions. The resulting structure is a large number of small clusters. amount
If we can obtain the joint density of the responses within cluster, and n-asymptotically normally estimated by using a pooled
we can improve upon methods that completely ignore the spatial heteroskedastic probit approach. These moment conditions are
dependence while arriving at estimation methods much less com- used in the Pinkse and Slade (1998) GMM estimator.
putationally demanding than joint MLE. We refer to our proposed Before we proceed further, the presence of W in (3) raises
method as ‘‘partial MLE’’ because we are only using partial joint a question about how we should summarize the partial effects
distributions, not the entire joint distribution. of the elements of Xi on the response probability. The notion of
Because we model spatial correlation only within a cluster, the average structural function (ASF), proposed by Blundell and
we still need to account for spatial correlation across clusters. Powell (2004) in a different context, seems useful. In the present
This feature is what distinguishes the current setting from a application, the ASF is defined as
standard panel data setting, where independence across clusters
are assumed. To obtain valid inference, we appeal to Conley (1999), ASF (x) = EW Φ xβ/ hi (W , λ) . (4)
who extends Newey and West (1987) to allow for data generated
by a countable lattice. Conley (1999) uses metrics of economic The average partial effects are obtained by taking changes or partial
distance to characterize dependence among agents, and shows that derivatives of ASF (x). Given consistent estimators β̂ and λ̂, ASF (x)
the GMM estimator is consistent and asymptotically normal under can be (under regularity conditions) consistently estimated by
some assumptions similar to time-series data. n
The rest of the paper is organized as follows. Section 2 provides
n− 1 Φ xβ̂/ hi (W , λ̂) . (5)
a brief overview of popular spatial models with a binary response. i=1
H. Wang et al. / Journal of Econometrics 172 (2013) 77–89 79
Yi∗ = Xi β + εi , (6)
n
εi = λ Wij εj + ui . (7)
j=1
fact, constructing the likelihood function requires N-dimensional + (1 − Yi ) log[1 − Φ (Xi β/σi (λ))]}, (14)
integration of a multivariate normal distribution. We refer the √
reader to Lee (2004) for details. where σi (λ) is shorthand for Var(εi |W ). Assuming that β and
While the formulation in (10) is common, it is not the only λ are identified, and that the conditions below in Section 4 hold,
√
possibility. We may prefer more of a moving average structure, the pooled heteroskedastic probit is generally consistent and n-
such as asymptotically normal. But, for reasons we discussed above, it is
likely to be very inefficient relative to the full MLE. While Pinkse
and Slade’s GMM estimator can help a little, estimators that use
ε i = ui + λ Wih uh , (11)
some information on the spatial correlation across observations
h̸=i
seem more promising in terms of increasing precision.
where the ui are i.i.d. Normal(0, 1) random variables. This formula-
tion is attractive because it is relatively easy to find variances and
3.2. Bivariate probit partial MLE
pairwise covariances (which we will use in the pseudo MLEs intro-
duced in the next section). In particular,
We now turn to using information on pairs of ‘‘nearby’’ obser-
vations to identify and estimate β and λ. There is nothing special
Var(εi |W ) = 1 + λ2 Wih2 (12) about using pairs; we could use, say, triplets, or even larger groups.
h̸=i But the bivariate case is easy to illustrate and is computationally
and quite feasible.
For illustration, assume a sample includes 2n observations, and
we divide the 2n observations into n pairwise groups according
Cov(εi , εj |W ) = λWij + λWji + λ 2
Wih Wjh . (13) for example to the spatial Euclidean distance between them (see
h̸=i,h̸=j
Fig. 1). In other words, each group includes two observations,
Notice that if we use only Var(εi |W ) in a pooled analysis we would with the idea being that the internal correlation between the two
have to take λ > 0. observations is more important than external correlations with
observations in other groups. Of course, the way that we group the
3. Using partial MLEs to estimate general spatial probit models observations will affect the asymptotic variance of our procedure.
In practice, we recommend to specify different types of grouping
As mentioned earlier, estimating a probit spatial autocorrela- and to check if the variance estimates are reduced significantly
tion model by full MLE is a prodigious task. The EM algorithm can in the different cases. One could, after obtaining estimators from
be used (McMillen, 1992), the RIS simulator (Beron and Vijver- several groupings, apply an efficient minimum distance procedure
berg, 2003), and the Bayesian Gibbs sampler (Lesage, 2000). But (for example Wooldridge, 2010, Chapter 14) to obtain a single
each of these approaches is computationally burdensome, making estimator.
, Yg2
∗ ∗
it very difficult to conduct simulation studies or to quickly estimate Let Yg∗ = Yg1 be the bivariate vectors of latent outcomes
a range of models. for group g, and assume for notational simplicity that we have 2n
80 H. Wang et al. / Journal of Econometrics 172 (2013) 77–89
observations. Write the linear equations for group g as These properties form the basis of a partial MLE with two responses
per group. We now turn to the asymptotic properties of this
Yg1 = Xg1 β + εg1
∗
(15)
estimator.
Yg2 = Xg2 β + εg2 ,
∗
(16)
where Xg1 and Xg2 are 1 × K vectors of regressors and β is a 4. Asymptotic properties of the partial MLEs
K × 1 vector; εg1 and εg2 are scalars. These two equations look
like a two-period panel data model, but we must recognize that
In the context of panel data and also cluster samples,
the variances and covariance depend on the weighting matrix in
Wooldridge (2010, Chapters 13 and 20) discusses partial MLE
the underlying spatial model. Not only are εg1 and εg2 correlated
methods. These PMLEs apply to pooled log likelihoods where
with each other but they are also with the errors in other groups.
Therefore, the variances and covariance between εg1 and εg2 not some dependence — across time or within cluster — is ignored
only depend on the weight within the group, but also weights with in estimation. The asymptotic theory in Wooldridge (2010)
other observations out of the group. is straightforward because observations are assumed to be
By assumption, E (εg1 |Xg1 , W ) = E (εg2 |Xg2 , W ) = 0. Write the independent across groups (and the group sizes are fixed, as we
2 × 2 variance matrix as assume here). In the present setting, we still have correlation
across all clusters due to the spatial nature of the data. But the
Ωg11 Ωg12
Var(εg |Xg , W ) ≡ Ωg (W , λ) = , (17) arguments for how partial MLE identifies parameters and generally
Ωg21 Ωg22 has desirable asymptotic properties is essentially unchanged
where we often suppress the dependence of Ωg (W , λ) on W and from the standard case. Nevertheless, the details of showing
λ in what follows. The variance terms are the same as in the that the groups are sufficiently weakly dependent are not
Pinkse and Slade (1998) approach. To implement our procedure, simple, and estimating the asymptotic variance matrix requires
the covariance must also be computed; exploiting this correlation some care.
is the source of improving the precision of the estimates of β and If we let Yg be the 2 × 1 vector of observed responses for group
λ. g, the partial log likelihood has the form
Let Yg1 and Yg2 be the binary outcomes associated with group g.
The conditional bivariate normal distribution of Yg1 and Yg2 given n
Xg (and W ) is L= log fg (Yg |Xg , W , θ ), (25)
P (Yg1 = 1, Yg2 = 1|Xg ) g =1
= P (Xg1 β + εg1 > 0, Xg2 β + εg2 > 0|Xg ) (18) where fg (yg |Xg , W , θ ) is the density of Yg given Xg (and we again
= P (εg1 ≤ Xg1 β, εg2 ≤ Xg2 β|Xg ) assume there are 2n total observations). Because this conditional
density is correctly specified, the partial-log-likelihood function
Xg1 β Xg2 β generally identifies θ0 because of the Kullback–Leibler inequality
= Φ2 , , ρg (19) applied for each g (Wooldridge, 2010, Chapter 13). Of course,
Ωg11 Ωg22
we would need to assume or otherwise show the uniqueness
Cov(εg1 , εg2 ) Ωg12 of θ0 .
ρg = = , (20)
Var(εg1 ) Var(εg2 ) Ωg11 Ωg22 The general PMLE results apply to the spatial probit model
if we have correctly specified the bivariate normal densities
where Φ2 is the standard bivariate normal distribution, φ2 is the
φ2g (Yg1, Yg2 |Xg , W , θ ). To ensure correct specification, we must
standard density function of the bivariate normal distribution
properly obtain the 2×2 conditional variance–covariance matrix of
and ρg is the standardized covariance between two error terms.
Estimation in this context is similar to ‘‘random effects’’ probit (εg1 , εg2 )′ . This is where the underlying spatial model comes in. We
with two ‘‘time periods’’ and n observations. The difference is that must also take care in restricting the spatial dependence in the data
we have system heteroskedasticity in Var(εg |Xg , W ) and spatial so that the standardized sum in (25) satisfies the usual limit laws.
correlation across g. To ensure weak dependence, we assume that the spatial process
Obtaining the joint probabilities within the group is not is strong mixing, which means the grouped observations form a
difficult, and is most easily done finding marginal and conditional strong mixing sequence, too. The asymptotic approximations we
probabilities. Given that (εg1 , εg2 ) has a joint normal distribution, use are based on the thought experiment that the geographic area
we can write is increasing in size.
εg1 = δg1 εg2 + eg1 (21) To facilitate asymptotic analysis, write the partial log likelihood
function as
where
n
Cov(εg1 , εg2 )
δg1 = , (22) L= {Yg1 Yg2 log P (Yg1 = 1, Yg2 = 1|Xg )
Var(εg2 ) g =1
and eg1 is independent of Xg and εg2 . Because of the joint normality + Yg1 (1 − Yg2 ) log P (Yg1 = 1, Yg2 = 0|Xg )
of (εg1 , εg2 ), eg1 is also normally distributed with E (eg1 ) = 0, and + (1 − Yg1 )Yg2 log P (Yg1 = 0, Yg2 = 1|Xg )
Var(eg1 ) = Var(εg1 ) − δg1
2
Var(εg2 ). (23) + (1 − Yg1 )(1 − Yg2 ) log P (Yg1 = 0, Yg2 = 0|Xg )} (26)
Thus, we can write
and for the sake of brevity define
Xg1 β + δg1 εg2
P (Yg1 = 1|Xg , εg2 ) = Φ . (24) Pg (1, 1) ≡ log P (Yg1 = 1, Yg2 = 1|Xg );
Var(eg1 )
(27)
Pg (1, 0) ≡ log P (Yg1 = 1, Yg2 = 0|Xg );
Once we obtain (24), we can retrieve explicit expressions for
Pg (0, 1) ≡ log P (Yg1 = 0, Yg2 = 1|Xg ) and
P (Yg1 = 1, Yg2 = 1|Xg ), P (Yg1 = 0, Yg2 = 1|Xg ), P (Yg1 = 1, (28)
Yg2 = 0|Xg ) and P (Yg1 = 0, Yg2 = 0|Xg ) as given in Appendix A. Pg (0, 0) ≡ log P (Yg1 = 0, Yg2 = 0|Xg ).
H. Wang et al. / Journal of Econometrics 172 (2013) 77–89 81
Therefore, we can rewrite the partial log likelihood (PLL) as compact set Θ at θ0 , (iii) Q is continuous on Θ , (iv) the density of
n observations in any region whose area exceeds a fixed minimum is
bounded, (v) as n → ∞,
L= {Yg1 Yg2 Pg (1, 1) + Yg1 (1 − Yg2 )Pg (1, 0)
g =1
1 1
+ (1 − Yg1 )Yg2 Pg (0, 1) + (1 − Yg1 )(1 − Yg2 )Pg (0, 0)}. (29) sup +
1≤g ≤n P (Y = 1, Yg2 = 1|Xg ) P (Yg1 = 1, Yg2 = 0|Xg )
g1
The PLL in (29) is the simplest way one might exploit spatial cor- 1
relation in pairs of observations. One possibility is to expand the +
P (Yg1 = 0, Yg2 = 1|Xg )
group sizes and shrink the number of groups, although expanding
the group size makes the computational problem harder (because 1
<∞
+
the dimension of Yg grows). Previously we mentioned the possibil- P (Yg1 = 0, Yg2 = 0|Xg )
ity of using several different pairings and using minimum distance
(vi) as n → ∞, supg (Xg + Yg ) = O(1), (vii) supngj |Cov
estimation. A related possibility is to estimate (β, λ) by pooling
pseudo log likelihoods across multiple partitions of the data. Sup- (Ygi , Yji )| ≤ α(dgj ), i = 1, 2 where dgj denotes the distance between
pose we settle on J partitions of the data into n groups of two ob- group g and j, and α(d) → 0 as d → ∞, (viii) limn→∞ E [Qn (θ )]
servations. Let ijg1 and ijg2 denote the observation index of the first exists, (ix) supg Wg < ∞, then θ − θ0 = op (1).
and second member of group g in partition j = 1, . . . , J. Then we Proof. Given in Appendix A.
can form a PLL as
Condition (i) is a standard assumption for optimization
J
n
estimators. Condition (ii) is the identification condition for MLE.
log fijg1 ,ijg2 (β, λ) , (30)
Condition (iii) assumes that the function Q is continuous in the
j =1 g =1
metric space, which is a reasonable assumption and necessary for
where fijg1 ,ijg2 (β, λ) has the same form as the contribution to the the proof that Qn (θ) is stochastically equicontinuous. Condition
log likelihood given in (29). This estimator will be a bit more com- (iv) simply excludes that an infinite number of observations
putationally demanding than the one that we propose explicitly in crowd in one bounded area. The minimum area restriction is
this paper, but it will be more efficient. In this paper our asymp- imposed because an infinitesimal area around a single observation
totic analysis is restricted to the case of a single partition of the has infinite density. Condition (v) makes sure any one of these
data. four situations will be present in a sufficiently large sample
in our bivariate probit structure. Condition (vi) makes sure the
4.1. Consistency of bivariate probit estimation regressors are deterministic and uniformly bounded, which is not
a strong assumption in this literature. Condition (vii) is the key
In this section, to make the asymptotic arguments formal, we assumption for this theorem, and it requires that the dependence
distinguish between the true value, θ0 , and a generic parameter among groups decays sufficiently quickly when the distance
value θ . We establish conditions under which the PMLE estimator between groups become further apart. This assumption employs
the concept from α -mixing to define the rate of dependence
p
θ−
introduced above is weakly consistent, that is, → θ0 , as n → ∞.
The objective function for the bivariate probit PMLE, standard- decreasing as distance increases. Condition (viii) assumes the
ized by n−1 , is limit of E [Sn (θ)] exists as n → ∞, which is not a strong
n
assumption. Condition (ix) is actually implied by the rule of
dividing groups, which just excludes that the two groups are
Qn (θ ) ≡ n−1 {Yg1 Yg2 Pg (1, 1) + Yg1 (1 − Yg2 )Pg (1, 0)
exactly in the same location. An important remark is that the
g =1
assumptions in Theorem 1 allow for general types of spatial
+ (1 − Yg1 )Yg2 Pg (0, 1) dependence as the one given in (7), (11) and higher order spatial
+ (1 − Yg1 )(1 − Yg2 )Pg (0, 0)}, (31) error lags. Moreover, for simplicity reasons, we focus on the setting
of a bivariate probit with a likelihood function as given in (31).
and θ maximizes Qn (θ ) over the parameter space Θ . Remember However, the results of our Theorems can be easily generalized
that this objective function represents a partial log likelihood: at the expense of more complex notation to go beyond the
we are only using information on the conditional distributions bivariate dependence provided that we extend assumptions such
D(Yg1 , Yg2 |X , W ) and not D(Y1 , Y2 , . . . , Yn |X , W ) — as in a full as (v) to allow for a finite number of observations inside each
maximum likelihood setting. group.
The identification condition essentially requires that the limit
of E [Qn (θ )] is uniquely maximized at the true value θ0 . From the
4.2. Asymptotic normality
argument described earlier, the only issue is whether θ0 is unique.
Define the limiting function as Our proof of asymptotic normality must recognize the spatial
Q (θ ) ≡ lim E [Qn (θ)]. dependence in the scores of the partial log likelihood. To deal
n→∞
with general dependence problems, a common approach in the
Then θ0 will uniquely maximize Q (θ ) in well-specified models literature is to use the so called ‘‘Bernstein Sums’’, which break
when there is not perfect collinearity among the regressors or up Sn into blocks (partial sums). This is the approach we take
some other degeneracy. It can require some care in parameterizing here. Each block must be so large, relative to the rate at which
the spatial autocorrelation, but standard models of spatial the memory of the sequence decays, that the degree to which the
autocorrelation cause no problems. As in Pinkse and Slade (1998) next block can be predicted from current information is negligible.
we assume uniqueness in all our analysis. At the same time, the number of blocks must increase with n so
The following Theorem 1 states the main consistency result for that the CLT argument can be applied to this derived sequence
a broad class of spatial probit models. (Davidson, 1994).
In this section, we show under what assumptions we are
Theorem 1. If (i) θ0 is the interior of a compact set Θ , which is the able to apply McLeish’s central limit theorem (1974) to spatial
closure of a concave set, (ii) Q attains a unique maximum over the dependence cases to get asymptotic normality for the spatial Probit
82 H. Wang et al. / Journal of Econometrics 172 (2013) 77–89
estimator. This is presented in the following Theorem. AT denotes index random field Ws∗ that is equal to one if location s ∈ Z 2 is
the transpose of matrix A. Define the score of the objective function sampled and zero otherwise. Ws∗ is assumed to be independent of
as the underlying random field and to have a finite expectation and
∂ Qn to be stationary. The strong mixing coefficients are defined as
Sn (θ) ≡ (θ ). (32)
∂θ αk,l (n) ≡ sup {|P (A ∩ B) − P (A) P (B)|} ,
A ∈ ΞΛ1 , B ∈ ΞΛ2 and
Theorem 2. If the assumptions of Theorem 1 hold, and in addi- |Λ1 | ≤ k, |Λ2 | ≤ l, Υ (Λ1 , Λ2 ) ≥ n.
d2 α(dd∗ )
tion: (i) as d → ∞, α(d∗ ) = o(1) for all fixed d∗ > 0, (ii) the
√ We also define a new process Rs (θ ) such as
sampling area grows uniformly at a rate of n in two non-opposing
S (θ) if Ws∗ = 1,
directions, (iii) B(θ0 ) ≡ limn→∞ E [nSn (θ0 )SnT (θ0 )] and A(θ0 ) ≡ Rs (θ ) =
limn→∞ −E [Hn (θ0 )] are positive definite matrices. Then 0 if Ws∗ = 0.
√
n(
θ − θ0 ) → N [0, A(θ0 )−1 B(θ0 )A(θ0 )−1 ], We have the following theorem.
pooled estimators: both estimation approaches have neglected In generating the data according to Eqs. (34) and (35) we set
dynamics making analytical comparisons of the asymptotic the true parameter values for β1 , β2 and β3 all equal to unity. We
variances very difficult, if not impossible. Intuitively, it seems are particularly interested in estimation of the spatial parameter
reasonable that using more information about the spatial structure λ, and so we vary its value as follows: λ = 0.2; 0.4; 0.6; and 0.8.
should produce more precise estimators. In this section we use a These values for λ are in the range of the estimated value in the
small simulation study to verify this intuition. empirical application of Pinkse and Slade (1998). We consider total
sample sizes of N = 500 (so n = 250 groups), N = 1000, and
5.1. Simulation design and results N = 1500. We use 1000 replications in the simulations. The results
are reported in Table 1 (for the spatial parameter λ) and Table 2 (for
Instead of comparing our PMLE to the GMM estimator of Pinkse β1 , β2 and β3 ) in Appendix B.
and Slade (1998) directly, we choose to compare the bivariate We start with estimation of β . Table 2 shows that the PMLE
PMLE to the univariate PMLE, which we refer to as the het- has little bias with N = 1500 (except when λ = .2), whereas
eroskedastic probit estimator (HPE). We have two justifications for the HPE still has substantial bias. The poor behavior of the HPE
using the HPE rather than the GMM version. First, the HPE uses the for estimating β may be due to its inability to estimate λ. The
PMLE does much better in terms of precision, too. Generally, as we
same moment conditions as the GMM estimator because both use
expect, the Monte Carlo standard deviations shrink as the sample
the first-order condition from the HPE. Thus, efficiency gains from
size increases.
using an optimal weighting matrix are unlikely to be important.
Table 1 shows that the HPE struggles when trying to estimate
Second, the STATA2 source codes for bivariate probit estimation
and heteroskedastic probit estimation are available online, and we
λ. The PMLE is always much closer to the true parameter value —
although it has a systematic upward bias for each N — with smaller
can easily adopt the code for the kind of heteroskedasticity in the
standard deviations across all sample sizes and parameter values.
variances and covariances implied by common spatial dependence
The bias of the PMLE decreases when N increases but there is room
structures. We consider two simulation settings allowing for dif-
for improvement. Possible bias adjustments to the estimator of λ
ferent types of spatial dependence.
is a good topic for future research.
In summary, from the simulation results of Tables 1 and 2, we
5.1.1. Case 1 see how the PMLE clearly outperforms the HPE, especially when
According to the theoretical framework given in previous estimating the spatial parameter λ. While simulation findings are
sections, we could generate a dataset which allows a general necessarily special, the ones here provide strong support for the
correlation structure across groups as in Eqs. (6) and (7). We idea that using even a little information on the spatial correlation
require knowing the 2 × 2 matrices Ωg as functions of λ and W . structure can go a long way in obtaining less biased, more precise
Generally, it is quite difficult to derive the pairwise covariances estimators.
for the bivariate probit because the exact formula for Ωg12 (and
of Ωg11 , Ωg22 ) is very complicated; they must be obtained from 5.1.2. Case 2
the inverse of the full 2n × 2n variance–covariance matrix. For the We consider a second data generating process given as (34),
SAE model, this matrix is where again we set the true parameter values for β1 , β2 and
[(I − λW )′ (I − λW )]−1 β3 all equal to unity. In this case we assume (11) where the ui
are i.i.d. Normal(0, 1) and Wih is the reciprocal of the Euclidean
Ω111 · · · · · ·
··· ··· distance between i and h. We obtain the closed form expressions
··· ··· ··· ··· ··· for the variance and covariance given in (12) and (13). Results for
= · · · · · · Ωg11 Ωg12 ··· , (33)
1000 replications are provided in Tables 3 and 4 in Appendix B.
··· ··· Ω Ωg22 ···
g21 Again, the PMLE provides substantial improvements over the
··· ··· ··· ··· Ωn22 HPE, especially when estimating λ. It makes sense that using
information in the pairwise data helps to substantially improve
and it is difficult to obtain the Ωghi in closed form. Instead, it seems
the precision in estimating λ, which is fundamentally a spatial
reasonable to do the following. Let R be a weighting matrix (which
correlation parameter. Efficiency gains of the bivariate procedure
can be generated in STATA3 ) according to the distance between
in estimating β are smaller in Table 4 — possibly because we chose
observations. Then define
to group nearby observations — but still nontrivial. For this data
Yi∗ = Xi1 β1 + Xi2 β2 + Xi3 β3 + εi (34) generating process (DGP), both HPE and PMLE show little bias in
estimating β , especially for the largest sample size.
ε = λRu, (35)
where u ∼ Normal(0, I2n ). The weighting matrix R is standardized 6. Conclusions
so that the diagonal elements are ones, and then the elements of
R shrink as distance between observations increases. Using this The idea of this paper is simple and intuitive: rather than just
approach, it is relatively easy to determine Var(εi ) and Cov(εi , εj ), using information contained the marginal distributions, we divide
which facilitates the HP bivariate probit estimation. We still allow observations into pairwise groups and use a partial MLE approach.
general correlation across groups, and we are able to compare Using the spatial correlation for pairs of outcomes, we prove
the efficiency gains from only using the marginal information (the that the bivariate PMLE is consistent and asymptotically normal
HP approach) to using both diagonal and off-diagonal information under reasonable regularity conditions (although these could be
(bivariate probit). relaxed in future research). We also discuss how to get consistent
covariance matrix estimators under general spatial dependence
by following the approach of Conley (1999), which is much more
2 See http://www.stata.com/. practical than the proposal of Pinkse and Slade (1998).
3 The STATA command is ‘‘Spatwmat’’. Since the speed to calculate the inverse of The simulation study in Section 5 demonstrates that using
a matrix is much slower as the size of matrix increases, and moreover the maximum
bivariate rather than univariate distributions not only improves
matrix size in Stata is 800, we allow here each observation to be spatially correlated efficiency, but can substantially decrease finite-sample bias —
to nearby 99 observations. especially for estimating the spatial correlation parameters.
84 H. Wang et al. / Journal of Econometrics 172 (2013) 77–89
The fact that we can undertake a substantial simulation study Now we are ready to get P (Yg1 = 1, Yg2 = 1|Xg ) as follows
demonstrates that our approach is computational much more
1
feasible than the full, joint MLE. Our conjecture is that an estimator P (Yg1 = 1, Yg2 = 1|Xg ) =
that uses, say, trivariate distributions would perform even better. Xg2 β
Φ √
Of course that comes at the expense of computation. Nevertheless, Var(εg2 )
computation for a single data set should not be difficult for even
∞
Xg1 β + δg1 εg2 εg2
larger group sizes. We think the findings for group sizes of two × Φ φ dεg2
Var(eg1 ) Var(εg2 )
make a strong case for the general PMLE approach. −Xg2 β
A fixed and known spatial error structure is a limitation of our
results. Ideally, one could accommodate endogenous location deci- Xg2 β
×Φ (42)
Var(εg2 )
sions. Unfortunately, endogenous location raises both conceptual
and technical difficulties that need to be studied in future research.
∞
Xg1 β + δg1 εg2 εg2
Extensions that are more immediate are models with spatial dis-
tributed lags in the covariates and other kinds of nonlinear models
= Φ φ dεg2 , (43)
Var(eg1 ) Var(εg2 )
−Xg2 β
that can be estimated by PMLE, including Tobit, count and switch-
ing models. and similarly we can obtain finally
Xg2 β
Appendix A P (Yg1 = 0, Yg2 = 1|Xg ) = Φ
Var(εg2 )
A.1. Expressions of conditional bivariate distributions ∞
Xg1 β + δg1 εg2 εg2
− Φ φ dεg2 (44)
Var(eg1 ) Var(εg2 )
−Xg2 β
Since
P (Yg1 = 1, Yg2 = 0|Xg )
P (Yg1 = 1, Yg2 = 1|Xg )
Xg2 β
Xg1 β + δg1 εg2 εg2
= P (Yg1 = 1|Yg2 = 1, Xg ) · P (Yg2 = 1|Xg ) (36) = Φ φ dεg2 (45)
Var(eg1 ) Var(εg2 )
−∞
it is easy to see that P (Yg2 = 1|Xg ) = Φ Xg2 β/ Var(εg2 ) , and
thus it remains to get P (Yg1 = 1|Yg2 = 1, Xg ). Xg2 β
P (Yg1 = 0, Yg2 = 0|Xg ) = 1−Φ
First, since Yg2 = 1 if and only if εg2 > −Xg2 β, and εg2 follows Var(εg2 )
a normal distribution and it is independent of Xg , then the density
Xg2 β
of εg2 given εg2 > −Xg2 β is Xg1 β + δg1 εg2 εg2
− Φ φ dεg2 . (46)
Var(eg θ 1 ) Var(εg2 )
−∞
εg2 εg2
φ √ φ √
Var(εg2 ) Var(εg2 )
= . (37)
P (εg2 > −Xg2 β)
A.2. Proofs of theorems
Φ √ Xg2 β
Var(εg2 )
Proof of Theorem 1. By Newey and Mcfadden (1994), for consis-
Therefore, tency it is sufficient to verify the following conditions:
and therefore, does not hold here, i.e. −E [Hn (θ0 )] ̸= E [Sn (θ )SnT (θ )], because the
n score terms are correlated with each other over space. In this
1 −1 part, we follow Pinkse and Slade (1998) and we use Bernstein’s
Yg1 Yg2
n g =1 [P (Yg1 = 1, Yg2 = 1|Xg )]2 blocking methods and the McLeish’s (1974) central limit theorem
for dependent processes. First, define Tnan ≡ Πj=n1 (1 + iγ Dn,j ),
a
0<τ < Let Λnj denote the set of indices corresponding to the
1
2
.
As usual, we apply repeatedly the above arguments to the other
observations in area j. By assumption a number C > 0 exists such
terms. Finally, we can get that 1
that maxj (#Λnj ) < Cbn . Define Dn,j ≡ n− 2 g ∈Λnj Ang , and hence
∂ 2 Qn ∗ p ∂ 2 Qn j=1 Dn,j .
an
we can write Y0n =
lim (θ ) −
→ E (θ 0 ) . (67)
n→∞ ∂θ∂θ T ∂θ ∂θ T Now we are ready to discuss the four conditions for Mcleish’s
(1974) central limit theorem. First, look at condition (iv), which re-
If we define quires that maxj≤an |Dn,j | = op (1)
∂ 2 Pg (1, 1) ∂ 2 Pg (1, 0)
( )
H ≡ Yg1 Yg2 + Yg1 1 − Yg2
∂θ ∂θ T ∂θ ∂θ T
−1
Ang .
max |Dn,j | = max n
2 (71)
∂ 2 Pg (0, 1) j≤an j≤an
g ∈Λnj
+ (1 − Yg1 )(Yg2 )
∂θ ∂θ T
Since by assumption
∂ 2 Pg (0, 0)
+ (1 − Yg1 )(1 − Yg2 ) (68)
∂θ ∂θ T
−1
max(#Λnj ) < Cbn ⇒ max n 2
Ang
where H denotes the Hessian, Eq. (68) can be rewritten as j j≤an
g ∈Λnj
n
1 p
1
≤ Cbn × n− 2 sup Ang ,
lim H (θ ∗ ) −
→ lim E [H (θ0 )]. (69) (72)
n→∞ n g =1 n→∞
where # denotes the number of objects, by definition we have that
Therefore, it remains to show the asymptotic normality of the √
nSg (θ0 )
n
score term, Sn (θ0 ). Now ϖ T
√ = n− 2
1
Ang ,
B(θ0 ) g =1
n
∂ Pg (1, 1)
1
Sn (θ0 ) = Yg1 Yg2 (θ0 )
∂ Pg (1, 1)
∂θ T 1
n g =1 Ang = ϖ √ Yg1 Yg2 (θ0 )
B0 ∂θ
∂ Pg (1, 0)
+ Yg1 (1 − Yg2 ) (θ0 ) ∂ Pg (1, 0) ∂ Pg (0, 1)
∂θ + Yg1 (1 − Yg2 ) (θ0 ) + (1 − Yg1 )Yg2 (θ0 )
∂ Pg (0, 1) ∂θ ∂θ
+ (1 − Yg1 )Yg2 (θ0 ) ∂ Pg (0, 0)
∂θ + (1 − Yg1 )(1 − Yg2 ) (θ0 ) . (73)
∂ Pg (0, 0)
∂θ
+ (1 − Yg1 )(1 − Yg2 ) (θ0 ) . (70)
∂θ 1
Since B(θ0 ) is positive definite, B(θ0 )− 2 is bounded as n → ∞,
and we have that supg Yg < ∞ by assumption (vi) in Theo-
1
We need to show that B− 2 (θ0 )Sn (θ0 ) → N (0, IK ), where B(θ ) ≡
∂ Pg (1,1)
limn→∞ nE [Sn (θ )SnT (θ )]. Note that the information matrix equality rem 1. We have also proved that supg ∂θ < ∞ in Lemma 2.
H. Wang et al. / Journal of Econometrics 172 (2013) 77–89 87
Therefore, we are able to prove that sup Ang < ∞. Then Cbn × by construction of Y0n , since E (Y0n ) = 1. It remains to show that
2
≤ P sup Πja=n1
1 + γ 2 D2n,j
|E (Dn,i Dn,j )| = o(n−1 bn ) = o(a−
n ).
1
n >N
√an
τ τ Lemma 6. Under the assumptions in Theorem 2, max
> K sup n |Dn,j | ≤ C + P sup n |Dn,j | > C j∈Ξnil
(77) l =2
n>N ,j |E (Dn,i Dn,j )| = o(a−
n ).
1
an
= E (Y0n
2
)−1− E (Dn,i Dn,j ) + op (1) = op (1), (81) 4 Lemmas 5–8 are along the lines of those in Pinkse and Slade (1998), which are
Table 1
*
Case 1: Simulation results of different estimators of λ in the context of the bivariate spatial probit model.
λ = 0.2 λ = 0.4 λ = 0.6 λ = 0.8
HPE PMLE HPE PMLE HPE PMLE HPE PMLE
N = 500 Mean 3.938 0.514 6.177 0.519 7.698 0.571 7.735 0.634
Bias 3.738 0.314 5.777 0.319 7.098 −0.029 6.935 −0.166
(s.d.) (12.158) (0.120) (15.776) (0.205) (16.929) (0.151) (16.202) (0.289)
N = 1000 Mean 3.174 0.512 4.668 0.518 5.456 0.581 5.914 0.672
Bias 2.974 0.312 4.268 0.118 4.856 −0.019 5.114 −0.128
(s.d) (8.844) (0.107) (9.100) (0.133) (9.631) (0.149) (10.173) (0.276)
N = 1500 Mean 2.746 0.511 4.050 0.507 4.872 0.609 5.426 0.708
Bias 2.546 0.311 3.650 0.107 4.272 0.009 4.626 −0.092
(s.d.) (6.423) (0.099) (7.414) (0.124) (8.598) (0.149) (8.514) (0.253)
*
Results are presented for the bivariate Partial Maximum Likelihood Estimator (PMLE) and the Heteroskedastic Probit Estimator (HPE) of λ. Numbers in brackets are
standard deviations (s.d.)
Table 2
*
Case 1: Simulation results of different estimators of β1 , β2 and β3 in the context of the bivariate spatial probit model.
β1 = 1 β2 = 1 β3 = 1
HPE PMLE HPE PMLE HPE PMLE
Table 3
*
Case 2: Simulation results of different estimators of λ in the context of the bivariate spatial probit model.
λ = 0.2 λ = 0.4 λ = 0.6 λ = 0.8
HPE PMLE HPE PMLE HPE PMLE HPE PMLE
N = 500 Mean 2.151 0.381 2.575 0.667 2.491 0.970 2.876 1.202
Bias 1.951 0.181 1.175 0.267 1.891 0.370 2.076 0.402
(s.d.) (4.630) (0.844) (5.073) (0.923) (4.996) (0.913) (6.213) (0.966)
N = 1000 Mean 1.013 0.356 1.089 0.606 1.307 0.863 1.660 1.160
Bias 0.813 0.156 0.689 0.206 0.707 0.263 0.860 0.360
(s.d) (2.131) (0.606) (2.241) (0.622) (2.424) (0.671) (2.675) (0.813)
N = 1500 Mean 0.684 0.324 0.792 0.592 0.906 0.860 1.305 1.156
Bias 0.484 0.124 0.392 0.192 0.306 0.260 0.505 0.356
(s.d.) (1.508) (0.484) (1.566) (0.515) (1.611) (0.601) (1.910) (0.706)
*
Results are presented for the bivariate Partial Maximum Likelihood Estimator (PMLE) and the Heteroskedastic Probit Estimator (HPE) of λ. Numbers in brackets are
standard deviations (s.d.)
H. Wang et al. / Journal of Econometrics 172 (2013) 77–89 89
Table 4
*
Case 2: Simulation results of different estimators of β1 , β2 and β3 in the context of the bivariate spatial probit model.
β1 = 1 β2 = 1 β3 = 1
HPE PMLE HPE PMLE HPE PMLE
References Lesage, J.P., 2000. Bayesian estimation of limit dependent variable spatial
autoregressive models. Geographical Analysis 32, 19–35.
Andrews, D.W.K., 1991. Heteroskedasticity and autocorrelation consistent covari- McMillen, D.P., 1992. Probit with spatial autocorrelation. Journal of Regional Science
ance matrix estimation. Econometrica 59 (3), 817–858. 32, 335–348.
Anselin, L., Florax, R.J.G.M., 1995. New Direction in Spatial Econometrics. Springer- McMillen, D.P., 1995. Spatial Effects in Probit Models: A Monte Carlo Investigation.
In: New Directions in Spatial econometrics, Springer-Verlag, Berlin, Germany,
Verlag, Berlin, Germany.
pp. 189–228.
Anselin, L., Florax, R.J.G.M., Rey, J.S., 2004. Econometrics for Spatial Models:
Newey, W.K., Mcfadden, D., 1994. Large sample estimation and hypothesis testing.
Recent Advances. In: Advances in Spatial econometrics, Springer-Verlag, Berlin,
In: Handbook of Econometrics, Vol 4. North-Holland, New York, Ch. 36.
Germany, pp. 1–28.
Newey, W.K., West, K.D., 1987. A simple, positive semi-definite, heteroskedasticity
Beron, K.J., Vijverberg, W.P., 2003. Probit in a Spatial Context: A Monte Carlo
and autocorrelation consistent covariance matrix. Econometrica 55, 308–703.
Approach. In: Advances in Spatial econometrics, Springer-Verlag, Berlin, Pinkse, J., Slade, M.E., 1998. Contracting in space: an application of spatial statistics
Germany, pp. 169–196. to discrete-choice models. Journal of Econometrics 85, 125–154.
Blundell, R., Powell, J.L., 2004. Endogeneity in semiparametric binary response Pinkse, J., Slade, M.E., Shen, L., 2006. Dynamic spatial discrete choice using one-step
models. Review of Economic Studies 71, 655–679. GMM: an application to mine operating decisions. Spatial Economic Analysis 1
Case, A.C., 1991. Spatial patterns in household demand. Econometrica 59, 953–965. (1), 53–99.
Conley, T.G., 1999. GMM estimation with cross sectional dependence. Journal of Poirier, D., Ruud, P.A., 1988. Probit with dependent observations. Review of
Econometrics 92, 1–45. Economic Studies 55, 593–614.
Davidson, J., 1994. Stochastic Limit Theory. Oxford University Press, Oxford. Robinson, P.M., 1982. On the asymptotic properties of estimators of models
Kelejian, H.H., Prucha, I.R., 1999. A generalized moments estimator for the containing limited dependent variables. Econometrica 50, 27–41.
autoregressive parameter in a spatial model. International Economic Review 40, Wooldridge, J.M., 2005. Unobserved heterogeneity and estimation of average partial
509–533. effects. In: Andrews, D.W.K., Stock, J.H. (Eds.), Identification and Inference
Kelejian, H.H., Prucha, I.R., 2001. On the asymptotic distribution of the Moran I test for Econometric Models: Essays in Honor of Thomas Rothenberg. Cambridge
statistic with applications. Journal of Econometrics 104, 219–257. University Press, Cambridge, pp. 27–55.
Lee, L.-F., 2004. Asymptotic distribution of quasi-maximum likelihood estimators Wooldridge, J.M., 2010. Econometric Analysis of Cross Section and Panel Data,
for spatial autoregressive models. Econometrica 72 (6), 1899–1925. second ed. MIT Press, Cambridge, Massachusetts.