You are on page 1of 10

Ecography 35: 879–888, 2012

doi: 10.1111/j.1600-0587.2011.07138.x
© 2012 The Authors. Ecography © 2012 Nordic Society Oikos
Subject Editor: Alexandre Diniz-Filho. Accepted 19 December 2011

A new method for dealing with residual spatial autocorrelation in


species distribution models

Beth Crase, Adam C. Liedloff and Brendan A. Wintle


B. Crase (beth.crase@nt.gov.au) and B. A. Wintle, School of Botany, The Univ. of Melbourne, Parkville, VIC 3010, Australia. – A. C. Liedloff,
CSIRO Ecosystem Sciences, Berrimah, NT 0828, Australia.

Species distribution modelling (SDM) is a widely used tool and has many applications in ecology and conservation biology.
Spatial autocorrelation (SAC), a pattern in which observations are related to one another by their geographic distance, is
common in georeferenced ecological data. SAC in the residuals of SDMs violates the ‘independent errors’ assumption
required to justify the use of statistical models in modelling species’ distributions. The autologistic modelling approach
accounts for SAC by including an additional term (the autocovariate) representing the similarity between the value of the
response variable at a location and neighbouring locations. However, autologistic models have been found to introduce
bias in the estimation of parameters describing the influence of explanatory variables on habitat occupancy. To address
this problem we developed an extension to the autologistic approach by calculating the autocovariate on SAC in residuals
(the RAC approach). Performance of the new approach was tested on simulated data with a known spatial structure and
on strongly autocorrelated mangrove species’ distribution data collected in northern Australia. The RAC approach was
implemented as generalized linear models (GLMs) and boosted regression tree (BRT) models. We found that the BRT
models with only environmental explanatory variables can account for some SAC, but applying the standard autologistic or
RAC approaches further reduced SAC in model residuals and substantially improved model predictive performance. The
RAC approach showed stronger inferential performance than the standard autologistic approach, as parameter estimates
were more accurate and statistically significant variables were accurately identified. The new RAC approach presented here
has the potential to account for spatial autocorrelation while maintaining strong predictive and inferential performance,
and can be implemented across a range of modelling approaches.

Species distribution models (SDMs) relate species distribu- the geographic distance between the observation (Legendre
tion data as such abundance, occurrence (presence only) and Fortin 1989, Fortin and Dale 2005). When SAC is posi-
or presence/absence to environmental characteristics (Elith tive, locations close to each other exhibit more similar val-
and Leathwick 2009). They have been applied extensively ues than those further apart. The presence of SAC in model
in conservation planning and management, for predicting residuals violates the assumption of independent and identi-
range shifts in response to climate change (Bakkenes et al. cally distributed errors and can inflate type I errors (Legendre
2002, Peterson et al. 2002, Midgley et al. 2003, Thomas 1993, Kühn 2007). This can lead to the selection of unim-
et al. 2004), and in invasive species management, disease portant explanatory variables and poorly estimated parameters
mapping and evolutionary biology (Austin 2002, Miller (regression coefficients) in SDMs (Lennon 2000, Dormann
et al. 2004, Araújo and New 2007). Such models have two 2007a). This is an important issue in ecology as autocorre-
broad applications, firstly to facilitate an understanding of lation is a general property of ecological variables measured
the variables influencing species distribution (inference); over geographic space (Legendre and Legendre 1998).
and secondly, for predicting a species’ spatial distribution. Several methods have been developed to account for the
Accurate predictive ability of models depends on the use of effects of SAC and can be used to model species distributions
sound ecological rationale and robust variable selection dur- (reviewed by Keitt et al. 2002, Dormann et al. 2007). The
ing model building. The modelling framework applied to autologistic approach is widely applied (Dormann 2007b)
the development of SDMs should ideally produce models and accounts for SAC by including an additional term in
that perform well for both explanation and prediction. the model (the autocovariate) to represent the influence
SDMs use spatially referenced observations of species of neighbouring observations (Besag 1972, Augustin et al.
occurrence as the dependent variable in the modelling pro- 1996). This autocovariate specifies the relationship between
cess. Most species observation data exhibit spatial autocor- the value of the observation at a location and those at neigh-
relation (Lennon 2000). Spatial autocorrelation (SAC) is a bouring locations. Autologistic models have good predictive
pattern in which observations are related to one another by performance in comparison to models that do not account

879
for SAC (Hoeting et al. 2000, Wintle and Bardos 2006, plus the environmental explanatory variables are included.
Betts 2009), however, parameter estimates can be biased For both BRT and GLM, the three modelling strategies are
(Dormann 2007b, Dormann et al. 2007), potentially dis- compared on the basis of their predictive power, their infer-
torting inference. ential performance and their ability to reduce residual SAC.
Simultaneous autoregressive (SAR) models and general-
ized estimating equations (GEE) are alternative approaches
for dealing with SAC and use an autocovariate to represent Methods
only the portion of spatial structure in the response vari-
able that is not explained by the environmental explanatory Datasets
variables (Carl and Kühn 2007, Dormann et al. 2007). That
is, the autocovariate specifies the relationship between the Two datasets with Bernoulli distributed observation data
value of the residuals at each location and those at neigh- (presence/absence), one simulated and the other derived
bouring locations. When analysing simulated data these from field mapping of mangrove tree species were used to
approaches produced less biased parameter estimates than illustrate and evaluate alternative approaches to modelling
the autologistic approach (Dormann et al. 2007). The way autocorrelated species distribution data. The first dataset
the autocovariate is calculated for the SAR and GEE was simulated by Dormann et al. (2007) and used to com-
approaches is more congruent with theory because it is the pare the performance of several spatial regression modelling
SAC remaining in model residuals that violates assumptions methods including spatial eigenvector mapping, autoregres-
of statistical model, not SAC in the response variable per se sive models (e.g. SAR), generalized linear mixed models
(Legendre 1993). and generalized estimating equations. The distribution of
However, SAR and GEE models are more complex the virtual species (called ‘Snouter’) was simulated ten times
to apply than the autologistic approach. The autologis- across a grid of 1108 cells based on a known linear response
tic approach is usually applied within a GLM framework to an environmental variable (rain) and a spatially correlated
(Augustin et al. 1996, Luoto et al. 2001, Betts et al. 2006, error structure. Two explanatory variables, one informative
Wintle and Bardos 2006). In contrast to SAR and GEE (rain) and the other not (djungle) were used to model the
models, GLMs are widely applied in ecology and other dis- distribution of the virtual species. The uninformative djun-
ciplines, their use is supported by numerous textbooks and gle variable was used to test the frequency with which the
software packages and they are easy to implement and inter- various modelling methods made type I inferential errors.
pret. In addition, the autologistic approach has the potential For detailed methodology refer to (Dormann et al. 2007).
to be implemented across a range of modelling techniques, The second dataset provided the mapped distribution of
such as tree-based machine learning methods including three mangrove species (Sonneratia alba, Ceriops tagal and
boosted regression trees (BRT) and Random Forests. These Rhizophora stylosa) along the Elizabeth River in Darwin
methods have been shown to model complex relationships Harbour, northern Australia (Brocklehurst and Edmeades
and interactions between variables, leading to strong predic- 1996). The presence or absence of each species (the response
tive performance (Breiman 2001, Elith et al. 2006, De’ath variable) was represented as a binary map of 25 by 25 m grid
2007). Tree-based learning methods could be a useful tool cells over an area of 8  8 km (a total of 102 400 cells). Three
for modelling spatially autocorrelated data (Hothorn et al. explanatory variables (elevation, distance to water and dis-
2011), as they may capture complex structures within the tance to harbour) were used as independent variables in our
data that simpler modelling approaches may miss (Elith and regression models of mangrove species distribution. These
Graham 2009). variables are correlated with soil physicochemical gradients
Given that virtually all spatially explicit ecological data- including salinity, nitrogen, phosphorous and sulphide, and
sets contain SAC (Lennon 2000), it is important that species are used as proxies for these gradients (Bunt et al. 1982,
distribution modelling methods can account for the spatial Boto and Wellington 1983, Ball 1998, McKee et al. 2002).
autocorrelation that cannot be attributed to the spatial struc- Elevation was used as a proxy for inundation frequency, soil
turing of predictor variables included in the SDMs. SDMs nitrogen, phosphorous and sulphide (Boto and Wellington
accounting for SAC should have strong predictive and infer- 1983, Ball 1998, Matthijs et al. 1999) and was derived from
ential performance, be straight forward to apply and able to a digital elevation model (DEM derived from 1:25 000
be implemented across a range of modelling approaches. In contours captured at vertical intervals of 5 m). Distance to
this paper we extend the autologistic approach by fitting an nearest water represents gradients in salinity, nitrogen, phos-
autocovariate derived from model residuals instead of directly phorous, sulphide and inundation frequency and duration
from the response variable. This combines the strengths of (Boto and Wellington 1983, Matthijs et al. 1999, McKee
SAR and GEE approaches with the accessibility of the autol- et al. 2002). Distance to the outer harbour provided a
ogistic approach. We compare predictive and inferential per- proxy for the salinity of inundating water (Bunt et al. 1982,
formance of three models implemented within GLM and Williams et al. 2006). Distance to water and distance to the
BRT frameworks. In the first model (environment only), outer harbour were generated from a coastal map of Darwin
explanatory variables are used to model the occurrence of the harbour (scale 1: 25 000; resolution 12.5 m).
focal species; in the second model (standard autologistic),
autocovariates derived directly from the response variable are Modelling approach
included as well as the environmental variables; and for the
third model (residuals autocovariate, RAC), autocovariates Three classes of models were considered. First, an envi-
derived from the residuals of an environment only model ronment only model that does not explicitly account for

880
SAC; second, the standard autologistic model, an approach Environment only model
commonly applied in ecological studies (Dormann 2007a)
that includes an additional autocovariate representing SAC The environment only model was fitted with the environ-
in the response variable; and finally, a new approach that mental explanatory variables and spatial autocorrelation
utilizes an autocovariate derived from the residuals of an was not considered. Therefore, variation in the probability
environment only model, termed a residuals autocovari- of occurrence of the target species is related only to varia-
ate (RAC) model. The only difference between these three tion in the environmental variables. For the simulated data
classes of models is whether or not an autocovariate term is two explanatory variables, ‘rain’ and ‘djungle’, were included
included and the way the term is derived. The environmental in the model. For the mangrove species data three explana-
explanatory variables are identical in all models. The models tory variables, ‘elevation’, ‘distance to water’ and ‘distance to
in each class (environment only, standard autologistic, and harbour’ were included. For the GLMs, the linear predictor,
RAC) were implemented with a GLM and BRT approach. at location i is related to the explanatory variables as
Generalized linear modelling is an approach where the a
response variable is related to explanatory variables via a link
function (McCullagh and Nelder 1983). A linear model
Xbi  a  ∑d g k ki (3)
k1
combining the explanatory variables is fitted to the observed
values of the response variable. In this study we have a value where a is the intercept, dk is the parameter to be estimated
for the response variable for each cell (location) and test for each explanatory variable k, where k ranges from 1 to a,
how accurate the models are by comparing the modelled the maximum number of variables, and gki is the value of
and observed values. The linear predictor, Xb, consists of a the explanatory variable k at location i. In fitting the BRT
matrix of the explanatory variables X, and a vector of associ- models the same variables were used although the fitting
ated parameters b. The linear predictor is related to the esti- procedure differs from the GLM as described in the previous
mated (modelled) value E(Y) via a link function g as section.

E(Y)  m  g21(Xb) (1)
Autologistic model
where m is the mean of the distribution of the predicted value
of the response variable. The autologistic model included the explanatory variables
The link function relates the mean of the response vari- as in the environment only model, plus an additional vari-
able to the linear predictor, and as the datasets used in this able, the autocovariate, representing SAC in the response
study are binary with a Bernoulli distribution, a logit link variable. SAC is accounted for in the model by estimating
function was used (McCullagh and Nelder 1983), such that how much the response variable at a location is correlated
with the values of the response variable at neighbouring
Xb  ln(m/(12m)) (2) locations (Dormann et al. 2007). Fitting the autologistic
model, where the autocovariate is derived from the response
The GLMs were implemented considering only linear variable, is a two step process. First the autocovariate is cal-
relationships between the environmental explanatory vari- culated based on the observation data, then the explanatory
ables and the response variable. The analyses were per- variables and the autocovariate are fitted within a GLM or
formed using the statistical package R (ver. 2.11.1; R Core a BRT approach. To calculate the autocovariate, the obser-
Development Team). vations (the values of the response variable) are arranged in
Boosted regression tree (BRT) models were also used to a grid and a neighbourhood size defined. There are several
relate the occurrence of the target species to the explana- ways to calculate the autocovariate and they fall into two
tory variables (Breiman et al. 1984, Friedman et al. 2000). general classes: direct estimates and focal calculations. For
BRT models function by combining two algorithms, the direct estimates, the autocovariate is calculated as the aver-
first generates trees by recursive binary splits, with explana- age of the number of occupied cells in a set of neighbour-
tory variables and split points (nodes) selected to minimize ing cells, divided by the distance to the neighbouring cells
prediction errors. The second algorithm, called boost- (inverse distance weighting) (Augustin et al. 1996). Focal
ing, combines many trees in a forward stagewise process, calculations of the autocovariate are a special instance of
by selecting, at each step, the tree that maximises reduction inverse distance weighting where cells within a selected
in deviance (De’ath 2007). BRT models were implemented neighbourhood have a weight of 1, and all other cells 0.
with a tree complexity of three (i.e. each tree has three The autocovariate is then calculated as the average or maxi-
nodes) enabling simple interactions between variables to be mum value within the defined neighbourhood (Leathwick
captured. A 10-fold cross validation procedure was used to and Austin 2001, Luoto et al. 2001). A first-order neigh-
determine the optimum number of trees to be fitted to the bourhood is defined as the eight cells adjacent to the cen-
BRT model and the learning rate was set to ensure that at tral cell and a second-order neighbourhood includes the
least 1000 trees were produced during the fitting process, 16 cells adjacent to the first-order neighbours (O’Sullivan
as recommended by Elith et al. (2008). For more a detailed and Unwin 2003). Here a first-order neighbourhood with
explanation of these terms and more information on the a focal mean autocovariate is presented as this produced the
BRT fitting procedure refer to Elith et al. (2008). BRT mod- strongest model performance (in terms of cross-validated
els were developed using the R-package ‘gbm’ (Ridgeway AUC scores, deviance explained and reduction in SAC
2010), and the custom functions of Elith et al. (2008). in model residuals) in comparison to two other methods

881
(inverse distance weighting and maximum focal autoco- where qj is the estimated probability of occurrence at site
variate, both calculated for first- and second-order neigh- j derived from an environment only model (Eq. 3). The
bourhoods, results not shown here). The focal means were remaining terms are as defined for Eq. 3 and 4. The same
calculated in R using the package ‘raster’ (Hijmans 2011). RAC autocovariate and environmental explanatory variables
For the autologistic model, SAC is accounted for by were fitted to the GLM and BRT models. The RAC mod-
estimating how much the response variable at site i is cor- els were fitted using the statistical package R (ver. 2.11.1;
related with the values of the response variable at neigh- R Core Development Team) and code is provided in the
bouring sites j (Dormann et al. 2007). For the GLMs, when Supplementary material Appendix 1.
the autocovariate is calculated as a focal mean, the relation-
ship between the presence of the target species and the linear
predictor, at location i, can be represented as Assessment of model performance
a
The predictive and inferential performance of the three
∑d g ∑y
1
Xbi  a  k ki  da1 j
(4) classes of models were assessed against the following cri-
N (i )
k1 j ∈N ( i )
teria: predictive ability, reduction in SAC and inferential
performance.
where a, d, g, k, i and a are as defined for Eq. 3, and the
final term is the autocovariate where yj is the value of the Predictive ability of the models
response variable at location j, where j is a cell within the set Predictive accuracy is measured according to the match
of cells forming neighbourhood N(i). For each location i, a between the predictive map of the probability of species
neighbourhood is defined, so the summation is across the set occurrence produced by the model and the observed spatial
of cells in the neighbourhood and divided by the number of distribution of the species. Prediction accuracy was assessed
cells in the neighbourhood. A parameter, da1, is estimated quantitatively in two ways. In the first, we calculated the
for the autocovariate simultaneously with estimates of the area under the receiver operating curve (AUC), a metric
parameters associated with the environmental explanatory that combines the trade off between sensitivity (the true
variables (dk). Here the same autocovariate and environmen- positive proportion) against the false positive proportion,
tal explanatory variables were fitted to the GLM and BRT across all possible thresholds (Swets 1988). A model with
models. a high AUC score has a greater ability to correctly rank
sites that are actually occupied by the target organism
Residuals autocovariate model above unoccupied sites. For pairs of randomly selected sites
ranked in order of predicted probability of occupancy, a
The residuals autocovariate (RAC) model includes the model with an AUC score of 1 will correctly rank the sites
environment only explanatory variables plus an autocovari- 100% of the time, while an AUC score of 0.5 indicates
ate that represents spatial autocorrelation in the residuals that the model will rank sites correctly only 50% of the
of the environment only model. In a RAC model, SAC is time (Pearce and Ferrier 2000). The AUC was calculated
accounted for by estimating the strength of the relationship on the held-out folds of a 10-fold cross validation (Stone
between the environment only model residuals at a loca- 1974), therefore, the model was not tested against the data
tion and the values of those residuals at neighbouring loca- on which it was fitted.
tions. Fitting models when the autocovariate is based on the The second performance metric, deviance explained
residuals is a three step process. First, a model describing the by the model, indicates the goodness of fit between the
influence of environmental variables on species occupancy modelled values and the observed values (Crawley 2007).
is fitted as either a GLM or a BRT model. Second, the resid- The percent of deviance explained was calculated as the
uals from that model are calculated for each grid cell and null deviance less the residual deviance as a proportion of
the autocovariate calculated based on these residuals using the null deviance. Deviance explained incorporates mea-
a mean focal operation for a first order neighbourhood, in a sures of the match between the actual and predicted fre-
procedure identical to the autologistic model. This produces quency of occurrence, and strong explanatory power is
an autocovariate calculated from the model residuals rather indicated by a high percent of deviance explained (Ferrier
than directly from the response variable as is the case in and Watson 1997).
the standard autologistic approach. The residuals are in the
scale of the link function (logit), and are approximately nor- Ability of the model to account for spatial autocorrelation
mally distributed, potentially ranging from negative infinity The ability of the models to account for SAC is indicated by
to infinity. In the third step the environmental explanatory the reduction of SAC in model residuals. SAC was measured
variables and the residuals autocovariate are fitted to a GLM by calculating Moran’s index and plotting these values as
or BRT model, and the parameter values describing the correlograms. Moran’s index indicates the strength of the
influence of explanatory variables and the autocovariate on correlation between observations as a function of the dis-
the observed occupancy data simultaneously estimated. tance separating them, with the distance between obser-
The RAC model linear predictor, at location i, has the vations grouped into intervals. Values of Moran’s I range
following form, when fitted as a GLM between 1 (indicating strong positive SAC e.g. clustering)
a and 21 (indicating strong negative SAC e.g. dispersion)
∑d g ∑ ( y 2q )
1
Xbi  a  k ki  da1 j j (5) with zero indicating a random pattern with no spatial auto-
k1
N (i ) j ∈N ( i ) correlation (Cliff and Ord 1981). Moran’s I was calculated

882
Table 1. The predictive performance and ability to control residual was similar, however for the mangrove species occurrence
SAC for three models (environment only, autologistic and RAC, data the BRT models outperformed the GLM’s in deviance
implemented as GLMs and BRTs) for simulated data, as indicated by
the mean ( SE) of cross-validated percent of deviance explained, reduction.
AUC scores and Moran’s I.
% deviance
Model explained AUC Moran's I Synthetic data
GLM environment 7.29  0.17 0.66  0.02 0.343  0.016 Predictive ability (GLM and BRT)
GLM autologistic 58.95  0.20 0.96  0.01 20.039  0.003
Model predictive performance was evaluated using cross-
GLM RAC 58.95  0.38 0.95  0.01 20.040  0.003
BRT environment 8.03  0.01 0.67  0.02 0.336  0.017
validated AUC scores and the deviance explained by each
BRT autologistic 56.29  0.03 0.95  0.01 20.006  0.006 model (Table 1). The GLM and BRT environment only
BRT RAC 55.80  0.03 0.95  0.01 20.008  0.009 models had AUC scores of  0.70 and the percent of devi-
ance explained was very low, with both the GLM and BRT
approaches accounting for  9%. The addition of either
and correlograms plotted using R, with the ‘spdep’ package the autocovariate derived from the response variable or the
(Bivand 2010) and ‘ncf ’ package (Bjørnstad 2009). autocovariate derived from residuals improved AUCs by
0.28–0.30 over the environment only models, with similar
Quality of the model for inference improvements for both the autologistic model and the RAC
The reliability of model inference was measured for GLMs (Table 1). Notably, there was little difference in these scores
as: a) the unbiased estimation of model parameters; b) the between the GLM and BRT approaches.
accurate selection of explanatory variables as statistically sig-
nificant, and c) correct identification of the sign of regression Spatial pattern in residuals (GLM and BRT)
slopes indicating positive or negative relationships between The spatial autocorrelation remaining in model residuals
the response and explanatory variables. As BRT models are was measured by calculating Moran’s I (Table 1). Moran’s I
not used for parameter estimation, fitted functions repre- was very similar for the GLM and BRT models without the
senting the relationship between the response variable and autocovariates, and was reduced to approximately zero for
each explanatory variable are plotted and compared qualita- all models containing an autocovariate term. The autologis-
tively. The relative importance of variables is indicated by the tic and RAC GLM and BRT models had Moran’s I ranging
proportion of times a variable is selected as a node within the from 20.008 to 20.04 indicating that all spatial modelling
BRT algorithm (Elith et al. 2008). methods tested here effectively removed spatial structure
from model residuals.
Correlograms, a visual representation of SAC, were plot-
Results ted for the raw data and residuals of each model (Fig. 1). The
spatial structure of the raw data, and the GLM and BRT
The predictive performance of the autocovariate models environment only models were similar (Fig. 1a, b, g), and
describing variation in mangrove species’ occurrence was showed that locations within five cells of each other had the
stronger than for the environment only models, as indicated strongest spatial autocorrelation. The correlation structure
by substantial improvements in cross-validated AUC scores was very similar for all the autologistic and RAC models fit-
and the percent of deviance explained (Table 1, 2). Both the ted, irrespective of whether they were GLMs or BRTs, with
standard autologistic model and the RAC model reduced negligible spatial structure in residuals at any lag distance.
residual SAC to negligible levels (0.35 to 20.006 simulated This indicates that all methods successfully explained the
data; 0.6 to 20.04 for C. tagal ). For the simulated data the spatial structuring of the observation data.
RAC approach produced less biased parameter estimates
than the environment only model, and significant variables Parameter estimation bias (GLM)
were correctly identified as statistically significant by the The spatial modeling approaches were assessed for their
RAC model but not by the standard autologistic model. ability to correctly estimate the true parameter values used
For the simulated data GLM and BRT model performance to generate the artificial Snouter dataset, and to correctly

Table 2. The predictive performance of three models (environment only, autologistic and RAC, implemented as GLMs and BRTs) for three
mangrove species, as indicated by the mean ( SE) of cross-validated percent of deviance explained and AUC scores.
Mangrove species
Sonneratia alba Rhizophora stylosa Ceriops tagal
% deviance % deviance % deviance
Model explained AUC explained AUC explained AUC
GLM environment 44.55  0.09 0.95  0.002 13.13  0.03 0.75  0.002 2.87  0.02 0.65  0.003
GLM autologistic 90.36  0.06 1.00  0.0001 79.78  0.03 0.99  0.0003 75.25  0.02 0.98  0.0003
GLM RAC 78.02  0.08 0.99  0.001 68.32  0.05 0.97  0.001 63.89  0.04 0.96  0.001
BRT environment 72.64  0.002 0.99  0.001 32.47  0.005 0.87  0.002 22.49  0.004 0.80  0.002
BRT autologistic 92.10  0.001 1.00  0.0001 80.68  0.004 0.99  0.0003 75.96  0.004 0.98  0.0005
BRT RAC 86.93  0.002 1.00  0.0002 76.70  0.002 0.99  0.0003 72.63  0.003 0.98  0.0003

883
Table 3. The inferential performance of three models (environment
only, autologistic and RAC, implemented as GLMs) for simulated
data, as indicated by the mean ( SE) of parameter estimates for the
explanatory variables. ns: non significant, *p  0.001.
Explanatory variables
Model
Actual value of Rain Djungle
parameter: (20.002) (0.0)
GLM environment 20.0022  0.0003* 0.0052  0.013 ns
GLM autologistic 20.0006  0.0004 ns 0.0029  0.0167 ns
GLM RAC 20.005  0.001* 0.0065  0.0072 ns

model however, the relationship between rain and Snouter


occurrence is positive, while for the RAC BRT model it
was negative, and the probability of Snouter occurrence
decreased with increasing rain, as it should. The rela-
tive importance of variables (RI in Fig. 2), represented by
the number of times the variable is selected during the
BRT fitting procedure, was high for the autocovariates
in both the standard autologistic and RAC approaches
(86.6 to 93.4 autologistic model, 86.3 to 94.0 RAC
model), and low for the environment only model 0.32 to
13.3 (Fig. 2).

Mangrove data
Predictive ability (GLMs and BRT models)
Accounting for SAC by incorporating an autocovariate term
in models of mangrove species distribution improved model
predictive performance as indicated by improvements in
cross-validated AUC scores and in the percent of deviance
explained (Table 2). For the Ceriops and Rhizophora species,

Figure 1. The ability of three models (environment only (a, b),


autologistic (c, d), and RAC (e, f )) to control SAC in model
residuals, as indicated by correlograms of Moran’s I plotted
against distance classes, for simulated data. Models were imple-
mented as GLMs (a, c, e) and BRTs (b, d, f ). Spatial autocorrela-
tion in the raw data is shown in panel (g).

identify significant explanatory variables (Table 3). All mod-


eling approaches correctly found djungle to be a non-signifi-
cant explanatory variable. However, the standard autologistic
approach also erroneously found the rain variable to be non-
significant (a type I error). The parameter estimate for rain
by the GLM RAC model consistently over-estimated the
influence of rain on the distribution of the virtual organism
(true value of rain 20.002, RAC estimate 20.005  0.001).

Fitted functions (BRT models)


Fitted functions graphically representing the relationship
Figure 2. The relationship between the predicted occurrence of the
between the response variable and each explanatory variable simulated organism and explanatory variables, as indicated by the
were plotted and visually interpreted (Fig. 2). For the fitted functions shown for three BRT models (environment only,
environment only model, rain was the most important autologistic and RAC) for the tenth simulated dataset. The relative
variable and the likelihood of Snouter occurring declined importance of variables (RI) was calculated for each simulated data-
strongly as rain increased. For the standard BRT autologistic set, and the range of values shown.

884
Table 4. The ability of three models (environment only, autologistic Parameter estimation bias (GLMs)
and RAC, implemented as GLMs and BRTs) to control spatial auto- The true values of the parameters describing the influence
correlation remaining in model residuals, as indicated by Moran’s I,
for three mangrove species. An index of 1 indicates high positive of environmental explanatory variables on mangrove species
autocorrelation; 0 no autocorrelation; 21 high negative autocorre- distributions are unknown and therefore model performance
lation. Spatial autocorrelation in the raw data is shown in the last can not be assessed by a comparison between actual and
line. estimated values of the parameters. Instead, the differences
Mangrove species in variables that were statistically significant in the environ-
Sonneratia Rhizophora Ceriops
ment only, autologistic and RAC models are compared; and
Model alba stylosa tagal incongruities identified between modelling methods in the
GLM environment 0.67 0.62 0.60
sign (positive or negative) of the parameters describing the
GLM autologistic 20.04 20.07 20.07 relationship between the response and explanatory variables.
GLM RAC 20.05 20.07 20.07 The RAC models retained a greater proportion of
BRT environment 0.41 0.48 0.47 environmental explanatory variables deemed statistically
BRT autologistic 20.08 20.08 20.08 significant in the environment only models, compared to the
BRT RAC 20.02 0.02 20.04 standard autologistic models (Table 5). For the Rhizophora
Raw data 0.77 0.68 0.62 and Ceriops species, all three environmental explanatory
variables were significant in the RAC and environment
only models, while only one of the three variables was
both types of autocovariate model had strong predictive considered statistically significant in the autologistic model.
performance with AUC scores in excess of 0.96 as com- The RAC model had a more stable variable selection, in
pared to environment only GLMs (0.65, 0.75 for Ceriops that the explanatory variables included in the environ-
and Rhizophora respectively). The AUC scores for ment only model tended to remain in the RAC model. For
the Sonneratia models were high for all three models Sonneratia alba however, both the standard autologistic
(0.95–1.00), and indicate that the fitted models achieve model and the RAC model had three significant environ-
almost perfect discrimination between occupied and unoc- mental explanatory variables while the environment only
cupied sites. Improvements in the percent of deviance model had two.
explained by the models were greater for the autologistic The sign of the relationship between the occurrence of
models than the RAC models, with the total percent of the target species and environmental variables was com-
deviance explained differing by 3–6% in the BRT approach, pared between the environment only, autologistic and RAC
and 10–12% in the GLM approach. models, and assessed for consistency. The autologistic model
for Rhizophora stylosa indicated a positive relationship
Spatial pattern in model residuals (GLMs and BRT models) between probability of occurrence and distance to water.
Residual SAC was high in both the GLM and BRT This relationship was negative under both the environment
environment only models, although the BRT models only and RAC models. Similarly for Ceriops tagal, the sign
showed a reduction in SAC in comparison to the GLMs of the parameter was negative for the standard autologistic
(Moran’s I, BRT: 0.41–0.48; GLM: 0.60–0.67). This indi- approach and positive for the two other methods (Table 5).
cates that the BRT approach can automatically capture The direction of relationships between the response and
some of the SAC present, although there was still suffi- explanatory variables were the same for RAC and environ-
cient SAC in the model residuals to violate assumptions of ment only models in all cases for all variables tested.
regression models. This residual spatial structure was effec-
tively reduced to zero when either type of autocovariate was Fitted functions (BRT models)
included in the distribution models of each of the three The fitted functions describing the relationship between the
mangrove species (Table 4). occurrence of Sonneratia alba and the explanatory variables

Table 5. The inferential performance of three models (environment only, autologistic and RAC, implemented as GLMs and BRTs) for three
mangrove species, as indicated by mean ( SE) of parameter estimates for the explanatory variables. ns: non significant, *p  0.01,
**p  0.001.
Explanatory variables
Mangrove species
Model Distance to nearest water Distance to outer harbour Elevation Autocovariate
Sonneratia alba
GLM environment 20.0379** 0.0000146 ns 20.1455**
GLM autologistic 20.00192* 20.000704** 20.263** 14.57**
GLM RAC 20.0453** 20.0000267** 20.388** 5.55**
Rhizophora stylosa
GLM environment 20.00428** 0.0000378** 20.06367**
GLM autologistic 0.001067** 20.00000914 ns 20.000684 ns 12.573**
GLM RAC 20.0069** 0.0006759** 20.1318** 5.20**
Ceriops tagal
GLM environment 0.0015** 0.0000119** 20.0492**
GLM autologistic 20.0007808** 20.000000197 ns 0.00539 ns 12.15**
GLM RAC 0.00246** 0.0002571** 20.0986** 5.121**

885
(distance to water, distance to harbour, elevation and the model allows the environmental explanatory variables to
autocovariate) illustrate how the autocovariate in the stan- contribute to deviance reduction in the model while remov-
dard autologistic model overwhelms the other explanatory ing the spatial autocorrelation in the model residuals.
variables (Fig. 3). For the environment only model, the
fitted function illustrating the relationship between occur-
rence of S. alba and the distance to water variable (Fig. 3a) Discussion
shows that the probability of occurrence is highest at  200 m
from water followed by a smooth, steep decline to zero In this study we show that an extension of the standard
around 500 m inland. The relative importance of the three autologistic model, where the autocovariate is calculated
variables is similar (distance to water 37.9, elevation 27.2 from model residuals instead of from the response variable,
and distance to harbour 34.9), and the model is strongly outperforms both standard autologistic and environment
performing (deviance explained 72.64%, AUC 0.99). only models when evaluated against four criteria: cross-
Although the standard autologistic model for S. alba was validated predictive performance, removal of residual spatial
also strongly performing (deviance explained 92.1%, AUC autocorrelation, reduction in parameter estimation bias and
1.00), the fitted functions are flat and approximately zero the correct identification of statistically significant variables.
across the range of all three variables, including distance to This new approach (residuals autocovariate, RAC) performs
water (Fig. 3d–f ). The relative importance of the explana- as well as or better than the two other approaches when
tory variables dropped to between 0.2 and 0.4, leaving the applied within GLMs and BRT modelling frameworks,
autocovariate as the most powerful predictor of Sonneratia using both synthetic and field datasets.
occurrence in the autologistic model (relative importance of Spatial autocorrelation in the response variable does not
99.2). Much of the information previously contained in the always generate bias in parameter selection and estimation
environmental explanatory variables has been absorbed by (Diniz-Filho et al. 2003), and it is only the SAC in model
the autocovariate term. residuals that violates the assumptions of statistical models
In contrast, in the RAC model, the environmental (Legendre 1993). While the standard autologistic model has
explanatory variables still play a role, in particular distance good predictive performance in comparison to models that
to water which retains a similar shape in the fitted function do not account for SAC (Hoeting et al. 2000, Wintle and
(Fig. 3c). Furthermore, although the RAC autocovariate had Bardos 2006, Betts et al. 2009) it can produce biased para­
a relative importance of 85.3 and is still the most influential meter estimates (Carl and Kühn 2007, Dormann 2007b,
explanatory variable, the three explanatory variables have a Dormann et al. 2007). This is largely because the autologistic
relative importance of 8.3 for distance to water, 3.5 for eleva- approach accounts for SAC in the response variable before
tion and 2.9 for distance to harbour, much greater values fitting the explanatory variables, whereas there are sources
than in the autologistic model. This shows that the RAC of SAC that can be explained by the explanatory variables

Figure 3. The relationship between the predicted occurrence of the mangrove tree species Sonneratia alba and explanatory variables,
shown for three BRT models (environment, autologistic and RAC) as indicated by the fitted functions. RI is the relative importance of
variables.

886
and potentially eliminate any SAC in model residuals. complex interactions and non-linearities, qualities that fre-
Spatial autocorrelation in biological data arises from exog- quently occur in actual data, but less so in synthetic data-
enous sources such as from the environment (e.g. rain- sets. This underscores the importance of testing modelling
fall, temperature and soil type) (Legendre et al. 2002) and approaches on both synthetic and actual data.
endo­genous sources related to characteristics of the target Models are widely used in ecology and conservation. A
species itself, such as limits to dispersal or breeding model can represent our mechanistic understanding of the
aggregations (Legendre 1993, Keitt et al. 2002, Wintle and causal relationship between explanatory and response vari-
Bardos 2006). The advantage of the RAC approach over the ables (Crawley 2007). In correlative statistical modelling,
autologistic approach is that the explanatory variables are incorrect parameter estimation can lead to a poor under-
fitted first and have an opportunity to account for SAC in standing of the system and to poor predictions. Under- or
species distributions. By deriving the autocovariate from over-estimation of the importance of explanatory variables
model residuals, only the variance unexplained by the can lead to misunderstandings about the role of natural
explanatory variables is incorporated, and therefore the RAC and anthropogenic processes in ecology and result in poor
model better captures the true influence of these explana- decisions and policies that may jeopardize the persistence
tory variables, resulting in stronger inferential performance of species. Given that virtually all geo-referenced ecological
than the autologistic approach. This was demonstrated for datasets contain SAC (Lennon 2000), and that the presence
the datasets used in this study where the RAC model para­ of spatial autocorrelation in model residuals can cause infer-
meter estimates were more accurate than those of the stan- ential problems in statistical analyses, it is important that
dard autologistic model, and significant variables were species distribution modelling methods account for spatial
correctly identified and retained in the models. autocorrelation, while remaining straight forward to apply
The two modelling frameworks, GLM and BRT, applied and simple to interpret. We have demonstrated the advan-
here for the RAC approach are complementary. Generalized tages of an approach that reduces bias in parameter estima-
linear models are easy to interpret, are widely applied in ecol- tion and increases predictive performance of models fitted to
ogy and produce parameter estimates that can be used in spatially autocorrelated observation data. A practical advan-
meta-analyses or in Bayesian learning (McCarthy and Master tage of the RAC model over others accounting for SAC is
2005). The strength of BRT lies in their ability to automati- that it is simple to apply and can be implemented within the
cally capture complex, non-linear relationships with sharp very widely used GLM approach, or to enhance predictive
disjunctions (De’ath 2007, Elith et al. 2008). For the man- performance further, within a BRT framework.
grove data we found that the environment only BRT models
were able to reduce SAC in model residuals, although sub- Acknowledgements – We extend our gratitude to Peter Brocklehurst
stantial residual SAC remained. The RAC BRT models for for providing the mangrove distribution datasets, to Michael Bode
mangrove species were superior to the RAC GLM models, and Yacov Salomon for assistance in preparation of the equations
as shown by an increase in deviance explained. This indi- and to Jane Elith, Mark Burgman, Alan Andersen and Anna
cates that the advantages of the BRT approach, evident in Richards for their thoughtful comments on the manuscript.
the environment only models are also present in the RAC
models. The combination of a residuals autocovariate model
(RAC) with a tree-based modelling method (BRT) results References
in superior model predictive performance. However, the
BRT method automatically fits interactions between vari- Araújo, M. B. and New, M. 2007. Ensemble forecasting of species
distributions. – Trends Ecol. Evol. 22: 42–47.
ables while such interactions must be explicitly defined and Augustin, N. H. et al. 1996. An autologistic model for spatial
included in the GLM approach. If we had explicitly included distribution of wildlife. – J. Appl. Ecol. 33: 339–347.
these interactions in the GLM’s their performance, most Austin, M. P. 2002. Spatial prediction of species distribution: an
likely, could have been improved. interface between ecological theory and statistical modelling.
There are advantages to using both synthetic and actual – Ecol. Model. 157: 101–118.
data to test model prediction and inferential ability (or per- Bakkenes, M. et al. 2002. Assessing effects of forecasted climate
formance) as shown by the comparison of GLM and BRT, change on the diversity and distribution of European higher
and autologistic and RAC model performance. Synthetic plants for 2050. – Global Change Biol. 8: 390–407.
Ball, M. C. 1998. Mangrove species richness in relation to salinity
data, where the actual values of the parameters are known, and water logging: a case study along the Adelaide River flood-
enables identification of bias in the recovery of parameter plain, northern Australia. – Global Ecol. Biogeogr. Lett. 7:
estimates. However, synthetic data is necessarily structurally 73–82.
simplistic and does not replicate the complexity of field data Besag, J. E. 1972. Nearest-neighbour systems and the auto-logistic
(Betts et al. 2009, Bini et al. 2009). The field data used here model for binary data. – J. R. Stat. Soc. B 34: 75–83.
for mangrove species is complex and strongly spatially auto- Betts, M. et al. 2009. The ecological importance of space in
correlated and we found that the BRT models had stronger species distribution models: a comment on Dormann et al.
predictive performance than the GLMs for all approaches. – Ecography 32: 1–5.
Betts, M. G. et al. 2006. The importance of spatial autocorrelation,
These differences were not evident in results from the syn- extent and resolution in predicting forest bird occurrence.
thetic data probably because the data were simulated via – Ecol. Model. 191: 197–224.
a simple linear relationship with no interactions and was, Bini, L. M. et al. 2009. Coefficient shifts in geographical ecology:
therefore, well modelled by a simple linear GLM. Much of an empirical evaluation of spatial and non-spatial regression.
the advantage of BRT models lies in the ability to handle – Ecography 32: 193–204.

887
Bivand, R. 2010. Package ‘spdep’. – CRAN. Hothorn, T. et al. 2011. Decomposing environmental, spatial and
Bjørnstad, O. N. 2009. Package ‘ncf ’ – spatial nonparametric spatiotemporal components of species distributions. – Ecol.
covariance functions. – CRAN. Monogr. in press.
Boto, K. G. and Wellington, T. 1983. Phosphorous and nitrogen Keitt, T. H. et al. 2002. Accounting for spatial pattern when mod-
nutrient status of a northern Australian mangrove forest. eling organism–environment interactions. – Ecography 25:
– Mar. Ecol. Prog. Ser. 11: 63–69. 616–625.
Breiman, L. 2001. Statistical modeling: the two cultures. – Stat. Kühn, I. 2007. Incorporating spatial autocorrelation may invert
Sci. 16: 199–215. observed patterns. – Divers. Distrib. 13: 66–69.
Breiman, L. et al. 1984. Classification and regression trees. Leathwick, J. R. and Austin, M. P. 2001. Competitive interactions
– Wadsworth International Group. between tree species in New Zealand’s old-growth indigenous
Brocklehurst, P. and Edmeades, B. 1996. The mangrove communi- forests. – Ecology 82: 2560–2573.
ties of Darwin Harbour. – Dept of Lands, Planning and Envi- Legendre, P. 1993. Spatial autocorrelation: trouble or new para-
ronment of the Northern Territory, Resource Capability digm? – Ecology 74: 1659–1673.
Assessment Branch. Legendre, P. and Fortin, M. J. 1989. Spatial pattern and ecological
Bunt, J. S. et al. 1982. River water salinity and the distribution of analysis. – Vegetatio 80: 107–138.
mangrove species along several rivers in North Queensland. Legendre, P. and Legendre, L. 1998. Numerical ecology. – Elsevier.
– Aust. J. Bot. 30: 401–412. Legendre, P. et al. 2002. The consequences of spatial structure for
Carl, G. and Kühn, I. 2007. Analyzing spatial autocorrelation in the design and analysis of ecological field surveys. – Ecography
species distributions using Gaussian and logit models. – Ecol. 25: 601–615.
Model. 207: 159–170. Lennon, J. J. 2000. Red-shifts and red herrings in geographical
Cliff, A. D. and Ord, J. K. 1981. Spatial processes: models and ecology. – Ecography 23: 101–113.
applications. – Pion. Luoto, M. et al. 2001. Determinants of distribution and abun-
Crawley, M. J. 2007. The R book. –Wiley. dance in the clouded apollo butterfly: a landscape ecological
De’ath, G. 2007. Boosted trees for ecological modeling and approach. – Ecography 24: 601–617.
prediction. – Ecology 88: 243–251. Matthijs, S. et al. 1999. Mangrove species zonation and soil redox
Diniz-Filho, J. A. F. et al. 2003. Spatial autocorrelation and red state, sulphide concentration and salinity in Gazi Bay (Kenya),
herrings in geographical ecology. – Global Ecol. Biogeogr. 12: a preliminary study. – Mangroves Salt Marshes 3: 243–249.
53–64. McCarthy, M. A. and Master, P. 2005. Profiting from prior infor-
Dormann, C. F. 2007a. Effects of incorporating spatial auto­ mation in Bayesian analysis of ecological data. – J. Appl. Ecol.
correlation into the analysis of species distribution data. 42: 1012–1019.
– Global Ecol. Biogeogr. 16: 129–138. McCullagh, P. and Nelder, J. A. 1983. Generalized linear models.
Dormann, C. F. 2007b. Assessing the validity of autologistic regres- – Chapman and Hall.
sion. – Ecol. Model. 207: 234–242. McKee, K. L. et al. 2002. Mangrove isotopic (d15N and d13C)
Dormann, C. F. et al. 2007. Methods to account for spatial auto- fractionation across a nitrogen vs. phosphorus limitation gradi-
correlation in the analysis of species distributional data: a ent. – Ecology 83: 1065–1075.
review. – Ecography 30: 609–628. Midgley, G. F. et al. 2003. Developing regional and species-level
Elith, J. and Graham, C. H. 2009. Do they? How do they? assessments of climate change impacts on biodiversity in the
WHY do they differ? On finding reasons for differing Cape Floristic Region. – Biol. Conserv. 112: 87–97.
performances of species distribuiton models. – Ecography 32: Miller, J. R. et al. 2004. Spatial extrapolation: the science of pre-
66–77. dicting ecological patterns and processes. – BioScience 54:
Elith, J. and Leathwick, J. R. 2009. Species distribution models: 310–320.
ecological explanation and prediction across space and time. O’Sullivan, D. and Unwin, D. J. 2003. Geographic information
– Annu. Rev. Ecol. Evol. Syst. 40: 677–697. analysis. – Wiley.
Elith, J. et al. 2006. Novel methods improve prediction of Pearce, J. and Ferrier, S. 2000. Evaluating the predictive perform-
species’ distributions from occurrence data. – Ecography 29: ance of habitat models developed using logistic regression.
129–151. – Ecol. Model. 133: 225–245.
Elith, J. et al. 2008. A working guide to boosted regression trees. Peterson, A. T. et al. 2002. Future projections for Mexican faunas
– J. Anim. Ecol. 77: 802–813. under global climate change scenarios. – Nature 416: 626–629.
Ferrier, S. and Watson, G. 1997. An evaluation of the effectiveness Ridgeway, G. 2010. ‘gbm’ package. – CRAN.
of environmental surrogates and modelling techniques in pre- Stone, M. 1974. Cross-validatory choice and assessment of statisti-
dicting the distribution of biological diversity. – New South cal predictions. – J. R. Stat. Soc. B 36: 111–147.
Wales Parks and Wildlife: Biodiversity Group, Environment Swets, J. A. 1988. Measuring the accuracy of diagnostic systems.
Australia. – Science 240: 1285–1293.
Fortin, M. J. and Dale, M. R. T. 2005. Spatial analysis – a guide Thomas, C. D. et al. 2004. Extinction risk from climate change.
for ecologists. – Cambridge Univ. Press. – Nature 427: 145–148.
Friedman, J. et al. 2000. Additive logistic regression: a statistical Williams, D. et al. 2006. Hydrodynamics of Darwin Harbour.
view of boosting. – Ann. Stat. 28: 337–407. – In: Wolanski, E. (ed.), The environment in Pacific harbours.
Hijmans, R. J. 2011. Package ‘raster’. – CRAN. Springer, pp. 461–476.
Hoeting, J. A. et al. 2000. An improved model for spatially cor- Wintle, B. A. and Bardos, D. C. 2006. Modelling species–habitat
related binary responses. – J. Agric. Biol. Environ. Stat. 5: relationships with spatially autocorrelated observation data.
102–114. – Ecol. Appl. 16: 1945–1958.

Supplementary material (Appendix E7138 at  www.


oikosoffice.lu.se/appendix ). Appendix 1.

888

You might also like