Professional Documents
Culture Documents
Jari Oksanen
processed with vegan 2.5-4 in R Under development (unstable) (2019-02-04 r76055) on February 4, 2019
Contents
1.1 User interface
1 Parallel processing 1 The functions that are capable of parallel process-
1.1 User interface . . . . . . . . . . . . .
1 ing have argument parallel. The normal default
1.1.1 Using parallel processing as is parallel = 1 which means that no parallel pro-
default . . . . . . . . . . . . .
1 cessing is performed. It is possible to set parallel
1.1.2 Setting up socket clusters . . 2 processing as the default in vegan (see § 1.1.1).
1.1.3 Random number generation . 2 For parallel processing, the parallel argument
1.1.4 Does it pay off? . . . . . . . . 2 can be either
1.2 Internals for developers . . . . . . . 2
1. An integer in which case the given number
2 Nestedness and Null models 3 of parallel processes will be launched (value
2.1 Matrix temperature . . . . . . . . . 3 1 launches non-parallel processing). In unix-
like systems (e.g., MacOS, Linux) these will
3 Scaling in redundancy analysis 4 be forked multicore processes. In Windows
socket clusters will be set up, initialized and
4 Weighted average and linear combina- closed.
tion scores 5
4.1 LC Scores are Linear Combinations . 6 2. A previously created socket cluster. This saves
4.2 Factor constraints . . . . . . . . . . 9 time as the cluster is not set up and closed
4.3 Conclusion . . . . . . . . . . . . . . 9 in the function. If the argument is a socket
cluster, it will also be used in unix-like sys-
tems. Setting up a socket cluster is discussed
1 Parallel processing in § 1.1.2.
Several vegan functions can perform parallel pro- 1.1.1 Using parallel processing as default
cessing using the standard R package parallel.
The parallel package in R implements the func- If the user sets option mc.cores, its value will be
tionality of earlier contributed packages multicore used as the default value of the parallel argument
and snow. The multicore functionality forks the in vegan functions. The following command will
analysis to multiple cores, and snow functionality set up parallel processing to all subsequent vegan
sets up a socket cluster of workers. The multi- commands:
core functionality only works in unix-like systems > options(mc.cores = 2)
1
The mc.cores option is defined in the parallel other contributed packages, you must pre-define the
package, but it is usually unset in which case ve- socket cluster also in Windows with appropriate
gan will default to non-parallel computation. The clusterEvalQ calls.
mc.cores option can be set by the environmen- If you pre-set the cluster, you can also use snow
tal variable MC_CORES when the parallel package style socket clusters in unix-like systems.
is loaded.
R allows setting up a default socket cluster 1.1.3 Random number generation
(setDefaultCluster), but this will not be used in
vegan. Vegan does not use parallel processing in random
number generation, and you can set the seed for
1.1.2 Setting up socket clusters the standard random number generator. Setting
the seed for the parallelized generator (L’Ecuyer)
If socket clusters are used (and they are the only has no effect in vegan.
alternative in Windows), it is often wise to set up a
cluster before calling parallelized code and give the 1.1.4 Does it pay off ?
pre-defined cluster as the value of the parallel
argument in vegan. If you want to use socket clus- Parallelized processing has a considerable over-
ters in unix-like systems (MacOS, Linux), this can head, and the analysis is faster only if the non-
be only done with pre-defined clusters. parallel code is really slow (takes several seconds
If socket cluster is not set up in Windows, vegan in wall clock time). The overhead is particularly
will create and close the cluster within the function large in socket clusters (in Windows). Creating a
body. This involves following commands: socket cluster and evaluating library(vegan) with
clus <- makeCluster(4)
clusterEvalQ can take two seconds or longer, and
## perform parallel processing only pays off if the non-parallel analysis takes ten
stopCluster(clus) seconds or longer. Using pre-defined clusters will
reduce the overhead. Fork clusters (in unix-likes
The first command sets up the cluster, in this case
operating systems) have a smaller overhead and can
with four cores, and the second command stops the
be faster, but they also have an overhead.
cluster.
Each parallel process needs memory, and for a
Most parallelized vegan functions work similarly
large number of processes you need much memory.
in socket and fork clusters, but in oecosimu the
If the memory is exhausted, the parallel processes
parallel processing is used to evaluate user-defined
can stall and take much longer than non-parallel
functions, and their arguments and data must be
processes (minutes instead of seconds).
made known to the socket cluster. For example, if
you want to run in parallel the meandist function If the analysis is fast, and function runs in, say,
of the oecosimu example with a pre-defined socket less than five seconds, parallel processing is rarely
cluster, you must use: useful. Parallel processing is useful only in slow
analyses: large number of replications or simula-
> ## start up and define meandist() tions, slow evaluation of each simulation. The dan-
> library(vegan)
> data(sipoo) ger of memory exhaustion must always be remem-
> meandist <- bered.
function(x) mean(vegdist(x, "bray")) The benefits and potential problems of parallel
> library(parallel)
> clus <- makeCluster(4)
processing depend on your particular system: it is
> clusterEvalQ(clus, library(vegan)) best to rely on your own experience.
> mbc1 <- oecosimu(dune, meandist, "r2dtable",
parallel = clus)
> stopCluster(clus) 1.2 Internals for developers
Socket clusters are used for parallel processing The implementation of the parallel processing
in Windows, but you do not need to pre-define should accord with the description of the user in-
the socket cluster in oecosimu if you only need terface above (§ 1.1). Function oecosimu can be
vegan commands. However, if you need some used as a reference implementation, and similar
2
interpretation and order of interpretation of argu-
ments should be followed. All future implementa-
tions should be consistent and all must be changed
if the call heuristic changes.
The value of the parallel argument can be
NULL, a positive integer or a socket cluster. Integer
1 means that no parallel processing is performed.
The “normal” default is NULL which in the “nor-
mal” case is interpreted as 1. Here “normal” means
that R is run with default settings without setting
●
mc.cores or environmental variable MC_CORES.
Function oecosimu interprets the parallel ar-
guments in the following way:
2. Integer: An integer value is taken as the num- Figure 1: Matrix temperature for Falco subbuteo on
ber of created parallel processes. In unix-like Sibbo Svartholmen (dot). The curve is the fill line,
systems this is the number of forked multicore and in a cold matrix, all presences (red squares)
processes, and in Windows this is the num- should be in the upper left corner behind the fill
ber of workers in socket clusters. In Windows, line. Dashed diagonal line of length D goes through
the socket cluster is created, and if needed the point, and an arrow of length d connects the
library(vegan) is evaluated in the cluster point to the fill line. The “surprise” for this point
(this is not necessary if the function only uses is u = (d/D)2 and the matrix temperature is based
internal functions), and the cluster is stopped on the sum of surprises: presences outside the fill
after parallel processing. line or absences within the fill line.
3
tion nestedtemp will never exactly reproduce re- This is similar to the equation suggested by
sults from other programs, although it is based on Rodrı́guez-Gironés and Santamaria (2006, eq.
the same general principles.1 I try to give main 4), but omits all terms dependent on the num-
computation details in this document — all details bers of species or sites, because I could not
can be seen in the source code of nestedtemp. understand why they were needed. The dif-
ferences are visible only in small data sets.
Species and sites are put into unit square The y and x are the coordinates in the unit
(Rodrı́guez-Gironés and Santamaria, 2006). square, and the parameter p is selected so
The row and column coordinates will be (k − that the curve covers the same area as is the
0.5)/n for k = 1 . . . n, so that there are no proportion of presences (Fig. 1). The pa-
points in the corners or the margins of the rameter p is found numerically using R func-
unit square, and a diagonal line can be drawn tions integrate and uniroot. The fill line
through any point. I do not know how the rows used in the original matrix temperature soft-
and columns are converted to the unit square ware (Atmar and Patterson, 1993) is supposed
in other software, and this may be a consider- to be similar (Rodrı́guez-Gironés and Santa-
able source of differences among implementa- maria, 2006). Small details in the fill line com-
tions. bined with differences in scores used in the unit
square (especially in the corners) can cause
Species and sites are ordered alternately using large differences in the results.
indices (Rodrı́guez-Gironés and Santamaria,
2006): A line with slope = −1 is drawn through the
X point and the x coordinate of the intersection
sj = i2 of this line and the fill line is found using func-
i|xij =1 tion uniroot. The difference of this intersec-
(1)
tj =
X
(n − i + 1)2 tion and the row coordinate gives the argument
d of matrix temperature (Fig. 1).
i|xij =0
4
P 2 P 2 P
product
P is zero,
P or i uik = j vjk = 1, and ues λk / λk instead of original eigenvalues. These
u u
i ik il = v v
j jk jl = 0 for k 6= l. This is a proportions are invariant with scale changes, and
decomposition, and the original matrix is found ex- typically they have a nice range for plotting two
actly from the singular vectors and corresponding data sets in the same graph.
singular values, and first two singular components p The vegan
P package uses a scaling constant c =
give the rank = 2 least squares estimate of the orig- 4
(n − 1) λk in order to be able to use scaling by
inal matrix. proportional eigenvalues (like in Canoco) and still
The coefficients uik and vjk are scaled to unit be able to have a biplot scaling. Because of this,
length for all axes k. Eigenvalues λk give the in- the scaling of rda scores is non-standard. However,
formation of the importance of axes, or the ‘axis the scores function lets you to set the scaling con-
lengths.’ Instead of the orthonormal coefficients, or stant to any desired values. It is also possible to
equal length axes, it is customary to scale species have two separate scaling constants: the first for
(column) or site (row) scores or both by eigenval- the species, and the second for sites and friends,
ues to display the importance of axes and to de- and this allows getting scores of other software or
scribe the true configuration of points. Table 1 R functions (Table 2).
shows some alternative scalings. These alterna- The scaling is controlled by three arguments in
tives apply to principal components analysis in all the scores function in vegan:
cases, and in redundancy analysis, they apply to
1. scaling with options "sites", "species"
species scores and constraints or linear combina-
and "symmetric" defines the set of scores
tion scores; weighted averaging scores have some-
which is scaled by eigenvalues (Table 1).
what wider dispersion.
In community ecology, it is common to plot both 2. const can be used to set the numeric scaling
species and sites in the same graph. If this graph constant to non-default values (Table 2).
is a graphical display of pca, or a graphical, low-
3. correlation can be used to modify species
dimensional approximation of the data, the graph
scores so that they show the relative change
is called a biplot. The graph P is a biplot if the trans-
of species abundance, or their correlation with
formed scores satisfy xij = c k u∗ij vjk ∗
where c is a
the ordination (Table 1). This is no longer a
scaling constant. In functions princomp, prcomp
biplot scaling.
and rda with scaling = "sites", the plotted
scores define a biplot so that the eigenvalues are
expressed for sites, and species are left unscaled. 4 Weighted average and linear
There is no natural way of scaling species and
site scores to each other. The eigenvalues in redun- combination scores
dancy and principal components analysis are scale-
Constrained ordination methods such as Con-
dependent and change when the data are multiplied
strained Correspondence Analysis (CCA) and Re-
by a constant. If we have percent cover data, the
dundancy Analysis (RDA) produce two kind of site
eigenvalues are typically very high, and the scores
scores (ter Braak, 1986; Palmer, 1993):
scaled by eigenvalues will have much wider disper-
sion than the orthonormal set. If we express the LC or Linear Combination Scores which are
percentages as proportions, and divide the matrix linear combinations of constraining variables.
by 100, the eigenvalues will be reduced by factor
WA or Weighted Averages Scores which are
1002 , and the scores scaled by eigenvalues will have
such weighted averages of species scores that
a narrower dispersion. For graphical biplots we
are as similar to LC scores as possible.
should be able to fix the relations of row and col-
umn scores to be invariant against scaling of data. Many computer programs for constrained ordina-
The solution in R standard function biplot is to tions give only or primarily LC scores following
scale site and species scores independently, and typ- recommendation of Palmer (1993). However, func-
ically very differently (Table 1), but plot each in- tions cca and rda in the vegan package use primar-
dependently to fill the graph area. The solution ily WA scores. This chapter explains the reasons for
in Canoco and rda is to use proportional eigenval- this choice.
5
Table 1: Alternative scalings for rda used in the functions prcomp and princomp, and the one used
in the vegan function rda and the proprietary software Canoco scores in terms of orthonormal species
(vik ) and site p
scores (ujk ), eigenvalues (λk ), number of sites (n) and species standard deviations (sj ). In
P
rda, const = 4 (n − 1) λk . Corresponding negative scaling in vegan is derived dividing each species
by its standard deviation sj (possibly with some additional constant multiplier).
Table 2: Values of the const argument in vegan to get the scores that are equal to those from other
functions and software.
P Number of sites (rows) is n, the number of species (columns) is m, and the sum
of all eigenvalues is k λk (this is saved as the item tot.chi in the rda result)
prcomp, princomp 1 √ 1 (n − 1) k λk
√
Canoco v3 -1, -2, -3 n−1
√ √n
Canoco v4 -1, -2, -3 m n
Briefly, the main reasons are that scores, because they are not what they were look-
ing for in ordination.
LC scores are linear combinations, so they
give us only the (scaled) environmental vari-
ables. This means that they are independent
4.1 LC Scores are Linear Combina-
of vegetation and cannot be found from the tions
species composition. Moreover, identical com- Let us perform a simple CCA analysis using only
binations of environmental variables give iden- two environmental variables so that we can see the
tical LC scores irrespective of vegetation. constrained solution completely in two dimensions:
> library(vegan)
McCune (1997) has demonstrated that noisy > data(varespec)
environmental variables result in deteriorated > data(varechem)
LC scores whereas WA scores tolerate some > orig <- cca(varespec ~ Al + K, varechem)
errors in environmental variables. All environ- Function cca in vegan uses WA scores as default.
mental measurements contain some errors, and So we must specifically ask for LC scores (Fig. 2).
therefore it is safer to use WA scores.
> plot(orig, dis=c("lc","bp"))
This article studies mainly the first point. The What would happen to linear combinations of LC
users of vegan have a choice of either LC or WA scores if we shuffle the ordering of sites in species
(default) scores, but after reading this article, I be- data? Function sample() below shuffles the in-
lieve that most of them do not want to use LC dices.
6
2
2
6 24
21
1
18 9
25 12
16 7 3
1
4
14 14
22 2
23 15 9
10 28
19 25
0
15 16
CCA2
7
20 13
27
0
24 22 12 5
2 6
CCA2
28 10 11 18
−1
11 20
Al 27
−1
21 23
4
−2
K
K
−2
13
Al
−2 −1 0 1 2
19
CCA1
−2 −1 0 1 2
●
●
●
0
7
Procrustes errors Procrustes errors
8
2
6
●
4
1
● ●
●
Dimension 2
Dimension 2
● ● ●
●
2
●
● ●
● ● ● ●
0
● ● ●●● ●●
● ●
● ●
●
0
● ● ●
●
● ●
●
●
●
● ●
−1
● ●
−2
● ●
●
●
●
−4
−2
●
−6
−6 −4 −2 0 2 4 6 −3 −2 −1 0 1 2
Dimension 1 Dimension 1
Figure 5: Procrustes rotation of LC scores in RDA Figure 6: Procrustes rotation of WA scores of CCA
of the original and shuffled data. with the original and shuffled data.
8
2
0.5
16
15
13
20
19
8
14 ●
●
5
18
6
7
11
2
1 ●
Moisture.Q
● ●
●
Moisture.C
0.0
1
10
3
17
4 ●
● ●
●
Moisture1 Moisture5
−0.5
●
●
0
●
●
Moisture2
Moisture.L
−1.0
●
CCA2
CCA2
●
−1
●
−1.5
−2.0
−2
●
●
−2.5
Moisture4
−3
9
12 ●
CCA1 CCA1
Figure 7: LC scores of the dune meadow data using Figure 8: A “spider plot” connecting WA scores
only one factor as a constraint. to corresponding LC scores. The shorter the web
segments, the better the ordination.
4.2 Factor constraints
It seems that users often get confused when they > ordispider(orig, col="red")
perform constrained analysis using only one factor > text(orig, dis="cn", col="blue")
(class variable) as constraint. The following exam-
ple uses the classical dune meadow data (Jongman
et al., 1987): This is the standard way of displaying results of
discriminant analysis, too. Moisture classes 1 and 2
> data(dune)
> data(dune.env) seem to be overlapping, and cannot be completely
> orig <- cca(dune ~ Moisture, dune.env) separated by their vegetation. Other classes are
more distinct, but there seems to be a clear arc
When the results are plotted using LC scores, sam-
effect or a “horseshoe” despite using CCA.
ple plots fall only in four alternative positions (Fig.
7). In the previous chapter we saw that this hap-
pens because LC scores are the environmental vari-
ables, and they can be distinct only if the environ-
mental variables are distinct. However, normally 4.3 Conclusion
the user would like to see how well the environmen-
tal variables separate the vegetation, or inversely, LC scores are only the (weighted and scaled) con-
how we could use the vegetation to discriminate straints and independent of vegetation. If you plot
the environmental conditions. For this purpose we them, you plot only your environmental variables.
should plot WA scores, or LC scores and WA scores WA scores are based on vegetation data but are
together: The LC scores show where the site should constrained to be as similar to the LC scores as
be, the WA scores shows where the site is. only possible. Therefore vegan calls LC scores as
Function ordispider adds line segments to con- constraints and WA scores as site scores, and
nect each WA score with the corresponding LC uses primarily WA scores in plotting. However, the
(Fig. 8). user makes the ultimate choice, since both scores
> plot(orig, display="wa", type="points") are available.
9
References
Atmar W, Patterson BD (1993). “The measure of
order and disorder in the distribution of species
in fragmented habitat.” Oecologia, 96, 373–382.
Jongman RH, ter Braak CJF, van Tongeren OFR
(1987). Data analysis in community and land-
scape ecology. Pudoc, Wageningen.
McCune B (1997). “Influence of noisy environmen-
tal data on canonical correspondence analysis.”
Ecology, 78, 2617–2623.
Palmer MW (1993). “Putting things in even better
order: The advantages of canonical correspon-
dence analysis.” Ecology, 74, 2215–2230.
10