Introduction To Ecological Multivariate Analysis

Introduction to Ecological
Multivariate Analysis
0.5
0.4
0.0
0.1
K
Mg
P
R2
0.2 0.3
Mn
Fe
Al
Lasse Ruokolainen and Guillaume Blanchet

University of Helsinki
Mg
2014
Al
Fe
Mn
Contents
1 Data
1.1 Matrices . . . . . . . . . .
1.1.1 Matrix algebra . .
1.1.2 Data matrix . . . .
1.2 Species data manipulation
1.2.1 Transformation . .
1.2.2 Standardization . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
3
5
7
7
8
2 Association
2.1 What is an association coefficient? . . . . . . . . . . . . . . . . .
2.1.1 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.2 Partial correlation . . . . . . . . . . . . . . . . . . . . . .
2.1.3 Visualization . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Double-zero problem . . . . . . . . . . . . . . . . . . . . .
2.3 Measuring ecological distance . . . . . . . . . . . . . . . . . . . .
2.3.1 Chord distance . . . . . . . . . . . . . . . . . . . . . . . .
2.3.2 2 distance . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.3 Hellinger distance . . . . . . . . . . . . . . . . . . . . . .
2.3.4 Bray-Curtis distance, aka Odums index, aka Renkonen
index, aka percentage difference dissimilarity . . . . . . .
2.4 Metric and semimetric distances . . . . . . . . . . . . . . . . . .
2.5 Similarity indices . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
11
11
12
13
14
16
17
17
18
19
3 Cluster Analysis
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Hierarchical clustering . . . . . . . . . . . . . . . . . . . .
3.2.1 Single-linkage clustering . . . . . . . . . . . . . . .
3.2.2 Complete-linkage clustering . . . . . . . . . . . . .
3.2.3 Average-linkage clustering . . . . . . . . . . . . . .
3.2.4 Wards clustering method . . . . . . . . . . . . . .
3.2.5 Comparison . . . . . . . . . . . . . . . . . . . . . .
3.3 Interpretation of hierarchical clustering results . . . . . .
3.3.1 Cophenetic correlation . . . . . . . . . . . . . . . .
3.3.2 Finding interpretable clusters . . . . . . . . . . . .
3.3.3 Graphical presentation of the final clustering result
3.4 Non-hierarchical clustering . . . . . . . . . . . . . . . . . .
3.4.1 Partitioning by k -means . . . . . . . . . . . . . . .
3.4.2 Fuzzy clustering . . . . . . . . . . . . . . . . . . .
3.5 Validation with external data . . . . . . . . . . . . . . . .
3.5.1 Continuous predictors . . . . . . . . . . . . . . . .
3.5.2 Categorical predictors . . . . . . . . . . . . . . . .
23
23
23
23
24
26
27
28
29
30
30
33
35
35
37
40
40
42
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19
19
20
21
4 Simple (or unconstrained) Ordinations

4.1 Overview . . . . . . . . . . . . . . . . . . . . .
4.2 Principal Component Analysis . . . . . . . . .
4.2.1 Correlation or Covariance? . . . . . . .
4.2.2 Scaling . . . . . . . . . . . . . . . . . .
4.2.3 Equilibrium contribution circle . . . . .
4.2.4 Number of axes to interpret . . . . . . .
4.2.5 Pre-transformation of species data . . .
4.3 Correspondence analysis (CA) . . . . . . . . . .
4.3.1 Scaling . . . . . . . . . . . . . . . . . .
4.3.2 Word of caution . . . . . . . . . . . . .
4.4 Principal coordinate analysis (PCoA) . . . . . .
4.5 Non-metric multidimensional scaling (NMDS or
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
MDS)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5 Canonical (constrained) ordinations

5.1 Redundancy analysis (RDA) and canonical correspondence analysis (CCA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Partial canonical ordination . . . . . . . . . . . . . . . . . . . . .
5.2.1 Variation partitioning . . . . . . . . . . . . . . . . . . . .
5.2.2 Forward selection of explanatory variables . . . . . . . . .
5.3 distance-based redundancy analysis (db-RDA) . . . . . . . . . . .
5.4 Consensus RDA . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
45
46
47
48
50
51
52
53
54
56
59
61
63
63
67
67
71
72
73
Data
1.1
Matrices
In mathematics a matrix is basically an n-by-m table of information. Usually

the cells, or elements of a matrix, are numbers in which case one can talk of a
numeric matrix. An ecological example of a matrix is a data table, also called a
data matrix, where species are columns and samples (sites) are rows.
1.1.1
Matrix algebra
In order to better understand the methodology related to ecological matrices,

it is useful to understand a few basic mathematical operations carried out with
matrices.
A binary operation is called commutative if changing the order of the terms
does not change the result. This is the case, with scalar numbers, e.g., in 1 + 2
or 3 5. However, this is not necessarily the case with matrices. Consider the
two matrices A and B:

a11 a12
b11 b12 b13
A = a21 a22 , B =
(1.1)
b21 b22 b23
a31 a32
Summation of these matrices is not possible as such, because A is a 3-by-2 matrix
and B is a 2-by-3 matrix. When dealing with matrices, the dimensions must be
the same for a summation to be carried out. The sum of A and B is then:
a11 + b11 a12 + b21

A + BT = a21 + b12 a22 + b22 = (AT + B)T ,
(1.2)
a31 + b13 a32 + b23
where T indicates a matrix transpose (this is required as the rows and columns of
the two matrices do not match if B is not transposed). While adding matrices is
still quite straightforward, matrix multiplication is somewhat more complicated.
In element-wise operations the size of the matrices need to match exactly, as
above. In a matrix product the number of columns in the matrix on the left
of the operation must match the number of rows in the matrix on the right (a
matrix multiplication in R is coded as %*%):

BA =

a11 b11 + a21 b12 + a31 b13 a12 b11 + a22 b12 + a32 b13
.
a11 b21 + a21 b22 + a31 b23 a12 b21 + a22 b22 + a32 b23
(1.3)
When a matrix with only one row and another with only one column (i.e.,
two vectors) are multiplied, the result is a scalar (a single value), and thus this
operation is called a scalar product. A useful piece of information related to
the scalar product is that if two vectors (variables) are both standardized (zero
mean and unit variance) and normalized (scaled in such a way that their sum
of squares equals 1), the scalar product between them equals the correlation
between the original variables.
It is also possible to multiply the example matrices the otherway a round
(that is AB), which results in a 3-by-3 matrix. In matrix algebra, one cannot
divide matrices, such that A/B is impossible. Instead, this is written as AB1 ,
where B1 is the inverse of matrix B.
A square matrix is a special class of matrices that has an equal number of
rows and columns, for example:
a11 a12 a13

(1.4)
A = a21 a22 a23 .
a31 a32 a33
The diagonal of a square matrix is the set of matrix elements that runs from
the upper lefts corner to the lower right corner (use function diag to extract the
diagonal):
a11 0
0
diag(A) = 0 a22 0 .
(1.5)
0
0 a33
The off-diagonal elements form the lower and upper triangle around the diagonal
(use functions lower.tri and upper.tri to get the triangles):
0 a12 a13
upper(A) = 0 0 a23 .
(1.6)
0 0
0
For any square matrix, one can apply the following equation:
(A I)u = 0,
(1.7)
where I is an identity matrix (ones on the diagonal and zeros on the off-diagonal).
This so-called characteristic equation is used to derive the eigenvalues and
eigenvectors u of a matrix. An n-by-n square matrix has n eigenvalues and each
eigenvalue is associated with an eigenvector. Eigenvectors are orthogonal and
thus they represent indepedent directions of variation in the matrix. This is a
very useful property, which will become evident later.
Eigenvalues and eigenvectors can be calculated with function eigen in R.
For example, the following matrix A:

1 2
A=
(1.8)
3 4
has eigenvalues
1 = 5.37, 2 = 0.37
(1.9)
and eigenvectors

u1 =
1
1

, u2 =
1
1

.
(1.10)
The eigenvectors can be interpretted as orthogonal, unit vectors that define the
dimensions of the matrix:
1
1
Figure 1.1: Graphical interpretation of eigenvectors
Eigenvalues can then be considered as multipliers that stretch or shrink the

eigenvectors. For example, many ordination techniques are based on eigendecomposition of a matrix derived from raw data (such as the covariance matrix).
The eigenvectors of this matrix represent the ordination axes and the eigenvalues
represent the amount of variation accounted by each eigenvector.
1.1.2
Data matrix
Ecologists are usually dealing with data matrices, which are generally non-square
matrices, and covariance/correlation matrices that are square matrices. Let us
look at an example data matrix:
> library(vegan) # load the vegan package to the workspace
> data(dune)
# load a data set to the workspace
> head(dune[,1:10])
2
13
4
16
6
1
Belper Empnig Junbuf Junart Airpra Elepal Rumace Viclat Brarut Ranfla
3
0
0
0
0
0
0
0
0
0
0
0
3
0
0
0
0
0
0
2
2
0
0
0
0
0
0
0
2
0
0
0
0
3
0
8
0
0
4
2
0
0
0
0
0
0
6
0
6
0
0
0
0
0
0
0
0
0
0
0
This data matrix contains the abundance of vascular plants on a dune meadow.
The entire matrix has 30 species sampled in 20 sites (sample plots). The dimensions of the matrix can be examined with the following commands:
> dim(dune)
> nrow(dune)
> ncol(dune)
If one wishes to consider the table not as sites-by-species but as species-by-sites,
the data matrix can be transposed :
> t(dune[1:6,1:10])
Belper
Empnig
Junbuf
Junart
Airpra
Elepal
Rumace
Viclat
Brarut
Ranfla
2 13 4 16 6 1
3 0 2 0 0 0
0 0 0 0 0 0
0 3 0 0 0 0
0 0 0 3 0 0
0 0 0 0 0 0
0 0 0 8 0 0
0 0 0 0 6 0
0 0 0 0 0 0
0 0 2 4 6 0
0 2 0 2 0 0
As shown above, indexing can be used to access specific parts of a data

matrix. The same approach can also be used to leave out certain parts:
> dune[-4,-c(2,5,7,15)]
We can also calculate simple statistics from the sites-by-species data table,
such as the species richness in each site and the proportional frequency of each
species among sites (Figure 1.2):
> rich = rowSums(dune>0) # calculate species richness
> per.freq = 100*colSums(dune>0)/nrow(dune) # calculate %-frequency of each species
> par(mfrow=c(1,2),mar=c(3,3,1,1),mgp=c(2,.8,0))
> hist(rich,10,col='wheat',xlab='Species richness',cex.lab=1.2,main='')
> hist(per.freq,10,col='wheat',xlab='Species prevalence (%)',cex.lab=1.2,main='')
8
0
Frequency
4
3
2
Frequency
1
0
10
12
14
Species richness
20
40
60
80
Species prevalence (%)
Figure 1.2: The distribution of species richness and percentage frequency of

species in the dune data set
1.2
1.2.1
Species data manipulation

Transformation
The distribution of variables within a data matrix can be examined using histograms. If the distribution is undesirable for some purposes (e.g., non-symmetric),
the data can be transformed, e.g., using a square root (sqrt) or logarithm (log,
log10, log1p) functions (Figure 1.3):
data(varechem)
attach(varechem)
par(mfrow=c(1,3),mar=c(6,6,2,1),mgp=c(4,1,0),xpd=NA)
hist(Al,15,col='lightblue',cex.axis=2,cex.lab=3,cex.main=3)
hist(sqrt(Al),10,col='lightblue',cex.axis=2,cex.lab=3,cex.main=3)
hist(log(Al),10,col='lightblue',cex.axis=2,cex.lab=3,cex.main=3)
detach(varechem)
Histogram of log(Al)
3
2
Frequency
3
2
Frequency
0
100
200
Al
300
400
4
2
Frequency
Histogram of sqrt(Al)
Histogram of Al
>
>
>
>
>
>
>
10
15
sqrt(Al)
20
log(Al)
Figure 1.3: The distribution of Aluminium concentration in the soil of lichen

pastures in eastern Fennoscandia: raw data, squareroot transformed data, and
(natural) log-transformed data
Such transformations may be needed, e.g., if the data are not normally distributed (using Pearsons correlation), or if model residuals are not normally
7
distributed (using linear regression). Normality can be tested, e.g., using the
Shapiro-Wilk test:
> shapiro.test(Al)
Shapiro-Wilk normality test
data: Al
W = 0.8877, p-value = 0.01193
which indicates that the null hypothesis (the distribution is normal) can be
rejected for the distribution Aluminium concentration in the soil. Doing the
same test for square root transformed data gives a p = 0.18, indicating that this
transformed data is normally distributed.
Instead of applying a specific function to the data to alter the scale of observations, species abundances can also be transformed to presence-absence (i.e.,
changing the scale to 0/1). This can be done either as:
> dune.pa = matrix(as.numeric(dune>0),dim(dune),
+
dimnames=list(rownames(dune),colnames(dune)))
or:
> dune.pa = ifelse(dune > 0, 1, 0)
or by using the decostand function in vegan, with pa as the method:

> dune.pa = decostand(dune,method='pa')
Note that it is the ifelse function that is applied internally by decostand to

perform the transformation.
1.2.2
Standardization
The above transformations can only be applied to single variables. However, in

some cases it can be useful to apply a transformation that affects the entire data,
which is often necessary when analyzing species data. The decostand function
(vegan package) provides many options for common standardization procedures
of ecological data. In this function, standardization, as contrasted with simple
transformation (such as square root, log or presence-absence), means that the
values are not transformed with regard to one variable but the transformation
consider all the other variables in the data table. Depending on the transformation, this can be done with respect to sites, species, or both. For example, the
scale of species abundances can be changed either
(1) by standardizing to maximal abundances:
yi0 =
yi
,
max(y)
where yi is the abundance of a species in site i,
(1.11)
(2) by standardizing with site totals (producing relative abundance):

0
yij
= PS
yi
k=1 ykj
(1.12)
where yi is the abundance of a species i (S in total) in site j, or

(3) by standardizing with the grand total:
yi
0
yij
= PP
ykj
(1.13)
Consider examples of data standardization:

>
>
>
>
>
>
>
>
>
>
# Data standardization methods using decostand:

# --------------------------------------------data(varespec) # load data
# (1) Scale species abundances by their maxima:
spe.scaled = decostand(varespec,'max')
# (2) Scale to species relative abundances dividing by site totals:
spe.relab = decostand(varespec,'total',MARGIN=2)
# (3) Normalize row vectors to lenght of one:
spe.norm = decostand(varespec,'normalize')
This scaling (3) is called the chord transformation; when calculating Euclidean
distances after this transformation, chord distances are returned. This can be
useful when performing analysis such as Principal component analysis (PCA)
and k -means partitioning because it is the chord distance, not the Euclidean
distance that is preserved in the analysis. Other similar standardizations are
considered later in Chapter 2.
>
>
>
>
>
# Standardization of both species and sites:

# -----------------------------------------# (4) Chi-square transformation:
spe.chi = decostand(varespec,'chi')
As with the chord transformation this Chi-square transformation leads to 2

-distances being returned when calculating Euclidean distances between sites.
> # (5) Wisconsin standardization:
> spe.wis = wisconsin(varespec)
This method first ranges abundances by species maxima (1) and then by site
totals (2). The function metaMDS (in vegan) (which is used to perform
non-metric multidimentional scaling) automatically performs a Wisconsin transformation if there are abundance values greater than 9. This function also automatically applies a square root transformation, if the maximum count in the
data is greater than 50. The reasoning behind this transformation is that, theoretically, the variance of a Poisson distributed variable (a distribution followed
by e.g. species counts data) that has been square root transformed tends to 1/4.
9
All these standardizations (as the transformations considered previously) affect the distribution of species abundances. The differences between these transformation can be compared using boxplots (Figure 1.4):
Raw
Sqrt
log
0.4
0.0
0.0
0.2
0.2
0.4
0.6
10
0.8
0.6
1.0
w = 'Emp.nig' # select Empetrum nigrum as an example

spe = varespec[,w]
par(mfrow=c(1,3),mar=c(4,4,1,1))
boxplot(spe,sqrt(spe),log1p(spe),col='lightblue',
names=c('Raw','Sqrt','log'),cex.axis=2)
boxplot(spe.scaled[,w],spe.relab[,w],spe.norm[,w],
col='plum',names=c('Max','Total','Norm'),cex.axis=2)
boxplot(spe.chi[,w],spe.wis[,w],col='olivedrab1',
names=c('Chi','Wisconsin'),cex.axis=2)
15
>
>
>
>
+
>
+
>
+
Max
Total
Norm
Chi
Wisconsin
Figure 1.4: The effect of various transformations on the distribution of Empetrum

nigrum in the varespec data
10
Association
2.1
What is an association coefficient?
Usually, considering a table of variables, the association between these variables

is measured by correlation; a broad class of statistical relationships coefficients
involving dependence. The most commonly used methods are Pearsons correlation, which is parametric, and Spearmans correlation, which is non-parametric
(parametric methods assume that the data comes from a certain distribution
and make inferences about the parameters of that distribution). Correlation
coefficients range between 1 and 1. Negative correlation indicates inverse dependence, positive correlation indicates direct dependence, and zero indicates
independence.
2.1.1
Correlation
When considering Pearsons correlation, it is useful to start from defining covariance. The (sample) variance of a variable x (with n observations) is defined
as:
n
V [x] =
1 X
(x x
)2 .
n1
(2.1)
i=1
Similarly, the (sample) covariance between variables x and y is calculad as:

n
Cov[x, y] =
1 X
(x x
)(y y).
n1
(2.2)
i=1
While the variance measures the amount of dispersion around the mean, covariance measures the amount of joint dispersion between two variables (how much
overlap is there in the variances of two variables?).
The Pearsons correlation is calculated by scaling covariance by the product
of variances, such that:
Cov[x, y]
rx,y = p
.
V [x]V [y]
(2.3)
From here it can be seen that if covariance is calculated for standardized variables
(with zero mean and unit variance), the result is Pearsons correlation. If there
are several variables among which correlation is calculated, the result is usually
combined into a correlation matrix. This is a symmetric matrix (rij = rji ), with
ones on the diagonal (rii = 1).
As mentioned above, Pearsons correlation is a parameteric method that
assumes variables are normally distributed. If the values of variables are replaced
by their rank orders, the formula for Pearsons correlation returns the Spearman
correlation . If there are no duplicate values (i.e., all ranks are unique), the
Spearman correlation is calculated as follow:
11
=1
Pn
2
i=1 (xi yi )
.
n(n2 1)
(2.4)
Another rank-order correlation, Kendalls , is based on ranking pairs of

observations. Lets consider two random variables x and y, such that all the
values xi and yi are unique. Any pair of observations [xi , yi ] and [xj , yj ] are said
to be concordant if the ranks for both elements agree: that is, if both xi > xj
and yi > yj or xi < xj and yi < yj . If these conditions are not met, the pairs are
said to be discordant. When all pairs have been scored to be concordant (total
of a) or discortant (total of b), Kendalls is calculated as follow:
=
2(a b)
.
n(n 1)
(2.5)
The Spearman and Kendall correlations are non-parameteric, and thus assume nothing about the distribution of the variables considered. Also, while
Pearsons correlation models a linear relationship between variables, rank-order
correlations are more flexible about the shape of the association. For example, consider how different relationships are captured by linear and rank-order
correlations (Figure 2.1):
>
>
>
>
>
>
>
>
+
>
>
>
>
>
>
# Simulate correlations:
# ---------------------m = character(6)
x1 = seq(-2,2,length.out=100)
y1 = x1 + rnorm(100)*.5
y2 = x1^3 + rnorm(100)*.5
x2 = c(seq(0,2,length.out=90),seq(3,3.5,length.out=10))
y3 = c(2*x2[1:90]+rnorm(90)*.5,6-x2[91:100]
+rnorm(10)*.5)
m[1] = as.character(signif(cor(x1,y1),2))
m[2] = as.character(signif(cor(x1,y1,method='spearman'),2))
2.1.2
Partial correlation
When considering the association between several variables, one might be interested in the unique dependence between two variables, when the effect of other
variables has been accounted for. This is essentially what partial correlation is
about. For example, when the focal data consists of three variables x, y, and z,
the partial correlation between x and y is calculated as follow:
12
y3
2
y2
0
1
y1
2
0
x1
Pearson = 0.71
Spearman = 0.87
Pearson = 0.9
0
1
2
Pearson = 0.91
Spearman = 0.92
Spearman = 0.93
2
0
x1
0.0
1.0
2.0
x2
3.0
Figure 2.1: Three different relationships between variables X and Y
x,y|z = q
x,y x,z y,z

q
.
1 2x,z 1 2y,z
Partial correlations can be applied with any correlation coefficient. There

are several functions available for this purpose, e.g., pcor (package ggm), pcor
(package ppcor), and cor2pcor (package corpcor). The later alternative converts a correlation matrix to a partial correlation matrix (an vice versa).
2.1.3
Visualization
Patterns in between-variable association can be easily examined using the plot.data.frame

function (call function plot on an object of class data.frame). As an example,
consider the environmental data table varespec, in package vegan, in Figure
2.2:
>
>
>
>
>
+
require(vegan) # load vegan

require(gclus) # load gclus
require(RColorBrewer) # load RColorBrewer
data(varechem) # load a data.frame of environmental variables
plot(varechem[,c('N','P','Ca','Fe','Mn','Al','pH')],
panel=panel.smooth,pch=16,cex=.8,lwd=2)
A much neater way of using plot.data.frame to illustrate correlations is given

in Borcard et al. (2011).
If one is just intrested in portraying correlation patterns in the data, a simpler
approach is to use plotcolors (ipackage gclus) to plot the correlation matrix.
The function brewer.pal (package RColorBrewer) is used here to select a
more appealing color scheme, as compared to the default option. This is exemplified in Figure 2.3.
>
>
>
>
# Figure 2.3.
par(mfrow=c(1,2),mar=c(1,1,1,1),mgp=c(2,.2,0))
n = 11
cors = cor(varechem)
13
60
100
200
25
30
60
15
800
30
100
200
Ca
80
Fe
20
Mn
2.8
pH
3.4
200
Al
15
25
200
800
20
80
2.8
3.4
Figure 2.2: A scatter plot matrix with smoothing within panels

>
>
>
>
+
>
o = order.single(cors)
cmat = dmat.color(cors,breaks=seq(-1,1,length=n),colors=brewer.pal(n,'RdBu'))
plotcolors(cmat,rlabels=T,clabels=T,dlabels=names(varechem))
cmat = dmat.color(cors[o,o],breaks=seq(-1,1,length=n),
colors=brewer.pal(n,'RdBu'))
plotcolors(cmat,rlabels=o,clabels=o,dlabels=names(varechem)[o])
2.2
Distance
In many community ecology studies, the aim is to compare the species compostion between two or several samples, and possibly relate that to external
explanatory variables. Many applications, such as ordination and classification
(clustering), are based on some measure of resemblance between samples, rather
than the raw presence/absence or abundace of species. In this section we will
consider what is meant by ecological resemblance and how it can be calculated.
For example, consider a simple data set with 2 species, whose relative abundance has been observed in 5 sites. The relationships between the sites, with
respect to their species composition, can be easily examined in a scatter plot
(Figure 2.4).
14
13
12
11
10
9
8
7
6
5
4
3
2
1
1
P
12
K
13
Ca
9
Mg
10
S
6
Al
3
Fe
2
Mn
4
Zn
5
Mo
11
Baresoil
Humdepth
pH
8
14
14
11
10
13
12
14
13
12
11
10
1
14
N
Baresoil
Humdepth
Mn
Zn
S
K
P
Ca
Mg
Mo
Al
Fe
pH
Figure 2.3: Color plots of the correlation matrix. On the left the matrix is
plotted in its original form, while on the right the matrix is reordered to reflect
patterns in correlation between several variables (the variables in the middle are
all negatively correlated with each other)
>
>
>
>
>
>
>
# Figure 2.4
# Data matrix:
X = matrix(c(0.78,0.90,0.29,0.73,0.19,0.28,0.93,0.63,0.51,0.87),5,2);
n = nrow(X)
# all possible combinations of the elements of [1:n] taken 2
# at a time
pairs = combn(1:n, 2)
>
>
+
>
>
>
par(mar=c(5,5,3,3))
plot(X[pairs,1], X[pairs,2],type='b',xlab='Species 1',ylab='Species 2',
pch=21,cex=2,bg='gray70',cex.lab=1.5,xlim=c(0.1,1),ylim=c(0.2,1))
text(X[,1], X[,2],1:n,pos=2,col=2,cex=2)
text(.5,.92,expression('D'[25])); text(.6,.74,expression('D'[23]));
text(.75,.7,expression('D'[24])); text(.89,.6,expression('D'[12]))
The similarity in species composition between sites can be measured as the

distance between the points on this two-dimensional plane, defined by the abundances of the two species. The simplest way to do this is to use the Pythagorean
theorem:
q
ij
DEucl
= (xi xj )2 + (yi yj )2 ,
(2.6)
which gives the Euclidean distance D between sites i and j, where x and y are
the two species. Note that Dij = Dji , i.e., the measure of distance is symmetric.
In more general, with n species, this becomes:
q
ij
DEucl
(2.7)
= nk=1 (xk,i xk,j )2 .
The Euclidean distance ranges from 0 to infinity, with increasing number of
15
1.0
D25
0.8
D24
0.6
D12
4
0.4
Species 2
D23
0.2
1
0.2
0.4
0.6
0.8
1.0
Species 1
Figure 2.4: The dispersion of 5 samples in the space of 2 species. The lines
connecting each site (from 1 to 5) represent the shortest (Euclidean) distance
between them, D
variables and increasing difference in the value for each variable between sites
(this means that the scale of each variable has an effect on the distance). This
can be avoided by standardizing each variable (i.e., substract mean and divide
by standard deviation: (X X )/X ). As a result, all variables will be on the
same scale and have equal (zero) mean.
As with correlations, all distances between objects are collected into a symmetric distance matrix, with zeros on the diagonal (Dii = 0):
> print(signif(as.matrix(dist(X)),2))
1
2
3
4
5
1
0.00
0.66
0.60
0.24
0.83
2.2.1
2
0.66
0.00
0.68
0.45
0.71
3
0.60
0.68
0.00
0.46
0.26
4
0.24
0.45
0.46
0.00
0.65
5
0.83
0.71
0.26
0.65
0.00
Double-zero problem
The Euclidean distance is a useful, and logical, measure to characterize differences in, e.g., spatial location, or the physical properties of sampling locations.
16
In these cases the value zero has the same meaning as any other value on the
scale of the variable. For example, the absence of nitrogen in the soil or the
fact that two samples have been acquired from the same spot are ecologically
meaningful pieces of information.
In contrast, when it comes to species presence/absence, the interpretation
of double-zeros becomes more tricky. The presence of a species at a given site
generally implies that this site provides a set of minimal conditions allowing the
species to survive (the dimensions of its ecological niche). Note that a species
might be found in a site because it appeared there by accident and not because
the local conditions are suitable for it; many species can be transiently found
in sites where they cannot survive in the long run. However, the absence of a
species from a sample can be due to a variety of causes: the species niche may
be occupied by a replacement species, or the absence of the species is due to
adverse conditions on any of the important dimensions of its ecological niche,
or the species has been missed because of a purely stochastic component of its
spatial distribution, or the species does not show a regular distribution on the
site under study, or simply due to observation error (the species was there but
the observer missed it).
The crucial point here is that the absence of a species from two sites cannot
readily be counted as an indication of resemblance between the two sites, because
this double absence may be due to completely different reason between the samples. Luckily, many alternative methods for measuring ecological resemblance
that account for this problem are available.
2.3
Measuring ecological distance
As we learned above, Euclidean distance (giving the same interpretation to

double-presence as double-absence) is not suitable for species data. However,
using specific pre-transformations to the data, the Pythagoran formula (DEucl )
can be made to return other than Euclidean distances. Here we will consider
three such transformations.
2.3.1
Chord distance
If the raw data is normalized between 0 and 1, calculating the Euclidean distance
between two objects results in the so-called chord distance (here referred to as
Dchord ). The transformation required is:
bij = q Xij
X
Pp
(2.8)
2
j=1 Xij
that is, the abundance of species j in site i is divided by the square root of the
sum of squared abundances at that site. The clear advantage of Dchord over
DEucl is that the chord distance is insensitive to double-zeros, making it suitable
for species abundance data.
While DEucl is the shortest distance netween points in variable space, the
chord distance is equivalent to the length of a chord joining two points within
17
1.0
a segment of a sphere or hypersphere of radius 1. If only two variables are

considered, the sphere becomes a circle and the chord distance can be represented
as shown in Figure 2.5.
0.6
D12
0.4
0.0
0.2
Species 2
0.8
0.0
0.2
0.4
0.6
0.8
1.0
Species 1
Figure 2.5: Graphical representation of the chord distance. The red line between
point 1 and 2 illustrate the Euclidean distance whereas the black arc defines the
chord distance.
The chord distance is maximum when the species at the two sites are completely different (no common species). In this case, the normalized site vec
o
tors are
at 90 with each other, and the distance between the two sites is 2
(D = 12 + 12 ).
2.3.2
2 distance
This distance is related to the 2 statistic used to study contingency tables or to

compare expected and observed distributions. This distance can be calculated
by first transforming species abundances into profiles of conditional probability
and then computing a weighted Euclidean distance among sites. The weight is
inversely proportional to the abundace of each species, which means that this
measure puts a special emphasis on rare species.
To calculate 2 distance using DEucl , the data needs to be transformed as
follows:
bij =
X
p
X++
Xij
p
,
Xi+ X+j
(2.9)
where X++ is the grand sum of the sites-by-species data table, Xi+ is the total
abundance of site i, and X+j is the total abundance of species j in the data.
This distance has no upper bound, similarly to DEucl .
The 2 distance (here D2 ) is considered here because the Correspondence
Analysis (CA) retains this distance between sites. As with Dchord , D2 does not
18
suffer from the double-zero problem.

2.3.3
Hellinger distance
A distance measure related to both Dchord and D2 is the Hellinger distance,

which is contained between 0 and 1, and does not suffer from the double-zero
problem. The Hellinger distance (here DHel ) can again be calculated using DEucl
by transforming the raw data:
s
Xij
bij =
,
(2.10)
X
Xi+
where Xi+ is the site total. This distance places less emphasis on species abundances (the square root down-weights large abundances), which makes it in many
cases more useful than D2 .
2.3.4
Bray-Curtis distance, aka Odums index, aka Renkonen index,

aka percentage difference dissimilarity
This beloved child has many names, which reflects the fact that the percentage
difference dissimilaritytance is one of the most applied measure of ecological
resemblance. This is the reciprocal of the Stainhaus similarity index, DBC =
1 SStein . Steinhaus similarity is calculated as follow:
ij
=
SStein
2W
,
A+B
(2.11)
where W is theP
sum of the minimum abundances of shared species between sites
i and j (W =
min(Xi , Xj )), A is total abundance of species occupying only
site i and B is total abundance of species occupying only site j. That is, this
measure gives the proportion of shared abundance between two sites, hence the
name percentage difference (dis)similarity.
2.4
Metric and semimetric distances
A distance is said to be metric when it fulfills these four axioms:

(1) D(x, y) 0
(2) D(x, y) = 0 if and only if x = y
(3) D(x, y) = D(y, x)
(4) D(x, z) D(x, y) + D(y, z)
The first axiom posits that metric distances can only be positive (negative
distances are usually non-sensical). The second axiom states that the distance
between two points is zero, if and only if, the two points are identical. The third
axiom means that distances are symmetric; distance from x to y is the same
19
as the distance from y to x. The fourth axiom is often called the triangular
inequality. This axiom requires that the distance from x to z via y is at least as
great as from x to z directly. DEucl , D2 , Dchord , and DHell are metric distances.
A semimetric distance satisfies the three first axioms, but not necessarily the
triangular inequality. These measures cannot directly be used to order points
in a metric or Euclidean space because, for three points (x, y and z), the sum
of the distances from x to y and from y to z may be smaller than the distance
between x and z. This is the case for example with the Bray-Curtis dissimilarity
(DBC ), which
is semimetric. However, note that the square rooted Bray-Curtis
dissimilarity ( DBC ) is metric.
2.5
Similarity indices
In many cases similarity and distance are just reciprocals of each other. Thus,
all similarity coefficients can be converted into distances by one of the following
formulas:
D =1S
D=
p
1 S2
D=
1S
D = 1 S/Smax
A similarity that has been converted to a distance is usually referred to as
dissimilarity. The above presented Steinhaus similarity is commonly used for
ecological data. This a quantitative index that takes into account the differences in species abundances. Other commonly used similarity indices are binary,
meaning that they only consider presence/absence data.
When considering binary data, there are four different quantities related to
the similarity of two samples (objects):
a: number of matches between samples
b: number of species unique to site i
c: number of species unique to site j
d : number of co-absences
Different similarity indices can be generated by combining these different
quantities, with different weights. As in the case of distances, double-absence
is not meaningfull when considering species data. However, This can be useful if one is analysing categorical data. The best-known indices utilizing these
quantities are the Jaccard index and the Srensen index:
20
a
a+b+c
2a
=
.
2a + b + c
SJac =
SSor
(2.12)
(2.13)
Jaccard index can also be calculated as 2SSor /(1 + SSor ). SSor is the binary
equivalent of SStein . The Srensen dissimilarity (DSor = 1 SSor = (b + c)/(2a +
b + c)) between two sites equals the Whittakers species turnover between those
sites. If the formula for Jaccard index is applied to quantitative data (such that
the letters stand for summed abundances rather than presences) the resulting
similarity is called Ruzicka index (according to J. Oksanen).
The binary similarity between sites can also be calculated using a correlation
coefficient. Using the above quantities, this becomes:
ad bc
r=p
.
(a + b)(c + d)(a + c)(b + d)
(2.14)
Since 1 r ranges between 0 and 2, it is convenient to do the conversion to

dissimilarity as (1 r)/2. Again, remember that this index is not optimal for
species data, given that it considers double-zeros (d).
2.6
Implementation
There are many R-packages with functions dedicated for calculating ecological
distances/similarities, such as vegan (vegdist), ade4 (dis.binary, dist.quant),
cluster (daisy), and labdsv (dsvdis). The properties of each function can be
examined throught their help files.
A particularly interesting function is designdist in the vegan package. Note
that this function generates only distances (or dissimilarities), even when the applied index is actually a similarity. When calculating the distances considered
above, one can use the function decostand in vegan to do the data transformations. The following example unitilizes these two functions to generate a range
of distances and dissimilarities.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
data(BCI) # load data of tree counts

# calculate Bray&Curtis dissimilarity:
D.BC = designdist(BCI,method='(A+B-2*J)/(A+B)',terms='minimum')
# calculate Jaccards similarity:
D.Jac = designdist(BCI,method='(A+B-2*J)/(A+B-J)',terms='binary')
# calculate Sorensen similarity:
D.Sor = designdist(BCI,method='(A+B-2*J)/(A+B)',terms='binary')
# calculate Euclidean distance:
D.Eucl = dist(BCI)
# calculate chord distance:
D.chord = dist(decostand(BCI,'normalize'))
# calculate Chi^2 distance:
D.Chi = dist(decostand(BCI,'chi.square'))
# calculate Hellinger distance:
21
> D.Hel = dist(decostand(BCI,'hellinger'))

> D = data.frame(D.Eucl=D.Eucl[1:1225],D.BC=D.BC[1:1225],
+
D.Jac=D.Jac[1:1225],D.Sor=D.Sor[1:1225],
+
D.chord=D.chord[1:1225],D.Chi=D.Chi[1:1225],
+
D.Hel=D.Hel[1:1225])
> plot(D,lower.panel=NULL,panel=panel.smooth,cex=.5,pch=16,lwd=2)
250
0.3
0.5
0.7
0.35
0.50
0.650.20
0.35
0.50 0.4
0.8
1.2
1.0
2.0
3.0
0.5 0.7 0.9
150
0.7
50
D.Eucl
50
150
250
150
250
50
0.50
0.65
0.3
0.5
D.BC
0.35
0.50 0.35
D.Jac
1.2 0.20
D.Sor
2.0
3.0
0.4
0.8
D.chord
D.Hel
0.5 0.7 0.9
1.0
D.Chi
0.5 0.7 0.9
Figure 2.6: Comparison of ecological distances and dissimilarities calculated for

the BCI data in the vegan package
The first thing to notice here is that Euclidean distances differ quite considerably from all the other measures. One reason for this is that Euclidean
distance is the only one that considers double-zeros. Next, one can see that the
Bray-Curtis dissimilarity is most similar to the Hellinger distance. This is useful
to keep in mind when using methods such as PCA, k -means partitioning, and
redundancy analysis (RDA). It is not very surprising that Jaccard and Srensen
dissimilarities are almost identical. However, there is a rather large quantitative
devition between them and other resemblance measures (remember that these
are binary indices).
22
Cluster Analysis
3.1
Overview
Clustering is a family of methods that are used to classify objects into descrete
categories, using specific rules and assumptions. The majority of clustering algorithms fall into the category of hierarchical agglomerative clustering. Nonhierarchical methods can also be used, if one is interested in finding an optimal
grouping of observations and not their hierarchical relationships. Most clustering methods are based on an association matrix calculated between the objects
of interest.
Hierarchical, agglomerative procedures begin with the discontinuous collection of objects (i.e., each object forms its own group) that are successively
grouped into larger and larger clusters until a single, all-encompassing cluster is
obtained (all objects are combined in the same group). That is, the members
of lower-ranking clusters become members of larger, higher-ranking clusters. In
contrast, non-hierarchical methods (such as k -means clustering) produce one
single partition, without any hierarchy among the groups.
In the 80s and 90s a popular method, especially among vegetation ecologists, was to use a divisive clustering method called TWINSPAN (for Two-Way
INdicator SPecies ANalysis). This method proceeds by first calculating the 1st
axis of Correspondence analysis (CA) and splits the data by the centre of this
axis. This process is repeated within each part of the division, until all objects
form their own group. Divisive clustering methods are not considered here, as
they are rather problematic (objects are assigned to different groups based on a
rather arbitrary partitioning) and are no longer used in ecology.
In the stats package in R, function hclust can be used to produce a variety
of clustering methods. Other functions are available, e.g., in package cluster.
3.2
3.2.1
Hierarchical clustering
Single-linkage clustering
Consider a distance, similarity, or dissimilarity matrix. In single-linkage (or

nearest neighbour ) clustering, one gradually increases a distance threshold. All
objects that are within that distance from each other are grouped together. Two
groups are joined, if at least one member of each group has a distance within
the threshold.
Single-linkage clustering is conceptually related to a minimum spanning tree,
which can be represented in ordinations. This method is particularly useful in
finding discontinuities in the data. However, single-linkage is prone to chaining;
single sites are joined to large clusters. The resulting dendrogram does not show
clearly separated groups, but can be used to identify gradients in the data.
The result of a hierarchical clustering is generally presented in the form of
a dendrogram; a tree-like figure that illustrates the agglomeration hierarchy. In
the following snippet of code we are using the dune data set found in the vegan
package.
23
>
>
>
>
>
>
library(vegan)
data(dune)
dat = dune
D.Hel = dist(decostand(dat,'hellinger'))
C.SL = hclust(D.Hel,method='single')
plot(C.SL)
17
19
6
7
4
3
10
5
16
15
20
13
12
8
9
11
18
1
14
0.7
0.5
0.3
Height
0.9
Cluster Dendrogram
D.Hel
hclust (*, "single")
Figure 3.1: Cluster dendrogram for the single-linkage method
3.2.2
Complete-linkage clustering
Contrary to single-linkage clustering, complete-linkage clustering (a.k.a. furthest

neighbour clustering) allows an object (or a group) to join with another group
only at a distance (or similarity) corresponding to that of the most distant pairs
of objects. That is, for the two groups to join, it is required that all objects
be related at the given distance threshold. Complete linkage, therefore, tends
to produce many small groups separately, that agglomerate at large distances.
Therefore, this method is interesting to look for discontinuities in data that are
a priori quite compact.
Complete-linkage clustering might be preferable over single-linkage clustering
because it makes compact clusters. However, this is partially due to an artefact
of the method: the clusters are not allowed to grow, because the complete-linkage
criterion would be violated.
> C.CL = hclust(D.Hel,method='complete')
> plot(C.CL)
24
16
15
20
13
12
4
3
6
7
11
18
5
2
10
8
9
17
19
14
1.0
0.6
0.2
Height
1.4
Cluster Dendrogram
D.Hel
hclust (*, "complete")
Figure 3.2: Cluster dendrogram for the complete-linkage method
The conceptual difference between single- and complete-linkage clustering is

illustrated in Figure 3.3. The two groups, coloured red and blue, are joined
by single-linkage at the distance corresponding to the green line, whereas in
complete-linkage the respective distance is given by the cyan line.
Figure 3.3: Illustrating the conceptual difference between single- and completelinkage clustering. The two groups (red and blue points) are joined by singlelinkage at the distance corresponding to the green line between the most similar
objects between the two groups. In contrast, complete-linkage joins the two
groups at the distance shown by the cyan line, the most distant points between
the groups
25
3.2.3
Average-linkage clustering
Average-linkage clustering is often considered to be a compromise between the

previous presented two extremes, and more neutral in grouping. Unlike the
methods described above, average-linkage clustering is not based on the number
of links between groups or objects, but rather on average similarities among
objects or on centroids of clusters (a centroid is a point where the distance
from each group members is minimized). These methods differ by the way the
position of the groups is computed (arithmetic average versus centroids) and in
the weighting of the groups according to the number of objects that they contain
(Table 3.1). This is illustrated conceptually in Figure 3.4.
Table 3.1: Four methods of average-linkage agglomerative clustering.
Weight
Equal
Unequal
Arithmetic average
Unweighted (UPGMA)
Weighted (WPGMA)
Centroid clustering
Unweighted (UPGMC)
Weighted (WPGMC)
UPGMA
UPGMC
Figure 3.4: In UPGMA an object joins a cluster at the average distance between
the object and all members of the cluster. In UPGMC the joining distance is
that between the object and the cluster centroid
The most commonly applied method, UPGMA (Unweighted Pair-Group Method

using Arithmetic averages), can be calculated using the average method in
the function hclust, whereas the UPGMC is calculated using the centroid
method in function hclust.
> C.upgma = hclust(D.Hel,method='average')
> plot(C.upgma)
UPGMA must be applied with caution because it gives equal weights to the
original similarities. It assumes that the objects in each group form a representative sample of the corresponding larger groups of objects in the reference
population under study. For this reason, UPGMA clustering should only be used
26
11
18
10
6
7
4
3
8
9
13
12
16
15
20
2
1
14
17
19
0.8
0.6
0.2
0.4
Height
1.0
1.2
Cluster Dendrogram
D.Hel
hclust (*, "average")
Figure 3.5: Cluster dendrogram for the average-linkage method UPGMA

in connection with simple random or systematic sampling designs if the results
are to be extrapolated to a larger reference population.
The important distinction between unweighted (UPGMA, UPGMC) and
weighted (WPGMA, WPGMC) average-linkage methods is that the latter consider the number of objects in each group. The two unweighted clustering methods may be distorted when a large and a small group of objects are clustered
together. WPGMA can be calculated with function agnes in package cluster.
While intuitively appealing, centroid clustering is not used much in practice,
partly owing to its tendency to produce trees with reversals. Reversals occur
when the values at which clusters merge do not increase from one clustering step
to the next, but decrease instead, which can make the result hard to interpret.
3.2.4
Wards clustering method
This method is related to the centroid clustering methods described above (UPGMC and WPGMC). That is, cluster centroids play an important role. What
this method does is that it tries to minimizes the squared error of ANOVA.
At the beginning, each objects form a cluster; for this starting point, the sum
of squared distances between objects and centroids is 0. As clusters form, the
centroids move away from actual object coordinates and the sum of the squared
distances from the objects to the centroids increase.
At each clustering step, Wards clustering method finds the pair of objects
or clusters whose fusion increases as little as possible the sum, over all objects,
of the squared distances between objects and cluster centroids. As the mean
squared deviation can be calculated for both raw data and distances, Wards
27
method is very flexible about the type of input data. However, if using raw
species data, it is best to pre-transform it before analysis (such that other than
Euclidean distances are preserved between objects). Lets see how this method
performs using Hellinger transformed data, as above.
> C.Ward = hclust(D.Hel^2,method='ward')
> C.Ward$height = sqrt(C.Ward$height)
> plot(C.Ward)
1.5
16
15
20
13
12
4
3
8
9
2
10
6
7
11
18
17
19
14
1.0
0.0
0.5
Height
2.0
Cluster Dendrogram
D.Hel^2
hclust (*, "ward")
Figure 3.6: Cluster dendrogram for the Hellinger transformed data, using Wards
method. Here the height refers to the sum of squared distances to the cluster
centroid. To obtain the correct solution the distances need to be squared when
using function hclust. In this case it is useful to take a square root of the height
element of the clustering object
3.2.5
Comparison
The four different methods of hierarchical clustering all produced somewhat different results. We can do a simple visual comparison, before going into detail
about selecting the best method, by considering the grouping of objects between
the methods. Lets assume that we are interested in defining three groups in the
data (Figure 3.7). One can easily see from Figure 3.7 that the methods differ
somewhat in their final grouping. Consider for example how sites 17 and 19 are
classified by the four methods.
> # Figure 3.7:
> # -----------
28
plot(C.SL,main='Single-linkage',xlab='')
rect.hclust(C.SL,3,border=2:4)
plot(C.CL,main='Complete-linkage',xlab='')
rect.hclust(C.CL,3,border=2:4)
plot(C.upgma,main='UPGMA',xlab='')
rect.hclust(C.upgma,3,border=2:4)
plot(C.Ward,main='Ward',xlab='')
rect.hclust(C.Ward,3,border=2:4)
14
8
9
6
7
13
12
4
3
2
10
11
18
0.4
0.2
16
15
20
17
19
0.6
2
4
3
10
5
6
7
Ward
hclust (*, "complete")
Height
17
19
2
1
16
15
20
13
12
4
3
8
9
2
10
6
7
11
18
17
19
0.5
11
18
10
5
6
7
14
8
4
3
13
12
0.0
14
16
15
20
1.0
1.5
1.0
2.0
UPGMA
hclust (*, "single")
0.8
0.6
0.2
0.4
Height
0.8
Height
1.0
17
19
1.2
1.4
Completelinkage
1
14
16
15
20
13
12
8
9
11
18
0.3 0.4 0.5 0.6 0.7 0.8 0.9

1.2
Height
Singlelinkage
>
>
>
>
>
>
>
>
>
Figure 3.7: Visual comparison between the clustering patterns between four
hierarchical clustering methods. The function rect.hclust is used to visualize
three groups in each dendrogram. This function uses another function cutree,
which can be used to find the hierarchy level of a desired number of groups, or
find the number of groups at a desired height in the dendrogram
3.3
Interpretation of hierarchical clustering results
An important point to bear in mind is that clustering is a procedure that maps

the distances among objects into a classification, not a statistical test. The
choices regarding the association coefficient and the clustering algorithm influence the outcome, as you have seen above in Figure 3.7. This stresses the impor29
tance of choosing a method that is consistent with the aims of the analysis. If
one uses a distance measure with an interpretable interval (such as Bray-Curtis
or Jaccard), a meaningful cutting level is 0.5, since objects in a cluster are then
more similar with each other than with other objects in the dendrogram. There
are also several methods for assessing the suitability of a given approach.
3.3.1
Cophenetic correlation
A clustering algoritm maps the original distances between objects into cophenetic
distances. The cophenetic distance between two objects in a dendrogram is the
distance at which the objects become members of the same group. That is, the
cophenetic distance between the two objects is the distance to the source node
for both objects in the dendrogram, or their common ancestor. As the original
distances between objects form a distance matrix, cophenetic correlations form
a cophenetic matrix. To evaluate the correspondence between the original and
the cophenetic matrix (i.e., to assess how well the original distances have been
mapped), one can calculate a correlation between these matrices, called the
cophenetic correlation. Note that since the two matrices are not independent,
this correlation cannot be tested for significance.
Lets see how the above methods perform. The cophenetic matrix is found
with function cophenetic in package stats. Then, a correlation coefficient is
calculated. It is preferrable to use a rank-order correlation, since the relationship
between the original and cophenetic distances are likely to be non-linear.
>
>
>
>
>
>
>
>
>
>
>
>
# Cophenetic correlations:
# -----------------------cph.SL = cophenetic(C.SL)
cph.CL = cophenetic(C.CL)
cph.upgma = cophenetic(C.upgma)
cph.Ward = cophenetic(C.Ward)
cors = matrix(0,1,4,dimnames=list('COR',c('Single','Complete','UPGMA','Ward')))
cors[1] = cor(D.Hel,cph.SL,method='spearman')
cors[2] = cor(D.Hel,cph.CL,method='spearman')
cors[3] = cor(D.Hel,cph.upgma,method='spearman')
cors[4] = cor(D.Hel,cph.Ward,method='spearman')
print(cors)
Single Complete
UPGMA
Ward
COR 0.5300351 0.5329543 0.7858786 0.629469
This analysis suggests that UPGMA would be the optimal method, given the
Hellinger distances used to model between-object association.
3.3.2
Finding interpretable clusters
Above we briefly considered the way the four clustering methods divided the
objects into three groups. Next we will put such cutting of the dendrogram to a
test. There are several approaches to validate a given partitioning. We will here
consider a method considering silhouette width.
30
The silhouette width is a measure of the degree of membership of an object

to its cluster, based on the average distance between this object and all objects
of the cluster to which it belongs, compared to the same measure computed for
the next closest cluster. Silhouette widths range from 1 to 1. The greater this
value is, the better the object is clustered. Negative values indicate misclassification. The silhouette widths for each object can be averaged over all objects
of a partition to get an overall goodness estimate. Silhouette widths can be
calculated with function silhouette in package cluster (Figure 3.8).
>
>
>
>
>
>
>
+
+
+
>
>
>
>
>
+
+
>
# Silhouette widths:
# -----------------require(cluster)
sil.wid = numeric(nrow(dat))
# Calculate silhouette widths for each number of clusters,
# disregarding the trivial k = 1:
for(k in 2:(nrow(dat)-1)){
tmp = silhouette(cutree(C.Ward,k=k),D.Hel)
sil.wid[k] = summary(tmp)$avg.width
}
# Best width
k.best = which.max(sil.wid)
# Plotting:
par(xpd=NA)
plot(1:(nrow(dat)),sil.wid,type='h',main='Silhouette: optimal number
of clusters, Ward',xlab='k number of clusters',
ylab='Average silhouette width',cex.lab=1.25)
lines(rep(k.best,2),c(0,max(sil.wid)),col=2,cex=1.5,lwd=3)
0.20
0.10
0.00
Average silhouette width
Silhouette: optimal number

of clusters, Ward
10
15
20
k number of clusters
Figure 3.8: Bar plot of silhouette widths for k = 220 number of groups, for the
Wards clustering method. The optimal partition is given by the red bar, having
the highest average silhouette width
When an optimal number of groups has been selected, the classification of

31
each object can be diagnosed further by considering their respective silhouette

widths. This can be done, e.g., using a silhoutte plot. Lets try this for a case
with three groups, which was used as an example in Figure 3.7.
>
>
>
>
>
>
>
>
+
# Silhouette plot of the 'optimal' partition:

# ------------------------------------------k = 3
cutc = cutree(C.Ward,k=k)
sil = silhouette(cutc,D.Hel)
sil.ord = sortSilhouette(sil)
rownames(sil.ord) = row.names(dat)[attr(sil.ord,'iOrd')]
plot(sil.ord,main='Silhouette plot, Ward (Hellinger)',cex.names=0.8,
col=sil.ord[,1]+1,nmax.lab=100)
Silhouette plot, Ward (Hellinger)

3 clusters Cj
j : nj | aveiCj si
n = 20
10
6
7
17
5
18
11
19
1
2
1 : 10 | 0.12
13
9
12
3
4
8
2 : 6 | 0.27
20
15
14
16
3 : 4 | 0.29
0.0
0.2
0.4
0.6
0.8
1.0
Silhouette width si
Average silhouette width : 0.2
Figure 3.9: Silhouette plot for a three-group partition, based on Wards clustering
method using a Hellinger distance
In Figure 3.9, showing the silhouette plot, on the right you can see the number
of objects (n) and the average silhouette with ave for each group. Below the
figure the average silhouette width for the entire partition is given. Notice that
objects 1 and 2 seem to be missclassified. Going through this procedure for
several alternative methods might be needed for finding the best approach.
A similar approach to silhouette widths is to calculate a Mantel correlation
between the original distance matrix and binary matrices computed from the
dendrogram cut at various levels (representing group allocations).
32
3.3.3
Graphical presentation of the final clustering result
After selecting the best algorithm and finding the optimal number of clusters,
it is time to consider how should the result be presented. First, produce a
dendrogram with the final groupping: Figure 3.10.
>
>
>
>
>
>
>
>
>
>
# Figure 3.10:
# -----------require(gclus)
# Reorder the dendrogram such that the ordering in the dissimilarity
# matrix is respected as much as possible (this does not affect
# dendrogram totpology).
C.Ward.ord = reorder.hclust(C.Ward,D.Hel)
plot(C.Ward.ord,hang=-1,xlab='5 groups',sub='',main='Reordered, Ward (Hellinger)')
# the hang = -1 argument draws the branches of the dendrogram down to 0.
rect.hclust(C.Ward.ord,k=k.best,border=2:6)
1.0
17
19
18
11
6
7
5
10
2
1
4
3
9
8
13
12
16
20
15
14
0.0
Height
2.0
Reordered, Ward (Hellinger)
5 groups
Figure 3.10: Final dendrogram, using Wards method on Hellinger dissimilarities.

Each group is highlighted with a box.
A highly illustrative approach is to use a so-called heat map to visualize
patterns in the original distance matrix and the clustering at the same time.
Figure 3.11 shows a symmetric heatmap with the same clustering result for both
columns and rows of the distance matrix. If desired, a different clustering result
can be used for each.
>
>
>
>
# Figure 3.11:
# -----------dend = as.dendrogram(C.Ward.ord)
heatmap(as.matrix(D.Hel),Rowv=dend,symm=T)
Using a similar approach one can explore the species composition of each
cluster. Here it is useful to rescale species abundaces, which can be done, e.g.,
33
17
19
18
11
6
7
5
10
2
1
4
3
9
8
13
12
16
20
15
14
17
19
18
11
6
7
5
10
2
1
4
3
9
8
13
12
16
20
15
14
Figure 3.11: Heat map of the Hellinger dissimilarity matrix, reordered according
to the Wards clustering dendrogram. Darker colors indicate higher similarity
using the vegemite function in the vegan package. Using a heat map, the
rescaled abundance of each species can be displayed in association with the cluster dendrogram, which allows one to inspect how species content varies between
groups. As a default, vegemite orders species by their weighted averages on the
site scores (Hills method). This is illustrated in Figure 3.12. This figure indicates that there is a gradient in the data, associated with considerable turnover
in species composition.
>
>
>
>
# Figure 3.12:
# -----------require(RColorBrewer)
or = vegemite(dat,C.Ward.ord,'Hill',zero='-')
Airpra
Empnig
Hyprad
Antodo
Viclat
Plalan
Tripra
Achmil
Rumace
Lolper
Belper
Brohor
1111
1
111211
79816750214398326054
22------------------2-----------------23-2---------------22--2222-------------12---1-----------2-223332---------------322------------2---222221-------------323-----2--2-----233323333322-------2---222-22------------2222-2---------
34
Leoaut
Poapra
Salrep
Brarut
Trirep
Elyrep
Sagpro
Poatri
Cirarv
Junbuf
Alogen
Chealb
Agrsto
Junart
Ranfla
Elepal
Calcus
Potpal
sites
20
scale:
233322223-222222-222
1-2222222223222-----22--------------3--2323222--2222-2222-22232233-122222--13
------2-22223-------2-2------3-2222-------2332323332322------------2-------------2------2-22-----------2-2323332----------------1--------------3222323322
------------22--222-------------22-2222
-------------2--3232
----------------22-2
------------------22
species
30
Hill
> heatmap(t(dat[rev(or$species)]),Rowv=NA,Colv=dend,
+
col=c('white',brewer.pal(5,'Blues')),xlab='Sites',
+
margin=c(4,4),ylab='Species')
3.4
Non-hierarchical clustering
As explained in the Overview, non-hierarchical clustering methods do not produce a tree-like hierarchy of objects, but generate only a single partitioning. Here
we will consider two methods, k -means and fuzzy clustering.
3.4.1
Partitioning by k -means
Here a predetermined number of groups is sought by partitioning the objects

into k groups, such that the objects within each cluster are more similar to
one another than to objects in the other clusters. To achieve this, the method
iteratively minimizes an objective function called the total error sum of squares
(E 2 or TESS). This quantity is the sum, over the k groups, of the means of the
squared distances among objects in their respective groups.
The process starts from either a random partitioning of the objects in k
groups, a partition derived from a hierarchical clustering computed on the same
data, or one provided by an ecological hypothesis. Especially when the starting
configuration is random, several iterations are needed to find a stable solution.
As with the Wards method, k -means partitioning can be computed either from
raw data or a distance matrix (as TESS can be computed directly from distances
among objects). In the case of using raw data, it is preferable to pre-transform
the data to avoid double-zeros (i.e., Euclidean distances are presenrved between
objects). The function kmeans in the stats package uses raw data.
35
17
19
18
11
6
7
5
10
2
1
4
3
9
8
13
12
16
20
15
14
Species
Airpra
Empnig
Hyprad
Antodo
Viclat
Plalan
Tripra
Achmil
Rumace
Lolper
Belper
Brohor
Leoaut
Poapra
Salrep
Brarut
Trirep
Elyrep
Sagpro
Poatri
Cirarv
Junbuf
Alogen
Chealb
Agrsto
Junart
Ranfla
Elepal
Calcus
Potpal
Sites
Figure 3.12: Heat map of the community table with dendrogram

As the number of groups is predefined, it is necessary to repeat the analysis
for different values of k. A convenient way of doing this is to use the function
cascadeKM in the vegan package (Figure 3.13). Here you specify the minimum and maximum number of groups and a criterion is used to compare the
partitionings. The cascadeKM function offers two alternative criteria, either
calinski (Calinski-Harabasz criterion) or ssi (Simple Structure Index). The
former is an F -statistic comparing the among-group to the within-group sum of
squares of the partition, while the latter combines three elementswhich influence the interpretability of a solutionwhich are multiplicatively combined and
normalized to give a value between 0 and 1. More details of these criteria, as
well as other criteria are available in clustIndex in package cclust.
The choice of the criterion is not trivial. For example, the ssi and calinski
criteria give either six or three groups as optimal for the example data, respectively. This means that in order to arrive at a stable solution, it might be
necessary to consider different association measures, different criteria, as well as
the suitability of each to the question at hand.
>
>
>
>
>
# k-means partitioning:
# --------------------X.Hel = decostand(dat,'hellinger')
KM.ssi = cascadeKM(X.Hel,sup.gr=10,inf.gr=2,criterion='ssi')
KM.calinski = cascadeKM(X.Hel,sup.gr=10,inf.gr=2,criterion='calinski')
36
> plot(KM.calinski,sortg=T)
> plot(KM.ssi,sortg=T)
calinski
criterion
10
2
10
8
6
4
2
10
15
20
5.2
5.8
Kmeans partitions comparison
ssi
criterion
8
6
4
2
10
Values
10
Objects
Number of groups in each partition
Number of groups in each partition
Kmeans partitions comparison
10
15
Objects
20
0.10
0.16
Values
Figure 3.13: k -means cascade plots for two different evaluation criteria, showing
the grouping of each object for each partition. The optimal solution, given the
criterion, is marked with red in the right-hand panel.
The same analysis could be done step-by-step with the functions kmeans,
which provides a partitioning to k groups, which can in turn be evaluated with
function cIndexKM.
3.4.2
Fuzzy clustering
In all the methods above, objects are unambiguously (discontinuously) assigned

into different groups. However, as the natural world is mostly continuous, it
might be preferrable to use less clear-cut approaches to partition objects, based
37
on their species composition. This can be done by fuzzy clustering. Instead of a

classification where a given object belongs to only one cluster, fuzzy clustering
associates to all objects a series of membership values measuring the strength
of their memberships in the various clusters. An object that is clearly linked
to a given cluster has a strong membership value for that cluster and weak (or
null) values for the other clusters. The membership values add up to 1 for each
object. This means that the membership can be interpreted as the probability
of an object to belong in a given group. Fuzzy clustering is non-hierarchical and
in that respect similar to k -means clustering.
Here we will make an example using the function fanny in the cluster
package. This function accepts either a data table, or a distance matrix. We will
use the Hellinger transformed species data, as with k -means clustering. Because
there are no similar validation criteria for fuzzy clustering as there were for k means clustering, selection of the optimal grouping needs to be done by hand
(this is of course possible also for k -means clustering).
>
>
>
>
>
>
>
>
>
+
# Fuzzy clustering:
# ----------------k = 3
C.fuz = fanny(X.Hel,k=k,memb.exp=1.5)
s.C.fuz = summary(C.fuz)
C.fuz.g = C.fuz$clustering
# Silhouette plot:
# ---------------plot(silhouette(C.fuz),main='Silhouette plot, fuzzy (Hellinger)',
cex.names=0.8,col=C.fuz$silinfo$widths+1)
Silhouette plot, fuzzy (Hellinger)

3 clusters Cj
j : nj | aveiCj si
n = 20
3
9
4
13
12
2
1
8
1 : 8 | 0.19
20
15
16
14
2 : 4 | 0.33
6
17
7
10
18
5
11
19
3 : 8 | 0.19
0.0
0.2
0.4
0.6
0.8
Silhouette width si
Average silhouette width : 0.22
Figure 3.14: Silhouette plot for fuzzy clustering
38
1.0
A good way to visually evaluate the membership of each object is to overlay

the clustering result on an ordination diagram (Figure 3.15).
>
>
>
>
>
>
+
+
+
+
>
+
>
+
# Ordination plot of clustering result:

# ------------------------------------pcoa = scores(cmdscale(D.Hel),choises=c(1,2))
plot(pcoa,asp=1,type='n',main='Ordination of fuzzy clusters',ylim=c(-.6,.6))
abline(h=0,lty='dotted');abline(v=0,lty='dotted')
for(ii in 1:k){
tmp = pcoa[C.fuz.g==ii,]
tmp2 = chull(tmp);tmp2 = c(tmp2,tmp2[1])
lines(tmp[tmp2,],col=ii+1,lty=2)
}
stars(C.fuz$membership,location=pcoa,add=T,scale=F,draw.segments=T,
len=.1,col.segments=2:(k+1))
legend(.5,-.4,paste('Cluster',1:k,sep=' '),pch=15,pt.cex=2,
col=2:(k+1),bty='n')
0.6
Ordination of fuzzy clusters
17
0.4
19
14
20
15
0.0
10
57
0.2
Dim2
0.2
18
11
0.4
16
12
1
4
Cluster 1
Cluster 2
Cluster 3
13
0.6
0.6
0.4
0.2
0.0
0.2
0.4
0.6
0.8
Dim1
Figure 3.15: Fuzzy clustering membership displayed in a principal coordinates

ordination (PCoA).
Here, it is fairly straight forward to identify the objects difficult to classify:

they have relatively similar membership in more than one group. For example, while site 19 has the strongest membership in cluster 3, it is not strongly
associated with any of the three groups.
39
3.5
Validation with external data
Clustering of species data classifies objects (sites, patches, etc.) based on the
differences in species composition. Validation of a given partition based on silhouettes or other means only considers the mapping of original distances to the
classification. However, what should be of interest is the ecological interpretability of the grouping of objects; how can the clusters be explained? Here any
external data to species abundances comes into picture. A simple way of contrasting the clustering result with, e.g., environmental data is to use the grouping
as a factor in an ANOVA (or KuskalWallis test).
3.5.1
Continuous predictors
The dune data set is not optimal for an external validation of clustering results
(the associated environmental data only contains a single continuous variable).
Instead, well use here another data set from vegan, the varespec species data
and varechem environmental data. Using the calinski criterion, a k -means
partition on Hellinger transformed data gives an optimum of 3 groups, which is
used here. The varechem data set contains 14 variables, but we will use only six
of them here, the content of Nitrogen, Phosphorus, Calsium, Aluminium, Iron,
and Manganese in the soil (Figure 3.16).
>
>
>
>
>
>
>
>
+
+
# Analysing clustering results against environenmental variables:

# --------------------------------------------------------------data(varechem);data(varespec)
spec = decostand(varechem,'hellinger')
env = varechem[,c('N','P','Ca','Al','Fe','Mn')]
groups = factor(kmeans(spec,3)$cluster)
for(ii in 1:6){
boxplot(env[,ii]~groups,main=names(env)[ii],col=2:5)
}
A global test of log-transformed environmental variables (MANOVA) indicates that the groups do differ in their environmental conditions (log-transformed
data is used for two reasons: (1) variance tends to increases with the mean and
(2) a unit change in concentration is likely to be more important under low
concentrations that it is under high concentrations):
> m = manova(as.matrix(log(env))~groups)
> summary(m)
Df Pillai approx F num Df den Df
Pr(>F)
groups
2 1.3075
5.3495
12
34 5.415e-05 ***
Residuals 21
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Summary statistics for univariate tests can be accessed via summary.aov of the
model object (here m). Doing this shows that (as it should be clear from Figure
3.16) the best descriptors of group differences are Al and Fe.
Note that we did not evaluate the assumptions of parametric testing here.
The assumption of normality is not usually an issue, for two reasons: 1) linear
40
Ca
400
200
15
30
20
40
600
50
25
60
30
800 1000
70
50
40
60
100
300
200
0
20
100
Mn
150
400
Fe
200
Al
80 100
Figure 3.16: Boxplots of four environmental variables. Boxes represent three

grouops, based on k -means partitioning
models and ANOVA are considered to be fairly robust against this assumption,
and 2) the statistical test can always be performed using random permutation.
The more problematic assumption requires that residual variance is homogeneous
between groups (or across fitted values for continuous predictors). Here this is
likely to be a problem, at least when considering the explanatory variable Fe
(Figure 3.16). Here we are going to use function gls (for Generalized Least
Squares) in package nlme for further examination.
> require(nlme)
> m.Fe = gls(log(Fe)~groups,data=env,method='ML')
> plot(m.Fe)
As the plot of standardized residuals against fitted values indicated heterogeneous variance between groups, statistical testing of between-group differences
are likely to be biased. The problem can be fixed, e.g., by accounting for this
heterogeneity. In function gls this can be done by specifying weights to describe
the within-group variance structure. In this case we let the error variance differ
between groups.
> m.Fe.2 = gls(log(Fe)~groups,data=env,weights=varIdent(form=~1|groups),method='ML')
The two models can be compared with the anova function to evaluate the importance of heteroscedasticity for inference:
> anova(m.Fe,m.Fe.2)
m.Fe
m.Fe.2
Model df
AIC
BIC
logLik
Test L.Ratio p-value
1 4 64.36398 69.07620 -28.18199
2 6 62.59623 69.66455 -25.29811 1 vs 2 5.767754 0.0559
This test indicates that the model is not improved by accounting for heterogenic
residual variance (the model fit is only marginally improved by accounting for
the heterogeneity in residual variance).
41
Standardized residuals
3
2
1
0
1
2
2
Fitted values
Figure 3.17: Standardized model residuals versus fitted values. Notice that
residual variance differs between groups even after log-transforming the response
variable
One way to visualize the relationship between a grouping and a continuous
predictor is to overlay the predictor on a ordination plot, using either bubbles
or smoothed surfaces (see, e.g., Oksanen, 2013). For example, lets consider how
the concentration of iron, aluminium, and manganese is distributed across the
sites:
>
>
>
>
>
+
>
>
+
>
>
+
>
# Visualizing environmental variables in ordination:

# -------------------------------------------------par(mfrow=c(1,3),mar=c(4,4,1,1),pty='s')
pcoa = cmdscale(dist(spec))
plot(pcoa,type='p',cex=log(env$Fe),pch=16,col='gray50',xlab='Dim 1',
ylab='Dim 2',xlim=c(-.3,.5),ylim=c(-.15,.1),main='Iron')
ordihull(pcoa,kmeans(spec,3)$cluster,col=2)
plot(pcoa,type='p',cex=log(env$Al),pch=16,col='gray50',xlab='Dim 1',
ylab='Dim 2',xlim=c(-.3,.5),ylim=c(-.15,.1),main='Aluminium')
plot(pcoa,type='p',cex=log(env$Mn),pch=16,col='gray50',xlab='Dim 1',
ylab='Dim 2',xlim=c(-.3,.5),ylim=c(-.15,.1),main='Manganese')
3.5.2
Categorical predictors
When both the response and predictor variables are categorical one is concerned
with the analysis of contingency tables. That is, the response data is the counts
of observations belonging to each combination of categories. The table of counts
can then be analyzed using Poisson regression (generalized linear model with
42
0.2
0.0
0.2
Dim 1
0.4
Dim 2
0.05
0.05 0.10
Manganese
0.15
0.05
Dim 2
0.05 0.10
Aluminium
0.15
0.05
0.15
Dim 2
0.05 0.10
Iron
0.2
0.0
0.2
Dim 1
0.4
0.2
0.0
0.2
0.4
Dim 1
Figure 3.18: The (log) concentration of iron, aluminium, and manganese illustrated in PCoA ordination diagrams
Poisson distributed errors). Here we can again use the dune data set as an
example.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
# Analyzing clustering result with categorical predictors:

# -------------------------------------------------------data(dune.env)
env = dune.env[,c('Management','Use')]
groups = factor(kmeans(X.Hel,3)$cluster)
# Generate a table of counts between group and predictor
# level pairs, including zeros:
tmp = table(paste(groups,env$Manage,sep='.'))
counts = numeric(3*4)
f1 = factor(c(1,1,1,1,2,2,2,2,3,3,3,3))
f2 = factor(rep(levels(env$Manage),3))
names(counts) = paste(f1,f2,sep='.')
counts[c(1,2,4,6:8,9,11)]=tmp
# Analyze contingency table with Poisson glm, here the interest
# is in the interaction between groups and the predictor (are the
# levels of these two significantly associated?):
m = glm(counts~f1*f2,family='poisson')
The analysis of contingency tables should start from the saturated model, proceeding by sequentially dropping non-significant terms. This can be done using
the drop1 function, with test = Chi, which effectively performs a likelihood
ratio test between models that do contain and do not contain a specific term.
> drop1(m,test='Chi')
Single term deletions
Model:
counts ~ f1 * f2
Df Deviance
AIC
LRT Pr(>Chi)
<none>
0.000 45.847
f1:f2
6
18.555 52.401 18.555 0.004986 **
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
43
This test indicates that there is a significant interaction between the grouping
(acquired from k -means clustering) and management of sites. A somewhat simpler approach is to use the Chi-square test (chisq.test) to compare the groups
and a categorical variable:
> chisq.test(groups,dune.env$Manage,simulate.p.value=T)
Pearson's Chi-squared test with simulated p-value (based on 2000
replicates)
data: groups and dune.env$Manage
X-squared = 12.8222, df = NA, p-value = 0.04298
44
Simple (or unconstrained) Ordinations
4.1
Overview
Simple ordinations (or unconstrained ordinations) are designed to find general

trends in a multivariate dataset. In a community ecology context, these methods
are particularly well adapted to model species distributions which are generally
distributed in gradients.
More generally, simple ordinations are designed to make sense of data where
many variables were sampled at multiple sampling units. An example of an ordination is represented in Figure 4.1 where two variables characterize six sampling
units. In a community ecology context, this could be species sampled at multiple sites, whereas in population genetics setting the variables measured could be
gene frequencies for individuals of a particular population.
Var 1
1
2
3
5
3
6
2
1
2
3
4
5
6
7
Var 2
4
3
5
1
4
3
4
3
1 7 5
2
4
Variable 2
SU
SU
SU
SU
SU
SU
SU
Variable 1
Figure 4.1: Ordination of seven samples in the space of two variables
The first ever application of an ordination in an ecological context was pre45
sented by Goodall (1954) to study variations in plant communities. However,

the first ordination technique, principal component analysis (PCA) was developed more than half a century before by Pearson (1901). Since the middle of
the 20th century, a few different ordination techniques have been proposed by
statisticians and ecologists and a subset of these have been adopted by ecologists
and are still commonly used today.
In this lecture, we will discuss some of these techniques, their properties and
how and when they should be used. In each cases, examples based on real data
will be used to illustrate the properties of the methods discussed.
More specifically, we will discuss about principal component analysis (PCA),
correspondance analysis (CA), principal coordinate analysis (PCoA) and nonmetric multi-dimensional scaling (NMDS).
4.2
Principal Component Analysis
Lets imagine a data matrix where n sampling units are characterized by p variables. Graphically, the sampling units can be represented by a group of points
in a p-dimensional space. These points are generally not distributed in a perfect
sphere across all dimensions; the cluster of points can be elongated in one or
a few directions, and flattened in others. In addition, the directions where the
points are found is not necessarily aligned with a single dimension (i.e. with a
single variable) of the multidimensional space. The direction where the set of
points is most elongated represent the direction with the largest variance of the
set of points.
2
1
7
5
6
PCA axis 2
Variable 2
Variable 1
PCA axis 1
Figure 4.2: PCA rotation of the 7 samples presented of Figure 4.1

In essence, PCA performs a rigid rotation of the original system of axes in
such a way that the new axes (also called principal components) are orthogonal
to one another. The axes correspond to the successive dimensions of maximum
variance of the set of points. The principal component gives the position of the
sampling units in the new set of coordinates after the rotation (Figure 4.2).
Because each principal component is a linear combination of the original variables, the axes of a PCA can be interpreted by studying which variables
46
contribute most to the first few principal components. It is also possible to represent the variables on the PCA diagram with the sample points. However, it
is important to note that when interested in the relationships among variables
another type of projection is preferable (this will be discussed later).
A principal component is constructed on an eigenvector for which an eigenvalue i is associated. This eigenvalue defines the amount of variance represented
by the principal component. The eigenvalues are always presented in decreasing
order; that is the first axis presents the most important part of the variance in
the data, the second axis defines less information than the first axis but more
than the others, and so on. The number of principal components in a PCA
equals the number of variables in the original data set.
When performing a PCA on a matrix Y, it is usual to present the total variance of the data as a reference to evaluate the proportion of variance represented
by each principal component. The total variance of a matrix can be calculated
as follow:
n

1 X
Var (Y) =
Yij Yj
n1
(4.1)
i=1
However, within the PCA framework, the total variance of Y can also be
calculated through a sum of all the eigenvalues:
Var (Y) =
p
X
(4.2)
i=1
This property of PCA makes it possible to calculate how much variance is

described by each axis. For example, the amount of variance represented by the
first axis can be calculated as follow:
1
Var (Y)
(4.3)
When calculating a PCA, it is important to known that it can performed

and its results can be presented in different ways. In its basic form, PCA (1) is
computed on the raw variables (after each one has been centred, but no additional
transformation is applied to the variables) and (2) the Euclidean distance is
preserved among objects. Later on we will discuss how it is possible to go
around these properties.
4.2.1
Correlation or Covariance?
In a PCA, the association measures used to compare all pairs of variables is either
the covariance or the correlation. Both of these association measures are linear.
It is important to decide which one of these two association measures should
be used when computing a PCA. The reason why this decision is important is
because of the Euclidean property of PCA. Euclidean distance is very sensitive
to the scales of the variables. For this reason, performing a PCA on the raw (that
is... only centred) variables (yielding a PCA on a covariance matrix) is only valid
47
1
7
5
6
Variable 1
PCA axis 2 19.47%
Variable 2
3
PCA axis 1 80.53%

Figure 4.3: PCA diagram of the data of Figures 4.1 and 4.2, with projection of
the original variables. There were only two variables in the data, thus there are
only two PCA axes. Scaling type 1.
when the variables have the same dimensions. Otherwise, it is recommended to
remove the effect of the differences in scale among the variables. This can be
done by performing a PCA on the correlation matrix because a correlation is a
covariance calculated on standardized variables.
4.2.2
Scaling
When performing a PCA, both the samples and the variables can be represented
on the same diagram, called a biplot. There are two types of biplots that can
be used to represent the result of a PCA. However, each one has particular
properties.
With this in mind, if the main interest of the analysis is to interpret the relationships among samples a distance biplot (scaling 1) should be used. However,
if the interest is to study the relationship among variables, a correlation biplot
(scaling 2) should be used.
48
Table 4.1: Properties of the distance biplot (scaling 1) and the correlation biplot
(scaling 2) in PCA)
Distance
among
samples in biplot
Projection of samples on variables
Length of variables
Angles among variables
Distance biplot
Correlation biplot
(scaling 1)
(scaling 2)
Approximation of the Eu- Meaningless
clidean distance in multidimensional space
Projecting a sample at right angle on a variable approximates its position on the variable
1 in full dimensional space Covariance
matrix:
Indicates the contribution standard deviation of the
of a variable in the reduced variable in full dimenspace
sional space
The length of the descriptor in the reduced space
is an approximation of its
standard deviation
Correlation matrix: 1
in full dimensional space
Indicates the contribution
of a variable in the reduced
space
Meaningless
Reflect the covariance or
correlation among variables
49
4.2.3
Equilibrium contribution circle
In all but one option discussed above (i.e. all option discussed above except the
PCA performed on a covariance matrix plotted using a correlation biplot [scaling
2]), a circle representing the equilibrium contribution of all the variables can
be drawn on the plane defined by two principal components. The equilibrium
contribution is the length a variable (a vector in the biplot) should have if it
contribute equally to all the axes of the PCA. The variables whose vectors is
within the equilibrium contribution circle contribute little to a given reduced
space (i.e. plane described by the first and second axis). Conversely, variables
whose vectors extend beyond the radius of the equilibrium contribution circle
contribute more to the reduced space. Figure 4.4 shows an example of a PCA
calculated on the dune data where equilibrium an circle is drawn. In a distance
biplot (scaling 1), the equilibrium circle is calculated as follow:
r
Dimensions of the reduced space (usually 2)

Total number of variables
(4.4)
For a PCA performed on a covariance matrix plotted using a correlation

biplot (scaling 2), the equilibrium contribution can only be computed independently for each variable:
r
Standard deviation of a variable
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
+
>
+
>
Dimensions of the reduced space

Total number of variables
(4.5)
# Figure 4.4:
# ----------library(vegan)
### The two fictitious variables
data(dune)
### Hellinger transformation on dune
duneHell<-decostand(dune,method="hellinger")
### Perform PCA on correlation matrix of Hellinger transfomred dune data
PCABase<-rda(dune,scale=TRUE)
### Extract species information
PCAsp<-scores(PCABase,choices=1:2,display="species",scaling=2)
#==============
### Plot graphs
#==============
### Plot basis
par(mar=c(3,3,0.5,0.5),pty='s',mgp=c(2,.8,0))
labels<-paste("PCA Axis",1:2," - ",round(eigenvals(PCABase)[1:2]
/sum(eigenvals(PCABase)),4)*100,"%",sep="")
plot(PCAsp, asp=1,xlim=c(-1,1),ylim=c(-1,1),type="n",
xlab=labels[1],ylab=labels[2])
abline(h=0,lty=2)
50
1.0
Hyprad
Airpra
Empnig
PCA Axis2 16.66%

0.5
0.0
0.5
Antodo
Leoaut
Brarut
Plalan
Achmil
Salrep
Viclat
Tripra
PotpalCalcus
Elepal
Rumace
Trirep
Sagpro
Chealb
Cirarv
Junbuf
Brohor
Lolper
Belper
Poapra
Elyrep
Ranfla
Junart
Agrsto
Alogen
1.0
Poatri
1.0
0.5
0.0
0.5
PCA Axis1 23.44%
1.0
Figure 4.4: PCA on a correlation matrix of Hellinger-transformed species using

the dune data. Scaling type 2. Axes 1 2. Circle of equilibrium contribution
(blue). Circle of radius 1 (green): maximum length possible for a vector in a
PCA on a correlation matrix.
>
>
>
>
>
>
>
>
>
abline(v=0,lty=2)
### Equilibrium circle
symbols(0,0,circles=1,inches=FALSE,fg="darkgreen",add=TRUE,lwd=2)
symbols(0,0,circles=sqrt(2/20),inches=FALSE,fg="blue",add=TRUE,lwd=2)
arrows(0,0,PCAsp[,1],PCAsp[,2],length=0.1,lwd=2,angle=30,col="red")
pos1<-which(PCAsp[,2] < 0)
pos3<-which(PCAsp[,2] > 0)
text(PCAsp[pos1,1],PCAsp[pos1,2],labels=rownames(PCAsp)[pos1],pos=1,cex=0.65,col="red")
text(PCAsp[pos3,1],PCAsp[pos3,2],labels=rownames(PCAsp)[pos3],pos=3,cex=0.65,col="red")
4.2.4
Number of axes to interpret
PCA is not a statistical test. The goal of PCa is to represent the major features
of a data matrix on a reduced number of axes, this is why the expression ordination in reduced space is often used to decribe it. Generally, one studies the
eigenvalues (i.e. the amount of variance defined by each axis) and decides the
number of axes that are worth presenting. The decision can be arbitrary (e.g.
only the axes that represent in total 75% of the variance should be considered).
51
However, procedures have been proposed to distinguish the axes represent interesting and valuable features in the data and the axes display random variance.
One can calculate the average of the eigenvalues and interpret only the axes
associated to eigenvalues larger than that average. Another idea is to compute
a model called the broken stick model that divide a stick of unit length into as
many pieces as the number of axes in the PCA. The pieces are than ordered
from longest to shortest and compared to the eigenvalues. One interprets only
the axes with an eigenvalue larger than the length of the corresponding stick.
Figure 4.5 illustrate the two techniques discussed previously to assess the number
of axes to interpret in a PCA.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
+
>
+
# Figure 4.5:
# ----------### Extract eigenvalues
eigPCA<-as.vector(eigenvals(PCABase))
eigPCAmean<-mean(eigPCA)
### Calculate the percentage of variation represented by each eigenvalues
eigPCAVar<-eigPCA/sum(eigPCA)*100
### Construct broken stick model
brokenStick<-bstick(length(eigPCA),tot.var=100)
### Combine eigenvalues and broken-stick model result
PCAVar<-rbind(eigPCAVar,brokenStick)
#==============
### Plot graphs
#==============
### Plot basis
par(mfrow=c(2,1),mar=c(4,5,0.5,0.5))
barplot(eigPCA,ylab="Eigenvalue",cex.axis=0.9)
abline(h=eigPCAmean,col="red",lwd=3)
legend("topright","Average eigenvalue",lwd=3,col="red",cex=0.8,bty="n")
barplot(PCAVar,beside=TRUE,col=c("lightblue","orange"),
names.arg=paste("Axis",1:19),las=3,ylab="Variance (%)",cex.axis=0.9)
legend("topright",c("PCA Axis","Broken-stick model"),
fill=c("lightblue","orange"),cex=0.8,bty="n")
4.2.5
Pre-transformation of species data
Traditionally, PCA has been very useful for the ordination of matrices of environmental data. However, because it is a linear method with Euclidean distance
as the underlying distance, it is not adapted to ordinate raw species abundance
data. This is mainly because of the double-zero problem (i.e. zeros are treated
as any other values in the data) However, Legendre & Gallagher (2001) found
a way to overcome this problem. What they propose is to pre-transform the
species data so that, after carrying out a PCA, the distance respected among
object is not the Euclidean distance anymore, but an ecologically meaningful
one; a distance that does not account for the double-zeros in the computation
of the resemblances between objects. The tranformations proposed by Legendre
& Gallagher (2001) and their associated distance coefficients are presented in
52
4
2
20
5 10
PCA Axis
Brokenstick model
Axis 1
Axis 2
Axis 3
Axis 4
Axis 5
Axis 6
Axis 7
Axis 8
Axis 9
Axis 10
Axis 11
Axis 12
Axis 13
Axis 14
Axis 15
Axis 16
Axis 17
Axis 18
Axis 19
Variance (%)
Eigenvalue
Average eigenvalue
Figure 4.5: Barplots to help decide how many PCA axis to interpret. The two
results presented here were constructed using the dune data.
Table 4.2. Note that these pre-transformations can be used with many linear
analytical methods such as PCA, RDA, K -mean clustering,...
4.3
Correspondence analysis (CA)
Correspondence analysis is actually a PCA on species data, which has been

transformed into a Pearsons 2 statistic. The raw data are first transformed
into profiles of conditional probabilities weighted by the row and column sums.
Following, a PCA is performed on the resulting table. The result is an ordination where the 2 distance is preserved among samples instead of the Euclidean
distance. The advantage of performing a CA is that the 2 distance is the underlying distance, which does not consider double zeros. Therefore, CA is better
adapted to the analysis of species abundance data. Note that the data submitted
to a CA must be dimensionally homogeneous and equal or larger than 0. This
makes CA useful to analyse species counts or presence-absence data.
For technical reasons which I will not develop in here, CA produces one axis
less than the minimum between the number of samples and the number of species.
As is the case for PCA, the orthogonal ordination axes are organized from the
one representing the most variance to the one representing the least variance.
However, in CA the variation is usually measured as a quantity called the total
inertia. In CA, individual eigenvalues are always smaller than 1. To know the
amount of variation represented on an axis, one must divide the eigenvalue of
this axis by the total inertia of the species data matrix.
53
Table 4.2: Pre-transformation of species abundance data and their associated

distances.
Name
Distance
v
u
uP
u p q y1j
t j=1
Pp
chord
Transformation
2
j=1 y1j
Pp
1
y+j /y++
Pp
Pp
r
j=1
s
Species profiles
j=1
s
Hellinger
j=1
2
j=1 y2j
y1j
y2j
y1+
y2+
y1j
y2j
y1+
y2+
y1j
y1+
y2j
qP
p
2
y2j
y2+
2
yij
0
yij
= qP
p
j=1
0
yij
=
0
yij
=
2
0
yij
=
y++
2
yij
yij
yi+ y+j
yij
yi+
r
yij
yi+
0 referes to the transformed values of sample i for species j, y

yij
+j is the sum of
abundances for all species sampled at i, yi+ is the sum of abundances for species
j across all samples, and y++ is the sum of all abundances for the whole data
matrix.
4.3.1
Scaling
Unlike for PCA, in CA, both the samples and the variables are usually displayed
as points on the same biplot (also know as joint plot in CA). In CA, there are
three scalings, two of which, the distance scaling (scaling 1) and the correlation
scaling (scaling 2) are commonly used in ecology. Scaling 3 (also known as
Hill scaling), is less commonly used in ecology but it has its merits. Table 4.3
describes the properties of each scaling and how results from each scaling should
be interpreted.
An example of of a CA is presented in Figures 4.6. The data used for this
illutration is given in Table 4.4.
>
>
>
>
>
>
>
>
>
>
>
>
# Figure 4.6:
# ----------library(vegan)
### Three fictitious species sampled at 7 sites
sp1<-c(1,2,3,5,3,6,2)
sp2<-c(4,3,5,1,4,3,4)
sp3<-c(1,0,2,1,0,3,5)
sp<-cbind(sp1,sp2,sp3)
rownames(sp)<-paste("SU",1:7)
### Perform PCA on correlation matrix of Hellinger transfomred dune data
CABase<-cca(sp)
54
Table 4.3: Properties of scaling in correspodence analysis and how to interpret

these results
When
use
to
Properties
Distance
among
sample
points*
Distance
among
variable
points
Scaling 1
When the interest
lies in the ordinations of samples
Rows are the centroids of columns
Approximation of
the 2 distance in
multidimensional
space
Meaningless
Scaling 2
When the interest
lies in the ordinations of variables
(species)
Columns are the
centroids of rows
Meaningless
Approximation of
the 2 distance in
multidimensional
space
Scaling 3
When both samples
and variables are
important
Rows and columns
are both centred
Approximation of
the 2 distance in
multidimensional
space
Approximation of
the 2 distance in
multidimensional
space
*For scaling 1 and 3 - (1) Sample points close to one another are generally
similar in their species frequencies; (2) For abundance data: Sample points close
to a species point is likely to have a high contribution of that species; (3) For
presence-absence data: The probably to have an occurence is higher for sample
points closer to a species point
For scaling 2 and 3 - (1) A variable point found close a sample point is likely
to have a higher frequency at this sample than in samples further away; (2)
Variable points close to one another are likely to have the relatively similar
relative frequencies in samples.
Table 4.4: Artificial data for CA.
SU
SU
SU
SU
SU
SU
SU
1
2
3
4
5
6
7
Sp 1
1
2
3
5
3
6
2
Sp 2
4
3
5
1
4
3
4
55
Sp 3
1
0
2
1
0
3
5
1.5
1.5
SU
SU 25
0.0
SU 3
SU 7
SU 4
sp1
SU 1
(b)
0.5
1.0
SU 2
SU 5
SU 3
sp2
sp1
sp3
SU 7
SU 6
1.5
sp3
0.5 0.0
SU 1
CA Axis 1 46.91% (0.119)
0.5
1.0
sp2
SU 6
1.0
CA Axis 1 46.91% (0.119)
(a)
1.5
0.5
0.0
0.5
1.0
SU 4
1.5
0.5
1.0
CA Axis 1 53.09% (0.134)
1.0
CA Axis 1 53.09% (0.134)
0.5 0.0
(c)
0.5
sp2
SU 2
SU 5
0.0
SU 3
0.5
SU 7
sp1
sp3
SU 6
1.0
CA Axis 1 46.91% (0.119)
SU 1
SU 4
1.0
0.5
0.0
0.5
CA Axis 1 53.09% (0.134)
Figure 4.6: First and second axis of the CA of the data shown in Table 4.4. (a)
Scaling 1. (b) Scaling 2. (c) Scaling 3.
>
>
>
>
>
>
>
>
>
>
>
eig<-eigenvals(CABase)
explAxis<-round(eig/sum(eig),4)*100
labels<-paste("CA Axis 1 - ",explAxis[1:2],"% (",round(eig[1:2],3),")",sep="")
par(mar=c(4,5,0.5,0.5))
layout(matrix(c(1,1,2,2,0,3,3,0),byrow=TRUE,nrow=2,ncol=4))
plot(CABase,scaling=1,xlab=labels[1],ylab=labels[2])
leg<-legend("topright","(a)",bty="n")
leg<-legend("topright","(b)",bty="n")
leg<-legend("topright","(c)",bty="n")
4.3.2
Word of caution
Correspondence analysis has first been described to analyse contingency tables.

Therefore, it tends to overemphasise extreme values, and, as an ordination
method, it is very sensitive to rare species, which tend to be located at
extreme positions in the ordination diagram. Therefore, when using CA, it may
be advisable to eliminate the rarest species from the data table.
56
Arch and horseshow effects Long environmental gradients often support

a succession of species (Figure 4.7). Since the species that are controlled by
environmental factors tend to have unimodal distributions, a long gradient may
encompass sites that, at both ends of the gradient, have no species in common;
thus, their distance reaches a maximum value (or their similarity is 0). But if
one looks at either end of the succession, the sites still represent a continuation
of the ecological succession, so contiguous sites continue to grow more different
from each other. Therefore, instead of a linear trend, the gradient is represented
on a pair of CA axes as an arch (Figure 4.8). Several detrending techniques have
been proposed to counter this effect, leading to detrended correspondence
analysis (DCA), which we will discuss in the following section.
Note that the arch-like pattern is even stronger in PCA. There the extreme
sites tend to be actually closer to one another as the number of nonoverlapping species increases, because the double zeros implied are considered in the
Euclidean space as a resemblance between the sites. Thus, the extreme sites become closer as the number of double zeros increases. One can clearly see that this
is an ecological nonsense. This pattern is called the horseshoe effect (Figure
4.8), because the extremities of the arch bend inwards.
# Figure 4.7:
# ----------comm<-matrix(0,nrow=140,ncol=20)
for(i in 1:20){
comm[,i]<-dnorm(1:140,i*5,sd=5)*250
}
comm<-comm[-(101:140),]
par(mar=c(4,5,0.5,0.5))
plot(comm[,1],type="l",ylab="Abundance",xlab="Sampling units",lwd=3,col="red")
couleur<-rainbow(19)
for(i in 2:20){
lines(comm[,i],col=couleur[i],lwd=3)
}
15
10
0
Abundance
20
>
>
>
>
+
+
>
>
>
>
>
+
+
20
40
60
80
100
Sampling units
Figure 4.7: Succession of species along an ideal gradient (species packing model).
57
par(mar=c(4,5,0.5,0.5),mfrow=c(1,2))
CA<-cca(comm)
PCA<-rda(comm)
CAsites<-scores(CA,display="sites",choices=1:2,scaling=1)
PCAsites<-scores(PCA,display="sites",choices=1:2,scaling=1)
plot(CAsites,xlab="CA Axis 1",ylab="CA Axis 2",cex.lab=1.5,pch=19,col="blue")
abline(h=0,lty=2)
abline(v=0,lty=2)
legend("top","(a)",cex=2,bty="n")
plot(PCAsites,xlab="PCA Axis 1",ylab="PCA Axis 2",cex.lab=1.5,pch=19,col="blue")
abline(h=0,lty=2)
abline(v=0,lty=2)
legend("top","(b)",cex=2,bty="n")
0.0
1.0
0.5
PCA Axis 2
0.5
0.0
0.5
1.0
CA Axis 2
(b)
0.5
(a)
1.0
>
>
>
>
>
>
>
>
>
>
>
>
>
1.5 1.0 0.5
0.0
0.5
1.0
1.5
1.0
CA Axis 1
0.5
0.0
0.5
1.0
PCA Axis 1
Figure 4.8: (a) CA, describing the arch effect and (b) PCA describing the horseshoe effect on the data of Figure 4.7. Scaling 1 for CA and PCA.
Detrended correspondence analysis (DCA)

Detrending by segments In this approach, the first axis is divided into
a number of segments, and, within each segment, the mean of the object scores
along the second axis is forced to zero. This methods has been strongly rejected by many authors. Actually, the scores on the second axis are essentially
meaningless;
Detrending by polynomials Another line of reasoning about the origin
of the arch effect leads to the observation that when an arch occurs, the second
axis can be seen as quadratically related to the first (i.e. it is a second-order
polynomial of the first). This makes up for the parabolic shape of the scatter of
points. Hence, a solution is to make the second axis not only linearly, but also
quadratically independent from the first. Although intuitively attractive, this
method of detrending has to be applied with caution because it actually imposes
a more constraining model on the data.
58
4.4
Principal coordinate analysis (PCoA)
PCA as well as CA (at least in their classic forms) impose the distance preserved
among samples: Euclidean distance for PCA and 2 distance for CA (remember,
however, that one can modify this to some extent by using pre-transformation
of data (see Table 4.2) before carrying out a PCA). But if one would like to
ordinate samples on the basis of yet another distance measure, more appropriate
to the problem at hand, then PCoA is the method to apply. It allows to obtain
a Euclidean representation of a set of samples whose relationships are measured
by any similarity or distance coefficient chosen by the user. (Figure 4.9)
>
>
>
>
+
>
>
>
+
>
>
>
par(mar=c(3,3,0.5,0.5),pty='s',mgp=c(2,.8,0))
PCoA<-cmdscale(vegdist(sp,"bray"),eig=TRUE)
PCoAEig<-PCoA$eig
PCoAAxis<-(PCoAEig[1:2]+abs(min(PCoAEig)))/
(sum(PCoAEig)+((nrow(sp)-1)*abs(min(PCoAEig))))
PCoAAxisPres<-round(PCoAAxis,4)*100
labels<-paste("PCoA Axis ",1:2," - ",PCoAAxisPres,"% (",round(PCoAEig,3),")",sep="")
plot(PCoA$points[,1:2],xlab=labels[1],ylab=labels[2],
ylim=range(c(PCoA$points[,2],0.24)),cex.lab=1.5,pch=19,cex=2)
text(PCoA$points[,1:2],labels=1:7,cex=1.25,pos=3)
abline(h=0,lty=2)
abline(v=0,lty=2)
Like PCA and CA, PCoA produces a set of orthogonal axes whose importance
is measured by eigenvalues. When negative eigenvalues are obtained in a PCoA,
one needs to apply the following calculation to compute the amount of variance
explained by a particular axis:
k + m |min ()|
k=1 k ) + (n 1) |min ()|
Pn
(4.6)
where k is the k th eigenvalue calculated in a PCoA, is a vector containing

all the eigenvalues calculated in the PCoA, m is the dimensionality of the reduced
space (2 most of the time), and n the number of samples of the community
matrix.
Since a PCoA is based on an association matrix, in its classical usage it can
only represent the relationships among samples (if the association matrix defines relationships among samples) or variables (if the association matrix defines
relationships among variables), but not both at the same time (Figure 4.10).
However, when a PCoA is carried out on an association matrix calculating the
relationships among samples, it is possible to reprojects the variables defining
samples and construct a joint plot similar to the one in Figure 4.2.
>
>
>
>
>
>
>
par(mar=c(4,5,0.5,0.5))
plot(PCoA$points[,1:2],xlab=labels[1],ylab=labels[2],cex.lab=1.5,type="n")
abline(h=0,lty=2)
abline(v=0,lty=2)
### Plot samples
text(PCoA$points[,1:2],labels=1:7,cex=1.25)
### Plot species
59
0.2
0.0
0.1
0.1
PCoA Axis 2 26.48% (0.107)
3
1
4
5
2
0.2 0.1
0.0
0.1
0.2
0.3
PCoA Axis 1 53.15% (0.224)

Figure 4.9: First and second axis of a PCoA carried out using the percentage
difference distance on the data of Table 4.4. The two first axes have eigenvalues
of 0.224 and 0.107; they represent 53.1% and 26.5% of the variance, respectively.
This PCoA gives 4 positive, one zero and two negative eigenvalues.
>
>
spWa<-wascores(PCoA$points,sp)
text(spWa[,1:2],labels=colnames(sp),cex=1.25,col="red")
In the case of Euclidean association measures, PCoA will behave in a Euclidean manner. For instance, computing a Euclidean distance among sites and
running a PCoA will yield the same results as running a PCA on a covariance
matrix and scaling 1 on the same data. But if the association coefficient used
is non-metric, semi-metric or has other problems of non-Euclideanarity, then
PCoA will react by producing several negative eigenvalues in addition to the
positive ones (an a null one in-between). The negative eigenvalues can be seen
as the representation of the non-Euclidean part of the structure of the association matrix and it is, of course, not representable on real ordination axes. In
most cases this does not affect the representation of the samples on the first few
principal axes, but in several applications this can lead to problems. There are
technical solutions to this problem (e.g. Lingoes and Caillez corrections), but
they are not always recommendable, and go beyond the scope of this lecture.
60
0.2
0.1
sp3
3
0.0
sp2
sp1
1
4
0.1
PCoA Axis 2 26.48% (0.107)
5
2
0.2
0.1
0.0
0.1
0.2
0.3
PCoA Axis 1 53.15% (0.224)

Figure 4.10: First and second axis of a PCoA as in 4.9 where species have been
added a posteriori. The ordination axes of a PCoA can be interpreted like those
of a CA using a Scaling 1.
4.5
Non-metric multidimensional scaling (NMDS or MDS)
If the users priority is not to preserve the exact distances among samples, but
rather to represent as well as possible the ordering relationships among samples
in a small and specified number of axes, then NMDS may be the solution. Like
PCoA, NMDS is not limited to Euclidean distance matrices. It can produce
ordinations of samples from any distance matrix. The method can also proceed
with missing distance estimates, as long as there are enough measures left to
position an object with respect to a few others.
NMDS is not an eigenvalue technique, and it does not maximise the variability associated with individual axes of the ordination. As a result, plots may
arbitrarily be rotated, centred, or inverted. The procedure goes as follows (very
schematically; for details see Borcard et al. 2011, section 5.6):
Step 1 Specify the number m of axes (dimensions) desired.
Step 2 Construct an initial configuration of the objects in the m dimensions,
to be used as a starting point of an iterative adjustment process. This is a
tricky step, since the end-result may depend on the starting configuration.
61
Step 3 An iterative procedure seeks to position the objects in the desired number of dimensions in such a way as to minimize a stress function (scaled
from 0 to 1), which measures how far the reduced-space configuration is
from being monotonic to the original distances in the association matrix.
Step 4 The adjustment goes on until the stress value can no more be diminished,
or it attains a predefined value (tolerated lack-of-fit).
Step 5 Most NMDS programs rotate the final solution using PCA for easier
interpretation.
For a given and small number of axes (e.g. 2 or 3), NMDS often achieves
a less deformed representation of the relationships among objects than a PCoA
can show on the same number of axes. But NMDS remains a computer-intensive
solution, exposed to the risk of suboptimal solutions in the iterative process
(because the objective function to minimize has reached a local minimum).
62
5
5.1
Canonical (constrained) ordinations

Redundancy analysis (RDA) and canonical correspondence
analysis (CCA)
The simple (unconstrained) ordination methods presented previously are meant

to represent the variation of a data matrix in a reduced number of dimensions.
Interpretation of the structures is done a posteriori, hence the expression indirect gradient analysis used for this approach. For instance, one can interpret
the CA ordination axes (one at a time), by regressing the object scores on one
or several environmental variables. The ordination procedure itself has not been
influenced by these external variables, which become involved only after the computation. One lets the data matrix express itself without constraint. This is an
exploratory, descriptive approach.
Constrained ordination, on the contrary, explicitly puts into relationship two
matrices: one matrix of response variables (e.g. a community matrix) and one
matrix of explanatory variables. Both are implied at the stage of the ordination.
This approach is called direct gradient analysis, and integrates the techniques
of ordination and multiple regression (Table 5.1).
Table 5.1: Relationship between ordination and regression
Data to explain
1 variable
1 variable
Many variables
Many variables
Explanatory variables
1 variable
Many variables
No variable
Many variables
Analysis
Simple regression
Multiple regression
Simple ordination
Canonical ordination
There are two commonly used canonical ordinations, redundancy analysis

(RDA) and canonical correspondence analysis (CCA). In RDA and CCA, the
ordination process is directly influenced by a set of explanatory variables: the
ordination seeks the axes that are best explained by a linear combination
of explanatory variables. In other words, these methods seek the combinations of explanatory variables that best explain the variation of the response
matrix. It is therefore a constrained ordination process. The difference with
an unconstrained ordination is important: the matrix of explanatory variables
conditions the weight (eigenvalues), the orthogonality and the direction of the
ordination axes. Here one can say that the axes explain (in the statistical sense)
the variation of the response matrix.
A constrained ordination produces as many canonical axes as there are
explanatory variables, but each of these axes is a linear combination (a multiple regression model) of textbfall explanatory variables. Examination of the
canonical coefficients (i.e., the regression coefficients of the models) of the explanatory variables on each axis allows to know which variable(s) is or are most
important to explain the first, second, third,... axis.
The variation of the response matrix that cannot be defined by the explana63
tory variables is expressed on a series of unconstrained axes following the canonical ones.
Due to the fact that in many cases the explanatory variables are not dimensionally homogeneous, usually canonical ordinations are carried out with standardized explanatory variables. In RDA, this does not affect the choice
between running the analysis on a covariance or a correlation matrix, however,
since this choice relates to the response (y) variables.
Depending on the algorithm used, the search for the optimal linear combinations of explanatory variables, that represent the orthogonal canonical axes,
is done sequentially (axis by axis, using an iterative algorithm) or in one step
(direct algorithm). Figure 5.1, which is Figure 11.2 of Legendre and Legendre
(2012, p. 631), summarises the steps of a redundancy analysis (RDA) using the
direct algorithm:
Step 1 Regress each dependent variable separately on the explanatory variables
and compute the fitted and residual values of the regressions.
Step 2 Run a PCA of the matrix of fitted values of these regressions.
Step 3 Use the matrix of canonical eigenvectors to compute two sorts of ordinations:
(a) An ordination in the space of the response variables (species space);
the ordination axes are not orthogonal in this ordination;
(b) An ordination in the space of the explanatory variables; this yields
the fitted site scores; the canonical axes obtained here are orthogonal
to one another;
Step 4 Use the matrix of residuals from the multiple regressions to compute an
unconstrained ordination (PCA in the case of an RDA).
Redundancy analysis (RDA) is the canonical version of principal component
analysis (PCA). Canonical correspondence analysis (CCA) is the canonical version of correspondence analysis (CA).
Due to various technical constraints, the maximum numbers of canonical and
non-canonical axes differ (Table 5.2).
Graphically, the results of RDA and CCA are presented in the form of biplots or triplots, i.e. scattergrams showing the samples, response variables
(usually species) and explanatory variables on the same diagram. In canonical
ordinations, explanatory variables can be qualitative (the multiclass ones are
coded as a series of binary variables) or quantitative. A qualitative explanatory
variable is represented on the bi- or triplot as the centroid of the sites that have
the description 1 for that variable, and the quantitative ones are represented as
vectors. The analytical choices are the same as for PCA and CA with respect to
the analysis on a covariance or correlation matrix (RDA) and the scaling types
(RDA and CCA). Table 5.3 presents how an RDA triplot should be interpreted.
64
Table 5.2: Maximum number of non-zero eigenvalues and corresponding eigenvectors that may be obtained from canonical analysis of a matrix of response
variables Y(n p) and a matrix of explanatory variables X(n m) using redundancy analysis (RDA) or canonical correspondence analysis (CCA). This is
Table 11.1 from Legendre & Legendre (2012).
Canonical eigenvalues
RDA
CCA
and eigenvectors
min(p, m, n 1)
min(p 1, m, n 1)
Non-canonical eigenvalues
and eigenvectors
min(p, n 1)
min(p 1, n 1)
Table 5.3: Properties of the distance biplot (scaling 1) and the correlation biplot
(scaling 2) in RDA)
Distance among samples,

among centroids and between centroids and samples
Projection of samples on
variables
Angles among response
variables
Angles among explanatory variables
Angles among response
and explanatory variables
Projection of centroid of
a qualitative explanatory
variable on response variables
Distance triplot
(scaling 1)
Approximate the Euclidean distance
Correlation triplot
(scaling 2)
Meaningless
Projecting a sample at right angle on a variable (response or explanatory) approximates

its position on the variable
Meaningless
Reflect their correlation
Meaningless
Projecting the centroid of a qualitative explanatory variable at right angle on a response
variable approximates its relationship with the
response variable
65
Samples
(centred)
Ordination
in the space
of Y
(centred)
F=YU
Regress each y on X
Samples
Explanatory
variables
Species
Fitted values
from the
multiple
regressions
PCA
Matrix of
canonical
eigenvectors
Ordination
in the space
of X
Z=YU
Y = X(Xt X) XtY
-1
} }
Matrix of
Residual values
residual
from the
eigenvectors
multiple
PCA
regressions
Yres = Y Y
Ures
Ordination
in the space
of the
residuals
Yres Ures
Figure 5.1: The steps to perform a redundancy analysis (RDA) using a direct
algorithm. This is a modification of Figure 11.2 of Legendre & Legendre (2012).
In CCA, on can use the same types of scalings as in CA. Samples and response
variables are plotted as points on the triplot. For the response variables (species)
and samples, the interpretation is the same as in CA. The interpretation of the
explanatory variables should be made as followin in CCA:
>
>
>
>
### CCA using varespec (community matrix) and varechem (explanatory variables)
CCA<-cca(varespec,varechemSc)
par(mar=c(4,5,0.5,0.5))
ordiplot(CCA,scaling=1,type="t")
Scaling 1 (focus on samples)

(1) The position of object on a quantitative explanatory variable can be
obtained by projecting the objects at right angle on the variable.
(2) An object found near the point representing the centroid of a qualitative explanatory variable is more likely to possess the state 1 for
that variable.
66
0.5
6
N 18 Cla.ran
Cla.arb13
Baresoil
0.0
RDA Axis 2 20.26%
75
16 14
20
Dic.fus
Cal.vul Mo
Ste.sp
Vac.uli
Cet.niv
Dip.mon
Pti.cil
23 Emp.nig
Cla.gra
Bar.lyc
Cla.fim
Cla.def
Cla.coc
Ich.eri
Led.pal
Cla.ama
Cet.eri
Pol.pil
Cla.bot
Cla.cri
Bet.pub
Cla.cer
22
Cla.sp
Dic.pol
Pel.aph
Poh.nut
Cla.chl
Cla.cor
Cet.isl
Pol.com
Cla.phy
P
in.syl 11
15
Nep.arc
Des.fle
Vac.vit
Pol.jun
Cla.unc
Hyl.spl
Vac.myr
Dic.sp
21
2524
0.5
Mn 27
Ple.sch
Humdepth
28
K
MgZn
Ca
P
0.5
19
S
0.0
Fe
4
Al
pH
3
2
12
Cla.ste
10
9
0.5
RDA Axis 1 32.52%

Figure 5.2: RDA example using the varespec and varechem data. The community matrix was pretransformed using chord transrormation
Scaling 2 (focus on response variables)
(1) The optimum of a species along a quantitative environmental variable can be obtained by projecting the species at right angle on the
variable.
(2) A species found near the centroid of a qualitative environmental variable is likely to be found frequently (or in larger abundances) in the
sites possessing the state 1 for that variable.
Scaling 3 (focus on samples and response variables)
Combines the properties of scaling 1 and scaling 2.
5.2
5.2.1
Partial canonical ordination

Variation partitioning
In the same way as one can do partial regression, it is possible to run partial
canonical ordinations. It is thus possible to run, for instance, a CCA of a species
67
Cla.phy
Hyl.spl
Cla.chl
910
Ca
S
Mg
Cla.sp
Zn
Cet.isl
K
28
2
Pol.com
Vac.myrPle.sch
Pin.syl12
Des.fle
Humdepth
Poh.nut
Mn
Dic.sp
Pol.jun
Nep.arc
27 Pel.aph 19
24 21
Vac.vit
3 pH
Dic.pol
Emp.nig
25 Cla.unc
Cla.cor
Cla.cri 11
Cla.def
Cet.eri
Led.pal
15 23
Cla.cer Al
20 Cla.fim
Fe
Pti.cil
22 Cla.graCla.coc
4
Mo
Cla.bot
14
Bet.pub
Bar.lyc 16
18
Baresoil
6
Dic.fus
13
Pol.pil
CCA2
Cla.ste
7
Cla.ran
5
Cla.arb
Cet.niv
Dip.mon
Cal.vul
Cla.ama
Vac.uli
Ste.sp
Ich.eri
CCA1
Figure 5.3: CCA example using the varespec and varechem data.
data matrix (Y matrix), explained by a matrix of climatic variables (X), controlling for the edaphic variables (W). Such an analysis would allow the user to
assess how much species variation can be uniquely attributed to climate when
the effect of the soil factors have been removed. This possibility has led Borcard
et al. (1992) to devise a procedure called variation partitioning in a context of
spatial analysis. One explanatory matrix X contains the environmental variables, and the other (W) contains the x-y geographical coordinates of the sites,
augmented to its third-order polynomial function:
b0 + b1 x + b2 y + b3 x2 + b4 xy + b5 y 2 + b6 x3 + b7 x2 y + b1 x + b8 xy 2 + b9 y 3 (5.1)
The procedure aims at partitioning the variation of a Y matrix of species
data into the following fractions (Figure 5.4):
[a] Variation explained solely by matrix X
[b] Variation explained by matrix X and W
[c] Variation explained solely by matrix W
[d] Unexplained variation
68
If run with RDA, the partitioning is done under a linear model, the total SS
of the Y matrix is partitioned, and it corresponds strictly to what is obtained
by multiple regression if the Y matrix contains only one response variable. If
run under CCA, the partitioning is done on the total inertia of the Y matrix.
More recently, Borcard & Legendre (2002), Borcard et al. (2004), Dray et al.
(2006) and Blanchet et al. (2008b) have proposed to replace the spatial polynom
by a much more powerful representation of space defined using various types of
spatial eigenfunctions. See Chapter 7 of Borcard et al. (2011) for more details.
>
>
par(mar=c(0.5,0.5,0.5,0.5))
showvarparts(2,cex=3,lwd=3)
[a]
[b]
[c]
Residuals = [d]
Figure 5.4: The fractions of variation obtained by partitioning a response data
set Y (large rectangle) with two explanatory data matrices X (Fractions [a]+[b])
and W (Fractions [b] + [c]).
Fractions [a]+[b], [b]+[c], [a] alone and [c] alone can be obtained by canonical
or partial canonical analyses. Fraction [b] does not correspond to a fitted fraction
of variation an can only be obtained by subtraction of some of the fractions
obtained by ordinations.
The procedure must be run as follows if one is interested in the R2 values of
the four fractions:
Step 1 Perform an RDA (or CCA) of Y explained by X. This yields fraction
[a] + [b].
Step 2 Perform an RDA (or CCA) of Y explained by W. This yields fraction
[b] + [c].
Step 3 Perform an RDA (or CCA) of Y explained by X and W together. This
yields fraction [a] + [b] + [c].
69
The R2 values obtained above are unadjusted, i.e. they do not take into
account the numbers of explanatory variables used in matrices X and W. In
canonical ordination as in regression analysis, R2 always increases when an explanatory variable xi is added to the model, regardless of the real meaning of this
variable. In the case of regression, to obtain a better estimate of the population
coefficient of determination (2 ), Zar (1999), p. 423, among others, propose to
use an adjusted coefficient of multiple determination:
2
Radj
=1

n1
1 R2
nm1
(5.2)
As Peres-Neto et al. (2006) have shown using extensive simulations, this formula can be applied to the fractions obtained above in the case of RDA
(but not CCA), yielding adjusted fractions: ([a] + [b])adj , ([b] + [c])adj and
([a] + [b] + [c])adj . These adjusted fractions can then be used to obtain the individual adjusted fractions:
Step 4 Fraction [a]adj is obtained by substracting ([b] + [c])adj from ([a] + [b] + [c])adj .
Step 5 Fraction [b]adj is obtained by substracting [a]adj from ([a] + [b])adj .
Step 6 Fraction [c]adj is obtained by substracting ([a] + [b])adj from ([a] + [b] + [c])adj .
Step 7 Fraction [d]adj is obtained by substracting ([a] + [b] + [c])adj from 1 (i.e.
the total variance of Y).
Alternately, if one is interested in the fitted site scores for fractions [a]adj
and [c]adj , the partitioning can be run using partial canonical ordinations. Note,
however, that it is not possible to obtain the R2 adj on this basis:
Step 1 Perform an RDA (or CCA) of Y explained by X. This yields fraction
[a] + [b].
Step 2 Perform a partial RDA (or CCA) of Y explained by X, controlling for
W. This yields fraction [a].
Step 3 Perform a partial RDA (or CCA) of Y explained by Wm controlling
for X. This yields fraction [c].
Step 4 Fraction [b] is obtained by substracting [a] from [a] + [b].
Step 5 Fraction [d] is obtained by substracting [a] + [b] + [c] from 1 (RDA) or
the total inertia of Y (CCA).
It must be emphasised here that fraction [b] has nothing to do with
the interaction of a ANOVA! In ANOVA, an interaction measures the
effect that an explanatory variable (a factor) has on the influence
of the other explanatory variable(s) on the dependent variable. An
interaction can have a non-zero value when the two explanatory variables are
70
orthogonal, which is the situation where fraction [b] is equal to zero. Fraction
[b] arises because there is some correlation between matrices X and W. Note
that in some cases fraction [b] can even take negative values. This happens, for
instance, if matrices X and W have strong opposite effects on matrix Y while
being positively correlated to one another.
This variation partitioning procedure can be extended to more than two
explanatory matrices, and can be applied outside the spatial context.
5.2.2
Forward selection of explanatory variables
There are situations where one wants to reduce the number of explanatory
variables in a regression or canonical ordination model. An approach commonly
used to perform this procedure is forward selection. This is how it works:
Step 1 Compute the independent contribution of all the m explanatory variables to the explanation of the variation of the response data table. This
is done by running m separate canonical analyses.
Step 2 Test the significance of the contribution of the best variable (i.e. the
2 .
one with the highest R2 ) and calculate an Radj
2
2
calculated on
does not exceed an Radj
Step 3 If it is significant and the Radj
the full model (i.e. a model constructed with all the explanatory variables
of interest), include it into the model as a first explanatory variable.
Step 4 Compute (one at a time) the partial contributions (conditional effects)

of the m???1 remaining explanatory variables, controlling for the effect of
the one already in the model.
Step 5 Test the significance of the best partial contribution among the m???1
2 for this new model.
variables and calculate an Radj
2
2
Step 6 If it is significant and the Radj
does not exceed an Radj
calculated on
the full model, include it into the model as a first explanatory variable.
Step 7 Compute (one at a time) the partial contributions (conditional effects)

of the m???2 remaining explanatory variables, controlling for the effect of
the two already in the model.
Step 8 The procedure goes on until no more significant partial contribution
2
2
is found or until Radj
of the reduced model becomes larger than an Radj
calculated on the full model.
Traditionally, forward selection only included one model selection criterion,
the level of significance of a model. However, this criterion was known to be
overy liberal. After a forward selection carried out using only the significance of
a model a selection criterion, a model would generally include too many explanatory variables in the model, which sometimes yielded models deemed wrongfully
signicant. To solve this problem, Blanchet et al. (2008a) proposed to include
2 as another stopping in the forward selection procedure.
the Radj
71
Remarks
(a) The tests are run by random permutations.
(b) Like all variable selection procedures (forward, backward or stepwise), this
one does not guarantee that the best model is found. From the second step
on, the inclusion of variables is conditioned by the nature of the variables
that are already in the model.
(c) As in all regression models, the presence of strongly intercorrelated explanatory variables renders the regression/canonical coefficients unstable.
Forward selection does not necessarily eliminate this problem since even
strongly correlated variables may be admitted into a model.
(d) Forward selection can help when several candidate explanatory variables
are strongly correlated, but the choice has no a priori ecological validity.
In this case it is often advisable to eliminate one of the intercorrelated
variables on ecological basis rather than on statistical basis.
(e) The classic forward selection is a rather conservative procedure when compared to backward elimination (see below): it tends to admit a smaller set
of explanatory variables. In absolute terms, however, it is relatively liberal. However, the forward selection proposed by Blanchet et al. (2008a)
is much more efficient at not making bad variable selection.
(f ) If one wants to select an even larger subset of variables, another choice is
backwards elimination, where one starts with all the variables included,
and remove one by one the variables whose partial contributions are not
significant. The partial contributions must also be recomputed at each
step. Backward elimination is not offered in packfor, however, it is offered
in vegan through the ordistep function.
(g) In cases where several correlated explanatory variables are present, without
clear a priori reasons to eliminate one or the other, one can examine the
variance inflation factors (VIF).
(h) The variance inflation factors (VIF) measure how much the variance of the
canonical coefficients is inflated by the presence of correlations among explanatory variables. This measures in fact the instability of the regression
model. As a rule of thumb, ter Braak recommends that variables that
have a VIF larger than 20 be removed from the analysis. Beware: always
remove the variables one at a time and recompute the analysis, since the
VIF of every variable depends on all the others!
5.3
distance-based redundancy analysis (db-RDA)
For cases where the user does not want to base the comparisons among objects on the distances that are preserved in CCA or RDA (including the species
72
pre-transformations), another approach is possible for canonical ordination: dbRDA (Legendre & Anderson, 1999). Described in the framework of multivariate
ANOVA testing, the steps of a db-RDA are as follows (Figure 5.5):
Step 1 Compute a distance matrix from the raw data using the most appropriate association coefficient.
Step 2 Compute a PCoA of the matrix obtained in Step 1. If necessary, correct
for negative eigenvalues (Lingoes or Caillez correction), because the aim
here is to conserve all the data variation.
Step 3 Compute an RDA, using the objects principal coordinates as response
(Y) matrix and the matrix of explanatory variables as X matrix.
Samples
Species
Distance matrix
Correct for
negative eigenvalues
PCoA
Explanatory
variables
Samples
Samples
Principal coordinates
X
(centred)
}
RDA
Figure 5.5: The steps to perform a db-RDA. Modified from Legendre and Anderson (1999).
5.4
Consensus RDA
The development of db-RDA and the pretransformations proposed by Legendre

& Gallagher (2001) has generated, in the mind of many ecologists, the question:
Which distance should be used to analyse community data?. This question
has been recently approached in two studies. (1) Legendre & De Caceres (2013)
studied many association coefficients and grouped them based on 14 properties
they used to compare these coefficients. Their conclusions are summarized in
73
Figure 5.6. Although Figure 5.6 helps choosing an association coefficient to

approach a particular ecological question, which one to choose within one group
(type) of coefficient can be problematic because the result may differ slightly
when comparing final canonical ordination triplots. Blanchet et al. (in press)
approched this problem and found that the best way to deal with this problem
is to perform a consensus of ordination triplots; a consensus of RDAs where
the only aspect that differs among RDA are the association coefficent between
the different canonical ordinations. The steps to perform consensus RDA are
described here:
Step 1 Compute an RDA using the same response and explanatory variables
for each association coefficient considered interesting for the analysis of the
community data.
Step 2 Calculate a Z matrix (site scores in the space of the explanatory variables) from each of the RDA computed above.
Step 3 Correlate each pair of Z matrices using the RV coefficient and store the
result in a resemblance matrix.
tr(Zj Zti Zi Ztj )
RV (Zi , Zj ) = q
tr(Zj Ztj Zj Ztj )tr(Zi Zti Zi Zti )
(5.3)
Step 4 Construct a minimum spanning tree from the ressemblance matrix of RV

coefficient. This minimum spanning tree represents how similar association
coefficients are among each others.
Step 5 If necessary, select the association coefficient on which to perform the
consensus RDA.
Step 6 Perform the consensus RDA (Figure 5.7).
74
Figure 5.6: Comparison of distance coefficients. Reproduction of Figure 2 of

from Legendre & De C
aceres (2013)
75
Explanatory
variables
with K different meaningful

association coefficients
RDA, db-RDA, tb-RDA
Axes
K
1
Samples
Species
Samples
Samples
sites
scores
Explanatory
variables
Significant axes
Z1 Z2 ... ZK
RDA
Consensus axes
Consensus
species
scores
Z*
Consensus axes
Consensus
sites
scores
Consensus axes
m descriptors
Consensus axes
p species
Consensus axes
n sites
C*
Consensus
canonical
coefficients
eigenvalues
}
Consensus RDA Axis 2
2
2
Consensus RDA Axis 1
Figure 5.7: Schematic representation of consensus RDA. (a) The first step of
the procedure is to perform a series of RDAs (tb-RDA or db-RDA) to model
the community data Y using explanatory variables X. Each RDA is computed
with a different dissimilarity coefficient using scaling type 1 (distance triplot,
Z matrices). In the figure, K different dissimilarity coefficients are used. (b)
For each of the K dissimilarity coefficients, the significant axes within each Z
matrix are grouped in a large matrix. (c) An RDA is then performed on this large
matrix using X as the explanatory variables. (d) This RDA yields the site scores
consensus matrix Z , a diagonal matrix of eigenvalues , and the consensus
canonical coefficients C . (e) Equation 5 is then used to obtain the consensus
species scores U . (f) Z , U , and C can be used to draw a consensus RDA
triplot; the eigenvalues in show the importance of each axis in the consensus
triplot. This figure was modified from Blanchet et al. (in press)
76
References
Blanchet, F.G., Legendre, P., Bergeron, J.A.C. & He, F. (in press). Consensus rda across dissimilarity coefficients for canonical ordination of community
composition data. Ecological Monographs.
Blanchet, F.G., Legendre, P. & Borcard, D. (2008a). Forward selection of explanatory spatial variables. Ecology, 89, 26232632.
Blanchet, F.G., Legendre, P. & Borcard, D. (2008b). Modelling directional spatial processes in ecological data. Ecological Modelling, 215, 325336.
Borcard, D., Gillet, F. & Legendre, P. (2011). Numerical Ecology with R. Use
R! Springer, New York.
Borcard, D. & Legendre, P. (2002). All-scale spatial analysis of ecological data
by means of principal coordinates of neighbour matrices. Ecological Modelling,
153, 5168.
Borcard, D., Legendre, P., Avois-Jacquet, C. & Tuosimoto, H. (2004). Dissecting
the spatial structure of ecological data at multiple scales. Ecology, 85, 1826
1832.
Borcard, D., Legendre, P. & Drapeau, P. (1992). Partialling out the spatial
component of ecological variation. Ecology, 73, 10451055.
Dray, S., Legendre, P. & Peres-Neto, P.R. (2006). Spatial modelling: a comprehensive framework for principal coordinate analysis of neighbour matrices
(PCNM). Ecological Modelling, 196, 483493.
Goodall, D.W. (1954). Objective methods for the classification of vegetation.
III. an essay in the use of factor analysis. Australian Journal of Botany, 2,
304324.
Greenacre, M. & Primeicerio, R. (2013). Multivariate analysis of Ecological Data.
Fundaci
on BBVA.
Legendre, P. & Anderson, M.J. (1999). Distance-based redundancy analysis:
testing multispecies responses in multifactorial ecological experiments. Ecological Monographs, 69, 124.
Legendre, P. & De C
aceres, M. (2013). Beta diversity as the variance of community data: dissimilarity coefficients and partitioning. Ecology Letters, 16,
951963.
Legendre, P. & Gallagher, E. (2001). Ecologically meaningful transformations
for ordination of species data. Oecologia, 129, 271280.
Legendre, P. & Legendre, L. (2012). Numerical Ecology. 3rd edn. vol. 24 of
Developments in Environmental Modelling. Elsevier.
77
Oksanen, J. (2013). Multivariate analysis of ecological communities in R.

Pearson, K. (1901). On lines and planes of closest fit to systems of points in
space. Philosophical Magazine, 2, 559572.
Peres-Neto, P.R., Legendre, P., Dray, S. & Borcard, D. (2006). Variation partitioning of species data matrices: Estimation and comparison of fractions.
Ecology, 87, 26142625.
Romesburg, C. (2004). Cluster analysis for researchers. Lulu Press.
Zar, J.H. (1999). Biostatistical Analysis. 4th edn. Prentice Hall.
Zuur, A., Ieno, E.N., Walker, N., Saveliev, A.A. & Smith, G.M. (2009). Mixed
effects models and extensions in ecology with R. Springer.
78

Introduction To Ecological Multivariate Analysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Ecological Multivariate Analysis

Uploaded by

Copyright:

Available Formats

Introduction to Ecological

Lasse Ruokolainen and Guillaume Blanchet

4 Simple (or unconstrained) Ordinations

5 Canonical (constrained) ordinations

In mathematics a matrix is basically an n-by-m table of information. Usually

In order to better understand the methodology related to ecological matrices,

a11 + b11 a12 + b21

a11 a12 a13

Figure 1.1: Graphical interpretation of eigenvectors

Eigenvalues can then be considered as multipliers that stretch or shrink the

As shown above, indexing can be used to access specific parts of a data

Species prevalence (%)

Figure 1.2: The distribution of species richness and percentage frequency of

Species data manipulation

Figure 1.3: The distribution of Aluminium concentration in the soil of lichen

or by using the decostand function in vegan, with pa as the method:

Note that it is the ifelse function that is applied internally by decostand to

The above transformations can only be applied to single variables. However, in

where yi is the abundance of a species in site i,

(2) by standardizing with site totals (producing relative abundance):

where yi is the abundance of a species i (S in total) in site j, or

Consider examples of data standardization:

# Data standardization methods using decostand:

# Standardization of both species and sites:

As with the chord transformation this Chi-square transformation leads to 2

w = 'Emp.nig' # select Empetrum nigrum as an example

Figure 1.4: The effect of various transformations on the distribution of Empetrum

What is an association coefficient?

Usually, considering a table of variables, the association between these variables

Similarly, the (sample) covariance between variables x and y is calculad as:

Another rank-order correlation, Kendalls , is based on ranking pairs of

Figure 2.1: Three different relationships between variables X and Y

x,y x,z y,z

Partial correlations can be applied with any correlation coefficient. There

Patterns in between-variable association can be easily examined using the plot.data.frame

require(vegan) # load vegan

A much neater way of using plot.data.frame to illustrate correlations is given

Figure 2.2: A scatter plot matrix with smoothing within panels

The similarity in species composition between sites can be measured as the

Measuring ecological distance

As we learned above, Euclidean distance (giving the same interpretation to

a segment of a sphere or hypersphere of radius 1. If only two variables are

This distance is related to the 2 statistic used to study contingency tables or to

suffer from the double-zero problem.

A distance measure related to both Dchord and D2 is the Hellinger distance,

Bray-Curtis distance, aka Odums index, aka Renkonen index,

Metric and semimetric distances

A distance is said to be metric when it fulfills these four axioms:

Since 1 r ranges between 0 and 2, it is convenient to do the conversion to

data(BCI) # load data of tree counts

> D.Hel = dist(decostand(BCI,'hellinger'))

0.5 0.7 0.9

0.5 0.7 0.9

0.5 0.7 0.9

Figure 2.6: Comparison of ecological distances and dissimilarities calculated for

Consider a distance, similarity, or dissimilarity matrix. In single-linkage (or

Figure 3.1: Cluster dendrogram for the single-linkage method

Contrary to single-linkage clustering, complete-linkage clustering (a.k.a. furthest

Figure 3.2: Cluster dendrogram for the complete-linkage method

The conceptual difference between single- and complete-linkage clustering is

Average-linkage clustering is often considered to be a compromise between the

The most commonly applied method, UPGMA (Unweighted Pair-Group Method

Figure 3.5: Cluster dendrogram for the average-linkage method UPGMA