You are on page 1of 28

# biostatistics - multivariate data analysis 2010-11 19

Ordination – introduction

Ordination (from the Latin „ordinatio‟ and German „ordnung‟) is the arrangements of units in some order.
It consists of plotting objects along one axis representing an ordered relationship, or forming a scatter
diagram with two or more axes.
In ecology, several descriptors are usually observed
for each object under study. In a simple example (cf. also
above), samples could then be ordered along two axes
representing the two species present in the data set (in a
bivariate graph). In most instances however, ecologists are
interested in characterizing the main trends in variation of
the objects with respect to all descriptors, not only a few of
them. Looking at scatter plots of the objects with respect to
bivariate scatter plot of 4 samples
all possible pairs of descriptors is a tedious approach, which
in a 2D space defined by the
generally does not shed much light on the problem at hand. abundances of species 1 and 2
In contrast, the multivariate approach consists of (from Legendre & Legendre 1998)
representing the scatter of objects in a multidimensional
diagram, with as many axes as there are descriptors in the study. It is not possible, however, to draw
such a diagram on paper with more than two or three dimensions, even though it is a perfectly valid
mathematical construct. For the purpose of analysis, ecologists therefore project the multidimensional
scatter diagram onto bivariate graphs whose axes are known to be of particular interest (= ordination
axes). These axes are chosen to represent a large fraction of the variability of the multidimensional data
matrix, in a space with reduced dimensionality relative to the original data set. Methods for ordination in
reduced space also allow one to derive quantitative information on the quality of the projections and
study the relationships among descriptors as well as objects.

Reduction of a p-dimensional space to a k-dimensional ordination (from McCune & Grace 2002)
biostatistics - multivariate data analysis 2010-11 20

## A simple two-species example

species B
20 20
ordination axis 1
15 15

10 10

5 5

0 0
-20 -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20

-5 -5

ordination axis 2
-10
species A -10

-15 -15

-20 -20

Samples (blue dots) can be ordered along axes representing variation in the abundance of species A
and species B (species axes). A similar approach can be used for three species-axes, and for n species
n axes exist. But beyond three species, it becomes impossible to plot. We therefore seek the directions
of maximum variation in the species data: these (ordination) axes will represent the combined variation
in the species data. In the example above, ordination axis 1 captures most of the combined variation in
species A and B. Axis 2 does the same, but has to be independent (orthogonal) to axis 1 (otherwise a
part of the same variation would be shown twice). For two species two ordination axes can be drawn, for
n species n axes can be drawn. The aim however is that only a few of these will manage to capture a
large part of the variation in the species data, so only a few axes need to be shown: the first axes
always represent the main variation in species composition. How these axes are established will be
treated in detail below.
The main underlying assumption in ordination analysis, which is at the same time the main
reason why we do ordination analyses, is that these gradients in species composition have an
underlying, ecological cause (i.e. there is an ecological reason why these two species covary): variation
in species composition represents variation in their environment (be it abiotic and/or biotic). Ordination
techniques allow to reconstruct these hypothetical environmental gradients by extracting the main
gradients in species composition from complex data sets. They can then be used to study relationships
between and amongst objects and predictors.

## Indirect vs. direct ordination

The way in which abiotic and biotic environmental variables influence community structure is often
explored in the following way. First, one samples a set of sites and records which species occur where
(and/or when), and in what quantity. Since the number of species is usually large, one uses ordination to
summarize and arrange the data in an ordination diagram, which is then interpreted in the light of
whatever is known about the environment or the biology and ecology of the species. If only species data
are available, this type of analysis is called indirect gradient analysis: the most important gradients in
species composition (ordination axes) are reconstructed, and these can afterwards be related to what is
known about the environment. We thus look indirectly at the environmental gradients present in the data
sets, via the species gradients. In direct gradient analysis, environmental variables are explicitly
incorporated in the analysis. Direct gradient analysis can be univariate (e.g. when the abundance of a
biostatistics - multivariate data analysis 2010-11 21

species is plotted or formally correlated or regressed against environmental variables, cf. part Prof.
Vanreusel) or multivariate. In the latter case, environmental variables are explicitly incorporated in the
ordination analysis [usually involving a (multiple) regression step].
Indirect gradient analysis has some important advantages over direct gradient analysis. It is
often very hard or expensive to measure environmental variables, and very often the ones that most
influence species composition have not been measured or cannot be measured (e.g. biotic variables
such as competition, parasitism, disease, etc.).

In the next chapters, we will discuss a number of indirect ordination techniques. Direct ordination
techniques, also called constrained or canonical ordination techniques, will not be discussed in this
course.

Response models

Before a data set is analyzed, it is crucial that a technique which is appropriate for the data at hand is
selected. The choice of the most appropriate method depends on the nature of the ecological data, and
more specifically, the underlying response model of the species data.

In ecology, it is assumed that species respond to the underlying environmental gradients. Different
response curves can be fitted to the species data. The choice of an appropriate response curve is
essential for choosing an appropriate regression curve (what shape should be fitted, how complicated
should the fit be?) or ordination technique. The figure below shows different types of response curves.
The expected response is plotted against the variable. Response curves can be constant (a), monotonic
increasing [sigmoid (b) or straight line (c)], monotonic decreasing [sigmoid (d)], unimodal [parabola (e),
Gaussian (f), skewed or block function (g), or bimodal (h)]. Note that the shape of the response model
can depend on the size of the sampling interval.

## Response models (from Jongman et al. 1987)

In ordination analysis, it is imperative that the technique chosen fits the underlying response model.
Some techniques assume a linear response model (like Principal Components Analysis - PCA), while
others assume a unimodal model (e.g. Correspondence Analysis - CA). A third group of techniques
does not assume a specific underlying model but starts from a (dis)similarity or distance matrix between
the sites (Multidimensional Scaling - MDS). Before an analysis is started, it is essential to check the
structure of the species data.
biostatistics - multivariate data analysis 2010-11 22

Eigenanalysis

As mentioned above, one of the main aims of ordination analysis is to generate new variables
(ordination axes) to replace the original descriptors, with each new variable explaining a large portion of
the (co)variation in the data set, and to make sure that these new variables explain different aspects of
the data set. There will be as many new variables as there are original descriptors/dimensions. This
may seem contradictory (because it does not seem to reduce the multidimensionality of the data set),
but the aim is that only a few of the new variables will capture as much of the variation as possible, so
only a few need to be shown. As the new variables need to explain different aspects of the data set,
they need to be independent from one another (i.e. they need to be orthogonal).

Mathematically, the problem may be solved as follows. Starting from an association matrix A (which
most often describes the association between the descriptors), we want to find an equivalent
association matrix. This equivalent matrix will (1) represent the same amount of information as the
original matrix A, but distributed in a different way over the variables (descriptors/species in A vs.
ordination axes in the new matrix). In addition, the new variables should be independent (so all
elements outside the diagonal in the new matrix should be zero). This new, equivalent matrix is called
the matrix of eigenvalues. The new variables (the ordination axes) are eigenvectors. Each eigenvector
is associated with an eigenvalue, which represents a portion of the original variation in the data set.
Eigenvectors and eigenvalues can be calculated using matrix algebra.

The figure below shows a symmetric association matrix A derived from an ecological data matrix E, with
the terms below (and above) the diagonal characterizing the degree of association between the
descriptors (or objects). We now want to find a new matrix Λ, which has to be equivalent to A. This
diagonal matrix Λ is called the matrix of eigenvalues, and it describes the association between the
eigenvectors. The new variables (eigenvectors) are linearly independent of one another (as all values
outside the diagonal are zero). Matrix A and Λ are equivalent as they represent exactly the same
amount of variation in the original species data set: in matrix A, the diagonal represents the variances of
all species (i.e. the total variation in data set E), matrix Λ gives the variation captured by all the
eigenvectors (or ordination axes) in its diagonal. The difference between both matrices therefore mainly
lies in how the variation is partitioned [respectively over descriptors/species (A) or over eigenvectors (Λ);
remember that an ordination „replaces‟ species axes with new variables/axes which capture the
covariation among species. See also the concrete example below].

(from Legendre & Legendre 1998 - note that in this example stands for the number of species, while in the text
below p stands for the number of species and n for the number of samples!)
biostatistics - multivariate data analysis 2010-11 23

Calculating the new variables and their variances (eigenvalues) is the basis of ordination analysis. The
eigenvalues and eigenvectors of a matrix A are found from equation6

## Au k  k uk with uk = eigenvector k and λk = eigenvalue of eigenvector k

The eigenvalues λk can be calculated by solving the following equation which states that the
determinant of the difference between matrices A and (λk I) must be equal to zero for each λk:

A  k I  0 with k = 1  p

with matrix A being the association matrix derived from the sample (n) x species (p) matrix E, and matrix
I being a unit matrix (matrix with all elements zero except for the diagonal where values are 1) with the
same dimensions as matrix A (note that λk I = Λ). This equation is also called the characteristic
equation. For a matrix of order p, the characteristic equation is a polynomial of degree p, whose p
solutions are the values λk. The eigenvalues are therefore the roots of a polynomial. There are p
different eigenvalues.

When these values are calculated, it is possible to calculate the corresponding eigenvectors u i, using
the following equation:

A  k I uk  0
There are as many eigenvectors as there are eigenvalues (p).

The matrix U of the eigenvectors is a transform matrix, allowing one to go from matrix A to matrix Λ:

AU=UΛ

Each eigenvector contains the coefficients of the linear equation for a given ordination axis; they
represent the contribution of a given variable (i.c. descriptor/species) to a particular axis. The matrix of
the scores of the objects (samples) on each axis can be found as follows:

X=EU

Finding the eigenvalues and eigenvectors involves complex matrix algebra. In practice however, the
coefficients for the objects and the descriptors and variation explained for all ordination axes can also be
deduced using an iterative regression and calibration algorithm (see below). A numerical example of a
simple eigenanalysis will be given when describing the first ordination technique, PCA.

## Ordination – indirect ordination techniques

In this section, we will discuss four different indirect ordination techniques, namely Principal
Components Analysis (PCA), Correspondence Analysis (CA) and a derived technique, Detrended
Correspondence Analysis (DCA), and Nonmetric MultiDimensional Scaling (NMDS). This list is not
exhaustive, other methods exist, such as Metric Multidimensional Scaling (or Principal Coordinates
Analysis) and Bray-Curtis (or Polar) Ordination. An overview of all methods is given in McCune & Grace
(2002).

6 The validity of this equation is demonstrated in Legendre & Legendre (1998, p. 81-82).
biostatistics - multivariate data analysis 2010-11 24

## Principal components analysis (PCA)

Principal components analysis is the first and most basic eigenvector method of ordination (Pearson
1911). The object of PCA is to represent a data set (matrix E) containing many variables by a smaller
number of composite variables [principal components (PC) or axes] with the most interesting and
strongest covariation emerging along the first axes. PCA is a linear method: it seeks the strongest linear
correlation structure among the variables.

Below, we will first explain how the ordination axes are constructed using eigenanalysis. Then we will
explain the same method using response models as a conceptual basis for ordination.

Eigenanalysis

 Step 1. Produce a pxp variance/covariance (or correlation) matrix (A) with variances along the
diagonal and covariances (or correlations) in the triangles.

 Step 2. Find the eigenvalues of this matrix. There are p values of λ. Each represents a portion of the
original variance, corresponding to that particular PC.
A  I  0

 Step 3. Find the eigenvectors. For every λ there is a vector u, such that

A  k I uk  0
These eigenvectors are the principal axes of dispersion matrix A. Each eigenvector contains the
coefficients (the loadings or scores) of the descriptors for a linear equation for a given PC, so each
gives the contribution of a given descriptor (species) to an axis. Collectively these vectors form a
pxp matrix (U). Each eigenvector has zero correlation with the others (orthogonality).

 Step 4. Find the scores for each object on an axis. Scores are the original data matrix postmultiplied
with the matrix of eigenvectors, X = E U. The score of object xi on axis 1 is
p

x i1
y u
k 1 ik 1k

with u1k the loading (species score) of species k (< eigenvector 1), and yik the original abundance of
species k in sample i. Usually (see below) yik is replaced by the centered abundances of the
species (see below): yk - y k .
biostatistics - multivariate data analysis 2010-11 25

Numerical example

## Consider a data matrix E consisting of 5 samples and 2 species:

2 1   3.2  1.6 
3 4   2.2 1.4 
   
E  5 0 after centering on the column means, [y - y ]    0.2  2.6
   
7 6  1.8 3.4 
9 2  3.8  0.6

The association (i.c. dispersion – variances and covariances) matrix of the above descriptors (species)
is:

8.2 1.6
A 
1.6 5.8

## 8.2 1.6 k 0

A  k I    0
1.6 5.8  0 k 

Solving this equation7 yields two eigenvalues, λ1 = 9 and λ2 = 5. Note that the total variance stays the
same, but it is partitioned in a different way: the sum of the variances in the main diagonal of matrix A is
8.2 + 5.8 = 14, while the sum of the eigenvalues is 9 + 5 = 14. The first principal component (PC, =
ordination axis in PCA) thus accounts for 64.3 % of the variation in the species data, while the second
accounts for 35.7 %. The successive eigenvalues always account for smaller fractions of the variance in
the species data.

Introducing, in turn, the λk „s in [ A- λk I ] uk = 0 provides the eigenvectors u1 and u2 associated with the
eigenvalues8:

## A  k I uk  0 0.8944  0.4472

 U=  
0.4472 0.8944 

u1 u2

The eigenvectors have been normalized to unit length9. The elements of the eigenvectors are the
weights, or the loadings, of the original descriptors in the linear combination of descriptors from which
the PC‟s are computed. These loadings are the scores (the coordinates) of the descriptors (species) in
the ordination space defined by the two principal components (see figure below).

7 Solving the characteristic equation yields the characteristic polynomial λ²-14 λ+45=0 which has two eigenvalues 9 and 5
8
Solving this equation yields pairs of linear equations which are indeterminate [as for each eigenvector uk both elements u
are unknown]. This problem is solved by assigning an arbitrary value (e.g. 1) to the one of the elements in uk.
9
Normalization is achieved by dividing each element in the vector by the norm (or length) of the vector (i.e. the square root
of the sum of the squares of all elements in the vector)
biostatistics - multivariate data analysis 2010-11 26

The PC‟s then give the positions of the objects with respect to this new system of principal axes. The
position of object (sample) xi on the first principal axis is given by the following function (or linear
combination):

X  y  y U

## with [y - y ] the matrix of the centered observations.

For the numerical example, the principal components are computed as follows:

##   3.2  1.6    3.578 0 

  2.2 1.4    1.342 2.236 
  0.8944  0.4472  
X    0.2  2.6      1.342  2.236
  0.4472 0.8944   
 1.8 3.4   3.130 2.236 
 3.8  0.6  3.130  2.236

Since the two columns of the matrix of the object (sample) scores are the coordinates of the five objects
with respect to the principal axes, they can be used to plot the objects with respect to the principal axes
1 and 2. It is clear that in this simple example the objects are positioned by the PC in the same way as
in the original system of species axes; PCA has simply rotated the axes in such a way that the new axes
correspond to the main components of variation. PCA thus preserves the Euclidean distances among
the objects. When there are more than two species PCA still only performs a rotation of the system of
species axes, but now in multidimensional space. In that case, PC 1 and 2 define the plane allowing the
representation of the largest amount of variation in the species data. The objects are then projected
onto that plane in such a way as to preserve, as much as possible, the relative Euclidean distances they
have in the multidimensional space of the original descriptors.
biostatistics - multivariate data analysis 2010-11 27

Numerical example: (a) five objects plotted with respect to descriptors y1 and y2. (b) After centring of the data, the
objects are now plotted with respect to (y1 - y1 ) and (y2 - y 2 ). (c) The objects are plotted with respect to the
principal axes I and II. (d) The two axis systems can be superimposed after rotation
(from Legendre & Legendre 1998)

By analogy with the objects, the descriptors as well can be shown in the reduced ordination space,
which allows us to assess the relationships between these descriptors and between the descriptors and
the objects. The coordinates of the descriptors actually specify the position of an arrow in the reduced
space (see below: PCA biplots).

## Linear response model as a conceptual basis for Principal Components Analysis

In this procedure, PCA is regarded as an extension of fitting straight lines and planes by least squares
(LSQ) regression (Jongman et al. 1995) using an iterative two-way weighted summation algorithm.

Suppose we want to explain the abundance values of several species from the dune meadow data set
by a particular environmental variable, say moisture, and suppose we attempt to do so by fitting straight
lines to the data. We then, for each species, carry out LSQ regression of its abundance values on the
moisture values and obtain, among other things, the residual sum of squares (SSres). This SSres is
actually a measure of how badly moisture explains the data of that species. To measure how badly
moisture explains the data of all species together, we use the total of all the SSres of all species: the
total SSres. If this total SSres is small, then moisture explains the species data well. We can now try to
regress the species data against other environmental variables and try and find which one explains the
species data best (or which variable minimizes the total SSres). In ordination techniques, we will try and
construct a hypothetical or theoretical variable (the ordination axis) which explains the species data
even better. PCA is the technique that constructs the theoretical variable that minimizes the total SSres
biostatistics - multivariate data analysis 2010-11 28

after fitting straight lines through the species data. PCA does so by choosing the best site (or sample)
scores (hence defining the axis). This is shown in the figure below. The abundances of six species are
here plotted against a theoretical variable, namely ordination axis or principal component X. The
residuals are shown for Lolium perenne. Any other choice of the site scores (or axis) would have
resulted in a larger total SSres.

## (from Jongman et al. 1987)

The horizontal axis in this figure is actually PC 1. The scores of the samples along this axis are the
values (ticks) on the axis. The score of a species for this axis is the slope of the line fitted through the
data for that species. The larger the slope (or the score), the stronger the relationship between the
species and that axis will be. After a first axis is constructed, a second axis will be sought. This axis will
also try and minimize the total SSres, but will be constrained to be orthogonal (independent) from the
first axis, in order to avoid showing (part of) the same information (i.e. relationship between axis and
data) twice.

Algorithms

There are several mathematical algorithms which can be used for calculating all the eigenvectors (and
hence the species and site scores) and the eigenvalues of a symmetric matrix. In ecological applications
however, we only need to know a subset of these, namely the first few, most important components. Ter
Braak (in Jongman et al. 1987) proposed a two-way weighted summation algorithm, which is explained
below.

If the relation between a species and an environmental variable is linear, we can summarize this
relationship by calculating the intercept and the slope of a straight line. These parameters are estimated
using LSQ regression. Consequently, if the intercepts and the slopes are known, we can estimate the
value of an environmental variable by calibration. If it is not known in advance which environmental
variable determines the abundances of the species, we can try and discover the underlying gradient (i.c.
biostatistics - multivariate data analysis 2010-11 29

the ordination axis) by applying straight-line regression and calibration alternately in an iterative fashion,
starting from arbitrary initial values for the sites or arbitrary initial values for the intercepts and slopes.
This iteration process will eventually converge to a set of values for species and sites that does not
depend on the initial (arbitrary) values.

The iteration process is simplified by first centering the abundances of each species to zero mean ( y =
0) and standardize the site scores to x = 0 and  ( xi  x )2  1 . Then the equations10 to estimate the
i

intercept and the slope of a straight line (y = b0 + b1 x) reduce to b0 = 0 and b1 =  ( yi xi ) . We can thus
i
ignore the intercepts and focus on the slopes. From now on, b k will denote the slope parameter for
species k and yik the centered abundance of species k at site i. The slope parameter (or the score for
species k) is then calculated as
n
bk   yik xi
i 1

## and the score for the site

p
xi   yik bk
k 1

Note that the species score is a weighted sum of the site scores, and the site score is a weighted sum of
the species scores (weighted for the abundances of each species). Note also that this last equation is
equal to the step 4 in the eigenanalysis section!

Initial step in the iterative two-way weighted summation algorithm (left) and second step
(from Jongman et al. 1987)

n n
10
b0  y  b1 x and b1   ( yi  y )( xi  x ) /  ( xi  x ) 2
i 1 i 1
biostatistics - multivariate data analysis 2010-11 30

As already mentioned, we need a set of start values for either the slopes or the sites. Above and below,
this is elaborated using the dune meadow data set. We start from initial arbitrary values for the site
scores: this is the sample number (120) standardized to zero mean and unit variance11. We use
these initial xi values to calculate the species scores bk. Rearrange these from smallest to largest value.
We then use these bk‟s to calculate new xi‟s (second step). These are then standardized12 and also
rearranged. We now continue to calculate new species and site scores, until the values for both the
species and site scores stabilize (see below).

The results of this iteration process yield the final species and site scores for the first PCA axis of the
dune meadow data set (see table below). We can now use these scores to plot the PCA ordination
diagram. Scores of species and sites for the second and consecutive axes are calculated in a similar
fashion, but these axes have to be unrelated (orthogonal) to the variation shown along the first and
other existing axes. To achieve this, a so-called orthogonalization procedure is applied (not shown).

Final species and sample scores for a PCA of the dune meadow data set
(from Jongman et al. 1987)

11
Plus an additional step as outlined below in footnote 12
12
Standardizing the site scores is achieved by calculating the sum of squares of all site scores (s² = Σi=1...n xi²) and
calculating the new xi = xi „old‟ / s
biostatistics - multivariate data analysis 2010-11 31

## PCA ordination diagram – PCA biplot

The figure below shows a PCA biplot. The prefix „bi‟ refers to the joint representation of sites and
species. In this particular biplot, from a PCA analysis of the dune meadow data set, the site scores and
the species scores of the first PCA axis are plotted against those of the second PCA axis. As can also
be seen in the table above, the ranges of the species and site scores are different. In the biplot, different
scalings are therefore used for both types of scores.

## PCA biplot and interpretation rules (from Jongman et al. 1987)

Species in a PCA biplot are represented by arrows. These arrows are the shifted and rotated axes of
the species in species space. The endpoints of each arrow are the scores of the species against the first
two axes. For each species, PCA fits a straight line to the (centered) abundances of the species in one
dimension (cf. Fig. p 26). In two dimensions, a plane is fitted. The abundance of a species thus changes
linearly across the biplot. The direction of the arrow indicates the direction of the steepest ascent of the
fitted plane, i.e. the direction in which the abundance of the species increases most. For instance,
Agrostis stolonifera increases both along the first and the second axis. Remember that the species
scores are actually the slopes of the regression lines of the species for the ordination axes. The species
do not change in abundance in a direction perpendicular to their arrow. The larger the slope (i.e. the
higher the species score), the steeper the regression line with respect to an axis. So the length of the
arrow tells us something about the rate of change of a species along an axis. Note that this does not
necessarily mean that species with the longest arrows are the most abundant species. It only concerns
biostatistics - multivariate data analysis 2010-11 32

those species that show a lot of variation in abundance the axes shown in the diagram. Species whose
arrows point in the same (opposite) direction are positively (negatively) correlated in the data set. If the
arrows are at 90°, they show no correlation (i.e. they behave in an independent manner in the data
set)13.
Samples are represented by points. Samples close together have similar species composition;
samples further apart are increasingly different in species composition. Note that this increase in
difference goes in all directions from the sample points (see concentric circles in the graph). If we want
to know in what samples a species is most abundant, we can project the sample points on an axis
running through the species arrow. This yields a ranking of the fitted abundances of that particular
species in all the sites (see red lines on arrow of Eleocharis palustris): the species will have the highest
fitted relative abundance in the samples that are on the most positive side of the arrow, and the lowest
fitted relative abundance in the samples that are on the most negative side of the arrow. The fitted
abundance of the species will be higher than the species mean on the positive side of the origin and
lower on the negative of the origin (because the species data have been centered).

The output of an ordination analysis usually also gives the eigenvalue of each axis (which is a measure
of importance of an axis, or the fraction of the total variation in species composition which is captured
along that axis). In addition, the total inertia is also given. This is a measure for the total variation in the
species data. In PCA, the total inertia is equal to the total sum of squares of the regressions of all
species. Note that the total variance is also equal to the sum of all the eigenvalues. When showing an
ordination diagram, it is always important to mention the eigenvalues of the axes, as this gives an idea
of how representative the diagram is. However, axes will low eigenvalues can also be very informative,
especially in very noisy data sets.

13
This is a simplification. Depending of the scaling type of the biplot, focus is on the samples or the species. For example, in
a covariance biplot, focus is on the species, and the angle θ between two arrows provides an approximation of their
correlation (r ~ cos θ). If the species scores are post-transformed (divided by species standard deviation), we can estimate
the correlations by perpendicularly projecting the arrow tips of other species onto a particular species arrow. In some scaling
types, the angles between the species arrows are meaningless.
biostatistics - multivariate data analysis 2010-11 33

## Correspondence analysis (CA)

Correspondence analysis was developed independently by several authors, the earliest version dating
back to 1935. It is sometimes also referred to as reciprocal averaging (Hill 1973).

As in PCA, we start from an nxp matrix E. For every cell ik in this column, we now calculate the ²
statistic ²ik. Each ²ik value is the standardized residual of a frequency fij after fitting a null model to the
original data table. This null model states that there is no relationship between the rows and the
columns of the table. Eik, the expected frequency of species k in sample i, is then equal to the product of
the sum in row i with the sum in column k divided by n (the total number of observations in the table).

## species 1 species 2 species 3 species 4

(total abundance of (total n° 30 30 30 30
each species ) observations in
each sample ↓)
sample 1 60 30 (15) 10 (15) 15 (15) 5 (15)
sample 2 30 0 (7.5) 20 (7.5) 0 (7.5) 10 (7.5)
sample 3 15 0 (3.75) 0 (3.75) 0 (3.75) 15 (3.75)
sample 4 15 0 (3.75) 0 (3.75) 15 (3.75) 0 (3.75)

In the table above, the observed frequencies (Oik) are shown, with the expected frequencies Eik between
(O  E )2
brackets. The Pearson ² statistic states that    2
, or in the case of the table above,
allcells E
O  Eik  p  pi  p k 
 ik  ik  f    ik  with pik the relative frequency of each entry in the whole data
Eik  pi  p k 
matrix (i.e. each entry divided by f++, the sum of all frequencies over the whole table). Correspondence
 p  pi  p k 
analysis is based on a matrix Q  qik    ik  whose qik values only differ from the ²ik
 pi  p k 
values in a constant. Note that each value qik is the original relative frequency value pik which has been
centred and simultaneously weighted by the row and the column totals. Likewise, an uncentred matrix
 pik 
Q  q~ik   
~ ~
 can be constructed, in which each value qik is the original relative frequency
 pi  p k 
value pik which has been simultaneously weighted by the reciprocals of the square roots of the sample
unit (row, pi+) and the species (column, p+k) totals.
~ ~
From this matrix we now derive a pxp matrix which is analogous to a covariance matrix14, namely Q ' Q .
~ ~
Like in PCA this is a variance-covariance matrix but the cross-products are weighted. On this Q ' Q we
now perform an eigenanalysis, which produces the eigenvalue matrix Λ and the eigenvectors U.

By analogy with the eigenanalysis in PCA (cf. steps 2-4 in PCA above), we can now calculate the
eigenvalues and the corresponding eigenvectors, and the scores for each sample on the ordination
axes. Note that as in PCA the sample scores are again a linear combination of the original species data
multiplied by the coefficient of each species contained in the eigenvectors.

14
Remember (see p.4) that a covariance matrix can be computed directly by multiplying the matrix of centred data with its
transpose.
biostatistics - multivariate data analysis 2010-11 34

## Unimodal response model as a conceptual basis for Correspondence Analysis

While PCA can be regarded as an extension of fitting straight lines and planes by least squares (LSQ)
regression (Jongman et al. 1995) using an iterative two-way weighted summation algorithm,
correspondence analysis can be viewed as an extension of the weighted averaging approach used in
direct gradient analysis (Jongman et al. 1995), and the site and species scores can be calculated using
an iterative weighted averaging approach.

Species commonly show bell-shaped curves with respect to environmental gradients. For example, in
the figure below, species A to E are plotted against a moisture gradient. A plant species may prefer
particular soil moisture content, and not grow at all in places where the soil is either too dry or too wet.
In the example below, species A prefers drier conditions than species E. Note that in this figure
presence (1) - absence (0) data are shown on the y-axis; these data are only shown for species D which
is present at four of the sites.

Suppose we want to explain the abundance values of these species by moisture. We can obtain an
indication of where a species occurs along the moisture gradient by taking the average of the moisture
values of the sites in which the species occurs. This average is an estimate of the optimum of the
species. This average here is called the species score. The arrows in the figure point to the scores of
the five species. As a measure of how well moisture explains the species data, we use the dispersion
(„spread‟) of the species scores. If the dispersion is large, moisture neatly separates the species scores
and explains the species data well. If the spread is small, moisture explains the variation in the species
data less.

## (from Jongman et al. 1987)

We can now do the same for different environmental variables (note that these have to be standardized
first otherwise the dispersions cannot be compared), and try and find the variable that explains the
species data best. We might now try and find a variable which explains the data even better.
Correspondence analysis is the technique that constructs the theoretical variable that best explains the
species data. CA does this by choosing the best values for the sites, i.e. the values that maximize the
dispersion of the species scores (see figure). Note that as a consequence, the species curves have
become narrower. This theoretical variable is the first CA ordination axis. Second and further CA axes
biostatistics - multivariate data analysis 2010-11 35

can now also be constructed. These also maximize the dispersion of the species scores but subject to
the constraint of being uncorrelated to the previous CA axes.

If the species data would be quantitative abundances, we would take a weighted average, which would
be the average of the values of the (hypothetical) environmental variable for the sites in which the
species occurs weighted for the abundance of each species.

Algorithm

The algorithm by which correspondence analysis was introduced in ecology is called reciprocal
averaging (Hill 1973). It is an extension of the method of weighted averaging and involves a process of
two-way weighted averaging.

SAWA =sample weighted average, SPWA = species weighted average (from Lepš & Šmilauer 2003)

The table above shows a simple example of an ecological data matrix, with 4 plant species and 3
samples. The process starts, as in PCA, by arbitrarily choosing initial site scores (initial values 0 to 10).

The first set of species scores (SPWA1) is derived from these initial samples scores by calculating the
weighted average (WA) of the site scores for each species: n n
uk   yk i xi /  yk i
i 1 i 1

Using these species scores, a new set of sample scores (SAWA1) is calculated using weighted
averaging calibration: m m
xi   yk iuk /  yk i
k 1 k 1
The sample scores are then rescaled to the range of the initial values (SAWA1resc)(„stretching the
axis‟).

After a few cycles of WA regression and calibration, the sample and species scores stabilize and
converge to a final set of definitive scores. We can now use these scores to plot the CA ordination
diagram.
biostatistics - multivariate data analysis 2010-11 36

Scores of species and sites for the second and consecutive axes are calculated in a similar fashion, but
these axes have to be unrelated (orthogonal) to the variation shown along the first and other existing
axes. To achieve this, a so-called orthogonalization procedure is applied (not shown).

Below, the same procedure is shown (initial steps and final set) for the dune meadow data set.

Initial step in the iterative two-way weighted averaging algorithm (left) and second step. Note that the initial site
scores are not standardized as in the PCA example (from Jongman et al. 1987)

Final species and sample scores for a CA of the dune meadow data set (from Jongman et al. 1987)
biostatistics - multivariate data analysis 2010-11 37

## CA ordination diagram – CA distance diagram

As in PCA, the scores of the species and the sites of a CA can be plotted in an ordination diagram.
Usually, the first two axes, which capture most of the variation in the species data, are plotted against
one another. However, the rules for interpretation of the relationships between and amongst samples
and sites are different. This is mainly due to the fact that in CA the species scores are weighted
averages of the site scores (and vice versa), and not slopes of regression lines against the axes.

## CA distance diagram (from Jongman et al. 1987)

As in PCA, site that lie close together have a more similar species composition than sites that are
further apart. Site points lie at the centroid of the points of the species that occur in them. Sites that lie
close to a species point are therefore likely to have a high abundance of that species. Species points lie
at the centroid of the site points in which they occur. The scores of the species against each axis are the
optima of the species for these axes. The species points can therefore be viewed as lying on the tips of
imaginary hills with their expected abundances decreasing with distance in all directions from the point.
For example, in the figure above, Juncus bufonius is predicted to be most abundant (in decreasing
order) in 3, 9, 4, 13, 12 etc. In reality (see original species x site table) it is present in four sites: 9, 12, 13
and 7.

Species points at the edge of CA diagrams should be interpreted with care, as they are often rare
species whose position is either due to the fact that they prefer sites with more extreme conditions, or to
mere chance. In addition, species with optima close to the centre of the distance diagram either have a
biostatistics - multivariate data analysis 2010-11 38

unimodal response curve with respect to the axes, or bimodal curves, or are not related to the axes.
This can be checked by plotting the abundance of the species against the axes.

As in PCA, the output of a CA ordination analysis usually also lists the eigenvalues of each axis. This is
a measure of importance of an axis, or the fraction of the total variation in species composition which is
captured along that axis. In addition, the total inertia is also given. This is a measure for the total
variation in the species data. In CA, the total inertia is based on ² distances. Note that as in PCA the
total variance is also equal to the sum of all the eigenvalues. While in PCA the Euclidean distances
between the sites are preserved, the distance preserved in CA is the ² distance.

## Artefacts in PCA and CA

The figures below show an artificial data set, with three species showing a unimodal response against
an environmental gradient. In the table (lower panel), the first line of four different distance matrices
calculated for this data set are shown (only the upper one – Euclidean distance, and the bottom one - ²
distance, are important here). Ordination methods aim at rendering this non-linear phenomenon in a
Euclidean space15 (two-dimensional ordination plots). In such plots, non-linear relationships will usually
show up as horseshoes or arches, depending on the distance measure used by the different
ordination methods. This is why it is imperative that PCA and CA are used for the kind of data they are
meant for. PCA is most appropriate for data sets in which the species show monotonic relationships
against the environmental gradients underlying the variation in species composition. Straight lines can
only be fitted against this kind of data; the underlying distance measure (between the sample units) that
is preserved is the Euclidean distance. CA is most appropriate for data sets in which there is a strong
turnover in species composition, i.e. the species show unimodal response curves along the axes, and
samples on both extremes of the axes have almost no species in common (like in the example below).

Distributions of 3 species at 19 sampling sites along a hypothetical environmental gradient (from Legendre &
Legendre 1998)

If species would increase or decrease monotonically against the gradient in the figure above, the
Euclidean distances would increase from one end of the gradient to the other, and PCA would be able
to correctly represent the original gradient. In the case of the artificial data set however, we see that the
ED‟s of site 1 against itself and the other sites increase, then decrease, then increase again etc. and
decrease again at the other end (sample 19). The most dramatic effect of the PCA algorithm is thus
found at the ends of the transect, which are folded inwards along axis one (the horseshoe effect, see
figure a below).

## 15 A Euclidean representation is a representation in a Cartesian coordinate system

biostatistics - multivariate data analysis 2010-11 39

Upper panel: artificial data illustrated in the figure above. Lower panel: the first rows of four distance matrices,
comparing site 1 to itself and the 18 other sites (from Legendre & Legendre 1998)

This is caused by the interpretation (in the Euclidean distance measure) of shared zeros (samples at the
ends of the gradients) as an indication of positive relationship (hence small distance values between site
1 and 19 which both lack species 2). Remember also that ED is very sensitive to quantitative aspects of
the data sets: this causes sites 1 and 3 to be more distant than 1 and 19, although 1 and 3 both share
species 1.
CA, which preserves the ² distances between sites, is less sensitive to shared zeros. However,
in CA as well a so-called arch effect is observed (see figure below, c). Especially when there is a single
dominant gradient, this effect can be pronounced. In this situation, a second (independent) axis will be
created by folding the first axis in the middle and bringing the ends together. The second axis will then
have similar scores for sites that are on opposite sides along the first axis. Even if there were a true
second axis, CA would still choose this artificial second axis if it would spread the optima more than the
true CA axis. In addition, the WA algorithm tends to shrink the axes (cf. above), which also contributes
to the arch effect.
biostatistics - multivariate data analysis 2010-11 40

Ordinations from the data introduced above: (a) PCA axes 1 and 2; (c and bottom - a) CA axes 1 and 2 and
(bottom - b) DCA axes 1 and 2 (from Legendre & Legendre 1998)

In order to remove the arch effect from the ordinations, detrending was introduced (Detrended
Correspondence Analysis - DCA). Detrending can be achieved in different ways. Here we only show
detrending by segments. In detrending by segments, the first axis is divided in segments and within
each segment the (2nd axis) site scores after detrending are obtained by subtracting the mean of the CA
scores of the 2nd axis in that segment. As can be seen in the examples above and below, the arch
effects have been removed from the data after detrending by segments. Another popular way of
detrending is detrending by polynomials.

## Detrending by segments (from McCune & Grace 2002)

DCA has been very popular for a while, but has also been seriously criticized because sometimes
meaningful arch effects (i.e. curves that truly represented an underlying gradient) were removed from
the ordination diagrams, resulting in loss of information. The figure above also shows the effect of
detrending on the artificial data set introduced above.

Multidimensional scaling

Ordination is defined as a method that arranges site points in the best possible way in a continuum such
that points that are close together correspond to sites that are similar in species composition, and points
that are far apart correspond to sites that are dissimilar. A particular ordination technique is obtained by
biostatistics - multivariate data analysis 2010-11 41

further specifying what „similar‟ means and what „best‟ is. The definition suggests that we choose a
measure of dissimilarity between sites, replace the original species composition data by a matrix of
dissimilarity values between sites and work further from the dissimilarity matrix to obtain an ordination
diagram. This final step is termed multidimensional scaling.
In general, it is not possible to arrange sites such that the mutual distances between the sites in
the ordination diagram are equal to the calculated dissimilarity values. We therefore need a measure
that expresses in a single number how well or how badly the distances in the diagram correspond to the
dissimilarity values. Such a measure is termed a loss function or a stress function. In metric
ordination techniques such as PCA or CA the loss function depends on the actual numerical values of
the dissimilarities (i.c. distances: ED or ²), in non-metric techniques, the loss function depends only on
the rank order of the dissimilarities.
In PCA and CA we need not calculate a matrix of dissimilarities first, yet those techniques use
particular measures of dissimilarity (Euclidean and ² distance respectively). Remember that the ²
distance is based on proportional differences in the abundances of species, and Euclidean distance
involves absolute differences. Differences in site and species totals are therefore less influential in CA
than in PCA, unless a data transformation is used in PCA to correct for this effect.

A simple metric technique for multidimensional scaling is principal coordinate analysis (PCoA) or
metric multidimensional scaling. This technique can be used to obtain a Euclidean representation of
a set of objects whose relationships are measured by any similarity or distance coefficient chosen by the
users. PCoA can be looked upon as the equivalent of PCA. However, in PCA, the principal components
are linear combinations of the original variables. Principal coordinates on the other hand are also
functions of the original variables, but mediated through the similarity or distance measure chosen. In
any case, PCoA can only fully represent in Euclidean space the Euclidean part of the data matrix under
study. This is not a property of the data but a result of the Euclidean model that is forced upon the data
because the objective is to draw scatter diagrams on flat sheets of paper. Whatever is not Euclidean in
the data set cannot be drawn on paper.

The method of non-metric multidimensional scaling (NMDS) also obtains ordinations of objects from
any resemblance matrix. It is better than PCoA at compressing the distance relationships among objects
into two or three dimensions. It will always obtain a Euclidean representation, even from non-Euclidean-
embeddable distances (i.e. which can be embedded or fully represented in a Euclidean space). NMDS
is favoured by some authors (e.g. McCune & Grace 2002) because it can „see‟ at much wider range of
structures in the data set, while PCA and CA can only „see‟ that portion of the configuration that fits a
limited perspective. NMDS is well-suited to data that are nonnormal. NMDS appears to work better with
simulated data sets than PCA and CA (see below).

It is important to keep in mind that all ordination methods, PCoA, NMDS, PCA, CA, etc. provide
Euclidean representations of point-objects (ordination diagrams), but unlike in PCA or CA where
particular measures are used (and hence only certain relationships can be „seen‟), in PCoA and NMDS
any dissimilarity measure can be used.

NMDS, in contrast to other ordination techniques, is not concerned with preserving (as much as
possible) the distances present in the original multidimensional space defined by the species axes.
Instead, it tries to represent the objects in a small and specified number of dimensions (usually two or
three), with dissimilar objects being far apart and similar ones close to one another. The ordering
relationships (the ranking) between the objects are being preserved, not the exact distances.
Contrary to PCA, CA and PCoA, NMDS is not an eigenvector method. It does not maximize the
variability associated with the individual axes of the ordination. Instead, NMDS performs an iterative
biostatistics - multivariate data analysis 2010-11 42

search for the best positions of n entities on k dimensions (axes) that minimizes the stress of the k-
dimensional configuration.

## NMDS: how it works

 Step 1. Produce a nxn distance matrix (D = [Dhi]) computed for the original nxp matrix E using a
measure appropriate to the data at hand (with Dhi the empirical distance between objects h and i)

 Step 2. A priori specify the number m of dimensions for scaling the objects. The output will then
provide coordinates of the n objects of the m axes.

 Step 3. Construct an initial configuration of the objects in m dimensions. These can be randomly
assigned numbers, or another ordination like PCA.

 Step 4. Calculate a matrix Δ of fitted distances dhi in the ordination space, using the Euclidean
distance.

 Step 5. Rank the elements of D in ascending order, then rank the elements of Δ in ascending order,
and plot these against one another in a bivariate graph (also referred to as a Shepard diagram, see
both figures below).

Moving a point to achieve monotonicity in a Shepard diagram (plotting the distances in the original space vs those
in m-space (from McCune & Grace 2002). Note that here m-space is referred to as k-space, and Dhi as δhi

 Step 6. Regress dhi on Dhi. Values forecasted by the regression line are called d̂ hi . The choice of
regression is up to the user, but often monotone regressions are applied. Monotone regression is
a step-function which is constrained to always increase from left to right. The amount of movement
to achieve monotonicity is measured by the difference between d and d̂ (see figure above). The
sum of these squared differences is the basis for evaluating stress. The closer the points lie on a
monotonic line, the better the fit and the lower the stress.

## Calculate raw stress: 

S *   d hi  dˆhi 
2

h ,i
biostatistics - multivariate data analysis 2010-11 43

 Step 7. Improve the configuration by moving it slightly in a direction of decreasing stress. This is
done by a numerical optimization algorithm called the method of steepest descent. The direction of
steepest descent is the direction in the space of solutions along which stress is decreasing most
rapidly.

 Step 8. Repeat steps 4-7 until the stress function reaches a small, predetermined value, or until
convergence is achieved, i.e. until it reaches a minimum and no further progress can be made.

## Detrending by segments (from McCune & Grace 2002)

With the artificial data set introduced above, a one-dimensional NMDS almost perfectly reconstructed
the gradient from sites 1-19.

## Choice of indirect ordination method

The main difference between PCA and NMDS is that PCA and CA use the original “species by sample”
matrix to extract ordination axes based on Euclidean distance or ² distance measures, whereas NMDS
estimated distances between samples out of a derived “sample by sample” matrix. This “sample by
sample” matrix is obtained by transforming the original “species by sample” matrix using a (dis)similarity
measure. NMDS thus has the advantage over e.g. PCA in that it is not restricted to Euclidean distance
measure but any (dis)similarity measure can be used, which can also relax the requirement of normality
of data. Another advantage is that NMDS can better deal with missing data because the (dis)similarity
between samples can be calculated from the measured variables only, whereas PCA needs a complete
“species by sample” matrix.
The quality of ordination spaces is best illustrated using artificial data sets (cf. the examples
above). The figures below show a CA, a NM(D)S and a PCA of a data set shown top left. This data set
has smooth, noiseless, unimodal species responses to a strong primary gradient and to an independent,
much weaker, secondary gradient. The species responses were sampled with a 10x3 grid of 30
sampling units, with the long axis of the grid corresponding to the stronger environmental gradient. In
the ordination diagrams, the points along the major grid axis have been connected.
biostatistics - multivariate data analysis 2010-11 44

Comparison of a 2D CA, NMDS and PCA (from McCune & Grace 2002)

NMDS was most successful at recovering both gradients, while CA and PCA (partly) recover the first
gradient but struggle with the second gradient.

On the other hand, the fact that NMDS does not use the original “species by sample” matrix also has
disadvantages, which can be quite serious depending on the aim of the ordination analysis. First and
foremost, NMDS does not allow a simultaneous ordination of the descriptors (the species). Second,
because NMDS does not use the original matrix, an evaluation in terms of displayed percentage
variance (eigenvalues) is not possible.

In conclusion, it can be stated that all methods have advantages and disadvantages, and the choice
between methods often depends on „tradition‟ in a particular lab or field of research. The best way to
proceed with the analysis of an ecological data set is to compare different ordinations. It is also always
advisable to check the underlying response model of the species data.
biostatistics - multivariate data analysis 2010-11 45

## Interpretation of indirect ordinations with external data

Indirect ordination techniques summarize relationships between samples and species and order them
along ordination axes, which represent the main gradients in species composition and can therefore be
regarded as hypothetical environmental gradients which underlie the gradients in species composition.
These gradients are usually afterwards interpreted with external knowlegde on the sites and the
species. There are several ways to do this, some of which are informal, while others are strictly formal
(statistical tests). Most try and relate environmental parameters associated with the sites to their
ordering along the ordination axes. This can be done by plotting the scores of an axis against the values
of an environmental variable (see figure below; more formally, the DCA scores could be regressed
against the manure data), or by calculating the correlation coefficients between the environmental
variables and the ordination axes.

Site scores of second DCA axis of the dune meadow data set plotted against the amount of manure (see box p.
15)(left) and correlation coefficients (100 x r) of environmental variables with the first four DCA axes of the same
data set (from Jongman et al. 1987)

Classes of nominal or ordinal environmental variables can plotted in the ordination diagram (see below).
From all examples it should be clear that the second axis in the (D)CA diagrams appears to be related
to the management type of the meadows: sites which are intensively used for farming (SF), and which
as a result receive higher amounts of manure, are found on the negative side of the second axis, while
„unspoilt‟ sites (nature reserves) are on the positive side of the axis. Keep in mind however that this axis
represents the second most important gradient in species composition. The main gradient in species
composition is the first ordination axis, which is related to other factors (which ones?).
biostatistics - multivariate data analysis 2010-11 46

Site scores of second DCA axis of the dune meadow data set plotted against the amount of manure (see box p.
15)(left) and correlation coefficients (100 x r) of environmental variables with the first four DCA axes of the same
data set (from Jongman et al. 1987)