You are on page 1of 9

BIO3060 2016: Practical 1

Hierarchical Cluster Analysis using


PAST v.3
Sandro Lanfranco
Department of Biology, University of Malta

September 2016

General
Hierarchical cluster analysis is a multivariate procedure than can be used to detect patterns in ecological
data. It can work with abundance data or with binary presence-absence data.

Software
The material in this guide is based on the procedures carried out using PAST (Hammer, Harper, and Ryan,
2001). This is free statistical software (http://folk.uio.no/ohammer/past/) that was originally designed for
paleontological analyses but that is also very well-suited to ecological applications. At the time of writing,
the current version was 3.13 (August 2016). The software is updated frequently.

Procedure
Preparing the data
(1) Prepare a data matrix with sites/locations in rows and species in columns. In the case
of presence/absence data (as in BIO3060 Practical 1), the occurrence of a species should be
indicated by a ‘1’ and its absence by a ‘0’. The matrix shown in Figure 1 shows the occurrence of
15 species of endemic plants in 20 locations around the Maltese Islands (Galea, 2016). The matrix
was originally edited in Microsoft Excel and subsequently copied into PAST.
(2) To enter data in PAST, please tick the ‘Row attributes’ and ‘Column attributes’ boxes on the
left- hand side of the top ribbon and paste the copied data into the cell labelled ‘Name’ in both
rows and columns (Figure 2). After the data has been entered, the ‘Row attributes’ and ‘Column
attributes’ boxes should be unticked.

Figure 1: Presence-absence data for endemic plants of the Maltese Islands. Species names are in columns and site names are in
rows. Matrix compiled by Christine M Galea (2016).

1
BIO3060 2016: Practical 1

Figure 2: Data screen in PAST v.3.13. The boxes that need to be ticked, and the cell where data should be pasted from another
editor are indicated.

Running the analysis


(1) Select all the data for analysis by clicking on the top left-hand cell of the spreadsheet in
PAST.
(2) Select ‘M ultivariate>Clustering>Classical’ from the top menu.
(3) In the new window that appears (Figure 3), please set ‘Algorithm’ to ‘Paired Group (UPMGA)’
and ‘Similarity Index’ to ‘Jaccard’. The Jaccard Coefficient is suitable for presence/absence data.
All other options can be left unchanged for the purposes of this analysis.
(4) Maximise the window and click on the ‘Compute’ button. A dendrogram similar to the one
shown in Figure 3 should appear.

2
BIO3060 2016: Practical 1

Figure 3: Hierarchical clustering window in PAST

Interpreting the results


(1) The dendrogram (Figure 3, Figure 4) has a similarity scale, ranging from 0 to 1 (0% to 100%) along
the left-hand side. Pairs of sites, or groups of sites, are arranged in ‘clusters’ according to the
similarity of their species composition. Sites with identical species composition have a similarity
of 100% and sites that are very different would be expected to have a similarity closer to zero.
(2) A cursory inspection of the dendrogram suggests that there are two main groups of sites (both
highlighted in shaded boxes in Figure 4). The larger cluster may, in turn, be subdivided into smaller
clusters (enclosed by an outline in Figure 4). There is no fixed rule for identifying clusters. This is
an exploratory technique, and the interpretation is dependent on your judgement as a biologist.
Looks for patterns that are ecologically-meaningful.
(3) In this specific case, inspection of the dendrogram suggests three principal groups of sites:
a. Cluster 1: HQ, MG, Q2, CJ, QM
b. Cluster 2: D2, LA, PM, RT, SP, DA, Q1, GH, TW, D1, ZN
c. Cluster 3: DW, XL, MF
d. Site MK seems to be an outlier, but closer to Cluster 1 and Cluster 2, than to Cluster 3.
(4) These observations may suggest three ecological zones, three groups of habitats, etc., depending
on the nature of the study in question. It should be emphasised that cluster analysis (and other
multivariate procedures) are only summarising complex variation into a simpler pattern and that
in doing so, information is necessarily lost. This should be borne in mind when interpreting the
results of the analysis.

3
BIO3060 2016: Practical 1

Figure 4: Dendro gram showing results of cluster analysis (Paired-Group linkage; Jaccard similarity) on the data in the matrix in
Figure 1.

How ‘different’ are the clusters?


The hypothesis that three groups of sites are present may be tested using ANOSIM (‘ Analysis of
Similarity’). Other procedures, such as PERMANOVA (‘Permutational Multivariate Analysis of Variance’),
are also suitable for this purpose.

(1) In the PAST data matrix, please tick the ‘Row attributes’ and ‘Column attributes’ boxes
and insert a new column before the first species column. Use the ‘Edit>Insert more columns ...’
command from the top menu to do this. The new column will be labelled ‘c1’ by default.
(2) With the newly-added column ‘c1’ selected, click on the cell in the ‘Type’ row. This should
bring up a drop-down menu with the options ‘Group’, ‘Ordinal’, ‘Nominal’, and ‘Binary’. Please
select the ‘Group’ option, indicating that the values in this column are grouping variables (indicating
membership of a cluster).
(3) Enter the cluster within which each site has been classified in column c1 (Figure 5).
(4) Select all the data by clicking on the cell in the top left-hand corner of the spreadsheet.
(5) Select the ANOSIM procedure by selecting ‘Multivariate>Tests>One-way ANOSIM’ from the
top menu.
(6) Set the ‘Similarity Index’ to ‘Jaccard’ in the pop-up window that appears. Leave the number
of permutations at the default value. Click the ‘Recompute’ button.
(7) The results will appear in a window as shown in Figure 6.
(8) The important results are the R value and the p value. The R value varies from 0 to +1, with
values closer to +1 indicating higher dissimilarity between the clusters being compared, and
values closer to 0 indicating greater similarity. The p value indicates whether this separation is
statistically significant. In the case if this analysis, the R value (0.63) suggests a distinct dissimilarity
between the species composition of the sites in Clusters I, 2, and 3. The p value of 0.0002 indicates
that this dissimilarity is also statistically-significant.

4
BIO3060 2016: Practical 1

(9) Clicking on the ‘pairwise’ tab shows a matrix comparing each pair of clusters with each other.
The R values and p values associated with each comparison can be shown (Figure 7, Figure 8). In
this case the very high R values of the comparisons between Cluster 1 and Cluster 3 (R=0.9949)
and between Cluster 2 and Cluster 3 (R=0.8386) suggest that the species composition for the sites
in Cluster 3 is very different when compared to those in Clusters 1 and 2.

5
BIO3060 2016: Practical 1

Figure 5: Data matrix in PAST showing the addition of a grouping variable indicating cluster membership for each site.

6
BIO3060 2016: Practical 1

Figure 6: Results of ANOSIM procedure from PAST

Figure 7: Pairwise p-values from ANOSIM

Figure 8: Pairwise R values from ANOSIM


6
BIO3060 2016: Practical 1

Why do the clusters differ?


At this stage in the analysis of data, the ANOSIM procedure would have indicated whether the clusters
identified through inspection of the dendrogram are ‘distinct’ from each other. The next step would be
to establish why those differences between clusters exist. This can be carried out using the SIMPER
(‘Similarity Percentage’) procedure in PAST.

(1) Select all the data in the matrix by clicking on the cell in the top left-hand corner of the
PAST spreadsheet.
(2) Select the ‘Multivariate>Tests>SIMPER’ command from the top menu. A results window
similar to the one shown in Figure 9 should appear.
(3) Set the ‘Distance/similarity measure’ to ‘Euclidean’ from the drop-down menu, and
select the two clusters to be compared (1 v 2, 1 v 3, or 2 v 3, in this case). Click on the
‘Recompute’ button when the required options have been selected. The results will change
depending on the new options selected.
(4) The SIMPER results (Figure 9) list the taxa (in this case, species) that are causing the
difference between the selected clusters.
(5) The third column (‘Contrib.%’) shows the contribution of each species listed to the
differences between the selected clusters. In the example shown in Figure 9, where Cluster 1
and Cluster 2 are being compared, the species Chiliadenus bocconei is contributing to 26.96% of the
difference between the two clusters.
(6) The fifth and sixth columns (‘Mean 1’ and ‘Mean 2’) show the proportion of sites in
which a species is present for a given cluster. For the data shown in Figure 9, Chiliadenus
bocconei was not present in any of the five sites in Cluster 1 (‘Mean 1’ =0) but was recorded from
11 out of the 12 sites in Cluster 2 (‘Mean 2’ =0.917). Similarly, Anthemis urvilleana was not
present in any of the Cluster 1 sites and recorded from half the Cluster 2 sites (‘Mean 2’ =0.5).
(7) From the ‘Contrib.%’, it can be seen that A.urvilleana contributes 14.71% of the
difference between Cluster 1 sites and Cluster 2 sites. The cumulative contribution of the first
two species to the difference between Clusters 1 and 2 (fourth column: ‘Cumulative %’) is
therefore 41.67%.
(8) Taken together, these results suggest that the principal differences between Cluster 1 and
Cluster 2 sites arise from the presence of C.bocconei and A.urvilleana in the Cluster 2 sites whereas
these species were absent from all of the Cluster 1 sites.

Figure 9: Results of SIMPER procedure in PAST


8
BIO3060 2016: Practical 1
Summary
In conclusion, therefore a cluster analysis should comprise the following steps:
(1) Entry of data
(2) Choice of an appropriate linkage measure and distance index
(3) Clustering of data and generation of a dendrogram
(4) Recognition of meaningful clusters from the dendrogram
(5) Testing for distinctness of these clusters using ANOSIM
(6) Testing for the contribution of individual species to differences across clusters using SIMPER
In all of this, please bear in mind that using statistics can lend support to your arguments/conclusions but
is, in itself, not a substitute for your scientific reasoning as a biologist.

References
Galea, C.M. (2016). Trait characteristics of endemic plants of the Maltese Islands. Unpublished Bachelor
of Science dissertation. Faculty of Science, University of Malta: xiv+66pp.

Hammer, Ø., Harper, D. A. T., & Ryan, P. D. (2001). Paleontological Statistics Software: Package for
Education and Data Analysis. Palaeontologia Electronica 4(1): 9 pp.

You might also like