Professional Documents
Culture Documents
Past PDF
Past PDF
http://palaeo-electronica.org
ABSTRACT
Hammer, Øyvind, Harper, David A.T., and Paul D. Ryan, 2001. Past: Paleontological Statistics Software Package for Education and
Data Analysis. Palaeontologia Electronica, vol. 4, issue 1, art. 4: 9pp., 178kb.
http://palaeo-electronica.org/2001_1/past/issue1_01.htm.
Øyvind Hammer, David A. T. Harper, and Paul D. Ryan: PALEONTOLOGICAL STATISTICS SOFTWARE
STAT gained a wide user base among validation and correction of diversity
both paleontologists and biologists. curves).
After some years of service, however, One of the main ideas behind PAST is
it was becoming clear that PALSTAT had to include many functions in a single pro-
to undergo major revision. The DOS- gram package while providing for a con-
based user interface and an architecture sistent user interface. This minimizes time
designed for computers with miniscule spent on searching for, buying, and learn-
memories (by modern standards) was ing a new program each time a new
becoming an obstacle for most users. method is approached. Similar projects
Also, the field of quantitative paleontology are being undertaken in other fields (e,g.,
has changed and expanded considerably systematics and morphometry). One
in the last 15 years, requiring the imple- example is Wayne Maddison’s ‘Mesquite’
mentation of many new algorithms. There- package (http://mesquite.biosci.ari-
fore, in 1999 we decided to redesign the zona.edu/mesquite/mesquite.html).
program totally, keeping the general con- An important aspect of PALSTAT was
cept but without concern for the original the inclusion of case studies, including
source code. The new program, called data sets designed to illustrate possible
PAST (PAleontological STatistics) takes uses of the algorithms. Working through
full advantage of the Windows operating these examples allowed the student to
system, with a modern, spreadsheet- obtain a practical overview of the different
based, user interface and extensive methodologies in a very efficient way.
graphics. Most PAST algorithms produce Some of these case studies have been
graphical output automatically, and the adjusted and included in PAST, and new
high-quality figures can be printed or case studies have been added in order to
pasted into other programs. The function- demonstrate the new features. The case
ality has been extended substantially with studies are primarily designed as student
inclusion of important algorithms in the exercises for courses in paleontological
standard PAST toolbox. Functions found data analysis. The PAST program, docu-
in PAST that were not available in PAL- mentation, and case studies are available
STAT include (but are not limited to) parsi- free of charge at http://www.nhm.uio.no/
mony analysis with cladogram plotting, ~ohammer/past.
detrended correspondence analysis, prin-
cipal coordinates analysis, time-series PLOTTING AND BASIC STATISTICS
analysis (spectral and autocorrelation),
geometrical analysis (point distribution Graphical plotting functions (see http://
and Fourier shape analysis), rarefaction, www.nhm.uio.no/~ohammer/past/
modelling by nonlinear functions (e.g., plot.html) in PAST include different types
logistic curve, sum-of-sines) and quantita- of graph, histogram, and scatter plots. The
tive biostratigraphy using the unitary asso- program can also produce ternary (trian-
ciations method. We believe that the gle) plots and survivorship curves.
functions we have implemented reflect the Descriptive statistics (see http://
present practice of paleontological data www.nhm.uio.no/~ohammer/past/
analysis, with the exception of some func- univar.html) include minimum, maximum,
tionality that we hope to include in future and mean values, population variance,
versions (e.g., morphometric analysis with sample variance, population and sample
landmark data and more methods for the standard deviations, median, skewness,
and kurtosis.
2
Øyvind Hammer, David A. T. Harper, and Paul D. Ryan: PALEONTOLOGICAL STATISTICS SOFTWARE
3
Øyvind Hammer, David A. T. Harper, and Paul D. Ryan: PALEONTOLOGICAL STATISTICS SOFTWARE
become rare under for lower and higher Bray-Curtis, chord and Morisita indices for
values (this is in contrast to PCA, that abundance data, and Dice, Jaccard, and
assumes a linear response). The CA algo- Raup-Crick indices for presence-absence
rithm employed in PAST is taken from data.
Davis (1986), which also includes a more Seriation of an absence-presence
detailed description of the method and matrix can be performed using the algo-
example analysis. Ordination of both sam- rithm described by Brower and Kyle
ples and taxa can be plotted in the same (1988). For constrained seriation, columns
CA coordinate system, whose axes will should be ordered according to some
normally be interpreted in terms of envi- external criterion (normally stratigraphic
ronmental parameters (e.g., water depth, level) or positioned along a presumed fau-
type of substrate temperature). nal gradient. Seriation routines attempt to
The Detrended Correspondence reorganize the data matrix such that the
(DCA) module uses the same ‘reciprocal presences are concentrated along the
averaging’ algorithm as the program Dec- diagonal. Also, in the constrained mode,
orana (Hill and Gauch 1980). It is special- the program runs a ‘Monte Carlo’ simula-
ized for use on “ecological” data sets with tion to determine whether the original
abundance data (taxa in rows, localities in matrix is more informative than a random
columns), and it has become a standard matrix. In the unconstrained mode both
method for studying gradients in such rows and columns are free to move: the
data. Detrending is a type of normalization method then amounts to a simple form of
procedure in two steps. The first step ordination.
involves an attempt to “straighten out” The degree of separation between to
points lying along an arch-like pattern (= hypothesized groups (e.g., species or
Kendall’s Horseshoe). The second step morphs) can be investigated using dis-
involves “spreading out” the points to criminant analysis (Davis 1986). Given two
avoid artificial clustering at the edges of sets of multivariate data, an axis is con-
the plot. structed that maximizes the differences
Hierarchical clustering routines pro- between the sets. The two sets are then
duce a dendrogram showing how and plotted along this axis using a histogram.
where data points can be clustered (Davis The null hypothesis of group means equal-
1986, Harper 1999). Clustering is one of ity is tested using Hotelling’s T2 test.
the most commonly used methods of mul-
tivariate data analysis in paleontology. CURVE FITTING AND TIME-SERIES ANALYSIS
Both R-mode clustering (groupings of
taxa), and Q-mode clustering (grouping Curve fitting (see http://
variables or associations) can be carried www.nhm.uio.no/~ohammer/past/fit-
out within PAST by transposing the data ting.html) in PAST includes a range of lin-
matrix. Three different clustering algo- ear and non-linear functions.
rithms are available: the unweighted pair- Linear regression can be performed
group average (UPGMA) algorithm, the with two different algorithms: standard
single linkage (nearest neighbor) algo- (least-squares) regression and the
rithm, and Ward’s method. The similarity- ”Reduced Major Axis” method. Least-
association matrix upon which the clusters squares regression keeps the x values
are based can be computed using nine dif- fixed, and it finds the line that minimizes
ferent indices: Euclidean distance, correla- the squared errors in the y values.
tion (using Pearson’s r or Spearman’s ρ, Reduced Major Axis minimizes both the x
4
Øyvind Hammer, David A. T. Harper, and Paul D. Ryan: PALEONTOLOGICAL STATISTICS SOFTWARE
and the y errors simultaneously. Both x of time series can be performed using the
and y values can also be log-transformed, Lomb periodogram algorithm, which is
in effect fitting the data to the “allometric” more appropriate than the standard Fast
function y=10bxa. An allometric slope Fourier Transform for paleontological data
value around 1.0 indicates that an “isomet- (which are often unevenly sampled; Press
ric” fit may be more applicable to the data et al. 1992). Evenly-spaced data are of
than an allometric fit. Values for the course also accepted. In addition to the
regression slope and intercepts, their plotting of the periodogram, the highest
errors, a χ2 correlation value, Pearson’s r peak in the spectrum is presented with its
coefficient, and the probability that the col- frequency and power value, together with
umns are not correlated are given. a probability that the peak could occur
In addition, the sum of up to six sinu- from random data. The data set can be
soids (not necessarily harmonically optionally detrended (linear component
related) with frequencies specified by the removed) prior to analysis. Applications
user, but with unknown amplitudes and include detection of Milankovitch cycles in
phases, can be fitted to bivariate data. isotopic data (Muller and MacDonald
This method can be useful for modeling 2000) and searching for periodicities in
periodicities in time series, such as annual diversity curves (Raup and Sepkoski
growth cycles or climatic cycles, usually in 1984). Autocorrelation (Davis 1986) can
combination with spectral analysis (see be carried out on evenly sampled tempo-
below). The algorithm is based on a least- ral-stratigraphical data. A predominantly
squares criterion and singular value zero autocorrelation signifies random
decomposition (Press et al. 1992). Fre- data—periodicities turn up as peaks.
quencies can also be estimated by trial
and error, by adjusting the frequency so GEOMETRICAL ANALYSIS
that amplitude is maximized.
Further, PAST allows fitting of data to PAST includes some functionality for
geometrical analysis (see http://
the logistic equation y=a/(1+be-cx), using
www.nhm.uio.no/~ohammer/past/mor-
Levenberg-Marquardt nonlinear optimiza-
pho.html), even if an extensive morpho-
tion (Press et al. 1992). The logistic equa-
metrics module has not yet been
tion can model growth with saturation, and
implemented. We hope to implement more
it was used by Sepkoski (1984) to
extensive functionality, such as landmark-
describe the proposed stabilization of
based methods, in future versions of the
marine diversity in the late Palaeozoic.
program.
Another option is fitting to the von Berta-
The program can plot rose diagrams
lanffy growth equation y=a(1-be-cx). This
(polar histograms) of directions. These
equation is used for modeling growth of
can be used for plotting current-oriented
multi-celled animals (Brown and Rothery
specimens, orientations of trackways, ori-
1993).
entations of morphological features (e.g.,
Searching for periodicities in time
trilobite terrace lines), etc. The mean
series (data sampled as a function of time)
angle together with Rayleigh’s spread are
has been an important and controversial
given. Rayleigh’s spread is further tested
subject in paleontology in the last few
against a random distribution using Ray-
decades, and we have therefore imple-
leigh’s test for directional data (Davis
mented two methods for such analysis in
1986). A χ2 test is also available, giving
the program: spectral analysis and auto-
correlation. Spectral (harmonic) analysis
5
Øyvind Hammer, David A. T. Harper, and Paul D. Ryan: PALEONTOLOGICAL STATISTICS SOFTWARE
the probability that the directions are ran- Character states are coded using inte-
domly and evenly distributed. gers in the range 0 to 255. The first taxon
Point distribution statistics using near- is treated as the outgroup and will be
est neighbor analysis (modified from Davis placed at the root of the tree. Missing val-
1986) are also provided. The area is esti- ues are coded with a question mark. There
mated using the convex hull, which is the are four algorithms available for finding
smallest convex polygon enclosing the short trees: branch-and-bound (finds all
points. The probability that the distribution shortest trees), exhaustive (finds all short-
is random (Poisson process, giving an est trees, and allows the plotting of tree-
exponential nearest neighbor distribution) length distribution), heuristic nearest
is presented, together with the ‘R’ value. neighbor interchange (NNI) and heuristic
Clustered points give R<1, Poisson pat- subtree pruning and regrafting (SPR).
terns give R~1, while over-dispersed Three different optimality criteria are avail-
points give R>1. Applications of this mod- able: Wagner (reversible and ordered
ule include spatial ecology (are in-situ bra- characters), Fitch (reversible and unor-
chiopods clustered) and morphology (are dered characters), and Dollo (irreversible
trilobite tubercles over-dispersed; see and ordered). Bootstrapping can be per-
Hammer 2000). formed with a given number of replicates.
The Fourier shape analysis module All shortest (most parsimonious) trees
(Davis 1986) accepts x-y coordinates digi- can be viewed. If bootstrapping has been
tized around an outline. More than one performed, a bootstrap value is given at
shape can be analyzed simultaneously. the root of the subtree specifying each
Points do not need to be evenly spaced. group.
The sine and cosine components are The consensus tree of all shortest
given for the first ten harmonics, and the (most parsimonious) trees can also be
coefficients can then be copied to the main viewed. Two consensus rules are imple-
spreadsheet for further analysis (e.g., by mented: strict (groups must be supported
PCA). Elliptic Fourier shape analysis is by all trees) and majority (groups must be
also provided (Kuhl and Giardina 1982). supported by more than 50% of the trees).
For an application of elliptic Fourier shape PAST can read and export files in the
analysis in paleontology, see Renaud et al. NEXUS format, making it compatible with
(1996). packages such as PAUP and MacClade.
6
Øyvind Hammer, David A. T. Harper, and Paul D. Ryan: PALEONTOLOGICAL STATISTICS SOFTWARE
because of its solid theoretical basis and in the program. The cases are taken from
minimum of statistical assumptions. such diverse fields as morphology, taxon-
The data input consists of a presence- omy, paleoecology, paleoclimatology, sedi-
absence matrix with samples in rows and mentology, extinction studies, and
taxa in columns. Samples belong to a set biostratigraphy. The examples are taken
of sections (localities), where the strati- from both vertebrate and invertebrate
graphical relationships within each section paleontology, and they cover the whole of
are known. The basic idea is to generate a the Phanerozoic. These case studies are
set of assemblage zones (similar to ‘Oppel well suited for an introductory course in
zones’) that are optimal in the sense that paleontological data analysis and have
they give maximal stratigraphic resolution been tested in classroom situations. The
with a minimum of superpositional contra- cases are organized into four main subject
dictions. An example of such a contradic- areas: morphology and taxonomy, bioge-
tion would be a section containing species ography and paleoecology, time-series
A above species B, while assemblage 1 analysis, and biostratigraphy.
(containing species A) is placed below Case studies 1-51 involve the descrip-
assemblage 2 (containing species B). The tion and analysis of morphological varia-
method of Unitary Associations is a logical tion of different sorts, while case study 6
but somewhat complicated procedure, targets some phylogenetic problems in a
consisting of several steps. Its implemen- group of Cambrian trilobites and the mam-
tation in PAST does not include all the fea- mals.
tures found in the standard program, Case Study 1 investigates the external
called BioGraph (Savary and Guex 1999), morphology of the Permian brachiopod
and advanced users are referred to that Dielasma, developing ontogenic models
package. for the genus and comparing the growth
PAST produces a detailed report of rates and outlines of different samples
the analysis, including maximal cliques, from in and around a Permian reef com-
unitary associations, correlation table, plex. In a more focused exercise, Case
reproducibility matrix, contradictions Study 2 uses spatial statistics to assess
between cliques, biostratigraphic graph, the mode of distribution of tubercles on the
graph of superpositional relationships cranidium of the trilobite Paradoxides from
between maximal cliques, and strong the middle Cambrian.
components (cycles) in the graphs (Guex Case Study 3 tackles the multivariate
1991). It is important to inspect these morphometrics of the Ordovician illaenid
results thoroughly in order to assess the trilobite Stenopareia using Principal Com-
quality of the correlation and to improve ponents Analysis (PCA), Principal Coordi-
the quality of the data, if necessary. Angio- nate Analysis (PCO), cluster and
lini and Bucher (1999) give an example of discriminant analyses to determine the
such careful use of the method of Unitary validity of two species from Scandinavia.
Associations.
1. PE Note: The Case Study files are avail-
CASE STUDIES
able from the PE site, and also directly from
the author. The links below point to the
The fourteen case studies have been author's site, which will, as time and the
designed to demonstrate both the use of author proceed, contain updates and newer
different data analysis methods in paleon- versions. The author’s site is: http://
tology and the specific use of the functions www.nhm.uio.no/~ohammer/past/.
7
Øyvind Hammer, David A. T. Harper, and Paul D. Ryan: PALEONTOLOGICAL STATISTICS SOFTWARE
Case Study 4 demonstrates the use of variate techniques (similarity and distance
Elliptic Fourier shape analysis and princi- coefficients, cluster analysis, detrended
pal components for detecting changes in correspondence analysis, and seriation)
trilobite cephalon shape through ontogeny. the reality and mutual relationships of
In Case Study 5, aspects of the allom- these benthic associations can be tested
etric growth of the Triassic rhynchosaur using a modified dataset.
Scaphonyx are investigated using regres- Case Study 10 discusses some well-
sion analysis. known Jurassic shelly faunas from
Case Study 6 investigates the phylo- England and France. The integrity and
genetic structure of the middle Cambrian onshore – offshore distribution of six Cor-
Paradoxididae through cladistic analysis, allian bivalve-dominated communities is
using parsimony analysis and bootstrap- investigated with diversity measures, clus-
ping. Similar techniques can be applied to ter analysis and detrended correspon-
a matrix of 20 taxa of mammal; cla- dence analysis.
dograms generated by the program can Case Study 11 completes the analysis
be compared with a cluster analysis of the of biotic assemblages with an investigation
data matrix. of the direction and orientation of a bed-
Case studies 7-11 cover aspects of ding-plane sample of brachiopod shells
paleobiogeography and paleoecology. from the upper Ordovician rocks of Scot-
Case Study 7 analyzes a global dataset of land.
late Ordovician brachiopod distributions. A Two cases involve the study of time
series of provincial faunas were developed series data. Case Study 12 investigates
against a background of regression and the periodicity of mass extinctions during
cooler surface waters during the first strike the Permian to Recent time interval using
of the late Ordovician (Hirnantian) glacia- spectral analysis. A number of diversity
tion. Through the calculation of similarity curves can be modeled for the Paleozoic
and distance coefficients together with and post-Paleozoic datasets available in
cluster analysis, these data can be orga- Fossil Record 2, and turnover rates can be
nized into a set of latitudinally controlled viewed for Phanerozoic biotas.
provinces. Seriation helps to develop any Case Study 13 addresses the period-
faunal, possibly climatically generated, icity of oxygen isotope data from ice cores
gradients within the data structure. representing the last million years of Earth
In Case Study 8 faunal changes history.
through a well-documented section in the The final case study demonstrates the
upper Llanvirn rocks of central Wales are use of quantitative biostratigraphical corre-
investigated graphically and by the calcu- lation with the method of Unitary Associa-
lation of diversity, dominance, and related tions. Eleven sections from the Eocene of
parameters for each of ten horizons in the Slovenia are correlated using alveolinid
sections. The changes in faunas finger- foraminiferans studied by Drobne.
print environmental shifts through the sec-
tion, shadowed by marked changes in CONCLUSION
lithofacies. This dataset is ripe for consid-
erable experimentation. Statistical and other quantitative meth-
Case Study 9 involves a re-evaluation ods are now very much part of the paleon-
of Ziegler’s classic Lower Paleozoic tologists’ tool kit. PAST is a free, user-
depth-related communities from the friendly and comprehensive package of
Anglo-Welsh area. Using a range of multi- statistical and graphical algorithms, tailor
8
Øyvind Hammer, David A. T. Harper, and Paul D. Ryan: PALEONTOLOGICAL STATISTICS SOFTWARE
made for the scientific investigation of dence of lateral inhibition. Acta Palaeontologica
Polonica, 45:251-270.
paleontological material. PAST provides a Harper, D.A.T. (ed.). 1999. Numerical Palaeobiology.
window on current and future develop- John Wiley & Sons, New York.
ments in this rapidly evolving research Harper, D.A.T. and Ryan, P.D. 1987. PALSTAT. A statisti-
area. Together with a simple manual and cal package for palaeontologists. Lochee Publica-
tions and the Palaeontological Association.
linked case histories and datasets, the Hill, M.O. and Gauch Jr, H.G. 1980. Detrended Corre-
package is an ideal educational aid and spondence analysis: an improved ordination tech-
first-approximation research tool. Planned nique. Vegetation, 42:47-58.
future developments include extended Kitching, I.J., Forey, P.L., Humphries, C.J. and Williams,
D.M. 1998. Cladistics. Oxford University Press,
functionality for morphometrics and the Oxford.
extension of available algorithms within Krebs, C.J. 1989. Ecological Methodology. Harper &
the cladistics and unitary associations Row, New York.
Kuhl, F.P. and Giardina, C.R. 1982. Elliptic Fourier analy-
modules. sis of a closed contour. Computer Graphics and
Image Processing, 18:259-278.
REFERENCES Muller, R.A. and MacDonald, G.J. 2000. Ice ages and
astronomical causes: Data, Spectral Analysis, and
Adrain, J.M., Westrop, S.R. and Chatterton, D.E. 2000. Mechanisms. Springer Praxis, Berlin.
Silurian trilobite alpha Press, W.H., Teukolsky, S.A., Vetterling, W.T. and Flan-
diversity and the end-Ordovician mass extinction. Paleo- nery, B.P. 1992. Numerical Recipes in C. Cambridge
biology, 26:625-646. University Press, Cambridge.
Angiolini, L. and Bucher, H. 1999. Taxonomy and quanti- Raup, D. and Crick, R.E. 1979. Measurement of faunal
tative biochronology of similarity in paleontology. Journal of Paleontology,
Guadalupian brachiopods from the Khuff Formation, 53:1213-1227.
Southeastern Oman. Raup, D. and Sepkoski, J.J. 1984. Periodicities of extinc-
Geobios, 32:665-699. tions in the geologic past. Proceedings of the
Brower, J.C. and Kyle, K.M. 1988. Seriation of an original National Academy of Science, 81:801-805.
data matrix as applied to Renaud, S., Michaux, J., Jaeger, J.-J. and Auffray, J.-C.
palaeoecology. Lethaia, 21:79-93. 1996. Fourier analysis applied to Stephanomys
Brown, D. and Rothery, P. 1993. Models in biology: (Rodentia, Muridae) molars: nonprogressive evolu-
mathematics, statistics and computing. John Wiley & tionary pattern in a gradual lineage. Paleobiology,
Sons, New York. 22:255-265.
Bruton, D.L. and Owen, A.W. 1988. The Norwegian Ryan, P.D., Harper, D.A.T. and Whalley, J.S. 1995. PAL-
Upper Ordovician illaenid trilobites. Norsk Geolo- STAT, Statistics for palaeontologists. Chapman & Hall
gisk Tidsskrift, 68:241-258. (now Kluwer Academic Publishers).
Davis, J.C. 1986. Statistics and Data Analysis in Geol- Sepkoski, J.J. 1984. A kinetic model of Phanerozoic tax-
ogy. John Wiley & Sons, New York. onomic diversity. Paleobiology, 10:246-267.
Guex, J. 1991. Biochronological Correlations. Springer Savary, J. and Guex, J. 1999. Discrete Biochronological
Verlag, Berlin. Scales and Unitary Associations: Description of the
Hammer, Ø. 2000. Spatial organisation of tubercles and BioGraph Computer Program. Mémoires de Geolo-
terrace lines in Paradoxides forchhammeri - evi- gie (Lausanne), 34.