Albert - Parsimony Phylogeny and Genomics

Parsimony, Phylogeny, and Genomics
This page intentionally left blank

Parsimony,
Phylogeny,
and Genomics
EDITED BY
Victor A. Albert
University of Oslo, Norway
1
1
Great Clarendon Street, Oxford OX2 6DP
Oxford University Press is a department of the University of Oxford.
It furthers the University’s objective of excellence in research, scholarship,
and education by publishing worldwide in
Oxford New York
Auckland Cape Town Dar es Salaam Hong Kong Karachi
Kuala Lumpur Madrid Melbourne Mexico City Nairobi
New Delhi Shanghai Taipei Toronto
With offices in
Argentina Austria Brazil Chile Czech Republic France Greece
Guatemala Hungary Italy Japan Poland Portugal Singapore
South Korea Switzerland Thailand Turkey Ukraine Vietnam
Oxford is a registered trade mark of Oxford University Press
in the UK and in certain other countries
Published in the United States
by Oxford University Press Inc., New York
# Oxford University Press 2005
The moral rights of the authors have been asserted
Database right Oxford University Press (maker)
First published 2005
All rights reserved. No part of this publication may be reproduced,
stored in a retrieval system, or transmitted, in any form or by any means,
without the prior permission in writing of Oxford University Press,
or as expressly permitted by law, or under terms agreed with the appropriate
reprographics rights organization. Enquiries concerning reproduction
outside the scope of the above should be sent to the Rights Department,
Oxford University Press, at the address above
You must not circulate this book in any other binding or cover
and you must impose this same condition on any acquirer
British Library Cataloging in Publication Data
(Data available)
Library of Congress Cataloging-in-Publication Data
Parsimony, phylogeny, and genomics / edited by Victor A. Albert.
p. cm.
Includes bibliographical references and index.
ISBN 0-19-856493-7 (alk. paper)
1. Cladistic analysis. 2. Genetics. 3. Phylogeny. I. Albert, Victor A. (Victor
Anthony), 1964-
QH441. P37 2005
576. 8 0 8—dc22 2004026054
1 0 9 8 7 6 5 4 3 2 1
Typeset by Newgen Imaging Systems (P) Ltd., Chennai, India
Printed in Great Britain
on acid-free paper by
Antony Rowe, Chippenham
ISBN 0-19-856493 7 (Hbk) 978 0 19 856493 5
Dedication
This book is dedicated to James S. Farris, one of the Dr. Farris has received the honor Doctor of
foremost scholars of phylogenetic biology in the Philosophy honoris causa from the University of
twentieth century. It is now thirty-five years since Helsinki, Finland.
the two-paper introduction of Wagner parsimony1.
Wagner parsimony paved the way for all modern
parsimony approaches, including the more general
Personal Dedication
algorithms of Fitch and Sankoff. These landmark
publications by Dr. Farris are among the best I first met Steve Farris in 1990, at the International
known and highest-cited papers in systematic Congress of Systematic and Evolutionary Biology
biology. At the same time, and little known to meetings held in College Park, Maryland. I was
either the biological or mathematical communities, anxious to meet the man whose work had so
was Dr. Farris’s 1970 development2 of what has affected me already, despite having just started my
come to be known as the Farris transform3. This PhD in 1989. Axel Meyer, who was then at Stony
transformation4 was rediscovered in the context of Brook, introduced us. Steve had taught statistics at
phylogenetics twice, as well as in Wolf Prize Stony Brook for many years (although few know
winner Mikhail Gromov’s work on hyperbolic that he was trained as a systematic ichthyologist). I
groups, for which it was later dubbed the Gromov showed Steve some calculations and graphics I
product. The Farris transform also appears, in had made with Brent Mishler that we thought had
‘disguise’, in distance geometry, where it is known bearing on the issue of consistency of parsimony
as the covariance mapping. (see Chapter 1, for example). Seeing an opportu-
nity, Steve spontaneously proceeded to persuade
organizers of a Hennig Society symposium at the
1
Kluge, A. G. and Farris, J. S. (1969). Quantitative phyletics Congress to let me fill an empty slot. I was lucky
and the evolution of Anurans. Syst. Zool. 18: 1–32; Farris, J. S. enough to have brought overheads. I gave the talk,
(1970). Methods for computing Wagner trees. Syst. Zool. 19:
and suffice it to say that I made some friends, and
83–92.
2
Farris, J. S., Kluge, A. G., and Eckhart, M. J. (1970).
estranged some others. But this was the true start
A numerical approach to phylogenetic systematics. Syst. Zool. 19: of my career. Controversy has never been slight in
172–189. the field of phylogenetics, and Steve has almost
3
Dress, A., Holland, B., Huber, K. T., Koolen, J. H., Moulton, never been slight (except when he was young, or
V. and Weyer-Menkoff, J. (2005). D Additive and D ultra-additive
in that photo in David Hull’s book), and certainly
maps, Gromov’s trees, and the Farris transform. Discrete Appl.
Math. 146: 51–73.
never anything but controversial. Like it or not,
4
The similarity measure Sa ¼ 1 Da that one gets from a debate, duel, divide, and conquer is one approach
dissimilarity measure D defined on a set of tree leaves (terminal to science. Regardless, we can thank Steve’s
taxa) X containing leaf a by placing Da ¼ Dðx; yÞ Dða; xÞ relentless pursuit of what he thought was correct
Dða; yÞ is the Farris transform of D relative to a. Here, a can be
for many thousands of papers in the literature
interpreted as an outgroup root and the values of Dða; xÞ and
Dða; yÞ as the distances of leaves x and y from that root, in which
using parsimony approaches, as well as for many
case Da ¼ Dða; xÞ þ Dða; yÞ Dðx; yÞ would be twice the dis- others that argue against it.
tance of the last common ancestor of x and y from a. Victor A. Albert
v
Preface
Parsimony analysis (cladistics) has long been one level university courses. A noteworthy exception is
of the most widely used methods of phylogenetic Elliott Sober’s Reconstructing the Past: Parsimony,
inference in the fields of systematic and evolu- Evolution, and Inference (1988, MIT Press), which
tionary biology. Moreover, it has mathematical was written for a specialist audience. While there
attributes that lend itself for use with complex, exist advanced texts devoted to phylogenetic
genomic-scale data sets. This book reviews philo- analysis of morphological data, concepts of
sophical, statistical, methodological, and mathe- species, cladistic methods in biogeography, and
matical aspects of parsimony analysis, and mathematical aspects of phylogenetic inference,
demonstrates the potential that this powerful there has been no book that specifically incorpo-
hierarchical data-summarization method has for rates advanced material spanning philosophical,
both structural and functional genomic research. methodological, and mathematical perspectives on
The book is aimed primarily at graduate-level the relevance of parsimony analysis, particularly as
students as well as professional researchers in the applied to the burgeoning field of genomic biology.
fields of phylogenetics and phylogenomics (within My work on this book began at the 21st annual
both the evolutionary and molecular biology meeting of the Willi Hennig Society, held at the
communities). However, mathematicians, statisti- Hanasaari Cultural Centre, Helsinki, Finland. I am
cians, and philosophers of science will also find the grateful to the various chapter authors for their
contents of relevance and use. enthusiasm for the project. Mike Steel and David
Readers will discover among the chapters that Penny are acknowledged for winning the ‘‘First
parsimony analysis does not represent a single Draft In Prize.’’ I thank Cécile Ané, Joe Felsenstein,
research view, but rather a variety of perspectives Mike Sanderson, Mark Simmons, and several
all based upon a theme. I viewed it of great chapter authors for their thoughtful reviews of
importance to display this diversity in light of the one or more chapters. Other referees are listed
multiplicity of other phylogenetic methods that among chapter acknowledgments. Andreas Dress,
have been developed over the years. Katharina Huber, and Vincent Moulton kindly
My aim with this volume has been to provide contributed information on the Farris transform.
parsimony analysis with a benchmark for its Finally, I thank Oxford University Press Commis-
current place in science and for judgment of sioning Editor Ian Sherman for immediate interest
its progress into the future. Previous works in the project, and Editorial Assistants Abbie
focusing on parsimony analysis are surprisingly Headon, Kerstin Demata and Production Editor
few given the extremely widespread use of parsi- Anita Petrie for their assistance. Heartfelt thanks
mony methods in the academic journal literature. also to Charlotte, Torben, and Siri for putting up
Those books that have been written are mainly with me.
introductory treatises, i.e. geared for mid-upper Victor A. Albert
vi
Contents
Contributors ix
1 Parsimony and phylogenetics in the genomic age 1

Victor A. Albert
I Philosophical aspects of parsimony analysis, including comparison with

model-based approaches
2 What is the rationale for ‘Ockham’s razor’ (a.k.a. parsimony)

in phylogenetic inference? 15
Arnold G. Kluge
3 Parsimony and its presuppositions 43

Elliott Sober
II Parsimony, character analysis, and optimization of sequence characters
4 The logic of the data matrix in phylogenetic analysis 57

Brent D. Mishler
5 Alignment, dynamic homology, and optimization 71

Ward C. Wheeler
6 Parsimony and the problem of inapplicables in sequence data 81

Jan E. De Laet
III Computational limits of parsimony analysis: from historical aspects to

competition with fast model-based approaches
7 The limits of conventional cladistic analysis 119

Jerrold I. Davis, Kevic C. Nixon, and Damon P. Little
8 Parsimony and Bayesian phylogenetics 148

Pablo A. Goloboff and Diego Pol
IV Mathematical attributes of parsimony
9 Maximum parsimony and the phylogenetic information in

multistate characters 163
Mike Steel and David Penny
vii
viii CONTENTS
V Parsimony and genomics
10 Using phylogeny to understand genomic evolution 181

David A. Liberles
11 Dollo parsimony and the reconstruction of genome evolution 190

Igor B. Rogozin, Yuri I. Wolf, Vladimir N. Babenko, and Eugene V. Koonin
References 201
Index 218
Contributors
Victor A. Albert, Natural History Museum, University University of California, Berkeley, CA 94720, USA.
of Oslo, P.O. Box 1172 Blindern, NO-0318 Oslo, e-mail: bmishler@socrates.berkeley.edu
Norway. e-mail: victor.albert@nhm.uio.no Kevin C. Nixon, L.H. Bailey Hortorium and
Vladimir N. Babenko, National Center for Biotechnology Department of Plant Biology, Cornell
Information, National Library of Medicine, National University, Ithaca, NY 14850, USA. e-mail:
Institutes of Health, 8600 Rockville Pike, Bldg. 38A, kcn2@cornell.edu
Bethesda, MD 20894, USA. e-mail: babenko@ David Penny, Allan Wilson Centre for Molecular
ncbi.nlm.nih.gov Ecology and Evolution, Massey University,
Jerrold I. Davis, L.H. Bailey Hortorium and Department Palmerston North, New Zealand. e-mail:
of Plant Biology, Cornell University, Ithaca, d.penny@massey.ac.nz
NY 14850, USA. e-mail: jid1@cornell.edu Diego Pol, Division of Paleontology, American
Jan E. De Laet, Royal Belgian Institute of Natural Museum of Natural History, Central Park West
Sciences, Vautierstraat 29, Brussels, Belgium. at 79th Street, New York, NY 10024, USA.
e-mail: jdelaet@natuurwetenschappen.be, e-mail: dpol@amnh.org
jan.delaet@lid.kviv.be Igor B. Rogozin, National Center for Biotechnology
Pablo A. Goloboff, CONICET, INSUE, Instituto Information, National Library of Medicine, National
Miguel Lillo, Miguel Lillo 205, 4000 San Miguel de Institutes of Health, 8600 Rockville Pike, Bldg. 38A,
Tucumán, Argentina. e-mail: pablogolo@csnat. Bethesda, MD 20894, USA. e-mail:
unt.edu.ar rogozin@ncbi.nlm.nih.gov
Arnold G. Kluge, Cladistics Institute, Ann Arbor, MI Elliott Sober, Department of Philosophy, University
48103, USA. e-mail: akluge@umich.edu of Wisconsin, Madison, WI 53706, USA. e-mail:
Eugene V. Koonin, National Center for Biotechnology ersober@wisc.edu
Information, National Library of Medicine, Mike Steel, Allan Wilson Centre for Molecular
National Institutes of Health, 8600 Rockville Pike, Ecology and Evolution, University of Canterbury,
Bldg. 38A, Bethesda, MD 20894, USA. Christchurch, New Zealand. e-mail:
e-mail: koonin@ncbi.nlm.nih.gov m.steel@math.canterbury.ac.nz
David A. Liberles, Computational Biology Unit, Ward C. Wheeler, Division of Invertebrate Zoology,
Bergen Centre for Computational Science, American Museum of Natural History,
University of Bergen, NO-5020 Bergen, Norway. Central Park West at 79th St, New York,
e-mail: liberles@cbu.uib.no NY 10024–5192, USA. e-mail:
Damon P. Little, L.H. Bailey Hortorium and Department wheeler@amnh.org
of Plant Biology, Cornell University, Ithaca, Yuri I. Wolf, National Center for Biotechnology
NY 14850, USA. e-mail: dpl10@cornell.edu, Information, National Library of Medicine,
dlittle@nybg.org National Institutes of Health, 8600 Rockville Pike,
Brent D. Mishler, University Herbarium, Jepson Bldg. 38A, Bethesda, MD 20894, USA. e-mail:
Herbarium, and Department of Integrative Biology, wolf@ncbi.nlm.nih.gov
ix
CHAPTER 1
Parsimony and phylogenetics in

the genomic age
Victor A. Albert
1.1 Parsimony inference If one were a Bayesian, and Joe had already tossed
the coin 1 000 times and gotten heads for 500 tos-
Parsimony (Ockham’s razor) as a method of in- ses, this prior probability could be used to assess
ference has a long history. Based upon a parsi- the posterior probability.
mony argument, Copernicus maintained that his In this simple example, parsimony, maximum
heliocentric solar system theory was superior to likelihood (the mean of a normal distribution, as
the geocentric one of Ptolemy because of its with 1 000 coin tosses), and posterior probability
greater simplicity. His reasoning was that Ptol- all give the same answer. However, this was a very
emy’s theory required what amounted to inde- simple example, involving a single object with
pendent models for each planet’s movement only two alternatives. The relationships between
(extra parameters), whereas his own included the parsimony, likelihood, and Bayesian inference
simplifying factor of Earth–Sun movement for become much less obvious with more objects
each planet.1 (characters) and alternatives (states). A biological
According to Copernicus, his theory ‘‘follow[s] example that Sober and Steel (2000) and Sober
Nature, who producing nothing vain or super- (2003) have examined is Crick’s (1968) parsimony-
fluous often prefers to endow one cause with based claim that all life has a common ancestor.
many effects . . . . ’’ An important point about Crick’s argument was that since many different
Copernicus’s argument is that it represented an versions of the genetic code could have been pos-
appeal to a universal law in nature, in other words, sible, the common use of one (albeit with slight
God. Modern considerations of parsimony method- modifications) by all extant organisms strongly
ology, especially those following Lamarck and suggests their common ancestry. The idea is that
Darwin, have by necessity been occupied with selection would operate against code changes in
other, non-deist justifications (Sober 2003). descendants of a given code. In other words, one
Parsimony today stands as a method of infer- beginning of life with this attribute is more parsi-
ence from observations. For example, if one has a monious than many (say, X). But Crick’s is also a
coin with heads and tails, in the absence of any likelihood argument (Sober 2003):
prior information about the coin other than this
observation, the most parsimonious assumption Pr(code universal in extant organisms j one
for the result of a coin toss is one or the other, i.e. ancestor) > Pr(code universal in extant organ-
50/50 chance. If the toss were to be repeated 1000 isms j X separate ancestors)
times, one could establish a frequency-based which takes the standard form Pr(OjM1) >
probability (with margin of error) that this were so. Pr(OjM2), comparing likelihoods of O observations
given models M. The formulation above follows
1
Sober (e.g., Sober 1989, 2003; Chapter 3) has been an active the Law of Likelihood (Royall 1997), which states
student of this history, and I acknowledge his work for this that a hypothesis with higher likelihood is prefer-
example and several others I present below. able over one with lesser.
1
2 PARSIMONY, PHYLOGENY, AND GENOMICS
As pointed out by Sober (2003), parsimony and enough than the most likely linear model to avoid
likelihood therefore provide identical evaluations the complexity factor, which is dependent on
of Crick’s common ancestry hypothesis. However, sample size:
further equivalence postulates between parsimony BICM ¼ 2logLM þ pM logn
and likelihood, which I explore later, show the
issue to be much more complex. Roughly equal likelihoods, L, will likely mean that
the line will win.
1.2 Examples of modern

uses of parsimony 1.2.2 Trees of species or genes
1.2.1 Curve fitting Data points based on characters (e.g. nucleotides)

sampled from species or genes can be analyzed
Parsimony often plays a role in choosing among under a hierarchic model in order to reconstruct
models fit to a set of points on an x,y plane. For most parsimonious trees. This operation is, of
example, a set of points might be regressed by a course, the central subject of this book. Most par-
line a bit sloppily, or by a parabola far better. The simonious trees are hierarchies or partially col-
question is, which model to accept? The latter lapsed hierarchies with changes minimized across
requires an extra adjustable parameter, so it could all characters that could show evidence for
be considered less parsimonious based on economy grouping. Here, groups are defined as two or more
of assumptions (PEA parsimony), i.e. Ockham’s species or genes partitioned from two or more
razor. On the other hand, a better fit to a parabola, other species or genes. Not all character-state dis-
which takes the form of minimization of residual tributions can show evidence for grouping, and
variance between observations and model, is of the specifics of information use is a major differ-
course the likelihood, Pr(OjM). ence between parsimony, likelihood, and distance
There are different criteria to choose among matrix methods. An illustration of this, as well as
models. In one example, an excellent parabolic fit what trees demonstrate among the methods, will
(or even a higher-order one) might have no logical be useful.
relationship to the data at hand (e.g. length of a For four species or genes, A, B, C, and D, there
naked mRNA strand vs. number of free bases after are 2n 1 different ways (in this case n ¼ 4) for
chemical degradation in vitro), and so a sloppy line binary characters to partition species or genes:
would be the better model (through PEA), albeit
representing data that may have been collected in fABg fCDg fABCg fDg fABCDg fg
fACg fBDg fABDg fCg
an error-prone manner.
fADg fBCg fACDg fBg
Other data with better parabolic fit might have a fBCDg fAg
realistic basis—this would then suggest a defiance
of PEA in terms of the number of adjustable para- Parsimony can use only 2n 1 (n þ 1) partitions,
meters. But how to decide between the models? i.e. three—the two-item splits shown to the left.
Two well-known criteria offer likelihood-based A character (in isolation from other characters) that
methods for this choice, the Bayesian and Akaike argues for such a partition incurs one state change
information criteria (BIC and AIC, respectively; see between such splits, yielding two groups (Fig. 1.1).
Sober 2003; Felsenstein 2004). In these criteria, None of the other partitions produce groups, al-
parsimony takes the role of a penalty for com- though the middle four 3 : 1 splits incur state
plexity (in terms of number of adjustable para- changes (these merely show a difference, between,
meters, p, referring to PEA) among models, M, with say D vs. A, B, and C; Fig. 1.1). No changes are
different log-likelihoods. For example, with the implied in the 4 : 0 split. However, likelihood
BIC (the ratio of the average likelihoods for two methods use all of the eight partitions (see below).
models), for the most likely parabola to be pre- Distance matrix methods use information at rate
ferred, it must fit the observations, n, better (n2 n)/2. As species/gene number increases, it
PARSIMONY AND PHYLOGENETICS IN THE GENOMIC AGE 3
1 2 1.2.3 Phylogenetic models for which

A c c
B c c parsimony and likelihood are equivalent
C t c
D t t
I have already illustrated simple, non-phylogenetic
models for which parsimony and likelihood are
C equivalent. The first attempts to establish this
A C
equivalence for phylogeny reconstruction were
∆ ∆
those of Farris (1973) and Felsenstein (1973). These
A D
models were different in that Farris’s was basically
c↔t c↔t
Bayesian, with equal (flat) prior probabilities on all
B D
B trees, whereas Felsenstein’s was based on likeli-
hood. Farris solved for the tree topology and
Figure 1.1 Examples of two characters and their states that (1) can
character-state assignments at all points along
show evidence for hierarchy vs. (2) evidence for difference. Trees
implied by 1 vs. 2 alone are also shown. D indicates where a branches, while including no assumption about
character-state change could occur. Note that only character 1 could rates of character-state change. Felsenstein’s model
support two groups, a group defined as comprising two or more items; summed over all possible character-state assign-
character 2 can only support a fully collapsed tree in which one branch ments, but required low rates of character-state
(not a group) is different from the others. A–D, species or genes.
change. According to Sober (Chapter 3), Farris’s
c/t, different pyrimidine bases.
solution for topology plus character-state assign-
ments (additional parameters) renders it inequi-
can be seen that 2n 1 (n þ 1) approaches 2n 1, valent to likelihood, but that Felsenstein’s
but that (n2 n)/2 lags far behind. Thus, for a parsimony model does achieve a likelihood
given large n (such as with genomic-scale estima- equivalence. These considerations depend of
tion of gene family phylogenies), parsimony uti- course on the type of likelihood under consider-
lizes the majority of all available evidence while ation, for which there are several variants (Steel
only incorporating characters that could show and Penny 2000; Goloboff 2003). To echo the point
evidence for grouping. made by Steel and Penny (2000), both Farris’s and
Likelihood trees demonstrate relationships Felsenstein’s models are likelihood equivalents,
among species or genes that maximize Pr(OjM), just not for the same kind of likelihood.
where M is an evolutionary model. As such, like- Goldman’s (1990) parsimony-likelihood formu-
lihood methods need all of the observations O, lation permits all branches to have the same
including the non-grouping partitions, to maxi- length—a very simple model. However, Goldman,
mize the likelihood of the data; anything less and indeed Sober (Chapter 3), assert its inequi-
would compromise the calculation. Branches of valence with likelihood for basically the same
likelihood trees have lengths in terms of character- reasons as for Farris’s (1973) model: inference of
state change probabilities. Parsimony trees display the topology plus something else, in this case, an-
relationships in terms of character-state changes cestral character states. However, it is worth
along branches. Distance matrix methods of pointing out other views in the literature.
building trees show raw or model-adjusted dif- According to Farris (1986) and Goloboff (2003a),
ferences between species or genes. ancestral states are not to be viewed as parameters:
These illustrations are not meant as justifica-
Goldman (1990) decided that, even if the ancestral
tions for one method over the other in phylo-
reconstructions are not parameters, they ‘‘could be trea-
geny reconstruction; rather, my goal has been to
ted as if they were.’’ But they could also be treated (much
draw attention to differences in information use more properly) as if they were not a parameter. The
among the different methods, and to what trees ancestral states are more like a kind of inferred observa-
derived from them demonstrate. This has often tion (Farris, 1986). Parameters are instead those variables
been confused in the theoretical and biological of the process that determine the conditions of the
literature. problem—the variables that determine the outcome of
evolution, that is. Even if not observed, the ancestral simple model: the fit of a tree to data is solely
states are (just like observed states) part of that outcome. based on its topology and on state change/stasis
[Goloboff 2003a, p. 100] probabilities. As such, parsimony inference can re-
Goldman’s formulation of parsimony assumes that each ceive a likelihood equivalence at both ends of the
character type occurs with a probability equal to the complexity spectrum, which has been interpreted to
pathway with highest probability, among all the path- speak toward its generality as an inferential method
ways that lead to that character type. If the probability of (see Goloboff 2003a). Of course, equivalence be-
change in each branch is low, this estimation produces tween parsimony and likelihood between a few
probabilities that are roughly proportional to the actual
models does not mean that equivalence extends to
probabilities (i.e., the ones obtained by summing); that is,
all models, or that it has to do so in order to
all the resulting character types are ranked in the same
order of increasing probability by both criteria. This, justify use of parsimony methods.
however, does not convert the calculations under
Goldman’s model into estimations of a parameter; if a
reconstruction was indeed a parameter, there would be 1.2.4 A non-likelihood justification
one of them which would confer to the corresponding for parsimony
character type its true probability of occurrence under
the model, and there is none. Only the sum of all Not all users of parsimony analysis care about
reconstructions provides the true value for a given type. equivalencies between parsimony and likelihood
[Goloboff 2003a, p. 100] under certain process models. Farris himself, who
produced a series of statistical interpretations of
Thus, Goloboff argues, using the most likely recon-
parsimony (1973, 1977, 1978), later downplayed
struction (instead of the sum of likelihoods for all
these in one of the most important philosophical
reconstructions) produces a good approximation
papers on parsimony analysis (Farris 1983). He
of the actual likelihood, which is not exact, but
famously stated that:
then again some likelihood methods are not
exact either. Goloboff gives the example that the A number of authors, myself among them (Farris, 1973,
assumption of nucleotide state frequencies remain- 1977, 1978), have used statistical arguments to defend
ing constant over time also implies that likelihood parsimony, using, of course, different models from
Felsenstein’s [1973]. . . . my own models, if perhaps not
calculations are approximate instead of exact, since
quite so fantastic as Felsenstein’s, are nonetheless like the
reconstructions are then not truly independent
latter in comprising uncorroborated (and no doubt false)
(they must sum to the assumed frequencies) claims on evolution. If reasoning from unsubstantiated
(Goloboff 2003a, p. 101). suppositions cannot legitimately question parsimony,
Without debate as an equivalence between then neither can it properly bolster that criterion. The
parsimony and likelihood (Sober, Chapter 3; statistical approach to phylogenetic inference was wrong
Goloboff 2003a) is the formulation of Tuffley and from the start, for it rests on the idea that to study
Steel (1997). They provided a proof that parsi- phylogeny at all, one must first know in great detail how
mony was a maximum likelihood estimator under evolution has proceeded. That cannot very well be the
the assumption of no common mechanism (po- way in which scientific knowledge is obtained. [Farris
tentially unequal change probabilities) for each 1983, p. 17]
character with r states and a symmetric change Farris argued in favor of parsimony as a method
assumption. With this formulation, the different that maximizes explanatory power among obser-
rates can either be very, very small or very, very vations that could be expected to reflect genea-
large: in fact, only 0 or infinity, and nothing in logical relationships, i.e. potential homologies.
between. The Tuffley and Steel result has been He characterized most-parsimonious trees as the
considered by some a complex parsimony equiva- least falsified hierarchical hypotheses in the con-
lent because of its numerous adjustable parameters text of the philosopher Karl Popper’s ideas on the
(the lengths of each branch for each character, as treatment of observations (see Kluge, Chapter 2).
estimated from a single datum). On the other However, Farris carried his argument into
hand, Goldman’s formulation is an extremely more general terms: trees with minimal homoplasy
(i.e. with minimal parallelism or reversal) must be Farris (1983) rejected Felsenstein’s 1978 model
preferred over trees that have more, because the latter by arguing against its applicability to real data.
require more observations to be dismissed for the sole He didn’t object to the general idea of seeking a
purpose of protecting conclusions from ‘offending’ consistent estimator; he just felt that one was
evidence. This is a fundamentally different use of not available in practice. A decade ago, collea-
the parsimony criterion, that which the data requires gues and I modeled consistency for sequence
(PDR parsimony), and indeed, this is a criterion that evolution and concluded that the ‘Felsenstein
can be interpreted quantitatively. Operationally, for zone’ of inconsistency (under Felsenstein’s own
a given set of independent pairwise similarity conditions) was small enough to be insigni-
statements, ficant for real data (Albert et al. 1992, 1993).
X X Felsenstein showed similar results himself (2004;
min Hk () max Jk see also Steel and Penny 2000), which echo our
k k findings that as r states increase, the zone of in-
consistency decreases. This will be seen to have
where H and J represent independent statements bearing when gene-order data are discussed
of pairwise homoplasy and homology, respectively, below.
across k characters (see De Laet, Chapter 6).
Minimization of H, with the addition of a tree-
independent constant, is equivalent to minimizing 1.2.6 Other practical considerations.
total steps (character-state changes) as calculated Parsimony analysis yields hierarchic results that
by standard parsimony software. are both fully diagnosable and interconvertible
With reference to the minimization of H, Farris with the original data (Fig. 1.2). This is a very
also refuted the commonly held belief that parsi- positive feature in terms of tree interpretation
mony assumes rarity of homoplasy by use of an and for information storage and retrieval (Farris
analogy to linear regression analysis; although 1979). As stated above, most-parsimonious trees
residual variation in a least squares fit is certainly have branch lengths in terms of changes among
minimized, there is no requirement that this vari- the states of informative characters. This format is
ation be small. Likewise, minimization of H occurs more intuitive than branches in terms of state-
in the context of all characters, and this involves change probabilities, distances (via some metric),
no requirement that estimated homoplasy be rare. or those with no dimensions whatsoever. Indeed,
For further discussion of Farris’s arguments, see many investigators have exploited parsimony’s
De Laet (Chapter 6; including an interesting branch-length properties to optimize their ori-
elaboration) and Kluge (Chapter 2). ginal data on to trees derived from other meth-
ods; however, such comparisons are not inter-
convertible with the data matrix, rendering
1.2.5 Parsimony and statistical consistency interpretations potentially tree-biased as opposed
Although consistency enters into further discus- to data-biased (remember that the data that
sion below, I will only briefly deal with its form- go into most-parsimonious trees provide the
alities. As Felsenstein (2004; p. 107) explains: branch lengths that come back out; see Mishler,
Chapter 4).
An estimator is consistent if, as the amount of data gets
larger and larger (approaching infinity), the estimator
converges to the true value of the parameter with prob- 1.3 Genomic-scale data and parsimony
ability 1. If it converges to something else, we must
suspect the method of trying to push us toward some Current whole-genome sequences and projects
untrue conclusion. In 1978 I presented . . . an argument underway represent the tip of the iceberg. Now
that parsimony is, under some circumstances, an incon- that we have complete genome sequences for
sistent estimator of the tree topology. [italics in the many prokaryotes, several eukaryotes, and
original; bold emphasis is mine] numerous organellar DNAs, bioinformaticians
O cc c c c c c c ccccccccc c c c c c c cccccccccc
A gg g g g c c c ggggggggg c c c c c c cccccccccc
B gg g g g c c c ccccccccc g c c c c c cccccccccc
C gg g g c c c c ccccccccc c g c c c c cccccccccc
D gg g c c c c c ccccccccc c c g c c c cccccccccc
E gg c c c g c c ccccccccc c c c g c c cccccccccc
F gg c c c g g c ccccccccc c c c c g c cccccccccc
G gg c c c g g g ccccccccc c c c c c g cccccccccc
H gg c c c g g g ccccccccc c c c c c c gggggggggg
1 2 3 4 5 6 7
O
1 2 D
3 C
4 B
A
2 E
6 F
7 G
H
Figure 1.2 Phylogenetic trees based on parsimony are fully diagnosable. Similarities that had the capacity to bear hierarchical information, here
characters 2–7, were those used to build the most parsimonious tree, and upon inspection it can be seen that inferred character-state changes can
be readily optimized on to internal branches. All other characters shown in this example only show difference, as opposed to evidence for hierarchy.
This fact can be readily appreciated by examination of the lengths of branches—two differences separate O (outgroup) from the other species or genes
A–H; A and H have whole blocks of singular differences, B–G each have singular differences assigned, and all of this is reflected in the tree as
differential branch lengths, not as arguments for a different overall hierarchy. Note also that the tree and its branch lengths are fully interconvertible
with the original data matrix.
must find ways to reduce this complexity to pro- to solve reliably for large most-parsimonious
vide meaningful fodder for biological hypotheses. trees that were once thought to be intractable
One urgent need is for fast and predictive phylo- problems. For example, the 500-sequence rbcL
genetic estimation of species and gene relation- data set for seed plants (see Davis et al., Chapter 7)
ships. Parsimony is a method that has the logical can now be solved for most-parsimonious trees
and practical attributes discussed above, as well in seconds (<42 s on my 1.6 GHz Pentium M
as, recently, the speed necessary to carry out laptop, in fact). This application has also been
massive topological calculations. parallelized (Goloboff et al. 2003b), so for indi-
vidual sequence alignments, the practical limits
for genomic-scale sequence data will be the
1.3.1 Sequence data and tree size
strength of such alignments.
Parsimony analyses of sequence data for large In the case of rbcL, a highly conserved protein-
numbers of species have been possible for a coding gene with no introns, sequence start-stop
number of years (see Davis et al., Chapter 7), but and internal base alignment is unambiguous.
only recently have these become fast enough to However, this is certainly not a generalization that
be considered of use at the genomic scale. For can be made for genes in general, not to mention
genome comparisons, it will be important to use (e.g.) non-coding regions in between genes. This
parsimony calculations to determine gene rela- begs an issue that I have avoided until now—
tionships within gene families or superfamilies. according to Wheeler (Chapter 5) and De Laet
The issue of orthology vs. paralogy is important (Chapter 6), the logic of a priori multiple alignment
in the context of molecular evolutionary hypo- is erroneous and the results are incomplete at best.
thesis testing (see Liberles, Chapter 10, and These authors argue persuasively for optimization
Rogozin et al., Chapter 11). One current applica- of sequences as whole, complex characters under
tion, TNT (Goloboff et al. 2004), has the ability Sankoff parsimony (Sankoff 1975). However,
practical limits rise greatly for such algorithms, species or genes. It will no doubt arise in some
which create compound NP-complete problems readers’ minds, however, that the issue of tree size,
that must, by current necessity, be solved heurist- e.g. for sequences of 50 000 genes, could impact
ically. statistical consistency. Indeed, some have cau-
Likelihood- or distance-based phylogenetics will tioned that inconsistency can occur more often as
reach different sorts of complexity blockades in trees become larger and larger (Kim 1996). I will
dealing with relationships among large gene present a more positive outlook on parsimony and
families. For example, the R2R3-MYB gene family large trees below.
in Arabidopsis is composed of ca. 100 members, as
opposed to only three found in human (Martin
1.3.2 A conjecture on parsimony and large
and Paz-Ares 1997; Romero et al. 1998). Imagine ca.
phylogenetic trees
100 R2R3-MYBs across 500 plant genomes—
perhaps 50 000 genes, which given advances in Background
sequencing technology, isn’t a far-fetched pos- With reference to the largest phylogenetic
sibility in the not-too-distant future. If full diag- analysis yet attempted (2 538 rbcL sequences for
nosability (tree/matrix interconvertability) and photosynthetic organisms), colleagues and I
speed of execution together form the important (Källersjö et al. 1999) observed that relatively
criterion, distance matrix and likelihood methods rapidly evolving nucleotide sites, such as those in
will prove inferior. Distance matrix methods are third positions of codons, provide the majority of
fast, but they decompose information into pairwise tree structure despite initial estimates of satur-
estimates of path lengths between ni and nj and ation and high levels of homoplasy on most-
then, at best, try to reassemble them into a tree that parsimonious trees. We pointed out in reference
somehow optimizes these lengths. Such an oper- to this analysis that, analyzed by themselves,
ation is fraught with error since pairwise path third positions resolve 1 327 supported groups
lengths may show no relationship to those on with an average parsimony jackknife frequency of
reconstructed trees. Moreover, the inherent infor- 85%, whereas the first two positions together re-
mation loss, especially as n gets large, is un- solve only 431 groups, with an average frequency
acceptable (see above). Likelihood methods do not of only 75%. The groups recovered by third
provide character-state diagnoses either, and are positions are also well supported by the full data
understandably slower than parsimony given that and are spread over the tree, including both older
calculations are CPU-intensive, especially as n and younger lineages. In contrast, the first two
increases (even using the pruning algorithm, positions fail, for example, to recognize either
computational effort is proportional to k(n 1)r 2; land plants or flowering plants as monophyletic
Felsenstein 2004). Supercomputers and CPU clusters groups.
speed up likelihood calculations, but eventually We also generated random subsets (10 for each)
a tradeoff will be reached. Besides, to quote of n ¼ 100 species, n, 2n . . . 10n, from the 2538-
Felsenstein (2004, p. 122): species matrix, and calculated the average reten-
If it escapes the clutches of long branch attraction tion index, position-wise within codons, for each
[inconsistency], parsimony is a fairly well-behaved subset. The retention index for individual char-
method. It is close to being a likelihood method, but is acters—(g s)/(g m), where g is the maximum
simpler and faster. It is robust against violations of the number of steps, s is the most-parsimonious
assumption that rates of change at different sites are number, and m is the minimum number—
equal. (It shares this with its likelihood doppelganger.) measures the amount of initial similarity retained
Given the equatability of parsimony with likeli- as homology on most parsimonious trees (Farris
hood under several models that range from simple 1989a). As matrix size rose from 100 to 1 000, the
to complex (see above), parsimony should be the retention index rose for third positions as matrix
method of choice as applied to genomic-scale size increased. In contrast, the retention indices for
questions that include enormous numbers of first and second positions—those sometimes
favored for molecular phylogenetics because they above, this quantity is at least 17 300 informative
evolve more slowly—decreased. Our simple in- sites. But, in reality, k was only 1 428—the number
terpretation, also following the group support data of bases in the rbcL gene. Erdo " s et al. (1999) also
reported above, was that third positions were established that at least some phylogenetic meth-
performing better than first or second positions. ods should require, for a given constant c, only
Moreover, with respect to total homoplasy, the c log(n) or at worst a power of c log(n) characters.
consistency index (m/s; Kluge and Farris 1969), Steel and Penny’s Conjecture 9.4.3 and Propos-
which is inversely proportional to the total number ition 9.4.4 (Chapter 9) similarly suggest that k
of substitutions, gave the converse view: first, follows logarithmic growth on n, and as n grows, so
second, and third positions averaged 0.155, 0.178, too would homoplasy for some characters, especially
and 0.046, respectively on the tree calculated from given any limitation put on r. This requirement for
all positions—that is, greatest homoplasy was dis- homoplasy echoes the empirical findings discussed
covered for the more rapidly evolving third positions, above. With Mike Steel’s help, I provide below a
despite their better performance. Given these results, mathematical formalization of my conjecture in
our conclusion was that homoplasy can increase light of these findings.
phylogenetic structure.
Conjecture 1.3.1. Consider the r-state symmetric
Poisson model. For any E > 0 and constant B 1
Conjecture there exist constants h and c that depend only on r,
Homoplasy on trees can have a direct relation- E, and B for which the following holds.
ship with rates of change. It certainly does on Suppose k characters are generated independ-
large trees, such as those discussed above (n ¼ ently for this model on any fully resolved phylo-
100–2 538), for which enough branches exist to genetic tree T with n species or genes for which all
observe the products of high evolutionary rates. the branch lengths of T are at most h and the ratio
The branches of small trees (e.g. n ¼ 4–6), such as of any two branch lengths of T is at most B. Then
those often used for simulation studies, would provided k c log(n)/f 2, where f is the smallest branch
not be expected to reveal underlying rate differ- length in T, maximum parsimony will correctly recover
ences as accurately for vast divergences, and such T with probability at least 1 E.
differences are precisely those that might lead to
Here, E is any real number, and B ¼ 1 is the case
inconsistency.
where all branch lengths are equal. As f, which is
In Chapter 9 of this book, Steel and Penny
a function of n, gets smaller and smaller, the
prove a common-mechanism equivalence between
sequence length needed to ‘detect’ that branch has
parsimony and likelihood when the number of
to grow (indeed quadratically with 1/f ). The role
character states, r, is large enough (we will get
of B is to avoid inconsistency (as described by
back to this issue later regarding gene-order data).
Felsenstein 1978a). However, note that the con-
They also show that under such conditions,
jecture just says ‘for any B’, so one could take
few characters, k, are required to arrive at a most-
B ¼ 1 000 a priori, and thereby allow one branch to
parsimonious solution.
be 1 000 times as long as another. This will impact
My conjecture is that a parsimony-likelihood equivalence can the values of c and h, but these are constants so
hold when r is much smaller than required by Steel and far as n is concerned. Presumably also as n
Penny’s Theorem 9.6.2, e.g. in the r ¼ 4 case as for nucleotide
increases, the value of B may come down closer
data, if n is large enough.
to 1, provided that adding to n does not create a
Erdo" s et al. (1999) proved that a compatibility tree, branch that is too short.
from which homoplasy is prohibited, requires at The conditions stated mean that most of the char-
least (n 3) log(n 3) (n 3) informative characters will have a fair degree of homoplasy—indeed, the
acters (i.e., with grouping potential) to reconstruct expected number of steps will go to infinity with n,
a tree of n species or genes with at least 50/50 since each branch length is bounded below by h, which
probability. For the 2 538 species case illustrated is a positive number.
1.3.3 Strings, and more on r-state characters need not be restricted to nucleotide data; amino acid
data, exons/introns, or even genomic regions could
Around the same time, Steve Farris and I devel-
be so coded (see below).
oped methods that were intended to lower worries
Now on to r-state characters. Farris and Källersjö
about parsimony and inconsistency. I will begin
presented a related method, supersites, at the 1999
with my procedure (Albert et al. 1994), which was
meetings of the Willi Hennig Society in Göttingen,
to accept strings of nucleotides (randomly selec-
Germany. With supersites, strings of nucleotides
ted) as unit characters instead of individual bases;
are recognized beginning at nucleotide W and then
these strings were recoded as presence vs. absence
parsed downwards through the matrix, recognizing
for data analysis. The intended effect was to
as many character states as necessary to account
reduce the probability of homoplasy given, e.g.
for differences within the strings. Supersites can
that a six-base pair string is much easier to lose
therefore generate considerable character-state space
once it exists than it is to regain once lost.
among fewer informative characters. However,
My argument was based on investigations of
Steel and Penny (2000) suggest that such proce-
Dollo parsimony (Farris 1977) and its use with DNA
dures may not avoid inconsistency because prob-
restriction-site data (Albert et al. 1992). In this con-
abilities of change along branches increase ! 1 as a
text, Dollo never permits parallel gains of a restriction
function of k, where r ¼ dk and d is the number of
site (often a six-base recognition string), only mul-
possible character states.
tiple losses. In this earlier work, we concluded that
the Dollo model was too severe and that despite the 1.3.4 Gene content
asymmetry in probabilities just discussed, parsi-
mony with equal character-state weights (Kluge and Genomic-scale phylogenetic studies based on gene
Farris 1969; Farris 1970) was more appropriate. How- content are reviewed by Rogozin et al. (Chapter 11).
ever, my work did not consider increasing string Two approaches have been used: (1) estimate spe-
length. Felsenstein (2004, p. 236) has examined the cies trees from orthologous gene presence vs.
parallel gain case more thoroughly, solving for the absence among whole genomes, or (2) optimize
probability (under the Jukes–Cantor model) that these data on to a predetermined species tree. In my
two species or genes and their common ancestor string character method, above, I permitted string
each have/had ( þ ) a particular nucleotide string presence-absence to have equal weight, whereas the
of k sites given substitution rate q and t units of probabilities modeled above imply asymmetric
branch length: weights favoring losses. So which character-state
k 4
!2k weights to use? Rogozin et al. (Chapter 11) discuss
1 1 þ 3e3qt Dollo analyses based on whole genes, which are of
Pr( þ þ þ ) ¼
4 4 course nucleotide strings themselves (see above).
The conditional probability that the ancestor is in The Dollo assumption is the asymptotic case, and
state þ is the ratio of this equation with the prob- with reference to the equation above this should be
ability that both species/genes have state þ . The entirely appropriate as string size increases, say, to
latter probability takes the exact form as above while 1 428 bases. Use of Dollo optimization onto species
replacing 2qt for qt and the exponent k for 2k. This trees incorporates the same state-change asym-
probability ratio clearly demonstrates that as strings metry. Moreover, Huson and Steel (2004) have
grow longer and longer, the probability of parallel shown that Dollo parsimony compares very favor-
gain still remains small so long as substitution rates ably with a genesis-loss likelihood model they con-
remain low. structed to analyze gene-content data.
This is precisely why I developed the string
1.3.5 Gene order
character method; if one were to code only those
completely matching strings beginning at certain The growing rate of whole-genome sequencing,
nucleotide positions, especially larger and larger particularly for the relatively small and circular
ones, then these should be rather conservative genomes (prokaryote, chloroplast, and most mito-
characters for deep branchings within phylogenetic chondrial), has been accompanied by heightened
problems. As such, the string character concept interest in determining phylogenetic relationships
based on gene order (synteny). The problem is not log-transformed fluorescence intensities. Clusters of
a simple one, at least in terms of encoding the data. genes or ‘treatments’ (including tissue types) are
For one, Steel and Penny point out in Chapter 9 formed based on shared patterns of up- vs. down-
that the order of G genes in a signed (oriented) regulation of gene expression. Such analyses have
circular genome can display any of 2G (G 1)! proven of some use in (e.g.) tumor classification by
combinations. Nonetheless, their proof of equival- gene-expression profiles, as well as, by inverting the
ence between likelihood and parsimony for large matrix, identification of genes active in different
character-state space states bodes well for use of tumor types. The analyses do not intend to be
computationally simpler parsimony calculations in phylogenetic. However, phylogenetic methods,
this genomic arena. such as parsimony analysis, can be brought to bear
Parsimony analyses of gene order are related to on microarray data, at least when these data could
analyses of string data (as discussed above), but be expected a priori to show evidence for hierarchy
differ in their attempt to account for adjacency vs. (e.g. through hereditary relationship). A parsimony
non-adjacency of strings. Coding methods for use approach to microarray analysis has been devel-
with parsimony analysis have already been under oped (Planet et al. 2001; Sarkar et al. 2002) and
investigation, e.g. Maximum Parsimony on Multi- applied to the tumor classification problem. How-
state Encodings (MPME; Wang et al. 2002, as sug- ever, classification of different tumor types may not
gested by Bryant 2000) methods. This method fit a phylogenetic model; gene regulation can be
produces signed, multistate circular permutations hierarchical and is certainly heritable, but it may
of gene adjacency on circular genomes (see discus- also be networked, and tumor types do not share
sion in Steel and Penny, Chapter 9). In one simula- clear evolutionary relationships. A less stringent
tion study, MPME has been shown to have greater view on fit to model might be worth adopting for
accuracy in comparison with a method incorporat- exploratory studies, since Sarkar et al. did identify
ing neighbor joining (Wang et al. 2002), a distance gene-expression events that had also been identified
matrix method that inherently incorporates less in- by phenetic clustering.
formation from the data (see above). Still other There is one study of which I am aware that
coding methods exist or are under development, explicitly used parsimony analysis to reconstruct
including a technique that utilizes Dollo parsimony heritable relationships (they cited Planet et al.
on tightly linked gene pairs that are then binary- 2001). Uddin et al. (2004) used genome-wide
recoded (Wolf et al. 2001; see Rogozin et al. expression profiles from primate brains to perform
Chapter 11). This method is directly related to the a parsimony analysis of organismic relationships;
string recoding method of Albert et al. (1994). echoing other substantial evidence (e.g. Salem et al.
A limitation encountered by Wang et al. (2002) 2003), the chimpanzee was identified as Homo
and mentioned by Steel and Penny (Chapter 9) is sapiens’ closest relative. Another classic study of
the relatively small number of character states heritable gene expression relationships within and
permitted by the most rigorous parsimony soft- between species was that of Oleksiak et al. (2002)
ware (e.g. TNT) and required by the multistate on Fundulus fish populations. These authors used
coding methods. In other words, getting anywhere the phenetic clustering methods of Eisen et al.
with the gene-order issue will require algorithmic (1998) to group populations by genes, as well as
advances regarding state space. genes by populations. Although Oleksiak et al.
were studying population differentiation and not
phylogeny per se, it would have been possible
1.3.6 Microarray data
to use parsimony methods that incorporate
Hierarchic analysis of microarray expression data population-level information. Sarkar et al.’s Char-
has become routine. However, almost all methods acteristic Attribute Organization System (CAOS)
used are those of phenetic clustering (Eisen et al. is closely related to Population Aggregation
1998), which only supplies dimensionless levels Analysis (PAA; Davis and Nixon 1992), which
of difference based on a distance matrix of identifies patterns of discrete features that
unambiguously mark groups (i.e. that have gone can render another parsimony-likelihood equiva-
to fixation). The CAOS approach further identifies lence. A massive increase in whole-genome
characteristic expression patterns found in some sequencing will no doubt permit refinements to
members of a group and never outside that group, estimations of ancestral gene content. To mention
those found in all members of a group but never an area barely discussed in this chapter, opti-
found together outside that group, and those mization of whole sequences as complex char-
found in some samples within a group but never acters will also become a practical and everyday
outside that group. tool with large numbers of species or genes. The
CAOS illustrates the major advantage of using gene-order issue, which will no doubt develop
parsimony analysis for microarray data: diagno- further with different encoding methods, will also
sability (see above). State changes that identify include similar approaches to optimization of
groups (of genes or treatments) and changes whole genomes as complex characters (see De
among members of groups have far greater pre- Laet, Chapter 6). Finally, parsimony analyses of
dictive value than dimensionless clustering. microarray data should become commonplace for
gene-expression data with underlying hereditary
relationships, such as for phylogenetic and popu-
1.4 The future: some predictions
lation genomics.
It is difficult to predict future modes and rates of
genomic-scale data acquisition, but with Moore’s
1.5 Acknowledgments
law, computer capacity should open up previ-
ously inaccessible data-analysis possibilities in I thank Pablo Goloboff for his extremely helpful
less than a decade. Parsimony will remain an in- comments on the manuscript, Mike Steel for his
dispensable part of the phylogenetics and geno- interest in helping me formalize Conjecture 1.3.1,
mics tool kit, particularly to estimate enormously and Steve Farris for the example in Fig. 1.2. Again,
large trees with full diagnosability, and also for the work of Elliott Sober is acknowledged for
data with large character-state space (e.g. gene- useful examples. I also thank the authors of the
order information). Perhaps a proof for Conjecture various chapters of this book; their insights and
1.3.1 can be given, establishing that particular topical reviews helped make this introductory
amounts of homoplasy on large-enough trees chapter possible.
I
Philosophical aspects of parsimony
analysis, including comparison with
model-based approaches
CHAPTER 2
What is the rationale for ‘Ockham’s

razor’ (a.k.a. parsimony) in
phylogenetic inference?
Arnold G. Kluge
Anyone suggesting a justification for a method of inference—be it parsimony

or anything else—should be careful to distinguish sufficiency from necessity.
It is one thing to show that a set of assumptions suffice to justify a method; it
is much more difficult to show that those sufficient conditions are also
necessary. I have suggested a [likelihood] framework, which, if true, suffices to
justify parsimony. But what assurance can there be that this framework is
necessary for the method to be justified? The possibility always remains that
different or more meager assumptions will suffice to legitimize the method.
Although this possibility cannot be ruled out in principle, there is an
indirect test that provides some indication of whether the suggested justifying
principles are fundamental. If those principles provide a general framework
that allows one to characterize and investigate other aspects of phylogenetic
inference, this is some indication that the framework proposed is not only
sufficient, but fundamental.
Elliott Sober (1986, p. 41)
2.1 Introduction
evaluation because it does not depend on any
Philosophers continue to debate the meaning and subject-matter-specific assumption. Adding yet
rationale of ‘Ockham’s razor.’ For instance, Sober another dimension to this recent debate, Baker
(1994) concluded that parsimony is not a global (2003; see also Nolan 1997) justified ASP parsimony
principle of theory evaluation because it has no hypotheses on strictly quantitative grounds. Thus,
subject-matter-invariant applicability. He also parsimony has been rationalized in terms of
maintained (p. 77) ‘‘parsimony, in and of itself, AQP and ASP, and for which there are different
cannot make one hypothesis more plausible than quantitative rationales.
another,’’ a position that obtains under the anti- Most empirical scientists have shown little
quantity principle (AQP). On the other hand, interest in these debates and the newer meanings
according to Barnes (2000, p. 370), interpreting and justifications for Ockham’s razor, not that
parsimony as an anti-superfluity principle (ASP), empiricists have ever exhibited much concern for
parsimony in and of itself does make a theory the philosophical. Indeed, it is common to find no
more plausible because it ‘‘releases a theory from argument whatsoever for parsimony being cog-
its commitment to components unsupported by nitively virtuous in the evaluation of a set of com-
the relevant data.’’ Moreover, Barnes went on to peting theories or hypotheses, and if an argument
conclude that ASP is a global principle of theory is provided at all it is usually made in operational
15
terms. But, in the absence of a concept, most kinds can be no probabilification when the entities of
of justification are difficult, if not impossible, interest—objects and events—are necessarily
to make (Grant 2002). One of the underlying unique, as they are in a historical science like
themes of this chapter is the importance of making phylogenetics (Grant 2002; Kluge 2002).
the distinction between concept and operation, and Also relevant to parsimony is the question of the
where the former precedes the latter (see also meagerness of the conditions legitimizing a choice
Farris 1967, p. 44). Another important theme among hypotheses. What are the minimally sufficient
is identifying incoherence among nested con- assumptions required to make an inference? And, of
ceptualizations. those justifications for parsimony that are ontolog-
Some unconcerned empiricists, when pressed, ically sound and minimally sufficient, are there any
will fall back on syntactic simplicity, such as the that can be judged necessary (see epigraph)? I begin
conventionalist argument that stresses aesthetic my evaluation of these questions in the inference of
value (e.g. W. C. Wheeler, personal communica- phylogeny with a brief discussion of the newer
tion, XVI Annual Meeting of the Willi Hennig classifications of parsimony, and their most general
Society). Such appeals to simplicity cannot, how- justifications. To set the stage for my evaluation of
ever, save conventionalism from arbitrariness, the kinds of parsimony that have been, or are likely
since choice of a convention is arbitrary (Popper to be, employed in phylogenetic inference, I briefly
1959). Others have laid claim to the pragmatic, explicate phylogenetic inference, as I see it, focusing
where for example the most parsimonious on the ontological status of what is being inferred, as
hypothesis is judged a ‘‘necessary property of well as the scientific approach that is consistent with
methods of analysis’’ (Patterson 1988, p. 79). Still achieving that kind of knowledge. I conclude with
others have embraced simplicity/parsimony for its some examples of how the fundamental nature of
descriptive efficiency (Farris 1979; e.g. see Brower parsimony in phylogenetics, as well as theory uni-
2000; Frost 2000), but for which there is no epi- fication, might be judged. I will only discuss uses of
stemology. A few empiricists have spoken of a parsimony applied in the inference of phylogeny
statistical, goodness-of-fit, kind of justification per se, i.e. relative recency of common ancestry.
when applying a parsimony method, where for Parsimony as it pertains to networks (undirected
example the most probable or plausible hypothesis graphs), methods of algorithmic efficiency, data
is claimed to be most predictive. A best explana- exploration (Grant and Kluge 2003), optimization,
tion kind of justification has also been mentioned and indexes of support will not be considered.
by a few scientists, but usually without stating the Mention of parsimony in relation to outgroups and
relevant cause and effect. To merely assume ‘some additivity, the study of adaptation, coevolution, and
kind of explanation’ is to treat explanation as a biogeography will only be made in passing. Each of
primitive term. these topics deserves a separate forum. While it is
The ontological status of that to which a fair to say that I seek a necessary and sufficient jus-
parsimony criterion is applied is not always con- tification for parsimony in phylogenetic inference, I
sidered, but without which the application cannot will be satisfied if I am only able to convince a few
be judged. For example, the goodness-of-fit kind of more empirical scientists to become involved in
justification can, and often does, involve an explicit debating the meaning and rationale of Ockham’s
connection between parsimony and probability, razor (a.k.a. parsimony) in phylogenetic inference.
and that in turn supposes there are multiple
instances of a kind with which to statistically
2.2 Parsimony: classifications and
estimate or model fit. However, not all sciences are
justifications
concerned with class concepts and lawful regular-
ities, i.e. not all are nomothetic sciences (Wrinch Syntactic simplicity has long been distinguished
and Jeffreys 1921; Grant 2002), and the justification from ontological simplicity. Some philosophers
for parsimony has to be sought elsewhere (see refer to the former as elegance (Walsh 1979), while
below). For example, it has been argued that there the term parsimony is applied to the latter.
WHAT IS THE RATIONALE FOR ‘OCKHAM’S RAZOR’ IN PHYLOGENETIC INFERENCE? 17
According to Baker (2003), syntactic simplicity As mentioned in the introduction, Barnes (2000)
involves the number and complexity of hypo- distinguished two kinds of parsimony justification,
theses, where justification for that kind of mini- AQP and ASP. The former principle recommends
mization is sought in descriptive efficiency, positing as few theoretical components as possible,
subjective epistemology, such as aesthetic value, or whereas the latter recommends against positing the
instrumentalism. In addition to these, syntacti- superfluous. As Barnes (2000) exemplified (p. 354):
cally simple hypotheses can have practical con- The two principles are clearly not equivalent: consider
sequences, since the simpler theory can be more two competing theories, A and B, which both fit the
clearly expressed and thereby more easily under- relevant data equally well. Theory A contains more
stood (McAllister 1996). ‘‘Such a structure, simply components than B, and is thus less parsimonious than B
as a structure, is intrinsically perspicuous’’ (Walsh by the lights of the AQP. But while A contains no com-
1979 p. 243). This subjective, ‘easily understood,’ ponents that are not required (within A) to explain the
argument can also be interpreted objectively in data, theory B posits one or more superfluous com-
terms of testability—a simpler theory being more ponents—i.e. one or more components which could be
logically improbable than a less parsimonious deleted from B without impairing B’s ability to explain
the data. Thus B is less parsimonious than A by the ASP.
hypothesis (Popper 1959, p. 119; see below).
Obviously, with this kind of understanding, we The fact that AQP entails ASP, but not vice versa,
have passed from the syntactic to the ontological. leads to the conclusion that whatever justifies B
Some continue to restrict the domain of parsimony may not justify A. Barnes (2000) thought ASP to be
to the theoretical, usually the theoretical hypo- at the heart of what is generally known as infer-
theses that posit the fewest entities, objects, events, ence to the best explanation.
or processes, or ascribe certain properties to As this discussion suggests, there are many ways
objects (Barnes 2000, p. 354). Appealing to a most- to define Ockham’s razor, as well as to classify its
parsimonious hypothesis of species relationships various kinds. My only reason for choosing Barnes’
in any of these senses certainly gives the appear- (2000) general background knowledge, pragmatic,
ance of assuming evolution is parsimonious. unification, anti-free parameters and local back-
However, parsimony can be seen only as a rule of ground knowledge as the kinds of parsimony
inference, not an empirical assumption of reality. It categories that I recognize in relation to phyloge-
need not be an ontological claim that evolution, netic inference is that they expose justifications that
wherever it applies in the universe, is really par- have been previously lumped together in this field.
simonious. Enjoining parsimony is then consistent I have added testability and ASP categories to
with the background knowledge assumption of complete that exposure. As will become apparent,
‘descent, with modification’ (see below). Having the justifications that I exploit in these discussions
said that, however, any phylogenetic method that are the usual philosophical contrasts, between
denies the independent evolution of similarities theory-based and narrative explanation, descrip-
would seem to be making a claim on the mini- tive efficiency and explanatory power, subjective
mization of the evolutionary process. and objective epistemologies, instrumentalism and
Justification for parsimony is usually sought in realism, and the subjective values of aesthetics and
explanatory power, realism, objective epistemolo- the like and the objective cognitive (epistemic).
gies, or objective cognitive (epistemic) values.
Those who use explanatory power to justify par-
2.2.1 General background knowledge
simony are considered realists, not subjectivisits,
when they argue from cognitive (epistemic) cri- A general background knowledge kind of
teria in favor of empirical assumptions. Moreover, parsimony supposes the universe is naturally
in the sense of realism, there are good reasons to parsimonious in some way. Certainly one of the
treat the epistemologically preferred hypothesis as most general arguments for its justification is due
tentatively true, i.e. as an objectively optimal to Sir Isaac Newton (Thayer 1953, p. 3), who
knowledge claim. claimed, ‘‘nature is pleased with simplicity
and affects not the pomp of superfluous causes.’’ are unified would have to be considered fortuitous
This kind of justification is more than a curiosity, according to instrumentalism, which is currently
as will be illustrated below with examples from popular in phylogenetics. The unification of
phylogenetic inference. evolutionary and genetic theories in the neo-
Darwinian synthesis is an example that should
be familiar to most contemporary biologists.
2.2.2 Pragmatic
Truth is tested by its practical consequences with
2.2.4 Anti-free parameters
this kind of parsimony, where the more parsimo-
nious of two hypotheses stands a better chance of This kind of parsimony concerns a preference for
confirmation. According to Quine (1963, p. 105), hypotheses with few adjustable parameters. For
for example, the more parsimonious hypothesis, many years the focus was on minimizing para-
‘‘the one with fewer parameters, is initially the meters, and now, with so much emphasis placed
more probable because a wider range of possible on models in inference (Burnham and Anderson
subsequent findings is classified as favorable to it.’’ 1998), Akaike’s (1973) approach to evaluating
This is an instrumentalist justification, and for models is receiving most of the attention. Akaike
which there can be no straightforward report as to argued that the predictive accuracy of models
what is explained (Walsh 1979, p. 244). provides a reason for choosing among models, and
he proved how the predictive accuracy of a model
is estimated (Forster and Sober 1994; Sober 1996).
2.2.3 Unification
Simply, old data are used to estimate the max-
Friedman’s (1983; see also Greene 2004) lengthy imum likelihood of the parameters of the model,
discussion of parsimony in relativistic physics and this fitted model is in turn used to predict
underscored the importance of unified theories. As future data. Parsimony is involved when a penalty
Barnes (2000, p. 356) put the issue inductively, is given to the number of adjustable parameters,
‘‘unified theories are multiply confirmed by the which is subtracted from the log-likelihood estim-
various empirical phenomena they explain, while ate of the best-fitting model. The proof of the
a competing theory with less unifying power theorem is not, however, without its difficulties
can only be confirmed by its smaller class of (Forster and Sober 1994). For example, it is
explanada—thus unified theories tend to be better assumed that: (1) nature is uniform, i.e. the old and
confirmed than their disunified competitors.’’ new data sets involved in the definition of
Eliminating those entities that are indeterminate or predictive accuracy come from the same under-
have no unifying power in the particular context is lying distribution (Forster 2000), (2) the likelihood
where parsimony is said to come into play in function is asymptotically normal, where the like-
unification. Isomorphism and reduction are the lihood is a function of the parameter values, and
traditional techniques used in the analysis of the- (3) the sample size (amount of data) is large
ory unification. For example, two or more theories enough to ensure the likelihood function approx-
are said to be isomorphic if there is a one-to-one imates asymptotic normality.
mapping of their structures; e.g. if the properties
attributed to them by their respective theories are
2.2.5 Local background knowledge
the same. Reduction of theories focuses on the
microstructure of phenomena, rather than on their This anti-quality kind of parsimony is firmly
physicalities, such as properties. grounded in local background knowledge. Several
‘‘According to scientific realists, the unification different rationales have been advanced that
of theories reveals common causes or mechanisms depend on subject-matter-specific assumptions.
underlying unconnected phenomena’’ (McAllister This is even true in a narrowly defined field like
2000, p. 538). Realists also point out that the rela- phylogenetic inference, as will be illustrated
tive ease with which theories of different domains below.
2.2.6 Testability explained by totaling the individual positive con-

tributions of each object’’ where ‘‘quantitative
Testability places a premium on the improbability
parsimony tends to increase the explanatory
of the hypothesis, not its probability (Popper 1959,
power of hypotheses compared to their less
p. 119, 1983, pp. 283–240; Kluge 2001b; contra de
quantitatively parsimonious rivals’’ (Baker 2003,
Queiroz and Poe 2001, 2003). Testability is defined
p. 248). A less quantitatively parsimonious
objectively, as the power of an hypothesis to
hypothesis can only match the most quantitatively
explain the evidence, in light of the background
parsimonious proposition in explanatory power by
knowledge, where the data consists of reports of
adding auxiliary claims. ‘‘Thus the preference for
the outcome of sincere attempts to refute the
quantitatively parsimonious hypotheses emerges
hypothesis, not of attempts to confirm it (Popper
as one facet of a more general preference for
1959, p. 414; see also Salmon 1966, p. 46; Kluge
hypotheses with greater explanatory power’’
2001b; contra de Queiroz and Poe 2001, 2003). In
(p. 258). However, it is also clear from Baker’s
maximizing severity of test, explanatory power
lengthy discussion that he defined explanatory
and degree of corroboration are maximized. In
power as a primitive term. For example (p. 258), he
Popper’s logic of scientific discovery (Popper 1959,
simply concluded, ‘‘quantitatively parsimonious
p. 145; see also below), enjoining parsimony
hypotheses allow the explanation of more things.’’
protects the falsifiability of the system from going
In other words, Baker did not define quantitative
to zero.
parsimony specifically in terms of explanation in
relation to evidence. The ASP from which this
justification obtains is not strictly equivalent to
2.2.7 Anti-superfluity
Sober’s (1981, p. 145) more general principle that
Being opposed to the superfluous has been justi- entities should not be postulated beyond those that
fied in various ways. For example, Nolan (1997, have explanatory power (see also Farris 1983).
p. 339) justified it in terms of plausibility; however,
I believe his argument, that it is ‘‘Better, rather,
to have quantitative parsimony expressed as a
2.3 The ontological status of
different principle to the independently plausible
phylogeny: what is ideographic
principle about explanatory parsimony, rather
science is not nomothetic science
than tying them together in this way,’’ unneces- The historical science of phylogenetic inference is
sarily emphasizes plausibility in determining ideographic (Grant 2002). The word ideographic,
extravagance (see also Baker 2003, p. 248). Barnes in this context, springs from the idea that relative
(2000) appealed to the subjectivism of Bayesian recency of common ancestry can be represented
inference as his justification, while Baker (2003) directly as a concrete, spatio-temporally restricted,
made it clear that he interpreted ASP as pertaining explainable thing, the phylogenetic hypothesis,
to the class of cases that are demonstrably addi- cladogram, or tree, as can the accompanying
tive, i.e. that involve the postulation of a collection transformation of an inherited trait or homologue.
of qualitatively equivalent individual entities in For all such things there is orderliness to their
the relevant respects, be they objects or events (see unfolding, a transformation series, and for more
below). In this, Baker presumed an analysis of complex (highly integrated) traits there is usually
equivalent singular causal statements of one kind, an assumable direction to the sequence of
of the form ‘e causes e 0 .’ This is not a holist justi- change, based on such things as ontogeny. At the
fication for parsimony in the sense that the ‘quantum level’ of systematics, where physico-
strengths and weaknesses of one kind of science chemically identical nucleotide states (A, G, C, T)
carry over to another kind. Its virtue is that col- substitute for one another indeterministically,
lectively such equivalent entities can explain some direction of transformation may not be assumed.
particular phenomenon. ‘‘The explanation is ‘addit- In either case, there is an asymmetry between the
ive’ in the sense that the overall phenomenon is past and the future. As an historical science,
phylogenetics is then retrodictive, but not pre- immediately subtending a lineage splitting event.
dictive; it does not predict speciation events or Each stage of occurrence in a hypothesized trans-
evolutionary changes that have yet to happen. formation series is tested simultaneously against
While the laws of relativity may govern time itself, all other such evidence, thereby maximizing
the direction or arrow of time does not appear to severity of test (Kluge 2003a). Phylogenetic
be governed by relativity. It seems that time’s hypotheses are chosen for their explanatory
arrow was merely conditioned at the birth of the power, on the basis of the number of transforma-
universe, at a starting point of low entropy (Greene tion events they explain. The only things that are
2004, Fig. 6.3). explanatorily relevant in this system are the cla-
Phylogeny defined in terms of Darwin’s (1859, distic and patristic (Farris 1967; not the patristic of
p. 420) principles of ‘‘descent, with modification’’— Sokal and Camin 1965; see below), which is to say
relative recency of species common ancestry that only the inherited patristic things can provide
(monophyletic entities or clades) and the trans- a critical test of competing cladistic hypotheses.
formation of the phenotype/genotype (homo- The spatio-temporally unrestricted can be rejected,
logous characters or character states)—is an because it has no meaning in the part/whole sys-
evolutionary concept, where only heritable things tem of organism history. The a priori testable and a
can evolve. The event of transformation is the posteriori reciprocally illuminated hypotheses are
same factor that gives absolutely the same result in the statements of relative recency of species com-
all places and at all times, and it is on this basis mon ancestry and homology, respectively (Kluge
that the law of inheritance is argued. 2003a). As Grant and Kluge (2004) pointed out,
Phylogeny is judged to be a lineage system, one however, the number of possible hypotheses of
consisting of ‘‘bundles’’ of characters transforming homology is defined a priori by pure logic, as a
through time as part of the evolution of species function of the number of inherited parts identified
(Hennig, 1966). Within a species lineage is located for each terminal taxon, just as all possible
the lineage histories of the organisms, tokogeneti- hypotheses of phylogeny are predefined as a
cally related in the case of biparentals. Within each function of the number of those terminals (Siddall
of those parts are located the ontogenetic histories and Kluge 1997). As such, no special procedure is
of the more complex phenotypes. At the least required to generate hypotheses of homology, nor
inclusive level of spatio-temporal restrictedness hypotheses of relationships, since they already
there is the transformation series—what the evo- exist.
lutionary systematist claims as evidence (Grant Popper (1957, pp. 105–122, 143–147; see also
and Kluge, 2004). This part/whole system is spatio- Scriven 1959; Goudge 1961, p. 63; Hull 1974,
temporally restricted at each level. Phylogenetic 1982; Sober 1993, pp. 14–18) discussed the dis-
inference is then devoted to the deduction of those tinction between historical things and lawful
sister lineages and the explanation of the singular, generalizations—the ideographic and the nomo-
non-recurrent, heritable events in each bundle that thetic, respectively. For the most part, the
mark such points of phylogenesis. Species diver- ideographic and the nomothetic are readily dis-
sification is the result of historical contingency tinguished in terms of being concerned with the
and all of the propensities acting at the time of spatio-temporally restricted and unrestricted, res-
divergence (Kluge, 2002). pectively. Nomological necessity pertains to rela-
Inference in such an ideographic system is lim- tions that are repeatable in an indefinitely
ited to what can be observed of organisms, recurrent way, or to sequences of variable phe-
observations that are then used to retrodict some nomena, which are invariable under the same
objectively defined part of the system’s history to conditions. The historical entity—object or event—
which they belong. That inference explains the is firmly grounded in objective reality, whereas
heritable variation observed among lineages by laws explain what is inherent in the abstract—
identifying the series of necessarily unique trans- classes (kinds) or sets of particulars (Grant 2002).
formation events that occurred in the lineage No frequency-based probability exists for the
necessarily unique. There is no basis for modeling modeled accuracy and prediction are the goals.
the probability of error in the ideographic science Alternatively, however, one can identify the mini-
of phylogenetic inference (Kluge 2002). As will be mum number of independent character-state
argued below, that homoplasy is a universal con- transformations of sex, inferred from the most
cept, the complement, ‘not-a,’ of homology means parsimonious hypothesis of species relationships,
that it can play no role in the conceptualization of assuming for example only nb $ nu, and proceed to
phylogeny or in its inference. explain each of those relatively few unique past
Hull (1977, 1989) and Sober (1980) forcefully instances of evolution on a case-by-case basis. To
argued that evolutionary theory precludes the con- apply the standard nomothetic means of analysis at
ception of taxa, including species, as classes or sets. the population level, as if the problem is necessarily
I agree, and would add that the end game or goal of one of lawful optimality, ‘due to natural selection,’
identifying species and the monophyletic taxa of may be arguable, but to treat the nb and nu obser-
which they are a part involves historical questions, vations as if they are all independent in those ana-
and which cannot be answered with nomothetic lyses does not give consideration to the historical
means (Kluge 2002; see however Rieppel 2004). Why nature of the problem, including the uniqueness of
is it then that some phylogeneticists use the lawful each instance of transformation, and Darwin’s
‘if/then’ means of modeling historical relationships other major principles of evolution, ‘‘descent, with
(e.g. Felsenstein 2004)? I believe the answer lies in modification.’’ To think of Darwin’s contribution
‘‘distinguishing means from ends,’’ and from separ- only in terms of the lawful regularities of natural
ating the nomothetic and ideographic sciences selection and adaptation is to miss the significance
according to that distinction (Sober 1993, p. 14). As of his theory of propinquity of descent.
Sober (1993, p. 15) illustrated: Popper (1957) also used the distinction between
The astronomer’s problem is a historical one because the lawfulness and explanatory retrodiction, the
goal is to infer the properties of a particular object; the nomothetic and the ideographic, as part of his
astronomer uses laws only as a means. Particle physics, argument that laws cannot predict history, nor
on the other hand, is a nomothetic discipline because the explain trends in history, because you cannot use
goal is to infer general laws; descriptions of particular what is spatio-temporally unrestricted to retrodict
objects are relevant only as a means. what is spatio-temporally restricted. Even
The problem to ponder, as I see it, is how the means familiar law-like evolutionary statements, such as
(say, of modeling) can justify the end (of identifying ‘‘all swans are white,’’ can be accorded a singular,
species and their relationships). By focusing first on spatio-temporally restricted, object explanation,
the kind of question involved, as dictated by the because ‘‘all white swans’’ can be hypothesized to
ontological status of what is being inferred, as I be parts of an historical individual or mono-
argued above, you are effectively saying the oppo- phyletic taxon (Kluge 1999). In addition, there is
site—that it is the end that justifies the means. the argument that ‘‘evolutionary theory contains
That there really is a general truth in this no reference to particular taxa, just what one
aphorism, consider what is often presented as one would expect if taxa are actually individuals and
of the greatest challenges to Darwin’s theory of not classes. According to this view, ‘All swans are
natural selection—the origin of sex, where as a white’ could not count as a scientific law even if it
consequence the individual female wastes half her were true’’ (Hull 1977, p. 83). As Simpson (1964,
energy producing males. The significance of the p. 128) succinctly concluded, ‘‘The search for
problem is usually stated as a function of the historical laws is . . . mistaken in principle.’’
number of biparental relative to uniparental spe- Nomothetic science is not the domain of phylo-
cies, nb and nu, respectively, and the solution to the genetics, not only because each instance of common
problem is sought with lawful, if/then, means. ancestry is a spatio-temporally restricted unique
These are the standards of neo-Darwinian science, part of history, but because each species is part of a
such as modeling and frequentist statistical infer- replicator system (Lidén 1990) that renders it
ence applied within and among populations, where ‘‘necessarily unique,’’ which is uniqueness in the
strictest sense (Goudge 1961; Simpson 1964, p. 186; 2.4 Causality and scientific practice
Kluge 2002; see however Hull 1974, p. 47, 97–98). in phylogenetic inference
The phylogeneticist cannot meaningfully practice
Given the ideographic generalities discussed in the
estimation because there is no set of instances with
previous section, I can now explicate the specifics
which to assess a frequentist probability statement
of the causality involved and the kind of scientific
of species relationships. Not only is each event of
method that can be practiced in the inference of the
common ancestry necessarily unique, so is each
necessarily unique parts of history. To begin with,
transformation event that is used as evidence
the causal event of heritability is the sufficient
(a proposition of homology) in the inference of
condition for claiming a homology or historical
phylogenetic relatedness and the explanation of
identity (H), where that relation specifies a part of
observed biological diversity. Whether inherited
phylogeny (P), that which is ostensively defined in
variation is identified with DNA substitutions or the
terms of common ancestry. The point is that an
modification of a complex (highly integrated) phe-
event of heritability is precisely located in relation
notypic character, each hypothesized transforma-
to the object or character state that is inherited
tion or proposition of homology involves a
(Hennig, 1966, fig. 21). Although the part/whole
necessarily unique event—an historical indivi-
relations of ontogenesis/tokogenesis/phylogenesis
dual—just as are species and monophyletic groups
may be necessary to the conceptualization of
of species (Grant and Kluge 2004).
causality in phylogenetic inference, it is the event
The rationale one uses in defense of parsimonious
of heritability or transformation that fixes the
inferences in phylogenetics must not rely on
nature of the causality involved.
assumptions that violate the logic imposed by the
This logic does not tell us what things exist; it
ontological status of history. But what is assumed is
only suggests how to determine what things a the-
often disputed. For example, there is the popular
ory claims to exist. Only H is relevant in the study of
claim that uses of unweighted parsimony in phy-
P, which in the simplest case of three terminal taxa,
logenetic inference are dependent, at least impli-
A, B and C, can be represented as (A,B)C, (A,C)B
citly, on a model that assumes a constant rate of
and (B,C)A. Since neither P or H are observable,
evolution in which all character transformations are
typically a hypothesized shared derived state, a
equally likely to occur, and therefore parsimony is
synapomorphy (S), is used as the unit of empirical
unable to identify any patterns of relationships
evidence to test among the logically possible P, a
other than those kinds. The error in this argument is
choice of which in turn hypothesizes at least some S
obvious upon inspection of empirical results—par-
being explained by H. Regarding linguistic con-
simony methods often identify heterogeneous rates
ventions (Kluge, 2003a, p 236), we can say that
of character evolution in the unweighted most-
P(A,B)C causally explains H(A,B)C, as inferred from
parsimonious phylogenetic hypothesis. In other
S(A,B)C, but not H(A,C)B or H(B,C)A, as inferred from
words, there is no basis for modeling data, either in
S(A,C)B or S(B,C)A, respectively. Likewise, P(A,C)B
light of, or independent of, the hypothesis.
causally explains H(A,C)B, as inferred from S(A,C)B,
Even this brief discussion indicates why those
but not H(A,B)C or H(B,C)A, as inferred from S(A,B)C or
who apply parsimony must be careful to evaluate
S(B,C)A; or P(B,C)A causally explains H(B,C)A, as
the ontological status of what is being inferred, as
inferred from S(B,C)A, but not H(A,B)C or H(A,C)B, as
well as the nature of the evidence used in those
inferred from S(A,B)C or S(A,C)B.
inferences—which are the normative aspects of
Two attendant considerations turn this ideo-
parsimony. No scientist should feel safe in
graphic kind of causal explanation into a historical
appealing to parsimony without first assessing the
kind of scientific operation (following the general
ontological status of what he/she is making an
outline provided by Goudge 1961). For the practice
inference. Even the instrumentalist’s heuristic use
of phylogenetic inference to be scientific (1) the
of probability or likelihood as the basis for histor-
evolutionary principles of ‘‘descent, with modifica-
ical prediction, and not explanation, cannot be
tion,’’ must contain concepts that do not necessarily
founded on an illogical thesis (Ariew 1998).
correlate with what is hypothesized, the P and the causally and critically evaluated, most such ana-
H. These concepts are called theoretical constructs, lyses are fraught with issues of non-independence
which is what heritability is in the present case. and enumerative confirmation, which I believe
These constructs may not be directly observable, but cannot be part of a scientific philosophy.
nonetheless play an important role in the frame- According to Goudge (1961), why particular
work of the theory. Their scientific admissibility historical events have occurred requires a narrative
depends on the fact that they occur in statements explanation. As he summarized (p. 77):
that have a deductive connection with statements
What we seek to formulate is a temporal sequence of
that refer directly to the inherited object, i.e. the
conditions which, taken as a whole, constitutes a unique
empirical data. It is because of this connection that sufficient condition of that event. This sequence will
theoretical constructs have scientific meaning con- likewise never recur, though various elements of it may.
ferred on them. In the practice of phylogenetic When, therefore, we affirm ‘E because s’, under the above
inference (Kluge 2003a), so-called observation circumstances, we are not committed to the empirical
statements, S, stand as a hypothesis of H, if at best a generalization (or law) ‘Whenever s, then E’. What we are
weak one (see below), which in turn provide the committed to, of course, is the logical principle ‘If s, then E’,
means whereby the theory, P, is tested to ascertain for its acceptance is required in order to argue ‘E
its falsity. (2) In addition, there are reasons for because s’. But the logical principle does not function as a
holding that a theory, P, is properly called scientific premise in an argument; the affirmation, ‘E because s’,
is not deducible form it . . . Both s and E are concrete,
provided it entails observation statements, which
individual phenomena between which an individual
are capable of being refuted by any empirical data
relation holds.
(Kluge 1999). This guarantees that the theory, P, is
potentially falsifiable. Character congruence, and Critical to Goudge’s thinking on integrative and
the reciprocally illuminating process of character re- narrative explanations (p. 174), and the part with
analysis, provides the test of such statements in which I do not take exception, is the idea that if
phylogenetic inference (Farris et al. 1970, pp. 177– we envisage a transformation series ‘‘as a unique
178; Kluge 1997b). There is no vicious circularity in sequence of historical events, extending from
this scheme of causality and testing (for further the past into the present, then it is irreversible in
discussion see Hull 1967; Kluge 2003b, p. 365), the sense of being irrevocable. What has happened
because the observation statements, such as S, are cannot be altered, and a fortiori cannot be
not perfectly correlated with H (Farris et al. 1970). reversed.’’
The logical proof of this obtains from the familiar The particulars involved in a historical narrative
argument that while all H are S, not all S are H explanation are akin to the central subjects of literary
(Farris et al. 1970, p. 187). Thus, the scientific quality narratives, and they take their place in the chronicle
of P, as inferred from H, is maintained. as a consequence of interpretative or explanatory
Goudge (1961) also formulated two kinds of his- writing. Under this interpretation, explanation is
torical explanation, integrative and narrative. While achieved through closure, that which contributes to
neither supposed genuine scientific laws, they are the cohesiveness and conclusiveness of the chroni-
not to be confused with the deductive historical cle. The analogue to closure in phylogenetics is
explanation described immediately above, which I Hull’s (1975, 1981) notion of integration, where
will consider further in my explication of phyloge- explanation is achieved through integrating onto-
netic inference. In the case of Goudge’s integrative logical individuals, i.e. by making them wholes.
explanation, similarity relations and spatial patterns Narrative explanation is, however, not without
observed among organisms are explained in light of significant problems. As O’Hara (1988) pointed
a phylogenetic hypothesis, showing the relations out, not all clades exhibit closure; the more inclus-
and patterns to be the outcome of, or partly depen- ive ones remain open, to the extent that any
dent on, past sequences of historical phenomena, included lineage is extant. Also, O’Hara argued
which have continuity and direction. Although that the literary interpretation tends to emphasize
Goudge argued that integrative explanations can be linearity, which in reading the phylogenetic
hypothesis promotes unnatural, paraphyletic, has been falsified empirically. Any unweighted
groups. O’Hara (1988, p. 153) concluded that these most-parsimonious hypothesis of species relation-
‘‘false concepts arise out of our expectation that the ships on which character states cannot be opti-
central subject of an evolutionary history is a linear mized as unique and unreversed disconfirms this
individual, instead of a branched tree.’’ Then there justification, assuming the absence of systematic
is the disanalogy between the divergent nature of error. The best falsifiers in this regard are physico-
phylogenetic hypotheses and the reticulate nature chemically identical nucleotide states. Moreover,
of at least some aspects of cultural evolution, such this kind of falsification has long been considered
as the ‘‘tree of knowledge.’’ In addition, it remains commonplace (Felsenstein 1979, p. 60). At best, the
unclear how historical integration (sensu Hull 1981; probability of character-state change being rare is
O’Hara 1988) would involve those properties an auxiliary conditional of the kind one expects to
exhibited by the central subjects, as well as those find in a model. In fact, Felsenstein (1979) took
processes in which central subjects participate. Camin and Sokal’s statement to be the model for a
These are the cause–effect relations (the sufficient likelihood argument—the hypothesis of maximum
conditions) that connect the central subjects, such likelihood being the rooted branching pattern that
as evolving species. Indeed, the historical narrative requires the fewest character-state changes to
does not have the form of theory-based explana- explain the observed data, assuming the absence of
tion, as in science, where an hypothesis is sought systematic error and processes that lead to rever-
that has the power to explain the evidence (Popper sals of character evolution, 1 ! 0.
1957, 1962a, b; see however, Ruse 1971). Additional A similar auxiliary conditional, again not of the
criticisms of Goudge’s narrative explanation can quality of general background knowledge, forms
be found in Ruse (1971; see also R. Laudan 1990). the basis for Farris’ (1977a, b) Dollo method. Here,
the opposite of the Camin and Sokal model is
2.5 Parsimony: justifications in assumed. (1) Forward changes (0 ! 1) are allowed,
phylogenetic inference but are considered very rare. (2) As many rever-
sions (1 ! 0) are permitted to occur as are neces-
With the ontological status of phylogeny, the nat- sary to explain the data. As Farris discussed
ure of historical causality, and the scientific prac- (1977a, p. 86): ‘‘A useful way of assessing the sig-
tice of phylogenetic inference having been nificance of a probability ratio between two trees is
explicated, we are now in a position to more cri- to compare it to the likelihood ratio between null
tically evaluate the justifications for parsimony in and alternative hypotheses attained when the null
phylogenetics in light of these details. hypothesis can be rejected in favor of the alter-
native at exactly error-rate a in large-sample
2.5.1 General background knowledge normal statistics.’’ Basically, the phylogenetic
hypothesis with fewest reversions is preferred
The Camin—Sokal parsimony method, popular for
under that model of evolution (see review by
only a brief time in phylogenetics, was argued in
Blackburn 1984).
terms of the parsimonious nature of the evolu-
Earlier references to a general background
tionary process, such as the probability of char-
knowledge kind of minimum evolution assump-
acter state change being rare. As Camin and Sokal
tion in phylogenetic inference can be found in
(1965, pp. 311–312) stated:
Edwards and Cavalli-Sforza (1963, 1964), and that
Comparison by Camin of these various schemes with the assumption continues to be explicitly assumed or
‘‘truth’’ led him to the observation that those trees which implied (e.g. Dayhoff and Eck 1968; Dayhoff
most closely resembled the true cladistics invariably and Park 1969, p. 7–16; Crisci and Stuessy 1980;
required for their construction the least number of pos- Cartmill 1981; Kumar et al. 1993; Pritchard 1994;
tulated evolutionary steps for the characters studied.
de Queiroz 1996; Larson and Losos 1996; Gee
However, this justification for parsimony falls 2000, p. 6–7). This class of justifications for
short of general background knowledge, because it discrete character-state change, as well as those
for minimum distances (e.g. see Kidd and identifying the relative recency of common ances-
Sgaramella-Zonta 1971; Farris 1972; Rzhetsky and try required of all approaches to phylogenetic
Nei 1992; Swofford et al. 1996, p. 451), not only inference. As will be further discussed below, it is
generally fail in their presumption that evolution- also imperfect when using shared-derived similar
ary change is rare, but they exceed the sufficient states (Farris et al. 1970, p. 187; Kluge and Farris
general background knowledge premise of 1999). Even physico-chemically identical states of
‘‘descent, with modification’’ (see below). nucleotide characters are well known to be an
imperfect index to common ancestry. (2) Without a
minimal a priori assumption of ‘descent’ there is no
2.5.2 Pragmatic
reason to presuppose a nested, hierarchical, pat-
Many pattern cladists appeal to a kind of prag- tern of relationships; it might just as well be a
matic justification for their use of parsimony. Their circular array, a reticulate pattern, or a periodic
usual interpretation of pragmatic is like Friedman’s order. (3) There is also no reason to assume that
(1983, p. 269), where the descriptively most the pattern of relationships is necessarily dichot-
efficient hypothesis is sought because it is argued omous, since the difference between the reticulate
to be the most predictive (Farris 1979). The kind of pattern of tokogeny and the increasingly divergent
prediction most commonly mentioned is that of pattern of species relationships is theory-
other characters, even in the situation where an dependent. (4) There is no reason to exclude com-
evolutionary explanation of homology is explicitly patibility/clique analysis as a kind of parsimony
denied (Brower 2000)! A frequently repeated method because the largest clique consists of a
justification for why pattern cladists exorcise completely congruent set of characters, one in
evolutionary assumptions is that the most- which all the evidence is unique and unreversed,
parsimonious pattern of relationships, once deter- without exception. (5) Without an assumption,
mined, can then, and only then, serve as evidence such as ‘‘descent, with modification,’’ there is no
of the basic principles of evolution. Brady (1985) justification for optimizing character states at the
went even further, arguing that the pre-Darwinian internodes of a pattern of relationships. The fact
standing patterns of natural history—common that Wagner and Prim networks provide most-
plan, homology, ontogenetic parallelism, and the parsimonious hypotheses among the terminals
hierarchy of groups—are recoverable with the without entailing internodes should make them
parsimony method, without presuming ‘‘descent, the methods of choice for the pattern cladist.
with modification.’’ In other words, it is parsi- (6) Without a limiting assumption, such as
mony’s inductive confirmation of patterns that is ‘descent,’ i.e. one history, there is no reason to seek
critical, not its evolutionary explanation. And, like one most-parsimonious hypothesis of relationships.
fitting the simplest curve to a set of points, pattern In other words, pattern cladists have no basis for
cladists argue that an hypothesis will be produced using the phylogeneticists’ optimality criterion of
in the long run that is predictively efficient and the most parsimonious tree hypothesis.
arbitrarily close to the truth. Of course, Brady’s
argument runs afoul of reification, i.e. it is illogical
2.5.3 Unification
to interpret an abstract pattern as a historical
thing (Kluge 2003b; see however Rieppel and Some might consider systematics, including most
Kearney 2002). of its sub-disciplines, to be the ‘‘poster-child’’ for
There are a number of additional points that can disunification, not unification. For example, there is
be made regarding the pattern cladists’ pragmatic, no consensus as to phylogenetic method and rele-
theory-free, justification for parsimony when using vant evidence, let alone a theory of inference.
it to predict phylogeny (see review by Kluge Consider, phenetics was born out of a concern for
2001a). (1) To begin with, the evidential basis similarity relations, shared states, not shared steps
for grouping species (taxa) is overall similarity, (Farris et al. 1970, pp. 178, 187), and pattern cladism
which is notoriously deficient when it comes to continues to be an argument for phenetics
(Kluge and Farris 1999). Further, Bayesian and or alternative, principles. If competing forms of
likelihood inference of species history involve phylogenetic unification should actually be for-
excess assumptions and subjectivism, where there mulated then they can be evaluated in terms of
is little to connect what is observed to what is that which has the greater generality or scope, the
analyzed and explained as character data. More- more general and broader in scope being more
over, there is as yet little published that convin- vulnerable to refutation (see epigraph).
cingly indicates theory unification within any of the
disciplines that relate to phylogenetic inference,
2.5.4 Anti-free parameters
such as the study of adaptation, vicariance bio-
geography, coevolution, and taxonomy (see below). Numerous authors have examined the premium
However, I believe there is one overarching placed on free parameters in terms of the models
consideration that holds promise for the unifica- assumed by parsimony and likelihood methods of
tion of all the historical sciences—which is to inference (e.g. see Farris 1973b; Felsenstein 1973,
eliminate or reformulate the theories (and meth- 1979, 1981a,b, 1982, 1983, 1988; Sober 1985, 1988a;
ods) of those disciplines and sub-disciplines that Felsenstein and Sober 1987; Goldman 1990; Steel
are inconsistent with the ontological status of what et al. 1993, 1994; Yang et al. 1995a; Yang 1996; Lewis
is being inferred. The history of species relations 2001; Steel and Penny 2000; Steel 2002). According
being what it is clearly defines the criteria for to Tuffley and Steel’s (1997: 599, italics in the
systematics more generally and phylogenetic original; Steel and Penny 2000, Theorem 2)
inference in particular—an ordered set of histori- Theorem 5, ‘‘Maximum parsimony and maximum
cally contingent and necessarily unique events of likelihood with no common mechanism are equivalent
common ancestry. As will be further discussed in the sense that both choose the same tree or trees.’’
below, such a unification of this kind would No common mechanism in this theorem refers to
eliminate those theories (and methods) that are the absence of constraints on edge parameters
indeterminate or inconsistent and cannot therefore from site to site. This theorem does require,
have any unifying power in this kind of historical however, the simplest type of substitution model
science. Such a unified theory in its ultimate form at a particular nucleotide position, a Poisson
could then be legitimately judged an ideographic model, where each of the possible substitutions
science, as distinct from the nomothetic. That is to occurs with equal probability. Given just a tree and
say, historicism would finally be removed from a single character (and no information as to edge
phylogenetics (Popper 1957). This ideographic lengths), the maximum likelihood estimate of
science would not only be recognized for its power the state at any internal node is precisely the
to unify, but to simplify and explain the particulars maximum parsimony state.
of species diversity. A step in the direction of this More recently, Goloboff (2003) provided an
kind of unification has already been taken in example of anti-free parameter justification for
questioning the similarity-based theory of char- parsimony in phylogenetic inference, where he
acter and replacing it with an evolutionary concept proved that parsimony assumes fewer model
of ‘‘transformation series’’ or ‘‘stages of expres- parameters than does likelihood. Thus, that
sion’’ (Hennig 1966, p. 91; Kluge 2003b; Grant and unweighted most-parsimonious hypothesis of
Kluge 2004; see however Rieppel and Kearney species relationships must necessarily be included
2002). This is considered significant because all of in any likelihood ratio test to decide whether the
the important entities in phylogenetics, the species simpler model should be rejected.
relations, as well as the statements of homology, However, I believe the fact that Tuffley and Steel
conceptualized as spatio-temporally restricted assumed more model parameters than did
objects or events, are interpretable in terms of Goloboff illustrates the futility of attempting to
evolutionary theory (Hull 1977, 1989; Sober 1980). understand a method based on the model it sup-
This is not to say that another form of theory posedly implies. If exactly the same hypothesis can
unification cannot be achieved according to other, be understood as being derived from a maximally
complex model (in terms of free parameters) or a (e.g. Farris, 1966, 1969, 1979, 2001; Le Quesne
maximally simple model (in terms of free para- 1969; Goloboff 1993b; Mindell and Thacker
meters), or just assuming ‘‘descent, with mod- 1996; Penny et al. 1996; Salisbury 1999). All of the
ification,’’ as background knowledge, then this more explicitly stated arguments for differential
demonstrates that focusing on free parameters is a character weighting assume some concept of
meaningless exercise. conservatism/constancy (uniformatarianism; e.g.
Another anti-free parameter justification for Goloboff 1993b; see Kluge 1997b).
parsimony in phylogenetic inference is Akaike’s More specific criticisms of weighting that must
framework (e.g. see Posada and Crandall 1998, also be rebutted before proceeding with any such
2001a, b; Sober and Steel 2002), where the goal of kind of practice include the following: (1) Weight-
model selection is predictive accuracy, and parsi- ing leads to suboptimal, less-parsimonious, not
mony is employed in hypothesis evaluation. more-parsimonious, phylogenetic hypotheses when
However, the ‘uniformity of nature’ assumption it comes to the data of observation (Kluge 1997b;
disqualifies that framework (see above) when it see however Farris 1983). Thus, to weight is to be
comes to phylogenetics. To assume that the old logically inconsistent with parsimony’s goal of
and new data sets evolved according to the same maximizing explanatory power and finding
underlying distribution is a counter-factual con- the best-supported hypothesis given the evidence
ditional, and one that is recognized among phy- (sensu Grant and Kluge 2003). As an aside, by
logeneticists as being generally false. More the same argument, weighting does not maximize
importantly, while this framework may provide descriptive efficiency (Farris 1979; Kluge 1997a).
instrumentalism with a kind of justification, its (2) Assuming a conservative/constancy model of
appeal to frequentism is denied in the study of evolution also diminishes severity of test (Kluge
phylogeny, because the events of interest and the 1997a). (3) Weighting contributes to a loss of char-
relevant evidence are necessarily unique (Kluge acter independence; there is a loss of independence
2002; Grant and Kluge 2004). by virtue of the fact that the members of any
Minimally, any phylogenetic method, parsi- weighted-class of characters (the more or less
mony or likelihood, that assumes a model, can be conservative classes) are weighted the same (Kluge
criticized. First, models assume counter-factual 1997b). (4) There is also a potential loss of infor-
conditionals. There is also the issue that models mation because an incongruent character state
are usually statistical, and to relate them to the can in fact increase phylogenetic structure; e.g. a
necessarily unique hypotheses of phylogeny is reversed state can be diagnostic of a monophyletic
illogical. Further, to employ a model is to assume group (Källersjö et al. 1999).
more than background knowledge, that which is Sober (1988a, 1994, p. 85; see also Felsenstein
minimally sufficient to provide a causal explana- 1973) provided quite a different kind local back-
tion of historical individuality (Kluge 2002; Grant ground knowledge justification for parsimony—
and Kluge 2004; see also below). where parsimony impacts on likelihoods in terms
of a common causal explanation. Many phylo-
geneticists, including some who call themselves
2.5.5 Local background knowledge
cladists (e.g. Nelson and Platnick 1981), have
This justification for parsimony has taken a variety implied that their inference extends from such an
of forms in phylogenetic inference, including explanation. As developed by Sober (1988a),
weighting (Wheeler 1986; Goloboff 1993b, p. 83). Bayes’ theorem summarizes the plausibility of the
The several kinds of weighted parsimony common causal explanation of homology (cc) in
analysis—a priori, successive (iterative, a posteriori), relation to the separate ‘explanation’ of homoplasy
implied (heaviest tree), support, and strongest (sc) in light of shared-derived character-state
evidence—attempt to correct for instances of similarities or synapomorphies (e), p(cc, e) ¼ p(e, cc)
homoplasious similarity, which is assumed to lead p(cc)/p(e), and p(sc, e) ¼ p(e, sc) p(sc)/p(e). Sober
to a better-supported, more reliable, hypothesis (1988a, p. 79) traced the concept of common causal
explanation to the idea of improbable coincidences see however Sober 1988a; Felsenstein 2004). To be
developed by Russell (1948) and Reichenbach sure, phylogeneticists can be wrong in their choice
(1956): ‘‘If two events are similar in ways that of data used to test some part of species history,
would be immensely improbable if they had but that concerns the uncertainty of the operational
separate causes, we may reasonably hypothesize issues required to identify transformation series.
that they trace back to a common cause.’’ Sober’s There is no logic that says phylogenetic hypotheses
‘Smith/Quackdoodle theorem’ formalized a com- are able to say how probable the observations are
mon causal explanation for three taxa (Sober as evidence of common ancestry ‘‘if we append
1988a, p. 239). In the present case, a hypothesis further assumptions about character evolution’’
of homology is considered more plausible than (contra Sober 1994, p. 88; my italics). Yes, there is
a separate cause of homoplasy when the inde- uncertainty in the observations systematists
pendent origin of similar shared-derived (syna- employ as evidence of phylogenetic relationships,
pomorphic) character states is relatively unlikely. a normal part of operationalism; however, as we
Sober (1988a) argued, by analogy, that the genea- will see below, there is no uncertainty in the rela-
logical relatedness of two people listed in a phone tionship between the ideographic character con-
book as Smith is not as plausible as two people cept of transformation series and the nested
named Quackdoodle. hierarchy concept of species relationships—the
Given the same denominators in p(cc, e) and two concepts are perfectly coincident (e.g. see
p(sc, e), p(e, cc) p(cc) > p(e, sc) p(sc), or p[(A1, B1), Hennig 1966, Fig. 21).
cc] > p[(A1, B1),sc], where the hypothesis of Further, the concept of support is important in
homology is considered more plausible than the the inference of species relationships (Grant and
separate cause of homoplasious similarity obser- Kluge 2003), as measured by the relative degree of
ved in two species, A1, B1. For three taxa, the corroboration of the competing hypotheses, not
likelihood terms, p(e,h), are p[1 1 0, (A,B)C] > p[1 1 0, their probability/plausibility. Assessing truth is
A(B,C)], or p[1 1 0, B(A,C)]. It is in this context subjective, founded on probabilities, statistics, or
that set of terms was considered ‘‘a prima facie likelihood, where what is being inferred, and the
plausible inference principle’’ (Sober 1993, p. 174). evidence used to infer it, are misapplied class
While there appears to be explanatory power in concepts. As argued elsewhere (e.g. Kluge 2002), it
Sober’s Bayesian approach to phylogenetic infer- is illogical to treat historical individuals as class
ence, where shared-derived similarities are concepts, and to do so leads unnecessarily to over-
explained as homologues, his justification depends reductionism (Frost and Kluge 1994).
on a frequentist assumption of character-state
occurrence (the frequency of Smiths and Quack-
2.5.6 Testability
doodles in a phone book), as well as a causal
explanation of homoplasy (see above). Those who Phylogenetic inference has long been cast in Pop-
assert that parsimony is a kind of likelihood have perian terms (Wiley 1975; see also Bock 1973),
some form of plausibility parsimony in mind (e.g. where testability is a function of the improbability
Swofford et al., 1996; de Queiroz and Poe 2001, of a hypothesis of relative recency of common
2003), not an unweighted parsimony analysis (see ancestry, not its frequentist probability. In the case
also Kluge 1997b, 2001b). of phylogenetic inference (Kluge 2003a), assuming
A phylogenetic hypothesis cannot tell you whe- only ‘‘descent, with modification,’’ as background
ther a given character relation is expected or not. knowledge, the evidence for competing hypo-
Each transformation is necessarily unique, and any theses of sister-group relationships should be
hypothesis of such change can only be true (p ¼ 1) equally likely. Thus, in the simplest case of three
or false (p ¼ 0); some frequentist probability value terminal taxa, the possible hypotheses are P(A,B),
in between true and false has no meaning when P(A,C), and P(B,C), and the expected data of obser-
applied to the concept of historical individuality of vation equally likely, S(A,B) ¼ S(A,C) ¼ S(B,C). How-
transformation (Kluge 2002; Grant and Kluge 2004; ever, if a large majority of one of those possible
kinds of data were to be observed in an unbiased science of phylogenetic inference (Goudge 1961;
sample, say S(A,B), which counts against P(A,C) and Popper 1980; Kluge 2003a, p. 238).
P(B,C), but counts for P(A,B), then this is improbable
given the background knowledge alone, but not
2.5.7 Anti-superfluity
under that background knowledge plus the pos-
tulated rooted cladogram P(A,B)C (Kluge 1997a). It Farris’ (1983) minimization of ad hoc hypotheses is
follows from this improbability argument, where a well-known ASP justification for most parsimo-
only incongruent data count as a falsifier, that nious phylogenetic hypotheses. The value of
severity of test increases with the number of those minimizing ad hoc hypotheses is unassailable, as
tests that have been carried out, and the more philosophers and scientists alike acknowledge (e.g.
severe the test, as supporting evidence of the cor- Popper 1962b, p. 288; Farris 1983, p. 18), because
roborated hypothesis, the greater power the such hypotheses are adopted only for the purpose
hypothesis has to causally explain the data. The of saving a theory from difficulty or refutation, in
corroboration of the hypothesis by the evidence is the absence of any independent rationale. Without
simply the measure of the degree of support given such minimization there would be no way to dis-
by the evidence to the hypothesis, and explanatory tinguish personal belief from evidence in choosing
power and degree of corroboration are maximized among competing theories. Moreover, ad hocisms
by minimizing the data on the hypothesis, where can be explanatorily empty. As Popper (1957,
‘‘enjoining parsimony protects the falsifiability of the p. 103) pointed out, ‘‘the ad hoc hypothesis that the
phylogenetic system from going to zero’’ (Kluge 2003a, laws have changed would ‘explain’ everything,’’
p. 237). but in doing so would explain nothing at all.
In this system, the unweighted most- Farris (1983) was clear that it was ad hoc
parsimonious phylogenetic hypothesis requires the hypotheses of a particular kind that are explana-
fewest character-state changes or steps. The applica- torily superfluous. As he stated (p. 18; my italics),
tion of this kind of parsimony in a total evidence ‘‘the explanatory power of genealogy
analysis of equally weighted data minimizes the is . . . measured by the degree to which it can
total number of hypotheses of character transfor- avoid postulating homoplasies,’’ where (Farris
mation required to explain the heritable variation 1989b, p. 107) ‘‘A postulate of homology explains
observed among species and, as such, the unwei- similarities among taxa as inheritance, while one
ghted most-parsimonious cladogram represents the of homoplasy requires that similarities be dis-
objectively optimal phylogenetic theory (Grant and missed as coincidental, so that most parsimonious
Kluge 2004; Kluge 2004). Moreover, support can be arrangements have greatest explanatory power.’’
defined objectively in this system (see also above), Contrary to homologues, homoplasious simila-
as the ‘‘degree to which critical evidence refutes rities are then minimized in phylogenetic infer-
competing hypotheses. A hypothesis is unsup- ence, according to Farris, because they do not
ported if it is either (1) decisively refuted by the constitute propositions of similarity that identify
critical evidence or (2) contradicted by other, monophyletic groups.
equally optimal hypotheses (i.e. the evidence is There are, however, significant problems with
ambiguous), otherwise it is supported. That is, the different ways homoplasy has been expli-
rational hypothesis preference is based on the cated. First, there is the issue of interpreting
relative degree of corroboration of competing homoplasy as independently evolved instances of
hypotheses, where the hypothesis that is the least a similar kind (for a review see Kluge 2003b).
refuted by critical evidence is preferred’’ (Grant Suffice it to say, similarity in this context is being
and Kluge 2003, p. 383). While some have argued treated as a class concept, one tied to lawfulness
that Popper (1959) interpreted testability only in or natural necessity, where one or more immu-
nomothetic terms, there still appears to be no table properties constitute the basis for inten-
reason why it, and his epistemological principle of sionally defining any particular class or kind. As
explanatory power, do not apply to the ideographic such, similarity is an abstraction, and so too is
any group of organisms defined in terms of logical priority over the practical or ‘‘instrumental’’
having properties of that kind (see however Sober tasks of science, such as precision.
1988b). Aside from the arguments that have been Phylogeneticists are concerned with the ideo-
lodged against using similarity in the inference of graphic, patterns of inherited things that can be
phylogeny (Hennig 1966; Farris et al. 1970; Kluge deduced from a common ancestral state. That
2003b; see also below), homoplasy cannot be being the case, I assert that homoplasy can be
explained in terms of evolution, when homoplasy nothing more than a description of inferred trans-
is intensionally defined as an immutable set of formation events; effectively, it is a description of
similarity relations. Only ostensively defined, spa- explanations. Just as referring to something as
tio-temporally restricted, things have the potential similar is acausally descriptive, referring to some-
to evolve according to Darwin’s principles of thing as homoplasious is acuasally descriptive.
‘‘descent, with modification’’ (contra Sanderson Explanation of the observed, independently
and Hufford, 1996). evolved heritable variation is achieved through the
When propositions of homology are tested with inference of transformation events, and nothing
character congruence, and from which homoplasy explanatory is added by referring to them as
is deduced, homology and homoplasy become a homoplasies.
complementary relation, a, and not-a, respectively. Arguably, homoplasy is an example of Aristotle’s
As the not-a relation, homoplasy is nominal ‘‘fallacy of accident,’’ where distinct differences
(everything that a is not) and as such it cannot be between the essential and the accidental are
causally explained. Of course, any one of the assumed, i.e. that independent transformations
independently evolved instances of homoplasy result in a set of ‘‘similar,’’ causally accidental,
might be explained in its own right as homology things. As Ghiselin (1966, p. 148) argued, we may
(Kluge 1999). None of this argument denies the want to compare similar things, but it is an error to
lawfulness of natural selection, only as it applies to subsume one relation within the other, because
a set of independently evolved similar things homology involves some kind of similarity
(Kluge 2003b). Homoplasy per se can have no between organisms (Farris et al. 1970).
common causal historical explanation because the It is true that incongruent transformations can
independently evolved instances of similarity are be made useful, both in the sense of Hennig’s
spatio-temporally unrestricted. If nature has (1966) reciprocal illumination and also as a heur-
taught us anything, it would be that living things istic in developing and testing adaptive/
respond to the same selective pressures in any selectionist explanations of particular transforma-
number of ways, a lesson that is an anathema tions, but even in these cases there is nothing at all
to inductive reasoning in comparative biology explanatory in the term homoplasy. Although
(T. Grant, personal communication). further explanation may in principle be achieved
Lastly, it was Farris’ (1983) position that homo- for each of those transformation events, e.g. by
plasy is merely investigator ‘‘error’’ in the infer- establishing the selective basis for their origin and
ence of homology. However, there is no natural retention, those conditions may be determined to
causal explanation for such error. Although be causally the same (cf. homoplasy) or causally
homoplasy as systematic error may be defined different, and in neither case does this impinge on
intensionally as a class concept, it cannot be phylogenetic explanation which is concerned with
modeled as if it were a historical law. Of course, the spatio-temporally restricted, i.e. historical
increasing precision by minimizing error is a individuals.
worthwhile endeavor in all sciences, but it has no I conjecture that it has been the inductionists’
epistemological standing itself. As Popper (1979, p. preoccupation with homoplasy (e.g. see Sober
356–357) recognized, a ‘‘precise statement can be 1988, p. 32), with the possibility of interpreting
more easily refuted than a vague one, and can similar character states as repetitions of a kind,
therefore be better tested.’’ However, as he went that has given license to the myriad of methods
on to note, the theoretical or the explanatory has concerned with which hypothesis is most likely to
be true (Bayesianism), which hypothesis is sta- respects—which would appear to make this justi-
tistically the most probable (frequentism), or fication an unlikely candidate for phylogenetic
which hypothesis confers the highest likelihood inference. Indeed, none of the character concepts
on the data (likelihoodism), where counter-factual usually referred to in phylogenetics suggest that
auxiliary assumptions are entailed in an attempt kind of individuality, where character states can be
to model the course of such a history of inde- interpreted as additive instances of one kind.
pendent evolution. Repetitions, like repeated Recently, however, Grant and Kluge (2004) made
trials, may instantiate a class concept of some that connection with their definition of an
kind of similarity. That set of instances may even ideographic character. As they stated (p. 29; my
be used to generate a frequency profile that is italics),
interpreted as approximating a probability dis-
the application of phylogenetic parsimony in a total
tribution relevant to some method. That concept,
evidence analysis of equally weighted evidence mini-
distribution and method may even be thought of
mizes the total number of hypotheses of transformation
as governed by a universal law or propensity.
required to explain the heritable variation observed
While this may be nomothetic science at its best, among species and, as such, the most parsimonious
it bears no relationship to the practice of ideo- cladogram represents the objectively optimal phylo-
graphic science. genetic theory.
This is the same argument that denies the use
of weighting against instances of homoplasy, In fact, it was this treatment of Hennig’s (1966)
from a priori, successive (iterative, a posteriori), transformation series character concept that I
implied (heaviest tree), support, to strongest evi- consider fundamental to my quantitative parsi-
dence (see above). It is true that the multiple mony rationale—where the conceptualization of
hypotheses required to explain the variation that history determines the operational and metho-
we describe as similar can lead to the reciprocally dological means used in its inference—and which
clarifying elimination of operational error (Kluge in turn is critical to my attempt to focus phyloge-
1999). That, however, does not offer an historical netic inference only on the ideographic.
epistemological argument for minimizing instan- As Grant and Kluge (2004) pointed out, most of
ces of independent evolution. The study of the character concepts currently in use emphasize
homoplasy may be of interest to students of kinds and degrees of similarity among terminal
function, but that research has no special meaning taxa as evidence of their relationships (for details
for historical biologists, independent of the sepa- see Kluge 2003b; see also Rieppel and Kearney
rately evolved states being homologous. To be so 2002). For example, it is usually stated that
concerned, the phylogeneticist is engaged in fal- ‘‘Derived similarity is evidence of propinquity of
lacious reasoning. descent’’ and ‘‘Ancestral similarity is not evidence
A quantitative kind of ASP has yet to be of propinquity of descent’’ (Sober 1994, p. 87).
articulated as a justification for parsimony in the Aside from the problem of not being able
inference of phylogenetic relationships. Wheeler’s to provide an evolutionary epistemology for
(1996) direct optimization approach to gene- ‘‘similarity’’—because the properties of organisms
sequence alignment may be an analog of ASP, but to which similarities refer are spatio-temporally
an epistemological justification has yet to be pro- unrestricted, abstract, and immutable (see
vided for it (see below). To be sure, few phylo- above)—there is no basis on which to claim an
geneticists have shown much interest in the additive accounting. To begin with, the plesio-
philosophical, and Baker’s (2003) discussion of the morphic and the apomorphic states of a single
quantitative justification was published only heritable transformation may entail any number of
recently (however, see preliminaries by Nolan properties (Hennig 1966, pp. 92–93), with the total
1997). Also important is the additivity require- number of properties being infinite. Moreover,
ment—a collection of qualitatively equivalent logic dictates that a similarity relation according to
individual objects or events in the relevant one kind of property cannot be equivalent to one
based on another kind, because each kind has its With regard to the systematists’ and geneticists’
own intensionally defined necessary and sufficient character operationalisms, it is important to
set of conditions. Thus, similarity is without a recognize that the phenotypic states that are
common currency, because it is one of degree as attributed to an organism are, at best, only proxies
well as kind. for the actual ‘‘stages of expression’’ in the trans-
Grant and Kluge’s (2004) ideographic definition formation series (Hennig, 1966, p. 91). For exam-
of character is an event concept, events being ple, no one should be fooled into thinking that the
things that happen, such as phylogenesis and states of eye color and handedness in humans are
transformation. That definition is not a material things that literally pass from parent to offspring.
object concept, those being things to which physi- An investigator would do well to sample nucleo-
cal features are attributed, like volume, mass, and tides if precision in heritability is of particular
being containable and storable, even though the concern.
object is the thing that systematists claim to As already mentioned above, Hennig’s (1966,
observe when operationalizing the concept char- Fig. 21) transformation series character concept,
acter, and it is the thing geneticists currently use to assuming just ‘‘descent, with modification,’’ as
measure heritability, i.e. the proportion of the background knowledge, does not entail the con-
variance in a trait among individuals that is attri- tradictions and inconsistencies with respect to
butable to differences in genotype. What is it then evolutionary theory as do similarity-based def-
that allowed Grant and Kluge to argue that their initions. More importantly in my formulation of a
transformation series character concept is con- quantitative parsimony rationale, adopting Grant
cerned with heritability when the ontological dis- and Kluge’s (2004; see also Kluge 2003b) ideo-
tinctions between event and object imply their graphic explication of Hennig’s concept, insofar as
incommensurability? To begin with, the problem is it is relevant to phylogenetic explanation, each
simplified by virtue of the fact that the transfor- inferred transformation is metaphysically the
mation event(s) and the transformed object(s) form same kind of process, with each such event
a spatio-temporally restricted, historically con- counting equally as heritable evidence in the
tingent, transformation series (Hennig, 1966, fig. analysis of singular causal statements of that kind
21). That is, the locatability and mobility of the (Bach 1981; Davidson, 1991).
event is not a problem with reference to the object, How this quantitative ASP parsimony rationale
they are causally related, and consequently, para- is connected to Grant and Kluge’s (2004) ideo-
phrasing Woodger (1929, pp. 301–302), it can be graphic character concept is clarified by Farris’
stated that the perceptual object we also call the (1967) definition of evolutionary relationship. In
character state is expressive of certain of the that seminal, but largely overlooked, paper, Farris
knowable characteristics of the event that can be distinguished phenetic and evolutionary or phy-
exemplified in sense-experience. That is, the char- logenetic systems, and in doing so he made dis-
acter state is the event and the event is the char- tinctions and identified other relevant parameters
acter state, or, in a word, the event and the object sufficiently rich in ideas to restrict the permissible
are coextensive. Thus, the ontological distinctness of meanings of relationship. As he pointed out,
mutation (event) and mutant (object) concepts in distinguishing phenetic relations (pp. 45–47; my
does not deny their causal continuity and their italics), ‘‘The best one can do is to study the form
comparability in such terms as heritability. More- of the measure of overall phenetic similarity . . .
over, it is because the transformation events until the meaning of overall similarity is standard-
occupy the same place in the causal sequence, i.e. ized.’’ Whereas, with four axioms, he precisely
have the same causes and the same effects, and defined a priori the evolutionary form of the
that they are identical with events described in the measure of phylogenetic relationship.
causal law of inheritance, that they can be con- Axiom 1: The objective of the [phylogenetic] system is
sidered identical and additive (Davidson, 1991; to place [species] in such a way as to describe their
Baker, 2003).
patristic and cladistic relationships as completely as While Kluge and Farris (1969; see also Farris 1970)
possible. . . . Axiom 2: The patristic difference between provided an heuristically efficient, if not an effect-
[species] is a function of the displacement in all the ive, algorithm in their Wagner method for choosing
unit characters of the [species] along the phyletic line a best fitting hypothesis, it was Farris et al.
connecting the [species]. . . . Axiom 3: The phylogenetic
(1970) who further clarified and extended the
relationship between two given [species] is a fixed
meaning of cladistics and patristics (Farris 1967).
value. . . . Axiom 4: The measures of patristic and cladistic
difference are non-negative real numbers. They abstracted four premises from Hennig’s
(1966) ‘‘Phylogenetic Systematics,’’ and from these
And from these axioms, Farris characterized two they derived three theorems, plus corollaries,
phylogenetic relationship functions, cladistic and which they used to explicate evolutionary tree
patristic (not the patristic of Sokal and Camin hypotheses in accordance with Hennigian phylo-
1965), with phylogenetic relationship being the genetic principles. Their specific points that con-
negative of the corresponding cladistic or patristic stitute the basis for my ideographic interpretation
differences. For example, as he stated: of phylogenetic systematics are as follows. (1) Their
The overall patristic difference is the sum of the patristic Axiom I described Hennig’s transformation series
unit character differences. Each patristic unit character concept of character (see also Grant and Kluge
difference is the summation of the changes of that char- 2004), which defined the evolutionary ordering of
acter from point to point over the phyletic line between character-states as plesiomorphous and apomor-
the [taxa] compared. phous in the simple case or as a character-state tree
when the transformation series consists of more
Likewise, Farris defined cladistic difference as the
than two stages of expression, and they allowed
sum of the number of lineage divergences between
reversals and any state to be potentially permis-
any two taxa and their most recent common
sible as the most ancestral state for some restricted
ancestor, which means that both phylogenetic
part of the tree (their Axiom I 0 ). (2) According
relationship functions have the same properties of
to their Axiom II (Farris et al. 1970, p. 173), all
historical individuality, where each divergence
monophyletic groups are distinguished by sharing
and transformation is spatio-temporally restricted
one or more apomorphous ‘‘stages of expression,’’
and necessarily unique. As Farris (1967, p. 47)
whether the group has an apomorphic state x or
concluded, ‘‘The two components of evolutionary
a state apomorphous relative to state x (as deter-
difference thus have similar properties, and this
mined by the predefined character-state tree).
fact lends a certain unity to the concept of [phy-
(3) Transformation series or stages of expression
logenetic] relationship.’’
were characterized in terms of ‘‘steps’’ or ‘‘derived
Let there be no mistake, Farris’ definition of
steps,’’ where emphasis was put on sharing
phylogenetic relationship and relationship func-
‘‘stages of expression.’’ As Farris et al. (1970, p. 174;
tions are genealogical and not phenetic. As Darwin
italics in the original) summarized, ‘‘two [taxa]
(1859, p. 420) stated:
with states y and z share a step, x, if and only if y
All the foregoing rules and aids and difficulties in [is derived from] x and z [is derived from] x.
classification are explained, if I do not greatly deceive (4) In their Axiom III (Hennig’s auxiliary princi-
myself, on the view that the natural system is founded ple), in the absence of evidence to the contrary, any
on descent with modification; that the characters which state corresponding to a step shared by a group of
naturalists consider as showing true affinity between
taxa is assumed to be unique and unreversed
any two or more species, are those which have been
(at least locally). (5) Their Axiom IV measured
inherited from a common parent, and, in so far, all true
classification is genealogical; that community of descent
the strength of the evidence for a monophyletic
is the hidden bond which naturalists have been uncon- group—the more characters certainly interpretable
sciously seeking, and not some unknown plan of crea- as apomorphous the better founded is the
tion, or the enunciation of general propositions, and the assumption the group is monophyletic. (6) Their
mere putting together and separating of objects more or Theorem I provided the basis for describing the
less alike. common ancestral state of a monophyletic group,
i.e. the most derived state from which the sister events as propositions of homology. In keeping
lineages are derived. In this theorem they identi- with Farris et al.’s (1970, p. 172) reference to a
fied homologous states, at least in the unambigu- ‘‘quantitative analog of phylogenetic systematics,’’
ously optimized case. (7) As a corollary, their I believe it is fitting to designate this ideographic
Theorem II described the same relation for taxa, i.e. kind of phylogenetic inference quantitative phylo-
the common ancestor for a monophyletic group genetic systematics (QPS), and I name its quantita-
is the most derived hypothetical taxon from which tive parsimony rationale Farris parsimony (FP),
the sister lineages are derived. In these two theo- in recognition of James S. Farris’ many significant
rems, they had characterized both monophyletic contributions to the theory of phylogenetics.
taxa and transformation series as spatio- Wagner parsimony remains an efficient, if not an
temporally restricted. (8) Their Theorem III stated, effective, method for operationalizing FP (Kluge
in terms of derived steps, the evidential basis for a and Farris 1969; Farris 1970).
taxon being excluded from a monophyletic group. I underscore the fact that not one of the para-
Summarizing, Farris et al. identified a close meters of QPS (sensu Farris 1967; Kluge and Farris
connection between Hennigian phylogenetic 1969; Farris et al. 1970) is conceptualized in terms
systematics and unweighted most parsimonious of similarity, nor is FP’s rationale identified with
hypotheses of species relationships, and they the minimization of ad hoc hypotheses of homo-
found the Wagner method for inferring those plasy. Setting aside similarity in these con-
hypotheses (Kluge and Farris 1969; Farris 1970) ceptualizations means more than discounting
to be consistent with their generalization of the overall similarity, which subsumes symplesio-
Hennigian axioms. Effectively, they made the con- morphy, synapomorphy, and independent evolu-
nection between amount of evidence (Axiom IV) tion (Hennig 1966; Kluge 2003b). It also means
and the minimum number of steps required of the setting aside synapomorphic similarity, s(A,B), i.e.
unweighted most-parsimonious hypothesis of ‘‘shared-derived character states,’’ when it comes
species relationships, which is measured by the to the conceptual. At the very most, conceptually
additive requirement of quantitative parsimony. speaking, one might say that QPS is left with
The virtue of minimizing the quantitatively a kind of similarity ‘‘owing to ancestral states,’’
superfluous—the patristic difference—is that col- as per Farris et al.’s (1970, p. 187) formal distinc-
lectively the hypothesized heritable stages of tion between s(A,B) and sE(A,B). And, as they
transformation (T, not syanpomorphy or S) can pointed out:
explain the particular phenomenon of homology
The actual choice of a phyletic tree is left to an algorithm
(H) as a nested series of such statements. The
that effectively constructs the evolutionary hypothesis
explanation is ‘additive’ in the sense that the
most in accord with available data. Thus only a weak
overall phenomenon of phylogeny (P) is explained connection between s or sE and relationship is assumed.
by totaling the individual positive contributions of
each transformation, where quantitative parsi- Unfortunately, their distinction between the con-
mony tends to increase the explanatory power of ceptual and the operational when it comes to
phylogenetic hypotheses compared to their less similarity in phylogenetic inference (see also Kluge
quantitatively parsimonious rivals. Less quantita- 2003b; Grant and Kluge 2004) has been largely
tively parsimonious hypotheses can only match overlooked, even in the most recent literature (e.g.
the more quantitatively parsimonious propositions see Mayr and Bock 2002; Ghiselin 2004; Padian
in explanatory power by adding auxiliary claims 2004).
of one sort or another. Although there may not be an absolute criterion
The following injunction summarizes these for knowing the truth, as I stated above, specifying
details—choose the hypothesis of cladistic relation- the conditions for truth is no more burdensome in
ships that minimizes the overall patristic difference, QPS than it is in other approaches to phylogenetic
because that hypothesis has the greatest power to inference. For example, given three terminal taxa,
explain the independently heritable transformation the statement ‘‘(A,B)C is true’’ if and only if A and B
share a more recent common ancestor than either epistemological consistency in the inference of
does with C; ‘‘(A,C)B is true’’ if and only if A and C phylogeny—inconsistency being any apparent
share a more recent common ancestor than either negation or contradiction among the concepts and
does with B; ‘‘(B,C)A is true’’ if and only if B and C operations that are claimed to lead to advances in
share a more recent common ancestor than either objective knowledge (Grant 2002). In these eva-
does with A. In turn, a transformation series luations I was primarily concerned with the nor-
characteristic of A and B is presumed to be mative aspect of parsimony, what justifies the
homologous if and only if the stage of expression minimization, and not with the descriptive or how
observed in A and the stage of expression the related optimality criteria are scored (Sober
observed in B can be derived eventually from the 1983). The limited evaluations undertaken are an
ancestral state of the group (A,B), not including C; example of normative naturalism, where rules or
a transformation series characteristic of B and C is criteria are identified for picking the theories or
presumed to be homologous if and only if the stage concepts according to the aims of the discipline at
of expression observed in B and the stage of hand (L. Laudan 1990).
expression observed in C can be derived eventually Leaving the theory unification justification for
from the ancestral state of the group (B,C), not parsimony until the final section, it is in this sense
including A; a transformation series characteristic of evaluation that all but the testability and
of A and C is presumed to be homologous if and quantitative justifications were judged to be
only if the stage of expression observed in A and inconsistent, having failed ontologically and/or
the stage of expression observed in C can be epistemologically. For example, the failures of the
derived eventually from the ancestral state of general background-knowledge justifications were
the group (A,C), not including B (Theorem I, Farris simply a function of the falsity of the assumptions
et al. 1970, p. 175). they make. And, claiming an advance in knowl-
As suggested above, the virtue of ASP is that edge is not possible according to the pragmatic
collectively qualitatively equivalent things can justification because there is no way to evaluate
explain some particular phenomenon, like trans- either ‘best fit’ or explanation. The principal reason
formation series, whose stages of expression con- for the failures of the other justifications was not
stitute the basis for inferring the parts of species distinguishing between the ideographic and
history. It is the application of FP—choosing the nomothetic—not taking account of the ontological
unweighted most parsimonious hypothesis of spe- status of what is being inferred—each instance of
cies relationships—that maximizes explanatory common ancestry being necessarily unique and
power, i.e. the stages of expression form a nested not a class or set of things.
series of homology statements. I maintain that the What constitutes a minimally sufficient justifi-
difference is conceptually significant between cation for parsimony is another important kind of
minimizing steps, sE(A,B), in order to maximize evaluation. In this I am guided by the ‘principle of
explanatory power, and minimizing ad hoc hypo- less is more,’ a conditional of the usual form, if p
theses of homoplasy in order to explain shared- then q, if the less of something then the more of
derived character state similarities, s(A,B), as something else. For example, according to Sober
homologues. Transformation has a common causal (1988a, p. 11; my italics), the ‘‘less we need to know
explanation in this historical science—heritable about the evolutionary process to make an infer-
change—whereas similarity and homoplasy do not. ence about pattern, the more confidence we can
have in our conclusions. From the point of view of
an evolutionary theory that is used to uncover
2.6 Judging the ontological consistency
phylogenetic relationships, the best outcome
and sufficiency of parsimony
would be that minimal process assumptions suf-
justifications
fice to identify that pattern. If on the other hand a
Each parsimony justification discussed in the pre- detailed understanding were required of why
vious section was examined for its ontological and evolution proceeded in the way it did, then an
inference about pattern would have to await a reason that assumptions used in the inference of
detailed understanding of process,’’ or be unat- phylogeny, other than ‘‘descent, with modifica-
tainable if the understanding of process is entirely tion,’’ should be looked upon with skepticism, if
dependent on knowledge of the historical pattern. not outright rejected.
How to exploit the ‘less is more’ conditional is The ‘less is more’ principle can also be inter-
suggested by Sober’s reference to ‘‘confidence’’, i.e. preted as saying ‘‘the less one assumes the more
there being a basis in the logical condition of being one can test, and thereby explain.’’ For example, in
necessary/sufficient: if p is a necessary condition of assuming a model of homogeneity of rate of evo-
q, then q cannot be true unless p is true. If p is a lutionary change in phylogenetic inference, one
sufficient condition of q, then given that p is true, q cannot in turn use the phylogenetic hypothesis to
is also true. However, as Sober recorded, there are measure that rate or its homogeneity. Likewise, if
some potential difficulties with this logic when the temporal order of fossils is used to polarize a
applied to current topic. (1) What is the minimum character then that temporal record cannot serve as
set of assumptions required to make an inference the basis for testing competing hypotheses of
of phylogeny? (2) At some point, pattern will species relationships (Donoghue et al. 1989). Con-
become irretrievable, because process assumptions sider further, to a priori down-weight transversions
will have become too meager, and how is that relative to transitions leaves no opportunity to
point to be recognized (Sober 1988a)? judge that inequality most critically, i.e. histori-
In phylogenetic inference, the principles of cally. In Popperian terms, the simplest (unweigh-
‘‘descent, with modification,’’ are widely con- ted most-parsimonious) set of assumptions
sidered the minimal set of assumptions. To assume maximizes severity of test, and in turn explanatory
‘‘descent’’ alone is too meager, since it does noth- power and degree of corroboration (Kluge 2003a).
ing more than provide an assumption of common Historical scientists cannot afford to lose these
ancestry (contra de Queiroz 1992, p. 305), albeit an opportunities to critically evaluate hypotheses of
important assumption in the argument against a relative recency of common ancestry, because that
creationist interpretation of pattern. To assume is the basis for providing objective knowledge.
absolutely nothing about evolution, as pattern I conclude that the testability and quantitative
cladists claim, is obviously vacuous because it justifications for parsimony are judged ontologi-
provides no basis whatsoever for an empirical cally consistent, as well as minimally sufficient, in
evaluation of competing hypotheses of relative that they rely solely on the non-problematic back-
recency of common ancestry. The simplicity justi- ground knowledge of ‘‘decent, with modification.’’
fication that Goloboff (2003) attributed to parsi- Not only are model assumptions of some of the
mony pertains to model parameters, i.e. auxiliary other justifications ontologically inconsistent, they
assumptions, which are in addition to ‘‘descent, violate the principle of less is more. The question
with modification.’’ remains, however, whether the testability and
Only premises that are not known to be false can quantitative justifications are rival, com-
serve as background knowledge, and it is in this plementary, or coextensive in the inference of
sense that assuming just ‘‘descent, with modifica- phylogeny. The basic issue is whether or not QPS
tion,’’ is not considered problematic (Siddall and is inclusive of testability.
Kluge 1997, p. 320). Further, the hierarchy of Having a skeptical research ethic, as formally
relative recency of common ancestry, given this provided by testability, is certainly to be com-
particular example of background knowledge, mended in the historical sciences, because nothing
cannot be judged an a priori truth (pace Brady 1994, inferred can be proven to be true, or even probably
p. 22). On the other hand, models are problematic, true (Kluge 2002). As already discussed above,
because they are counterfactual conditionals, of the testability places a premium on the improbability
general form ‘‘if p were to have happened q would of hypotheses, not on their probability, where
have happened, where the supposition of p is evidence consists of reports of the outcome of
contrary to the known fact not-p.’’ It is for this sincere attempts to refute a hypothesis, not of
attempts to verify it. Thus, falsificationism is dis- as well, some may doubt those conditions are also
tinguished from verificationism, deduction from necessary. Taking my cue from Sober (1986, p. 41;
induction (Kluge 1997a). In maximizing severity of see epigraph), I will now judge how necessary FP
test, given the total relevant available evidence, is by examining some of its sufficient conditions
explanatory power, and degree of corroboration are for fundamental contributions—those justifying
maximized, and objective knowledge gains can be principles that provide a general framework for
claimed, as can an objective measure of support characterizing and investigating the empirical
(Grant and Kluge 2003; Kluge 2004). All that is aspects of phylogenetic inference and related fields
required of testability is that the investigator exhibit of inquiry.
no a priori bias towards one or more of the compet- In this regard, I believe the empirical nature of
ing phylogenetic hypotheses, as in the simplest case phylogenetic inference benefits significantly from
of P(A,B), P(A,C), or P(B,C). Thus, when a large majority parsimony being defined in terms of transfor-
of one of the kinds of data, T(A,B), T(A,C), or T(B,C) in mation series. For example, according to that
QPS, is observed in the unbiased sample, say that definition, FP then provides an evolutionary
which counts for P(A,B), then the phylogeneticist can epistemology. That argument, either the empirical
argue that this is improbable given only ‘‘descent, or the epistemological, cannot be made when
with modification,’’ but not given that background the concept of character is defined in terms of
knowledge plus the rooted cladogram P(A,B)C. That a similarity, s(A,B) (Hennig 1966; Farris et al. 1970,
parsimony algorithm maximizes severity of test is p. 187). Moreover, similarity is neither predictive
what is important, not that its justification is sought nor projectible (Kluge 2003b).
in a particular argument, such as testability or the Further, it is equally important to re-emphasize
quantitative interpretation of ASP. the fact that FP departs significantly from simil-
I believe testability benefits QPS in another way, arity valued comparisons, where the phylogeneti-
besides providing for severity of test. As Grant and cist is faced with having to argue relations in terms
Kluge (2003) pointed out, testability is accompanied of natural kinds and properties, concepts that are
by an objective concept of support, which QPS is contradictory, if not antagonistic, to evolutionary
not provided with in the quantitative justification theory (Frost and Kluge 1994; Kluge 2003b; Grant
for parsimony. While it can always be argued that and Kluge 2004). The ideographic character con-
such a concept will eventually be defined for that cept defined by Grant and Kluge (2004), with its
justification, I fail to see that there is any room for it. unambiguous reference to the concept of trans-
By this I mean FP is limited to maximizing expla- formation series (Hennig 1966), and in turn herit-
natory power, without the accompanying severity ability is not only evolutionary, it focuses directly
of test that underlies the concept of support in tes- on what it is the phylogeneticist is concerned
tability. A concept of support might be formulated with—parts of species history and homology—and
for QPS but I don’t see how it could be anything but more precisely on the congruence of two kinds of
‘‘explanatory power ¼ support.’’ Having defined historical things, monophyletic taxa and transfor-
QPS as including testability, this problem may be mation or heritable change (Farris et al. 1970). That
considered mute. While adopting the philosophy of FP provides a criterion for choosing among com-
testability in the practice of QPS has great merit, it is peting hypotheses of these parameters may be
FP that provides a sufficient justification for interpreted as a fundamental conceptual advan-
advancing our knowledge of species relationships tage over all those methods that rely on counting
(Kluge 2003a, p. 237). instances of similarity of a kind, including those
similarity-based uses of parsimony, s(A,B).
Character coding is an area where similarity
2.7 The fundamental nature of Farris
has been mistakenly involved in a variety of
parsimony in phylogenetic inference
ways. For example, Lipscomb (1992) advocated
While FP certainly qualifies as sufficient in the that all multistate characters be treated non-
inference of phylogeny, and I believe minimally so additively, even though that kind of coding
discards both the form and direction of the char- of the organism is not problematic either, because
acter-state tree, preserving only the identity of the it merely reflects the integrated nature of the
character states. Whereas, additive coding pre- organism, as a whole. However, a category error is
serves the form and the direction of evolutionary committed when these two kinds of the non-
ordering, and as Farris et al. (1970, p. 181) con- independence are conflated (e.g. see Naylor and
cluded, only ‘‘Additive coding corresponds Adams 2001, 2003), and the problem is unavoid-
directly to the operations employed in the phy- able when the concept character is defined in
logenetic system,’’ even though an analysis of purely operational terms, as it was in Rieppel and
non-additive characters results in a hypothesis of Kearney’s (2002) similarity definition of character.
relationships equal to or of fewer steps than an Dependence in QPS is not a problem, providing
additive analysis (Grant and Kluge 2003). Cur- the ideographic concept of character is primary,
iously, Lipscomb (1992, p. 51; italics in the origi- and from which the operationalisms of character
nal; my bold) asked the correct question, ‘‘If analysis follow. To observe functional and devel-
parsimony is minimizing some assumptions and this is opmental dependence may effect how the sys-
to be used to derive hypotheses of multistate character tematist a priori defines the unit character, but the
transformation, we must decide what types of possibility of such dependence does not invalidate
assumptions are most important to minimize,’’ but choosing the unweighted most-parsimonious
then proceeded to operationalize it incorrectly. hypothesis.
Incorrectly I say (p. 52), because the ‘‘order of the The importance of the distinction between
states is postulated so that states that are most sE(A,B) and s(A,B), a patristic difference and
similar are adjacent to each other’’ was the first synapomorphic similarity difference, is no more
step in her ‘‘transformation series’’ method. To evident than in the analysis of homologous
assume the non-additivity of complex morpholo- nucleotide sequences that differ in length. Typi-
gical characters is in principle to embrace a pat- cally, in multiple sequence alignment, gaps are
tern cladistic kind of phenetics, i.e. to be content introduced so that base correspondences can be
with similarity relations or shared states, s(A,B) interpreted as shared similarities. Alternatively,
(Kluge and Farris 1999). Moreover, Lipscomb’s (p. there is Wheeler’s (1996) direct optimization
54) elaborate method for distinguishing ‘‘the approach, which was founded on the idea that the
incongruence in the character that is due to the number of DNA sequence events is provided
character state order from that caused by non- directly by phylogeny, with character-optimization
homology of states’’ becomes a non-issue. As procedures finding the minimum number of those
Farris et al. (1970, p. 181) clearly had in mind, the events on the competing tree hypotheses. Ignoring
shared steps ordering relation of taxa and character the fact that Wheeler’s actual method used
states is integral to the phylogenetic system, of weighted (cost) functions (relying ultimately on
which FP is a part, whereas ‘‘in phenetic practice frequentist probability arguments), the relevant
this need not be.’’ conceptual feature of direct optimization is that it
An unweighted most-parsimonious phylo- analyzes all events as transformations, sE(A,B),
genetic hypothesis is chosen because it counts more insertion and deletion (indels), as well as sub-
transformation events as homologues than does stitutions, rather than the implied similarity rela-
any competing hypothesis, and when the concept tions that obtain from multiple sequence-
character is defined ideographically, as an histor- alignment methods. However, Wheeler’s (p. 1)
ical individual, all such events are necessarily justification for that minimization was descriptive
independent. ‘‘The dependency between such efficiency, direct optimization providing ‘‘more
transformation series is non-problematic, because efficient (simpler) explanations of sequence varia-
it merely reflects the transformation event(s) they tion than does multiple alignments.’’ But as Frost
share, i.e., the shared portion of their history’’ et al. (2001, p. 354) pointed out, even that inter-
(Grant and Kluge 2004, p. 26). The functional and pretation is only consistent when setting all sub-
developmental dependence that occurs at the level stitution costs and the unit gap cost equal. Still, the
bottom line remains that unweighted direct opti- QPS argues against Nixon and Carpenter’s (1993,
mization can find a pattern of character state p. 413) ‘‘unconstrained, simultaneous analysis of
change that is more parsimonious than those based all terminals,’’ which those authors judged to be
on maximizing pair-wise statements of alignment sufficient with respect to the discovery of mono-
similarities, s(A,B) (e.g., compare figs. 1B and D to phyletic groups according to the parsimony cri-
1C and E in Simmons, 2004). New methods that terion. The problem with their conclusion is that
use s(A,B) and claim to avoid the problem of they asserted an operational imperative, global
suboptimal hypotheses have yet to be justified parsimony, without regard for the epistemological
epistemologically (e.g., see DeLaet, 1997; DeLaet argument that justifies the concept of parsimony in
and Smets, 1998). phylogenetic inference. A convincing justification
Lastly, on the subject of similarity, QPS does is required of global parsimony, just as it is of FP.
not provide a basis for distinguishing ‘good’ from To appeal to descriptive efficiency in these cir-
‘bad’ data, unique and unreversed from homo- cumstances won’t do, because it is without an
plasy, where the number of instances of inde- epistemological foundation of its own.
pendent evolution supposedly marks the The importance of FP maximizing explanatory
relatively weakness of the evidence. Indeed, FP power goes beyond the philosophical and theore-
calls into question the whole issue of weighting tical, with the practical and the heuristic being
(Kluge 1997b). Instead of weighting, Hennig covered as well. For example, the fact that the
(1966, p. 148; Hull 1967, p. 186; Farris et al. 1970; unweighted most-parsimonious hypothesis of
Kluge 1997b) emphasized the importance of species relationships maximally explains the rele-
research cycles in his empirical concept of reci- vant available evidence in terms of discrete and
procal clarification (reciprocal illumination). Basi- incontrovertible homology statements means the
cally, the incongruence of different kinds of evidence can then be used to diagnose mono-
observations, in light of the unweighted most- phyletic groups. Such information is practically
parsimonious phylogenetic hypothesis, suggests important to the community of scholars respon-
the need for further study, a posteriori, and further sible for identifying and systematizing museum/
testing of the incongruences may lead to a herbarium collections. It is also possible to use the
reinterpretation of the data, such as a redefinition history of each character, described in light of that
of the characters and character states, and optimal hypothesis of species relationships, as the
ultimately to a more severely tested and better- basis for formulating testable population-level
supported hypothesis. There is no vicious hypotheses, such as in the study of adaptation.
circularity of reasoning at work in QPS under That not all methods of phylogenetic inference are
these conditions; however, one must be careful to explanatory, and do not therefore have these extra
maintain testability at all levels of analysis and research benefits, is clear. For example, the his-
reanalysis, because it is easy for reciprocal tories of individual characters in a maximum
clarification to be born out of utilitarianism. likelihood analysis can only be estimated sub-
Obviously, this argument for research cycles jectively, as probabilities. Effectively, there is no
favors including testability in QPS, because ASP distribution of real valued (other than abstract)
does not advocate any particular scientific scheme character states on a maximum likelihood tree, a
of inference, either deductive or inductive. Being distribution that is often obtained from an
able to claim the potential for research cycles in a posteriori most-parsimonious mapping of discrete
testability also draws attention to how little character states on the tree of greatest likelihood
potential of that objective kind there is in the (e.g. Smith et al. 2004, Fig. 9)!
Bayesian and likelihood kinds inference, where
models and claiming to know the truth are critical
2.8 Ideographic theory unification
(subjectively speaking, of course).
I believe QPS is also relevant to the issue of While FP may be both necessary and sufficient in
character-state polarity (rooting). In theory at least, the inference of phylogeny (see previous two
sections), the question remains whether QPS provide these kinds of causal explanation on their
addresses more than the empirical in the evaluation own (e.g. Smith et al. 2004, Fig. 9). Many sys-
of scientific hypotheses. Can that ideographic tematists are especially interested in discrete
theory make significant contributions to the evolutionary changes in the phenotype and gen-
philosophical—to metaphysical system building? otype, and not being able to deduce that kind of
In addressing this question from the point of history, in light of the most probable or likely
view of theory unification (Friedman 1983; phylogenetic hypothesis, must be considered a
McAllister 2000), I briefly survey a small sample of significant shortcoming of the methods employed.
relevant areas of comparative biology to deter- I eliminate Bayesian and likelihood approaches
mine to what extent they multiply disconfirm from ideographic theory because they are inde-
ideographic theory. I will also consider using terminate when it comes to the causality and
‘Ockham’s razor’ to eliminate other approaches to explanation of the evidence employed. Phyloge-
phylogenetic inference, other than those based on netic inference is about more than a probable or
unweighted evidence, on the grounds that their likely classification, assuming those subjective
contributions are indeterminate, or have less uni- conditions can be determined.
fying power. Also relevant to ideographic theory unification
The latter exercise is certainly not the first is reliance on a character concept of transformation
attempt to unify phylogenetic theory. Initially, series, assuming ‘‘descent, with modification,’’ and
there were the debates that resulted in the elim- not on one of similarity. Not only does QPS avoid
ination of phenetics and the evolutionary sys- being logically inconsistent with evolutionary
tematics of the neo-Darwinian synthesis. Recently, theory in this conceptualization, it is directly rele-
the indeterminate nature of phenetics was identi- vant to all fields of comparative biology that
fied in pattern cladistics, which I believe has assume heritable objects/events. This considera-
marginalized, if not eliminated, the influence of tion is relevant even to areas of historical research
that approach. Even more recently, the excess of outside biology. For example, the study of illumi-
assumptions and the subjectivism of evolutionary nated manuscripts (Platnick and Cameron 1977;
systematics have been identified in the currently Cameron 1987) assumes a concept of a transfor-
popular Bayesian and likelihood approaches to mation series that involves a kind of ‘‘heritability,’’
phylogenetic inference. Thus, the ideographic a change in what is copied and passed on to sub-
continues to be tested, as it should be if unification sequent illuminators, and for which there is the
is to be a scientific exercise. objective of maximizing the explanatory power of
Fink (1982) was the first to explicate develop- the event—explaining mistakes in copying in
ment in the context of a cladogram. He argued terms of the history of manuscript. The same
that to interpret ontogenetic processes requires cannot be said however for the studies of lan-
such a hypothesis. For example, a paedomorphic guages that depend on class concepts of historical
condition can resemble the plesiomorphic state, evidence (Rexová et al. 2003), and here there is an
and the only way to distinguish the two is in explicit basis for their elimination from ideo-
light of a phylogenetic hypothesis, the former graphic theory.
often being described as an ‘‘evolutionary rever- Taxonomy has always been an important area of
sal.’’ This is a good example of the well-known systematics, and one where phylogenetics is now
principle that to explain any pattern of inter- being promoted with increased vigor. Not only has
specific variation in terms of ‘‘descent, with there been an emphasis on the individuality of
modification,’’ that pattern must be in accord taxa in at least some of these efforts (e.g. Kluge
with ideographic theory. Certainly, the best- 2005), giving license to the distinction between
known example of this principle is the explana- the classification of classes and the systematization
tion of homologues. Not all theories of historical of historical things, but the use of testability has
inference, however, such as represented by been advocated as the basis for ruling on the
Bayesian and maximum likelihood methods, content (wholeness) of taxa and their names, as
opposed to handing over such important respon- is ontologically indecisive in making that decision,
sibilities to international commissions of nomen- and increasing information density (Brooks 1981),
clature which reach decisions according to which is just another phrase for increasing
legalistic, non-scientific, conventions (ICZN 1999; descriptive efficiency, provides no epistemological
Grueter et al. 2000). I interpret this example to be basis for choosing. However, as Sober (p. 252; see
an issue of metaphysical system building, reaching also Kluge 1988, p. 316) contemplated, ‘‘If hor-
out into areas that heretofore had not been con- izontal transmission and vertical transmission are
sidered scientific. Perhaps only a small point, but a unequally probable or make different predictions,
point nonetheless, taxonomic names at all levels of that may provide a reason for preferring some-
taxonomy are proper names, and therefore the . . . hypotheses to others.’’ But, here we are being
things to which they apply are required to be asked to use a nomothetic means to address an
spatio-temporally restricted, as they are in QPS. ideographic end, which is the instrumentalist
Most other approaches to phylogenetic inference scheme that I rejected earlier in this paper. Thus,
treat most or all of the parts of species history as given these issues, and aside from the possibility
class concepts, as is evident in names being pre- of their heuristic potential, I doubt the historical
ceded by the article ‘‘the,’’ as in the Homo sapiens. theories of adaptation, coevolution, and vicariance
As noted earlier, evolutionary theory precludes the biogeography can confirm ideographic theory
conception of taxa, including species, as classes unification.
or sets. Like the evolutionary systematics of the neo-
Most of the current interspecific studies of Darwinian synthesis, the nomothetic sciences
adaptation, coevolution, and vicariance biogeo- practiced in the inference of phylogeny, such as
graphy entail a phylogenetic hypothesis. More- Bayesian inference and likelihood, make exces-
over, most of the analyses in these areas use sive assumptions, and thereby have relatively
parsimony as an optimality criterion. Still, I doubt little unifying power. As already discussed, these
they multiply confirm ideographic theory. For model assumptions are counterfactual condi-
example, Lauder et al.’s (1993) parsimony-based tionals. Nor can those assumptions be tested
lineage (homology) method for inferring adapta- because testing requires a phylogenetic hypoth-
tion was tested for its ability to deliver increased esis in the first place. In addition, nomothetic
knowledge, but sufficient failures were found in its inference relies on a frequentist interpretation of
theory, especially in its presuppositions, to call history. However, as Kluge (2002) argued, the
into question its credibility in the inference of adequacy of a probabilistic interpretation must
adaptation. Harvey and Pagel’s (1991) comparative be judged according to the nature of the event,
(convergence) method, although statistical, fails or object, being inferred, and he found a con-
for the same kinds of reasons. ditional (frequency) interpretation of probability
The historical treatments of coevolution and fails all the tests of adequacy—admissibility,
biogeography fare no better. As Sober (1988b) put ascertainability, and applicability. Dilemmas of
it, there is a disanalogy between phylogenetic this sort are becoming especially evident in
inference and those sciences accompanied by emerging fields of comparative biology, like
‘‘dispersal’’ theory, such as coevolution and genomics, where the impressive technicalities of
vicariance biogeography. In phylogenetic infer- nomothetic science seem to count more than an
ence, only ancestor-descendant (vertical) relations unweighted parsimony analysis that can objec-
are assumed and counted as heritable events tively identify hypotheses of species relation-
in choosing between competing hypotheses. In ships and character history, and do so with
coevolution and vicariance biogeography, deciding philosophical, theoretical, and methodological
among competing hypotheses is a complex func- consistency.
tion of the kinds of vertical and horizontal (dis- It is significant that QPS assumes only back-
persal) events the researcher is prepared to accept. ground knowledge, ‘‘descent, with modification,’’
I believe assuming a shared history (Brooks 1988) whereas Bayesian and likelihood methods of
inference depend on choosing from among an Darwinian synthesis, i.e. the unification of evolu-
infinite number of additional auxiliary or model tionary and genetic theories (McAllister 2000).
assumptions. Not only is the assumption of
‘‘descent, with modification,’’ not known to be
2.9 Acknowledgments
false, unlike model assumptions, it is uniformly
required of all the comparative biological sci- I dedicate this paper to James S. Farris. His jus-
ences, including the nomothetic. Surely, choosing tification of parsimony—that the minimization
from among an infinite variety of models con- of ad hoc hypotheses of homoplasy maximizes
tributes relatively little to any kind of theory explanatory power—is certainly among the best
unification. known of his many theoretical contributions to
Obviously, the details involved in this most phylogenetic inference. This mantra was argued
recent attempt at unification of the historical sci- early in the phylogenetics revolution (Farris 1983)
ences have only just begun to be exposed. At least and it has guided more than one generation of
some of the important bases for distinguishing the systematists in their preference for most-
ideographic from the nomothetic have been iden- parsimonious hypotheses, including myself.
tified, and all that remains is to continue to criti- I especially want to thank Taran Grant for a most
cally evaluate those research programs that are critical review of the diverse ideas expressed in
comparative and claim to assume historical this paper. I also thank Taran for bringing Nolan’s
change. As for future studies in this area, I see no and Baker’s ASP justifications of parsimony to my
reason to necessarily exclude any kind of research, attention. Prior to knowing about their works I
including those at the populational level, such as appealed either to Farris’ mantra or, more recently
illustrated by the nomothetic sciences of phylo- (Kluge 2003b), to Popper’s (1959) falsificationist
geography (Avise 2000) and conservation genetics justification for parsimony. Elliott Sober also read
(Moritz 2002). While I am not optimistic that con- an early draft of this paper and he provided
firmation can be obtained from any theory that several trenchant comments. It was Ward C.
does not focus on historical individuality, the Wheeler’s candor in acknowledging aesthetic
unification of the phylogenetic and the tokogenetic value that caused me to search more widely for a
may yet be possible by reformulating the latter sufficient justification for parsimony in theory
fields in ideographic terms, by divesting them of evaluation. I take full responsibility for the final
all their references to class concepts and kinds. If positions taken in this paper. This manuscript was
successful, the result would at least be equiva- written and revised at the Cladistics Institute,
lent in scope, although not in kind, to the neo- Ann Arbor, MI, USA.
CHAPTER 3
Parsimony and its presuppositions

Elliott Sober
3.1 Introduction not about whether parsimony makes assumptions

The use of a principle of parsimony in phylogenetic about the evolutionary process, but concerns
inference is both widespread and controversial. It is what those assumptions are and whether they are
controversial because biologists who view phylo- troublesome. But perhaps more important is the fact
genetic inference as first and foremost a statistical that critics and defenders also disagree about
problem have pressed the question of what one how those assumptions should be unearthed
must assume about the evolutionary process if one and evaluated. Defenders of maximum likelihood
is entitled to use parsimony in this way. They approach this problem by embedding the principle
suspect not just that parsimony makes assumptions of parsimony in a statistical framework; they
about the evolutionary process but that it makes evaluate parsimony by examining it through the
highly specific assumptions that are often implaus- lens of probability. Defenders of parsimony often
ible. That it must make some assumptions seems reject the use of statistics and probability as a cri-
clear to them because they are confident that terion for evaluating parsimony; as Farris (1983/
the method of maximum parsimony must resemble 1994, p. 342) says, ‘‘the modeling approach was
the main statistical procedure that is used to make wrong from the start.’’ His preferred alternative is to
phylogenetic inferences, the method of maximum evaluate (and justify) parsimony in terms of what he
likelihood.1 Maximum likelihood requires the takes to be the more basic idea that the best phylo-
explicit statement of a probabilistic model of the genetic hypothesis is the one that has the most
evolutionary process. Parsimony does not; you can explanatory power.
calculate how parsimonious different tree topolog- There are many dimensions to this dispute—too
ies are for a given data set without stating a process many to discuss in the brief compass of the present
model. Likelihoodists suspect that parsimony chapter. What I wish to concentrate on here is the
nonetheless involves an implicit model. The ques- relationship that exists between maximum like-
tion, for them, is to discover what that model is. lihood and maximum parsimony. Felsenstein (1973,
Cladists who defend the criterion of maximum 1979) and Tuffley and Steel (1997) have each
parsimony often reply that parsimony does make identified models of the evolutionary process that
assumptions about evolution, but that those suffice to insure that the two methods always agree
assumptions are modest and unproblematic. For on which hypothesis is best supported by a given
example, cladists sometimes claim that parsimony data set. I will argue that these results provide only
assumes just that descent with modification has negative guidance concerning what parsimony
occurred. This suggests that the disagreement presupposes. They allow one to establish that this
between critics and defenders of parsimony is or that proposition is not assumed by parsimony,
but do not allow one to conclude that any pro-
1
position is an assumption that parsimony makes. To
For the purposes of this chapter, I will treat ‘the maximum
likelihood approach’ as an umbrella term that covers both fre-
discover what parsimony presupposes, another
quentist and Bayesian implementations. The difference between strategy is needed. I suggest that parsimony’s
them is discussed later. presuppositions can be found by examining simple
43
examples in which parsimony and likelihood Proposition (2) restricts likelihood to hypotheses
disagree. My arguments will assume a broadly that describe phylogenetic relationships. I state the
likelihoodist point of view, but will not require the Law of Likelihood in this way to preserve its
assumption that any evolutionary model is correct. symmetry with (1), even though likelihood is
supposed to be a perfectly general criterion for
evaluating the direction in which the evidence
3.2 Preliminaries
points. Proposition (2) is not a consequence of the
The principle of parsimony does not provide a rule axioms of probability; it is not a mathematical
of acceptance; rather, it provides a rule of evaluation. truth, but rather is a philosophical thesis—that the
That is, the principle does not tell you to believe the epistemological concept of support is adequately
phylogenetic hypothesis that requires the fewest represented by the mathematical concept of like-
changes in character state to explain the data lihood. Just as one can ask whether, or in what
at hand. After all, if the most-parsimonious tree circumstances, (1) is true, the same questions can
requires that there be at least 25 changes, and the be posed about (2). If (2) is always true, then I will
second and third most-parsimonious trees require say that likelihood is ‘correct’ in what it says. And
that there be 26 and 27 respectively, the most you if (2) is true in some restricted domain, I will say
should conclude is that the most-parsimonious that likelihood is correct in what it says about that
tree is better supported than the others; you are not restricted domain.
obliged to conclude that the most parsimonious How are (1) and (2) related? If there is a data set
tree is true. In other words, parsimony would be a and a pair of hypotheses H1 and H2 such that the
sound principle if the parsimony ordering of parsimony ordering and the likelihood ordering
phylogenetic hypotheses and the support ordering do not agree (e.g. where H1 is more parsimonious
of those hypotheses came to the same thing: than H2, but Pr(Data j H1) < Pr(Data j H2)), then (1)
(1) For any data set D, and any phylogenetic hypotheses
and (2) cannot both be true. On the other hand,
H1 and H2, D supports H1 more than D supports H2 if parsimony and likelihood agreed about the
if and only if H1 is a more parsimonious explanation relative support of any two hypotheses for any
of D than H2 is. data set you please, then (1) and (2) would be
perfectly compatible. The fact that one is stated
If (1) is always true, I will say that parsimony is
in terms of likelihood and the other in terms of
‘correct’ in what it says. And if (1) is true in some
parsimony would be no more significant than the
restricted domain, I will say that parsimony is
difference between measuring distance in meters
correct in what it says about that domain.
and measuring it in feet.
Likelihood likewise seeks to provide a rule of
Proposition (2) gives a somewhat misleading
evaluation, not a rule of acceptance. If one hypothesis
picture of what it means to apply the Law of
confers a higher probability on the data than another
Likelihood to phylogenetic hypotheses. The prob-
hypothesis does, it does not follow that the first hy-
lem is that phylogenetic hypotheses that describe
pothesis is true; in fact, it doesn’t even follow that
the topology of a tree—not the times of branching
the first has the higher probability of being true. The
events, or the amount of change that has taken
fact that Pr(Data j H1) > Pr(Data j H2) does not entail
place on branches, or the character states of inter-
that Pr(H1 j Data) > Pr(H2 j Data). Rather, the virtue
ior nodes—do not, all by themselves, confer
that has been claimed for the likelihood concept
probabilities on the data. In the language of stat-
is that it provides an indication of which hypotheses
istics, phylogenetic hypotheses are composite, not
are better supported by the data. The following
simple There are two possible solutions to this
principle has come to be called the Law of Likelihood
problem. The first is Bayesian; one represents the
(Hacking 1965, Edwards 1972, Royall 1997):
likelihood of a phylogenetic hypothesis H as
(2) For any data set, and any phylogenetic hypotheses a weighted average. H will vary in its likelihood,
H1 and H2, the Data support H1 more than they sup- depending on which process model is considered,
port H2 if and only if Pr(Data j H1) > Pr(Data j H2). and depending also on what the values are for the
PARSIMONY AND ITS PRESUPPOSITIONS 45
parameters that occur in a process model. The likelihood in a moment. For now, let’s focus on
‘full’ Bayesian approach is to take all these pos- what they have in common—both evaluate the
sibilities into account, weighting each by its prob- likelihood of H by assuming a process model M.5
ability, conditional on H: How should this recognition of the model-
PP relativity of likelihoods be incorporated in (2)?
Pr(Data j H) ¼ i j Pr(Data j H & process model i &
values j for the parameters in model i) Pr(process
If different models are correct for different data
model i & values j for the parameters in model i/H)2 sets and different taxa, there won’t be a single
‘master model’ that should be used to evaluate the
Although biologists are starting to explore Bayesian support of all phylogenetic hypotheses. Rather,
methods in phylogenetic inference (see e.g. what we need is the following:
Huelsenbeck et al. 2001), no one has proposed to
represent a hypothesis’ likelihood by averaging (2*) If M is the correct model for how the characters
described in a data set evolved in the taxa des-
over all possible process models;3 rather, Bayesians
cribed in phylogenetic hypotheses H1 and H2, then
have tended to adopt a single process model M
the Data support H1 more than they support H2 if
and to average over the different values that the and only if PrM(D j H1) > PrM(D j H2).
parameters in M might take. According to this
‘attenuated’ Bayesian approach, the likelihood of Just as Proposition (2) is a philosophical thesis, not
H should be written as PrM(Data j H), not as a mathematical truth, the same point holds for (2*).
Pr(Data j H), where If (2*) is true, then I’ll say that likelihoodM is
P correct for the taxa and data set in question.
(B) PrM(Data j H) ¼ j Pr(Data j H & model M & values j
I earlier described how (1) and (2) can come into
for the parameters in M) Pr(values j for the para-
conflict. What would it take for (1) and (2*) to
meters in model M j H&M)
conflict? You need the same ingredients as before,
Whereas Bayesians treat the likelihood of H (once a plus a model M that is correct for the taxa and
model has been adopted) as a weighted average characters involved. That is, consider a pair of
over the likelihood that H would have under phylogenetic hypotheses, a data set, and a model M,
the different possible settings of the model’s where the parsimony ordering of the hypotheses
parameters, frequentists treat the likelihood of H differs from their likelihoodM ordering. If you ac-
(given an assumed model) by finding L(H&M), cept (2*) and also think that model M is correct, then
where L(H&M) is the likeliest special case of you are obliged to accept the judgment of likelihoodM
the conjunction (H&M); it is found by setting and reject the judgment of parsimony concerning which
the adjustable parameters in M to values that hypothesis is better supported by the data. Notice that
maximize the likelihood of (H&M).4 For them, there are two if’s in this italicized statement. This
the appropriate quantity is means that if you do not reject what parsimony
(F) PrM(Data j H) ¼ Pr(Data j L(H&M)).
says about the hypotheses, there are two options
available, not just one. You can reject model M or
Whereas likelihood means average likelihood for you can reject (2*). That is, cladists are not obliged
a Bayesian, likelihood means best-case likelihood for to reject model M; they also have the option of
a frequentist. We will return to this difference rejecting the Law of Likelihood as it is embodied
between Bayesian and frequentist treatments of in (2*).
I so far have described how a model can lead to
2
As an expository convenience, I represent H’s average like- a conflict between (1) and (2*). However, it is
lihood as a discrete summation, rather than as a continuous
equally true that there are models of the evolu-
integration.
3
Hulesenbeck et al. (2004) average over all of the many dif- tionary process that lead to a perfect harmony
ferent time-reversible models by assigning them equal prior
5
probabilities. Since some of these models are nested inside others, In Sober (2004a) I argue that model selection criteria such as
this prior distribution is questionable. Akaike information criteria (AIC) permit phylogenetic inference
4
Note that L(H&M) is a proposition, not a number between to proceed by considering any number of process models without
0 and 1. one’s having to commit to any of them.
between (1) and (2*). Such models lead parsimony support, then likelihoodM does not. If likelihoodM
and likelihoodM to be ordinally equivalent: does not correctly evaluate support, then M cannot
be correct.
(OE) Parsimony and likelihoodM are ordinally equi-
valent if and only if, for any data set D, and any The first line of reasoning describes what would be
pair of phylogenetic hypotheses, the parsimony true if likelihoodM and parsimony were not just in
ordering of that pair is the same as the likelihoodM perfect agreement, but additionally had the prop-
ordering.
erty of correctly evaluating support. The second
If a model M induces ordinal equivalence, what describes how a failure of ordinal equivalence can
does that establish about the legitimacy of parsi- help uncover a presupposition of parsimony—
mony and likelihoodM? If you accept the model if parsimony correctly evaluates support, then
and regard one method as legitimate, then you process model M must be false. Notice that both
should regard the other method as legitimate as lines of reasoning require (2*). A more thorough
well. In this circumstance, likelihoodists will investigation would address the question of why
say that M provides a likelihood justification of one should accept this formulation of the Law of
parsimony, whereas friends of parsimony will Likelihood. This is a topic I will not take up here;
say that M provides a parsimony justification of I’ll assume (2*) without trying to justify it.
likelihoodM. On the other hand, if you do not
accept the model that induces ordinal equivalence,
the status of the two methods is left open; for 3.3 How to determine what parsimony
example, both could turn out to be unsatisfactory does not presuppose
methods for evaluating the support of phylo- A number of writers have attempted to find
genetic hypotheses. The point to notice here is that models that induce ordinal equivalence. Three
(OE) says nothing about whether parsimony is have succeeded, Felsenstein (1973, 1979) and
correct; it merely says what it means for parsimony Tuffley and Steel (1997). In Felsenstein’s model,
and likelihoodM to be in the same boat; if parsi- characters are constrained to have very low prob-
mony and likelihoodM are ordinally equivalent, abilities of changing state, but there is no require-
then both are correct or neither is. Two broken ment that the probability of a character’s changing
thermometers can be ordinally equivalent in from state i to state j on a branch is the same as its
what they say about the temperatures of different probability of changing from state j to state i. In
objects. Tuffley and Steel’s, characters can have high
In summary, the model-relativity of likelihood probabilities of changing state (though they need
entails that we are asking the wrong question not), but the probabilities of change must be
when we ask ‘‘what is the relationship between symmetrical.6 The models are very different, but
likelihood and parsimony?’’ The word ‘the’ is each entails ordinal equivalence.
where the trouble lies; there are many likelihood Both Felsenstein, and Tuffley and Steel, evaluate
concepts (one for each possible model of the evolu- the likelihoods of phylogenetic hypotheses by
tionary process) and so there are many relation- using the frequentist approach (F) for assigning
ships between the different likelihood concepts values to the parameters in the models they dis-
and parsimony. More specifically, if we adopt (2*), cuss. For example, consider a single site in the
the following two lines of reasoning are valid. aligned sequences that characterize four species W,
If model M is correct, then likelihoodM correctly X, Y, and Z. Suppose that W and X are in state G
evaluates support. If likelihoodM has this prop- and that Y and Z are in state A. The most parsi-
erty, and moreover is ordinally equivalent with monious unrooted tree is (WX)(YZ). Under the
parsimony, then parsimony also correctly evalu-
ates support. 6
The two models agree that different traits on the same branch
If parsimony and likelihoodM are not ordinally can have different probabilities of changing; this also applies to
equivalent and parsimony correctly evaluates the same trait on different branches.
symmetrical model that Tuffley and Steel assume, distinguish what a modeler assumes from what the
the highest likelihood this tree can have, relative model reveals concerning what parsimony assumes
to this character, is (14)(14)(1)(1)(1)(1) ¼ 1/16, and this (Sober 1988, 2004a).
is the likelihood that Tuffley and Steel take the Still, if we accept the instance of the Law of
unrooted tree to have.7 A Bayesian would want Likelihood given by (2*), these results about
to consider the average likelihood of (WX)(YZ), not ordinal equivalence provide a partial test for
the maximum. whether parsimony assumes this or that proposi-
There are other attempts in the literature to tion about the evolutionary process (Sober 2002,
establish ordinal equivalence. Farris (1973) tried to 2004a). As noted earlier, parsimony and likeli-
prove this result by using a model that makes very hoodM can be ordinally equivalent even if both are
weak assumptions about the evolutionary process. wrong in what they say about support. However,
Goldman (1990) sought to do the same thing if they are ordinally equivalent and model M is
by using a model in which the probability that true, then both are correct, given (2*). Consider
a character will change state on a branch is a model M that induces ordinal equivalence; M
independent of the branch’s duration. Both these might be Felsenstein’s model, or the one described
efforts fail to establish ordinal equivalence because by Tuffley and Steel, or some third model that no
both interpret parsimony as inferring not just one has yet identified:
the topology of a tree but something more inclu-
M ! parsimony is correct ! A
sive. Goldman viewed parsimony as a procedure
for inferring the topology plus an assignment of If model M is true (where M induces ordinal
character states to interior nodes; Farris took parsi- equivalence), then parsimony is correct in what
mony to output the topology plus an assignment it says about support (and so is likelihoodM, of
of character states to all points along the branches. course). What does parsimony assume? The
Why does this vitiate the arguments that Farris assumptions (A) of parsimony are just those pro-
and Goldman present? The reason is that even if positions that must be true, if parsimony is correct
H1&X1 is more likely and more parsimonious than in what it says about support. Notice that any
H2&X2, and H1 is more parsimonious than H2, it proposition that is entailed by the claim that
doesn’t follow that H1 is more likely than H2 parsimony is correct also must be entailed by
(Felsenstein 1973; Sober 1988; Steel and Penny model M. However, the converse is not true—if
2000). The likelihood of a tree must sum over all model M entails a proposition, that proposition
possible assignments of character states to points may or may not be entailed by the thesis that
in the tree’s interior. parsimony is correct. This means that the results
When a model induces ordinal equivalence, of Felsenstein (1973, 1979) and of Tuffley and Steel
what does this reveal concerning parsimony’s (1997) provide the following test concerning
presuppositions about the evolutionary process? whether parsimony assumes that proposition X is
It most certainly does not show that parsimony true:
assumes that the model is true. The models of
If model M entails X, then X may or may not be
Felsenstein (1973, 1979) and of Tuffley and Steel
an assumption of parsimony’s.
(1997) are simply sufficient conditions for ordinal
If model M does not entail X, then X is not an
equivalence. No one has shown that either of these
assumption of parsimony’s.
models is necessary for ordinal equivalence. And,
obviously, neither of them is; if each of two models Applying this partial test yields some surprising
suffices, neither is necessary. We must be careful to results. First, many biologists have suspected that
parsimony assumes that changes in character state
7
are very improbable and that homoplasies are rare;
Of the two occurrences of one-quarter in this expression, one
of them is the prior probability of the root’s being in a given state;
from a likelihood point of view, this suspicion is
the other is the probability of a change in state in the tree’s provably mistaken. The reason is that the Tuffley
interior. and Steel model does not entail that changes are
improbable or that homoplasies are rare. Second, A of those tip species is that A was also in state a.
it follows that parsimony does not assume that Parsimony’s solution to the problem would remain
change is symmetrical; the reason is that the the same if we were talking about a star phylogeny.
Felsenstein model does not assume this. These In fact, distilled to its simplest form, the problem
results depend on using the Law of Likelihood and parsimony’s solution to it can be formulated by
(2*); but once that interpretative framework is considering a single lineage that ends with a des-
adopted, these results are secure. cendant D that is in state a; the problem is to infer
As illuminating as these results are, they still what the character state was of the ancestor A that
have the limitation of being purely negative. existed at the start of the lineage. Parsimony says
The partial test can show that this or that pro- that the best estimate is that A ¼ a.
position is not an assumption that parsimony What would a likelihood analysis of this problem
makes, but the test isn’t able to demonstrate that a look like? If the character in question is dicho-
given proposition is assumed by parsimony. tomous (with character states 0,1), the standard
Results that demonstrate that a model induces approach from the theory of stochastic processes
ordinal equivalence have this inherent limitation. (Parzen 1962) is to divide the lineage into a large
In order to obtain positive results concerning what number of brief temporal intervals. In each, there is
parsimony assumes about the evolutionary a probability u that the lineage will change from
process, a new strategy is needed. state 0 to state 1, and there is a (possibly different)
probability v that the lineage will change from state
3.4 How to determine what 1 to state 0. Each of these instantaneous probabilities
parsimony presupposes are assumed to be small (at least less than one-half).
They allow us to describe the probability PrN(i ! j)
Mathematical arguments for ordinal equivalence
that a lineage that is N units of time in duration will
are necessarily general; they must show, for any
end in state j, given that it starts in state i. These
data set and for any pair of phylogenetic hypo-
lineage transition probabilities are as follows:
theses (which may describe an arbitrarily large
number of taxa), that parsimony and likelihoodM PrN (0 !1) ¼ u=(u þ v) [u=(u þ v)](1 u v)N
agree about the support ordering. In contrast, PrN (1 !1) ¼ u=(u þ v) þ [v=(u þ v)](1 u v)N
an argument that demonstrates a failure of ordinal
PrN (1! 0) ¼ v=(u þ v) [v=(u þ v)](1 u v)N
equivalence need not be general; it can just take the
form of a simple example. All that is needed is a PrN (0!0) ¼ v=(u þ v) þ [u=(u þ v)](1 u v)N
model M, a single data set, and a pair of hypo-
This is the two-state Markov process model. Each
theses such that the parsimony ordering differs
of these transition probabilities averages over all
from the likelihoodM ordering. If parsimony is
possible scenarios consistent with the specified
right in what it says, then likelihoodM is wrong.
initial and end states. For this reason, it would be
And if likelihoodM is wrong, so is the model M
misleading to say that the two transition probab-
(assuming that 2* is true). Such cases therefore
ilities of the form PrN(i ! i) describe the probability
help reveal parsimony’s presuppositions.
of stasis; PrN(i ! i) encompasses the possibility that
there has been no change in the lineage but also the
3.4.1 Example 1
possibility that the lineage has flip-flopped an even
Let’s begin with a simple example in which the number of times. There is no assumption in this
hypotheses being evaluated don’t describe tree model as to whether u ¼ v. If u ¼ v, the lineage
topologies, but rather assign character states to undergoes an unbiased process of drift. If u > v,
ancestors in trees that are taken as given. Imagine a there is a directionality or bias in the evolutionary
bifurcating tree in which all the tip species are process, favoring state 1 over state 0. One possible
observed to have the same character state a. Parsi- source of this bias is natural selection; however,
mony asserts that the best-supported estimate of the mutation and migration also can induce a bias in
character state of the most recent common ancestor how the lineage tends to evolve.
When N is very small, the two probabilities of assigned to its parameters u, v, and N. I didn’t
the form PrN(i ! i) are close to unity and the two focus exclusively on the values for these para-
probabilities of the form PrN(i ! j) (where i 6¼ j) are meters that would maximize the likelihood of each
close to 0. When N is infinite PrN(i ! j) ¼ PrN hypothesis about the state of ancestor A; that
(j ! j); the lineage has the same probability of would have led to the conclusion that the two
ending in state j, regardless of what the state was likelihoods are as close together as you please,
in when the lineage began. Thus, when a lineage since
has a very short duration, its initial condition vir-
Pr(D ¼ 1 j L(A ¼ 1 & u ¼ a1 & v ¼ a2 & N ¼ a3 ))
tually determines its final state and the relation-
ship of u and v doesn’t matter; when a lineage is ¼1
very old, it is the process that occurs during the
and
lineage’s duration (represented by u and v) that
matters; the initial condition is forgotten. Pr(D ¼ 1 j L(A ¼ 0 & u ¼ a1 & v ¼ a2 & N ¼ a3 ))
It is a property of this model that a backwards ! 1 as a3 ! 1
inequality obtains: PrN(j ! j) PrN(i ! j), with strict
inequality when N is finite (Sober 1988). Don’t con- Rather, my argument is that for each value of u,
fuse the backwards inequality with the forwards each value of v (each less than 0.5), and for each
inequality PrN(j ! j) > PrN(j ! i). An instance of this finite value of N,
forwards inequality (e.g. that PrN(1!1)>PrN(1! 0)) Pr(D¼ 1 j A¼ 1 & u¼ a1 & v ¼a2 & N ¼ a3 )
will be true for some values of u, v, and N, but not
for others. The backwards inequality says that if a > Pr(D ¼1 j A¼ 0 & u¼ a1 & v ¼ a2 & N ¼ a3 )
descendant is in state j, that outcome is made more
From this it follows that
probable by the hypothesis that its ancestor was in
X
state j than by the hypothesis that the ancestor was Pr(D ¼ 1 j A ¼ 1 & u ¼ i & v ¼ j & N ¼ k)
in state i. This provides a likelihood solution to our i;j;k
problem: if the descendant is in state a of a dichotomous
Pr(u ¼ i & v ¼ j & N ¼ k j A ¼ 1)
character, the hypothesis of maximum likelihood about X
the state of the ancestor is that the ancestor was in state a > Pr(D ¼ 1 j A ¼ 0 & u ¼ i & v ¼ j & N ¼ k)
as well. This result holds regardless of what the i;j;k
values of u, v, and N are; even if these values entail Pr(u ¼ i & v ¼ j & N ¼ k j A ¼ 0)
that the expected number of changes in the lineage
is large, the most-parsimonious assignment of if the settings of u, v, and N are independent of the
character state to the ancestor is still the assignment character state of the ancestor A. In this instance,
of maximum likelihood. Parsimony and likelihoodM the Bayesian weighting terms Pr(u¼i & v¼j &
therefore agree when M is the two-state Markov N ¼ k j A¼1) and Pr(u¼i & v¼j & N ¼ k j A¼0) are
process model and the problem is to infer an innocuous; the last stated inequality holds, no
ancestor’s character state from the character state of matter what their values are.
a descendant.8 I now want to consider the same problem—that
In analyzing this simple problem, I used the of inferring the character state of the ancestor A
Bayesian method (B), not the frequentist procedure from the observed character state of the descendant
(F), for taking account of the fact that the two- D—when the character in question is a quantitative
state Markov model can have different values phenotype (e.g. the average length in the species
of a particular bone), not dichotomous. It remains
true, of course, that if the descendant is in state a,
8
The same result holds when we pose this question about a
then the most-parsimonious hypothesis about the
star phylogeny or a bifurcating tree. If branches are conditionally
independent of each other, the support for A ¼ a (as measured by
state of the ancestor is that it was in state a as well.
the likelihood ratio) is greater when there are several descendants To see what likelihood says about this problem,
than when there is just one. we need to construct a probabilistic model of the
evolution of the quantitative character. Let’s begin value of its descendant. With very little time, the
by setting limits on the values of the character in expected value of the descendant is tightly peaked
question; suppose it can’t go below zero or above around the lineage’s initial state. As time goes on,
100. We can think of u as the probability of the this low variance bell curve flattens and spreads
lineage’s increasing its character state by a very out. With infinite time, there is a flat distribu-
small amount during a brief interval of time, and v tion—each character state has the same proba-
as the probability of the lineage’s reducing its bility. Whereas selection in a finite population
value during that instant. Since there are upper involves both the shifting and the squashing of a
and lower bounds on the character state, u and v distribution, the process of pure drift involves
cannot remain constant over the full range of the only squashing.10
lineage’s possible states; for example, u must have Now let’s return to the inference problem. If the
a value of zero when the lineage is in state 100, descendant D is in state a of a quantitative
though of course it can have a nonzero value when phenotypic trait, which assignment of character
the lineage has a value less than 100.9 In addition, state to the ancestor A has the highest likelihood?
we want to allow for the possibility that the lineage Since the backwards inequality holds for dichot-
is evolving towards a stable equilibrium; for omous characters, one might expect the model for
example, perhaps a trait value of 75 is optimal, and continuous phenotypic characters to have the same
selection is pushing the lineage towards that value. unconditional consequence: that A ¼ a has
This means that u > v when the lineage’s trait maximum likelihood. This is not always correct
value is less than 75, but that u < v when the (Sober 2002). If D ¼ a, then A ¼ a is the maximum
population has a value greater than 75. In addition, likelihood assignment when the process is one of
the degree to which u > v must decline as the pure drift (W. Maddison 1991). And if D ¼ a and a is
population approaches 75 from below. the optimal character state towards which natural
When a biased process (such as natural selec- selection is pushing, then A ¼ a is again the
tion) is pushing a lineage towards a single attractor maximum-likelihood assignment. However, if the
state, the lineage’s probability of reaching that descendant has a character state of, say, 40 and
equilibrium is greater, the closer its initial state is selection is pushing the lineage towards a value of
to that attractor. Similarly, the equilibrium 50, then the maximum likelihood assignment to
value has a higher probability of being attained, the ancestor A will be less than 40; how much less
the more time there is in the lineage. When the than 40 the maximum likelihood value is depends
lineage has a very short duration, stasis is almost on how long the lineage has been evolving, on how
certain; as the lineage is given a longer duration, strong the directional force is, and on the char-
the biased process takes over and the initial con- acter’s heritability (Sober in press). This means that
dition recedes in its impact on the lineage’s final parsimony and likelihoodM conflict when the
state. In the limit of infinite time, the initial con- model says that there is a directional process
dition is entirely forgotten and the lineage’s whose attractor is some state different from a,
probability of attaining a given end state is the the observed character state of the descendant.
same, regardless of what the state was in which the Thus, to defend the parsimonious assignment of
lineage began. A ¼ a without rejecting the Law of Likelihood, one
How should we conceptualize a pure drift must reject this model. Parsimony assumes that
process for continuous phenotypic characters? In the trait either evolves by pure drift or by a selec-
this case, u ¼ v, except when the lineage is at the tion process in which the descendant’s character
limit values of 0 and 100. If the ancestor has a state is optimal.
given trait value, that trait value is the expected
10
See Lande (1976) and Sober (2005) for further details
concerning these phenotypic models for selection and drift. Let
9
I do not conceptualize its maximum and minimum values me emphasize that my discussion of ‘drift’ in this paper is not
(0 and 100) conceptualized as absorbing states. The same will be about random genetic drift, but concerns change in the population
true in the drift model to be discussed shortly. average of a quantitative phenotype.
3.4.2 Example 2 from the u and v values that characterize the other.
But not just any difference between the two pairs
Consider two extant species A and B and their
of values will suffice for C ¼ 0 and C ¼ 1 to have
most recent common ancestor C. Suppose that
the same likelihood. The lineage with the longer
A ¼ 1 and B ¼ 0; parsimony says that C ¼ 1 and
duration (the one leading to A) must have a
C ¼ 0 are equally parsimonious. In what circum-
smaller value for u þ v; in fact, the degree to which
stances do these two assignments of character state
its value for u þ v must be smaller is determined by
to the ancestor have the same likelihood? That
the two lineage durations. This has the embarras-
is, when will it be true that PrA(0 ! 1)PrB(0 ! 0) ¼
sing consequence that the lineage leading to A
PrA(1 ! 1)PrB(1 ! 0)? Here the subscripts A and B
must change its values for u and v in a very precise
represent which of the two lineages the transition
way as it gets older. At one point the lineage
probabilities describe. It is helpful to rewrite this
leading to A and the lineage leading B were of
equality as
equal duration. But then B went extinct while A
continued to exist, so the lineage leading to A got
PrA (0 ! 1) PrB (1 ! 0) longer while the one leading to B did not.
¼
PrA (1 ! 1) PrB (0 ! 0) According to parsimony, A’s values for u and v
must evolve in precise response to its duration and
If the two lineages experience the same evolu- to the duration and u and v values that attach to
tionary processes (i.e. are characterized by the the lineage leading to B.
same pair of u, v values), then this equality holds if
and only if the duration N of the lineages is 0, or
3.4.4 Example 4
infinity, or u ¼ v. That is, parsimony assumes that
if the two lineages are of finite duration and have The next example in which a process model M
experienced the same evolutionary process, then induces a conflict between likelihoodM and parsi-
that process is pure drift. mony concerns a rooted tree in which the character
state of the root is specified. It is a familiar prop-
erty of parsimony in this context that shared
3.4.3 Example 3
derived characters are said to provide evidence
The next problem is just like the previous one, of common ancestry, but that shared ancestral
except that the two lineages have unequal tem- characters do not. If three tip species A, B, and C,
poral durations. When A ¼ 1 and B ¼ 0, parsimony are in states A ¼ 1, B ¼ 1, and C ¼ 0, with 0 taken to
says that C ¼ 1 and C ¼ 0 are equally well- represent the ancestral character state, then (AB)C
supported estimates of the ancestral character state will be more parsimonious than A(BC). However,
even when A is a present-day species and B is if the polarity is reversed, with 1 now representing
a fossil. This temporal difference between A and B the ancestral condition, then (AB)C and A(BC) will
means that the lineage leading from C to A lasted be equally parsimonious.
longer than the lineage leading from C to B. The Consider the two-state Markov model given
two-state Markov model views this difference as before on which an additional constraint is
evidentially significant; recall that N, the duration imposed, namely that the probability of one
of a lineage, figures in the expressions for the branch’s ending in state i if it begins in state j
lineage transition probabilities. If the processes in (i,j ¼ 0,1) is the same as another branch’s doing the
the two lineages are characterized by the same same, if the two branches are simultaneous. This
values of u and v, then B provides stronger model has the consequence that (AB)C has a higher
evidence (in the sense of a larger likelihood ratio) likelihood than A(BC), given the observation that
about the state of C than A does; likelihood will A ¼ 1, B ¼ 1, C ¼ 0, if all branches have finite dur-
then favor C ¼ 0 over C ¼ 1. Parsimony denies this. ation, regardless of what the polarity of the character is
Parsimony therefore assumes that the u and v (Sober 1988, pp. 206–212). This contradicts what
values that characterize one lineage must differ parsimony asserts, when 1 is the ancestral state.
Parsimony’s interpretation of the observations is dichotomous. In Example 3, parsimony assumes

therefore requires a rejection of the process model that the two lineages experience different evolu-
just described. tionary processes. In Example 2, parsimony does
not require this assumption; rather, it assumes that
if the same process is at work in the two lineages,
3.4.5 Discussion of the examples
then that single process is drift. In Example 4,
If the Law of Likelihood, as formulated in (2*), is parsimony leaves open whether selection or drift is
correct, then parsimony assumes the falsehood of operating within a branch, but requires that dif-
any model M for which likelihoodM and parsi- ferent simultaneous branches be characterized by
mony are not ordinally equivalent. A summary of different pairs of values for u and v. These results
the models discussed in this chapter that parsi- suggest, but do not demonstrate, that parsimony
mony assumes are false is provided in Table 3.1. may impose different assumptions about the
These descriptions do not lay out the full details of evolutionary process when it addresses different
the models that parsimony must reject. This is an inference problems.
important point, since these models are each con- Although I think these examples make clear at
junctions of several propositions. If parsimony least some of the assumptions that parsimony
assumes that model M is false, this means that makes about the evolutionary process, I have not
at least one of the constitutive propositions that commented on whether those assumptions are
specifies the model must be false, not that all of innocuous or implausible. I have emphasized that
them must be. So don’t take the table’s brief my arguments are predicated on the assumption
description of a model to mean that the detail that the Law of Likelihood, as formulated in (2*), is
described must be false. Also, I have described, for true. This is so general a proposition that it can
each inference problem, a model that parsimony hardly be said to be an assumption about evolution.
must regard as false; don’t assume that this is the Even so, if it is rejected, we are back to square one.
only model that parsimony must reject when it If it is retained, the question becomes more spe-
addresses that inference problem. cifically biological, but here again, there are choices
Though each example requires that parsimony to consider. For example, in problem 3, a like-
reject a process model, a model that parsimony lihoodist may wish to maintain that the state of the
needs to reject in one inference problem doesn’t fossil B provides more evidence about the state of
necessarily have to be rejected in another. In the most recent common ancestor C than the extant
Example 1, parsimony requires a nontrivial organism A does. If so, parsimony’s solution to
assumption when the character is quantitative, this problem is mistaken. But it is open to the
but no such requirement is imposed when the trait defender of parsimony to take the opposite
Table 3.1 Summary of the models discussed in this chapter that parsimony assumes to be false
Inference problem A model that parsimony assumes is false
1 For a quantitative character evolving in a lineage, infer the Selection is pushing the lineage towards an optimum
character state of the ancestor when the descendant has that differs from the character state of the descendant.
character state a.
2 When two descendants alive now exhibit different states of The two lineages are characterized by the same pair of
a dichotomous character, infer the character state of their values for u and v, where u 6¼ v.
most recent common ancestor.
3 When two descendants (one extant, the other a fossil) exhibit The two lineages are characterized by the same pair of
different states of a dichotomous character, infer the values for u and v.
character state of their most recent common ancestor.
4 When two species share a symplesiomorphy not exhibited by a Simultaneous branches in the tree have the same pair of
third, infer the rooted tree topology. values for u and v.
position; however, I think it is not enough just to similarly about the examples in this essay, except
insist that parsimony is right in what it says about that I have focused on ordinal equivalence with
this example and to conclude from this that the respect to finite data sets, not on statistical con-
model that leads likelihood to a contrary verdict sistency, which describes what will happen if you
must be mistaken. An additional argument is have an infinite data set. This difference aside,
needed concerning why the Markov process model I am hardly the first to suggest that parsimony’s
should not be taken seriously. This point gen- failing to have some property elucidates what its
eralizes to the other examples. All these examples biological assumptions are.
can be handled in the same way by attacking the
entire Markov process framework. I don’t say that
3.5 Acknowledgments
this framework is beyond criticism. Rather, I sug-
gest that criticisms of this framework must be I thank Joe Felsenstein and Michael Steel for useful
biological in their content. This is an important comments on the present paper. I also want to
point: once the Law of Likelihood is accepted, both acknowledge the considerable influence that Steve
criticisms and defenses of parsimony must be Farris has had on my thinking about the role of
based on biological, not purely methodological, parsimony in phylogenetic inference, and in a
considerations. wider scientific context. It is a pleasure to con-
I have not discussed the issue of statistical contribute this chapter to a volume honoring his work.
sistency in this essay, but there is a part of the
debate about that matter that bears on the present
discussion. Felsenstein (1978) described a model
of evolution and an assumed true phylogeny
that together lead parsimony to converge on a
false phylogeny as the data are made large
without limit. Farris’ (1983/1994) reaction was
to reject Felsenstein’s model as unrealistic; after all,
Felsenstein’s model says that all traits in a given
branch have identical transition probabilities and
that the probability of reversion from the derived
to the ancestral state is always zero. Felsenstein
said at the outset that the model he describes is
unrealistic; Farris emphatically agreed, and took
this point to cancel whatever criticism of parsi-
mony the demonstration of statistical inconsist-
ency might be thought to imply. Farris apparently
was reasoning that the correctness of parsimony
requires parsimony to be statistically consistent;
thus, if model M entails that parsimony is not
consistent, then the correctness of parsimony
of the use of multiple process models, see Sober (2004a). Further-
requires that model M be false.11 I have reasoned
more, if using parsimony required the rejection of any model
whose parameters can be assigned values that render parsimony
11
In Sober (1988, pp. 166–171), I argue that a method’s statistically inconsistent, then the Tuffley–Steel model has
statistical consistency is not a necessary condition for one to be both of the following properties: (1) it suffices for likelihood
rational in using that method. As it happens, there are parameter and parsimony to agree, and (2) its falsity is presupposed
settings of the Tuffley and Steel (1997) no common mechanism by parsimony. This illustrates how fundamentally different
model (N) that have the consequence that parsimony and like- likelihood and statistical consistency are as tools for thinking
lihoodN both fail to be statistically consistent. However, I don’t about parsimony. Bayesians see the Law of Likelihood as
see why that forces one to decline to use N as one process model, fundamental; frequentists such as Felsenstein see consistency as
possibly among several, in phylogenetic inference; for discussion the fundamental desideratum.
II
Parsimony, character analysis, and
optimization of sequence characters
CHAPTER 4
The logic of the data matrix in

phylogenetic analysis
Brent D. Mishler
4.1 Introduction
assembly of the data matrix, and it is high time to
The process of phylogenetic analysis inherently examine this all-important part of systematic
consists of two phases. First a data matrix is research. At stake are each of the logical elements
assembled, then a phylogenetic tree is inferred of the data matrix: the rows (what are the
from that matrix. There is obviously some feed- terminals?), the columns (what are the characters?),
back between these two phases, yet they remain and the individual entries (what are the character
logically distinct parts of the overall process. states?).
One could easily argue that the first phase of The tree of life is inherently fractal-like in its
phylogenetic analysis is the most important; complexity, which complicates the search for
the tree is basically just a re-representation of answers to these questions. Look closely at one
the data matrix with no value added. This is lineage of a phylogeny (defined as a diachronic
especially true from a parsimony viewpoint, the connection between an ancestor and a descendent)
point of which is to maintain an isomorphism and it dissolves into many smaller lineages, and so
between a data matrix and a cladogram. We on, down to a very fine scale. Thus the nature
should be very suspicious of any attempt to add of both the terminal units (TUs; the twigs of the
something beyond the data in translating a tree in any particular analysis) and the characters
matrix into a tree! (hypotheses of homology, markers that serve as
Paradoxically, despite the logical preeminence of evidence for the past existence of a lineage) change
data matrix construction in phylogenetic analysis, as one goes up and down this ‘fractal’ scale.
by far the greatest effort in phylogenetic theory Furthermore, there is a tight interrelationship
has been directed at the second phase of analysis, between TUs and character states, since they are
the question of how to turn a data matrix into reciprocally recognized during the character
a tree. Extensive series of publications have been analysis process.
elaborated to attempt to justify such tree building This chapter will deal with logical issues invol-
approaches as neighbor-joining, maximum like- ving the elements of the data matrix in light of the
lihood, and Bayesian inference, while ignoring nested and interrelated nature of TUs and char-
entirely the nature of the data matrix that acters. I will argue at the end that if care is taken to
must underlie any analysis. The reasons for this construct an appropriate data matrix to address a
asymmetry in research on phylogenetic theory are particular question of relationships at a given
not entirely clear, but it probably has to do with level, then simple parsimony analysis is all that is
the fact that the problem of tree building may needed to transform the matrix into a tree. Debates
appear simpler, more clear-cut. Perhaps it is just a over more-complicated models for tree building
matter of research fashions. For whatever reason, can then be seen for what they are: attempts to
relatively little attention has been paid to the compensate for marginal data.
57
4.2 What exactly is a terminal sense. An additional flaw of the original concept
branch on a tree (that is, a row of OTU is that, by using the word ‘taxonomic,’ it
in the data matrix)? implies that one can do taxonomy before an
analysis is completed. This view, by confounding
People who publish phylogenetic analyses are usu-
the logical precedence of analysis before classifi-
ally cavalier about what their terminal branches
cation, has led to major mistakes in systematics
represent. One often sees species or other taxon
research, both phenetic and cladistic, most acutely
names, or even geographic designations of popu-
in the development of phylogenetic species con-
lations, attached to terminal branches of published
cepts (see the debates framed in Wheeler and
trees without explanation. Larger-scale units might
Meier 2000).
indeed be a well-justified TU, but they need to be
So how can we define a TU that is suitable for
justified, not assumed a priori. Taxa or populations
use in phylogenetics? Epistemologically speaking,
are never the fundamental things from which
a TU is a set of semaphoronts that are homogeneous for
phylogenies are actually built. Not even indi-
the informative character states currently known (as
viduals are the TUs (contra Vrana and Wheeler
explained in detail below). A TU is essentially a
1992). As was carefully elaborated by Hennig
pile of semaphoronts that cannot currently be
(1966), the fundamental terminal entity in phylo-
subdivided by character data, and thus it is a
genetics is the semaphoront, an instantaneous time
pragmatic unit, always subject to change as
slice of an individual organism at some point in its
knowledge of characters progresses. Ontologically
ontogeny. A tube of extracted DNA and its associ-
speaking, a TU is taken to represent a time slice of one
ated museum voucher specimen—a semaphoront—
of the terminal lineages whose relationships are being
should be considered the ultimate TU.
studied in a particular analysis.
This realization helps conceptually, but doesn’t
Why do I say ‘‘in a particular analysis?’’ Because
solve all of the empirical problems that arise in
the definition of TUs, even for the same group
assembling a matrix. In practice, TUs (i.e. rows in
of organisms, may change in analyses at different
a data matrix) are usually not semaphoronts.
scales. There unfortunately isn’t one fundamental
Especially in larger-scale studies, TUs are usually a
TU suitable for any and all analyses; for several
complicated assemblage of semaphoronts, and
different reasons. Epistemologically speaking,
sometimes even include data removed from any
since TUs are dependent on character-state divi-
connection with its original semaphoront. Many
sions in the characters being employed, they are
specimens often need to be examined for relevant
discovered and defined in the course of character
character information (not all of which can be
analysis (as discussed in detail below), and of
gathered from all semaphoronts because of their
course different characters are useful at different
sex, life stage, or state of preservation). Informa-
scales of analysis. There is thus a reciprocal rela-
tion from the literature or a database such as
tionship between character states and TUs as they
GenBank is often included in the matrix, based on
are being discovered during character analysis at
a taxon identification alone without reference to a
different levels. Ontologically speaking, larger-
voucher specimen. This process of assembly of
scale lineages are usually composed of smaller
such composite TUs needs careful examination.
lineages nested inside them, and the choice of
Similar sorts of terminals have been called
which lineage to represent in a particular analysis
operational taxonomic units (OTUs) in the past,
depends on the questions begin asked. Further-
but I think a refined concept of TUs, as referred to
more, the lineages at these different levels poten-
above, is necessary, one designed specifically for
tially have different histories; in other words the
phylogenetics. The original concept of OTU was
smaller lineages are not always proper subsets of
defined by pheneticists as a minimal cluster in a
the larger ones. This is sometimes called the
Euclidian distance sense. Cladists need instead to
gene tree/species tree distinction (Maddison and
refer to specific, potentially homologous and dis-
Maddison 1992), but that distinction is far too sim-
crete-state characters in a Manhattan distance
plified; there are many nested levels of potentially
THE LOGIC OF THE DATA MATRIX IN PHYLOGENETIC ANALYSIS 59
incongruent lineages, not just two (more on this and re-checked each time a group is re-studied.
topic later). They need to be carefully justified and re-justified
Even if one wanted to try to avoid these pro- using character evidence. This causes problems
blems by using only semaphoronts in a data with easy comparison between analyses based
matrix, one would still need to pay attention to the on different data, but is an unavoidable fact of
same issues of scale. One would still need to decide life in systematics and needs to be taken into
conceptually which lineages are being represented account in such areas as database design (more
by what semaphoronts. It is nearly impossible in below).
practice to use single semaphoronts as terminals
rather than compositely coded TUs that have data
4.3 What exactly is a character (that is,
taken from a number of semaphoronts. For one
a column in the data matrix)?
thing, not all semaphoronts bear all the characters;
there may be juvenile specializations or sexual The fundamental activity in phylogenetic syste-
dimorphism present in a lineage. Some specimens matics is character analysis (Neff 1986) in which
will be missing reproductive organs or other key characters and states are hypothesized, tested, and
features. Different genes will often be sequenced refined in a reciprocal manner, in concert with the
from different individuals. Furthermore, data are assembly of TUs, as part of the development of a
often taken from the literature (e.g. a previously data matrix. In addition to the logical primacy
published ultrastructural analysis) or from a data- of data matrix construction, there is a temporal
base (e.g. another laboratory’s gene sequence), primacy as well. It is an established fact that a
in cases where no reference can be made to an systematist spends 95% of his/her time gathering
original semaphoront (e.g. if no voucher specimen and analyzing character data and less than 5%
was deposited in a museum). Thus, data are time turning the assembled data matrix into a tree.
virtually always compiled from studies of different Character analysis must be the all-important part
individual organisms considered to represent the of the phylogenetic reconstruction process if there
same terminal lineage. TUs are nearly always is going to be a hope of discovering the history of a
composites in practice; their composition varying group. Fortunately, there have been some clear
depending on the scale of analysis. treatments of the elements of character analysis
This topic obviously touches on the species (Wiley 1981, Farris 1983, Neff 1986), but these were
debate, on which I have some opinions (Mishler published some time ago and seem to be unknown
and Brandon 1987; Mishler 1999; Mishler and to many recent workers. Younger systematists
Theriot 2000a, b, c), but which I am attempting to would do well to put more energy into investiga-
steer clear of in this essay to maintain focus. I am tions of the principles of character analysis and
speaking here to how data matrices are made: building better matrices, than into ever more
classification (including naming species) is some- complex model building for tree reconstruction,
thing that happens much later in the process. keeping firmly in mind the principle of ‘garbage
So, while this is not the place to debate species in, garbage out.’ No model of the evolutionary
concepts, I do need to point out that the fractal process can be brought to bear successfully if the
scaling of nested lineages includes those well data matrix does not represent cogently argued
below the traditional species level. Thus, species character and character-state statements.
are not somehow different from lineages at any Before using a tool (characters in this case) it is
other level; they are not ‘privileged’ TUs—they wise to think carefully about what one is trying to
simply need to be justified like any other. do with the tool. What we are trying to do in
In summary, there is never a given, a priori phylogenetic analysis is to infer the existence of
set of TUs to begin a phylogenetic analysis some past lineage by finding characters that
with. Certainly, named taxa (including species) changed state in that lineage and can thus serve as
should not be taken as TUs without question. a potential marker for reconstructing that branch
TUs need to be constructed during each analysis, in the future (the Hennig Principle). The goal of
A character
changing state on the
branch, becoming a
Period of shared marker for the
history existence of that
branch in the future
Figure 4.1 Illustration of the concept of a
phylogenetic marker.
character analysis is find as many potential least two discrete states. I elaborate somewhat on
markers as possible that can serve as evidence for each of these criteria in turn below:
the past existence of lineages shared by one or (1) Homology is certainly one of the most
more of the TUs (see Fig. 4.1). These markers are important concepts in systematics, and therefore
the only tools a phylogeneticist has to reconstruct also one of the most controversial. Following on
the branching history of life, but of course the from the work of Hennig and later phylogenetic
kind of markers that are useful for branches at systematists, when we say that two semaphoronts
one level of depth in time won’t necessarily be share the same characteristic, we mean they share
so for another level. Thus markers need to be a profound historical continuity of information
searched for carefully in light of the particular (Roth 1984, 1988). They are postulated to have
branching events one is trying to reconstruct. Since shared a common ancestor that had that char-
semaphoronts are chosen to build TUs that are acteristic. Thus an important contribution of
representative of the branching events under cladistics has been the explicit formulation of
study, then we need to find ‘good’ characters that a phylogenetic criterion for homology: a hypothesis
differentiate the chosen semaphoronts. of taxic homology (i.e. a potential synapomorphy) by
Much has been written about what constitutes a necessity is also a hypothesis for the existence of
‘good’ character. Ontologically speaking, poten- a monophyletic group (Patterson 1982; Stevens 1984).
tially informative markers need to support a Each postulated homology (i.e. a column in the
hypothesis of homology across the group under data matrix) is essentially a miniature phylogenetic
study; thus they need to be comparable in a con- hypothesis all by itself (especially as viewed in the
vincing way across the study organisms. They context of its assigned states), and can be tested
need to be independent, so they can be taken as against other postulated homologies. Therefore,
separate pieces of evidence for the existence of past congruence among all postulated homologies
lineages in the face of confounding effects such as provides a test of any single character in question;
convergence. They need to have discrete states so some characters initially thought to be homologous
they can be inferred to contain a record of evo- are later inferred not to be because they are in
lutionary events marking at least one specific conflict with the majority of characters. The initial
past lineage. The epistemological rules of char- hypotheses of homology are based on detailed
acter analysis can thus be summarized as follows. similarity in structure and development (see the
Potential characters need to be evaluated by discussion in Wiley 1981); these go into the matrix
evidence for: (1) homology and heritability of a for eventual testing by congruence.
character across the TUs being studied, (2) inde- (2) For character changes to count as independ-
pendent evolution of different characters, and ent pieces of evidence in the congruence test
(3) presence in each character of a system of at (Patterson 1982), it is necessary that they not be
genetically, developmentally, or functionally cor- and independent of other characters. This view of
related with other characters. There are many taxonomic characters also requires that each be a
biological processes acting to distort the phyloge- system of at least two discrete transformational
netic signal present in characters (e.g. reversal homologs, or character states. Note that just as with
to primitive states caused by heterochrony, con- TUs, there is never a given, a priori set of characters
vergent evolution across different characters to begin a phylogenetic analysis with. Characters
caused by natural selection, parallel changes to the need to be discovered and evaluated during
same state within one character caused by func- each analysis, and re-checked each time a group is
tional constraints, etc.), along with random effects studied.
such as long branch attraction (caused by the
accumulation of homoplastic matches on long,
4.4 What is the relationship between
non-sister branches making them appear to be
TUs and character states (that is, the
sister groups). The only weapon the phylogenetic
individual entries in the data matrix)?
systematist has against this inevitable distortion is
many independent sources of information that are, Neither the concept of TU nor the concept of
as best as can possibly be determined, not impac- character can be fully understood alone, without
ted by the same biasing factors. reference to each other and to the ‘fractal’ nature of
Note that there is another meaning of ‘correla- the tree of life (as discussed earlier). The nature
tion’, phylogenetic congruence, that does not dis- of both TUs and characters change as you go up
qualify characters from counting as independent! and down this fractal scale.
Congruence is what gives us supporting evidence As discussed earlier, the rows in a data matrix
for the existence of a monophyletic group. Thus are virtually never based on data taken from a
the rules of character analysis need to be carefully single individual, given that different labs are
drawn to encompass all the valid potential mar- producing the data over time, and that different
kers possible while rejecting those that are not data-gathering techniques (ranging from DNA
suitable. extraction through preparation for anatomical
(3) Why is it necessary for a useful character to study) often require destructive sampling; thus
have at least two distinct states? Again, we need to data are often compiled from study of different
think back to what we are trying to do: discrete organisms considered to represent the same TU.
states are needed because we are trying to recon- Thus TUs are nearly always composites in practice,
struct a discrete thing, an evolutionary event in their composition varying depending on the scale
which a prior state changed to some new posterior of analysis.
state, thus marking the existence of a shared Likewise, what counts as a useful character
ancestral lineage. The literature on the practice of changes depending on the scale of analysis. They
how to define character states has had a checkered have been selected based on their apparent utility
past. In most cases, people have simply made for the task at hand, homologized (e.g. aligned) for
character state distinctions without any justifica- the organisms under study, and pre-screened for
tion at all, and many methods proposed for ‘gap their fit to the criteria of a good taxonomic char-
coding’ are flawed in various ways (Stevens 1991). acter. Thus, the columns in a data matrix are
The empirical details are beyond the scope of this already highly refined hypotheses of phylogenetic
chapter; see Mishler and De Luna (1991) for a homology, defined with respect to the scale of the
discussion of this issue and a recommended current study.
approach using ANOVA and multiple range tests To make things more complicated, there is
to seek statistically homogeneous groups of TUs clearly a reciprocal relationship between TUs and
representing character states. character states. As detailed earlier, a TU can best
To summarize, a ‘good’ character for phyloge- be defined as a set of individual samples (sema-
netic analysis shows greater variation among TUs phoronts in Hennig’s terminology) that are homo-
than within them. This variation must be heritable geneous for character states currently known,
while a character can best be defined as a potential much as to make homology assessments difficult;
marker for shared history of some subset of the the same is true at the nucleotide level, where
known TUs. This means that TUs and characters multiple substitutions in the same region may
emerge during a process of ‘‘reciprocal illum- make alignment difficult. Thus very slowly evol-
ination’’ (Hennig 1966). To a large extent their ving genes may be sought, but that very con-
definitions and discovery are interlinked. How do servatism is caused by strong selective constraints
we proceed empirically in a way that avoids which increases the danger of convergence leading
circularity? Before answering this question we need to character dependence. Another approach is
to consider the scaling problem in more detail. to increase sampling density—if TUs can be
added that more evenly sample the true tree, thus
reducing the asymmetry between internal and
4.5 Deep vs. shallow phylogenetics
external branches, then faster-evolving genes may
The reconstruction of ‘deep’ relationships is fun- have better performance (Källersjö et al. 1998, 1999).
damentally different than reconstruction of ‘shal- These considerations suggest that the problems
low’ relationships (Mishler 2000). This is because being faced, and their best-justified solutions, will
the problems faced at these different temporal change as you go up and down this fractal scale.
scales are quite distinct. In shallow reconstruction The nature of TUs and usable characters are going
problems, the branching events at issue happened to change, and we need to have a way to scale
a relatively short time ago and the set of lineages phylogenetic results from one level to the next if
resulting from these branching events is relatively we are going to have a hope of building a complete
complete (extinction has not had time to be a major tree of life. There is effectively an infinite number
effect). In these situations the relative lengths of semaphoronts out there; there will never be a
of internal and external branches are similar, ‘complete’ data matrix for all of them for the
giving less opportunity for long-branch attraction. practical reason that there are too many. But more
However, the investigator working at this level has importantly, it isn’t at all clear that a single global
to deal with the potentially confounding effects of analysis of all semaphoronts living on Earth would
reticulation and lineage sorting. Character-state be desirable, even if we could do it. There is the
distinctions may be quite subtle, at least at the fact discussed earlier that a given semaphoront
morphological level. At the nucleotide level it doesn’t bear all the relevant data, and thus
is necessary to look very carefully to find composite TUs would need to be constructed
genes evolving rapidly enough; however, such in practice. There is also the fact that character
genes may be relatively selectively neutral, and homologies can be drawn much more easily when
thus less subject to adaptive constraints which can comparing only closely related TUs. Very few
lead to non-independence. characters can be coded reliably across the whole
In deep reconstruction problems, the branching tree of life. So we need to examine the scaling
events at issue happened a relatively long time ago issue closely to see how we might combine
and the set of lineages resulting from these or concatenate data matrices and phylogenetic
branching events is relatively incomplete (extinc- results from more-shallow analyses into deeper
tion has had a major effect). In these situations, the and deeper ones until eventually a global tree of
relative lengths of internal and external branches life can be produced.
are often quite different; thus there is more
opportunity for long branch attraction, even
4.6 How should we connect up
though there is little to no problem with reticula-
analyses and data matrices that
tion and lineage sorting since most of the remain-
are ‘nested’ inside each other at
ing branches are so old and widely separated in
various different time scales?
time. Due to all the time available on many bran-
ches, many potential morphological characters How will we ultimately connect up deep and
should be available, yet they may have changed so shallow analyses, each with their own distinctively
useful data and problems? Some hold out hope for branches is introduced. These problems can lead
eventual global analyses, once enough universally to erroneous long branch attractions in global
comparable data have been gained and computer analyses.
programs get much more efficient, to deal with all At the right-hand end of the spectrum,
extant organisms at once. Others would go to the local analyses are simply grafted together into
opposite extreme, and use a supertree approach, supertrees at the place where shared taxa occur,
where shallow analyses are grafted on to the tips without reference back to the original data. There
of deeper analyses. An intermediate approach, are many ways to do this in detail (as reviewed by
called compartmentalization (Mishler 1994, 2000), Sanderson et al. 1998), but the important thing is
uses shallow topologies (that are based on analyses that the analyses on real character data are only
of the characters useful locally) to constrain global done locally, and the concatenation is based on a
deep analyses (that are based on analyses of combination of local topologies rather than an
characters useful globally). All of these issues integration of local data sets into a global data set.
surrounding how to use phylogenetic markers at Both of these approaches may be problematic,
their appropriate level to reconstruct the extremely one too global, the other too local. Thus the appeal
deep tree of life are likely to be among the major of a promising intermediate approach called
concerns of phylogenetics in coming years, as compartmentalization (by analogy to a water-tight
reconstruction of the whole tree of life from twigs compartment on a ship—homoplasy is not allowed
to trunk is attempted. in or out). This approach represents diverse yet
The different approaches to concatenating clearly monophyletic clades by their inferred
analyses at different scales can be best viewed as a ancestral states in larger-scale cladistic analyses
spectrum (see Fig. 4.2). At the left-hand end of this (Mishler 1994, 2000). A well-supported local topo-
spectrum, the approach is to include all possible logy is sought first, then an inferred ‘‘archetype’’ or
TUs and potential characters in one matrix. Hypothetical Ancestor (HA) for the group is inserted
Generally this is not actually done, because the into a more inclusive analysis. In more detail,
sheer amount of data (millions of possible TUs) the procedure is to: (1) perform global analyses,
makes thorough phylogenetic analysis computa- determine the best supported clades (these become
tionally impossible. The most-common approach the compartments); (2) perform local analyses
in practice in global analyses is to select a few within compartments, including more taxa and
representatives of a large, clearly monophyletic characters (more characters can be used within
group (the exemplar method). Care is sometimes compartments due to improved homology assess-
taken to select representatives that are ‘basal’ TUs ments among closely related organisms—see
within the group to be represented (i.e. cladistic- below); (3) return to a global analyses, in one of
ally basal relative to the imaginary root defined two ways, either (a) with compartments repre-
by outgroups); however, this still does not avoid sented by single HAs (the archetypes), or (b) with
two important problems: (1) within-group vari- compartments constrained to the topology found
ation is not fully represented in the analysis, and in local analyses (for smaller data sets, this
(2) an increase both in terminal branch lengths approach appears better because it allows flexible
and in asymmetry between lengths of different character state assignments to the base of the
compartment based on optimizations to the local
Global Local topology).
The compartmentalization approach differs from
the exemplar approach in that the representative
character-states coded for the archetype are based
‘All’ TUs Compartmentalization Supertrees
on all the TUs in the compartment, thus the
Figure 4.2 How to concatenate different analyses to build the tree
reconstructed HA is likely to be quite different
of life? Shown is a spectrum of approaches ranging from global from any particular TU. As an estimate of the states
to local. See text for explanation. of the most recent common ancestor of all the local
TUs, the HA is likely to have a much shorter morphological characters that are perhaps subject
terminal branch with respect to the global analysis, to adaptive convergence (although convergence of
which in turn can have the beneficial global effect course cannot be ruled out in DNA sequences,
of reducing long branch attraction. In addition particularly at deeper levels). Sequences of highly
to these advantages of compartmentalization at conserved genes can be homologized across very
the global level, the local analyses will be broad groups that share little morphologically,
better because one can: (1) include all local TUs for although these same highly conserved regions are
which data are available; (2) incorporate more probably highly subject to adaptive convergence.
(and better justified) characters, by adding in Finally, models of evolutionary change are easier to
those characters for which homology could not postulate for DNA sequence evolution, a perceived
be determined globally (e.g. genes that can only be advantage for those who like to use maximum
aligned locally); (3) avoid spurious homoplasy likelihood methods.
that can change the local topology due to long- On the other hand, especially in deeper com-
branch attractions with distant outgroups. The parisons, structural characters (i.e. traditional
effects of compartmentalization are thus to cut morphological characters but also modern geno-
large data sets down to manageable size, suppress mic characters such as rearrangements and intron
the impact of spurious homoplasy, and allow the insertions; see next section) often have much
use of more information in analyses. This approach greater complexity, and may exhibit ontogeny,
is self-reinforcing; as better understanding of allowing a temporal axis of comparison not avail-
phylogeny is gained, the support for compartments able with DNA sequence data. Structural char-
will be improved, leading in turn to refined acters often change in an episodic pattern, which is
understanding of appropriate characters and TUs necessary for evidence of deep, short branches to
both within the compartments and between the remain detectable. Clock-like markers are the
compartments. worst kind of data for those sorts of branches; the
markers keep changing and thus erasing history. It
is much better for discovering those deep, short
4.7 Structural vs. DNA sequence
branches to have a clock like those found frozen in
characters
place on the sunken ship Titanic (still showing the
The choice of data for use at different scales of time the ship went down); a clock that stopped
analysis is the crux of the matter. One important ticking when some major change occurred. Fur-
issue to consider is how intrinsically useful are thermore, the number of possible character states
different categories of characters at these different is usually much higher in morphological character
scales? It is clear that, as general categories, systems (and in genomic rearrangements) than in
structural data (i.e. anatomical, morphological, or DNA sequence data, which serves to make long
genomic) and DNA sequence data have different branch attraction less of a problem (see Mishler
and complementary strengths and weaknesses. 1994 for discussion). Morphological data are more
DNA sequence characters are much more numer- easily gathered from large numbers of specimens,
ous than structural characters, thus increasing the and from fossils, making it much easier to robustly
chance that sufficient markers can be found for all sample the true phylogeny. For all these reasons,
branches of a tree. They are especially useful in morphological data have remained among the
organisms with simple morphology, such as fungi characters of choice at deeper phylogenetic levels,
and bacteria, that may lack a sufficient number and have been joined recently by an exciting
of structural characters. Objectively defining new class of structural characters derived from
character states in structural comparisons can be genomic comparisons. The latter promise to be
difficult, particularly in shallow reconstructions, very useful in the future, particularly for those
while the states are usually clear-cut in DNA deep, relatively short internal branches that have
sequence data. It has been argued that it is useful proven resistant to phylogenetic reconstruction
that DNA sequence data are independent from with DNA sequence data.
4.8 Genomic characters processes. For example, Stuart et al. (2003) used
microarray data from four completely sequenced
This is the era of whole-genome sequencing;
genomes (yeast, nematode, insect, and human) to
molecular data are becoming available at a
show coexpression relationships that have
rate unanticipated even a few years ago. Sequen-
been conserved across a wide spectrum of animal
cing projects in a number of countries have
evolution.
produced a growing number of fully sequenced
Most importantly for the systematist, the new
genomes, providing computational biologists with
comparative genomic data should also greatly
tremendous opportunities. However, comparative
increase the accuracy of reconstructions of the tree
genomics has so far largely been restricted to
of life. Even though nucleotide sequence compar-
pairwise comparisons of genomes; for instance,
isons have become the workhorse of phylogenetic
to identify syntenic regions, orthologous genes,
analysis at all levels, there are clearly phylogenetic
and common regulatory elements between human
problems for which nucleotide sequence data are
and mouse. The importance of taking a phylo-
poorly suited, because of their simple nature
genetic approach to systematically relating
(having only four character states) and tendency to
larger sets of genomes has only recently been
evolve in a regular, more-or-less clock-like fashion.
realized.
In particular, as stated earlier, deep branching
A recent synthesis of phylogenetic systematics
questions (with relatively short internodes of
and molecular biology/genomics—two fields once
interest mixed with long terminal branches) are
estranged—is beginning to form a new field that
notoriously difficult to resolve with DNA sequence
could be called phylogenomics (Eisen et al. 1998).
data. It is fortunate, therefore, that fundamentally
Something can be learned about the function
new kinds of structural genomic characters such as
of genes by examining them in one organism.
inversions, translocations, losses, duplications, and
However, a much richer array of tools is available
insertion/deletion of introns will be increasingly
using a phylogenetic approach. Close sister-group
available in the future.
comparisons between lineages differing in a critical
These characters need to be evaluated using
phenotype (e.g. desiccation or freeze tolerance) can
much the same principles of character analysis
allow a quick narrowing of the search for genetic
(discussed earlier) that were originally developed
causes. Dissecting a complicated, evolutionarily
for morphological characters. They must be
advanced genotype/phenotype complex (e.g. devel-
looked at carefully to establish likely homology
opment of the angiosperm flower) by tracing the
(e.g. examining the ends of breakpoints across
components back through simpler ancestral recon-
genomes to see whether a single rearrangement
structions can lead to quicker understanding. Hence,
event is likely to have occurred), independence,
phylogenomics allows one to go beyond the use of
and discreteness of character states. Given the
pairwise sequence similarities and use phylogenic
close link between characters and TUs discussed
comparative methods to confirm and/or to establish
above, it is also necessary to consider carefully
gene function and interactions.
the appropriate TUs for comparative genomic
Cross-genome phylogenetic approaches have
analysis, especially since different parts of one
the potential to provide insights into many open
organism’s genome may or may not have exactly
functional questions. A short list includes under-
the same history. Thus close collaboration between
standing the processes underlying genomic
systematists and molecular biologists will be
evolution, identifying key regulatory regions,
required to code these genomic characters pro-
understanding the complex relationship between
perly, and to assemble them into matrices with
phenotype and genomic changes, and under-
other data types. Challenges resulting from
standing the evolution of complex physiological
combining different data sources, in light of the
pathways in related organisms. Using such a
possibility of different histories for different parts
comparative approach will aid in elucidating how
of the same genome, are discussed in the next two
these genes interact to perform specific biological
sections.
4.9 Dealing with heterogeneous may allow the noise to cancel out, and the histor-
data types ical signal to come through.
Therefore, observing a particular gene or other
There is every reason to search carefully for good
data partition exhibiting serious conflict with
potential markers in all kinds of data, particularly
another is not sufficient reason to reject combining
for the deep branching questions discussed earlier.
them. There must also be additional evidence,
Deep phylogenetic reconstructions are by their
outside of the phylogenetic analysis, for reticulation
nature difficult, and all characters should be
or lineage sorting. The best current examples of
sought and used if they meet the criteria of good
such discordance are in shallow analyses, where
potential markers (Mishler 2000). However, it
organellar genomes may have different phylogenies
remains controversial how data from different
than those of associated nuclear genomes and
sources are to be evaluated and compared
morphologies (Smith and Sytsma 1990; Rieseberg
(Swofford 1991). Some have argued that data sets
and Soltis 1991). Barring that sort of clearly explain-
derived from fundamentally different sources
able discordance via reticulation, all appropriate
should be analyzed separately, and only common
data should be used, especially in deep analyses
results taken as well supported (i.e. consensus tree
because, as argued earlier, reticulation and lineage
approaches), or at least that only data sets that
sorting are much less likely to be problems in deep
appear to be similar in the trees they favor should
analyses, while convergence is likely to be a greater
be combined (Huelsenbeck et al. 1996). Others have
problem. But even if its effects may be negligible in
argued that all putative homologies should be
many deeper analyses, the problem of reticulation is
combined into one matrix (i.e. ‘total evidence’;
a difficult one, worthy of a more detailed look.
Kluge 1989; Barrett et al. 1991; Donoghue and
Sanderson 1992; Mishler 1994). Theoretical argu-
4.10 Reticulation
ments at present favor the latter approach: if
characters have been independently judged to As introduced earlier, the tree of life is essentially
be good candidates for phylogenetic markers, composed of nested sets of lineages. Look closely
then they are equivalent and should be analyzed at one lineage, and it turns out to be composed of
together. smaller lineages, all the way down to within the
There is one major exception to the preference organism (e.g. cell lineages and gene genealogies).
for a ‘total evidence’ position: data should not None of the levels of nested lineages can be con-
be combined into a single matrix if there is evid- sidered fundamental (Mishler and Theriot 2000a,
ence that some characters had a different branch- b,c)—it depends on the scale of the specific ques-
ing history than the rest (Mishler 2000). However, tion being asked. To build the large-scale frame-
this is not easy to detect. There are several sources work of the tree of life one can probably ignore
of homoplasy other than different branching his- the fine-scale lineages within organisms and
tory, including evolutionary convergence. If sev- between organisms within populations. But to
eral data partitions show different highly study microevolutionary differentiation processes
discordant trees due to convergence, the only way and design conservation plans at the population
to see the ‘true’ tree topology is to combine them. level, one needs to look at the fine-scale lineages,
The only weapon a systematist has against con- and to look at the spread of cancer cells in a body,
vergence is the likelihood that truly independent one needs to look at finer levels still.
characters will be subject to different confusing The major problem that arises is that these
factors and thus the true history may emerge nested sets of lineages are not always proper
when these independent characters are combined subsets. Especially at the finer levels, sublineages
(Barrett et al. 1991). Probably all character systems of a larger lineage may not all have the same
are influenced by constraints that tend to bias history, and/or may not have the same history
phylogeny reconstruction one way or another, as the larger lineage. For example, parts of the
yet a combination of very different character data genome within one organism can have different
B Because of this important violation of a funda-

A b mental cladistic assumption, Hennig (1966) and
a C later Nixon and Wheeler (1990) were correct in
c
focusing on reticulation and the problems it causes
for cladistics. However, the problems posed by
reticulation are more complicated than their
proposed ‘solution,’ i.e. their suggestion that the
species level can be used as a dividing line by
Figure 4.3 Illustration of lineage sorting. Three larger-scale lineages are supposing that reticulation only occurs below the
outlined with dark lines and labeled with capital letters. Three extant species level. This assumption (made by many, but
smaller-scale lineages are included, together with extinct relatives,
not all, cladists) of an abrupt cessation of inter-
and shown with lighter lines and lower-case letters. Note that
the relationships of the larger-scale lineages are A(B,C) while the breeding at the species level, separating rampant
relationships of the smaller-scale lineages are (a,b)c because of reticulation below from clean divergent evolution
the particular pattern of extinction that occurred. This would result in above, was wrong in two important respects. One
apparent homoplasy at the level of the larger-scale lineages. is the implication that reticulation can be dis-
regarded at higher levels, and the other is the
histories, for two main reasons. The first of these implication that cladistic methods are not appro-
is lineage sorting (see Fig. 4.3), which occurs when priate below the species level. Mishler and Theriot
genes exist in families within the genome due to (2000a, b, c) refuted both implications; here are
past duplication events, and differential extinction their arguments in summary:
has taken place in derived higher-level lineages (1) There is no consistent demarcation between
such that the relationships of the genes appear not reticulate and branching relationships at any
to match the relationships of the higher-level particular level. Hybridization takes place between
lineages (Avise 1989). The problem in this case is clades of various patristic/cladistic degrees of
one of mistaken homology—paralogy is confused relatedness. Reticulate relationships range from
with orthology because not all the gene lineages intense (in panmictic, sexually reproducing groups
are present in all higher-level lineages. where individual relationships are exclusively
The second major reason for differential histories reticulate), to less intense (in spatially or tempor-
is reticulation, which occurs when once separate ally subdivided groups where both reticulate and
lineages blend back together. At the genome level, divergent relationships exist among individuals),
recombination can bring genes with different to none in clonally reproducing organisms. Rare,
histories together into a single lineage. Of all the high-level hybridizations may occur among very
different sources of homoplasy, such as adaptive divergent lineages, such as among genera of
convergence, gene conversion, developmental orchids; viral-mediated lateral transfer of genetic
constraints, mistaken coding, lineage sorting, and material is suspected at much higher levels.
reticulation, the last is the most problematical. This (2) Just as barriers to reticulation are often not
is because reticulation violates a fundamental complete, reticulation is not a complete barrier to
assumption underlying cladistic analysis, that of a cladistic analysis. There is much phylogenetic
branching model of history. The other factors are all structure within named species; indeed, a whole
cases of mistaken hypotheses of homology of one new field of phylogeography was founded to
sort or another, whereas ‘homoplastic’ character explore this (Avise 1989). We can reconstruct
distributions due to reticulate evolution involve relationships in the face of some amount of reticu-
true homologies whose mode of transmission was lation (how much is not yet clear, but is amenable
not tree-like. The possibility of reticulation further to study). For example, McDade (1992) showed
complicates the relationship between TUs and that incorporating a few known hybrids in an
characters discussed earlier, since it ensures that analysis of ‘good’ species does not seriously affect
some lineages nested inside of larger ones truly the cladistic topology of the good species. There
have different histories than others. may be a self-correcting mechanism here as there
is with other sources of homoplasy: even major may not be monophyletic, and essentially function
convergence (e.g. among cave animals) can be as static sorting bins for pulling out the basic
uncovered via cladistic analysis. As with con- records—there is no way to access or display
vergence, where the application of cladistic emergent properties of data at higher evolutionary
analysis provides the only rigorous basis we have levels or to discover finer-scale patterns at lower
for identifying homoplasy and thus demonstrating levels. In other words, databases are not yet
non-parsimonious evolution, the only way we can sensitive to the fractal nature of phylogenies (with
identify reticulation on the basis of character their many hierachically nested levels). As argued
analysis alone is through the application of above, there are no basic comparable taxa
cladistic parsimony, followed by the examination (terminal or otherwise), or characters. Both TUs
of homoplasy to attempt to discover its source and characters are defined with respect to a certain
(see discussion by Vrana and Wheeler 1992). level in the phylogeny.
As was pioneered by Slatkin and Maddison (1989), As a new generation of phylogenetic databases
cladistic analysis of non-recombining genes can are built (in part coordinated by a large NSF ITR
even be used to measure gene flow between grant supporting a national resource in phylo-
populations. Thus, cladistic analysis can be used to informatics, Cyber Infrastructure for Phylogenetic
study reticulation, at any level. Research (CIPRes); see www.phylo.org), there
(3) Thus, just as there may be no largest cladistic needs to be much more flexibility built in. The main
unit for which reticulation is impossible, there may themes of this chapter need to be explored to
be no smallest ‘irreducible’ cladistic unit within appropriately present the richness of phylogenetic
which no further diverging phylogenetic patterns data to users. Fundamental open questions that
occur. Ontologically speaking, we are dealing with need to be addressed for databases include: (1)
a fractal pattern again; if you look inside one how can the elements of the data matrix (TUs,
lineage you see a pattern of divergence of lineages characters, and states) as defined and recognized
within (and some reticulation, perhaps increas- in some particular study be stored and potentially
ingly greater as one looks at less-inclusive linea- retrieved for use in a future study at a different
ges). This fractal pattern of reticulation and level? (2) How can heterogeneous data types
branching presents a problem for simple phylo- (e.g. DNA sequences, genomic rearrangements,
genetic inference. But, as argued above, phenom- morphology) be compared/combined? (3) How
ena such as lineage sorting and reticulation can be can data sets and analyses at very different scales
discovered as incongruence between organismal be concatenated (e.g. supertree, compartmentali-
and gene phylogenies, or incongruence between zation, or global approaches as discussed earlier)?
different genes or different regions of the genome. (4) How can phylogenetic results at these different
concatenated scales, where TUs are nested inside
larger ones, and character definitions (e.g. align-
4.11 TUs, characters, and
ments) change as you move up and down the
database design
scale, be presented to the community in compre-
One of the big challenges in modern biology is hensible and useable ways?
informatics. There are so many data available, and The centerpiece of all future biological data-
a number of projects are attempting to represent bases will need to be phylogenetic classification,
the information in databases. However existing a deeply nested hierarchy of named nodes linked
databases (e.g. GenBank or Tropicos) are essen- to all available structural and functional data at
tially a flat file with respect to phylogeny. Data are each level dynamically, as new data enter the
entered with whatever taxon name happens to be database. All biological data fall somewhere on the
attached to them. The only sense of evolutionary tree of life, which is the one thing that can unify
relationships is given by a schema of higher-taxon them all. This new approach to biodiversity
names (say families and phyla) that can be used to informatics will take advantage of the richness of
group the basic entries. These higher taxa may or the phylogenetic structure of biological data.
4.12 Tree building data with an appropriate rate of change for the
problem at hand (more on this later). Thus, para-
This chapter has focused on the first phase of
doxically, pursuit of well-supported weighting
phylogenetic analysis, building the data matrix,
schemes has ended up convincing many of us of
rather than the second phase, building a tree from
the broad applicability and robustness of equally
the matrix. Still, a few words on the latter are
weighted parsimony (Albert et al. 1993). Further-
appropriate. The simplest model for evaluating
more, all reconstruction methods work best with
congruence among characters (different hypo-
‘good data’, i.e. characters chosen with respect to a
theses of homology) is equally weighted parsi-
particular level of phylogenetic question. It is with
mony (Farris 1983), which remains the preferred
more problematic data (e.g. with a limited number
method for comparing diverse sorts of characters.
of informative characters, a high rate of change,
Each column in a data matrix can be regarded
or strong constraints) that results of different
as an independently justified hypothesis about
methods begin to diverge. Weighting algorithms
phylogenetic grouping (the criteria for justifying
and maximum likelihood approaches may be able
these individual character hypotheses is discussed
to extend the use of problematic data, but only if
above), an individual piece of evidence for the
the evolutionary parameters that are biasing rates
existence of a monophyletic group. Parsimony
of change are known. As biases become greater,
assumes that an apparent homology is more likely
precise knowledge of them becomes ever more
to be due to true homology than to homoplasy,
important for avoiding spurious reconstructions.
unless evidence to the contrary exists, i.e. a plur-
Therefore, given the large number of potential
ality of apparent homologies showing a different
characters made available by modern technology,
pattern (Funk and Brooks 1990; Mishler 1994).
it is desirable to be highly selective about the
Parsimony does involve some simplifying assum-
characters that are used to address any particular
ptions, i.e. that all character-state changes are
phylogenetic question; to the extent possible,
similar in their probability of change, and thus
the problematic data should be left out (possibly
they can all be equally weighted. This assumption,
to be used at a different, more appropriate level:
while robust, can lead to mistaken reconstructions
see discussion on compartmentalization in Mishler
under some extreme circumstances of asymmetric
1994, and elsewhere in this chapter).
probabilities of change within and among char-
What is the relationship between this chapter
acters, and in such cases simple parsimony can
emphasizing the data matrix, and the general themes
be modified using more complicated models of
of this book on parsimony? Simple. A rigorously
change by either character and character-state
produced data matrix has already been evaluated
weighting (Albert et al. 1992, 1993; Albert and
carefully for potential homology of each feature
Mishler 1992) or maximum likelihood approaches
when being assembled. Everything interesting has
(Felsenstein 1981; Yang 1994).
already been encoded in the matrix; what is needed
Debates will no doubt continue over how com-
is a simple transformation of that matrix into a
plicated an evolutionary model it is prudent to
tree without any pretended value added. Straight,
include in an analysis, but it is clear that all the
evenly weighted parsimony is to be preferred,
parsimony and maximum likelihood methods,
because it is a robust method (insensitive to variation
by using individual character data (specific hypo-
over a broad range of possible biasing factors)
theses of homology), belong to a related Hennigian
and because it is based on a simple, interpretable,
family of methods. Fortunately, one important
and generally applicable model. More-complicated
empirical observation is that differential weighting
models for tree building are fundamentally attempts
and maximum likelihood have little effect on
to compensate for marginal data. Given the surfeit
simple parsimony reconstructions. Weighted
of data available these days, it would be wiser to
parsimony and maximum likelihood topologies
avoid the use of marginal data!
are almost always a subset of the equally weighted
These issues of how to use phylogenetic mar-
parsimony topologies, especially when applied to
kers at their appropriate level to reconstruct the
extremely fractal tree of life are likely to be one of 4.13 Acknowledgements

the major concerns in the theory of phylogenetics
This chapter has benefited from analyses and
in coming years. In the future, my prediction is
collaborations supported by three NSF grants, and
that more-careful selection of characters for parti-
I acknowledge my co-principal investigators for
cular questions (i.e. more-careful and rigorous
their help in understanding these issues: the Deep
construction of the data matrix) will lead to less
Gene Research Coordination Network (DEB- 0090227;
emphasis on the need for modifications to equally
http://ucjeps.berkeley.edu/bryolab/deepgene/),
weighted parsimony. The future of phylogenetic
the Green Tree of Life Project (EF-0228729; http://
analysis appears to be in careful selection of
ucjeps.berkeley.edu/TreeofLife/), and the ITR grant
appropriate characters (discrete, heritable, inde-
entitled Cyber Infrastructure for Phylogenetic
pendent, and with an appropriate rate of change)
Research (CIPRes; EF-0331494; www.phylo.org/).
for use at a carefully defined phylogenetic level.
CHAPTER 5
Alignment, dynamic homology,

and optimization
Ward C. Wheeler
5.1 Introduction entirely cladogram-dependent and the relative

optimality of alternate cladograms determines
Systematics is the production of cladograms that whether or not features exhibit this relationship.
link taxa through their observed variation. These The dynamic homology framework (Wheeler
cladograms must optimize an objective function 2001a) is an analytical concept that extends through
such that they can participate in hypothesis testing optimization of transformations to the corres-
on the basis of this function. The core activity of pondence among features (often referred to as puta-
systematics is to assay the relative merits of a tive or primary homologies) themselves. The joint
pair of competing scenarios and judge one scenario of correspondence and transformation is
superior. The repeated and transitive application chosen such that the overall cladogram cost is mini-
of this elemental comparison results in a globally mal. The correspondences among features (nucleo-
optimal solution that is the ultimate goal of tides in this context) are not predetermined, but a
systematics. result of the analysis. In this framework, there is no
This depiction of systematics raises three points: distinction between putative or primary and sec-
the nature of the objective optimality criterion; ondary homology (de Pinna 1991)—all variation is
the manner of determination of this value; and the optimized de novo for each cladogram.
assessment of the relative merits of cladograms. This
chapter is concerned with the second of these three,
the realm of character analysis, homology and
5.2 Sequence data
optimization. The arguments here will be based on
the optimality criterion of parsimony or minimum There are two properties that have been used
cost. Likelihood or other criteria could well be used, to differentiate sequence data from other sorts of
however, and most of the character-optimization information: simplicity of states and length vari-
discussions would remain largely unchanged other ation. Unlike complex anatomical features (e.g. limb
than the specifics of their implementation and or wing) that can express themselves in a myriad
numerical values. The comparison of cladograms is of forms, nucleotides exhibit only four conditions.
the province of cladogram or tree searching and is Complexity and difference imply that states (e.g.
not discussed in any depth here. presence/absence, or conditions) are not compar-
Homology is the relationship between features able across characters. Nucleotide states, on the
that is derived from their shared, unique origin. other hand are identical no matter where they
Given a single cladogram, two features are occur. Nucleotide sequences may also differ in
homologous if their origin can be traced back to a length. These two aspects of molecular sequence
specific transformation on a branch of that clado- data remove the complexity and positional
gram, but the same pair of features may not be information so often used in establishing primary
homologous on alternate cladograms. Homology is homologies in anatomical systems.
71
5.3 Alignment and optimization When coupled with cladogram search, we are
faced with a compound NP-complete problem and
Two approaches have been developed to deal with
all of our statements will be based on approximate
the absence of preordained homologies and analyze
solutions.
sequence data. On one hand, methods have been
Both alignment and optimization may be viewed
devised to create the missing primary homology
as heuristic approaches to solving this problem.
statements that are then analyzed by standard
Alignment accomplishes this based on static, glo-
techniques—broadly referred to as multiple align-
bal, primary homology statements, whereas opti-
ment. Traditionally, sequence data have undergone
mization techniques propose cladogram-specific
this pre-phylogenetic analysis step to permit familiar
homology scenarios.
procedures akin to those used with anatomical
characters. A second approach is to directly optimize
sequence variation during cladogram searching. 5.5 Alignment methods
This methodology requires no notions of primary
As a heuristic solution, alignment decomposes the
character homology or any global (i.e. topology-
nested homology/search procedure into two
independent) homology statements whatsoever
sequential problems. Length-variable sequences are
(other than that the compared sequences themselves
converted into a series of column vectors (primary
be homologous). Direct optimization is also applic-
homology statements) through the insertion of
able to simple, serial morphologies, for which char-
gap characters (-) as placeholders that denote the
acters and their states are similarly constrained.
results of insertion/deletion events. Alignments
For the sake of discussion here, the terms align-
minimize a cost function (in the case of two sequences
ment and optimization will refer to these alternate
the cost to ‘edit’ one sequence into the other) that
approaches.
is based on the relative costs of transforma-
tion events (especially insertion/deletion—‘indel’—
costs), which may or may not be cladogram-based.
5.4 The problem
There are several components to the alignment
In computational terms, the problem of determin- process. These progress from pairwise alignment
ing the cost of a given cladogram reduces to the of two sequences to the exact solution for multiple
determination of the set of internal vertex (hypo- sequences and then to the heuristic methods
thetical ancestral) sequences such that the overall employed in real-world analyses.
cost is minimized. Whether expressed in terms of
alignment or optimization, the problem (known as
5.5.1 Pairwise alignment
the tree-alignment problem) is NP-hard (see Wang
and Jiang 1994); hence, we are very unlikely to Alignment of sequence pairs is the foundation
achieve exact solutions. NP-hard problems are of all more elaborate procedures. The problem,
members of a class of computational problems for simply stated, is to create the series of corres-
which there is no known polynomial time solution. pondences between the nucleotides in two seque-
These problems are often combinatorially ‘explo- nces via the insertion of gaps, such that the edit
sive’, with the size of the solution space expanding cost (the weighted sum of all events—insertions,
factorially. That is, when the sequences can vary in deletions, nucleotide substitutions—required to
length, even the determination of the cost of a convert one sequence into another) between the
single cladogram will be heuristic (for 10 sequen- sequences is minimized (or some other function
ces of length 5—an unrealistically small and well- optimized). Costs must be assigned to each type of
behaved case—there are 1.35 1038 homology event, or trivial, zero-cost alignments can result
schemes; Slowinski 1998). (e.g. indels costing zero and an alignment
Given that determining cladogram cost is heur- that places each nucleotide opposite a gap). The
istic, the transformation and homology statements first algorithmic solution to this form of string-
derived from the cladogram are heuristic as well. matching problem was proposed by Needleman
ALIGNMENT, DYNAMIC HOMOLOGY, AND OPTIMIZATION 73
and Wunsch (1970) and is used throughout most or less formally, the minimum among
alignment procedures (see Gusfield 1997 for more
aligning a1 . . . ai1 and b1 . . . bj and
extensive discussion).
Consider two sequences ACGT and AGCT and aligning character ai with a gap,
alignment parameters of nucleotide-substitution aligning a1 . . . ai and b1 . . . bj1 and
cost equal to 1 (CostSubs) and indel cost equal aligning character bj with a gap,
to 10 (CostInDel). The algorithm follows a dynamic aligning a1 . . . ai1 and b1 . . . bj1 and
programming approach by solving a series of small, aligning character ai and bj :
dependent sub-problems that implicitly examine all
possible alignments. There are two components to The additional first row and column (the reason
the procedure. The first determines the cost of the for the þ 1 in the matrix dimensions) represents
best alignment (or alignments—there may be mul- the alignment of a sequence with an empty string;
tiple solutions). This is often referred to as that is, initial gaps. Each decision minimum is
the wavefront update. The second component is the recorded, to follow the path that leads to the
traceback, which yields the alignment itself (more- cost of aligning a and b; that is, the cost in cell (n,m)
complex examples can be found in Phillips (Fig. 5.1).
et al. 2000). Needleman and Wunsch described a In order to create the actual alignment between
maximization-of-identity algorithm, where here a the sequences a traceback step is performed that
minimization of difference is presented. The proceeds back up and to the left of the matrix,
underlying principles are unchanged. keeping track of the optimal indels and substitu-
tions performed in the matrix-update operations.
Costi;j ¼ minfCosti1;j þ CostInDel , Costi;j1 The minimum cost path is followed back, where
þ CostInDel; Costi1;j1 þ CostSubs i;j g
the best move is diagonal if the nucleotides of the
The first part of the algorithm fills a matrix M of sequences correspond, and the left and upwards
size (n þ 1) (m þ 1) to align a pair of sequences a moves signify indels (Fig. 5.1). The minimal cost
and b of length n and m respectively. Each cell (i,j) alignment for these sequences (ACGT and AGCT)
is the cost of aligning the first i characters of a with with the cost regime {indels ¼ 10, substitutions ¼ 1}
the first j characters of b (i.e. aligning a1 . . . ai and is 2 with two base substitutions implied between
b1 . . . bj). Each value is calculated using the pre- the sequences (C $ G, and G $ C).
viously aligned subsequences: that is, the cost of If a complementary cost scenario is specified,
cell (i,j) will be e.g. indels ¼ 1 and substitutions ¼ 10, a different
optimal solution is found (Fig. 5.1, right). In this
minfði 1, jÞ þ indel, ði, j 1Þ þ indel, case as well, the minimum cost is two, but
ði 1, j 1Þ þ align character ai and bi g no substitutions are implied—only indels (2).
- A C G T - A C G T
- 0 10 20 30 40 - 0 1 2 3 4
A 10 0 11 12 13 A 1 0 1 2 3
G 20 10 1 11 13 G 2 1 2 1 2
C 30 20 10 2 12 C 3 2 1 2 3
Figure 5.1 Needleman and Wunsch (1970) alignment
T 40 30 11 11 2 T 4 3 2 3 2 matrix tables for two cost scenarios. On the left, indel
events cost 10 steps and nucleotide changes 1, while
A
. C
. G
. -
. T
. these are reversed on the right. Both cost scenarios yield
A C G T A - G C T
. . . . minimum cost alignments of cost 2, although minimizing
A G C T A
. -
. C
. G
. T
. indels in the former (left) and nucleotide substitutions in
A G C - T the latter (right).
Furthermore, there are two equally optimal solu- may or may not be explicitly linked to optimality
tions differing in the placement of the gaps. This criteria (Fig. 5.2).
ambiguity comes from the equally costly paths By far the most commonly used heuristic
found at matrix element 3,3 (of 0,0 to 4,4). multiple-alignment implementation is CLUSTAL,
The non-unique nature of such solutions is a mainly because it is fast and relatively easy to use.
frequent property of alignments and can have Many others are freely available, however, and
dramatic effects on phylogenetic conclusions take different approaches to the problem. Several
(Wheeler 1994). of these approaches are illustrated in this sample.
More-complete lists can be found at http://pbil.
univ-lyon1.fr/alignment.html and more compar-
5.5.2 Exact multiple alignment isons in Phillips et al. (2000).
CLUSTAL (Higgins and Sharp 1988 et seq.) cre-
The pairwise procedure can be generalized in a
ates a single multiple alignment based on a single
straightforward fashion to align more than two
guide tree. A neighbor joining tree (Saitou and Nei
sequences. The matrix would have an axis for
1987) is calculated from the pairwise alignments
each sequence (l sequences would require l
via a ‘corrected’ distance formula. This tree is used
dimensions) and there would be 2l 1 paths to
as a guide tree for progressive pairwise alignment
each cell representing all the possible combinations
of terminal sequences and internal consensus
of gaps and substitutions possible (seven in the
sequences (a down-pass). A second (up) pass
case of three sequences). These two factors add
resolves the placement of gaps in internal and
enormously to the calculations, making true mul-
ultimately observed sequences. There is no
tidimensional alignments unattainable for real
optimality value associated with a CLUSTAL
data sets.
alignment.
An additional complexity arises in analyses of
TREEALIGN (Hein 1989a, b) Also produces a
data sets with more than three sequences. The cost
single multiple alignment based on a single guide
calculations at each cell may (as Sankoff and
tree, but that guide tree is constructed (with some
Cedergren 1983 suggested) be based on the clado-
tree refinement) as the alignment is created.
gram of relationships of the sequences. If this is
A parsimony step is included as part of the tree-
known, or at least specified a priori, the cell cost can
reconstruction procedure. Although alignments
be calculated directly. If, however, the cladogram is
are not searched as such, the generation of the
unspecified, a search would be performed for each
guide tree examines multiple alternatives. A final,
cell, or the entire multidimensional alignment
single multiple alignment is generated with an
repeated for multiple (potentially all) cladograms.
attached parsimony score, but no comparisons to
The immense computational burden of exact
other complete alignments are made.
multiple alignment ensures that heuristic solutions
DALIGN (Morgenstern et al. 1996) differs from
are used in nearly all real-world cases.
other methods in looking for alignments of con-
tiguous gap-free fragments of DNA that may have
mismatches. This contrasts with the approach that
5.5.3 Heuristic multiple alignment
attempts to align each position in a sequence.
Current heuristic procedures are similar in No gap penalty is employed. The idea behind this
that many attempt to render multiple alignment method is to create complete alignments by
tractable by breaking down simultaneous stitching together locally similar sequences that
n-dimensional alignments into a series of man- may be separated by highly divergent regions.
ageable pairwise alignments related by a ‘‘guide An optimal alignment is one that maximizes the
tree’’ (in the parlance of Feng and Doolittle 1987). weighted sum of the matches in the smaller seg-
These differ in the techniques used to generate the ments. Alignments can be compared on this basis.
guide tree and conduct the pairwise alignments at This method makes no reference to cladograms or
the guide tree nodes. Furthermore, the procedures trees whatsoever.
AGT AT ATC
AXT-
AXTX
AXTX
AGT AT ATC
AGT Tree search

A-C
AGT-
A-T-
A-TC
AGT ATC AT
AGT Compare Best

Tree search
ATC tree lengths alignment
AGT-
ATC
AT-
AT ATC AGT
AT- Tree search

ATC
AGT-
A-T-
A-TC
Figure 5.2 General heuristic multiple sequence alignment. Top, a guide tree is specified to direct a series of pairwise alignments which
incrementally include sequences as the guide tree is traversed from tips to root (e.g. CLUSTAL). Usually, some form of consensus sequence is created
at the internal nodes. Additional gaps are inserted as the tree is traversed a second time form root to tips. When an optimality criterion is
employed (e.g. MALIGN) multiple guide trees are created and the derived alignments compared by some metric (such as phylogenetic tree cost).
COFFEE (Notredame et al. 1998, 2000) behaves MALIGN (Wheeler and Gladstein 1994) uses
as a ‘wrapper,’ using a genetic algorithm to optim- multiple guide trees to generate a diversity of
ize multiple alignments based on consistency with multiple sequence alignments, choosing the best
the pairwise alignments of the same sequences. on the basis of the parsimony score (indels inclu-
Any pairwise alignment procedure can be used ded) of the most parsimonious cladogram derived
under the COFFEE optimality function. from that alignment. Guide trees are searched and
The following alignment methods involve multiple alignments created for each candidate
‘search’ procedures. In MALIGN and the method guide tree. Each alignment is used as the basis for
of Hein et al. (2003), tree searches are conducted to a heuristic cladogram search (indels weighted and
produce multiple alignments, whereas POY sear- included). The cost of the most parsimonious cla-
ches for optimal cladograms directly and can dogram is attached to the alignment as its optim-
generate alignments post facto for the optimal ality score. MALIGN will output multiple
cladogram. multiple-alignments if they are equally optimal.
Hein et al. (2003) employ the Thorne–Kishino– generate a list of all possible sequences, determine
Felsenstein (TKF) model (Thorne et al. 1991) for the edit cost between each pair (via some pro-
likelihood-based multiple alignments related by a cedure akin to that of Needleman and Wunsch
tree. The algorithm employed is based on Sankoff 1970), and try each possible sequence at each
(1975) for likelihood. Currently, the implementa- internal cladogram node by dynamic program-
tion (designed for demonstration purposes) can ming (Sankoff and Rousseau 1975). This type of
manage a few sequences (ca. 7) but could well be explicit enumeration could be accomplished by
extended to larger data sets. extending the candidate set of sequences employed
POY Implied Alignment (Wheeler 2003a; Wheeler by search-based optimization (Wheeler 2003b) to
et al. 2003) is not an alignment program, but sear- include all possible sequences. Since this would
ches for parsimonious cladograms directly (see the entail an explosively increasing number of sequ-
next section). A multiple alignment can be gener- ences this technique would become untenable
ated, however, from the transformation series rapidly. Some sort of branch-and-bound technique
implied by the optimal cladogram. This is not a could be applied to this search given an initial
multiple alignment in the sense of other methods, upper-bound estimate, but it is unclear whether
but rather is inextricably linked to the cladogram much additional headway can be made towards
from which it was derived (Wheeler 2003a). exact solutions.
5.6 Optimization methods 5.6.2 Heuristic solutions

In contrast to alignment procedures, optimization The operational goal of heuristic optimization
methods skip the alignment step and proceed procedures is to determine a set of HTU sequences
directly to the determination of cladogram cost. that minimizes the overall cladogram length
This is achieved by focusing on determining opti- (¼ edge weight). Two general sorts of approach
mal hypothetical taxonomic unit (HTU) sequences have been proposed based on attempts to estimate
at internal tree nodes. In doing so, homology these internal vertex sequences using known
schemes are created for each cladogram uniquely, sequences or on a search for them within the world
and for cladogram costs based on them. Multiple- of possible sequences.
alignment methods create a single alignment upon The first-estimation heuristic was proposed by
which all cladograms are diagnosed. Optimization Sankoff et al. (1973) and Sankoff (1975). Given the
methods create individualized homology schemes high dimensionality of the exact recursive solution
for each cladogram. proposed by Sankoff (1975), a three-dimensional
local-optimum heuristic was proposed. This would
break the problem down into a series of single-
5.6.1 Exact solutions
point estimations surrounded by three known or
As mentioned earlier, the determination of the previously estimated sequences (Fig. 5.3; as
lowest cost for a single cladogram depends on the opposed to the two-point problem reduction in
lowest cost assignment of HTU sequences, and this many heuristic alignment approaches). At the
is an NP-hard problem (Wang and Jiang 1994). time, the method was too time-consuming for real
Exact solutions, therefore, will not be available data sets.
generally. Wheeler (1996, 2002) proposed a two-
Sankoff (1975) proposed a recursive procedure dimensional heuristic (optimization alignment,
that would calculate the minimum-cost cladogram later called direct optimization), which though
exactly. This method requires a number of steps more approximate that the three-dimensional
proportional to (2n)m where n is the average length approach, was more rapid (Fig. 5.4a). Later,
of the sequences and m the number of sequences Wheeler combined the Sankoff method with direct
for a given cladogram. An alternate, simple- optimization and incremental character optimiza-
minded exhaustive approach would be to simply tion (Gladstein 1997) in iterative-pass optimization,
Seq i Seq j (a) AA AG

∆
X X minimizes d(X, seq i ) + d(x, seq j ) AGAG
+ d(x, seq k ) AG
AAGG 3 indels + 1 substitution = 4

Seq k AAG
Figure 5.3 Median-state heuristic for n-dimensional optimization
proposed by Sankoff (1975). The state of X (which could be an entire AAGG
sequence) is that which minimizes the summed distances to the nodes
(b) AA AG
which connect to it.
∆
AGAG
AA
(a) ATTA AAA AAGG 4 indels + 1 substitution = 5

∆ AA
∆
AAAAA
AWAA AAGG
∆
AAA 3 indels + 2 substitutions = 8 (c) AA AG
AAAA
AGAG
AAA AAG
AAGG 4 indels + 0 substitution = 4

(b) ATTA AAA AAG
∆
AAAAA AAGG
AAA
Figure 5.5 Search-based methods. (a) Direct optimization (Wheeler
AAA 3 indels + 1 substitution = 7 1996) for the sequences AG, AA, AGAG, and AAGG. When all events
AAA are equally costly (indels and nucleotide substitutions) the cladogram
has a cost of 4. (b) Fixed-state optimization (Wheeler 1999) limits
AAA the HTU sequences to those present in the terminals and results in a cost
of 5. (c) Search-based optimization (Wheeler 2003b) allows the
Figure 5.4 Estimation methods. (a) Direct optimization (Wheeler 1996) addition of HTU states—here AAG—and thereby reducing the
results in a cladogram of cost 8 for the input sequences AA, ATTA, cladogram cost to 4.
AAAAA, and AAA when all events (indels and nucleotide substitutions)
are equally costly. (b) Iterative-pass optimization (Wheeler 2003c)
improves on this by 1 step. The horizontal bars signify indels and D
represent nucleotide subsitutions whose location may be ambiguous. greatly in length this may not be true). Since the
method is not calculating ancestral sequence states
which improved cladogram-length calculations but simply optimizing states, cladogram optim-
and can be used for larger numbers of sequences ization time, after initialization, is independent of
(Fig. 5.4b; Wheeler 2003c). sequence length. As the number of sequences
Sequence-search heuristics first appeared with increases, the number of potential sequence states
fixed-state optimization (Wheeler 1999). The rises as well, both improving the cladogram
Fixed-state method limited the possible set cost estimation and increasing the cost of compu-
of HTU sequences to those observed in terminal tation of a given cladogram (roughly m3 for m
taxa, which are then diagnosed via dynamic sequences) (Fig. 5.5b).
programming based on a matrix of edit costs Search-based optimization (Wheeler 2003b)
between the sequences. Given this constraint, less- relaxes the strict limit on sequence states by
satisfactory lower bounds on cladogram length are the addition of heuristically chosen sequences
usually found (Fig. 5.5a; when sequences differ (Fig. 5.5c). Through the increase of the state set at
the cost of execution time, progressively lower procedures, by examining the cladograms them-
bounds can be found until further enlargement of selves, do not suffer this shortcoming (direct
the set is unproductive. The set could be made all- optimization as implemented in POY (Wheeler
inclusive with an exact solution the result (but at et al. 2003) finds both cladograms).
great time cost).
5.7.1 Evaluation
5.7 Comparison of alignment and
Given the identical goals of alignment
optimization
and optimization, how can these somewhat
Although the goals of alignment (at least in phy- competing methods be evaluated? Speed and
logenetics) and optimization are the same—to find effectiveness are two obvious criteria. Speed
minimum-cost cladograms—the approaches are would be measured straight-forwardly as the
quite different. Alignment methods seek to find a time required to complete the combined align-
single putative homology scheme upon which all ment/cladogram-search operation versus that for
cladograms are evaluated. Optimization methods the optimization-based cladogram search. The
perform this operation for each evaluated clado- determination of cladogram cost for fixed align-
gram. As such, cladogram searches based on ments can be accomplished extremely efficiently
alignment methods are likely to be consistently (Goloboff 1994; 1998b) even for fairly general
faster than optimization approaches since the steps dynamic-programming characters (Goloboff 1995,
involved in determining the cost of a cladogram 1996a). Implementations such as TNT (Goloboff
from a fixed alignment are much less burdensome. et al. 2002) are able to evaluate many tens of
Optimization methods, however, are likely to find thousands of cladograms (containing hundreds of
lower-cost cladograms (Wheeler 1996; Giribet et al. taxa) per second. Multiple alignment imple-
2002; T. Grant. pers, comm.) and execution time mentations that generate a single multiple align-
comparisons should include the time consumed by ment (such as CLUSTAL) can create an alignment
alignment. of a thousand nucleotides for a hundred taxa in a
This can be illustrated by examining a simple set few minutes. Multiple-alignment procedures that
of three sequences (Fig. 5.6). There is not necessa- evaluate many candidate multiple alignments
rily one globally optimal alignment. An alignment (such as MALIGN) will absorb much more time.
may be optimal for a particular cladogram (a la Such a search using dynamic homology optim-
Sankoff and Cedergren) but any cladogram search ization (at least under present implementations)
based on such an alignment may well overlook could take yet longer.
other equal or lower-cost solutions. Optimization An example (for illustrative purposes, not
exhaustive by any means) is provided by the
analysis of 100 mollusk 18S rRNA sequences
(G. Giribet, personal communication). The align-
(a) I GGGG I II
II -GGG ment programs CLUSTAL and MALIGN were
III GAAG used and compared to optimization-based POY.
IV -GAA III 6 steps IV
CLUSTAL produced alignments most quickly and
(b) I GGGG III
II G-GG I MALIGN most slowly. When comparing the
III GAAG approach of CLUSTAL and POY, CLUSTAL was
IV GAA- II 6 steps IV faster by 20% (without cladogram search and
(c) I GGGG when minimal POY options were specified), but the
II GGG- ((I III)(II IV)) = 6 steps
III GAAG ((I II)(III IV)) = 6 steps
multiple alignment (really an implied alignment)
IV GAA- produced by POY was 30% less costly in terms of
Figure 5.6 Simple alignments of four sequences GGGG, GGG, GAAG, parsimony (see Table 5.1). Cladogram searching
and GAA. The alignments in (a) and (b) result in minimal cost, would add time to the total solution of the align-
but different cladograms. The alignment in (c) yields both. ment methods, but this would be a small premium
Table 5.1 Performance of CLUSTAL, MALIGN. and POY on mollusc test set. The analyses were performed with all
transformations (indels included) costing 1. The data set consisted of 100 mullusk 18S DNA sequences of
approximately 1000 bp (G. Giribet, personal communication). All runs were on a Pentium M computer at 1.7 Ghz
under LINUX. Runs for MALIGN and POY specified indel cost as 1, CLUSTAL was run twice; once under the
default values (Default) and a second time specifying all gaps and transformations as 1 (1 : 1 : 1 : 1). The POY run with
TBR branch swapping yields two equally costly cladograms. The arrows denotes the cost of the implied alignments
when analyzed using NONA (Goloboff 1993b). NONA diagnosed the POY cladograms as the cost found by POY,
but was able to find more-parsimonious solutions using the implied alignments
Method Options Execution time(s) Cost
CLUSTALW 1:1:1:1 688 11 999

CLUSTALW Default 722 10 642
MALIGN ‘Build’ only 26 270 11 790
POY þ Implied Alignment ‘Build’ only 920 7 989 ! 7 970
POY þ Implied Alignment TBR 134 470 2 @ 7 690 ! 7 684
in comparison with the alignment time. This could, object created is not necessarily fair to topologies
however, narrow the CLUSTAL vs. POY execu- other than its basis cladogram. Each of those
tion-time difference. topologies would be tested best by their own
The second criterion, effectiveness, favors optimi- implied alignments. Such a unique-alignment
zation methods. Given that optimization methods procedure is the approach optimization methods
are creating homology schemes specifically tai- bring to phylogenetic analysis.
lored to be optimal for each cladogram examined, An effect of this is seen in the calculations of
it is only logical that this approach should result in Bremer (1994) support. Support values calculated
less-costly (or higher-likelihood, for that matter) on the basis of a global alignment can overestimate
cladograms. This has been shown several times support compared to those based on dynamic
(Wheeler and Hiyashi 1998; Giribet et al. 2002; homology. Given the specific homology schemes
T. Grant. pers, comm.) and in the example above. created by optimization methods, alternate clado-
Decreases in cladogram cost of 10% over align- gram lengths should be lower (or at worst equal)
ment methods are not unexpected. The more to those based on an alignment that is optimal for
length variation present in the sequences, the more some other cladogram. This will tend to inflate the
opportunity there is for dynamic homology to find differences in cladogram lengths, hence Bremer
more-effective topology-specific solutions. values.
Alignment and optimization can be used in
tandem to reduce execution time in optimization-
5.7.2 Interrelationship
based searches. In essence, an implied alignment
There is a connection between multiple alignment (or any alignment for that matter), represents a
and cladogram optimization. Transformation ser- static approximation of dynamic homology. Given
ies inherent in a cladogram can be extracted and that an implied alignment is generated for a spe-
represented as an implied alignment (Wheeler cific cladogram, it can be used as the basis for
2003a). Such an implied alignment contains all the rapid cladogram cost evaluations among similar
synapomorphy statements and transformation topologies. The implied alignment is used
events required by the topology under an optimi- to identify candidate cladograms quickly for fur-
zation approach. As such, they resemble standard ther, more time consuming, analysis. If a clado-
multiple alignments, but are actually derived from gram is found to be superior to previously
the analysis of a specific cladograms as opposed to identified solutions, a new implied alignment is
the basis for a search. Given the dependence of this created based on the new topology and the process
sort of alignment on a specific cladogram, the continued. This approach can accelerate searches
by a factor of four or more depending on the As such, multiple alignment does not have
problem at hand (Wheeler 2003a). separate standing in phylogenetic analysis. It is
one approach to solving a complex, NP-complete
5.8 Conclusion problem. In comparison to optimization-based
procedures, it may be fast, but it is approximate.
Traditionally, alignment has been used to convert
In essence, alignment is a heuristic—and not a
data without inherent putative homology state-
very effective one.
ments into those that do. This step is operation-
ally logical, but, given the ultimate goal of
optimal cladograms, unnecessary. The criticism of
5.9 Acknowledgments
optimization-based methods as lacking primary
homology is largely based on this historical exer- I thank Vic Albert for initiating and managing
cise. Clearly, a priori notions of homology (at least this effort, Gonzalo Giribet for use of unpub-
at the nucleotide level) are not logical or compu- lished sequences, Vic Albert, Lorenzo Prendini,
tational requisites of phylogenetic analysis. Andres Varon, and an anonymous reviewer for
Criticisms of the optimization approach need to helpful critique of the manuscript, NSF and NASA
be based in effectiveness and logic—not on for support, and Steve Farris for decades of
appeals to tradition. guidance.
CHAPTER 6
Parsimony and the problem of

inapplicables in sequence data
Jan E. De Laet
‘ ‘I don’t know what you mean by ‘glory,’ ’ Alice said. Humpty Dumpty
smiled contemptuously. ‘Of course you don’t–till I tell you. I meant ‘there’s a
nice knock-down argument for you!’ ’ ‘But ‘glory’ doesn’t mean ‘a nice
knock-down argument,’ ’ Alice objected. ‘When I use a word,’ Humpty
Dumpty said in rather a scornful tone, ‘it means just what I choose it to
mean–neither more nor less.’ ’
(Caroll 1872, chapter VI)
6.1 Introduction subject to two methodological constraints: the

same evidence should not be taken into account
About 10 years ago, Maddison (1993; see also multiple times, and the overall explanation must
Platnick et al. 1991) drew attention to problems that be free of internal contradictions.
can arise in parsimony analyses when data sets Here, I examine how this formulation can be
contain characters that are not applicable across all used to deal with the problem of inapplicables.
terminals. Examples of such characters are tail More specifically, I deal with the problem of
color when some terminals lack tails, or positions inapplicables in sequence data, a harder and more
in DNA sequences in which gaps are present. general problem than most cases of inapplicability
Maddison (1993) examined various ways of coding that Maddison (1993) had in mind. The review of
such characters for various parsimony algorithms parsimony analysis in the first section provides the
and concluded that no general solution was basis for discussing the analysis of sequence data
available. Since then, the problem of inapplicables in the second section. The basic idea of the whole
has been rediscussed repeatedly (e.g. Lee and chapter is to explore the ramifications of the con-
Bryant 1999; Strong and Lipscomb 1999; Seitz et al. ceptual framework of Farris (1983) beyond the
2000), but Maddison’s conclusion still holds. realm of single-column characters. This was in part
Farris (1983), focusing on regular single-column prompted by the double observation that several
characters as classically used in phylogenetic ana- authors seem to be using isolated elements of that
lysis, characterized parsimony as a method that paradigm when discussing methods for sequence
maximizes explanatory power in the sense that analysis (see, e.g., Frost et al. 2001; Simmons 2004),
most-parsimonious trees are best able to explain while, at the same time, no coherent discussion of
observed similarities among organisms by inherit- those ideas as applied to sequence data is available.
ance and common ancestry. This led De Laet (1997;
see also De Laet and Smets 1998) to formu-
6.2 Parsimony analysis as
late parsimony analysis as two-item analysis.
two-item analysis
In this view, parsimony maximizes the number
of observed pairwise similarities that can be Some notes on terminology are appropriate first.
explained as identical by virtue of common descent, Take a simple term such as ‘autapomorphy’.
81
Originally, autapomorphies were defined as ‘apo- more restrictive usage that reserves the term for
morphous features characteristic for a particular ‘novelties that are coded as unique in a data set’
monophyletic group (present only in it)’ (Hennig (Kluge 1989, p. 9) is widespread.
1966, p. 90). In addition to this original meaning, a Consider the data set of Fig. 6.1 and its most-
parsimonious tree (out1 out2 (A ((B C) (D (E F)))))
(see Fig. 6.2). Under Hennig’s original definition,
Characters
the first seven characters all provide autapomor-
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
Terminals phies. As an example, character c4 has apomor-
out1 0 0 0 1 0 0 0 1 0 0 phous state 0 for monophyletic group (B C), and
out2 0 0 0 1 0 0 0 0 0 0 that state does not occur outside that clade. Under
A 1 0 0 1 0 2 0 0 1 1
the more restrictive definition only character c7 is
B 1 1 1 0 0 1 1 0 1 0
C 1 1 1 0 0 1 0 0 1 0 autapomorphic. Obviously, questions as to whe-
D 1 1 1 1 1 2 0 0 1 1 ther autapomorphies should be taken into account
E 1 1 1 1 1 2 0 0 0 1 or not when calculating the consistency index of a
F 1 1 1 1 1 2 0 1 0 1
data set on a tree (e.g. Yeates 1992) take an entirely
Figure 6.1 A data set with 10 unordered characters for eight terminals. different meaning depending on the way in which
Terminals out1 and out2 are interpreted as outgroups. the term ‘autapomorphy’ is used.
(a) B C (b) A B C F E D
c7:1–0 c7: c8:0 1
0 1
E out1
c10:0 1 c9:1 0
c4:0–1 c4:1 0
c6:1–2 c8:1–0
c8:1–0 c6: c10:0 1
F c10:0–1 c2:1–0 out2 2 1 c5:0 1
c9:0–1 c2:0 1
c9:0–1 c1:1–0 c3:1 0
c5: c3: c6:2–0 c9:0 1
1–0 0–1 c1:0 1
c10:0–1 c8:0 1 c6:0 2
D A out1 out2
(c) 1 0 0 1 1 1 (d) 1 0 0 1 1 1
0 1 0 1
1 c10:1 0
1
c10:0 1 c10:0 1
0 1
0 1
0 c10:0 1
0
0 0 0 0
(e) 1 0 0 1 1 1 (f ) C out2 E F A D
c10: c10:1 0
1 0 1 1 c10:0 1
c10:0 1
0
c10:1 0
c10:0 1
1
c10:0 1 1
0
0 0 out1 B
Figure 6.2 Parsimony analysis of the data of Fig. 6.1. (a) The most-parsimonious explanation of the data requires 14 steps. (b) To come to hypotheses of
synapomorphy and monophyly in the ingroup, the ingroup is rooted using the branch that leads to the outgroups (note that this procedure does not imply
such hypotheses outside the ingroup). (c, d) Two alternative optimal explanations of character c10 on the most-parsimonious tree. (e) A suboptimal
explanation of character c10 on the most parsimonious tree. (f) An optimal explanation of character c10 on a suboptimal tree.
PARSIMONY AND THE PROBLEM OF INAPPLICABLES IN SEQUENCE DATA 83
Paraphrasing Farris (1983, p. 8), I share Humpty hypotheses. Likewise, empirical work that results
Dumpty’s disdain for arguing definitions as such. in new characters that are added to data sets can
Therefore I shall not discuss and evaluate the pros lead to cladograms with new or refined hypoth-
and cons of various possible meanings of the terms eses of phylogenetic relationships. These, in turn,
that I employ, nor indicate alternative terms with can point to characters that are highly incongruent
identical or similar meanings. But as the above with the general pattern and that may therefore be
example shows, it is important to make intended worth additional scrutiny. If an empirical basis can
meanings clear, so in this section I shall explicitly be found for a reinterpretation of such characters
point out my usages of terms. or their states, the data set can be adapted
At the same time, this process will provide an accordingly (see, e.g., Farris 1983, p.10).
interlocked set of concepts that will allow a clear At a given point in this process of continuous
discussion of parsimony and inapplicables in the refinement, consider an individual character such
next section, and help to distinguish terminological as c4 in the data set of Fig. 6.1. From the point of
issues from more substantial argument. To preempt view of character analysis this character is a state-
any objection as should the conclusions hinge on ment about a feature that comes in two states,
major redefinitions of familiar terms, I shall indicate coded 0 and 1, such that state 0 is observed in
how my usages are rooted in existing literature. This, terminals B and C and state 1 in all other terminals.
however, should not be taken to imply that these Theoretically, such a character expresses the hypo-
usages are always strictly in line with those refer- thesis that the observed feature carries evidence on
ences: whenever some existing, term is close enough, the genealogical relationships among the taxa that
in spirit, to intended use (as would, e.g. Kluge’s use of are involved. This directly limits characters and
Hennig’s autapomorphy above) I shall adopt existing character states for phylogenetic analysis to fea-
terminology rather than propose a new term. tures that are inheritable. A thought-provoking
discussion of this seemingly trivial observation can
be found in Freudenstein et al. (2003).
6.2.1 Characters and character analysis
Beyond this, however, little more specific can be
Conceptually, a cladistic analysis consists of said other than that a character state as observed in
two main activities (see, e.g., Rieppel 1988; de different terminals ‘must be sufficiently similar to
Pinna 1991; Rieppel and Kearney 2002). The first be called the same [ . . . ] at some level of taxonomic
comprises empirical observation, leading to deli- generality’ (Kluge 1997a, p. 89; the quote refers to
mitation of characters and character states, and to a derived states but the statement is valid in gen-
data set in which those characters are scored for the eral), an observation that also holds for the char-
terminals in the analysis. This is the activity of acter as a whole (see, e.g., Platnick 1979, p. 542;
perceiving similarity and coding it into characters Jenner 2004, p. 301). For morphological and ana-
and data sets, to which I shall refer as character tomical features, the criteria of composition, con-
analysis (Kluge and Farris 1969, p. 9–10; see also junction, ontogeny, and topography provide
Rieppel and Kearney 2002, p. 60). The second perspectives that can serve to evaluate if such
activity takes data sets as input, identifies their sufficiency holds in particular cases (Kluge 1997a).
most-parsimonious hierarchic arrangment(s), and Of those, topography or topological relationships
uses the resulting cladogram(s) as a basis for phy- are often considered to be the fundamental criter-
logenetic inference. I shall refer to this as parsimony ion (e.g. Rieppel 1988; de Pinna 1991, Hennig 1966,
analysis (Farris 1983, p. 10–12; see also later). pp. 93–94; see also Remane 1952, pp. 31–66).
Character analysis and parsimony analysis stand As discussed extensively by Rieppel and
in a continuous relationship of reciprocal illumi- Kearney (2002, in the context of anatomy; see also
nation, at different levels (e.g. Rieppel 2003, p. 182; Jenner 2004), care must be taken to give similarity
see also Hennig 1950, p. 26). As an example, the statements as expressed in characters an observa-
selection of terminals that will be included in a tional basis. In order to do so one has to rely,
data set is in part guided by existing phylogenetic however, unavoidably on background knowledge,
and there is in principle no limit to the degree of at some level of inclusiveness, meaning that they
background knowledge that can be incorporated in share a common ancestor that they do not share
a character (Rieppel and Kearney 2002, p. 265). So with terminals outside that group (Hennig 1966,
even in this specific and restricted context of 73–74; see Farris 1991 for a review of this and
erecting character hypotheses for cladistic analysis, related terms). The terminals that are assumed to
the concept of similarity unavoidably retains some be part of the monophyletic group are called
elusiveness. This notwithstanding, similarity ingroup terminals and are collectively referred to as
assessments as expressed in characters and their the ingroup. Terminals outside that group are
states, in the theoretical framework as just dicus- called outgroup terminals or outgroups for short.
sed, are the empirical basis on which further When outgroups are included in a data set,
phylogenetic inference is built. they can be used to root the ingroup after the
globally most-parsimonious arrangements of the
data have been identified (Farris 1972, p. 657; see
6.2.2 Single-character phylogenetic inference
Figs 6.2a and 6.2b for an example). In the ingroup,
If no other comparative data were available for the hypotheses of relative apomorphy and plesio-
terminals that are involved, a character such as c4 morphy and of the direction of transformations
would constitute a data set on its own. It is a useful then directly follow (Farris 1982a; see Figs 6.2c
exercise to subject such a minimal data set to and 6.2d for some examples). This is the proce-
parsimony analysis. Within the constraint of dure that is now almost universally used to root
terminal sampling, this leads to the following ingroups and polarize characters, and it is mostly
inferences: (1) the feature arose in a common referred to as the outgroup method or the outgroup
ancestor of these terminals, from which they criterion (see, e.g., Farris 1979, p. 511). Confus-
inherited it; (2) differentiation into two states ingly, these and similar labels were also used in a
ocurred at a later stage; (3) for each state, the series of papers in the 1980s for a series of
terminals with that state are only connected methods of prior character polarization that are
through ancestors that have that same state. These fundamentally different and mostly no longer in
inferences do not yet include a polarity statement use. A historical account and a discussion of these
for which state is considered apomorphic and methods can be found in Nixon and Carpenter
which plesiomorphic. (1993). The precise way in which hypotheses on
The apomorphy/plesiomorphy pair of terms is character polarity come about does not affect the
defined as follows: for a given evolutionary argumentation in this paper, so without loss of
transformation, the condition or state from which generality the discussion is restricted to out-
the transformation started is plesiomorphic or pri- groups.
mitive and the condition after the transformation In a data set that has only one character, as
apomorphic or derived (Hennig 1966, p. 89). As dis- above, the general use of outgroups as just
cussed by Hennig (1966, p. 93), coming to an described becomes simplified because the best
hypothesis of features that are involved in such a tree for the data set coincides with the structure
transformation on the one hand and deciding on of its single character. In the above example, the
the evolutionary direction of such a transformation outgroup hypothesis could be the assumption
on the other are entirely different questions. The that terminals A through F (the ingroup) share a
inclusion of outgroups in data sets is arguably the most recent common ancestor that is not shared
most general and least assumption-laden way to with terminals out1 and out2 (the outgroups).
address the latter question. Observing that state 1 of character c4 is present
in the outgroups as well as in the ingroup, it
Roots and outgroups follows that state 1 is plesiomorphic in the
In general, when studying the phylogenetic rela- ingroup; that state 0 is apomorphic in that same
tionships among a group of terminals, one group; and that (B C) is a monophyletic subgroup
assumes that these are part of a monophyletic group of the ingroup.
Outgroups do not always lead to such unam- not affect this usage. The same applies to some
biguous single-character inferences. An example other terms that I already have used: outgroup,
is character c6, where (A D E F) and (B C) could apomorphy, and plesiomorphy are defined in
both be monophyletic; or, alternatively, either terms of phylogenetic history but are often used to
could be paraphyletic with the other mono- refer to just a hypothesis about that history.
phyletically nested in it. In addition, contra- Hennig (1966, p. 89) introduced the terms sym-
dictions can arise between a character hypothesis plesiomorphy and synapomorphy to decribe the pre-
and the outgroup hypothesis, even with binary sence of plesiomorphies and apomorphies among
characters. An example is character c8: the two terminals. As above, these terms are defined with
following statements, derived from the character, respect to true evolutionary history, but are often
contradict the outgroup hypothesis: terminals used to refer to inferences as well. Such context-
out1 and F are only connected through ancestors dependent shifts in meaning of these and similar
that have state 1; the other terminals are only terms are widespread in the literature, Hennig
connected through ancestors that have state 0. (1966) being a prime example. Related to this,
Such cases are mostly but not necessarily inter- when considering a transformation series such as
preted to mean that the hypothesis of ingroup a ! a 0 , Hennig (1966, pp. 88–89) sometimes refer-
monophyly is incorrect. In general, nothing more red to a and a 0 as ‘character conditions,’ sometimes
can be said other than that the data do not as ‘special characters’ and sometimes even just as
support the prior assumption of ingroup mono- ‘characters.’ Combined with context-dependent
phyly (Farris 1972, p. 657), an observation that is meanings of terms, such use of different terms for
also consistent with the alternative interpretation the same thing, with meanings that often differ
that the data are wrong. Neither issue addressed from current usage, can make it hard to under-
in this paragraph affects the argumentation of stand Hennig’s writings. This is even more pro-
this paper. blematic because Hennig used an argumentation
scheme to order and polarize characters that is
Premises very different from current practice. In the above
Obviously, the above conclusion of monophyly for example, Hennig referred to a and a 0 as characters
(B C) is conditional: it depends on the correctness ‘in the sense that they distinguish their bearers
of the outgroup hypothesis, on the correctness of from one another’ (Hennig 1966, p. 89). At the level
the similarity assessments that led to character c4 of character analysis they are, in current usage, just
and its coded states, and on the correctness of character states.
several other, hidden, assumptions that remained When used conditionally, the precise meaning of
unexpressed (such as absence of reticulate evolu- terms such as synapomorphy and plesiomorphy in
tion). So, it would be more precise to say that (B C) particular cases can drastically change according
is a putative monophyletic group, or a presumed to the exact conditionals that are used or implied.
monophyletic group, or that B and C are hypo- Consider, for example, isolated character c9 and
thesized to be monophyletic, each time conditional the outgroup hypothesis. In that case the presence
on the premises stated above (see Farris 1983, p. 13 of state 1 in terminals A, B, C, and D is a (putative)
for a similar use of the term ‘putative’). Below, I synapomorphy compared to the presence of state 0
shall use such verbose formulations only when in terminals out1, out2, E, and F, which is a
confusion could arise otherwise, or when I wish to (putative) plesiomorphy. On the other hand, when
stress the difference between hypothesis or infer- considering the whole data set of Fig. 6.1 and its
ence on the one hand and true historical account most-parsimonous tree (Fig. 6.2b), the presence of
on the other. For the latter I shall then use the the same character state 1 in the same terminals A,
convenient adjective ‘true’, following existing B, C, and D is now a (putative) symplesiomorphy
practice (see, e.g., Farris 1983, p. 12), while obser- compared to the presence of state 0 in terminals E
ving that the philosophical problems that sur- and F, which has become a (putative) synapo-
round the notion of truth (see, e.g., Boyd 1991) do morphy. The presence of state 0 in the outgroups
remains a (putative) symplesiomorphy. More discarding some of the evidence that bears on the
interestingly, the presence of apomorphic state 1 in problem at hand (viz. the perceived similarity
its original form (terminals A, B, C, and D) and in between terminal D on the one hand and terminals
its more derived form (terminals E and F) is now a E and F on the other. The remaining evidence
putative synapomorphy for terminals A–F. (viz. the perceived similarity between E and F)
then supports monophyly of E and F to the
exclusion of D.
6.2.3 Homology, the Hennig–Farris auxiliary
principle, and parsimony analysis
Homology should be presumed in the absence of
A crucial assumption in the above interpretation of evidence to the contrary
a single character is Hennig’s auxiliary principle, Hennig’s formulation of his auxiliary principle,
stating ‘that the presence of apomorphous char- quoted earlier, is logically inconsistent because it
acters in different species . . . is always reason for can lead to internal contradictions: if the presence
suspecting kinship [i.e. that the species belong to a of presumed apomorphies is always to be a reason
monophyletic group], and that their origin by for suspecting true monophyly (first part of the
convergence should not be assumed a priori’ principle), then it is not simply sufficient that
(Hennig 1966, p. 121; square brackets present in multiple, convergent, origins of that state should
original). In this quote, the term ‘character’ refers not be assumed a priori (second part). This would
to a ‘special character’ (Hennig 1966, p. 89), which still leave open the possibility that some terminals
is a character state as used in this chapter, whereas with the presumed plesiomorphic state obtained
an apomorphous (special) character refers to a that state through a reversal. In that case, the
special character that ‘can certainly or with rea- group of all terminals with the presumed apo-
sonable probability be interpreted as apomor- morphic state would no longer be truely mono-
phous’ (Hennig 1966, p.121), i.e. an hypothesis of phyletic, which contradicts the first part. So that
apomorphy or a putative apomorphy; monophyly first part by logical necessity requires an additional
is used in its true historical meaning. statement that the origin of presumed plesiomor-
Without this principle, one could equally well phies should not a priori be interpreted as reversals
assume that, for example, state 1 of character c5 of (for characters with more than two states, a similar
Fig. 6.1 arose multiple times. As an example, on the statement is required for each state). As an exam-
most-parsimonous tree (Fig. 6.2b) state 1 could have ple, without this addition a character such as c5
arisen a first time in the branch that leads up to could be taken as evidence for, e.g., a mono-
terminal D, and a second time in a common ances- phyletic group (A D E F) because it is not pre-
tor of E and F that is not a common ancestor of D. cluded that state 0 in terminal A arose as a reversal
Under this interpretation, the shared presence of within that clade. In this interpretation, state 0 as
1 in E and F would be interpreted as evidence for present in terminal A would be derived relative to
monophyly of clade (E F), to the specific exclusion state 1 as present in terminals D, E, and F.
of terminal D, even if D has the same state. Such additional statements are implicit in Farris’
However, given that the delimitation of char- (1983, p. 8) formulation of Hennig’s auxiliary
acter c5 is grounded in empirical observation, this principle: ‘homology should be presumed in
is not a very plausible interpretation of the char- absence of evidence to the contrary’, where
acter. Indeed, if any empirical evidence were homology refers to similarities among organisms
available that state 1 as present in terminal D is not that have arisen historically through inheritance
sufficiently similar to state 1 as found in terminals from a common ancestor, irrespective of these
E and F to be called the same at some level of similarities being apomorphic or plesiomorphic.
generality, these terminals would not have been More explicit discussions of the necessity, in
assigned the same numeric state code to begin parsimony analysis, of explaining plesiomor-
with. Since this was not the case, preferring the phic similarities as due to common descent
second interpretation over the first amounts to can be found in Farris et al. (1995, p. 215) and
Farris (1997, pp. 132–133). I shall therefore refer to and Kluge 2004, p. 29). As such they provide the
the auxiliary criterion in its logically consistent best explanation of the observations on account of
form as the Hennig–Farris auxiliary principle. the theory.
When, as above, the Hennig–Farris auxiliary Note that, at this level of analysis, characters and
principle is applied to single–character data sets, it their states can indeed be treated as simple
can be interpreted as a condition that makes the observations, even if, as discussed above, they are
apomorphic state by necessity mark a true mono- complex theories or hypotheses on their own.
phyletic group: the state arose only once and never Likewise, little confusion arises if the presence of
reverted. That group will be present on any tree the same character state of a given character in two
that requires only a single origin for that state, terminals is simply called an observed point of
which is in line with Farris’ (1983, p. 12) observa- similarity between those two terminals. Such usa-
tion that grouping by true synapomorphy would ges of these terms can be found, for example,
have to behave exactly as parsimony, in the sense throughout Farris (1983).
that it would lead to preference for the tree(s) on Similarities as coded in characters can very well
which no homoplasy is present (homoplasy being a be true homoplasies rather than true homologies.
point of similarity among organsims that cannot be Likewise, it cannot be ruled out that character
explained by inheritance and common descent on similarities that can be explained as homologies on
a particular tree; Farris 1983, p. 18; see also below). most-parsimonious cladograms are true homo-
These are, by definition, the shortest trees possible, plasies instead, even when using single-character
so they are also most parsimonious trees. data sets as above. Combined with the observation
that parsimony minimizes putative homoplasy,
Parsimony and the Hennig–Farris auxiliary principle such observations are sometimes taken to mean
In practice, however, one is constrained to work that it is an assumption of parsimony analysis that
with actual observable traits of organisms rather homoplasy is rare in evolutionary history. How-
than with true historical synapomorphies. Char- ever, even if rarity of homoplasy may be a suffi-
acter codings of such traits seldom if ever capture cient condition to prefer most-parsimonious trees
all true evolutionary transformations, let alone (see, e.g., Felsenstein 1981), it is definitely not a
their order, as exemplified by the presence of necessary condition.
homoplasy in all but the smallest and simplest Consider a data set for terminals out, A, B, and C
data sets (note that absence of homoplasy in such where 10 characters support clade (B C) and just
data sets would hardly justify the conclusion that one character supports clade (A C) (this example
all relevant transformations have been captured— and discussion is based on Farris 1983, pp. 13–14,
absence of evidence is not evidence of absence). see also p. 12, pp. 18–19). If clade (A C) is genea-
This led Farris (1983, p. 17–19; see also Farris and logically correct, then the 10 characters that sup-
Kluge 1986, p. 300; Farris 1986, pp. 15–16) to a port (B C) are (true) homoplasies; if, on the other
general characterization of parsimony analysis in hand, clade (B C) is genealogically correct, then the
terms of a methodological principle that is funda- single character that supports (A C) is a (true)
mental to science in general: maximization of homoplasy. These simple observations point out
explanatory power or conformity between obser- an interesting asymmetry in the relationship
vation and theory. More specifically, the observa- between characters and genealogies: a given gen-
tions are the similarity statements as coded in ealogy implies that characters that contradict this
characters, and the theory is that these similarities genealogy are homoplasious but requires nothing
have arisen through inheritance and common concerning characters that do not contradict the
descent. Most-parsimonious cladograms are then genealogy. Now assume that true homoplasy is so
preferred because they are the trees on which the abundant that only one out of those 11 characters
greatest amount of such observed points of simi- has escaped its effects. Under the assumption that
larity among organisms can be explained by this one character can equally well be any char-
inheritance and common descent (contra Grant acter in the data set, a simple statistical argument
leads to preference for clade (B C): the probability this makes similarity-based statements of putative
that this single historically correct character sup- homology the centerpiece of phylogenetic infer-
ports this clade is 10 times higher than the prob- ence: most parsimonious trees are trees on which
ability that it supports (A C). Thus it is seen that the greatest amount of putative homology state-
even under extremely high levels of homoplasy ments that return from character analysis can be
most-parsimonious trees can still be the best phy- explained as due to inheritance and common
logenetic hypotheses one can make on the basis of descent, and such trees are the best available
the available data, even if some of the putative phylogenetic hypotheses for the terminals at
homologies may be true homoplasies instead. hand, whether or not the individual similarity
The underlying assumption of the above con- statements or their explanations are historically
clusion is best stated in the negative: absence of correct.
any assumption about the distribution of homo- As just discussed, the premises under which this
plasies in data sets. In a statistical framework, this holds are best stated in the negative: complete non-
can be understood as the use of an uninformative reliance on specific premises regarding correla-
prior. Obviously, one can postulate distributions of tions of evolutionary rates within and across
homoplasy such that the most-parsimonious trees characters and lineages. As such, parsimony ana-
will no longer be the best bets. Such distributions lysis can be considered the most general method
are typically derived from stochastic models of for phylogenetic analysis that is available. Tuffley
sequence evolution (see, e.g., Felsenstein 1978a; and Steel (1997; see also Steel and Penny 2000) and
Huelsenbeck and Lander 2003). The mere fact, Goloboff (2003) have examined similar but less
however, that such distributions can be postulated extreme positions of agnosticism with respect
does not by itself invalidate parsimony analysis as to the details of evolutionary processes, using
a method to analyze empirical data. Indeed, such a stochastic modeling. In both cases the most-
conclusion would crucially hinge on the realism or parsimonious tree(s) are the best phylogenetic
plausibility of the underlying stochastic models hypotheses, reinforcing the above conclusion.
(and not on their simplicity, as Huelsenbeck
and Lander 2003 seem to suggest). Farris (1983,
6.2.4 Quantifying and maximizing homology
pp. 14–17, p. 12; see also Farris 1999) amply dis-
cussed these issues and found the models that Given a tree and a data set such as in Fig. 6.1,
were in use at that time greatly lacking in realism. Farris (1983) did not directly quantify the amount
Stochastic models of sequence evolution have of points of similarity that can be explained by
dramatically increased in complexity since then common descent and inheritance on that tree.
(see Felsenstein 2004 for a review), but they still Instead he used, as a relative measure, the mini-
seem mostly inadequate to model even small-sized mum number of independent statements of
real data sets (D. Pol, personal communication). homoplasy that are required on that tree. This
Therefore, Farris’ discussion and conclusions works because an instance of homoplasy is present
remain as valid and to the point as they were more on a tree whenever a point of similarity as
than 20 years ago. expressed in a character cannot be explained as
Considering all this, the Hennig–Farris auxiliary homology on that tree (Farris 1983, p. 18).
principle can be phrased as the following rule for So, when comparing two trees, the tree with the
erecting character hypotheses and interpreting lower level of homoplasy will have the greater
their optimizations on trees: ‘features that on the amount of similarity that can be explained as
basis of empirical evidence are deemed sufficiently homology, and hence the greater power to explain
similar to be called the same at some level of the data on account of the theory. In practice, most
generality should be treated as putative homo- parsimony programs calculate the minimum
logues in phylogenetic analysis (even if they may number of steps that are required, which, for
be true homoplasies instead).’ In combination with a given character, differs from the minimum
the principle of maximizing explanatory power, number of independent statements of homoplasy
by a constant factor. As a result, the same ranking makes the explanation of the character on the tree
of trees is obtained. Several points are worth logically capable of phylogenetic interpretation
elaborating here. (Farris et al. 2001b). For example, on this tree one
can explain either the similarity between A and D
Inner-node state assignments and the requirement (e.g. Fig. 6.2d) or the similarity between out1 and B
of internal consistency as a homology (e.g. Fig. 6.2c); one cannot possibly,
First, whether or not a particular pairwise simi- however, simultaneously explain both similarities
larity as coded in a character can be explained as a as homologies because they are mutually exclus-
homology on a particular tree does not just depend ive. This logical requirement of non-contradiction
on the structure of the tree and on the state dis- is also met in maximum likelihood methods that
tribution of the character that is involved, but also integrate over all possible sets of inner-node state
on assumptions that are made about the character assignments, such as that of Felsenstein (1981). It is
states that are present at the internal nodes of not met in quartet and triplet methods (De Laet
the tree. and Smets 1998). Pairwise similarity statements
Take character c10 of the data set of Fig. 6.1 and that can simultaneously be explained as homology
the most-parsimonious tree for that data set (Fig. on a given tree will be referred to as (mutually)
6.2b). Representing a pairwise similarity that is compatible statements.
expressed as the presence of a same state i of a When the terminals of a tree are labeled with the
character in two terminals X and Y as Si(X Y), or, observed states of a particular character and the
equivalently, Si(Y X), the similarity among term- inner nodes have been assigned character states as
inals A and D as coded in c10 is S1(A D). With well, the tree can be cut into a number of parts in
inner node state assignments as in Figs. 6.2c or which all nodes have the same state, and such that
6.2e, this pairwise similarity cannot be explained neighboring parts have different states. I shall refer
as a homology because independent derivations of to such parts as regions. There is a straightforward
state 1 from state 0 are involved. On the other connection between number of regions and num-
hand, with state assignments as in Fig. 6.2d, that ber of steps: any boundary between two regions
same similarity can be explained as a homology. implies a step, so the number of steps is one less
Similarly, S0(out1 B) can be explained as a homo- than the number of regions. By definition, all
logy in Fig. 6.2c but not in Figs. 6.2d and 6.2e. In similarities within a region can be explained as
general, a pairwise similarity Si(X Y) can be homologies, while similarities across regions
explained as a homology on a tree when all nodes are homoplastic. Because these regions are non-
that connect X and Y have been assigned that same overlapping and because homologies do not cross
state i; in that case, the statement is said to be the borders of such regions, the problem of quan-
accomodated on the tree. In all other cases, it is a tifying the amount of similarity of the character
homoplasy, and the statement is not accomodated that can be explained as homology on the tree can
(only cases in which unique states are assigned to be broken down easily into the smaller problem of
inner nodes are considered in this paper; poly- determining the amount of homology in such a
morphic inner nodes, as in Farris (1978a) or in region. For the same reason, the different states of
Felsenstein (1979), are left undiscussed). a character can be treated independently under
The connection between the explanation of a those conditions.
character and assignments of states to inner nodes
can be seen as a methodological constraint that Independence and the units of empirical
ensures that the set of all homology statements that content of comparative data sets
can be derived from a tree and a character state A second issue is logical independence of pairwise
distribution is free from internal contradictions (De homology (and homoplasy) statements within
Laet and Smets 1998, pp. 374–376). Or, put posi- characters (Farris 1983, pp. 19–20, 21–22; De Laet
tively, it ensures that the overall explanation is and Smets 1998, pp. 369–374; this is different
logically possible or consistent. This, in turn, from logical dependence between characters, as
discussed, e.g., in Wilkinson 1995, pp. 297–298). generating sets of cardinality j 1). These sets non-
Consider state 1 of character c10 as it returns from redundantly describe the homologies of the char-
character analysis. At that point, all its six pairwise acter state on the tree, and the total number of
similarity statements can be interpreted as homo- independent statements that are accomodated is
logies: S1(A D), S1(A E), S1(A F), S1(D E), S1(D F), the total number of statements in these sets. Then
and S1(E F). Not all of these are independent pool these generating sets and augment the
though: if, e.g., S1(A D) and S1(A E) can be inter- resulting set to obtain a smallest generating set for
preted as homologies, then, by necessity, S1(D E) all similarities in the character state, without
can be interpreted as a homology as well. In gen- reference to a tree. The added statements form a
eral, if ni terminals have the same character state maximal set of independent pairwise similarity
for a given character, there are ni * (ni 1)/2 dif- statements that are not accomodated. This proce-
ferent pairwise similarity statements that can be dure establishes that the number of independent
made, but no more than ni 1 of those can be accomodated homologies and homoplasies for
independent. Adding statements beyond this a given state add up to a number that is tree-
number will introduce redundancy in the independent. As a result, minimizing the number
description of the data. This maximum number of of independent statements of pairwise homoplasy
independent pairwise similarity statements is at in a character state and maximizing the number of
the same time the minimum number of statements independent statements of pairwise homology in
that must be considered to deduce the complete that same state are equivalent problems indeed.
set: when removing statements from a largest set Because independent homologies can be counted
of independent statements, there is no longer suf- one region at a time, this remains true when
ficient information to generate all data. summing over all states in a character, and/or over
all characters in a data set.
Non-redundant descriptions. I shall call such max- In this example, the first region (isolated node A)
imal sets of independent pairwise similarity has no similarities and therefore an empty smallest
statements smallest generating sets. The exact iden- generating set; {S1(D E), S1(E F)} is a smallest
tity of the members of such sets does not matter, generating set for the second region. Adding, for
the important points are completeness and absence example, homoplastic statement S1(A E) is suffic-
of logical dependencies. As an example, {S1(A D), ient to fully describe the character state and its
S1(A E), S1(A F)} and {S1(A D), S1(D E), S1(E F)} are explanation on the given tree. As an example,
two different smallest generating sets for state 1 of given that S1(D E) is accomodated and that S1(A E)
character c10; {S1(A D), S1(A E), S1(A F), S1(E D)} is is not accomodated, it follows that S1(A D) is not
a generating set, but not a smallest one because not accomodated either.
all of its elements are independent. Next consider
how the pairwise similarities in a character state Explanation. When assessing how well a tree with
can be explained on a particular tree with a par- inner-node state assignments can explain a char-
ticular set of inner-node state assignments, such as, acter state as due to inheritance and common
for example, in Fig. 6.2c. There are two regions descent, the correct measure is the number of
that have character state 1: isolated node A and independent accomodated pairwise similarities,
subtree (D (E F)). All similarities within a region not the total number of accomodated pairwise
are homologies and all similarities across regions similarities. Consider a character in which 100
homoplasies, so S1(D E), S1(D F), and S1(E F) terminals have state 0 and another 100 state 1, and
are homologies, while S1(A D), S1(A E), and S1(A F) two trees on which the first 100 terminals occur in
are homoplastic. one region and the other 100 in two regions.
A non-redundant description of this can be Assume that in the first tree, the first region with
determined as follows. For each region that is state 1 has one terminal and the second 99; and
involved, establish a smallest generating set (in that, in the second tree, both regions with state 1
general, a region with j terminals will have smallest have 50 terminals. The total number of pairwise
similarities in this character state is 99 100/ I have been assuming equal weighting of simi-
2 ¼ 4 950, of which at most 99 are independent. larity statements throughout, but the principle of
Summing over regions, in the first case a total of parsimony as discussed here does in itself not
0 þ 4 851 ¼ 4 851 similarities are accomodated, in prescribe that all parts of the data be equally
the second case only 1 225 þ 1 225 ¼ 2 450. weighted. Farris (1983, p. 11) discussed this issue
Yet in both cases, the same number of 98 inde- at the level of differential weighting of entire
pendent pairwise similarities are required for a characters and characterized his preference for
non-redundant description of the situation. Or, equal weighting as a stance of ignorance: in the
conversely, in both cases only a single independent absence of any convincing reason for doing
pairwise similarity cannot be explained as a otherwise, all characters in a data set are treated as
homology. This is in direct agreement with the if they provide equally cogent evidence on phy-
observation that both cases can equally well logenetic relationship. The same reasoning applies
explain the observations on account of the theory, at the level of the independent similarity state-
which in this restricted case is possible historical ments that make up characters.
identity of state 1 through inheritance and com- Algorithms such as Farris (1970; additive char-
mon descent on the given trees with the given sets acters) or Sankoff and Rousseau (1975; step
of inner-node state assignments for the given matrices) can be seen as methods that apply dif-
character. The total number of pairwise homo- ferential weighting within characters. Such differ-
logies gives a different answer (the first tree is ential weighting is defined in terms of
considered about twice as good: score 4 851 vs. transformations, not in terms of similarities:
2 450) because that number also depends on the transformations between different pairs of char-
numbers of terminals that are present in each acter states can receive different weights. This may
region of a tree in which the state is homologous. seem problematic for the current approach because
As these numbers do not feature in the theory the simple equivalence of minimizing homoplasy
on account of which the data are explained, the and maximizing homology, as discussed above, in
total number of accomodated similarities is not general only holds when all transformations and
suited to measure agreement between theory and all unit homologies are weighted equally. How-
observation. ever, differential weighting as in Farris (1970) and
Sankoff (1975) can also be characterized in terms of
Weighting. An alternative way of viewing the similarities that are hierarchically nested. A full
difference between all and independent pairwise discussion of this issue is beyond the scope of this
similarity statements is in terms of dynamic review.
weighting of similarity statements (see De Laet
and Smets 1998 for a similar discussion in the A methodological requirement. The unit of evident-
context of triplet and quartet methods). More ial value of a data set on a tree that arises from this
particularly, if the weight that is assigned to an discussion is an independent accomodated pair-
independent accomodated similarity statement in wise similarity statement. Likewise, independent
a given region is calculated dynamically as the pairwise similarity statements are the currency in
total number of statements in that region divided which the empirical content of a data set is mea-
by the number of independent statements in that sured. This ultimately permits to interpret the
region, then the total number of unweighed preference for independent accomodated state-
accomodated statements equals the number of ments (versus all accomodated statements) as a
weighted independent accomodated statements. methological requirement when maximizing the
This weighting scheme is highly unnatural and number of pairwise similarity statements that can
hard if not impossible to defend, which just be explained as homology: it enforces that each
reinforces the conclusion of the previous para- unit or quantum of empirical content of a data set
graph. But it also raises the general question of is considered precisely once. Note that, in itself,
weighting. this does not amount to equal weighting: whether
or not all quanta of comparative empirical content imposes methodological constraints on allowed
should receive the same weight is an entirely dif- explanations of the other states. As discussed
ferent question. above, such constraints are met when inner-node
Again, this methodological constraint is not met state assignments are taken into account, in addi-
in quartet and triplet methods (De Laet and Smets tion to the observed states at the terminal nodes.
1998). Likewise, it is not met in methods that base Therefore, a crude solution for optimizing a char-
the inference on a square matrix of pairwise dis- acter on a tree is to generate all possible sets of
tances among terminals, such as neighbor joining inner-node state assignments and to count the
(Saitou and Nei 1987), for the simple reason that number of independent accomodated statements
the required information to do so is not present in for each (three different possibilities, on the same
such matrices. To be sure, neighbor joining can in tree, are illustrated in Figs. 6.2c–6.2e, with scores 5,
principle operate directly on character state data 5, and 2). If the sets of inner-node state assign-
(Saitou and Nei 1987, p. 410), but such data sets are ments are generated in a clever enough order,
mostly reduced to square distance matrices first. In this can be improved using a branch-and-bound
maximum likelihood methods such as Felsenstein mechanism.
(1981), the constraint is met. The difference with However, a much more efficient approach is
parsimony analysis is that in such methods the possible, starting from the above observation that
explanation of a similarity statement on a tree is the number of independent compatible homologies
based on integration over all possible inner-node and homoplasies for a character add up to a num-
state assignments, using stochastic models of ber that is tree-independent. As a result, a set of
character evolution and best-scenario branch inner node state assignments that minimizes inde-
lengths (see, e.g., Steel and Penny 2000 and pendent homoplasies also maximizes independent
Goloboff 2003 for a discussion). As seen above, homologies. Next, the minimum number of inde-
when looking for best trees, parsimony analysis pendent homoplasies for a given character and a
evades uncertainty as to the true historical status given optimal set of inner-node state assignments
of a similarity statement that can be explained as a equals, up to a tree-independent constant, the
homology on a tree at an entirely different level, number of regions as imposed by the inner-node
thus enabling it to remain largely agnostic about state assignments, which in turn is one more than
details of the processes of character evolution. the minimum number of steps in the character.
Therefore, algorithms that minimize the number of
Maximizing the amount of homology steps in such characters can be used to maximize
Given a data set of characters, one has to identify homology. Examples are the algorithm of Farris
the tree or trees on which the highest number of (1970) for binary characters and additive multistate
independent compatible pairwise similarity state- characters, or the algorithm of Fitch (1971; see also
ments can be explained as homology. This Hartigan 1973) for unordered characters.
involves an optimization at two different levels. The second problem is illustrated in the two
First, which is the highest number of such trees of Figs 6.2b and 6.2f: even if the second tree
homology statements on a given tree? Second, can explain some characters better than the first
given a procedure to solve the first problem, tree (e.g. c10), the first tree is preferred because it
which is (are) the tree(s) on which this number is provides a better explanation of the data as a
maximal? whole. The problem of deciding whether a given
The first problem can be tackled one character at tree is an optimal tree for the data at hand is NP-
a time because there are no logical interactions complete (Foulds and Graham 1982). Practically,
among the explanations of different characters this means that in general the only way to find
(this is a fundamental assumption that is not met the best tree(s) is the hard approach of examining
when inapplicables are present). Within a char- all possible trees that exist for the given terminals,
acter, though, it cannot be tackled one state at a either explicitly or implicitly, by using a branch-
time because the explanation of any given state and-bound approach (for which see Hendy and
Penny 1982). Unfortunately, the number of trees number of terminals is just a convenient non-
grows so extremely fast as the number of terminals redundant summary of elementary putative
grows (see, e.g., Felsenstein 1978b) that this homology decisions that are made, during char-
approach is only feasible for relatively small acter analysis, in all possible pairwise comparisons
numbers of terminals. Exactly how many terminals of some observable characteristic in those term-
can be analysed in this way depends on the inals (see De Laet and Smets 1998, pp. 378–380; the
structure of the data set and on the computing unhappy informal use of the term ‘essence’ does
power and time that is available, but as a rule of not invalidate their discussion). In each such
thumb it is somewhere between 15 and 25. So, pairwise comparison, the mere fact that the char-
when dealing with increasingly larger numbers of acteristic is being compared entails the hypothesis
terminals, one is practically forced to restrict the that at some level of generality it is historically the
tree search to increasingly smaller subsets of all same. At a lower level, the different states of the
possible trees, proportionwise. In doing so, heur- character are hypotheses of alternative expressions
istics such as branch swapping are used to make of the characteristic, each of which is also hypo-
sure that no or little computing effort is wasted thesized to be historically the same. As discussed
on trees that are manifestly not optimal (for a above, all such hypotheses are to be seen through
broader discussion and some developments beyond the lens of the Hennig–Farris auxiliary principle.
simple branch swapping see, e.g., Goloboff 1999; To clarify, consider some angiosperms and a
Moilanen 1999; Nixon 1999; Moilanen 2001). character that codes a floral structure that comes in
Both levels of optimization are logically inde- two forms, rounded (state 0) and square (1). The
pendent, even if they are in practice often tightly fact that these two forms are coded as states of
integrated in heuristic approaches (see, e.g., the same character reflects the hypothesis that the
Goloboff 1996b for examples). One could do a tree structures, despite the observed difference in form,
search using any imaginable function that com- are homologous at a more general level. Mostly,
putes a number from a tree and a data set, and, such an hypothesis is based on a combination of
heuristic uncertainty aside, the resulting trees criteria. As an example, when the development of
would be optimal according to that function. floral buds in different terminals is compared, the
Therefore, when comparing and evaluating differ- meristem that gives rise to the structure could
ent methods, it is sufficient to examine the meaning originate in almost identical topological relation-
of the function used to evaluate any single tree. ships relative to other meristems. In addition, the
adult structures, whether round or square, could
share many anatomical and morphological simi-
6.2.5 Characters revisited
larities. As a whole, the character then reflects the
Summarizing this long introductory section, higher-level prior hypothesis that the structure in
observation-based pairwise similarity statements all these terminals is identical through common
are the fundamental statements of comparative descent and inheritance. Within the character, the
research. When searching for trees on which difference in general form (round vs. square) is
the highest number of such similarities can be considered important enough to warrant recogni-
explained as homologies, two methodological tion of two different states, reflecting the lower-
requirements must be met: (1) the overall expla- level prior hypotheses that the roundness and the
nation of the data must be free of internal contra- squareness of these structures can be explained as
dictions, which can be enforced by assigning, for identity through common descent and inheritance
each character, states to inner nodes of the tree; (2) as well.
the same piece of empirical content should not be
used multiple times, which translates into counting The different roles of characters and character states
only homologies that are logically independent. It has often been observed that there is a large
From this point of view, a character that discrepancy between the formalized nature of
describes the distribution of a number of states in a phylogenetic analysis once a data set has been
constructed and the much more subjective deci- been coded as states of the same character. To
sions that are involved in character analysis, when remove such hard-coded higher-level assump-
it comes to deciding if observed features in two tions, Pleijel (1995) proposed to use absence/pre-
terminals should be coded as the same state of a sence coding of character states, which is formally
character, two alternative states of a character, or identical to non-additive binary coding, a tech-
part of different characters altogether (see e.g., nique that stems from phenetics (see, e.g., Sokal
de Pinna 1991, p. 380). Pleijel (1995) argued that 1986). Whether it is feasible or desirable to exclude
this is especially relevant for the assumptions such assumptions from the analysis will be
regarding homology of states within a character examined below.
(are two such floral structures homologous, irre- But whatever the answer, the use of absence/
spective of their general form?). Contrary to presence coding as a means of doing so can lead to
hypotheses of homology within states (is the internal inconsistencies in the phylogenetic expla-
roundness of two round structures homologous, is nation of data, a result that is particularly relevant
the squareness of square structures homologous?), for this paper because Pleijel (1995) advanced
such higher-level hypotheses are never questioned absence/presence coding as a promising way to
during subsequent phylogenetic analysis (Pleijel deal with inapplicables. Consider the data set of
1995, p. 312). As an example, consider character c9 Fig. 6.3a and assume, without loss of generality,
of the data set of Fig. 6.1, and assume that state that none of the character states codes for absence.
0 codes the square and state 1 the round structure In the recoded version of Fig. 6.3b each column
of the above character. On the most-parsimonious stands for one character state of a character of Fig.
tree for these data (Fig. 6.2b), the squareness of the 6.3a, with 0 coding for absence of that state and 1
structure that is observed in terminals out1 and out2 for presence. When analyzing Fig. 6.3a, the three
is not homologous to the squareness of the same trees of Fig. 6.3c are obtained (nine steps; loss of
structure that is observed in terminals E and F, and two independent pairwise similarities). With the
the initial lower-level hypothesis has to be revised. recoded data, only one shortest tree is found, the
Similar posterior revisions of the higher-level middle tree of Fig. 6.3c; the two other trees are
hypothesis cannot be made because the homology suboptimal by one step (18 vs. 17).
of round versus square structures has been hard- Pleijel (1995, p. 313) pointed out that, with
coded in the analysis, precisely because they have absence/presence coding, hypotheses concerning
(a) c1 c2 c3 c4 c5 c6 (b)
out1 0 0 0 0 0 0 out1 01 01 01 01 01 100
out2 0 0 0 0 0 0 out2 01 01 01 01 01 100
A 1 1 0 0 1 1 A 10 10 01 01 10 010
B 1 1 0 1 1 1 B 10 10 01 10 10 010
C 1 0 1 1 1 2 C 10 01 10 10 10 001
D 1 0 1 1 0 2 D 10 01 10 10 01 001
(c) out1 out1 out1

out2 out2 out2
A A D
B B C
C C A
D D B
Figure 6.3 Absence/presence coding of character states aims to remove prior hypotheses of homology among states (Pleijel 1995) but can lead
to internal inconsistencies. (a) A dataset with characters that reflect nested hypotheses of homology as determined during character analysis
(characters unordered). (b) The characters of (a) with absence/presence recoding of character states. (c) The three most-parsimonious trees for (a).
With the data coded as in (b) only the middle tree is considered optimal. The two other trees are rejected even if they explain the data equally well
under acceptable hypotheses of homology that they imply.
transformation series between the analysed states (a) c1 c2 c3 c4 c5 c6 (b) c5⬘ c6⬘
will emerge as part of the results, but he remained out 0 2 4 6 8 10 8 10
A 1 2 4 6 9 11 11 9
somewhat vague about the logical and technical B 1 3 5 7 8 12 8 12
implications of this observation. As an example, C 1 3 5 7 9 13 13 9
take the three recoded states of the original char-
Figure 6.4 Absence/presence coding of character states, to remove
acter c6, each with a perfect fit on the single most-
prior hypotheses of homology among states, can lead to surprising
parsimonious tree for the recoded data. Because 0 optimal implied transformation series. (a) A dataset with six unordered
stands for absence of the corresponding state, an characters as they return from character analysis; the groupings of
inner node that is optimized as 0 can be hypo- character states in columns (characters) reflect nested hypotheses of
thesized to have one of the two other states (other putative homology; the most-parsimonious tree is (out (A (B C))), which is
also the best tree when the data are recoded to remove prior assumptions
possibilities exist but are not relevant for the
of homologies among states. (b) Alternative grouping of the states of
argument). Combining and summarizing all pos- characters c5 and c6 that cannot be rejected on the basis of the
sible such optimizations of the three recoded states optimized recoded states. For this grouping, the transformation series
of c6, and using the outgroup hypothesis, three as implied by the optimized recoded characters provides a better
possible implied transformation series emerge from explanation of the data than the original characters.
the tree: 1 0 ! 2, 0 ! 1 ! 2, and 0 ! 2 ! 1. Each
of these has a perfect fit on the tree as well, and in One step further, posterior groupings of states
each case only two steps are required to explain may exist that reduce the total number of steps
the state distribution. When doing the same below the number required by the groupings as
excercise for the groups of states as defined by the they come out of character analysis. An example is
other characters of Fig. 6.3a, all these other states presented in Fig. 6.4. As above, it can be assumed
can be explained by postulating a total of only without loss of generality that none of the states in
seven steps (note that some of the implied trans- Fig. 6.4a codes for absence. When states 8–13
formation series incorporate non-homology of sta- are grouped as in characters c5 and c6 of Fig. 6.4a,
tes as defined a priori; an example is character c4). the transformation series that are implied by the
The middle tree of Fig. 6.3c is considered optimizations of the recoded states on the best
the best tree for the recoded states because it has tree require a total of five steps on the best tree. But
the shortest length for the recoded data. But on the the alternative grouping as in Fig. 6.4b, implying
basis of possible transformation series that emerge 11 8 ! 13 and 10 ! 9 ! 12, can explain the
as part of the analysis, one can construct a phylo- observed distributions of states 8–13 at only four
genetic explanation of the data on that tree that steps. This optimal implied grouping of states
requires fewer steps. So, whatever the length of an obviously contradicts the empirical evidence on
absence/presence recoded matrix on a tree means, the basis of which the original characters were
it definitely does not measure how well that tree proposed. But then it is the aim of this approach to
can explain the data phylogenetically under the remove such untestable assumptions (Pleijel 1995,
assumption that character states can transform into p. 312), and posterior acceptance of groups of
one another, and maximization of phylogenetic states as in characters c5 0 and c6 0 is just a logical
explanatory power under that assumption cannot consequence. More precisely, recognition of such
be the rationale for preferring trees that minimize transformation series follows from the notion that
this recoded length. Indeed, analyzing the two hypotheses concerning transformation series
other trees in the same manner, they can also be among the analysed states should emerge as part
explained by postulating only nine steps (which of the results and from the general requirements
should not come as a surprise, as it was already that the analysis should be logically capable
clear from the analysis of the data set of Fig. 6.3a of phylogenetic interpretation and internally
that the states could be grouped such that only consistent.
nine steps are required on those trees). Yet they are It does not require much imagination to see that
rejected if the length of the recoded matrix is used in practice this could easily lead to situations
as an optimality criterion. where square floral structures of one angiosperm
would a posteriori be considered homologous with, with stems and leaves (Rutishauser and Sattler
for example, a type of root system as present in 1989; a fourth, more complex, interpretation is also
another angiosperm, and the round floral struc- provided). Similar problems abound when dealing
tures of this other angiosperm to the root system of with fossils or when making comparisons across
the first. Most systematists would not hesitate to very divergent groups. In both cases one often has
reconsider homology within states on the basis to deal with structures that cannot be easily
of a well-supported most-parsimonious tree (the homologized across the terminals being compared,
squareness of the floral structures in these term- which in turn often results in competing and
inals is not the same as the squareness of such conflicting prior interpretations. In studies of
structures in those other terminals after all, despite sequence data, this problem can come in the form
my prior assessment to the contrary), but in gen- of different prior hypotheses about orthology and
eral such reinterpretations across characters are paralogy of sequences (Fitch 1970) or in different
much more difficult to accept (darn, these flowers alignments for the same set of putative orthologs
are actually not flowers but modified root (several examples of the latter case are discussed in
systems!). the second section).
So, even if statements of homology among states In each such case, when characters are coded
are untestable in the sense of Pleijel (1995), they according to just one of the competing interpreta-
put bounds on the degree of reinterpretation of tions, chances are that the chosen view will be
character states one is willing to accept in the light favored by the resulting trees simply because the
of incongruence in the data, and these bounds data have been exclusively interpreted as such to
reflect empirical evidence as obtained during begin with. As observed by Endress (1994, p. 401–
character analysis. Outright removal of such 402), circular reasoning when dealing with such
bounds, as would seem to be a logical consequence ambiguously interpretable features can be over-
of using absence/presence coding as advocated by come by repeatedly testing all different possibi-
Pleijel (1995), therefore amounts to throwing away lities. Only this approach amounts to a sincere
important relevant empirical data. As a work- attempt at falsification. Unfortunately, in formal
around, one could limit implied transformation analyses and with current algorithms this is not
series to include only groupings of states that are easy to achieve because the technical framework of
compatible with the results of character analysis. independent single-column characters does not
But that actually amounts to giving up the premise lend itself to simultaneous analysis of such alter-
that prior statements regarding homology among native interpretations of the data in a logically
states should be removed from the analysis. And consistent and correct way.
as discussed above, absence/presence coding then A hard work-around would be to manually
results in the same trees as obtained with regularly construct and analyse as many data sets as there
coded characters, at least if the aim of the analysis are different combinations of different interpreta-
is to maximize explanatory power in a phyloge- tions in different characters, which may be prac-
netic context. tically feasible when the number of such
combinations is not too large. The best phyloge-
Beyond single-column characters netic hypotheses would then be the shortest trees
On the other hand, it is not uncommon in character across all those data sets, and optimal homo-
analysis to find multiple possible interpretations logizations and details of transformation series
for features, which is not surprising given the role would emerge from those trees as part of the
of background knowledge as discussed earlier. As analysis. The difference with absence/presence
an example, depending on the view one takes, the coding of states is that, as above, the level of rein-
vegetative region in some species of the angio- terpretation of states that one is willing to accept in
sperm genus Utricularia (bladderworts) can be the light of incongruence is still bounded by the
interpreted morphologically as a shoot-like leaf, a results of character analysis. The difference with an
branched stem system without leaves, or a shoot analysis of just one set of classic single-column
characters is of a purely technical nature: these are deletion, the resulting sequence misses a part of
cases in which the a priori acceptable hypotheses of the original sequence; with an insertion the
homology among states cannot be expressed as a resulting sequence has a subsequence that was not
simple series of independent single-column char- present before. In both cases, characters that
acters. But the purpose remains maximization of describe the subsequences that are involved will be
the number of independent pairwise similarities inapplicable in the other sequence.
that can be interpreted as identical through com- For the purpose of phylogenetic analysis, it is
mon descent and inheritance. From this point of common practice to establish the positions and
view, the next section can be seen as an attempt to sizes of indels by creating a multiple alignment
develop a formal and logically consistent method to prior to tree evaluation and tree search, thus
deal with the problem of multiple a priori acceptable turning the putative homologous sequences into a
hypotheses of homology among states in the case of sequence of single-column positional characters
putative homology statements within putative that subsequently can be treated as a regular data
orthologous sequences. set (see Fig. 6.5a for an example). Each such posi-
tional character describes the state distribution of
the base that is found at that position of the
6.3 Parsimony analysis of
alignment, with gaps (coded as dashes in this
sequence data
chapter) indicating inapplicability. As discussed
When dealing with sequence data, it is not unusual by Maddison (1993, p. 578), this makes sequence
to find that putative homologous sequences have data susceptible to the general problems that come
different lengths in different terminals. Such with inapplicables.
length differences are explained as the result of However, the approach of generating multiple
indel events, insertion and/or deletions that alignments prior to tree evaluation and tree search
occurred in the course of evolutionary history. As is fundamentally insufficient as a general method
a consequence of indel events, two sequences that for analysis of sequence data, as will be discussed
are homologous as a whole will nevertheless con- below. As a consequence, the question of inapp-
tain subsequences that are not homologous: with a licables in sequence data cannot be discussed in
a a a t a
c1 c2 c3 c4 c5 s i
A a a a t a
B a g t t - a a t t -
C g a t t -
s
(a) a g t t -s g a t t -
A aaata aaata A a a a t a A a a a t a
B agtt i
s B a g t t - B a g - t t
C gatt aatt or
C g a t t - C g a - t t
s s
a a t t - a a – t t
(b) agtt gatt
aaata A a a a t a
s
s B a g - t t
agatt C - g a t t
i i
a g a t t
agtt gatt
Figure 6.5 Three putative homologous sequences and two different approaches to evaluating them on the single unrooted tree for three terminals.
(a) First a multiple alignment is constructed to establish base-level positional correspondences (dashes indicate gaps); the resulting positional
characters are optimized using the algorithm of Fitch (1971), resulting in three substitutions (s) and one indel (i). (b) The unaligned sequences are
optimized directly on the tree using the algorithm of Sankoff (1975); in this example, two optimal reconstructions of the sequence at the inner node exist,
each at four steps; in each case, the optimal length imposes one or more optimal sets of positional correspondences.
general at that level. It is argued that a general All gap costs in this paper are of the form
method by necessity requires that unaligned a þ (n 1) b, in which n is the length of the gap, a
sequences be directly optimized on trees, using the (gap) opening cost, and b the (gap) extension cost.
algorithms such as Sankoff (1975) or Altschul (1989, If gap opening cost and gap extension cost are
pp. 307–308). Such algorithms treat the unaligned equal, the term unit gap cost refers to either, and
putative homologous sequences as one single the cost for a gap of length n is n times the unit gap
complex character, to which I shall refer as a cost. Such a cost regime can be expressed as a 5 5
sequence character. It is widely believed that the step matrix (see Sankoff and Rousseau 1975) in
various parameters that these algorithms employ to which the unit gap is included as a fifth state, in
set up a cost regime, such as base substitution and addition to a, c, g, and t.
gap costs, can only be specified or interpreted with The minimal mutation algorithm of Sankoff (1975)
reference to detailed models of the evolutionary is illustrated in the example of Fig. 6.5b. It recon-
processes that generated the data. However, the structs inner node sequences and positional corre-
cost regime can also be set according to the prin- spondences among observed sequences such that
ciple of parsimony as discussed above, leading to a the total number of mutations is minimized under
maximization of the amount of independent the assumption that a gap of length n constitutes
sequence similarity that can be interpreted as due to n mutation events. This corresponds to a cost
inheritance and common descent (De Laet 2004). regime in which all base substitution costs, the gap
Throughout this section I use DNA sequences, opening cost, and the gap extension cost are equal.
but the discussion is general and applies to any Sankoff and Cedergren (1983) generalized the
kind of data that can be conceptualized to be approach to a step matrix with arbitrary metric
hierarchically related through substitutions and distances, still treating a gap of length n as n
indels, including, for example, serial homologs in events. A further extension to include gap costs of
morphology or different versions of manuscripts the form a 0 þ n b, in which n is the length of the
in stemmatology. Examples are constructed such gap, a 0 þ b the gap opening cost, and b the gap
that optimalities can be verified by hand. extension cost, was examined by Altschul (1989,
pp. 307–308). With such gap costs, the first unit
gap of a gap incurs a cost (a 0 þ b), each next unit
6.3.1 Some background
a cost of b.
Some additional notes on terminology are appro- Sankoff (1975) used the concept of optimal frame
priate first. Gap and gap cost terminology can be sequences to specify reconstructed sequences and
confusing because the same terms are sometimes positional correspondences that lead to minimal
used for different things and the other way costs. Sankoff and Cedergren (1983) framed their
around. As an example, in a sequence like a t t - - - discussion in terms of the slightly less general
t t a c the term gap is sometimes used for each of concept of tree alignments. A tree alignment always
the three consecutive missing positions in the refers to a particular tree with the given sequences
middle (three gaps), or alternatively for the whole at the tips and hypothetical or reconstructed
stretch of three missing positions (one gap). In this sequences at the inner nodes. It consists of (1) that
paper, a gap always refers to a maximum stretch of tree; (2) a matrix in which both observed and
missing positions, not to smaller composing parts. reconstructed sequences are aligned; and (3) cor-
The length of a gap is the number of positions over respondences between nodes of the tree and rows
which it extends. The smallest composing part of a of the matrix. It is conveniently represented as a
gap is referred to as a unit gap. The character that is tree in which the nodes are labeled with the rows
used to indicate a unit gap, a dash in this chapter, of the matrix, as, for example, in Fig. 6.10 (see
is sometimes called the gap character, a term that below). In this way it is easy to see that, in a tree
has also been used for characters in data sets that alignment, each branch of the tree defines a pair-
describe the distribution of a putative indel events wise alignment between the sequences at the two
(e.g. Simmons and Ochoterena 2000). nodes that the branch connects. The cost of the tree
alignment is then defined as the sum of the costs of a given tree the tree distance of those sequences on
these pairwise alignments along all branches of the that tree. Their and similar algorithms (Sankoff
tree, always with reference to the cost regime in 1975; Altschul 1989) can be used to calculate such
use. A ‘classic’ multiple alignment of the terminal tree distances and the reconstructions that come
sequences is obtained by deleting the rows with with them. In terms of the current approach, the
inner-node sequences from the matrix of a tree tree distance as defined by Sankoff and Cedergren
alignment. Multiple alignments that are obtained (1983) is the length of the sequence character on
in this way have been called implied alignments (e.g. that tree. As such, the algorithms of, for example,
Schwikowski and Vingron 1997; Wheeler 2003a). Fitch (1971) and Sankoff (1975) are comparable in
Some examples of optimal implied alignments can the sense that they both calculate the cost of an
be found in Fig. 6.5b. optimal reconstruction of a character on a tree. As
With cost regimes that make no difference will be discussed below, they are vastly different
between gap opening cost and gap extension cost, when it comes to computational complexity. For
the cost at any position in a pairwise alignment of tree alignments, the second level of optimization—
a tree alignment is independent from the costs at the problem of finding, among all possible trees,
its other positions. By extension, this also applies trees of minimal length or tree distance—is often
to the costs of complete colums of a tree alignment. called generalized tree alignment (e.g. Jiang and
As a result, each such column can be interpreted as Lawler 1994; Vingron 1999) but other terms are
a single-column character with a set of inner-node used as well; Hein (1989a), for example, refers to it
state assignments. In this way the algorithm of as the general parsimony problem.
Sankoff (1975; all substitution costs and unit gap
cost equal) can be seen as a generalization of the
6.3.2 Putative homologous sequences:
minimum mutation algorithm of Fitch (1971).
a sequence of characters or a sequence
Indeed, under the conditions of Sankoff (1975),
character?
each column of an optimal tree alignment specifies
a character and set of inner-node state assignments It has been argued that all substitution costs and the
that are also optimal under the conditions of Fitch unit gap cost should be set equal in Sankoff (1975)
(1971). The generalization lies in the fact that dif- style analyses of sequence data (Frost et al. 2001), a
ferent optimal tree alignments for the same data position that will be examined more closely later.
on the same tree can imply different sets of However, first it is argued, in this subsection, that a
Fitch characters (see Fig. 6.5b for examples). The general method of sequence alignment must by
algorithm of Sankoff and Cedergren (1983; tree necessity move beyond prior multiple alignments
alignments with step matrices) is a similar (contra Simmons and Ochoterena 2000; Simmons
generalization of the algorithm of Sankoff and 2004). The argumentation does not depend on the
Rousseau (1975), which, in turn, generalized Fitch particular settings of the cost regime, but for clarity
(1971) to accomodate differential weighting within I tentatively accept the position of Frost et al. (2001)
characters. Under the conditions of Altschul (1989; and contrast (equally weighted) Fitch (1971) ana-
different gap opening and gap extension costs), lysis of prior alignments with Sankoff (1975)
the costs of the different columns of a tree align- analysis of unaligned sequences.
ment are no longer independent. As a result, such When optimizing a sequence character on a tree,
tree alignments cannot be understood in terms of base-level correspondences among the observed
independent single-column positional characters. sequences are not determined and fixed a priori but
As was the case with inner-node state assign- calculated as part of the optimization process, as
ments for simple single-column characters (com- already illustrated for three terminals in Fig. 6.5.
pare, e.g., Figs. 6.2c and 6.2e), tree alignments on a The full implication of this can be seen when
given tree can be optimal or suboptimal. Sankoff analyzing more than three sequences, such that
and Cedergren (1983) called the cost of an optimal alternative trees exist and have to be examined.
tree alignment for a set of observed sequences on Consider the data set of Fig. 6.6a. For four taxa
(a) A gc (b) A gc (c) A gc (d) A gc- same score. Any method that does not meet this
B cg B cg B cg B cg- test is in serious trouble.
C c C -c C c- C --c
As discussed, Sankoff (1975) optimization dia-
D gg D gg D gg D gg-
gnoses (A C)(B D) and (B C)(A D) at the same cost
Figure 6.6 A simple dataset (a) and three different multiple alignments and thus meets the test. Turning to prior align-
(b, c, d). ments, the first question is which prior alignments
A–D, three unrooted trees exist: (A B)(C D), (A C) to consider. With data as simple as this it is easily
(B D), and (A D)(B C). Using Sankoff (1975), the established that alignments in Figs 6.6b and 6.6c are
latter two are both diagnosed at cost 3 (each time the only valid candidates. All other alternatives,
two substitutions and one indel) while (A B)(C D) such as, for example, Fig. 6.6d would need some
comes at cost 4 (three substitutions and one indel). special argumentation as to why, in this case, the c
Looking at the two optimal trees, (A C)(B D) comes that is observed in terminal C should not a priori be
with the implied alignment of Fig. 6.6b, (A D)(B C) considered homologous to the c that is observed in
with the different implied alignment of Fig. 6.6c. A or to the c that is observed in B. Given that it is
So it is not just that base correspondences are not accepted, a priori, that the sequences as a whole are
fixed prior to analysis, a posteriori they can be dif- homologous (they are putative orthologs), this
ferent in different optimal trees. seems hard to do. A Fitch (1971) analysis of align-
ment 6b yields tree (A C)(B D) at cost 3, with
A simple case of symmetry (B C)(A D) one step more costly; alignment 6c
The data set of Fig. 6.6a has a peculiar symmetry: yields (B C)(A D), also at cost 3, and with (A C)(B D)
when the labels of A and B are switched and the one step more costly (in both cases, (A B)(C D) has a
directions of all sequences reversed, the original cost of 4). So, when looking at just one alignment,
data set is recovered. As such it provides a perfect the two trees get a different score and the method
example where mutually exclusive sets of putative fails the above test. As a result, depending on the
homology statements cannot be distinguished at prior alignment that is used, positive support is
the level of character analysis. The higher-level found for either (B C)(A D) or (A C)(B D), whereas
hypothesis in this data set is that the sequences are in fact relationships are ambiguous.
orthologs. Within the orthologs, however, the Similar symmetry observations can be made
symmetry makes it logically impossible to decide a with respect to alignments 6b and 6c: they can be
priori if the single c of terminal C is to be con- turned into one another by exchanging the labels
sidered homologous to the c in the second position of A and B and reversing the direction of each
of A or to the c in the first position of B. Con- sequence. Therefore, if either is considered optimal
ceptually, this is like the situation in bladderworts, according to some criterion, the other should be as
discussed above, where it cannot be determined a well. So a way out of the problem of finding
priori if the vegetative system should be considered spurious relationships with single prior align-
a shoot-like leaf or a leaf-like shoot system (even if ments suggests itself: rather than to construct and
the situation with bladderworts is more complex analyse just one prior alignment, identify and
because there are still other homologizations that analyse all different prior multiple alignments that
are considered acceptable on a priori grounds). are considered optimal, and accept only groups
Turning to trees, the symmetry has, as a con- that are common to all. This may sound trivial but
sequence, that these data cannot possibly distin- it raises the non-trivial question of how to calculate
guish between (A C)(B D) and (B C)(A D), two the relevant prior optimal multiple alignments. For
unrooted trees in which the labels of A and B have this particular example, that question comes down
been exchanged. This conclusion follows directly to finding a criterion that gives an optimal score to
and solely from the internal structure of the data alignments on Figs 6.6b and 6.6c and a worse score
set. As such it can be used to establish the fol- to all other alignments.
lowing strong test for candidate phylogenetic Optimal alignments of two sequences can be
methods: (A C)(B D) and (B C)(A D) should get the calculated using dynamic programming algorithms
as pioneered, in biology, by Needleman and (a) A ct (b) A ct (c) A ct

Wunsch (1970) and Sellers (1974). A description of B c B -c B c-
the basic algorithm and some historic notes can be C tc C tc C tc
D tc D tc D tc
found in Kruskal (1983); extensions are reviewed E tt E tt E tt
in, for example, Gusfield (1997). For the current
purpose, approaches that generalize such algo- Figure 6.7 A simple data set (a) and two different multiple alignments
rithms to more than two sequences can be grouped (b, c). According to the SP criterion, alignment (b) is better than
alignment (c) (SP scores 13 and 14).
according to whether or not they use the tree-
alignment approach.
In optimal tree alignments, the kind of data sequence of C and D, and the sequences of C and
symmetry in Fig. 6.6a is reflected directly in sym- D are turned into the sequence of A. Therefore, the
metry of calculations when comparing trees structure of the data set is such that these data
(A C)(B D) and (B C)(A D). So it was not just cannot distinguish between trees that differ only in
coincidence that the above Sankoff (1975) optim- the positions of A vs. (C D), as, for example, the
ization of the data of Fig. 6.6a gave identical scores pair (B (C D) (A E)) and (A B (E (C D))). Using
for those trees, with implied alignments that dis- Sankoff (1975), these trees both have a cost of 3,
play among themselves the same symmetry as the which is the optimal cost over all trees as well.
data. Theoretically then, one could use a tree- Tree (B (C D) (A E)), or any other tree that has an
alignment analysis to generate implied alignments AE–BCD partition, comes with optimal implied
that are next used as prior alignments. There alignment 7b; tree (A B (E (C D))), or any other tree
would be no need to analyze the implied align- that has an AB–CDE and an ABE–CD partition,
ments, though, because their best trees would comes with alignment 7c. As above, these implied
already have been identified in the preliminary alignments have among themselves the same
tree alignment analysis. In fact, while the approach symmetry as the unaligned data. So Sankoff (1975)
provides a solution to the problem discussed here, optimization does not tell these trees apart, and
it actually comes down to giving up the notion that correctly so.
sequences should be aligned prior to tree evalua- This is necessarily so as long as the ancestor of C
tion and tree search. and D has a reconstructed sequence that is ident-
Among the multiple alignments methods that do ical to and perfectly aligned with the sequences of
not use tree alignments, SP alignments or sums-of- C and D in optimal tree alignments. If this is the
pairs alignments (Murata et al. 1985; Carillo and case, the data symmetry is directly reflected in the
Lipman 1988) and especially progressive alignment Sankoff (1975) calculations that are performed on
methods (e.g. Feng and Doolittle 1987; Thompson the two trees that are involved, and an identical
et al. 1994; Notredame et al. 2000) are probably cost on both trees follows. The assumption about
most widely used. First consider SP alignments. the reconstructed sequence for the ancestor of C
An SP alignment of a set of sequences is an and D is easily proved by showing that its nega-
alignment for which the sum of pairwise align- tion leads to a contradiction. Assume that an
ment scores between all possible pairs of sequen- optimal tree alignment exists in which the ancestor
ces is minimal. Setting all substitution costs and of C and D has a sequence that is different or
the unit gap cost to 1, it is easily verified that the differently aligned. In that case, the tree alignment
alignments of Figs 6.6b and 6.6c have identical SP can be improved—contradicting the premise—by
scores of 9, leaving the SP criterion as a potential changing that ancestor and its alignment as indic-
solution to the problem. ated above. That this is an improvement can be
seen as follows: for any position in the ancestor of
Another case of symmetry C and D with an entry (base or unit gap) that is
However, consider the data of Fig. 6.7a. Reading different from the base at the corresponding posi-
each sequence in reverse, nothing changes for B tion in C and D, changing that entry into the cor-
and E, but the sequence of A is turned into the responding entry of C and D will improve the cost
by two mutations; at the same time, that change that symmetries as discussed here are properly
can incur at most one additional mutation, taken into account.
between the ancestor of C and D and the third
node to which this ancestor is connected. So, in A case of local symmetry
conclusion, optimal tree alignments are not tricked Based on the premise that multiple alignments
by data symmetries such as in Fig. 6.7. should be constructed prior to tree search on the
This does not hold for SP alignments: alignment basis of a similarity criterion, Simmons (2004,
7b has a better SP score than alignment 7c (13 vs. p. 876; see also Ochoterena 2004) recently pro-
14; the score for 7b is optimal), proving the case by posed the following tree-independent procedure
counter example. As a result, if the SP criterion for constructing optimal prior alignments. In a first
were used to construct and select prior alignments, step, construct one or more multiple alignments
alignment 7b would be selected and trees with AB– using, for example, programs that try to maximize
CDE and ABE–CD partitions considered sub- (an unspecified measure of) similarity, or infor-
optimal in the subsequent phylogenetic analysis. mation from secondary structure. Next, evaluate
To salvage the approach, one could consider to these alignments using the number of ‘differences’
examine suboptimal SP alignments, like 7c, up to that are implied, and try to lower that score by
the degree that all prior alignments have been adjusting those alignments. Such adjustments can
accepted that are involved in symmetries such as be done manually or, ideally, using optimization
in Figs. 6.6a and 6.7a. But this would not work, for programs. The rationale is to further increase the
two reasons. First, there is no general way to tell amount of similarity that is present in the align-
how far one has to descend into suboptimality ment. The best alignments that are obtained are
before all relevant alignments have been taken into then subjected to parsimony analysis.
account. Second, many additional and unwanted In the above, the number of differences is best
alignments might pass as well. So accepting sub- explained by first looking at a regular data set such
optimal SP alignments cannot be a general solution as in Fig. 6.1. For each character in the data set, the
to this problem of data symmetry. observed variation m (Farris 1989a, p. 417) is one
Similar problems can arise with progressive less than the number of states in the character, and
alignments using guide trees (e.g. Thompson et al. that number is the minimum of steps that the
1994; see also Feng and Doolittle 1987). Such trees character can have on any tree. The observed
are usually constructed on the basis of a square variation for the data set as a whole, M, is the sum
overall distance matrix that is derived from pair- of the observed variation in all its characters, and
wise alignment scores. Multiple alignment then can be interpreted as the number of steps that the
proceeds by traversing this tree from terminals to best tree for the data set would have if all char-
the root. At each node that is visited, a partial acters were congruent. If indel events would not
multiple alignment is created that includes and occur, the number of differences in the sense of
combines the partial alignments that are found at Simmons (2004) would be equal to M. But indel
the daughter nodes (terminal nodes are initially events do occur and complicate matters because
assigned a trivial partial alignment that includes single indel events can affect multiple columns of
just the observed sequence of that node). In this an alignment. However, as will be clear below,
way, all sequences are included in the alignment further details of the calculations that are involved
after the root node has been visited. At any node, in such cases (see, for example, Simmons and
the alignment of partial alignments mostly pro- Ochoterena 2000) are not required for the current
ceeds by using some modification of the SP cri- argument. Simmons (2004) observed that minim-
terion, considering only those pairwise alignments ization of differences in this sense can lead to
across the node being considered. Moreover, this trivial alignments that require only as many indels
criterion is mostly applied only locally: gaps that as there are sequences in the data set, irrespective
have been inserted before will never be removed. of the tree being considered (see Fig. 6.13c, below,
In general, this group of methods cannot guarantee for an example). To circumvent that problem,
(a) out ttttttttttggggtttt tcca (b) tcca (c) tcca (d)

A aattttttttggggtttt c -c-- --c-
B aaaattttttggggtttt c -c-- --c-
C aaaaaattttggggtttt c -c-- --c-
D aaaaaaaattggggtttt c -c-- --c- I c
E aaaaaaaaaaggggaaaa cg -cg- -cg- E cg
F aaaaaaaaaacccctttt gc -gc- -gc- F gc
G aaaaaaaaaaccccaaaa aca -aca aca- G aca
H aaaaaaaaaaccggaatt gg -gg- -gg- H gg
Figure 6.8 An example of localized data symmetry. (a) A data set consisting of two sets of putative homologous sequences. (b, c) Two multiple
alignments for the second set. (d) Reduced data set that exhibits the same kind of symmetry as discussed for Fig. 6.6.
Simmons (2004, p. 876) suggested not to add labels of E and F, reverse the direction in which the
positions to alignments as obtained in the first step sequences are read, and the original data set is
during possible adjustments in the second step. recovered. Considering all this, the second set of
This optimality criterion assigns the same scores sequences of Fig. 6.8a cannot be used to distin-
to the symmetric alignments of Figs. 6.6 and 6.7, guish between the two candidate trees, as these
and in each case all other alignments have a worse only differ in their relative positions of E and F.
score. Therefore this approach could correctly Therefore, any method that assigns different scores
identify the relevant prior alignments for these to these trees for these data is in serious trouble.
problematic data sets. However, consider the data The algorithm of Sankoff (1975) properly takes
set of Fig. 6.8a, a case where two different sets of into account data symmetries such as in Fig. 6.8d.
putative homologous sequences are analysed It also treats the whole data set of Fig. 6.8a cor-
simultaneously (the example uses two sets of rectly, which can be shown, as above, by observing
sequences for reasons of clarity only; similar that optimal tree alignments on optimal trees have
examples can be constructed that use only one set to reconstruct the ancestral sequence for terminals
of putative homologs). The structure of the first set D–H as c, and such that this c is aligned with the
of sequences jumps out so clearly that it is easily c’s of terminals A–D. The score for the complete
seen that the best trees for that part of the data are data set of Fig. 6.8a on both trees is 30, and this is
(out (A (B (C (D (E (H (F G)))))))) and (out (A (B (C also the optimal score. Two corresponding implied
(D (F (H (E G)))))))). Moreover, it is easily estab- alignments are shown in Figs. 6.8b and 6.8c. As
lished that the first set of sequences is so strongly above, these display the same symmetry as the
structured that the problem of finding the best raw data (other optimal tree alignments exist, but
trees for the data set as a whole reduces to evalu- that does not affect the argumentation).
ating the second set of sequences on those two Evaluating these implied alignments using the
trees. criterion of Simmons (2004) cannot be done by
In both trees, consider the ancestor of terminals simply summing over isolated columns because
D–H and this second set of sequences. In each case, some gaps affect more than one column, and more
that node will be optimized as c for the alignments elaborate calculations are required. However,
of Figs. 6.8b and 6.8c, or indeed for any alignment these are not really required in this case because
in which the c’s of terminals A–D are aligned (it is reversing the sequences in both alignments estab-
easily seen that such must be the case for optimal lishes mutual symmetry of gap positions for such
explanations). Next consider the data set of calculations. So, whatever the contribution of the
Fig. 6.8d, where terminals out, A, B, C, and D have gaps in the first alignment, it will be the same in
been replaced by a single hypothetical terminal the second and their unit gaps can therefore be
I that is assigned that reconstructed sequence c. treated as missing entries for the purpose of
This reduced data set exhibits the same kind of assessing the relative scores of the alignments. This
data symmetry as discussed above: change the results in relative score three for Fig. 6.8b but four
for Fig. 6.8c, and the procedure of Simmons (2004) or even just the use of suboptimal tree alignments),
therefore would lead to prior rejection of the will not in general be able to deal with such
alignment of Fig. 6.8c. The net result is that this situations. This leaves, by definition, optimal tree
procedure leads to rejection of a tree that the data alignment methods. As a corollary, unless one is
cannot distinguish from a tree that it accepts. willing to defend methods that in some cases can
Comparing the alignments of Figs 6.8b and 6.8c, give different scores to trees that cannot be dis-
the preference of the optimality criterion of tinguished by the data at hand, alignment and tree
Simmons (2004) for the first one boils down to the search cannot be properly separated in phyloge-
fact that it puts the last a of terminal G in the same netic analysis of sequence data. Note that this
column as the last a of the outgroup. But on the conclusion is argued and reached in logical space.
best tree for this alignment, the a that G and the Whether or not it results in a practically feasible
outgroup share cannot be explained as identical by method will be discussed below.
common descent and inheritance. Consider the The examples of Figs 6.6a, 6.7a, and 6.8a are
consequences of this observation in the light of the unusual in that some terminals have sequences
overall analysis, where tree (out (A (B (C (D (E (H that are the exact reverse of other sequences, a
(F G)))))))) is accepted but (out (A (B (C (D (F (H (E situation that will hardly if ever arise in real data
G)))))))) rejected. Given the local symmetry in the sets. But such perfect crab canons are not neces-
second sequence character, both trees explain the sary for the phenomenon to occur. Sequences such
data equally well, albeit with different posterior as those can be embedded as short motifs in longer
homologizations of positions and base identities. sequences that as a whole are not identical when
But they are different in their amounts of homo- read in reverse, and similar distortions could
plasy: overall, the first tree has a homoplastic result. For simple examples as above, one could
pairwise base similarity (the last a of terminal G argue that the problem can easily be spotted and
and the outgroup) that the second tree lacks. solved by carefully inspecting the data and the
Moreover, the preference for the first tree when alignments by eye, but this approach would no
using the procedure of Simmons (2004) is based longer work in such more complex cases.
solely on this difference: of the two trees with In addition, the motifs that are involved do not
equal amount of similarity that can be explained as have to be identical when read in reverse, only
homology, it selects the tree that has the higher their alignment scores with the other sequences
amount of homoplasious similarity. In more com- must remain unchanged. Lastly, even when the
plex cases, this effect can ultimately lead to rejec- symmetry in the motifs is not perfect, by devia-
tion of trees with higher amounts of homologous tions in motif sequence and/or substitution costs
similarity in favour of trees with lower amounts of that are involved, systematic distortions, though
homologous similarity. The same problem can also less well defined, would still arise. So situations
occur with the related tree-independent optimality where short subsequences can have alternative
criteria for multiple alignments that have recently optimal alignments, with different local costs on
been discussed by Carpenter (2003, pp. 6–7) and different trees, may well be relatively common in
Nixon and Little (2004). empirical data. Moreover, when such data sets are
aligned progressively according to a guide tree
General conclusions (using, for example, CLUSTAL; Thompson et al.
None of this is accidental. Data symmetries such as 1994), such ambiguities that include groups of the
in Figs 6.6a, 6.7a, and 6.8a have a consequence that guide tree may systematically be resolved in favor
no distinction can be made between particular of the guide tree.
trees or groups of trees. As a result, methods of Summarizing, alignment and tree evaluation
analysis that do not directly take into account the cannot be properly separated in phylogenetic ana-
structure of trees (e.g. SP alignment or the pro- lyses of sequence data. As a consequence, the view
cedure of Simmons 2004), or do so in a way that that a set of sequences that are deemed putative
violates the symmetry (e.g. progressive alignment, homologues should be turned into a sequence of
positional characters prior to tree search and eva- (a) aaa (b) aaa-
luation is erroneous or at best incomplete. Instead,
such sequences constitute a single complex char- aat agat
acter, a sequence character, that can be optimized
on trees using optimal tree alignment algorithms
agt gat ag-t -gat
such as that of Sankoff (1975). These conclusions
follow from very general considerations of data Figure 6.9 Two different tree alignments of the putative homologues
symmetry and do not depend on details of the cost aaa, agt, and gat on the single tree for three sequences.
regime that is used. (a) This reconstruction requires three steps (three substitutions, no indels)
and retains three independent pairwise base similarities among observed
sequences. (b) At four steps (one substitution, three indels) this
6.3.3 Quantifying and maximizing homology reconstruction requires one more transformation, even if it retains one
more independent pairwise similarity among observed sequences.
in sequence characters
Frost et al. (2001, pp. 354–355; they use the term
‘indel’ for a unit gap as used here) discussed the alignments, Smith et al. (1981; their equation 4b
method of direct optimization (Wheeler 1996), and with wk ¼ 0) showed that maximization of base-to-
argued for setting all substitution costs and the base matches is equivalent to minimization of cost
unit gap cost equal because this amounts to equal when all base substitution costs are set at twice the
weighting of all hypothesized transformations, unit gap cost, a different regime than advocated by
which in turn ‘renders the highest degree of des- Frost et al. (2001). This result of Smith et al. (1981)
criptive efficiency and maximizes the explanatory cannot directly be extended to comparisons of
power of all lines of evidence (i.e. characters).’ more than two sequences, but a generalization to
Direct optimization has been proposed and is tree alignments (see below) still yields a cost
still often discussed as a sequence optimization regime that is different from the one favored by
method that is qualitatively different from optimal Frost et al. (2001). With more than three sequences,
tree alignment methods, but the method is best this difference can lead to a preference for different
seen as a heuristic approximation for optimal tree trees.
alignments (De Laet and Wheeler 2003; see also On a general level, this example merely reflects
below), and the claimed novelty of the approach the well-known fact that the choice of substitution,
rests on a lack of familiarity with or misunder- gap opening, and gap extension costs affects the
standing or misrepresentation of the work of result of alignment and tree-building procedures.
Sankoff (1975) and Sankoff and Cedergren (1983) When examining the logical basis of sequence
(see, e.g., Wheeler 1996, 1998; Giribet and Wheeler analysis, however, the paradoxical situation arises
1999; Phillips et al. 2000; Wheeler 2001b, 2002, that the objectives of maximizing explanatory
2003a). Therefore, the argumentation of Frost et al. power and maximizing independent homologous
(2001) amounts to a preference for the minimum similarity seem to be at odds. As discussed below,
mutation algorithm of Sankoff (1975). this contradiction is only apparent because the pre-
Consider the sequence character aaa, gat, and agt mises at either side of the comparison are faulty:
and two alternative tree alignments on the single setting all costs equal does not maximize expla-
tree for three terminals as presented in Fig. 6.9. natory power, and independent base-to-base
With the above cost regime, tree alignment 9a is homologous similarity is not all there is to
better than 9b (three steps versus four). On the sequence homology.
other hand, when looking at independent accom-
modated pairwise similarities, as a measure of the Subsequence homology and compositional homology
amount of similarity that can be explained as The latter is easily seen when considering a data
homology, 9b performs better than 9a: it accomod- set, such as in Fig. 6.10, where sequences differ
ates one more independent pairwise base match. only in length. The two tree alignments that are
This should not come as a surprise. For pairwise shown do not differ in the number of independent
(a) A aaa (b) aaa---- A D aaaaaaa

B aaa
C aaaaaaa aaa---- aaaaaaa aaaaaaa
D aaaaaaa
E aaaaaaa aaa---- B E aaaaaaa
C
aaaaaaa
(c) aaa---- A B aaa----
aaaaaaa aaaaaaa aaaaaaa
aaaaaaa D E aaaaaaa
C
aaaaaaa
Figure 6.10 A data set in which the sequences only differ in their lengths (a) and two trees with optimal inner-node reconstructions and
positional correspondences under the assumption that insertion/deletion of a stretch of contiguous bases is counted as one transformation (b, c).
Double bars indicate indel events. Note that on each tree alternative sets of optimal positional correspondences exist.
base-to-base matches among observed sequences without regard for the composition of those sub-
that they accommodate: in both cases there are 20 sequences comes down to optimizing the length of
independent base-to-base comparisons, and all the observed sequences as a regular unordered
these are matches. Yet, the first tree alignment can character, irrespective of the amount of substitu-
be considered a better explanation of the data at tions that are implied. Optimized in isolation,
hand because it captures an element of homo- neither will in general result in a globally optimal
logous similarity between the sequences of A and explanation of the data.
B that is not retained in the second one. However Instead, what is needed is an optimal balance
the tree of the first tree alignment is rooted, A and between subsequence and compositional homol-
B share the absence of bases 4–7 with their direct ogy. This optimal balance can be found by using a
ancestor. Depending on the position of the root, cost regime that is the sum of the two cost regimes
these three contiguous nodes lack the insertion of that are involved, provided that there is a
that subsequence, or they share its deletion; in both mechanism to avoid or deal with logical contra-
cases, this comes down to one unit similarity that dictions between optimizations of both compon-
can be explained as a homology. On the second ents. Such a mechanism is implicit in tree
tree alignment, the shared absence of bases 4–7 in alignments because tree alignments are internally
A and B must be explained as a homoplasy. The consistent explanations of the data. Therefore,
main conclusion that can be drawn from this expressions to describe the amount of subsequence
simple example is that sequence homology has homology and the amount of compositional
a component that cannot be reduced to mere homology in tree alignments can be derived
base-to-base composition. This component I shall independently and then simply summed to get an
refer to as homology of subsequences, as opposed to expression for the total amount of sequence
base-to-base or compositional homology within homology. This expression, finally, can be used for
homologous subsequences. purposes of optimization.
The two components of sequence homology can
be optimized separately but there would be little Quantifying the amount of subsequence homology
use in doing so. When just optimizing base-to-base of a tree alignment
similarities, gaps will be inserted ‘at will’ to max- The amount of subsequence similarity in a tree
imize matches (Smith et al. 1981, p. 42). On the alignment that can be interpreted as homology can
other hand, maximizing subsequence homology be measured indirectly and in a relative way by
counting nindels, the number of independent indel The observed bases in a single column of a tree
events, provided that the insertion/deletion of a alignment can be sorted into a number of groups
series of contiguous bases is counted as a single such that two bases from the same group are
event. This is so because each such indel comparable but two bases from different groups
event effectively marks a subsequence that is are not comparable. I shall refer to these groups of
not homologous across a branch. Therefore, an comparable bases as subcharacters, a concept that is
independent indel event can be seen as an inde- closely related to the concept of regions as defined
pendent unit of non-homology in subsequence above, and denote the number of subcharacters in
homology. a column of a tree alignment as nscc. This number
As discussed above, the cost of a tree alignment is related but not identical to the number of indel
is obtained as the sum of the costs of the pairwise events in which this column of the alignment is
alignments across the branches of the tree. Tech- involved.
nically, counting independent indel events in such Within a subcharacter, denote the number of
a pairwise alignment is achieved by setting sub- observed bases as nobsc. If two such bases are ident-
stitution costs to 0, gap opening cost to 1, and gap ical and all nodes in the path that connects them
extension cost to 0. In addition, when evaluating are labeled with that same base, then the two bases
such a pairwise alignment, paired gaps have to be match and their shared presence can be explained
removed first, a procedure that Altschul (1989) as a homology. If any node in the path that con-
called projection. Projection is required because nects two such identical bases has a base that is
paired gaps just indicate that both sequences miss different, then they don’t match and their shared
something that is present elsewhere on the tree presence cannot be explained as a homology. Two
and because the indel events that caused such a non-identical bases of a subcharacter or two bases
shared absence are accounted for along other that belong to different subcharacters, finally, do
branches. As an example, going from -gaat---ccct- not contribute to base-to-base homology. The
to -gaat--ccccc- in, for example, the second tree minimum number of pairwise comparisons that
alignment of Fig. 6.14, (see below) means going have to be made to classify the bases of a sub-
from gaat-ccct to gaatccccc. As far as subsequence character into subgroups of such matching bases is
homology is concerned, this comes at cost 1 nobsc 1. The number of mismatches nmmsc in any
(1 times the gap opening cost of 1 plus 0 times the such set of nobsc 1 independent pairwise com-
extension cost of 0). parisons can be thought of as the number of base
substitutions or steps within the subcharacter.
Quantifying the amount of compositional homology With these definitions, the amount of composi-
of a tree alignment tional homology in a subcharacter is obtained just
Specifying an expression for compositional simi- as the amount of homology in a regular character:
larity that can be explained as homology is more the maximum number of independent pairwise
elaborate. A tree alignment can be seen as a reg- comparisons minus its number of steps, or
ular multiple alignment with, for each position, nobsc 1 nmmsc. With nobc the total number of
reconstructions at the inner nodes. If, in a single observed bases and nmmc the total number of
column, the tree path between two observed bases substitutions in a column of a tree alignment, the
passes through an inner node that is optimized as amount of compositional homology in a column is
a unit gap character, these bases are not compar- nobc nscc nmmc. The amount of compositional
able because they are part of non-homologous homology in the whole tree alignment is the sum
subsequences; if, on the other hand, the connect- of this value over all columns. Switching signs,
ing path has no nodes with unit gaps, they belong nscc þ nmmc nobc describes a cost function that
to homologous subsequences; more specifically, varies directly with compositional homology in a
they occur at the same position within those column. In this expression, nscc can be considered
homologues. I refer to such bases as comparable a cost factor that accounts for local loss of com-
bases. positional homology due to indel events (that may
encompass multiple neigbouring columns), and lengths of its different homologous logical sub-
nmmc a regular substitution cost factor. sequences.
For any given tree alignment, nindels þ nsubc þ nsubst
Maximizing homology in sequence characters is a straightforward expression that is easily
Adding it up, the total amount of homology of checked, but finding the tree alignment(s) for
different tree alignments for a given set of which this expression is minimal is quite some-
sequences can be compared using cost function thing else. Even for a single given tree, the pro-
nindels þ S(nscc þ nmmc nobc), where the summa- blem of deciding if a tree alignment is optimal has
tion is over all columns of the tree alignment: the been shown to be NP-complete (Wang and Jiang
lower the cost, the higher the amount of homology. 1994). Algorithmically, as the subsequence
In this expression, losses in subsequence homology homology component requires use of variable gap
and compositional homology are weighted equally. costs (gap opening cost 1, gap extension cost 0),
Differential weighting, for example to downweight the algorithms of Sankoff (1975) and Sankoff and
subsequence homology, can be done by applying Cedergren (1983) are not adequate. Altschul (1989)
different weights to the two terms that are does accomodate variable gap costs but still this is
involved. As Snobc is identical for different tree not sufficient because his algorithm does not keep
alignments for the same data, the cost func- track of the number of subcharacters in a column.
tion for a tree alignment can be reduced to This directly implies that the current cost function
nindels þ S(nscc þ nmmc). Using nsubc for Snscc and cannot be expressed just in terms of substitution,
nsubst for Snmmc, the relative amounts of total gap opening, and gap extension costs. To optimize
homology of two different tree alignments can be this function, the dynamic programming recur-
compared using nindels þ nsubc þ nsubst, the sum of rences of Altschul (1989) would have to be adapted
indel events, subcharacters, and substitutions. and extended to keep track of observed bases and
Alternatively, the problem can be presented as a subcharacters in columns as well.
maximization of a similarity measure; this simi-
larity measure would count independent homo-
6.3.4 Discussion
logous base-to-base matches but assign a penalty
to indel events, much as the original algorithm of So, when applied to sequence data, the simple
Needleman and Wunsch (1970). More specifically, principle of maximizing similarity that can inter-
the penalty would be 1 for each indel event in preted as homology, in a logically correct way,
the tree alignment, irrespective of the length of the leads to a preference for those trees on which the
indel. In comparisons of two and three sequences, sum of indel events, base substututions, and sub-
such similarity measures with length independent characters is minimal. In this final subsection,
gap penalties have been studied by Fredman some properties and wider connections of this
(1984) (fide Hein 1989a, p. 650). parsimony criterion are discussed.
In Figs. 6.11–6.15, the positions of all inferred
indel events are indicated throughout the tree Heuristics
alignments, using vertical bars. The subsequences Even with simple Hamming distances, as when
that are defined in that way can be considered using Fitch (1971) optimization of prior align-
logical subsequences. In simple cases, such logical ments, the problem of deciding if a tree is optimal
subsequences are identical to the subsequences is NP-complete (Foulds and Graham 1982). So,
that effectively take part in the inferred indel when combining tree search and tree alignment,
events (e.g. Figs 6.11–6.14), but in more complex one NP-complete problem is nested within
cases a single inferred indel event along a parti- another. As pointed out by Hein (1989a, p. 651),
cular branch can affect a series of contiguous the computational complexity of this problem
logical subsequences (see Fig. 6.15 for examples). makes the use of heuristic approximations una-
The total number of subcharacters in a tree align- voidable. Examples of algorithms for heuristic
ment can be easily determined as the sum of the approximations of optimal tree alignment costs, or
algorithms that can be interpreted as such, can be using the gap opening cost. With equal weighting
found in, for example, Sankoff et al. (1973, 1976), of both components of homology, this penalty
Hein (1989a, b), Jiang and Lawler (1994), Wang et equals the substitution cost. As an example, using
al. (1996), Wheeler (1996, 1999, 2003c; all available a substitution cost of 2, the corresponding gap-
in Wheeler et al. 2003, where they are tightly opening cost is 2 þ 1, and the corresponding gap
integrated with a wide range of tree search extension cost 1. The same result holds for three-
heuristics; see De Laet and Wheeler 2003), and wise comparisons on a star tree.
Schwikowski and Vingron (1997, 2003). Still other Beyond three sequences this simpler cost regime
approaches can be found in the reviews of Vingron is no longer equivalent to the criterion developed
(1999) and Notredame (2002). here, as can be seen from the following counter-
It currently remains largely an open question example. The tree alignment of Fig. 6.11b explains
how well these various approaches perform in the sequence character of Fig. 6.11a better than
practice. In the end, even the use of an a priori Fig. 6.11c because it can explain an additional
alignment can be seen as a quick and dirty heur- independent pairwise base match: the a that ter-
istic for the analysis of a sequence character. Even minates the sequences of B and D. This difference
if any single such analysis is too shallow to be is correctly measured by the sum of indels, sub-
satisfactory, analyses of many different prior characters, and substitutions, but with the simpler
alignments may be effectively combined into a cost regime, both tree alignments come at the same
more elaborate search strategy, following the cost of 12. In more complex examples, such situa-
heuristic logic as developed in Farris et al. (1996) tions can lead to a preference for different trees
(see also Goloboff and Farris 2001). alltogether. The simpler cost regime may never-
Most heuristic tree alignment methods attack theless be a good choice when using heuristic tree
the optimal tree alignment problem by approx- alignment methods that are based on pairwise or
imate decomposition into a set of simpler pro- threewise comparisons of sequences.
blems that can easily be solved exactly using For some approximation methods an upper
pairwise alignments (e.g. Hein 1989a; Wang et al. bound can be established for their deviation of
1996; Wheeler 1996, 1999) or threewise alignments optimality. As an example, consider lifted align-
on a star tree (e.g. Sankoff et al. 1973; Wheeler ments (Jiang and Lawler 1994; Wang et al. 1996; see
2003c). Interestingly, compositional homology in a also Wheeler 1999; Lutzoni et al. 2000), in which
pairwise alignment amounts to the number of base possible inner-node sequences are chosen from
matches, a number that can be maximized by and restricted to the set of observed sequences.
setting the unit gap cost to half the substitution Under these restricted conditions, an efficient
cost (Smith et al. 1981). To maximize total sequence algorithm exists to find the optimal assignments of
homology in a pairwise alignment, an additional sequences to inner nodes of a given tree, and the
penalty has to be added for losses in subsequence resulting tree alignment can be shown to have a
homology, which, as discussed above, can be done cost that is at most twice the cost of the unrestricted
(a) (b) 1 1 (c) 1 1

A ggg ggg|- A C ttt|- ggg|- A C ttt|-
B ggga
C ttt 1 2 1 2 1 1
D ttta ggg|a ttt|a ggg|- ttt|-
1 2 1 2 1 2 1 3
ggg|a B D ttt|a ggg|a B D ttt|a
Figure 6.11 An example of the parsimony criterion for sequence characters. (a) A sequence character. (b) An optimal tree alignment on the optimal tree.
(c) A suboptimal tree alignment on the optimal tree (same number of indel events and substitutions, but one more subcharacter). Single bars across
branches indicate substitutions, double bars indel events. Logical subsequences are indicated using vertical bars, and numbered for clarity.
optimum for that tree (Wang et al. 1996) As This independence is a direct consequence of the
discussed by Gusfield (1997, p. 358), such bounded- fact that, in the current approach, base-to-base
error approximation methods can help to under- comparisons are only made within subsequences
stand the behaviour of difficult optimization that can be explained as homologs. As a con-
problems; from a practical point of view, they may sequence, comparisons of sequences and their
be combined with other methods, such as local bases automatically occur at the correct levels of
improvement methods, to obtain more elaborate generality, and the problems with inapplicables
heuristic search strategies. that Maddison (1993) described simply dissolve.
Indeed, Maddison (1993, p. 580) observed that all
Inapplicables solutions that he considered to deal with inap-
The example of Fig. 6.12 illustrates that indel events plicables were in the end problematic because they
divide the sequences of the tree alignment into did not properly restrict counting of steps to parts
subsequences that can be considered indepen- of trees where comparisons were valid, and he
dently: the two optimal alignments that are shown correctly surmised that an eventual solution would
have identical subsequences and only differ in lie in the development of new algorithms. Most
the way that those subsequences (and their sub- cases of inapplicability, however, would not
characters) are presented. Incidentally, this example require an algorithm as complex as the one dis-
also shows that postulated indel events may cussed here, because there are fewer degrees of
improve the explanation of the data even in cases freedom in a priori acceptable hypotheses of
where all observed sequences have the same length. homology.
(a) (b) 1 2 4 1
A aaaattt ---|aaaa|ttt A C ggg|cccc|---
B tttaaaa subs: 4
C gggcccc 1 1 subc: 16
D ccccggg ---|aaaa|--- ---|cccc|---
indels: 4
3 1 1 5 24
ttt|aaaa|--- B D ---|cccc|ggg
(c) 1 2 4 1
------|aaaa|ttt| A C ggg|---|cccc|------
subs: 4
1 1 subc: 16
------|aaaa|------ ------|cccc|------
indels: 4
3 1 1 5 24
---|ttt|aaaa|------ B D ------|cccc|---|ggg
(d) 1 1
aaaattt A C gggcccc
subs: 19
1 1 subc: 7
tttaaaa gggcccc
indels: 0
1 1 26
tttaaaa B D ccccggg
Figure 6.12 An example of the parsimony criterion for sequence characters. (a) A sequence character in which all sequences have equal length.
(b, c, d) Three tree alignments of the character on the optimal tree (A B)(C D). The first two, requiring four indel events, are optimal; the third,
not requiring indel events, is suboptimal by two units. The two optimal alignments that are shown imply the same five subsequences that take part
in indel events and differ only in the way that these subsequences are presented (still other possibilities exist). Subs, subc, and indels are numbers
of substitutions, subcharacters, and indel events. Single bars across branches indicate substitutions, double bars indel events. Logical subsequences
are indicated using vertical bars, and numbered for clarity.
Consider again the multiple alignment of pairwise base matches can be explained as
Fig. 6.8b, but now assume that the four columns homology. The trivial alignment that is obtained
are regular independent single-column characters, by juxtaposing all observed sequences (Fig. 6.13c)
with a dash indicating inapplicability. Obviously, has no such base matches. In addition, compared
in this case there is no need to examine alternative to the first tree alignment, it has has four inde-
groupings of states, such as in Fig. 6.8c, during tree pendent instances of subsequence non-homology.
search and optimization. Permitting such shifts The total difference in explanatory power thus
would lead to the same problems as when using equals six, which is reflected in the relative tree
absence/presence coding of individual states. As scores.
the computational complexity of the current This shows that the current criterion is not a
approach mostly derives from the need to examine minimum evolution method: the second tree
alternative groupings of bases when optimizing alignment of Fig. 6.13 requires only four muta-
sequences on a tree, this restriction has as a tions (four insertions of subsequences of length
fortunate consequence that the general algorithm four) but it is considered a much worse explana-
for dealing with this kind of inapplicability is tion of the data than the first one, which requires
much simpler and faster (De Laet 2003). 10 mutations (10 substitutions). Given that one of
the terms in the minimization for sequence
Maximizing homologous similarity vs. mimimizing character homology is the number of sub-
transformations characters, a quantity that has no direct relation-
The parsimony criterion as discussed here relies on ship with evolutionary transformations, the non-
the notion that one indel event counts as one unit equivalence of both approaches when dealing
loss of subsequence homology, irrespective of the with sequence characters should come as no
number of bases that are involved. But this does surprise. But this non-equivalence with minimiza-
not mean that it would in general produce trivial tion of evolutionary transformations does not
alignments that are obtained by simply juxtapos- imply that the current method is not logically
ing all observed sequences, which requires only as capable of phylogenetic interpretation. Such an
many insertion events as there are sequences. An interpretation, however, is in terms of unit state-
example is presented in Fig. 6.13. In the optimal ments of similarity that can be explained in a
tree alignment of Fig. 6.13b, two independent logically consistent way as identity through
(a) (b) 1 1
A aaaa aaaa A C tccc
B ggag subs: 10
C tccc 1 1 subc: 4
D tttt aaaa tttt
indels: 0
1 1 14
ggag B D tttt
(c) 1 3
aaaa|------------ A C --------|tccc|----
subs: 0
subc: 16
---------------- ----------------
indels: 4
2 4 20
----|ggag|-------- B D ------------|tttt
Figure 6.13 An example of the parsimony criterion for sequence characters. (a) A sequence character. (b, c) Two tree alignments on the optimal
tree (A B)(C D). The first is optimal. The second, obtained by simply juxtaposing all observed sequences, is suboptimal by six units. Subs, subc, and
indels are numbers of substitutions, subcharacters, and indel events. Single bars across branches indicate substitutions, double bars indel events.
Logical subsequences are indicated using vertical bars, and numbered for clarity.
(a) A gaatcgct
B gaatccgt
C ataaaaacccac
D ataaaaaccccgg
E gaatccccc
1 3 4
gaat|---|c|cccc|-
(b) 1 2 3 4 E 1 4
ataa|aaa|c|ccac|- C A gaat|----|cgct|-
subs: 8
1 2 3 4 1 4
ataa|aaa|c|cccc|- gaat|----|ccct|- subc: 13
gaat|---|c|cccc|- indels: 3
1 2 3 4 5 1 3 4 1 4
ataa|aaa|c|cccg|g D B gaat|----|ccgt|- 24
2 4 5
-|gaat|--|c|cccc|-
(c) 1 2 3 4 5 E 2 5
a|taaa|aa|c|ccac|- C A -|gaat|---|cgct|-
subs: 7
1 2 3 4 5 2 5
a|taaa|aa|c|cccc|- -|gaat|---|ccct|- subc: 13
-|gaat|--|c|cccc|- indels: 4
1 2 3 4 5 6 2 4 5 2 5
a|taaa|aa|c|cccg|g D B -|gaat|---|ccgt|- 24
1 2 4 5
ga|at|-----|c|cccc|-
(d) 2 3 4 5 E 1 2 5
--|at|aaaaa|c|ccac|- C A ga|at|------|cgct|-
subs: 5
2 3 4 5 1 2 5
--|at|aaaaa|c|cccc|- ga|at|------|ccct|- subc: 15
ga|at|-----|c|cccc|- indels: 4
2 3 4 5 6 1 2 4 5 1 2 5
--|at|aaaaa|c|cccg|g D B ga|at|------|ccgt|- 24
Figure 6.14 An example of the parsimony criterion for sequence characters. (a) A sequence character. (b, c, d) Three optimal tree alignments on its
optimal tree. Subs, subc, and indels are numbers of substitutions, subcharacters, and indel events. Single bars across branches indicate substitutions,
double bars indel events. Logical subsequences are indicated using vertical bars, and numbered for clarity.
common descent and inheritance, and not in the tree alignments column by column. So just
terms of numbers of transformations that are considering compositional homology, the first
required to that effect. explanation is suboptimal. The difference, how-
An example where different optimal tree align- ever, is exactly offset by its lower loss in sub-
ments on the best tree have different numbers of sequence homology (three indels versus four and
indels plus substitutions is presented in Fig. 6.14. four). With the cost regime that is advocated by
The two first tree alignments have more indel Frost et al. (2001) (all costs equal), the optimization of
events plus substitutions than the third one (11 Fig. 6.14c is preferred (cost 12 vs. costs 13 for 14b and
versus 9), but despite this higher total number of 14 for 14d).
mutations, they provide an equally good overall The difference between both cost regimes is
explanation of the data in terms of the amount of further illustrated in Fig. 6.15. Maximizing the
total sequence similarity that can be explained as amount of sequence similarity that can be inter-
homology. More precisely, the first alignment preted as homology, the tree of Fig. 6.15b is
accomodates 29 independent pairwise matches optimal, and an optimal tree alignment is shown.
among observed bases, the second 30, and the third The tree of Fig. 6.15c is suboptimal by two units,
one 30 as well, as easily verified by examining as can be seen from the optimal alignment that
(a) (b) 1 2 3 4 5 1 5
A ggaaaaaaaaaat gga|aa|aaa|aaaa|t A C cca|---------|t
B ggaaat
C ccat 1 2 3 5 1 2 3 5
D ccaaaaaat gga|aa|aaa|----|t cca|aa|aaa|----|t
1 2 5 1 2 3 5
gga|aa|-------|t B D cca|aa|aaa|----|t
(c) 1 2 3 4 5 1 2 5
gga|aa|aaa|aaaa|t A B gga|aa|-------|t
1 2 3 5 1 2 5
gga|aa|aaa|----|t gga|aa|-------|t
1 2 3 5 1 5
cca|aa|aaa|----|t D C cca|---------|t
Figure 6.15 An example of the parsimony criterion for sequence characters. (a) A sequence character. (b) An optimal tree alignment on the
optimal tree. (c) An optimal tree alignment on a suboptimal tree. Single bars across indicate substitutions, double bars indel events. Logical
subsequences are indicated using vertical bars, and numbered for clarity. The number of subcharacters in both optimizations is the same.
is shown. Under the costs of Frost et al. (2001) single bases, which constitutes a severe knowl-
the tree alignments of Figs 6.15b and 6.15c are edge claim about the processes that shape
also optimal for their respective trees, but the sequence evolution. It is hard to see then how
ranking of the trees reverses: the second tree is this approach ‘maximizes the explanatory power
now preferred (costs 14 vs. 13). This shift in of all lines of evidence’ (Frost et al. 2001, p. 354)
preference is a consequence of counting an indel even more so if one considers their apparent
event of length k as k events, as implicitly position that methods that make severe know-
advocated by Frost et al. (2001). In this exam- ledge claims can be safely ignored (Frost et al.
ple, this amounts operationally to treating the 2001, p. 354). No comparable claim is present in
lengths of the gaps that are involved as an the current method, in which the lengths and
ordered character. positions of subsequences that take part in indel
A more extreme example of the same pheno- events are left open to optimization.
menon occurs with a sequence character such as A similar methodological asymmetry exists
ttaatt, ttaaatt, ttaaaatt, and ttaaaaatt for terminals between methods that impose irreversibility
A, B, C, and D. With the cost regime of Frost et al. of inferred character evolution and methods
(2001), unrooted tree (A B)(C D) is preferred that leave the possibility of reversal open
because, operationally, it best groups the series of during phylogenetic analysis. An extensive dis-
a’s in the middle of the observed sequences cussion of the issues that are involved can be
according to their length. With the cost regime that found in Farris (1983, pp. 24–27). Frost et al.
maximizes homology, the three different unrooted (2001) did not discuss such issues. In fact, they
trees for four terminals are considered equally did not not even provide arguments why
good explanations of the character. equal weighting of all evolutionary transforma-
The preference of Frost et al. (2001, pp. 354– tions should lead to equal substitution and unit
355) for equal substitution and unit gap costs gap costs. It can reasonably be argued that
follows from their position that all hypothesized the principle of equal weighting of all transforma-
evolutionary transformations should be weighted tions is instead better implemented by using equal
equally. However, this cost regime only accom- substitution and gap costs, irrespective of the
plishes such equal weighting under the very length of the gaps that are involved. However,
restrictive assumption that indels only affect for most sequence characters this cost regime
would lead to trivial alignments such as in A likelihood conjecture

Fig. 6.13c, requiring only as many transformations Miklós et al. (2004) recently described a probabil-
as there are terminals, irrespective of the tree istic model of sequence evolution that allows
that is considered. Again, it is hard to see insertions and deletions of arbitrary length, a more
how such optimizations can be considered to general approach than Thorne et al. (1992), the first
maximize explanatory power. Yet they are optimal probabilistic method that incorporated indels that
under the notion of minimizing equally weighted affect multiple residues at once. In their model,
transformations. substitutions are described using a regular time-
reversible rate matrix; indels are modelled such
Sequence characters and branch support that the rates for insertions as a function of their
The example of Fig. 6.13 illustrates an interest- length k are a geometric function of k, and such
ing consequence for the concept of branch sup- that the ratio between the rates of insertions and
port. Consider the tree alignment of Fig. 6.13b. deletions of length k is a constant.
In that alignment, the (A B)(C D) branch is Miklós et al. (2004) only dealt with comparisons
supported, not because of the four substitutions of two sequences, but the model can in principle
on that branch, but because collapse of the be extended to simultaneous comparison of more
branch—resulting in an unresolved tree—would than two sequences that are related by a binary
remove either the a–a base match between A and tree, similarly as Hein (2001) extended the two-
B or the t–t base match between C and D. This sequence model of Thorne et al. (1991), the first
is in line with the observation of Farris et al. stochastic model to include insertions and dele-
(2001a) that branch lengths do not measure sup- tions (single residue indels only). In the approach
port. Instead, support for any single branch is of Hein (2001), rate parameters are assumed to be
measured as the degree to which removal of constant throughout the sequences. Removal of
the branch worsens the explanation of the assumptions of that kind would turn the model
data, which holds for sequence and non-sequence into a no-common-mechanism model akin to the
data alike. This, by definition, is Bremer (1988) model of Tuffley and Steel (1997, pp. 584, 597) for
support. regular r-state characters.
Alternatively, one could measure robustness of Envisioning such a double extension of the
a branch using the jackknife (Farris et al. 1996) or model of Miklós et al. (2004) it can be conjectured
related methods. However, as sequence char- that, under a wide range of possible non-fixed
acters have no predefined single-column char- rates, the trees that are found with a parsimony
acters, pseudoreplicates cannot be constructed in criterion along the lines as described here are also
the usual way. This problem can be solved by trees of maximum likelihood. As with single-
resampling at the level of individual bases in the column characters (see above), this does not
sequences to be compared, such that unsampled imply that such a probabilistic process model
bases are made uninformative with a probability would exhaustively describe and capture the
equal to the character removal probability of current method.
regular jackknifing (operationally, this can be
done by replacing a base with a polymorphism Beyond sequence characters: the genome
code for ‘a or c or g or t or -’; or, a bit more Most examples above consist of data sets with just
conservative, for ‘a or c or g or t’). With a a single sequence character, but data sets can have
removal probability of 0.37, the (A B)(C D) several such characters, and in addition any
branch in the above example would not survive, number of single-column characters. Exactly
as it depends on the simultaneous presence of which observations are coded as characters, the
the four bases mentioned above. With the con- subject of character analysis, is ultimately outside
servative approach, the probability that all four the realm of the technical aspects of further ana-
are retained in a pseudoreplicate is only lysis that have been discussed in this section. For
(1 0.37)4. sequence characters, a widely used criterion for
establishing hypotheses of putative sequence Consider, for example, a process such as lateral
homology is almost identical to the technology to transfer, which may well play an important
obtain those sequences in the first place: whatever role in the evolution of genomes (see, e.g.,
is amplified using a particular primer pair. In Kunin and Ouzounis 2003), or speciation
addition, various other criteria can be used to through allopolyploidization (see, e.g., Vander
identify biologically relevant structures, such as Stappen et al. 2002). For any data set, positing
exons and introns in protein coding sequences, sufficient such events in any phylogenetic tree
or stems and loops in rRNA sequences (see, e.g., will permit to explain all observed similari-
Kjer 1995; Giribet 2002). ties as historically identical, whether through
On the basis of such criteria, even contiguous regular ancestor–descendant relationships of
stretches of the genome can be subdivided organisms or through non-hierarchic processes
into sequences of sequence characters that can such as lateral transfer. It may be sufficient
be optimized separately. When doing so, it to restrict the current criterion to the former
may be a legitimate concern that the subsequent case, but, alternatively or additionally, a more
analysis might be constrained and even biased general criterion might be conceived that max-
by preconceived ideas about the evolution of imizes the difference between similarity that can
such structures. However, given that the com- be explained as historical identity, whatever the
plexity of the calculations when dealing with underlying processes, and the minimum number
sequence characters makes the use of heuristics of hypothesized historical events required to
and approximations unavoidable, the procedure that effect.
of breaking up long sequences in smaller com- This second approach would need careful
ponents prior to analysis may very well be part elaboration of a broader theoretical concept of
of a heuristic search strategy. This approach explanation than used here, which is beyond the
could be especially powerful when combined scope of this chapter. However, one way to go
with heuristic multiple alignment methods that would be to couple the principle of maximizing
try to assemble global alignments from align- conformity between observation and theory to the
ments of fragments that are dynamically identi- principle of choosing the simplest theory or the-
fied (e.g. Morgenstern et al. 1996; Morgenstern ories that can explain the data, which would lead
2004). to a true synthesis of two different but interwoven
On a more fundamental level, sequence char- lines of argument that can be found in the work
acters as discussed here are thought to be hier- of Farris (see, e.g., Farris 1982b, 1983). As dis-
archically related through indels and substitutions cussed extensively in this paper, the first principle
only. This may be a biologically plausible leads to maximization of similarity that can be
assumption for shorter parts of the genome, but it explained as homology. The second principle
definitely breaks down for complete genomes, requires a measure of the simplicity of a phylo-
where other processes such as inversions, dupli- genetic explanation, which may well be the
cations, and translocations play a role as well. minimum number of logically distinct historical
Over the past few years, many combinatorial events that have to be postulated. The rationale
algorithms have been developed to study such for a combined optimality function as above
phenomena (see, e.g., Sankoff and Nadeau 2000), would then be to find an optimal balance between
and heuristic multiple-alignment methods that both principles.
incorporate such rearrangment events are becom- For single-column character data and under the
ing available (see, e.g., Brudno et al. 2003, 2004). above restriction, that approach would opera-
It remains an open question how such methods tionally be equivalent to the current parsimony
can be interpreted or generalized to accomodate criterion, because in such cases it amounts to
a parsimony criterion as developed here. minimizing twice the amount of homoplasy. For
Such extensions may well lead to revisions or sequence characters as defined here (only indels
further elaborations of the current framework. and substitutions), it would amount to minimizing
2nindels þ nsubc þ 2nsubst, which would obviously anonymous reviewers for constructive criticisms.
change details of several examples discussed Thanks to Kevin Nixon for bandwidth and disk
in this section. For example, both trees of Fig. 6.13 space to make the computer program goechel
are then considered equally good explanations; or available at www.cladistics.com. Goechel is a java
the two first trees of Fig. 6.14 become suboptimal shell around POY (Wheeler et al. 2003) to perform
by two units. But the main conclusions, and jackknife analyses as discussed in 6.3.4 (subsec-
especially those based on data symmetries, would tion Sequence characters and branch support). The
remain valid. author holds a return grant from The Belgian
Federal Science Policy Office. Lastly I wish to
mention Steve Farris. It’s been my privilege to
6.4 Acknowledgments
have him as a member of my PhD committee, late
This chapter was prepared on a gnu/linux system, last century. Ever since, I consider myself a stu-
mostly using vim and LaTeX. Thanks to Victor dent of his, as should be obvious by even just a
Albert, Norberto Giannini, Pablo Goloboff, Mark cursory glance at this paper. Thanks for it all,
Simmons, Hubert Turner, John Wenzel, and two Steve!
III
Computational limits of parsimony
analysis: from historical aspects to
competition with fast model-based
approaches
CHAPTER 7
The limits of conventional

cladistic analysis
Jerrold I. Davis, Kevin C. Nixon, and Damon P. Little
7.1 Introduction small percentage of the most optimal sets of trees

obtained by a large number of relatively short
Software for cladistic analysis has been widely
searches, are more efficient than one-stage sear-
available for more than 20 years, and a series of
ches. Although data sets with substantially greater
advances made during this time have facilitated
numbers of terminals than the zilla matrix are
the analysis of matrices of ever-increasing size. A
beyond the current limits of conventional cladistic
milestone was reached in 1993 with the assembly
analysis with a solitary personal computer, these
of a 500-terminal rbcL matrix (known as zilla), but
techniques are likely to continue to be of impor-
optimal trees for this matrix were not discovered
tance when employed in association with more
until years later. A range of analytical methods
recently developed methods such as tree fusion,
developed since that time have found what appear
sectorial searches, tree drifting, and the parsimony
to be most-parsimonious trees for the zilla matrix,
ratchet.
but there continue to be perceptions that shortest
trees for this matrix cannot be discovered by con-
ventional search methods with a single personal
7.2 A brief history of parsimony
computer in a reasonable period of time. We pro-
methods for phylogenetic analysis
vide an overview of the development of parsi-
mony methods for cladistic analysis, describe Willi Hennig, the German dipterist, is widely
strategies that have allowed the zilla matrix to be considered to be the father of modern phylo-
analyzed by conventional methods, and demon- genetics, and his book Phylogenetic Systematics
strate that zilla was amenable to analysis by these (Hennig 1966) had a broad-reaching influence in
methods as early as the mid-1990s, using then- the early development of the field. Hennig’s
available hardware and software. Preliminary greatest contributions are observed in his clear
analyses, even when unsuccessful at discovering definitions of monophyly, in his discussion of
most-parsimonious trees, can be used to identify the evidence used to determine monophyly (i.e.
appropriate software settings for use during thor- synapomorphy), and in his strict adherence to
ough analyses. A useful indicator of the settings phylogenetic classifications. However, Hennig’s
that yield the most efficient searches is the excess explication of the methods by which one might
branch-swapping ratio, which is the ratio between determine synapomorphies, and thus monophy-
the number of tree rearrangements conducted letic groups, and ultimately phylogenetic trees,
during a particular phase of branch swapping in were less precise. On the other hand, some meth-
which shorter trees are being discovered, and the ods that at least superficially embodied Hennig’s
minimum possible number of rearrangements proposals had been published several years earlier
during this phase. Two-stage search strategies, (e.g. Wagner 1952, 1961). Following the publication
with intensive branch swapping conducted on a of Phylogenetic Systematics in the late 1960s, efforts
119
were begun to reconcile Hennigian phylogenetics (Mickevich and Farris 1981). PHYSYS was only
with quantitative methods that were becoming available for mainframe computers (PCs had not
practicable through the availability of digital yet become widely available) and it was installed
computers. Very quickly, the concept of parsimony only at a few universities in North America. In
as the overriding criterion in constructing phylo- addition to numerous other numerical techniques,
genetic trees began to predominate, although there PHYSYS included routines that performed branch
was great resistance among the numerical taxo- swapping, including branch breaking (BB; later
nomists (pheneticists) of the day. Most of the called tree bisection and reconnection (TBR) by
advances in the development and application of Swofford 1990; branch-breaking actually had been
parsimony to phylogenetics were due to the work implemented by Farris in 1970, in a little-known
of J. S. Farris, and much of the early seminal work, and undistributed program called Clad/OS
both by Farris and others, was published in the (J. S. Farris, personal communication). It quickly
journal Systematic Zoology. became obvious that these ‘heuristic’ branch-
Among the first computational approaches swapping methods were more effective at finding
for the production of phylogenetic trees using shorter trees than those methods that merely
Hennigian concepts of synapomorphy was the shuffled the taxon order in a Wagner tree analysis.
Wagner tree method (Wagner 1961) as developed With this recognition also came the discouraging
and implemented by Farris (1970). Initially, Wagner realization that such methods were not effective
trees were computed by hand, which could be enough to guarantee the discovery of shortest trees
tedious with more than a few taxa. Such trees were for data sets of any realistic size, because of the
then utilized as ‘final’ results—in other words, a massive numbers of trees that would need to be
Wagner tree was computed and this cladogram examined (see Felsenstein 1978b).
alone was the basis for interpreting the phylogeny. In the mid-1980s PCs became widely available,
Once computer programs became available for and MS-DOS machines with 64–640KB of RAM,
‘quickly’ computing Wagner trees (e.g. Farris running at 5–8 MHz, were the platforms for the
1978b1), it became apparent that there were two cladistic parsimony program Hennig86 (Farris
major problems: first, Wagner trees were often 1988; the 86 refers to the Intel 8 086 chip family, not
suboptimal (i.e. there were shorter trees) when the to the date of release, a common misconception).
data were noisy, or had significant levels of Many of the parsimony features of PHYSYS were
homoplasy, and second, there were often multiple implemented in Hennig86. The most important
equally parsimonious solutions. This was evident features included a ‘branch and bound’ command
when the taxon order in an analysis was changed, (ie, which stood for implicit enumeration) that
which often resulted in the discovery of different could guarantee shortest trees on data sets with as
trees of different lengths, often with more than one many as 20 or more terminals, and the mh* and bb*
of the shortest length found. In the early days commands that performed branch breaking (again,
when the computer programs were mostly exe- BB or TBR) on input trees (the starting trees usually
cuted by stacks of cards, the easiest way to gen- were calculated with the Wagner algorithm).
erate these extra trees was to ‘shuffle’ the taxon Although not obvious to users, the mh* command
deck (each taxon, along with its character scores, did a series of quick analyses using different taxon-
was on a single card in the Fortran deck) and addition sequences, followed by branch breaking
resubmit the job. while holding few trees. The user would typically
The first real breakthroughs in calculating take the results of an mh* analysis and perform
parsimony trees came in the program PHYSYS branch breaking on the shortest of the trees from the
initial sets while holding as many trees as would
1
Note: we have attempted to provide accurate citations for the
fit in available memory, with the command bb*
release dates of the computer programs cited in this chapter.
However, we have encountered conflicting records concerning
(which students sometimes confused with branch
the release dates of certain versions of these programs, and some and bound). The mh* þ bb* sequence in Hennig86
of the dates given may not be accurate. thus was the first common implementation of
THE LIMITS OF CONVENTIONAL CLADISTIC ANALYSIS 121
a two-stage analysis (see below), with a series of et al. 1993; Rice et al. 1997), or represent only a subset
‘quick’ runs initially conducted, using multiple of the complete set of most-parsimonious trees,
starting trees, followed by more exhaustive swap- resulting in spurious resolution in the reported
ping on the best trees found during the first stage. consensus (e.g. Donoghue and Doyle 1989).
The first release of PAUP (Swofford 1984) initi- The program Nona (Goloboff 1993b) was
ally ran on mainframes, then on MS-DOS PCs developed by the Argentine arachnologist Pablo
(Swofford 1985), and later the program was ported Goloboff, while a graduate student at Cornell
to the principal platform on which it is now used, University, as a companion to his program Pee-Wee
the Macintosh (Swofford 1990). Although available (Goloboff 1993c), which was designed to conduct
for desktop computers before Hennig86 was implied-weighting tree searches. Pee-Wee was
released, PAUP had limitations on tree output, available as a beta version from 1991, and Nona
search options, and overall speed. Before the from 1993. Nona has many similarities to Hennig86,
release of version 3.0 in 1990, PAUP did not but allows more precise control over search strate-
implement branch breaking, and indeed the same gies, and by using the defaults it is very easy
method as branch breaking was named as TBR in to implement customized two-stage searches as
that release (Swofford 1990). Because of the ease of described above (e.g. mult* þ max*). This has resul-
use on the Macintosh platform, PAUP became and ted in more experimentation with different search
appears to remain (as PAUP*) the most widely strategies, including those described in this chapter.
used cladistic program. Unfortunately, the default The belief that data sets with 100 or more taxa are
settings and culture that developed around PAUP virtually intractable to analyze was common through
resulted in many analyses being conducted as the 1990s and persists even today, particularly
single one-stage analyses (see below) with the among users of PAUP and PAUP*. This belief was
maximum number of trees held for branch swap- voiced strongly by Rice et al. (1997), who expressed
ping being 100 (or in many cases the maximum the goal of exploring ‘‘methodological and theore-
that would fit in the available RAM). tical issues raised by very large data sets’’ (p. 554), yet
The use of these methods, in association with conducted a one-stage reanalysis of the 500-terminal
what became known by many PAUP users as rbcL seed plant matrix of Chase et al. (1993).
‘‘swapping to completion’’ (conducting a complete Much of the attitude about intractability of large
round of branch swapping on an entire set of trees data sets rests on the misguided belief that an
held in memory), came to be regarded as a thor- analysis must swap to completion in order to be
ough analysis by many investigators. Even these valid. However, as illustrated in the present
analyses often were not completed on moderately chapter, the methods and thus conclusions of the
sized data sets, and on large data sets, such as the Rice et al. paper were flawed. By performing eight
500-terminal rbcL matrix (discussed further below), one-stage analyses while holding large numbers of
analyses had to be stopped prematurely after trees in RAM, Rice et al. merely repeated the
running for a period of time, often several months ineffective analytical strategies that were
(e.g. Chase et al. 1993; Rice et al. 1997). employed as the default settings of PAUP (and
Of course, adherence to the mantra of swapping PAUP*). The only difference was the total amount
to completion does not in any way guarantee the of time devoted to the analyses, which consumed
discovery of shortest trees, due to problems of ‘‘tree 11.6 months of computer time on three separate
islands’’ (D. Maddison 1991), so the benefits of this Sun workstations (Rice et al. 1997). Other supposed
approach are at best illusory. Because of the wide- new search strategies proposed by Rice et al. were
spread adoption of these methods, many clado- either not implemented or not shown to be effec-
grams published with PAUP over the years, tive, and thus the overall result of the paper was to
whether or not the trees were swapped to comple- reinforce the common myth that data sets of this
tion, are not actually among the most-parsimonious magnitude were intractable. The current chapter,
trees that could have been found easily with con- using the same data set as analyzed by Rice et al.,
current versions of Hennig86 or Nona (e.g. Chase and using software available at the time, and
computers of comparable speed, shows how inef- or near-optimal trees that can then be subjected to
fective the Rice et al. approach actually was at the tree fusion (driven searches). Since the success of
time it was published. tree fusion is dependent on the input trees collec-
tively having all of the clades found in the shortest
trees, the selection of methods to produce these
7.3 Newer methods for
input trees is very important.
parsimony analysis
With TNT it is possible to analyze large data sets
The first breakthrough in analyzing what might very quickly. For example, an analysis of a matrix
now be considered large (>150 terminal) data sets of 1553 small-subunit ribosomal DNA sequences,
came with the introduction of the parsimony with a combination of the ratchet (using Nona as
jackknife by Farris et al. (1996), which remains the the search engine, because the ratchet was not yet
fastest method by which to undertake a parsimony available in TNT) and tree fusion (using TNT),
analysis. Using the parsimony jackknife, Källersjö took approximately 2 months of computer time on
et al. (1998) analyzed a data set of 2538 terminals, a 1.3 GHz Xeon processor, and resulted in a
and their results include, as far as we are aware, minimum of five independent discoveries of pre-
the largest cladogram ever published. Rice et al. sumed shortest trees (Tehler et al. 2003). Attempts
(1997, p. 559), referring to the parsimony jackknife to use conventional searches, tree drifting, and
and to analogous approaches for rapid maximum sectorial searches to produce input trees for tree
likelihood analysis, incorrectly characterized the fusion did not yield trees as short as those
use of such methods as ‘‘abandoning the notion obtained when the ratchet was used to produce the
that maximum parsimony or maximum likelihood input trees, but it is not known if this would be the
is the criterion that we are optimizing.’’ case with other matrices of comparable size.
More recently, tree-search strategies and new The availability of new programs that imple-
algorithms that are more effective than simple two- ment advanced tree-search strategies (e.g. TNT)
stage methods have been developed, utilizing ran- does not eliminate the need to fully understand the
domization of character weights during successive idiosyncrasies and factors influencing the effi-
searches (the parsimony ratchet; Nixon 1999), tem- ciency of conventional tree-search strategies. With
porary changes in optimality criteria (tree drifting; the exception of tree fusion, these new strategies
Goloboff 1999), as well as tree fusion (Goloboff (the parsimony ratchet, tree drift, and particularly
1999) and swapping on only a portion of a tree sectorial searches) use standard branch-swapping
(sectorial searches; Goloboff 1999). Detailed techniques, applied in an iterative fashion.
descriptions of these methods are beyond the scope Although tree fusion does not utilize branch
of this chapter, and the reader is referred to the swapping per se, the trees which form the popu-
original publications for more information on the lation to fuse must be found by conventional
algorithms and how they have been implemented. searches or more often with the ratchet or sectorial
The parsimony ratchet, originally available only searches (e.g. Tehler et al. 2003). Thus, the detailed
in WinClada (running Nona as a daughter pro- information provided here on conventional sear-
cess), has now been implemented as the program ches, in combination with further study of the
PAUPRat (Sikes and Lewis 2001), which produces nature of tree islands, and further exploration of
batch files that can be analyzed with PAUP*. The the effectiveness of different combinations of tree-
parsimony ratchet, tree fusion, tree drifting, and search methods, has utility in a general theoretical
sectorial searches are available in the latest version sense, and may facilitate the development of new
of the computer program TNT (Goloboff et al. tree search strategies.
2004). Besides raw speed (the TBR swapper is
estimated to be at least 10 times faster than that of
7.4 Challenges of large data sets
the latest version of Nona), TNT allows the user to
combine methods (conventional, ratchet, tree With the rise of molecular systematics, which
drifting, and sectorial searches) to produce optimal involves the scoring of thousands of characters for
a taxon in the course of a single set of operations than 200 times during the course of the present
(i.e. the sequencing of a genomic region), it has study. During the same period, approximately
become possible over the past two decades to 10 times as many sets of trees of length less than
assemble numerous characters for a single taxon in or equal to 16 220 (the shortest length discovered
a short order of time, and thus to assemble cladistic by Rice et al. 1997) were discovered.
data matrices with increasingly large numbers of The analyses that constitute this overall study
taxa. As noted above, a milestone in this progres- were conducted on several different personal com-
sion was reached more than 10 years ago, when puters of various processor speeds, over the course
Chase et al. (1993) generated and analyzed the zilla of a period of about 6 years. With a PC that is fast by
matrix, which consists of 500 rbcL sequences, current standards (3 GHz), and using Nona with the
though they did not discover most-parsimonious optimum software settings, as determined by the
trees for the matrix. This point was demonstrated current analysis, one set of trees of length 16 218 can
by the discovery of shorter trees by Rice et al. be discovered in about 1 day (details below). This
(1997), whose re-analysis of the data set consumed does not imply that a single day’s analysis is suffi-
a total of nearly a year of computer time on three cient to declare the analysis of this or any other
Sun workstations (i.e. an average of about one- matrix of this magnitude complete, because there
third of a year on each of the three computers), and may be multiple ‘‘islands’’ (D. Maddison 1991) of
yielded trees five steps shorter than those that had equally parsimonious trees that cannot be dis-
been described by Chase et al. However, it soon covered by conventional branch swapping from one
became evident that Rice et al. also had not dis- tree of this length. Thus, the true consensus tree for a
covered the shortest trees for this matrix. matrix may be less resolved than the one that is
Nixon (1999) and Goloboff (1999) analyzed the discovered following the initial discovery of a set of
same matrix, the former author using the parsimony most-parsimonious trees.
ratchet and the latter using tree fusion, tree drifting, The estimate of 1 day of data analysis also can be
and sectorial searches (alone and in combination), misleading because it is based on the use of opti-
and both authors found trees two steps shorter mum settings for this matrix, and does not include
than the shortest trees that had been discovered by the preliminary analyses that are required to
Rice et al. Trees of this length have since been dis- determine these settings. However, when this
covered numerous times, in re-analyses by the matter is taken into consideration it is reasonable
authors of the present chapter, using the methods of to state that a thorough conventional analysis of
Nixon and Goloboff. Although one cannot be cer- the 500-terminal matrix, including the preliminary
tain that these are most-parsimonious trees for this analyses, and resulting in a consensus tree that
matrix, it seems likely that they are, since trees of includes no erroneously resolved nodes, can be
this length are discovered rapidly with these completed over the course of a few weeks. Thus, if
methods, in the course of many hundreds of sear- it is assumed that the shortest trees currently
ches. Even if shorter trees still remain to be dis- known for the 500-terminal rbcL matrix actually
covered, trees of the length discovered by Nixon are the shortest possible trees for this matrix, it is
and Goloboff still represent a benchmark against possible at present to analyze this matrix in a
which other analytical methods can be compared. reasonable amount of time on one personal com-
Here we present results of a set of conventional puter using conventional search methods. Desktop
analyses (i.e. using only standard branch-swapping computers with processors approximately one-
techniques) of the 500-terminal data set and, as tenth the speed of those currently available have
detailed below, we have discovered trees of the been available since 1997, as was the software that
same length as those previously found by Nixon was used in the present study (Nona; Goloboff
and Goloboff using other techniques. These trees, 1993b and 1993), so it was possible at that time that
with uninformative characters removed from the Rice et al. were conducting their analysis of this
matrix, are 16 218 steps in length, and sets of matrix to discover at least one set of shortest
trees of this length have been discovered more trees for the 500-terminal matrix over the course
of a few weeks, and possibly, with some luck, as is abandoned, as are all other trees of equal length
early as 1993, when the matrix first was assembled. that are held in memory, and swapping is then
Although the 500-terminal rbcL matrix still is a initiated on the new tree, which can be regarded as a
relatively large data set, matrices with more than new parent tree. In this manner, conventional ana-
1000 terminals have been generated and analyzed lyses proceed from a starting tree through sets of
in recent years (e.g. Källersjö et al. 1998; Tehler successively shorter trees, often spending con-
et al. 2003), and the present availability of nucleotide siderable periods of time at one length or another,
sequences of growing numbers of genes from hun- accumulating and swapping through a large num-
dreds or even thousands of accessions suggests that ber of trees, before shorter trees are discovered.
additional matrices of this magnitude and greater Eventually, every search of this sort results in the
will be generated during the next few years. discovery of a set of one or more trees of some length
One might ask, then, whether we have reached (which may or may not be most-parsimonious for
the limits of conventional cladistic analysis. Faster the matrix), and the search culminates in a complete
processors become available in personal computers round of branch swapping through this set of trees.
at regular intervals, but a doubling in rates occurs, at With matrices for which the discovery of most-
best, over a period of 1 year to a few years, and the parsimonious trees is difficult, the starting tree for
required processing power for cladistic matrices is each search often is far from optimal, and there is
growing much faster than that. Although sub- an initial period of swapping during which there
stantially more efficient approaches to cladistic is a relatively rapid approach towards optimality,
analysis are now available (i.e. the methods of with a substantial portion of the trees that are
Nixon and Goloboff, as discussed above), those subjected to swapping being discarded before
methods resemble conventional analytical techni- being swapped completely. Later in the process
ques in employing heuristic tools such as branch the approach to optimality slows, with large num-
swapping. Thus, empirical study of the limits of bers of trees accumulated and subjected to swap-
conventional analysis, as described in the present ping, while little or no progress is made towards
chapter, should provide insights that are applicable the discovery of more optimal trees.
to the development of analytical strategies that may In light of this pattern, a tradeoff in potential
help to maximize the effectiveness of these and search strategies is apparent. If the investigator
other methods that may be developed. Also, by chooses to retain relatively few trees in memory,
establishing benchmarks and limits, the current each search ends relatively quickly (i.e. stalls at
analysis will help to establish a basis for compar- some length after failing to find shorter trees
isons among current and future methods. within the imposed limits), but a large number of
We refer to the general methods of heuristic searches can be conducted in a given period of
search techniques that have been used for cladistic time. With few trees retained, the thoroughness
analysis over the past several years, as described of each search is minimal, and each search is
above, as conventional search methods. Similar relatively unlikely to result in the discovery of
methods have been used for maximum likelihood most-parsimonious trees.
searches, and those too can be called conventional Alternatively, if greater numbers of trees are
search methods. These searches vary in certain retained in memory, each tree search is more thor-
details, but they follow a basic multistep pattern, ough, and thus more likely to result in the discovery
and it is useful to review this pattern. Typically, an of most-parsimonious or nearly most-parsimonious
initial or ‘starting’ tree is generated (usually a trees, but each search also consumes a great deal of
Wagner tree), and this tree is then subjected to time, and few searches can be conducted. The latter
a methodical regimen of branch rearrangements situation is illustrated by the analysis of the 500-
(i.e. branch swapping). terminal rbcL matrix that was conducted by Rice et al.
When swapping results in the discovery of a tree (1997), who allowed large numbers of trees to be
that is more optimal than the tree that is being sub- held in memory, but conducted only eight indivi-
jected to swapping, the tree that is being swapped dual searches during the course of nearly a year of
computer time, with none of those searches yielding First, trees that are nearly optimal for the matrix
most-parsimonious trees for the data matrix. All are more likely to yield most-parsimonious trees
eight of the searches were aborted prior to comple- under intensive branch swapping than are those
tion (i.e. an ad hoc limit on the number of trees to be that are less optimal. Thus, use of the two-stage
swapped in each search was interposed during the search strategy focuses the most-intensive swap-
course of the analysis) because more trees had been ping on sets of trees that are among the most likely
accumulated than could be subjected to swapping to yield most-parsimonious trees. Second, the two-
in a reasonable amount of time. stage search strategy allows a greater number of sets
In light of the tradeoff between rapidity and of trees to be subjected to intensive swapping dur-
thoroughness of individual searches, one would ing a given period of time than does the one-stage
need to evaluate various combinations of settings search strategy. This is because the trees that are
between two extremes to determine the optimal subjected to intensive swapping during the second
point(s) of balance between the conflicting goals of stage of a search are relatively optimal to begin with,
conducting numerous searches and of having these and therefore have a relatively shorter path to fol-
searches be thorough. One of the goals of the present low to their point of completion, whether or not they
analysis was to examine this tradeoff, and we have succeed in discovering most-parsimonious trees.
determined that there can be multiple peaks of For example, swapping with 2 000 trees held
search efficiency for a matrix, and that one of the in memory, when initiated with trees two steps
peaks of search efficiency can occur, even with a longer than the most-parsimonious, involves
homoplasious matrix of 500 terminals, with as few branch swapping through a maximum of 6 000 trees
as 50 trees held in memory during each search. (up to 2 000 trees each at the initial length and at one
Apart from the number of trees held in memory and two steps shorter), and a maximum of only
and subjected to branch swapping during indi- 4 000 trees in searches that fail to yield most-
vidual searches, several additional factors can affect parsimonious trees. However, if intensive branch
the overall efficiency of conventional tree searches. swapping is initiated with trees that are 10 steps
We refer to the general approach described in the longer than the most-parsimonious, as many as
previous paragraphs (one or more independent 20 000 trees may be subjected to branch swapping in
search initiations, each followed by branch swap- searches that do not yield most-parsimonious trees.
ping from a starting tree, and terminating after a Hence, a key advantage of two-stage searching
set of trees of some length has been accumulated lies in its ability to minimize the time that is spent
and subjected to swapping) as a one-stage search. in intensive swapping during each search, which
In fact, many practitioners now conduct two-stage thereby allows a greater number of searches to be
searches (as with the mh* and bb* commands of conducted in a given period of time. In other
Hennig86), with the first stage corresponding to a words, the limitations inherent in one-stage sear-
one-stage search in which a relatively large num- ches, as are evident in the tradeoff between con-
ber of searches are conducted, each with a rela- ducting a small number of intensive searches, and
tively small number of trees held in memory. a larger number of individually less-intensive
Following the completion of this stage, the second searches, are ameliorated by the very structure of
stage proceeds as the most optimal sets of trees two-stage searches. As described below, with
obtained during the first stage are subjected to reference to the 500-terminal matrix, almost any
additional swapping with greater numbers of trees two-stage search that is conducted with reasonable
held in memory. The two-stage approach has the software settings is superior to even the most
advantage of confining the most intensive and efficient one-stage search.
time-consuming branch swapping (i.e. with If two-stage searches are more efficient than
numerous trees held in memory and subjected to one-stage searches, as a general rule (i.e. with
swapping) to those sets of trees that are relatively appropriate software settings in each case), it still
optimal to begin with and, as detailed below, there remains to determine how many trees should be
are two key advantages to this approach. held in memory during each stage, and what
percentage of tree sets obtained during the first of tree rearrangements required during a particular
stage should be subjected to additional swapping phase of a search and the minimum that would be
during the second stage. In addition to these required by an unsuccessful search, i.e. an excess
points, there are other significant factors. There are branch-swapping ratio (see below).
different branch swapping procedures, ranging We have conducted preliminary searches and
from relatively cursory (e.g. nearest-neighbor applied these calculations to a second and larger
interchange) to much more thorough (e.g. BB, or data set, the three-gene matrix of 567 terminals of
TBR). Also, trees with polytomies may or may not Soltis et al. (2000). As indicated by those authors, the
be recognized as acceptable trees for evaluation shortest trees discovered by conventional searches
during a branch swapping procedure, and when with PAUP*, conducted over a period of several
polytomies are allowed, there are different rules weeks, were six steps longer than those that were
that can be applied to determine whether the discovered in the course of a few hours with the
various possible dichotomous resolutions of a parsimony ratchet, and the ratchet eventually yiel-
polytomy are recognized as being supported by a ded trees that were one further step shorter. Since
given matrix. All searches in the present analysis that time, numerous additional ratchet searches
involved BB/TBR swapping, but we examined have been conducted with the three-gene matrix
various combinations of the numbers of trees held, and, as with the 500-terminal matrix, we are con-
and of the search settings relating to polytomies. fident but not certain that the shortest trees obtained
If the optimum search conditions for one data with the ratchet, and reported by Soltis et al., are
matrix differ from those for another, as seems rea- most-parsimonious trees for this matrix.
sonable to assume, it is important to develop However, as noted above with reference to the
methods for estimating the appropriate settings to 500-terminal matrix, even if shortest trees for the
be used with any particular data set. This goal may three-gene matrix have not yet been discovered,
be difficult to attain for large matrices, because the the tree length obtained with the ratchet still pro-
analysis of any data set that is recognized as a large vides a basis for comparison among search stra-
one is, almost by definition, computationally tegies. It might be argued that an optimal method
demanding. The problem, then, is that if the most for the discovery of trees slightly longer than most-
efficient search conditions for a given data set can be parsimonious is not the best method for obtaining
determined only by conducting a series of pre- most-parsimonious trees themselves, but if this
liminary searches under a variety of conditions, and were the case we would expect any differences in
if most-parsimonious trees are discovered only methods for the efficient discovery of most-parsi-
infrequently under even the best of conditions, and monious and nearly most-parsimonious trees to be
furthermore, if the results of preliminary searches relatively minor.
are evaluated in terms of success in the discovery We conducted preliminary analyses with the
of most-parsimonious trees, the computational three-gene matrix, using a range of settings, cal-
demands of the required preliminary searches will culated excess branch-swapping ratios, used these
be nearly as burdensome as a thorough analysis ratios to select software settings for more thorough
itself. However, a preliminary phase of searching, searches of this matrix, and then conducted addi-
preceding an actual search, would be feasible if tional searches using those settings.
the effectiveness of the various settings that are
explored during the preliminary phase could be
evaluated in terms of a result that is correlated with 7.5 Methodogical matters
efficiency in the discovery of most-parsimonious
7.5.1 The zilla data set
trees, and that could be determined on the basis of
a relatively small number of preliminary searches. Analyses were conducted using the rbcL data set
We consider various correlates of search efficiency that was assembled and first analyzed by Chase
with the 500-terminal matrix, and demonstrate that et al. (1993). Chase et al. analyzed two different
one useful indicator is the ratio between the number versions of the same general data set, using
different analytical procedures. In their search II branch swapping. Reported tree lengths vary, in
they employed a data matrix comprising 500 term- part, because uninformative characters have been
inals, including two sequences from Canella, so the included in some calculations and excluded from
matrix can be described as comprising 499 taxa or others. Another source of variation lies in the use
500 terminals. Their search II was conducted with all of different criteria of character informativeness by
transformations between nucleotides weighted the various programs. This problem was noted by
equally, and therefore it represents the earliest Rice et al., who used PAUP and MacClade, and who
analysis of a matrix of this magnitude using analy- observed that the programs reported different tree
tical techniques that are generally accepted today. lengths when commands were invoked to exclude
Rice et al. obtained a copy of the search II matrix and uninformative characters from consideration.
nicknamed it Treezilla (attributing the name to A. The following software packages have been
Yoder), but this name often is replaced by the shorter employed in this and previous studies of the zilla
name zilla, which we use for the balance of this data set: Chase et al. (1993) used PAUP version 3.0s
chapter. Rice et al. (1997) re-analyzed the zilla matrix, (Swofford 1990) for their search II analysis; Rice
using equal character weights, and the present et al. (1997) used PAUP versions 3.0s and 3.1.1
analysis is also based on that version of the data. (Swofford 1993), plus MacClade version 3.04
The data matrix used here was downloaded in (Maddison and Maddison 1992); and we used
January 1998, as posted by Rice et al. (at www. Nona version 1.5.1 (compiled September 4, 1996;
herbaria.harvard.edu/~rice/treezilla/), converted Goloboff 1993b), the multi-thread version of Nona
from NEXUS format (Maddison et al. 1997) to Nona version 1.6 (Goloboff 1993, i.e. Paranona, compiled
format, and stripped of cladistically uninformative February 26, 1998), WinClada version 1.00.08
characters, leaving 759 informative characters. The (Nixon 2002), PAUP version 3.1.1, PAUP* version
aforementioned Harvard website is unavailable at 4.0b10 (Swofford 2002), and MacClade version 4.03
the time of writing, but the NEXUS-format version (Maddison and Maddison 2001).
of the zilla matrix is available currently at a different Because of different reported lengths for the trees
website (www.cis.upenn.edu/~krice/treezilla/). found by Chase et al. and Rice et al., due to errors in
In addition to the zilla matrix, the downloaded file excluding uninformative characters in older ver-
includes a tree of length 16 533, which is the shortest sions of PAUP, we use here the tree lengths as
tree length reported by Rice et al. (1997). We determined by MacClade version 3.04 (Maddison
undertook extensive evaluation of this data set to and Maddison 1992), Nona, and WinClada, which
establish that it was identical to the published 500- are all consistent (details of the comparisons that
taxon data set of Chase et al. and Rice et al. Details of were made are available from J.I.D.). These pro-
these tests are available upon request from the grams determine the shortest trees found by Rice
senior author of the present chapter (J. I. D.). et al. to be of length 16 220 with uninformative
characters excluded (80 steps fewer than the num-
ber reported by older versions of PAUP, and 313
7.5.2 Comparability of reported results with
steps fewer than the number obtained with unin-
zilla across software platforms
formative characters included). Rice et al. included
The present analyses of zilla, and prior analyses by one of these trees in their posted matrix, and we
other authors, have been conducted with a variety confirm, using PAUP* version 4.0b10, that this tree
of programs, and in order to compare results of the is of length 16 533 with the data matrix in that file,
various studies it is important to determine the when all characters are included. This version of
comparability of the results reported by the var- PAUP* also detects 759 cladistically informative
ious packages. Several different versions of PAUP, characters in the matrix, which is the same number
MacClade, Nona, and WinClada have been used in detected by Nona and WinClada in the matrices
these various analyses, and two principal items of that we derived from this file (see above). Also,
interest reported by these programs are tree this version of PAUP* reports a length of 16 220 for
lengths and numbers of trees examined during the Rice et al. shortest tree when uninformative
characters are excluded from consideration. Finally, length 44 163, or, as reported by Soltis et al., 45 100
we have determined that PAUP* version 4.0b10 and steps when uninformative characters are included.
MacClade version 4.03 both agree with Nona and
WinClada in determining that the shortest trees
7.5.4 One-stage tree searches
discovered by the present analysis are of length
16 218 (uninformative characters excluded) or 16 531 Except where specified otherwise, cladistic ana-
(uninformative characters included). In both cases lyses were conducted with the multi-thread ver-
these trees are recognized as two steps shorter than sion of Nona version 1.6 cited above. All analyses
those discovered by Rice et al., and seven steps were conducted as one-thread tree searches (using
shorter than those discovered by Chase et al. the default setting thread 1). Searches were per-
The preceding comparisons substantiate an formed with the mult* command, which conducts
overall consistency in data structure among the a set of replicate tree searches (e.g. 10 replicates
various zilla files that exist in NEXUS and Nona with mult*10), and most searches were conducted
formats, and among Nona, WinClada, PAUP* in sets of 10 or 20, with trees saved after each set of
version. 4.0b10, and MacClade in the determination searches. The mult* command initiates each repli-
of character informativeness and in the calculation cate search by generating a Wagner tree, assembled
of tree lengths with and without uninformative with a taxon entry sequence determined by a seed
characters included. In contrast, it appears that number that is input with the command rseed, then
older versions of PAUP employ a definition of conducts a round of subtree pruning-regrafting
character informativeness that differs from the one (SPR) swapping on the Wagner tree, with one tree
discussed by Farris (1989a), and employed by all held in memory, followed by BB/TBR branch
current versions of the programs mentioned above. swapping with a user-determined number of trees
On the basis of these results we recommend that held. We used the command rseed 0, which uses
investigators using older versions of PAUP, or the computer’s clock to generate a random seed
consulting publications that used those versions, number, which is saved with the results of each
verify all results that are based on distinctions search, thereby facilitating repeated searches using
between informative and uninformative characters, identical taxon entry sequences.
including items such as consistency indices, which In each replicate search, a predetermined num-
are inflated when uninformative characters are ber of trees (x, as set with the hold/x command)
interpreted as informative. is retained in memory and branch swapping is
conducted successively on each of these trees. If a
shorter tree is discovered, all trees except the new
7.5.3 Three-gene matrix
one are discarded, swapping is initiated on the
In addition to the zilla data set, we conducted new tree, and a new set of daughter trees is
analyses with another large data matrix, the three- accumulated. For reasons discussed below, we also
gene data set of Soltis et al. (2000). This data set note that if a shorter tree is not discovered, the
comprises 567 taxa scored for rbcL, atpB, and 18 S search terminates after all x trees in memory have
rDNA. As noted by Soltis et al., they found shortest been subjected to swapping, and that the final
trees for this matrix only with the parsimony phase of every replicate search therefore consists
ratchet (Nixon 1999), as implemented in WinClada, of a complete round of swapping through x trees
using Nona as a daughter process for tree searches of the shortest length discovered in that search,
(the Nona-format version of the matrix employed unless this set of trees constitutes an island
in those searches is available at www.cladistics. (D. Maddison 1991) of fewer than x trees. In the
com/). Analyses of this data set for the present latter case, the final phase of the search would
study were conducted using copies of the matrix consist of swapping through all trees in the
downloaded from that website. This matrix island. However, every search of the zilla matrix
includes 567 terminals scored for 2 153 informative conducted during the course of the present ana-
characters, and most-parsimonious trees are of lysis did lead to the accumulation of the maximum
number of trees set by the hold/ command, so each is invoked, the setting for the ambiguous command
of these searches did end with a complete round of is irrelevant, because polytomies are disallowed in
swapping through that number of trees. any case; therefore, apart from the default settings
The output generated by each set of replicate of these commands, preliminary analyses were
searches includes a record of the seed number that conducted using only two of the three remaining
was used to initiate each search, the minimum tree combinations of the settings determined by these
length obtained by the search, and the number of commands (poly¼, ambiguous¼, and poly- ambiguous-).
trees of that length that were accumulated. Also, On the basis of results from the preliminary
use of the tcount command causes the total number analyses, intensive analyses were conducted with
of tree rearrangements examined in each set of these polytomy and ambiguity settings with 50,
replicate searches to be reported (e.g. for mult*10, 100, and 2 000 trees held in memory (i.e. using the
the total for the 10 constituent replicates). commands ho/50, ho/100, and ho/2 000).
To determine the effectiveness of tree searches In addition to tree length, Nona and PAUP* both
under a variety of conditions, analyses were con- report the number of tree rearrangements con-
ducted using a variety of settings for other com- ducted during the course of an analysis. In order to
mands (see below). The output from all searches compare these programs, including one of the two
conducted under each set of conditions was com- versions of PAUP used by Rice et al. (1997), with
bined into a common file, and duplicate searches respect to the numbers of tree rearrangements that
(i.e. those based on the same randomly drawn seed they report when conducting comparable actions,
number, which invariably yielded identical we selected one putative most-parsimonious tree
results) were removed from consideration. (i.e. length 16 218) from each of 10 sets that were
One setting that was varied among sets of one- generated during the course of this study. Each of
stage searches was the number of trees retained the 10 trees was subjected to branch swapping, with
and subjected to branch swapping, as specified 100 trees retained (i.e. 99 new trees propagated from
with the hold/ command. Analyses were conducted the first, and all 100 from each set subjected to
with the following numbers of trees retained in branch swapping), using Nona version 1.6, PAUP
each search: 1, 2, 5, 10, 20, 50, 100, 200, 500, 1 000, version 3.1.1, and PAUP* version 4.0b10.
2 000, and 5 000. Two other settings that were varied Because this swapping never yielded shorter trees,
among analyses are those that determine whether this procedure resulted in a complete round of
polytomies are allowed in trees (using the poly branch swapping through exactly 1 000 trees with
command) and, if so, the conditions for recognition each program, and the average number of rearran-
of potential resolutions as being supported by the gements required to swap through one tree (signified
data (using the ambiguous or amb command). as r1avg ) was calculated from the results of these ana-
A complete set of analyses (i.e. using each of the lyses. With Nona, the default polytomy and ambi-
hold/ settings listed above) was conducted with the guity settings were used (poly¼ and amb-), and BB
default settings in Nona for these commands, poly ¼ (i.e. TBR) branch swapping was conducted, using
(polytomies allowed), and ambiguous- (when poly- the command max*. With PAUP, TBR swapping was
tomies are allowed, branch collapsed to form a conducted with only minimal-length trees retained,
polytomy if its length is zero under at least one and ‘zero-length branches collapsed,’ and with
possible character optimization, i.e. a branch is PAUP*, TBR swapping was conducted with bran-
resolved only if it has unambiguous support). ches collapsed if ‘maximum length is zero.’ It
Additional analyses also were conducted with the should be noted that most-parsimonious trees are
alternative settings for these commands, either poly- not actually required for this procedure to be con-
(polytomies disallowed), or ambiguous ¼ (when ducted, because a starting tree of any length can
polytomies are allowed, branch collapsed only if be used, with any number of trees retained and
its length is zero under all possible optimizations, subjected to branch swapping, under any particular
i.e. a branch is resolved if it has ambiguous or swapping regime, if it has been determined by a
unambiguous support). When the poly- command prior round of swapping (or after the fact, from the
results of such a procedure) that this particular tree that yields one set of trees of a specified length is
does not yield shorter trees when swapped under counted as a single successful search, even though
the specified conditions. each of the latter searches accumulates 10 times as
Results from the various one-stage analyses many trees as each of the former.
were compared in terms of two principal metrics, Search efficiency, as expressed in terms of tree
the frequency of success and the search efficiency, rearrangements, can be converted to efficiency in
both of which are expressed with reference to a terms of computer time elapsed, with the latter
particular tree length. Frequency of success is the expression reflecting processor speed and other
percentage of searches under a given set of con- factors. Calculation of these indices in terms of the
ditions that result in the discovery of trees as short number of rearrangements facilitates the pooling
as the specified length. In most cases we specify of results generated here from several different
success with reference to most-parsimonious trees, computers, and also facilitates comparisons
but when applied to a tree length greater than that, between these results and those obtained with
such as two steps longer than most-parsimonious, other computer platforms.
we apply this measure in a comprehensive
manner, so that it includes all searches that yielded
7.5.5 Phases of one-stage searches
trees of the specified length or shorter.
Tree-search efficiency describes the results of a For the purpose of interpretation of preliminary
search in terms of the number of tree rearrange- results it is useful to distinguish two phases of branch
ments required to discover a set of trees of a speci- swapping that occur during conventional searches.
fied length. Like frequency of success, it can be As noted above, conventional one-stage searches (as
specified with respect to most-parsimonious trees or conducted in Nona, PAUP*, and other programs)
longer trees, and in the latter case it includes sear- begin with the generation of a Wagner tree, after
ches that result in the discovery of trees of length which branch swapping is conducted, starting with
equal to or shorter than the specified length. In order this tree. Also noted above is the observation that for
for greater efficiency to be specified by higher any number x of trees held in memory and subjected
numbers, search efficiency is defined as the ratio of to swapping in searches of this sort (as determined by
number of searches that result in trees of a given the Nona command hold/x), searches typically end
length or shorter to the number of rearrangements immediately after a phase of unsuccessful swapping
required to discover those sets of trees. Thus, if an in which a set of x trees of some length is generated
average of 1 billion branch swaps is required to and subjected to branch swapping without yielding
conduct each replicate search under a particular set shorter trees. When several searches are conducted,
of conditions, and if an average of one of every four this is the point at which each search ends, prior
such searches yields trees of a specified length or to the initiation of the next one. The number of
shorter, the frequency of success for this tree length trees accumulated and subjected to swapping during
is 25%, and the search efficiency for this length is one this phase of the search can be less than x if there are
set of trees for every 4 billion tree rearrangements. fewer than x trees in an island, but with large data
In this chapter we sometimes discuss efficiency sets, and with x set at a number that is likely
informally, in terms of billions of rearrangements to be efficient for tree searches, this may occur only
required per successful search, and it should be rarely, and during the present study the accumula-
noted that with this parlance a higher number refers tion of fewer than the set maximum number of
to a lower efficiency. It should be noted as well that trees never occurred with the zilla matrix, and
frequency of success and tree-search efficiency are occurred only once with the three-gene matrix.
expressed in terms of the number of searches that The following discussion assumes that every
obtain trees of a given length, not in terms of the search with hold/x involves a final phase of
number of trees that are obtained. Thus, if two unsuccessful swapping through x trees. Although
searches are conducted, one with 50 trees retained, this is not necessarily the case, appropriate
and the other with 500 trees retained, each search adjustments could be made for other situations.
Consider a search that is conducted with one held in memory this portion of phase 2 may be
tree held in memory (hold/1). In this case, swapping preceded by a period of successful swapping.
is initiated on a Wagner tree, and it proceeds as a Thus, for any search that is conducted with a given
succession of trees are subjected to swapping, each seed number, with x trees held in memory, the
yielding a shorter tree before being subjected to a number of rearrangements required during phase
complete round of swapping. The search ends 1 is identical for all numbers x, and the minimum
after it reaches the first tree in the progression that possible number of rearrangements during phase 2 is
is subjected to a complete round of swapping equal to the number that is required to swap through
without yielding a shorter tree. x trees. Similarly, the average number of rearrange-
Retrospectively, two phases of branch swapping ments required during phase 1, for sufficiently large
in this search can be recognized. The first (phase 1) numbers of searches conducted with randomly
is the productive phase of the search, during which selected seed numbers, should be identical during
shorter trees are discovered, and the second (phase phase 1 with any number x trees held in memory, and
2) is the nonproductive phase, during which the the degree to which the average number of rearran-
single shortest tree discovered by the search is gements required during phase 2 exceeds the aver-
swapped completely without yielding any shorter age number required to swap through x trees is a
trees. If another search is initiated with the same reflection of the overall degree of success of these
seed number, and with two trees retained (i.e. hold/ searches in discovering shorter trees.
hold=x
2), it proceeds through an identical sequence, except To formalize these expressions, let ravg be the
that one additional tree is retained in memory, and average number of rearrangements required to
is subjected to swapping following the completion conduct a complete search with any number x
of swapping on the solitary tree that did not yield a trees retained, and let rxavg , which is the average
shorter tree during the hold/1 search. number of rearrangements required to swap
If phase 2 is again recognized as being initiated through x trees without discovering shorter trees,
when swapping begins on the first tree that is be the minimum possible number of rearrange-
swapped unsuccessfully, and if the second tree, ments required during phase 2 of any search (i.e.
like the first, is swapped completely without when phase 2 is unsuccessful). Because phase 2 of
yielding a shorter tree, the search ends with any search commences when branch swapping is
exactly one additional tree having been subjected initiated on the first tree that does not yield shorter
to branch swapping. In this case, the search with trees, the average number of rearrangements
hold/2 is no more successful than the search with during phase 1 is identical for any x, and it equals
hold=1
hold/1, and phase 1 of the two searches is identical, ravg r1avg . The average number of rearrange-
while phase 2 for the second search is precisely ments actually conducted during phase 2, for any x,
twice as long as phase 2 for the first search. is the average total for a search with x trees held,
However, if swapping on the second tree yields minus the average number of rearrangements
hold=x hold=1
a shorter tree, the total number of rearrangements required during phase 1, or ravg (ravg r1avg ).
during phase 2 of the two-tree search (which now We define the excess branch-swapping ratio as the
includes some successful swapping) is greater than ratio of this number (i.e. the average number of
if it had not yielded a shorter tree. Thus, phase 2 of re arrangements during phase 2) to the minimum
any search with hold/1 is always unsuccessful, by possible for phase 2, or
definition (being the phase during which the first hold=x
ravg
hold=1
ðravg r1avg Þ
and only tree in the search is swapped unsuccess- ð1Þ
rxavg
fully), but phase 2 of a search with more than one
tree held in memory may or may not be successful. This ratio is 1 when x ¼ 1, by definition, because
The final portion of phase 2 of any search ends phase 2 of such a search is the phase during which
when a complete round of swapping is conducted unsuccessful swapping is conducted on a single tree,
through a set of trees that is equal to the number and the number of rearrangements conducted dur-
held in memory, but when more than one tree is ing this phase is precisely the minimum number
possible. When x > 1, the minimum value for the Although the trees of length 16 218 were not
ratio in a given search is 1, and this ratio occurs when expected to yield shorter trees (and did not), and
all swapping during phase 2 is unsuccessful, i.e. although subjecting them to additional swapping
when a search with x > 1 finds trees no shorter than diminished the efficiencies that were ultimately
are found when x ¼ 1. However, this ratio exceeds 1 computed for the two-stage searches, because this
to the extent that the additional swapping with x > 1 constituted additional but fruitless swapping, they
results in the discovery of shorter trees. were nonetheless included in the second round of
With the zilla data set (see results, below), we swapping so that this procedure would accurately
have determined that the second term in the reflect the additional swapping that normally
numerator of the excess branch-swapping ratio is would be conducted in the analysis of a matrix for
trivial in magnitude relative to the first when x is which the length of shortest trees is not known. It
greater than about 20, so (1), above, is approxi- should also be noted that each episode of second-
hold=x
mately equal to ravg =rxavg . However, when x lies stage swapping began with a set of 50 trees that
between 1 and 20 the second term of the numerator had already been subjected to swapping during
has a substantial effect on the ratio that is obtained. the first stage, and that all 50 of these trees were
The numbers required to compute this ratio, for swapped again, as part of the larger task of
any number x, are simply the average number of swapping through a larger set of trees. This action
rearrangements required for searches with x trees also diminishes the efficiency of the two-stage
held in memory, and the average number of rear- searches, but it too reflects actions that normally
rangements to swap unsuccessfully through x trees. are taken in a two-stage conventional search.
Both can be obtained easily from Nona or PAUP*. The results of these searches were used to cal-
culate the efficiency of two-stage searches by
summing the number of rearrangements required
7.5.6 Two-stage searches
for the one-stage search that generated the sets that
Two-stage analyses were conducted by subjecting were subjected to additional swapping, along with
sets of trees obtained from various one-stage the additional searches conducted in the second
searches to additional branch swapping, with stage, and dividing the total number of sets of trees
greater numbers of trees retained for swapping obtained of length 16 218 by this sum. In many
than in the original searches. Of the available sets cases it was not feasible to conduct second-stage
of trees from the one-stage searches, the sets sub- swapping on all available sets of trees of a given
jected to additional swapping were those derived length obtained from the initial one-stage analysis,
from the most successful one-stage searches, i.e. and in these cases the sets actually subjected to
the shortest trees obtained from those searches, second-stage swapping were selected randomly
plus sets of successively longer trees, as is common from the available sets, and success and efficiency
in conventional two-stage analyses, where second- rates were calculated by pro-rating the observed
stage swapping is conducted on the best available results. In this manner, the efficiencies of two-stage
trees obtained from first-stage swapping. This searches were calculated for several combinations
creates certain complications in evaluating the of settings (number of trees held during primary
results. First, trees as short as 16 218 steps (i.e. trees and secondary searches, various percentages of
believed to be most-parsimonious for the zilla trees from the primary searches subjected to sec-
matrix) were obtained from some of the one-stage ondary swapping, and various combinations of
searches conducted with 50 trees held in memory. polytomy and ambiguity settings).
These sets of trees, and sets up to four steps longer,
were subjected to additional swapping with var-
7.5.7 Preliminary analyses to estimate
ious larger numbers of trees held in memory, to
optimum search conditions
determine the frequency with which continued
swapping on trees of various lengths yielded trees After most of the one- and two-stage analyses of
of length 16 218. zilla had been conducted, and the most efficient
settings had been determined, we computed tings, and with 200 trees held in memory, about
excess branch-swapping ratios for the various 4% of all searches yielded trees of length 16 220
combinations of settings to determine if this ratio is or shorter. Thus, there is a rough equivalence
a useful predictor of search efficiency. Having between our results and those reported by Rice et
determined that it is, we conducted preliminary al., who discovered trees of length 16 220 in at least
analyses with the three-gene matrix, in most cases one of their eight one-stage searches (they reported
involving only 10 replicate searches for each the discovery of trees of this length, but not the
number of trees retained, and used these results to number of times that this occurred among their
determine the settings for more extensive one- and eight searches).
two-stage analyses. The present analysis involved over 2.17 1014
(i.e. 217 trillion) tree rearrangements during one-
stage searches (Table 7.1), or about 7 778 times the
7.6 Comparability of branch swapping
number of tree rearrangements reported by Rice
with Nona and PAUP
et al. We also conducted 4.80 1013 (i.e. 48 trillion)
BB (or TBR) swapping through 10 sets of 100 most- tree rearrangements in the course of our two-
parsimonious trees for the zilla data set, with the stage analyses and the various preliminary ana-
multi-thread version of Nona version 1.6, using lyses of the zilla matrix (details available on
one thread, and with the default settings poly¼ request), for a total of more than 2.65 1014 (i.e.
and amb-, involved an average of ca. 9.75 106 265 trillion) tree rearrangements, or about 9 500
rearrangements to swap through each tree. TBR times the number reported by Rice et al. In
swapping through 10 sets of 100 most-parsimonious addition to the analysis of the zilla matrix, we
trees propagated from the same 10 starting trees conducted 1.85 1013 (i.e. 18.5 trillion) tree rear-
required an average of ca. 9.95 106 rearrange- rangements with the three-gene matrix of Soltis
ments per tree with PAUP version 3.1.1, and et al. (2000), for a total of ca. 284 trillion tree
9.74 106 with PAUP* version 4.0b10. On the basis rearrangements, or more than 10 000 times the
of these comparisons it appears that Nona, PAUP, number reported by Rice et al.
and PAUP* count tree rearrangements in approxi-
mately the same manner when conducting com-
7.7 One-stage searches
parable branch-swapping processes, or at least
those processes which constitute the bulk of the A series of one-stage searches of the zilla matrix
branch swapping reported here. Hence, it appears conducted in Nona using the default polytomy
that the numbers reported by Nona for this overall and ambiguity settings, with various numbers of
study can be compared directly with those repor- trees held in memory, indicates that while the
ted by PAUP and PAUP*. number of rearrangements required per search
Comparisons of other sorts also can be made increases with the number of trees held in memory
with the analysis conducted by Rice et al. (1997). (Table 7.1, Fig. 7.1a), this increase is uneven in rate.
Those authors reported that they conducted a total On a log-log graph of the relationship between
of 27.9 billion tree rearrangements in the course of these variables, the increase in tree rearrangements
eight one-stage searches with large numbers of required per search is greatest between 10 and 50
trees held in memory. Their searches apparently trees held, and between 500 and 2 000 trees held.
were aborted at various stages, but the overall Similarly, the average length of trees obtained by
average for the eight searches was ca. 3.5 billion these searches drops most steeply in the same two
tree rearrangements conducted per search. For the regions (Fig. 7.1b), and the greatest excess branch-
present analysis, an average of ca. 3.8 billion swapping ratio occurs in both of these regions,
rearrangements was required for the completion of with one peak corresponding to 50 trees held
each one-stage search with 200 trees retained in in memory, and a secondary peak correspond-
memory, under the default polytomy and ambi- ing to 2 000 trees held (Fig. 7.1c). Preliminary
guity settings in Nona (Table 7.1). With these set- searches conducted with alternative polytomy and
Table 7.1 Results of one-stage searches with various numbers of trees held in memory under various combinations of polytomy and ambiguity-of-support
settings. Bold italics identifies the two search strategies in each column that are most successful as measured by average tree length obtained, percentage
of successful searches, or search efficiency
Polytomy and Number of Total Average Average Discovery of trees of Discovery of trees of
ambiguity searches number number tree length 16 218 length 16 220
settings, and of trees of trees length
Number (and Number of Number Number of
number of examined in examined
frequency, trees (and trees
trees all searches per search
in percent) examined frequency, examined
retained (billions) (billions)
of successful per successful in percent) per successful
searches search of successful search
(billions) searches (billions)
poly¼, amb-
1 3 664 216.8 0.059 16 237.881 0 (—) — 0 (—) —
2 11 640 838.8 0.072 16 237.380 0 (—) — 0 (—) —
5 3 614 414.9 0.115 16 236.707 0 (—) — 0 (—) —
10 3 606 673.4 0.187 16 236.101 0 (—) — 0 (—) —
20 3 221 1 673.7 0.520 16 233.084 0 (—) — 5 (0.155) 334.7
50 3 225 5 979.6 1.854 16 227.383 4 (0.124) 1 494.9 99 (3.070) 60.4
100 3 209 8 670.1 2.702 16 226.249 4 (0.125) 2 167.5 134 (4.176) 64.7
200 3 200 12 244.0 3.826 16 226.134 3 (0.094) 4 081.3 124 (3.875) 98.7
500 2 708 19 845.6 7.329 16 225.790 4 (0.148) 4 961.4 130 (4.801) 152.7
1 000 850 14 119.7 16.611 16 224.866 5 (0.588) 2 823.9 83 (9.765) 170.1
2 000 765 31 952.1 41.767 16 222.868 20 (2.614) 1 597.6 189 (24.706) 169.1
5 000 765 55 919.6 73.098 16 222.841 14 (1.830) 3 994.3 190 (24.837) 294.3
poly¼, amb¼
50 3 224 5 218.3 1.619 16 228.978 0 (—) — 35 (1.086) 149.1
100 1 628 4 926.7 3.026 16 226.187 2 (0.123) 2 463.4 55 (3.378) 89.6
2 000 765 29 881.8 39.061 16 223.529 18 (2.353) 1 660.1 145 (18.954) 206.1
poly-, amb-
50 3 226 4 953.5 1.535 16 229.513 1 (0.031) 4 953.5 17 (0.527) 291.4
100 1 629 5 315.6 3.263 16 226.012 2 (0.123) 2 657.8 73 (4.481) 72.8
2 000 388 14 651.9 37.763 16 224.054 6 (1.546) 2 442.0 70 (18.041) 209.3
Totals 51 327 217 496 — — 83 (n/ad) — 1 349 (n/a) —
d
n/a, not applicable.
ambiguity settings reveal the same general rela- attribute anomalies such as these to our use of
tionship between number of trees held in memory random taxon entry sequences, in combination
and average tree length obtained (Fig. 7.1b), as well with limited sample sizes. Note that the most
as between the number of trees held and the complete set of analyses was conducted with the
greatest excess in rearrangements required per settings poly¼ and amb-, and that the only anomaly
search (Fig. 7.1c). observed for this set of analyses was in the slightly
The average tree length obtained should higher average tree length obtained with 5 000
decrease with every increase in the number of trees trees held than with 2 000 trees held (Table 7.1,
held, if all other factors are constant, but this was Fig. 7.1b); in both of these cases only 765 analyses
not always observed to be the case. For example, were conducted. Fewer replicate searches were
the average tree length obtained with the settings conducted with the alternative settings, in some
poly¼ and amb¼ was substantially greater with five cases amounting to only a small fraction of the
trees held than with two trees held (Fig. 7.1b). We number of searches conducted with poly¼ and
(a) 1011
rearrangements per search

Average number of
1010
109
108
107
(b) 16 243
Average tree length
16 238
16 233
16 228
16 223
16 218
(c) 4
swapping ratio
Excess branch-
1
1 10 100 1 000 10 000
Number of trees held
Figure 7.1 Rearrangements required, lengths of shortest trees discovered, and excess branch-swapping ratios (see text) for one-stage searches
of the 500-terminal rbcL data set (zilla), as conducted with Nona, with various numbers of trees held in memory and with various combinations
of polytomy and ambiguity-of-support settings. Most of the results depicted represent preliminary sets of 10–20 searches with each combination of
settings, but in some cases (particularly analyses with poly¼ and amb-) the results of more-extensive searches conducted during the course of
the overall study are depicted. (a) Average number of tree rearrangements required per search with poly¼ and amb-. (b) Average tree length
discovered in each search, with all three combinations of polytomy and ambiguity settings that were examined (&, poly¼ and amb-;
*, poly¼ and amb¼; ~, poly- and amb-). (c) Excess branch-swapping ratio (key as in b).
amb-, and it is in these cases that the anomalies are combinations of polytomy and ambiguity settings
most prevalent. (Table 7.1). Additional sampling with the two non-
Although the peak excess branch-swapping ratio default combinations of polytomy and ambiguity
with the settings poly¼ and amb- was observed settings was also conducted with 2 000 trees held in
with 50 trees held in memory, as it was with the memory (Table 7.1), and comparable series of one-
settings poly¼ and amb¼, the peak with the settings stage searches therefore exist for all three of these
poly- and amb- occurred with 100 trees held in combinations of polytomy and ambiguity settings
memory (Fig. 7.1c). This difference initially was with 50, 100, and 2 000 trees held during searches.
observed among preliminary sets of searches, With 50 and 2 000 trees held, the success rates and
but it continued to be observed after more thor- efficiencies of searches conducted with the settings
ough sampling was conducted with both 50 and poly ¼ and amb- exceeded those with the other
100 trees held in memory for all three two combinations of settings in the discovery of
(a) 100
80
Success rate (per cent)
60
40
20
0
Efficiency (billions of rearrangements)
(b) 1
10
102
103
104
16 218 16 219 16 220 16 221 16 222 16 223 16 224 16 225 16 226 16 227 16 228
Tree length
Figure 7.2 Success rates and efficiencies of one-stage searches of the 500-terminal rbcL data set (zilla), as conducted with Nona with various
numbers of trees held in memory and with various combinations of polytomy and ambiguity-of-support settings. Results are depicted for searches
with 50, 100, and 2 000 trees held in memory under all three combinations of polytomy and ambiguity settings that were examined, and with
5 000 trees held in memory with the settings poly¼ and amb- (see text); results of these searches are also provided in Table 7.1. Dashed lines,
50 trees held in memory; dotted lines, 100 trees held; solid lines, 2 000 trees held; alternately dashed/dotted lines, 5 000 trees held;
*, poly¼ and amb-; &, poly¼ and amb¼; ~, poly- and amb-. (a) Success rates for all tree lengths from 16 218 to 16 228, with success defined
as the frequency of discovery of trees of a given length or shorter among the searches conducted. (b) Search efficiencies for all tree lengths from
16218 to 16228, with efficiency defined as the number of tree rearrangements required per successful search, with success for each tree length
including sets of that length and shorter.
most-parsimonious trees (Table 7.1) and for tree obtained for sets of trees of greater length (Fig. 7.2).
lengths up to six steps longer than most-parsimo- However, the efficiency of searches conducted
nious (Fig. 7.2), though in some cases slightly with 100 trees held in memory, in the discovery of
greater efficiencies were observed with other most-parsimonious trees, was greatest with the
polytomy and ambiguity settings for the discovery settings poly¼ and amb- (Table 7.1). Thus, for all
of longer trees (e.g. length 16 228 with 2 000 trees three numbers of trees held in memory that were
held in memory, Fig. 7.2b). With 100 trees held in examined intensively, the default polytomy and
memory, success rates in the discovery of most- ambiguity settings in Nona yielded the greatest
parsimonious trees were nearly identical under all efficiency in the discovery of most-parsimonious
three combinations of polytomy and ambiguity trees (though in some cases by narrow margins),
settings (ca. 0.125% of searches yielded trees of this and consequently additional analyses were
length; Table 7.1), and similar results also were conducted only with these settings.
(a) 100 16 228

16 227
16 226
16 225
80

16 224
16 223
60
16 222
40 16 221
16 220
20
16 219
16 218
0
(b) 1 16 228 16 227 16 226
16 225
16 224
Efficiency (billions of
10
rearrangements)
16 223
16 222
102
16 221
16 220
16 219
103
16 218
104
1 10 100 1 000 10 000
Number of trees held
Figure 7.3 Success rates and efficiencies of one-stage searches of the 500-terminal rbcL data set (zilla), as conducted with Nona with various
numbers of trees held in memory, from 1 to 5 000, with the settings poly¼ and amb-; results from the same searches are also provided in
Table 7.1. (a) Success rates, as in Fig. 7.2. (b) Search efficiencies, as in Fig. 7.2.
Comparisons of success rates and efficiency that anomalies are evident in the slightly greater
rates of one-stage searches, in the discovery of success rates obtained for length 16 218 with 2 000
most-parsimonious trees and trees up to 10 steps trees held in memory than with 5 000 trees held
longer, with the settings poly¼ and amb-, and for a (Table 7.1, Fig. 7.3a). This anomaly is also evident
range of numbers of trees held in memory, are in the relative success rates for some tree lengths
depicted in Fig. 7.3. The greatest efficiency of greater than the most-parsimonious (e.g. lengths
searches occurs with 50 trees held, and the effi- 16 221 and 16 222, Fig. 7.3a). With fewer than 800
ciency of searches with 2 000 trees held is slightly searches conducted with these particular settings
less (Table 7.1; ca. 1.5 vs. 1.6 trillion rearrange- (though more than 110 trillion rearrangements
ments required per set of most-parsimonious were conducted in these two sets of searches), we
trees). Success rates of searches with 2 000 and attribute this anomaly to the effects of random
5 000 trees held in memory are nearly equal for all sampling of taxon addition sequences.
tree lengths examined (i.e. there is a plateau As is evident from the shapes of the success
between these points; Figs 7.2a and 7.3a), but and efficiency curves in Fig. 7.3, success rates
greater efficiencies consistently are observed in the discovery of trees longer than the most-
with 2 000 trees held than with 5000 (Fig. 7.3). parsimonious rise unevenly with an increase in the
In light of the plateau in success rates between number of trees held, with the regions of steepest
2 000 and 5 000, and the limited number of searches increase between 10 and 100 trees held, and
conducted with these settings, it is not surprising between 500 and 2 000 trees held (Fig. 7.3a), and
with nearly flat plateaus observed outside these the settings poly¼ and amb-, the peak search effi-
two regions. ciencies of one-stage searches are observed with
As already noted, there are two peaks in effi- 50 and 2 000 trees held in memory, and these are
ciency for searches yielding most-parsimonious the settings at which the greatest excess branch-
trees, corresponding to 50 and 2 000 trees held swapping ratios are observed (Fig. 7.1c). Fewer
(Table 7.1, Fig. 7.3b). Two peaks in efficiency are combinations of settings were examined in depth
also observed for tree lengths of 16 219 (i.e. one with alternative polytomy and support-ambiguity
step longer than the most-parsimonious; Fig. 7.3b), settings, and the results of those analyses should
with the peak for 50 trees held (ca. 214 billion be interpreted with caution, but the excess branch-
rearrangements required per set of trees) sub- swapping ratios are generally highest for those
stantially higher than that for 2 000 trees held settings that yielded the most efficient searches
(ca. 404 billion rearrangements). For trees of length (cf. Table 7.1, Fig. 7.1c). Thus, the amount of excess
16 220, the peak with 2 000 trees held is nearly swapping required during phase two of one-stage
absent and, for greater tree lengths it is absent. searches appears to be a useful indicator of tree-
Meanwhile, a second peak in efficiency appears search efficiency, and it can be determined with
with fewer than 50 trees held in memory (e.g. for fewer searches conducted than are required to
length 16 221, there is a peak with five trees held). discover most-parsimonious trees. This having
This peak shifts to lower numbers of trees held been said, it should be noted that the location of
(i.e. to two and then to one) for successively longer the peak in the excess branch swapping ratio is of
trees and, eventually, for trees around nine to ten greater importance than its height. With the set-
steps longer than the most-parsimonious, it sup- tings poly ¼ and amb- the excess branch-swapping
plants the peak at 50 trees held, with the single ratio is substantially greater for 50 trees held in
peak efficiency in the discovery of trees of length memory than for 2 000 trees held, but the actual
16 228 corresponding to one tree held during search efficiencies of these settings are quite
searches. similar in magnitude. Also, the excess branch-
Thus, the greatest search efficiencies for trees swapping ratio is substantially greater with 100
longer than the most-parsimonious occur with trees held in memory for the settings poly- and
relatively few trees held, and for most-parsimo- amb- than for poly¼ and amb-, but the latter com-
nious trees the greatest search efficiency is observed bination exhibits a greater search efficiency.
with 2 000 trees held in memory. Beyond this point
(i.e. with 5 000 trees held) individual searches are
7.8 Two-stage searches
only slightly more likely to yield most-parsimo-
nious trees, and because these searches involve With two-stage searches, the principal settings
many more tree rearrangements they are sub- include the number of trees held in memory
stantially less efficient. Many more searches would during each of the two stages, and the percentage
need to be conducted to determine with confidence of tree sets obtained during the first stage that are
whether the most efficient searches for most-parsi- subjected to additional branch swapping during
monious trees occur with 50 or 2 000 trees held in the second stage. In light of the many possible
memory (or possibly with somewhat different combinations of these settings, plus the various
numbers of trees held). The similarity in overall polytomy and support-ambiguity settings, and the
efficiency between these two settings, in contrast many individual searches that must be conducted
with the substantially greater frequency of suc- to examine the actual search efficiency with each of
cessful searches with 2 000 trees held than with 50 the possible combinations of these settings, we
trees held (ca. 1.8 vs. 0.12%; Table 7.1) is indicative of were not able to examine all possible combina-
the tradeoff between the number of searches con- tions. We concentrated our efforts on those com-
ducted and the intensiveness of each search. binations that appeared to be the most promising
The excess branch-swapping ratio appears to be on the basis of results of the one-stage searches,
a reliable indicator of tree-search efficiency. With and we also conducted less-intensive searches
with a range of alternative settings to provide ciencies for these categories were calculated by
preliminary indications of variation patterns in extrapolating from the results of these searches.
success and efficiency rates. Calculations of search efficiencies for two-stage
Nine combinations of numbers of trees held searches took into account the rearrangements
during the first and second stages of two-stage required during the first stage of each search, the
searches eventually were examined in detail with percentage of trees available in each tree-length
the settings poly¼ and amb- (Table 7.2, Fig. 7.4), category, the percentage of trees within each of these
and one combination (50 trees held followed by categories that actually were subjected to second
2 000 trees held) was examined with the settings stage swapping, and the number of rearrangements
poly ¼ and amb¼. In most cases, second-stage required during the second stage. The latter factor
swapping was conducted on tree sets of various varies substantially among tree-length categories.
lengths that had been obtained during the one- For example, for two-stage searches with 50
stage searches. However, the numbers of sets of trees held during the first stage and 2 000 during
trees from the one-stage searches that had been the second, with the settings poly¼ and amb-, the
conducted with two trees held in memory (i.e. average number of rearrangements required per
those in the first six rows of combinations pre- set of trees during the second stage was ca. 22.3
sented in Table 7.2) were insufficient to conduct billion for those that started at length 16 219, and
thorough two-stage analyses, so the available tree ca. 33.2 billion for those that started at length
sets obtained from the one-stage searches were 16 222. Although these numbers are substantially
supplemented by the results of an additional series different, both of them compare favorably with the
of one-stage searches with two trees held. ca. 41.8 billion rearrangements required for one-
Also, when only a few sets of trees of a given stage searches with 2 000 trees held and the same
length were available for second-stage branch ambiguity and polytomy settings.
swapping (e.g. just seven sets of length 16 222 from Thus, when second-stage branch swapping in a
the one-stage searches with two trees held in two-stage search is concentrated on a small sample
memory, and the settings poly¼ and amb-), they of relatively optimal trees derived from the first
were combined with sets of successively greater stage, the number of rearrangements required per
length into a more inclusive set for examining tree set during the second stage can be small
search success and efficiency. Hence, the shortest (additional details available on request). However,
tree sets subjected to second-stage swapping after the total number of rearrangements required per set
first-stage swapping with two trees held in memory of trees subjected to second-stage swapping also
are those of length less than or equal to 16 224 (i.e. includes the swapping that must be conducted
from the most-parsimonious to six steps longer, a during the first stage to create the pool of sets of
category that included only 0.43% of available sets trees for the second stage. When only a small per-
of trees; Table 7.2), and the shortest tree sets sub- centage of the shortest tree sets obtained during the
jected to second-stage swapping after first-stage first stage are subjected to second-stage swapping,
swapping with 50 trees held in memory are those of the swapping required during the first stage is
length less than or equal to 16 219 (Table 7.2). apportioned across a small number of sets of trees,
As second-stage swapping proceeded on trees of and therefore the total number of rearrangements
successively greater length (e.g. up to length 16 229 required per set (including first- and second-stage
for tree sets derived from first-stage swapping swapping) is relatively large (Fig. 7.4a).
with two trees held in memory; Table 7.2), the As the percentage of available sets of trees that are
number of available tree sets for second-stage subjected to second-stage branch swapping increa-
swapping was substantially greater than the ses, the average number of second- stage rearran-
number of tree sets of shorter length. For these gements required per set increases, because trees of
longer sets of trees, swapping was conducted only successively greater lengths are now being sub-
with randomly selected tree sets from among those jected to intensive swapping. However, the total
that were available, and success rates and effi- amount of branch swapping that was conducted
Table 7.2 Results of two-stage searches with various combinations of polytomy and ambiguity-of-support settings, various numbers of trees held in memory during each stage, and various percentages
of available sets of trees from the first stage of each search subjected to branch swapping during the second stage. Bold italics in each row identifies the combination that yielded the greatest percentage
of successful searches and the combination that yielded greatest overall tree-search efficiency. Success rate is given as the percentage of second-stage searches yielding trees of length 16 218,
and efficiency as billions of trees examined per set of trees discovered of length 16 218. Numbers in parentheses give the percentage of total sets available
Number of Number of Success rate and efficiency

trees held, sets swapped
Success Efficiency Success Efficiency Success Efficiency Success Efficiency Success Efficiency Success Efficiency
stage 1/stage 2 in stage 2
poly¼, amb-
16 224 (0.43) 16 225 (0.92) 16 226 (1.85) 16 227 (3.55) 16 228 (5.93) 16 229 (9.07)
2/50 1 339 2.0 902.8 1.0 962.7 0.5 1 081.2 0.5 725.9 0.3 926.0 0.2 1 221.9
2/100 1 239 2.0 934.4 1.3 708.4 0.9 615.6 0.7 562.1 0.4 751.6 0.4 662.3
2/500 167 4.0 567.6 1.9 744.3 0.9 1 088.0
2/1 000 442 6.9 416.2 3.3 630.8 3.0 591.3 2.2 733.2 1.3 1 216.6 1.8 866.8
2/2 000 167 20.0 233.9 14.0 275.1 8.7 413.9
2/5 000 107 20.0 394.0 15.0 464.7
16 219 (0.87) 16 220 (3.07) 16 221 (6.51) 16 222 (12.81)
50/1 000 214 39.3 571.8 13.4 532.0 7.2 556.6 4.4 607.5
50/2 000 214 53.6 439.4 27.9 302.2 18.3 306.1 11.7 383.8
50/5 000 128 53.6 496.9 32.4 356.4 20.5 408.8
poly¼, amb¼
16 219 (0.40) 16 220 (1.09) 16 221 (2.67) 16 222 (6.22)
50/2 000 41 46.2 923.5 28.6 612.0 14.6 612.1 7.9 739.0
(a) 109
rearrangements per search

2/5 000
2/2 000
Average number of
50/5 000 50/2 000 50/1 000
108 5 000
2 000
2/1 000
1 000
107 2/500
500
2/100 2/50
100
50
106
(b) 60
50
50/5 000
50/2 000
40
50/1 000
30
2/5 000
20
2/2 000
10 2/500 2/1 000 2 000
2/100 2/50 5 000
0
(c) 0
2/5 000 2/2 000 50/2 000 50/5 000
Efficiency (billions of
400
rearrangements)
50/1 000 2/100
800
2/50
2/1 000
1 200
2/500
50
1 600 2 000
0 2 4 6 8 10 12 14
Percentage of trees from stage one subjected to
branch swapping in second stage
Figure 7.4 Rearrangements required, success rates, and efficiencies of two-stage searches of the 500-terminal rbcL data set (zilla), as conducted
with Nona using the settings poly¼ and amb-, with various numbers of trees held in memory during each stage, and with various percentages of
tree sets from the first stage subjected to swapping during the second stage. Baseline data from comparable one-stage searches (cf. Figs 7.1–7.3,
Table 7.1) also are presented, as horizontal dotted lines through each panel, with numbers on the right indicating number of trees held;
all one-stage searches that were conducted but are not depicted in Fig.7.4b and 7.4c had lower success rates and efficiencies, respectively,
than those that are depicted. Portions of the data on two-stage searches are also presented in Table 7.2. Dashed lines, two-stage searches conducted with
50 trees held in memory during the first stage; solid lines, two-stage searches with two trees held in memory during the first stage. (a) Average
number of tree rearrangements required per set of trees subjected to two-stage search, with the number of trees held during each stage indicated in
the figure body for each set of results, and with number of trees held for baseline one-stage searches indicated on the right. (b) Success rates for
searches resulting in the discovery of trees of length 16 218, with labels as in (a). (c) Efficiencies of searches resulting in the discovery of trees of
length 16 218, with labels as in (a).
during the first stage is now apportioned among a With increasing percentages of the available tree
greater number of sets of trees that are subjected to sets subjected to second-stage branch swapping,
second-stage swapping, so the average number of the average number of total rearrangements
total rearrangements per tree set subjected to sec- required for each set approaches the average
ond-stage swapping actually diminishes. number required for one-stage searches conducted
with the same number of trees held as during the of trees held for branch swapping is as few as 50.
second stage of two-stage searches, and in at least In two-stage searches using the default polytomy
some cases it overshoots that number and fewer and ambiguity settings in Nona, with two trees
branch rearrangements are actually required (e.g. held during the first stage, 50 trees held during the
Fig. 7.4a, in which the curve representing a two- second stage, and only the shortest 0.43% of
stage search with two trees held in memory in the available sets of trees from the first stage subjected
first stage and 1 000 in the second crosses the line to second-stage swapping (i.e. only tree sets of
corresponding to a one-stage search with 1 000 length 16 224 and less), 2% of the tree sets sub-
trees held). The limit of this trend would occur if jected to second-stage swapping yielded most-
second-stage branch swapping were conducted on parsimonious trees (i.e. length 16 218; Table 7.2),
all tree sets derived from the first stage of a search. while only 0.12% of single-stage searches with 50
For example, if a two-stage search were con- trees held and with the same polytomy and
ducted with 50 trees held during the first stage and ambiguity settings yielded trees of this length.
2 000 held during the second, and all tree sets In this case, the use of a two-stage search
derived from the first stage were subjected to strategy increases the success rate by a factor
second-stage swapping, the overall amount of greater than 16, and in other cases (e.g. with
swapping (and the average per set swapped with 50 trees held in the first stage, and 2 000 trees in the
2 000 trees held) would be slightly greater than that second, as compared to a one-stage search with
which is conducted during a one-stage search with 2 000 trees held) the increase in success rate is by a
2 000 trees held, with the excess corresponding to factor greater than 20.
the 50 trees derived from each first-stage search It is evident, then, that the overall efficiency of
that are subjected to swapping a second time at the two-stage searches reflects a complex set of tra-
beginning of each second-stage round of branch deoffs between the various factors of the search,
swapping. with a critical role played by the percentage of tree
A similar relationship exists between the per- sets obtained during the first stage that are sub-
centage of available tree sets that are subjected to jected to second-stage branch swapping. When
second-stage swapping and the frequency of only a small percentage of the available sets is
success in the discovery of most-parsimonious subjected to second-stage swapping, the total
trees. When only the shortest trees obtained by amount of swapping that is required for each of
first-stage swapping are subjected to second-stage these sets is large, because a complete accounting
swapping, the percentage of tree sets that yield of the swapping required with this approach
most-parsimonious trees can exceed the percentage includes the large amount of first-stage swapping
obtained with one-stage searches by an order of that is required to generate each tree set that is
magnitude or more (Fig. 7.4b). The greatest fre- subjected to second-stage swapping. However, the
quency of success obtained in one-stage searches, success rate for second-stage swapping that is
specifically in those in which 2 000 or 5 000 trees conducted on only a small percentage of the
were retained in memory, is ca. 2% (Table 7.1, available tree sets (i.e. the shortest trees available
Fig. 7.3a), while frequencies as great as 7%, 20%, and from first-stage swapping) is also quite high.
more were obtained in two-stage searches when as These factors are integrated in the calculation
many as 12% of available tree sets were subjected to of overall search efficiencies (Table 7.2, Fig. 7.4c).
second-stage searches, with the greatest success Before considering the results of particular search
rates observed when only the shortest trees strategies, a general feature of this figure should be
obtained during the first stage were subjected to noted, which is that all of the two-stage searches
second-stage swapping (Table 7.2, Fig. 7.4b). that were conducted were more efficient than any
This increase in the percentage of searches of the one-stage searches. The combinations of
yielding most-parsimonious trees, relative to the settings that were tested in two-stage searches
numbers obtained in single-stage searches, is also were not chosen at random, as the choice was
evident in searches in which the greatest number based on the results of one-stage searches, but on
the basis of those results we are confident that no this sort (including 50/1 000 and 50/5 000 sear-
combination of settings for single-stage searches ches), the first stage of branch swapping yields a
would yield results that were substantially superior relatively large percentage of sets of most-
to the most efficient ones described above. Hence, parsimonious and nearly most-parsimonious trees,
it appears that many combinations of settings for though at the cost of a considerable amount of
two-stage searches, if chosen according to reason- branch swapping. In light of these facts, it is not
able criteria, will yield results that are superior to surprising that the best results are obtained when a
even the most efficient one-stage searches. relatively large percentage of these tree sets is
Two-stage searches conducted with relatively subjected to second-stage swapping.
few trees held during the first stage (i.e. two trees), It is notable that branch swapping during the
and with relatively large numbers of trees (e.g. second stage of the two most efficient two-stage
2 000 or 5 000) held during the second stage exhibit searches (2/2 000 and 50/2 000) was conducted
peak efficiencies when a relatively small percen- with the same number of trees held in memory,
tage of the tree sets obtained during the first stage and that this number corresponds to one of the
are subjected to second-stage swapping (Table 7.2, two points of peak efficiency for one-stage sear-
Fig. 7.4c). For example, the peak efficiency for a ches (Figs 7.1 and 7.3b). With 50 trees held during
search with two trees held in the first stage and 500 the first stage of the 50/2 000 two-stage analysis,
in the second (i.e. a ‘2/500’ search) occurs when both settings (i.e. 50 and 2 000) correspond to the
only the most optimal 0.43% of tree sets derived two points of peak efficiency. For the 2/2 000
from the first stage are subjected to second-stage search, the greatest efficiency occurred when tree
searching. The peak efficiencies of 2/1 000, 2/2 000, sets of length 16 224 and shorter were subjected to
and 2/5 000 searches also occur when the most second-stage swapping (Table 7.2, Fig. 7.4c), and a
optimal 0.43% of tree sets from first-stage search- secondary point of peak efficiency for the dis-
ing are subjected to second-stage searching. This covery of trees of this length and shorter occurs
overall pattern is not surprising, because the when two trees are held in memory (Fig. 7.3b).
strategy implicit in searches of this sort is to con- Thus, the two most efficient two-stage searches
duct thorough searches on each of a very small utilize settings that are predicted by observed
number of tree sets that have been examined in points of peak efficiency for one-stage searches,
only a cursory way during the first stage. Of the and two of these settings (50 and 2 000) correspond
various two-stage searches conducted, the most to peaks in the excess branch-swapping ratio
efficient one is in this category (the 2/2 000 search) (Fig. 7.1c). We note that one of the settings for the
with an efficiency of ca. 234 billion tree rearran- most efficient two-stage search (i.e. two trees held)
gements required per set of most-parsimonious was not predicted by the excess branch-swapping
trees obtained. ratio. It may be the case, however, that the most
An alternative and similarly efficient strategy for efficient two-stage searches, even for large data
two-stage searches is to conduct relatively thor- sets, often involve only cursory branch swapping
ough first-stage searches (i.e. with 50 trees held in during the first stage. If so, the utility of large
memory), and to conduct second-stage searching numbers of such searches in identifying a few
on a relatively large percentage of the tree sets that promising starting points for second-stage branch
are obtained in the first stage. Among the combi- swapping may have more to do with breadth
nations that were examined, the 50/2 000 search of coverage (i.e. among possible taxon-entry
yielded the highest efficiency (ca. 302 billion rear- sequences) than with search efficiency per se.
rangements per set of most-parsimonious trees
obtained), with second-stage swapping conducted
7.9 Three-gene matrix
on the most optimal 3.1% of tree sets obtained
during the first stage, and with second-stage Preliminary analyses of the three-gene matrix,
swapping on the most optimal 6.5% of tree sets using the same range of number of trees held in
yielding nearly equivalent results. In searches of memory as were examined with the zilla matrix
(i.e. a range of settings from 1 through 5 000), and 1.6 days. On the fastest standard desktop PCs
the default Nona settings poly¼ and amb-, identi- currently available, which have processor speeds
fied peaks in the excess branch-swapping ratio at least twice that of this computer, it should be
corresponding to 20 and 200 trees held in memory, possible to discover approximately one set of
with no evidence of a peak between 200 and 5 000. most-parsimonious trees for the zilla matrix per
When BB swapping is conducted, an average of day. Another computer that was used in the
ca. 1.34 107 tree rearrangements are required to present study, with a 75 MHz Pentium I chip, was
swap through a single tree, or about 1.4 times the manufactured in 1994. On this computer, Nona
number required with the zilla matrix. On the basis conducts about 8.10 104 rearrangements per
of the observed peaks in the excess branch swap- second (about 5% of the number conducted
ping ratio, two-stage searches with a variety of with the 1.7 GHz Pentium 4), and a set of most-
settings were conducted, and most-parsimonious parsimonious trees for the zilla matrix can be dis-
trees were discovered eight times. covered with this computer, using conventional
We regard the number of successful searches search methods, about once per month. Thus, by
sufficient to provide only a general estimate of the 1997, when computers with processors about three
optimal conditions for conducting a conventional times as fast as the 75 MHz chip were available, as
analysis of this matrix, and the maximum effi- was Nona, it was possible to discover a set of
ciency obtainable by these methods, but the pre- most-parsimonious trees approximately once
cision of these estimates is limited. The most every 2 weeks with the proper settings. Under
efficient two-stage searches require approximately those circumstances, a year of processor time on a
1.5 trillion tree rearrangements for the discovery of conventional desktop computer would have been
each set of most-parsimonious trees, or about five sufficient to conduct a fairly thorough analysis of
to eight times as many as are required for the zilla the zilla matrix, including time for preliminary
matrix. This efficiency is obtained by two-stage analyses to determine the most efficient settings,
searches conducted with 200 trees held in memory and afterward, for the discovery of a dozen or
during the first stage, followed by second-stage more sets of most-parsimonious trees.
swapping, with 2 000 trees held in memory, on the Why, then, did Rice et al. (1997), using a total of
most optimal 3% of tree sets obtained during the about a year of processor time on three Sun
first stage. The preliminary analyses had suggested workstations, fail to discover shortest trees? We
a search-efficiency peak with 200 trees held in will leave aside the matter of the search efficiency
memory, but not the apparent peak at 2 000. It is of Nona vs. that of PAUP, and examine search
possible that there is another peak that corre- strategies. First, it is likely that Rice et al. used the
sponds to 5 000 trees held in memory, but we have default ambiguity setting of PAUP, which closely
not conducted analyses with settings in that range. resembles the amb¼ setting in Nona. On the basis
of the results presented here, we would urge
investigators to use settings corresponding to the
7.10 Real time
amb- setting of Nona, in association with those
The analyses of the zilla and three-gene matrices, corresponding to poly¼.
as described above, were conducted with Nona on Second, Rice et al. conducted one-stage searches
several different computers with a range of pro- and, as demonstrated here, almost any two-stage
cessor speeds. One of these computers, a Pentium 4 search with reasonable settings should outperform
with a 1.7 GHz chip speed, can conduct and a one-stage search. With the zilla matrix, the most
evaluate about 1.65 106 tree rearrangements per efficient two-stage searches are approximately five
second. With a minimum of 2.34 1011 rearrange- times as efficient as the most efficient one-stage
ments required for the discovery of one set of searches (ca. 200–300 billion vs. ca. 1.5–1.6 trillion
most-parsimonious trees (i.e. for a two-stage rearrangements required for the discovery of one
2/2 000 search), this computer can discover a set set of most-parsimonious trees). In fact, the
of most-parsimonious trees about once every exclusive reliance on one-stage searches is perhaps
the greatest problem with the analysis of Rice et al. discovering shorter trees, and subjecting them to
As discussed below, we believe that investigators additional swapping. What might not have been
should never rely on conventional one-stage sear- predicted, however, is that the increase in required
ches when analyzing matrices for which multiple processor time—and tree-search efficiency—rises
sets of presumed shortest trees cannot be found in so unevenly with an increase in the number of
a reasonable amount of time. Using currently trees retained for branch swapping. The general
available hardware and software this corresponds significance of this phenomenon is that search
to matrices of ca. 150–250 taxa. efficiencies under some software settings may
Third, Rice et al. conducted their one-stage differ substantially, and in unpredictable ways,
searches with large numbers of trees held in from those obtained with what may appear to
memory, which allowed very few individual be fairly similar settings (e.g. one-stage searches
searches to be conducted during the course of their conducted with 50 or 2 000 trees held in memory
study. As demonstrated by our results, the most are about three times as efficient in the discovery
efficient searches of the zilla matrix are those with of most-parsimonious trees as those conducted
as few as 50 trees or even two trees held during the with either 200 or 500 trees held; Table 7.1,
first stage of a two-stage search, and the most Fig. 7.3b).
efficient one- and two-stage analyses never This study also demonstrates that optimal set-
involved swapping with more than 2 000 trees held tings for one-stage searches are predictive to some
in memory. degree of optimal settings for two-stage searches.
This should not be surprising, since both stages of
a two-stage search are themselves one-stage sear-
7.11 Conclusions and recommendations
ches. However, this relationship is important
The present analysis of the 500-terminal rbcL because it allows a preliminary series of one-stage
matrix, conducted with a variety of software set- searches to be predictive of optimal settings for
tings, demonstrates (unsurprisingly) that the two-stage searches, which appear to be more effi-
amount of branch swapping that is required to cient than one-stage searches, in general, for the
complete a one-stage analysis increases with the analysis of large data sets. When conducting pre-
number of trees held in memory. However, the rate liminary one-stage analyses, searches with very
of increase in the required amount of branch few trees held in memory (e.g. two) should be
swapping is uneven, and we have demonstrated included, because settings in this range may be
that the intervals in which the branch swapping useful during the first stage of a two-stage search.
requirements ascend most steeply are those in Using these various relationships, we have
which the average tree length obtained descends discovered that the zilla matrix is quite amenable
most steeply. The regions of greatest change in to analysis using conventional methods. With
these relationships correspond in turn to the points Nona running on a standard PC, and with
at which the greatest excess branch-swapping ratios appropriately chosen software settings, a set of
are obtained, and these ratios are themselves pre- most-parsimonious trees can be discovered in a
dictive of peaks in search efficiency. This overall set day or so of computer time. However, with
of relationships demonstrates that the optimal set- matrices substantially larger than zilla, conven-
tings for tree searches are those that require sub- tional searches on a single PC become impractical.
stantially more branch swapping than when fewer The 567-terminal three-gene matrix appears to lie
trees are held for swapping, and hence those that near the current limits of conventional cladistic
consume substantially more processor time. analysis with a single PC. Fortunately, alternative
Perhaps it should not be surprising that the methods such as tree fusion, tree drifting, sectorial
settings that require the greatest amount of branch searches, and the parsimony ratchet are available.
swapping are also those that are most efficient, On the basis of the present analysis, and our
because the reason that these settings require so experiences with other data sets, we believe that
much branch swapping is that they are successfully conventional single-stage analyses are sufficient
only with relatively small data sets. With a matrix perhaps 2–20 trees held in memory, and with the
of up to 100 or perhaps 150 taxa, a good starting best 5–10% of the tree sets obtained by these
point for exploratory analyses would be the searches subjected to second-stage searching, with
equivalent of approximately 100 one-stage analyses five to ten times as many trees held in memory
with ca. 20 trees held in memory (i.e. hold/20; during the second stage as were held during the
mult*100 in Nona), with polytomies allowed, and first. Results from all searches that yielded shortest
with unambiguous support required for each trees eventually should be combined into a single
dichotomous resolution (i.e. poly¼ and amb- in tree file, from which duplicate trees are eliminated
Nona; if a user wishes to examine the diversity of (e.g. using the unique or best commands of Nona),
trees obtained under alternative polytomy or and an attempt to swap through all of these trees
ambiguity settings, additional branch swapping should be made, with the goal of discovering all
might be conducted under those conditions after a most-parsimonious trees for the matrix.
thorough search has been conducted using the With matrices of more than 200 or so taxa, any
recommended settings). There can be no guarantee results obtained by convenitional searches should
that shortest trees have been discovered, even with be verified by conducting parallel searches using
matrices this small, but if the shortest trees found the fastest available methods, such as those
by the overall analysis are discovered by 10% or described by Nixon (1999) and Goloboff (1999);
more of the 100 searches in each set of analyses, this approach was taken by Davis et al. (2004), who
and if this occurs (with the same shortest tree found identical sets of trees for a series of matrices
length discovered in each set of 100 analyses) on with up to 218 terminals using conventional sear-
repeated runs of mult*100, there is a good likeli- ches as well as the parsimony ratchet.
hood that these are most-parsimonious trees for the With matrices larger than 200–250 taxa, the use
matrix. It may be advisable at that point to run a of these alternative methods should be regarded as
series of additional one-stage analyses, with greater essential, but conventional searches also should be
numbers of trees held in memory (e.g. hold/50 or conducted whenever practicable, for there appear
hold/100), and if shorter trees are not discovered, to be some matrices in this size range that are more
those that have been discovered can be accepted amenable to analysis by conventional methods
provisionally as shortest trees for the matrix, but than by the use of the more recently developed
more-extended swapping with a few of the optimal methods. Indeed, it should be noted that a
sets obtained (i.e. limited two-stage searching) still parsimony ratchet analysis, as implemented in
would be advisable. Note that this set of recom- WinClada (Nixon 2002), is initiated with a short
mendations is contingent upon the discovery of one-stage conventional search.
trees of a given length by multiple individual Thus, optimal approaches to the analysis of large
searches within each set of 100 one-stage searches. data sets may involve successive stages in which
If the shortest trees obtained during the initial various methods are employed (e.g. Tehler et al.
sets of one-stage analyses are found in only a small 2003), including conventional searches during
percentage of the individual searches, or if the early stages (e.g. numerous one-stage searches) to
number of taxa exceeds 150 or so and is less than generate sets of relatively short trees that are then
300 or so, it is advisable to consider running two- subjected to additional searching using other
stage searches from the outset. Because the initial methods. Goloboff (1999), for example, combined
stage of a two-stage search is a series of one-stage multiple search methods, and of the strategies he
searches, the analysis still begins, of course, with tested, the most efficient searches were those that
one-stage searches, but will not end with them. used tree drifting and sectorial searches to produce
Preliminary analyses can be conducted with the suboptimal trees that were then subjected to
intention of calculating excess branch-swapping tree fusion to produce optimal trees. A similar
ratios or, alternatively, a minimum of several strategy was used by Tehler et al. (2003), but with
hundred one-stage searches should be conducted, their matrix the parsimony ratchet, rather than
as described in the previous paragraph, with conventional searches, tree drifting, or sectorial
searches, was required to produce suboptimal the course of several days or a few weeks. With
trees with appropriate qualities (likely a sample of larger data sets, however, the success of such an
trees from multiple islands) amenable to the dis- endeavor would be questionable. Thus, conven-
covery of shortest trees with tree fusion. tional analytical methods are useful with smaller
When only one or a few personal computers are data sets, and with larger ones they are likely to
available, it is currently possible to discover continue to play an important role during the
shortest trees for matrices with up to 500–700 preliminary stages of searches that also invoke the
terminals using conventional methods alone, over more recently developed methods.
C HA P T E R 8
Parsimony and Bayesian

phylogenetics
Pablo A. Goloboff and Diego Pol
8.1 Introduction problem, Farris (1973, p. 250) pointed out that the
tree to be selected ‘‘should be the most probable
Methods of phylogeny reconstruction are often tree on the basis of available data,’’ and that
divided into statistical methods (which require an (for tree T and data X) this probability (normally
explicit model of evolution) and non-statistical called posterior probability) can be calculated with
methods. Among methods with an explicit statist- Bayes’ Theorem:
ical justification, the most widely used are the
methods of maximum likelihood, resulting from Pr(XjT)Pr(T)
Pr(TjX) ¼
Felsenstein’s (1973, 1981c) work, and more recently, Pr(X)
Bayesian phylogenetic methods based on Monte
Carlo Markov chains, following Li (1996), Mau and where Pr(T) is the prior probability of the
Newton (1997), and Larget and Simon (1999). tree being analyzed (i.e. the probability, a priori
The aim of a statistically based method is to of any observation, of the tree being the true one),
estimate tree topologies and values of possibly the factor Pr(XjT) is the likelihood of the topology
relevant parameters, as well as the uncertainty (i.e. the probability of the data, given the tree), and
inherent in those estimations. A method that could the denominator Pr(X) is the prior probability of
P
do that with reasonable accuracy would be the observed data (calculated as Pr(XjT)Pr(T)
attractive indeed. It is often claimed that it is for all possible topologies). Farris (1973) noted
advantageous for a method to be based on a spe- that because the prior probability of each tree
cific evolutionary model, because that allows topology can be assumed to be the same (equal
incorporating into the analysis the ‘knowledge’ of prior probabilities are usually called a flat prior),
the real world embodied in the model. Bayesian and because Pr(X) is fixed for the given observa-
methods have become very prominent among tions, the choice, equivalent to parsimony,
model-based methods, in part because of compu- depends only on the likelihood of the tree. Farris
tational advantages, and in part because they (1973) developed a very general model, with
estimate the probability that a hypothesis is true, minimal assumptions; under that model, the most
given the observations and model assumptions. likely tree is equivalent to the most-parsimonious
Early work on phylogenetics suggested the desir- tree. In the very same issue of Systematic Zoology,
ability of probabilifying the falsehood or truth of Felsenstein (1973) laid the basis for his subsequent
hypotheses. This includes early papers by Farris developments of a very different model, with
(1973, 1977, 1978), who later reconsidered the much more specific assumptions (including
question of whether phylogeny estimation is to be assumptions of Markovian evolution and Poisson
viewed as a statistical problem or not, and moved substitution), conceived mostly as applicable to the
to the position that phylogenetic inference is best evolution of DNA sequences. In the approach
viewed in non-statistical terms (Farris 1983). When of Felsenstein (1981c) the values of parameters
he first approached phylogeny as a statistical as well as branch lengths are jointly estimated
148
PARSIMONY AND BAYESIAN PHYLOGENETICS 149
in order to maximize the likelihood function of (MCMC) to approximate the posterior probabi-
a tree. lities of trees (Li 1996; Mau 1996; Mau and Newton
Bayesian approaches to phylogenetics have 1997; Yang and Rannala 1997; Larget and Simon
taken Felsenstein’s methods a step further, and 1999; Mau et al. 1999; Newton et al. 1999; Li et al.
instead of producing point estimations of all 2000).
parameters to maximize the likelihood, they have The idea in a MCMC is to make computationally
suggested integrating the likelihood across the feasible the integration of the posterior probabilities
different possible parameter values (i.e. branch across the parameters of interest (e.g. topology,
lengths and substitution model parameters): branch lengths, substitution parameters). The chain
Z Z uses a proposal mechanism, which consists of gra-
Pr(XjT) ¼ Pr(XjT, bT , j)f(bT , j) djdbT dual modifications from a starting point (ideally,
BT F randomly chosen), and it alternatively changes
some parameter values (e.g. topology, branch
where BT is the set of possible branch lengths (bT) lengths, substitution parameters), stochastically
of topology T, F is the set of all possible substitu- and aperiodically. These proposals or transitions
tion parameter values (j) of the model, and f(bT,j) are accepted with a probability given by the
is the prior distribution of these parameters. Both Metropolis–Hastings algorithm (Metropolis et al.
Farris (1973) and Felsenstein (1973) had considered 1953; Hastings 1970; see Larget and Simon 1999 or
such a type of integration desirable, but noted that Huelsenbeck et al. 2002 for details) and the Markov
a major problem with this approach is that it chain proceeds until it reaches a stationary state.
involves the calculus of a multidimensional integ- If the Markov chain is irreducible (i.e. it is pos-
ral for every possible topology, which is exceed- sible for the chain to visit every possible set of
ingly complex and computationally demanding. parameters and tree topologies) its stationary state
In order to overcome this problem, some converges to the joint posterior probability
researchers (e.g. Farris 1973; Hasegawa and distribution of the parameters being modified
Kishino 1989; Smouse and Li 1989) have attempted (Tierney 1994). Thus, the frequency with which a
to compute the Bayesian posterior probability of a given topology is visited in the Markov chain
topology using the parameter values that max- approximates its marginal posterior probability
imize its likelihood factor (e.g. the maximum (Mau et al. 1999). Thus, the results of MCMC are
likelihood estimate of branch lengths). However, directly interpreted in probabilistic terms; they can
this approximation (as noted by Goloboff 2003, for estimate the probability that a particular tree is the
the case of maximum likelihood) ignores an infinite true tree for these sequences, conditional on the
number of additional hypotheses that result from stochastic model of substitution (Li et al. 2000).
alternative sets of branch lengths (or other para- Additionally, since the posterior probability dis-
meter values) for that topology. tribution is simultaneously estimated, measures of
Others, instead, have suggested calculating the uncertainty can be derived from the Markov chain.
exact probabilities, integrating the likelihood of a The outcome of the stationary state of the
topology across all possible sets of branch lengths MCMC is a set of phylogenetic trees (with their
(e.g. Rannala and Yang 1996) or other parameters associated parameters). In phylogenetic applica-
(e.g. Sinsheimer et al. 1996). The complexity of this tions of MCMC, the relative frequency of each
procedure, however, precludes its applicability to topology (irrespective of branch length and sub-
data sets of more than a few sequences, and stitution parameter values) is interpreted as
therefore these methods were hardly ever used. its posterior probability (given the stochastic
model and data). Therefore, it seems straight-
forward to take the topology with the highest
8.2 Markov chain Monte Carlo
posterior probability as the point estimate of the
Recently, three independent groups originally true topology. This was clearly recognized by
applied Markov chain Monte Carlo methods several authors (Li 1996; Rannala and Yang 1996;
Yang and Rannala 1997; Larget and Simon 1999; on whether the chain has succeeded in finding the
Mau et al. 1999). The estimated tree was referred to actual MAP(s). As the chain is not conceived as a
as the maximum posterior probability (MAP) tree search mechanism, but instead as a sampling
by Rannala and Yang (1996). mechanism, it is extremely unlikely that it will find
However, the availability of the approximation the individual trees of maximum a posteriori
of the posterior distribution of trees also allows probability, except in very small data sets.
the evaluation of the variability of the estimates
(e.g. topology or any other parameter integrated by
8.3 Problems with estimations of
the MCMC). As noted by Mau et al. (1999), sum-
monophyly by MCMC
marizing the distribution of MCMC trees indeed
presents a challenge and several methods have In this section, the discussion will be within the
been proposed for such purpose. realm of the rules and goals postulated by
For instance, Mau et al. (1999) and Rannala and defenders of model-based methods. We also have
Yang (1996) noted that a Bayesian ‘‘credible set’’ other general concerns about model-based
can be obtained as the collection of topologies methods; these reflect a viewpoint not shared by
having the sum of their posterior probabilities Bayesians, and are therefore discussed in the fol-
constrained to be no less than a specified value lowing section. While the MCMC can be used to
(e.g. 0.95). Li (1996) also considered alternative estimate any parameter of the evolutionary pro-
ways to estimate the posterior probability of cess, we are concerned here with the estimates that
the true phylogeny from the MCMC results, such are relevant for phylogenetic studies: estimations
as the use of the tree that has the minimum of monophyly of groups. Other parameters, such
topological distance to the majority (e.g. 90%) of as transition:transversion (ts:tv) ratios, while pos-
the MCMC trees, or the use of a majority-rule sibly the primary interest for other evolutionary
consensus of the set of topologies generated studies, are only of secondary interest to the phy-
by MCMC. logeneticist. Part of the attraction of MCMC
In the latter option, the frequency of the clades Bayesian methods is that the values estimated for
has been interpreted by most Bayesian phylogen- those other parameters, such as ts:tv ratios, do not
eticists (e.g. Huelsenbeck et al. 2002) as the pos- rely on estimation of a tree topology, an advantage
terior probability that the clade is true, following for such studies which we do not dispute. How-
ideas of Newton et al. (1999) and Larget and Simon ever, the fact that our examples show that there are
(1999). These authors propose summing the pos- problems when the estimations of monophyly are
terior probabilities of the trees in which each clade carried out in a certain way suggests that estab-
of the MAP is present as a way to summarize lishing proper estimations from MCMC is far from
uncertainty in the tree topology estimate. This automatic, and raises concerns about the validity
approach, which sums the frequency with which a of the inferences of those other parameters as well.
particular clade appears in the Markov chain in The most common approach to estimating
order to estimate its posterior probability, is cer- probability of monophyly of a group X is by
tainly the most commonly used way to summarize summing the posterior probabilities of all the trees
MCMC results. This approach is implemented where group X is monophyletic. This can be done
in available software packages (e.g. MrBayes for the groups present in the individual tree
of Huelsenbeck and Ronquist 2001; BAMBE, of of highest posterior probability (as proposed in
Simon and Larget 1998), and is frequently reported Larget and Simon 1999), or for each of the groups
in empirical analyses using Bayesian methods. found in the analysis; these options make no dif-
Here we will focus on some undesirable properties ference for our argument.
found on this frequently used option to summarize Huelsenbeck et al. (2002, p. 674) claimed that,
MCMC results. Other alternative ways to sum- since ‘‘Bayesian inference is based on the like-
marize the results are less frequently used, and lihood function, it should inherit many of the nice
they differ from this one in depending much more statistical properties of the maximum-likelihood
method.’’ The ‘‘nice statistical property’’ for which an unresolved bush, which does express the fact
likelihood has been held superior to parsimony is, that the monophyly of no group is actually sup-
quintessentially, statistical consistency. Statistical ported by the data. Note that A can float in the
consistency has been proven for maximum skeleton tree of the remaining taxa; each of the 45
likelihood, but only as a byproduct of the con- trees with alternative placements of A has exactly
sistent estimation of the branch lengths between the same likelihood (and thus, under a flat prior on
taxa (see Rogers 1997; Chang 1996; with discus- tree topologies, the same posterior probability).
sion in Goloboff 2003). If the tree topologies are However, of all those trees, only two (A sister to
estimated without estimating branch lengths— B, or A sister to C) make the group BC non-
integrating branch lengths for a given tree topo- monophyletic; the proportion of trees with the
logy, as done in the Bayesian methods—then sta- group BC monophyletic is thus 43/45 ¼ 0.955. That
tistical consistency might be lost (as discussed in is almost exactly the posterior probability for
Goloboff 2003). And even if Bayesian analysis used monophyly of BC estimated by MrBayes (see Fig. 8.1;
optimal branch lengths (which would slow it values on the branches are values reported by
down considerably), the fact that posterior prob- MrBayes, numbers above the branches are the fre-
abilities of individual clades are estimated from quencies of the groups in the 45 most-parsimonious
sums of posterior probabilities of the trees having trees). The group BCD, instead, is made non-
the clade still creates problems. So, the idea that monophyletic by two additional locations of taxon
Bayesian analysis should automatically ‘‘inherit A, so it is monophyletic in a proportion of 41/
the nice statistical properties of maximum 45 ¼ 0.911. As one moves towards the middle of the
likelihood’’ is no more than wishful thinking; tree, the proportions of locations which make the
Bayesian analysis with MCMC involves subgroup non-monophyletic decreases: 6/45 for group
stantial modifications to maximum-likelihood. BCDE, 8/45 for group BCDEF, etc. Past the middle of
Estimating the posterior probability for mono- the tree, the proportions start increasing again. This
phyly of a given group as the sum of posterior is reflected almost exactly in the posterior prob-
probabilities of the trees with that group may abilities reported by MrBayes. Since this is per-
create serious problems, and it is easy to see why. fectly expected, the proposal mechanism used by
Imagine that there is a single tree of highest like- MrBayes seems—at least for data sets as simple as
lihood1, where group X is not present. Imagine this one—to provide a sample of the tree space
that there are many trees of a likelihood only adequate to estimate the sums of posterior prob-
slightly inferior, where group X is monophyletic. abilities for different groups; our criticism has
The sum of the likelihoods of the trees with the nothing to do with sampling problems, but simply
group may exceed the likelihood of the one tree with the quantity that is being estimated. The (esti-
without the group, and then the method would mated) sum of posterior probabilites of the trees
conclude that the group has a relatively large with and without a group provides a measure with
probability of being monophyletic. While there are no apparent utility. Using such a measure leads to
many situations under which such an asymmetry the unfounded conclusion that, even when what we
could occur, some of them are surprisingly simple. know about A is nothing, we can still estimate with
Consider the case of Fig. 8.1, a 25-taxon data set, some precision its placement in the tree! That loca-
with taxon A having only missing entries. The data tion of A is determined rather by the priors on trees,
determine a perfectly pectinate tree, except for the but that means that the priors on groups are highly
placement of A. The strict consensus for these data unequal. That an equal prior on trees may mean an
(analyzed with either parsimony or likelihood) is unequal prior on groups has been discussed by
Pickett and Randle (2005); Pickett and Randle also
1
note that using flat priors on some aspects of
Whether the likelihood is calculated as the likelihood for
optimal branch lengths, or the sum of the likelihoods for all the
a simulation may impose non-flat priors on other
branch lengths for the given topology, makes no difference to our aspects. While the non-flat priors on groups
argument. (which undoubtedly exist) influence the posterior
ROOT AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
X AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
W GGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
V GGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
U GGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
T GGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
S GGGGGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
R GGGGGGGGGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Q GGGGGGGGGGGGGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
P GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
O GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
N GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
M GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
L GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
K GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
J GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAA
I GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAA
H GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAA
G GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAA
F GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAAAAAAAAAAA
E GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAAAAAAA
D GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAAA
C GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
B GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
A ????????????????????????????????????????????????????????????????????????????????????
ROOT
X
W
95 V
93 91 U
96 88 86 T
88 82 82 S
88 78 77 R
82 74 73 Q
78 69 68 P
74 64 64 O
68 59 60 A
64 55 46 N
59 52 51 M
56 55 L
50 58 60 K
54 64 64 J
58 68 68 I
64 72 73 H
69 77 77 G
73 81 82 F
78 85 86 E
82 90 91 D
87 93 95 C
92 96 B
96
Figure 8.1 A data set with a taxon (A) scored only with missing entries. No group has any actual support, since the monophyly of any group can
be violated at no cost. The numbers on the branches are the posterior probabilities of monophyly, estimated by MrBayes with 100 000 generations,
using four chains, with a sampling frequency of 100, and a ‘burn-in’ of 250 (i.e. discarding the first 25 000 generations). The numbers above the
branches indicate group frequency in the most-parsimonious (dichotomous) trees. The numbers below the branches show the bootstrap frequencies,
as calculated by PAUP* (with 100 replications, analyzing each resampled data set with a branch-and-bound solution). Tree topology corresponds to
the analysis with MrBayes.
probabilities reported by MrBayes, that is only part Admittedly, the example of Fig. 8.1 is contrived
of the picture; the other aspect is the shape of the in that no worker will attempt to analyze a matrix
likelihood landscape, which is what our examples where a taxon is represented only by missing
show. entries. But the same effects may come in much
more subtle flavors; for example, a sub-clade of a trees for the data set2. Not surprisingly, MrBayes
larger clade that can connect with different root- reports unsupported group T–Z as strongly sup-
ings (all with about the same likelihood) to the rest ported, with a posterior probability of 0.93.
of a tree will produce, inside the clade with an The examples of Figs 8.1 and 8.2 were not
undetermined root, the same effect observed for derived from any model, and for this reason may
Fig. 8.1. This can also happen even for groups of a perhaps be dismissed by Bayesians as being irre-
relatively large size (which non-flat priors on levant. But the same effect can appear even in
groups of different sizes do not easily explain), as simulated data, where there are no violations of
in the example of Fig. 8.2, analyzed with MrBayes the model. The easiest way to produce the effect is
under the No Common Mechanism model to mimic the conditions of Fig. 8.1. For this, we
( ¼ parsimony). Under such a model, group N–Z is used as the model tree a perfectly pectinate tree, as
well supported by the data, and group T–Z is not: in Fig. 8.3, with taxa A and B forming a mono-
each of the characters that might support the phyletic group at the tip of the tree, and successive
monophyly of T–Z becomes an ambiguous syn- terminals appearing as successive sister groups.
apomorphy when taxon M is the sister group of All the branches in the tree had a length of
N–Z (so that there are trees of best fit that do not
have T–Z as monophyletic). However, this hap-
pens only when M is the sister group of N–Z; for
each of the other (numerous) possible locations of J
M in the rest of the tree, the group T–Z is required I 0.744
to provide the best fit to the data. The group T–Z is H
present in about 91% of the most-parsimonious G 0.769
F
0.799
ROOT AAAAAAAAAA AAAAA E
A AAAAAAAAAA AAAAA
0.03 D 0.883
B AAAAAAAAAA AAAAA
C AAAAAAAAAA AAAAA C
D AAAAAAAAAA AAAAA B
E AAAAAAAAAA AAAAA A
F AAAAAAAAAA AAAAA 3.00
G AAAAAAAAAA AAAAA
H AAAAAAAAAA AAAAA Figure 8.3 Tree shape used in the simulations (results reported in
I AAAAAAAAAA AAAAA Fig. 8.4). Data were generated for trees with different numbers of taxa,
J AAAAAAAAAA AAAAA using a Jukes–Cantor model. All branches were of length 0.03, except the
K AAAAAAAAAA AAAAA branch leading to A, with a length of 3.0. The simulations generated
L AAAAAAAAAA AAAAA 1 000 characters each. MrBayes analyses used 50 000 generations, with
M AAAAAAAAAA GGGGG three chains, sampling every 50 generations, and a burn-in of 250 (i.e.
N GGGGGGGGGG AAAAA discarding the first 12 500 generations). The posterior probabilities are
O GGGGGGGGGG AAAAA shown for four incorrect groups (for 50 taxa, average posterior probability
P GGGGGGGGGG AAAAA for 20 replications); note that the posterior probabilities decrease towards
Q GGGGGGGGGG AAAAA
the middle of the tree, just as in Fig. 8.1.
R GGGGGGGGGG AAAAA
S GGGGGGGGGG AAAAA
T GGGGGGGGGG GGGGG 2
We calculated the frequency of group T–Z in most-
U GGGGGGGGGG GGGGG
V GGGGGGGGGG GGGGG parsimonious trees by taking a pseudo-random sample of 1 000
W GGGGGGGGGG GGGGG most-parsimonious trees. We generated each by a Wagner tree
X GGGGGGGGGG GGGGG where both the insertion and addition sequences were randomized
Y GGGGGGGGGG GGGGG (as implemented in TNT; Goloboff et al. 2004), followed by tree
Z GGGGGGGGGG GGGGG bisection and reconnection (TBR) branch swapping. Randomiz-
ing the insertion sequence means that, for each taxon to be added
Figure 8.2 A data set with a group (T–Z) unsupported but found in to form the Wagner tree, the pre-existing locations to insert the
many optimal trees, and thus with a high estimated posterior probability. new taxon are tried in a random order; this eliminates bias in tree
See text for details. shapes in the resulting Wagner trees for poorly informative data.
0.03 (thus, a probability of no change along the groups (often over 0.90) is not the effect of sam-
branch of 0.978; we used a Jukes–Cantor model), pling error or lack of convergence in the chains. As
except for the branch leading to taxon A, which the number of taxa increases, so does the apparent
was very long (with a length of 3; that is, a prob- confidence on the false groups (the more so for the
ability of no change of 0.287). The model tree was smaller groups), while the confidence on the true
used to generate simulated data sets, with 1 000 groups decreases (the more so for the smaller
characters, for different numbers of taxa. Since A groups). Whereas the reference to smaller and
has a very long branch, it connects to the rest of the larger groups makes sense in these examples, with
tree with about the same likelihood at every pos- pectinate trees, this does not mean that MCMC
sible location. The effect is therefore the same as analysis will in general favor groups of a given
that of Fig. 8.1. In most of the simulations, size; the problem arises because of the relative
MrBayes reports a high posterior probablity that differences in likelihood ( ¼ posterior probabilities,
the group BC is monophyletic, which is in fact since we used a flat prior on trees) of those trees
false. The estimated probability of monophyly of with and without each group, and this effect could
the wrong group BC actually increases with the potentially happen for groups of any size. These
number of taxa, since then the alternative locations results could possibly derive as well from viola-
of A that make a significant contribution to the tions of the model, or from examining data for
sum of posterior probabilities for group BC also several genes where some of the taxa have not
increases. Because there is significant variability in been sequenced for all the genes.
different simulated data sets, we used 20 replica- The examples are not intended to be realistic,
tions for each of 5, 10, 20, 30, 40 and 50 taxa. The but they show unequivocally that the estimations
results are shown in Fig. 8.4. While there is of of posterior probabilities of individual groups may
course some sampling error in our measurements, lead to grossly mistaken conclusions, and in real
the trends evident in Fig. 8.4 make it clear that the cases such an effect can easily be confounded by
high posterior probability attributed to the wrong other factors. For the simulated examples, a taxon
True groups False groups

0.9 0.9 B–C
B–D
0.8 0.8 B–E
Average posterior probability
0.7 0.7
B–F
0.6 0.6 B–G
0.5 0.5
B–H
0.4 0.4
0.3 A–B 0.3

A–C
0.2 A–D 0.2
0.1 A–E 0.1

A–F
0 0
0 10 20 30 40 50 60 0 10 20 30 40 50 60
Total number of taxa Total number of taxa
Figure 8.4 Results for the simulations, using the model tree shown in Fig. 8.3, for different numbers of taxa. As the number of taxa increases, so
does the estimated posterior probability of the false groups (BC, BCD, etc.), the more so the smaller the group. All the averages reported correspond
to 20 replications for each number of taxa.
with a branch as long as the branch leading to A unrealistic conditions (i.e. infinite, or at least mas-
cannot be confidently placed anywhere in the tree; sive amounts of data evolving under the same
every location will have roughly the same fit. The model3, with a perfect fit to the model), is not a
most serious problem faced by Bayesian analysis is very relevant property at the time of deciding
not that it places A in some definite location (i.e. among possible methods of phylogenetic infer-
in the middle of the tree), but rather that it leads ence. Even some statistically inclined phylogen-
one to conclude that there is a very high prob- eticists hold this point of view (e.g. Kim 1996;
ability that A is not placed as sister to B, which is Sanderson and Kim 2000). The focus of this chap-
the one true placement. A proper method should ter is on Bayesian methods: Bayesians, with their
recognize, in cases like Fig. 8.3, that no conclusion claim that Bayesian analysis, being based on like-
is possible. Note that our criticism of Bayesian lihood methods, should inherit the ‘‘nice statistical
analysis, in this case, is not equivalent to Siddall’s properties’’ of likelihood (see above), have adhered
(1998) criticism of likelihood; Siddall (1998) criti- (implicitly, at least) to the notion that consistency
cized likelihood because in his simulations it is desirable. However, as we show later, the only
separated long branches that were in fact sisters; feasible implementation of Bayesian phylogenetic
Swofford et al. (2001) showed in their reply that, analyses is likely to suffer from inconsistency.
while likelihood indeed separates long sister The complexity of the inferential models used
branches (for small numbers of characters), the has also appeared in the likelihood vs. parsimony
likelihoods of the alternative trees that place those controversy. Although several likelihoodists
long branches together is only slightly inferior, so (Goldman 1990; Steel and Penny 2000; Lewis 2001)
that the maximum likelihood analysis actually had suggested that parsimony requires estimation
implies that no decision is possible. That is not the of many more parameters than maximum
case for the Bayesian results; they attribute a high likelihood, Goloboff (2003) reconsidered the pro-
probability to false groups that should at least be blem and concluded that, if anything, parsimony
recognized as ambiguous. requires estimation of fewer parameters than tra-
ditional maximum likelihood methods (a similar
conclusion had been reached by Farris 1986, p. 22).
8.4 Potential problems of the
Bayesian phylogenetic methods could in theory
statistical approach
integrate uncertainty in some parameters during
Statistically justified, model-based approaches to the MCMC (thus not requiring ‘estimation’ of
phylogeny have come to dominate the field in the those parameters). The problem is that the para-
last decade, but many authors still feel that those meter space to be explored then becomes more
model-based justifications miss the mark. The complex, so that the chain would have to be run
controversy, not surprisingly, has often involved for much longer to insure convergence and an
criticism and even misrepresentation from both adequate sampling.
sides. Among the topics on which the debate has Finally, the epistemological questions about
centered are the questions of statistical consist- using evolutionary models and their empirical
ency, the complexity of the evolutionary models basis are perhaps the problems that have been less
used, the possible empirical basis of the evolu- openly discussed in the literature. Several authors
tionary models on which the inferences are to have presented the controversy in terms of which
be based, and whether epistemological consi- of the two approaches can be justified under Karl
derations support the use of specific models of
3
evolution. Note that we say here ‘‘under the same model.’’ While it is
The issue of consistency has been discussed true that current-day techniques allow sampling of very long
DNA sequences, the chances of all the sites still obeying to the
mostly in relation to the likelihood vs. parsimony
same model decrease as the sequences become longer and
controversy (Steel et al. 1993; Siddall 1998; Farris include different genes or gene regions (as pointed out by Pol
1999; Swofford et al. 2001). We consider that con- and Siddall 2001). The amount of data available for a given model
sistency, since it is relevant, at best, only under will always be in the order of a few kilobases.
Popper’s philosophy (Popper 1968). Most notable geny reconstruction would be perfectly justified.
among recent philosophical defenses of model- Using those model-based methods would have
based approaches is perhaps the paper by de the advantage that they make it possible to
Queiroz and Poe (2003). They argue that parsi- provide measures of uncertainty with a direct
mony can be justified as a Popperian approach interpretation.
only by reference to specific models of evolution. The alternative is considering that sequence
de Queiroz and Poe (2003) say that it is not true evolution is driven by too many parameters, which
that characters provide falsifiers of phylogenies may change too much over time, and that the
(which Farris 1983 had used to characterize phylo- samples (of sequences) we may expect to ever
genetic hypotheses as falsifiable), because a phy- obtain are far below what could reasonably allow
logeny cannot per se make impossible any accurate inferences. Of course no one expects
particular character distribution. de Queiroz and inferences that are 100% error-free; but the prob-
Poe (2003) argue that, having disposed of other lem is how much is too much. Philosophical
possible justifications, the only way to falsify a positions aside, many people who use parsimony
phylogeny is to show that it is less probable than place themselves at the ‘too much’ side of the
its rival, and that parsimony can only be justified scale, and tend to think that the probabilities estim-
as Popperian if coupled with specific evolutionary ated by using specific models are likely to be so far
models that specify those probabilities. But off that there is no point in trying to consider the
de Queiroz and Poe (2003) have not actually dis- results in terms of real probabilities. All we can
posed of other possible justifications: they present expect is to simply provide the best explanation of
only part of Farris’ arguments. Farris (1983) had the data, and it is best to remain silent about the
made it clear that no character could provide probability of the resulting hypothesis being true.
absolute falsification, and that the relationship When de Queiroz and Poe (2003) claim that par-
between falsifier and hypothesis is purely logical. simony can be justified only by reference to some
That is, if a given apparent character-state homo- specific model, they mean ‘‘parsimony as a statis-
logy between two taxa is truly due to common tical method can be justified only by reference to
ancestry, then it follows that a phylogeny that some specific model,’’ which is true in itself, but
places them apart is truly false. Contra de Queiroz then most proponents of parsimony do not view
and Poe (2003), a strictly logical justification of parsimony as attempting to provide the tree with
parsimony, made without reference to a specific the highest posterior probability: any attempt
evolutionary model, is possible. Probabilistic to provide a figure representing an actual prob-
models are necessary only to interpret the results ability, in the case of a process as complex as
of a parsimony analysis probabilistically; they are phylogeny, is no more reliable than a guess4. In
unnecessary otherwise. this sense, the number of things that model-based
The question of whether some apparent homo- methods try to estimate (statistically speaking) is
logies are more probably truly homologous than much greater, and then it is natural that research-
others only enters the picture when we accept ers with no previous experience in the field are
statistical justification and specific models. Contrary attracted by estimation methods which are appar-
to some defenders of parsimony (e.g. Siddall and ently omnipotent.
Kluge, 1997; Kluge 1997, 2001; Kluge, Chapter 2 of To some extent, the two aims—providing the
this volume), we do not argue that such a type of best possible rationalization of the data by means
justification is philosophically and intrinsically of a phylogeny, or providing the best statistical
flawed. Our concerns have to do with common
sense, more than with philosophy. If one knew
4
Measures of support such as the Bremer support (Bremer
with certainty that sequence evolution is driven
1994; Goloboff and Farris 2001) or resampling (Farris et al. 1996;
exclusively by a reduced set of parameters, Goloboff et al. 2003b) are often interpreted as somehow measur-
and that those parameters remain very stable ing the truth content of the hypothesis, but this is not correct: all
over time, then model-based methods of phylo- they measure is how much evidence supports the hypothesis.
estimation of the phylogeny—are both defensible Furthermore, whether the data can be modeled
in their own right. The difference is not only reasonably will depend on the nature of the data.
regarding whether a probabilistic model forms the While a Poisson model of substitution seems reas-
basis for inferences or not; the difference is also onable for some types of DNA sequence data, it
about how the results are to be interpreted. Which seems unfounded to apply such a model to mor-
aim a particular worker prefers and pursues may phological data (although it has been attempted,
depend on many factors, actual personal interests, by Lewis 2001). It is unfounded because there is
or even peer pressure, among others. But, fashions very little ground to think that all characters of the
in science aside, the decision of whether using given organisms have about the same probability
phylogenetic methods based either on models or of changing along a given branch of a tree, and
pure logic depends also to a good extent on the that alternative states in morphological characters
dose of skepticism the researcher holds. Several are like units turned on and off. Some vertebrates
defenders of model-based methods (e.g. Swofford have mammary glands, and some arthropods have
et al. 2001) have suggested that, in different fields chelicerae, but within a given group of tetrapods,
of science, the first approach is based on intuitive who could claim that the chances of gaining
methods and that, as the field becomes mature, mamary glands are the same as—or even com-
explicit statistical justifications replace the original parable to—the chances of gaining chelicerae? For
intuitive ones. This claim is not strictly true (or genomic data, which include so many different
testable, at least), and it is presented as if somehow types of transformations (insertions, deletions,
the current use of statistically justified methods translocations, inversions, etc.), postulating reas-
was evidence of maturity—when the alternative onable models is also very difficult or impossible.
interpretation, namely that the use of statistical Therefore, for these types of data, only the parsi-
methods in phylogenetics is still premature and mony aproach has been used so far (starting from
unjustified, may be much more reasonable to some Sankoff and Blanchette 1998), for no other
workers. But how is one to decide whether the field approach seems reasonable. Even the simplest
is mature enough, or whether our knowledge of case, insertions/deletions, presents a serious chal-
the mechanisms of evolution is detailed enough, to lenge to modeling; although some programs (like
justify using those models? The answer to this POY; Wheeler et al. 2003) have implemented
need not be an all-or-none answer; there is instead ‘‘maximum likelihood models’’ for insertions/
a gray area between those who believe that our deletions; these Poisson ‘‘models’’ are based sim-
ignorance of evolutionary mechanisms is almost ply on attributing some probability to an insertion
total (a view which some supporters of parsimony as a function of other parameters (like branch
seem to hold), and those who believe that our ‘‘length’’). Wheeler et al. (personal communica-
knowledge is so complete as to guarantee even the tion5) present the likelihood methods in POY as
most detailed inferences (as some likelihoodists ‘‘interpretive tools, without any necessary rela-
and Bayesians seem to believe). At what particular tionships to the actual process of change in nat-
point of this scale a particular worker finds him- ure.’’ They point out themselves that this has a
self/herself will depend on how he/she resolves a meaning quite different from the use of Poisson
large number of subtle issues; such a decision models in DNA sequences: in those models, a base
requires reason and logic, but it cannot be that is to replace another one exists outside the
accomplished with a statistical test. This, of course, sequence and therefore, given a certain chance of
is not to say that anything goes; for example, a replication error, there is a certain chance for
model-based method that may produce incorrect each possible base to be inserted. Thus, Poisson
estimations—even without violations of the model
from which it is derived—is clearly to be avoided. 5
Wheeler, W., Aagesen, L., Arango, C., Faivovich, J., Grant, T.,
Such is the case with estimations of posterior D’Haese, C., Janies, D., Smith, W., Varon, A. and Giribet, G.
probabilities of monophyly by MCMC, as we have (unpublished manuscript). Dynamic Homology and Phylogenetic
already seen. Systematics: A Unified Approach using POY.
substitution models are based on more than just this situation becomes less and less likely, for the
attributing arbitrary probabilities of change to sum of likelihoods of the alternative placements
events; the probabilities postulated by those increases as well. So, it is hard to predict what
models are based on some knowledge of the would happen for infinite numbers of characters in
mechanisms that govern the process of DNA cases with very large numbers of taxa. However,
substitution, at least in the absence of selection and even if there is the potential for Bayesian estima-
constraints (whether the model is factually correct, tions of monophyly to provide correct topological
of course, is a different matter, but it is plausible estimations for infinite numbers of characters, there
and coherent in itself). Gaps, on the other hand, are is still the problem that Bayesian analysis claims to
not units to be incorporated into a string of DNA do much more than simply producing consistent
being synthesized. A Poisson model for gaps, estimations: it also claims to measure the degree of
while it seems natural given the widespread use of support of the conclusions, in a statistical sense.
Poisson models for DNA substitutions, may be Our examples show that it does not.
totally inadequate. Other likelihood ‘models’ of Several recent papers (Suzuki et al. 2002; Alfaro
insertions/deletions (e.g. Thorne et al. 1992; Miklós et al. 2003; Cummings et al. 2003) have compared
et al. 2004; see De Laet, Chapter 6 in this volume, bootstrap and clade credibility values. In terms of
for comments on these) do not use Poisson models, the problem discussed above, some of those com-
but still are based on arbitrarily assigning prob- parisons could never have been very informative,
abilities to the possible events. In those models, the despite large amounts of computational effort. For
final probability is no more real than are the fig- example, the study of Cummings et al. (2003) used
ures obtained by parsimony analysis. On the other over 15 years of CPU time, but examined only data
hand, analyzing sequences of unequal length by sets with four taxa. The problems pointed out here
prealigning them and then discarding positions with Bayesian analyses can arise only with larger
with gaps (a practice common among like- numbers of taxa, so Cummings et al.’s effort could
lihoodists, and Bayesians) is probably even more never have led to discovery of those problems.
inadequate, so that we are again in a situation Moreover, even for larger numbers of taxa, the
where probabilities cannot really be assigned problem pointed out here could not have been
meaningfully. Much the same can be said of other discovered by comparing posterior probabilities
types of chromosomal rearrangement. with the bootstrap values produced by PAUP* (the
program used in essentially all published com-
parisons between bootstrap and Bayesian cred-
8.5 Discussion
ibilities; Swofford 2002). When bootstrapping or
Strictly speaking, our simulations do not demon- jacknifing, in the case of multiple trees for a
strate that the estimations of posterior probabilities resampled matrix, PAUP* weights each group
of individual groups produced by MrBayes are found according to its frequency. This produces
inconsistent. That would require either running exactly the same results as summing posterior
data sets with infinite numbers of characters, or an probabilities of monophyly: groups that are very
analytical treatment of the multidimensional integ- frequent in optimal or quasi-optimal trees always
ral across all possible trees. Neither of those is appear as highly supported, regardless of their
possible. Admittedly, in cases like our simulations, actual support. Fig. 8.1 shows, below the branches,
as the number of characters increases, the difference the bootstrap values estimated by PAUP*; they
in likelihood between the correct and alternative are almost exactly the same as Bayesian estimates.
placements of the long branch increases. Eventually The relative (although not universal) agreement
this difference might be so great as to make the between bootstrap and Bayesian estimations has
likelihood of the individual best placement of the been taken as mutual confirmation, but in fact
long branch (the correct one) higher than the sum of MrBayes and PAUP*’s implementation of boot-
the likelihoods of the alternative (wrong) place- strapping and jacknifing share similar biases.
ments. However, as the number of taxa increases, Alternative implementations of resampling
methods (such as the one in TNT; see Goloboff machine). Running 5 000 000 000 generations of a
et al. 2004) avoid this problem by producing the MCMC is impossible (in practical terms).
strict consensus for each resampled matrix. Suppose anyway that the chain succeeds in
What happens with other ways to summarize finding the trees of maximum a posteriori prob-
the results of a MCMC? As noted above, they ability a certain number of times. The posterior
depend on whether the chain succeeded in finding probability of each individual tree will thus be
the individual trees of highest posterior prob- negligible (the more so the more taxa are included
ability. For larger numbers of taxa, it is extremely in the analysis). In our view, such a low posterior
unlikely that the chain will ever pass through the probability is perfectly reasonable, and illustra-
optimal tree(s), let alone pass through the optimal tes the fact that the statistical significance of
tree(s) enough times to estimate their posterior phylogenetic conclusions cannot be meaningfully
probability with any accuracy. Although tree assessed in real cases. But statistically minded
bisection reconnection (TBR) forms the basis for phylogeneticists will likely show continued inter-
both tree search and MCMC algorithms, rearrange- est in making probabilities more robust, i.e. in
ments leading to worse trees are often accepted producing more ‘acceptable’ values. The alter-
under MCMC, while they are normally rejected in native is to identify a credible set of trees. A strict
a tree search. For equivalent numbers of rearrange- consensus of the credible set of trees may contain
ments, then, a tree search (specially one combining exclusively well-supported groups, but only to the
different algorithms, like the methods in TNT; see extent that the chain was run for long enough to
Goloboff 1999, for details) is much more likely than find some trees that are relatively close to optimal
a MCMC to find an optimal tree. Even if MCMC trees. For the simulations carried out here (small
and a tree search had the same chances of finding numbers of taxa, very clean data without viola-
an optimal tree for the same number of rearran- tions of the model, chains quickly converging),
gements, the numbers of rearrangements required credibility sets of 90% still display, in many
to find optimal trees during a search cannot ever cases, false groups. Only running very large
be achieved in Bayesian analyses. For example, in numbers of generations would avoid that problem,
the case of a relatively small matrix of 84 taxa but in the case of larger data sets this will be
(from Goloboff 1995), TNT requires at least 5–10 impossible.
million rearrangements to produce the first hit to
minimum length. For Chase et al.’s (1993) data
set (zilla, 500 taxa), it takes TNT an average of
about 500 million rearrangements to find an 8.6 Acknowledgments
optimal tree for the first time6; for the 854 taxa It is our pleasure to contribute to a book in the
used in Goloboff (1999), it takes about 5 000 million honor of James S. Farris, whose work in the field of
rearrangements (about 18 min in an 800 MHz phylogenetic systematics we greatly admire and
respect. We also thank comments from Victor
6
Note that the figure of 500 million trees corresponds to
Albert, Jan De Laet, Mark Simmons, and Ward
analyses using sectorial searches, tree drifting, and tree fusing.
Davis et al. (Chapter 7) report the numbers of rearrangements
Wheeler. Financial support from FONCyT and
required to find optimal trees for zilla using only TBR; those CONICET (PAG), and The American Museum of
numbers are much larger. Natural History (DP), is gratefully acknowledged.
IV
Mathematical attributes of parsimony
CHAPTER 9
Maximum parsimony and the

phylogenetic information in
multistate characters
Mike Steel and David Penny
9.1 Introduction
can take any of 2k(k 1)! values. In these models
In this chapter we investigate some of the statist- whenever there is a change of state—for example
ical issues that surround the maximum parsimony a re-shuffling of genes by a random inversion (of
(MP) method. Such issues have long been of a consecutive subsequence of genes)—it is likely
interest, since the pioneering work of Farris (1973) that the resulting state (gene arrangement) is a
and Felsenstein (1978). The latter was particularly unique evolutionary event, arising for the first
interested in the question of statistical consistency: time in the evolution of the genes under study. At
would MP select a correct tree under a simple this point the reader may object that the observed
finite-state Markov model, as the number of char- number of states in such a situation can never
acters became large? Although much more is now exceed the number n of extant species and so this
known about the (necessary and sufficient) condi- is the only bound that matters. However when we
tions for this to occur there is still a lot that isn’t. come to investigate the stochastic properties of MP
More recently, there has also been interest in other under simple models of state transition, it is the
types of statistical questions. For example, when potential rather than the observed number of
will MP and maximum likelihood (ML) select the states that is important. Having a large state space
same tree on any given data, and under what sort allows for a low level of predicted homoplasy,
of model(s) is MP an ML method? leading to one of the links we report below
This chapter considers this last question, and between MP and ML.
describes some new sufficient conditions for such A related central question we consider in this
an equivalence. We are particularly interested here chapter is how many characters are needed to
in settings that involve a large state space. Tra- unambiguously recover a phylogenetic tree? We
ditionally most of the biological studies involving consider this both for random models of state
MP have involved a state space that is small transition, and in the deterministic setting. We also
(typically 2 or 4 or 20) and fixed (independent of consider the question of when, on a fixed tree, we
the number of taxa). Indeed much standard soft- can expect the most-parsimonious reconstruction
ware for parsimony (including PAUP*) appears to of a character to correspond exactly with its actual
have problems dealing with a state space that has evolution.
size of more than (say) 64. However increasingly This chapter is organized so the first three
there is interest in genomic characters such as sections are largely ‘model-free’ (beyond
gene order where the underlying state space may the assumption of evolution on a tree), and the
be very large (Rokas and Holland 2000; Moret et al. remaining three sections are based on simple
2001, 2002; Gallut and Barriel 2002). For example, Markov models of character evolution. We begin
the order of k genes in a signed circular genome by recalling some background and definitions that
163
are required to state our results, and by reviewing states down the tree (from the root to the leaves) in
some basic combinatorial properties of MP. such a way that (1) the leaf states are specified by w
and (2) there is no convergent or reverse evolution
(for a more formal rendition of this equivalence,
9.2 Preliminaries
see Semple and Steel 2002).
Throughout this chapter, X will denote a set of n Suppose we are given a sequence C ¼ (w1 , . . . , wk )
extant species or individuals. A character (on X, over of characters on X. The parsimony score of C on T,
a set R of character states) is any function w from X denoted l(C, T), is defined by
into some finite set R. Throughout this chapter, we
let r denote the size of R. Suppose we have a tree X
k
T ¼ (V, E). We say that T is a tree on X if X is a l(C, T) :¼ l(wi , T)

i¼1
subset of V, and all vertices of T of degree 1 or 2
are contained in X. If, in addition, X is precisely Any tree T on X that minimizes l(C, T) is said to be a
the set of leaves of T we say that T is a phylogenetic maximum parsimony (MP) tree for C, and the cor-
X-tree, and if, furthermore, every vertex of T has responding l-value is the parsimony or MP score of
degree 3 we say that T is fully resolved. Two phylo- C, denoted l(C). Similarly, we may define
genetic X-trees are regarded as equivalent if the
identity mapping from X to X induces a graph X
k
isomorphism between the two trees. Further h(C, T) :¼ h(wi , T)

i¼1
background and mathematical details concerning
phylogenetic trees can be found in Semple and the total homoplasy of C on T, and the tree(s) on X
Steel (2003). that minimize h are precisely the MP trees (since
The MP method for reconstructing a tree on X h(C, T) ¼ l(C, T) þ constant, where the constant
from a collection of characters on X can be depends on C and not T). This minimal value of h
described as follows. we write as h(C).
Suppose we have a tree T ¼ (V, E) on X, and a As is well known, the problem of finding an MP
character w : X ! R. A function w : V ! R is said to tree for C is computationally intractable (NP-hard),
be an extension of w since it describes an assignment as shown by Foulds and Graham (1982). One might
of states to all the vertices of T that agrees with the therefore ask for a more reasonable goal. For
states that w stipulates at the leaves. example, is it possible to determine splits that are
w, T) :¼ jfe ¼ fu, vg [ E :
Let ch( w(u) 6¼
w(v)gj. Given shared by all (or some) MP trees? One sufficient
a character w : X ! R, the parsimony score of w on T, condition that allows for the identification of such
is defined by splits was described recently by David Bryant, and
can be stated as follows (from Bryant 2003,
l(w, T) :¼ min fch(
w, T)g
w:V!R, wjX¼w Lemma B6). Recall that two binary characters are
compatible if there exists a tree T on which they are
where wjX denotes the restriction of
w to X. A map both homoplasy-free (this is equivalent to the con-
w that extends w and which minimizes ch( w, T) is dition that at most three (of the four possible) pairs
called a minimal extension (or most-parsimonious of states are assigned by these two characters).
extension) of w on T. Let
Proposition 9.2.1. Suppose C ¼ (w1 , . . . , wk ) is any
h(w, T) ¼ l(w, T) jw(X)j þ 1 sequence of binary characters on X. Let w be
any nontrivial binary character that is compatible
be the homoplasy of w on T. By necessity, h(w, T) 0 with all the characters in C. Then there exists an
and when h(w, T) ¼ 0 we say that w is homoplasy-free MP tree T for C that contains the X-split defined by
on T. This condition is exactly equivalent to a w. Furthermore, if w is one of the characters in C
statement that, informally, says the following: then every MP tree for C contains the X-split
regardless of where T is rooted, one can evolve defined by w.
MAXIMUM PARSIMONY AND THE PHYLOGENETIC INFORMATION IN MULTISTATE CHARACTERS 165
9.2.1 Bounds on the MP score of data p (i.e. Fp :¼fS f1, . . . , kg: jSj ¼ pg), for which
n(F p ) ¼ p1
k1
. A second class of examples is where
For a single character w : X ! R it is easily shown
F is a partition of {1, . . . , k} into nonoverlapping
that
subsets in which case n(F ) ¼ 1.
minfl(w, T)g ¼ jw(X)j 1 (1) Given a sequence C ¼ (w1 , . . . , wk ) of characters
T
on X, and a set S f1, , kg, let CS ¼ (wj : j [ S)
and that and let
X
maxfl(w, T)g ¼ j(X)j m (2) hF : ¼ h(Cs )
T S[F
where m is the largest number of species in X The following result extends the ‘partition theorem’
that are assigned the same state (formally of Hendy et al. (1980).
m ¼ max{ j w1(a) j : a [ R}). In (1) and (2) T ranges
Proposition 9.2.2. Let F be a uniformly covering
over all phylogenetic X-trees (or, equivalently,
family F of subsets of {1, . . . , k}, let C be a sequence
over all fully resolved phylogenetic X-trees).
of characters. Then,
For a collection C of characters it is also useful to
determine lower bounds on l(C). We first recall an
1 F
easily computed lower bound. Form a graph by h(C) h
n(F )
taking X as the set of vertices, and placing an edge
between each pair of vertices (this produces the
Proof. Let T0 denote an MP tree for C, and
‘complete graph on X’). Weight each edge {x, y}
let h0 (j) :¼ h(wj , T0 ). For S f1, . . . , kg, let
by the number of characters f in C for which P
h0 (Cs ) : ¼ j [ S h0 (j). Thus, h0 (Cs ) h(Cs ) and so
f(x) 6¼ f(y), then construct a minimum-length-
spanning tree for this graph. This last task can
X
k
1 X 0 1 X
be accomplished using one of the well-known h(C) ¼ h0 (j) ¼ h (Cs ) h(Cs)
polynomial-time techniques, such as Kruskal’s j¼1
n(F ) S [ F n(F ) S [ F
algorithm or Prim’s algorithm. Let L(C) denote the 1 F
¼ h
sum of the weights of the edges in this tree. Then, n(F )
l(C) 12 L(C): Furthermore, the factor of 12 is
where the second equality is justified by the
(asymptotically) optimal for a lower bound based
identity:
on this approach due to Foulds (1984); however by
adopting a more complex polynomial-time X XX X
k X
approach a slightly better approximation to l(C) is h0 (Cs ) ¼ h0 (j) ¼ h0 (j)

S[F S[F j[s j¼1 S:j [ S
possible (see Prömel and Steger 2000). Here we
describe a quite different type of lower bound, X
k
¼ n(F )h0 (j)
which has the advantage of coinciding with l(C) j¼1
when the homoplasy h(C) is zero (in contrast to the
minimum-length-spanning tree bound, which does For applications one would construct a family F
not have this property in general). of (small) subsets of X that cover each element of X
Let F be a family of subsets of {1, . . . , k} with the the same number of times, and compute h(CS ) for
property that each number 1, 2, . . . , k appears in each small subset. As a special case, if we take
the same number of sets from F . In this case we F ¼ F (2) (so that n ¼ k 1) and note that h(CS ) 1
say that F is uniformly covering. Let n(F ), or whenever CS is incompatible, then we obtain the
more briefly n, denote this number of sets from following bound for any collection C ¼ (w1 , . . . , wk )
F that each number appears in (formally, of characters:
n(F ) ¼ jfS [ F : j [ Sgj for each j [ {1, . . . , k}). One
X
k
In(C)
natural example of such a family is the collection l(C) (jwi (X)j 1) þ
F p of all subsets of {1, . . . , k} of fixed size i¼1
k1
where In(C) is the number of pairs of characters in recourse to simulations. That such a formula exists
C that are incompatible. The ‘partition theorem’ is truly remarkable, and is due to a little-known
from Hendy et al. (1980), which states that if F is a but nontrivial result from Carter et al. (1980).
P
partition of {1, . . . , k} then l(C) S [ F l(CS ), also Using straightforward algebra one can easily
follows directly from Proposition 9.2.2. derive the following result from Theorem 2 of that
Note that the requirement of Proposition 9.2.2 paper.
that F covers each element of X the same number
Proposition 9.3.1. Suppose that a character w
of times can be weakened by adopting a linear
partitions a set of n species into classes of size
programming approach. That is, if we let h be
P a1, a2, . . . , ar. Then
the minimal value of ki¼1 xi subject to the linear
inequality constraints, xi 0 for all i ¼ 1, . . . , k,
P
and x h(CS ) for all S [ F , then clearly X
nrþ1
Pj [ S j I(w) ¼ (1 bj ) log (2j 3)
l(C) ki¼1 (jwi (X)j 1) þ h. j¼3
where bj ¼ j {i : ai j} j .
9.3 How phylogenetically informative
is a single r-state character? For example, consider a character w that partitions
20 species into classes of size 6, 4, 4, 3, 2, and 1.
In this section we consider the question of to how Then
to quantify the phylogenetic information a single
r-state character carries (a priori, without regard to I(w) ¼ 3 log (3) 2 log (5) þ log (11) þ log (13)
other characters, or to the character’s fit on an þ þ log (27)
existing tree). Let w : X ! R be a character. One
measure of the phylogenetic information content In this example, b3 ¼ 4 (giving rise to the
of w, based on compatibility, is the following: 3( ¼ 1 b3) multiplier for log(3)), b4 ¼ 3,
b5 ¼ b6 ¼ 1, and bj ¼ 0 for j > 6.
I(w) ¼ log (p(w)) (3)
Proposition 9.3.1 may be useful for deciding how
where p(w) is the proportion of fully resolved to construct and select between possible character
phylogenetic X-trees for which w is homoplasy- codings, for example for genomic data. Ideally we
free. For example, if w assigns the same state to all would like I(w) to be as large as possible, and
species in X or, at the other extreme, a separate achieving this may assist in tuning certain coding
state to each species in X then I(w) ¼ 0, as we procedures. Further aspects of this information
should expect, since every such character is measure have also been explored recently using
homoplasy-free on all trees. simulations by Dezulian and Steel (2004). At this
A measure of phylogenetic content is only useful point we will simply note an interesting con-
if it can be readily computed. For the measure I sequence of Proposition 9.3.1. Firstly, if we fix r,
described in (3) it might seem tempting to the number of classes that X is partitioned into,
approximate this quantity by simulation: simply then I(w) is largest when all of the classes have
generate fully resolved trees at random and count (approximately) the same size. Let Imax(n, r) be this
what proportion of them allow w to be homoplasy- largest value of I(w) over all characters w that par-
free. However this turns out to be generally tition a set of size n into r non-empty sets. We may
impractical once X becomes large, for the obvious ask how this quantity varies as a function of r.
reason: even if you simulate a huge number of Clearly if r ¼ 1 or r ¼ n then Imax(n, r) ¼ 0. Con-
large trees at random, it is likely that few if any sequently, there is some intermediate value,
of them will provide a homoplasy-free fit for w. between r ¼ 1 and r ¼ n, where Imax(n, r) is largest.
Fortunately it turns out that I can be easily A plot of Imax(120, r) is shown in Fig. 9.1. Under the
computed by a simple exact formula, and without I measure, the most informative character for
300 9.3.1 Coding gene order as multistate

character data
250
It is instructive to consider the types of genomic
200 data for which we may expect, simultaneously,
both low homoplasy due to a large state space, and
–log (p)
150 yet phylogenetically informative characters.

For gene-order data, one approach (that has been
100
called ‘‘maximum parsimony on multistate encod-
ings’’) was proposed by Bryant (2000) and tested by
50
Wang et al. (2002). Suppose one has n genomes. We
0 will take these to be circular, and consider the genes
0 20 40 60 80 100 120 as signed (oriented) and we will suppose that the
Number of states
genomes have been edited so that each of them
Figure 9.1 Distribution of log of the number of fully resolved contains the set of N genes, which we can label 1, 2,
phylogenetic trees on 120 species for a homoplasy-free character that . . . , N. A circular gene ordering then can be regar-
partitions the species into r equally sized sets. ded as a signed circular permutation, for example (1,
4, 3, 2) (which is equivalent to ( 4, 3, 2,
n ¼ 120 is one that partitions the taxa into 24 1) or to (4, 1, 2, 3), etc.). The coding procedure
groups, each of size 5. considered by Bryant (2000) and Wang et al. (2002) is
based on the observation that each gene order
Other measures of the informativeness of a induces a sequence of length 2N by considering the
character are possible and have been proposed; for gene that immediately follows each given gene in
example, following Farris (1989), one can consider either direction. Given a collection of genomes, this
the difference allows one to define a sequence of characters w1, . . . ,
d(w) ¼ maxfl(w, T)g minfl(w, T)g w2N (on a state space of size 2N) as follows. For each i
T T
between 1 and N, set wi(j) ¼ k if k immediately
where the terms on the right-hand side of this last follows gene i in genome j; and for each i between
equation are given by (1) and (2). Note that if, as N þ 1 and 2N, wi(j) ¼ k if k immediately follows
above, we fix r, the number of classes that X is gene — in genome j. For example, for j ¼ (1, 4, 3,
partitioned into by w, then d(w) is largest when all 2) the sequence (wi(j) : i ¼ 1, 2, . . . , 8) is ( 4, 3, 4,
of the classes have (approximately) the same size. 1, 2, 1, 2, 3).
Let dmax(n, r) be this largest value of d(w) over all The method of Gallut and Barriel (2002) has a
characters w that partition a set of size n into r non- similar flavor. In their approach each gene is
empty sets. We may ask how this quantity varies associated with the (unordered) pair of genes that
as a function of r. Clearly if r ¼ 1 or r ¼ n then, as appear on either side of it. Thus if there are
with the measure based on I, we have dmax(n, r) ¼ 0. n genomes, each consisting of N genes, then this
Note that if r divides n, then applying (1) and (2) coding method produces N characters that have a

gives state space of size N2 .
1 Other methods of coding are also possible, and
dmax (n, r) ¼ n(1 ) r þ 1 these are currently being investigated (Dezulian
r
and Steel, unpublished work).
Maximizing this expression for r we find the
maximal value of this expression as r varies (over
pffiffiffi 9.4 The smallest number of multistate
the real numbers) occurs precisely when r ¼ n.
For the example discussed above with n ¼ 120 the
characters required for tree
character that maximizes d partitions the taxa into
reconstruction
fewer groups (namely 10 or 12) than the 24-fold In this section we consider two related questions:
partition that maximizes I. given a fully resolved phylogenetic tree T with
n leaves, what is the smallest possible number of resolved phylogenetic trees on a set of size n must
characters for which (1) T is the unique MP phyl- be less or equal to the number of sequences of k
ogenetic tree for these characters, and (2) T is the characters on a set of n species. This latter number
unique phylogenetic tree for which the characters is rnk where n ¼ j X j , which we may rewrite as
Q
have no homoplasy? If we call these two numbers, enk log(r). Now B(n) ¼ ni¼3 (2i 5) and it can be
respectively, n1(T) and n2(T) it is clear that shown (using Stirling’s approximation for n!) that
n1(T) n2(T). It might be expected that both these for a constant b > 0 we have B(n) > ebn log(n).
quantities would grow with the size of the tree, yet Thus B(n) rnk implies that k c log n where c ¼
it has recently been shown that this is not so, b/log(r). This completes the proof.
provided no bound is placed on the size of the
It seems plausible that this lower bound on n1(T)
state space. More precisely, we have the following
is not too far from the true value, even for binary
result, from Huber et al. (2002).
characters, and so we offer the following.
Theorem 9.4.1. For any fully resolved phylogen-
Conjecture 9.4.3. There exists a constant c > 0 such
etic tree T, on any number of species, the quantities
that, for any fully resolved phylogenetic tree T,
n1(T) and n2(T) are at most 4.
there exists a sequence of at most bc log (n)c
When a bound is placed on the size of the state binary characters on X for which T is the unique
space, then an elementary counting argument MP phylogenetic tree, where n denotes as usual
shows that both n1(T) and n2(T) cannot be bounded the number of leaves of T.
by any fixed number that is independent of the
Proposition 9.2.1 places interesting constraints on
number n of leaves of T. This begs the question:
the sorts of sequences of characters that this last
how fast must n1(T) and n2(T) grow with n? In the
conjecture requires. Namely, any split that is not in
case of binary characters it is well known that
T must be incompatible with at least one of the (at
n2 (T) ¼ n 3 most) c log(n) characters in the collection promised
by the conjecture. Can such a small set of binary
since every one of the n 3 interior edges of the
characters be incompatible with virtually all
fully resolved tree T must be distinguished by at
other binary characters? We end this section by
least one of the binary characters. Furthermore, for
describing a result that shows that this is indeed
r-state characters, it was shown by Semple and
possible. The proof is given in Appendix 9.1.
Steel (2002) that
n3 Proposition 9.4.4. There exists a set C of log2(n)
n2 (T) binary characters on a set X of size n( ¼ 2k) with the
r1
following property: any binary character on X that
and it seems that this bound is fairly close to the
is compatible with every character in C is a trivial
true value. The behavior of n1(T) has received less
character.
investigation, and consequently little is known
about how large n1(T) might be. However the A further interesting feature of the type of data sets
following result shows that n1(T) must grow at that would be required to verify Conjecture 9.4.3
least logarithmically with n (at least for some trees). is that many of the characters would need to
have large homoplasy values on the tree T. The
Proposition 9.4.2. For any given state space size
effectiveness of such data sets in recovering trees is
r, there is a positive constant c such that for each n
in line with recent observations by Källersjö et al.
there exists a fully resolved phylogenetic tree T
(1999).
with n leaves, for which n1(T) c log(n).
9.4.1 Reconstructing ancestral states

Proof. Suppose that to each fully resolved phylo-
genetic X-tree T we can associate a sequences CT of In the previous section we considered the question
k characters on X for which T is the unique MP of defining a tree using parsimony. Now we
phylogenetic tree. Then the number B(n) of fully will consider the analogous question for the
‘small parsimony’ (i.e. fixed-tree) problem. Given a The proof of this claim is by induction on the
phylogenetic tree X-tree, T, and a character height h of u (i.e. h is the number of edges separ-
w : X ! R that has evolved on T, when are the ating u from a most distant descendant leaf). When
states that were present at the ancestral vertices of h ¼ 1 the claim holds, since the assumption on w
the tree identical to the most-parsimonious recon- implies that all but at most one (of the two or
struction? We will present a sufficient condition more) descendant leaves of u has the same state
(on the evolution of the character) that guarantees under w. Suppose the claim holds for all internal
the historical accuracy of the ancestral-state vertices of height h and that u has height h þ 1.
reconstructions. Essentially this sufficient condition By the assumption on w one of the following
is that substitutions that occur are ‘well-separated’ two cases applies: (i) w(vi ) ¼ w(u) for all i [ {1, . . . ,k};
in the tree (that is, they do not occur too close to (ii) w(vi ) ¼ w(u) for all but at most one i.
each other in the tree). Apart from its intrinsic In case (i), we may apply the induction
interest, this result will also be useful later in pro- hypothesis to the vertices v1, . . . vk which each have
viding a limiting Poisson distribution for the par- height at most h. It follows that w(u) [ S(vi ) for all i.
simony score of a tree, under low substitution rates. Furthermore there is at most one vertex vi for
which S(vi ) 6¼ fw(u)g since if there were two such
Theorem 9.4.5. Suppose that T is a phylogenetic
vertices, then we would obtain two edges on
X-tree, and consider the assignment of states
which w changes state, yet which are separated
w : V(T) ! R corresponding to the evolution of
by only two edges in T. Consequently, by the
some character on T. Let w ¼ wjX be the observed
Fitch–Hartigan recursion we deduce that
states on the extant set of species (leaves of T).
S(u) ¼ fw(u)g.
Suppose furthermore that the evolution of the
Consider now case (ii). We may suppose that
character is such that any two edges of T on which
w(v1 ) 6¼ w(u). Consider first the case where k > 2.
a net transition occurs are separated by at least
Applying the induction hypothesis to v1, . . . , vk and
three other edges of T. Then w is a minimal
invoking the assumption on w we have that
extension of w on T; moreover it is the only min-
S(v1 ) ¼ fw(v1 )g, and for all i > 1 we have
imal extension of w on T.
S(vi ) ¼ fw(u)g. It now follows by the Fitch–
Proof. Suppose that T is a phylogenetic X-tree, and Hartigan recursion (remembering that k > 2) that
w : V(T) ! R. Suppose furthermore that for any S(u) ¼ fw(u)g. Thus we have established the
two edges {u, v} and {u 0 , v 0 } for which w(u) 6¼ w(v) second part of the claim. It remains to consider the
and w(u0 ) 6¼ w(v0 ) there are at least three other edges other possibility for case (ii), namely k ¼ 2. Again
separating {u, v} and {u 0 , v 0 }. Let w ¼ wjX. Then we we apply the induction hypothesis on v1, v2
claim that w is the unique minimal extension of w and invoke the assumption on w to deduce
on T. To establish this claim, let w0 be a minimal that S(v1 ) ¼ fw(v1 )g and S(v2 ) ¼ fw(v2 )g; hence
extension of w on T; we will show that for each S(u) ¼ fw(v1 ), w(v2 )g, as required to justify the
vertex v of T we have w0 (v) ¼ w(v). claim.
Let us root tree T on vertex v and direct all the Now let us take u ¼ v, the vertex we have
edges of T away from v. For any vertex u in this selected as our putative root for T. Since T is a
rooted tree, let S(u) denote the set of states phylogenetic tree, v has degree at least three, so by
assigned to u by applying the first pass of the the claim we have S(v) ¼ fw(v)g. However, since v
Fitch–Hartigan algorithm (Fitch 1971; Hartigan is the root of the tree for the recursion, S(v) is
1973) to the pair (T, w). We will establish the fol- precisely the set of states that can occur at v across
lowing. Claim: suppose that u is an internal vertex all possible minimal extensions of w on T (Hartigan
of T and that v1, v2, . . . vk are the vertices of T that 1973). Thus we have shown that all such minimal
are immediate descendents of u. Then extensions (in particular w0 ) assign vertex v the
state as that specified by w. Since we can repeat this
fw(v1 ), w(v2 )g, if k ¼ 2 and w(v1 ) 6¼ w(v2 ) argument for any vertex v in T the theorem now
S(u) ¼
fw(u)g, otherwise follows.
1 0 0 1 0 1 0 1 ments below if we allowed the state at x0 to

be random). The model assigns states from
R recursively to the remaining vertices of the tree
according to the following scheme: if e ¼ {u, v} is an
edge of T directed from u to v and u has been
assigned state a, then, with probability 1 p(e) we
assign v state a, otherwise, with probability p(e) we
select uniformly at random one of the other r 1
states (different to a) and assign this state to v. The
0 assignments are made independently across edges,
and the value p(e) is called the substitution prob-
Figure 9.2 Example showing that two-edge separation does not ability associated with edge e. It is natural to con-
suffice for Theorem 9.4.5.
strain p(e) to lie in the interval 0, r1 r ; the reason

for the upper bound is that, if we realise this
Note that Theorem 9.4.5 is no longer true if we model by a continuous-time Markov process, the
weaken the edge-separation requirement from probability of a net substitution over any period of
three edges to two. For example, consider the tree time is always less than r1 r . We will say that the
and character w shown in Fig. 9.2. Then the mapping e ! p(e) is admissible if the p(e) values all
extension w of w defined by making substitutions lie within this allowed interval.
precisely on the five (bold) edges incident with When r ¼ 4, this model is essentially the same as
leaves in state 0 as indicated in Fig. 9.2 is not a what is often referred to as the Jukes–Cantor
minimal extension of w, even though each pair of model. For general values of r, this model was
bold edges is separated by at least two other edges. investigated in 1970 by Jerzy Neyman (1971), and
For this example, the minimal extension has more recently been studied by Paul Lewis
is provided by assigning state 0 to all the interior (2001) as a starting framework for likelihood ana-
vertices of the tree. Note also that an ancestral- lysis for certain morphological characters. This
state reconstruction satisfying the requirements model has been christened in the bioinformatics
of Theorem 9.4.5 is not necessarily the ‘true’ literature under a variety of titles, including
reconstruction, it is merely the unique most- the Neyman r-state model and the r-state
parsimonious reconstruction. Nevertheless, as we Jukes–Cantor model.
will see in the next section (Proposition 9.5.1), Given the pair (T, p) where T ¼ (V, E) is a tree on
certain stochastic models of character evolution X, and p is an admissible assignment of transition
imply that this unique most-parsimonious recon- probabilities, and given a map w : V ! R, let
struction is also likely to be historically accurate, Pr(wjT, p) denote the probability that the vertices in
provided the substitution probabilities are uni- T take values specified by w under the Poisson
formly small. model on R with parameters (T, p). More formally,
Pr(wjT, p) ¼ Pr( \v [ Vfx0 g fZ(v) ¼ w(v)g), where Z(v)
is the random variable state assigned to v under the
9.5 The Poisson model
model. By the assumptions of the model, we have
In this section and the next we consider the sim- Y Y
p(e)
plest tree-based model for the evolution of char- Pr(wjT, p) ¼ (1 p(e))
fu, vg [ E:w(u)6¼w(v)
r 1 fu, vg [ E:w(u)¼w(v)
acters with state space R, which we will refer to
here simply as the Poisson model on R (with para- (4)
meters (T, p)). In this model, we have a tree T on X,
For any character w : X ! R, let
select any element x0 [ X as a reference vertex, and X
direct all edges of T away from x0. We will regard Pr(wjT, p) ¼ Pr(wjT, p)
w [ c(w)
the value from R assigned to vertex x0 as being
given (it would make little difference to the argu- where c(w) ¼ fw : V ! R : wjX ¼ wg.
9.5.1 Distribution of the parsimony score most 12 (2n 3) (4 þ 8 þ 16) < 28n, and so, by the
Bonferroni inequality,
Theorem 9.4.5 has the following consequence for
the (limiting) distribution of the parsimony score 2
h
of a character under the Poisson model. Pr(A) < 28n pffiffiffi 28h2 (5)
n
Proposition 9.5.1. Consider a process on a fully
which, together with Theorem 9.4.5 establishes
resolved phylogenetic tree T with n leaves, and let
part (i).
pffiffiffi Let L denote the random number of edges of T
h ¼ maxfp(e) : e [ Eg n
on which there is a substitution. Thus L L , and
L has a limiting Poisson distribution since it is the
and sum of an increasing (with n) number of inde-
X pendent 0/1 random variables, where the
m¼ p(e) probability that each variable takes the value 1
e [ E(T)
converges to 0 (with n). Moreover, Le Cam’s
inequality (Le Cam 1960) gives
Generate a character w by this process on T and let w
denote the states at all the vertices of T. Then, X
1
mk X
jPr(L ¼ k) em j<2 p(e)2 (6)
for small values of h, the most-parsimonious k! e
k¼0
reconstruction of w is likely to be both unique and
historically accurate, and the parsimony score By the law of total probability,
L ¼ l(w, T) of a character w generated by this process
on T is closely approximated by a Poisson dis- Pr(L ¼ k) ¼Pr(L ¼ kjAc )Pr(Ac )
tribution with mean m. More precisely, for any value (7)
þ Pr(L ¼ kjA)Pr(A)
of h we have (i) Pr[w is the unique MP reconstruction
k
of w on T] 1 28h2, (ii) jPr(L ¼ k) em mk! j < 32h2 , and
k
and (iii) S1
k¼0 jPr(L ¼ k) e k! j < 60h .
m m 2
To illustrate this result, suppose that a fully Pr(L ¼ k) ¼Pr(L ¼ kjAc )Pr(Ac )
(8)
resolved phylogenetic X-tree has n ¼ 10 000 leaves, þ Pr(L ¼ kjA)Pr(A)
and the substitution probability p(e) on each edge
is (say) 2 10 4. In this case we can take where Ac is the complementary event of A.
h ¼ 2 10 2, and so we may approximate L Now, conditional on the event Ac, Theorem 9.4.5
closely by a Poisson distribution with mean 4. guarantees that L ¼ L (with probability 1); that is,
Notice that, in Proposition 9.5.1, a small value of Pr(L ¼ kjAc ) ¼ Pr(L ¼ kjAc ). Applying this iden-
h does not necessarily imply a small value for m if tity to (7) and (8) gives
the number of leaves in the tree T is large.
jPr(L ¼ k) Pr(L ¼ k)j ¼ jPr(L ¼ kjA)
Proof of Proposition 9.5.1 Pr(L ¼ kjA)jPr(A) Pr(A) < 28h2
Let A be the event that substitutions occur on some
where the last inequality is from (5). Furthermore,
pair of edges that are separated by two or fewer
(6) implies that
edges. The number of ordered pairs of edges that
are separated by two or fewer edges is at most mk
(2n 3) (4 þ 8 þ 16) since (2n 3) is the number jPr(L ¼ k) em j < 4h2
k!
of edges of T and since (4 þ 8 þ 16) bounds
the number of edges of T that are separated by 0, 1 Combining these last two inequalities gives
or 2 other edges from any given edge of T.
Thus the number of unordered pairs of edges mk
jPr(L ¼ k) em j < 32h2
that are separated by two or fewer edges is at k!
which establishes part (ii). Similarly, Recall that L(TjC) and Lncm are referred to as the
maximum (average) likelihood or ML score, and
X
1
jPr(L ¼ k) Pr(L ¼ k)j Lmp (TjC) as the most-parsimonious likelihood or MPL
k¼0 score, of T given C (cf. Barry and Hartigan 1987;
X
1 Steel and Penny 2000).
jPr(L ¼ kjA) Pr(L ¼ kjA)jPr(A) The distinction between these two forms of
k¼0
likelihood is as follows: the ML score of T is the
2Pr(A) < 56h2
largest probability (over all admissible choices of
from which part (iii) now follows. substitition probabilities p) of generating the
observed sequence of characters at the leaves of T
but without specifying or conditioning on any
9.6 Links between MP and ML particular assignment of sequences of characters
at the interior vertices of the tree (these are
Given a sequence C ¼ (w1 , . . . , wk ) of characters on effectively ‘averaged over’). In contrast the MPL
X, we put score of T is the largest probability (over all
admissible choices of substitution probabilities p)
Y
k
Pr(CjT, p) ¼ Pr(wi jT, p), of generating any particular assignment of
i¼1 sequence of characters to all the vertices of the
L(TjC) ¼ sup (Pr(CjT, p)), tree, so that the sequences assigned to the tips are
p
the observed sequences.
A tree T on X is said to be an ML tree or
Y
k an MPL tree for C if L(TjC) L(T 0 jC) or Lmp (TjC)
Pr(CjT, p)mp ¼ max (Pr(wi jT, p)jwi [ c(i)) Lmp (T 0 jC), respectively, holds for all other trees T 0
i¼1
on X. The problem of finding an MPL tree given
Lmp (TjC) ¼ sup (Pr(CjT, p)mp )
p only C was recently shown to be NP-hard by
Addario-Berry et al. (2004) (where the method is
where the supremum is taken over all admissible referred to as ‘‘ancestral maximum likelihood’’,).
choices of p and cðiÞ ¼ cðwi Þ is the set of extensions Finding an MP tree from C is also NP-hard (Foulds
of wi to V. Note that Pr(CjT, p) is the probability of and Graham 1982); most likely so too is the prob-
generating the k characters by independent and lem of finding an ML tree for C.
identical evolution under a Poisson model with We say that an MP, ML, or MPL tree for C is
parameters (T, p). irreducible if we cannot collapse any edge of T to
Similarly one has analogous definitions for the obtain another such tree for C.
‘no common mechanism’ Poisson model, in which We now describe three links between two tree
each character evolves independently under a reconstruction methods, one of which (ML) is
Poisson model on R but where p in the parameter based explicitly on an underlying Markov model
pair (T, p) for this model takes admissible values for the evolution of characters on a tree (the
that are permitted to vary freely between the Poisson model), while the other method—MP—is
characters. Specifically, let based solely on a minimality principle.
Y
k
Pr(CjT, (p1 , . . ., pk )) ¼ Pr(wi jT, pi )
i¼1
9.6.1 Link 1: no common mechanism and an
extension
and
MP is an ML estimator for phylogenetic trees under
Lncm (TjC) ¼ sup (Pr(CjT, (p1 , . . ., pk ))) the ‘no common mechanism’ model described
(p1 ,..., pk )
above. In particular, a tree T maximizes Lncm (TjC)
where the supremum is taken over all k-tuples precisely if T is an MP tree for C. This result, estab-
(p1,. . .,pk) where each pi is admissible. lished in Tuffley and Steel (1997), extended the result
for r ¼ 2 that was described by Penny et al. (1994). variables (ri) subject to the obvious constraint
Here we describe a further slight extension of this that ri j wi(X) j . In that case Theorem 9.6.1 holds if
result where we allow the size of the state space of we replace the character weight log(ri) by
the Poisson model to vary from character to char- log( j wi(X) j ).
acter. In this case it can be shown that a weighted
form of MP is an ML estimator for a phylogenetic
9.6.2 Link 2: large state space
tree under the ‘no common mechanism’ model.
First recall that character-weighted parsimony is In this section, we describe a quite different link
directly analogous to standard MP; given a between MP and ML. In contrast to the afore-
sequence (w1, . . . , wk) of characters and a weighting mentioned link we consider here the ‘common
function w: {1, . . . , k} ! R 0 we simply replace mechanism’ setting for which the two methods are
l(C, T) by its weighted version lw (C, T) ¼ in general quite different, since they may select
Ski¼1 w(i)l(wi , T). We then have the following result. different trees (Felsenstein 1973). However when
the number of states is sufficiently large, then once
Theorem 9.6.1. Suppose C ¼ (w1 , . . . , wk ) are
again ML trees are always MP trees. As we will see
characters on X. Consider the model in which all
this may be relevant to the use of certain genomic
characters evolve independently on a phylogenetic
data (such as gene order) for inferring phylo-
tree T and that each character wi evolves according
genies, as in this case the underlying state space
to some Poisson model on a state space of size ri
may be very large. The proof of the following
according to admissible edge parameters that are
result—which also relies on the identity (9)—can
free to vary from character to character. Then the
be found in Steel and Penny (2004).
(average) ML method ranks phylogenetic trees on
X in exactly the same order as the weighted MP Theorem 9.6.2. Suppose C ¼ (w1 , w2 , . . . , wk ) is a
method provided that each character wi is assigned sequence of k characters on X over a state space R
weight log(ri). of size r 4nk. Under the model in which the
characters evolve independently according to the
Proof. The proof relies on a key result from Tuffley same Poisson model on R, any ML tree for C is an
and Steel (1997): for any character w : X ! R, and MP tree for C.
any phylogenetic X-tree T 0 we have
sup Pr(wjT 0 , p0 ) ¼ rl(w, T ) 9.6.3 Link 3: dense sampling of sequences

0
(9)
p0
Let S ¼ {S1, S2, . . . , Sn} be a collection of aligned
where the supremum is over all admissible p . 0 sequences of length k on r 2 states. Equivalently,
Consequently, we may view S as a sequence CS ¼ (w1, . . . , wk)
where wi is an r-state character on X. If we
Y
k
l(wi , T 0 ) write Si as Si(l), . . . Si(k), then Si(l) ¼ wl(i) for
Lncm (T 0 jC) ¼ ri
i¼1 all i [ {1, . . . , n} and l [ {1, . . . , k}. Let dH denote
X
k the Hamming metric on S, defined by setting
¼ exp ( log (ri )l(wi , T 0 )) dH(Si, Sj) ¼ j {l: Si(l) 6¼ Sj(l)} j . We will suppose that
i¼1
¼ exp (lw (C, T 0 )) the sequences in S are distinct: that is, dH(Si, Sj) > 0
for all i 6¼ j. Let GS be the graph with vertex set S
where w is the character weight function defined and with an edge connecting any two sequences
by w(i) ¼ log(ri). Consequently the tree(s) T 0 that that differ in exactly one coordinate. Equivalently,
maximize Lncm (T 0 jC) are precisely the tree(s) that GS ¼ (S, E) where
minimize lw (C, T), as claimed.
E ¼ f(Si , Sj ) : dH (Si , Sj ) ¼ 1g
Note that if the size (ri) of the state space for
character wi is unknown for some or all values of i, In the context of molecular genetics, GS is the
then in an ML framework we might optimize these ‘haplotype graph’ described, for example, in
Excoffier and Smouse (1994). We say that S is To introduce the more general class of Markov
ample if GS is connected. It is easily shown that if S processes, we note that many processes involving
is an ample collection of sequences then the set of simple reversible models of change can be mod-
spanning trees of GS (i.e. the trees in GS on vertex eled by a random walk on a regular graph. To
set S) is precisely the set of irreducible MP trees for explain this connection, suppose there are certain
CS . Consequently, CS has MP score n 1. ‘elementary moves’ that can transform each state
Theorem 9.6.3 below implies that when S is into some ‘neighboring’ states. In this way we can
ample, then any spanning tree for CS is also an construct a graph from the state space, by placing
MPL tree for CS under this model. That is, we an edge between state a and state b precisely if it is
cannot improve the MPL score by introducing possible to go from either state to the other in one
additional ‘Steiner points’ (hypothetical ancestral elementary move. The graph so obtained is said to
sequences). As an aside, this result provides be regular, or more specifically d-regular if each
another case where a particular instance of an state is adjacent to the same number d of neigh-
NP-hard problem (namely that described boring states.
by Addario-Berry et al. 2004) has a simple, For example, aligned sequences of length N
polynomial-time solution. We note also that the under the r-state Poisson model can be regarded as
Buneman complex (Buneman 1971) or, equival- a random walk on the set of all sequences of length
ently, the median network Bandelt et al. (1995) of a N over R; here an elementary move involves
collection of X-splits provides natural examples of changing the state at any one position to some
ample sets of sequences. The proof of the following other state (chosen uniformly at random from the
result can be found in Steel and Penny (2004). remaining r 1 states). Thus the associated graph
has r N vertices and it is N(r 1)-regular.
Theorem 9.6.3. Suppose that S is ample. Then,
As another example, consider a simple model of
under the model in which the characters evolve
(unsigned) genome rearrangement where the state
independently under the same Poisson model on
space consists of all permutations of length N
R, the MP trees and the MPL trees for CS coincide.
(corresponding to the order of genes 1, . . . , N) and
Furthermore, the MPL value is given by
an elementary move consists of an inversion of the
n1 order of the elements of the permutation between
1 1 positions i and j, where this pair is chosen uni-
Lmp (TjCS ) ¼ (1 )k1
k(r 1) k formly at random from all such pairs between
{1, . . . , N}. In this case the state space has size N!
where k is the length of the sequences, and r is the
and the graph is d-regular for d ¼ N2 .
size of the state space.
Both of the graphs we have just described have
more structure than mere d-regularity. To describe
this we recall the concept of a Cayley graph.
9.7 More general models; the probability
Suppose we have a (non-abelian or abelian)
of homoplasy-free evolution
group G together with a subset S of elements of G,
In this section we investigate a more general class with the properties that 1G 6 [S and s [ S ) s1 [ S.
of Markov processes than the simple Poisson Then the Cayley graph associated with the pair
model. For these models we ask the question of (G, S) has vertex set G and an edge connecting g
how likely it is that a character has evolved with- and g 0 whenever there exists some element s [ S for
out homoplasy. This question has been invest- which g ¼ g 0 s. To recover the above graph on
igated for the two-state Poisson model (and pairs of aligned sequences of length N over an r-letter
taxa) by Chang and Kim (1996). Here we consider alphabet, we may take G as the (abelian) group
more general processes on a larger state space, and (Zr)N and the set S of all N-tuples that are the
for many taxa. Consequently we obtain bounds identity element of Zr except on one coordinate. To
rather than the exact expressions that are possible recover the graph described above for unsigned
in the simpler setting of Chang and Kim (1996). genome rearrangements we may take G to be the
(non-abelian) symmetric group on N letters and S Associated with any such process there is a
to be the elements corresponding to inversions. corresponding graph with vertex set R and where
The demonstration that such graphs are Cayley the edge set E is defined by E ¼ {{a,b}: Qab 6¼ 0,
graphs has an important consequence: it implies a 6¼ b}. Note that this graph is d-regular, and sub-
that they also have the following property. A graph stitution events under a model satisfying (12) cor-
G is said to be vertex-transitive if, for any two ver- responds to a random walk on the associated
tices u and v there is an automorphism of G that graph. Accordingly we will call any continuous-
maps u to v. Informally, a graph is vertex-transitive time Markov process that satisfies (12) a d-regular
if it ‘‘looks the same, regardless of which vertex walk process. The equilibrium distribution of any
one is standing at.’’ Clearly a (finite) vertex- such process is uniform.
transitive graph must be d-regular for some d, and it
Lemma 9.7.2. Let (Xt; t 0) be a d-regular walk
is an easy and standard exercise to show that every
process. Then, for any two distinct states a, b, and
Cayley graph is vertex-transitive (however not
any values s, t 0,
every vertex-transitive graph is a Cayley graph, and
not every regular graph is vertex-transitive). Thus,
1
there are three properly nested classes of graphs: Pr(Xtþs ¼ bjXt ¼ a)
d
Cayley graphs
vertex-transitive graphs Proof. For this Markov process, consider the asso-

regular graphs ciated graph (R, E). Let M denote the random
number of transitions between states, during
Given a connected graph G a (simple) random the interval between time t and t þ s. Then
walk on a graph is a walk on the vertices of G that, Pr (Xt þ s ¼ b j Xt ¼ a) can be written as
from any given position, selects as its next state X
one of the neighboring vertices (selected uniformly Pr(Xtþs ¼ bjM ¼ m, Xt ¼ a)
m0
at random). This random process forms a revers-
ible Markov chain. The proof of the following Pr(M ¼ mjXt ¼ a): (13)
result is given in Appendix 9.1. Now, Pr(Xt þ s ¼ b j M ¼ m, Xt ¼ a) is precisely the
Lemma 9.7.1. Suppose W0, W1, . . . is a random probability that for a random walk Wn on the
walk on a d-regular graph G. Then, for any two graph (R, E) we have Wm ¼ b conditional on
distinct vertices u, v, and any n 0, W0 ¼ a, and by Lemma 9.7.1 this is at most 1d.
Applying this to the expression for
1 Pr(Xt þ s ¼ b j Xt ¼ a) given by (13) completes the
Pr(Wn ¼ vjW0 ¼ u) (10)
d proof.
Furthermore, if G is vertex-transitive then The following result shows that for such a
Markov process if d is much larger than 2n2 (the
Pr(Wn ¼ ujW0 ¼ u) ¼ Pr(Wn ¼ vjW0 ¼ v) (11) number of species) then any character generated
on a tree with n species will almost certainly be
Consider now a continuous-time Markov process
homoplasy-free on that tree.
(Xt ; t 0) on a finite state space R, and with rate
matrix Q. Thus, for any two distinct states a, b, Qab Proposition 9.7.3. Suppose characters evolve on a
is the instantaneous rate at which state a changes phylogenetic tree T according to a d-regular walk
to state b. Suppose that for some fixed positive process. Let p(T) denote the probability that the
integer d and some fixed positive real number q we resulting randomly generated character w is
have the following property: for each state a [ R homoplasy-free on T. Then
there is some neighborhood N(a) R {a} of size
d for which, for all b 6¼ a we have (2n 3)(n 1)
p(T) 1
d
q, if b [ N(a)
Qab ¼ (12)
0, otherwise where n ¼ j X j .
Proof. Consider a general Markov process on T as w and w, respectively. For an element x [ X we

with state space R. Suppose that for each arc (u, v) will let w(x) denote the equivalence class contain-
of T and each pair a, b of distinct states in R, the ing x. We call the resulting probability distribution
conditional probability that state b occurs at v on partitions of X the random cluster model with
given that a occurs at u is at most p. Then, from parameters (T, p) where p is the map e 7! p(e).
Proposition 7.1 of Semple and Steel (2003) we have A central result from Mossel and Steel (2004d) was
p(T) 1 (2n 3)(n 1)p. By Lemma 9.7.2 we may that the number of characters required to correctly
take p ¼ 1d. The result now follows. reconstruct a fully resolved phylogenetic tree with
As an example to illustrate Propostion 9.7.3 n leaves grows (with n) at the rate log(n) provided
consider the simple model for random inversions upper and lower limits to p are specified (and the
of (unsigned) gene orders mentioned above. If we upper limit is less than 0.5). More precisely, let us

have L genes then d ¼ L2 and so if we have (say) suppose for the rest of this section that each value
n ¼ 10 genomes each consisting of the same set of p(e) lies between a value pmin and value pmax where
L ¼ 100 (unsigned) genes that have evolved on 0 < pmin pmax < 0.5.
a phylogenetic tree, the probability that this char- For this model Mossel and Steel (2004d) estab-
acter is homoplasy-free on that tree is at least 0.97. lished the following result: if one independently
generates at least

9.8 Results for infinite and large 2 n
state spaces log pffiffi (14)
b E
Finally, we turn to the question of how many
characters under this model, where
characters we need to reconstruct a large tree if the

characters evolve under a Markov model on a 1 2pmax 4
large state space. b ¼ pmin (15)
1 pmax
Markov models for genome rearrangement such
as the (generalized) Nadeau–Taylor model then with probability at least 1 E, T is the only
(Nadeau and Taylor 1984; Moret et al. 2002) confer phylogenetic tree on which the characters are
a high probability that any given character gener- homoplasy-free; furthermore T can be recon-
ated is homoplasy-free on the underlying tree, structed from the characters in polynomial time
provided the number of genes is sufficiently large (simulations conducted by Dezulian and Steel
relative to j X j (Semple and Steel 2002). In this (2004) show that even fewer characters may suffice
setting the appropriate limiting model is to assume for accurate tree reconstruction than (14) requires,
that every time a substitution occurs a completely although a logarithmic dependence on n is still
new and unique state arises: such a model may be provably necessary).
viewed as the phylogenetic analogue of what is We now provide a similar result for certain
known in population genetics as the ‘infinite regular walk processes on a finite state space. We
alleles model’ of Kimura and Crow (1964). will show that for a subclass of d-regular walk
Mossel and Steel (2004a) recently investigated processes, and provided d grows at least as fast as
such a ‘random cluster’ model on a phylogenetic n2log(n) (where n is the number of leaves of T),
tree T, which operates as follows. For each edge e then we can generate enough homoplasy-free
let us independently either cut this edge—with characters to reconstruct T correctly.
probability p(e)—or leave it intact. The resulting First we describe a subclass of regular walk pro-
disconnected graph (forest) G partitions the vertex cesses. Suppose that R is a group, and for some
set V(T) of T into non-empty sets according to the subset S (closed under inverses and not containing
equivalence relation that u v if u and v are in the the identity element of R) we have Qab ¼ q if and
same component of G. This model thus generates only if there exists some element s [ S for which
random partitions of V(T), and thereby of X by b ¼ a s, otherwise for any distinct pair a, b we
connectivity, and we will refer to these partitions have Qab ¼ 0. Such a process we will call a group
walk process (on the generating set S). Clearly a group Outline of the proof of Theorem 9.8.1. A detailed
walk process is a regular process, and the graph proof of Theorem 9.8.1 can be found in Mossel and
(R, E) associated with the regular walk process is Steel (2004b). Here we simply outline the argu-
the Cayley graph for the pair (R, S). Random walk ment and indicate how it depends on earlier
processes have a further useful property on trees: results. l m
for each arc e ¼ (u, v) of T ¼ (V, E) consider the Generate k ¼ b20 log( pnffiE) characters under a
event D(e) that the state that occurs at v is different group walk process satisfying condition (12) on a
from the state that occurs at u (i.e. there has been a rooted phylogenetic tree. Consider the event H that
net transition across the edge). By Lemma 9.7.1 all of these characters are homoplasy-free on T.
(and the fact that the Cayley graph for (R,S) is Since a group walk process is a regular walk pro-
vertex transitive), it follows that the events (D(e), cess, satisfying (12), using Proposition 9.7.3 it can
e [ E) are independent. Let p 0min ¼ min{Pr(D(e)): be shown that P½H 1 E. Furthermore the
e [ E}, p0max ¼ max{Pr (D(e)): e [ E}, and for any E > 0 probability that T will be correctly reconstructed
let (using MP or maximum compatibility) from k
characters produced by a coupled random cluster
1 þ log( p1ffiE)
cE ¼ (16) model (with b ¼ b 0 ) is at least 1–E by (14) (recalling
b0 E that pmax < 12 ). Now, the original k characters
4
12p0max
where b0 ¼ p0min 1p0max : induce the same partitions as the coupled random
cluster characters whenever event H holds, and
We are now ready to state a result for certain
P½H 1 E. Consequently, by the Bonferroni
Markov processes on large (but finite!) state
inequality, the joint probability that event H holds
spaces, which brings together several ideas pre-
and that the k characters produced by the coupled
sented above. Informally, Theorem 9.8.1 states
process recover T is at least 1 – 2E. Thus the
that, for a group walk process, a growth of around
probability that the original k characters recover
n2log(n) in the size of the generating set is suffi-
T is at least this joint probability, and so at least
cient (with all else held constant) for producing a
1 – 2E, as claimed.
sequence of homoplasy-free characters that
We end this section by noting that a related
define T.
result—namely the statistical consistency of MP
Theorem 9.8.1. Suppose characters evolve inde- for certain Markov processes on a sufficiently
pendently on a fully resolved phylogenetic tree T large state space—was established in Steel and
according to a group walk process on a generating Penny (2000). The main difference between that
set of size d, where result and Theorem 9.8.1 is that statistical con-
sistency is a limiting statement; it says that as the
d cE n2 log (n) number of characters becomes large, the prob-
ability of recovering the correct tree converges to
with cE given by (16) and with pmax < 12. Then with
1. Theorem 9.8.1 meanwhile provides an explicit
probability at least 1 2E we can correctly recon-
bound on the probability of correctly recon-
struct the topology of T by generating db20 log(pnffiE)e
structing the correct tree from a certain given
characters and applying a method such as MP or number of characters.
maximum compatibility.
As an example, consider the group walk process
9.9 Concluding comments
for (unsigned) gene-order reversal mentioned
earlier. In this case, for L genes, we have d ¼ (L2). MP has continued to provide mathematicians
Theorem 9.8.1 shows that provided L grows at the with a rich variety of problems for study. Often
pffiffiffiffiffiffiffiffiffiffiffiffiffiffi
rate (with n) at least some constant times n log (n) these problems have led to elegant and surprising
then one can hope to recover fully resolved phylo- solutions, including the bichromatic binary tree
genetic trees with n leaves from a (logarithmic with n) theorem (Carter et al. 1990; Erdo" s and Székely
number of such independent gene-order characters. 1993; Steel 1993), the min-max theorem of Erdo "s
and Székely (1992), and the guaranteed embed- We then consider two cases:
ding of MP trees in median networks due to
Bandelt et al. (1995). In this chapter we have (i) jAj < 2k1
considered further problems, particularly those (ii) jAj 2k1
concerning the statistical aspects of applying MP
to character data on a large state space, and for In case (i) condition (17) and the equality
which some solutions have been proposed. j Ai j ¼ 2k 1 ensures A Ai for all i. But this
However the reader would be wrong to conclude means that A ¼ {(1,1, . . . ,1)} and so A j B is a trivial
that MP for even two-state character data is character. In case (ii) condition (17) and the
completely understood. Indeed the following equality j Ai j ¼ 2k 1 ensures that for each i either
problem is still open: under the two-state Poisson Ai A or Bi A; in the first case we will let yi ¼ 0
process is there a value p > 0 so that MP is sta- and in the second case we will let yi ¼ 1. Let
tistically consistent for all fully resolved trees y ¼ (y1, . . . , yk). Then A ¼ X {y} and so again A j B
(having any number of leaves) under the con- is a trivial character.
straint p(e) ¼ p for all edges of the tree? The fact Proof of Lemma 9.7.1. We prove the first claim by
that such a basic question is still open suggests induction on n. The result trivially holds for n ¼ 0,
there still await challenges for investigators in and for n ¼ 1 we have Pr(W1 ¼ vjW0 ¼ u) [ f0, 1dg
future. since the graph is d-regular, and so (10) holds.
Suppose (10) holds for n ¼ k. Then by the element-
ary theory of Markov chains,
9.10 Acknowledgments Pr(Wkþ1 ¼ vjW0 ¼ u)

X
We thank the New Zealand Marsden Fund and the ¼ Pr(W1 ¼ wjW0 ¼ u)Pr(Wk ¼ vjW0 ¼ w)
New Zealand Institute for Mathematics and its w
Applications (NZIMA) for supporting this (18)

research. We also thank Andrew Hugall for posing
Letting N(u) denote the set of vertices that neigh-
a question that led to Theorem 9.6.1, and Joseph
bor u the right-hand term in (18) is
Felsenstein, Michael Sanderson, and Cécile Ané for
helpful comments on an earlier version of this 1 X
Pr(Wk ¼ vjW0 ¼ w)
chapter. d w [ N(u)
1 X
¼ Pr(Wk ¼ wjW0 ¼ v)
d w [ N(u)
Appendix 9.1 Proof of
Proposition 9.4.4, and Lemma 9.7.1 (19)
where the equality in (19) arises since the

Proof of Proposition 9.4.4. Let X ¼ f0, 1gk and let chain-transition matrix is symmetric and so
C ¼ fAi jBi , i ¼ 1, . . . kg where Ai : ¼ fx [ X : xi ¼ 1g Pr (Wk ¼ v j W0 ¼ w) ¼ Pr(Wk ¼ w j W0 ¼ v). Combin-
and Bi ¼ X Ai. We claim that C has the property ing (18) and (19) we have
described. To this end, suppose that A j B is an
X-split that is compatible with every character in C. Pr(Wkþ1 ¼ vjW0 ¼ u)
Let 1 ¼ (1, 1, . . . , 1) [ X. Without loss of generality 1 X 1
(by interchanging A and B, as well as Ai and Bi if ¼ Pr(Wk ¼ wjW0 ¼ u)
d w [ N(u) d
necessary) we may suppose that 1 [ A and, for
each i, 1 [ Ai. Note that, by definition, j Ai j ¼ 2k 1;
so that (10) holds for n ¼ k þ 1, establishing the
also we have Ai \ A 6¼ ; for all i. Thus the
induction step and thereby the lemma.
compatibility of A j B with Ai j Bi ensures that
The proof of (11) in Lemma 9.7.1 is similar but
for each i eitherAi A or A Ai or Bi A (17) easier.
V
Parsimony and genomics
CHAPTER 10
Using phylogeny to understand

genomic evolution
David A. Liberles
10.1 Introduction more recently (Liberles 2001). Increasingly sophisti-

cated maximum-likelihood approaches for deter-
As genome-sequencing projects have propagated,
mining ancestral sequences have also been
comparative genomics has emerged as a method of
developed (Yang et al. 1995b; Koshi and Goldstein
choice for understanding protein function. Simple
1996; Pupko et al. 2000, 2002). Parsimony-based
approaches for comparing sequences, like relative
ancestral character reconstruction is fast and
entropy (Shenkin et al. 1991) or binary transforma-
can be performed easily in large-scale genomic
tions of gene-content comparisons (Gaasterland and
applications.
Ragan 1998; Pellegrini et al. 1999) have been pre-
Both explicit ancestral sequence reconstruction
sented. However, phylogenetic methods that expli-
(from either parsimony or maximum likelihood)
citly consider evolutionary history are not only more
and maximum likelihood methods can be used to
powerful, but enable additional types of analysis
estimate the evolution that occurred along any
drawing on knowledge in parallel fields, such as
given branch of a phylogenetic tree. Using expli-
ecology, anthropology, and geology. This chapter
citly reconstructed ancestral sequences, one can
will focus both on methodological issues and on
examine the difference between nodes connected
their application to real genomic-scale problems.
by a branch (see Fig. 10.2). This gives a recon-
Parsimony and maximum likelihood are two
structed picture of evolution that is predicted to
phylogenetic approaches that are used and often
have occurred along any branch of interest in a
compared side by side. While the choice between
phylogenetic tree.
them has been contentious at times, they fre-
quently give similar results and, where they don’t,
they can complement each other. Maximum like-
10.2 Gene sequence evolution
lihood works well when a good model is available.
Parsimony works well when a good model does In a phylogenetic tree based upon gene sequences,
not or cannot exist, as for very complex processes, branches correspond to periods of evolution
and also along very short branches where multiple following speciation events or to periods of evo-
events per position (as in a sequence) are extre- lution following gene duplication or gene transfer
mely infrequent. events. Genes related most recently by a node
Both methods can be used to estimate ancestral representing a speciation event are called ortho-
states in a phylogenetic tree. Fitch (1971) famously logs, while genes related most recently by a node
provided an algorithm for parsimony reconstruc- representing a gene duplication event are called
tion of ancestral character states in a rooted paralogs. All such genes related by common
phylogenetic tree. This approach is depicted in ancestry are called homologs. A species tree is
Fig. 10.1. Variations on this approach, including frequently derived from either the fossil record
branch length weighting, have been implemented or from sets of genes that are believed to be
181
1. A_CT
2. A(GC)CT
1. AC_T
2. AC(CG)T
4. ACCT
ACCT
ACGT
AGCT
Ordered set of actions for Fitch parsimony on a rooted tree
1. Going up the tree from the extent species, at each node take the intersection of possible characters from the descendants.
2. If the intersection is a null set, then take the union. Do this for all nodes until you reach the root. You now have the preliminary nodal set.
Now work back down the tree.
3. At a node, if the preliminary nodal set contains all of the characters present in the final set of the ancestor, go to 4, otherwise go to 5.
4. Eliminate all characters from the set that are not in the final set of the immediate ancestor. Continue with the next node.
5. If the set was formed by a union, go to 6, otherwise go to 7.
6. Add to the set any characters not present that appear in the final set of the immediate ancestor. Continue with the next node.
7. Add any characters not present that are present in both the final set of the immediate ancestor and the current set in at least one of the
two descendants. Continue with the next node.
8. Finally, eliminate possible links involving mutations to characters added in steps 3–7.
Figure 10.1 Ancestral sequences are calculated over a rooted phylogenetic tree according to the approach of Fitch (1971). At each of the two
nodes, the sequence obtained after each step is indicated.
history of life on Earth. These concepts are dis-

played in Fig. 10.3.
Sequence 4
Sequence 3
Sequence 2 10.3 Mapping gene trees onto
species trees
Sequence 1 Several approaches are available for doing this
mapping. Goodman and coworkers (1979) intro-
Figure 10.2 In looking for significant events that have occurred duced a rigid parsimony approach to mapping
between sequence 1 or 2 and sequence 3 or 4, pairwise comparison or gene trees on to species trees. More recently, a
phylogenetic analysis to determine evolution along a branch are possible. Bayesian approach has been developed (Arvestad
The pairwise comparisons will average over four branches, while
et al. 2003). In these approaches, fixed binary gene
phylogenetic methods allow individual analysis of each of the four
branches (for example, the differences between reconstructed ancestral
and species trees are used. However, not all sec-
sequences at nodes connected by a branch). If a function-changing event tions of a genome necessarily show the same
occurred somewhere along the dashed branch, analysis considering only ancestral history, especially in periods where rapid
that branch will have a lower signal-to-noise ratio than the pairwise successive speciation events may have led to dif-
comparison, increasing the chance of detection.
ferential fixation of shared ancestral polymorph-
isms, or where lateral gene transfer has been
orthologous. Understanding the evolution of genes common. An alternative soft parsimony-based
in a genome in the context of the species tree approach that allows for non-binary species trees
requires mapping of the gene tree (and the events and uncertainty in gene trees has recently also
it represents) on to a species tree representing the been developed (Steffansson 2004). Specifically,
USING PHYLOGENY TO UNDERSTAND GENOMIC EVOLUTION 183
(a) Rat alpha 1 Koonin’s work is based upon complete genomes.

This allows a definitive statement about presence
Rat alpha 2 and copy number, or absence of genes from a
genome. Gene families, like those found in
Mouse alpha HOVERGEN (Duret et al. 1994), the Master Catalog
(Benner et al. 2000), and The Adaptive Evolution
Human alpha Database (TAED; Liberles et al. 2001), are based
upon gene or domain families, or independently
Rat beta evolving units in the case of the Master Catalog.
Independently evolving units are pieces of a gene
Mouse beta that are found as a self-contained gene in at least one
organism or secondarily are found in conjunction
Human beta with gene segments that are self-contained in
another species. Families based upon genes or
Salmon protein independently evolving units permit an assessment
of more species (including those without complete
(b) Mouse
genomes), but only allow a statement of presence
Rat and minimum copy number, not of absence. This
can all be combined to give an increasingly com-
Human prehensive picture of the genes common to various
Salmon last common ancestral points in the tree of life,
which is presented in Chapter 11 (see also Koonin et
Figure 10.3 (a) A gene tree is indicated for an idealized gene family. al. 2004) using a Dollo parsimony approach
Gene duplication events are shown with white circles, while speciation
(allowing gene loss but not de novo gene gain) to
events are shown in various shades of grey. Orthologs are proteins
related by a speciation event at the last common ancestor in a gene content from various completed genomes.
phylogenetic tree, while paralogs are related by a gene duplication Such approaches contrast with the non-
event. With respect to each other, rat alpha 1 and rat alpha 2 are phylogenetic analyses done using a binary trans-
paralogs, as are the alpha and beta proteins with respect to each formation of gene-content data from various
other. All mammalian proteins are co-orthologs of the salmon protein,
complete genomes, called, ironically enough, phy-
as are both rat alpha proteins with respect to the mouse and human
alpha proteins. (b) The species tree for rat, mouse, human, and logenetic profiling (Pellegrini et al. 1999). This is
salmon is shown. Speciation nodes that correspond to each other used as a method to identify functions in bacterial
are indicated in the same shades of grey in a mapping of the gene genes without known functions. The principle
tree from (a) on to this species tree. behind phylogenetic profiling is that proteins per-
forming basic interacting functions for an organism
this approach considers gene trees that map on to will be conserved together. Nonidentical profiles
non-binary nodes in species trees with different can actually be analyzed using a parsimony analy-
resolutions as equally parsimonious to those that sis over a species tree (Liberles et al. 2002). Incor-
resolve non-binary nodes consistently throughout porating phylogeny into the analysis improves its
the gene tree. Further work will continue to performance, but the method is still not trustworthy
improve methods for mapping of real data on to for blind prediction of gene functions without
relevant species trees, as well as improve the spe- additional information from databases or experi-
cies trees themselves. ments (Marcotte et al. 1999; Liberles et al. 2002).
With the framework that has been established
above, it is now possible to analyze various geno-
mic data in the context of species trees. Koonin and 10.4 Understanding gene function
coworkers (Koonin et al. 2004; see also Chapter 11 Phylogeny can also be used to understand gene
in this book) have done this with gene content, function in other ways. Within-sequence evolu-
with quite interesting results. The analysis from tion in gene families can be understood in a
phylogenetic context. This of course includes indicative of negative selection or conservation.

evolutionary divergence of both paralogs and Most proteins have been optimized over millions
orthologs. of years of evolution for a given function. There-
To begin with paralogs, Ohno (1970) saw gene fore, any given mutation is more likely to decrease
duplication as the driving force for innovation. fitness and be selected against, reducing the rate of
Gene duplication, under a purely neutral mecha- substitution at such sites. In rarer cases, in which
nism, led to relaxation of selective constraint on adaptation involving modification of protein func-
both duplicate copies. Both were then free to tion is one of several possible causes, mutations
explore sequence space until one copy no longer increase fitness and are selected for, resulting in
achieved the basic function necessary in the Ka/Ks 1. This is called positive selection.
genome. That copy remained free to evolve, while
the other copy became constrained to uphold the
10.5 Case studies of gene-family
ancestral function. Possible fates for the freely
evolution
evolving copy were neofunctionalization (the
evolution of a new function) and pseudogenization We now turn back to the innovation of Messier
(the loss of gene function). and Stewart (1997). Lysozyme is a bacterioly-
This theory has been extended by Lynch and tic enzyme that is widespread among species.
coworkers (Force et al. 1999), who have proposed a Colobine monkeys are the only primates with a
third fate: subfunctionalization. Subfunctionaliza- foregut, where bacteria ferment edible plant
tion occurs when part of the sequence or its reg- material before passing digested food to a true
ulatory regions becomes modified or inactivated in stomach with high levels of lysozyme. Other pri-
one copy while another region becomes modified mates have a ‘simple stomach’ with a different
or inactivated in the other copy. Both copies are anatomy. Instead of just comparing the Ka/Ks
then required in the genome to perform the ratios between extant species, Messier and Stewart
ancestral function. Subfunctionalization can also also calculated ancestral sequences at various
be viewed as a transition state to neofunctional- nodes in the phylogenetic tree and calculated Ka/
ization, where the sequence freed from constraint Ks ratios over the tree along these branches. This
in each copy can evolve to either optimize the implicated not just the branch leading to colobine
original activity or develop a new activity. monkeys as being under positive selective pressure,
Neofunctionalization, however, is not limited to but also, unexpectedly, the branch leading to
paralogs. Both paralogs and orthologs can evolve hominids (see Fig. 10.4). The events driving this
new functions under neutral or positive selection positive selection during the emergence of hominids
pressures. An innovative combination of phylo- are not clear, but may correspond to dietary chan-
geny, ancestral sequence reconstruction, and ges. Leptin, another dietarily important protein, also
evolutionary theory was presented by Messier and appears to have been under positive selective pres-
Stewart (1997) in examining the evolution of pri- sure at the same time (Benner et al. 1998).
mate lysozyme orthologs. Coupling ancestral sequence reconstruction to
Under a neutral evolutionary model, the rate of phylogeny in searching for periods of positive
substitution at nucleotide positions that can selection pressure allows more-precise dating of
change the encoded amino acid (called the non- such selective regimes and increases the power in
synonymous nucleotide substitution rate, Ka, or detecting them, when compared with pairwise
dN) should be equal to the rate of substitution at calculation involving extant sequences (again,
nucleotide positions where substitution does not see Fig. 10.2). Of course, it is also possible to
change the encoded amino acid (called the evaluate such scenarios using a likelihood-based
synonymous nucleotide substitution rate, Ks, or approach involving nested models. This was done
dS). Most protein-encoding genes in a comparison by Yang (1998) for the lysozyme data set of
of closely related species show a Ka/Ks ratio sig- Messier and Stewart (1997), largely confirming the
nificantly less than 1. This is not surprising and is original results.
Douc langur Duiker

Hanuman langur
Purple-faced langur Sheep
Dusky langur
Goat
Francois’ langur
Proboscis monkey
Ibex
Guereza colobus
Angolan colobus Tahr
Patas monkey
Vervet Impala
Talapoin
Allen’s monkey Eland
Rhesus macaque
Gaur
Olive baboon
Sooty mangebey
Cow
Lar gibbon
Orangutan Pronghorn
Gorilla
Human Pig
Chimpanzee
Figure 10.5 The phylogenetic tree of ruminant myostatin sequences
Bonobo from Tellgren et al. (2004) is shown. After reconstruction of ancestral
sequences and calculation of Ka/Ks ratios along each branch,
Figure 10.4 The phylogenetic tree of primate lysozyme sequences from the three branches shown with thick lines showed evidence of
Messier and Stewart (1997) is shown. After reconstruction of ancestral positive selective pressures for adaptation of the encoded myostatin
sequences and calculation of Ka/Ks ratios along each branch, the two protein.
branches shown with thick lines showed evidence of positive selection
pressures for adaptation of the encoded lysozyme protein.
has also been under positive selective pressure
following gene duplication in teleost fish (Liberles
Another interesting case study is that of et al. 2001) and may be a more general selectable
myostatin. Myostatin has been implicated in the regulator for modulating muscle mass. The
double-muscling phenotype in cattle and other myostatin gene itself encodes a signal peptide, a
mammals, where some breeds of cattle have twice regulatory propeptide, and the mature protein.
the number of a specific type of muscle fiber due to Positive selective pressures in ruminants have
mutation in myostatin. From comparative acted on both the regulatory propeptide and the
sequencing of various ruminant species (relatives mature protein. The more detailed molecular and
of cattle), myostatin was found to be under posi- structural basis of this positive selection is cur-
tive selective pressure during the divergence of rently under further investigation.
bovids and Antilopinae (sheep, goats, and close In another interesting case study, perhaps
relatives; Tellgren et al. 2004). This was demon- inspired by Jurassic Park (Crichton 1990), Chang
strated using the approach of Messier and Stewart et al. (2002) sought to reconstruct a visual pigment
(1997) as well as that of Yang (1998), as seen in of the last common ancestor of alligators and birds,
Fig. 10.5. From this analysis, a key protein regu- which also is the last common ancestor with car-
lating skeletal muscle appears to have changed nivorous dinosaurs (see Fig. 10.6). Through ana-
function as ruminants diverged, possibly enabling lysis of the ancestral protein’s function, insight
phenotypic divergence. Interestingly, this protein in to the visual capabilities of this long-extinct
Domestic pigeon (502–505 nm) periods of positive selective pressure. Ultimately,

such approaches may be powerful, not just for
Chicken (503–507 nm) understanding nature, but also for identifying key
residues that can then be used in protein engi-
Zebra finch (501–507 nm) neering.
The case studies presented above represent
American alligator (499 nm) a small number of the growing set of examples
studied individually in detail (see Yang and
Green anole (491 nm)
Bielawski 2000). As genome and individual gene-
sequencing data have amassed, it has also been
Human (495 nm)
possible to apply Ka/Ks phylogenetic methods
508 nm systematically.
Figure 10.6 From Chang et al. (2002), the ancestral sequence for
an archosaur visual pigment was determined, indicated by the circled
10.6 Large-scale analysis
node. In their study, additional outgroup sequences besides green
The first approaches to search for positive selection
anole and human were included. The optimal absorption spectra of
proteins from the extant species are indicated. The archosaur
in large data sets did not use phylogeny. An early
sequence was synthesized experimentally and its optimal absorption pairwise comparison of 363 mouse and rat homo-
was determined to be slightly red-shifted compared to the extant logs yielded only interleukin-3 as being under
sequences. positive selection pressure (Wolfe and Sharp 1993).
A subsequent systematic comparison by Gojobori
organism are possible. Using only four ingroup and coworkers (Endo et al. 1996) examined 3 595
species (American alligator plus three birds: gene families from GenBank. Positive selection in
domestic pigeon, chicken, and zebra finch) plus a at least half of the pairwise comparisons was seen
collection of outgroup species, the ancestral as evidence for a gene family under positive
sequence was reconstructed for the ancient arch- selective pressure. Using these criteria, only 17
osaur using a phylogenetic approach. Only three gene families were identified. CDC6, snake neuro-
positions were found to be ambiguous by method toxin, and prostatic steroid-binding protein were
and several variants of the ancestral protein were the only eukaryotic examples, the latter two in
considered. Ancestral proteins were ultimately chordates.
synthetically reconstructed in the laboratory and A systematic approach using methodology
the ambiguous positions proved not to be func- similar to that of Messier and Stewart (1997) was
tionally important. The visible absorption maxima applied to Master Catalog families (Benner et al.
were measured, and the maximum of the protein 2000) in chordates and higher plants. Following a
from the ancient archosaur was red-shifted com- parsimony mapping of gene trees on to species
pared with values reported for most extant birds trees, this was collected in a phylogenetic context
and reptiles. The further implication was that the in the original version of The Adaptive Evolution
ancient archosaur had dim-light vision and may Database (TAED) (Liberles et al. 2001). On the
have been nocturnal rather than diurnal. chordate side, 5305 gene families were analyzed
The pioneering approach of Jermann et al. (1995) with 280 families containing 643 positively selected
to study the evolution of function of ribonucleases, branches spanning over 63 branches of the
which was adopted by Chang et al. (2002), is an National Center for Biotechnology Information
increasingly important combination of computa- (NCBI) taxonomy (Benson et al. 2004). A picture for
tional phylogenetics and experimental molecular a greater role for positive selection was beginning
biology/functional genomics used to study protein to emerge. The approach used in the original
function. This can be coupled to periods of high calculation of this database was approximate,
Ka/Ks ratio or rapid sequence evolution, where any but it was still likely to be conservative, given that
functional changes are assessed before and after it averaged over all sites in a protein.
10.7 Positive selection, protein system was published (Rain et al. 2001). From
structure, and coevolution mapping one data set on to the other, two path-
ways with multiple positively selected hits were
From many protein structure-function studies, we
identified (H. Ardawatia and D. A. Liberles,
know that some residues play key scaffolding roles,
unpublished observations). While the significance
other residues are involved in surface interactions
of this is still under investigation, this type of
with solvent, and additional residues perform cat-
approach can increasingly link sequence evolution
alysis, binding, and other functions. In a protein
in the context of phylogeny with the growing field
where function is being modified, probably only a
of systems biology.
subset of these residues corresponding to the nature
of the modification is likely to be under positive
10.8 Continuous-character ancestral-
selective pressure, while the remainder are not.
state reconstruction
Further, structure itself can drive positive
selective pressure. This can represent a real Beyond sequence evolution, parsimony and phy-
change, such as coevolving sites modifying a logeny have other applications in genomics. Both
binding affinity. This can also be driven by com- gene expression and mRNA splicing are important
pensation in interacting residues for slightly dele- processes regulating how the genome is converted
terious substitutions. As has been seen before for to the proteome. Their regulation, evolution, and
other cases, using phylogeny seems to be the best ultimately the species-specific effects caused by
way to detect intramolecular covariation, from the this combination are, so far, less well understood
co-occurence of such sites along such branches than sequence evolution.
separating parsimony-reconstructed ancestors (as Data from large-scale gene expression studies
in Fukami-Kobayashi et al. 2002). This information and also the relative abundance of alternative
can actually be used in both structure prediction mRNA transcripts are continuous rather than dis-
and phylogenetic reconstruction. crete characters (as in individual sequence posi-
The intermolecular coevolution of proteins can tions). The reconstruction of continuous characters
also be studied phylogenetically. Both adaptation over a phylogeny using the principle of parsimony
and compensatory covariation can explain the can be turned in to a minimum-evolution-type
correlated evolution of residues in a protein. distance method (Rossnes 2004). The methodology
One interesting case where this has been detected here is similar (but not identical) to continuous-
is the evolution of the interaction between leptin parsimony approaches like Wagner parsimony
and the extracellular domain of its receptor in (see Kluge and Farris 1969) extended to a rooted
higher primates. Both show high Ka Ks periods tree, without any assumptions of ordered numerical
in several branches during the diversification of transitions that may not be appropriate for gene
primates (Benner et al. 1998). Interestingly, both expression data. The implementation by Rossnes
proteins appear to have evolved new gene (2004) is shown by example in Fig. 10.7. A range
expression patterns during this period due to the of values consistent with parsimony or minimum
action of transposition in enhancer and splicing evolution is obtained. The midpoint of this range
regulatory regions of the respective genes (Bi et al. can be selected if conservatism is desired or an
1997; Kapitinov and Jurka 1999). unchanging model of regulatory evolution is
This type of analysis can be extended to anticipated.
whole pathways, combining genes under positive Subsequently, looking for branches with
selection with information on either metabolism or significantly different changes in value can be
protein interaction. A list of 19 proteins showing coupled to a traditional reconstruction and branch
evidence for positive selection in Helicobacter pylori analysis of the regulatory sequences (upstream
was published recently (Davids et al. 2002). regions in the case of gene expression). This can
Around the same time, a protein–protein interac- be used to reduce the signal-to-noise ratio in
tion map for H. pylori based upon the two-hybrid identifying sites with important functional roles
Sequence 5, value 7.8 Sequence 5, ACCT

(4.5–7.8)
(ACCT)
(3.3–4.8)
(4.5–4.8)
6.15
Sequence 4, ACGG
Sequence 4, value 4.8
4.65
Sequence 3, AGCT
Sequence 3, value 3.3 Sequence 2, ACCT
(4.5–4.8)
Sequence 1, ACCT
4.65
(4.5–5.0)
(4.5–4.8)

4.65

Figure 10.7 For reconstruction of continuous character traits, like
relative expression or splice-site usage values, a minimum-evolution
distance method has similar properties to a parsimony reconstruction.
After going up and then down the tree, the ranges that minimize the Sequence 4, value 7.8
total branch length of the tree are indicated next to the nodes.
Ultimately midpoint values can be selected if conservatism is desired or Sequence 3, value 3.3
if a homogeneous or non-episodic model of evolution is believed.
Several measures of along-branch change and significance are possible. Sequence 2, value 4.5

in regulating gene expression or mRNA splicing.
Additionally, the reconstructed nodes can be used Figure 10.8 For reconstructed ancestral states of a continuous
to address gene expression or relative mRNA character, an idealized branch with significant change is shown with a
abundances of differently spliced variants in dashed line. The discrete molecular characters that are candidates for
regulating the process can be reconstructed simultaneously (for example,
ancient organisms. This is shown in Figs 10.7 and
regions upstream of a gene for gene expression, or intron/exon boundary
10.8. If one evaluates the absorption maxima of sequences for alternative mRNA splicing). In this case, one might point
Chang et al. (2002), using this approach one obtains to a CT ! GG substitution as a candidate for driving the change seen
a value that is less red-shifted than the sequence along the dashed branch. This method benefits from an improvement in
evolution gives. However, a fuller characterization signal-to-noise ratio, as previously indicated in Fig.10.2.
of the statistical properties (including variance
estimates) of this approach is required to assess 2004). This database contains alternative-splicing
any possible incompatibilities of the two results. events available from alignments of expressed
On the side of gene expression, Pääbo and sequence tag (EST)/cDNA data sets from human,
coworkers (Enard et al. 2002) have produced an mouse, rat, cow, chicken, zebrafish, fruit fly,
innovative data set with comparative expression nematode, and mustard weed. As data become
levels of the same genes in human, chimpanzee, available from more closely related species, this
orangutan, and rhesus macaque. This data set has data base will become increasingly amenable to
been utilized generally to detect significant chan- phylogenetic analysis.
ges in expression of genes expressed in human Starting with a set of the most closely related
brain relative to the other primates, compared to species, human and the rodents mouse and rat,
the changes in liver-expressed genes (Enard et al. Modrek and Lee (2003) showed that the most
2002; Gu and Gu 2003). Interspecific data sets like common splice forms were much more conserved
this, motivated by phylogenetic knowledge, will than the alternative-splice forms. The alternative
become increasingly common, paving the way for forms were viewed as an opportunity for evolu-
analyses like that described above. tionary innovation. Extending this work, it was
On the mRNA-splicing side, ASD, the Alter- shown that the rare splice variants that were con-
native Splicing Database, has been established for served were under strong conservative selective
comparative genomics purposes (Thanaraj et al. pressure (Resch et al. 2004). This implies that these
splicing patterns have been present in the last evolution, gene expression, and mRNA splicing
common ancestor of the species studied. Extending can be collected in a phylogenetic context after
this analysis phylogenetically as databases like mapping onto a species tree. This will enable an
ASD grow in their species coverage will enable a understanding of the molecular genomic events
more systematic evolutionary analysis of the role of underpinning phenotypic evolution. Parsimony
alternative splicing in cellular biology and regu- will remain a valuable method for pursuing this
lation. This will ultimately allow a quantifiable research goal.
view of the role of new splice variants in the
generation of evolutionary molecular innovation
10.10 Acknowledgments
leading to adaptation.
I am grateful to Matthew Betts, Roald Rossnes,
Himanshu Ardawatia, and Jessica Liberles for
10.9 Conclusion
careful reading of this chapter. I also thank Victor
Ultimately, evolution of various biological pro- Albert for his input and for reading a draft version
cesses including gene content, gene-sequence of the chapter as well.
C HA P T E R 1 1
Dollo parsimony and the

reconstruction of genome evolution
Igor B. Rogozin, Yuri I. Wolf, Vladimir N. Babenko and
Eugene V. Koonin
11.1 Introduction 11.2 Dollo parsimony for molecular

data in the pre-genomic era
The Dollo parsimony method, which was first
formalized by J. S. Farris in 1977 (Farris, 1977a, b), Dollo’s Law, also known as the Law of Phylo-
is based on the assumption that a complex char- genetic Irreversibility or the Law of Irreversible
acter that has been lost during evolution of a Evolution, is an important tenet of evolutionary
particular lineage cannot be regained. When theory, formulated by the Belgian biologist Louis
applicable, this principle leads to a substantial Dollo (1893). It basically states that organisms
simplification of evolutionary analysis and pro- cannot re-evolve along lost pathways, but must
vides for unambiguous reconstruction of evolu- find alternate routes because the same fortuitous
tionary scenarios, which may not be attainable combination of mutational events, being com-
with other methods. In this chapter, we describe pletely random, will never repeat. Dolphins, in
applications of Dollo parsimony to the quantita- other words, will never again walk on land with
tive analysis of the dynamics of genome evolu- re-evolved pelvic appendages that derive from
tion. Dollo parsimony is the method of choice for their current remnant structures that correspond to
reconstructing evolution of the gene repertoire of legs of land animals. They might, however, evolve
eukaryotic organisms because, although multiple, walking appendages that derive from other bio-
independent losses of a gene in different lineages logical provenance, especially if there were some
are common, multiple gains of the same gene are selective advantage to do so, say, if the oceans
improbable. This contrasts with the situation in began to dry up. While some non-Darwinian
prokaryotes where the widespread occurrence of theorists have attempted to use Dollo’s Law to
horizontal gene transfer makes multiple gains promote their cause, Dollo was simply seeking to
possible, thereby invalidating the Dollo principle. explain convergence of form in diverse species
We apply Dollo parsimony to reconstruct the (e.g. sharks, ichthyosaurs, and dolphins): ichthyo-
scenario of evolution for the genomes of crown- saurs and dolphins look so similar because they
group eukaryotes by assigning the loss of genes have converged to the same (hydrodynamically
and emergence of new genes to the branches of favorable) shape through independent, parallel
the phylogenetic tree, and delineate the minimal paths of degradative evolution.
gene sets for various ancestral forms. A similar The Dollo parsimony method was first formalized
analysis, with rather unexpected results, was by Farris (1977a). In its simplest form, the algorithm
performed to infer gain vs. loss of introns in explains the presence of the complex, derived
conserved eukaryotic genes. We discuss the state 1 by allowing only one forward change 0 ! 1
applicability of the Dollo principle for these and (where 0 is the primitive ancestral state) and as
other problems in evolutionary genomics. many reversions 1 ! 0 as are necessary to explain
190
DOLLO PARSIMONY AND THE RECONSTRUCTION OF GENOME EVOLUTION 191
the observed pattern of states. The method the small number of unique genes in the human
attempts to minimize the required number of genome when compared to the mouse genome
1 ! 0 reversions. In molecular studies, Dollo par- (and vice versa) suggests considerable stability of
simony analysis was often applied for analysis of the gene repertoire in mammals (Waterston et al.
restriction sites because the loss of an existing 2002).
restriction site is more probable than a parallel
gain of the same site at any particular location
11.4 Orthologous and paralogous
(DeBry and Slade 1985).
genes
Sequencing of multiple genomes from diverse
11.3 Genome evolution
taxa provides the data required for quantitative
Comparative genomics has already changed our analysis of the dynamics of genome evolution.
understanding of genome evolution. In what A prerequisite for such studies is a classification of
might amount to a paradigm shift in evolutionary the genes from the sequenced genomes based on
biology, genome comparisons have shown that homologous relationships. The two principal
lineage-specific gene loss and horizontal gene categories of homologs are orthologs and paralogs
transfer (HGT) are not inconsequential freak (Fitch 1970; Sonnhammer and Koonin 2002; Storm
incidents of evolution but rather extremely com- and Sonnhammer 2003). Orthologs are homologous
mon phenomena. To a large degree, these pro- genes that evolved via vertical descent from a
cesses have shaped extant genomes, at least those single ancestral gene in the last common ancestor
of prokaryotes (Doolittle 1999; Koonin et al. 2000; of the compared species. Paralogs are homologous
Gogarten et al. 2002; Snel et al. 2002; Mirkin et al. genes, which, at some stage of evolution, have
2003). The extent of gene loss occurring in certain evolved by duplication of an ancestral gene.
lineages of prokaryotes, particularly parasites, is Orthology and paralogy are two sides of the same
astonishing: in some cases, >80% of genes in the coin because, when a duplication (or a series of
genome have been lost over approximately 200 duplications) occurs after the speciation event
million years of evolution (Moran 2002). HGT is that separated the compared species, orthology
harder to document, but a strong case has been becomes a relationship between sets of paralogs,
made for its extensive contribution to the evolution rather than between individual genes; genes that
of prokaryotes (Ochman et al. 2000; Koonin et al. belong to such orthologous sets are sometimes
2001; Gogarten et al. 2002; Mirkin et al. 2003). Gene termed co-orthologs (Sonnhammer and Koonin
exchange between phylogenetically distant eukar- 2002).
yotes does not appear to be an important evolu- Robust identification of orthologs and paralogs
tionary phenomenon. In contrast, the contribution is critical for the construction of evolutionary
of gene loss to the evolution of eukaryotic genomes scenarios, which include, along with vertical
was probably substantial, although the level of inheritance, lineage-specific gene loss and, pos-
genome fluidity observed in prokaryotes is sibly, HGT (Snel et al. 2002; Mirkin et al. 2003).
unlikely to have been attained in eukaryotic evol- The algorithms for the construction of these
ution. A comparison of the genomes of two yeasts, scenarios involve, in one form or another, tracing
Saccharomyces cerevisiae and Schizosaccharomyces the fates of individual genes, which is feasible only
pombe, showed that, in the S. cerevisiae lineage, up when orthologous (including co-orthologous)
to 10% genes have been lost since the divergence of relationships are known. In principle, orthologs,
the two species (Aravind et al. 2000). In eukaryotic including co-orthologs, should be identified by
parasites with small genomes, such as the micro- phylogenetic analysis of entire families of homo-
sporidia, much more extensive gene elimination logous proteins, which is expected to define
appears to have occurred (Katinka et al. 2001). orthologous protein sets as clades (e.g. Sicheritz-
In contrast, the extent of gene loss in complex, Ponten and Andersson 2001). However, for
multicellular eukaryotes remains unclear, although genome-wide protein sets, such analysis remains
labor-intensive and error-prone. Thus, procedures 11.5 Matrices of character presence/

have been developed for identification of sets of absence and Dollo parsimony
probable orthologs without explicit use of phylo-
A simple but critically important concept that was
genetic methods. Generally, these approaches are
introduced in the context of the COG analysis is
based on the notion of a genome-specific best hit
a phyletic (phylogenetic) pattern, which is the
(BeT), i.e. the protein from a target genome that
pattern of representation (presence/absence) of the
has the greatest sequence similarity to a given
analyzed species in each COG (Tatusov et al. 1997;
protein from the query genome (Tatusov et al.
Koonin and Galperin 2002). Similar notions have
1997; Huynen and Bork 1998). The central
been independently developed and applied by
assumption is that orthologs have a greater sim-
others (Gaasterland and Ragan 1998; Pellegrini
ilarity to each other than to any other protein from
et al. 1999). The COGs show a wide scatter of
the respective genomes due to the conservation of
phyletic patterns, with only a small minority
functional constraints. Of course, evolutionary
(approximately 1%) represented in all included
history and sequence similarity can be at odds in
genomes. Similarity and complementarity among
some cases, which would invalidate this model
the phyletic patterns of COGs have been success-
(Storm and Sonnhammer 2003). The extent to
fully employed for prediction of gene functions
which this occurs in practice is in fact not known,
(Galperin and Koonin 2000; Koonin and Galperin
again due to the enormity of proteome-scale
2002; Myllykallio et al. 2002; Levesque et al. 2003).
phylogenetic surveys.
Phyletic patterns can be formally represented as
When multiple genomes are analyzed using the
strings of ones for presence of a species and zeros
BeT approach, pairs of probable orthologs detected
for absence of a species; Table 11.1, which can be
on the basis of BeTs are combined into orthologous
easily input to a variety of algorithms. The evolu-
clusters represented in all or a subset of the ana-
tionary parsimony methods are among those that
lyzed genomes (Tatusov et al. 1997; Montague and
naturally apply to this type of data. Pairs of
Hutchison 2000). This approach, amended with
neighboring genes and intron positions also can be
procedures for detecting co-orthologous protein
represented as a character matrix and used for
sets and for treating multidomain proteins, was
parsimony analysis (see discussion below).
implemented in the database of Clusters of
A Dollo parsimony tree can be constructed using
Orthologous Groups (COGs) of proteins (Tatusov
a matrix of gene (pair of genes/intron) presence/
et al. 1997, 2003). The current COG set includes
absence and the data-dependent reliability of the
approximately 70% of the proteins encoded in
tree topology can be assessed in the standard
69 genomes of prokaryotes and unicellular eukar-
manner using the bootstrap method. The presence
yotes (Tatusov et al. 2003). The COGs have been
vs. absence of a gene in a genome can be naturally
extensively employed for genome-wide evolu-
treated in terms of character states. The Dollo
tionary studies, functional annotation of
model is based on the assumption that each
new genomes, and target selection in structural
derived character state (in this case, the presence
genomics (Koonin and Galperin 2002 and refer-
of a gene) originates only once, and homoplasies
ences therein). Recently, we extended the system
exist only in the form of reversals to the ancestral
of orthologous protein clusters to complex, multi-
condition (absence of a gene) in accord with
cellular eukaryotes by constructing clusters of
the Dollo principle as formalized by Farris (1977a).
euKaryotic Orthologous Groups (KOGs) for seven
In other words, parallel or convergent gains of
sequenced genomes of animals, fungi, micro-
the derived condition are assumed to be highly
sporidia, and plants; namely humans (Hs), the
unlikely (or impossible, for practical purposes).
nematode Caenorhabditis elegans (Ce), the fruit
The Dollo parsimony principle also can be
fly Drosophila melanogaster (Dm), two yeasts,
applied in the opposite direction: assuming a
S. cerevisiae (Sc) and S. pombe (Sp), and the green
particular species tree topology, a parsimonious
plant Arabidopsis thaliana (At) (Tatusov et al. 2003;
scenario of evolution can be constructed. Such a
Koonin et al. 2004).
Table 11.1 Matrix of presence/absence of genes in eukaryotic genomes. For each the number of the KOG and the (predicted) protein function
are shown. 1 indicates that the given gene (KOG) is represented in the given species and 0 indicates absence of the gene in the given species. Species
abbreviations: At, Arabidopsis thaliana; Ce, Caenorhabditis elegans; Dm, Drosophila melanogaster; Ec, Encephalitozoon cuniculi; Hs, Homo
sapiens; Sc, Saccharomyces cerevisiae; Sp, Schizosaccharomyces pombe
Species KOG2207, 3 0 -5 0 KOG4125, KOG0006, E3 KOG0090, KOG0050,

exonuclease acid trehalase ubiqutin-protein signal-recognition mRNA splicing
ligase particle factor CDC5
receptor b-subunit
At 1 0 0 1 1
Ec 0 0 0 0 1
Sc 0 1 0 1 1
Sp 0 0 0 1 1
Ce 1 0 1 1 1
Dm 1 1 1 1 1
Hs 0 1 1 1 1
scenario is, essentially, a mapping of different differ between genes as well as between lineages,
types of evolutionary events on to the branches of evolutionary scenarios produced using these
the phylogenetic tree. Obviously these scenarios, parameters are ambiguous and have to be assessed
which naturally also include the reconstruction of using external criteria (Mirkin et al. 2003).
the character states in all internal nodes of the tree With eukaryotes, however, the situation is quite
and in the root (when the root position is known), different. Although acquisition of bacterial genes
can be of major value for understanding the via HGT had been an important contribution to the
evolution of the analyzed taxa. evolution of eukaryotic genomes, at least prior to
However, in the context of evolutionary genomics, the emergence of multicellular forms (Doolittle
the Dollo principle cannot be assumed to be et al. 2003), gene exchange between eukaryotic
valid by default. The specific biological context lineages themselves does not appear to occur at an
must be examined in order to ascertain whether or appreciable rate. Therefore, as far as evolution of
not the elementary evolutionary events involved eukaryotic genomes is concerned, Dollo parsimony
could violate phylogenetic irreversibility by pro- yields unambiguous parsimonious scenarios that
ducing homoplastic character gains. Extensive include only two types of elementary event: loss of
HGT shown to occur during evolution of prokar- genes (or other characters, such as introns or pairs
yotes obviously has the potential to violate the of genes) and emergence of new characters.
Dollo principle: the ‘same’ gene (more precisely, In the studies summarized below, we used the
a member of the same set of orthologous genes; PAUP* (Swofford 2002) and DOLLOP (Felsenstein
a COG) can be readily regained via HGT after 1996) programs. An important difference between
being lost from the given lineage. In this case, the implementations of Dollo parsimony in these
it does not matter whether or not the gene programs should be noted: PAUP* produces
regained via HGT comes from the same lineage as unrooted trees while DOLLOP reconstructs rooted
the original (lost) gene because, in this type of trees (Swofford et al. 1996). Where root inferences
analysis, a COG is treated as the basic character. are reported, the latter program was used.
Thus, Dollo parsimony cannot be used for recon-
structing evolution of prokaryotic genomes; some
11.6 Dollo parsimony tree based on
form of weighted parsimony should be applied
gene content: application to a crucial
instead (Snel et al. 2002; Kunin and Ouzounis 2003;
problem in animal evolution
Mirkin et al. 2003). Since the relative weights of
different elementary events, i.e. lineage-specific The relative positions of nematodes, arthropods
gene loss and HGT, are unknown and probably and chordates in the phylogeny of animals remain
uncertain (for review see Hedges 2002). The Otherwise, however, this tree was at odds with the
traditional tree topology joins arthropods with prevalent taxonomic view (Hedges 2002) in that
chordates in a coelomate clade, whereas nema- an animal–plant clade, as opposed to an animal–
todes, which lack a coelome (a true body cavity), fungus clade with plants as an outgroup, was
occupy a basal position (e.g. Raff 1996). However, observed. This deviation of the gene-content trees
the current leading hypothesis, which is based on from the currently accepted phylogeny is probably
phylogenetic trees for 18S rRNA and some addi- due to the varying amount of gene loss in different
tional comparisons of protein-coding genes, joins eukaryotic lineages; in particular, massive gene-
nematodes with arthropods in a clade of molting loss in yeasts. As discussed previously in the
animals, Ecdysozoa (Aguinaldo et al. 1997; Giribet context of prokaryotic genome analysis, the
et al. 2000; Peterson and Eernisse 2001; Mallatt and topology of gene-content trees seems to reflect
Winchell 2002). The complete genome sequences of a combination of the phylogenetic signal and other
the nematodes, insects, and vertebrates provide trends in genome evolution that are not necessarily
for the possibility to extend phylogenetic studies to linked to phylogeny, such as parallel gene loss
the genomic scale, in order to address this major associated with life-style similarities (Wolf et al.
issue in animal evolution (Mushegian et al. 1998; 2002). The clustering of humans with flies in the
Blair et al. 2002; Wolf et al. 2004). gene-content tree points to the congruence in gene
We addressed this problem both by traditional repertoires of these animals. In this particular case,
phylogenetic analysis of numerous sets of ortho- the majority of phylogenetic trees for orthologous
logous genes and by using patterns of gene gene sets point in the same direction, suggesting
presence/absence in orthologous sets for tree that the coelomate clade could, after all, reflect the
construction, which is the most straightforward actual course of animal evolution (Wolf et al. 2004).
technique in the category of so-called genome-tree
methods (Fitz-Gibbon and House 1999; Snel et al.
11.7 A parsimonious scenario of gene
1999; Wolf et al. 2002). Presence/absence of a gene
gain and loss in eukaryotic evolution
in a set of species can be naturally treated as a
binary character and the table of such characters As discussed in the previous section, the Dollo
can be subjected to either parsimony or distance parsimony tree based on gene presence/absence
phylogenetic analysis. After constructing such a shows some conflicts with the accepted phyloge-
character matrix for the complete set of KOGs, we netic tree of the eukaryotic crown group, the
applied the Dollo parsimony method. The rooted principal clades of which have been established
tree produced using Dollo parsimony confidently with considerable confidence. In particular, some
supported the coelomate topology (Fig. 11.1). conflicting observations notwithstanding, the
consensus of many phylogenetic analyses points
to an animal–fungus clade, grouping of micro-
Sc sporidia with the fungi, and a coelomate (chordate–
arthropod) clade among the animals (Blair et al.
Sp
2002; Hedges 2002; Wolf et al. 2004). Assuming this
100% tree topology and treating the phyletic pattern of
At
each KOG as a string of binary characters (1 for the
100%
Ce presence of the given species and 0 for its absence
100% in the given KOG), the parsimonious scenario of
Dm gene loss and emergence during the evolution of
99%
the eukaryotic crown group was constructed.
Hs In the resulting scenario, each branch was
Figure 11.1 Dollo parsimony tree of the eukaryotic crown group based associated with both gene loss and gain of new
on gene presence/absence. The bootstrap values are indicated for each genes, with the exception of the plant branch and
internal branch. The species abbreviations are as in Table 11.1. the branch leading to the common ancestor of
Dm Hs Ce Sc Sp Ec At
Gain 3 491 13 688 4 503 1 679 842 586 3 711
Loss 520 162 541 299 202 1 969 —
3 260
5 361 267
398 55
37 5 000 3 048
1 358 15
193 3 835 802
422
– 3 413
Figure 11.2 Scenario of gene gain and loss during evolution of the eukaryotic crown group derived using Dollo parsimony under a fixed tree topology.
The numbers in boxes indicate the inferred number of KOGs in the respective ancestral forms. The numbers next to branches indicate the number of gene
gains (emergence of KOGs; top) and gene (KOG) losses (bottom) associated with the respective branches; a dash indicates that the number of losses for a
given branch could not be determined. Proteins from each genome that did not belong to KOGs as well as LSEs (lineage-specific gene family expansions)
were counted as gains on the terminal branches. The species abbreviations are as in Table 11.1.
fungi and animals, to which gene losses could one fungal species co-occurs with at least one
not be assigned without an additional outgroup animal species. Clearly, these are conservative
(Fig. 11.2). There is little doubt that, once genomes reconstructions of ancestral gene sets because,
of early-branching eukaryotes are included, gene as mentioned above, gene losses in the lineages
loss associated with these branches will become branching off the deepest bifurcation could not
apparent. The principal features of the recon- be detected. Under this conservative approach,
structed scenario include massive gene loss in the 3 413 genes (KOGs) were assigned to the last
fungal clade, additional elimination of numerous common ancestor of the crown group (Fig. 11.2).
genes in the microsporidia, emergence of a large More realistically, it appears likely that a certain
set of new genes at the onset of the animal clade, number of ancestral genes have been lost in all or
and subsequent substantial gene loss in each of the all but one of the analyzed lineages during sub-
animal lineages, particularly in the nematodes and sequent evolution, such that the gene set of the
arthropods (Fig. 11.2). The estimated number of eukaryotic crown group ancestor might have been
genes lost in S. cerevisiae after its divergence from close in size to those of modern yeasts, or even
the common ancestor with the other yeast species, larger (i.e. 6 000–7 000 genes).
S. pombe, closely agreed with a previous estimate
produced using a different approach (Aravind et al.
11.8 Dollo parsimony applied
2000).
to evolution of eukaryotic
The parsimony analysis described above
gene structure
involves explicit reconstruction of the gene sets of
ancestral eukaryotic genomes. Under the Dollo Most of the eukaryotic protein-coding genes
parsimony model, an ancestral gene (KOG) set is contain multiple introns that are spliced out of the
the union of the KOGs that are shared by the pre-mRNA by a distinct, large RNA–protein
respective outgroup and each of the remaining complex, the spliceosome, which is conserved in
species. Thus, the gene set for the common all eukaryotes (Dacks and Doolittle 2001). The
ancestor of the crown group includes all the KOGs positions of some spliceosomal introns are
in which Arabidopsis co-occurs with any of the conserved in orthologous genes from plants and
other analyzed species. Similarly, the recon- animals (Marchionni and Gilbert 1986; Logsdon
structed gene set for the common ancestor of fungi et al. 1995; Boudet et al. 2001). A recent systematic
and animals consists of all KOGs in which at least analysis of pairwise alignments of homologous
proteins from animals, fungi, and plants suggested conservation (defined as the mean similarity
that 10–25% of introns are ancient (Fedorov et al. to KOG members from other species) was selected
2002; Rogozin et al. 2003). However, intron dens- for evolutionary analysis. For a pair of introns
ities in different eukaryotic species differ widely, to be considered orthologous, they were required
and the location of introns in orthologous genes to occur in exactly the same position in the
does not always coincide, even in closely related aligned sequences of KOG members. Altogether,
species (Logsdon et al. 1998). Likely cases of intron 684 KOGs were examined for intron conservation;
insertion and loss have been described (e.g. these comprised the great majority, if not the
Rzhetsky et al. 1997; Logsdon et al. 1998), and entirety, of highly conserved eukaryotic genes that
indications of a high intron-turnover rate have are amenable for an analysis of the exon–intron
been obtained (Lynch and Richardson 2002). It has structure over the entire span of crown group
been suggested that the proportion of shared evolution. The analyzed KOGs contained 21 434
intron positions decreased with increasing evolu- introns in 16 577 unique positions; 5 981 introns
tionary distance and, accordingly, intron con- were conserved among two or more genomes.
servation could be a useful phylogenetic marker Most of the conserved introns were present in only
(Nei and Kumar 2001). However, the evolutionary two species, but a considerable number was found
history of introns and, the selective forces that in three genomes, and several introns were shared
shape intron evolution remain mysterious. by four to seven species (Table 11.2). A simulation
Although recent comparisons have revealed the of the intron distribution in the analyzed sample of
existence of many ancient introns shared by orthologous gene sets by random shuffling of the
animals, plants, and fungi (Fedorov et al. 2002), the intron positions showed that approximately 1% of
point(s) of origin of these introns in eukaryotic the observed number of introns shared by two
evolution and the relative contributions of intron species was expected to occur by chance, whereas
loss and intron insertion in the evolution of none were expected to be shared by three or more
eukaryotic genes remain unknown. species (Table 11.2). It has been proposed that
We used the KOG data set for evolutionary introns insert into coding sequences non-randomly
analysis of intron–exon structure in eukaryotic but primarily into ‘‘proto-splice sites’’ (Dibb and
genes on a genomic scale. For the purpose of this Newman 1989). Although the proto-splice model
analysis, orthologs from two additional eukaryotic has been questioned as inconsistent with the
species, the mosquito Anopheles gambiae (Ag) and observed distribution of intron phases (Long and
the apicomplexan malarial parasite Plasmodium Rosenberg 2000), we considered the potential effect
falciparum (Pf), were included in the KOGs using of non-random intron insertion on the apparent
the COGNITOR method (Tatusov et al. 1997). evolutionary conservation of intron positions.
Many of the KOGs contain multiple paralogs For this purpose, random simulation was repeated
from one or more of the constituent species due with intron insertion allowed in 10% of the
to lineage-specific duplications; among these para- positions in the analyzed genes. Obviously, this
logs, the one showing the greatest evolutionary led to an increase in the expected number of
Table 11.2 Conservation of intron positions in orthologous gene sets from eight eukaryotic species
Number of introns total
Number of species . . . 1 2 3 4 5 6 7 8
a
Observed 13 406 2 047 719 275 104 25 1 0
Expected 21 368 33 0 0 0 0 0 0
Expected 10% 20 083 662 8 0 0 0 0 0
a
The probability that intron sharing in different species was due to chance, P(Monte Carlo) < 0.0001 (this applies both to the analysis of all alignment
positions and to the test with 10% of the positions allowed for intron insertion).
shared introns in two or more species, but the introns positions shared by humans with the fly,
excess of introns found in the same position mosquito, or worm (Table 11.3). The difference
remained substantial and highly statistically sig- becomes even more dramatic when the numbers of
nificant (Table 11.2). These observations suggest introns conserved in Arabidopsis and each of the
that a substantial majority of introns located in the three animal species are compared: approximately
same position in orthologous genes from different three times more plant introns have a counterpart
eukaryotic lineages are indeed orthologous, i.e. at the same position in orthologous human genes
originate from an ancestral intron in the same than in the fly or worm orthologs (Table 11.3).
position in the respective gene of the last common Although yeast S. pombe and the apicomplexan
ancestor of the compared species. Nevertheless, protist Plasmodium have few introns compared to
the simulation results under the proto-splice site plants or animals, the same asymmetry was
assumption show that the applicability of Dollo observed for these organisms: the numbers of
parsimony in this case could be limited. We should introns shared with Arabidopsis and humans are
note that the magnitude of error introduced by close and are two or three times greater than the
the assumption of irreversible character gain number of introns shared with the insects or the
depends directly on the nature of the biological worm (Table 11.3).
phenomenon involved; in this case, how specific We then examined the evolutionary dynamics of
the proto-splice signal is. introns in greater detail by using phylogenetic
The matrix of shared introns in all pairs of ana- analysis. It should be noted that the comparative
lyzed eukaryotic genomes revealed an unexpected data on intron positions are as conducive to the
pattern (Table 11.3). The number of conserved representation as a character matrix as the phyletic
introns did not drop monotonically with increasing pattern data. Specifically, intron positions were
evolutionary distance among the compared represented as a data matrix of intron presence/
organisms. On the contrary, human genes shared absence (encoded, as usual, as 1/0). An example
the greatest number of introns with the plant of such a matrix for intron positions in one of
Arabidopsis instead of with any of the other three the conserved gene clusters (KOGs) is shown in
animals included in the comparison. In the con- Fig. 11.3. To reconstruct the genome-wide scenario
served regions (which give more accurate results of gene-structure evolution, the matrices for all
given the alignment uncertainties in other parts of analyzed KOGs were concatenated to produce one
genes), 24% of the intron positions in the analyzed alignment, which consisted of 16 577 columns of
human genes were shared with Arabidopsis (these ones and zeros. The Dollo parsimony tree that we
comprised approximately 27% of the Arabidopsis reconstructed from the matrix of intron presence/
introns) compared to approximately 12–17% of absence obviously did not mimic the species tree,
Table 11.3 Conservation of intron positions in eukaryotic orthologous gene sets: the matrix of pairwise interspeicies comparisions
Pf Sc Sp At Ce Dm Ag Hs
Pf 971 2 48 137 50 46 54 145

Sc 46 7 3 3 3 4 6
Sp 839 209 98 114 111 308
At 5 589 353 255 254 1 148
Ce 3 465 315 312 948
Dm 1 826 787 802
Ag 1 768 771
Hs 6 930
The diagonal shows the total number of introns in the 684 analyzed genes from the given species. Species abbreviations: At, Arabidopsis thaliana;
Ce, Caenorhabditis elegans; Dm, Drosophila melanogaster; Hs, Homo sapiens; Ag, Anopheles gambiae; Pf, Plasmodium falciparum; Sc,
Saccharomyces cerevisiae; Sp, Schizosaccharomyces pombe.
Intron
positions 33 55 144 169 233
⇓ ⇓
Pf MSR RTKKVGLTGKYGTRY GSSLRKQIKKIELMQ HAKYLCTFCGKTATK RTCVGIWKCK--KCK RKVCGGAWSLTTPAA VAAKSTIIRLRKQKE EAQKS
⇓ ⇓
At MTK RTKKARIVGKYGTRY GASLRKQIKKMEVSQ HNKYFCEFCGKYSVK RKVVGIWGCK--DCG KVKAGGAYTMNTASA VTVRSTIRRLREQTE S
Sc MAK RTKKVGITGKYGVRY GSSLRRQVKKLEIQQ HARYDCSFCGKKTVK RGAAGIWTCS--CCK KTVAGGAYTVSTAAA ATVRSTIRRLREMVE A

⇓
Sp MTK RTKKVGVTGKYGVRY GASLRRDVRKIEVQQ HSRYQCPFCGRLTVK RTAAGIWKCSGKGCS KTLAGGAWTVTTAAA TSARSTIRRLREMVE V
Ce MAK RTKKVGIVGKYGTRY GASLRKMAKKLEVAQ HSRYTCSFCGKEAMK RKATGIWNCA--KCH KVVAGGAYVYGTVTA ATVRSTIRRLRDLKE

⇓
Dm MAK RTKKVGIVGKYGTRY GASLRKMVKKMEITQ HSKYTCSFCGKDSMK RAVVGIWSCK--RCK RTVAGGAWVYSTTAA ASVRSAVRRLRETKE Q
⇓
Ag MAK RTRKVGIVGKYGTRY GASLRKMVKKMEITQ HAKYTCTFCGKDAMK RSCVGIWSCK--RCN RVVAGGAWVYSTTAA ASVRSAVRRLREM
⇓ ⇓
Hs MAK RTKKVGIVGKYGTRY GASLRKMVKKIEISQ HAKYTCSFCGKTKMK RRAVGIWHCG--SCM KTVAGGAWTYNTTSA VTVKSAIRRLKELKD Q
33 55 144 169 233

Pf 1 0 1 0 0
At 0 1 1 0 0
Sc 0 0 0 0 0
Sp 0 0 0 1 0
Ce 0 0 0 0 0
Dm 0 0 1 0 0
Ag 0 0 1 0 0
Hs 0 0 1 0 1
Figure 11.3 Examples of conservation and variability of intron positions in orthologous eukaryotic genes. The data are for KOG0473, ribosomal protein
L37. The intron positions are shown directly on the alignment and the conversion of the intron-alignment mapping into a presence/absence matrix is
illustrated. 1 indicates the presence of an intron and 0 indicates the absence of an intron in the given alignment position (shown at the top). The species
abbreviations are as in Table 11.3.
with humans and Arabidopsis forming a strongly Dm

supported lineage embedded within the metazoan Hs
clade, and another anomalous group formed
Ag
by yeast S. pombe and Plasmodium (Fig. 11.4). The
topology of these trees supported the notion, At 100%
already suggested by the pairwise comparisons 100%
summarized in Table 11.3, that ancestral introns
100% 100%
have been, to a large extent, conserved in plants
and vertebrates, but have been extensively lost in Ce Sc
fungi, nematodes, and arthropods. Clustering of 99%
Plasmodium with S. pombe, to the exclusion of the
other yeast species, S. cerevisiae, seems to be due to
the fact that Plasmodium shared approximately as Sp
Pf
many introns with S. pombe as with worm and
insects, but that the total number of introns in Figure 11.4 Dollo parsimony tree of the eukaryotic crown group based
S. pombe was substantially lower than in animals on comparison of intron positions. The bootstrap values are indicated
(Table 11.3). Thus, conservation of intron positions for each internal branch. The species abbreviations are as in Table 11.3.
does not seem to be a good source of information
for inferring phylogenetic relationships across long we applied Dollo parsimony in the opposite
evolutionary distances. direction: given a species tree topology, construct
Having shown that evolution of introns in the most-parsimonious scenario for the evolution of
eukaryotic genes did not follow the species tree, gene structure, i.e. the distribution of intron-gain
Dm Ag Hs Ce Sc Sp At Pf
Loss 342 377 170 1 141 394 4 73 —
Gain 662 639 4 377 2 307 35 438 4 187 761
1 461 1 214
1 306 244 405 3
119
2 523 543
48
2 099 731
34
1 616 175
–
1 475 1 265
210 –
210
Figure 11.5 The parsimonious evolutionary scenario of intron gain/loss for the most likely topology of the eukaryotic phylogenetic tree. Intron gains
and losses are mapped to each species and each internal branch. The numbers next to branches indicate the number of inton losses (top) and gains
(bottom) associated with the respective branches; dashes show branches for which losses could not be inferred from the available data. The (minimal)
number of introns inferred to have existed in the analyzed set of genes in the respective ancestral forms is indicated in a box next to each internal
node of the tree. The species abbreviations are as in Table 11.3.
and -loss events among the tree branches. This a substantial underestimate given that Plasmodium
approach is completely analogous to the con- is a parasite with a highly degraded genome
struction of the scenario for gene gain and loss and low intron density. Sequencing and analysis
described above. The resulting schema suggests an of genomes of other early-branching eukaryotes
intron-rich ancestor for the crown group, with is expected to substantially increase the number of
limited intron loss in the animal ancestor, but introns that have survived since the dawn of
massive losses in yeasts (particularly S. cerevisiae), eukaryotic evolution.
worm, and insects (Fig. 11.5). The differences in
the relative rates of intron gain and loss in the
11.9 Dollo parsimony analysis of
terminal branches are remarkable; there is a huge
prokaryotic gene order
excess of gains over losses in humans and S. pombe,
and an equally obvious excess of losses in insects As discussed above, Dollo parsimony is hardly
and S. cerevisiae, whereas C. elegans shows nearly applicable to the analysis of evolution of prokar-
identical numbers of gains and losses. All introns yotic gene repertoires because extensive HGT
shared by Plasmodium and any of the crown-group leads to gross violations of the irreversibility
species (at least 210) are assigned to the last com- principle. However, it might be possible to come
mon ancestor of alveolates and the crown group, up with (nearly) irreversible, Dollo-compatible
which lived some 1.5–2.0 billion years ago (Hedges characters even in the case of prokaryotic genome
2002). Thus, a substantial fraction of the introns in evolution. Elements of gene order are, perhaps,
extant eukaryotic genomes appear to be inherited the most obvious candidates for the role of such
from a common ancestor of the crown group and characters in this category. Genome colinearity is
the alveolates, i.e. almost from the very onset of preserved only in closely related bacteria and
eukaryotic evolution. At present, loss of ancestral archaea because rearrangements continuously shuffle
introns in Plasmodium cannot be documented prokaryotic genomes, gradually breaking ances-
because Plasmodium is the outgroup to the crown- tral gene strings. Nevertheless, many operons—
group species; neither can losses be assigned to the groups of co-expressed, functionally linked pro-
internal branch that leads to the ancestor of the karyotic genes (usually, three or four; Jacob et al.
crown group. Hence we produced a conservative 1960)—are highly conserved (Mushegian and
estimate of the number of the most ancient introns Koonin 1996; Watanabe et al. 1997; Dandekar
in the analyzed gene set, which is likely to be et al. 1998). The elementary unit of gene-order
conservation is a contiguous (or tightly linked) for dolphins to re-evolve feet or for yeast to
gene pair. Independent formation of the same gene re-evolve the lost system for post-transcriptional
pair in several genomes of distantly related pro- gene silencing (Aravind et al. 2000) but it appears
karyotes is extremely unlikely. The possibility of exceedingly unlikely that these features could
HGT of operons should be considered more reappear in the same form as the lost ones, at least
seriously in light of the so-called ‘selfish-operon’ to be considered the same character. A loose
concept, which posits that operons are often trans- thermodynamic analogy seems to fit: movement of
ferred between species as single units (Lawrence a single molecule is completely time-reversible,
and Roth 1996; Lawrence 1999). Depending on the but the Second Law of Thermodynamics virtually
strength of the selfish-operon trend, this could lead guarantees that any regular configuration of
to significant regaining of lost characters (gene molecules would be irreversibly destroyed by
pairs) and, accordingly, to violations of the Dollo thermal motion. The probabilistic nature of the
assumption. Nevertheless, phylogenetic analysis Dollo law suggests that there would be a con-
using conserved gene pairs as characters for tinuum of characters differing in their degree of
genome comparison appears to be an attractive (ir)reversibility. Indeed, regain of a lost gene seems
possibility. Due to the (relatively) high rate of to be impossible, for all practical purposes, barring
intragenomic rearrangements, the gene-order trees HGT. By contrast, regain of an ancestral amino
are, at least in theory, especially suitable for acid at a particular site is quite likely, especially
resolving the phylogeny of closely related species. given sufficient evolutionary distance separating
We identified pairs of genes (COGs) whose the compared sequences. The specific problems
physical proximity is conserved in several discussed here cover this entire range. In the case
genomes. The presence/absence matrices of these of the eukaryotic gene repertoires, gene loss clearly
pairs were analyzed using Dollo parsimony is irreversible and, accordingly, Dollo parsimony,
and neighbor joining, which produced essentially which simplifies the analysis and allows for more
the same topology (Wolf et al. 2001). The results reliable conclusions than other methods of evolu-
were also very similar to the results of distance- tionary reconstruction, is the approach of choice.
tree analysis of prokaryotic gene orders reported In the case of intron positions, the Dollo approach
by Korbel and coworkers (Korbel et al. 2002). The can be applied also, but the probability of multiple
resulting tree topology showed a good separation gains might not be negligible (depending on the
between archaea and bacteria and also reproduced biology of the process, which is not yet sufficiently
well-established, tight bacterial clades, but had a understood), and caution is due. By contrast,
poor resolution at deep branching points. Fur- Dollo parsimony is not applicable to the study of
thermore, some of the groups seemed to result evolution of prokaryotic genomes due to massive
from preferential HGT between certain prokaryotic HGT. For the same reason, the attempt to use
lineages (Wolf et al. 2001). Thus, the effect of HGT conserved gene pairs as Dollo characters was not
could be too significant for Dollo parsimony to be particularly successful either.
an appropriate method for tree construction in With the further growth of genomics and sys-
this case. tems biology, the numbers of characters potentially
suitable for phylogenetic analysis will continue to
grow. Examples of new types of information that
11.10 Genomics and Dollo parsimony:
are becoming amenable to this type of analysis
validity of the Dollo principle for
include various characteristics of gene expression
different types of genomic data
and protein–protein interaction networks. The case
Dollo parsimony assumes that each derived char- studies described here indicate that Dollo parsi-
acter state originates only once, and homoplasies mony is a useful and potentially powerful meth-
exist only in the form of reversals to the primitive odology of evolutionary genomics, but that its
condition. Obviously, this is not an absolute but a applicability always needs to be gauged against
probabilistic notion. It is not physically impossible the biology of the specific systems under analysis.
References
Addario-Berry, L., Chor, B., Hallett, M., Lagergren, J., Ariew, A. (1998). Are probabilities necessary for evo-
Panconesi, A. and Wareham, T. (2004). Ancestral max- lutionary explanations? Biol. Philos. 13: 245–253.
imum likelihood of phylogenetic trees is hard. J. Bioinf. Arvestad, L., Berglund, A.C., Lagergren, J. and Sennblad, B.
Comp. Biol. 2: 257–271. (2003). Bayesian gene/species tree reconciliation
Aguinaldo, A.M., Turbeville, J.M., Linford, L.S., and orthology analysis using MCMC. Bioinformatics 19
Rivera, M.C., Garey, J.R., Raff, R.A. and Lake, J.A. Suppl. 1: i7–i15.
(1997). Evidence for a clade of nematodes, arthropods Avise, J.C. (1989). Gene trees and organismal histories:
and other moulting animals. Nature 387: 489–493. a phylogenetic approach to population biology.
Akaike, H. (1973). Information theory and an extension of Evolution 43:1192–1208.
the maximum likelihood principle. In Second Inter- Avise, J.C. (2000). Phylogeography: The History and Forma-
national Symposium on Information Theory, Tsahkadsor, tion of Species. Cambridge, MA, Harvard University
Armenia, USSR, September 2–8, 1971 (eds B.N. Petrov Press.
and F. Csaaki), pp. 267–281. Budapest, Akademiai Bach, E. (1981). On time, tense and aspect: An essay
Kiado. in English metaphysics. In Radical Pragmatics
Albert, V.A. and Mishler, B.D. (1992). On the rationale (ed. P. Cole), pp. 63–81. New York, Academic Press.
and utility of weighting nucleotide sequence data. Baker, A. (2003). Quantitative parsimony and explana-
Cladistics 8: 73–83. tory power. Br. J. Phil. Sci. 54: 245–259.
Albert, V.A., Mishler, B.D. and Chase, M.W. (1992). Bandelt, H.J., Forster, P., Sykes, B.C. and Richards, M.B.
Character-state weighting for restriction site data in (1995). Mitochondrial portraits of human populations
phylogenetic reconstruction, with an example from using median networks. Genetics 141: 743–753.
chloroplast DNA. In Molecular Systematics of Plants. Barnes, E.C. (2000). Ockham’s Razor and the anti-
(eds P.S. Soltis, D.E. Soltis, and J.J. Doyle), pp. 369–403. superfluity principle. Erkenntnis 53: 353–374.
New York, Chapman and Hall. Barrett, M., Donoghue M.J. and Sober, E. (1991). Against
Albert, V.A., Chase, M.W. and Mishler, B.D. (1993). consensus. Syst. Zool. 40: 486–493.
Character-state weighting for cladistic analysis of Barry, D. and Hartigan, J.A. (1987). Statistical analysis of
protein-coding DNA sequences. Ann. Missouri Bot. hominoid molecular evolution. Stat. Sci. 2: 191–210.
Gard. 80: 752–766. Benner, S.A., Trabesinger, N. and Scheiber, D. (1998).
Albert, V.A., Backlund, A., Bremer, K, Chase, M.W., Post-genomic science: Converting primary sequence into
Manhart, J.R., Mishler, B.D. and K.C. Nixon. (1994). physiological function. Adv. Enzyme Regul. 38: 155–190.
Functional constraints and rbcL evidence for land plant Benner, S.A., Chamberlin, S.G., Liberles, D.A.,
phylogeny. Ann. Missouri Bot. Gard. 81: 534–567. Govindarajan, S. and Knecht, L. (2000). Functional
Alfaro, M., Zoller, S. and Lutzoni, F. (2003). Bayes or inferences from reconstructed evolutionary biology
bootstrap? A simulation study comparing the per- involving rectified databases — an evolutionarily
formance of Bayesian Markov chain Monte Carlo grounded approach to functional genomics. Res.
sampling and bootstrapping in assessing phylogenetic Microbiol. 151: 97–106.
confidence. Mol. Biol. Evol. 20: 255–266. Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J.
Altschul, S.F. (1989). Gap costs for multiple alignment. and Wheeler, D.L. (2004). GenBank: update. Nucleic
J. Theor. Biol. 138: 297–309. Acids Res. 32: D23-D26.
Aravind, L., Watanabe, H., Lipman, D.J. and Koonin, E.V. Bi, S., Garilova, O., Gong, D.W., Mason, M.M. and
(2000). Lineage-specific loss and divergence of func- Reitman, M. (1997). Identification of a placental
tionally linked genes in eukaryotes. Proc. Natl. Acad. enhancer for the human leptin gene. J. Biol. Chem. 272:
Sci. USA 97: 11319–11324. 30583–30588.
201
202 REFERENCES
Blackburn, D.G., (1984). From whale toes to snake eyes: Bryant, D. (2003). A classification of consensus methods
Comments on the reversibility of evolution. Syst. Zool. for phylogenetics. In Bioconsensus (eds Janowitz, M.F.,
33: 241–245. Lapointe, F.J., McMorris, F.R., Mirkin, B. and
Blair, J.E., Ikeo, K., Gojobori, T. and Hedges, S.B. (2002). The Roberts, F.S.), pp. 163–184. DIMACS Series in Discrete
evolutionary position of nematodes. BMC Evol. Biol. 2: 7. Mathematics and Theoretical Computer Science, vol. 61,
Bock, W.J. (1973). Philosophical foundations of classical Providence, RI, American Mathematical Society.
evolutionary classification. Syst. Zool. 22: 375–392. Buneman, P. (1971). The recovery of trees from measures
Boudet, N., Aubourg, S., Toffano-Nioche, C., Kreis, M. and of dissimilarity. In Mathematics in the Archaeological and
Lecharny, A. (2001). Evolution of intron/exon structure Historical Sciences (eds Hodson, F.R., Kendall, D.G.
of DEAD helicase family genes in Arabidopsis, Caenor- and Tautu, P.), pp. 387–395. Edinburgh, Edinburgh
habditis, and Drosophila. Genome Res. 11: 2101–2114. University Press.
Boyd, R. (1991). Confirmation, semantics, and the inter- Burnham, K.P. and Anderson, D.R. (1998). Model Selection
pretation of scientific theories. In The Philosophy of Science and Inference. A Practical Information-theoretic Approach.
(eds R. Boyd, P. Gasper and J.D. Trout), pp. 3–35. New York, Springer.
Cambridge, MA, MIT Press. Cameron, H.D. (1987). The upside-down cladogram:
Brady, R.H. (1985). On the independence of systematics. Problems in manuscript affiliation. In Biological
Cladistics 1: 113–126. Metaphor and Cladistic Classification: An Interdisciplinary
Brady, R.H. (1994). Explanation, description, and the Perspective (eds H.M. Hoenigswald and L.F. Wiener),
meaning of transformation in taxonomic evidence. In pp. 227–242. Philadelphia, PA, University of Pennsyl-
Models in Phylogeny Reconstruction (eds R.W. Scotland, vania Press.
D.J. Siebert and D.M. Williams), pp. 11–29 special vol. Camin, J.H. and Sokal, R.R. (1965). A method for dedu-
52, Syst. Assoc. Oxford, Clarendon Press. cing branching sequences in phylogeny. Evolution 19:
Bremer, K. (1988). The limits of amino acid sequence data 311–326.
in angiosperm phylogenetic reconstruction. Evolution Carillo, H. and Lipman, D. (1988). The multiple sequence
42: 795–803. alignment problem in biology. SIAM J. Appl. Math. 48:
Bremer, K. (1994). Branch support and tree stability. 1073–1082.
Cladistics 10: 295–304. Caroll, L. (1872). Through the Looking Glass. London,
Brooks, D.R. (1981). Hennig’s parasitological method: Macmillan.
A proposed solution. Syst. Zool. 30: 229–249. Carpenter, J.M. (2003). On ‘‘Molecular phylogeny of
Brooks, D.R. (1988). Scaling effects in historical biogeo- Vespidae (Hymenoptera) and the evolution of sociality
graphy: A new view of space, time, and form. Syst. in wasps’’. Am. Museum Novitates 3389: 1–20.
Zool. 37: 237–244. Carter, M., Hendy, M.D., Penny, D., Székely, L.A. and
Brower, A.V.Z. (2000). Homology and the inference of Wormald, N.C. (1990). On the distribution of lengths of
systematic relationships: Some historical and philoso- evolutionary trees. SIAM J. Discrete Math. 3: 38–47.
phical perspectives. In Homology and Systematics: Coding Cartmill, M. (1981). Hypothesis testing and phylogenetic
Characters for Phylogenetic Analysis (eds R. Scotland and reconstruction. Z. Zool. Syst. Evol.-forsch. 19: 73–96.
R.T. Pennington), pp. 10–21. New York, Taylor and Chang, B.S.W., Jönsson K., Kazmi, M.A., Donoghue, M.J.
Francis. and Sakmar T.P. (2002). Recreating a functional
Brudno, M., Chapman, M., Göttgens, B., Batzoglou, S. ancestral archosaur visual pigment. Mol. Biol. Evol. 19:
and Morgenstern, B. (2003). Fast and sensitive multiple 1483–1489.
alignment of large genomic sequences. BMC Bioinfor- Chang, J. (1996). Full reconstruction of Markov models on
matics 4: 66. evolutionary trees: Identifiability and consistency.
Brudno, M., Poliakov, A., Salamov, A., Cooper, G.M., Math. Biosc. 137: 51–73.
Sidow, A., Rubin, E.M., Solovyev, V., Batzoglou, S. and Chang, J.T. and Kim, J. (1996). The measurement of
Dubchak, I. (2004). Automated whole-genome multiple homoplasy: A stochastic view. In Homoplasy: the
alignment of rat, mouse, and human. Genome Res. 14: Recurrence of Similarity in Evolution (eds M.J. Sanderson
685–692. and L. Hufford), pp. 189–303. Academic Press.
Bryant, D. (2000). A lower bound for the breakpoint Chase, M.W., Soltis, D.E., Olmstead, R.G., Morgan, D.,
phylogeny problem. In Proceedings of the 11th Annual Les, D.H., Mishler, B.D., Duvall, M.R., Price, R.A.,
Symposium on Combinatorial Pattern Matching (eds Hills, H.G., Qiu, Y.-L. et al. (1993). Phylogenetics of seed
R. Giancarlo and D. Sankoff), pp. 235–247. London, plants: An analysis of nucleotide sequences from the
Springer Verlag. plastid gene rbcL. Ann. Missouri Bot. Gard. 40: 528–580.
REFERENCES 203
Crichton, M. (1990). Jurassic Park. New York, Ballantine De Laet, J. (2003). Parsimony algorithms for characters
Books. that are inapplicable in some terminals (Abstract, 21st
Crick, F. (1968). The origin of the genetic code. J. Mol. annual meeting of the Willi Hennig Society, Helsinki
Biol. 38: 367–379. 2002). Cladistics 19: 151.
Crisci, J.V. and Stuessy, T.F. (1980). Determining pri- De Laet, J. (2004). When one and one is not two: Parsi-
mitive character states for phylogenetic reconstruction. mony analysis of sequence data (Abstract, 22nd annual
Syst. Bot. 5: 112–135. meeting of the Willi Hennig Society, New York 2003).
Cummings, M., Handley, S., Myers, D., Reed, D., Rokas, A. Cladistics 20: 81.
and Winka, K. (2003). Comparing bootstrap and De Laet, J. and Smets, E. (1998). On the three-taxon
posterior probability values in the four-taxon case. approach to parsimony analysis. Cladistics 14: 363–381.
Syst. Biol. 52: 477–487. De Laet, J. and Wheeler, W. (2003). POY version 3.0.11
Dacks, J.B. and Doolittle, W.F. (2001). Reconstructing/ (Wheeler, Gladstein and De Laet, May 6 2003). Com-
deconstructing the earliest eukaryotes: How compara- mand line documentation. Available at ftp://ftp.amnh.
tive genomics can help. Cell 107: 419–425. org/pub/molecular/poy.
Dandekar, T., Snel, B., Huynen, M. and Bork, P. (1998). de Pinna, M. C. C. (1991). Concepts and tests of homo-
Conservation of gene order: A fingerprint of proteins logy in the cladistic paradigm. Cladistics 7: 367–394.
that physically interact. Trends Biochem. Sci. 23: 324–328. de Queiroz, K. (1992). Phylogenetic definitions and
Darwin, C. (1859). The Origin of Species by Means of Natural taxonomic philosophy. Biol. Philos. 7: 295–313.
Selection, or the Preservation of Favoured Races in the de Queiroz, K. (1996). Including the characters of interest
Struggle for Life. London, John Murray [1964. Facsimile of during tree reconstruction and the problems of cir-
1st edition. Cambridge, MA, Harvard University Press] cularity and bias in studies of character evolution.
Davids, W., Gamieldien, J., Liberles, D.A. and Hide, W. Am. Nat. 148: 700–708.
(2002). Positive selection scanning reveals decoupling de Queiroz, K. and Poe, S. (2001). Philosophy and
of enzymatic activities of carbamoyl phosphate phylogenetic inference: A comparison of likelihood
synthetase in H. pylori. J. Mol. Evol. 54: 458–464. and parsimony methods in the context of Karl
Davidson, D. (1991). On the individuation of events. Popper’s writings on corroboration. Syst. Biol. 50:
Synthese 86: 229–254. 305–321.
Davis, J.I. and Nixon, K.C. (1992). Populations, genetic de Queiroz, K. and Poe, S. (2003). Failed refutations:
variation, and the delimitation of phylogenetic species. Further comments on parsimony and likelihood
Syst. Biol. 41: 421–435. methods and their relationship to Popper’s degree of
Davis, J.I., Stevenson, D.W., Petersen, G., Seberg, O., corroboration. Syst. Biol. 52: 352–367.
Campbell, L.M., Freudenstein, J.V., Goldman, D.H., Dezulian, T. and Steel, M. (2004). Phylogenetic closure
Hardy, C.R., Michelangeli, F.A., Simmons, M.P. et al. operations and homoplasy-free evolution. In Classi-
(2004). A phylogeny of the monocots, as inferred from fication, Clustering, and Data Mining Applications (Pro-
rbcL and atpA sequence variation, and a comparison of ceedings of the meeting of the International Federation
methods for calculating jackknife and bootstrap values. of Classification Societies (IFCS) 2004). (eds D. Banks,
Syst. Bot. 29: 467–510. L. House, F.R. McMorris, P. Arabie and W. Gaul),
Dayhoff, M.O. and Eck, R.V. (1968). Atlas of Protein pp. 395–416. Springer-Verlag, Berlin.
Sequence and Structure. 1967–68. Silver Spring, MD, Dibb, N.J. and Newman, A.J. (1989). Evidence that introns
National Biomed. Res. Foundation. arose at proto-splice sites. EMBO J. 8: 2015–2021.
Dayhoff, M.O. and Park, C.M. (1969). Cytochrome C: Dollo, L. (1893). Le lois de l’evolution. Bull. Soc. Belge
Building a phylogenetic hypothesis. In Atlas of Protein Geol. Paleontol. d’Hydrol. 7: 164–167.
Sequence and Structure. 1969 (ed. M.O. Dayhoff), Donoghue, M.J. and Doyle, J.A. (1989). Phylogenetic
pp. 7–16 vol. 4. Silver Spring, MD, National. Biomed. analysis of angiosperms and the relationships of
Res. Foundation. Hamamelidae. In Evolution, Systematics and Fossil
DeBry, R.W. and Slade, N.A. (1985). Cladistic analysis History of the Hamamelidae, vol.1 (eds P. Crane and
of restriction endonuclease cleavage maps within a S. Blackmore), pp. 17–45. Oxford, Clarendon Press.
maximum-likelihood framework. Syst. Zool. 34: 21–34. Donoghue, M.J. and Sanderson, M.J. (1992). The suit-
De Laet, J. (1997). A Reconsideration of Three-Item Analysis, ability of molecular and morphological evidence in
the Use of Implied Weights in Cladistics, and a Practical reconstructing plant phylogeny. In Molecular Systema-
Application in Gentianceae. PhD Thesis, University of tics of Plants. (eds P.S. Soltis, D.E. Soltis and J.J. Doyle),
Leuven, Belgium. pp. 340–368, New York, Chapman & Hall.
204 REFERENCES
Donoghue, M.J., Doyle, J.A., Gauthier, J.A., Kluge, A.G. Farris, J.S. (1969). A successive approximations approach
and Rowe, T. (1989). The importance of fossils in phy- to character weighting. Syst. Zool. 18: 374–385.
logeny reconstruction. Annu. Rev. Ecol. Syst. 20: 431–460. Farris, J.S. (1970). Methods for computing Wagner trees.
Doolittle, W.F. (1999). Lateral genomics. Trends Cell. Biol. Syst. Zool. 19: 83–92.
9: M5—M8. Farris, J.S. (1972). Estimating phylogenetic trees from
Doolittle, W.F., Boucher, Y., Nesbo, C.L., Douady, C.J., distance matrices. Am. Nat. 106: 645–668.
Andersson, J.O. and Roger, A.J. (2003). How big is the Farris, J.S. (1973a). On the use of the parsimony criter-
iceberg of which organellar genes in nuclear genomes are ion for inferring phylogenetic trees. Syst. Zool. 22:
but the tip? Phil. Trans. R. Soc. Lond. B Biol. Sci. 358: 39–57. 250–256.
Duret, L., Mouchiroud, D. and Gouy, M. (1994). Farris, J.S. (1973b). A probability model for inferring
HOVERGEN: A database of homologous vertebrate evolutionary trees. Syst. Zool. 22: 250–256.
genes. Nucleic Acids Res. 22: 2360–2365. Farris, J.S. (1977a). Phylogenetic analysis under Dollo’s
Edwards, A.W.F. (1972). Likelihood. Cambridge, law. Syst. Zool. 26: 77–88.
Cambridge University Press. Farris, J.S. (1977b). Some further comments on Le
Edwards, A.W.F. and Cavalli-Sforza, L.L. (1963). The Quesne’s methods. Syst. Zool. 26: 220–223.
reconstruction of evolution. Heredity 18: 553. Farris, J.S. (1978a). Inferring phylogenetic trees from
Edwards, A.W.F. and Cavalli-Sforza, L.L. (1964). chromosome inversion data. Syst. Zool. 27: 275–284.
Reconstruction of evolutionary trees. In Phenetic and Farris, J.S. (1978b). Wagner78. Published by the author.
Phylogenetic Classification (eds V.H. Heywood and Farris, J.S. (1979). The information content of the phylo-
J. McNeill), pp. 67–76 no. 6. London, Syst. Assoc. Publ. genetic system. Syst. Zool. 28: 483–519.
Eisen, M.B., Spellman, P.T., Brown, P.O. and Botstein, D. Farris, J.S. (1982a). Outgroups and parsimony. Syst. Zool.
(1998). Cluster analysis and display of genome-wide 31: 328–334.
expression patterns. Proc. Natl. Acad. Sci. USA 95: Farris, J.S. (1982b). Simplicity and informativeness in
14863–14868. systematics and phylogeny. Syst. Zool. 31: 413–444.
Enard, W., Khaitovich, P., Klose, J., Zöllner, S., Heissig, F., Farris, J.S. (1983). The logical basis of phylogenetic
Giavalisco, P., Nieselt-Struwe, K., Muchmore, E., analysis. In Advances in Cladistics, volume 2: Proceedings
Varki, A., Ravid, R. et al. (2002). Intra and interspecific of the Second Meeting of the Willi Hennig Society. (eds
variation in primate gene expression patterns. Science N. Platnick and V. Funk) pp. 7–36. New York,
296: 340–343. Columbia University Press.
Endo, T., Ikeo, K. and Gojobori, T. (1996). Large-scale Farris, J.S. (1983/1994). The logical basis of phylogenetic
search for genes on which positive selection may analysis. In Advances in Cladistics — Proceedings of the
operate. Mol. Biol. Evol. 13: 685–690. 2nd Annual Meeting of the Willi Hennig Society
Endress, P.K. (1994). Diversity and Evolutionary Biology of (eds N. Platnick and V. Funk. New York, Columbia
Tropical Flowers. Cambridge, Cambridge University University Press. Abridged and reprinted in E. Sober
Press. (ed.) pp. 7–36. Conceptual Issues in Evolutionary Biology,
Erdo" s, P.L. and Székely, L.A. (1992). Evolutionary trees: Cambridge, MA, MIT Press, 1994 (page references to
An integer multicommodity maxflow–min-cut theorem. the latter).
Adv. Appl. Math. 13: 375–389. Farris, J.S. (1986). On the boundaries of phylogenetic
Erdo" s, P.L. and Székely, L.A. (1993). Counting bichro- systematics. Cladistics 2: 14–27.
matic evolutionary trees. Discrete Appl. Math. 47: 1–8. Farris, J.S. (1988). Hennig86. Published by the author,
Erdo" s, P.L., Steel, M.A., Székely, L.A. and Warnow, T. Port Jefferson Station, New York.
(1999). A few logs suffice to build (almost) all trees Farris, J.S. (1989a). The retention index and the rescaled
(Part 1). Random Struct Algorithms 14: 153–184. consistency index. Cladistics 5: 417–419.
Excoffier, L. and Smouse, P.E. (1994). Using allele Farris, J.S. (1989b). Entropy and fruit flies. Cladistics 5:
frequencies and geographic subdivision to reconstruct 103–108.
gene trees within a species: Molecular variance parsi- Farris, J.S. (1991). Hennig defined paraphyly. Cladistics 7:
mony. Genetics 136: 343–359. 297–304.
Farris, J.S. (1966). Estimation of conservatism of characters Farris, J.S. (1997). Cycles. Cladistics 13: 131–144.
by constancy within biological populations. Evolution Farris, J.S. (1999). Likelihood and inconsistency. Cladistics
20: 319–334. 15: 199–204.
Farris, J.S. (1967). The meaning of relationship and taxo- Farris, J.S. (2001). Support weighting. Cladistics 17:
nomic procedure. Syst. Zool. 16: 44–51. 389–394.
REFERENCES 205
Farris, J.S. and Kluge, A.G. (1986). Synapomorphy, Felsenstein, J. (1996). Inferring phylogenies from protein
parsimony, and evidence. Taxon 35: 298–315. sequences by parsimony, distance, and likelihood
Farris, J.S., Kluge, A.G. and Eckardt, M.J. (1970). methods. Methods Enzymol. 266: 418–427.
A numerical approach to phylogenetic systematics. Felsenstein, J. (2004). Inferring Phylogenies. Sunderland,
Syst. Zool. 19: 172–189. MA, Sinauer Associates.
Farris, J.S., Källersjö, M., Albert, V.A., Allard, M., Felsenstein, J. and Sober, E. (1987). Parsimony and like-
Anderberg, A., Bowditch, B., Bult, C., Carpenter, J.M., lihood: An exchange. Syst. Zool. 35: 617–626.
Crowe, T.M., De Laet, J. et al. (1995). Explanation. Feng, D.F. and Doolittle, R.F. (1987). Progressive sequence
Cladistics 11: 211–218. alignment as a prerequisite to correct phylogenetic
Farris, J.S., Albert, V.A., Källersjö, M., Lipscomb, D. and trees. J. Mol. Evol. 25: 351–360.
Kluge, A.G. (1996). Parsimony jackknifing outperforms Fink, W.L. (1982). The conceptual relationship
neighbor-joining. Cladistics 12: 99–124. between ontogeny and phylogeny. Paleobiology 8:
Farris, J.S., Källersjö, M. and De Laet, J.E. (2001a). Branch 254–264.
lengths do not indicate support — even in maximum Fitch, W.M. (1970). Distinguishing homologous from
likelihood. Cladistics 17: 298–299. analogous proteins. Syst. Zool. 19: 99–113.
Farris, J.S., Kluge, A.G. and De Laet, J.E. (2001b). Taxic Fitch, W.M. (1971). Toward defining the course of
revisions. Cladistics 17: 79–103. evolution: Minimal change for a specific tree topology.
Fedorov, A., Merican, A.F. and Gilbert, W. (2002). Large- Syst. Zool. 20: 406–416.
scale comparison of intron positions among animal, Fitz-Gibbon, S.T. and House, C.H. (1999). Whole genome-
plant, and fungal genes. Proc. Natl. Acad. Sci. USA 99: based phylogenetic analysis of free-living micro-
16128–16133. organisms. Nucleic Acids Res. 27: 4218–4222.
Felsenstein, J. (1968). Statistical Inference and the Estimation Force, A., Lynch, M., Pickett, F.B., Amores, A., Yan, Y.L.
of Phylogenies. PhD Thesis, University of Chicago, and Postlethwait, J. (1999). Preservation of duplicate
Chicago, IL. genes by complementary, degenerative mutations.
Felsenstein, J. (1973). Maximum likelihood and minimum- Genetics 151: 1531–1545.
steps methods for estimating evolutionary trees from Forster, M.R. (2000). Key concepts in model selection:
data on discrete characters. Syst. Zool. 22: 240–249. Performance and generalizability. J. Math. Psych. 44:
Felsenstein, J. (1978a). Cases in which parsimony and 205–231.
compatibility methods can be positively misleading. Forster, M.R. and Sober, E. (1994). How to tell when
Syst. Zool. 27: 401–410. simpler, more unified, or less ad hoc theories will
Felsenstein, J. (1978b). The number of evolutionary trees. provide more accuracte predictions. Br. J. Phil. Sci. 45:
Syst. Zool. 27: 27–33. 1–35.
Felsenstein, J. (1979). Alternative methods of phylo- Foulds, L.R. (1984). Maximum savings in the Steiner
genetic inference and their interrelationship. Syst. Zool. problem in phylogeny. J. Theoret. Biol. 107: 471–474.
28: 49–62. Foulds, L.R. and Graham, R.L. (1982). The Steiner
Felsenstein, J. (1981a). Evolutionary trees from DNA problem in phylogeny is NP-complete. Adv. Appl. Math.
sequences: A maximum likelihood approach. J. Mol. 3: 43–49.
Evol. 17: 368–376. Friedman, M. (1983). Foundations of Space-Time Theories:
Felsenstein, J. (1981b). Evolutionary trees from gene Relativistic Physics and Philosophy of Science. Princeton,
frequencies and quantitative characters: Finding max- NJ, Princeton University Press.
imum likelihood estimates. Evolution 35: 1229–1242. Fredman, M.L. (1984). Algorithms for computing evolu-
Felsenstein, J. (1981c). A likelihood approach to character tionary similarity measures with length independent
weighting and what it tells us about parsimony and gap penalties. Bull. Math. Biol. 46: 545–563.
compatibility. Biol. J. Linnean Soc. 16: 183–196. Freudenstein, J.V., Pickett, K.M., Simmons, M.P. and
Felsenstein, J. (1982). Numerical methods for inferring Wenzel, J.W. (2003). From basepairs to birdsongs:
evolutionary trees. Q. Rev. Biol. 57: 379–404. phylogenetic data in the age of genomics. Cladistics
Felsenstein, J. (1983). Methods for inferring phylo- 19: 333–347.
genies: A statistical view. In Numerical Taxonomy (ed. Frost, D.R. (2000). Species, descriptive efficiency, and
J. Felsenstein), pp. 315–334. Berlin, Springer-Verlag. progress in systematics. In The Biology of Plethodontid
Felsenstein, J. (1988). Phylogenies from molecular Salamanders (eds R.C. Bruce, R.J. Jaeger, and
sequences: Inference and reliability. Annu. Rev. Genet. L.D. Houck), pp. 7–29. New York, Kluwer Academic/
2: 521–565. Plenum Publishing.
206 REFERENCES
Frost, D.R. and Kluge, A.G. (1994). A consideration of Gogarten, J.P., Doolittle, W.F. and Lawrence, J.G. (2002).
epistemology in systematic biology, with special Prokaryotic evolution in light of gene transfer. Mol.
reference to species. Cladistics 10: 259–294. Biol. Evol. 19: 2226–2238.
Frost, D.R., Rodrigues, M.T., Grant, T. and Titus, T.A. Goldman, N. (1990). Maximum likelihood inference of
(2001). Phylogenetics of the lizard genus Tropidurus phylogenetic trees, with special reference to a Poisson
(Squamata: Tropiduridae: Tropidurinae): Direct opti- process model of DNA substitution and to parsimony
mization, descriptive efficiency, and sensitivity analysis analyses. Syst. Zool. 39: 345–361.
of congruence between molecular data and morpho- Goloboff, P.A. (1995). A revision of the south American
logy. Mol. Phylogenet. Evol. 21: 352–371. spiders of the family Nemesiidae (Araneae, Mygalo-
Fukami-Kobayashi, K., Schreiber, D.R. and Benner, S.A. morphae). Part I: species from Peru, Chile, Argen-
(2002). Detecting compensatory covariation signals tina, and Uruguay. Bull. Am. Mus. Nat. Hist. 224:
in protein evolution using reconstructed ancestral 1–189.
sequences. J. Mol. Biol. 319: 729–743. Goloboff, P.A. (1993a). Estimating character weights
Funk, V.A. and Brooks, D.R. (1990). Phylogenetic Syste- during tree search. Cladistics 9: 83–91.
matics as the Basis of Comparative Biology. Washington, Goloboff, P.A. (1993b). Nona: a tree-searching program.
DC, Smithsonian Institution Press. Available at http://www.zmuc.dk/public/phylogeny/
Gaasterland, T. and Ragan, M.A. (1998). Microbial Nona-PeeWee/.
genescapes: Phyletic and functional patterns of Goloboff, P.A. (1993c). Pee-Wee: Parsimony and Implied
ORF distribution among prokaryotes. Microb. Comp. weights. Available at http://www.zmuc.dk/public/
Genomics 3: 199–217. phylogeny/Nona-PeeWee/.
Gallut, C. and Barriel, V. (2002). Cladistic coding of Goloboff, P.A. (1994). Character optimization and
genomic maps. Cladistics 18: 526–536. calculation of tree lengths. Cladistics 9: 433–436.
Galperin, M.Y. and Koonin, E.V. (2000). Who’s your Goloboff, P.A. (1995). SPA: Sankoff Parsimony Analysis,
neighbor? New computational approaches for func- ver. 1.1. Available at http://www.zmuc.dk/public/
tional genomics. Nat. Biotechnol. 18: 609–613. phylogeny/Nona-PeeWee/.
Gee, H. (2000). Deep Time: Cladistics, the Revolution in Goloboff, P.A. (1996a). PHAST: Phylogenetic Analysis for
Evolution. London, Fourth Estate. Sankovian Transformations, ver. 1.1. Available at http://
Ghiselin, M.T. (1966). On semantic pitfalls of biological www.zmuc.dk/public/phylogeny/Nona-PeeWee/.
adaptation. Philos. Sci. 33: 147–153. Goloboff, P.A. (1996b). Methods for faster parsimony
Ghiselin, M.T. (2004). Mayr and Bock versus Darwin on analysis. Cladistics 12: 199–220.
genealogical classification. J. Zool. Syst. Evol. Res. 42: Goloboff, P.A. (1998b). Tree searches under Sankoff
165–169. parsimony. Cladistics 14: 229–238.
Giribet, G. (2002). Relationships among metazoan phyla Goloboff, P.A. (1999). Analyzing large data sets in
as inferred from 18S rRNA sequence data: A methodo- reasonable times: Solutions for composite optima.
logical approach. In Molecular Systematics and Evolution: Cladistics 15: 415–428.
Theory and Practice, (eds R. DeSalle, G. Giribet, and Goloboff, P.A. (2003). Parsimony, likelihood, and sim-
W. Wheeler), pp. 85–101. Basel, Birkhäuser Verlag. plicity. Cladistics 19: 91–103.
Giribet, G. and Wheeler, G. (1999). On gaps. Mol. Goloboff, P.A. and Farris, J.S. (2001). Methods for quick
Phylogenet. Evol. 13: 132–143. consensus estimation. Cladistics 17: S26–S34.
Giribet, G., Distel, D.L., Polz, M., Sterrer, W. and Goloboff, P.A., Wheeler, W. and Pol, D. (2003a). Parallel
Wheeler, W.C. (2000). Triploblastic relationships with TNT. Cladistics 19: 152 (in Muona, J. (2003). Abstracts of
emphasis on the acoelomates and the position of the 21st annual meeting of the Willi Hennig society.
Gnathostomulida, Cycliophora, Plathelminthes, and Cladistics 19: 148–163.)
Chaetognatha: A combined approach of 18S rDNA Goloboff, P., Farris, J., Källersjö, M., Oxelmann, B.,
sequences and morphology. Syst. Biol. 49: 539–562. Ramı́rez, M. and Szumik, C. (2003b). Improvements to
Giribet, G., Wheeler, W.C. and Muona, J. (2002). DNA resampling measures of group support. Cladistics 19:
multiple sequence alignments. In Molecular Systematics 324–332.
and Evolution: Theory and Practice (eds R. Desalle, Goloboff, P., Farris, J. and Nixon, K. (2004). T.N.T.: Tree
G. Giribet and W. Wheeler), pp. 107–114. Basel, Analysis Using New Technology. Available at www.zmuc.
Birkhäuser Verlag. dk/public/phylogeny/tnt.
Gladstein, D.S. (1997). Efficient incremental character Goodman, M., Czelusniak, J., Moore, G.W., Romero-
optimization. Cladistics 13: 21–26. Herrera, A.E. and Matsuda, G. (1979). Fitting the gene
REFERENCES 207
lineage into its species lineage, a parsimony strategy on Biocomputing 2001. (eds R.B. Altman, A.K. Dunker,
illustrated by cladograms constructed from globin L. Hunter, K. Lauderdale and T.E. Klein), vol. 6,
sequences. Syst. Zool. 28: 132–163. pp. 179–190. Singapore, World Scientific.
Goudge, T.A. (1961). The Ascent of Life. A Philosophical Hein, J., Jensen, J.L. and Pedersen, C.N.S. (2003). Recur-
Study of the Theory of Evolution. Toronto, University of sions for statistical multiple alignment. Proc. Natl. Acad.
Toronto Press. [1967 reprint] Sci. USA 100: 14960–14965.
Grant, T. (2002). Testing methods: The evaluation of dis- Hendy, M.D. and Penny, D. (1982). Branch and bound
covery operations in evolutionary biology. Cladistics 18: algorithms to determine minimal evolutionary trees.
94–111. Math. Biosci. 59: 277–290.
Grant, T. and Kluge, A.G. (2003). Data exploration in Hendy, M.D., Foulds, L.R. and Penny, D. (1980). Proving
phylogenetic inference: Scientific, heuristic, or neither. phylogenetic trees minimal with l-clustering and set
Cladistics 19: 379–418. partitioning. Math. Biosci. 51: 71–88
Grant, T. and Kluge, A.G. (2004). Transformation series as Hennig, W. (1950). Grundzüge einer Theorie der phylo-
an ideographic character concept. Cladistics 20: 23–31. genetischen Systematik. Berlin, Deutscher Zentralverlag.
Greene, B. (2004). The Fabric of the Cosmos. Space, Time, and Hennig, W. (1966). Phylogenetic Systematics. Urbana, IL,
the Texture of Reality. New York, A.A. Knopf. University of Illinois Press.
Greuter, W., McNeill, J., Barrie, F.R., Burdet, H.M., Higgins, D.G. and Sharp, P.M. (1988). CLUSTAL: A
Demoulin, V., Filgueiras, T.S., Nicolson, D.H., Silva, package for performing multiple sequence alignment
P.C., Skog, J.E., Trehane, P., et al. (2000). International on a microcomputer. Gene 73: 237–244.
Code of Botanical Nomenclature (St. Louis Code). Regnum Huber, K.T., Moulton, V. and Steel, M. (2002). Four
Vegetabile 138. Königstein, Koeltz Scientific Books. characters suffice to convexly define a phylogenetic tree.
Gu, J. and Gu, X. (2003). Induced gene expression in Research Report UCDMA2002/12, Christchurch,
human brain after the split from chimpanzee. Trends New Zealand, Department of Mathematics and
Genet. 19: 63–65. Statistics, University of Canterbury.
Gusfield, D. (1997). Algorithms on Strings, Trees, and Huelsenbeck, J.P. and Lander, K.M. (2003). Frequent
Sequences: Computer Science and Computational Biology. inconsistency of parsimony under a simple model of
Cambridge, Cambridge University Press. cladogenesis. Syst. Biol. 52: 641–648.
Hacking, I. (1965). The Logic of Statistical Inference. Huelsenbeck, J.P. and Ronquist, F. (2001). MrBayes:
Cambridge, Cambridge University Press. Bayesian inference of phylogeny. Bioinformatics 17:
Hartigan, J.A. (1973). Minimum mutation fits to a given 754–755.
tree. Biometrics 29: 53–65. Huelsenbeck, J.P., Bull, J.J. and Cunningham, C.W.
Harvey, P.H. and Pagel, M.D. (1991). The Comparative (1996). Combining data in phylogenetic analysis.
Method in Evolutionary Biology. New York, Oxford Trends Ecol. Evol. 4: 152–158.
University Press. Huelsenbeck, J., Ronquist, F., Nielsen, R. and Bollback, J.
Hasegawa, M. and Kishino, H. (1989). Confidence limits (2001). Bayesian inference of phylogeny and its impact
on the maximum-likelihood estimate of the hominoid on evolutionary biology. Science 294: 2310–2314.
tree from mitochondrial-DNA sequences. Evolution 43: Huelsenbeck, J.P., Larget, B., Miller, R.E. and Ronquist, F.
672–677. (2002). Potential applications and pitfalls of Bayesian
Hastings, W.K. (1970). Monte Carlo sampling methods inference of phylogeny. Syst. Biol. 51: 673–688.
using Markov chains and their applications. Biometrika Huelsenbeck, J., Larget, B. and Alfaro, M. (2004). Bayesian
57: 97–109. phylogenetic model selection using reversible jump
Hedges, S.B. (2002). The origin and evolution of model Markov chain Monte Carlo. Mol. Biol. Evol. 21: 1123–1133.
organisms. Nat. Rev. Genet. 3: 838–849. Hull, D.L. (1967). Certainty and circularity in evolu-
Hein, J. (1989a). A new method that simultaneously tionary taxonomy. Evolution 21: 174–189.
aligns and reconstructs ancestral sequences for any Hull, D.L. (1974). Philosophy of Biological Science.
number of homologous sequences when a phylogeny is Englewood Cliffs, NJ, Prentice-Hall.
given. Mol. Biol. Evol. 6: 649–668. Hull, D.L. (1975). Central subjects and historical narra-
Hein, J. (1989b). A tree reconstruction method that is tives. History and Theory: Studies Philos. Hist. 14: 253–274.
economical in the number of pairwise comparisons Hull, D.L. (1977). The ontological status of species as
used. Mol. Biol. Evol. 6: 669–684. evolutionary units. In Foundational Problems in Special
Hein, J.J. (2001). An algorithm for statistical alignment of Sciences (eds R. Butts and J. Hintikka), pp. 91–102.
sequences related by a binary tree. In Pacific Symposium Dordrecht, D. Reidel Pub. Co.
208 REFERENCES
Hull, D.L. (1981). Historical narratives and integrating Kim, J. (1996). General inconsistency conditions for
explanations. In Pragmatism and Purpose: Essays Presented maximum parsimony: Effects of branch lengths and
to Thomas A. Goudge (eds L.W. Sumner, J.G. Slater and increasing numbers of taxa. Syst. Biol. 45: 363–374.
F. Wilson), pp. 172–188, 308–310. Toronto, University of Kimura, M. and Crow, J. (1964). The number of alleles
Toronto Press. that can be maintained in a finite population. Genetics
Hull, D.L. (1982). Exemplars and scientific change. 49: 725–738.
Proc. Biennial Mtg. Phil. Sci. Assoc. 2: 479–503. Kjer, K.M. (1995). Use of rRNA secondary structure in
Hull, D.L. (1989). The Metaphysics of Evolution. Albany, phylogenetic studies to identify homologous positions:
NY, SUNY Press. An example of alignment and data presentation from
Huson, D.H. and Steel, M. (2004). Phylogenetic trees the frogs. Mol. Phylogenet. Evol. 4: 314–330.
based on gene content. Bioinformatics 20: 2044–2049. Kluge, A.G. (1988). Parsimony in vicariance biogeo-
Huynen, M.A. and Bork, P. (1998). Measuring genome graphy: A quantitative method and a Greater Antillean
evolution. Proc. Natl. Acad. Sci. USA 95: 5849–5856. example. Syst. Zool. 37: 315–328.
ICZN (1999). International Code of Zoological Nomenclature, Kluge, A.G. (1989). A concern for evidence and a phy-
4th Edn. London, International Trust for Zoological logenetic hypothesis of relationships among Epicrates
Nomenclature. (Boidae, Serpentes). Syst. Zool. 38: 7–25.
Jacob, F., Perrin, D., Sanchez, C. and Monod, J. (1960). Kluge, A.G. (1997a). Testability and the refutation and
L’Operon: groupe de genes a expression coordonee par corroboration of cladistic hypotheses. Cladistics 13: 81–96.
un operateur. C. R. Seance Acad. Sci. 250: 1727–1729. Kluge, A.G. (1997b). Sophisticated falsification and
Jenner, R.A. (2004). The scientific status of metazoan research cycles: Consequences for differential character
cladistics: why current reserach practice must change. weighting in phylogenetic systematics. Zool. Scripta 26:
Zool. Scripta 33: 293–310. 349–360.
Jermann, T.M., Opitz, J.G., Stackhouse, J. and Benner, S.A. Kluge, A.G. (1999). The science of phylogenetic sys-
(1995). Reconstructing the evolutionary history of the tematics: Explanation, prediction, and test. Cladistics 15:
artiodactyl ribonuclease superfamily. Nature 374: 57–59. 429–436.
Jiang, T.L. and Lawler, E.L. (1994). Aligning sequences Kluge, A.G. (2001a). Parsimony with and without scien-
via an evolutionary tree: Computational complexity tific justification. Cladistics 17: 199–210.
and approximation. In Proceedings of the 26th ACM Kluge, A.G. (2001b). Philosophical conjectures and their
Symposium on the Theory of Computing, pp. 760–769. refutation. Syst. Biol. 50: 322–330.
New York, ACM. Kluge, A.G. (2002). Distinguishing ‘‘or’’ from ‘‘and’’
Källersjö, M., Farris, J.S., Chase, M.W., Bremer, B., Fay, and the case for historical identification. Cladistics 18:
M.F., Humphries, C.J., Petersen, G., Seberg, O. and 585–593.
Bremer, K. (1998). Simultaneous parsimony jackknife Kluge, A.G. (2003a). On the deduction of species
analysis of 2538 rbcL DNA sequences reveals support relationships: A précis. Cladistics 19: 233–239.
for major clades of green plants, land plants, seed plants Kluge, A.G. (2003b). The repugnant and the mature
and flowering plants. Plant Syst. Evol. 213: 259–287. in phylogenetic inference: A temporal similarity and
Källersjö, M., Albert, V.A. and Farris, J.S. (1999). historical identity. Cladistics 19: 356–368.
Homoplasy increases phylogenetic structure. Cladistics Kluge, A.G. (2004). On total evidence: For the record.
15: 91–93. Cladistics 20: 205–207.
Kapitonov, V.V. and Jurka, J. (1999). The long terminal Kluge, A.G. (2005). Taxonomy in theory and practice,
repeat of an endogenous retrovirus induces alternative with arguments for a new phylogenetic system of
splicing and encodes an additional carboxy-terminal taxonomy. In Ecology and Evolution in the Tropics:
sequence in the human leptin receptor. J. Mol. Evol. 48: a Herpetological Perspective (eds M.A. Donnelly,
248–251. B.I. Crother, C. Guyer, M.H. Wake and M.E. White),
Katinka, M.D., Duprat, S., Cornillot, E., Metenier, G., pp. 7–47. Chicago, University of Chicago Press.
Thomarat, F., Prensier, G., Barbe, V., Peyretaillade, E., Kluge, A.G. and Farris, J.S. (1969). Quantitative phyletics
Brottier, P., Wincker, P. et al. (2001). Genome sequence and the evolution of Anurans. Syst. Zool. 18: 1–32.
and gene compaction of the eukaryote parasite Kluge, A.G. and Farris, J.S. (1999). Taxic homology ¼
Encephalitozoon cuniculi. Nature 414: 450–453. overall similarity. Cladistics 15: 205–212.
Kidd, K.K. and Sgaramella-Zonta, L.A. (1971). Phylo- Koonin, E.V., Aravind, L. and Kondrashov, A.S. (2000).
genetic analysis: Concepts and methods. Am. J. Hum. The impact of comparative genomics on our under-
Genet. 23: 235–252. standing of evolution. Cell 101: 573–576.
REFERENCES 209
Koonin, E.V., Makarova, K.S. and Aravind, L. (2001). Le Cam, L. (1960). An approximation theorem for
Horizontal gene transfer in prokaryotes: Quantification the Poisson binomial distribution. Pacific J. Math. 10:
and classification. Annu. Rev. Microbiol. 55: 709–742. 1181–1197.
Koonin, E.V. and Galperin, M.Y. (2002). Sequence- Le Quesne, W. (1969). A method of selection of characters
Evolution-Function. Computational Approaches in Compara- in numerical taxonomy. Syst. Zool. 18: 201–205.
tive Genomics. Kluwer Academic Publishers, New York. Lee, D.C. and Bryant, H.N. (1999). A reconsideration of
Koonin, E.V., Fedorova, N.D., Jackson, J.D., Jacobs, A.R., the coding of inapplicable characters: Assumptions and
Krylov, D.M., Makarova, K.S., Mazumder, R., problems. Cladistics 15: 373–378.
Mekhedov, S.L., Nikolskaya, A.N., Rao, B.S., et al. Levesque, M., Shasha, D., Kim, W., Surette, M.G. and
(2004). A comprehensive evolutionary classification of Benfey, P.N. (2003). Trait-to-gene. a computational
proteins encoded in complete eukaryotic genomes. method for predicting the function of uncharacterized
Genome Biol. 5: R7. genes. Curr. Biol. 13: 129–133.
Korbel, J.O., Snel, B., Huynen, M.A. and Bork, P. (2002). Lewis, P.O. (2001). A likelihood approach to estimating
SHOT: A web server for the construction of genome phylogeny from discrete morphological character data.
phylogenies. Trends Genet. 18: 158–162. Syst. Biol. 50: 913–925.
Koshi, J.M. and Goldstein, R.A. (1996). Probabilistic Li, S. (1996). Phylogenetic Tree Construction using Markov
reconstruction of ancestral protein sequences. J. Mol. Chain Monte Carlo. PhD Dissertation, Ohio State
Evol. 42: 313–320. University, Columbus, OH.
Kruskal, J. (1983). An overview of sequence comparison. Li, S., Pearl, D.K. and Doss, H. (2000). Phylogenetic tree
In Time Warps, String Edits, and Macromolecules: construction using Markov chain Monte Carlo. J. Am.
The Theory and Practice of Sequence Comparison (eds Stat. Assoc. 2000: 493–508.
D. Sankoff and J. Kruskal), pp. 1–44. Stanford, CA, Liberles, D.A. (2001). Evaluation of methods for deter-
CSLI Publications (1999 reprint). mination of a reconstructed history of gene sequence
Kumar, S., Tamura, K. and Nei, M. (1993). MEGA: evolution. Mol. Biol. Evol. 18: 2040–2047.
Molecular Evolutionary Genetics Analysis, vers. 1.01. Liberles, D.A., Schreiber, D.R., Govindarajan, S.,
University Park, PA, Pennsylvania State University. Chamberlin, S.G. and Benner, S.A. (2001). The
Kunin, V. and Ouzounis, C.A. (2003). The balance of Adaptive Evolution Database (TAED). Genome Biol.
driving forces during genome evolution in prokaryotes. 2(8): research0028.1– research0028.6.
Genome Res. 13: 1589–1594. Liberles, D.A., Thoren, A., von Heijne, G. and Elofsson,
Lande, R. (1976). Natural selection and random genetic A. (2002). The use of phylogenetic profiles for gene
drift in phenotypic evolution. Evolution 30: 314–334. predictions. Curr. Genomics 3: 131–137.
Larget, B. and Simon, D. (1999). Markov chain Monte Lidén, M. (1990). Replicators, hierarchy, and the species
Carlo algorithms for the Bayesian analysis of phylo- problem. Cladistics 6: 183–186.
genetic trees. Mol. Biol. Evol. 16: 750–759. Lipscomb, D.L. (1992). Parsimony, homology, and the
Larson, A. and Losos, J.B. (1996). Phylogenetic syste- analysis of multistate characters. Cladistics 8: 45–65.
matics of adaptation. In Adaptation (eds M.R. Rose and Logsdon, Jr., J.M., Stoltzfus, A. and Doolittle, W.F. (1998).
G.V. Lauder), pp. 187–220. San Diego, CA, Academic Molecular evolution: Recent cases of spliceosomal
Press. intron gain? Curr. Biol. 8: R560–R63.
Laudan, L. (1990). Science and Relativism: Some Key Logsdon, Jr., J.M., Tyshenko, M.G., Dixon, C., J, D.J.,
Controversies in the Philosophy of Science. Chicago, Walker, V.K. and Palmer, J.D. (1995). Seven newly
University of Chicago Press. discovered intron positions in the triose-phosphate
Laudan, R. (1990). What’s so special about the past? isomerase gene: Evidence for the introns-late theory.
In Evolutionary Innovations (ed. M. Nitechi), pp. 55–67. Proc. Natl. Acad. Sci. USA 92: 8507–8511.
Chicago, University of Chicago Press. Long, M. and Rosenberg, C. (2000). Testing the ‘‘proto-splice
Lauder, G.V., Leroi, A.M. and Rose, M.R. (1993). Adap- sites’’ model of intron origin: Evidence from analysis of
tations and History. Trends Ecol. Evol. 8: 294–297. intron phase correlations. Mol. Biol. Evol. 17: 1789–1796.
Lawrence, J. (1999). Selfish operons: The evolutionary Lutzoni, F., Wagner, P., Reeb, V. and Zoller, S. (2000).
impact of gene clustering in prokaryotes andeukar- Integrating ambiguously aligned regions of DNA
yotes. Curr. Opin. Genet. Dev. 9: 642–648. sequences in phylogenetic analyses without violating
Lawrence, J.G. and Roth, J.R. (1996). Selfish operons: positional homology. Syst. Biol. 49: 628–651.
Horizontal transfer may drive the evolution of gene Lynch, M. and Richardson, A.O. (2002). The evolution of
clusters. Genetics 143: 1843–1860. spliceosomal introns. Curr. Opin. Genet. Dev. 12: 701–710.
210 REFERENCES
Maddison, D.R. (1991). The discovery and importance of Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N.,
multiple islands of most-parsimonious trees. Syst. Zool. Teller, A.H. and Teller, E. (1953). Equations of state
40: 315–328. calculations by fast computing machines. J. Chem. Phys.
Maddison, D.R. and Maddison, W.P. (2001). MacClade 4: 21: 1087–1091.
Analysis of Phylogeny and Character Evolution (incl. vers. Mickevich, M.F. and Farris, J.S. (1981). PHYSYS: Phylo-
4.03). Sunderland, MA, Sinauer Associates. genetic Analysis System. Published by the authors.
Maddison, D.R., Swofford, D.L. and Maddison, W.P. Miklós, I., Lunter, A. and Holmes, I. (2004). A ‘‘long
(1997). NEXUS: An extensible file format for systematic indel’’ model for evolutionary sequence alignment.
information. Syst. Biol. 46: 590–621. Mol. Biol. Evol. 21: 529–540.
Maddison, W.P. (1991). Squared-change parsimony Mindell, D.P. and Thacker, C.E. (1996). Rates of mole-
reconstructions of ancestral states for continuous- cular evolution: Phylogenetic issues and applications.
valued characters on a phylogenetic tree. Syst. Zool. 40: Annu. Rev. Ecol. Syst. 27: 279–303.
304–314. Mirkin, B.G., Fenner, T.I., Galperin, M.Y. and Koonin, E.V.
Maddison, W.P. (1993). Missing data versus missing char- (2003). Algorithms for computing parsimonious
acters in phylogenetic analysis. Syst. Biol. 42: 576–581. evolutionary scenarios for genome evolution, the last
Maddison, W.P. and Maddison, D.R. (1992). MacClade 3: universal common ancestor and dominance of hor-
Analysis of Phylogeny and Character Evolution (incl. izontal gene transfer in the evolution of prokaryotes.
vers. 3.04). Sunderland, MA, Sinauer Associates. BMC Evol. Biol. 3: 2.
Mallatt, J. and Winchell, C.J. (2002). Testing the new Mishler, B.D. (1994). Cladistic analysis of molecular and
animal phylogeny: First use of combined large-subunit morphological data. Am. J. Phys. Anthropol. 94: 143–156.
and small-subunit rRNA gene sequences to classify the Mishler, B.D. (1999). Getting rid of species? In Species:
protostomes. Mol. Biol. Evol. 19: 289–301. New Interdisciplinary Essays (ed. R. Wilson), pp. 307–315.
Marchionni, M. and Gilbert, W. (1986). The triosephos- Cambridge, MA, MIT Press.
phate isomerase gene from maize: Introns antedate the Mishler, B.D. (2000). Deep phylogenetic relationships
plant-animal divergence. Cell 46: 133–141. among ‘‘plants’’ and their implications for classification.
Marcotte, E.M., Pellegrini, M., Thompson, M.J., Taxon 49: 661–683.
Yeates, T.O. and Eisenberg, D.A. (1999). A combined Mishler, B.D. and Brandon, R.N. (1987). Individuality,
algorithm for genome-wide prediction of protein pluralism, and the phylogenetic species concept. Biol.
function. Nature 402: 83-86. Phil. 2: 397–414.
Martin C. and Paz-Ares J. (1997). MYB transcription Mishler, B.D. and De Luna, E. (1991). The use of onto-
factors in plants. Trends Plant Sci. 13: 67–73. genetic data in phylogenetic analyses of mosses.
Mau, B. (1996). Bayesian Phylogenetic Inference via Markov Adv. Bryol. 4: 121–167.
Chain Monte Carlo Methods. PhD Dissertation, Mishler, B.D. and Theriot, E. (2000a). The phylogenetic
University of Wisconsin, Madison, WI. species concept (sensu Mishler and Theriot): Mono-
Mau, B. and Newton, M. (1997). Phylogenetic inference phyly, apomorphy, and phylogenetic species concepts.
for binary data on dendrograms using Markov chain In Species Concepts and Phylogenetic Theory: A Debate.
Monte Carlo. J. Comput. Graph. Stat. 6: 122–131. (eds Q.D. Wheeler and R. Meier), pp. 44–54. New York,
Mau, B., Newton, M. and Larget, B. (1999). Bayesian Columbia University Press.
phylogenetic inference via Markov chain Monte Carlo Mishler, B.D. and Theriot, E.G. (2000b). A critique from
methods. Biometrics 55: 1–12. the Mishler and Theriot phylogenetic species concept
Mayr, E. and Bock, W.J. (2002). Classification and other perspective: Monophyly, apomorphy, and phyloge-
ordering systems. J. Zool. Syst. Evol. Res. 40: 169–194. netic species concepts. In Species Concepts and Phyloge-
McAllister, J.W. (1996). Beauty and Revolution in Science. netic Theory: A Debate (eds Q.D. Wheeler and R. Meier),
Ithaca, NY, Cornell University Press. pp. 133–145. New York, Columbia University Press.
McAllister, J.W. (2000). Unification of theories. In A Com- Mishler, B.D. and Theriot, E.G. (2000c). A defense of
panion to the Philosophy of Science (ed. W.H. Newton- phylogenetic species concept (sensu Mishler and
Smith), pp. 537–539. Oxford, Blackwell Publishing. Theriot): Monophyly, apomorphy, and phylogenetic
McDade, L.A. (1992). Hybrids and phylogenetic syste- species concepts. In Species Concepts and Phylogenetic
matics II. The impact of hybrids on cladistic analysis. Theory: A Debate (eds Q.D. Wheeler and R. Meier),
Evolution 46: 1329–1346. pp. 179–184. New York. Columbia University Press.
Messier, W. and Stewart, C.B. (1997). Episodic adaptive Modrek, B. and Lee, C.J. (2003). Alternative splicing in
evolution of primate lysozymes. Nature 385: 151–154. the human, mouse, and rat genomes is associated with
REFERENCES 211
an increased frequency of exon creation and/or loss. dependent mechanism for thymidylate synthesis.
Nat. Genet. 34: 177–180. Science 297: 105–107.
Moilanen, A. (1999). Searching for most parsimonious Nadeau, J.J. and Taylor, B.A. (1984). Lengths of chro-
trees with simulated evolutionary optimization. mosome segments conserved since divergence of
Cladistics 15: 39–50. man and mouse. Proc. Natl. Acad. Sci. USA 81:
Moilanen, A. (2001). Simulated evolutionary optimization 814–818.
and local search: Introduction and application to tree Naylor, G.J.P. and Adams D.C. (2001). Are the fossil data
search. Cladistics 17: S12–S25. really at odds with the molecular data? Morphological
Montague, M.G. and Hutchison, 3rd., C.A. (2000). Gene evidence for Cetartiodactyla phylogeny reexamined.
content phylogeny of herpesviruses. Proc. Natl. Acad. Syst. Biol. 50: 444–453.
Sci. USA 97: 5334–5339. Naylor, G.J.P. and Adams, D.C. (2003). Total evidence
Moran, N.A. (2002). Microbial minimalism: Genome versus relevant evidence: A response to O’Leary et al.
reduction in bacterial pathogens. Cell 108: 583–586. (2003). Syst. Biol. 52: 864–865.
Moret, B.M.E., Wang, L.S., Warnow, T. and Wyman, S. Needleman, S.B. and Wunsch, C.D. (1970). A general
(2001). New approaches for reconstructing phylo- method applicable to the search for similarities in the
genies based on gene order. Proc. 9th Int’l Conf. on amino acid sequence of two proteins. J. Mol. Biol. 48:
Intelligent Systems for Molecular Biology ISMB-2001, 443–453.
Bioinformatics 17: S165-S173. Neff, N.A. (1986). A rational basis for a priori character
Moret, B.M.E., Tang, J., Wang, L.S. and Warnow, T. weighting. Syst. Zool. 35: 102–109.
(2002). Steps toward accurate reconstruction of phylo- Nei, M. and Kumar, S. (2001). Molecular Evolution
genies from gene-order data. J. Comput. Syst. Sci. 65(3): and Phylogenetics. Oxford, Oxford University press.
508–525. Nelson, G. and Platnick, N.I. (1981). Systematics and Bio-
Morgenstern, B. (2004). DIALIGN: Multiple DNA and geography: Cladistics and Vicariance. New York, Columbia
protein sequence alignment at BiBiServ. Nucleic Acids University Press.
Res. 32: W33–W36. Newton, M., Mau, B. and Larget, B. (1999). Markov
Morgenstern, B., Dress, A. and Werner, T. (1996). Multi- chain Monte Carlo for the Bayesian analysis of evo-
ple DNA and protein sequence alignment based on lutionary trees from aligned molecular sequences.
segment-to-segment comparison. Proc. Natl. Acad. Sci. In Statistics in Molecular Biology and Genetics, vol. 33
USA 93: 12098–12103. (ed. F. Seillier-Moiseiwitsch), pp. 143–162. Bethesda,
Moritz, C. (2002). Strategies to protect biological diver- MD, Institute of Mathematical Statistics.
sity and the processes that sustain it. Syst. Biol. 51: Neyman, J. (1971). Molecular studies of evolution: A
238–254. source of novel statistical problems. In Statistical
Mossel, E. and Steel, M. (2004a). A phase transition for Decision Theory and Related Topics (eds S. Gupta and
a random cluster model on phylogenetic trees. Math. J. Yackel), pp. 1–27. New York, Academic Press.
Biosci. 187: 189–203. Nixon, K.C. (1999). The parsimony ratchet, a new
Mossel, E. and Steel, M. (2005). How much can method for rapid parsimony analysis. Cladistics 15:
evolved characters tell us about the tree that generated 407–414.
them? In Mathematics of Evolution and Phylogeny Nixon, K.C. (2002). WinClada, vers. 1.00.08. Published
(ed. O. Gascuel). Oxford, Oxford University Press. by the author, Ithaca, New York (distributed through
Murata, M., Richardson, J. and Sussman, J. (1985). Simul- www.cladistics.org).
taneous comparison of three protein sequences. Proc. Nixon, K.C. and Carpenter, J.M. (1993). On outgroups.
Natl. Acad. Sci. USA 82: 3073–3077. Cladistics 9: 413–426.
Mushegian, A.R. and Koonin, E.V. (1996). Gene order is Nixon, K.C. and Little, D.P. (2004). The use of optimality
not conserved in bacterial evolution. Trends Genet. 12: criteria in DNA sequence data and its application in
289–290. a new computer program (Abstract, 22nd annual
Mushegian, A.R., Garey, J.R., Martin, J. and Liu, L.X. meeting of the Willi Hennig Society, New York 2003).
(1998). Large-scale taxonomic profiling of eukaryotic Cladistics 20: 90–91.
model organisms: A comparison of orthologous pro- Nixon, K.C. and Wheeler, Q.D. (1990). An amplification
teins encoded by the human, fly, nematode, and yeast of the phylogenetic species concept. Cladistics 6:
genomes. Genome Res. 8: 590–598. 211–223.
Myllykallio, H., Lipowski, G., Leduc, D., Filee, J., Nolan, D. (1997). Quantitative parsimony. Br. J. Phil. Sci.
Forterre, P. and Liebl, U. (2002). An alternative flavin- 48: 329–343.
212 REFERENCES
Notredame, C. (2002). Recent progress in multiple Phillips, A., Janies, D. and Wheeler, W.C. (2000). Multiple
sequence alignment: A survey. Pharmacogenomics 3: 1–14. sequence alignment in phylogenetic analysis. Mol.
Notredame, C., Holm, L. and Higgins, D.G. (1998). Phylogenet. Evol. 16: 317–330.
COFFEE: An objective function for multiple sequence Planet, P.J., DeSalle, R., Siddall, M., Bael, T., Sarkar, I.N.
alignments. Bioinformatics 14: 407–422. and Stanley, S.E. (2001). Systematic analysis of DNA
Notredame, C., Higgins, D.G. and Heringa, J. (2000). microarray data: Ordering and interpreting patterns of
T-Coffee: A novel method for fast and accurate multi- gene expression. Genome Res. 11: 1149–1155.
ple sequence alignment. J. Mol. Biol. 302: 205–217. Platnick, N.I. (1979). Philosophy and the transformation
O’Hara, R.J. (1988). Homage to Clio, or, toward an of cladistics. Syst. Zool. 28: 537–546.
historical philosophy for evolutionary biology. Syst. Platnick, N.I. and Cameron, H.D. (1977). Cladistic methods
Zool. 37: 142–155. in textual, linguistic, and phylogenetic analysis. Syst.
Ochman, H., Lawrence, J.G. and Groisman, E.A. (2000). Zool. 26: 380–385.
Lateral gene transfer and the nature of bacterial Platnick, N.I., Griswold, C.E. and Coddington, J.A. (1991).
innovation. Nature 405: 299–304. On missing entries in cladistic analysis. Cladistics
Ochoterena, H. (2004). Independence of alignment 7: 337–343.
and phylogenetic reconstruction and their optimality Pleijel, F. (1995). On character coding for phylogeny
criteria (Abstract, 22nd annual meeting of the Willi reconstruction. Cladistics 11: 309–315.
Hennig Society, New York 2003). Cladistics 20: 91. Pol, D. and Siddall, M. (2001). Biases in maximum
Ohno, S. (1970). Evolution by Gene Duplication. New York, likelihood and parsimony: A simulation approach to a
Springer-Verlag. 10-taxon case. Cladistics 17: 266–281.
Oleksiak, M.F., Churchill, G.A. and Crawford, D.L. Popper, K. (1957). The Poverty of Historicism. London,
(2002). Variation in gene expression within and among Routledge and Kegan Paul.
natural populations. Nat. Genet. 32: 261–266. Popper, K. (1959). The Logic of Scientific Discovery.
Padian, K. (2004). For Darwin, ‘genealogy alone’ did give New York, Harper and Row [1968 edition].
classification. J. Zool. Syst. Evol. Res. 42: 162–164. Popper, K. (1962a). Some comments on truth and
Parzen, E. (1962). Stochastic Processes. San Francisco, the growth of knowledge. In Logic, Methodology and
Holden-Day. Philosophy of Science (eds E. Nagel, P. Suppes and
Patterson, C. (1982). Morphological characters and A. Tarski), pp. 285–292. Proc 1960 Internatl. Congress.
homology. In Problems of Phylogenetic Reconstruction. Stanford, CA, Stanford University Press.
(eds K.A. Joysey and A.E. Friday), pp. 21–74. New York, Popper, K. (1962b). Conjectures and Refutations: The Growth of
Academic Press. Scientific Knowledge. London, Routledge and Kegan Paul.
Patterson, C. (1988). The impact of evolutionary theories Popper, K. (1968). The Logic of Scientific Discovery.
on systematics. In Prospects in Systematics (ed. New York, Harper Torchbooks.
D.L. Hawksworth), pp. 59–91. Syst. Assoc. Spec. Vol. 36. Popper, K. (1979). Objective Knowledge. An Evolutionary
Oxford, Clarendon Press. Approach. New York, Oxford University Press.
Pellegrini, M., Marcotte, E.M., Thompson, M.J., Eisenberg, Popper, K. (1980). Evolution. New Scientist 87: 611.
D. and Yeates, T.O. (1999). Assigning protein functions Popper, K. (1983). Realism and the Aim of Science. London,
by comparative genome analysis: Protein phylogenetic Routledge.
profiles. Proc. Natl. Acad. Sci. USA 96: 4285–4288. Posada, D. and Crandall, K. (1998). MODELTEST:
Penny, D., Lockhart, P.J., Steel, M.A. and Hendy, M.D. Testing the model of DNA substitution. Bioinformatics
(1994). The role of models in reconstructing evolu- 14: 817–818.
tionary trees. In Models in Phylogeny Reconstruction Posada, D. and Crandall, K. (2001a). Selecting models of
(eds R.W. Scotland, D.J. Siebert and D.M. Williams), nucleotide subsitution: An application to human immu-
pp. 211–230. Systematics Assoc. Special vol. 52. Oxford, nodeficiency virus 1 (HIV-1). Mol. Biol. Evol. 18: 897–906.
Clarendon Press. Pritchard, P.C.H. (1994). Cladistics: The great delusion.
Penny, D., Hendy, M.D., Lockhart, P.J. and Steel, M.A. Herpetol. Rev. 25: 103–110.
(1996). Corrected parsimony, minimum evolution, Posada, D. and Crandall, K. (2001b). Selecting the
and Hadamard conjugations. Syst. Biol. 45: 596–606. best-fit model of nucleotide substitution. Syst. Biol. 50:
Peterson, K.J. and Eernisse, D.J. (2001). Animal phylogeny 580–601.
and the ancestry of bilaterians: Inferences from mor- Prömel, H.J. and Steger, A. (2000). A new approximation
phology and 18S rDNA gene sequences. Evol. Dev. algorithm for the Steiner tree problem with per-
3: 170–205. formance ratio 5/3. J. Algorithms 36: 89–101.
REFERENCES 213
Pupko, T., Pe’er, I., Shamir, R. and Graur, D. (2000). A fast servation of intron positions and massive, lineage-
algorithm for joint reconstruction of ancestral amino specific intron loss and gain in eukaryotic evolution.
acid sequences. Mol. Biol. Evol. 17: 890–896. Curr. Biol. 13: 1512–1517.
Pupko, T., Pe’er, I., Hasegawa M., Graur, D. and Rokas, A. and Holland, P.W.H. (2000). Rare genomic
Friedman, N. (2002). A branch-and-bound algorithm changes as a tool for phylogenetics. Trends Ecol. Evol. 15:
for the inference of ancestral amino-acid sequences 454–459.
when the replacement rate varies among sites: Romero, I., Fuertes, A., Benito, M.J., Malpica, J. M.,
Application to the evolution of five gene families. Leyva, A. and Paz-Ares, J. (1998). More than 80 R2R3-
Bioinformatics 18: 1116–1123. MYB regulatory genes in the genome of Arabidopsis
Quine, W.V. (1963). On simple theories of a complex thaliana. Plant J. 14: 273–284.
world. Synthese 15: 103–106. Rossnes, R. (2004). Ancestral Reconstruction of Continuous
Raff, R.A. (1996). The Shape of Life: Genes, Development, and Characters and its Potential Application to Gene Expression
the Evolution of Animal Form. Chicago, IL, University of and Alternative Splicing Analysis. MSc Thesis, University
Chicago Press. of Bergen, Norway.
Rain, J.-C., Selig, L., De Reuse, H., Battaglia, V., Reverdy, C., Roth, V.L. (1984). On homology. Biol. J. Linn. Soc. 22: 13–29.
Simon, S., Lenzen, G., Petel, F., Wojcik, J., Schachter, V. Roth, V.L. (1988). The biological basis of homology.
et al. (2001). The protein-protein interaction map of In Ontogeny and Systematics (ed. Humpries, C.J.).
Helicobacter pylori. Nature 409: 211–215. New York, Columbia University Press.
Rannala B. and Yang, Z. (1996). Probability distribution Royall, R. (1997). Statistical Evidence — a Likelihood
of molecular evolutionary trees: A new method of Paradigm. New York, Chapman and Hall.
phylogenetic inference. J. Mol. Evol. 43: 304–311. Ruse, M. (1971). Narrative explanation and the theory of
Reichenbach, H. (1956). The Direction of Time. Berkeley, evolution. Can. J. Phil. 1: 59–74.
CA, University of California Press. Russell, B. (1948). Human Knowledge, its Scope and Limits.
Remane, A. (1952). Die Grundlagen des natürlichen Sys- London, George Allen and Unwin.
tems, der vergleichenden Anatomie und der Phylogenetik. Rutishauser, R. and Sattler, R. (1989). Complementary
Leipzig, Akademische Verlagsgesellschaft Geest & and heuristic value of contrasting models in structural
Portig. biology. III. Case study on shoot-like ‘‘leaves’’ and leaf-
Resch, A., Xing, Y., Alekseyenko, A., Modrek, B. and like ‘‘shoots’’ in Utricularia macrorhiza and Utricularia
Lee, C. (2004). Evidence for a subpopulation of purpurea (Lentibulariaceae). Botanische Jahrbücher für
conserved alternative splicing events under selection Systematik 111: 121–137.
pressure for protein reading frame conservation. Rzhetsky, A. and Nei, M. (1992). A simple method for
Nucleic Acids Res. 32: 1261–1269. estimating and testing minimum-evolution trees. Mol.
Rexová, K., Frynta, D. and Zrzavý, J. (2003). Cladistic Biol. Evol. 9: 945–967.
analysis of languages: Indo-European classification Rzhetsky, A., Ayala, F.J., Hsu, L.C., Chang, C. and
based on lexicostatistical data. Cladistics 19: 120–127. Yoshida, A. (1997). Exon/intron structure of aldehyde
Rice, K.A., Donoghue, M.J. and Olmstead, R.G. (1997). dehydrogenase genes supports the ‘‘introns-late’’
Analyzing large data sets: rbcL 500 revisited. Syst. Biol. theory. Proc. Natl. Acad. Sci. USA 94: 6820–6825.
46: 554–563. Saitou, N. and Nei, M. (1987). The neighbor-joining
Rieppel, O.C. (1988). Fundamentals of Comparative Biology. method: A new method for reconstructing phyloge-
Basel, Birkhäuser Verlag. netic trees. Mol. Biol. Evol. 4: 406–425.
Rieppel, O. (2003). Semaphoronts, cladograms and the Salem, A.H., Ray, D.A., Xing, J., Callinan, P.A., Myers, J.S.,
roots of total evidence. Biol. J. Linn. Soc. 80: 167–186. Hedges, D.J., Garber, R.K., Witherspoon, D.J., Jorde, L.B.
Rieppel, O. and Kearney, M. (2002). Similarity. Biol. and Batzer, M.A. (2003). Alu elements and hor-
J. Linn. Soc. 75: 59–82. minid phylogenetics. Proc. Natl. Acad. Sci. USA 100:
Rieseberg, L.H. and Soltis, D.E. (1991). Phylogenetic 12787–12791.
consequences of cytoplasmic gene flow in plants. Evol. Salisbury, B.A. (1999). Strongest evidence: Maximum
Trends Plants 5: 5–84. apparent phylogenetic signal as a new cladistic
Rogers, J. (1997). On the consistency of the maximum optimality criterion. Cladistics 15: 137–149.
likelihood estimation of phylogenetic trees from Salmon, W.C. (1966). The Foundations of Scientific Inference.
nucleotide sequences. Syst. Biol. 46: 354–357. Pittsburgh, PA, University of Pittsburgh Press.
Rogozin, I.B., Wolf, Y.I., Sorokin, A.V., Mirkin, B.G. and Sanderson, M. and Hufford, L., eds (1996). Homoplasy.
Koonin, E.V. (2003). Remarkable interkingdom con- San Diego, CA, Academic Press.
214 REFERENCES
Sanderson, M. and Kim, J. (2000). Parametric phylo- Shenkin, P.S., Erman, B. and Mastrandrea, L.D. (1991).
genetics? Syst. Biol. 49: 817–829. Information-theoretical entropy as a measure of sequence
Sanderson, M.J., Purvis, A. and Henze, C. (1998). variability. Proteins Struct. Funct. Genet. 11: 297–313.
Phylogenetic supertrees: Assembling the trees of life. Sicheritz-Ponten, T. and Andersson, S.G. (2001). A phylo-
Trends Ecol. Evol. 13: 105–109. genomic approach to microbial evolution. Nucleic Acids
Sankoff, D. (1975). Minimal mutation trees of sequences. Res. 29: 545–552.
SIAM J. Appl. Math. 28: 35–42. Siddall, M. (1998). Success of parsimony in the four-taxon
Sankoff, D. and Blanchette, M. (1998). Multiple genome case: Long branch repulsion by likelihood in the Farris
rearrangement and breakpoint phylogeny. J. Comp. zone. Cladistics 14: 209–220.
Biol. 5: 555–570. Siddall, M.E. and Kluge, A.G. (1997). Probabilism and
Sankoff, D. and Cedergren, R.J. (1983). Simultaneous phylogenetic inference. Cladistics 13: 313–336.
comparison of three or more sequences related by Sikes, D.S. and Lewis, P.O. (2001). PAUPRat: PAUP
a tree. In Time Warps, String Edits, and Macromolecules. Implementation of the Parsimony Ratchet. Published by
The Theory and Practice of Sequence Comparison (eds the authors (distributed through www.ucalgary.ca/
D. Sankoff, and J. Kruskal), pp. 253–263. Stanford, CA, dsikes/sikes_lab.htm).
CSLI Publications (1999 reprint). Simmons, M.P. (2004). Independence of alignment and
Sankoff, D. and Nadeau, J.H. (eds) (2000). Comparative tree search. Mol. Phylogenet. Evol. 31: 874–879.
Genomics. Empirical and Analytical Approaches to Gene Simmons, M.P. and Ochoterena, H. (2000). Gaps as
Order Dynamics, Map Alignment and the Evolution of characters in sequence-based phylogenetic analyses.
Gene Families. Dordrecht, Kluwer Academic Publishers. Syst. Biol. 49: 369–381.
Sankoff, D. and Rousseau, P. (1975). Locating the vertices of Simon, D. and Larget, B. (1998). Bayesian Analysis in
a Steiner tree in arbitrary space. Math. Program. 9: 240–246. Molecular Biology and Evolution (BAMBE), version 1.01
Sankoff, D., Cedergren, R.J. and Lapalme, G. (1976). beta. Pittsburgh, PA, Department of Mathematics and
Frequency of insertion-deletion, transversion, and Computer Science, Duquesne University.
transition in the evolution of 5S ribosomal RNA. J. Mol. Simpson, G.G. (1964). This View of Life: The World of
Evol. 7: 133–149. an Evolutionist. New York, Harcourt, Brace and World.
Sankoff, D., Morel, C. and Cedergren, R.J. (1973). Sinsheimer, J., Lake, J.A. and Little, R.J.A. (1996). Baye-
Evolution of 5S RNA and the non-randomness of base sian hypothesis testing of four-taxon topologies using
replacement. Nat. New Biol. 245: 232–234. molecular sequence data. Biometrics 52: 193–210.
Sarkar, I.N., Planet, P.J., Bael, T.E., Stanley, S.E., Slatkin, M. and Maddison, W. (1989). A cladistic measure
Siddall, M., DeSalle, R. and Figurski, D.H. (2002). of gene flow inferred from the phylogenies of alleles.
Characteristic attributes in cancer microarrays. Genetics 123: 603–613.
J. Biomed. Inform. 35: 111–122. Slowinski, J.B. (1998). The number of multiple alignments.
Schwikowski, B. and Vingron, M. (1997). The deferred Mol. Phyl. Evol. 10: 264–266.
path heuristic for the generalized tree alignment prob- Smith R.L. and Sytsma, K.J. (1990). Evolution of Populus
lem. J. Comput. Biol. 4: 415–431. nigra (sect. Aigeiros): Introgressive hybridization and the
Schwikowski, B. and Vingron, M. (2003). Weighted chloroplast contribution of Populus alba (sect. Populus).
sequence graphs: Boosting iterated dynamic pro- Am. J. Bot. 77: 1176–1187.
gramming using locally suboptimal solutions. Discr. Smith, T.F., Waterman, M.S. and Fitch, W.M. (1981).
Appl. Math. 127: 95–117. Comparative biosequence metrics. J. Mol. Evol. 18: 38–46.
Scriven, M. (1959). Explanation and prediction in evo- Smith, V.S., Page, R.D.M. and Johnson, K.P. (2004). Data
lutionary theory. Science 130: 477–482. incongruence and the problem of avian louse phylo-
Seitz, V., Ortiz Garcı́a, S. and Liston, A. (2000). Alter- geny. Zool. Scripta 33: 239–259.
native coding strategies and the inapplicable data Smouse, P.E. and Li, W.-H. (1989). Likelihood analysis
coding problem. Taxon 49: 47–54. of mitochondrial restriction-cleavage patterns for the
Sellers, P.H. (1974). An algorithm for the distance human-chimpanzee-gorilla trichotomy. Evolution 41:
between two sequences. J. Comb. Theory 16: 253–258. 1162–1176.
Semple, C. and Steel, M. (2002). Tree reconstruction Snel, B., Bork, P. and Huynen, M.A. (1999). Genome phylo-
from multi-state characters. Adv. Appl. Math. 28: geny based on gene content. Nat. Genet. 21: 108–110.
169–184. Snel, B., Bork, P. and Huynen, M.A. (2002). Genomes in
Semple, C. and Steel, M. (2003). Phylogenetics. Oxford, flux: The evolution of archaeal and proteobacterial
Oxford University Press. gene content. Genome Res. 12: 17–25.
REFERENCES 215
Sober, E. (1980). Evolution, population thinking, and Steel, M. (2002). Some statistical aspects of the maximum
essentialism. Phil. Sci. 47: 350–383. parsimony method. In Molecular Systematics and Evolu-
Sober, E. (1981). The principle of parsimony. Br. J. Phil. tion: Theory and Practice (eds R. De Salle, R. Giribet and
Sci. 32: 145–156. W. Wheeler), pp. 125–139. Basel, Birkhäuser Verlag.
Sober, E. (1983). Parsimony methods in systematics — Steel, M. and Penny, D. (2000). Parsimony, likelihood,
philosophical issues. Annu. Rev. Ecol. Syst. 14: and the role of models in molecular phylogenetics.
335–357. Mol. Biol. Evol. 17: 839–850.
Sober, E. (1985). A likelihood justification for parsimony. Steel, M. and Penny, D. (2004). Two links between MP
Cladistics 1: 209–233. and ML under the Poisson model. Applied Math. Lett.
Sober, E. (1986). Parsimony and character weighting. (in press).
Cladistics 2: 28–42. Steel, M., Penny, D. and Hendy, M. (1993). Parsimony can
Sober, E. (1988a). Reconstructing the Past: Parsimony, be consistent! Syst. Biol. 42: 581–587.
Evolution and Inference. Cambridge, MA, MIT Press. Steel, M., Szekely, L. and Hendy, M. (1994). Reconstruc-
Sober, E. (1988b). The conceptual relationship of cladistic ting trees from sequences whose sites evolve at variable
phylogenetics and vicariance biogeography. Syst. Zool. rates. J. Comp. Biol. 1: 153–163.
37: 245–253. Steffansson, P. (2004). Inferring Duplication and Loss Events
Sober, E. (1993). Philosophy of Biology. San Francisco, CA, Using Soft Parsimony. MSc Thesis, Royal Institute of
Westview Press. Technology, Stockholm, Sweden.
Sober, E. (1994). Let’s razor Ockham’s Razor. In Expla- Stevens, P.F. (1984). Homology and phylogeny: Morpho-
nation and its Limits (ed. D. Knowles), pp. 73–93. Suppl. logy and systematics. Syst. Bot. 9: 395–409.
Philosophy, Roy. Inst. Phil. 27. Stevens, P.F. (1991). Character states, Morphological
Sober, E. (1996). Parsimony and predictive equivalence. variation, and phylogenetic analysis: A review. Syst.
Erkenntnis 44: 167–197. Bot. 16: 553–583.
Sober, E. (2002). Reconstructing ancestral character Storm, C.E. and Sonnhammer, E.L. (2003). Comprehen-
states — a likelihood perspective on cladistic parsi- sive analysis of orthologous protein domains using the
mony. The Monist 85: 156–176. HOPS database. Genome Res. 13: 2353–2362.
Sober, E. (2003). Parsimony. In The Philosophy of Science: Strong, E. and Lipscomb, D. (1999). Character coding and
an Encyclopedia (eds S. Sarkar and J. Pfeifer). London, inapplicable data. Cladistics 15: 363–371.
Routledge. Stuart, J.M., Segal, E., Koller, D. and Kim, S.K. (2003). A
Sober, E. (2004a). The contest between likelihood and gene-coexpression network for global discovery of
parsimony. Syst. Biol. 53: 644–653. conserved genetic modules. Science 302: 249–255.
Sober, E. (in press): Is drift a serious alternative to natural Suzuki, Y., Glazko, G. and Nei, M. (2002). Overcredibi-
selection as an explanation of complex adaptive traits? lity of molecular phylogenetics obtained by
In The Spandrels of San Marco Twenty-Five Years After Bayesian phylogenetics. Proc. Natl. Acad. Sci. USA 99:
(ed. D. Walsh). Oxford, Oxford University Press. 16138–16143.
Sober, E. and Steel, M. (2002). Testing the hypothesis of Swofford, D.L. (1984). PAUP: Phylogenetic Analysis Using
common ancestry. J. Theor. Biol. 218: 395–408. Parsimony. Champaign, IL, Illinois Natural History
Sokal, R.R. (1986). Phenetic taxonomy: Theory and Survey.
methods. Annu. Rev. Ecol. Syst. 17: 423–442. Swofford, D.L. (1985). PAUP: Phylogenetic Analysis Using
Sokal, R.R. and Camin, J.H. (1965). The two taxonomies: Parsimony, vers. 2.4. Champaign, IL, Illinois Natural
Areas of agreement and conflict. Syst. Zool. 14: 176–195. History Survey.
Soltis, D.E., Soltis, P.S., Chase, M.W., Mort, M.E., Swofford, D.L. (1990). PAUP: Phylogenetic Analysis Using
Albach, D.C., Zanis, M., Savolainen, V., Hahn, W.H., Parsimony, vers. 3.0. (incl. vers. 3.0s). Champaign,
Hoot, S.B., Fay, M.F. et al. (2000). Angiosperm phylo- Illinois Natural History Survey.
geny inferred from 18S rDNA, rbcL, and atpB sequences. Swofford, D.L. (1991). When are phylogeny estimates
Bot. J. Linn. Soc. 133: 381–461. from molecular and morphological data incongruent?
Sonnhammer, E.L. and Koonin, E.V. (2002). Orthology, In Phylogenetic Analysis of DNA Sequences. (eds
paralogy and proposed classification for paralog sub- M.M. Miyamoto and J. Cracraft), pp. 295–333. New
types. Trends Genet. 18: 619–620. York, Oxford University Press.
Steel, M.A. (1993). Distributions on bicoloured binary Swofford, D.L. (1993). PAUP: Phylogenetic Analysis Using
trees arising from the principle of parsimony. Discrete Parsimony, vers. 3.1 (incl. vers. 3.1.1). Champaign, IL,
Appl. Math. 41: 245–261. Illinois Natural History Survey.
216 REFERENCES
Swofford, D.L. (2002). PAUP*: Phylogenetic Analysis Using M. (2004). Sister grouping of chimpanzees and humans
Parsimony (*and other methods), vers. 4 (incl. vers. 4.0b10). as revealed by genome-wide phylogenetic analysis
Sunderland, MA, Sinauer Associates. of brain gene expression profiles. Proc. Natl. Acad. Sci.
Swofford, D.L., Olsen, G.J., Waddell, P.J. and Hillis, D.M. USA. 101: 2957–2962.
(1996). Phylogenetic inference. In Molecular Systematics, Vander Stappen, J., De Laet, J., Gama-López, S., Van
2nd edn (eds D.M. Hillis, C. Moritz and B.K. Marble), Campenhout, S. and Volckaert, G. (2002). Phylogenetic
pp. 407–514. Sunderland, MA Sinauer Associates. analysis of Stylosanthes (Fabaceae) based on the internal
Swofford, D., Waddell, P., Huelsenbeck, J., Foster, P., transcribed spacer region (ITS) of nuclear ribosomal
Lewis, P. and Rogers, J. (2001). Bias in phylogenetic DNA. Plant Syst. Evol. 234: 27–51.
estimation and its relevance to the choice between Vingron, M. (1999). Sequence alignment and phylogeny
parsimony and likelihoods methods. Syst. Biol. 50: construction. In Mathematical Support for Molecular Biol-
525–539. ogy (eds M. Farach-Colton, F.S. Roberts, M. Vingron and
Tatusov, R.L., Koonin, E.V. and Lipman, D.J. (1997). M. Waterman), pp. 53–64. DIMACS Series in Discrete
A genomic perspective on protein families. Science 278: Mathematics and Theoretical Computer Science, vol. 47.
631–637. Providence, RI, American Mathematical Society.
Tatusov, R.L., Fedorova, N.D., Jackson, J.D., Jacobs, A.R., Vrana, P. and Wheeler, W. (1992). Individual organisms
Kiryutin, B., Koonin, E.V., Krylov, D.M., Mazumder, as terminal entities: Laying the species problem to rest.
R., Mekhedov, S.L., Nikolskaya, A.N., et al. (2003). Cladistics 8: 67–72.
The COG database: An updated version includes Wagner, Jr., W.H. (1952). The fern genus Diellia: struc-
eukaryotes. BMC Bioinformatics 4: 41. ture, affinities, and taxonomy. Univ. Cal. Publ. Bot. 26:
Tehler, A., Little, D.P. and Farris, J.S. (2003). The full- 1–212, pl. 1–21.
length phylogenetic tree from 1 551 ribosomal sequen- Wagner, Jr., W.H. (1961). Problems in the classification of
ces of chitinous fungi, Fungi. Mycol. Res. 107: 901–916. ferns. In Recent Advances in Botany, vol. 1, pp. 841–844.
Tellgren, Å., Berglund, A.C., Savolainen, P., Janis, C.M. Toronto, University of Toronto Press.
and Liberles, D.A. (2004). Myostatin rapid sequence Walsh, D. (1979). Occam’s razor: A principle of intel-
evolution in ruminants predates domestication. Mol. lectual elegance. Am. Phil. Q. 16: 241–244.
Phylogenet. Evol. 33: 782–790. Wang L. and Jiang, T. (1994). On the complexity of
Thanaraj, T.A., Stamm, S., Clark, F., Riethoven, J.J., Le multiple sequence alignment. J. Comput. Biol. 1: 337–348.
Texier, V. and Muilu, J. (2004). ASD: The Alternative Wang, L., Jiang, T. and Lawler, L. (1996). Approximation
Splicing Database. Nucleic Acids Res. 32: D64-D69. algorithms for tree alignment with a given phylogeny.
Thayer, H.S. (1953). Newton’s Philosophy of Nature: Algorithmica 16: 302–315.
Selections from his Writings. New York, Hafner Wang, L.S., Jansen, R., Moret, B., Raubeson, L. and
Publishing Co. Warnow, T. (2002). Fast phylogenetic methods for the
Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994). analysis of genome rearrangement data: An empirical
CLUSTAL W: Improving the sensitivity of progresssive study. Proceedings of the Pacifc Symposium on Biocompu-
multiple alignment through sequence weighting, ting (PSB 02), pp. 524–535. Singapore, World Scientific.
position-specific gap penalties and weight matrix Watanabe, H., Mori, H., Itoh, T. and Gojobori, T. (1997).
choice. Nucleic Acids Res. 22: 4673–4680. Genome plasticity as a paradigm of eubacteria evolu-
Thorne, J.L., Kishino, H. and Felsenstein, J. (1991). An tion. J. Mol. Evol. 44: S57–S64.
evolutionary model for maximum likelihood alignment Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J.,
of DNA sequences. J. Mol. Evol. 33: 114–124. Abril, J.F., Agarwal, P., Agarwala, R., Ainscough, R.,
Thorne, J.L., Kishino, H. and Felsenstein, J. (1992). Alexandersson, M., An, P. et al. (2002). Initial sequen-
Inching toward reality: An improved likelihood model cing and comparative analysis of the mouse genome.
of sequence evolution. J. Mol. Evol. 34: 3–16. Nature 420: 520–562.
Tierney, L. (1994). Markov chains for exploring posterior Wheeler, Q.D. (1986). Character weighting and cladistic
distributions. Ann. Stat. 22: 1701–1786. analysis. Syst. Zool. 35: 102–109.
Tuffley, C. and Steel, M. (1997). Links between max- Wheeler, Q.D. and Meier, R. (eds) (2000). Species Concepts
imum likelihood and maximum parsimony under a and Phylogenetic Theory: a Debate, pp. 179–184. New
simple model of site substitution. Bull. Math. Biol. 59: York, Columbia University Press.
581–607. Wheeler, W.C. (1994). Sources of ambiguity in nucleic
Uddin, M., Wildman, D.E., Liu, G., Xu, W., Johnson, R.M., acid sequence alignment. In Molecular Ecology and Evo-
Hof, P.R., Kapatos, G., Grossman, L.I. and Goodman, lution: Approaches and Applications (eds B. Schierwater,
REFERENCES 217
B. Streit, G.P. Wagner and R. DeSalle), pp. 323–352. Wilkinson, M. (1995). A comparison of two methods of
Basel, Birkhäuser Verlag. character construction. Cladistics 11: 297–308.
Wheeler, W.C. (1996). Optimization alignment: The end Wolf, Y.I., Rogozin, I.B., Grishin, N.V., Tatusov, R.L. and
of multiple sequence alignment in phylogenetics? Koonin, E.V. (2001). Genome trees constructed using
Cladistics 12: 1–9. five different approaches suggest new major bacterial
Wheeler, W.C. (1998). Alignment characters, dynamic clades. BMC Evol. Biol. 1: 8.
programming and heuristic solutions. In Molecular Wolf, Y.I., Rogozin, I.B., Grishin, N.V. and Koonin, E.V.
Approaches to Ecology and Evolution (eds R. DeSalle and (2002). Genome trees and the tree of life. Trends Genet.
B. Schierwater), pp. 243–251. Basel, Birkhäuser Verlag. 18: 472–479.
Wheeler, W.C. (1999). Fixed character states and the Wolf, Y.I., Rogozin, I.B. and Koonin, E.V. (2004).
optimization of molecular sequence data. Cladistics 15: Coelomata and not Ecdysozoa: evidence from genome-
379–385. wide phylogenetic analysis. Genome Res. 14: 29–36.
Wheeler, W.C. (2001a). Homology and the optimization Wolfe, K.H. and Sharp, P.M. (1993). Mammalian gene
of DNA sequence data. Cladistics 17: S3–S11. evolution: Nucleotide sequence divergence between
Wheeler, W. (2001b). Homology and DNA sequence data. mouse and rat. J. Mol. Evol. 37: 441–456.
In The Character Concept in Evolutionary Biology Woodger, J.H. (1929). Biological principles: A critical study.
(ed. G.P. Wagner), pp. 303–317. San Diego, Academic New York, Harcourt, Brace and Co.
Press. Wrinch, D. and Jeffreys, H. (1921). On certain funda-
Wheeler, W.C. (2002). Optimization Alignment: Down, mental principles of scientific inquiry. Phil. Mag. 42:
up, error, and improvements. In Techniques in Molecular 369–390.
Systematics and Evolution (eds R. Desalle, G. Giribet and Yang, Z. (1994). Maximum likelihood phylogenetic
W. Wheeler), pp. 55–69. Basel, Birkhäuser Verlag. estimation from DNA sequences with variable rates
Wheeler, W.C. (2003a). Implied alignment: A synapo- over sites: Approximate methods. J. Mol. Evol. 39:
morphy-based multiple-sequence alignment method 306–314.
and its use in cladogram search. Cladistics 19: 261–268. Yang, Z. (1996). Phylogenetic analysis using parsimony
Wheeler, W.C. (2003b). Search-based optimization. and likelihood methods. J. Mol. Evol. 39: 294–307.
Cladistics, 19: 348–355. Yang, Z.H. (1998). Likelihood ratio tests for detecting
Wheeler, W.C. (2003c). Iterative pass optimization of positive selection and application to primate lysozyme
sequence data. Cladistics 19: 254–260. evolution. Mol. Biol. Evol. 15: 568–573.
Wheeler, W.C. and Gladstein, D.S. (1994). MALIGN: Yang, Z.H. and Bielawski, B. (2000). Statistical methods
A multiple sequence alignment program. J. Hered. 85: for detecting molecular adaptation. Trends Ecol. Evol.
417–418. 15: 496–503.
Wheeler, W.C. and Hayashi, C.Y. (1998). The phylogeny Yang, Z. and Rannala, B. (1997). Bayesian phylogenetic
of the extant chelicerate orders. Cladistics 14: 173–192. inference using DNA sequences: A Markov chain
Wheeler, W., Gladstein, D. and De Laet, J. (2003). POY, Monte Carlo method. Mol. Biol. Evol. 14: 717–724.
ver. 3.0.11. Available at ftp://ftp.amah.org/pub/ Yang, Z., Goldman, N. and Friday, A. (1995a). Maximum
molecular/poy. likelihood trees from DNA sequences: A peculiar
Wiley, E.O. (1975). Karl R. Popper, systematics, and statistical estimation problem. Syst. Biol. 44: 384–399.
classification: A reply to Walter Bock and other Yang, Z.H., Kumar, S. and Nei, M. (1995b). A new
evolutionary taxonomists. Syst. Zool. 24: 233–243. method of inference of ancestral nucleotide and amino
Wiley, E.O. (1981). Phylogenetics: the Theory and Practice acid sequences. Genetics 141: 1641–1650.
of Phylogenetic Systematics. New York, John Wiley Yeates, D. (1992). Why remove autapomorphies? Cladistics
and Sons. 8: 387–389.
Index
Page numbers in Italics refer to Tables and Figures

18S rRNA 78, 79, 194 sequences 181, 182, 184, 186 Bayesian 1–3, 40, 47, 53 n.11,
500-terminal rbcL matrix (see zilla) states 3–5, 34, 51, 63, 168–70, 153, 157, 182
567-terminal three-gene matrix 181 analysis 155, 159
see three-gene (567-terminal) reconstructing 168–70 approach 28, 45
matrix 145 Anderson, D.R. 18 ‘credible set’ 150
absence/presence coding 94–96, 111 Andersson, S.G. 191 estimations 158
Adams, D.C. 38 Anopheles gambiae (Ag) 196, 197 inference 19, 26, 39, 41, 57
adaptation 16, 26, 39, 41, 184, 185, ANOVA 61 methods 42, 49, 148, 150–1, 155
187, 189 anti-free parameters 17–18, 26–7 phylogenetics 148–9
adaptive/selectionist anti-quantity principle (AQP) 15, and parsimony 148–59
constraints 62 17 weighted average 44
convergence 64, 67 anti-superfluity principle (ASP) weighting terms 49
explanations 30 15, 17, 19, 29–35, 37 Bayesianism 31
The Adaptive Evolution Database apomorphy/plesiomorphy Benner, S.A. 183–4, 186–7
(TAED) 183, 186 84–6 Benson, D.A. 186
Addario-Berry, L. 172, 174 AQP see also anti-quantity principle best-case likelihood 45
aesthetic value 16–17 Arabidopsis thaliana (At) 7, 188, best-fitting hypothesis/model 18,
agnosticism 88 192, 193, 195, 197, 197–8 33, 35, 153
Aguinaldo, A.M. 194 Aravind, L. 191, 195, 200 Bi, S. 187
Akaike, H. 2, 18, 27, 45 archetypes 63 bichromatic binary tree theorem
Akaike information criterion Ardawatia, H. 187 177
(AIC) 2, 45 Ariew, A. 22 Bielawski, B. 186
Albert, V.A. 1–11, 69 Aristotle ‘fallacy of accident’ 30 binary characters 85, 92, 164, 168,
Alfaro, M. 158 arthropods 157, 193–5, 198 194
alignment 68, 71–2, 78, 197 Arvestad, L. 182 binary gene and species trees,
lifted 109 ASP see anti-superfluity principle fixed 182
methods 38, 72, 75–6, 78, 101, ‘attenuated’ Bayesian mapping 182
115 approach 45 biogeography 16, 41
and optimization 72, 78–80 autapomorphy 81–3 Blackburn, D.G. 24
pairwise 72–5, 98–102, 107, 109, average likelihood 45, 47 Blair, J.E. 194
195 Avise, J.C. 42, 67 Blanchette, M. 157
progressive 101–2, 104 Bock, W.J. 28, 34
allopolyploidization, speciation Bonferroni inequality 171, 177
115 Babenko, N. 190 bootstrap 152
Alternative Splicing Database Bach, E. 32 estimations 158
(ASD) 188–9 backwards inequality 49–50 method 192
Altschul, S.F. 98–9, 107–8 bacteria 64, 184, 193, 199–200 values 158, 194
ambiguity settings 129, 135–6, Baker, A. 15, 17, 19, 30–1 Bork, P. 192
142, 146 BAMBE 150 Boudet, N. 195
ancestral Bandelt, H.J. 174, 178 bounded-error approximation
gene 11, 195, 199 Barnes, E.C. 15, 17–19 methods 110
content 11 Barrett, M. 66 Boyd, P. 85
duplication of 191 Barriel, V. 163, 167 Brady, R.H. 25, 36
strings 199 Barry, D. 172 branch
intron 197–8; see also introns base-to-base homology 105–8 -and-bound 76, 92
maximum likelihood 172 Bayes’ theorem 27, 148 length 5–6, 9, 114
218
INDEX 219
weighting 181 evolution best 146

rearrangements 142 processes of 92 hold 146
support 114, 116 rate of 22 hold/ 129
swapping 79, 93, 121–36, 138–46, exact meaning 59–61 hold/x 128, 130
140, 153 ‘good’ 60–1 mh* 120, 125
ratio 119, 126, 131, 133–5, 135, independence 27 max* 129
138, 143–6 ordered 113 mult* 128, 146
second-stage 142 positional 97, 105 poly ¼ 129, 135, 137, 144, 146
techniques 123 quantitative 52 poly- 135–6
time-consuming 125 state 8, 38–9, 48–51, 61, 83, rseed 128
two phases of 130 85, 89–90, 164, 169, thread1 128
branch breaking (BB) see also TBR 175–6, 192–3, 200 tcount 129
120–1, 126, 129, change 3, 6, 24, 29 unique 146
133, 144 different roles of 93 common-mechanism 173
Brandon, R.N. 59 homology 156 equivalence between parsimony
Bremer, K. 79, 114, 156 polarity 39 and likelihood 8
Bremer support 79, 114, 156 space 9 comparative genomics 65, 191
Brooks, D.R. 41, 69 statements 59 compartmentalization 63–4, 68–9
Brower, A.V.Z. 16, 25 transformations 21 compatibility/compatible 25, 166,
Brudno, M. 115 tree 38 177
Bryant, D. 10, 164, 167 weighting 69 characters 164
Bryant, H.N. 81 well separated on tree 169 statements (mutually) 89
Buneman, P. 174 uninformative 123, 127–8 tree 8
Buneman complex 174 unordered 106 composite terminal units
Burnham, K.P. 18 weighting 27, 36, 69 (TUs) 59, 61
Chase, M.W. 121, 123, 126–8, 159 compositional homology 106–7,
Caenorhabditis elegans (Ce) 192–3, chimpanzee 10, 188 109, 112
193, 197, 199 chromosomal rearrangement 158 compound NP-complete 72
Cameron, H.D. 40 circular genome 10, 163 consensus tree 123
Camin, J.H. 20, 24, 33 circular reasoning 96 approaches 66
Camin–Sokal parsimony method 24 clade credibility values 158 conservation genetics 42, 66
CAOS see Characteristic Attribute cladistic 43 consistency index 8, 82
Organization System 10 analysis, conventional conventional
Carillo, H. 101 fundamental assumption 67 analyses/searches 119, 122, 124
Carpenter, J.M. 34, 39, 84, 104 limits of 119–47 of 500-terminal data set
Carter, M. 166, 177 software for 119 123, 144
Cartmill, M. 24 data matrices 123 branch swapping 124
Cavalli-Sforza, L.L. 24 difference 33 of zilla matrix 123, 144
Cayley graph 174–5, 177 parsimony 68 methods 119, 124
Cedergren, R.J. 74, 78, 98–9, 105, cladogram 57 convergence 41, 60, 61, 66, 68,
108 cost 72 86, 190
Chang, B.S.W. 185–6, 186, 188 optimization 79 co-orthologs 183, 191–2
Chang, J.T. 151, 174 clock-like markers 64 Copernicus 1
Characteristic Attribute CLUSTAL 74–5, 75, 78–9, 104 corroboration, degree of 19, 28–9,
Organization System Clusters of Orthologous Groups 36–7
(CAOS) 10 (COGs) 192, 193, 200 costs 99
characters 85 codons, third position of 7 function 72
additive 91 coevolution 16, 26, 41, 187 of the tree alignment 98
analysis 59–61, 65, 71, 83–4, 88, COFFEE 75 covariation
90, 94, 100 COGs see Clusters of Orthologous compensatory 187
complex 187–8 Groups intramolecular 187
congruence 30 COGNITOR method 196 Crandall, K. 27
conservative for deep commands see also Hennig86, Crichton, M. 185
branchings 9 NONA Crick, F. 1, 2
continuous 50, 187–8 amb- 144 Crisci, J.V. 24
different roles of 93–6 amb ¼ 144 Crow, J. 176
discrete 187 bb* 120, 125 Cummings, M. 158
220 INDEX
Dacks, J.B. 195 Dollo, L. 190 concept of support 44

DALIGN 74 Dollo method 24 equal (flat) prior probabilities 3
Dandekar, T. 199 DOLLOP 193 equal weighting 91, 105, 109, 113
Darwin, C. 1, 33 Dollo parsimony 9, 183, 195, Erdo" s, P.L. 8, 177
theory of evolution 21, 39 198 Erdo" s and Székely
data ancestral gene model 195 min–max theorem 177
equally weighted 99, 114 character presence/absence, eukaryotic crown group 190–200
‘good’ 69 matrices of 192–3 eukaryotic genomes 197
from the literature 59 eukaryotic gene structure, evolution of 193
database design application in 194–9 euKaryotic Orthologous Groups
TUs, characters 68 in the opposite direction 198 (KOGs) 192
data matrices/matrix 57, 59, orthologous/paralogous evolution/evolutionary
62–4, 69, 123 genes 191–2 conservation 196
classification 59 pre-genomic era, molecular data epistemology 37
column in 59–61 for 190–1 genomics 190, 193, 200
individual entries in 61 prokaryotic gene order, analysis history 181
representation 58–9 of 191, 199–200 model 3, 69, 148, 156
rows in 58, 61 and the reconstruction of the complexity of 155
data sets, large 6, 126–7, 132–3, genome evolution mechanisms of 157
135–7, 141 191–200 neutral 184
challenges of 122–6 tree, in animal evolution process 43, 47–8, 88
files 128 193–4 rates 88
matrix 119, 121, 123, 124–6, 130, unambiguous scenarios 193 change 36
143 Dollo’s Law 190, 200 regulatory 187
daughter nodes 102 Donoghue, M.J. 36, 66, 121 reticulate cultural 24, 85
Davids, W. 187 Doolittle, R.F. 74, 101–2 systematics 40–1
Davis, J.I. 6, 10, 119–47, 159 Doolittle, W.F. 191, 193, 195 theory 41, 184
Dayhoff, M.O. 24 Doyle, J.A. 121 transformation 84, 87, 111, 113
DeBry, R.W. 191 Drosophila melanogaster (Dm) 192, Excoffier, L. 174
deduction/deductive 20, 23, 37, 39 193, 197 exons and introns, in protein
‘deep’ branching questions 65 duplications 65, 115 coding sequences 9, 115
deep vs shallow phylogenetics 62 Duret, L. 183 explanatory power 4, 17, 19–20,
De Laet, J.E. 5–6, 11, 39, 81–116, dynamic 27–9, 34–7, 39–40, 43, 81,
158–9 algorithms 100 87–8, 95–6, 105, 111, 113–14
deletion 38, 97, 157 approach 73 expressed sequence tag (EST)/
of introns 65 homology cDNA data 188
De Luna, E. 61 framework 71
de Pinna, M.C.C. 71, 83, 94 and optimization 71–80 falsifiability 19, 23, 29
de Queiroz, K. 24, 28, 36, 156 programming 76–7 falsification/falsificationism 37, 96
descendant 1, 48–52, 169, 182 weighting 91 falsifiers and hypothesis 24
descent with modification 17, 20–1, relationship between 156
25, 27, 30, 32, 36–7, 40, 42–3 Eck, R.V. 24 Farris, J.S. 3–9, 16, 19–20, 23–7,
descriptive efficiency 16–17, 27, Edwards, A. 24, 44 29–30, 32–9, 43, 47, 53, 59,
38–9, 41 Eernisse, D.J. 194 69, 81, 83–9, 91–2, 102,
destructive sampling 61 Eisen, M.B. 10, 65 109, 113–15, 120, 122,
developmental constraints 67 empirical 128, 148–9, 155–6, 163,
Dezulian, T. 166–7, 176 assumptions 17 167, 187, 190, 192
diagnosability 5, 6, 7, 11 content 92–3 Farris parsimony (FP) 34–9
Dibb, N.J. 196 of a data set 91 in phylogenetic inference
differential weighting 108 data 96, 104 fundamental nature of 37–9
direct optimization 76–8, 77, 105 evidence 95 Fedorov, N.D. 196
disconnected graph (forest) 176 empiricists, unconcerned 16 Felsenstein, J. 2, 3–5, 7–9, 20–2,
discrete states 60–1 Enard, W. 188 24, 26–8, 43, 46–8, 53, 69,
‘dispersal’ theory 41 Endo, T. 186 76, 87–9, 92–3, 120,
distance Endress, P.K. 96 148–9, 163, 173, 193
-based phylogenetics 2–3, 7, 10, epistemological/epistemology 16, zone 5
102, 200 29–30, 35, 39, 41, 58, 60 Feng, D.F. 74, 101–2
INDEX 221
finite state space 176 genome evolution 190 -scale estimation 3

Fink, W.L. 40 evolution of 163 -scale questions 7
first-stage swapping 132, 139, 142 expression 187–8, 200 sequencing projects 181
Fitch, W.M. 81, 92, 96, 97, 99–100, families 6, 183 -specific best hit (bet) 192
108, 169, 181, 182 family evolution, case studies tree methods 194
Fitch-Hartigan algorithm 169 of 184–6 genome sequencing, whole 9, 11,
Fitz-Gibbon, S.T. 194 flow 68 65
fixed-state optimization 77 function, understanding of genotype 32, 40, 65
flat prior 50, 148, 151, 154–5 183–4, 192 Ghiselin, M.T. 30, 34
Force, A. 184 gain 195 Gilbert, W. 195
Forster, M.R. 18 genealogies 66 Giribet, G. 78–9, 105, 115, 194
forwards inequality 49 highly conserved 64 Gladstein, D.S. 75–6
fossils 36, 64, 96, 181 identification in different tumor Gogarten, J.P. 191
Foulds, L.R. 92, 108, 164–5, 172 types 10 Gojobori, T. 186
fractal 70 loss 191, 194–5, 200 Goldman, N. 3–4, 26, 47, 155
scaling 57, 59, 61, 62, 68, 70 order 9–11, 163, 167, 173, 176–7, Goldstein, R.A. 181
frame sequences 98 199 Goloboff, P. 3–4, 6, 26–7, 36, 78,
frequency of success 130, 138, 142 data 5 79, 88, 92–3, 109, 121–4,
frequentism 27, 30 orthologous and paralogous 146, 148–9, 151, 153,
frequentist approach (F) 46 191–2 155–6, 158–9
assumption 28 pair 200 Goodman, M. 182
probability 21–2, 38 random inversion, re-shuffling goodness-of-fit 16
procedure 49 of 163 Goudge, T.A. 20, 21–4, 29
frequentists 45, 53 n.11 regulation 187 Graham, R.L. 92, 108, 164, 172
Freudenstein, J.V. 83 sequences 181, 189 Grant, T. 16, 19, 20, 22, 26–30,
Friedman, M.L. 108 silencing 200 32–5, 37–8, 42, 68, 87
Friedman, N. 18, 25, 40 structure, evolution of 197–8 graph isomorphism 164
Frost, D.R. 16, 28, 37–8, 81, 99, 105, transfer 115, 181–2, 190 Greene, B. 18, 20
112–13 tree 183 grouping, evidence for 2
Fukami-Kobayashi, K. 187 mapping of 182–3 group walk process 177
functional tree/species tree distinction 58 Grueter, W. 41
annotation 192 genetic algorithm 75 Gu, J. 188
constraints 61 genetic code 1 guide tree 74–5, 102, 104
and developmental dependence genomes/genomic 41, 114–15, 176, Gusfield, D. 73, 101, 110
38 182, 187
genomics 186 applications 181 Hacking, I. 44
Funk, V.A. 69 characters Hamming
categories of 64–5 distances 108
Gaasterland, T. 181, 192 structural 64 metric 173
Gallut, C. 163, 167 colinearity 199 haplotype graph 173
Galperin, M.Y. 192 comparative 6, 65, 191 Hartigan, J.A. 92, 169, 172
gap 74, 81, 88, 97, 102, 106, 158 complete 115 Harvey, P.H. 41
characters 72, 98 data 157, 166, 173 Hasegawa, M. 149
coding 61 and Dollo parsimony Hastings, W.K. 149
costs 98 genome evolution, Hedges, S.B. 194, 199
extension cost 98–9, 105, 108 reconstruction of 190–200 Hein, J.J. 74–6, 99, 108–9, 114
opening cost 99, 105, 108 validity of 200 heliocentric solar system theory 1
penalty 74 evolution, 181–9, 191, 194 Hendy, M.D. 165–6
positions 103 ancestral state reconstruction, Hennig, W. 20, 26, 30–3, 37, 39,
‘garbage in, garbage out’ 59 continuous-character 187–9 58, 60–2, 67, 69, 83–7,
Gee, H. 24 case studies of 184–6 119, 121
GenBank 58, 186 gene trees, mapping of 182–3 auxiliary principle 33, 86
genes 167, 181 loss of genes 190 concepts of synapomorphy 120
content 9, 181, 183, 193–4 using phylogeny 181–9 Hennig86 120–121, 125
conversion 67 parsimony and 181–200 see also commands
duplication 181, 183, 184–5, 191 rearrangement 68, 174, 176 Hennig–Farris auxiliary principle
emergence 190, 194 -scale data and parsimony 5–11 86–8, 93
222 INDEX
Hennigian phylogenetics 33, 34, 120 -free 164, 166, 167, 174–7 statements of homoplasy 88
Hennig principle 59 increase in phylogenetic statements of pairwise
heterochrony 61 structure 8 homology 5
heterogeneous data types 66, 68 large values of 168 induction/inductive 37, 39
heuristic/heuristics 108–10 separate ‘explanation’ of 27 reasoning 30
approximation of optimal tree humans 65, 192, 194, 197, 199 inferential models
alignment costs 108 gene expression in 188 complexity of 155
branch swapping methods 120 genome 191 inferred character evolution,
multiple alignment 74–5 Homo sapiens (Hs) 10, 193, 197 irreversibility of 113
methods 115 horizontal gene transfer infinite alleles model 176
search techniques 124 (HGT) 190, 191, 193, 199–200 infinite and large state spaces
solutions 76 horizontal transmission 41 results for 176–7
HGT see horizontal gene House, C.H. 194 informatics 68
transfer 190, 191, 193, HOVERGEN 183 informative markers 60
199–200 HTU, see hypothetical taxonomic ingroup 79, 82, 186
hierarchic model 2 unit terminals 84
Higgins, D.G. 74 Huber, K.T. 168 inner node
historical causality 24 Huelsenbeck, J.P. 45, 66, 88, 149–50 reconstructions 106
historical contingency 20 Hull, D.L. 20–3, 39 sequences 98
Hiyashi, C.Y. 79 Huson, D.H. 9 state assignments 89–92, 99
Holland, P.W.H. 163 Huynen, M.A. 192 insertion/deletion (see indel)
homologies/homology 20, 25, hybridization 67 38, 65, 72, 97, 157–8, 196
28–30, 35, 37, 39, 41, 57, hypothesis of a stretch of contiguous
60–2, 65, 69, 71–2, 76, evaluation 27 bases 106
78–80, 86, 88–93, 96, testing 71 instrumentalism 17, 27
106–7, 109, 113, 115, 156 Hypothetical Ancestor (HA) 63–4 instrumentalist 22, 41
common causal explanation 27–8 sequences 174 justifications 18
of different tree alignments 108 hypothetical taxonomic unit parsimony 18
hypotheses of, mistaken 67 (HTU) 76–7 internal nodes 182, 188, 193
logically independent 93 intragenomic rearrangements 200
maximizing the amount of ICZN 41 introns 6, 190, 192, 196–200, 198–9
92–3, 108 ideographic 20, 22, 26, 28–30, in eukaryotic genes 198
and parsimony analysis, 32–3, 40–2 evolutionary history of 196
Hennig–Farris auxiliary science vs nomothetic insertions
principle 86–8 science 19 structural characters
secondary 71 theory unification 40–2 (traditional morphological
in sequence characters 105–8 illuminated manuscripts 40 characters) 64
similarities that can be explained implied loss 196
as 115 alignments 79, 79, 99–101, 103 presence/absence 197
of subsequences 106 transformation series 95 inversions 115, 157, 163, 176
homologous 30, 34 weighting 27, 30, 121 irreversible evolution, law of 190
proteins inapplicability/inapplicables 81, islands of trees 128, 130, 147
systematic analysis of 195 83, 91, 93–4, 97, 110–11 isomorphism 18, 57, 164
similarity, maximizing incongruence 39, 68 theory 18
vs minimizing inconsistency 8 iterative-pass optimization 76, 77
transformations 111–14 incremental character
homologs 29, 38, 61, 98, 103, 110, optimization 76 jackknife
181, 186, 191 indels 73, 75, 77, 79, 107, 109 frequency 7
homology schemes, independent jackknifing 114, 158
individualized 76 and identical evolution Jacob, A.R. 199
homoplasies/ homoplasy 4–5, 8–9, (IID) 172 Jeffreys, H. 16
11, 20, 28–30, 39, 47–8, 63, evolution 60 Jenner, R.A. 83
66–8, 87–90, 104, 106, 120, pairwise comparisons 107 Jermann, T.M. 186
163–5, 167–8, 174, 176–7, 192, pairwise similarities 91–2, 94, Jiang, T. 72, 76, 99, 108–9
200 97, 105 Jukes–Cantor model 9, 153,
distributions of 88 single-column characters 96–7, 154–5
extinction 67 111 r-state 170
INDEX 223
Jurka, J. 187 of PrM(Data j H) ¼ maximization

juvenile specialization 59 likelihoodM 45 -of-identity algorithm 73
ratio 51 of independent homologies 92
Ka/Ks ratio 184, 185, 186–7 test 26 of independent sequence
Källersjö, M. 7, 9, 27, 62, 122, 124, lineage similarity 98
168 duration 49 maximizing
Kapitinov, V.V. 187 of a phylogeny 57 amount of homology 92
Katinka, M.D. 191 sorting 62, 67–8 homologous similarity vs
Kearney, M. 26, 31, 38, 83–4 -specific mimimizing
Kidd, K.K. 25 duplications 196 transformations 111–14
Kim, J. 7, 155, 174 gene family expansion 195 homology 88, 91
Kimura, M. 176 gene loss 193 independent homologous
Kishino, H. 149 transition probabilities 48 similarity 105
Kjer, K.M. 115 linear programming approach pairwise homology 90
Kluge, A.G. 4–5, 8–9, 15–16, 19–23, 166 pair-wise similarities 39
25–30, 32–41, 66, 83, 87, 156, linear regression analysis 5 see also likelihood
187 Lipscomb, D.L. 37–8, 81 maximum (average) likelihood
KOGs see euKaryotic Orthologous Little, D.P. 104, 119 (ML) see also likelihood 1,
Groups 194–7, 195 logical subsequences 108–13 18, 22, 24, 26–8, 30, 39–50,
Koonin, E.V. 183, 190–200 Logsdon, Jr., J.M. 195–6 57, 64, 69, 89, 114, 122, 124,
Korbel, J.O. 200 Long, M. 196 148–50, 153–5, 157–8,
Koshi, J.M. 181 long branch attraction 61–4 163, 172–3, 181
Kruskal, J. 101 Losos, J.B. 24 analysis 155
algorithm 165 LSEs (lineage-specific gene family ancestral 172
Kumar, S. 24, 196 expansions) 195 estimator 4
Kunin, V. 115, 193 Lutzoni, F. 109 for phylogenetic trees 172
Lynch, M. 184, 196 maximum parsimony (MP)
Lamarck, J-B. links with ML 172
parsimony methodology 1 MacClade 127–8 methods 92
Lander, K.M. 88 Maddison, D.R. 50, 58, 127 models 157
Larget, B. 148–50 Maddison, W.P. 58, 68, 81, 97, maximum parsimony (MP)
Larson, A. 24 110, 127 see also parsimony
last common ancestor 183, 185, majority-rule consensus 150 infinite and large state
191, 197 MALIGN 75, 78–9 spaces 163
Laudan, L. 35 Mallatt, J. 194 result of 176–7
Laudan, R. 24 Manhattan distance 58 links between MP and ML
Lauder, G.V. 41 Marchionni, M. 195 172–4
Lawler, L. 99, 109 Marcotte, E.M. 183 on Multistate Encodings
Law of Likelihood 1, 44–8, 50, 52–3 Markov Chain Monte Carlo (MPME) 10
law of total probability 171 (MCMC) 149–50 phylogenetic information
Lawrence, J.G. 200 algorithms 159 in multistate characters
Le Cam, L. Bayesian methods 163–78
inequality 171 attraction of 150 phylogenetic tree 168
Lee, C.J. 188 posterior probabilities of score of data 165–6
Lee, D -C. 81, 188 monophyly by 157 bounds on 165
Le Quesne, W. 27 problems with 150–5 maximum posterior probability
‘less is more’ principle 35–6 results of 149 (MAP)
Lewis, P.O. 26, 122, 155, 157, 170 Markov chains 148–50, 175 tree 150
Li, S. 148–50 elementary theory of 178 Mayr, E. 34
Liberles, D.A. 181–9 Markov model 48–9, 51, 52–3, 170, McAllister, J.W. 18, 40
Lidén, M. 21 172, 174–7, 175 McDade, L.A. 67
likelihood 2–3, 45–52, 184 of character evolution 163, MCMC 150, 154–5, 157–9
conjecture 114 172 see also Markov Chain Monte
landscape 152 genome rearrangement 176 Carlo
Law of 1, 44–48, 50, 52–3 Martin, C. 7 Meier, R. 58
mathematical concept of 44 Master Catalog 183, 186 Messier, W. 184–6, 185
vs parsimony 155 Mau, B. 148–50 metaphysical system building 41
224 INDEX
Metropolis, N. 149 monophyly 85–6, 150–4, 157–8 science 21

Metropolis–Hastings defintion of 119 synthesis 18, 40–2
algorithm 149 Moore’s law 11 neofunctionalization 184
Mickevich, M.F. 120 Moran, N.A. 191 nested
microarray data 10–11, 65 Moret, B.M.E. 163, 176 hypotheses of putative
microevolutionary differentiation Morgenstern, B. 74, 115 homology 95
processes 66 Moritz, C. 42 lineages 59, 66
Miklós, I. 114, 158 morphological/morphology 64, sets of lineages 66
Mindell, D.P. 27 68, 98 Newman, A.J. 196
minimal and anatomical features 83 Newton, M. 148–50
cluster in a Euclidian distance characters 62, 157, 170 Newton, Sir, I. 7
sense 58 mosquito see Anopheles gambiae (Ag) view 32
extension (or most-parsimonious Mossel, E. 176–7 Neyman, J. 170
extension) 164, 169–70 most likely reconstruction 4 r-state model 170
gene sets 190 motif sequence 104 Nixon, K.C. 10, 39, 67, 84, 93,
mutation algorithm 98 mouse genome 191 104, 119, 122–4, 127–8, 146
minimization most-parsimonious likelihood no common mechanism 4, 172
of ad hoc hypotheses 29 (MPL) 174 model 114, 153, 172–3
of independent homoplasies 92 score 172, 174 Nolan, D. 15, 19, 30, 42
minimizing MrBayes 150–5, 152, 158 nomothetic 19, 20–1, 26, 29, 41–2
pairwise homoplasy 90–1 mRNA splicing 187–9, 188 NONA see also commands 79,
steps 35 multidomain proteins 192 121–3, 127–30, 132–7,
minimum multiple 135–7, 141–6
distances 25 alignments 72, 74, 79–80, 99, 100, non-additive binary coding 94
evolution 24 101–2, 107 non-binary nodes 183
method 111 characters 97 non-flat priors 151, 153
minimum mutation algorithm 99 methods 76 non-synonymous rate (Ka) 184
-evolution distance method equally parsimonious solutions normal distribution 1
187–8, 188 120 normative naturalism 35
independent homoplasies 92 losses 9 Notredame, C. 75, 101, 109
independent statements of range tests 61 NP-complete 7, 80, 92, 108
homoplasy 88 multistate characters 37–8, 163, 167 NP-hard 72, 76, 164, 172, 174
-length-spanning tree 165 Murata, M. 101 nucleotide/nucleotides 2, 4, 9, 32, 38,
number of steps 88 Mushegian, A.R. 194, 199 62, 65, 71–3, 77–80, 124, 127
Minkowski, H. mutation 48, 62, 98, 102, 105, characters 25
geometric representation 32 111–12, 167, 182–5, 190 data 8–9
min–max theorem 177 decrease fitness 184 Poisson model 26
Mirkin, B.G. 191, 193 and migration 48 positions 9, 26, 184
Mishler, B.D. 5, 57–70 myostatin 185 states 4, 19, 24, 32, 71
mitochondrial 9 teleost fish, gene duplication strings 9
model/models in 185 substitutions 77
-based methods 155, 156–7
of the evolutionary processes Nadeau, J.J. 115, 176 O’Hara, R.J. 23–4
64, 98 Nadeau-Taylor model 176 objective epistemologies 17
parameters 26, 36 National Center for Biotechnology objective function 71
parsimony and likelihood 3 Information (NCBI) 186 observed point of similarity 87
of sequence evolution 114 natural selection 30, 48, 50, 61 observed variation, m 102
Modrek, B. 188 Naylor, G.J.P. 38 for the data set as a whole, M 102
Moilanen, A. 93 Needleman, S.B. 72–3, 101, 108 Ochman, H. 191
molecular Neff, N. 59 Ochoterena, H. 99, 102
biology/genomics 65 Nei, M. 25, 92, 196 Ockham’s razor 1–2, 15–17, 40
evolutionary hypothesis neighbor joining 10, 92, 200 in phylogenetic inference 15–42
testing 6 Nelson, G. 27 Ohno, S. 184
phylogenetics 8 nematodes (see Oleksiak, M.F. 10
monophyletic groups 7, 22, 27, 29, Caenorhabditis elegans) one-stage searches 119, 121,
33–4, 39, 60–3, 69, 82, 84–7, 65, 194–5, 197, 198, 199 125, 130, 132–9, 135–6,
119, 150–4 neo-Darwinian 141, 142, 144–6
INDEX 225
conventional 145 panmictic, sexually reproducing and likelihood

success rates vs efficiency groups 67 equivalence between 4, 11, 114
rates 137 parabola 2 phylogenic models 3–4, 46
ontogenetic parallelism 25 parallel gains 9 links between MP and ML 172–4
ontogeny 19, 83 parallelism 4 mathematical attributes of 163–78
ontological/ontologically 17, 19, paralogs/paralogy 6, 67, 96, 181, and maximum-likelihood 181
21, 23–4, 32, 35, 41, 58, 60, 68 183, 183–4, 191, 196 methods
consistency 35–6 neofunctionalization 184 for cladistic analysis 119
inconsistent 36 Paranona 127 for phylogenetic analysis
simplicity 16 parasites 191, 196, 199 119–22
status 22, 26 Park, C.M. 24 modern uses
of parsimony 16 Parsimony examples of 2–5
operationalisms 28, 32 analysis 5, 83, 97–116, 144–5, 158, a non-likelihood justification
operational taxonomic units 183, 192, 195 for 4
(OTUs) 58 at different scales 25, 63 ordinal equivalence with
operons 199 information storage and likelihood 46–8
optimal inner-node retrieval 5 phylogenetic information in
reconstructions 106 interconvertible 5 multistate characters
optimality of sequence data 97 163–78
criterion 25, 71, 75 anti-free parameters 18 and phylogenetics, in the
function, combined 115 background knowledge 17, 27–9, genomic age 1–11
optimization 7, 38–9, 71–2, 78 36, 42, 83–4, 96 Poisson model 170–2
alignment 76 vs Bayesian inference 1 pragmatic 18, 25
-based procedures 80 and Bayesian phylogenectics and its presuppositions 43–53
direct 77 148–59 principle of
fixed-state 77 character analysis and preliminaries 44–6
iterative-pass 77 optimization of probability of homoplasy-free
methods 76, 78–9 sequence characters 57–116 evolution 174–6
search-based 77 classifications and problems, general 99
of whole sequences as complex justifications 16–19 ratchet 119, 122–3, 126, 128,
characters 11 determination of what, it does 145–6
ordinal equivalence (OE) 46–8, 52 not presuppose 46–8 r-state character,
organellar DNA 6 determination of what it phylogenetically
orthologous 182 presupposes 48–53 informative 166–7
clusters 192 Dollo score 75, 164, 174
gene 65 reconstruction of genome distribution of 171–2
and paralogous genes 191–2 evolution 190–200 search efficiencies 137
robust identification of 191 equally weighted 22, 70 sequence data
sets 191, 196, 197 equitability with likelihood 7 problem of inapplicables
protein 191 Fitch 182 in 81–116
orthologs 100, 181, 183, 184, general background knowledge tree size 6–7
191–2, 196 17–18 and statistical consistency 5
orthology 6, 67, 96, 197 and genomics 5–11, 181–200 as a statistical method 156
outgroups 6, 16, 63, 82, 84–5, evolution 181–9 summary of the models 52
95, 104, 186, 194–5, 199 and Hennig–Farris auxiliary tree reconstruction 167–70
criterion 84 principle 87–8 weighted 27, 173, 193
roots and 84–5 infinite and large state particle physics 21
Ouzounis, C.A. 115, 193 spaces 163 partition theorem 165–6
equivalence between Parzen, E. 48
PAA (see Population Aggregation likelihood and 10 patristic 20
Analysis) result of 176–7 difference 33–4
Pääbo, S. 188 jackknife 122 unit character 33
Padian, K. 34 justifications in, phylogenetic pattern cladistics/cladists 25, 36,
Pagel, M.D. 41 inference 24–35 38, 40
pairwise sequence similarities 65, large numbers of taxa 121 Patterson, C. 16, 60
81, 89–90 and large phylogenetic trees PAUP* 121–2, 126–30, 132–3,
statements 93 conjecture on 7–8 144, 152, 158, 163, 193
226 INDEX
PAUP* (cont.) single-character 84–6 differentiation 10

first release of 121 structure within named genetics 176
implementation of species 67 level hypotheses 39, 66
bootstrapping 158 systematics 33, 119 Posada, D. 27
ineffective analytical synthesis of 65 positional correspondences 97, 98,
strategies 121 tree 119, 168, 173, 176, 191, 193 106
jackknifing 158 ancestral states in a 181 positive selection 184, 185, 186–7
conventional searches ML estimator for 172 molecular basis of 185
with 126 X-tree 164–5, 169 pressure 184–6
PAUPRat 122 phylogenomics 1, 65 in ruminants 185
Paz-Ares, J. 7 phylogeny 183 structural basis of 185
Pee-Wee 121 definition 57 posterior probabilities 1, 148–56,
Pellegrini, M. 181, 183, 192 estimation of 148, 157 150–1, 153, 154–9
Penny, D. 3, 5, 8–10, 26, 47, 88, ontological status of 19–22 POY 75, 78–9, 79, 157
92–3, 155, 163, 172–4, 177 potential problem of statistical implied alignment 76, 79
permutation 10, 167, 174 approach to 155–7 pre-Darwinian 25
Peterson, K.J. 194 reconstruction methods of 148, predictive accuracy 27
PHAST 78 156 of models 18
pheneticists 58 in terms of Darwin’s principles pre-mRNA 195
phenetics 10, 25, 38, 40, 94 20 presumed apomorphies 86
phenotype/phenotypic 20, 40, to understand genomic Prim networks 25
49, 65 evolution 181 Prim’s algorithm 165
character 22, 50 phylogeography 42, 67 primary homologies 71–2, 80
divergence 185 physiological pathways 65 prior alignment 100, 103, 108
evolution 189 PHYSYS 120 Pritchard, P.C.H. 24
Phillips, A. 73, 74, 105 Pickett, K.M. 151 probabilistic
phylogenetics 192 Planett, P.J. 10 frequency 41
analyses 119–22, 183, 188, Plasmodium falciparum (Pf) 196–9, model 43, 49, 114, 156–7
191, 200 197, 199 of the evolutionary process 43
Bayesian approaches 149 Platnick, N.I. 27, 40, 81, 83 probability 36, 88
classification 68 Pleijel, F. 94, 96 of change 46, 69
conclusions 159 plesiomorphy 84–5 prokaryotes/prokaryotic 6, 9,
content, measure of 166 see also apomorphy 190–3, 199–200
deep vs shallow 62 Poe, S. 19, 28, 156 evolution of 191
hypotheses 36, 156 point estimations 149 gene orders, distance-tree
inference 20, 148 Poisson model 8, 26, 157–8, 168–9, analysis of 200
causality and scientific practice 171, 170–4 genomes 199
in 22–4 for DNA substitutions 148, lineages 200
Farris parsimony in 37–9 157–8 Prömel, H.J. 165
parsimony justifications in for gaps 158 protein
24–35 r-state 174 -coding gene 6
infinite alleles model 176 two-stage 174, 178 engineering 186
information in multistate Pol, D. 88, 147–59 function 185
characters 163–78 polarity statement 84 –protein interaction 187, 200
single r-state character 166 polymorphic inner nodes 89 structure-function 187
likelihood 7 polymorphism code 114 proto-splice sites 196–7
logic of data matrix 57–70 polynomial time 165, 176 pruning algorithm 7
marker 196 polytomies 129 pseudogenization 184
methods 181 settings 133, 135, 146 pseudoreplicates 114
models 3–4 in Nona 142 Ptolemy’s theory 1
and parsimony, in the genomic trees with 126 Pupko, T. 181
age 1–11 Popper, K. 4, 16–17, 19–21, 24, 26, putative homologous sequences
methods 119–22 28–30, 36, 42, 156–7 of characters or a sequence
and population genomics 11 philosophy 155–6 character 99–105
profiling 183 Population Aggregation Analysis
reconstruction 187 (PAA) 10 quantitative 37
signal 194 population character 50
INDEX 227
parsimony 19, 31–2, 34 Royall, R. 1, 44 evolution 156

phenotype 49 rRNA sequences, stems and loops length 174
phylogenetic systematics in 115 regulatory 187
(QPS) 34, 36–38, 40–42 r-state 173–4 as whole complex characters 6
quartet and triplet methods 89, 92 character 9, 114, 167–8, 173 serial homologs 98
quasi-optimal trees 158 single 166–7 serial morphologies 72
Quine, W.V. 18 Jukes–Cantor model 170 severity of test 19, 20, 27, 36, 37
symmetric Poisson model 8 Sgaramella-Zonta, L.A. 25
R2R3-MYB gene family 7 rule of acceptance/evaluation 44 ‘shallow’ analyses 66
Raff, R.A. 194 Ruse, M. 24 ‘shallow’ relationships 62
Ragan, M.A. 181, 192 Russell, B. 28 Sharp, P.M. 74
Rain, J.-C. 187 Rutishauser, R. 96 Shenkin, P.S. 181
random cluster model 176 Rzhetsky, A. 25, 196 Sicheritz-Ponten, T. 191
random walk 174–5, 177 Siddall, M.E. 20, 36, 155–6
Rannala, B. 149, 150 Saccharomyces cerevisiae (Sc) 191–3, criticism of likelihood 155
rate matrix, time reversible 114 195, 197, 199 signal-to-noise ratio 187
raw or model-adjusted Saitou, N. 74, 92 Sikes, D.S. 122
differences 3 Salisbury, B.A. 27 Simmons, M.P. 39, 81, 99, 102–4
real time 144–5 Salmon, W.C. 19 Simon, D. 148–9, 150
reciprocal illumination 23, 30, 39, sampling density 62 simplicity 1
62, 83 Sanderson, M.J. 63, 66, 155 of a phylogenetic explanation
recombination 67 Sankoff, D. 7, 76–8, 91, 97, 115
reconstructed 98–9, 100–1, 103, 105, syntactic vs ontological 16–17
nodes 188 108–9, 115, 157 Simpson, G.G. 21
scenario, principal features Sarkar, I.N. 10 simulation studies 8
of 195 Sattler, R. 96 signed circular permutation 10,
reduction theory 18 saturation 7 167
regions of trees 89–90 Schizosaccharomyces pombe 191–2, single-character
regular walk processes 176 195, 197, 198–9 data sets 87
regulatory elements 65 Schwikowski, B. 99, 109 phylogenetic inference 84–6
Reichenbach, H. 28 Scriven, M. 20 single-column
relativistic physics 18 search efficiencies 126, 136, 138–9, characters 81, 114–115
Remane, A. 83 140 beyond them 96–7
research cycles 39 search-based optimization 76, 77 positional 97
restriction sites 191 Second Law of Thermodynamics single-stage analyses,
retention index 7, 167 200 conventional 145
reticulation 62, 66–8 sectorial searches 119, 122, 145–6 single-stage searches 143
Rexová, K. 40 Seitz, V. 81 Sinsheimer, J. 149
Rice, K.A. 121–4, 127–9, 133, 144–5 selection 52 Slade, N.A. 191
Richardson, A.O. 196 of characters 70 Slatkin, M. 68
Rieppel, O. 21, 26, 31, 38, 83–4 in a finite population 50 Slowinski, J.B. 72
Rieseberg, L.H. 66 selfish-operon concept 200 smallest generating sets 90
Rogers, J. 151 Sellers, P.H. 101 Smets, E. 39, 81, 89, 91–3
Rogozin, I.B. 6, 9, 190, 196 semaphoronts 58–62 Smith, R.L. 40, 66, 105–6
Rokas, A. 163 global analysis of 62–4 Smith, T.F. 105
Romero, I. 7 Semple, C. 164, 168, 176 Smith, V.S. 28, 39–40
Ronquist, F. 150 sequence Smith/Quackdoodle theorem 28
rooted/roots 82, 84 characters (complex) 98, 105, Smouse, P.E. 149, 174
branching pattern 24 108, 110, 111, 112–13, 114–15 Snel, B. 191, 193–4
position of 106 analysis and optimization of Sober, E. 1–3, 15, 18–21, 26–8, 30–1,
tree 29, 37, 47, 51, 164, 169, 182, parsimony 57–116 35–7, 41, 43, 45, 47, 49–50, 53
187, 193–4 and branch support 114, 116 n.11
Rosenberg, C. 196 data 71, 81 Sokal, R.R. 20, 24, 33, 94
Rossnes, R. 187 parsimony analysis of 97–116 Soltis, D.E. 66, 126, 128, 133
Roth, J.R. 200 and the problem of Sonnhammer, E.L. 191–2
Roth, V. 60 inapplicables 81–116 speciation events 20, 181–2, 183
Rousseau, P. 76, 91, 98–9 dense sampling of 173 speciation nodes 183
228 INDEX
species probability 170 three-gene (567-terminal)

diversification 20 success rate 136, 137, 139, matrix 126, 128, 133,
‘good’ 67 140–1, 142 143, 144–5
level 67 successive weighting 27, 30 data 128
tree 181–2, 197 Sums-of-pairs (SP) preliminary analyses of
mRNA splice variants 188–9 alignments 101–2, 104 143–4
spurious homoplasy 64 supersites 9 Tierney, L. 149
stages of expression 26, 33, 35 supertree 63, 68 TNT 6, 10, 78, 122, 153,
star phylogeny 48 Suzuki, Y. 158 158–9
starting tree 120–1, 124–5, 129, 133 swapping 125 tokogeny/tokogenetic 20, 25, 42
state-change probabilities 5 to completion 121 topography 83
state space 163, 168, 170, 173, 176 initial period of 124 topology 3–5, 44, 47, 52, 63–7,
infinite and large 176 intensive 139 72, 79, 148–52, 177, 192,
statistical argument 87, 88 regime 129 194–5, 198–200
to defend parsimony 4 Swofford, D.L. 25, 28, 66, -independent 72
statistical consistency 5, 53, 151, 120–1, 155, 157, 193 traceback 73
155, 177–8 symmetric change assumption 4 tradeoff 125
Steel, M.A. 1, 3, 4–5, 8–10, 26–7, symmetry 100–1 transformation 71
43, 46–7, 88, 92, 114, symplesiomorphy 34, 52, 85–6 events 30
155, 163–4, 166–8, synapomorphy 22, 34, 60, series 18–20, 23, 26, 28,
172–4, 176–7 85–7, 119, 153 31–8, 40, 76, 79,
Steffansson, P. 182 synonymous nucleotide 85, 95–6
Steger, A. 165 substitution rate Ks 184 transformational homologs 61
Steiner points 174 syntactic simplicity 16–17 transition 36
stemmatology 98 synteny 10, 65 probabilities 51
step matrices 91, 99 systematic error 30 transition:transversion (ts:tv)
Stevens, P.F. 60–1 systems biology 187, 200 ratios 150
Stewart, C. 184–5, 185 Sytsma, K.J. 66 translocations 65, 115, 157
Stirling’s approximation 168 Székely, L.A. 177 transversions 36
stochastic model 149 tree
of character evolution 88, Tatusov, R.L. 192, 196 alignment 72, 98, 98–9, 101,
92, 170 taxa, large numbers of 158–9 103, 105–8, 110, 110,
stochastic properties of MP taxic homology 60 112, 113
parsimony 163 taxon 20, 34, 58, 68, 76, 123, generalized 99
Storm, C.E. 191–2 127, 134, 137, 151–4, 159 branch length, differential 6
strict consensus 159 -entry order 120, 143 building 69–70
string/strings 9, 10, 158 Taylor, B.A. 176 credible set of 159
Strong, E. 81 (TBR) tree-bisection distance 99
structural vs DNA sequence reconnection, see also branch drifting 119, 122–3, 145–6
characters 64–5 breaking (BB) 121–2, 126, fully collapsed 3
Stuart, J.M. 65 129, 133 fusion 119, 122–3, 145–7
Stuessy, T.F. 24 branch swapping 79, 153 islands 121–2
subcharacters 107, 109–13, Tehler, A. 122, 124, 146 /matrix interconvertability 7
112–13 Tellgren, Å. 185 non-binary species
subfunctionalization 184 terminal units (TUs) 57–8 mapping 182
subjective epistemology 17 characters and data base from orthologous gene
suboptimal SP alignments 102 design 68 presence vs absence 9
suboptimal trees 146–7 relationships 61–2 parent 124
subsequence/subsequences 97, representation 58–9 phylogenetic, large 7,
104, 107–8, 110, 111, 113 testability 17, 19, 28–9, 36–7, 41 168, 171
homology 105–12 Thanaraj, T.A. 188 reconstruction, ancestral
and compositional Theriot, E. 59, 66–7 states 168–70
homology 105–6 thermodynamic analogy 200 -search efficiency 130,
of a tree alignment, Thompson, J.D. 101–2, 104 140, 145
quantification 106–7 Thorne, J.L. 76, 114, 158 branch-swapping ratio 138
substitution/substitutions 38, 74 Thorne–Kishino–Felsenstein of species or genes 2–3
cost 104–5, 109 (TKF) model 76 topology 148, 194
INDEX 229
unrooted 193 unit character 9, 38 weighting algorithms 69

TREEALIGN 74 unit gap 98, 101, 105, 107 Wheeler, Q.D. 67
tree bisection and reconnection cost 38, 99, 109, 113 Wheeler, W.C. 6, 16, 27, 30–1,
(TBR) see also branch position 101 38, 42, 58, 67–8, 71, 74–6,
breaking (BB) 120, 159 77, 78–80, 99, 105,
branch swapping 153 Vander, S.G. 115 109, 157
Treezilla 127 verificationism 37 Wiley, E. 28, 59–60
triplet and quartet methods 91 vertebrates 157, 194 Wilkinson, M. 90
true tree 62, 66, 149 vertex-transitive graphs 175 Willi Hennig Society 9
Tuffley, C. 4, 23, 26–7, 43, vicariance biogeography WinClada 122, 127–8, 146
46–7, 53, 88, 114, 26, 41 Wolf, Y.I. 190, 194, 200
172–3 Vingron, M. 99, 109 Woodger 32
no common mechanism model viral-mediated lateral Wrinch, D. 16
(N) 53 n.11 transfer 67 Wunsch, C.D. 72–3, 76, 101, 108
tumor classification 10 visual pigment 185–6, 188
TUs 58–65, 67–8 voucher specimen 59
Vrana, P. 58, 68 x, y plane 2
close link between characters
X tree 164–6, 168–9, 171, 173
and 65
see also terminal units Wagner, Jr., W.H. 33,
two-item analysis 81 119–20 Yang, Z.H. 26, 69, 149–50, 181,
two-stage analyses 121, 133, algorithm 120 184–6
139 method 33–4 Yeates, D. 82, 182
two-stage search 125, 132, networks 25 Yoder, A. 127
138–43, 140–1, 142–6 parsimony 187
key advantage of 125 and Prim networks 25
tree 124, 128, 130–1, 153 zilla 119, 123, 126–8, 130,
overall efficiency of 142
analysis 120 132–3, 135, 137, 141,
search efficiencies for
suboptimal 120 143–5
calculations of 139
Walsh, D. 17–18 across software platforms
strategies 119
Wang, L.S. 10, 72, 76, 108–10, comparability of
167 127–8
Uddin, M. 10 analyses of 144
Watanabe, H. 199
uniformatarianism 27 Waterston, R.H. 191 one-, two-stage analyses of
uniformly covering 165 wavefront update 73 132

Albert - Parsimony Phylogeny and Genomics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Albert - Parsimony Phylogeny and Genomics

Uploaded by

Copyright:

Available Formats

Parsimony, Phylogeny, and Genomics

This page intentionally left blank

1 Parsimony and phylogenetics in the genomic age 1

I Philosophical aspects of parsimony analysis, including comparison with

2 What is the rationale for ‘Ockham’s razor’ (a.k.a. parsimony)

3 Parsimony and its presuppositions 43

II Parsimony, character analysis, and optimization of sequence characters

4 The logic of the data matrix in phylogenetic analysis 57

5 Alignment, dynamic homology, and optimization 71

6 Parsimony and the problem of inapplicables in sequence data 81

III Computational limits of parsimony analysis: from historical aspects to

7 The limits of conventional cladistic analysis 119

8 Parsimony and Bayesian phylogenetics 148

IV Mathematical attributes of parsimony

9 Maximum parsimony and the phylogenetic information in

V Parsimony and genomics

10 Using phylogeny to understand genomic evolution 181

11 Dollo parsimony and the reconstruction of genome evolution 190

Parsimony and phylogenetics in

1.2 Examples of modern

1.2.1 Curve ﬁtting Data points based on characters (e.g. nucleotides)

1 2 1.2.3 Phylogenetic models for which

What is the rationale for ‘Ockham’s

Anyone suggesting a justiﬁcation for a method of inference—be it parsimony

2.2.6 Testability explained by totaling the individual positive con-

Parsimony and its presuppositions

3.1 Introduction not about whether parsimony makes assumptions

Parsimony’s interpretation of the observations is dichotomous. In Example 3, parsimony assumes

Inference problem A model that parsimony assumes is false

The logic of the data matrix in

B Because of this important violation of a funda-

extremely fractal tree of life are likely to be one of 4.13 Acknowledgements

Alignment, dynamic homology,

5.1 Introduction entirely cladogram-dependent and the relative

AGT Tree search

AGT Compare Best

AT- Tree search

5.6 Optimization methods 5.6.2 Heuristic solutions

Seq i Seq j (a) AA AG

AAGG 3 indels + 1 substitution = 4

(a) ATTA AAA AAGG 4 indels + 1 substitution = 5

AAGG 4 indels + 0 substitution = 4

Method Options Execution time(s) Cost

CLUSTALW 1:1:1:1 688 11 999

Parsimony and the problem of

6.1 Introduction subject to two methodological constraints: the

(c) out1 out1 out1

as pioneered, in biology, by Needleman and (a) A ct (b) A ct (c) A ct

(a) out ttttttttttggggtttt tcca (b) tcca (c) tcca (d)

(a) A aaa (b) aaa---- A D aaaaaaa

(c) aaa---- A B aaa----

aaaaaaa aaaaaaa aaaaaaa

(a) (b) 1 1 (c) 1 1

would lead to trivial alignments such as in A likelihood conjecture

The limits of conventional

7.1 Introduction small percentage of the most optimal sets of trees

rearrangements per search

(a) 100 16 228

Success rate (per cent)

Number of Number of Success rate and efﬁciency

rearrangements per search

50/1 000 2/100

Parsimony and Bayesian