Professional Documents
Culture Documents
Evolutionary Genomics: Statistical and Computational Methods
Evolutionary Genomics: Statistical and Computational Methods
Evolutionary
Genomics
Statistical and Computational
Methods
Second Edition
METHODS IN MOLECULAR BIOLOGY
Series Editor
John M. Walker
School of Life and Medical Sciences
University of Hertfordshire
Hatfield, Hertfordshire, AL10 9AB, UK
Second Edition
Edited by
Maria Anisimova
Institute of Applied Simulations, School of Life Sciences and Facility Management, Zurich University
of Applied Sciences (ZHAW), W€adenswil, Switzerland
Swiss Institute of Bioinformatics, Lausanne, Switzerland
Editor
Maria Anisimova
Institute of Applied Simulations
School of Life Sciences and Facility Management
Zurich University of Applied Sciences (ZHAW)
W€adenswil, Switzerland
Swiss Institute of Bioinformatics
Lausanne, Switzerland
This Humana imprint is published by the registered company Springer Science+Business Media, LLC, part of Springer
Nature.
The registered company address is: 233 Spring Street, New York, NY 10013, U.S.A.
Preface
This volume is a thoroughly revised second edition of Evolutionary Genomics: Statistical and
Computational Methods published in 2012. Like the first edition, the new volume includes
comprehensive reviews of the most recent and fundamental developments in bioinformatics
methods for evolutionary genomics and related challenges associated with increasing data
size, heterogeneity, and its inherent complexity.
Throughout the volume, prominent authors address the challenge of analyzing and
understanding the dynamics of complex biological systems, and elaborate on some
promising strategies that would bring us closer to the ultimate “holy grail” of biology—
uncovering of the relationships between genotype and phenotype. Consequently, the pre-
sented collection of peer-reviewed articles also represents a synergy between theoretical and
experimental scientists from a range of disciplines, working together towards a common
goal. Once again, the revised volume reiterates the power of taking an evolutionary
approach to study molecular data.
This book is intended for scientists looking for a compact overview of the cutting-edge
statistical and computational methods in evolutionary genomics. The volume may serve as a
comprehensive guide for both graduate and advanced undergraduate students planning to
specialize in genomics and bioinformatics. Equally, the volume should be helpful for
experienced researchers entering genomics from more fundamental disciplines, such as
statistics, computer science, physics, and biology. In other words, the material presented
here should suit both a novice in biology with strong statistics and computational skills and a
molecular biologist with a good grasp of standard mathematical concepts. To cater to
differences in reader backgrounds, Part I is composed of educational primers to help with
fundamental concepts in genome biology (Chapter 1), probability and statistics (Chapter 2),
and molecular evolution (Chapter 3). As these concepts reappear repeatedly throughout the
book, the first three chapters will help the neophyte to stay “afloat”. The exercises and
questions offered at the end of each chapter serve to deepen the understanding of the
material.
Part II of this volume focuses on sequence homology and alignment—from aligning
whole genomes (Chapter 4) to disentangling orthologs, paralogs, and transposable ele-
ments (Chapters 5 and 6). Part III includes chapters on phylogenetic methods to study
genome evolution. Chapter 7 presents multispecies coalescent methods for reconciling
phylogenetic discord between gene and species trees. However, a mathematically convenient
“binary tree” model does not always live up to scrutiny as numerous evolutionary processes
act in reticulate (network-like) fashion, complicating the statistical description of evolution-
ary models and increasing computational complexity, often to prohibitive levels. One
simplification is to assume that some molecular sequence units (genes, gene segments)
still evolve in a treelike manner. If so, Chapter 8 describes one practical approach to
meaningfully summarize the binary tree distributions for a set of genomes as a “forest of
trees”. Alternatively network-like phylogenetic relationships can be represented by graphs
(Chapter 9). Dating methods for genome-scale data are discussed in Chapter 10, while
Chapter 11 provides more examples of non-treelike processes in a comparative review of
genome evolution in different breeding systems.
v
vi Preface
vii
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 767
Contributors
xiii
xiv Contributors
Abstract
Organisms display astonishing levels of cell and molecular diversity, including genome size, shape, and
architecture. In this chapter, we review how the genome can be viewed as both a structural and an
informational unit of biological diversity and explicitly define our intended meaning of genetic information.
A brief overview of the characteristic features of bacterial, archaeal, and eukaryotic cell types and viruses sets
the stage for a review of the differences in organization, size, and packaging strategies of their genomes. We
include a detailed review of genetic elements found outside the primary chromosomal structures, as these
provide insights into how genomes are sometimes viewed as incomplete informational entities. Lastly, we
reassess the definition of the genome in light of recent advancements in our understanding of the diversity
of genomic structures and the mechanisms by which genetic information is expressed within the cell.
Collectively, these topics comprise a good introduction to genome biology for the newcomer to the field
and provide a valuable reference for those developing new statistical or computation methods in genomics.
This review also prepares the reader for anticipated transformations in thinking as the field of genome
biology progresses.
Key words Organism diversity, Viruses, Prokaryotes, Eukaryotes, Organelles, DNA, RNA, Protein,
Regulatory DNA, Epigenetics, Plasmids, Transcription, Translation, DNA replication, Chromatin,
Gene structure
1 Introduction
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Methods in Molecular Biology, vol. 1910,
https://doi.org/10.1007/978-1-4939-9074-0_1, © The Author(s) 2019
3
4 Noor Youssef et al.
Cells are the smallest living unit of an organism. All cells have three
attributes in common: cell membrane, cytoplasm, and genome.
Structurally, cells can be divided into two basic types: prokaryotic
and eukaryotic cells. Eukaryotic cells tend to be more complex.
They possess a nucleus and other membrane-bound organelles,
which are specialized components in the cell that perform unique
functions (e.g., nucleus, mitochondria, plastids). Conversely, pro-
karyotic cells lack membrane-bound organelles. Although similar in
cell structure, prokaryotes include two fundamentally distinct
domains: the eubacteria (true bacteria, often referred to simply as
bacteria) and the archaea.
Introduction to Genome Biology and Diversity 5
2.1 Viruses Viruses are infectious agents of living cells that are unable to repro-
duce in the absence of a host. Viruses are not considered cellular
entities since they lack two of the essential attributes that define a
cell; they possess neither a cell membrane nor cytoplasm. The
discovery of virophages, viruses that parasitize other viruses, resur-
rected the debate on their classification as living organisms
[14]. Some consider viruses to be living entities since they can be
hosts to other viruses, with a virophage infection leading to the
eventual death of the host virus, implying an initial “living” state
[15]. The opposing view asserts that a virus’ inability to reproduce
outside of a cellular host makes them nonliving entities
[16, 17]. Irrespective of their delineation as living or nonliving,
viruses are relevant to this review as they possess genomes and are
the most abundant biological replicators in the biosphere [18].
Outside of their host, viruses exist as viral particles (virions)
consisting of a protein capsule that protects and encloses their
genome. Once a virion has entered a host cell, it “hijacks” the
host’s cellular structures and processes to carry out the metaboli-
cally active phase of the viral life cycle. At this stage, the virus
exhibits physiological properties reminiscent of living cells; they
metabolize, grow, and reproduce. There is a wide range of viral
lifestyles, with corresponding diversity in viral forms, sizes, hosts,
and genomes [16]. The largest known virus, the mimivirus, was
originally identified as an infectious agent of an amoeba [19] and
can itself become a host for virophages [14]. To put this in context,
the virion of a mimivirus can be larger than some prokaryotic cells
[16]. At the other end of the scale are viruses such as the circo-
viruses, some of which have small genomes made up of less than
2000 nucleotides [20]. A more detailed account of viral diversity
can be found at the ViralZone website [21].
2.2 Bacteria The bacterial cell is prokaryotic, and it is relatively simple as com-
pared to eukaryotic cells. It has no membrane-bound organelles,
and the chromosome (usually one) is not separated from the other
components of the cell. While predominantly unicellular, they often
live in biofilms, a community of cells bound together by a secreted
6 Noor Youssef et al.
2.3 Archaea Archaea are single-celled organisms that appear strikingly similar to
bacteria under light and electron microscopes. Like bacteria they
often have a single circular chromosome and lack a nucleus, and for
a long period of time the archaea were wrongly categorized as
bacteria. The first indication that the archaea might be a separate
domain of life was obtained from phylogenetic analyses of the 16S
rRNA gene [36]. Advancements in genome sequencing and analy-
sis yielded further evidence of the evolutionary distinction between
the bacterial and archaeal domains [37]. Despite their superficial
cellular similarity to bacteria, the archaea have many molecular-level
similarities to eukaryotes, leading researchers to hypothesize that
the ancestor of the eukaryotes arose within the archaea [38].
Previously, archaea were assumed to be a minor group of
organisms inhabiting extreme environments beyond the tolerance
of bacteria (salt brines, hydrothermal vents, acidic and anoxic
Introduction to Genome Biology and Diversity 7
3.1 Viral Genomes Viruses use any combination of either RNA or DNA, either single-
or double-stranded molecules, in either circular or linear forms, to
encode their genetic instructions [75, 76]. The viral genetic mate-
rial is typically referred to as segments rather than chromosomes.
Viral genomes composed of multiple segments are referred to as
segmented. When different strains of the same segmented viral
species infect a cell, genomes from the different strains can mix to
produce hybrids—a process known as reassortment. Hybrid flus
such as the H1N1 swine influenza A virus originated in this
way [77].
Viral strains package their genomes in various ways. Most DNA
and RNA viruses with small genomes (<20 kb) employ energy-
independent packaging systems where capsid assembly and genome
condensation are coupled. One example is the RNA genome of the
HIV retrovirus that, in the mature virion, forms a RNA-protein
complex with one of the cleavage products of the Gag polyprotein
[78]. Other viruses, such as the lambda bacteriophage, require ATP
to pump their genome directly into a preassembled capsid
[79]. The latter type of machinery is ubiquitous in bacterial viruses.
Alternatively, large viruses package their genome using histone-like
proteins that are critical for eukaryotic genome packaging [80]. For
a review on genome packaging in viruses, see ref. 81.
Introduction to Genome Biology and Diversity 11
3.3 Archaeal Archaeal genomes share features with both bacteria and eukaryotes.
Genomes Archaea typically possess circular chromosomes reminiscent of bac-
teria genomes; some have a single chromosome and a single origin
12 Noor Youssef et al.
3.5 Auxiliary DNA Both prokaryotes and eukaryotes have secondary chromosomal
Structures structures. For eukaryotes, this refers to any form of DNA found
outside of a nucleus—although the discovery of microDNA
extends this classification [109]. Eukaryotic auxiliary DNA often
contains essential genes that are necessary for normal cell produc-
tion. For example, the DNA chromosome located within the mito-
chondrial organelle encodes genes that are involved in oxidative
phosphorylation and the creation of different types of RNA
[110]. For prokaryotes, auxiliary DNA refers to any DNA that is
not associated with the primary chromosome, and unlike eukar-
yotes, the genes encoded in such DNA are often dispensable. For
example, small circular chromosomes, called plasmids, often
14 Noor Youssef et al.
3.5.2 Plastid DNA Plastids are similarly derived from an endosymbiosis with a bacte-
rium, with the organelle retaining remnants of that ancestral bacte-
rial genome. Like the mitochondrion, the plastid is a double
membrane-bound cytoplasmic organelle. Unlike the mitochon-
drion, plastids often contain pigment used in photosynthesis. Plas-
tids are found in the cytoplasm of protists and all higher plants.
Plastid DNA (ptDNA) is highly reduced relative to the genomes of
extant photosynthetic bacteria. In part, the reduction in genome
Introduction to Genome Biology and Diversity 15
size is due to gene loss with some regions excised and incorporated
into the host nuclear DNA [116]. The ptDNA encodes important
proteins that are essential for cell viability [117]. Almost all plastids
have circular DNA, with the alveolate Chromera velia being the
single known case of linear ptDNA. The linear extrachromosomal
ptDNA has a telomere arrangement resembling those of linear
mtDNA [117, 118].
Genes encoded in ptDNA are involved in the synthesis and
storage of various cellular components, including those necessary
for photosynthesis. Plastids have diverged to carry out different
functions with multiple types identified. For example, chloroplasts
are specialized for carrying out photosynthesis; chromoplasts con-
tain pigments that provide petal colors, whereas amyloplasts are
used for bulk storage of starch [117].
3.5.4 Plasmid DNA Plasmids are present in bacteria, archaea, and eukaryotes [120].
Most plasmids are circular, although linear plasmids have been
identified [121]. The genes carried on plasmids tend to be asso-
ciated with functions that enable or enhance survival and growth
under specific conditions. They can be horizontally transferred
between prokaryotic cells and represent an important vehicle for
sharing genetic information [122]. For example, a plasmid that has
evolved an antibiotic resistance gene(s) can be transferred to neigh-
boring bacteria promoting their rapid adaptation to various stresses
associated with an antibiotic environment.
The eubacteria E. coli is estimated to have more than 270 plas-
mids having different distributions among and within cells; some
promote mating, while others contain genes that kill other bacteria.
The number of plasmids known and sequenced is much higher in
bacteria as compared to archaea, with the lowest number having
16 Noor Youssef et al.
4.1 Gene Expression Mere knowledge of the DNA sequence of a genome is often insuf-
ficient to predict phenotype. The amount and timing of gene
expression play a key role. For example, human cells with a nucleus
have copies of almost identical DNA sequences. Yet cells perform
varying functions, and they organize to create the multiple organs
that constitute the human body. Cells achieve this primarily by
differentially regulating the rate of transcription and/or translation
of genes.
DNA transcription and protein translation comprise elemen-
tary levels of information transfer from genotype to phenotype.
Maintaining control of these processes is fundamental for all organ-
isms. Genetic elements involved in regulating gene expression are
referred to as regulatory elements. They often represent sequences
found on the DNA or RNA. In this way, regulatory information can
be encoded directly within the nucleic acid sequence. Direct struc-
tural proximity if often not necessary, as regulatory elements may be
found proximal or distal to the genes they affect. In humans,
approximately 8% of nuclear DNA is composed of elements
involved directly in regulation such as promoters, enhancers, silen-
cers, and insulators (defined in Subheading 4.1.1; [127, 128]).
If all genetic and regulatory information is encoded in the
DNA sequence, why can’t any cell with a complete genome be
used to produce a viable organism? The specificity of cells suggests
that additional regulatory markers also exist outside of the primary
DNA sequence. This type of regulation is epigenetic (above the
genes) and is essential for normal development. Epigenetic infor-
mation is derived from chemical modifications of the chromosome
(e.g., DNA methylation or histone modification) that do not
change the primary sequence of chromosomal DNA and can be
18 Noor Youssef et al.
4.1.1 Transcriptional DNA transcription is the chemical process through which informa-
Regulation tion is transferred from DNA to RNA. The transcribed RNA may
itself carry out some biological function or may be part of an
intermediate information-carrying class of RNA known as messen-
ger RNA (mRNA). mRNA along with other RNA molecules
(tRNA and rRNA) are part of the machinery used to synthesize
proteins. The flow of genetic information from DNA to RNA to
protein is present in all forms of life. However, it is important to
note that information transfer is not exclusively unidirectional. The
enzyme reverse transcriptase can transfer genetic information from
an RNA template into DNA.
The basic model of transcriptional regulation requires that
regulatory proteins called transcriptional factors (TFs) bind specific
DNA sequences in regulatory modules (RMs). TFs are protein
products that are themselves subjected to regulation of gene
expression. RMs are defined according to both the primary DNA
sequence to which TFs bind and their role in the process of reg-
ulating gene expression. One type of RMs are promoters. They are
specific motifs on DNA that are necessary regulatory elements for
RNA transcription in prokaryotes and eukaryotes. They bind the
basal transcriptional machinery, RNA polymerase and general TFs.
Enhancers are RMs that bind activator proteins and enhance the
affinity of RNA polymerase to the promoter region. They, there-
fore, result in an upregulation of transcription of a gene or set of
genes. Enhancers often act by stabilizing RNA polymerase binding
through structural histone modifications [131]. Silencers are regu-
latory elements that when bound to repressor proteins function to
prevent gene transcription. Silencers and enhancers are often
distance-independent, meaning that they can act on gene(s) that
are proximal or distal to their location [132]. Enhancers can be
thought of as on-switches for gene expression, whereas silencers are
the off-switches.
4.1.2 Translational The fate of all mRNAs, transcribed from protein-coding genes, is
Regulation not the same. The mRNA is often subjected to translational regu-
lation depending on cellular and environmental conditions. These
regulatory mechanisms affect the rate of protein synthesis. In pro-
karyotes and eukaryotes, most translational regulation involves
structural changes in the mRNA molecule that impact its accessi-
bility [133, 134]. The mRNAs can be sequestered in stress granules
or localized in specific regions of a cell’s cytoplasm
[135–137]. Another mechanism of translational regulation is
Introduction to Genome Biology and Diversity 19
4.2 Mobile Genetic Also known as transposons or jumping genes, mobile genetic ele-
Elements ments are sequences that can move around within a genome inde-
pendently of the complex networks which otherwise regulate gene
expression [146]. Through their movement, transposons often
cause mutations either by inserting into a gene and disturbing its
function or by promoting DNA rearrangement. If a transposon is
inserted within a protein-coding region, then it will undoubtedly
affect the expression of this gene by altering the final protein
product. Transposons may also be inserted into regulatory regions
resulting in over- or under-expression of certain gene(s). The capa-
bility of these DNA sequences to produce new copies of themselves
elsewhere in a genome is called transposition. The two types of
transposition are:
Copy-and-paste (replicative) transposition: a new copy of the trans-
posable element is inserted into a new site, while the old copy
remains integrated into the original site [147]. This type of
transposition requires transfer of information into an RNA
intermediate (retrotransposons) and subsequent retro-
transcription into DNA. This mechanism results in an increase
in the number transposon copies.
Cut-and-paste (non-replicative or conservative) transposition: the
transposable element is excised from the old site and is inserted
into a new site in the genome. The number of transposons is
not increased in this case [147].
Transposable elements are found in all cell types. The kinds of
transposable elements vary within and between prokaryotes and
eukaryotes. They are often viewed as genetic parasites since they
rely on a host cell for information processing systems (replication,
transcription, and/or translation). In humans, about 44% of the
genome is comprised of sequences that are related to transposable
Introduction to Genome Biology and Diversity 21
References
1. Lederberg J, McCray AT (2001) Ome 6. Mueller RL (2015) Genome biology and the
SweetOmics--a genealogical treasury of evolution of cell-size diversity. Cold Spring
words. Scientist 15(7):8 Harb Perspect Biol 7(11):a019125
2. Patra S, Andrew AA (2015) Human, social, 7. Kysela DT, Randich AM, Caccamo PD, Brun
and environmental impacts of human genetic YV (2016) Diversity takes shape: understand-
engineering. J Biomed Sci 4:2 ing the mechanistic and adaptive basis of bac-
3. Behjati S, Tarpey PS (2013) What is next terial morphology. PLoS Biol 14(10):
generation sequencing? Arch Dis Child Educ e1002565
Pract Ed 98(6):236–238 8. Minelli A, Fusco G (2010) Developmental
4. Goldman AD, Landweber LF (2016) What is plasticity and the evolution of animal complex
a genome? PLoS Genet 12(7):e1006181 life cycles. Philos Trans R Soc Lond B Biol Sci
5. Tyler-Smith C, Yang H, Landweber LF, 365(1540):631–640
Dunham I, Knoppers BM, Donnelly P et al 9. Forster SC (2017) Illuminating microbial
(2015) Where next for genetics and geno- diversity. Nat Rev Microbiol 15(10):578
mics? PLoS Biol 13(7):e1002216
Introduction to Genome Biology and Diversity 25
10. Carroll SB (2001) Chance and necessity: the 28. Jovel J, Patterson J, Wang W, Hotte N,
evolution of morphological complexity and O’Keefe S, Mitchel T et al (2016) Characteri-
diversity. Nature 409(6823):1102 zation of the gut microbiome using 16S or
11. History of life through time UCMP. www. shotgun metagenomics. Front Microbiol
ucmp.berkeley.edu/exhibits/historyoflife. 7:459
php 29. Peterson BF, Scharf ME (2016) Metatran-
12. The tree of life web project. tolweb.org scriptome analysis reveals bacterial symbiont
13. The encyclopedia of life. eol.org contributions to lower termite physiology and
potential immune functions. BMC Genomics
14. Claverie J, Abergel C (2009) Mimivirus and 17(1):772
its virophage. Annu Rev Genet 43:49–66
30. Young KD (2006) The selective value of bac-
15. Pearson H (2008) ‘Virophages’ suggests terial shape. Microbiol Mol Biol Rev 70
viruses are alive. Nature 454(7205):677 (3):660–703
16. Forterre P (2010) Defining life: the virus 31. Willis L, Huang KC (2017) Sizing up the
viewpoint. Orig Life Evol Biosph 40 bacterial cell cycle. Nat Rev Microbiol 15
(2):151–160 (10):606–620
17. Koonin EV, Starokadomskyy P (2016) Are 32. Haeusser DP, Levin PA (2008) The great
viruses alive? The replicator paradigm sheds divide: coordinating cell cycle events during
decisive light on an old but misguided ques- bacterial growth and division. Curr Opin
tion. Stud Hist Philos Biol Biomed Sci Microbiol 11(2):94–99
59:125–134
33. Thanbichler M (2010) Synchronization of
18. Koonin EV (2010) The wonder world of chromosome dynamics and cell division in
microbial viruses. Expert Rev Anti-Infect bacteria. Cold Spring Harb Perspect Biol 2
Ther 8(10):1097–1099 (1):a000331
19. Raoult D, Audic S, Robert C, Abergel C, 34. Brown PJ, Hardy GG, Trimble MJ, Brun YV
Renesto P, Ogata H et al (2004) The 1.2- (2008) Complex regulatory pathways coordi-
megabase genome sequence of Mimivirus. nate cell-cycle progression and development
Science 306(5700):1344–1350 in Caulobacter crescentus. Adv Microb
20. Finsterbusch T, Mankertz A (2009) Porcine Physiol 54:1–101
circoviruses—small but powerful. Virus Res 35. Frost LS, Leplae R, Summers AO, Toussaint
143(2):177–183 A (2005) Mobile genetic elements: the agents
21. Swiss Institute of Bioinformatics. ViralZone. of open source evolution. Nat Rev Microbiol
http://www.expasy.org 3(9):722
22. Nadell CD, Drescher K, Foster KR (2016) 36. Woese CR, Fox GE (1977) Phylogenetic
Spatial structure, cooperation and competi- structure of the prokaryotic domain: the pri-
tion in biofilms. Nat Rev Microbiol 14 mary kingdoms. Proc Natl Acad Sci U S A 74
(9):589–600 (11):5088–5090
23. Rosenberg SM (2009) Life, death, differenti- 37. Hug LA, Baker BJ, Anantharaman K, Brown
ation, and the multicellularity of bacteria. CT, Probst AJ, Castelle CJ et al (2016) A new
PLoS Genet 5(3):e1000418 view of the tree of life. Nat Microbiol 1:16048
24. Flores E, Herrero A (2010) Compartmenta- 38. Williams TA, Foster PG, Cox CJ, Embley TM
lized function through cell differentiation in (2013) An archaeal origin of eukaryotes sup-
filamentous cyanobacteria. Nat RevMicrobiol ports only two primary domains of life.
8(1):39 Nature 504:231–236
25. Lasken RS, McLean JS (2014) Recent 39. Eckburg PB, Lepp PW, Relman DA (2003)
advances in genomic DNA sequencing of Archaea and their potential role in human
microbial species from single cells. Nat Rev disease. Infect Immun 71(2):591–596
Genet 15(9):577–584 40. Nakamura N, Lin HC, McSweeney CS,
26. Stewart EJ (2012) Growing unculturable bac- Mackie RI, Gaskins HR (2010) Mechanisms
teria. J Bacteriol 194(16):4151–4160 of microbial hydrogen disposal in the human
27. Qin Y, Hou J, Deng M, Liu Q, Wu C, Ji Y, He colon and implications for health and disease.
X (2016) Bacterial abundance and diversity in Annu Rev Food Sci Technol 1:363–395
pond water supplied with different feeds. Sci 41. Saengkerdsub S, Ricke SC (2014) Ecology
Rep 6:35232 and characteristics of methanogenic archaea
26 Noor Youssef et al.
in animals and humans. Crit Rev Microbiol 40 metabolism in archaea. Curr Opin Microbiol
(2):97–116 8(6):695–705
42. Jarrell KF, Walters AD, Bochiwal C, Borgia 56. Coelho SM, Peters AF, Charrier B, Roze D,
JM, Dickinson T, Chong JP (2011) Major Destombe C, Valero M, Cock JM (2007)
players on the microbial stage: why archaea Complex life cycles of multicellular eukar-
are important. Microbiology 157 yotes: new approaches based on the use of
(4):919–936 model organisms. Gene 406(1):152–170
43. Prosser JI, Nicol GW (2008) Relative contri- 57. Adl SM, Simpson AG, Lane CE, Lukeš J,
butions of archaea and bacteria to aerobic Bass D, Bowser SS et al (2012) The revised
ammonia oxidation in the environment. Envi- classification of eukaryotes. J Eukaryot Micro-
ron Microbiol 10(11):2931–2941 biol 59(5):429–514
44. Leigh JA (2000) Nitrogen fixation in metha- 58. Mathur J (2004) Cell shape development in
nogens: the archaeal perspective. Curr Issues plants. Trends Plant Sci 9(12):583–590
Mol Biol 2:125–131 59. Mogilner A, Keren K (2009) The shape of
45. DeLong EF (1998) Everything in modera- motile cells. Curr Biol 19(17):R771
tion: archaea as ‘non-extremophiles’. Curr 60. Fagarasanu A, Rachubinski RA (2007)
Opin Genet Dev 8(6):649–654 Orchestrating organelle inheritance in Sac-
46. Kandler O, König H (1998) Cell wall poly- charomyces cerevisiae. Curr Opin Microbiol
mers in archaea (archaebacteria). Cell Mol 10(6):528–538
Life Sci 54(4):305–308 61. Bornens M (2008) Organelle positioning and
47. Vollmer W, Bertsche U (2008) Murein (pep- cell polarity. Nat RevMol Cell Biol 9(11):874
tidoglycan) structure, architecture and bio- 62. Dyall SD, Brown MT, Johnson PJ (2004)
synthesis in Escherichia coli. Biochim Ancient invasions: from endosymbionts to
Biophys Acta 1778(9):1714–1734 organelles. Science (New York, NY) 304
48. Kates M (1992) Archaebacterial lipids: struc- (5668):253–257
ture, biosynthesis and function. Biochem Soc 63. Corliss JO (2002) Biodiversity and biocom-
Symp 58:51–72 plexity of the protists and an overview of their
49. Sato T, Atomi H (2011) Novel metabolic significant roles in maintenance of our bio-
pathways in archaea. Curr Opin Microbiol sphere. Acta Protozool 41(3):199–220
14(3):307–314 64. Schlegel M, Hülsmann N (2007) Protists–a
50. Lombard J, López-Garcı́a P, Moreira D textbook example for a paraphyletic taxon.
(2012) Phylogenomic investigation of phos- Organisms Divers Evol 7(2):166–172
pholipid synthesis in archaea. Archaea 65. Parfrey LW, Walters WA, Knight R (2011)
2012:630910 Microbial eukaryotes in the human micro-
51. Forterre P (2013) The common ancestor of biome: ecology, evolution, and future direc-
archaea and eukarya was not an archaeon. tions. Front Microbiol 2:153
Archaea 2013:372396 66. Caron DA, Alexander H, Allen AE, Archibald
52. Zillig W (1991) Comparative biochemistry of JM, Armbrust EV, Bachy C et al (2017) Prob-
archaea and bacteria. Curr Opin Genet Dev 1 ing the evolution, ecology and physiology of
(4):544–551 marine protists using transcriptomics. Nat
53. Moissl C, Rachel R, Briegel A, Engelhardt H, Rev Microbiol 15(1):6–20
Huber R (2005) The unique structure of 67. Sutton WS (1902) On the morphology of the
archaeal ‘hami’, highly complex cell appen- chromoso group in Brachystola magna. Biol
dages with nano-grappling hooks. Mol Bull 4(1):24–39
Microbiol 56(2):361–370 68. O’Donnell M, Langston L, Stillman B (2013)
54. Szabo Z, Stahl AO, Albers SV, Kissinger JC, Principles and concepts of DNA replication in
Driessen AJ, Pohlschroder M (2007) Identifi- bacteria, archaea, and eukarya. Cold Spring
cation of diverse archaeal proteins with class Harb Perspect Biol 5(7):10
III signal peptides cleaved by distinct archaeal 69. Dolgin E (2009) Human mutation rate
prepilin peptidases. J Bacteriol 189 revealed. Nat News. https://doi.org/10.
(3):772–778 1038/news.2009.864
55. Siebers B, Schönheit P (2005) Unusual path-
ways and enzymes of central carbohydrate
Introduction to Genome Biology and Diversity 27
99. Kelman LM, Kelman Z (2003) Archaea: an 113. Nosek J, Tomáška Ľ (2003) Mitochondrial
archetype for replication initiation studies? genome diversity: evolution of the molecular
Mol Microbiol 48(3):605–615 architecture and replication strategy. Curr
100. Bell SD, White MF (2010) Archaeal chroma- Genet 44(2):73–84
tin organization. Bacterial chromatin. 114. Smith DR, Keeling PJ (2013) Gene conver-
Springer, New York, pp 205–217 sion shapes linear mitochondrial genome
101. White MF, Bell SD (2002) Holding it architecture. Genome Biol Evol 5
together: chromatin in the archaea. Trends (5):905–912
Genet 18(12):621–626 115. Kolesnikov AA, Gerasimov ES (2012) Diver-
102. Mattiroli F, Gu Y, Yadav T, Balsbaugh JL, sity of mitochondrial genome organization.
Harris MR, Findlay ES et al (2017) Biochemistry 77(13):1424
DNA-mediated association of two histone- 116. Green BR (2011) Chloroplast genomes of
bound complexes of yeast Chromatin Assem- photosynthetic eukaryotes. Plant J 66
bly Factor-1 (CAF-1) drives tetrasome assem- (1):34–44
bly in the wake of DNA replication. elife 6: 117. Rogalski M, do Nascimento Vieira L, Fraga
e22799 HP, Guerra MP (2015) Plastid genomics in
103. Forterre P, Confalonieri F, Knapp S (1999) horticultural species: importance and applica-
Identification of the gene encoding archeal- tions for plant population genetics, evolution,
specific DNA-binding proteins of the Sac10b and biotechnology. Front Plant Sci 6:586
family. Mol Microbiol 32(3):669–670 118. Janouškovec J, Sobotka R, Lai D,
104. Zlatanova J, Caiafa P (2009) CCCTC- Flegontov P, Konı́k P, Komenda J et al
binding factor: to loop or to bridge. Cell (2013) Split photosystem protein, linear-
Mol Life Sci 66(10):1647–1660 mapping topology, and growth of structural
105. Tark-Dame M, van Driel R, Heermann DW complexity in the plastid genome of Chro-
(2011) Chromatin folding--from biology to mera velia. Mol Biol Evol 30(11):2447–2462
polymer models and back. J Cell Sci 124 119. Archibald JM (2007) Nucleomorph gen-
(Pt 6):839–845 omes: structure, function, origin and evolu-
106. Biscotti MA, Olmo E, Heslop-Harrison JP tion. BioEssays 29(4):392–402
(2015) Repetitive DNA in eukaryotic gen- 120. Funnell BE, Phillips GJ (2004) Plasmid biol-
omes. Chromosom Res 23(3):415–420 ogy. ASM Press, Washington, DC
107. Rubin GM, Spradling AC (1982) Genetic 121. Ravin NV (2011) N15: the linear phage–plas-
transformation of drosophila with transposa- mid. Plasmid 65(2):102–109
ble element vectors. Science (New York, NY) 122. Shintani M, Sanchez ZK, Kimbara K (2015)
218(4570):348–353 Genomics of microbial plasmids: classification
108. Treangen TJ, Salzberg SL (2011) Repetitive and identification based on replication and
DNA and next-generation sequencing: transfer systems and host taxonomy. Front
computational challenges and solutions. Nat Microbiol 6:242
Rev Genet 13(1):36–46 123. Burgess DJ (2017) Genetic engineering:
109. Shibata Y, Kumar P, Layer R, Willcox S, CREATE-ing genome-wide designed muta-
Gagan JR, Griffith JD, Dutta A (2012) Extra- tions. Nat Rev Genet 18(2):69
chromosomal microDNAs and chromosomal 124. Kumar P, Dillon LW, Shibata Y, Jazaeri AA,
microdeletions in normal tissues. Science Jones DR, Dutta A (2017) Normal and
(New York, NY) 336(6077):82–86 cancerous tissues release extrachromosomal
110. Chen XJ, Butow RA (2005) The organization circular DNA (eccDNA) into the circulation.
and inheritance of the mitochondrial genome. Mol Cancer Res 15(9):1197–1205
Nat RevGenet 6(11):815 125. Barbieri M (2016) What is information?
111. Karnkowska A, Vacek V, Zubáčová Z, Treitli Philos Trans R Soc A 374(2063):20150060
SC, Petrželková R, Eme L et al (2016) A 126. Gibson DG, Glass JI, Lartigue C, Noskov
eukaryote without a mitochondrial organelle. VN, Chuang RY, Algire MA et al (2010)
Curr Biol 26(10):1274–1284 Creation of a bacterial cell controlled by a
112. Stehling O, Lill R (2013) The role of mito- chemically synthesized genome. Science 329
chondria in cellular Iron-sulfur protein bio- (5987):52–56
genesis: mechanisms, connected processes, 127. ENCODE Project Consortium (2012) An
and diseases. Cold Spring Harb Perspect integrated encyclopedia of DNA elements in
Biol 5(8):a011312
Introduction to Genome Biology and Diversity 29
154. Howard-Varona C, Hargreaves KR, Abedon Proc Natl Acad Sci U SA 104
ST, Sullivan MB (2017) Lysogeny in nature: (35):14092–14097
mechanisms, impact and ecology of temperate 166. Wang Y, Wang Y, Sheng Y, Huang J, Chen X,
phages. ISME J 11:1511–1520 AL-Rasheid KA, Gao S (2017) A comparative
155. Lindell D, Sullivan MB, Johnson ZI, Tolonen study of genome organization and epigenetic
AC, Rohwer F et al (2004) Transfer of photo- mechanisms in model ciliates, with an empha-
synthesis genes to and from Prochlorococcus sis on Tetrahymena, Paramecium and Oxytri-
viruses. Proc Natl Acad Sci U S A cha. Eur J Protistol 61(Pt B):376–387
101:11013–11018 167. Yerlici VT, Landweber LF (2014) Pro-
156. Zeidner G, Bielawski JP, Shmoish M, Scanlan grammed genome rearrangements in the Cil-
DJ, Sabehi G, Béjà O (2005) Potential pho- iate Oxytricha. Microbiol Spectr 2:6. https://
tosynthesis gene recombination between Pro- doi.org/10.1128/microbiolspec.MDNA3-
chlorococcus and Synechococcus via viral 0025-2014
intermediates. Environ Microbiol 7 168. Pilling OA, Rogers AJ, Gulla-Devaney B,
(10):1505–1513 Katz LA (2017) Insights into transgenera-
157. Sullivan MB, Lindell D, Lee JA, Thompson tional epigenetics from studies of ciliates.
LR, Bielawski JP, Chisholm SW (2006) Prev- Eur J Protistol 61(Pt B):366–375
alence and evolution of core photosystem II 169. Bulic A, Postberg J, Fischer A, Jönsson F,
genes in marine cyanobacterial viruses and Reuter G, Lipps HJ (2013) A permissive
their hosts. PLoS Biol 4(8):e234 chromatin structure is adopted prior to site-
158. Orsi RH, Sun Q, Wiedmann M (2008) specific DNA demethylation of developmen-
Genome-wide analyses reveal lineage specific tally expressed genes involved in macronu-
contributions of positive selection and recom- clear differentiation. Epigenetics Chromatin
bination to the evolution of Listeria monocy- 6(1):5
togenes. BMC Evol Biol 8(1):233 170. Bracht JR, Fang W, Goldman AD,
159. Dunn KA, Bielawski JP, Ward TJ, Dolzhenko E, Stein EM, Landweber LF
Urquhart C, Gu H (2009) Reconciling eco- (2013) Genomes on the edge: programmed
logical and genomic divergence among genome instability in ciliates. Cell 152
lineages of Listeria under an “extended (3):406–416
mosaic genome concept”. Mol Biol Evol 26 171. Forcob S, Bulic A, Jönsson F, Lipps HJ, Post-
(11):2605–2615 berg J (2014) Differential expression of his-
160. Buchrieser C, Rusniok C, Kunst F, Cossart P, tone H3 genes and selective association of the
Glaser P (2003) Comparison of the genome variant H3. 7 with a specific sequence class in
sequences of Listeria monocytogenes and Lis- Stylonychia macronuclear development. Epi-
teria innocua: clues for evolution and patho- genetics Chromatin 7(1):4
genicity. Pathog Dis 35(3):207–213 172. Blekhman R, Goodrich JK, Huang K, Sun Q,
161. Nightingale KK, Windham K, Wiedmann M Bukowski R, Bell JT et al (2015) Host genetic
(2005) Evolution and molecular phylogeny of variation impacts microbiome composition
Listeria monocytogenes isolated from human across human body sites. Genome Biol 16
and animal listeriosis cases and foods. J Bac- (1):191
teriol 187(16):5537–5551 173. Clemente JC, Ursell LK, Parfrey LW, Knight
162. Lawrence JG (2002) Gene transfer in bacte- R (2012) The impact of the gut microbiota
ria: speciation without species? Theor Popul on human health: an integrative view. Cell
Biol 61(4):449–460 148(6):1258–1270
163. Nesbø CL, Dlutek M, Doolittle WF (2006) 174. Baquero F, Nombela C (2012) The micro-
Recombination in Thermotoga: implications biome as a human organ. Clin Microbiol
for species concepts and biogeography. Infect 18(s4):2–4
Genetics 172(2):759–769 175. Sandoval-Motta S, Aldana M, Martı́nez-
164. Retchless AC, Lawrence JG (2007) Temporal Romero E, Frank A (2017) The human
fragmentation of speciation in bacteria. Sci- microbiome and the missing heritability prob-
ence 317(5841):1093–1096 lem. Front Genet 8:80
165. Papke RT, Zhaxybayeva O, Feil EJ, 176. Gordon H, Moller FT, Andersen V, Harbord
Sommerfeld K, Muise D, Doolittle WF M (2015) Heritability in inflammatory bowel
(2007) Searching for species in haloarchaea. disease: from the first twin study to genome-
Introduction to Genome Biology and Diversity 31
wide association studies. Inflamm Bowel Dis structure are associated with sustained remis-
21(6):1428 sion after nutritional treatment of pediatric
177. Dunn KA, Moore-Connors J, MacIntyre B, Crohn’s disease. Inflamm Bowel Dis 22
Stadnyk A, Thomas NA, Noble A et al (2016) (12):2853–2862
The gut microbiome of pediatric Crohn’s dis- 179. Bordenstein SR, Theis KR (2015) Host biol-
ease patients differs from healthy controls in ogy in light of the microbiome: ten principles
genes that can influence the balance between of holobionts and hologenomes. PLoS Biol
a healthy and dysregulated immune response. 13(8):e1002226
Inflamm Bowel Dis 22(11):2607–2618 180. Gavrilets S (2004) Fitness landscapes and the
178. Dunn KA, Moore-Connors J, MacIntyre B, origin of species (MPB-41), vol 41. Princeton
Stadnyk AW, Thomas NA, Noble A et al University Press, Princeton, NJ
(2016) Early changes in microbial community
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons
licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s
Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder.
Chapter 2
Abstract
In this chapter, we review basic concepts from probability theory and computational statistics that are
fundamental to evolutionary genomics. We provide a very basic introduction to statistical modeling and
discuss general principles, including maximum likelihood and Bayesian inference. Markov chains, hidden
Markov models, and Bayesian network models are introduced in more detail as they occur frequently and in
many variations in genomics applications. In particular, we discuss efficient inference algorithms and
methods for learning these models from partially observed data. Several simple examples are given
throughout the text, some of which provide the basis for models that are discussed in more detail in
subsequent chapters.
Key words Bayesian inference, Bayesian networks, Dynamic programming, EM algorithm, Hidden
Markov models, Markov chains, Maximum likelihood, Statistical models
1 Statistical Models
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Methods in Molecular Biology, vol. 1910,
https://doi.org/10.1007/978-1-4939-9074-0_2, © The Author(s) 2019
33
34 Niko Beerenwinkel and Juliane Siebourg
Aa
AA aa
0.20
l
l
0.15
Probability
0.10
l l
l
0.05
l l
l
0.00
l l l l l l l
0 5 10 15
Reads per base
Fig. 2 Coverage distribution of a shotgun sequencing experiment with n ¼ 108 reads of length L ¼ 100 of the
human genome of length G ¼ 3 109. The average coverage is c ¼ np ¼ 3.4, where p ¼ L/G. Dots show
the binomial coverage distribution Binomðn, pÞ and the solid line its approximation by the Poisson distribution
PoisðnpÞ. Note that the Poisson distribution is also discrete and just shown as a line to distinguish it from the
binomial distribution
n c k e c
PðX ¼ kÞ ¼ pk ð1 pÞnk : ð12Þ
k k!
For example, using next-generation sequencing technology, one
might obtain n ¼ 108 reads of length L ¼ 100 bases in a single
run. For the human genome of length G ¼ 3 109, we obtain a
coverage of c ¼ 3.4. The distribution of the number of reads per
base pair is shown in Fig. 2. In particular, the fraction of unse-
quenced positions is P(X ¼ 0) ¼ ec ¼ 3.57%. □
A continuous random variable X takes values in X ¼ and is
defined by a nonnegative function f(x) such that:
Z
PðX ∈BÞ ¼ f ðxÞdx, for all subsets B : ð13Þ
B
Z 1
EðX Þ ¼ x f ðxÞ dx, ð16Þ
1
Z 1
VarðX Þ ¼ ½x EðX Þ2 f ðxÞ dx: ð17Þ
1
1 X ðiÞ
N
x ¼ x , ð19Þ
N
i¼1
1 X ðiÞ
N
s2 ¼ ðx x Þ2 : ð20Þ
N 1
i¼1
0.04
0.08
2
2
0.12
0.16
0.2
0.24
1
1
0.22
0.18
0.14
0.1
0
0
0.06
0.02
−1
−1
−2
−2
−2 −1 0 1 2 −2 −1 0 1 2
Fig. 3 Marginalization. Left: two-dimensional histogram of a discrete bivariate distribution with the two
marginal histograms. Right: contour plot of a two-dimensional Gaussian density with the marginal distribu-
tions of each component
2 Statistical Inference
0.25
0.20
0.15
Likelihood
0.10
0.05
0.00
Fig. 4 Likelihood function of the binomial model. The underlying data set
consists of k ¼ 7 successes out of N ¼ 10 Bernoulli trials. The likelihood L
( p) ¼ pk (1 p)Nk is plotted as a function of the model parameter p, the
probability of success (solid line). The MLE is the maximum of this function,
p^ ML ¼ k =N ¼ 7=10 (dashed line)
800
150
600
10
100
400
5
50
200
0
0
0.80 0.85 0.90 0.95 1.00 0.80 0.85 0.90 0.95 1.00 0.80 0.85 0.90 0.95 1.00
p p p
Fig. 5 Bootstrap analysis of the ML allele frequency. The bootstrap distribution of the maximum likelihood
estimator p^ ML ¼ ð2n AA þ n Aa Þ=ð2N Þ of the major allele frequency in the Hardy–Weinberg model is plotted for
B ¼ 100 (left), B ¼ 1000 (center), and B ¼ 10, 000 (right) bootstrap samples, for the data set (nAA, nAa,
naa) ¼ (81, 18, 1)
Aa Aa Aa
× =
l l
MLE MLE
AA aa AA aa AA aa
Fig. 6 Dirichlet prior for multinomial likelihood. The Dirichlet prior is conjugate to the multinomial likelihood.
Shown are contour lines of the prior Dirð10, 10, 10Þ on the left, the multinomial likelihood P((nAA, nAa,
naa) ¼ (81, 18, 1)jθAA, θAa, θaa) in the center, and the resulting posterior Dirð91, 28, 11Þ on the right. The
posterior is the product of prior and likelihood
We often cannot observe all relevant random variables due to, for
example, experimental limitations or study designs. In this case, a
statistical model P(X, Z j θ ∈ Θ) consists of the observed random
variable X and the hidden (or latent) random variable Z, both of
which can be multivariate. In this section, we write X ¼ (X(1), . . .,
X(N )) for the random variables describing the N observations and
refer to X also as the observed data. The hidden data for this model
is Z ¼ (Z(1), . . ., Z(N )) and the complete data is (X, Z). For
Probability, Statistics, and Computational Science 49
X1 X2 ... XL
N L N L
ðZ ðiÞ Þ
∏ ∏ θn, Z ðiÞ X ðiÞn ¼ ∏ ∏ ∏ ∏ θnn, ,kxkx
I
/ , ð55Þ
i¼1n¼1 i¼1n¼1 k∈½K x∈X
ð56Þ
In the E step, we compute the expected values of Z(i):
X
L XX
EZ j X , θ0 ½ℓhid ðθÞ ¼ N n, kx logθn, kx : ð58Þ
n¼1 k∈½K x∈X
4 Markov Chains
1−α H D 1−β
β
The probability of a healthy individual becoming sick due to
infection is α ¼ 0.6, and the probability of a diseased individual to
clear the infection and recover is β ¼ 0.9. The initial probabilities
for health and disease are P(H) ¼ 0.7 and P(D) ¼ 0.3. Write down
the transition matrix T of this Markov chain. What is the probability
of observing the disease trajectories DDHHD and HDHDH? Cal-
culate the stationary distribution of the Markov chain.
Probability, Statistics, and Computational Science 55
A T
C G
ð74Þ
and the stationary distribution as t !1 is uniform, π ¼ (1/4, 1/4,
1/4, 1/4)t. □
Example 21 (The Poisson Process): A continuous-time Markov chain
X(t) is a counting process, if X(t) represents the total number of
events that occur by time t. It is a Poisson process, if in addition
X(0) ¼ 0, the increments are independent, and in any interval of
length t the number of events is Poisson distributed with rate λt:
ðλtÞk ð75Þ
P½X ðt þ sÞ X ðsÞ ¼ k ¼ P½X ðtÞ ¼ k ¼ e λt :
k!
The Poisson process is used, for example, to count mutations in a
gene. □
Example 22 (Exponential Distribution): The exponential distribu-
tion ExpðλÞ with parameter λ > 0 is a common distribution for
waiting times. It is defined by the density function:
f ðxÞ ¼ λe λx , for x 0: ð76Þ
If X ExpðλÞ, then X has expectation E(X) ¼ λ1 and variance
VarðX Þ ¼ λ2 . The exponential distribution is memoryless, which
means that P(X > s + t j X > t) ¼ P(X > s), for all s, t > 0. An
important consequence of the memoryless property is that the
waiting times between successive events are i.i.d. For example, the
waiting times τn (n 1) between the events of a Poisson process,
the sequence of interarrival times, are exponentially distributed,
τn ExpðλÞ, for all n 1. □
Exercise 23 (Kimura Model): The Kimura two-parameter model is a
DNA substitution model that distinguishes between transitions,
i.e., purine-to-purine and pyrimidine-to-pyrimidine substitutions,
from transversions, i.e., purine-to-pyrimidine and pyrimidine-to-
purine substitutions [18]. It is defined by the rate matrix:
0 1
2β α β α β
Bβ 2β α β α C
R¼B @α
C,
A
β 2β α β
β α β 2β α
where α, β∈þ are the two substitution rates. Assuming that the
Markov chain is ergodic, derive its stationary distribution.
Probability, Statistics, and Computational Science 57
Fig. 9 Hidden Markov model. Shaded nodes represent observed random variables (or symbols) Xn, and clear
nodes represent hidden states (or the annotation). Directed edges indicate statistical dependencies which are
given, respectively, by transition and emission probabilities among hidden states and between hidden states
and observed symbols
58 Niko Beerenwinkel and Juliane Siebourg
There are KL possible state paths such that already for sequences of
moderate length, the optimization problem (Eq. 80) cannot be
solved by enumerating all paths.
However, there is an efficient algorithm solving (Eq. 80) based
on the following factorization along the Markov chain:
L1
max PðX , Z Þ ¼ max PðZ 1 Þ ∏ PðX n j Z n ÞPðZ nþ1 j Z n Þ
Z Z 1 , ..., Z L n¼1
Dn
In
B Mn E
Fig. 10 Profile hidden Markov model. The hidden state space and its transitions
are shown for the profile HMM of length L ¼ 3. Match states are denoted Mn,
insert states In, and delete states Dn. B and E denote silent begin and end states,
respectively. With match and insert states, probability tables for the emissions of
symbols (amino acids or nucleotides, and gaps) are associated
where X(1), . . ., X(N ) are the i.i.d. observations. For each observa-
tion, we can rewrite the joint probability as:
N ðZ ðiÞ Þ N ðZ ðiÞ Þ
PðX ðiÞ , Z ðiÞ j θÞ ¼ ∏ ∏ E kxkx ∏ ∏ T kl kl ,
k∈½K x∈X k∈½K l∈½K
ð86Þ
(i)
where Nkx(Z ) is the number of x emissions when in state k and
Nkl(Z(i)) the number of k-to-l transitions in state path Z(i)
(cf. Eq. 68).
62 Niko Beerenwinkel and Juliane Siebourg
ð87Þ
The expected counts Nkx and Nkl are the sufficient statistics [11] of
the HMM, i.e., with respect to the model, they contain all infor-
mation about the parameters available from the data. The expected
counts can be computed using the forward and backward algo-
rithms. In the M step, this expression is maximized P
with respect to
θ ¼ (T, E). We find the MLEs T^ kl ¼ N kl = m N km and
P
^ kx ¼ N kx = N ky .
E y
7 Bayesian Networks
E A
B D
for the given optimal structure G∗. The first step is a model selec-
tion problem as introduced in Subheading 2.
Model selection for Bayesian networks is a particularly hard
problem because the number of DAGs increases super-
exponentially with the number of vertices rendering exhaustive
searches impractical, and the objective function in Eq. 92 is difficult
to compute. Recall that the posterior PðG j DÞ is proportional to
the product PðD j GÞPðGÞ of marginal likelihood and network
prior, and the marginal likelihood:
R
PðD j GÞ ¼ PðD j θ, GÞPðθ j GÞ dθ ð94Þ
is usually analytically intractable. Here, P(θ j G) is the prior distri-
bution of parameters given the network topology.
To address this limitation, the marginal likelihood (Eq. 94) can
be approximated by a function that is easier to evaluate. A popular
choice is the Bayesian information criterion (BIC) [26]:
1
logPðD j GÞ logPðD j θ^ML , GÞ νlogN , ð95Þ
2
where ν is the number of free parameters of the model and N the
size of the data. The BIC approximation can be derived under
certain assumptions, including a unimodal likelihood. It replaces
computation of the integral (Eq. 94) by evaluating the integrand at
the MLE and adding the correction term ðνlogN Þ=2, which
penalizes models of high complexity.
The model selection problem remains hard even with a tracta-
ble scoring function, such as BIC, because of the enormous search
Probability, Statistics, and Computational Science 65
Z4
Z3
Z1 Z2
X1 X2 X3 X4 X5
Z1 Z2 Z3 Z4 ...
X1,1 X1,2 X1,3 X2,1 X2,2 X2,3 X3,1 X3,2 X3,3 X4,1 X4,2 X4,3
Fig. 13 Phylo-HMM. Shown are the first four positions of a Phylo-HMM. The hidden Markov chain has random
variables Z. In the trees, Y denote the hidden common ancestors and X the observed species. Note that the
tree topology changes between position 2 and 3
Probability, Statistics, and Computational Science 69
A B C D E
C1 0 0 0 0 0
C2 0 0 0 0 1
E A
C3 0 0 0 0 1
C4 1 1 1 1 0
B D
C5 1 0 1 1 0
C6 0 0 0 1 1
C7 1 1 1 1 0 C
C8 1 0 0 0 1
C9 1 0 0 1 1
(a) Specify the adjacency matrix of the directed graph.
(b) Determine the local probability distributions for each vertex of
the graph. Use conditional counting to determine the condi-
tional probabilities as:
N ðX i , X paðiÞ Þ
PðX i j X paðiÞ Þ P ,
k N ðX i ¼ k, X paði Þ
References
1. Ewens WJ, Grant GR (2005) Statistical meth- 3. Davison AC (2009) Statistical models. Cam-
ods in bioinformatics: an introduction, 2nd bridge series in statistical and probabilistic
edn. Springer, Berlin mathematics. Cambridge University Press,
2. Deonier RC, Tavaré S, Waterman MS (2005) Cambridge
Computational genome analysis: an introduc- 4. Ross SM (2007) Introduction to probability
tion. Springer, Berlin models. Academic, London
70 Niko Beerenwinkel and Juliane Siebourg
5. Hardy GH (1908) Mendelian proportions in a 20. Durbin R (1998) Biological sequence analysis.
mixed population. Science 28:49–50 Probabilistic models of proteins and nucleic
6. Weinberg W (1908) Uber € den Nachweis der acids. Cambridge University Press, Cambridge
Vererbung beim Menschen. Jahres Wiertt Ver 21. Viterbi A (1967) Error bounds for convolu-
Vaterl Natkd 64:369–382 tional codes and an asymptotically optimum
7. Pachter L, Sturmfels B (2005) Algebraic statis- decoding algorithm. IEEE Trans Inf Theory
tics for computational biology. Cambridge 13:260–269
University Press, Cambridge 22. Baum LE (1972) An inequality and associated
8. Casella G, Berger RL (2002) Statistical Infer- maximization technique in statistical estima-
ence. Thomson Learning, Pacific Grove tion for probabilistic functions of Markov pro-
9. Efron B, Tibshirani R (1993) An introduction cesses. Inequalities 3:1–8
to the bootstrap. Chapman and Hall/CRC, 23. Husmeier D, Dybowski R, Roberts S (2005)
Boca Raton Probabilistic modeling in bioinformatics and
10. Gelman A, Carlin JB, Stern HS, Rubin DB medical informatics. Springer, New York
(2004) Bayesian data analysis, vol 46, 2 edn. 24. Koller D, Friedman N (2009) Probabilistic
Chapman and Hall/CRC, Boca Raton graphical models: principles and techniques.
11. Bishop CM (2006) Pattern recognition and The MIT Press, Cambridge
machine leaning. Springer, Berlin 25. Jordan MI (1998) Learning in graphical mod-
12. Kass R, Raftery A (1995) Bayes factors. J Am els. Kluwer Academic Publishers, Dordrecht
Stat Assoc 90:773–795 26. Schwarz G (1978) Estimating the dimension of
13. Dempster AP, Laird NM, Rubin DB (1977) a model. Ann Stat 6:461–464
Maximum likelihood from incomplete data via 27. Neal RM (1993) Probabilistic inference using
the EM algorithm. J R Stat Soc Ser B 39:1–38 Markov Chain Monte Carlo methods. Intelli-
14. Norris JR (1998) Markov chains. Cambridge gence 62:144
University Press, Cambridge 28. Hastings WK (1970) Monte Carlo sampling
15. Wright S (1990) Evolution in Mendelian methods using Markov chains and their appli-
populations. Bull Math Biol 52:241–295 cations. Biometrika 57:97–109
16. Fisher RA (1930) The genetical theory of nat- 29. Geman S, Geman D (1984) Stochastic relaxa-
ural selection. Clarendon Press, Oxford tion, Gibbs distributions, and the Bayesian res-
toration of images. IEEE Trans Pattern Anal
17. Jukes TH, Cantor CR (1969) Evolution of Mach Intell 6:721–741
protein molecules. Mamm Protein Metab
3:21–132 30. Felsenstein J (2004) Inferring phylogenies.
Sinauer Associates, Sunderland
18. Kimura M (1980) A simple method for esti-
mating evolutionary rates of base substitutions 31. Felsenstein J (1981) Evolutionary trees from
through comparative studies of nucleotide DNA sequences: a maximum likelihood
sequences. J Mol Evol 16:111–120 approach. J Mol Evol 17:368–376
19. Rabiner LR (1989) A tutorial on HMM and 32. Siepel A, Haussler D (2005) Phylogenetic hidden
selected applications in speech recognition. Markov models. Statistical methods in molecular
Proc IEEE 77:257–286 evolution. Springer, New York, pp 325–351
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons
licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s
Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder.
Chapter 3
Abstract
In this chapter, we give a not-so-long and self-contained introduction to computational molecular evolu-
tion. In particular, we present the emergence of the use of likelihood-based methods, review the standard
DNA substitution models, and introduce how model choice operates. We also present recent developments
in inferring absolute divergence times and rates on a phylogeny, before showing how state-of-the-art
models take inspiration from diffusion theory to link population genetics, which traditionally focuses at a
taxonomic level below that of the species, and molecular evolution. Although this is not a cookbook
chapter, we try and point to popular programs and implementations along the way.
1 Introduction
Many books [1–7] and review papers [8–10] have been published
in recent years on the topic of computational molecular evolution,
so that updating our previous primer on the very same topic [11]
may seem redundant. However, the field is continuously under-
going changes, as both models and algorithms become even more
sophisticated, efficient, robust, and accurate. This increase in refine-
ment has not been motivated by a desire to complicate existing
models, but rather to make an old wish come true: that of having
integrated methods that can take unaligned sequences as an input,
and simultaneously output the alignment, the tree, and other esti-
mates of interest, in a sound statistical framework justified by sound
principles: those of population genetics.
The aim of this chapter is still to provide readers with the
essentials of computational molecular evolution, offering a brief
overview of recent progress, both in terms of modeling and algo-
rithm development. Some of the details will be left out as they are
dealt with by others in this volume. Likewise, the analysis of
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Methods in Molecular Biology, vol. 1910,
https://doi.org/10.1007/978-1-4939-9074-0_3, © The Author(s) 2019
71
72 Stéphane Aris-Brosou and Nicolas Rodrigue
genomic-scale data is briefly touched upon, but the details are left
to other chapters.
2.1 A Brief Overview The simplest phylogenetic question pertains to the reconstruction
of Parsimony of a rooted tree with three sequences (Fig. 1). The sequences can be
made of DNA, RNA, amino acids, or codons, but for the sake of
simplicity we focus on DNA throughout this chapter. In the basic
example below, based on [12], DNA sequences are assumed to have
been sampled from three different species that diverged a “long
time ago.” In this context, we assume that the data or gene
sequences have been aligned (see Subheading 6), and that the
DNA alignment is:
s1 ATGACCCCAATACGCAAAACTAACCCCCTAATAAAATTAATTAACCACTCCTTC
s2 ATGACCCCAATACGGAAAACTAACCCCCAAATAAAATTAATTAACCACTCATTC
s3 ATGACGCCAATACGCAAAACTAACCGCCTAATAAAATTAATTTACCACTCATTC
T0 T1 T2 T3
s1 s2 s3 s1 s2 s3 s2 s1 s3 s1 s2 s3
Fig. 1 The simplest phylogenetic problem. With three species, s1 s2, and s3, four
rooted trees are possible: T0, the star tree, and the three resolved topologies
T1–T3
A Not-So-Long Introduction to Computational Molecular Evolution 73
s1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxyxxx
s2 xxxxxxxxxxxxxxyxxxxxxxxxxxxxyxxxxxxxxxxxxxxxxxxxxxxxxx
s3 xxxxxyxxxxxxxxxxxxxxxxxxxyxxxxxxxxxxxxxxxxyxxxxxxxxxxx
Table 1
The winning-site strategy
2.2 Assessing the As with any statistical exercise estimating a quantity of interest, we
Reliability of an would like to have a confidence interval, taken at a particular level,
Estimate: The so that we can gauge the reliability of our estimate. A standard
Bootstrap approach to derive confidence intervals is the bootstrap [17], a
computational technique that resamples data points with replace-
ment to simulate the distribution of any test statistic under the null
hypothesis that is tested. The bootstrap, particularly useful in com-
plicated nonparametric problems where no asymptotic results can
be obtained [18], was adapted by Felsenstein to the nonstandard
phylogenetic problem [19]. Indeed, the problem is nonstandard in
that the object for which we wish to assess accuracy is not a real-
valued parameter, but a graph.
The basic idea, clearly explained in [20], consists in resampling
columns of the alignment, with replacement, to construct a “syn-
thetic” alignment of the same size as the original alignment. This
synthetic or bootstrap replicate is then subjected to the same tree-
reconstruction algorithm used on the original data (Fig. 2). This
exercise is repeated a large number of times (e.g., 106), and the
proportion of each original bipartition (internal node) in the set of
bootstrapped trees is recorded. In Fig. 2, for instance, the bipar-
tition s1s2|s3 is found in two bootstrap trees out of three, so the
bootstrap support for this node is 66.7%. In this simple case with
three sequences, the bootstrap support for topology T1 is also
66.7%. This bootstrap proportion for topologies (or for trees
when branch lengths are taken into account, in a maximum likeli-
hood context, for instance—see below) can be computed very
quickly by bootstrapping the sitewise log-likelihood values, instead
of the columns of the alignment; this bootstrap is called RELL, for
“resampling estimated log-likelihood” [21].
However, this approach is no longer used or cited extensively
since 2008 (source: ISI Thompson). One alternative that has
gained momentum is the one based on the approximated likelihood
ratio test (aLRT) [22], implemented, for instance, in phyml
[23, 24]. Instead of resampling any quantity (sites or sitewise
log-likelihood values), the aLRT tests the null hypothesis that an
interior branch length is zero. In spite of being slightly conservative
in simulations, the approach is extremely fast and hence highly
practical [22].
The meaning of the bootstrap has been a matter of debate for
years. As noted before [8] (see also [22]), the bootstrap proportion
P can be seen as assessing the correctness of an internal node, and
A Not-So-Long Introduction to Computational Molecular Evolution 75
s1 s2 s3
Bootstrap replicate #1
T1
s1 s2 s3
Bootstrap replicate #2
T2 T1
2/3
s2 s1 s3 s1 s2 s3
Bootstrap replicate #3
T1
s1 s2 s3
2.3 Parsimony and Now that we have a means of evaluating the support for the
LBA different topologies, we can test some of the conditions under
which parsimony estimates the correct tree topology. Ideally, a
good method should return the correct answer with a probability
of one when the number of sites increases to infinity. This desirable
statistical property is called consistency. One serious criticism of
parsimony is its sensitivity to long branch attraction, or LBA, even
in the presence of an infinite amount of data (infinite alignment
length) [31]. In other words, parsimony is not statistically
consistent.
Different types of model misspecification can lead to LBA, and
new ones are continually identified. The topology originally used to
demonstrate the artifact is represented in Fig. 3, where two long
branches are separated by a shorter one. Felsenstein demonstrated
that, under a simple evolutionary process, the artifact or LBA tree is
reconstructed. Note that parsimony is not the only phylogenetic
method affected by LBA, but because it posits a very simple model
of evolution [32–34], parsimony is particularly sensitive to the
artifact. In spite of this, one particular journal chose to enforce
the use of parsimony, stating that authors should estimate their
phylogenies by parsimony but also that, if estimated by some
other method, they would need to defend their position “on phil-
osophical grounds” [35]; there is of course no valid scientific
justification for taking such a step—derided in the “Twittersphere”
as “#parsimonygate.”
The LBA artifact has been shown to plague the analysis of
numerous data sets, and a number of empirical approaches have
been used to detect the artifact [36, 37]. Most recent papers based
on multigene analyses (e.g., [38, 39]) now examine carefully the
effect of across-site and across-lineage rate variation (in addition to
the use of heterogeneous models). For both sites and lineages, the
s1 s2
(a) s4 s3
True tree topology (b)
s1 s2
s1 s2
s4 s3
Attract tree topology s4 s3
Tree topology
(in absence of LBA)
Fig. 3 The long branch attraction artifact. The true tree topology has two long branches separated by a short
one. The tree reconstructed under a simple model of evolution (a) is the artifact or LBA tree on the left. The tree
reconstructed under the correct model of evolution (b) is the correct tree, on the right
A Not-So-Long Introduction to Computational Molecular Evolution 77
present past
Geological time (and actual number of substitution)
Fig. 4 Saturation of DNA sequences. As time increases, the observed number of differences between pairs of
sequences reaches a plateau, whereas the actual number of substitutions keeps increasing
2.4 Origin of the By definition, parsimony minimizes the number of changes along
Problem each branch of the tree. When there is only a small number of
changes per branch, the method is expected to be accurate. How-
ever, when sequences are quite divergent, the parsimony assump-
tion leads to underestimating the actual number of changes (Fig. 4;
see also [40]).
Consequently, we would like a tree-reconstruction method that
accounts for multiple substitutions. We would also like a method
that (1) takes into account less parsimonious as well as most parsi-
monious state reconstructions (intervals, tests), (2) weights changes
differently if they occur on branches of different length (evolution-
ary time), and (3) weights different kinds of events (transitions,
transversions) differently (biological realism). Likelihood methods
include such considerations explicitly, as they require modeling the
substitution process itself.
2.5 Modeling The basic model of DNA substitution (Fig. 5) is defined on the
Molecular Evolution DNA state space, made of the four nucleotides thymine (T), cyto-
sine (C), adenine (A), and guanine (G). Note that T and C are
pyrimidines (biochemically, six-membered rings), while A and
G are purines (fused five- and six-membered heterocyclic com-
pounds). Depending on these two biochemical categories, two
different types of substitutions can happen: transitions within a
category, and transversions between categories. Their respective
rates are denoted α and β in Fig. 5.
The process we want to model should describe the substitution
process of the different nucleotides of a DNA sequence. Again, we
78 Stéphane Aris-Brosou and Nicolas Rodrigue
α
pyrimidines (Y) T C
β β
α: transitions
purines (R) A α
G β: transversions
will make the simplifying assumption that sites evolve under a time-
homogeneous Markov process and are iid, as above. We can there-
fore concentrate on one single site for now (e.g., [41]).
At a particular site, we want to describe the change in nucleo-
tide frequency after a short amount of time dt, during which the
nucleotide frequency of A, for instance, after dt will change from
fA(t) to fA(t + dt). According to Fig. 5, fA(t + dt) will be equal to
what we had at time t, fA(t), minus the quantity of A that “dis-
appeared” by mutation during dt, plus the quantity of A that
“appeared” by mutation during dt. Denoting the mutation rate as
μ, the quantity of A that “disappeared” by mutation during dt is
simply fA(t)μAdt. These mutations away from A generated quanti-
ties of T, C, and G, in which we are not interested at the moment
since we only want to know what happens to A. There are three
different ways to generate A: from either T, C, or G (Fig. 5).
Coming from T, mutation will generate fT(t)μT!Adt of A during
dt. Similar expressions exist for C and for G, so that in total, over
the three non-A nucleotides, mutation will generate
∑i6¼A fi(t)μi!Adt. Mathematically, we can express these ideas as:
X
f A ðt þ dtÞ ¼ f A ðtÞ f A ðtÞμA dt þ f i ðtÞμiA dt ð1Þ
i6¼A
ð2Þ
which, in matrix notation, can simply be rewritten as:
A Not-So-Long Introduction to Computational Molecular Evolution 79
T C
t0 t1
i (fictive root)
2.6 Computation on Now that we know how to determine the rate of change of nucleo-
a Tree tide frequencies during a time interval dt, we can compute the
probability of a particular nucleotide change on a tree. The simplest
case, though somewhat artificial with only two sequences, is
depicted in Fig. 6.
We are looking at a particular nucleotide position, denoted j,
for two aligned sequences. The observed nucleotides at this posi-
tion are T in sequence 1, and C in sequence 2. The branch separat-
ing T from C has a total length of t0 + t1. For the sake of
convenience, we set an arbitrary root along this path. The likeli-
hood at site j is then given by the probability of going from the
fictive root i to T in t0, and from i to C in t1. Any of the four
nucleotides can be present at the fictive root. As we do not know
which one was there, we sum these probabilities over all possible
state, weighted by their prior probabilities, the equilibrium fre-
quencies π i. In all, we have the likelihood ℓj at site j:
X
ℓj ¼ π i P i, T ðt 0 ÞP i, C ðt 1 Þ
ð9Þ
i¼fT , C , A, G g
2.7 Substitution Now that we have almost all the elements to compute the likeli-
Models and hood of a set of parameters, including the tree (i.e., the set of
Instantaneous Rate branch lengths and the tree topology; see Subheading 2.10), the
Matrices Q only missing element required to compute the likelihood at each
site, as in Eq. 9, for instance, is the specification of the instanta-
neous rate matrix Q as in Eq. 4. Remember that the μi,j represent
mutation rates from state (nucleotide) i to j. This matrix is generally
rewritten as:
0 1
rT C rT A rT G
B C
B r CT r CA r CG C
B C
Q ¼ μB C ð10Þ
B r AT r AC r C
@ AG A
r GT r GC r GA
so that each entry rij is a rate of change from nucleotide i to
nucleotide j. The diagonal entries are left out, indicated by a “,”
and are in fact calculated as the negative sum of the off-diagonal
entries (as rows sum to 0).
The simplest specification of Q would be that all rates of change
are identical, so that Q becomes (leaving out the mutation rate μ
and indexing the matrix to indicate the difference):
0 1
1 1 1
B C
B1 1 1 C
B C
Q JC ¼ B C ð11Þ
B1 1 1 C
@ A
1 1 1
which is the model proposed by Jukes and Cantor [45] and often
noted “JC” or “JC69”. Under the specification of Eq. 11, this
model has no free parameter. The process is generally scaled such
that the unit of branch lengths can be interpreted as an expected
number of substitutions per site.
Of course, this model is extremely simplistic and neglects a fair
amount of basic molecular biology. In particular, it overlooks two
observations. First, base frequencies are not all equal in actual DNA
sequences, but are rather skewed, and second, transitions are more
frequent than transversions (see Subheading 2.5).
The way to account for this first “biological realism” is as
follows. If DNA sequences were made exclusively of As, for
instance, that would mean that all mutations are towards the
observed base, in this case A, whose equilibrium or stationary
frequency is π A. The same reasoning can be used for arbitrary
equilibrium frequencies π, so that all relative rates of change in
Q become proportional to the vector of equilibrium frequency π
of the target nucleotide. In other words, the instantaneous rate
matrix Q becomes:
82 Stéphane Aris-Brosou and Nicolas Rodrigue
0 1
πC πA πG
B C
B πT πA πG C
B C
Q F81 ¼B C ð12Þ
B πT πC πG C
@ A
πT πC πA
again with the requirement that rows sum to 0. This matrix repre-
sents the Felsenstein or F81 model [44]. This model has four
parameters (the four base frequencies), but since base frequencies
sum to 1, we only have three free parameters.
The second “biological realism,” accounting for the different
rates of transversions and transitions, can be described by saying
that transitions occur κ times faster than transversions. From Fig. 5,
recall that transitions are mutations from T to C (and vice versa) and
from A to G (and vice versa). This translates into:
0 1
κ 1 1
B C
Bκ 1 1 C
B C
Q K80 ¼ B C ð13Þ
B1 1 κ C
@ A
1 1 κ
This model is called the Kimura two-parameter model or K80
(or K2P) [46]. The model is alternatively described with the two
rates α and β (see Fig. 5). In the “κ version” of the model as in
Eq. 13, there is only one free parameter.
Of course it is possible to account for both kinds of “biological
realism,” unequal equilibrium base frequencies and transition bias,
all in the same model, whose generator Q becomes:
0 1
πC κ πA πG
B C
B πT κ πA πG C
Q HKY ¼ B B C ð14Þ
@ πT πC πG κ C
A
πT πC πA κ
which corresponds to the Hasegawa–Kishino–Yano or HKY
(or HKY85) model [47]. This model has four free parameters: κ
and three base frequencies.
The level of “sophistication” goes “up to” the general time-
reversible model [48], denoted GTR or REV, which has for
generator:
0 1
aπ C bπ A cπ G
B C
B aπ T dπ A eπ G C
Q GTR ¼ BB bπ
C ð15Þ
@ T dπ C πG CA
cπ T eπ C π A
A Not-So-Long Introduction to Computational Molecular Evolution 83
2.8 Some For a given substitution model, how should parameters be esti-
Computational mated, given the (potentially) high dimensionality of the model?
Aspects Analytical solutions consist in determining when the first derivative
of the likelihood function is equal to zero (with a change of sign in
2.8.1 Optimization of the the second derivative). However, finding the root of the likelihood
Likelihood Function function analytically is only possible in the simple case of three
sequences of binary characters under the assumption of the molec-
ular clock (see Subheading 3.1) [12]. As a result, numerical solu-
tions must be found to maximize the likelihood function.
A number of ideas have been combined to search efficiently for
the parameter values that maximize the likelihood function. Most
programs will start from a random starting point, for example,
ð0Þ ð0Þ
ðθ1 , θ2 Þ, denoted by an x in Fig. 7, where we limit ourselves to
a two-parameter example. The optimization procedure can follow
(A) (B)
x x
parameter θ2
parameter θ2
+ +
parameter θ1 parameter θ1
Fig. 7 Two optimization strategies. The likelihood surface of a function with two parameters θ1 and θ2 (e.g.,
two branch lengths) is depicted as a contour plot, whose highest peak is at the + sign. (a) Optimization one
parameter at a time. (b) Optimization of all parameters simultaneously. See text for details
84 Stéphane Aris-Brosou and Nicolas Rodrigue
one of the two strategies. In the first one, parameters are optimized
one at a time. In Fig. 7a, parameter θ1 is first optimized to maximize
the likelihood function with a line search, which defines a direction
along which the other parameter (θ2) or parameters in the multidi-
ð1Þ
mensional case are kept constant. Once θ1 is found, a new
direction is defined to optimize θ2, and so on so forth until conver-
gence to the maximum of the likelihood function. As shown in
Fig. 7a, many iterations can be required, in particular when the
parameters θ1 and θ2 are correlated. The alternative to optimizing
one parameter at a time is to optimize all parameters simulta-
neously. In this case (Fig. 7b), an initial direction is defined at
ð0Þ ð0Þ
ðθ1 , θ2 Þ such that the slope at this point is maximized. The
process is repeated until convergence. More technical details can
be found in [52]. The simultaneous optimization procedure gener-
ally requires fewer steps than optimizing parameters one at a time,
but not always. Since the computation of the likelihood function is
the most expensive computation of these algorithms, the simulta-
neous optimization is much more efficient, at least in our toy
example.
How general is this result? Simultaneously optimizing para-
meters of the substitution model, while optimizing branch lengths
one at a time, was shown to be more effective on large data sets
[43], potentially because of the correlation that exists between
some of the parameters entering the Q matrix (see Subheading 2.7).
2.8.2 Convergence Convergence is usually reached either when the increment in the
log-likelihood score becomes smaller than an ε value, usually set to
a small number such as 106 (but yet a number larger than the
machine ε: the smallest number that a given computer can repre-
sent), or when the log-likelihood score has not changed after a
predetermined number of iterations. However, none of these stop-
ping rules guarantees that the global maximum of the likelihood
function has been found. Therefore, it is generally recommended to
run the optimization procedure at least twice, starting from differ-
ent initial values of the model parameters, and to check that the
likelihood score after optimization is the same across the different
runs (Fig. 8). If this is not the case, additional runs may be required,
and the one with the largest likelihood is chosen for inference (e.g.,
[53]).
In many instances though, different substitution models will
give different tree topologies, and therefore different biological
conclusions. One difficulty is therefore to know which model
should be used to analyze a particular data set.
2.9 Selection of the One important issue in model selection is about the trade-off
Appropriate between bias and variance [55]: a simple model will fail to capture
Substitution Model all the sophistication of the actual substitution process, and will
A Not-So-Long Introduction to Computational Molecular Evolution 85
CV no CV
p(X | θ)
θ
2.9.1 The Likelihood The substitution models presented above have one key property: it
Ratio Test is possible to reduce the most sophisticated time-reversible named
model (GTR+Γ+I) to any simpler model by imposing some con-
straints on parameters. As a result, the models are said to be nested,
and statistical theory (the Neyman–Pearson lemma) tells us that
there is an optimal (most powerful) way of comparing two nested
models (a simple null vs. a simple alternative hypothesis) based on
the likelihood ratio test or LRT.
86 Stéphane Aris-Brosou and Nicolas Rodrigue
2.9.3 The Bayesian The Bayesian framework has permitted the development of two
Approach main approaches, which are actually two sides of the same coin: one
based on finding the model that is the most probable a posteriori,
and one based on ranking models and estimating a quantity called
the Bayes factor.
In a nutshell, the frequentist approaches developed in the
previous sections are based on the likelihood, which is the proba-
bility of the data, given the parameters: p(X|θ). However, this
approach may not be the most intuitive, since most practitioners
are not interested in knowing the conditional probability of their
data, as the data were collected to learn more about the processes
that generated them. It can therefore be argued that the Bayesian
approach, which considers the probability of the parameters given
the data or p(θ|X), is more intuitive than the frequentist approach.
Unlike likelihood, which relies on the function p(X|θ) and permits
point estimation, Bayesian inference is based on the posterior dis-
tribution p(θ|X). This distribution is often summarized by a cen-
trality measure such as its mode, mean, or median. Measures of
uncertainty are based on credibility intervals, the Bayesian equiva-
lent of confidence intervals. Typically, credibility intervals are taken
at the 95% cutoff and are called highest posterior densities (HPDs).
The connection between posterior probability and likelihood is
made with Bayes’ inversion formula, also called Bayes’ theorem, by
means of a quantity called the prior distribution p(θ):
pðX jθÞ pðθÞ
pðθjX Þ ¼ ð21Þ
pðX Þ
The prior represents what we think about the process that gener-
ated the data, before analyzing the data, and is at the origin of all
controversies surrounding Bayesian inference. In practice, priors
are more typically chosen based on statistical convenience, and
often have nothing to do with our genuine state of knowledge
about parameters before observing the available data. We will see
in Subheading 3.1 that priors can be used to distinguish between
parameters that are confounded in a maximum likelihood analysis
90 Stéphane Aris-Brosou and Nicolas Rodrigue
2.10 Finding the Best Now that we can select a model of evolution (Subheading 2.9) and
Tree Topology estimate model parameters (Subheading 2.8) under a particular
model (Subheading 2.5), how do we find the optimal tree? The
2.10.1 Counting Trees
basic example in Subheading 2.1 suggested that we score all possi-
ble tree topologies and choose for inference the one that has the
highest score. However, a simple counting exercise shows that an
exhaustive examination of all possible topologies is not realistic.
Figure 9 shows how to count tree topologies. Starting from the
simplest possible unrooted tree, with three taxa, there are three
positions where a fourth branch (leading to a fourth taxon) can be
added. As a result, there are three possible topologies with four
taxa. For each of these, there are four places on the tree where a fifth
branch can be added, which leads to a total of 3 5 ¼ 15 topol-
ogies with five taxa. A recursion appears immediately, and it can be
shown that the total number of unrooted topologies with n taxa is
equal to 1 3 2n 5 [100] (see [15] for the deeper his-
tory), which, as given in [101], is equal to:
3
ð2n 5Þ! 2 n2
Γðn Þ
T ðnÞ
N unrooted ¼ n3 ¼ pffiffiffi 2 ð26Þ
2 ðn 3Þ! π
where the Γ function for any real number x is defined as
Ð 1 x1
ΓðxÞ ¼ 0 t e t dt. An approximation based on Stirling num-
ber is also given in [101].
The same exercise can be done for rooted trees (Fig. 10), where
the number of possible rooted topologies with n taxa becomes
1 3 2n 3, which is
B B
B
Table 2
Counting tree topologies
A B C
A B C D
A B C D E
1
2 n1
Γ n
T ðnÞ ð2n 3Þ! 2 ð27Þ
N rooted ¼ ¼ pffiffiffi
2 ðn 2Þ!
n2 π
T ðnÞ T ðn1Þ
Note that N unrooted ¼ N rooted , as Table 2 clearly suggests.
As a result, the number of possible topologies quickly becomes
very large when the number n of sequences increases, even with a
very modest n, so that heuristics become necessary to find the best-
scoring tree.
2.10.2 Some Heuristics The simplest approach builds upon the idea presented in Figs. 9
to Find the Best Tree and 10. Stepwise addition, for instance, starts with three sequences
drawn at random among the n sequences to be analyzed, and adds
sequences one at a time, keeping only the tree that has the highest
score at each step (e.g., [52]). However, there is no guarantee that
the final tree is the optimal tree [44]. The idea behind branch-and-
A Not-So-Long Introduction to Computational Molecular Evolution 95
2.10.3 Cutting Corners As some of the above computations can become quite costly (high
with ABC and AI runtimes, heavy memory footprints, poor scalability with large data
sets, etc.), computational workarounds have been and are being
explored. One of these resorts to approximate Bayes computing
(ABC), which is essentially a likelihood-free approach. First devel-
oped in the context of population genetics [109, 110], the driving
idea is to bypass the optimization procedures and replace them with
simulations in the context of a rejection sampler. In population
genetics, the problem could be about a gene tree, which is usually
96 Stéphane Aris-Brosou and Nicolas Rodrigue
3.1 Dating the Tree Similar to the problem of estimating the tree of life, dating the tree
of Life: Always of life poses many challenges [135]. Since it was first proposed in
Deeper? 1965 [40], the idea of estimating divergence times has since under-
gone a dramatic change, and new approaches are regularly pro-
posed. Population geneticists have their own approaches, which are
either fully Bayesian [136] or based on approximate Bayesian com-
putation in the coalescent framework [137]. All these approaches
make it possible to infer divergence times between recently
diverged species, as in the case of humans and chimpanzees, or to
date demographic events such as the migrations “out of Africa” of
early human populations [138].
In the context of molecular evolution, we are usually interested
in estimating deeper divergence times, such as those between spe-
cies, which are available online, for instance, at www.timetree.org
[139], recently revamped and extended to cover close to 100k
species [140]. While early “molecular dates” were systematically
biased towards ages that are too old [135], we argue here that
recent developments in the field have led to more accurate methods
and also to a better understanding of methodological limitations.
3.1.1 The Strict One quantity that we can estimate when comparing pairs of
Molecular Clock sequences is the number of differences that exist. This number,
estimated as a branch length b, can be corrected for multiple sub-
stitutions (see Subheading 2.7), but basically remains an expected
number of substitutions per site. With “dating” (defined here as the
activity of estimating divergence times [141]), we are interested in
estimating time t, which relates to the expected numbers of sub-
stitutions b according to the following equation:
b ¼ Δt r ð28Þ
where Δt is a period of time and r the rate of evolution. In technical
terms, times and rates are said to be confounded, because we
cannot estimate one without making an assumption about the
other.
The molecular clock hypothesis does just this by assuming that
rates of evolution are constant in time [40] (see also [142], p. 65).
Under this assumption, the estimated tree is ultrametric as in the
basic example represented in Fig. 11, which implies that all the tips
are level, or equivalently that the distance from root to tip is the
same for all branches.
In this example (Fig. 11), the branch length from the fossil-
dated node is 0.1 substitutions/site (sub/site), and the fossil was
estimated to be present 10 million years ago (MYA). Under the
strict molecular clock assumption (equal rates over the whole tree),
we can (1) estimate the rate of evolution (0.1/10 ¼ 0.01
98 Stéphane Aris-Brosou and Nicolas Rodrigue
0.1
0.02
S2 S6 S5 S3 S4 S1 S7
Fig. 11 The strict molecular clock. The tree is ultrametric. The node marked with a star indicates the presence
of a fossil, dated in this example to 10 million years ago (MYA). This is the point that we will use to calibrate the
clock, that is, to estimate the global rate of evolution. The number of substitutions that accumulated from the
marked node to the tips (present) is indicated on the right weights in at 0.1 substitutions/site. The node that is
the most recent common ancestor of S2 and S5 is the node of interest. The number of substitutions from this
node to the tips is 0.02 substitutions/site
sub/site/my) and (2) date all the other nodes on the tree. For
instance, the most recent common ancestor of S2 and S5 is sepa-
rated from the tips by a branch length of 0.02 sub/site. Its diver-
gence time is therefore 0.02/0.01 ¼ 2 MYA.
As with any hypothesis, the strict clock can be tested. Tests
based on relative rates assess whether two species evolve at the same
rate as a third one, used as an outgroup. Originally formulated in a
distance-based context [143], likelihood versions have been
described [44, 144]. However, because of their low power [145]
their use is on the wane. The most powerful test is again the LRT
(see Subheading 2.9.1). The test proceeds as usual, first calculating
the test statistic 2Δℓ (twice the difference of log-likelihood values).
The null hypothesis (strict clock) is nested within the alternative
hypothesis (clock not enforced), so that 2Δℓ follows a χ 2 distribu-
tion. The degree of freedom is calculated following Fig. 12. With
an alignment of n sequences, we can estimate n 1 divergence
times under the null model (disregarding parameters of the substi-
tution model) and we have 2n 3 branch lengths under the
alternative model. The difference in number of free parameters is
therefore n 2, which is our degree of freedom. This version of
the test actually assesses whether all tips are at the same distance
from the root of the tree [44]. For time-stamped data, serially
sampled in time as in the case of viruses, the alternative model
incorporates information on tip dates [146].
This linear regression model suggested by the molecular clock
hypothesis has often been portrayed as a recipe [147], which gave
rise in the late twentieth to early twenty-first century to a veritable
cottage industry [148–151], culminating with a paper suggesting
that the age of the tree of life might be older than the age of planet
Earth [152]. This recipe was put down by two factors: (1) the
A Not-So-Long Introduction to Computational Molecular Evolution 99
t6
t5
b2
t4 b1
b4 b3
t3 b5
b6
t2
t1 b7 b8 b9 b10 b11
S2 S6 S5 S3 S4 S1 S7
Fig. 12 Testing the strict molecular clock. The divergence times that can be estimated under the strict clock
assumption are denoted ti. The branch lengths that can be estimated without the clock are denoted bi. In the
case depicted, with n ¼ 7 sequences, we have n 1 ¼ 6 divergence times and 2n 3 ¼ 11 branch
lengths
3.1.2 Local Molecular This “debacle” has motivated further theoretical developments in
Clocks the dating field. The simplest idea is that, if a global clock does not
hold for the entire tree, then perhaps groups of related species share
the same rate. That is, if a global clock does not hold, perhaps the
tree can be subdivided into local molecular clocks. An initial idea
was proposed in the context of quartets of sequences [154] and was
later generalized to a tree of any size with any number of local
clocks on the tree [155] (constrained by the number of branches on
the tree and calibration points). Because of the arbitrariness of such
local clocks, methods have been devised to place the clocks on the
tree [156] and to estimate the appropriate number of clocks that
should be used [157]. A Bayesian approach now estimates all these
parameters and their placement in an integrated statistical
framework [158].
3.1.3 Correlated Relaxed The idea of a correlated relaxed molecular clock goes back to
Clocks Sanderson [159] (see also [160]), who considered that rates of
evolution can change from branch to branch on a tree. By con-
straining rates of evolution to vary in an autocorrelated manner on
a tree, it is possible to devise a method that minimizes the amount
of rate change.
100 Stéphane Aris-Brosou and Nicolas Rodrigue
max
3
Prior distribution min
on rates
2
Prior on
calibration
1
Prior distribution
on times
3.1.4 Uncorrelated Because of the autocorrelation between the rate of each branch and
Relaxed Clocks that of its ancestral branch (except for the root, which obviously
requires a special treatment), the tree topology is fixed under the
autocorrelated models described above. By relaxing this assump-
tion about rate autocorrelation, [172] were able to implement a
model that also integrates over topological uncertainty. In spite of
the somewhat counter-intuitive nature of the relaxation of the
autocorrelated process, as implemented in BEAST [91, 173], empir-
ical studies have found this approach to be one of the best-
performing (e.g., [157]).
When first published, it was proposed that making use of an
uncorrelated relaxed molecular clock could improve phylogenetic
inference [172]. The idea was that calibration points and their
placement on the tree could act as additional information. How-
ever, a simulation study suggests that relaxed molecular clocks
might not improve phylogenetic accuracy [174], a result that
might be due to the lack of calibration constraints in this particular
simulation study.
3.1.5 Some Applications Since the advent of relaxed molecular clocks, two very exciting
of Relaxed Clock Models developments have seen the light of day. The first concerns the
inclusion of spatial statistics into dating models [175, 176]. Spatial
statistics are not new in population genetics [177] and have been
used with success in combination with analyses in computational
molecular evolution (e.g., [178]). However, the originality in
[176], for instance, is to combine in a single statistical framework
102 Stéphane Aris-Brosou and Nicolas Rodrigue
4.1 Bridging the Gap Assuming a point-mutation process, such that events only change
Between Population one nucleotide of a codon during a small time interval, Muse and
Genetics and Gaut proposed a codon substitution model with rates specified
Phylogenetics from the QGTR nucleotide-level matrix (see Subheading 2.7),
along with one parameter that modulates synonymous events and
another one that modulates nonsynonymous events [194]. In most
subsequent formulations, the parameter associated with synony-
mous events is assumed to be fixed, such that the model only
modulates nonsynonymous rates by means of a parameter denoted
ω. This parameter has traditionally been interpreted as the nonsy-
nonymous to synonymous rate ratio, and is generally associated
with a different formulation of the codon model proposed by
Goldman and Yang [195]. More details on codon models can be
found in Chapter 4.1 [196]. There continues to be a debate
regarding the interpretation of the ω parameter [197, 198]. Regard-
less of how this issue is settled, it is clear that ω is aimed at capturing
the net overall effects of selection, irrespective of the exact nature of
these effects.
With the intention to model selective effects themselves, Hal-
pern and Bruno proposed a codon substitution model that com-
bines a nucleotide-level layer, as described above, for controlling
mutational features, along with a fixation factor that is proportional
to the fixation probability of the mutational event [199]. The
fixation factor is in turn specified from an account of amino acid
or codon preferences. One objective of the model, then, consists in
teasing apart mutation and selection. While [199] proposed their
model with site-specific fixation factors, later work has explored
simpler specifications, where all sites have the same fixation factor
[200]. Other models that aimed at capturing across-site heteroge-
neities in fixation factors were proposed using nonparametric
devices and empirical mixtures [201]. Another core idea behind
these approaches is to construct a more appropriate null model
against which to test for features of the evolutionary process. This
idea has been put into practice for the detection of adaptive evolu-
tion in protein-coding genes [202, 203]. Recent developments
include sequence-wide fixation factors [9, 197, 204, 205], and we
predict that these models will play a role in bridging the gap
between molecular evolution at the population and at the species
levels.
mutation, selection, and drift. The question is then, what is the fate
of an allele under the combined action of these processes? Our
development here follows [207] (but see [208] for a very clear
account).
4.2.1 Fixation Of the three processes affecting allele frequencies, mutation and
Probabilities selection can be seen as directional forces in that their action will
shift the distribution of allele frequencies towards a particular point,
be it an internal equilibrium, or fixation/loss of an allele. On the
other hand, drift is a non-directional process that will increase the
variance in allele frequencies across populations, and will therefore
spread out the distribution of allele frequencies. This distribution is
denoted Ψ ( p, t). We also must assume that the magnitude of all
three processes, mutation, selection, and drift, is small and of the
1
order of 2N e
, where Ne is the effective population size. To derive the
fate of an allele after a certain number of generations, we also need
to define g( p, ε;dt), the probability that allele frequency changes
from p to p + ε during a time interval dt.
In phylogenetics (and population genetics) we are generally
interested in predicting the past. The tool making this possible is
called the Kolmogorov backward equation, which predicts the
frequency of an allele at some time t, given its frequency p0 at
time t0:
Ð
Ψ ðp, t þ dtjp0 Þ ¼ Ψ ðp, tjp0 þ εÞ gðp0 , ε;dtÞ dε ð30Þ
We can take the Taylor expansion of Eq. 30 around p0, neglect all
terms whose order is larger than two (oðp20 Þ) and since Ψ is not a
function of ε, we obtain:
ð ð
Ð ∂Ψ
2
∂ Ψ ε2
Ψ ðp, t þ dtjp0 Þ ¼ Ψ g dε þ εgdε þ gdε
∂p0 ∂p20 2
ð31Þ
This formulation leads to the definition of two terms that represent
the directional processes affecting allele frequencies (M) and the
non-directional process, or drift (V ):
8 ð
>
< M ðpÞ dt ¼ g ε dε
ð ð32Þ
>
: V ðpÞ dt ¼ g ε2 dε
Table 3
The standard selection models
4.2.2 The Case of Genic We are now ready to derive an explicit form to Ψ (1, 1; p0) in
Selection Eq. 34 in the case of the genic selection model (Table 3; [209]).
We obtain:
w ¼ 1 þ sp2 þ 2pqhs ¼ 1 þ 2phs þ sp2 ð1 2hÞ ð35Þ
which can be approximated by 1 + 2phs (the result is exact only
when h ¼ 1/2). Therefore, dw=dp ¼ 2hs, and we can calculate the
M and V terms to obtain the popular result:
Ð p0Ð 2M
e V dp dp e 4N e hsp0 1
Ψ ð1, 1;p0 Þ ¼ Ð0 Ð 2M ¼ 4N hs ð36Þ
1
V dp dp
e e 1
0 e
5.2 HPC and Cloud More recent work has focused on the development of heuristics
Computing that make large-scale phylogenetics amenable to high-performance
computing (HPC) that are performed on computer clusters.
A Not-So-Long Introduction to Computational Molecular Evolution 107
6 Conclusions
Acknowledgements
References
1. Nei M, Kumar S (2000) Molecular evolution 7. Drummond AJ, Bouckaert RR (2015) Bayes-
and phylogenetics. Oxford University Press, ian evolutionary analysis with BEAST. Cam-
Oxford bridge University Press, Cambridge
2. Higgs PG, Attwood TK (2005) Bioinformat- 8. Aris-Brosou S, Xia X (2008) Phylogenetic
ics and molecular evolution. Blackwell Pub- analyses: a toolbox expanding towards Bayes-
lishing, Oxford ian methods. Int J Plant Genomics
3. Balding DJ, Bishop MJ, Cannings C (2007) 2008:683509
Handbook of statistical genetics, 3rd edn. 9. Rodrigue N, Philippe H (2010) Mechanistic
Wiley, Chichester revisions of phenomenological modeling stra-
4. Salemi M, Vandamme A-M, Lemey P (2009) tegies in molecular evolution. Trends Genet
The phylogenetic handbook: a practical 26:248–252
approach to phylogenetic analysis and 10. Yang Z, Rannala B (2012) Molecular phylo-
hypothesis testing, 2nd edn. Cambridge Uni- genetics: principles and practice. Nat Rev
versity Press, Cambridge Genet 13:303–314
5. Hall BG (2011) Phylogenetic trees made easy: 11. Aris-Brosou S, Rodrigue N (2012) The essen-
a how to manual. Sinauer Associates, tials of computational molecular evolution.
Sunderland Methods Mol Biol 855:111–152
6. Yang Z (2014) Molecular evolution: a statisti- 12. Yang Z (2000) Complexity of the simplest
cal approach. Oxford University Press, phylogenetic estimation problem. Proc Biol
Oxford Sci 267:109–116
110 Stéphane Aris-Brosou and Nicolas Rodrigue
13. Sober E (1988) Reconstructing the past: par- 29. Shimodaira H, Hasegawa M (2001) CON-
simony, evolution, and inference. MIT Press, SEL: for assessing the confidence of phyloge-
Cambridge netic tree selection. Bioinformatics
14. Durbin R, Eddy SR, Krogh A, Mitchison G 17:1246–1247
(1998) Biological sequence analysis: probabi- 30. Salichos L, Rokas A (2013) Inferring ancient
listic models of proteins and nucleic acids. divergences requires genes with strong phylo-
Cambridge University Press, Cambridge genetic signals. Nature 497:327–331
15. Felsenstein J (2004) Inferring phylogenies. 31. Felsenstein J (1978) Cases in which parsi-
Sinauer Associates, Sunderland mony or compatibility methods will be posi-
16. Yang Z (2007) PAML 4: phylogenetic analysis tively misleading. Syst Zool 27:401–410
by maximum likelihood. Mol Biol Evol 32. Tuffley C, Steel M (1997) Links between
24:1586–1591 maximum likelihood and maximum parsi-
17. Efron B, Tibshirani R (1993) An introduction mony under a simple model of site substitu-
to the bootstrap, vol 57. Chapman and Hall, tion. Bull Math Biol 59:581–607
Boca Raton 33. Steel M, Penny D (2000) Parsimony, likeli-
18. Efron B, Halloran E, Holmes S (1996) Boot- hood, and the role of models in molecular
strap confidence levels for phylogenetic trees. phylogenetics. Mol Biol Evol 17:839–850
Proc Natl Acad Sci USA 93:7085–7090 34. Holder MT, Lewis PO, Swofford DL (2010)
19. Felsenstein J (1985) Confidence limits on The Akaike information criterion will not
phylogenies: an approach using the bootstrap. choose the no common mechanism model.
Evolution 39:783–791 Syst Biol 59:477–485
20. Baldauf SL (2003) Phylogeny for the faint of 35. Editors T (2016) Editorial. Cladistics 32:1.
heart: a tutorial. Trends Genet 19:345–351 https://doi.org/10.1111/cla.12148
21. Hasegawa M, Kishino H (1989) Confidence 36. Philippe H, Zhou Y, Brinkmann H,
limits of the maximum-likelihood estimate of Rodrigue N, Delsuc F (2005) Heterotachy
the hominoid three from mitochondrial-DNA and long-branch attraction in phylogenetics.
sequences. Evolution 43:672–677 BMC Evol Biol 5:50
22. Anisimova M, Gascuel O (2006) Approxi- 37. Brinkmann H, van der Giezen M, Zhou Y, de
mate likelihood-ratio test for branches: a fast, Raucourt GP, Philippe H (2005) An empirical
accurate, and powerful alternative. Syst Biol assessment of long-branch attraction artefacts
55:539–552 in deep eukaryotic phylogenomics. Syst Biol
23. Guindon S, Delsuc F, Dufayard J-F, Gascuel 54:743–757
O (2009) Estimating maximum likelihood 38. Hampl V, Hug L, Leigh JW, Dacks JB, Lang
phylogenies with phyml. Methods Mol Biol BF, Simpson AG, Roger AJ (2009) Phyloge-
537:113–137 nomic analyses support the monophyly of
24. Guindon S, Dufayard J-F, Lefort V, Excavata and resolve relationships among
Anisimova M, Hordijk W, Gascuel O (2010) eukaryotic “supergroups”. Proc Natl Acad
New algorithms and methods to estimate Sci USA 106:3859–3864
maximum-likelihood phylogenies: assessing 39. Liu H, Aris-Brosou S, Probert I, de Vargas C
the performance of PhyML 3.0. Syst Biol (2010) A timeline of the environmental
59:307–321 genetics of the haptophytes. Mol Biol Evol
25. Hillis DM, Bull JJ (1993) An empirical test of 27:161–176
bootstrapping as a method for assessing con- 40. Zuckerkandl E, Pauling L (1965) Evolution-
fidence in phylogenetic analysis. Syst Biol ary divergence and convergence in proteins.
42:182–192 In: Bryson V, Vogel HJ (eds) Evolving genes
26. Felsenstein J, Kishino H (1993) Is there and proteins. Academic, Cambridge, pp
something wrong with the bootstrap on phy- 97–166
logenies? A reply to Hillis and Bull. Syst Biol 41. Galtier N, Gascuel O, Jean-Marie A (2005)
42:193–200 Markov models in molecular evolution. In:
27. Yang Z, Rannala B (2005) Branch-length Nielsen R (ed) Statistical methods in molecu-
prior influences Bayesian posterior probability lar evolution. Statistics for biology and health.
of phylogeny. Syst Biol 54:455–470 Springer, New York, pp 3–24
28. Berry V, Gascuel O (1996) On the interpreta- 42. Cox DR, Miller HD (1965) The theory of
tion of bootstrap trees: appropriate threshold stochastic processes. Chapman and Hall/
of clade selection and induced gain. Mol Biol CRC, Boca Raton
Evol 13:999 43. Yang Z (2000) Maximum likelihood estima-
tion on large phylogenies and analysis of
A Not-So-Long Introduction to Computational Molecular Evolution 111
adaptive evolution in human influenza virus methods for detecting adaptive evolution in
A. J Mol Evol 51:423–432 protein coding sequences and for identifying
44. Felsenstein J (1981) Evolutionary trees from positively selected sites. Genetics
DNA sequences: a maximum likelihood 168:1041–1051
approach. J Mol Evol 17:368–376 59. Massingham T, Goldman N (2005) Detecting
45. Jukes JC, Cantor CR (1969) Evolution of amino acid sites under positive selection and
protein molecules. In: Munro HN purifying selection. Genetics 169:1753–1762
(ed) Mammalian protein metabolism. Aca- 60. Zhang J, Nielsen R, Yang Z (2005) Evalua-
demic, New York, pp 21–123 tion of an improved branch-site likelihood
46. Kimura M (1980) A simple method for esti- method for detecting positive selection at the
mating evolutionary rates of base substitu- molecular level. Mol Biol Evol 22:2472–2479
tions through comparative studies of 61. Anisimova M, Yang Z (2007) Multiple
nucleotide sequences. J Mol Evol hypothesis testing to detect lineages under
16:111–120 positive selection that affects only a few sites.
47. Hasegawa M, Kishino H, Yano T (1985) Dat- Mol Biol Evol 24:1219–1228
ing of the human-ape splitting by a molecular 62. Yang Z (2010) A likelihood ratio test of spe-
clock of mitochondrial DNA. J Mol Evol ciation with gene flow using genomic
22:160–174 sequence data. Genome Biol Evol 2:200–211
48. Tavaré S (1986) Some probabilistic and statis- 63. Fletcher W, Yang Z (2010) The effect of inser-
tical problems in the analysis of DNA tions, deletions, and alignment errors on the
sequences. Lect Math Life Sci 17:57–86 branch-site test of positive selection. Mol Biol
49. Huelsenbeck JP, Larget B, Alfaro ME (2004) Evol 27:2257–2267
Bayesian phylogenetic model selection using 64. Yang Z, dos Reis M (2011) Statistical proper-
reversible jump Markov chain Monte Carlo. ties of the branch-site test of positive selec-
Mol Biol Evol 21:1123–1133 tion. Mol Biol Evol 28:1217–1228
50. Yang Z, Roberts D (1995) On the use of 65. Self SG, Liang K-Y (1987) Asymptotic prop-
nucleic acid sequences to infer early branch- erties of maximum likelihood estimators and
ings in the tree of life. Mol Biol Evol likelihood ratio tests under nonstandard con-
12:451–458 ditions. J Am Stat Assoc 82:605–610
51. Huelsenbeck JP, Bollback JP, Levine AM 66. Posada D, Crandall KA (1998) MODELT-
(2002) Inferring the root of a phylogenetic EST: testing the model of DNA substitution.
tree. Syst Biol 51:32–43 Bioinformatics 14:817–818
52. Yang Z (2006) Computational molecular evo- 67. Posada D (2008) jModelTest: phylogenetic
lution. Oxford University Press, Oxford model averaging. Mol Biol Evol
53. Aris-Brosou S (2005) Determinants of adap- 25:1253–1256
tive evolution at the molecular level: the 68. Cunningham CW, Zhu H, Hillis DM (1998)
extended complexity hypothesis. Mol Biol Best-fit maximum-likelihood models for phy-
Evol 22:200–209 logenetic inference: empirical tests with
54. Anisimova M, Yang Z (2004) Molecular evo- known phylogenies. Evolution 52:978–987
lution of the hepatitis delta virus antigen gene: 69. Pol D (2004) Empirical problems of the hier-
recombination or positive selection? J Mol archical likelihood ratio test for model selec-
Evol 59:815–826 tion. Syst Biol 53:949–962
55. Burnham KP, Anderson DR (1998) Model 70. Kullback S, Leibler RA (1951) On informa-
selection and inference: a practical tion and sufficiency. Ann Math Stat 22:79–86
information-theoretic approach. Springer, 71. Minin V, Abdo Z, Joyce P, Sullivan J (2003)
Berlin Performance-based selection of likelihood
56. Anisimova M, Bielawski JP, Yang Z (2001) models for phylogeny estimation. Syst Biol
Accuracy and power of the likelihood ratio 52:674–683
test in detecting adaptive molecular evolution. 72. Ripplinger J, Sullivan J (2008) Does choice in
Mol Biol Evol 18:1585–1592 model selection affect maximum likelihood
57. Whelan S, Goldman N (2004) Estimating the analysis? Syst Biol 57:76–85
frequency of events that cause multiple- 73. Posada D, Crandall KA (2001) Selecting the
nucleotide changes. Genetics best-fit model of nucleotide substitution. Syst
167:2027–2043 Biol 50:580–601
58. Wong WS, Yang Z, Goldman N, Nielsen R 74. Abdo Z, Minin VN, Joyce P, Sullivan J (2005)
(2004) Accuracy and power of statistical Accounting for uncertainty in the tree
112 Stéphane Aris-Brosou and Nicolas Rodrigue
topology has little effect on the decision- 90. Ronquist F, Huelsenbeck JP (2003) MrBayes
theoretic approach to model selection in phy- 3: Bayesian phylogenetic inference under
logeny estimation. Mol Biol Evol 22:691–703 mixed models. Bioinformatics 19:1572–1574
75. Luo A, Qiao H, Zhang Y, Shi W, Ho SY, 91. Drummond AJ, Rambaut A (2007) BEAST:
Xu W, Zhang A, Zhu C (2010) Performance Bayesian evolutionary analysis by sampling
of criteria for selecting evolutionary models in trees. BMC Evol Biol 7:214
phylogenetics: a comprehensive study based 92. Raftery AE (1996) Hypothesis testing and
on simulated datasets. BMC Evol Biol 10:242 model selection. In: Gilks WR, Richardson S,
76. Schwarz G (1978) Estimating the dimension Spiegelhalter DJ (eds) Markov chain Monte
of a model. Ann Stat 6:461–464 Carlo in practice. Chapman & Hall, Boca
77. Evans J, Sullivan J (2011) Approximating Raton, pp 163–187
model probabilities in Bayesian information 93. Ogata Y (1989) A Monte Carlo method for
criterion and decision-theoretic approaches high dimensional integration. Numer Math
to model selection in phylogenetics. Mol 55:137–157
Biol Evol 28:343–349 94. Gelman A, Meng X-L (1998) Simulating nor-
78. Gascuel O (1997) BIONJ: an improved ver- malizing constants: from importance sam-
sion of the NJ algorithm based on a simple pling to bridge sampling to path sampling.
model of sequence data. Mol Biol Evol Stat Sci 13:163–185
14:685–695 95. Xie W, Lewis PO, Fan Y, Kuo L, Chen MH
79. Darriba D, Taboada GL, Doallo R, Posada D (2011) Improving marginal likelihood esti-
(2012) jModelTest 2: more models, new mation for Bayesian phylogenetic model
heuristics and parallel computing. Nat Meth- selection. Syst Biol 60:150–60
ods 9:772–772 96. Baele G, Lemey P, Bedford T, Rambaut A,
80. Lefort V, Longueville J-E, Gascuel O (2017) Suchard MA, Alekseyenko AV (2012)
SMS: smart model selection in PhyML. Mol Improving the accuracy of demographic and
Biol Evol 34:2422–2424 molecular clock model comparison while
81. Kleinman CL, Rodrigue N, Bonnard C, accommodating phylogenetic uncertainty.
Philippe H, Lartillot N (2006) A maximum Mol Biol Evol 29:2157–2167
likelihood framework for protein design. 97. Raftery AE, Newton MA, Satagopan JM, Kri-
BMC Bioinformatics 7:326 vitsky PN (2007) Estimating the integrated
82. Rodrigue N, Philippe H, Lartillot N (2007) likelihood via posterior simulation using the
Exploring fast computational strategies for harmonic mean identity. Bayesian Stat 8:1–45
probabilistic phylogenetic analysis. Syst Biol 98. Smyth P (2000) Model selection for probabi-
56:711–726 listic clustering using cross-validated likeli-
83. Yang Z (2005) Bayesian inference in molecu- hood. Stat Comput 10:63–72
lar phylogenetics. In: Gascuel O 99. Lartillot N, Brinkmann H, Philippe H (2007)
(ed) Mathematics of evolution and phylogeny. Suppression of long-branch attraction arte-
Oxford University Press, Oxford, pp 63–90 facts in the animal phylogeny using a site-
84. Jeffreys H (1939) Theory of probability. The heterogeneous model. BMC Evol Biol 7
International series of monographs on phys- (Suppl 1):S4
ics. The Clarendon Press, Oxford 100. Cavalli-Sforza LL, Edwards AW (1967) Phy-
85. Kass RE, Raftery AE (1995) Bayes factors. J logenetic analysis. Models and estimation
Am Stat Assoc 90:773–795 procedures. Am J Hum Genet 19:233–257
86. Lartillot N, Philippe H (2006) Computing 101. Aris-Brosou S (2003) Least and most power-
Bayes factors using thermodynamic integra- ful phylogenetic tests to elucidate the origin
tion. Syst Biol 55:195–207 of the seed plants in the presence of
87. Fan Y, Wu R, Chen MH, Kuo L, Lewis PO conflicting signals under misspecified models.
(2011) Choosing among partition models in Syst Biol 52:781–793
Bayesian phylogenetics. Mol Biol Evol 102. Foulds LR, Penny D, Hendy MD (1979) A
28:523–32 general approach to proving the minimality of
88. Newton MA, Raftery AE (1994) Approximat- phylogenetic trees illustrated by an example
ing Bayesian inference with the weighted like- with a set of 23 vertebrates. J Mol Evol
lihood bootstrap. J R Stat Soc B 56:3–48 13:151–166
89. Aris-Brosou S (2003) How Bayes tests of 103. Hendy MD, Penny D (1982) Branch and
molecular phylogenies compare with frequen- bound algorithms to determine minimal evo-
tist approaches. Bioinformatics 19:618–624 lutionary trees. Math Biosci 59:277–290
A Not-So-Long Introduction to Computational Molecular Evolution 113
104. Saitou N, Nei M (1987) The neighbor- 119. Ratmann O, Donker G, Meijer A, Fraser C,
joining method: a new method for recon- Koelle K (2012) Phylodynamic inference and
structing phylogenetic trees. Mol Biol Evol model assessment with approximate Bayesian
4:406–425 computation: influenza as a case study. PLoS
105. Bruno WJ, Socci ND, Halpern AL (2000) Comput Biol 8:e1002835
Weighted neighbor joining: a likelihood- 120. Zheng Y, Aris-Brosou S (2013) Approximate
based approach to distance-based phylogeny Bayesian computation algorithms for estimat-
reconstruction. Mol Biol Evol 17:189–197 ing network model parameters. In: Joint sta-
106. Larget B, Simon D (1999) Markov chain tistical meeting proceedings (2013)—
Monte Carlo algorithms for the Bayesian biometrics section, pp 2239–2253
analysis of phylogenetic trees. Mol Biol Evol 121. Poon AF (2015) Phylodynamic inference
16:750 with kernel ABC and its application to HIV
107. Holder MT, Lewis PO, Swofford DL, Larget epidemiology. Mol Biol Evol 32:2483–2495
B (2005) Hastings ratio of the LOCAL pro- 122. Ibeh N, Aris-Brosou S (2016) Estimation of
posal used in Bayesian phylogenetics. Syst sub-epidemic dynamics by means of sequen-
Biol 54:961–965 tial Monte Carlo approximate Bayesian com-
108. Whelan S (2007) New approaches to phylo- putation: an application to the Swiss HIV
genetic tree search and their application to cohort study. https://doi.org/10.1101/
large numbers of protein alignments. Syst 085993
Biol 56:727–740 123. Hastie T, Tibshirani R, Friedman JH (2009)
109. Pritchard JK, Seielstad MT, Perez-Lezaun A, The elements of statistical learning: data
Feldman MW (1999) Population growth of mining, inference, and prediction. Springer
human y chromosomes: a study of Y chromo- series in statistics, 2nd edn. Springer,
some microsatellites. Mol Biol Evol New York
16:1791–1798 124. Poon AF, Walker LW, Murray H, McCloskey
110. Beaumont MA, Zhang W, Balding DJ (2002) RM, Harrigan PR, Liang RH (2013)
Approximate Bayesian computation in popu- Mapping the shapes of phylogenetic trees
lation genetics. Genetics 162:2025–2035 from human and zoonotic RNA viruses.
111. Kingman JFC (1982) The coalescent. Stoch PLoS One 8:e78122
Process Appl 13:235–248 125. Schwarz RF, Fletcher W, Förster F, Merget B,
112. Hein J, Schierup MH, Wiuf C (2005) Gene Wolf M, Schultz J, Markowetz F (2010) Evo-
genealogies, variation and evolution: a primer lutionary distances in the twilight zone—a
in coalescent theory. Oxford University Press, rational kernel approach. PLoS One 5:
Oxford e15788
113. Marjoram P, Molitor J, Plagnol V, Tavaré S 126. Höhl M, Ragan MA (2007) Is multiple-
(2003) Markov chain Monte Carlo without sequence alignment required for accurate
likelihoods. Proc Natl Acad Sci inference of phylogeny? Syst Biol 56:206–221
100:15324–15328 127. Sanderson M, Nicolae M, McMahon M
114. Sisson SA, Fan Y, Tanaka MM (2007) (2017) Homology-aware phylogenomics at
Sequential Monte Carlo without likelihoods. gigabase scales. Syst Biol 66:590–603
Proc Natl Acad Sci 104:1760–1765 128. Jordan MI, Mitchell TM (2015) Machine
115. Toni T, Welch D, Strelkowa N, Ipsen A, learning: trends, perspectives, and prospects.
Stumpf MP (2009) Approximate Bayesian Science 349:255–260
computation scheme for parameter inference 129. Rusk N (2016) Deep learning. Nat Methods
and model selection in dynamical systems. J R 13:35
Soc Interface 6:187–202 130. Esteva A, Kuprel B, Novoa RA, Ko J, Swetter
116. Beaumont MA (2010) Approximate Bayesian SM, Blau HM, Thrun S (2017)
computation in evolution and ecology. Annu Dermatologist-level classification of skin can-
Rev Ecol Evol Syst 41:379–406 cer with deep neural networks. Nature
117. Sunnåker M, Busetto AG, Numminen E, 542:115–118
Corander J, Foll M, Dessimoz C (2013) 131. Morell V (1996) TreeBASE: the roots of phy-
Approximate Bayesian computation. PLoS logeny. Science 273:569
Comput Biol 9:e1002803 132. Whelan S, de Bakker PIW, Quevillon E,
118. Lintusaari J, Gutmann MU, Dutta R, Kaski S, Rodriguez N, Goldman N (2006) PANDIT:
Corander J (2017) Fundamentals and recent an evolution-centric database of protein and
developments in approximate Bayesian com- associated nucleotide domains with inferred
putation. Syst Biol 66:e66–e82 trees. Nucleic Acids Res 34:D327–D331
114 Stéphane Aris-Brosou and Nicolas Rodrigue
133. Zhou J, Troyanskaya OG (2015) Predicting 149. Kumar S, Hedges SB (1998) A molecular
effects of noncoding variants with deep timescale for vertebrate evolution. Nature
learning-based sequence model. Nat Meth- 392:917–920
ods 12:931–934 150. Wang DY, Kumar S, Hedges SB (1999)
134. Tran NH, Zhang X, Xin L, Shan B, Li M Divergence time estimates for the early his-
(2017) De novo peptide sequencing by deep tory of animal phyla and the origin of plants,
learning. Proc Natl Acad Sci. https://doi. animals and fungi. Proc Biol Sci 266:163–171
org/10.1073/pnas.1705691114 151. Heckman DS, Geiser DM, Eidell BR, Stauffer
135. Benton MJ, Ayala FJ (2003) Dating the tree RL, Kardos NL, Hedges SB (2001) Molecu-
of life. Science 300:1698–700 lar evidence for the early colonization of land
136. Rannala B, Yang Z (2007) Inferring specia- by fungi and plants. Science 293:1129–1133
tion times under an episodic molecular clock. 152. Hedges SB, Chen H, Kumar S, Wang DY,
Syst Biol 56:453–66 Thompson AS, Watanabe H (2001) A geno-
137. Wegmann D, Leuenberger C, Excoffier L mic timescale for the origin of eukaryotes.
(2009) Efficient approximate Bayesian com- BMC Evol Biol 1:4
putation coupled with Markov chain Monte 153. Graur D, Martin W (2004) Reading the
Carlo without likelihood. Genetics entrails of chickens: molecular timescales of
182:1207–1218 evolution and the illusion of precision. Trends
138. Reich D, Green RE, Kircher M et al (2010) Genet 20:80–86
Genetic history of an archaic hominin group 154. Rambaut A, Bromham L (1998) Estimating
from Denisova Cave in Siberia. Nature divergence dates from molecular sequences.
468:1053–1060 Mol Biol Evol 15:442–448
139. Hedges SB, Dudley J, Kumar S (2006) Time- 155. Yoder AD, Yang Z (2000) Estimation of pri-
Tree: a public knowledge-base of divergence mate speciation dates using local molecular
times among organisms. Bioinformatics clocks. Mol Biol Evol 17:1081–1090
22:2971–2972 156. Yang Z (2004) A heuristic rate smoothing
140. Kumar S, Stecher G, Suleski M, Hedges SB procedure for maximum likelihood estima-
(2017) TimeTree: a resource for timelines, tion of species divergence times. Acta Zool
timetrees, and divergence times. Mol Biol Sin 50:645–656
Evol 34:1812–1819 157. Aris-Brosou S (2007) Dating phylogenies
141. Welch JJ, Bromham L (2005) Molecular dat- with hybrid local molecular clocks. PLoS
ing when rates vary. Trends Ecol Evol One 2:e879
20:320–327 158. Drummond AJ, Suchard MA (2010) Bayesian
142. Kimura M (1983) The neutral theory of random local clocks, or one rate to rule them
molecular evolution. Cambridge University all. BMC Biol 8:114
Press, Cambridge 159. Sanderson M (1997) A nonparametric
143. Sarich VM, Wilson AC (1973) Generation approach to estimating divergence times in
time and genomic evolution in primates. Sci- the absence of rate constancy. Mol Biol Evol
ence 179:1144–1147 14:1218
144. Muse SV, Weir BS (1992) Testing for equality 160. Sanderson MJ (2002) Estimating absolute
of evolutionary rates. Genetics 132:269–276 rates of molecular evolution and divergence
145. Bromham L, Penny D, Rambaut A, Hendy times: a penalized likelihood approach. Mol
MD (2000) The power of relative rates tests Biol Evol 19:101–109
depends on the data. J Mol Evol 50:296–301 161. Gillespie JH (1991) The causes of molecular
146. Rambaut A (2000) Estimating the rate of evolution. Oxford University Press, Oxford
molecular evolution: incorporating 162. Thorne JL, Kishino H, Painter IS (1998)
non-contemporaneous sequences into maxi- Estimating the rate of evolution of the rate
mum likelihood phylogenies. Bioinformatics of molecular evolution. Mol Biol Evol
16:395–399 15:1647–1657
147. Martin AP (2001) Molecular clocks. Encyclo- 163. Aris-Brosou S, Yang Z (2002) Effects of mod-
pedia of life sciences. Wiley, Hoboken, pp 1–6 els of rate evolution on estimation of diver-
148. Wray GA, Levinton JS, Shapiro LH (1996) gence dates with special reference to the
Molecular evidence for deep Precambrian metazoan 18S ribosomal RNA phylogeny.
divergences among Metazoan phyla. Science Syst Biol 51:703–714
274:568–573 164. Aris-Brosou S, Yang Z (2003) Bayesian mod-
els of episodic evolution support a late
A Not-So-Long Introduction to Computational Molecular Evolution 115
precambrian explosive diversification of the 178. Nadin-Davis SA, Feng Y, Mousse D, Wande-
Metazoa. Mol Biol Evol 20:1947–1954 ler AI, Aris-Brosou ST (2010) Spatial and
165. Rannala B, Yang Z (1996) Probability distri- temporal dynamics of rabies virus variants in
bution of molecular evolutionary trees: a new big brown bat populations across Canada:
method of phylogenetic inference. J Mol Evol footprints of an emerging zoonosis. Mol
43:304–311 Ecol 19:2120–2136
166. Pybus OG, Rambaut A, Harvey PH (2000) 179. Pagel M, Meade A (2004) A phylogenetic
An integrated framework for the inference of mixture model for detecting pattern-
viral population history from reconstructed heterogeneity in gene sequence or character-
genealogies. Genetics 155:1429–1437 state data. Syst Biol 53:571–581
167. Drummond AJ, Rambaut A, Shapiro B, 180. Pagel M, Meade A, Barker D (2004) Bayesian
Pybus OG (2005) Bayesian coalescent infer- estimation of ancestral character states on
ence of past population dynamics from molec- phylogenies. Syst Biol 53:673–684
ular sequences. Mol Biol Evol 22:1185–1192 181. Lartillot N, Poujol R (2011) A phylogenetic
168. Minin VN, Bloomquist EW, Suchard MA model for investigating correlated evolution
(2008) Smooth skyride through a rough sky- of substitution rates and continuous pheno-
line: Bayesian coalescent-based inference of typic characters. Mol Biol Evol 28:729–744
population dynamics. Mol Biol Evol 182. Bromham L, Woolfit M, Lee MS, Rambaut A
25:1459–1471 (2002) Testing the relationship between mor-
169. Hedges SB, Kumar S (2004) Precision of phological and molecular rates of change
molecular time estimates. Trends Genet along phylogenies. Evolution 56:1921–1930
20:242–247 183. Ho SYW, Duchêne S (2014) Molecular-clock
170. Yang Z, Rannala B (2006) Bayesian estima- methods for estimating evolutionary rates and
tion of species divergence times under a timescales. Mol Ecol 23:5947–5965
molecular clock using multiple fossil calibra- 184. dos Reis M, Donoghue PCJ, Yang Z (2016)
tions with soft bounds. Mol Biol Evol Bayesian molecular clock dating of species
23:212–226 divergences in the genomics era. Nat Rev
171. Inoue J, Donoghue PCJ, Yang Z (2010) The Genet 17:71–80
impact of the representation of fossil calibra- 185. Donoghue PCJ, Yang Z (2016) The evolu-
tions on Bayesian estimation of species diver- tion of methods for establishing evolutionary
gence times. Syst Biol 59:74–89 timescales. Philos Trans R Soc Lond B Biol
172. Drummond AJ, Ho SYW, Phillips MJ, Ram- Sci. https://doi.org/10.1098/rstb.2016.
baut A (2006) Relaxed phylogenetics and dat- 0020
ing with confidence. PLoS Biol 4:e88 186. Ho SY, Tong KJ, Foster CS, Ritchie AM,
173. Bouckaert R, Heled J, Kühnert D, Lo N, Crisp MD (2015) Biogeographic cali-
Vaughan T, Wu CH, Xie D, Suchard MA, brations for the molecular clock. Biol Lett
Rambaut A, Drummond AJ (2014) BEAST 11:20150194
2: a software platform for Bayesian evolution- 187. Kühnert D, Wu C-H, Drummond AJ (2011)
ary analysis. PLoS Comput Biol 10:e1003537 Phylogenetic and epidemic modeling of rap-
174. Wertheim JO, Sanderson MJ, Worobey M, idly evolving infectious diseases. Infect Genet
Bjork A (2010) Relaxed molecular clocks, Evol 11:1825–1141
the bias-variance trade-off, and the quality of 188. Rieux A, Balloux F (2016) Inferences from
phylogenetic inference. Syst Biol 59:1–8 tip-calibrated phylogenies: a review and a
175. Lemey P, Rambaut A, Drummond AJ, practical guide. Mol Ecol 25:1911–1924
Suchard MA (2009) Bayesian phylogeogra- 189. Ho SYW, Chen AXY, Lins LSF, Duchêne DA,
phy finds its roots. PLoS Comput Biol 5: Lo N (2016) The genome as an evolutionary
e1000520 timepiece. Genome Biol Evol 8:3006–3010
176. Lemey P, Rambaut A, Welch JJ, Suchard MA 190. O’Reilly JE, dos Reis M, Donoghue PCJ
(2010) Phylogeography takes a relaxed ran- (2015) Dating tips for divergence-time esti-
dom walk in continuous space and time. Mol mation. Trends Genet 31:637–50
Biol Evol 27:1877–1885 191. 1000 Genomes Project Consortium, Abecasis
177. Guillot G, Santos F, Estoup A (2008) Analys- GR, Altshuler D, Auton A, Brooks LD, Dur-
ing georeferenced population genetics data bin RM, Gibbs RA, Hurles ME, McVean GA
with Geneland: a new algorithm to deal with (2010) A map of human genome variation
null alleles and a friendly graphical user inter- from population-scale sequencing. Nature
face. Bioinformatics 24:1406–1407 467:1061–1073
116 Stéphane Aris-Brosou and Nicolas Rodrigue
192. UK10K Consortium, Walter K, Min JL, sequences. Philos Trans R Soc Lond B Biol
Huang J et al (2015) The UK10K project Sci 363:3931–3939
identifies rare variants in health and disease. 205. Rodrigue N, Kleinman CL, Philippe H, Lar-
Nature 526:82–90 tillot N (2009) Computational methods for
193. Ledford H (2016) AstraZeneca launches evaluating phylogenetic models of coding
project to sequence 2 million genomes. sequence evolution with dependence between
Nature 532:427 codons. Mol Biol Evol 26:1663–1676
194. Muse SV, Gaut BS (1994) A likelihood 206. Hartl DL, Clark AG (2007) Principles of pop-
approach for comparing synonymous and ulation genetics, 4th edn. Sinauer Associates,
nonsynonymous nucleotide substitution Sunderland
rates, with application to the chloroplast 207. Kimura M (1962) On the probability of fixa-
genome. Mol Biol Evol 11:715–724 tion of mutant genes in a population. Genet-
195. Goldman N, Yang Z (1994) A codon-based ics 47:713–719
model of nucleotide substitution for protein- 208. Rice SH (2004) Evolutionary theory: mathe-
coding DNA sequences. Mol Biol Evol matical and conceptual foundations. Sinauer
11:725–736 Associates, Sunderland
196. Kosiol C, Anisimova M (2011) Methods for 209. Kimura M (1978) Change of gene frequen-
detecting natural selection in protein-coding cies by natural selection under population
genes. In: Anisimova M (ed) Evolutionary number regulation. Proc Natl Acad Sci USA
genomics: statistical and computational 75:1934–1937
methods. Methods in molecular biology 210. Tamuri A, dos Reis M, Goldstein R (2012)
series. Humana-Springer, New York Estimating the distribution of selection coef-
197. Thorne JL, Choi SC, Yu J, Higgs PG, Kishino ficients from phylogenetic data using sitewise
H (2007) Population genetics without intra- mutation-selection models. Genetics
specific data. Mol Biol Evol 24:1667–1677 190:1101–1115
198. Choi SC, Hobolth A, Robinson DM, 211. Rodrigue N (2013) On the statistical inter-
Kishino H, Thorne JL (2007) Quantifying pretation of site-specific variables in
the impact of protein tertiary structure on phylogeny-based substitution models. Genet-
molecular evolution. Mol Biol Evol ics 193:557–564
24:1769–1782 212. Prins P, Belhachemi D, Möller S, Smant G
199. Halpern AL, Bruno WJ (1998) Evolutionary (2011) Scalable computing in evolutionary
distances for protein-coding sequences: mod- genomics. In: Anisimova M
eling site-specific residue frequencies. Mol (ed) Evolutionary genomics: statistical and
Biol Evol 15:910–917 computational methods. Methods in molecu-
200. Yang Z, Nielsen R (2008) Mutation-selection lar biology series. Humana-Springer,
models of codon substitution and their use to New York
estimate selective strengths on codon usage. 213. Yang Z (1993) Maximum-likelihood estima-
Mol Biol Evol 25:568–579 tion of phylogeny from DNA sequences when
201. Rodrigue N, Philippe H, Lartillot N (2010) substitution rates differ over sites. Mol Biol
Mutation-selection models of coding Evol 10:1396–1401
sequence evolution with site-heterogeneous 214. Yang Z (1994) Maximum likelihood phyloge-
amino acid fitness profiles. Proc Natl Acad netic estimation from DNA sequences with
Sci USA 107:4629–4634 variable rates over sites: approximate meth-
202. Rodrigue N, Lartillot N (2017) Detecting ods. J Mol Evol 39:306–314
adaptation in protein-coding genes using a 215. Altekar G, Dwarkadas S, Huelsenbeck JP,
Bayesian site-heterogeneous mutation-selec- Ronquist F (2004) Parallel Metropolis cou-
tion codon substitution model. Mol Biol pled Markov chain Monte Carlo for Bayesian
Evol 34:204–214 phylogenetic inference. Bioinformatics
203. Bloom JD (2017) Identification of positive 20:407–415
selection in genes is greatly improved by 216. Lartillot N, Rodrigue N, Stubbs D, Richer J
using experimentally informed site-specific (2013) PhyloBayes MPI: phylogenetic recon-
models. Biol Direct 12:1. https://doi.org/ struction with infinite mixtures of profiles in a
10.1186/s13062-016-0172-z parallel environment. Syst Biol 62:611–615
204. Choi SC, Redelings BD, Thorne JL (2008) 217. Rodrigue N, Lartillot N (2014) Site-
Basing population genetic inferences and heterogeneous mutation-selection models
models of molecular evolution upon desired within the PhyloBayes-MPI package. Bioin-
stationary distributions of DNA or protein formatics 30:1020–1021
A Not-So-Long Introduction to Computational Molecular Evolution 117
218. Stamatakis A, Hoover P, Rougemont J (2008) Liu L, Suchard MA, Huelsenbeck JP (2012)
A rapid bootstrap algorithm for the RAxML MrBayes 3.2: efficient Bayesian phylogenetic
Web servers. Syst Biol 57:758–771 inference and model choice across a large
219. Stamatakis A, Ludwig T, Meier H (2005) model space. Syst Biol 61:539–542
RAxML-III: a fast program for maximum 224. Muir P, Li S, Lou S et al (2016) The real cost
likelihood-based inference of large phyloge- of sequencing: scaling computation to keep
netic trees. Bioinformatics 21:456–463 pace with data generation. Genome Biol
220. Stamatakis A, Göker M, Grimm GW (2010) 17:53
Maximum likelihood analyses of 3,490 rbcL 225. Schatz MC, Langmead B, Salzberg SL (2010)
sequences: scalability of comprehensive infer- Cloud computing and the DNA data race.
ence versus group-specific taxon sampling. Nat Biotechnol 28:691–693
Evol Bioinform Online 6:73–90 226. Dereeper A, Guignon V, Blanc G et al (2008)
221. Stamatakis A, Alachiotis N (2010) Time and Phylogeny.fr: robust phylogenetic analysis for
memory efficient likelihood-based tree the non-specialist. Nucleic Acids Res 36:
searches on phylogenomic alignments with W465–W469
missing data. Bioinformatics 26:i132–i139 227. de Koning AP, Gu W, Pollock DD (2010)
222. Suchard MA, Rambaut A (2009) Many-core Rapid likelihood analysis on large phylogenies
algorithms for statistical phylogenetics. Bioin- using partial sampling of substitution his-
formatics 25:1370–1376 tories. Mol Biol Evol 27:249–265
223. Ronquist F, Teslenko M, van der Mark P,
Ayres DL, Darling A, Höhna S, Larget B,
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons
licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s
Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder.
Part II
Whole-Genome Alignment
Colin N. Dewey
Abstract
Whole-genome alignment (WGA) is the prediction of evolutionary relationships at the nucleotide level
between two or more genomes. It combines aspects of both colinear sequence alignment and gene
orthology prediction and is typically more challenging to address than either of these tasks due to the
size and complexity of whole genomes. Despite the difficulty of this problem, numerous methods have been
developed for its solution because WGAs are valuable for genome-wide analyses such as phylogenetic
inference, genome annotation, and function prediction. In this chapter, we discuss the meaning and
significance of WGA and present an overview of the methods that address it. We also examine the problem
of evaluating whole-genome aligners and offer a set of methodological challenges that need to be tackled in
order to make most effective use of our rapidly growing databases of whole genomes.
1 Introduction
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Methods in Molecular Biology, vol. 1910,
https://doi.org/10.1007/978-1-4939-9074-0_4, © The Author(s) 2019
121
122 Colin N. Dewey
2.2 Toporthology In many cases, WGAs do not aim to predict all orthologous
sequences. Instead, they only predict toporthology (positional
orthology), a distinguished subset of orthology [6, 7]. The concept
of toporthology captures the notion that not all orthologous rela-
tionships are equivalent in terms of the evolutionary history of the
genomic context of the orthologs. Figure 1 gives an example
scenario in which toporthology helps to distinguish between two
orthologous relationships.
The definition of toporthology relies on a classification of
duplication events. A duplication event is considered to be “sym-
metric” if the removal of either copy of the duplicated genomic
material (immediately after the event) reverts the genome to its
original (preduplication) state. Examples of symmetric duplications
are tandem and whole-genome duplications. If only one specific
copy can be removed to undo a duplication event, then the event is
considered “asymmetric.” In the asymmetric case, the removable
copy is referred to as the “target,” with the other copy referred to as
the “source.” Retrotransposition and segmental duplication both
belong to the asymmetric class.
With this classification of duplication events in hand, we can
now define toporthology. Two genomic segments are toportholo-
gous if they are orthologous and neither segment is derived from
the target of an asymmetric duplication event since the time of the
MRCA of the segments. Thus, two orthologous segments are
toporthologous if their evolutionary history (since the MRCA)
only involves symmetric duplication events or asymmetric duplica-
tions in which their ancestral segment was part of the source copy.
The important property of toporthologs is that, in the absence
of rearrangement events, they share the same ancestral genomic
context. As the context of a gene or genomic segment has func-
tional consequences, toporthologous sequences are generally
expected to be more similar in their function than orthologous
sequences that are not toporthologous (atoporthologs) [6]. How-
ever, there is no guarantee that toporthologs share a common
124 Colin N. Dewey
Ancestor
X Y Z
speciation
XA YA ZA XB YB ZB
asymmetric
Species A duplication
XB YB1 ZB YB2
Species B
Fig. 1 A hypothetical evolutionary scenario in which the relation of toporthology distinguishes between two
ortholog pairs. The bullet-like shapes indicate genomic segments. Both YB1 and YB2 are orthologous to
YA. However, only YB1 is toporthologous to YA because YB2 was derived from the target of an asymmetric
duplication since the time of the most recent common ancestor, Y, of YB2 and YA
function or that two genomic intervals that have the same function
are toporthologs. Thus, a rigorous functional analysis of genomes
should consider all classes of homology. Nevertheless, WGAs that
focus on toporthology produce a good first approximation to a
functional correspondence between genomes.
2.3 Definition and To be more precise, a WGA is, in general, the prediction of homol-
Representation ogous pairs of positions between two or more genome sequences.
Often, as we have previously discussed, only orthologous or
toporthologous relations are predicted in WGAs. And while align-
ment is typically focused on homologous relationships between
sequences, whole-genome comparisons can also include alignments
within genomes, which represent paralogous sequences.
Note that we define WGA as homology prediction at the level
of nucleotides. Although the concept of homology is more com-
monly used with respect to entire genes or proteins, it is easily used
and, in fact, more naturally defined at the level of single nucleotides.
Homology of nucleotide positions is established through template-
driven nucleotide synthesis, and the definitions of orthology, paral-
ogy, and xenology for nucleotides follow those for genes [7].
While a WGA can be defined as a prediction of homology
statements, it is usually represented as a set of nucleotide-level
alignment matrices or “blocks,” each block made up by segments
of the genomes that are both homologous and colinear. Homolo-
gous genomic segments are colinear if they have not been broken
by a rearrangement event since the time of their MRCA. Since
rearrangement events, such as inversions, are common at the scale
of entire genomes, WGAs are typically made up of many blocks. In
general, a block contains two or more genomic segments, and
multiple segments in the same block may belong to the same
genome (indicating paralogous sequence). One specific WGA rep-
resentation, the “threaded blockset” [8], requires that every
Whole-Genome Alignment 125
(B) (C)
W X Y Z A TT-CTAAGTG
A A1 A A B CTACTAAG-G
B A2 B B C1 CTACT--GTG
C B C1 C2 CTACC--GTG
C C2
Fig. 2 An example WGA of three genomes represented as a set of alignment blocks. (a) The positions of the
genomic segments that are in the alignment blocks are shown as shaded bullet-like shapes (the direction of
the bullet indicates the orientation of the segment). In this example, not all genomic segments belong to a
block (note the unshaded intervals). (b) The alignment blocks of the WGA. Note that blocks do not need to
contain a segment from all genomes (e.g., block Y) and that some blocks can contain multiple segments from
the same genome (e.g., blocks X and Z). (c) A slice of alignment block Z, which is a nucleotide-level alignment
2.4 Comparison to WGA is closely related to classical sequence alignment (the align-
Other Homology ment of two or more relatively short and colinear sequences), and
Prediction Tasks most whole-genome aligners rely on classical alignment techniques
(e.g., the Needleman–Wunsch [10] and Smith–Waterman [11]
pairwise alignment algorithms and heuristics used for multiple
alignments) as subroutines. However, there are three key differ-
ences between these two classes of alignment. First, and most
importantly, classical alignment requires sequences to be colinear,
which is often not the case for genome sequences due to rearrange-
ment events. Second, even when restricted to toporthologous rela-
tionships, the correspondences between genomes are not one to
126 Colin N. Dewey
2.5 Significance WGAs are powerful because they allow for the analysis of molecular
evolution at both large and small scales. At the large scale, one can
use such alignments to estimate the frequency and location of
rearrangement and duplication events. For example, one might
use a WGA between human and mouse to identify colinear ortho-
logous blocks, which are then given to a rearrangement analysis
method (e.g., [12]) to determine a most parsimonious set of rear-
rangement events explaining the current structures of the two
genomes. At the small scale, WGAs can be used to examine the
rates of substitutions and indels across the entire genome. For
example, one might look at alignments of ancestral repeats to
estimate the neutral rates of nucleotide evolution. Both small-
Whole-Genome Alignment 127
3.2 The Two Major Existing WGA methods attempt to address one or more of the
Approaches to WGA weaknesses of this simple approach. These methods can be loosely
classified into two major strategies which we refer to as the “hierar-
chical” and “local” approaches. The main idea behind the hierar-
chical approach is to split the WGA problem into a set of global
multiple alignment problems. To do this, it first identifies the
colinear and homologous (typically orthologous) segments of the
genomes. Each set of colinear segments is then given to a
specialized genomic global alignment method to produce a
nucleotide-level alignment. In contrast, the first step of the
“local” approach is to produce a large set of nucleotide-level align-
ments. Later steps involve the filtering and merging of these align-
ments to produce sets of pairwise or multiple alignments of
homologous (typically orthologous) sequences. Despite their dif-
ferences, both strategies typically begin with a local alignment step
that is similar to the simplistic all-vs.-all alignment of the BLAST
approach. A summary of all of the WGA methods described in this
chapter and the role they play within one or both approaches is
given in Table 1.
Both approaches have advantages and disadvantages. The pri-
mary advantage of the hierarchical approach is that it can often be
faster and breaks a WGA into a number of independent subpro-
blems that can be solved in parallel. It is faster because the
Whole-Genome Alignment 129
Table 1
A list of the WGA methods cited in this chapter
Relationships Pairwise or
Method Category predicted multiple References
BLAST Local alignment Homology Pairwise [21]
BLAT Local alignment Homology Pairwise [32]
STELLAR Local alignment Homology Pairwise [33]
LASTZ Local alignment Homology Pairwise [34]
LAST Local alignment Homology Pairwise [28]
MUMmer Local alignment Orthology Pairwise [35]
CHAOS Local alignment Homology Pairwise [36]
GRIMM-Synteny Homology mapping Toporthology Multiple [40]
DRIMM-Synteny Homology mapping Homology Multiple [45]
Mercator Homology mapping Toporthology Multiple [46]
Enredo Homology mapping Homology Multiple [47]
OSfinder Homology mapping Toporthology Multiple [48]
SuperMap Homology mapping Homology Multiple [49]
Sibelia Homology mapping Homology Multiple [50]
M-GCAT Hierarchical WGA Toporthology Multiple [51]
progressiveMauve Hierarchical WGA Toporthology Multiple [52]
MUGSY Hierarchical WGA Toporthology Multiple [53]
Cactus Hierarchical WGA Homology Multiple [54]
MAVID Global genomic Colinear homology Multiple [60]
alignment
LAGAN/Multi- Global genomic Colinear homology Pairwise/ [37]
LAGAN alignment multiple
DIALIGN Global genomic Colinear homology Multiple [36]
alignment
SeqAn::T-Coffee Global genomic Colinear homology Multiple [61]
alignment
Pecan Global genomic Colinear homology Multiple [47]
alignment
FSA Global genomic Colinear homology Multiple [62]
alignment
NUCmer/PROmer Local WGA Orthology Pairwise [35]
MULTIZ/TBA Local WGA Homology Multiple [8]
AXTCHAIN/ Alignment chaining Orthology Pairwise [67]
CHAINNET and filtering
(continued)
130 Colin N. Dewey
Table 1
(continued)
Relationships Pairwise or
Method Category predicted multiple References
PicoInversionMiner Alignment refinement Orthology Pairwise [68]
Cassis Alignment refinement Orthology Pairwise [69, 70]
GenAlignRefine Alignment refinement Colinear homology Multiple [71]
PSAR-Align Alignment refinement Colinear homology Multiple [73]
Phylo Alignment refinement Colinear homology Multiple [76, 77]
SLAM Alignment refinement Colinear homology Pairwise [78]
DOUBLESCAN Alignment refinement Colinear homology Pairwise [79]
CESAR Alignment refinement Colinear homology Pairwise [81]
MORPH Alignment refinement Colinear homology Pairwise [82]
EMMA Alignment refinement Colinear homology Pairwise [83]
MAFIA Alignment refinement Colinear homology Multiple [84]
SAPF Alignment refinement Colinear homology Multiple [85]
REAPR Alignment refinement Colinear homology Multiple [86]
For each method, the approach it uses or the role it plays within a larger WGA system is given in the “category” column.
Each method is labeled as either “pairwise” or “multiple” depending on whether it can be applied to generate multiple
alignments. In addition, the primary type of evolutionary relationship predicted by each method is given in the “relation-
ships predicted” column
3.3 Local Pairwise Methods for both WGA strategies generally start by finding local
Genomic Alignment alignments between, and perhaps within, the genomes. The
Smith–Waterman algorithm is the classical solution to the pairwise
local alignment problem, but is generally not used for WGA
because it runs in time quadratic in the size of the genomes,
which can be large. Instead, most methods adopt a “seed-and-
extend” approach for discovering high-scoring local alignments,
Whole-Genome Alignment 131
3.4 The Hierarchical The hierarchical approach to WGA consists of two steps. First, a
Approach high-level homology map between the genomes is constructed.
Second, a nucleotide-level alignment is obtained by running a
genomic global alignment tool on each homologous and colinear
set of genomic segments identified by the homology map. Hierar-
chical WGA methods vary in the exact techniques used for
each step.
The idea behind the hierarchical approach is to separate the
problem of identifying rearrangements and duplications from that
of obtaining a nucleotide-level alignment. In the absence of rear-
rangements and duplications, WGA simply reduces to classical
sequence alignment although at a much larger scale. Thus, if a
WGA problem can be broken into a set of subproblems that do
not contain these large-scale events, the numerous methods that
have been developed for classical global alignment can be utilized.
The first step of the hierarchical strategy is to construct a
homology map between the genomes of interest. A homology
map is a collection of sets of genomic intervals, where each set of
intervals is required to be homologous and colinear (i.e., free of
rearrangements and duplications). Each set represents the
sequences that will ultimately form a block within a WGA.
Whole-Genome Alignment 133
3.5 The Local The local approach to WGA bypasses the high-level homology map
Approach construction phase of the hierarchical approach and instead begins
by identifying a comprehensive set of nucleotide-level pairwise local
alignments. The second step of this approach is to combine the
pairwise local alignments into a cohesive WGA by filtering out
nonorthologous relationships and merging pairwise alignments
into multiple alignments. Because there is typically no additional
pairwise nucleotide-level alignment performed in the second step,
the local alignments generated by the first step are obtained with a
more sensitive aligner than that used by hierarchical methods for
homology map building. The two primary examples of local WGA
methods are MUMmer, a pairwise genome aligner, and MULTIZ/
TBA, a multiple genome aligner [8].
MUMmer was one of the first pairwise WGA methods to be
developed and was initially targeted at the alignment of
prokaryotic-sized genomes. The WGA ability of MUMmer is
achieved through a combination of smaller modules that is orche-
strated by the NUCmer or PROmer scripts. The first module
identifies maximum unique matches (MUMs) between a pair of
genomes with a suffix tree data structure. Nearby matches are
clustered together, and a high-scoring colinear chain of matches is
identified within each cluster. Finally, the matches within the chains
are extended with a variant of the Smith–Waterman algorithm, and
the resulting extended chains are output as a WGA. The raw WGA
output by MUMmer can, in general, include all classes of homolo-
gous relationships. However, the chains are typically filtered to
leave only those that are highest scoring or that result in a reference
position being overlapped by only a single chain. Thus, a filtered
WGA from MUMmer is usually representative of orthology.
MULTIZ/TBA, which was instead designed for large eukary-
otic genomes, starts by using LASTZ to generate sensitive local
pairwise alignments between all pairs of genomes or between a
reference genome and all others. MULTIZ is then used to identify
local alignment blocks of subsets of genomes that should be com-
bined and to merge these blocks using a banded variant of the
Smith–Waterman algorithm. TBA is the program that is used to
coordinate this entire process when all pairs of genomes are com-
pared. Thus far, it does not appear that TBA has been used at the
whole-genome scale, although MULTIZ is regularly used for
reference-based WGAs hosted by the UCSC Genome Browser
[66]. For these reference-based WGAs, the ungapped segments of
LASTZ alignments are first processed with a chaining program
(AXTCHAIN) to establish large colinear alignments between the
reference and another genome. In contrast to the output of
Whole-Genome Alignment 137
4 Evaluation of WGAs
5 Future Challenges
6 Exercises
Ancestor
X Y Z
speciation
inversion duplication
speciation
duplication
inversion duplication
XA ZA YA1 YA2 XB1 XB2 YB ZB XC1 YC XC2 XC3 ZC
Species A Species B Species C
Fig. 3 The evolutionary scenario to be considered for Exercise 3. Each bullet-like shape corresponds to a
genomic segment, with the direction of the bullet indicating the orientation of the segment
References
1. Löytynoja A (2012) Alignment methods: v.6: data updates and feature enhancements.
strategies, challenges, benchmarking, and Nucleic Acids Res 45:D446–D456
comparative overview. Methods Mol Biol 4. Fitch WM (1970) Distinguishing homolo-
855:203–235 gous from analogous proteins. Syst Zool
2. Fleischmann RD, Adams MD, White O et al 19:99–113
(1995) Whole-genome random sequencing 5. Altenhoff AM, Dessimoz C (2012) Inferring
and assembly of Haemophilus influenzae orthology and paralogy. Methods Mol Biol
Rd. Science 269:496–512 855:259–279
3. Mukherjee S, Stamatis D, Bertsch J et al 6. Dewey CN (2011) Positional orthology: put-
(2017) Genomes OnLine Database (GOLD) ting genomic evolutionary relationships into
context. Brief Bioinform 12(5):401–412
144 Colin N. Dewey
7. Dewey CN, Pachter L (2006) Evolution at new generation of protein database search
the nucleotide level: the problem of multiple programs. Nucleic Acids Res 25:3389–3402
whole-genome alignment. Hum Mol Genet 23. Brudno M, Malde S, Poliakov A et al (2003)
15 Spec No 1:R51–R56 Glocal alignment: finding rearrangements
8. Blanchette M, Kent WJ, Riemer C et al during alignment. Bioinformatics 19(Suppl
(2004) Aligning multiple genomic sequences 1):i54–i62
with the threaded blockset aligner. Genome 24. Ma B, Tromp J, Li M (2002) PatternHunter:
Res 14:708–715 faster and more sensitive homology search.
9. Ma J, Ratan A, Raney BJ et al (2008) The Bioinformatics 18:440–445
infinite sites model of genome evolution. 25. Sun Y, Buhler J (2005) Designing multiple
Proc Natl Acad Sci U S A 105:14254–14261 simultaneous seeds for DNA similarity search.
10. Needleman SB, Wunsch CD (1970) A general J Comput Biol 12:847–861
method applicable to the search for similari- 26. Xu J, Brown D, Li M et al (2006) Optimizing
ties in the amino acid sequence of two pro- multiple spaced seeds for homology search. J
teins. J Mol Biol 48:443–453 Comput Biol 13:1355–1368
11. Smith TF, Waterman MS (1981) Identifica- 27. Zhang L (2007) Superiority of spaced seeds
tion of common molecular subsequences. J for homology search. IEEE/ACM Trans
Mol Biol 147:195–197 Comput Biol Bioinform 4:496–505
12. Tesler G (2002) GRIMM: genome rearrange- 28. Kiełbasa SM, Wan R, Sato K et al (2011)
ments web server. Bioinformatics Adaptive seeds tame genomic sequence com-
18:492–493 parison. Genome Res 21:487–493
13. Paten B, Herrero J, Fitzgerald S et al (2008) 29. Kucherov G, Noé L, Roytberg M (2006) A
Genome-wide nucleotide-level mammalian unifying framework for seed sensitivity and its
ancestor reconstruction. Genome Res application to subset seeds. J Bioinform Com-
18:1829–1843 put Biol 4:553–569
14. Ma J, Zhang L, Suh BB et al (2006) Recon- 30. Flannick J, Batzoglou S (2005) Using multi-
structing contiguous regions of an ancestral ple alignments to improve seeded local align-
genome. Genome Res 16:1557–1565 ment algorithms. Nucleic Acids Res
15. Stark A, Lin MF, Kheradpour P et al (2007) 33:4563–4577
Discovery of functional elements in 12 Dro- 31. Sun H, Buhler JD (2012) PhyLAT: a phylo-
sophila genomes using evolutionary signa- genetic local alignment tool. Bioinformatics
tures. Nature 450:219–232 28:1336–1344
16. Alioto T (2012) Gene prediction. Methods 32. Kent WJ (2002) BLAT–the BLAST-like align-
Mol Biol 855:175–201 ment tool. Genome Res 12:656–664
17. Eddy SR (2002) Computational genomics of 33. Kehr B, Weese D, Reinert K (2011) STEL-
noncoding RNA genes. Cell 109:137–140 LAR: fast and exact local alignments. BMC
18. Margulies EH, Blanchette M, NISC Compar- Bioinform 12:S15
ative Sequencing Program et al (2003) Iden- 34. Schwartz S, Kent WJ, Smit A et al (2003)
tification and characterization of multi-species Human-mouse alignments with BLASTZ.
conserved sequences. Genome Res Genome Res 13:103–107
13:2507–2518 35. Delcher AL, Kasif S, Fleischmann RD et al
19. Tagle DA, Koop BF, Goodman M et al (1999) Alignment of whole genomes. Nucleic
(1988) Embryonic epsilon and gamma globin Acids Res 27:2369–2376
genes of a prosimian primate (Galago crassi- 36. Brudno M, Chapman M, Göttgens B et al
caudatus). Nucleotide and amino acid (2003) Fast and sensitive multiple alignment
sequences, developmental regulation and of large genomic sequences. BMC Bioinform
phylogenetic footprints. J Mol Biol 4:66
203:439–455
37. Brudno M, Do CB, Cooper GM et al (2003)
20. Bejerano G, Pheasant M, Makunin I et al LAGAN and Multi-LAGAN: efficient tools
(2004) Ultraconserved elements in the for large-scale multiple alignment of genomic
human genome. Science 304:1321–1325 DNA. Genome Res 13:721–731
21. Altschul SF, Gish W, Miller W et al (1990) 38. Gusfield D (1997) Algorithms on strings,
Basic local alignment search tool. J Mol Biol trees, and sequences: computer science and
215:403–410 computational biology. Cambridge University
22. Altschul SF, Madden TL, Sch€affer AA et al Press, Cambridge
(1997) Gapped BLAST and PSI-BLAST: a
Whole-Genome Alignment 145
39. Nadeau JH, Taylor BA (1984) Lengths of 52. Darling AE, Mau B, Perna NT (2010) pro-
chromosomal segments conserved since gressiveMauve: multiple genome alignment
divergence of man and mouse. Proc Natl with gene gain, loss and rearrangement.
Acad Sci U S A 81:814–818 PLoS One 5:e11147
40. Pevzner P, Tesler G (2003) Genome rearran- 53. Angiuoli SV, Salzberg SL (2011) Mugsy: fast
gements in mammalian evolution: lessons multiple alignment of closely related whole
from human and mouse genomes. Genome genomes. Bioinformatics 27:334–342
Res 13:37–45 54. Paten B, Earl D, Nguyen N et al (2011) Cac-
41. Ghiurcuta CG, Moret BME (2014) Evaluat- tus: algorithms for genome multiple sequence
ing synteny for improved comparative studies. alignment. Genome Res 21:1512–1528
Bioinformatics 30:i9–i18 55. Kehr B, Trappe K, Holtgrewe M et al (2014)
42. Wang X, Shi X, Li Z et al (2006) Statistical Genome alignment with graph data struc-
inference of chromosomal homology based tures: a comparison. BMC Bioinform 15:99
on gene colinearity and applications to Arabi- 56. Kececioglu J (1993) The maximum weight
dopsis and rice. BMC Bioinform 7:447 trace problem in multiple sequence align-
43. Proost S, Fostier J, De Witte D et al (2012) ment. In: Combinatorial pattern matching,
i-ADHoRe 3.0–fast and sensitive detection of Lecture notes in computer science. Springer,
genomic homology in extremely large data Berlin, pp 106–119
sets. Nucleic Acids Res 40:e11 57. Pevzner PA, Pevzner PA, Tang H et al (2004)
44. Lucas JMEX, Muffato M, Roest Crollius H De novo repeat classification and fragment
(2014) PhylDiag: identifying complex syn- assembly. Genome Res 14:1786–1796
teny blocks that include tandem duplications 58. Raphael B, Zhi D, Tang H et al (2004) A
using phylogenetic gene trees. BMC Bioin- novel method for multiple alignment of
form 15:268 sequences with repeated and shuffled ele-
45. Pham SK, Pevzner PA (2010) DRIMM- ments. Genome Res 14:2336–2346
Synteny: decomposing genomes into evolu- 59. Paten B, Diekhans M, Earl D et al (2011)
tionary conserved segments. Bioinformatics Cactus graphs for genome comparisons. J
26:2509–2516 Comput Biol 18:469–481
46. Dewey CN (2007) Aligning multiple whole 60. Bray N, Pachter L (2004) MAVID: con-
genomes with Mercator and MAVID. In: strained ancestral alignment of multiple
Bergman N (ed) Methods in Molecular Biol- sequences. Genome Res 14:693–699
ogy, vol 395. Humana, Clifton, NJ, pp 61. Rausch T, Emde AK, Weese D et al (2008)
221–236 Segment-based multiple sequence alignment.
47. Paten B, Herrero J, Beal K et al (2008) Bioinformatics 24:i187–i192
Enredo and Pecan: genome-wide mammalian 62. Bradley RK, Roberts A, Smoot M et al (2009)
consistency-based multiple alignment with Fast statistical alignment. PLoS Comput Biol
paralogs. Genome Res 18:1814–1828 5:e1000392
48. Hachiya T, Osana Y, Popendorf K et al (2009) 63. Slater GSC, Birney E (2005) Automated gen-
Accurate identification of orthologous seg- eration of heuristics for biological sequence
ments among multiple genomes. Bioinfor- comparison. BMC Bioinform 6:31
matics 25:853–860
64. Herrero J, Muffato M, Beal K et al (2016)
49. Dubchak I, Poliakov A, Kislyuk A et al (2009) Ensembl comparative genomics resources.
Multiple whole-genome alignments without a Database (Oxford) 2016:bav096
reference organism. Genome Res
19:682–689 65. Brudno M, Poliakov A, Minovitsky S et al
(2007) Multiple whole genome alignments
50. Minkin I, Patel A, Kolmogorov M et al and novel biomedical applications at the
(2013) Sibelia: a scalable and comprehensive VISTA portal. Nucleic Acids Res 35:
synteny block generation tool for closely W669–W674
related microbial genomes. In: Algorithms in
bioinformatics, Lecture notes in computer 66. Casper J, Zweig AS, Villarreal C et al (2018)
science. Springer, Berlin, pp 215–229 The UCSC Genome Browser database: 2018
update. Nucleic Acids Res 46:D762–D769
51. Treangen TJ, Messeguer X (2006) M-GCAT:
interactively and efficiently constructing 67. Kent WJ, Baertsch R, Hinrichs A et al (2003)
large-scale multiple genome comparison fra- Evolution’s cauldron: duplication, deletion,
meworks in closely related species. BMC and rearrangement in the mouse and human
Bioinform 7:433 genomes. Proc Natl Acad Sci U S A
100:11484–11489
146 Colin N. Dewey
68. Hou M, Yao P, Antonou A et al (2011) Pico- 83. He X, Ling X, Sinha S (2009) Alignment and
inplace-inversions between human and chim- prediction of cis-regulatory modules based on
panzee. Bioinformatics 27:3266–3275 a probabilistic model of evolution. PLoS
69. Lemaitre C, Tannier E, Gautier C et al (2008) Comput Biol 5:e1000299
Precise detection of rearrangement break- 84. Majoros WH, Ohler U (2010) Modeling the
points in mammalian chromosomes. BMC evolution of regulatory elements by simulta-
Bioinform 9:286 neous detection and alignment with phyloge-
70. Baudet C, Lemaitre C, Dias Z et al (2010) netic pair HMMs. PLoS Comput Biol 6:
Cassis: detection of genomic rearrangement e1001037
breakpoints. Bioinformatics 26:1897–1898 85. Satija R, Pachter L, Hein J (2008) Combining
71. Wang C, Lefkowitz EJ (2005) Genomic mul- statistical alignment and phylogenetic foot-
tiple sequence alignments: refinement using a printing to detect regulatory elements. Bioin-
genetic algorithm. BMC Bioinform 6:200 formatics 24:1236–1242
72. Notredame C, Holm L, Higgins DG (1998) 86. Will S, Yu M, Berger B (2013) Structure-
COFFEE: an objective function for multiple based whole-genome realignment reveals
sequence alignments. Bioinformatics many novel noncoding RNAs. Genome Res
14:407–422 23:1018–1027
73. Kim J, Ma J (2014) PSAR-align: improving 87. Earl D, Nguyen N, Hickey G et al (2014)
multiple sequence alignment using probabilis- Alignathon: a competitive assessment of
tic sampling. Bioinformatics 30:1010–1012 whole-genome alignment methods. Genome
74. Schwartz AS, Pachter L (2007) Multiple Res 24:2077–2089
alignment by sequence annealing. Bioinfor- 88. Darling ACE, Mau B, Blattner FR et al (2004)
matics 23:e24–e29 Mauve: multiple alignment of conserved
75. Kim J, Ma J (2011) PSAR: measuring multi- genomic sequence with rearrangements.
ple sequence alignment reliability by probabi- Genome Res 14:1394–1403
listic sampling. Nucleic Acids Res 89. Edgar RC, Asimenos G, Batzoglou S et al
39:6359–6368 (2011) Evolver: a whole-genome sequence
76. Kawrykow A, Roumanis G, Kam A et al evolution simulator. https://www.drive5.
(2012) Phylo: a citizen science approach for com/evolver
improving multiple sequence alignment. 90. Dalquen DA, Anisimova M, Gonnet GH et al
PLoS One 7:e31362 (2012) ALF–a simulation framework for
77. Kwak D, Kam A, Becerra D et al (2013) genome evolution. Mol Biol Evol
Open-Phylo: a customizable crowd- 29:1115–1123
computing platform for multiple sequence 91. Stoye J, Evers D, Meyer F (1998) Rose: gen-
alignment. Genome Biol 14:R116 erating sequence families. Bioinformatics
78. Alexandersson M, Cawley S, Pachter L (2003) 14:157–163
SLAM: cross-species gene finding and align- 92. Cartwright RA (2005) DNA assembly with
ment with a generalized pair hidden Markov gaps (Dawg): simulating sequence evolution.
model. Genome Res 13:496–502 Bioinformatics 21(Suppl 3):iii31–iii38
79. Meyer IM, Durbin R (2002) Comparative ab 93. Pollard DA, Moses AM, Iyer VN et al (2006)
initio prediction of gene structures using pair Detecting the limits of regulatory element
HMMs. Bioinformatics 18:1309–1318 conservation and divergence estimation
80. Durbin R, Eddy S, Korgh A et al (1998) using pairwise and multiple alignments.
Biological sequence analysis: probabilistic BMC Bioinform 7:376
models of proteins and nucleic acids. Cam- 94. Huang W, Nevins JR, Ohler U (2007) Phylo-
bridge University Press, Cambridge genetic simulation of promoter evolution:
81. Sharma V, Elghafari A, Hiller M (2016) Cod- estimation and modeling of binding site turn-
ing exon-structure aware realigner (CESAR) over events and assessment of their impact on
utilizes genome alignments for accurate com- alignment tools. Genome Biol 8:R225
parative gene annotation. Nucleic Acids Res 95. Varadarajan A, Bradley RK, Holmes IH
44(11):e103 (2008) Tools for simulating evolution of
82. Sinha S, He X (2007) MORPH: probabilistic aligned genomic regions with integrated
alignment combined with hidden Markov parameter estimation. Genome Biol 9:R147
models of cis-regulatory modules. PLoS
Comput Biol 3:e216
Whole-Genome Alignment 147
96. Fletcher W, Yang Z (2009) INDELible: a alignment: the problem of spurious alignment
flexible simulator of biological sequence evo- flanks. Nucleic Acids Res 36:5863–5871
lution. Mol Biol Evol 26:1879–1888 106. Landan G, Graur D (2007) Heads or tails: a
97. Kim J, Sinha S (2010) Towards realistic simple reliability check for multiple sequence
benchmarks for multiple alignments of alignments. Mol Biol Evol 24:1380–1383
non-coding sequences. BMC Bioinform 107. Penn O, Privman E, Landan G et al (2010)
11:54 An alignment confidence score capturing
98. Arenas M, Posada D (2014) Simulation of robustness to guide tree uncertainty. Mol
genome-wide evolution under heterogeneous Biol Evol 27:1759–1767
substitution models and complex multispecies 108. Chang JM, Di Tommaso P, Notredame C
coalescent histories. Mol Biol Evol (2014) TCS: a new multiple sequence align-
31:1295–1301 ment reliability measure to estimate align-
99. Margulies EH, Cooper GM, Asimenos G et al ment accuracy and improve phylogenetic
(2007) Analyses of deep mammalian sequence tree reconstruction. Mol Biol Evol
alignments and constraint predictions for 1% 31:1625–1637
of the human genome. Genome Res 109. Genome 10K Community of Scientists
17:760–774 (2009) Genome 10K: a proposal to obtain
100. Morgenstern B, Rinner O, Abdeddaı̈m S et al whole-genome sequence for 10,000 verte-
(2002) Exon discovery by genomic sequence brate species. J Hered 100:659–674
alignment. Bioinformatics 18:777–787 110. Treangen TJ, Ondov BD, Koren S et al
101. Wang AX, Ruzzo WL, Tompa M (2007) How (2014) The Harvest suite for rapid core-
accurately is ncRNA aligned within whole- genome alignment and visualization of
genome multiple alignments? BMC Bioin- thousands of intraspecific microbial genomes.
form 8:417 Genome Biol 15:524
102. Karlin S, Altschul SF (1990) Methods for 111. Frith MC, Hamada M, Horton P (2010)
assessing the statistical significance of molec- Parameters for accurate genome alignment.
ular sequence features by using general scor- BMC Bioinform 11:80
ing schemes. Proc Natl Acad Sci U S A 112. Frith MC, Kawaguchi R (2015) Split-
87:2264–2268 alignment of genomes finds orthologies
103. Prakash A, Tompa M (2005) Statistics of local more accurately. Genome Biol 16:106
multiple alignments. Bioinformatics 21(Suppl 113. Lunter G, Rocco A, Mimouni N et al (2008)
1):i344–i350 Uncertainty in homology inferences: asses-
104. Prakash A, Tompa M (2007) Measuring the sing and improving genomic sequence align-
accuracy of genome-size multiple alignments. ment. Genome Res 18:298–309
Genome Biol 8:R124 114. Herman JL, Novák Á, Lyngsø R et al (2015)
105. Frith MC, Park Y, Sheetlin SL et al (2008) Efficient representation of uncertainty in mul-
The whole alignment and nothing but the tiple sequence alignments using directed acy-
clic graphs. BMC Bioinform 16:108
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons
licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s
Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder.
Chapter 5
Abstract
The distinction between orthologs and paralogs, genes that started diverging by speciation versus duplica-
tion, is relevant in a wide range of contexts, most notably phylogenetic tree inference and protein function
annotation. In this chapter, we provide an overview of the methods used to infer orthology and paralogy.
We survey both graph-based approaches (and their various grouping strategies) and tree-based approaches,
which solve the more general problem of gene/species tree reconciliation. We discuss conceptual differ-
ences among the various orthology inference methods and databases and examine the difficult issue of
verifying and benchmarking orthology predictions. Finally, we review typical applications of orthologous
genes, groups, and reconciled trees and conclude with thoughts on future methodological developments.
1 Introduction
Adrian M. Altenhoff and Natasha M. Glover are the Joint first authors
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Methods in Molecular Biology, vol. 1910,
https://doi.org/10.1007/978-1-4939-9074-0_5, © The Author(s) 2019
149
150 Adrian M. Altenhoff et al.
a) b)
S1
S2 S2
Fig. 1 (a) Simple evolutionary scenario of a gene family with two speciation
events (S1 and S2) and one duplication event (star). The type of events completely
and unambiguously define all pairs of orthologs and paralogs: The frog gene is
orthologous to all other genes (they coalesce at S1). The red and blue genes are
orthologs between themselves (they coalesce at S2), but paralogs between each
other (they coalesce at star). (b) The corresponding orthology graph. The genes
are represented here by vertices and orthology relationships by edges. The frog
gene forms one-to-many orthology with both the human and dog genes, because
it is orthologous to more than one sequence in each of these organisms. In such
cases, the bi-directional best-hit approach only recovers one of the relations
(the highest scoring one). Note that in contrary to BBH, the nonsymmetric BeTs
approach—simply taking the best genome-wide hit for each gene regardless of
reciprocity—would in the situation of a lost blue human gene infer an incorrect
orthologous relation between the blue dog and red human gene
Box 1: Terminology
Homology is a relation between a pair of genes that share a
common ancestor. All pairs of genes in the below figure are
homologous to each other.
S1
D1
S2 S2
x1 y1 x2 y2 z1
(continued)
Inferring Orthology and Paralogy 151
Box 1: (continued)
Orthology is a relation defined over a pair of homologous
genes, where the two genes have emerged through a specia-
tion event [4]. Example pairs of orthologs are (x1, y1) or (x2,
z1). Orthologs can be further subclassified into one-to-one,
one-to-many, many-to-one, and many-to-many orthologs.
The qualifiers one and many indicate for each of the two
involved genes whether they underwent an additional dupli-
cation after the speciation between the two genomes. Hence,
the gene pair (x1, y1) is an example of a one-to-one ortholo-
gous pair, whereas (x2, z1) is a many-to-one ortholog relation.
Paralogy is a relation defined over a pair of homologous
genes that have emerged through a gene duplication, e.g., (x1,
x2) or (x1, y2).
In-Paralogy is a relation defined over a triplet. It involves a
pair of genes and a speciation event of reference. A gene pair is
an in-paralog if they are paralogs and duplicated after the
speciation event of reference [5]. The pair (x1, y2) are
in-paralogs with respect to the speciation event S1.
Out-Paralogy is also a relation defined over a pair of genes
and a speciation event of reference. This pair is out-paralogs if
the duplication event through which they are related to each
other predates the speciation event of reference. Hence, the
pair (x1, y2) are out-paralogs with respect to the speciation
event S2.
Co-orthology is a relation defined over three genes, where
two of them are in-paralogs with respect to the speciation
event associated to the third gene. The two in-paralogous
genes are said to be co-orthologous to the third (out-group)
gene. Thus, x1 and y2 are co-orthologs with respect to z1.
Homoeology is a specific type of homologous relation in a
polyploid species, which thus contain multiple “sub-gen-
omes.” This relation describes pairs of genes that originated
by speciation and were brought back together in the same
genome by allopolyploidization (hybridization) [6]. Thus, in
the absence of rearrangement, homoeologs can be thought of
as orthologs between sub-genomes.
2 Inferring Orthology
2.1.1 Graph Construction In its most basic form, the graph construction phase identifies
Phase: Orthology Inference orthologous genes by considering pairs of genomes at a time. The
main idea is that between any given two genomes, the orthologs
tend to be the homologs that diverged least. Why? Because assum-
ing that speciation and duplication are the only types of branching
events, the orthologs branched by definition at the latest possible
time point—the speciation between the two genomes in question.
Therefore, using sequence similarity score as surrogate measure of
closeness, the basic approach identifies the corresponding ortholog
of each gene through its genome-wide best hit (BeT)—the highest
scoring match in the other genome [8]. To make the inference
symmetric (as orthology is a symmetric relation), it is usually
required that BeTs be reciprocal, i.e., that orthology be inferred
for a pair of genes g1 and g2 if and only if g2 is the BeT of g1 and g1 is
the BeT of g2 [9]. This symmetric variant, referred to as bi-direc-
tional best hit (BBH), has also the merit of being more robust
against a possible gene loss in one of the two lineages (Fig. 1).
Inferring orthology from BBH is computationally efficient,
because each genome pair can be processed independently and
high-scoring alignments can be computed efficiently using dynamic
programming [10] or heuristics such as BLAST [11]. Overall, the
Inferring Orthology and Paralogy 153
In- Available
Method paralogs Based on Grouping strategy Database Extra algorithm/DB References
BBH (best No BLAST scores n.a. – –/– [9]
bi-directional
hit)
COG Yes BLAST scores Merged adjacent triangles COG/KOG ✓/✓ [8]
of BeTs
Adrian M. Altenhoff et al.
EggNOG Yes Smith Waterman Hierarchical orthologous EggNOG Computed at several levels –/✓ [15–17]
scores groups of taxonomic tree
Hieranoid Yes BLAST scores and Hierarchical orthologous HieranoiDB ✓/✓ [13, 18]
HMM profiles groups
InParanoid Yes BLAST scores Orthologous groups InParanoid ✓/✓ [5, 19, 20]
between pairs of species
OMA Yes ML distance Hierarchical orthologous OMA Browser Computed at all levels of ✓/✓ [21, 22]
GETHOGS estimates groups the taxonomic tree
OMA Pairs Yes ML distance Every pair is orthologous OMA Browser Detects differential gene ✓/✓ [23, 24]
estimates loss
OrthoDB Yes Smith Waterman Hierarchical orthologous OrthoDB Computed at any level of ✓/✓ [25, 26])
scores groups taxonomic tree
OrthoInspector Yes BLAST scores Only between pairs of OrthoInspector ✓/✓ [27, 28]
species
OrthoMCL Yes BLAST scores MCL clusters OrthoMCL- ✓/✓ [29, 30]
DB
RSD No ML distance Deterministic single- – ✓/✓ [31–33]
(reciprocal estimates linkage clustering
smallest
distance)
Inferring Orthology and Paralogy 155
Allowing for More Than One Some genes can have more than one orthologous counterpart in a
Ortholog given genome. This happens whenever a gene undergoes duplica-
tion after the speciation of the two genomes in question. Since
BBH only picks the best hit, it only captures part of the ortholo-
gous relations (Fig. 1). The existence of multiple orthologous
counterparts is often referred to as one-to-many or many-to-many
orthology, depending whether duplication took place in one or
both lineages. To designate the copies resulting from such duplica-
tions occurring after a speciation of reference, Remm et al. coined
the term in-paralogs and introduced a method called InParanoid
that improves upon BBH by potentially identifying all pairs of
many-to-many orthologs [5]. In brief, their algorithm identifies
all paralogs within a species that are evolutionarily closer (more
similar) to each other than to the BBH gene in the other genome.
This results in two sets of in-paralogs—one for each species—where
all pairwise combinations between the two sets are orthologous
relations. Alternatively, it is possible to identify many-to-many
orthology by relaxing the notion of “best hit” to “group of best
hits.” This can be implemented using a score tolerance threshold or
a confidence interval around the BBH [23, 34].
Differential Gene Losses As discussed above, one of the advantages of BBH over BeT is that
by virtue of the bi-directional requirement, the former is more
robust to gene losses in one of the two lineages. But if gene losses
occurred along both lineages, it can happen that a pair of genes
mutually closest to one another is in fact paralogs, simply because
both their corresponding orthologs were lost—a situation referred
to as “differential gene losses.” Dessimoz et al. [37] presented a
way to detect some of these cases by looking for a third species in
which the corresponding orthologs have not been lost and thus can
act as witnesses of non-orthology.
2.1.2 Clustering Phase: The graph construction phase yields orthologous relationships
From Pairs to Groups between pairs of genes. But this is often not sufficient. Concep-
tually, information obtained from multiple genes or organisms is
often more powerful than that obtained from pairwise comparisons
156 Adrian M. Altenhoff et al.
2.2 Tree-Based At their core, tree-based methods infer orthologs on the basis of
Methods gene family trees whose internal nodes are labeled as speciation or
duplication nodes. Indeed, once all nodes of the gene tree have
been inferred as a speciation or duplication event, it is trivial to
establish whether a pair of genes is orthologous or paralogous,
based on the type of the branching where they coalesce. Such
labeling is traditionally obtained by reconciling gene and species
trees. In most cases, gene and species trees have different topolo-
gies, due to evolutionary events acting specifically on genes such as
duplications, losses, lateral transfers, or incomplete lineage sorting
[42]. Goodman et al. [43] pioneered research to resolve these
158 Adrian M. Altenhoff et al.
Speciation
Duplication
Gene loss
Fig. 2 Schematic example of the gene/species tree reconciliation. The gene tree
and species tree are not compatible. Reconciliation methods resolve the
incongruence between the two by inferring speciation, duplication, and losses
events on the gene tree. The reconciled tree indicates the most parsimonious
history of this gene, constrained to the species tree. The simple representation
(bottom right) suggests that the human and frog genes are orthologs and that
they are both paralogous to the dog gene
2.2.1 Unresolved A first problem ignored by most early reconciliation algorithms lies
Species Tree in the uncertainty often associated with the species tree, which
these methods assume as correct and heavily rely upon.
One way of dealing with the uncertainties is to treat unresolved
parts of the species tree as multifurcating nodes (also known as soft
polytomies). By doing so, the reconciliation algorithm is not forced
to choose for a specific type of evolutionary event in ambiguous
regions of the tree. This approach is, for instance, implemented in
TreeBeST [52] and used in the Ensembl Compara project [53].
Alternatively, Heijden et al. [57] demonstrated that it is often
possible to infer speciation and duplication events on a gene tree
without knowledge of the species tree. Their approach, which they
call species overlap, identifies for a given split the species represented
in the two subtrees induced by the split. If at least one species has
genes in both subtrees, a duplication event is inferred; else a specia-
tion event is inferred. In fact, this approach is a special case of soft
polytomies where all internal nodes have been collapsed. Thus, the
only information needed for this approach is a rooted gene tree.
Since then, this approach has been adopted in other projects, such
as PhylomeDB [59].
2.2.2 Rooting The classical reconciliation formulation requires both gene and
species trees to be rooted. But most models of sequence evolution
are time reversible and thus do not allow to infer the rooting of the
reconstructed gene tree. One sensible solution is to root a gene tree
so that it minimizes the number of duplication events [62]. Thus,
this method uses the parsimony principle for both rooting and
reconciliation. For cases of multiple optimal rootings, ties can be
broken by selecting the tree that minimizes the tree height [63] or
by picking the rooting which minimizes the number of gene
losses [61].
Another approach is to place the root at the “center of the
tree”—also known as “midpoint rooting” [58]. The idea of this
method goes back to Farris [64] and is motivated by the concept of
a molecular clock. But for most gene families, assuming a constant
rate of evolution is inappropriate [65, 66], and thus this approach is
not used widely. A newly introduced refinement based on minimiz-
ing average deviations among children nodes holds promise of
being more robust [67] but still relies on a molecular clock
assumption.
For the species tree, the most common and reliable way of
rooting trees is by identifying an outgroup species. PhylomeDB
uses genes from outgroup species to root gene trees [59]. One
main potential problem with this approach is that in many situa-
tions, it can be difficult to identify a suitable outgroup. For exam-
ple, in analysis covering all kingdoms of life, an outgroup species
may not be available, or the relevant genes might have been lost
Inferring Orthology and Paralogy 161
2.2.3 Gene Tree Another assumption made in the original tree reconciliation prob-
Uncertainty lem is the (topological) correctness of the gene tree. But it has been
shown that this assumption is commonly violated, often due to
finite sequence lengths, taxon sampling [70, 71], or gene evolution
model violations [72]. On the other hand, techniques of expressing
uncertainties in gene tree reconstruction via support measures, e.g.,
bootstrap values, have become well established. Storm and Sonn-
hammer [58] as well as Zmasek and Eddy [63] independently
suggested to extend the bootstrap procedure to reconciliation,
thereby reducing the dependency of the reconciliation procedure
on any one gene tree while providing a measure of support of the
inferred speciation/duplication events. The downsides of using the
bootstrap are the high computational costs and interpretation dif-
ficulties associated with it [73].
Similarly to how unresolved species tree can be handled, unre-
solved parts of the gene tree can also be collapsed into multifurcat-
ing nodes. For instance, HOGENOM [55] and Softparsmap [61]
collapse branches with low bootstrap support values.
A third way of tackling this problem consists in simultaneously
solving both the gene tree reconstruction and reconciliation pro-
blems [74]. They use the parsimony criterion of minimizing the
number of duplication events to improve on the gene tree itself.
This is achieved by rearranging the local gene tree topology of
regions with low bootstrap support such that the number of dupli-
cations and losses is further reduced.
2.2.4 Parsimony All the approaches mentioned so far try to minimize the number of
vs. Likelihood gene duplication events. This is generally justified by a parsimony
argument, which assumes that gene duplications and losses are rare
events. But what if this assumption is frequently violated? Little is
known about duplication and loss rates in general [75], but there is
strong evidence for historical periods with high gene duplication
occurrence rates [76] or gene families specifically prone to massive
duplications (e.g., olfactory receptor, opsins, serine/threonine
kinases, etc.)
Motivated by this reasoning, Arvestad et al. introduced the idea
of a probabilistic model for tree reconciliation [49]. They used a
Bayesian approach to estimate the posterior probabilities of a rec-
onciliation between a given gene and species tree using Markov
chain Monte Carlo (MCMC) techniques. Arvestad et al. [49]
162 Adrian M. Altenhoff et al.
3 Meta-methods
5 Benchmarking Orthology
5.1 Benchmarking Several ways of benchmarking orthology inference have been devel-
Approaches oped in the past years. In the next sections, we go over the main
approaches, bringing attention to the advantages and limitations
to each.
5.1.1 Functional The first surrogate measures proposed revolved around conserva-
Conservation tion of function [94]. This was motivated by the common belief
that orthologs tend to have conserved function, while paralogs
tend to have different functions. Indeed, orthologs tend to be
more conserved than paralogs in terms of GO annotation similarity
[95]. Thus, “for a given evolutionary distance, more accurate
orthology inference is likely to be correlated with more functionally
similar gene pairs.” Hulsen et al. [94] assessed the quality of ortho-
log predictions in terms of conservation of co-expression levels,
domain annotation, and protein-protein interaction partners.
Additionally, Altenhoff et al. [96] used similarity of experimentally
validated GO annotations as well as Enzyme Commission
(EC) numbers as a functional benchmark. Functional benchmarks
have an advantage in that many researchers are interested in orthol-
ogy because they want to find functionally conserved genes, thus
making functional tests important for assessing different inference
methods. The main limitation of these measures is that it is not so
clear how much they correlate with orthology/paralogy. Indeed, it
166 Adrian M. Altenhoff et al.
5.1.2 Gene The fraction of orthologs that have neighboring genes being ortho-
Neighborhood logs themselves is an indicator of consistency and therefore to some
Conservation extent also of quality of orthology predictions [94]. Although
synteny has been used as part of the orthology inference for several
algorithms, to date it has not been used as part of large-scale
benchmarking efforts. One possible problem is that gene neighbor-
hood can be conserved among paralogs, such as those resulting
from whole-genome duplications. Furthermore, some methods use
gene neighborhood conservation to help in their inference process,
which can bias the assessment done on such measures (principle of
independence stated above).
5.1.3 Species Tree The quality of ortholog predictions can also be assessed based on
Discordance Test phylogeny. By definition, the tree relating a set of genes all ortho-
logous to one another only contains speciation splits and has the
same topology as the underlying species. We introduced a bench-
marking protocol that quantifies how well the predictions from
various orthology inference methods agree with undisputed species
tree topologies [96, 98]. Thus, the species tree discordance test
judges the accuracy of ortholog predictions based on the correct-
ness of the species tree which can be constructed from them.
The advantage of this measure is that by virtue of directly ensuing
from the definition of orthology, it correlates strongly with it and
thus satisfies the first principle. However, the second principle,
independence from the inference process, is not satisfied with
methods relying on the species tree—typically all reconciliation
methods but also most graph-based methods producing hierarchi-
cal groups. In such cases, interpretation of the results must be done
carefully.
5.1.4 Gold Standard High-quality reference gene trees can also be used to assess
Gene Tree Test orthology inferences. For this, one compares the pairs of ortho-
logs from a given method to pairs of orthologs derived from these
expertly curated gene trees [40, 99]. One drawback of this bench-
mark is that it is limited by the ability to curate the phylogeny—if
the evolutionary history of the gene family is ambiguous, the
resulting reference tree will unavoidably have mistakes. Another
limitation is the small size of most benchmarks of this type. This
casts doubts on their generalizability and makes them prone to
overfitting.
Inferring Orthology and Paralogy 167
5.1.5 Subtree For inference methods based on reconciliation between gene and
Consistency Test species trees, Vilella et al. [53] proposed a different phylogeny-
based assessment scheme. For any duplication node of the labeled
gene tree, a consistency score is computed, which captures the
balance of the species found in the two subtrees. Unbalanced
nodes correspond to an evolutionary scenario involving extensive
gene losses and therefore, under the principle of parsimony, are less
likely to be correct. Given that studies to date tend to support the
adequacy of the parsimony criterion in the context of gene family
dynamics (Subheading 2.2.4), it can be expected that this metric
correlates highly with correct orthology/paralogy assignments.
However, since virtually all tree-based methods themselves incor-
porate this very criterion in their objective function (i.e., minimiz-
ing the number of gene duplications and losses), the principle of
independence is violated, and thus the adequacy of this measure is
questionable.
5.1.6 Latent Class Chen et al. [100] proposed a purely statistical benchmark based on
Analysis latent class analysis (LCA). Given the absence of a definitive answer
on whether two given genes are orthologs, the authors argue that
by looking at the agreement and disagreement of predictions made
by several inference methods on a common dataset, one can esti-
mate the reliability of individual predictors. More precisely, LCA is
a statistical technique that computes maximum likelihood estimates
of sensitivity and specificity rates for each orthology inference
methods, given their predictions and given an error model. This is
attractive, because it does not depend on any surrogate measure.
However, the results depend on the error model assumed. Thus, we
are of the opinion that LCA merely shifts the problem of assessing
orthology to the problem of assessing an error model of various
orthology inference methods.
5.1.7 Simulated Finally, simulated data can be used in benchmarking. By this, the
Genomes precise evolutionary history of a genome can be validated, in terms
of gene duplication, insertion, deletion, and lateral gene transfer
[101]. Knowing for certain all aspects of the simulated genomes
gives an advantage over assessments based on empirical data, where
the true evolutionary history is unknown. On the other hand, how
well the simulated data reflect “real” data is debatable.
5.3 Conclusions on Overall, it becomes apparent that there is no “magic bullet” strat-
Benchmarking egy for orthology benchmarking, as each approach discussed here
has its limitations (though some limitations are more serious than
others). Nevertheless, comparative studies based on these various
benchmarking measures have reported surprisingly consistent find-
ings [40, 94, 96, 98, 100]: these assessments generally observe that
there is a trade-off between accuracy and coverage and most com-
mon databases are situated on a Pareto frontier. The various assess-
ments concur that the “best” orthology approach is highly
dependent on the various possible applications of orthology.
6 Applications
8 Exercises
A B C D E F
where duplications are depicted as star and all other splits are
speciations.
Problem #1: Draw the corresponding orthology graph, where the
vertices correspond to the observed genes and the edges indi-
cate orthologous relations between them.
Problem #2: Apply the following two clustering methods on your
orthology graph. First, reconstruct all the maximal fully
Inferring Orthology and Paralogy 171
Acknowledgments
References
1. Dewey CN (2012) Whole-genome align- 11. Altschul SF, Madden TL, Sch€affer AA et al
ment. Methods Mol Biol 855:237–257 (1997) Gapped BLAST and PSI-BLAST: a
2. Alioto T (2012) Gene prediction. In: Anisi- new generation of protein database search
mova M (ed) Evolutionary genomics: statisti- programs. Nucleic Acids Res 25:3389–3402
cal and computational methods, vol 12. Zhang L (1997) On a Mirkin-Muchnik-Smith
1. Humana, Totowa, NJ, pp 175–201 conjecture for comparing molecular phyloge-
3. Löytynoja A (2012) Alignment methods: nies. J Comput Biol 4:177–187
strategies, challenges, benchmarking, and 13. Schreiber F, Sonnhammer ELL (2013) Hier-
comparative overview. In: Anisimova M anoid: hierarchical orthology inference. J Mol
(ed) Evolutionary genomics: statistical and Biol 425:2072–2081
computational methods, vol 1. Humana, 14. Chor B, Tuller T (2005) Maximum likelihood
Totowa, NJ, pp 203–235 of evolutionary trees is hard. In: Proceedings
4. Fitch WM (1970) Distinguishing homolo- of the 9th annual international conference on
gous from analogous proteins. Syst Zool research in computational molecular biology.
19:99–113 Springer, Berlin, pp 296–310
5. Remm M, Storm CEV, Sonnhammer ELL 15. Jensen LJ, Julien P, Kuhn M et al (2008)
(2001) Automatic clustering of orthologs eggNOG: automated construction and anno-
and in-paralogs from pairwise species compar- tation of orthologous groups of genes.
isons. J Mol Biol 314:1041–1052 Nucleic Acids Res 36:D250–D254
6. Glover NM, Redestig H, Dessimoz C (2016) 16. Muller J, Szklarczyk D, Julien P et al (2010)
Homoeologs: what are they and how do we eggNOG v2.0: extending the evolutionary
infer them? Trends Plant Sci 21:609–621 genealogy of genes with enhanced
7. Kuzniar A, van Ham RCHJ, Pongor S et al non-supervised orthologous groups, species
(2008) The quest for orthologs: finding the and functional annotations. Nucleic Acids
corresponding gene across genomes. Trends Res 38:D190–D195
Genet 24:539–551 17. Huerta-Cepas J, Szklarczyk D, Forslund K
8. Tatusov RL, Koonin EV, Lipman DJ (1997) et al (2016) eggNOG 4.5: a hierarchical
A genomic perspective on protein families. orthology framework with improved func-
Science 278:631–637 tional annotations for eukaryotic, prokaryotic
9. Overbeek R, Fonstein M, D’Souza M et al and viral sequences. Nucleic Acids Res 44:
(1999) The use of gene clusters to infer func- D286–D293
tional coupling. Proc Natl Acad Sci U S A 18. Kaduk M, Sonnhammer E (2017) Improved
96:2896–2901 orthology inference with Hieranoid 2. Bioin-
10. Smith TF, Waterman MS (1981) Identifica- formatics 33:1154–1159
tion of common molecular subsequences. J 19. Ostlund G, Schmitt T, Forslund K et al
Mol Biol 147:195–197 (2010) InParanoid 7: new algorithms and
tools for eukaryotic orthology analysis.
Nucleic Acids Res 38:D196–D203
172 Adrian M. Altenhoff et al.
20. Sonnhammer ELL, Östlund G (2015) InPar- genomics for over 1800 genomes. Bioinfor-
anoid 8: orthology analysis between 273 pro- matics 28:715–716
teomes, mostly eukaryotic. Nucleic Acids Res 34. Fulton DL, Li YY, Laird MR et al (2006)
43:D234–D239 Improving the specificity of high-throughput
21. Altenhoff AM, Gil M, Gonnet GH et al ortholog prediction. BMC Bioinform 7:270
(2013) Inferring hierarchical orthologous 35. Koski LB, Golding GB (2001) The closest
groups from orthologous gene pairs. PLoS BLAST hit is often not the nearest neighbor.
One 8:e53786 J Mol Evol 52:540–542
22. Train C-M, Glover NM, Gonnet GH et al 36. Roth ACJ, Gonnet GH, Dessimoz C (2008)
(2017) Orthologous Matrix (OMA) algo- Algorithm of OMA for large-scale orthology
rithm 2.0: more robust to asymmetric evolu- inference. BMC Bioinform 9:518
tionary rates and more scalable hierarchical 37. Dessimoz C, Boeckmann B, Roth ACJ et al
orthologous group inference. Bioinformatics (2006) Detecting non-orthology in the
33:i75–i82 COGs database and other approaches group-
23. Dessimoz C, Cannarozzi G, Gil M et al ing orthologs using genome-specific best hits.
(2005) OMA, a comprehensive, automated Nucleic Acids Res 34:3309–3316
project for the identification of orthologs 38. Kristensen DM, Kannan L, Coleman MK et al
from complete genome data: introduction (2010) A low-polynomial algorithm for
and first achievements. In: Comparative geno- assembling clusters of orthologous groups
mics. Springer, Berlin, pp 61–72 from intergenomic symmetric best matches.
24. Altenhoff AM, Schneider A, Gonnet GH et al Bioinformatics 26:1481–1487
(2011) OMA 2011: orthology inference 39. Van Dongen SM (2001) Graph clustering by
among 1000 complete genomes. Nucleic flow simulation. PhD thesis, University of
Acids Res 39:D289–D294 Utrecht
25. Kriventseva EV, Rahman N, Espinosa O et al 40. Boeckmann B, Robinson-Rechavi M, Xenar-
(2008) OrthoDB: the hierarchical catalog of ios I et al (2011) Conceptual framework and
eukaryotic orthologs. Nucleic Acids Res 36: pilot study to benchmark phylogenomic data-
D271–D275 bases based on reference gene trees. Brief
26. Zdobnov EM, Tegenfeldt F, Kuznetsov D Bioinform 12:423–435
et al (2017) OrthoDB v9.1: cataloging evolu- 41. Jothi R, Zotenko E, Tasneem A et al (2006)
tionary and functional annotations for animal, COCO-CL: hierarchical clustering of homol-
fungal, plant, archaeal, bacterial and viral ogy relations based on evolutionary correla-
orthologs. Nucleic Acids Res 45:D744–D749 tions. Bioinformatics 22:779–788
27. Linard B, Thompson JD, Poch O et al (2011) 42. Nei M (1987) Molecular evolutionary genet-
OrthoInspector: comprehensive orthology ics. Columbia University Press, New York
analysis and visual exploration. BMC Bioin-
form 12:11 43. Goodman M, Czelusniak J, Moore GW et al
(1979) Fitting the gene lineage into its species
28. Linard B, Allot A, Schneider R et al (2015) lineage, a Parsimony strategy illustrated by
OrthoInspector 2.0: software and database cladograms constructed from globin
updates. Bioinformatics 31:447–448 sequences. Syst Zool 28:132–163
29. Li L, Stoeckert CJ Jr, Roos DS (2003) 44. Page RDM (1994) Maps between trees and
OrthoMCL: identification of ortholog groups cladistic analysis of historical associations
for eukaryotic genomes. Genome Res among genes, organisms, and areas. Syst Biol
13:2178–2189 43:58–77
30. Chen F, Mackey AJ, Stoeckert CJ Jr et al 45. Mirkin B, Muchnik I, Smith TF (1995) A
(2006) OrthoMCL-DB: querying a compre- biologically consistent model for comparing
hensive multi-species collection of ortholog molecular phylogenies. J Comput Biol
groups. Nucleic Acids Res 34:D363–D368 2:493–507
31. Wall DP, Fraser HB, Hirsh AE (2003) Detect- 46. Eulenstein O (1997) A linear time algorithm
ing putative orthologs. Bioinformatics for tree mapping. Arbeitspapiere der GMD
19:1710–1711 No. 1046, St
32. DeLuca TF, Wu I-H, Pu J et al (2006) 47. Zmasek CM, Eddy SR (2001) A simple algo-
Roundup: a multi-genome repository of rithm to infer gene duplication and speciation
orthologs and evolutionary distances. Bioin- events on a gene tree. Bioinformatics
formatics 22:2044–2046 17:821–828
33. DeLuca TF, Cui J, Jung J-Y et al (2012)
Roundup 2.0: enabling comparative
Inferring Orthology and Paralogy 173
48. Poptsova MS, Gogarten JP (2007) Branch- Proceedings of the fourth annual interna-
Clust: a phylogenetic algorithm for selecting tional conference on computational molecular
gene families. BMC Bioinform 8:120 biology. ACM, New York, NY, pp 138–146
49. Arvestad L, Berglund A-C, Lagergren J et al 63. Zmasek CM, Eddy SR (2002) RIO: analyzing
(2003) Bayesian gene/species tree reconcilia- proteomes by automated phylogenomics
tion and orthology analysis using MCMC. using resampled inference of orthologs.
Bioinformatics 19(Suppl 1):i7–i15 BMC Bioinform 3:14
50. Åkerborg Ö, Sennblad B, Arvestad L et al 64. Farris JS (1972) Estimating phylogenetic
(2009) Simultaneous Bayesian gene tree trees from distance matrices. Am Nat
reconstruction and reconciliation analysis. 106:645–668
Proc Natl Acad Sci U S A 106:5714–5719 65. Avise JC, Bowen BW, Lamb T et al (1992)
51. Ullah I, Sjöstrand J, Andersson P et al (2015) Mitochondrial DNA evolution at a turtle’s
Integrating sequence evolution into probabi- pace: evidence for low genetic variability and
listic orthology analysis. Syst Biol 64:969–982 reduced microevolutionary rate in the Testu-
52. Li H, Coghlan A, Ruan J et al (2006) Tree- dines. Mol Biol Evol 9:457–473
Fam: a curated database of phylogenetic trees 66. Ayala FJ (1999) Molecular clock mirages.
of animal gene families. Nucleic Acids Res 34: Bioessays 21:71–75
D572–D580 67. Tria FDK, Landan G, Dagan T (2017) Phylo-
53. Vilella AJ, Severin J, Ureta-Vidal A et al genetic rooting using minimal ancestor devia-
(2009) EnsemblCompara GeneTrees: com- tion. Nat Ecol Evol 1:193
plete, duplication-aware phylogenetic trees in 68. Huelsenbeck JP, Bollback JP, Levine AM
vertebrates. Genome Res 19:327–335 (2002) Inferring the root of a phylogenetic
54. Herrero J, Muffato M, Beal K et al (2016) tree. Syst Biol 51:32–43
Ensembl comparative genomics resources. 69. Tarrı́o R, Rodrı́guez-Trelles F, Ayala FJ
Database 2016:bav096 (2000) Tree rooting with outgroups when
55. Dufayard J-F, Duret L, Penel S et al (2005) they differ in their nucleotide composition
Tree pattern matching in phylogenetic trees: from the ingroup: the Drosophila saltans and
automatic search for orthologs or paralogs in Willistoni groups, a case study. Mol Phylo-
homologous gene sequence databases. Bioin- genet Evol 16:344–349
formatics 21:2596–2603 70. Graybeal A (1998) Is it better to add taxa or
56. Penel S, Arigon A-M, Dufayard J-F et al characters to a difficult phylogenetic problem?
(2009) Databases of homologous gene Syst Biol 47(1):9–17
families for comparative genomics. BMC 71. Rokas A, Williams BL, King N et al (2003)
Bioinform 10(Suppl 6):S3 Genome-scale approaches to resolving incon-
57. van der Heijden RTJM, Snel B, van Noort V gruence in molecular phylogenies. Nature
et al (2007) Orthology prediction at scalable 425:798–804
resolution by phylogenetic tree analysis. BMC 72. Yang Z, Goldman N, Friday A (1994) Com-
Bioinform 8:83 parison of models for nucleotide substitution
58. Storm CEV, Sonnhammer ELL (2002) Auto- used in maximum-likelihood phylogenetic
mated ortholog inference from phylogenetic estimation. Mol Biol Evol 11:316–324
trees and calculation of orthology reliability. 73. Anisimova M, Gascuel O (2006) Approxi-
Bioinformatics 18:92–99 mate likelihood-ratio test for branches: a fast,
59. Huerta-Cepas J, Dopazo H, Dopazo J et al accurate, and powerful alternative. Syst Biol
(2007) The human phylome. Genome Biol 8: 55:539–552
R109 74. Durand D, Halldórsson BV, Vernot B (2006)
60. Huerta-Cepas J, Capella-Gutiérrez S, Pryszcz A hybrid micro-macroevolutionary approach
LP et al (2014) PhylomeDB v4: zooming into to gene tree reconstruction. J Comput Biol
the plurality of evolutionary histories of a 13:320–335
genome. Nucleic Acids Res 42:D897–D902 75. Lynch M, Conery JS (2000) The evolutionary
61. Berglund-Sonnhammer A-C, Steffansson P, fate and consequences of duplicate genes. Sci-
Betts MJ et al (2006) Optimal gene trees ence 290:1151–1155
from sequences and species trees using a soft 76. Robinson-Rechavi M, Marchand O, Escriva
interpretation of parsimony. J Mol Evol H et al (2001) Euteleost fish genomes are
63:240–250 characterized by expansion of gene families.
62. Hallett MT, Lagergren J (2000) New algo- Genome Res 11:781–788
rithms for the duplication-loss model. In:
174 Adrian M. Altenhoff et al.
77. Kendall DG (1948) On the generalized 91. Steinegger M, Söding J (2017) MMseqs2
“birth-and-death” process. Ann Math Stat enables sensitive protein sequence searching
19:1–15 for the analysis of massive data sets. Nat Bio-
78. Doyon J-P, Hamel S, Chauve C (2012) An technol 35(11):1026–1028
efficient method for exploring the space of 92. Wittwer LD, Piližota I, Altenhoff AM et al
gene tree/species tree reconciliations in a (2014) Speeding up all-against-all protein
probabilistic framework. IEEE/ACM Trans comparisons while maintaining sensitivity by
Comput Biol Bioinform 9:26–39 considering subsequence-level homology.
79. Gabaldón T, Dessimoz C, Huxley-Jones J PeerJ 2:e607
et al (2009) Joining forces in the quest for 93. Huerta-Cepas J, Forslund K, Coelho LP et al
orthologs. Genome Biol 10:403 (2017) Fast genome-wide functional annota-
80. Contreras-Moreira B, Vinuesa P (2013) tion through orthology assignment by
GET_HOMOLOGUES, a versatile software eggNOG-Mapper. Mol Biol Evol
package for scalable and robust microbial pan- 34:2115–2122
genome analysis. Appl Environ Microbiol 94. Hulsen T, Huynen MA, de Vlieg J et al
79:7696–7701 (2006) Benchmarking ortholog identification
81. Salgado D, Gimenez G, Coulier F et al (2008) methods using functional genomics data.
COMPARE, a multi-organism system for Genome Biol 7:R31
cross-species data comparison and transfer of 95. Altenhoff AM, Studer RA, Robinson-Rechavi
information. Bioinformatics 24:447–449 M et al (2012) Resolving the ortholog con-
82. Eyre TA, Wright MW, Lush MJ et al (2007) jecture: orthologs tend to be weakly, but sig-
HCOP: a searchable database of human nificantly, more similar in function than
orthology predictions. Brief Bioinform 8:2–5 paralogs. PLoS Comput Biol 8:e1002514
83. Hu Y, Flockhart I, Vinayagam A et al (2011) 96. Altenhoff AM, Boeckmann B, Capella-
An integrative approach to ortholog predic- Gutierrez S et al (2016) Standardized bench-
tion for disease-focused and other functional marking in the quest for orthologs. Nat
studies. BMC Bioinform 12:357 Methods 13:425–430
84. Maher MC, Hernandez RD (2015) Rock, 97. Studer RA, Robinson-Rechavi M (2009)
paper, scissors: harnessing complementarity How confident can we be that orthologs are
in ortholog detection methods improves similar, but paralogs differ? Trends Genet
comparative genomic inference. G3 25:210–216
5:629–638 98. Altenhoff AM, Dessimoz C (2009) Phyloge-
85. Pereira C, Denise A, Lespinet O (2014) A netic and functional assessment of orthologs
meta-approach for improving the prediction inference projects and methods. PLoS Com-
and the functional annotation of ortholog put Biol 5:e1000262
groups. BMC Genomics 15(Suppl 6):S16 99. Trachana K, Larsson TA, Powell S et al (2011)
86. Pryszcz LP, Huerta-Cepas J, Gabaldón T Orthology prediction methods: a quality
(2011) MetaPhOrs: orthology and paralogy assessment using curated protein families.
predictions from multiple phylogenetic evi- BioEssays 33:769–780
dence using a consistency-based confidence 100. Chen F, Mackey AJ, Vermunt JK et al (2007)
score. Nucleic Acids Res 39:e32 Assessing performance of orthology detection
87. Sutphin GL, Mahoney JM, Sheppard K et al strategies applied to eukaryotic genomes.
(2016) WORMHOLE: novel least diverged PLoS One 2:e383
ortholog prediction through machine 101. Dalquen DA, Altenhoff AM, Gonnet GH et al
learning. PLoS Comput Biol 12:e1005182 (2013) The impact of gene duplication, inser-
88. Tabari E, Su Z (2017) PorthoMCL: parallel tion, deletion, lateral gene transfer and
orthology prediction using MCL for the sequencing error on orthology inference: a
realm of massive genome availability. Big simulation study. PLoS One 8:e56925
Data Anal 2:4 102. Thomas PD, Campbell MJ, Kejariwal A et al
89. Cosentino S, Iwasaki W (2018) SonicPara- (2003) PANTHER: a library of protein
noid: extremely fast, accurate, and easy families and subfamilies indexed by function.
orthology inference. Bioinformatics. https:// Genome Res 13:2129–2141
doi.org/10.1093/bioinformatics/bty631 103. Engelhardt BE, Jordan MI, Muratore KE et al
90. Buchfink B, Xie C, Huson DH (2015) Fast (2005) Protein molecular function prediction
and sensitive protein alignment using DIA- by Bayesian phylogenomics. PLoS Comput
MOND. Nat Methods 12:59–60 Biol 1:e45
Inferring Orthology and Paralogy 175
104. Cook SA (1971) The complexity of theorem- 108. Hallett M, Lagergren J, Tofigh A (2004)
proving procedures. In: Proceedings of the Simultaneous identification of duplications
third annual ACM symposium on theory of and lateral transfers. In: Proceedings of the
computing. ACM, New York, NY, pp eighth annual international conference on
151–158 Research in computational molecular biology.
105. Sharan R, Ideker T (2006) Modeling cellular ACM, New York, NY, pp 347–356
machinery through biological network com- 109. Forslund K, Pereira C, Capella-Gutierrez S
parison. Nat Biotechnol 24:427–433 et al (2017) Gearing up to handle the mosaic
106. Dewey CN, Pachter L (2006) Evolution at nature of life in the quest for orthologs. Bio-
the nucleotide level: the problem of multiple informatics. https://doi.org/10.1093/bioin
whole-genome alignment. Hum Mol Genet formatics/btx542
15 Spec No 1:R51–RR6 110. Guigó R, Muchnik I, Smith TF (1996)
107. Górecki P (2004) Reconciliation problems Reconstruction of ancient molecular phylog-
for duplication, loss and horizontal gene eny. Mol Phylogenet Evol 6:189–213
transfer. In: Proceedings of the eighth annual 111. Bansal MS and Eulenstein O (2008) The mul-
international conference on research in tiple gene duplication problem revisited. Bio-
computational molecular biology. ACM, informatics 24:i132–i13i138
New York, NY, pp 316–325
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons
licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s
Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder.
Chapter 6
Abstract
Most genomes are populated by hundreds of thousands of sequences originated from mobile elements. On
the one hand, these sequences present a real challenge in the process of genome analysis and annotation. On
the other hand, they are very interesting biological subjects involved in many cellular processes. Here we
present an overview of transposable elements biodiversity, and we discuss different approaches to transpo-
sable elements detection and analyses.
Key words Transposable elements, Transposons, Mobile elements, Repetitive elements, Genome
analysis, Genome evolution
1 Introduction
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Methods in Molecular Biology, vol. 1910,
https://doi.org/10.1007/978-1-4939-9074-0_6, © The Author(s) 2019
177
178 Wojciech Makałowski et al.
1
The historical background of the “junk DNA” term was recently discussed by Dan Graur in his excellent blog
http://judgestarling.tumblr.com/post/64504735261/the-origin-of-the-term-junk-dna-a-historical
Transposable Elements: Classification, Identification, and Their Use. . . 179
3 Transposons Classification
dr IR ORF(s) IR dr
3.2.1 Class I: Mobile As mentioned above, class I TEs transpose through an RNA inter-
Elements mediary. The RNA intermediate is transcribed from genomic DNA
and then reverse-transcribed into DNA by a TE-encoded reverse
transcriptase (RT), followed by reintegration into a genome. Each
replication cycle produces one new copy, and as a result, class I
elements are the major contributors to the repetitive fraction in
large genomes. Retrotransposons are divided into five orders: LTR
retrotransposons, DIRS-like elements, Penelope-like elements
(PLEs), LINEs (long interspersed elements), and SINEs (short
interspersed elements). This scheme is based on the mechanistic
features, organization, and reverse transcriptase phylogeny of these
retroelements. Accidentally, the retrotranscriptase coded by an
autonomous TE can reverse-transcribe another RNA present in
the cell, e.g., mRNA, and produce a retrocopy of it, which in
most cases results in a pseudogene.
The LTR retrotransposons are characterized by the presence of
long terminal repeats (LTRs) ranging from several hundred to
several thousand base pairs. Both exogenous retroviruses and
LTR retrotransposons contain a gag gene that encodes a viral
Transposable Elements: Classification, Identification, and Their Use. . . 181
Table 1
Prokaryotic transposable elements as presented in the IS Finder database [31]
Family Typical size range in bp Direct repeat size in bp IRsa Number of ORFs
IS1 740–4600 0–10 Y 1 or 2
IS110 1200–1550 0 Y 1
IS1182 1330–1950 0–60 Y 1
IS1380 1550–2000 4–5 Y 1
IS1595 700–7900 8 Y 1
IS1634 1500–2000 5–6 Y 1
IS200/IS605 600–2000 0 Y/N 1 or 2
IS21 1750–2600 4–8 Y 2
IS256 1200–1500 8–9 Y 1
IS3 1150–1750 5 Y 2
IS30 1000–1700 2–3 Y 1
IS4 1150–5400 8–13 Y 1 or more
IS481 950–1300 4–15 Y 1
IS5 800–1500 2–9 Y 1 or 2
IS6 700–900 8 Y 1
IS607 1700–2500 0 N 2
IS630 1000–1400 2 Y 1 or 2
IS66 1350–3000 8–9 Y 1 or more
IS701 1400–1550 4 Y 1
IS91 1500–2000 0 N 1
IS982 1000 3–9 Y 1
ISAs1 1200–1500 8–10 Y 1
ISAzo13 1250–2200 0–4 Y 1
ISH3 1225–1500 4–5 Y 1
ISH6 1450 8 Y ISL
ISKra4 1400–2900 0–9 Y 1 or more
ISL3 1300–2300 8 Y ISKra4
ISLre2 1500–2000 9 Y 1
Tn3 Over 3000 0 Y More than 1
ISNCY 1300–2400 0–12 Y/N 1 or 2
a
Presence (Y) or absence (N) of terminal inverted repeats
182 Wojciech Makałowski et al.
Table 2
Classification of eukaryotic transposable elements as proposed by Wicker et al. [33]
Class I
Retroviruses
TSD TSD
LTR gag pol env
envv LTR
LTR elements
TSD gag TSD
LTR pol LTR
DIRS
TSD gag
gag TSD
TIR YR TIR
LINEs
TSD ORF2 polyA TSD
ORF1
SINEs
TSD polyA TSD
A B
SVAs
TSD TSD
CGCTCTn Alu-like VNTR SSINE-R
INE-R polyA
Retrogenes
TSD TSD
ORF polyA
Class II - subclass 1
“classical” autonomous DNA transposons
TIR TIR
TTRANSPOSASE
RANSPOSA
ASEE
Class II - subclass 2
Helitrons
TC HELICASE,
H pal)
ELICASE, (rpal)
(rp CTRR
Mavericks
TDS TIR TIR TDS
iint
nt ORFs
ORFs polB
B
polB
Fig. 2 Structures of eukaryotic mobile elements. See text for detailed discussion
LTR retrotransposons can acquire the env gene and become infec-
tious entities [36]. Presently, most of the LTR sequences (85%) in
the human genome are found only as isolated LTRs, with the
internal sequence being lost most likely due to homologous
184 Wojciech Makałowski et al.
3.2.2 Class II: Mobile Class II elements move by a conservative cut-and-paste mechanism;
Elements the excision of the donor element is followed by its reinsertion
elsewhere in the genome. DNA transposons are abundant in bacte-
ria, where they are called insertion sequences (see Subheading 3.1),
but are present in all phyla. Wicker et al. distinguished two sub-
classes of DNA transposons based on the number of DNA strands
that are cut during transposition [33].
Classical “cut-and-paste” transposons belong to the subclass I,
and they are classified as the TIR order. They are characterized by
terminal inverted repeats (TIR) and encode a transposase that binds
near the inverted repeats and mediates mobility. This process is not
usually a replicative one, unless the gap caused by excision is
repaired using the sister chromatid. When inserted at a new loca-
tion, the transposon is flanked by small gaps, which, when filled by
host enzymes, cause duplication of the sequence at the target site.
The length of these TSDs is characteristic for particular transpo-
sons. Nine superfamilies belong to the TIR order, including Tc1-
Mariner, Merlin, Mutator, and PiggyBac. The second order Cryp-
ton consists of a single superfamily of the same name. Originally
thought to be limited to fungi [66], now it is clear that they have a
wide distribution, including animals and heterokonts [67]. A
Transposable Elements: Classification, Identification, and Their Use. . . 187
4.1 De Novo There are several steps involved in the de novo characterization of
Approaches to Finding transposons. First, we need to find all the repeats in a genome, then
Repetitive Elements build a consensus of each family of related sequences, and finally
classify detected sequences. For the first step, three groups of
algorithms exist: the k-mer approach, sequence self-comparison,
and periodicity analysis.
In the k-mer approach, sequences are scanned for overrepre-
sentation of strings of certain length. The idea is that repeats that
belong to the same family are compositionally similar and share
some oligomers. If the repeats occur many times in a genome, then
those oligomers should be overrepresented. However, since repeats
and transposons in particular are not perfect copies of a certain
sequence, some mismatches must be allowed when oligo frequen-
cies are calculated. The challenge is to determine optimal size of an
oligo (k-mer) and number of mismatches allowed. Most likely,
these parameters should be different for different types of transpo-
sons, i.e., low versus high copy number, old versus young transpo-
sons, and those from different classes and families. Several programs
have been developed based on the k-mer idea using a suffix tree data
structure including REPuter [77, 78], Vmatch (Kurtz, unpub-
lished; http://www.vmatch.de/), and Repeat-match
[79, 80]. Another approach is to use fixed length k-mers as seeds
and extend those seeds to define repeat’s family as it was imple-
mented in ReAS [81], RepeatScout [82], and Tallymer
[83]. Another interesting algorithm can be found in the FORRe-
peats software [84], which uses factor oracle data structure [85]. It
starts with detection of exact oligomers in the analyzed sequences,
followed by finding approximate repeats and their alignment.
The second group of programs developed for de novo detec-
tion of repeated sequences is using self-comparison approach.
Repeat Pattern Toolkit [86], RECON [87], PILER [88, 89], and
BLASTER [90] belong to this group. The idea is to use one of the
fast sequence similarity tools, e.g., BLAST [91], followed by clus-
tering search results. The programs differ in the search engine for
the initial step, though most are using some of the BLAST algo-
rithms, the clustering method, and heuristics of merging initial hits
into a prototype element. For instance, RECON [87], which was
developed for the repeat finding in unassembled sequence reads,
starts with an all-to-all comparison using WU-BLAST engine.
Then, single-linkage clustering is applied to alignment results that
is followed by construction of an undirected graph with overlap-
ping. The shortest sequence that contains connected images
Transposable Elements: Classification, Identification, and Their Use. . . 189
4.3 Population-Level Recent advances in sequencing technology and the sharp decrease
Analyses of in sequencing costs allow genomic studies at population level.
Transposable Although initially focused on human populations [101–103],
Elements recent population studies of other species have been initiated as
well [104, 105]. One of the common questions in such studies is
Transposable Elements: Classification, Identification, and Their Use. . . 191
a k
b l
c m
d n
e o
f p
g q
h r
i s
j t
a k
b l
c m
d n
e o
f p
g q
h r
i s
j t
k
l
m
n
o
p
q
r
s
t
Fig. 4 Detection of a TE insertion (polymorphic TE) from the NGS data. The upper panel shows real genomic
sequence with a TE, which is not present in the reference genome (lower panel). Hypothetical discordant pair-
reads (a, b, d, f, g, i, j, k, l, o, q, s, and t) have only one the pairs mapped to the reference genome, while the
other would map to a consensus sequence of a TE. The hypothetical split reads (c, e, h, m, p, and r) will have
part of the sequence mapped to the reference genome and the other to a TE consensus sequence
4.5 Classification of Once the consensus of a repetitive element has been constructed, it
Transposable can be subjected to further analyses. There are two major categories
Elements of programs dealing with the issue of TE classification: library or
similarity-based and signature-based. The latter approach is very
often used in specialized software, i.e., tailored for specific type of
TEs. However, some general tools also exist, e.g., TEclass [120].
The library approach is probably the most common approach
for TE classification. It is also very efficient and quite reliable as long
as good libraries of prototype sequences exist. In practice, it is the
recommended approach when we analyze sequences from well-
characterized genomes or from a genome relatively closely related
to a well-studied one. For instance, since the human genome is one
of the best studied, any primate sequences can be confidently
analyzed using the library approach. Most likely, the first software
using the similarity-based approach for repeat classification was
Censor developed by Jerzy Jurka in the early 1990s [121]. It uses
RepBase [122] as a reference collection and BLAST as a search
engine [91]. However, the most popular TE detection software is
RepeatMasker (RM) (http://www.repeatmasker.org). Interest-
ingly, RM is also using RepBase as a reference collection and
AB-BLAST, RM-BLAST, or cross-match as a search engine. In
both cases, original search hits are processed by a series of Perl
scripts to determine the structure of elements and classify them to
one of known TE families. Both Censor and RM also employ user-
provided libraries, including “third-party” lineage-specific libraries,
e.g., TREP [123]. Over the years, RepeatMasker has become a
standard tool for TE analyses, and often its output is used for
more biologically oriented studies (see below). The aforemen-
tioned programs have one important drawback: since they are
completely based on sequence similarity, they can detect only TEs
that had been previously described. Nevertheless, similarity
searches, like in many other bioinformatics tasks, should be the
first approach for the analysis of repetitive elements.
Signature-based programs are searching for certain features
that characterize specific TEs, for example, long terminal repeats
(LTRs), target site duplications (TSDs), or primer-binding sites
(PBSs). Since different types (families) of elements are structurally
different, they require specific rules for their detection. Hence,
many of the programs that use signature-based algorithms are
specific for certain type of transposons. There are a number of
programs specialized in detection of LTR transposons, which are
based on a similar methodology. They take into account several
structural features of LTR retroposons including size, distance
between paired LTRs and their similarity, the presence of TSDs,
and the presence of replication signals, i.e., the primer-binding site
and the polypurine tract (PPTs). Some of the programs check also
for ORFs coding for the gag, pol, and env proteins. LTR_STRUC
[124] was one of the first programs based on this principle. It uses
Transposable Elements: Classification, Identification, and Their Use. . . 195
4.6 Pipelines Recent years witnessed some attempt to create more complex,
global analyses systems. One such a system is REPCLASS
[134]. It consists of three classification modules: homology
(HOM), structure (STR), and target site duplication (TSD). Each
module can be run separately or in the pairwise manner, whereas
the final step of the analysis involves integration of the results
delivered by each module. There is one interesting novelty in the
STR module, namely, implementation of tRNAscan-SE [135] to
196 Wojciech Makałowski et al.
4.7 Meta-analyses Most of the software developed are focused on the TE discovery
and rarely offer more biological oriented analyses. Consequently,
researchers interested in TE biology or using TE insertions as tools
for another biological investigations need to utilize other resources.
One of them is TinT (transposition in transposition), tool that
applies maximum likelihood model of TE insertion probability to
Transposable Elements: Classification, Identification, and Their Use. . . 197
5 Concluding Remarks
Annoying junk for some, hidden treasure for others, TEs can hardly
be ignored [148]. With their diversity and high copy number in
most of the genomes, they are not the easiest biological entities to
analyze. Nevertheless, recent years witnessed increased interest in
TEs. On the one hand, we observe improvement in computational
tools specialized in TE analyses. Table 3 lists some of such tools and
Table 3
Selected resources for transposable elements discovery and analyses
Software Address
AB-BLAST http://www.advbiocomp.com/blast.html
ACLAME http://aclame.ulb.ac.be/
BLASTER suite http://urgi.versailles.inra.fr/index.php/urgi/Tools/BLASTER
Censor http://www.girinst.org/censor/download.php
DOTTER http://sonnhammer.sbc.su.se/Dotter.html
DROPOSON ftp://biom3.univ-lyon1.fr//pub/drosoposon/
find_ltr http://darwin.informatics.indiana.edu/cgi-bin/evolution/ltr.pl
FINDMITE http://jaketu.biochem.vt.edu/dl_software.htm
FORRepeats http://al.jalix.org/FORRepeats/
Gepard http://cube.univie.ac.at/gepard
HelitronFinder http://limei.montclair.edu/HT.html
HelSearch http://sourceforge.net/project/showfiles.php?group_id¼260708
HERVd http://herv.img.cas.cz/
IRF http://tandem.bu.edu/irf/irf.download.html
LTR_FINDER http://tlife.fudan.edu.cn/ltr_finder/
LTR_MINER http://genomebiology.com/2004/5/10/R79/suppl/s7
LTR_par http://www.eecs.wsu.edu/~ananth/software.htm
MGEScan-LTR http://darwin.informatics.indiana.edu/cgi-bin/evolution/daphnia_ltr.pl
MGEScan-nonLTR http://darwin.informatics.indiana.edu/cgi-bin/evolution/nonltr/nonltr.pl
microTranspoGene http://transpogene.tau.ac.il/microTranspoGene.html
MITE-Hunter http://target.iplantcollaborative.org/mite_hunter.html
PILER http://www.drive5.com/piler/
REannotate http://www.bioinformatics.org/reannotate/index.html
ReAS ftp://ftp.genomics.org.cn/pub/ReAS/software/
RECON http://eddylab.org/software/recon/
(continued)
Transposable Elements: Classification, Identification, and Their Use. . . 199
Table 3
(continued)
Software Address
RepSeek http://wwwabi.snv.jussieu.fr/public/RepSeek/
RepeatFinder http://cbcb.umd.edu/software/RepeatFinder/
RepeatMasker http://www.repeatmasker.org/
RepeatModeler http://www.repeatmasker.org/RepeatModeler/
RepeatRunner http://www.yandell-lab.org/software/repeatrunner.html
Repeat-match http://mummer.sourceforge.net/
REPET http://urgi.versailles.inra.fr/index.php/urgi/Tools/REPET
RepMiner http://repminer.sourceforge.net/index.htm
REPuter http://bibiserv.techfak.uni-bielefeld.de/reputer/
RetroMap http://www.burchsite.com/bioi/RetroMapHome.html
SMaRTFinder http://services.appliedgenomics.org/software/smartfinder/
SoyTEdb http://www.soytedb.org
Spectral Repeat Finder http://www.imtech.res.in/raghava/srf/
T-lex http://petrov.stanford.edu/cgi-bin/Tlex.html
Tallymer http://www.zbh.uni-hamburg.de/Tallymer/
TARGeT http://target.iplantcollaborative.org/
TEclass http://www.bioinformatics.uni-muenster.de/tools/teclass/
TE Displayer http://labs.csb.utoronto.ca/yang/TE_Displayer/
TE nest http://www.plantgdb.org/prj/TE_nest/TE_nest.html
TESD http://pbil.univ-lyon1.fr/software/TESD/
TinT http://www.bioinformatics.uni-muenster.de/tools/tint/
TIPseqHunter https://github.com/fenyolab/TIPseqHunter
TRANSPO http://alggen.lsi.upc.es/recerca/search/transpo/transpo.html
TranspoGene http://transpogene.tau.ac.il/
Transposon-PSI http://transposonpsi.sourceforge.net/
TRAP http://www.coccidia.icb.usp.br/trap/tutorials/
TRF http://tandem.bu.edu/trf/trf.html
TROLL http://finder.sourceforge.net/
TSDfinder http://www.ncbi.nlm.nih.gov/CBBresearch/Landsman/TSDfinder/
WikiPoson http://www.bioinformatics.org/wikiposon/doku.php
VariationHunter http://compbio.cs.sfu.ca/software-variation-hunter
Vmatch http://www.vmatch.de/
200 Wojciech Makałowski et al.
References
1. Waring M, Britten RJ (1966) Nucleotide 12. Aronson AI, Bolton ET, Britten RI, Cowie
sequence repetition - a rapidly reassociating DB, Duerksen JD, McCarthy BJ,
fraction of mouse DNA. Science 154 McQuillen K, Roberts RB (1960) Biophysics.
(3750):791–794 In: Yearbook 59, vol 59. Carnegie Institution
2. Britten RJ, Kohne DE (1968) Repeated of Washington, Washington, pp 229–279
sequences in DNA. hundreds of thousands 13. Ehret CF, De Haller G (1963) Origin, devel-
of copies of DNA sequences have been opment and maturation of organelles and
incorporated into the genomes of higher organelle systems of the cell surface in Para-
organisms. Science 161(841):529–540 mecium. J Ultrastruct Res 23(Suppl 6):1–42
3. Makalowski W (2001) The human genome 14. Brosius J (1991) Retroposons--seeds of evo-
structure and organization. Acta Biochim lution. Science 251(4995):753
Pol 48(3):587–598 15. Makalowski W, Mitchell GA, Labuda D
4. C._elegans_Sequencing_Consortium (1998) (1994) Alu sequences in the coding regions
Genome sequence of the nematode of mRNA: a source of protein variability.
C. elegans: a platform for investigating biol- Trends Genet 10(6):188–193
ogy. Science 282(5396):2012–2018 16. Jordan IK, Rogozin IB, Glazko GV, Koonin
5. SanMiguel P, Tikhonov A, Jin YK, EV (2003) Origin of a substantial fraction of
Motchoulskaia N, Zakharov D, Melake- human regulatory sequences from transposa-
Berhan A, Springer PS, Edwards KJ, Lee M, ble elements. Trends Genet 19(2):68–72. Pii
Avramova Z, Bennetzen JL (1996) Nested S0168-9525(02)00006-9
retrotransposons in the intergenic regions of 17. Thornburg BG, Gotea V, Makalowski W
the maize genome. Science 274 (2006) Transposable elements as a significant
(5288):765–768 source of transcription regulating signals.
6. Keller EF (1983) A feeling for the organism: Gene 365:104–110. https://doi.org/10.
the life and work of Barbara McClintock. 1016/j.gene.2005.09.036. S0378-1119(05)
W.H. Freeman, San Francisco 00653-0 [pii]
7. McClintock B (1950) The origin and behav- 18. Feschotte C (2008) Transposable elements
ior of mutable loci in maize. Proc Natl Acad and the evolution of regulatory networks.
Sci U S A 36(6):344–355 Nat Rev Genet 9(5):397–405. https://doi.
8. McClintock B (1951) Chromosome organi- org/10.1038/nrg2337
zation and genic expression. Cold Spring 19. Mita P, Boeke JD (2016) How retrotranspo-
Harb Symp Quant Biol 16:13–47 sons shape genome regulation. Curr Opin
9. McClintock B (1956) Controlling elements Genet Dev 37:90–100. https://doi.org/10.
and the gene. Cold Spring Harb Symp 1016/j.gde.2016.01.001
Quant Biol 21:197–216 20. Chuong EB, Elde NC, Feschotte C (2017)
10. Malamy MH, Fiandt M, Szybalski W (1972) Regulatory activities of transposable elements:
Electron microscopy of polar insertions in the from conflicts to benefits. Nat Rev Genet 18
lac operon of Escherichia coli. Mol Gen Genet (2):71–86. https://doi.org/10.1038/nrg.
119(3):207–222 2016.139
11. Ohno S (1972) So much “junk” DNA in our 21. Franke V, Ganesh S, Karlic R, Malik R,
genome. In: Smith HH (ed) Brookhaven Pasulka J, Horvat F, Kuzman M, Fulka H,
symposia in biology, vol 23. Gordon & Cernohorska M, Urbanova J, Svobodova E,
Breach, New York, pp 366–370 Ma J, Suzuki Y, Aoki F, Schultz RM et al
Transposable Elements: Classification, Identification, and Their Use. . . 201
(2017) Long terminal repeats power evolu- 33. Wicker T, Sabot F, Hua-Van A, Bennetzen JL,
tion of genes and gene expression programs Capy P, Chalhoub B, Flavell A, Leroy P,
in mammalian oocytes and zygotes. Genome Morgante M, Panaud O, Paux E,
Res 27(8):1384–1394. https://doi.org/10. SanMiguel P, Schulman AH (2007) A unified
1101/gr.216150.116 classification system for eukaryotic transposa-
22. Wang L, Rishishwar L, Marino-Ramirez L, ble elements. Nat Rev Genet 8(12):973–982.
Jordan IK (2017) Human population-specific https://doi.org/10.1038/nrg2165.
gene expression and transcriptional network nrg2165 [pii]
modification with polymorphic transposable 34. Hughes SH (2015) Reverse transcription of
elements. Nucleic Acids Res 45 retroviruses and LTR retrotransposons.
(5):2318–2328. https://doi.org/10.1093/ Microbiol Spectr 3(2):MDNA3-0027-2014.
nar/gkw1286 https://doi.org/10.1128/microbiolspec.
23. Venuto D, Bourque G (2018) Identifying MDNA3-0027-2014
co-opted transposable elements using com- 35. Kazazian HH Jr (2004) Mobile elements: dri-
parative epigenomics. Develop Growth Differ vers of genome evolution. Science 303
60(1):53–62. https://doi.org/10.1111/ (5664):1626–1632. https://doi.org/10.
dgd.12423 1126/science.1089670. 303/5664/1626
24. Mahillon J, Chandler M (1998) Insertion [pii]
sequences. Microbiol Mol Biol Rev 62 36. Malik HS, Henikoff S, Eickbush TH (2000)
(3):725–774 Poised for contagion: evolutionary origins of
25. Wilde C, Escartin F, Kokeguchi S, Latour- the infectious abilities of invertebrate retro-
Lambert P, Lectard A, Clement JM (2003) viruses. Genome Res 10(9):1307–1318
Transposases are responsible for the target 37. Leib-Mosch C, Haltmeier M, Werner T, Geigl
specificity of IS1397 and ISKpn1 for two EM, Brack-Werner R, Francke U, Erfle V,
different types of palindromic units (PUs), Hehlmann R (1993) Genomic distribution
Nucleic Acid Res 31(15):4345–4353 and transcription of solitary HERV-K LTRs.
26. Derbyshire KM, Grindley NDF (1996) Cis Genomics 18(2):261–269. https://doi.org/
preference of the IS903 transposase is 10.1006/geno.1993.1464
mediated by a combination of transposase 38. Lander ES, Linton LM, Birren B,
instability and inefficient translation. Mol Nusbaum C, Zody MC, Baldwin J,
Microbiol 21(6):1261–1272. https://doi. Devon K, Dewar K, Doyle M, FitzHugh W,
org/10.1111/j.1365-2958.1996.tb02587.x Funke R, Gage D, Harris K, Heaford A, How-
27. Ichikawa H, Ikeda K, Amemura J, Ohtsubo E land J et al (2001) Initial sequencing and
(1990) Two domains in the terminal analysis of the human genome. Nature 409
inverted-repeat sequence of transposon Tn3. (6822):860–921. https://doi.org/10.1038/
Gene 86(1):11–17 35057062
28. Maekawa T, Amemura-Maekawa J, Ohtsubo 39. Wicker T, Stein N, Albar L, Feuillet C,
E (1993) DNA binding domains in Tn3 Schlagenhauf E, Keller B (2001) Analysis of
transposase. Mol Gen Genet 236 a contiguous 211 kb sequence in diploid
(2–3):267–274 wheat (Triticum monococcum L.) reveals
29. Weinert TA, Schaus NA, Grindley ND (1983) multiple mechanisms of genome evolution.
Insertion sequence duplication in transposi- Plant J 26(3):307–316. tpj1028 [pii]
tional recombination. Science 222 40. Vicient CM, Kalendar R, Anamthawat-
(4625):755–765 Jonsson K, Schulman AH (1999) Structure,
30. Turlan C, Chandler M (1995) IS1-mediated functionality, and evolution of the BARE-1
intramolecular rearrangements: formation of retrotransposon of barley. Genetica 107
excised transposon circles and replicative dele- (1–3):53–63
tions. EMBO J 14(21):5410–5421 41. SanMiguel P, Gaut BS, Tikhonov A,
31. Siguier P, Gourbeyre E, Varani A, Nakajima Y, Bennetzen JL (1998) The pale-
Ton-Hoang B, Chandler M (2015) Every- ontology of intergene retrotransposons of
man’s guide to bacterial insertion sequences. maize. Nat Genet 20(1):43–45. https://doi.
Microbiol Spectr 3(2):MDNA3-0030-2014. org/10.1038/1695
https://doi.org/10.1128/microbiolspec. 42. Peterson DG, Schulze SR, Sciara EB, Lee SA,
MDNA3-0030-2014 Bowers JE, Nagel A, Jiang N, Tibbitts DC,
32. Finnegan DJ (1989) Eukaryotic transposable Wessler SR, Paterson AH (2002) Integration
elements and genome evolution. Trends of Cot analysis, DNA cloning, and high-
Genet 5(4):103–107 throughput sequencing facilitates genome
characterization and gene discovery. Genome
202 Wojciech Makałowski et al.
66. Goodwin TJ, Butler MI, Poulter RT (2003) complete genomes. Bioinformatics 15
Cryptons: a group of tyrosine-recombinase- (5):426–427
encoding DNA transposons from pathogenic 78. Kurtz S, Choudhuri JV, Ohlebusch E,
fungi. Microbiology 149. (Pt 11:3099–3109 Schleiermacher C, Stoye J, Giegerich R
67. Kojima KK, Jurka J (2011) Crypton transpo- (2001) REPuter: the manifold applications
sons: identification of new diverse families and of repeat analysis on a genomic scale. Nucleic
ancient domestication events. Mob DNA 2 Acids Res 29(22):4633–4642
(1):12. https://doi.org/10.1186/1759- 79. Delcher AL, Kasif S, Fleischmann RD,
8753-2-12 Peterson J, White O, Salzberg SL (1999)
68. Bureau TE, Wessler SR (1994) Stowaway: a Alignment of whole genomes. Nucleic Acids
new family of inverted repeat elements asso- Res 27(11):2369–2376
ciated with the genes of both monocotyle- 80. Delcher AL, Phillippy A, Carlton J, Salzberg
donous and dicotyledonous plants. Plant SL (2002) Fast algorithms for large-scale
Cell 6(6):907–916. https://doi.org/10. genome alignment and comparison. Nucleic
1105/tpc.6.6.907. 6/6/907 [pii] Acids Res 30(11):2478–2483
69. Feschotte C, Swamy L, Wessler SR (2003) 81. Li RQ, Ye J, Li SG, Wang J, Han YJ, Ye C,
Genome-wide analysis of mariner-like trans- Wang J, Yang HM, Yu J, Wong GKS, Wang J
posable elements in rice reveals complex rela- (2005) ReAS: recovery of ancestral sequences
tionships with stowaway miniature inverted for transposable elements from the unassem-
repeat transposable elements (MITEs). bled reads of a whole genome shotgun. PLoS
Genetics 163(2):747–758 Comput Biol 1(4):313–321. https://doi.
70. Zhou MB, Tao GY, Pi PY, Zhu YH, Bai YH, org/10.1371/Journal.Pcbi.0010043. Artn
Meng XW (2016) Genome-wide characteri- E43 [pii]
zation and evolution analysis of miniature 82. Price AL, Jones NC, Pevzner PA (2005) De
inverted-repeat transposable elements novo identification of repeat families in large
(MITEs) in moso bamboo (Phyllostachys het- genomes. Bioinformatics 21:I351–I358.
erocycla). Planta 244(4):775–787. https:// https://doi.org/10.1093/Bioinformatics/
doi.org/10.1007/s00425-016-2544-0 Bti1018
71. Wicker T, Robertson JS, Schulze SR, Feltus 83. Kurtz S, Narechania A, Stein JC, Ware D
FA, Magrini V, Morrison JA, Mardis ER, Wil- (2008) A new method to compute K-mer
son RK, Peterson DG, Paterson AH, Ivarie R frequencies and its application to annotate
(2005) The repetitive landscape of the large repetitive plant genomes. BMC Geno-
chicken genome. Genome Res 15 mics 9:517. https://doi.org/10.1186/1471-
(1):126–136. https://doi.org/10.1101/gr. 2164-9-517
2438004 84. Lefebvre A, Lecroq T, Dauchel H, Alexandre
72. Kapitonov VV, Jurka J (2001) Rolling-circle J (2003) FORRepeats: detects repeats on
transposons in eukaryotes. Proc Natl Acad Sci entire chromosomes and between genomes.
U S A 98:8714–8719 Bioinformatics 19(3):319–326. https://doi.
73. Hood ME (2005) Repetitive DNA in the org/10.1093/Bioinformatics/Btf843
automictic fungus Microbotryum violaceum. 85. Crochemore M, Ilie L, Seid-Hilmi E (2006)
Genetica 124(1):1–10 Factor oracles. In: Ibarra OH, Yen H-C (eds)
74. Pritham EJ, Feschotte C (2007) Massive Implementation and application of automata.
amplification of rolling-circle transposons in Springer, Berlin, pp 78–89
the lineage of the bat Myotis lucifugus. Proc 86. Agrawal P, States D (1994) The Repeat Pat-
Natl Acad Sci U S A 104:1895–1900 tern Toolkit (RPT): analyzing the structure
75. Pritham EJ, Putliwala T, Feschotte C (2007) and evolution of the C. elegans genome.
Mavericks, a novel class of giant transposable Proc Int Conf Intell Syst Mol Biol 2:9
elements widespread in eukaryotes and related 87. Bao ZR, Eddy SR (2002) Automated de novo
to DNA viruses. Gene 390(1–2):3–17. identification of repeat sequence families in
https://doi.org/10.1016/j.gene.2006.08. sequenced genomes. Genome Res 12
008. S0378-1119(06)00537-3 [pii] (8):1269–1276. https://doi.org/10.1101/
76. Kapitonov VV, Jurka J (2006) Self- Gr.88502
synthesizing DNA transposons in eukaryotes. 88. Edgar RC, Myers EW (2005) PILER: identi-
Proc Natl Acad Sci U S A 103:4540–4545 fication and classification of genomic repeats.
77. Kurtz S, Schleiermacher C (1999) REPuter: Bioinformatics 21:I152–I158. https://doi.
fast computation of maximal repeats in org/10.1093/Bioinformatics/Bti1003
204 Wojciech Makałowski et al.
89. Edgar RC (2007) PILER-CR: fast and accu- repeatome with dnaPipeTE from raw geno-
rate identification of CRISPR repeats. BMC mic reads and comparative analysis with the
Bioinform 8:18. https://doi.org/10.1186/ yellow fever mosquito (Aedes aegypti).
1471-2105-8-18 Genome Biol Evol 7(4):1192–1205.
90. Quesneville H, Bergman CM, Andrieu O, https://doi.org/10.1093/gbe/evv050
Autard D, Nouaud D, Ashburner M, Anxola- 101. Genome of the Netherlands Consortium
behere D (2005) Combined evidence annota- (2014) Whole-genome sequence variation,
tion of transposable elements in genome population structure and demographic his-
sequences. PLoS Comput Biol 1 tory of the Dutch population. Nat Genet 46
(2):166–175. Artn E22. https://doi.org/10. (8):818–825. https://doi.org/10.1038/ng.
1371/Journal.Pcbi.0010022 3021
91. Altschul SF, Gish W, Miller W, Myers EW, 102. 1000 Genomes Project Consortium,
Lipman DJ (1990) Basic local alignment Auton A, Brooks LD, Durbin RM, Garrison
search tool. J Mol Biol 215(3):403–410 EP, Kang HM, Korbel JO, Marchini JL,
92. Fitch WM (1969) Locating gaps in amino McCarthy S, McVean GA, Abecasis GR
acid sequences to optimize the homology (2015) A global reference for human genetic
between two proteins. Biochem Genet 3 variation. Nature 526(7571):68–74. https://
(2):99–108 doi.org/10.1038/nature15393
93. Gibbs AJ, McIntyre GA (1970) The diagram, 103. UK10K Consortium, Walter K, Min JL,
a method for comparing sequences. Its use Huang J, Crooks L, Memari Y, McCarthy S,
with amino acid and nucleotide sequences. Perry JR, Xu C, Futema M, Lawson D,
Eur J Biochem 16(1):1–11 Iotchkova V, Schiffels S, Hendricks AE,
94. Sonnhammer EL, Durbin R (1995) A Danecek P et al (2015) The UK10K project
dot-matrix program with dynamic threshold identifies rare variants in health and disease.
control suited for genomic DNA and protein Nature 526(7571):82–90. https://doi.org/
sequence analysis. Gene 167(1–2):GC1–G10 10.1038/nature14962
95. Krumsiek J, Arnold R, Rattei T (2007) 104. Lack JB, Lange JD, Tang AD, Corbett-Detig
Gepard: a rapid and sensitive tool for creating RB, Pool JE (2016) A thousand fly genomes:
dotplots on genome scale. Bioinformatics 23 an expanded Drosophila genome nexus. Mol
(8):1026–1028. https://doi.org/10.1093/ Biol Evol 33(12):3308–3313. https://doi.
bioinformatics/btm039 org/10.1093/molbev/msw195
96. Staton SE, Burke JM (2015) Transposome: a 105. Lynch M, Gutenkunst R, Ackerman M,
toolkit for annotation of transposable element Spitze K, Ye Z, Maruki T, Jia Z (2017) Popu-
families from unassembled sequence reads. lation genomics of Daphnia pulex. Genetics
Bioinformatics 31(11):1827–1829. https:// 206(1):315–332. https://doi.org/10.1534/
doi.org/10.1093/bioinformatics/btv059 genetics.116.190611
97. Novak P, Neumann P, Pech J, Steinhaisl J, 106. Gardner EJ, Lam VK, Harris DN, Chuang
Macas J (2013) RepeatExplorer: a Galaxy- NT, Scott EC, Pittard WS, Mills RE, Gen-
based web server for genome-wide characteri- omes Project Consortium, Devine SE
zation of eukaryotic repetitive elements from (2017) The Mobile Element Locator Tool
next-generation sequence reads. Bioinformat- (MELT): population-scale mobile element
ics 29(6):792–793. https://doi.org/10. discovery and biology. Genome Res 27
1093/bioinformatics/btt054 (11):1916–1929. https://doi.org/10.1101/
gr.218032.116
98. Koch P, Platzer M, Downie BR (2014)
RepARK--de novo creation of repeat libraries 107. Ewing AD (2015) Transposable element
from whole-genome NGS reads. Nucleic detection from whole genome sequence
Acids Res 42(9):e80. https://doi.org/10. data. Mob DNA 6:24. https://doi.org/10.
1093/nar/gku210 1186/s13100-015-0055-3
99. Chu C, Nielsen R, Wu Y (2016) REPdenovo: 108. Rishishwar L, Marino-Ramirez L, Jordan IK
inferring de novo repeat motifs from short (2016) Benchmarking computational tools
sequence reads. PLoS One 11(3):e0150719. for polymorphic transposable element detec-
https://doi.org/10.1371/journal.pone. tion. Brief Bioinform. https://doi.org/10.
0150719 1093/bib/bbw072
100. Goubert C, Modolo L, Vieira C, 109. Helman E, Lawrence MS, Stewart C,
ValienteMoro C, Mavingui P, Boulesteix M Sougnez C, Getz G, Meyerson M (2014)
(2015) De novo assembly and annotation of Somatic retrotransposition in human cancer
the Asian tiger mosquito (Aedes albopictus) revealed by whole-genome and exome
Transposable Elements: Classification, Identification, and Their Use. . . 205
sequencing. Genome Res 24(7):1053–1063. CM, Lee BT, Karolchik D, Hinrichs AS,
https://doi.org/10.1101/gr.163659.113 Haeussler M, Guruvadoo L, Navarro
110. Lee E, Iskow R, Yang L, Gokcumen O, Gonzalez J, Gibson D et al (2018) The
Haseley P, Luquette LJ III, Lohr JG, Harris UCSC Genome Browser database: 2018
CC, Ding L, Wilson RK, Wheeler DA, Gibbs update. Nucleic Acids Res 46(D1):
RA, Kucherlapati R, Lee C, Kharchenko PV D762–D769. https://doi.org/10.1093/
et al (2012) Landscape of somatic retrotran- nar/gkx1020
sposition in human cancers. Science 337 119. Noll A, Grundmann N, Churakov G,
(6097):967–971. https://doi.org/10.1126/ Brosius J, Makalowski W, Schmitz J (2015)
science.1222077 GPAC-genome presence/absence compiler: a
111. Tubio JMC, Li Y, Ju YS, Martincorena I, web application to comparatively visualize
Cooke SL, Tojo M, Gundem G, Pipinikas multiple genome-level changes. Mol Biol
CP, Zamora J, Raine K, Menzies A, Roman- Evol 32(1):275–286. https://doi.org/10.
Garcia P, Fullam A, Gerstung M, Shlien A et al 1093/molbev/msu276
(2014) Mobile DNA in cancer. Extensive 120. Abrusan G, Grundmann N, DeMester L,
transduction of nonrepetitive DNA mediated Makalowski W (2009) TEclass-a tool for
by L1 retrotransposition in cancer genomes. automated classification of unknown eukary-
Science 345(6196):1251343. https://doi. otic transposable elements. Bioinformatics 25
org/10.1126/science.1251343 (10):1329–1330. https://doi.org/10.1093/
112. Thung DT, de Ligt J, Vissers LE, bioinformatics/btp084
Steehouwer M, Kroon M, de Vries P, Slag- 121. Jurka J, Klonowski P, Dagman V, Pelton P
boom EP, Ye K, Veltman JA, Hehir-Kwa JY (1996) Censor - a program for identification
(2014) Mobster: accurate detection of mobile and elimination of repetitive elements from
element insertions in next generation DNA sequences. Comput Chem 20
sequencing data. Genome Biol 15(10):488. (1):119–121
https://doi.org/10.1186/s13059-014- 122. Bao W, Kojima KK, Kohany O (2015)
0488-x Repbase Update, a database of repetitive ele-
113. Keane TM, Wong K, Adams DJ (2013) Ret- ments in eukaryotic genomes. Mob DNA
roSeq: transposable element discovery from 6:11. https://doi.org/10.1186/s13100-
next-generation sequencing data. Bioinfor- 015-0041-9
matics 29(3):389–390. https://doi.org/10. 123. Wicker T, Matthews DE, Keller B (2002)
1093/bioinformatics/bts697 TREP: a database for Triticeae repetitive ele-
114. Tang Z, Steranka JP, Ma S, Grivainis M, ments. Trends Plant Sci 7(12):561–562. [pii]
Rodic N, Huang CR, Shih IM, Wang TL, S1360-1385(02)02372-5
Boeke JD, Fenyo D, Burns KH (2017) 124. McCarthy EM, McDonald JF (2003)
Human transposon insertion profiling: analy- LTR_STRUC: a novel search and identifica-
sis, visualization and identification of somatic tion program for LTR retrotransposons. Bio-
LINE-1 insertions in ovarian cancer. Proc informatics 19(3):362–367. https://doi.org/
Natl Acad Sci U S A 114(5):E733–E740. 10.1093/Bioinformatics/Btf878
https://doi.org/10.1073/pnas. 125. Kalyanaraman A, Aluru S (2006) Efficient
1619797114 algorithms and software for detection of full-
115. Chen Y, Ye W, Zhang Y, Xu Y (2015) High length LTR retrotransposons. J Bioinforma
speed BLASTN: an accelerated MegaBLAST Comput Biol 4(2):197–216.
search tool. Nucleic Acids Res 43 S021972000600203X [pii]
(16):7762–7768. https://doi.org/10.1093/ 126. Rho M, Choi JH, Kim S, Lynch M, Tang H
nar/gkv784 (2007) De novo identification of LTR retro-
116. Kent WJ (2002) BLAT--the BLAST-like transposons in eukaryotic genomes. BMC
alignment tool. Genome Res 12 Genomics 8:90. https://doi.org/10.1186/
(4):656–664. https://doi.org/10.1101/gr. 1471-2164-8-90. 1471-2164-8-90 [pii]
229202. Article published online before 127. Xu Z, Wang H (2007) LTR_FINDER: an
March 2002 efficient tool for the prediction of full-length
117. Kielbasa SM, Wan R, Sato K, Horton P, Frith LTR retrotransposons. Nucleic Acids Res 35
MC (2011) Adaptive seeds tame genomic (Web Server issue):W265–W268. https://
sequence comparison. Genome Res 21 doi.org/10.1093/nar/gkm286. gkm286
(3):487–493. https://doi.org/10.1101/gr. [pii]
113985.110 128. Ellinghaus D, Kurtz S, Willhoeft U (2008)
118. Casper J, Zweig AS, Villarreal C, Tyner C, LTRharvest, an efficient and flexible software
Speir ML, Rosenbloom KR, Raney BJ, Lee
206 Wojciech Makałowski et al.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons
licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s
Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder.
Part III
Abstract
The multispecies coalescent (MSC) model provides a compelling framework for building phylogenetic trees
from multilocus DNA sequence data. The pure MSC is best thought of as a special case of so-called
“multispecies network coalescent” models, in which gene flow is allowed among branches of the tree,
whereas MSC methods assume there is no gene flow between diverging species. Early implementations of
the MSC, such as “parsimony” or “democratic vote” approaches to combining information from multiple
gene trees, as well as concatenation, in which DNA sequences from multiple gene trees are combined into a
single “supergene,” were quickly shown to be inconsistent in some regions of tree space, in so far as they
converged on the incorrect species tree as more gene trees and sequence data were accumulated. The
anomaly zone, a region of tree space in which the most frequent gene tree is different from the species tree,
is one such region where many so-called “coalescent” methods are inconsistent. Second-generation
implementations of the MSC employed Bayesian or likelihood models; these are consistent in all regions
of gene tree space, but Bayesian methods in particular are incapable of handling the large phylogenomic
data sets currently available. Two-step methods, such as MP-EST and ASTRAL, in which gene trees are first
estimated and then combined to estimate an overarching species tree, are currently popular in part because
they can handle large phylogenomic data sets. These methods are consistent in the anomaly zone but can
sometimes provide inappropriate measures of tree support or apportion error and signal in the data
inappropriately. MP-EST in particular employs a likelihood model which can be conveniently manipulated
to perform statistical tests of competing species trees, incorporating the likelihood of the collected gene
trees on each species tree in a likelihood ratio test. Such tests provide a useful alternative to the multilocus
bootstrap, which only indirectly tests the appropriateness of competing species trees. We illustrate these
tests and implementations of the MSC with examples and suggest that MSC methods are a useful class of
models effectively using information from multiple loci to build phylogenetic trees.
1 Introduction
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Methods in Molecular Biology, vol. 1910,
https://doi.org/10.1007/978-1-4939-9074-0_7, © The Author(s) 2019
211
212 Liang Liu et al.
1.1 Stopgap By and large, the ensuing decades of molecular phylogenetics has
Approaches to Gene fulfilled much of its potential, revolutionizing taxonomies and
Tree Heterogeneity resolving conundrums previously considered intractable. However,
as the amount of genetic data per species becomes ever-more
voluminous, it has become clear that the conflicts between individ-
ual genes with each other and with the overarching species tree,
Building Phylogenetic Trees Using the Multispecies Coalesent Model 213
5k gen
Fig. 1 An example showing the utility of multiple gene trees in producing species tree topologies. (a) Nine
unlinked loci are simulated (or inferred without error) from a species group with substantial amounts of
incomplete lineage sorting. Note that no single gene recovers the correct relationship between clades.
Furthermore, despite identical conditions for all nine simulations, no two genes agree on the correct topology,
let alone the correct divergence times. (b) Superimposing the nine gene trees on top of each other clarifies the
relationships. It can be (correctly) inferred that the true tree is perfectly ordered, with (ABC) diverging from D
about 1500 generations ago, the (AB)-C split occurring at 800, and A diverging from B about 600 generations
ago. Also, the amount of crossbreeding within the recently diverged taxa implies (correctly) that C has the
effective smallest population size
A B C D A B C D
species tree
gene trees
A B C DA B C D A B C D A BC D A B C D A B C D
Fig. 2 The relationship between gene trees and species trees. Lines within the
species trees indicate gene lineages. Simplified gene trees are shown below
each species tree. Whereas gene trees on the left vary due to deep coalescence,
gene trees on the right are topologically concordant but vary slightly in branch
lengths due to the coalescent. Modified with permission from [19]
input data being DNA sequences: BEST [58, 68, 69], *BEAST2
[59], and a new model (A00) in the Bayesian Phylogenetics and
Phylogeography (bpp) package [70–72]. An additional “one-step”
method, SVD Quartets [73], derives species trees directly from
aligned, unlinked single-nucleotide polymorphisms using the
method of invariants in a coalescent framework. Species tree meth-
ods exhibit a number of attractive advantages over concatenation
methods in terms of performance. These advantages are not
restricted to the anomaly zone, occur across broad regions of tree
space, and include less susceptibility to long-branch attraction [74]
and missing data [75]. Another attractive aspect of species tree
methods and multispecies coalescent models is that they deliver
more appropriate estimated levels of confidence that are more
evenly spread across genes and appear to be less susceptible to the
inflation of posterior probabilities that was early on attributed to
Bayesian analyses (e.g., [76, 77]) but may also be due to model
misspecification due to concatenation [53]. Bayesian methods are
generally agreed to be the most efficient and accurate, capturing all
details of the MSC model seamlessly [52]. However, one drawback
is that the estimation of larger numbers of parameters (population
sizes and divergence times in addition to topologies) can slow
computation, may not be relevant in some situations [78], and is
generally not possible with the large data sets that are routinely seen
today in phylogenomics [59]. Thus far, two-step methods such as
ASTRAL, STAR, NJst, and MP-EST have proven the most widely
used for large-scale phylogenomic studies, such as the Avian Phy-
logenomics Project [79] and large-scale phylogenomics of fish [80]
and plants [81].
2.1 Sources of Gene The “standard” and most common reason why gene trees are not
Tree/Species Tree speciodendritic is incomplete lineage sorting, i.e., lineages have not
Discordance and yet been reproductively isolated for long enough for drift to cause
Violations of the complete genetic divergence in the form of reciprocal monophyly
Multispecies of gene trees ([82]; Figs. 1 and 2). This source of gene tree
Coalescent Model heterogeneity is guaranteed to be ubiquitous, if only because it
arises from the finite size populations of all species that have ever
2.1.1 Population come into existence. Almost all the techniques and software
Processes packages discussed above are designed to approximate uncertainties
in species tree topology arising from this phenomenon.
Delimitation of Species and For recent divergences, the definition of “species” can become
Diverging Lineages problematic for species tree methods [63], and the challenge of
delimiting species has, if anything, increased now that the overly
conservative strictures of gene tree monophyly as a delimiter of
species have been mostly abandoned [82]. This fundamental issue
in a phylogenetic study—whether the extent of divergence among
lineages warrants species status—has not gone away in the genomic
era. However, traditional species tree methods using the MSC need
Building Phylogenetic Trees Using the Multispecies Coalesent Model 219
not use “good” species as OTUs; they will work perfectly well on
lineages that have recently diverged, so long as they have ceased
exchanging genes. The key issue is not whether the OTUs in
species tree analyses are in fact species but rather whether they
have ceased exchanging genes, which has been shown to compro-
mise traditional MSC methods [83, 84] (see below).
The problem of species delimitation may ultimately be solved
by data other than genetics, and today few species concepts use
strictly genetic criteria [85]. Some have suggested that the line
between a population-level difference and a species-level difference
can be drawn empirically and with consistency in well-studied taxa
such as birds, using morphological, environmental, and behavioral
data simultaneously [86]. Thus, there is some hope that species
delimitation can be performed rigorously a priori in many cases.
Researchers who opt for delimiting species primarily with molecu-
lar data have a wide array of techniques and prior examples available
to them, although not all without controversy [71, 87–93]. Recent
progress in species delimitation is motivated by the conceptual
transition from “biological/reproductive isolation species” to the
“lineage species concept,” which defines species not in terms of
monophyly of gene lineages but as population lineage segments in
the species tree [93]. Under that expanded concept, boundaries of
species (i.e., lineages in the species tree) can be facilitated by
collection and analysis of gene trees in the framework of the
multispecies coalescent model [72]. The recent suggestion that
coalescent species delimitation methods define only structure but
not species [90] was, in our view, already well-established, with
confusion stemming largely from the term “species delimitation,”
as opposed to “delimitation of populations between which gene
flow has ceased.”
Gene Flow There are a number of other situations in which the assumptions
of the coalescent are violated. MSC models involve a series of
isolation events unaccompanied by gene flow. In this regard,
they are like the isolation-migration models of phylogeography
[94, 95] but without the migration. The assumption of no gene
flow naturally restricts their utility, but gene flow of course com-
promises other methods of phylogenetic inference, including con-
catenation methods, as well. Additionally, situations in which gene
flow yields a prominent molecular signal often are detectable
primarily among very closely related species in the realm of phy-
logeography [96]. If some substantial gene flow continues
between species after divergence, then the multispecies coalescent
can quickly destabilize, especially for a small number of loci and as
the rate of genetic introgression increases (Fig. 6 in [87, 97–99]).
We recommend model comparison algorithms like PHRAPL [87]
for determining whether a given data set conforms to the assump-
tions of the MSC.
220 Liang Liu et al.
2.1.2 Molecular In addition to species delimitation and gene flow, there are at least
Processes three mechanisms that generate discordance on the molecular level
(Fig. 3). These include horizontal gene transfer (HGT), which can
pose a serious risk to phylogenetic analysis; gene duplication, whose
risks can be avoided by certain models; and natural selection, which
generally poses no direct threat but, depending on its mode of
action and consequences for DNA and protein sequences, can be
the most challenging of all.
A B C D A B C D
Gene Duplication
Copy 1
Copy 2
A A B B A B
Convergent Evolution
Mutation
A B C D A B C D
Fig. 3 Three examples of gene histories that depart from the standard
multispecies coalescent model. (a) A duplication event that precedes a
speciation event can lead to incorrect inference of divergence times in the
species tree if copy 1 is compared to copy 2. This can be particularly difficult
if one of the gene copies has been lost or not sequenced by the researcher. (b)
Convergent evolution can occur at the molecular level, for example, in certain
genes under strong natural selection or highly biased mutational processes.
These processes will tend to bring together distantly related taxa in the
phylogenetic tree and are likely to be given additional false support by
morphological data. (c) Horizontal gene transfer causes difficulties in some
current species tree methods, because it establishes a spurious lower bound
to divergence times. Though rare in eukaryotes, it is by no means unknown and
is likely to become a more difficult problem in the future when species trees are
based on tens of thousands of loci
Building Phylogenetic Trees Using the Multispecies Coalesent Model 221
Horizontal Gene Transfer HGT is now known to be so widespread across the Tree of Life,
especially in prokaryotes, that some have suggested a web of life
may be a more appropriate paradigm for phylogenetic change
[100–102]. Growing evidence shows that even eukaryotic gen-
omes contain substantial amounts of “uploaded” genetic material
from bacteria, archaea, viruses, and even fellow eukaryotes
[103–105]. Even though effective techniques are not yet widely
available for detecting HGT in eukaryotes, enough individual cases
have been “accidentally” discovered that researchers have given up
trying to list them all [103].
The implications of HGT for species tree construction vary
depending on the method used. For example, following the stan-
dard assumption in coalescent theory that allelic divergences must
occur earlier in time than the divergences of species harboring those
alleles, some species tree techniques [48, 58], as well as classical
approaches (e.g., [13]), assume that the gene tree exhibiting the
most recent divergence between taxon A and taxon B establishes a
hard upper limit on the divergence time of those species in the
species tree. For small sets of genes in taxa where HGT is rare, a
researcher would need to be quite unlucky to choose a horizontally
transferred gene for analysis. However, as the genomic era
advances, it becomes more likely that at least one of the thousands
of genes studied will have been transferred horizontally and thus
establish a spurious upper bound for clade divergence at the species
level. When selective introgression of genes from one species to
another is considered, this number of genes coalescing recently
between species will increase [106]. Although HGT is clearly a
problem for some current methodologies, if transferred genes can
first be identified, then they could be extremely useful as genomic
markers for monophyletic groups that have inherited such genes
and would otherwise be difficult to resolve [107]. However, for
other species tree methods that calculate averages of coalescence
times, such as STAR [65], HGT events will have less of an impact.
Liu et al. [56] examined the effect of HGT on the pseudo-
likelihood method MP-EST and predicted that, mathematically,
species tree branch lengths may be biased by HGT but that topol-
ogies were fairly robust. Davidson et al. [108] found that quartet-
based methods, such as ASTRAL-II, were fairly robust to HGT in
the presence of ILS. Removal of genes suspected to be transferred
via HGT prior to species tree analysis would be warranted; how-
ever, some methods to detect such events rely both on having the
true species tree already in hand and also on the absence of other
mechanisms causing gene tree discordance [109–112]. Recent
work aims to incorporate HGT into other mechanisms of gene
tree incongruence (reviewed in [113]); how much we need to
invest in such synthetic methods will likely depend on the preva-
lence of HGT in particular taxonomic groups.
222 Liang Liu et al.
Gene Duplication Gene duplication presents another violation of the basic MSC
model (Fig. 3); like HGT, its potential problems are worst when
they go unrecognized [49]. Imagine a taxon where a gene of
interest duplicated 10 Mya into copy α and copy β; the taxon then
split 5 Mya into species 1 and 2. A researcher investigating the
daughter species would therefore sequence four orthologous
genes, with the potential to compare α1 to β2 and β1 to α2 and
thus generate two gene trees where the estimated split time was
10 Mya, rather than 5 Mya. Such a situation will be easily recog-
nized if copy α and β have diverged sufficiently by the time of their
duplication, and a number of methods of coalescent analysis have
incorporated gene duplication (e.g., [114, 115]; reviewed in
[116]). Additionally, failure to recognize the situation may not
have drastic consequences for phylogenetic analysis if the paralogs
have coalesced very recently or are species-specific, in which case the
estimated gene coalescence would be approximately correct no
matter which comparison was made. However, if one of the copies
has been lost and only one of the remaining copies is sequenced,
then the chances of inferring an inappropriately long period of
genetic isolation are larger and will increase as the size of the family
of paralogs increases. Assessing paralogs in phylogenomic data is a
major challenge, particularly in groups like plants and fish, and a
growing number of dedicated methods ([117]; assessed in [118])
or filtering protocols [119] for doing so exist. This problem will
tend to overestimate gene coalescence times, and some species tree
methods depend on minimum isolation times among a large set of
genes. These deep coalescences might spuriously increase inferred
ancestral population sizes. A systematic search for biases incurred by
species tree methods due to gene duplication is needed.
Natural Selection Natural selection causes yet another violation of the multispecies
coalescent model. Selection can cause serious problems in some
cases, although in other circumstances it is predicted not to cause
problems of phylogenetic analysis [47, 120]. The usual stabilizing
selection can be helpful to taxonomists working at high levels
because it slows the substitution rate; likewise selective sweeps,
directional selection, and genetic surfing [121] tend to clarify
phylogenetic relationships by accelerating reciprocal monophyly
for genes in rapidly diverging clades. However, challenges to phy-
logenetic inference are posed by any evolutionary force that may
bias the reconstruction of gene trees, including convergent neutral
mutations (homoplasy), balancing selection, and selection-driven
convergent evolution (e.g., [122]). Balancing selection tends to
preserve beneficial alleles at a gene for long periods of time and is
probably the most insidious form of selection with respect to
accurately reconstructing gene trees and species trees.
Building Phylogenetic Trees Using the Multispecies Coalesent Model 223
2.2 More About Many of the instances of violations of the coalescent model will
Violations and Model occur at individual genes and usually will not dominate the signal
Fit of the Multispecies of the entire suite of genes sampled for phylogenetic analysis. Reid
Coalescent Model et al. [123] conducted one of the few tests of the fit of the MSC to
multilocus phylogenetic data. Although the title of their article
suggests that the MSC overall provides a “poor fit” to empirical
data, we suggest that their results provide a more hopeful picture.
The most important thing is that they investigated the fit of the
MSC to individual loci in phylogenetic data sets and were able to
identify loci that failed to fit the MSC. They were less successful at
identifying the causes of departure from the MSC for individual loci.
More common but still rare are efforts to determine which
models of phylogenetic inference, the MSC or concatenation, pro-
vide a better fit to empirical phylogenomic data. Edwards et al.
[124] and Liu and Pearl [58] both used the Bayesian species tree
method BEST [68] to ask using Bayes factors whether the MSC or
concatenation fits empirical data sets better. Uniformly, they found
that the MSC fit empirical data sets better than concatenation,
often by a large margin. However, further work in this area is still
needed. Most discussions in the literature have focused on the
perceived failings or violations of the MSC by empirical data
sets—such as evidence for recombination within loci—even when
such failings or assumptions also apply to concatenation [47].
Given that all models are approximations of reality, a better focus
would be to ask which model better fits empirical data sets better.
The limited research that has been done suggests overwhelmingly
that the MSC provides a better fit to empirical data sets than
concatenation.
Are there better models for phylogenomics than the MSC?
Depending on the data set, almost surely there are (Fig. 4). Several
authors working with phylogenomic data sets have suggested that
gene flow is detectable, even among lineages that diverged a long
time ago (e.g., [129, 130]). The increasing number of reports of
hybridization and introgression among phenotypically distinct spe-
cies suggests that hybridization may be a typical component of
speciation and that even phylogenetic models can be improved by
incorporating such reticulation (e.g., [47, 106, 131]). The pure
MSC is best thought of as a special case of so-called “multispecies
network coalescent” models, or MSNC [127, 132–134] (Fig. 4), in
which gene flow connects some branches of the species tree. In the
end, empiricists will need to decide what level of model fit they are
willing to tolerate and which software packages can accommodate
the large data sets that are now routine in phylogenomics.
2.2.1 Phylogenetic Genes whose phylogenetic signal differs significantly from that of
Outlier Loci the remainder of data set can be thought of as phylogenetic outliers.
These loci are conceptually similar to outliers in population genet-
ics, which have been the focus of many studies (reviewed in
224 Liang Liu et al.
1 = 4N 1 µ 2 = 4N 2 µ
1 = m1/µ
M2
Θ1 Θ2 2 = m2/µ
M1 =µt
Time
equilibrium migration model = 4N µ
isolation-migration model
1 M23 = m/μ
M34 = m/μ
1
2
2
3 3
pure isolation
(phylogeny) multispecies
multispecies coalescent network coalescent
2.2.2 Genomic Signals of When faced with a surprising or nonconvergent species tree, one
Phylogenetic Outliers possibility is that an unusual gene tree is to blame. Though techni-
ques for dealing with violations of the coalescent model are in their
infancy, researchers do have a few options. Below we list several
ideas, some borrowed from classical phylogenetics or from meth-
ods used in bioinformatics. It is likely that the several tests con-
structed to detect phylogenetic outliers in classical phylogenetics
can be extended slightly to incorporate the additional variation
among genes expected due to the coalescent process. Of course,
with larger data sets, at least with some coalescent methods, single
anomalous genes may have little effect on the resulting species tree,
particularly in species tree methods utilizing summary statistics
[65]. However, as pointed out above, species tree methods such
as BEST that relies on “hard” boundaries for the species tree by
individual genes could be derailed due to the anomalous behavior
of even a single gene.
Jackknifing: A straightforward approach to detecting phyloge-
netic outliers under the multispecies coalescent model is to rerun
the analysis n times, where n is the number of loci in the study,
leaving one locus out each time. An outlier can then be identified if
the analysis that does not include that gene differs from the remain-
ing analyses in which that gene is included. This approach has been
applied successfully in fruit flies by Wong et al. [21], who consid-
ered their problem resolved when the elimination of one of the ten
genes unambiguously resolved a polytomy. There may be other
metrics of success that are more robust or sensitive or do not
depend as strongly on a priori beliefs about the relationships
among taxa. Because some duplications or horizontal transfers
may affect only one taxon, whole-tree topology summary statistics
are unlikely to be sensitive enough to detect recent events. How-
ever, the cophenetic distance of each taxon to its nearest neighbor
in the complete species tree could be compared across jackknife
results. This procedure will produce a distribution of “typical”
distances, and significance can therefore be assigned to highly
divergent results. The drawback to such an approach is the
226 Liang Liu et al.
2.2.3 Simulation Simulating gene trees from a species tree is another method for
Approaches to Detecting identifying gene trees that differ from the majority of loci in the
Phylogenetic Outliers data set. Several species tree methods yield estimate of the phylog-
eny that include branch lengths in coalescent units [56, 57, 70],
which are required to simulate gene trees from a species tree.
Branch lengths in the estimated species tree can be decomposed
into a number of substitutions per site and an estimate of θ ¼ 4Nμ
that are compatible with the original branch length in coalescent
units. For example, using any number of algorithms, including
maximum likelihood or Bayesian methods, the length of species
tree branch lengths in substitutions per site can be approximated by
fitting the concatenated alignment of genes to the estimated species
tree topology, yielding a tree with the same topology but branch
lengths in substitutions per site (μt, where t is the time span of the
branch in either generations or years). With these branch lengths in
hand, estimates of θ can then be applied to each branch so that the
original coalescent units t/2N μt/θ from the species tree are
retained. Care needs to be taken to preserve the appropriate ploidy
units when simulating gene trees from an estimated species tree.
Packages such as MP-EST yield estimates of species tree branch
lengths in coalescent units of 4N generations, appropriate for
diploids, whereas packages such as Phybase [142] simulate gene
trees from a species tree in estimates of 2N units, appropriate for
haploids. Another issue that is important to be aware of is the
distinction between gene coalescence times and species tree branch
lengths [143, 144]. Whereas species tree branch lengths are esti-
mates of lineage or population branch lengths in the species tree,
the DNA sequence alignment that is fitted to the species tree will
yield branch lengths reflecting the coalescence time of genes in
ancestral species. This discrepancy occurs because gene coalescence
times by necessity predate and record a more ancient event than do
species divergence times. The discrepancy may represent a small
fraction of the branch length if species divergence times are large,
but Angelis and dos Reis [143] have suggested that the discrepancy
can be quite large even in comparisons of distantly related species,
such as exemplars of mammalian orders. There is a great need for
methods of molecular dating and combining fossils and DNA data
that distinguish between gene coalescence times and speciation
times, the latter of which is usually of primary interest.
Once the branch lengths of the species tree are prepared for
simulation, gene trees can be simulated using a number of packages
(Phybase, [142]; TreeSim, [145]; CoMus, [146]). Even packages
traditionally used in phylogeography can be used to simulated gene
Building Phylogenetic Trees Using the Multispecies Coalesent Model 227
4 Future Directions
5 Practice Problems
References
1. Darwin C (1859) On the origin of species, vol DM, Moritz C, Mable BK (eds) Molecular
Facsimile of 1st Edition. Harvard University systematics. Sinauer, Sunderland, MA
Press, Cambridge, p 513 7. Nei M (1987) Molecular evolutionary genet-
2. Hillis DM (1987) Molecular versus morpho- ics, vol 512. Columbia University Press,
logical approaches to systematics. Annu Rev New York
Ecol Syst 18:23–42 8. Nei M, Kumar S (2000) Molecular evolution
3. Scotland RW, Olmstead RG, Bennett JR and phylogenetics. Oxford University Press,
(2003) Phylogeny reconstruction: the role of New York
morphology. Syst Biol 52:539–548 9. Rosenberg NA (2002) The probability of
4. Kocher TD, Thomas WK, Meyer A, Edwards topological concordance of gene trees and
SV, P€a€abo S, Villablanca FX, Wilson AC species trees. Theor Popul Biol 61:225–247
(1989) Dynamics of mitochondrial DNA evo- 10. Cavalli-Sforza LL (1964) Population struc-
lution in animals: amplification and sequenc- ture and human evolution. Proc R Soc Lond
ing with conserved primers. Proc Natl Acad Series B 164:362–379
Sci U S A 86:6196–6200 11. Avise JC, Arnold J, Ball RM, Bermingham E,
5. Miyamoto MM, Cracraft J (1991) Phylogeny Lamb T, Neigel JE, Reeb CA, Saunders NC
inference, DNA sequence analysis, and the (1987) Intraspecific phylogeography: the
future of molecular systematics. In: Miyamoto mitochondrial DNA bridge between popula-
MM, Cracraft J (eds) Phylogenetic analysis of tion genetics and systematics. Annu Rev Ecol
DNA sequences. Oxford University Press, Syst 18:489–522
New York, NY, pp 3–17 12. Pamilo P, Nei M (1988) Relationships
6. Swofford DL, Olsen GJ, Waddell PJ, Hillis between gene trees and species trees. Mol
DM (1996) Phylogenetic inference. In: Hillis Biol Evol 5:568–583
234 Liang Liu et al.
13. Takahata N (1989) Gene genealogy in three gene lineages on a species tree. Proc Natl Acad
related populations: consistency probability Sci U S A 113:8002–8009
between gene and population trees. Genetics 28. Degnan JH, Rosenberg NA (2009) Gene tree
122:957–966 discordance, phylogenetic inference, and the
14. Avise JC (1994) Molecular markers, natural multispecies coalescent. Trends Ecol Evol
history and evolution. Chapman and Hall, 24:332–340
New York 29. Degnan JH, Rosenberg NA (2006) Discor-
15. Wollenberg K, Avise JC (1998) Sampling dance of species trees with their most likely
properties of genealogical pathways underly- gene trees. Public Lib Sci Genet 2:762–768
ing population pedigrees. Evolution 30. Rosenberg NA, Tao R (2008) Discordance of
52:957–966 species trees with their most likely gene trees:
16. Gatesy J, Springer MS (2014) Phylogenetic the case of five taxa. Syst Biol 57:131–140
analysis at deep timescales: unreliable gene 31. Huang HT, Knowles LL (2009) What is the
trees, bypassed hidden support, and the coa- danger of the anomaly zone for empirical phy-
lescence/concatalescence conundrum. Mol logenetics? Syst Biol 58:527–536
Phylogenet Evol 80:231–266 32. Sackton TB et al (2018) Convergent regu-
17. de Queiroz A, Gatesy J (2007) The superma- latory evolution and the origin of flightless-
trix approach to systematics. Trends Ecol Evol ness in palaeognathous birds. bioRxiv.
22:34–41 https://doi.org/10.1101/262584
18. Maddison WP (1997) Gene trees in species 33. Linkem CW, Minin VN, Leaché AD (2016)
trees. Syst Biol 46:523–536 Detecting the anomaly zone in species trees
19. Edwards SV (2009) Is a new and general the- and evidence for a misleading signal in higher-
ory of molecular systematics emerging? Evo- level skink phylogeny (Squamata: Scincidae).
lution 63:1–19 Syst Biol 65:465–477
20. Carstens BC, Knowles LL (2007) Estimating 34. Bryant D (2003) A classification of consensus
species phylogeny from gene-tree probabil- methods for phylogenetics. In: Janowitz M
ities despite incomplete lineage sorting: an et al (eds) BioConsensus. American Mathe-
example from melanoplus grasshoppers. Syst matical Society, Providence, RI, pp 163–183
Biol 56:400–411 35. Ewing GB, Ebersberger I, Schmidt HA, von
21. Wong A, Jensen JD, Pool JE, Aquadro CF Haeseler A (2008) Rooted triple consensus
(2007) Phylogenetic incongruence in the and anomalous gene trees. BMC Evol Biol
Drosophila melanogaster species group. Mol 8:118
Phylogenet Evol 43:1138–1150 36. Degnan JH, DeGiorgio M, Bryant D, Rosen-
22. Knowles LL, Kubatko LS (2010) Estimating berg NA (2009) Properties of consensus
species trees: an introduction to concepts and methods for inferring species trees from gene
models. In: Knowles LL, Kubatko LS (eds) trees. Syst Biol 58:35–54
Estimating species trees: practical and theoret- 37. Ranwez V, Criscuolo A, Douzery EJ (2010)
ical aspects. Wiley-Blackwell, New York, pp SuperTriplets: a triplet-based supertree
1–14 approach to phylogenomics. Bioinformatics
23. Liu L, Yu L, Kubatko L, Pearl DK, Edwards 26:i115–i123
SV (2009) Coalescent methods for estimating 38. Steel M, Rodrigo A (2008) Maximum likeli-
phylogenetic trees. Mol Phylogenet Evol hood supertrees. Syst Biol 57:243–250
53:320–328 39. Wiens JJ (2003) Missing data, incomplete
24. Neigel JE, Avise JC (1986) Phylogenetic rela- taxa, and phylogenetic accuracy. Syst Biol
tionships of mitochondrial DNA under vari- 52:528–538
ous demographic models of speciation. In: 40. Gadagkar SR, Rosenberg MS, Kumar S
Karlin S, Nevo E (eds) Evolutionary processes (2005) Inferring species phylogenies from
and theory. Academic, New York, pp multiple genes: concatenated sequence tree
515–534 versus consensus gene tree. J Exp Zool B
25. Degnan JH, Salter L (2005) Gene tree distri- Mol Dev Evol 304:64–74
butions under the coalescent process. Evolu- 41. Bull JJ, Huelsenbeck JP, Cunningham CW,
tion 59:24–37 Swofford DL, Waddell PJ (1993) Partitioning
26. Tajima F (1983) Evolutionary relationships of and combining data in phylogenetic analysis.
DNA sequences in finite populations. Genet- Syst Biol 42:384–397
ics 105:437–460 42. Rokas A, Williams B, King N, Carroll S
27. Mehta RS, Bryant D, Rosenberg NA (2016) (2003) Genome-scale approaches to resolving
The probability of monophyly of a sample of
Building Phylogenetic Trees Using the Multispecies Coalesent Model 235
72. Yang Z, Rannala B (2010) Bayesian species 87. Jackson ND, Carstens BC, Morales AE,
delimitation using multilocus sequence data. O’Meara BC (2017) Species delimitation
Proc Natl Acad Sci U S A 107:9264–9269 with gene flow. Syst Biol 66:799–812
73. Chifman, J Kubatko L (2014) Quartet infer- 88. Leache AD, Zhu T, Rannala B, Yang Z (2018)
ence from SNP data under the coalescent The spectre of too many species. Syst Biol
model. Bioinformatics 30:3317–3324 66:379
74. Liu L, Xi ZX, Davis CC (2015) Coalescent 89. Solis-Lemus C, Knowles LL, Ane C (2015)
methods are robust to the simultaneous Bayesian species delimitation combining mul-
effects of long branches and incomplete line- tiple genes and traits in a unified framework.
age sorting. Mol Biol Evol 32:791–805 Evolution 69:492–507
75. Xi ZX, Liu L, Davis CC (2016) The impact of 90. Sukumaran J, Knowles LL (2017) Multispe-
missing data on species tree estimation. Mol cies coalescent delimits structure, not species.
Biol Evol 33:838–860 Proc Natl Acad Sci U S A 114:1607–1612
76. Shen X-X, Hittinger CT, Rokas A (2017) 91. Carstens BC, Pelletier TA, Reid NM, Satler
Contentious relationships in phylogenomic JD (2013) How to fail at species delimitation.
studies can be driven by a handful of genes. Mol Ecol 22:4369–4383
Nat Ecol Evol 1:0126 92. Carstens BC, Dewey TA (2010) Species
77. Suzuki Y, Glazko GV, Nei M (2002) Over- delimitation using a combined coalescent
credibility of molecular phylogenies obtained and information-theoretic approach: an
by Bayesian phylogenetics. Proc Natl Acad Sci example from North American Myotis bats.
U S A 99:16138–16143 Syst Biol 59:400–414
78. Huang HT, He QI, Kubatko LS, Knowles LL 93. De Queiroz K (2007) Species concepts and
(2010) Sources of error inherent in species- species delimitation. Syst Biol 56:879–886
tree estimation: impact of mutational and coa- 94. Pinho C, Hey J (2010) Divergence with gene
lescent effects on accuracy and implications flow: models and data. Annu Rev Ecol Evol
for choosing among different methods. Syst Syst 41:215–230
Biol 59:573–583 95. Hey J, Nielsen R (2007) Integration within
79. Jarvis ED et al (2014) Whole-genome ana- the Felsenstein equation for improved Mar-
lyses resolve early branches in the tree of life kov chain Monte Carlo methods in popula-
of modern birds. Science 346:1320–1331 tion genetics. Proc Natl Acad Sci U S A
80. Hughes LC et al (2018) Comprehensive phy- 104:2785–2790
logeny of ray-finned fishes (Actinopterygii) 96. Carstens BC, Morales AE, Jackson ND,
based on transcriptomic and genomic data. O’Meara BC (2017) Objective choice of phy-
Proc Natl Acad Sci U S A 115:6249–6254 logeographic models. Mol Phylogenet Evol
81. Wickett NJ et al (2014) Phylotranscriptomic 116:136–140
analysis of the origin and early diversification 97. Wakeley J (2001) The effects of subdivision
of land plants. Proc Natl Acad Sci U S A 111: on the genetic divergence of populations and
E4859–E4868 species. Evolution 54:1092–1101
82. Avise JC, Ball RMJ (1990) Principles of gene- 98. Solı́s-Lemus C, Yang M, Ané C (2016) Incon-
alogical concordance in species concepts and sistency of species tree methods under gene
biological taxonomy. Oxf Surv Evol Biol flow. Syst Biol 65:843–851
7:45–67 99. Eckert AJ, Carstens BC (2008) Does gene
83. Solis-Lemus C, Yang M, Ane C (2016) Incon- flow destroy phylogenetic signal? The perfor-
sistency of species tree methods under gene mance of three methods for estimating species
flow. Syst Biol 65:843–851 phylogenies in the presence of gene flow. Mol
84. Stenz NW, Larget B, Baum DA, Ane C Phylogenet Evol 49(3):832–842
(2015) Exploring tree-like and non-tree-like 100. Doolittle WF, Bapteste E (2007) Pattern plu-
patterns using genome sequences: an example ralism and the Tree of Life hypothesis. Proc
using the inbreeding plant species Arabidopsis Natl Acad Sci U S A 104:2043–2049
thaliana (L.) Heynh. Syst Biol 64:809–823 101. Boto L (2010) Horizontal gene transfer in
85. Hudson RR, Coyne JA (2002) Mathematical evolution: facts and challenges. Proc Biol Sci
consequences of the genealogical species con- 277:819–827
cept. Evolution 56:1557–1565 102. Soucy SM, Huang J, Gogarten JP (2015)
86. Tobias JA, Seddon N, Spottiswoode CN, Pil- Horizontal gene transfer: building the web
grim JD, Fishpool LDC, Collar NJ (2010) of life. Nat Rev Genet 16:472–482
Quantitative criteria for species delimitation.
Ibis 152:724–746
Building Phylogenetic Trees Using the Multispecies Coalesent Model 237
103. Keeling PJ, Palmer JD (2008) Horizontal groups from orthologous gene pairs. PLoS
gene transfer in eukaryotic evolution. Nat One 8:e53786
Rev Genet 9:605–618 119. Irisarri I et al (2017) Phylotranscriptomic
104. Soanes D, Richards TA (2014) Horizontal consolidation of the jawed vertebrate time-
gene transfer in eukaryotic plant pathogens. tree. Nat Ecol Evol 1:1370–1378
Annu Rev Phytopathol 52:583–614 120. Edwards SV (2009) Natural selection and
105. Thomas J, Schaack S, Pritham EJ (2010) Per- phylogenetic analysis. Proc Natl Acad Sci U
vasive horizontal transfer of rolling-circle S A 106:8799–8800
transposons among animals. Genome Biol 121. Ray N, Excoffier L (2009) Inferring past
Evol 2:656–664 demography using spatially explicit popula-
106. Mallet J, Besansky N, Hahn MW (2015) How tion genetic models. Hum Biol 81:141–157
reticulated are species? BioEssays 38:140–149 122. Castoe TA, de Koning APJ, Kim H-M, Gu W,
107. Huang J, Gogarten JP (2006) Ancient hori- Noonan BP, Naylor G, Jiang ZJ, Parkinson
zontal gene transfer can benefit phylogenetic CL, Pollock DD (2009) Evidence for an
reconstruction. Trends Genet 22:361–366 ancient adaptive episode of convergent
108. Davidson R, Vachaspati P, Mirarab S, Warnow molecular evolution. Proc Natl Acad Sci U S
T (2015) Phylogenomic species tree estima- A 106:8986–8991
tion in the presence of incomplete lineage 123. Reid NH, Hird SM, Brown JM, Pelletier TA,
sorting and horizontal gene transfer. BMC McVay JD, Satler JD, Carstens BC (2014)
Genomics 16(Suppl 10):S1 Poor fit to the multispecies coalescent is
109. Linz S, Semple C, Stadler T (2010) Analyzing widely detectable in empirical data. Syst Biol
and reconstructing reticulation networks 63:322–333
under timing constraints. J Math Biol 124. Edwards SV, Liu L, Pearl DK (2007) High-
61:715–737 resolution species trees without concatena-
110. Linz S, Radtke A, von Haeseler A (2007) A tion. Proc Natl Acad Sci U S A
likelihood framework to measure horizontal 104:5936–5941
gene transfer. Mol Biol Evol 24:1312–1319 125. Beerli P, Felsenstein J (2001) Maximum like-
111. Rasmussen MD, Kellis M (2011) A Bayesian lihood estimation of a migration matrix and
approach for fast and accurate gene tree effective population sizes in n subpopulations
reconstruction. Mol Biol Evol 28:273–290 by using a coalescent approach. Proc Natl
112. Rasmussen MD, Kellis M (2007) Accurate Acad Sci U S A 98:4563–4568
gene-tree reconstruction by learning gene- 126. Hey J, Wakeley J (1997) A coalescent estima-
and species-specific substitution rates across tor of the population recombination rate.
multiple complete genomes. Genome Res Genetics 145:833–846
17:1932–1942 127. Solis-Lemus C, Bastide P, Ane C (2017) Phy-
113. Szöllosi GJ, Tannier E, Daubin V, Boussau B loNetworks: a package for phylogenetic net-
(2015) The inference of gene trees with spe- works. Mol Biol Evol 34:3292–3298
cies trees. Syst Biol 64:e42–e62 128. Than C, Ruths D, Nakhleh L (2008) Phyl-
114. Sanderson MJ, McMahon MM (2007) Infer- oNet: a software package for analyzing and
ring angiosperm phylogeny from EST data reconstructing reticulate evolutionary rela-
with widespread gene duplication. BMC tionships. BMC Bioinform 9:322
Evol Biol 7(Suppl 1):S3 129. Hallström BM, Janke A (2010) Mammalian
115. Thomas PD (2010) GIGA: a simple, efficient evolution may not be strictly bifurcating. Mol
algorithm for gene tree inference in the geno- Biol Evol 27:2804–2816
mic age. BMC Bioinform 11:312 130. Kutschera VE, Bidon T, Hailer F, Rodi JL,
116. Boussau B, Szollosi GJ, Duret L, Gouy M, Fain SR, Janke A (2014) Bears in a forest of
Tannier E, Daubin V (2013) Genome-scale gene trees: phylogenetic inference is compli-
coestimation of species and gene trees. cated by incomplete lineage sorting and gene
Genome Res 23:323–330 flow. Mol Biol Evol 31:2004–2017
117. Conte MG, Gaillard S, Droc G, Perin C 131. Mavárez J, Salazar CA, Bermingham E,
(2008) Phylogenomics of plant genomes: a Salcedo C, Jiggins CD, Linares M (2006)
methodology for genome-wide searches for Speciation by hybridization in Heliconius
orthologs in plants. BMC Genomics 9:183 butterflies. Nature 441:868–871
118. Altenhoff AM, Gil M, Gonnet GH, Dessimoz 132. Wen D, Nakhleh L (2017) Coestimating
C (2013) Inferring hierarchical orthologous reticulate phylogenies and gene trees from
multilocus sequence data. Syst Biol 67
(3):439–457
238 Liang Liu et al.
133. Yu Y, Nakhleh L (2015) A maximum pseudo- 147. Anderson CNK, Ramakrishnan U, Chan YL,
likelihood approach for phylogenetic net- Hadly EA (2005) Serial SimCoal: a popula-
works. BMC Genomics 16(Suppl 10):S10 tion genetics model for data from multiple
134. Stenz NW, Larget B, Baum DA, Ané C populations and points in time. Bioinformat-
(2015) Exploring tree-like and non-tree-like ics 21:1733–1734
patterns using genome sequences: an example 148. Excoffier L, Foll M (2011) fastsimcoal: a
using the inbreeding plant species Arabidopsis continuous-time coalescent simulator of
thaliana (L.) Heynh. Syst Biol 64(5):809–823 genomic diversity under arbitrarily complex
135. Beaumont MA, Balding DJ (2004) Identify- evolutionary scenarios. Bioinformatics
ing adaptive genetic divergence among popu- 27:1332–1334
lations from genome scans. Mol Ecol 149. Robinson DF, Foulds LR (1981) Compari-
13:969–980 son of phylogenetic trees. Math Biosci
136. Storz JF (2005) Using genome scans of DNA 53:131–147
polymorphism to infer adaptive population 150. Song S, Liu L, Edwards SV, Wu SY (2012)
divergence. Mol Ecol 14:671–688 Resolving conflict in eutherian mammal phy-
137. Barrett RDH, Hoekstra HE (2011) Molecu- logeny using phylogenomics and the multi-
lar spandrels: tests of adaptation at the genetic species coalescent model. Proc Natl Acad Sci
level. Nat Rev Genet 12:767–780 U S A 109:14942–14947
138. Swofford DL (1991) When are phylogeny 151. Heled J, Bryant D, Drummond AJ (2013)
estimates from molecular and morphological Simulating gene trees under the multispecies
data incongruent? In: Miyamoto MM, Cra- coalescent and time-dependent migration.
craft J (eds) Phylogenetic analysis of DNA BMC Evol Biol 13:44
sequences. Oxford University Press, Oxford, 152. Edwards SV, Potter S, Schmitt CJ, Bragg JG,
pp 295–333 Moritz C (2016) Reticulation, divergence,
139. Dufraigne C, Fertil B, Lespinats S, Giron A, and the phylogeography–phylogenetics con-
Deschavanne P (2005) Detection and charac- tinuum. Proc Natl Acad Sci U S A
terization of horizontal transfers in prokar- 113:8025–8032
yotes using genomic signature. Nucleic 153. Weyenberg G, Huggins PM, Schardl CL,
Acids Res 33:e6 Howe DK, Yoshida R (2014) KDETREES:
140. Roettger M, Martin W, Dagan T (2009) A non-parametric estimation of phylogenetic
machine-learning approach reveals that align- tree distributions. Bioinformatics
ment properties alone can accurately predict 30:2280–2287
inference of lateral gene transfer from 154. Gaither, J Kubatko L (2016) Hypothesis tests
discordant phylogenies. Mol Biol Evol for phylogenetic quartets, with applications
26:1931–1939 to coalescent-based species tree inference.
141. Zimmermann T, Mirarab S, Warnow T J Theor Biol 408:179–186
(2014) BBCA: improving the scalability of 155. McVay JD, Carstens BC (2013) Phylogenetic
*BEAST using random binning. BMC Geno- model choice: justifying a species tree or con-
mics 15(Suppl 6):S11 catenation analysis. J Phylogenet Evol Biol
142. Liu L, Yu L (2010) Phybase: an R package 1:114
for species tree analysis. Bioinformatics 156. Felsenstein J (1985) Confidence limits on
26:962–963 phylogenies: an approach using the bootstrap.
143. Angelis K, dos Reis M (2015) The impact of Evolution 39:783–791
ancestral population size and incomplete line- 157. Lemoine F, Domelevo Entfellner JB,
age sorting on Bayesian estimation of species Wilkinson E, Correia D, Davila Felipe M, De
divergence times. Curr Zool 61:874–885 Oliveira T, Gascuel O (2018) Renewing Fel-
144. Edwards SV, Beerli P (2000) Perspective: senstein’s phylogenetic bootstrap in the era of
gene divergence, population divergence, and big data. Nature 556:452–456
the variance in coalescence time in phylogeo- 158. Seo TK (2008) Calculating bootstrap prob-
graphic studies. Evolution 54:1839–1854 abilities of phylogeny using multilocus
145. Stadler T (2011) Simulating trees with a fixed sequence data. Mol Biol Evol 25:960–971
number of extant species. Syst Biol 159. Sayyari E, Mirarab S (2016) Fast coalescent-
60:676–684 based computation of local branch support
146. Papadantonakis S, Poirazi P, Pavlidis P (2016) from quartet frequencies. Mol Biol Evol
CoMuS: simulating coalescent histories and 33:1654–1668
polymorphic data from multiple species. Mol 160. Liu L et al (2017) Reply to Gatesy and
Ecol Resour 16:1435–1448 Springer: Claims of homology errors and
Building Phylogenetic Trees Using the Multispecies Coalesent Model 239
zombie lineages do not compromise the dat- maximum likelihood. Infect Genet Evol
ing of placental diversification. Proc Natl 9:384–385
Acad Sci U S A 114:E9433–E9434 166. Leaché AD, Oaks JR (2017) The utility of
161. Edwards SV (2016) Phylogenomic subsam- single nucleotide polymorphism (SNP) data
pling: a brief review. Zool Scr 45:63–74 in phylogenetics. Annu Rev Ecol Evol Syst
162. Lee JY, Joseph L, Edwards SV (2012) A spe- 48:69–84
cies tree for the Australo-Papuan Fairy-wrens 167. Pease JB, Haak DC, Hahn MW, Moyle LC
and Allies (Aves: Maluridae). Syst Biol (2016) Phylogenomics reveals three sources
61:253–271 of adaptive variation during a rapid radiation.
163. Stamatakis A (2014) RAxML version 8: a tool PLoS Biol 14:e1002379
for phylogenetic analysis and post-analysis of 168. Hahn MW, Nakhleh L (2016) Irrational exu-
large phylogenies. Bioinformatics berance for resolved species trees. Evolution
30:1312–1313 70:7–17
164. Xi Z, Liu L, Davis CC (2015) Genes with 169. Jennings WB, Edwards SV (2005) Specia-
minimal phylogenetic information are prob- tional history of Australian grass finches (Poe-
lematic for coalescent analyses when gene tree phila) inferred from 30 gene trees. Evolution
estimation is biased. Mol Phylogenet Evol 59:2033–2047
92:63–71 170. Stamatakis A (2006) RAxML-VI-HPC: maxi-
165. Guindon S, Dufayard JF, Hordijk W, mum likelihood-based phylogenetic analyses
Lefort V, Gascuel O (2009) PhyML: fast and with thousands of taxa and mixed models.
accurate phylogeny reconstruction by Bioinformatics 22:2688–2690
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons
licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s
Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder.
Chapter 8
Abstract
Genome-wide comparison of phylogenetic trees is becoming an increasingly common approach in evolu-
tionary genomics, and a variety of approaches for such comparison have been developed. In this article we
present several methods for comparative analysis of large numbers of phylogenetic trees. To compare
phylogenetic trees taking into account the bootstrap support for each internal branch, the boot-split
distance (BSD) method is introduced as an extension of the previously developed split distance
(SD) method for tree comparison. The BSD method implements the straightforward idea that comparison
of phylogenetic trees can be made more robust by treating tree splits differentially depending on the
bootstrap support. Approaches are also introduced for detecting treelike and netlike evolutionary trends in
the phylogenetic Forest of Life (FOL), i.e., the entirety of the phylogenetic trees for conserved genes of
prokaryotes. The principal method employed for this purpose includes mapping quartets of species onto
trees to calculate the support of each quartet topology and so to quantify the tree and net contributions to
the distances between species. We describe the applications methods used to analyze the FOL and the
results obtained with these methods. These results support the concept of the Tree of Life (TOL) as a
central evolutionary trend in the FOL as opposed to the traditional view of the TOL as a “species tree.”
Key words Forest of Life, Tree of Life, Phylogenomic methods, Tree comparison, Map of quartets
Abbreviations
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Methods in Molecular Biology, vol. 1910,
https://doi.org/10.1007/978-1-4939-9074-0_8, © The Author(s) 2019
241
242 Pere Puigbò et al.
SD Split distance
TNT Tree-Net trend
TOL Tree of Life
1 Introduction
METHODS CONCEPTS
Boot split distance (BSD) Forest of life (FOL)
Inconsistency Score (IS) Nearly universal trees (NUTs)
Classical muldimensional scaling Central trend tree of life (TOL)
analysis( CMDS) Paerns in the FOL
Map of quartet species Tree and net components of evoluon
The Tree-Net Trend (TNT)
Fig. 1 A schematic of the methods and concepts involved in the FOL analysis
2 Materials
2.1 The Forest of Life We analyzed the set of 6901 phylogenetic trees from [5] that were
(FOL) and Nearly obtained as follows. Clusters of orthologous genes were obtained
Universal Trees (NUTs) from the COG [62] and EggNOG [63] databases from 100 pro-
karyotic species (59 bacteria and 41 archaea). The species were
selected to represent the taxonomic diversity of Archaea and Bacte-
ria (for the complete list of species, see Additional File 1). The BeTs
algorithm [62] was used to identify the orthologs with the highest
mean similarity to other members of the same cluster (“index
orthologs”), so the final clusters contained 100 or fewer genes,
with no more than one representative of each species. The
sequences in each cluster were aligned using the Muscle program
[64] with default parameters and refined using Gblocks [65]. The
program Multiphyl [66], which selects the best of 88 amino acid
Genome-Wide Comparative Analysis of Phylogenetic Trees. . . 245
3 Methods
3.1 Boot-Split The BSD method compares trees based on the original split dis-
Distance: A Method to tance (SD) [12] method. Both methods work by collecting all
Compare Phylogenetic possible binary splits of the two compared trees and calculating
Trees Taking into the fraction of equal splits, i.e., those splits that are present in both
Account Bootstrap trees (different splits refer to splits that are present in only one of
Support the two trees). Instead of considering all branches as being equal as
is the case in SD, the BSD method takes into account the bootstrap
3.1.1 Boot-Split Distance values to increase or decrease the SD value proportionally to the
(BSD) robustness of individual internal branches. The BSD value is the
average of the BSD in the equal splits (eBSD) and the BSD in the
different splits (Eq. 1). Equations 2 and 3 give the formulas to
calculate the eBSD and dBSD values, respectively.
eBSD þ dBSD
BSD ¼ ð1Þ
2
he i
eBSD ¼ 1 ∙M e ð2Þ
a
d
dBSD ¼ ∙M d ð3Þ
a
Here e is the sum of bootstrap values of equal splits, d is the sum
of bootstrap value of different splits, a is the sum of all bootstrap
values, Me is the mean bootstrap value of equal splits, and Md is the
mean bootstrap value of different splits.
The BSD algorithm proceeds in four basic steps to compare
pairs of trees (Fig. 2). The first step is to obtain all possible splits
from both trees. This procedure implies a binary split of the tree at
each internal branch, so that the tree is partitioned into two parts
each of which contains at least two species. Then, the common set
of leaves between the two trees is obtained, that is, the set of shared
species. Only trees with a common leaf set of at least four species
can be compared. The third step consists in pruning all splits to the
common leaf set of species; at this step, species that are present in
only one of the two compared trees are removed from the split list.
After this procedure, in partially overlapping trees, the algorithm
checks whether each of the splits remains a valid partition, that is, a
partition that separates at least two species from the rest of the tree.
If a split is not a valid partition, it is removed. Finally, the algorithm
calculates the BSD using Eqs. 1–3.
246 Pere Puigbò et al.
Tree 1 Tree 2
2 1 2
Set of species in
common
3 3
All splits from All splits from
tree 1 pruned to 4 tree 2 pruned to
common species common species
BSD
Fig. 2 The main algorithm of the BSD method. The algorithm to calculate the
BSD between two trees includes four basic steps: (1) split both trees in all
possible partitions, (2) read the common set of species of both trees, (3) prune
the splits according with the common leaf set, and (4) calculate the BSD
3.1.2 The BSD Algorithm There are three possible types of comparisons for trees that do not
include paralogs, that is, include one and only one sequence from
each of the constituent species (Fig. 3). In the first case, the two
trees completely overlap, that is, consist of the same set of species
(Fig. 3a). In this case, step 2, the pruning procedure, is not neces-
sary, and the comparison involves only obtaining all possible splits
and the calculation of the BSD. In the second case, one of the
compared trees is a subset of the other tree (Fig. 3b). In this case,
the splits are only pruned and occasionally removed from the bigger
tree. In the third case, when the two trees partially overlap or when
a tree is a subset of another tree, a pruning procedure is required. In
the example shown in Fig. 4, after the pruning procedure (step 3),
there is only one remaining split (split: AB|CD) that is repeated
several times in both trees. The remaining AB|CD split in Tree 1 is
separated by four nodes that have different bootstrap values. In this
case, the bootstrap of the remaining split is calculated using Eq. 4,
where n is the total number of nodes between the two sides of the
split and BSi is the bootstrap value (adjusted to the 0–1 range) of
the node i.
Bootstrap ¼ 1 ∏ni¼1 ð1 BSi Þ ð4Þ
96 4 2
5 6 58
[96] 45 | 6231 16|2345 [58]
47 6 1
2
[47] 62 | 4531 162|345 [8] 3 8
26 3 [26] 31 | 4562 2613|45 [79] 4 79
1 5
SD = 0.667 BSD = 0.333
SD BSD eBSD dBSD p=2, q=4, m=6, a=3.140, e=1.750, d=1.390
0.667 0.333 0.512 0.154 Ma=0.53, Me=0.875, Md=0.3475
a
5
2 72 | 54163 [5] 4 15
33 5 7 5
[33] 43 | 152 724 | 5163 [15]
1 2
[59] 51 | 432 7254 | 163 [38] 1
59 4 18
3 72543 | 16 [18] 6 38
3
SD = 1.000 BSD = 0.681
SD BSD eBSD dBSD p=0, q=4, m=4, a=1.450, e=0.000, d=1.450
1.000 0.681 1.000 0.363 Ma=0.3625, Me=0, Md=0.3625
b
Fig. 3 Examples of the BSD algorithm in single family trees. (a) Two trees of the
same size. (b) Tree 1 is a subtree of the Tree 2. Two trees that partially overlap.
SD split distance, BSD boot-split distance, eBSD BSD of equal splits, dBSD BSD
of different splits, p number of equal splits, q number of different splits, m total
number of splits, a sum of bootstraps in all splits, e sum of bootstraps in equal
splits, d sum of bootstraps in different splits, Ma mean bootstrap value, Me mean
bootstrap value in equal splits, Md mean bootstrap value in different splits
3.1.3 Using a Bootstrap The key question regarding the BSD method is as follows: what is
Threshold: Pros and Cons the best approach to phylogenetic tree comparison—using all
branches, reliable or not, with the appropriate weighting, or using
only branches supported by high bootstrap values? The first option
is illustrated in Fig. 3, whereas Fig. 5 shows an example of a tree
comparison that employs a bootstrap threshold of 70, i.e., only
branches supported by a higher bootstrap are taken into account in
the comparison. The second procedure appears reasonable and can
be recommended in some cases. However, it is not advisable as a
general approach because, when two large trees with varying
248 Pere Puigbò et al.
90 A
Tree 1 10 Tree 2
B
10 A
P
10 B
Q 90
C
R
C D
D
90 A
10
B
10
P
10
Q
R
C
D
90 A A
10
B B
10 93
10
C C
D D
Bootstrap = 1-(0.1x0.9x0.9x0.9) = 0.93
Fig. 4 Calculation of BSD for trees with an unequal numbers of species. The
larger tree (1) is pruned prior to the calculation of BSD. The bootstrap value for
the only shared internal branch is calculated according to Eq. 4
7 2
86 6 [98] 34 | 67125 71 | 52346 [75] 5 35
98 3 [86] 346 | 7125 715 | 2346 [35] 7 75
4 1
[37] 3467 | 125 7152 | 346 [32]
77 2 3 80
37 5
[77] 34671 | 25 71526 | 34 [80] 32
4
1 6
SD = 0.600 BSD = 0.536
Threshold SD BSD eBSD dBSD p: 2, q: 3, m: 5, a: 4.160, e: 1.780, d:2.380,
70 0.600 0.536 0.619 0.454 Ma: 0.832, Me: 0.890, Md: 0.793
Fig. 5 Example of the BSD algorithm using a bootstrap cutoff. The figure shows
the comparison of two phylogenetic trees that takes into account only those
branches with bootstrap support greater than 70. SD split distance, BSD boot-
split distance, eBSD BSD of equal splits, dBSD BSD of different splits, p number
of equal splits, q number of different splits, m total number of splits, a sum of
bootstraps in all splits, e sum of bootstraps in equal splits, d sum of bootstraps in
different splits, Ma mean bootstrap value, Me mean bootstrap value in equal
splits, Md mean bootstrap value in different splits
Genome-Wide Comparative Analysis of Phylogenetic Trees. . . 249
3.1.4 Testing the BSD The performance of the BSD method was compared with that of
Method the original SD method implemented in the TOPD/FMTS pro-
gram [12]. Figure 6 shows the correlation of SD and BSD for trees
with a number of species from 4 to 15 (a) and from 16 to
100 (b) from a recent large-scale analysis of the FOL [5]. The
three-way comparison of SD, BSD, and tree size (number of spe-
cies) shows a positive correlation between SD and BSD for all tree
sizes (R2 ¼ 0.8613 for trees with 4–16 species and R2 ¼ 0.7055 for
trees with 16–100 species) (Fig. 6c). However, the SD follows a
discrete distribution, which obviously is most conspicuous in the
comparisons of small trees (Fig. 6a), whereas, thanks to the use of
the bootstrap values, the BSD distribution is continuous (Fig. 7).
Figure 7 shows an example of the comparison (all-against-all)
of three trees with six species each that differ in one, two, and three
splits, resulting in SD values of 0.33, 0.66, and 1, respectively
(Fig. 7a). Also, each tree was compared to itself resulting in a SD
of 0. Then, bootstrap values were assigned randomly to the trees in
order to compare the trees using the BSD method, and this proce-
dure was repeated 1000 times. The resulting plot (Fig. 7b) shows
that, for the comparison of trees with SD of 0 and 1, the BSD values
ranged from 0 to 0.5 and from 0.5 to 1, respectively, and in
principle, could assume all intermediate values. In the case of the
comparisons that differed in one split (SD ¼ 0.33), the BSD value
was greater than 0.33 in 75% of the comparison, whereas for the
comparisons that differed in two splits (SD ¼ 0.67), 25% of the
BSD values were greater than 0.67. Thus, the BSD method for tree
comparison offers a better resolution than the SD method, espe-
cially, for trees with a small number of species.
Figure 8a shows the results of analysis of six simulated align-
ments with an increasing level of noise (divergence respect to the
initial alignment) in each alignment, i.e., from the alignment
0 (without noise and producing trees with bootstrap values of
100) to alignment 5 with the maximum level of noise. For each
alignment, a tree was constructed using the UPGMA method from
the web server DendroUPGMA (http://genomes.urv.cat/
250 Pere Puigbò et al.
a 1.0 b 1.0
0.8 0.8
0.6 0.6
BSD
BSD
0.4 0.4
0.2 0.2
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
SD SD
c 1.0
0.8
SD
0.6 1.0
BSD
0.8
0.4
0.6
0.4
0.2
0.2
0.0 0.0
0 20 40 60 80 100
Tree size
Fig. 6 Correlation of BSD and SD from the all-against-all tree comparisons of 6901 phylogenetic trees. (a)
Trees containing 4–15 species. (b) Trees containing 16–100 species. (c) SD, BSD, and tree size for trees
containing between 16 and 100 species
a)
Tree 1 Tree 2
SD=0
A B A
A
AB|CDEF AB|CDEF
Tree 1
Tree 1
ABC|DEF ABC|DEF
B
C B
SD=0.33
ABCD|EF ABCD|EF
C1
D C1
E AB|CDEF AC|BDEF
Tree 1
Tree 2
F
E D E
ABC|DEF ACB|DEF
ABCD|EF ACBD|EF
Tree 3
Tree 2
A
ACE|BDF ACB|DEF
E
B
ACEB|DF ACBD|EF
SD=0
BC1
D AB|CDEF AC|EBDF
Tree 1
Tree 3
F
E
ABC|DEF ACE|BDF
Tree 3 ABCD|EF ACEB|DF
1
b)
0.8
BSD|SD=0
0.6 BSD|SD=0.33
Distance
BSD|SD=0.67
BSD|SD=1
0.4
SD=0
SD=0.33
0.2 SD=0.67
SD=1
0
0 250 500 750 1000
Repeon
Fig. 7 Comparisons of trees with six taxa. Bootstrap values were assigned randomly in each comparison
3.1.5 Analysis of To assess the significance of the tree comparison by the BSD
Random Trees and the method, we performed several tree comparisons using random
Significance of BSD Results trees containing between 4 and 100 species (Fig. 9). Each test is
an all-against-all comparison of 1000 random trees (for complete
results see Additional File 2). The results from random tree
252 Pere Puigbò et al.
J
a)
I
0 H
K
1 (3 changes)
100
A 111110000000000000000000000000
100 A 111110000000000000000000000000
G
100 L
B 111110000000000000000000000000 B 111110000000000000000000000000
C 000001111100000000000000000000 100 C 000001111100000000000000000000
D 000001111100000000000000000000 D 000001111100000000000000000000
E 000000000011111000000000000000 E 000000000011111000000000000000
F 000000000011111000000000000000 100 F 000111000011111000000000000000
G 000000000000000111110000000000 F 100 G 000000000000000111110000000000
H 000000000000000111110000000000 H 000000000000000111110000000000
I 000000000000000000001111100000
100 I 000000000000000000001111100000
J 000000000000000000001111100000 100 A J 000000000000000000001111100000
K 000000000000000000000000011111
E K 000000000000000000000000011111
L 000000000000000000000000011111 L 000000000000000000000000011111
B
D
C tree 0
F
5 (46 changes) E
G
A 111110011000000000000110000110
J H
B 111110000000000000100000000110 99
C 000001111100011000110100000000
D 000001111100000011000000000100 96
E 01001000011111000001100000000 77
F 000000010011111000000000000000 I 88
G 000011000000000111110000011000 33 56 98 A
H 000000000000110111110000000100 94
I 000000001000001100001111100000 83
J 001100000001100000111111100000
K 001100011000000000000000111111
B
D
L 000000000110000000000010011111
C K
L
tree 5
b)
1
Normalized distance
0.75
SD
0.5
ND
0.25 BSD
0
0 1 2 3 4 5
Trees
Fig. 8 Comparison of six trees constructed from alignments with increasing noise levels. (a) Comparison of
trees from six simulated alignments. The UPGMA tree from each alignment was reconstructed with the web
Genome-Wide Comparative Analysis of Phylogenetic Trees. . . 253
BSD SD
1.00
0.75
Distance
0.50
0.25
0.00
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Tree size
Fig. 9 Random BSD and SD depending on the tree size. Results of the tree comparison of random trees (with
different sizes ranging from 4 to 100 species) show that the BSD and SD increase up to 0.75 and 0.999,
respectively
Fig. 8 (continued) server DendroUPGMA (http://genomes.urv.cat/UPGMA) using the Jaccard coefficient as the
measure of distance and generating 100 bootstraps replicates. Alignment 0 corresponds to the initial
alignment without noise that perfectly separates all branches, resulting in a tree with bootstrap values of
100 for all internal nodes. Alignments 1 to 5 correspond to the derivatives of the initial alignment with
increasing noise levels at each step. (b) Results of the comparison of each tree [1 to 5] with the initial tree (0).
The trees were compared using three methods: split distance (SD), nodal distance (ND), and boot-split
distance (BSD). For the purpose of comparison, the results obtained with each of the three methods were
normalized to the maximum value in each case
254 Pere Puigbò et al.
opposite, namely, that the two trees are no more similar than two
random trees of 15 species.
Another and probably the most important problem of the
comparison of phylogenetic trees is how to interpret the results
from a biological perspective. To address this issue, we generated
random trees containing from 4 to 100 species and performed 1 to
100 permutations (swap of a pair of branches) in each tree. The
resulting tree was then compared with the source tree (Fig. 10a, b).
The results show the number of permutations required to obtain a
particular BSD value for different tree sizes (number of species).
For instance, BSD ¼ 0.3 in the comparison of two trees with
a) 1.0
0.8
0.6
BSD
100
0.4 80
60
s
ion
0.2
at
40
ut
20 rm
Pe
0.0 0
0 20 40 60 80 100
Tree size
b) 0.9
0.6
BSD
0.3
0
0 20 40 60 80 100
Permutaons
Fig. 10 The number of permutations and the BSD. (a) BSD depending on the
number of permutations and tree size. (b) Mean and standard deviation of the
BSD for up to 100 permutations for trees with 20 species
Genome-Wide Comparative Analysis of Phylogenetic Trees. . . 255
20 species indicates that the two trees are separated by one permu-
tation whereas BSD ¼ 0.6 indicates that the trees are separated by
approximately 9 permutations (for the complete listing of equiva-
lences between BSD, SD and the number of permutations, see
Additional File 3). Considering that each permutation corresponds
to an HGT event, the BSD may be construed as the measure of the
extent of HGT contributing to the topological difference between
the compared trees. Given the discrete distribution of SD values,
this measure cannot be used to infer the number of permutations
with the same precision as BSD.
3.2 Analysis of A key characteristic of the FOL is the degree of the topological (in)
Topological Trends in consistency between the constituent trees. To quantify this trend,
a Set of Phylogenetic we introduced the inconsistency score (IS), which is the fraction of
Trees the times that the splits from a given tree are found in all N trees
that comprise the FOL. Thus, the IS may be naturally taken as a
3.2.1 Calculation of the measure of how representative of the entire FOL is the topology of
Tree Inconsistency the given tree. The IS is calculated using Eqs. 5–7, where N is the
total number of trees, X is the number of splits in the given tree,
and Y is the number of times the splits from the given tree are found
in all trees of the FOL.
1
ISmin
IS ¼ Y ð5Þ
ISmax
1
ISmin ¼ ð6Þ
X N
1
ISmax ¼ ISmin ð7Þ
X
In addition to the calculation of a single value of IS for a given
tree by comparing its topology to the topologies of rest of trees in
the FOL, IS can be calculated along the depth of the trees, namely,
split depth and phylogenetic depth. The split depth was calculated
for each unrooted tree according to the number of splits from the
tips to the center of the tree. The value of split depth ranged from
1 to 49 ([100 species/2] 1). The phylogenetic depth was
obtained from the branch lengths of a rescaled ultrametric tree,
rooted between archaeal and bacterial species, and ranged from 0 to
1. The topology of the ultrametric tree was obtained from the
supertree of the 102 NUTs using the CLANN program
[67]. The branch lengths from each of the 6901 trees were used
to calculate the average distance between each pair of species. The
obtained matrix was used to calculate the branch lengths of the
supertree of the NUTs. This supertree with branch lengths was
then used to construct an ultrametric tree using the program
KITSCH from the Phylip package [68] and rescaled to the depth
range from 0 to 1. The resulting ultrametric tree was used for the
analysis of the dependence of tree inconsistency on phylogenetic
depth.
256 Pere Puigbò et al.
b) Species
100
Quartets (Q)
abcd, abce, abcf, abcg, …
100
C4 = 3, 921, 225
QT vs. FOL
8.12·1010 comparisons
Fig. 11 Quartets and quartet topologies. (a) Each quartet (qi) is defined by a set of
four species (different colors denote species) and may be represented by three
possible unrooted tree topologies (qiti). (b) Quartet topologies (QT). In
100 species, the total number of quartets (Q) is 3,921,225. Each quartet may
be represented by three distinct QTs, resulting in a total of 11,763,735 QTs. Each
QT was mapped onto the FOL, i.e., for each QT, it was determined which of the
three topologies is represented in each phylogenetic tree in the FOL
(8.12 1010 comparisons). Modified from ref. 61
3.3.2 Distance Matrices Using the quartet support values for each quartet, a 100 100
and Heat Maps between-species distance matrix was calculated as dij ¼ 1 Sij/Qij
where dij is the distance between two species, Sij is the number of
trees containing quartets in which the two species are neighbors,
and Qij is the total number of quartets containing the given two
species. Then, this distance matrix was used to construct different
heat maps using the matrix2png web server ([73], Fig. 12b). In
contrast to the BSD method, which is best suited for the analysis of
the evolution of individual genes, the distance matrices derived
from maps of quartets are used to analyze the evolution of species
and to disambiguate treelike evolutionary relationships and “high-
ways” (preferential routes) of HGT.
3.3.3 The Tree-Net Trend The quartet-based between-species distances were used to calculate
(TNT) the Tree-Net Trend (TNT) score. The TNT score is calculated by
rescaling each matrix of quartet distances to a 0–1 scale between the
258 Pere Puigbò et al.
a)
q1t1 X X X X X X X X 80%
Solved
q1 q1t2 X 10%
q1t3 X 10%
q2t1 X X X X 40%
Unsolved
q2 q2t2 X X X 30%
q3t3 X X 30%
qit1 X X X X
qn qit2 X X X X
qit3 X X
Sp. 1 2 … 100
Sp. i * 1 0
2 0
Sp. j * … 0
100 0
Fig. 12 Mapping quartets. (a) Mapping quartets onto a set of ten trees. (b) A
schematic of the procedure used to reconstruct a species matrix from the map of
quartets
4.1 Patterns in the The reconstruction of the evolutionary trends in the FOL is based
Phylogenetic Forest of on the idea that prokaryotes, effectively, share a common gene
Life pool. This gene pool consists of genes with widely different ranges
of phyletic spread, from universal to rare ones only present in a few
species [74]. Thus, genes, as the elements of this gene pool, have
Genome-Wide Comparative Analysis of Phylogenetic Trees. . . 259
Dr Ds
Dr = 0.67 Ds > Dr
Ds Dr
Ds < Dr Dr = 0.67
Fig. 13 The Tree-Net Trend (TNT). The figure shows a schematic of the TNT
calculation and the rescaling procedure. Modified from ref. 61
2000
N u m b e r o f t re e s
1000
0
0 20 40 60 80 100
Tree size
Fig. 14 The Forest of Life (FOL). The distribution of the trees in the FOL by the
number of species. Modified from ref. 5
75%
Percentage of NUTs
50%
25%
0%
I R D M M O U H K L E G F J
COG funcons
Fig. 15 Distribution of the gene functions among the NUTs. The functional
classification of genes was from the COG database [62]
numerous small trees, with about 2/3 of the trees including <20
species (Fig. 14).
4.2 The Nearly We define the nearly universal trees (NUTs) as trees for those
Universal Trees (NUTs) COGs that were represented in more than 90% of the included
prokaryotes. This definition yielded 102 NUTs. Not surprisingly,
the great majority of the NUTs are genes encoding proteins
involved in translation and the core aspects of transcription
(Fig. 15). Among the NUTs, only 14 corresponded to COGs
that consist of strict 1:1 orthologs (all of them ribosomal proteins),
whereas the rest of NUTs included paralogs in some organisms
(only the most conserved paralogs were used for tree construction
[5]). The 1:1 NUTs were similar to the rest of the NUTs in terms of
the connectivity in tree similarity (1-BSD) networks and their
positions in the single cluster of NUTs obtained using CMDS.
The 102 NUTs were compared to trees produced by analysis of
concatenations of universal proteins [49]. The results showed that
Genome-Wide Comparative Analysis of Phylogenetic Trees. . . 261
4.3 The Tree of Life We analyzed the matrix of all-against-all tree comparisons of the
(TOL) as a Central NUTs by embedding them into a 30-dimensional tree space using
Trend in the FOL the CMDS procedure [69, 70]. The gap statistics analysis [71]
reveals a lack of significant clustering among the NUTs in the tree
space. Thus, all the NUTs seem to belong to one unstructured
cloud of points scattered around a single centroid. This organiza-
tion of the tree space is best compatible with individual trees
randomly deviating from a single, dominant topology (which may
be denoted the TOL), apparently as a result of random HGT (but
in part possibly due to random errors in the tree-construction
procedure). Therefore, there is an unequivocal general trend
among the NUTs. Although the topologies of the NUTs were,
for the most part, not identical, so that the NUTs could be sepa-
rated by their degree of inconsistency (a proxy for the amount of
HGT), the overall high consistency level indicated that the NUTs
are scattered in the close vicinity of a consensus tree, with HGT
events distributed randomly [5].
Thus, the NUTs present a unique and strong signal of unity
that seems to reflect the TOL pattern of evolution. The inconsis-
tency score (IS) among the NUTs ranged from 1.4% to 4.3%,
whereas the mean IS value for an equivalent set (102) of randomly
generated trees with the same number of species was approximately
80%, indicating that the topologies of the NUTs are highly consis-
tent and nonrandom [5].
To further assess the potential contribution of phylogenetic
analysis artifacts to observed inconsistencies between the NUTs,
we analyzed these trees with different bootstrap support thresholds
(i.e., only splits supported by bootstrap values above the respective
threshold value were compared). Particularly low IS levels were
detected for splits with high bootstrap support, but the inconsis-
tency was never eliminated completely, suggesting that HGT is a
significant contributor to the observed inconsistency among the
NUTs (IS ranges from 0.3% to 2.1% and 0.3% to 1.8% for splits with
a bootstrap value higher than 70 and 90, respectively) [5].
Analysis of the supernetwork built from the 102 NUTs [5]
showed that the incongruence among these trees is mainly con-
centrated at the deepest levels, with a much greater congruence
at shallow phylogenetic depths. The major exception is the
262 Pere Puigbò et al.
4.4 The NUTs Using the BSD method, we compared the topologies of the NUTs
Topologies as the to those of the rest of the trees in the FOL. Notably, 2615 trees
Central Trend and (~38% of the FOL) showed a greater than 50% similarity (P-value
Detection Distinct <0.05) to at least one of the NUTs, being the mean similarity of the
Evolutionary Patterns trees to the NUTs approximately 50% (Fig. 16). For a set of
in the FOL 102 randomized trees of the same size as the NUTs, only about
10% of the trees in the FOL showed the same or greater similarity,
indicating that the NUTs were strongly and nonrandomly
connected to the rest of the FOL.
We then analyzed the structure of the FOL by embedding the
3789 COG trees into a 669-dimensional space using the CMDS
procedure [69, 70]. A CMDS clustering of the entire set of 6901
trees in the FOL was beyond the capacity of the R software package
used for this analysis; however, the set of COG trees included most
of the trees with a large number of species for which the topology
comparison is most informative. A gap statistics analysis [69, 70] of
K-means clustering of these trees in the tree space revealed distinct
clusters of trees in the forest. The FOL is optimally partitioned into
seven clusters of trees (the smallest number of clusters for which the
gap function did not significantly increase with the increase of the
number of clusters) (Fig. 17). Clusters 1, 4, 5, and 6 were enriched
Genome-Wide Comparative Analysis of Phylogenetic Trees. . . 263
Similarity
>80% >60% 40-60% <40% <20%
100%
75%
% of trees
50%
25%
0%
NUTs
Fig. 16 Topological similarity between the NUTs and the rest of the FOL.
Percentage of trees connected to the NUTs at a different percentage of similarity.
Modified from ref. 5
(6) 48.6%
** (2) 63.34%
*
(1) 42.43%
*
(3) 62.11%
**
(4) 56.21%
**
(5) 50.17% (7) 49.66%
** **
* p = 0.0014
** p < 0.000001
Fig. 17 Clusters and patterns in the FOL. The seven clusters identified in the FOL
using the CMDS method and the mean similarity values between the 102 NUTs
and all trees from each of the seven clusters are shown. Modified from ref. 5
4.5 The Tree and Net The TNT map of the NUTs was dominated by the treelike signal
Components of (green in Fig. 18a): the mean TNT score for the NUTs was 0.63
Prokaryote Evolution (Fig. 19b), so the evolution of the nearly universal genes of
Net Tree
0 1
Euryarchaeota
Archaea
Crenarchaeota (a)
Thaumarchaeota
Nanoarchaeota
Thermotogae
Aquificae
Cyanobacteria
Acnobacteria
Chloroflexi
Firmicutes
Fusobacteria
Deinococci
Bacteroidetes 0 1
Bacteria
Chlorobi
(b)
Spirochaetes
Chlamydiae
Verrucomicrobia
Lensphaerae
Planctomycetes
Acidobacteria
g-Proteobacteria
b-Proteobacteria
a-Proteobacteria
d-Proteobacteria
a-Proteobacteria
Fig. 18 The Tree-Net Trend (TNT) score heatmaps. (a) The 102 NUTs. (b) The FOL
without the NUTs (6799 trees). The TNT increases from red (low score, close to
random, an indication of netlike evolution) to green (high score, close to the
supertree topology, an indication of treelike evolution). The species are ordered
according to the topology of the supertree of the 102 NUTs. In (a), the major
groups of archaea and bacteria are denoted. Modified from ref. 61
Genome-Wide Comparative Analysis of Phylogenetic Trees. . . 265
a) 0 1
NET TREE
b) 0.63 NUTs
c) 0.39
FOL
Fig. 19 The Tree-Net Trend in the FOL and in the NUTs. (a) A hypothetical
equilibrium between the tree and net trends. (b) A schematic representation of
the tree tendency in the NUTs. (c) A schematic representation of the net
tendency in the FOL
5 Conclusions
6 Exercises
Acknowledgment
References
1. Huerta-Cepas J, Dopazo H, Dopazo J, Gabal- 17. Allen L, Steel M (2001) Subtree transfer opera-
don T (2007) The human phylome. Genome tions and their induced metrics on evolutionary
Biol 8:R109 trees. Ann Comb 5:1–15
2. Huerta-Cepas J, Bueno A, Dopazo J, Gabal- 18. Waterman MS, Steel M (1978) On the similar-
don T (2008) PhylomeDB: a database for ity of dendrograms. J Theor Biol 73:789–800
genome-wide collections of gene phylogenies. 19. Beiko RG, Hamilton N (2006) Phylogenetic
Nucleic Acids Res 36:D491–D496 identification of lateral genetic transfer events.
3. Frickey T, Lupas AN (2004) PhyloGenie: auto- BMC Evol Biol 6:15
mated phylome generation and analysis. 20. Hickey G, Dehne F, Rau-Chaplin A, Blouin C
Nucleic Acids Res 32:5231–5238 (2008) SPR distance computation for
4. Sicheritz-Ponten T, Andersson SG (2001) A unrooted trees. Evol Bioinformatics Online
phylogenomic approach to microbial evolu- 4:17–27
tion. Nucleic Acids Res 29:545–552 21. Bogdanowicz D, Giaro K (2017) Comparing
5. Puigbo P, Wolf YI, Koonin EV (2009) Search phylogenetic trees by matching nodes using the
for a Tree of Life in the thicket of the phyloge- transfer distance between partitions. J Comput
netic forest. J Biol 8:59 Biol 24:422–435
6. Felsenstein J (2004) Inferring phylogenies. 22. Kubicka E, Kubicki G, McMorris FR (1995)
Sinauer Associates, Sunderland, MA An algorithm to find agreement subtrees. J
7. Nei M, Kumar S (2001) Molecular evolution Classif 12:91–99
and phylogenetics. Oxford University Press, 23. Nye TM, Lio P, Gilks WR (2006) A novel
Oxford algorithm and web-based tool for comparing
8. Castresana J (2007) Topological variation in two alternative phylogenetic trees. Bioinfor-
single-gene phylogenetic trees. Genome Biol matics 22:117–119
8:216 24. de Vienne DM, Giraud T, Martin OC (2007) A
9. Soria-Carrasco V, Castresana J (2008) Estima- congruence index for testing topological simi-
tion of phylogenetic inconsistencies in the larity between trees. Bioinformatics
three domains of life. Mol Biol Evol 23:3119–3124
25:2319–2329 25. Cotton JA, Page RD (2002) Going nuclear:
10. Marcet-Houben M, Gabaldon T (2009) The gene family evolution and vertebrate phylog-
tree versus the forest: the fungal tree of life and eny reconciled. Proc Biol Sci 269:1555–1561
the topological diversity within the yeast phy- 26. Soria-Carrasco V, Talavera G, Igea J, Castre-
lome. PLoS One 4:e4357 sana J (2007) The K tree score: quantification
11. Robinson DF, Foulds LR (1981) Comparison of differences in the relative branch length and
of phylogenetic trees. Math Biosci 53:131–147 topology of phylogenetic trees. Bioinformatics
12. Puigbo P, Garcia-Vallve S, McInerney JO 23:2954–2956
(2007) TOPD/FMTS: a new software to com- 27. Marcet-Houben M, Gabaldon T (2011)
pare phylogenetic trees. Bioinformatics TreeKO: a duplication-aware algorithm for
23:1556–1558 the comparison of phylogenetic trees. Nucleic
13. Steel MA, Penny D (1993) Distribution of tree Acids Res 39:e66
comparison metrics - some new results. Syst 28. Lu B, Zhang L, Leong HW (2017) A program
Biol 42:126–141 to compute the soft Robinson-Foulds distance
14. Bluis J, Shin D-G (2003) Nodal distance algo- between phylogenetic networks. BMC Geno-
rithm: calculating a phylogenetic tree compari- mics 18:111
son metric. In: Proceedings of the third IEEE 29. Koonin EV, Wolf YI, Puigbo P (2009) The
symposium on bioInformatics and bioEngi- phylogenetic forest and the quest for the elu-
neering, IEEE Computer Society, pp 87–94 sive tree of life. Cold Spring Harb Symp Quant
15. Cardona G, Llabres M, Rossello F, Valiente G Biol 74:205–213
(2009) Nodal distances for rooted phyloge- 30. Zuckerkandl E, Pauling L (1962) Molecular
netic trees. J Math Biol 61(2):253–276 evolution. In: Kasha M, Pullman B (eds) Hor-
16. Estabrook GF, McMorris FR, Meachan A izons in biochemistry. Academic, New York, pp
(1985) Comparison of undirected phyloge- 189–225
netic trees based on subtree of four evolution- 31. Woese CR (1987) Bacterial evolution. Micro-
ary units. Syst Zool 34:193–200 biol Rev 51:221–271
268 Pere Puigbò et al.
32. Bapteste E, O’Malley MA, Beiko RG, 48. Galtier N, Daubin V (2008) Dealing with
Ereshefsky M, Gogarten JP, Franklin-Hall L incongruence in phylogenomic analyses. Philos
et al (2009) Prokaryotic evolution and the Trans R Soc Lond B Biol Sci 363:4023–4029
tree of life are two different things. Biol Direct 49. Ciccarelli FD, Doerks T, von Mering C, Cree-
4:34 vey CJ, Snel B, Bork P (2006) Toward auto-
33. Doolittle WF (2000) Uprooting the tree of life. matic reconstruction of a highly resolved tree
Sci Am 282:90–95 of life. Science 311:1283–1287
34. Doolittle WF, Bapteste E (2007) Pattern plu- 50. Choi IG, Kim SH (2007) Global extent of
ralism and the Tree of Life hypothesis. Proc horizontal gene transfer. Proc Natl Acad Sci U
Natl Acad Sci U S A 104:2043–2049 S A 104:4489–4494
35. Kurland CG, Canback B, Berg OG (2003) 51. Dagan T, Martin W (2009) Getting a better
Horizontal gene transfer: a critical view. Proc picture of microbial evolution en route to a
Natl Acad Sci U S A 100:9658–9662 network of genomes. Philos Trans R Soc
36. Kurland CG (2005) What tangled web: bar- Lond B Biol Sci 364:2187–2196
riers to rampant horizontal gene transfer. 52. Boucher Y, Douady CJ, Papke RT, Walsh DA,
BioEssays 27:741–747 Boudreau ME, Nesbo CL et al (2003) Lateral
37. Logsdon JM, Faguy DM (1999) Thermotoga gene transfer and the origins of prokaryotic
heats up lateral gene transfer. Curr Biol 9: groups. Annu Rev Genet 37:283–328
R747–R751 53. Bucknam J, Boucher Y, Bapteste E (2006)
38. Genereux DP, Logsdon JM Jr (2003) Much Refuting phylogenetic relationships. Biol
ado about bacteria-to-vertebrate lateral gene Direct 1:26
transfer. Trends Genet 19:191–195 54. Schliep K, Lopez P, Lapointe FJ, Bapteste E
39. Kunin V, Goldovsky L, Darzentas N, Ouzounis (2011) Harvesting evolutionary signals in a
CA (2005) The net of life: reconstructing the forest of prokaryotic gene trees. Mol Biol
microbial phylogenetic network. Genome Res Evol 28:1393–1405
15:954–959 55. Beiko RG, Doolittle WF, Charlebois RL
40. Daubin V, Moran NA, Ochman H (2003) Phy- (2008) The impact of reticulate evolution on
logenetics and the cohesion of bacterial gen- genome phylogeny. Syst Biol 57:844–856
omes. Science 301:829–832 56. Doolittle WF, Zhaxybayeva O (2009) On the
41. Lerat E, Daubin V, Moran NA (2003) From origin of prokaryotic species. Genome Res
gene trees to organismal phylogeny in prokar- 19:744–756
yotes: the case of the gamma-proteobacteria. 57. Gogarten JP, Townsend JP (2005) Horizontal
PLoS Biol 1:E19 gene transfer, genome innovation and evolu-
42. Woese CR, Olsen GJ, Ibba M, Soll D (2000) tion. Nat Rev Microbiol 3:679–687
Aminoacyl-tRNA synthetases, the genetic 58. Gogarten JP, Doolittle WF, Lawrence JG
code, and the evolutionary process. Microbiol (2002) Prokaryotic evolution in light of gene
Mol Biol Rev 64:202–236 transfer. Mol Biol Evol 19:2226–2238
43. Fitz-Gibbon ST, House CH (1999) Whole 59. O’Malley MA, Koonin EV (2011) How stands
genome-based phylogenetic analysis of free- the Tree of Life a century and a half after The
living microorganisms. Nucleic Acids Res Origin? Biol Direct 6:32
27:4218–4222 60. Puigbo P, Wolf YI, Koonin EV (2013) Seeing
44. Hanage WP, Fraser C, Spratt BG (2006) the Tree of Life behind the phylogenetic forest.
Sequences, sequence clusters and bacterial spe- BMC Biol 11:46
cies. Philos Trans R Soc Lond B Biol Sci 61. Puigbo P, Wolf YI, Koonin EV (2010) The tree
361:1917–1927 and net components of prokaryote evolution.
45. Eisen JA, Fraser CM (2003) Phylogenomics: Genome Biol Evol 2:745–756
intersection of evolution and genomics. Sci- 62. Tatusov RL, Fedorova ND, Jackson JD, Jacobs
ence 300:1706–1707 AR, Kiryutin B, Koonin EV et al (2003) The
46. Salzberg SL, White O, Peterson J, Eisen JA COG database: an updated version includes
(2001) Microbial genes in the human genome: eukaryotes. BMC Bioinformatics 4:41
lateral transfer or gene loss? Science 63. Jensen LJ, Julien P, Kuhn M, von Mering C,
292:1903–1906 Muller J, Doerks T et al (2008) eggNOG:
47. Galtier N (2007) A model of horizontal gene automated construction and annotation of
transfer and the bacterial phylogeny problem. orthologous groups of genes. Nucleic Acids
Syst Biol 56:633–642 Res 36:D250–D254
Genome-Wide Comparative Analysis of Phylogenetic Trees. . . 269
64. Edgar RC (2004) MUSCLE: multiple 75. Ge F, Wang LS, Kim J (2005) The cobweb of
sequence alignment with high accuracy and life revealed by genome-scale estimates of hori-
high throughput. Nucleic Acids Res zontal gene transfer. PLoS Biol 3:e316
32:1792–1797 76. Brochier C, Bapteste E, Moreira D, Philippe H
65. Castresana J (2000) Selection of conserved (2002) Eubacterial phylogeny based on trans-
blocks from multiple alignments for their use lational apparatus proteins. Trends Genet
in phylogenetic analysis. Mol Biol Evol 18:1–5
17:540–552 77. Wolf YI, Rogozin IB, Grishin NV, Koonin EV
66. Keane TM, Naughton TJ, McInerney JO (2002) Genome trees and the tree of life.
(2007) MultiPhyl: a high-throughput phyloge- Trends Genet 18:472–479
nomics webserver using distributed comput- 78. Wolf YI, Rogozin IB, Grishin NV, Tatusov RL,
ing. Nucleic Acids Res 35:W33–W37 Koonin EV (2001) Genome trees constructed
67. Creevey CJ, McInerney JO (2005) Clann: using five different approaches suggest new
investigating phylogenetic information major bacterial clades. BMC Evol Biol 1:8
through supertree analyses. Bioinformatics 79. Creevey CJ, Fitzpatrick DA, Philip GK, Kin-
21:390–392 sella RJ, O’Connell MJ, Pentony MM et al
68. Felsenstein J (1996) Inferring phylogenies (2004) Does a tree-like phylogeny only exist
from protein sequences by parsimony, distance, at the tips in the prokaryotes? Proc Biol Sci
and likelihood methods. Methods Enzymol 271:2551–2558
266:418–427 80. Brochier-Armanet C, Boussau B, Gribaldo S,
69. Torgerson WS (1958) Theory and methods of Forterre P (2008) Mesophilic Crenarchaeota:
scaling. Wiley, New York proposal for a third archaeal phylum, the Thau-
70. Gower JC (1966) Some distance properties of marchaeota. Nat Rev Microbiol 6:245–252
latent root and vector methods used in multi- 81. Elkins JG, Podar M, Graham DE, Makarova
variate analysis. Biometrika 53:325–328 KS, Wolf Y, Randau L et al (2008) A korarch-
71. Tibshirani R, Walther G, Hastie T (2001) Esti- aeal genome reveals new insights into the evo-
mating the number of clusters in a data set via lution of the Archaea. Proc Natl Acad Sci U S A
the gap statistic. J R Stat Soc B Stat Methodol 105:8102–8107
63:411–423 82. Wolf YI, Aravind L, Grishin NV, Koonin EV
72. Hillis DM, Heath TA, St John K (2005) Anal- (1999) Evolution of aminoacyl-tRNA synthe-
ysis and visualization of tree space. Syst Biol tases--analysis of unique domain architectures
54:471–482 and phylogenetic trees reveals a complex his-
73. Pavlidis P, Noble WS (2003) Matrix2png: a tory of horizontal gene transfer events.
utility for visualizing matrix data. Bioinformat- Genome Res 9:689–710
ics 19:295–296 83. Koonin EV (2003) Comparative genomics,
74. Koonin EV, Wolf YI (2008) Genomics of bac- minimal gene-sets and the last universal com-
teria and archaea: the emerging dynamic view mon ancestor. Nat Rev Microbiol 1:127–136
of the prokaryotic world. Nucleic Acids Res
36:6688–6719
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons
licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s
Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder.
Chapter 9
Abstract
In the post genomic era, large and complex molecular datasets from genome and metagenome sequencing
projects expand the limits of what is possible for bioinformatic analyses. Network-based methods are
increasingly used to complement phylogenetic analysis in studies in molecular evolution, including com-
parative genomics, classification, and ecological studies. Using network methods, the vertical and horizon-
tal relationships between all genes or genomes, whether they are from cellular chromosomes or mobile
genetic elements, can be explored in a single expandable graph. In recent years, development of new
methods for the construction and analysis of networks has helped to broaden the availability of these
approaches from programmers to a diversity of users. This chapter introduces the different kinds of
networks based on sequence similarity that are already available to tackle a wide range of biological
questions, including sequence similarity networks, gene-sharing networks and bipartite graphs, and a
guide for their construction and analyses.
Key words Sequence similarity network, Evolution, Lateral gene transfer (LGT), Metagenomics,
Gene remodeling, Ecology
1 Introduction
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Methods in Molecular Biology, vol. 1910,
https://doi.org/10.1007/978-1-4939-9074-0_9, © The Author(s) 2019
271
272 Andrew K. Watson et al.
of network and the types of analyses that are possible using these
networks: sequence similarity networks, gene-sharing networks,
and multipartite graphs [23].
Fig. 1 Constructing a simple sequence similarity network. A set of sequences (protein or DNA) in fasta format
(a) are aligned in pairs using alignment tools (such as BLAST). These alignments (b) are scored with metrics
such as the percentage identity between two sequences (the number of identical nucleotides/amino acids
displayed above) or the E-value of the alignment. In the resulting network (c), sequences are represented as
nodes. Two sequence nodes are joined with an edge if they can be aligned above a define threshold, with the
weight of the edge often based on percentage identity or E-value
274 Andrew K. Watson et al.
(continued)
The Methodology Behind Network Thinking. . . 275
Box 1: (continued)
2. Similarity search: To generate a sequence similarity network,
all sequences must be aligned against one another in an all-
versus-all search, in which the dataset of sequences is
searched against a database including the same sequences.
For gene networks, the alignment is usually done with a fast
pairwise aligner such as BLAST [58, 59] as implemented in
EGN [52]. Filters are often used to remove low-complexity
sequences from the search, as these can cause artefactual hits
(BLAST options --seg yes, -soft-masking true). The BLAST
method of alignment will be the focus of future discussion in
this chapter; however, alternatives are available including
BLAT [60] (also implemented in EGN), SWORD [61],
USEARCH [62], and DIAMOND [63]. These alternatives
generally include an option to produce a “BLAST” style
tabulated output, making them compatible with programs
commonly used in network analyses.
Within alignment tools like BLAST, it is possible to
assign thresholds, such as the maximum E-value of the
alignment. It is not recommended to set minimal thresh-
olds for some parameters (such as % sequence identity)
unless required due to memory constraints so that you
can generate networks from a single sequence alignment
with different thresholds for comparison (e.g., compari-
son of a 30% similarity threshold to a 90% threshold,
where edges will only be drawn between highly similar
genes).
Note: It may be intuitive to use additional CPUs to
speed up the alignment process; however, in BLAST it can
be more efficient to split the query file and launch multi-
ple searches on separate cores instead of using the BLAST
multithreading option. The pairwise alignment step is
generally the most time-limiting part of generating a
SSN, so benchmarking should be used to establish the
optimal settings for the pairwise and/or determine the
feasibility of a project given the size of the dataset and the
available computational resources.
3. Filtering similarity search results: In an all-versus-all similar-
ity search, any given query sequence will have a self-hit in the
corresponding database. For example, with sequences A
and B, a self-hit is query sequence A matching to sequence
A in the database, cases of which must be removed prior to
network construction (Fig. 2). When query sequence A in a
similarity search is aligned with sequence B in the database,
often the reciprocal result is also identified (an alignment
between query sequence B and sequence A in the database).
These are called reciprocal hits; while the sequences involved
(continued)
276 Andrew K. Watson et al.
Fig. 2 Filtering sequence similarity results for network construction. In the output of an all-against-all
sequence similarity search, there are a number of features that are often filtered out prior to network
construction. Self-hits (1/ and 2/), where like sequences are paired in a sequence alignment, are not
informative to network construction and are removed (highlighted by the red box surrounding the alignments).
In cases where there are reciprocal hits (3/ and 4/) between two sequences, then only the alignment with the
highest E-value is retained (highlighted with a green box around the retained alignment) to ensure only one
edge representing the best possible alignment connects any two nodes in the network. The same is true for
cases where a sequence has multiple hits against another sequence, such as when it aligns to another
sequence in multiple positions (5/ and 6/)
Box 1: (continued)
are identical, the alignments and scores are not. Retaining
both hits would generate two different edges between the
same two nodes in a SSN, so generally only the best results
from reciprocal hits are retained, based on a score such as the
E-value (Fig. 2). Finally, a single query sequence may be
significantly aligned multiple times in different positions of
the same sequence in the database; however, for SSN con-
struction only the best BLAST hit is generally retained
(Fig. 2). The selection of the best BLAST hit is again gener-
ally often based on the E-value. Removing multiple hits
against the same sequence allows the generation of an undi-
rected network where a single edge connects two nodes,
representing the best possible alignment between these
nodes.
4. Thresholding and network construction: Constructing a SSN
from a BLAST output is conceptually simple; an edge is
created between two sequences (nodes) that have been
aligned in the sequence similarity search. It is common to
apply thresholding criteria such as minimal % ID and/or
coverage and/or maximal E-value to determine whether an
edge is drawn between two sequences in the network
(Fig. 1). There are different ways to calculate the % coverage
of an alignment. This could be based on the coverage of a
single sequence in the alignment, selecting either the query
or the database sequence in each alignment or the longest or
shortest sequence in each alignment. Alternatively both
(mutual coverage) can be used, retaining an alignment
(continued)
The Methodology Behind Network Thinking. . . 277
Box 1: (continued)
when both values are above a given threshold. Edges above
the thresholding criteria can be assigned a weight based on
these criteria, producing a weighted sequence similarity net-
work that retains information of the properties of the align-
ment between two sequences (Fig. 1). It is often useful to
construct and compare several SSNs with variable stringen-
cies defining the edges between sequences, for example, to
optimize gene family detection within the SSN (discussed
below).
2.2 Exploiting A gene family is usually defined as a group of sequences that are
Sequence Similarity similar at the sequence level, indicative of homology and potentially
Networks for of shared functions; however, there is no uniform way to define this
Identification of Gene similarity [70, 71]. One of the early contributions of SSNs in
Families molecular sequence analysis was the construction of the COG
database of homologous protein sequences [39, 40]. This study
attempted to define gene families based on similarity at the
sequence level using the results of sequence similarity searches.
Within the results of an all-versus-all BLAST search, groups of at
least three proteins encoded by different genomes that were more
similar to each other than they were to other proteins found in the
same genomes were defined as a likely orthologous gene family.
Orthologous gene families are group of genes in different genomes
that show sequence similarity, likely as a result of their shared
evolutionary history.
The idea of using graphs to identify gene families is now a core
part of many graph-based analyses. Members of a gene family
aggregate in a sub-network in a SSN. These sub-networks are called
connected components (CCs) at these defined thresholds, i.e.,
clusters of nodes connected by edges either directly or indirectly
(via intermediate nodes) (Fig. 3). The size (number of nodes and
edges in a CC) and density (the proportion of potential connec-
tions between all nodes in a CC that are actually connected by edges
in the graph) of CCs will depend on the thresholds used for
Fig. 3 Louvain community detection in a sequence similarity network. The network is assembled from the
results of an all-versus-all alignment, as previously described. Edges can be weighted by E-value, percentage
of identity, or bitscore. For the purpose of simplification, we consider strong or weak weights rather than
actual values. (a) A giant connected component at relaxed threshold. (b) Three connected components at a
more stringent threshold. (c) Three communities with Louvain clustering algorithm, taking into account edge
weights
The Methodology Behind Network Thinking. . . 279
Fig. 4 Giant connected component before and after community detection. (a) A single giant connected
component from a sequence similarity network. (b) The same giant connected component after application
of a community detection algorithm. Node colors correspond to the newly assigned communities
280 Andrew K. Watson et al.
2.3 Exploiting SSNs When discussing identification of gene families, we have focused on
to Identify Signatures networks where edges are drawn between protein sequences that
of “Tinkering” and show a high enough similarity across their entire length, defined by
Gene Fusion a high mutual coverage threshold (e.g., 80%). Sequence similarity
can also be partial, for example, following gene remodeling or
“tinkering” [85] producing new combinations of gene domains
via gene fusion and fission events, or through the de novo sequence
synthesis of gene extensions, adding to existing sequences. The
term “Rosetta Stone sequence” was coined to define the formation
of a new fusion protein in a species as the result of the fusion of two
proteins that are found separate in another species, with authors
originally predicting that these fusions could occur between pro-
teins that physically interact in a common structural complex
[86]. One of the earliest applications of sequence similarity searches
to identify fusion proteins was an attempt to predict pairs of pro-
teins that may physically interact in an organism based on whether
they could be identified as a single “composite” fusion protein in
another organism [44]. Beyond predicting protein-protein interac-
tions, this kind of gene remodeling and recycling of existing gene
parts has the potential to contribute to the expansion of functional
diversity in genomes, creating new and unique combinations of
domains and functions [51, 85, 87–91]. Similarity search-based
screens have been implemented to identify composite genes and
genome rearrangements in a range of prokaryotes [92–94], eukar-
yotes [87, 95–97], and viruses [98].
Early attempts to identify composite genes were based on the
output of sequence similarity searches, but without formalizing the
results of search methods into a graph structure. The first attempt
to formalize the problem of identifying “composite” genes in net-
works was the “Neighborhood Correlation” approach, aiming to
distinguish genuine multi-domain proteins sharing common ances-
try (homologues) from novel multi-domain proteins that share
domains due to insertions [99]. The later development of the
FusedTriplets and MosaicFinder tools attempted to unify existing
graph-based methods for detection of “composite” gene detection
[50]. FusedTriplets is a graph-based implementation of the tradi-
tional gene-centered method for composite gene identification,
originally introduced by Enright et al. [44], with additional cross-
282 Andrew K. Watson et al.
Fig. 5 Composite gene identification using “minimal clique separators.” (a) A multiple sequence alignment of
composite genes (yellow) with two components (blue and magenta). (b) The sequence similarity network
corresponding to the multiple sequence alignment. The composite genes (yellow) are a minimal clique
separator for the network. Their removal (shown in c) decomposes the network to the two separate
component families
(continued)
The Methodology Behind Network Thinking. . . 283
Box 2: (continued)
tendency of BLAST to produce overhanging alignments
[100]. The output includes a node, edge, and information
file including information on number of nodes, edges, and
family connectivity from family detection. Two outputs are
included for composite gene detection, a “composites” file
with detailed information on each predicted composite gene
in fasta format and a “compositesinfo” file, summarizing the
data. Similarly, two files provide detailed information on
composite gene families and a summary of composite gene
families.
3. Filtering results: By default, CompositeSearch outputs all
possible composite genes in “fast” mode or composite
gene families in the full mode. These are given alongside a
number of different metrics designed to help to filter families
for more confident predictions, including the gene family
size, number of composites directly predicted within the
gene family, the number of domains, the number of compo-
nent families, the number of singleton component families
(families including only one sequence), the connectivity of
the family, and a score based on the overlap between differ-
ent components mapped to the composite gene.
2.4 Exploiting SSNs Ecological studies increasingly involve the assembly, analysis, and
for Ecological Studies comparison of large metagenome datasets. In addition to identifi-
cation of functions and organisms associated with a particular
environment, these studies enable the investigation of important
hypotheses in microbial ecology at the level of organism or
284 Andrew K. Watson et al.
Fig. 6 Exploring distribution of annotations in sequence similarity networks. In this example, nodes within a
single connected component are assigned two colors, blue and yellow, corresponding to their having a
different categorical annotation (e.g., originating from a different environmental source). Using the example of
environmental source, genes in cluster A would all have the same environmental source (blue), indicating an
environment-specific cluster of genes. Genes in cluster B are found in two different environmental sources
(blue and yellow); however, nodes of the same type are preferentially linked to each other in the network than
to genes from different environmental sources. This would result in a positive assortativity coefficient
approaching 1 for environment and a low conductance score, suggesting a strong environmental community
structure. Genes in cluster C are also found in two different environmental sources; however, there is no clear
pattern for the distribution of genes with regard to environment. This network would have an assortativity
approaching 0 and a high conductance score
The Methodology Behind Network Thinking. . . 285
2.4.1 Assortativity as a Forster et al. used assortativity (among other network statistics,
Tool to Study Geographical including the previously discussed shortest path analysis) to explore
and Habitat Distributions of the geographical dispersion patterns of marine ciliates in a network
Microbes and Genes generated from ciliate SSU-rDNA sequences [25]. Sequences were
clustered into two different levels of gene family—CCs and Louvain
communities (LCs) as previously described. Sequences were
assigned categorical labels based on their geographical point of
origin (eight locations) or habitat of origin (three habitats), and
assortativity was calculated. If sequences, and thus species, are
broadly distributed across geographical categories, then assortativ-
ity of SSU-rDNA sequences labeled with these geographical cate-
gories would be low because similar sequences would be found in
different environments. Contrarily, if similar sequences tend to be
from the same geographical category, indicative of endemism, then
assortativity of sequence geographical origin will be high (Fig. 6).
The majority of CCs and LCs showed a positive assortativity for
geographical origin, higher than expected by chance, indicative of
geographical community structure as opposed to global dispersal of
ciliates. Similar approaches were used by Fondi et al. and applied to
a collection of environmental metagenome samples to test the
“everything is everywhere” hypothesis at the gene pool and func-
tional level. Gene pools were more strongly associated with a
particular ecological niche than with specific geographical location,
supporting the idea that microbial genes are found everywhere but
the environment selects for them [26].
2.4.2 Conductance in the Conductance is used to explore the clustering of pairs of different
Comparison of Lifestyles node categories in a connected component. In a study by Cheng
and Evolutionary Histories et al., the proteomes of 84 prokaryote genomes were categorized
into four broad redox groups based on their lifestyle, methanogens,
obligate anaerobes, facultative anaerobes, and obligate aerobes
[27]. For each CC in a pan-proteome sequence similarity network
including all 84 genomes, the conductance was calculated for pairs
of redox categories and compared to values obtained following
random relabelling of the components. The distributions of con-
ductance values for methanogens and for obligate anaerobes
groups indicated that the sequences in these groups have features
distinct from those in other groups, that anaerobes and aerobes
tend to be dissimilar, and that their sequences are more isolated
from one another in the SSN than expected by chance.
286 Andrew K. Watson et al.
2.5 SSNs in Remote Up to 99% of microbial species are not cultivable and thus have not
Homologue been studied in isolated culture. Analysis of high-throughput
Identification: sequencing and metagenomics datasets has shed light on these
Shedding Light on the uncultivable organisms, often referred to as the “microbial dark
Microbial Dark Matter matter” [109], and in some cases enabled the reconstruction of
draft genomes [110–114]. A considerable portion of most meta-
genome studies have predicted ORFs showing no detectable simi-
larity to any known proteins, termed metaORFans [115]. These
can represent 25–85% of the total ORFs identified in metagenomes
[22]. Identifying distant homologues of ORFans may help to pre-
dict their functions and begin to unravel the microbial dark matter.
Recent work by Lopez et al. in 2015 probed the microbial diversity
of metagenome datasets from a range of environments including
the human gut microbiome, identifying homologues of genes from
86 ancient gene families that are distributed across archaea and
bacteria. The majority of these gene families included environmen-
tal homologues that were highly divergent from any of their
cultured homologues, and many branched deeply with the phylo-
genetic tree of life, highlighting our limited understanding of
diverse elements of the microbial world and hinting at the existence
of yet unknown major divisions of life [24] (Fig. 7).
2.6 Exploiting SSNs Metagenomic and genomic data are providing scientists with a
to Analyze tantalizing amount of sequence data, casting the analysis of the
Classifications extent of biodiversity as a major research theme in biology
[116–120]. In theory, existing organismal and viral classifications
are invaluable tools to structure and analyze this biodiversity. How-
ever, the way taxonomical classifications are constructed raises
questions about their naturalness and their actual application
scope [38, 120–128], in particular regarding genetic diversity
surveys. There are three major reasons for this. First, organismal
and viral diversity is still largely undersampled, which means that
existing classifications are incomplete [119, 120]. Therefore, taxo-
nomically unassigned sequences cannot be readily used in class-
based genetic diversity surveys, since this dark matter remains
outside existing classes. Second, classifications are constructed
The Methodology Behind Network Thinking. . . 287
60%
Max % identity to homologues in databases ≥ 60%
Table 1
Schematic properties of two extreme kinds of taxonomic classes with respect to their genetic
diversity
A. B. C.
Member 1 Member 2
wi
3/3 X%-core
Out of 6 class members
PID PCS (4/6)
wj 2/3 Member 3 Member 4
D.
Fig. 8 Intra- and interclasses heterogeneity measurements in weighted similarity networks. Sequences are
represented by nodes. Each node is colored to represent the taxonomic class to which its host belongs. Nodes
with the same color belong to the same class. Edge weight is represented by edge size proportional to the
weight. Subgraphs correspond to clusters of sequences. Direct neighbors have a greater similarity than the
threshold set to allow such connections. PID, average edge weights (% identity) between two sequences from
genomes of the same class; CCC, average proportion of genetic conservation between sequences from the
same cluster and from the same taxonomic class; maxCore%, conductance; and %-exclusive components
correspond to the estimates used to assess genetic consistency of classes
more similar to one another than they are to sequences from other
classes (Fig. 8d, e). Such sequences could be used as classificatory
features to assign members to the class. In a SSN, this property
translates to a low ratio of interclass edges over intra-class edges and
is measured by conductance (Fig. 8d). Likewise, the proportion of
clusters comprised exclusively of sequences from one class, a diag-
nostic feature of the class, provides an estimate of the class genetic
distinctiveness. Genetically highly distinct classes have a high % of
such exclusive clusters. Based on these network measures, interclass
genetic heterogeneity can simply be diagnosed by contrasting esti-
mates of genetic consistency for all the above measures for each
class. There is interclass heterogeneity within a classification when
the mean PID, mean CCC, maxCore%, DRC, and % of exclusive
components differ between classes.
Such network analyses show that virus classifications face a
pragmatic issue: overall genetic distinctiveness allows relatively
safe assignments of viral sequences to existing classes; however,
genetic diversity of viral taxa of similar ranks differs among the
tested classifications. Therefore, virus classifications (especially
ICTV classification at the family level) should be used carefully to
avoid inaccurate estimates in metagenomic diversity surveys. Clas-
ses with broader genetic diversity will tend to be more easily
290 Andrew K. Watson et al.
3 Gene-Sharing Networks
a g2 b
G1
GF1 g1
3
g3 3
g4
G2 G3
g5 g6 3
GF2
c
g7 g8 G1
g9
2
GF3 g10 g11
G2 G3
g12
3.1 Classification of The possibility of summarizing gene sharing between sets of enti-
Entities Using Gene- ties with complex evolutionary histories means that gene-sharing
Sharing Networks networks can be useful for classifying organisms based on their gene
content. Lima-Mendez et al. analyzed bacteriophage genomes to
generate two different phage gene-sharing networks that reflect
their reticulate evolutionary history [37]. In the first gene-sharing
network, phage genomes (nodes) were connected by edges when
292 Andrew K. Watson et al.
Fig. 10 Functional gene-sharing network reflecting the chimeric nature of eukaryotes. These gene-sharing
networks describing how genes in different functional categories are shared between bacteria (green),
archaea (yellow), eukaryotes (gray), plasmids (purple), and viruses (red) from a published dataset [29]. In
both cases, a giant connected component is shown alongside examples of smaller connected components (a)
Gene-sharing network for COG category D: cell division control. In this network, sequences of eukaryote origin
(gray) cluster with bacterial sequences, reflecting their origin in the alphaproteobacterial endosymbiont that
would become the mitochondrion. (b) Gene-sharing network for COG category K: transcription machinery. In
this network, eukaryote sequence (gray) cluster with archaeal sequences, reflecting the origin of these genes
in the archaeal host for the eukaryotic endosymbiont
3.2 Exploring Routes Two network metrics, also useful in the analysis of gene networks,
of Gene Sharing in can be used to attempt to identify “hubs” of gene sharing in the
Gene-Sharing context of gene-sharing networks: node “degree” and “between-
Networks ness.” Both metrics aim to determine the centrality of a node in a
network. The degree of a node is simply the number of edges that it
is connected to. The betweenness of a node is the frequency at
The Methodology Behind Network Thinking. . . 293
which it is found in all the possible shortest paths between any two
nodes in the network. Halary et al. used gene-sharing networks
based on DNA sequence similarity to explore gene sharing between
prokaryotes and mobile genetic elements [30]. Plasmids were iden-
tified as hubs of gene sharing within this pool of genomes, suggest-
ing that they are key vectors for genetic exchange between cellular
genomes and a potential DNA reservoir shared by genomes. Phages
were more peripheral in the network and mostly linked prokaryotes
from the same lineage. Thus, gene-sharing networks provided
insights on the evolutionary processes that shape the gene content
of prokaryote genomes.
The importance of plasmids in genetic worlds was further high-
lighted by exploring plasmid gene-sharing networks without inclu-
sion of prokaryote genomes [14, 36]. Connecting 2343 plasmid
genomes based on shared gene content in a single graph demon-
strated that plasmids tended to cluster based on the phylogenetic
class of their corresponding host prokaryote rather than habitat but
that more mobile plasmids tended to be more “central” in the graph,
indicating that these were hubs of gene sharing. Specifically, routes of
gene sharing for gene families including antibiotic resistance markers
were identified between actinobacterial plasmids and gammaproteo-
bacterial plasmids, suggesting that Actinobacteria may act as a reser-
voir for antibiotic resistance genes for Gammaproteobacteria [14].
The finding that plasmids are hubs of gene sharing for prokary-
ote genomes was supported by analysis of gene sharing in a pro-
teobacterial phylogenomic network including 329 proteobacterial
genomes [32]. A phylogenomic network is a type of phylogenetic
network that has been constructed from fully sequenced genomes.
In this example the phylogenomic network is an alternative to a
gene-sharing network, in which genome nodes within a phylogeny
are linked by edges if they share genes [34]. This study identified
extensive evidence for lateral gene transfer among Proteobacteria,
with at least one LGT event inferred in 75% of all gene families. Of
these putative LGTs, more were related to plasmid-related genes
than phage-related genes, suggesting plasmid conjugation was a
more frequent source of gene transfer [32]. Directed graphs explor-
ing directionality of LGT events between 657 prokaryote genomes
allowed the polarization of 32,028 putative LGT events finding
that frequency of recent events correlates with genome sequence
similarity and most LGTs occurring between donor-recipient pairs
with <5% difference in GC content, suggesting that there are some
barriers to lateral gene transfer between prokaryotes but that these
are not insurmountable [31]. Later reconstruction of transduction
events linking phage donors and recipients in a phylogenomic
network demonstrated that LGT by transduction was generally
highest in similar genomes and between clusters of closely related
species but that this constraint was occasionally broken, resulting in
LGTs over long evolutionary distances [35].
294 Andrew K. Watson et al.
4 Bipartite Graphs
Fig. 11 A bipartite graph and its reduction to a quotient graph: (a) An example of a bipartite graph displaying
how five gene families are shared between three genomes. (b) A reduced form of the bipartite graph in which
gene families are combined to “twin” nodes if they share identical taxonomic distributions. A single
“articulation point” connects all three genomes
The Methodology Behind Network Thinking. . . 295
(continued)
The Methodology Behind Network Thinking. . . 297
Box 4: (continued)
l Gene family identification: AcCNET uses kClust [155] to
assemble gene families, a kmer-based method for rapid assem-
bly of clusters of homologous proteins from sequence data.
By default, MultiTwin identifies gene families using an all-
versus-all BLAST search, followed by identification of
connected components at a given threshold, as previously
discussed for gene family detection from SSNs. MultiTwin
can also be used in a modular way allowing for additional
customization, including the use of any custom gene family
input in the form of a “community file”: a tab-delimited file
linking every gene/protein ID to a community identifier,
with gene families defined using a clustering method of
choice.
l Edge weighting: In AcCNET the edge weight is proportional
to the inverse of the phylogenetic distance between proteins
in a cluster from a given genome to other proteins within the
same cluster. In MultiTwin, the default edge weight is based
on the number of genes present in a gene family from any
given genome.
l Graph compression: While both methods can be used to iden-
tify “twin” nodes, only MultiTwin generates a quotient graph
from these twin nodes and identifies articulation points.
AcCNET is available at: https://sourceforge.net/pro
jects/accnet
MultiTwin is available at: http://www.evol-net.fr/index.
php/en/downloads
4.1 Using Bipartite The simplest application of a bipartite graph is the summary of all
Graphs to Explore genes shared between genomes in a single parsable graph, and this
Patterns of Gene feature has been used to explore gene sharing in the dsDNA virome
Sharing Between [38], a range of Escherichia coli genomes to investigate the E. coli
Diverse Entities pangenome [150] and between a broad range of prokaryotes that
include newly discovered organisms [33]. In their analysis of pro-
karyote genomes, Jaffe et al. used the notion of “twins” to explore
patterns of gene sharing between prokaryotes, including Archaea
and the recently discovered ultrasmall “Candidate Phyla Radiation”
and TM6 bacteria with extremely unusual and reduced genomes.
The group found evidence for lateral gene transfer between ultra-
small bacteria and other prokaryotes, consistent with the sugges-
tion that the ultrasmall bacteria may be symbionts [33]. In their
exploration of the dsDNA virome, Iranzo et al. used graph module
detection, algorithms designed to identify groups of densely
connected nodes in a graph, to identify sets of densely connected
viral genes and genomes that included viruses with broad host
ranges, as well as 14 hallmark viral genes that account for most of
the gene sharing between all different viral modules [38].
298 Andrew K. Watson et al.
5 Conclusions
Fig. 12 A workflow highlighting some of the available routes for generation and analysis of SSNs, gene-
sharing networks, and bipartite graphs. This workflow highlights just some of the many tools and routes for
network construction and analysis
The Methodology Behind Network Thinking. . . 299
6 Exercises
The exercises use EGN [52] and require access to a local installation
of BLAST+ [58] and Perl. The fasta sequence file “example.faa”
provided with EGN includes a dataset of protein sequences from
Archaea, Bacteria, Eukaryota, and mobile genetic elements, avail-
able at http://www.evol-net.fr/index.php/fr/downloads:
1. Perform a manual all-versus-all BLAST using search for a given
protein sequence file from the unix terminal (requires local
installation of BLAST). The output can be filtered to generate
a network:
(a) Make the blast database using the “makeblastdb.”
l Command: “makeblastdb -dbtype prot -in example.faa
–out example”
(b) Performing the BLAST search using “blastp,” remember-
ing to output data in a tabular format for easy processing.
l Command: “blastp -query example.faa -db example
-evalue 1e-5 -seg yes -soft_masking true - max_target_-
seqs 5000 -outfmt “6 qseqid sseqid evalue pident bitscore
qstart qend qlen sstart send slen” -out protein.blastpout”
2. Generate a SSN using EGN from example.faa (requires local
installation of BLAST and download of EGN from http://
www.evol-net.fr/index.php/fr/downloads):
(a) Run EGN from the terminal using “perl egn.1.0.plus.pl”
from the programs home directory.
(b) Follow on-screen prompts sequentially to generate an
alignment, filter the output, and generate a gene network
with outputs compatible with both Cytoscape and Gephii.
3. Visualize SSN networks:
(a) In Cytoscape: Import files named “cc.*.txt” as a network
to visualize that set of connected components.
l To associate nodes with their annotations, import “cc*.
atr” as a table.
(b) In Gephi: Open “cc*.gxf” files to import individual
connected components from the network into Gephi.
Use the “layout” menu to explore different kinds of lay-
outs for the network.
Glossary
References
1. Timmis JN, Ayliffe MA, Huang CY, Martin 8. Kondo N, Nikoh N, Ijichi N et al (2002)
W (2004) Endosymbiotic gene transfer: Genome fragment of Wolbachia endosymbi-
organelle genomes forge eukaryotic chromo- ont transferred to X chromosome of host
somes. Nat Rev Genet 5:123–135. https:// insect. Proc Natl Acad Sci U S A
doi.org/10.1038/nrg1271 99:14280–14285. https://doi.org/10.
2. Embley TM, Martin W (2006) Eukaryotic 1073/pnas.222228199
evolution, changes and challenges. Nature 9. McInerney JO (2017) Horizontal gene trans-
440:623–630. https://doi.org/10.1038/ fer is less frequent in eukaryotes than prokar-
nature04546 yotes but can be important (retrospective on
3. Williams TA, Foster PG, Cox CJ, Embley TM DOI 10.1002/bies.201300095). BioEssays
(2013) An archaeal origin of eukaryotes sup- 39:1700002. https://doi.org/10.1002/
ports only two primary domains of life. bies.201700002
Nature 504:231–236. https://doi.org/10. 10. Gogarten JP, Doolittle WF, Lawrence JG
1038/nature12779 (2002) Prokaryotic evolution in light of
4. Alsmark C, Foster PG, Sicheritz-Ponten T gene transfer. Mol Biol Evol 19:2226–2238
et al (2013) Patterns of prokaryotic lateral 11. Dagan T, Martin W (2007) Ancestral genome
gene transfers affecting parasitic microbial sizes specify the minimum rate of lateral gene
eukaryotes. Genome Biol 14:R19. https:// transfer during prokaryote evolution. Proc
doi.org/10.1186/gb-2013-14-2-r19 Natl Acad Sci U S A 104:870–875. https://
5. Hirt RP, Alsmark C, Embley TM (2015) Lat- doi.org/10.1073/pnas.0606318104
eral gene transfers and the origins of the 12. Hooper SD, Mavromatis K, Kyrpides NC
eukaryote proteome: a view from microbial (2009) Microbial co-habitation and lateral
parasites. Curr Opin Microbiol 23:155–162. gene transfer: what transposases can tell
https://doi.org/10.1016/j.mib.2014.11. us. Genome Biol 10:R45. https://doi.org/
018 10.1186/gb-2009-10-4-r45
6. Nowack ECM, Price DC, Bhattacharya D 13. Nelson-Sathi S, Sousa FL, Roettger M et al
et al (2016) Gene transfers from diverse bac- (2014) Origins of major archaeal clades cor-
teria compensate for reductive genome evolu- respond to gene acquisitions from bacteria.
tion in the chromatophore of Paulinella Nature 517:77–80. https://doi.org/10.
chromatophora. Proc Natl Acad Sci U S A 1038/nature13805
113:12214–12219. https://doi.org/10. 14. Tamminen M, Virta M, Fani R, Fondi M
1073/pnas.1608016113 (2012) Large-scale analysis of plasmid rela-
7. McCoy JM, Mi S, Lee X et al (2000) Syncytin tionships through gene-sharing networks.
is a captive retroviral envelope protein Mol Biol Evol 29:1225–1240. https://doi.
involved in human placental morphogenesis. org/10.1093/molbev/msr292
Nature 403:785–789. https://doi.org/10. 15. Lapierre P, Gogarten JP (2009) Estimating
1038/35001608 the size of the bacterial pan-genome. Trends
The Methodology Behind Network Thinking. . . 303
37. Lima-Mendez G, Van Helden J, Toussaint A, 49. Bapteste E, Lopez P, Bouchard F et al (2012)
Leplae R (2008) Reticulate representation of Evolutionary analyses of non-genealogical
evolutionary and functional relationships bonds produced by introgressive descent.
between phage genomes. Mol Biol Evol Proc Natl Acad Sci U S A
25:762–777. https://doi.org/10.1093/ 109:18266–18272. https://doi.org/10.
molbev/msn023 1073/pnas.1206541109
38. Iranzo J, Krupovic M, Koonin EV (2016) The 50. Jachiet P-A, Pogorelcnik R, Berry A et al
double-stranded DNA virosphere as a modu- (2013) MosaicFinder: identification of fused
lar hierarchical network of gene sharing. MBio gene families in sequence similarity networks.
7:e00978–e00916. https://doi.org/10. Bioinformatics 29:837–844. https://doi.
1128/mBio.00978-16 org/10.1093/bioinformatics/btt049
39. Tatusov RL, Koonin EV, Lipman DJ (1997) A 51. Méheust R, Zelzion E, Bhattacharya D et al
genomic perspective on protein families. Sci- (2016) Protein networks identify novel sym-
ence 278:631–637 biogenetic genes resulting from plastid endo-
40. Tatusov RL, Galperin MY, Natale DA, Koo- symbiosis. Proc Natl Acad Sci U S A
nin EV (2000) The COG database: a tool for 113:3579–3584. https://doi.org/10.1073/
genome-scale analysis of protein functions pnas.1517551113
and evolution. Nucleic Acids Res 28:33–36. 52. Halary S, McInerney JO, Lopez P, Bapteste E
https://doi.org/10.1093/nar/28.1.33 (2013) EGN: a wizard for construction of
41. Huson DH, Scornavacca C (2011) A survey gene and genome similarity networks. BMC
of combinatorial methods for phylogenetic Evol Biol 13:146. https://doi.org/10.1186/
networks. Genome Biol Evol 3:23–35. 1471-2148-13-146
https://doi.org/10.1093/gbe/evq077 53. Martin AJM, Walsh I, Di Domenico T et al
42. Huson DH, Rupp R, Scornavacca C (2011) (2013) PANADA: protein association net-
Phylogenetic networks: concepts, algorithms work annotation, determination and analysis.
and applications. Cambridge University Press, PLoS One 8:e78383. https://doi.org/10.
New York, NY 1371/journal.pone.0078383
43. Nakhleh L (2011) Evolutionary phylogenetic 54. Shannon P, Markiel A, Ozier O et al (2003)
networks: models and issues. In: Problem Cytoscape: a software environment for
solving handbook in computational biology integrated models of biomolecular interaction
and bioinformatics. Springer, New York, pp networks. Genome Res 13:2498–2504.
125–158 https://doi.org/10.1101/gr.1239303
44. Enright AJ, Iliopoulos I, Kyrpides NC, 55. Bastian M, Heymann S, Jacomy M (2009)
Ouzounis CA (1999) Protein interaction Gephi: an Open source software for exploring
maps for complete genomes based on gene and manipulating networks. Third Int AAAI
fusion events. Nature 402:86–90. https:// Conf Weblogs Soc Media. pp 361–362.
doi.org/10.1038/47056 https://doi.org/10.1136/qshc.2004.
45. Pasternak G, Hochhaus A, Schultheis B, 010033
Hehlmann R (1998) Chronic myelogenous 56. Csárdi G, Nepusz T (2006) The igraph soft-
leukemia: molecular and cellular aspects. J ware package for complex network research.
Cancer Res Clin Oncol 124:643–660 InterJ Complex Syst 1695
46. Watanabe H, Otsuka J (1995) A comprehen- 57. Hagberg AA, Schult DA, Swart PJ (2008)
sive representation of extensive similarity link- Exploring network structure, dynamics, and
age between large numbers of proteins. function using NetworkX. In: Varoquaux G,
Bioinformatics 11:159–166. https://doi. Vaught T, Millman J (eds) Proc. 7th Python
org/10.1093/bioinformatics/11.2.159 Sci. Conf, Pasadena, CA, pp 11–15
47. Park J, Teichmann SA, Hubbard T, Chothia C 58. Camacho C, Coulouris G, Avagyan V et al
(1997) Intermediate sequences increase the (2009) BLAST+: architecture and applica-
detection of homology between sequences. J tions. BMC Bioinform 10:421. https://doi.
Mol Biol 273:349–354. https://doi.org/10. org/10.1186/1471-2105-10-421
1006/jmbi.1997.1288 59. Altschul SF, Gish W, Miller W et al (1990)
48. Bolten E, Schliep A, Schneckener S et al Basic local alignment search tool.pdf. J Mol
(2001) Clustering protein sequences--struc- Biol 215:403–410. https://doi.org/10.
ture prediction by transitive homology. Bioin- 1016/S0022-2836(05)80360-2
formatics 17:935–941. https://doi.org/10. 60. Kent WJ (2002) BLAT--the BLAST-like
1093/bioinformatics/17.10.935 alignment tool. Genome Res 12:656–664.
The Methodology Behind Network Thinking. . . 305
137. Kulp A, Kuehn MJ (2010) Biological func- 146. Borgatti SP, Everett MG (1997) Network
tions and biogenesis of secreted bacterial analysis of 2-mode data. Soc Netw
outer membrane vesicles. Annu Rev Micro- 19:243–269. https://doi.org/10.1016/
biol 64:163–184. https://doi.org/10. S0378-8733(96)00301-2
1146/annurev.micro.091208.073413 147. Goh K-I, Cusick ME, Valle D et al (2007)
138. McDaniel LD, Young E, Delaney J et al The human disease network. Proc Natl Acad
(2010) High frequency of horizontal gene Sci U S A 104:8685–8690. https://doi.org/
transfer in the oceans. Science 330:50. 10.1073/pnas.0701361104
https://doi.org/10.1126/science.1192243 148. Himmelstein DS, Baranzini SE, Rand V et al
139. Dubey GP, Ben-Yehuda S (2011) Intercellu- (2015) Heterogeneous network edge predic-
lar nanotubes mediate bacterial communica- tion: a data integration approach to prioritize
tion. Cell 144:590–600. https://doi.org/10. disease-associated genes. PLoS Comput Biol
1016/j.cell.2011.01.015 11:e1004259. https://doi.org/10.1371/
140. Desnues C, La Scola B, Yutin N et al (2012) journal.pcbi.1004259
Provirophages and transpovirons as the 149. Ahn Y-Y, Ahnert SE, Bagrow JP et al (2011)
diverse mobilome of giant viruses. Proc Natl Flavor network and the principles of food
Acad Sci U S A 109:18078–18083. https:// pairing. Sci Rep 1:196. https://doi.org/10.
doi.org/10.1073/pnas.1208835109 1038/srep00196
141. Kutschera VE, Bidon T, Hailer F et al (2014) 150. Lanza VF, Baquero F, de la Cruz F, Coque
Bears in a forest of gene trees: phylogenetic TM (2017) AcCNET (Accessory Genome
inference is complicated by incomplete line- Constellation Network): comparative geno-
age sorting and gene flow. Mol Biol Evol mics software for accessory genome analysis
31:2004–2017. https://doi.org/10.1093/ using bipartite networks. Bioinformatics
molbev/msu186 33:283–285. https://doi.org/10.1093/bio
142. Simmonds P (2014) Methods for virus classi- informatics/btw601
fication and the challenge of incorporating 151. Barber MJ (2007) Modularity and commu-
metagenomic sequence data. J Gen Virol. nity detection in bipartite networks. Phys Rev
https://doi.org/10.1099/jgv.0.000016 E 76:66102. https://doi.org/10.1103/
143. Iranzo J, Koonin EV, Prangishvili D, Krupo- PhysRevE.76.066102
vic M (2016) Bipartite network analysis of the 152. Beckett SJ (2016) Improved community
archaeal virosphere: evolutionary connections detection in weighted bipartite networks. R
between viruses and capsid-less mobile ele- Soc Open Sci 3:140536. https://doi.org/10.
ments. J Virol 90:11043–11055. https:// 1098/rsos.140536
doi.org/10.1128/JVI.01622-16 153. Diestel R (2010) Graph theory. Springer,
144. Lord E, Le Cam M, Bapteste É et al (2016) New York
BRIDES: a new fast algorithm and software 154. McInerney JO, Pisani D, Bapteste E, O’Con-
for characterizing evolving similarity net- nell MJ (2011) The public goods hypothesis for
works using breakthroughs, roadblocks, the evolution of life on Earth. Biol Direct 6:41.
impasses, detours, equals and shortcuts. https://doi.org/10.1186/1745-6150-6-41
PLoS One 11:e0161474. https://doi.org/ 155. Hauser M, Mayer CE, Söding J (2013)
10.1371/journal.pone.0161474 kClust: fast and sensitive clustering of large
145. van Dongen SM (2001) Graph clustering by protein sequence databases. BMC Bioinform
flow simulation. PhD thesis, University of 14:248. https://doi.org/10.1186/1471-
Utrecht 2105-14-248
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons
licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s
Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder.
Chapter 10
Abstract
Bayesian methods for molecular clock dating of species divergences have been greatly developed during the
past decade. Advantages of the methods include the use of relaxed-clock models to describe evolutionary
rate variation in the branches of a phylogenetic tree and the use of flexible fossil calibration densities to
describe the uncertainty in node ages. The advent of next-generation sequencing technologies has led to a
flood of genome-scale datasets for organisms belonging to all domains in the tree of life. Thus, a new era has
begun where dating the tree of life using genome-scale data is now within reach. In this protocol, we explain
how to use the computer program MCMCTree to perform Bayesian inference of divergence times using
genome-scale datasets. We use a ten-species primate phylogeny, with a molecular alignment of over three
million base pairs, as an exemplar on how to carry out the analysis. We pay particular attention to how to set
up the analysis and the priors and how to diagnose the MCMC algorithm used to obtain the posterior
estimates of divergence times and evolutionary rates.
Key words Molecular clock, Bayesian analysis, MCMC, Fossil, Phylogeny, Primates, Genome
1 Introduction
The molecular clock hypothesis, which states that the rate of molec-
ular evolution is approximately constant with time, provides a
powerful way to estimate the times of divergence of species in a
phylogeny. Since its proposal over 50 years ago [1], the molecular
clock hypothesis has been used countless times to calibrate molec-
ular phylogenies to geological time, with the ultimate aim of dating
the tree of life [2, 3]. Several statistical inference methodologies
have been developed for molecular clock dating analyses; however,
during the past decade, the Bayesian method has emerged as the
method of choice [4, 5], and several Bayesian inference software
packages now exist to carry out this type of analysis [6–10].
In this protocol, we will explain how to use the computer
program MCMCTree to estimate times of species divergences
using genome-scale datasets within the Bayesian inference
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Methods in Molecular Biology, vol. 1910,
https://doi.org/10.1007/978-1-4939-9074-0_10, © The Author(s) 2019
309
310 Mario dos Reis and Ziheng Yang
To run the protocol, you will need the MCMCTree and BASEML
programs, which are part of the PAML software package for phylo-
genetic analysis [22]. The source code and compiled versions of the
code are freely available from bit.ly/ziheng-paml. All the data files
necessary to run the protocol can be obtained from github.com/
mariodosreis/divtime. Please create a directory called divtime in
your computer and download all the data files from the GitHub
repository. This protocol was tested with PAML version 4.9e.
You are assumed to have basic knowledge of the command line
in Unix or Windows (also known as command prompt, shell, or
terminal). Simple tutorials for users of Windows, Mac OS, and
Linux are posted at bit.ly/ziheng-software. Install MCMCTree
and BASEML in your computer system, and make sure you have
Molecular Clock Dating 311
2.1 Tree and Fossil The phylogenetic tree of the ten species is shown in Fig. 1. The tree
Calibrations encompasses members of all the main primate lineages. The ten
species were chosen because they have had their complete genomes
sequenced. They are a subset of the 36 mammal species analyzed in
[23]. File data/10s.tree contains the tree with fossil calibrations
in Newick format, which is the format required by MCMCTree.
The eight fossil calibrations are shown in Table 1. The calibrations
are the same used to estimate primate divergence times in [24]. We
discuss fossil calibrations in detail in the “Sampling from the Prior”
section. The time unit in the analysis is 100 million years (My).
Thus, the calibration B(0.075, 0.10) means the node age is con-
strained to be between 7.5 and 10 million years ago (Ma).
2.2 Molecular The molecular data are an alignment of 5614 protein-coding genes
Sequence Data from the ten species. All ambiguous codon sites were removed, and
thus the alignment contains no missing data. The alignment was
separated into two partitions: A partition consisting of all the first
and second codon positions (2,253,316 nucleotides long) and a
partition of third codon positions (1,126,658 nucleotides long).
The alignment is a subset of the larger 36-mammal-species align-
ment in [23]. See also ref. 24. File 10s.phys in the data directory
contains the alignment. The alignment is compressed into site
patterns (a site pattern is a unique combination of character states
in an alignment column) to save disk space.
3 Tutorial
Threeshrew
Bushbaby
11 13 Mouse lemur
12 Tarsier
Marmoset
14
Rhesus
15
Orangutang
16
Gorilla
17
18 Chimp
19 Human
Fig. 1 The tree of ten species. Nodes with fossil calibrations are indicated with
black dots (see Table 1 for calibration densities). Internal nodes are numbered
from 11 to 19 according to the nomenclature used by MCMCTree
Table 1
List of fossil calibrations used in this tutorial
3.1 Overview We will use the approximate likelihood method to speed up the
computation of the likelihood on the large genome alignment. The
general strategy for the analysis is as follows:
1. Approximate likelihood calculation: First, we will calculate the
gradient (g) and Hessian (H) matrix of the branch lengths on
the unrooted tree. For this step, we will need to use the
MCMCTree and BASEML programs (BASEML will carry
out the actual computation of g and H). The substitution
model is chosen at this step.
2. MCMC sampling from the posterior: Once g and H have been
calculated and we have decided on our priors, we can use
MCMCTree to perform MCMC sampling from the posterior
distribution of times and rates. We will then look at the sum-
maries of the posterior (such as posterior mean times and rates
and 95% credibility intervals).
3. Convergence diagnostics: The MCMC algorithm is a stochastic
algorithm that visits regions of the parameter space in propor-
tion to the posterior distribution. Due to its very nature, it is
possible that sometimes the MCMC chain is terminated before
it has had a chance to explore the parameter space appropri-
ately. The way to guard against this is to run the analysis two or
more times and compare the summary statistics from the two
(or more) MCMC chains. If the results from different runs are
very similar, then convergence to the posterior distribution can
be reasonably assumed.
4. MCMC sampling from the prior: Finally, we will sample directly
from the prior of times and rates. This is particularly important
in Bayesian molecular clock dating because in most cases the
prior on times may look quite different from the fossil calibra-
tion densities specified by the user. Thus, sampling from the
prior allows the user to check the soundness of the prior
actually used.
Note that in this protocol we assume the user has chosen a
suitable sequence alignment and a phylogenetic tree to carry out
the analysis. For genome-scale alignments, it is important that the
genes chosen among the various species are orthologous and that
the alignment has been checked for accuracy. Several chapters in
this volume can guide the user in this purpose.
3.2 Calculation of the Go into the gH directory, and open the mcmctree-outBV.ctl file
Gradient and Hessian using your favorite text editor. This control file contains the set of
to Approximate the parameters necessary for MCMCTree to carry out the calculations
Likelihood of the gradient and Hessian needed for the approximate likelihood
method. Figure 2 shows the contents of the mcmctree-outBV.
ctl file.
Molecular Clock Dating 315
seqfile = ../data/10s.phys
treefile = ../data/10s.tree
ndata = 2
seqtype = 0 * 0: nucleotides; 1:codons; 2:AAs
usedata = 3 * 0: no data (prior); 1:exact likelihood;
* 2: approximate likelihood; 3:out.BV (in.BV)
clock = 2 * 1: global clock; 2: independent rates; 3: correlated rates
Fig. 2 The gH/mcmctree-outBV.ctl file, with appropriate options to set up calculation of the gradient
and Hessian matrix for the approximate likelihood method
10
Hessian
Fig. 3 The gH/out.BV file produced by BASEML. The first line has the number of species (10), the second
line has the tree topology with MLEs of branch lengths, and the MLEs of branch lengths are given again in the
third line. The fourth line contains the gradient, g, followed by the Hessian, H, for partition 1. This file will be
renamed in.BV and placed into the mcmc/ directory to carry out MCMC sampling using the approximate
likelihood method
$ mcmctree mcmctree-outBV.ctl
3.3 Calculation of the Now that we have calculated g and H, we can proceed to MCMC
Posterior of Times and sampling of the posterior distribution using the approximate likeli-
Rates hood method. Copy the gH/out.BV file into the mcmc directory,
and rename it as in.BV. Now go into the mcmc directory. There
3.3.1 Control File and you will find mcmctree.ctl, the necessary MCMCTree control
Priors file to carry out MCMC sampling from the posterior. Figure 4
shows the contents of the file. The first item, seed, is the seed for
the random number generator used by the MCMC algorithm.
Here it is set to 1, which tells MCMCTree to use the system’s
clock time as the seed. This is useful, as running the program
multiple times will generate different outputs.
Molecular Clock Dating 317
seed = -1
seqfile = ../data/10s.phys
treefile = ../data/10s.tree
mcmcfile = mcmc.txt
outfile = out.txt
ndata = 2
seqtype = 0 * 0: nucleotides; 1:codons; 2:AAs
usedata = 2 * 0: no data (prior); 1:exact likelihood;
* 2:approximate likelihood; 3:out.BV (in.BV)
clock = 2 * 1: global clock; 2: independent rates; 3: correlated rates
RootAge = '<1.0' * safe constraint on root age, used if no fossil for root.
Fig. 4 The mcmc/mcmctree.ctl file necessary to sample from the posterior distribution using the
approximate likelihood method
uniformly distributed between present time and the age of the root.
Options kappa_gamma and alpha_gamma set gamma priors for
the κ and α parameters in the substitution model. These have no
effect as we are using the likelihood approximation. Options rge-
ne_gamma and sigma2_gamma set the gamma-Dirichlet prior on
the mean substitution rate for partitions and for the rate variance
parameter, σ 2 [19]. The prior on the mean rate is Gamma(2, 40),
which has mean 0.05 substitutions per time 100 My. A symmetric
Dirichlet distribution with concentration parameter equal to 1 is
used to spread the rate prior across partitions (thus rgene_gamma
¼ 2 40 1). See ref. 19 for details. The prior on σ is Gamma(1, 10)
2
which has mean 0.1. A Dirichlet is also used to spread the prior
across partitions.
The final block of options, print, burnin, sampfreq, and
nsample, control the length and sampling frequency of the
MCMC. We will discard the first 20,000 iterations as the burn-in
and then print parameter values to the mcmc.txt file every
100 iterations, to a maximum of 20,000 + 1 samples. Thus, our
MCMC chain will run for a total of 20,000 + 20,000 100 ¼
2,020,000 iterations.
$ mcmctree mcmctree.ctl
This will run the analysis a second time. The results should be
slightly different to the previous run due to the stochastic nature of
the algorithm. Once the second run has finished, rename mcmc.
Molecular Clock Dating 319
0% 0.26 0.39 0.23 0.39 0.28 1.285 1.243 0.588 1.158 0.541 0.321 - 0.192 0.197 -16.9 0:02
(nsteps = 50)
Current Pjump: 0.26200 0.39475 0.23175 0.38650 0.28000 0.27550 0.39200 0.43750
0.40100 0.29725 0.33725 0.27525 0.32275 0.23475 0.23150 0.29875 0.31600 0.27800
0.25300 0.29975 0.29650 0.32575 0.27500 0.61150 0.29850 0.31225 0.35400 0.23200
0.30800 0.28250 0.33050 0.21325 0.22700 0.25900 0.26725 0.26900 0.33150 0.23725
0.31000 0.20700 0.24225 0.61625 0.30675 0.30150 0.32000 0.21975 0.27650 0.22500
0.36650 0.00000
Current finetune: 0.00365 0.00166 0.00586 0.00182 0.00503 0.00697 0.00486 0.00500
0.00835 0.24230 0.21346 0.71942 0.65595 0.01093 0.01230 0.01256 0.00960 0.01492
0.02008 0.02466 0.03547 0.03942 0.04624 0.17077 0.02425 0.04971 0.01513 0.03626
0.03661 0.04475 0.08082 0.00867 0.00949 0.01146 0.00861 0.01133 0.01263 0.02252
0.02728 0.03996 0.03790 0.14736 0.02025 0.04584 0.01209 0.02975 0.02776 0.03389
0.05173 0.00000
New finetune: 0.00313 0.00232 0.00438 0.00248 0.00465 0.00632 0.00675 0.00806
0.01194 0.23972 0.24532 0.65158 0.71499 0.00829 0.00918 0.01250 0.01020 0.01367
0.01654 0.02463 0.03499 0.04345 0.04183 0.47928 0.02411 0.05210 0.01846 0.02714
0.03776 0.04175 0.09064 0.00592 0.00694 0.00969 0.00755 0.01000 0.01422 0.01728
0.02835 0.02644 0.02976 0.42023 0.02079 0.04611 0.01305 0.02100 0.02527 0.02454
0.06589 0.00000
5% 0.34 0.30 0.31 0.32 0.28 1.163 0.981 0.622 0.893 0.464 0.295 - 0.129 0.154 -17.0 0:08
10% 0.35 0.30 0.31 0.32 0.27 1.189 0.943 0.607 0.859 0.457 0.293 - 0.128 0.153 -17.0 0:15
15% 0.36 0.30 0.30 0.31 0.27 1.156 0.920 0.604 0.837 0.457 0.290 - 0.133 0.160 -17.0 0:22
20% 0.35 0.30 0.30 0.32 0.26 1.126 0.908 0.600 0.825 0.453 0.290 - 0.137 0.165 -17.0 0:29
25% 0.36 0.30 0.30 0.31 0.26 1.139 0.912 0.605 0.829 0.458 0.293 - 0.138 0.165 -17.0 0:37
30% 0.36 0.30 0.30 0.31 0.26 1.153 0.918 0.609 0.834 0.460 0.293 - 0.136 0.163 -17.0 0:43
Fig. 5 Screenshot of MCMCTree’s output during MCMC sampling of the posterior. Different runs of the
program will give slightly different output values
Species tree for FigTree. Branch lengths = posterior mean times; 95% CIs = labels
(1_Tree_shrew, ((2_Bushbaby, 3_Mouse_lemur) 13 , (4_Tarsier, (5_Marmoset, (6_Rhesus, (7_Orangut
Fig. 6 The end of the mcmc/out.txt file produced by MCMCTree at the end of the MCMC sampling of the
posterior
between human and chimp (node 19, Fig. 1) is 8.52 Ma (95% CI,
7.58–9.81 Ma).
You will also notice that MCMCTree created a file called Fig-
Tree.tre. This contains the posterior tree in Nexus format, suit-
able for plotting in the program FigTree (tree.bio.ed.ac.uk/
software/figtree/). Figure 7 shows the posterior tree plotted in
FigTree, with the time unit set to 1 My.
Fig. 7 The dated primate phylogeny with error bars (representing 95% CIs of node ages), drawn with FigTree.
The time unit is 1 My
Fig. 8 Convergence diagnostic plots of the MCMC drawn with R (see R/analysis.R)
Fig. 9 Calculations of posterior mean, ESS, posterior variance, and standard error of the posterior mean in R
(see R/analysis.R)
Molecular Clock Dating 323
1.53 My (Fig. 9). That is, we have estimated the mean accurately to
within 2 1.53 My ¼ 3.06 My. To reduce the S.E. by half, you
need to increase the ESS four times. Note that independent
MCMC runs can be combined into a single run. Thus, you may
save time by running several MCMC chains in parallel for compu-
tationally expensive analyses, although care must be taken to ensure
each chain has run long enough to exit the burn-in phase and
explore the posterior appropriately.
Trace plots and histograms are useful to spot problems and
check convergence. Figure 8b, c shows trace plots for t_n19 and
t_n11, respectively. The trace of t_n19, which has high ESS, looks
like a “hairy caterpillar.” Compare it to the trace of t_n11, which
has low ESS. Visual inspection of a trace plot usually gives a sense of
whether the parameter has an adequate ESS without calculating
it. Note that both traces are trendless, that is, the traces oscillate
around a mean value (the posterior mean). If you see a persistent
trend in the trace (such as an increase or a decrease), that most likely
means the MCMC did not converge to the posterior and needs a
longer burn-in period.
Figure 8d shows the smoothed histograms (calculated using
density in R) for t_n11 for the two runs. Notice that the two
histograms are slightly different. As the ESS becomes larger, histo-
grams for different runs will converge in shape until becoming
indistinguishable. If you see large discrepancies between histo-
grams, that may indicate serious problems with the MCMC, such
as lack of convergence due to short burn-in or the MCMC getting
stuck in different modes of a multimodal posterior.
3.5 MCMC Sampling Note that fossil calibrations (such as those of Table 1) are repre-
from the Prior sented as statistical distributions of node ages. MCMCTree uses
these distributions to construct the prior on times. However, the
resulting time prior used by the program may be substantially
different from the original fossil calibrations, because the program
applies a truncation so that daughter nodes are younger than their
ancestors [14, 27]. Thus, it is advisable to calculate the time prior
explicitly by running the MCMC with no data so that it can be
examined and compared with the fossil calibrations and the
posterior.
Go to the prior directory and type
$ mcmctree mcmctree-pr.ctl
This will start the MCMC sampling from the prior. File
mcmctree-pr.ctl is identical to mcmc/mcmctree.ctl except
that option usedata has been set to 0. Sampling from the prior is
much quicker because the likelihood does not need to be calcu-
lated. It takes about 1 min on the Intel Core i7 for MCMCTree to
complete the analysis. Rename files mcmc.txt and out.txt to
324 Mario dos Reis and Ziheng Yang
Fig. 10 Prior (gray) and posterior (black) density plots of node ages plotted with R (see R/analysis.R)
4.1 Taxon Sampling, In this tutorial we used a small phylogeny to illustrate Bayesian time
Data Partitioning, and estimation using approximate likelihood calculation. In practical
Estimation of Tree data analysis, it may be desirable to analyze much larger phylogenies
Topology (see Sect. 5.5). In large phylogenies, there may be uncertainties in
the relationships of some groups. The approximate method dis-
cussed here can only be applied to a fixed (known) tree topology. If
the uncertainties in the tree are few so that just a handful of tree
topologies appear reasonable, the approximate method can be used
by analyzing each topology separately [23, 32]. This involves esti-
mating g and H for each topology and then running separate
MCMC chains on each topology to estimate the times. Several
methods to co-estimate divergence times and tree topology are
available [8, 9, 17, 18], although they do not implement the
approximate likelihood method and are thus unsuitable for the
analysis of genome-scale datasets.
326 Mario dos Reis and Ziheng Yang
4.2 Selection of Fossil calibrations are one of the most important pieces of informa-
Fossil Calibrations tion needed to perform divergence time estimation and thus should
be chosen after careful consideration of the fossil record, although
this may involve some subjectivity [29]. Parham et al. [30] discuss
best practice for construction of fossil calibrations. For example,
minimum bounds on node ages are normally set to be the age of the
oldest fossil member of the crown group. A small probability (say
2.5%) should be set for the probability that the node age violates
the minimum bound (e.g., to guard against misidentified or incor-
rectly dated fossils). Specifying maximum bounds is more difficult,
as absence of fossils for a given geological period is not evidence
that the clade in question was absent during the period [31]. Cur-
rent practice is to set the maximum bound to a reasonable value
according to the expertise of the paleontologist (see ref. 29 for
examples), although a large probability (say 10% or even 20%)
may be required to guard against badly specified maximum bounds.
Calibration densities based on statistical modeling of species diver-
sification, fossil preservation, and discovery are also possible
Molecular Clock Dating 327
4.3 Construction of The birth-death process with species sampling was used here to
the Time Prior construct the time prior for nodes in the phylogeny for which fossil
calibrations are not available. Varying the birth (μ), death (λ), and
sampling (ρ), parameters can result in substantially different time
priors. For example, using μ ¼ λ ¼ 1 and ρ ¼ 0 leads to a uniform
distribution prior on node ages. This diffuse prior appears appro-
priate for most analyses. Varying the values of μ, λ, and ρ is useful to
assess whether the time estimates are robust to the time prior.
Parameter configurations can be set up to generate time densities
that result in young node ages or in very old node ages (see p. 381 in
[20] for examples).
4.4 Selection of the In analysis of closely related species (such as the apes), the clock
Clock Model assumption appears to be appropriate for time estimation. A likeli-
hood ratio test can be used to determine whether the strict clock is
appropriate for a given dataset [40]. If the clock is rejected, then
Bayesian molecular clock dating should proceed using one of the
various relaxed-clock models available [7, 13]. In this case, Bayesian
model selection may be used to choose the most appropriate
relaxed-clock model [41], although the method is computationally
expensive and thus only applicable to small datasets. The use of
different relaxed-clock models (such as the autocorrelated vs. the
independent log-normally distributed rates) may result in substan-
tially different time estimates (see ref. 32 for an example). In such
cases, repeating the analysis under the different clock models may
be desirable.
5 Exercises
5.1 Autocorrelated Modify file mcmc/mcmctree.ctl and set clock ¼ 3. This activates
Rate Model the autocorrelated log-normal rates model, also known as the
geometric Brownian motion rates model [6, 13]. Run the
MCMC twice and check for convergence. Compare the posterior
times obtained with those obtained under the independent
log-normal model (clock ¼ 2). Are there any systematic differences
in node age estimates between the two analyses? Which clock model
produces the most precise (i.e., narrower CIs) divergence time
estimates?
328 Mario dos Reis and Ziheng Yang
5.2 MCMC Sampling Modify file mcmc/mcmctree.ctl and set clock ¼ 2 (independent
with Exact Likelihood rates), usedata ¼ 1 (exact likelihood), burnin ¼ 200, sampfreq ¼
Calculation 2, and nsample ¼ 500. These last three options will lead to a much
shorter MCMC chain, with a total of 1200 iterations. Run the
MCMC sampling twice, and check for convergence using the
ESS, histograms, and trace plots. How long does it take for the
sampling to complete? Can you estimate how long it would take to
run the analysis using 2,020,000 iterations, as long as for the
approximate method of Sect. 3.3.2? Did the two chains converge
despite the low number of iterations?
5.3 Change of Fossil There is some controversy over whether {Sahelanthropus, used to
Calibrations set the minimum bound for the human-chimp divergence, is
indeed part of the human lineage. The next (younger) fossil in
the human lineage is {Orrorin which dates to around 6 Ma. Modify
file data/10s.tree and change the calibration in the human-
chimp node to B(0.057, 0.10, 0.01, 0.2). Also change the calibra-
tion on the root node to B(0.615, 1.315, 0.01, 0.05). Run the
MCMC analysis with the approximate method and again sampling
from the prior. Are there any substantial differences in the posterior
distributions of times under the new fossil calibrations? Which
nodes are affected? How bad is the truncation effect among the
calibration densities and the prior?
5.4 Comparing This is a difficult exercise. Use R to plot the prior densities of times
Calibration Densities sampled using MCMC (the same as in Fig. 10). Now try to work
and Prior Densities out how to overlay the calibration densities onto the plots. For
example, see Fig. 3 in [23] for an idea. First, write functions that
calculate the calibration densities. The dunif function in R is useful
to plot uniform calibrations. Functions sn::dsn and sn::dst
(in the SN package) are useful to plot the skew-t (ST) and skew-
normal (SN) distributions. Calibration type S2N (Table 1) is a
mixture of two skew-normal distributions [15]. How do the sam-
pled priors compare to the calibration densities? Are there any
substantial truncation effects?
5.5 Time Estimation Good taxon sampling is critical to obtaining robust estimates of
in a Supermatrix of divergence times for clades. In the data/ directory, an alignment
330 Species of the first and second codon positions from mitochondrial
protein-coding genes from 330 species (326 primate and
4 out-group species) is provided, 330s.phys, with corresponding
tree topology, 330s.tree. First, place the fossil calibrations of
Table 1 on the appropriate nodes of the species tree. Then obtain
the gradient and Hessian matrix for the 330-species alignment
using the HKY + G model. Finally, estimate the divergence times
on the 330-species phylogeny by using the approximate likelihood
method. How does taxon sampling affect node age estimates when
comparing the 10-species and 330-species trees? How does
Molecular Clock Dating 329
References
1. Zuckerkandl E, Pauling L (1965) Evolutionary 15. Wilkinson RD, Steiper ME, Soligo C et al
divergence and convergence in proteins. In: (2011) Dating primate divergences through
Bryson V, Vogel HJ (eds) Evolving genes and an integrated analysis of palaeontological and
proteins. Academic, New York, pp 97–166 molecular data. Syst Biol 60:16–31
2. Kumar S (2005) Molecular clocks: four dec- 16. Dos Reis M, Yang Z (2011) Approximate like-
ades of evolution. Nat Rev Genet 6:654–662 lihood calculation on a phylogeny for Bayesian
3. Bromham L, Penny D (2003) The modern estimation of divergence times. Mol Biol Evol
molecular clock. Nat Rev Genet 4:216–224 8(7):2161–2172
4. dos Reis M, Donoghue PCJ, Yang Z (2016) 17. Bouckaert R, Heled J, Kühnert D et al (2014)
Bayesian molecular clock dating of species BEAST 2: a software platform for Bayesian
divergences in the genomics era. Nat Rev evolutionary analysis. PLoS Comput Biol 10
Genet 17:71–80 (4):e1003537
5. Donoghue PCJ, Yang Z (2016) The evolution 18. Höhna S, Landis MJ, Heath TA et al (2016)
of methods for establishing evolutionary time- RevBayes: Bayesian phylogenetic inference
scales. Philos Trans R Soc B Biol Sci using graphical models and an interactive
371:20160020 model-specification language. Syst Biol
6. Thorne JL, Kishino H, Painter IS (1998) Esti- 65:726–736
mating the rate of evolution of the rate of 19. Dos Reis M, Zhu T, Yang Z (2014) The impact
molecular evolution. Mol Biol Evol of the rate prior on Bayesian estimation of
15:1647–1657 divergence times with multiple loci. Syst Biol
7. Drummond AJ, Ho SYW, Phillips MJ et al 63:555–565
(2006) Relaxed phylogenetics and dating with 20. Yang Z (2014) Molecular Evolution: A Statis-
confidence. PLoS Biol 4:699–710 tical Approach. Oxford University Press,
8. Ronquist F, Teslenko M, Van Der Mark P et al Oxford
(2012) Mrbayes 3.2: efficient Bayesian phylo- 21. Heath TA, Moore BR (2014) Bayesian infer-
genetic inference and model choice across a ence of species divergence times. In: Chen
large model space. Syst Biol 61:539–542 M-H, Kuo L, Lewis PO (eds) Bayesian Phylo-
9. Lartillot N, Lepage T, Blanquart S (2009) Phy- genetics: Methods, Algorithms, and Applica-
loBayes 3: a Bayesian software package for phy- tions. CRC Press, Boca Raton, pp 277–318
logenetic reconstruction and molecular dating. 22. Yang Z (2007) PAML 4: phylogenetic analysis
Bioinformatics 25:2286–2288 by maximum likelihood. Mol Biol Evol
10. Heath TA, Holder MT, Huelsenbeck JP 24:1586–1591
(2012) A Dirichlet process prior for estimating 23. dos Reis M, Inoue J, Hasegawa M et al (2012)
lineage-specific substitution rates. Mol Biol Phylogenomic datasets provide both precision
Evol 29:939–955 and accuracy in estimating the timescale of pla-
11. Kishino H, Thorne JL, Bruno WJ (2001) Per- cental mammal phylogeny. Proc Biol Sci
formance of a divergence time estimation 279:3491–3500
method under a probabilistic model of rate 24. dos Reis M, Gunnell G, Barba-Montoya J et al
evolution. Mol Biol Evol 18:352–361 (2018) Using phylogenomic data to explore
12. Yang Z, Rannala B (2006) Bayesian estimation the effects of relaxed clocks and calibration
of species divergence times under a molecular strategies on divergence time estimation: pri-
clock using multiple fossil calibrations with soft mates as a test case. Syst Biol 67(4):594–615
bounds. Mol Biol Evol 23:212–226 25. Yang Z (1996) Among-site rate variation and
13. Rannala B, Yang Z (2007) Inferring speciation its impact on phylogenetic analyses. Trends
times under an episodic molecular clock. Syst Ecol Evol 11(9):367–372
Biol 56:453–466 26. Gillespie JH (1984) The molecular clock may
14. Inoue J, Donoghue PCJ, Yang Z (2010) The be an episodic clock. Proc Natl Acad Sci U S A
impact of the representation of fossil calibra- 81:8009–8013
tions on Bayesian estimation of species diver-
gence times. Syst Biol 59:74–89
330 Mario dos Reis and Ziheng Yang
27. Warnock RCM, Yang Z, Donoghue PCJ species divergence times. Syst Biol 67
(2012) Exploring uncertainty in the calibration (1):61–77
of the molecular clock. Biol Lett 8:156–159 35. Lanfear R, Calcott B, Ho SYW et al (2012)
28. Zollikofer CPE, Ponce de León MS, Lieber- PartitionFinder: combined selection of parti-
man DE et al (2005) Virtual cranial reconstruc- tioning schemes and substitution models for
tion of Sahelanthropus tchadensis. Nature phylogenetic analyses. Mol Biol Evol
434:755–759 29:1695–1701
29. Benton MJ, Donoghue PCJ (2007) Paleonto- 36. Duchêne S, Molak M, Ho SYW (2014) Clock-
logical evidence to date the tree of life. Mol staR: choosing the number of relaxed-clock
Biol Evol 24(1):26–53 models in molecular phylogenetic analysis. Bio-
30. Parham JF, Donoghue PCJ, Bell CJ et al informatics 30:1017–1019
(2012) Best practices for justifying fossil cali- 37. Heath TA, Huelsenbeck JP, Stadler T (2014)
brations. Syst Biol 61(2):346–359 The fossilized birth-death process for coherent
31. Ho SYW, Phillips MJ (2009) Accounting for calibration of divergence-time estimates. Proc
calibration uncertainty in phylogenetic estima- Natl Acad Sci U S A 111:E2957–E2966
tion of evolutionary divergence times. Syst Biol 38. Ronquist F, Klopfstein S, Vilhelmsen L et al
58:367–380 (2012) A total-evidence approach to dating
32. Dos Reis M, Thawornwattana Y, Angelis K et al with fossils, applied to the early radiation of
(2015) Uncertainty in the timing of origin of the hymenoptera. Syst Biol 61:973–999
animals and the limits of precision in molecular 39. O’Reilly JE, dos Reis M, Donoghue PCJ
timescales. Curr Biol 25:2939–2950 (2015) Dating tips for divergence-time estima-
33. Zhu T, Reis MD, Yang Z (2014) Characteriza- tion. Trends Genet 31(11):637–650
tion of the uncertainty of divergence time esti- 40. Felsenstein J (1981) Evolutionary trees from
mation under relaxed molecular clock models DNA sequences: a maximum likelihood
using multiple loci. Syst Biol 64(2):267–280 approach. J Mol Evol 17:368–376
34. Angelis K, Alvarez-Carretero S, dos Reis M 41. Lepage T, Bryant D, Philippe H et al (2007) A
et al (2018) An evaluation of different parti- general comparison of relaxed molecular clock
tioning strategies for Bayesian estimation of models. Mol Biol Evol 24:2669–2680
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons
licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s
Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder.
Chapter 11
Abstract
A major current molecular evolution challenge is to link comparative genomic patterns to species’ biology
and ecology. Breeding systems are pivotal because they affect many population genetic processes and thus
genome evolution. We review theoretical predictions and empirical evidence about molecular evolutionary
processes under three distinct breeding systems—outcrossing, selfing, and asexuality. Breeding systems may
have a profound impact on genome evolution, including molecular evolutionary rates, base composition,
genomic conflict, and possibly genome size. We present and discuss the similarities and differences between
the effects of selfing and clonality. In reverse, comparative and population genomic data and approaches
help revisiting old questions on the long-term evolution of breeding systems.
Key words Breeding systems, GC-biased gene conversion, Genome evolution, Genomic conflicts,
Selection, Transposable elements
1 Introduction
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Methods in Molecular Biology, vol. 1910,
https://doi.org/10.1007/978-1-4939-9074-0_11, © The Author(s) 2019
331
332 Sylvain Glémin et al.
outcrossing
selfing
asexuality
Effective
Recombination population size
Outcrossing
Selfing
Asexuality
Segregation
Fig. 2 A schematic representation of the effect of breeding systems on population genetic parameters
Table 1
Summary of the major theoretical predictions regarding breeding systems and evolutionary genomic
variables, with outcrossing being taken as reference
Outcrossing: h = 0.5
4 Selfing: any h
Outcrossing: h = 0.3
3 Clonality: h = 0.7
Clonality: h = 0.5
Clonality: h = 0.3
2
S
–0.0002 –0.0001 0.0000 0.0001 0.0002
Fig. 3 Substitution rates relative to the neutral case (dN/dS) in outcrossers (thin lines), selfers (bold line), and
asexuals (dotted lines) for different mutation dominance levels. The fitness of the resident, heterozygote, and
homozygote mutant genotypes are 1, 1 hs, and 1 s, respectively. For asexuals, it is necessary to consider
two substitution rates corresponding to the initial fixation of heterozygotes and the ultimate fixation of
complete homozygote mutants from an initially heterozygote population [31]. Population size: N ¼ 10,000.
To highlight the difference between selfers and asexuals due to segregation, demographic and hitchhiking
effects reducing Ne in asexuals and selfers are not taken into account
2.2 Breeding The effective population size strongly affects the outcome of natu-
Systems and Selection ral selection. The probability of fixation of a new mutation is a
Efficacy function of the Nes product, where s is the selection coefficient
([49] and see Fig. 3). As Ne is reduced, a higher proportion of
2.2.1 Drift and mutations behave almost neutrally. Weakly deleterious alleles can
Recombination: Parallel thus be fixed, while weakly advantageous ones can be lost. Genetic
Reduction in Selection associations among loci generated by selfing and clonality also
Efficacy in Selfers and induce selective interferences [26, 50]. Because of their reduced
Asexuals? effective population size and recombination rate, selection is thus
expected to be globally less effective in selfers and asexuals than in
outcrossers, which should result in various footprints at the molec-
ular level (Table 1). Assuming that most mutations are deleterious
(with possible back compensatory mutations), both the ratio of
non-synonymous to synonymous polymorphism, πN/ πS, and the
ratio of non-synonymous to synonymous substitutions, dN/dS, are
predicted to be higher in selfers and asexuals than in outcrossers.
Codon usage should also be less optimized in selfers and asexuals
than in outcrossers.
Contrary to polymorphism surveys, few studies have tested
these predictions empirically (Table 2). In the few available
Table 2
338
Summary of studies comparing patterns of molecular evolution between different breeding systems
+ and indicate if theoretical predictions are confirmed or not. Empty cells correspond to nonavailable data
a
No more significant after controlling for terminal vs. internal branches
b
Terminal vs. internal branches not controlled
c
Possible confounding effect of hybrid origin of one asexual lineage
339
340 Sylvain Glémin et al.
selective sweeps [68, 69]. If such mechanisms are the main cause of
reduction of Ne in selfers, then even a low recombination rate could
be enough to maintain the selection efficacy. This is suggested by
genomic patterns across recombination gradients in outcrossing
species. In primates, no effect of recombination on the selection
efficacy has been detected [70]. In Drosophila, Haddrill et al. [71]
found little evidence of reduced selection in low recombining
regions, except when recombination was fully suppressed, as in Y
chromosomes. Differences between selfers and asexuals could thus
simply result from different degrees of residual outcrossing. How-
ever, as stated above, selfers and asexuals also fundamentally differ
as far as segregation is concerned, as we now discuss in more detail.
2.2.2 Segregation: Selfing affects the selection efficacy by increasing homozygosity and
Dealing with Heterozygotes thus exposing recessive alleles to selection. This effect can counter-
act the effect of reducing Ne. Considering the sole reduction in Ne
due to non-independent gamete sampling, selection is less efficient
under partial selfing for dominant mutations but more efficient for
recessive ones (Fig. 3, and see ref. 72). More precisely, Glémin [73]
determined the additional reduction in Ne (due to hitchhiking and
demographic effects) necessary to overcome the increased selection
efficacy due to homozygosity. This additional reduction can be high
for recessive mutations. On the contrary, the lack of segregation in
asexuals reduces selection efficacy and increases the drift load, as
heterozygotes can fix [31]. The effects of selfing and clonality on
the fixation probability of codominant, recessive, or dominant
mutations are summarized in Fig. 3. Note that segregation may
also have indirect effects. When recombination is suppressed, Mul-
ler’s ratchet is supposed to reduce Ne and contribute to the fixation
of weakly deleterious alleles [74]. In selfers, the purging of partially
recessive deleterious alleles slows down the ratchet [67], which
suggests that the fixation of deleterious alleles at linked loci would
be lower in selfers than in asexuals. The same mechanism also
contributes to weaker background selection in selfers than in asex-
uals (see above, [36]). In the extreme case of intra-gametophytic
selfing, purging could be even more efficient at removing deleteri-
ous alleles [11], as it has been suggested for moss species [75]. Seg-
regation at meiosis could thus partly explain the differences
between selfers and asexuals, but more data are clearly needed to
confirm this hypothesis.
The two opposite effects of drift and segregation in selfers
should also affect adaptive evolution. In outcrossers, new beneficial
mutations are more likely to be rapidly lost if recessive, as they are
initially present in heterozygotes and masked to selection—a pro-
cess known as Haldane’s sieve [76]. By unmasking these mutations
in homozygotes, selfing could help adaptive evolution from reces-
sive mutations [72, 73]. However, this advantage of selfing disap-
pears when adaptation proceeds from pre-existing variation because
342 Sylvain Glémin et al.
2.2.3 Selection on So far, we have only considered the immediate, mechanistic effects
Genetic Systems of breeding systems on population genetic parameters. Breeding
systems, however, can also affect the evolution of genetic systems
themselves, which modulates previous predictions. Theoretical
arguments suggested that selfing, even at small rates, greatly
increases the parameter range under which recombination is
selected for [91–93]. These predictions have been confirmed in a
meta-analysis in angiosperms in which outcrossers exhibited lower
chiasmata counts per bivalent than species with mixed or selfing
mating systems [94]. Higher levels of physical recombination (r0)
could thus help break down LD and reduce hitchhiking effects.
This could contribute to explaining why little evidence of long-
term genomic degradation has been observed in selfers, compared
to asexuals.
Breeding systems may also affect selection on mutation rates.
Since the vast majority of mutations are deleterious, mutation rates
should tend toward zero, up to physiological costs of further
reducing mutation rates being too high (e.g., [95, 96]). Under
complete linkage, a modifier remains associated with its “own”
mutated genome. Selection should thus favor lower mutation
rates in asexuals and selfers (e.g., [95, 96]). However, Lynch
recently challenged this view and suggested a lower limit to DNA
repair may be set by random drift, not physiological cost [97]. Such
a limit should thus be higher in asexuals and selfers. Asexuality is
often associated with very efficient DNA repair systems (reviewed in
[43]), supporting the view that selection for efficient repair may
overwhelm drift in asexual lineages. Alternatively, only groups
344 Sylvain Glémin et al.
2.3 Breeding Outcrossing species undergo various sorts of genetic conflict. Sex-
Systems and Genomic ual reproduction directly leads to conflicts within (e.g., for access to
Conflicts mating) and between sexes (e.g., for resource allocations between
male and female functions or between offspring). In selfers and
asexuals, such conflicts occur because mates are akin or because
mating is absent [100, 101]. Outcrossers are also sensitive to
epidemic selfish element proliferation and to meiotic drive, because
alleles can easily spread over the population through random mat-
ing. In contrast, selfers and asexuals should be immune to such
genomic conflicts because selection only occurs between selfing or
asexual lineages so that selfish elements should be either lost or
evolve into commensalists or mutualists [102].
2.3.1 Relaxation of Some genes involved in sexual reproduction are known to evolve
Sexual Conflicts in Selfers rapidly because of recurrent positive selection [103]. Arm races for
and Asexuals mating or for resource allocation to offspring are the most likely
causes of this accelerated evolution. In selfers and asexuals, selec-
tion should be specifically relaxed on these genes, not only because
of low recombination and effective size but mainly because the
selection pressure per se should be suppressed. According to this
prediction, in the outcrosser C. grandiflora, 6 out of the 20 genes
that show the strongest departure from neutrality are reproductive
genes and under positive selection. This contrasts with the selfer
A. thaliana, for which no reproductive genes are under positive
selection [83].
Genome Evolution in Outcrossing vs. Selfing vs. Asexual Species 345
2.3.2 Biased Gene GC-biased gene conversion (gBGC) is a kind of meiotic drive at the
Conversion as a Meiotic base pair scale that can also be strongly influenced by breeding
Drive Process: systems. In many species, gene conversion occurring during
Consequences for double-strand break recombination repair is biased toward G and
Nucleotide Landscape and C alleles (reviewed in [113]). This process mimics selection and can
Protein Evolution rapidly increase the GC content, especially around recombination
hotspots [114, 115], and, more broadly, can affect genome-wide
nucleotide landscapes. For instance, it is thought to be the main
force that shaped the isochore structure of mammals and birds
[116]. gBGC has been mostly studied by comparing genomic
regions with different rates of (crossing-over) recombination
(reviewed in [116]). However, comparing species with contrasted
breeding systems offers a broader and unique opportunity to study
gBGC. gBGC cannot occur in asexuals because recombination is
lacking. Selfing is also expected to reduce the gBGC efficacy
because meiotic drive does not occur in homozygotes [117]. To
our knowledge, GC content has never been compared between
sexual and asexual taxa, but there have been comparisons between
outcrossers and selfers.
As expected, no relationship was found between local recombi-
nation rates and GC-content in the highly selfing Arabidopsis thali-
ana [117], and Wright et al. [118] suggested that the (weak)
differences observed with the outcrossing A. lyrata and Brassica
oleracea could be due to gBGC. Much stronger evidence has been
obtained in grasses. Grasses are known to exhibit unusual genomic
base composition compared to other plants, being richer and more
heterogeneous in GC-content [119], and direct and indirect evi-
dences of gBGC have been accumulating [119, 120–122]. Accord-
ingly, GC-content or equilibrium GC values were found to be
higher in outcrossing than in selfing species [24, 84, 120]. Differ-
ence in gBGC between outcrossing and selfing lineages has also
been found in the plant genus Collinsia [123] and in freshwater
snails [66], although difference in selection on codon usage cannot
be completely ruled out.
gBGC can also affect functional sequence evolution, leaving a
spurious signature of positive selection and increasing the mutation
load through the fixation of weakly deleterious AT!GC muta-
tions: gBGC would represent a genomic Achilles’ heel
[124]. Once again, comparing outcrossing and selfing species is
useful for detecting interference between gBGC and selection.
gBGC is expected to counteract selection in outcrossing species
only. The Achilles’ heel hypothesis could explain why relaxed selec-
tion was not detected in four grass species belonging to the Triti-
ceae tribe [84]. In outcrossing species, but not in selfing ones,
dN/dS was found to be significantly higher for genes exhibiting
high than low equilibrium GC-content, suggesting that selection
efficacy could be reduced because of high substitution rates in favor
of GC alleles in these outcrossing grasses. In outcrossing species,
Genome Evolution in Outcrossing vs. Selfing vs. Asexual Species 347
2.3.3 Transposable Considering the role of sex in the spread of selfish elements, TEs
Elements in Selfers and should be less frequent in selfers and asexuals than in outcrossers
Asexuals: Purging or because they cannot spread from one genomic background to
Accumulation? another through syngamy. However, highly selfing and asexual
species derive from sexual outcrossing ancestors, from which they
inherit their load of TEs. TE distribution eventually depends on the
balance between additional transposition within selfing/clonal
lineages on one hand and selection or excision on the other. Fol-
lowing the abandonment of sex, large asexual populations are
expected to purge their load of TEs, provided excision occurs,
even at very low rates. However, purging can take a very long
time, and, without excision, TEs should slowly accumulate, not
decline [126]. In small populations, even with excision, a Muller’s
ratchet-like process drives TE accumulation throughout the
genome [126]. Transition from outcrossing to selfing should also
rapidly purge TEs, but as for asexuals, in small fully selfing popula-
tions, TEs can be retained [127]. Using yeast populations, it was
experimentally confirmed that sex increases the spread of TEs
[128, 129]. TE numbers were also found to be higher in cyclically
sexual than in fully asexual populations of Daphnia pulex
[130–132] (Table 3), contrary to what was described in the para-
sitoid wasp Leptopilina clavipes and in root knot nematode species
(Table 3). It should be noted that several comparative studies on
asexual arthropods, nematodes, primroses, and green algae did not
evidence any significant effect of breeding system on TE content or
evolution (Table 3). At larger evolutionary scales, the putatively
ancient asexual bdelloid rotifers strikingly exemplify the fact that
348
Table 3
Summary of studies comparing transposable element distribution and dynamics between different breeding systems
fraction of TE in asexuals
inherited directly from
sexuals
349
(continued)
350
Table 3
(continued)
Sylvain Glémin et al.
asexuals can purge their load of TEs. Unlike all sexual eukaryotes,
they appear to be free of vertically transmitted retrotransposon,
while their genome contains DNA transposons, probably acquired
via horizontal transfers [133, 134]. Examples of TE accumulation
in asexuals are less common, maybe because species are doomed to
extinction under this evolutionary scenario [135]. However, the
increase in genome size in some apomictic lineages of Hypericum
species may result from this process [136].
In selfers, the distribution of TEs depends not only on the
population size but also on the mode of selection against TEs
[127, 137]. Under the “deleterious” model, TE insertions are
selected against because they disrupt gene functions. According
to the “ectopic exchange” model, TEs are selected against because
they generate chromosomal rearrangements through unequal
crossing-over between TE at nonhomologous insertion sites.
Under the first of these two models, homozygosity resulting from
selfing increases the selection efficacy against TEs, while under the
second one, under-dominant chromosomal rearrangements are less
selected against in selfing than in outcrossing populations
[127, 137]. A survey of Ty1-copia-like elements in plants suggests
that they are less abundant in self-fertilizing than in outcrossing
plants, thus supporting the “deleterious” rather than the “ectopic”
exchange model [127]. The distribution of retrotransposons in
self-incompatible and self-compatible Solanum species also sup-
ports the “deleterious” model, even though most insertions are
probably neutral [138] (Table 3). In the selfer Arabidopsis thali-
ana, selection efficacy against TEs seems to be reduced compared
to its outcrossing sister species A. lyrata [139, 140], but compari-
son of the two complete genomes revealed a higher load of TE in
A. lyrata and a recent decrease in TE in number in A. thaliana, in
agreement with the date of transition to selfing [141]. In the
Capsella genus, while the very recent selfer C. rubella possesses a
slightly higher number of TEs than the outcrossing C. grandiflora,
the oldest selfer C. orientalis exhibits a significantly reduced load of
TE [142] (Table 3). Other selfish elements, such as B chromo-
somes, are also less frequent in selfers, in support of the view that
inbreeding generally prevents selfish element transmission [102].
2.4 Breeding Atypical breeding systems are often associated with polyploidy
Systems, Ploidy, and [143], and the reasons for this association are not entirely clear.
Hybridization Polyploid mutants might be more likely to establish as new lineages
in selfers and asexuals than in obligate outcrossers if crosses
between polyploids and diploids are unfertile or counterselected.
This is because at low population frequency a polyploid mutant will
experience the disadvantage of mostly mating with diploids—the
minority cytotype exclusion principle [144, 145]—unless it repro-
duces asexually or via selfing. In addition, by doubling gene copy
number, polyploidy might alleviate the fitness cost of recessive
352 Sylvain Glémin et al.
2.5 Breeding As argued above, breeding systems can affect many aspects of
Systems and Genome genome content and organization. They should also affect the
Size Evolution whole genome size. Following Lynch’s theory [1], genome size
should be higher in selfers and asexuals because of their reduced
effective population size, hence reduced ability to get rid of useless,
slightly costly sequences. However, the picture is probably more
complex. First, because of the recent origin of many selfing and
(at least some) asexual lineages, relaxed selection may not have
operated longly enough to impact genome size. Second, because
of their immunity to selfish element transmission, selfers and asex-
uals should exhibit lower genome size, especially in groups where
TEs are major determinants of genome size. Hence, it is not clear
whether genetic drift or resistance to selfish elements (or other
processes) is the most important in governing genome size evolu-
tion in various breeding systems.
Meta-analyses performed in plants provided equivocal answers.
Analysis of the distribution of B chromosomes showed a strong and
significant positive association between outcrossing, the occurrence
of B chromosomes, and genome size [102, 158]. However, after
phylogenetic control, only the association between breeding sys-
tems and B chromosomes remains. Whitney et al. [159] simulta-
neously tested the effect of breeding systems (using outcrossing
rate estimates) and genetic drift (using polymorphism data) on
genome size in seed plants. Raw data showed a significant effect
of both breeding systems and genetic drift, according to theoretical
predictions. However, no effect was observed after phylogenetic
control, leading the authors to reconsider the hypothesis of a role
Genome Evolution in Outcrossing vs. Selfing vs. Asexual Species 353
3.1 Genomic Genetic markers have long been used to determine breeding sys-
Approaches to Infer tems and quantify selfing rates or degrees of asexuality. For
Breeding System instance, current selfing rates can be inferred using molecular mar-
Evolution kers through FIS estimates or preferably—although more time
consuming—through progeny analyses [164–166]. Multilocus-
3.1.1 Genomic based estimates that take identity disequilibrium into account
Characterization of greatly improve the simple FIS-based method that is sensitive to
Breeding Systems several artifacts such as null alleles ([167], see also refs. 168, 169).
This method, implemented in the RMES software [167], has
proven to give results very similar to progeny-based methods
[170]. To take advantage of the information potentially available
in sequence data, coalescence-based estimators have also been pro-
posed to infer long-term selfing rates, and they have been imple-
mented more recently in a Bayesian clustering approach in the
INSTRUCT software package [171]. However, this approach
mostly captures information from recent coalescence events so
that such approaches still estimate recent selfing rates [28]. Much
more information about long-term selfing rates can be derived
from LD patterns [19], but this has not been fully exploited for
selfing rate estimators (for instance, LD is not taken into account in
INSTRUCT). Similarly, recombination can be inferred using
genetic markers or sequence data, and more generally, various
methods have been proposed to characterize the degree of clonality
in natural populations (for review see ref. 172) and recently imple-
mented in the R package RClone [173].
Initially, such methods were applied with few markers, from
which only global descriptions of breeding systems were deducible.
Thanks to the considerable increase in sequencing facilities, it has
become possible to finely characterize temporal and spatial varia-
tions in breeding systems. In A. thaliana, an analysis of more than
1000 individuals in 77 local stands using more than 400 SNP
markers revealed spatial heterogeneity in outcrossing rates. Local
“hotspots” of recent outcrossing (up to 15%) were identified, while
other stands exhibited complete homozygosity with no detectable
outcrossing [174]. Interestingly, at this local scale (from 30 m to
40 km), outcrossing rates have been found to be twofold higher on
average in rural than in urban stands; hence, selfing could be
associated with higher disturbance in urban stands.
Genomic data may also help characterize breeding systems in
species with unknown or ill-characterized life cycles. In yeasts Sac-
charomyces cerevisiae and S. paradoxus, the analyses of linkage dis-
equilibrium patterns allowed to quantify the frequency of (rare)
sexual reproduction events and the proportion of inbreeding and
outcrossing during these events [175, 176]. For instance, in the
pico-algae Ostreococcus, no sexual form or process has been detected
in the lab. However, the occurrence of infrequent recombination
(about 1 meiosis for 10 mitoses) inferred from a population geno-
mics approach and the presence of meiosis genes in the genome
Genome Evolution in Outcrossing vs. Selfing vs. Asexual Species 355
3.1.2 Inferring and Genomic approaches are also useful for analyzing the dynamics of
Dating Breeding System breeding system evolution. A simple way is to map breeding system
Transitions evolution on phylogenies, which could provide a raw picture of the
frequency and relative timing of breeding system transitions (e.g.,
[183]). However, these approaches, based on ancestral character
reconstruction, are hampered by numerous uncertainties. For
instance, in the case of two sister species with contrasting breeding
systems, such as A. thaliana and A. lyrata, it is impossible to know
whether A. thaliana evolved toward selfing just after divergence
(about five million years ago) or only very recently. At a larger
phylogenetic scale, inferring rates of transition between characters
and ancestral states can be biased if diversification rates differ
between characters—this is typically expected with breeding sys-
tems for which asexuals and selfers should exhibit higher extinction
rates than outcrossers [184].
Thanks to the genomic signatures left by contrasted breeding
systems, it is possible to trace back transitions in the past and to date
them more precisely. In diploid asexual species, because of the
arrest of recombination, the two copies of each gene have diverged
independently since the origin of asexuality. After having calibrated
the molecular clock, it is thus possible to date this origin from the
level of sequence divergence between the two copies. This so-called
Meselson effect was observed and quantified in the trypanosome T.
b. gambiense, suggesting that this species evolved asexually about
10,000 years ago [88]. However, no Meselson effect has been
observed in other presumably ancient asexual species such as oriba-
tid mites [185] or darwinulid ostracods [186], while data refute the
possibility of cryptic sex. In such cases, it is thus not possible to infer
when recombination actually stopped, presumably because of
356 Sylvain Glémin et al.
3.2 Matching The expected reduction in Ne in selfers and asexuals may increase
Breeding System the drift load (accumulation of slightly deleterious mutations) and
Evolution Theories preclude adaptation. Selfing and clonality are thus supposed to be
with Genomic Data evolutionary dead ends [17, 18]. The twiggy phylogenetic distri-
butions of asexuals [196] and selfers [183] or self-compatible
3.2.1 Testing the Dead- species [197] suggest they are mostly derived recently from out-
End Hypothesis: crossing ancestors (but see ref. 198). However, this observation may
Comparison Between not be sufficient to support the dead-end hypothesis, and neutral
Selfing and Asexuality models can also explain this pattern [199–201]. In a comprehensive
and epochal phylogenetic study of several Solanaceae genera, Gold-
berg et al. [202] went further by testing the irreversibility of
transitions. Using a phylogenetic method developed for estimating
the character effect on speciation and extinction [203, 204], they
showed that self-compatible species have both higher speciation
and extinction rates—with the resulting net diversification rates
being lower—than self-incompatible species. This was the first
direct demonstration of the dead-end hypothesis, and additional
results have been obtained in Primula species [205]. On the con-
trary, in the Oenothera genus, asexuality has been found associated
with increased diversification but frequent reversion toward the
sexual system, suggesting that the form of asexuality in this group
is not an evolutionary dead end [206].
Genomic data also provide an opportunity to investigate the
genetic causes of such long-term evolutionary failures. The
increased dN/dS ratios reported in asexuals (see above) suggest
that deleterious point mutations contribute to the load. However,
in Daphnia rapid exposure of recessive deleterious alleles through
mitotic recombination or gene conversion likely has a much stron-
ger effect on clone persistence than their long-term accumulation
under Muller’s ratchet [60]. TE could also contribute to the load
and to the extinction of asexuals [135], though more data are still
needed to unambiguously support this hypothesis (but see ref.
136). The pattern in selfers is less clear. While theory globally
predicts a reduction in selection efficacy in selfers, models also
highlight conditions under which selection can be little affected
or even enhanced in selfers [72, 73, 207], especially regarding TE
accumulation [127, 137]. Empirical data on both protein and TE
evolution have not revealed any strong evidence of long-term
accumulation of deleterious mutation in selfers, as compared to
outcrossers, whereas polymorphism data mainly support relaxation
of selection in selfers (Table 2). This is in agreement with the recent
origin of selfing but makes difficult further inference of the under-
lying causes of higher extinction in selfers as trait-dependent diver-
sification processes alter the relationship between life history traits
and rate of molecular evolution [208]. A reduced ability to respond
to environmental changes through adaptive evolution could also
contribute to long-term extinction in asexuals (but see ref. 209) and
selfers, especially if standing variation is needed to rescue
358 Sylvain Glémin et al.
3.2.2 Evading the “Dead The few putatively ancient asexuals known so far seem to escape the
End” mutational load predicted by the dead-end hypothesis and avoid
extinction over long evolutionary time scales. For example, fossil
evidence and decades of microscopic observations indicate that
bdelloid rotifers have apparently persisted for over 40 million
years without meiosis, males, or conventional sexual reproduction
Genome Evolution in Outcrossing vs. Selfing vs. Asexual Species 359
5 Questions
Acknowledgments
References
1. Lynch M (2007) The origin of genome archi- 5. Figuet E, Nabholz B, Bonneau M, Mas
tecture, 1st edn. Sinauer, Sunderland, MA Carrio E, Nadachowska-Brzyska K,
2. Smith SA, Donoghue MJ (2008) Rates of Ellegren H, Galtier N (2016) Life history
molecular evolution are linked to life history traits, protein evolution, and the nearly neu-
in flowering plants. Science 322(5898):86–89 tral theory in amniotes. Mol Biol Evol 33
3. Romiguier J, Gayral P, Ballenghien M, (6):1517–1527
Bernard A, Cahais V, Chenuil A, Chiari Y, 6. Chen J, Glemin S, Lascoux M (2017) Genetic
Dernat R, Duret L, Faivre N, Loire E, Lour- diversity and the efficacy of purifying selection
enco JM, Nabholz B, Roux C, across plant and animal species. Mol Biol Evol
Tsagkogeorga G, Weber AA, Weinert LA, 34(6):1417–1428
Belkhir K, Bierne N, Glemin S, Galtier N 7. Lefebure T, Morvan C, Malard F, Francois C,
(2014) Comparative population genomics in Konecny-Dupre L, Gueguen L, Weiss-Gayet-
animals uncovers the determinants of genetic M, Seguin-Orlando A, Ermini L, Sarkissian C,
diversity. Nature 515(7526):261–263 Charrier NP, Eme D, Mermillod-Blondin F,
4. Bromham L, Hua X, Lanfear R, Cowman PF Duret L, Vieira C, Orlando L, Douady CJ
(2015) Exploring the relationships between (2017) Less effective selection leads to larger
mutation rates, life history, genome size, envi- genomes. Genome Res 27(6):1016–1028
ronment, and species richness in flowering 8. Jarne P, Auld JR (2006) Animals mix it up
plants. Am Nat 185(4):507–524 too: the distribution of self-fertilization
Genome Evolution in Outcrossing vs. Selfing vs. Asexual Species 361
among hermaphroditic animals. Evolution 60 Proc R Soc Lond B Biol Sci 273
(9):1816–1824 (1604):3011–3019
9. Vogler DW, Kaliz S (2001) Sex among the 25. Golding GB, Strobeck C (1980) Linkage dis-
flowers: the distribution of plant mating sys- equilibrium in a finite population that is par-
tems. Evolution 55(1):202–204 tially selfing. Genetics 94(3):777–789
10. Haldane JBS (1932) The causes of evolution, 26. Roze D (2015) Effects of interference
vol 1, 1st edn. Princeton University Press, between selected loci on the mutation load,
Princeton inbreeding depression, and heterosis. Genet-
11. Hedrick PW (1987) Population genetics of ics 201(2):745–757
intragametophytic selfing. Evolution 41 27. Pollak E (1987) On the theory of partially
(1):137–144 inbreeding finite populations. I. Partial self-
12. Balloux F, Lehmann L, de Meeus T (2003) ing. Genetics 117(2):353–360
The population genetics of clonal and partially 28. Nordborg M, Donnelly P (1997) The coales-
clonal diploids. Genetics 164(4):1635–1644 cent process with selfing. Genetics 146
13. Simon JC, Delmotte F, Rispe C, Crease TJ (3):1185–1195
(2003) Phylogenetic relationships between 29. Ceplitis A (2003) Coalescence times and the
parthenogens and their sexual relatives: the Meselson effect in asexual eukaryotes. Genet
possible routes to parthenogenesis in animals. Res 82(3):183–190
Biol J Lin Soc 79:151–163 30. Hartfield M, Wright SI, Agrawal AF (2016)
14. Whitton J, Sears CJ, Baack EJ, Otto SP Coalescent times and patterns of genetic
(2008) The dynamic nature of apomixis in diversity in species with facultative sex: effects
the angiosperms. Int J Plant Sci 169 of gene conversion, population structure, and
(1):169–182 heterogeneity. Genetics 202(1):297–312
15. Schurko AM, Neiman M, Logsdon JM 31. Haag CR, Roze D (2007) Genetic load in
(2009) Signs of sex: what we know and how sexual and asexual diploids: segregation, dom-
we know it. Trends Ecol Evol 24(4):208–217 inance and genetic drift. Genetics 176
16. Neiman M, Sharbel TF, Schwander T (2014) (3):1663–1678
Genetic causes of transitions from sexual 32. Schoen DJ, Brown AHD (1991) Intraspecific
reproduction to asexuality in plants and ani- variation in population gene diversity and
mals. J Evol Biol 27(7):1346–1359 effective population size correlates with the
17. Maynard-Smith J (1978) The evolution of mating system in plants. Proc Natl Acad Sci
sex. Cambridge University Press, Cambridge U S A 88:4494–4497
18. Stebbins GL (1957) Self fertilization and 33. Haag CR, Ebert D (2004) A new hypothesis
population variability in higher plants. Am to explain geographic parthenogenesis. Ann
Nat 91:337–354 Zool Fennici 41:539–544
19. Nordborg M (2000) Linkage disequilibrium, 34. Ingvarsson PK (2002) A metapopulation per-
gene trees and selfing: an ancestral recombi- spective on genetic diversity and differentia-
nation graph with partial self-fertilization. tion in partially self-fertilizing plants.
Genetics 154(2):923–929 Evolution 56(12):2368–2373
20. Padhukasahasram B, Marjoram P, Wall JD, 35. Gordo I, Charlesworth B (2001) Genetic
Bustamante CD, Nordborg M (2008) linkage and molecular evolution. Curr Biol
Exploring population genetic models with 11(17):R684–R686
recombination using efficient forward-time 36. Agrawal AF, Hartfield M (2016) Coalescence
simulations. Genetics 178(4):2417–2427 with background and balancing selection in
21. Hartfield M, Glemin S (2016) Limits to adap- systems with bi- and uniparental reproduc-
tation in partially selfing species. Genetics 203 tion: contrasting partial asexuality and selfing.
(2):959–974 Genetics 202(1):313–326
22. Roze D (2016) Background selection in par- 37. Thomas CG, Wang W, Jovelin R, Ghosh R,
tially selfing populations. Genetics 203 Lomasko T, Trinh Q, Kruglyak L, Stein LD,
(2):937–957 Cutter AD (2015) Full-genome evolutionary
23. Flint-Garcia SA, Thornsberry JM, Buckler ES histories of selfing, splitting, and selection in
IV (2003) Structure of linkage disequilibrium Caenorhabditis. Genome Res 25(5):667–678
in plants. Annu Rev Plant Biol 54:357–374 38. Andersen EC, Gerke JP, Shapiro JA, Crissman
24. Glémin S, Bazin E, Charlesworth D (2006) JR, Ghosh R, Bloom JS, Felix MA, Kruglyak L
Impact of mating systems on patterns of (2012) Chromosome-scale selective sweeps
sequence polymorphism in flowering plants. shape Caenorhabditis elegans genomic diver-
sity. Nat Genet 44(3):285–290
362 Sylvain Glémin et al.
39. Coop G (2016) Does linked selection explain 53. Johnson SG, Howard RS (2007) Contrasting
the narrow range of genetic diversity across patterns of synonymous and nonsynonymous
species? bioArxiv. https://doi.org/10.1101/ sequence evolution in asexual and sexual
042598 freshwater snail lineages. Evolution 61
40. Hamrick JL, Godt MJW (1996) Effects of life (11):2728–2735
history traits on genetic diversity in plant spe- 54. Neiman M, Hehman G, Miller JT, Logsdon
cies. Philos Trans R Soc Lond B 351 JM Jr, Taylor DR (2010) Accelerated muta-
(1345):1291–1298 tion accumulation in asexual lineages of a
41. Nybom H (2004) Comparison of different freshwater snail. Mol Biol Evol 27
nuclear DNA markers for estimating intraspe- (4):954–963
cific genetic diversity in plants. Mol Ecol 13 55. Lovell JT, Williamson RJ, Wright SI, McKay
(5):1143–1155 JK, Sharbel TF (2017) Mutation accumula-
42. Fontcuberta Garcia-Cuenca A, Dumas Z, tion in an asexual relative of Arabidopsis.
Schwander T (2016) Extreme genetic diver- PLoS Genet 13(1):e1006550
sity in asexual grass thrips populations. J Evol 56. Ollivier M, Gabaldon T, Poulain J, Gavory F,
Biol 29(5):887–899 Leterme N, Gauthier JP, Legeai F, Tagu D,
43. Normark BB, Judson OP, Moran NA (2003) Simon JC, Rispe C (2012) Comparison of
Genomic signatures of ancient asexual gene repertoires and patterns of evolutionary
lineages. Biol J Lin Soc 79:69–84 rates in eight aphid species that differ by
44. Lercher MJ, Hurst LD (2002) Human SNP reproductive mode. Genome Biol Evol 4
variability and mutation rate are higher in (2):155–167
regions of high recombination. Trends 57. Pellino M, Hojsgaard D, Schmutzer T,
Genet 18(7):337–340 Scholz U, Horandl E, Vogel H, Sharbel TF
45. Hellmann I, Ebersberger I, Ptak SE, Paabo S, (2013) Asexual genome evolution in the apo-
Przeworski M (2003) A neutral explanation mictic Ranunculus auricomus complex: exam-
for the correlation of diversity with recombi- ining the effects of hybridization and
nation rates in humans. Am J Hum Genet 72 mutation accumulation. Mol Ecol 22
(6):1527–1535 (23):5908–5921
46. Longman-Jacobsen N, Williamson JF, Daw- 58. Hollister JD, Greiner S, Wang W, Wang J,
kins RL, Gaudieri S (2003) In polymorphic Zhang Y, Wong GK, Wright SI, Johnson MT
genomic regions indels cluster with nucleo- (2015) Recurrent loss of sex is associated with
tide polymorphism: quantum genomics. accumulation of deleterious mutations in
Gene 312:257–261 Oenothera. Mol Biol Evol 32(4):896–905
47. Tian D, Wang Q, Zhang P, Araki H, Yang S, 59. Ament-Velasquez SL, Figuet E,
Kreitman M, Nagylaki T, Hudson R, Ballenghien M, Zattara EE, Norenburg JL,
Bergelson J, Chen JQ (2008) Single- Fernandez-Alvarez FA, Bierne J, Bierne N,
nucleotide mutation rate increases close to Galtier N (2016) Population genomics of sex-
insertions/deletions in eukaryotes. Nature ual and asexual lineages in fissiparous ribbon
455(7209):105–108 worms (Lineus, Nemertea): hybridization,
polyploidy and the Meselson effect. Mol
48. Hollister JD, Ross-Ibarra J, Gaut BS (2010) Ecol 25(14):3356–3369
Indel-associated mutation rate varies with
mating system in flowering plants. Mol Biol 60. Tucker AE, Ackerman MS, Eads BD, Xu S,
Evol 27(2):409–416 Lynch M (2013) Population-genomic
insights into the evolutionary origin and fate
49. Kimura M (1962) On the probability of fixa- of obligately asexual Daphnia pulex. Proc Natl
tion of mutant genes in a population. Genetics Acad Sci U S A 110(39):15740–15745
47:713–719
61. Wright SI, Lauga B, Charlesworth D (2002)
50. Hill WG, Robertson AW (1966) The effect of Rates and patterns of molecular evolution in
genetic linkage on the limits to artificial selec- inbred and outbred Arabidopsis. Mol Biol
tion. Genet Res 8:269–294 Evol 19(9):1407–1420
51. Paland S, Lynch M (2006) Transitions to 62. Cutter AD, Wasmuth JD, Washington NL
asexuality result in excess amino acid substitu- (2008) Patterns of molecular evolution in
tions. Science 311(5763):990–992 Caenorhabditis preclude ancient origins of
52. Henry L, Schwander T, Crespi BJ (2012) selfing. Genetics 178(4):2093–2104
Deleterious mutation accumulation in asexual 63. Escobar JS, Cenci A, Bolognini J, Haudry A,
Timema stick insects. Mol Biol Evol 29 Laurent S, David J, Glémin S (2010) An inte-
(1):401–408 grative test of the dead-end hypothesis of
Genome Evolution in Outcrossing vs. Selfing vs. Asexual Species 363
selfing evolution in Triticeae (poaceae). Evo- 78. Kamran-Disfani A, Agrawal AF (2014) Self-
lution 64(10):2855–2872 ing, adaptation and background selection in
64. Glémin S, Muyle A (2014) Mating systems finite populations. J Evol Biol 27
and selection efficacy: a test using chloroplas- (7):1360–1371
tic sequence data in Angiosperms. J Evol Biol 79. Kirkpatrick M, Jenkins CD (1989) Genetic
27(7):1386–1399 segregation and the maintenance of sexual
65. Arunkumar R, Ness RW, Wright SI, Barrett reproduction. Nature 339(6222):300–301
SC (2015) The evolution of selfing is accom- 80. Ronfort J, Glémin S (2013) Mating system,
panied by reduced efficacy of selection and Haldane’s sieve, and the domestication pro-
purging of deleterious mutations. Genetics cess. Evolution 67(5):1518–1526
199(3):817–829 81. McDonald JH, Kreitman M (1991) Adapta-
66. Burgarella C, Gayral P, Ballenghien M, tive protein evolution at the ADH locus in
Bernard A, David P, Jarne P, Correa A, Drosophila. Nature 351:652–654
Hurtrez-Bousses S, Escobar J, Galtier N, Glé- 82. Eyre-Walker A, Keightley PD (2009) Estimat-
min S (2015) Molecular evolution of freshwa- ing the rate of adaptive molecular evolution in
ter snails with contrasting mating systems. the presence of slightly deleterious mutations
Mol Biol Evol 32(9):2403–2416 and population size change. Mol Biol Evol 26
67. Charlesworth D, Morgan MT, Charlesworth (9):2097–2108
B (1993) Mutation accumulation in finite 83. Slotte T, Foxe JP, Hazzouri KM, Wright SI
outbreeding and inbreeding populations. (2010) Genome-wide evidence for efficient
Genet Res 61:39–56 positive and purifying selection in Capsella
68. Hartfield M, Glemin S (2014) Hitchhiking of grandiflora, a plant species with a large effec-
deleterious alleles and the cost of adaptation tive population size. Mol Biol Evol 27
in partially selfing species. Genetics 196 (8):1813–1821
(1):281–293 84. Haudry A, Cenci A, Guilhaumon C, Paux E,
69. Hartfield M, Otto SP (2011) Recombination Poirier S, Santoni S, David J, Glémin S (2008)
and hitchhiking of deleterious alleles. Evolu- Mating system and recombination affect
tion 65(9):2421–2434 molecular evolution in four Triticeae species.
70. Bullaughey K, Przeworski M, Coop G (2008) Genet Res 90(1):97–109
No effect of recombination on the efficacy of 85. Hersch-Green EI, Myburg H, Johnson MT
natural selection in primates. Genome Res 18 (2012) Adaptive molecular evolution of a
(4):544–554 defence gene in sexual but not functionally
71. Haddrill PR, Halligan DL, Tomaras D, Char- asexual evening primroses. J Evol Biol 25
lesworth B (2007) Reduced efficacy of selec- (8):1576–1586
tion in regions of the Drosophila genome that 86. Engelstadter J (2017) Asexual but not clonal:
lack crossing over. Genome Biol 8(2):R18 evolutionary processes in automictic popula-
72. Charlesworth B (1992) Evolutionary rates in tions. Genetics 206(2):993–1009
partially self-fertilizing species. Am Nat 140 87. Mandegar MA, Otto SP (2007) Mitotic
(1):126–148 recombination counteracts the benefits of
73. Glémin S (2007) Mating systems and the effi- genetic segregation. Proc Biol Sci 274
cacy of selection at the molecular level. Genet- (1615):1301–1307
ics 177(2):905–916 88. Weir W, Capewell P, Foth B, Clucas C,
74. Charlesworth B, Charlesworth D (1997) Pountain A, Steketee P, Veitch N, Koffi M,
Rapid fixation of deleterious alleles can be De Meeus T, Kabore J, Camara M, Cooper A,
caused by Muller’s ratchet. Genet Res 70 Tait A, Jamonneau V, Bucheton B,
(1):63–73 Berriman M, MacLeod A (2016) Population
75. Szovenyi P, Devos N, Weston DJ, Yang X, genomics reveals the origin and asexual evo-
Hock Z, Shaw JA, Shimizu KK, McDaniel lution of human infective trypanosomes. Elife
SF, Wagner A (2014) Efficient purging of 5:e11473
deleterious mutations in plants with haploid 89. Omilian AR, Cristescu ME, Dudycha JL,
selfing. Genome Biol Evol 6(5):1238–1252 Lynch M (2006) Ameiotic recombination in
76. Haldane JBS (1937) The effect of variation on asexual lineages of Daphnia. Proc Natl Acad
fitness. Am Nat 71:337–349 Sci U S A 103(49):18638–18643
77. Glémin S, Ronfort J (2013) Adaptation and 90. Keith N, Tucker AE, Jackson CE, Sung W,
maladaptation in selfing and outcrossing spe- Lledo JIL, Schrider DR, Schaack S, Dudycha
cies: new mutations versus standing variation. JL, Ackerman M, Younge AJ, Shaw JR, Lynch
Evolution 67(1):225–240 M (2016) High mutational rates of large-scale
364 Sylvain Glémin et al.
duplication and deletion in Daphnia pulex. 106. Spillane C, Schmid KJ, Laoueille-Duprat S,
Genome Res 26(1):60–69 Pien S, Escobar-Restrepo JM, Baroux C,
91. Charlesworth D, Charlesworth B, Strobeck C Gagliardini V, Page DR, Wolfe KH, Grossnik-
(1979) Selection for recombination in self- laus U (2007) Positive darwinian selection at
fertilizing species. Genetics 93:237–244 the imprinted MEDEA locus in plants.
92. Charlesworth D, Charlesworth B, Strobeck C Nature 448(7151):349–352
(1977) Effects of selfing on selection for 107. Kawabe A, Fujimoto R, Charlesworth D
recombination. Genetics 68:213–226 (2007) High diversity due to balancing selec-
93. Roze D, Lenormand T (2005) Self- tion in the promoter region of the Medea
fertilization and the evolution of recombina- gene in Arabidopsis lyrata. Curr Biol 17
tion. Genetics 170:841–857 (21):1885–1889
94. Ross-Ibarra J (2007) Genome size and 108. Budar F, Touzet P, De Paepe R (2003) The
recombination in angiosperms: a second nucleo-mitochondrial conflict in cytoplasmic
look. J Evol Biol 20(2):800–806 male sterilities revisited. Genetica 117
(1):3–16
95. Dawson KJ (1998) Evolutionarily stable
mutation rates. J Theor Biol 194(1):143–157 109. Houliston GJ, Olson MS (2006) Nonneutral
evolution of organelle genes in Silene vul-
96. Kondrashov AS (1995) Modifiers of garis. Genetics 174(4):1983–1994
mutation-selection balance - general-
approach and the evolution of mutation- 110. Ingvarsson PK, Taylor DR (2002) Genealog-
rates. Genet Res 66(1):53–69 ical evidence for epidemics of selfish genes.
Proc Natl Acad Sci U S A 99
97. Lynch M (2010) Evolution of the mutation (17):11265–11269
rate. Trends Genet 26(8):345–352
111. Touzet P, Delph LF (2009) The effect of
98. Schoen DJ (2005) Deleterious mutation in breeding system on polymorphism in mito-
related species of the plant genus Amsinckia chondrial genes of Silene. Genetics 181
with contrasting mating systems. Evolution (2):631–644
59(11):2370–2377
112. Foxe JP, Wright SI (2009) Signature of diver-
99. Baer CF, Joyner-Matos J, Ostrow D, sifying selection on members of the pentatri-
Grigaltchik V, Salomon MP, Upadhyay A copeptide repeat protein family in Arabidopsis
(2010) Rapid decline in fitness of mutation lyrata. Genetics 183(2):663–672,
accumulation lines of gonochoristic (out- 661SI–668SI
crossing) Caenorhabditis nematodes. Evolu-
tion 64(11):3242–3253 113. Marais G (2003) Biased gene conversion:
implications for genome and sex evolution.
100. Brandvain Y, Haig D (2005) Divergent mat- Trends Genet 19(6):330–338
ing systems and parental conflict as a barrier
to hybridization in flowering plants. Am Nat 114. Clement Y, Arndt PF (2013) Meiotic recom-
166(3):330–338 bination strongly influences GC-content evo-
lution in short regions in the mouse genome.
101. Tazzyman SJ, Abbott JK (2015) Self- Mol Biol Evol 30(12):2612–2618
fertilization and inbreeding limit the scope
for sexually antagonistic polymorphism. J 115. Glémin S, Arndt PF, Messer PW, Petrov D,
Evol Biol 28(3):723–729 Galtier N, Duret L (2015) Quantification of
GC-biased gene conversion in the human
102. Burt A, Trivers R (1998) Selfish DNA and genome. Genome Res 25(8):1215–1228
breeding systems in plants. Proc R Soc Lond
B 265:141–146 116. Duret L, Galtier N (2009) Biased gene con-
version and the evolution of mammalian
103. Swanson WJ, Vacquier VD (2002) The rapid genomic landscapes. Annu Rev Genomics
evolution of reproductive proteins. Nat Rev Hum Genet 10:285–311
Genet 3(2):137–144
117. Marais G, Charlesworth B, Wright SI (2004)
104. Palopoli MF, Rockman MV, TinMaung A, Recombination and base composition: the
Ramsay C, Curwen S, Aduna A, Laurita J, case of the highly self-fertilizing plant Arabi-
Kruglyak L (2008) Molecular basis of the dopsis thaliana. Genome Biol 5(7):R45
copulatory plug polymorphism in Caenor-
habditis elegans. Nature 454 118. Wright SI, Iorgovan G, Misra S, Mokhtari M
(7207):1019–1022 (2007) Neutral evolution of synonymous
base composition in the Brassicaceae. J Mol
105. Cutter AD (2008) Reproductive evolution: Evol 64(1):136–141
symptom of a selfing syndrome. Curr Biol
18(22):R1056–R1058 119. Serres-Giardi L, Belkhir K, David J, Glémin S
(2012) Patterns and evolution of nucleotide
Genome Evolution in Outcrossing vs. Selfing vs. Asexual Species 365
landscapes in seed plants. Plant Cell 24 133. Arkhipova I, Meselson M (2000) Transposa-
(4):1379–1397 ble elements in sexual and ancient asexual
120. Muyle A, Serres-Giardi L, Ressayre A, taxa. Proc Natl Acad Sci U S A 97
Escobar J, Glémin S (2011) GC-biased gene (26):14473–14477
conversion and selection affect GC content in 134. Arkhipova IR, Meselson M (2005) Diverse
the Oryza genus (rice). Mol Biol Evol 28 DNA transposons in rotifers of the class Bdel-
(9):2695–2706 loidea. Proc Natl Acad Sci U S A 102
121. Rodgers-Melnick E, Vera DL, Bass HW, (33):11781–11786
Buckler ES (2016) Open chromatin reveals 135. Arkhipova I, Meselson M (2005) Deleterious
the functional maize genome. Proc Natl transposable elements and the extinction of
Acad Sci U S A 113(22):E3177–E3184 asexuals. BioEssays 27(1):76–85
122. Li X, Li L, Yan J (2015) Dissecting meiotic 136. Matzk F, Hammer K, Schubert I (2003)
recombination based on tetrad analysis by Coevolution of apomixis and genome size
single-microspore sequencing in maize. Nat within the genus Hypericum. Sex Plant
Commun 6:6648 Reprod 16:51–58
123. Hazzouri KM, Escobar JS, Ness RW, Killian 137. Wright SI, Schoen DJ (1999) Transposon
Newman L, Randle AM, Kalisz S, Wright SI dynamics and the breeding system. Genetica
(2013) Comparative population genomics in 107(1–3):139–148
Collinsia sister species reveals evidence for 138. Tam SM, Causse M, Garchery C, Burck H,
reduced effective population size, relaxed Mhiri C, Grandbastien MA (2007) The dis-
selection, and evolution of biased gene con- tribution of copia-type retrotransposons and
version with an ongoing mating system shift. the evolutionary history of tomato and
Evolution 67(5):1263–1278 related wild species. J Evol Biol 20
124. Galtier N, Duret L (2007) Adaptation or (3):1056–1072
biased gene conversion? Extending the null 139. Wright SI, Le QH, Schoen DJ, Bureau TE
hypothesis of molecular evolution. Trends (2001) Population dynamics of an Ac-like
Genet 23(6):273–277 transposable element in self- and cross-
125. Glémin S (2010) Surprising fitness conse- pollinating Arabidopsis. Genetics 158
quences of GC-biased gene conversion: (3):1279–1288
I. Mutation load and inbreeding depression. 140. Lockton S, Gaut BS (2010) The evolution of
Genetics 185(3):939–959 transposable elements in natural populations
126. Dolgin ES, Charlesworth B (2006) The fate of self-fertilizing Arabidopsis thaliana and its
of transposable elements in asexual popula- outcrossing relative Arabidopsis lyrata. BMC
tions. Genetics 174(2):817–827 Evol Biol 10:10
127. Morgan MT (2001) Transposable element 141. de la Chaux N, Tsuchimatsu T, Shimizu KK,
number in mixed mating populations. Genet Wagner A (2012) The predominantly selfing
Res 77(3):261–275 plant Arabidopsis thaliana experienced a
128. Zeyl C, Bell G, Green DM (1996) Sex and the recent reduction in transposable element
spread of retrotransposon Ty3 in experimen- abundance compared to its outcrossing rela-
tal populations of Saccharomyces cerevisiae. tive Arabidopsis lyrata. Mob DNA 3(1):2
Genetics 143(4):1567–1577 142. Agren JA, Wang W, Koenig D, Neuffer B,
129. Goddard MR, Greig D, Burt A (2001) Out- Weigel D, Wright SI (2014) Mating system
crossed sex allows a selfish gene to invade shifts and transposable element evolution in
yeast populations. Proc Biol Sci 268 the plant genus Capsella. BMC Genomics
(1485):2537–2542 15:602
130. Sullender BW, Crease TJ (2001) The behav- 143. Otto SP, Whitton J (2000) Polyploid inci-
ior of a Daphnia pulex transposable element dence and evolution. Annu Rev Genet
in cyclically and obligately parthenogenetic 34:401–437
populations. J Mol Evol 53(1):63–69 144. Fowler NL, Levin DA (1984) Ecological con-
131. Valizadeh P, Crease TJ (2008) The associa- straints on the establishment of a novel poly-
tion between breeding system and transposa- ploid in competition with its diploid
ble element dynamics in Daphnia pulex. J Mol progenitor. Am Nat 124:703–711
Evol 66(6):643–654 145. Husband BC (2000) Constraints on poly-
132. Schaack S, Pritham EJ, Wolf A, Lynch M ploid evolution: a test of the minority cyto-
(2010) DNA transposon dynamics in popula- type exclusion principle. Proc Biol Sci 267
tions of Daphnia pulex with and without sex. (1440):217–223
Proc R Soc B Biol Sci 277(1692):2381–2387
366 Sylvain Glémin et al.
146. Husband BC (2016) Effect of inbreeding on 156. Moon CD, Craven KD, Leuchtmann A,
pollen tube growth in diploid and tetraploid Clement SL, Schardl CL (2004) Prevalence
Chamerion angustifolium: do polyploids of interspecific hybrids amongst asexual fun-
mask mutational load in pollen? Am J Bot gal endophytes of grasses. Mol Ecol 13
103(3):532–540 (6):1455–1467
147. Kreiner JM, Kron P, Husband BC (2017) 157. Blanc-Mathieu R, Perfus-Barbeoch L, Aury
Frequency and maintenance of unreduced J-M, Da Rocha M, Gouzy J, Sallet E,
gametes in natural plant populations: associa- Martin-Jimenez C, Bailly-Bechet M,
tions with reproductive mode, life history and Castagnone-Sereno P, Flot J-F, Kozlowski
genome size. New Phytol 214(2):879–889 DK, Cazareth J, Couloux A, Da Silva C,
148. Xu S, Innes DJ, Lynch M, Cristescu ME Guy J, Kim-Jo Y-J, Rancurel C, Schiex T,
(2013) The role of hybridization in the origin Abad P, Wincker P, Danchin EGJ (2017)
and spread of asexuality in Daphnia. Mol Ecol Hybridization and polyploidy enable genomic
22(17):4549–4561 plasticity without sex in the most devastating
149. Douglas GM, Gos G, Steige KA, Salcedo A, plant-parasitic nematodes. PLoS Genet 13
Holm K, Josephs EB, Arunkumar R, Agren (6):e1006777
JA, Hazzouri KM, Wang W, Platts AE, Wil- 158. Trivers R, Burt A, Palestis BG (2004) B chro-
liamson RJ, Neuffer B, Lascoux M, Slotte T, mosomes and genome size in flowering
Wright SI (2015) Hybrid origins and the ear- plants. Genome 47(1):1–8
liest stages of diploidization in the highly suc- 159. Whitney KD, Baack EJ, Hamrick JL, Godt
cessful recent polyploid Capsella bursa- MJ, Barringer BC, Bennett MD, Eckert CG,
pastoris. Proc Natl Acad Sci U S A 112 Goodwillie C, Kalisz S, Leitch IJ, Ross-Ibarra
(9):2806–2811 J (2010) A role for nonadaptive processes in
150. Novikova PY, Tsuchimatsu T, Simon S, plant genome size evolution? Evolution 64
Nizhynska V, Voronin V, Burns R, Fedorenko (7):2097–2109
OM, Holm S, Sall T, Prat E, Marande W, 160. Agren JA, Greiner S, Johnson MT, Wright SI
Castric V, Nordborg M (2017) Genome (2015) No evidence that sex and transposable
sequencing reveals the origin of the allotetra- elements drive genome size variation in even-
ploid Arabidopsis suecica. Mol Biol Evol 34 ing primroses. Evolution 69(4):1053–1062
(4):957–968 161. Albach DC, Greilhuber J (2004) Genome size
151. Szitenberg A, Salazar-Jaramillo L, Blok VC, variation and evolution in Veronica. Ann Bot
Laetsch DR, Joseph S, Williamson VM, Blax- 94(6):897–911
ter ML, Lunt DH (2017) Comparative geno- 162. Wright S, Ness RW, Foxe JP, Barrett SC
mics of apomictic root-knot nematodes: (2008) Genomic consequences of outcross-
hybridization, ploidy, and dynamic genome ing and selfing in plants. Int J Plant Sci 169
change. Genome Biol Evol 9(10):2844–2861 (1):105–118
152. Svensson O, Smith A, Garcia-Alonso J, van 163. Fierst JL, Willis JH, Thomas CG, Wang W,
Oosterhout C (2016) Hybridization gener- Reynolds RM, Ahearne TE, Cutter AD, Phil-
ates a hopeful monster: a hermaphroditic self- lips PC (2015) Reproductive mode and the
ing cichlid. R Soc Open Sci 3(3):150684 evolution of genome size and structure in
153. Beck JB, Alexander PJ, Allphin L, Al-Shehbaz Caenorhabditis nematodes. PLoS Genet 11
IA, Rushworth C, Bailey CD, Windham MD (6):e1005323
(2012) Does hybridization drive the transi- 164. Ritland K (2002) Extensions of models for
tion to asexuality in diploid Boechera? Evolu- the estimation of mating systems using n
tion 66(4):985–995 independent loci. Heredity 88(4):221–228
154. Janko K, Kotusz J, De Gelas K, Slechtova V, 165. Ritland K, Jain S (1981) A model for the
Opoldusova Z, Drozd P, Choleva L, estimation of outcrossing rate and gene-
Popiolek M, Balaz M (2012) Dynamic forma- frequencies using N independent loci. Hered-
tion of asexual diploid and polyploid lineages: ity 47(1):35–52
multilocus analysis of Cobitis reveals the 166. Koelling VA, Monnahan PJ, Kelly JK (2012)
mechanisms maintaining the diversity of A Bayesian method for the joint estimation of
clones. PLoS One 7(9):e45384 outcrossing rate and inbreeding depression.
155. Lampert KP, Schartl M (2008) The origin Heredity (Edinb) 109(6):393–400
and evolution of a unisexual hybrid: Poecilia 167. David P, Pujol B, Viard F, Castella V, Goudet
formosa. Philos Trans R Soc Lond B Biol Sci J (2007) Reliable selfing rate estimates from
363(1505):2901–2909 imperfect population genetic data. Mol Ecol
16(12):2474–2487
Genome Evolution in Outcrossing vs. Selfing vs. Asexual Species 367
168. Redelings BD, Kumagai S, Tatarenkov A, 179. Birky CW (1996) Heterozygosity, hetero-
Wang L, Sakai AK, Weller SG, Culley TM, morphy, and phylogenetic trees in asexual
Avise JC, Uyenoyama MK (2015) A Bayesian eukaryotes. Genetics 144(1):427–437
approach to inferring rates of selfing and 180. Welch DM, Meselson M (2000) Evidence for
locus-specific mutation. Genetics 201 the evolution of bdelloid rotifers without sex-
(3):1171–1188 ual reproduction or genetic exchange. Science
169. McClure NS, Whitlock MC (2012) Multilo- 288(5469):1211–1215
cus estimation of selfing and its heritability. 181. Signorovitch A, Hur J, Gladyshev E, Mesel-
Heredity (Edinb) 109(3):173–179 son M (2015) Allele sharing and evidence for
170. Burkli A, Sieber N, Seppala K, Jokela J (2017) sexuality in a mitochondrial clade of bdelloid
Comparing direct and indirect selfing rate rotifers. Genetics 200(2):581–590
estimates: when are population-structure esti- 182. Debortoli N, Li X, Eyres I, Fontaneto D,
mates reliable? Heredity (Edinb) 118 Hespeels B, Tang CQ, Flot JF, Van Doninck
(6):525–533 K (2016) Genetic exchange among bdelloid
171. Gao H, Williamson S, Bustamante CD rotifers is more likely due to horizontal gene
(2007) A Markov chain Monte Carlo transfer than to meiotic sex. Curr Biol 26
approach for joint inference of population (6):723–732
structure and inbreeding rates from multilo- 183. Takebayashi N, Morrell PL (2001) Is self-
cus genotype data. Genetics 176 fertilization an evolutionary dead end? Revi-
(3):1635–1651 siting an old hypothesis with genetic theories
172. Halkett F, Simon JC, Balloux F (2005) Tack- and a macroevolutionary approach. Am J Bot
ling the population genetics of clonal and 88(7):1143–1150
partially clonal organisms. Trends Ecol Evol 184. Goldberg EE, Igic B (2008) On phylogenetic
20(4):194–201 tests of irreversible evolution. Evolution 62
173. Bailleul D, Stoeckel S, Arnaud-Haond S (11):2727–2741
(2016) RClone: a package to identify Multi- 185. Schaefer I, Domes K, Heethoff M,
Locus Clonal Lineages and handle clonal data Schneider K, Schon I, Norton RA, Scheu S,
sets in R. Methods Ecol Evol 7:966–970 Maraun M (2006) No evidence for the
174. Bomblies K, Yant L, Laitinen RA, Kim ST, ‘Meselson effect’ in parthenogenetic oribatid
Hollister JD, Warthmann N, Fitz J, Weigel mites (Oribatida, Acari). J Evol Biol 19
D (2010) Local-scale patterns of genetic (1):184–193
variability, outcrossing, and spatial structure 186. Schon I, Martens K (2003) No slave to sex.
in natural stands of Arabidopsis thaliana. Proc R Soc Lond B 270(1517):827–833
PLoS Genet 6(3):e1000890 187. Bechsgaard JS, Castric V, Charlesworth D,
175. Tsai IJ, Bensasson D, Burt A, Koufopanou V Vekemans X, Schierup MH (2006) The tran-
(2008) Population genomics of the wild yeast sition to self-compatibility in Arabidopsis
Saccharomyces paradoxus: quantifying the life thaliana and evolution within S-haplotypes
cycle. Proc Natl Acad Sci U S A 105 over 10 Myr. Mol Biol Evol 23
(12):4957–4962 (9):1741–1750
176. Ruderfer DM, Pratt SC, Seidel HS, Kruglyak 188. Tang C, Toomajian C, Sherman-Broyles S,
L (2006) Population genomic analysis of out- Plagnol V, Guo YL, Hu TT, Clark RM, Nas-
crossing and recombination in yeast. Nat rallah JB, Weigel D, Nordborg M (2007) The
Genet 38(9):1077–1081 evolution of selfing in Arabidopsis thaliana.
177. Grimsley N, Pequin B, Bachy C, Moreau H, Science 317(5841):1070–1072
Piganeau G (2010) Cryptic sex in the smallest 189. Shimizu KK, Tsuchimatsu T (2015) Evolu-
eukaryotic marine green alga. Mol Biol Evol tion of selfing: recurrent patterns in molecular
27(1):47–54 adaptation. Annu Rev Ecol Syst 46:593–622
178. Derelle E, Ferraz C, Rombauts S, Rouze P, 190. Castric V, Billiard S, Vekemans X (2014) Trait
Worden AZ, Robbens S, Partensky F, transitions in explicit ecological and genomic
Degroeve S, Echeynie S, Cooke R, Saeys Y, contexts: plant mating systems as case studies.
Wuyts J, Jabbari K, Bowler C, Panaud O, Adv Exp Med Biol 781:7–36
Piegu B, Ball SG, Ral JP, Bouget FY, 191. Tsuchimatsu T, Goubet PM, Gallina S, Holl
Piganeau G, De Baets B, Picard A, AC, Fobis-Loisy I, Berges H, Marande W,
Delseny M, Demaille J, Van de Peer Y, Mor- Prat E, Meng D, Long Q, Platzer A,
eau H (2006) Genome analysis of the smallest Nordborg M, Vekemans X, Castric V (2017)
free-living eukaryote Ostreococcus tauri Patterns of polymorphism at the self-
unveils many unique features. Proc Natl incompatibility locus in 1,083 Arabidopsis
Acad Sci U S A 103(31):11647–11652
368 Sylvain Glémin et al.
thaliana genomes. Mol Biol Evol 34 203. Fitzjohn RG, Maddison WP, Otto SP (2009)
(8):1878–1889 Estimating trait-dependent speciation and
192. Durvasula A, Fulgione A, Gutaker RM, Ala- extinction rates from incompletely resolved
cakaptan SI, Flood PJ, Neto C, phylogenies. Syst Biol 58(6):595–611
Tsuchimatsu T, Burbano HA, Pico FX, 204. Maddison WP, Midford PE, Otto SP (2007)
Alonso-Blanco C, Hancock AM (2017) Afri- Estimating a binary character’s effect on spe-
can genomes illuminate the early history and ciation and extinction. Syst Biol 56
transition to selfing in Arabidopsis thaliana. (5):701–710
Proc Natl Acad Sci U S A 114 205. de Vos JM, Hughes CE, Schneeweiss GM,
(20):5213–5218 Moore BR, Conti E (2014) Heterostyly accel-
193. Foxe JP, Slotte T, Stahl EA, Neuffer B, erates diversification via reduced extinction in
Hurka H, Wright SI (2009) Recent speciation primroses. Proc Biol Sci 281
associated with the evolution of selfing in (1784):20140075
Capsella. Proc Natl Acad Sci U S A 106 206. Johnson MT, Fitzjohn RG, Smith SD,
(13):5241–5245 Rausher MD, Otto SP (2011) Loss of sexual
194. Guo YL, Bechsgaard JS, Slotte T, Neuffer B, recombination and segregation is associated
Lascoux M, Weigel D, Schierup MH (2009) with increased diversification in evening prim-
Recent speciation of Capsella rubella from roses. Evolution 65(11):3230–3240
Capsella grandiflora, associated with loss of 207. Glémin S (2003) How are deleterious muta-
self-incompatibility and an extreme bottle- tions purged? Drift versus nonrandom mat-
neck. Proc Natl Acad Sci U S A 106 ing. Evolution 57(12):2678–2687
(13):5246–5251 208. Tahir D, Glémin S, Lascoux M, Kaj I (2019)
195. Brandvain Y, Slotte T, Hazzouri KM, Wright Modeling a trait-dependent diversification
SI, Coop G (2013) Genomic identification of process coupled with molecular evolution on
founding haplotypes reveal the history of the a random species tree. J Theor Biol
selfing species Capsella rubella. PLoS Genet 9 461:189–203
(9):e1003754 209. Dalrymple RL, Buswell JM, Moles AT (2015)
196. Judson OP, Normark BB (1996) Ancient Asexual plants change just as often and just as
asexual scandals. Trends Ecol Evol 11 fast as do sexual plants when introduced to a
(2):41–46 new range. Oikos 124(2):196–205
197. Igic B, Bohs L, Kohn JR (2006) Ancient 210. Uecker H (2017) Evolutionary rescue in ran-
polymorphism reveals unidirectional breeding domly mating, selfing, and clonal popula-
system shifts. Proc Natl Acad Sci U S A 103 tions. Evolution 71(4):845–858
(5):1359–1363 211. Richards AJ (1997) Plant breeding systems,
198. Ferrer MM, Good-Avila SV (2007) Macro- 2nd edn. Chapman & Hall Ltd, London
phylogenetic analyses of the gain and loss of 212. Igic B, Kohn JR (2006) The distribution of
self-incompatibility in the Asteraceae. New plant mating systems: study bias against obli-
Phytol 173(2):401–414 gately outcrossing species. Evolution 60
199. Janko K, Drozd P, Flegr J, Pannell JR (2008) (5):1098–1103
Clonal turnover versus clonal decay: a null 213. Wright SI, Kalisz S, Slotte T (2013) Evolu-
model for observed patterns of asexual lon- tionary consequences of self-fertilization in
gevity, diversity and distribution. Evolution plants. Proc Biol Sci 280(1760):20130133
62(5):1264–1270
214. Fontaneto D, Barraclough TG (2015) Do
200. Janko K (2014) Let us not be unfair to asex- species exist in asexuals? Theory and evidence
uals: their ephemerality may be explained by from bdelloid rotifers. Integr Comp Biol 55
neutral models without invoking any evolu- (2):253–263
tionary constraints of asexuality. Evolution 68
(2):569–576 215. Flot JF, Hespeels B, Li X, Noel B,
Arkhipova I, Danchin EGJ, Hejnol A,
201. Schwander T, Crespi BJ (2009) Twigs on the Henrissat B, Koszul R, Aury JM, Barbe V,
tree of life? Neutral and selective models for Barthelemy RM, Bast J, Bazykin GA,
integrating macroevolutionary patterns with Chabrol O, Couloux A, Da Rocha M, Da
microevolutionary processes in the analysis of Silva C, Gladyshev E, Gouret P,
asexuality. Mol Ecol 18(1):28–42 Hallatschek O, Hecox-Lea B, Labadie K,
202. Goldberg EE, Kohn JR, Lande R, Robertson Lejeune B, Piskurek O, Poulain J,
KA, Smith SA, Igic B (2010) Species selection Rodriguez F, Ryan JF, Vakhrusheva OA,
maintains self-incompatibility. Science 330 Wajnberg E, Wirth B, Yushenova I, Kellis M,
(6003):493–495 Kondrashov AS, Welch DBM, Pontarotti P,
Genome Evolution in Outcrossing vs. Selfing vs. Asexual Species 369
Weissenbach J, Wincker P, Jaillon O, Van 225. Brandvain Y, Slotte T, Hazzouri KM, Wright
Doninck K (2013) Genomic evidence for SI, Coop G (2013) Genomic Identification of
ameiotic evolution in the bdelloid rotifer Adi- Founding Haplotypes Reveals the History of
neta vaga. Nature 500(7463):453–457 the Selfing Species Capsella rubella. PLOS
216. Golczyk H, Massouh A, Greiner S (2014) Genetics 9(9): e1003754
Translocations of chromosome 226. Whittle CA, Sun Y, Johannesson H (2011)
end-segments and facultative heterochroma- Evolution of synonymous codon usage in
tin promote meiotic ring formation in even- Neurospora tetrasperma and Neurospora dis-
ing primroses. Plant Cell 26(3):1280–1293 creta. Genome Biol Evol 3:332–343
217. Barraclough TG, Fontaneto D, Ricci C, Her- 227. Bast J, Schaefer I, Schwander T, Maraun M,
niou EA (2007) Evidence for inefficient selec- Scheu S, Kraaijeveld K (2016) No accumula-
tion against deleterious mutations in tion of transposable elements in asexual
cytochrome oxidase I of asexual bdelloid roti- arthropods. Mol Biol Evol 33:697–706
fers. Mol Biol Evol 24:1952–1962 228. Docking TR, Saade FE, Elliott MC, Schoen
218. Foxe JP, Dar VU, Zheng H, Nordborg M, DJ (2006) Retrotransposon sequence varia-
Gaut BS et al (2008) Selection on amino acid tion in four asexual plant species. J Mol Evol
substitutions in Arabidopsis. MBE 62:375–387
25:1375–1383 229. Dolgin ES, Charlesworth B, Cutter AD
219. Gioti A, Stajich J, Johannesson H (2013) (2008) Population frequencies of transposa-
Neurospora and the dead-end hypothesis: ble elements in selfing and outcrossing Cae-
genomic consequences of selfing in the norhabditis nematodes. Genet Res
model genus. Evolution 67(12):3600–3616 90:317–329
220. Mark Welch DB, Meselson MS (2001) Rates 230. Goodwin TJ, Poulter RT (2008) Multiple
of nucleotide substitution in sexual and LTR-Retrotransposon families in the asexual
anciently asexual rotifers. Proc Natl Acad Sci yeast Candida albicans. Genome Res
USA 98:6720–6724 10:174–191
221. Ness RW, Siol M, Barrett SC (2012) Genomic 231. Jiang X, Tang H, Ye Z, Lynch M (2017)
consequences of transitions from cross- to Insertion polymorphisms of mobile genetic
self-fertilization on the efficacy of selection elements in sexual and asexual populations of
in three independently derived selfing plants. Daphnia pulex. Genome Biol Evol
BMC Genomics 13:611 9:362–374
222. Nygren K, Strandberg R, Wallberg A, 232. Kraaijeveld K, Zwanenburg B, Hubert B,
Nabholz B, Gustafsson T et al (2011) A com- Vieira C, De Pater S, Alphen V et al (2012)
prehensive phylogeny of Neurospora reveals a Transposon proliferation in an asexual parasit-
link between reproductive mode and molecu- oid. Mol Ecol 21:3898–3906
lar evolution in fungi. Mol Phylogenet Evol 233. Schaack S, Choi E, Lynch M, Pritham EJ
59:649–663 (2010) DNA transposons and the role of
223. Qiu S, Zeng K, Slotte T, Wright S, Charles- recombination in mutation accumulation in
worth D (2011) Reduced efficacy of natural Daphnia pulex. Genome Biol 11:R46
selection on codon usage bias in selfing Ara- 234. Szitenberg A, Cha S, Opperman CH, Bird
bidopsis and Capsella species. Genome Biol DM, Blaxter ML, Lunt DH (2016) Genetic
Evol 3:868–880 drift, not life history or RNAi, determine
224. Slotte T, Hazzouri KM, Agren JA, Koenig D, long-term evolution of transposable ele-
Maumus F et al (2013) The Capsella rubella ments. Genome Biol Evol 8:2964–2978
genome and the genomic consequences of 235. Zeyl C, Bell G, Da Silva J (1994) Transposon
rapid mating system evolution. Nat Genet abundance in sexual and asexual populations
45:831–835 of Chlamydomonas reinhardtii. Evolution
48:1406–1409
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons
licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s
Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder.
Part IV
Abstract
Populations evolve as mutations arise in individual organisms and, through hereditary transmission, may
become “fixed” (shared by all individuals) in the population. Most mutations are lethal or have negative
fitness consequences for the organism. Others have essentially no effect on organismal fitness and can
become fixed through the neutral stochastic process known as random drift. However, mutations may also
produce a selective advantage that boosts their chances of reaching fixation. Regions of genomes where new
mutations are beneficial, rather than neutral or deleterious, tend to evolve more rapidly due to positive
selection. Genes involved in immunity and defense are a well-known example; rapid evolution in these
genes presumably occurs because new mutations help organisms to prevail in evolutionary “arms races”
with pathogens. In recent years genome-wide scans for selection have enlarged our understanding of the
genome evolution of various species. In this chapter, we will focus on methods to detect selection on the
genome. In particular, we will discuss probabilistic models and how they have changed with the advent of
new genome-wide data now available.
Key words Conserved and accelerated regions, Positive selection scans, Codon models, Selection-
mutation models, Polymorphism-aware phylogenetic models
1 Introduction
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Methods in Molecular Biology, vol. 1910,
https://doi.org/10.1007/978-1-4939-9074-0_12, © The Author(s) 2019
373
374 Carolin Kosiol and Maria Anisimova
Fig. 1 A diagram illustrating the different data and levels to analyze genomic
sequences and the relationship of the various approaches modeling selection
Several whole genome sequence data sets are now available for
selection scans. Mammalian genomes are well represented
(in particular primates), and insect genomes are becoming more
numerous (in particular Drosophila). These data can be down-
loaded as orthologous alignments from the Ensembl [1] and
UCSC [2] browsers.
In light of recent advances in DNA sequencing, with so-called
next generation sequencing (NGS) technologies that have dramat-
ically reduced the cost and time needed to sequence an organism’s
entire genome, large-scale (involving many organisms) sequencing
projects have been and are currently being undertaken. Just to
name a few, genome projects re-sequencing 1000 D. melanogaster
[3] and 1001 Arabidopsis [4] were accomplished, and the 100,000
human genome project [5] is ongoing. These polymorphism data
from multiple individuals from several species enable us to detect
very recent selection.
Together with the progress in sequencing technologies, algo-
rithmic advances now allow the de novo assembly of genomes from
NGS data, including complex mammalian genomes (e.g., giant
panda genome [6]). Therefore, not only international consortia
but also small groups and individual labs can now envisage to
sequence the organisms of their interest. As a consequence plat-
forms for sharing this data have been established. For example, the
Genome 10K project aims to assemble a genomic zoo—a collection
Selection Acting on Genomes 375
3 Methods
3.2 Detecting Understanding the forces shaping the evolution of specific lineages
Regions of Accelerated is one of the most exciting areas in evolutionary genomics. In
Genome Evolution particular, regions of accelerated evolution in mammalian and
insect species have been studied (e.g., see [14]). To eliminate non-
functional regions, one strategy is to begin with a search for regions
that are conserved through the mammalian history or longer. A
likelihood ratio test (LRT) may be used to detect acceleration of
rates in a lineage of interest, for example, the human lineage. Such
LRT compares the likelihood of the alignment data under two
probabilistic models. The null model has a single scale parameter
representing shortening (more conserved) and lengthening (less
conserved) of all branches of the tree. The alternative model has an
additional parameter for the human lineage, which is constraint to
be 1. This extra parameter allows the human branch to be rela-
tively longer (accelerated) than the branches in the rest of the tree.
For example, this approach was used to identify genomic
regions that are conserved in most vertebrates but have evolved
rapidly in humans. Interestingly, the majority of the human accel-
erated regions (HARs) were noncoding, and many were located
near protein-coding genes with protein functions related to the
nervous system [14].
In contrast, the majority of Drosophila melanogaster accelerated
regions (DMARs) are found in protein-coding regions and
Selection Acting on Genomes 377
3.3.2 Accounting for First codon models assumed constant nonsynonymous and synon-
Variability of Selective ymous rates among sites and over time. Although most proteins
Pressures evolve under purifying selection most of the time, positive selection
may drive the evolution in some lineages. During episodes of
adaptive evolution, only a small fraction of sites in the protein
have the capacity to increase the fitness of the protein via AA
replacements. Thus approaches assuming constant selective pres-
sure over time and over sites lack power in detecting genes affected
by positive selection. Consequently, various scenarios of variation in
selective pressure were incorporated in codon models, making
them more powerful at detecting positive selection, and short
episodes of adaptive evolution in particular. Evidence of positive
selection on a gene can be obtained by a LRT comparing two
nested models: a model that does not allow positive selection
(constraining ω 1 to represent the null hypothesis) and a model
that allows positive selection (ω > 1 is allowed in the alternative
hypothesis). Positive selection is detected if a model ω > 1 fits data
significantly better compared to the model restricting ω 1 at all
sites and lineages. However, the asymptotic null distribution may
vary from the standard χ 2 due to boundary problems or if some
parameters become not estimable (e.g., see [40, 41]).
3.3.3 Case Study: In 2006, six high-coverage genome assemblies became available for
Application of a Genome- eutherian mammals. The increased phylogenetic depth of this data
Wide Scan of Positive set permitted Kosiol and colleagues [42] to perform several new
Selection on Six lineage- and clade-specific tests using branch-site codon models. Of
Mammalian Genomes ~16,500 human genes with high-confidence orthologs in at least
two other species, 544 genes showed significant evidence of posi-
tive selection using branch-site codon models and standard LRTs.
380 Carolin Kosiol and Maria Anisimova
3.3.4 Selective Variability The simplest site models use the general discrete distribution with a
Among Codons: Site pre-specified number of site classes. Each site class i has an inde-
Models pendent parameter ωi estimated by ML together with proportions
of sites pi in each class. Since a large number of site categories
require many parameters, three categories are usually used (requir-
ing five independent parameters). To test for positive selection,
several pairs of nested site models were defined to represent the
null and alternative hypotheses in LRTs. For example, model M1a
includes two site classes, one with ω0 < 1 and another with ω1 ¼ 1,
representing the neutral model of evolution (the null hypothesis).
The alternative model M2a extends M1a by adding an extra site
class with ω2 1 to accommodate sites evolving under positive
selection. Significance of the LRT is tested using the χ 22-distribution
for the M1 vs. M2 comparison. We test the C7 gene for positive
selection by the LRT comparing nested models M1a and M2a
(Table 1).
Model M2a has two additional parameters compared to model
M1a. The resulting LRT statistic is 2(log L2 log L1) ¼ 2
(6377.35 (6369.67)) ¼ 2 7.68 ¼ 15.36. This is much
greater than the critical value of the chi-square distribution
χ 2 (df ¼ 2, at 5%) ¼ 5.99, and we calculate a p-value of
P ¼ 5.0e04. However, the M1a vs. M2a comparison for genes
C8B and C9 is not significant.
Selection Acting on Genomes 381
Table 1
Parameter estimates and log-likelihoods for a LRT of positive selection for the complement immunity
component C7
M1a (neutral)
Site class 0 1
Proportion p0 ¼ 0.69 ( p1 ¼ 1 p0 ¼ 0.31)
ω ratio ω0 ¼ 0.07 (ω1 ¼ 1)
Log-likelihood L1 ¼ 6377.35
M2a (selection)
Site class 0 1 2
Proportion p0 ¼ 0.70 p1 ¼ 0.29 ( p2 ¼ 1 p0 p1 ¼ 0.01)
ω ratio ω0 ¼ 0.08 (ω1 ¼ 1) ω2 ¼ 10.89
Log-likelihood L2 ¼ 6369.67
The model M2a is the alternative model with a class of sites with ω2 1. The null hypothesis M1a is the same model but
with ω2 ¼ 1 fixed
3.3.5 Selective Variability A simple way to include the variation of the selective pressure over
over Time: Branch Models time is by using separate parameters ω for each branch of a phylog-
eny (known as free-ratio model; [35]). Compared with the one-
ratio model (which assumes constant selection over time), the free-
ratio model requires additional 2T 4 ω parameters for T species.
Figure 2 shows the estimates of the free-ratio model for the C8B
gene. Although the ML estimates of ω values on the rodent lineages
are visibly higher than on the primate lineages, none of the
branches has ω > 1.
Other branch models can be defined by constraining different
sets of branches of a tree to have an individual ω. LRTs are used to
decide (1) whether selective pressure is significantly different on a
pre-specified set of branches and (2) whether these branches are
under positive selection.
However, branch models have relatively poor power to detect
selection [53] in comparison to branch-site models that are dis-
cussed in the next section. Also note that testing of multiple
hypotheses on the same data requires a correction, so the overall
false-positive rate is kept at the required level (most often 5%).
Correction for multiple testing further reduces the power of the
method, especially when many hypotheses are tested simulta-
neously (see Subheading 4 later).
3.3.6 Temporal and Several solutions were proposed to simultaneously account for
Spatial Variation of differences in selective constraints among codons and the episodic
Selective Pressure nature of molecular evolution at individual sites. One of the first
models—model MA [45]—assumes four site classes. Two classes
contain sites evolving constantly over time: one under purifying
selection with ω0 < 1; another with ω1 ¼ 1. The other two site
Selection Acting on Genomes 383
4 Notes/Discussion
4.1 Quality of The impact of the quality of sequence and the alignment is a major
Multiple Alignments concern when performing positive selection scans. For example, in
their analysis of 12 genomes Markova-Raina and Petrov [71] found
that the results were highly sensitive to the choice of an alignment
method. Furthermore, visual analysis indicated that most sites
inferred as positively selected are in fact misaligned at the codon
level. The rate of false positives ranged ~50% and more depending
on the aligner used. Some of these results can be ascribed to the
high divergence level of the 12 Drosophila species and could be
addressed by better filtering of the data. Nevertheless, even in
mammals where alignment is easier, problems have been observed.
Bakewell et al. [72] used the branch-site test to analyze
~14,000 genes from the human, chimpanzee, and macaque and
detected more genes to be under positive selection on the chim-
panzee lineage than on the human lineage (233 vs. 154). The same
pattern was also observed by Arbiza et al. [73] and Gibbs et al.
Selection Acting on Genomes 387
4.2 Biased Gene Mutation rate variation can also cause genomic regions to have
Conversion and different substitution rates without any change in fixation rate.
Recombination Recent studies of guanine and cytosine (GC)-isochores in the
mammalian genome have suggested the importance of another
selectively neutral evolutionary process that affects nucleotide evo-
lution. As described in the work of Laurent Duret and others (see
[82, 83]), biased gene conversion (BGC) is a mechanism caused by
the mutagenic effects of recombination combined with the prefer-
ence in recombination-associated DNA repair toward strong
(GC) versus weak (adenine and thymine [AT]) nucleotide pairs at
non-Watson-Crick heterozygous sites in heteroduplex DNA during
crossover in meiosis. Thus, beginning with random mutations,
BGC results in an increased probability of fixation of G and C
alleles. In particular, methods looking for accelerated regions in
coding DNA but also codon models cannot distinguish positive
selection from BGC (see [84, 85]). Therefore, the putatively
selected genes should be checked for GC content and closeness to
recombination hotspots and telomeres.
Most codon models assume a single phylogeny and a constant
synonymous rate among sites, implying that rate variation among
388 Carolin Kosiol and Maria Anisimova
4.3 Selection on Most selection studies to date focused on detecting selection on the
Synonymous Sites protein, since synonymous changes are often presumed neutral and
so unaffected by selective pressures. However, selection on synony-
mous sites has been documented more than a decade ago. Codon
usage bias is known to affect the majority of genes and species. In
his seminal work, Akashi [91] demonstrated purifying selection on
genes of Drosophila melanogaster, where strong codon bias favoring
certain (optimal) codons serves to increase the translational accu-
racy. Pressure to optimize for translational efficiency, robustness,
and kinetics leads to synonymous codon bias, which was shown to
widely affect mammalian genes [92], as well as genes of fast-
evolving pathogens like viruses [93]. The standard approach to
study selection on codon usage computes various codon adaptation
indexes on full-length protein-coding genes (see [94] for review).
More recently, methods to study selection on synonymous changes
adopted more sophisticated approaches, mainly the following stra-
tegies: (1) account for synonymous rate variation within sequences;
(2) include codon fitness parameters within a modeling framework
that connects population and intraspecific parameters; and
(3) allow for selection on synonymous substitutions by introducing
Selection Acting on Genomes 389
changes between preferred and rare codons have the same fitness
(and so the same selection coefficient). They used one selection
coefficient for optimal codon usage for each branch of a phylogeny
and estimated these jointly with the ω ratio by ML. Using this
approach to study ancestral codon usage bias, Nielsen et al. [114]
confirmed the reduction in selection for optimal codon usage in
D. melanogaster. In contrast, Yang and Nielsen [115] estimated
individual codon fitness parameters and used them to estimate
optimal codon frequencies for a gene across multiple species. LRT
is used to test whether the codon bias is due to the mutational bias
alone. Nevertheless, one remarkable contribution of the mutation-
selection models is the connection they make between the interspe-
cific and population parameters. Exploiting this further should
provide insights to how changing demographic factors influence
observed intraspecific patterns. Mutation-selection models also
allow a new perspective on understanding codon models in the
context of fitness landscapes with statistical implications as
discussed in Subheading 4.2 of Chapter 13 by Jones, Susko, and
Bielawski.
Finally, it is also possible to study selection on synonymous
changes by introducing a parametric relationship between fitness
and protein production cost. The idea was first described by Gilchr-
ist [116], who assumed that, in addition to mutation and drift, the
codon bias evolved under selection to reduce the cost of nonsense
errors. Protein production cost can be computed as a ratio of the
expected cost to the expected benefit [117]. Kubatko and collea-
gues [118] have extended a standard codon model to include the
difference in protein production due to the usage of different
codons (and therefore different elongation probabilities). How-
ever, such a model requires position-specific instantaneous rate
matrices, and consequently also the probability transition matrices,
making the approach computationally very intensive. To circum-
vent this, a GPU-based implementation was developed and used for
phylogeny inference from 104 gene data set from Saccharomyces
cerevisiae. Based on the standard model selection measure AIC, the
new model outperformed the simplest model M0 as well as the
mutation-selection model FMutSel of Yang and Nielsen.
5 Exercises
Acknowledgments
References
1. Zerbino DR, Achuthan P, Akanni W, Amode The sequence and de novo assembly of the
MR, Barrell D, Bhai J, Billis K, Cummins C, giant panda genome. Nature 463:311–317
Gall A, Girón CG, Gil L, Gordon L, 7. Posada D, Crandall KA (2002) The effect of
Haggerty L, Haskell E, Hourlier T, Izuogu recombination on the accuracy of phyloge-
OG, Janacek SH, Juettemann T, To JK, Laird netic estimation. J Mol Evol 54:396–402
MR, Lavidas I, Liu Z, Loveland JE, Maurel T, 8. Sawyer S (1989) Statistical tests for detecting
McLaren W, Moore B, Mudge J, Murphy gene conversion. Mol Biol Evol 6:526–538
DN, Newman V, Nuhn M, Ogeh D, Ong
CK, Parker A, Patricio M, Riat HS, 9. Semple C Wolfe KH (1999) Gene duplication
Schuilenburg H, Sheppard D, Sparrow H, and gene conversion in the Caenorhabditis
Taylor K, Thormann A, Vullo A, Walts B, elegans genome. J Mol Evol 48:555–564
Zadissa A, Frankish A, Hunt SE, 10. Doolittle WF (1999) Phylogenetic classifica-
Kostadima M, Langridge N, Martin FJ, tion and the universal tree. Science
Muffato M, Perry E, Ruffier M, Staines DM, 284:2124–2129
Trevanion SJ, Aken BL, Cunningham F, 11. Robinson DM, Jones DT, Kishino H,
Yates A, Flicek P (2018) Ensembl 2018. Goldman N, Thorne JL (2003) Protein evo-
Nucleic Acids Res 46:D754–D761 lution with dependence among codons due to
2. Casper J, Zweig AS, Villarreal C, Tyner C, tertiary structure. Mol Biol Evol
Speir ML, Rosenbloom KR, Raney BJ, Lee 20:1692–1704
CM, Lee BT, Karolchik D, Hinrichs AS, 12. Choi SC, Holboth A, Robinson DM,
Haeussler M, Guruvadoo L, Navarro Kishino H, Thorne JL (2007) Quantifying
Gonzalez J, Gibson D, Fiddes IT, the impact of protein tertiary structure on
Eisenhart C, Diekhans M, Clawson H, Barber molecular evolution. Mol Biol Evol
GP, Armstrong J, Haussler D, Kuhn RM, 24:1769–1782
Kent WJ (2018) The UCSC Genome Browser 13. Keilson J (1979) Markov Chain models-rarity
database: update 2018. Nucleic Acids Res 46: and exponentiality. Springer, New York, NY
D762–D769 14. Pollard KS, Salama SR, King B, Kern AD,
3. Lack JB, Lange JD, Tang AD, Corbett-Detig Dreszer T, Katzman S, Siepel A, Perdersen
RB, Pool JE (2016) A thousand fly genomes: JS, Berjerano G, Baertsch R, Rosenblum KR,
an expanded drosophila genome nexus. Mol Kent J, Haussler D (2006) Forces shaping the
Biol Evol 33:3308–3313 fastest evolving regions in the human
4. Weigel D, Mott R (2009) The 1001 Genomes genome. PLoS Genet 2(10):e168
Project for Arabidopsis thaliana. Genome Biol 15. Holloway AK, Begun DJ, Siepel A, Pollard K
10:107 (2008) Accelerated sequence divergence of
5. Turnbull C, Scott RH, Thomas E, Jones L, conserved genomic elements in Drosophila
Murugaesu N, Pretty FB, Halai D, Baple E, melanogaster. Genome Res 18:1592–1601
Craig C, Hamblin A, Henderson S, Patch C, 16. Miyamoto MM, Fitch WM (1995) Testing
O’Neill A, Devereaux A, Smith K, Martin AR, the covarion hypothesis of molecular evolu-
Sosinsky A, McDonagh EM, Sultana R, tion. Mol Biol Evol 12:503–513
Mueller M, Smedley D, Toms A, Dinh L, 17. Lockhart PJ, Steel MA, Barbrook AC, Huson
Fowler T, Bale M, Hubbard T, Rendon A, DH, Charleston MA, Howe CJ (1998) A
Hill S, Caulfield MJ, 100,000 Genomes Proj- covariotide model explains apparent phyloge-
ect (2018) The 100 000 Genomes Project: netic structure of oxygenic photosynthetic
bringing whole genome sequencing to the lineages. Mol Biol Evol 15:1183–1188
NHS. BMJ 361:k1687
18. Penny D, McComish BJ, Charleston MA,
6. Li R, Fan W, Tian G, Zhu H, He L, Cai J, Hendy MD (2001) Mathematical elegance
Huang Q, Cai Q, Li B, Bai Y, Zhang Z, with biochemical realism: the covarion
Zhang Y, Xuan Z, Wang W, Li J et al (2010)
394 Carolin Kosiol and Maria Anisimova
model of molecular evolution. J Mol Evol 32. Goldman N, Yang Z (1994) A codon-based
53:711–753 model of nucleotide substitution for protein-
19. Siltberg J, Liberles DA (2002) A simple coding DNA sequences. Mol Biol Evol
covarion-based approach to analyse nucleo- 11:725–736
tide substitution rates. J Evol Biol 33. Muse SV, Gaut BS (1994) A likelihood
15:588–594 approach for comparing synonymous and
20. Lichtarge O, Bourne HR, Cohen FE (1996) nonsynonymous nucleotide substitution
An evolutionary trace method defines binding rates, with application to the chloroplast
surfaces common to protein families. J Mol genome. Mol Biol Evol 11:715–724
Evol 257:342–358 34. Grantham R (1974) Amino acid difference
21. Gu X (1999) Statistical methods for testing formula to help explain protein evolution.
functional divergence after gene duplication. Science 185:862–864
Mol Biol Evol 16:1664–1674 35. Yang Z (1998) Likelihood ratio tests for
22. Armon A, Graur D, Ben-Tal N (2001) Con- detecting positive selection and application
Surf: an algorithmic tool for the identification to primate lysozyme evolution. Mol Biol
of functional regions in proteins by surface Evol 15:568–573
mapping of phylogenetic information. J Mol 36. Schneider A, Cannarozzi GM, Gonnet GH
Biol 307:447–463 (2005) Empirical codon substitution matrix.
23. Gaucher EA, Gu X, Miyamoto MM, Benner BMC Bioinformatics 6:134
SA (2002) Predicting functional divergence in 37. Kosiol C, Holmes I, Goldman N (2007) An
protein evolution by site-specific rate shifts. empirical codon model for protein sequence
Trends Biochem Sci 27:315–321 evolution. Mol Biol Evol 24:1464–1479
24. Pupko T, Galtier N (2002) A covarion-based 38. Doron-Faigenboim A, Pupko T (2007) A
method for detecting molecular adaptation: combined empirical and mechanistic codon
application to the evolution of primate mito- model. Mol Biol Evol 24:388–397
chondrial genomes. Proc Biol Sci 39. De Maio N, Holmes I, Schlötterer C, Kosiol
269:1313–1316 C (2013) Estimating empirical hidden Mar-
25. Blouin C, Boucher Y, Roger AJ (2003) Infer- kov models. Mol Biol Evol 30:725–736
ring functional constraints and divergence in 40. Whelan S, Goldman N (1999) Distributions
protein families using 3D mapping of phylo- of statistics used for the comparison of models
genetic information. Nucleic Acids Res of sequence evolution in phylogenetics. Mol
31:790–797 Biol Evol 16:1292–1299
26. Landau M, Mayrose I, Rosenberg Y, Glaser F, 41. Anisimova M, Bielawski JP, Yang Z (2001)
Martz E, Pupko T, Ben-Tal N (2005) Con- Accuracy and power of the likelihood ratio
Surf 2005: the projection of evolutionary test in detecting adaptive molecular evolution.
conservation scores of residues on protein Mol Biol Evol 18:1585–1592
structures. Nucleic Acids Res 33: 42. Kosiol C, Vinar T, Da Fonseca RR, Hubisz
W299–W302 MJ, Bustamante CD, Nielsen R, Siepel A
27. Gu X (2001) Maximum-likelihood approach (2008) Patterns of positive selection in six
for gene family evolution under functional mammalian genomes. PLoS Genet 4:
divergence. Mol Biol Evol 18:453–464 e10000144
28. Gu X (2006) A simple statistical method for 43. Anisimova M, Bielawski JP, Yang Z (2002)
estimating type-II (cluster-specific) functional Accuracy and power of bayes prediction of
divergence of protein sequences. Mol Biol amino acid sites under positive selection.
Evol 23:1937–1945 Mol Biol Evol 19:950–958
29. Bofkin L, Goldman N (2007) Variation in 44. Yang Z, Wong WS, Nielsen R (2005) Bayes
evolutionary processes at different codon empirical bayes inference of amino acid sites
positions. Mol Biol Evol 24:513–521 under positive selection. Mol Biol Evol
30. Hughes AL, Nei M (1988) Pattern of nucleo- 22:1107–1118
tide substitution at major histocompatibility 45. Yang Z, Nielsen R, Goldman N, Pedersen
complex class I loci reveals overdominant AMK (2000) Codon-substitution models for
selection. Nature 335:167–170 heterogeneous selection pressure at amino
31. Yang Z, Nielsen R (2000) Estimating synon- acid sites. Genetics 155:431–449
ymous and nonsynonymous substitution rates 46. Huelsenbeck JP, Dyer KA (2004) Bayesian
under realistic evolutionary models. Mol Biol estimation of positively selected sites. J Mol
Evol 17:32–43 Evol 58:661–672
Selection Acting on Genomes 395
47. Scheffler K, Seoighe C (2005) A Bayesian 62. Pollard KS, Hubisz MJ, Rosenbloom KR, Sie-
model comparison approach to inferring pos- pel A (2010) Detection of non-neutral substi-
itive selection. Mol Biol Evol 22:2531–2540 tution rates on mammalian phylogenies.
48. Aris-Brosou S, Bielawski JP (2006) Large- Genome Res 20:110–121
scale analyses of synonymous substitution 63. Yang Z (2007) PAML 4: phylogenetic analysis
rates can be sensitive to assumptions about by maximum likelihood. Mol Biol Evol
the process of mutation. Gene 378:58–64 24:1586–1591
49. Massingham T, Goldman N (2005) Detecting 64. Kosakovsky Pond SL, Muse SV (2005) Site-
amino acid sites under positive selection and to-site variation of synonymous substitution
purifying selection. Genetics 169:1753–1762 rates. Mol Biol Evol 22:2375–2385
50. Kosakovsky Pond SL, Posada D, Gravenor 65. Schrempf D, Minh BQ, De Maio N, von
MB, Woelk CH, Frost SD (2006) GARD: a Haeseler A, Kosiol C (2016) Reversible
genetic algorithm for recombination detec- polymorphism-aware phylogenetic models
tion. Bioinformatics 22:3096–3098 and their application to tree inference. J
51. Kosakovsky PSL, Posada D, Gravenor MB, Theor Biol 407:362–370
Woelk CH, Frost SD (2006) Automated phy- 66. Gil M, Zanetti MS, Zoller S, Anisimova M
logenetic detection of recombination using a (2013) CodonPhyML: fast maximum likeli-
genetic algorithm. Mol Biol Evol hood phylogeny estimation under codon sub-
23:1891–1901 stitution models. Mol Biol Evol
52. Felsenstein J (2004) Inferring phylogenies. 30:1270–1280
Sinauer Associates, Sunderland, MA 67. Zhang J, Nielsen R, Yang Z (2005) Evalua-
53. Yang Z, Dos Reis M (2011) Statistical proper- tion of an improved branch-site likelihood
ties of the branch-site test of positive selec- method for detecting positive selection at the
tion. Mol Biol Evol 28:1217–1228 molecular level. Mol Biol Evol 22:2472–2479
54. Anisimova M, Yang Z (2007) Multiple 68. Yang Z, Nielsen R (2002) Codon-
hypothesis testing to detect lineages under substitution models for detecting molecular
positive selection that affects only a few sites. adaptation at individual sites along specific
Mol Biol Evol 24:1219–1228 lineages. Mol Biol Evol 19:908–917
55. Kosakovsky Pond SL, Frost SD (2005) A 69. Vamathevan JJ, Hasan S, Emes RD, Amrine-
genetic algorithm approach to detecting Madsen H, Rajagopalan D, Topp SD,
lineage-specific variation in selection pressure. Kumar V, Word M, Simmons MD, Foord
Mol Biol Evol 22:478–485 SM, Sanseau P, Yang Z, Holbrook JD
56. Guindon SA, Rodrigo G, Dyer KA, Huelsen- (2008) The role of positive selection in deter-
beck JP (2004) Modeling the site-specific var- mining the molecular cause of species differ-
iation of selection patterns along lineages. ences in disease. BMC Evol Biol 8:273
Proc Natl Acad Sci U S A 101:12957–12962 70. Nozawa M, Suzuki Y, Nei M (2009) Reliabil-
57. De Maio N, Schlötterer C, Kosiol C (2013) ities of identifying positive selection by the
Linking great apes genome evolution across branch-site and site-prediction methods.
time scales using polymorphism-aware phylo- Proc Natl Acad Sci U S A 106:6700–6705
genetic models. Mol Biol Evol 30:2249–2262 71. Markova-Raina P, Petrov D (2011) High sen-
58. De Maio N, D Schrempf D, Kosiol C (2016) sitivity to aligner and high rate of false posi-
PoMo: an allele frequency-based approach for tives in the estimates of positive selection in
species tree estimation. Syst Biol 12 Drosophila genomes. Genome Res
64:1018–1031 21:863. https://doi.org/10.1101/gr.
115949.110
59. Maddison W, Knowles L (2006) Inferring
phylogeny despite incomplete lineage sorting. 72. Bakewell MA, Shi P, Zhang J (2007) More
Syst Biol 55:21–30 genes underwent positive selection in chim-
panzee than in human evolution. Proc Natl
60. Eyre-Walker A, Hurst L (2001) The evolution Acad Sci U S A 104:E97
of isochores. Nat Rev Genet 2:549–555
73. Arbiza L, Dopazo J, Dopazo H (2006) Posi-
61. Siepel A, Bejerano G, Pedersen JS, tive selection, relaxation, and acceleration in
Hinrichs A, Hou M, Rosenbloom K, the evolution of the human and chimp
Clawson H, Spieth J, Hillier LW, Richards S, genome. PLoS Comput Biol 2:e38
Weinstock GM, Wilson RK, Gibbs RA, Kent
WJ, Miller W, Haussler D (2005) Evolution- 74. Gibbs RA, Rogers J, Katze MG,
arily conserved elements in vertebrate, insect, Bumgarner R, Weinstock GM, Mardis ER,
worm, and yeast genomes. Genome Res Remington KA, Strausberg RL, Venter JC,
20:1034–1050 Wilson RK et al (2007) Evolutionary and
396 Carolin Kosiol and Maria Anisimova
biomedical insights from the macaque 88. Drummond AJ, Suchard MA (2008) Fully
genome. Science 316:222–234 Bayesian tests of neutrality using genealogical
75. Mallik S, Gnerre S, Muller P, Reich D (2010) summary statistics. BMC Genet 9:68
The difficulty of avoiding false positives in 89. Scheffler K, Martin DP, Seoighe C (2006)
genome scans for natural selection. Genome Robust inference of positive selection from
Res 19:922–933 recombining coding sequences. Bioinformat-
76. Schneider A, Souvorov A, Sabath N, ics 22:2493–2499
Landan G, Gonnet GH (2009) Estimates of 90. Rasmussen MD, Hubisz MJ, Gronau I, Siepel
positive Darwinian selection are inflated by A (2014) Genome-wide inference of ancestral
errors in sequencing, annotation, and align- recombination graphs. PLoS Genet 10(5):
ment. Genome Biol Evol 1:114–118 e1004342
77. Fletcher W, Yang Z (2010) The effect of inser- 91. Akashi H (1994) Synonymous codon usage in
tions, deletions and alignment errors on the Drosophila melanogaster: natural selection
branch-site test of positive selection. Mol Biol and translational accuracy. Genetics
Evol 27:2257–2267 136:927–935
78. Löytynoja A, Goldman N (2005) An algo- 92. Chamary JV, Parmley JL, Hurst LD (2006)
rithm for progressive multiple alignment of Hearing silence: non-neutral evolution at syn-
sequences with insertions. Proc Natl Acad onymous sites in mammals. Nat Rev Genet
Sci U S A 102:10557–10562 7:98–108
79. Löytynoja A, Goldman N (2008) Phylogeny- 93. Ngandu N, Scheffler K, Moore P,
aware gap placement prevents error in Woodman Z, Martin D, Seoighe C (2009)
sequence alignment and evolutionary analysis. Extensive purifying selection acting on synon-
Science 320:1632–1635 ymous sites in HIV-1 Groug M sequences.
80. Jordan G, Goldman N (2012) The effects of Virol J 5:160
alignment error and alignment filtering on the 94. Roth A, Anisimova M, Cannarozzi GM
sitewise detection of positive selection. Mol (2012) Measuring codon usage bias. Codon
Biol Evol 29:1125–1139 evolution: mechanisms and models. Oxford
81. Penn O, Privman E, Landan G, Graur D, University Press, New York, NY
Pupko T (2010) An alignment confidence 95. Resch AM, Carmel L, Marino-Ramirez L,
score capturing robustness to guide tree Ogurtsov AY, Shabalina SA, Rogozin IB,
uncertainty. Mol Biol Evol 27:1759–1767 Koonin EV (2007) Widespread positive selec-
82. Duret L, Semon M, Piganeau G, tion in synonymous sites of mammalian genes.
Mouchiroud D, Galtier N (2002) Vanishing Mol Biol Evol 24:1821–1831
GC-rich isochores in mammalian genomes. 96. Cannarozzi GM, Faty M, Schraudolph NN,
Genetics 162:1837–1847 Roth A, von Rohr P, Gonnet P, Gonnet GH,
83. Meunier J, Duret L (2004) Recombination Barral Y (2010) A role for codons in transla-
drives the evolution of GC content in the tional dynamics. Cell 141:355–367
human genome. Mol Biol Evol 21:984–990 97. Hurst LD, Pál C (2001) Evidence of purifying
84. Berglund J, Pollard KS, Webster MT (2009) selection acting on silent sites in BRCA1.
Hotspots of biased nucleotide substitutions in Trends Genet 17:62–65
human genes. PLoS Biol 7:e26 98. Chamary JV, Hurst LD (2005) Biased usage
85. Ratnakumar A, Mousset S, Glemin S, near intron-exon junctions: selection on splic-
Berglund J, Galtier N, Duret L, Webster MT ing enhancers, splice site recognition or some-
(2010) Detecting positive selection within thing else? Trends Genet 21:256–259
genomes: the problem of biased gene conver- 99. Gu W, Wang X, Zhai C, Xie X, Zhou T (2012)
sion. Phil Trans R Soc B 365:2571–2580 Selection on synonymous sites for increased
86. Anisimova M, Nielsen R, Yang Z (2003) accessibility around miRNA binding sites in
Effect of recombination on the accuracy of plants. Mol Biol Evol 29:3037–3044
the likelihood method for detecting positive 100. Garcia V, Anisimova M (2018) Accounting
selection at amino acid sites. Genetics for programmed ribosomal frameshifting in
164:1229–1236 the computation of codon usage bias indices.
87. Martin DP, Williamson C, Posada D (2005) G3 (Bethesda) 8:3173
RDP2: recombination detection and analysis 101. Komar AA (2008) Protein translational rates
of sequence alignments. Bioinformatics and protein misfolding: is there any link? In:
21:260–262 O’Doherty CB, Byrne AC (eds) Protein
Selection Acting on Genomes 397
misfolding: new research. Nova Science Pub- 110. Xing Y, Lee C (2006) Alternative splicing and
lisher Inc, New York, NY RNA selection pressure--evolutionary conse-
102. Kimichi-Sarfaty C, Oh JM, Kim IW, Sauna quences for eukaryotic genomes. Nat Rev
ZE, Calcagno AM, Ambudkar SV, Gottesman Genet 7:499–509
MM (2007) A silent polymorphism in the 111. Zhou T, Gu W, Wilke CO (2010) Detecting
MDR1 gene changes substrate specificity. Sci- positive and purifying selection at synony-
ence 315:525–528 mous sites in yeast and worm. Mol Biol Evol
103. Nackley AG, Shabalina SA, Tchivileva IE, 27:1912–1922
Satterfield K, Korchynskyi O, Makarov SS, 112. Wong WSW, Nielsen R (2004) Detecting
Maixner W, Diatchenko L (2006) Human selection in non-coding regions of nucleotide
catechol-O-methyltransferase haplotypes sequences. Genetics 167:949–958
modulate protein expression by altering 113. Nielsen R, Yang Z (2003) Estimating the
mRNA secondary structure. Science distribution of selection coefficients from
314:1930–1933 phylogenetic data with applications to mito-
104. Mayrose I, Doron-Faigenboim A, chondrial and viral DNA. Mol Biol Evol
Bacharach E, Pupko T (2007) Towards realis- 20:1231–1239
tic codon models: among site variability and 114. Nielsen R, Bauer DuMont VL, Hubisz MJ,
dependency of synonymous and Aquadro CF (2007) Maximum likelihood
non-synonymous rates. Bioinformatics 23: estimation of ancestral codon usage bias para-
i319–i327 meters in Drosophila. Mol Biol Evol
105. Dimitrieva S, Anisimova M (2014) Unravel- 24:228–235
ing patterns of site-to-site synonymous rates 115. Yang Z, Nielsen R (2008) Mutation-selection
variation and associated gene properties of models of codon substitution and their use to
protein domains and families. PLoS One 9 estimate selective strengths on codon usage.
(7):e102721 Mol Biol Evol 25:568–579
106. Martincorena I, Raine KM, Gerstung M, 116. Gilchrist MA (2007) Combining models of
Dawson KJ, Haase K, Van Loo P, Davies H, protein translation and population genetics to
Stratton MR, Campbell PJ (2017) Cell predict protein production rates from codon
171:1029–1041.e21 usage patterns. Mol Biol Evol 24:2362–2372
107. Rubinstein ND, Doron-Faigenboim A, May- 117. Gilchrist MA, Shah P, Zaretzki R (2009)
rose I, Pupko T (2011) Evolutionary models Measuring and detecting molecular adapta-
accounting for layers of selection in protein- tion in codon usage against nonsense errors
coding genes and their impact on the infer- during protein translation. Genetics
ence of positive selection. Mol Biol Evol 183:1493–1505
28:3297–3308 118. Kubatko L, Shah P, Herbei R, Gilchrist MA
108. Yang Z (2006) Computational molecular (2016) A codon model of nucleotide substi-
evolution. Oxford University Press, tution with selection on synonymous codon
New York, NY usage. Mol Phylogenet Evol 94:290–297
109. Anisimova M, Liberles DA (2012) Detecting 119. Kimura M, Ohta T (1969) The average num-
and understanding natural selection. Codon ber of generations until fixation of a mutant
evolution: mechanisms and models. Oxford gene in a finite population. Genetics
University Press, New York, NY 61:763–771
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons
licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s
Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder.
Chapter 13
Abstract
Codon substitution models (CSMs) are commonly used to infer the history of natural section for a set of
protein-coding sequences, often with the explicit goal of detecting the signature of positive Darwinian
selection. However, the validity and success of CSMs used in conjunction with the maximum likelihood
(ML) framework is sometimes challenged with claims that the approach might too often support false
conclusions. In this chapter, we use a case study approach to identify four legitimate statistical difficulties
associated with inference of evolutionary events using CSMs. These include: (1) model misspecification,
(2) low information content, (3) the confounding of processes, and (4) phenomenological load, or
PL. While past criticisms of CSMs can be connected to these issues, the historical critiques were often
misdirected, or overstated, because they failed to recognize that the success of any model-based approach
depends on the relationship between model and data. Here, we explore this relationship and provide a
candid assessment of the limitations of CSMs to extract historical information from extant sequences. To
aid in this assessment, we provide a brief overview of: (1) a more realistic way of thinking about the process
of codon evolution framed in terms of population genetic parameters, and (2) a novel presentation of the
ML statistical framework. We then divide the development of CSMs into two broad phases of scientific
activity and show that the latter phase is characterized by increases in model complexity that can sometimes
negatively impact inference of evolutionary mechanisms. Such problems are not yet widely appreciated by
the users of CSMs. These problems can be avoided by using a model that is appropriate for the data; but,
understanding the relationship between the data and a fitted model is a difficult task. We argue that the only
way to properly understand that relationship is to perform in silico experiments using a generating process
that can mimic the data as closely as possible. The mutation-selection modeling framework (MutSel) is
presented as the basis of such a generating process. We contend that if complex CSMs continue to be
developed for testing explicit mechanistic hypotheses, then additional analyses such as those described in
here (e.g., penalized LRTs and estimation of PL) will need to be applied alongside the more traditional
inferential methods.
Key words Codon substitution model, dN/dS, False positives, Maximum likelihood, Mechanistic
model, Model misspecification, Mutation-selection model, Parameter confounding, Phenomenologi-
cal load, Phenomenological model, Positive selection, Reliability, Statistical inference, Site-specific
fitness landscape
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Methods in Molecular Biology, vol. 1910,
https://doi.org/10.1007/978-1-4939-9074-0_13, © The Author(s) 2019
399
400 Christopher T. Jones et al.
1 Introduction
1
The original CSM proposed by Goldman and Yang [16] was in fact quite complex in that it adjusted substitution
rates between nonsynonymous codons to account for differences in physicochemical properties using the
Grantham matrix [17]. This approach was later abandoned in favor of the simpler formulation now known as
M0 [44], e.g., the first M-series model [81].
402 Christopher T. Jones et al.
2 Conceptual Foundations
2.1 How Should We A codon substitution model represents an attempt to explain the
Think About the way a target protein-coding gene changed over time by a combina-
Alignment-Generating tion of mutation, selection (purifying as well as adaptive), and drift.
Process? Adaptive evolution occurs at each site within a protein in response
to a hierarchy of effects, including, but not limited to, changes in
the network of the protein’s interactions, changes in the functional
properties of that network, and changes in both the cellular and
organismal environment over time. The result of the complex
interplay between these effects is typically viewed through the
narrow lens of an alignment of homologous sequences X obtained
from extant species, possibly accompanied by a tree topology τ (for
our purposes, it is always assumed that τ is known). The informa-
tion contained in X is evidently insufficient to resolve all of the
Looking for Darwin in Genomic Sequences. . . 403
2
Single-nucleotide (SN) mutations that are nonsynonymous occur more frequently than those that are synony-
mous due to idiosyncrasies in the genetic code. This is accounted for in the formulation of dN and dS, so that dN
can be interpreted as the proportion of nonsynonymous SN mutations that are fixed. Likewise, dS is the
proportion of synonymous SN mutations that are fixed. See Jones et al. [25] for a discussion of various
interpretations of dN/dS.
404 Christopher T. Jones et al.
2.2 What Is the CSMs have become increasingly complex with the addition of more
Objective of Model free parameters since the introduction of the M-series models in
Building? Yang et al. [81]. The prima facie objective of this trend is to
produce models that provide better mechanistic explanations of
the data. The assumption is that this will lead to more accurate
inferences about evolutionary processes, particularly as the volume
of genetic data increases [35]. However, the significance of a new
model parameter is assessed by a comparison of site-pattern distri-
butions without reference to mechanism. Combined with the pos-
sibility of confounding, this feature of the ML framework means
that the objective of improving model fit does not necessarily
coincide with the objective of providing a better representation of
the mechanisms of the true generating process.
Given any CSM with parameters θM, it is possible to compute a
vector P that assigns a probability to each of the 61N possible site
patterns for an N-taxon alignment (i.e., a multinomial distribution
Looking for Darwin in Genomic Sequences. . . 405
with another process that did occur, the LLR in Eq. 3 can still be
significant. Under this scenario, the addition of ψ to M1 would
engender movement toward P S ðθ^S Þ and PGP, but the new model
M2 would also provide a worse mechanistic explanation of the data
because it would falsely indicate that E occurred. The possibility of
confounding and its impact on inference is demonstrated in Case
Study D.
3.1 Case Study A: To study the impact of low information content on inference, we
Low Information use a pair of nested M-series models known as M1a and M2a
Content [70, 82]. Under M1a, sites are partitioned into two rate-ratio
categories, 0 < ω0 < 1 and ω1 ¼ 1 in proportions p0 and p1 ¼ 1
p0. M2a includes an additional category for the proportion of
sites p2 ¼ 1 p0 p2 that evolved under positive selection with
Looking for Darwin in Genomic Sequences. . . 409
3.1.1 Irregularity and Issues associated with low information content can be made worse
Penalized Likelihood by violations of certain regularity conditions. For example, M2a is
the same as M1a but for two extra parameters, p2 and ω2. Usual
likelihood theory would therefore predict that the limiting distri-
bution of the LLR is χ 22 . However, this result is valid only if the
regularity conditions hold. Among these conditions is that the null
model is not obtained by placing parameters of the alternate model
on the boundary of parameter space. Since M1a is the same as M2a
but with p2 ¼ 0, this condition is violated. The same can be said for
many nested pairs of Phase I CSMs, such as M7 vs M8 [81] or M1
vs branch-site Model A [79]. Although the theoretical limiting
distribution of the LLR under some irregular conditions has been
determined by Self and Liang [54], those results do not include
cases where one of the model parameters is unidentifiable under the
null [2]. Since M1a is M2a with p2 ¼ 0, the likelihood under M1a
is the same for any value of ω2. This makes ω2 unidentifiable under
the null. The limiting distribution for the M1a vs M2a contrast is
therefore unknown [74].
A penalized likelihood ratio test (PLRT; [39]) has been pro-
posed to mitigate problems associated with unidentifiable para-
meters. Under this method, the likelihood function for the
alternate model (e.g., M2a) is modified so that values of p2 closer
to zero are penalized. This has the effect of drawing the MLE for p2
away from the boundary, and can interpreted as a way to “regular-
ize” the model. PLRT seems to be more useful in cases where
the analysis of a real alignment produces a small value of p^2 accom-
panied by an unrealistically large value of ω ^ 2 . This can happen
because ω ^ 2 is influenced by fewer and fewer site patterns as p^2
approaches zero, and is therefore subject to larger and larger sam-
pling errors. In addition, ω ^ 2 and p^2 tend to be negatively correlated,
which further contributes to the large sampling errors. For exam-
ple, Mingrone et al. [39] found that M2a fitted to a 5-taxon
alignment with 198 codon sites without penalization gave
ðp^2 ,ω
^ 2 Þ ¼ ð0:01,34:70Þ. These MLEs, if taken at face value, sug-
gest that a small number of sites in the gene underwent positive
412 Christopher T. Jones et al.
3.2 Case Study B: The mechanisms that give rise to the diversity of site patterns in a
Model set of homologous genes are highly complex and not fully under-
Misspecification stood. CSMs are therefore necessarily simplified representations of
the true generating process, and are in this sense misspecified. The
extent to which misspecification might cause an omnibus LRT to
falsely detect positive selection was of primary concern during
Phase I of model development. We use a particular form of the
YN-BSM called Model A [79] to illustrate this issue. In its original
form, the omnibus LRT assumes a null under which a proportion
p0 of sites evolved under stringent selection with ω0 ¼ 0 and the
remaining sites evolved under a neutral regime with ω1 ¼ 1 on all
branches of the tree (i.e., model M1 in [44]). This is contrasted
with Model A, which is the same as M1 except that it assumes that
some stringent sites and some neutral sites evolved under positive
selection with ω2 > 1 on a prespecified branch called the fore-
ground branch. The omnibus test contrasting M1 with Model A
was therefore designed to detect a subset of sites that evolved
adaptively on the same branch of the tree.
During this period of model development, the standard
method to test the impact of misspecification on the reliability of
Looking for Darwin in Genomic Sequences. . . 413
Table 1
Rate ratios (ω) for regimes X and Z taken from Zhang [85]
Sites 1–20 21–40 41–60 61–80 81–100 101–120 121–140 141–160 161–180 181–200
ω regime X 1.00 1.00 0.80 0.80 0.50 0.50 0.20 0.20 0.00 0.00
ω regime Z 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
4.1 Case Study C: By expressing the codon substitution process in terms of explicit
Confounding population genetic parameters, the MutSel framework facilitates
the investigation of complex evolutionary dynamics, such as shift-
ing balance on a fixed fitness landscape or adaptation to a change in
selective constraints (i.e., a peak shift; [6, 25]) that are missing from
alignments generated using traditional methods. Specifically, by
assigning a different vector of fitness coefficients for the 20 amino
acids to each site, MutSel can generate more variation in rate ratio
across sites and over time than has been realized in the past simula-
tion studies (e.g., Table 1). In this way, MutSel provides the basis of
a generating model that can be adjusted to produce alignments that
closely mimic real data [26]. MutSel therefore serves to connect
demonstrably plausible evolutionary dynamics to the pathology we
refer to as confounding.
Under MutSel, the dynamic regime at the hth codon site (e.g.,
shifting balance, neutral, nearly neutral, or adaptive evolution) is
uniquely specified by a vector of fitness coefficients
f h ¼ f h1 , . . . , f hm . It is generally assumed that mutation to any of
the three stop codons is lethal, so m ¼ 61 for nuclear genes and
m ¼ 60 for mitochondrial genes. And, although it is not a require-
ment, it is typical to assume that the f hj are constant across synony-
mous codons [25, 57]. Given fh, the elements of a site-specific
instantaneous rate matrix Ah can be defined as follows for all i 6¼ j
(cf. Eq. 1):
8
>
> μ if s hij ¼ 0
< ij h
A hij / μ s ij ð5Þ
> otherwise
> ij
: 1 exp s h
ij
Fig. 3 Fitness coefficients for the 20 amino acids were drawn from a normal
distribution centered at zero and with standard deviation σ ¼ 0.001. Bars show
the resulting stationary frequencies (a proxy for fitness) sorted from largest to
smallest. They compose a metaphorical site-specific landscape over which the
site is imagined to move. The solid red line shows the codon-specific rate ratio
d N hi =d S hi for the sorted codons. This varies depending on the codon currently
occupying the site, and can be greater than one following a chance substitution
into the tail (to the right) of the landscape. In this case, the codon-specific rate
ratio for the site ranged from 0.21 to 4.94 with a temporally averaged site-
specific rate ratio of dNh/dSh ¼ 0.52
Fig. 4 The box plot depicts the distribution of the phenomenological load
(PL) carried by ð^ α , β^ Þ produced by fitting the RaMoSS vs RaMoSSwDT
contrast to 50 alignments generated under MutSel-mmtDNA: the circles
represent outliers of this distribution. The diamond is the percent reduction in
deviance for the same parameters estimated by fitting RaMoSS vs RaMoSSwDT
to the real mtDNA alignment
3
It has previously been noted that the rapid fixation of compensatory mutations following substitution to an
unstable base pair (e.g., AT!GT!GC) can also produce site patterns that suggest fixation of DT mutations [74,
p. 46].
Looking for Darwin in Genomic Sequences. . . 421
5 Discussion
References
1. Anisimova M, Kosiol C (2009) Investigating 9. Fisher R (1930) The distribution of gene ratios
protein-coding sequence evolution with prob- for rare mutations. Proc R Soc Edinb
abilistic codon substitution models. Mol Biol 50:205–220
Evol 26:255–271 10. Forsberg R, Christiansen FB (2003) A codon-
2. Anisimova M, Bielawski JP, Yang ZH (2001) based model of host-specific selection in para-
Accuracy and power of the likelihood ratio test sites, with an application to the influenza a
in detecting adaptive molecular evolution. Mol virus. Mol Biol Evol 20:1252–1259
Biol Evol 18:1585–1592 11. Friedman R, Hughes AL (2007) Likelihood-
3. Anisimova M, Bielawski JP, Yang ZH (2002) ratio tests for positive selection of human and
Accuracy and power of Bayes prediction of mouse duplicate genes reveal nonconservative
amino acid sites under positive selection. Mol and anomalous properties of widely used meth-
Biol Evol 19:950–958 ods. Mol Phylogenet Evol 542:388–393
4. Bielawski JP, Yang ZH (2004) A maximum 12. Galtier N (2001) Maximum-likelihood phylo-
likelihood method for detecting functional genetic analysis under a covarion-like model.
divergence at individual codon sites, with appli- Mol Biol Evol 18:866–873
cation to gene family evolution. J Mol Evol 13. Gaston D, Susko E, Roger AJ (2011) A phylo-
59:121–132 genetic mixture model for the identification of
5. De Maio N, Holmes I, Schlötterer C, Kosiol C functionally divergent protein residues. Bioin-
(2013) Estimating empirical codon hidden formatics 27:2655–2663
Markov models. Mol Biol Evol 30:725–736 14. Gibbs RA (2007) Evolutionary and biomedical
6. dos Reis M (2013). http://arxiv:1311.6682v1. insights from the Rhesus macaque genome.
Last accessed 26 Nov 2013 Science 316:222–234
7. dos Reis M (2015) How to calculate the 15. Goldman N (1993) Statistical tests of models
non-synonymous to synonymous rate ratio of DNA substitution. J Mol Evol 36:182–198
protein-coding genes under the Fisher-Wright 16. Goldman N, Yang ZH (1994) Codon-based
mutation-selection framework. Biol Lett model of nucleotide substitution for protein-
11:1–4. coding DNA-sequences. Mol Biol Evol
8. Field SF, Bulina MY, Kelmanson IV, Bielawski 11:725–736
JP, Matz MV (2006) Adaptive evolution of 17. Grantham R (1974) Amino acid difference for-
multicolored fluorescent proteins in reef- mula to help explain protein evolution. Science
building corals. J Mol Evol 62:332–339 862–864
424 Christopher T. Jones et al.
18. Guindon S, Rodrigo AG, Dyer KA, Huelsen- 33. Kosiol C, Vinař T, daFonseca RR, Hubisz MJ,
beck JP (2004) Modeling the site-specific vari- Bustamante CD, Nielsen R, Siepel A (2008)
ation of selection patterns along lineages. Proc Patterns of positive selection in six mammalian
Natl Acad Sci USA 101:12957–12962 genomes. PLoS Genet 4:1–17
19. Halpern AL, Bruno WJ (1998) Evolutionary 34. Lartillot N, Philippe H (2004) A Bayesian mix-
distances for protein-coding sequences: model- ture model for across-site heterogeneities in the
ing site-specific residue frequencies. Mol Biol amino-acid replacement process. Mol Biol Evol
Evol 15:910–917 21:1095–1109
20. Harris K, Nielsen R (2014) Error-prone poly- 35. Liberles DA, Teufel AI, Liu L, Stadler T (2013)
merase activity causes multinucleotide muta- On the need for mechanistic models in compu-
tions in humans. Genome Res 9:1445–1554 tational genomics and metagenomics. Genome
21. Huelsenbeck JP, Dyer KA (2004) Bayesian esti- Biol Evol 5:2008–2018
mation of positively selected sites. J Mol Evol 36. Lopez P, Casane D, Phillipe H (2002) Hetero-
58:661–672 tachy, and important process of protein evolu-
22. Hughes AL (2007) Looking for Darwin in all tion. Mol Biol Evol 19:1–7
the wrong places: the misguided quest for pos- 37. Lu A, Guindon S (2013) Performance of stan-
itive selection at the nucleotide sequence level. dard and stochastic branch-site models for
Heredity 99:364–373 detecting positive selection among coding
23. Hughes AL, Friedman R (2008) Codon-based sequences. Mol Biol Evol 31:484–495
tests of positive selection, branch lengths, and 38. Mingrone J, Susko E, Bielwaski JP (2016)
the evolution of mammalian immune system Smoothed bootstrap aggregation for assessing
genes. Immunogenetics 60:495–506 selection pressure at amino acid sites. Mol Biol
24. Hughes AL, Nei M (1988) Pattern of nucleo- Evol 33:2976–2989
tide substitution at major histocompatibility 39. Mingrone J, Susko E, Bielwaski JP (2018)
complex class-1 loci reveals overdominant Modified likelihood ratio tests for positive
selection. Nature 335:167–170 selection (submitted). Bioinformatics, Advance
25. Jones CT, Youssef N, Susko E, Bielawski JP Access https://doi.org/10.1093/bioinformat
(2017) Shifting balance on a static mutation- ics/bty1019
selection landscape: a novel scenario of positive 40. Miyazawa S (2011) Advantages of a mechanis-
selection. Mol Biol Evol 34:391–407 tic codon substitution model for evolutionary
26. Jones CT, Youssef N, Susko E, Bielawski JP analysis of protein-coding sequences. PLoS
(2018) Phenomenological load on model para- ONE 6:20
meters can lead to false biological conclusions. 41. Murrell B, Weaver S, Smith MD, Wertheim JO,
Mol Biol Evol 35:1473–1488 Murrell S, Aylward A, Eren K, Pollner T, Mar-
27. Keightley P, Trivedi U, Thomson M, Oliver F, tin DP, Smith DM, Scheffler K, Pond SLK
Kumar S, Blaxter M (2009) Analysis of the (2015) Gene-wide identification of episodic
genome sequences of three Drosophila mela- selection. Mol Biol Evol 32:1365–1371
nogaster spontaneous mutation accumulation 42. Muse SV, Gaut BS (1994) A likelihood
lines. Genet Res 19:1195–1201 approach for comparing synonymous and non-
28. Kimura M (1962) On the probability of fixa- synonymous nucleotide substitution rates, with
tion of mutant genes in a population. Genetics applications to the chloroplast genome. Mol
47:713–719 Biol Evol 11:715–724
29. Kosakovsky Pond SL, Frost SDW (2005) Not 43. Nei M, Gojobori T (1986) Simple methods for
so different after all: a comparison of methods estimating the numbers of synonymous and
for detecting amino acid sites under selection. nonsynonymous nucleotide substitutions.
Mol Biol Evol 22:1208–1222 Mol Biol Evol 3:418–426
30. Kosakovsky Pond SL, Muse SV (2007) Site-to- 44. Nielsen R, Yang ZH (1998) Likelihood models
site variations of synonymous substitution for detecting positively selected amino acid
rates. Mol Biol Evol 22:2375–2385 sites and applications to the HIV-1 envelope
31. Kosakovsky Pond SL, Murrell B, Fourment M, gene. Genetics 148:929–936
Frost SDW, Delport W, Scheffler K (2011) A 45. Nielsen R, Yang Z (2003) Estimating the dis-
random effects branch-site model for detecting tribution of selection coefficients from phylo-
episodic diversifying selection. Mol Biol Evol genetic data with applications to mitochondrial
28:3033–3043 and viral DNA. Mol Biol Evol 20:1231–1239
32. Kosiol C, Holmes I, Goldman N (2007) An 46. Nozawa M, Suzuki Y, Nei M (2009) Reliabil-
empirical codon model for protein sequence ities of identifying positive selection by the
evolution. Mol Biol Evol 24:1464–1479 branch-site and the site-prediction methods.
Proc Natl Acad Sci USA 106:6700–6705
Looking for Darwin in Genomic Sequences. . . 425
47. Pagel M, Meade A (2004) A phylogenetic mix- 61. Suzuki Y, Nei M (2001) Reliabilities of
ture model for detecting pattern-heterogeneity parsimony-based and likelihood-based meth-
in gene sequence or character-state data. Syst ods for detecting positive selection at single
Biol 53:571–581 amino acid sites. Mol Biol Evol 18:2179–2185
48. Rodrigue N, Lartillot N (2014) Site- 62. Suzuki Y, Nei M (2002) Simulation study of
heterogeneous mutation-selection models the reliability and robustness of the statistical
with the PhyloBayes-MPI package. Bioinfor- methods for detecting positive selection at sin-
matics 30:1020–1021 gle amino acid sites. Mol Biol Evol
49. Rodrigue N, Lartillot N (2016) Detection of 19:1865–1869
adaptation in protein-coding genes using a 63. Suzuki Y, Nei M (2004) False-positive selec-
Bayesian site-heterogeneous mutation-selec- tion identified by ML-based methods: exam-
tion codon substitution model. Mol Biol Evol ples from the Sig1 gene of the diatom
34:204–214 Thalassiosira weissflogii and the tax gene of
50. Rodrigue N, Philippe H, Lartillot N (2010) the human T-cell lymphotropic virus. Mol
Mutation-selection models of coding sequence Biol Evol 21:914–921
evolution with site-heterogeneous amino acid 64. Tamuri AU, dos Reis M, Goldstein RA (2012)
fitness profiles. Proc Natl Acad Sci USA Estimating the distribution of selection coeffi-
107:4629–4634 cients from phylogenetic data using sitewise
51. Rubinstein ND, Doron-Faigenboim A, May- mutation-selection models. Genetics
rose I, Pupko T (2011) Evolutionary model 190:1101–1115
accounting for layers of selection in protein- 65. Tamuri AU, Goldman N, dos Reis M (2014) A
coding genes and their impact on the inference penalized-likelihood method to estimate the
of positive selection. Mol Biol Evol distribution of selection coefficients from phy-
28:3297–3308 logenetic data. Genetics 197:257–271
52. Sawyer SL, Emerman M, Malik HS (2007) 66. Wang H, Li K, Susko E, Rodger AJ (2008) A
Discordant evolution of the adjacent antiretro- class frequency mixture model that adjusts for
viral genes trim22 and trim5 in mammals. site-specific amino acid frequencies and
PLoS Pathog 3:e197 improves inference of protein phylogeny.
53. Schrider D, Hourmozdi J, Hahn M (2014) BMC Evol Biol 8:1–13
Pervasive multinucleotide mutational events 67. Wertheim JO, Murrell B, Smith MD, Pond
in eukaryotes. Curr Biol 21:1051–1054 SLK, Scheffler K (2014) Relax: detecting
54. Self SG, Liang KY (1987) Asymptotic proper- relaxed selection in a phylogenetic framework.
ties of maximum likelihood estimators and like- Mol Biol Evol 32:820–832
lihood ratio test under nonstandard 68. Whelan S, Goldman N (2004) Estimating the
conditions. J Am Stat Assoc 82:605–610 frequency of events that cause multiple-
55. Smith MD, Wertheim JO, Weaver S, Murrell B, nucleotide changes. Genetics 167:2027–2043
Scheffler K, Pond SLK (2015) Less is more: an 69. Whelan S, de Bakker PIW, Quevillon E,
adaptive branch-site random effects model for Rodriguez N, Goldman N (2006) Pandit: an
efficient detection of episodic diversifying evolution-centric database of protein and asso-
selection. Mol Biol Evol 32:1342–1353 ciated nucleotide domains with inferred trees.
56. Spielman S, Wilke CO (2015) Pyvolve: a flexi- Nucleic Acids Res 34(Database issue):
ble Python module for simulating sequences D327–D331
along phylogenies. PLoS ONE 10:1–7 70. Wong WSW, Yang ZH, Goldman N, Nielsen R
57. Spielman S, Wilke CO (2015) The relationship (2004) Accuracy and power of statistical meth-
between dN/dS and scaled selection coeffi- ods for detecting adaptive evolution in protein
cients. Mol Biol Evol 34:1097–1108 coding sequences and for identifying positively
58. Spielman S, Wilke CO (2016) Extensively para- selected sites. Genetics 168:1041–1051
meterized mutation-selection models reliably 71. Yang ZH (2005) The power of phylogenetic
capture site-specific selective constraints. Mol comparison in revealing protein function. Proc
Biol Evol 33:2990–3001 Natl Acad Sci USA 102:3179–3180
59. Struder RA, Robinson-Rechavi M (2009) Evi- 72. Yang ZH (2006) On the varied pattern of evo-
dence for an episodic model of protein lution in 2 fungal genomes: a critique of
sequence evolution. Biochem Soc Trans Hughes and Friedman. Mol Biol Evol
37:783–786 23:2279–2282
60. Suzuki Y (2008) False-positive results obtained 73. Yang ZH (2007) PAML4: phylogenetic analy-
from the branch-site test of positive selection. sis by maximum likelihood. Mol Biol Evol
Genes Genet Syst 83:331–338 24:1586–1591
426 Christopher T. Jones et al.
74. Yang ZH (2014) Molecular evolution: a statis- 81. Yang ZH, Nielsen R, Goldman N, Pedersen
tical approach. Oxford University Press, AMK (2000) Codon-substitution models for
Oxford heterogeneous selection pressure at amino
75. Yang ZH (2017) PAML: phylogenetic analysis acid sites. Genetics 155:431–449
by maximum likelihood. http://abacus.gene. 82. Yang ZH, Wong SWS, Nielsen R (2005) Bayes
ucl.ac.uk/software/pamlDOC.pdf empirical Bayes inference of amino acid sites
76. Yang ZH, Bielawski JP (2000) Statistical meth- under positive selection. Mol Biol Evol
ods for detecting molecular adaptation. Trends 22:1107–1118
Ecol Evol 15:496–503 83. Zaheri M, Dib L, Salamin N. (2014) A
77. Yang ZH, dos Reis M (2011) Statistical proper- generalized mechanistic codon model. Mol
ties of the branch-site test of positive selection. Biol Evol 31:2528–2541
Mol Biol Evol 28:1217–1228 84. Zhai W, Nielsen R, Goldman N, Yang ZH
78. Yang ZH, Nielsen R (1998) Synonymous and (2012) Looking for Darwin in genomic
nonsynonymous rate variation in nuclear genes sequences – validity and success of statistical
of mammals. J Mol Evol 46: 409–418 methods. Mol Biol Evol 20:2889–2893
79. Yang ZH, Nielsen R (2002) Codon- 85. Zhang J (2004) Frequent false detection of
substitution models for detecting molecular positive selection by the likelihood method
adaptation at individual sites along specific with branch-site models. Mol Biol Evol
lineages. Mol Biol Evol 19:908–917 21:1332–1339
80. Yang ZH, Nielsen R (2007) Mutation- 86. Zhang J, Nielsen R, Yang ZH (2005) Evalua-
selection models of codon substitution and tion of an improved branch-site likelihood
their use to estimate selective strengths on method for detecting positive selection at the
codon usage. Mol Biol Evol 25:568–579 molecular level. Mol Biol Evol 22:2472–2479
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons
licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s
Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder.
Chapter 14
Abstract
Natural selection is a fundamental force shaping organismal evolution, as it both maintains function and
enables adaptation and innovation. Viruses, with their typically short and largely coding genomes, experi-
ence strong and diverse selective forces, sometimes acting on timescales that can be directly measured.
These selection pressures emerge from an antagonistic interplay between rapidly changing fitness require-
ments (immune and antiviral responses from hosts, transmission between hosts, or colonization of new host
species) and functional imperatives (the ability to infect hosts or host cells and replicate within hosts).
Indeed, computational methods to quantify these evolutionary forces using molecular sequence data were
initially, dating back to the 1980s, applied to the study of viral pathogens. This preference largely emerged
because the strong selective forces are easiest to detect in viruses, and, of course, viruses have clear
biomedical relevance. Recent commoditization of affordable high-throughput sequencing has made it
possible to generate truly massive genomic data sets, on which powerful and accurate methods can yield
a very detailed depiction of when, where, and (sometimes) how viral pathogens respond to various selective
forces.
Here, we present recent statistical developments and state-of-the-art methods to identify and characterize
these selection pressures from protein-coding sequence alignments and phylogenies. Methods described
here can reveal critical information about various evolutionary regimes, including whole-gene selection,
lineage-specific selection, and site-specific selection acting upon viral genomes, while accounting for
confounding biological processes, such as recombination and variation in mutation rates.
Key words Virus evolution, Molecular evolution, Recombination, Positive selection, Relaxed selec-
tion, Phylogenetics, Codon models
1 Introduction
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Methods in Molecular Biology, vol. 1910,
https://doi.org/10.1007/978-1-4939-9074-0_14, © The Author(s) 2019
427
428 Stephanie J. Spielman et al.
context using codon models [1, 3]. A typical analysis on viral gen-
omes, for example, might be performed for a single gene repre-
sented by isolates from different individuals (e.g., sequences from
many HIV-1-infected hosts) or from different hosts (e.g., primate
lentiviruses).
In the context of codon models, selection is typically measured
using dN/dS (also referred to as ω, or Ka/Ks), which represents
the ratio of the non-synonymous evolutionary rate (dN) to the
synonymous evolutionary rate (dS). The synonymous evolutionary
rate is used to provide a baseline rate of neutral evolution because
the average selective effect of a synonymous substitution is assumed
to be negligible compared to the effect of a non-synonymous
substitution. 1 The selective regime can be deduced by establishing,
with a degree of statistical confidence, that dN/dS differs from
unity, i.e., the neutral expectation where dN/dS ¼ 1. Diversifying,
balancing, or (sometimes) directional selection yields dN/dS > 1,
whereas purifying selection effects dN/dS < 1. Comparative meth-
ods for selection detection estimate dN/dS, or dS and dN sepa-
rately, at sites and/or branches and perform a statistical test to
establish on which side of the neutral expectation the inferences
fall. As with any statistical procedure applied to finite data, each
inference can be a false positive or a false negative, although meth-
ods typically take care to control the rates of both.
While the question “Is this gene under selection?” is an obvious
one, the nearly universally applicable answer to this question is
“yes.’. That is because a functional gene is (or has been) subject
to some form of selection, e.g., negative selection to maintain
essential features. On the other extreme is the question that has
an immediate biological significance: “Is changing a leucine to an
arginine at position 209 in gene X along a specific branch in the
phylogeny adaptive?”. Without additional information, such as a
carefully experimentally measured fitness impact of introducing
said substitution, current comparative sequence approaches cannot
answer this question. Indeed, such a scenario presents a sample size
of one, which cannot be statistically meaningful.
In this chapter, we present a collection of statistical methods,
each of which is designed to carefully address a biological question
somewhere on the spectrum between the two extremes: sufficiently
specific to be interesting, yet general enough to be answerable
based only on the evolutionary history of homologous sequences.
We will not discuss the technical details of codon substitution
methods here (for details, please see one of the excellent available
reviews Anisimova and Kosiol [1], Delport et al. [3], Yang [44], or
the primary methods papers including Goldman and Yang [8],
1
We note that there are a variety of well-documented situations where synonymous substitutions can have strong
effects on fitness [11, 30].
Evolution of Viral Genomes: Interplay Between Selection, Recombination. . . 429
Kosakovsky Pond and Muse [13], Muse and Gaut [27], Nielsen
and Yang [29]). Instead, we present each method operationally
(“How and when does one use this method?”), by addressing the
following points:
1. What biological question is the method designed to answer?
2. What are the recommended applications?
3. What is the statistical procedure and statistical test used to
establish significance for this method?
4. How should one interpret positive and negative test results?
5. Rules of thumb for when this method is likely to work well, and
when it is not.
We conclude by discussing how inferences can accommodate
potentially confounding biological processes, including intragenic
recombination and mutation rate variation. It is critical to model
these processes, both in their own right and because ignoring their
effects could bias selection inference tools and yield misleading
results.
2 Materials
3 Methods
3.1 How to Run a There is a uniform workflow to run any of the described methods,
Selection Analysis either locally (on one’s own computer and/or a high-performance
computing environment) in HyPhy or using the Datamonkey
web-service, available at www.datamonkey.org. The version of
HyPhy that supports all of the analyses is a command-line program,
i.e., it must be run from a terminal prompt (similar to most other
bioinformatics packages) in Linux or Mac OS X. It is also possible
to run the program in Windows, with an appropriate POSIX emu-
lation environment (e.g., MinGW) installed.
To execute a selection analysis locally, the following steps will
need to be taken.
1. Prepare your coding sequence alignment. In general, any
duplicate sequences should be removed before analysis. Most
importantly, it is imperative that the sequence alignment be in
the correct reading frame, meaning that alignment must be
performed with codon structure in mind. A common approach
to ensure this criterion is met is to generate the alignment using
translated amino-acid data and then back-translate to the orig-
inal nucleotide sequences.
2. Prepare a phylogenetic tree from the multiple sequence align-
ment. Note that certain analyses may require a labeled phylo-
genetic tree, as indicated within each subsequent tutorial. Keep
in mind that for most selection analyses, a tree topology is a
nuisance parameter. Hence, while it is advisable to use good
practices when inferring trees, minor errors in tree inference
tend to have minor effects on gene- and site-level inference. A
notable exception occurs when lineage-specific selection is
investigated; in this case, ensuring high-quality tree topologies
is important.
Evolution of Viral Genomes: Interplay Between Selection, Recombination. . . 431
2
With the exception of GARD.
432 Stephanie J. Spielman et al.
3.2 BUSTED
What Biological Question Is the Method Designed to
Answer?:
Is there evidence that some sites in the alignment have been
subject to positive diversifying selection, either pervasive
(throughout the evolutionary tree) or episodic (only on some
lineages)? In other words, BUSTED asks whether a given gene
has been subject to positive, diversifying selection at any site, at
any time [26]. If a priori information about lineages of interest
is available (e.g., due to migration, change in the environment,
etc.), then BUSTED can be restricted to test for selection only on
a subset of tree lineages, potentially boosting power.
Recommended Applications
1. Annotating a collection of alignments with a binary attribute:
Has this alignment been subject to positive diversifying selec-
tion (yes/no)? [34].
2. Testing small- or low-divergence alignments (i.e., 10
sequences) for evidence of positive diversifying selection,
where neither branch- nor site-level methods have sufficient
power to detect weak, but present, signal.
3
Note that the method GARD does not provide markdown output or a JSON, and output is in a different format.
This may be updated in a future HyPhy release.
Evolution of Viral Genomes: Interplay Between Selection, Recombination. . . 433
Example Analysis:
To begin, we will perform a BUSTED analysis using a dataset of
primate-specific KSR2, kinase suppressor of RAS2, genes from
Enard et al. [5]. This gene has been implicated as a so-called
‘virus-interacting protein,’ and previous work has suggested it
has experienced adaptation in mammalian lineages due to
selective pressures exerted by viruses [5]. We will test all lineages
for positive selection (rather than specifying a subset of “test”
branches), thereby asking the question: “Has KSR2 been subject
to diversifying selection at some time during evolution in
primates?”
### Obtaining branch lengths and nucleotide substitution biases under the
nucleotide GTR model
* Log ( L ) = -5768.01, AIC - c = 11582.06 (23 estimated parameters)
### Obtaining the global omega estimate based on relative GTR branch lengths
and nucleotide substitution biases
* Log ( L ) = -5342.48, AIC - c = 10745.17 (30 estimated parameters)
* non - synonymous / synonymous rate ratio for *test* = 0.0342
### Improving branch lengths , nucleotide substitution biases, and global dN/dS
ratios under a full codon model
* Log ( L ) = -5333.46, AIC - c = 10727.13 (30 estimated parameters)
* non - synonymous / synonymous rate ratio for *test* = 0.0307
### Performing the constrained (dN/dS > 1 not allowed) model fit
* Log ( L ) = -5326.18, AIC - c = 10718.63 (33 estimated parameters)
* For * test * branches under the null (no dN/dS > 1 model), the following
rate distribution for branch-site combinations was inferred
----
## Branch - site unrestricted statistical test of episodic diversification
[BUSTED]
Likelihood ratio test for episodic diversifying positive selection, **p =
0.0015**.
Evolution of Viral Genomes: Interplay Between Selection, Recombination. . . 435
Interpreting Results:
The results printed to the terminal indicate a highly significant
result (P ¼ 0.0015) in the test for whole-gene selection. Analysis
with BUSTED therefore provides robust evidence that KSR2
experienced episodic positive selection in the primates. Because
we performed the original BUSTED analysis on the entire tree
(i.e., without a specified set of test branches), we do not know
from this result along which lineages KSR2 was subject to posi-
tive selection. We can conclude only that a non-zero proportion of
sites on some lineage(s) in the primate tree experienced diversify-
ing selection pressure.
Fig. 1 Example analysis visualization in HyPhy-Vision of BUSTED results. (a) The summary section provides a
brief overview of the analysis performed, including information about the inputted data (which can be
downloaded via the linked file name) and primary results from the hypothesis test performed. (b) The
model statistics section provides information about models fitted to the data. In BUSTED, this section
additionally includes an interactive display of site evidence ratios, which can be interpreted as a descriptive
measure for which sites may have contributed to the selection signal. (c) The tree section displays the
phylogeny as fitted under all inferred models and data partitions, if specified. Tree views can be toggled under
the Options drop-down menu. (d) Graphical views of each model’s inferred ω distribution can be viewed when
clicking on a given row’s plot icon in the Model fits table seen in (b)
3.3 RELAX
What Biological Question Is the Method Designed to
Answer?:
Is there evidence that the strength of selection has been relaxed
(or conversely intensified) on a specified group of lineages (Test)
relative to a set of reference lineages (Reference)? We note that
the RELAX framework can perform this specific hypothesis test
as well as fit a suite of descriptive models which address, for
example, overall rate differences between test and reference
branches or lineage-specific inferences of selection relaxation.
We focus our attention here on RELAX’s hypothesis testing
abilities. More information about descriptive analyses is avail-
able on hyphy.org as well as in RELAX’s primary publication
[43]. Importantly, RELAX is not designed to detect diversify-
ing selection specifically.
Recommended Applications
1. Testing for a systematic shift (relaxation/intensification) in
the distribution of selection pressure associated with major
biological transitions such as hosting switching in viruses [6]
or lifestyle evolution in bacteria (i.e., transition from free-living
to endosymbiotic lifestyle [43]).
2. Comparing selective regimes between two subsets of
branches in the tree, e.g., to investigate selective differences
among transmission routes in HIV-1 [42].
Example Analysis:
We will perform a RELAX analysis using a dataset of Influ-
enza A PB2 subunit sequences from Tamuri et al. [41]. The
PB2 subunit, which is part of influenza’s RNA polymerase
complex, has emerged as a critical determinant of influenza
infectivity and, as a consequence, host range [9, 18]. The dataset
we examine here contains sequences from both avian host and
human host strains.4 Previous studies have shown that this host
switch is correlated with significant shifts in selection pressures
and preferred amino acids at key sites in PB2 [36, 40, 41]. We
now re-analyze this dataset using RELAX to ask a different but
related question: “Was the shift from avian to human hosts
associated with a relaxation of selection pressures in Influenza
A PB2?”
4
The original dataset in Tamuri et al. [41] contained 401 sequences. For the purposes of this chapter, we analyze a
subset of this alignment with only 35 sequences (20 from avian and 15 from human hosts), thereby achieving a
tractable runtime on a personal machine.
Evolution of Viral Genomes: Interplay Between Selection, Recombination. . . 439
### Obtaining branch lengths and nucleotide substitution biases under the
nucleotide GTR model
* Log ( L ) = -16755.26, AIC - c = 33660.66 (75 estimated parameters)
### Obtaining the global omega estimate based on relative GTR branch lengths
and nucleotide substitution biases
* Log ( L ) = -14410.97, AIC - c = 28988.46 (83 estimated parameters)
* non - synonymous / synonymous rate ratio for *Reference* = 0.0401
* non - synonymous / synonymous rate ratio for *Test* = 0.0604
### Improving branch lengths , nucleotide substitution biases, and global dN/dS
ratios under a full codon model
* Log ( L ) = -14354.67, AIC - c = 28875.86 (83 estimated parameters)
* non - synonymous / synonymous rate ratio for *Reference* = 0.0358
* non - synonymous / synonymous rate ratio for *Test* = 0.0609
----
## Test for relaxation ( or intensification) of selection [RELAX]
Likelihood ratio test ** p = 0.0014**.
> Evidence for * relaxation of selection* among **test** branches _relative_ to
the **reference** branches at P<=0.05
----
Interpreting Results:
On this data, RELAX has inferred a relaxation parameter
K ¼ 0.73 with a highly significant P ¼ 0.0014. Therefore,
there is evidence to reject the null hypothesis that selection pres-
sure has not been shifted in the test (here, human host) lineages.
We instead have strong evidence that selection has been relaxed
(because the inferred K < 1) in the human host lineages. In
other words, selection in the test branches has generally moved
towards neutrality (ω ¼ 1) compared to the reference branches.
This finding is consistent with the evolutionary changes that
typically occur during a virus host-switching event, wherein
selection stringency will be reduced to facilitate viral
adaptation.
Recommended Applications
1. Exploratory testing for evidence of lineage-specific positive
diversifying selection in small- to medium-sized alignments
(up to 100 sequences).
442 Stephanie J. Spielman et al.
Example Analysis:
Here, we will demonstrate aBSREL use and interpretation
using a dataset of HIV-1 env sequences collected from an epide-
miologically linked donor–recipient transmission pair [7]. This
dataset can be found in the provided file hiv1_transmission.
fna.
### Obtaining branch lengths and nucleotide substitution biases under the
nucleotide GTR model
* Log ( L ) = -5524.50, AIC - c = 11153.08 (52 estimated parameters)
### Fitting the baseline model with a single dN/dS class per branch, and no
site-to-site variation.
* Log ( L ) = -5402.40, AIC - c = 11009.72 (102 estimated parameters)
* Branch - level non - synonymous / synonymous rate ratio distribution has median
0.66, and 95% of the weight in 0.00--5.41
### Determining the optimal number of rate classes per branch using a step up
procedure
444 Stephanie J. Spielman et al.
| Branch | Length |Rates| Max. dN/dS | Log(L) | AIC-c |Best AIC-c so far|
|-----------|-------|-----|--------------–– |------------|--------- |---------------–|
| 0564 _22 | 0.01 | 2 | 1.96 (52.27%) | -5402.41 | 11013.78 | 11009.72 |
| 0564 _7 | 0.01 | 2 | 0.74 ( 5.19%) | -5402.40 | 11013.76 | 11009.72 |
| Separator | 0.01 | 2 | 197.32 ( 3.95%) | -5397.53 | 11004.02 | 11004.02 |
| Separator | 0.01 | 3 | 180.22 ( 4.08%) | -5397.53 | 11008.06 | 11004.02 |
| 0564 _4 | 0.01 | 2 | 29.79 ( 2.15%) | -5394.37 | 11001.74 | 11001.74 |
| 0564 _4 | 0.01 | 3 | 29.78 ( 2.15%) | -5394.37 | 11005.78 | 11001.74 |
| 0564 _3 | 0.01 | 2 | 126.86 ( 3.14%) | -5388.59 | 10994.22 | 10994.22 |
| 0564 _3 | 0.01 | 3 | 135.96 ( 3.05%) | -5388.59 | 10998.25 | 10994.22 |
| 0564 _9 | 0.01 | 2 | 10.01 ( 8.61%) | -5388.37 | 10997.82 | 10994.22 |
...
| Node53 | 0.00 | 2 | 1.00 (100.00%) | -5371.63 | 10976.46 | 10971.76 |
| 0557 _6 | 0.00 | 2 | 27.66 (100.00%) | -5371.32 | 10975.83 | 10971.76 |
| 0557 _21 | 0.00 | 2 | 0.25 ( 1.96%) | -5371.30 | 10975.80 | 10971.76 |
| 0557 _7 | 0.00 | 2 | 0.25 ( 1.96%) | -5371.30 | 10975.80 | 10971.76 |
Interpreting Results:
The first printed markdown table ("Determining the optimal
number of rate classes per branch using a step up procedure")
summarizes the model selection process. For example, when two ω
rates were assigned to branch Separator, this improved the AICc
score of the fit (compared to the single-rate model) from
11, 009.72 to 11, 004.02. However, allocating three ω rates to
the same branch worsens the score to 11, 008.06. Therefore the
aBSREL model will use two ω rates at the branch.
Fig. 2 HyPhy-Vision tree viewer depicting the fitted aBSREL Adaptive model to HIV-1 data. Branches are
colored by their inferred ω distribution, as indicated in the legend. Lineages identified as positive selection at
P < 0.05 after correction for multiple testing are shown with thick branches, with color distributions
representing the relative values and proportions of inferred ω categories. Note that taxon labels beginning
with “0554” represent HIV-1 sequences derived from the donor patient, and labels beginning with “0557”
represent HIV-1 sequences derived from the recipient patient
3.5 Site-Level
Selection: MEME, FEL,
What Biological Question Is the Method Designed to
SLAC, and FUBAR Answer?:
The methods FEL, SLAC, and FUBAR address the question:
Which site(s) in a gene are subject to pervasive, i.e., consistently
across the entire phylogeny, diversifying selection? MEME
addresses a more general question: Which site(s) in a gene are
subject to pervasive or episodic, i.e., only on a single lineage or
subset of lineages, diversifying selection?
Recommended Applications
1. MEME is the sole method in HyPhy for detecting selection at
individual sites that considers both pervasive and episodic selec-
tion. MEME is therefore our recommended method if maxi-
mum power is desired.
2. The phenomenon of pervasive selection is generally most prev-
alent in pathogen evolution and any biological system influ-
enced by evolutionary arms race dynamics (or balancing
selection), including adaptive immune escape by viruses. As
such, FEL, SLAC, and FUBAR are ideally suited to identify
sites under positive selection which represent candidate sites
subject to strong selective pressures across the entire phylog-
eny. Each of these methods has a particular use case as well:
l FEL is our recommended method for analyzing small-to-
medium size datasets when one wishes only to study perva-
sive selection at individual sites.
l FUBAR is our recommended method for detecting perva-
sive selection at individual sites on large (> 500 sequences)
datasets for which other methods have prohibitive runtimes,
unless you have access to a computer cluster.
l SLAC provides legacy functionality as a counting-based
method adapted for phylogenetic applications. In general,
this method will be the least statistically robust.
(continued)
448 Stephanie J. Spielman et al.
(continued)
Evolution of Viral Genomes: Interplay Between Selection, Recombination. . . 449
Note that FEL and SLAC report both positively and negatively
selected sites, but MEME and FUBAR report only sites under
positive selection.
Example Analysis:
We will demonstrate the use and interpretation of site-level
methods using data from influenza strain H3N2 (the “Hong
Kong flu”), the primary circulating strain of seasonal influ-
enza since the late 1960s. We specifically will assess selection on
the H3 hemagglutinin, the influenza surface protein which is
responsible for host cell binding. Hemagglutinin experiences
rapid evolution triggered by host immune escape, and previous
studies have identified numerous signatures of positive diversify-
ing selection in H3 sequences with a particular concentration
around the host-binding domain [28].
0.01
Fig. 3 Phylogeny of H3 hemagglutinin sequences analyzed here. Tip colors indicate those selected for each
dataset
Table 1
Sites identified as positively selected across the H3 datasets analyzed here
### Obtaining branch lengths and nucleotide rates under the GTR model
* Log ( L ) = -7506.06
### Obtaining the global omega estimate based on relative GTR branch lengths
and nucleotide substitution biases
* Log ( L ) = -7302.10
* non - synonymous / synonymous rate ratio for *test* = 0.2923
### Improving branch lengths , nucleotide substitution biases, and global dN/dS
ratios under a full codon model
* Log ( L ) = -7289.65
* non - synonymous / synonymous rate ratio = 0.2598
### ** Found _3_ sites under pervasive positive diversifying and _115_
sites under negative selection at p <= 0.1**
...
### For partition 1 these sites are significant at p <=0.1
...
### Ancestor sampling analysis
...
SLAC reports several key quantities for codons with significant
P-values for positive or negative selection to the screen.
S The number of synonymous substitutions
inferred at this site
NS The number of non-synonymous substitutions
inferred at this site
dS Estimated site-specific synonymous rate
dN Estimated site-specific non-synonymous rate
Selection detected? Selection classification (positive or negative)
and the corresponding P-value for the binomial
test
...
| Codon |Partition | alpha | beta+ | p+ |LRT |Episodic selection detected?|#branches|
|:------:|:-------: |:----:|:----: |:---: |:---:|:---------------------–: |:--------: |
| 64 | 1 | 0.000 | 14.717 | 0.204|3.512| Yes, p = 0.0816 | 5 |
| 154 | 1 | 0.000 | 35.302 | 0.145|5.334| Yes, p = 0.0317 | 8 |
| 171 | 1 | 0.000 | 45.005 | 0.017|5.753| Yes, p = 0.0256 | 1 |
| 208 | 1 | 0.000 | 59.749 | 0.089|5.554| Yes, p = 0.0283 | 6 |
| 242 | 1 | 1.839 | 34.114 | 0.216|4.273| Yes, p = 0.0549 | 7 |
| 402 | 1 | 0.000 | 10.476 | 0.091|3.493| Yes, p = 0.0824 | 2 |
5
Note that for all prompts with default values, simply pressing enter will choose this default.
Evolution of Viral Genomes: Interplay Between Selection, Recombination. . . 455
...
### Tabulating site - level results
| Codon |Partition| alpha | beta | N.eff |Posterior prob for positive selection|
|:-------:|:------–: |:----:|:---–: |:------–:|:----------------------------------–:|
| 61 | 1 | 0.753 | 4.365| 64.549 | Pos. posterior = 0.9262 |
| 64 | 1 | 0.753 | 3.920| 77.106 | Pos. posterior = 0.9095 |
| 69 | 1 | 0.730 | 4.447| 64.182 | Pos. posterior = 0.9325 |
| 154 | 1 | 0.637 | 6.595| 53.312 | Pos. posterior = 0.9826 |
456 Stephanie J. Spielman et al.
Interpreting Results:
Sites identified as positively selected by each method, across all
three datasets, are given in Table 1. In general, we expect
MEME to be the most comprehensive and robust of all site-level
methods because it uniquely considers both pervasive and epi-
sodic selection [24]. In addition, power studies have shown that
FUBAR is expected to outperform FEL and SLAC under most
circumstances [25]. Finally, we expect that SLAC will be the
least robust method due to its reliance on a relatively naive
counting-based approach [12].
3.6 Screening A critical aspect of sequence analysis we have not yet covered is the
Sequences for detection of and correction for intragenic recombination in an
Recombination alignment of homologous sequences. Because recombination is
such a key biological process in many viral pathogens, we strongly
advocate screening an alignment for recombination before pro-
ceeding with additional analyses, unless there is a sound biological
reason to discount (i.e., intragenic recombination Influenza A is
negligibly rare). Indeed, because recombination causes different
regions of an alignment to be related by different phylogenies, its
presence can heavily influence selection detection and other down-
stream applications.
There are many computational approaches to finding evidence
of recombination in a sequence alignment [32], however at their
core, many such methods look for evidence of phylogenetic incon-
gruence. Here, we demonstrate one such method, GARD (genetic
algorithms for recombination detection) that we have found to
perform very well among a wide range of approaches on simulated
data [14]. Note that at this time, GARD will not produce a JSON
file as output but instead several text files containing inference
information, as well as a final partitioned alignment for downstream
use if recombination was detected.
3.7 GARD
What Biological Question Is the Method Designed to
Answer?:
Have sequences in the given alignment undergone recombina-
tion, and if so what are the recombination breakpoints and
segment-specific phylogenies?
Recommended Applications:
GARD is geared towards mapping the breakpoints and detecting
segments of the alignment which can be adequately described by a
single tree topology. Therefore, alignments, particularly alignments
of viral sequences, should be screened for the presence of recombi-
nation before performing any selection inference. The NEXUS
output from GARD can be directly used as input for most down-
stream selection detection analyses.
460 Stephanie J. Spielman et al.
A
B
R
O
O B A R O A B R
5. Save results to. For this option, provide a full path to the
output file to which you would like GARD to write results.
The supplied file name will ultimately contain an HTML-for-
matted summary of the analysis. HyPhy will generate several
other files with names obtained by appending suffixes (as in
<file name>_suffix) to the main result file. In particular,
the _finalout file stores the original alignment in NEXUS
format with inferred non-recombinant sections of the alignment
saved in the ASSUMPTIONS block and trees inferred for each
partition in the TREES block. This NEXUS file can be input into
many recombination-aware analyses in HyPhy and other pro-
grams that can read NEXUS. The _ga_details file contains
two lines of information about each model examined by the
genetic algorithm: its AICc score and the location of breakpoints
in the model. Finally, the _ga_splits file stores information
about the location of breakpoints and trees inferred for each
alignment region under the best model found by the GA.
GARD will now run to completion, printing status indica-
tors to screen while it runs:
Listing 8 Partial GARD output:
Interpreting Results:
GARD found evidence of recombination in this dataset with
three breakpoints, yielding a 135.9 point AICc improvement
over the model without recombination. Among all models with
three breakpoints in the Cache Valley Virus glycoprotein align-
ment, the best model places them at nucleotides 587, 962, and
1475. Importantly, if GARD had reported that the best model
had 0 breakpoints, we could conclude that no evidence of recom-
bination had been found. Note that because genetic algorithms
are stochastic, there is no guarantee that replicate runs will
converge to exactly the same quantitative results. When there is
a strong signal of recombination breakpoints in the data, how-
ever, the qualitative results (number and general location of
breakpoints) should be fairly robust.
Interpreting Results:
FEL inference on the GARD-processed partitioned Cache Valley
Virus data does not detect sites under selection at P 0.1. By
contrast, FEL inference on the unpartitioned Cache Valley
Virus data (i.e., not pre-screened for recombination) detects
three positively selected sites at P 0.1 (212, 516, and 558 at
P = 0.08, P = 0.03, and P = 0.09, respectively). From these
results, we can clearly tell that not screening or recombination
has the potential for adverse consequence including an increased
false positive rate as seen here. As such, we strongly encourage
users to screen alignments for recombination if such processes are
suspected before proceeding to selection detection.
3.8 Accounting for A critical genomic process that one must consider when detecting
Synonymous Rate selection is the phenomenon of synonymous rate variation, wherein
Variation the rate of synonymous codon evolution (represented by dS in the
6
Note that neither aBSREL nor RELAX accepts partitioned data because they require a consistent phylogeny to
define branch sets.
464 Stephanie J. Spielman et al.
Interpreting Results:
At P 0.1, analysis of CD2 with synonymous rate variation
revealed a total of 14 sites under positive selection. By contrast,
CD2 analysis with FEL without dS variation only detected four
sites under positive selection (Fig. 5). Similarly, analysis with dS
variation revealed 27 sites under purifying selection, but analy-
sis without dS variation revealed only 15 sites under purifying
selection. Most importantly, all sites detected when dS was fixed
to 1 were a subset of the sites identified by the model with dS
variation (Fig. 5). Together, these results demonstrate that
ignoring dS variation can induce both an increased false nega-
tive rate regarding positive selection detection and an overall
decrease in power to detect any selective regime. We acknowledge
that it is possible that the opposite conclusion might be true,
namely, that additional sites identified by FEL with dS varia-
tion might instead be false positives. However, in our experience,
this is much less frequently the case [12].
Evolution of Viral Genomes: Interplay Between Selection, Recombination. . . 465
With dS variation
** * * ** ** ** * ** ** * ** * * * * * * * ** * * * **** * * ** * *
* * ** * * * * * * * ** * * * * * * Without dS variation
Site
Fig. 5 Sites identified as positively (red) and negatively (blue) selected in CD2 at P 0.1 by FEL run with
(above the line) and without dS variation (below the line). Sites with arrows represent those identified as
selected by FEL with dS variation that were not identified by FEL when dS variation was ignored
4 Tips
5 Exercises
References
1. Anisimova M, Kosiol C (2009) Investigating 3. Delport W, Scheffler K, Seoighe C (2009)
protein-coding sequence evolution with prob- Models of coding sequence evolution. Brief
abilistic codon substitution models. Mol Biol Bioinform 10(1):97–109
Evol 26:255–271 4. Drummond DA, Wilke CO (2008)
2. Chamary JV, Hurst LD (2005) Evidence for Mistranslation-induced protein misfolding as
selection on synonymous mutations affecting a dominant constraint on coding-sequence
stability of mRNA secondary structure in mam- evolution. Cell 134:341–352
mals. Genome Biol 6:R75
Evolution of Viral Genomes: Interplay Between Selection, Recombination. . . 467
5. Enard D, Cai L, Gwennap C, Petrov DA 18. Labadie K, Dos Santos Afonso E, Rameix-Welti
(2016) Viruses are a dominant driver of protein M-A, van der Werf S, Naffakh N (2007) Host-
adaptation in mammal. eLife 5:e12469 range determinants on the PB2 protein of
6. Forni D, Cagliani R, Clerici M, Sironi M (2017) influenza A viruses control the interaction
Molecular evolution of human coronavirus gen- between the viral polymerase and nucleopro-
omes. Trends Microbiol 25(1):35–48. ISSN tein in human cells. Virology 362:271–282
0966-842X 19. Lorenzo-Redondo R, Fryer HR, Bedford T,
7. Frost SDW, Liu Y, Kosakovsky Pond SL, Kim E-Y, Archer J, Pond SLK, Chung Y-S,
Chappey C, Wrin T, Petropoulos CJ, Little SJ, Penugonda S, Chipman JG, Fletcher CV,
Richman DD (2005) Characterization of Schacker TW, Malim MH, Rambaut A, Haase
Human Immunodeficiency Virus Type AT, McLean AR, Wolinsky SM (2016) Persis-
1 (HIV-1) envelope variation and neutralizing tent HIV-1 replication maintains the tissue res-
antibody responses during transmission of ervoir during therapy. Nature 530
HIV-1 Subtype B. J Virol 79:6523–6527 (7588):51–56
8. Goldman N, Yang Z (1994) A codon-based 20. Lynch M, Ackerman MS, Gout J-F, Long H,
model of nucleotide substitution for protein- Sung W, Thomas WK, Foster PL (2016)
coding DNA sequences. Mol Biol Evol Genetic drift, selection and the evolution of
11:725–736 the mutation rate. Nature 17(11):704–714
9. Graef KM, Vreede FT, Lau Y-F, McCall AW, 21. Lynn DJ, Freeman AR, Murray C, Bradley DG
Carr SM, Subbarao K, Fodor E (2010) The (2005) A genomics approach to the detection
PB2 subunit of the Influenza virus RNA Poly- of positive selection in cattle: adaptive evolu-
merase affects virulence by interacting with the tion of the t-cell and natural killer cell-surface
mitochondrial antiviral signaling protein and protein cd2. Genetics 170(3):1189–1196
inhibiting expression of beta interferon. J 22. Meyer AG, Wilke CO (2015) Geometric con-
Virol 84:8433–8445 straints dominate the antigenic evolution of
10. Harrison RJ, Charlesworth B (2011) Biased Influenza H3N2 Hemagglutinin. PLoS Pathog
gene conversion affects patterns of codon 11:e1004940
usage and amino acid usage in the Saccharomy- 23. Mugal CF, Wolf JBW, Kaj I (2014) Why time
ces sensu stricto group of yeasts. Mol Biol Evol matters: codon evolution and the temporal
28:117–129 dynamics of dN/dS. Mol Biol Evol
11. Hershberg R, Petrov D (2008) Selection on 31:212–231
codon bias. Annu Rev Genet 42:287–299 24. Murrell B, Wertheim JO, Moola S, Weighill T,
12. Kosakovsky Pond SL, Frost SWD (2005) Not Scheffler K, Kosakovsky Pond SL (2012)
so different after all: a comparison of methods Detecting individual sites subject to episodic
for detecting amino acid sites under selection. diversifying selection. PLoS Genet 8(7):
Mol Biol Evol 22:1208–1222 e1002764
13. Kosakovsky Pond SL, Muse SV (2005) Site-to- 25. Murrell B, Moola S, Mabona A, Weighill T,
site variation of synonymous substitution rates. Scheward D, Kosakovsky Pond SL, Scheffler
Mol Biol Evol 22:2375–2385 K (2013) FUBAR: a fast, unconstrained Bayes-
14. Kosakovsky Pond SL, Posada D, Gravenor MB, ian AppRoximation for inferring selection. Mol
Woelk CH, Frost SDW (2006) Automated Biol Evol 30:1196–1205
phylogenetic detection of recombination 26. Murrell B, Weaver S, Smith MD, Wertheim JO,
using a genetic algorithm. Mol Biol Evol 23 Murrell S, Aylward A, Eren K, Pollner T, Martin
(10):1891–901 DP, Smith DM, Scheffler K, Kosakovsky Pond
15. Kosakovsky Pond SL, Delport W, Muse SV, SL (2015) Gene-wide identification of episodic
Scheffler K (2010) Correcting the bias of selection. Mol Biol Evol 32:1365–1371
empirical frequency parameter estimators in 27. Muse SV, Gaut BS (1994) A likelihood
codon models. PLoS One 5:e11230 approach for comparing synonymous and non-
16. Kosakovsky Pond SL, Murrell B, Fourment M, synonymous nucleotide substitution rates, with
Frost SD, Delport W, Scheffler K (2011) A application to the chloroplast genome. Mol
random effects branch-site model for detecting Biol Evol 11:715–724
episodic diversifying selection. Mol Biol Evol 28. Nelson MI, Holmes EC (2007) The evolution
28:3033–3043 of epidemic influenza. Nat Rev Genet 8
17. Kryazhimskiy S, Plotkin JB (2008) The popu- (3):196–205
lation genetics of dN/dS. PLoS Genet 4: 29. Nielsen R, Yang Z (1998) Likelihood models
e1000304 for detecting positive selected amino acid sites
468 Stephanie J. Spielman et al.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons
licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s
Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder.
Chapter 15
Abstract
This chapter reviews current research on how protein domain architectures evolve. We begin by summariz-
ing work on the phylogenetic distribution of proteins, as this will directly impact which domain architec-
tures can be formed in different species. Studies relating domain family size to occurrence have shown that
they generally follow power law distributions, both within genomes and larger evolutionary groups. These
findings were subsequently extended to multi-domain architectures. Genome evolution models that have
been suggested to explain the shape of these distributions are reviewed, as well as evidence for selective
pressure to expand certain domain families more than others. Each domain has an intrinsic combinatorial
propensity, and the effects of this have been studied using measures of domain versatility or promiscuity.
Next, we study the principles of protein domain architecture evolution and how these have been inferred
from distributions of extant domain arrangements. Following this, we review inferences of ancestral domain
architecture and the conclusions concerning domain architecture evolution mechanisms that can be drawn
from these. Finally, we examine whether all known cases of a given domain architecture can be assumed to
have a single common origin (monophyly) or have evolved convergently (polyphyly). We end by a
discussion of some available tools for computational analysis or exploitation of protein domain architectures
and their evolution.
Key words Protein domain, Protein domain architecture, Superfamily, Monophyly, Polyphyly, Con-
vergent evolution, Domain evolution, Kingdoms of life, Domain co-occurrence network, Node degree
distribution, Power law, Parsimony
1 Introduction
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Methods in Molecular Biology, vol. 1910,
https://doi.org/10.1007/978-1-4939-9074-0_15, © The Author(s) 2019
469
470 Sofia K. Forslund et al.
1.2 Protein Domains Protein domains are high-level parts of proteins that either occur
alone or together with partner domains on the same protein chain.
Most domains correspond to tertiary structure elements and are
able to fold independently. All domains exhibit evolutionary con-
servation, and many either perform specific functions or contribute
in a specific way to the function of their proteins. The word domain
strictly refers to a distinct region of a specific protein, an instance of
a domain family. However, domain and domain family are often
used interchangeably in the literature.
1.5 Mechanisms for Most mutations are point mutations: substitutions, insertions, or
Domain Architecture deletions of single nucleotides. While conceivably enough of these
Change might create a new domain from an old one or noncoding sequence
or remove a domain from a protein, in practice we are interested in
mechanisms whereby the domain architecture of a protein changes
instantly or nearly so (but see below for an overview of recent work
on the origin of new domains). Figure 1 shows some examples of
ways in which domain architectures may mutate. In general, adding
or removing domains requires genetic recombination events. These
can occur either through errors made by systems for repairing DNA
damage such as homologous [16, 17] or nonhomologous (illegiti-
mate) [18, 19] recombination or through the action of mobile
genetic elements such as DNA transposons [20] or retrotranspo-
sons [21, 22]. Recombination can cause loss or duplication of parts
of genes, entire genes or much longer chromosomal regions.
In organisms that have introns, exon shuffling [23, 24] refers to
the integration of an exon from one gene into another, for instance,
through chromosomal crossover, gene conversion, or mobile
genetic elements. Exons could also be moved around by being
472 Sofia K. Forslund et al.
Fig. 1 Examples of mutations that can change domain architectures. Adapted from Buljan et al. [25]. (a) Gene
fusion by a mobile element. LINE refers to a Long Interspersed Nuclear repeat Element, a retrotransposon. The
reverse transcriptase encoded within the LINE causes its mRNA to be reverse-transcribed into DNA and
integrated into the genome, making the domain-encoding blue exon from the donor gene integrate along with
it in the acceptor gene. (b) Gene fusion by loss of a stop signal or deletion of much of the intergenic region.
Genes 1 and 2 are joined together into a single, longer gene. (c) Domain insertion through recombination. The
blue domain from the donor gene is inserted within the acceptor gene by either homologous or illegitimate
recombination. (d) Right: Gene fission by introduction of transcription stop (the letter Ω) and start (the letter A).
Left: Domain loss by introduction of a stop codon (exclamation mark) with subsequent degeneration of the
now untranslated domain
N ¼ c ði þ X Þa
Power law distributions arise in a vast variety of contexts: from
human income distributions, connectivity of internet routers, word
usage in languages, and many other situations ([34, 35, 40, 41], see
also [42], for a conflicting view). Luscombe et al. [35] described a
number of other genomic properties that also follow power law
distributions, such as the occurrence of DNA “words,” pseudo-
genes, and levels of gene expression. These distributions fit much
better than the alternative they usually are contrasted against, an
exponential decay distribution. The most important difference
between exponential and power law distributions in this context
concerns the fact that the latter has a “fat tail,” that is, while most
domain families occur only a few times in each proteome, most
domains in the proteome still belong to one of a small number of
families.
Later work ([39, 43], see also [44]) demonstrated that
proteome-wide domain occurrence data fit the general GPD better
than the power law but that it also asymptotically fits a power law as
X i. The deviation from strict power law behavior depends on
proteome size in a kingdom-dependent manner [43]. Regardless, it
is mostly appropriate to treat the domain family size distribution as
approximately (and asymptotically) power law-like, and later stud-
ies typically assume this.
The power law, but not the GPD, is scale-free in the sense of
fulfilling the condition
f ðax Þ ¼ g ða Þf ðx Þ
where f(x) and g(x) are some functions of a variable x and where a is
a scaling parameter, that is, studying the data at a different scale will
not change the shape of function. This property has been exten-
sively studied in the literature and is connected to other attributes,
notably when it occurs in network degree distributions (i.e., fre-
quency distributions of edges per node). Here it has been asso-
ciated with properties such as the presence of a few central and
critical hubs (nodes with many edges to other nodes), the similarity
between parts and the whole (as in a fractal), and the growth
process called preferential attachment, under which nodes are
more likely to gain new links the more links they already have.
However, the same power law distribution may be generated
from many different network topologies with different patterns of
connectivity. In particular, they may differ in the extent that hubs
are connected to each other [42]. It is possible to extend the
analysis by taking into account the distribution of degree pairs
along network edges, but this is normally not done.
What kind of evolutionary mechanisms give rise to this kind of
distribution of gene or domain family sizes within genomes? In one
model by Huynen and van Nimwegen [33], every gene within a
Evolution of Protein Domain Architectures 475
Fig. 2 (a) Distribution of domain family sizes in three selected species. Power law distributions were fitted to
these curves such that for frequency f of families of size X, f ¼ cXa. For S. cerevisiae, a ¼ 1.9, for E. coli,
a ¼ 1.7, and for H. sapiens, a ¼ 1.5. (b) Distribution of domain family sizes across the three kingdoms.
Power law distributions were fitted to these curves such that for frequency f of families of size X, f ¼ cXa. For
bacteria, a ¼ 0.9, for archaea, a ¼ 1.1, for eukaryotes, a ¼ 0.8, and for viruses, a ¼ 1.9
Evolution of Protein Domain Architectures 477
Fig. 3 (a) Distribution of multi-domain (architecture) family sizes in three selected species. Power law
distributions were fitted to these curves such that for frequency f of families of size X, f ¼ cXa. For
S. cerevisiae, a ¼ 2.0, for E. coli, a ¼ 1.8, and for H. sapiens, a ¼ 1.5. (b) Distribution of multi-
domain (architecture) family sizes across the three kingdoms. Power law distributions were fitted to these
curves such that for frequency f of families of size X, f ¼ cXa. For bacteria, a ¼ 1.0, for archaea, a ¼ 1.1,
for eukaryotes, a ¼ 1.1, and for viruses, a ¼ 2.0
478 Sofia K. Forslund et al.
Fig. 4 (a) Kingdom distribution of unique domains. Values are given as percentages of the total, 10,330
domains. (b) Kingdom distribution of unique domain pairs. Values are given as percentages of the total, 31,287
domain pairs. (c) Kingdom distribution of unique domain triplets. Values are given as percentages of the total,
33,662 domain triplets. (d) Kingdom distribution of unique multi-domain architectures. Values are given as
percentages of the total, 23,238 multi-domain architectures
Table 1
The 20 most densely connected hubs with regard to immediate domain neighbors, according to Pfam
30.0
Fig. 6 (a) Distribution of domain co-occurrence network node degrees in three selected species. Power law
distributions were fitted to these curves such that for frequency f of families of size X, f ¼ cXa. For
S. cerevisiae, a ¼ 2.2, for E. coli, a ¼ 2.0, and for H. sapiens, a ¼ 1.9. (b) Distribution of domain
co-occurrence network node degrees across the three kingdoms. This corresponds to a network where two
domains are connected if any species within the kingdom has a protein where these domains are immediately
adjacent. Power law distributions were fitted to these curves such that for frequency f of families of size X,
f ¼ cXa. For bacteria, a ¼ 1.6, for archaea, a ¼ 1.7, for eukaryotes, a ¼ 1.5, and for viruses a ¼ 2.0
Weiner et al. [14] used Pfam domains for 10,746 species in all
kingdoms and took as their relative versatility measure the logarith-
mic regression coefficient for each domain family across genomes,
meaning that it is not defined within single proteomes.
To what extent is high versatility an intrinsic property of a
certain domain? Vogel et al. [66] only examined large groups of
domains together and therefore did not address this question for
single domains. Basu et al. [55] and Weiner et al. [14] instead
analyzed each domain separately and concluded that there are
strong variations in relative versatility at this level. Their results
are very different in detail, however, reflected by the fact that only
one domain family (PF00004, AAA ATPase family) is shared
between the ten most versatile domains reported in the two studies.
As they used fairly similar domain datasets, it would appear that the
results strongly depend on the definition of relative versatility.
Another potential reason for the different results is that Basu’s list
was based on eukaryotes only, while Weiner’s analysis was heavily
biased toward prokaryotes. Furthermore, the top ten list in Basu
et al. [55] and their follow-up paper [56] only overlap by four
domains, yet the main difference is that in the latter study all
28 eukaryotes were considered, while the former study was limited
to the subset of 20 animal, plant, and fungal species. The choice of
species thus seems pivotal for the results when using this method.
They also used different methods for calculating the average value
of relative versatility across many species, which may influence the
results.
Does domain versatility vary between different functional clas-
ses of domains? Vogel et al. [66] found no difference in relative
versatility between broad functional or process categories or
between SCOP structural classes. In contrast to this, Basu et al.
[55] reported that high versatility was associated with certain func-
tional categories in eukaryotes. However, no test for the statistical
significance of these results was performed. Weiner et al. [14] also
noted some general trends but found no significant enrichment of
gene ontology terms in versatile domains. This does not necessarily
mean that no such correlation exists, but more research is required
to convincingly demonstrate its strength and its nature. More
recently, Cromar et al. [68] analyzed domain architectures in
eukaryotic extracellular matrix proteomes, noting that these struc-
tures are organized around a set of versatile domains under the
weighted bigram metric of Basu et al. [55].
Another important question is to what extent domain versatil-
ity varies across evolutionary lineages. Vogel et al. [66] reported no
large differences in average versatility for domains in different king-
doms. The versatility measure of Basu et al. [55] can be applied
within individual genomes, which means that according to this
measure domains may be versatile in one organism group but not
in another, as well as gain or lose versatility across evolutionary
Evolution of Protein Domain Architectures 489
time. They found that more domains were highly versatile in ani-
mals than in other eukaryotes. Modeling versatility as a binary
property defined for domains in extant species, they further used
a maximum parsimony approach to study the persistence of versa-
tility for each domain across evolutionary time and concluded that
both gain and loss of versatility are common during evolution.
Inferring ancestral domain architectures, Cohen-Gihon et al. [69]
report an increase in versatility in many domains during eukaryotic
evolution, in particular around the divergence of Bilateria. Weiner
at al. [14] divided domains into age categories based on distribu-
tion across the tree of life and reported that the versatility index is
not dependent on age, i.e., domains have equal chances of becom-
ing versatile at different times in evolution. This is consistent with
the observation by Basu et al. [55] that versatility is a fast-evolving
and varying property. When measuring versatility as a regression
within different organism groups, Weiner et al. [14] found slightly
lower versatility in eukaryotes, which is in conflict with the findings
of Basu et al. [55]. Again, this underscores the strong dependence
of the method and dataset on the results.
Further properties reported to correlate with domain versatility
include sequence length, where Weiner et al. [14] found that
longer domains are significantly more versatile within the frame-
work of their study, while at the same time, shorter domains are
more abundant and hence may have more domain neighbors in
absolute numbers. Basu et al. [55] further reported that more
versatile domains have more structural interactions than other
domains. To determine which of these reported correlations that
genuinely reflect universal biological trends, further comprehensive
studies are needed using more data and uniform procedures. This
would hopefully allow the results from the studies described here to
be validated and any conflicts between them to be resolved.
Basu et al. [55] further analyzed the phylogenetic spread of all
immediate domain neighbor pairs (“bigrams”) containing domains
classified as promiscuous. The main observation this yielded was
that although most such combinations occurred in only a few
species, most promiscuous domains are part of at least one combi-
nation that is found in a majority of species. They interpreted this as
implying the existence of a reservoir of evolutionarily stable domain
combinations from which lineage-specific recombination may draw
promiscuous domains to form unique architectures. Later work by
Hsu et al. [70] analyzed the domain co-occurrence networks cen-
tered on each domain family, classifying such subnetworks as being
either mostly starlike, taillike, or tetragon-like, with promiscuous
domains forming cores of starlike architecture networks in this
representation.
490 Sofia K. Forslund et al.
procedure on yet more data and concluded that fusion was 5.6
times as likely as fission.
Buljan and Bateman [15] performed a similar maximum parsi-
mony reconstruction of ancestral domain architectures. They too
observed that domain architecture changes primarily take place at
the protein termini, and the authors suggested that this might
largely occur because terminal changes to the architecture are less
likely to disturb overall protein structure. Moreover, they con-
cluded from reconciliation of gene and species trees that domain
architecture changes were more common following gene duplica-
tions than following speciation but that these cases did not differ
with respect to the relative likelihood of domain losses or gains.
Recently, Buljan et al. [25] presented a new ancestral domain
architecture reconstruction study which assumed that gain of a
domain should take place only once in each gene tree, i.e., Dollo
parsimony [79]. Their results also support gene fusion as a major
mechanism for domain architecture change. The fusion is generally
preceded by a duplication of either of the fused genes. Intronic
recombination and insertion of exons are observed but relatively
rarely. They also found support for de novo creation of disordered
segments by exonization of previously noncoding regions. More
recently still a method for domain architecture history reconstruc-
tion using a network construct called a plexus was described
[83]. Yang and Bourne [84] further described another
parsimony-based reconstruction approach, as did Wu et al. [85],
reporting that histories of signaling and development proteins are
enriched for gene fusion/fission events. Stolzer et al. [86] present
another method for domain architecture history inference, made
available through the Notung software.
10 Conclusions
Updated statistics were generated from the data in Pfam 30.0. All
UniProt proteins in the SwissPfam set for Pfam 30.0 were included.
These span 1090 bacteria, 506 eukaryotes, and 94 archaea. All
Pfam-A domains regardless of type were included. However, as
stretches of repeat domains are highly variable, consecutive subse-
quences of the same domain were collapsed into a single pseudo-
domain, if it was classified as type Motif or Repeat, as in several
previous works [50, 60, 66, 82].
Domains were ordered within each protein based on their
sequence start position. In the few cases of domains being inserted
within other domains, this was represented as the outer domain
followed by the nested domain, resulting in a linear sequence of
domain identifiers. As long regions without domain assignments
are likely to represent the presence of as-yet uncharacterized
domains, we excluded any protein with unassigned regions longer
than 50 amino acids (more than 95% of Pfam-A domains are longer
than this). This approach is similar to that taken in previous works
[59, 60, 71]. Other studies [50, 72] have instead performed addi-
tional, more sensitive domain assignment steps, such as clustering
the unassigned regions to identify unknown domains within them.
Pfam domains are sometimes organized in clans, where clan-
mates are considered homologous. A transition from a domain to
another of the same clan is thus less likely to be a result of domain
swapping of any kind and more likely to be a result of sequence
divergence from the same ancestor. Because of this, we replaced all
Pfam domains that are clan members with the corresponding clan.
The statistics and plots were generated using a set of Perl and R
scripts, which are available upon request. Power law regressions
were done using the R nls function. For reasons of scale, the
regression for a power law relation such as
N ¼ cX a
was performed on the equivalent relationship
logðX Þ ¼ ð1=aÞðlogðc Þ logðN ÞÞ
for the parameters a and c, with the exception of the data for Fig. 6,
where instead the relationship
logðN Þ ¼ logðc Þ alogðX Þ
Evolution of Protein Domain Architectures 497
For further studies or research into this field, the first and most
important stop will be the domain databases. Table 2 presents a
selection of domain databases in current use.
Table 2
A selection of protein domain databases
Several software tools have been described and made available that
allow for analysis and visualization of domain architectures and
their evolution. A selection of such tools is shown in Table 3.
A few of these tools allow domain architecture evolution analy-
sis by visualizing each protein’s domain architecture along a protein
sequence tree. An example is the web tool TreeDom [96] which,
given a protein domain family and an anchor sequence, fetches the
family from Pfam and builds a tree with the nearest neighbors of the
anchor sequence. An example output from TreeDom is shown in
Fig. 7, in which a nonredundant set of representative proteomes
were queried. Here one can see that while the NUDIX domain of
the anchor sequence tends to co-occur with two other domains
(zf-NADH-PPase and NUDIX-like), it also has recombined with
many other domains over the course of evolution.
Other tools allow different types of analyses, for instance,
searching for similar domain architectures or showing taxonomic
distributions. Some of the protein domain databases listed in
Table 2 include variants of such analyses, while external tools
typically offer more specialized functionality. For example, the
Pfam website allows searching for domain content, while the java
tool PfamAlyzer allows searching Pfam for particular domain archi-
tecture patterns specified with a given domain order and
spacing [94].
The RAMPAGE/RADS tools [95] make use of domain assign-
ments for rapid homology searching. DoMosaics [92] is a software
Table 3
A selection of online software applying protein domain architecture evolution analysis
Fig. 7 TreeDom output using as query the NUDIX domain (PF00293), the human NUDT12 (Q9BQG2) protein,
30 closest sequences, and RP15 (representative proteomes at 15% co-membership). The domains are green,
NUDIX; blue, NUDIX-like (PF09296); yellow, zf-NADH-PPase (PF09297); red, Ocnus (PF05005); cyan, Ank_2
(PF12796); black, Ank_5 (PF13857); orange, Prefoldin (PF02996); and pink, Fibrinogen_C (PF00147)
tool that can act as a wrapper for domain annotation tools, allowing
detailed visualization and analysis of domain architectures, as does
DomArch [97]. The DAAC algorithm [98] explicitly transfers
functional annotation to query sequences based on domain archi-
tectural similarity to annotated homologs, as does FACT [93]. In
the same vein, similarity measures between architectures are avail-
able using the WDAC [99] tool and in ADASS [100]. Domain
architecture similarity is used for orthology detection in the
porthoDom software [68]. The DOGMA tool makes use of
domain content data to assess completeness of a proteome or
transcriptome [101].
500 Sofia K. Forslund et al.
14 Exercises/Questions
References
1. Chandonia J-M, Fox NK, Brenner SE (2017) InterPro in 2017—beyond protein family and
SCOPe: manual curation and artifact removal domain annotations. Nucleic Acids Res 45
in the structural classification of proteins – (D1):D190–D199
extended database. Comput Res Mol Biol 7. Marchler-Bauer A, Derbyshire MK, Gonzales
429(3):348–355 NR, Lu S, Chitsaz F, Geer LY et al (2015)
2. Dawson NL, Lewis TE, Das S, Lees JG, CDD: NCBI’s conserved domain database.
Lee D, Ashford P et al (2017) CATH: an Nucleic Acids Res 43(D1):D222–D226
expanded resource to predict protein function 8. Letunic I, Doerks T, Bork P (2015) SMART:
through structure and sequence. Nucleic recent updates, new developments and status
Acids Res 45(D1):D289–D295 in 2015. Nucleic Acids Res 43(D1):
3. Wilson D, Pethica R, Zhou Y, Talbot C, D257–D260
Vogel C, Madera M et al (2009) SUPER- 9. Bru C, Courcelle E, Carrère S, Beausse Y,
FAMILY—sophisticated comparative geno- Dalmar S, Kahn D (2005) The ProDom data-
mics, data mining, visualization and base of protein domain families: more empha-
phylogeny. Nucleic Acids Res 37(suppl_1): sis on 3D. Nucleic Acids Res 33(suppl_1):
D380–D386 D212–D215
4. Lam SD, Dawson NL, Das S, Sillitoe I, 10. UniProt Consortium (2017) UniProt: the
Ashford P, Lee D et al (2016) Gene3D: universal protein knowledgebase. Nucleic
expanding the utility of domain assignments. Acids Res 45(D1):D158–D169
Nucleic Acids Res 44(D1):D404–D409 11. Heger A, Wilton CA, Sivakumar A, Holm L
5. Finn RD, Mistry J, Tate J, Coggill P, Heger A, (2005) ADDA: a domain database with global
Pollington JE et al (2010) The Pfam protein coverage of the protein universe. Nucleic
families database. Nucleic Acids Res 38 Acids Res 33(suppl_1):D188–D191
(suppl_1):D211–D222 12. Lewis TE, Sillitoe I, Andreeva A, Blundell TL,
6. Finn RD, Attwood TK, Babbitt PC, Buchan DWA, Chothia C et al (2015) Gen-
Bateman A, Bork P, Bridge AJ et al (2017) ome3D: exploiting structure to help users
Evolution of Protein Domain Architectures 501
understand their sequences. Nucleic Acids 29. Demuth JP, Hahn MW (2009) The life and
Res 43(D1):D382–D386 death of gene families. Bioessays 31(1):29–39
13. Kummerfeld SK, Teichmann SA (2009) Pro- 30. Toll-Riera M, Albà MM (2013) Emergence of
tein domain organisation: adding order. BMC novel domains in proteins. BMC Evol Biol 13
Bioinformatics 10(1):39 (1):47
14. Weiner J, Moore AD, Bornberg-Bauer E 31. Moore AD, Bornberg-Bauer E (2012) The
(2008) Just how versatile are domains? BMC dynamics and evolutionary potential of
Evol Biol 8(1):285 domain loss and emergence. Mol Biol Evol
15. Buljan M, Bateman A (2009) The evolution 29(2):787–796
of protein domain families. Biochem Soc 32. Kersting AR, Bornberg-Bauer E, Moore AD,
Trans 37(4):751 Grath S (2012) Dynamics and adaptive bene-
16. Orozco-Mosqueda M d C, Altamirano- fits of protein domain emergence and
Hernandez J, Farias-Rodriguez R, Valencia- arrangements during plant genome evolution.
Cantero E, Santoyo G (2009) Homologous Genome Biol Evol 4(3):316–329
recombination and dynamics of rhizobial gen- 33. Huynen MA, van Nimwegen E (1998) The
omes. Res Microbiol 160(10):733–741 frequency distribution of gene family sizes in
17. Heyer W-D, Ehmsen KT, Liu J (2010) Regu- complete genomes. Mol Biol Evol 15
lation of homologous recombination in (5):583–589
eukaryotes. Annu Rev Genet 44:113–139 34. Qian J, Luscombe NM, Gerstein M (2001)
18. Brissett NC, Doherty AJ (2009) Repairing Protein family and fold occurrence in gen-
DNA double-strand breaks by the prokaryotic omes: power-law behaviour and evolutionary
non-homologous end-joining pathway. Bio- model11Edited by J. Thornton. J Mol Biol
chem Soc Trans 37(3):539 313(4):673–681
19. van Rijk A, Bloemendal H (2003) Molecular 35. Luscombe NM, Qian J, Zhang Z, Johnson T,
mechanisms of exon shuffling: illegitimate Gerstein M (2002) The dominance of the
recombination. Genetica 118(2):245–249 population by a selected few: power-law beha-
20. Feschotte C, Pritham EJ (2007) DNA trans- viour applies to a wide variety of genomic
posons and the evolution of eukaryotic gen- properties. Genome Biol 3(8):research0040.1
omes. Annu Rev Genet 41:331–368 36. Apic G, Gough J, Teichmann SA (2001)
21. Cordaux R, Batzer MA (2009) The impact of Domain combinations in archaeal, eubacterial
retrotransposons on human genome evolu- and eukaryotic proteomes11Edited by G. von
tion. Nat Rev Genet 10(10):691–703 Heijne. J Mol Biol 310(2):311–325
22. Gogvadze E, Buzdin A (2009) Retroelements 37. Apic G, Huber W, Teichmann SA (2003)
and their impact on genome evolution and Multi-domain protein families and domain
functioning. Cell Mol Life Sci 66(23):3727 pairs: comparison with known structures and
a random model of domain recombination. J
23. Patthy L (2003) Modular assembly of genes Struct Funct Genomics 4(2–3):67–78
and the evolution of new functions. In: Long
M (ed) Origin and evolution of new gene 38. Vogel C, Berzuini C, Bashton M, Gough J,
functions. Springer, Dordrecht, pp 217–231 Teichmann SA (2004) Supra-domains: evolu-
tionary units larger than single protein
24. Liu M, Grigoriev A (2004) Protein domains domains. J Mol Biol 336(3):809–823
correlate strongly with exons in multiple
eukaryotic genomes – evidence of exon shuf- 39. Karev GP, Wolf YI, Rzhetsky AY, Berezovs-
fling? Trends Genet 20(9):399–403 kaya FS, Koonin EV (2002) Birth and death
of protein domains: a simple model of evolu-
25. Buljan M, Frankish A, Bateman A (2010) tion explains power law behavior. BMC Evol
Quantifying the mechanisms of domain gain Biol 2:18–18
in animal proteins. Genome Biol 11(7):R74
40. Wuchty S (2001) Scale-free behavior in pro-
26. Weiner J, Beaussart F, Bornberg-Bauer E tein domain networks. Mol Biol Evol 18
(2006) Domain deletions and substitutions (9):1694–1702
in the modular protein evolution. FEBS J
273(9):2037–2047 41. Rzhetsky A, Gomez SM (2001) Birth of scale-
free molecular networks and the number of
27. Schmidt EE, Davies CJ (2007) The origins of distinct DNA and protein domains per
polypeptide domains. Bioessays 29 genome. Bioinformatics (Oxford, England)
(3):262–270 17(10):988–996
28. Bornberg-Bauer E, Huylmans A-K, Sikosek T 42. Li L, Alderson D, Doyle JC, Willinger W
(2010) How do new proteins arise? Nucl (2005) Towards a theory of scale-free graphs:
Acids Seq Topol 20(3):390–396
502 Sofia K. Forslund et al.
definition, properties, and implications. Inter- 55. Basu MK, Carmel L, Rogozin IB, Koonin EV
net Math 2(4):431–523 (2008) Evolution of protein domain promis-
43. Kuznetsov VA, Pickalov VV, Senko OV, Lnott cuity in eukaryotes. Genome Res 18
GD (2002) Analysis of the evolving pro- (3):449–461
teomes: predictions of the number of protein 56. Basu MK, Poliakov E, Rogozin IB (2009)
domains in nature and the number of genes in Domain mobility in proteins: functional and
eukaryotic organisms. J Biol Syst 10 evolutionary implications. Brief Bioinform 10
(04):381–407 (3):205–216
44. Koonin EV, Wolf YI, Karev GP (2002) The 57. Barabási A-L, Albert R (1999) Emergence of
structure of the protein universe and genome scaling in random networks. Science 286
evolution. Nature 420(6912):218–223 (5439):509
45. Yanai I, Camacho CJ, DeLisi C (2000) Pre- 58. Bashton M, Chothia C (2002) The geometry
dictions of gene family distributions in micro- of domain combination in proteins11Edited
bial genomes: evolution by gene duplication by J. Thornton. J Mol Biol 315(4):927–939
and modification. Phys Rev Lett 85 59. Gough J (2005) Convergent evolution of
(12):2641–2644 domain architectures (is rare). Bioinformatics
46. Eirin-Lopez JM, Rebordinos L, Rooney AP, 21(8):1464–1471
Rozas J (2012) The birth-and-death evolu- 60. Forslund K, Henricson A, Hollich V, Sonn-
tion of multigene families revisited. Genome hammer ELL (2008) Domain tree-based
Dyn 7:170–196 analysis of protein architecture evolution.
47. van Nimwegen E (2003) Scaling laws in the Mol Biol Evol 25(2):254–264
functional content of genomes. Trends Genet 61. Parikesit AA, Stadler PF, Prohaska SJ (2017)
19(9):479–484 Large-scale evolutionary patterns of protein
48. Ranea JAG, Sillero A, Thornton JM, Orengo domain distributions in eukaryotes. BioRxiv
CA (2006) Protein superfamily evolution and 62. Hsu C-H, Chiang AWT, Hwang M-J, Liao
the last universal common ancestor (LUCA). B-Y (2016) Proteins with highly evolvable
J Mol Evol 63(4):513–525 domain architectures are nonessential but
49. Chothia C, Gough J (2009) Genomic and highly retained. Mol Biol Evol 33
structural aspects of protein evolution. Bio- (5):1219–1230
chem J 419(1):15 63. Brivanlou AH, Darnell JE (2002) Signal
50. Ekman D, Björklund ÅK, Elofsson A (2007) transduction and the control of gene expres-
Quantification of the elevated rate of domain sion. Science 295(5556):813
rearrangements in metazoa. J Mol Biol 372 64. Weiner J III, Bornberg-Bauer E (2006) Evo-
(5):1337–1348 lution of circular permutations in multido-
51. Itoh M, Nacher JC, Kuma K, Goto S, Kane- main proteins. Mol Biol Evol 23(4):734–743
hisa M (2007) Evolutionary history and func- 65. Tordai H, Nagy A, Farkas K, Bányai L, Patthy
tional implications of protein domains and L (2005) Modules, multidomain proteins and
their combinations in eukaryotes. Genome organismic complexity. FEBS J 272
Biol 8(6):R121 (19):5064–5078
52. Nasir A, Kim KM, Caetano-Anollés G (2014) 66. Vogel C, Teichmann SA, Pereira-Leal J
Global patterns of protein domain gain and (2005) The relationship between domain
loss in superkingdoms. PLoS Comput Biol 10 duplication and recombination. J Mol Biol
(1):e1003452 346(1):355–365
53. Przytycka T, Davis G, Song N, Durand D 67. Xie X, Jin J, Mao Y (2011) Evolutionary ver-
(2005) Graph theoretical insights into evolu- satility of eukaryotic protein domains revealed
tion of multidomain proteins. In: Miyano S, by their bigram networks. BMC Evol Biol 11
Mesirov J, Kasif S, Istrail S, Pevzner PA, (1):242
Waterman M (eds) Res. Comput. Mol. Biol. 68. Bitard-Feildel T, Kemena C, Greenwood JM,
9th Annu. Int. Conf. RECOMB 2005 Camb. Bornberg-Bauer E (2015) Domain similarity
MA USA May 14-18 2005 Proc. Springer, based orthology detection. BMC Bioinfor-
Berlin, pp 311–325 matics 16(1):154
54. Marcotte EM, Pellegrini M, Ng H-L, Rice 69. Cohen-Gihon I, Fong JH, Sharan R,
DW, Yeates TO, Eisenberg D (1999) Detect- Nussinov R, Przytycka TM, Panchenko AR
ing protein function and protein-protein (2011) Evolution of domain promiscuity in
interactions from genome sequences. Science eukaryotic genomes-a perspective from the
285(5428):751
Evolution of Protein Domain Architectures 503
inferred ancestral domain architectures. Mol 84. Yang S, Bourne PE (2009) The evolutionary
Biosyst 7(3):784–792 history of protein domains viewed by species
70. Hsu C-H, Chen C-K, Hwang M-J (2013) phylogeny. PLoS One 4(12):e8378
The architectural design of networks of pro- 85. Wu Y-C, Rasmussen MD, Kellis M (2012)
tein domain architectures. Biol Lett 9 Evolution at the subgene level: domain rear-
(4):20130268 rangements in the drosophila phylogeny. Mol
71. Björklund ÅK, Ekman D, Light S, Frey- Biol Evol 29(2):689–705
Skött J, Elofsson A (2005) Domain rearran- 86. Stolzer M, Siewert K, Lai H, Xu M, Durand D
gements in protein evolution. J Mol Biol 353 (2015) Event inference in multidomain
(4):911–923 families with phylogenetic reconciliation.
72. Björklund ÅK, Ekman D, Elofsson A (2006) BMC Bioinformatics 16(14):S8
Expansion of protein domain repeats. PLoS 87. Zmasek CM, Godzik A (2012) This Déjà Vu
Comput Biol 2(8):e114 Feeling—analysis of multidomain protein
73. Nagy A, Szlama G, Szarka E, Trexler M, evolution in eukaryotic genomes. PLoS Com-
Banyai L, Patthy L (2011) Reassessing put Biol 8(11):e1002701
domain architecture evolution of metazoan 88. Leclère L, Rentzsch F (2012) Repeated evo-
proteins: major impact of gene prediction lution of identical domain architecture in
errors. Genes. 2(3):449–501 metazoan netrin domain-containing proteins.
74. Doolittle RF, Bork P (1993) Evolutionarily Genome Biol Evol 4(9):883–899
mobile modules in proteins. Sci Am 269 89. Nagy A, Bányai L, Patthy L (2011) Reasses-
(4):50–56 sing domain architecture evolution of meta-
75. Moore AD, Björklund ÅK, Ekman D, zoan proteins: major impact of errors caused
Bornberg-Bauer E, Elofsson A (2008) by confusing paralogs and epaktologs. Genes.
Arrangements in the modular evolution of 2(3):516–561
proteins. Trends Biochem Sci 33(9):444–451 90. Nagy A, Patthy L (2011) Reassessing domain
76. Grassi L, Fusco D, Sellerio A, Cora D, architecture evolution of metazoan proteins:
Bassetti B, Caselle M et al (2010) Identity the contribution of different evolutionary
and divergence of protein domain architec- mechanisms. Genes 2(3):578–598
tures after the yeast whole-genome duplica- 91. Geer LY, Domrachev M, Lipman DJ, Bryant
tion event. Mol Biosyst 6(11):2305–2315 SH (2002) CDART: protein homology by
77. Zhang X-C, Wang Z, Zhang X, Le MH, domain architecture. Genome Res 12
Sun J, Xu D et al (2012) Evolutionary dynam- (10):1619–1623
ics of protein domain architecture in plants. 92. Moore AD, Held A, Terrapon N, Weiner J III,
BMC Evol Biol 12(1):6 Bornberg-Bauer E (2014) DoMosaics: soft-
78. Sharma M, Pandey GK (2016) Expansion and ware for domain arrangement visualization
function of repeat domain proteins during and domain-centric analysis of proteins. Bio-
stress and development in plants. Front Plant informatics 30(2):282–283
Sci 6:1218 93. Koestler T, von Haeseler A, Ebersberger I
79. Farris JS (1977) Phylogenetic analysis under (2010) FACT: functional annotation transfer
Dollo’s law. Syst Zool 26(1):77–88 between proteins with similar feature archi-
80. Snel B, Bork P, Huynen M (2000) Genome tectures. BMC Bioinformatics 11(1):417
evolution. Trends Genet 16(1):9–11 94. Hollich V, Sonnhammer ELL (2007) PfamA-
81. Kummerfeld SK, Teichmann SA (2005) Rela- lyzer: domain-centric homology search. Bio-
tive rates of gene fusion and fission in multi- informatics 23(24):3382–3383
domain proteins. Trends Genet 21(1):25–30 95. Terrapon N, Weiner J, Grath S, Moore AD,
82. Fong JH, Geer LY, Panchenko AR, Bryant Bornberg-Bauer E (2014) Rapid similarity
SH (2007) Modeling the evolution of protein search of proteins using alignments of domain
domain architectures using maximum parsi- arrangements. Bioinformatics 30(2):274–281
mony. J Mol Biol 366(1):307–315 96. Haider C, Kavic M, Sonnhammer ELL
83. Wiedenhoeft J, Krause R, Eulenstein O (2016) TreeDom: a graphical web tool for
(2010) Inferring evolutionary scenarios for analysing domain architecture evolution. Bio-
protein domain compositions. In: informatics 32(15):2384–2385
Borodovsky M, Gogarten JP, Przytycka TM, 97. Vera-Parra N, Gutiérrez-Ramirez M, Lopez-
Rajasekaran S (eds) Bioinforma. Res. Appl. Sarmiento D (2016) Automatic construction
6th Int. Symp. ISBRA 2010 Storrs CT USA and graph-making of functional domain
May 23-26 2010 Proc. Springer, Berlin, pp architectures. Adv Nat Appl Sci 10
179–190 (12):99–106
504 Sofia K. Forslund et al.
98. Doğan T, MacDougall A, Saidi R, Poggioli D, 100. Syamaladevi DP, Joshi A, Sowdhamini R
Bateman A, O’Donovan C et al (2016) (2013) An alignment-free domain architec-
UniProt-DAAC: domain architecture align- ture similarity search (ADASS) algorithm for
ment and classification, a new method for inferring homology between multi-domain
automatic functional annotation in Uni- proteins. Bioinformation 9(10):491–499
ProtKB. Bioinformatics 32(15):2264–2271 101. Dohmen E, Kremer LPM, Bornberg-Bauer E,
99. Lee B, Lee D (2009) Protein comparison at Kemena C (2016) DOGMA: domain-based
the domain architecture level. BMC Bioinfor- transcriptome and proteome quality assess-
matics 10(15):S5 ment. Bioinformatics 32(17):2577–2581
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons
licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s
Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder.
Chapter 16
Abstract
Understanding the abundance, diversity, and distribution of TEs in genomes is crucial to understand
genome structure, function, and evolution. Advances in whole-genome sequencing techniques, as well as
in bioinformatics tools, have increased our ability to detect and analyze the transposable element content in
genomes. In addition to reference genomes, we now have access to population datasets in which multiple
individuals within a species are sequenced. In this chapter, we highlight the recent advances in the study of
TE population dynamics focusing on fruit flies and humans, which represent two extremes in terms of TE
abundance, diversity, and activity. We review the most recent methodological approaches applied to the
study of TE dynamics as well as the new knowledge on host factors involved in the regulation of TE activity.
In addition to transposition rates, we also focus on TE deletion rates and on the selective forces that affect
the dynamics of TEs in genomes.
Key words Long-read sequencing, Transposition rates, Self-regulation, Effective population size,
Adaptation, Horizontal transfer
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Methods in Molecular Biology, vol. 1910,
https://doi.org/10.1007/978-1-4939-9074-0_16, © The Author(s) 2019
505
506 Lain Guio and Josefa González
Fig. 1 TE content in the genome of different organisms expressed as percentage of the genome: Homo sapiens
(~45% [12], >66% [13]) Mus musculus [143], Saccharomyces cerevisiae [144], Arabidopsis thaliana [145],
Pyrococcus furiosus [146], Clostridium difficile [147], Danio rerio [133], Kryptolebias marmoratus [148],
Bombyx mori [149], Hypothenemus hampei [150], Drosophila melanogaster (11%, [68], ~20% [69]), Pseu-
dozyma antarctica, and Laccaria bicolor [151]. Zea mays [152] and Fritillaria imperialis [8]. All estimates were
obtained with homology-based methods except [13] that uses P-cloud and [69] that uses de novo approaches
Fig. 2 Effects of TEs on the host genome. (a) TEs can affect the expression and/or structure of genes. Exons
are represented as blue boxes and TEs as green boxes. (1) A TE inserted in the upstream region of a gene can
add insulator sequences, transcription factor-binding site (TFBS), or can disrupt an existing promoter gene;
(2) A TE inserted in an intron can truncate the mRNA or induce alternative splicing; (3) A TE inserted in the
downstream region of a gene can add microRNA binding sites or alter the polyadenylation site; (4) A TE
New Insights on the Evolution of Genome Content: Population Dynamics of. . . 509
Fig. 2 (continued) inserted in the exon of a gene can lead to exonization of the TE or to transcript truncation;
(5) the whole domain of a TE protein could insert in the coding region of a gene generating a chimeric gene
with host and TE domains [5, 21]. In addition to these changes that depend on where the TE is inserted and on
the sequences that the TE is adding, TEs can also alter the posttranslational modifications of histones. (b) TEs
could also induce translation repression by generating secondary structure in the 30 UTR of genes that leads to
changes in the localization of the mRNA. This secondary structure could bind to one of the protein components
of paraspeckle (P54nrb) and translocate to paraspeckle, a group of subnuclear bodies, avoiding moving out of
the nucleus. However, the same secondary structure could bind to the dsRNA-binding protein Staufen 1
(STAU1) and in this case translocate to cytoplasm. Once in the cytoplasm, the secondary structure could bind
to STAU1 again allowing translation, but under some situations mRNA could bind to the ds-RNA- dependent
protein kinase (PKR) repressing translation [23]. (c) Ectopic recombination between TE copies (green boxes
with yellow arrows) in the same orientation can lead to deletions when recombination takes place between
copies located on the same chromatid (1) or deletions and duplications when recombination takes place
between copies in different chromosomes (2) (recombination between two nonhomologous chromosomes
should lead to a translocation). Ectopic recombination between TE copies in opposite orientation leads to
inversion of the DNA between the two TEs (3)
510 Lain Guio and Josefa González
Table 1
Summary of recent TE population dynamic studies
(continued)
512 Lain Guio and Josefa González
Table 1
(continued)
(continued)
New Insights on the Evolution of Genome Content: Population Dynamics of. . . 513
Table 1
(continued)
TE REGULATION
HOST
Natural selection
Fig. 3 Factors that influence the population and evolutionary dynamics of TEs. Our understanding of TE
population and evolutionary dynamics is still incomplete. The different factors that affect TE population and
evolutionary dynamics are interrelated, new factors have been identified in recent years, and future research
is still likely to reveal existence of additional factors
Table 2
Summary of recent mathematical models and computer simulations applied to the study of TE
dynamics
(continued)
516 Lain Guio and Josefa González
Table 2
(continued)
4 Rates of Transposition
4.2.2 Regulation by Host Small RNAs, such as small-interfering RNAs (siRNAs) and piwi-
Factors interacting RNAs (piRNAs), are well-known to play an essential
role in silencing TEs and preventing transposition. Several recent
reviews highlight the monumental progress in this field
[115–119]. In addition to posttranscriptional regulation of TEs,
small RNAs are involved in transcriptional regulation as well. In
mouse, piRNAs are required for de novo methylation and silencing
of TEs [120]. In Drosophila, Piwi proteins repress transcription and
correlate with an increase in repressive chromatin marks at loci
targeted by piRNAs [121].
While the role of siRNAs and piRNAs has been established for
several years, a role of micro RNAs (miRs) in suppressing the
mobility of retrotransposons was only recently described
[122]. The authors showed that mir-128 binds to L1 RNA and
represses its integration in humans [122].
New studies have also provided evidence for the role in TE
repression of proteins previously known for their roles in other
cellular processes such as interferon-stimulated proteins, the
tumor suppressor p53, and the longevity regulating protein
SIRT6. Several interferon-stimulated genes, such as the Moloney
leukemia virus 10 (MOV10), the zinc-finger antiviral protein
(ZAP), and the 30 repair exonuclease 1 (TREX1), which are asso-
ciated with virus response, have been recently involved in the
inhibition of L1 activity [66, 123]. Recently, it has also been
shown that the p53 transcription factor, which is involved in stress
response networks and acts to restrict oncogenesis, also restricts
retrotransposon activity in zebra fish, flies, and humans [124]. The
authors showed that p53 interacts with components of the piwi-
interacting RNA to suppress retrotransposition [124]. Finally, the
longevity regulating protein SIRT6 is also involved in retrotranspo-
son repression by coordinating their packaging into transcription-
ally repressive heterochromatin. SIRT6 binds to the 50 UTR region
520 Lain Guio and Josefa González
5.1 Natural Selection Natural selection and stochastic processes influence both the rate of
Against TE Insertions fixation and the frequency distribution of TEs in populations. The
efficiency of selection depends on the effective population size,
which largely differs between Drosophila and humans: >108 and
~104, respectively [126, 127]. Thus, while in Drosophila the high
efficiency of selection should led to the removal of slightly deleteri-
ous TE insertions, in humans, these insertions may accumulate in
the genome. Indeed most of the TE sequences in the human
genome are remnants of ancient insertions [12].
A review by Barrón et al. [128] explored the latest insights on
the nature of selection acting against the deleterious effects of TEs
in D. melanogaster populations [128]. More recently, Kofler et al.
[129] analyzed intraspecific TE dynamics between D. melanogaster
and D. simulans populations to shed light on the long-term evolu-
tion of TEs [129]. They confirmed that most of the TEs are present
at low frequencies in D. melanogaster and showed that the same
pattern is present in D. simulans. Based on computer simulations
showing that 50% of the TE families have temporally heteroge-
neous transposition rates, and on the differences in TE composition
between populations of the same species, the authors suggested
that TE activity has recently increased in the two species. They
proposed that the demographic history of both species, with a
recent colonization of different environments, could be the cause
of the high TE activity detected [129].
In humans, a recent study took advantage of the 1000 Genome
Project data that reports 16,192 polymorphic TEs to perform the
most complete TE dynamics analysis to date [130]. Most of the
polymorphic TEs were found to be present at very low frequencies:
>93% of TEs showed <5% allele frequency in 26 human popula-
tions. These results confirm that overall polymorphic TE insertions
are deleterious in humans as was previously suggested with smaller
family-specific datasets [131].
5.2 TE-Induced Several recent reviews have compiled results that showcase the
Adaptations adaptive role of TEs [19, 24, 50, 59, 128]. We would like to
highlight the recent discovery of a TE in a fish-like marine chordate
that encodes RAG-like proteins with endonuclease-transposase
activity [39]. This discovery provides evidence that supports the
TE origin hypothesis for the adaptive immune system in jawed
vertebrates [39]. Two other recent publications provide
New Insights on the Evolution of Genome Content: Population Dynamics of. . . 521
6 Rate of Loss
8 Conclusion
9 Questions
How differences in the rate of DNA loss can affect the evolutionary
dynamics of TEs?
Why host regulation of transposition is relevant for TE dynamics?
New Insights on the Evolution of Genome Content: Population Dynamics of. . . 523
Acknowledgment
References
42. Moran JV, DeBerardinis RJ, Kazazian HH Jr Drosophila melanogaster. Mol Biol Evol 32
(1999) Exon shuffling by L1 retrotransposi- (7):1800–1814
tion. Science 283(5407):1530–1534 55. Merenciano M, Ullastres A, de Cara MA, Bar-
43. Goodier JL, Kazazian HH Jr (2008) Retro- ron MG, Gonzalez J (2016) Multiple inde-
transposons revisited: the restraint and reha- pendent retroelement insertions in the
bilitation of parasites. Cell 135(1):23–35 promoter of a stress response gene have vari-
44. Campbell IM, Gambin T, Dittwald P, Beck able molecular and functional effects in dro-
CR, Shuvarikov A, Hixson P, Patel A, sophila. PLoS Genet 12(8):e1006249
Gambin A, Shaw CA, Rosenfeld JA, Stankie- 56. Le Manh H, Guio L, Merenciano M,
wicz P (2014) Human endogenous retroviral Rovira Q, Barron MG, Gonzalez J (2017)
elements promote genome instability via Natural and laboratory mutations in kuzba-
non-allelic homologous recombination. nian are associated with zinc stress pheno-
BMC Biol 12:74 types in Drosophila melanogaster. Sci Rep
45. Startek M, Szafranski P, Gambin T, Campbell 7:42663
IM, Hixson P, Shaw CA, Stankiewicz P, Gam- 57. McCue AD, Nuthikattu S, Reeder SH, Slot-
bin A (2015) Genome-wide analyses of kin RK (2012) Gene expression and stress
LINE-LINE-mediated nonallelic homolo- response mediated by the epigenetic regula-
gous recombination. Nucleic Acids Res 43 tion of a transposable element small RNA.
(4):2188–2198 PLoS Genet 8(2):e1002474
46. Hancks DC, Kazazian HH Jr (2012) Active 58. Schrader L, Kim JW, Ence D, Zimin A,
human retrotransposons: variation and dis- Klein A, Wyschetzki K, Weichselgartner T,
ease. Curr Opin Genet Dev 22(3):191–203 Kemena C, Stokl J, Schultner E, Wurm Y,
47. Helman E, Lawrence MS, Stewart C, Smith CD, Yandell M, Heinze J, Gadau J,
Sougnez C, Getz G, Meyerson M (2014) Oettler J (2014) Transposable element islands
Somatic retrotransposition in human cancer facilitate adaptation to novel environments in
revealed by whole-genome and exome an invasive species. Nat Commun 5:5495
sequencing. Genome Res 24(7):1053–1063 59. Shapiro JA (2017) Exploring the read-write
48. Evrony GD, Lee E, Park PJ, Walsh CA (2016) genome: mobile DNA and mammalian adap-
Resolving rates of mutation in the brain using tation. Crit Rev Biochem Mol Biol 52
single-neuron genomics. Elife 5:e12966 (1):1–17
49. Payer LM, Steranka JP, Yang WR, 60. Magwire MM, Bayer F, Webster CL, Cao C,
Kryatova M, Medabalimi S, Ardeljan D, Jiggins FM (2011) Successive increases in the
Liu C, Boeke JD, Avramopoulos D, Burns resistance of Drosophila to viral infection
KH (2017) Structural variants caused by Alu through a transposon insertion followed by a
insertions are associated with risks for many duplication. PLoS Genet 7(10):e1002337
human diseases. Proc Natl Acad Sci U S A 114 61. Aminetzach YT, Macpherson JM, Petrov DA
(20):E3984–E3992 (2005) Pesticide resistance via transposition-
50. Chuong EB, Elde NC, Feschotte C (2016) mediated adaptive gene truncation in Dro-
Regulatory evolution of innate immunity sophila. Science 309(5735):764–767
through co-option of endogenous retro- 62. Schmidt JM, Good RT, Appleton B,
viruses. Science 351(6277):1083–1087 Sherrard J, Raymant GC, Bogwitz MR,
51. Gonzalez J, Lenkov K, Lipatov M, Macpher- Martin J, Daborn PJ, Goddard ME,
son JM, Petrov DA (2008) High rate of Batterham P, Robin C (2010) Copy number
recent transposable element-induced adapta- variation and transposable elements feature in
tion in Drosophila melanogaster. PLoS Biol 6 recent, ongoing adaptation at the Cyp6g1
(10):e251 locus. PLoS Genet 6(6):e1000998
52. Mateo L, Ullastres A, Gonzalez J (2014) A 63. Daborn PJ, Yen JL, Bogwitz MR, Le Goff G,
transposable element insertion confers xeno- Feil E, Jeffers S, Tijet N, Perry T, Heckel D,
biotic resistance in Drosophila. PLoS Genet Batterham P, Feyereisen R, Wilson TG,
10(8):e1004560 ffrench-Constant RH (2002) A single p450
53. Guio L, Barron MG, Gonzalez J (2014) The allele associated with insecticide resistance in
transposable element Bari-Jheh mediates oxi- Drosophila. Science 297(5590):2253–2256
dative stress response in Drosophila. Mol Ecol 64. Rishishwar L, Marino-Ramirez L, Jordan IK
23(8):2020–2030 (2017) Benchmarking computational tools
54. Ullastres A, Petit N, Gonzalez J (2015) for polymorphic transposable element detec-
Exploring the phenotypic space and the evo- tion. Brief Bioinform 18:908
lutionary history of a natural mutation in
526 Lain Guio and Josefa González
65. Rishishwar L, Wang L, Clayton EA, Marino- 77. Leblanc P, Desset S, Giorgi F, Taddei AR,
Ramirez L, McDonald JF, Jordan IK (2017) Fausto AM, Mazzini M, Dastugue B, Vaury
Population and clinical genetics of human C (2000) Life cycle of an endogenous retrovi-
transposable elements in the (post) genomic rus, ZAM, in Drosophila melanogaster. J Virol
era. Mob Genet Elements 7(1):1–20 74(22):10658–10669
66. Goodier JL (2016) Restricting retrotranspo- 78. Le Rouzic A, Payen T, Hua-Van A (2013)
sons: a review. Mob DNA 7:16 Reconstructing the evolutionary history of
67. Bergman CM, Quesneville H, transposable elements. Genome Biol Evol 5
Anxolabehere D, Ashburner M (2006) Recur- (1):77–86
rent insertion and duplication generate net- 79. Xue C, Goldenfeld N (2016) Stochastic
works of transposable element sequences in predator-prey dynamics of transposons in the
the Drosophila melanogaster genome. human genome. Phys Rev Lett 117
Genome Biol 7(11):R112 (20):208101
68. Sessegolo C, Burlet N, Haudry A (2016) 80. Startek M, Le Rouzic A, Capy P, Grzebelus D,
Strong phylogenetic inertia on genome size Gambin A (2013) Genomic parasites or sym-
and transposable element content among bionts? Modeling the effects of environmental
26 species of flies. Biol Lett 12(8):20160407 pressure on transposition activity in asexual
69. Quesneville H, Bergman CM, Andrieu O, populations. Theor Popul Biol 90:145–151
Autard D, Nouaud D, Ashburner M, Anxola- 81. Gogolesky K, Startek A, Gambin A, Le
behere D (2005) Combined evidence annota- Rouzic A (2016) Modelling the proliferation
tion of transposable elements in genome of transposable elements in populations under
sequences. PLoS Comput Biol 1(2):166–175 environmental stress. arXiv.
70. Kaminker JS, Bergman CM, Kronmiller B, arXiv:1611.04812
Carlson J, Svirskas R, Patel S, Frise E, Wheeler 82. Boutin TS, Le Rouzic A, Capy P (2012) How
DA, Lewis SE, Rubin GM, Ashburner M, Cel- does selfing affect the dynamics of selfish
niker SE (2002) The transposable elements of transposable elements? Mob DNA 3:5
the Drosophila melanogaster euchromatin: a 83. Kijima TE, Innan H (2013) Population
genomics perspective. Genome Biol 3(12): genetics and molecular evolution of DNA
RESEARCH0084 sequences in transposable elements. I. A sim-
71. Kapitonov VV, Jurka J (2003) Molecular pale- ulation framework. Genetics 195(3):957–967
ontology of transposable elements in the Dro- 84. Moulin S, Seux N, Chretien S, Guyeux C,
sophila melanogaster genome. Proc Natl Acad Lerat E (2017) Simulation-based estimation
Sci U S A 100(11):6569–6574 of branching models for LTR retrotranspo-
72. Singh ND, Petrov DA (2004) Rapid sequence sons. Bioinformatics 33(3):320–326
turnover at an intergenic locus in Drosophila. 85. Goubert C, Modolo L, Vieira C,
Mol Biol Evol 21(4):670–680 ValienteMoro C, Mavingui P, Boulesteix M
73. Yang HP, Barbash DA (2008) Abundant and (2015) De novo assembly and annotation of
species-specific DINE-1 transposable ele- the Asian tiger mosquito (Aedes albopictus)
ments in 12 Drosophila genomes. Genome repeatome with dnaPipeTE from raw geno-
Biol 9(2):R39 mic reads and comparative analysis with the
74. Petrov DA, Fiston-Lavier AS, Lipatov M, yellow fever mosquito (Aedes aegypti).
Lenkov K, Gonzalez J (2011) Population Genome Biol Evol 7(4):1192–1205
genomics of transposable elements in Dro- 86. Ewing AD (2015) Transposable element
sophila melanogaster. Mol Biol Evol 28 detection from whole genome sequence
(5):1633–1644 data. Mob DNA 6:24
75. Adrion JR, Song MJ, Schrider DR, Hahn 87. Kofler R, Gomez-Sanchez D, Schlotterer C
MW, Schaack S (2017) Genome-wide esti- (2016) PoPoolationTE2: comparative popu-
mates of transposable element insertion and lation genomics of transposable elements
deletion rates in Drosophila melanogaster. using pool-seq. Mol Biol Evol 33
Genome Biol Evol 9(5):1329–1340 (10):2759–2764
76. Kim A, Terzian C, Santamaria P, Pelisson A, 88. Fiston-Lavier AS, Barron MG, Petrov DA,
Purd’homme N, Bucheton A (1994) Retro- Gonzalez J (2015) T-lex2: genotyping, fre-
viruses in invertebrates: the gypsy retrotran- quency estimation and re-annotation of trans-
sposon is apparently an infectious retrovirus posable elements using single or pooled next-
of Drosophila melanogaster. Proc Natl Acad generation sequencing data. Nucleic Acids
Sci U S A 91(4):1285–1289 Res 43(4):e22
New Insights on the Evolution of Genome Content: Population Dynamics of. . . 527
89. Rahman R, Chirn GW, Kanodia A, Sytnikova Korlach J, Wilson RK, Eichler EE (2017)
YA, Brembs B, Bergman CM, Lau NC (2015) Discovery and genotyping of structural varia-
Unique transposon landscapes are pervasive tion from long-read haploid genome
across Drosophila melanogaster genomes. sequence data. Genome Res 27(5):677–685
Nucleic Acids Res 43(22):10655–10672 97. Pendleton M, Sebra R, Pang AW, Ummat A,
90. Treiber CD, Waddell S (2017) Resolving the Franzen O, Rausch T, Stutz AM, Stedman W,
prevalence of somatic transposition in Dro- Anantharaman T, Hastie A, Dai H, Fritz MH,
sophila. Elife 6:e28297 Cao H, Cohain A, Deikus G, Durrett RE,
91. Nelson MG, Linheiro RS, Bergman CM Blanchard SC, Altman R, Chin CS, Guo Y,
(2017) McClintock: an integrated pipeline Paxinos EE, Korbel JO, Darnell RB, McCom-
for detecting transposable element insertions bie WR, Kwok PY, Mason CE, Schadt EE,
in whole genome shotgun sequencing data. Bashir A (2015) Assembly and diploid archi-
G3 (Bethesda) 7:2763 tecture of an individual human genome via
92. McCarthy A (2010) Third generation DNA single-molecule technologies. Nat Methods
sequencing: Pacific Biosciences’ single mole- 12(8):780–786
cule real time technology. Chem Biol 17 98. Disdero E, Filee J (2017) LoRTE: detecting
(7):675–676 transposon-induced genomic variants using
93. McCoy RC, Taylor RW, Blauwkamp TA, Kel- low coverage PacBio long read sequences.
ley JL, Kertesz M, Pushkarev D, Petrov DA, Mob DNA 8:5
Fiston-Lavier AS (2014) Illumina TruSeq 99. Kelleher ES, Edelman NB, Barbash DA
synthetic long-reads empower de novo assem- (2012) Drosophila interspecific hybrids phe-
bly and resolve complex, highly-repetitive nocopy piRNA-pathway mutants. PLoS Biol
transposable elements. PLoS One 9(9): 10(11):e1001428
e106689 100. Vela D, Fontdevila A, Vieira C, Garcia Guer-
94. Cridland JM, Macdonald SJ, Long AD, reiro MP (2014) A genome-wide survey of
Thornton KR (2013) Abundance and distri- genetic instability by transposition in Dro-
bution of transposable elements in two Dro- sophila hybrids. PLoS One 9(2):e88992
sophila QTL mapping resources. Mol Biol 101. Romero-Soriano V, Modolo L, Lopez-
Evol 30(10):2311–2327 Maestre H, Mugat B, Pessia E,
95. Sudmant PH, Rausch T, Gardner EJ, Hand- Chambeyron S, Vieira C, Garcia Guerreiro
saker RE, Abyzov A, Huddleston J, Zhang Y, MP (2017) Transposable element misregula-
Ye K, Jun G, Fritz MH, Konkel MK, tion is linked to the divergence between
Malhotra A, Stutz AM, Shi X, Casale FP, parental piRNA pathways in drosophila
Chen J, Hormozdiari F, Dayama G, Chen K, hybrids. Genome Biol Evol 9(6):1450–1470
Malig M, Chaisson MJP, Walter K, Meiers S, 102. Robillard E, Le Rouzic A, Zhang Z, Capy P,
Kashin S, Garrison E, Auton A, Lam HYK, Hua-Van A (2016) Experimental evolution
Mu XJ, Alkan C, Antaki D, Bae T, Cerveira E, reveals hyperparasitic interactions among
Chines P, Chong Z, Clarke L, Dal E, Ding L, transposable elements. Proc Natl Acad Sci U
Emery S, Fan X, Gujral M, Kahveci F, Kidd S A 113(51):14763–14768
JM, Kong Y, Lameijer EW, McCarthy S, 103. Gonzalez J, Petrov DA (2012) Evolution of
Flicek P, Gibbs RA, Marth G, Mason CE, genome content: population dynamics of
Menelaou A, Muzny DM, Nelson BJ, transposable elements in flies and humans.
Noor A, Parrish NF, Pendleton M, Methods Mol Biol 855:361–383
Quitadamo A, Raeder B, Schadt EE, 104. Cordaux R, Hedges DJ, Herke SW, Batzer
Romanovitch M, Schlattl A, Sebra R, Shabalin MA (2006) Estimating the retrotransposition
AA, Untergasser A, Walker JA, Wang M, Yu F, rate of human Alu elements. Gene
Zhang C, Zhang J, Zheng-Bradley X, 373:134–137
Zhou W, Zichner T, Sebat J, Batzer MA,
McCarroll SA, Genomes Project C, Mills 105. Ewing AD, Kazazian HH Jr (2010) High-
RE, Gerstein MB, Bashir A, Stegle O, Devine throughput sequencing reveals extensive vari-
SE, Lee C, Eichler EE, Korbel JO (2015) An ation in human-specific L1 content in individ-
integrated map of structural variation in ual human genomes. Genome Res 20
2,504 human genomes. Nature 526 (9):1262–1270
(7571):75–81 106. Huang CR, Schneider AM, Lu Y, Niranjan T,
96. Huddleston J, Chaisson MJP, Steinberg KM, Shen P, Robinson MA, Steranka JP, Valle D,
Warren W, Hoekzema K, Gordon D, Graves- Civin CI, Wang T, Wheelan SJ, Ji H, Boeke
Lindsay TA, Munson KM, Kronenberg ZN, JD, Burns KH (2010) Mobile interspersed
Vives L, Peluso P, Boitano M, Chin CS,
528 Lain Guio and Josefa González
repeats are major structural variants in the 120. Aravin AA, Sachidanandam R, Bourc’his D,
human genome. Cell 141(7):1171–1182 Schaefer C, Pezic D, Toth KF, Bestor T, Han-
107. Xing J, Zhang Y, Han K, Salem AH, Sen SK, non GJ (2008) A piRNA pathway primed by
Huff CD, Zhou Q, Kirkness EF, Levy S, Bat- individual transposons is linked to de novo
zer MA, Jorde LB (2009) Mobile elements DNA methylation in mice. Mol Cell 31
create structural variation: analysis of a com- (6):785–799
plete human genome. Genome Res 19 121. Le Thomas A, Rogers AK, Webster A, Mar-
(9):1516–1526 inov GK, Liao SE, Perkins EM, Hur JK, Ara-
108. Ernst C, Odom DT, Kutter C (2017) The vin AA, Toth KF (2013) Piwi induces piRNA-
emergence of piRNAs against transposon guided transcriptional silencing and establish-
invasion to preserve mammalian genome ment of a repressive chromatin state. Genes
integrity. Nat Commun 8(1):1411 Dev 27(4):390–399
109. McCullers TJ, Steiniger M (2017) Transpo- 122. Hamdorf M, Idica A, Zisoulis DG,
sable elements in Drosophila. Mob Genet Gamelin L, Martin C, Sanders KJ, Pedersen
Elements 7(3):1–18 IM (2015) miR-128 represses L1 retrotran-
110. Charlesworth B, Charlesworth D (1983) The sposition by binding directly to L1 RNA. Nat
population dynamics of transposable ele- Struct Mol Biol 22(10):824–831
ments. Genet Res 42(1):1–27 123. Ariumi Y (2016) Guardian of the human
111. Lohe AR, Hartl DL (1996) Autoregulation of genome: host defense mechanisms against
mariner transposase activity by overproduc- LINE-1 retrotransposition. Front Chem 4:28
tion and dominant-negative complementa- 124. Wylie A, Jones AE, D’Brot A, Lu WJ, Kurtz P,
tion. Mol Biol Evol 13(4):549–555 Moran JV, Rakheja D, Chen KS, Hammer
112. Grabundzija I, Irgang M, Mates L, Belay E, RE, Comerford SA, Amatruda JF, Abrams
Matrai J, Gogol-Doring A, Kawakami K, JM (2016) p53 genes function to restrain
Chen W, Ruiz P, Chuah MK, mobile elements. Genes Dev 30(1):64–77
VandenDriessche T, Izsvak Z, Ivics Z (2010) 125. Van Meter M, Kashyap M, Rezazadeh S,
Comparative analysis of transposable element Geneva AJ, Morello TD, Seluanov A, Gorbu-
vector systems in human cells. Mol Ther 18 nova V (2014) SIRT6 represses LINE1 retro-
(6):1200–1209 transposons by ribosylating KAP1 but this
113. Bire S, Casteret S, Arnaoty A, Piegu B, repression fails with stress and age. Nat Com-
Lecomte T, Bigot Y (2013) Transposase con- mun 5:5011
centration controls transposition activity: 126. Karasov T, Messer PW, Petrov DA (2010)
myth or reality? Gene 530(2):165–171 Evidence that adaptation in Drosophila is
114. Woodard LE, Downes LM, Lee YC, Kaja A, not limited by mutation at single sites. PLoS
Terefe ES, Wilson MH (2017) Temporal self- Genet 6(6):e1000924
regulation of transposition through host- 127. Park L (2011) Effective population size of
independent transposase rodlet formation. current human population. Genet Res
Nucleic Acids Res 45(1):353–366 (Camb) 93(2):105–114
115. Wheeler BS (2013) Small RNAs, big impact: 128. Barron MG, Fiston-Lavier AS, Petrov DA,
small RNA pathways in transposon control Gonzalez J (2014) Population genomics of
and their effect on the host stress response. transposable elements in Drosophila. Annu
Chromosome Res 21(6-7):587–600 Rev Genet 48:561–581
116. Clark JP, Lau NC (2014) Piwi proteins and 129. Kofler R, Nolte V, Schlotterer C (2015)
piRNAs step onto the systems biology stage. Tempo and mode of transposable element
Adv Exp Med Biol 825:159–197 activity in drosophila. PLoS Genet 11(7):
117. Toth KF, Pezic D, Stuwe E, Webster A (2016) e1005406
The piRNA pathway guards the germline 130. Rishishwar L, Tellez Villa CE, Jordan IK
genome against transposable elements. Adv (2015) Transposable element polymorphisms
Exp Med Biol 886:51–77 recapitulate human evolution. Mob DNA
118. Yang F, Xi R (2017) Silencing transposable 6:21
elements in the Drosophila germline. Cell 131. Boissinot S, Davis J, Entezam A, Petrov D,
Mol Life Sci 74(3):435–448 Furano AV (2006) Fitness cost of LINE-1
119. Luo S, Lu J (2017) Silencing of transposable (L1) activity in humans. Proc Natl Acad Sci
elements by piRNAs in drosophila: an evolu- U S A 103(25):9590–9594
tionary perspective. Genomics Proteomics 132. Sundaram V, Choudhary MN, Pehrsson E,
Bioinformatics 15(3):164–176 Xing X, Fiore C, Pandey M, Maricque B,
New Insights on the Evolution of Genome Content: Population Dynamics of. . . 529
153. Boissinot S, Sookdeo A (2016) The evolution mapping of transposable element diversity
of LINE-1 in vertebrates. Genome Biol Evol reveals links to gene regulation and epige-
8(12):3485–3507 nomic variation. Elife 5:e20777
154. Quadrana L, Bortolini Silveira A, Mayhew 156. Yan L, Gu YH, Tao X, Lai XJ, Zhang YZ, Tan
GF, LeBlanc C, Martienssen RA, Jeddeloh XM, Wang H (2014) Scanning of transposa-
JA, Colot V (2016) The Arabidopsis thaliana ble elements and analyzing expression of
mobilome and its impact at the species level. transposase genes of sweet potato Ipomoea
Elife 5:e15716 batatas. PLoS One 9(3):e90895
155. Stuart T, Eichten SR, Cahn J, Karpievitch YV,
Borevitz JO, Lister R (2016) Population scale
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons
licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s
Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder.
Part V
Abstract
In this chapter, we give a short introduction to the genetics of complex diseases emphasizing evolutionary
models for disease genes and the effect of different models on the genetic architecture, and we give a survey
of the state-of-the-art of genome-wide association studies (GWASs).
Key words Complex diseases, Association mapping, Genome-wide association studies, Common
disease/common variant
1 Introduction
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Methods in Molecular Biology, vol. 1910,
https://doi.org/10.1007/978-1-4939-9074-0_17, © The Author(s) 2019
533
534 Søren Besenbacher et al.
domain
domain
domain
Fig. 1 Mutation, drift, and selection. New mutations enter a population at stochastic intervals, determined by
the mutation rate, u, and the effective population size, N. For low or high frequencies, where the range of such
frequencies is determined by the selection factor, s, and the effective population size, the frequency of a
mutant allele changes stochastically. At medium frequencies, on the other hand, the frequency of the allele
changes up or down, depending on s, in a practically deterministic fashion. If a positively selected allele
reaches moderate frequency, it will quickly be brought to high frequency, at a speed also determined by s and
N
1.0
0.8
Cumulative disease
allele frequency
0.6
Frequency
0.4
0.2
Individual rare
allele frequencies
0.0
Fig. 2 Accumulation of several rare frequencies. If selection works against a set of alleles, each will be kept at
a low frequency. Their accumulated frequency, however, can be high in the population
Fig. 3 A population out of equilibrium following an expansion. In a transition period following a population
expansion, the allele frequency patterns are different from the patterns in a stable population
538 Søren Besenbacher et al.
Fig. 4 A population out of equilibrium following changes in the selective landscape. If the selection of an allele
changes direction, so the positively selected allele becomes negatively selected and vice versa, it will
eventually move through moderate frequencies. Following a change in the selective landscape, it is thus
possible to find alleles at moderate frequencies that would not otherwise be found
2.2 The Allelic The vast majority of human nucleotide variation is very rare because
Frequency Spectrum in of our history of population bottlenecks followed by rapid growth.
Humans For instance, in the 2500 individuals of the 1000 genomes study,
64 million SNVs have frequency <0.5%, and 20 million SNVs have
frequency >0.5% [11]. Nevertheless the majority of heterozygous
variants observed within a single individual are not rare [11]. The
Association Mapping and Disease: Evolutionary Perspectives 539
rare variants are most often very recent and therefore specific to
populations, and they are also more often deleterious because
selection has not yet acted on them [12]. This is particularly clear
for loss-of-function variants and other protein-coding variants. A
study of 2636 Icelanders found that the fraction of variants with a
minor allele frequency (MAF) below 0.1% was 62% for protein-
truncating variants, 46% for missense variants, and 38% for synony-
mous variants [13].
The strong recent population expansions have also allowed
variants to increase in frequency by surfing on the population
expansion wave front even if they would be selected against in a
population with stable size. Thus, rare variants with large effects on
disease may exist. The GWAS studies so far have been successful in
identifying a large set of common variants associated with disease,
so common variants contributing to disease do exist. It is likely that
rare variants with large phenotypic effects also contribute to the
heritability of many common diseases, but the extend is likely to be
disease specific.
The first GWASs were published around 2006 [14, 15] when
Illumina and Affymetrix first introduced genotyping chips that
made it possible to test hundreds of thousands of SNPs quickly
and inexpensively. The GWASs’ approach to find susceptibility
variants for diseases boils down to testing approximately 0.3–-
2 million SNPs (depending on chip type) for differences in allele
frequencies between cases and controls, adjusting for the high
number of multiple tests. This approach is a wonderfully simple
procedure that requires no complicated statistics or algorithms but
only well-known statistical tests and a minimum of computing
power. Despite the simplicity, some issues remain, such as faulty
genotype data and confounding factors that can result in erroneous
findings if not handled properly. The most important aspects of any
GWAS are, therefore, thorough quality control of the data used and
measures to avoid and reduce the effect of confounding factors.
3.1 Statistical Tests The primary analysis in an association study is usually testing each
variant separately under the assumption of an additive or multipli-
cative model. One way of doing that is by creating a 2 2 allelic
contingency table as shown in Table 1 by summing the number of
A and B alleles seen in all case individuals and all control individuals.
Be aware that we are counting alleles and not individuals in this
contingency table, so Ncases will be equal to two times the number
of case individuals because each individual carries two copies of each
variant unless we are looking at non-autosomal DNA. If there is no
association between the variant and the disease in question, we
540 Søren Besenbacher et al.
Table 1
Contingency table for allele counts in case/control data
Allele A Allele B
Case Ncase,A Ncase,B Ncases
Control Ncontrol,A Ncontrol,B Ncontrols
NA NB N
Table 2
Expected allele counts in case/control data
Allele A Allele B
Case (Ncases · NA)/N (Ncases · NB)/N Ncases
Control (Ncontrols · NA)/N (Ncontrols · NB)/N Ncontrols
NA NB N
3.2 Effect Estimates A commonly used way of measuring the effect size of an association
is the allelic odds ratio (OR), which is the ratio of the odds of being
a case given that you carry n copies of alleles A to the odds of being
a case if you carry n 1 copies of allele A. Assuming a multiplicative
model, this can be calculated as:
OR ¼ N case, A =N control, A = N case, B =N control, B
¼ N case, A N control, B =N case, B N control, A
Another measure of effect size that is perhaps more intuitive is
the relative risk (RR), which is the disease risk in carriers divided by
the disease risk in noncarriers. This measure, however, suffers from
the weakness that it is harder to estimate. If our cases and controls
were sampled from the population in an unbiased way, the allelic
RR could be calculated as:
RR ¼ N case, A =N A = N case, B =N B
but it is very rare to have an unbiased population sample in associa-
tion studies because the studies are generally designed to deliber-
ately oversample the cases to increase the power. This oversampling
affects the RR as calculated by the formula above but not the OR
which is one of the reasons why the OR is usually reported in
association studies instead of the RR.
3.3 Quality Control Data quality problems can be either variant specific or individual
specific, and inspection usually results in the removal of both prob-
lematic individuals and problematic variants from the data set.
Individual-specific problems can be caused by low DNA quality
or contamination by foreign DNA. A sample of low DNA quality
results in a high rate of missing data, where particular variants
cannot be called, and there is a higher risk of miscalling variants.
It is, therefore, recommended that individuals lacking calls in more
than 2–3% of the variants are removed from the analysis. Excess
heterozygosity is an indicator of sample contamination, and indivi-
duals displaying that should also be disregarded. Sex checks and
other kinds of phenotype tests might also be applied to remove
individuals, where the genotype information does not match the
phenotype information due to a sample mix-up [18].
For a given variant, the data from an individual can be suspi-
cious in two ways: it can fail to be called by the genotype-calling
program or it can be miscalled. Typically, a conservative cutoff value
is used in the calling process securing that most problems show up
as missing data rather than miscalls. Most problematic variants,
therefore, reveal a high fraction of missing data, and variants miss-
ing calls above a given threshold (typically, 1–5%) are removed.
Miscalls typically occur when the homozygotes are hard to distin-
guish from the heterozygotes, and some of the heterozygotes are
being misclassified as homozygotes or vice versa. Both biases
542 Søren Besenbacher et al.
3.4 Confounding Confounding in GWAS can arise if there are genotyping batch
Factors effects or if there is population or family structure in the sample.
For example, if cases and controls in GWAS are predominantly
collected from geographically distinct areas, association signals
could arise due to genetic differences caused by geographic varia-
tion, and most of such genetic signals are unlikely to be causal. Such
confounding due to population structure typically occurs when
samples have different genetic ancestry, e.g., if the sample contains
individuals of both European and Asian ancestry. Population struc-
ture confounding can also happen when the population structure is
more subtle, especially for large sample sizes. Methods for inferring
population substructure, such as principal components analysis, are
useful for detecting outliers we can remove from the data
[20]. However, this approach is not suitable when dealing with
subtle structure, as a small bias can become significant in a large
enough sample of individuals of similar genetic ancestry.
Confounding in GWAS can be detected as inflation of the test
statistics, beyond what is expected due to truly causal variants. A
useful way of visualizing such inflation of test statistics is the
so-called quantile–quantile (QQ) plot. In this plot, ranked values
of the test statistic are plotted against their expected distribution
under the null hypothesis. In the case of no true positives and no
inflation of the test statistic due to population structure or cryptic
relatedness, the points of the plot lie on the x ¼ y line (see Fig. 5a).
True positives show an increase in values above the line in the right
tail of the distribution but do not affect the rest of the points since
a b c
25
25
25
20
20
20
Sample Quantiles
Sample Quantiles
Sample Quantiles
15
15
15
10
10
5 10
5
0
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles
Fig. 5 QQ plots from a χ 2 distribution. (a) A QQ plot, where the observation follows the expected distribution.
(b) A QQ plot, where the majority of observations follow the expected distribution, but where some have
unexpectedly high values, i.e., are statistically significant. (c) A QQ plot, where the observations all seem to be
higher than expected, which is an indication that the observations are not following the expected distribution
Association Mapping and Disease: Evolutionary Perspectives 543
3.5 Meta-analysis The statistical power to detect association depends directly on the
of GWAS sample size used, all other things being equal. This fact has driven
researchers to collaborate across institutions and countries in
GWAS consortia, where they combine multiple cohorts in one
large analysis. However, for logistic and legal reasons, it may not
be possible to share individual-level genotypes, which are required
for all of the GWAS approaches covered so far. Meta-analyses of
GWASs performed in each cohort are a solution to this problem.
These require coordination between the researchers, where they
share GWAS summary statistics instead of individual-level geno-
types. These summary statistics are then meta-analyzed using sta-
tistical approaches that either assume a constant effect across
cohorts or not. In recent years many large-scale GWAS meta-
analyses have been published, and the resulting summary statistics
of these are often made public, providing a treasure trove for
understanding genetics of common diseases and traits [32].
3.6 Replication The best way to make sure that a finding is real is to replicate it. If
the same signal is found in an independent set of cases and controls,
it means that the association is unlikely to be the result of a con-
founding factor specific to the original data. Likewise, if the associ-
ation persists after typing the markers using another genotyping
method, it means that it is not a false positive due to some artifact of
the genotyping method used.
When trying to replicate a finding, the best strategy is to try to
replicate it in a population of similar ancestry. A marker that corre-
lates with a true causal variant in one population might not be
correlated with the same variant in a population of different ethnic-
ity, where the LD structure can be different. This is especially
problematic when trying to replicate an association found in a
non-African population in an African population [33]. A marker
might easily have 20 completely correlated markers in a European
population, but no good correlates in an African population. To
replicate a finding in the European population of one of these
variants, it does not suffice to test one of the variants in an African
population; all 20 variants must be tested. This, however, also offers
a way to fine map the signal and possibly find the causative
variant [34].
Before spending time and effort to replicate an association
signal in a foreign cohort, it is a good idea to search for the existing
partial replication of the marker within the data. Usually, a marker is
surrounded by several correlated markers on the genotyping chip,
and if one marker shows a significant association, then the corre-
lated markers should show an association too. If a marker is signifi-
cantly associated with a disease, but no other marker in the region
is, then it should be viewed as suspicious.
Association Mapping and Disease: Evolutionary Perspectives 545
4.1 Selection of The two important choices when performing imputation are the
Reference Data Set reference data set to use and the software to use. Usually, a publicly
available reference data set, such as the 1000 Genomes Project [11]
or the large Haplotype Reference Consortium [37], is used. Alter-
natively, researchers sequence a part of their study cohort and thus
create their own reference data set. The latter strategy has the
advantage that one can be certain that the ancestry of the reference
data matches the ancestry of the study cohort. It is important that
the reference data be from a population that is similar to the study
population. If the reference population is too distantly related to
the study population, the reliability of the imputed data will be
reduced. The quality and nature of the reference data also limit the
quality of the imputed data in other ways. A reference data set
consisting of only a small number of individuals is not able to
reliably estimate the frequency of rare variants and that in turn
means that the imputation of rare variants lacks in accuracy. This
means that there is a natural limit to how low a frequency a variant
can have and still be reliably imputed.
546 Søren Besenbacher et al.
4.3 Testing Imputed Since imputation is based on probabilistic models, the output is
Variants merely a probability for each genotype for the unknown variants in
a given individual. That is, instead of reporting the genotype of an
individual as AG, say, the program reports that the probability of
Association Mapping and Disease: Evolutionary Perspectives 547
5 Current Status
Fig. 6 GWAS statistics from the NHGRI-EBI GWAS Catalog [63] (accessed June 2017). (a) The cumulative
number of GWASs published since 2005. (b) The initial sample sizes of the GWASs. For dichotomous traits the
combined number of cases and controls is shown. Replication samples are not counted. (c) The number of
tested variants in each study
548 Søren Besenbacher et al.
number of variants that are tested in a GWAS (Fig. 6c). All these
GWASs published in the last decade have increased our knowledge
about the genetic architecture of common diseases a lot. In this
section, we will go through some of the insights that have been
revealed by these studies.
5.1 Polygenic GWASs have consistently shown that most complex traits and dis-
Architecture of eases have very polygenic architectures with a large number of
Common Diseases causal variants with small effects. The small effect sizes mean that
enormous sample sizes are needed to detect the associated variants
and that each variant only explains a small fraction of the heritabil-
ity. Even though large sample sizes have led to the discovery of
many loci affecting common diseases, the aggregated effect of all
these loci still only explains a small fraction of the heritability.
A good example is type 2 diabetes where researchers by 2012
had identified 63 associated loci that collectively only explained
5.7% of the liability-scale variance [48]. Such results led to much
discussion about the possible source of the remaining “missing
heritability” [49, 50]. A significant contribution to this debate
was when researchers in 2010 started using mixed linear models
to estimate the heritability explained by all common variants not
only those that surpass a conservative significance threshold. These
studies showed that a significant fraction of the so-called missing
heritability was not truly missing from the GWAS data sets but only
hidden due to small effect sizes. This was first illustrated in height
where 180 statistically significant SNPs could only explain 10% of
the heritability, but this fraction increased to 45% when all geno-
typed variants were considered [51].
For common diseases, such analyses have typically shown that
around half of the heritability can be explained by considering all
common variants. Given the small individual contribution of each
of the discovered variants and that the individual contribution of
the yet to be found variants will be even smaller, it is likely that the
actual number of causal variants will be much more than a thousand
for many common diseases. Recent data shows that in many dis-
eases these causal variants are relatively uniformly distributed along
the genome. It has, for instance, been estimated that 71–100% of
1 MB windows in the genome contribute to the heritability of
schizophrenia [52]. Another article recently estimated that most
100 kB windows contribute to the variation of height and that
more than 100,000 markers have an independent effect on height.
This strikingly large number leads the authors to propose a new
“omnigenic” model in which most genes expressed in a cell type
that is relevant for a given disease have a nonzero contribution to
the heritability of that disease [53].
5.2 Pleiotropy The variants that have been discovered by GWASs so far reveal
numerous examples where one genetic locus affects multiple often
seemingly unrelated traits [54, 55]. One explanation for such a
Association Mapping and Disease: Evolutionary Perspectives 549
5.3 Differences Because of differences in age of onset and severity, we do not expect
Between Diseases identical allelic architectures in all common diseases. Using the
currently available GWAS data sets, we can now start to identify
these differences in the allelic architectures, but because of the
significant differences in samples sizes and the number of tested
variants, this is not an easy task.
The data available to date show that the degree of polygenicity
differs between diseases with schizophrenia, for example, having
more predicted loci than immune disorders [57] and hypertension
[52]. Results also show that rare variants play a larger role in some
diseases compared to others. Rare variants, for example, have a
greater role in amyotrophic lateral sclerosis than in schizophrenia
[58] and are even less important in lifestyle-dependent diseases
such as type 2 diabetes [59].
6 Perspectives
7 Questions
References
1. Visscher PM, Hill WG, Wray NR (2008) Heri- 15. Duerr RH, Taylor KD, Brant SR et al (2006) A
tability in the genomics era |[mdash]| concepts genome-wide association study identifies
and misconceptions. Nat Rev Genet 9:255. IL23R as an inflammatory bowel disease gene.
https://doi.org/10.1038/nrg2322 Science 314:1461–1463. https://doi.org/10.
2. Cardon LR, Abecasis GR (2003) Using haplo- 1126/science.1135245
type blocks to map human complex trait loci. 16. Lewis CM (2002) Genetic association studies:
Trends Genet 19:135 design, analysis and interpretation. Brief Bioin-
3. de Bakker PIW, Yelensky R, Pe’er I et al (2005) form 3:146–153
Efficiency and power in genetic association 17. Balding DJ (2006) A tutorial on statistical
studies. Nat Genet 37:1217–1223. https:// methods for population association studies.
doi.org/10.1038/ng1669 Nat Rev Genet 7:781–791. https://doi.org/
4. Daly MJ, Rioux JD, Schaffner SF et al (2001) 10.1038/nrg1916
High-resolution haplotype structure in the 18. Wellcome Trust Case-Control Consortium
human genome. Nat Genet 29:229–232. (2007) Genome-wide association study of
https://doi.org/10.1038/ng1001-229 14,000 cases of seven common diseases and
5. Shi H, Kichaev G, Paşaniuc B (2016) Contrast- 3,000 shared controls. Nature 447:661–678.
ing the genetic architecture of 30 complex https://doi.org/10.1038/nature05911
traits from summary association data. Am J 19. Mccarthy MI, Abecasis GCAR, Cardon LR
Hum Genet 99:139–153. https://doi.org/ et al (2008) Genome-wide association studies
10.1016/j.ajhg.2016.05.013 for complex traits: consensus, uncertainty and
6. Wright S (1931) Evolution in mendelian popu- challenges. Nat Rev Genet 9:356. https://doi.
lations. Genetics 16:97–159 org/10.1038/nrg2344
7. Pritchard JK (2001) Are rare variants responsi- 20. Patterson N, Price AL, Reich D (2006) Popu-
ble for susceptibility to complex diseases? Am J lation structure and eigenanalysis. PLoS Genet
Hum Genet 69:124–137. https://doi.org/10. 2:e190. https://doi.org/10.1371/journal.
1086/321272 pgen.0020190
8. Pritchard JK, Cox NJ (2002) The allelic archi- 21. Devlin B, Roeder K (1999) Genomic control
tecture of human disease genes: common dis- for association studies. Biometrics
ease–common variant. . . or not? Hum Mol 55:997–1004
Genet 11:2417 22. Devlin B, Roeder K, Wasserman L (2001)
9. Reich DE, Lander ES (2001) On the allelic Genomic control, a new approach to genetic-
spectrum of human disease. Trends Genet based association studies. Theor Popul Biol
17:502 60:155–166. https://doi.org/10.1006/tpbi.
10. Di Rienzo A, Hudson RR (2005) An evolu- 2001.1542
tionary framework for common diseases: the 23. Yang J, Weedon MN, Purcell S et al (2011)
ancestral-susceptibility model. Trends Genet Genomic inflation factors under polygenic
21:596–601. https://doi.org/10.1016/j.tig. inheritance. Eur J Hum Genet 19:807–812.
2005.08.007 https://doi.org/10.1038/ejhg.2011.39
11. 1000 Genomes Project Consortium, Auton A, 24. Bulik-Sullivan BK, Loh P-R, Finucane HK et al
Brooks LD et al (2015) A global reference for (2015) LD Score regression distinguishes con-
human genetic variation. Nature 526:68–74. founding from polygenicity in genome-wide
https://doi.org/10.1038/nature15393 association studies. Nat Genet 47:291–295.
12. Quintana-Murci L (2016) Understanding rare https://doi.org/10.1038/ng.3211
and common diseases in the context of human 25. Price AL, Patterson NJ, Plenge RM et al
evolution. Genome Biol 17:225. https://doi. (2006) Principal components analysis corrects
org/10.1186/s13059-016-1093-y for stratification in genome-wide association
13. Gudbjartsson DF, Helgason H, Gudjonsson studies. Nat Genet 38:904–909. https://doi.
SA et al (2015) Large-scale whole-genome org/10.1038/ng1847
sequencing of the Icelandic population. Nat 26. Yu J, Pressoir G, Briggs WH et al (2006) A
Genet 47:435. https://doi.org/10.1038/ng. unified mixed-model method for association
3247 mapping that accounts for multiple levels of
14. Klein RJ, Zeiss C, Chew EY et al (2005) Com- relatedness. Nat Genet 38:203–208. https://
plement factor H polymorphism in age-related doi.org/10.1038/ng1702
macular degeneration. Science 308:385–389. 27. Price AL, Zaitlen NA, Reich D, Patterson N
https://doi.org/10.1126/science.1109557 (2010) New approaches to population
552 Søren Besenbacher et al.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons
licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s
Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder.
Chapter 18
Abstract
Borrowing both from population genetics and phylogenetics, the field of population genomics emerged as
full genomes of several closely related species were available. Providing we can properly model sequence
evolution within populations undergoing speciation events, this resource enables us to estimate key
population genetics parameters such as ancestral population sizes and split times. Furthermore we can
enhance our understanding of the recombination process and investigate various selective forces. With the
advent of resequencing technologies, genome-wide patterns of diversity in extant populations have now
come to complement this picture, offering an increasing power to study more recent genetic history.
We discuss the basic models of genomes in populations, including speciation models for closely related
species. A major point in our discussion is that only a few complete genomes contain much information
about the whole population. The reason being that recombination unlinks genomic regions, and therefore a
few genomes contain many segments with distinct histories. The challenge of population genomics is to
decode this mosaic of histories in order to infer scenarios of demography and selection. We survey modeling
strategies for understanding genetic variation in ancestral populations and species. The underlying models
build on the coalescent with recombination process and introduce further assumptions to scale the analyses
to genomic data sets.
Key words Ancestral population, Coalescence, Demography, Divergence, Markov model, Migration,
Recombination, Selection, Speciation
1 Introduction
We are in the population genomics era where data sets from the
1000 human genomes project [1], the great apes project [2], and
the 1001 arabidopsis genomes project [3] are available. The under-
lying data sets contain genotypic information for thousands of
individuals in one or several species, in the form of de novo
sequenced genomes or variation compared to an available “refer-
ence” genome (a.k.a. resequencing). By comparing genomes from
several individuals of the same species or closely related species, we
can obtain information about split times, population sizes, recom-
bination events, and selection in contemporary and ancestral
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Methods in Molecular Biology, vol. 1910,
https://doi.org/10.1007/978-1-4939-9074-0_18, © The Author(s) 2019
555
556 Julien Y. Dutheil and Asger Hobolth
Speciation
Recombination event
Fig. 1 Left: Isolation model of two species. Right: The coalescent process along the genomes of the two
species. By comparing the two genomes we obtain information about the split time of the species and the
ancestral population size. Furthermore the breakpoints along the genomes correspond to recombination
events, so we also have information about the recombination process
species (see Fig. 1). In this chapter we discuss various models for
obtaining this information.
Comparing homologous sequences available for a given locus
to infer their degree of relatedness enables the discovery of the
parental relationships of the sequences, depicted as a tree thereby
named genealogy. When one sequence sampled from one individual
of one species is compared with sequences from other species, the
resulting genealogy contains information about the history of spe-
cies, the so-called phylogeny. The phylogeny summarizes the rela-
tionship and the divergence times between the species.
Conversely, when sequences from several individuals within a
species are sampled, we have access to the genetic variation in
contemporary populations. The evolutionary forces that shape
genetic variation within a species are genetic drift, mutation,
recombination, and selection and are the subject of population
genetics. The key modeling tool in population genetics is coales-
cent theory. Classical coalescent theory describes the genetic ances-
try of a sample of homologous DNA sequences from the same
species. This genealogical description includes times to common
ancestry, which is measured back into the past.
Molecular phylogenetics and population genetics have accu-
mulated 50 years of methodological developments. The conver-
gence of these two fields and their key mathematical and statistical
tools is needed in order to fully understand genomic sequence
alignments, because comparing genealogies and phylogenies is at
the heart of the study of the speciation process [4].
We describe the interplay between population genetics and
phylogenetics by reviewing the methods and models that have
been developed to understand evolutionary history from genomic
data (see Table 1 for a comparative summary of all methods).
Table 1
Methods comparison
NA2, r
Integrating over the discretized Markov process I T, NA, r — Orangutans: two full genomes CoalHMM [13]
distribution of divergence for a pair
557
of genomes
(continued)
558
Table 1
(continued)
Use the conditional sampling Markov process + — Ne1. . .n, — Human diploid genomes diCal [16–18]
distribution to approximate the CSD r fixed
integration over the discretized
distribution of divergence
coalescence events within multiple
haploid genomes in one or more
population(s)
Extension of the pairwise sequentially Markov process + — Ne1. . .n — Unphased human genomes SMC++ [19]
Markov coalescent with site Poisson
frequency spectrum based on many random field
individuals
Bayesian sampling of ARG, using a Markov process + — Ne1. . .n, r — Human diploid genomes ARGweaver [20]
discretized distribution of “threading”
divergence time conditioned on
multiple haploid genomes in a
population
This table summarizes and compares existing ancestral population genomics methods. Parameters correspond to the one in Fig. 4. RAS: Rate across site model, assuming an a
priori distribution of evolutionary rate (usually a discretized gamma distribution) over alignment positions
Ancestral Population Genomics 559
ð Þ:
1 r
PðR2 > rÞ ¼ 1
2N
If we scale time t in units of 2N, i.e., set r ¼ 2Nt, we get
ð Þ ð Þ
1 r 1 2N t
PðR2 > rÞ ¼ 1 ¼ 1 e t ,
2N 2N
where the approximation is valid for large N. In coalescent time
units the waiting time T2 ¼ R2/(2N) before coalescence of two
individuals is therefore exponentially distributed with mean one.
These considerations can be extended to multiple individuals.
In general the time Tn before two of n individuals coalesce is
exponentially distributed with rate n2 .
The waiting time Wn for a sample of n individuals to find the
most recent common ancestor (MRCA) is given by
W n ¼ T n þ T n1 þ þ T 2 ,
where Tk are
independent exponential random variables with
parameter 2k ; see Fig. 2 for an illustration. It follows that the
mean of Wn is
560 Julien Y. Dutheil and Asger Hobolth
T2
W5
T3
T4
T5
Fig. 2 Illustration of the coalescent process. The waiting time before two out of n individuals coalesce is Tn and
the time before a sample of n individuals find common ancestry is Wn
X
n X
n X 1 n
E½W n ¼ E½T k ¼
2
kðk 1Þ
¼2
k1 k
1
ð Þ
k¼2 k¼2 k¼2
¼2 1ð 1
n
Þ
:
2.2 Adding Now suppose mutations occur at a rate u per locus per generation.
Mutations to the In a lineage of r generations, we then expect ru mutations or in the
Standard Coalescent coalescent time units with r ¼ 2Nt we expect 2Ntu mutations. We
Model let θ ¼ 4Nu be the mutation rate parameter. Since u is small we can
make a Poisson approximation of the binomial number of muta-
tions in a lineage of r generations
Binðr,uÞ ¼ Binð2N t,θ=ð2 2N ÞÞ Poisðtθ=2Þ:
We have thus arrived at the following two-step process for
simulating samples under the coalescent: (a) simulate the genealogy
by merging lineages uniformly at random
and with waiting times
exponentially distributed with rate n2 when n lineages are present;
(b) on each lineage in the tree add mutations according to a Poisson
process with rate θ/2.
Another possibility is to scale the coalescent process such that
one mutation is expected in one time unit. In this case
the expo-
nentially distributed waiting times in (a) have rate n2 ð2=θÞ, and in
(b) the mutations are added with unit rate. We use the latter version
of the coalescent-with-mutations process below.
2.3 Taking For species where recombination occurs, different parts of the
Recombination into genome come from distinct ancestors, and therefore have a distinct
Account history. Figure 3 exemplifies this phenomenon for two species. It
displays the genealogical relationships for two sequences which
underwent a single recombination event. In the presence of recom-
bination, each position of a genome alignment therefore has a
specific genealogy, and close positions are more likely to share the
same one (recall Fig. 1). The genome alignment can therefore be
described as an ordered series of genealogies, spanning a variable
amount of sites, and then changing because of a recombination
event [4]. The genealogy is therefore depicted as a complex graph
with nodes representing both coalescence and recombination
events, the ancestral recombination graph (ARG, Fig. 3c). A single
genome thus contains different samples from the distribution of the
age of the MRCA, and the distribution contains information about
the ancestral population size and speciation time. The coalescent
with recombination serves as a basis for modeling genome-wide
genealogy, a point that we will further develop in Subheading 4.
3 Adding Genetic Barriers and Gene Flow to the Picture: The Structured Coalescent
A) B)
1 2 3 4 1 2 3 4
C) Left tree ARG Right tree
1 2 3 4 1 2 3 4 1 2 3 4
Fig. 3 Ancestral recombination graph for two species. (a) Genealogy of four sampled sequences from two
species. The bold line shows the divergence of two sequences of interest. (b) A single recombination event
happened between the lineages of sequences 3 and 4 (horizontal line), so that in a part of the sequences, the
genealogy is as depicted by the bold line and therefore displays an older divergence. (c) The corresponding
ancestral recombination graph (in black) with the trees of each side of the recombination break point
superimposed (red: left tree; blue: right tree). When going backward in time, a split corresponds to a
recombination event and a merger to a coalescence event
shown in Fig. 4 (see also Table 1) and include: (a) The two species
isolation model; (b) The two species isolation-with-migration
models; (c) The three species isolation model (and incomplete
lineage sorting); and (d) The three species isolation-with-migration
model. We also discuss the general multiple species isolation-with-
migration model. The two species isolation model was introduced
in [24] and the isolation-with-migration model was introduced
in [25].
3.1 Isolation Model If the sequences are sampled from two distinct species that have
with Two Species diverged a time T ago (see Fig. 4a), then the distribution of the age
of the MRCA is shifted to the right with the amount T, resulting in
the distribution
Ancestral Population Genomics 563
NA NA
m1→2
T T N1 N2
m2→1
a) b)
NA2 NA2
mA1→3
T2 NA1 T2 NA1
m3→A1
m1→3 N3
m1→2 m2→3
T1 T1 N1 N2
m2→1 m3→2
m3→1
c) d)
Fig. 4 Speciation models and associated parameters. In all exemplified models effective population size is
constant between speciation events, represented by dash lines. The timing of the speciation events, noted
T are parameters of the models, together with ancestral effective population sizes, noted NA. In some cases,
contemporary population sizes can also be estimated, and are noted Ni, where i is the index of the population.
Models with post-divergence genetic exchanges have additional migration parameters labeled mfrom!to. The
number of putative migration rates increases with the number of contemporary populations under study, and
some models might consider some of them to be equal or eventually null to reduce complexity. (a) Isolation
model with two species. (b) Isolation-migration model with two species. (c) Isolation model with three species.
(d) Isolation-Migration model with three species
8
<0 if t < T
f T 2 ðtÞ ¼ 2 2ðtT Þ=θA ,
: e if t > T
θA
where θA ¼ 4NA u is the ancestral mutation rate. The mean time
to coalescent is E[T2] ¼ T + θA/2 and the average divergence time
between two sequences is twice this quantity, that is, 2T + θA. Since
θA ¼ 4NAu it follows that the larger the size of the ancestral
564 Julien Y. Dutheil and Asger Hobolth
3.2 Isolation Model Now consider the isolation model with three species depicted in
with Three or More Fig. 4c. Such a model is often used for the human–chimpanzee–-
Species and gorilla (HCG) triplet (e.g., [10–12]).
Incomplete Lineage The density function for the time to coalescence between
Sorting sample 1 and sample 2 is given by
8
> 0 if t < T 1
>
>
>
>
>
> 2 2ðtT 1 Þ=θA1
if T 1 < t < T 12
< e
θA1
f T 2 ðtÞ ¼ ð1Þ
>
> 2 2ðtT 12 Þ=θA2
>
> if t > T 12 ,
>
> P 12 e
>
: θA2
where
T 12 ¼ T 1 þ T 2 and P 12 ¼ e 2ðT 12 T 1 Þ=θA1
is the probability of the two samples not coalescing in the ancestral
population of sample 1 and sample 2. In the upper right corner of
Fig. 5 we plot the density (Eq. 1) with parameters that resemble the
HCG triplet.
If sample 1 and sample 2 do not coalesce in the ancestral
population of sample 1 and sample 2, then the three trees
((1,2),3), ((1,3),2), and ((2,3),1) are equally likely. The probability
of the gene tree being different from the species tree is thus
2 2
PrðincongruenceÞ ¼ P 12 ¼ e 2ðT 12 T 1 Þ=θA1 : ð2Þ
3 3
The event that the gene tree is different from the species tree is
called incomplete lineage sorting (ILS). ILS is important because
species tree incongruence often manifests itself as a relatively clear
signal in a sequence alignment and thereby allows for accurate
estimation of population parameters. In Fig. 6 we show the (in)
congruence probability Eq. 2. We also refer to Exercise 1 (see
Subheading 8.1) and Exercise 2 (see Subheading 8.2) for more
discussion of ILS.
Ancestral Population Genomics 565
Time
0.000 0.005 0.010 0.015 0.020 0.025
100 150 200 250 300
Isolation model
Density
50
0
300
250
Isolation+migration model
150 200
Density
100
50
0
0.000 0.005 0.010 0.015 0.020 0.025
Time
Fig. 5 Illustration of the density for coalescent in various models and data layout. The curves are the
probability density functions. In the most simple case with two species, a constant ancestral population
size and a punctual speciation (top left panel), more genomic regions find a common ancestor close to the
species split (the vertical line), while a few regions have a more ancient common ancestor, distributed in an
exponential manner (see Eq. 1). If speciation is not punctual and migration occurred after isolation of the
species, then some sequences have a common ancestor which is more recent than the species split and the
distribution in the ancestor becomes more complex (bottom left panel, see Eqs. 4 and 6). When a third species
is added (right panel), then another discontinuity appears and all distributions depend on additional para-
meters, particularly when migration is allowed. We use θA1 ¼ 0.0062, θA2 ¼ 0.0033 and τ1 ¼ 0.0038 (the
first vertical line), τ2 ¼ 0.0062 (the second vertical line) corresponding to the HCG triplet. Ancestral population
sizes are taken from the simulation study in Table 6 in Wang and Hey [8]: θ1 ¼ 0.005 and θ2 ¼ 0.003.
Migration parameters are all set to 50
1.0
0.8
congruence
incongruence
0.6
Probability
0.4
((human,chimpanzee),gorilla)
0.2
0.0
Fig. 6 Probability (Eq. 2) of gene tree and species tree being incongruent. In case
of the HCG triplet we obtain (T12 T1)/θA1 ¼ (0.0062 0.0038)/
0.0062 ¼ 0.39 which corresponds to an incongruence probability of 30%
S12 m1 · m2 0 0
S1 0 0 0 · m2
S2 0 0 0 m1 ·
½
f ðtÞ ¼ ðe Q T ÞaS 11 þ ðe Q T ÞaS 12 þ ðe Q T ÞaS 22 θ2 e ð2=θ ÞðtT Þ :
A
A
ð7Þ
In Fig. 5 we illustrate the coalescent density in the two species
isolation-with-migration model.
The likelihood for a pair of homologous sequences X is given
by
ð1
PðX jΘÞ ¼ LðΘjX Þ ¼ PðX jtÞf ðtjΘÞdt ð8Þ
0
where f (t) ¼ f (t|Θ) given by Eqs. 6 and 7 is the density of the two
sequences finding a MRCA at time t and P(X|t) is the probability of
the two sequences given that they find a MRCA at time t. The latter
term is calculated using a distance-based method. One possibility is
to use the infinite sites model where it is assumed that substitutions
568 Julien Y. Dutheil and Asger Hobolth
4.1 The Independent The simplest way to handle issues relating to the ancestral recom-
Loci Approach: Free bination graph is to divide the data into presumably independent
Recombination loci. Such analyses are therefore restricted to candidate regions that
Between, No are not too large (to avoid including a recombination point) and
Recombination Within not too close (to ensure several recombination events happened
between loci). Each region can then be described by a single
underlying tree, reducing the analytical and computational load.
Using 15,000 loci distant from 10 kb totaling 7.4 Mb and the
isolation model introduced above, Burgess and Yang [9] (Table 2,
model (b) sequencing errors) find the following ancestral popula-
tion sizes and speciation times estimates for human (H), chimpan-
zee (C), gorilla (G), orangutan (O), and macaque (M) ancestors:
θHC ¼ 0.0062, θHCG ¼ 0.0033, θHCGO ¼ 0.0061, θHCGOM
¼ 0.0118 and THC ¼ 0.0038, THCG ¼ 0.0062, THCGO
¼ 0.0137, THCGOM ¼ 0.0260. Converting these estimates into
time units requires an estimate of the substitution rate, either
absolute or deduced from a scaling point. Using u ¼ 109 as an
estimate for substitutions per year, this leads to an estimate of 3.8
My for the human–chimpanzee speciation, a very recent estimate.
Using the same data, Yang [10] showed that the isolation-with-
migration model was preferred. Yang finds a more ancient specia-
tion time THC ¼ 0.0053 (5.3 My with u ¼ 109) when migration
is accounted for.
4.2 State-Space The coalescent with recombination for two loci and two sequences
Model: is originally described in Simonsen and Churchill [31] as a
Simonsen–Churchill continuous-time Markov chain backward in time with eight states
Framework as shown in Fig. 7. This Markov chain is given a careful treatment in
the textbooks by Durrett [34, Section 3.1.1] and Wakeley [21,
Section 7.2.4], and we therefore only briefly explain the basic
properties of the model here.
A single sequence is either linked ( , , , or
meaning that it contains material ancestral to the sample at both
loci, or it is unlinked ( , , , or ) when it contains material
ancestral to the sample at only one locus. The coalescent rate is one
for any two sequences, and the recombination rate is ρ/2 for any
linked sequence. The chain begins at time zero in state 1 with two
linked sequences. After an exponential waiting time with rate 1 + ρ
the chain enters state 8 with probability 1/(1 + ρ) or state 2 with
probability ρ/(1 + ρ). The transition from state 1 to state 8 is a
coalescent event, and the left and right tree heights are identical.
The transition from state 1 to state 2 is a recombination event that
breaks apart one of the two sequences. All other transitions have
similar interpretations. Common ancestry for a locus is marked
570 Julien Y. Dutheil and Asger Hobolth
1 1
1 ρ
ρ/2 1 ρ/2
× ×
× ×
6 2 4 5 2 7
1
1 1
× ×
1 1
or
×
×
8
Fig. 7 State transition diagram for two loci and two sequences described as a continuous-time Markov chain
backward in time. The figure is adapted from Figure 7.7 in Wakeley [21]. A line with a bullet or a cross at both
ends is a linked sequence (ancestral material to the sample at both loci), whereas a line with a bullet or a cross
at one end only is a sequence with ancestral material at one locus only. A cross denotes common ancestry.
s and t denote the heights of the left and right trees, respectively
a b
1.0
ρ = 0.2
ρ = 0.5
0.6
0.4
ρ = 2.5
0.4
0.2
0.2
0.0
0.0
0 1 2 3 4 0 1 2 3 4
Left tree height s Right tree height t
Fig. 8 (a) Probability of same tree height. (b) Density for right tree height conditional on the left tree height
being equal to s and a recombination rate equal to ρ
572 Julien Y. Dutheil and Asger Hobolth
4.3 Time Li and Durbin [14] and Mailund et al. [13] analyze pairs of
Discretization: Setting sequences using a hidden Markov model (HMM). The hidden
Up the Finite State states are tree heights (times to the most recent common ancestor),
HMM and the tree height is discretized to obtain a finite hidden state
space. The observed states of the HMM are alignment columns,
with probabilities corresponding to a substitution process on the
tree (see Fig. 9). In the Li and Durbin model, an infinite site model
is assumed and observed states are converted to binary data, telling
whether the site is heterozygous (one mutation) or homozygous
(no mutation).
We now describe how we discretize time for the case of two
sequences considered in the previous section. The discrete version
of the Markov process is used to build a finite Markov chain along
the two sequences. When the finite Markov chain is combined with
a substitution process, we obtain an HMM as in Li and
Durbin [14].
Let the discrete time points (backward in time) of the Markov
chain be d0 ¼ 0 < d1 < d2 < < dM1 < dM ¼ 1 and denote
the corresponding states by 1, 2, . . ., M. State m (m ∈{1, . . ., M})
then corresponds to a tree height in the interval between dm1 and
dm. The continuous stationary distribution is πðtÞ ¼ expðtÞ, and
therefore the discrete times are chosen such that
a Recombination process r b
4
Tree heights
x1 x2 ···
3
Discretized time
(latent)
2
Mutation process q
1
Binary data
y1 y2
0
l l l l l ll l ll
ll
ll
ll l l lll ll
l l l l l l
l l l l ll lll l
l l l ll
llll
l l l l l l
l ll l ll l l l l ll ll
l l l l l ll l l l l l l
(observed)
0 200 400 600 800 1000
Genomic position
Fig. 9 (a) Graphical structure of the hidden Markov Model. (b) Simulation from the hidden Markov model
Ancestral Population Genomics 573
ð11Þ
The reason for the first case is that in order for the left tree height to
be in state ℓ < r, it must be in state 1, 2, or 3 at time dℓ1 and in
state 5 or 7 at time dℓ (i.e., there have been no coalescent events
before time dℓ1 and a left coalescent event between time dℓ1 and
dℓ), and similarly it must still be in state 5 or 7 at time dr1 and in
state 8 at time dr (i.e., there have been no coalescent events between
time dℓ and time dr1 and a right coalescent event between time
dr1 and time dr). The next case corresponds to no coalescent
events before time dℓ1 and both a left and a right coalescent
event between time dℓ1 and dℓ. The last case is due to symmetry
of the chain.
From the joint tree states (ℓ, r) we easily get the conditional
tree states
PðL ¼ ℓ, R ¼ rÞ
P ðℓ, rÞ ¼ PðrjℓÞ ¼ PðR ¼ rjL ¼ ℓÞ ¼ ,
PðL ¼ ℓÞ
where P(L ¼ ℓ) ¼∑rP(R ¼ r, L ¼ ℓ). These probabilities are used
in the HMM.
4.4 Careful A careful treatment of the mutation process allows for a more
Treatment of Mutation coarse binning procedure and is needed to avoid biasing the results.
Process In continuous time the probability for a mutation given a tree
height t is given by μðtÞ ¼ 1 expðθtÞ, and the stationary tree
height distribution is πðtÞ ¼ expðtÞ. The probability of a muta-
tion conditionally on the hidden state m becomes
574 Julien Y. Dutheil and Asger Hobolth
μm ¼ pðy i ¼ 1jx i ¼ mÞ
pðy i ¼ 1jt∈ðd m1 , d m ÞÞ
¼ pðy i ¼ 1jt∈ðd m1 , d m ÞÞ ¼
pðt∈ðd m1 , d m ÞÞ
ð dm ð dm
pðy i ¼ 1jtÞπðtÞdt ð1 e θt Þe t dt
¼ d m1
¼ d m1 ð12Þ
ð dm ð dm
πðtÞdt e t dt
d m1 d m1
¼ 1 e θd m1
ð1 e ð1þθÞðd m d m1 Þ
Þ :
ð1 þ θÞð1 e ðd m d m1 Þ Þ
Note that with a fine discretization we have that the interval dm
dm1 is small and the first-order Taylor expansion
expðazÞ 1 az for z small gives
pðy i ¼ 1jx i ¼ mÞ 1 e θd m1 ,
as perhaps expected. We are, however, discretizing the interval [0,
1[, so it is not possible to avoid one or more large bins. Generally
we have found that a careful treatment of the mutation process is
crucial for accurate inference [36].
4.5 Statistical Here we choose to focus on three inference methods for estimating
Inference of the recombination rate. The first method is based on the full
Population Parameters likelihood obtained from the classical forward (or backward) algo-
from Sequences rithm for HMMs. The second is based on the distribution of the
distance between segregating sites. This summary statistics was
used in Harris and Nielsen [37] for demographic inference. It is
sometimes also described as the distribution of the distance
between heterozygote sites, runs of homozygosity, or the nearest-
neighbor distribution. The third summary statistics is the probabil-
ity that two sites at certain distance apart are both heterozygote
sites. This probability is closely related to the pair correlation func-
tion from spatial statistics [36] and to the zygosity correlation
introduced in [38].
4.5.1 Summary Recall that in continuous time the probability for a mutation given
Statistics: Runs of a tree height t is given by μðtÞ ¼ 1 expðθtÞ, and the stationary
Homozygosity and Pair tree height distribution is πðtÞ ¼ expðtÞ. The marginal probability
Correlation for a mutation is therefore given by
ð1
μðtÞπðtÞdt ¼ θ=ð1 þ θÞ: ð13Þ
0
Ancestral Population Genomics 575
0.15
exp(−t) θ=0.1 and ρ=1
0.8
0.10
θ=5
Probability
0.6
Density
0.05
θ=0.1
θ=0.01
0.2
0.00
0.0
0 1 2 3 4 0 10 20 30 40 50
Time Distance from mutation
0 10 20 30 40 50
Distance from mutation
Fig. 10 (a) Stationary distribution of tree height conditional on a mutation. (b) Probability of a mutation at
various distances away from a mutation. (c) Probability of the first mutation at various distances away from a
mutation
4.5.2 Parameter We estimate the mutation rate using an estimating equation based
Estimation on the marginal probability for a mutation (Eq. 13). If the observed
frequency of a mutation is p^, then the mutation rate is
θ^ ¼ p^=ð1 p^Þ (see left plot in Fig. 11). The recombination rate is
estimated using maximum likelihood for the HMM and goodness
of fit for the pair correlation (see middle plot in Fig. 11) and runs of
homozygosity (see right plot in Fig. 11).
We simulated 50 sequences of length 20,000 base pairs and
with mutation rate θ ¼ 0.1 and recombination rate ρ ¼ 0.1. We
estimated the mutation rate using the estimating equation and the
recombination rate using maximum likelihood and the HMM, and
goodness of fit for the pair correlation and nearest neighbor
(Fig. 12) [35]. As expected the HMM procedure shows the best
results because here we are using all the available information. It
seems, however, that we are not losing too much power when
applying the pair correlation function. This is in contrast to the
nearest-neighbor summary statistics that perform much worse than
the other two methods.
We have provided a detailed treatment of the main components
involved in an analysis of pair of DNA sequences based on an HMM
derived from coalescent theory. Pairwise sequentially Markov coa-
lescent (PSMC) models have been extensively applied to various
organisms, see, for instance [39–43].
Ancestral Population Genomics 577
0.15
l
l
l
l
l
l l
l
l
l
0.4
l
l l l
l l
l
l
l l l
l l
l l l
l l l l
l l
l l
l l
l l
l l l
l l l
l
0.10
l
l l
l
0.3
Probability
θ/(1+θ)
0.2
0.05
0.1
Empirical
0.00
0.0
Fitted
0.0 0.1 0.2 0.3 0.4 0.5 0 10 20 30 40 50
θ Distance
Fitted
l
0.10
Probability
l
l
0.05
l
l
l
l
l l
l l
l l
l l
l
0.00
l l
l l
l l l l
l l l l l
l l l l l l l
l l l l l l l
l l
0 10 20 30 40 50
Distance
Fig. 11 Parameter estimation for summary statistics. (a) The mutation rate θ is estimated from the observed
number of mutations and length of the region. (b) The recombination rate ρ is estimated using the empirical
distribution of a mutation at various distances from a mutation. (c) The recombination rate is estimated using
the empirical distribution of the first mutation from a mutation
Simulation study
0.25
l l l l l l l l l
Hidden Markov Model
l
Pair Correlation
Nearest Neighbour
l
0.20
l
l
l
l
l
Recombination Rate
l
l
l l
0.15
l
l
l
l
l l
l l
l
l
l l l
l
l l
l l l
l l l l
l l l
l l l
l l l l
l l l l l l
l l l l
l l l
0.10
l l l
l l
l l l l
l l
l l l
l l l
l l l
l l l l l l l
l l l
l l l l l
l l l
l l l l
l l l l
l l l
l l l
l l
l l
l l
l l l
l
l l
0.05
l l
l
l
l
l
l
l
l l
l
0.00
l l l l
0 10 20 30 40 50
Simulation Number
Fig. 12 Results of parameter estimation for simulation study. The pair correlation summary performs rather
well compared to the full HMM data analysis. Nearest neighbor is a poor summary statistics
5.1 From 2 to n Schiffels and Durbin [15] proposed to extend the PSMC model
Genomes [14] to more than two haploid genomes by modeling the most
recent coalescence event in the sample. In this framework, the
5.1.1 The Multiple
hidden states of the model are a combination of divergence times,
Sequentially Markov
taken from a discretized distribution, and identity of the
Coalescent (MSMC)
corresponding haplotypes involved. The rationale for such simplifi-
cation was that the PSMC showed poor resolution in the recent
past [14], and considering more genomes would bring additional
signal. The drawback of this implementation is that the more
genomes are considered, the more “shifted” toward the present is
the timeframe where population parameters can be inferred. As a
result, the authors reported that with more than 8 diploid indivi-
duals (16 haploid genomes), parameters can virtually not be esti-
mated (see also [44] for an illustration of this effect with
simulations). Another consequence of this approach is that the
recombination rate parameter cannot be reliably estimated
[15]. The MSMC was used to infer the recent history of human
population. In particular, the authors introduced the possibility to
label individuals and look at cross-coalescence rate between groups,
a way to get a fine-tuned view of population divergence [15, 45].
Ancestral Population Genomics 579
5.1.2 The Demographic An alternative approach was introduced by Song and colleagues
Inference with Composite [16–18]. The demographic inference with composite approximate
Approximate Likelihood likelihood (diCal) approach is based on the conditional sampling
(diCal) distribution, which computes the likelihood of one genome
conditioned on the observation of others. Using the so-called
composite likelihood formula, it is therefore possible to compute
the likelihood of the data for n genomes as the product of the
likelihood of one genome given the n 1 other ones and the
likelihood the remaining n 1 genomes:
PðD 1...n jΘÞ ¼ PrðD 1 jD 2...n , ΘÞ PðD 2...njΘ Þ,
where Θ is the set of model parameters and D1. . .n denotes the data
set with n genomes. By further noting that
PðD 2...n jΘÞ ¼ PðD 2 jD 3...n , ΘÞ PðD 3...n1 jΘÞ
the likelihood of the full data set can be computed by recursion.
The terms P(Di|Di+1. . .n) form the conditional sampling distribu-
tion (CSD). Paul et al. [16] proposed a way to compute the CSD at
the cost of introducing several additional hypotheses: (a) the hap-
lotypes upon which the sample is conditioned are considered inde-
pendent, that is, no coalescence events involving these haplotypes
are allowed and (b) mutations can only occur once in any lineage
(infinite site hypothesis). The likelihood resulting from this
approximated CSD is therefore not exact. This approach was intro-
duced by Li and Stephens [46] and is referred to as the product of
approximate conditionals (PAC) model. Under the PAC model,
the likelihood depends on the order by which the data is
conditioned, which can be circumvented with permutation proce-
dures. While the CSD-based SMC does not have the same draw-
backs as the MSMC of Schiffels and Durbin [15], its computational
efficiency decreases as the number of haplotypes considered
increases and becomes impractical for more than 10 genomes
[19]. An elegant feature of the diCal approach is that it can be
extended to more complex demographic models, including popu-
lation structure and gene flow [18, 45]. Such extension is of
interest as the SMC approximation has been shown to be sensitive
to strong population structure [47].
5.1.3 Extending the SMC In order to use the large amount of data available in “1000 gen-
with Conditional Site omes” projects, Terhorst et al. [19] extended the PSMC in a
Frequency Spectra (CSFS) different direction. Instead of modeling the genealogy of the com-
plete sample, the authors proposed to model the divergence of two
haplotypes (the PSMC model) as hidden states, yet considering the
full set of genomes as observed states. In this approach, the transition
probabilities of the coalescent HMM are similar to the PSMC (or to
be more precise, similar to the MSMC with two haplotypes, as the
original PSMC uses the SMC of McVean and Cardin [33] and not
580 Julien Y. Dutheil and Asger Hobolth
the SMC’ of Marjoram and Wall [48]), but the emission probabil-
ities are extended to account for the full site frequency spectrum of
hundreds of genomes. This conditional site frequency spectrum
(CSFS) is computed using coalescence theory, offering a generali-
zation of the Poisson random field (PRF) model introduced by
Sawyer and Hartl [49]. Just like the original PRF, however, the
CSFS ignores linkage of observed states, only linkage between the
two conditioned haplotypes is modeled via the SMC. Additional
data reduction steps are therefore required to ensure that the
independence condition of sampled sites is met.
5.1.4 Explicit While the ARG contains all historical information about a sample of
Reconstruction of the genomes, genomes themselves contain very little information
Ancestral Recombination regarding the underlying ARG. As a result, in most statistical
Graph inference methods is the ARG treated as a variable accounted for,
but not directly inferred. In the SMC models presented above, this
is taken care of by the hidden Markov methodology, which com-
putes a likelihood for a given sample by summing over all possible
ARG (via the so-called forward algorithm). The Viterbi algorithm
and the posterior decoding procedure are HMM algorithms that
allow to reconstruct a posteriori the most likely ARG for a sample,
such procedures are notably used for the inference of patterns of
incomplete lineage sorting along genomes [11, 12, 50, 51]. Yet the
variance in such estimation is typically very large [12].
Rasmussen et al. [20] proposed a different approach: they
developed a Bayesian sampler of ARGs conditioned on a set of
genome sequences. Similar in principle to the PAC and CSD
approaches, the authors proposed to generate the ARG of
n genomes conditioned on the ARG of n 1 genomes, a proce-
dure they refer to as threading. The generated ARGs can then be
used to infer evolutionary processes of interest. Palacios et al. [52]
developed a non-parametric method that allows to estimate the
variation in time of the effective population size based on such
reconstructed ARG. Rasmussen et al. further showed that while
the model used for inference is purely neutral, the a posteriori
inferred ARG contains signature of selection, visible for instance
as a decrease of the time of the most common ancestor of two
samples in the data close to coding sequences. Such approaches
offer promising avenues for the development of new statistical
methods to detect genomic regions with unusual history.
5.2 The Case of Hobolth et al. [11] developed a hidden Markov model (HMM) to
Multiple Species infer the ancestral recombination graph between three closely
related species. Because this model only contains one haploid
genome per species, it only allows to infer population parameters
in the ancestral species. Dutheil et al. [12] reparametrized this
model in the context of the sequentially Markov coalescent. In
contrast to the previous approaches, only four hidden states were
Ancestral Population Genomics 581
Fig. 13 The coalescent process along genomes of three closely related species. (a) Four archetypes of
coalescence scenarios with three species, exemplified with human, chimpanzee, and gorilla. In the first
scenario, human and chimpanzee coalesce within the human–chimpanzee common ancestor. In the three
other scenarios, all sequences coalesce within the common ancestor of all species, with probability 1/3
depending on which two sequences coalesce first. (b) Example of genealogical changes along a piece of an
alignment. The alignment was simulated using the true coalescent process and parameters corresponding to
the human–chimpanzee–orangutan history. The blue line depicts the variation along the genome of the
human–chimpanzee divergence. The background colors depict the change in topology, red and yellow
corresponding to incomplete lineage sorting. Each change in color or break of the blue line is the result of
a recombination event
6.2 Diploid Data and While sequencing of diploid individuals allows to infer the two
Phasing alleles present at heterozygous positions, establishing how these
alleles are combined on each homologous chromosome requires an
additional, error-prone step calling phasing. Analyses based on the
comparison of individuals from distinct species do not require such
information, as the coalescence time of two alleles from the same
species is expected to have happened much after the speciation time
of the compared species. In such case alleles at each heterozygous
position can be sampled randomly [13] in order to build a com-
posite haploid genome. The same rationale applies with respect to
the use of the human reference genome, a composite genome
obtained from multiple individuals. Conversely, inferences at the
population level typically rely on the modeling of haploid genomes
and therefore require phased data. A notable exception is the
PSMC [14], as well as its extension SMC++ [19], which, when
applied to one diploid individual, only requires the knowledge of
the position of heterozygous positions.
6.3 Structural Genome data are intrinsically fragmented, firstly because of chro-
Variation and Genome mosomal organization, but also because of rearrangements that
Alignment prevent molecule-to-molecule alignment from one species to
another. A genome data set is therefore a set of distinct alignments,
one per synteny block. Synteny information can only be extracted
when individual genomes are available, which is typically not the
case for most “re-sequencing” data sets. At the population level,
however, such large-scale variation is considered negligible (but
see, for instance, [59] for an exception), while it becomes more
prominent when genomes from distinct species are compared. In
such cases, a genome alignment is constructed with potential errors
ultimately leading to the comparison of nonhomologous regions.
So far, the only way to deal with such errors is to restrict the analysis
on regions where orthology can be unambiguous resolved, mostly
by removing short synteny blocks and regions that contain a high
proportion of repeated elements, gaps, and duplications.
584 Julien Y. Dutheil and Asger Hobolth
7 Discussion
8 Exercises
8.1 ILS in Primates Assuming that there are 5 My between the speciation times of
human with the gorilla and the orangutan, that the HG ancestral
effective population size was 50,000, what is the expected amount
of ILS between human, gorilla, and orangutan? Assuming that
another 2.5 My separates the speciations of human with chimpan-
zee and gorilla, with an HC effective ancestral population size of
50,000, what is the expected amount of ILS between human,
chimpanzee, and orangutan? We assume a generation time of
20 years for all extent and ancestral primates.
8.2 Estimating Given that 30% of incomplete lineage sorting is observed between
Ancestral Population human, chimpanzee, and gorilla and assuming a generation time of
Size from the Observed 20 years and a that 2.5 My separate the splits between human/
Amount of ILS chimpanzee and human—chimpanzee/gorilla, what is the effective
ancestral population size compatible with this observed amount?
Using Burgess and Yang’s method [9], a researcher finds a higher
estimate of Ne than expected. What could explain this discrepancy?
Acknowledgements
References
12. Dutheil JY, Ganapathy G, Hobolth A, 25. Nielsen R, Wakeley J (2001) Distinguishing
Mailund T, Uyenoyama MK, Schierup MH migration from isolation: a Markov chain
(2009) Ancestral population genomics: the Monte Carlo approach. Genetics 158
coalescent hidden Markov model approach. (2):885–896
Genetics 183(1):259–274 26. Tavaré S (1979) A note on finite homogeneous
13. Mailund T, Dutheil JY, Hobolth A, Lunter G, continuous-time Markov chains. Biometrics
Schierup MH (2011) Estimating speciation 35:831–834
time and ancestral effective population size of 27. Hobolth A, Andersen LN, Mailund T (2011)
Bornean and Sumatran orangutan subspecies On computing the coalescence time density in
using a coalescent hidden Markov model. an isolation-with-migration model with few
PLoS Genet 7(3):e1001319 samples. Genetics 187(4):1241–1243
14. Li H, Durbin R (2011) Inference of human 28. Hey J (2010) Isolation-with-migration models
population history from individual whole- for more than two populations. Mol Biol Evol
genome sequences. Nature 475 27(4):905–920
(7357):493–496 29. Kelleher J, Etheridge AM, McVean G (2016)
15. Schiffels S, Durbin R (2014) Inferring human Efficient coalescent simulation and genealogi-
population size and separation history from cal analysis for large sample sizes. PLoS Com-
multiple genome sequences. Nat Genet 46 put Biol 12(5):e1004842
(8):919–925. http://www.nature.com/ng/ 30. Staab PR, Zhu S, Metzler D, Lunter G (2015)
journal/v46/n8/full/ng.3015.html Scrm: efficiently simulating long sequences
16. Paul JS, Steinrücken M, Song YS (2011) An using the approximated coalescent with recom-
accurate sequentially Markov conditional sam- bination. Bioinformatics 31(10):1680–1682
pling distribution for the coalescent with 31. Simonsen N, Churchill N (1997) A Markov
recombination. Genetics 187(4):1115–1128 chain model of coalescence with recombina-
17. Sheehan S, Harris K, Song YS (2013) Estimat- tion. Theor Popul Biol 52(1):43–59
ing variable effective population sizes from 32. Wiuf C, Hein J (1999) Recombination as a
multiple genomes: a sequentially Markov con- point process along sequences. Theor Popul
ditional sampling distribution approach. Biol 55(3):248–259
Genetics 194(3):647–662. https://doi.org/
10.1534/genetics.112.149096 33. McVean GAT, Cardin NJ (2005) Approximat-
ing the coalescent with recombination. Philos
18. Steinrücken M, Paul JS, Song YS (2013) A Trans R Soc Lond B Biol Sci 360
sequentially Markov conditional sampling dis- (1459):1387–1393
tribution for structured populations with
migration and recombination. Theor Popul 34. Durrett R (2008) Probability models for DNA
Biol 87:51–61 sequence evolution. Probability and its applica-
tions. Springer, New York
19. Terhorst J, Kamm JA, Song YS (2017) Robust
and scalable inference of population history 35. Hobolth A, Jensen JL (2014) Markovian
from hundreds of unphased whole genomes. approximation to the finite loci coalescent
Nat Genet 49(2):303–309 with recombination along multiple sequences.
Theor Popul Biol 98:48–58. https://doi.org/
20. Rasmussen MD, Hubisz MJ, Gronau I, Siepel 10.1016/j.tpb.2014.01.002
A (2014) Genome-wide inference of ancestral
recombination graphs. PLoS Genet 10(5): 36. Nielsen SV, Simonsen S, Hobolth A (2016)
e1004342 Inferring population genetic parameters: parti-
cle filtering, HMM, ripleys K-function or runs
21. Wakeley J (2008) Coalescent theory: an intro- of homozygosity? In: Algorithms in bioinfor-
duction, 1st edn. Roberts and Company Pub- matics. Lecture notes in computer science.
lishers, Arapahoe County Springer, Cham, pp 234–245
22. Hein J, Schierup MH, Wiuf C (2005) Gene 37. Harris K, Nielsen R (2013) Inferring demo-
genealogies, variation and evolution: a primer graphic history from a spectrum of shared hap-
in coalescent theory. Oxford University Press, lotype lengths. PLoS Genet 9(6):e1003521.
Oxford http://journals.plos.org/plosgenetics/article?
23. Tavaré S (2004) Ancestral inference in popula- id¼10.1371/journal.pgen.1003521
tion genetics, vol 1837. Springer, New York, 38. Lynch M, Xu S, Maruki T, Jiang X,
pp 1–188 Pfaffelhuber P, Haubold B (2014) Genome-
24. Takahata N, Nei M (1985) Gene genealogy wide linkage-disequilibrium profiles from sin-
and variance of interpopulational nucleotide gle individuals. Genetics 198(1):269–281.
differences. Genetics 110(2):325–344 https://doi.org/10.1534/genetics.114.
166843
588 Julien Y. Dutheil and Asger Hobolth
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons
licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s
Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder.
Chapter 19
Abstract
Metagenomics has become a part of the standard toolkit for scientists interested in studying microbes in the
environment. Compared to 16S rDNA sequencing, which allows coarse taxonomic profiling of samples,
shotgun metagenomic sequencing provides a more detailed analysis of the taxonomic and functional
content of samples. Long read technologies, such as developed by Pacific Biosciences or Oxford Nanopore,
produce much longer stretches of informative sequence, greatly simplifying the difficult and time-
consuming process of metagenomic assembly. MEGAN6 provides a wide range of analysis and visualization
methods for the analysis of short and long read metagenomic data. A simple and efficient analysis pipeline
for metagenomic analysis consists of the DIAMOND alignment tool on short reads, or the LAST alignment
tool on long reads, followed by MEGAN. This approach performs taxonomic and functional abundance
analysis, supports comparative analysis of large-scale experiments, and allows one to involve experimental
metadata in the analysis.
Key words Metagenomics, Software, MEGAN, Taxonomic analysis, Functional analysis, Long reads
1 Introduction
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Methods in Molecular Biology, vol. 1910,
https://doi.org/10.1007/978-1-4939-9074-0_19, © The Author(s) 2019
591
592 Caner Bağcı et al.
The basic workflow for using MEGAN consists of two main steps:
read alignment against a reference database and then import an
analysis of the alignments in MEGAN. The aim of pipeline is to
perform taxonomic and functional binning of the input reads.
The alignment can be performed using a number of different
tools depending on the type of sequencing data and on the chosen
database, its sequence type, size, and available computer power. For
smaller databases more sensitive tools can be chosen such as MALT
[6] or even BLAST [7]. These tools generally offer higher sensitiv-
ity at the cost of a longer runtime. For large datasets and databases,
it is more suitable to choose an alignment tool such as DIAMOND
or LAST. We use the NCBI NR database [8] with both of the latter
tools, because it is the largest and most comprehensive protein
database available today. NCBI NR contains 144.5 million protein
sequences (August 2017).
2.1 Short Read We describe here the basic short read analysis pipeline as shown in
Pipeline Fig. 1. By default, we use DIAMOND to align reads against the full
NCBI NR database.
Before running the pipeline, one can optionally perform
preprocessing, that is, quality control, trimming, and filtering, of
the raw reads. However, these steps usually have little impact on
the results of the alignment-based analysis described in this
document.
Introduction to the Analysis of Environmental Sequences: Metagenomics with MEGAN 593
2.1.1 Read Alignment DIAMOND uses double indexed alignment, which means both the
with DIAMOND reference database and the query are indexed for comparison. This
leads to a large speedup especially for large queries and databases.
Like BLASTX, DIAMOND uses the “seed and extend” method to
find all matches between a query and the database. To further
increase speed, DIAMOND utilizes spaced seeds, which are long
seeds where only some positions are used for matching the seed.
This leads to another increase of speed without decreasing
sensitivity.
DIAMOND can be run either in fast or sensitive mode. Fast
mode will run around 20,000 times faster than BLASTX on short
reads and will be able to find 75–90% of all relevant matches that
one would find with BLASTX, while sensitive mode provides a
speedup of 2500 while recovering up to 94% of significant
matches.
2.1.2 Taxonomic and DIAMOND can save alignments in a compressed format called
Functional Classification DAA (DIAMOND alignment archive) format. DAA files can be
with MEGAN6 imported into MEGAN6 in multiple ways. A small number of small
DAA files can easily be imported interactively using menu items
provided in MEGAN. For larger datasets and or many files, one
should use the command-line tools provided with MEGAN. These
include daa2rma, which will generate a RMA file as used by
MEGAN from one or two (for paired reads) DIAMOND files
and daa-meganizer, which analyzes a DAA file and then appends
the result to the end of the file. Such “meganized” DAA files can
then be opened directly in MEGAN. The latter approach is much
faster and is more space efficient. However, to use paired reads all
alignments have to be in the same file.
One can use the program blast2rma to process the output of
a range of different alignment programs, such as BLAST.
During the processing of alignments for MEGAN, the reads
will be assigned to nodes in the NCBI taxonomy and any functional
classifications that have been configured in the import dialog or on
594 Caner Bağcı et al.
2.1.3 Investigation of the The resulting files can be opened and interactively investigated
Results using the MEGAN6 graphical user interface. The first view when
opening a file is always a hierarchical representation of the taxo-
nomic composition of the sample. Selecting different nodes of this
tree, the user can uncover further information on the reads mapped
to the represented taxon. Selecting Inspect Reads on a node will
open the Inspector Window, which displays the reads assigned to
that node, as well as their alignments. This functionality can be used
both in the Taxonomy Viewer, where nodes represent taxa, and in
any of the Functional Viewers. Figure 2a shows an example of the
Inspector Window.
Instead of just viewing a listing of the matches and alignments,
it is also possible to select Show Alignments. This will open the
Alignment Viewer (Fig. 2b), where for each of the database refer-
ences with matches from the reads assigned to the selected node it is
Fig. 2 (a) The Inspector Viewer showing some reads that have been assigned to Alistipes ihumii. (b) The
Alignment Viewer showing reads aligned to a reference sequence
Introduction to the Analysis of Environmental Sequences: Metagenomics with MEGAN 595
Fig. 3 (a) Bar chart of taxonomic assignments on family level, sorted by abundance. (b) Radial chart of
functional assignments to KEGG for the same sample from [14]
Fig. 4 Taxonomic assignment of reads from the day 0 sample for “Alice” from the ASARI [14] dataset which
have been assigned to “resistance of fluoroquinolones” in the SEED hierarchy
2.2 Long Read As presented in the previous section, using metagenomic short reads,
Pipeline one can assembly gene sequences and obtain variants of a single gene
using a gene-centric assembly, or of course use other assembly tech-
niques. However, using short read data, it is very difficult to establish
whether different genes are present in the same organism. We can
connect the genes if they are found on a single DNA molecule with
long sequencing reads, provided by third generation sequencing
technologies such as PacBio [16] or Oxford Nanopore [17].
The PacBio and Nanopore devices can produce reads that are
hundreds of thousands of bases long, with error rates of around
10%, say [17]. In contrast to short reads, which each can be safely
assumed to overlap with only a single gene, long read will usually
overlap or contain multiple genes. Hence, many popular short read
alignment and analysis algorithms may require modification so as to
take into account that a given read can align to multiple genes.
2.2.1 Long Read Analysis The basic long read analysis pipeline is analogous to the above
Pipeline described short read pipeline, and consists of the alignment and
MEGAN analysis steps (Fig 5), but the details of the analysis
pipeline as well as some components of MEGAN6 differ from the
short read solution.
As described in the following, for long reads alignment is
performed using LAST, processing of the alignments requires an
additional step and MEGAN provides some modified algorithms
for processing and visualizing long reads.
Introduction to the Analysis of Environmental Sequences: Metagenomics with MEGAN 597
2.2.2 Alignment Using Third generation sequencing technologies produce much longer
LAST reads, with a higher error rate (approximately 10%, mostly inser-
tions and deletions). Most DNA-to-protein aligners (such as
BLASTX [7] or DIAMOND) translate the complete DNA query
sequence in all six reading frames and then align the translated
sequences against the protein database. Insertions or deletions in
long reads cause a frame-shift and break translation-based align-
ments. LAST is a frame-shift aware aligner that incorporates single-
base insertions or deletions into the alignment calculation. These
are represented as “∖” for forward-shifts and ”/” for reverse-shifts,
as shown in Fig. 6.
LAST, when used with large databases, such as NCBI-nr, splits
the database into several volumes and indexes them individually.
Similarly the large input files are loaded in separate volumes, and
each volume of input is searched against each volume of the data-
base. LAST, by default, generates output in MAF, “Multiple Align-
ment Format.”
2.2.3 Taxonomic and Because of processing both the query and database in different
Functional Classification of volumes and writing the output as soon as it is generated, the
Long Reads alignments for a single read appear in different parts of the MAF
output of LAST. MEGAN processes alignment files line-by-line,
identifies all alignments of a single read, and then assigns that read
to a taxonomic and/or functional class. The unordered structure of
LAST output prevents MEGAN from doing this. Thus, MAF files
produced by LAST must be sorted before they are imported to
MEGAN. For this task, MEGAN provides a command-line script,
called sort-last-maf.
598 Caner Bağcı et al.
2.2.4 Investigation of the The first view the user gets when a long read dataset is loaded in to
Results MEGAN6 is identical to that of a short read dataset; however, there
are some underlying differences and several investigation modes
designed specifically for long reads.
Due to a large variability of read length of long reads [18], it is
impractical for MEGAN to report number of reads assigned to class
as a mean of abundance. Using the raw read length is also not
feasible for Nanopore technology as reads tend to have “head”
Introduction to the Analysis of Environmental Sequences: Metagenomics with MEGAN 599
Fig. 7 Long Read Inspector in MEGAN6. The read is drawn as a line in the middle and the protein alignments
are drawn as arrows on their corresponding positions and strands on the read
Fig. 8 MEGAN6 offers a remote BLAST functionality, namely “BLAST on NCBI,” which can be used for
suspicious assignments. (a) Long Read Inspector view for a read assigned to Trichuris trichiura, based on
protein alignments against NCBI nr. (b) Long Read Inspector view for the same read as in (a), assigned to
Escherichia coli, after searching it against nucleotide collection of NCBI using the remote BLAST functionality
of MEGAN6
Fig. 9 PCoA analysis of 12 samples associated with “Alice” (round shapes) and “Bob” (square shapes), from
[14]. Time points of antibiotic intake are colored light blue, time points before and after antibiotic intake dark
red. (a) A PCoA plot based on Bray–Curtis distances as calculated by MEGAN using the taxonomic abundances
for the samples. The green vectors represent the bi-plot vectors. The samples are grouped by individual,
showing the convex hulls of the groups as well as ellipses. (b) is based on the same data but using the
abundances of GO terms in the InterPro2GO hierarchy and only showing the convex hulls of the group. Here the
orange vectors are the tri-plot vectors, showing the relation of metadata values to the principal components
602 Caner Bağcı et al.
Fig. 10 Attribute correlation plot for the data from [14] for two healthy individuals taking antibiotics for 6 days
(day 1–6). Correlation is shown as a heat map with red marking positive correlation between the attribute and
the taxon and blue marking negative correlation. Correlations are shown for antibiotics intake (boolean) and
time (day 0, 1, 3, 6, 8, and 34)
4 Outlook
It goes without saying that the quality and quantity of the input
sequencing data limits the reliability of the output analysis. More
directly, quality of the MEGAN hierarchy assignments is deter-
mined by the quality of the read alignment, which, in turn, depends
on the chosen database and alignment tool. On the one hand, the
database needs to be well annotated and comprehensive, as it is only
possible to analyze the organisms or entities present in it. On the
other hand, the alignment tool needs to be sensitive in order to
identify the matching sequence. It is especially difficult to deal with
sets of very similar sequences. Currently, for the human gut micro-
biome sequencing data analyzed with the basic short read pipeline,
as much as 30% of reads are not assigned to any node in the course
of the taxonomic analysis.
In order to avoid the bias introduced by the database one can
also use one of the database-free strategies, e.g., k-mer counting.
They are good for tracking the global changes in the data, but it is
difficult to correct for possible contaminations. Although MEGAN
does not support this type of analysis, it enables global comparisons
with PCoA based on the profiles computed for each of the samples.
Introduction to the Analysis of Environmental Sequences: Metagenomics with MEGAN 603
4.1 MEGAN MEGAN Community software is freely available on the website: ab.
Resources inf.uni-tuebingen.de/data/software/megan6, together with the
current mapping files for taxonomic and functional analysis.
Short read datasets presented in this chapter and used for
visualizations are publicly accessible in MEGAN via MeganServer.
The dataset used in the Long Read Pipeline section was down-
loaded from the supplementary material of Brown et al.
[20]. Instructions for use of MEGAN and user support can be
found on the MEGAN community website (megan.informatik.
uni-tuebingen.de).
References
1. Handelsman J (2004) Metagenomics: applica- 5. Huson DH, Beier S, Flade I, Górska A,
tion of genomics to uncultured microorgan- El-Hadidi M, Mitra S, Ruscheweyh HJ,
isms. Microbiol Mol Biol Rev 68(4):669–685 Tappu R (2016) MEGAN community editio-
2. Buchfink B, Xie C, Huson DH (2015) Fast and n–interactive exploration and analysis of large-
sensitive protein alignment using DIAMOND. scale microbiome sequencing data. PLoS Com-
Nat Methods 12:59–60 put Biol 12(6):e1004957. https://doi.org/
3. Kiełbasa SM, Wan R, Sato K, Horton P, Frith 10.1371/journal.pcbi.1004957
MC (2011) Adaptive seeds tame genomic 6. Herbig A, Maixner F, Bos KI, Zink A, Krause J,
sequence comparison. Genome Res 21 Huson DH (2016) MALT: fast alignment and
(3):487–493 analysis of metagenomic DNA sequence data
4. Sheetlin SL, Park Y, Frith MC, Spouge JL applied to the Tyrolean Iceman. bioRxiv
(2014) Frameshift alignment: statistics and 050559. https://doi.org/10.1101/050559
post-genomic applications. Bioinformatics 30 7. Altschul SF, Gish W, Miller W, Myers EW, Lip-
(24):3575–3582 man DJ (1990) Basic local alignment search
tool. J Mol Biol 215(3):403–410
604 Caner Bağcı et al.
8. Geer LY, Marchler-bauer A, Geer RC, Han L, determination through sequence-based meta-
He J, He S, Liu C, Shi W, Bryant SH (2010) genomics. Antimicrob Agents Chemother 59
The NCBI biosystems database. Nucleic Acids (12):7335–7345
Res 38(2009):492–496 15. Huson DH, Tappu R, Bazinet AL, Xie C,
9. Mitchell A, Chang HY, Daugherty L, Fraser M, Cummings MP, Nieselt K, Williams R (2017)
Hunter S, Lopez R, McAnulla C, Fast and simple protein-alignment-guided
McMenamin C, Nuka G, Pesseat S, Sangrador- assembly of orthologous gene families from
Vegas A, Scheremetjew M, Rato C, Yong SY, microbiome sequencing reads. Microbiome 5
Bateman A, Punta M, Attwood TK, Sigrist CJ, (1):11
Redaschi N, Rivoire C, Xenarios I, Kahn D, 16. Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G,
Guyot D, Bork P, Letunic I, Gough J, Oates M, Peluso P, Rank D, Baybayan P, Bettman B,
Haft D, Huang H, Natale DA, Wu CH, Bibillo A, Bjornson K, Chaudhuri B,
Orengo C, Sillitoe I, Mi H, Thomas PD, Finn Christians F, Cicero R, Clark S, Dalal R,
RD (2014) The InterPro protein families data- Dixon J, Foquet M, Gaertner A,
base: the classification resource after 15 years. Hardenbol P, Heiner C, Hester K, Holden D,
Nucleic Acids Res 43(D1):D213–D221 Kearns G, Kong X, Kuse R, Lacroix Y, Lin S,
10. Hunter S, Corbett M, Denise H, Fraser M, Lundquist P, Ma X, Marks P, Maxham M,
Gonzalez-Beltran A, Hunter C, Jones P, Murphy D, Park I, Pham T, Phillips M, Roy J,
Leinonen R, McAnulla C, Maguire E, Sebra R, Shen G, Sorenson J, Tomaney A,
Maslen J, Mitchell A, Nuka G, Oisel A, Travers K, Trulson M, Vieceli J, Wegener J,
Pesseat S, Radhakrishnan R, Rocca-Serra P, Wu D, Yang A, Zaccarin D, Zhao P, Zhong F,
Scheremetjew M, Sterk P, Vaughan D, Korlach J, Turner S (2009) Real-time DNA
Cochrane G, Field D, Sansone SA. EBI meta- sequencing from single polymerase molecules.
genomics–a new resource for the analysis and Science 323(5910):133–138
archiving of metagenomic data. Nucleic Acids 17. Mikheyev AS, Tin MY (2014) A first look at the
Res 42(D1):D600–D606 Oxford Nanopore MinION sequencer. Mol
11. Powell S, Szklarczyk D, Trachana K, Roth A, Ecol Resour 14(6):1097–1102
Kuhn M, Muller J, Arnold R, Rattei T, 18. Laver T, Harrison J, ONeill PA, Moore K,
Letunic I, Doerks T, Jensen LJ, von Farbos A, Paszkiewicz K, Studholme DJ
Mering C, Bork P (2012) eggNOG v3.0: (2015) Assessing the performance of the
orthologous groups covering 1133 organisms Oxford Nanopore Technologies MinION.
at 41 different taxonomic ranges. Nucleic Acids Biomol Detect Quant 3:1–8
Res 40(D1):D284–D289 19. Yang C, Chu J, Warren RL, Birol I (2017)
12. Kanehisa M, Goto S (2000) KEGG: Kyoto NanoSim: nanopore sequence read simulator
encyclopedia of genes and genomes. Nucleic based on statistical characterization. Giga-
Acids Res 28(1):27–30 Science 6(4):1–6
13. Overbeek R, Olson R, Pusch GD, Olsen GJ, 20. Brown BL, Watson M, Minot SS, Rivera MC,
Davis JJ, Disz T, Edwards RA, Gerdes S, Franklin RB (2017) MinION nanopore
Parrello B, Shukla M, Vonstein V, Wattam sequencing of environmental metagenomes: a
AR, Xia F, Stevens R The seed and the rapid synthetic approach. GigaScience 6(3):1–10
annotation of microbial genomes using subsys- 21. Medvedev P, Georgiou K, Myers G, Brudno M
tems technology (RAST). Nucleic Acids Res 42 (2007) Computability of models for sequence
(D1):D206–D214 assembly. Gene 4645:289–301
14. Willmann M, El-Hadidi M, Huson DH, et al
(2015) Antibiotic selection pressure
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons
licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s
Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder.
Chapter 20
Abstract
Metagenomics, also known as environmental genomics, is the study of the genomic content of a sample of
organisms (microbes) obtained from a common habitat. Metagenomics and other “omics” disciplines have
captured the attention of researchers for several decades. The effect of microbes in our body is a relevant
concern for health studies. There are plenty of studies using metagenomics which examine microorganisms
that inhabit niches in the human body, sometimes causing disease, and are often correlated with multiple
treatment conditions. No matter from which environment it comes, the analyses are often aimed at
determining either the presence or absence of specific species of interest in a given metagenome or
comparing the biological diversity and the functional activity of a wider range of microorganisms within
their communities. The importance increases for comparison within different environments such as
multiple patients with different conditions, multiple drugs, and multiple time points of same treatment
or same patient. Thus, no matter how many hypotheses we have, we need a good understanding of
genomics, bioinformatics, and statistics to work together to analyze and interpret these datasets in a
meaningful way. This chapter provides an overview of different data analyses and statistical approaches
(with example scenarios) to analyze metagenomics samples from different medical projects or clinical trials.
1 Introduction
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Methods in Molecular Biology, vol. 1910,
https://doi.org/10.1007/978-1-4939-9074-0_20, © The Author(s) 2019
605
606 Suparna Mitra
1
https://www.the-scientist.com/?articles.view/articleNo/50044/title/Shotgun-Sequencing-Outdone-by-
Amplicon/.
Multiple Data Analyses and Statistical Approaches for Analyzing Data from. . . 607
2.1.1 Methodology For this study, we used atherosclerotic tissue samples from a group
Details of 15 patients that underwent elective carotid endarterectomy
following repeated transient ischemic attacks or minor strokes
(samples from symptomatic atherosclerotic plaques as cases).2 Fur-
ther, we have asymptomatic atherosclerotic plaques from seven
2
All methods and experimental manuals were approved by The National Committee on Health Research Ethics
(Danish) and was granted by the Ethical Committee of the region of Copenhagen (H-3-2011-013).
Multiple Data Analyses and Statistical Approaches for Analyzing Data from. . . 609
Organising samples
Data processing
Fig. 1 Analysis pipeline for the study of human atherosclerotic plaque samples. Interested readers may refer to
the full study here [15]
3
These samples originated from the tissue bank at the Department of Forensic Medicine (Approval
No. 1501230).
610
Table 1
Sample statistics and read assignments
Assigned in MEGAN
(MSc:50 MSp25
Suparna Mitra
Hiseq 2000
238 1/2-12 tissue DNA 10/18/2012 153,302,968 36,976,828 24.12% 5,216,154 3,002,923 835,412 16.02%
Cases
239 7/2-12 tissue DNA 10/18/2012 154,652,444 48,177,200 31.15% 5,643,923 3,045,461 222,363 3.94%
240 27/2-12 tissue DNA 10/18/2012 101,591,496 46,000,276 45.28% 5,166,877 2,640,033 110,186 2.13%
241 5/3-12 tissue DNA 10/18/2012 99,927,824 42,032,952 42.06% 4,607,355 2,471,259 153,425 3.33%
242 13/3-12 tissue DNA 10/18/2012 82,850,664 46,721,284 56.39% 5,002,916 2,716,756 135,617 2.71%
243 25/4-12 tissue DNA 10/18/2012 104,094,892 52,853,464 50.77% 5,726,930 3,095,542 193,819 3.38%
Hiseq 2500 977 (P0613) 7/11/2013 111,699,176 13,191,406 11.81% 2,344,220 955,602 74,646 3.18%
Miseq V1 P0613 repeat 13/06-12 tissue DNA 7/11/2013 7,184,184 980,004 13.64% 759,262 228,262 31,889 4.20%
977+APD1 118,883,360 14,171,410 11.92% 3,103,482 1,183,864 106,535 3.43%
55 AP26 gDNA (Control) 3/14/2012 105,779,932 29,019,712 27.43% 8,911,853 4,040,359 293,160 3.29%
56 AP25 gDNA (Control) 3/20/2012 128,471,814 10,767,822 8.38% 4,680,735 1,922,688 204,397 4.37%
232 AP 21 tissue DNA (Control) 3/18/2012 127,173,774 25,793,632 20.72% 3,267,969 1,726,714 164,711 5.04%
233 AP24 tissue DNA (Control) 3/18/2012 166,547,592 29,282,304 17.58% 4,559,468 2,634,338 649,226 14.24%
Controls
Hiseq 2000
234 AP27 tissue DNA (Control) 3/18/2012 114,673,124 37,683,568 32.86% 4,550,457 2,477,740 193,659 4.26%
235 AP28 tissue DNA (Control) 3/18/2012 151,195,284 34,767,160 22.99% 4,633,661 2,550,975 231,926 5.01%
236 AP29 tissue DNA (Control) 3/18/2012 152,179,180 32,269,700 21.21% 4,254,572 2,319,403 204,778 4.81%
2,610,268,774 685,143,146 26.25% 4.60%
Multiple Data Analyses and Statistical Approaches for Analyzing Data from. . . 611
Fig. 2 Schedule of visits for the study to understand the effect of omega-3 polyunsaturated fatty acid
supplements on the human intestinal microbiota
2.2.2 Sample Microbial DNA extractions were performed based on the method
Preparation and of Yu and Morrison, [23] with slight modifications. DNA was
Sequencing extracted from approximately 250 mg feces using the QIAamp
DNA Stool Mini Kit (Qiagen, Germany) with bead beating. DNA
Library Prep Kit for Illumina, NEBNext Singleplex Oligos for
Illumina (New England Biolabs, UK), and unique in-house-
designed index primers (Integrated DNA Technologies, UK)
were used to allow for multiplexing of samples. Twelve cycles of
enrichment PCR were performed, and final libraries were cleaned
with AMPure Beads (Beckman Coulter, UK). Successful libraries
were confirmed by DNA 1000 bioanalyzer chips or DNA Analysis
screen tapes (Agilent, UK). Quantification was performed with the
Quant-iT dsDNA Assay Kit, broad range. A total of 30 ng of
each library was pooled and sequenced on an Illumina MiSeq
(2 250 bp) [24]. The variable region (V4) of the 16S rRNA
gene was sequenced for these samples.
2.2.3 Data Analyses Demultiplexed FASTQ files were trimmed of adapter sequences
using cutadapt [25]. Paired reads were merged using fastq-join
[26] under default settings and then converted to FASTA format.
Consensus sequences were removed if they contained any ambigu-
ous base calls, two contiguous bases with a PHRED quality score
lower than 33, or a length more than 2 bp different from the
expected length of 240 bp. Further analysis was performed using
QIIME [27]. Operational taxonomy units (OTUs) were picked
using usearch [28] and aligned to the Greengenes reference data-
base using PyNAST [29]. Taxonomy was assigned using the RDP
2.2 classifier [30]. The resulting OTU BIOM files from the above
Multiple Data Analyses and Statistical Approaches for Analyzing Data from. . . 613
Drug X or
DrugY
Multiple
Time points
factors
End of treatment
Mid treatment (Day 2 Followup (Day 20
Baseline (Day 9/10 to Day
to Day 8) or more)
11/12)
2.3.1 Sample In a clinically relevant setting this type of study wants to know
Preparation and which drug works better for a similar group of patients. Patients
Sequencing and Data are randomized between drug arms to control any selection bias.
Analyses Usually in this type of projects as we want to compare several
factors, we need many samples to start with. Readers are advised
to seek statistics help to do power calculation to obtain the pre-
ferred sample size. In general, as we end up having hundreds of
samples, we usually go for 16S sequencing as a cost-effective solu-
tion. However, some projects can also use shotgun sequencing.
Similar to previous examples, we assume that we have sequenced
(either 16S or shotgun sequencing) our samples and performed
further analysis process as outlined earlier to obtain taxonomic
profile (following data analyses methods as described in previous
scenarios) for each patient at each time point. Besides analyzing
time series of each individual separately, we have also grouped them
in certain time points such as baseline, mid-treatment, end of
treatment, and follow-up. Besides treatment groups, patients are
also compared based on multiple factors such as age, gender, and
geography.
3.1 Taxonomic and Taxonomic annotation addresses the question, ‘Who is out there?’ or
Functional Annotation in other words tries to obtain information regarding the species
composition of a given metagenome. On the other hand, func-
tional annotation attempts to answer the question, ‘What are they
doing?’ There are different approaches for metagenome analyses,
among which one type of approach is to use phylogenetic markers
to distinguish between different species in a sample. The most
widely used marker is the small subunit ribosomal ribonucleic acid
(SSU rRNA) gene (16S or 18S) and a second type of method is
based on analyzing the nucleotide composition of reads. In a
supervised approach the nucleotide composition of a collection of
reference genomes is used to train a classifier, which is then used to
place a given set of reads into taxonomic bins. In an unsupervised
approach, reads are clustered by composition similarity and then
the resulting clusters are analyzed in an attempt to place the reads.
Subheading 4 of this chapter provide details of multiple approaches
and available different tools which readers can use according to
their preferences.
In general, for annotating 16S rRNA sequences we use QIIME
[27] and for shotgun sequencing we use MEGAN [31] which can
also be used for 16S. MEGAN is a highly efficient program for
Multiple Data Analyses and Statistical Approaches for Analyzing Data from. . . 615
3.2 Metagenome Similar in nature to the genomic assembly, which is the reconstruc-
Assembly tion of genomes from the sequenced DNA segments (or reads),
metagenome assembly is more complex. The main goal is to stitch
together the fragments of the reads that could be from the same
genome. Here the reads consist of mixture of DNA from different
organisms and also may have widely different levels of abundance.
Few recent reviews discussed new challenges and opportunities as
well as assessed the most common and freely available metagenome
assembly tools with respect to their output statistics, their sensitiv-
ity for low-abundance community members and variability in
resulting community profiles as well as their ease of use. Interested
readers please refer to reviews [43, 44].
350
325
300
275
Number of leaves in taxonomy
250
225
200
175
150
125
100
75
50
25
0
0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 70000 75000 80000 85000 90000 95000 100000
Fig. 4 Rarefaction. Rarefaction plot using annotated species profile for all 22 (unstable and stable) athero-
sclerotic plaque samples. These curves show the number of nodes that would be present if based on 10%,
20%, and up to 90% of the reads
3.4 Subsample In situations like Fig. 3, where two samples have much higher
Comparison sequencing depth, another option can be subsample comparison.
In this process without excluding high-depth samples from further
study, another approach is to simulate subsample of lowest sample
size (of other samples in the study) for sufficient number of times.
And then take a median of the subsamples to generate a pseudo
profile, which can serve as a good comparable sample for the group.
For example, if in a study for most of the samples sequence reads are
in a range of 200,000–300,000. However, only few samples have
approx. 1 million reads, in those cases we simulate subsample of
200,000 reads from them for large number of times (say 1000) and
we take median of the profiles, which we can then compare with
other samples.
Fig. 5 Genus level taxonomic comparison of patients’ microbiome (median of each time point group) in two
drug treatment groups for certain time points such as baseline, mid-treatment, end of treatment and follow
up. Here different colors indicate different genera and the size of each color in the pie reflects the percentage
of those genus in median microbiome for each time point group and for each drug
618 Suparna Mitra
Microbacterium laevaniformans
Bacteroides
Bacteroidales Porphyromonadaceae
Prevotella Prevotella copri
Prevotella histicola
Waddlia chondrophila
Cyanothece sp. CCY0110
Bacillales Bacillus cereus
Paenibacillus
Lactobacillus casei group
Lactobacillus crispatus
Lactobacillus fermentum
Lactobacillus Lactobacillus gasseri
Bacilli Lactobacillus jensenii
Lactobacillales Lactobacillus johnsonii
Lactobacillus oris
Firmicutes Lactobacillus rhamnosus
Lactobacillus salivarius
Lactobacillus vaginalis
Streptococcus
Streptococcus parasanguinis
cellular organisms Bacteria Streptococcus salivarius
Clostridium butyricum
Clostridiales Eubacterium rectale
Lachnospiraceae
[Clostridium] difficile
Faecalibacterium prausnitzii
root Coprobacillus sp. D7
Veillonella atypica
Fusobacterium ulcerans
Candidatus Pelagibacter ubique
Betaproteobacteria Acidovorax sp. CF316
Neisseria Neisseria polysaccharea
Neisseria subflava
Proteobacteria Helicobacter pylori
Enterobacteriaceae
Gammaproteobacteria Actinobacillus pleuropneumoniae
Pseudomonadales Acinetobacter sp. P8-3-8
Pseudomonas
Beggiatoa sp. PS
Endoriftia persephone
Aster yellows phytoplasma
candidate division TM7 candidate division TM7 single-cell isolate TM7a
candidate division TM7 single-cell isolate TM7c
Eukaryota
Gammaretrovirus
Retroviridae Human endogenous retrovirus
Human endogenous retroviruses
unclassified Retroviridae Human endogenous retrovirus K
Human endogenous retrovirus W
Murine endogenous retrovirus ERV-L
synthetic construct
Not assigned
Low complexity
Fig. 6 Tree view at “family” level taxonomy comparing merged data from cases and control samples using
data from Scenario 1
Form this figure we can easily see that the microbiome pattern in
drug X over treatment period is more consistent (or more stable
over the time) than in drug Y. Here with visual comparison we are
not making any conclusion, but with these types of plots we can
start to see if there is any trend in our data, which can later be
investigated with appropriate statistical tests.
Further as metagenomic data are often hierarchical in nature,
besides doing basic plots which can be done only at certain taxo-
nomic levels (e.g., family/genus), often it is helpful to display the
whole data as comparative tree view. For example in Scenario
1, samples from cases and controls have grouped closely (as can
be seen later in Subheading 3.9), we can explore their broad differ-
ences by comparing total biome from cases and controls using
comparative tree view (Fig. 6). This kind of tree view also help us
to assess multiple time point samples from single patient or
grouped data comparison for multiple factors (e.g., in Scenario 3).
0.75
Simpson diversity index
Group
0.50
X
Y
0.25
Fig. 7 Boxplot showing Simpson diversity indices for samples from each time point and for both the drugs X
and Y
−4 −2 0 2 4
Row Z−Score
Helicobacteraceae
Waddliaceae
Candidatus Saccharibacteria
Neisseriaceae
sulfur−oxidizing symbionts
unclassified Alphaproteobacteria
Thiotrichaceae
Burkholderiales Genera incertae sedis
Chitinophagaceae
Nitrospiraceae
Rhodobacteraceae
Sphingomonadaceae
Cytophagaceae
Bradyrhizobiaceae
Comamonadaceae
Flavobacteriaceae
Nocardioidaceae
Mycobacteriaceae
Nannocystineae
unclassified Betaproteobacteria
Polyangiaceae
Caldilineaceae
Chromatiaceae
Leptospiraceae
Cystobacterineae
Rhodocyclaceae
Planctomycetaceae
Verrucomicrobia subdivision 3
Saprospiraceae
Solibacteraceae
Gallionellaceae
Methylophilaceae
Bdellovibrionales
Methylococcaceae
Nostocaceae
Oxalobacteraceae
Peptostreptococcaceae
Dermatophilaceae
Dermabacteraceae
Chlorobiaceae
Opitutaceae
Chloroflexales
Alicyclobacillaceae
Natranaerobiaceae
Thermoanaerobacteraceae
Alcaligenaceae
Thermoanaerobacterales Family IV. Incertae Sedis
Sanguibacteraceae
Intrasporangiaceae
unclassified Actinobacteria
Ktedonobacteraceae
Dictyoglomaceae
Heliobacteriaceae
Oscillospiraceae
Syntrophomonadaceae
Caulobacteraceae
Geobacteraceae
Synergistaceae
Gordoniaceae
Corynebacteriaceae
Tsukamurellaceae
Frankiaceae
Glycomycetaceae
Beutenbergiaceae
Brevibacteriaceae
Cellulomonadaceae
Promicromonosporaceae
Micromonosporaceae
Propionibacteriaceae
Pseudonocardiaceae
Streptosporangineae
Conexibacteraceae
Sphingobacteriaceae
Bacillales Family XII. Incertae Sedis
Listeriaceae
Planococcaceae
Sporolactobacillaceae
Thermoactinomycetaceae
Leuconostocaceae
Peptococcaceae
Halanaerobiaceae
Thermoanaerobacterales Family III. Incertae Sedis
Methylobacteriaceae
Phyllobacteriaceae
Acetobacteraceae
Desulfobacteraceae
Brachyspiraceae
Thermotogaceae
Haloplasmataceae
Clostridiaceae
Aeromonadaceae
Paenibacillaceae
Bacillaceae
Xanthomonadaceae
Rhizobiaceae
unclassified Clostridiales
Streptomycetaceae
Burkholderiaceae
Desulfovibrionaceae
Brucellaceae
Rhodospirillaceae
Nocardiaceae
Shewanellaceae
Staphylococcaceae
environmental samples <Bacteria>
Ruminococcaceae
Vibrionaceae
Enterococcaceae
Carnobacteriaceae
Leptotrichiaceae
Aerococcaceae
Thermaceae
Porphyromonadaceae
Bacteroidaceae
Micrococcaceae
Prevotellaceae
Coriobacteriaceae
Actinomycetaceae
Veillonellaceae
Clostridiales Family XIII. Incertae Sedis
Fusobacteriaceae
Campylobacteraceae
Bacillales Family XI. Incertae Sedis
Rikenellaceae
Bifidobacteriaceae
Lachnospiraceae
Clostridiales Family XI. Incertae Sedis
Oceanospirillales
environmental samples <gamma proteobacteria>
Verrucomicrobiaceae
Acidaminococcaceae
Enterobacteriaceae
Pasteurellaceae
Pseudomonadaceae
Moraxellaceae
Eubacteriaceae
Mycoplasmataceae
Spirochaetaceae
Streptococcaceae
Lactobacillaceae
Cases Controls
P0613
241
240
237
239
243
242
238
232
235
236
233
234
52
54
48
53
51
49
50
56
55
Fig. 8 Taxonomic comparison of all DNA samples. Hierarchical clustering result of “family” level taxonomic
comparisons of data from Scenario 1: unstable atherosclerotic plaques from 15 patients with symptomatic
atherosclerotic disease (unstable plaques) and stable plaques from a control group of seven patients that died
from other causes than atherosclerosis (controls). Red indicates downregulation, green indicates upregulation,
and black indicates no change in read abundance level comparing to all samples. Hierarchical clustering was
computed with average linkage, whereas Pearson correlation was used for clustering the families (rows) and
Spearman correlation was used for clustering the datasets (columns), respectively
622 Suparna Mitra
3.10 Principal PCA and PCoA are tools for multivariate analysis. PCA uses an
Component Analysis orthogonal transformation to convert a set of observations of
(PCA) and Principal possibly correlated variables into a set of values of linearly uncorre-
Coordinates Analysis lated variables called principal components [58]. This is often used
(PCoA) for quantitative variables, so the axes in graphic have a quantitative
weight, and the positions of the samples are in relation with those
weight. On the other hand, PCoA or multidimensional scaling
(MDS) is a means of visualizing the level of similarity of individual
cases of a dataset [59]. PCoA is similar to Polar ordination (PO;
[60]) arranges samples between endpoints or ‘poles’ according to
the distance matrix maximizing the linear correlation between the
distances in the distance matrix. If further interested in these meth-
ods please see [61].
For multiple sample comparison we often use PCoA and PCA,
these are among the best tools available for multivariate analysis.
These can give us powerful information of similarities and dissim-
ilarities within samples. When coupled with phenotypic data or
metadata (using colors and symbols etc.), these can be very helpful
tools to understand within group variations. As an example, we
have used PCoA on 22 plaque samples from Scenario 1 (Fig. 9).
Here we can see that sample 238 and 238 being very different
possibly due to high sequence depth (as also seen in Fig. 4).
Biplots: In addition to PCA or PCoA, variables can also be
plotted on the same diagram (this is called a biplot). The biplot
provides a useful tool of data analysis and allows the visual appraisal
of the structure of large data matrices [62]. In our examples, where
taxa are variables, biplot can show important taxa which helps in
determining relatedness represented as arrows. For example, in
Scenario 2, β diversity was compared using principal coordinate
analysis (PCoA) on all samples from all visits, where biplots are
displayed with green arrows (Fig. 10). From this PCoA with biplot,
we interpret that samples from volunteers 8, 13, and 16 are differ-
ent than the other volunteers and that they have higher abundance
of Succinivibrionaceae, Gammaproteobacteria, Aeromonadales, etc.
3.11 Canonical- CCA (correlation) seeks to find the linear combination of the Xi
Correlation Analysis and Yj that have the greatest correlation with each other where
(CCA) and Canonical- X ¼ (X1, . . ., Xn) and Y ¼ (Y1, . . ., Ym) of random variables thus it
Correspondence is often used as a dimension–reduction method. The method was
Analysis (CCA) first introduced by Harold Hotelling [63]. On the other hand,
CCA (correspondence) is a multivariate method to elucidate the
relationships between biological assemblages of species and their
Multiple Data Analyses and Statistical Approaches for Analyzing Data from. . . 623
-0.80
-0.70 Legend:
Cases group
-0.60 Control group
-0.50
238
-0.40
-0.30
233
-0.20
-0.10
55
53
0.00 56
234 232 235
54 50 48
49 240 236
243 241237 P0613
51 242 239
0.10 52
0.20
0.30
0.40
-0.70 -0.60 -0.50 -0.40 -0.30 -0.20 -0.10 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00
Fig. 9 principal coordinate analyses (PCoA) of “family” level taxonomic comparisons of data from Scenario 1:
unstable atherosclerotic plaques from 15 patients with symptomatic atherosclerotic disease (cases: cyan) and
stable plaques from a control group of seven patients that died from other causes than atherosclerosis
(controls: magenta)
19 9
9
10 11
9
-3
10
6
18
17
-2 8
7
13 10 10
17
17
12
12
-1 7 3718
12 12 12
3 2
6 Terrabacteria group
8 8 11
6 Clostridia
Clostridiales
7
8 14 11 Firmicutes
PC2 4 1
15 15 18
Aeromonadales 7 2 6 17
8
13
Succinivibrionaceae Proteobacteria 1 11
Succinivibrio 14
1 Gammaproteobacteria Ruminococcaceae
13 15 11 17
5
6 20
20
14 2 5 20
17
2 13
1 1
18
13 16 4 20
16 4
16 5 1
3 16
16
4
18
20
2
16 4 5
2
-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 PC1 1 2 3 4 5 6 7 8
Fig. 10 principal coordinate analyses (PCoA) of level taxonomic comparisons of data from Scenario 2: all
samples (V1–V5) for all participants, where biplots are displayed with green arrows. Each visit is denoted by a
different color
3.12 Multivariate Multivariate data analysis refers to any statistical approach used to
Analyses analyze data with more than one variable. For example, as described
in Scenario 3 we have multiple factors. The key to identifying
important microbial taxa associated with two treatments is that
the large datasets from each patient are compared within groups,
and then the metadata from the patients’ groups are compared
against each other. Analysis of multivariate data in response to
factors, groups, or treatments in an experimental design needs
sophisticated methods.
To achieve this, we can use PERMANOVA (permutational
multivariate analysis of variance) [66] to test the homogeneity of
multivariate dispersions within groups, on the basis of any resem-
blance measure. PERMANOVA is a better approach than ANOVA
(Analysis of variance)/MANOVA (Multivariate analysis of variance)
for our study as PERMANOVA works with any distance measure
that is appropriate to the data, and uses permutations to make it
Multiple Data Analyses and Statistical Approaches for Analyzing Data from. . . 625
DHA
LNA
DPA
EPA
OA
AA
SA
LA
PA
1
Elusimicrobia
Bacteroidetes 0.8
Fusobacteria
Proteobacteria 0.6 b
5
LNA
Chlamydiae
OA DPA
LA
0.4 +
Lentisphaerae AA +
+ + +
+ EPA
+ + +++++++++++
++ +++ +
+++
++++++ +
0
0
+ ++++++
+++++
++ ++ +
++++++ +++
+ +
++++
Verrucomicrobia +
++
DHA+ ++++++++
+ +
0.2 ++ SA
Spirochaetes PA
−5
Synergistetes 0
CCA2
Actinobacteria..phylum.
−1
−10
−0.2
Cyanobacteria
Deinococcus.Thermus
−0.4
−15
Firmicutes
Tenericutes −0.6
−20
Euryarchaeota
Streptophyta
−0.8
−10 −5 0 5 10
The.Firmicutes.Bacteroidetes_ratio CCA1
−1
Fig. 11 (a) Pearson correlation between genus level microbiome and RBC fatty acid data. (b) Canonical
correspondence analysis of microbiome (genus level taxonomy) distribution in relation to blood parameters
(biplot: represented by blue arrows). Red crosses represent taxa and black circles represents individual
samples
l Velvet [87].
l SOAP [88].
l Abyss [89].
(c) The next generation of assembly tools.
l MetaVelvet-SL [90].
l Meta-IDBA [91].
l InteMAP [92].
l SAT-Assembler [93].
l IDBA-UD [94].
5. Removing near-exact matches by maping to specific genomes.
(a) Bowtie 2 [17].
6. Binning tools for metagenomes.
(a) Composition-based binning algorithms.
l S-GSOM [95].
l PhylopythiaS [96].
l TACAO [97].
l PCAHIER [98].
l ESOM [95].
l ClaMS [99].
(b) Similarity-based binning software include tools.
l MEGAN [31].
l IMG/MER 4 [35].
l MG-RAST [34].
l CARMA [100].
l MetaPhyler [101].
(c) Unsupervised binning.
l PhylopythiaS+ [102].
l PhymmBL [103].
l ESOMs [104].
l VizBin [105].
l IFCM (fuzzy c-means method) [106].
7. Binning of metagenome contigs for reconstructing single
genomes.
(a) ICoVeR [107].
(b) MyCC [108].
(c) MetaBAT [109].
(d) GroopM [110].
628 Suparna Mitra
5 Concluding Remarks
References
38. Overbeek R, Begley T, Butler RM, Choud- using phylogenetic networks based on ecolog-
huri JV, Chuang HY, Cohoon M et al (2005) ical indices. ISME J 4(10):1236–1242
The subsystems approach to genome annota- 52. Lozupone C, Knight R (2005) UniFrac: a
tion and its use in the project to annotate new phylogenetic method for comparing
1000 genomes. Nucleic Acids Res 33 microbial communities. Appl Environ Micro-
(17):5691–5702 biol 71(12):8228–8235
39. Powell S, Forslund K, Szklarczyk D, 53. Hamady M, Lozupone C, Knight R (2010)
Trachana K, Roth A, Huerta-Cepas J et al Fast UniFrac: facilitating high-throughput
(2014) eggNOG v4.0: nested orthology phylogenetic analyses of microbial commu-
inference across 3686 organisms. Nucleic nities including analysis of pyrosequencing
Acids Res 42(D1):D231–D2D9 and PhyloChip data. ISME J 4(1):17–27
40. Tatusov RL, Galperin MY, Natale DA, Koo- 54. Lozupone C, Lladser ME, Knights D,
nin EV (2000) The COG database: a tool for Stombaugh J, Knight R (2011) UniFrac: an
genome-scale analysis of protein functions effective distance metric for microbial com-
and evolution. Nucleic Acids Res 28 munity comparison. ISME J 5(2):169–172
(1):33–36 55. McGill R, Tukey JW, Larsen WA (1978) Var-
41. Haft DH, Selengut JD, White O (2003) The iations of box plots. Am Stat 32(1):12–16
TIGRFAMs database of protein families. 56. Kaufman L, Rousseeuw PJ (1990) Finding
Nucleic Acids Res 31(1):371–373 groups in data: an introduction to cluster
42. Finn RD, Coggill P, Eberhardt RY, Eddy SR, analysis, 1st edn. John Wiley, New York
Mistry J, Mitchell AL et al (2016) The Pfam 57. Hastie T, Tibshirani R, Friedman J (2009)
protein families database: towards a more sus- Hierarchical clustering. In: The elements of
tainable future. Nucleic Acids Res 44(D1): statistical learning, 2nd edn. Springer,
D279–DD85 New York
43. Vollmers J, Wiegand S, Kaster AK (2017) 58. Pearson K (1901) On lines and planes of clos-
Comparing and evaluating Metagenome est fit to systems of points in space. Philos
assembly tools from a microbiologist’s per- Mag 2(7–12):559–572
spective—not only size matters! PLoS One
12(1):e0169662 59. Borg I, Groenen PJ (2005) Modern multidi-
mensional scaling: theory and applications.
44. Ghurye JS, Cepeda-Espinoza V, Pop M Springer Science & Business Media,
(2016) Metagenomic assembly: overview, New York
challenges and applications. Yale J Biol Med
89(3):353–362 60. Bray JR, Curtis JT (1957) An ordination of
the upland forest communities of southern
45. Bacaro G, Rocchini D, Ghisla A, Wisconsin. Ecol Monogr 27(4):326–349
Marcantonio M, Neteler M, Chiarucci A
(2012) The spatial domain matters: spatially 61. Michael PW Ordination methods—an over-
constrained species rarefaction in a free and view. http://ordination.okstate.edu/over
open source environment. Ecol Complex view.htm
12:63–69 62. Gabriel KR (1971) Biplot graphic display of
46. Ni J, Yan Q, Yu Y (2013) How much meta- matrices with application to principal compo-
genomic sequencing is enough to achieve a nent analysis. Biometrika 58(3):453–467
given goal? Sci Rep 3:1968 63. Hotelling H (1936) Relations between two
47. Rodriguez RL, Konstantinidis KT (2014) sets of variates. Biometrika 28:321–377
Estimating coverage in metagenomic data 64. Terbraak CJF (1986) Canonical correspon-
sets and why it matters. ISME J 8 dence-analysis—a new eigenvector technique
(11):2349–2351 for multivariate direct gradient analysis. Ecol-
48. Whittaker RH (1972) Evolution and mea- ogy 67(5):1167–1179
surement of species diversity. Taxon 21 65. Chen J, Bushman FD, Lewis JD, Wu GD, Li
(2/3):213–251 H (2013) Structure-constrained sparse
49. Hill MO (1973) Diversity and evenness: a canonical correlation analysis with an applica-
unifying notation and its consequences. Ecol- tion to microbiome data analysis. Biostatistics
ogy 54(2):427–432 14(2):244–258
50. Tuomisto H (2010) A consistent terminology 66. Anderson MJ (2001) A new method for
for quantifying species diversity? Yes, it does non-parametric multivariate analysis of vari-
exist. Oecologia 164(4):853–860 ance. Austral Ecol 26(1):32–46
51. Mitra S, Gilbert JA, Field D, Huson DH 67. Clarke KG, Gorley RN (2006) PRIMER v6:
(2010) Comparison of multiple metagenomes user manual/tutorial. PRIMER-E, Plymouth
632 Suparna Mitra
68. Gentleman RC, Carey VJ, Bates DM, independent, community-supported software
Bolstad B, Dettling M, Dudoit S et al for describing and comparing microbial com-
(2004) Bioconductor: open software develop- munities. Appl Environ Microbiol 75
ment for computational biology and bioinfor- (23):7537–7541
matics. Genome Biol 5(10):16 82. Quince C, Lanzen A, Davenport RJ, Turn-
69. Oksanen J, Blanchet FG, Friendly M, baugh PJ (2011) Removing noise from Pyr-
Kindt R, Legendre P, McGlinn D, et al osequenced amplicons. BMC Bioinformatics
(2008) The vegan package. https://cran.r- 12:38
project.org/web/packages/vegan/vegan. 83. Schloss PD, Handelsman J (2008) A statistical
pdf toolbox for metagenomics: assessing func-
70. McMurdie PJ, Holmes S (2013) Phyloseq: an tional diversity in microbial communities.
R package for reproducible interactive analysis BMC Bioinformatics 9:34
and graphics of microbiome census data. 84. Chevreux B, Pfisterer T, Drescher B, Driesel
PLoS One 8(4):11 AJ, Müller WEG, Wetter T et al (2004) Using
71. Teeling H, Glockner FO (2012) Current the miraEST assembler for reliable and auto-
opportunities and challenges in microbial mated mRNA transcript assembly and SNP
metagenome analysis-a bioinformatic per- detection in sequenced ESTs. Genome Res
spective. Brief Bioinform 13(6):728–742 14(6):1147–1159
72. Oulas A, Pavloudi C, Polymenakou P, Pavlo- 85. Zheng Y, Zhao LJ, Gao JP, Fei ZJ (2011)
poulos GA, Papanikolaou N, Kotoulas G et al iAssembler: a package for de novo assembly
(2015) Metagenomics: tools and insights for of Roche-454/sanger transcriptome
analyzing next-generation sequencing data sequences. Bmc Bioinformatics 12:8
derived from biodiversity studies. Bioinform 86. Pevzner PA, Tang H, Waterman MS (2001)
Biol Insights 9:75–88 An Eulerian path approach to DNA fragment
73. Pavlopoulos GA, Oulas A, Iacucci E, Sifrim A, assembly. Proc Natl Acad Sci U S A 98
Moreau Y, Schneider R et al (2013) Unravel- (17):9748–9753
ing genomic variation from next generation 87. Zerbino DR, McEwen GK, Margulies EH,
sequencing data. BioData Min 6:13 Birney E (2009) Pebble and rock band: heu-
74. Lindgreen S, Adair KL, Gardner PP (2016) ristic resolution of repeats and scaffolding in
An evaluation of the accuracy and speed of the velvet short-read de novo assembler. PLoS
metagenome analysis tools. Sci Rep 6:19233 One 4(12):9
75. Smeds L, Künstner A (2011) ConDeTri—a 88. Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J
content dependent read trimmer for Illumina et al (2012) SOAPdenovo2: an empirically
data. PLoS One 6(10):e26314 improved memory-efficient short-read de
76. Bolger AM, Lohse M, Usadel B (2014) Trim- novo assembler. GigaScience 1:18
momatic: a flexible trimmer for illumina 89. Jackman SD, Vandervalk BP, Mohamadi H,
sequence data. Bioinformatics 30 Chu J, Yeo S, Hammond SA et al (2017)
(15):2114–2120 ABySS 2.0: resource-efficient assembly of
77. Cox MP, Peterson DA, Biggs PJ (2010) large genomes using a bloom filter. Genome
SolexaQA: at-a-glance quality assessment of Res 27(5):768–777
illumina second-generation sequencing data. 90. Afiahayati, Sato K, Sakakibara Y (2015)
BMC Bioinformatics 11:485 MetaVelvet-SL: an extension of the velvet
78. Schmieder R, Edwards R (2011) Quality con- assembler to a de novo metagenomic assem-
trol and preprocessing of metagenomic data- bler utilizing supervised learning. DNA Res
sets. Bioinformatics 27(6):863–864 22(1):69–77
79. Ye Y, Choi J-H, Tang H (2011) RAPSearch: a 91. Peng Y, Leung HCM, Yiu SM, Chin FYL
fast protein similarity search tool for short (2011) Meta-IDBA: a de Novo assembler for
reads. BMC Bioinformatics 12:159 metagenomic data. Bioinformatics 27(13):
80. Quast C, Pruesse E, Yilmaz P, Gerken J, i94–i101
Schweer T, Yarza P et al (2013) The SILVA 92. Lai B, Wang F, Wang X, Duan L, Zhu H
ribosomal RNA gene database project: (2015) InteMAP: integrated metagenomic
improved data processing and web-based assembly pipeline for NGS short reads. BMC
tools. Nucleic Acids Res 41(D1): Bioinformatics 16:244
D590–D5D6 93. Zhang Y, Sun Y, Cole JR (2014) A scalable
81. Schloss PD, Westcott SL, Ryabin T, Hall JR, and accurate targeted gene assembly tool
Hartmann M, Hollister EB et al (2009) Intro- (SAT-assembler) for next-generation
ducing mothur: open-source, platform-
Multiple Data Analyses and Statistical Approaches for Analyzing Data from. . . 633
sequencing data. PLoS Comput Biol 10(8): augmented binning of metagenomic data.
e1003737 Microbiome 3:1
94. Peng Y, Leung HCM, Yiu SM, Chin FYL 106. Liu Y, Hou T, Kang B, Liu F (2017) Unsu-
(2012) IDBA-UD: a de novo assembler for pervised binning of metagenomic assembled
single-cell and metagenomic sequencing data Contigs using improved fuzzy C-means
with highly uneven depth. Bioinformatics 28 method. IEEE/ACM Trans Comput Biol
(11):1420–1428 Bioinform 14(6):1459–1467
95. Chan C-KK, Hsu AL, Tang S-L, Halgamuge 107. Broeksema B, Calusinska M, McGee F,
SK (2008) Using growing self-Organising Winter K, Bongiovanni F, Goux X et al
maps to improve the binning process in envi- (2017) ICoVeR—an interactive visualization
ronmental whole-genome shotgun sequenc- tool for verification and refinement of meta-
ing. J Biomed Biotechnol 2008:513701 genomic bins. BMC Bioinformatics 18:233
96. Patil KR, Roune L, McHardy AC (2012) The 108. Lin H-H, Liao Y-C (2016) Accurate binning
PhyloPythiaS web server for taxonomic of metagenomic contigs via automated clus-
assignment of metagenome sequences. PLoS tering sequences using information of geno-
One 7(6):e38581 mic signatures and marker genes. Sci Rep
97. Diaz NN, Krause L, Goesmann A, Niehaus K, 6:24175
Nattkemper TW (2009) TACOA—taxo- 109. Kang DD, Froula J, Egan R, Wang Z (2015)
nomic classification of environmental geno- MetaBAT, an efficient tool for accurately
mic fragments using a kernelized nearest reconstructing single genomes from complex
neighbor approach. BMC Bioinformatics microbial communities. PeerJ. 3:e1165
10:56 110. Imelfort M, Parks D, Woodcroft BJ,
98. Zheng H, Wu H (2010) Short prokaryotic Dennis P, Hugenholtz P, Tyson GW (2014)
DNA fragment binning using a hierarchical GroopM: an automated tool for the recovery
classifier based on linear discriminant analysis of population genomes from related meta-
and principal component analysis. J Bioin- genomes. PeerJ 2:e603
forma Comput Biol 8(6):995–1011 111. Wu Y-W, Tang Y-H, Tringe SG, Simmons BA,
99. Pati A, Heath LS, Kyrpides NC, Ivanova N Singer SW (2014) MaxBin: an automated
(2011) ClaMS: a classifier for metagenomic binning method to recover individual gen-
sequences. Stand Genomic Sci 5(2):248–253 omes from metagenomes using an
100. Krause L, Diaz NN, Goesmann A, Kelley S, expectation-maximization algorithm. Micro-
Nattkemper TW, Rohwer F et al (2008) Phy- biome 2:26
logenetic classification of short environmental 112. Alneberg J, Bjarnason BS, de Bruijn I,
DNA fragments. Nucleic Acids Res 36 Schirmer M, Quick J, Ijaz UZ et al (2014)
(7):2230–2239 Binning metagenomic contigs by coverage
101. Liu B, Gibbons T, Ghodsi M, Treangen T, and composition. Nat Methods 11
Pop M (2011) Accurate and fast estimation of (11):1144–1146
taxonomic profiles from metagenomic shot- 113. Zhu W, Lomsadze A, Borodovsky M (2010)
gun sequences. BMC Genomics 12(2):S4 Ab initio gene identification in metagenomic
102. Gregor I, Dröge J, Schirmer M, Quince C, sequences. Nucleic Acids Res 38(12):e132
McHardy AC (2016) PhyloPythiaS+: a self- 114. Hyatt D, Chen G-L, LoCascio PF, Land ML,
training method for the rapid reconstruction Larimer FW, Hauser LJ (2010) Prodigal: pro-
of low-ranking taxonomic bins from meta- karyotic gene recognition and translation ini-
genomes. PeerJ 4:e1603 tiation site identification. BMC
103. Brady A, Salzberg S (2011) PhymmBL Bioinformatics 11:119
expanded: confidence scores, custom data- 115. Hoff KJ, Lingner T, Meinicke P, Tech M
bases, parallelization and more. Nat Methods (2009) Orphelia: predicting genes in metage-
8(5):367 nomic sequencing reads. Nucleic Acids Res
104. Dick GJ, Andersson AF, Baker BJ, Simmons 37(Web Server issue):W101–W1W5
SL, Thomas BC, Yelton AP et al (2009) 116. Rho M, Tang H, Ye Y (2010) FragGeneScan:
Community-wide analysis of microbial predicting genes in short and error-prone
genome sequence signatures. Genome Biol reads. Nucleic Acids Res 38(20):e191
10(8):R85 117. Bland C, Ramsey TL, Sabree F, Lowe M,
105. Laczny CC, Sternal T, Plugaru V, Gawron P, Brown K, Kyrpides NC et al (2007) CRISPR
Atashpendar A, Margossian HH et al (2015) recognition tool (CRT): a tool for automatic
VizBin—an application for reference- detection of clustered regularly interspaced
independent visualization and human-
634 Suparna Mitra
palindromic repeats. BMC Bioinformatics 120. Laczny CC, Kiefer C, Galata V, Fehlmann T,
8:209 Backes C, Keller A (2017) BusyBee web:
118. Edgar RC (2007) PILER-CR: fast and accu- metagenomic data analysis by bootstrapped
rate identification of CRISPR repeats. BMC supervised binning and annotation. Nucleic
Bioinformatics 8:18 Acids Res 45(W1):W171–W1W9
119. Arndt D, Xia J, Liu Y, Zhou Y, Guo AC, Cruz 121. Richardson EJ, Escalettes F, Fotheringham I,
JA et al (2012) METAGENassist: a compre- Wallace RJ, Watson M (2013) Meta4: a web
hensive web server for comparative metage- application for sharing and annotating meta-
nomics. Nucleic Acids Res 40(Web Server genomic gene predictions using web services.
issue):W88–W95 Front Genet 4:168
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons
licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s
Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder.
Chapter 21
Abstract
Systems genetics combines high-throughput genomic data with genetic analysis. In this chapter, we review
and discuss application of systems genetics in the context of evolutionary studies, in which high-throughput
molecular technologies are being combined with quantitative trait locus (QTL) analysis in segregating
populations.
The recent explosion of high-throughput data—measuring thousands of RNAs, proteins, and metabo-
lites, using deep sequencing, mass spectrometry, chromatin, methyl-DNA immunoprecipitation, etc.—
allows the dissection of causes of genetic variation underlying quantitative phenotypes of all types. To deal
with the sheer amount of data, powerful statistical tools are needed to analyze multidimensional relation-
ships and to extract valuable information and new modes and mechanisms of changes both within and
between species. In the context of evolutionary computational biology, a well-designed experiment and the
right population can help dissect complex traits likely to be under selection using proven statistical methods
for associating phenotypic variation with chromosomal locations.
Recent evolutionary expression QTL (eQTL) studies focus on gene expression adaptations, mapping the
gene expression landscape, and, tentatively, define networks of transcripts and proteins that are jointly
modulated sets of eQTL networks. Here, we discuss the possibility of introducing an evolutionary “prior”
in the form of gene families displaying evidence of positive selection, and using that prior in the context of
an eQTL experiment for elucidating host-pathogen protein-protein interactions.
Here we review one exemplar evolutionairy eQTL experiment and discuss experimental design, choice of
platforms, analysis methods, scope, and interpretation of results. In brief we highlight how eQTL are
defined; how they are used to assemble interacting and causally connected networks of RNAs, proteins, and
metabolites; and how some QTLs can be efficiently converted to reasonably well-defined sequence variants.
Key words Systems genetics, Genetical genomics, QTL, eQTL, xQTL, R-genes, Evolution, R/qtl,
LMM, GEMMA, NGS, Genomics, Metabolomics, Network inference, GeneNetwork
1 Introduction
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Methods in Molecular Biology, vol. 1910,
https://doi.org/10.1007/978-1-4939-9074-0_21, © The Author(s) 2019
635
636 Pjotr Prins et al.
experiments conducted more than 10, even 30, years ago can still
be compared with those today. Databases, such as GeneNetwork
[7, 8], contain thousands of studies conducted on the same indi-
vidual mouse strains.
Systems genetics combines genetics with high-throughput
molecular technologies. Combining gene expression, as measured
by microarray probes or RNA sequencing, with linkage leads to
gene expression QTL (eQTL). Such eQTL studies elucidate how
genotypic variation underlies, for example, morphological pheno-
types, by using gene expression levels as intermediate molecular
phenotypes. In other words, the expression level, as measured by a
microarray probe or probe set, is treated as a phenotype, i.e., a gene
expression trait. This phenotype is associated with the genome in
the form of one or more eQTL. With microarrays, the genomic
location of the probe is usually known. Therefore, expression phe-
notype and probe connect two types of genomic information:
eQTL location(s) and gene location. It is usually assumed that
eQTL loci represent cis- or trans-transcription regulators of the
target gene [9]. If the eQTL is located close to the gene on the
genome, the eQTL may point to a cis-regulator. If the eQTL is
located far from the gene on the genome, the eQTL may point to a
trans-regulator of a single gene or even eQTL trans-bands that
regulate multiple genes (see Fig. 1a and [10, 11]).
In a similar fashion, proteins and metabolites can be measured
to map protein QTL ( pQTL) and metabolite QTL (mQTL). A
remarkable study published in 1994 used two-dimensional protein
electrophoresis and a restriction fragment length polymorphism
map (RFLP) [12]. Deep sequencing, chromatin, and methyl-
DNA immunoprecipitation are just a few of the latest technologies
that add to the arsenal of tools available for the study of the genetic
variation underlying quantitative phenotypes. Together, eQTL,
mQTL, and pQTL are referred to as xQTL. Different xQTL appear
to confirm each other, for example, with the A. thaliana glucosi-
nolate pathway where eQTL, mQTL, and pQTL were mapped
together and used to infer the underlying pathways [13]. Such
causal inference can lead to dissecting pathways and gene networks
which is an active field of research, e.g., [14–16] (see also Fig. 1).
1.1 Evolutionary From the perspective of evolutionary biology, systems genetics has
xQTL Studies been applied to elucidate evolutionary adaptations of transcript
regulation. For example, Fraser et al. introduced a test for
lineage-specific selection on gene expression and analyzed the
directionality of microarray eQTL for 112 haploid segregants of a
genetic cross between two strains of the budding yeast Saccharomy-
ces cerevisiae, reanalyzing the two-color cDNA microarray data of
Brem and Kruglyak [17]. They found that hundreds of gene
expression levels have been subjected to lineage-specific selection.
Comparing these findings with independent population genetic
638 Pjotr Prins et al.
Fig. 1 In this hypothetical and schematic example related to mapped locations on a chromosome, prior
information is combined with multiple phenotype-genotype QTL mappings to zoom in on genomic areas and to
reason about causal relations between different layers of information. (a) The prior (red area on the
chromosome) points out that certain sections are of interest; these sections consist of related genes with
high homology showing evidence of positive selection, as discussed in the main text. The blue double arrow
points out the confidence interval for each QTL, above the significance threshold (red dotted line). The
accumulated evidence (light-blue areas) leads to a narrowed down section on the genome, where in this case
the prior information is the most specific. In addition, expression phenotypes A and B point to exact gene
locations (dotted line, based on exact probe information). (b) To infer causal relationships, network inference is
possible. On the left (vertical I), traits A, B, and D map to one hot spot, where A may be a regulator of B
because one QTL is shared. B causes metabolite phenotype C, again a shared QTL. Phenotype D matches A
and B, and phenotype E matches A, B, and C. These causal relationships are drawn by arrows. The figure
suggests that, even if individual QTL are not very informative, the accumulated evidence starts to paint a
picture
Systems Genetics for Evolutionary Studies 639
1.2 Adding a Prior QTL mapping links complex traits with one or more locations on
the genome (see Fig. 1). Such a location is a wide measure because a
QTL is a statistical estimate and rarely a precise indicator. On the
genome, a single QTL may represent tens, hundreds, and even
thousands of real genes. Combining the QTL with high-
throughput technologies, such as microarrays, can add informa-
tion. To zoom in on the genes underlying QTL, information from
other sources has to be utilized. Such a priori knowledge (prior)
could consist of results from traditional linkage studies or associa-
tion studies of, for example, human disease. That way one can
assign a specific regulatory role to polymorphic sites in a genomic
region known to be associated with disease [23]. Other useful
priors can be derived from existing information on gene ontology
terms, metabolic pathways, and protein-protein interactions, which
can be used to identify genes and pathways [25], provided these
databases are sufficiently informative.
Zou et al., for example, used gene ontology as a prior and
concluded that trans-acting eQTL divergence between duplicate
pairs of genes is related to a fitness defect under treatment condi-
tions, but not with fitness under normal condition [19].
Chen et al. identified strong candidate genes for resistance to
leaf rust in barley and on the general pathogen response pathway
using a custom barley microarray on 144 doubled haploid lines of
the St/Mx population [26]. Fifteen thousand six hundred and
eighty-five eQTL were mapped from 9557 genes. Correlation anal-
ysis identified 128 genes that were correlated with resistance, of
which 89 had eQTL colocating with the phenotypic QTL (phQTL)
or classic QTL. Transcript abundance in the parents and conserva-
tion of synteny with rice prioritized six genes as candidates for
Rphq11, the phQTL of largest effect [26].
Systems Genetics for Evolutionary Studies 641
(continued)
642 Pjotr Prins et al.
Box 1: (continued)
of these proteins is regulated in trans, as has been shown for
RPM1 and RPS2 [35].
A single A. thaliana plant has about 150 R-genes, repre-
senting a subset of R-genes in the overall population. The
protein products of R-genes are involved in molecular inter-
actions. They generally have a recognition site which can dock
against, i.e., recognize, one or more specific molecule(s). The
proteins encoded by the largest class of R-genes carry a
nucleotide-binding site LRR domain (NB-LRR, also referred
to as NB-ARC-LRR and NBS-LRR). NB-LRR R-genes can
be further subdivided based on their N-terminal structural
features into TIR-NB-LRR, which have homology to the
Drosophila Toll and mammalian interleukin-1 receptors and
CC-NB-LRR, which contain a putative coiled-coil motif
[36]. The LRR domain appears to mediate specificity in path-
ogen recognition, while the N-terminal TIR, or coiled-coil
motif, is likely to play a role in downstream signaling
[34]. When a molecule is docked, the R-protein is able to
activate pathways in the cell, resulting in, for example, a
hypersensitive response causing apoptosis and preventing
spread of infection.
Meanwhile, one single R-protein only recognizes one
type of invading molecule. Therefore, through its R-genes,
one individual plant only recognizes a limited number of
strains of invading pathogens, as the individual pathogens
have variation in effectors too. When a pathogen evolves to
use nonrecognized effectors, the plant becomes susceptible.
The success of plant defense is determined by both evolution
and the variation of specificity in a population. Unlike the
evolved mammal immune system, which can change in a living
organism and learn about invasions “on the fly” [37], plant
R-genes depend on the variation inside a gene pool to provide
the resistance against a pathogen; see, for example, Holub
et al. [38]. Even so, many genes involved in pathogen recog-
nition undergo rapid adaptive evolution [39], and studies
have found that A. thaliana R-genes show evidence of posi-
tive selection, e.g., [40–42].
2.1 Create a Prior To create the prior, we use Ziheng Yang’s codeml implementation
with PAML of phylogenetic analysis by maximum likelihood (PAML)
[44]. PAML can find amino acid sites which show evidence of
positive selection using dN/dS ratios, which is the ratio of
non-synonymous over synonymous substitution (ω, see [44]). The
calculation of maximum likelihood for multiple evolutionary mod-
els is computationally expensive, and executing PAML over an
alignment of a hundred sequences may take hours, sometimes
days, on a PC. The software for generating the prior is prepackaged
and makes up the workflow in Chap. 25, which includes BLAST
[45], Clustal Omega [46], pal2nal [47], PAML [44], and
BioRuby [48].
It is possible to find nonoverlapping large gene families by
using BLASTCLUST, a tool that is part of the BLAST tool set
[45]. After fetching the A. thaliana cDNA sequences from the
Arabidopsis Information Resource (TAIR) [49], convert the
sequences to a protein BLAST database format. Based on a homol-
ogy criterion, the identity score and genes are clustered into puta-
tive gene families by running BLASTCLUST with 70% amino acid
sequence identity. Note that the percentage identity may not render
all families and will leave out a number of genes. It is used here for
demonstration purposes only. For A. thaliana such a genome-wide
search finds at least 60 gene families, including some R-gene
families.
644 Pjotr Prins et al.
2.2 Select a Suitable To select a suitable experimental population, the choice of parents
Experimental is key. Because we want a descriptive evolutionary prior based on
Population gene families with known genome locations, we also need a
sequenced genome, from one parent and ideally from both of the
parental strains. The choice of parents for QTL analysis is normally
based on large (classical) phenotypic differences. For testing path-
ogen resistance, the choice would ideally be one susceptible parent
and one resistant (nonsusceptible) parent. For eQTL, phylogenetic
distance can be used, when there is no obvious phenotype. In
general, it is a good idea to choose one or both parents from
common library strains based on, for example, Columbia (Col-0),
Landsberg erecta (Ler-0), Wassilewskija (Ws-0), or Kashmir
(Kas-1). This is because a great number of experimental resources
and online information will be available. In addition, a reference
genetic background is provided in this way, which allows the com-
parison of the effects of QTL and mutant alleles [53]. A number of
RIL populations can be found through TAIR, a model organism
database providing a centralized, curated gateway to Arabidopsis
biology, research materials, and community [49].
Systems Genetics for Evolutionary Studies 645
2.3 Select an xQTL A large part of published xQTL studies is based on gene expression
Technology eQTL partly because gene expression probe provides a direct geno-
mic link. When it comes to selecting single-color or two-color
arrays, one consideration may be that two-color arrays have higher
efficiency when using a distant pair design [54].
Deep sequencing technology (RNA-seq, [55]) is affordable for
eQTL studies. The main advantage over microarrays is improved
signal-to-noise ratios and possibly improved coverage depending
on the reference genome. Microarrays are noisy partly due to cross
hybridization, e.g., [56], and have limited signal on low-abundance
transcripts or expressors; both facts are detrimental to significance.
Deep sequencing is no panacea, however, since it accentuates the
high expressors. High expressors are expressed thousands of times
higher than low expressors. Low expressors may lack significance
for differential expression. Worse because deep sequencing is sto-
chastic, many low expressors may even be absent. Another point to
consider is that currently at least 1 in 1000 nucleotide base pairs is
misread, which makes it harder to disentangle error from genetic
variation. Only when a sequence polymorphism is measured many
times (say 20), it can be considered to represent genetic variation.
Also a choice for a certain eQTL technology should take into
account that, when looking at differential gene expression analysis,
different microarray platforms agree with each other, but overlap
between microarray and deep sequencing is much lower, suggest-
ing a technical bias [57].
For an example of a metabolite mQTL study, see Keurentjes
et al. [58] and Fu et al. [59]. For a study integrating eQTL, pQTL,
mQTL, and classical phenotypic QTL, see Fu et al. [60] and Jansen
et al. [13].
2.4 Sizing the The size of the experimental population should be large enough to
Experimental give informative results. For classical QTL analysis, the sizing may
Population be assisted using estimates of total environmental variance and the
total genetic variance derived from the accessions, selected as par-
ents. Roughly, population sizes of 200 RILs, without replications,
will allow detection of large-effect QTL with an explained variance
of 10% in confidence intervals of 10–20 cM. Detection of small-
effect QTL or mapping accuracy below 5% requires increasing the
population size to at least 300 RILs [53]. It is important to note
that QTL mapping accuracy is a function of marker density and
population size. The number of strains to use differs between
inbred lines. The promise of extreme dense marker maps, such as
delivered by SNPs, does not automatically translate to higher accu-
racy. It is the number of recombination events in the population for
a particular genomic region that limits QTL interval size. In fact,
current marker maps, in the order of thousands of (evenly spread)
markers per genome, suite population sizes of a few hundred RILs.
646 Pjotr Prins et al.
2.5 Analyzing the R/qtl is extensible, interactive free software for the mapping of
xQTL Experiment with xQTL in experimental crosses. It is implemented as an add-on
R/qtl package for the widely used statistical language/software R. Since
its introduction, R/qtl has become a reference implementation
with an extensive guide on QTL mapping [62].
R/qtl includes multiple QTL mapping (MQM), as described in
[10], an automated procedure, which combines the strengths of
generalized linear model regression with those of interval mapping.
MQM can handle missing data by analyzing probable genotypes.
MQM selects important marker cofactors by multiple regression
and backward elimination. QTL are moved along the chromo-
somes using these preselected markers as cofactors. QTL are inter-
val mapped using the most informative model through maximum
likelihood. MQM for R/qtl brings the following advantages to
QTL mapping: (1) higher power, as long as the QTL explain a
reasonable amount of variation; (2) protection against overfitting,
because MQM fixes the residual variance from the full model;
(3) prevention of ghost QTL detection (between two QTL in
coupling phase); and (4) detection of negating QTL (QTL in
repulsion phase) [10].
MQM for R/qtl brings additional advantages to systems genet-
ics data sets with hundreds to millions of traits: (5) a pragmatic
permutation strategy for control of the FDR and prevention of
locating false QTL hot spots, as discussed above; (6) high-
performance computing by scaling on multi-CPU computers, as
well as clustered computers, by calculating phenotypes in parallel,
through the message passing interface (MPI) of the parallel package
for R; and (7) visualizations for exploring interactions in a genomic
Systems Genetics for Evolutionary Studies 647
2.6 Matching the After detecting eQTL, we have a map of gene regulation in the form
Prior of a cis-trans map. When taking a priori information into account,
i.e., genomic locations derived through other methods, we can
potentially match the genomic locations of genes and gene families
with the eQTL cis-trans map. Until now, there has been no com-
bined QTL and evolutionary study, involving PAML, for host-
pathogen relationships in plants, though they have been conducted
separately.
3 Discussion
4 Questions
References
1. Qin L, Prins P, Jones JT et al (2001) Genest, a 15. Li Y, Tesson BM, Churchill GA, Jansen RC
powerful bidirectional link between cdna (2010) Critical reasoning on causal inference
sequence data and gene expression profiles in genome-wide linkage and association stud-
generated by cdna-aflp. Nucleic Acids Res 29 ies. Trends Genet 26(12):493–498
(7):1616–1622 16. Arends D, Li Y, Brockmann G et al (2016)
2. Mckeown PC, Laouielle-duprat S, Prins P et al Correlation trait loci (ctl) mapping: phenotype
(2011) Identification of imprinted genes sub- network inference subject to genotype. JOSS. 1
ject to parent-of-origin specific expression in (6):87
arabidopsis thaliana seeds. BMC Plant Biol 17. Brem RB, Kruglyak L (2005) The landscape of
11:113 genetic complexity across 5,700 gene expres-
3. Nandi S, Subudhi PK, Senadhira D et al (1997) sion traits in yeast. Proc Natl Acad Sci U S A
Mapping QTLs for submergence tolerance in 102(5):1572–1577
rice by AFLP analysis and selective genotyping. 18. Fraser HB, Moses AM, Schadt EE (2010) Evi-
Mol Gen Genet 255(1):1–8 dence for widespread adaptive evolution of
4. Meaburn E, Butcher LM, Schalkwyk LC, Plo- gene expression in budding yeast. Proc Natl
min R (2006) Genotyping pooled DNA using Acad Sci U S A 107(7):2977–2982
100K SNP microarrays: a step towards geno- 19. Zou Y, Su Z, Yang J, Zeng Y, Gu X (2009)
mewide association scans. Nucleic Acids Res 34 Uncovering genetic regulatory network diver-
(4):e27 gence between duplicate genes using yeast eqtl
5. Kim S, Plagnol V, Hu TT et al (2007) Recom- landscape. J Exp Zool B Mol Dev Evol 312
bination and linkage disequilibrium in arabi- (7):722–733
dopsis thaliana. Nat Genet 39(9):1151–1155 20. Li Y, Breitling R, Jansen RC (2008) General-
6. Dixon AL, Liang L, Moffatt MF et al (2007) A izing genetical genomics: getting added value
genome-wide association study of global gene from environmental perturbation. Trends
expression. Nat Genet 39(10):1202–1207 Genet 24(10):518–524
7. Sloan Z, Arends D, Broman KW et al (2016) 21. Hager R, Lu L, Rosen GD, Williams RW
Genenetwork: framework for web-based (2012) Genetic architecture supports mosaic
genetics. JOSS 1(2):25 brain evolution and independent brain-body
8. Mulligan MK, Mozhui K, Prins P, Williams RW size regulation. Nat Commun 3:1079
(2017) Genenetwork: a toolbox for systems 22. Kliebenstein DJ, West MA, Van leeuwen H et al
genetics. Methods Mol Biol 1488:75–120 (2006) Identification of QTLs controlling
9. Gibson G, Weir B (2005) The quantitative gene expression networks defined a priori.
genetics of transcription. Trends Genet 21 BMC Bioinformatics 7:308
(11):616–623 23. Gilad Y, Rifkin SA, Pritchard JK (2008) Reveal-
10. Arends D, Prins P, Jansen RC, Broman KW ing the architecture of gene regulation: the
(2010) R/qtl: high-throughput multiple promise of eQTL studies. Trends Genet 24
QTL mapping. Bioinformatics 26 (8):408–415
(23):2990–2992 24. Alberts R, Terpstra P, Li Y et al (2007)
11. Li Y, Alvarez OA, Gutteling EW et al (2006) Sequence polymorphisms cause many false cis
Mapping determinants of gene expression plas- eqtls. PLoS One 2(7):e622
ticity by genetical genomics in C. elegans. 25. Franke L, Bakel van H, Fokkens L et al (2006)
PLoS Genet 2(12):e222 Reconstruction of a functional human gene
12. Damerval C, Maurice A, Josse JM, De Vienne network, with an application for prioritizing
D (1994) Quantitative trait loci underlying positional candidate genes. Am J Hum Genet
gene product variation: a novel perspective for 78(6):1011–1025
analyzing regulation of genome expression. 26. Chen X, Hackett CA, Niks RE et al (2010) An
Genetics 137(1):289–301 eqtl analysis of partial resistance to puccinia
13. Jansen RC, Tesson BM, Fu J, Yang Y, Mcintyre hordei in barley. PLoS One 5(1):e8598
LM (2009) Defining gene and QTL networks. 27. Qin L, Kudla U, Roze EH et al (2004) Plant
Curr Opin Plant Biol 12(2):241–246 degradation: a nematode expansin acting on
14. Langfelder P, Horvath S (2008) Wgcna: an r plants. Nature 427(6969):30
package for weighted correlation network anal- 28. Saijo Y, Schulze-lefert P (2008) Manipulation
ysis. BMC Bioinformatics 9:559 of the eukaryotic transcriptional machinery by
Systems Genetics for Evolutionary Studies 651
bacterial pathogens. Cell Host Microbe 4 43. Bevan M, Bancroft I, Bent E, Chalwatzis N
(2):96–99 (1998) Analysis of 1.9 Mb of contiguous
29. Chen LQ, Hou BH, Lalonde S et al (2010) sequence from chromosome 4 of Arabidopsis
Sugar transporters for intercellular exchange thaliana. Nature 391(6666):485–488
and nutrition of pathogens. Nature 468 44. Yang Z (1997) PAML: a program package for
(7323):527–532 phylogenetic analysis by maximum likelihood.
30. Hewitson JP, Grainger JR, Maizels RM (2009) Comput Appl Biosci 13(5):555–556
Helminth immunoregulation: the role of para- 45. Altschul SF, Madden TL, Schaffer AA et al
site secreted proteins in modulating host (1997) Gapped blast and psi-blast: a new gen-
immunity. Mol Biochem Parasitol 167(1):1–11 eration of protein database search programs.
31. Bird PI, Trapani JA, Villadangos JA (2009) Nucleic Acids Res 25(17):3389–3402
Endolysosomal proteases and their inhibitors 46. Sievers F, Wilm A, Dineen D et al (2011) Fast,
in immunity. Nat Rev Immunol 9 scalable generation of high-quality protein
(12):871–882 multiple sequence alignments using clustal
32. Dangl JL, Jones JD (2001) Plant pathogens omega. Mol Syst Biol 7:539
and integrated defence responses to infection. 47. Suyama M, Torrents D, Bork P (2006) Pal2nal:
Nature 411(6839):826–833 robust conversion of protein sequence align-
33. Flor H (1956) The complementary genic sys- ments into the corresponding codon align-
tems in flax and flax rust*. Adv Genet 8:29–54 ments. Nucleic Acids Res 34(Web Server):
34. Bakker EG, Toomajian C, Kreitman M, Bergel- W609–W612
son J (2006) A genome-wide survey of R gene 48. Goto N, Prins P, Nakao M et al (2010) BioR-
polymorphisms in Arabidopsis. Plant Cell 18 uby: bioinformatics software for the Ruby pro-
(8):1803–1818 gramming language. Bioinformatics 26
35. Mackey D, Belkhadir Y, Alonso JM, Ecker JR, (20):2617–2619
Dangl JL (2003) Arabidopsis rin4 is a target of 49. Rhee SY, Beavis W, Berardini TZ et al (2003)
the type iii virulence effector avrrpt2 and mod- The Arabidopsis Information Resource
ulates rps2-mediated resistance. Cell 112 (TAIR): a model organism database providing
(3):379–389 a centralized, curated gateway to Arabidopsis
36. Richly E, Kurth J, Leister D (2002) Mode of biology, research materials and community.
amplification and reorganization of resistance Nucleic Acids Res 31(1):224–228
genes during recent arabidopsis thaliana evolu- 50. Anisimova M, Nielsen R, Yang Z (2003) Effect
tion. Mol Biol Evol 19(1):76–84 of recombination on the accuracy of the likeli-
37. Medzhitov R, Janeway CA Jr (1997) Innate hood method for detecting positive selection at
immunity: impact on the adaptive immune amino acid sites. Genetics 164(3):1229–1236
response. Curr Opin Immunol 9(1):4–9 51. Arabidopsis Genome Initiative (2000) Analysis
38. Holub EB (2001) The arms race is ancient of the genome sequence of the flowering plant
history in Arabidopsis, the wildflower. Nat arabidopsis thaliana. Nature 408
Rev Genet 2(7):516–527 (6814):796–815
39. Bishop JG, Dean AM, Mitchell-olds T (2000) 52. Michelmore RW, Meyers BC (1998) Clusters
Rapid evolution in plant chitinases: molecular of resistance genes in plants evolve by divergent
targets of selection in plant-pathogen coevolu- selection and a birth-and-death process.
tion. Proc Natl Acad Sci U S A 97 Genome Res 8(11):1113–1130
(10):5322–5327 53. Salinas J, Sanchez-serrano J (2006) Arabidop-
40. Xiao S, Emerson B, Ratanasut K et al (2004) sis protocols. Humana Press Inc, Totowa, NJ
Origin and maintenance of a broad-spectrum 54. Fu J, Jansen RC (2006) Optimal design and
disease resistance locus in Arabidopsis. Mol analysis of genetic studies on gene expression.
Biol Evol 21(9):1661–1672 Genetics 172(3):1993–1999
41. Mondragon-palomino M, Meyers BC, Michel- 55. Mortazavi A, Williams BA, Mccue K,
more RW, Gaut BS (2002) Patterns of positive Schaeffer L, Wold B (2008) Mapping and
selection in the complete nbs-lrr gene family of quantifying mammalian transcriptomes by
Arabidopsis thaliana. Genome Res 12 rna-seq. Nat Methods 5(7):621–628
(9):1305–1315 56. Eklund AC, Turner LR, Chen P et al (2006)
42. Sun X, Cao Y, Wang S (2006) Point mutations Replacing cRNA targets with cDNA reduces
with positive selection were a major force dur- microarray cross-hybridization. Nat Biotech-
ing the evolution of a receptor-kinase resistance nol 24(9):1071–1073
gene family of rice. Plant Physiol 140 57. Hoen PA, Ariyurek Y, Thygesen HH et al
(3):998–1008 (2008) Deep sequencing-based expression
652 Pjotr Prins et al.
analysis shows major advances in robustness, 63. Arends D, Prins P, Broman KW, Jansen RC
resolution and inter-lab portability over five (2010) Tutorial - multiple-QTL mapping
microarray platforms. Nucleic Acids Res 36 (MQM) analysis. http://www.rqtl.org/
(21):e141 tutorials/MQM-tour.pdf
58. Keurentjes JJ, Sulpice R, Gibon Y et al (2008) 64. Jansen RC, Nap JP (2001) Genetical genomics:
Integrative analyses of genetic variation in the added value from segregation. Trends
enzyme activities of primary carbohydrate Genet 17(7):388–391
metabolism reveal distinct modes of regulation 65. Wayne ML, Mcintyre LM (2002) Combining
in Arabidopsis thaliana. Genome Biol 9(8): mapping and arraying: an approach to candi-
R129 date gene identification. Proc Natl Acad Sci U
59. Fu J, Swertz MA, Keurentjes JJ, Jansen RC S A 99(23):14903–14906
(2007) Metanetwork: a computational proto- 66. Lippert C, Listgarten J, Liu Y et al (2011) Fast
col for the genetic study of metabolic net- linear mixed models for genome-wide associa-
works. Nat Protoc 2(3):685–694 tion studies. Nat Methods 8(10):833–835
60. Fu J, Keurentjes JJ, Bouwmeester H et al 67. Zhou X, Stephens M (2012) Genome-wide
(2009) System-wide molecular evidence for efficient mixed-model analysis for association
phenotypic buffering in Arabidopsis. Nat studies. Nat Genet 44(7):821–824
Genet 41(2):166–167 68. Wang X, Pandey AK, Mulligan MK et al (2016)
61. Breitling R, Li Y, Tesson BM et al (2008) Joint mouse-human phenome-wide association
Genetical genomics: spotlight on QTL hot- to test gene function and disease risk. Nat
spots. PLoS Genet 4(10):e1000232 Commun 7:10464
62. Broman K, Sen S (2009) A guide to QTL
mapping with R/qtl. Springer, New York, NY
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons
licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s
Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder.
Part VI
Abstract
Biological databases are growing at an exponential rate, currently being among the major producers of Big
Data, almost on par with commercial generators, such as YouTube or Twitter. While traditionally biological
databases evolved as independent silos, each purposely built by a different research group in order to answer
specific research questions; more recently significant efforts have been made toward integrating these
heterogeneous sources into unified data access systems or interoperable systems using the FAIR principles
of data sharing. Semantic Web technologies have been key enablers in this process, opening the path for new
insights into the unified data, which were not visible at the level of each independent database. In this
chapter, we first provide an introduction into two of the most used database models for biological data:
relational databases and RDF stores. Next, we discuss ontology-based data integration, which serves to
unify and enrich heterogeneous data sources. We present an extensive timeline of milestones in data
integration based on Semantic Web technologies in the field of life sciences. Finally, we discuss some of
the remaining challenges in making ontology-based data access (OBDA) systems easily accessible to a larger
audience. In particular, we introduce natural language search interfaces, which alleviate the need for
database users to be familiar with technical query languages. We illustrate the main theoretical concepts
of data integration through concrete examples, using two well-known biological databases: a gene expres-
sion database, Bgee, and an orthology database, OMA.
Key words Data integration, Ontology-based data access, Knowledge representation, Query proces-
sing, Keyword search, Relational databases, RDF stores
Abbreviations
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Methods in Molecular Biology, vol. 1910,
https://doi.org/10.1007/978-1-4939-9074-0_22, © The Author(s) 2019
655
656 Ana Claudia Sima et al.
OMA Orthologous Matrix, a database for the inference of orthologs among complete
genomes.—https://omabrowser.org, SPARQL endpoint: https://sparql.
omabrowser.org/sparql
PK Primary key in a relational database
PK-FK Primary key-foreign key relationship; enables joining two tables in a relational
database
RDB Relational database
RDF Resource Description Framework
SODA Search Over Relational Databases [21]
SQL Structured Query Language
SPARQL SPARQL Protocol and RDF Query Language
TBox Terminological box
URI Uniform Resource Identifier
1 Introduction
Fig. 1 Sample relational database (extracted from the gene expression database Bgee)
Semantic Integration and Enrichment of Heterogeneous Biological Databases 659
2.1 Limitations of So far, we have seen that relational databases are a mature, highly
Relational Databases optimized technology for storing and querying structured data.
and Emerging Also, combined with a powerful and expressive query language,
Solutions for Data SQL, they allow users to federate (join) data even from different
Integration databases.
However, there are certain relationships that are not natural for
relational databases. Let us consider the relationship “hasOrtho-
log”. Both the domain and the range of this relationship, as defined
in the Orthology Ontology [13], are the same—a gene. For exam-
ple, the hemoglobin (HBB) gene in human has the Hbb-bt ortho-
logous gene in the mouse (expressed via the relation hasOrtholog).
In the relational database world, this translates into a so-called self-
join. As the name suggests, this requires joining one table—in this
case, Gene—with itself, in order to retrieve the answer. These types
of “self-join” relations, while frequent in the real world (e.g., a
manager of an employee is also an employee, a friend of a person
is also a person, etc.), are inefficient in the context of relational
databases. While there are sometimes ways to avoid self-joins, these
660 Ana Claudia Sima et al.
that aligns, but also enriches the existing data, with the purpose of
facilitating knowledge discovery.
Throughout the remainder of this chapter, we will combine
theoretical aspects of data integration with concrete examples,
based on our SODA project [21], as well as from our ongoing
research project, Bio-SODA [22], where we are currently building
an integrated data access system for biological databases (starting
with OMA and Bgee), using a natural language search interface. In
the context of this project, Semantic Web technologies, such as
RDF, are used to enhance interoperability among heterogeneous
databases at the semantic level (e.g., RDF graphs with predefined
semantics). Moreover, currently, several life science and biomedical
databases such as OMA [3], UniProt [7], neXtProt [22], the
European Bioinformatics Institute (EMBL-EBI) RDF data [24],
and the WorldWide Protein Data Bank [25] already provide RDF
data access, which also justifies an RDF-based approach to enable
further integration efforts to include these databases. A recent
initiative for (biological) data sharing is based on the FAIR princi-
ples [26], aiming to make data findable, accessible, interoperable,
and re-usable.
3.1 Unique Resource A Uniform Resource Identifier (URI) is a character sequence that
Identifier (URI) identifies an abstract or physical resource. A URI is classified as a
locator, a name, or both. The Uniform Resource Locators (URLs)
are a subset of URIs that, in addition to identifying a resource,
provide a means of locating the resource by describing its primary
662 Ana Claudia Sima et al.
Fig. 4 An RDF graph with two nodes (subject and object) and an edge connecting
them (predicate)
(a TBox axiom that states a man is a human and male) and john:
Man (an ABox assertion that states john is an instance of man).
Given that one of the goals of the Semantic Web is to assign
unambiguous names to resources (URIs), an ontology should be
more than a simple description of data in a particular triple store.
Rather, it should more generally serve as a description of a domain,
for instance, genomics (see Gene Ontology [37]) or orthology (see
Orth Ontology [13]). Different instantiations of this domain, for
example, by different research groups, should reuse and extend this
ontology. Therefore, constructing good ontologies requires careful
consideration and agreement between domain specialists, with the
goal of formally representing knowledge in their field. As a conse-
quence, ontologies are usually defined in the scope of consor-
tiums—such as the Gene Ontology Consortium [38] or the
Quest for Orthologs Consortium [39]. A notable collaborative
effort is the Open Biological and Biomedical Ontology (OBO)
Foundry [40]. It established principles for ontology development
and evolution, with the aim of maximizing cross-ontology coordi-
nation and interoperability, and provides a repository of life science
ontologies, currently, including about 140 ontologies.
To give an example of RDF data in a concrete life sciences use
case, let us consider the following RDF triples, which illustrate a
few of the assertions used in the OMA orthology database to
describe the human hemoglobin protein (“HBB”), using the first
version of the ORTH ontology [13]:
3.3 RDF Schema RDF Schema (RDFS) provides a vocabulary for modeling RDF
(RDFS) data and is a semantic extension of RDF. It provides mechanisms
for describing groups (i.e., classes) of related resources and the
relationships between these resources. The RDFS is defined in
RDF. The RDFS terms are used to define attributes of other
resources such as the domains (rdfs:domain) and ranges (rdfs:
range) of properties. Moreover, the RDFS core vocabulary is
defined in a namespace informally called rdfs here, and it is conven-
tionally associated with the prefix rdfs:. That namespace is identified
by the URI http://www.w3.org/2000/01/rdf-schema#.
In this section, we will mostly focus on the RDF and RDFS
terms used in this chapter. Further information about RDF/RDFS
terms is available in [42].
l Classes
– rdfs:Resource—all things described by RDF are called
resources, which are instances of the class rdfs:Resource (i.e.,
rdfs:Resource is an instance of rdfs:Class).
– rdfs:Class is the class of resources that are RDF classes.
Resources that have properties (attributes) in common may
be divided into classes. The members of a class are instances.
– rdf:Property is a relation between subject and object
resources, i.e., a predicate. It is the class of RDF properties.
– rdfs:Literal is the class of literal values such as textual strings
and integers. rdfs:Literal is a subclass of rdfs:Resource.
l Properties
– rdfs:range is an instance of rdf:Property. It is used to state
that the values of a property are instances of one or more
classes. For example, orth:hasHomolog rdfs:range orth:Sequen-
ceUnit (see Fig. 5a). This statement means that the values of
orth:hasHomolog property can only be instances of orth:
SequenceUnit class.
– rdfs:domain is an instance of rdf:Property. It is used to state
that any resource that has a given property is an instance of
one or more classes. For example, orth:hasHomolog rdfs:
domain orth:SequenceUnit (see Fig. 5b). This statement
means that resources that assert the orth:hasHomolog prop-
erty must be instances of orth:SequenceUnit class.
– rdf:type is an rdf:Property that is used to state that a resource
is an instance of a class.
666 Ana Claudia Sima et al.
3.4 Web Ontology The first level above RDF/RDFS in the Semantic Web stack (see
Language (OWL) Fig. 2) is an ontology language that can formally describe the
meaning of resources. If machines are expected to perform useful
reasoning tasks on RDF data, the language must go beyond the
basic semantics of RDF Schema [43]. Because of this, OWL and
OWL 2 (i.e., Web Ontology languages) include more terms for
describing properties and classes, such as relations between classes
(e.g., disjointness, owl:disjointWith), cardinality (e.g., “exactly 2,”
owl:cardinality), equality (i.e., owl:equivalentClass), richer typing
of properties, characteristics of properties (e.g., symmetry, owl:
SymmetricProperty), and enumerated classes (i.e., owl:oneOf). The
owl: prefix replaces the following URI namespace: http://www.w3.
org/2002/07/owl#.
As a full description of OWL and OWL 2 is beyond the scope of
this chapter, we refer the interested reader to [44, 45]. In the
following, we focus solely on some essential modeling features
that the OWL languages offer in addition to RDF/RDFS
vocabularies.
l owl:Class is a subclass of rdfs:Class. Like rdfs:Class, an owl:Class
groups instances that share common properties. However, this
new OWL term is defined due to the restrictions on DL-based
Semantic Integration and Enrichment of Heterogeneous Biological Databases 667
Fig. 6 Examples of instances of orth:SequenceUnit and orth:Gene and object and datatype property assertions
3.5 RDF Serialization RDF is a graph-based data model which provides a grammar for its
Formats syntax. Using this grammar, RDF syntax can be written in various
concrete formats which are called RDF serialization formats. For
example, we can mention the following formats: Turtle [48],
RDF/XML (an XML syntax for RDF) [49], and JSON-LD
(a JSON syntax for RDF) [50]. In this section, we will solely
focus on the Turtle format.
Turtle language (TTL) allows for writing an RDF graph in a
compact textual form. To exemplify this serialization format, let us
consider the following turtle document that defines the homolo-
gous and orthologous relations:
668 Ana Claudia Sima et al.
# http://purl.org/net/orth#SequenceUnit
orth:SequenceUnit rdf:type owl:Class .
3.6 Querying the Once we have defined the knowledge base (TBox and ABox), how
Semantic Web with can we use it to retrieve relevant data? Similar to SQL for relational
SPARQL databases, data in RDF stores can be accessed by using a query
language. One of the main RDF query languages, especially used
in the field of life sciences, is SPARQL [51]. A SPARQL query
essentially consists of a graph pattern, namely, conjunctive RDF
triples, where the values that should be retrieved (the unknowns—
either subjects, predicates, or objects) are replaced by variable names,
prefixed by “?”. Looking again at the previous example, if we want to
get the description of the “HBB” protein from OMA, we would
simply use a graph pattern, where the value of the “description”—
the one we want to retrieve—is replaced by a variable as follows:
Fig. 7 A portion of the ontology defined over the relational database sample from Bgee. For readability
purposes, we omitted the namespace (“bgee:”) for the ontology properties
Fig. 8 The class hierarchy of the OMA ontology. Ellipses indicate class labels, while arrows indicate the “rdfs:
subClassOf ” property. Further details are available in [13]
most important concepts in the ontology are shown in the top right
corner, namely, the cluster of orthologs and the cluster of paralogs,
which store information about gene orthology (or paralogy) in a
hierarchical tree structure (the gene-tree node). Similarly to the
Bgee ontology, the Gene class in OMA is external. Arrows indicate
the “rdfs:subClassOf” relationship—for example, both the “Clus-
ter of Orthologs” and the “Cluster of Paralogs” classes—are
Semantic Integration and Enrichment of Heterogeneous Biological Databases 671
?proteinOMA a orth:Protein.
?proteinOMA oma:xrefUniprot ?proteinUniProt.
}
5.1 A System’s We start with a bottom-up description of the layers that make up an
Perspective integrated data access system, followed by a concrete example using
the two bioinformatics databases introduced above: the orthology
database OMA and the gene expression database Bgee.
The main four layers of an integrated data access system, as
shown in Fig. 9, are:
Semantic Integration and Enrichment of Heterogeneous Biological Databases 673
5.1.1 Base Data Layer This represents the physical storage layer, where all the actual data,
for example, experimental results, annotations, etc., are kept.
Figure 9 illustrates only a few of the possible storage types, namely,
relational databases, hierarchical data stores (e.g., HDF5), and
RDF stores. At this low-level layer, the data are usually structured
so as to optimize machine parameters, such as storage space, com-
plexity of joins required to answer physical queries, etc. Therefore,
it is not designed for human readability. Furthermore, tables, col-
umn names, or even IDs may not match any real terms. For exam-
ple, the Bgee relational database uses the table name “anatEntity”
to refer to the term “anatomic entity,” while others may be even
further away from the original terms.
5.1.2 Data Model Layer This layer is used to describe, at a higher level of abstraction, the
data contained in the physical storage. Here, for example, original
names for terms are recovered while also creating a mapping
between these higher-level terms (“Anatomical Entity”) and their
corresponding physical layer location (table “anatEntity” in schema
Bgee). The data model layer can be viewed as the first semantic layer
in the system, as it allows representing the actual terms referred to
in the underlying physical storage while abstracting away the details
of the actual structure of the physical storage. The data model layer
can be understood as an ontology, however, only applicable to the
level of an individual database.
5.1.3 Integration Layer The integration layer performs a similar task to the data model
layer, in that it defines a mapping between high-level concepts
(“Anatomical Entity”) and all the occurrences where these concepts
can be found in the physical storage (table “anatEntity” in schema
Bgee, class “Anatomic Entity” in UniProt, etc.). In doing so, the
integration layer also aligns the different data models, by defining
674 Ana Claudia Sima et al.
5.1.4 Presentation Layer The three layers presented so far already achieve data integration,
but with a significant drawback, which is that the user is required to
know a technical query language, such as SPARQL. The role of the
presentation layer is to expose data from all integrated resources in
Semantic Integration and Enrichment of Heterogeneous Biological Databases 675
5.2 A Concrete So far we have seen an abstract view of a system for data integration
Example: A Global across heterogeneous databases. It is time to look at how this
Ontology to Unify OMA translates into a real-world example, using the Bgee relational
and Bgee database and the OMA RDF database.
676 Ana Claudia Sima et al.
Fig. 10 A sample global ontology for integrating OMA and Bgee and an example assertion
The top part of Fig. 10, the terminological box, illustrates part
of the global ontology (layer 3, integration layer) for the two
databases, with most of the terms being part of OMA, except for
Anatomic Entity, which is specific to Bgee. As mentioned previ-
ously, OMA extends the ORTH ontology, which is why the
corresponding terms in the ontology are prefixed with “orth:.”
The Gene concept can actually be found in both Bgee and OMA;
therefore the global ontology will define mappings to both sources.
As we can see in the ontology, the Gene is the common point that
joins together OMA and Bgee. The gene IDs used in both data-
bases are Ensembl IDs [62], stored in the ensemblGeneId string
property. For example, the human hemoglobin gene, “HBB,”
which we previously showed as an example entry in OMA, corre-
sponds to the ENSG00000244734 Ensemble ID and can also be
found in Bgee.
Semantic Integration and Enrichment of Heterogeneous Biological Databases 677
5.3 How to Link a One of the main challenges in implementing technologies for the
Database with an Semantic Web was recognized from early on (see the study pub-
Ontology? lished in 2001 by Calvanese et al. [63]) to be the problem of
integrating heterogeneous sources. In particular, one of the observa-
tions made was that integrating legacy data will not be feasible
through a simple 1-to-1 mapping of the underlying sources into
an integrative ontology (e.g., mapping all attributes of tables in
relational databases to properties of classes in an ontology), but
rather through more complex transformations, that map views of
the data into elements of the global ontology [63].
To illustrate this with a concrete example, let us consider again
the unified ontology for OMA and Bgee that we introduced in the
previous section. Although Figure 10 shows properties such as
“gene isExpressedIn” or “gene hasOrtholog,” this data is actually
not explicitly stored in the underlying databases but rather needs to
be computed on-the-fly based on the available data. For example,
the “isExpressedIn” property can be computed based on the num-
ber of experiments which show the expression of a gene in a certain
anatomic entity in Bgee. Deciding the exact threshold for when a
gene is considered as “expressed” according to the data available is
not straightforward and needs to be agreed upon by domain spe-
cialists. Therefore, the integration layer will also serve to enrich the
data available in the underlying layers, by defining new concepts
678 Ana Claudia Sima et al.
5.4 Putting Things So far we have seen how individual sources can be represented into
Together a single, unified ontology, and we had a high-level view of a data
access system that enables users to ask queries and get responses in a
unified way, without knowledge of where data is located or how it is
structured. In this section we finally look at how all of these com-
ponents can work together in answering natural language queries
on biological databases. Although there are multiple alternatives to
natural language interfaces, including visual query interfaces or
keyword-based search interfaces, it has been shown that natural
language interfaces are the most appropriate means to query
Semantic Web data for non-technical end-users [68]. As a conse-
quence, natural language querying, based on Semantic Web tech-
nologies, is currently one of the active areas of research, examples of
recent systems implementing an ontology-based natural language
interface including the Athena [59] and TRDiscover [60] systems.
Semantic Integration and Enrichment of Heterogeneous Biological Databases 679
The field of life sciences has been an early adopter of Semantic Web
technologies, due to the need of interoperability and integration of
biological data spread across different databases. In this section, we
provide a brief timeline (see Fig. 11), including the example ontol-
ogies introduced in this chapter.
– 1995: Davidson et al. [72] suggest basic steps to integrate
bioinformatics data (common data model, match semantically
related objects, schema integration, transform data into feder-
ated database, match semantically equivalent data).
– 2000: TAMBIS (Transparent Access to Multiple Bioinfor-
matics Information Sources) [73] proposes a unified ontology
covering many aspects of the bioinformatics knowledge space.
– 2000: The “Gene Ontology—a tool for the unification of
biology” [37] is the first significant milestone in unifying
diverse biological databases, focusing on gene functions. Even
before the publication of the Semantic Web paper by Tim Ber-
ners Lee (in the following year), the GO highlighted the benefits
of controlled vocabularies and standardized naming, both pre-
cursors of Semantic Web technologies, which were adopted in
the GO in the year 2002 [74]. Today it is, arguably, the most
682 Ana Claudia Sima et al.
8 Exercises
PREFIX up:<http://purl.uniprot.org/core/>
SELECT (count(?protein) as ?count)
WHERE
{ ?protein a up:Protein. }
prefix up:<http://purl.uniprot.org/core/>
prefix taxon:<http://purl.uniprot.org/taxonomy/>
select distinct ?proteinOMA ?proteinUniProt
where {
service <http://sparql.uniprot.org/sparql> {
?proteinUniProt a up:Protein .
?proteinUniProt up:organism taxon:9606 . # Homo Sapiens
?proteinUniProt up:annotation ?annotation . # annotations of this
protein entry
?annotation rdfs:comment ?text
filter( regex(str(?text), "leukemia") ) # only those containing
the text "leukemia"
}
?proteinOMA a orth:Protein.
?proteinOMA oma:xrefUniprot ?proteinUniProt.
}
Acknowledgments
References
1. Mole B (2004) The gene sequencing future is 9. W3Schools. Online SQL Tutorial. https://
here. http://www.sciencenews.org/article/ www.w3schools.com/sql/sql_intro.asp.
gene-sequencing-future-here. Accessed Accessed 15 Feb 2018
15 Feb 2018 10. Beaulieu A (2009) Learning SQL: master SQL
2. Stephens ZD, Lee SY, Faghri F et al (2015) Big fundamentals. O’Reilly Media, Inc, Sebasto-
data: astronomical or genomical? PLoS Biol 13 pol, CA
(7):e1002195. https://doi.org/10.1371/jour 11. Fehily C (2014) SQL (database program-
nal.pbio.1002195 ming). Questing Vole Press, Pacific Grove,
3. Altenhoff AM, Škunca N, Glover N et al CA. (2015 Edition)
(2014) The OMA orthology database in 12. Teodoro D, Pasche E, Wipfli R et al (2009)
2015: function predictions, better plant sup- Integration of biomedical data using federated
port, synteny view and other improvements. databases. Swiss Medical Informatics, Muttenz
Nucleic Acids Res 43(D1):D240–D249. 13. Fernández-Breis JT, Chiba H, del Carmen
https://doi.org/10.1093/nar/gku1158 L-GM et al (2016) The orthology ontology:
4. Waterhouse RM, Tegenfeldt F, Li J et al (2012) development and applications. J Biomed
OrthoDB: a hierarchical catalog of animal, fun- Semant 7(1):34. https://doi.org/10.1186/
gal and bacterial orthologs. Nucleic Acids Res s13326-016-0077-x
41(D1):D358–D365. https://doi.org/10. 14. Self-join incurs more I/O activities and
1093/nar/gks1116 increases locking overhead (2013) http://
5. Powell S, Szklarczyk D, Trachana K et al sqltouch.blogspot.ch/2013/07/self-join-
(2011) eggNOG v3. 0: orthologous groups incurs-more-io-activities-and.html. Accessed
covering 1133 organisms at 41 different taxo- 15 Feb 2018
nomic ranges. Nucleic Acids Res 40(D1): 15. Sadalage PJ, Fowler M (2012) NoSQL dis-
D284–D289 tilled: a brief guide to the emerging world of
6. Bastian F, Parmentier G, Roux J et al (2008) polyglot persistence. Pearson Education,
Bgee: integrating and comparing heteroge- Upper Saddle River, NJ
neous transcriptome data among species. In: 16. Hunger M, Boyd R, Lyon W (2016) RDBMS
International Workshop on Data Integration & Graphs: why relational databases aren’t
in the Life Sciences. Springer, Berlin, pp always enough. https://neo4j.com/blog/
124–131 rdbms-graphs-why-relational-databases-arent-
7. UniProt Consortium (2014) UniProt: a hub enough/. Accessed 15 Feb 2018
for protein information. Nucleic Acids Res 43 17. Stockinger K, Bödi R, Heitz J et al (2017)
(D1):D204–D212. https://doi.org/10. ZNS-Efficient query processing with Zurich-
1093/nar/gku989 NoSQL. Data Know Eng 112:38–54
8. Codd EF (1970) A relational model of data for 18. Erling O, Mikhailov I (2009) RDF support in
large shared data banks. Commun ACM 13 the virtuoso DBMS. In: Networked
(6):377–387. https://doi.org/10.1145/ knowledge-networked media. Springer, Berlin,
362384.362685 pp 7–24. https://doi.org/10.1007/978-3-
642-02184-8_2
688 Ana Claudia Sima et al.
19. Berners-Lee T, Hendler J, Lassila O (2001) 32. URN syntax, RFC2141 (1997) https://tools.
The semantic web. Sci Am 284(5):34–43 ietf.org/html/rfc2141. Accessed 15 Feb 2018
20. Kimball R, Ross M, Mundy J et al (2015) The 33. URI Syntax, RFC3986 (2005) https://tools.
kimball group reader: relentlessly practical ietf.org/html/rfc3986. Accessed 15 Feb 2018
tools for data warehousing and business intelli- 34. RDF 1.1 Primer (2014) https://www.w3.
gence remastered collection. John Wiley & org/TR/rdf11-primer/. Accessed 15 Feb
Sons, New York, NY 2018
21. Blunschi L, Jossen C, Kossmann D et al (2012) 35. Studer R, Benjamins VR, Fensel D (1998)
Soda: generating sql for business users. Proc Knowledge engineering: principles and meth-
VLDB Endow 5(10):932–943 ods. Data Know Eng 25(1-2):161–197
22. Bio-SODA: enabling complex, semantic 36. Baader F (ed) (2003) The description logic
queries to bioinformatics databases through handbook: theory, implementation and appli-
intuitive searching over data (2017) https:// cations. Cambridge university press,
www.zhaw.ch/no_cache/en/research/peo Cambridge
ple-publications-projects/detail-view-project/ 37. Ashburner M, Ball CA, Blake JA (2000) Gene
projekt/3066/. Accessed 15 Feb 2018 ontology: tool for the unification of biology.
23. Lane L, Argoud-Puy G, Britan A et al (2011) Nat Genet 25(1):25. https://doi.org/10.
neXtProt: a knowledge platform for human 1038/75556
proteins. Nucleic Acids Res 40(D1): 38. Gene Ontology Consortium (2014) Gene
D76–D83. https://doi.org/10.1093/nar/ ontology consortium: going forward. Nucleic
gkr1179. Sparql endpoint. Available at: Acids Res 43(D1):D1049–D1056. https://
https://sparql.nextprot.org/ doi.org/10.1093/nar/gku1179
24. Li W, Cowley A, Uludag M et al (2015) The 39. Dessimoz C, Gabaldón T, Roos DS et al
EMBL-EBI bioinformatics web and program- (2012) Toward community standards in the
matic tools framework. Nucleic Acids Res 43 quest for orthologs. Bioinformatics 28
(W1):W580–W584. https://doi.org/10. (6):900–904. https://doi.org/10.1093/bioin
1093/nar/gkv279. RDF data. Available at: formatics/bts050
https://www.ebi.ac.uk/rdf
40. Smith B, Ashburner M, Rosse C et al (2007)
25. Berman H, Henrick K, Nakamura H (2003) The OBO Foundry: coordinated evolution of
Announcing the worldwide protein data bank. ontologies to support biomedical data integra-
Nat Struct Mol Biol 10(12):980. RDF data. tion. Nat Biotechnol 25(11):1251
Available at https://pdbj.org/help/rdf
41. Federhen S (2011) The NCBI taxonomy data-
26. Wilkinson MD, Dumontier M, Aalbersberg IJ base. Nucleic Acids Res 40(D1):D136–D143.
et al (2016) The FAIR guiding principles for https://doi.org/10.1093/nar/gkr1178
scientific data management and stewardship.
Sci Data 3:160018. https://doi.org/10. 42. McBride B (2004) The resource description
1038/sdata.2016.18 framework (RDF) and its vocabulary descrip-
tion language RDFS. In: Handbook on ontol-
27. Bizer C, Heath T, Berners-Lee T (2009) ogies. Springer, Berlin, pp 51–65
Linked data-the story so far. Int J Semant
Web Inf Syst 5(3):1–22 43. OWL Web Ontology Language Overview
(2004) https://www.w3.org/TR/owl-
28. Domingue J, Fensel D, Hendler JA (eds) features/. Accessed 15 Feb 2018
(2011) Handbook of semantic web technolo-
gies. Springer Science & Business Media, 44. Motik B, Patel-Schneider PF, Parsia B, Bock C,
New York, NY Fokoue A, Haase P, Hoekstra R, Horrocks I,
Ruttenberg A, Sattler U, Smith M (2009)
29. Hitzler P, Krotzsch M, Rudolph S (2009) OWL 2 web ontology language: structural
Foundations of semantic web technologies. specification and functional-style syntax. W3C
CRC press, Boca Raton, FL Recomm 27(65):159
30. Patel-Schneider PF (2005) A revised architec- 45. OWL Web Ontology Language semantics and
ture for semantic web reasoning. In: Interna- abstract syntax (2004) W3C Recommenda-
tional Workshop on Principles and Practice of tion, 10. https://www.w3.org/TR/owl-
Semantic Web Reasoning. Springer, Berlin, pp semantics/. Accessed 15 Feb 2018
32–36
46. W3C Owl Working Group (2012) OWL 2 web
31. Bratt S (2007) Semantic Web, and Other Tech- ontology language document overview.
nologies to Watch. https://www.w3.org/ https://www.w3.org/TR/owl2-overview/.
2007/Talks/0130-sb-W3CTechSemWeb/#( Accessed 15 Feb 2018
24). Accessed 15 Feb 2018
Semantic Integration and Enrichment of Heterogeneous Biological Databases 689
47. Dean M, Schreiber G, Bechhoffer S, et al natural language querying over relational data
(2004) OWL web ontology language refer- stores. Proc VLDB Endow 9(12):1209–1220.
ence. W3C Recommendation. https://www. https://doi.org/10.14778/2994509.
w3.org/TR/owl-ref/. Accessed 15 Feb 2018 2994536
48. Prud’hommeaux E, Carothers G (2014) RDF 60. Song D, Schilder F, Smiley C et al (2015) TR
1.1 Turtle: terse RDF triple language. W3C Discover: a natural language interface for que-
recommendation. http://www.w3.org/TR/ rying and analyzing interlinked datasets. In:
2014/REC-turtle-20140225/. The latest edi- International Semantic Web Conference,
tion is available at http://www.w3.org/TR/ 2. Springer, Cham
turtle/ 61. Ell B, Vrandečić D, Simperl E (2011) Labels in
49. World Wide Web Consortium (2014) RDF 1.1 the web of data. Semant Web 2011:162–176.
XML Syntax. The latest edition is available at https://doi.org/10.1007/978-3-642-25073-
https://www.w3.org/TR/rdf-syntax- 6_11
grammar/ 62. Kinsella RJ, K€ah€ari A, Haider S et al (2011)
50. World Wide Web Consortium (2014) JSON- Ensembl BioMarts: a hub for data retrieval
LD 1.0: a JSON-based serialization for linked across taxonomic space. Database 2011:
data. The latest edition is available at https:// bar030. https://doi.org/10.1093/database/
www.w3.org/TR/json-ld/ bar030
51. SPARQL Query Language for RDF (2008) 63. Calvanese D, De Giacomo G, Lenzerini M
https://www.w3.org/TR/rdf-sparql-query/. (2001) Ontology of integration and integra-
Accessed 15 Feb 2018 tion of ontologies. Descr Logics 49(10-19):30
52. Lohmann S, Link V, Marbach E et al (2014) 64. Bornea MA, Dolby J, Kementsietsidis A et al
WebVOWL: web-based visualization of ontol- (2013) Building an efficient RDF store over a
ogies. In: International Conference on Knowl- relational database. In: Proceedings of the
edge Engineering and Knowledge 2013 ACM SIGMOD International Confer-
Management. Springer, Cham, pp 154–158 ence on Management of Data. ACM,
53. Altenhoff AM, Glover NM, Train CM et al Washington, DC, pp 121–132
(2017) The OMA orthology database in 65. R2RML: RDB to RDF mapping language
2018: retrieving evolutionary relationships (2012) https://www.w3.org/TR/r2rml/.
among all domains of life through richer web Accessed 15 Feb 2018
and programmatic interfaces. Nucleic Acids 66. Sequeda JF (2016) Integrating relational data-
Res 46(D1):D477–D485 bases with the Semantic Web. IOS Press,
54. Rakhmawati NA, Umbrich J, Karnstedt M et al Amsterdam
(2013) A comparison of federation over 67. Michel F, Montagnat J, Zucker CF (2013) A
SPARQL endpoints frameworks. In: Interna- survey of RDB to RDF translation approaches
tional Conference on Knowledge Engineering and tools. Dissertation. Inria Sophia Antipolis
and the Semantic Web. Springer, Berlin, pp 68. Kaufmann E, Bernstein A (2007) How useful
132–146 are natural language interfaces to the semantic
55. Calvanese D, Cogrel B, Komla-Ebri S et al web for casual end-users? In: The Semantic
(2017) Ontop: answering SPARQL queries Web. Springer, Berlin, pp 281–294
over relational databases. Semant Web 8 69. Leis V, Gubichev A, Mirchev A et al (2015)
(3):471–487 How good are query optimizers, really? Proc
56. Sequeda JF, Miranker DP (2013) Ultrawrap: VLDB Endow 9(3):204–215
SPARQL execution on relational data. Web 70. Wu W, Chi Y, Zhu S et al (2013) Predicting
Semant 22:19–39 query execution time: are optimizer cost mod-
57. Bizer C, Seaborne A (2004) D2RQ-treating els really unusable? In: Data Engineering
non-RDF databases as virtual RDF graphs. In: (ICDE), 2013 IEEE 29th International Con-
Proceedings of the 3rd international semantic ference on 2013 Apr 8. IEEE, New York, NY,
web conference (ISWC2004). Springer, pp 1081–1092
New York, NY 71. Montoya G, Skaf-Molli H, Hose K (2017) The
58. Gasteiger E, Gattiker A, Hoogland C et al Odyssey approach for optimizing federated
(2003) ExPASy: the proteomics server for SPARQL queries. In: International Semantic
in-depth protein knowledge and analysis. Web Conference. Springer, Cham, pp 471–489
Nucleic Acids Res 31(13):3784–3788. 72. Davidson SB, Overton C, Buneman P (1995)
https://doi.org/10.1093/nar/gkg563 Challenges in integrating biological data
59. Saha D, Floratou A, Sankaranarayanan K et al sources. J Comput Biol 2(4):557–572.
(2016) Athena: an ontology-driven system for https://doi.org/10.1089/cmb.1995.2.557
690 Ana Claudia Sima et al.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons
licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s
Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder.
Chapter 23
Abstract
In this chapter, we focus on the computational challenges associated with statistical phylogenomics and how
use of the broad-platform evolutionary analysis general likelihood evaluator (BEAGLE), a high-
performance library for likelihood computation, can help to substantially reduce computation time in
phylogenomic and phylodynamic analyses. We discuss computational improvements brought about by the
BEAGLE library on a variety of state-of-the-art multicore hardware, and for a range of commonly used
evolutionary models. For data sets of varying dimensions, we specifically focus on comparing performance
in the Bayesian evolutionary analysis by sampling trees (BEAST) software between multicore central
processing units (CPUs) and a wide range of graphics processing cards (GPUs). We put special emphasis
on computational benchmarks from the field of phylodynamics, which combines the challenges of phylo-
genomics with those of modelling trait data associated with the observed sequence data. In conclusion, we
show that for increasingly large molecular sequence data sets, GPUs can offer tremendous computational
advancements through the use of the BEAGLE library, which is available for software packages for both
Bayesian inference and maximum-likelihood frameworks.
Key words Adaptive Markov chain Monte Carlo, Multipartite data, Generalized linear model, High-
performance computing, BEAGLE, BEAST, Pathogen phylodynamics, Data integration, Bayesian
phylogenetics, Phylogenomics
1 Introduction
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Methods in Molecular Biology, vol. 1910,
https://doi.org/10.1007/978-1-4939-9074-0_23, © The Author(s) 2019
691
692 Guy Baele et al.
2.1 Principles The most effective methods for phylogenetic inference involve
computing the probability of observed character data for a set of
2.1.1 Computing
taxa given an evolutionary model and phylogenetic tree, which is
Observed Data Likelihoods
often referred to as the (observed data) likelihood of that tree.
Felsenstein demonstrated an algorithm to calculate this probability
[16], and his algorithm recursively computes partial likelihoods via
simple sums and products. These partial likelihoods track the prob-
ability of the observed data descended from an internal node con-
ditional on a particular state at that internal node.
The partial likelihood calculations apply to a subtree compris-
ing a parent node, two child nodes, and connecting branches. It is
repeated for each unique site pattern in the data (in the form of a
multiple sequence alignment), for each possible character of the
state space (e.g., nucleotide, amino acid, or codon), and for each
694 Guy Baele et al.
Fig. 1 Diagrammatic example of the tree sampling process and fine-grained parallel computation of
phylogenetic partial likelihoods using BEAGLE for a nucleotide model problem with five taxa, nine site
patterns, and four evolutionary rate categories. Each entry in a partial likelihood array L is assigned to a
separate GPU thread t. In this example, 144 GPU threads are created to enable parallel evaluation of each
entry of the partial likelihood array L(x0)
library API
BEAGLE BEAST
analysis parameters
implementation manager
sequence data
implementation base-code
BEAGLE
CPU model CUDA model iterative tree sampling
Fig. 2 Layer diagram depicting BEAGLE library organization, and illustration of API use. Arrows indicate
direction and relative size of data transfers between the client program and library
2.2 Design The general structure of the BEAGLE library can be conceptua-
lized as layers (Fig. 2, library), the upper most of which is the
2.2.1 Library
application programming interface. Underlying this API is an
implementation management layer, which loads the available
implementations, makes them available to the client program, and
passes API commands to the selected implementation.
The design of BEAGLE allows for new implementations to be
developed without the need to alter the core library code or how
client programs interface with the library. This architecture also
includes a plugin system, which allows implementation-specific
code (via shared libraries) to be loaded at runtime when the
required dependencies are present. Consequently, new frameworks
and hardware platforms can more easily be made available to pro-
grams that use the library, and ultimately to users performing
phylogenetic analyses.
Currently, the implementations in BEAGLE derive from two
general models. One is a serial CPU implementation model, which
does not directly use external frameworks. Under this model, there
is a standard CPU implementation, and one with added SSE intrin-
sics, which uses vector processing extensions present in many CPUs
to parallelize computation across character state values. The other
implementation model involves an explicit parallel accelerator pro-
gramming model, which uses the CUDA external computing
framework to exploit NVIDIA GPUs. It implements fine-grained
parallelism for evaluating likelihoods under arbitrary molecular
evolutionary models, and thus harnessing the large number of
processing cores to efficiently perform calculations [3, 32].
High-Performance Computing in Bayesian Phylogenetics and Phylodynamics. . . 697
2.2.2 Application The BEAGLE API was designed to increase performance via fine-
Programming Interface scale parallelization while reducing data transfer and memory copy
overhead to an external hardware accelerator device (e.g., GPU).
Client programs, such as BEAST [11], use the API to offload the
evaluation of tree likelihoods to the BEAGLE library (Fig. 2, API).
API functions can be subdivided into two categories: those which
are only executed once per inference run and those which are
repeatedly called as part of the iterative sampling process. As part
of the one-time initialization process, client programs use the API
to indicate analysis parameters such as tree size and sequence
length, as well as specifying the type of evolutionary model
and hardware resource(s) to be used. This allows BEAGLE to
allocate the appropriate number and size of data buffers on device
memory. Additionally at this initialization stage, the sequence data
is specified and transferred to device memory. This costly memory
operation is only performed once, thus minimizing its impact.
During the iterative tree sampling procedure, client programs
use the API to specify changes to the evolutionary model and
instruct a series of partial likelihood operations that traverse the
proposed tree in order to find its overall likelihood. BEAGLE
efficiently computes these operations and makes the overall tree
likelihood as well as per-site likelihoods available via another
API call.
2.3 Performance Peak performance with BEAGLE is achieved when using a high-
end GPU; however, the relative gain over using a CPU depends on
model type and problem size as more demanding analyses allow for
better utilization of GPU cores. Figure 3 shows speedups relative to
serial CPU code when using BEAGLE with an NVIDIA P100
GPU for the critical partial likelihood function, with increasing
unique site pattern counts and for two model types. Computing
these likelihoods typically accounts for over 90% of the total execu-
tion time for phylogenetic inference programs and the relationship
between speedups and problem size observed here primarily
matches what would be observed for a full analysis.
Figure 3 includes performance results for computing partial
likelihoods under both nucleotide and codon models. The vertical
axis labels show the speedup relative to the average performance of
a baseline serial, single threaded and non-vectorized, CPU imple-
mentation. This nonparallel CPU implementation provides a
consistent performance level across different problem sizes and
698 Guy Baele et al.
speedup factor
256 256
64 64
16 16
4 4
l l l l l l l l l l l l l l l l l l l l l l l l l 1 ll l l l l l l l l l l l l l l l l l l l l l l l 1
Fig. 3 Plots showing BEAGLE partial likelihood computation performance on the GPU relative to serial CPU
code, under nucleotide and codon models and for an increasing number of unique site patterns. Speedup
factors are on a log-scale
2.4 Memory Usage When assessing the suitability of a phylogenetic analysis for GPU
acceleration via BEAGLE, it is also important to consider if the
GPU has sufficient on-board memory for the analysis to be per-
formed. GPUs typically have less memory than what is available to
CPUs, and the high transfer cost of moving data from CPU to
GPU memory prevents direct use of CPU memory for GPU
acceleration.
Figure 4 shows how much memory is required for problems of
different sizes when running nucleotide and codon-model analyses
in BEAST with BEAGLE GPU acceleration. Note that when mul-
tiple GPUs are available, BEAST can partition a data set into
High-Performance Computing in Bayesian Phylogenetics and Phylodynamics. . . 699
B GB
8 12
800 6 GB GB 800
GB 8
4 6 GB
600 G 600
2
B 4 GB
GB GB
400 1 2 400
GB 1 G GB
200 B 200
RABV l carnivores
l
10000 20000 30000 40000 50000 1000 2000 3000 4000 5000
unique site patterns
Fig. 4 Contour plots depicting BEAGLE memory usage on GPUs for BEAST nucleotide and codon-model
analyses with 4 rate categories in double-precision floating-point format, for a range of problem sizes with
different numbers of taxa and of unique site patterns. Memory requirements shown here assume an
unpartitioned dataset. Partitioned analyses and more sophisticated models that use multiple BEAGLE
instances incur memory overhead per additional library instance. Black dots indicate memory usage require-
ments for the unpartitioned version of three data sets subsequently described in this chapter
separate BEAGLE instances, one for each GPU. Thus, each GPU
will only require as much memory as necessary for the data subset
assigned to it. Typical PC-gaming GPUs have 8 GB of memory or
less, while GPUs dedicated to high-performance computing, such
as the NVIDIA Tesla series, may have as much as 24 GB of memory.
2.5 Hardware Highly parallel computing technologies such as GPUs have over-
taken traditional CPUs in peak performance potential and continue
to advance at a faster pace. Additionally, the memory bandwidth
available to the processor is especially relevant to data-intensive
computations, such as the evaluation of nucleotide model likeli-
hoods. In this measure as well, high-end GPUs significantly out-
perform equivalently positioned CPUs.
BEAGLE was designed to take advantage of this trend of
increasingly advanced GPUs and uses runtime compilation meth-
ods to optimize code for whichever generation of hardware is being
used. Table 1 lists hardware specifications for the processors used in
this chapter. We note that further advancements in the GPU market
for scientific computing are on its way, with NVIDIA preparing the
launch (at the time of writing) of the Tesla V100 in Q3 of 2017.
The new NVIDIA Tesla V100 features a total of 5120 CUDA cores
and comes equipped with 32 GB of on-board memory with
900 GB/s of bandwidth. As such, it seems to have the potential
to reach 7.5 TFLOPs in double-precision peak performance
(DP PP), a roughly 50% increase over their current flagship, the
Tesla P100.
700 Guy Baele et al.
Table 1
Hardware specifications for the Intel CPUs and NVIDIA GPUs used in this chapter
3 Results
time in minutes
600
40
400
20
200
0 0
1 2 4 8 12 16 20 1 2 4 8 12 16 20 24 1 2 1 2 1 2 1 1 2
number of cores number of devices
Fig. 5 Performance evaluation of different CPU and GPU configurations when estimating a single full codon
substitution model on a full genome mitochondrial data set. Left: performance comparison on multicore CPUs
(with the dashed line indicating optimal GPU performance). Right: performance comparison on GPUs. The
numbers below the bars indicate the number of partitions/threads used in each analysis. The GTX 590, a GPU
released in 2011 for PC gaming, performs equally well as a 24-core CPU configuration. However, GPUs from
the Tesla generation, aimed at scientific computing, drastically outperform the CPU system, with the recently
released P100 showcasing the impressive improvements in the GPU market, running 166 times faster than a
single-core CPU setup
3.2 Ebola Virus The original developments within the BEAGLE library offered
considerable computational speedup when evaluating codon mod-
els—up to a 52-fold increase when employing three GPU cards—
and nucleotide models—up to a 15-fold increase when using three
GPU cards—in double precision [32]. This may have resulted in
the perception that GPU cards are mainly useful when evaluating
codon models, but that the benefit for fitting models was not
sufficiently substantial to warrant GPUs. To offer an objective
assessment of the usefulness of GPUs in such cases, we analyze
the use of various CPU and GPU configurations on a full genome
Ebola virus data set, consisting of 1610 publicly available genomes
sampled over the course of the 2013–2016 Ebola virus disease
epidemic in West Africa [12] (we discuss this study in more detail
in Subheading 4). This data set encompassing 18,992 nt columns is
modelled with four partitions: one for each codon position and one
additional partition for the intergenic region (which consists of
several noncoding regions interspersed in the genome). The three
codon partitions contain, respectively, 2366, 2328, and 2731
unique site patterns, while the intergenic partition contains 2785
unique site patterns. We model among-site rate variation [33] in
each partition independently, which confronts us with a computa-
tionally demanding analysis for this large number of taxa and
unique site patterns.
Figure 6 shows how the performance of such a large nucleotide
data set scales with the available CPU and GPU resources. Contrary
to the carnivores data set analysis in Subheading 3.1, this analysis
does not scale particularly well with the number of CPU cores
available, as the main benefit lies with splitting each partition into
two subpartitions and only limited performance gains can be
observed when using additional partitions or threads. Popular
single GPU cards for scientific computing—such as the Tesla
K40—match the optimal performance brought about by using
16 CPU cores, and may provide a useful alternative to multicore
CPU systems. However, the decreasing cost for increasingly parallel
multicore CPU systems makes this a difficult matchup for slightly
older GPUs. More recently introduced GPU cards, such as the
Tesla P100, are able to deliver a substantial performance improve-
ment over a multicore CPU setup, with two Tesla P100 GPUs
running in parallel offering over twice the performance of a
16-core CPU setup. We note that the GTX 590 cards, as well as a
single Tesla K20 card, do not contain sufficient on-board memory
to hold the full data set and as such, these benchmarks could not be
run on those resources.
3.3 Phylogeography As shown in the results in Subheadings 3.1 and 3.2, different
partitions of the aligned sequence data can contain a large number
of unique site patterns, rendering phylogenomic inference chal-
lenging. However, other data types are also included more
704 Guy Baele et al.
40 20
time in minutes
time in minutes
30 15
20 10
10 5
0 0
4 8 12 16 20 4 8 12 16 20 24 2 1 2 1 1 2
number of cores number of devices
Fig. 6 Performance evaluation of different CPU and GPU configurations when estimating a four-partition
nucleotide substitution model on a full genome Ebola virus data set. Left: performance comparison on
multicore CPUs (with the dashed line indicating optimal GPU performance). Right: performance comparison
on GPUs. The numbers below the bars indicate the number of partitions/threads used in each analysis.
Because of the lower dimensionality compared to full codon models, nucleotide substitution models currently
take less advantage of the large amounts of cores present on GPUs. The fastest GPU configuration—
consisting of two Tesla P100 GPUs—outperforms by 107% the fastest CPU configuration that employs
16 threads on a 24-core Haswell CPU
A B C
A . AB AC
B
B BA . BC
A C CA CB .
A B C D E F G H I J
A . AB AC AD AE AF AG AH AI AJ
C B BA . BC BD BE BF BG BH BI BJ
A B D C CA CB . CD CE CF CG CH CI CJ
J
D DA DB DC . DE DF DG DH DI DJ
I
E E EA EB EC ED . EF EG EH EI EJ
H
F FA FB FC FD FE . FG FH FI FJ
G
G GA GB GC GD GE GF . GH GI GJ
F H HA HB HC HD HE HF HG . HI HJ
I IA IB IC ID IE IF IG IH . IJ
J JA JB JC JD JE JF JG JH JI .
Fig. 7 Graphical depiction of the Benelux countries (top) and the provinces of Belgium (bottom). When these
countries and provinces are used as discrete location states in a discrete trait model, this yields, respectively,
a 3 3 and a 10 10 CTMC model with instantaneous rates of transition between each pair of locations.
Such models are subject to the same restrictions as those used in popular substitution models, i.e., the rows
sum to 0. As such, these CTMC models consist of, respectively, 6 and 90 free parameters to be estimated
3.3.1 Bat Rabies We here assess the performance of a phylodynamic setup aimed at
reconstructing the spatial dispersal and cross-species dynamics of
rabies virus (RABV) in North American bat populations based on a
set of 372 nucleoprotein gene sequences (nucleotide positions:
594–1353). The data set comprises a total of 17 bat species sam-
pled between 1997 and 2006 across 14 states in the USA [31]. Two
additional species that had been excluded from the original analysis
owing to a limited amount of available sequences, Myotis austror-
iparius (Ma) and Parastrellus hesperus (Ph), are also included here
[14]. We also include a viral sequence with an unknown sampling
date (accession no. TX5275, sampled in Texas from Lasiurus bor-
ealis), which will be adequately accommodated in our inference.
This leads to a total of 548 unique site patterns. Following Faria
et al. [14], we employ two GLM-diffusion models for this analysis,
one on the discrete set of 17 bat species and another on the discrete
set of 14 location states.
Figure 8 shows the performance of various multicore platforms
on the bat rabies Bayesian phylodynamic analysis. In contrast to
previous examples, the low number of sites (and hence unique site
patterns) in the alignment does not offer many options for splitting
the observed data likelihood over additional threads. While four
CPU cores offer the optimal performance across our CPU plat-
forms, using more threads for the analysis causes serious communi-
cation overhead, slowing down the analysis. Comparing the CPU
results with the GPU results shows that, across all multicore plat-
forms tested, a 4-core CPU offers the best performance.
Nonetheless, this scenario provides a very interesting use case
for employing multiple graphics cards for scientific computing.
Even though the (relatively small) dimensions of this particular
example do not allow for a performance increase, it will be benefi-
cial for higher-dimensional cases to compute each diffusion model
on a separate GPU. When assuming independent diffusion pro-
cesses that only depend on the underlying phylogeny, each of the
trait diffusion models can be computed on a different GPU,
whereas the data alignment can be split into two subpartitions of
equal complexity (i.e., with an equal number of unique site pat-
terns) and hence also be computed in parallel over the two GPUs.
However, the limited sequence data size and the relatively restricted
High-Performance Computing in Bayesian Phylogenetics and Phylodynamics. . . 707
4 20
time in minutes
time in minutes
3 15
2 10
1 5
0 0
1 2 4 8 12 16 1 2 4 8 12 16 20 1 2 1 2 1 2 1 1 2
number of cores number of devices
Fig. 8 Performance evaluation of different CPU and GPU configurations when estimating a single-partition
nucleotide substitution model on a rabies virus data set, along with two GLMs. Left: performance comparison
on multicore CPUs (with the dashed line indicating optimal GPU performance). Right: performance comparison
on GPUs. Given the short length of the alignment, communication overhead becomes quite severe when
adding a large number of CPU processor cores. Even two state-of-the-art Tesla P100 GPUs fail to outperform
four CPU cores, given the low number of sequences and the low number of states for the GLMs
number of discrete locations make this data set less suited for
illustrating performance increases using GPUs.
3.3.2 Ebola Virus We here assess the performance of a similar setup as in the previous
section, but using the data from the 2013–2016 West African
epidemic caused by the Ebola virus [12]. We hence use the nucleo-
tide data from the previous Ebola example (see Subheading 3.2) and
augment it with location states. Using a phylogeographic GLM
that integrates covariates of spatial spread, we have examined
which features influenced the spread of EBOV among administra-
tive regions at the district (Sierra Leone), prefecture (Guinea), and
country (Liberia) levels. This resulted in a GLM parameterizing
transition rates among 56 discrete location states according to
25 potential covariates (see Dudas et al. [12] for more information),
resulting in a computationally challenging analysis.
As shown in Fig. 9, we have evaluated the performance of this
challenging data set on our different multicore platforms. By com-
paring these benchmarks with those in Fig. 6, it’s clear that the
addition of a high-dimensional discrete trait model is much harder
to process for any multicore CPU configuration. Adding more
CPU cores to the analysis does not improve performance by
much, indicative of the discrete trait model being the main
708 Guy Baele et al.
250
80
time in minutes
200
time in minutes
60
150
40
100
50 20
0 0
4 8 12 16 4 8 12 16 20 2 1 2 1 1 2
number of cores number of devices
Fig. 9 Performance evaluation of different CPU and GPU configurations when estimating a four-partition
nucleotide substitution model on a full genome Ebola virus data set along with a GLM-diffusion model for the
location traits. Left: performance comparison on multicore CPUs (with the dashed line indicating optimal GPU
performance). Right: performance comparison on GPUs. While a previous comparison of solely the nucleotide
data resulted in a performance increase of over 100% for GPUs over CPUs, we observe a far more pronounced
benefit for GPU in this case with two Tesla P100 GPUs outperforming a 12-core CPU setup by 568%
4 Examples
4.1 Human Influenza A potentially powerful predictor for the behavior of influenza and
H3N2 other infectious diseases comes in the form of information on
global human movement patterns, of which the worldwide air
transportation network is by far the best studied system of global
mobility in the context of human infectious diseases [8].
Lemey et al. [25] use a discrete phylogeographic model
equipped with a GLM to show that the global dynamics of influ-
enza H3N2 are driven by air passenger flows, whereas at more local
scales spread is also determined by processes that correlate with
geographic distance. For a data set that encompasses 1441 time-
stamped hemagglutinin sequences (sampled between 2002 and
2007) and up to 26 locations to be used in a discrete phylogeo-
graphic model equipped with a GLM, BEAGLE can offer substan-
tial performance gains. A snapshot of a visual reconstruction
through geographic space is presented in Fig. 10, which includes
a summary of the support for the collection of covariates in the
inclusion probability
0.0 0.2 0.4 0.6 0.8 1.0
Passenger
flow
Destination
population
density
Destination
sample size
Average
distance
Origin
sample size
Origin
population
density
Minimum
distance
Origin
antigenic
divergence
Destination
BF > 150
population size
20 < BF < 150
Origin 3 < BF < 20
population size 1 < BF < 3
Fig. 10 Snapshot of the geographic spread of human influenza subtype H3N2, based on 1441 hemagglutinin
sequences sampled between 2002 and 2007 [25]. A discrete phylogeographic approach was used, allocating
the sequence data into a discrete number of locations and employing a generalized linear model on the
parameters that model geographic spread. Inclusion probabilities and Bayes factor support are shown for the
most prominent predictors of H3N2 geographic spread. D3 visualization is made using SpreaD3 [7], with
circular polygon areas proportional to the number of tree lineages maintaining that location at that time
710 Guy Baele et al.
4.2 Ebola Virus During the two and a half years Ebola virus (EBOV) circulated in
West Africa, it caused at least 28,646 cases and 11,323 deaths. As
mentioned in Subheading 3.3.2, Dudas et al. [12] used 1610
genome sequences collected throughout the epidemic, represent-
ing over 5% of recorded Ebola virus disease (EVD) cases to recon-
struct a detailed phylogenetic history of the movement of EBOV
within and between the three most affected countries. This study
considers a massive time-stamped data set that allows to uncover
regional patterns and drivers of the epidemic across its entire dura-
tion, whereas individual studies had previously focused on either
limited geographical areas or time periods. The authors use the
phylogeographic GLM to test which features were important in
shaping the spatial dynamics of EVD during the West African
epidemic (see Fig. 11).
The phylogeographic GLM allowed Dudas et al. [12] to deter-
mine the factors that influenced the spread of EBOV among admin-
istrative regions at the district (Sierra Leone), prefecture (Guinea),
and country (Liberia) levels. The authors find that EBOV tends to
disperse between geographically close regions, with great circle
distances having among the strongest Bayes factor support for
inclusion in the GLM among all covariates tested (along with four
other predictors). Additionally, both origin and destination popu-
lation sizes are equally strongly and positively correlated with viral
dissemination (see Fig. 11). Dudas et al. [12] conclude that the
combination of the positive effect of population sizes with the
inverse effect of geographic distance implies that the epidemic’s
spread followed a classic gravity-model dynamic, with intense dis-
persal between larger and closer populations. Finally, the authors
found a significant propensity for virus dispersal to occur within
each country, relative to internationally, suggesting that country
borders may have provided a barrier for the geographic spread
of EBOV.
High-Performance Computing in Bayesian Phylogenetics and Phylodynamics. . . 711
inclusion probability
Great-circle
distances
Within
country effect
International
border sharing
Origin
population size
Destination
population size
Origin
temperature
seasonality
Fig. 11 Snapshot of the geographic spread of the Ebola virus during the 2013–2016 West African epidemic,
based on 1610 whole genome sequences [12]. A discrete phylogeographic approach was used, allocating the
sequence data into a discrete number of locations and employing a generalized linear model on the
parameters that model geographic spread. Inclusion probabilities and Bayes factor support are shown for
the most prominent predictors of Ebola virus geographic spread. D3 visualization is made using SpreaD3 [7],
with circular polygon areas proportional to the number of tree lineages maintaining that location at that time
5 Adaptive MCMC
The various data sets described in this chapter so far have shown the
use and computational performance of a wide range of models in
phylogenetic, phylogenomic, and phylodynamic research. Whether
employing full codon models (e.g., the carnivores data set), codon
partition models (e.g., the Ebola data set), or discrete phylogeo-
graphic models, the number of parameters of a typical Bayesian
phylogenetic analysis has increased drastically over the years. This is
exacerbated by the use of partitioning strategies, resulting also in a
potentially large array of likelihoods that need to be evaluated
simultaneously, increasing run times for most phylogenetic ana-
lyses. In a similar fashion, computational resources available to
researchers have also markedly increased, both in the form of multi-
core CPU technology and increasingly powerful graphics cards
targeted towards scientific computing. The ubiquitous availability
of multiprocessor and multicore computers practically has moti-
vated the design of novel parallel algorithms to make efficient use of
these machines [22, 32].
712 Guy Baele et al.
Fig. 12 Conceptual visualization on the potential benefits of an adaptive MCMC algorithm over single-
component Metropolis–Hastings (green bars indicate that a processor is computing a specific likelihood). In
Bayesian phylogenetics, the common practice of updating a single parameter (i.e., either a, b, c, or d) at a time
leaves many CPU cores idle, underusing the computational performance of such architecture. Adaptive MCMC
allows to update a collection of continuous parameters simultaneously (i.e., a, b, c, and d ), putting many cores
(in this case: 4) to work in a parallel fashion. Quad-Core AMD Opteron processor silicon die is shown, courtesy
of Advanced Micro Devices, Inc. (AMD), obtained from Wikimedia Commons
High-Performance Computing in Bayesian Phylogenetics and Phylodynamics. . . 713
6 Conclusions
7 Notes
8 Exercises
2 : Tesla K40c
Global memory (MB): 11520
Clock speed (Ghz): 0.74
Number of cores: 2880
Flags: PRECISION_SINGLE PRECISION_DOUBLE ...
Acknowledgements
References
1. Ayres DL, Cummings MP (2017) Configuring 3. Ayres DL, Darling A, Zwickl DJ, Beerli P,
concurrent computation of phylogenetic par- Holder MT, Lewis PO, Huelsenbeck JP,
tial likelihoods: accelerating analyses using the Ronquist F, Swofford DL, Cummings MP,
BEAGLE library. In: 17th International con- Rambaut A, Suchard MA (2012) BEAGLE:
ference on algorithms and architectures for par- an application programming interface and
allel processing: ICA3PP 2017, Collocated high-performance computing library for statis-
Workshops tical phylogenetics. Syst Biol 61(1):170–173
2. Ayres DL, Cummings MP (2017) Heteroge- 4. Baele G, Lemey P (2013) Bayesian evolution-
neous hardware support in BEAGLE, a high- ary model testing in the phylogenomics era:
performance computing library for statistical matching model complexity with computa-
phylogenetics. In: 46th International confer- tional efficiency. Bioinformatics 29
ence on parallel processing workshops (16):1970–1979
(ICPPW 2017)
High-Performance Computing in Bayesian Phylogenetics and Phylodynamics. . . 721
5. Baele G, Lemey P, Rambaut A, Suchard MA 13. Eisen JA, Fraser CM (2003) Phylogenomics:
(2017) Adaptive MCMC in Bayesian phyloge- intersection of evolution and genomics. Sci-
netics: an application to analyzing partitioned ence 300(5626):1706–1707
data in BEAST. Bioinformatics 33 14. Faria NR, Suchard MA, Rambaut A, Streicker
(12):1798–1805 DG, Lemey P (2013) Simultaneously recon-
6. Baele G, Suchard MA, Rambaut A, Lemey P structing viral cross-species transmission his-
(2017) Emerging concepts of data integration tory and identifying the underlying
in pathogen phylodynamics. Syst Biol 66(1): constraints. Philos Trans R Soc B 368
e47–e65 (1614):20120196
7. Bielejec F, Baele G, Vrancken B, Suchard MA, 15. Felsenstein J (1978) The number of evolution-
Rambaut A, Lemey P (2016) Spread3: interac- ary trees. Syst Biol 27(1):27–33
tive visualization of spatiotemporal history and 16. Felsenstein J (1981) Evolutionary trees from
trait evolutionary processes. Mol Biol Evol 33 DNA sequences: a maximum likelihood
(8):2167–2169. https://doi.org/10.1093/ approach. J Mol Evol 17(6):368–376
molbev/msw082 17. Gilks WR, Richardson S, Spiegelhalter DJ
8. Brockmann D, David V, Gallardo AM (2009) (1996) Markov chain Monte Carlo in practice.
Human mobility and spatial disease dynamics. Chapman and Hall, London
In: Diffusion fundamentals III. Leipziger Uni- 18. Goldman N, Yang Z (1994) A codon-based
versit€atsverlag, Leipzig, pp 55–81 model of nucleotide substitution for protein-
9. De Maio N, Wu CH, O’Reilly KM, Wilson D coding DNA sequences. Mol Biol Evol 11
(2015) New routes to phylogeography: a (5):725–736. http://mbe.oxfordjournals.
Bayesian structured coalescent approximation. org/content/11/5/725.abstract, http://
PLoS Genet 11(8):e1005421 mbe.oxfordjournals.org/content/11/5/725.
10. Doucet A, De Freitas N, Gordon N (eds) full.pdf+html
(2001) Sequential Monte Carlo methods in 19. Guindon S, Dufayard JF, Lefort V,
practice. Springer, New York Anisimova M, Hordijk W, Gascuel O (2010)
11. Drummond AJ, Suchard MA, Xie D, Rambaut New algorithms and methods to estimate
A (2012) Bayesian phylogenetics with BEAUti maximum-likelihood phylogenies: assessing
and the BEAST 1.7. Mol Biol Evol 29 the performance of PhyML 3.0. Syst Biol 59
(8):1969–1973 (3):307–321
12. Dudas G, Carvalho LM, Bedford T, Tatem AJ, 20. Huelsenbeck JP, Ronquist F, Nielsen R, Boll-
Baele G, Faria NR, Park DJ, Ladner JT, back JP (2001) Bayesian inference of phylog-
Arias A, Asogun D, Bielejec F, Caddy SL, eny and its impact on evolutionary biology.
Cotten M, D’Ambrozio J, Dellicour S, Caro Science 294(5550):2310–2314
AD, Diclaro II JD, Durrafour S, Elmore MJ, 21. Jeffroy O, Brinkmann H, Delsuc F, Philippe H
Fakoli III LS, Faye O, Gilbert ML, Gevao SM, (2006) Phylogenomics: the beginning of
Gire S, Gladden-Young A, Gnirke A, Goba A, incongruence? Trends Genet 22(4):225–231
Grant DS, Haagmans BL, Hiscox JA, Jah U, 22. Kobert K, Flouri T, Aberer A, Stamatakis A
Kargbo B, Kugelman JR, Liu D, Lu J, Mal- (2014) The divisible load balance problem
boeuf CM, Mate S, Matthews DA, Matranga and its application to phylogenetic inference.
CB, Meredith LW, Qu J, Quick J, Pas SD, Phan In: International workshop on algorithms in
MVT, Pollakis G, Reusken CB, Sanchez- bioinformatics, pp 204–216
Lockhart M, Schaffner SF, Schieffelin JS, Seal-
fon RS, Simon-Loriere E, Smits SL, 23. Kumar S, Filipski AJ, Battistuzzi FU, Pond
Stoecker K, Thorne L, Tobin EA, Vandi MA, SLK, Tamura K (2012) Statistics and truth in
Watson SJ, West K, Whitmer S, Wiley MR, phylogenomics. Mol Biol Evol 29(2):457–472
Winnicki SM, Wohl S, Wölfel R, Yozwiak NL, 24. Lemey P, Rambaut A, Drummond AJ, Suchard
Andersen KG, Blyden SO, Bolay F, Carroll MA (2009) Bayesian phylogeography finding
MW, Dahn B, Diallo B, Formenty P, Fraser C, its roots. PLoS Comput Biol 5(9):e1000520
Gao GF, Garry RF, Goodfellow I, Günther S, 25. Lemey P, Rambaut A, Bedford T, Faria N,
Happi CT, Holmes EC, Keı̈ta S, Kellam P, Bielejec F, Baele G, Russell CA, Smith DJ,
Koopmans MPG, Kuhn JH, Loman NJ, Pybus OG, Brockmann D, Suchard MA
Magassouba N, Naidoo D, Nichol ST, (2014) Unifying viral genetics and human
Nyenswah T, Palacios G, Pybus OG, Sabeti transportation data to predict the global trans-
PC, Sall A, Ströher U, Wurie I, Suchard MA, mission dynamics of human influenza H3N2.
Lemey P, Rambaut A (2017) Virus genomes PLoS Pathog 10(2):e1003932
reveal factors that spread and sustained the 26. Muse SV, Gaut BS (1994) A likelihood
Ebola epidemic. Nature 544(7650):309–315 approach for comparing synonymous and
722 Guy Baele et al.
nonsynonymous nucleotide substitution rates, 31. Streicker DG, Turmelle AS, Vonhof MJ, Kuz-
with application to the chloroplast genome. min IV, McCracken GF, Rupprecht CE (2010)
Mol Biol Evol 11(5):715–724 Host phylogeny constrains cross-species emer-
27. Neal RM (2010) MCMC using Hamiltonian gence and establishment of rabies virus in bats.
dynamics. In: Handbook of Markov chain Science 329(5992):676–679
Monte Carlo, vol 54. CRC Press, Boca Raton, 32. Suchard MA, Rambaut A (2009) Many-core
pp 113–162 algorithms for statistical phylogenetics. Bioin-
28. Roberts GO, Rosenthal JS (2009) Examples of formatics 25:1370–1376
adaptive MCMC. J Comput Graph Stat 33. Yang Z (1996) Among-site rate variation and
18:349–367 its impact on phylogenetic analyses. Trends
29. Ronquist F, Teslenko M, van der Mark P, Ayres Ecol Evol 11(9):367–372
DL, Darling A, Höhna S, Larget B, Liu L, 34. Yang Z, Rannala B (1997) Bayesian phyloge-
Suchard MA, Huelsenbeck JP (2012) MrBayes netic inference using DNA sequences: a Mar-
3.2: efficient Bayesian phylogenetic inference kov chain Monte Carlo method. Mol Biol Evol
and model choice across a large model space. 14(7):717–724
Syst Biol 61(3):539–542 35. Zwickl DJ (2006) Genetic algorithm
30. Sanmartı́n I, van der Mark P, Ronquist F approaches for the phylogenetic analysis of
(2008) Inferring dispersal: a Bayesian approach large biological sequence datasets under the
to phylogeny-based island biogeography, with maximum likelihood criterion. PhD thesis,
special reference to the Canary Islands. J Bio- The University of Texas at Austin
geogr 35:428–449
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons
licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s
Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder.
Chapter 24
Abstract
Biological, clinical, and pharmacological research now often involves analyses of genomes, transcriptomes,
proteomes, and interactomes, within and between individuals and across species. Due to large volumes, the
analysis and integration of data generated by such high-throughput technologies have become computa-
tionally intensive, and analysis can no longer happen on a typical desktop computer.
In this chapter we show how to describe and execute the same analysis using a number of workflow
systems and how these follow different approaches to tackle execution and reproducibility issues. We show
how any researcher can create a reusable and reproducible bioinformatics pipeline that can be deployed and
run anywhere. We show how to create a scalable, reusable, and shareable workflow using four different
workflow engines: the Common Workflow Language (CWL), Guix Workflow Language (GWL), Snake-
make, and Nextflow. Each of which can be run in parallel.
We show how to bundle a number of tools used in evolutionary biology by using Debian, GNU Guix,
and Bioconda software distributions, along with the use of container systems, such as Docker, GNU Guix,
and Singularity. Together these distributions represent the overall majority of software packages relevant for
biology, including PAML, Muscle, MAFFT, MrBayes, and BLAST. By bundling software in lightweight
containers, they can be deployed on a desktop, in the cloud, and, increasingly, on compute clusters.
By bundling software through these public software distributions, and by creating reproducible and
shareable pipelines using these workflow engines, not only do bioinformaticians have to spend less time
reinventing the wheel but also do we get closer to the ideal of making science reproducible. The examples in
this chapter allow a quick comparison of different solutions.
Key words Bioinformatics, Evolutionary biology, Big data, Parallelization, MPI, Cloud computing,
Cluster computing, Virtual machine, MrBayes, Debian Linux, GNU Guix, Bioconda, CWL, Common
Workflow Language, Guix Workflow Language, Snakemake, Nextflow
Availability: All included software, scripts, and Docker images are based on free and open-source software and
can be found at https://github.com/EvolutionaryGenomics/scalability-reproducibility-chapter.
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Methods in Molecular Biology, vol. 1910,
https://doi.org/10.1007/978-1-4939-9074-0_24, © The Author(s) 2019
723
724 Francesco Strozzi et al.
1 Introduction
1.2 Parallelization in Cloud computing allows the use of “on-demand” CPUs accessible
the Cloud via the Internet and is playing an increasingly important role in
bioinformatics. Bioinformaticians and system administrators previ-
ously had to physically install and maintain large compute clusters
to scale up computations, but now cloud computing makes it
possible to rent and access CPUs, GPUs, and storage, thereby
enabling a more flexible concept of on-demand computing
[7]. The cloud scales and commoditizes cluster infrastructure and
management and, in addition, allows users to run their own
operating system, usually not true with existing cluster and GRID
infrastructure (a GRID is a heterogeneous network of computers
that act together). A so-called hypervisor sits between the host
operating system and the guest operating system, and it makes
sure they are clearly separated while virtualizing host hardware.
This means many guests can share the same machine that appears
to the users as a single machine on the network. This allows
providers to efficiently allocate resources. Containers are another
form of light virtualization that is now supported by all the main
cloud providers, such as Google, Microsoft, Rackspace OpenStack,
and Amazon (AWS). Note that only OpenStack is available as free
and open-source software.
An interesting development is that of portable batch systems
(PBS) in the cloud. PBS-like systems are ubiquitous in high-
performance computing (HPC). Both Amazon EC2 and Microsoft
Cloud offer batch computing services with powerful configuration
options to run thousands of jobs in the cloud while transparently
automating the creation and management of virtual machines and
containers for the user. As an alternative, Arvados is an open-source
product specifically aimed at bioinformatics applications that makes
the cloud behave as if it is a local cluster of computers, e.g., [8].
At an even higher level, MapReduce is a framework for
distributed processing of huge datasets, and it is well suited for
problems using large number of computers [9]. The map step takes
a dataset and splits it into parts and distributes them to worker
nodes. Worker nodes can further split and distribute data. At the
726 Francesco Strozzi et al.
1.3 A Pipeline for the To create a bioinformatics pipeline, it is possible to combine remote
Cloud cloud instances with a local setup. Prepare virtual machines or
containers using similar technologies on a local network, such as a
few office computers or servers, and then use these for calculations
in the cloud when an analysis takes too long. The cloud computing
resources may, for instance, support a service at peak usage, while
regular loads are met with local infrastructure (i.e., burst compute).
New ideas can be developed and pre-evaluated using modest
in-house setups and then scaled to match the most demanding
work.
Cloud services can be used for burst computing – enabling local
clusters to be much smaller – as small as a single computer
In the following sections, we will provide instructions to deploy
applications, and we will show how the use of workflow systems and
reproducible environments can greatly simplify running scalable
workflows on different environments, including the cloud.
(continued)
Scalable Workflows and Reproducible Data Analysis for Genomics 729
Box 1: (continued)
where depending tasks wait for each other forever and, with
race conditions, where tasks try to modify a shared resource
(e.g., a file) at the same time resulting in a loss of data or an
undetermined condition. This introduces additional com-
plexity in software development, bug hunting, and code
maintenance. Typically it takes more time to debug such
code than to write it.
1.4.1 GPU Programming Another recent development is the introduction of GPU comput-
ing or “heterogeneous computing” for offloading computations.
Most GPUs consist of an array of thousands of cores that can
execute similar instructions at the same time. Having a few
Scalable Workflows and Reproducible Data Analysis for Genomics 731
2.1.1 Create a Docker Using the bio packages already present in Debian, it is straightfor-
Image with Debian ward to build a Docker container that includes all the necessary
software to run the example workflows. Here is the code for creat-
ing the Docker image (see also [30]). We created a pre-built Docker
image which is available on Docker Hub [31].
Essentially, write a Docker script:
FROM debian:buster
RUN apt-get update && apt-get -y install perl clustalo paml
ADD pal2nal.pl /usr/local/bin/pal2nal.pl
RUN chmod +x /usr/local/bin/pal2nal.pl
2.2.1 Create a Docker GNU Guix has native support for creating Docker images. Creating
Image with GNU Guix a Docker image with GNU Guix is a one liner:
2.3.1 Create a Docker A Docker container can be created starting from the “Miniconda”
Image with Bioconda image template, which is based on Debian. The Docker instruc-
tions are comparable to those of Debian above:
FROM conda/miniconda3
RUN conda config --add channels conda-forge
RUN conda install -y perl=5.22.0
RUN conda install -y -c bioconda paml=4.9 clustalo=1.2.4
wget=1.19.1
ADD pal2nal.pl /usr/local/bin/pal2nal.pl
RUN chmod +x /usr/local/bin/pal2nal.pl
2.4 A Note on All above packaging systems use free and open-source software
Software Licenses (FOSS) released under a permissible license, i.e., a license permit-
ting the use, modification, and distribution of the source code for
any purpose. This is important because it allows software distribu-
tions to distribute all included software freely. Software that is made
available under more restrictive licenses, such as for “academic
nonprofit use only,” cannot be distributed in this way. An example
is PAML that used to have such a license. Only when it was changed
PAML got included into Debian, etc. Also, for this book chapter,
we asked the author of pal2nal to add a proper license. After adding
the GPLv2, it became part of the Debian distribution; see also
https://tracker.debian.org/pkg/pal2nal. This means that above
Docker scripts can be updated to install the pal2nal Debian
package.
When you use scientific software, always check the type of
license under which it is provided, to understand what you can or
cannot do with it. When you publish software, add a license along
with your code, so others can use it and distribute it.
Typical licenses used in bioinformatics are MIT (Expat) and
BSD, which are considered very permissive, and also GPL and the
Apache License, which are designed to grant additional protections
with regard to derivative works and patentability. Whenever possi-
ble, free software licenses such as mentioned above are encouraged
for scientific software. Check the guidelines of your employer and
funding agencies.
Scalable Workflows and Reproducible Data Analysis for Genomics 735
the git repository locally. The examples can be executed from the
repository tree:
Workflow Inputs
nucleotides proteins
multi_sequence
nucleotides clustal
protein_alignment
pal2nal tree
sequences
codeml
Workflow Outputs
Fig. 1 Workflow automatically generated from the CWL schema displays how
PAML’s Codeml receives inputs from two sources and outputs the dN/dS infor-
mation. A workflow engine figures out that it has to run clustal first, followed by
pal2nal and Codeml as a linear sequence. For each input, the job can be
executed in parallel
3.3 Guix Workflow The Guix Workflow Language (GWL) extends the functional pack-
Language age manager GNU Guix [32] with workflow management capabil-
ities. GNU Guix provides an embedded domain-specific language
(EDSL) for packages and package composition. GWL extends this
EDSL with processes and process composition.
In GWL, a process describes the computation, for example,
running the clustalo program. A workflow in the GWL describes
how processes relate to each other. For example, the Codeml
program can only run after both clustalo and pal2nal finished
successfully.
The tight coupling of GWL and GNU Guix ascertains that not
only the workflow is described rigorously but also the deployment
of the programs on which the workflow depends.
To run the GWL example, you need to install GNU Guix
(https://www.gnu.org/software/guix/manual/html_node/
Binary-Installation.html) and the GWL installed on your com-
puter. Once GNU Guix is available, installing GWL can be done
using:
cd scalability-reproducibility-chapter/GWL
guix workflow -r example-workflow
3.4 Snakemake Snakemake [42] is a workflow management system that takes inspi-
ration from GNU Make [40], a tool to coordinate the compilation
of large programs consisting of interdependent source files
(https://snakemake.readthedocs.io/en/stable/).
Snakemake provides a DSL that allows the user to specify
generator rules. A rule describes the steps that need to be per-
formed to produce one or more output files, such as running a
shell script. These output files may be used as inputs to other rules.
The workflow is described as a graph in which the nodes are files
Scalable Workflows and Reproducible Data Analysis for Genomics 739
cd Snakemake
snakemake
3.5 Nextflow Nextflow [43] is a framework and an orchestration tool that enables
scalable and reproducible scientific workflows using software con-
tainers (https://www.nextflow.io/). It is written in the Groovy
JVM programming language [44] and provides a domain-specific
language (DSL) that simplifies writing and deploying complex
workflows across different execution platforms in a portable
manner.
A Nextflow pipeline is described as a series of processes, where
each process can be written in any language that can be executed or
740 Francesco Strozzi et al.
4 Discussion
5 Questions
References
1. Ronquist F, Huelsenbeck JP (2003) MrBayes 3. Yang Z (1997) PAML: a program package for
3: Bayesian phylogenetic inference under phylogenetic analysis by maximum likelihood.
mixed models. Bioinformatics 19 Comput Appl Biosci 13(5):555–556
(12):1572–1574 4. Doctorow C (2008) Big data: welcome to the
2. Eddy SR (2008) A probabilistic model of local petacentre. Nature 455(7209):16–21
sequence alignment that simplifies statistical 5. Trelles O, Prins P, Snir M, Jansen RC (2011)
significance estimation. PLoS Comput Biol 4 Big data, but are we ready? Nat Rev Genet 12
(5):e1000069 (3):224
744 Francesco Strozzi et al.
41. Vivian J, Rao AA, Nothaft FA et al (2017) Toil 44. Koenig D, Glover A, King P, Laforge G, Skeet J
enables reproducible, open source, big bio- (2007) Groovy in action. Manning Publica-
medical data analyses. Nat Biotechnol 35 tions Co. Greenwich, CT
(4):314–316 45. (2017) Ga4gh platform for docker-based tools
42. Köster J, Rahmann S (2012) Snakemake--a and workflows sharing. https://dockstore.org
scalable bioinformatics workflow engine. Bio- 46. Amdahl GM (1967) Validity of the single pro-
informatics 28(19):2520–2522 cessor approach to achieving large scale com-
43. Di tommaso P, Chatzou M, Floden EW et al puting capabilities. In: Proceedings of the April
(2017) Nextflow enables reproducible compu- 18-20, 1967, Spring Joint Computer Confer-
tational workflows. Nat Biotechnol 35 ence. ACM, Washington, DC, pp 483–485
(4):316–319 47. (2015) Reproducible builds https://reproduc-
ible-builds.org. https://reproducible-builds.org
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons
licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s
Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder.
Chapter 25
Abstract
Open-source software encourages computer programmers to reuse software components written by others.
In evolutionary bioinformatics, open-source software comes in a broad range of programming languages,
including C/C++, Perl, Python, Ruby, Java, and R. To avoid writing the same functionality multiple times
for different languages, it is possible to share components by bridging computer languages and Bio*
projects, such as BioPerl, Biopython, BioRuby, BioJava, and R/Bioconductor.
In this chapter, we compare the three principal approaches for sharing software between different
programming languages: by remote procedure call (RPC), by sharing a local “call stack,” and by calling
program to programs. RPC provides a language-independent protocol over a network interface; examples
are SOAP and Rserve. The local call stack provides a between-language mapping, not over the network
interface but directly in computer memory; examples are R bindings, RPy, and languages sharing the Java
virtual machine stack. This functionality provides strategies for sharing of software between Bio* projects,
which can be exploited more often.
Here, we present cross-language examples for sequence translation and measure throughput of the
different options. We compare calling into R through native R, RSOAP, Rserve, and RPy interfaces, with
the performance of native BioPerl, Biopython, BioJava, and BioRuby implementations and with call stack
bindings to BioJava and the European Molecular Biology Open Software Suite (EMBOSS).
In general, call stack approaches outperform native Bio* implementations, and these, in turn, outper-
form “RPC”-based approaches. To test and compare strategies, we provide a downloadable Docker
container with all examples, tools, and libraries included.
Key words Bioinformatics, R, Python, Ruby, Perl, Java, Web services, RPC, EMBOSS, PAML
1 Introduction
Bioinformatics has created its tower of Babel. The full set of func-
tionality for bioinformatics, including statistical and computational
methods for evolutionary biology, is implemented in a wide range
of computer languages, e.g., Java, C/C++, Perl, Python, Ruby, and
R. This comes as no surprise, as computer language design is the
Download: https://github.com/EvolutionaryGenomics/Cross-language-interfacing
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Methods in Molecular Biology, vol. 1910,
https://doi.org/10.1007/978-1-4939-9074-0_25, © The Author(s) 2019
747
748 Raoul J. P. Bonnal et al.
1.1 Bridging The most simple way of interfacing software is by invoking one
Functional Resources program from another. This strategy is often used in Bio* projects,
Calling from Program for example, for invoking external programs. A regular subset
to Program would be PAML [11], HMMER [12], ClustalW [13], MAFFT
[14], Muscle [15], BLAST [16], and MrBayes [17]. The Bio*
projects typically contain modules which invoke the external pro-
gram and parse the results. The advantage of this approach is that it
mimics running a program on the command line, so invocation is
straightforward. Another advantage, in a web service context, is
that if the called program crashes, it does not have to take the whole
service down. There are also some downsides, however. Loading a
new instance of a program every time incurs extra overhead. More
importantly, nonstandard input and output makes the interface
fragile, i.e., what happens when input or output differs between
two versions of a program? A further downside is that external
programs do not have fine-grained function access and have no
support for advanced error handling and exceptions. What hap-
pens, for example, when the invoked program runs out of process
memory? How to handle that gracefully? A final complication is
that such a program is an external software deployment depen-
dency, which may be hard to resolve for an end user.
1.2 Remote In contrast to calling one program from another, true cross-
Procedure Call language interfacing allows one language to access functions
and/or objects in another language, as if they are native function
calls. To achieve transparent function calls between different com-
puter languages, there are two principal approaches. The first
approach is for one language to call directly into another language’s
function or method over a network interface, the so-called remote
procedure call (RPC). The second approach is to call into another
language over a local “call stack.”
In bioinformatics, cross-language RPC comes in the form of
web services and binary network protocols. A web service applica-
tion programming interface (API) is exposed, and a function call
gets translated with its parameters into a language-independent
Sharing Programming Resources Between Bio* Projects 751
1.3 Local Call Stack The alternative to RPC is to create native local bindings from one
language to another using a shared native call stack, essentially
linking into code of a different computer language. With the call
stack, function calls do not run over the network but over a stack
implementation in shared computer memory. In a single virtual
machine, such as the JVM and Erlang Beam, compiled code can
share the same call stack, which can make cross-language calling
efficient. For example, the languages Java, Jython, JRuby, Clojure,
Groovy, and Scala can transparently call into each other when
running on the same virtual machine using native speeds.
Native call stack sharing is also supported at the lowest level by
the computer operating system through compiled shared libraries.
These shared libraries have an extension .so on Linux, .dylib on
OSX, and .dll on Windows. The shared libraries are designed so
that they contain code and data that provide services to indepen-
dent programs, which allows the sharing and changing of code and
data in a modular fashion. Shared library interfaces are well defined
at the operating system level, and languages have a way of binding
them. Specialized interface bindings to shared libraries exist for
every language, for example, R’s C modules, the Java Native Inter-
face (JNI) for the JVM, Foreign Function Interfaces (FFI) for
Python and Ruby, the Parrot native compiler interface PerlXS
for Perl.
752 Raoul J. P. Bonnal et al.
1.4 Comparing Here, we compare bridging code from one language to another
Approaches using the RPC approach and the call stack approach. As a compari-
son we also provide a program-to-program approach and show how
dependencies can be fixated. The comparison is done in the form of
short experiments (scripts) which can be executed by the reader. To
measure performance between different approaches, we use codon
translation as an example of shared functionality between Bio*
projects. Codon translation is a straightforward algorithm with
table lookups. Such sequence translation is representative of many
bioinformatics tasks that deal with genome-sized data and require
many function calls with small-sized parameters.
In this chapter we first focus on comparing R and Python
bindings. We include native Bio* implementations, i.e., Biopython,
BioRuby, BioPerl, BioJava, and EMBOSS (C) for an absolute speed
comparison. Next we try bindings on the JVM.
Sharing Programming Resources Between Bio* Projects 753
2 Results
2.1 Calling into R R is a free and open-source environment for statistical computing
and graphics [25]. R comes with a wide range of functionality,
including modules for bioinformatics, such as bundled in R/Bio-
conductor [1]. R is treated as a special citizen in this chapter
because the language is widely used and comes with statistical
algorithms for evolutionary biology, such as Ape [26] and SeqinR
[27], both available through the comprehensive R archive network
(CRAN).
R defines a clear interface between the high-level language R
and low-level highly optimized C and FORTRAN libraries, some of
which have been around for a long time, such as the libraries for
linear regression and linear algebra. In addition, the R environment
successfully handles cross-platform packaging of C, C++, FOR-
TRAN, and R code. The combination of features has resulted in
R becoming the open-source language of choice in a number of
communities, including statistics and some disciplines in biology.
R/Bioconductor has gene expression analysis [1] and R/qtl [28]
and R/qtlbim [29] for QTL mapping (see also QTL mapping in
Chapter 21). Not all is lost, however, for those not comfortable
with the R language itself. R can act as an intermediate between
functionality and high-level languages. A number of libraries have
been created that interface to R from other languages, either
providing a form of RPC, through RSOAP or Rserve, or a call
stack interface calling into the R-shared library and executing R
commands, for example, RPy for Python, RSPerl for Perl, RSRuby
for Ruby, and JRI for Java. Of the last call stack approaches, RPy
currently has the most complete implementation; see also [2].
In this chapter, we compare different approaches for invoking
full R functionality from another language. To test cross-language
calling, we elected to demonstrate codon translation. Codon-to-
protein amino acid translation is representative for a relatively sim-
ple computation that potentially happens thousands of times with
genome-sized data. Every Bio* project includes such a translation
function, so it is a fair way to test for language interoperability and
performance. For data, we use a WormBase [30] C. elegans cDNA
FASTA file (33 Mb), containing 24,652 nucleotide sequences,
predicted to translate to protein (Fig. 1).
754 Raoul J. P. Bonnal et al.
Fig. 1 Throughput of mRNA to protein translation using combinations of cross-language calling with a range of
programming resources. WormBase C. elegans predicted protein coding DNA that was parsed in FASTA format
and translated into amino acids. Tests were executed inside a container. Different file sizes were used
containing 500, 1000, 5000, 15,000, and 25,000 sequences (X-axis) and the number of sequences processed
per seconds (Y-axis log10 scale). Measurements were taken on an AMD Opteron(TM) 6128 8 cores at 2.0 GHz,
4 sockets 8 cores, with 512 GB RAM DDR3 ECC, and an HDD SATA of 2 TB. Broadly the figure shows that
sustained throughput is reached quickly and flattens out. R-Biostrings performs poorly at 285 Seq/s, while
R-GeneR and Rserve (Python+Rserve+GeneR) perform at the level of native Bio* libraries, respectively,
658 Seq/s and 660 Seq/s. The cross-language Ruby-FFI at 6256 Seq/s calls EMBOSS C translation and
outperforms all others
2.1.1 Using GeneR with “The R/Bioconductor GeneR package [31] supports fast codon
Plain R translation with the strTranslate function implemented in C.”
GeneR supports the eukaryotic code and other major encoding
standards. R usage is:
library(GeneR)
strTranslate("atgtcaatggtaagaaatgtatcaaatcagagcgaaaaattg-
gaaattttgt")
[1] "MSMVRNVSNQSEKLEIL"
fasta = ’dna.fa’
library(GeneR)
idx = indexFasta(fasta)
Sharing Programming Resources Between Bio* Projects 755
2.1.2 Calling into R from One strategy for bridging between languages is to use R as a
Other Languages with RPC network server and invoke remote procedure calls (RPC) over the
network.
1. SOAP
SOAP allows processes to communicate using XML over
HTTP in a client/server setup. SOAP is an operating system
756 Raoul J. P. Bonnal et al.
import pyRserve
conn = pyRserve.connect()
conn.eval(’library(GeneR)’)
conn.eval(’strTranslate("atgtcaatggtaagaaatgtatcaaatcagagc-
gaaaaattggaaattttgt")’)
’MSMVRNVSNQSEKLEIL’
2.1.3 Calling into R from Another strategy for bridging language is to use a native call stack,
Other Languages with the i.e., data does not get transferred over the network. RPy2 executes
Call Stack Approach R code from within Python over a local call stack [2]. Invoking the
same GeneR functions from Python:
2.3 Using the JVM The Java virtual machine (JVM) is a “bytecode” standard that
for Cross-Language represents a form of computer intermediate language. This lan-
Support guage conceptually represents the instruction set of a stack-
oriented capability architecture. This intermediate language, or
“bytecode,” is not tied to Java specifically, and in the last 10 years,
a number of languages have appeared which target the JVM,
including JRuby (Ruby on the JVM), Jython (Python on the
JVM), Groovy [33], Clojure [34], and Scala [35]. These languages
also compile into bytecode and share the same JVM stack. The
shared JVM stack allows transparent function calling between dif-
ferent languages.
758 Raoul J. P. Bonnal et al.
import org.biojava.nbio.core.sequence.transcription.Tran-
scriptionEngine
import org.biojava.nbio.core.sequence._
2.4 Shared C Library EMBOSS is a free and OSS analysis package specially developed for
Cross-Calling Using the needs of the molecular biology user community, mostly written
EMBOSS Codon in C [9].
Translation
print(seq)
MSMVRNVSNQSEKLEILX
require ’ffi’
module Emboss
extend FFI::Library
ffi_lib "./emboss.so"
attach_function :ajTrnNewI, [:int], :pointer
attach_function :ajSeqNewNameC, [:pointer, :pointer], :
pointer
attach_function :ajTrnSeqOrig, [:pointer, :pointer, :int], :
pointer
attach_function :ajSeqGetSeqCopyC, [:pointer], :string
end
trnTable = Emboss.ajTrnNewI(1)
ajpseq = Emboss.ajSeqNewNameC("atgtcaatggtaagaaatgtatcaaatca-
gagcgaaaaattggaaattttgt", "Test sequence")
ajpseqt = Emboss.ajTrnSeqOrig(trnTable,ajpseq,1)
aa = Emboss.ajSeqGetSeqCopyC(ajpseqt)
print aa,"\n"
MSMVRNVSNQSEKLEILX
2.5 Calling Program Calling program to program is far more common than you may
to Program think because even when you run a program in a shell, such as Bash,
you are calling program to program. You can invoke EMBOSS
from the command line:
760 Raoul J. P. Bonnal et al.
Fig. 2 Number of seconds needed for processing mRNA to protein translation using cross-language calling
with a range of programming resources. See Fig. 1 for the setup. The figure shows that for all the
implementations, the time increases linearly with the number of sequences in input. R-Biostrings performs
poorly with an upstart of 6.50 s and the highest slope. The cross-language Ruby-FFI, Python FFI, and Python-
EMBOSS with an upstart slightly higher than Java have a very minimal slope; Ruby-FFI has a nearly
constant time
(SOAP) offer functionality over http(s) and can be used from any
programming language. Here a Ruby example of using TogoWS:
Web services can harness a lot of power because they use large
databases and access up-to-date information. As an example, let’s
generate RDF from above entry:
<http://rdf.ncbi.nlm.nih.gov/pubmed/16381885> medline:pmid
"16381885" ;
762 Raoul J. P. Bonnal et al.
rdfs:label "pmid:16381885" ;
dc:identifier "16381885" ;
medline:own "NLM" ;
3 Discussion
4 Questions
1. Install the Docker container and run different tests. Can you
replicate the differences of throughput statistics?
2. Why are network protocols such as Rserve slower than native
call stack approaches?
3. What are possible advantages of using a virtual machine, such as
the JVM?
4. If you were to bridge between your favorite language and an R
library, what options do you have?
Acknowledgments
References
1. Gentleman RC, Carey VJ, Bates DM et al 15. Edgar RC (2004) MUSCLE: a multiple
(2004) Bioconductor: open software develop- sequence alignment method with reduced
ment for computational biology and bioinfor- time and space complexity. BMC Bioinformat-
matics. Genome Biol 5(10):R80 ics 5:113
2. Gautier L (2010) An intuitive Python interface 16. Altschul SF, Madden TL, Schaffer AA et al
for Bioconductor libraries demonstrates the (1997) Gapped BLAST and PSI-BLAST: a
utility of language translators. BMC Bioinfor- new generation of protein database search pro-
matics 11(Suppl 12):S11 grams. Nucleic Acids Res 25(17):3389–3402
3. Urbanek S (2003) Rserve a fast way to provide 17. Ronquist F, Huelsenbeck JP (2003) MrBayes
R functionality to applications. In Proceedings 3: Bayesian phylogenetic inference under
of the 3rd International Workshop on mixed models. Bioinformatics 19
Distributed Statistical Computing (DSC (12):1572–1574
2003), Vienna, Austria 18. Box D, Ehnebuske D, Kakivaya G et al (2000)
4. Urbanek S (2009) How to talk to strangers: Simple object access protocol (SOAP) 1.1
ways to leverage connectivity between R, Java 19. St Laurent S, Johnston J, Dumbill E (2001)
and objective C. Comput Stat 24:303–311 Programming Web services with XML-RPC.
5. Stajich JE, Block D, Boulez K et al (2002) The pub-ORA, 213p
Bioperl toolkit: Perl modules for the life 20. Richardson L, Ruby S (2007) Restful web ser-
sciences. Genome Res 12(10):1611–1618 vices. pub-ORA, xxiv + 419p
6. Cock PJ, Antao T, Chang JT et al (2009) Bio- 21. Muller J, Lorenz M, Geller F, Zeier A, Plattner
python: freely available Python tools for H (2010) Assessment of communication pro-
computational molecular biology and bioinfor- tocols in the EPC network-replacing textual
matics. Bioinformatics 25(11):1422–1423 SOAP and XML with binary google protocol
7. Goto N, Prins P, Nakao M et al (2010) Bior- buffers encoding. In: Industrial Engineering
uby: bioinformatics software for the Ruby pro- and Engineering Management (IE\&EM),
gramming language. Bioinformatics 26 2010 IEEE 17Th International Conference
(20):2617–2619 on. IEEE, New York, NY, pp 404–409
8. Holland RC, Down TA, Pocock M et al (2008) 22. Agarwal A, Slee M, Kwiatkowski M (2007)
BioJava: an open-source framework for bioin- Thrift: scalable cross-language services
formatics. Bioinformatics 24(18):2096–2097 implementation
9. Rice P, Longden I, Bleasby A (2000) 23. Hintjens P (2013) Zeromq: messaging for
EMBOSS: the european molecular biology many applications. O’Reilly Media, Sebastopol,
open software suite. Trends Genet 16 CA, 516p
(6):276–277 24. Beazley D (1996) SWIG: an easy to use tool for
10. Dutheil J, Gaillard S, Bazin E et al (2006) Bio+ integrating scripting languages with C and C++.
+: a set of C++ libraries for sequence analysis, Proceedings of the 4th conference on USENIX
phylogenetics, molecular evolution and popu- Tcl/Tk Workshop, 1996-Volume 4, USENIX
lation genetics. BMC Bioinformatics 7:188 Association, 15p
11. Yang Z (1997) PAML: a program package for 25. Development Core Team R (2010) R: a lan-
phylogenetic analysis by maximum likelihood. guage and environment for statistical
Comput Appl Biosci 13(5):555–556 computing
12. Eddy SR (2008) A probabilistic model of local 26. Paradis E, Claude J, Strimmer K (2004) APE:
sequence alignment that simplifies statistical analyses of phylogenetics and evolution in R
significance estimation. PLoS Comput Biol 4 language. Bioinformatics 20(2):289–290
(5):e1000069 27. Charif D, Thioulouse J, Lobry JR, Perriere G
13. Larkin MA, Blackshields G, Brown NP et al (2005) Online synonymous codon usage ana-
(2007) Clustal W and clustal X version 2.0. lyses with the ade4 and seqinR packages. Bio-
Bioinformatics 23(21):2947–2948 informatics 21(4):545–547
14. Katoh K, Kuma K, Toh H, Miyata T (2005) 28. Arends D, Prins P, Jansen RC, Broman KW
MAFFT version 5: improvement in accuracy of (2010) R/qtl: high-throughput multiple
multiple sequence alignment. Nucleic Acids QTL mapping. Bioinformatics 26
Res 33(2):511–518 (23):2990–2992
766 Raoul J. P. Bonnal et al.
29. Yandell BS, Mehta T, Banerjee S et al (2007) 35. Odersky M, Altherr P, Cremet V et al (2004)
R/qtlbim: QTL with Bayesian interval An overview of the Scala programming lan-
mapping in experimental crosses. Bioinformat- guage. LAMP-EPFL. (IC/2004/64)
ics 23(5):641–643 36. Sloan Z, Arends D, Broman KW et al (2016)
30. Harris TW, Antoshechkin I, Bieri T et al (2010) Genenetwork: framework for web-based
WormBase: a comprehensive resource for nem- genetics. J Open Source Soft 1(25):2
atode research. Nucleic Acids Res 38(Database 37. Berman HM, Battistuz T, Bhat TN et al (2002)
issue):D463–D467 The protein data bank. Acta Crystallogr D Biol
31. Cottret L, Lucas A, Marrakchi E et al GeneR: R Crystallogr 58(Pt 6, 1):899–907
for genes and sequences analysis 38. Finn RD, Mistry J, Tate J et al (2010) The
32. Warnes G (2004) RSOAP provides a SOAP Pfam protein families database. Nucleic Acids
interface for the open-source statistical package Res 38(Database Issue):D211–D222
R 39. Kanehisa M, Goto S (2000) KEGG: kyoto
33. Koenig D, Glover A, King P, Laforge G, Skeet J encyclopedia of genes and genomes. Nucleic
(2007) Groovy in action. Manning Publica- Acids Res 28(1):27–30
tions Co, Greenwich, CT 40. Bairoch A, Apweiler R, Wu CH et al (2005) The
34. Halloway S (2009) Programming Clojure. universal protein resource (UniProt). Nucleic
Pragmatic Bookshelf, Raleigh, NC Acids Res 33(Database issue):D154–D159
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons
licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s
Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder.
INDEX
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Methods in Molecular Biology, vol. 1910,
https://doi.org/10.1007/978-1-4939-9074-0, © The Author(s) 2019
767
EVOLUTIONARY GENOMICS: STATISTICAL AND COMPUTATIONAL METHODS
768 Index
Biased gene conversion (BGC) .........197, 346–347, 385, nucleus .....................................................4, 6, 7, 9, 10,
387–388, 464 12, 15, 17, 21, 22
Bi-directional best hit (BBH) ......................150, 152–155 size ............................................................................... 9
Bigrams ................................................................. 487–489 Cellulose ........................................................................ 641
Bindings.............................................................10, 12, 14, Central limit theorem ..................................................... 38
18–20, 127, 138, 179, 185, 194, 381, 389, Centroided data ............................................................ 261
449, 483, 486, 507–509, 521, 640, 642, 749, Centromere ..................................................................... 12
751, 752, 762 Chaperone ..................................................................... 185
Binomial distribution, see Distribution, binomial Chapman-Kolmogorov equations............................52, 80
Bio++ ............................................................................. 749 Chimeric ..................................................... 271, 272, 280,
Bioconductor.......................................627, 749, 753–755 281, 292, 509
BioJava .................................................749, 752, 757, 758 ChIP-seq........................................................................ 545
Biological membrane .................................................... 4, 5 Chloroplast ............................................................. 15, 338
Biological variation .............................................. 647, 648 Chromatid ............................................................ 186, 509
BioMart ......................................................................... 392 Chromatin .......................................................12–14, 512,
BioMoby............................................................... 682, 683 514, 519, 520, 637
Bionode ......................................................................... 755 Chromoplast.................................................................... 15
BioPerl ......................................................... 749, 752, 757 Chromosome
Bio programming...................58, 59, 274, 298, 747–764 condensation ............................................................. 10
Bio projects........................................................... 747–764 rearrangement ......................................................... 644
BioPython.......................... 749, 752, 756–757, 759, 763 Ciliates ..............................................................22, 23, 285
BioRuby............................. 643, 749, 752, 757, 759, 761 Ciona intestinalis........................................................... 184
Bipartite graph ........................... 291, 294–298, 300, 301 Circular DNA molecules ..........................................15, 16
Birth-death .................................................................... 475 Cis ......................................................................... 319, 327
model/process ..................... 162, 313, 317, 327, 475 Cis-acting, regulatory elements.................................... 640
BLAST ........................................................ 127–129, 131, Clinical trial .......................................................... 605–628
132, 141, 152, 154, 188, 189, 192, 194, 196, Clique ....................................................66, 156, 171, 282
198, 273, 275–278, 280, 282, 283, 295, 297, Clonality ..................................................... 333, 335, 337,
299, 592, 593, 597, 599, 600, 609, 625, 643, 340, 341, 354, 357, 358
727, 740, 750, 762, 763 Cloud computing....................... 106–108, 725, 726, 731
Bonferroni correction .........................442, 444, 445, 646 ClustalW ........................................................................ 750
Boot-split distance (BSD).......................... 243, 245–255, Cluster computing ........................................................ 724
257, 260, 262, 266, 734 Clustering ................................................... 152–157, 162,
Bootstrap/Bootstrapping ....................................... 44, 45, 170, 188, 190, 196, 261–263, 278–280, 285,
74–75, 106, 107, 141, 159, 161, 228–231, 292, 297, 354, 496, 497, 619–621, 625
233, 243, 245–255, 261, 410–412, 421, 734 Clusters of orthologous gene(s), see Clusters of
Branch-site (codon) models ......................................... 379 orthologous groups
Broad-platform evolutionary analysis general likelihood Clusters of orthologous groups (COGs)....................153,
evaluator (BEAGLE).......................... 691–720 154, 156, 171, 244, 245, 260, 262, 273, 278,
BSD, see Boot-split distance (BSD) 292, 614
CNV, see Copy number variation
C Coalescence .................................96, 213, 215, 217, 221,
Caenorhabditis elegans (C. elegans) 222, 226, 232, 335, 354, 356, 558, 559, 561,
Call stack................... 750–753, 756, 757, 759, 762, 764 562, 564, 578–581, 583
Coalescent model ....................... 211–233, 559, 561, 713
Cancer.............................................................16, 390, 761
Caulobacter ...............................................................6, 620 Coarse grained............................................................... 729
Causality .......................................................550, 647–649 Coding gene .................................................................... 10
Codon
C/C++ (programming language) ...................... 747, 748,
752, 753 model .......................................................83, 103, 106,
cDNA...................................................637, 639, 643, 753 377–387, 391, 428, 434, 439, 447, 451, 464,
694, 697–699, 701–704, 711
Cell
cycle ............................................................................. 6 translation .................... 749, 752, 753, 756, 758–759
division......................................................6, 9, 12, 292 usage bias ................................................................. 391
membrane....................................................... 4, 5, 380 Co-estimation of alignment and phylogeny ................ 431
EVOLUTIONARY GENOMICS: STATISTICAL AND COMPUTATIONAL METHODS
Index 769
Co-evolution ..................................................24, 345, 380 D
Colinearity ...........................................126, 130, 133, 134
Common disease DAG, see Directed acyclic graph
common variant (CDCV).............534, 538, 539, 548 Darwin, C. .................................................. 100, 169, 211,
rare variant (CDRV) .....................534, 538, 539, 549 243, 399–423
Communicating sequential processes (CSP) ...... 730, 740 Darwin core..................................................169, 399–423
Communities .....................................................7, 23, 271, Data integration .................................658–661, 674, 675,
272, 278, 279, 285, 294, 300, 301, 607, 741, 678, 681–684, 724
749, 750, 753 Dating ............................................................97–102, 226,
Comparative 309–329, 355–356
genomics........................................................ 143, 169, de Bruijn graph ............................................................. 134
177–200, 243, 373 Decode..............................................................59, 60, 580
metagenomics........................................ 294, 616, 618 Decoding problem .......................................................... 58
Complementary base pair ................................................. 9 Deep sequencing technology (RNA-Seq) ......... 192, 196,
Complex diseases .......................................................... 534 645, 649
Complex trait ......................................548, 635, 636, 640 Defense mechanism ...................................................... 641
Composite-likelihood (CL) .......................................... 579 Deleterious mutation .......................................... 179, 336,
Comprehensive R archive network (CRAN)............... 753 340, 343, 347, 352, 357, 358, 537
Computational complexity ...................66, 137, 441, 694 Deletion ........................................................ 60, 121, 167,
Concatenation ............................................ 212, 214–216, 212, 353, 472, 511, 515, 518, 521, 648
218, 223, 228, 232, 260, 261 of domains ...................................................... 490, 493
Conditional De novo ...................................................... 187–189, 196,
independence.................................40, 41, 62, 63, 216 281, 374, 473, 475, 492, 507, 519, 555, 626
probability .....................................40, 63, 66, 89, 570 Dependency................................................ 21, 40, 57, 67,
random field ............................................................ 580 87, 106, 161, 389, 732, 738, 742, 763, 764
Conservation .............................................. 133, 136, 138, Differential
165–166, 259, 288, 289, 385, 392, 458, 487, gene expression ....................................................... 645
506, 511, 598 gene loss ......................................................... 154, 155
Conserved Diploid...................................................... 34, 42, 53, 105,
synteny ..................................................................... 166 226, 332, 333, 335, 343, 351, 352, 355,
Consistency-objective function ........................... 135, 167 514–516, 559, 578, 583
Constraint..........................................................21, 47, 83, Directed acyclic graph (DAG)..................................62–66
100–102, 197, 217, 275, 293, 317, 325, 355, Directional selection, see Selection, directional
376, 377, 382, 389, 416, 448, 470, 481, 496, Direct repeats (DRs) ................................... 180, 181, 184
500, 568, 724 Disease ............................................. 8, 11, 23, 24, 41, 54,
Convergent evolution .......................................... 220, 222 142, 185, 294, 390, 509, 533–550, 606–610,
Copy number variant/variation (CNV) ............. 509, 546 612–613, 620, 632, 635, 636, 640–642, 649,
Correlated................................................. 4, 99–101, 165, 656, 671, 680, 681, 703, 709, 735
315, 317, 353, 389, 411, 438, 534, 544, 545, Disease associated.......................................................... 127
601, 621, 640, 647, 648, 710 Distribution
Correlation structure ........................................... 388, 646 Beta ................................................................... 46, 381
CpG island .................................................................58, 59 binomial ............................................... 36, 37, 39, 448
CpG nucleotides ............................................................. 59 dirichlet............................................................. 46, 318
CRISPR-Cas .................................................................... 19 exponential ................................................................ 56
Cross-hybridization....................................................... 645 gamma ...................................... 68, 92, 312, 377, 558
Cross-language adapters ............................................... 752 Gaussian...............................................................40, 64
Crossover ..................................................... 387, 471, 611 multinomial ..............................................39, 404, 406
Cross-platform experiments ................................ 733, 753 Poisson..................................................36–38, 48, 568
Cross validation (CV) ......................................85, 92, 610 Divergence
C-terminus/terminal ..........................179, 471, 482, 486 of sequences........................................... 355, 449, 496
Cytoplasmic inheritance................................................ 345 of species................................................ 221, 309, 585
Cytoscape............................................................. 274, 281, DNA
284, 296, 299 double strand breaks ............................................... 346
methylation..........................................................17, 19
EVOLUTIONARY GENOMICS: STATISTICAL AND COMPUTATIONAL METHODS
770 Index
DNA (cont.) Environmental.......................................... 18, 20, 22, 102,
polymerase ........................................................ 11, 187 109, 219, 284, 285, 287, 290, 296, 298, 357,
repair ...................................................... 343, 356, 387 358, 377, 473, 512–515, 550, 591–603, 608,
replication ....................................................... 9, 11, 12 622, 639, 645, 648
sequencing .............................. 36, 121, 374, 591, 606 factor ........................................................................ 648
transposons ......... 180, 182, 186, 187, 195, 351, 471 Epigenetics ................................. 17, 19–20, 23, 507, 550
dN/dS ............................................... 334, 337, 338, 340, Epigenomics ......................................................... 513, 550
342, 346, 356, 357, 360, 378, 390, 403, 405, Episodic selection, see Selection, episodic
428, 430., 434, 435, 439, 440, 443, 444, 451, Epistasis ........................................................................... 24
455, 456, 466, 643, 701, 724, 737 Erlang .......................................................... 729, 730, 751
See also Nonsynonymous to synonymous rates ratio Error handling and exceptions ..................................... 750
DOI, see Digital Object Identifier Euchromatin......................................................... 510, 516
Domain architecture ............................................ 469–500 Eukaryotes ............................................ 6–16, 18–20, 220,
Dosage compensation..................................................... 23 221, 271, 280, 281, 283, 290, 292, 351,
Drosophila melanogaster (D. melanogaster) 476–479, 484, 485, 487–489, 496, 500
Drug resistance.................... 15, 179, 293, 389, 595, 607 Eukaryotic cell .......................................4, 5, 7, 12, 14, 15
Duplication European Molecular Biology Open Software Suite
asymmetric...................................................... 123, 124 (EMBOSS)......................................... 749, 752,
gene....................................................... 149, 151, 158, 754, 758–760, 762
161, 162, 167, 168, 220, 222, 225, 484, 486, Euryarchaeota................................................................ 265
490, 491, 493, 495, 521, 639, 641 Eutherians............................................................. 185, 379
segmental ........................................................ 123, 644 Evolutionary
symmetric ................................................................ 123 biology .................................... 53, 197, 637, 747, 753
tandem ..................................................................... 123 distance ..........................................127, 155, 165, 293
whole-genome (WGD)........................ 123, 166, 170, homology............................................... 122, 133, 506
490, 491 model
Dynamic programming (DP) ..........................58, 59, 491 amino acid ................................................ 131, 701
codon ................................................................. 430
E F81..................................................................... 430
Ecology .......................................272, 283, 335, 359, 618 General time reversible (GTR).................429–430
Ecosystems............................................................ 515, 736 HKY/HKY85.................................................... 430
K80 .................................................................... 430
Ectopic recombination ............................... 507, 509, 521
Effective population size..................................... 104, 213, prior ......................................................................... 644
224, 331, 333, 335–337, 352, 392, 416, 520, signature .................................................................. 127
Exonization .........................................472, 478, 492, 509
522, 536, 560, 563, 580, 582, 585
Effector ................................................................. 641, 642 Exon shuffling .....................................471, 479, 490, 495
EggNOG .................................................... 154, 157, 165, Expectation-maximization (EM) .......48–51, 61, 65, 557
Experimental
244, 594, 595, 614, 656
EM algorithm, see Algorithm, expectation maximization design.............................................................. 624, 641
EMBOSS, see European Molecular Biology Open population ..................................... 636, 644–646, 648
Expression
Software Suite
Emission probability ........................................57, 60, 580 pattern............................................................. 513, 636
Empirical QTL (eQTL) ......................................... 637, 639, 645
Bayes ..................................................... 381, 392, 409, trait.................................................................. 637, 646
412, 449, 454
F
codon model(s) ....................................................... 378
Encyclopedia of Life (EOL) ............................................. 5 False discovery rate (FDR) .................380, 458, 459, 646
3’ end ................................................................ 8, 184–186 False positive (error) ..................................................... 648
5’end ............................................................................8, 11 Fine grained................................ 495, 694–696, 729, 750
Endosymbiosis.............................................................7, 14 Fixation ............................................ 7, 22, 103, 104, 334,
Enhancer.................................................. 17, 18, 197, 521 337, 341–343, 346, 356, 377, 385, 387, 400,
Ensembl ...................................................... 135, 159, 160, 404, 415, 418–420, 475, 520
169, 374, 392, 676, 677 probability .................................53, 54, 103–105, 341
EVOLUTIONARY GENOMICS: STATISTICAL AND COMPUTATIONAL METHODS
Index 771
Fixed effect models ....................................................... 382 prediction............................................... 138, 392, 494
Focused promoter ........................................ 18, 179, 185, regulation.......................................... 18, 23, 178, 187,
186, 507, 508 389, 639, 647
Forest of Life (FOL) ............................................ 241–266 remodeling .............................................................. 281
Forward algorithm ...........................................59, 60, 580 tree ......................................... 95, 152, 153, 157–162,
Fossil ............................97, 101, 212, 226, 310, 325–328 166, 167, 169, 212–222, 225–233, 491–494,
calibration .............. 99, 101, 310–314, 317, 323–328 564, 566, 670
Frameshift ...................................................................... 389 Genealogy ................................................... 212, 216, 556,
Functional 557, 561, 562, 568, 579, 581–583, 714, 715
analysis ............................................................ 124, 603 GeneR ...........................................................754–757, 763
relationship .............................................................. 647 Generalized linear model (GLM) ...................... 646, 700,
Fusion ................................................ 272, 281, 283, 298, 705–711, 714
343, 472, 486, 491, 492, 494 General transcription factor................................ 486, 507,
519, 521, 641
G Genetic(s)
Gag.........................................................10, 180, 182, 194 algorithm (GA) ............................................. 138, 384,
Galaxy .......................................................... 627, 727, 736 460, 462, 463
code...............................................377, 378, 403, 429,
Gametes ...................................................... 332, 333, 335,
341, 343, 352 433, 438, 443, 451, 455
Gammaproteobacteria ......................................... 293, 622 draft.......................................................................... 514
Gap drift ..........................................................54, 342, 352,
353, 514, 535, 556
filling ............................................................... 186, 190
GARLI ........................................................................... 693 variation .............................................. 22, 23, 35, 513,
GC-content ....................................... 293, 334, 346, 355, 534, 556, 639, 645, 648
Genic selection ..................................................... 103–106
385, 387, 512
Gene Genome
accelerated ...................................................... 376–377 content........................................... 294, 352, 505–523
evolution............................................... 127, 187, 283,
cluster.................................................... 244, 280, 284,
285, 287, 377, 644 331–360, 375–376, 475, 505–523
conserved .............................................. 126, 165, 262, function .......................................................... 509, 522
282, 287, 392, 606 networks .................................................................. 290
segmentation ........................................................... 149
conversion............................................. 197, 343, 346,
357, 375, 385, 387–388, 464, 471 sequencing ........................................ 6, 109, 121–125,
duplication............................................ 149, 151, 158, 127, 138, 141, 142, 152, 187, 293, 301, 374,
506, 510, 522, 546, 549, 580, 644, 710, 711
161, 162, 167, 168, 220, 222, 225, 484, 486,
490, 492, 493, 495, 521, 639, 641 size .............................................................10, 12, 189,
evolution............................... 161, 170, 178, 639, 649 351–353, 475, 511, 512, 514
structure.............................................. 8–16, 187, 353,
expression ..............................................13, 17–20, 22,
38, 68, 464, 474, 506–508, 513, 521, 550, 359, 509, 522
636, 637, 639, 640, 643, 645, 648–649, Genome-wide association studies (GWAS) ....... 538–545,
547–550, 649
656–658, 662, 669, 672, 677, 678, 749, 753
family .......................................................12, 150, 157, Genomic
166, 167, 277, 278, 282, 283, 285, 286, 288, rearrangements ........................................................ 506
291, 294–297, 345, 475, 644 signature .................................................................. 355
Genotype .................................................. 4, 8, 17, 34, 35,
fission ..................................................... 472, 491, 492
flow ....................................................... 102, 217, 219, 42–44, 46, 105, 332, 335, 337, 344, 533,
220, 223, 224, 227, 232, 272, 561–568, 579 539–541, 544–547, 550, 607, 636, 638, 646,
648, 649
fusion ........................... 281–283, 472, 491, 492, 494
loss ................................. 15, 152, 154, 155, 162, 167 Germline mutation........................................................ 343
network................................................. 275, 291, 292, Germline nucleus .................................................... 4, 6, 7,
9, 10, 12, 13, 15, 17, 21, 22, 283, 359, 509,
299, 637, 640, 647
Omnibus .................................................................. 409 519, 639
ontology (GO) ..................... 484, 488, 640, 664, 681 Ghost QTL detection (between two QTL in coupling
order ............................................................... 126, 659 phase) ........................................................... 646
EVOLUTIONARY GENOMICS: STATISTICAL AND COMPUTATIONAL METHODS
772 Index
Gibbs sampling................................................................ 66 I
GO, see Gene, ontology (GO)
Grammar............................................................... 492, 667 Illegitimate recombination ........................................... 472
Graphics processing card (GPU)........................ 107, 391, Illumina........................................................ 539, 545, 611
693–704, 706–708, 713–720, 725, 730, 731 Incomplete lineage sorting ........................ 157, 213–215,
GridEngine .................................................................... 738 218, 232, 242, 385, 564, 580, 581, 585
Incongruence ...................................................... 158, 221,
H 224–225, 261, 460, 564, 566
Inconsistency score (IS)....................................... 255, 266
Haploid ....................................................... 226, 332, 333, Independence ............................................. 24, 40, 41, 57,
353, 416, 558, 578, 580, 583, 640 62, 63, 166, 167, 216, 375, 568, 580
segregants ................................................................ 637 Information content ............................................... 88, 92,
Haplotype ................................................... 336, 356, 545, 399–401, 408–412, 414, 421
546, 578–580, 636 Inhibitors .............................................................. 380, 643
Hardy-Weinberg Initiation of DNA replication.....................................9, 12
equilibrium ........................................ 35, 42, 540, 542 Innate immune system.................................................. 641
model ............................................................ 34, 42–47 In-paralog .................................................... 151, 155, 157
Haskell, E. ................................................... 729, 730, 749 Insertion, see Scoring of edit events
Hessian..........................................................313–316, 328 Insertion of domains...........................281, 472, 490, 493
Heterochromatin ................................510, 512, 519, 520 Instantaneous rate matrix .....................79, 375, 416, 567
HGT, see Horizontal gene transfer Insulator ................................................................. 17, 508
Hidden Markov model (HMM) ...................... 57–63, 68, Interacting genes......................................... 380, 641, 643
154, 163, 169, 385, 470, 546, 557, 572–574, Interaction .................................................. 3, 6, 8, 18, 23,
576, 578–580 165, 169, 273, 325, 375, 381, 402, 482, 489,
Hidden paralogy......................................... 122, 123, 128, 514, 515, 520, 523, 550, 601, 607, 635, 640,
143, 149–171, 225, 242, 670 641, 643, 644, 646, 648, 714, 735
Hierarchical Interaction network
clustering ................................................157, 619–621 degree distribution......................................... 474, 484
group .............................................154, 157, 168, 170 Interoperability...........................661, 664, 681, 684, 753
High(-)performance computing (HPC) ........... 106–108, Interspersed repeats ............................................. 185, 510
164, 646, 691–720, 725, 727, 728, 731, Introgression .............................................. 219, 221, 223,
736, 742 283, 298, 301, 340
Histone(s).................................................... 10, 13, 14, 20 Intron................................................. 184, 186, 231, 389,
modification....................................... 17, 18, 507, 509 390, 471, 472, 490, 492, 508
HIV ............................................................... 10, 108, 428, Inversion .............................. 89, 124, 137, 139, 142, 509
437, 443, 445, 446, 465, 466 Inverted repeats........................................... 179, 180, 186
HMM, see Hidden Markov model Iron-sulphur clusters....................................................... 14
HMMER .............................................................. 724, 750 Isochores............................................................... 346, 387
HOGENOM ............................................... 159, 161, 280
Homology (homologous) J
pairs of chromosomes ................................... 151, 333,
355, 359, 509, 583 Jaccard coefficient ....................................... 250, 252, 253
proteins .................................................. 278, 297, 400 Jackknife ........................................................................ 225
recombination (HR) ................................22, 180, 460 Java............................................. 189, 429, 498, 717–719,
sequences ........................................60, 128, 139, 164, 730, 740, 747–753, 756–758, 760, 763
402, 428, 459, 556, 567 Java virtual machine (JVM) ................................ 739, 751,
Horizontal gene transfer (HGT) ..............................6, 22, 752, 757, 758, 763
170, 220–222, 225, 227, 241–243, 255, Job scheduler........................................................ 717, 726
257–266, 271, 272, 290, 293, 294, 301, 359 JRI......................................................................... 749, 753
Host-pathogen ........................... 635, 641, 643–644, 648 Junk DNA ..................................................................... 178
HTTP protocol .................................................... 662, 751 JVM, see Java virtual machine
HyPhy .................................................386, 429–433, 435, Jython ................................................................... 751, 757
436, 438, 439, 441, 443, 445, 447, 449, 451,
K
452, 454, 461–465, 735
KEGG .........................................380, 594, 595, 614, 763
EVOLUTIONARY GENOMICS: STATISTICAL AND COMPUTATIONAL METHODS
Index 773
Keyword search ........................................... 672, 675, 685 Marker map .......................................................... 636, 645
Knowledge representation ............................................ 684 Markov
Kullback-Leibler (KL) distance ...................................... 87 chain......................................... 51–59, 62, 63, 65, 68,
90, 385, 455, 569, 570, 572, 692, 704, 729
L chain Monte Carlo (MCMC)..................... 65, 90–92,
Landsberg erecta (Ler) ................................................. 644 95, 96, 100, 106, 161, 310, 313–325, 327,
Last universal common ancestor (LUCA)................... 475 328, 448, 455, 456, 692, 693, 711–713, 716
Lateral gene transfer (LGT), see Horizontal gene transfer clustering ................................................................. 156
model .....................................................57–62, 65, 68,
Leucine-rich-repeat (LRR) .................................. 641, 642
Likelihood 138, 165, 375, 377, 470, 497, 546, 557, 572,
composite (CL) ....................................................... 579 578, 580, 724
(see also Evolutionary, model)
function ....................................................... 42–44, 48,
49, 83, 84, 86, 106, 107, 233, 694, 697 Mass spectrometry ........................................................ 611
ratio test (LRT) ..........................74, 85–87, 228–231, Mating system ..................................................... 343, 344,
347, 352, 353, 511, 512, 514, 516, 522
233, 327, 376, 390, 392, 409, 411, 433, 434,
437, 440, 442, 444, 445, 447, 448, 452, 454 Maximum
Lineage specific likelihood (ML).................................... 42, 45, 86, 88,
89, 134, 142, 153, 155, 167, 196, 226, 313,
gene duplications ................................. 197, 377, 384,
385, 637, 639 316, 375, 377, 383, 400, 407, 421, 447, 448,
test(s) ............................................................. 197, 437, 460, 576, 643, 724
441, 442, 637 likelihood estimate/estimation (MLE).............42–48,
64, 85, 86, 90, 403, 406, 407, 409–411, 418
Linear Gaussian model ................................................... 64
Linkage ....................................................... 188, 215, 280, parsimony (see Parsimony)
334–336, 342, 343, 354, 356, 534, 580, 620, McDonald-Kreitman test (MK test) ............................ 342
Measures of correctness ............... 74, 161, 326, 681, 749
636, 637, 639, 640, 646, 648
Linkage disequilibrium (LD) .................... 334, 335, 354, Mechanistic model ............ 403, 404, 407, 408, 415, 422
356, 534, 535, 544–546, 636 MEGAN ....................591–603, 610, 612, 614, 625–627
Meiosis ................................................................. 332–334,
Linked data ........................................................... 661, 683
Lipid bilayer....................................................................... 7 336, 341, 343, 352, 354, 358, 359, 387
Local probability distribution (LPD)..........62–64, 66, 69 Meloidogyne hapla
Loci/locus ................................................ 6, 8, 19, 22, 42, Membrane-bound organelles ...............4, 5, 9, 11, 12, 14
Message passing interface (MPI) ....................... 106, 461,
46, 105, 193, 214, 217, 219, 220, 223,
225–229, 231, 232, 335–337, 341, 356, 373, 646, 727, 729, 730
519, 536, 548–550, 556, 557, 561, 568–571, Messenger RNA (mRNA) ....................... 16, 18, 19, 180,
185, 186, 389, 464, 472, 507–509, 754, 760
584, 637, 639, 644, 648, 649
LOFT ............................................................................. 159 Metabolic pathways................................................ 14, 640
Long-branch-attraction (LBA)............... 76, 83, 218, 242 Metabolite QTL (mQTL) .......................... 637, 645, 647
Metabolites ..........................................637, 638, 645, 647
Long interspersed nucleotide element-1 (LINE1) ..... 510
Long-reads..................................................................... 517 Metagenomics .............................................. 23, 286, 287,
Long-read sequencing .................................................. 517 289, 294, 296, 591–600, 603, 606, 608–610,
615–618, 624–628
M Methyl-DNA immunoprecipitation ............................. 637
Metropolis-Hastings algorithm............................. 65, 712
Machine learning.......................................... 96, 107, 109, Microarray .......................................................33, 38, 636,
164, 192, 196, 681, 731, 733 637, 639, 640, 645, 649
Macro language............................................................. 752 Microbiome .....................................................23, 24, 286,
Macronucleus ............................................................22, 23 591, 606, 608, 611, 614, 616, 617, 619, 621,
MANOVA ..................................................................... 624 623, 624
Mapping power ............................................................. 646 Micronucleus .............................................................22, 23
MapReduce........................................................... 725, 726 Microorganism ....................................591, 599, 605–607
Marginalization .................................................. 40, 41, 50 MicroRNAs ...........................................19, 389, 508, 514
Marker ............................................. 17, 19, 20, 133, 221, Microsatellites.................................................................. 13
231, 293, 354, 544, 548, 550, 606, 613, 614, Mimivirus........................................................................... 5
636, 645, 646 Mining ........................................................................... 731
EVOLUTIONARY GENOMICS: STATISTICAL AND COMPUTATIONAL METHODS
774 Index
Mitochondrion ....................................................... 14, 292 Non-coding .......................................................11, 19, 67,
Mitosis ........................................................................... 333 127, 138–140, 142, 178, 215, 232, 376, 471,
Mobile genetic elements.................................20–21, 178, 473, 492, 703
179, 272, 293, 294, 299, 472 Nonsynonymous mutation........................................... 405
Model misspecification ................................ 76, 400, 401, Nonsynonymous to synonymous rate ratio
408, 412–414, 418 (dN/dS) ....................................................... 378
Model organism ..................................190, 636, 644, 649 Normal distribution .......................................... 38–40, 44,
Molecular clock 88, 310, 313, 317, 328, 410, 417, 624
correlated ...........................................................99–101 Normalization ......................................90, 251, 253, 284,
evolution.................................................................... 97 285, 411, 432, 615, 683
hypothesis .................................................................. 97 NP-complete ............................................... 153, 156, 169
local ............................................................................ 99 Nucleoid .......................................................................... 11
strict ..................................................................... 97–99 Nucleomorph ............................................................15, 21
uncorrelated ............................................................ 101 Nucleosome...............................................................10, 13
Most recent common ancestor (MRCA)........... 122–124, Nucleotide binding site leucine rich repeat domain
217, 559–562 (NB-LRR) .................................................... 642
MP-EST...................................................... 217, 218, 221, NUTs, see Nearly Universal Trees
226, 229, 231, 232
mRNA, see Messenger RNA O
Multifurcation ............................................................... 159 Olfactory receptor ......................................................... 161
Multilocus............................................................ 216, 223, Oligomer ....................................................................... 188
228, 354, 568
OMA, see Orthologous Matrix
Multiple QTL Mapping (MQM) ............... 646, 647, 649 Ontology
Multiple sequence alignment (MSA) .............66, 80, 189, (-)based data access (OBDA) ........................ 676, 681
280, 282, 392, 470, 693
evolution (EO) ........................................................ 664
Multispecies coalescent (MSC) ........................... 216–233 gene (GO) ..................................................... 484, 488,
Multivariate analysis ............................................. 621, 624 640, 664, 670, 681
Mus musculus....................................................... 507, 636
Open reading frame (ORF)................................ 180, 181,
Mutant alleles .............................................. 535, 536, 644 184–186, 194, 195, 286, 511, 625
Mutation Open source software (OSS) ............................... 749, 758
accumulation .................................344, 353, 512, 518 OpenStack ..................................................................... 725
rate ......................................... 36, 68, 78, 79, 81, 215,
Operational taxonomy unit (OTU) ................... 219, 290,
224, 336, 337, 343, 344, 352, 385, 387, 419, 612, 625
429, 464, 535, 536, 561, 563, 575–577, 584 Optimization .....................................................49, 50, 58,
83–84, 88, 89, 95, 159, 163, 280, 377, 429,
N
442, 448, 460, 681, 693
Naive Bayes................................................................50–51 ORF, see Open reading frame
Natural population.....................340, 354, 518, 636, 646 Origins of DNA replication................................. 9, 11, 12
Natural selection, see Selection, Natural Ortholog.............................................122–124, 127, 143,
Nearly Universal Trees (NUTs) ........244–245, 255, 256, 149–158, 162–169, 171, 242, 244, 260, 379,
259–266 656, 670, 677, 679, 680
Negating QTL (QTL in repulsion phase) ................... 646 Orthologous Matrix (OMA) .............................. 154, 157,
Nematode .......................... 333, 336, 347, 350, 353, 641 164, 169, 279, 280, 656–658, 661, 664,
Network 667–672, 675–677, 679, 680, 684–686
analyzer ............... 272, 274, 275, 277, 287, 289, 649 Orthology
hubs ...................................... 292, 293, 474, 482–484 many-to-many ................................................ 151, 155
inference .................................................638, 647–648 map ................................................133, 134, 140, 142
Neutrality test.............................344, 437, 440, 448, 459 one-to-many .......................................... 150, 151, 155
Next(-)generation sequencing (NGS) ............... 3, 33, 37, one-to-one............................................ 123, 126, 128,
108, 187, 189–192, 196, 374, 506, 516–518, 133, 134, 151, 168
523, 547, 591, 596, 597, 606, 609 positional ................................................................. 123
NGS, see Next(-)generation sequencing prediction....................................... 126, 140, 163–166
OrthoMCL ..........................................154, 156, 157, 279
EVOLUTIONARY GENOMICS: STATISTICAL AND COMPUTATIONAL METHODS
Index 775
OTU, see Operational taxonomy unit Phylogenetic(s)
Outcrossing ..................................................331–360, 511 footprinting ............................................................. 127
Out-paralogy ................................................................. 151 hidden Markov model (phylo-HMM)............ 68, 385
Overlap(ing) network.................................................. 224, 243, 301
graph ........................................................................ 188 outliers ............................................................ 223–227
layout consensus...................................................... 471 tree ............................................................... 66–68, 90,
reading frames ................................................ 131, 597 96, 109, 135, 193, 211–233, 241–266, 311,
OWL, see Web Ontology Language 314, 319, 430, 515, 581, 693, 727, 729
Oxidative phosphorylation .......................................13, 14 Phylogenetic analysis by maximum likelihood
(PAML) ....................................... 73, 310, 386,
P 392, 420, 643–644, 647, 724, 726, 727, 734,
735, 737, 750
Pairwise alignment ....................................... 59, 125, 128,
135, 136, 138, 141, 189, 195, 273, 275, 277 Phylogeny .............................................75, 108, 149, 152,
Parallelization ............................................. 106, 692, 694, 166, 167, 180, 211–213, 232, 293, 309–311,
313, 317, 321, 325, 327, 328, 383, 384, 386,
695, 697, 698, 725–731, 741
Paralog ........................................................ 123, 149–153, 387, 391, 428, 431, 435, 436, 438, 441, 442,
155–157, 165, 166, 169, 222, 246, 670 447, 448, 450, 457, 460, 463, 465, 493, 516,
556, 581, 701, 704, 706, 714
Paralogy ...................................................... 122, 123, 128,
143, 149–171, 225, 242, 670 Phylogeny-aware alignment ......................................... 592
Parameter confounding ................................................ 401 Phylo(-)HMM, see Phylogenetic hidden Markov model
Parametric PhyML ..................................................... 89, 95, 106, 693
Phytophthora infestans
(codon) model ............................................... 377–385
(see also Evolutionary, model) Pipeline ...................................... 190, 195–196, 274, 295,
test................................................................... 229, 230 387, 432, 517, 592–597, 599, 602, 608, 609,
612, 614, 627, 628, 724, 726, 727, 739, 741
Parent.....................................................62, 522, 644, 693
Parrot native compiler interface ................................... 751 Piwi-interacting RNA (piRNA)............................. 23, 519
Parsimony ........................................................72–96, 108, Piwi proteins.................................................................. 519
Plant resistance ..................................................... 641, 643
158–162, 167, 484, 489, 491, 492, 494
Pathogen..........................................................22, 54, 380, Plant resistance genes (R-genes) ......................... 641–642
388, 447, 459, 511, 610, 621, 640–644, 649, Plasmid ...............................................13, 15–16, 19, 179,
709, 735 280, 290, 292, 293
Plasmodium
Pattern
discovery ......................................................... 178, 356 Plasticity......................................................................... 639
significance...................................................... 126–127 Plastid ................................................4, 7, 14–15, 21, 283
Pleiotropic effect ........................................................... 647
Peeling algorithm ............................................................ 66
Peptide ............................................................................... 9 Ploidy ............................................................226, 351–352
Perl ....................................................................... 194, 299, Poisson
distribution ...........................................36–38, 48, 568
496, 727, 740, 747–753
Permutation genetics .................................................................... 568
pattern...................................................................... 624 genomics......................................................... 561, 568
model ....................................................................... 580
strategy..................................................................... 646
Permutational multivariate analysis of variance process ....................................................................... 56
(PERMANOVA) ......................................... 624 random field ................................................... 558, 580
Pfam ....................................................470, 476, 479–481, size .................................................................. 561, 568
Polyadenylation .................................................... 508, 521
483–485, 487, 488, 494, 496–498, 763
Phenomenological Polymorphism ............................................ 218, 336–338,
load ....................................... 401, 402, 415, 418–422 340, 342, 352, 357, 374, 385, 466, 533, 538,
581, 584, 585, 636, 639, 640, 645, 648
model ....................................................................... 401
Phenotype...........................................4, 8, 13, 17–19, 23, frequencies............................................. 385, 538, 636
294, 436, 533, 534, 541, 607, 636–638, 644, Polymorphism-aware phylogenetic model
(PoMo) ............................................... 385, 386
646, 647, 649
Phybase ................................................226, 227, 229, 233 Population
Phylodynamics...................................................... 691–720 dynamics .................................................402, 505–522
EVOLUTIONARY GENOMICS: STATISTICAL AND COMPUTATIONAL METHODS
776 Index
Population (cont.) ProDom .................................................... 471, 497
genetics ........................................................ 53, 71, 95, SCOP ..................... 470, 471, 478, 479, 488, 497
97, 101–104, 223, 225, 331, 333–337, 343, SMART.............................................470, 487, 497
354, 356, 360, 392, 415, 416, 522, 534, 535, domain ............................................................ 469–500
555–586, 637 neighbor pair ......................................... 481, 487, 489
genomics.................................................356, 555–586 order ...................................................... 471, 486, 491
simulator .................................................................. 139 promiscuity/versatility ................................... 487–489
size ........................................................ 104, 213, 214, QTL ( pQTL) ........................................ 637, 645, 647
222, 224, 331, 333, 335–337, 351, 352, 392, sequence .........................................59, 220, 278, 281,
416, 512, 520, 522, 536, 537, 555, 556, 560, 296, 299, 378, 381, 461, 470
561, 563–565, 568, 580–582, 584, 585, 645, structure.......................................................... 475, 492
648, 710 triplets .................................................... 471, 479, 481
Positive selection, see Selection, positive Protein-coding gene .......................................10, 18, 103,
Posterior 133, 138, 190, 311, 326, 328, 376, 378, 388,
decoding .....................................................59, 60, 580 390, 400, 402, 414
probability ................................................... 41, 45, 46, Protein-protein interaction(s) ............................ 165, 169,
89, 90, 409, 410, 430, 450, 456, 705 273, 381, 640, 735
probability distribution .................................... 91, 715 Proteome ........................................... 165, 167, 272, 280,
Power law distribution...............474, 476, 477, 483, 485 285, 473, 474, 476, 477, 479, 480, 483–485,
Preferential attachment.......................474, 483, 493, 500 487, 488, 495, 498, 499
Primary chromosome...................................................... 13 Pruning ........................................... 80, 95, 107, 245, 246
Primates ...................................................... 143, 182, 185, Pseudogene ................................................. 180, 186, 356
186, 193, 194, 311, 312, 321, 328, 341, 374, Punctuated equilibrium ................................................ 475
382, 392, 428, 433, 435, 557, 584, 585 Purifying selection, see Selection, purifying
Primer ....................................................71, 606, 611, 663 Python ................................................................. 274, 465,
Principal 517, 727, 728, 733, 739, 740, 747–754,
components analysis (PCA)...................542, 621–622 756–760, 763
coordinates analysis (PCoA)......... 601, 602, 621–622
Prior Q
conjugate .............................................................46, 47 QIIME.................................................612, 614, 625, 627
distribution .................................................. 89, 90, 96, Quality control (QC)................. 539, 541–542, 592, 625
100, 101, 327, 410, 455, 583
Quantitative
Profile HMM.............................................................60, 61 phenotypes............................................................... 637
Progressive alignment ................................................... 135 trait loci (QTL) .....................................342, 636–638,
Prokaryotes........................................ 4, 9, 10, 12, 13, 15,
640, 643–649, 753
18–20, 22, 170, 221, 242, 243, 258–260, Query processing ................................659, 662, 668–669
264–266, 272, 280, 281, 285, 293, 297, 478,
484, 488, 518 R
Prokaryotic cell...............................................5, 10, 14, 15
Promoter .........................................................17, 18, 179, R (statistical language).................................................. 646
185, 186, 507, 508 Random
Protein effect ..................................... 382, 414, 417, 441, 448
architecture ..................................................... 469–500 forest ........................................................................ 195
combination .................................................. 471, 472, variable ............................................ 34–42, 45, 46, 48,
478, 479, 481, 483–484, 486, 489, 493 50, 51, 57, 62–65, 67, 68, 90, 559, 622, 692
complex.................................................. 475, 477, 494 Rate
databases (of) deletion........................................... 512, 518, 521
ADDA....................................................... 471, 497 (of) excision ............................................................. 347
CATH ............................. 470, 471, 478, 481, 497 (of) fusion to fission....................................... 486, 491
Conserved Domain Database........................... 486 heterogeneity....................67, 68, 354, 377, 461, 701
Gene3D family .................................470, 478, 497 (of) insertion ......................................... 512, 518, 521
InterPro ............................................470, 478, 497 shift .......................................................................... 377
Pfam ......................................................... 470, 476, (of) transposition................................... 517, 518, 520
479–481, 483–485, 487, 488, 494, 496–498 RBH, see Reciprocal best hit
EVOLUTIONARY GENOMICS: STATISTICAL AND COMPUTATIONAL METHODS
Index 777
RDF, see Resource description framework RFLP, see Restriction fragment length polymorphism
RDF Schema (RDFS) ..................................661, 665–667 R-gene .................................................641–644, 649, 754
Reactivity .............................................................. 647, 648 Ribosomal RNA (rRNA) ..................................6, 18, 606,
Rearrangement .............................. 20, 95, 122–126, 128, 611–615
130, 132, 134, 135, 137, 139, 151, 189, 230, Ribosome....................................................................... 262
233, 273, 281, 351, 352, 506, 509, 583, 644 RIL, see Recombinant inbred line
Reasoning ..................................................... 81, 161, 213, RNA
647, 666, 678, 729, 730 polymerase ................................................18, 185, 438
Reassortment................................................................... 10 RNA-seq .....................................192, 196, 637, 645, 649
Recessive (or lethal) alleles .................................. 341, 357 RPM1 ............................................................................ 642
Recombinant inbred line (RIL) .........636, 639, 644, 648 RPS2 .............................................................................. 642
Recombination ......................................... 22, 24, 68, 179, RPy..............................................749, 753, 756, 757, 763
180, 184, 216, 217, 223, 332–341, 343, 344, rRNA, see Ribosomal RNA
346, 354–357, 382, 387–388, 427–466, 471, Rserve ................................................ 749, 751, 753, 754,
472, 489, 492, 493, 495, 507, 509, 512, 521, 756, 762, 764
546, 556, 561, 562, 564, 568–569, 571, 572, RSOAP .......................................................................... 753
574, 576–578, 580–582, 584, 636, 645 RSPerl ............................................................................ 753
Redundancy................................................................... 107 RSRuby ................................................................. 749, 753
Regulation .....................................................6, 17–21, 23, 2R-WGDs, see Two-round whole genome duplications
178, 187, 283, 389, 479, 482, 507, 515,
518–520, 522, 637, 639, 647 S
Regulator .............................................................. 637–639
16S ..........................................6, 606, 612–615, 625, 627
Regulatory Saccharomyces cerevisiae (S. cerevisiae)................ 349, 354,
element .............................................. 17, 18, 127, 640 391, 476, 477, 485, 507, 637, 639
genomic region ........................................20, 138, 506
Scaffolding ..................................................................... 603
mechanism ...........................................................18, 21 Scala ............................................729, 730, 751, 757, 758
Relational database (RDB) ................................. 657–661, S. cerevisiae, see Saccharomyces cerevisiae
663, 669, 670, 672–674, 677, 678
Scoring
Relative rate test .............................................................. 98 matrix ......................................................................... 60
Reliability ................................................ 74–75, 101, 167, scheme .............................................................. 60, 142
192, 232, 400, 412, 414, 545, 602 Secondary chromosome ................................................. 11
RELL, see Resampling of estimated log-likelihoods
SEED subsystem ..........................................594–596, 614
Remote procedure call (RPC) ............................ 750–753, Segmental duplication ......................................... 123, 644
755–756, 762, 763 Segregating/segregation ................................6, 333, 334,
Repeat ...................................................87, 140, 180, 181,
337, 341–343, 574, 581
185, 188, 194–196, 472, 483, 495, 496, 510 Selection
Repetitive DNA/element ..................................... 13, 140, adaptive .................................................................... 345
188–189, 194
balancing............................... 106, 222, 345, 428, 447
Replication.........................................................3, 6, 9, 11, coefficient ............................................. 105, 337, 385,
12, 20, 21, 24, 180, 187, 194, 287, 466, 506, 390–392, 415, 416, 534
514, 515, 544, 547, 645
directional ...................................... 222, 335–336, 428
Representational state transfer (REST) ....................... 751 genic................................................................ 103–106
Resampling of estimated log-likelihoods (RELL)......... 74 positive............... 105, 106, 197, 338, 342, 344–346,
Residual variance ........................................................... 646 378–384, 386–390, 392, 400, 401, 404, 405,
Resolution schema ............................................... 249–251
408–414, 417, 418, 421, 433, 435, 440–443,
Resource Description Framework (RDF)...................657, 445–449, 451–454, 456, 458, 464, 466, 535,
660, 661, 663–668, 671–673, 675, 678, 679, 537, 638, 641, 643, 644, 648, 724, 735
683, 751, 761
pressure ................................................. 344, 375, 412,
Restriction fragment length polymorphism 413, 421–423, 435, 437, 438, 441, 445, 496,
(RFLP) ................................................ 636, 637 648, 735
Retrogenes............................................................ 182, 186
purifying/negative ............................... 197, 345, 348,
Retroposition........................................................ 185, 186 378–382, 388–390, 428, 440, 448, 451–454,
Retrotransposons .................................................. 20, 180, 458, 464, 535
182–185, 195, 345, 351, 472, 518–521 strength..........................................427, 437, 535, 701
EVOLUTIONARY GENOMICS: STATISTICAL AND COMPUTATIONAL METHODS
778 Index
Selection-mutation model (MutSel) .................. 386, 392, Speciation ................................................... 122, 149–152,
402, 404, 415–416, 418, 422, 423 155, 157–160, 162, 166, 170, 220, 223, 226,
Selection, natural.............................................22, 24, 217, 243, 357, 385, 492, 513, 556, 559–561,
220, 222, 225, 227, 333, 375, 384, 427, 510, 563–566, 568, 569, 581–586, 713
514, 520, 535 Species
Selective sweep ...........................222, 336, 341, 582, 639 delimitation .................................................... 218–220
Selfing ..................................................331–360, 511, 516 tree ..........................................................213, 217–222
Selfish elements ...................................344, 347, 351, 352 Specificity .......................................................17, 139–140,
Self-regulation ...................................................... 518–519 167, 195, 458, 483, 486, 641, 642
Semantic Spirochetes ...................................................................... 11
query ............................................................... 668–669 Splice
web................................................................. 657, 658, alignment ................................................................. 197
660–672, 675, 677, 678, 681–684, 751 site ...........................................................389, 506–507
Sequence Splicing ........................................................ 186, 389, 508
alignment (see Multiple sequence alignment; Pairwise Split distance (SD) ..................................... 242, 247–250,
alignment) 253, 255, 256, 266
assembly .................................. 13, 274, 506, 595, 596 Statistical
reads ....................................................... 188, 615, 625 alignment ................................................................. 138
similarity network (SSN) ......................273–290, 295, inference ....................................42–48, 309, 574–577
297–299 model(ing).............................................33–42, 46, 48,
Sequencing error correction...................... 557, 558, 569, 52, 57, 59, 326–327, 375, 448, 543, 692
582–583, 592 power .................................................... 162, 408, 435,
Sex 544, 636, 643, 649
chromosome..................................140, 335, 344, 582 significance................... 377, 429, 433, 437, 442, 646
Shared libraries ....................................696, 751, 752, 759 Stop codon .................................................. 416, 472, 490
Short Strain.....................................................10, 190, 438, 449,
read ....................................................... 187, 189, 190, 513, 518, 637, 639, 642, 644, 645
192, 196, 517, 591–593, 596, 598, 602, 603 Streptomyces ................................................................... 11
sequence repeat ..................................... 188, 615, 625 Structural
Shotgun sequencing........................36–38, 606, 613–615 benchmarks..................................................... 139, 192
Signaling ............................ 470–471, 482, 492, 641, 642 variation .......................................................... 191, 583
Silencer............................................................................. 18 Structure of DNA ................................................ 9, 13–16
Simple Object Access Protocol (SOAP) ............ 626, 751, Structure-preserving ..................................................... 486
755–756, 761 Study design ........................................................... 48, 611
Simplified Wrapper and Interface Generator Subgraph..................... 65, 156, 169, 171, 289, 301, 679
(SWIG)......................................................... 752 Substitution
Simulating matrix ......................................................................... 60
populations (see Population, simulator) model(s) (see Evolutionary, model)
trees.......................................................................... 226 scoring .............................................................. 48, 134
Single nucleotide polymorphism (SNP) ............ 218, 232, Supercluster ................................................................... 644
539, 542, 543, 545–549, 636, 639, 645, 646 Supercoiling..................................................................... 10
Sister chromatids ........................................................... 186 Supermatrix ................................ 212, 214–216, 328–329
Site frequency spectrum ...................................... 558, 580 Supertree .................................... 213, 255, 258, 264–266
Site-specific Support vector machine (SVM) ....................96, 163, 195
fitness landscape .................................... 404, 405, 414 Supradomains ....................................................... 473, 486
test (for selection) .......................................... 401–402 Susceptible ......................... 218, 227, 231, 636, 642, 644
Small-interfering RNA (siRNA)......................19, 21, 519 SVM, see Support vector machine
SNP, see Single nucleotide polymorphism Sweep ..........................................222, 336, 341, 582, 639
Solute carrier protein ................................................8, 541 SWIG, see Simplified wrapper and interface generator
Somatic Symbiont .............................................................. 297, 621
mutation .................................................................. 518 Synonymous mutation/substitution/change ............337,
nucleus .............................................................. 23, 518 378, 388–390, 403, 408, 415, 428, 453, 643,
SPARQL .............................................656, 661, 668–669, 701, 724
671, 672, 674, 675, 681, 683–686 Synteny ................................................166, 387, 583, 640
EVOLUTIONARY GENOMICS: STATISTICAL AND COMPUTATIONAL METHODS
Index 779
T class ...............................179–187, 194–196, 506, 514
copy number..................................349, 513, 514, 516
TAMBIS, see Transparent access to multiple diversity...................................................178, 509–513
bioinformatics services families ..........................................184, 192, 194, 197,
Tandem 349, 350, 510, 512, 517, 520, 522
array(s) ....................................................................... 13 fixation ..................................................................... 510
duplication(s) ................................................... 11, 123 mating system....................... 511, 512, 514, 516, 522
repeat(s) ....................................................11, 185, 510 regulation................................................................. 518
Target gene.....................................................16, 637, 640 Transposition..........................................13, 20, 179, 180,
Taxon ....................................................93, 161, 221, 222, 186, 195, 196, 347, 506, 510–513, 515,
225, 325–326, 328, 446, 594, 598, 599, 601, 517–520, 522
602, 665, 669, 685 Transposition in Transposition (TinT) ............... 196, 199
Taxonomic Transversion ................................. 56, 74, 77, 81, 82, 131
analysis ................................................... 602, 610, 614 Tree
group .............................................221, 338, 348, 350 of life (ToL) .............................................97–102, 149,
Taylor approximation........................................... 106, 313 187, 221, 243, 244, 261–262, 266, 281, 286,
Telomeres ............................................11, 12, 14, 15, 387 309, 489, 491
Termination of DNA replication.................................. 606 reconciliation ........................................ 152, 158, 159,
Tetraploid ...................................................................... 356 161, 162, 170, 242
TinT, see Transposition in transposition rooted ......................................... 72, 93, 94, 242, 243
Toporthology ............................. 123–125, 129, 133, 134 search ......................................................................... 95
Trade-off..................................................... 84–85, 88, 95, topology................................................ 66, 68, 76, 81,
156, 168, 598, 748 84, 88, 93–96, 101, 161, 166, 212–215, 218,
Training sample ............................................................. 619 225–227, 229, 233, 266, 310, 315, 325–326,
Trait ...........................................178, 331, 345, 353, 357, 328, 402, 430, 459, 515
512, 533, 534, 543, 544, 547–549, 635–638, ultrametric ..................................................97, 98, 255
640, 641, 646–649, 700, 704–708, 714, 719 unrooted ....................................................93, 94, 242,
Trans .............................................................................. 642 244, 255–257, 314, 383
trans-acting.................................................................... 640 Triple..........................................213, 379, 400, 415, 419,
regulatory elements.............................................17, 18 422, 663, 664, 668, 669, 672, 679, 685, 686
trans-band ............................................................ 637, 639 tRNA, see Transfer RNA
Transcript........................................... 197, 389, 507, 509, Two color cDNA microarray............................... 637, 639
637, 639, 640, 645, 647
Transcription U
factor ................................................18, 486, 519, 641
factor binding sites ........................127, 507, 508, 521 UCSC Genome Browser .................................... 136, 142,
Transcriptome assembly................................................ 196 192, 374, 392
Transfer RNA (tRNA) ..........................18, 182, 185, 389 Ultraconserved .............................................................. 127
Transition ............................................... 8, 54–56, 60, 61, elements (UCEs)/regions (UCRs)/conserved
74, 77, 81, 82, 131, 219, 283, 333, 340, 347, nongenic sequences (CNSs) .............. 127, 280
351–353, 355–357, 359, 403, 437, 496, 537, Uncertainty................................................. 34, 43, 44, 48,
569–571, 704, 705, 714 89, 90, 99, 101, 102, 108, 142, 155, 159–161,
Transition probability .......................................52, 53, 55, 218, 325, 326, 329, 355, 382, 384, 410, 452,
57, 60, 65, 375–376, 579, 695, 715 453, 493, 547
Transition/transversion (rate).................... 313, 379, 701 Unequal crossing over .................................................. 351
Transition-transversion rate ratio ................................. 378 Uniform Resource
Translation.................................................... 9, 12, 17, 19, Identifier (URI)......................................661–666, 679
20, 260, 262, 389, 476, 507, 509, 678, 679, Locator (URL) ..............................497, 498, 661, 663
735, 752, 753, 757, 758, 760, 763 Name (URN) ................................................. 662–663
Translocation ............................................... 140, 179, 509 Uniparental inheritance ....................................... 345, 358
Transparent Access to Multiple Bioinformatics Services UniProt....................................................... 471, 476, 479,
(TAMBIS) .................................................... 681 496, 497, 656, 658, 660, 661, 664, 669,
Transposable element (TE/transposon) 671–673, 677, 679, 680, 682, 684–686, 763
activity............................................ 511–513, 520, 522 Untranslated regions (UTR) ............................... 509, 511
URI, see Uniform resource identifier
EVOLUTIONARY GENOMICS: STATISTICAL AND COMPUTATIONAL METHODS
780 Index
URL, see Uniform resource locator Web-services ............................................... 430, 628, 682,
UTR, see Untranslated regions 750, 751, 760–762
WGD, see Whole genome duplication
V Whole genome duplication (WGD) ......... 123, 166, 170,
490–491
Vertebrates.................................................... 99, 184, 186,
332, 333, 375, 376, 473, 479, 490, 511, 520, Witness(es) of non-orthology ............................. 155, 156
521, 701 Wolbachia ...................................................................... 350
Viral evolution........................................................ 22, 295 WormBase............................................................. 753, 754
Wright-Fisher
Virion .................................................................. 5, 10, 287
Virtualization........................................................ 725, 726 population ................................................53, 217, 559
Virtual machine (VM) ............................... 725, 726, 731, process ................................................................. 53–54
738, 751, 763, 764
X
Virus.......................................................... 5, 9, 10, 22, 98,
108, 221, 281, 287, 289, 290, 292, 297, 388, Xenolog(s) ..................................................................... 478
392, 433, 437, 440, 445, 447, 476, 477, 485, XML.....................................................667, 751, 755, 756
519, 710 xQTL .................................................................... 637–649
Viterbi ............................................................... 58–60, 580
VM, see Virtual machine Y
Yeast ............................................343, 347, 354, 490, 639
W
Wald confidence interval................................................. 44 Z
Wassilewskijai (Ws)........................................................ 644 Zinc (Zn) finger protein ............................................... 185
Watson-Crick pairings...............................................9, 387
Zygote.............................................................20, 332, 333
Web Ontology Language (OWL) ....................... 666–667