You are on page 1of 24

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/337959323

Phylogenomics — principles, opportunities and pitfalls of big‐data


phylogenetics

Article  in  Systematic Entomology · December 2019


DOI: 10.1111/syen.12406

CITATIONS READS

47 6,702

2 authors:

Andrew D. Young Jessica Gillung


University of Guelph McGill University
20 PUBLICATIONS   270 CITATIONS    44 PUBLICATIONS   353 CITATIONS   

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Jessica Gillung on 28 January 2020.

The user has requested enhancement of the downloaded file.


Systematic Entomology (2019), DOI: 10.1111/syen.12406

OPINION

Phylogenomics – principles, opportunities and pitfalls


of big-data phylogenetics
1
A N D R E W D . Y O U N G † and J E S S I C A P . G I L L U N G 2
1
California Department of Food and Agriculture, Plant Pest Diagnostics Center, Sacramento, CA, U.S.A. and 2 Department of Natural
Resource Sciences, McGill University, Sainte-Anne-de-Bellevue, Canada

Introduction and tools that are central to phylogenomics, with an emphasis


on the appropriate application of techniques useful for phylo-
Phylogenetics is the science of reconstructing the evolution- genetic analysis of genomic data. We focus on the sequencing
ary history of life on Earth. Traditionally, phylogenies were technologies and statistical methods for phylogeny estimation,
constructed using morphological data only, but the introduc- and the software implementing these methods and their appli-
tion of Sanger sequencing and PCR in the late 1970s enabled cation to large molecular datasets. We also discuss the tools
genetic information to be incorporated into phylogenetic and tradeoffs for improving the accuracy of phylogenomic anal-
analyses. Early phylogenetic studies employing multilocus yses, including the biological and methodological sources of
analyses contributed greatly to our knowledge of phylogenetic systematic error in phylogeny estimation. Finally, we provide
history and challenged some well-established views of the a glossary of commonly encountered terms used in phyloge-
relationships among many groups of plants and animals. Since nomics that may be useful for those entering the field and hop-
the publication of these pioneering studies, significant method- ing to sort through the multitude of methods, analytical tools
ological advances in both sequencing and analytical techniques and terminology inherent to this relatively new, but rapidly
have been made, and molecular phylogenies are now broadly advancing field.
accepted to represent robust hypotheses of organismal relation-
ships. Next-generation sequencing techniques, developed in the
mid-2000s, revolutionized DNA sequencing and led to a dra- What is phylogenomics?
matic reduction in sequencing cost per nucleotide and a sharp
increase in data generation speed. As a result, the generation The word ‘phylogenomics’ was first introduced in the con-
of unprecedented amounts of sequence data for both model text of prediction of gene function for genome-scale data
and nonmodel organisms has become affordable. This develop- (Eisen, 1998), and soon after in the context of phyloge-
ment has transformed the field of molecular phylogenetics into netic inference (O’Brien & Stanyon, 1999). The discipline
phylogenomics – where genome-scale data are obtained from of phylogenomics owes its existence to the advances made
multiple samples at once at a much reduced cost (Mardis, 2011). in DNA sequencing technology over the past two decades
The phylogenomic pipeline can be very complex, present- (Metzker, 2010). It comprises several areas of research at
ing an overwhelming array of methodologies available for the the interface between molecular and evolutionary biology
acquisition, manipulation, analysis and interpretation of massive and has two major goals: (i) to infer phylogenetic relation-
datasets. Researchers also have to overcome the challenges of ships between taxa and gain insights into the mechanisms of
sequencing strategy design, identification of orthologous loci, molecular evolution; and (ii) to use multispecies phylogenetic
model selection and phylogeny estimation. This can be par- comparisons to infer putative functions for DNA or protein
ticularly daunting for researchers new to the field – both stu- sequences.
dents and established scientists – who wish to delve into novel Traditional Sanger sequencing studies include relatively few
methods and data to reconstruct the evolution of their study loci and are therefore limited by stochastic or sampling error.
group. Here we present an entry-level overview of the theory As there is a relatively small number of phylogenetically infor-
mative characters available in one or a few genes, this random
Correspondence: Jessica P. Gillung, Department of Natu-
ral Resource Sciences, McGill University, 21111 Lakeshore ‘noise’ influences the inference of backbone nodes, potentially
Road, Sainte-Anne-de-Bellevue, QC H9X 3V9, Canada. E-mail: leading to poorly resolved or poorly supported phylogenetic
jpg.bio@gmail.com trees. This problem can be addressed successfully by using much
larger amounts of sequence data. Modern phylogenomic analy-
†Present address: School of Environmental Sciences, University of ses, which take advantage of hundreds to thousands of loci from
Guelph, Guelph, Ontario, N1G 2W1, Canada. across the genome, are, on average, orders of magnitude larger

© 2019 The Royal Entomological Society 1


2 A. D. Young and J. P. Gillung

than traditional Sanger sequencing datasets. The size of these studies and continuously address larger and more complex
datasets therefore significantly reduces the impact of stochas- questions. This philosophy encompasses the sharing of data and
tic error and data availability as a limiting factor, offering great code used to produce the analysis, as well as open archiving of all
promise for resolving historically recalcitrant nodes in the tree raw data (Mork, 2015; Shade & Teal, 2015). Data provenance,
of life. the recording of the input and transformation of information
High-throughput sequencing technologies [also called used to generate a result, is a key issue in reproducibility. Several
next-generation sequencing (NGS)] (Fig. 1) have yielded recommendations and guidelines to promote the best practices
genome-scale data in immense quantities. Next-generation in reproducibility and data management in phylogenomics
sequencing technologies differ fundamentally from the Sanger and bioinformatics have been proposed (Cranston et al., 2014;
method in that they allow for massively parallel DNA sequenc- Magee, 2014; Debiasse & Ryan, 2019), and many tools for
ing, providing extremely high throughput from multiple samples ensuring provenance and curation of both data and methods have
simultaneously and at a much reduced cost (Mardis, 2011). been developed (e.g. Dunn, 2013; Oakley, 2014; Szitenberg,
Millions to billions of DNA nucleotides can be sequenced in 2015).
parallel, yielding orders of magnitude more data and mini-
mizing the need for the fragment-cloning methods that are To ensure the best practices in phylogenomics and bioin-
used with Sanger sequencing (Fig. 1). Recent progress in formatics, it is vital that reproducibility checkpoints are
NGS technology and the rapid development of bioinformat- enforced – places in a workflow devoted to scrutinizing its
ics tools now allow research groups of any size to generate integrity, so results are validated across multiple iterations to
large amounts of genomic sequences for organisms of interest. ensure consistency of results.
High-throughput sequencing can be used for whole-genome
Additionally, adopting an iterative, branching workflow to sys-
sequencing (Lam, 2012), whole-transcriptome shotgun sequenc-
tematically explore the methodological space is crucial. Linear
ing (also called RNA sequencing, RNA-seq, or transcriptomics;
methodology, with experimental and computational procedures
Wang, 2009), whole-exome sequencing (Rabbani, 2014), and
lined up one after the other, as presented in most published
reduced-representation genome sequencing (also called target
studies, is rarely the reality of phylogenetic analysis. Instead,
enrichment) (e.g., Faircloth, 2012; Lemmon, 2012).
estimating phylogenetic trees is more often than not a messy
Table 1 summarizes the most commonly used sequencing
enterprise, and a systematic exploration of the methods and data
technologies in phylogenomics. For more details on these
is recommended in order to select the best tools and pipelines
different technologies see the Beginner’s Handbook of Next
to answer the question at hand. Finally, for good provenance of
Generation Sequencing by Genohub (https://genohub.com/
experimental procedures and computational tools utilized in a
next-generation-sequencing-handbook/) (see also Ambardar,
particular study, it is highly recommended that comprehensive
2016; Besser et al., 2018, and references therein). Choosing
notes are kept throughout the process. In particular, keeping a
the appropriate sequencing technology for a phylogenomic
‘readme’ file at every step can be extremely helpful in keeping
study has important effects on downstream workflows, espe-
track of the software versions used, parameter values utilized,
cially in terms of read length, as library preparation in some
phylogenomic techniques (e.g. ultraconserved elements and goals of each step and how they relate to the software utilized, as
anchored hybrid enrichment, discussed later) requires a read size well as indication of data format changes. All these can greatly
selection step. contribute to standardization and ease of downstream efforts
(Shade & Teal, 2015).
Phylogenomic data are a precious scientific resource: molec-
A road map for the phylogenomic workflow ular sequence alignments and phylogenies are expensive to
generate, difficult to replicate and have seemingly infinite poten-
Strict experimental reproducibility is an integral – albeit tial for synthesis and reuse. For most phylogenomic analyses,
uncommon – aspect of biological sciences, mainly due to var- phylogeneticists are faced with a large combination of algo-
ied technical challenges with implementation and curation of rithms, models and data manipulation techniques. To address
experimental methods and procedures. Despite the importance this issue, here we present a flowchart containing the major steps
of phylogenetic analyses to most fields of biology, the repro- and tools utilized in phylogenomics (Fig. 2). The flowchart is
ducibility of phylogenetic experiments can be very low, with not meant to be exhaustive, but merely a visualization of the
an estimated 60% of published phylogenetic analyses being commonly utilized methodologies and pipelines in recent
‘lost to science’ due to the unavailability of the underlying data phylogenomic studies.
and methods (Magee, 2014). Published phylogenetic studies
can be difficult or impossible to replicate or expand upon, as
the utilized analytical software, software versions, software Taxon sampling: a crucial first step in phylogenomics
parameters, dependencies and operating system versions can be
very challenging to uncover or recreate. Taxon sampling is of extreme importance for phylogenetic infer-
The promotion of open science and reproducible research can ence, and increased sampling of taxa – coupled with increased
create a more productive and responsible scientific culture in sampling of loci – is commonly advocated as a solution to
phylogenomics, enabling researchers to build upon previous resolving recalcitrant nodes of the tree of life. Ideally, sampling

© 2019 The Royal Entomological Society, Systematic Entomology, doi: 10.1111/syen.12406


Understanding phylogenomics 3

Fig. 1. Comparison of Sanger sequencing (left) and Illumina next-generation sequencing (right). A flow cell is a glass slide with lanes, each lane being
a channel coated with two types of oligos, which are complementary to the adapters ligated in the DNA fragments. dsDNA, double-stranded DNA; F,
forward DNA strand; R, reverse DNA strand; ssDNA, single-stranded DNA. [Colour figure can be viewed at wileyonlinelibrary.com].

© 2019 The Royal Entomological Society, Systematic Entomology, doi: 10.1111/syen.12406


4 A. D. Young and J. P. Gillung

of both taxa and sequences should be increased at the same pace,

Indels, deletions
but the advances in high-throughput sequencing have caused

Read length,
up to 100 kb
increases in gene sampling to far outpace taxon sampling. As

Nanopore

portability
greater amounts of data are incorporated into phylogenetic stud-
Oxford

0.1–1

5–40
ies, new evidence and hypotheses regarding relationships among

SR
taxa can emerge, and placement of lineages within clades can
change dramatically. Taxon sampling can thus greatly influence

Speed, read length


hypotheses supported by phylogenetic inference (Rosenberg &
Kumar, 2003; Nabhan & Sarkar, 2012).
Biosciences

up to 60 kb Taxon selection meant to address a specific research question


0.5–10
Pacific

should take place early in a phylogenomic study. ‘Sufficient’

Indels
taxon sampling is always dependent on the questions being
SR
13
addressed. Ideally, in order to unravel the phylogeny of an
entire taxonomic unit, most, if not all, subordinate taxa in that
Speed, read length
unit should be sampled. Even though increasing the number
homopolymers

of taxa results in a more complex computational problem for


Ion Torrent

phylogenetic analysis, it has been demonstrated that denser


200–400

Indels,

taxon sampling improves phylogenetic accuracy (Heath et al.,


1–15

1–2
SR

2008).
Taxon sampling, however, can be greatly limited by the phy-
logenomic method of choice. Transcriptomics, for instance,
High throughput

requires specimens collected and stored directly into liquid


Substitutions

nitrogen or RNAlater, whereas other sequencing methods,


2000–6000
Table 1. Comparison of different next-generation sequencing (NGS) platforms. SR, single read; PE, paired-end read.

NovaSeq

including target enrichment, and shotgun and exome sequenc-


0.2–0.8
SR, PE

ing, will require molecular-grade specimens, preferably pre-


250

served in high-grade ethyl ethanol and stored in a laboratory


ultrafreezer. A notable exception to this is target enrichment of
ultraconserved elements (UCEs), a method that can successfully
Throughput, low

generate phylogenomic data from old, pinned insect museum


Substitutions
1000–1800

specimens (Blaimer, 2016).


error rate
SR, PE

Genome-scale projects may be particularly vulnerable to sys-


HiSeq

150

tematic error caused by nonproportional phylogenetic sampling.


0.2

As dataset size increases, so does the accumulation of non-


random systematic error and accompanying nonphylogenetic
Substitutions

Throughput

signal (Jeffroy, 2006). Bayesian analyses of macroevolutionary


NextSeq

10–120

patterns – including divergence-time estimation, ancestral state


SR, PE

reconstruction, and diversification rate estimation – assume pro-


150

0.8

portional sampling of lineages within a clade, and deviations


from it may potentially lead to biases (Stadler, 2009). However,
Substitutions

Read length

some implementations enable ‘corrections’ for uneven taxon


Illumina

sampling (e.g. revbayes implements corrections for birth-death


SR, PE
MiSeq

10–15

and various diversification rate models, except for fossilized


300

0.8

birth-death).
Before sequencing new specimens, it is also worth evaluat-
ing previously sequenced resources. The National Center for
Read accuracy

Biotechnology Information’s Sequence Read Archive (NCBI


Substitutions
Up to 1 kb

and length

SRA) contains user-uploaded raw sequence data and align-


c. 0.0005

0.001–1
Sanger

ment information from high-throughput sequencing projects


(Leinonen, 2011). Other resources include FlyBase (Thurmond,
SR

2019), a large database of Drosophila genes and genomes,


WormBase (https://www.wormbase.org), containing genomic
data of Caenorhabditis elegans and related nematodes, and the
Error rate (%)
range per run
Throughput

Read length

UCSC Genome Browser (Kent et al., 2002), a large repository


Advantages
Error type
Read type

of mostly vertebrate genomes. Utilizing sequences from these


databases can save money and/or increase taxon sampling in
(Gb)

ongoing phylogenomic projects.

© 2019 The Royal Entomological Society, Systematic Entomology, doi: 10.1111/syen.12406


Understanding phylogenomics 5

Acquisition of DNA sequences: from specimens


to plastic tubes

For a comprehensive overview of insect DNA methods,


see Moreau (2014), which offers a detailed description of
DNA extraction methods using either commercial kits or
phenol/chloroform protocols.
After DNA extraction, specimens should be deposited in
publicly accessible collections in association with their unique
identifier, and publications utilizing these data should always
include unique identifier, repository and specimen metadata
(including specimen collector, date and method of collection,
and geographic origin).
Vouchering specimens with unique identifiers (alphanumeric
database number) is crucial for all phylogenetic projects. There-
fore, nondestructive or partially destructive DNA extraction
methods should be used whenever possible, and in these cases,
the extracted specimen itself becomes the voucher. By con-
trast, when nondestructive DNA extraction is not possible,
such as in transcriptomic projects or small-bodied organisms,
a photographic voucher can be associated with the sequence
data. Moreover, when the destroyed specimen is part of a
sample of conspecifics (e.g. in communal or social insects),
another specimen from the same sample can serve as a voucher,
provided it is made clear that it is not the extracted speci-
men. Properly vouchering specimens used for DNA extraction
greatly increases reproducibility by alleviating issues related
to sample identity and unstable taxonomy (Pleijel et al., 2008;
Turney, 2015).

A vast array of techniques to generate phylogenomic


data

Although large phylogenomic datasets have become increas-


ingly more accessible and cost-efficient in recent years, it is now
widely accepted that simply increasing the amount of sequence
data will not unambiguously resolve some of the most diffi-
cult nodes in the tree of life, mainly due to systematic error
from nonphylogenetic signal or model inadequacy. Appropri-
ate locus selection is therefore crucial in phylogenomics, but
knowledge of the best molecular markers for resolving diffi-
cult branches at various evolutionary depths is still incipient.
Questions still remain about whether to use coding or noncod-
ing sequence data, conserved or highly variable loci, and long or
short alignments (Betancur-R. et al., 2014; Edwards et al., 2016;
Chen et al., 2017). Therefore, one of the most critical decisions
in a phylogenomic project is the sequencing method to be uti-
lized, a decision that must be made a priori as each method will
result in different types of genomic data sequenced. Different
methods have their own characteristics, advantages, and limita-
tions, including cost-effectiveness, ease of use, sample quality
Fig. 2. A generalized phylogenomic project workflow. WGSS,
required, and downstream data filtering and analysis workflow.
whole-genome shotgun sequencing; AHE, anchored hybrid enrichment;
UCE, ultraconserved element; ML, maximum likelihood. [Colour Phylogenomic sequencing methods (Table 2) can be broadly
figure can be viewed at wileyonlinelibrary.com]. subdivided into shotgun sequencing and target enrichment
sequencing. Shotgun sequencing is the process of sequencing

© 2019 The Royal Entomological Society, Systematic Entomology, doi: 10.1111/syen.12406


6 A. D. Young and J. P. Gillung

Table 2. Main characteristics of different phylogenomics sequencing methods. UCE, ultraconserved element; AHE, anchored hybrid enrichment;
nrDNA nuclear ribosomal DNA.

Whole-genome
UCEs AHE RNA-seq shotgun Genome skimming
Missing data Low Low Medium Very low if high Low
among species coverage
Type and quality High-grade, alcohol High-grade, alcohol Must be preserved High-grade, alcohol High-grade, alcohol
of specimens preserved best; preserved best; specifically for RNA preserved best; preserved best;
needed museum specimens museum specimens work in liquid museum specimens museum specimens
often acceptable sometimes acceptable nitrogen or RNAlater sometimes acceptable often acceptable

DNA template Low Medium High High Low


quality needed
Genome data Targeted nuclear loci; Targeted nuclear and Randomly sequenced Nearly complete Partial or complete
obtained partial or complete mitochondrial loci; loci, vast majority of genome; symbiont mitochondrial
mitochondrial partial or complete nuclear genome; genomes as by-catch. genome; nrDNA;
genomes as by-catch. mitochondrial symbiont loci as Coding and symbiont genomes as
Symbiont loci could genomes as by-catch. by-catch. Coding noncoding sequences by-catch. Coding and
be targeted in the Symbiont loci could sequences only noncoding sequences
probe set. Coding and be targeted in the
noncoding sequences probe set. Mostly
coding sequences

Taxonomic levels All levels from Mostly deep levels Deep levels All levels from All levels from
for phylogenetic shallow to deep shallow to deep shallow to deep
relationships
Relative cost Medium-low Medium High Medium-high, Low
depending on the
coverage and depth

from the entire fragmented genome at random, returning part or Sequencing depth, or the average number of times an individ-
all of the genome depending on the sequencing depth achieved, ual base in the genome is sequenced, is a key concept in shotgun
whereas target enrichment uses bidirectional probes (analogous sequencing. Because the genome is sequenced at random,
to primers in Sanger sequencing) to recover only genomic multiple-copy regions of the genome (i.e. mitochondrial, ribo-
regions of interest. Popular methods of shotgun sequencing somal, and plastid DNA) are sequenced more frequently than
include genome skimming, whole-genome shotgun sequencing single-copy regions. Therefore, when a genome is sequenced at
and transcriptome sequencing (i.e. RNA-seq). Popular methods a relatively shallow depth, only fragments from multiple-copy
of target enrichment for phylogenetics include anchored hybrid regions of the genome are sequenced in sufficient quantities to
enrichment (AHE) (Lemmon, 2012) and UCEs (McCormack be successfully recovered. Shallow-depth whole-genome shot-
et al., 2012; Faircloth et al., 2012) [see also Mandel (2014) gun sequencing is also called genome skimming (Straub et al.,
for an alternative method developed for plants in the family 2012), a time and cost-efficient method of sequencing mostly
Compositae]. These techniques are reviewed briefly in Table 2 mitochondrial, ribosomal and plastid DNA. Conversely, when
and have been covered in more detail elsewhere (e.g. Lemmon near-complete genomes are desired from whole-genome shot-
& Lemmon, 2013; McCormack, 2013; Wen et al., 2015; Zhang gun sequencing, a much greater sequencing depth is required
et al., 2019). in order to sequence sufficient numbers of fragments from
single-copy regions of the genome.
By contrast, in RNA-seq (or transcriptomics) the extracted
Shotgun sequencing mRNA is used as a template to generate reverse-complement
DNA. This reverse-complement DNA is then sequenced, result-
Shotgun sequencing (Fig. 3) involves fragmenting template ing in data generated from only the genomic regions undergo-
DNA into short pieces, which are then randomly sequenced ing active transcription at the time of tissue preservation. This
to obtain reads. Next, various methods and software are used method is therefore not only a genome-reduction strategy, but
to overlap different reads and assemble them into a longer also facilitates the comparison of transcription activity between
DNA sequence called a contig. RNA-seq can be considered a individual tissues, life stages, rearing conditions, etc. One of
special form of shotgun sequencing, where whole mRNA is the major drawbacks of transcriptomics is the high tissue qual-
first extracted and reverse-transcribed into reverse-complement ity required – specimens must be flash-frozen in liquid nitro-
DNA, which is then sequenced. gen or collected directly into RNAlater, thus precluding the

© 2019 The Royal Entomological Society, Systematic Entomology, doi: 10.1111/syen.12406


Understanding phylogenomics 7

(A)

(B)

(C)

Fig. 3. Comparison of main shotgun sequencing methods used in phylogenomics. [Colour figure can be viewed at wileyonlinelibrary.com].

Fig. 4. Comparison of target enrichment sequencing methods commonly used in phylogenomics. [Colour figure can be viewed at
wileyonlinelibrary.com].

© 2019 The Royal Entomological Society, Systematic Entomology, doi: 10.1111/syen.12406


8 A. D. Young and J. P. Gillung

utilization of specimens already available in tissue collections which a substantial amount of quality control is applied such
or museums. that target loci are single copy and have the appropriate amount
of sequence variation to ensure both phylogenetic accuracy and
efficient enrichment (Lemmon, 2012). Full probe kits target
Target enrichment sequencing 500–800 protein-coding loci on average, and marker genes
(also called traditional or legacy genes, e.g. COI) are often
Targeted sequence capture, or target enrichment (Fig. 4), is included in the target pool of loci, which facilitates integration
an umbrella term for multiple efficient, cost-effective methods with previous Sanger sequencing phylogenetic studies.
for generating phylogenomic datasets for nonmodel organisms. Ultraconserved elements, in turn, are highly conserved regions
These methods effectively reduce genomic DNA complexity of the genome shared among evolutionarily distant taxa. As
through the use of short (60–120 bp), single-stranded nucleotide universal genetic markers for particular taxa of interest, UCE
baits or probes that hybridize with template sequences, thus data are useful for reconstructing the evolutionary history and
enabling the recovery of particular sequences of interest with population-level relationships of many organisms. In short,
high coverage. As a result, mostly genomic regions of interest UCEs are identified by aligning several genomes to each other,
are recovered, although nontarget DNA (including mitochon- with subsequent detection and filtering of areas of very high
drial and symbiont sequences) can be present in the result- (95–100%) sequence conservation across all taxa of interest.
ing set of reads. Multiple samples can be multiplexed and There are a number of different ways of identifying UCEs for
sequenced together, which enables the generation of DNA use as genetic markers and designing baits to target them, but
sequence data for hundreds of loci from over 100 samples the most commonly used approach in animal phylogenetics
simultaneously. There are two main methods of target enrich- was described in detail in Faircloth (2017). Ultraconserved
ment commonly used in animal phylogenomics: AHE (Lemmon elements have been shown to perform well when collecting data
et al., 2012) and UCEs (McCormack et al., 2012). Both methods from museum samples (Blaimer et al., 2016; McCormack et al.,
are reduced-representation approaches that rely on the utiliza- 2016), which can greatly facilitate expansion of taxon sampling
tion of a phylogenetically informative subset of the study organ- as sequencing is no longer restricted to fresh specimens. Despite
isms’ genomes. Other target enrichment methods have been pro- older specimens producing fewer and shorter loci in general, it
posed, with varied locus selection criteria, including Hyb-Seq is still possible to retrieve hundreds of markers from relatively
(Weitemier et al., 2014), Compositae COS loci (Mandel et al., old specimens (McCormack et al., 2016).
2014) and RELEC (Karin, 2019), but AHE and UCEs have thus A major advantage of UCEs over AHE data is phyluce, an
far been the most commonly used methods in phylogenomic user-friendly and open-source software pipeline developed for
studies of animals. the processing and analysis of UCE data (Faircloth, 2016). phy-
In these methods, probes hybridize in-solution with targeted luce contains several software packages and tutorials that are
anchor sequences, which are subsequently enriched. Both the extremely helpful and accessible, especially to beginners. These
genomic regions targeted by the probes and their flanking resources, however, require some familiarity with working in
regions are sequenced, such that both conserved and more the command-line environment of a Unix or Unix-like system.
variable (and thus more phylogenetically informative) regions For a comprehensive and user-friendly guide to working on
are sequenced at once. These reduced-representation methods the command line see the Happy Belly Bioinformatics tutori-
enable researchers not only to capture the same sets of loci als (https://astrobiomike.github.io/unix/). For a comprehensive
across all taxa of interest, but also to exclude repetitive or and informative overview of the theory and practice of UCEs
phylogenetically misleading regions of the genome, including for arthropod phylogenomics, see Zhang et al. (2019).
pseudogenes and paralogues. A great benefit of consistently
using the same sets of markers across studies is that it allows
for meta-analyses as more phylogenomic data accumulate over Building phylogenomic data: quality control
time. By contrast, for RNA-seq datasets to be consistent, the and assembly of raw reads
transcriptomes must be gathered from the same tissue types, and
it can be challenging to accurately assess orthologues and align Performing quality-control on raw reads obtained from
different isoforms during analysis. high-throughput sequencing is a crucial, yet sometimes over-
Anchored hybrid enrichment sequencing targets mainly looked step. Reads should ideally be inspected for sequence
protein-coding regions of the genome, meaning that enriched quality, guanine-cytosine (GC) content, presence of adapter
loci comprise mostly coding, and in some cases intronic or sequences and read errors (i.e. base calling errors and small
other genomic elements (e.g. untranslated region). This means insertions/deletions). Several programs have been proposed
that phylogenetic data can be more easily coded and analysed for quality control of NGS data, including fastqc for Illu-
both as nucleotides and as amino acids. Candidate target loci mina reads (Andrews, 2010), and ngsqc for reads from all
are identified using genomes, transcriptomes or raw genomic other platforms (Dai et al., 2010). These two resources offer
reads from two or more reference species. Then, sequences for a means to detect and visualize potential errors, after which
the target loci for each reference species to be included in the a complementary software, such as trimmomatic (Bolger
probe kit design are isolated, and an alignment for each locus is et al., 2014), can be used to trim low-quality bases and remove
generated. Probes are developed based on these alignments, for adapter contamination. While trimmomatic offers an all-in-one

© 2019 The Royal Entomological Society, Systematic Entomology, doi: 10.1111/syen.12406


Understanding phylogenomics 9

solution to sequence quality control, being the most commonly As assembly is a major step in any phylogenomic analysis,
used in animal phylogenomics, other similar programs have problems at this stage (incomplete assembly, assembly errors
been proposed. cutadapt, for instance, is used only for read and redundancy) can lead to difficulties in downstream work-
trimming, but is especially useful for working with sequences flows, including orthologue and paralogue identification, align-
obtained from Applied Biosystems’ SOliD sequencer (Martin, ment and matrix construction. These problems increase the
2011). Likewise, scythe is a Bayesian approach for removing amount of missing data in the final aligned matrix, ultimately
adapters from the 3′ end of sequences, where read quality is limiting the amount of useful data. For the best assembly,
often degraded (https://github.com/vsbuffalo/scythe). sickle high coverage, long read lengths, and good read quality are
is a standalone program used only for read quality trimming all needed. However, current sequencing technologies do not
(Joshi & Fass, 2011). Finally, htstream consists of a com- inherently possess all three; for instance, Illumina sequencing
prehensive quality control pipeline that allows streaming from results in good-quality reads, but short length, while pacbio
application to application (https://github.com/ibest/HTStream). sequencing generates very long reads, but with low quality
The software can simultaneously handle both single-end and (see Table 1).
paired-end reads and enables process parallelization.
Once quality control has been performed on raw reads, the
next step is the assembly of contigs. Sequence assembly refers
to aligning and merging small DNA fragments obtained from Assessing common ancestry in molecular sequence
a high-throughput sequencing platform in order to recon- data: a case for orthology
struct longer DNA sequences. Sequence assembly is necessary
Phylogenetic relationships should always be estimated based
because whole genomes cannot be sequenced in one go, but
on sequences that are related by orthology, i.e. whose common
rather small pieces of DNA of approximately 20 000–30 000
ancestor diverged as a result of speciation (orthologues), rather
bases in length are sequenced at a time, depending on the
than a duplication event (paralogues) (Fitch, 1970). Genes
technology used. These short DNA pieces are called reads, and
arising from duplication events complicate the inference of a
these reads are then assembled into longer DNA sequences
species tree based on concatenated gene alignments, because
called contigs.
There are two main techniques of genomic assembly: de the gene tree describing the relationships among paralogues may
novo and reference-based. De novo assembly methods consist differ from the species tree.
of constructing, simplifying and resolving de Bruijn graphs to Orthology assessment has become a central problem for evo-
extract contigs, and no reference genome is needed [see Com- lutionary and molecular biologists. In phylogenetic inference, it
peau (2011) for a general introduction to de Bruijn graphs]. In is assumed a priori that the loci used to infer evolutionary rela-
the case of reference-based methods, a previously assembled tionships are orthologues, and the violation of this assumption
genome is used as a reference to which sequenced reads are can lead to phylogenetic error.
independently aligned. Ultimately, almost every read is placed
at its most likely position, and in contrast to de novo assem- There are two main categories of orthology prediction:
bly, no synergies between reads occur. De novo assembly is (i) similarity or graph-based methods (Altenhoff & Dessi-
often the preferred method in phylogenomics, as it does not moz, 2009); and (ii) phylogeny-based methods (Gabaldón,
require a fully assembled reference genome. A multitude of 2008). Both methods can be computationally intensive, as
methods for de novo assembly have been developed, and the the exhaustive pairwise comparisons for blast-based meth-
field is constantly evolving. Performance of different methods ods and the tree inference for the phylogeny-based meth-
depends greatly on data type and species to be assembled, and ods impose great computational burden. However, compara-
each method has its own set of tradeoffs, including computer tive studies have shown that phylogeny-based approaches have
memory needed and computational complexity exhibited. The much higher specificity, particularly for detecting orthologues
most commonly used programs for de novo assembly in phy- among distantly related taxa (Chen et al., 2007; Altenhoff
logenomics include velvet (Zerbino & Birney, 2008), trinity & Dessimoz, 2009).
(Grabherr, 2011), soapdenovo (Li et al., 2010), spades (Banke- In similarity-based methods, loci are partitioned into groups
vich, 2012) and abyss 2.0 (Jackman et al., 2017). For a compre- of orthologues based on the assumption that genes that share
hensive comparison of them, see Narzisi & Mishra (2011) and similar sequences are orthologous. The precise definition of
Hölzer & Marz (2019). Finally, the software atram 2.0 (Allen a group of orthologous genes, however, depends on the par-
et al., 2015) enables easy manipulation of whole-genome, ticular method used, and several methods and algorithms to
genome-skimming and transcriptomic data. The program imple- cluster orthologous genes have been proposed, including blast
ments most of the aforementioned assemblers in a pipeline that searches, Markov clustering, spectral clustering and hierarchical
enables automation and parallelization of assembly tasks. In orthology. Several similarity-based methods are commonly used
addition, the program performs targeted assembly of contigs of in phylogenetics and functional genomics, including orthomcl
interest, which can greatly reduce the computational scale of (Li, 2003), cog (Tatusov et al., 2003), inparanoid (O’Brien,
the assembly problem compared with full genome/transcriptome 2005), oma (Roth, 2008), proteinortho (Lechner et al.,
de novo assembly, as well as avoiding errors introduced during 2011), eggnog (Powell et al., 2012), orthofinder (Emms &
whole-genome/transcriptome assembly. Kelly, 2015), orthograph (Petersen et al., 2017) and orthodb

© 2019 The Royal Entomological Society, Systematic Entomology, doi: 10.1111/syen.12406


10 A. D. Young and J. P. Gillung

(Kriventseva et al., 2019). By contrast, phylogeny-based meth- Describing the process of molecular evolution:
ods aim to identify orthologues by inferring the lowest common model selection and partitioning
ancestor relationships between genes, which is performed by
estimating a gene tree and identifying its speciation and duplica- Evolutionary models are at the centre of molecular sequence
tion events using reconciliation with a species tree (Ullah et al., data analyses. Parameter inference, whether performed within
2015). Commonly used phylogeny-based methods of orthology a maximum likelihood (ML) or Bayesian inference (BI) frame-
prediction in phylogenomics include loft (van der Heijden work, relies on the explicit definition of the substitution process,
et al., 2007), coco-cl (Jothi et al., 2006), orthologid (Chiu and different models and parameters account for different pro-
et al., 2006) and phylotreepruner (Kocot et al., 2013). For cesses of evolution.
a comprehensive review and comparison of different orthology
Selecting the model that best describes the evolutionary
prediction methods, see Salichos & Rokas (2011) and Trachana
process that generated the data at hand is of utmost importance,
et al. (2011).
and failure to do so can lead to erroneous estimates of topology
and branch lengths.
Construction of a phylogenomic dataset: alignment Model selection is most commonly performed under a fre-
and concatenation quentist framework, where the fit of the data to each substitu-
tion model (together with the substitution model parameters,
Following assembly and orthology prediction, the next step in a tree topology, and branch lengths) is assessed through itera-
phylogenomic workflow is the alignment of sequence data. For tive optimizations of the likelihood function, where ML scores
phylogenomic-scale projects, two of the most commonly used of each model are compared through one of several possible
multiple-sequence alignment programs are clustal (Larkin criteria. It is worth mentioning that the partition scheme is
et al., 2007) and mafft (Katoh & Standley, 2013). Broadly, part of the model, and so selecting a partition scheme is also
both methods produce a pairwise distance matrix from input part of model selection. Some of the most widely used crite-
sequences, construct an initial guide tree from the distance ria include the hierarchical and dynamic likelihood ratio tests
matrix, and then align sequences based on the estimated tree. (hLRT and dLRT, respectively) (Posada & Crandall, 2001), the
The first alignment produced is then scored according to a Akaike information criterion (AIC; Akaike, 1974), the corrected
specified criterion, and used to produce a new guide tree, which AIC (AICc; Sugiura, 1978), the Bayesian information crite-
is in turn used to produce an updated alignment that is then rion (BIC) (Schwarz, 1978), and the decision-theory criterion
rescored, until a best-scoring final alignment is reached in an (DT) (Minin et al., 2003). Current widely used model selec-
iterative process. tion programs available for phylogenomics are based on the fre-
mafft L-INS-i, E-INS-i and G-INS-i algorithms are most quentist framework and employ several of the aforementioned
appropriate for phylogenomic datasets as they were designed information criteria. Most commonly used programs in phyloge-
to handle a large number of sequences at once. If sequences nomics include partitionfinder2 (Lanfear et al., 2016) and
have full length but low similarity, then the global option is modelfinder (Kalyaanamoorthy et al., 2017) (which is imple-
appropriate (G-INS-i). Alternatively, if sequences have one mented in the program iqtree; Nguyen et al., 2015). These
similar core domain in between flanking unalignable regions, methods scale relatively well for phylogenomic-scale datasets
then the local affine gap option works best (L-INS-i). Finally, (using the rcluster algorithm in partitionfinder; Lanfear
if sequence data are composed of multiple alignable domains et al., 2014) and require a concatenated dataset coupled with a
with varied length, then the local generalized affine gap scheme partition file for data partitioning as input.
is more appropriate (E-INS-i). For most phylogenomic datasets, Alternatively, under the Bayesian approach, model selection
mafft E-INS-i algorithm is probably more appropriate, as it can be performed using the ratio of the marginal likelihoods of
makes the fewest assumptions about the data, and is better two models (i.e. Bayes factor; Goodman, 1999). The ratio of the
at handling unalignable sections of data within a sequence. marginal likelihoods quantifies the strength of evidence for one
However, several widely used alignment pipelines (including model being more appropriate to describe the data than the other.
phyluce) utilize the L-INS-i algorithm by default. For small to Several methods that estimate the Bayes factors or the marginal
medium-sized projects, both mafft and clustal have online likelihood for model selection in phylogenetic analyses have
servers where sequence data can be uploaded and aligned been proposed, with variable tradeoffs between computation
remotely. times and accuracy (Suchard, 2001; Huelsenbeck & Rannala,
Finally, individually aligned loci can be concatenated into a 2004; Lartillot & Philippe, 2006; Fan et al., 2010).
supermatrix for use in concatenation-based analyses (see later). Protein-coding genes can be analysed as amino acids,
Several software packages and scripts facilitate concatenation nucleotides or codons, and choosing which data type to analyse
of aligned loci, including scafos (Roure et al., 2007), phyluce in phylogenomics is a challenge that can significantly affect the
(Faircloth, 2016), and amas (Borowiec, 2016). These programs results. The statistical analysis of amino acid and nucleotide
generate a partition file delimiting the boundaries of individ- data are fundamentally different (Huelsenbeck et al., 2008).
ual loci within the supermatrix, which are then used for both Model parameters of nucleotide models are estimated from
partitioning and model selection in downstream phylogenetic the data, while in amino acid models most of the parameters
analysis. are fixed to specific values estimated from a large number

© 2019 The Royal Entomological Society, Systematic Entomology, doi: 10.1111/syen.12406


Understanding phylogenomics 11

of sequences and predetermined by the model (e.g. Dayhoff different models of evolution are applied to individual loci, the
et al., 1978; Jones et al., 1992; Le et al., 2012; etc.) (but see tree is linked across all partitions and a single species tree results
+F variants of models implemented in partitionfinder2 from the analysis. This assumption, however, may be incorrect,
and iqtree, where stationary frequencies are estimated from as biological processes such as hybridization and incomplete
the data). Although the fixed amino acid models reduce the lineage sorting (ILS) can lead to discrepancy between gene trees
number of free parameters to be estimated, leading to less and the species tree (Degnan & Rosenberg, 2006; Edwards,
computationally intensive analyses, it is possible that even 2009). To overcome this issue, coalescence-based approaches
the best-fitting fixed amino acid model is not appropriate have been proposed as an alternative, where gene trees are
for the data at hand. Several studies have shown that using allowed to vary across genes, and the assumption that all
an inadequate model may result in erroneous estimates of genes have the same history is relaxed (Liu et al., 2015). Of
phylogeny (e.g. Buckley, 2002; Lemmon & Moriarty, 2004; these approaches, the multispecies coalescent model (Rannala
Cooper, 2014), and proper model selection is an integral & Yang, 2003) is the most popular for phylogenomic data
part of the phylogenetic reconstruction procedure, especially analysis.
when analysing amino acids. In the case of nucleotides, it
has been suggested that using the most parameter-rich model
(GTR + I + G) instead of conducting thorough model selec- Concatenation-based analyses
tion leads to phylogenies as accurate as those obtained when
model selection is performed. This implies that the stan- Commonly used ML software includes garli (Zwickl, 2006),
dard practice of model selection before phylogeny inference raxml (Stamatakis, 2014), iq-tree (Nguyen et al., 2015), and
is unnecessary when a complex and parameter-rich model examl (Kozlov et al., 2015). raxml and iqtree are both
is applied to nucleotide data (Abadi et al., 2019). However, relatively user-friendly, with well-annotated manuals available
it is worth mentioning that combining a gamma distribution on their respective webpages. iqtree, in particular, is extremely
(+G; Yang, 1993) with a proportion of invariant sites (+I; Fitch well-documented and includes a variety of tutorials available on
& Margoliash, 1967) to account for among-site rate hetero- the developer’s website. examl and garli, in turn, require more
geneity has been highly criticized, mainly because some of familiarity with basic scripting and working in command-line
the parameters of the +I and + G models cannot be optimized environments. examl is well suited to the analysis of extremely
independently of each other (Yang, 2006; Sullivan, 1999; May- large datasets, while garli is notable for including a wide range
rose, 2005; Jia et al., 2014). Thus, it is more advisable to apply of models of codon evolution.
a GTR + G mixture model for each partition, as opposed to a The most well-known BI software includes mrbayes (Ron-
GTR + I + G. quist et al., 2012b), exabayes (Aberer et al., 2014), phy-
lobayes (Lartillot et al., 2013), revbayes (Höhna et al., 2016),
beast (Suchard et al., 2018), and beast2 (Bouckaert et al.,
Estimating evolutionary history: phylogenetic trees 2019). mrbayes is a well-established program, with very helpful
documentation and many online tutorials available. phylobayes
Model-based approaches have dominated phylogenetic studies implements the CAT model, a site-heterogeneous mixture model
in recent decades, with ML and BI used most widely. Parsi- developed to account for site-specific features of protein evo-
mony analysis, by contrast, is not suitable for phylogenetic lution (Le et al., 2008). revbayes, in turn, allows for greater
analysis of molecular data, as it is prone to misinterpretation user control over individual parameters but may require more
of identical yet homoplastic character states as homologous, significant time investment from new users, at least initially.
resulting in incorrect grouping of taxa with increasing support beast and beast2 are primarily used to co-estimate a species
as additional data are included, a property referred to as incon- tree and divergence times and are discussed in more detail sub-
sistency (Felsenstein, 1978). This problem is most prevalent sequently. Of all these programs, the most appropriate for large
in the so-called ‘Felsenstein zone’, a region of theoretical tree phylogenomic datasets are revbayes and exabayes, which are
space in which divergent nonsister taxa with parallel charac- particularly suitable for mass parallelization and multithreading
ter state transformations are joined by long-branch attraction on computer clusters.
(Huelsenbeck & Hillis, 1993). Moreover, parsimony methods
fail to make explicit an underlying model of evolution. For
an in-depth review of the strengths and weaknesses of par- Coalescent-based analyses
simony and model-based phylogenetic methods, see Yang &
Rannala (2012), and for an introduction to Bayesian statistics The most commonly used coalescent-based methods in
in phylogenetics see Nascimento et al. (2017). phylogenomics are called ‘shortcut’ or summary coalescence
Broadly, model-based phylogenetic methods can be divided approaches [e.g. stem (Kubatko et al., 2009), mdc (Than &
into concatenation (or supermatrix) and coalescent-based Nakhleh, 2009), steac (Liu, 2009), star (Liu, 2009), mp-est
approaches. Concatenation-based methods infer phylogenies (Liu, 2010), svdquartets (Chifman & Kubatko, 2014) and
from a single combined gene matrix and assume that all genes astral (Zhang, 2018)], wherein gene trees and the species
have the same, or very similar, evolutionary histories (Huelsen- tree are not co-estimated. In fact, these methods do not analyse
beck et al., 1996). Even when the dataset is partitioned so that sequence data directly, and instead use individual gene trees

© 2019 The Royal Entomological Society, Systematic Entomology, doi: 10.1111/syen.12406


12 A. D. Young and J. P. Gillung

as inputs to calculate a single species tree while accounting alignment). This means that only those sites that are assumed
for variation across genealogies. Coalescence methods that to evolve under a single model of evolution are grouped during
co-estimate gene trees and the species tree, such as *beast analysis. Partitioning allows the incorporation of heterogeneity
(Heled & Drummond, 2010), are more desirable, but currently in models of molecular evolution, freeing parameter values from
are not computationally tractable for large datasets (see Liu being joint estimates across all of the data in a particular dataset.
et al., 2015 for a comprehensive review). astral and svdquar- Model selection as traditionally performed in phylogenetics and
tets, in particular, are fast and user-friendly, and have been phylogenomics allows for the identification of the best-fitting
shown to be both statistically consistent under the multispecies model among a pool of candidate models, which is then applied
coalescent model and to perform well under high ILS conditions to each data partition separately. However, this procedure only
(Chou et al., 2015). enables the selection of an appropriate model when compared
However, these summary coalescent methods have been criti- with the other candidate models (i.e. relative model fit), but
cized (Kubatko & Degnan, 2007), as they are unable to account it does not inform us whether the best-fitting model is indeed
for causes of gene tree – species tree conflict other than ILS. appropriate for the data or not (i.e. absolute model fit). Absolute
Of primary concern is recombination, as it violates one of the model fit tests employing posterior predictive simulation (and
primary assumptions of the multispecies coalescent model and related approaches) have the potential to fill an important gap in
is likely to have occurred in long gene fragments. Conversely, phylogenetic methodology by assessing a model’s absolute fit to
if analyses are limited to short loci where recombination is a given dataset (Bollback, 2002; Brown, 2014). Assessing the fit
unlikely, gene-tree estimation error can increase (Springer & of different models to empirical data in such a way that the data
Gatesy, 2014). A possible workaround to this issue, known as can reject the fit of all models should be a fundamental step in the
statistical binning, was proposed by Mirarab et al. (2014), where phylogenomics pipeline. These methods are being implemented
genes with similar gene-tree topologies are grouped together in a growing number of phylogenetic software, making them
and analysed through concatenation, before using these concate- easier to apply as a routine step in phylogenetic analyses [e.g.
nated supergene trees as inputs for coalescent analyses. phylobayes (Lartillot et al., 2009) and revbayes (Höhna et al.,
2016)]. Posterior prediction allows for the removal of portions
of the data that cannot be adequately modelled, thus improving
I have a phylogeny! What now? overall accuracy of phylogenetic estimations (Doyle, 2015).
Evaluating the reliability of a given phylogenetic tree is just
Phylogenomic studies have greatly improved our understanding as important as the phylogenetic estimate itself. Measures of
of the tree of life and show remarkable promise for resolving branch support indicate which parts of the tree have greater
incongruences in previous analyses based on limited numbers of credibility when interpreting the evolution of a group and can
loci. As previously mentioned, large phylogenomic datasets are pinpoint outstanding questions where data collection is needed
not as susceptible to stochastic (or sampling) error as analyses to resolve remaining uncertainties. Assessing the uncertainty
of datasets with fewer characters per taxon, and generally result in individual clades on a tree has the benefit that it allows
in well-resolved and well-supported estimates of phylogeny. the researcher to evaluate specific hypotheses of monophyly.
However, phylogenomic analyses are vulnerable to systematic The reliability of a phylogenomic hypothesis can be assessed
error – biases caused by a failure of the method rather than a using frequentist (i.e. ML) and Bayesian approaches. In the ML
lack of data (Kumar et al., 2012). In fact, adding more data framework, support values are estimated using nonparametric
can exacerbate the effect of systematic error by increasing bootstrapping (Efron, 1979; Felsenstein, 1985), a procedure
support for erroneous relationships (Brown & Thomson, 2016). that involves the random resampling (with replacement) of
Systematic error can arise as a result of improper modelling characters (i.e. columns in an alignment of DNA sequences)
of biological phenomena (e.g. heterotachy, compositional bias, from the original data to generate pseudo-replicate data matrices
rate heterogeneity, ILS, horizontal gene transfer; Romiguier identical in size to the original matrix. These pseudo-replicates
et al., 2016) or methodological issues (e.g. poor model selec- are then subjected to the same phylogenetic analyses as the
tion, biased taxon or gene sampling, poor orthology inference; original dataset. Bootstrap support for a group of interest is
Hosner et al., 2016), both of which can lead to biased or calculated as the proportion of times that the group is obtained in
incorrect parameter estimates in phylogenomic analyses. the pseudo-replicates; for instance, if a group appeared in 95% of
the analyses of the bootstrapped data matrices, then its bootstrap
It is critical to examine phylogenomic analyses for evidence of
support value is c. 95%.
systematic error, even when the datasets are large and support
While posterior probabilities have a clear-cut interpretation,
for phylogenetic relationships is high.
representing the probability that the phylogenetic estimate is
Poor model fit seems to be one of the major sources of sys- true given the model, the priors and the data (Huelsenbeck
tematic bias in phylogenomics. When the model fails to account et al., 2002), bootstrap values are not as easily interpreted in
for important features of the data, phylogenetic inferences and phylogenetics (Huelsenbeck & Rannala, 2004). Nonetheless,
measures of confidence (i.e. branch support) can be inaccurate. most practising systematists erroneously interpret bootstrap
One way of incorporating model complexity and thus increas- support as the probability that the clade is correct – even though
ing model fit to phylogenomic analysis is data partitioning (i.e. a high bootstrap proportion is probably associated with correct
when different models are applied to different portions of the clades more often than low bootstrap proportions. In fact, only

© 2019 The Royal Entomological Society, Systematic Entomology, doi: 10.1111/syen.12406


Understanding phylogenomics 13

the Bayesian approach directly addresses the probability that a Beyond topologies: unravelling patterns
clade is correct, conditional on the observations. In other words, of biodiversity
the posterior probability of a tree can be interpreted as the
probability that the tree is correct given the data and the model For those interested in investigating additional questions besides
(Huelsenbeck et al., 2001), but bootstrap values cannot. estimating the evolutionary relationships of taxa of interest,
Both posterior probabilities and bootstrap values are sensitive there is a wide range of supplemental analyses available, includ-
to model misspecification and can be highly inflated when the ing divergence time estimation, ancestral state reconstruction,
model is oversimplified (e.g. when an overly simplistic model patterns of diversification, biogeography, and others.
is utilized, or a concatenated alignment is not adequately
partitioned).
Divergence time estimation
The sensitivity of posterior probabilities, however, seems to
be greater than the sensitivity of bootstrap values (Huelsen- Divergence time estimation is concerned with establishing the
beck & Rannala, 2004; Lemmon & Moriarty, 2004; Brown & timeline of evolution of a particular group, therefore allow-
Lemmon, 2007; Brown & Thomson, 2016). Bayesian support ing cladogenesis (i.e. speciation) to be correlated with past
values were shown to be consistently higher than bootstrap geological events, changes in climate, mass extinctions or the
values for the same clades both in simulations and in empirical diversification of other organisms. Divergence time estimation is
analyses (Erixon et al., 2003; Huelsenbeck & Rannala, 2004). a complicated, but often essential, step for many phylogenomic
Posterior probabilities are more sensitive to underspecification projects. Accurate estimates of species divergence times are vital
of the evolutionary model than to overspecification, meaning to understanding historical biogeography, estimating diversifi-
that it is more detrimental to phylogeny estimation to use an cation rates and identifying the causes of variation in rates of
oversimplified model as opposed to an overly complex one molecular evolution.
(Huelsenbeck & Rannala, 2004). Thus, appropriate model fit Originally, phylogenies were dated by assuming a constant
is necessary for both reliable phylogeny estimation and inter- molecular clock, with mutation or substitution rate estimated
pretation of results, especially in the presence of confounding with reference to the fossil record (Zuckerkandl & Pauling,
factors – the methodological and biological phenomena that 1962). Since then, new and more sophisticated methods have
cause model violation. In the last few years, much effort has been introduced, as it has become clear that the rate of molecu-
been devoted to uncovering and understanding the effects lar evolution varies significantly over time and among lineages.
of confounding factors in phylogenomic reconstruction (see Thus, it is now standard practice to accommodate such rate vari-
Table 3). When the model fails to account for important fea- ation using relaxed-clock models (Drummond et al., 2006; Yang
tures of the data, inferences and measures of confidence can be & Rannala, 2006). There are two main alternative approaches to
inaccurate, and as the complexity of datasets grows with size, estimating divergence times in phylogenomics: node-dating and
so does the potential for poor model fit to bias phylogenetic tip-dating methods. In node-dating approaches, the age and phy-
inference. logenetic placement of fossil species are used to calibrate partic-
Given the unreliability of branch support values under some ular nodes in the phylogeny, and those calibration points are then
circumstances, it is desirable to perform additional analyses utilized to estimate the ages of the remaining nodes. In tip-dating
to verify the robustness of phylogenetic estimates obtained approaches (e.g. implementing the fossilized birth-death topol-
with either ML or BI. One such approach involves the cal- ogy and node-age prior; Heath et al., 2014), multiple fossils per
culation of gene and site concordance factors, which allow lineage or node can be utilized, and fossil placement is estimated
for the assessment of the fraction of loci and alignment sites based on their ages, without the necessity for morphological data
supporting a particular branch (Ané et al., 2007; Minh et al., or prior age densities on fossils. Total-evidence dating, proposed
2018). Importantly, this procedure highlights underlying dis- by Ronquist et al. (2012a), is a special case of tip-dating where
agreement or discordance in phylogenomic datasets, serving fossils are included in the analysis along with the extant taxa,
as a useful complement to bootstrap values and posterior so the phylogenetic tree, divergence times and fossil placement
probabilities, especially for large phylogenomic datasets. are jointly estimated by combining morphological and molecu-
This method is implemented in the software iq-tree, and a lar data. There are advantages and tradeoffs for both approaches,
user-friendly tutorial is freely available from the developer’s but node-dating approaches have been recently criticized mostly
website. Another method to assess the statistical robustness due to unpredictable and undesirable interactions among fossil
of phylogenomic estimates is four-cluster likelihood-mapping, calibration densities (i.e. user-specified prior ages) (Brown &
an approach that reveals the proportion of taxon quartets in a Smith, 2018) [see Arcila et al. (2015), Matzke & Wright (2016)
data matrix that supports each of the three alternative topolo- and Smith et al. (2018a) for a review of the major concepts and
gies around a specific branch of interest (Strimmer & von issues related to divergence dating in phylogenomics].
Haeseler, 1997). This approach allows for hypothesis-testing In node-dating approaches, estimating divergence times
of alternative relationships and the visualization of the phy- requires a phylogenetic tree and associated branch length
logenetic information contained in a sequence alignment. information, and a list of fossil taxa of reliable age and phylo-
Four-cluster likelihood-mapping can also be easily carried out genetic placement to be used as calibration points. Depending
in iq-tree. on the focal organism, there may be little information in the

© 2019 The Royal Entomological Society, Systematic Entomology, doi: 10.1111/syen.12406


14 A. D. Young and J. P. Gillung

Table 3. Major confounding factors in phylogenomic analyses and the strategies to mitigate them

Sources of systematic
error Strategies to overcome Main references

Compositional Remove the loci or taxa exhibiting extreme deviation in Jeffroy et al. (2006); Philippe et al. (2011); Duchêne
heterogeneity composition. Alternatively, explicitly modelling et al. (2017); Borowiec (2019)
compositional heterogeneity may be more satisfactory.
Missing data Design smaller (i.e. with fewer loci) but more complete Roure et al. (2013); Hosner et al. (2016); Kocot et al.
datasets, as opposed to larger (i.e. with more loci) but (2017); Smith et al. (2018b)
sparser datasets. Analyse subsamples of the data to see
if missing data cause conflict.
Branch-length Perform analysis with and without taxa and genes most Nosenko (2013); Struck (2014); Kück & Wägele
heterogeneity likely to be susceptible to long-branch attraction (e.g. (2015); Qu et al. (2017)
remove rapidly evolving sites or taxa).
Paralogy Filter and remove paralogous loci using an appropriate Struck (2013); Betancur-R (2014); Siu-Ting (2019)
method for orthology prediction. Carefully examine
gene topologies to detect paralogues.
Heterotachy Removal of most varied sites from alignment. Use Zhong (2011); Bouckaert & Lockhart (2015); Kocot
evolutionary models that explicitly model heterotachy et al. (2017); Crotty et al. (2019)
(e.g. ghost, implemented in iqtree).
Gene tree Use coalescence approaches. Perform statistical Betancur-R (2013); Copetti (2017); Richards (2018)
heterogeneity binning, tree concordance analysis.

fossil record, such that obtaining fossils with reliable age their common ancestors. These methods enable the estima-
and phylogenetic placement may be challenging. Therefore, tion of the number of times a particular phenotype evolved,
palaeontological literature should be carefully reviewed so the approximate timing of major evolutionary events, and
that fossil placement is critically evaluated. See Parham et al. missing phenotypic characteristics of fossil species. Ancestral
(2012) for a discussion on the best practices for fossil selection state reconstruction can also help to contextualize correlated
for divergence time estimation studies. A valuable resource shifts between phenotypic and ecological characters, as well as
for generating a list of potential fossil taxa is the Fossilworks perform phylogenetic prediction, in which phenotypic values
Paleobiology Database (https://fossilworks.org), a repository for unobserved or incompletely sampled taxa can be estimated.
of fossil species containing both their age and phylogenetic Finally, ancestral state reconstruction can be used to recover
placement. ancestral DNA sequences, the ancestral amino acid sequence of
Divergence time estimation in phylogenomics is generally a protein, the composition of a genome (including gene order
performed using BI and penalized likelihood methods. Com- and copy number), as well as the geographic range of an ances-
monly used Bayesian divergence time estimation software tral population or species. However, as with other types of anal-
includes mrbayes (Ronquist et al., 2012b; Zhang, 2016), yses detailed here, it is important to carefully consider the scien-
mcmctree (part of the paml package; Xu & Yang, 2013), tific question that is being asked and which analyses and meth-
phylobayes (Lartillot et al., 2013), beast (Suchard et al., ods are suitable to answering it. It is not advisable to tack on
2018), and beast2 (Bouckaert et al., 2019). mrbayes, beast analyses such as ancestral state reconstruction unless they con-
and beast2 enable the co-estimation of the phylogeny and tribute to answering the question(s) the study seeks to address.
divergence times and allow for the implementation of both This is doubly true when said analyses are performed without
node-dating and tip-dating approaches, including total-evidence due diligence (e.g. using parsimony mapping when model-based
dating (where morphological data can be used for fossil place- techniques are available, not performing model testing, etc.).
ment). Comprehensive and user-friendly tutorials are available Ancestral state reconstruction in phylogenomics can be per-
for mrbayes (Zhang, 2017), mcmctree (https://github.com/ formed in a ML or BI framework. Analyses using ML can be
mariodosreis/divtime) and beast2 (Barido-Sottani et al., 2017). performed in bayestraits (Pagel et al., 2004), the ‘ace’ func-
Divergence time estimation using penalized-likelihood include tion in the r package ape (Paradis, 2004), mesquite (Maddison
‘r8s’ (Sanderson, 2003), ‘treePL’ (Smith & O’Meara, 2012) & Maddison, 2018), the ‘FastAnc’ function in the r package
and the ‘chronos’ function of the R package ape (Paradis, 2004; phytools (Revell, 2012), and the ‘rayDISC’ function in the
Paradis, 2013). r package corhmm (Beaulieu et al., 2013). Analysis under BI
can be performed in bayestraits, revbayes and mesquite.
These methods take a phylogenetic tree with branch length
Ancestral state reconstruction information and a trait file containing the character states for
each taxon as input, and transition rates and ancestral states for
Ancestral state reconstruction is the extrapolation back in two or more characters are then calculated. mesquite and ‘ray-
time from observed characters of individuals or species to DISC’ can only analyse discrete (i.e. categorical) characters and

© 2019 The Royal Entomological Society, Systematic Entomology, doi: 10.1111/syen.12406


Understanding phylogenomics 15

implement only two to three models of trait evolution, while methods for assessing the relative fit of models of lineage diver-
bayestraits and revbayes enable much greater flexibility, sification, and phylogeny simulation for the generation of null
allowing the analysis of both discrete and continuously varying hypotheses (Höhna, 2015).
traits, and implementing a variety of models of character evolu- It has been shown that incomplete and biased taxon sampling
tion. In addition, these implementations can be used in a single can cause unpredictable effects on inferences of diversification
phylogenetic tree (so only uncertainty about model parameters rates, particularly in large empirical datasets (Höhna, 2014; Title
is explored), or it can be applied to a posterior sample of trees, & Rabosky, 2017). For instance, Sun et al. (2019) found out
in which model parameters are jointly estimated and phyloge- that, as compared with full empirical sampling, representative
netic uncertainty is taken into account. Bayesian character state and random sampling schemes either under- or overestimate
estimation is preferred over ML approaches as it accounts for speciation rates at varying degrees, depending on the method
uncertainty in both model parameters and phylogeny (Pagel, and sampling scheme utilized.
2004; Vanderpoorten & Goffinet, 2006). For a review and com-
Caution in the inference and interpretation of diversification
parison of the methods of character state reconstruction, see
estimates is necessary, especially when using very sparse or
Royer-Carenzi (2013), Joy et al. (2016) and Royer-Carenzi &
incomplete taxonomic sampling – regardless of method utilized.
Didier (2016).
Because of this, it is also advisable to perform sensitivity
analyses to verify whether missing data might cause bias in the
Diversification rate estimation empirical results.

Rates of speciation and extinction vary greatly through time


and among lineages, leading to extensive heterogeneity in Biogeography
species richness across the tree of life. By linking diversification
rate to traits of interest, evolutionary biologists can explore a Historical biogeography is concerned with understanding the
variety of hypotheses in a quantitative manner. For instance, distributional patterns of organisms across the globe, as well as
if a clade of interest possesses a synapomorphy (i.e. novel the processes that have produced such distributions. In recent
evolutionary trait) that constitutes a putative key innovation, years, biogeography has transitioned from parsimony-based
diversification rate analysis can be used to determine whether approaches (i.e. cladistic biogeography; Parenti, 2006) to a
that clade actually diversified at a higher rate than related statistical framework, where parametric models are used to
lineages. Several approaches have been proposed to estimate explore a much broader range of hypotheses as compared with
lineage-specific rates of diversification, which can be broadly parsimony-only approaches (Ree & Sanmartín, 2009).
subdivided into two main categories: model-based approaches Since then, several biogeographic models have been proposed,
(i.e. methods that utilize models that are parameterized with including statistical dispersal-vicariance analysis (S-DIVA;
speciation and extinction rates) and nonmodel-based approaches Ronquist, 1997; Yu et al., 2010), dispersal – extinction –
(i.e. methods that simply rely on branch lengths and splitting cladogenesis (DEC; Ree & Smith, 2008), bayarea (Landis
events) (Wiens, 2017). et al., 2013) and Bayesian binary Markov chain Monte Carlo
Model-based approaches are dominant in phylogenomics. (MCMC) analysis (BBM; Yu et al., 2015). In S-DIVA, ances-
For example, Bayesian analysis of macroevolutionary mixtures tral distributions in a phylogeny are reconstructed based on a
(bamm; Rabosky, 2014) is widely used in empirical studies. One three-dimensional cost matrix, in which extinctions and dis-
of the major advantages of bamm over nonmodel methods is persals cost more than vicariance events. The biogeographic
that it can accommodate heterogeneity in the diversification rate histories with the lowest cost are selected, while accounting for
through time and among lineages in a clade. bamm simulates a both phylogenetic and DIVA optimization uncertainty. In turn,
posterior distribution of macroevolutionary rate shift configura- DEC assigns probabilities to various range-changing events,
tions given a phylogeny of interest and then extracts marginal and these event probabilities are then used to calculate the
rates of speciation and extinction for individual taxa based on likelihood of the observed geographic range at the tips of the
this distribution. However, studies have shown that bamm anal- tree. Finally, the values of the parameters and ancestral states
yses can result in problematic estimates of both diversifica- that confer maximum probability given the data are selected
tion rates and rate shifts (Moore et al., 2016; Meyer & Wiens, after ML optimization. BBM is primarily designed for recon-
2018). Estimation of diversification rates can also be carried structing ancestral distributions of nodes in a phylogenetic tree,
out in medusa (Alfaro, 2009), the r package rpanda (Morlon using simple probabilistic models and a fixed topology. Finally,
et al., 2016) and revbayes, which offers a variety of different bayarea estimates biogeographic history by marginalizing
hypothesis-testing options and diversification rate models in a over all possible biogeographic scenarios given the current dis-
Bayesian framework. Additionally, the r package tess imple- tributions of taxa, estimating model parameters using MCMC,
ments a variety of statistical approaches for inferring rates of and comparing candidate biogeographic models using Bayes
lineage diversification from empirical phylogenetic trees under factors.
various stochastic branching process models (Höhna, 2013). These models, as well as their variations, are commonly
Some major features of tess include the ability to implement used in phylogenomics, and are implemented in a ML frame-
various methods of incomplete taxon sampling, robust Bayesian work in the program lagrange (Ree et al., 2005) and the r

© 2019 The Royal Entomological Society, Systematic Entomology, doi: 10.1111/syen.12406


16 A. D. Young and J. P. Gillung

package biogeobears (Matzke, 2013), and under BI in the pro- fossil placement in a quantitative and comparative framework.
grams bayarea (Landis, 2013) and rasp (Yu et al., 2015). bio- The analysis of combined morphological and molecular charac-
geobears is particularly flexible, enabling both the probabilistic ters for both living and extinct taxa in a model-based framework
inference of historical biogeography (i.e. inference of ances- is an ideal approach for both phylogeny estimation and fossil
tral geographic ranges on a phylogeny) and the comparison placement, which probably results in more robust divergence
of different models of range evolution (i.e. model selection in time estimation and character state reconstruction.
biogeography). In addition, biogeobears allows greater user Contrary to popular belief, novel phylogenomic studies can be
flexibility to relax parameter values and to introduce new performed relatively inexpensively. Some early phylogenomic
free parameters, including new cladogenic events such as the studies performed somewhat simple data analyses, so down-
founder-event speciation (+J). loading, combining and reanalysing published datasets using
several of the described techniques is a cost-effective way to
get acquainted with the field. By interrogating existing data
The future of phylogenomics in more detail than in the original publication, novel insights
can be obtained (and published) without the need for expensive
The rate at which genomic data can be obtained has increased sequencing. Furthermore, data scientists could develop new ana-
exponentially in recent years, providing unprecedented opportu- lytical tools or data-processing methods using existing datasets
nities for the reconstruction of molecular phylogenies. In most to demonstrate their effectiveness and ability to analyse empiri-
cases, the use of phylogenomic-scale data reduces the effects of cal data. By improving even a single step in the phylogenomics
stochastic error in phylogenetic analysis. However, this influx pipeline (see Fig. 2), bioinformaticians can develop novel and
of data increases the potential for systematic error, and methods highly applicable tools for data processing.
and protocols for addressing this issue are still in development. Overall, the future of phylogenomic analysis is extremely
Current models of evolution may be overly simplistic in many bright, with many independent avenues of exploration available.
cases, and model misspecification, regarded as insignificant in Through awareness of the pitfalls of systematic error due to
the Sanger era of phylogenetic reconstruction, may adversely either methodological or biological phenomena, increasingly
affect phylogenomic-scale analyses (Philippe et al., 2011). Sev- robust hypotheses of evolution can be proposed. Likewise, by
eral approaches to alleviate the effects of systematic bias in carefully considering taxon selection, experimental protocols
phylogenomics have been proposed, including tests of composi- and data interpretation, advances in phylogenomic protocols can
tional heterogeneity (Duchêne et al., 2017), posterior predictive greatly contribute to resolving the tree of life.
approaches to assess model fit (Bollback, 2002; Doyle et al.,
2015), and approaches to evaluate the sensitivity of results to
removal of sites or loci likely to introduce bias (Chen et al., Glossary of the basic concepts in phylogenomics
2015; Goremykin et al., 2015) and to explore genealogical con-
cordance among different regions of the genome (Minh et al., Adapters. Short, synthetic, single-stranded DNA molecules
2018). that are ligated to the ends of DNA fragments of interest.
These allow target DNA fragments to bind to a flow cell
Most, if not all, of these approaches should become a part of
for amplification in a NGS platform. Adapters are indexed or
the standard phylogenomics toolkit, and focus should be given
‘barcoded’ with a short, unique sequence usually six to 10
to interpreting, integrating and summarizing across different
bases long, which enables multiple samples to be sequenced
datasets and tree-inference approaches for drawing major phy-
simultaneously (multiplexing). After sequencing, these indexes
logenetic conclusions.
are used to associate individual reads back to their correct
Coalescence-based phylogenetic analysis, which addresses sample (demultiplexing).
the issues of ILS, is currently gaining ground in phyloge- Coalescence. A point in past evolutionary time where different
netic circles, but there are valid criticisms of the summary alleles originated from a single common ancestor gene variant.
(or ‘shortcut’) coalescent approaches used most commonly Compositional heterogeneity. Variation in the equilibrium rate
in phylogenomic-scale datasets (Kubatko & Degnan, 2007). of individual bases or amino acids between taxa. Not accounting
If future implementations of true coalescent analysis – with for compositional heterogeneity in phylogenetic analyses can
co-estimation of phylogeny and coalescence time – become lead to tree estimation error.
viable, this could largely address some of the major sources Contig. A contiguous sequence of DNA constructed by merging
of systematic error. Until then, coalescent analysis is extremely (or assembly) at least two individual DNA sequence reads. With
useful, mostly by showcasing gene tree conflict, but should prob- traditional Sanger sequencing, contigs are typically constructed
ably be performed alongside traditional ML or BI analyses. from a forward and reverse read of the same gene region,
Unfortunately, current molecular phylogenies are often whereas in NGS methods, contigs are typically assembled using
entirely divorced from morphological analyses, making it large numbers of partially overlapping reads in both the forward
difficult to link current literature with both past phylogenetic and reverse directions.
analyses and the natural history of organisms themselves. De novo assembly. The processes of constructing contigs and
In addition, as morphology is often our only link to extinct scaffolds without the use of a pre-existing reference genome
species, morphological data should ideally be used to infer from a related organism. The methods used by de novo assembly

© 2019 The Royal Entomological Society, Systematic Entomology, doi: 10.1111/syen.12406


Understanding phylogenomics 17

software are varied, but the most common type for short reads is reads can later be identified and associated with their respective
assembly by de Bruijn graphs. species.
Felsenstein zone. A region of theoretical tree space where Nonparametric bootstrap. A procedure where replicate
long branch attraction is especially pervasive. This leads to the bootstrap datasets are generated by randomly sampling (with
artificial grouping of two or more taxa that are situated in long replacement) characters (i.e. columns) from a sequence matrix.
branches. These replicates are then analysed phylogenetically under the
Gene tree. A phylogenetic tree based on data (or information) same parameters used for the original data, with bootstrap val-
from a single locus. ues corresponding to the proportion of trees/replicates in which
Gene tree estimation error. The inaccuracy in the topology of a group is recovered. Bootstrap values are thought to correspond
individual gene trees. This can be caused by either systematic to assessments of confidence (or support) for each clade of an
error (e.g. model misspecification) or stochastic error (i.e. insuf- observed tree, but they nonetheless have no straightforward
ficient phylogenetic signal leading to inaccurate tree topologies). interpretation in a phylogenetic context.
Gene tree – gene tree conflict. The mismatch between individ- Orthologue. A gene that is homologous to another gene in a
ual gene tree topologies constructed from different loci from a related species and arose through speciation. Orthologous genes
single organism. This phenomenon is most commonly caused usually retain the same or similar functions in related organisms,
by incomplete lineage sorting, where only a random subset of and therefore are under similar selection pressures.
alleles for each gene within an organism is passed on during a Paralogue. A gene that is homologous to another gene in either
speciation event. the same or a related species and arose through gene duplication
Heterotachy. Variation in the evolutionary rate of individual events. Paralogous genes often obtain new functions within a
genes, gene regions or taxa over time. Not accounting for genome, such that they may undergo entirely different selection
heterotachy in phylogenetic analysis can lead to tree estimation pressures when compared with the original gene from which
error. they descended.
Horizontal gene transfer (HGT). The transfer of genetic Parsimony. A method to choose among possible phylogenetic
material from one organism to another that is not its offspring. trees, which states that the phylogeny implying the simplest (i.e.
HGT causes unrelated organisms to share the same, or similar, the least number of changes in character states) hypothesis is
genetic material, thus misleading phylogeny estimation. the best.
Incomplete lineage sorting (ILS). The random inheritance of Posterior probability. In Bayesian phylogenetics, the posterior
only some gene variants (or alleles) by a new species from a probability of a clade is the probability that the relationship is
founding population during speciation events. ILS can cause correct, given the data and assuming that the model is correct.
mismatches between the species tree and individual gene trees. The posterior probability of each clade is estimated based on
Incongruence. Two or more phylogenetic trees are said to the frequency at which that clade is resolved among the whole
be incongruent when they exhibit conflicting topologies or sample of posterior trees (i.e. across all possible phylogenetic
branching orders and cannot be superimposed. This implies that trees).
at least one node present in one tree is not found in the others, Read. A single-stranded DNA sequence that has been ‘read’ by
where it is replaced by alternative groupings of taxa. a DNA sequencer.
Kmer. A DNA sequence of length k, generally obtained from a Reciprocal blast search. blasting (basic local alignment
sequencing platform. search tool) is a procedure where a given locus is compared with
Long branch attraction (LBA). When two (or more) lineages a predefined database of loci obtained from other organisms,
have much longer branches than the others, they tend to group with the best (most likely) hit being returned. In phylogenetics,
together irrespective of their true relationships (i.e. common reciprocal blast searches are generally used to assess gene
ancestry). orthology. In a reciprocal blast search, the best hit is then
Missing data. Character information that is not sampled for blasted against all loci from the original organism. If the best
some taxa in a character matrix. In the context of phylogenomic hit from this reciprocal blast search is the same locus that was
analyses, missing data occur as loci or portions of loci are not used for the original query, then the two loci are likely to be
sampled for particular taxa in an alignment. orthologous. If a different locus is returned as the best hit by a
Model of sequence evolution. A statistical description of the significant margin, then the two loci are regarded as paralogous.
process of substitution in nucleotide or amino acid sequences. Reference-guided assembly. Also called referenced assembly,
Complex models better approximate the evolutionary process this procedure aligns individual reads to a pre-existing reference
but at the expense of more parameters and computational time. genome from a related organism to construct contigs and
As more complex, and thus more parameter-rich, models require scaffolds. The main limitation of reference-guided assemblies
more data to properly describe the process of evolution, they is the reliance on existing annotated genomes from related taxa.
have become more reliable with the advent of phylogenomic Depending on the study organism and the phylogenetic diversity
datasets. of taxa sampled, this technique may be more or less viable for
Multiplexing. A procedure that allows large numbers of DNA phylogenomic studies.
libraries to be pooled and sequenced simultaneously during Saturation. An alignment is saturated when sequences have
a single run on a high-throughput instrument (e.g. Illumina). undergone so many multiple substitutions that apparent
Individual barcode sequences are added to each library so that distances largely underestimate the real genetic distances.

© 2019 The Royal Entomological Society, Systematic Entomology, doi: 10.1111/syen.12406


18 A. D. Young and J. P. Gillung

Phylogenetic inference works best with datasets that are only Akaike, H. (1974) A New Look at the Statistical Model Identification.
partially saturated. Due to their reduced number of possible Selected Papers of Hirotugu Akaike (ed. by E. Parzen, K. Tanabe and
states (four bases), nucleotide sequences saturate more rapidly G. Kitagawa), pp. 215–222. Springer, New York, New York.
Alfaro, M.E., Santini, F., Brock, C. et al. (2009) Nine exceptional
than protein sequences (20 amino acids). In addition, third
radiations plus high turnover explain species diversity in jawed
codon positions saturate more rapidly than first and second vertebrates. Proceedings of the National Academy of Sciences, 106,
positions. 13410–13414.
Scaffold. A portion of a genome reconstructed by merging Allen, J.M., Huang, D.I., Cronk, Q.C. & Johnson, K.P. (2015) aTRAM -
multiple contigs together, which are often separated by gaps of automated target restricted assembly method: a fast method for assem-
known length. Ideally, complete scaffolds represent entire gene bling loci across divergent taxa from next-generation sequencing data.
regions, chromosomes, etc. BMC Bioinformatics, 16, 98.
Sequencing depth. The number of times individual bases Altenhoff, A.M. & Dessimoz, C. (2009) Phylogenetic and functional
assessment of Orthologs inference projects and methods. PLoS
are sequenced. To assemble contigs of single-copy loci in a
Computational Biology, 5, e1000262.
genome, greater sequencing depth is required, as multiple-copy Ambardar, S., Gupta, R., Trakroo, D., Lal, R. & Vakhlu, J. (2016) High
loci will be sequenced repeatedly before sufficient reads from throughput sequencing: an overview of sequencing chemistry. Indian
single-copy areas of the genome are obtained. Journal of Microbiology, 56, 394–404.
Species tree. A phylogenetic tree that shows the evolutionary Andrews, S. (2010) FastQC: a quality control tool for high through-
relationships between a group of species. put sequence data. URL http://www.bioinformatics.babraham.ac.uk/
Stochastic error. The error in phylogenetic estimation caused projects/fastqc. [Accessed on 23 Nov 2019].
by the finite length of the sequences used in the reconstruction. Ané, C., Larget, B., Baum, D.A., Smith, S.D. & Rokas, A. (2007)
Bayesian estimation of concordance among gene Trees. Molecular
As the size of the sequences increases, the magnitude of the error
Biology and Evolution, 24, 412–426.
decreases. This is why phylogenomic datasets are thought to be Arcila, D., Alexander Pyron, R., Tyler, J.C., Ortí, G. & Betancur-R., R.
almost immune to stochastic error. (2015) An evaluation of fossil tip-dating versus node-age calibrations
Systematic error (or bias). Inaccuracy in phylogenetic recon- in tetraodontiform fishes (Teleostei: Percomorphaceae). Molecular
struction that is a result of the failure of the reconstruction Phylogenetics and Evolution, 82, 131–145.
method to fully account for the properties of the data. Bankevich, A., Nurk, S., Antipov, D. et al. (2012) SPAdes: a new
Tiling. In the context of NGS, tiling refers to an overlap genome assembly algorithm and its applications to single-cell
sequencing. Journal of Computational Biology, 19, 455–477.
between short DNA sequences (e.g. probes/baits or reads).
Barido-Sottani, J., Bošková, V., Plessis, L.D. et al. (2017) Taming the
Target enrichment probes are often designed in a tiled fashion in BEAST – A community teaching material resource for BEAST 2.
order to maximize the chances of successfully hybridizing with Systematic Biology, 67, 170–174.
DNA fragments from across the locus of interest and in all target Beaulieu, J.M., O’Meara, B.C. & Donoghue, M.J. (2013) Identifying
organisms. Also, during contig assembly, many individual reads hidden rate changes in the evolution of a binary morphological
are overlapped, or ‘tiled’, in order to build a consensus of reads character: the evolution of plant habit in Campanulid angiosperms.
at each base position. Systematic Biology, 62, 725–737.
Besser, J., Carleton, H.A., Gerner-Smidt, P., Lindsey, R.L. & Trees, E.
(2018) Next-generation sequencing technologies and their application
to the study and control of bacterial infections. Clinical Microbiology
Acknowledgements
and Infection, 24, 335–341.
Betancur-R, R., Li, C., Munroe, T.A., Ballesteros, J.A. & Ortí, G. (2013)
Thank you to Shaun Winterton for his support and encourage- Addressing gene tree discordance and non-stationarity to resolve a
ment (i.e. productive nagging) during the preparation of this multi-locus phylogeny of the flatfishes (Teleostei: Pleuronectiformes).
manuscript. Thank you to Ziad Khouri for his valuable com- Systematic Biology, 62, 763–785.
ments and suggestions on an early version. Thank you to the Betancur-R, R., Naylor, G.J.P. & Ortí, G. (2014) Conserved genes,
reviewers, Dr Steve Cameron and Dr Phil Ward, whose insight- sampling error, and Phylogenomic inference. Systematic Biology, 63,
ful comments greatly improved the text. The authors declare no 257–262.
conflict of interest. Work by ADY was conducted on the tradi- Blaimer, B.B., Lloyd, M.W., Guillory, W.X. & Brady, S.G. (2016)
tional territory of the Patwin and Nisenan peoples, while JPG’s Sequence capture and phylogenetic utility of genomic ultraconserved
elements obtained from pinned insect specimens. PLoS One, 11,
research was carried out on the traditional territory of Cayuga
e0161531.
and Onondaga Nations. The authors are grateful to have the
Bolger, A.M., Lohse, M. & Usadel, B. (2014) Trimmomatic: a flexible
opportunity to work on their land. trimmer for Illumina sequence data. Bioinformatics, 30, 2114–2120.
Bollback, J.P. (2002) Bayesian model adequacy and choice in phyloge-
netics. Molecular Biology and Evolution, 19, 1171–1180.
References Borowiec, M.L. (2016) AMAS: a fast tool for alignment manipulation
and computing of summary statistics. PeerJ, 4, e1660.
Abadi, S., Azouri, D., Pupko, T. & Mayrose, I. (2019) Model selection Borowiec, M.L., Rabeling, C., Brady, S.G., Fisher, B.L., Schultz, T.R. &
may not be a mandatory step for phylogeny reconstruction. Nature Ward, P.S. (2019) Compositional heterogeneity and outgroup choice
Communications, 10, 934. influence the internal phylogeny of the ants. Molecular Phylogenetics
Aberer, A.J., Kobert, K. & Stamatakis, A. (2014) ExaBayes: massively and Evolution, 134, 111–121.
parallel Bayesian tree inference for the whole-genome era. Molecular Bouckaert, R., Lockhart, P. (2015) Capturing heterotachy through
Biology and Evolution, 31, 2553–2556. multi-gamma site models. bioRxiv, 018101.

© 2019 The Royal Entomological Society, Systematic Entomology, doi: 10.1111/syen.12406


Understanding phylogenomics 19

Bouckaert, R., Vaughan, T.G., Barido-Sottani, J. et al. (2019) BEAST Degnan, J.H. & Rosenberg, N.A. (2006) Discordance of species trees
2.5: an advanced software platform for Bayesian evolutionary analy- with their most likely gene trees. PLoS Genetics, 2, e68.
sis. PLoS Computational Biology, 15, e1006650. Doyle, V.P., Young, R.E., Naylor, G.J. & Brown, J.M. (2015) Can we
Brown, J.M. (2014) Predictive approaches to assessing the fit of identify genes with increased phylogenetic reliability? Systematic
evolutionary models. Systematic Biology, 63, 289–292. Biology, 64, 824–837.
Brown, J.M. & Lemmon, A.R. (2007) The importance of data parti- Drummond, A.J., Ho, S.Y.W., Phillips, M.J. & Rambaut, A. (2006)
tioning and the utility of Bayes factors in Bayesian Phylogenetics. Relaxed phylogenetics and dating with confidence. PLoS Biology, 4,
Systematic Biology, 56, 643–655. e88.
Brown, J.M. & Thomson, R.C. (2016) Bayes factors unmask highly Duchêne, D.A., Duchêne, S. & Ho, S.Y.W. (2017) New statistical criteria
variable information content, Bias, and extreme influence in Phyloge- detect phylogenetic Bias caused by compositional heterogeneity.
nomic analyses. Systematic Biology, 66, 517–530. Molecular Biology and Evolution, 34, 1529–1534.
Brown, J.W. & Smith, S.A. (2018) The past sure is tense: on interpreting Dunn, C.W., Howison, M. & Zapata, F. (2013) Agalma: an automated
phylogenetic divergence time estimates. Systematic Biology, 67, phylogenomics workflow. BMC Bioinformatics, 14, 330.
340–353. Edwards, S.V. (2009) Is a new and general theory of molecular
Buckley, T.R. (2002) Model misspecification and probabilistic tests of systematics emerging? Evolution, 63, 1–19.
topology: evidence from empirical data sets. Systematic Biology, 51, Edwards, S.V., Xi, Z., Janke, A. et al. (2016) Implementing and
509–523. testing the multispecies coalescent model: a valuable paradigm
Chen, F., Mackey, A.J., Vermunt, J.K. & Roos, D.S. (2007) Assessing for phylogenomics. Molecular Phylogenetics and Evolution, 94,
performance of Orthology detection strategies applied to eukaryotic 447–462.
genomes. PLoS One, 2, e383. Efron, B. (1979) Bootstrap methods: another look at the Jackknife. The
Chen, M.-Y., Liang, D. & Zhang, P. (2015) Selecting question-specific Annals of Statistics, 7, 1–26.
genes to reduce incongruence in Phylogenomics: a case study Eisen, J.A. (1998) Phylogenomics: improving functional predictions for
of jawed vertebrate backbone phylogeny. Systematic Biology, 64, uncharacterized genes by evolutionary analysis. Genome Research, 8,
1104–1120. 163–167.
Chen, M.-Y., Liang, D. & Zhang, P. (2017) Phylogenomic resolution of Emms, D.M. & Kelly, S. (2015) OrthoFinder: solving fundamen-
the phylogeny of Laurasiatherian mammals: exploring phylogenetic
tal biases in whole genome comparisons dramatically improves
signals within coding and noncoding sequences. Genome Biology and
orthogroup inference accuracy. Genome Biology, 16, 157.
Evolution, 9, 1998–2012.
Erixon, P., Svennblad, B., Britton, T. & Oxelman, B. (2003) Reliability
Chifman, J. & Kubatko, L. (2014) Quartet inference from SNP data
of Bayesian posterior probabilities and bootstrap frequencies in
under the coalescent model. Bioinformatics, 30, 3317–3324.
Phylogenetics. Systematic Biology, 52, 665–673.
Chiu, J.C., Lee, E.K., Egan, M.G., Sarkar, I.N., Coruzzi, G.M. &
Faircloth, B.C. (2016) PHYLUCE is a software package for the analysis
DeSalle, R. (2006) OrthologID: automation of genome-scale ortholog
of conserved genomic loci. Bioinformatics, 32, 786–788.
identification within a parsimony framework. Bioinformatics, 22,
Faircloth, B.C., McCormack, J.E., Crawford, N.G., Harvey, M.G.,
699–707.
Brumfield, R.T. & Glenn, T.C. (2012) Ultraconserved elements
Chou, J., Gupta, A., Yaduvanshi, S., Davidson, R., Nute, M., Mirarab, S.
anchor thousands of genetic markers spanning multiple evolutionary
& Warnow, T. (2015) A comparative study of SVDquartets and other
timescales. Systematic Biology, 61, 717–726.
coalescent-based species tree estimation methods. BMC Genomics,
Fan, Y., Wu, R., Chen, M.-H., Kuo, L. & Lewis, P.O. (2010) Choosing
16, S2.
among partition models in Bayesian phylogenetics. Molecular Biol-
Compeau, P.E., Pevzner, P.A. & Tesler, G. (2011) How to apply
de Bruijn graphs to genome assembly. Nature Biotechnology, 29, ogy and Evolution, 28, 523–532.
987–991. Felsenstein, J. (1978) Cases in which parsimony or compatibility
Cooper, E.D. (2014) Overly simplistic substitution models obscure methods will be positively misleading. Systematic Zoology, 27,
green plant phylogeny. Trends in Plant Science, 19, 576–582. 401–410.
Copetti, D., Búrquez, A., Bustamante, E. et al. (2017) Extensive gene Felsenstein, J. (1985) Confidence limits on phylogenies: an approach
tree discordance and hemiplasy shaped the genomes of north Ameri- using the bootstrap. Evolution, 39, 783–791.
can columnar cacti. Proceedings of the National Academy of Sciences Fitch, W.M. (1970) Distinguishing homologous from analogous pro-
of the USA, 114, 12003–12008. teins. Systematic Zoology, 19, 99–113.
Cranston, K., Harmon, L.J., O’Leary, M.A. & Lisle, C. (2014) Best Fitch, W.M. & Margoliash, E. (1967) A method for estimating the
practices for data sharing in phylogenetic research. PLoS Currents, number of invariant amino acid coding positions in a gene using
6, ecurrents.tol.bf01eff4a6b60ca4825c69293dc59645. cytochrome c as a model case. Biochemical Genetics, 1, 65–71.
Crotty, S.M., Minh, B.Q., Bean, N.G., Holland, B.R., Tuke, J., Jer- Gabaldón, T. (2008) Large-scale assignment of orthology: back to
miin, L.S., von Haeseler, A. (2019) GHOST: Recovering Historical phylogenetics? Genome Biology, 9, 235.
Signal from Heterotachously-evolved Sequence Alignments. bioRxiv, Goodman, S.N. (1999) Toward evidence-based medical statistics. 2: the
174789. Bayes factor. Annals of Internal Medicine, 130, 1005–1013.
Dai, M., Thompson, R.C., Maher, C. et al. (2010) NGSQC: Goremykin, V.V., Nikiforova, S.V., Cavalieri, D., Pindo, M. & Lockhart,
cross-platform quality analysis pipeline for deep sequencing data. P. (2015) The root of flowering plants and total evidence. Systematic
BMC Genomics, 11, S7. Biology, 64, 879–891.
Dayhoff, M.O., Schwartz, R. & Orcutt, B. (1978) A model of evolu- Grabherr, M.G., Haas, B.J., Yassour, M. et al. (2011) Full-length
tionary change in proteins. Atlas of Protein Sequence and Structure transcriptome assembly from RNA-Seq data without a reference
(ed. by M.O. Dayhoff), pp. 345–352. National Biomedical Research genome. Nature Biotechnology, 29, 644–652.
Foundation, Silver Spring, Maryland. Heath, T.A., Hedtke, S.M. & Hillis, D.M. (2008) Taxon sampling and
Debiasse, M.B. & Ryan, J.F. (2019) Phylotocol: promoting transparency the accuracy of phylogenetic analyses.
and overcoming Bias in Phylogenetics. Systematic Biology, 68, Heath, T.A., Huelsenbeck, J.P. & Stadler, T. (2014) The fossilized
672–678. birth–death process for coherent calibration of divergence-time

© 2019 The Royal Entomological Society, Systematic Entomology, doi: 10.1111/syen.12406


20 A. D. Young and J. P. Gillung

estimates. Proceedings of the National Academy of Sciences, 111, Karin, B.R., Gamble, T. & Jackman, T.R. (2019) Optimizing phyloge-
E2957–E2966. nomics with rapidly evolving long exons: comparison with anchored
Heled, J. & Drummond, A.J. (2010) Bayesian inference of species hybrid enrichment and Ultraconserved elements. bioRxiv, 672238.
trees from multilocus data. Molecular Biology and Evolution, 27, Katoh, K. & Standley, D.M. (2013) MAFFT multiple sequence align-
570–580. ment software version 7: improvements in performance and usability.
Höhna, S. (2013) Fast simulation of reconstructed phylogenies under Molecular Biology and Evolution, 30, 772–780.
global time-dependent birth–death processes. Bioinformatics, 29, Kent, W.J., Sugnet, C.W., Furey, T.S., Roskin, K.M., Pringle, T.H.,
1367–1374. Zahler, A.M. & Haussler, D. (2002) The human genome browser at
Höhna, S. (2014) Likelihood inference of non-constant diversification UCSC. Genome Research, 12, 996–1006.
rates with incomplete taxon sampling. PLoS One, 9, e84184. Kocot, K.M., Citarella, M.R., Moroz, L.L. & Halanych, K.M. (2013)
Höhna, S., Landis, M.J., Heath, T.A. et al. (2016) RevBayes: Bayesian PhyloTreePruner: a phylogenetic tree-based approach for selection
phylogenetic inference using graphical models and an interactive of orthologous sequences for phylogenomics. Evolutionary Bioinfor-
model-specification language. Systematic Biology, 65, 726–736. matics, 9, 429–435.
Höhna, S., May, M.R. & Moore, B.R. (2015) Phylogeny Simulation and Kocot, K.M., Struck, T.H., Merkel, J., Waits, D,S., Todt, C., Brannock,
Diversification Rate Analysis with TESS, 98. P.M., Weese, D.A., Cannon, J.T., Moroz, L.L., Lieb, B., Halanych,
Hölzer, M. & Marz, M. (2019) De novo transcriptome assembly: K.M. (2017) Phylogenomics of Lophotrochozoa with consideration
a comprehensive cross-species comparison of short-read RNA-Seq of systematic error. Systematic Biology, 66, 256–282, 2.
assemblers. GigaScience, 8, giz039. Kozlov, A.M., Aberer, A.J. & Stamatakis, A. (2015) ExaML version 3:
Hosner, P.A., Faircloth, B.C., Glenn, T.C., Braun, E.L. & Kimball, a tool for phylogenomic analyses on supercomputers. Bioinformatics,
R.T. (2016) Avoiding missing data biases in phylogenomic inference: 31, 2577–2579.
an empirical study in the Landfowl (Aves: Galliformes). Molecular Kriventseva, E.V., Kuznetsov, D., Tegenfeldt, F., Manni, M., Dias, R.,
Biology and Evolution, 33, 1110–1125. Simão, F.A. & Zdobnov, E.M. (2019) OrthoDB v10: sampling the
Huelsenbeck, J.P., Bull, J.J. & Cunningham, C.W. (1996) Combining diversity of animal, plant, fungal, protist, bacterial and viral genomes
data in phylogenetic analysis. Trends in Ecology & Evolution, 11, for evolutionary and functional annotations of orthologs. Nucleic
152–158. Acids Research, 47, D807–D811.
Huelsenbeck, J.P. & Hillis, D.M. (1993) Success of phylogenetic Kubatko, L.S., Carstens, B.C. & Knowles, L.L. (2009) STEM: species
methods in the four-taxon case. Systematic Biology, 42, 247–264. tree estimation using maximum likelihood for gene trees under
Huelsenbeck, J.P., Joyce, P., Lakner, C. & Ronquist, F. (2008) Bayesian coalescence. Bioinformatics, 25, 971–973.
analysis of amino acid substitution models. Philosophical Transac- Kubatko, L.S. & Degnan, J.H. (2007) Inconsistency of phylogenetic
tions of the Royal Society B: Biological Sciences, 363, 3941–3953. estimates from concatenated data under coalescence. Systematic
Huelsenbeck, J.P., Larget, B., Miller, R.E. & Ronquist, F. (2002) Biology, 56, 17–24.
Potential applications and pitfalls of Bayesian inference of phylogeny. Kück, P.J. & Wägele, W. (2015) Plesiomorphic character states cause
Systematic Biology, 51, 673–688. systematic errors in molecular phylogenetic analyses: a simulation
Huelsenbeck, J.P. & Rannala, B. (2004) Frequentist properties of study. Cladistics, 32, 461–478.
Bayesian posterior probabilities of phylogenetic trees under simple Kumar, S., Filipski, A.J., Battistuzzi, F.U., Kosakovsky Pond, S.L. &
and complex substitution models. Systematic Biology, 53, 904–913. Tamura, K. (2012) Statistics and truth in phylogenomics. Molecular
Huelsenbeck, J.P., Ronquist, F., Nielsen, R. & Bollback, J.P. (2001) Biology and Evolution, 29, 457–472.
Bayesian inference of phylogeny and its impact on evolutionary Lam, H.Y., Clark, M.J., Chen, R. et al. (2012) Performance comparison
biology. Science, 294, 2310–2314. of whole-genome sequencing platforms. Nature Biotechnology, 30,
Jackman, S.D., Vandervalk, B.P., Mohamadi, H. et al. (2017) ABySS 78–82.
2.0: resource-efficient assembly of large genomes using a bloom filter. Landis, M.J., Matzke, N.J., Moore, B.R. & Huelsenbeck, J.P. (2013)
Genome Research, 27, 768–777. Bayesian analysis of biogeography when the number of areas is large.
Jeffroy, O., Brinkmann, H., Delsuc, F. & Philippe, H. (2006) Phy- Systematic Biology, 62, 789–804.
logenomics: the beginning of incongruence? Trends in Genetics, 22, Lanfear, R., Calcott, B., Kainer, D., Mayer, C. & Stamatakis, A. (2014)
225–231. Selecting optimal partitioning schemes for phylogenomic datasets.
Jia, F., Lo, N. & Ho, S.Y.W. (2014) The impact of modelling rate BMC Evolutionary Biology, 14, 82.
heterogeneity among sites on phylogenetic estimates of intraspecific Lanfear, R., Frandsen, P.B., Wright, A.M., Senfeld, T. & Calcott,
evolutionary rates and timescales. PLoS One, 9, e95722. B. (2016) PartitionFinder 2: new methods for selecting partitioned
Jones, D.T., Taylor, W.R. & Thornton, J.M. (1992) The rapid generation models of evolution for molecular and morphological phylogenetic
of mutation data matrices from protein sequences. Computer Appli- analyses. Molecular Biology and Evolution, 34, 772–773.
cations in the Biosciences, 8, 275–282. Larkin, M.A., Blackshields, G., Brown, N.P. et al. (2007) Clustal W and
Joshi, N.A. & Fass, J.N. (2011) Sickle: A sliding-window, adaptive, Clustal X version 2.0. Bioinformatics, 23, 2947–2948.
quality-based trimming tool for FastQ files. URL https://github.com/ Lartillot, N., Lepage, T. & Blanquart, S. (2009) PhyloBayes 3: a
najoshi/sickle. [Accessed on 23 Nov 2019]. Bayesian software package for phylogenetic reconstruction and
Jothi, R., Zotenko, E., Tasneem, A. & Przytycka, T.M. (2006) molecular dating. Bioinformatics, 25, 2286–2288.
COCO-CL: hierarchical clustering of homology relations based Lartillot, N. & Philippe, H. (2006) Computing Bayes factors using
on evolutionary correlations. Bioinformatics, 22, 779–788. thermodynamic integration. Systematic Biology, 55, 195–207.
Joy, J.B., Liang, R.H., McCloskey, R.M., Nguyen, T. & Poon, A.F.Y. Lartillot, N., Rodrigue, N., Stubbs, D. & Richer, J. (2013) PhyloBayes
(2016) Ancestral reconstruction. PLoS Computational Biology, 12, MPI: phylogenetic reconstruction with infinite mixtures of profiles in
e1004763. a parallel environment. Systematic Biology, 62, 611–615.
Kalyaanamoorthy, S., Minh, B.Q., Wong, T.K., von Haeseler, A. & Le, S.Q., Dang, C.C. & Gascuel, O. (2012) Modeling protein evolution
Jermiin, L.S. (2017) ModelFinder: fast model selection for accurate with several amino acid replacement matrices depending on site rates.
phylogenetic estimates. Nature Methods, 14, 587–589. Molecular Biology and Evolution, 29, 2921–2936.

© 2019 The Royal Entomological Society, Systematic Entomology, doi: 10.1111/syen.12406


Understanding phylogenomics 21

Le, S.Q., Gascuel, O. & Lartillot, N. (2008) Empirical profile mix- phylogeography and phylogenetics. Molecular Phylogenetics and
ture models for phylogenetic reconstruction. Bioinformatics, 24, Evolution, 66, 526–538.
2317–2323. McCormack, J.E., Tsai, W.L.E. & Faircloth, B.C. (2016) Sequence
Lechner, M., Findeiss, S., Steiner, L., Marz, M., Stadler, P.F. & capture of ultraconserved elements from bird museum specimens.
Prohaska, S.J. (2011) Proteinortho: detection of (co-)orthologs in Molecular Ecology Resources, 16, 1189–1203.
large-scale analysis. BMC Bioinformatics, 12, 124. Metzker, M.L. (2010) Sequencing technologies – the next generation.
Leinonen, R., Sugawara, H., Shumway, M. & International Nucleotide Nature Reviews Genetics, 11, 31–46.
Sequence Database Collaboration (eds) (2011) The sequence read Meyer, A.L.S. & Wiens, J.J. (2018) Estimating diversification rates for
archive. Nucleic Acids Research, 39, D19–D21. higher taxa: BAMM can give problematic estimates of rates and rate
Lemmon, A.R., Emme, S.A. & Lemmon, E.M. (2012) Anchored shifts. Evolution, 72, 39–53.
hybrid enrichment for massively high-throughput phylogenomics. Minh, B.Q., Hahn, M. & Lanfear, R. (2018) New methods to calculate
Systematic Biology, 61, 727–744. concordance factors for phylogenomic datasets. bioRxiv, 487801.
Lemmon, A.R. & Moriarty, E.C. (2004) The importance of proper Minin, V., Abdo, Z., Joyce, P. & Sullivan, J. (2003) Performance-based
model assumption in Bayesian Phylogenetics. Systematic Biology, 53, selection of likelihood models for phylogeny estimation. Systematic
278–298. Biology, 52, 674–683.
Lemmon, E.M. & Lemmon, A.R. (2013) High-throughput genomic Mirarab, S., Bayzid, M.S., Boussau, B. & Warnow, T. (2014) Statistical
data in systematics and Phylogenetics. Annual Review of Ecology, binning improves species tree estimation in the presence of gene tree
Evolution, and Systematics, 44, 99–121. incongruence. Science, 346, 1250463.
Li, L., Stoeckert, C.J. & Roos, D.S. (2003) OrthoMCL: identification Moore, B.R., Höhna, S., May, M.R., Rannala, B. & Huelsenbeck, J.P.
of ortholog groups for eukaryotic genomes. Genome Research, 13, (2016) Critically evaluating the theory and performance of Bayesian
2178–2189. analysis of macroevolutionary mixtures. Proceedings of the National
Li, R., Zhu, H., Ruan, J. et al. (2010) De novo assembly of human Academy of Sciences, 113, 9569–9574.
genomes with massively parallel short read sequencing. Genome Moreau, C.S. (2014) A practical guide to DNA extraction, PCR, and
Research, 20, 265–272. gene-based DNA sequencing in insects. Halteres, 5, 32–42.
Liu, L., Wu, S. & Yu, L. (2015) Coalescent methods for estimating Mork, R., Martin, P. & Zhao, Z. (2015) Contemporary challenges for
species trees from phylogenomic data. Journal of Systematics and data-intensive scientific workflow management systems. Proceedings
Evolution, 53, 380–390. of the 10th Workshop on Workflows in Support of Large-Scale Science
Liu, L., Yu, L. & Edwards, S.V. (2010) A maximum pseudo-likelihood - WORKS ’15 (ed. by J. Montagnat and I. Taylor), pp. 1–11. ACM
approach for estimating species trees under the coalescent model. Press, Austin, Texas.
BMC Evolutionary Biology, 10, 302. Morlon, H., Lewitus, E., Condamine, F.L., Manceau, M., Clavel, J.
Liu, L., Yu, L., Pearl, D.K. & Edwards, S.V. (2009) Estimating species & Drury, J. (2016) RPANDA: an R package for macroevolutionary
phylogenies using coalescence times among sequences. Systematic analyses on phylogenetic trees. Methods in Ecology and Evolution, 7,
Biology, 58, 468–477. 589–597.
Maddison, W.P. & Maddison, D.R. (2018) Mesquite: a modular sys- Nabhan, A.R. & Sarkar, I.N. (2012) The impact of taxon sampling
tem for evolutionary analysis. URL http://www.mesquiteproject.org. on phylogenetic inference: a review of two decades of controversy.
[Accessed on 23 Nov 2019]. Briefings in Bioinformatics, 13, 122–134.
Magee, A.F., May, M.R. & Moore, B.R. (2014) The Dawn of open Narzisi, G. & Mishra, B. (2011) Comparing De novo genome assembly:
access to phylogenetic data. PLoS One, 9, e110268. the long and short of it. PLoS One, 6, e19175.
Mandel, J.R., Dikow, R.B., Funk, V.A. et al. (2014) A target enrichment Nascimento, F.F., dos Reis, M. & Yang, Z. (2017) A biologist’s guide
method for gathering phylogenetic information from hundreds of loci: to Bayesian phylogenetic analysis. Nature Ecology and Evolution, 1,
an example from the compositae. Applications in Plant Sciences, 2, 1446–1454.
1300085. Nguyen, L.-T., Schmidt, H.A., von Haeseler, A. & Minh, B.Q. (2015)
Mardis, E.R. (2011) A decade’s perspective on DNA sequencing IQ-TREE: a fast and effective stochastic algorithm for estimating
technology. Nature, 470, 198–203. maximum-likelihood phylogenies. Molecular Biology and Evolution,
Martin, M. (2011) Cutadapt removes adapter sequences from 32, 268–274.
high-throughput sequencing reads. EMBnet Journal, 17, 10–12. Nosenko, T., Schreiber, F., Adamska, M. et al. (2013) Deep metazoan
Matzke, N.J. (2013) Probabilistic historical biogeography: new models phylogeny: when different genes tell different stories. Molecular
for founder-event speciation, imperfect detection, and fossils allow Phylogenetics and Evolution, 67, 223–233.
improved accuracy and model-testing. Frontiers of Biogeography, 5, O’Brien, K.P., Remm, M. & Sonnhammer, E.L.L. (2005) Inparanoid:
242–248. a comprehensive database of eukaryotic orthologs. Nucleic Acids
Matzke, N.J. & Wright, A. (2016) Inferring node dates from tip dates Research, 33, D476–D480.
in fossil Canidae: the importance of tree priors. Biology Letters, 12, O’Brien, S.J. & Stanyon, R. (1999) Phylogenomics: ancestral primate
20160328. viewed. Nature, 402, 365–366.
Mayrose, I., Friedman, N. & Pupko, T. (2005) A gamma mixture model Oakley, T.H., Alexandrou, M.A., Ngo, R., Pankey, M.S., Churchill,
better accounts for among site rate heterogeneity. Bioinformatics, 21, C.K.C., Chen, W. & Lopker, K.B. (2014) Osiris: accessible and
ii151–ii158. reproducible phylogenetic and phylogenomic analyses within the
McCormack, J.E., Faircloth, B.C., Crawford, N.G., Gowaty, P.A., Brum- galaxy workflow management system. BMC Bioinformatics, 15, 230.
field, R.T. & Glenn, T.C. (2012) Ultraconserved elements are novel Pagel, M., Meade, A. & Barker, D. (2004) Bayesian estimation of
phylogenomic markers that resolve placental mammal phylogeny ancestral character states on phylogenies. Systematic Biology, 53,
when combined with species-tree analysis. Genome Research, 22, 673–684.
746–754. Paradis, E. (2013) Molecular dating of phylogenies by likelihood
McCormack, J.E., Hird, S.M., Zellmer, A.J., Carstens, B.C. & Brum- methods: a comparison of models and a new information criterion.
field, R.T. (2013) Applications of next-generation sequencing to Molecular Phylogenetics and Evolution, 67, 436–444.

© 2019 The Royal Entomological Society, Systematic Entomology, doi: 10.1111/syen.12406


22 A. D. Young and J. P. Gillung

Paradis, E., Claude, J. & Strimmer, K. (2004) APE: analyses of Ronquist, F., Teslenko, M., Van Der Mark, P. et al. (2012b) MrBayes 3.2:
phylogenetics and evolution in R language. Bioinformatics, 20, efficient Bayesian phylogenetic inference and model choice across a
289–290. large model space. Systematic Biology, 61, 539–542.
Parenti, L.R. (2006) Common cause and historical biogeography. Bio- Rosenberg, M.S. & Kumar, S. (2003) Taxon sampling, bioinformatics,
geography in a Changing World. Fifth Biennial Conference of the and phylogenomics. Systematic Biology, 52, 119–124.
Systematics Association (ed. by M.C. Ebach and R.S. Tangney), pp. Roth, A.C.J., Gonnet, G.H. & Dessimoz, C. (2008) Algorithm of OMA
61–82. CRC Press, London. for large-scale orthology inference. BMC Bioinformatics, 9, 518.
Parham, J.F., Donoghue, P.C.J., Bell, C.J. et al. (2012) Best practices for Roure, B., Baurain, D. & Philippe, H. (2013) Impact of missing data
justifying fossil calibrations. Systematic Biology, 61, 346–359. on phylogenies inferred from empirical phylogenomic data sets.
Petersen, M., Meusemann, K., Donath, A. et al. (2017) Orthograph: a Molecular Biology and Evolution, 30, 197–214.
versatile tool for mapping coding nucleotide sequences to clusters of Roure, B., Rodriguez-Ezpeleta, N. & Philippe, H. (2007) SCaFoS:
orthologous genes. BMC Bioinformatics, 18, 111. a tool for selection, concatenation and fusion of sequences for
Philippe, H., Brinkmann, H., Lavrov, D.V., Littlewood, D.T.J., Manuel, phylogenomics. BMC Evolutionary Biology, 7(Suppl 1), S2.
M., Wörheide, G., Baurain, D. (2011) Resolving difficult phyloge- Royer-Carenzi, M. & Didier, G. (2016) A comparison of ancestral
netic questions: why more sequences are not enough. PLoS Biology, state reconstruction methods for quantitative characters. Journal of
9, e1000602, 3. Theoretical Biology, 404, 126–142.
Pleijel, F., Jondelius, U., Norlinder, E. et al. (2008) Phylogenies without Royer-Carenzi, M., Pontarotti, P. & Didier, G. (2013) Choosing the
roots? A plea for the use of vouchers in molecular phylogenetic best ancestral character state reconstruction method. Mathematical
studies. Molecular Phylogenetics and Evolution, 48, 369–371. Biosciences, 242, 95–109.
Posada, D. & Crandall, K.A. (2001) Selecting the best-fit model of Salichos, L. & Rokas, A. (2011) Evaluating ortholog prediction algo-
nucleotide substitution. Systematic Biology, 50, 580–601. rithms in a yeast model clade. PLoS One, 6, e18755.
Powell, S., Szklarczyk, D., Trachana, K. et al. (2012) eggNOG v3.0: Sanderson, M.J. (2003) r8s: inferring absolute rates of molecular
orthologous groups covering 1133 organisms at 41 different taxo- evolution and divergence times in the absence of a molecular clock.
nomic ranges. Nucleic Acids Research, 40, D284–D289. Bioinformatics, 19, 301–302.
Qu, X.J., Jin, J.J., Chaw, S.M., Li, D.Z. & Yib, T.S. (2017) Multiple Schwarz, G. (1978) Estimating the dimension of a model. The Annals of
measures could alleviate long-branch attraction in phylogenomic Statistics, 6, 461–464.
reconstruction of Cupressoideae (Cupressaceae). Scientific Reports, Shade, A. & Teal, T.K. (2015) Computing workflows for biologists: a
7, 41005. roadmap. PLoS Biology, 13, e1002303.
Rabbani, B., Tekin, M. & Mahdieh, N. (2014) The promise of Siu-Ting, K., Torres-Sánchez, M., San Mauro, D. et al. (2019) Inad-
whole-exome sequencing in medical genetics. Journal of Human vertent paralog inclusion drives artifactual topologies and timetree
Genetics, 59, 5–15. estimates in Phylogenomics. Molecular Biology and Evolution, 36,
Rabosky, D.L. (2014) Automatic detection of key innovations, rate 1344–1356.
shifts, and diversity-dependence on phylogenetic trees. PLoS One, 9, Smith, B.T., Mauck III, W.M., Benz, B., Andersen, M.J. (2018b) Uneven
e89543. missing data skews phylogenomic relationships within the lories and
Rannala, B. & Yang, Z. (2003) Bayes estimation of species divergence lorikeets. bioRxiv, 398297.
times and ancestral population sizes using DNA sequences from Smith, S.A., Brown, J.W. & Walker, J.F. (2018a) So many genes, so
multiple loci. Genetics, 164, 1645–1656. little time: a practical approach to divergence-time estimation in the
Ree, R.H., Moore, B.R., Webb, C.O. & Donoghue, M.J. (2005) A genomic era. PLoS One, 13, e0197433.
likelihood framework for inferring the evolution of geographic range Smith, S.A. & O’Meara, B.C. (2012) treePL: divergence time estimation
on phylogenetic trees. Evolution, 59, 2299–2311. using penalized likelihood for large phylogenies. Bioinformatics, 28,
Ree, R.H. & Sanmartín, I. (2009) Prospects and challenges for para- 2689–2690.
metric models in historical biogeographical inference. Journal of Bio- Springer, M.S. & Gatesy, J. (2014) Land plant origins and coalescence
geography, 36, 1211–1220. confusion. Trends in Plant Science, 19, 267–269.
Ree, R.H. & Smith, S.A. (2008) Maximum likelihood inference of geo- Stadler, T. (2009) On incomplete sampling under birth–death models
graphic range evolution by dispersal, local extinction, and Cladogen- and connections to the sampling-based coalescent. Journal of Theo-
esis. Systematic Biology, 57, 4–14. retical Biology, 261, 58–66.
Revell, L.J. (2012) Phytools: an R package for phylogenetic comparative Stamatakis, A. (2014) RAxML version 8: a tool for phylogenetic
biology (and other things). Methods in Ecology and Evolution, 3, analysis and post-analysis of large phylogenies. Bioinformatics, 30,
217–223. 1312–1313.
Richards, E.J., Brown, J.M., Barley, A.J., Chong, R.A. & Thomson, R.C. Straub, S.C.K., Parks, M., Weitemier, K., Fishbein, M., Cronn, R.C.
(2018) Variation across mitochondrial gene Trees provides evidence & Liston, A. (2012) Navigating the tip of the genomic iceberg:
for systematic error: how much gene tree variation is biological? next-generation sequencing for plant systematics. American Journal
Systematic Biology, 67, 847–860. of Botany, 99, 349–364.
Romiguier, J., Cameron, S.A., Woodard, S.H., Fischman, B.J., Keller, Strimmer, K. & Haeseler, A.v. (1997) Likelihood-mapping: a simple
L. & Praz, C.J. (2016) Phylogenomics controlling for base composi- method to visualize phylogenetic content of a sequence alignment.
tional Bias reveals a single origin of Eusociality in Corbiculate bees. Proceedings of the National Academy of Sciences, 94, 6815–6819.
Molecular Biology and Evolution, 33, 670–678. Struck, T.H. (2013) The impact of paralogy on phylogenomic stud-
Ronquist, F. (1997) Dispersal–vicariance analysis: a new approach to ies – a case study on annelid relationships. PLoS One, 8, e62892.
the quantification of historical biogeography. Systematic Biology, 46, Struck, T.H. (2014) TreSpEx-detection of misleading signal in phyloge-
195–203. netic reconstructions based on tree information. Evolutionary Bioin-
Ronquist, F., Klopfstein, S., Vilhelmsen, L., Schulmeister, S., Murray, formatics Online, 10, 51–67.
D.L. & Rasnitsyn, A.P. (2012a) A total-evidence approach to dating Suchard, M.A., Lemey, P., Baele, G., Ayres, D.L., Drummond, A.J. &
with fossils, applied to the early radiation of the hymenoptera. Rambaut, A. (2018) Bayesian phylogenetic and phylodynamic data
Systematic Biology, 61, 973–999. integration using BEAST 1.10. Virus Evolution, 4, vey016.

© 2019 The Royal Entomological Society, Systematic Entomology, doi: 10.1111/syen.12406


Understanding phylogenomics 23

Suchard, M.A., Weiss, R.E. & Sinsheimer, J.S. (2001) Bayesian selec- Wen, J., Egan, A.N., Dikow, R.B. & Zimmer, E.A. (2015) Utility of
tion of continuous-time Markov chain evolutionary models. Molecu- transcriptome sequencing for phylogenetic inference and character
lar Biology and Evolution, 18, 1001–1013. evolution. Next-generation Sequencing in Plant Systematics
Sugiura, N. (1978) Further analysts of the data by akaike’s information Wiens, J.J. (2017) What explains patterns of biodiversity across the tree
criterion and the finite corrections: further analysts of the data by of life? BioEssays, 39, 1600128.
akaike’s. Communications in Statistics, 7, 13–26. Xu, B. & Yang, Z. (2013) PAMLX: a graphical user interface for PAML.
Sullivan, J., Swofford, D.L. & Naylor, G.J.P. (1999) The effect of Molecular Biology and Evolution, 30, 2723–2724.
taxon sampling on estimating rate heterogeneity parameters of Yang, Z. (1993) Maximum-likelihood estimation of phylogeny from
maximum-likelihood models. Molecular Biology and Evolution, 16, DNA sequences when substitution rates differ over sites. Molecular
1347–1347, 1356. Biology and Evolution, 10, 1396–1401.
Sun, M., Folk, R.A., Gitzendanner, M.A., Guralnick, R.P., Soltis, Yang, Z. (2006) Computational Molecular Evolution. Oxford University
P.S., Chen, Z., Soltis, D.E. (2019) Estimating rates and patterns of Press, New York, New York.
diversification with incomplete sampling: a case study in the rosids. Yang, Z. & Rannala, B. (2006) Bayesian estimation of species diver-
bioRxiv, 749325. gence times under a molecular clock using multiple fossil cali-
Szitenberg, A., John, M., Blaxter, M.L. & Lunt, D.H. (2015) Repro- brations with soft bounds. Molecular Biology and Evolution, 23,
Phylo: an environment for reproducible Phylogenomics. PLoS Com- 212–226.
putational Biology, 13, 11, 9, e1004447. Yang, Z. & Rannala, B. (2012) Molecular phylogenetics: principles and
Tatusov, R.L., Fedorova, N.D., Jackson, J.D. et al. (2003) The COG practice. Nature Reviews Genetics, 13, 303–314.
database: an updated version includes eukaryotes. BMC Bioinformat- Yu, Y., Harris, A.J., Blair, C. & He, X. (2015) RASP (reconstruct
ics, 4, 41. ancestral state in phylogenies): a tool for historical biogeography.
Than, C. & Nakhleh, L. (2009) Species tree inference by minimizing Molecular Phylogenetics and Evolution, 87, 46–49.
deep coalescences. PLoS Computational Biology, 5, e1000501. Yu, Y., Harris, A.J. & He, X. (2010) S-DIVA (statistical dispersal-
Thurmond, J., Goodman, J.L., Strelets, V.B. et al. (2019) FlyBase 2.0: Vicariance analysis): a tool for inferring biogeographic histories.
the next generation. Nucleic Acids Research, 47, D759–D765. Molecular Phylogenetics and Evolution, 56, 848–850.
Title, P.O. & Rabosky, D.L. (2017) Do macrophylogenies yield stable Zerbino, D.R. & Birney, E. (2008) Velvet: algorithms for de novo
macroevolutionary inferences? An example from Squamate reptiles. short read assembly using de Bruijn graphs. Genome Research, 18,
Systematic Biology, 66, 843–856. 821–829.
Trachana, K., Larsson, T.A., Powell, S., Chen, W.-H., Doerks, T., Zhang, C. (2017) Molecular Clock Dating using MrBayes. arXiv,
Muller, J. & Bork, P. (2011) Orthology prediction methods: a quality 1603.05707.
assessment using curated protein families. BioEssays, 33, 769–780. Zhang, C., Rabiee, M., Sayyari, E. & Mirarab, S. (2018) ASTRAL-III:
Turney, S., Cameron, E.R., Cloutier, C.A. & Buddle, C.M. (2015) polynomial time species tree reconstruction from partially resolved
Non-repeatable science: assessing the frequency of voucher specimen gene trees. BMC Bioinformatics, 19, 153.
deposition reveals that most arthropod research cannot be verified. Zhang, C., Stadler, T., Klopfstein, S., Heath, T.A. & Ronquist, F. (2016)
PeerJ, 3, e1168. Total-evidence dating under the fossilized birth–death process. Sys-
Ullah, I., Sjöstrand, J., Andersson, P., Sennblad, B. & Lagergren, J. tematic Biology, 65, 228–249.
(2015) Integrating sequence evolution into probabilistic Orthology Zhang, M.Y., Williams, J.L. & Lucky, A. (2019) Understanding UCEs: a
analysis. Systematic Biology, 64, 969–982. comprehensive primer on using ultraconserved elements for arthropod
van der Heijden, R.T.J.M., Snel, B., van Noort, V. & Huynen, M.A. phylogenomics. Insect Systematics and Diversity, 3, 1–12.
(2007) Orthology prediction at scalable resolution by phylogenetic Zhong, B., Deusch, O., Goremykin, V.V. et al. (2011) Systematic error
tree analysis. BMC Bioinformatics, 8, 83. in seed plant phylogenomics. Genome Biology and Evolution, 3,
Vanderpoorten, A. & Goffinet, B. (2006) Mapping uncertainty and phy- 1340–1348.
logenetic uncertainty in ancestral character state reconstruction: an Zuckerkandl, E. & Pauling, L. (1962) Molecular disease, evolution and
example in the Moss genus Brachytheciastrum. Systematic Biology, genetic heterogeneity. Horizons in Biochemistry (ed. by M. Kasha and
55, 957–971. B. Pullman), pp. 189–225. Academic Press, New York, New York.
Wang, Z., Gerstein, M. & Snyder, M. (2009) RNA-Seq: a revolutionary Zwickl, D.J. (2006) GARLI: genetic algorithm for rapid likelihood
tool for transcriptomics. Nature Reviews Genetics, 10, 57–63. inference. PhD Dissertation, The University of Texas at Austin, Austin
Weitemier, K., Straub, S.C.K., Cronn, R.C., Fishbein, M., Schmickl,
R., McDonnell, A. & Liston, A. (2014) Hyb-Seq: combining target
enrichment and genome skimming for plant phylogenomics. Appli- Accepted 30 October 2019
cations in Plant Science, 2, 1400042.

© 2019 The Royal Entomological Society, Systematic Entomology, doi: 10.1111/syen.12406

View publication stats

You might also like