You are on page 1of 25

Resource

https://doi.org/10.1038/s41564-021-00918-8

A standardized archaeal taxonomy for the


Genome Taxonomy Database
Christian Rinke 1 ✉, Maria Chuvochina1, Aaron J. Mussig 1, Pierre-Alain Chaumeil1,
Adrián A. Davín1, David W. Waite2, William B. Whitman 3, Donovan H. Parks 1 and
Philip Hugenholtz 1 ✉

The accrual of genomic data from both cultured and uncultured microorganisms provides new opportunities to develop sys-
tematic taxonomies based on evolutionary relationships. Previously, we established a bacterial taxonomy through the Genome
Taxonomy Database. Here, we propose a standardized archaeal taxonomy that is derived from a 122-concatenated-protein
phylogeny that resolves polyphyletic groups and normalizes ranks based on relative evolutionary divergence. The result-
ing archaeal taxonomy, which forms part of the Genome Taxonomy Database, is stable for a range of phylogenetic variables
including marker gene selection, inference methods, corrections for rate heterogeneity and compositional bias, tree rooting
scenarios and expansion of the genome database. Rank normalization is shown to robustly correct for substitution rates vary-
ing up to 30-fold using simulated datasets. Taxonomic curation follows the rules of the International Code of Nomenclature of
Prokaryotes while taking into account proposals to formally recognize the rank of phylum and to use genome sequences as type
material. This taxonomy is based on 2,392 archaeal genomes, 93.3% of which required one or more changes to their existing
taxonomy, mainly owing to incomplete classification. We identify 16 archaeal phyla and reclassify 3 major monophyletic units
from the former Euryarchaeota and one phylum that unites the Thaumarchaeota–Aigarchaeota–Crenarchaeota–Korarchaeota
(TACK) superphylum into a single phylum.

C
arl Woese’s discovery of the Archaea, originally termed eventually proposed11. Subsequently, the field experienced a burst
Archaebacteria1, in 1977 gave rise to the recognition of a in availability of genomic data due to the substantial acceleration
new domain of life2 and fundamentally changed our view of culture-independent genome recovery driven by improvements
of cellular evolution on Earth. Over time an increasing number in high-throughput sequencing5,7. This resulted in the description
of Archaea have been described, initially from extreme environ- and naming of several new archaeal lineages, including the candi-
ments but subsequently from soils, oceans, fresh water and animal date phyla Aigarchaeota12, Geoarchaeota13 and Bathyarchaeota14,
guts, highlighting the global importance of this domain2. Since previously reported as members of the Crenarchaeota based on
their recognition, Archaea have been classified primarily by gen- SSU rRNA data. Some of the proposed phyla were met with criti-
otype, that is, small subunit (SSU) ribosomal RNA (rRNA) gene cism, such as the Geoarchaeota, which was considered to be a
sequences. Therefore, compared to Bacteria, they suffer less from member of the order Thermoproteales rather than a novel phy-
historical misclassifications based on phenotypic properties3. Using lum15. In 2011, the TACK superphylum was proposed, originally
the SSU rRNA gene, Woese initially described two major lines of comprising the Thaumarchaeota, Aigarchaeota, Crenarchaeota
archaeal descent, the Euryarchaeota and the Crenarchaeota4, and and Korarchaeota16, and more recently adding the Bathyarchaeota14
in the following years, all newly discovered archaeal lineages were and Verstraetearchaeota17; in essence, TACK comprised all lineages
added to these two main groups. Non-extremophilic Archaea except Euryarchaeota5 and Nanoarchaeota9.
were generally classified as Euryarchaeota, which, together with New archaeal lineages were also described outside the
the discovery of novel extremophile euryarchaeotes, contributed Euryarchaeota and TACK and were ultimately classified into two
to a considerable expansion of this lineage (for recent reviews see superphyla, DPANN and Asgard18,19. DPANN as originally pro-
Adam et al.5, Baker et al.6 and Spang et al.7 and references therein). posed was based on five phyla, Diapherotrites, Parvarchaeota,
Eventually, two new archaeal phyla were proposed based on phylo- Aenigmarchaeota, Nanoarchaeota and Nanohaloarchaeota18,
genetic novelty of their SSU rRNA sequences: the Korarchaeota8, but now also includes the Micrarchaeota, Woesearchaeota,
recovered from hot springs in Yellowstone National Park, and the Pacearchaeota, Altiarchaeota and Huberarchaeota5,20–23. The
nanosized, symbiotic Nanoarchaeota9, co-cultured from a subma- Asgard archaea are notable for their inferred sister relation-
rine hot vent. By the mid-2000s, archaeal classification had begun ship to the eukaryotes and were originally proposed based on
to leverage genome sequences, and the first sequenced mesophilic the phyla Lokiarchaeota, Thorarchaeota, Odinarchaeota and
crenarchaeote, Candidatus Cenarchaeum symbiosum, belong- Heimdallarchaeota19,24,25, followed by the Helarchaeota26. The net
ing to Marine Group I (ref. 10), was used to argue that mesophilic result of these cumulative activities is that archaeal classification at
Crenarchaeota should be in a separate phylum from hyperthermo- higher ranks is currently very uneven. The Euryarchaeota absorbed
philic Crenarchaeota, for which the name Thaumarchaeota was novel lineages and grew into a phylogenetic behemoth, whereas the

Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, Australia. 2School of
1

Biological Sciences, The University of Auckland, Auckland, New Zealand. 3Department of Microbiology, University of Georgia, Athens, GA, USA. ✉e-mail:
c.rinke@uq.edu.au; p.hugenholtz@uq.edu.au

946 Nature Microbiology | VOL 6 | July 2021 | 946–959 | www.nature.com/naturemicrobiology


NATuRE MICRobIoloGy Resource
Crenarchaeota were split into multiple subordinate taxa including gtdb.ecogenomic.org). The GTDB normalizes rank assignments
several shallow lineages, such as Geoarchaeota, which, despite their using relative evolutionary divergence (RED) in a concatenated
phylogenetic disproportion, have been given the same rank of phy- protein phylogeny that takes into account differing evolutionary
lum. Attempts to rectify this taxonomic bias have included propos- rates, followed by an extensive automated and manual taxonomy
als to reclassify TACK as a single phylum termed Proteoarchaeota27 curation process. This approach resulted in taxonomic changes to
and to introduce a new taxonomic rank above the class level that over half of the nearly 95,000 bacterial genomes analysed35 and has
would generate several superclasses within the Euryarchaeota5,28. recently been extended to a complete classification from domain
Most proposing and naming of new phyla has occurred outside to species36. Here we present a GTDB taxonomy for the domain
the International Code of Nomenclature of Prokaryotes29 (ICNP) Archaea (release R04-RS89), comprising 2,392 genomes from cul-
because the ICNP does not consider uncultured microorganisms tured and uncultured organisms. This taxonomic release circum-
nor recognize the ranks of phylum and superphylum. However, scribes 16 phyla, including 3 phyla from major monophyletic units
proposals have been made to include the rank of phylum30 and of the Euryarchaeota and one phylum resulting from the amalgama-
to allow gene sequences to serve as type material31, which would tion of the TACK superphylum. The archaeal taxonomy is publicly
contribute to taxonomic stability32. Currently, uncultured Archaea available at the GTDB website (https://gtdb.ecogenomic.org/), and
and Bacteria can be provisionally named using Candidatus status33. we invite community engagement and feedback on the taxonomy
However, these names have no formal standing in nomenclature, do through an online forum (https://forum.gtdb.ecogenomic.org).
not have priority and often contradict the current ICNP rules or are
otherwise problematic34. The waters have been further muddied by Results
the proposal of names for higher taxa without designation of lower Reference genome tree and decoration with the NCBI tax-
ranks and type material, which leaves them without nomenclatural onomy. The archaeal GTDB (04-RS89) comprises 2,392
anchors and with their circumscription subject to dispute if they quality-filtered genomes obtained from RefSeq/GenBank release
become polyphyletic32. In addition to these higher-level classifica- 89 (ref. 37 and Methods). Genomes were clustered into species
tion issues, the current archaeal taxonomy suffers from the same units based on average nucleotide identity (Methods), resulting in
phylogenetic inconsistencies observed in the Bacteria, such as poly- 1,248 species36. A representative sequence of each species (that is,
phyletic taxa (for example, class Methanomicrobia; see Removal of the genome of the type strain or the highest quality genome of an
polyphyletic groups and rank normalization), but to a lesser degree uncultivated species38), was then used for phylogenomic analyses
than the Bacteria due to the early integration of phylogeny and the (Supplementary Table 1). The protein sequences for up to 122 con-
relatively small size of the archaeal dataset. More problematic is the served single-copy ubiquitous archaeal genes were recovered from
widespread incomplete classification of environmental archaeal each genome39 (Supplementary Table 2 and Supplementary Fig.
sequences in the published literature, which is reflected in the 1), aligned, tested for taxonomic congruence, concatenated into a
NCBI taxonomy, where sequences are often only assigned to a can- supermatrix and trimmed to 5,124 columns (Methods). A refer-
didate phylum with no subordinate rank names. This high degree ence tree (ar122.r89) was inferred using the C10 protein mixture
of incomplete classification is likely due to a natural hesitancy to model with the posterior mean site frequency (PMSF) approxi-
create novel genera and intermediate taxa for groups lacking iso- mation40 implemented in IQ-TREE (ref. 41). The PMSF model is
lated representatives. Further, it is our opinion that biological clas- a faster approximation of finite mixture models, which capture
sification should be based on an evolutionary framework that takes the heterogeneity in the amino acid substitution process between
into account differing rates of evolution such that ancestral taxa of sites, and is able to mitigate long-branch attraction artefacts40.
a given rank co-existed in time and are therefore directly compara- Additional trees were inferred from different alignments and with
ble35. Currently, differences in evolutionary rates are not uniformly alternative inference methods to assess the robustness of the GTDB
taken into account across archaeal taxa in the NCBI taxonomy. taxonomy (see Robustness of GTDB archaeal taxonomy). The
For the domain Bacteria, we addressed these long-standing reference tree was initially decorated with taxon names obtained
taxonomic issues and inconsistencies by proposing a standardized from the NCBI taxonomy42 standardized to seven canonical ranks,
taxonomy referred to as the Genome Taxonomy Database (GTDB; as previously described35. Strikingly, many archaeal genomes had

Fig. 1 | Comparison of rank-normalized archaeal GTDB and NCBI taxonomies. a,b, RED of taxa defined by the NCBI taxonomy (a) and the curated
GTDB taxonomy (04-RS89) (b). Each data point (black circle) represents a taxon distributed according to its RED value (x axis) and its rank (y axis).
The fill colours of the circle (blue, grey or orange) indicate that a taxon is monophyletic, operationally monophyletic (defined as having an F measure
>0.95) or polyphyletic, respectively, in the underlying genome tree. An overlaid histogram shows the relative abundance of monophyletic, operationally
monophyletic and polyphyletic taxa for each 0.025 RED interval. Blue bars shows the median RED value, and the black bars on either side show the RED
interval (±0.1) for each rank. Note that in the NCBI taxonomy the values of the higher ranks (order and above) are very unevenly distributed, to the point
that the medians are out of order; that is, the median RED value for classes was higher than the median value for orders. The GTDB taxonomy uses the
RED value to resolve over- and under-classified taxa by moving them to a new interior node (horizontal shift in plot) or by assigning them to a new rank
(vertical shift in plot). Only monophyletic or operationally monophyletic taxa were used to calculate the median RED values for each rank. In addition, only
taxa with a minimum of two children (for example, a phylum with two or more classes or a class with two or more orders) were considered for the GTDB
tree (‘--min_children 2’); however, a more lenient approach (‘--min_children 0’) was necessary for the NCBI tree since none of the NCBI phyla had the
required minimum of two classes, with the exception of the Euryarchaeota. Note that the phylum Crenarchaeota is not displayed in the NCBI plot, since
all genomes in this NCBI phylum are assigned to the class Thermoprotei, resulting in a single node decorated as ‘p__Crenarchaeota; c__Thermoprotei’
(Tpr). Also, Korarchaeota are only represented by a single species, Korarchaeum cryptofilum, in GTDB 04-RS89, and hence there is no internal node to be
displayed in this plot. RED values were calculated based on the ar122.r89 tree, inferred from 122 concatenated proteins, decorated with either the NCBI
or GTDB taxonomy. c, Rank comparison of GTDB and NCBI taxonomies. Shown are changes in GTDB compared to the NCBI taxonomic assignments
across 2,392 archaeal genomes from RefSeq/GenBank release 89. Note that the 153 UBA genomes passing quality control (QC) are not included
(2,392–153 = 2,239) since they had no NCBI taxonomy assignment. In the bars on the left, a taxon is shown as unchanged if its name was identical in both
taxonomies, as a passive change if the GTDB taxonomy provided name information absent in the NCBI taxonomy (missing names) or as an active change
if the name was different between the two taxonomies. The right bar shows the changes of the entire tax string (consisting of seven ranks) per genome,
indicating that most genomes had both active and passive changes in their taxonomy.

Nature Microbiology | VOL 6 | July 2021 | 946–959 | www.nature.com/naturemicrobiology 947


Resource NATuRE MICRobIoloGy

no canonical rank information beyond their phylum affiliation This was particularly apparent for DPANN phyla, which are
(31.0%), which is partly offset by the extensive use of names almost entirely lacking in information in the family to class ranks
with no rank in the NCBI taxonomy (Supplementary Fig. 2a). (Supplementary Fig. 2b).

a
Ae phylum Ca. Aenigmarchaeota
Bat phylum Ca. Bathyarchaeota
Dia phylum Ca. Diapherotrites
Species (15) Eur phylum Euryarchaeota
Hei phylum Ca. Heimdallarchaeota Thm
Mar phylum Ca. Marsarchaeota
Mar
Mic phylum Ca. Micrarchaeota
Genus (88) Tha phylum Thaumarchaeota
Tho phylum Ca. Thorarchaeota
Woe phylum Ca. Woesearchaeota
Rank (no. taxa)

Tpr class Thermoprotei


Family (26) Thmclass Thermoplasmata
Tpl
Thm genus Thermofilum

Order (14)
Tpr Tpl

Class (8)
Eur Hei Tha Bat Mic Ae Dia Woe Mar Tho

Phylum (10)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Relative evolutionary divergence

b
Asg p__Asgardarchaeota
Met p__Methanobacteriota (Euryarchaeota) Thm ThA
Mar o__Marsarchaeales
Tpr c__Thermoprotei
Genus (174) Tpl c__Thermoplasmata
Thm g__Thermofilum
ThA g__Thermofilum_A

Family (76)
Rank (no. taxa)

Mar

Order (36)

Tpl Tpr

Class (18)

Met Asg

Phylum (5)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Relative evolutionary divergence

c 100
Active change Only passive
90 Passive change Active and passive
Unchanged Only active
80 Unchanged
Genomes (%) (2,239 total)

70

60

50

40

30

20

10

0
Phylum Class Order Family Genus Species Per genome

948 Nature Microbiology | VOL 6 | July 2021 | 946–959 | www.nature.com/naturemicrobiology


NATuRE MICRobIoloGy Resource
Removal of polyphyletic groups and rank normalization. Robustness of GTDB archaeal taxonomy. We tested the robust-
Approximately 10% of NCBI taxa above the rank of species (26 ness of the monophyly and rank normalization of the proposed tax-
of 252) could not be reproducibly resolved as monophyletic or onomy in relation to a number of standard phylogenetic variables
operationally monophyletic in the reference tree (the latter is including marker genes, inference method, compositional bias,
defined as having an F measure ≥0.95 (ref. 35)) (Supplementary fast-evolving sites, increasing number of genomes and rooting of the
Table 3 and Methods). These include the phyla Nanoarchaeota, tree. Comparing tree similarities indicated that marker choice had a
Aenigmarchaeota and Woesearchaeota, which were intermin- stronger influence on the tree topology than did inference methods
gled with each other and with unclassified archaeal genomes and substitution models (Supplementary Fig. 7 and Supplementary
(Supplementary Fig. 3). Another prominent example is the class Fig. 8). However, for all subsequent comparisons we focused on the
Methanomicrobia, which comprised three orders in NCBI, two of robustness of the taxonomy, not the overall consistency of tree topol-
which (Methanosarcinales and Methanocellales) were clearly sepa- ogies, as only a subset of interior nodes (69.9%) in the reference tree
rated from the type order Methanomicrobiales (Supplementary were used for taxon classification (Supplementary Table 5).
Fig. 4). To resolve these cases, the lineage containing the nomencla-
ture type retained the name. Where possible, all other groups were Markers. As expected, individual protein phylogenies of the 122
renamed following the ICNP and recent proposals to modify the markers had lower phylogenetic resolution than the reference tree,
code (Methods). For example, the Methanomicrobia were resolved particularly at the ranks of class and phylum (Supplementary Fig.
in GTDB by reserving the name for the lineage containing the type 9a,b). However, 78.5% of the GTDB taxa with two or more repre-
genus of the type order of the class (that is, Methanomicrobium) sentatives above the rank of species were still recovered as mono-
and by reclassifying the two remaining orders into their own phyletic groups in ≥50% of single protein trees, and on average,
classes, Methanosarcinia class. nov. and Methanocellia class. nov. taxa were resolved as monophyletic in 74.1% of the single protein
(Supplementary Fig. 4). These and other ICNP-based rules used to trees (Supplementary Fig. 9c,d). We also compared the GTDB
standardize the curation process above the rank of genus are sum- taxonomy to SSU rRNA gene trees due to their historical impor-
marized in a decision tree (Supplementary Note 1 and Methods). tance in defining archaeal taxa. However, this was complicated by
When genera or species needed to subdivided due to polyphyly, the absence of this gene (>900 nucleotides after quality-trimming)
existing names were retained as placeholders with alphabeti- in almost half of the species representatives (578 of 1,248; 46.3%).
cal suffixes for groups lacking nomenclature types. For example, The majority (94.1%) of species representatives lacking SSU rRNA
the genus Thermococcus is polyphyletic because it also comprises sequences were draft metagenome-assembled genomes (MAG)
species of the genus Pyrococcus. This polyphyly was resolved by assemblies (Supplementary Table 6), which often lack this gene
retaining the name for the monophyletic group containing the due to the difficulties of correctly assembling and binning rRNA
type species, Thermococcus celer, and by assigning alphabetical repeats in metagenomic datasets39,44. Over 84% of the GTDB taxa
suffixes to three basal groups comprising other Thermococcus with two or more SSU rRNA representatives were operationally
genomes (Supplementary Fig. 5). Note that Thermococcus chito- monophyletic in the SSU rRNA tree, with loss of monophyly most
nophagus was transferred to Pyrococcus according to the GTDB pronounced in the higher ranks (Supplementary Fig. 10), as seen
reclassification due to its proximity to the type species of this genus for the single protein phylogenies. Three alternative concatenated
(Supplementary Fig. 5). protein marker datasets were assessed: rp1 (16 ribosomal pro-
Taxonomic ranks were normalized using the RED metric. teins, Supplementary Table 7 (ref. 45)), rp2 (23 ribosomal proteins,
This method linearly interpolates the inferred phylogenetic dis- Supplementary Table 7 (ref. 18)) and 53 recently proposed archaeal
tances between the last common ancestor (set to RED = 0) and all markers (ar.53, Supplementary Table 8 (ref. 46)). The great majority
extant taxa (RED = 1) providing an approximation of relative time (≥95%) of GTDB taxa above the rank of genus with two or more
of divergence35. We tested the tolerance of the RED approach to representatives were recovered as monophyletic groups in the three
increasing differences in substitution rates between lineages using alternative marker set trees inferred using the same C10 PMSF
simulated datasets. The approach was robust to variable rates up mixture model as the reference tree (Extended Data Fig. 3 and
to 30-fold different (Supplementary Note 2, Extended Data Figs. Supplementary Fig. 11)18. The small percentage of taxa not resolved
1 and 2), which is likely substantially higher than naturally occur- in the alternative trees (Extended Data Fig. 4, Supplementary Fig.
ring variation between prokaryotic taxa43. Rank distributions in 11 and Supplementary Table 9) were well supported in the refer-
the NCBI-decorated reference tree were extremely broad, high- ence tree (average bootstrap support 90.3 ± 8.8%; Supplementary
lighting severely under- and over-classified outlier taxa (Fig. 1a). Fig. 12). However, these taxa were resolved as operationally mono-
Rank distributions were normalized35 by systematically reclassi- phyletic in a lower proportion of individual protein trees than other
fying outliers either by reassignment to a new rank with associ- GTDB taxa (33.7 ± 14.7% versus 77.1 ± 18.3%). The RED distribu-
ated nomenclatural changes for Latin names or by moving names tions of the GTDB taxa were comparable in the reference, rp1 and
to new interior nodes in the tree (Fig. 1b). This resulted in large rp2 trees, whereas the SSU rRNA tree had a substantially broader
movements of the median RED values of the higher ranks (order distribution (Fig. 2a,e and Supplementary Fig. 13), likely reflecting
and above) to produce a normalized distribution (Fig. 1b). Over undersampling of the topology, lower resolution of single marker
half (56.4%) of all archaeal NCBI taxon names had to be changed genes relative to concatenated marker sets and, potentially, compo-
in the GTDB taxonomy, with the largest percentage of changes sitional bias in the SSU rRNAs of thermophiles47.
occurring at the phylum level (76.6%) (Fig. 1c). Examples of
changes at lower ranks due to RED normalization include the Inference methods and models. We overlaid the IQ-TREE-based tax-
genus Methanobrevibacter, which was divided into five genus-level onomy onto trees inferred with different models (for example, C20
groups: Methanobrevibacter, which includes the type species, and and C60; Supplementary Table 10) and phylogenetic inference tools,
four genera with alphabetical suffixes (Methanobrevibacter_A to including FastTree and ExaML (maximum likelihood), PhyloBayes
_D; Supplementary Fig. 6). GTDB names were assigned to nodes (Bayesian) and ASTRAL (supertree). All methods were applied to
with high bootstrap support (mean 98.5%) to ensure taxonomic the 122-archaeal-marker set with the exception of the ASTRAL
stability with only a small number of exceptions (bootstrap sup- supertree, which was also applied to a 253-concatenated-marker set
port < 90%; Supplementary Table 4). Overall, 93.3% of the 2,239 subsampled from the PhyloPhlAn dataset48. Overall, the GTDB tax-
archaeal genomes present in NCBI release 89 had one or more onomy was remarkably consistent with comparable RED distribu-
changes in their taxonomic assignments (Fig. 1c). tions for taxa at each investigated rank (Fig. 2a,d and Supplementary

Nature Microbiology | VOL 6 | July 2021 | 946–959 | www.nature.com/naturemicrobiology 949


Resource NATuRE MICRobIoloGy

a ar.122.r89 (IQTREE C10 PMSF) (2.1) IQTREE C10 (2.3) IQTREE C60 (2.5) 32kAA IQTREE C10 PMSF (17)
p 5 p 5 p 5 p 5
c 18 c 18 c 17 c 18
o 36 o 36 o 36 o 36
f 76 f 76 f 76 f 76
g 174 g 174 g 174 g 174

−0.2 0 0.2 −0.2 0 0.2 −0.2 0 0.2 −0.2 0 0.2


b BMGE IQTREE C10 PMSF (11.2) Tr. 20% IQTREE C10 PMSF (15.3.4) Tr. 40% IQTREE C10 PMSF (15.3.8) DIVVIER IQTREE C10 PMSF (11.5)
p 5 p 5 p 5 p 5
c 17 c 18 c 17 c 18
o 35 o 36 o 36 o 36
f 75 f 76 f 76 f 76
g 174 g 174 g 174 g 174

−0.2 0 0.2 −0.2 0 0.2 −0.2 0 0.2 –0.2 0 0.2

c Recoding BMGE IQTREE


Recoding IQTREE C60 SR4 (19.1) C60 SR4 (19.2) SlowFaster 20% IQTREE C10 (15.2.2) SlowFaster 40% IQTREE C10 (15.2.4)
p 5 p 5
c 17 c 14 p 5 p 5

o 36 o 34
f 76 f 75 c 18 c 18
g 174 g 174

−0.2 0 0.2 −0.2 0 0.2 −0.2 0 0.2 −0.2 0 0.2

d FastTree2 WAGG (1) ExaML JTT G (12) PhyloBayes CAT (10)


p 5 p 5
c 18 c 18 p 5

o 36 o 36
f 76 f 76 c 18
g 174 g 174

−0.2 0 0.2 −0.2 0 0.2 −0.2 0 0.2


e RP1 IQTREE C10 PMSF (5) RP2 IQTREE C10 PMSF (7) SSU IQTREE (14.3) ar53 IQTREE C10 (20.1)
p 5 p 5 p 5 p 5
c 18 c 18 c 11 c 18
o 33 o 35 o 26 o 34
f 75 f 76 f 50 f 76
g 171 g 174 g 104 g 174

−0.2 0 0.2 −0.2 0 0.2 −0.2 0 0.2 −0.2 0 0.2

f ASTRAL 122 (18.1) ASTRAL 253 (18.2)


p 5 p 5 >20%
c 17 c 17
>10% < 20% Percentage of
o 36 o 34
> 0% < 10% polyphyletic taxa per rank
f 75 f 75
g 174 g 169 0%

−0.2 0 0.2 −0.2 0 0.2

Fig. 2 | Comparison of marker sets, inference methods and models. Phylogenetic trees inferred with different methods, from varying concatenated
alignments or via supertree approaches were decorated with the GTDB 04-RS89 taxonomy. RED distributions for taxa at each rank (p, phylum; c, class; o,
order; f, family; g, genus) are shown (y axis) relative to the median RED value of the rank (x axis). The number of taxa is provided on the right-hand side of
the plots. The legend indicates the percentage of polyphyletic taxa per rank, defined as an F measure <0.95. Note that only taxa with two or more genomes
were included. a, Trees inferred with IQ-TREE from a concatenated alignment of the 122 GTDB markers with ~5,000 alignment columns using different
profile mixture models (C10 PMSF, C10 and C60) and from the untrimmed 32kAA alignment. b, Trees inferred from a modified concatenated alignment
of the 122 GTDB markers to account for compositional bias, including stationary (BMGE) and progressive trimming (Tr. 20%, Tr. 40%) of heterogeneous
sites and clustering of sites with shared homology (Divvier). c, Trees inferred from a modified concatenated alignment of the 122 GTDB markers including
recoding into four character states (C60 SR4), recoding and stationary trimming (BMGE C60 SR4) and removal of 20% and 40% of the fastest-evolving
sites (SlowFaster 20% and 40%). Note that, due to technical limitations, a reduced order-dereplicated genome set was used for SlowFaster, allowing
the evaluation of only the phylum and class ranks. d, Trees inferred from 122 marker alignments using different inference software and models, including
FastTree2, ExaML and PhyloBayes. Note that due to computational constraints PhyloBayes was calculated from the order-dereplicated dataset, allowing the
evaluation of only the ranks phylum and class. e, Trees inferred from alternative markers, including 16 ribosomal proteins (rp1), 23 ribosomal proteins (rp2),
SSU rRNA genes (SSU) and a set of 53 marker proteins (ar53). f, Trees inferred with the ASTRAL supertree approach using 122 and 253 marker proteins.
More details about inference models and methods are given in Supplementary Table 10. The number in parentheses following each tree name (for example,
‘(2.1)’) refers to the number of this tree in the Supplementary Information, including Supplementary Table 10. The violin plots include a marker for the
median and a box indicating the first and third quartiles.

Fig. 13). All GTDB taxa were recovered as monophyletic or opera- using the most divergent approach tested, recovered 96% of GTDB
tionally monophyletic (Extended Data Fig. 3 and Supplementary taxa with two or more representatives as monophyletic or opera-
Fig. 11) for all supermatrix methods, regardless of the underlying tionally monophyletic (Extended Data Fig. 3 and Supplementary
inference algorithm or model. The ASTRAL supertrees, inferred Fig. 11). The only major inconsistency affected the Euryarchaeota,

950 Nature Microbiology | VOL 6 | July 2021 | 946–959 | www.nature.com/naturemicrobiology


NATuRE MICRobIoloGy Resource
a b c

Genus (174)
Rank (no. taxa)

Family (76)

Order (36)

Class (18)

Phylum (5)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
RED RED RED

0.10 0.10
0.10
Halobacteriota Halobacteriota Halobacteriota

Thermoplasmatota Eury Thermoplasmatota Eury Thermoplasmatota Eury

Methanobacteriota Methanobacteriota Methanobacteriota

Hydrothermarchaeota Hydrothermarchaeota Hydrothermarchaeota


Hadarchaeota Hadarchaeota Hadarchaeota

Thermoproteota Thermoproteota Thermoproteota

Asgardarchaeota Asgardarchaeota Asgardarchaeota

Nanoarchaeota Nanoarchaeota Nanoarchaeota

Nanohaloarchaeota Nanohaloarchaeota Nanohaloarchaeota


EX4484-52 EX4484-52 EX4484-52
Huberarchaeota Huberarchaeota Huberarchaeota
Aenigmarchaeota Aenigmarchaeota Aenigmarchaeota

UAP2 UAP2 UAP2


Micrarchaeota Micrarchaeota Micrarchaeota

Iainarchaeota Iainarchaeota Iainarchaeota

Altarchaeota Altarchaeota Altarchaeota

Fig. 3 | Impact of different rooting scenarios on RED intervals. a–c, The rooting approach implemented in GTDB (a), which calculates the RED as
the median of all possible rootings of phyla with at least two classes (red arrows), is compared to a fixed root between the DPANN superphylum (red
arrow) and the remaining Archaea (b) and to a fixed root within the NCBI phylum Euryarchaeota, which translates to a root between the two phyla
Thermoplasmata and Halobacteriota (red arrow) and the rest of the Archaea in the GTDB taxonomy (c). In the upper RED plot each data point (black
circle) represents a taxon distributed according to its RED value (x axis) and its rank (y axis). An overlaid histogram shows the relative abundance of taxa
for each 0.025 RED interval, a blue bar shows the median RED value and two black bars on either side show the RED interval (±0.1) for each rank. Note
that, overall, the ranks can be distinguished based on their RED value, regardless of the applied rooting scenario. Furthermore, RED values are relative and
should not be directly compared between plots, as they are dataset specific. Rather, the distribution of RED values is the key metric; that is, the distance
(positive or negative) from the median of the RED value for each rank (ΔRED). The trees include a label highlighting the corresponding NCBI phylum
Euryarchaeota (Eury) as a point of reference. The scale bars indicate 0.1 substitutions.

with 48 taxa being placed outside of this phylum in both supertrees Analysis of increasing genome database size. An important variable
(Supplementary Fig. 14e and Supplementary Fig. 15). All phyloge- in producing a genome-based taxonomy is the changing number of
netic trees inferred in this study are summarized in Supplementary genome sequences, which will continue to increase in future and will
Table 10 and provided in Supplementary Data 1. likely impact the underlying tree topology. To assess the robustness
of the GTDB taxonomy to this variable, we recapitulated the expan-
Compositional bias and fast-evolving sites. To test for possible com- sion of the archaeal genome database since 2015 by subsampling
positional bias we employed tools for character trimming and for the 1,248 archaeal genomes according to their NCBI release date,
clustering of high confidence positional homology (Methods) resulting in 362, 528, 1,035 and 1,183 taxa for the years 2015, 2016,
that have been shown to alleviate long-branch attraction artefacts 2017 and 2018, respectively. Comparison of the taxonomy derived
and to increase tree accuracy for alignments of distantly related from these antecedent trees to the reference tree revealed that >99%
sequences49,50. Decorated maximum likelihood (ML) trees calcu- of all named taxa were recovered as monophyletic groups, demon-
lated from the trimmed or clustered alignments were in strong taxo- strating that addition of new branches to the reference tree did not
nomic agreement with the reference tree, showing comparable RED destabilize the subset of robust interior nodes used for GTDB taxo-
values and recovering >96.5% of GTDB taxa with two or more rep- nomic assignments. This suggests that the GTDB taxonomy should
resentatives as monophyletic or operationally monophyletic (Fig. be stable with increasing dataset size in future releases.
2b,c, Extended Data Fig. 3 and Supplementary Figs. 11 and 13). Less
than 4% of genomes had conflicting taxonomic assignments at any Rooting effects. Rank normalization is sensitive to the placement
rank above species (Supplementary Table 11), with the largest dif- of the tree root, which defines the last common ancestor (set to
ference being observed in the class Methanosarcinia (18 taxa) and RED = 0)35,38, and can therefore potentially influence the result-
the order Desulfurococcales (46 taxa) for the character-trimmed ing taxonomy. Since the rooting of both the archaeal and bacterial
alignment (Supplementary Fig. 14d). The clustered alignment domains remains contested, GTDB uses an operational approach
resulted in fewer differences, with a total of seven conflicting taxa whereby the median of multiple plausible rootings is considered35.
(Supplementary Fig. 14d). We assessed the effect of fixed rooting of the archaeal domain on

Nature Microbiology | VOL 6 | July 2021 | 946–959 | www.nature.com/naturemicrobiology 951


Resource NATuRE MICRobIoloGy

Thermoproteota
Asgardarchaeota
GTDB phyla
Nanoarchaeota
NCBI phyla

VM T AC
H LT K
Cr enarchae ota
Nanohaloarchaeota e Bat
N Wo
Aenigmarchaeota N
PA Th

c__Lo dallarc
au

c__H
UAP2 D ma Hadarchaeota
P rc Hydrothermarchaeota
N ha

eim
Micrarchaeota

kiarch
h eo
ta
N

c __ Thermoprotei
Iainarchaeota
A

c_

aeia aeia

eia
_N

ha
h
an
M

arc
Altarchaeota

oa

ria
D

thy
ia

rch

ae
ae

Ba

ph
ch

ae
ar Methanobacteriota

os
c__
ia
m

os
er

itr
h
ot

_N
yd
r ci
oc

c_
_H oc
c_ rm
he i
occ es
_T noc cal
c_ tha nococ
M e a
c___Meth
o_

c__Methanobacteria
o__Methanobacteriales

c__E2
c__T
herm
oplas
mata
ia
cter les
H a lobaacteria c_
c__ Halo b _P
o__ os
ei
do
ni
ia
es
ro ia
ic b
al
m icro
bi
no m
ha no

inia
et tha

c__Syntrophoarchaeia
c__ A r c haeoglobi
_M e

arc
o_ __M

nos
c

tha
Me
c__

Halobacteriota Thermoplasmatota

Eur yarchae ota

Tree scale: 0.1

Fig. 4 | Rank-normalized archaeal GTDB taxonomy. Species representatives ar122.r89 ML tree, scaled by replacing branch lengths with RED values, and
decorated with the archaeal GTDB taxonomy R04-RS89. The outer blue ring denotes the rank-normalized phyla, and the inner light blue clades indicate
the classes in the rank-normalized GTDB taxonomy. Classes with ten or more taxa are labelled, and classes with order-level divergence are indicated with
both class and order names. The two GTDB phyla consisting of only a single species each, namely Huberarchaeota and EX4484-52, are highlighted by
red branches indicating their uncertain placement in the ar122.r89 tree. The inner orange ring denotes the r89 NCBI phyla with two or more taxa. The
NCBI superphyla TACK and DPANN are indicated by grey arcs. Abbreviations are the following: Bat (Ca. Bathyarchaeota), M (Ca. Marsarchaeota), V (Ca.
Verstraetearchaeota), T (Ca. Thorarchaeota), L (Ca. Lokiarchaeota), H (Ca. Heimdallarchaeota), Woe (Ca. Woesearchaeota), P (Ca. Parvarchaeota), N
(Nanoarchaeota), Nh (Ca. Nanohaloarchaeota), A (Ca. Aenigmarchaeota), M (Ca. Micrarchaeota), D (Ca. Diapherotrites). Bootstrap values over 90% are
indicated by blue dots. Scale bar indicates 0.1 RED.

the taxonomy by testing two recently proposed archaeal root place- nodes within a rank is the key metric, and by this metric all taxa in
ments: the first being within the Euryarchaeota (as defined in the the rooted lineages were within their expected RED assigned rank
NCBI taxonomy)51 and the second, between the DPANN superphy- intervals (Supplementary Fig. 16 and Fig. 3).
lum and all other Archaea52. Overall, RED values were stable across
the tested rooting scenarios and the intervals defining taxonomic Proposal of revised taxa based on the GTDB archaeal taxonomy.
ranks in GTDB were largely preserved (Fig. 3). As expected, a fixed After resolving polyphyletic groups and normalizing ranks, the
root caused the taxa within the rooted lineages to be drawn closer archaeal GTDB taxonomy (release R04-RS89) comprises 16 phyla,
to the root, although it is important to note that absolute RED val- 36 classes, 96 orders, 238 families, 534 genera and 1,248 species
ues should not be compared between trees or even between differ- (Supplementary Table 12). This entailed the proposal of 13 new
ent rootings of the same tree. Rather, the distribution of labelled taxa and 32 new Candidatus taxa above the rank of genus includ-

952 Nature Microbiology | VOL 6 | July 2021 | 946–959 | www.nature.com/naturemicrobiology


NATuRE MICRobIoloGy Resource
Table 1 | Correspondence between the 04-RS89 GTDB taxonomy and RefSeq89 equivalent NCBI taxonomy
GTDB phylum GTDB class; order NCBI phylum NCBI class
p__Halobacteriota c__Archaeoglobi Euryarchaeota Archaeoglobi
p__Halobacteriota c__Halobacteria Euryarchaeota Halobacteria
p__Halobacteriota c__Methanomicrobia Euryarchaeota Methanomicrobia
p__Halobacteriota c__Methanonatronarchaeia Euryarchaeota Methanonatronarchaeia
p__Halobacteriota c__Methanosarcinia Euryarchaeota Methanomicrobia
p__Thermoplasmatota c__E2 Euryarchaeota [Ca. Thermoplasmata
Thermoplasmatota]
p__Thermoplasmatota c__Poseidoniia Euryarchaeota [Ca. n.a. [Ca. Poseidoniia]
Thermoplasmatota]
p__Thermoplasmatota c__Thermoplasmata Euryarchaeota [Ca. Thermoplasmata
Thermoplasmatota]
p__Methanobacteriota c__Methanobacteria Euryarchaeota Methanobacteria
p__Methanobacteriota c__Methanococci Euryarchaeota Methanococci
p__Methanobacteriota c__Methanopyri Euryarchaeota Methanopyri
p__Methanobacteriota c__Thermococci Euryarchaeota Thermococci
p__Hadarchaeota c__Hadarchaeia Euryarchaeota Hadesarchaea
p__Hydrothermarchaeota c__Hydrothermarchaeia Euryarchaeota [Ca. n.a.
Hydrothermarchaeota]
p__Thermoproteota c__Bathyarchaeia Ca. Bathyarchaeota n.a.
p__Thermoproteota c__Korarchaeia Ca. Korarchaeota n.a.
p__Thermoproteota c__Methanomethylicia Ca. Verstraetearchaeota n.a.
p__Thermoproteota c__Nitrososphaeria Thaumarchaeota Nitrososphaeria
p__Thermoproteota c__Nitrososphaeria; o__Caldarchaeales a
Thaumarchaeota n.a.
p__Thermoproteota c__Thermoproteia Crenarchaeota Thermoprotei
p__Thermoproteota c__Thermoproteia; o__Geoarchaealesb Crenarchaeota n.a.
p__Asgardarchaeota c__Heimdallarchaeia Ca. Heimdallarchaeota n.a.
p__Asgardarchaeota c__Lokiarchaeia Ca. Lokiarchaeota n.a.
p__Asgardarchaeota c__Lokiarchaeia; o__Helarchaeales Ca. Helarchaeota n.a.
p__Nanoarchaeota c__Nanoarchaeia Ca. Woesearchaeota/ Ca. n.a.
Nanoarchaeota
p__Nanoarchaeota c__Nanoarchaeia; o__Pacearchaealesc Ca. Pacearchaeotaf n.a.
p__Nanoarchaeota c__Nanoarchaeia; o__Parvarchaeales Ca. Parvarchaeota n.a.
p__Nanoarchaeota c__Nanoarchaeia; o__Woesearchaealesd Ca. Woesearchaeota n.a.
p__Nanohaloarchaeota c__Nanosalinia Euryarchaeota Nanohaloarchaea
p__Aenigmarchaeota 5
c__Aenigmarchaeia e
Ca. Aenigmarchaeota n.a.
p__Micrarchaeota c__Micrarchaeia Ca. Micrarchaeota n.a.
p__Iainarchaeota c__Iainarchaeia Ca. Diapherotrites n.a.
p__Altarchaeota c__Altarchaeia Ca. Altiarchaeota n.a.
p__Huberarchaeota c__Huberarchaeia Ca. Huberarchaea n.a.
Named GTDB phyla, major classes and selected orders are listed with their corresponding NCBI taxonomy. Note that, in cases where GTDB and NCBI lineages are not an exact match, the NCBI lineage with
the highest number of matching taxa is provided. n.a., not assigned, meaning no rank has been assigned to this lineage in the NCBI taxonomy. Further details regarding the correspondence between NCBI
and GTDB taxa are provided in Supplementary Tables 27 and 28. Names that have been updated in the NCBI taxonomy since release RefSeq89 (13 July 2018) up until 12 March 2021 are provided in square
brackets. a Proposed as ‘Ca. Aigarchaeota’ by Nunoura et al.12. b Proposed as ‘Ca. Geoarchaeota’ by Kozubal et al.13. Note that name has been corrected to ‘o__Gearchaeales’ in GTDB 05-RS95. d Proposed as
‘Ca. Pacearchaeota’ by Castelle et al.21. d Proposed as ‘Ca. Woesearchaeota’ by Castelle et al.21. e Names have been corrected to ‘Aenigmatarchaeota’ and ‘Aenigmatarchaeia’ in GTDB 05-RS95. f Note that
the rank of this lineage is defined as ‘clade’ in NCBI.

ing five novel species combinations and three novel Candidatus 4). For example, the phylum Euryarchaeota was divided into five
species combinations (Supplementary Tables 13–16). We also used separate phyla in the GTDB taxonomy due to its anomalous depth
25 Latin names without standing in nomenclature as placehold- (Figs. 1a and 4). For two of the five phyla, names have previously
ers in the GTDB taxonomy to preserve literature continuity, as we been proposed: Methanobacteriota53 and Hydrothermarchaeota54.
were unable to propose them mostly due to the absence of a desig- The phylum Hadarchaeota and its subordinate taxa are proposed in
nated nomenclature type (Supplementary Table 17). The extensive this study (Supplementary Table 15), named after the genus of the
rearrangement of phyla to normalize phylogenetic depth resulted previously proposed species Ca. Hadarchaeum yellowstonense32.
in both division and amalgamation of release 89 NCBI phyla (Fig. Names for the other two phyla are formally proposed here based on

Nature Microbiology | VOL 6 | July 2021 | 946–959 | www.nature.com/naturemicrobiology 953


Resource NATuRE MICRobIoloGy

erales
ospha
_Nitros
o_

f__Nitrosocaldaceae

e
cea

f_
_N
mila

itr
f__UBA

os
opu

os
ph
itros

ae
213

s
icu
f__

ra
f_ _ N

m
UB

ce
os
f_

ae
oc
A5
_U

os
7
BA

i tr
_N
14

g_
1

f__ 2
UB 45
A1 A 10
83 _UB
g_

densis
evergla
o__UB sos phaera
A164 s__Nitro

o__JDFR-14 s__Nitrososphaera viennensis EN76


osphaeria

g__Nitrososp
s__Nitro
R-13
o__JdF sospha
era sp0
024948
95
s__
Nitros

Nitr
oso
sph

haera
aera
garg
c__

s_
les _N ens
is
ea itro
c ha so
iar sp
ald ha
_C 05 era
o_ 4-2 sp
00
4 48 25
01
EX 84
o __ 5
ia
ae

c__
rch

c__The
thylicia
ya

Kor
ath

arch
_B

anome

rmopro
c_

aeia
th

tei
c__Me

Fig. 5 | Reclassification of the Thaumarchaeota. Cladogram of a subtree of the ar122.r89 reference tree showing the GTDB phylum Thermoproteota with the
classes Korarchaeia, Thermoprotei, Methanomethylicia (former phylum Ca. Verstraetearchaeota), Bathyarchaeia and Nitrososphaeria (former phylum Ca.
Thaumachaeota). The validly published class Nitrososphaeria (light blue) was emended in GTDB to include all taxa assigned to the phylum Thaumarchaeota
in the NCBI RefSeq89 taxonomy. The type species of this lineage is Nitrososphaera viennensis56, which serves as the type for higher taxa including the genus
Nitrososphaera, the family Nitrososphaeraceae, the order Nitrososphaerales and the class Nitrososphaeria. The genome of the N. viennensis type strain EN76T
( = DSM 26422T = JMC 19564T) is highlighted in white. Arrow points to the ar122.r89 reference tree taxa not shown in this figure. The phylogenetic tree was
annotated and coloured with the online tool ‘Interactive Tree Of Life’80.

validly published class names within each lineage: Halobacteriota Euryarchaeota could be reintroduced as a superphylum; however,
phyl. nov. (after the class Halobacteria) and Thermoplasmatota phyl. this is outside the scope of the GTDB taxonomy, which only uses
nov. (after the class Thermoplasmata; Table 1 and Supplementary canonical ranks.
Table 13). Note that the name Euryarchaeota was not retained in The TACK superphylum16 was reclassified as a single phylum
the GTDB taxonomy for two reasons. Firstly, a nomenclatural type based on rank normalization and its robust monophyly, for which
was not designated in the original proposal4, which means the we propose to use the effectively published name Thermoproteota53
name would be illegitimate if the rank of phylum is introduced into based on the earliest validly described class for any TACK phylum,
the ICNP, since a type is a requirement for validation of a name. the Thermoprotei55 (Table 1 and Supplementary Table 13). Note that
Secondly, the substantial changes made to this phylum might the phylum name could alternatively have been derived from the
have introduced confusion if the name had been retained for one earliest genus name in the TACK superphylum, Sulfolobus; how-
of the five newly circumscribed phyla. It is possible that the name ever, this genus is currently a member of the class Thermoprotei.

954 Nature Microbiology | VOL 6 | July 2021 | 946–959 | www.nature.com/naturemicrobiology


NATuRE MICRobIoloGy Resource
The RED-based reorganization of this lineage requires the unifica- Discussion
tion of the NCBI-defined TACK phyla as previously proposed27. We present the GTDB for the domain Archaea, with the aim of
Thus, the Thaumarchaeota was reclassified as a class-level lineage, providing researchers with a phylogenetically congruent and
for which we propose an emended description of the only val- rank-normalized classification based on well-supported nodes in
idly described class in this lineage, the Nitrososphaeria56 (Fig. 5). a phylogenomic tree of 122 concatenated conserved single-copy
According to its RED-based rank and concatenated protein phylog- marker proteins (Supplementary Table 2). Unlike the bacterial
eny, the Aigarchaeota constitutes an order-level lineage within the GTDB reference tree, which contains ~19-fold more species rep-
Nitrososphaeria, for which we propose the order Caldarchaeales resentatives38, the relatively modest number of publicly available
ord. nov. (Supplementary Tables 15 and 18), derived from the spe- archaeal genomes (representing only 1,248 species) allowed us to
cies Ca. Caldarchaeum subterraneum12. We propose to reclassify the use IQ-TREE with a protein mixture model that captures substitu-
Crenarchaeota as an emended description of the class Thermoprotei tion site heterogeneity between sites and can mitigate long-branch
and the Korarchaeota as the Korarchaeia class. nov. (Supplementary attraction artefacts40.
Table 15) based on subordinate taxa named after the genus of the Our proposed taxonomy is stable under a range of standard
species Ca. Korarchaeum cryptofilum57. In addition, the candi- phylogenetic variables, including alternate marker genes, infer-
date phyla Verstraetearchaeota17 and Bathyarchaeota14 were uni- ence methods, tree rooting scenarios and expansion of the genome
fied with the Thermoproteota based on their robust phylogenetic database, due in part to the inherent flexibility of rank designa-
affiliation. After unification, both lineages represent classes in the tions within RED intervals (Fig. 1). We endeavoured to preserve
Thermoproteota for which the names Ca. Methanomethylicia58 the existing archaeal taxonomy wherever possible; however, most
from Ca. Methanomethylicus mesodigestum17 and the GTDB place- (93.3%) of the 2,239 archaeal genomes in GTDB had one or more
holder name Bathyarchaeia (Supplementary Table 17) were imple- changes to their classification compared with the NCBI taxonomy
mented until type material is assigned for the latter lineage32. Note (Fig. 1c). This high percentage of modifications, compared to 58%
that the name Crenarchaeota was not retained in the GTDB tax- reported for genomes in the bacterial GTDB35, is attributable,
onomy (that is, was not used instead of the name Thermoproteota) firstly, to extreme unevenness in archaeal ranks, particularly at
to avoid confusion over its recircumscription relative to the NCBI the phylum level. For example, the division of the NCBI-defined
taxonomy and because the name would be illegitimate if the rank of Euryarchaeota alone affected more than two-thirds of all archaeal
phylum were introduced into the Code4 as per the case made for the genomes (Fig. 4). Secondly, widespread missing rank information in
Euryarchaeota above. NCBI RefSeq89, particularly amongst as-yet-uncultivated lineages,
Similar to TACK, the recently described Asgard superphy- required numerous passive (rank-filling) changes to the taxonomy
lum19 was reclassified as a single phylum, for which we use (Fig. 1). And thirdly, the Bacteria, unlike the Archaea, have highly
Asgardarchaeota as a placeholder name (Supplementary Table 17) sampled species and genera that did not require taxonomic changes,
to retain its identity until a new name derived from type material is including for example more than 12,000 Streptococcus genomes in
proposed for this lineage32, which potentially could be the recently release 04-RS89, which effectively reduces the percentage of bacte-
described co-culture, Ca. Prometheoarchaeon syntrophicum59. By rial genomes with taxonomic differences to NCBI.
contrast, the DPANN superphylum18 still comprises multiple (nine) The 7-fold difference in the number of archaeal and bacterial
phyla after rank normalization, in part due to a lack of stable deeper phyla (16 versus 112; release 04-RS89) is notable and raises the
interior nodes for naming (Fig. 4). Many of these phyla are repre- question of whether GTDB-defined bacterial and archaeal phyla
sented by only a few genomes on long branches in release 04-RS89 are comparable. Strictly speaking, they are not, as each taxonomy
(Supplementary Fig. 17). We expect that the DPANN phyla will was developed independently and shaped around the existing NCBI
undergo further reclassification in future GTDB releases when addi- taxonomy as far as possible. Early attempts to produce a prokary-
tional genomes become available to populate this region of the tree; otic concatenated protein tree upon which to base a combined tax-
however, even if some these phyla were unified, DPANN cannot be onomy distorted RED values to the point of being unusable. In the
collapsed into a single phylum according to rank normalization. case of the Archaea, we also experimented with moving the phylum
Unlike the major changes required at the phylum level, archaeal and class rank intervals to the right (away from the root) since the
taxa with lower rank information (species to class) were more starting distributions were so broad (Fig. 1a), which more than dou-
stable, with an average of only 11% name changes. Notable exam- bled the number of archaeal phyla. However, this also resulted in an
ples include the well-known genus Sulfolobus, which was reported even greater departure from the starting NCBI taxonomy and hence
early on as potentially polyphyletic60 and is comprised of strains was not pursued further. Perhaps the most compelling evidence
differing in their metabolic repertoire and genome size61. The that the current ratio of phyla may reflect true biological diversity
concatenated protein tree confirms that Sulfolobus is not mono- is the fact that there are ~19-fold more average nucleotide identity
phyletic, being interspersed with species belonging to the genera (ANI)-defined bacterial than archaeal species (23,458 versus 1,248;
Acidianus, Metallosphaera and Sulfurisphaera. We resolved this release 04-RS89).
polyphyly by dividing the group into four separate genera, two of The archaeal GTDB taxonomy and associated tree and align-
which have since been reclassified consistent with our subdivision ment files are available online (https://gtdb.ecogenomic.org) and
(Supplementary Fig. 18). This reclassification reflects previously via linked third party tools such as AnnoTree (ref. 65). Users can
reported differences between Sulfolobus species, including a high classify their own genomes against the archaeal GTDB taxonomy
number of transposable elements in Sulfolobus_A species (S. islan- using GTDB-Tk (ref. 66). We envisage that in the short term, the
dicus and S. solfataricus), which are mostly absent in the type species archaeal taxonomy will scale with genome deposition in the public
of the genus, S. acidocaldarius61. An example of a more complicated repositories based on current accumulation rates (Supplementary
situation is the taxonomy of genera belonging to the halophilic Fig. 20), and indeed, during the review of this manuscript, the next
family Natrialbaceae. Some of these genera have been reported as archaeal GTDB release (05-RS95) comprised a manageable increase
polyphyletic, such as Natrinema and Haloterrigena62, or are poly- of 424 archaeal species. However, in the longer term, more efficient
phyletic in published trees without associated comment, including phylogenomic tools will be needed to scale with increasingly large
Halopiger63 and Natronolimnobius64. We confirmed this polyphyly in genomic datasets and to allow for biologically meaningful phylo-
the concatenated protein tree, which required extensive reclassifica- genetic inferences. We also expect that the less stable parts of the
tion guided by the type species of each genus (Supplementary Fig. concatenated protein tree, such as the DPANN phyla, will become
19 and Supplementary Table 14). more robust with additional genome sequence representatives.

Nature Microbiology | VOL 6 | July 2021 | 946–959 | www.nature.com/naturemicrobiology 955


Resource NATuRE MICRobIoloGy

Methods The alignment was trimmed to remove leading and trailing columns represented
Genome dataset. For the archaeal GTDB taxonomy R04-RS89, we obtained 2,661 by <70% of taxa and to filter sequences <900 bp (trimSeqs.py v0.0.1; https://github.
archaeal genomes from RefSeq/GenBank release 89 and augmented them with com/Ecogenomics/scripts).
187 phylogenetically diverse MAGs (Supplementary Note 3) derived from the
Sequencing Read Archive37 (SRA) as part of a large genome recovery study by the Species clusters. The 2,392 archaeal genomes were formed into species clusters
authors39, resulting in 2,848 genomes. This dataset was refined by applying a quality as previously described38. Briefly, a representative genome was selected for each
threshold (completeness − 5 × contamination > 50%) using lineage-specific markers of the 380 validly or effectively published archaeal species with one or more
implemented in CheckM (refs. 36,67) and by screening out genomes that contain (1) genomes passing QC, and genomes were assigned to these representatives using
<40% of the 122 archaeal GTDB marker genes, (2) more than 100,000 ambiguous ANI and alignment fraction (AF) criteria. Genomes were assigned to the closest
bases and (3) more than 1,000 contigs, and those that have an N50 < 5 kilobase representative genome for which they have an ANI of ≥95% with an AF of at
(kb). The filtered genomes were manually inspected, and four exceptions least 65%, except if two representatives had an ANI > 95%. In such cases, the ANI
were made for genomes that did not pass the QC but were the only named radius of a representative was set to the value of the closest representative, up to a
representatives for the corresponding lineages (Aenigmarchaeum, Lokiarchaeota, maximum of 97%. Species representatives having an ANI > 97% were considered
Nanosalinarum and Parvarchaeum; Supplementary Table 19). This approach left synonyms. For example, Ferroplasma acidiphilum is represented by the genome
2,392 genomes to form species clusters (see Species clusters), resulting in a total of GCF_002078355.1, which is assembled from the type strain of this species. The
1,248 species-representative genomes for the downstream analysis (Supplementary closest GTDB representative to this F. acidiphilum genome is GCF_000152265.2,
Table 1). The 456 genomes that did not pass QC are still searchable on the GTDB which is assembled from the type strain of Ferroplasma acidarmanus. The ANI
website (https://gtdb.ecogenomic.org/) and are listed in Supplementary Table 20. between these two genomes is 96.95%. Consequently, to preserve both the F.
acidiphilum and F. acidarmanus names in the GTDB, the ANI circumscription
NCBI taxonomy. The NCBI taxonomy of all representative genomes of R04-RS89 radius of these two species is set to 96.95% instead of the default 95% (see Parks
was obtained from the NCBI Taxonomy FTP site on 16 July 2018. The NCBI et al.38, which illustrates and describes this methodology).
taxonomy was standardized to seven ranks (species to domain) by identifying Genomes not assigned to one of these 380 species were formed into 868
missing standard ranks and filling these gaps with rank prefixes and by removing de novo species clusters, each specified by a single representative genome.
non-standard ranks35. All standard ranks were prefixed with rank identifiers (for Representative genomes were selected based on assembly quality with preference
example, ‘p__’ for phylum) as previously described68. given to isolate genomes. Of the 868 de novo species clusters, the representative
genomes of 70 were unnamed isolates, 775 were MAGs and 23 were single
Phylogenomic marker set. Archaeal multiple sequence alignments (MSAs) were amplified genomes (SAGs).
created through the concatenation of 122 phylogenetically informative markers
comprised of proteins or protein domains specified in the Pfam v27 or TIGRFAMs Accounting for compositional bias. Each of the 122 untrimmed GTDB 04-RS89
v15.0 databases. The 122 archaeal marker proteins were selected based on the archaeal single-copy marker protein alignments was filtered individually using
criteria described in Parks et al.39. In brief, this included being present in ≥90% BMGE 1.12 (ref. 50) and Divvier 1.0 (ref. 49). BMGE was executed using ‘-t AA
of archaeal genomes and, when present, single-copy in ≥95% of genomes. Only -s FAST -h 0.55 -m BLOSUM30’, and Divvier was run using the recommended
genomes comprising ≤200 contigs with an N50 of ≥20 kb and with CheckM options (‘-divvy -mincol 4 -divvygap’). Processing untrimmed protein alignments
completeness and contamination estimates of ≥95% and ≤5%, respectively, of individual markers ensures that all protein positions are considered when
were considered (Supplementary Note 4). Phylogenetically informative proteins accounting for compositional bias. Next, each of the filtered marker gene
were determined by filtering ubiquitous proteins whose gene trees had poor alignments was concatenated into single MSA supermatrices for BMGE and
congruence with a set of subsampled concatenated genome trees39. Gene calling Divvier, whereby previously removed gap-only sequences were added again in
was performed with Prodigal v2.6.3, and markers were identified and aligned the corresponding positions. Finally, the MSA was trimmed according to GTDB
using HMMER v3.1b1. The presence or absence of the 122 protein markers in criteria mentioned above to a length of 7,859 amino acids (BMGE) and 32,061
the 1,248 species representatives is provided in Supplementary Table 21. The amino acids (Divvier) and used for phylogenetic inferences. The reasoning
marker proteins were concatenated into an MSA of 32,500 columns, which we for and background on how to account for compositional bias are provided in
refer to as ‘untrimmed 32kAA alignment’ in this manuscript. To remove sites Supplementary Note 6.
with weak phylogenetic signals, we created an amino acid alignment by trimming
columns represented in <50% of the genomes and columns with less than 25% or Phylogenetic inference. Phylogenomic trees were inferred with FastTree2 (ref. 72),
more than 95% amino acid consensus, resulting in an initial 27,000 amino acid ExaML (ref. 73), IQ-TREE (ref. 41) and PhyloBayes (ref. 74) on different alignments
alignment. Therefore, the term ‘consensus’ refers to the number of taxa with the and with a range of models (Supplementary Table 10). Note that we chose
same residue in a given sequence alignment column; for example, a maximum IQ-TREE as the GTDB standard inference, since it scales with our dataset and
consensus of 95% means that a maximum of 95% of all taxa can have the same allows mixture models (see IQ-TREE) and because a previous study concluded
residue in an alignment column in order to be considered for the trimmed that for concatenation-based species tree inference, IQ-TREE consistently achieved
alignment. To reduce computational requirements, the alignment was further the best-observed likelihoods for all datasets, compared to RAxML/ExaML and
trimmed by randomly selecting 42 amino acids from the remaining columns of FastTree (ref. 75). More detail about each inference program is provided below.
each marker. The resulting ar122.r89 MSA included a total of 5,124 (42 × 122)
columns. This MSA filtering methodology is implemented in the ‘align’ method of FastTree. FastTree v2.1.9 was executed in multithreaded mode with the
GTDB-Tk v1.0.2 (ref. 66). WAG + GAMMA parameters.

Alternative marker sets. Two alternative MSAs resembling previously published IQ-TREE. IQ-TREE was executed employing mixture models such as models C10
datasets were created through the concatenation of either 16 or 23 ribosomal through C60 (ref. 76) and PMSF, a faster approximation of these models40. True
proteins, termed datasets rp1 (ref. 45) and rp2 (ref. 18), respectively. After trimming C10–C60 trees are computationally more demanding, and memory requirements
columns represented by <50% of the genomes and with an amino acid consensus tend to increase with the number of components in the mixture, which range from
<25%, the resulting MSA spanned 1,174 and 2,377 amino acids for rp1 and rp2, 10 to 60 (Supplementary Table 10). We therefore opted for the faster PMSF model,
respectively. specifically, C10 PMSF, to calculate the ar122.r89 tree. The tree was calculated with
In addition, we compared the GTDB taxonomy derived from the ar.122.r89 IQ-TREE v1.6.12 based on the C10 mixture model and a starting tree (‘-ft’), inferred
tree to a recently created set of trees based on a MSA of 56 evaluated archaeal by FastTree v2.1.9 as described above, to invoke the faster PMSF approximation
markers46. We found that 95.3–99.5% of GTDB taxa with more than one genome with the following settings: ‘-m LG + C10 + F + G –ft <starting tree>’.
were recovered as operationally monophyletic in trees inferred from the best
25–75% of the markers in this study (Supplementary Tables 22 and 23 and ExaML. ExaML trees (gamma, JTT) were calculated from ten different starting
Supplementary Note 5). Next, we created an MSA from all markers supplied with a trees with random seeds using the mpi version 3.0.20 (settings ‘-m GAMMA’),
hidden Markov model (HMM) (53 out of 56) (ar.53; Supplementary Table 8)46, by whereby the tree with the highest likelihood score was retained.
concatenating individual alignments for each of the 53 HMMs for all 1,248 species
representatives, using pfam 33.1 and tigrfam 15.0. The resulting concatenated PhyloBayes. To accommodate the computationally demanding Bayesian inferences,
MSA of 13,451 amino acids was used without further trimming steps to infer we subsampled our dataset by reducing the number of taxa to one representative
phylogenetic trees or to apply tools addressing compositional bias. per order, resulting in 96 taxa. The order representatives were selected by removing
genomes with a quality score (CheckM completenes − 5 × CheckM contamination)
SSU rRNA gene. Archaeal SSU rRNA genes were identified from the 1,248 of <50, <50% of the 122 archaeal marker genes, an N50 < 4 kb, >2,500 contigs
archaeal R04-RS89 GTDB genomes using nhmmer v3.1b2 (ref. 69) with the SSU or >1,500 scaffolds. From the remaining genomes, the highest quality genome
rRNA model (RF00177) from the RFAM database70. Only the longest sequence was was selected, giving preference to, in the following order, (1) NCBI reference
retained for each genome. The resulting sequences were aligned with SSU-ALIGN genomes annotated as ‘complete’ at NCBI, (2) NCBI reference genomes, (3)
0.1.1 (ref. 71), and regions of low posterior probabilities, which are indicative of complete NCBI representative genomes, (4) NCBI representative genomes and
high alignment ambiguity, were pruned with ssu-mask (SSU-ALIGN 0.1.1). (5) GTDB representative genomes. Supplementary Table 24 indicates which of

956 Nature Microbiology | VOL 6 | July 2021 | 946–959 | www.nature.com/naturemicrobiology


NATuRE MICRobIoloGy Resource
these categories a given genome falls into. The Bayesian trees were inferred with provide Supplementary Table 27 linking NCBI taxa to corresponding GTDB
PhyloBayes-MPI version 1.10.2 with the following settings: ‘-cat -gtr -x 10–1 -dgam taxa. Please note that in many cases this is not a 1:1 correspondence due to
4’. Three independent chains were run and tested for convergence (maxdiff = 0.13) reasons including polyphyletic taxa in the NCBI taxonomy, missing assignments
using bpcomp (part of pb_mpi), whereby the first 1,000 trees were eliminated as in the NCBI taxonomy and updated rank assignments in the GTDB taxonomy.
burn-in and the remaining trees were sampled every ten trees. An example is the NCBI phylum Euryarchaeota. The 1,629 genomes contained
in this phylum were assigned to ten phyla in the GTDB04-RS89 taxonomy
Supertree. Supertrees were calculated from (1) 122 GTDB markers and (2) 253 (Supplementary Table 27). GTDB and NCBI taxonomies and all available metadata
markers that were present in >10% of archaeal genomes and were extracted from for the 2,392 archaeal r89 genomes are provided in Supplementary Table 28. The
the PhyloPhlAn dataset48. GTDB online browser ‘taxon history’ also allows the comparison between GTDB
For the 122-marker set, a guide tree was inferred for each marker gene and NCBI taxa.
via approximate ML using FastTree v2.1.9 with the WAG model and gamma
distributed rate heterogeneity. Subsequently, a complete ML tree was inferred Tracking changes in the GTDB taxonomy. While we aim to keep changes in the
using IQ-Tree v 1.6.9 with the C10 + PMSF model and the guide tree with 100 GTDB taxonomy to a minimum, updates are required when, for example, a name
non-parametric bootstrap samplings to assess node support. A consensus tree was is proposed in the literature for a taxon that was previously only associated with a
constructed from the individual gene trees using ASTRAL v5.6.3. placeholder name, or when changes in taxonomic opinion result in nomenclatural
For the 253-marker set, we first identified 400 conserved, single-copy gene changes. GTDB features a taxon history function (https://gtdb.ecogenomic.org/
orthologues from the PhyloPhlAn dataset48 using PhyloPhlAn v0.99. Markers taxon_history), which allows users to review the complete history of a given taxon
present in less than 10% of all archaeal genomes were identified and removed, and nomenclatural consequences, with the option of including the corresponding
resulting in a final marker set of 253 protein markers. For each marker, multiple NCBI taxonomy. Furthermore, the NCBI taxonomy is displayed prominently at the
sequence alignment was performed using MAFFT-LINSI v7.221 (ref. 77), and top of every genome page in GTDB (see, for example, C. symbiosum https://gtdb.
the resulting amino acid alignment was filtered using trimAl 1.2rev59, removing ecogenomic.org/genomes?gid=GCA_000200715.1); hence, previous classification
columns present in less than half of all taxa in the alignment. Tree inference was no longer used by GTDB will always be highly visible.
performed using the same approach as for the 122-marker set.
Defining and updating the highest quality genomes in GTDB. Each GTDB
SSU trees. Trees from the trimmed SSU gene alignments (SSU rRNA gene) were species is defined by a single representative genome and species assignments
inferred with IQ-TREE, whereby the substitution model was determined by the established by considering the ANI and AF to these representative genomes
IQ-TREEs model finder to be SYM + R10. (see Species clusters) and by selecting the highest quality genome. Species
representatives are re-evaluated for each GTDB release with an emphasis on
Taxonomic assignment and rank standardization. The assignment of higher retaining representatives so they can serve as effective nomenclatural type material.
taxonomic ranks was normalized based on the RED calculated from the ar122. However, the goal of stable representatives must be balanced with the desire to use
r89 tree using PhyloRank (v0.0.37; https://github.com/dparks1134/PhyloRank/) as high-quality genomes as representatives, the incorporation of changing taxonomic
described previously35. In brief, PhyloRank linearly interpolates the RED values of opinion and identified errors in genome classification or assembly.
internal nodes according to lineage-specific rates of evolution under the constraints GTDB representatives are updated according to two primary principles: (1)
of the root being defined as zero and the RED of all present taxa being defined as representatives should be assembled from the type strain of a species whenever
one. To account for the influence of the root placement on RED values, PhyloRank possible and (2) representatives should only be replaced by assemblies of suitably
roots a tree multiple times at the midpoint of each phylum with two or more classes. higher overall quality. These two principles are quantitatively defined by the
In the case of the ar122.r89 reference tree, PhyloRank identified five GTDB phyla balanced ANI score (BAS), which is defined as 0.5 × (ANI score) + 0.5 × (quality
(Halobacteriota, Thermoplasmatota, Methanobacteriota, Thermoproteota and score), where the ANI score is 100–20 × (100 − ANI to current representative)
Asgardarchaeota) with two or more classes to be used for the multiple outgroup and the quality score is defined by the criteria given in Supplementary Table 29.
rooting approach. The RED of a taxon is then calculated as the median RED over According to these principles, an existing representative is only replaced by a new
all these tree rootings, excluding the tree in which the taxon was the outgroup. The representative if it has a BAS at least ten points higher than the BAS of the current
RED intervals for each rank were defined as the median RED value ± 0.1 to serve as representative. Intuitively, the BAS achieves the goal of stable representatives by
a guide for the normalization of taxonomic ranks from genus to phylum in GTDB. requiring a new representative to be of increasingly higher quality (as defined
The rank of species was assigned using ANI and AF criteria (Species clusters). Note by the quality score) the more dissimilar it is from the current representative (as
that the application of names above the rank of genus in the GTDB taxonomy is defined by the ANI score). In addition, representatives are updated whenever the
manually curated where required (that is, where naming ambiguity exists due to underlying assembly is updated at NCBI or in cases when genome assemblies are
reclassification changes (taxa split, union or transfer) or where a new name needs removed from NCBI.
to be assigned). The curation team follows a decision tree for the manual curation
workflow (Supplementary Fig. 21, Extended Data Fig. 5 and Supplementary Note 1). Reporting Summary. Further information on research design is available in the
Nature Research Reporting Summary linked to this article.
Assessment of phylogenetic congruence. We measured the phylogenetic tree
similarity applying the normalized Robinson–Foulds distance (RF) of each Data availability
alternative tree compared to the ar122.89 tree. The RF is defined as the number The GTDB taxonomy is available at the GTDB website (https://gtdb.ecogenomic.
of splits that are present in one tree but not in the other one and vice versa78. The org/), including the ar122.r89 tree and the GTDB and NCBI taxonomic
normalized RF is a relative measure obtained by dividing the calculated RF by the assignments for all 2,392 archaeal genomes in GTDB 04-RS89. Genome assemblies
maximal RF. The resulting distance is a value between 0% and 100% that can be are available from the NCBI Assembly database (BioProject: PRJNA593905). All
interpreted as the percentage of different or missing splits in the alternative trees GTDB decorated phylogenetic trees are provided as Newick files in Supplementary
compared to the ar.122.89 tree79. RF calculations and tree comparisons with lineage Data 1. The SR4 model used for data recoded inferences is provided in
resolution were carried out with the visualization tool metatree (https://github. Supplementary Data 2. Source data are provided with this paper.
com/aaronmussig/metatree).

Assessment of taxonomic congruence. The congruence of the GTDB taxonomy Code availability
in different trees was assessed as (1) the percentage of taxa identified as The standalone tool GTDB-Tk, which enables researchers to classify their own
monophyletic, operationally monophyletic (defined as having an F measure genomes according to the GTDB taxonomy, is available from GitHub (https://
≥0.95) or polyphyletic, (2) the RED distributions for taxa at each rank relative to github.com/Ecogenomics/GTDBTk/) and through KBase (https://kbase.us/
the median RED value of that rank and (3) the number of genomes with identical applist/apps/kb_gtdbtk/run_kb_gtdbtk/release). Taxonomic assignment and rank
or conflicting taxonomic assignments between compared trees. To carry out (1), standardization were carried out based on the RED calculated using PhyloRank
each taxon was placed on the node with the highest resulting F measure. The F v0.0.37, which is available from GitHub (https://github.com/dparks1134/
measure is defined as the harmonic mean of precision and recall and has been PhyloRank/).
proposed for decorating trees with a donor taxonomy68. Note that we introduced
the term operationally monophyletic (F measure ≥0.95) because otherwise a Received: 10 March 2020; Accepted: 10 May 2021;
few incongruent genomes can cause a large number of polyphyletic taxa. For a Published online: 21 June 2021
detailed explanation of the F measure, a note on how to compare RED values
and information concerning the stability between releases see Supplementary
Note 7 and Supplementary Table 25. RED values for all taxa across all decorated References
phylogenetic trees are provided in Supplementary Table 26. 1. Woese, C. R. & Fox, G. E. Phylogenetic structure of the prokaryotic domain:
the primary kingdoms. Proc. Natl Acad. Sci. USA 74, 5088–5090 (1977).
Correspondence between NCBI and GTDB taxa. To provide a comparison 2. Gribaldo, S. & Brochier-Armanet, C. The origin and evolution of Archaea: a
between the NCBI and the GTDB taxonomy, based on the ar122.r89 tree, we state of the art. Philos. Trans. R. Soc. Lond. B Biol. Sci. 361, 1007–1022 (2006).

Nature Microbiology | VOL 6 | July 2021 | 946–959 | www.nature.com/naturemicrobiology 957


Resource NATuRE MICRobIoloGy
3. Zuo, G., Xu, Z. & Hao, B. Phylogeny and taxonomy of Archaea: a comparison 33. Murray, R. G. E. & Stackebrandt, E. Taxonomic note: implementation of the
of the whole-genome-based CVTree approach with 16S rRNA sequence provisional status Candidatus for incompletely described procaryotes. Int. J.
analysis. Life 5, 949–968 (2015). Syst. Evol. Microbiol. 45, 186–187 (1995).
4. Woese, C. R., Kandler, O. & Wheelis, M. L. Towards a natural system of 34. Oren, A. A plea for linguistic accuracy—also for Candidatus taxa. Int. J. Syst.
organisms: proposal for the domains Archaea, Bacteria, and Eucarya. Proc. Evolut. Microbiol. 67, 1085–1094 (2017).
Natl Acad. Sci. USA 87, 4576–4579 (1990). 35. Parks, D. H. et al. A standardized bacterial taxonomy based on genome
5. Adam, P. S., Borrel, G., Brochier-Armanet, C. & Gribaldo, S. The growing phylogeny substantially revises the tree of life. Nat. Biotechnol. 36,
tree of Archaea: new perspectives on their diversity, evolution and ecology. 996–1004 (2018).
ISME J. https://doi.org/10.1038/ismej.2017.122 (2017). 36. Parks, D. H. et al. A complete domain-to-species taxonomy for Bacteria and
6. Baker, B. J. et al. Diversity, ecology and evolution of Archaea. Nat. Microbiol. Archaea. Nat. Biotechnol. 38, 1079–1086 (2020).
5, 887–900 (2020). 37. Haft, D. H. et al. RefSeq: an update on prokaryotic genome annotation and
7. Spang, A., Caceres, E. F. & Ettema, T. J. G. Genomic exploration of the curation. Nucleic Acids Res. 46, D851–D860 (2018).
diversity, ecology, and evolution of the archaeal domain of life. Science 357, 38. Parks, D. H. et al. A complete domain-to-species taxonomy for Bacteria and
eaaf3883 (2017). Archaea. Nat. Biotechnol. 38, 1079–1086 (2020).
8. Barns, S. M., Delwiche, C. F., Palmer, J. D. & Pace, N. R. Perspectives on 39. Parks, D. H. et al. Recovery of nearly 8,000 metagenome-assembled genomes
archaeal diversity, thermophily and monophyly from environmental rRNA substantially expands the tree of life. Nat. Microbiol. 2, 1533–1542 (2017).
sequences. Proc. Natl Acad. Sci. USA 93, 9188–9193 (1996). 40. Wang, H.-C., Minh, B. Q., Susko, E. & Roger, A. J. Modeling site
9. Huber, H. et al. A new phylum of Archaea represented by a nanosized heterogeneity with posterior mean site frequency profiles accelerates accurate
hyperthermophilic symbiont. Nature 417, 63–67 (2002). phylogenomic estimation. Syst. Biol. 67, 216–235 (2018).
10. Hallam, S. J. et al. Genomic analysis of the uncultivated marine crenarchaeote 41. Nguyen, L.-T., Schmidt, H. A., von Haeseler, A. & Minh, B. Q. IQ-TREE: a
Cenarchaeum symbiosum. Proc. Natl Acad. Sci. USA 103, 18296–18301 (2006). fast and effective stochastic algorithm for estimating maximum-likelihood
11. Brochier-Armanet, C., Boussau, B., Gribaldo, S. & Forterre, P. Mesophilic phylogenies. Mol. Biol. Evol. 32, 268–274 (2015).
crenarchaeota: proposal for a third archaeal phylum, the Thaumarchaeota. 42. Federhen, S. The NCBI Taxonomy database. Nucleic Acids Res. 40,
Nat. Rev. Microbiol. 6, 245–252 (2008). D136–D143 (2012).
12. Nunoura, T. et al. Insights into the evolution of Archaea and eukaryotic 43. Marin, J., Battistuzzi, F. U., Brown, A. C. & Hedges, S. B. The timetree of
protein modifier systems revealed by the genome of a novel archaeal group. prokaryotes: new insights into their evolution and speciation. Mol. Biol. Evol.
Nucleic Acids Res. 39, 3204–3223 (2011). 34, 437–446 (2017).
13. Kozubal, M. A. et al. Geoarchaeota: a new candidate phylum in the Archaea 44. Sieber, C. M. K. et al. Recovery of genomes from metagenomes via a
from high-temperature acidic iron mats in Yellowstone National Park. ISME dereplication, aggregation and scoring strategy. Nat. Microbiol. 3,
J. 7, 622–634 (2013). 836–843 (2018).
14. Meng, J. et al. Genetic and functional properties of uncultivated MCG 45. Hug, L. A. et al. A new view of the tree of life. Nat. Microbiol. 1,
Archaea assessed by metagenome and gene expression analyses. ISME J. 8, 16048 (2016).
650–659 (2014). 46. Dombrowski, N. et al. Undinarchaeota illuminate DPANN phylogeny and the
15. Guy, L., Spang, A., Saw, J. H. & Ettema, T. J. G. ‘Geoarchaeote NAG1’ is a impact of gene transfer on archaeal evolution. Nat. Commun. 11, 3939 (2020).
deeply rooting lineage of the archaeal order Thermoproteales rather than a 47. Galtier, N. & Lobry, J. R. Relationships between genomic G+C content, RNA
new phylum. ISME J. 8, 1353–1357 (2014). secondary structures, and optimal growth temperature in prokaryotes. J. Mol.
16. Guy, L. & Ettema, T. J. G. The archaeal ‘TACK’ superphylum and the origin Evol. 44, 632–636 (1997).
of eukaryotes. Trends Microbiol. 19, 580–587 (2011). 48. Segata, N., Börnigen, D., Morgan, X. C. & Huttenhower, C. PhyloPhlAn is a
17. Vanwonterghem, I. et al. Methylotrophic methanogenesis discovered in the new method for improved phylogenetic and taxonomic placement of
archaeal phylum Verstraetearchaeota. Nat. Microbiol. 1, 16170 (2016). microbes. Nat. Commun. 4, 2304 (2013).
18. Rinke, C. et al. Insights into the phylogeny and coding potential of microbial 49. Ali, R. H., Bogusz, M. & Whelan, S. Identifying clusters of high confidence
dark matter. Nature 499, 431–437 (2013). homologies in multiple sequence alignments. Mol. Biol. Evol. 36, 2340–2351
19. Zaremba-Niedzwiedzka, K. et al. Asgard Archaea illuminate the origin of (2019).
eukaryotic cellular complexity. Nature 541, 353–358 (2017). 50. Criscuolo, A. & Gribaldo, S. BMGE (block mapping and gathering with
20. Baker, B. J. et al. Enigmatic, ultrasmall, uncultivated Archaea. Proc. Natl Acad. entropy): a new software for selection of phylogenetic informative regions
Sci. USA 107, 8806–8811 (2010). from multiple sequence alignments. BMC Evol. Biol. 10, 210 (2010).
21. Castelle, C. J. et al. Genomic expansion of domain Archaea highlights roles 51. Raymann, K., Brochier-Armanet, C. & Gribaldo, S. The two-domain tree of
for organisms from new phyla in anaerobic carbon cycling. Curr. Biol. 16, life is linked to a new root for the Archaea. Proc. Natl Acad. Sci. USA 112,
690–701 (2015). 6670–6675 (2015).
22. Probst, A. J. et al. Differential depth distribution of microbial function and 52. Williams, T. A. et al. Integrative modeling of gene and genome
putative symbionts through sediment-hosted aquifers in the deep terrestrial evolution roots the archaeal tree of life. Proc. Natl Acad. Sci. USA 114,
subsurface. Nat. Microbiol. 3, 328–336 (2018). E4602–E4611 (2017).
23. Probst, A. J. et al. Biology of a widespread uncultivated archaeon that 53. Whitman, W. B. et al. Proposal of the suffix –ota to denote phyla. Addendum
contributes to carbon fixation in the subsurface. Nat. Commun. 5, to ‘Proposal to include the rank of phylum in the International Code of
5497 (2014). Nomenclature of Prokaryotes’. Int. J. Syst. Evol. Microbiol. 68, 967–969 (2018).
24. Seitz, K. W., Lazar, C. S., Hinrichs, K.-U., Teske, A. P. & Baker, B. J. Genomic 54. Jungbluth, S. P., Amend, J. P. & Rappé, M. S. Metagenome sequencing and 98
reconstruction of a novel, deeply branched sediment archaeal phylum microbial genomes from Juan de Fuca Ridge flank subsurface fluids. Sci. Data
with pathways for acetogenesis and sulfur reduction. ISME J. 10, 4, sdata201737 (2017).
1696–1705 (2016). 55. Reysenbach, A.-L. Class I. Thermoprotei class. nov. in Bergey’s Manual of
25. Spang, A. et al. Complex Archaea that bridge the gap between prokaryotes Systematic Bacteriology Volume 1: The Archaea and the Deeply Branching and
and eukaryotes. Nature 521, 173–179 (2015). Phototrophic Bacteria (eds Garrity, G. et al.) 169–210 (Springer Verlag, 2001).
26. Seitz, K. W. et al. Asgard Archaea capable of anaerobic hydrocarbon cycling. 56. Stieglmeier, M. et al. Nitrososphaera viennensis gen. nov., sp. nov., an aerobic
Nat. Commun. 10, 1822 (2019). and mesophilic, ammonia-oxidizing archaeon from soil and a member
27. Petitjean, C., Deschamps, P., López-García, P. & Moreira, D. Rooting the of the archaeal phylum Thaumarchaeota. Int. J. Syst. Evol. Microbiol. 64,
domain Archaea by phylogenomic analysis supports the foundation of the 2738–2752 (2014).
new kingdom Proteoarchaeota. Genome Biol. Evol. 7, 191–204 (2014). 57. Elkins, J. G. et al. A korarchaeal genome reveals insights into the evolution of
28. Petitjean, C., Deschamps, P., López-García, P., Moreira, D. & the Archaea. Proc. Natl Acad. Sci. USA 105, 8102–8107 (2008).
Brochier-Armanet, C. Extending the conserved phylogenetic core of Archaea 58. Oren, A., Garrity, G. M., Parker, C. T., Chuvochina, M. & Trujillo, M. E.
disentangles the evolution of the third domain of life. Mol. Biol. Evol. 32, Lists of names of prokaryotic Candidatus taxa. Int. J. Syst. Evol. Microbiol.
1242–1254 (2015). https://doi.org/10.1099/ijsem.0.003789 (2020).
29. Parker, C. T., Tindall, B. J. & Garrity, G. M. International Code of 59. Imachi, H. et al. Isolation of an archaeon at the prokaryote–eukaryote
Nomenclature of Prokaryotes. Int. J. Syst. Evol. Microbiol. 69, S1–S111 (2019). interface. Nature 577, 519–525 (2020).
30. Oren, A. et al. Proposal to include the rank of phylum in the International 60. Fuchs, T., Huber, H., Burggraf, S. & Stetter, K. O. 16S rDNA-based phylogeny
Code of Nomenclature of Prokaryotes. Int. J. Syst. Evol. Microbiol. 65, of the archaeal order Sulfolobales and reclassification of Desulfurolobus
4284–4287 (2015). ambivalens as Acidianus ambivalens comb. nov. Syst. Appl. Microbiol. 19,
31. Whitman, W. B. Modest proposals to expand the type material for naming of 56–60 (1996).
prokaryotes. Int. J. Syst. Evol. Microbiol. 66, 2108–2112 (2016). 61. Quehenberger, J., Shen, L., Albers, S.-V., Siebers, B. & Spadiut, O.
32. Chuvochina, M. et al. The importance of designating type material for Sulfolobus—a potential key organism in future biotechnology. Front. Microbiol
uncultured taxa. Syst. Appl. Microbiol. 42, 15–21 (2019). 8, 2474 (2017).

958 Nature Microbiology | VOL 6 | July 2021 | 946–959 | www.nature.com/naturemicrobiology


NATuRE MICRobIoloGy Resource
62. Minegishi, H. et al. Further refinement of the phylogeny of the 79. Kupczok, A., Schmidt, H. A. & von Haeseler, A. Accuracy of phylogeny
Halobacteriaceae based on the full-length RNA polymerase subunit B′ reconstruction methods combining overlapping gene data sets. Algorithms
(rpoB′) gene. Int. J. Syst. Evol. Microbiol. 60, 2398–2408 (2010). Mol. Biol. 5, 37 (2010).
63. Sorokin, D. Y. et al. Natronolimnobius sulfurireducens sp. nov. and 80. Letunic, I. & Bork, P. Interactive tree of life (iTOL) v4: recent
Halalkaliarchaeum desulfuricum gen. nov., sp. nov., the first sulfur-respiring updates and new developments. Nucleic Acids Res. 47, W256–W259
alkaliphilic Haloarchaea from hypersaline alkaline lakes. Int. J. Syst. Evol. (2019).
Microbiol. 69, 2662–2673 (2019).
64. Sorokin, D. Y. et al. Sulfur respiration in a group of facultatively anaerobic
natronoarchaea ubiquitous in hypersaline soda lakes. Front. Microbiol. 9, Acknowledgements
2359 (2018). We thank B. Kemish and D. Senanayake for system administration support, P. Yilmaz for
65. Mendler, K. et al. AnnoTree: visualization and exploration of a functionally stimulating discussions on archaeal taxonomy and the GTDB user community for their
annotated microbial tree of life. Nucleic Acids Res. https://doi.org/10.1093/nar/ feedback. We also thank the Australian Centre for Ecogenomics (ACE) at The University
gkz246 (2019). of Queensland and the New Zealand eScience Infrastructure (NeSI) for providing
66. Chaumeil, P.-A., Mussig, A. J., Hugenholtz, P. & Parks, D. H. GTDB-Tk: a high-performance computing facilities. The project was supported by an Australian
toolkit to classify genomes with the Genome Taxonomy Database. Research Council (ARC) Future Fellowship (FT170100213) awarded to C.R. and by an
Bioinformatics https://doi.org/10.1093/bioinformatics/btz848 (2019). Australian Research Council Laureate Fellowship (FL150100038) awarded to P.H. The
67. Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. funders had no role in study design, data collection and analysis, decision to publish or
CheckM: assessing the quality of microbial genomes recovered from isolates, preparation of the manuscript.
single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).
68. McDonald, D. et al. An improved Greengenes taxonomy with explicit ranks Author contributions
for ecological and evolutionary analyses of Bacteria and Archaea. ISME J. 6, P.H., C.R. and D.H.P. conceived the archaeal GTDB study and designed experiments.
610–618 (2012). C.R., A.J.M., D.H.P. and D.W.W. performed phylogenetic inferences. D.H.P. and
69. Wheeler, T. J. & Eddy, S. R. nhmmer: DNA homology search with profile P.-A.C. calculated rank normalizations. D.H.P., P.-A.C. and A.J.M. created the GTDB
HMMs. Bioinformatics 29, 2487–2489 (2013). web interface and underlying databases. M.C., C.R., P.H. and W.B.W. curated the
70. Kalvari, I. et al. Rfam 13.0: shifting to a genome-centric resource for GTDB taxonomy with input from all co-authors and the scientific community. A.A.D.
non-coding RNA families. Nucleic Acids Res. 46, D335–D342 (2018). performed the RED simulation analysis. C.R. and P.H. wrote the manuscript with
71. Nawrocki, E. Structural RNA Homology Search and Alignment Using contributions from all co-authors.
Covariance Models PhD thesis, Washington Univ. St Louis (2009).
72. Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2—approximately
maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010). Competing interests
73. Kozlov, A. M., Aberer, A. J. & Stamatakis, A. ExaML version 3: a tool The authors declare no competing interests.
for phylogenomic analyses on supercomputers. Bioinformatics 31,
2577–2579 (2015).
74. Lartillot, N. & Philippe, H. A Bayesian mixture model for across-site Additional information
heterogeneities in the amino-acid replacement process. Mol. Biol. Evol. 21, Extended data is available for this paper at https://doi.org/10.1038/s41564-021-00918-8.
1095–1109 (2004). Supplementary information The online version contains supplementary material
75. Zhou, X., Shen, X.-X., Hittinger, C. T. & Rokas, A. Evaluating fast maximum available at https://doi.org/10.1038/s41564-021-00918-8.
likelihood-based phylogenetic programs using empirical phylogenomic data
sets. Mol. Biol. Evol. 35, 486–503 (2018). Correspondence and requests for materials should be addressed to C.R. or P.H.
76. Quang, L. S., Gascuel, O. & Lartillot, N. Empirical profile mixture models for Peer review information Nature Microbiology thanks the anonymous reviewers for their
phylogenetic reconstruction. Bioinformatics 24, 2317–2323 (2008). contribution to the peer review of this work.
77. Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software Reprints and permissions information is available at www.nature.com/reprints.
version 7: improvements in performance and usability. Mol. Biol. Evol. 30,
772–780 (2013). Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in
78. Robinson, D. F. & Foulds, L. R. Comparison of phylogenetic trees. Math. published maps and institutional affiliations.
Biosci. 53, 131–147 (1981). © The Author(s), under exclusive licence to Springer Nature Limited 2021

Nature Microbiology | VOL 6 | July 2021 | 946–959 | www.nature.com/naturemicrobiology 959


Resource NATuRE MICRobIoloGy

Extended Data Fig. 1 | Implementing the shifting substitution rate model (SSR). a, Timetree, that is a tree scaled to time. A grid has been overlayed on
top of the tree to delineate the length of the individual branch segments. b, An array of substitution rate multipliers, ordered from lowest to highest. In this
specific example, the slowest evolving lineage evolves 16 times more slowly than the fastest evolving lineage. c, The result of running the SSR model on the
tree in a. Every circle represents a shift in the substitution rate, whereby the colour correlates to the substitution rates multipliers shown in c). At the start
of the simulation the model starts with the grey circle. How quick the changes take place depend on the shifting substitution rate parameter. d, The result
of taking the tree in a) and scaling it according to the active substitution rate multiplier in every branch at every branch segment. For instance, we can see
that the single segment in dark red (x4) has the same length as the two prior segments in light red (x2) on the same branch.

Nature Microbiology | www.nature.com/naturemicrobiology


NATuRE MICRobIoloGy Resource

Extended Data Fig. 2 | Impact of variable evolutionary rates on the RED approach. Each panel shows the true ranking of every inner node of the species
tree (which we can directly obtain from the simulation) and the x-axis, and the inferred rankings of every node, resulting of modifying the branch lengths
of the tree using the SSR model and subsequently applying the RED algorithm to recover an ultrametric tree, on the y-axis. The main diagonal corresponds
therefore to the proportion of nodes for the ranking given by the column that has been correctly classified. In every panel we have the results of using
a different shifting substitution rate, from 0.1 to 0.5, and 1 (events/speciation). These results show that, as expected, higher numbers of changes in the
substitution rates impact the performance of the RED approach to a higher degree. However, the levels of accuracy remain high even for the most extreme
cases.

Nature Microbiology | www.nature.com/naturemicrobiology


Resource NATuRE MICRobIoloGy

Extended Data Fig. 3 | Average number of monophyletic (green bars), operationally monophyletic (yellow bars) and polyphyletic (orange bars) taxa
across higher ranks (phylum, class, and order) in percent. Shown are GTDB taxonomy decorated phylogenetic trees inferred with different methods, from
a range of markers, from alignments trimmed to reduce compositional bias and fast evolving sites, and from alignments created as part of the simulated
database expansion. Note that only taxa with two or more genomes were included, and that the data set (order representatives) used for PhyloBayes
restricts the analysis to the ranks of phylum and class. Details for each inferred tree are provided in Supplementary Table 10. Percentages for all ranks are
shown in Supplementary Fig. 11. Monophyly and operational monophyly was determined based on the F measure of decorated internal nodes.

Nature Microbiology | www.nature.com/naturemicrobiology


NATuRE MICRobIoloGy Resource

Extended Data Fig. 4 | Higher rank GTDB lineages not resolved with alternative inference methods. Shown are taxa that were not recovered as
monophyletic or operational monophyletic (green) and hence were polyphyletic (red; F measure < 0.95) in at least one of the different alignments and
inference methods. The bootstrap support for each taxon in the ar.122.r89 reference tree is given in the last column ‘BS in ar.122.r89 tree’.

Nature Microbiology | www.nature.com/naturemicrobiology


Resource NATuRE MICRobIoloGy

Extended Data Fig. 5 | Examples of application of names using the manual curation workflow. Provided are five examples of taxon names that have
been updated in GTDB following the manual curation workflow. Thereby, each example is shown in a distinct colour: Ca. Thaumarchaeota (red), Ca.
Diapherotrites (blue), Ca. Bathyarchaeota (green), Ca. Verstraetearchaeota (purple), and Ca. Korarchaeota (orange). For example, Ca. Thaumarchaeota
(red) has no designated nomenclature type and has no lower-ranking taxon based on the same stem as the taxon. Furthermore, it has been united with
another taxon of the same rank in GTDB, which resulted in a name being chosen based on priority, in this case Thermoproteota. *Nomenclature type of the
taxon (for ranks above genus) is defined as one of its subordinate taxa with which the name is permanently associated.

Nature Microbiology | www.nature.com/naturemicrobiology


nature research | reporting summary
Corresponding author(s): Christian Rinke
Last updated by author(s): 28-04-2021

Reporting Summary
Nature Research wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency
in reporting. For further information on Nature Research policies, see our Editorial Policies and the Editorial Policy Checklist.

Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.
n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement
A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly
The statistical test(s) used AND whether they are one- or two-sided
Only common tests should be described solely by name; describe more complex techniques in the Methods section.

A description of all covariates tested


A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons
A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient)
AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals)

For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted
Give P values as exact values whenever suitable.

For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes
Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated
Our web collection on statistics for biologists contains articles on many of the points above.

Software and code


Policy information about availability of computer code
Data collection For the archaeal GTDB taxonomy R04-RS89 2,661 archaeal genomes were obtained from RefSeq/GenBank release 89
The NCBI taxonomy of all representative genomes of R04-RS89 was obtained from the NCBI Taxonomy FTP site on July 16th, 2018.

Data analysis The standalone tool GTDB-Tk, which enables researchers to classify their own genomes according to the GTDB taxonomy is available at
GitHub (https://github.com/Ecogenomics/GTDBTk/) and through KBase (https://kbase.us/applist/apps/ kb_gtdbtk/run_kb_gtdbtk/release).
Taxonomic assignment and rank standardisation were carried out based on the relative evolutionary divergence (RED) calculated using
PhyloRank v0.0.37 which is available at GitHub https://github.com/dparks1134/PhyloRank/).
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors and
reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

Data
Policy information about availability of data
April 2020

All manuscripts must include a data availability statement. This statement should provide the following information, where applicable:
- Accession codes, unique identifiers, or web links for publicly available datasets
- A list of figures that have associated raw data
- A description of any restrictions on data availability

The GTDB taxonomy is available at the GTDB website (https://gtdb.ecogenomic.org/), including the ar122.r89 tree and the GTDB and NCBI taxonomic assignments
for all 2,392 archaeal genomes in GTDB 04-RS89.

1
nature research | reporting summary
Field-specific reporting
Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.
Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf

Life sciences study design


All studies must disclose on these points even when the disclosure is negative.
Sample size All 2,661 archaeal genomes from RefSeq/GenBank release 89 were included in the analysis

Data exclusions This data set was refined by applying a quality threshold (completeness - 5x contamination >50%) using lineage specific markers implemented
in CheckM and by screening out genomes which contain <40% of the 122 archaeal GTDB marker genes, more than 100,000 ambiguous bases,
more than 1000 contigs, and which have an N50 <5kb. This approach left 2,392 genomes to form species clusters (see below), resulting in a
total 1248 species representative genomes for the downstream analysis (Table S1). The 456 genomes which did not pass QC are still
searchable on the GTDB website (https://gtdb.ecogenomic.org/) and are listed in table S20.

Replication Bootstrap values were obtained by inferring 100 bootstrap trees.

Randomization The alignment matrix columns were randomly sub-sampled for bootstrap trees.

Blinding Blinding is not applicable to archaeal taxonomy.

Reporting for specific materials, systems and methods


We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material,
system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response.

Materials & experimental systems Methods


n/a Involved in the study n/a Involved in the study
Antibodies ChIP-seq
Eukaryotic cell lines Flow cytometry
Palaeontology and archaeology MRI-based neuroimaging
Animals and other organisms
Human research participants
Clinical data
Dual use research of concern

Antibodies
Antibodies used Describe all antibodies used in the study; as applicable, provide supplier name, catalog number, clone name, and lot number.

Validation Describe the validation of each primary antibody for the species and application, noting any validation statements on the
manufacturer’s website, relevant citations, antibody profiles in online databases, or data provided in the manuscript.

Eukaryotic cell lines


Policy information about cell lines
Cell line source(s) State the source of each cell line used.
April 2020

Authentication Describe the authentication procedures for each cell line used OR declare that none of the cell lines used were authenticated.

Mycoplasma contamination Confirm that all cell lines tested negative for mycoplasma contamination OR describe the results of the testing for
mycoplasma contamination OR declare that the cell lines were not tested for mycoplasma contamination.

Commonly misidentified lines Name any commonly misidentified cell lines used in the study and provide a rationale for their use.
(See ICLAC register)

2
Palaeontology and Archaeology

nature research | reporting summary


Specimen provenance Provide provenance information for specimens and describe permits that were obtained for the work (including the name of the
issuing authority, the date of issue, and any identifying information).

Specimen deposition Indicate where the specimens have been deposited to permit free access by other researchers.

Dating methods If new dates are provided, describe how they were obtained (e.g. collection, storage, sample pretreatment and measurement), where
they were obtained (i.e. lab name), the calibration program and the protocol for quality assurance OR state that no new dates are
provided.

Tick this box to confirm that the raw and calibrated dates are available in the paper or in Supplementary Information.

Ethics oversight Identify the organization(s) that approved or provided guidance on the study protocol, OR state that no ethical approval or guidance
was required and explain why not.
Note that full information on the approval of the study protocol must also be provided in the manuscript.

Animals and other organisms


Policy information about studies involving animals; ARRIVE guidelines recommended for reporting animal research
Laboratory animals For laboratory animals, report species, strain, sex and age OR state that the study did not involve laboratory animals.

Wild animals Provide details on animals observed in or captured in the field; report species, sex and age where possible. Describe how animals were
caught and transported and what happened to captive animals after the study (if killed, explain why and describe method; if released,
say where and when) OR state that the study did not involve wild animals.

Field-collected samples For laboratory work with field-collected samples, describe all relevant parameters such as housing, maintenance, temperature,
photoperiod and end-of-experiment protocol OR state that the study did not involve samples collected from the field.

Ethics oversight Identify the organization(s) that approved or provided guidance on the study protocol, OR state that no ethical approval or guidance
was required and explain why not.
Note that full information on the approval of the study protocol must also be provided in the manuscript.

Human research participants


Policy information about studies involving human research participants
Population characteristics Describe the covariate-relevant population characteristics of the human research participants (e.g. age, gender, genotypic
information, past and current diagnosis and treatment categories). If you filled out the behavioural & social sciences study
design questions and have nothing to add here, write "See above."

Recruitment Describe how participants were recruited. Outline any potential self-selection bias or other biases that may be present and
how these are likely to impact results.

Ethics oversight Identify the organization(s) that approved the study protocol.

Note that full information on the approval of the study protocol must also be provided in the manuscript.

Clinical data
Policy information about clinical studies
All manuscripts should comply with the ICMJE guidelines for publication of clinical research and a completed CONSORT checklist must be included with all submissions.

Clinical trial registration Provide the trial registration number from ClinicalTrials.gov or an equivalent agency.

Study protocol Note where the full trial protocol can be accessed OR if not available, explain why.

Data collection Describe the settings and locales of data collection, noting the time periods of recruitment and data collection.

Outcomes Describe how you pre-defined primary and secondary outcome measures and how you assessed these measures.
April 2020

Dual use research of concern


Policy information about dual use research of concern

Hazards

3
Could the accidental, deliberate or reckless misuse of agents or technologies generated in the work, or the application of information presented

nature research | reporting summary


in the manuscript, pose a threat to:
No Yes
Public health
National security
Crops and/or livestock
Ecosystems
Any other significant area

Experiments of concern
Does the work involve any of these experiments of concern:
No Yes
Demonstrate how to render a vaccine ineffective
Confer resistance to therapeutically useful antibiotics or antiviral agents
Enhance the virulence of a pathogen or render a nonpathogen virulent
Increase transmissibility of a pathogen
Alter the host range of a pathogen
Enable evasion of diagnostic/detection modalities
Enable the weaponization of a biological agent or toxin
Any other potentially harmful combination of experiments and agents

ChIP-seq
Data deposition
Confirm that both raw and final processed data have been deposited in a public database such as GEO.
Confirm that you have deposited or provided access to graph files (e.g. BED files) for the called peaks.

Data access links For "Initial submission" or "Revised version" documents, provide reviewer access links. For your "Final submission" document,
May remain private before publication. provide a link to the deposited data.

Files in database submission Provide a list of all files available in the database submission.

Genome browser session Provide a link to an anonymized genome browser session for "Initial submission" and "Revised version" documents only, to
(e.g. UCSC) enable peer review. Write "no longer applicable" for "Final submission" documents.

Methodology
Replicates Describe the experimental replicates, specifying number, type and replicate agreement.

Sequencing depth Describe the sequencing depth for each experiment, providing the total number of reads, uniquely mapped reads, length of reads and
whether they were paired- or single-end.

Antibodies Describe the antibodies used for the ChIP-seq experiments; as applicable, provide supplier name, catalog number, clone name, and lot
number.

Peak calling parameters Specify the command line program and parameters used for read mapping and peak calling, including the ChIP, control and index files
used.

Data quality Describe the methods used to ensure data quality in full detail, including how many peaks are at FDR 5% and above 5-fold enrichment.

Software Describe the software used to collect and analyze the ChIP-seq data. For custom code that has been deposited into a community
repository, provide accession details.
April 2020

4
Flow Cytometry

nature research | reporting summary


Plots
Confirm that:
The axis labels state the marker and fluorochrome used (e.g. CD4-FITC).
The axis scales are clearly visible. Include numbers along axes only for bottom left plot of group (a 'group' is an analysis of identical markers).
All plots are contour plots with outliers or pseudocolor plots.
A numerical value for number of cells or percentage (with statistics) is provided.

Methodology
Sample preparation Describe the sample preparation, detailing the biological source of the cells and any tissue processing steps used.

Instrument Identify the instrument used for data collection, specifying make and model number.

Software Describe the software used to collect and analyze the flow cytometry data. For custom code that has been deposited into a
community repository, provide accession details.

Cell population abundance Describe the abundance of the relevant cell populations within post-sort fractions, providing details on the purity of the
samples and how it was determined.

Gating strategy Describe the gating strategy used for all relevant experiments, specifying the preliminary FSC/SSC gates of the starting cell
population, indicating where boundaries between "positive" and "negative" staining cell populations are defined.

Tick this box to confirm that a figure exemplifying the gating strategy is provided in the Supplementary Information.

Magnetic resonance imaging


Experimental design
Design type Indicate task or resting state; event-related or block design.

Design specifications Specify the number of blocks, trials or experimental units per session and/or subject, and specify the length of each trial
or block (if trials are blocked) and interval between trials.

Behavioral performance measures State number and/or type of variables recorded (e.g. correct button press, response time) and what statistics were used
to establish that the subjects were performing the task as expected (e.g. mean, range, and/or standard deviation across
subjects).

Acquisition
Imaging type(s) Specify: functional, structural, diffusion, perfusion.

Field strength Specify in Tesla

Sequence & imaging parameters Specify the pulse sequence type (gradient echo, spin echo, etc.), imaging type (EPI, spiral, etc.), field of view, matrix size,
slice thickness, orientation and TE/TR/flip angle.

Area of acquisition State whether a whole brain scan was used OR define the area of acquisition, describing how the region was determined.

Diffusion MRI Used Not used

Preprocessing
Preprocessing software Provide detail on software version and revision number and on specific parameters (model/functions, brain extraction,
segmentation, smoothing kernel size, etc.).

Normalization If data were normalized/standardized, describe the approach(es): specify linear or non-linear and define image types used for
April 2020

transformation OR indicate that data were not normalized and explain rationale for lack of normalization.

Normalization template Describe the template used for normalization/transformation, specifying subject space or group standardized space (e.g.
original Talairach, MNI305, ICBM152) OR indicate that the data were not normalized.

Noise and artifact removal Describe your procedure(s) for artifact and structured noise removal, specifying motion parameters, tissue signals and
physiological signals (heart rate, respiration).

5
Volume censoring Define your software and/or method and criteria for volume censoring, and state the extent of such censoring.

nature research | reporting summary


Statistical modeling & inference
Model type and settings Specify type (mass univariate, multivariate, RSA, predictive, etc.) and describe essential details of the model at the first and
second levels (e.g. fixed, random or mixed effects; drift or auto-correlation).

Effect(s) tested Define precise effect in terms of the task or stimulus conditions instead of psychological concepts and indicate whether
ANOVA or factorial designs were used.

Specify type of analysis: Whole brain ROI-based Both


Statistic type for inference Specify voxel-wise or cluster-wise and report all relevant parameters for cluster-wise methods.
(See Eklund et al. 2016)

Correction Describe the type of correction and how it is obtained for multiple comparisons (e.g. FWE, FDR, permutation or Monte Carlo).

Models & analysis


n/a Involved in the study
Functional and/or effective connectivity
Graph analysis
Multivariate modeling or predictive analysis

Functional and/or effective connectivity Report the measures of dependence used and the model details (e.g. Pearson correlation, partial correlation,
mutual information).

Graph analysis Report the dependent variable and connectivity measure, specifying weighted graph or binarized graph,
subject- or group-level, and the global and/or node summaries used (e.g. clustering coefficient, efficiency,
etc.).

Multivariate modeling and predictive analysis Specify independent variables, features extraction and dimension reduction, model, training and evaluation
metrics.

April 2020

You might also like