Professional Documents
Culture Documents
https://doi.org/10.1038/s41564-021-00918-8
The accrual of genomic data from both cultured and uncultured microorganisms provides new opportunities to develop sys-
tematic taxonomies based on evolutionary relationships. Previously, we established a bacterial taxonomy through the Genome
Taxonomy Database. Here, we propose a standardized archaeal taxonomy that is derived from a 122-concatenated-protein
phylogeny that resolves polyphyletic groups and normalizes ranks based on relative evolutionary divergence. The result-
ing archaeal taxonomy, which forms part of the Genome Taxonomy Database, is stable for a range of phylogenetic variables
including marker gene selection, inference methods, corrections for rate heterogeneity and compositional bias, tree rooting
scenarios and expansion of the genome database. Rank normalization is shown to robustly correct for substitution rates vary-
ing up to 30-fold using simulated datasets. Taxonomic curation follows the rules of the International Code of Nomenclature of
Prokaryotes while taking into account proposals to formally recognize the rank of phylum and to use genome sequences as type
material. This taxonomy is based on 2,392 archaeal genomes, 93.3% of which required one or more changes to their existing
taxonomy, mainly owing to incomplete classification. We identify 16 archaeal phyla and reclassify 3 major monophyletic units
from the former Euryarchaeota and one phylum that unites the Thaumarchaeota–Aigarchaeota–Crenarchaeota–Korarchaeota
(TACK) superphylum into a single phylum.
C
arl Woese’s discovery of the Archaea, originally termed eventually proposed11. Subsequently, the field experienced a burst
Archaebacteria1, in 1977 gave rise to the recognition of a in availability of genomic data due to the substantial acceleration
new domain of life2 and fundamentally changed our view of culture-independent genome recovery driven by improvements
of cellular evolution on Earth. Over time an increasing number in high-throughput sequencing5,7. This resulted in the description
of Archaea have been described, initially from extreme environ- and naming of several new archaeal lineages, including the candi-
ments but subsequently from soils, oceans, fresh water and animal date phyla Aigarchaeota12, Geoarchaeota13 and Bathyarchaeota14,
guts, highlighting the global importance of this domain2. Since previously reported as members of the Crenarchaeota based on
their recognition, Archaea have been classified primarily by gen- SSU rRNA data. Some of the proposed phyla were met with criti-
otype, that is, small subunit (SSU) ribosomal RNA (rRNA) gene cism, such as the Geoarchaeota, which was considered to be a
sequences. Therefore, compared to Bacteria, they suffer less from member of the order Thermoproteales rather than a novel phy-
historical misclassifications based on phenotypic properties3. Using lum15. In 2011, the TACK superphylum was proposed, originally
the SSU rRNA gene, Woese initially described two major lines of comprising the Thaumarchaeota, Aigarchaeota, Crenarchaeota
archaeal descent, the Euryarchaeota and the Crenarchaeota4, and and Korarchaeota16, and more recently adding the Bathyarchaeota14
in the following years, all newly discovered archaeal lineages were and Verstraetearchaeota17; in essence, TACK comprised all lineages
added to these two main groups. Non-extremophilic Archaea except Euryarchaeota5 and Nanoarchaeota9.
were generally classified as Euryarchaeota, which, together with New archaeal lineages were also described outside the
the discovery of novel extremophile euryarchaeotes, contributed Euryarchaeota and TACK and were ultimately classified into two
to a considerable expansion of this lineage (for recent reviews see superphyla, DPANN and Asgard18,19. DPANN as originally pro-
Adam et al.5, Baker et al.6 and Spang et al.7 and references therein). posed was based on five phyla, Diapherotrites, Parvarchaeota,
Eventually, two new archaeal phyla were proposed based on phylo- Aenigmarchaeota, Nanoarchaeota and Nanohaloarchaeota18,
genetic novelty of their SSU rRNA sequences: the Korarchaeota8, but now also includes the Micrarchaeota, Woesearchaeota,
recovered from hot springs in Yellowstone National Park, and the Pacearchaeota, Altiarchaeota and Huberarchaeota5,20–23. The
nanosized, symbiotic Nanoarchaeota9, co-cultured from a subma- Asgard archaea are notable for their inferred sister relation-
rine hot vent. By the mid-2000s, archaeal classification had begun ship to the eukaryotes and were originally proposed based on
to leverage genome sequences, and the first sequenced mesophilic the phyla Lokiarchaeota, Thorarchaeota, Odinarchaeota and
crenarchaeote, Candidatus Cenarchaeum symbiosum, belong- Heimdallarchaeota19,24,25, followed by the Helarchaeota26. The net
ing to Marine Group I (ref. 10), was used to argue that mesophilic result of these cumulative activities is that archaeal classification at
Crenarchaeota should be in a separate phylum from hyperthermo- higher ranks is currently very uneven. The Euryarchaeota absorbed
philic Crenarchaeota, for which the name Thaumarchaeota was novel lineages and grew into a phylogenetic behemoth, whereas the
Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, Australia. 2School of
1
Biological Sciences, The University of Auckland, Auckland, New Zealand. 3Department of Microbiology, University of Georgia, Athens, GA, USA. ✉e-mail:
c.rinke@uq.edu.au; p.hugenholtz@uq.edu.au
Fig. 1 | Comparison of rank-normalized archaeal GTDB and NCBI taxonomies. a,b, RED of taxa defined by the NCBI taxonomy (a) and the curated
GTDB taxonomy (04-RS89) (b). Each data point (black circle) represents a taxon distributed according to its RED value (x axis) and its rank (y axis).
The fill colours of the circle (blue, grey or orange) indicate that a taxon is monophyletic, operationally monophyletic (defined as having an F measure
>0.95) or polyphyletic, respectively, in the underlying genome tree. An overlaid histogram shows the relative abundance of monophyletic, operationally
monophyletic and polyphyletic taxa for each 0.025 RED interval. Blue bars shows the median RED value, and the black bars on either side show the RED
interval (±0.1) for each rank. Note that in the NCBI taxonomy the values of the higher ranks (order and above) are very unevenly distributed, to the point
that the medians are out of order; that is, the median RED value for classes was higher than the median value for orders. The GTDB taxonomy uses the
RED value to resolve over- and under-classified taxa by moving them to a new interior node (horizontal shift in plot) or by assigning them to a new rank
(vertical shift in plot). Only monophyletic or operationally monophyletic taxa were used to calculate the median RED values for each rank. In addition, only
taxa with a minimum of two children (for example, a phylum with two or more classes or a class with two or more orders) were considered for the GTDB
tree (‘--min_children 2’); however, a more lenient approach (‘--min_children 0’) was necessary for the NCBI tree since none of the NCBI phyla had the
required minimum of two classes, with the exception of the Euryarchaeota. Note that the phylum Crenarchaeota is not displayed in the NCBI plot, since
all genomes in this NCBI phylum are assigned to the class Thermoprotei, resulting in a single node decorated as ‘p__Crenarchaeota; c__Thermoprotei’
(Tpr). Also, Korarchaeota are only represented by a single species, Korarchaeum cryptofilum, in GTDB 04-RS89, and hence there is no internal node to be
displayed in this plot. RED values were calculated based on the ar122.r89 tree, inferred from 122 concatenated proteins, decorated with either the NCBI
or GTDB taxonomy. c, Rank comparison of GTDB and NCBI taxonomies. Shown are changes in GTDB compared to the NCBI taxonomic assignments
across 2,392 archaeal genomes from RefSeq/GenBank release 89. Note that the 153 UBA genomes passing quality control (QC) are not included
(2,392–153 = 2,239) since they had no NCBI taxonomy assignment. In the bars on the left, a taxon is shown as unchanged if its name was identical in both
taxonomies, as a passive change if the GTDB taxonomy provided name information absent in the NCBI taxonomy (missing names) or as an active change
if the name was different between the two taxonomies. The right bar shows the changes of the entire tax string (consisting of seven ranks) per genome,
indicating that most genomes had both active and passive changes in their taxonomy.
no canonical rank information beyond their phylum affiliation This was particularly apparent for DPANN phyla, which are
(31.0%), which is partly offset by the extensive use of names almost entirely lacking in information in the family to class ranks
with no rank in the NCBI taxonomy (Supplementary Fig. 2a). (Supplementary Fig. 2b).
a
Ae phylum Ca. Aenigmarchaeota
Bat phylum Ca. Bathyarchaeota
Dia phylum Ca. Diapherotrites
Species (15) Eur phylum Euryarchaeota
Hei phylum Ca. Heimdallarchaeota Thm
Mar phylum Ca. Marsarchaeota
Mar
Mic phylum Ca. Micrarchaeota
Genus (88) Tha phylum Thaumarchaeota
Tho phylum Ca. Thorarchaeota
Woe phylum Ca. Woesearchaeota
Rank (no. taxa)
Order (14)
Tpr Tpl
Class (8)
Eur Hei Tha Bat Mic Ae Dia Woe Mar Tho
Phylum (10)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Relative evolutionary divergence
b
Asg p__Asgardarchaeota
Met p__Methanobacteriota (Euryarchaeota) Thm ThA
Mar o__Marsarchaeales
Tpr c__Thermoprotei
Genus (174) Tpl c__Thermoplasmata
Thm g__Thermofilum
ThA g__Thermofilum_A
Family (76)
Rank (no. taxa)
Mar
Order (36)
Tpl Tpr
Class (18)
Met Asg
Phylum (5)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Relative evolutionary divergence
c 100
Active change Only passive
90 Passive change Active and passive
Unchanged Only active
80 Unchanged
Genomes (%) (2,239 total)
70
60
50
40
30
20
10
0
Phylum Class Order Family Genus Species Per genome
a ar.122.r89 (IQTREE C10 PMSF) (2.1) IQTREE C10 (2.3) IQTREE C60 (2.5) 32kAA IQTREE C10 PMSF (17)
p 5 p 5 p 5 p 5
c 18 c 18 c 17 c 18
o 36 o 36 o 36 o 36
f 76 f 76 f 76 f 76
g 174 g 174 g 174 g 174
o 36 o 34
f 76 f 75 c 18 c 18
g 174 g 174
o 36 o 36
f 76 f 76 c 18
g 174 g 174
Fig. 2 | Comparison of marker sets, inference methods and models. Phylogenetic trees inferred with different methods, from varying concatenated
alignments or via supertree approaches were decorated with the GTDB 04-RS89 taxonomy. RED distributions for taxa at each rank (p, phylum; c, class; o,
order; f, family; g, genus) are shown (y axis) relative to the median RED value of the rank (x axis). The number of taxa is provided on the right-hand side of
the plots. The legend indicates the percentage of polyphyletic taxa per rank, defined as an F measure <0.95. Note that only taxa with two or more genomes
were included. a, Trees inferred with IQ-TREE from a concatenated alignment of the 122 GTDB markers with ~5,000 alignment columns using different
profile mixture models (C10 PMSF, C10 and C60) and from the untrimmed 32kAA alignment. b, Trees inferred from a modified concatenated alignment
of the 122 GTDB markers to account for compositional bias, including stationary (BMGE) and progressive trimming (Tr. 20%, Tr. 40%) of heterogeneous
sites and clustering of sites with shared homology (Divvier). c, Trees inferred from a modified concatenated alignment of the 122 GTDB markers including
recoding into four character states (C60 SR4), recoding and stationary trimming (BMGE C60 SR4) and removal of 20% and 40% of the fastest-evolving
sites (SlowFaster 20% and 40%). Note that, due to technical limitations, a reduced order-dereplicated genome set was used for SlowFaster, allowing
the evaluation of only the phylum and class ranks. d, Trees inferred from 122 marker alignments using different inference software and models, including
FastTree2, ExaML and PhyloBayes. Note that due to computational constraints PhyloBayes was calculated from the order-dereplicated dataset, allowing the
evaluation of only the ranks phylum and class. e, Trees inferred from alternative markers, including 16 ribosomal proteins (rp1), 23 ribosomal proteins (rp2),
SSU rRNA genes (SSU) and a set of 53 marker proteins (ar53). f, Trees inferred with the ASTRAL supertree approach using 122 and 253 marker proteins.
More details about inference models and methods are given in Supplementary Table 10. The number in parentheses following each tree name (for example,
‘(2.1)’) refers to the number of this tree in the Supplementary Information, including Supplementary Table 10. The violin plots include a marker for the
median and a box indicating the first and third quartiles.
Fig. 13). All GTDB taxa were recovered as monophyletic or opera- using the most divergent approach tested, recovered 96% of GTDB
tionally monophyletic (Extended Data Fig. 3 and Supplementary taxa with two or more representatives as monophyletic or opera-
Fig. 11) for all supermatrix methods, regardless of the underlying tionally monophyletic (Extended Data Fig. 3 and Supplementary
inference algorithm or model. The ASTRAL supertrees, inferred Fig. 11). The only major inconsistency affected the Euryarchaeota,
Genus (174)
Rank (no. taxa)
Family (76)
Order (36)
Class (18)
Phylum (5)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
RED RED RED
0.10 0.10
0.10
Halobacteriota Halobacteriota Halobacteriota
Fig. 3 | Impact of different rooting scenarios on RED intervals. a–c, The rooting approach implemented in GTDB (a), which calculates the RED as
the median of all possible rootings of phyla with at least two classes (red arrows), is compared to a fixed root between the DPANN superphylum (red
arrow) and the remaining Archaea (b) and to a fixed root within the NCBI phylum Euryarchaeota, which translates to a root between the two phyla
Thermoplasmata and Halobacteriota (red arrow) and the rest of the Archaea in the GTDB taxonomy (c). In the upper RED plot each data point (black
circle) represents a taxon distributed according to its RED value (x axis) and its rank (y axis). An overlaid histogram shows the relative abundance of taxa
for each 0.025 RED interval, a blue bar shows the median RED value and two black bars on either side show the RED interval (±0.1) for each rank. Note
that, overall, the ranks can be distinguished based on their RED value, regardless of the applied rooting scenario. Furthermore, RED values are relative and
should not be directly compared between plots, as they are dataset specific. Rather, the distribution of RED values is the key metric; that is, the distance
(positive or negative) from the median of the RED value for each rank (ΔRED). The trees include a label highlighting the corresponding NCBI phylum
Euryarchaeota (Eury) as a point of reference. The scale bars indicate 0.1 substitutions.
with 48 taxa being placed outside of this phylum in both supertrees Analysis of increasing genome database size. An important variable
(Supplementary Fig. 14e and Supplementary Fig. 15). All phyloge- in producing a genome-based taxonomy is the changing number of
netic trees inferred in this study are summarized in Supplementary genome sequences, which will continue to increase in future and will
Table 10 and provided in Supplementary Data 1. likely impact the underlying tree topology. To assess the robustness
of the GTDB taxonomy to this variable, we recapitulated the expan-
Compositional bias and fast-evolving sites. To test for possible com- sion of the archaeal genome database since 2015 by subsampling
positional bias we employed tools for character trimming and for the 1,248 archaeal genomes according to their NCBI release date,
clustering of high confidence positional homology (Methods) resulting in 362, 528, 1,035 and 1,183 taxa for the years 2015, 2016,
that have been shown to alleviate long-branch attraction artefacts 2017 and 2018, respectively. Comparison of the taxonomy derived
and to increase tree accuracy for alignments of distantly related from these antecedent trees to the reference tree revealed that >99%
sequences49,50. Decorated maximum likelihood (ML) trees calcu- of all named taxa were recovered as monophyletic groups, demon-
lated from the trimmed or clustered alignments were in strong taxo- strating that addition of new branches to the reference tree did not
nomic agreement with the reference tree, showing comparable RED destabilize the subset of robust interior nodes used for GTDB taxo-
values and recovering >96.5% of GTDB taxa with two or more rep- nomic assignments. This suggests that the GTDB taxonomy should
resentatives as monophyletic or operationally monophyletic (Fig. be stable with increasing dataset size in future releases.
2b,c, Extended Data Fig. 3 and Supplementary Figs. 11 and 13). Less
than 4% of genomes had conflicting taxonomic assignments at any Rooting effects. Rank normalization is sensitive to the placement
rank above species (Supplementary Table 11), with the largest dif- of the tree root, which defines the last common ancestor (set to
ference being observed in the class Methanosarcinia (18 taxa) and RED = 0)35,38, and can therefore potentially influence the result-
the order Desulfurococcales (46 taxa) for the character-trimmed ing taxonomy. Since the rooting of both the archaeal and bacterial
alignment (Supplementary Fig. 14d). The clustered alignment domains remains contested, GTDB uses an operational approach
resulted in fewer differences, with a total of seven conflicting taxa whereby the median of multiple plausible rootings is considered35.
(Supplementary Fig. 14d). We assessed the effect of fixed rooting of the archaeal domain on
Thermoproteota
Asgardarchaeota
GTDB phyla
Nanoarchaeota
NCBI phyla
VM T AC
H LT K
Cr enarchae ota
Nanohaloarchaeota e Bat
N Wo
Aenigmarchaeota N
PA Th
c__Lo dallarc
au
c__H
UAP2 D ma Hadarchaeota
P rc Hydrothermarchaeota
N ha
eim
Micrarchaeota
kiarch
h eo
ta
N
c __ Thermoprotei
Iainarchaeota
A
c_
aeia aeia
eia
_N
ha
h
an
M
arc
Altarchaeota
oa
ria
D
thy
ia
rch
ae
ae
Ba
ph
ch
ae
ar Methanobacteriota
os
c__
ia
m
os
er
itr
h
ot
_N
yd
r ci
oc
c_
_H oc
c_ rm
he i
occ es
_T noc cal
c_ tha nococ
M e a
c___Meth
o_
c__Methanobacteria
o__Methanobacteriales
c__E2
c__T
herm
oplas
mata
ia
cter les
H a lobaacteria c_
c__ Halo b _P
o__ os
ei
do
ni
ia
es
ro ia
ic b
al
m icro
bi
no m
ha no
inia
et tha
c__Syntrophoarchaeia
c__ A r c haeoglobi
_M e
arc
o_ __M
nos
c
tha
Me
c__
Halobacteriota Thermoplasmatota
Fig. 4 | Rank-normalized archaeal GTDB taxonomy. Species representatives ar122.r89 ML tree, scaled by replacing branch lengths with RED values, and
decorated with the archaeal GTDB taxonomy R04-RS89. The outer blue ring denotes the rank-normalized phyla, and the inner light blue clades indicate
the classes in the rank-normalized GTDB taxonomy. Classes with ten or more taxa are labelled, and classes with order-level divergence are indicated with
both class and order names. The two GTDB phyla consisting of only a single species each, namely Huberarchaeota and EX4484-52, are highlighted by
red branches indicating their uncertain placement in the ar122.r89 tree. The inner orange ring denotes the r89 NCBI phyla with two or more taxa. The
NCBI superphyla TACK and DPANN are indicated by grey arcs. Abbreviations are the following: Bat (Ca. Bathyarchaeota), M (Ca. Marsarchaeota), V (Ca.
Verstraetearchaeota), T (Ca. Thorarchaeota), L (Ca. Lokiarchaeota), H (Ca. Heimdallarchaeota), Woe (Ca. Woesearchaeota), P (Ca. Parvarchaeota), N
(Nanoarchaeota), Nh (Ca. Nanohaloarchaeota), A (Ca. Aenigmarchaeota), M (Ca. Micrarchaeota), D (Ca. Diapherotrites). Bootstrap values over 90% are
indicated by blue dots. Scale bar indicates 0.1 RED.
the taxonomy by testing two recently proposed archaeal root place- nodes within a rank is the key metric, and by this metric all taxa in
ments: the first being within the Euryarchaeota (as defined in the the rooted lineages were within their expected RED assigned rank
NCBI taxonomy)51 and the second, between the DPANN superphy- intervals (Supplementary Fig. 16 and Fig. 3).
lum and all other Archaea52. Overall, RED values were stable across
the tested rooting scenarios and the intervals defining taxonomic Proposal of revised taxa based on the GTDB archaeal taxonomy.
ranks in GTDB were largely preserved (Fig. 3). As expected, a fixed After resolving polyphyletic groups and normalizing ranks, the
root caused the taxa within the rooted lineages to be drawn closer archaeal GTDB taxonomy (release R04-RS89) comprises 16 phyla,
to the root, although it is important to note that absolute RED val- 36 classes, 96 orders, 238 families, 534 genera and 1,248 species
ues should not be compared between trees or even between differ- (Supplementary Table 12). This entailed the proposal of 13 new
ent rootings of the same tree. Rather, the distribution of labelled taxa and 32 new Candidatus taxa above the rank of genus includ-
ing five novel species combinations and three novel Candidatus 4). For example, the phylum Euryarchaeota was divided into five
species combinations (Supplementary Tables 13–16). We also used separate phyla in the GTDB taxonomy due to its anomalous depth
25 Latin names without standing in nomenclature as placehold- (Figs. 1a and 4). For two of the five phyla, names have previously
ers in the GTDB taxonomy to preserve literature continuity, as we been proposed: Methanobacteriota53 and Hydrothermarchaeota54.
were unable to propose them mostly due to the absence of a desig- The phylum Hadarchaeota and its subordinate taxa are proposed in
nated nomenclature type (Supplementary Table 17). The extensive this study (Supplementary Table 15), named after the genus of the
rearrangement of phyla to normalize phylogenetic depth resulted previously proposed species Ca. Hadarchaeum yellowstonense32.
in both division and amalgamation of release 89 NCBI phyla (Fig. Names for the other two phyla are formally proposed here based on
erales
ospha
_Nitros
o_
f__Nitrosocaldaceae
e
cea
f_
_N
mila
itr
f__UBA
os
opu
os
ph
itros
ae
213
s
icu
f__
ra
f_ _ N
m
UB
ce
os
f_
ae
oc
A5
_U
os
7
BA
i tr
_N
14
g_
1
f__ 2
UB 45
A1 A 10
83 _UB
g_
densis
evergla
o__UB sos phaera
A164 s__Nitro
g__Nitrososp
s__Nitro
R-13
o__JdF sospha
era sp0
024948
95
s__
Nitros
Nitr
oso
sph
haera
aera
garg
c__
s_
les _N ens
is
ea itro
c ha so
iar sp
ald ha
_C 05 era
o_ 4-2 sp
00
4 48 25
01
EX 84
o __ 5
ia
ae
c__
rch
c__The
thylicia
ya
Kor
ath
arch
_B
anome
rmopro
c_
aeia
th
tei
c__Me
Fig. 5 | Reclassification of the Thaumarchaeota. Cladogram of a subtree of the ar122.r89 reference tree showing the GTDB phylum Thermoproteota with the
classes Korarchaeia, Thermoprotei, Methanomethylicia (former phylum Ca. Verstraetearchaeota), Bathyarchaeia and Nitrososphaeria (former phylum Ca.
Thaumachaeota). The validly published class Nitrososphaeria (light blue) was emended in GTDB to include all taxa assigned to the phylum Thaumarchaeota
in the NCBI RefSeq89 taxonomy. The type species of this lineage is Nitrososphaera viennensis56, which serves as the type for higher taxa including the genus
Nitrososphaera, the family Nitrososphaeraceae, the order Nitrososphaerales and the class Nitrososphaeria. The genome of the N. viennensis type strain EN76T
( = DSM 26422T = JMC 19564T) is highlighted in white. Arrow points to the ar122.r89 reference tree taxa not shown in this figure. The phylogenetic tree was
annotated and coloured with the online tool ‘Interactive Tree Of Life’80.
validly published class names within each lineage: Halobacteriota Euryarchaeota could be reintroduced as a superphylum; however,
phyl. nov. (after the class Halobacteria) and Thermoplasmatota phyl. this is outside the scope of the GTDB taxonomy, which only uses
nov. (after the class Thermoplasmata; Table 1 and Supplementary canonical ranks.
Table 13). Note that the name Euryarchaeota was not retained in The TACK superphylum16 was reclassified as a single phylum
the GTDB taxonomy for two reasons. Firstly, a nomenclatural type based on rank normalization and its robust monophyly, for which
was not designated in the original proposal4, which means the we propose to use the effectively published name Thermoproteota53
name would be illegitimate if the rank of phylum is introduced into based on the earliest validly described class for any TACK phylum,
the ICNP, since a type is a requirement for validation of a name. the Thermoprotei55 (Table 1 and Supplementary Table 13). Note that
Secondly, the substantial changes made to this phylum might the phylum name could alternatively have been derived from the
have introduced confusion if the name had been retained for one earliest genus name in the TACK superphylum, Sulfolobus; how-
of the five newly circumscribed phyla. It is possible that the name ever, this genus is currently a member of the class Thermoprotei.
Methods The alignment was trimmed to remove leading and trailing columns represented
Genome dataset. For the archaeal GTDB taxonomy R04-RS89, we obtained 2,661 by <70% of taxa and to filter sequences <900 bp (trimSeqs.py v0.0.1; https://github.
archaeal genomes from RefSeq/GenBank release 89 and augmented them with com/Ecogenomics/scripts).
187 phylogenetically diverse MAGs (Supplementary Note 3) derived from the
Sequencing Read Archive37 (SRA) as part of a large genome recovery study by the Species clusters. The 2,392 archaeal genomes were formed into species clusters
authors39, resulting in 2,848 genomes. This dataset was refined by applying a quality as previously described38. Briefly, a representative genome was selected for each
threshold (completeness − 5 × contamination > 50%) using lineage-specific markers of the 380 validly or effectively published archaeal species with one or more
implemented in CheckM (refs. 36,67) and by screening out genomes that contain (1) genomes passing QC, and genomes were assigned to these representatives using
<40% of the 122 archaeal GTDB marker genes, (2) more than 100,000 ambiguous ANI and alignment fraction (AF) criteria. Genomes were assigned to the closest
bases and (3) more than 1,000 contigs, and those that have an N50 < 5 kilobase representative genome for which they have an ANI of ≥95% with an AF of at
(kb). The filtered genomes were manually inspected, and four exceptions least 65%, except if two representatives had an ANI > 95%. In such cases, the ANI
were made for genomes that did not pass the QC but were the only named radius of a representative was set to the value of the closest representative, up to a
representatives for the corresponding lineages (Aenigmarchaeum, Lokiarchaeota, maximum of 97%. Species representatives having an ANI > 97% were considered
Nanosalinarum and Parvarchaeum; Supplementary Table 19). This approach left synonyms. For example, Ferroplasma acidiphilum is represented by the genome
2,392 genomes to form species clusters (see Species clusters), resulting in a total of GCF_002078355.1, which is assembled from the type strain of this species. The
1,248 species-representative genomes for the downstream analysis (Supplementary closest GTDB representative to this F. acidiphilum genome is GCF_000152265.2,
Table 1). The 456 genomes that did not pass QC are still searchable on the GTDB which is assembled from the type strain of Ferroplasma acidarmanus. The ANI
website (https://gtdb.ecogenomic.org/) and are listed in Supplementary Table 20. between these two genomes is 96.95%. Consequently, to preserve both the F.
acidiphilum and F. acidarmanus names in the GTDB, the ANI circumscription
NCBI taxonomy. The NCBI taxonomy of all representative genomes of R04-RS89 radius of these two species is set to 96.95% instead of the default 95% (see Parks
was obtained from the NCBI Taxonomy FTP site on 16 July 2018. The NCBI et al.38, which illustrates and describes this methodology).
taxonomy was standardized to seven ranks (species to domain) by identifying Genomes not assigned to one of these 380 species were formed into 868
missing standard ranks and filling these gaps with rank prefixes and by removing de novo species clusters, each specified by a single representative genome.
non-standard ranks35. All standard ranks were prefixed with rank identifiers (for Representative genomes were selected based on assembly quality with preference
example, ‘p__’ for phylum) as previously described68. given to isolate genomes. Of the 868 de novo species clusters, the representative
genomes of 70 were unnamed isolates, 775 were MAGs and 23 were single
Phylogenomic marker set. Archaeal multiple sequence alignments (MSAs) were amplified genomes (SAGs).
created through the concatenation of 122 phylogenetically informative markers
comprised of proteins or protein domains specified in the Pfam v27 or TIGRFAMs Accounting for compositional bias. Each of the 122 untrimmed GTDB 04-RS89
v15.0 databases. The 122 archaeal marker proteins were selected based on the archaeal single-copy marker protein alignments was filtered individually using
criteria described in Parks et al.39. In brief, this included being present in ≥90% BMGE 1.12 (ref. 50) and Divvier 1.0 (ref. 49). BMGE was executed using ‘-t AA
of archaeal genomes and, when present, single-copy in ≥95% of genomes. Only -s FAST -h 0.55 -m BLOSUM30’, and Divvier was run using the recommended
genomes comprising ≤200 contigs with an N50 of ≥20 kb and with CheckM options (‘-divvy -mincol 4 -divvygap’). Processing untrimmed protein alignments
completeness and contamination estimates of ≥95% and ≤5%, respectively, of individual markers ensures that all protein positions are considered when
were considered (Supplementary Note 4). Phylogenetically informative proteins accounting for compositional bias. Next, each of the filtered marker gene
were determined by filtering ubiquitous proteins whose gene trees had poor alignments was concatenated into single MSA supermatrices for BMGE and
congruence with a set of subsampled concatenated genome trees39. Gene calling Divvier, whereby previously removed gap-only sequences were added again in
was performed with Prodigal v2.6.3, and markers were identified and aligned the corresponding positions. Finally, the MSA was trimmed according to GTDB
using HMMER v3.1b1. The presence or absence of the 122 protein markers in criteria mentioned above to a length of 7,859 amino acids (BMGE) and 32,061
the 1,248 species representatives is provided in Supplementary Table 21. The amino acids (Divvier) and used for phylogenetic inferences. The reasoning
marker proteins were concatenated into an MSA of 32,500 columns, which we for and background on how to account for compositional bias are provided in
refer to as ‘untrimmed 32kAA alignment’ in this manuscript. To remove sites Supplementary Note 6.
with weak phylogenetic signals, we created an amino acid alignment by trimming
columns represented in <50% of the genomes and columns with less than 25% or Phylogenetic inference. Phylogenomic trees were inferred with FastTree2 (ref. 72),
more than 95% amino acid consensus, resulting in an initial 27,000 amino acid ExaML (ref. 73), IQ-TREE (ref. 41) and PhyloBayes (ref. 74) on different alignments
alignment. Therefore, the term ‘consensus’ refers to the number of taxa with the and with a range of models (Supplementary Table 10). Note that we chose
same residue in a given sequence alignment column; for example, a maximum IQ-TREE as the GTDB standard inference, since it scales with our dataset and
consensus of 95% means that a maximum of 95% of all taxa can have the same allows mixture models (see IQ-TREE) and because a previous study concluded
residue in an alignment column in order to be considered for the trimmed that for concatenation-based species tree inference, IQ-TREE consistently achieved
alignment. To reduce computational requirements, the alignment was further the best-observed likelihoods for all datasets, compared to RAxML/ExaML and
trimmed by randomly selecting 42 amino acids from the remaining columns of FastTree (ref. 75). More detail about each inference program is provided below.
each marker. The resulting ar122.r89 MSA included a total of 5,124 (42 × 122)
columns. This MSA filtering methodology is implemented in the ‘align’ method of FastTree. FastTree v2.1.9 was executed in multithreaded mode with the
GTDB-Tk v1.0.2 (ref. 66). WAG + GAMMA parameters.
Alternative marker sets. Two alternative MSAs resembling previously published IQ-TREE. IQ-TREE was executed employing mixture models such as models C10
datasets were created through the concatenation of either 16 or 23 ribosomal through C60 (ref. 76) and PMSF, a faster approximation of these models40. True
proteins, termed datasets rp1 (ref. 45) and rp2 (ref. 18), respectively. After trimming C10–C60 trees are computationally more demanding, and memory requirements
columns represented by <50% of the genomes and with an amino acid consensus tend to increase with the number of components in the mixture, which range from
<25%, the resulting MSA spanned 1,174 and 2,377 amino acids for rp1 and rp2, 10 to 60 (Supplementary Table 10). We therefore opted for the faster PMSF model,
respectively. specifically, C10 PMSF, to calculate the ar122.r89 tree. The tree was calculated with
In addition, we compared the GTDB taxonomy derived from the ar.122.r89 IQ-TREE v1.6.12 based on the C10 mixture model and a starting tree (‘-ft’), inferred
tree to a recently created set of trees based on a MSA of 56 evaluated archaeal by FastTree v2.1.9 as described above, to invoke the faster PMSF approximation
markers46. We found that 95.3–99.5% of GTDB taxa with more than one genome with the following settings: ‘-m LG + C10 + F + G –ft <starting tree>’.
were recovered as operationally monophyletic in trees inferred from the best
25–75% of the markers in this study (Supplementary Tables 22 and 23 and ExaML. ExaML trees (gamma, JTT) were calculated from ten different starting
Supplementary Note 5). Next, we created an MSA from all markers supplied with a trees with random seeds using the mpi version 3.0.20 (settings ‘-m GAMMA’),
hidden Markov model (HMM) (53 out of 56) (ar.53; Supplementary Table 8)46, by whereby the tree with the highest likelihood score was retained.
concatenating individual alignments for each of the 53 HMMs for all 1,248 species
representatives, using pfam 33.1 and tigrfam 15.0. The resulting concatenated PhyloBayes. To accommodate the computationally demanding Bayesian inferences,
MSA of 13,451 amino acids was used without further trimming steps to infer we subsampled our dataset by reducing the number of taxa to one representative
phylogenetic trees or to apply tools addressing compositional bias. per order, resulting in 96 taxa. The order representatives were selected by removing
genomes with a quality score (CheckM completenes − 5 × CheckM contamination)
SSU rRNA gene. Archaeal SSU rRNA genes were identified from the 1,248 of <50, <50% of the 122 archaeal marker genes, an N50 < 4 kb, >2,500 contigs
archaeal R04-RS89 GTDB genomes using nhmmer v3.1b2 (ref. 69) with the SSU or >1,500 scaffolds. From the remaining genomes, the highest quality genome
rRNA model (RF00177) from the RFAM database70. Only the longest sequence was was selected, giving preference to, in the following order, (1) NCBI reference
retained for each genome. The resulting sequences were aligned with SSU-ALIGN genomes annotated as ‘complete’ at NCBI, (2) NCBI reference genomes, (3)
0.1.1 (ref. 71), and regions of low posterior probabilities, which are indicative of complete NCBI representative genomes, (4) NCBI representative genomes and
high alignment ambiguity, were pruned with ssu-mask (SSU-ALIGN 0.1.1). (5) GTDB representative genomes. Supplementary Table 24 indicates which of
Assessment of taxonomic congruence. The congruence of the GTDB taxonomy Code availability
in different trees was assessed as (1) the percentage of taxa identified as The standalone tool GTDB-Tk, which enables researchers to classify their own
monophyletic, operationally monophyletic (defined as having an F measure genomes according to the GTDB taxonomy, is available from GitHub (https://
≥0.95) or polyphyletic, (2) the RED distributions for taxa at each rank relative to github.com/Ecogenomics/GTDBTk/) and through KBase (https://kbase.us/
the median RED value of that rank and (3) the number of genomes with identical applist/apps/kb_gtdbtk/run_kb_gtdbtk/release). Taxonomic assignment and rank
or conflicting taxonomic assignments between compared trees. To carry out (1), standardization were carried out based on the RED calculated using PhyloRank
each taxon was placed on the node with the highest resulting F measure. The F v0.0.37, which is available from GitHub (https://github.com/dparks1134/
measure is defined as the harmonic mean of precision and recall and has been PhyloRank/).
proposed for decorating trees with a donor taxonomy68. Note that we introduced
the term operationally monophyletic (F measure ≥0.95) because otherwise a Received: 10 March 2020; Accepted: 10 May 2021;
few incongruent genomes can cause a large number of polyphyletic taxa. For a Published online: 21 June 2021
detailed explanation of the F measure, a note on how to compare RED values
and information concerning the stability between releases see Supplementary
Note 7 and Supplementary Table 25. RED values for all taxa across all decorated References
phylogenetic trees are provided in Supplementary Table 26. 1. Woese, C. R. & Fox, G. E. Phylogenetic structure of the prokaryotic domain:
the primary kingdoms. Proc. Natl Acad. Sci. USA 74, 5088–5090 (1977).
Correspondence between NCBI and GTDB taxa. To provide a comparison 2. Gribaldo, S. & Brochier-Armanet, C. The origin and evolution of Archaea: a
between the NCBI and the GTDB taxonomy, based on the ar122.r89 tree, we state of the art. Philos. Trans. R. Soc. Lond. B Biol. Sci. 361, 1007–1022 (2006).
Extended Data Fig. 1 | Implementing the shifting substitution rate model (SSR). a, Timetree, that is a tree scaled to time. A grid has been overlayed on
top of the tree to delineate the length of the individual branch segments. b, An array of substitution rate multipliers, ordered from lowest to highest. In this
specific example, the slowest evolving lineage evolves 16 times more slowly than the fastest evolving lineage. c, The result of running the SSR model on the
tree in a. Every circle represents a shift in the substitution rate, whereby the colour correlates to the substitution rates multipliers shown in c). At the start
of the simulation the model starts with the grey circle. How quick the changes take place depend on the shifting substitution rate parameter. d, The result
of taking the tree in a) and scaling it according to the active substitution rate multiplier in every branch at every branch segment. For instance, we can see
that the single segment in dark red (x4) has the same length as the two prior segments in light red (x2) on the same branch.
Extended Data Fig. 2 | Impact of variable evolutionary rates on the RED approach. Each panel shows the true ranking of every inner node of the species
tree (which we can directly obtain from the simulation) and the x-axis, and the inferred rankings of every node, resulting of modifying the branch lengths
of the tree using the SSR model and subsequently applying the RED algorithm to recover an ultrametric tree, on the y-axis. The main diagonal corresponds
therefore to the proportion of nodes for the ranking given by the column that has been correctly classified. In every panel we have the results of using
a different shifting substitution rate, from 0.1 to 0.5, and 1 (events/speciation). These results show that, as expected, higher numbers of changes in the
substitution rates impact the performance of the RED approach to a higher degree. However, the levels of accuracy remain high even for the most extreme
cases.
Extended Data Fig. 3 | Average number of monophyletic (green bars), operationally monophyletic (yellow bars) and polyphyletic (orange bars) taxa
across higher ranks (phylum, class, and order) in percent. Shown are GTDB taxonomy decorated phylogenetic trees inferred with different methods, from
a range of markers, from alignments trimmed to reduce compositional bias and fast evolving sites, and from alignments created as part of the simulated
database expansion. Note that only taxa with two or more genomes were included, and that the data set (order representatives) used for PhyloBayes
restricts the analysis to the ranks of phylum and class. Details for each inferred tree are provided in Supplementary Table 10. Percentages for all ranks are
shown in Supplementary Fig. 11. Monophyly and operational monophyly was determined based on the F measure of decorated internal nodes.
Extended Data Fig. 4 | Higher rank GTDB lineages not resolved with alternative inference methods. Shown are taxa that were not recovered as
monophyletic or operational monophyletic (green) and hence were polyphyletic (red; F measure < 0.95) in at least one of the different alignments and
inference methods. The bootstrap support for each taxon in the ar.122.r89 reference tree is given in the last column ‘BS in ar.122.r89 tree’.
Extended Data Fig. 5 | Examples of application of names using the manual curation workflow. Provided are five examples of taxon names that have
been updated in GTDB following the manual curation workflow. Thereby, each example is shown in a distinct colour: Ca. Thaumarchaeota (red), Ca.
Diapherotrites (blue), Ca. Bathyarchaeota (green), Ca. Verstraetearchaeota (purple), and Ca. Korarchaeota (orange). For example, Ca. Thaumarchaeota
(red) has no designated nomenclature type and has no lower-ranking taxon based on the same stem as the taxon. Furthermore, it has been united with
another taxon of the same rank in GTDB, which resulted in a name being chosen based on priority, in this case Thermoproteota. *Nomenclature type of the
taxon (for ranks above genus) is defined as one of its subordinate taxa with which the name is permanently associated.
Reporting Summary
Nature Research wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency
in reporting. For further information on Nature Research policies, see our Editorial Policies and the Editorial Policy Checklist.
Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.
n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement
A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly
The statistical test(s) used AND whether they are one- or two-sided
Only common tests should be described solely by name; describe more complex techniques in the Methods section.
For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted
Give P values as exact values whenever suitable.
For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes
Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated
Our web collection on statistics for biologists contains articles on many of the points above.
Data analysis The standalone tool GTDB-Tk, which enables researchers to classify their own genomes according to the GTDB taxonomy is available at
GitHub (https://github.com/Ecogenomics/GTDBTk/) and through KBase (https://kbase.us/applist/apps/ kb_gtdbtk/run_kb_gtdbtk/release).
Taxonomic assignment and rank standardisation were carried out based on the relative evolutionary divergence (RED) calculated using
PhyloRank v0.0.37 which is available at GitHub https://github.com/dparks1134/PhyloRank/).
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors and
reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.
Data
Policy information about availability of data
April 2020
All manuscripts must include a data availability statement. This statement should provide the following information, where applicable:
- Accession codes, unique identifiers, or web links for publicly available datasets
- A list of figures that have associated raw data
- A description of any restrictions on data availability
The GTDB taxonomy is available at the GTDB website (https://gtdb.ecogenomic.org/), including the ar122.r89 tree and the GTDB and NCBI taxonomic assignments
for all 2,392 archaeal genomes in GTDB 04-RS89.
1
nature research | reporting summary
Field-specific reporting
Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.
Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf
Data exclusions This data set was refined by applying a quality threshold (completeness - 5x contamination >50%) using lineage specific markers implemented
in CheckM and by screening out genomes which contain <40% of the 122 archaeal GTDB marker genes, more than 100,000 ambiguous bases,
more than 1000 contigs, and which have an N50 <5kb. This approach left 2,392 genomes to form species clusters (see below), resulting in a
total 1248 species representative genomes for the downstream analysis (Table S1). The 456 genomes which did not pass QC are still
searchable on the GTDB website (https://gtdb.ecogenomic.org/) and are listed in table S20.
Randomization The alignment matrix columns were randomly sub-sampled for bootstrap trees.
Antibodies
Antibodies used Describe all antibodies used in the study; as applicable, provide supplier name, catalog number, clone name, and lot number.
Validation Describe the validation of each primary antibody for the species and application, noting any validation statements on the
manufacturer’s website, relevant citations, antibody profiles in online databases, or data provided in the manuscript.
Authentication Describe the authentication procedures for each cell line used OR declare that none of the cell lines used were authenticated.
Mycoplasma contamination Confirm that all cell lines tested negative for mycoplasma contamination OR describe the results of the testing for
mycoplasma contamination OR declare that the cell lines were not tested for mycoplasma contamination.
Commonly misidentified lines Name any commonly misidentified cell lines used in the study and provide a rationale for their use.
(See ICLAC register)
2
Palaeontology and Archaeology
Specimen deposition Indicate where the specimens have been deposited to permit free access by other researchers.
Dating methods If new dates are provided, describe how they were obtained (e.g. collection, storage, sample pretreatment and measurement), where
they were obtained (i.e. lab name), the calibration program and the protocol for quality assurance OR state that no new dates are
provided.
Tick this box to confirm that the raw and calibrated dates are available in the paper or in Supplementary Information.
Ethics oversight Identify the organization(s) that approved or provided guidance on the study protocol, OR state that no ethical approval or guidance
was required and explain why not.
Note that full information on the approval of the study protocol must also be provided in the manuscript.
Wild animals Provide details on animals observed in or captured in the field; report species, sex and age where possible. Describe how animals were
caught and transported and what happened to captive animals after the study (if killed, explain why and describe method; if released,
say where and when) OR state that the study did not involve wild animals.
Field-collected samples For laboratory work with field-collected samples, describe all relevant parameters such as housing, maintenance, temperature,
photoperiod and end-of-experiment protocol OR state that the study did not involve samples collected from the field.
Ethics oversight Identify the organization(s) that approved or provided guidance on the study protocol, OR state that no ethical approval or guidance
was required and explain why not.
Note that full information on the approval of the study protocol must also be provided in the manuscript.
Recruitment Describe how participants were recruited. Outline any potential self-selection bias or other biases that may be present and
how these are likely to impact results.
Ethics oversight Identify the organization(s) that approved the study protocol.
Note that full information on the approval of the study protocol must also be provided in the manuscript.
Clinical data
Policy information about clinical studies
All manuscripts should comply with the ICMJE guidelines for publication of clinical research and a completed CONSORT checklist must be included with all submissions.
Clinical trial registration Provide the trial registration number from ClinicalTrials.gov or an equivalent agency.
Study protocol Note where the full trial protocol can be accessed OR if not available, explain why.
Data collection Describe the settings and locales of data collection, noting the time periods of recruitment and data collection.
Outcomes Describe how you pre-defined primary and secondary outcome measures and how you assessed these measures.
April 2020
Hazards
3
Could the accidental, deliberate or reckless misuse of agents or technologies generated in the work, or the application of information presented
Experiments of concern
Does the work involve any of these experiments of concern:
No Yes
Demonstrate how to render a vaccine ineffective
Confer resistance to therapeutically useful antibiotics or antiviral agents
Enhance the virulence of a pathogen or render a nonpathogen virulent
Increase transmissibility of a pathogen
Alter the host range of a pathogen
Enable evasion of diagnostic/detection modalities
Enable the weaponization of a biological agent or toxin
Any other potentially harmful combination of experiments and agents
ChIP-seq
Data deposition
Confirm that both raw and final processed data have been deposited in a public database such as GEO.
Confirm that you have deposited or provided access to graph files (e.g. BED files) for the called peaks.
Data access links For "Initial submission" or "Revised version" documents, provide reviewer access links. For your "Final submission" document,
May remain private before publication. provide a link to the deposited data.
Files in database submission Provide a list of all files available in the database submission.
Genome browser session Provide a link to an anonymized genome browser session for "Initial submission" and "Revised version" documents only, to
(e.g. UCSC) enable peer review. Write "no longer applicable" for "Final submission" documents.
Methodology
Replicates Describe the experimental replicates, specifying number, type and replicate agreement.
Sequencing depth Describe the sequencing depth for each experiment, providing the total number of reads, uniquely mapped reads, length of reads and
whether they were paired- or single-end.
Antibodies Describe the antibodies used for the ChIP-seq experiments; as applicable, provide supplier name, catalog number, clone name, and lot
number.
Peak calling parameters Specify the command line program and parameters used for read mapping and peak calling, including the ChIP, control and index files
used.
Data quality Describe the methods used to ensure data quality in full detail, including how many peaks are at FDR 5% and above 5-fold enrichment.
Software Describe the software used to collect and analyze the ChIP-seq data. For custom code that has been deposited into a community
repository, provide accession details.
April 2020
4
Flow Cytometry
Methodology
Sample preparation Describe the sample preparation, detailing the biological source of the cells and any tissue processing steps used.
Instrument Identify the instrument used for data collection, specifying make and model number.
Software Describe the software used to collect and analyze the flow cytometry data. For custom code that has been deposited into a
community repository, provide accession details.
Cell population abundance Describe the abundance of the relevant cell populations within post-sort fractions, providing details on the purity of the
samples and how it was determined.
Gating strategy Describe the gating strategy used for all relevant experiments, specifying the preliminary FSC/SSC gates of the starting cell
population, indicating where boundaries between "positive" and "negative" staining cell populations are defined.
Tick this box to confirm that a figure exemplifying the gating strategy is provided in the Supplementary Information.
Design specifications Specify the number of blocks, trials or experimental units per session and/or subject, and specify the length of each trial
or block (if trials are blocked) and interval between trials.
Behavioral performance measures State number and/or type of variables recorded (e.g. correct button press, response time) and what statistics were used
to establish that the subjects were performing the task as expected (e.g. mean, range, and/or standard deviation across
subjects).
Acquisition
Imaging type(s) Specify: functional, structural, diffusion, perfusion.
Sequence & imaging parameters Specify the pulse sequence type (gradient echo, spin echo, etc.), imaging type (EPI, spiral, etc.), field of view, matrix size,
slice thickness, orientation and TE/TR/flip angle.
Area of acquisition State whether a whole brain scan was used OR define the area of acquisition, describing how the region was determined.
Preprocessing
Preprocessing software Provide detail on software version and revision number and on specific parameters (model/functions, brain extraction,
segmentation, smoothing kernel size, etc.).
Normalization If data were normalized/standardized, describe the approach(es): specify linear or non-linear and define image types used for
April 2020
transformation OR indicate that data were not normalized and explain rationale for lack of normalization.
Normalization template Describe the template used for normalization/transformation, specifying subject space or group standardized space (e.g.
original Talairach, MNI305, ICBM152) OR indicate that the data were not normalized.
Noise and artifact removal Describe your procedure(s) for artifact and structured noise removal, specifying motion parameters, tissue signals and
physiological signals (heart rate, respiration).
5
Volume censoring Define your software and/or method and criteria for volume censoring, and state the extent of such censoring.
Effect(s) tested Define precise effect in terms of the task or stimulus conditions instead of psychological concepts and indicate whether
ANOVA or factorial designs were used.
Correction Describe the type of correction and how it is obtained for multiple comparisons (e.g. FWE, FDR, permutation or Monte Carlo).
Functional and/or effective connectivity Report the measures of dependence used and the model details (e.g. Pearson correlation, partial correlation,
mutual information).
Graph analysis Report the dependent variable and connectivity measure, specifying weighted graph or binarized graph,
subject- or group-level, and the global and/or node summaries used (e.g. clustering coefficient, efficiency,
etc.).
Multivariate modeling and predictive analysis Specify independent variables, features extraction and dimension reduction, model, training and evaluation
metrics.
April 2020