You are on page 1of 14

Articles

https://doi.org/10.1038/s41477-019-0588-4

The hornwort genome and early land plant


evolution
Jian Zhang1,18, Xin-Xing Fu1,2,18, Rui-Qi Li1,18, Xiang Zhao3,18, Yang Liu4,5,18, Ming-He Li6,18,
Arthur Zwaenepoel7,8,18, Hong Ma   9, Bernard Goffinet   10, Yan-Long Guan11, Jia-Yu Xue12,
Yi-Ying Liao4,13, Qing-Feng Wang   13, Qing-Hua Wang1, Jie-Yu Wang6,14, Guo-Qiang Zhang   6,
Zhi-Wen Wang3, Yu Jia1, Mei-Zhi Wang1, Shan-Shan Dong4, Jian-Fen Yang4, Yuan-Nian Jiao   1,
Ya-Long Guo   1, Hong-Zhi Kong   1, An-Ming Lu1, Huan-Ming Yang5, Shou-Zhou Zhang   4,19*,
Yves Van de Peer   7,8,15,16,19*, Zhong-Jian Liu   6,14,17,19* and Zhi-Duan Chen   1,13,19*

Hornworts, liverworts and mosses are three early diverging clades of land plants, and together comprise the bryophytes. Here,
we report the draft genome sequence of the hornwort Anthoceros angustus. Phylogenomic inferences confirm the monophyly of
bryophytes, with hornworts sister to liverworts and mosses. The simple morphology of hornworts correlates with low genetic
redundancy in plant body plan, while the basic transcriptional regulation toolkit for plant development has already been estab-
lished in this early land plant lineage. Although the Anthoceros genome is small and characterized by minimal redundancy,
expansions are observed in gene families related to RNA editing, UV protection and desiccation tolerance. The genome of
A. angustus bears the signatures of horizontally transferred genes from bacteria and fungi, in particular of genes operating in
stress-response and metabolic pathways. Our study provides insight into the unique features of hornworts and their molecular
adaptations to live on land.

L
and plants (Embryophyta) probably originated in the early CO2-concentrating pyrenoids, which have not been found in any
Palaeozoic1, initiating the colonization of the terrestrial habi- other land plants but are widespread among green algae10. Other
tat. Because bryophytes (hornworts, liverworts and mosses) unusual features of hornworts include the persistent basal meristem
emerged from the early split in the diversification of land plants, they in the sporophyte and mucilage-filled cavities for colonial symbi-
are key to the study of early land plant evolution (Supplementary onts on the gametophyte11. Most hornworts form tight symbiotic
Note 1.1). Unlike other extant land plants, the vegetative body of relationships with cyanobacteria12 and fungal endophytes (espe-
bryophytes is the haploid gametophyte, the sporophyte is always cially Glomeromycota and Mucoromycotina)13.
unbranched and permanently attached to the maternal plant, and Here, we present the draft genome of A. angustus Steph.
both generations lack lignified vascular tissue2. Bryophytes occur in (Anthocerotaceae) (see Methods, Supplementary Figs. 1 and 2, and
nearly all terrestrial habitats on all continents but are absent from Supplementary Note 1.2). Completion of this high-quality horn-
marine environments3. wort genome complements previously sequenced representatives
With only 200–250 species worldwide, the diversity of hornworts of the mosses (Physcomitrella patens14) and liverworts (Marchantia
is much lower than that of the other six extant lineages of embryo- polymorpha15) and provides a unique opportunity to revisit bryo-
phytes (angiosperms, gymnosperms, ferns, lycophytes, mosses and phyte phylogeny, early land plant evolution and the adaptation of
liverworts)4. Long considered sister to all other land plants, or sister plants to live on land.
to all extant vascular plants, hornworts have recently been resolved
as sister to the setaphytes (that is, the mosses and liverworts) within Genome assembly and annotation
monophyletic bryophytes1,5–8. Still, hornworts possess a series of dis- We sequenced the genome of A. angustus (a single individual of
tinct features9. For instance, most hornworts have chloroplasts with unknown sex from the dioecious species) using a combination

1
State Key Laboratory of Systematic and Evolutionary Botany, Institute of Botany, Chinese Academy of Sciences, Beijing, China. 2University of Chinese Academy
of Sciences, Beijing, China. 3PubBio-Tech Services Corporation, Wuhan, China. 4Key Laboratory of Southern Subtropical Plant Diversity, Fairy Lake Botanical
Garden, Shenzhen & Chinese Academy of Science, Shenzhen, China. 5BGI-Shenzhen, Shenzhen, China. 6Key Laboratory of National Forestry and Grassland
Administration for Orchid Conservation and Utilization at College of Landscape Architecture, Fujian Agriculture and Forestry University, Fuzhou, China.
7
Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium. 8VIB Center for Plant Systems Biology, Ghent, Belgium. 9Department
of Biology, Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, PA, USA. 10Department of Ecology and Evolutionary Biology,
University of Connecticut, Storrs, CT, USA. 11Key Laboratory for Plant Diversity and Biogeography of East Asia, Kunming Institute of Botany, Chinese Academy of
Sciences, Kunming, China. 12Center for Plant Diversity and Systematics, Institute of Botany, Jiangsu Province and Chinese Academy of Sciences, Nanjing, China.
13
Sino–Africa Joint Research Center, Chinese Academy of Sciences, Wuhan, China. 14College of Forestry and Landscape Architecture, South China Agricultural
University, Guangzhou, China. 15Center for Microbial Ecology and Genomics, Department of Biochemistry, Genetics and Microbiology, Pretoria, South Africa.
16
College of Horticulture, Nanjing Agricultural University, Nanjing, China. 17Fujian Colleges and Universities Engineering Research Institute of Conservation and
Utilization of Natural Bioresources, College of Forestry, Fujian Agriculture and Forestry University, Fuzhou, China. 18These authors contributed equally: Jian
Zhang, Xin-Xing Fu, Rui-Qi Li, Xiang Zhao, Yang Liu, Ming-He Li, Arthur Zwaenepoel. 19These authors jointly supervised this work: Shou-Zhou Zhang, Yves Van
de Peer, Zhong-Jian Liu, Zhi-Duan Chen. *e-mail: shouzhouz@126.com; yves.vandepeer@psb.vib-ugent.be; zjliu@fafu.edu.cn; zhiduan@ibcas.ac.cn

Nature Plants | VOL 6 | February 2020 | 107–118 | www.nature.com/natureplants 107


Articles NATURe PlAnTs

Comparative genomic analysis


Table 1 | Assembly and annotation statistics of the draft
For sequence similarity-based clustering of homologues, we used the
genome of A. angustus
predicted proteomes of A. angustus and 18 other green plants with
Assembly features fully-sequenced genomes (that is, 11 other land plants, two charo-
 Total length of scaffolds (bp) 119,333,152 phyte green algae and five chlorophyte green algae; Supplementary
Table 13). Genes of A. angustus are distributed among 7,644 gene
 Longest scaffold (bp) 3,809,330
families that are shared with other plants, and 497 gene families
 N50 of scaffold (bp) 1,092,075 that appear to be unique to A. angustus (Fig. 1a and Supplementary
 Total length of contigs (bp) 119,122,644 Table 14). In the shared gene families, most A. angustus genes
 Longest contig (bp) 3,254,985 (that is, 9,680) cluster with land plant genes, and only a very small
number (that is, 107) specifically cluster with green algae genes
 N50 of contig (bp) 796,636
(Supplementary Fig. 7). The gene families unique to A. angustus are
 GC ratio (%) 49.60 enriched in various biosynthetic categories (for example, terpenoid
Genome annotation and zeatin) and various activity categories (for example, nutrient
 Number of protein-coding genes 14,629
reservoir activity and catechol oxidase activity) (Supplementary
Tables 15 and 16).
 Average gene or CDS length (bp) 1,972.11/1,313.24 Phylogenetic inferences from 85 single-copy nuclear genes sam-
 Average exon/intron length (bp) 272.63/172.61 pled for A. angustus and 18 other green plants resolve hornworts
 Average exon per gene 4.81 (A. angustus), mosses (P. patens) and liverworts (M. polymorpha)
as a monophyletic group, with hornworts sister to mosses and
 Average intron per gene 3.81
liverworts, which agrees with inferences from 852 nuclear
 Total size of TEs (bp) 72,224,921 genes sampled from 103 plant species1 (Fig. 1b, Supplementary
 TEs in genome (%) 60.52 Figs. 8 and 9, Supplementary Table 17 and Supplementary Notes 2.1
CDS, coding sequence.
and 2.2). The divergence (Supplementary Figs. 10 and 11,
Supplementary Tables 18–20 and Supplementary Note S2.3) of the
extant crown group of hornworts is estimated at 275.62 million
years ago (Ma) (95% highest posterior density, 179.3–384.6 Ma)
of Illumina and Oxford Nanopore high-throughput sequencing (middle Carboniferous–early Jurassic) (Supplementary Fig. 11 and
systems (see Methods). We generated 126.53 Gb raw reads from Supplementary Table 20), which is comparable to the crown age of
Illumina and 63.61 Gb raw reads from Nanopore sequencing hornworts estimated based on two organellar sequences from 77
platforms, and retained 17.10 Gb and 3.78 Gb, respectively, after hornworts and 11 other land plants10. These estimates are thus older
filtering, error-correction and decontamination (see Methods, than those inferred from the fossil record, considering that the old-
Supplementary Figs. 2–4 and Supplementary Tables 1–3). Finally, est putative hornwort fossil is a spore from the Lower Cretaceous
we obtained an optimized assembly of 119 Mb with a contig N50 Baqueró Formation, Argentina (from 145 to 100 Ma) that resembles
length of 796.64 kb and a scaffold N50 length of 1.09 Mb (Table 1 the spores of extant Anthoceros18.
and Supplementary Table 4). Approximately 97.66% of the vegeta- Comparative genomics shows that the genome of A. angustus
tive gametophyte transcriptome data for A. angustus genome anno- has lost many gene families (that is, 2,145) and comparatively only
tation can be mapped to the assembled genome (Supplementary modest gains (that is, 497) (Fig. 1b). A similar trend characterizes
Table 5). Repeat sequences comprise 64.21% of the assembled the genome of Marchantia and of the ancestor common to all bryo-
genome, with transposable elements (TEs) being the major com- phytes, whereas P. patens has gained more families (that is, 1,334)
ponent (Table 1 and Supplementary Tables 6 and 7). Among than it has lost (that is, 1,248; Fig. 1b). Thus, bryophyte genomes
the TEs, long terminal repeats (LTRs) are the most abundant may not only harbour a number of genes and gene families com-
(Supplementary Table 7). We used a combination of de  novo, parable to those of vascular plants and in particular seed plants
homology-based and RNA sequence-based predictions to obtain (Fig. 1b) but may also be highly dynamic through evolutionary time.
gene models for the A. angustus genome (Supplementary Table 8). Many, if not most, land plants harbour genomic signatures
In total, we predicted 14,629 protein-coding genes with an average of ancient whole-genome duplication (WGD)19. However, like
coding-sequence length of 1.31 kb and an average of 4.81 exons per that of Marchantia15, the genome of Anthoceros lacks evidence of
gene (Table 1, Supplementary Fig. 5 and Supplementary Table 8). having undergone a WGD (Fig. 1c, Supplementary Fig. 13 and
About 85% of these predicted genes have their best hits on plant Supplementary Note 3.1), which confirms the hypothesis drawn
sequences from the National Center for Biotechnology Information previously from the analysis of transcriptomic data20. The chromo-
(NCBI) non-redundant database (Supplementary Fig. 6), somal arrangement of genes is not much conserved among the three
and 78.39% were functionally annotated through Swissprot, bryophyte lineages (Supplementary Fig. 14a,b and Supplementary
TrEMBL, Pfam, gene ontology (GO) and Kyoto Encyclopedia of Note 3.2), which likely reflects the ancient divergence of these dif-
Genes and Genomes (KEGG) (Supplementary Tables 9 and 32). ferent lineages of bryophytes. For example, the longest co-linear
Our annotation captured 89.64% of the 956 genes in the BUSCO block corresponds to a mere five anchor pairs for both A. angustus
plantae dataset16 (85.04% complete gene models plus 4.60% frag- versus P. patens and A. angustus versus M. polymorpha, whereas
mented gene models), compared with 93.51% and 92.15% captured within the A. angustus genome, the largest co-linear segment con-
in P. patens14 and M. polymorpha15, respectively (Supplementary sists of six anchor pairs (Supplementary Fig. 14).
Table 10). In addition to protein-coding genes, we also identified The A. angustus genome contains a much lower percentage of
30 known mature micro RNAs (miRNAs), 180 novel mature miR- multi-copy gene families than that of single-copy gene families,
NAs, 347 transfer RNAs, 94 ribosomal RNAs and 83 small nuclear implying low genetic redundancy (Supplementary Table 17), which
RNAs (snRNAs) in the A. angustus genome (Supplementary is similar to what has been observed for the liverwort Marchantia15.
Table 11). Nine mature miRNA sequences that appear con-
served among land plants (miR156/157, miR159/319, miR160, Transcription factors
miR165/166, miR170/171, miR408, miR477, miR535 and miR536)17 The A. angustus genome comprises 333 putative transcription fac-
were also found in A. angustus (Supplementary Table 12). tor (TF) genes covering 61 families, a number that is highly similar

108 Nature Plants | VOL 6 | February 2020 | 107–118 | www.nature.com/natureplants


NATURe PlAnTs Articles
a Setaphyta c 700
Physcomitrella patens
Tracheophyta Physcomitrella patens anchors
2,037 600
Marchantia polymorpha
Anthoceros angustus
844
4 14,652 500
110 Anthoceros angustus anchors

Duplication events
121 390
0 3
363
527 Anthoceros angustus versus Marchantia polymorpha
1099 94 400
238 Anthoceros angustus versus Physcomitrella patens
1,421 3 2
302
42 Marchantia polymorpha versus Physcomitrella patens
4,263 300
911
1 4
48
497
262 33 200
135
5
28
28
18
102
16 18 551
551 1,542 100
73
Anthoceros angustus 80
0
5,440 Charophyta
0 1 2 3 4 5
KS
Chlorophyta
b Species Gene families Orphans Genes
+1,078/–875
Arabidopsis thaliana 12,301 4,206 27,411
+43/–713
12,098
+258/–2,296
+495/–643 Genlisea aurea 10,060 4,434 17,685
Gene families 12,768
gain/loss
+593/–962
Vitis vinifera 12,399 2,172 25,676

+988/-210
12,916 +1,300/–1,170
Oryza sativa 11,950 4,647 27,912
+130/–421
11,820

Tracheophyta
+1,091/–1,271
+127/–932 Phalaenopsis equestris 11,640 9,156 29,431
+1,923/–409 12,111
12,138
+581/–2,222
Zostera marina 10,470 3,888 20,421
+1,228/–710
10,624 +1,092/–1,430
Amborella trichopoda 11,800 8,326 26,846

+376/–580
10,106 +1,689/–3,492
Picea abies 8,821 5,748 26,437

+1,660/–2,181
Selaginella moellendorffii 9,585 4,384 22,273
+2,017/–296
10,310
+1,334/–1,248
Physcomitrella patens 9,566 9,722 35,796
+138/–447 9,480

Bryophyta
+565/–1,101
+497/–370 +238/–759 Marchantia polymorpha 8,944 6,292 19,287
8,589 9,789
+497/–2,145
Anthoceros angustus 8,141 2,997 14,629
+2,042/–0
8,462
+618/–2,955
Chara braunii 6,252 5,220 22,776
Charophyta

+809/–969
Klebsormidium nitens 8,302 3,807 16,044

+391/–780
6,420 Volvox carteri 7,934 3,978 14,434
+2,045/–524
8,323
+639/–293
+179/–845 Chlamydomonas reinhardtii 8,669 5,991 17,741
6,802
Chlorophyta

+467/–1,906
Ulva mutabilis 5,363 5,289 12,924
+1,048/–0
7,468
+251/–1,156
Coccomyxa subellipsoidea 5,850 2,614 9,629
+175/–888 6,755
+245/–1,143
Chlorella variabilis 5,857 2,629 9,780

Fig. 1 | Comparative genomic analysis of A. angustus and 18 other plant species. a, Comparison of the number of gene families identified by OrthoMCL.
The Venn diagram shows the shared and unique gene families in A. angustus, Setaphyta, Tracheophyta, Charophyta and Chlorophyta. The gene-family number
is listed in each of the components. b, Gene-family gain (+)/loss (−) among 19 green plants. The numbers of gained (blue) and lost (red) gene families are
shown above the branches. The boxed number indicates the gene-family size at each node. The number of gene families, orphans (single-copy gene families)
and number of predicted genes is indicated next to each species. c, Comparison of whole paranome, anchor pair and one-to-one orthologue distribution of the
number of synonymous substitutions per synonymous site (KS) across the three bryophyte species (P. patens, M. polymorpha and A. angustus).

to that of the other two bryophyte genomes (Supplementary Fig. 15, of streptophytes and the other with the transition to land15,21. In
Supplementary Table 21 and Supplementary Note 4.1). The diver- plants, genes encoding TFs are among the most highly retained
sity of TF genes in extant plants is rather stable (Supplementary following polyploidy22, a pattern reflected in the comparison of
Fig. 15) and resulted from two ancient bursts of TF families during the three bryophyte genomes14,15. A. angustus and M. polymorpha,
the diversification of green plants: one concomitant with the origin whose genome did not undergo WGDs hold a small number of

Nature Plants | VOL 6 | February 2020 | 107–118 | www.nature.com/natureplants 109


Articles NATURe PlAnTs

TF compared to P. patens, which experienced at least one WGD in in genes composing the network underlying the development of its
its ancestry, resulting in a substantially larger number of TF genes body plan, the TF gene families linked to responses to terrestrial
(Supplementary Fig. 15). It supports the hypothesis that the WGD is environmental stimuli exhibit lineage-specific gene expansions in
an important mechanism for expansion of TF families23. A. angustus, namely, the LISCL genes for mycorrhizal signalling in
Phylogenetic analyses of 24 gene families contributing to the the GRAS gene family39 (Supplementary Fig. 53) and the clade SIP1
development of plant body plans or adaptation to the terrestrial envi- for ABA signalling under water stress in the Trihelix gene family40
ronment, including 16 TF gene families24,25 (Fig. 2a, Supplementary (Supplementary Fig. 54).
Figs. 16–54, Supplementary Table 22 and Supplementary Note 4.2),
confirm that a considerable number of genes, such as genes involved Gene-family expansion
in gametophyte or sporophyte development, haploid–diploid tran- Besides two TF gene families, the A. angustus genome harbours a
sition, meristem development, filamentous growth, photomorpho- variety of other uniquely expanded gene families (Supplementary
genesis and auxin signalling (Fig. 2), composed the genetic toolkit Fig. 55). The genome comprises an very large number of pen-
of plants before the conquest of land26. In particular, the TF genes for tatricopeptide repeat (PPR) genes for plant organellar RNA pro-
filamentous growth and auxin signalling arose in charophyte green cessing41, accounting for approximately 7.90% of the predicted
algae27,28 (Fig. 2b), which are thought to be the closest living rela- protein-coding genes. The expanded PPR genes are PLS-class PPR
tives to extant land plants, implying the preliminary establishment genes (Supplementary Fig. 55, Supplementary Tables 23 and 24
of relatively more complex body plan in these basal streptophytes and Supplementary Note 5.1). Most of the PLS-class PPR proteins
for plant terrestrial adaptation29. Furthermore, a set of genes under- in A. angustus were predicted to be localized in the mitochon-
lying key morphological innovations for terrestrial adaptation prob- drion or chloroplast (Supplementary Table 24). The expansion of
ably evolved along with the colonization of land30,31 (Fig. 2b), such the PLS-class PPR genes correlates with the large number of RNA
as SMF and ICE for stomatal development (Supplementary Figs. 29 editing sites estimated in the organellar genomes of A. angustus
and 30), APB, CLE and CLV1 for 3D growth (Supplementary (Supplementary Table 23). Our findings add further support to
Figs. 36 and 50–52), and VNS for water-conducting-cell develop- the hypothesis that an increase in the number of both RNA editing
ment (Supplementary Fig. 38). The sporophyte morphology of bryo- sites and PPR genes (especially the PLS-class PPR) occurred after
phytes is relatively simple, and many of the genes involved in the the separation of land plants from green algae41,42 (Supplementary
elaborate regulation of embryogenesis32, such as FUS3, LEC1, LEC2, Table 23). The reduced number of PPR genes and absence of RNA
NF-YA1/9 and NF-YA3/5/6/8 are absent in A. angustus, Marchantia editing in marchantiid liverworts are most probably secondary
and Physcomitrella (Fig. 2a and Supplementary Figs. 39–41). The losses (Supplementary Table 23), as the organellar RNA editing and
ABI3 genes that mainly function in embryo maturation and seed plant-specific extensions of PPR genes were also found in junger-
desiccation tolerance in flowering plants are present in bryophytes, manniid liverworts43. Through RNA editing, the PPR proteins could
and have roles in desiccation tolerance in their vegetative tissues33. act as ‘repair’ factors that alleviate DNA damage caused by increased
In A. angustus, most genes involved in the development of plant UV exposure in terrestrial environments41. Other stress-response
body plans have a single copy, and a few A. angustus TF gene fami- gene families have also expanded in A. angustus, such as cupin and
lies even lost a subset of duplicates (Fig. 2a and Supplementary cytochrome P450 (CYP) (Supplementary Fig. 55). Two groups of
Figs. 16–52). For example, in the bHLH family, the class I RSL gene cupin (PF00190) proteins—that is, monocupins and bicupins—
that controls the development of rhizoids and root hairs, thought can be recognized on the basis of the number of cupin domains44.
to have been important for the colonization of land34, is present in In A. angustus, the cupin gene family has undergone a signifi-
the A. angustus genome, whereas the class II RSL genes respon- cant expansion (Supplementary Table 25) such that it comprises
sible for regulating protonema differentiation in P. patens or root more bicupin genes than any other plant (Fig. 3a, Supplementary
hair elongation in A. thaliana by auxin35 are absent (Supplementary Figs. 56 and 57, Supplementary Table 25 and Supplementary
Fig. 27 and Supplementary Note 4.2). The lack of class II RSL genes Note 5.2). Expansion of the cupin gene family in A. angustus resulted
in A. angustus might be related to the morphological simplifica- mainly from tandem gene duplications (Fig. 3b,c and Supplementary
tion of this species with respect to tip-growing filamentous struc- Note 5.2). Since bicupins (that is, 11S and 7S seed storage proteins)
tures2. For the KNOX genes from the homeobox gene family, the are desiccation-tolerant proteins in higher land plants44, the large
A. angustus genome retains one class II KNOX gene for haploid- number of bicupin genes in A. angustus could indicate adaptation for
to-diploid morphological transition36, but lacks class I KNOX genes coping with drought stress in the terrestrial environment. The large
(Supplementary Fig. 23), whose activity is necessary for seta exten- number of A. angustus-specific monocupin genes are homologous
sion in the sporophytes in P. patens37. The absence of this gene might to the P. patens PpGLP6 gene (XP_001782709.1) (Supplementary
be linked to the absence of setae in hornworts2. The genome of Fig. 57 and Supplementary Note 5.2), which encodes a protein
A. angustus also holds few type II MIKCC MADS-box, class B ARF, with manganese-containing extracellular superoxide dismutase
NCARF and short PIN genes, as a result of gene losses suggested (SOD) activity to respond to oxidative stress in terrestrial environ-
by our phylogenetic analysis (Supplementary Figs. 17, 42, 45 and ments45. The CYP genes for primary and secondary metabolism
Supplementary Note 4.2). The class II RSL, class B ARF, NCARF and have also expanded in A. angustus (Supplementary Fig. 55 and
short PIN genes all have auxin-related functions (Supplementary Supplementary Note 5.3). For instance, genes belonging to the sub-
Note 4.2). Since these auxin-related genes were consistently lost families CYP71 and CYP85 contain 56 and 46 genes, respectively
in A. angustus, this hornwort species possesses the simplest auxin (Supplementary Figs. 58–61 and Supplementary Tables 26 and 27).
molecular toolkit among all investigated land plants so far38. Thus, The A. angustus CYP genes were assigned to 28 KEGG pathways,
like the liverwort M. polymorpha15, A. angustus exhibits low redun- of which ‘flavonoid 3′-monooxygenase/flavonoid 3′,5′-hydroxy-
dancy for genes shaping the plant body plan (Fig. 2b). Such a lim- lase’ and ‘abscisic acid 8′-hydroxylase’ were the most representative
ited toolkit may be characteristic of the ancestor to bryophytes and (Supplementary Table 28). Within the CYP71 gene subfamily, genes
hence, perhaps, of the earliest land plants with a dominant thalloid homologous to flavonoid 3′-hydroxylase (monooxygenase) (F3'H)
gametophyte, and provide the foundation to explaining the architec- or flavonoid 3′,5′-hydroxylase (F3′5′H) genes that are involved
tural simplicity of these plants. By contrast, the genome of P. patens, in flavonoid biosynthesis46 are highly expanded in A. angustus
which develops a leafy stem, has the most TF genes involved in (Supplementary Fig. 59 and Supplementary Note 5.3). Because fla-
the development of plant body plans among the compared bryo- vonoids have an important role in UV-B protection46, the expan-
phytes (Fig. 2b). Although the genome of A. angustus seems poor sion of flavonoid biosynthesis related genes in A. angustus might

110 Nature Plants | VOL 6 | February 2020 | 107–118 | www.nature.com/natureplants


NATURe PlAnTs Articles
a

Vascular plants

Chlorophytes
Charophytes
Liverworts

Hornworts
Mosses
0 1 2 3 ≥4

S. moellendorffii

M. polymorpha
A. trichopoda

C. reinhardtii
A. angustus
A. thaliana

P. patens

C. braunii

V. carteri
K. nitens
Family Clade Function

MIKCc 39 20 3 6 1 0
3 1 0 1
MADS-box MIKC* 7 2 3 11 1 1
M 63 12 13 7 0 11 0 0 1 1 Gametophyte/sporophyte
TCP-I 13 6 5 4 1 1 1 0 0 0 development
TCP
TCP-II 11 9 5 2 1 1 1 0 0 0
LFY 1 1 1 2 1 1 1 1 0 0
RWP-RK RKD 5 3 2 1 1 1 2 5 5 3
BELL 13 6 2 4 1 1 1 1 1 1 Haploid–diploid transition
KNOX-II 4 3 2 2 1 1 0 1
1 1
Homeobox KNOX-I 4 4 3 3 1 0 1 1
WOX 16 9 9 3 1 6 0 1 0 0 Meristem development
HD-Zip-III 5 3 3 5 1 1 1 1 0 0
RSL-I 2 1 3 2 1 1 0 0 0 0
RSL-II 4 1 2 5 1 0 0 0 0 0 Filamentous growth
bHLH
LRL 5 3 3 2 1 2 1 3 0 0
PIF 15 5 4 4 1 1 1 1 0 0
bZIP HY5 2 2 2 2 1 1 1 1 1 2 Photomorphogenesis
NF-YC NF-YC2/3/9 3 2 3 6 2 2 1 1 1 0
ARF-A 5 5 3 8 1 1 0 0 0 0
B3-ARF ARF-B 15 4 2 4 1 0 0 0 0 0 Auxin signalling
ARF-C 3 4 2 3 1 3 1 0 0 0
SMF 3 3 3 2 1 1 0 0 0 0
bHLH Stomatal development
ICE 2 2 3 3 2 1 0 0 0 0
AP2 euANT/APB 8 5 2 4 1 1 0 0 0 0 3D growth
NAC VNS 13 5 4 8 1 1 0 0 0 0 Water-conducting cell development
ABI3 1 1 5 8 2 1 0 0 0 0
B3-LAV FUS3 1 0 0 0 0 0 0 0 0 0
LEC2 1 1 0 0 0 0 0 0 0 0
Embryogenesis
NF-YB LEC1 2 1 1 0 0 0 0 0 0 0
NF-YA1/9 2 1 0 0 0 0 0 0 0 0
NF-YA
NF-YA3/5/6/8 4 2 0 0 0 0 0 0 0 0

b Charophyte green algae

Single-copy TFs
Green algal Hornworts Gene loss
ancestor (A. angustus) Simple gametophyte
Bryophytes

Liverworts Single-copy TFs


(M. polymorpha) Gene loss
Diploid development Simple gametophyte
Photomorphogenesis
Multicellularity (gametophyte) Dominant
Filamentous growth gametophyte Mosses Multiple-copy TFs
Auxin signalling (P. patens) Complex gametophyte

Multicellular sporophyte
Embryogenesis
3D patterning
Stomata
Move to land Vascular plants

Vasculature
Dominant sporophyte

Fig. 2 | Major TFs for plant body plan and evolutionary innovations within plants. a, Overview for the number of major TFs for plant body plan in ten
green plants. Colour key on the upper left of the heatmap denotes the TF numbers. b, Major innovations in plants and evolutionary features of three
bryophyte lineages.

Nature Plants | VOL 6 | February 2020 | 107–118 | www.nature.com/natureplants 111


Articles NATURe PlAnTs

a
Species No. of bicupins No. of monocupins Total

Klebsormidium nitens 9 15 24

Anthoceros angustus 31 48 79

Marchantia polymorpha 1 104 105

Physcomitrella patens 0 20 20

Selaginella moellendorffii 13 50 63

Picea abies 15 31 46

Amborella trichopoda 8 36 44

Arabidopsis thaliana 10 33 43

Oryza sativa 21 43 64

b Scaffold39 c Scaffold44 Scaffold108 Scaffold34 Scaffold13 Scaffold43

AANG004394
AANG004395

AANG009253
AANG009255
AANG009258
AANG009264
AANG009265
AANG013591
AANG009342
AANG009343

AANG008040
AANG013594

AANG008042
AANG008775
AANG008776
AANG008777

950 990 kb 280 300 kb 20 60 kb 740 760 kb 1,140 1,160 kb 550 630 kb

Scaffold64
160 Scaffold24 Scaffold22
AANG011559 910

AANG006112
AANG006111
AANG011560 AANG006497
AANG011562 AANG006501
AANG011563 950 kb
AANG011564 Scaffold6
240 kb
AANG001908 100
880 900 kb
AANG001910
Scaffold15 AANG001913
AANG005009 660 AANG001914
AANG005010 AANG001916
0.6 AANG001917
AANG005011 180 kb
700 kb

Scaffold93 Scaffold95 0.5


AANG013184 80 AANG013297 240
AANG013185 AANG013300
120 kb AANG013304
AANG013306
320 kb

Fig. 3 | Expansion of cupin gene family in A. angustus. a, A summary of the number of cupin genes from nine species based on a Pfam search of cupin_1
domain (PF00190). b,c, Phylogenetic trees show cupin genes in nine plant genomes: bicupins (b) and monocupins (c). The colour of each branch
corresponds to the background colour for each species in a. The tandem duplicated gene clusters are ordered and shown on scaffolds of the A. angustus
genome. The scale bars in the trees show the number of amino acid substitutions per site.

again represent a molecular adaptation to life in the terrestrial other plant genomes or transcriptomes with reference to the CCM
environment. Among the CYP85 genes, the genes homologous to genes from chlorophyte green algae Chlamydomonas reinhardtii49,50
abscisic acid 8′-hydroxylase genes involved in abscisic acid catabo- (Supplementary Figs. 62–71 and Supplementary Note 6.2).
lism during drought stress response47 are also uniquely abundant in A. angustus and all other green plants harbour orthologues of
A. angustus (Supplementary Fig. 60 and Supplementary Note 5.3), CAH1/2 whose expression is modulated by external inorganic car-
and may account for the high desiccation tolerance of A. angustus. bon concentration; of CemA, which maintains stromal pH balance;
Like the cupin gene family, many of the above expanded gene fami- of LCI11, which mediates the entry of HCO3− in the thylakoid lumen;
lies occur in tandem arrays (Supplementary Table 29). At least 9.82% and of RCA1 and RBCS1/2, which regulate CO2 fixation by Rubisco
of protein-coding genes in A. angustus form ‘tandem’ clusters in the (Supplementary Figs. 62, 65 and 69–72). By contrast, orthologues
genome (Supplementary Table 30 and Supplementary Note 5.4), of CCP1/2, which mediate the entry of HCO3- into the chloroplast
compared with only 1% in P. patens14 and 5.9% in M. polymorpha15. stroma and of EPYC1, which regulate CO2 fixation by Rubisco were
only present in chlorophyte green algae (Supplementary Figs. 67
CO2-concentrating mechanism and 72 and Supplementary Note 6.2). The three inorganic car-
Hornworts are the only extant land plant lineage harbouring a bon transporters (HLA3, LCI1 and LCIA-like genes) only occur
pyrenoid-based CO2-concentrating mechanism (CCM) similar in bryophytes and green algae, whereas the A. angustus genome
to that of green algae9,48 (Supplementary Note 6.1), for which the lacks the related orthologues (Supplementary Figs. 63, 66 and 72
key components have been identified49. To clarify whether the and Supplementary Note 6.2). Unexpectedly, the three kinds of car-
CCM components of green algae have orthologues in hornworts bonic anhydrases (CAH3, CAH9 and LCIB/C), which are essen-
and other land plants, we searched the A. angustus genome and tial components of CCM, are conserved in non-angiosperm land

112 Nature Plants | VOL 6 | February 2020 | 107–118 | www.nature.com/natureplants


NATURe PlAnTs Articles
a b 100 Gloeophyllum trabeum ATCC 11539 XP_007862832.1
Oscillatoria nigro-viridis WP_015174426.1 100 Neolentinus lepideus HHB14362 ss-1 KZT26254.1
72 Phormidium tenue WP_073610939.1 Cyanobacteria 80
Cylindrobasidium torrendii FP15055 ss-10 KIY69979.1 Basidiomycota
54 Nostoc calcicola WP_073644548.1 85
Kwoniella heveanensis BCC8398 OCF36487.1
Variovorax sp. CF079 SDD30019.1
Betaproteobacteria Microbotryum lychnidis-dioicae p1A1 Lamole KDE06708.1
Bordetella bronchiseptica WP_033454394.1
100 Taphrina deformans PYCC 5710 CCG84592.1
77 78 Paenibacillus sp. 453mf SFS53973.1
Firmicutes Protomyces lactucaedebilis ORY86103.1 Ascomycota
Lysinibacillus macroides WP_053996692.1
Stenotrophomonas daejeonensis WP_057642067.1 100 Saitoella complicata NRRL Y-17804 ODQ54499.1
98
Gammaproteobacteria
Pseudoxanthomonas dokdonensis WP_057657067.1 100 Anthocero sangustus AANG003233
59 100 Ensifer sp. WSM1721 WP_026621625.1 97 Anthoceros angustus AANG012156 Hornworts
Sinorhizobium meliloti WP_010969737.1 65 100 Anthoceros angustus AANG003219
Phenylobacterium sp. Root700 WP_056732391.1 Alphaproteobacteria Spizellomyces punctatus DAOM BR117 XP_016611677.1
60 Chytridiomycota
77 Rhizobium oryzae WP_075626925.1 Gaertneriomyces semiglobifer 574073
74 88
100 Methylobacterium sp. 10 WP_027174343.1 Mizuhopecten yessoensis XP_021361660.1
93
Anthoceros angustus AANG004679 Hornworts Metazoa
Amphimedon queenslandica XP_003384342.1
100
79 Hoyosella subflava WP_013806789.1 Phytophthora sojae XP_009525471.1 Stramenopiles
81 Williamsia muralis WP_062796745.1
Actinobacteria 100 Coccomyxa subellipsoidea C-169 XP_005645850.1 Chlorophyta
74 Mycobacterium chelonae group WP_057967264.1
Klebsormidium nitens GAQ80065.1 Charophyta
Rhodococcus WP_015889204.1
99 Actinoplanes sp. N902-109 WP_041832545.1
52 Madurella mycetomatis KXX75071.1
100 Micromonospora lupini WP_007459710.1
Fusarium oxysporum Fo47 EWZ28263.1
Mycobacterium sp. IS-1742WP_059094183.1
Verticillium longisporum CRK16593.1 Fungi 100 Bacteria
100 Trichoderma harzianum KKP05707.1 86 Bacillus bataviensis WP_007087256.1
100 Paenibacillus sp. LC231 WP_071219991.1
Purpureocillium lilacinum XP_018184305.1
Brevibacillus formosus WP_047071110.1
0.3 0.4

c 97 Physcomitrella patens XP_001776763.1


d
100 Umezawaea tangerina WP_106185535.1
Physcomitrella patens XP_001775466.1
100 Physcomitrella patens XP_001763702.1 Mosses Pseudonocardia acaciae WP_028921217.1
100 Physcomitrella patens XP_001756707.1 Thermobifida cellulosilytica WP_068753481.1 Actinobacteria
99
Physcomitrella patens XP_001760573.1 98
100 Anthoceros angustus AANG011893 Hornworts Actinoalloteichu scyanogriseus WP_030105734.1
94 Marchantia polymorpha Mapoly0049s0102.1 Liverworts Amycolatopsis taiwanensis WP_027946725.1
81 Pedobacter sp. Hv1 WP_055132054.1 Liverworts
70 60 Marchantia polymorpha Mapoly0076s0083.1
Filimonas lacunae WP_076381825.1
99 100 Physcomitrella patens XP_001778965.1 Mosses
Arcticibacter eurypsychrophilus WP_069661102.1 Bacteroidetes
88 Chitinophaga sp. CF118 WP_090103349.1 Anthoceros angustus AANG008038 Hornworts
Adhaeribacter aquaticus WP_034256883.1
85 Sphingopyxis sp. H115 WP_058804644.1
99 Domibacillus enclensis WP_045851243.1
54 Bacillus sp. SA1-12 WP_046515157.1 99 Methylopila sp. M107 WP_020178725.1 Alphaproteobacteria
100 Firmicutes
Paenibacillus fonticola WP_019640511.1 Sphingomonas sp. JS21-1 WP_093004013.1
95 Clostridiales bacterium SK-Y3 WP_094551178.1
88 Opitutus terrae WP_012377071.1 100 Vibrio atlanticus WP_065679365.1
100 Lacunisphaera sp. TWA-58 WP_129046218.1 Vibrionales bacterium SWAT-3 WP_008216774.1
Verrucomicrobia 57
72
Verrucomicrobia bacterium IMCC26134 WP_082083254.1
100 64 Photobacterium lipolyticum WP_107281819.1 Gammaproteobacteria
Opitutaceae bacterium EW11 WP_107743163.1 96 99
89 Chara braunii GBG82933.1 Oceanimonas baumannii WP_094277847.1
Charophyta
100 53 Chara braunii GBG84520.1 Nitrincola nitratireducens WP_036506495.1
98 Gallaecimonas pentaromativorans WP_050660479.1
97 100 Desulfotomaculum guttoideum WP_092243774.1
Lacimicrobium alkaliphilum WP_062483900.1 Gammaproteobacteria
Marinimicrobium agarilyticum WP_036188610.1 56 Clostridium sp. ASBs410 WP_025234707.1 Firmicutes
97 Lechevalieria aerocolonigenes WP_045318157.1 Lactobacillus acidophilus WP_003548951.1
55 Actinobacteria bacterium 13_2_20CM_2_71_6 OLB77073.1
100 Actinobacteria 96 Methanococcoides methylutens WP_048193719.1
Kutzneria albida WP_025357233.1 Archaea
Streptomyces sviceus WP_037901984.1 Methanosarcina acetivorans WP_011024203.1
93 Haloferax sp. ATB1 WP_042662379.1
100 0.4
Haloarcula japonica WP_004594381.1 Archaea
Halostagnicola larsenii WP_049954255.1
0.4

Fig. 4 | Phylogenetic affinities of genes horizontally transferred to A. angustus. a, Phylogenetic tree of glyoxalase (PF13468). b, Phylogenetic tree of
NAD-binding dehydrogenase (PF08635). c, Phylogenetic tree of glucuronyl hydrolase (PF07470). d, Phylogenetic tree of DNA methyltransferase
(PF02870 and PF01035). The stars indicate that the Anthoceros sequence or bryophyte sequences formed a monophyletic clade with homologues of
putative HGT donor, reflecting Anthoceros-specific or bryophyte-specific HGT events. Maximum-likelihood bootstrap support values ≥50% are shown
above the branches. Red, hornworts and other bryophytes; cyan, green algae; grey, metazoan; orange, stramenopiles; blue, bacteria; yellow, fungi; purple,
archaea. The homologues from the kingdom other than the one that HGT donors are involved in are used as the outgroup. The scale bars in the trees show
the number of amino acid substitutions per site.

plants and green algae (Supplementary Figs. 62, 64, 68 and 72). The Horizontal gene transfer
A. angustus genome retains the orthologues of both LCIB/C and Horizontal gene transfer (HGT) from bacteria or fungi has been
CAH3 genes, but has no copy of CAH9 (Supplementary Fig. 72). reported for both the moss P. patens51 and the liverwort M. polymor-
Besides green algae, the essential CCM components occur in both pha15. Consistent with those observations, the taxonomic distribu-
hornworts and other non-angiosperm land plants that lack pyre- tion of BLASTP hits following careful phylogenetic analysis and
noids (Supplementary Fig. 72). It implies that the CCM could be an manual inspection suggested that 19 genes from 14 families origi-
ancestral mechanism of CO2 fixation by plants, and pyrenoids for nated from HGTs from either bacteria or fungi (Supplementary
CCM are homologous between hornworts and green algae, whereas Fig. 6 and Supplementary Note 7.1). Bacterial donors are distrib-
both CCM components and pyrenoids have undergone multiple uted among nine families: Actinobacteria (three gene families),
losses in land plants in response to atmospheric changes in terres- Alphaproteobacteria (two gene families), Bacteroidetes (two gene
trial environments10,48. families), Firmicutes (one gene family) and Verrucomicrobia

Nature Plants | VOL 6 | February 2020 | 107–118 | www.nature.com/natureplants 113


Articles NATURe PlAnTs

(one gene family). Five families were acquired from fungi, The sporangium was opened and the spores were homogenized and spread onto
belonging to Ascomycota, Basidiomycota, hornwort-symbiotic the 1/2 KnopII agar medium57 in Petri dishes (Supplementary Fig. 1b). The culture
temperature was between 21 °C and 25 °C. Spores germinated within a couple of
Chytridiomycota or Mucoromycota13 (Fig. 4a,b, Supplementary days, and then the sporelings started to grow. After approximately three to four
Figs. 73–84 and 86 and Supplementary Table 31). The detection weeks, the gametophyte started to grow (Supplementary Fig. 1c,d). Since spores
of specific HGT in all three fully sequenced bryophytes is remark- are aposymbiotic, we did not find the phenomenon of mucilage-filled cavities
able, and is probably related to the fact that these organisms form colonization by cyanobacteria on the A. angustus gametophyte during the sterile
culture. A gametophyte from a single spore was selected and cultured by asexual
symbioses with diverse bacteria and fungi, which, together with
propagation. The tissue yielded from subculture was used for genome and RNA
the weakly protected tissues in the early developmental stages in sequencing. We tried to induce sexual reproduction by dropping the growth
the life cycle of these plants, provide the possibility for HGT51. temperature of gametophyte cultures to 10 °C and 16 °C, respectively; however
In addition, we found that two families originating from HGT until now they have not yet produced reproductive organs. Therefore, the
from bacteria are shared by the three bryophyte lineages, and sequenced A. angustus is indeed a single-sex individual, which is sequenced at the
gametophyte phase of its life cycle.
one originating from a HGT from fungi is shared between horn- Genomic DNA was isolated using the Plant DNAzol reagent for genomic
worts and liverworts only (Fig. 4c,d, Supplementary Figs. 85 and DNA extraction (Life Technologies) according to the manufacturer’s protocols.
86, Supplementary Table 31 and Supplementary Note 7.2). The For whole-genome shotgun sequencing, ten sequencing libraries with insert sizes
HGT genes mentioned above (SCUO value 0.2127) exhibit a sig- ranging from 170 bp to 40 kb were generated (Supplementary Table 1). Sequencing
nificantly more biased codon-usage pattern than non-HGT genes libraries were constructed using a library construction kit (Illumina). All libraries
were sequenced on the Illumina HiSeq 2000 platform. Raw sequencing reads were
(SCUO value 0.1595) (Supplementary Fig. 87a), which may be trimmed with Trimmomatic (v.0.33)58. Only high-quality reads with a total length
linked to their higher GC content (57.58%) than non-HGT genes of 126,532,381,412 bp were used for further analysis (Supplementary Table 1). For
(53.26%) (Supplementary Fig. 87b). Oxford Nanopore sequencing, we constructed a genomic DNA library using the
The HGT-derived genes in A. angustus mainly contribute to ONT 1D ligation sequencing kit (SQK-LSK108) according to the manufacturer’s
metabolic processes, oxidation–reduction and stress response instructions. The sequencing used a single 1D flow cell on a PromethION
sequencer (Oxford Nanopore Technologies). A total of 63,614,292,295 bp raw reads
(Supplementary Table 31). Some transferred genes related to were generated, of which 36,070,452,175 bp were retained for further analysis after
carbohydrate metabolism are predicted to encode glucuronyl filtering and trimming (Supplementary Table 3).
(AANG011893) and glycosyl hydrolases (AANG004297) (Fig. 4c, Total RNA was extracted using the PureLink Plant RNA reagent (Life
Supplementary Fig. 79 and Supplementary Table 31), which func- Technologies) and further purified using TRIzol reagent (Invitrogen). For
tion in cell wall synthesis and modification and might extend the transcriptome sequencing (RNA sequencing), libraries with insert sizes ranging
from 200 bp to 500 bp were constructed using the mRNA-Seq Prep Kit (Illumina)
metabolic flexibility of A. angustus in changing environments52. and then sequenced using the Illumina HiSeq 2000 platform. For small-RNA
The Alphaproteobacteria-derived gene AANG004679 encodes sequencing, the library was generated from RNA sample using the Truseq
glyoxalase, which is related to drought stress tolerance53 (Fig. 4a). Small RNA Preparation kit (Illumina) and sequenced on the Illumina
The Actinobacteria-derived DNA methyltransferase genes that are HiSeq 2500 platform.
present only in the three groups of bryophytes are related to DNA
repair54 (Fig. 4d). The hornworts and liverworts share the fungi- Decontamination. The GC content versus k-mer frequency distribution pattern
derived terpene synthase-like (MTPSL) genes (Supplementary of the Illumina raw reads (Supplementary Table 1) after trimming presented two
Fig. 85). Terpene synthases are pivotal enzymes for the biosynthesis large groups: one group with a low k-mer frequency (<50) and a wide GC content
distribution range (median number at 0.7), and the other group with a high k-mer
of terpenoids, which serve as chemical defences against herbivores frequency (60–165) and a concentrated GC content distribution range (median
and pathogens55. Some horizontally transferred genes in A. angustus, number at 0.5) (Supplementary Fig. 2a). The BLASTN results against the NCBI
such as NAD-binding dehydrogenase (Fig. 4b) and MTPSL genes nucleotide database revealed that the former sequences were mainly from a variety
(Supplementary Fig. 85), underwent subsequent gene duplications. of bacteria and the latter were the real genome sequences of A. angustus. We also
investigated the k-mer distributions of the raw reads from the other two published
The results suggest that the acquisition of foreign genes might have
hornwort genomic sequences, A. agrestis (accession: ERX714368)59 and
provided additional means for environmental adaptation during Anthoceros punctatus (accession: SRX538621)60, and found a similar distribution
evolution of the hornwort lineage. pattern as that of A. angustus, containing two groups, one for the contaminant
sequences and the other for sequences of the plant itself (Supplementary
Conclusions Fig. 2c,d). Because external bacterial contaminations from the laboratory cause
A. angustus to turn yellow and die during culturing, and all three Anthoceros species
As land pioneers, the three bryophyte groups form a well-sup- through axenic cultures still have the same bacterial contamination problems
ported monophyletic lineage, with hornworts sister to liverworts (Supplementary Fig. 2a,c,d), we infer that these bacterial contaminations are
and mosses. The genome of hornwort A. angustus shows no evi- from symbiotic bacteria of Anthoceros that might accompany spores hiding in the
dence of WGDs and low genetic redundancy for networks under- sterilized sporangium. Furthermore, we performed the DAPI staining analysis61 to
lying plant body plan, which may be congruent with an overall investigate the distribution of symbiotic bacteria in A. angustus. The gametophytes
were stained by 0.2 mg l−1 DAPI (4′,6-diamidino-2-phenylindole dihydrochloride;
simple body plan. Hornworts have retained the essential compo- Sigma, cat. no. D9564) for five minutes. The stained gametophytes were washed
nents of CCM found in green algae in response to the atmospheric three times, and then observed using confocal microscopy. The bacterial micro-
changes in terrestrial environments. Meanwhile, the gene inven- colonies were observed on the outer surface, as well as in the intercellular space of
tory in A. angustus expanded mainly through tandem duplication the gametophytes of A. angustus (Supplementary Fig. 3). Based on the GC content
and HGT. In particular, the expansion of specific gene families versus k-mer frequency distribution pattern of the Illumina raw reads and the result
of the DAPI staining, we could imagine that there is a certain amount of bacterial
and the acquisition of foreign genes have provided additional met- sequences remaining in the genome sequencing data of A. angustus. In order to
abolic abilities in hornworts that probably facilitated their survival isolate them, we performed a series of decontamination steps. After generating the
in a terrestrial environment. Together, our results indicate how the k-mer frequency, we chose the high-abundance k-mer depth (60–165) and retained
draft genome of A. angustus provides a useful model for studying the corresponding reads for further analysis. This treatment yielded filtered reads
early land plant evolution and the mechanism of plant terrestrial with a total length of 17,099,027,576 bp (Supplementary Table 2). The distribution
pattern of GC content versus k-mer frequency of the A. angustus filtered reads is
adaptation. depicted in Supplementary Fig. 2b, which shows an entire group with a sequencing
depth of approximately 150×. Furthermore, we performed error correction for
Methods filtered Nanopore reads using decontaminated Illumina reads by Nextdenovo
Sample preparation and sequencing. The natural populations of A. angustus (v.2.0)62, resulting in 9,247,957,448 bp corrected reads (Supplementary Table 3).
Steph. were collected from Jinping County, Yunnan Province, China. The voucher Through MEGABLAST against the NCBI nucleotide database,
specimen has been deposited at the herbarium, Institute of Botany, Chinese we further removed 5,463,972,682 bp prokaryotic sequences or organellar
Academy of Sciences, Beijing, China with collection number W1879-2010-01-18. sequences, and finally got 3,783,984,766 clean reads with a sequencing depth of
The sporophytes of A. angustus were detached from the gametophytes, sterilized in approximately 35× (Supplementary Table 3). A total of approximately
10% sodium hypochlorite and subsequently rinsed with distilled water56. 185× coverage was obtained finally.

114 Nature Plants | VOL 6 | February 2020 | 107–118 | www.nature.com/natureplants


NATURe PlAnTs Articles
Genome size estimation. To estimate the genome size of A. angustus, we used software. The tRNA genes were searched by tRNAscan-SE (v.1.3.1)85. The rRNA
clean Illumina reads to calculate the k-mer distribution. According to the genes were predicted by aligning plant rRNA sequences from NCBI (A. thaliana
Lander–Waterman theory63, the genome size can be determined by dividing the and Anthoceros agrestis) to the A. angustus genome by BLASTN. The snRNA genes
total number of k-mers by the peak value of the k-mer distribution. Because we were predicted using INFERNAL (v.1.1)86 to search from the Rfam database.
sequenced the haploid gametophyte of A. angustus, only one peak was found in
the k-mer distribution. The total number of k-mers was 14,092,039,150, and the Gene-family identification. To construct the dataset for gene-family clustering,
position of the peak was at 132 (Supplementary Fig. 4). The peak was used as the protein-coding genes from the genomes of A. angustus and 18 other
the expected k-mer depth and substituted into the formula genome size = total green plants were used, including those of seven angiosperms (A. thaliana,
k-mer/expected k-mer depth, and the haploid genome size was estimated to be Genlisea aurea, Vitis vinifera, O. sativa, Phalaenopsis equestris, Zostera marina and
106,757,872 bp (Supplementary Fig. 4). Amborella trichopoda), one gymnosperm (Picea abies), one lycophyte
(S. moellendorffii), two bryophytes (moss P. patens and liverwort M. polymorpha),
Genome assembly and assessment. The clean Nanopore reads after filtering two charophytes (Chara braunii and K. nitens) and five chlorophytes (Volvox
and decontamination were assembled with wtdbg-1.2.8. After finishing the pre- carteri, Chlamydomonas reinhardtii, Ulva mutabilis, Coccomyxa subellipsoidea and
assembly (148 Mb), iterative polishing was conducted using Pilon (v.1.22)64 in Chlorella variabilis) (Supplementary Table 13). We chose the longest transcript
which clean Illumina reads were aligned with the pre-assembled contigs. The to represent each gene and removed mitochondrial and chloroplast genes. After
pre-assembled contig sequences were performed with the MEGABLAST search performing an all-against-all BLASTP search with a threshold E-value of 1 × 10−5,
against the NCBI nucleotide database to further remove prokaryotic sequences or identity >30% and coverage >30%, orthogroups or putative gene families or
organellar DNA. A total of approximately 29 Mb of data were removed. Further, subfamilies were identified using OrthoMCL (v.2.0)87, on the basis of a collection of
we combined the final pre-assembled contig sequences from Nanopore sequencing 397,132 predicted protein-coding genes from the above 19 Viridiplantae genomes.
and clean paired-read data from Illumina sequencing into scaffolds using SSPACE A 5-way comparison of A. angustus, Setaphyta (M. polymorpha and P. patens),
(v.3.0)65 tool (Supplementary Table 4). Genome assembly completeness was Tracheophyta (vascular plants) (A. thaliana, V. vinifera, O. sativa, Z. marina,
assessed using the plantae database of 956 single-copy orthologues using BUSCO P. equestris, A. trichopoda, P. abies, G. aurea and S. moellendorffii), Charophyta
(v.3)16 with a BLAST threshold E-value of 1 × 10−5 (Supplementary Table 10). (C. braunii and K. nitens) and Chlorophyta (V. carteri, C. reinhardtii, U. mutabilis,
C. subellipsoidea and C. variabilis) is shown in Fig. 1a. For A. angustus-specific gene
Transcriptome assembly and mapping. We used Trimmomatic58 to remove families, we conducted GO and KEGG enrichment analyses via an enrichment
adaptors from the raw reads of transcriptome sequences and filter out low-quality pipeline (https://sourceforge.net/projects/enrichmentpipeline/).
reads before assembly. The resulting high-quality reads were de novo assembled
and annotated using Trinity (v.2.5.1)66. For genes with more than one transcript, Phylogenomics. We extracted 85 single-copy gene families shared by 19
the longest transcript was chosen as the unigene and used to predict open reading Viridiplantae for phylogenomic analysis (Supplementary Note 2.1). The amino
frames (ORFs) using TransDecoder (v.5.0.2) (https://github.com/TransDecoder/ acid alignments of each single-copy gene family were aligned by MAFFT (v.7)88,
TransDecoder/wiki). Finally, we obtained 39,044 unigenes, 26,805 of which had and the nucleotide alignments were generated separately with TranslatorX (v0.9)89
predicted ORFs. To extend the validation of genome assembly, the transcriptome on the basis of the corresponding amino acid translation. The amino acid data,
was compared to the reference assembly using BLASTN, with an E-value <1 × 10−5. the complete nucleotide data and the first and second codon positions, as well
Of the 26,805 transcripts (>200 bp), 97.66% were successfully mapped back to the as the third codon positions, were concatenated as super-matrices. These data
final assembled genome (Supplementary Table 5). matrices were used for maximum likelihood phylogenetic analyses by RAxML
(v.7.2.3)90 with the GTR + Γ and JTT models for nucleotide and amino acid data,
Repeat prediction. Tandem Repeats Finder (v.4.09)67 was used to search for respectively. For each analysis, the bootstrap support was estimated based on
tandem repeats in the A. angustus genome. Both homology-based and de novo 300 pseudoreplicates using a GTR + CAT approximation. To estimate the degree
approaches were used to search for TEs. In the homology-based approach, we used of substitutional saturation for the four concatenated datasets mentioned above
RepeatMasker (v.4.1.0)67 and RepeatProteinMask68 with the Repbase69 database (Supplementary Note 2.2), we plotted the uncorrected p-distances against the
of known repeat sequences to search for the TEs in the A. angustus genome. In inferred distances using the method described by Forterre and Philippe91. The level
the de novo approach, we used LTR_FINDER (v.1.0.2)70, PILER (v.1.3.4.)71 and of saturation was estimated by computing the slope of the regression line in the
RepeatModeler (v.1.0.3)72 to construct a de novo repeat sequence database for plot; the shallower the slope, the greater the degree of saturation. The maximum
A. angustus and then used RepeatMasker to search for repeats in the genome. All composite likelihood method was used to calculate the inferred distances for
the repeats identified by different methods were combined into the final repeat nucleotide data and Poisson correction was used to calculate the inferred distances
annotation after removing the redundant repeats. The predicted repeats covered for the amino acid data.
64.21% of the genome sequence (Supplementary Table 6). The categories of To improve the taxon sampling in bryophytes for divergence time estimation,
predicted TEs in the A. angustus genome are summarized in Supplementary Table 7. the transcriptome sequences of 22 other bryophytes were downloaded from the
1KP database92 (http://www.onekp.com/public_data.html) and used in subsequent
Genome annotation. To predict protein-coding genes, three approaches were analyses (Supplementary Table 18 and Supplementary Note 2.3). The divergence
used: (1) de novo gene prediction, (2) homology-based prediction, and (3) time was estimated using the MCMCTree program in the PAML package (v.4.7)93
RNA-sequencing annotation. For de novo prediction, AUGUSTUS (v.2.5.5)73 under the nucleotide general time reversible (GTR) substitution model and with
and GlimmerHMM (v3.0.1)74 were applied to predict genes. For homology- the independent rate model as the molecular clock model. The Markov chain
based prediction, we mapped the protein sequences of five published green plant Monte Carlo (MCMC) process consists of 500,000 burn-in iterations and 1,500,000
genomes (Arabidopsis thaliana, Selaginella moellendorffii, P. patens, M. polymorpha sampling iterations (1 sample per 150 iterations). The same parameters were
and Klebsormidium nitens) onto the A. angustus genome using TBLASTN, with a executed twice to obtain a stable result. We applied nine node constraints in the
threshold E-value of 1 × 10−5, and then used GeneWise (v.2.4.1)75 to predict gene age estimate (Supplementary Fig. 10). The minimum and maximum constraints for
structures. The de novo set and five homologue-based results were combined by each node are shown in Supplementary Table 19.
MAKER (v.1.0)76 to integrate a consensus gene set (Supplementary Table 8). To Gene-family sizes were inferred from the gene-family profile obtained by the
supplement and improve the gene set, we aligned the RNA-sequencing data to program OrthoMCL. The minimum ancestral gene families were estimated using
the genome using TopHat (v2.1.1)77, and the alignments were used as input for DOLLOP program included in the PHYLIP package (v.3.695)94 to determine gene-
Cufflinks (v.2.2.1)78 with default parameters. We manually combined the MAKER family gain or loss evolutions of gene families. There are 8,141 gene families in
gene set and ORFs of transcripts to form the final gene set that contains 14,629 the A. angustus genome, 8,944 in M. polymorpha and 9,566 in P. patens, and 9,789
genes (Supplementary Table 8). ancestral families in the ancestral bryophyte lineage (Fig. 1b).
The A. angustus predicted genes were aligned against the sequences in NCBI
non-redundant protein database using BLASTP79 (E-value <1 × 10−5). According to
the NCBI taxonomy categories of best BLAST hits, the source of A. angustus genes KS distribution and co-linearity analysis. All KS distributions were constructed
were classified (Supplementary Fig. 6). Functional annotation of these predicted using wgd (v.3.0)95 using default settings. The M. polymorpha and P. patens genome
genes was obtained by aligning the protein sequences of these genes against data was acquired from the PLAZA resource96. Pairwise co-linearity analyses
the sequences in public protein databases using BLASTP79 (E-value <1 × 10−5, within and between A. angustus, M. polymorpha and P. patens were conducted
identity >30% and coverage >70%, excluding annotations only characterized as using I-ADHoRe 3.097 with the following parameter settings: gap_size = 30,
hypothetical or predicted protein), including, SwissProt80, TrEMBL80, Pfam81, GO82 cluster_gap = 35, q_value = 0.75, prob_cutoff = 0.01, anchor_points = 3, alignment_
and KEGG83 (Supplementary Tables 9 and 32). method = ‘gg2’, level_2_only = ‘false’, table_type = ‘family’ and multiple_hypothesis_
correction = ‘FDR’. Within-genome co-linearity analyses were based on the
Identification of non-coding RNA genes. To obtain a reliable profile of paralogous families inferred with wgd, whereas the between-genome co-linearity
A. angustus miRNAs, we used mapped reads from small-RNA sequencing with analyses were conducted using gene families inferred with OrthoFinder using
reference to the A. angustus draft genome to search against miRNA sequences default settings.
in A. thaliana, Oryza sativa, S. moellendorffii, P. patens and C. reinhardtii from
miRBase (http://www.mirbase.org/) for predicting the known miRNAs. The Analysis of TFs. We used the genome-wide TF prediction program iTAK (v.1.7)98
mapped reads were also used to identify novel miRNAs using miREvo (v.1.2)84 (http://bioinfo.bti.cornell.edu/cgi-bin/itak/index.cgi) with default parameters to

Nature Plants | VOL 6 | February 2020 | 107–118 | www.nature.com/natureplants 115


Articles NATURe PlAnTs
preliminarily identify TFs in the above 19 Viridiplantae (Supplementary Tables 13 4. Christenhusz, M. J. M. & Byng, J. W. The number of known plants species
and 22). The reconstruction of the ancestral state for the individual TF family was in the world and its annual increase. Phytotaxa 261, 201–217 (2016).
performed using Mesquite (v.3.51)99 (http://mesquiteproject.org/), and the most 5. Qiu, Y. L. et al. The deepest divergences in land plants inferred
parsimonious assumption was taken. from phylogenomic evidence. Proc. Natl Acad. Sci. USA 103,
15511–15516 (2006).
Phylogenetic analysis of gene families. Generally, HMMER search100 with a 6. Wickett, N. J. et al. Phylotranscriptomic analysis of the origin and
domain profile or BLAST search using known protein sequences from other plants early diversification of land plants. Proc. Natl Acad. Sci. USA 111,
as queries was performed to retrieve the sequences from the A. angustus genome E4859–E4868 (2014).
(Supplementary Notes 4–6). The results of TF prediction by iTAK98 were used as 7. Cox, C. J., Li, B., Foster, P. G., Embley, T. M. & Civan, P. Conflicting
references. Multiple sequence alignments were performed using the MAFFT88 phylogenies for early land plants are caused by composition biases among
program (https://mafft.cbrc.jp/alignment/software/). The maximum-likelihood synonymous substitutions. Syst. Biol. 63, 272–279 (2014).
phylogenetic trees were implemented with RAxML-HPC2 on XSEDE101 through 8. Liu, Y., Cox, C. J., Wang, W. & Goffinet, B. Mitochondrial phylogenomics of
the CIPRES Science Gateway (v.3.3) (https://www.phylo.org/), estimating branch early land plants: mitigating the effects of saturation, compositional
support values by bootstrap iterations with 1,000 replicates. heterogeneity, and codon-usage bias. Syst. Biol. 63, 862–878 (2014).
9. Villarreal, J. C. & Renzaglia, K. S. The hornworts: important advancements
Gene-family expansion identification. To understand gene-family expansion in early land plant evolution. J. Bryol. 37, 157–170 (2015).
or contraction in A. angustus compared with that in 18 other green plants, the 10. Villarreal, J. C. & Renner, S. S. Hornwort pyrenoids, carbon-concentrating
mean gene-family size was calculated for all gene families (excluding orphans structures, evolved and were lost at least five times during the last 100
and species-specific families). The number of genes per species for each family million years. Proc. Natl Acad. Sci. USA 109, 18873–18878 (2012).
was transformed into a matrix of z-scores to centre and normalize the data. The 11. Renzaglia, K. S., Villarreal, J. C. & Duff, R. J. in Bryophyte Biology
first 100 families with the largest gene-family size in A. angustus were selected (eds Goffinet, B. & Shaw, J.) 139–171 (Cambridge Univ. Press, 2009).
(Supplementary Fig. 55). The clustering and visualization were performed using 12. Adams, D. G. & Duggan, P. S. Cyanobacteria–bryophyte symbioses.
Genesis (v.3.0)102. The functional annotation of each family was predicted on J. Exp. Bot. 59, 1047–1058 (2008).
the basis of sequence similarity to entries in the Pfam protein domain database, 13. Desirὸ, A., Duckett, J. G., Pressel, S., Villarreal, J. C. & Bidartondo, M. I.
where more than 30% of proteins in the family share the same protein domain. Fungal symbioses in hornworts: a chequered history. Proc. R. Soc. B 280,
Transposon-derived gene families were removed because the distribution of such 1759 (2013).
families is likely to be a consequence of the gene models derived from a repeat- 14. Rensing, S. A. et al. The Physcomitrella genome reveals evolutionary insights
masked genome sequence and therefore may be artefactual103. into the conquest of land by plants. Science 319, 64–69 (2008).
15. Bowman, J. L. et al. Insights into land plant evolution garnered from the
Tandem duplication definition. Genes were defined as tandemly arrayed genes if Marchantia polymorpha genome. Cell 171, 287–304 (2017).
they belonged to the same family, were located within 100 kb each other, and were 16. Simao, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. &
separated by zero, one or fewer, five or fewer, or ten or fewer non-homologous Zdobnov, E. M. BUSCO: assessing genome assembly and annotation
intervening ‘spacer’ genes104. Therefore, the four sets of tandem gene definitions completeness with single-copy orthologs. Bioinformatics 31,
were analysed. 3210–3212 (2015).
17. Axtell, M. J. & Bowman, J. L. Evolution of plant microRNAs and their
HGT event identification. In this study, we used two different strategies to identify targets. Trends Plant Sci. 13, 343–349 (2008).
candidates for A. angustus-specific and bryophyte-specific HGTs. For A. angustus- 18. Archangelsky, S. & Villar de Seone, L. Estudios palinógicos de la formación
specific HGTs, we submitted 14,629 predicted coding genes of A. angustus to a Baqueró (Cretácico), provincia de Santa Cruz, Argentina. Ameghiniana 35,
BLASTP search against the NCBI protein database (E-value cutoff of 1 × 10−5) 7–19 (1996).
(Supplementary Note 7.1). The proteins with the best BLAST hits in bacterial or 19. Van de Peer, Y., Mizrachi, E. & Marchal, K. The evolutionary significance of
fungal sequences were extracted. After sequences without support of transcript polyploidy. Nat. Rev. Genet. 18, 411–424 (2017).
evidence were excluded, a series of parameters were used to filter the candidates 20. Lang, D. et al. The Physcomitrella patens chromosome-scale assembly
(Supplementary Note 7.1). For the bryophyte-specific HGT, we extracted gene reveals moss genome structure and evolution. Plant J. 93, 515–533 (2018).
families that are common to at least two of the three members of bryophytes (moss 21. Catarino, B., Hetherington, A. J., Emms, D. M., Kelly, S. & Dolan, L. The
P. patens, liverwort M. polymorpha and hornwort A. angustus). To preliminarily stepwise increase in the number of transcription factor families in the
determine whether these clusters are HGT candidates, we submitted the Precambrian predated the diversification of plants on land. Mol. Biol. Evol.
corresponding A. angustus members of each cluster to the NCBI protein database 33, 2815–2819 (2016).
for BLASTP search and checked the taxonomy report of the top 1,000 22. Cheng, F. et al. Gene retention, fractionation and subgenome differences in
BLAST hits (Supplementary Note 7.2). The homologues of published HGTs polyploid plants. Nat. Plants 4, 258–268 (2018).
in P. patens51 and M. polymorpha15 were also investigated in the A. angustus 23. Lang, D. et al. Genome-wide phylogenetic comparative analysis of plant
genome. All candidate HGTs were subjected to phylogenetic analysis for transcriptional regulation: a timeline of loss, gain, expansion, and
verification. Synonymous codon-usage order values and GC contents of HGT and correlation with complexity. Genome Biol. Evol. 2, 488–503 (2010).
non-HGT genes were calculated by CodonO105. 24. Sakakibara, K. Technological innovations give rise to a new era of plant
evolutionary developmental biology. Adv. Bot. Res. 78, 3–35 (2016).
Reporting Summary. Further information on experimental design is available in
25. Szövényi, P., Waller, M. & Kirbis, A. Evolution of the plant body plan.
the Nature Research Reporting Summary linked to this article.
Curr. Top. Dev. Biol. 131, 1–34 (2019).
26. Floyd, S. K. & Bowman, J. L. The ancestral developmental tool kit of land
Data availability plants. Int. J. Plant Sci. 168, 1–35 (2007).
The A. angustus genome project has been deposited at the NCBI under the 27. Hori, K. et al. Klebsormidium flaccidum genome reveals primary factors for
BioProject number PRJNA543716. The genome sequencing data were deposited in plant terrestrial adaptation. Nat. Commun. 5, 3978 (2014).
the Sequence Read Archive database under the accession number SRR9696346. The 28. Nishiyama, T. et al. The Chara genome: secondary complexity and
A. angustus transcriptome project has been deposited at the NCBI under BioProject implications for plant terrestrialization. Cell 174, 448–464 (2018).
PRJNA543724. The transcriptome sequencing data were deposited in the Sequence 29. Wodniok, S. et al. Origin of land plants: do conjugating green algae hold
Read Archive database under the accession number SRR9662965. The assembled the key? BMC Evol. Biol. 11, 104 (2011).
genome sequences, gene models and miRNA data are available via DRYAD 30. Ishizaki, K. Evolution of land plants: insights from molecular studies on
(https://doi.org/10.5061/dryad.msbcc2ftv). All data that support the findings of basal lineages. Biosci. Biotechnol. Biochem. 81, 73–80 (2017).
this study are also available from the corresponding authors upon request. 31. Rensing, S. A. Great moments in evolution: the conquest of land by plants.
Curr. Opin. Plant Biol. 42, 49–54 (2018).
Received: 12 June 2019; Accepted: 20 December 2019; 32. Braybrook, S. A. & Harada, J. J. LECs go crazy in embryo development.
Published online: 10 February 2020 Trends Plant Sci. 13, 624–630 (2008).
33. Takezawa, D., Komatsu, K. & Sakata, Y. ABA in bryophytes: how a
References universal growth regulator in life became a plant hormone? J. Plant Res.
1. Puttick, M. N. et al. The interrelationships of land plants and the nature of 124, 437–453 (2011).
the ancestral embryophyte. Curr. Biol. 28, 733–745 (2018). 34. Proust, H. et al. RSL class I genes controlled the development of
2. Goffinet, B. & Buck, W. R. in The Evolution of Plant Form (eds Ambrose, B. epidermal structures in the common ancestor of land plants. Curr. Biol. 26,
& Purruganan, M.) 51–90 (Wiley–Blackwell, 2013). 93–99 (2016).
3. von Konrat, M., Shaw, A. J. & Renzaglia, K. S. A special issue of Phytotaxa 35. Pires, N. D. et al. Recruitment and remodeling of an ancient gene regulatory
dedicated to bryophytes: the closest living relatives of early land plants. network during land plant evolution. Proc. Natl Acad. Sci. USA 110,
Phytotaxa 9, 5–10 (2010). 9571–9576 (2013).

116 Nature Plants | VOL 6 | February 2020 | 107–118 | www.nature.com/natureplants


NATURe PlAnTs Articles
36. Sakakibara, K. et al. KNOX2 genes regulate the haploid-to-diploid 66. Haas, B. J. et al. De novo transcript sequence reconstruction from RNA-seq
morphological transition in land plants. Science 339, 1067–1070 (2013). using the Trinity platform for reference generation and analysis. Nat. Protoc.
37. Coudert, Y., Novák, O. & Harrison, C. J. A KNOX-cytokinin regulatory 8, 1494–1512 (2013).
module predates the origin of indeterminate vascular plants. Curr. Biol. 29, 67. Benson, G. Tandem repeats finder: a program to analyze DNA sequences.
2743–2750 (2019). Nucleic Acids Res. 27, 573–580 (1999).
38. Mutte, S. K. et al. Origin and evolution of the nuclear auxin response system. 68. Tarailo-Graovac, M. & Chen, N. Using RepeatMasker to identify repetitive
eLife 7, e33399 (2018). elements in genomic sequences. Curr. Protoc. Bioinformatics 25,
39. Cenci, A. & Rouard, M. Evolutionary analyses of GRAS transcription 4.10.1–4.10.14 (2009).
factors in angiosperms. Front. Plant Sci. 8, 273 (2017). 69. Jurka, J. et al. Repbase update, a database of eukaryotic repetitive elements.
40. Kaplan-Levy, R. N., Brewer, P. B., Quon, T. & Smyth, D. R. The trihelix Cytogenet. Genome Res. 110, 462–467 (2005).
family of transcription factors—light, stress and development. Trends Plant 70. Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of
Sci. 17, 163–171 (2012). full-length LTR retrotransposons. Nucleic Acids Res. 35, W265–W268 (2007).
41. Fujii, S. & Small, I. The evolution of RNA editing and pentatricopeptide 71. Edgar, R. C. & Myers, E. W. PILER: identification and classification of
repeat genes. N. Phytol. 191, 37–47 (2011). genomic repeats. Bioinformatics 21(Suppl. 1), i152–i158 (2005).
42. Cheng, S. et al. Redefining the structural motifs that determine RNA 72. Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat
binding and RNA editing by pentatricopeptide repeat proteins in land families in large genomes. Bioinformatics 21(Suppl. 1), i351–i358 (2005).
plants. Plant J. 85, 532–547 (2016). 73. Stanke, M. et al. AUGUSTUS: ab initio prediction of alternative transcripts.
43. Rüdinger, M., Polsakiewicz, M. & Knoop, V. Organellar RNA editing and Nucleic Acids Res. 34, W435–W439 (2006).
plant-specific extensions of pentatricopeptide repeat proteins in 74. Majoros, W. H., Pertea, M. & Salzberg, S. L. TigrScan and GlimmerHMM:
jungermanniid but not in marchantiid liverworts. Mol. Biol. Evol. 25, two open source ab initio eukaryotic gene-finders. Bioinformatics 20,
1405–1414 (2008). 2878–2879 (2004).
44. Dunwell, J. M., Khuri, S. & Gane, P. J. Microbial relatives of the seed storage 75. Birney, E., Clamp, M. & Durbin, R. GeneWise and Genomewise.
proteins of higher plants: conservation of structure and diversification of Genome Res. 14, 988–995 (2004).
function during evolution of the cupin superfamily. Microbiol. Mol. Biol. 76. Holt, C. & Yandell, M. MAKER2: an annotation pipeline and genome-
Rev. 64, 153–179 (2000). database management tool for second-generation genome projects.
45. Nakata, M. et al. Germin-like protein gene family of a moss, Physcomitrella BMC Bioinf. 12, 491 (2011).
patens, phylogenetically falls into two characteristic new clades. Plant Mol. 77. Trapnell, C., Pachter, L. & Salzberg, S. L. TopHat: discovering splice
Biol. 56, 381–395 (2004). junctions with RNA-seq. Bioinformatics 25, 1105–1111 (2009).
46. Pollastri, S. & Tattini, M. Flavonols: old compounds for old roles. Ann. Bot. 78. Trapnell, C. et al. Differential gene and transcript expression analysis
108, 1225–1233 (2011). of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 7,
47. Sakata, Y., Komatsu, K. & Takezawa, D. in Progress in Botany (ed. Lüttge, U.) 562–578 (2012).
57–96 (Springer-Verlag, 2014). 79. Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinf. 10,
48. Hanson, D. T., Renzaglia, K. & Villareal, J. C. in Photosynthesis of 421 (2009).
Bryophytes and Early Land Plants (eds Hanson, D. T. & Rice, S. K.) 80. Boeckmann, B. et al. The SWISS-PROT protein knowledgebase and its
95–111 (Springer, 2014). supplement TrEMBL in 2003. Nucleic Acids Res. 31, 365–370 (2003).
49. Meyer, M. & Griffiths, H. Origins and diversity of eukaryotic CO2- 81. Finn, R. D. et al. Pfam: the protein families database. Nucleic Acids Res. 42,
concentrating mechanisms: lessons for the future. J. Exp. Bot. 64, D222–D230 (2014).
769–786 (2013). 82. Ashburner, M. et al. Gene ontology: tool for the unification of biology.
50. Mackinder, L. C. M. A spatial interactome reveals the protein organization Nat. Genet. 25, 25–29 (2000).
of the algal CO2-concentrating mechanism. Cell 171, 133–147 (2017). 83. Kanehisa, M. & Goto, S. KEGG: Kyoto encyclopedia of genes and genomes.
51. Yue, J., Hu, X., Sun, H., Yang, Y. & Huang, J. Widespread impact of Nucleic Acids Res. 28, 27–30 (2000).
horizontal gene transfer on plant colonization of land. Nat. Commun. 3, 84. Wen, M., Shen, Y., Shi, S. & Tang, T. miREvo: an integrative microRNA
1152 (2012). evolutionary analysis platform for next-generation sequencing experiments.
52. Foflonker, F. et al. Genome of the halotolerant green alga Picochlorum sp. BMC Bioinf. 13, 140 (2012).
reveals strategies for thriving under fluctuating environmental conditions. 85. Lowe, T. M. & Eddy, S. R. tRNAscan-SE: a program for improved
Environ. Microbiol. 17, 412–426 (2015). detection of transfer RNA genes in genomic sequence. Nucleic Acids Res.
53. Hasanuzzaman, M. et al. Coordinated actions of glyoxalase and antioxidant 25, 955–964 (1997).
defense systems in conferring abiotic stress tolerance in plants. Int. J. Mol. 86. Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100-fold faster RNA homology
Sci. 18, 200 (2017). searches. Bioinformatics 29, 2933–2935 (2013).
54. Finnegan, E. J. & Kovac, K. A. Plant DNA methyltransferases. Plant Mol. 87. Li, L., Stoeckert, C. J. & Roos, D. S. OrthoMCL: identification of ortholog
Biol. 43, 189–201 (2000). groups for eukaryotic genomes. Genome Res. 13, 2178–2189 (2003).
55. Jia, Q. et al. Microbial-type terpene synthase genes occur widely in nonseed 88. Katoh, K., Kuma, K. I., Toh, H. & Miyata, T. MAFFT version 5:
land plants, but not in seed plants. Proc. Natl Acad. Sci. USA 113, improvement in accuracy of multiple sequence alignment. Nucleic Acids
12328–12333 (2016). Res. 33, 511–518 (2005).
56. Duckett, J. G. et al. In vitro cultivation of bryophytes: a review of 89. Abascal, F., Zardoya, R. & Telford, M. J. TranslatorX: multiple alignment of
practicalities, problems, progress and promise. J. Bryol. 26, 3–20 (2004). nucleotide sequences guided by amino acid translations. Nucleic Acids Res.
57. Kugita, M. et al. The complete nucleotide sequence of the hornwort 38, W7–W13 (2010).
(Anthoceros formosae) chloroplast genome: insight into the earliest land 90. Stamatakis, A. RAxML-VI-HPC: maximum likelihood-based phylogenetic
plants. Nucleic Acids Res. 31, 716–721 (2003). analyses with thousands of taxa and mixed models. Bioinformatics 22,
58. Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for 2688–2690 (2006).
Illumina sequence data. Bioinformatics 30, 2114–2120 (2014). 91. Forterre, P. & Philippe, H. Where is the root or the universal tree of life?
59. Szövényi, P. et al. Establishment of Anthoceros agrestis as a model species Bioessays 21, 871–879 (1999).
for studying the biology of hornworts. BMC Plant Biol. 15, 98 (2015). 92. Matasci, N. et al. Data access for the 1,000 Plants (1KP) project. Gigascience
60. Li, F. et al. Horizontal transfer of an adaptive chimeric photoreceptor from 3, 17 (2014).
bryophytes to ferns. Proc. Natl Acad. Sci. USA 111, 6672–6677 (2014). 93. Yang, Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol.
61. Mergaert, P. et al. Eukaryotic control on bacterial cell cycle and Evol. 24, 1586–1591 (2007).
differentiation in the Rhizobium–legume symbiosis. Proc. Natl Acad. Sci. 94. Felsenstein, J. PHYLIP: phylogenetic inference program v.3.6
USA 103, 5230–5235 (2006). (Univ. of Washington, 2005).
62. Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive 95. Zwaenepoel, A. & Van de Peer, Y. wgd—simple command line tools for the
k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017). analysis of ancient whole genome duplications. Bioinformatics 35,
63. Arratia, R., Lander, E. S., Tavaré, S. & Waterman, M. S. Genomic mapping 2153–2155 (2018).
by anchoring random clones: a mathematical analysis. Genomics 11, 96. Van Bel, M. et al. PLAZA 4.0: an integrative resource for functional,
806–827 (1991). evolutionary and comparative plant genomics. Nucleic Acids Res. 46,
64. Walker, B. J. et al. Pilon: an integrated tool for comprehensive microbial D1190–D1196 (2018).
variant detection and genome assembly improvement. PLoS ONE 9, 97. Proost, S. et al. i-ADHoRe 3.0―fast and sensitive detection of genomic
e112963 (2014). homology in extremely large data sets. Nucleic Acids Res. 40, e11 (2012).
65. Boetzer, M., Henkel, C. V., Jansen, H. J., Butler, D. & Pirovano, W. 98. Zheng, Y. et al. iTAK: a program for genome-wide prediction and
Scaffolding pre-assembled contigs using SSPACE. Bioinformatics 27, classification of plant transcription factors, transcriptional regulators, and
578–579 (2011). protein kinases. Mol. Plant 9, 1667–1670 (2016).

Nature Plants | VOL 6 | February 2020 | 107–118 | www.nature.com/natureplants 117


Articles NATURe PlAnTs
99. Maddison, W. P. & Maddison, D. R. Mesquite: a modular system for manuscript; R.-Q.L., J.-F.Y., Y.-Y.L., Q.-H.W., S.-Z.Z. and M.-Z.W. collected and cultured
evolutionary analysis v.2.75 (Mesquite Project, 2011). the plant material; R.-Q.L. and M.-H.L. sequenced and processed the raw data; X.Z. and
100. Madera, M. & Gough, J. A comparison of profile hidden Markov model Z.-W.W. assembled and annotated the genome; Y.L. and J.Z. performed phylogenetic
procedures for remote homology detection. Nucleic Acids Res. 30, analysis; J.Z. and X.-X.F. analysed gene families; X.-X.F. and J.Z. identified HGT; A.Z.
4321–4328 (2002). and Y.V.d.P. conducted WGD analysis; Y.-L. Guan. conducted DAPI staining analysis;
101. Stamatakis, A., Hoover, P. & Rougemont, J. A rapid bootstrap algorithm for J.-Y.X. conducted codon-usage bias analysis; M.-H.L., G.-Q.Z. and J.-Y.W. conducted
the RAxML Web servers. Syst. Biol. 57, 758–771 (2008). transcriptome sequencing and analysis; S.-S.D. and Y.L. conducted the RNA-editing-site
102. Sturn, A., Quackenbush, J. & Trajanoski, Z. Genesis: cluster analysis of analysis in organellar genomes; H.M., Q.-F.W., B.G., Y.J., Y.-N.J., Y.-L.Guo, H.-Z.K.,
microarray data. Bioinformatics 18, 207–208 (2002). A.-M.L. and H.-M.Y. contributed substantially to revisions. All authors commented
103. Martens, C., Vandepoele, K. & Van de Peer, Y. Whole-genome analysis on the manuscript.
reveals molecular innovations and evolutionary transitions in
chromalveolate species. Proc. Natl Acad. Sci. USA 105, 3427–3432 (2008). Competing interests
104. Hanada, K. et al. Importance of lineage-specific expansion of plant tandem The authors declare no competing financial interests.
duplicates in the adaptive response to environmental stimuli. Plant Physiol.
148, 993–1003 (2008).
105. Angellotti, M. C., Bhuiyan, S. B., Chen, G., Wan, X. & Wan, X. CodonO:
codon usage bias analysis within and across genomes. Nucleic Acids Res. 35, Additional information
W132–W136 (2007). Supplementary information is available for this paper at https://doi.org/10.1038/
s41477-019-0588-4.
Acknowledgements Correspondence and requests for materials should be addressed to Z.-D.C., Z.-J.L.,
We thank P. R. Crane, S. Ge, D.-Y. Hong, J.-L. Huang, J.-J. Qin, Y.-L. Qiu, J.-C. Villarreal, Y.V.d.P. or S.-Z.Z.
T. Wan and X.-Q. Wang for useful advice and discussions and L. Zhang for providing
Peer review information Nature Plants thanks Burkhard Becker and the other,
plant pictures. We dedicate the paper to Yang Zhong in memory of his support and
anonymous, reviewers for their contribution to the peer review of this work.
valuable suggestions on this project. This work was supported by Sino–Africa Joint
Research Center, Chinese Academy of Sciences, CAS International Research and Reprints and permissions information is available at www.nature.com/reprints.
Education Development Program (SAJC201613), the Strategic Priority Research Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in
Program of the Chinese Academy of Sciences (XDB31000000 and XDA19050103), published maps and institutional affiliations.
National Natural Science Foundation of China (NNSF 31590822), Shenzhen Fairy
Open Access This article is licensed under a Creative Commons
Lake Botanical Garden, State Key Laboratory of Systematic and Evolutionary Botany,
Attribution 4.0 International License, which permits use, sharing,
Institute of Botany, Chinese Academy of Sciences, and Key Laboratory of National
adaptation, distribution and reproduction in any medium or format, as
Forestry and Grassland Administration for Orchid Conservation and Utilization at
long as you give appropriate credit to the original author(s) and the source, provide a link
College of Landscape Architecture, Fujian Agriculture and Forestry University. Y.V.d.P.
to the Creative Commons license, and indicate if changes were made. The images or
acknowledges support from the European Union Seventh Framework Programme
other third party material in this article are included in the article’s Creative Commons
(FP7/2007-2013) under European Research Council Advanced Grant Agreement
license, unless indicated otherwise in a credit line to the material. If material is not
322739–DOUBLEUP.
included in the article’s Creative Commons license and your intended use is not permit-
ted by statutory regulation or exceeds the permitted use, you will need to obtain
Author contributions permission directly from the copyright holder. To view a copy of this license, visit
Z.-D.C., Z.-J.L., Y.V.d.P. and S.-Z.Z. conceived the paper; Z.-D.C., Z.-J.L. and S.-Z.Z. http://creativecommons.org/licenses/by/4.0/.
managed the project; J.Z., X.-X.F., Y.L., Z.-J.L., A.Z., Y.V.d.P. and Z.-D.C. wrote the © The Author(s) 2020

118 Nature Plants | VOL 6 | February 2020 | 107–118 | www.nature.com/natureplants


nature research | reporting summary
Zhi-Duan Chen, Zhong-Jian Liu, Yves Van de
Corresponding author(s): Peer, and Shou-Zhou Zhang
Last updated by author(s): Dec 17, 2019

Reporting Summary
Nature Research wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency
in reporting. For further information on Nature Research policies, see Authors & Referees and the Editorial Policy Checklist.

Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.
n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement
A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly
The statistical test(s) used AND whether they are one- or two-sided
Only common tests should be described solely by name; describe more complex techniques in the Methods section.

A description of all covariates tested


A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons
A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient)
AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals)

For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted
Give P values as exact values whenever suitable.

For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes
Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated
Our web collection on statistics for biologists contains articles on many of the points above.

Software and code


Policy information about availability of computer code
Data collection 1.We constructed libraries with insert sizes from 170 bp to 40 kb for whole-genome shotgun sequencing using Illumina HiSeq 2000.
2. We also constructed a genomic DNA library for Oxford Nanopore sequencing.

Data analysis Software used are listed as follows:BLASTP (ncbi-BLAST v2.2.28), BLASTN (ncbi-BLAST v2.2.28), TBLASTN (ncbi-BLAST v2.2.28),
Nextdenovo (V2.0), Pilon (v1.22), SSPACE (v3.0), BUSCO (v3), Trimmomatic (v0.33), Trinity (v2.5.1), TransDecoder (v5.0.2), Tandem
Repeats Finder (v4.09), RepeatMasker (v4.1.0), LTR_FINDER (v1.0.2), PILER (v1.3.4.), RepeatModeler (v1.0.3), AUGUSTUS (v2.5.5),
GlimmerHMM (v3.0.1), GeneWise (v2.4.1), MAKER (v1.0), TopHat (v2.1.1), Cufflinks (v2.2.1), miREvo (v1.2), tRNAscan-SE (v1.3.1),
INFERNAL (v1.1), OrthoMCL (v2.0), MAFFT (version 7), TranslatorX (v0.9), RAxML (v7.2.3), PAML (v4.7), PHYLIP (v3.695), wgd (v3.0), I-
ADHoRe 3.0, iTAK (version 1.7), Mesquite (version 3.51), HMMER (v 3.1b2), CIPRES Science Gateway (V. 3.3), Genesis (v3.0), CodonO.
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers.
We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

Data
Policy information about availability of data
All manuscripts must include a data availability statement. This statement should provide the following information, where applicable:
October 2018

- Accession codes, unique identifiers, or web links for publicly available datasets
- A list of figures that have associated raw data
- A description of any restrictions on data availability

The A. angustus genome project has been deposited at the NCBI under the BioProject number PRJNA543716. The genome sequencing data were deposited in the
Sequence Read Archive (SRA) database under the accession number SRR9696346. The A. angustus transcriptome project has been deposited at the NCBI under
BioProject PRJNA543724. The transcriptome sequencing data were deposited in the Sequence Read Archive (SRA) database under the accession number
SRR9662965. The assembled genome sequences, gene models, miRNA data are available via DRYAD (https://doi.org/10.5061/dryad.msbcc2ftv). All data that

1
support the findings of this study are also available from the corresponding authors upon request.

nature research | reporting summary


Field-specific reporting
Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.
Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf

Life sciences study design


All studies must disclose on these points even when the disclosure is negative.
Sample size We sequenced a single hornwort plant, and no statistical methods were used to predetermine sample sizes. For comparative genome
analyses, the gene sequences of Anthoceros angustus and other 18 plant species were used (Supplementary Table 13), including seven
angiosperms (Arabidopsis thaliana, Genlisea aurea, Vitis vinifera, Oryza sativa, Phalaenopsis equestris, Zostera marina and Amborella
trichopoda), one gymnosperm (Picea abies), one lycophyte (Selaginella moellendorffii), three bryophytes (Physcomitrella patens, Marchantia
polymorpha and Anthoceros angustus), two charophytes (Chara braunii and Klebsormidium nitens), five chlorophytes (Volvox carteri,
Chlamydomonas reinhardtii, Ulva mutabilis, Coccomyxa subellipsoidea and Chlorella variabilis). This sampling covered all the major lineages of
green plants and could present the backbone of green plant evolution.

Data exclusions Lines 416-456, 466-472: The prokaryotic sequences and organellar sequences were removed from sequencing data and pre-assembled
genome data. There are prokaryotic sequences and organellar sequences that involved in the genome sequencing data. Exclusion of the
contamination from foreign DNA sequences and organellar sequences is the prerequisite for accurate genome assembly. Through choose of
high-abundance k-mer reads, error-correction and MEGABLAST check, 3,78 Gb high-quality clean reads of Nanopore sequencing remained for
A. angustus genome assembly.
Lines 521-522: We excluded annotations only characterized as hypothetical/predicted protein, since these proteins could not be treated as
really functionally annotated ones.
Lines 542-543: During the comparative analysis, we chose the longest transcript to represent each gene and removed mitochondrial and
chloroplast genes, since the used genome datasets include multiple transcripts and organellar genes that might complicate the comparative
analysis.
Lines 617-618: The mean gene family size was calculated for all gene families, excluding orphans and species-specific families, since these
genes are unique to individual species and do not have orthologs in other species for comparison.
Lines 624-627: During the gene family expansion identification, transposon-derived gene families were removed, since the distribution of such
families is likely to be a consequence of the gene models derived from a repeat-masked genome sequence and therefore may be artefactual.
Lines 638-640: The sequences without support of transcript evidence were excluded from the HGT candidates, since these sequences might
be contaminated ones but not real HGT genes.

Replication The spore germination experiment was repeated three times independently. The DAPI staining experiment was repeated three times
independently.

Randomization We picked up spores randomly for germination experiments. We selected regions of the gametophytes randomly for DAPI staining.

Blinding We sequenced a single hornwort plant, and no control group is referred here. Blinding is not applicable in this study.

Reporting for specific materials, systems and methods


We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material,
system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response.

Materials & experimental systems Methods


n/a Involved in the study n/a Involved in the study
Antibodies ChIP-seq
Eukaryotic cell lines Flow cytometry
Palaeontology MRI-based neuroimaging
Animals and other organisms
Human research participants
October 2018

Clinical data

You might also like