Professional Documents
Culture Documents
doi: 10.1093/bib/bbx120
Advance Access Publication Date: 23 September 2017
Paper
Abstract
Microbiome research has grown rapidly over the past decade, with a proliferation of new methods that seek to make sense
of large, complex data sets. Here, we survey two of the primary types of methods for analyzing microbiome data: read clas-
sification and metagenomic assembly, and we review some of the challenges facing these methods. All of the methods rely
on public genome databases, and we also discuss the content of these databases and how their quality has a direct impact
on our ability to interpret a microbiome sample.
Florian P. Breitwieser is a postdoctoral fellow at the Center for Computational Biology at Johns Hopkins School of Medicine. His research interests include
metagenomics classification and visualization methods and their application to infectious disease diagnosis.
Jennifer Lu is a Biomedical Engineering PhD student in Steven Salzberg’s laboratory at the Center for Computational Biology at Johns Hopkins University.
Her research focuses on computational genomics and the usage of sequencing for diagnosing microbial infections relating to human health and diseases.
Steven L. Salzberg is the Bloomberg Distinguished Professor of Biomedical Engineering, Computer Science and Biostatistics at Johns Hopkins University.
His laboratory conducts research on DNA and RNA sequence analysis including genome assembly, transcriptome assembly, sequence alignment and
metagenomics.
Submitted: 7 June 2017; Received (in revised form): 22 August 2017
C The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
V
1125
1126 | Breitwieser et al.
sequencing. Because it only requires sequence from a single chimeric sequences, clustering of reads into ‘Operational
gene, this strategy provides a cost-effective means to identify a Taxonomic Units’ (OTUs) based on sequence similarity and clas-
wide range of organisms. Metagenomics refers to the random sification of the OTUs [13–20]. An alternative approach before
‘shotgun’ sequencing of microbial DNA, without selecting any clustering of reads into OTUs is their direct classification using
particular gene [2]. Both metataxonomics and metagenomics metagenomics classifiers (see section on ‘metagenomics classi-
can provide information on the species composition of a micro- fication’ and Table 3), as recently compared in [21]. The rest of
biome. Another strategy, metatranscriptomics, attempts to cap- this review will focus on metagenomics methods; for further
ture and sequence all of the RNA in a sample, which can help discussion of metataxonomic methods, see [22–25].
create a profile of all genes that are actively being transcribed, Marker gene sequencing does have some drawbacks, which
and may also provide a picture of the relative abundance of explains (in part) the rising popularity of metagenomics. First,
those genes [3]. marker gene-based methodologies do not capture viruses,
Complementary approaches that are becoming increasingly which have no conserved genes analogous to 16S or 18S rRNA
popular in microbiome research, but are not further covered in genes. The use of the 16S rRNA gene itself is imperfect as well:
this review, include metaproteomics and metametabolomics for the recently described Candidate Phyla Radiation, which
[4–6]. Metaproteomics uses mass spectrometry techniques, e.g. comprises up to 15% of the bacterial domain [26], it was esti-
liquid chromatography-coupled tandem mass spectrometry, to mated that >50% of the organisms evaded detection with clas-
generate profiles of protein expression and posttranslational sical 16S amplicon sequencing [27]. The short reads produced
modifications of proteins [5]. Typically, genome sequences by next-generation sequencers further limit analysis at the spe-
are required for the mapping of generated mass spectra to pro- cies level, although full-length 16S rRNA gene sequencing using
teins, and thus, this field also depends on metagenomics. long-read sequencers from Pacific Biosciences or Oxford
Metametabolomics attempts to create profiles of metabolites, Nanopore might help overcome this limitation [28]. The meth-
usually also created using mass spectrometry [6]. Mass spec- odology of an experiment and laboratory-specific factors can
trometry is more expensive and experimentally challenging also limit the effectiveness of marker gene sequencing
than sequencing, although the field is making continual tech- approaches, although the same caveat applies to metagenomics
nical improvements [4]. Integrating the data of all these differ- [29–32].
ent ‘meta-omics’ approaches is challenging, but it can yield
insights not found by looking at just one type of data [7].
Metataxonomics is an invaluable tool for microbial ecology.
Metagenomic analysis
rRNA gene sequences are the most widely used marker se- Many strategies can be used for analysis of metagenomics shot-
quences; these include the 16S rRNA gene for bacteria, the 18S gun data (Figure 1). A common first step is to run a variety of
rRNA gene for eukaryotes, and the internal transcribed spacer computational tools for quality control, which identify and re-
(ITS) regions of the fungal ribosome for fungi [8, 9]. These move low-quality sequences and contaminants. These include
markers work well for phylogenetic profiling because they are programs such as FastQC [33], Cutadapt [34], BBDuk [35] and
ubiquitously present in the population, they have hypervariable Trimmomatic [36] (Table 2). FastQ Screen [37] matches reads
regions that differentiate species and they are flanked by con- against multiple reference genomes such as human, mouse,
served regions that can be targeted by ‘universal’ primers [8]. A Escherichia coli and yeast, and can provide a quick overview of
major advantage of rRNA analysis is that databases such as where the reads align. Diginorm [38], implemented in the khmer
Greengenes [10], RDP [11] and SILVA [12] contain genes from package [39], can be used to reduce redundancy of reads in
millions of species, making them far more comprehensive than high-depth areas by down-sampling reads, and thus normalize
genome databases, which contain tens of thousands of species. coverage and make subsequent analyses computationally
The workflow for 16S analysis typically includes quality filter- cheaper. MultiQC [40] aggregates quality control reports from
ing, error correction (sometimes called de-noising), removal of multiple samples into a single report that can be viewed more
A review of metagenomics methods | 1127
FastQC Quality control tool showing statics such as quality [33] http://www.bioinformatics.babraham.ac.uk/pro
values, sequence length distribution and GC con- jects/fastqc/
tent distribution
FastQ Screen Screen a library against sequence databases to see if [37] http://www.bioinformatics.babraham.ac.uk/pro
composition of library matches expectations jects/fastq_screen
BBtools BBDuk trims and filters reads using k-mers and en- [35] http://jgi.doe.gov/data-and-tools/bbtools/bb-tools-
tropy information. BBNorm normalizes coverage user-guide/
by down-sampling reads (digital normalization)
Trimmomatic Flexible read trimming tool for Illumina data [36] http://www.usadellab.org/cms/?page¼trimmomatic
Cutadapt Find and remove adapter sequences, primers, poly- [34] https://cutadapt.readthedocs.io
A tails and other types of unwanted sequence
khmer/diginorm Tools for k-mer error trimming of reads and digital [38, 39] http://khmer.readthedocs.io
normalization of samples
MultiQC Summarize results from different analysis (such as [40] http://multiqc.info
FastQC) into one report
Note: Most of these tools can also be used for other types of genome sequence data, e.g. whole-genome or RNA-seq data.
easily. If the microbiome comes from a host with a sequenced every read is a form of binning because it groups reads into bins
genome, such as human, it is useful to identify and filter out corresponding to their taxon ID. Binning can also be done using
host reads before further analysis. Alternatively, some taxo- other properties such as composition and co-abundance pro-
nomic classifiers can include the host genome in their files, although those methods typically require assembly of
databases. reads into longer contigs, which provide better statistics for
After quality control, the reads can either be assembled into profiling [41]. (See [42] for a review of binning methods.) When
longer contiguous sequences called contigs or passed directly to the analysis only returns the estimated abundances of the dif-
taxonomic classifiers (Figure 1). Taxonomic classification of ferent taxa (instead of a classification of each read), we call it
1128 | Breitwieser et al.
taxonomic profiling. The choice of assembly-based analyses be used as evidence that a member of the associated clade is
versus direct taxonomic classification of reads depends on the present. This allows faster assignment because the database,
research question. even with a million or more genes (as in MetaPhlAn [81]), is far
Direct taxonomic classification is useful for quantitative smaller than a database containing the full genomes for all spe-
community profiling and identification of organisms with close cies. The assignment can then be made with fast, sensitive
relatives in the database. Compared with marker gene-based aligners, such as Bowtie2 [85] used by MetaPhlAn and HMMER
community profiling, metagenomic shotgun sequencing allevi- [86] used by Phylosift [87] and mOTU [82]. GOTTCHA [76] gener-
ates biases from primer choice and enables the detection of or- ates a database with unique genome signatures based on
ganisms across all domains of life, assuming that DNA can be unique 24 base-pair fragments, which it indexes with bwa-mem
extracted from the target environment. Researchers can quan- [88]. GOTTCHA can output either binary classification (pres-
tify the structure of microbial communities using ecological and ence/absence calls) or a taxonomic profile, which is based on
biogeographic measures such as species diversity, richness and coverage of the genomic signatures, The use of single-copy
uniformity of the communities [22, 43]. In clinical microbiology, marker genes should in principle make abundance estimation
the focus is often on the presence or absence of infectious more precise, although it is impossible to know the copy num-
pathogens, which can be identified by matching reads against a ber of a gene for a species with an incomplete genome. Because
stores the list of source genomes with each k-mer instead of K-mers can also be represented in de Bruijn graphs. Kallisto
their lowest common taxonomic ancestor. LMAT includes mi- [67, 95], which was originally developed for RNA-Seq analysis,
crobial draft genomes as well as eukaryotic microbes in its uses a colored de Bruijn graph [96] in which each edge (i.e. k-
‘Grand’ database, which requires 500 GB RAM and classifies mer) is assigned a set of ‘colors’, where a color encodes a gen-
more reads than a database without draft genomes. LMAT-ML ome in which the k-mer has been found. Given a sample read,
(for Marker Library) [78] implements more stringent k-mer prun- Kallisto finds approximately matching paths in the colored de
ing to retain only informative and nonoverlapping k-mers, Bruijn graph, an approach the authors term ‘pseudo-alignment’.
which reduces the memory requirements to just 16 GB. After mapping, each read has a set of genomes associated with
1130 | Breitwieser et al.
it. Kallisto then infers strain abundances using an expectation– sorts all of the resulting protein fragments by length and com-
maximization (EM) algorithm [67]. k-SLAM [68] is a novel k-mer- pares each against the protein database, longest to shortest,
based approach that uses local sequence alignments and finding and returning maximum exact matches.
pseudo-assembly, which generates contigs that can lead to
more specific assignments.
Centrifuge [80] is a fast and accurate metagenomics classifier Metagenomic assembly
using the Burrows–Wheeler transform (BWT) and an FM-index Illumina sequencing technology, which is the most widely used
to store and index the genome database. This strategy uses only sequencing method for metagenomics experiments today, gen-
about one-tenth the space of a Kraken index for the same data- erates read lengths in the range of 100–250 bp, with a typical
base. Centrifuge also implements a feature that combines sequencing run producing tens of millions of reads.
shared sequences from closely related genomes using MUMmer Metagenomics experiments might generate hundreds of mil-
[97]. This greatly reduces redundancy for species where dozens lions or even billions of reads from a single sample. Depending
of strains have been sequenced, further reducing the size of the on number of reads and the complexity of the microbial species
index data structure. in the sample, some genomes might be sequenced deeply,
MEGAN6/MEGAN-CE [73] and taxator-tk [79] both use the
corresponding to different organisms) based on coverage infor- (Iterative De Bruijn Graph Assembler) [125] first implemented
mation and graph connectivity. MetaVelvet-SL [105] improves this approach going from small k’s to large k’s, replacing reads
the splitting of chimeric nodes—nodes that are shared between with preassembled contigs at each iteration. IDBA-UD [107] is a
subgraphs of closely related species—and thus generates longer version of the IDBA assembler modified to tolerate uneven
scaffolds than MetaVelvet. Ray Meta, conversely, constructs depth of coverage, as occurring in single-cell and metagenomics
contigs by a heuristics-guided graph traversal. sequencing experiments. IDBA-UD first generates a de Bruijn
The choice of k is important for single k-mer de Bruijn graph graph from the reads using small k-mers (by default k ¼ 20),
assemblers. Small k’s are more sensitive in making connections, and—after error correction—extracts contigs that are used as
but fail to resolve repeats. Large k’s may miss connections and ‘reads’ in the graph construction with the next-higher k-mer
are more sensitive to sequencing errors, but usually create lon- size. IDBA-UD detects erroneous k-mers and k-mers from differ-
ger contigs. Most current metagenomics assemblers thus gener- ent genomes by looking at deviations from the average multipli-
ate contigs from iteratively constructed and refined de Bruijn city of k-mers in a contig. This local thresholding allows IDBA-
graphs using multiple k-mer lengths. The IDBA assembler UD to more accurately decompose the de Bruijn graph.
1132 | Breitwieser et al.
MetaSPAdes [103] is an extension of the SPAdes assembler Taxonomy-independent binning does not require prior
[102], which was originally developed for bacterial genome and knowledge about the genomes in a sample, but relies on fea-
single-cell sequencing assembly. SPAdes/MetaSPAdes use an tures inherent to the sequence set. Composition-based binning
approach similar to IDBA with iterative de Bruijn graph refine- is based on the observation that overall genome composition in
ment, but keeping the complete read information together with terms of G/C content and di- and higher-order nucleotide fre-
preassembled contigs at each step. MetaSPAdes implements quencies vary between organisms and are often characteristic
various heuristics for graph simplification, filtering and storage of taxonomic lineages [132]. Clustering then can be done on se-
to allow the assembly of large metagenomics data sets. quence composition ‘fingerprints’ of the contigs [133].
Importantly, MetaSPAdes uses ‘strain-contigs’ to inform the as- MetaCluster [118, 119] bins reads by first grouping them based
sembly of high-quality consensus backbone sequences, which on long unique k-mers (k > 36) and merging groups based on
are often longer than contigs from other assemblers [126]. tetranucleotide or pentanucleotide frequency distribution.
Megahit [101] is a fast assembler that uses a range of k-mers MetaCluster 5.0 further uses 16-mer frequencies in a second
for iteratively improving the assembly. Megahit (which works round to bin contigs from low-abundance species in complex
for both metagenomics and single-genome sequencing data) samples. VizBin [115] uses a dimensionality reduction mechan-
uses a memory-efficient succinct de Bruijn graph representa- ism based on self-organizing maps to visualize as well as cluster
protein database to assign names and functions wherever pos- M. tuberculosis complex. A well-known example of historic mis-
sible. Anvi’o [110] is another pipeline that combines assembly, placement is Shigella [140], a genus that clearly falls within the
alignment, binning and classification results in an interactive E. coli species with ANIs above 97%—much higher than the ANIs
interface that allows one to refine the binning and assembly. of, for example Escherichia fergusonii to E. coli of about 93%.
MOCAT2 [109] integrates read filtering, taxonomic profiling with The consequence of this variability for computational classi-
mOTU [82], assembly, gene prediction and annotation to output fiers is that at the species or genus rank, different levels of se-
taxonomic as well as functional profiles of metagenomics quence similarity in different parts of the taxonomic tree have a
samples. different meaning, making it impossible, for some taxa, to de-
sign consistent rules assigning reads or contigs (even long ones)
to a species, and there is clearly no fixed percent-identity
threshold that can be used to group sequences into the same
Microbial taxonomy and genome resources
species or genus.
and their impact on classification The fungal taxonomy sometimes has two species and tax-
Almost all of the methods described here rely on a database of onomy IDs for the same organism. Fungi can have both teleo-
genomes and on taxonomy of species. The accuracy and reli- morphic (sexual reproductive stage) and anamorphic (asexual
together sequences or compute the lowest common ancestor. Table 5. Number of entries in commonly used reference databases
Especially when using the BLAST nr/nt or nr databases, it may
Domain Level Draft genomes Complete genomes1
be useful to filter unclassified sequences, or include only micro-
bial taxa, as is done by the kaiju classifier [69] when including GenBank RefSeq GenBank RefSeq
eukaryotes from nr.
Taxonomy changes. One solution to some of the problems Archaea Entries 859 351 260 (20) 225 (12)
just listed is to rename or move the species in the microbial tax- Species 695 204 209 (14) 178 (7)
onomy. This does happen somewhat frequently, but the new Bacteria Entries 89 730 78 783 7314 (1346) 6973 (1066)
names do not automatically percolate outward to every re- Species 19 078 11 217 2677 (542) 2586 (406)
Fungi Entries 1897 191 28 (414) 7 (38)
source that has downloaded the genomes from GenBank. As a
Species 997 190 17 (68) 7 (36)
result, some benchmark genome sets used in metagenomics
Protists Entries 430 47 2 (49) 2 (27)
comparisons [148] have become outdated because some of the
Species 226 47 2 (38) 2 (26)
organisms have new names. This in turn can lead to mistaken
Viruses Entries 3 3 0 (0) 7214 (22)
conclusions when later studies download and reuse the data
Species 1 3 0 (0) 7073 (22)
without going back to retrieve the original genomes from
various reasons why genomes may be excluded from RefSeq, genomes are contaminated with fragments of sequence
e.g. the assemblies are too highly fragmented. For bacteria, cur- from other species, which presents challenges for meta-
rently, the most common reason that a GenBank genome is not genomic analysis.
included is that it is derived from a metagenome (about half of • Microbial taxonomy is rapidly changing in the genome
the excluded genomes). Note that this is a current policy and in- era, with many species being renamed and grouped
clusion criteria may change in the future. The rate of inclusion into different clades.
into RefSeq has been much slower for eukaryotic microbes; cur-
rently, it contains only 191 of 1897 fungal genome assemblies.
RefSeq also includes the viral domain, for which it validates and
indexes one viral genome per species (and sometimes per sero- Funding
type). As of May 2017, there are >7000 viral genomes in RefSeq. This work was supported in part by the US National Institutes of
In addition, the NCBI Viral Genomes Resource (https://www. Health (NIH grants R01-HG006677 and R01-GM083873) and by
ncbi.nlm.nih.gov/genome/viruses/) [156] provides links to other the US Army Research Office (grant W911NF-1410490).
validated viral genomes that are ‘neighbors’ (i.e. strains) of viral
species in RefSeq.
16. Li W, Godzik A. Cd-hit: a fast program for clustering and 38. Titus Brown C, Howe A, Zhang Q, et al. A reference-free algo-
comparing large sets of protein or nucleotide sequences. rithm for computational normalization of shotgun sequenc-
Bioinformatics 2006;22:1658–9. ing data. arXiv e-prints 2012.
17. Mahe F, Rognes T, Quince C, et al. Swarm: robust and fast 39. Crusoe MR, Alameldin HF, Awad S, et al. The khmer software
clustering method for amplicon-based studies. PeerJ 2014;2: package: enabling efficient nucleotide sequence analysis,
e593. F1000Rese 2015;4:900.
18. Callahan BJ, McMurdie PJ, Rosen MJ, et al. DADA2: high- 40. Ewels P, Magnusson M, Lundin S, et al. MultiQC: summarize
resolution sample inference from Illumina amplicon data. analysis results for multiple tools and samples in a single re-
Nat Methods 2016;13:581–3. port. Bioinformatics 2016;32:3047–8.
19. Callahan BJ, Sankaran K, Fukuyama JA, et al. Bioconductor 41. Sangwan N, Xia F, Gilbert JA. Recovering complete and draft
workflow for microbiome data analysis: from raw reads to population genomes from metagenome datasets.
community analyses. F1000Res 2016;5:1492. Microbiome 2016;4:8.
20. McMurdie PJ, Holmes S. phyloseq: an R package for reprodu- 42. Mande SS, Mohammed MH, Ghosh TS. Classification of
cible interactive analysis and graphics of microbiome cen- metagenomic sequences: methods and challenges. Brief
sus data. PLoS One 2013;8:e61217. Bioinform 2012;13:669–81.
57. Tan B, de Araujo E Silva R, Rozycki T, et al. Draft genome se- 78. Gardner SN, Ames SK, Gokhale MB, et al. Searching more
quences of three Smithella spp. obtained from a methano- genomic sequence with less memory for fast and accurate
genic alkane-degrading culture and oil field produced water. metagenomic profiling. bioRxiv 2016.
Genome Announc 2014;2:e01085-14. 79. Droge J, Gregor I, McHardy AC. Taxator-tk: precise taxonomic
58. Tan B, Nesbo C, Foght J. Re-analysis of omics data indicates assignment of metagenomes by fast approximation of evolu-
Smithella may degrade alkanes by addition to fumarate tionary neighborhoods. Bioinformatics 2015;31:817–24.
under methanogenic conditions. ISME J 2014;8:2353–6. 80. Kim D, Song L, Breitwieser FP, et al. Centrifuge: rapid and
59. Wawrik B, Marks CR, Davidova IA, et al. Methanogenic paraf- sensitive classification of metagenomic sequences. Genome
fin degradation proceeds via alkane addition to fumarate by Res 2016;26:1721–9.
‘Smithella’ spp. mediated by a syntrophic coupling with 81. Truong DT, Franzosa EA, Tickle TL, et al. MetaPhlAn2 for
hydrogenotrophic methanogens. Environ Microbiol 2016;18: enhanced metagenomic taxonomic profiling. Nat Methods
2604–19. 2015;12:902–3.
60. Nobu MK, Narihiro T, Rinke C, et al. Microbial dark matter 82. Sunagawa S, Mende DR, Zeller G, et al. Metagenomic species
ecogenomics reveals complex synergistic networks in a profiling using universal phylogenetic marker genes. Nat
methanogenic bioreactor. ISME J 2015;9:1710–22. Methods 2013;10:1196–9.
103. Nurk S, Meleshko D, Korobeynikov A, et al. metaSPAdes: a 123. Simao FA, Waterhouse RM, Ioannidis P, et al. BUSCO: assess-
new versatile metagenomic assembler. Genome Res 2017;27: ing genome assembly and annotation completeness with
824–34. single-copy orthologs. Bioinformatics 2015;31:3210–2.
104. Boisvert S, Raymond F, Godzaridis E, et al. Ray Meta: scalable 124. Zerbino DR, Birney E. Velvet: algorithms for de novo short
de novo metagenome assembly and profiling. Genome Biol read assembly using de Bruijn graphs. Genome Res 2008;18:
2012;13:R122. 821–9.
105. Afiahayati Sato K, Sakakibara Y. MetaVelvet-SL: an exten- 125. Peng Y, Leung HC, Yiu SM, et al. IDBA–a practical iterative de
sion of the Velvet assembler to a de novo metagenomic as- Bruijn graph de novo assembler. In: 14th Annual International
sembler utilizing supervised learning. DNA Res 2015;22: Conference, RECOMB 2010, Lisbon, Portugal, 25-28 April 2010. In
69–77. Research in Computational Molecular Biology. Springer-
106. Namiki T, Hachiya T, Tanaka H, et al. MetaVelvet: an exten- Verlag, Berlin Heidelberg, 2010; vol. 6044, 426–40.
sion of Velvet assembler to de novo metagenome assembly 126. Sczyrba A, Hofmann P, Belmann P, et al. Critical assessment
from short sequence reads. Nucleic Acids Res 2012;40:e155. of metagenome interpretation—a benchmark of computa-
107. Peng Y, Leung HC, Yiu SM, et al. IDBA-UD: a de novo assem- tional metagenomics software. bioRxiv 2017.
bler for single-cell and metagenomic sequencing data with 127. Bowe A, Onodera T, Sadakane K, et al. Succinct de Bruijn
144. Murray RG, Stackebrandt E. Taxonomic note: implementa- 150. Roux S, Hallam SJ, Woyke T, et al. Viral dark matter and
tion of the provisional status Candidatus for incompletely virus-host interactions resolved from publicly available mi-
described procaryotes. Int J Syst Bacteriol 1995;45:186–7. crobial genomes. Elife 2015;4:e08490.
145. Konstantinidis KT, Rosselló-Móra R. Classifying the unculti- 151. Simmonds P, Adams MJ, Benko M, et al. Consensus state-
vated microbial majority: a place for metagenomic data in ment: virus taxonomy in the age of metagenomics. Nat Rev
the Candidatus proposal. Syst Appl Microbiol 2015;38:223–30. Microbiol 2017;15:161–8.
146. Parker CT, Tindall BJ, Garrity GM. International code of no- 152. Simmonds P. Methods for virus classification and the chal-
menclature of prokaryotes. Int J Syst Evol Microbiol 2015. doi: lenge of incorporating metagenomic sequence data. J Gen
10.1099/ijsem.0.000778. Virol 2015;96:1193–206.
147. Federhen S, Clark K, Barrett T, et al. Toward richer metadata 153. Benson DA, Cavanaugh M, Clark K, et al. GenBank. Nucleic
for microbial sequences: replacing strain-level NCBI tax- Acids Res 2017;45:D37–42.
onomy taxids with BioProject, BioSample and Assembly re- 154. Merchant S, Wood DE, Salzberg SL. Unexpected cross-
cords. Stand Genomic Sci 2014;9:1275–7. species contamination in genome sequencing projects. PeerJ
148. Mende DR, Waller AS, Sunagawa S, et al. Assessment of 2014;2:e675.
metagenomic assembly using simulated next generation 155. Tatusova T, Ciufo S, Federhen S, et al. Update on RefSeq micro-