Professional Documents
Culture Documents
Abstract 1 Introduction
One emerging application of pyrosequencing [1] is
Pyrosequencing of small subunit ribosomal microbial community profiling using small subunit
RNA amplicons (pyrotags) is rapidly gaining (SSU) rRNA gene PCR amplicons [2]. In this appli-
popularity as the method of choice for pro- cation each read or pyrotag represents a member of
filing microbial communities because it pro- the community and analysis typically involves clus-
vides deep coverage with low cost. However, tering and classification to deduce community struc-
the large amount of data, and errors associ- ture. Recently we and others demonstrated that
ated with the sequencing technology present pyrosequencing errors, in particular homopolymer
significant analytical challenges. Here we de- length errors, can lead to inflated diversity estimates
scribe PyroTagger, a computational pipeline [3, 4]. To avoid interpreting sequencing errors as
for pyrotag analysis. The pipeline consists naturally occurring populations, we recommended
of read quality filtering and length trimming, accuracy trimming of sequences to 0.2% per-base
dereplication, clustering at 97% sequence error probability and clustering reads at a 97% se-
identity, classification and dataset partition- quence identity threshold [4]. The PCR also can in-
ing based on barcodes. To speed up the troduce base errors and more importantly, chimeras
rate-limiting clustering step we developed a of two or more DNA templates that need to be re-
novel purpose-built algorithm called pyro- moved using chimera detection software [5]. Here we
clust. PyroTagger is highly scalable, capa- describe a fast, accurate pipeline for pyrotag clas-
ble of processing hundreds of thousands of sification called PyroTagger that implements these
reads within minutes on a single CPU. Ver- recommendations.
sion 1.0 is available for public use at http:
//pyrotagger.jgi-psf.org/. 2 Methods
PyroTagger requires 3 files to operate; correspond-
ing fasta and qual files from a 454 pyrosequencing
∗ corresponding author run and an associated mapping file that contains the
This article was recommended by names and corresponding barcode-primer sequences
Vanja Klepac-Ceraj as Technical interest of the multiplexed samples. Several sets of these
Morgan N Price as Technical interest
Dirk Gevers as Technical interest
files can be uploaded via web-interface if compar-
Submitted on November 21, 2009 isons need to be made between runs. Compression
Accepted on February 21, 2010 of large files is recommended to reduce upload times
trimmed reads is achieved using hashing, with se- formed in two stages; first a fast search using a word
quences as hash keys and occurrences as hash val- length of 90, then a second slower search using a
ues. This step is implemented to occur together with word length of 30 for clusters lacking matches in
length trimming but is separated here for clarity. the first round. Classified reads are separated into
The non-redundant sequences are sorted and writ- their respective samples using the barcodes provided
ten to an output file in the order of abundance in in the original mapping file. Putatively chimeric
the dataset, most abundant sequences first. This is clusters are identified as sequences having a best
important since the following clustering stage will Blast alignment <90% of the trimmed read length
assume that correct sequences are more abundant to the reference database, >90% sequence identity
than sequence error variants and will therefore ap- to the best Blast match and cluster size ≤2. This
pear earlier in the non-redundant sequence file. provides a conservative estimate of chimeras in the
Clustering. Dereplicated reads are clustered at dataset which are particularly prevalent in lower
a 97% sequence identity threshold and the most abundance clusters [3]. Putative chimeric clusters
abundant unique sequence is used as a cluster rep- are flagged in the cluster classification output files
resentative. We previously established that a 97% (see below). Higher-level taxonomic information is
threshold combined with read quality trimming pro- collated for each sample by summing the number of
duces reasonable estimates of microbial diversity [4]. reads belonging to a given phylum, excluding puta-
Since clustering is the computational bottleneck in tively chimeric reads.
the pipeline, we developed a new two-stage algo- Output files. Six compressed text files are
rithm called Pyroclust to scale with dataset size. emailed to the user: i) cluster classification.xls;
Most algorithms for sequence comparison either at- number of reads for, and classification of, each
tempt to find the best matching hits to a query se- 97% cluster separated by sample, ii) clus-
quence in a database or to find the best alignment ter classification percent.xls; same as output (i)
between two sequences. In contrast, Pyroclust only but counts are replaced by percentages, iii)
attempts to group sequences within a user-specified phylum classification.xls; number of reads for
distance, which does not require finding the best each phylum separated by sample, iv) phy-
match or computing the complete pairwise align- lum classification percent.xls; same as output (iii)
ment. It achieves this by taking advantage of par- but counts are replaced by percentages, v) clus-
ticular properties of the pyrotag data and analysis ter representatives.fasta; the most abundant unique
pipeline, specifically identical sequence length, 5’ po- representative sequence for each 97% cluster and vi)
sitional homology of the reads and a high cluster- clusters2reads.txt; a full mapping of read assign-
ing threshold. The algorithm proceeds in two stages ments to clusters. In addition, by clicking on the
and is summarized in Figure 2. In future releases, provided link, users can access a log file that con-
we may modify or replace this algorithm to improve tains metrics on the progress of their run, such as
performance as datasets become larger. the current stage of the pipeline execution, number
Classification and sample partitioning. Rep- of input reads and the number of reads that pass
resentative sequences from each cluster are classified quality filtering.
by comparison to the greengenes [8] and silva [9] Benchmarking. Performance benchmarking
databases for bacteria/archaea and eukaryotes re- was done on a single CPU of an AMD Opteron 2Gz
spectively. The full taxonomy string for each refer- machine with 12Gb of RAM using a small artificial
ence sequence is exported from the greengenes and community dataset of 46,341 454-FLX reads [3] and
silva databases to allow taxonomic identification of a larger termite hindgut dataset of 229,222 454-FLX
pyrotag sequences. The current version of Pyro- reads [7] trimmed to 225 bp. Both datasets are avail-
Tagger uses BlastN [10] to identify the top hit and able as supplementary information and at http://
its associated taxonomy for each 97% cluster repre- pyrotagger.jgi-psf.org/sequences/. Input file
sentative. To improve performance, BlastN is per- uploading times were measured for a remote DSL
Figure 2: Pseudocode summarizing the Pyroclust algorithm. The current implementation uses words of 6 and 80 nu-
cleotides for short and long words respectively. The source code of the perl implementation of this algorithm is available
for download as part of the PyroTagger pipeline at http://pyrotagger.jgi-psf.org/release/.
the sequence comparisons required for clustering. Accuracy. Classification accuracy has been ex-
To address this bottleneck, we developed a 2-stage tensively addressed elsewhere [12]. To test the
clustering algorithm, Pyroclust, which avoids most accuracy of cluster richness and abundance esti-
pairwise sequence comparisons (Figure 2). The mates, we used the reference dataset of Quince et
complexity of the Pyroclust algorithm depends on al [3]. This is the 46,341 454-FLX read dataset
the number of initial sequences N, number of clus- also used for performance testing (above), which
ters C, sequence length L and a permissible num- was obtained by sequencing a non-equimolar mix of
ber of mismatches M. In the first coarse clustering 90 rRNA clones. Accuracy was determined by co-
stage of the algorithm, the worst case scenario is analyzing the Sanger reference and pyrotag datasets
that every sequence is aligned to every cluster, cre- with PyroTagger to ensure comparability of clus-
ating O(NCLM) complexity. However, in practice ters. The cluster richness and abundance of the
most sequences are not aligned or aligned only to a Sanger clones was independently verified using a
small number of clusters, nearing O(NLM) complex- tree-based assessment of clusters (ARB database
ity. The reason for this is twofold; i) most sequences is available at http://pyrotagger.jgi-psf.org/
do not share long words and thus will not be di- sequences/quince) with weighting applied to ac-
rectly compared to each other, and ii) once a match count for the variable clone concentrations in the
is found, the query sequence is recruited to a cluster mixture. The PyroTagger and tree-based analysis
and only the cluster centroid will be compared from produced 34 and 35 clusters respectively from the 90
this point on. This results in approximately linear reference clones using a 97% identity threshold. Py-
scaling by dataset size. Although the larger dataset roTagger analysis of the 46,341 pyrotags produced
is 5X as large as the smaller dataset tested, the pro- a richness estimate of 39 clusters of which 34 had
cessing time for stage 1 was only ∼2X longer for the matches to Sanger sequences according to the more
larger dataset (Figure 3). accurate tree-based method (Figure 4). Of the 5
In the second fine clustering stage, every coarse clusters with no Sanger match, three were manually
cluster must be aligned to all others, adding verified chimeras that eluded the chimera detection
O(C2 LM) complexity. Therefore, the running time script, and the other two were the result of cluster
of stage 2 is highly dependent on the cluster (species) splitting of borderline cases. No spurious clusters
richness of the sample. For species-rich samples due to low quality sequences were produced indi-
where C is high, stage 2 may become more computa- cating that the quality filtering step is efficient. In
tionally intensive than stage 1. For the two datasets future versions of PyroTagger we will explore the
tested, stage 2 times were substantially shorter than possibility of refining the chimera detection compo-
stage 1 (Figure 3) because cluster richness was nent of the pipeline. The abundance estimates of
moderate in both cases. the weighted Sanger and pyrotag datasets were con-
sistent with abundant and rare ’populations’ being
The final classification and sample partitioning correlated (R2 =0.966; Figure 4) and therefore Py-
module uses Blast and like the initial steps, does roTagger provides a sound prediction of abundance
not impact performance, typically taking two min- distribution in this artificial community.
utes against the greengenes and silva databases.
The performance outlook for PyroTagger looks In summary, PyroTagger will quickly and accu-
sound at least in the near term as it can analyze rately analyze amplicon pyrosequencing datasets up
millions of reads (N) with a comparable length of up to at least hundreds of thousands of reads on a sin-
to 450 bp (L) in a few hours on a single CPU (data gle processor. PyroTagger is not region-specific and
not shown). However, as the sequencing technolo- can be used on any amplified segment of the SSU
gies continue to improve, both N and L will increase rRNA gene. It also can be used on other highly con-
and the performance of PyroTagger will need to be served genes, such as large subunit rRNAs, provided
reevaluated. an appropriate reference database is supplied. The
5 Author contributions
VK performed experiments; VK and PH conceived
experiments; VK and PH analyzed data; VK and
PH wrote the paper.
6 Conflicts of interest
VK, PH declared no conflict of interest.
References
[1] Margulies, M. and Egholm, M. and Altman,
Figure 4: Rank abundance plots of 97% clusters pro- W.E. and Attiya, S. and Bader, J.S. and Be-
duced from 454-FLX data (blue) and 90 associated ref- mben, L.A. and Berka, J. and Braverman, M.S.
erence Sanger sequences (red). and Chen, Y.J. and Chen, Z. and Dewell, S.B.
and Du, L. and Fierro, J.M. and Gomes, X.V.
and Godwin, B.C. and He, W. and Helge-
sen, S. and Ho, C.H. and Ho, C.H. and Irzyk,
program is available as a web-based application at G.P. and Jando, S.C. and Alenquer, M.L. and
http://pyrotagger.jgi-psf.org/ and the source Jarvie, T.P. and Jirage, K.B. and Kim, J.B.
code can be downloaded from http://pyrotagger. and Knight, J.R. and Lanza, J.R. and Leamon,
jgi-psf.org/release/. J.H. and Lefkowitz, S.M. and Lei, M. and Li, J.
and Lohman, K.L. and Lu, H. and Makhijani,
V.B. and McDade, K.E. and McKenna, M.P.
and Myers, E.W. and Nickerson, E. and Nobile,
J.R. and Plant, R. and Puc, B.P. and Ronan,
4 Acknowledgments M.T. and Roth, G.T. and Sarkis, G.J. and Si-
mons, J.F. and Simpson, J.W. and Srinivasan,
M. and Tartaro, K.R. and Tomasz, A. and Vogt,
We thank Amrita Pati and Ed Kirton for assis-
K.A. and Volkmer, G.A. and Wang, S.H. and
tance with writing the pseudocode (Fig. 2) and
Wang, Y. and Weiner, M.P. and Yu, P. and
Chris Quince for supplying the artificial commu-
Begley, R.F. and Rothberg, J.M. Genome se-
nity data and clone frequencies, and members of the
quencing in microfabricated high-density picol-
Microbial Ecology Program at JGI for beta-testing
itre reactors. Nature, 437(7057):376–80, 2005.
PyroTagger. VK was supported in part by NSF
grant OPP0632359. This work was performed un- [2] Sogin, M.L. and Morrison, H.G. and Huber,
der the auspices of the US Department of Energy’s J.A. and Mark Welch, D. and Huse, S.M. and
Office of Science, Biological and Environmental Re- Neal, P.R. and Arrieta, J.M. and Herndl, G.J.
search Program, and by the University of California, Microbial diversity in the deep sea and the un-
Lawrence Berkeley National Laboratory under con- derexplored “rare biosphere”. Proc. Natl. Acad.
tract No. DE-AC02-05CH11231, Lawrence Liver- Sci. U.S.A., 103(32):12115–20, 2006.
more National Laboratory under Contract No. DE-
AC52-07NA27344, and Los Alamos National Labo- [3] Quince, C. and Lanzén, A. and Curtis, T.P.
ratory under contract No. DE-AC02-06NA25396. and Davenport, R.J. and Hall, N. and Head,
I.M. and Read, L.F. and Sloan, W.T. Accurate [11] Ludwig, W. and Strunk, O. and Westram, R.
determination of microbial diversity from 454 and Richter, L. and Meier, H. and Yadhuku-
pyrosequencing data. Nat. Methods, 6(9):639– mar, . and Buchner, A. and Lai, T. and Steppi,
41, 2009. S. and Jobb, G. and Förster, W. and Brettske,
I. and Gerber, S. and Ginhart, A.W. and Gross,
[4] Kunin, V. and Engelbrektson, A. and Ochman, O. and Grumann, S. and Hermann, S. and Jost,
H. and Hugenholtz, P. Wrinkles in the rare bio- R. and König, A. and Liss, T. and Lüssmann,
sphere: pyrosequencing errors lead to artificial R. and May, M. and Nonhoff, B. and Reichel,
inflation of diversity estimates. Environ. Micro- B. and Strehlow, R. and Stamatakis, A. and
biol., 10(3), 2009. Stuckmann, N. and Vilbig, A. and Lenke, M.
and Ludwig, T. and Bode, A. and Schleifer,
[5] Huber, T. and Faulkner, G. and Hugenholtz, P.
K.H. Arb: a software environment for sequence
Bellerophon: a program to detect chimeric se-
data. Nucleic Acids Res., 32(4):1363–71, 2004.
quences in multiple sequence alignments. Bioin-
formatics, 20(14):2317–9, 2004. [12] Liu, Z. and DeSantis, T.Z. and Andersen, G.L.
and Knight, R. Accurate taxonomy assign-
[6] Chou, H.H. and Holmes, M.H. Dna sequence
ments from 16s rrna sequences produced by
quality trimming and vector removal. Bioinfor-
highly parallel pyrosequencers. Nucleic Acids
matics, 17(12):1093–104, 2001.
Res., 36(18):e120, 2008.
[7] Engelbrektson, A. and Kunin, V. and
Wrighton, K.C. and Zvenigorodsky, N.
and Chen, F. and Ochman, H. and Hugenholtz,
P. Experimental factors affecting pcr-based
estimates of microbial species richness and
evenness. ISME J, 2010.