Professional Documents
Culture Documents
Sequence analysis
Fig. 1. Conventional seeds used by ELAND, SOAP and MAQ divide a 32 bp Fig. 2. The single periodic spaced seed full sensitive to two mismatches
read into four substrings. For any alignment within two mismatches, at least over a 32 bp read. For any alignment within two mismatches, at least one
one of six pairs of substrings will match exactly. This method requires three out of the seven subsequences will match exactly. This seed is composed of
hash tables and six lookups for each read and direction (forward or reverse repeating the pattern (111*1**).
complement).
seed with weight (number of ‘care’ positions) 11. Mathematical
a free color error. Periodic seeds allow PerM to maximize efficiency results (Buhler et al., 2003; Xu et al., 2006) have shown that spaced
2515
Table 1. The periodic spaced seed, applied to a read and slid through Table 2. The maximum weights of patterns that are full sensitivity to x SNPs
positions 8–14 six times, covers all the 21 pair of positions exactly once and k free mismatches
periodic seeds store a single index for each locus, requiring only 4.5 bytes work of Kucherov et al. (2005). Most importantly, generalizability drastically
per base to index the entire genome. Running time is dependent linearly on simplifies the seed-design algorithm and provides the framework for variable
the number of queries or slides required of a seed but can be influenced seed extension, which provides a significant increase in seed weight while
exponentially by the seed weight (number of ‘care’ positions). Specifically, maintaining a single index table.
2516
x=1,k=1 1,1
Weight−length ratio
k=3
x=2,k=0 W=19 1313**1***0200**1***1313**0***0200
0.7
k=4
W=18 *3131**2***2000**3***3130**2***200
W=17 **1313**0***0003**1***1300**0***00
0.4
...
...
W=14 ********0002**0***1313**0***0002**
0.1
W=14 *********0020**3***3131**0***0020*
6 7 8 9 11 13 15 17
Length of periodic spaced seed patterns Fig. 4. This figure shows the extension of periodic spaced seed S1,1 ,
composed of the repeating pattern (1111∗∗1∗∗∗), to multiple variable-
Fig. 3. This figure shows the optimal weight–length ratios for different weight spaced seeds that are applied to a 34-color SOLiD reads where 0,
pattern lengths. 1, 2 and 3 represent the four colors. The original seed is shown in the black
boxes, and the extended variable-weight seeds are highlighted at the tail by
2517
Table 3. PerM’s single periodic spaced seeds for SOLiD 34-color reads the human genome efficiently into 4.5 bytes per base, and load it to
14 GB of memory, without the swapping of index tables between
Seed Seed patterns parenthesized Seed disk and memory.
name according to their repeats weight
3.2.2 Running time The total running time of a mapping project
2 (111∗1∗∗)(111∗1∗∗)(111∗1∗∗)(111∗1∗∗) 16 can be divided into two major components:
S1,1 (1111∗∗1∗∗∗)(1111∗∗1∗∗∗)(1111∗) 14
3 (111∗1∗∗1∗∗∗)(111∗1∗∗1∗∗∗)(11) 12 (1) Preprocessing: the time to preprocess the reference genome
S2,0 (1111∗∗1∗∗∗∗)(1111∗∗1∗∗∗∗)(11) 12
(or the reads set) into one or more index tables.
4 (11∗∗∗1∗∗∗∗)(11∗∗∗1∗∗∗∗)(11∗∗∗) 8 (2) Mapping: the total time to find matches in the index tables for
all queried subsequences, and the time to examine all matches
using the full read-genome substring alignments.
Table 4. Three seed families are compared in their ability to map 34-color
PerM’s requirement of a single index table results in faster
SOLiD reads to a preprocessed human genome preprocessing time than methods which use the conventional multi-
seed, multi-table approach. Mapping time consists of two parts:
Seed No. of index No. of queries Seed Extended E(Random Hits) the time to query each seed-induced subsequence and to validate
matches which result in true alignments, and the time to examine
2518
Table 5. The results of mapping 5 million 34-color SOLiD reads to the Table 7. Running time comparison of mapping the Illumina reads with
whole human genome different read lengths and seeds to the whole human genome
The default mode of Bowtie is equivalent to -k 1. The -v k mode is set with -a –best –
Table 6. Running time comparison of mapping the 35 bp SOLiD reads to strata. The tests are performed on Sun, X4600, Opteron, 2.6 GHz, using 15 GB single
the whole human genome node and thread.
2519
Table 8. Comparing of PerM and SOCS in chromosome-level reference Golomb Ruler Design problem, considered likely to be NP-hard.
Thus, although we cannot guarantee optimality over the entire search
Full sensitivity PerM SOCS space, we propose three methods to design high weight multiple
seeds: a constrained exhaustive search, a reduction to the integer
Running time Weight Running time Weight programming problem and a ‘tuples-grouping’ algorithm. These
methods and additional performance analysis and experiments are
2 color mis 11 min 46 s 16–20 14 min 30 s 11 discussed in detail in the Supplementary Material.
1 base + 1 color mis 23 min 0 s 14–19
3 color mis 32 min 41 s 12–16 2 h 20 min 8
The running time includes preprocessing and I/O. The memory usage of both the ACKNOWLEDGEMENTS
programs is <2 GB. The tests are performed on Sun, X4600, Opteron, 2.6 GHz, using The authors would like to thank Dr Zhenyu Xuan and Dr Richard
single node and thread.
McCombie from the Genome Center of Cold Spring Harbor
Laboratory and Dr Andrew Smith at USC for providing us with data
read length or sensitivity level. As shown in Table 7, long reads and helpful suggestions. We would also like to thank Zheng Zhang
allow PerM’s extended seeds to significantly speed up mapping. from ABI as well as graduate students Dazhe Meng and Zach Frazier
from USC for their helpful discussion.
3.3.4 Comparison: PerM with SOCS PerM was compared with
2520
Ma,B. and Yao,H. (2008) Seed optimization is no easier than optimal Golomb Quinlan,A.R. et al. (2008) Pyrobayes: an improved base caller for SNP discovery in
ruler design. In Asia Pacific Bioinformatics Conference (APBC). Available at pyrosequences. Nat. Methods, 5, 179–181.
http://sunflower.kuicr.kyoto-u.ac.jp/apbc2008/acceptpaper.html paper 19. Ronaghi,M. et al. (1998) A sequencing method based on real-time pyrophosphate.
Ma,B. et al. (2002) PatternHunter: faster and more sensitive homology search. Science, 281, 363–363.
Bioinformatics, 18, 440–445. Rumble,S.M. et al. (2009) SHRiMP: accurate mapping of short color-space reads. PLoS
Malaysian Genomics Resource Center (2009) SXOligoSearch. World Wide Comput. Biol., 5, e1000386.
Web electronic publication. Available at http://synasite.mgrc.com.my:8080/sxog/ Shendure,J. and Ji,H. (2008) Next-generation DNA sequencing. Nat. Biotechnol., 26,
NewSXOligoSearch.php 1135–1145.
Mardis,E.R. (2007) ChIP-seq: welcome to the new frontier. Nat. Methods, 4, 613–614. Smith,A.D. et al. (2008) Using quality scores and longer reads improves accuracy of
Marioni,J.C. et al. (2008) RNA-seq: an assessment of technical reproducibility and Solexa read mapping. BMC Bioinformatics, 9, 128–128.
comparison with gene expression arrays. Genome Res., 18, 1509–1517. Sun,Y. and Buhler,J. (2005) Designing multiple simultaneous seeds for DNA similarity
Myers,G. (1999) A fast bit-vector algorithm for approximate string matching based on search. J. Comput. Biol., 12, 847–861.
dynamic programming. J. ACM, 46, 395–415. The MarthLab (2009) Mosaik. World Wide Web electronic publication.
Nicolas, F. and Rivals, E. (2008). Hardness of optimal spaced seed design. J. Comput. Xu,J. et al. (2006) Optimizing multiple spaced seeds for homology search. J. Comput.
Syst. Sci., 74, 831–849. Biol., 13, 1355–1368.
Noé,L. and Kucherov,G. (2004) Improved hit criteria for DNA local alignment. BMC Yang,I.-H. et al. (2004) Efficient methods for generating optimal single and multiple
Bioinformatics, 5, 149–149. spaced seeds. In BIBE ’04: Proceedings of the 4th IEEE Symposium on
Ondov,B.D. et al. (2008) Efficient mapping of applied biosystems solid sequence data Bioinformatics and Bioengineering. IEEE Computer Society, Washington, DC,
to a reference genome for functional genomic applications. Bioinformatics, 24, p. 411.
2776–2777.
2521