Professional Documents
Culture Documents
2013 July 20 Hitseq
2013 July 20 Hitseq
1/15
G ENOME ASSEMBLY
sequenced
reads:
overlapping
sub-sequences,
covering
the genome
redundantly
read
contig
assembly
hypothesis of
the genome
high-quality
vs
low-quality
assemblies
2/15
M OTIVATION
read ACTGATGAC
ACT
CTG
TGA
k-mers
GAT
(k=3)
ATG
TGA
GAC
Practical issue: assemblers rely on the user to set the parameter k .
3/15
M OTIVATION : OPTIMAL k NEEDED
Total length and contiguity (NG50) of chr. 14 (88 Mbp) assemblies
NG50: maximum ` such that (
P
|contigi |≥` |contigi |) larger than |genome|/2
Illumina 100bp paired-end 70x coverage, assembled by Velvet with several values of k
8.5e+07
Assembly size
8.0e+07
40 60 80
4000
NG50
2000
0
40 60 80
5/15
E XISTING METHODS TO ESTIMATE BEST k
5/15
H YPOTHESIS FOR THE OPTIMAL k
sequenced
k-mers
ideal world:
single contig
In DNA/RNA/metaDNA/metaRNA assembly:
6/15
H YPOTHESIS FOR THE OPTIMAL k
sequenced
k-mers
ideal world:
single contig
missing k-mers
break contigs
In DNA/RNA/metaDNA/metaRNA assembly:
- small k : less chance of missing k -mers
6/15
H YPOTHESIS FOR THE OPTIMAL k
repeat repeat
sequenced
k-mers
ideal world:
single contig
missing k-mers
break contigs
In DNA/RNA/metaDNA/metaRNA assembly:
- small k : less chance of missing k -mers
- large k : less repetitions shorter than k
6/15
H YPOTHESIS FOR THE OPTIMAL k
repeat repeat
sequenced
k-mers
ideal world:
single contig
missing k-mers
break contigs
In DNA/RNA/metaDNA/metaRNA assembly:
- small k : less chance of missing k -mers
- large k : less repetitions shorter than k
- Also, larger k -mers: more likely to contain errors (unusable k -mers)
Our hypothesis: use the largest k -mer size possible (to avoid repetitions), such that the
genome is sufficiently covered by k -mers.
6/15
k - MER HISTOGRAMS
7/15
D ISSECTION OF A k - MER HISTOGRAM
Chr 14 (≈ 88 Mbp) GAGE dataset; histogram k = 21
k = 21
1e+09
Erroneous k-mers
1e+07
Genomic non-repeated k-mers
0 20 40 60 80 120
Genomic area
≈
number of distinct k -mers covering the genome
≈
size of the assembly
→ How to determine exactly this area?
8/15
H ISTOGRAM MODEL
1e+09
pdf =
x α+1
Genomic k -mers Mixture of n Gaussians,
1e+07
weighted by a Zeta distribution
of shape s:
1e+05
w1 X1 + . . . + wn Xn
Xi ∼ N (iµ1 , (iσ1 )2 )
0 20 40 60 80 120
P(wi = k ) = k −s /ζ(s)
Full model Mixture weighted by
(pe , 1 − pe ).
9/15
S EEN SO FAR
To find the optimal k , one can compare histograms for different values of k .
k = 21 k = 41 k = 81
1e+09
5e+08
1e+04 1e+06 1e+08
Number of kmers
Number of kmers
Number of kmers
1e+07
5e+06
1e+05
5e+04
0 20 40 60 80 120 0 20 40 60 80 120 0 20 40 60 80 120
10/15
S AMPLING HISTOGRAMS
Computing exact k -mer histograms is expensive (= k -mer counting).
11/15
S AMPLING HISTOGRAMS
Computing exact k -mer histograms is expensive (= k -mer counting).
●
1e+08
- Chr 14 (≈ 88 Mbp) k = 41
Number of k−mers
● ●●●●●
- continuous line = exact histogram
●●●● ●●●●●
●●● ●●
●● ●●
- dots = sampled histogram
1e+06
● ●●● ●●
●●●●● ●●
●
●
●●
●
●
●●
●●
- sampling errors are visible for low
●●●●
●●●
●●●●●● ●
●●● ● ●
number of k -mers (log scale)
● ●●●●●●●●●
1e+04
● ● ●● ●
●●●● ●● ● ● ● ●
● ● ●● ●
●●●●●● ●●●● ●
0 20 40 60 80 100 ●
● ●
Abundance 11/15
TOOLS , DATASETS
12/15
K MER G ENIE R ESULTS : ACCURACY
Predicted best k and predicted assembly size vs actual assembly size and
NG50 for 3 organisms (GAGE benchmark).
Predicted size
Predicted size
3500000 9.0e+07
3.0e+08
3000000 8.5e+07
2.5e+08
2500000 8.0e+07
20 40 60 20 40 60 80 20 40 60 80
20000 10000
4000
10000 5000
NG50
NG50
NG50
2000
0 0 0
20 40 60 20 40 60 80 20 40 60 80
k k k
13/15
C ONCLUSION / PERSPECTIVES
Perspectives:
- Increase robustness (high-coverage, longer reads)
- Improve statistical model
- Estimation of Velvet’s cov_cutoff =⇒ zero-parameter assembler
- Extract information from histograms for transcriptome and
meta-genomes
14/15
U SING K MER G ENIE
15/15