2013 July 20 Hitseq

Informed and automated k -mer size selection
for genome assembly
Rayan Chikhi, Paul Medvedev
Pennsylvania State University
HiTSeq - July 2013
1/15
G ENOME ASSEMBLY
Genome assembly is the technique used to reconstruct genome sequences

from DNA sequencing.
sequenced
reads:
overlapping
sub-sequences,
covering
the genome
redundantly
read
contig
assembly
hypothesis of
the genome
high-quality
vs
low-quality
assemblies
2/15
M OTIVATION
Bioinformaticians routinely run assemblers (Allpaths-LG, Soapdenovo2, Velvet, . . . ) to

study novel organisms.
Most assemblers cut reads into k -mers (de Bruijn graph method).
read ACTGATGAC
ACT
CTG
TGA
k-mers
GAT
(k=3)
ATG
TGA
GAC
Practical issue: assemblers rely on the user to set the parameter k .
→ What could go wrong if k is incorrectly set?
3/15
M OTIVATION : OPTIMAL k NEEDED
Total length and contiguity (NG50) of chr. 14 (88 Mbp) assemblies
NG50: maximum ` such that (
P
|contigi |≥` |contigi |) larger than |genome|/2
Illumina 100bp paired-end 70x coverage, assembled by Velvet with several values of k
8.5e+07
Assembly size
8.0e+07
40 60 80
4000
NG50
2000
0
40 60 80
Fact: Genome assembly is not robust with respect to k .

Our motivation: help bioinformaticians obtain the best possible assembly by
finding optimal k automatically
4/15
E XISTING METHODS TO ESTIMATE BEST k
Velvetk: without looking at the data:

Nk
koptim = argmink (| − C|)
G
where:
Nk (total number of k -mers in the reads),
G (estimated genome size) and
C (desired target coverage).
Does not know about genome complexity and error rate.
VelvetOptimizer: for a specific assembler (Velvet). Brute-forces all values of k and

examines N50.
koptim = argmaxk (N50k )
Takes in the order of CPU-years for mammalian genomes.
5/15
E XISTING METHODS TO ESTIMATE BEST k
Velvetk: without looking at the data:

Nk
koptim = argmink (| − C|)
G
where:
Nk (total number of k -mers in the reads),
G (estimated genome size) and
C (desired target coverage).
Does not know about genome complexity and error rate.
VelvetOptimizer: for a specific assembler (Velvet). Brute-forces all values of k and

examines N50.
koptim = argmaxk (N50k )
Takes in the order of CPU-years for mammalian genomes.
Actually, most of the time:

- Bioinformaticians run [assembler] many times with k = 21, . . . , 91, or
- “Our colleagues had good results with k = 51 on [some other bacterial
dataset]”.
5/15
H YPOTHESIS FOR THE OPTIMAL k
sequenced
k-mers
ideal world:
single contig
In DNA/RNA/metaDNA/metaRNA assembly:
6/15
sequenced
k-mers
ideal world:
single contig
missing k-mers
break contigs
- small k : less chance of missing k -mers
6/15
repeat repeat
sequenced
k-mers
ideal world:
single contig
missing k-mers
break contigs
repetitions also break

contigs and reduce
total assembly size
- large k : less repetitions shorter than k
6/15
repeat repeat
sequenced
k-mers
ideal world:
single contig
missing k-mers
break contigs
repetitions also break

contigs and reduce
total assembly size
- large k : less repetitions shorter than k
- Also, larger k -mers: more likely to contain errors (unusable k -mers)
Our hypothesis: use the largest k -mer size possible (to avoid repetitions), such that the
genome is sufficiently covered by k -mers.
→ So, when are sufficiently many (non-erroneous) k -mers seen?
6/15
k - MER HISTOGRAMS
Common practice: compute the k -mer abundance histogram.

- x axis: abundance
- y axis: number of k -mers having abundance x (seen x times)
Abundance of each distinct 3-mer:

Example reads dataset: ACT: 1
ACTCA CTC: 1
GTCA TCA: 2
3-mers: GTC: 1
ACT 3-mer abundance:
CTC x y
TCA 1 3
GTC 2 1
TCA 3 0
4 0
For a dataset and a value of k , methods that build histograms already exist (k -mer
counting, e.g. Jellyfish, DSK, . . .).
7/15
D ISSECTION OF A k - MER HISTOGRAM
Chr 14 (≈ 88 Mbp) GAGE dataset; histogram k = 21
k = 21
1e+09
Erroneous k-mers
1e+07
Genomic non-repeated k-mers
Genomic repeated k-mers,

sequencing artifacts, ..
1e+05
0 20 40 60 80 120
Genomic area
≈
number of distinct k -mers covering the genome
≈
size of the assembly
→ How to determine exactly this area?
8/15
H ISTOGRAM MODEL
We use Quake’s model: [DR Kelley 2010]

Erroneous k -mers Pareto distribution with
shape α,
α k = 21
1e+09
pdf =
x α+1
Genomic k -mers Mixture of n Gaussians,
1e+07
weighted by a Zeta distribution
of shape s:
1e+05
w1 X1 + . . . + wn Xn
Xi ∼ N (iµ1 , (iσ1 )2 )
0 20 40 60 80 120
P(wi = k ) = k −s /ζ(s)
Full model Mixture weighted by
(pe , 1 − pe ).
Numerical optimization (R) is used to fit the model to actual histograms.
9/15
S EEN SO FAR
- Genome is sufficiently covered by k -mers =⇒ good k value

- Requires to know the number of genomic k -mers
- Can be estimated with a k -mer histogram and the Quake model
To find the optimal k , one can compare histograms for different values of k .
k = 21 k = 41 k = 81
1e+09
5e+08
1e+04 1e+06 1e+08
Number of kmers
Number of kmers
Number of kmers
1e+07
5e+06
1e+05
5e+04
0 20 40 60 80 120 0 20 40 60 80 120 0 20 40 60 80 120
Abundance Abundance Abundance
Chr 14 (≈ 88 Mbp) GAGE dataset; histograms for three values of k
→ Issue: computing a single histogram (using k -mer counting) is time and

memory expensive
10/15
S AMPLING HISTOGRAMS
Computing exact k -mer histograms is expensive (= k -mer counting).
Organism CPU time per k value

DSK
S. aureus 2min
chr14 48min
B. impatiens 7.5hour
11/15
S AMPLING HISTOGRAMS
Computing exact k -mer histograms is expensive (= k -mer counting).
Organism CPU time per k value Memory usage of

DSK Sampling method Sampling method (GB)
S. aureus 2min 11sec 0.1
chr14 48min 7min 0.1
B. impatiens 7.5hour 1.2hour 0.4
We developed a fast and memory-efficient histogram sampling technique.

Sample 1 k -mer out of r , in k -mer space (the same k -mer seen in two
different reads will be either consistently sampled, either consistently ignored)
●
1e+08
- Chr 14 (≈ 88 Mbp) k = 41
Number of k−mers
● ●●●●●
- continuous line = exact histogram
●●●● ●●●●●
●●● ●●
●● ●●
- dots = sampled histogram
1e+06
● ●●● ●●
●●●●● ●●
●
●
●●
●
●
●●
●●
- sampling errors are visible for low
●●●●
●●●
●●●●●● ●
●●● ● ●
number of k -mers (log scale)
● ●●●●●●●●●
1e+04
● ● ●● ●
●●●● ●● ● ● ● ●
● ● ●● ●
●●●●●● ●●●● ●
0 20 40 60 80 100 ●
● ●
Abundance 11/15
TOOLS , DATASETS
Software: KmerGenie (http://kmergenie.bx.psu.edu)
Evaluation on actual datasets from GAGE (assembly benchmark):

[Salzberg 2011]
Dataset S. aureus human chr 14 B. impatiens

Genome size 2.9 Mbp 88 Mbp 250 Mbp
Coverage 167x 70x 247x
Avg read length 101 bp 101 bp 124 bp
Selected a typical assembler for each dataset, executed ∀k :

Velvet and SOAPdenovo2 [Zerbino 2008, Luo 2013]
12/15
K MER G ENIE R ESULTS : ACCURACY
Predicted best k and predicted assembly size vs actual assembly size and
NG50 for 3 organisms (GAGE benchmark).
S. aureus Chr. 14 B. impatiens

Predicted Predicted Predicted
Velvet Velvet SOAPdenovo2
Predicted size
Predicted size
Predicted size
3500000 9.0e+07
3.0e+08
3000000 8.5e+07
2.5e+08
2500000 8.0e+07
20 40 60 20 40 60 80 20 40 60 80
20000 10000
4000
10000 5000
NG50
NG50
NG50
2000
0 0 0
20 40 60 20 40 60 80 20 40 60 80
k k k
vertical lines corresponds to predicted best k
13/15
C ONCLUSION / PERSPECTIVES
- KmerGenie helps choose the k-mer size for de novo assembly

- Experiments: choices are close to the best possible
- Methods:
I Best k maximizes the number of genomic k -mers
I Quake’s statistical model
I Efficient k -mer histogram sampling
Perspectives:
- Increase robustness (high-coverage, longer reads)
- Improve statistical model
- Estimation of Velvet’s cov_cutoff =⇒ zero-parameter assembler
- Extract information from histograms for transcriptome and
meta-genomes
14/15
U SING K MER G ENIE
curl http://kmergenie.bx.psu.edu/kmergenie-1.5397.tar.gz | tar xz

cd kmergenie-1.5397
make
Usage for a single file:

./kmergenie reads.fastq
Usage for a list of files:

ls -1 *.fastq > list_reads
./kmergenie list_reads
It returns:
best k: 47
As well as a set of kmer histograms to visualize.

Thank you for your attention!
15/15

2013 July 20 Hitseq

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2013 July 20 Hitseq

Uploaded by

Copyright:

Available Formats

Informed and automated k -mer size selection

for genome assembly

Rayan Chikhi, Paul Medvedev

Pennsylvania State University

HiTSeq - July 2013

Genome assembly is the technique used to reconstruct genome sequences

Bioinformaticians routinely run assemblers (Allpaths-LG, Soapdenovo2, Velvet, . . . ) to

→ What could go wrong if k is incorrectly set?

Fact: Genome assembly is not robust with respect to k .

Velvetk: without looking at the data:

VelvetOptimizer: for a specific assembler (Velvet). Brute-forces all values of k and

Velvetk: without looking at the data:

VelvetOptimizer: for a specific assembler (Velvet). Brute-forces all values of k and

Actually, most of the time:

repetitions also break

repetitions also break

→ So, when are sufficiently many (non-erroneous) k -mers seen?

Common practice: compute the k -mer abundance histogram.

Abundance of each distinct 3-mer:

Genomic repeated k-mers,

We use Quake’s model: [DR Kelley 2010]

Numerical optimization (R) is used to fit the model to actual histograms.

- Genome is sufficiently covered by k -mers =⇒ good k value

Abundance Abundance Abundance

Chr 14 (≈ 88 Mbp) GAGE dataset; histograms for three values of k

→ Issue: computing a single histogram (using k -mer counting) is time and

Organism CPU time per k value

Organism CPU time per k value Memory usage of

We developed a fast and memory-efficient histogram sampling technique.

Software: KmerGenie (http://kmergenie.bx.psu.edu)

Evaluation on actual datasets from GAGE (assembly benchmark):

Dataset S. aureus human chr 14 B. impatiens

Selected a typical assembler for each dataset, executed ∀k :

S. aureus Chr. 14 B. impatiens

vertical lines corresponds to predicted best k

- KmerGenie helps choose the k-mer size for de novo assembly

curl http://kmergenie.bx.psu.edu/kmergenie-1.5397.tar.gz | tar xz

Usage for a single file:

Usage for a list of files:

As well as a set of kmer histograms to visualize.

You might also like