You are on page 1of 20

Informed and automated k -mer size selection

for genome assembly

Rayan Chikhi, Paul Medvedev

Pennsylvania State University

HiTSeq - July 2013

1/15
G ENOME ASSEMBLY

Genome assembly is the technique used to reconstruct genome sequences


from DNA sequencing.

sequenced
reads:
overlapping
sub-sequences,
covering
the genome
redundantly
read

contig
assembly
hypothesis of
the genome

high-quality
vs
low-quality
assemblies

2/15
M OTIVATION

Bioinformaticians routinely run assemblers (Allpaths-LG, Soapdenovo2, Velvet, . . . ) to


study novel organisms.
Most assemblers cut reads into k -mers (de Bruijn graph method).

read ACTGATGAC
ACT
CTG
TGA
k-mers
GAT
(k=3)
ATG
TGA
GAC
Practical issue: assemblers rely on the user to set the parameter k .

→ What could go wrong if k is incorrectly set?

3/15
M OTIVATION : OPTIMAL k NEEDED
Total length and contiguity (NG50) of chr. 14 (88 Mbp) assemblies
NG50: maximum ` such that (
P
|contigi |≥` |contigi |) larger than |genome|/2
Illumina 100bp paired-end 70x coverage, assembled by Velvet with several values of k
8.5e+07

Assembly size
8.0e+07

40 60 80

4000
NG50

2000

0
40 60 80

Fact: Genome assembly is not robust with respect to k .


Our motivation: help bioinformaticians obtain the best possible assembly by
finding optimal k automatically
4/15
E XISTING METHODS TO ESTIMATE BEST k

Velvetk: without looking at the data:


Nk
koptim = argmink (| − C|)
G
where:
Nk (total number of k -mers in the reads),
G (estimated genome size) and
C (desired target coverage).
Does not know about genome complexity and error rate.

VelvetOptimizer: for a specific assembler (Velvet). Brute-forces all values of k and


examines N50.
koptim = argmaxk (N50k )
Takes in the order of CPU-years for mammalian genomes.

5/15
E XISTING METHODS TO ESTIMATE BEST k

Velvetk: without looking at the data:


Nk
koptim = argmink (| − C|)
G
where:
Nk (total number of k -mers in the reads),
G (estimated genome size) and
C (desired target coverage).
Does not know about genome complexity and error rate.

VelvetOptimizer: for a specific assembler (Velvet). Brute-forces all values of k and


examines N50.
koptim = argmaxk (N50k )
Takes in the order of CPU-years for mammalian genomes.

Actually, most of the time:


- Bioinformaticians run [assembler] many times with k = 21, . . . , 91, or
- “Our colleagues had good results with k = 51 on [some other bacterial
dataset]”.

5/15
H YPOTHESIS FOR THE OPTIMAL k

sequenced
k-mers

ideal world:
single contig

In DNA/RNA/metaDNA/metaRNA assembly:

6/15
H YPOTHESIS FOR THE OPTIMAL k

sequenced
k-mers

ideal world:
single contig

missing k-mers
break contigs

In DNA/RNA/metaDNA/metaRNA assembly:
- small k : less chance of missing k -mers

6/15
H YPOTHESIS FOR THE OPTIMAL k

repeat repeat

sequenced
k-mers

ideal world:
single contig

missing k-mers
break contigs

repetitions also break


contigs and reduce
total assembly size

In DNA/RNA/metaDNA/metaRNA assembly:
- small k : less chance of missing k -mers
- large k : less repetitions shorter than k

6/15
H YPOTHESIS FOR THE OPTIMAL k

repeat repeat

sequenced
k-mers

ideal world:
single contig

missing k-mers
break contigs

repetitions also break


contigs and reduce
total assembly size

In DNA/RNA/metaDNA/metaRNA assembly:
- small k : less chance of missing k -mers
- large k : less repetitions shorter than k
- Also, larger k -mers: more likely to contain errors (unusable k -mers)

Our hypothesis: use the largest k -mer size possible (to avoid repetitions), such that the
genome is sufficiently covered by k -mers.

→ So, when are sufficiently many (non-erroneous) k -mers seen?

6/15
k - MER HISTOGRAMS

Common practice: compute the k -mer abundance histogram.


- x axis: abundance
- y axis: number of k -mers having abundance x (seen x times)

Abundance of each distinct 3-mer:


Example reads dataset: ACT: 1
ACTCA CTC: 1
GTCA TCA: 2
3-mers: GTC: 1
ACT 3-mer abundance:
CTC x y
TCA 1 3
GTC 2 1
TCA 3 0
4 0
For a dataset and a value of k , methods that build histograms already exist (k -mer
counting, e.g. Jellyfish, DSK, . . .).

7/15
D ISSECTION OF A k - MER HISTOGRAM
Chr 14 (≈ 88 Mbp) GAGE dataset; histogram k = 21

k = 21

1e+09
Erroneous k-mers

1e+07
Genomic non-repeated k-mers

Genomic repeated k-mers,


sequencing artifacts, ..
1e+05

0 20 40 60 80 120
Genomic area

number of distinct k -mers covering the genome

size of the assembly
→ How to determine exactly this area?
8/15
H ISTOGRAM MODEL

We use Quake’s model: [DR Kelley 2010]


Erroneous k -mers Pareto distribution with
shape α,
α k = 21

1e+09
pdf =
x α+1
Genomic k -mers Mixture of n Gaussians,

1e+07
weighted by a Zeta distribution
of shape s:

1e+05
w1 X1 + . . . + wn Xn

Xi ∼ N (iµ1 , (iσ1 )2 )
0 20 40 60 80 120
P(wi = k ) = k −s /ζ(s)
Full model Mixture weighted by
(pe , 1 − pe ).

Numerical optimization (R) is used to fit the model to actual histograms.

9/15
S EEN SO FAR

- Genome is sufficiently covered by k -mers =⇒ good k value


- Requires to know the number of genomic k -mers
- Can be estimated with a k -mer histogram and the Quake model

To find the optimal k , one can compare histograms for different values of k .

k = 21 k = 41 k = 81
1e+09

5e+08
1e+04 1e+06 1e+08
Number of kmers

Number of kmers

Number of kmers
1e+07

5e+06
1e+05

5e+04
0 20 40 60 80 120 0 20 40 60 80 120 0 20 40 60 80 120

Abundance Abundance Abundance

Chr 14 (≈ 88 Mbp) GAGE dataset; histograms for three values of k

→ Issue: computing a single histogram (using k -mer counting) is time and


memory expensive

10/15
S AMPLING HISTOGRAMS
Computing exact k -mer histograms is expensive (= k -mer counting).

Organism CPU time per k value


DSK
S. aureus 2min
chr14 48min
B. impatiens 7.5hour

11/15
S AMPLING HISTOGRAMS
Computing exact k -mer histograms is expensive (= k -mer counting).

Organism CPU time per k value Memory usage of


DSK Sampling method Sampling method (GB)
S. aureus 2min 11sec 0.1
chr14 48min 7min 0.1
B. impatiens 7.5hour 1.2hour 0.4

We developed a fast and memory-efficient histogram sampling technique.


Sample 1 k -mer out of r , in k -mer space (the same k -mer seen in two
different reads will be either consistently sampled, either consistently ignored)


1e+08

- Chr 14 (≈ 88 Mbp) k = 41
Number of k−mers

● ●●●●●
- continuous line = exact histogram
●●●● ●●●●●
●●● ●●
●● ●●
- dots = sampled histogram
1e+06

● ●●● ●●
●●●●● ●●


●●


●●
●●
- sampling errors are visible for low
●●●●
●●●
●●●●●● ●
●●● ● ●
number of k -mers (log scale)
● ●●●●●●●●●
1e+04

● ● ●● ●
●●●● ●● ● ● ● ●
● ● ●● ●
●●●●●● ●●●● ●
0 20 40 60 80 100 ●
● ●
Abundance 11/15
TOOLS , DATASETS

Software: KmerGenie (http://kmergenie.bx.psu.edu)

Evaluation on actual datasets from GAGE (assembly benchmark):


[Salzberg 2011]

Dataset S. aureus human chr 14 B. impatiens


Genome size 2.9 Mbp 88 Mbp 250 Mbp
Coverage 167x 70x 247x
Avg read length 101 bp 101 bp 124 bp

Selected a typical assembler for each dataset, executed ∀k :


Velvet and SOAPdenovo2 [Zerbino 2008, Luo 2013]

12/15
K MER G ENIE R ESULTS : ACCURACY

Predicted best k and predicted assembly size vs actual assembly size and
NG50 for 3 organisms (GAGE benchmark).

S. aureus Chr. 14 B. impatiens


Predicted Predicted Predicted
Velvet Velvet SOAPdenovo2
Predicted size

Predicted size

Predicted size
3500000 9.0e+07
3.0e+08
3000000 8.5e+07
2.5e+08
2500000 8.0e+07

20 40 60 20 40 60 80 20 40 60 80
20000 10000
4000
10000 5000
NG50

NG50

NG50
2000

0 0 0
20 40 60 20 40 60 80 20 40 60 80

k k k

vertical lines corresponds to predicted best k

13/15
C ONCLUSION / PERSPECTIVES

- KmerGenie helps choose the k-mer size for de novo assembly


- Experiments: choices are close to the best possible
- Methods:
I Best k maximizes the number of genomic k -mers
I Quake’s statistical model
I Efficient k -mer histogram sampling

Perspectives:
- Increase robustness (high-coverage, longer reads)
- Improve statistical model
- Estimation of Velvet’s cov_cutoff =⇒ zero-parameter assembler
- Extract information from histograms for transcriptome and
meta-genomes

14/15
U SING K MER G ENIE

curl http://kmergenie.bx.psu.edu/kmergenie-1.5397.tar.gz | tar xz


cd kmergenie-1.5397
make

Usage for a single file:


./kmergenie reads.fastq

Usage for a list of files:


ls -1 *.fastq > list_reads
./kmergenie list_reads
It returns:
best k: 47

As well as a set of kmer histograms to visualize.


Thank you for your attention!

15/15

You might also like