Professional Documents
Culture Documents
Adam M. Phillippy
April 30, 2019
@aphillippy
Slides courtesy of:
Michael Schatz
Feb 4, 2019
Lecture 3: Applied Comparative Genomics
Second Generation Sequencing
1. Attach
2. Amplify
Illumina HiSeq 2000
Sequencing by Synthesis
>60Gbp / day
3. Image
http://en.wikipedia.org/wiki/FASTQ_format
Typical sequencing coverage
Coverage
6
5
4
3
2
1
Contig
Reads
Key properties:
• The standard deviation is the
square root of the mean.
• For mean > 5, well approximated
by a normal distribution
Normal Approximation
36-2*sqrt(36) = 24
Picard: http://picard.sourceforge.net
Beware of (Systematic) Errors
2. Practical issues
• Coverage, read length, errors, and repeats
It was the
It was
best
the
of besttimes,
of times,
it was
it was
the worst
the worstofoftimes,
times,ititwas
wasthe
the age
ageofofwisdom,
wisdom,ititwas
wasthe age
the of
agefoolishness,
of foolishness,
… …
It was the
It wasbest of times,
the best it was
of times, thethe
it was worst of times,
worst it it was
of times, was the
the age
age of wisdom, it it
of wisdom, waswas the
the age
age of of foolishness,
foolishness, … …
It was It the
wasbest
the of times,
best it was
of times, thethe
it was worst of times,
worst of times, itit was
was the
the age
age of wisdom,
of wisdom, it was
it was thethe age
age of foolishness,
of foolishness, … …
It It the
was wasbest
the best of times,
of times, it was
it was thethe worst
worst of of times, it was the age of of
wisdom, it it
wisdom, was the
was age of
the agefoolishness, … …
of foolishness,
times, it was the worst The repeated sequence make the correct
was the age of wisdom, reconstruction ambiguous
was the age of foolishness, • It was the best of times, it was the [worst/age]
was the best of times,
worst of times, it was How long will it take to compute the overlaps?
de Bruijn Graph Construction
• Gk = (V,E)
• V = Length-k sub-fragments
• E = Directed edges between consecutive sub-fragments
• Sub-fragments overlap by k-1 words
worst of times, it
times, it was the
of wisdom, it was
foolishness worst
spring of hope
belief
epoch of
light
season of incredulity
darkness
Repeats, repeats, repeats…
Repeat
Repeats, repeats, repeats…
Repeat
●
●
●● ● ●●
● ● ● ●
●
● ●
● ●●●
●● ●●
●
●●● ● ●
●●
●●
● ● ● ● ● ● ● ● ●●
● ●
●
●
●● ● ●● ●
●● ● ● ●
●
●
● ●●
●
● ●
●
● ●
●
●
●●
●● ● ●●●
●● ●
● ●
●
●
●●● ●● ● ●
●●● ● ●●● ● ●
● ●● ● ●●●
● ●
●●●
● ● ●
●●●● ●
●●
●
● ● ●
● ●
●●
●●
● ● ●●●
●
●●
●
●●
●●
● ●●
●●● ● ●●
●●●●●●●
● ●● ●●● ●●
●●● ●●● ●● ●●●
●●●●● ● ●●●●●
●
4e+06
4e+06
● ● ●●● ● ● ● ●●● ●● ● ●
4e+06
●●●● ●● ● ● ● ● ●● ●● ●
● ● ● ● ● ● ●● ●● ●
●
●
●
●
●●●● ●● ●
●● ●
●●●●
● ●
● ●●● ●
●
●●●
●●
●●●
●●
●● ●
●●
●●
●●●
●
●●
●
● ●● ●●●
● ●● ●●●● ●●
●● ●● ●●●
●●
●●● ●
●
●
● ●
●
●●
●
● ●●
● ● ●
●● ● ●●
●●● ●
●●● ● ●
● ● ● ● ● ● ● ● ●● ● ●
●●
●● ●● ●● ●
●●
● ●
●●
●
●● ●● ●●
●● ●
●● ●
●● ●●
●● ●● ●●
●● ●●●
●●●●● ●●
●
●●
●
● ● ●
●
●●
●●
●●
●●●● ●●●●
●●●
● ●● ● ●●
● ● ●● ● ●●
●
●
●
●
●●●
●
●●
●
●
●●
●●
●●
●● ●●
●
●●
● ●
●
●● ●●
● ● ●●● ●●● ●●●●
●● ●●● ●●● ●
●
●●
● ●●●
●
● ●●●●
●
●
●● ●
●●
●● ● ● ● ●●●
● ● ● ●●●●● ●● ● ●●
● ●●●●
●● ●
●●●● ●
●
● ● ●●●● ●
●●
●
●
●●
●●●
●
● ● ●● ●●● ● ●●●●
● ●●
● ●
●●●
●
●
●●
●●● ● ●●
●●
●● ●
●
●
●
●●● ●
●
●
●
●●
●●
●
●
●●●
● ●
● ●●
● ●●
●●●
●●●● ●
●● ●
●● ●● ●
●
●
●●●●
●
● ●
●
●●
●●●
●●
● ●●●
●●● ● ● ● ●
● ● ● ● ●
●
●●●
●● ●●
●●●
●● ●●●
● ●
●●
●●
●●
● ●●
● ●● ● ●
●●●
●●●●●
●● ●●
●● ●●●●● ●
●
●● ●● ●● ● ●
●● ●
●
●●●●●●●
●●●●●
●● ●●
●
●●
● ●●
●●●●
●
●●●
●●●
●● ●●
● ●●
● ●●
●
●●● ●
●●
● ●● ●●●●
●●●● ●●●●●●
● ●●
● ● ●●●● ●●●● ●● ● ●●●
●●● ●
● ● ● ●●
● ●●●
● ●●● ●● ● ● ● ● ● ● ●
●
●●●
● ●●●● ●●
● ●● ●●● ● ●
●●● ●●
●●
●●●● ●●
● ●● ●●●●●● ●● ●
●●●● ●●
●●●● ●●●●●●●●●●●
● ●●
● ●●●
●● ●● ●●●●
● ●●● ●●
●
●
●
●
●
●
●●
●●
●●
● ●
●
●
●
●
●
●●
●● ●●
●●
●
●
●●
●
●
●
●● ●●● ●●● ●
●●●
●
●
●
●●
●
●●●
●
● ●●
●●
●
●●●●
●●
●●
●●
●
●●
●●●
●●
●
●
●
●
●
●
●
●
●
●●
●●●●
●
●
●●
●
●●●● ●●
●●
●
●
●
●
●
●●
●●
●
●
● ●●
●●●
●
●●
●
● ●●
●
●
●
●● ●●
●● ●●
●
●●
●●
●●
●
●● ●
●●
●
●
●●
●
●
●
●
●
●●
●
●●
●
● ●● ●●● ● ●●●●
●
●●
●
● ●
●●●
●● ●
●●
●●●
●●● ●
●
●
●
●
●●●●●
●
●●● ●
●●
●
●
●●
●●
●●
●
●
●
●
●●
●
●● ●
●●● ●
●●
●
●● ●
●
●
●
●
●
●
●●●
●
●
●●●
●
●
●
●● ●
●
●
●
●●
●
●
●
●●
●●●●
●●
●
●●● ●●
● ●
● ●
● ●●
●
●●
●
●
●
●●● ●●
● ●
● ●
● ● ● ● ● ●
●● ●
●●
●●●● ● ●●●● ● ●●●●● ● ● ●●
● ● ●●● ● ●● ● ● ● ● ● ●●● ● ●● ● ●
● ●
●● ●●● ●●● ● ● ● ●●● ●● ●● ● ● ●● ●●● ● ●●●● ●●● ●
● ●● ●●
●● ● ●● ●●●● ●
●● ●●● ●●● ● ● ●● ● ●●●●●● ● ●●
●●
●●
● ●●● ●● ●●● ● ●●●●
●● ●●● ●
● ●● ●●
●● ● ●
●● ●●●
● ●● ● ●
●
● ● ● ●● ● ● ● ●
● ●● ● ● ●
●●
●
●●●
●● ●●
●
●●●
●
●● ●●
●
●●●●
●
● ●
●
●
● ●●
●● ● ● ●
●●
●
●●
●
●●
●●●●
●
●
●
●●
●●●
●●
● ●● ●
●● ●●
●●●●
●
● ●
●
●●●●
●
●●
●●●
●
●●●●
●●●●
●
●●
● ●
●
●●
●●● ●●●
●●●
●
●●
●
● ●
●●●
●●
●●
● ●●
●●
●●●
●
●
●
●●●
●●●●
●●
●● ●●●
● ●● ●●● ● ● ● ●● ●
● ●
●● ●● ●● ●●●●●
●● ●
●●●●●●●● ●●
●
●● ●●●
●● ●
●●●●●●●
● ● ●●●
●●●
●
● ●
●
●●●
●
●
●●
●●
●
● ●●
●●●● ●●
●
●●
●
●●
●
●●●
●●
●
● ●●
●●
●●
● ●
●●
●●
●
●
●
●
●●
●
●●●
●●●
●●●
●
●
●
●●
●●
●●
●
●●
●●●
●●●
●●● ●
● ●●
●●● ●●
●
●●
●
●● ●●
●
●● ●
● ●●
●
● ●●
●●
●●●
● ●●●●●
● ●
●●
●●●●
●
●●●●●
●
●
● ●
●
●
●●
●
●
●
●
●● ●
●
●●
● ●
●
●●●●
●
●●●●
● ●
● ● ●●
●●● ●●
●
● ●
●
● ●
●
● ●●
●
●● ●
● ● ●
● ●
●
●
● ● ●●
● ●
●
●●●
● ●●●●● ●●●● ●●● ● ●
●●● ●●
●●●●●
● ●●
●● ●●●
●●●●
● ●●● ●●●●●●
● ●●●●● ●●●●
●●●●●●●
●
●
●● ●●●
● ●●
●● ● ●●●●
● ● ● ●●
●●●
● ● ●
●●●● ● ●●● ●● ● ●●● ● ● ● ●●● ●●●● ● ● ●●●
● ● ● ● ●●
● ● ●●●● ●●● ●●● ● ●●●● ● ●● ●● ●● ● ● ● ●● ● ●● ● ● ● ●●
●
● ● ●
● ●● ●
● ● ● ●●●● ● ● ●●●
● ● ● ●●●● ● ●
●
●●
●●●●●
●●
●● ●●
●● ●●
●
●
●●
●
●●●●
●
●● ●●
●●
● ●●●●●●
●●
●●●
● ●
●●●●●
●●
●
● ●● ●●●●
●●●
●●
●● ●
●●●●●
●● ● ●
●● ●
●●
●●●●●●●
●
● ●●●
● ●
●
●●●
●
●●●●
●●
●●●●●
●●
●●
●
●
● ●
●●●●
●● ●●
●● ●
●●
●
●●●●
●●
●●●
●
●●
●
● ● ●●
●● ●
●
●● ●
●● ● ● ●●●●
● ● ● ●●●
● ●● ●● ●●
● ●● ●●● ●● ●●●●
●●
● ●
● ●
● ●
●
● ●● ●●●
●●● ● ●●● ● ●
● ●
●
●
● ●● ●●
● ●
●● ● ● ●
● ●
● ●●
●●●
2e+06
2e+06
●
●●●● ● ●●● ●● ● ●●● ● ●● ●●●●●● ● ●●●
● ●●●
●●● ● ● ●●●● ● ●● ●●
●● ● ●
2e+06
●
●
●●
●●● ●●●● ●●
● ●● ●●● ●
●●●
●●●
●●
●
●●
●
●●●●
●●
●●●●
●●
●
●●
●●●
●●
●●
●●
●
●
●
●
●
●
●
●●●
● ●●
●
●
●●● ●
●
●
●●
●●
●●
● ● ●●●
●
●●
●
● ●
●
●●●
● ●●
● ●●
●●●●
●● ●●
●
● ●●●
●●●
●
●●
●
●●●
●
●●
●●
●
● ●
●●●●●
●●
●
●●●●●
●
●●●
●●●
● ●
● ●●
●● ●
● ●
●● ● ●
●
●●●●
●●
●●
●
●
●●●
● ●●
● ●
●● ● ●●
●
●●
● ● ●● ●
●
●●●
●● ●
●●●●
● ● ●●
●●●●●●●
●●●● ●●● ●●●● ● ●●
●●● ●● ●● ● ● ●●
● ● ●● ●●
●● ●
●●●
● ●
●●● ●●
●
● ● ●● ●●
●●●● ● ●●● ●● ● ●●● ● ● ● ●●●
●●●●● ●● ●●●● ● ●●
●● ●
● ●●●● ●● ● ●●● ● ●●●● ● ●● ●●
●● ● ●
● ●
●
●
●
●
●
●
● ●
●
●●
●
●●● ●
●
●●● ●
●
● ●
●
● ●●
●●
●
●
● ● ●
● ● ● ● ●
●
●
●
●
●● ●●●● ●●
● ●●
● ●●
●● ● ● ●●●●●● ●
●●
●●●● ●
●● ● ●● ●● ●●●
●●●●● ●● ●●● ●
●●●
● ● ●●●
●● ●
●●
● ●
●● ●●●● ●●
● ● ● ● ●● ●
●
●
●●
●●
●●
●●
●
●
●●●
●●●● ●●
● ●●
● ●
●●
●●●
●●
●●●
●● ●
●●●●
●●●●
●●
●
● ● ●●
●●●●
● ●●
●●
●
●●
●
●●
●●●
●
●●●
●●
●●●
●
●
● ●
●
●●
●●● ●
●
●
●●●●
●●
●●●
●
● ●
●
●●
●
●●
●●
●
● ●●
● ●●
●●
● ●
●●●●
●●
●
●●●
●●
● ●
●●●
●
● ●
●●
●●●
●●
●●● ●
●
●
● ●
●●●
●●●
●● ●● ●
●●
●● ●
●
●
●●
●●●●
●
●
●● ●
● ● ●●
●
●●● ● ●
● ●
●
●
●
●
●●
● ●●●
● ●
●● ●● ● ●
●●
●
●
●
●●
●
●
●
●
● ●
●
● ● ●●
●
●●●
● ●●●● ●●
● ●● ●● ● ●
●●● ●●●●
●●● ●●●
●● ●
●●
●●●
● ●●● ●
● ●●●
●
●●●
● ●●
●●
●
● ●
●●●
●● ●
●●
●
●●●●
●●
●●●
●●●●●
● ●●
● ●
●●●●● ●●●●
● ● ●
●●●
● ● ●●
● ● ●
● ●
●
●● ●●
●● ●● ●●
●●
●●●
● ●●●●●
●●●●
●●
● ●●●●●
● ●
●
●
●
● ●
●
●
●● ● ●●
●
●
●●
●● ●●
●● ● ●
● ●
●
●●●●
●
●● ●●●
●
●●●
●● ●
● ●●
●●● ●
●
●● ●
●● ●●
● ● ●
●
●● ● ●● ●
● ●
● ● ● ● ●
● ● ●
●
●●
●●
●
●
●●
●●
●●
●●
●●
●●
● ●●
●●●● ●●
●
● ●
●
●●
●●
●●●●
●●
●●●
●
●
●
●●
●
●●●●
●● ●
●●● ●
●●
●●
●
●●
●●
●
● ●●
● ●●
●●● ●
●●
●
●
●●
●●
●
●●
●
●
●●
●●
●
●
● ●
●
●●●
● ●
●
●●● ●●
● ●
●●
●●
●●● ●
●●
●● ●●
●
●●
● ●
● ●
●●
● ●●
● ● ●●●●
●
● ●
●●
●●
●●
●
●●
●
●
●
●● ●
●●●●●
●
●
●●
●
●
●
●●●●●●
●
●
●●
●
●
●●
●
●●●
● ●●●
●●
●
●●●
●●
●
●●
●
●●
●
●
●
●
●●
●
●
●●
● ●●
●●
●●
●●
●
●● ●●
●
●●
●
● ● ●
●● ●
● ●
●
●
● ●
● ●
●
●
●
●● ●●
●
● ● ●
● ●
●
●
●
●
● ●
●
● ●
●
● ● ●● ●● ● ●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●●
●
●
● ●●
●●●● ●
●●
●●●
●●
●
●●● ●●● ●
●●●
●●●
●●
●●
●
●●
● ●
●●●●●●
●●●
●●●●
● ● ● ●●●● ●●
●
●●
●●
●●
●
●●
●
● ●●
●
●
●●
● ●●
●
●
●
● ●●
●●●
● ●●● ●●●●●
●●● ●●
●●
● ●
●●
●
●● ●●
●
●●
● ●●
●●●
●
●
●●
● ●●●
●●●●
●
●
●●●●
●
●●
●●
●
●
●
●
●●●
●●●●●
●●●
●
●●●
●●
●
●
●●
●
●● ●
●
●
●
●●●
●
●●
●
●● ●
● ●
●
●●●
●● ●
●
●
●
●
●
●
●●●
●●●
●●●
●
●●
●
●
●
●●
● ●●
●●●●●
●●●●●●●●
●
● ●●
● ●
●● ●●
●
● ●●
●● ● ● ●●
● ● ●●
●
●●
● ●●● ●
●
●
●
●
●● ●
●
●
●
●●
● ● ●
●
●
● ● ●
● ● ●
●
●
●●● ● ●●● ●●● ●● ●● ● ●●● ●●● ●● ● ● ● ●● ●● ●● ●●
●
●●● ●
●●● ● ●●
●●●●
● ●●● ● ●●
●● ●
●●●●●●● ●● ●●●●●● ●● ●● ●●
●● ●●●● ●●●● ●●●●
●●●
● ●●●● ●● ●
●●● ●● ● ●●
●● ●● ●●
●●
● ●●
●
●●
●
● ●●● ●●
● ●
●
●● ● ●
●
●
●●
●
●● ●
●●●●
●
●
●●
● ●●
●● ●
●●
● ●●● ●
●●
●● ●
●●● ●
●●
●●●●●
●
● ●●●
●
● ●● ●
●●
● ●
●
●●●
●● ●●●●●
●● ● ●●
●● ●
●●
●●●●●●
● ●●●
●●
● ●●
●●●●
● ●
●●●●
●●
● ●●●
● ●●
● ●●●
●●
● ●●●●●
●●●
●
●
● ●
●
●
●
● ●
●
● ●● ●● ● ●● ●
● ●
● ● ●
●
●
● ●
● ● ●
● ● ● ●●
●●
●
●
●●
●●
●
●
●● ●●● ●●● ●●●
● ●
●●● ●
● ●
●● ●
● ●●
●● ●●
●● ● ●●●●●
●● ●
● ●●● ● ●●
●●●●●●
●●
●●●● ● ●●● ●● ● ●●● ● ● ● ●●●
●
●●
●
●● ●●●● ●
● ● ● ●●● ●●●● ●
●● ● ●●●
●●●●
●●
●
●
●
● ●●
●● ●
●
●●●●
●
●●●
● ● ●●●● ●● ● ●●● ● ●●●● ● ●● ●●
●
● ●●●
● ●
●●
●
●
●●
● ●
● ● ●● ●● ● ● ●●
●●● ● ●
●● ● ●●
● ●
●
●
● ● ●● ● ● ● ●●
●
●●● ●●●
●● ● ●
●●
● ●●
●●●● ●●●●●●●
●●●
● ●●
●●●● ●●
●● ●●●
● ●●
●●
●● ●●
●●●● ●●
● ●
●● ●●●
●● ●●●●
●●
●
● ●●
●●●
●
●● ●●●
●●
●●
●● ●
●●●●
●●●
●●●●●
●
●●●
●● ●●
● ●● ● ●●
●
●●
● ●●●
● ●
● ●
● ●
●●
●●● ●
●
● ●●● ●
●● ●●● ●
●
● ●● ●●
●●
●
● ● ● ●●●●● ●●
● ●
●
●●●● ● ●●●● ●●● ●● ●●●● ● ●●
●●● ● ●●●●
●●●
●
●●● ●
●
●
●
●●
● ●
●●
●● ●
●
●● ●
●
●
●●● ● ●●● ● ●● ● ●●
0e+00
0e+00
0e+00
●●● ● ●
●●
●●
● ●● ●●
● ●●●
●● ●●●●
●●●● ●●
●●●●●
● ●●
●●
● ●● ●●●
● ●●
●● ● ●● ●● ●●
●● ●●●●●●●●●●●
●
●
● ●●●
● ●
●● ●●
●
● ●● ●●
●● ●●● ●●
●
●
●
● ●●
●●
●●
● ●
●
●
● ● ●
● ●
●
● ●
●
●
●
●
● ●● ●● ●●
● ● ●●
●●●
●●
●
●●●
●
●
●●
●
●●
● ●
●
●●
●
●●●●
●
● ●● ●
●
●●
●
●
●
●●
●
● ●
●
●
●
● ● ●● ●●● ● ●●●●
●●
●● ● ●●
●●●● ●●
●●
● ● ●
● ●●●
●●
●
●
●
●
●
●
●●●●
● ●
●
●
●
●●●● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●● ●●
●● ● ● ●●● ● ●
●● ● ● ● ● ●
● ● ●●●●● ●●
● ●●
●● ●
● ● ●
● ● ●
● ●
●
●
●● ● ● ●●● ●● ● ●●● ● ● ● ●●●
● ●
● ●●●● ●●
● ●●●●
● ●●●● ●●●●●
●●●●●
● ●●
● ●● ●●●
● ● ●●●● ●● ● ●●● ● ●●●● ● ●● ●●
●●● ●●● ●
●● ●●●●● ●●●●●●●●●●●
● ●●●
● ●●●
●● ● ●●●●
●
●
●
●
●●●●●●●● ● ● ●●
● ● ● ●● ● ●●
●
● ●●
● ●
● ●
●●● ●
● ●
● ●
● ●
● ●● ●●●● ●●
● ●●●●
● ●●● ● ●● ●
●
●●●●●●
● ●●
●●●●●
●●●●●●● ●
● ●●●
●●● ●
● ●●
●● ●●●●●
●●●●
●●●●
● ●●●●●
●
● ●●
● ●
● ●●●●
● ●●
●● ● ●●●●●
●●●
●
●
●
20 161 171
0 0 16
Reducing assembly complexity of microbial genomes with single-
molecule sequencing. Koren et al. (2013) Genome Biology.
E. coli (k=50)
B
ARBRCRD
A R D or
ARCRBRD
C
1000
A 300 100 45 45 30 20 15 15 10 . . . . .
B 45 45 30 20 15 15 10 . . . . . 3
A
• Better sequence
300 for all100downstream
45 45 30 20 analysis
15 15 10 . . . . .
ATTA
GATT
TACA
TTAC
Pop Quiz 1
Assemble these reads using a de Bruijn graph approach (k=3):
GAT
ATTA: ATT -> TTA
ATT
GATT: GAT -> ATT
TTA
TACA: TAC -> ACA
TAC
TTAC: TTA -> TAC
ACA
GATTACA
Pop Quiz 2
Assemble these reads using a de Bruijn graph approach (k=3):
ACGA
ACGT
ATAC
CGAC
CGTA
GACG
GTAT
TACG
Pop Quiz 2
Assemble these reads using a de Bruijn graph approach (k=3):
ACGA
ACGT CGA
ATAC
ACG
CGAC
CGTA
GACG
GTAT
TACG
Pop Quiz 2
Assemble these reads using a de Bruijn graph approach (k=3):
ACGA
ACGT CGA
ATAC
ACG
CGAC
CGTA CGT
GACG
GTAT
TACG
Pop Quiz 2
Assemble these reads using a de Bruijn graph approach (k=3):
ACGA GAC
ACGT CGA
ATAC
ACG
CGAC
CGTA CGT
GACG
GTAT
TACG
Pop Quiz 2
Assemble these reads using a de Bruijn graph approach (k=3):
ACGA GAC
ACGT CGA
ATAC
ACG
CGAC
CGTA CGT
GACG
GTAT
TACG
Pop Quiz 2
Assemble these reads using a de Bruijn graph approach (k=3):
ACGA GAC
ACGT CGA
ATAC
ACG
CGAC
CGTA CGT
GACG GTA
GTAT
TACG
Pop Quiz 2
Assemble these reads using a de Bruijn graph approach (k=3):
ACGA GAC
ACGT CGA
ATAC
ACG
CGAC
CGTA CGT
GACG GTA
GTAT TAT
TACG
Pop Quiz 2
Assemble these reads using a de Bruijn graph approach (k=3):
ACGA GAC
ACGT TAC CGA
ATAC
ACG
CGAC
CGTA CGT
GACG GTA
GTAT TAT
TACG
Pop Quiz 2
Assemble these reads using a de Bruijn graph approach (k=3):
• Metagenomes
• Sequencing assays
• Structural variations
• Transcript assembly
• …
Why are genomes hard to assemble?
1. Biological:
• (Very) High ploidy, heterozygosity, repeat content
2. Sequencing:
• (Very) large genomes, imperfect sequencing
3. Computational:
• (Very) Large genomes, complex structure
4. Accuracy:
• (Very) Hard to assess correctness
Assembling a Genome
1. Shear & Sequence DNA
1M
Contig Length
dog N50 +
100k
dog mean +
Expected Contig Length (bp)
panda N50 +
panda mean +
10k
Expected
1k
1000 bp
710 bp
250 bp
100
100 bp
52 bp
30 bp
0 5 10 15 20 25 30 35 40
Read Coverage
Read Coverage
High coverage is required Reads & mates must be longer Errors obscure overlaps
– Oversample the genome to ensure than the repeats – Reads are assembled by finding
every base is sequenced with long – Short reads will have false overlaps kmers shared in pair of reads
overlaps between reads forming hairball assembly graphs – High error rate requires very short
– Biased coverage will also fragment – With long enough reads, assemble seeds, increasing complexity and
assembly entire chromosomes into contigs forming assembly hairballs
0.015
Coverage
!
6
5
Error k- !
!
!!!
! !
!
!
mers
0.010
4
! !
! !
3 ! !
True k-
Density
2 ! !
1
!
!
!
!
mers
0.005
! !
! !
Contig
! !
! !
! !
! ! !
! !
! !
! !
Reads ! !
! !!
0.000
! !!! !!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!! !!!!!!!!!!!!
0 20 40 60 80 100
Coverage
Even though the reads are not assembled or aligned (or reference available),
Kmer counting is an effective technique to estimate coverage & other genome properties
Quake: quality-aware detection and correction of sequencing reads.
Kelley, DR, Schatz, MC, Salzberg SL (2010) Genome Biology. 11:R116
Heterozygous Kmer Profiles
Error k- !
!!!
! !
!
mers
! !
0.010
! !
! !
! !
True k-
Density
! !
! ! mers
! !
0.005
! !
! !
! !
! !
! !
! ! !
! !
! !
! !
! !
! !!
0.000
! !!!
!
!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!! !!!!!!!!!!!!
0 20 40 60 80 100
Coverage
worst of times, it
(Chaisson,
2009)
Repetitive regions
Repeat Type Definition / Example Prevalence
Mate-pair sequencing
• Circularize long molecules (1-10kbp), shear into fragments, & sequence
• Mate failures create short paired-end reads
10kbp
10kbp
circle
2x100 @ 300bp (innies)
Scaffolding
• Initial contigs (aka unipaths, unitigs)
terminate at B
• Coverage gaps: especially extreme GC
• Conflicts: errors, repeat boundaries A D
R
• Use mate-pairs to resolve correct order
through assembly graph C
• Place sequence to satisfy the mate constraints
• Mates through repeat nodes are tangled
AGA TTA A Z
GAA GTT B Y
AAG AGT R1 R2
TAA GTC X C
ATA TCC
W D