Professional Documents
Culture Documents
Lecture 5
Genome assembly I
The good old days
Administrivia
Homework 1 – on the website today, due Friday;
homework policy
s2 s2 s2
de novo s3 s3
s3
s4 s4
s4
s5
s6
s6 s6
guided s1
s2
s5 s3 s4
s6
de novo sequence assembly
Most CPU and memory
demanding stage
overlap
s1 s2 s3 s4 s5 s6
Phrap: “banded” alignment of
s1
reads around k-mer matches;
s2 tolerate alignment mismatches of
low-quality bases
s3
Phusion: group reads sharing >=
s4
11 k-mers of 17 bases
s5
Celera: k-mer seed and extend
s6 alignment of reads
Arachne: 24-mer seed and
extend alignment of reads
Generate
s1 s2 s3 s4 s5 s6 alignments
s1 s5
s1 s5
s2 s1 s2
Find connected s2
s3 components
s3 s3
s4
s4 s6 s4
s5
s6 s6
consensus PHRAP
Consensus base is base with highest
s1 s5 quality score
Quality score for position is based on
s2 all reads quality scores
s3
PCAP/CAP3
s4 Sum up quality scores for each base
take base with highest sum
s6 Quality score for position:
highest sum – all other sums
Reference-guided
sequence assembly
Reference sequence
s1 s5 s3 s4
s2 s6
Advantages Disadvantages
(much) faster Indels/rearragements
(much) less memory Lack of closely related reference
Bias towards reference similarity