You are on page 1of 28

BME 130 – Genomes

Lecture 5

Genome assembly I
The good old days
Administrivia
Homework 1 – on the website today, due Friday;
homework policy

Student-led paper discussion; choose groups and


pick paper

Guest lecture Friday – Bob Kuhn will demo the


UCSC genome browser
Genomics in the news
Genomic Fossils Calibrate the Long-Term
Evolution of Hepadnaviruses

Citation: Gilbert C, Feschotte C (2010) Genomic Fossils Calibrate the Long-


Term Evolution of Hepadnaviruses. PLoS Biol 8(9): e1000495.
doi:10.1371/journal.pbio.1000495
Figure 4.10 Genomes 3 (© Garland Science 2007)
Figure 4.10 part 1 of 2 Genomes 3 (© Garland Science 2007)
Figure 4.10 part 2 of 2 Genomes 3 (© Garland Science 2007)
Sequence assembly
overlap layout consensus
s1 s2 s3 s4 s5 s6
s1 s1 s5 s1 s5

s2 s2 s2

de novo s3 s3
s3
s4 s4
s4
s5
s6
s6 s6

reference- Reference sequence

guided s1
s2
s5 s3 s4
s6
de novo sequence assembly
Most CPU and memory
demanding stage
overlap
s1 s2 s3 s4 s5 s6
Phrap: “banded” alignment of
s1
reads around k-mer matches;
s2 tolerate alignment mismatches of
low-quality bases
s3
Phusion: group reads sharing >=
s4
11 k-mers of 17 bases
s5
Celera: k-mer seed and extend
s6 alignment of reads
Arachne: 24-mer seed and
extend alignment of reads

newbler: flowgram similarities (?)


de novo sequence assembly

Generate
s1 s2 s3 s4 s5 s6 alignments
s1 s5
s1 s5
s2 s1 s2
Find connected s2
s3 components
s3 s3
s4
s4 s6 s4
s5

s6 s6

Wide range of strategies for the layout


stage, many using mate-pair information
de novo Sequence assembly

consensus PHRAP
Consensus base is base with highest
s1 s5 quality score
Quality score for position is based on
s2 all reads quality scores

s3
PCAP/CAP3
s4 Sum up quality scores for each base
take base with highest sum
s6 Quality score for position:
highest sum – all other sums
Reference-guided
sequence assembly
Reference sequence

s1 s5 s3 s4

s2 s6

Advantages Disadvantages
(much) faster Indels/rearragements
(much) less memory Lack of closely related reference
Bias towards reference similarity

Pop M et al., “Comparative Genome Assembly”


Brief Bioinform. 2004 Sep;5(3):237-48.
Why is this called a sequence
gap and not a physical gap?

Figure 4.11a Genomes 3 (© Garland Science 2007)


Closing a physical gap means
finding a physical clone to
sequence that will span the gap
Genomic DNA is
template for this PCR

Figure 4.11b Genomes 3 (© Garland Science 2007)


Chromosome walking
(is slow)

Figure 4.12 Genomes 3 (© Garland Science 2007)


PCR from clone library
Insert 1 connects to who?

Figure 4.13 Genomes 3 (© Garland Science 2007)


Figure 4.14 Genomes 3 (© Garland Science 2007)
Figure 4.15 Genomes 3 (© Garland Science 2007)
Figure 4.15a Genomes 3 (© Garland Science 2007)
Figure 4.15b Genomes 3 (© Garland Science 2007)
Figure 4.15c Genomes 3 (© Garland Science 2007)
Figure 4.15d Genomes 3 (© Garland Science 2007)
Assembly can by validated by
mate-pair information

Figure 4.16 Genomes 3 (© Garland Science 2007)


Figure 4.16a Genomes 3 (© Garland Science 2007)
Figure 4.16b Genomes 3 (© Garland Science 2007)
Figure 4.17a Genomes 3 (© Garland Science 2007)
Figure 4.17b Genomes 3 (© Garland Science 2007)
Figure 4.18 Genomes 3 (© Garland Science 2007)

You might also like