Professional Documents
Culture Documents
genomic segment
~500 bp ~500 bp
CS262 Lecture 9, Win07, Batzoglou
Reconstructing the Sequence
(Fragment Assembly)
reads
Lander-Waterman model:
Assuming uniform distribution of reads, C=10 results in 1 gapped
region /1,000,000 nucleotides
CS262 Lecture 9, Win07, Batzoglou
Repeats
Bacterial genomes: 5%
Mammals: 50%
Repeat types:
3x109 nucleotides
Error!
Glued together two distant regions
CS262 Lecture 9, Win07, Batzoglou
What can we do about repeats?
3x109 nucleotides
A R B
ARB, CRD
or
R D
C ARD, CRB ?
3x109 nucleotides
1. Hierarchical – Clone-by-clone
i. Break genome into many long pieces
ii. Map each long piece onto the genome
iii. Sequence each piece with shotgun
map
genome
Goal:
Methods:
• Hybridization
• Digestion
p1 pn
Double digestion:
Cut with enzyme A, enzyme B, then enzymes A + B
~500 bp ~500 bp
CS262 Lecture 9, Win07, Batzoglou
Fragment Assembly
(in whole-genome shotgun sequencing)
Given N reads…
Where N ~ 30
million…
We need to use a
linear-time
algorithm
1.read
Find overlapping reads
a 500-900 long word that comes
out of sequencer
TACA TAGATTACACAGATTAC T GA
|| ||||||||||||||||| | ||
TAGT TAGATTACACAGATTAC TAGA
• Caveat: repeats
A k-mer that occurs N times, causes O(N2) read/read comparisons
ALU k-mers could cause up to 1,000,0002 comparisons
• Solution:
Discard all k-mers that occur “too often”
• Set cutoff to balance sensitivity/speed tradeoff, according to genome at
hand and computing resources available
CS262 Lecture 9, Win07, Batzoglou
1. Find Overlapping Reads
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAG TTACACAGATTATTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAG TTACACAGATTATTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA
TAGATTACACAGATTATTGA TAG-TTACACAGATTATTGA
TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA
TAG-TTACACAGATTACTGA TAG-TTACACAGATTATTGA
insert A
correlated errors—
replace T with C probably caused by repeats
disentangle overlaps
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
Note:
of course, we don’t
know the “color” of
these nodes
repeat region
Unique Contig
Overcollapsed Contig
repeat region
b
a
Unitigs:
CS262 Lecture 9, Win07, Batzoglou Gene Myers, 95
Repeats, errors, and contig lengths
• Repeats with more base pair diffs than sequencing error rate are OK
We throw overlaps between two reads in different copies of the repeat
Normal density
Too dense
Overcollapsed
Inconsistent links
Overcollapsed?
supercontig
(aka scaffold)
CS262 Lecture 9, Win07, Batzoglou
3. Link Contigs into Supercontigs
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
• PHRAP
• Early assembler, widely used, good model of read errors
• Overlap O(n2) layout (no mate pairs) consensus
• Celera
• First assembler to handle large genomes (fly, human, mouse)
• Overlap layout consensus
• Arachne
• Public assembler (mouse, several fungi)
• Overlap layout consensus
• Phusion
• Overlap clustering PHRAP assemblage consensus
• Euler
• Indexing Euler graph layout by picking paths consensus
CS262 Lecture 9, Win07, Batzoglou
Quality of assemblies
• 2001 – present
Thatrat
human (3Gbp), mouse (2.5Gbp), is*, chicken, dog, chimpanzee,
several fungal genomes impossible, and a
bad idea anyway Phil Green
Gene Myers
CS262 Lecture 9, Win07, Batzoglou
Genomes Sequenced
• http://www.genome.gov/10002154