You are on page 1of 46

DNA Sequencing

Method to sequence longer regions

genomic segment

cut many times at


random (Shotgun)

Get one or two reads from


each segment

~500 bp ~500 bp
CS262 Lecture 9, Win07, Batzoglou
Reconstructing the Sequence
(Fragment Assembly)

reads

Cover region with ~7-fold redundancy (7X)

Overlap reads and extend to reconstruct the original genomic region

CS262 Lecture 9, Win07, Batzoglou


Definition of Coverage

Length of genomic segment: L


Number of reads: n
Length of each read: l

Definition: Coverage C=nl/L

How much coverage is enough?

Lander-Waterman model:
Assuming uniform distribution of reads, C=10 results in 1 gapped
region /1,000,000 nucleotides
CS262 Lecture 9, Win07, Batzoglou
Repeats
Bacterial genomes: 5%
Mammals: 50%
Repeat types:

• Low-Complexity DNA (e.g. ATATATATACATA…)

• Microsatellite repeats (a1…ak)N where k ~ 3-6


(e.g. CAGCAGTAGCAGCACCAG)
• Transposons
 SINE (Short Interspersed Nuclear Elements)
e.g., ALU: ~300-long, 106 copies
 LINE (Long Interspersed Nuclear Elements)
~4000-long, 200,000 copies
 LTR retroposons (Long Terminal Repeats (~700 bp) at each end)
cousins of HIV

• Gene Families genes duplicate & then diverge (paralogs)

• Recent duplications ~100,000-long, very similar copies


CS262 Lecture 9, Win07, Batzoglou
Sequencing and Fragment Assembly
AGTAGCACAGA
CTACGACGAGA
CGATCGTGCGA
GCGACGGCGTA
GTGTGCTGTAC
TGTCGTGTGTG
TGTACTCTCCT

3x109 nucleotides

50% of human DNA is composed of repeats

Error!
Glued together two distant regions
CS262 Lecture 9, Win07, Batzoglou
What can we do about repeats?

Two main approaches:


• Cluster the reads

• Link the reads

CS262 Lecture 9, Win07, Batzoglou


What can we do about repeats?

Two main approaches:


• Cluster the reads

• Link the reads

CS262 Lecture 9, Win07, Batzoglou


What can we do about repeats?

Two main approaches:


• Cluster the reads

• Link the reads

CS262 Lecture 9, Win07, Batzoglou


Sequencing and Fragment Assembly
AGTAGCACAGA
CTACGACGAGA
CGATCGTGCGA
GCGACGGCGTA
GTGTGCTGTAC
TGTCGTGTGTG
TGTACTCTCCT

3x109 nucleotides

A R B

ARB, CRD

or
R D
C ARD, CRB ?

CS262 Lecture 9, Win07, Batzoglou


Sequencing and Fragment Assembly
AGTAGCACAGA
CTACGACGAGA
CGATCGTGCGA
GCGACGGCGTA
GTGTGCTGTAC
TGTCGTGTGTG
TGTACTCTCCT

3x109 nucleotides

CS262 Lecture 9, Win07, Batzoglou


Strategies for whole-genome sequencing

1. Hierarchical – Clone-by-clone
i. Break genome into many long pieces
ii. Map each long piece onto the genome
iii. Sequence each piece with shotgun

Example: Yeast, Worm, Human, Rat

2. Online version of (1) – Walking


i. Break genome into many long pieces
ii. Start sequencing each piece with shotgun
iii. Construct map as you go

Example: Rice genome

3. Whole genome shotgun

One large shotgun pass on the whole genome

Example: Drosophila, Human (Celera),


Neurospora, Mouse, Rat, Dog

CS262 Lecture 9, Win07, Batzoglou


Hierarchical Sequencing

CS262 Lecture 9, Win07, Batzoglou


Hierarchical Sequencing Strategy
a BAC clone

map
genome

1. Obtain a large collection of BAC clones


2. Map them onto the genome (Physical Mapping)
3. Select a minimum tiling path
4. Sequence each clone in the path with shotgun
5. Assemble
6. Put everything together

CS262 Lecture 9, Win07, Batzoglou


Methods of physical mapping

Goal:

Make a map of the locations of each clone relative to one another


Use the map to select a minimal set of clones to sequence

Methods:

• Hybridization
• Digestion

CS262 Lecture 9, Win07, Batzoglou


1. Hybridization

p1 pn

Short words, the probes, attach to complementary words

1. Construct many probes


2. Treat each BAC with all probes
3. Record which ones attach to it
4. Same words attaching to BACS X, Y  overlap

CS262 Lecture 9, Win07, Batzoglou


2. Digestion

Restriction enzymes cut DNA where specific words


appear

1. Cut each clone separately with an enzyme


2. Run fragments on a gel and measure length
3. Clones Ca, Cb have fragments of length { li, lj, lk } 
overlap

Double digestion:
Cut with enzyme A, enzyme B, then enzymes A + B

CS262 Lecture 9, Win07, Batzoglou


Some Terminology
insert a fragment that was incorporated in a
circular genome, and can be copied
(cloned)

vector the circular genome (host) that


incorporated the fragment

BAC Bacterial Artificial Chromosome, a type


of insert–vector combination, typically
of length 100-200 kb

read a 500-900 long word that comes out of


a sequencing machine

coverage the average number of reads (or


inserts) that cover a position in the
target DNA piece

shotgun the process of obtaining many reads


sequencing from random locations in DNA, to
detect overlaps and assemble
CS262 Lecture 9, Win07, Batzoglou
Whole Genome Shotgun
Sequencing
genome

cut many times at


random

plasmids (2 – 10 Kbp) forward-reverse paired


known dist reads
cosmids (40 Kbp)

~500 bp ~500 bp
CS262 Lecture 9, Win07, Batzoglou
Fragment Assembly
(in whole-genome shotgun sequencing)

CS262 Lecture 9, Win07, Batzoglou


Fragment Assembly

Given N reads…
Where N ~ 30
million…

We need to use a
linear-time
algorithm

CS262 Lecture 9, Win07, Batzoglou


Steps to Assemble a Genome
Some Terminology

1.read
Find overlapping reads
a 500-900 long word that comes
out of sequencer

mate pair a pair of reads from two ends


2. Merge some
of the “good”
same insert pairs of reads into
fragment
longer contigs
contig a contiguous sequence formed
by several overlapping reads
with no gaps
3. Link contigs to form supercontigs
supercontig an ordered and oriented set
(scaffold) of contigs, usually by mate
pairs
4.consensus
Derive consensus sequence
sequence derived from the
..ACGATTACAATAGGTT..
sequene multiple alignment of reads
in a contig
CS262 Lecture 9, Win07, Batzoglou
1. Find Overlapping Reads
(read, pos., word, orient.) (word, read, orient., pos.)
aaactgcagtacggatct aaactgcag aaactgcag
aaactgcag aactgcagt aactgcagt
aactgcagt actgcagta acggatcta
… … actgcagta
gtacggatct gtacggatc actgcagta
tacggatct tacggatct cccaaactg
gggcccaaactgcagtac gggcccaaa cggatctac
gggcccaaa ggcccaaac ctactacac
ggcccaaac gcccaaact ctgcagtac
… … ctgcagtac
actgcagta actgcagta gcccaaact
ctgcagtac ctgcagtac ggcccaaac
gtacggatctactacaca gtacggatc gggcccaaa
gtacggatc tacggatct gtacggatc
tacggatct acggatcta gtacggatc
… … tacggatct
ctactacac ctactacac tacggatct
tactacaca tactacaca tactacaca
CS262 Lecture 9, Win07, Batzoglou
1. Find Overlapping Reads

• Find pairs of reads sharing a k-mer, k ~ 24


• Extend to full alignment – throw away if not >98% similar

TACA TAGATTACACAGATTAC T GA
|| ||||||||||||||||| | ||
TAGT TAGATTACACAGATTAC TAGA

• Caveat: repeats
 A k-mer that occurs N times, causes O(N2) read/read comparisons
 ALU k-mers could cause up to 1,000,0002 comparisons
• Solution:
 Discard all k-mers that occur “too often”
• Set cutoff to balance sensitivity/speed tradeoff, according to genome at
hand and computing resources available
CS262 Lecture 9, Win07, Batzoglou
1. Find Overlapping Reads

Create local multiple alignments from the overlapping reads

TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAG TTACACAGATTATTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAG TTACACAGATTATTGA
TAGATTACACAGATTACTGA

CS262 Lecture 9, Win07, Batzoglou


1. Find Overlapping Reads
• Correct errors using multiple alignment

TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA
TAGATTACACAGATTATTGA TAG-TTACACAGATTATTGA
TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA
TAG-TTACACAGATTACTGA TAG-TTACACAGATTATTGA
insert A
correlated errors—
replace T with C probably caused by repeats
 disentangle overlaps

TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA

In practice, error correction removes


up to 98% of the errors TAG-TTACACAGATTATTGA
TAG-TTACACAGATTATTGA
CS262 Lecture 9, Win07, Batzoglou
2. Merge Reads into Contigs
• Overlap graph:
 Nodes: reads r1…..rn
 Edges: overlaps (ri, rj, shift, orientation, score)

Reads that come


from two regions of
the genome (blue
and red) that contain
the same repeat

Note:
of course, we don’t
know the “color” of
these nodes

CS262 Lecture 9, Win07, Batzoglou


2. Merge Reads into Contigs

repeat region

Unique Contig

Overcollapsed Contig

We want to merge reads up to potential repeat boundaries

CS262 Lecture 9, Win07, Batzoglou


2. Merge Reads into Contigs

repeat region

• Ignore non-maximal reads


• Merge only maximal reads into contigs
CS262 Lecture 9, Win07, Batzoglou
2. Merge Reads into Contigs

• Remove transitively inferable overlaps r r1 r2 r3


 If read r overlaps to the right reads r1, r2,
and r1 overlaps r2, then (r, r2) can be inferred
by (r, r1) and (r1, r2)

CS262 Lecture 9, Win07, Batzoglou


2. Merge Reads into Contigs

CS262 Lecture 9, Win07, Batzoglou


2. Merge Reads into Contigs

repeat boundary??? sequencing error

b
a

• Ignore “hanging” reads, when detecting repeat boundaries

CS262 Lecture 9, Win07, Batzoglou


Overlap graph after forming contigs

Unitigs:
CS262 Lecture 9, Win07, Batzoglou Gene Myers, 95
Repeats, errors, and contig lengths

• Repeats shorter than read length are easily resolved


 Read that spans across a repeat disambiguates order of flanking regions

• Repeats with more base pair diffs than sequencing error rate are OK
 We throw overlaps between two reads in different copies of the repeat

• To make the genome appear less repetitive, try to:

 Increase read length


 Decrease sequencing error rate

Role of error correction:


Discards up to 98% of single-letter sequencing errors
decreases error rate
 decreases effective repeat content
 increases contig length
CS262 Lecture 9, Win07, Batzoglou
2. Merge Reads into Contigs

• Insert non-maximal reads whenever unambiguous

CS262 Lecture 9, Win07, Batzoglou


3. Link Contigs into Supercontigs

Normal density

Too dense
 Overcollapsed

Inconsistent links
 Overcollapsed?

CS262 Lecture 9, Win07, Batzoglou


3. Link Contigs into Supercontigs

Find all links between unique contigs

Connect contigs incrementally, if  2 links

supercontig
(aka scaffold)
CS262 Lecture 9, Win07, Batzoglou
3. Link Contigs into Supercontigs

Fill gaps in supercontigs with paths of repeat contigs

CS262 Lecture 9, Win07, Batzoglou


4. Derive Consensus Sequence

TAGATTACACAGATTACTGA TTGATGGCGTAA CTA


TAGATTACACAGATTACTGACTTGATGGCGTAAACTA
TAG TTACACAGATTATTGACTTCATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGGGTAA CTA

TAGATTACACAGATTACTGACTTGATGGCGTAA CTA

Derive multiple alignment from pairwise read alignments

Derive each consensus base by weighted voting

(Alternative: take maximum-quality letter)


CS262 Lecture 9, Win07, Batzoglou
Some Assemblers

• PHRAP
• Early assembler, widely used, good model of read errors
• Overlap O(n2)  layout (no mate pairs)  consensus
• Celera
• First assembler to handle large genomes (fly, human, mouse)
• Overlap  layout  consensus
• Arachne
• Public assembler (mouse, several fungi)
• Overlap  layout  consensus
• Phusion
• Overlap  clustering  PHRAP  assemblage  consensus
• Euler
• Indexing  Euler graph  layout by picking paths  consensus
CS262 Lecture 9, Win07, Batzoglou
Quality of assemblies

CS262 Lecture 9, Win07, Batzoglou Celera’s assemblies of human and mouse


Quality of assemblies—mouse

CS262 Lecture 9, Win07, Batzoglou


Quality of assemblies—mouse

Terminology: N50 contig length


If we sort contigs from largest to smallest, and start
Covering the genome in that order, N50 is the length
Of the contig that just covers the 50th percentile.

CS262 Lecture 9, Win07, Batzoglou


Quality of assemblies—rat

CS262 Lecture 9, Win07, Batzoglou


History of WGA
1997

• 1982: -virus, 48,502 bp


Let’s sequence
the human
• 1995: h-influenzae, 1 genome
Mbp with the
shotgun strategy

• 2000: fly, 100 Mbp

• 2001 – present
Thatrat
 human (3Gbp), mouse (2.5Gbp), is*, chicken, dog, chimpanzee,
several fungal genomes impossible, and a
bad idea anyway Phil Green

Gene Myers
CS262 Lecture 9, Win07, Batzoglou
Genomes Sequenced

• http://www.genome.gov/10002154

CS262 Lecture 9, Win07, Batzoglou

You might also like