Professional Documents
Culture Documents
Fragment Assembly
Dr. Zoya Khalid
Zoya.khalid@nu.edu.pk
DNA Sequencing
• DNA sequencing is the process of determining the precise order
of nucleotides within a DNA molecule.
• It includes any method or technology that is used to determine the
order of the four bases adenine, guanine, cytosine, and thymine—in a
strand of DNA.
• The advent of rapid DNA sequencing methods has greatly accelerated
biological and medical research and discovery :
• Medical Diagnosis
• Forensic Biology
Contid…
• Sequencing an entire genome
(all of an organism’s DNA)
remains a complex task.
• It requires breaking the DNA
of the genome into many
smaller pieces, sequencing the
pieces, and assembling the
sequences into a single long
"consensus.
• video
Latest technologies
• 454
Next Generation Sequencing (NGS)
• Sequencing by synthesis
• Light emitted and detected on addition of a nucleotide by polymerase
• 400-600 Mb / 10 hour run
• Illumina
• Also sequencing by synthesis
• ~100 Gb/day on one machine
• Uses fluorescently-labeled reversible nucleotide terminators
• Like Sanger, but detects added nucleotides with laser after each step
Latest technologies
• Pacific Biosciences:
• Sequencing by synthesis
• Single molecule sequencing
• Detects addition of single fluorescently-labeled nucleotides by an immobilized
DNA polymerase
• Real-time: reads bases at the rate of DNA polymerase
• 4 hours for sequencing with reads up to 60kb long
When to use Sanger sequencing
Assemble reads
Fragment Assembly
• Problem :
There is a technology to cut random pieces of DNA and produce enough copies
of pieces of DNA
Typical Approach :
Is to sample and then sequence fragments from them. This leaves us with the
problem of assembling the pieces
Computational Task :Fragment Assembly
• To sequence a DNA molecule is to obtain the string of bases that it
contains. In largescale DNA sequencing we have a long target DNA
molecule (thousands of bp) that we want to sequence.
• 5’ 3’
• 3’ 5’
• Its like a puzzle problem we do not know which letter from the set
{A, C, G, T} is written on each card, but we do know that cards in the same
position of opposite strands form a complementary pair
Contid…..
• Our goal is to obtain the letters using certain hints, which are
(approximate) substrings of the rows. The long sequence to
reconstruct is called the target.
• In the biological problem, we know the length of the target sequence
approximately, within 10% or so. It is impossible to sequence the
whole molecule directly. However, we may instead get a piece of the
molecule starting at a random position in one of the strands and
sequence it in the canonical (5' —• 3’) direction for a certain length.
• Each such sequence is called a fragment. It corresponds to a substring
of one of the strands of the target molecule, but we do not know
which strand or its position relative to the beginning of the strand in
addition it may contain errors.
Shotgun sequence assembly
• In genetics, shotgun sequencing is a method used for sequencing random DNA strands.
• The chain termination method of DNA sequencing ("Sanger sequencing") can only be
used for short DNA strands of 100 to 1000 base pairs. Due to this size limit, longer
sequences are subdivided into smaller fragments that can be sequenced separately, and
these sequences are assembled to give the overall sequence.
• There are two principal methods for this fragmentation and sequencing process. Primer
walking (or "chromosome walking") progresses through the entire strand piece by piece,
whereas shotgun sequencing is a faster but more complex process that uses random
fragments.
• In shotgun sequencing,DNA is broken up randomly into numerous small segments, which
are sequenced using the chain termination method to obtain reads. Multiple overlapping
reads for the target DNA are obtained by performing several rounds of this
fragmentation and sequencing. Computer programs then use the overlapping ends of
different reads to assemble them into a continuous sequence.
Shotgun Sequence Assembly
• By using the shotgun method, we obtain a large number of fragments and
then we try to reconstruct the target molecule's sequence based on
fragment overlap.
• The problem is then to deduce the whole sequence of the target DNA
molecule. Because we have a collection of fragments to put together, this
task is known as fragment assembly
– TTACCGTGC
The fragment assembly problem
• Given: A set of reads (strings) {s1, s2, … , sn}
• Do: Determine a large string s that “best explains” the reads
Paul E. Black, "greedy algorithm", in Dictionary of Algorithms and Data Structures [online],
Paul E. Black, ed., U.S. National Institute of Standards and Technology. 2 February 2005.
http://www.itl.nist.gov/div897/sqg/dads/HTML/greedyalgo.html
How to find the most overlapping Strings
• The above greedy algorithm looks simple but is actually difficult to
perform. Difficulty lies in how to find the most overlapping strings in
given set of string. Below is the Naïve algorithm that finds that :
• Example
• Reads:
{ACG, CGA, CGC, CGT, GAC, GCG, GTA, TCG}
2
1
3 4
2 degree(v2) = 3
1 indegree(v2) = 1
4
3 outdegree(v2) = 2
Overlap graph
• For a set of sequence reads S, construct a directed
weighted graph G = (V,E,w)
• with one vertex per read (vi corresponds to si)
• edges between all vertices (a complete graph)
• w(vi,vj) = overlap(si,sj) = length of longest suffix of si that is a prefix
of sj
Overlap graph example
• Let S = {AGA, GAT, TCG, GAG}
AGA
2 0
2 0
0
1
GAT 1 TCG
1 2 0
0 1
GAG
Assembly as Hamiltonian Path
• Hamiltonian path in the overlap graph defines a superstring for the set of
fragments
• Hamiltonian Path: path through graph that visits each vertex exactly once
• NP – complete
AGA
2 0
2 0
0
1
GAT 1 TCG
1 2 0
0 1
AGATCGAG
Shortest superstring as TSP
• minimize superstring length èminimize hamiltonian path length in
overlap graph with edge weights negated
AGA
-2 0
-2 0
0
Path: GAGATCG -1
Path length: -5 GAT -1 TCG
String length: 7 -1 0
-2
0 -1
GAG
• Find the shortest path which visits every vertex exactly once
• NP – complete
Hamiltonian Path
1. Hamiltonian path in the overlap graph defines a superstring for the set of
fragments is the converse true ??
2. Does a superstring define a Hamiltonian path in the overlap graph?
No
since a superstring can contain arbitrary characters that are not
present in any fragment
Yes
Greedy Algorithm Examples
• Kruskal’s Algorithm for Minimum Spanning Tree
• Minimum spanning tree: a set of n-1 edges that connects a graph of n vertices
without any cycles and that has minimal total weight
• Kruskal’s algorithm adds the edge that connects two components with the
smallest weight at each step without introducing a cycle
• Proven to give an optimal solution
Weight Src Dest Greedy Algorithm Examples
1 7 6
2 8 2
2 6 5
1 2 3
4 0 1
4 2 5
6 8 6
0 8 4
7 2 3
7 7 8
8 0 7
7 6 5
8 1 2
9 3 4
10 5 4
11 1 7
14 3 5
The Greedy Algorithm
• Let G be a graph with fragments as
vertices, and no edges to start
• Create a queue, Q, of overlap edges,
with edges in order of increasing weight
• While G is disconnected AGA
• Sort all the edges in increasing order of -2 0
their weight
-2 0
• Pick the smallest edge and if it is not 0
-1
forming a cycle include this edge. Else GAT -1 TCG
Discard it.
• Repeat until there are (v-1) edges -1 -2 0
0 -1
GAG
The Greedy Algorithm
• GAG -> AGA -2
• AGA -> GAG -2
• AGA -> GAT -2
• GAG -> GAT -1
AGA
• TCG -> GAT -1 -2 0
• TCG -> GAG -1 -2 0
0
• GAT -> TCG -1 -1
GAT -1 TCG
-1 0
Path: GAGATCG -2
-1
0
Path length: -5
GAG
String length: 7