You are on page 1of 40

Lecture 5

Fragment Assembly
Dr. Zoya Khalid
Zoya.khalid@nu.edu.pk
DNA Sequencing
• DNA sequencing is the process of determining the precise order
of nucleotides within a DNA molecule.
• It includes any method or technology that is used to determine the
order of the four bases adenine, guanine, cytosine, and thymine—in a
strand of DNA.
• The advent of rapid DNA sequencing methods has greatly accelerated
biological and medical research and discovery :
• Medical Diagnosis
• Forensic Biology
Contid…
• Sequencing an entire genome
(all of an organism’s DNA)
remains a complex task.
• It requires breaking the DNA
of the genome into many
smaller pieces, sequencing the
pieces, and assembling the
sequences into a single long
"consensus.
• video
Latest technologies
• 454
Next Generation Sequencing (NGS)
• Sequencing by synthesis
• Light emitted and detected on addition of a nucleotide by polymerase
• 400-600 Mb / 10 hour run

• Illumina
• Also sequencing by synthesis
• ~100 Gb/day on one machine
• Uses fluorescently-labeled reversible nucleotide terminators
• Like Sanger, but detects added nucleotides with laser after each step
Latest technologies
• Pacific Biosciences:
• Sequencing by synthesis
• Single molecule sequencing
• Detects addition of single fluorescently-labeled nucleotides by an immobilized
DNA polymerase
• Real-time: reads bases at the rate of DNA polymerase
• 4 hours for sequencing with reads up to 60kb long
When to use Sanger sequencing

Next generation sequencing technology is now preferred for certain jobs.


Those include:
•sequencing more than 100 genes simultaneously
•expanding the number of targets to find novel variants

Sanger sequencing is still a good choice when:


•sequencing single genes
•Sequencing amplicon targets up to 100 base pairs
•sequencing 96 samples or less
•identifying of microbes
•analyzing fragments
•analyzing short tandem repeats (STRs)
Shotgun Sequencing Fragment Assembly
Multiple copies of sample DNA

Randomly fragment DNA

Sequence sample of fragments

Assemble reads
Fragment Assembly
• Problem :

• With current technology it is impossible to sequence directly the fragments of


DNA with more than few 100 bases

There is a technology to cut random pieces of DNA and produce enough copies
of pieces of DNA

Typical Approach :
Is to sample and then sequence fragments from them. This leaves us with the
problem of assembling the pieces
Computational Task :Fragment Assembly
• To sequence a DNA molecule is to obtain the string of bases that it
contains. In largescale DNA sequencing we have a long target DNA
molecule (thousands of bp) that we want to sequence.

• 5’ 3’

• 3’ 5’

• Its like a puzzle problem we do not know which letter from the set
{A, C, G, T} is written on each card, but we do know that cards in the same
position of opposite strands form a complementary pair
Contid…..
• Our goal is to obtain the letters using certain hints, which are
(approximate) substrings of the rows. The long sequence to
reconstruct is called the target.
• In the biological problem, we know the length of the target sequence
approximately, within 10% or so. It is impossible to sequence the
whole molecule directly. However, we may instead get a piece of the
molecule starting at a random position in one of the strands and
sequence it in the canonical (5' —• 3’) direction for a certain length.
• Each such sequence is called a fragment. It corresponds to a substring
of one of the strands of the target molecule, but we do not know
which strand or its position relative to the beginning of the strand in
addition it may contain errors.
Shotgun sequence assembly
• In genetics, shotgun sequencing is a method used for sequencing random DNA strands.

• The chain termination method of DNA sequencing ("Sanger sequencing") can only be
used for short DNA strands of 100 to 1000 base pairs. Due to this size limit, longer
sequences are subdivided into smaller fragments that can be sequenced separately, and
these sequences are assembled to give the overall sequence.
• There are two principal methods for this fragmentation and sequencing process. Primer
walking (or "chromosome walking") progresses through the entire strand piece by piece,
whereas shotgun sequencing is a faster but more complex process that uses random
fragments.
• In shotgun sequencing,DNA is broken up randomly into numerous small segments, which
are sequenced using the chain termination method to obtain reads. Multiple overlapping
reads for the target DNA are obtained by performing several rounds of this
fragmentation and sequencing. Computer programs then use the overlapping ends of
different reads to assemble them into a continuous sequence.
Shotgun Sequence Assembly
• By using the shotgun method, we obtain a large number of fragments and
then we try to reconstruct the target molecule's sequence based on
fragment overlap.

• Depending on experimental factors, fragment length can be as low as 200


or as high as 700. Typical problems involve target sequences 30,000 to
100,000 base-pairs long, and total number of fragments is in the range 500
to 2000.

• The problem is then to deduce the whole sequence of the target DNA
molecule. Because we have a collection of fragments to put together, this
task is known as fragment assembly

• We note that it suffices to determine one of the strands of the original


molecule, since the other can be readily obtained because of the
complementary pair rule.
Example
Input ACCGT
CGTGC
TTAC
TACCGT
Output should contain approximate 10 bases
--ACCGT
— CGTGC
TTAC
-TACCGT
-- TTACCGTGC
Contid….
• We try to align in the same column bases that are equal. The only
guidance to assembly, apart from the approximate size of the target,
are the overlaps between fragments.
• By overlap here we mean the fact that sometimes the end part of a
fragment is similar to the beginning of another, as with the first and
second sequences above.
• By positioning fragments so that they align well with each other we
get a layout, which can be seen as a multiple alignment of the
fragments.
Contid…..
• The sequence below the line is the consensus sequence, or simply
consensus, and is the answer to our problem. The consensus is
obtained by taking a majority vote among all bases in each column.
• In this example, every column is unanimous, so computing the
consensus is straightforward. This answer has nine bases, which is
close to the given target length of 10, and contains each fragment as
an exact substring
Complications
• Real problem instances are very large.
• Apart from this fact, several other complications exist that make the
problem much harder than the small example we saw.
• The main factors that add to the complexity of the problem are errors
which includes substitution errors, insertions and deletions.
Errors
•Base substitution/ mutation
•Insertions
•Deletions
•Input:
ACCGT CGTGC TTAC TGCCGT
Output
ACCGT In this instance there was a substitution
error in the second position of the last
-- CGTGC fragment, where A was replaced by G. The
TTAC consensus is still correct because of
-TGCCGT majority voting

– TTACCGTGC
The fragment assembly problem
• Given: A set of reads (strings) {s1, s2, … , sn}
• Do: Determine a large string s that “best explains” the reads

• What do we mean by “best explains”?


• What assumptions might we require?
Shortest Common Superstring
• One of the first attempts to formalize fragment assembly was through
a string problem in which we seek the shortest superstring of a
collection of given strings.
• Accordingly, this is called the Shortest Common Superstring problem,
or SCS
• Although this model has serious shortcomings in representing the
fragment assembly problem — it does not account for errors, for
instance — the techniques used to tackle the resulting computational
problem have application in other models as well
Shortest superstring problem
• Objective: Find a string s such that
• all reads s1, s2, … , sn are substrings of s
• s is as short as possible
• Assumptions:
• Reads are 100% accurate
• Identical reads must come from the same location on the genome
• “best” = “simplest”
• PROBLEM: SHORTEST COMMON SUPERSTRING (SCS)

• INPUT: A collection T of strings.

• OUTPUT: Given a list of strings where no string is substring of another,


find the shortest string that contains each string in given string as
substring
Algorithms for shortest Superstring
• Finding the shortest superstring is NP-Complete problem. But it can
be solved by taking a greedy approach

• Input : A set of strings S


• T= shortest superstring
• While |T| > 1 do
• Let a and b be the most overlapping strings
• Replace a and b be the string obtained with overlapping a and b
• T contains shortest superstring of S
Greedy Algorithms
• Definition: An algorithm that always takes the best
immediate, or local, solution while finding an answer.
• Greedy algorithms find the overall, or globally,
optimal solution for some optimization problems, but
may find less-than-optimal solutions for some
instances of other problems.

Paul E. Black, "greedy algorithm", in Dictionary of Algorithms and Data Structures [online],
Paul E. Black, ed., U.S. National Institute of Standards and Technology. 2 February 2005.
http://www.itl.nist.gov/div897/sqg/dads/HTML/greedyalgo.html
How to find the most overlapping Strings
• The above greedy algorithm looks simple but is actually difficult to
perform. Difficulty lies in how to find the most overlapping strings in
given set of string. Below is the Naïve algorithm that finds that :

• Check maximum overlap of strings s1 and s2 by

a Checking if suffix of s1 matches with the prefix of s2 by comparing


last i characters in s1 with first i characters in s2

B Checking if suffix of s1 matches with the prefix of s2 by comparing


first i characters in s1 with last i characters in s2
Shortest Superstring Problem
• Given a list of strings where no string is substring of another, find the
shortest string that contains each string in given string as substring

• Example
• Reads:
{ACG, CGA, CGC, CGT, GAC, GCG, GTA, TCG}

• What is the shortest superstring you can come up with?


• TCGACGCGTA (length 10)
Example
• Input : [CATGC, CTAAGT, GCTA, TTCA, ATGCATC]

• Find the shortest superstring ?


Shortest Superstring with Graph Theory

• The greedy problem will not always leads to an optimal solution.

• Sometimes we cannot find an optimal solution, we find a solution


because we are trying every possible ordering for n different input
strings.
• If S contains n strings n! ordering possible
• Greedy will make series of decisions each decision chooses the option
that reduces the link of eventual superstrings the most
Graph Basics
• A graph (G) consists of vertices (V) and edges (E)
G = (V,E)
• Edges can either be directed (directed graphs)

2
1
3 4

• or undirected (undirected graphs)


2
1
4
3
Vertex degrees
• The degree of a vertex: the # of edges incident to that vertex
• For directed graphs, we also have the notion of
• indegree: The number incoming edges
• outdegree: The number of outgoing edges

2 degree(v2) = 3
1 indegree(v2) = 1
4
3 outdegree(v2) = 2
Overlap graph
• For a set of sequence reads S, construct a directed
weighted graph G = (V,E,w)
• with one vertex per read (vi corresponds to si)
• edges between all vertices (a complete graph)
• w(vi,vj) = overlap(si,sj) = length of longest suffix of si that is a prefix
of sj
Overlap graph example
• Let S = {AGA, GAT, TCG, GAG}

AGA
2 0
2 0
0
1
GAT 1 TCG

1 2 0
0 1
GAG
Assembly as Hamiltonian Path
• Hamiltonian path in the overlap graph defines a superstring for the set of
fragments
• Hamiltonian Path: path through graph that visits each vertex exactly once
• NP – complete

AGA
2 0
2 0
0
1
GAT 1 TCG

1 2 0
0 1

Path: AGAGATCG GAG

AGATCGAG
Shortest superstring as TSP
• minimize superstring length èminimize hamiltonian path length in
overlap graph with edge weights negated
AGA
-2 0
-2 0
0
Path: GAGATCG -1
Path length: -5 GAT -1 TCG
String length: 7 -1 0
-2
0 -1
GAG

• This is essentially the Traveling Salesman Problem (also NP-complete)


Minimized Hamiltonian Path
The Hamiltonian path approach to solving the shortest superstring is not an
efficient one because finding the minimum weight Hamiltonian path is NP-hard
(you can reduce HAMPATH to it).

Unfortunately, there is no “better” approach to solving the shortest superstring


problem because it itself is an NP-hard problem

• Find the shortest path which visits every vertex exactly once
• NP – complete
Hamiltonian Path
1. Hamiltonian path in the overlap graph defines a superstring for the set of
fragments is the converse true ??
2. Does a superstring define a Hamiltonian path in the overlap graph?

No
since a superstring can contain arbitrary characters that are not
present in any fragment

3. Does a shortest superstring define Hamiltonian path in the overlap graph

Yes
Greedy Algorithm Examples
• Kruskal’s Algorithm for Minimum Spanning Tree
• Minimum spanning tree: a set of n-1 edges that connects a graph of n vertices
without any cycles and that has minimal total weight
• Kruskal’s algorithm adds the edge that connects two components with the
smallest weight at each step without introducing a cycle
• Proven to give an optimal solution
Weight Src Dest Greedy Algorithm Examples
1 7 6
2 8 2
2 6 5
1 2 3
4 0 1
4 2 5
6 8 6
0 8 4
7 2 3
7 7 8
8 0 7
7 6 5
8 1 2
9 3 4
10 5 4
11 1 7
14 3 5
The Greedy Algorithm
• Let G be a graph with fragments as
vertices, and no edges to start
• Create a queue, Q, of overlap edges,
with edges in order of increasing weight
• While G is disconnected AGA
• Sort all the edges in increasing order of -2 0
their weight
-2 0
• Pick the smallest edge and if it is not 0
-1
forming a cycle include this edge. Else GAT -1 TCG
Discard it.
• Repeat until there are (v-1) edges -1 -2 0
0 -1
GAG
The Greedy Algorithm
• GAG -> AGA -2
• AGA -> GAG -2
• AGA -> GAT -2
• GAG -> GAT -1
AGA
• TCG -> GAT -1 -2 0
• TCG -> GAG -1 -2 0
0
• GAT -> TCG -1 -1
GAT -1 TCG

-1 0
Path: GAGATCG -2
-1
0
Path length: -5
GAG
String length: 7

You might also like