Chapter 5

How Do We Compare Biological Sequences?
Dynamic Programming and

Divide-and Conquer Algorithms
Phillip Compeau and Pavel Pevzner.

Bioinformatics Algorithms: An Active Learning Approach
©2018 by Compeau and Pevzner. All rights reserved.
How Do We Compare Biological Sequences
• From Sequence Comparison to Biological Insights
• The Alignment Game and the Longest Common Subsequence
• The Manhattan Tourist Problem
• The Change Problem
• Dynamic Programming and Backtracking Pointers
• From Manhattan to the Alignment Graph
• From Global to Local Alignment
• Penalizing Insertions and Deletions in Sequence Alignment
• Space-Efficient Sequence Alignment
• Multiple Sequence Alignment
Bioinformatics Algorithms: An Active Learning Approach.

Copyright 2018 Compeau and Pevzner.
RNA Tie Club
ribosome
DNA → RNA → Protein

genetic code
RNA tie club

From Genetic Code to Non-Ribosomal Code
non-ribosomal
ribosome peptide synthetase
DNA → RNA → Protein → Peptide

genetic code non-ribosomal code
RNA tie club Marahiel

NRP Synthetase: A Giant Molecular Assembly Line
Adenylation domains
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
NRPBioinformatics
synthetase Algorithms: An Active Learning Approach.
adds one amino acid at a time
These Three A-domains Do Not Look Similar
YAFDLGYTCMFPVLLGGGELHIVQKETYTAPDEIAHYIKEHGITYIKLTPSLFHTIVNTASFAFDANFESLRLIVLGGEKIIPIDVIAFRKMYGHTEFINHYGPTEATIGA
AFDVSAGDFARALLTGGQLIVCPNEVKMDPASLYAIIKKYDITIFEATPALVIPLMEYIYEQKLDISQLQILIVGSDSCSMEDFKTLVSRFGSTIRIVNSYGVTEACIDS
IAFDASSWEIYAPLLNGGTVVCIDYYTTIDIKALEAVFKQHHIRGAMLPPALLKQCLVSAPTMISSLEILFAAGDRLSSQDAILARRAVGSGVYNAYGPTENTVLS

These Three A-domains Do Not Look Similar
AFDVSAGDFARALLTGGQLIVCPNEVKMDPASLYAIIKKYDITIFEATPALVIPLMEYIYEQKLDISQLQILIVGSDSCSMEDFKTLVSRFGSTIRIVNSYGVTEACIDS
just 3 conservative columns

Do they look similar now?
-AFDVSAGDFARALLTGGQLIVCPNEVKMDPASLYAIIKKYDITIFEATPALVIPLMEYIYEQKLDISQLQILIVGSDSCSMEDFKTLVSRFGSTIRIVNSYGVTEACIDS
11 conservative columns

And now?
YAFDLGYTCMFPVLLGGGELHIVQKETYTAPDEIAHYIKEHGITYIKLTPSLFHTIVNTASFAFDANFESLRLIVLGGEKIIPIDVIAFRKMYGHTE-FINHYGPTEATIGA
-AFDVSAGDFARALLTGGQLIVCPNEVKMDPASLYAIIKKYDITIFEATPALVIPLMEYI-YEQKLDISQLQILIVGSDSCSMEDFKTLVSRFGSTIRIVNSYGVTEACIDS
IAFDASSWEIYAPLLNGGTVVCIDYYTTIDIKALEAVFKQHHIRGAMLPPALLKQCLVSA----PTMISSLEILFAAGDRLSSQDAILARRAVGSGV-Y-NAYGPTENTVLS
19 conservative columns!

Red Positions Encode Conservative Core of A-domains
Which positions are responsible for encoding different amino acids Asp, Orn, Val?

Blue Positions in A-domains Define Non-Ribosomal Code
LTKVGHIG Asp
VGEIGSID Orn
AWMFAAVL Val

Cystic Fibrosis
• Cystic fibrosis (CF): An often
fatal disease which affects the
respiratory system and
produces an abnormally large
amount of mucus.
– Mucus is a slimy material
that coats epithelial
surfaces and is secreted
into fluids such as saliva.

Approximately 1 in 25 Humans Carry a Faulty CF Gene
• In the early 1980s biologists

hypothesized that CF is
caused by mutations in an
unidentified gene.
• When BOTH parent carry a

faulty gene, there is a 25%
chance that their child will
have cystic fibrosis.
Where Is the Cystic Fibrosis Gene?
• In the late 1980s, biologists
narrowed the search for the CF
gene to a small region on
chromosome 7.
• One of these genes was similar
to ATP binding proteins that act
as transport channels Chromosome 7
responsible for secretion.
• Hint: cystic fibrosis involves
sweet secretion with abnormally
high sodium levels.
Adenosine Triphosphate (ATP)Algorithms:

Bioinformatics transports chemical
An Active energy
Learning within cells.
Approach.
Where Is the Cystic Fibrosis Gene?
• In the late 1980s, biologists
narrowed the search for the CF
gene to a small region on
chromosome 7.
• One of these genes was similar
to ATP binding proteins that act
as transport channels
responsible for secretion.
• Hint: cystic fibrosis involves
sweet secretion with abnormally
high sodium levels.
Adenosine Triphosphate (ATP)Algorithms:

Bioinformatics transports chemical
An Active energy
Learning within cells.
Approach.
CFTR: Cystic Fibrosis Transmembrane Conductance Regulator
• The CFTR protein controls the

flow of ions in and out of cells
inside the lungs.


The Alignment Game
A T G T T A T A
A T C G T C C
Alignment Game (maximizing the number of points):
• Remove the 1st symbol from each sequence

• 1 point if the symbols match, 0 points if they don’t match
• Remove the 1st symbol from one of the sequences
• 0 points

The Alignment Game
A T G T T A T A
A T C G T C C
+1

The Alignment Game
A T G T T A T A
A T C G T C C
+1+1

The Alignment Game
A T - G T T A T A
A T C G T C C
+1+1

The Alignment Game
A T - G T T A T A
A T C G T C C
+1+1 +1

The Alignment Game
A T - G T T A T A
A T C G T C C
+1+1 +1+1

The Alignment Game
A T - G T T A T A
A T C G T - C C
+1+1 +1+1

The Alignment Game
A T - G T T A T A
A T C G T - C C
+1+1 +1+1

The Alignment Game
A T - G T T A T A
A T C G T - C - C
+1+1 +1+1

The Alignment Game
A T - G T T A T A
A T C G T - C - C
+1+1 +1+1 =4

What Is the Sequence Alignment?
matches insertions deletions mismatches
A T - G T T A T A
A T C G T - C - C
+1+1 +1+1 =4
Alignment of two sequences is a two-row matrix:
1st row: symbols of the 1st sequence (in order) interspersed by “-”
2nd row: symbols of the 2nd sequence (in order) interspersed by “-”

Longest Common Subsequence
A T - G T T A T A
A T C G T - C - C
Matches in alignment of two sequences (ATGT) form their
Common Subsequence
Longest Common Subsequence Problem: Find a longest
common subsequence of two strings.
• Input: Two strings.
• Output: A longest common subsequence of these
strings.

From Manhattan to a Grid Graph
Walk from the

source to the
sink (only in the
South ↓ and East
→ directions)
and visit the
maximum
number of
attractions

Manhattan Tourist Problem
Manhattan Tourist Problem: Find a longest path in a

rectangular city grid.
•Input: A weighted rectangular grid.
•Output: A longest path from the source to the sink in
the grid.

3 2 4 0
0 3
1 0 2 4 3
3 2 4 2
3 6 5 2 1
Greedy 0 7 3 3
algorithm?
4 4 5 2 1
3 3 0 2
5 6 8 5 3
1 3 2 2
3 2 4 0
0 3 5 9
1 0 2 4 3
3 2 4 2
13
3 6 5 2 1
Greedy 0 7 3 3
algorithm? 15 19
4 4 5 2 1
3 3 0 2
20
5 6 8 5 3
1 3 2 2
23
3 2 4 0
5
1 4 2 3
From a
3 2 4 4
regular to an
irregular grid 2
3 6 5 1
0 7 2 3
4 4
4 4 6 2 1
3 3 0 2
5 6 8 5 3
1 3 2 2
Search for Longest Paths in a Directed Graph
Longest Path in a Directed Graph Problem: Find a

longest path between two nodes in an edge-weighted
directed graph.
• Input: An edge-weighted directed graph with
source and sink nodes.
• Output: A longest path from source to sink in
the directed graph.

Do You See a Connection between
the Manhattan Tourist and the Alignment Game?
A T - G T T A T A
A T C G T - C - C
↘ ↘ → ↘ ↘ ↓ ↘ ↓ ↘

A T C G T C C
A
?
alignment → path T
A T - G T T A T A
A T C G T - C - C
G
↘ ↘ → ↘ ↘ ↓ ↘ ↓ ↘
T
T
A
T
A
A T C G T C C
A
?
alignment → path T
A T - G T T A T A
A T C G T - C - C
G
↘ ↘ → ↘ ↘ ↓ ↘ ↓ ↘
T
T
A
T
A
A T C G T C C
A
?
path → alignment T
A T G T T - A T A
- - A T C G T C C
G
↓ ↓ ↘ ↘ ↘ → ↘ ↘ ↘
T
highest-scoring
alignment T
=
longest path in a A
properly built
T
Manhattan
A
A T C G T C C
How to built a A
Manhattan for the
Alignment Game T
and the
Longest Common G
Subsequence
Problem? T
T
Diagonal red edges
correspond to A
matching symbols
and have scores 1 T
A

The Change Problem
Change Problem: Find the minimum number of coins
needed to make change.
•Input: An integer money and an array of positive
integers (coin1, . . . , coind).
•Output: The minimum number of coins with
denominations (coin1, . . . , coind) that changes money.

Changing Money in a Greedy Way
GreedyChange(money)
change ← empty collection of coins
while money > 0
coin ← largest denomination that does not exceed money
add coin to change
money ← money – coin
return change

Changing Money in Tanzania
40 cents = 25+10+5
Greedy

Changing Money in Tanzania: GreedyChange Fails
40 cents = 25+10+5 = 20+20

Greedy is not Optimal

Recursive Change
Given the denominations 6, 5, and 1, what is the minimum
number of coins needed to change 9 cents?
money 1 2 3 4 5 6 7 8 9 10 11 12
MinNumCoins ?
MinNumCoins(9)= ?

Recursive Change
money 1 2 3 4 5 6 7 8 9 10 11 12
MinNumCoins ? ? ? ?
MinNumCoins(9)= ?

Recursive Change
money 1 2 3 4 5 6 7 8 9 10 11 12
MinNumCoins ? ? ? ?
MinNumCoins(9-6)+1 = MinNumCoins(3)+1
MinNumCoins(9)=min{ MinNumCoins(9-5)+1 =
MinNumCoins(4)+1
MinNumCoins(9-1)+1 = MinNumCoins(8)+1

Recursive Change
money 1 2 3 4 5 6 7 8 9 10 11 12
MinNumCoins ? ? ? ?
MinNumCoins(3)=
MinNumCoins(4)=
MinNumCoins(8)=
?

Recursive Change
money 1 2 3 4 5 6 7 8 9 10 11 12
MinNumCoins ? ? ? ? ? ?
MinNumCoins(3)=
MinNumCoins(4)=
MinNumCoins(8)=
?

Recursive Change
money 1 2 3 4 5 6 7 8 9 10 11 12
MinNumCoins(money-6) + 1min{
MinNumCoins(money)= MinNumCoins(money-5) + 1
MinNumCoins(money-1) + 1

Recursive Change
money 1 2 3 4 5 6 7 8 9 10 11 12
min{
MinNumCoins(money-coin1) + 1
MinNumCoins(money)= . ...........................
MinNumCoins(money-coind) + 1

RecursiveChange
RecursiveChange(money, coins)
if money = 0
return 0
MinNumCoins 🡨 infinity
for i 🡨 1 to |coins|
if money ≥ coini
NumCoins 🡨
RecursiveChange(money-coini, coins)
if numCoins + 1 < MinNumCoins
MinNumCoins 🡨 numCoins + 1
return MinNumCoins

How Fast is the RecursiveChange?
7
6

The Recursive Tree
7
6
7 7
7
1 5
0

The Recursive Tree
7
6
7 7
7
1 5
0
6 6 6 6 6 7 6 7 7
4 5 9 5 6 0 9 0 4

The Recursive Tree
7
6
7 7
7
1 5
0
6 6 6 6 6 7 6 7 7
4 5 9 5 6 0 9 0 4
5 5 6 6 6 6 6 6 6 6 6 6 6 6 7
8 9 3 3 4 8 0 1 5 3 4 8 8 9 3
5 6 6 5 6 6 6 6 6 6 6 6
9 0 4 9 0 4 4 5 9 4 5 9
the optimal coin combination for 69 cents is computed 6 times!

the optimal coin combination for 30 cents is computed trillions of times!
Changing Money by Dynamic Programming
Hint. Wouldn’t it be nice to know all the values of
MinNumCoins(money – coini)
by the time we need to compute
MinNumCoins(money)?
Richard
Instead of the time-consuming calls: Bellman
RecursiveChange(money-coini, Coins, d)
we would simply look up the values of
MinNumCoins(money - coini)

Changing Money by Dynamic Programming
money 0 1 2 3 4 5 6 7 8 9 10 11 12
MinNumCoins 0 1 2 3 4 1 1 2 3 ?

DPChange
DPChange(money, coins)
MinNumCoins(0) 🡨 0
for m 🡨 1 to money
MinNumCoins(m) 🡨 infinity
for i 🡨 1 to |coins|
if m ≥ coini
if MinNumCoins(m – coini) + 1 < MinNumCoins(m)
MinNumCoins(m) 🡨 MinNumCoins(m – coini)+ 1
return MinNumCoins(money)

“Programming” in “Dynamic Programming”
Has Nothing to Do with Programming!
Richard Bellman developed this idea in 1950s working on
an Air Force project.
At that time, his approach seemed completely

impractical.
He wanted to hide that he is really doing math from the Richard

Secretary of Defense. Bellman
“…What name could I choose? I was interested in planning but
planning, is not a good word for various reasons. I decided therefore
to use the word, “programming” and I wanted to get across the
idea that this was dynamic. It was something not even a
Congressman could object to. So I used it as an umbrella for my
activities.” Bioinformatics Algorithms: An Active Learning Approach.

3 2 4 0
There are 1 0 2 4 3
only 2 ways
3 2 4 2
to arrive to
the sink:
by moving 3 6 5 2 1
South ↓ 0 7 3 3
or by moving
East → 4 4 5 2 1
3 3 0 2
5 6 8 5 3
South
1 3 2 2
or
Copyright 2018 Compeau and Pevzner. East?
South or East?
SouthOrEast(n,m)
if n=0 and m=0
return 0
if n>0 and m>0
x 🡨 SouthOrEast(n-1,m)+weight of edge “↓”into (n,m)
y 🡨 SouthOrEast(n,m-1)+ weight of edge “→”into (n,m)
return max{x,y}
return -infinity

3 2 4 0
0
1 0 2 4 3
3 2 4 2
1
4 6 5 2 1
0 7 3 3
5
4 4 5 2 1
3 3 0 2
5 6 8 5 3
1 3 2 2
3 2 4 0
0 3 5 9 9
1 0 2 4 3
3 2 4 2
1
4 6 5 2 1
0 7 3 3
5
4 4 5 2 1
3 3 0 2
9
5 6 8 5 3
1 3 2 2
14
3 2 4 0
0 3 5 9 9
South
or 1 0 2 4 3
East? 3 2 4 2
1 4
1+3 > 3+0
4 6 5 2 1
0 7 3 3
5
4 4 5 2 1
3 3 0 2
9
5 6 8 5 3
1 3 2 2
14
3 2 4 0
0 3 5 9 9
We arrived
to (1,1) 1 0 2 4 3
by the bold
3 2 4 2
edge: 1 4
3 4 6 5 2 1
4
0 7 3 3
5
4 4 5 2 1
3 3 0 2
9
5 6 8 5 3
1 3 2 2
14
3 2 4 0
0 3 5 9 9
South
or 1 0 2 4 3
East? 3 2 4 2
1 4
5+0 < 4+6
4 6 5 2 1
0 7 3 3
5 10
4 4 5 2 1
3 3 0 2
9
5 6 8 5 3
1 3 2 2
14
3 2 4 0
0 3 5 9 9
We arrived
to (2,1) 1 0 2 4 3
by the bold
3 2 4 2
edge: 1 4
4 6 5 2 1
6
0 7 3 3
5 10
10
4 4 5 2 1
3 3 0 2
9
5 6 8 5 3
1 3 2 2
14
3 2 4 0
0 3 5 9 9
1 0 2 4 3
3 2 4 2
1 4
4 6 5 2 1
0 7 3 3
5 10
4 4 5 2 1
3 3 0 2
9 14
5 6 8 5 3
1 3 2 2
14 20
3 2 4 0
0 3 5 9 9
South
or 1 0 2 4 3
East? 3 2 4 2
1 4 7
5+2 > 4+2
4 6 5 2 1
0 7 3 3
5 10
4 4 5 2 1
3 3 0 2
9 14
5 6 8 5 3
1 3 2 2
14 20
3 2 4 0
0 3 5 9 9
South
or 1 0 2 4 3
East? 3 2 4 2
1 4 7
5+2 > 4+2
4 6 5 2 1
0 7 3 3
5 10
4 4 5 2 1
3 3 0 2
9 14
5 6 8 5 3
1 3 2 2
14 20
3 2 4 0
0 3 5 9 9
1 0 2 4 3
3 2 4 2
1 4 7 13 15
4 6 5 2 1
0 7 3 3
5 10 17 20 24
4 4 5 2 1
3 3 0 2
9 14 22 22 25
5 6 8 5 3
1 3 2 2
14 20 30 32 34
3 2 4 0
0 3 5 9 9
1 2 4
Backtracking
pointers: 3 2
1 4 7 13 15
the best way
to get to 4 6
each node
7 3 3
5 10 17 20 24
4 4 5 2 1
9 14 22 22 25
5 6 8
2 2
14 20 30 32 34
Dynamic Programming Recurrence
si, j: the length of a longest path from (0,0) to (i,j)
si-1, j + weight of edge “↓”into (i,j)

si, j = max {
si, j-1 + weight of edge “→”into (i,j)


3 2 4 0
0 3 5
How does 5
1 5 4 7 2 3 3
the
recurrence 3 2 4 4
1 4
change for 2
this graph? 3 6 5 4 1
0 7 2 3
4 4
4 4 6 2 1
3 3 0 2
5 6 8 5 3
1 3 2 2
sa = maxall predecessors b of node a{sb+ weight of edge from b to a}
3 2 4 0
0 3 5
4 choices: 5
5+2 1 5 4 7 2 3 3
3+7 3 2 4 4
1 4 10
?
5+4 2
4+2 3 6 5 4 1
0 7 2 3
4 4
4 4 6 2 1
3 3 0 2
5 6 8 5 3
1 3 2 2
sa = maxall predecessors b of node a{sb+ weight of edge from b to a}
3 2 4 0
0 3 5 9 9
4 choices: 5
5+2 1 5 4 7 2 3 3
3+7 3 2 4 4
1 4 10
? 14 18
5+4 2
4+2 3 6 5 4 1
12
0 7 2 14 3
4 10 17 19
4 4
4 4 6 2 1
3 3 0 2
8 14 17 17 20
5 6 8 5 3
1 3 2 2
13 20 25 27 29
Dynamic Programming Recurrence for the
Alignment Graph
si-1, j + weight of edge “↓” into (i,j)

si, j= max { si, j-1 + weight of edge “→” into (i,j)
si-1, j-1+ weight of edge “↘” into (i,j)
red edges ↘ – weight 1

other edges – weight 0

Longest Common Subsequence Problem
si-1, j + 0
si, j= max { si, j-1 + 0
si-1, j-1+ 1, if vi=wj
red edges ↘ – weight 1


A T C G T C C
A
backtracking pointers
for the Longest
Common Subsequence
T
red edges ↘ – weight 1 G

T
T
A
T
A
A T C G T C C
A
for the Longest
Common Subsequence
T
G
T
T
A
T
A
Computing Backtracking Pointers
si,j-1+0
si,j ← max{ si-1,j+0
si-1,j-1+1, if vi=wj
“→”, if si,j=si,j-1
backtracki,j ← {“↓", if si,j=si-1,j
“↘”, if si,j=si-1,j-1+1

3 2 4 0
0 3 5 9 9
Why did we
store the 1 2 4
backtracking 3 2
pointers? 1 4 7 13 15
4 6
7 3 3
5 10 17 20 24
4 4 5 2 1
9 14 22 22 25
5 6 8
2 2
14 20 30 32 34
3 2 4 0
0 3 5 9 9
What is the
1 2 4
optimal
alignment 1 3 4 7 13
2
15
path?
4 6
7 3 3
5 10 17 20 24
4 4 5 2 1
9 14 22 22 25
5 6 8
2 2
14 20 30 32 34
A T C G T C C
A
for the Longest
Common Subsequence
T
G
T
T
A
T
A
Using Backtracking Pointers to Compute LCS
OutputLCS (backtrack, v, i, j)
if i = 0 or j = 0
return
if backtracki,j = “→”
OutputLCS (backtrack, v, i, j-1)
else if backtracki,j = “↓”
OutputLCS (backtrack, v, i-1, j)
else
OutputLCS (backtrack, v, i-1, j-1)
output vi

Computing Scores of ALL Predecessors
4
0 4
4
?
1
6 1 1 2
1
2
6
2
sa = maxALL predecessors b of node a{sb+ weight of edge from b to a}
4
0 4
4
?
1
6 1 1 2
1
? 2
6
2

4
0 4
4
?
1
6 1 1 2
1
? ? 2
6
2

4
0 4
4
? ?
1
6 1 1 2
1
? ? 2
6
2

A Vicious Cycle
4
0 4
4
? ?
1
6 1 1 2
1
? ? 2
6
2

In What Order Should We Explore Nodes of the Graph?
sa = maxALL predecessors b of node a{sb+ weight of edge from b to a}
• By the time a node is analyzed, the scores of all its

predecessors should already be computed.
• If the graph has a directed cycle, this condition is
impossible to satisfy.
• Directed Acyclic Graph (DAG): a graph without directed
cycles.
Topological Ordering
• Topological Ordering: Ordering of nodes of a DAG on a line
such that all edges go from left to right.
• Theorem: Every DAG has a topological ordering.

LongestPath
LongestPath(Graph, source, sink)
for each node a in Graph
sa ← -infinity
ssource ← 0
topologically order Graph
for each node a (from source to sink in topological order)
sa ← maxall predecessors b of node a{sb+ weight of edge from b to a}
return ssink


Current (Primitive) Scoring
#matches

Mismatches and Indel Penalties
#matches − μ · #mismatches − σ ·
#indels A T - G T T A T A
A T C G T - C – C
+1+1-2+1+1-2-3-2-3=-7
A C G T − A C G T −
A +1 −μ −μ −μ -σ A +1 −3 −5 −1 -3
C −μ +1 −μ −μ -σ C −4 +1 −3 −2 -3
G −μ −μ +1 −μ -σ G −9 −7 +1 −1 -3
T −μ −μ –μ +1 -σ T −3 −5 –8 +1 -4
− -σ -σ -σ -σ − -4 -2 -2 -1
Scoring matrix Even
Bioinformatics Algorithms: An Active more
Learning general scoring matrix
Approach.
Scoring Matrices for Amino Acid Sequences
Y (Tyr) often mutates into F (score +7)

but rarely mutates into P (score -5)
-5 7

Alignment Graph


Alignment Graph
si-1, j - σ
si, j-1 - σ
si, j= max { si-1, j-1 + 1, if vi=wj
si-1, j-1 - μ, if vi≠wj

Alignment Graph
si-1, j + score(vi,-)
si, j= max { si, j-1 + score(-,wj)
si-1, j-1+ score(vi,wj)

Global Alignment
Global Alignment Problem: Find the highest-scoring

alignment between two strings by using a scoring matrix.
• Input: Strings v and w as well as a matrix score.
• Output: An alignment of v and w whose alignment

score (as defined by the scoring matrix score) is
maximal among all possible alignments of v and w.

Homeobox Genes
• Two genes in different species
may be similar over short
conserved regions and dissimilar
over remaining regions.
• Homeobox genes have a short

region called the homeodomain
that is highly conserved among
species.
• A global alignment may not find the
homeodomain because it would try to
align the entire sequence.

Which Alignment is Better?
• Alignment 1: score = 22 (matches) - 20 (indels)=2.
GCC-C-AGT--TATGT-CAGGGGGCACG--A-GCATGCAGA-
GCCGCC-GTCGT-T-TTCAG----CA-GTTATG--T-CAGAT
• Alignment 2: score = 17 (matches) - 30 (indels)=-13.

---G----C-----C--CAGTTATGTCAGGGGGCACGAGCATGCAGA
GCCGCCGTCGTTTTCAGCAGTTATGTCAG-----A------T-----

Which Alignment is Better?
• Alignment 1: score = 22 (matches) - 20 (indels)=2.
GCC-C-AGT--TATGT-CAGGGGGCACG--A-GCATGCAGA-
GCCGCC-GTCGT-T-TTCAG----CA-GTTATG--T-CAGAT
• Alignment 2: score = 17 (matches) - 30 (indels)=-13.

---G----C-----C--CAGTTATGTCAGGGGGCACGAGCATGCAGA
GCCGCCGTCGTTTTCAGCAGTTATGTCAG-----A------T-----
local alignment

G C C G C C G T C G T T T T C A G C A G T T A T
G T C A G A T
G
C
C
C
A −−−G−−−−C−−−−−C−− CAGTTATGTCAGGGGGCACGAGCATGCAGA
G GCCGCCGTCGTTTTCAG CAGTTATGTCAG−−−−−A−−−−−−T −−−−
T
Local alignment
T
A
T
G
T
C
A
G
G
G
G
G
C
A
C
G
A
G
C GCC−C−AGT−TATGT-CAGGGGGCACG−−A−GCATGCAGA
A
-
T
G
GCCGCC−GTCGT-T-TTCAG----CA−GTTATG−T−CAGA
C T
A Global alignment
C
A

G T C A G A T
G
C
C
C
A
G
T
T
A
T
G
T
C
A
G
G
G
G
G
C
A
C
G
A
G
C
A
T
G
C
A
C
A

Local Alignment
Global alignment

Local Alignment= Global Alignment in a Subrectangle
Compute a Global
Alignment within
Global alignment
each rectangle to
get a Local
Alignment

Local Alignment Problem
Local Alignment Problem: Find the highest-scoring local

alignment between two strings.
• Input: Strings v and w as well as a matrix score.
• Output: Substrings of v and w whose global alignment

(as defined by the matrix score), is maximal among all
global alignments of all substrings of v and w.

Free Taxi Rides!
G T C A G A T
G
C
C
C
A
G
T
T
A
T
G
T
C
A
G
G
G
G
G
C
A
C
G
A
G
C
A
T
G
C
A
C
A
GCC−C−AGT−TATGT-CAGGGGGCACG−−A−GCATGCACA −−−G−−−−C−−−−−C−− CAGTTATGTCAGGGGGCACGAGCATGCACA

- GCCGCCGTCGTTTTCAG
Bioinformatics Algorithms: An Active CAGTTATGTCAG−−−−−A−−−−−−T −−−−
Learning Approach.
GCCGCC−GTCGT-T-TTCAG----CA−GTTATG−T−CAGA Local alignment
T
What Do Free Taxi Rides Mean in the Terms of the Alignment Graph?

Building Manhattan for the Local Alignment Problem
How many edges have we added?

Dynamic Programming for the Local Alignment
weight of edge from (0,0) to (i,j)

Dynamic Programming for the Local Alignment
0


Scoring Gaps
• We previously assigned a fixed penalty σ to
each indel.
• However, this fixed penalty may be too severe
for a series of 100 consecutive indels.
• A series of k indels often represents a single
evolutionary event (gap) rather than k events:
two gaps GATCCAG GATCCAG a single gap

(lower score) GA-C-AG GA--CAG (higher score)

More Adequate Gap Penalties
Affine gap penalty for a gap of length k: σ+ε·(k-1)
σ - the gap opening penalty

ε - the gap extension penalty
σ > ε, since starting a gap should be penalized
more than extending it.

Modelling Affine Gap Penalties by Long Edges

Building Manhattan with Affine Gap Penalties
σ+ε∙2
σ+ε σ+ε
We have just added O(n3) edges to the

graph… Bioinformatics Algorithms: An Active Learning Approach.
Building Manhattan on 3 levels
upper level
bottom level (deletions)
(insertions)
middle level
(matches/mismatches)

How can we emulate
this path in the 3-level
Manhattan?
loweri-1,j - ε σ
loweri,j = max {
middlei-1,j - σ 0
σ upperi,j-1 - ε
upperi,j = max {
middlei,j-1 - σ
loweri,j
ε middlei,j = max { middlei-1,j-1 + score(vi,wj)
upperi,j

• From Manhattan to an Arbitrary Directed Acyclic Graph

Middle Column of the Alignment
A C G G A A
middle column
(middle=#columns/2)
Middle Node of the Alignment
A C G G A A
middle node
(a node where an optimal alignment path crosses the middle column)
Divide and Conquer Approach to Sequence
Alignment
A C G G A A
AlignmentPath(source, sink)
find MiddleNode A

Alignment
A C G G A A
find MiddleNode A
AlignmentPath(source, MiddleNode)
T

Alignment
A C G G A A
find MiddleNode A
AlignmentPath(source, MiddleNode)
AlignmentPath(MiddleNode, sink) T
The only problem left is how

Bioinformatics to find
Algorithms: An this middle
Active Learning node in linear space!
Approach.
Computing Alignment Score in Linear Space
Finding the longest path in the alignment graph

requires storing all backtracking pointers – O(nm)
memory.
Finding the length of the longest path in the

alignment graph does not require storing any
backtracking pointers – O(n) memory.

Recycling the Columns in the Alignment Graph
A C G G A A
0 0 0 0 0 0 0
A
0 1 1 1 1 1 1
T
0 1 1 1 1 1 1
T
0 1 1 1 1 1 1
C
0 1 2 2 2 2 2
A
0 1 2 2 2 3 3
A
0 1 2 2 2 3 4
Can We Find the Middle Node without
Constructing the Longest Path?
A C G G A A
C 4-path that visits the node

(4,middle)
In the middle column
A
i-path – a longest path among paths that visit the i-th node in the middle column
Can We Find The Lengths of All i-paths?
A C G G A A
2
A
T
length(i):
T length of an i-path:
C
4
length(0)=2
A length(4)=4

Can We Find The Lengths of All i-paths?
A C G G A A
2
A
3
T
3
T
3
C
4
A
3
A
1

Can We Find The Lengths of i-paths?
A C G G A A
2
A
3
T
3
T
3
C length(i):
4 length of an i-path
A
3
A
1
length(i)=fromSource(i)+toSink(i)
Computing FromSource and toSink
A C G G A A A C G G A A
0 0 0 0 2 2 1 0
A A
0 1 1 1 2 2 1 0
T T
0 1 1 1 2 2 1 0
T T
0 1 1 1 2 2 1 0
C C
0 1 2 2 2 2 1 0
A A
0 1 2 2 1 1 1 0
A A
0 1 2 2 0 0 0 0
fromSource(i) Bioinformatics Algorithms: An Active Learning Approach.

toSink(i)
How Much Time Did It Take to Find the Middle Node ?
A C G G A A A C G G A A
0 0 0 0 2 2 1 0
A A
0 1 1 1 2 2 1 0
T T
0 1 1 1 2 2 1 0
T T
0 area/2
1 1 1 area/2+area/2=area 2 area/2
2 1 0
C C
0 1 2 2 2 2 1 0
A A
0 1 2 2 1 1 1 0
A A
0 1 2 2 0 0 0 0
fromSource(i) Bioinformatics Algorithms: An Active Learning Approach.

toSink(i)
Laughable Progress: O(nm) Time to Find ONE Node!
G A G C A A T T
C
Each subproblem
T can be conquered
in time
T proportional to its
area:
A
area/4+area/4=
A area/2

How much time would it take to conquer 2 subproblems?
Laughable Progress: O(nm+nm/2) Time to Find THREE Nodes!
G A G C A A T T
C
Each subproblem
T can be conquered
in time
T proportional to its
area:
A
area/8+area/8+
A area/8+area/8=
area/4
T

How much time would it take to conquer 4 subproblems?
O(nm+nm/2+nm/4) Time to Find NEARLY ALL Nodes!
G A G C A A T T
C
area+
area/2
T +area/4
T +area/8
+area/16
A
+….+
A <
2·area
T

How much time would it take to conquer ALL subproblems?
Total Time:
area+area/2+area/4+area/8+area/16+…
3rd pass: 1/4

1st pass: 1 area
4th pass: 1/8
2nd pass: 1/2
1 + ½ + ¼ +... < 2
The Middle Edge
G A G C A A T
T
A
C
Middle Edge:
T
an edge in an
T optimal
alignment path
A starting at the
middle node
A

The Middle Edge Problem
Middle Edge in Linear Space Problem. Find a middle edge
in the alignment graph in linear space.
• Input: Two strings and matrix score.
• Output: A middle edge in the alignment graph of

these strings (as defined by the matrix score).

G A G C A A T
T
A

G A G C A A T
T
A

Recursive LinearSpaceAlignment
LinearSpaceAlignment(top,bottom,left,right)
if left = right
return alignment formed by bottom-top edges “↓”
middle ← ⌊(left+right)/2⌋
midNode ← MiddleNode(top,bottom,left,right)
midEdge ← MiddleEdge(top,bottom,left,right)
LinearSpaceAlignment(top,midNode,left,middle)
output midEdge
if midEdge = “→“ or midEdge = “↘”
middle ← middle+1
if midEdge = “↓“ or midEdge = “↘”
midNode ← midNode+1
LinearSpaceAlignment(midNode,bottom,middle,right)
• From Manhattan to an Arbitrary Directed Acyclic Graph

From Pairwise to Multiple Alignment
• Up until now we have only

tried to align two sequences.
• A faint (and statistically
insignificant) similarity
between two sequences
becomes significant if it is
present in many other
sequences.
• Multiple alignments can
reveal subtle similarities that
pairwise alignments do not
reveal.

Alignment of Three A-domains



Generalizing Pairwise to Multiple Alignment
• Alignment of 2 sequences is a 2-row matrix.

• Alignment of 3 sequences is a 3-row matrix
A T - G C G -
A - C G T - A
A T C A C - A
• Our scoring function should score alignments with

conserved columns higher.

Alignments = Paths in 3-D
• Alignment of ATGC, AATC, and ATGC
A -- T G C
A A T -- C
-- A T G C

0 1 1 2 3 4 #symbols up to a given position

A -- T G C
A A T -- C
-- A T G C

0 1 1 2 3 4 #symbols up to a given position

A -- T G C
0 1 2 3 3 4
A A T -- C
-- A T G C

(0,0,0)→(1,1,0)→(1,2,1) →(2,3,2) →(3,3,3) →(4,4,4)
0 1 1 2 3 4
A -- T G C
0 1 2 3 3 4
A A T -- C
0 0 1 2 3 4
-- A T G C

2-D Alignment Cell versus 3-D Alignment Cell
(i-1,j-1,k-1) (i-1,j,k-1)
(i-1,j-1,k) (i-1,j,k)
2-D (i,j,k-1)
(i,j-1,k-1)
(i,j-1,k) (i,j,k)

Multiple Alignment: Dynamic Programming
• δ(x, y, z) is an entry in the 3-D scoring matrix.

Multiple Alignment: Running Time
• For 3 sequences of length n, the run time is

proportional to 7n3
• For a k-way alignment, build a k-dimensional

Manhattan graph with
– nk nodes
– most nodes have 2k – 1 incoming edges.
– Runtime: O(2knk)

Multiple Alignment Induces Pairwise
Alignments
Every multiple alignment induces pairwise alignments:

AC-GCGG-C
AC-GC-GAG
GCCGC-GAG
ACGCGG-C AC-GCGG-C AC-GCGAG

ACGC-GAC GCCGC-GAG GCCGCGAG

Idea: Construct Multiple from Pairwise Alignments
Given a set of arbitrary pairwise alignments, can

we construct a multiple alignment that induces
them?
AAAATTTT---- ----AAAATTTT TTTTGGGG----
----TTTTGGGG GGGGAAAA---- ----GGGGAAAA

Profile Representation of Multiple Alignment
- A G G C T A T C A C C T G
T A G – C T A C C A - - - G
C A G – C T A C C A - - - G
C A G – C T A T C A C – G G
C A G – C T A T C G C – G G
A 0 1 0 0 0 0 1 0 0 .8 0 0 0 0
C .6 0 0 0 1 0 0 .4 1 0 .6 .2 0 0
G 0 0 1 .2 0 0 0 0 0 .2 0 0 .4 1
T .2 0 0 0 0 1 0 .6 0 0 0 0 .2 0
- .2 0 0 .8 0 0 0 0 0 0 .4 .8 .4 0

Aligning Sequence Against Sequence
• In the past we were aligning a sequence

against a sequence.
T A G – C T A C C A - - - G
C A G – C T A C C A - - - G
A 0 1 0 0 0 0 1 0 0 .8 0 0 0 0
C .6 0 0 0 1 0 0 .4 1 0 .6 .2 0 0
G 0 0 1 .2 0 0 0 0 0 .2 0 0 .4 1
T .2 0 0 0 0 1 0 .6 0 0 0 0 .2 0
- .2 0 0 .8 0 0 0 0 0 0 .4 .8 .4 0

Aligning Sequence Against Profile

against a sequence.
– Can we align a sequence against a profile?
T A G – C T A C C A - - - G
C A G – C T A C C A - - - G
A 0 1 0 0 0 0 1 0 0 .8 0 0 0 0
C .6 0 0 0 1 0 0 .4 1 0 .6 .2 0 0
G 0 0 1 .2 0 0 0 0 0 .2 0 0 .4 1
T .2 0 0 0 0 1 0 .6 0 0 0 0 .2 0
- .2 0 0 .8 0 0 0 0 0 0 .4 .8 .4 0

Aligning Profile Against Profile

against a sequence.
– Can we align a sequence against a profile?
– Can we align a profile against a profile?
T A G – C T A C C A - - - G
C A G – C T A C C A - - - G
A 0 1 0 0 0 0 1 0 0 .8 0 0 0 0
C .6 0 0 0 1 0 0 .4 1 0 .6 .2 0 0
G 0 0 1 .2 0 0 0 0 0 .2 0 0 .4 1
T .2 0 0 0 0 1 0 .6 0 0 0 0 .2 0
- .2 0 0 .8 0 0 0 0 0 0 .4 .8 .4 0

Multiple Alignment: Greedy Approach
• Choose the most similar sequences and
combine them into a profile, thereby reducing
alignment of k sequences to an alignment of
of k – 2 sequences and 1 profile.
• Iterate

Greedy Approach: Example
• Sequences: GATTCA, GTCTGA, GATATT, GTCAGC.
• 6 pairwise alignments (premium for match +1,

penalties for indels and mismatches -1)
s2 GTCTGA s1 GATTCA--
s4 GTCAGC (score = 2) s4 G—T-CAGC (score = 0)
s1 GAT-TCA s2 G-TCTGA
s2 G-TCTGA (score = 1) s3 GATAT-T (score = -1)
s1 GAT-TCA s3 GAT-ATT
s3 GATAT-T (score = 1) s4 G-TCAGC (score = -1)
Greedy Approach: Example
• Since s2 and s4 are closest, we consolidate them
into a profile:
s2 GTCTGA
s2,4 = GTCt/aGa/cA
s4 GTCAGC
• New set of 3 sequences to align:
s1 GATTCA
s3 GATATT
s2,4 GTCt/aGa/c


Chapter 5

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 5

Uploaded by

Copyright:

Available Formats

How Do We Compare Biological Sequences?

Dynamic Programming and

Phillip Compeau and Pavel Pevzner.

Bioinformatics Algorithms: An Active Learning Approach.

DNA → RNA → Protein

RNA tie club

DNA → RNA → Protein → Peptide

RNA tie club Marahiel

Bioinformatics Algorithms: An Active Learning Approach.

just 3 conservative columns

Bioinformatics Algorithms: An Active Learning Approach.

Bioinformatics Algorithms: An Active Learning Approach.

Bioinformatics Algorithms: An Active Learning Approach.

Bioinformatics Algorithms: An Active Learning Approach.

Bioinformatics Algorithms: An Active Learning Approach.

Bioinformatics Algorithms: An Active Learning Approach.

• In the early 1980s biologists

• When BOTH parent carry a

Adenosine Triphosphate (ATP)Algorithms:

Adenosine Triphosphate (ATP)Algorithms:

• The CFTR protein controls the

Bioinformatics Algorithms: An Active Learning Approach.

Bioinformatics Algorithms: An Active Learning Approach.

Alignment Game (maximizing the number of points):

• Remove the 1st symbol from each sequence

Bioinformatics Algorithms: An Active Learning Approach.

Bioinformatics Algorithms: An Active Learning Approach.

Bioinformatics Algorithms: An Active Learning Approach.

Bioinformatics Algorithms: An Active Learning Approach.

Bioinformatics Algorithms: An Active Learning Approach.

Bioinformatics Algorithms: An Active Learning Approach.

Bioinformatics Algorithms: An Active Learning Approach.

Bioinformatics Algorithms: An Active Learning Approach.

Bioinformatics Algorithms: An Active Learning Approach.

Bioinformatics Algorithms: An Active Learning Approach.

Alignment of two sequences is a two-row matrix:

Bioinformatics Algorithms: An Active Learning Approach.

Bioinformatics Algorithms: An Active Learning Approach.

Walk from the

Bioinformatics Algorithms: An Active Learning Approach.

Manhattan Tourist Problem: Find a longest path in a

Bioinformatics Algorithms: An Active Learning Approach.

Longest Path in a Directed Graph Problem: Find a

Bioinformatics Algorithms: An Active Learning Approach.

Bioinformatics Algorithms: An Active Learning Approach.

Bioinformatics Algorithms: An Active Learning Approach.

Bioinformatics Algorithms: An Active Learning Approach.

Bioinformatics Algorithms: An Active Learning Approach.

Bioinformatics Algorithms: An Active Learning Approach.

40 cents = 25+10+5 = 20+20

Bioinformatics Algorithms: An Active Learning Approach.

Bioinformatics Algorithms: An Active Learning Approach.

Bioinformatics Algorithms: An Active Learning Approach.

Bioinformatics Algorithms: An Active Learning Approach.

Bioinformatics Algorithms: An Active Learning Approach.

Bioinformatics Algorithms: An Active Learning Approach.

Bioinformatics Algorithms: An Active Learning Approach.

Bioinformatics Algorithms: An Active Learning Approach.

Bioinformatics Algorithms: An Active Learning Approach.

Bioinformatics Algorithms: An Active Learning Approach.

Bioinformatics Algorithms: An Active Learning Approach.

Bioinformatics Algorithms: An Active Learning Approach.

the optimal coin combination for 69 cents is computed 6 times!