You are on page 1of 171

How Do We Compare Biological Sequences?

Dynamic Programming and


Divide-and Conquer Algorithms

Phillip Compeau and Pavel Pevzner.


Bioinformatics Algorithms: An Active Learning Approach
©2018 by Compeau and Pevzner. All rights reserved.
How Do We Compare Biological Sequences
• From Sequence Comparison to Biological Insights
• The Alignment Game and the Longest Common Subsequence
• The Manhattan Tourist Problem
• The Change Problem
• Dynamic Programming and Backtracking Pointers
• From Manhattan to the Alignment Graph
• From Global to Local Alignment
• Penalizing Insertions and Deletions in Sequence Alignment
• Space-Efficient Sequence Alignment
• Multiple Sequence Alignment

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
RNA Tie Club

ribosome

DNA → RNA → Protein


genetic code

RNA tie club


Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
From Genetic Code to Non-Ribosomal Code

non-ribosomal
ribosome peptide synthetase

DNA → RNA → Protein → Peptide


genetic code non-ribosomal code

RNA tie club Marahiel


Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
NRP Synthetase: A Giant Molecular Assembly Line

Adenylation domains

A1 A2 A3 A4 A5 A6 A7 A8 A9 A10

NRPBioinformatics
synthetase Algorithms: An Active Learning Approach.
adds one amino acid at a time
Copyright 2018 Compeau and Pevzner.
These Three A-domains Do Not Look Similar

YAFDLGYTCMFPVLLGGGELHIVQKETYTAPDEIAHYIKEHGITYIKLTPSLFHTIVNTASFAFDANFESLRLIVLGGEKIIPIDVIAFRKMYGHTEFINHYGPTEATIGA

AFDVSAGDFARALLTGGQLIVCPNEVKMDPASLYAIIKKYDITIFEATPALVIPLMEYIYEQKLDISQLQILIVGSDSCSMEDFKTLVSRFGSTIRIVNSYGVTEACIDS

IAFDASSWEIYAPLLNGGTVVCIDYYTTIDIKALEAVFKQHHIRGAMLPPALLKQCLVSAPTMISSLEILFAAGDRLSSQDAILARRAVGSGVYNAYGPTENTVLS

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
These Three A-domains Do Not Look Similar

YAFDLGYTCMFPVLLGGGELHIVQKETYTAPDEIAHYIKEHGITYIKLTPSLFHTIVNTASFAFDANFESLRLIVLGGEKIIPIDVIAFRKMYGHTEFINHYGPTEATIGA

AFDVSAGDFARALLTGGQLIVCPNEVKMDPASLYAIIKKYDITIFEATPALVIPLMEYIYEQKLDISQLQILIVGSDSCSMEDFKTLVSRFGSTIRIVNSYGVTEACIDS

IAFDASSWEIYAPLLNGGTVVCIDYYTTIDIKALEAVFKQHHIRGAMLPPALLKQCLVSAPTMISSLEILFAAGDRLSSQDAILARRAVGSGVYNAYGPTENTVLS

just 3 conservative columns

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Do they look similar now?

YAFDLGYTCMFPVLLGGGELHIVQKETYTAPDEIAHYIKEHGITYIKLTPSLFHTIVNTASFAFDANFESLRLIVLGGEKIIPIDVIAFRKMYGHTEFINHYGPTEATIGA

-AFDVSAGDFARALLTGGQLIVCPNEVKMDPASLYAIIKKYDITIFEATPALVIPLMEYIYEQKLDISQLQILIVGSDSCSMEDFKTLVSRFGSTIRIVNSYGVTEACIDS

IAFDASSWEIYAPLLNGGTVVCIDYYTTIDIKALEAVFKQHHIRGAMLPPALLKQCLVSAPTMISSLEILFAAGDRLSSQDAILARRAVGSGVYNAYGPTENTVLS

11 conservative columns

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
And now?

YAFDLGYTCMFPVLLGGGELHIVQKETYTAPDEIAHYIKEHGITYIKLTPSLFHTIVNTASFAFDANFESLRLIVLGGEKIIPIDVIAFRKMYGHTE-FINHYGPTEATIGA

-AFDVSAGDFARALLTGGQLIVCPNEVKMDPASLYAIIKKYDITIFEATPALVIPLMEYI-YEQKLDISQLQILIVGSDSCSMEDFKTLVSRFGSTIRIVNSYGVTEACIDS

IAFDASSWEIYAPLLNGGTVVCIDYYTTIDIKALEAVFKQHHIRGAMLPPALLKQCLVSA----PTMISSLEILFAAGDRLSSQDAILARRAVGSGV-Y-NAYGPTENTVLS

19 conservative columns!

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Red Positions Encode Conservative Core of A-domains

YAFDLGYTCMFPVLLGGGELHIVQKETYTAPDEIAHYIKEHGITYIKLTPSLFHTIVNTASFAFDANFESLRLIVLGGEKIIPIDVIAFRKMYGHTE-FINHYGPTEATIGA

-AFDVSAGDFARALLTGGQLIVCPNEVKMDPASLYAIIKKYDITIFEATPALVIPLMEYI-YEQKLDISQLQILIVGSDSCSMEDFKTLVSRFGSTIRIVNSYGVTEACIDS

IAFDASSWEIYAPLLNGGTVVCIDYYTTIDIKALEAVFKQHHIRGAMLPPALLKQCLVSA----PTMISSLEILFAAGDRLSSQDAILARRAVGSGV-Y-NAYGPTENTVLS

Which positions are responsible for encoding different amino acids Asp, Orn, Val?

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Blue Positions in A-domains Define Non-Ribosomal Code

YAFDLGYTCMFPVLLGGGELHIVQKETYTAPDEIAHYIKEHGITYIKLTPSLFHTIVNTASFAFDANFESLRLIVLGGEKIIPIDVIAFRKMYGHTE-FINHYGPTEATIGA

-AFDVSAGDFARALLTGGQLIVCPNEVKMDPASLYAIIKKYDITIFEATPALVIPLMEYI-YEQKLDISQLQILIVGSDSCSMEDFKTLVSRFGSTIRIVNSYGVTEACIDS

IAFDASSWEIYAPLLNGGTVVCIDYYTTIDIKALEAVFKQHHIRGAMLPPALLKQCLVSA----PTMISSLEILFAAGDRLSSQDAILARRAVGSGV-Y-NAYGPTENTVLS

LTKVGHIG Asp
VGEIGSID Orn
AWMFAAVL Val

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Cystic Fibrosis
• Cystic fibrosis (CF): An often
fatal disease which affects the
respiratory system and
produces an abnormally large
amount of mucus.
– Mucus is a slimy material
that coats epithelial
surfaces and is secreted
into fluids such as saliva.

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Approximately 1 in 25 Humans Carry a Faulty CF Gene

• In the early 1980s biologists


hypothesized that CF is
caused by mutations in an
unidentified gene.

• When BOTH parent carry a


faulty gene, there is a 25%
chance that their child will
have cystic fibrosis.
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
Where Is the Cystic Fibrosis Gene?
• In the late 1980s, biologists
narrowed the search for the CF
gene to a small region on
chromosome 7.
• One of these genes was similar
to ATP binding proteins that act
as transport channels Chromosome 7
responsible for secretion.
• Hint: cystic fibrosis involves
sweet secretion with abnormally
high sodium levels.

Adenosine Triphosphate (ATP)Algorithms:


Bioinformatics transports chemical
An Active energy
Learning within cells.
Approach.
Copyright 2018 Compeau and Pevzner.
Where Is the Cystic Fibrosis Gene?
• In the late 1980s, biologists
narrowed the search for the CF
gene to a small region on
chromosome 7.
• One of these genes was similar
to ATP binding proteins that act
as transport channels
responsible for secretion.
• Hint: cystic fibrosis involves
sweet secretion with abnormally
high sodium levels.

Adenosine Triphosphate (ATP)Algorithms:


Bioinformatics transports chemical
An Active energy
Learning within cells.
Approach.
Copyright 2018 Compeau and Pevzner.
CFTR: Cystic Fibrosis Transmembrane Conductance Regulator

• The CFTR protein controls the


flow of ions in and out of cells
inside the lungs.

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
How Do We Compare Biological Sequences
• From Sequence Comparison to Biological Insights
• The Alignment Game and the Longest Common Subsequence
• The Manhattan Tourist Problem
• The Change Problem
• Dynamic Programming and Backtracking Pointers
• From Manhattan to the Alignment Graph
• From Global to Local Alignment
• Penalizing Insertions and Deletions in Sequence Alignment
• Space-Efficient Sequence Alignment
• Multiple Sequence Alignment

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
The Alignment Game

A T G T T A T A
A T C G T C C

Alignment Game (maximizing the number of points):

• Remove the 1st symbol from each sequence


• 1 point if the symbols match, 0 points if they don’t match
• Remove the 1st symbol from one of the sequences
• 0 points

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
The Alignment Game

A T G T T A T A
A T C G T C C
+1

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
The Alignment Game

A T G T T A T A
A T C G T C C
+1+1

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
The Alignment Game

A T - G T T A T A
A T C G T C C
+1+1

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
The Alignment Game

A T - G T T A T A
A T C G T C C
+1+1 +1

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
The Alignment Game

A T - G T T A T A
A T C G T C C
+1+1 +1+1

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
The Alignment Game

A T - G T T A T A
A T C G T - C C
+1+1 +1+1

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
The Alignment Game

A T - G T T A T A
A T C G T - C C
+1+1 +1+1

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
The Alignment Game

A T - G T T A T A
A T C G T - C - C
+1+1 +1+1

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
The Alignment Game

A T - G T T A T A
A T C G T - C - C
+1+1 +1+1 =4

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
What Is the Sequence Alignment?
matches insertions deletions mismatches

A T - G T T A T A
A T C G T - C - C
+1+1 +1+1 =4

Alignment of two sequences is a two-row matrix:

1st row: symbols of the 1st sequence (in order) interspersed by “-”
2nd row: symbols of the 2nd sequence (in order) interspersed by “-”

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Longest Common Subsequence

A T - G T T A T A
A T C G T - C - C
Matches in alignment of two sequences (ATGT) form their
Common Subsequence
Longest Common Subsequence Problem: Find a longest
common subsequence of two strings.
• Input: Two strings.
• Output: A longest common subsequence of these
strings.
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
How Do We Compare Biological Sequences
• From Sequence Comparison to Biological Insights
• The Alignment Game and the Longest Common Subsequence
• The Manhattan Tourist Problem
• The Change Problem
• Dynamic Programming and Backtracking Pointers
• From Manhattan to the Alignment Graph
• From Global to Local Alignment
• Penalizing Insertions and Deletions in Sequence Alignment
• Space-Efficient Sequence Alignment
• Multiple Sequence Alignment

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
From Manhattan to a Grid Graph

Walk from the


source to the
sink (only in the
South ↓ and East
→ directions)
and visit the
maximum
number of
attractions

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Manhattan Tourist Problem

Manhattan Tourist Problem: Find a longest path in a


rectangular city grid.
•Input: A weighted rectangular grid.
•Output: A longest path from the source to the sink in
the grid.

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
3 2 4 0
0 3
1 0 2 4 3
3 2 4 2

3 6 5 2 1
Greedy 0 7 3 3
algorithm?
4 4 5 2 1
3 3 0 2

5 6 8 5 3
1 3 2 2
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
3 2 4 0
0 3 5 9
1 0 2 4 3
3 2 4 2
13

3 6 5 2 1
Greedy 0 7 3 3
algorithm? 15 19

4 4 5 2 1
3 3 0 2
20

5 6 8 5 3
1 3 2 2
23
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
3 2 4 0
5
1 4 2 3
From a
3 2 4 4
regular to an
irregular grid 2
3 6 5 1
0 7 2 3

4 4
4 4 6 2 1
3 3 0 2

5 6 8 5 3
1 3 2 2
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
Search for Longest Paths in a Directed Graph

Longest Path in a Directed Graph Problem: Find a


longest path between two nodes in an edge-weighted
directed graph.
• Input: An edge-weighted directed graph with
source and sink nodes.
• Output: A longest path from source to sink in
the directed graph.

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Do You See a Connection between
the Manhattan Tourist and the Alignment Game?

A T - G T T A T A
A T C G T - C - C
↘ ↘ → ↘ ↘ ↓ ↘ ↓ ↘

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
A T C G T C C
A
?
alignment → path T
A T - G T T A T A
A T C G T - C - C
G
↘ ↘ → ↘ ↘ ↓ ↘ ↓ ↘
T
T
A
T
A
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
A T C G T C C
A
?
alignment → path T
A T - G T T A T A
A T C G T - C - C
G
↘ ↘ → ↘ ↘ ↓ ↘ ↓ ↘
T
T
A
T
A
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
A T C G T C C
A
?
path → alignment T
A T G T T - A T A
- - A T C G T C C
G
↓ ↓ ↘ ↘ ↘ → ↘ ↘ ↘
T
highest-scoring
alignment T
=
longest path in a A
properly built
T
Manhattan
A
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
A T C G T C C
How to built a A
Manhattan for the
Alignment Game T
and the
Longest Common G
Subsequence
Problem? T
T
Diagonal red edges
correspond to A
matching symbols
and have scores 1 T
A
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
How Do We Compare Biological Sequences
• From Sequence Comparison to Biological Insights
• The Alignment Game and the Longest Common Subsequence
• The Manhattan Tourist Problem
• The Change Problem
• Dynamic Programming and Backtracking Pointers
• From Manhattan to the Alignment Graph
• From Global to Local Alignment
• Penalizing Insertions and Deletions in Sequence Alignment
• Space-Efficient Sequence Alignment
• Multiple Sequence Alignment

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
The Change Problem
Change Problem: Find the minimum number of coins
needed to make change.
•Input: An integer money and an array of positive
integers (coin1, . . . , coind).
•Output: The minimum number of coins with
denominations (coin1, . . . , coind) that changes money.

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Changing Money in a Greedy Way

GreedyChange(money)
change ← empty collection of coins
while money > 0
coin ← largest denomination that does not exceed money
add coin to change
money ← money – coin
return change

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Changing Money in Tanzania

40 cents = 25+10+5
Greedy

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Changing Money in Tanzania: GreedyChange Fails

40 cents = 25+10+5 = 20+20


Greedy is not Optimal

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Recursive Change
Given the denominations 6, 5, and 1, what is the minimum
number of coins needed to change 9 cents?
money 1 2 3 4 5 6 7 8 9 10 11 12
MinNumCoins ?

MinNumCoins(9)= ?

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Recursive Change
Given the denominations 6, 5, and 1, what is the minimum
number of coins needed to change 9 cents?
money 1 2 3 4 5 6 7 8 9 10 11 12
MinNumCoins ? ? ? ?

MinNumCoins(9)= ?

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Recursive Change
Given the denominations 6, 5, and 1, what is the minimum
number of coins needed to change 9 cents?
money 1 2 3 4 5 6 7 8 9 10 11 12
MinNumCoins ? ? ? ?

MinNumCoins(9-6)+1 = MinNumCoins(3)+1
MinNumCoins(9)=min{ MinNumCoins(9-5)+1 =
MinNumCoins(4)+1
MinNumCoins(9-1)+1 = MinNumCoins(8)+1

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Recursive Change
Given the denominations 6, 5, and 1, what is the minimum
number of coins needed to change 9 cents?
money 1 2 3 4 5 6 7 8 9 10 11 12
MinNumCoins ? ? ? ?

MinNumCoins(3)=
MinNumCoins(4)=
MinNumCoins(8)=
?

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Recursive Change
Given the denominations 6, 5, and 1, what is the minimum
number of coins needed to change 9 cents?
money 1 2 3 4 5 6 7 8 9 10 11 12
MinNumCoins ? ? ? ? ? ?

MinNumCoins(3)=
MinNumCoins(4)=
MinNumCoins(8)=
?

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Recursive Change
Given the denominations 6, 5, and 1, what is the minimum
number of coins needed to change 9 cents?
money 1 2 3 4 5 6 7 8 9 10 11 12
MinNumCoins ? ? ? ? ? ?

MinNumCoins(money-6) + 1min{
MinNumCoins(money)= MinNumCoins(money-5) + 1
MinNumCoins(money-1) + 1

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Recursive Change
Given the denominations 6, 5, and 1, what is the minimum
number of coins needed to change 9 cents?
money 1 2 3 4 5 6 7 8 9 10 11 12
MinNumCoins ? ? ? ? ? ?

min{
MinNumCoins(money-coin1) + 1
MinNumCoins(money)= . ...........................
MinNumCoins(money-coind) + 1

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
RecursiveChange
RecursiveChange(money, coins)
if money = 0
return 0
MinNumCoins 🡨 infinity
for i 🡨 1 to |coins|
if money ≥ coini
NumCoins 🡨
RecursiveChange(money-coini, coins)
if numCoins + 1 < MinNumCoins
MinNumCoins 🡨 numCoins + 1
return MinNumCoins

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
How Fast is the RecursiveChange?
7
6

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
The Recursive Tree
7
6
7 7
7
1 5
0

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
The Recursive Tree
7
6
7 7
7
1 5
0

6 6 6 6 6 7 6 7 7
4 5 9 5 6 0 9 0 4

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
The Recursive Tree
7
6
7 7
7
1 5
0

6 6 6 6 6 7 6 7 7
4 5 9 5 6 0 9 0 4

5 5 6 6 6 6 6 6 6 6 6 6 6 6 7
8 9 3 3 4 8 0 1 5 3 4 8 8 9 3
5 6 6 5 6 6 6 6 6 6 6 6
9 0 4 9 0 4 4 5 9 4 5 9

the optimal coin combination for 69 cents is computed 6 times!


the optimal coin combination for 30 cents is computed trillions of times!
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
Changing Money by Dynamic Programming
Hint. Wouldn’t it be nice to know all the values of
MinNumCoins(money – coini)
by the time we need to compute
MinNumCoins(money)?
Richard
Instead of the time-consuming calls: Bellman

RecursiveChange(money-coini, Coins, d)
we would simply look up the values of
MinNumCoins(money - coini)

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Changing Money by Dynamic Programming
Given the denominations 6, 5, and 1, what is the minimum
number of coins needed to change 9 cents?
money 0 1 2 3 4 5 6 7 8 9 10 11 12
MinNumCoins 0 1 2 3 4 1 1 2 3 ?

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
DPChange

DPChange(money, coins)
MinNumCoins(0) 🡨 0
for m 🡨 1 to money
MinNumCoins(m) 🡨 infinity
for i 🡨 1 to |coins|
if m ≥ coini
if MinNumCoins(m – coini) + 1 < MinNumCoins(m)
MinNumCoins(m) 🡨 MinNumCoins(m – coini)+ 1
return MinNumCoins(money)

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
“Programming” in “Dynamic Programming”
Has Nothing to Do with Programming!
Richard Bellman developed this idea in 1950s working on
an Air Force project.

At that time, his approach seemed completely


impractical.

He wanted to hide that he is really doing math from the Richard


Secretary of Defense. Bellman
“…What name could I choose? I was interested in planning but
planning, is not a good word for various reasons. I decided therefore
to use the word, “programming” and I wanted to get across the
idea that this was dynamic. It was something not even a
Congressman could object to. So I used it as an umbrella for my
activities.” Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
How Do We Compare Biological Sequences
• From Sequence Comparison to Biological Insights
• The Alignment Game and the Longest Common Subsequence
• The Manhattan Tourist Problem
• The Change Problem
• Dynamic Programming and Backtracking Pointers
• From Manhattan to the Alignment Graph
• From Global to Local Alignment
• Penalizing Insertions and Deletions in Sequence Alignment
• Space-Efficient Sequence Alignment
• Multiple Sequence Alignment

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
3 2 4 0

There are 1 0 2 4 3
only 2 ways
3 2 4 2
to arrive to
the sink:
by moving 3 6 5 2 1
South ↓ 0 7 3 3
or by moving
East → 4 4 5 2 1
3 3 0 2

5 6 8 5 3
South
1 3 2 2
or
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner. East?
South or East?

SouthOrEast(n,m)
if n=0 and m=0
return 0
if n>0 and m>0
x 🡨 SouthOrEast(n-1,m)+weight of edge “↓”into (n,m)
y 🡨 SouthOrEast(n,m-1)+ weight of edge “→”into (n,m)
return max{x,y}
return -infinity

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
3 2 4 0
0
1 0 2 4 3
3 2 4 2
1

4 6 5 2 1
0 7 3 3
5
4 4 5 2 1
3 3 0 2

5 6 8 5 3
1 3 2 2
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
3 2 4 0
0 3 5 9 9
1 0 2 4 3
3 2 4 2
1

4 6 5 2 1
0 7 3 3
5
4 4 5 2 1
3 3 0 2
9
5 6 8 5 3
1 3 2 2
14
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
3 2 4 0
0 3 5 9 9
South
or 1 0 2 4 3
East? 3 2 4 2
1 4
1+3 > 3+0
4 6 5 2 1
0 7 3 3
5
4 4 5 2 1
3 3 0 2
9
5 6 8 5 3
1 3 2 2
14
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
3 2 4 0
0 3 5 9 9
We arrived
to (1,1) 1 0 2 4 3
by the bold
3 2 4 2
edge: 1 4

3 4 6 5 2 1
4
0 7 3 3
5
4 4 5 2 1
3 3 0 2
9
5 6 8 5 3
1 3 2 2
14
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
3 2 4 0
0 3 5 9 9
South
or 1 0 2 4 3
East? 3 2 4 2
1 4
5+0 < 4+6
4 6 5 2 1
0 7 3 3
5 10
4 4 5 2 1
3 3 0 2
9
5 6 8 5 3
1 3 2 2
14
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
3 2 4 0
0 3 5 9 9
We arrived
to (2,1) 1 0 2 4 3
by the bold
3 2 4 2
edge: 1 4
4 6 5 2 1
6
0 7 3 3
5 10
10
4 4 5 2 1
3 3 0 2
9
5 6 8 5 3
1 3 2 2
14
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
3 2 4 0
0 3 5 9 9
1 0 2 4 3
3 2 4 2
1 4
4 6 5 2 1
0 7 3 3
5 10
4 4 5 2 1
3 3 0 2
9 14
5 6 8 5 3
1 3 2 2
14 20
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
3 2 4 0
0 3 5 9 9
South
or 1 0 2 4 3
East? 3 2 4 2
1 4 7
5+2 > 4+2
4 6 5 2 1
0 7 3 3
5 10
4 4 5 2 1
3 3 0 2
9 14
5 6 8 5 3
1 3 2 2
14 20
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
3 2 4 0
0 3 5 9 9
South
or 1 0 2 4 3
East? 3 2 4 2
1 4 7
5+2 > 4+2
4 6 5 2 1
0 7 3 3
5 10
4 4 5 2 1
3 3 0 2
9 14
5 6 8 5 3
1 3 2 2
14 20
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
3 2 4 0
0 3 5 9 9
1 0 2 4 3
3 2 4 2
1 4 7 13 15

4 6 5 2 1
0 7 3 3
5 10 17 20 24

4 4 5 2 1
3 3 0 2
9 14 22 22 25
5 6 8 5 3
1 3 2 2
14 20 30 32 34
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
3 2 4 0
0 3 5 9 9
1 2 4
Backtracking
pointers: 3 2
1 4 7 13 15
the best way
to get to 4 6
each node
7 3 3
5 10 17 20 24

4 4 5 2 1

9 14 22 22 25
5 6 8
2 2
14 20 30 32 34
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
Dynamic Programming Recurrence

si, j: the length of a longest path from (0,0) to (i,j)

si-1, j + weight of edge “↓”into (i,j)


si, j = max {
si, j-1 + weight of edge “→”into (i,j)

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
How Do We Compare Biological Sequences
• From Sequence Comparison to Biological Insights
• The Alignment Game and the Longest Common Subsequence
• The Manhattan Tourist Problem
• The Change Problem
• Dynamic Programming and Backtracking Pointers
• From Manhattan to the Alignment Graph
• From Global to Local Alignment
• Penalizing Insertions and Deletions in Sequence Alignment
• Space-Efficient Sequence Alignment
• Multiple Sequence Alignment

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
3 2 4 0
0 3 5
How does 5
1 5 4 7 2 3 3
the
recurrence 3 2 4 4
1 4
change for 2
this graph? 3 6 5 4 1
0 7 2 3

4 4
4 4 6 2 1
3 3 0 2

5 6 8 5 3
1 3 2 2
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
sa = maxall predecessors b of node a{sb+ weight of edge from b to a}
3 2 4 0
0 3 5
4 choices: 5
5+2 1 5 4 7 2 3 3
3+7 3 2 4 4
1 4 10
?
5+4 2
4+2 3 6 5 4 1
0 7 2 3

4 4
4 4 6 2 1
3 3 0 2

5 6 8 5 3
1 3 2 2
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
sa = maxall predecessors b of node a{sb+ weight of edge from b to a}
3 2 4 0
0 3 5 9 9
4 choices: 5
5+2 1 5 4 7 2 3 3
3+7 3 2 4 4
1 4 10
? 14 18
5+4 2
4+2 3 6 5 4 1
12
0 7 2 14 3
4 10 17 19
4 4
4 4 6 2 1
3 3 0 2
8 14 17 17 20
5 6 8 5 3
1 3 2 2
13 20 25 27 29
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
Dynamic Programming Recurrence for the
Alignment Graph
si, j: the length of a longest path from (0,0) to (i,j)

si-1, j + weight of edge “↓” into (i,j)


si, j= max { si, j-1 + weight of edge “→” into (i,j)
si-1, j-1+ weight of edge “↘” into (i,j)

red edges ↘ – weight 1


other edges – weight 0

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Dynamic Programming Recurrence for the
Longest Common Subsequence Problem
si, j: the length of a longest path from (0,0) to (i,j)

si-1, j + 0
si, j= max { si, j-1 + 0
si-1, j-1+ 1, if vi=wj

red edges ↘ – weight 1


other edges – weight 0

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
A T C G T C C
A
backtracking pointers
for the Longest
Common Subsequence
T

red edges ↘ – weight 1 G


other edges – weight 0
T
T
A
T
A
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
A T C G T C C
A
backtracking pointers
for the Longest
Common Subsequence
T
G
T
T
A
T
A
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
Computing Backtracking Pointers

si,j-1+0
si,j ← max{ si-1,j+0
si-1,j-1+1, if vi=wj

“→”, if si,j=si,j-1
backtracki,j ← {“↓", if si,j=si-1,j
“↘”, if si,j=si-1,j-1+1

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
3 2 4 0
0 3 5 9 9
Why did we
store the 1 2 4
backtracking 3 2
pointers? 1 4 7 13 15

4 6
7 3 3
5 10 17 20 24

4 4 5 2 1

9 14 22 22 25
5 6 8
2 2
14 20 30 32 34
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
3 2 4 0
0 3 5 9 9
What is the
1 2 4
optimal
alignment 1 3 4 7 13
2
15
path?
4 6
7 3 3
5 10 17 20 24

4 4 5 2 1

9 14 22 22 25
5 6 8
2 2
14 20 30 32 34
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
A T C G T C C
A
backtracking pointers
for the Longest
Common Subsequence
T
G
T
T
A
T
A
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
Using Backtracking Pointers to Compute LCS

OutputLCS (backtrack, v, i, j)
if i = 0 or j = 0
return
if backtracki,j = “→”
OutputLCS (backtrack, v, i, j-1)
else if backtracki,j = “↓”
OutputLCS (backtrack, v, i-1, j)
else
OutputLCS (backtrack, v, i-1, j-1)
output vi

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Computing Scores of ALL Predecessors
4
0 4

4
?
1
6 1 1 2
1
2

6
2
sa = maxALL predecessors b of node a{sb+ weight of edge from b to a}
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
4
0 4

4
?
1
6 1 1 2
1
? 2

6
2

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
4
0 4

4
?
1
6 1 1 2
1
? ? 2

6
2

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
4
0 4

4
? ?
1
6 1 1 2
1
? ? 2

6
2

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
A Vicious Cycle
4
0 4

4
? ?
1
6 1 1 2
1
? ? 2

6
2

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
In What Order Should We Explore Nodes of the Graph?

sa = maxALL predecessors b of node a{sb+ weight of edge from b to a}

• By the time a node is analyzed, the scores of all its


predecessors should already be computed.
• If the graph has a directed cycle, this condition is
impossible to satisfy.
• Directed Acyclic Graph (DAG): a graph without directed
cycles.
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
Topological Ordering
• Topological Ordering: Ordering of nodes of a DAG on a line
such that all edges go from left to right.

• Theorem: Every DAG has a topological ordering.

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
LongestPath
LongestPath(Graph, source, sink)
for each node a in Graph
sa ← -infinity
ssource ← 0
topologically order Graph
for each node a (from source to sink in topological order)
sa ← maxall predecessors b of node a{sb+ weight of edge from b to a}
return ssink

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
How Do We Compare Biological Sequences
• From Sequence Comparison to Biological Insights
• The Alignment Game and the Longest Common Subsequence
• The Manhattan Tourist Problem
• The Change Problem
• Dynamic Programming and Backtracking Pointers
• From Manhattan to the Alignment Graph
• From Global to Local Alignment
• Penalizing Insertions and Deletions in Sequence Alignment
• Space-Efficient Sequence Alignment
• Multiple Sequence Alignment

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Current (Primitive) Scoring
#matches

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Mismatches and Indel Penalties
#matches − μ · #mismatches − σ ·
#indels A T - G T T A T A
A T C G T - C – C
+1+1-2+1+1-2-3-2-3=-7

A C G T − A C G T −
A +1 −μ −μ −μ -σ A +1 −3 −5 −1 -3
C −μ +1 −μ −μ -σ C −4 +1 −3 −2 -3
G −μ −μ +1 −μ -σ G −9 −7 +1 −1 -3
T −μ −μ –μ +1 -σ T −3 −5 –8 +1 -4
− -σ -σ -σ -σ − -4 -2 -2 -1
Scoring matrix Even
Bioinformatics Algorithms: An Active more
Learning general scoring matrix
Approach.
Copyright 2018 Compeau and Pevzner.
Scoring Matrices for Amino Acid Sequences

Y (Tyr) often mutates into F (score +7)


but rarely mutates into P (score -5)

-5 7

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Dynamic Programming Recurrence for the
Alignment Graph

si-1, j + weight of edge “↓” into (i,j)


si, j= max { si, j-1 + weight of edge “→” into (i,j)
si-1, j-1+ weight of edge “↘” into (i,j)

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Dynamic Programming Recurrence for the
Alignment Graph
si-1, j - σ
si, j-1 - σ
si, j= max { si-1, j-1 + 1, if vi=wj
si-1, j-1 - μ, if vi≠wj

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Dynamic Programming Recurrence for the
Alignment Graph

si-1, j + score(vi,-)
si, j= max { si, j-1 + score(-,wj)
si-1, j-1+ score(vi,wj)

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Global Alignment

Global Alignment Problem: Find the highest-scoring


alignment between two strings by using a scoring matrix.

• Input: Strings v and w as well as a matrix score.

• Output: An alignment of v and w whose alignment


score (as defined by the scoring matrix score) is
maximal among all possible alignments of v and w.

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Homeobox Genes
• Two genes in different species
may be similar over short
conserved regions and dissimilar
over remaining regions.

• Homeobox genes have a short


region called the homeodomain
that is highly conserved among
species.
• A global alignment may not find the
homeodomain because it would try to
align the entire sequence.

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Which Alignment is Better?
• Alignment 1: score = 22 (matches) - 20 (indels)=2.

GCC-C-AGT--TATGT-CAGGGGGCACG--A-GCATGCAGA-
GCCGCC-GTCGT-T-TTCAG----CA-GTTATG--T-CAGAT

• Alignment 2: score = 17 (matches) - 30 (indels)=-13.


---G----C-----C--CAGTTATGTCAGGGGGCACGAGCATGCAGA
GCCGCCGTCGTTTTCAGCAGTTATGTCAG-----A------T-----

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Which Alignment is Better?
• Alignment 1: score = 22 (matches) - 20 (indels)=2.

GCC-C-AGT--TATGT-CAGGGGGCACG--A-GCATGCAGA-
GCCGCC-GTCGT-T-TTCAG----CA-GTTATG--T-CAGAT

• Alignment 2: score = 17 (matches) - 30 (indels)=-13.


---G----C-----C--CAGTTATGTCAGGGGGCACGAGCATGCAGA
GCCGCCGTCGTTTTCAGCAGTTATGTCAG-----A------T-----
local alignment

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
G C C G C C G T C G T T T T C A G C A G T T A T
G T C A G A T
G
C
C
C
A −−−G−−−−C−−−−−C−− CAGTTATGTCAGGGGGCACGAGCATGCAGA
G GCCGCCGTCGTTTTCAG CAGTTATGTCAG−−−−−A−−−−−−T −−−−
T
Local alignment
T
A
T
G
T
C
A
G
G
G
G
G
C
A
C
G
A
G
C GCC−C−AGT−TATGT-CAGGGGGCACG−−A−GCATGCAGA
A
-
T
G
GCCGCC−GTCGT-T-TTCAG----CA−GTTATG−T−CAGA
C T
A Global alignment
C
A

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
G C C G C C G T C G T T T T C A G C A G T T A T
G T C A G A T
G
C
C
C
A
G
T
T
A
T
G
T
C
A
G
G
G
G
G
C
A
C
G
A
G
C
A
T
G
C
A
C
A

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Local Alignment

Global alignment

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Local Alignment= Global Alignment in a Subrectangle

Compute a Global
Alignment within
Global alignment
each rectangle to
get a Local
Alignment

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Local Alignment Problem

Local Alignment Problem: Find the highest-scoring local


alignment between two strings.

• Input: Strings v and w as well as a matrix score.

• Output: Substrings of v and w whose global alignment


(as defined by the matrix score), is maximal among all
global alignments of all substrings of v and w.

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Free Taxi Rides!
G C C G C C G T C G T T T T C A G C A G T T A T
G T C A G A T
G
C
C
C
A
G
T
T
A
T
G
T
C
A
G
G
G
G
G
C
A
C
G
A
G
C
A
T
G
C
A
C
A

GCC−C−AGT−TATGT-CAGGGGGCACG−−A−GCATGCACA −−−G−−−−C−−−−−C−− CAGTTATGTCAGGGGGCACGAGCATGCACA


- GCCGCCGTCGTTTTCAG
Bioinformatics Algorithms: An Active CAGTTATGTCAG−−−−−A−−−−−−T −−−−
Learning Approach.
GCCGCC−GTCGT-T-TTCAG----CA−GTTATG−T−CAGA Local alignment
Copyright 2018 Compeau and Pevzner.
T
What Do Free Taxi Rides Mean in the Terms of the Alignment Graph?

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Building Manhattan for the Local Alignment Problem

How many edges have we added?


Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
Dynamic Programming for the Local Alignment
weight of edge from (0,0) to (i,j)
si-1, j + weight of edge “↓” into (i,j)
si, j= max { si, j-1 + weight of edge “→” into (i,j)
si-1, j-1+ weight of edge “↘” into (i,j)

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Dynamic Programming for the Local Alignment
0
si-1, j + weight of edge “↓” into (i,j)
si, j= max { si, j-1 + weight of edge “→” into (i,j)
si-1, j-1+ weight of edge “↘” into (i,j)

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
How Do We Compare Biological Sequences
• From Sequence Comparison to Biological Insights
• The Alignment Game and the Longest Common Subsequence
• The Manhattan Tourist Problem
• The Change Problem
• Dynamic Programming and Backtracking Pointers
• From Manhattan to the Alignment Graph
• From Global to Local Alignment
• Penalizing Insertions and Deletions in Sequence Alignment
• Space-Efficient Sequence Alignment
• Multiple Sequence Alignment

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Scoring Gaps
• We previously assigned a fixed penalty σ to
each indel.
• However, this fixed penalty may be too severe
for a series of 100 consecutive indels.
• A series of k indels often represents a single
evolutionary event (gap) rather than k events:

two gaps GATCCAG GATCCAG a single gap


(lower score) GA-C-AG GA--CAG (higher score)

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
More Adequate Gap Penalties
Affine gap penalty for a gap of length k: σ+ε·(k-1)

σ - the gap opening penalty


ε - the gap extension penalty
σ > ε, since starting a gap should be penalized
more than extending it.

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Modelling Affine Gap Penalties by Long Edges

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Building Manhattan with Affine Gap Penalties
σ+ε∙2

σ+ε σ+ε

We have just added O(n3) edges to the


graph… Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
Building Manhattan on 3 levels

upper level
bottom level (deletions)
(insertions)

middle level
(matches/mismatches)

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
How can we emulate
this path in the 3-level
Manhattan?

loweri-1,j - ε σ
loweri,j = max {
middlei-1,j - σ 0
σ upperi,j-1 - ε
upperi,j = max {
middlei,j-1 - σ

loweri,j
ε middlei,j = max { middlei-1,j-1 + score(vi,wj)
upperi,j

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
How Do We Compare Biological Sequences
• From Sequence Comparison to Biological Insights
• The Alignment Game and the Longest Common Subsequence
• The Manhattan Tourist Problem
• The Change Problem
• Dynamic Programming and Backtracking Pointers
• From Manhattan to an Arbitrary Directed Acyclic Graph
• From Global to Local Alignment
• Penalizing Insertions and Deletions in Sequence Alignment
• Space-Efficient Sequence Alignment
• Multiple Sequence Alignment

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Middle Column of the Alignment
A C G G A A

middle column
(middle=#columns/2)
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
Middle Node of the Alignment
A C G G A A

middle node
(a node where an optimal alignment path crosses the middle column)
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
Divide and Conquer Approach to Sequence
Alignment
A C G G A A
AlignmentPath(source, sink)
find MiddleNode A

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Divide and Conquer Approach to Sequence
Alignment
A C G G A A
AlignmentPath(source, sink)
find MiddleNode A
AlignmentPath(source, MiddleNode)
T

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Divide and Conquer Approach to Sequence
Alignment
A C G G A A
AlignmentPath(source, sink)
find MiddleNode A
AlignmentPath(source, MiddleNode)
AlignmentPath(MiddleNode, sink) T

The only problem left is how


Bioinformatics to find
Algorithms: An this middle
Active Learning node in linear space!
Approach.
Copyright 2018 Compeau and Pevzner.
Computing Alignment Score in Linear Space

Finding the longest path in the alignment graph


requires storing all backtracking pointers – O(nm)
memory.

Finding the length of the longest path in the


alignment graph does not require storing any
backtracking pointers – O(n) memory.

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Recycling the Columns in the Alignment Graph
A C G G A A

0 0 0 0 0 0 0
A

0 1 1 1 1 1 1
T

0 1 1 1 1 1 1
T

0 1 1 1 1 1 1
C

0 1 2 2 2 2 2
A

0 1 2 2 2 3 3
A

0 1 2 2 2 3 4
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
Can We Find the Middle Node without
Constructing the Longest Path?
A C G G A A

C 4-path that visits the node


(4,middle)
In the middle column
A

i-path – a longest path among paths that visit the i-th node in the middle column
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
Can We Find The Lengths of All i-paths?

A C G G A A
2
A

T
length(i):
T length of an i-path:
C
4
length(0)=2
A length(4)=4

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Can We Find The Lengths of All i-paths?

A C G G A A
2
A
3
T
3
T
3
C
4
A
3
A
1

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Can We Find The Lengths of i-paths?

A C G G A A
2
A
3
T
3
T
3
C length(i):
4 length of an i-path
A
3
A
1

length(i)=fromSource(i)+toSink(i)
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
Computing FromSource and toSink

A C G G A A A C G G A A
0 0 0 0 2 2 1 0
A A
0 1 1 1 2 2 1 0
T T
0 1 1 1 2 2 1 0
T T
0 1 1 1 2 2 1 0
C C
0 1 2 2 2 2 1 0
A A
0 1 2 2 1 1 1 0
A A
0 1 2 2 0 0 0 0

fromSource(i) Bioinformatics Algorithms: An Active Learning Approach.


toSink(i)
Copyright 2018 Compeau and Pevzner.
How Much Time Did It Take to Find the Middle Node ?

A C G G A A A C G G A A
0 0 0 0 2 2 1 0
A A
0 1 1 1 2 2 1 0
T T
0 1 1 1 2 2 1 0
T T
0 area/2
1 1 1 area/2+area/2=area 2 area/2
2 1 0
C C
0 1 2 2 2 2 1 0
A A
0 1 2 2 1 1 1 0
A A
0 1 2 2 0 0 0 0

fromSource(i) Bioinformatics Algorithms: An Active Learning Approach.


toSink(i)
Copyright 2018 Compeau and Pevzner.
Laughable Progress: O(nm) Time to Find ONE Node!
G A G C A A T T

C
Each subproblem
T can be conquered
in time
T proportional to its
area:
A
area/4+area/4=
A area/2

Bioinformatics Algorithms: An Active Learning Approach.


How much time would it take to conquer 2 subproblems?
Copyright 2018 Compeau and Pevzner.
Laughable Progress: O(nm+nm/2) Time to Find THREE Nodes!
G A G C A A T T

C
Each subproblem
T can be conquered
in time
T proportional to its
area:
A
area/8+area/8+
A area/8+area/8=
area/4
T

Bioinformatics Algorithms: An Active Learning Approach.


How much time would it take to conquer 4 subproblems?
Copyright 2018 Compeau and Pevzner.
O(nm+nm/2+nm/4) Time to Find NEARLY ALL Nodes!
G A G C A A T T

C
area+
area/2
T +area/4
T +area/8
+area/16
A
+….+
A <
2·area
T

Bioinformatics Algorithms: An Active Learning Approach.


How much time would it take to conquer ALL subproblems?
Copyright 2018 Compeau and Pevzner.
Total Time:
area+area/2+area/4+area/8+area/16+…

3rd pass: 1/4


1st pass: 1 area
4th pass: 1/8

2nd pass: 1/2

1 + ½ + ¼ +... < 2
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
The Middle Edge
G A G C A A T
T
A

C
Middle Edge:
T
an edge in an
T optimal
alignment path
A starting at the
middle node
A

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
The Middle Edge Problem
Middle Edge in Linear Space Problem. Find a middle edge
in the alignment graph in linear space.

• Input: Two strings and matrix score.

• Output: A middle edge in the alignment graph of


these strings (as defined by the matrix score).

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
G A G C A A T
T
A

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
G A G C A A T
T
A

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Recursive LinearSpaceAlignment
LinearSpaceAlignment(top,bottom,left,right)
if left = right
return alignment formed by bottom-top edges “↓”
middle ← ⌊(left+right)/2⌋
midNode ← MiddleNode(top,bottom,left,right)
midEdge ← MiddleEdge(top,bottom,left,right)
LinearSpaceAlignment(top,midNode,left,middle)
output midEdge
if midEdge = “→“ or midEdge = “↘”
middle ← middle+1
if midEdge = “↓“ or midEdge = “↘”
midNode ← midNode+1
LinearSpaceAlignment(midNode,bottom,middle,right)
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
How Do We Compare Biological Sequences
• From Sequence Comparison to Biological Insights
• The Alignment Game and the Longest Common Subsequence
• The Manhattan Tourist Problem
• The Change Problem
• Dynamic Programming and Backtracking Pointers
• From Manhattan to an Arbitrary Directed Acyclic Graph
• From Global to Local Alignment
• Penalizing Insertions and Deletions in Sequence Alignment
• Space-Efficient Sequence Alignment
• Multiple Sequence Alignment

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
From Pairwise to Multiple Alignment

• Up until now we have only


tried to align two sequences.
• A faint (and statistically
insignificant) similarity
between two sequences
becomes significant if it is
present in many other
sequences.
• Multiple alignments can
reveal subtle similarities that
pairwise alignments do not
reveal.

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Alignment of Three A-domains

YAFDLGYTCMFPVLLGGGELHIVQKETYTAPDEIAHYIKEHGITYIKLTPSLFHTIVNTASFAFDANFESLRLIVLGGEKIIPIDVIAFRKMYGHTE-FINHYGPTEATIGA

-AFDVSAGDFARALLTGGQLIVCPNEVKMDPASLYAIIKKYDITIFEATPALVIPLMEYI-YEQKLDISQLQILIVGSDSCSMEDFKTLVSRFGSTIRIVNSYGVTEACIDS

IAFDASSWEIYAPLLNGGTVVCIDYYTTIDIKALEAVFKQHHIRGAMLPPALLKQCLVSA----PTMISSLEILFAAGDRLSSQDAILARRAVGSGV-Y-NAYGPTENTVLS

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Alignment of Three A-domains

YAFDLGYTCMFPVLLGGGELHIVQKETYTAPDEIAHYIKEHGITYIKLTPSLFHTIVNTASFAFDANFESLRLIVLGGEKIIPIDVIAFRKMYGHTE-FINHYGPTEATIGA

-AFDVSAGDFARALLTGGQLIVCPNEVKMDPASLYAIIKKYDITIFEATPALVIPLMEYI-YEQKLDISQLQILIVGSDSCSMEDFKTLVSRFGSTIRIVNSYGVTEACIDS

IAFDASSWEIYAPLLNGGTVVCIDYYTTIDIKALEAVFKQHHIRGAMLPPALLKQCLVSA----PTMISSLEILFAAGDRLSSQDAILARRAVGSGV-Y-NAYGPTENTVLS

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Alignment of Three A-domains

YAFDLGYTCMFPVLLGGGELHIVQKETYTAPDEIAHYIKEHGITYIKLTPSLFHTIVNTASFAFDANFESLRLIVLGGEKIIPIDVIAFRKMYGHTE-FINHYGPTEATIGA

-AFDVSAGDFARALLTGGQLIVCPNEVKMDPASLYAIIKKYDITIFEATPALVIPLMEYI-YEQKLDISQLQILIVGSDSCSMEDFKTLVSRFGSTIRIVNSYGVTEACIDS

IAFDASSWEIYAPLLNGGTVVCIDYYTTIDIKALEAVFKQHHIRGAMLPPALLKQCLVSA----PTMISSLEILFAAGDRLSSQDAILARRAVGSGV-Y-NAYGPTENTVLS

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Generalizing Pairwise to Multiple Alignment

• Alignment of 2 sequences is a 2-row matrix.


• Alignment of 3 sequences is a 3-row matrix

A T - G C G -
A - C G T - A
A T C A C - A

• Our scoring function should score alignments with


conserved columns higher.

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Alignments = Paths in 3-D

• Alignment of ATGC, AATC, and ATGC

A -- T G C

A A T -- C

-- A T G C

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Alignments = Paths in 3-D

• Alignment of ATGC, AATC, and ATGC

0 1 1 2 3 4 #symbols up to a given position


A -- T G C

A A T -- C

-- A T G C

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Alignments = Paths in 3-D

• Alignment of ATGC, AATC, and ATGC

0 1 1 2 3 4 #symbols up to a given position


A -- T G C

0 1 2 3 3 4
A A T -- C

-- A T G C

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Alignments = Paths in 3-D

• Alignment of ATGC, AATC, and ATGC

(0,0,0)→(1,1,0)→(1,2,1) →(2,3,2) →(3,3,3) →(4,4,4)

0 1 1 2 3 4

A -- T G C

0 1 2 3 3 4
A A T -- C

0 0 1 2 3 4
-- A T G C

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
2-D Alignment Cell versus 3-D Alignment Cell
(i-1,j-1,k-1) (i-1,j,k-1)

(i-1,j-1,k) (i-1,j,k)

2-D (i,j,k-1)
(i,j-1,k-1)

(i,j-1,k) (i,j,k)

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Multiple Alignment: Dynamic Programming

• δ(x, y, z) is an entry in the 3-D scoring matrix.

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Multiple Alignment: Running Time

• For 3 sequences of length n, the run time is


proportional to 7n3

• For a k-way alignment, build a k-dimensional


Manhattan graph with
– nk nodes
– most nodes have 2k – 1 incoming edges.
– Runtime: O(2knk)

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Multiple Alignment Induces Pairwise
Alignments

Every multiple alignment induces pairwise alignments:


AC-GCGG-C
AC-GC-GAG
GCCGC-GAG

ACGCGG-C AC-GCGG-C AC-GCGAG


ACGC-GAC GCCGC-GAG GCCGCGAG

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Idea: Construct Multiple from Pairwise Alignments

Given a set of arbitrary pairwise alignments, can


we construct a multiple alignment that induces
them?
AAAATTTT---- ----AAAATTTT TTTTGGGG----
----TTTTGGGG GGGGAAAA---- ----GGGGAAAA

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Profile Representation of Multiple Alignment

- A G G C T A T C A C C T G
T A G – C T A C C A - - - G
C A G – C T A C C A - - - G
C A G – C T A T C A C – G G
C A G – C T A T C G C – G G

A 0 1 0 0 0 0 1 0 0 .8 0 0 0 0
C .6 0 0 0 1 0 0 .4 1 0 .6 .2 0 0
G 0 0 1 .2 0 0 0 0 0 .2 0 0 .4 1
T .2 0 0 0 0 1 0 .6 0 0 0 0 .2 0
- .2 0 0 .8 0 0 0 0 0 0 .4 .8 .4 0

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Aligning Sequence Against Sequence

• In the past we were aligning a sequence


against a sequence.

- A G G C T A T C A C C T G
T A G – C T A C C A - - - G
C A G – C T A C C A - - - G
C A G – C T A T C A C – G G
C A G – C T A T C G C – G G

A 0 1 0 0 0 0 1 0 0 .8 0 0 0 0
C .6 0 0 0 1 0 0 .4 1 0 .6 .2 0 0
G 0 0 1 .2 0 0 0 0 0 .2 0 0 .4 1
T .2 0 0 0 0 1 0 .6 0 0 0 0 .2 0
- .2 0 0 .8 0 0 0 0 0 0 .4 .8 .4 0

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Aligning Sequence Against Profile

• In the past we were aligning a sequence


against a sequence.
– Can we align a sequence against a profile?

- A G G C T A T C A C C T G
T A G – C T A C C A - - - G
C A G – C T A C C A - - - G
C A G – C T A T C A C – G G
C A G – C T A T C G C – G G

A 0 1 0 0 0 0 1 0 0 .8 0 0 0 0
C .6 0 0 0 1 0 0 .4 1 0 .6 .2 0 0
G 0 0 1 .2 0 0 0 0 0 .2 0 0 .4 1
T .2 0 0 0 0 1 0 .6 0 0 0 0 .2 0
- .2 0 0 .8 0 0 0 0 0 0 .4 .8 .4 0

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Aligning Profile Against Profile

• In the past we were aligning a sequence


against a sequence.
– Can we align a sequence against a profile?
– Can we align a profile against a profile?
- A G G C T A T C A C C T G
T A G – C T A C C A - - - G
C A G – C T A C C A - - - G
C A G – C T A T C A C – G G
C A G – C T A T C G C – G G

A 0 1 0 0 0 0 1 0 0 .8 0 0 0 0
C .6 0 0 0 1 0 0 .4 1 0 .6 .2 0 0
G 0 0 1 .2 0 0 0 0 0 .2 0 0 .4 1
T .2 0 0 0 0 1 0 .6 0 0 0 0 .2 0
- .2 0 0 .8 0 0 0 0 0 0 .4 .8 .4 0

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Multiple Alignment: Greedy Approach
• Choose the most similar sequences and
combine them into a profile, thereby reducing
alignment of k sequences to an alignment of
of k – 2 sequences and 1 profile.
• Iterate

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Greedy Approach: Example
• Sequences: GATTCA, GTCTGA, GATATT, GTCAGC.

• 6 pairwise alignments (premium for match +1,


penalties for indels and mismatches -1)
s2 GTCTGA s1 GATTCA--
s4 GTCAGC (score = 2) s4 G—T-CAGC (score = 0)

s1 GAT-TCA s2 G-TCTGA
s2 G-TCTGA (score = 1) s3 GATAT-T (score = -1)

s1 GAT-TCA s3 GAT-ATT
s3 GATAT-T (score = 1) s4 G-TCAGC (score = -1)
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
Greedy Approach: Example
• Since s2 and s4 are closest, we consolidate them
into a profile:
s2 GTCTGA
s2,4 = GTCt/aGa/cA
s4 GTCAGC
• New set of 3 sequences to align:
s1 GATTCA
s3 GATATT
s2,4 GTCt/aGa/c

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.

You might also like