Professional Documents
Culture Documents
ribosome
non-ribosomal
ribosome peptide synthetase
Adenylation domains
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
NRPBioinformatics
synthetase Algorithms: An Active Learning Approach.
adds one amino acid at a time
Copyright 2018 Compeau and Pevzner.
These Three A-domains Do Not Look Similar
YAFDLGYTCMFPVLLGGGELHIVQKETYTAPDEIAHYIKEHGITYIKLTPSLFHTIVNTASFAFDANFESLRLIVLGGEKIIPIDVIAFRKMYGHTEFINHYGPTEATIGA
AFDVSAGDFARALLTGGQLIVCPNEVKMDPASLYAIIKKYDITIFEATPALVIPLMEYIYEQKLDISQLQILIVGSDSCSMEDFKTLVSRFGSTIRIVNSYGVTEACIDS
IAFDASSWEIYAPLLNGGTVVCIDYYTTIDIKALEAVFKQHHIRGAMLPPALLKQCLVSAPTMISSLEILFAAGDRLSSQDAILARRAVGSGVYNAYGPTENTVLS
YAFDLGYTCMFPVLLGGGELHIVQKETYTAPDEIAHYIKEHGITYIKLTPSLFHTIVNTASFAFDANFESLRLIVLGGEKIIPIDVIAFRKMYGHTEFINHYGPTEATIGA
AFDVSAGDFARALLTGGQLIVCPNEVKMDPASLYAIIKKYDITIFEATPALVIPLMEYIYEQKLDISQLQILIVGSDSCSMEDFKTLVSRFGSTIRIVNSYGVTEACIDS
IAFDASSWEIYAPLLNGGTVVCIDYYTTIDIKALEAVFKQHHIRGAMLPPALLKQCLVSAPTMISSLEILFAAGDRLSSQDAILARRAVGSGVYNAYGPTENTVLS
YAFDLGYTCMFPVLLGGGELHIVQKETYTAPDEIAHYIKEHGITYIKLTPSLFHTIVNTASFAFDANFESLRLIVLGGEKIIPIDVIAFRKMYGHTEFINHYGPTEATIGA
-AFDVSAGDFARALLTGGQLIVCPNEVKMDPASLYAIIKKYDITIFEATPALVIPLMEYIYEQKLDISQLQILIVGSDSCSMEDFKTLVSRFGSTIRIVNSYGVTEACIDS
IAFDASSWEIYAPLLNGGTVVCIDYYTTIDIKALEAVFKQHHIRGAMLPPALLKQCLVSAPTMISSLEILFAAGDRLSSQDAILARRAVGSGVYNAYGPTENTVLS
11 conservative columns
YAFDLGYTCMFPVLLGGGELHIVQKETYTAPDEIAHYIKEHGITYIKLTPSLFHTIVNTASFAFDANFESLRLIVLGGEKIIPIDVIAFRKMYGHTE-FINHYGPTEATIGA
-AFDVSAGDFARALLTGGQLIVCPNEVKMDPASLYAIIKKYDITIFEATPALVIPLMEYI-YEQKLDISQLQILIVGSDSCSMEDFKTLVSRFGSTIRIVNSYGVTEACIDS
IAFDASSWEIYAPLLNGGTVVCIDYYTTIDIKALEAVFKQHHIRGAMLPPALLKQCLVSA----PTMISSLEILFAAGDRLSSQDAILARRAVGSGV-Y-NAYGPTENTVLS
19 conservative columns!
YAFDLGYTCMFPVLLGGGELHIVQKETYTAPDEIAHYIKEHGITYIKLTPSLFHTIVNTASFAFDANFESLRLIVLGGEKIIPIDVIAFRKMYGHTE-FINHYGPTEATIGA
-AFDVSAGDFARALLTGGQLIVCPNEVKMDPASLYAIIKKYDITIFEATPALVIPLMEYI-YEQKLDISQLQILIVGSDSCSMEDFKTLVSRFGSTIRIVNSYGVTEACIDS
IAFDASSWEIYAPLLNGGTVVCIDYYTTIDIKALEAVFKQHHIRGAMLPPALLKQCLVSA----PTMISSLEILFAAGDRLSSQDAILARRAVGSGV-Y-NAYGPTENTVLS
Which positions are responsible for encoding different amino acids Asp, Orn, Val?
YAFDLGYTCMFPVLLGGGELHIVQKETYTAPDEIAHYIKEHGITYIKLTPSLFHTIVNTASFAFDANFESLRLIVLGGEKIIPIDVIAFRKMYGHTE-FINHYGPTEATIGA
-AFDVSAGDFARALLTGGQLIVCPNEVKMDPASLYAIIKKYDITIFEATPALVIPLMEYI-YEQKLDISQLQILIVGSDSCSMEDFKTLVSRFGSTIRIVNSYGVTEACIDS
IAFDASSWEIYAPLLNGGTVVCIDYYTTIDIKALEAVFKQHHIRGAMLPPALLKQCLVSA----PTMISSLEILFAAGDRLSSQDAILARRAVGSGV-Y-NAYGPTENTVLS
LTKVGHIG Asp
VGEIGSID Orn
AWMFAAVL Val
A T G T T A T A
A T C G T C C
A T G T T A T A
A T C G T C C
+1
A T G T T A T A
A T C G T C C
+1+1
A T - G T T A T A
A T C G T C C
+1+1
A T - G T T A T A
A T C G T C C
+1+1 +1
A T - G T T A T A
A T C G T C C
+1+1 +1+1
A T - G T T A T A
A T C G T - C C
+1+1 +1+1
A T - G T T A T A
A T C G T - C C
+1+1 +1+1
A T - G T T A T A
A T C G T - C - C
+1+1 +1+1
A T - G T T A T A
A T C G T - C - C
+1+1 +1+1 =4
A T - G T T A T A
A T C G T - C - C
+1+1 +1+1 =4
1st row: symbols of the 1st sequence (in order) interspersed by “-”
2nd row: symbols of the 2nd sequence (in order) interspersed by “-”
A T - G T T A T A
A T C G T - C - C
Matches in alignment of two sequences (ATGT) form their
Common Subsequence
Longest Common Subsequence Problem: Find a longest
common subsequence of two strings.
• Input: Two strings.
• Output: A longest common subsequence of these
strings.
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
How Do We Compare Biological Sequences
• From Sequence Comparison to Biological Insights
• The Alignment Game and the Longest Common Subsequence
• The Manhattan Tourist Problem
• The Change Problem
• Dynamic Programming and Backtracking Pointers
• From Manhattan to the Alignment Graph
• From Global to Local Alignment
• Penalizing Insertions and Deletions in Sequence Alignment
• Space-Efficient Sequence Alignment
• Multiple Sequence Alignment
3 6 5 2 1
Greedy 0 7 3 3
algorithm?
4 4 5 2 1
3 3 0 2
5 6 8 5 3
1 3 2 2
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
3 2 4 0
0 3 5 9
1 0 2 4 3
3 2 4 2
13
3 6 5 2 1
Greedy 0 7 3 3
algorithm? 15 19
4 4 5 2 1
3 3 0 2
20
5 6 8 5 3
1 3 2 2
23
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
3 2 4 0
5
1 4 2 3
From a
3 2 4 4
regular to an
irregular grid 2
3 6 5 1
0 7 2 3
4 4
4 4 6 2 1
3 3 0 2
5 6 8 5 3
1 3 2 2
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
Search for Longest Paths in a Directed Graph
A T - G T T A T A
A T C G T - C - C
↘ ↘ → ↘ ↘ ↓ ↘ ↓ ↘
GreedyChange(money)
change ← empty collection of coins
while money > 0
coin ← largest denomination that does not exceed money
add coin to change
money ← money – coin
return change
40 cents = 25+10+5
Greedy
MinNumCoins(9)= ?
MinNumCoins(9)= ?
MinNumCoins(9-6)+1 = MinNumCoins(3)+1
MinNumCoins(9)=min{ MinNumCoins(9-5)+1 =
MinNumCoins(4)+1
MinNumCoins(9-1)+1 = MinNumCoins(8)+1
MinNumCoins(3)=
MinNumCoins(4)=
MinNumCoins(8)=
?
MinNumCoins(3)=
MinNumCoins(4)=
MinNumCoins(8)=
?
MinNumCoins(money-6) + 1min{
MinNumCoins(money)= MinNumCoins(money-5) + 1
MinNumCoins(money-1) + 1
min{
MinNumCoins(money-coin1) + 1
MinNumCoins(money)= . ...........................
MinNumCoins(money-coind) + 1
6 6 6 6 6 7 6 7 7
4 5 9 5 6 0 9 0 4
6 6 6 6 6 7 6 7 7
4 5 9 5 6 0 9 0 4
5 5 6 6 6 6 6 6 6 6 6 6 6 6 7
8 9 3 3 4 8 0 1 5 3 4 8 8 9 3
5 6 6 5 6 6 6 6 6 6 6 6
9 0 4 9 0 4 4 5 9 4 5 9
RecursiveChange(money-coini, Coins, d)
we would simply look up the values of
MinNumCoins(money - coini)
DPChange(money, coins)
MinNumCoins(0) 🡨 0
for m 🡨 1 to money
MinNumCoins(m) 🡨 infinity
for i 🡨 1 to |coins|
if m ≥ coini
if MinNumCoins(m – coini) + 1 < MinNumCoins(m)
MinNumCoins(m) 🡨 MinNumCoins(m – coini)+ 1
return MinNumCoins(money)
There are 1 0 2 4 3
only 2 ways
3 2 4 2
to arrive to
the sink:
by moving 3 6 5 2 1
South ↓ 0 7 3 3
or by moving
East → 4 4 5 2 1
3 3 0 2
5 6 8 5 3
South
1 3 2 2
or
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner. East?
South or East?
SouthOrEast(n,m)
if n=0 and m=0
return 0
if n>0 and m>0
x 🡨 SouthOrEast(n-1,m)+weight of edge “↓”into (n,m)
y 🡨 SouthOrEast(n,m-1)+ weight of edge “→”into (n,m)
return max{x,y}
return -infinity
4 6 5 2 1
0 7 3 3
5
4 4 5 2 1
3 3 0 2
5 6 8 5 3
1 3 2 2
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
3 2 4 0
0 3 5 9 9
1 0 2 4 3
3 2 4 2
1
4 6 5 2 1
0 7 3 3
5
4 4 5 2 1
3 3 0 2
9
5 6 8 5 3
1 3 2 2
14
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
3 2 4 0
0 3 5 9 9
South
or 1 0 2 4 3
East? 3 2 4 2
1 4
1+3 > 3+0
4 6 5 2 1
0 7 3 3
5
4 4 5 2 1
3 3 0 2
9
5 6 8 5 3
1 3 2 2
14
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
3 2 4 0
0 3 5 9 9
We arrived
to (1,1) 1 0 2 4 3
by the bold
3 2 4 2
edge: 1 4
3 4 6 5 2 1
4
0 7 3 3
5
4 4 5 2 1
3 3 0 2
9
5 6 8 5 3
1 3 2 2
14
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
3 2 4 0
0 3 5 9 9
South
or 1 0 2 4 3
East? 3 2 4 2
1 4
5+0 < 4+6
4 6 5 2 1
0 7 3 3
5 10
4 4 5 2 1
3 3 0 2
9
5 6 8 5 3
1 3 2 2
14
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
3 2 4 0
0 3 5 9 9
We arrived
to (2,1) 1 0 2 4 3
by the bold
3 2 4 2
edge: 1 4
4 6 5 2 1
6
0 7 3 3
5 10
10
4 4 5 2 1
3 3 0 2
9
5 6 8 5 3
1 3 2 2
14
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
3 2 4 0
0 3 5 9 9
1 0 2 4 3
3 2 4 2
1 4
4 6 5 2 1
0 7 3 3
5 10
4 4 5 2 1
3 3 0 2
9 14
5 6 8 5 3
1 3 2 2
14 20
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
3 2 4 0
0 3 5 9 9
South
or 1 0 2 4 3
East? 3 2 4 2
1 4 7
5+2 > 4+2
4 6 5 2 1
0 7 3 3
5 10
4 4 5 2 1
3 3 0 2
9 14
5 6 8 5 3
1 3 2 2
14 20
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
3 2 4 0
0 3 5 9 9
South
or 1 0 2 4 3
East? 3 2 4 2
1 4 7
5+2 > 4+2
4 6 5 2 1
0 7 3 3
5 10
4 4 5 2 1
3 3 0 2
9 14
5 6 8 5 3
1 3 2 2
14 20
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
3 2 4 0
0 3 5 9 9
1 0 2 4 3
3 2 4 2
1 4 7 13 15
4 6 5 2 1
0 7 3 3
5 10 17 20 24
4 4 5 2 1
3 3 0 2
9 14 22 22 25
5 6 8 5 3
1 3 2 2
14 20 30 32 34
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
3 2 4 0
0 3 5 9 9
1 2 4
Backtracking
pointers: 3 2
1 4 7 13 15
the best way
to get to 4 6
each node
7 3 3
5 10 17 20 24
4 4 5 2 1
9 14 22 22 25
5 6 8
2 2
14 20 30 32 34
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
Dynamic Programming Recurrence
4 4
4 4 6 2 1
3 3 0 2
5 6 8 5 3
1 3 2 2
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
sa = maxall predecessors b of node a{sb+ weight of edge from b to a}
3 2 4 0
0 3 5
4 choices: 5
5+2 1 5 4 7 2 3 3
3+7 3 2 4 4
1 4 10
?
5+4 2
4+2 3 6 5 4 1
0 7 2 3
4 4
4 4 6 2 1
3 3 0 2
5 6 8 5 3
1 3 2 2
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
sa = maxall predecessors b of node a{sb+ weight of edge from b to a}
3 2 4 0
0 3 5 9 9
4 choices: 5
5+2 1 5 4 7 2 3 3
3+7 3 2 4 4
1 4 10
? 14 18
5+4 2
4+2 3 6 5 4 1
12
0 7 2 14 3
4 10 17 19
4 4
4 4 6 2 1
3 3 0 2
8 14 17 17 20
5 6 8 5 3
1 3 2 2
13 20 25 27 29
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
Dynamic Programming Recurrence for the
Alignment Graph
si, j: the length of a longest path from (0,0) to (i,j)
si-1, j + 0
si, j= max { si, j-1 + 0
si-1, j-1+ 1, if vi=wj
si,j-1+0
si,j ← max{ si-1,j+0
si-1,j-1+1, if vi=wj
“→”, if si,j=si,j-1
backtracki,j ← {“↓", if si,j=si-1,j
“↘”, if si,j=si-1,j-1+1
4 6
7 3 3
5 10 17 20 24
4 4 5 2 1
9 14 22 22 25
5 6 8
2 2
14 20 30 32 34
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
3 2 4 0
0 3 5 9 9
What is the
1 2 4
optimal
alignment 1 3 4 7 13
2
15
path?
4 6
7 3 3
5 10 17 20 24
4 4 5 2 1
9 14 22 22 25
5 6 8
2 2
14 20 30 32 34
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
A T C G T C C
A
backtracking pointers
for the Longest
Common Subsequence
T
G
T
T
A
T
A
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
Using Backtracking Pointers to Compute LCS
OutputLCS (backtrack, v, i, j)
if i = 0 or j = 0
return
if backtracki,j = “→”
OutputLCS (backtrack, v, i, j-1)
else if backtracki,j = “↓”
OutputLCS (backtrack, v, i-1, j)
else
OutputLCS (backtrack, v, i-1, j-1)
output vi
4
?
1
6 1 1 2
1
2
6
2
sa = maxALL predecessors b of node a{sb+ weight of edge from b to a}
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
4
0 4
4
?
1
6 1 1 2
1
? 2
6
2
4
?
1
6 1 1 2
1
? ? 2
6
2
4
? ?
1
6 1 1 2
1
? ? 2
6
2
4
? ?
1
6 1 1 2
1
? ? 2
6
2
A C G T − A C G T −
A +1 −μ −μ −μ -σ A +1 −3 −5 −1 -3
C −μ +1 −μ −μ -σ C −4 +1 −3 −2 -3
G −μ −μ +1 −μ -σ G −9 −7 +1 −1 -3
T −μ −μ –μ +1 -σ T −3 −5 –8 +1 -4
− -σ -σ -σ -σ − -4 -2 -2 -1
Scoring matrix Even
Bioinformatics Algorithms: An Active more
Learning general scoring matrix
Approach.
Copyright 2018 Compeau and Pevzner.
Scoring Matrices for Amino Acid Sequences
-5 7
si-1, j + score(vi,-)
si, j= max { si, j-1 + score(-,wj)
si-1, j-1+ score(vi,wj)
GCC-C-AGT--TATGT-CAGGGGGCACG--A-GCATGCAGA-
GCCGCC-GTCGT-T-TTCAG----CA-GTTATG--T-CAGAT
GCC-C-AGT--TATGT-CAGGGGGCACG--A-GCATGCAGA-
GCCGCC-GTCGT-T-TTCAG----CA-GTTATG--T-CAGAT
Global alignment
Compute a Global
Alignment within
Global alignment
each rectangle to
get a Local
Alignment
σ+ε σ+ε
upper level
bottom level (deletions)
(insertions)
middle level
(matches/mismatches)
loweri-1,j - ε σ
loweri,j = max {
middlei-1,j - σ 0
σ upperi,j-1 - ε
upperi,j = max {
middlei,j-1 - σ
loweri,j
ε middlei,j = max { middlei-1,j-1 + score(vi,wj)
upperi,j
middle column
(middle=#columns/2)
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
Middle Node of the Alignment
A C G G A A
middle node
(a node where an optimal alignment path crosses the middle column)
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
Divide and Conquer Approach to Sequence
Alignment
A C G G A A
AlignmentPath(source, sink)
find MiddleNode A
0 0 0 0 0 0 0
A
0 1 1 1 1 1 1
T
0 1 1 1 1 1 1
T
0 1 1 1 1 1 1
C
0 1 2 2 2 2 2
A
0 1 2 2 2 3 3
A
0 1 2 2 2 3 4
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
Can We Find the Middle Node without
Constructing the Longest Path?
A C G G A A
i-path – a longest path among paths that visit the i-th node in the middle column
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
Can We Find The Lengths of All i-paths?
A C G G A A
2
A
T
length(i):
T length of an i-path:
C
4
length(0)=2
A length(4)=4
A C G G A A
2
A
3
T
3
T
3
C
4
A
3
A
1
A C G G A A
2
A
3
T
3
T
3
C length(i):
4 length of an i-path
A
3
A
1
length(i)=fromSource(i)+toSink(i)
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
Computing FromSource and toSink
A C G G A A A C G G A A
0 0 0 0 2 2 1 0
A A
0 1 1 1 2 2 1 0
T T
0 1 1 1 2 2 1 0
T T
0 1 1 1 2 2 1 0
C C
0 1 2 2 2 2 1 0
A A
0 1 2 2 1 1 1 0
A A
0 1 2 2 0 0 0 0
A C G G A A A C G G A A
0 0 0 0 2 2 1 0
A A
0 1 1 1 2 2 1 0
T T
0 1 1 1 2 2 1 0
T T
0 area/2
1 1 1 area/2+area/2=area 2 area/2
2 1 0
C C
0 1 2 2 2 2 1 0
A A
0 1 2 2 1 1 1 0
A A
0 1 2 2 0 0 0 0
C
Each subproblem
T can be conquered
in time
T proportional to its
area:
A
area/4+area/4=
A area/2
C
Each subproblem
T can be conquered
in time
T proportional to its
area:
A
area/8+area/8+
A area/8+area/8=
area/4
T
C
area+
area/2
T +area/4
T +area/8
+area/16
A
+….+
A <
2·area
T
1 + ½ + ¼ +... < 2
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
The Middle Edge
G A G C A A T
T
A
C
Middle Edge:
T
an edge in an
T optimal
alignment path
A starting at the
middle node
A
YAFDLGYTCMFPVLLGGGELHIVQKETYTAPDEIAHYIKEHGITYIKLTPSLFHTIVNTASFAFDANFESLRLIVLGGEKIIPIDVIAFRKMYGHTE-FINHYGPTEATIGA
-AFDVSAGDFARALLTGGQLIVCPNEVKMDPASLYAIIKKYDITIFEATPALVIPLMEYI-YEQKLDISQLQILIVGSDSCSMEDFKTLVSRFGSTIRIVNSYGVTEACIDS
IAFDASSWEIYAPLLNGGTVVCIDYYTTIDIKALEAVFKQHHIRGAMLPPALLKQCLVSA----PTMISSLEILFAAGDRLSSQDAILARRAVGSGV-Y-NAYGPTENTVLS
YAFDLGYTCMFPVLLGGGELHIVQKETYTAPDEIAHYIKEHGITYIKLTPSLFHTIVNTASFAFDANFESLRLIVLGGEKIIPIDVIAFRKMYGHTE-FINHYGPTEATIGA
-AFDVSAGDFARALLTGGQLIVCPNEVKMDPASLYAIIKKYDITIFEATPALVIPLMEYI-YEQKLDISQLQILIVGSDSCSMEDFKTLVSRFGSTIRIVNSYGVTEACIDS
IAFDASSWEIYAPLLNGGTVVCIDYYTTIDIKALEAVFKQHHIRGAMLPPALLKQCLVSA----PTMISSLEILFAAGDRLSSQDAILARRAVGSGV-Y-NAYGPTENTVLS
YAFDLGYTCMFPVLLGGGELHIVQKETYTAPDEIAHYIKEHGITYIKLTPSLFHTIVNTASFAFDANFESLRLIVLGGEKIIPIDVIAFRKMYGHTE-FINHYGPTEATIGA
-AFDVSAGDFARALLTGGQLIVCPNEVKMDPASLYAIIKKYDITIFEATPALVIPLMEYI-YEQKLDISQLQILIVGSDSCSMEDFKTLVSRFGSTIRIVNSYGVTEACIDS
IAFDASSWEIYAPLLNGGTVVCIDYYTTIDIKALEAVFKQHHIRGAMLPPALLKQCLVSA----PTMISSLEILFAAGDRLSSQDAILARRAVGSGV-Y-NAYGPTENTVLS
A T - G C G -
A - C G T - A
A T C A C - A
A -- T G C
A A T -- C
-- A T G C
A A T -- C
-- A T G C
0 1 2 3 3 4
A A T -- C
-- A T G C
0 1 1 2 3 4
A -- T G C
0 1 2 3 3 4
A A T -- C
0 0 1 2 3 4
-- A T G C
(i-1,j-1,k) (i-1,j,k)
2-D (i,j,k-1)
(i,j-1,k-1)
(i,j-1,k) (i,j,k)
- A G G C T A T C A C C T G
T A G – C T A C C A - - - G
C A G – C T A C C A - - - G
C A G – C T A T C A C – G G
C A G – C T A T C G C – G G
A 0 1 0 0 0 0 1 0 0 .8 0 0 0 0
C .6 0 0 0 1 0 0 .4 1 0 .6 .2 0 0
G 0 0 1 .2 0 0 0 0 0 .2 0 0 .4 1
T .2 0 0 0 0 1 0 .6 0 0 0 0 .2 0
- .2 0 0 .8 0 0 0 0 0 0 .4 .8 .4 0
- A G G C T A T C A C C T G
T A G – C T A C C A - - - G
C A G – C T A C C A - - - G
C A G – C T A T C A C – G G
C A G – C T A T C G C – G G
A 0 1 0 0 0 0 1 0 0 .8 0 0 0 0
C .6 0 0 0 1 0 0 .4 1 0 .6 .2 0 0
G 0 0 1 .2 0 0 0 0 0 .2 0 0 .4 1
T .2 0 0 0 0 1 0 .6 0 0 0 0 .2 0
- .2 0 0 .8 0 0 0 0 0 0 .4 .8 .4 0
- A G G C T A T C A C C T G
T A G – C T A C C A - - - G
C A G – C T A C C A - - - G
C A G – C T A T C A C – G G
C A G – C T A T C G C – G G
A 0 1 0 0 0 0 1 0 0 .8 0 0 0 0
C .6 0 0 0 1 0 0 .4 1 0 .6 .2 0 0
G 0 0 1 .2 0 0 0 0 0 .2 0 0 .4 1
T .2 0 0 0 0 1 0 .6 0 0 0 0 .2 0
- .2 0 0 .8 0 0 0 0 0 0 .4 .8 .4 0
A 0 1 0 0 0 0 1 0 0 .8 0 0 0 0
C .6 0 0 0 1 0 0 .4 1 0 .6 .2 0 0
G 0 0 1 .2 0 0 0 0 0 .2 0 0 .4 1
T .2 0 0 0 0 1 0 .6 0 0 0 0 .2 0
- .2 0 0 .8 0 0 0 0 0 0 .4 .8 .4 0
s1 GAT-TCA s2 G-TCTGA
s2 G-TCTGA (score = 1) s3 GATAT-T (score = -1)
s1 GAT-TCA s3 GAT-ATT
s3 GATAT-T (score = 1) s4 G-TCAGC (score = -1)
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
Greedy Approach: Example
• Since s2 and s4 are closest, we consolidate them
into a profile:
s2 GTCTGA
s2,4 = GTCt/aGa/cA
s4 GTCAGC
• New set of 3 sequences to align:
s1 GATTCA
s3 GATATT
s2,4 GTCt/aGa/c