You are on page 1of 41

BLOSUM

BLOcks Substitution Matrix


• Index denoting the level of clustering
• 62 block clustered at the 62% identity level
• Eliminate those sequence which are identical in more
than x% of their Amino acid Sequence to avoid the
biasness to certain protein
• This can be done either by removing the sequence from
the block or by finding a cluster of similar sequence

• Matrix built from blocks with not >x% similarity is called


BLOSUMx (so 60% similarity is called BLOSUM 60).
Steps of Building up BLOSUM Matrix
Seq 1 A A B C D ‐ ‐ ‐ B B C D A
Seq 2 D A B C D ‐ A ‐ B B C B B
Seq 3 B B B C D B A ‐ B C C A A
S 4
Seq A A A C D C ‐ D C B C D B
Seq 5 C C B A D B ‐ D B B D C C
q6
Seq A A A C A ‐ ‐ ‐ B B C C C

•Calculate the log odds ratio in each column of each block.


•This is done by counting the pairs of Amino acid in each
column of the multiple alignment.
e.g.
g in column of Amino acid AABACA ((first column): )
AA Pairs = 6; AB Pairs = 4; AC Pairs = 4; BC Pairs = 1;
BB Pairs = 0; CC Pairs = 0
•Hence there is a contribution of all pairs i.e.
6 + 4 + 4 + 1 + 0 + 0 = 15
Steps of Building up BLOSUM Matrix
Generally speaking for each pair of amino acids I and j for each column k of each
block we have for
ij 2  
like comparisons, C ( k )  ni pairs for a column of n amino acids
For unlike comparisons
Cij( k )  ni n j , ni is the number of times residue I was observed in the column
In the last stage the results are normalized according to the following definitions
Scores for each column across columns are summed up: Cij   Cij( k )
k
The pair of frequencies are normalized so that their sum becomes 1.
n(n  1)
T   Cij( k )  w where w  number of columns and n  no of seque
q nces
i i 2
qij is the observed probability for a pair of amino acids in the same column to be
Cij
i and j and is given by qij =
T
Seq 1 A A B C D ‐ ‐ ‐ B B C D A
4 + 8 + 0 + 0 + 0 + 0 +0 Seq 2 D A B C D ‐ A ‐ B B C B B
q AB 
(6)(5) Seq 3 B B B C D B A ‐ B C C A A
7
2 Seq 4 A A A C D C ‐ D C B C D B
12 Seq 5 C C B A D B ‐ D B B D C C

105 Seq 6 A A A C A ‐ ‐ ‐ B B C C C
Steps of Building up BLOSUM Matrix
Calculating the denominator of the log odds ratio : Probability of occurrence
qij
of the i residue in an  i, j  pair:
th
Pi  qij  
i j 2
Assuming independence,
independence given pairs should occur with frequencies
eij  Pi 2 for i  0 and eij  2 Pi Pj for i  0
qij
The odds matrix is S ij  log 2 . The final result is the rounded 2S ij
eij
This value is stoted in the (i , j ) entry of the BLOSUM matrix.
If the observed no. of differences between a pair of amino acid is equal
to the expected no. then S ij  0, less than expected no. then S ij  0
and more than expected no. then S ij  0

(1) Cij (2) T , (3) qij ,


(4) Pi , (5) eij , (6) log odds ratio S AA
Seq 1 A A I
Ques:‐ Find the BLOSUM value of AA Seq 2 S A L
for given sequences Seq 3 T A L
Seq 4 T A V
Seq 5 A A L
Ans‐ First calculate the Cij
Count the different letters present in the sequences i.e. A, S, T, I,
L V or, A,
L, A I,
I L,
L S,
S T,
TV

Value Cijin form of Matrix


A I L S T V n(n  1)
T   Cij( k )  w
A 11 i i 2
I 0 0  5(5  1)   5(4) 
= 3 = 3
L 0 3 3  2   2 

S 2 0 0 0  20 
= 3    3(10)  30
T 4 0 0 2 1 2
V 0 1 3 0 0 0
Calculate the matrix for qij qij = Cij Cij  given in block and T  30
T
A I L S T V
C ij
A 11/30 q ij =
I 0 0
T
11
L 0 3/30 3/30 q AA 
30
S 2/30 0 0 0
0
T 4/30 0 0 2/30 1/30 q II 
30
V 0 0 3/30 0 0 0
3
q LL 
A I L S T V 30
0
A 0.366 q SS 
I 0 0 30
1
L 0 0.1 0.1 qTT 
S 0.066 0 0 0
30
0
T 0.133 0 0 0.066.. 0.033.. qVV 
0 0 30
V 0.033.. 0.1 0 0
C ij
Calculate the matrix for Pi as PA, PI, PL PS,PT and PV q ij =
T
A I L S T V 11
q AA 
30
A 11/30 q II 
0
30
I 0 0 3
q LL 
30
L 0 3/30 3/30
0
q SS 
S 2/30 0 0 30
1
qTT 
T 4/30 0 0 2/30 1/30 30
0
V 0 1/30 3/30 0 0 0 qVV 
30
q ij q AX 11  2 4  11  2  4 
Pi  q i j  
i j 2
H e r e Pi  P A  q A X  
i X 2

30
 
 30
 
30 
2 
30
 
 30 
 2

11 6 1  6  1  22  6  1  28  1 28 14
  2   11          x   0 .4 6 6
30 30 30  2  30  2  30  2  30 2 30
qij 0  3 1  0 4 1  4 1 04 1 4 2
Pi  qij   Here Pi  PI     2  2  0        0.066
i j 2 30  30 30  30 30 30  2  30  2  30 2
  30
qijj 3  0 3  1  12  6
Pi  qij   Here Pi  PL     2    0.
02
i j 2 30  30 30  30 2
  30
qij 0  2  1 4 2
Pi  qij   Here Pi  PS   0   2    0.06 6
i j 2 30  30  30  2  3 0
q ij 1  0  1 8 4
Pi  q ij   Here Pi  PT    2      0.1 3 3
i j 2 30  30  30  2  3 0
q ij 0  3 1  1 4 2
Pi  qij   Here Pi  PV     2    0.0 6 6
i j 2 30  30 30  30  2  30
Calculate the matrix for eij ; eij = Pi2 for i = 0 and eij =2PiPj i ≠ 0 14
PA 
A I L S T V 30
2
A 
14 
2
PI 
30 30
I 
2 14  30
2  30
2
2
6
30 PL 
L 30
 30 30  30 6 30  6 30
2
2 14 6 2 2
2
PS 
S  30 30   30 2 30  30 2 30  2 30
2
2 14 2 2 2 2 6 30
 4
30 30  30 4 30  30 4 30  30 4 30  4 30
2
T 2 14 4 2 2 2 6 2 2 PT 
30
 30 30  30 2 30  30 2 30  30 2 30  30 2 30  2 30
2
V 2 14 2 2 2 2 6 2 2 2 4
2
PV 
30

qij 0.366
L odds
Log dd ratio i S ij  log
ti is l 2 , S AA  log
l 2  log 1 6837  0.7516
l 2 1.6837 0 7516
14 30 
2
eij
BLOSUM value for AA (the first diagonal element of the BLOSUM matrix)
= round (2(0.7516)) = 2
• Two
T sequences off similar
i il or variable
i bl length
l th

• Write each letter of one sequence in a row

• Write each letter of the other sequence in

column

• Start filling boxes where there is a letter for

sequence 1 in 2 for 2 in 1

• Interpret the plot


Example: Align two sequences globally AGCT and GCT

Seq 1 G T A C A T G
Seq 2 T A G A T G

S 1
Seq G A T T C T A T C T A A C T
Seq 2 G T T C T A T T C T A A C

G T A C A T G

T
A
G
A
T
G
Example: Align two sequences globally
Seq 1 G A T T C T A T C T A A C T
Seq 2 G T T C T A T T C T A A C

G A T T C T A T C T A A C T
G
T
T
C
T
A
T
T
C
T
A
A
C
Dot plots with thresholds

• If you colour in all cells with an identical letter, some dots


may be due to chance similarities.
• Therefore, it is common to use a threshold to decide whether
to plot a 'dot' in a cell.
• A window of a certain size (eg. window size = 3) is moved up
all possible diagonals, one‐by‐one.
• A score is calculated for each position of the window on a
diagonal: the number of identical letters in the window.
• If the score is equal
eq al to or above
abo e the threshold (eg.,
(eg threshold
= score of 2), all the cells in the window are coloured in.
• The choice of values for the window size and threshold for
the dot plot are chosen by trial‐and‐error
Seq 1 G A T T C T A T C T A A C T
Seq 2 G T T C T A T T C T A A C

G A T T C T A T C T A A C T
G
T
T
C
T
A
T
T
C
T
A
A
C
G A T T C T A T C T A A C T
G
T
T
C
T
A
T
T
C
T
A
A
C

Seq 1 G A T T C T A T ‐ T C T A A C T
Seq 2 G ‐ T T C T A T C T C T A A C ‐
Advantages
• Good for identification of long regions if strong similarity
• Easyy to make and interpret
p
• Can be used for any length sequence

Disadvantages
• Need to find best window size
• Graphical representation doesn’t give information about
mutation
Needleman‐Wunch (global alignment)
We want to align two sequences x1. x2,….xn and y1, y2,…ym and
create an m x n matrix F where
 F  i  1, j  1  Sij (match/mismatch in the diagonal)

F  i, j   max  Fi 1, j  d (gap in sequence 1)
F  d (gap in sequence 2)
 i , j 1
with F  0,0   0, F  i,0   id , F  0, j    jd
This is the recursive relation in dynamic programming algorithm.
In the tabular computation we start in cell (0, 0) and calculate one
row at a time. In each cell (I, j) we keep a pointer to the optimal
previous position, given the current one. j

i, j
Example: Align two sequences globally GAATTCAGTTA and GGATCGA
Answer: Seq 1 G A A T T C A G T T A
Seq 2 G G A T C G A

Length of Seq 1 i.e. M = 11 and Seq 2 i.e. N = 7


The simple scoring scheme is assumed where
Sij =1 if the residue at position i of seq 1 is the same as the residue at
position
iti j off Seq
S 2 (match
( t h score))
Sij = 0 (mismatch score)
D = 0 (gap penalty) G A A T T C A G T T A
G 1 0 0 0 0 0 0 1 0 0 0
G 1 0 0 0 0 0 0 1 0 0 0
A 0 1 1 0 0 0 1 0 0 0 1
T 0 0 0 1 0 0 0 0 1 1 0
The matrix value C 0 0 0 0 0 1 0 0 0 0 0
are to be used in G 1 0 0 0 0 0 0 1 0 0 0
the next step A 0 1 1 0 0 0 1 0 0 0 1
Three steps in dynamic programming
1. Initialization
2 Matrix
2. M t i fill (Scoring)
(S i )
3. Traceback (Alignment)

1 Initialization:
1. i i li i C
Create M +1
1 column
l and
d N + 1 rows
Seq 1 G A A T T C A G T T A
Seq 2 G G A T C G A
G GA A A A T T T T C C A A GG TT TT A
G 0 0 0 0 0 0 0 0 0 0 0 0
G
G 0
G
A 0
AT 0
TC 0
CG 0
G
A 0
A 0
2. Matrix fill (Scoring) j
i G A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
F (0,0)
G 0 1
F (0,1)
G 0
F (1,0) A 0
F (1,1) T 0
F (1,2) C 0
F (2,1) G 0
A 0
 F  i  1, j  1   S ij (m atch/m ism atch in the diagonal)

F  i , j   m ax  Fi 1, j  d (gap in sequence 1)
F
 i , j 1  d (gap in sequence 2 )
F o r p o s itio n 1, 1   S ij  S 1 ,1  1 ( S in c e G is p r e s e n t in b o th s e q )
w ( g a p p e n a lty )  0 . T h u s , F1 ,11 = M a x  F i  1 , j  1  1,
1 F1  1 ,11  0 , F1 ,11  1  0 
T h u s , F1 ,1 = M a x  F 0 , 0  1, F 0 ,1  0 , F1 , 0  0  = M a x  0  1, 0  0 , 0  0 
 M a x 1, 0 , 0  . 1 is th e la r g e s t v a lu e is p la c e d in 1 s t p o s itio n
2. Matrix fill (Scoring)

 F  i 1,
1 j 1  Sij (match/mismatch in the diagonal)

F  i, j   max  Fi1, j  d (gap in sequence 1)
F  d (gap in sequence 2)
 i, j 1
G A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
G 0 1 1 1 1 1 1 1 1 1 1 1
G 0 1 1 1 1 1 1 1 2 2 2 2
A 0 1 1 2 2 2 2 2 2 2 2 3
T 0 1 2 2 2 3 3 3 3 3 3 3
C 0 1 2 2 3 3 4 4 4 4 4 4
G 0 1 2 2 3 3 3 4 5 5 5 5
A 0 1 2 3 3 3 3 4 5 5 5 6
3. Traceback (Alignment): It begins in the M, J position in the matrix ,
i.e. the position that leads to the maximal score. Here it is 6.

G A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
G 0 1 1 1 1 1 1 1 1 1 1 1
G 0 1 1 1 1 1 1 1 2 2 2 2
A 0 1 1 2 2 2 2 2 2 2 2 3
T 0 1 2 2 2 3 3 3 3 3 3 3
C 0 1 2 2 3 3 4 4 4 4 4 4
G 0 1 2 2 3 3 3 4 5 5 5 5
A 0 1 2 3 3 3 3 4 5 5 5 6
G A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
G 0 1 1 1 1 1 1 1 1 1 1 1
G 0 1 1 1 1 1 1 1 2 2 2 2
A 0 1 1 2 2 2 2 2 2 2 2 3
T 0 1 2 2 2 3 3 3 3 3 3 3
C 0 1 2 2 3 3 4 4 4 4 4 4
G 0 1 2 2 3 3 3 4 5 5 5 5
A 0 1 2 3 3 3 3 4 5 5 5 6
A match or a Either a Deletion in seq 1 Either a Insertion in seq
substitution or Insertion in seq 2 1 or Deletion in seq 2

Seq 1 G A A T T C A G T T A
Seq 2 G G A ‐ T C ‐ G ‐ ‐ A
3. Traceback (Alignment): It begins in the M, J position in the matrix ,
i.e. the position that leads to the maximal score. Here it is 6.

G A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
G 0 1 1 1 1 1 1 1 1 1 1 1
G 0 1 1 1 1 1 1 1 2 2 2 2
A 0 1 1 2 2 2 2 2 2 2 2 3
T 0 1 2 2 2 3 3 3 3 3 3 3
C 0 1 2 2 3 3 4 4 4 4 4 4
G 0 1 2 2 3 3 3 4 5 5 5 5
A 0 1 2 3 3 3 3 4 5 5 5 6
G A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
G 0 1 1 1 1 1 1 1 1 1 1 1
G 0 1 1 1 1 1 1 1 2 2 2 2
A 0 1 1 2 2 2 2 2 2 2 2 3
T 0 1 2 2 2 3 3 3 3 3 3 3
C 0 1 2 2 3 3 4 4 4 4 4 4
G 0 1 2 2 3 3 3 4 5 5 5 5
A 0 1 2 3 3 3 3 4 5 5 5 6
A match or a Either a Deletion in seq 1 Either a Insertion in seq
substitution or Insertion in seq 2 1 or Deletion in seq 2

Seq 1 G ‐ A A T T C A G T T A
Seq 2 G G ‐ A ‐ T C ‐ G ‐ ‐ A
An advance Scoring Scheme
The sequences are treated with an advanced scoring scheme
where it is assumed that Length of Sequence 1 i.e. M and
Sequence 2 i.e. N

1. Sij =2 if the residue at position i of seq 1 is the same as the


residue at position j of Seq 2 (match score)
2 Sij = ‐1
2. 1 (mismatch
( i h score))
3. W = ‐2 (gap penalty)
The Simple Scoring Scheme An Advance Scoring Scheme
=1 if the residue at position i of =2 if the residue at position i
seq 1 is the same as the of seq 1 is the same as the
Sij
residue at position j of Seq 2 residue at position j of Seq 2
(match score) (match score)
Sij = 0 (mismatch score) = ‐1
1 (mismatch score)
D/W = 0 (gap penalty) = ‐2 (gap penalty)
Three steps in dynamic programming
1. Initialization
2.. Matrix
at fill (Sco
(Scoring)
g)
3. Traceback (Alignment)
Example: Align two sequences globally GAATTCAGTTA and GGATCGA
A
Answer: Seq 1 G A A T T C A G T T A
Seq 2 G G A T C G A
1. Initialization: Create M +1 column and N + 1 rows
G A A T T C A G T T A
G0 0 0 0 0 0 0 0 0 0 0 0
G G0
G A0
A T0
T C0
C G0
G A0
A 0
2. Matrix fill (Scoring) j
i G A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
F (0,0)
G 0 2
F (0,1)
G 0
F (1,0) A 0
F (1,1) T 0
F (1,2) C 0
F (2,1) G 0
A 0

 F  i  1, j  1   S ij (m a tc h /m is m a tc h in th e d ia g o n a l)

F  i , j   m a x  Fi 1, j  d (g a p in s e q u e n c e 1)
F
 i , j 1  d (g a p in s e q u e n c e 2 )
F o r p o s itio n 1,1   S ij  S 1,1  2 (S in c e G is p re s e n t in b o th s e q )
w ( g a p p e n a lty )   2 . T h u s , F1 ,1 = M a x  F1, 0  2 , F 0 ,1  2 , F 0 ,1  2 
 M a x  2 ,  2 ,  2  . 2 is th e la rg e s t v a lu e is p la c e d in 1 s t p o s itio n
2. Matrix fill (Scoring)
 F  i  1, j  1  Siijj (match/mismatch in the diagonal)

F  i, j   max  Fi 1, j  d (gap in sequence 1)
F  d (gap in sequence 2)
(gap
 i , j 1

G A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
G 0 2 0 ‐1 ‐1 ‐1 ‐1 ‐1 2 1 ‐1 ‐1
G 0 ‐2 1 1 ‐2 ‐2 ‐2 ‐2 1 1 ‐1 ‐2
A 0 0 ‐4 3 1 ‐1 ‐3 0 ‐1 0 0 1
T 0 ‐1
1 2 3 5 3 1 ‐1
1 ‐1
1 1 2 0
C 0 ‐1 0 1 3 ‐4 5 3 1 ‐1 0 1
G 0 2 0 ‐1 1 2 3 ‐4 5 3 1 ‐1
A 0 0 4 2 0 0 1 5 3 4 ‐2 3
3. Traceback (Alignment): It begins in the M, J position in the matrix ,
i.e. the position that leads to the maximal score. Here it is 3.

G A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
G 0 2 0 ‐1 ‐1 ‐1 ‐1 ‐1 2 0 ‐1 ‐1
G 0 2 1 1 ‐2 ‐2 ‐2 ‐2 1 1 ‐1 ‐2
A 0 0 ‐4
4 33 1 ‐1
1 ‐3
3 0 ‐1
1 0 0 1
T 0 ‐1 2 3 5 3 1 ‐1 ‐1 1 2 0
C 0 ‐1 0 1 3 ‐4 5 3 1 ‐1 0 1
G 0 2 0 ‐1 1 2 3 ‐4 5 3 1 ‐1
A 0 0 4 2 0 0 1 5 3 4 ‐2 3
(Alignment) G A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
G 0 2 0 ‐1 ‐1 ‐1 ‐1 ‐1 2 0 ‐1 ‐1
G 0 2 1 1 ‐22 ‐22 ‐22 ‐22 1 1 ‐11 ‐2
2
A 0 0 4 3 1 ‐1 ‐3 0 ‐1 0 0 1
T 0 ‐1
1 2 3 5 3 1 ‐1 1 ‐1
1 1 2 0
C 0 ‐1 0 1 3 4 5 3 1 ‐1 0 1
G 0 2 0 ‐1 1 2 3 ‐4 5 3 1 ‐1
A 0 0 4 2 0 0 1 5 3 4 2 3
A match or a Either a Deletion in seq 1 Either a Insertion in seq
substitution or Insertion in seq 2 1 or Deletion in seq 2

Seq 1 G A A T T C A G T T A
Seq 2 G G A T ‐ C ‐ G ‐ ‐ A
3. Traceback (Alignment): It begins in the M, J position in the matrix ,
i.e. the position that leads to the maximal score. Here it is 3.

G A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
G 0 2 0 ‐1 ‐1 ‐1 ‐1 ‐1 2 0 ‐1 ‐1
G 0 ‐2 1 1 ‐2 ‐2 ‐2 ‐2 1 1 ‐1 ‐2
A 0 0 ‐4
4 3 1 ‐11 ‐33 0 ‐11 0 0 1
T 0 ‐1 2 3 5 3 1 ‐1 ‐1 1 2 0
C 0 ‐1 0 1 3 ‐4 5 3 1 ‐1 0 1
G 0 2 0 ‐1 1 2 3 ‐4 5 3 1 ‐1
A 0 0 4 2 0 0 1 5 3 4 ‐2 3
(Alignment) G A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
G 0 2 0 ‐1 ‐1 ‐1 ‐1 ‐1 2 0 ‐1 ‐1
G 0 ‐2
2 1 1 ‐2
2 ‐2
2 ‐2
2 ‐2
2 1 1 ‐1
1 ‐2
2
A 0 0 ‐4 3 1 ‐1 ‐3 0 ‐1 0 0 1
T 0 ‐1
1 2 3 5 3 1 ‐1
1 ‐1
1 1 2 0
C 0 ‐1 0 1 3 ‐4 5 3 1 ‐1 0 1
G 0 2 0 ‐1 1 2 3 ‐4 5 3 1 ‐1
A 0 0 4 2 0 0 1 5 3 4 ‐2 3
A match or a Either a Deletion in seq 1 Either a Insertion in seq
substitution or Insertion in seq 2 1 or Deletion in seq 2

Seq 1 G A A T T C A G T T A
Seq 2 G G A ‐ T C ‐ G ‐ ‐ A
Aligning of sequences by the Simple Scoring Matrix
Seq 1 G A A T T C A G T T A
Seq 2 G G A ‐ T C ‐ G ‐ ‐ A
Seq 1 G ‐ A A T T C A G T T A
Seq 2 G G ‐ A ‐ T C ‐ G ‐ ‐ A
Aligning of sequences by an Advance Scoring Matrix
Seq 1 G A A T T C A G T T A
Seq 2 G G A T ‐ C ‐ G ‐ ‐ A

Seq 1 G A A T T C A G T T A
Seq
q2 G G A ‐ T C ‐ G ‐ ‐ A
Seq 1 G A A T T C A G T T A
Seq 2 G G A ‐ T C ‐ G ‐ ‐ A
Test to make sure the result of a valid score for alignment
Rememberingg that the scoringg scheme is +2 for a match,, ‐1 for a mismatch,,
and ‐2 for a gap, both the sequences can be tested to make sure that they
result in a score of 3.
Aligning of sequences by the Simple Scoring Matrix
Seq 1 G A A T T C A G T T A = ‐1 +2 +2
Seq 2 G G A ‐ T C ‐ G ‐ ‐ A = ‐11 + 4 = 3
+2 ‐1 +2 ‐2 +2 +2 ‐2 +2 ‐2 ‐2 +2

Seq 1 G A A T T C A G T T A
= ‐1 +2 +2
Seq 2 G G A ‐ T C ‐ G ‐ ‐ A = ‐1 + 4 = 3
+2 ‐1 +2 ‐2 +2 +2 ‐2 +2 ‐2 ‐2 +2
Three steps in dynamic programming
1. Initialization
2.. Matrix
at fill (Sco
(Scoring)
g)
3. Traceback (Alignment)
Example: Align two sequences globally AGCT and GCT

Answer: Seq 1 A G C T
Seq 2 G C T
Rule:
1. Match = 1; 4. Box besides (+ gap)
2. Mismatch = ‐1; 5. Box top (+ gap)
3. Gap = ‐2 6. /
Diagonal box ( Match/Mismatch)

A A
G CG TC T 0 + ‐1 = ‐1 ‐2 +‐ 1 = ‐3
G 0 ‐2 ‐4 ‐6 ‐8
‐2 +‐1=‐3 0 + ‐1 = ‐1
CG ‐2 ‐1 ‐1 ‐3 ‐5

TC ‐4
4 ‐3
3 ‐2
2 0 ‐2
2
T ‐6 ‐5 ‐4 ‐2 1
Three steps in dynamic programming
1. Initialization
2. Matrix fill (Scoring)
3. Traceback (Alignment)

A G C T
0 ‐2 ‐4 ‐6 ‐8
G ‐2
2 ‐1
1 ‐1
1 ‐3
3 5
‐5
C ‐4 ‐3 ‐2 0 ‐2
T ‐6 ‐5 ‐4 ‐2 1

A match or a Either a Deletion in seqq1 Either a Insertion in seq


substitution or Insertion in seq 2 1 or Deletion in seq 2

Seq 1 A G C T
Seq 2 ‐ G C T
Smith‐Waterman (Local Alignment)

Sometimes we want to find the conserved region in protein domain


and not align the entire sequences.
This method is useful for comparing the following:
1 Protein sequences that share a common motif (conserved
1.
pattern) or domain (independently folded unit) but differ
elsewhere.
2. Protein sequences against genomic DNA sequences (long
stretches of uncharacterized sequences).
3. DNA sequences that share a similar motif but differ elsewhere.
4. It is sensitive when comparing the highly diverged sequences.
Three steps in dynamic programming
1. Initialization: The first row and first column initialized with 0’s
2 Matrix fill (Scoring): Calculate the F(i,j) value
2.
3. Traceback (Alignment): Starts counting trace back in the cell with
highets score and then reach to the cell (0, 0)
Example: Align two sequences locally AAGA and TTAAG
 F  i  1, j  1   S ( x i , y j )
Answer: Seq 1 A A G A 
F  i , j   m ax  Fi 1 , j  d
Seq 2 T T A A G
F
 i , j 1  d
1. Initialization: Create M +1 column and N + 1 rows
AA AA GG A
T 0 0 0 0 0
T 0
T
T 0
A
A 0
A
A 0
G G 0
 F  i  1, j  1   S ( x i , y j )
Seq 1 A A G A 
F  i , j   m ax  Fi 1 , j  d
Seq 2 T T A A G F
 i , j 1  d

A A G A

0 0 0 0 0

T 0 0 0 0 0
T 0 0 0 0 0 Alignment
A 0 1 1 0 1 S 1
Seq A A G
Seq 2 A A G
A 0 1 2 0 1
G 0 0 0 3 1
Needleman‐Wunch Smith‐Waterman

1 Global alignment Local alignment


2 Residue alignment score may be positive or Require alignment score for a
negative pair of residues to be ≥ 0
3 No gap penalty required Require a gap penalty to work
effectively
4 Negative score weight must be given to No such score is assigned
mismatches so that the score drops as more
and more mismatches are added
5 Compares sequences and gives best overall Finds regions of ungapped
alignment sequence with a high degree of
similarity

You might also like