You are on page 1of 61

Pairwise Sequence

Alignment
DS201: Bioinformatics
September 26, 2022
Debnath Pal
Pairwise Sequence Alignment
It is the process of lining up two sequences to achieve maximal
levels of identity (and conservation, in the case of amino acid
sequences) for the purpose of assessing the degree of similarity and
the possibility of homology.
Identity: The extent to which two (nucleotide or amino acid)
sequences are invariant.
Similarity: Two residues are considered similar if their substitution
does not affect the function of the protein.
Homology: Similarity attributed to descent from a common
ancestor.
Used to:-
decide if two proteins (or genes) are related structurally or functionally
identify domains or motifs that are shared between proteins
Pairwise alignment of retinol-binding
protein and β-lactoglobulin
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP
. ||| | . |. . . | : .||||.:| :
1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin

51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP
: | | | | :: | .| . || |: || |.
45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin

98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP


|| ||. | :.|||| | . .|
94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin
Somewhat Identity Very
137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP
. | | similar
| : || .
(bar) | || | similar
(one dot) (two dots)
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin
Ancestral sequence
ACCCTAC

A no change A
C single substitution C --> A
C multiple substitutions C --> A --> T
C --> G coincidental substitutions C --> A
T --> A parallel substitutions T --> A
A --> C --> T convergent substitutions A --> T
C back substitution C --> T --> C

Sequence 1 Sequence 2
ACCGATC AATAATC
Li (1997) p.70
Biochemical and Physical
Properties of Amino acids

tiny
P
aliphatic C S+S small
G
I A G S
V C SH N
L D
T
hydrophobic M Y K E
F Q
W H R
positive
aromatic polar
charged
Observation Result

Modular nature of proteins:


Different proteins may have
proteins keep reusing ⇒
similar structure or sequence
domains
Performing alignment of
Protein is more informative protein sequences from gene

(20 vs 4 characters) sequences may provide robust
result
These are more likely to be
Many amino acids share
⇒ substituted by one other in a
related biophysical properties
protein sequences
Codons are degenerate:
Protein sequences are more
changes in the third position
⇒ conserved compared to gene
often do not alter the amino
sequences
acid that is specified
Global and Local Alignment

Global Alignment Local Alignment


Global alignment Local alignment

Extends from one end of each Finds optimally matching


sequence to the other regions within two sequences

Suitable for database searches


Suitable for aligning closely
such as BLAST, useful for finding
related sequences of similar
domains (or limited regions of
length
homology) within sequences.

Sensitive to modular nature of


Sensitive to gap penalties
proteins

Algorithm: Needleman and Algorithm: Smith and Waterman


Wunsch (1970) (1981)

e.g. Proteins having a common


proteins sharing a common
ancestor and being conserved
domain
over their whole sequence
Dotplot
M A G M A G A M E

M ● ● ●

A ● ●
Repeats

G ● ●

M ● ● ●

A ● ● ●
Palindrome
G ● ● Inversion
A ● ● ●

M ● ● ●

E ●
Real Dotplot
Insertions or Deletions (Indels) = Gaps
M A G C A D E T M E

M ● ●

A ● ●

G ●

C ●

A ● ●

T ●

M ● ●

E ● ●
Gaps
• Positions at which a letter is paired with a null are called gaps.
• Gaps are caused by
• Different lengths of sequences
• Regions of insertions and deletions
• Notion of gaps (denoted by ‘-’)
A L I G N M E N T
| | | | | | |
- L I G A M E N T
• Since a single mutational event may cause the insertion or
deletion of more than one residue, the presence of a gap is
ascribed more significance than the length of the gap. Thus there
are separate penalties for gap creation and gap extension.
Gap Penalties
Dynamic Programming
• Solve optimization problems by dividing the problem into
independent sub-problems
• Sequence alignment has optimal substructure property
• Sub-problem: alignment of prefixes of two sequences
• Each sub-problem is computed once and stored in a matrix
• Optimal score: built upon optimal alignment computed to that
point
• Aligns two sequences beginning at ends, attempting to align all
possible pairs of characters
• Scoring scheme for matches, mismatches, gaps
• Highest set of scores defines optimal alignment between
sequences
• Match score: DNA – exact match; Amino Acids – mutation
probabilities
Dynamic Programming Example
Sequence #1: GAATTCAGTTA; M = 11
Sequence #2: GGATCGA; N = 7

• s(aibj) = +5 if ai = bj (match score)


• s(aibj) = -3 if aibj (mismatch score)
• w = -4 (gap penalty)
View of the DP Matrix
• M+1 rows, N+1 columns
Global Alignment
(Needleman-Wunsch)
• Attempts to align all residues of two sequences

• INITIALIZATION: First row and first column set


• Si,0 = w * i
• S0,j = w * j
Initialized Matrix
(Needleman-Wunsch)
Matrix Fill
(Global Alignment)

Si,j = MAXIMUM[
Si-1, j-1 + s(ai,bj) (match/mismatch in the diagonal),
Si,j-1 + w (gap in sequence #1),
Si-1,j + w (gap in sequence #2)
]
Matrix Fill (Global Alignment)
• S1,1 = MAX[S0,0 + 5, S1,0 - 4, S0,1 - 4] = MAX[5, -8, -8]
Matrix Fill (Global Alignment)
• S1,2 = MAX[S0,1 -3, S1,1 - 4, S0,2 - 4] = MAX[-4 - 3, 5 – 4, -8 – 4] = MAX[-7, 1,
-12] = 1
Matrix Fill (Global Alignment)
Filled Matrix (Global Alignment)
Trace Back (Global Alignment)
• maximum global alignment score = 11 (value in the
lower right hand cell).
• Traceback begins in position SM,N; i.e. the position
where both sequences are globally aligned.
• At each cell, we look to see where we move next
according to the pointers.
Trace Back (Global Alignment)
Global Trace Back

G A A T T C A G T T A
| | | | | |
G G A – T C – G - — A
Local Alignment
• Smith-Waterman: obtain highest scoring local
match between two sequences

• Requires 2 modifications:
• Negative scores for mismatches
• When a value in the score matrix becomes negative,
reset it to zero (begin of new alignment)
Local Alignment Initialization
• Values in row 0 and column 0 set to 0.
Matrix Fill
(Local Alignment)

Si,j = MAXIMUM[
Si-1, j-1 + s(ai,bj) (match/mismatch in the diagonal),
Si,j-1 + w (gap in sequence #1),
Si-1,j + w (gap in sequence #2),
0
]
Matrix Fill
(Local Alignment)
• S1,1 = MAX[S0,0 + 5, S1,0 - 4, S0,1 – 4,0] = MAX[5, -4, -4, 0] = 5
Matrix Fill (Local Alignment)
• S1,2 = MAX[S0,1 -3, S1,1 - 4, S0,2 – 4, 0] = MAX[0 - 3, 5 – 4, 0 – 4, 0] = MAX[-
3, 1, -4, 0] = 1
Matrix Fill (Local Alignment)
S1,3 = MAX[S0,2 -3, S1,2 - 4, S0,3 – 4, 0] = MAX[0 - 3, 1 – 4, 0 – 4, 0] =
MAX[-3, -3, -4, 0] = 0
Filled Matrix
(Local Alignment)
Trace Back (Local Alignment)
• maximum local alignment score for the two
sequences is 14

• found by locating the highest values in the score


matrix

• 14 is found in two separate cells, indicating multiple


alignments producing the maximal alignment score
Trace Back (Local Alignment)
• Traceback begins in the position with the highest
value.

• At each cell, we look to see where we move next


according to the pointers

• When a cell is reached where there is not a pointer


to a previous cell, we have reached the beginning
of the alignment
Trace Back (Local Alignment)
Trace Back (Local Alignment)
Trace Back (Local Alignment)
Substitution Matrix
Substitution Matrix
• A substitution matrix contains values proportional
to the probability that amino acid i mutates into
amino acid j for all pairs of amino acids.
• Substitution matrices are constructed by
assembling a large and diverse sample of verified
pairwise alignments (or multiple sequence
alignments) of amino acids.
• Substitution matrices should reflect the true
probabilities of mutations occurring through a
period of evolution.
• The two major types of substitution matrices are
PAM and BLOSUM.
Nucleic Acid Scoring
Matrices
• Two mutation models:
• Uniform mutation rates (Jukes-Cantor)
• Two separate mutation rates (Kimura)
• Transitions
• Transversions
Percent Accepted Mutation (PAM
or Dayhoff) Matrices
• Studied by Margaret Dayhoff
• Amino acid substitutions
• Alignment of common protein sequences
• 1572 amino acid substitutions
• 71 groups of protein, 85% identity
• PAM also defines a time unit, where 1 PAM is the time in
which 1/100 amino acids are expected to undergo a
mutation
• “Accepted” mutations – do not negatively affect a
protein’s fitness
normalized probabilities multiplied by 10000
Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val

A R N D C Q E G H I L K M F P S T W Y V

A 9867 2 9 10 3 8 17 21 2 6 4 2 6 2 22 35 32 0 2 18

R 1 9913 1 0 1 10 0 0 10 3 1 19 4 1 4 6 1 8 0 1

N 4 1 9822 36 0 4 6 6 21 3 1 13 0 1 2 20 9 1 4 1

D 6 0 42 9859 0 6 53 6 4 1 0 3 0 0 1 5 3 0 0 1

C 1 1 0 0 9973 0 0 0 1 1 0 0 0 0 1 5 1 0 3 2

Q 3 9 4 5 0 9876 27 1 23 1 3 6 4 0 6 2 2 0 0 1

E 10 0 7 56 0 35 9865 4 2 3 1 4 1 0 3 4 2 0 1 2

G 21 1 12 11 1 3 7 9935 1 0 1 2 1 1 3 21 3 0 0 5

H 1 8 18 3 1 20 1 0 9912 0 1 1 0 2 3 1 1 1 4 1

I 2 2 3 1 2 1 2 0 0 9872 9 2 12 7 0 1 7 0 1 33

L 3 1 3 0 0 6 1 1 4 22 9947 2 45 13 3 1 3 4 2 15

K 2 37 25 6 0 12 7 2 2 4 1 9926 20 0 3 8 11 0 1 1

M 1 1 0 0 0 2 0 0 0 5 8 4 9874 1 0 1 2 0 0 4

F 1 1 1 0 0 0 0 1 2 8 6 0 4 9946 0 2 1 3 28 0

P 13 5 2 1 1 8 3 2 5 1 2 2 1 1 9926 12 4 0 0 2

S 28 11 34 7 11 4 6 16 2 2 1 7 4 3 17 9840 38 5 2 2

T 22 2 13 4 1 3 2 2 1 11 2 8 6 1 5 32 9871 0 2 9

W 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 9976 1 0

Y 1 0 3 0 3 0 1 0 4 1 1 0 0 21 0 1 1 2 9945 1

V 13 2 1 1 3 2 2 3 3 57 11 1 17 1 3 2 10 0 2 9901
PAM Matrices
• PAM matrices converted to log-odds matrix
• Calculate odds ratio for each substitution
• Taking scores in previous matrix
• Divide by frequency of amino acid
• Convert ratio to log10 and multiply by 10
• Take average of log odds ratio for converting A to B and converting
B to A
• Result: Symmetric matrix
Log odds ratio
𝑝𝑖𝑗
𝑆𝑖𝑗 = 𝜆 log
𝑞𝑖 × 𝑞𝑗
𝑝𝑖𝑗 = probability of two amino acids 𝑖 and 𝑗 replacing
each other
𝜆 = scaling factor to get integer 𝑆𝑖𝑗
𝑞𝑖 and 𝑞𝑗 are the background probabilities of finding
the amino acids 𝑖 and 𝑗 in any protein sequence. qi
and qj are calculated from the Count Matrix for
PAM1.
log odds matrix
PAM1 Matrix
BLOcks SUbstitution Matrix
(BLOSUM)
• BLOSUM62 is a matrix calculated from comparisons
of sequences with no less than 62% divergence.
• All BLOSUM matrices are based on observed
alignments; they are not extrapolated from
comparisons of closely related proteins.
• When converting to log odd scores the logarithm is
taken with base 2
Relation between PAM and
BLOSUM

More conserved Less conserved

Rat versus Rat versus


mouse RBP bacterial
lipocalin
Conclusion
• Substitution matrices and gap penalties introduce biological
information into alignment algorithms
• Assumptions:
• Rate of evolution is constant
• Identical residues are equal
• The relevance of alignment must be assessed with a statistical
score.
• Sequences sharing less than 20% identity are difficult to align:
• Twilight zone (Doolittle, 1986)
• Alignments may appear plausible to the eye but need to be statistically
significant.
• Other methods like Profiles need to explored
• Programs
• LALIGN
• PRSS
Conformational similarity weight
matrix
A 10
Sequence alignment approach
R 6.6 10
N 0 0 10 to pick up conformationally
D 0 0 9 10 similar protein fragments
C 0 0 0 0 10
Q 6.6 9 0 0 0 10
E 9 6.6 0 0 0 6.6 10
G 0 0 0 0 0 0 0 10
H 0 5 5 6.6 9 5 0 0 10
I 0 0 0 0 0 0 0 0 0 10
L 5 6.6 0 0 0 9 5 0 0 0 10
K 6.6 9 0 5 0 9 9 0 0 0 9 10
M 5 9 0 0 0 5 0 0 0 0 6.6 5 10
F 0 9 0 0 5 6.6 0 0 5 0 5 5 6.6 10
P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10
S 0 5 6.6 6.6 9 0 0 0 9 0 0 5 0 5 0 10
T 0 5 0 0 6.6 5 0 0 5 0 0 5 0 5 0 6.6 10
W 0 6.6 0 0 5 5 0 0 5 0 5 0 6.6 9 0 5 5 10
Y 0 5 0 0 5 5 0 0 6.6 0 0 0 0 9 0 6.6 9 5 10
V 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 10
A R N D C Q E G H I L K M F P S T W Y V
Hydrophobicity scoring matrix
A 100
Evolutionary conservation of both
R 13 100
N 60 53 100 the hydrophilic and hydrophobic
D 33 81 73 100 nature of transmembrane residues
C 98 11 58 30 100
Q 64 49 96 68 62 100
E 39 74 79 94 36 74 100
G 96 17 64 36 94 68 43 100
H 71 42 89 61 69 93 68 75 100
I 91 4 51 23 93 55 29 87 62 100
L 93 6 52 25 95 57 31 89 64 98 100
K 35 78 75 98 33 71 96 39 64 26 27 100
M 89 2 49 21 91 53 27 85 60 98 96 24 100
F 87 0 47 19 89 51 26 83 58 96 94 22 98 100
P 89 24 71 44 86 76 50 93 83 79 81 46 78 76 100
S 94 19 66 39 91 71 45 98 78 84 86 41 83 81 95 100
T 98 16 63 35 95 67 41 99 74 88 90 38 86 84 91 96 100
W 98 11 58 31 99 63 37 94 69 93 94 33 91 89 87 92 96 100
Y 94 19 66 38 92 70 44 98 77 85 87 41 83 81 94 99 97 93 100
V 94 7 54 26 96 58 33 90 65 97 99 29 95 93 83 88 91 96 88 100
A R N D C Q E G H I L K M F P S T W Y V
Isomorphicity of replacements
A 100
R -2 100
Predicting isomorphic residue
N -8 11 100 replacements for protein design
D 3 2 32 100
C 4 -12 -7 -20 100
Q 0 26 12 8 -8 100
E 12 11 13 34 -18 37 100
G -3 -6 10 8 -1 -10 -5 100
H -10 0 8 10 3 -4 -8 -8 100
I -17 -25 -32 -39 12 -20 -36 -21 3 100
L -6 -24 -35 -40 15 -26 -40 -16 -9 46 100
K 2 24 17 10 -14 26 24 -11 -4 -17 -33 100
M -2 0 -12 -16 2 -12 -17 -10 -8 18 24 -19 100
F -8 -11 -23 -31 11 -22 -40 -12 -10 32 36 -26 20 100
P -2 -4 8 1 -3 0 -7 4 -2 -18 -16 -2 -10 -2 100
S 10 -10 1 7 -6 -7 -10 9 -3 -17 -11 -10 -10 0 15 100
T -3 -5 12 11 -4 -10 -12 -13 -7 -5 -11 -8 -8 4 2 24 100
W -12 -7 -6 -20 8 -3 -10 -13 -6 12 8 -10 12 8 -9 -18 -8 100
Y -18 0 -13 -22 5 -23 -28 -18 12 29 16 -21 8 24 -9 -16 -2 6 100
V -12 -18 -30 -32 8 -12 -20 -14 -10 46 26 -14 10 16 -18 -16 -4 14 17 100
A R N D C Q E G H I L K M F P S T W Y V

You might also like