Professional Documents
Culture Documents
“continuing..”
(5th week)
• The overall goal of pairwise sequence sequence alignment
is to find best pairwise of two sequence
To find the best possible alignment across the entire length between
the two sequences.
H E A G A W G H E E
0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
P -8 -2 -8 -16 -24 -33 -42 -49 -57 -65 -73
A -16 -10 -3 -4 -12 -19 -28 -36 -44 -52 -60
W -24 -18 -11 -6 -7 -15 -4 -12 -21 -29 -37
H -32 -14 -18 -13 -8 -9 -12 -6 -2 -11 -19
E -40 -22 -8 -16 -16 -9 -12 -14 -6 4 -5
A -48 -30 -16 -3 -11 -11 -12 -12 -14 -4 2
E -56 -38 -24 -11 -6 -12 -14 -15 -12 -8 2
The value in the final cell is the best score for the alignment
Drawback of Needleman–Wunsch
algorithm
The drawback of focusing on getting a
maximum score for the full-length sequence
alignment is the risk of missing the best local
similarity. This strategy is only suitable for
aligning two closely related sequences that
are of the same length.
Local Alignment
• Does not assume that two sequence in question have similarity over the entire
length.
• It only finds local regions with the highest level of similarity between the two
sequences and aligns these regions without regard for the alignment of the rest of
• Used for aligning more divergent sequences with the goal of searching for
gap penalties.
- E Q L L K A L E F K L
-
K
V
L
E
F
G
Y
Example
Linear gap model
Gap = -1
Q: E Q L L K A L E F K L Match = 4
Mismatch = -2
P: K V L E F G Y
- E Q L L K A L E F K L
- 0 0 0 0 0 0 0 0 0 0 0 0
K 0
V 0
L 0
E 0
F 0
G 0
Y 0
Example
Linear gap model
Gap = -1
Q: E Q L L K A L E F K L Match = 4
Mismatch = -2
P: K V L E F G Y
- E Q L L K A L E F K L
- 0 0 0 0 0 0 0 0 0 0 0 0
K 0 0 0 0 0 4 3 2 1 0 4 3
V 0 0 0 0 0 3 2 1 0 0 3 2
L 0 0 0 4 4 3 2 6 5 4 3 7
E 0 4 3 3 3 2 1 5 10 9 8 7
F 0 3 2 2 2 1 0 4 9 14 13 12
G 0 2 1 1 1 0 0 3 8 13 12 11
Y 0 1 0 0 0 0 0 2 7 12 11 10
Example
Linear gap model
Gap = -1
Q: E Q L L K A L E F K L Match = 4
Mismatch = -2
P: K V L E F G Y
- E Q L L K A L E F K L
- 0 0 0 0 0 0 0 0 0 0 0 0
K 0 0 0 0 0 4 3 2 1 0 4 3
V 0 0 0 0 0 3 2 1 0 0 3 2
L 0 0 0 4 4 3 2 6 5 4 3 7
E 0 4 3 3 3 2 1 5 10 9 8 7
F 0 3 2 2 2 1 0 4 9 14 13 12
G 0 2 1 1 1 0 0 3 8 13 12 11
Y 0 1 0 0 0 0 0 2 7 12 11 10
Example
Alignment
Q: E Q L L K A L E F K L Q: K A - L E F
P: K V L E F G Y P: K - V L E F
- E Q L L K A L E F K L
- 0 0 0 0 0 0 0 0 0 0 0 0
K 0 0 0 0 0 4 3 2 1 0 4 3
V 0 0 0 0 0 3 2 1 0 0 3 2
L 0 0 0 4 4 3 2 6 5 4 3 7
E 0 4 3 3 3 2 1 5 10 9 8 7
F 0 3 2 2 2 1 0 4 9 14 13 12
G 0 2 1 1 1 0 0 3 8 13 12 11
Y 0 1 0 0 0 0 0 2 7 12 11 10
Example
Alignment
Q: E Q L L K A L E F K L Q: K - A L E F
P: K V L E F G Y P: K V - L E F
- E Q L L K A L E F K L
- 0 0 0 0 0 0 0 0 0 0 0 0
K 0 0 0 0 0 4 3 2 1 0 4 3
V 0 0 0 0 0 3 2 1 0 0 3 2
L 0 0 0 4 4 3 2 6 5 4 3 7
E 0 4 3 3 3 2 1 5 10 9 8 7
F 0 3 2 2 2 1 0 4 9 14 13 12
G 0 2 1 1 1 0 0 3 8 13 12 11
Y 0 1 0 0 0 0 0 2 7 12 11 10
Scoring Matrices
• Substitution Matrix
A 20 5 5 10
T 5 20 10 5
C 5 10 20 5
G 10 5 5 20
Scoring Matrices
Amino acid
Use as reference a 20 X 20 substitution matrix,
representing the 20 amino acids found in protein.
Seq1: T H I S I S A S E Q U E N C E
Seq1: T H A T _ _ _ S E Q U E N C E
Isoleucine – Alanine: both hydrophobic
Serine – Threonine : both polar
• Scoring pairs of amino acids:
– with similar properties higher scores
– With different properties lower scores
Positive Score
Zero Score
Negative Score
• frequency of amino acid substitutions found in the homologous sequence data set is
Small and
nonpolar
Polar and
acidic
Basic
Large and
hydrophobic
Aromatic
Different types of substitution matrices are being used based on:
– The number of mutations required for convertion of one amino acid to
the other
– Similarities in physicochemical properties.
PAM matrix
PAM “Point accepted Mutation”
“NATURAL SELECTION”
• Derived from global alignments of closely related sequences.
• The 250 in PAM 250 stands for the data gathered from 71 sets of aligned
100 residues
PAM120
• PAM120 accepts 120 mutations.
determining those processes which are responsible for these mutations and also
• Another advantage from statistical point of view is that data is derived from
sequences to construct a PAM table, now using that data PAM table can provide
information about the changes in the structure of an amino acid residue after a
replacement.
Disadvantages of PAM matrix
• BLOSUM matrices are used to score alignments between evolutionarily divergent protein
sequences
not have gaps in the sequence alignment) and then counted the relative frequencies of amino
• All BLOSUM matrices are based on observed alignments; they are not extrapolated from
database)
• Amino acid pair frequencies calculated by summing over all possible pairs in
block
DISADVANTAGE
• Restricted to a subset of conserved domains
Which matrix to use ?
– Depends on the problem properties,
• Scoring matrix
(BLOSUM-62)
• Matrix S of optimal scores of sub-sequence
alignments.