Sequence Alignment: "Continuing.." (5th Week)

SEQUENCE ALIGNMENT
“continuing..”
(5th week)
• The overall goal of pairwise sequence sequence alignment
is to find best pairwise of two sequence
One sequence needs to be shifted relative to the other to

find the position where maximum matches are found.
• There are two different alignment strategies that are often

used: Global alignment and Local alignment
Global Alignment
Attempt to align every residue in every sequence, are most useful
when the sequences in the query set are similar and of roughly equal size.
To find the best possible alignment across the entire length between
the two sequences.
For divergent sequence and sequences of variable lenght, this method

may not be able to generate optimal results
Dynamic programming is a method for
solving complex problems by breaking them
down into simpler subproblems.
Needleman–Wunsch algorithm
• A smart way to reduce the massive number of possibilities
that need to be considered, yet still guarantees that the
best solution will be found (Saul Needleman and Christian
Wunsch, 1970).
• The basic idea is to build up the best alignment by using

optimal alignments of smaller subsequences.
• The Needleman-Wunsch algorithm is an example of

dynamic programming, a discipline invented by Richard
Bellman (an American mathematician) in 1953
Needleman–Wunsch algorithm
Example
1. Initialization
2. Fill
3. Trace-back
HEAGAWGHE - E
Optimal alignment:
- P - - AW- HEAE
score(H,P) = -2, gap penalty=-8 (linear)

score(E,P) = 0, score(E,A) = -1, score(H,A) = -2
H E A G A W G H E E
0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
P -8 -2 -8 -16 -24 -33 -42 -49 -57 -65 -73
A -16 -10 -3 -4 -12 -19 -28 -36 -44 -52 -60
W -24 -18 -11 -6 -7 -15 -4 -12 -21 -29 -37
H -32 -14 -18 -13 -8 -9 -12 -6 -2 -11 -19
E -40 -22 -8 -16 -16 -9 -12 -14 -6 4 -5
A -48 -30 -16 -3 -11 -11 -12 -12 -14 -4 2
E -56 -38 -24 -11 -6 -12 -14 -15 -12 -8 2
The value in the final cell is the best score for the alignment
Drawback of Needleman–Wunsch
algorithm
The drawback of focusing on getting a
maximum score for the full-length sequence
alignment is the risk of missing the best local
similarity. This strategy is only suitable for
aligning two closely related sequences that
are of the same length.
Local Alignment
• Does not assume that two sequence in question have similarity over the entire
length.
• It only finds local regions with the highest level of similarity between the two
sequences and aligns these regions without regard for the alignment of the rest of
the sequence regions
• Used for aligning more divergent sequences with the goal of searching for
conserved patterns in DNA or protein sequences.

Smith–Waterman algorithm
• Very simple modification of Needleman-Wunsch
• The Smith–Waterman algorithm is a well-known algorithm for

performing local sequence alignment; that is, for determining similar
regions between two nucleotide or protein sequences.
• Instead of looking at the total sequence, the Smith–Waterman algorithm

compares segments of all possible lengths and optimizes the similarity
measure.
• The edges of the matrix are initialized to 0 instead of increasing
gap penalties.
• The maximum score is never less than 0, and no pointer is
recorded unless the score is greater than 0.
• The trace-back starts from the highest score in the matrix
(rather than at the end of the matrix) and ends at a score of 0
(rather than the start of the matrix).

Identification of regional sequence similarity
may be of greater significance than finding a
match that includes all residues.
• Use Smith-Waterman if:
– The divergence level between the two sequences to
be aligned is not easily known
– The sequence lengths of the two sequences may
also be unequal
Example
Linear gap model
Gap = -1
Q: E Q L L K A L E F K L Match = 4
Mismatch = -2
P: K V L E F G Y
- E Q L L K A L E F K L
-
K
V
L
E
F
G
Y
Example
Linear gap model
Gap = -1
Mismatch = -2
P: K V L E F G Y
- 0 0 0 0 0 0 0 0 0 0 0 0
K 0
V 0
L 0
E 0
F 0
G 0
Y 0
Example
Linear gap model
Gap = -1
Mismatch = -2
P: K V L E F G Y
- 0 0 0 0 0 0 0 0 0 0 0 0
K 0 0 0 0 0 4 3 2 1 0 4 3
V 0 0 0 0 0 3 2 1 0 0 3 2
L 0 0 0 4 4 3 2 6 5 4 3 7
E 0 4 3 3 3 2 1 5 10 9 8 7
F 0 3 2 2 2 1 0 4 9 14 13 12
G 0 2 1 1 1 0 0 3 8 13 12 11
Y 0 1 0 0 0 0 0 2 7 12 11 10
Example
Linear gap model
Gap = -1
Mismatch = -2
P: K V L E F G Y
- 0 0 0 0 0 0 0 0 0 0 0 0
K 0 0 0 0 0 4 3 2 1 0 4 3
V 0 0 0 0 0 3 2 1 0 0 3 2
L 0 0 0 4 4 3 2 6 5 4 3 7
E 0 4 3 3 3 2 1 5 10 9 8 7
F 0 3 2 2 2 1 0 4 9 14 13 12
G 0 2 1 1 1 0 0 3 8 13 12 11
Y 0 1 0 0 0 0 0 2 7 12 11 10
Example
Alignment
Q: E Q L L K A L E F K L Q: K A - L E F
P: K V L E F G Y P: K - V L E F
- 0 0 0 0 0 0 0 0 0 0 0 0
K 0 0 0 0 0 4 3 2 1 0 4 3
V 0 0 0 0 0 3 2 1 0 0 3 2
L 0 0 0 4 4 3 2 6 5 4 3 7
E 0 4 3 3 3 2 1 5 10 9 8 7
F 0 3 2 2 2 1 0 4 9 14 13 12
G 0 2 1 1 1 0 0 3 8 13 12 11
Y 0 1 0 0 0 0 0 2 7 12 11 10
Example
Alignment
Q: E Q L L K A L E F K L Q: K - A L E F
P: K V L E F G Y P: K V - L E F
- 0 0 0 0 0 0 0 0 0 0 0 0
K 0 0 0 0 0 4 3 2 1 0 4 3
V 0 0 0 0 0 3 2 1 0 0 3 2
L 0 0 0 4 4 3 2 6 5 4 3 7
E 0 4 3 3 3 2 1 5 10 9 8 7
F 0 3 2 2 2 1 0 4 9 14 13 12
G 0 2 1 1 1 0 0 3 8 13 12 11
Y 0 1 0 0 0 0 0 2 7 12 11 10
Scoring Matrices
• Substitution Matrix
Set of values for quantifying the likelihood of one residue

being substituted by an other in an alignment and define these
values for all possible pairs of residues.
The choice of which substitution matrix to use is not trivial

becuase there is no one correct scoring scheme for all
circumstances.
In the process of evolution, from one generation to the
next the amino acid sequences of an organism's
proteins are gradually altered through the action of
DNA mutations.
For example, the sequence
ALEIRYLRD could mutate into the sequence
ALEINYLRD in one step, and possibly
AQEINYQRD over a longer period of evolutionary

time.
Scoring Matrices
Nucleotide:
Scoring matrices for nucleotide sequences are
relatively simple.
Match positive value or high score

Mismatch Negative value or low score
Example..
A T C G
A 20 5 5 10
T 5 20 10 5
C 5 10 20 5
G 10 5 5 20
Scoring Matrices
Amino acid
Use as reference a 20 X 20 substitution matrix,
representing the 20 amino acids found in protein.
• Physicochemical properties of aminoacid

residues
• Likelihoodof certain residues being
substituted among true homologous sequences
There are essentially two types of amino
acid substitution matrices.
• One type is based on interchangeability of LESS
the genetic code or amino acid properties ACCURATE
• Derived from empirical studies of amino

acid substitutions.
Thus, the empirical approach has gained the

most popularity in sequence alignment
applications
• Physical and chemical characteristics
– V  I – Both small, both hydrophobic,
conservative substitution, small penalty
– V  K – Small  large, hydrophobic 
charged, large penalty
– Requires some expert knowledge and
judgement
• Empirical methods
– How often does the substitution V  I
occur in proteins that are known to be
related?
• Scoring matrices: PAM and BLOSUM
Scoring Alignments and Substitution Matrices
• Genuine matches may not be identical:
Seq1: T H I S I S A S E Q U E N C E
Seq1: T H A T _ _ _ S E Q U E N C E
Isoleucine – Alanine: both hydrophobic
Serine – Threonine : both polar
• Scoring pairs of amino acids:
– with similar properties  higher scores
– With different properties  lower scores
Positive Score
• frequency of amino acid substitutions found in a data set of homologous sequences is
greater than would have occurred by random chance.
Zero Score
• Frequency of aminoacid substitutions found in the homologous sequence data set is
equal to that expected by chance
Negative Score
• frequency of amino acid substitutions found in the homologous sequence data set is
less than would have occurred by random chance

Substitution Matrix
Small and
polar
Small and
nonpolar
Polar and
acidic
Basic
Large and
hydrophobic
Aromatic
Different types of substitution matrices are being used based on:
– The number of mutations required for convertion of one amino acid to
the other
– Similarities in physicochemical properties.
PAM matrix
PAM “Point accepted Mutation”
“Accepted point mutation”
This matrix is calculated by observing the differences in closely

related proteins. Accepted Point Mutations per 100 residues
250 PAM  250 mutation on 100 residues

Because of the use of very closely related homologous, the observed
mutation were not expected to significatly change the common function of
the proteins.
“NATURAL SELECTION”
• Derived from global alignments of closely related sequences.
• Matrices for greater evolutionary distances are extrapolated from

those for lesser ones.
• The number with the matrix (PAM40, PAM100) refers to the

evolutionary distance; greater numbers are greater distances.
• Does not take into account different evolutionary rates between

conserved and non-conserved regions.
PAM 250
• One of the most widely used PAM matrix.
• PAM 250 substitution matrix is used for distantly related sequences.
• PAM 250 accepts 250 mutations.
• The 250 in PAM 250 stands for the data gathered from 71 sets of aligned
sequences extrapolated up to the level of 250 amino acid replacements per
100 residues
PAM120
• PAM120 accepts 120 mutations.
• PAM 120 substitution matrix is used for

closely related sequences and it is best for
general alignments.
PAM120 A-R=-3, R-R=6, N-R=-1, X-Z=-1, Y-R=-5
PAM250 A-R=-2, R-R=6, N-R=0, X-Z=-1, Y-R=-4
Advantages of PAM matrix
• PAM tables are based on observed mutations so they can be extremely helpful in
determining those processes which are responsible for these mutations and also
provide criteria for selecting and fixing a mutation in a population.
• Another advantage from statistical point of view is that data is derived from
sequences to construct a PAM table, now using that data PAM table can provide
information about the changes in the structure of an amino acid residue after a
given number of mutations.
• PAM tables provide empirical and experimental determination of conserved
replacement.
Disadvantages of PAM matrix
• It assumes that all types of mutations are

distributed uniformly across proteins.
• It uses data from closely related proteins to

infer relationships between very different
proteins.
BLOSUM matrix
• The BLOSUM (BLOcks of Amino Acid SUbstitution Matrix)
• BLOSUM matrices are used to score alignments between evolutionarily divergent protein
sequences
Use mutation data from highly conserved local regions

• They are based on local alignments.
BLOSUM 62  62% identity

• They scanned the BLOCKS database for very conserved regions of protein families (that do
not have gaps in the sequence alignment) and then counted the relative frequencies of amino
acids and their substitution probabilities.
• All BLOSUM matrices are based on observed alignments; they are not extrapolated from
comparisons of closely related proteins like the PAM Matrices

• Target frequencies are identified directly instead of
extrapolation.
• Sequences more than x% identical within the block

where substitutions are being counted, are grouped
together and treated as a single sequence
– BLOSUM 50 : >= 50% identity
– BLOSUM 62 : >= 62 % identity

• Use blocks of protein sequence fragments from different families (the BLOCKS
database)
• Amino acid pair frequencies calculated by summing over all possible pairs in
block
• Different evolutionary distances are incorporated into this scheme with a
clustering procedure (identity over particular threshold = same cluster)
• Similar idea to PAM matrices
• Probabilities estimated from blocks of sequence fragments
• Blocks represent structurally conserved regions

BLOSUM 62
BLOSUM62/ BLOSUM50
• BLOSUM62 is better mainly in a function of how distant the
relationships are between the sequences.
• BLOSSUM 50 will work better for more distant relationships.
• BLOSUM62 will work better for closer relationships.

ADVANTAGE
• All BLOSUM matrices are calculated from
observed alignments they are not extrapolated.
DISADVANTAGE
• Restricted to a subset of conserved domains
Which matrix to use ?
– Depends on the problem properties,
– Distantly related sequences : PAM 250 – BLOSUM 50
– Closely related sequences: PAM 120, BLOSUM 80
– The length of the sequence is important
– Short sequences  PAM 40 or BLOSUM 80
– Long sequences  PAM 250 or BLOSUM 50

• BLOSUM – 62 and PAM 120
PAM BLOSUM
BLOSUM
Built from global alignments Built from local alignments
Built from small amout of Data Built from vast amout of Data
Counting is based on minimum Counting based on groups of
replacement or maximum parsimony related sequences counted as one
Perform better for finding global Better for finding local
alignments and remote homologs alignments
Higher PAM series means more Lower BLOSUM series means
divergence more divergence
EXAMPLE
DYNAMIC PROGRAMMING
• Sequences:
– X = THISLINE, Y = ISALIGNED
• Gaps:
– Linear gap penalty (E=8)
• Scoring matrix
(BLOSUM-62) 
• Matrix S of optimal scores of sub-sequence
alignments.
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

S(I, T) = -1,
1. I -- H
2. I -- gap
3. gap -- H

S(I, H) = -3, Recurrence relation:
S(I, gap) = -8, F(i,j) = max [ F(i,j-1) + s(gap,Y(j),

F(i-1,j) + s(X(i),gap),
S(gap, H) = -8 F(i-1, j-1) + s(X(i), Y(j)]

Sequence Alignment: "Continuing.." (5th Week)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sequence Alignment: "Continuing.." (5th Week)

Uploaded by

Copyright:

Available Formats

SEQUENCE ALIGNMENT

One sequence needs to be shifted relative to the other to

• There are two different alignment strategies that are often

For divergent sequence and sequences of variable lenght, this method

• The basic idea is to build up the best alignment by using

• The Needleman-Wunsch algorithm is an example of

score(H,P) = -2, gap penalty=-8 (linear)

the sequence regions

conserved patterns in DNA or protein sequences.

• The Smith–Waterman algorithm is a well-known algorithm for

• Instead of looking at the total sequence, the Smith–Waterman algorithm

• The maximum score is never less than 0, and no pointer is

recorded unless the score is greater than 0.

• The trace-back starts from the highest score in the matrix

(rather than at the end of the matrix) and ends at a score of 0

(rather than the start of the matrix).

Set of values for quantifying the likelihood of one residue

The choice of which substitution matrix to use is not trivial

ALEIRYLRD could mutate into the sequence

ALEINYLRD in one step, and possibly

AQEINYQRD over a longer period of evolutionary

Match positive value or high score

• Physicochemical properties of aminoacid

• Derived from empirical studies of amino

Thus, the empirical approach has gained the

• Genuine matches may not be identical:

• frequency of amino acid substitutions found in a data set of homologous sequences is

greater than would have occurred by random chance.

• Frequency of aminoacid substitutions found in the homologous sequence data set is

equal to that expected by chance

less than would have occurred by random chance

“Accepted point mutation”

This matrix is calculated by observing the differences in closely

250 PAM  250 mutation on 100 residues

• Matrices for greater evolutionary distances are extrapolated from

• The number with the matrix (PAM40, PAM100) refers to the

• Does not take into account different evolutionary rates between

• PAM 250 substitution matrix is used for distantly related sequences.

• PAM 250 accepts 250 mutations.

sequences extrapolated up to the level of 250 amino acid replacements per

• PAM 120 substitution matrix is used for

provide criteria for selecting and fixing a mutation in a population.

given number of mutations.

• PAM tables provide empirical and experimental determination of conserved

• It assumes that all types of mutations are

• It uses data from closely related proteins to

Use mutation data from highly conserved local regions

BLOSUM 62  62% identity

acids and their substitution probabilities.

comparisons of closely related proteins like the PAM Matrices

• Sequences more than x% identical within the block

– BLOSUM 50 : >= 50% identity

– BLOSUM 62 : >= 62 % identity

• Different evolutionary distances are incorporated into this scheme with a

clustering procedure (identity over particular threshold = same cluster)

• Similar idea to PAM matrices

• Probabilities estimated from blocks of sequence fragments

• Blocks represent structurally conserved regions

relationships are between the sequences.

• BLOSSUM 50 will work better for more distant relationships.

• BLOSUM62 will work better for closer relationships.

– Distantly related sequences : PAM 250 – BLOSUM 50

– Closely related sequences: PAM 120, BLOSUM 80