You are on page 1of 61

SEQUENCE ALIGNMENT

“continuing..”
(5th week)
• The overall goal of pairwise sequence sequence alignment
is to find best pairwise of two sequence

One sequence needs to be shifted relative to the other to


find the position where maximum matches are found.

• There are two different alignment strategies that are often


used: Global alignment and Local alignment
Global Alignment
Attempt to align every residue in every sequence, are most useful
when the sequences in the query set are similar and of roughly equal size.

To find the best possible alignment across the entire length between
the two sequences.

For divergent sequence and sequences of variable lenght, this method


may not be able to generate optimal results
Dynamic programming is a method for
solving complex problems by breaking them
down into simpler subproblems.
Needleman–Wunsch algorithm
• A smart way to reduce the massive number of possibilities
that need to be considered, yet still guarantees that the
best solution will be found (Saul Needleman and Christian
Wunsch, 1970).

• The basic idea is to build up the best alignment by using


optimal alignments of smaller subsequences.

• The Needleman-Wunsch algorithm is an example of


dynamic programming, a discipline invented by Richard
Bellman (an American mathematician) in 1953
Needleman–Wunsch algorithm
Example
1. Initialization
2. Fill
3. Trace-back
HEAGAWGHE - E
Optimal alignment:
- P - - AW- HEAE

score(H,P) = -2, gap penalty=-8 (linear)


score(E,P) = 0, score(E,A) = -1, score(H,A) = -2

H E A G A W G H E E
0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
P -8 -2 -8 -16 -24 -33 -42 -49 -57 -65 -73
A -16 -10 -3 -4 -12 -19 -28 -36 -44 -52 -60
W -24 -18 -11 -6 -7 -15 -4 -12 -21 -29 -37
H -32 -14 -18 -13 -8 -9 -12 -6 -2 -11 -19
E -40 -22 -8 -16 -16 -9 -12 -14 -6 4 -5
A -48 -30 -16 -3 -11 -11 -12 -12 -14 -4 2
E -56 -38 -24 -11 -6 -12 -14 -15 -12 -8 2

The value in the final cell is the best score for the alignment
Drawback of Needleman–Wunsch
algorithm
The drawback of focusing on getting a
maximum score for the full-length sequence
alignment is the risk of missing the best local
similarity. This strategy is only suitable for
aligning two closely related sequences that
are of the same length.
Local Alignment
• Does not assume that two sequence in question have similarity over the entire

length.

• It only finds local regions with the highest level of similarity between the two

sequences and aligns these regions without regard for the alignment of the rest of

the sequence regions

• Used for aligning more divergent sequences with the goal of searching for

conserved patterns in DNA or protein sequences.


Smith–Waterman algorithm
• Very simple modification of Needleman-Wunsch

• The Smith–Waterman algorithm is a well-known algorithm for


performing local sequence alignment; that is, for determining similar
regions between two nucleotide or protein sequences.

• Instead of looking at the total sequence, the Smith–Waterman algorithm


compares segments of all possible lengths and optimizes the similarity
measure.
• The edges of the matrix are initialized to 0 instead of increasing

gap penalties.

• The maximum score is never less than 0, and no pointer is

recorded unless the score is greater than 0.

• The trace-back starts from the highest score in the matrix

(rather than at the end of the matrix) and ends at a score of 0

(rather than the start of the matrix).


Identification of regional sequence similarity
may be of greater significance than finding a
match that includes all residues.
• Use Smith-Waterman if:
– The divergence level between the two sequences to
be aligned is not easily known
– The sequence lengths of the two sequences may
also be unequal
Example
Linear gap model
Gap = -1
Q: E Q L L K A L E F K L Match = 4
Mismatch = -2
P: K V L E F G Y

- E Q L L K A L E F K L
-
K
V
L
E
F
G
Y
Example
Linear gap model
Gap = -1
Q: E Q L L K A L E F K L Match = 4
Mismatch = -2
P: K V L E F G Y

- E Q L L K A L E F K L
- 0 0 0 0 0 0 0 0 0 0 0 0
K 0
V 0
L 0
E 0
F 0
G 0
Y 0
Example
Linear gap model
Gap = -1
Q: E Q L L K A L E F K L Match = 4
Mismatch = -2
P: K V L E F G Y

- E Q L L K A L E F K L
- 0 0 0 0 0 0 0 0 0 0 0 0
K 0 0 0 0 0 4 3 2 1 0 4 3
V 0 0 0 0 0 3 2 1 0 0 3 2
L 0 0 0 4 4 3 2 6 5 4 3 7
E 0 4 3 3 3 2 1 5 10 9 8 7
F 0 3 2 2 2 1 0 4 9 14 13 12
G 0 2 1 1 1 0 0 3 8 13 12 11
Y 0 1 0 0 0 0 0 2 7 12 11 10
Example
Linear gap model
Gap = -1
Q: E Q L L K A L E F K L Match = 4
Mismatch = -2
P: K V L E F G Y

- E Q L L K A L E F K L
- 0 0 0 0 0 0 0 0 0 0 0 0
K 0 0 0 0 0 4 3 2 1 0 4 3
V 0 0 0 0 0 3 2 1 0 0 3 2
L 0 0 0 4 4 3 2 6 5 4 3 7
E 0 4 3 3 3 2 1 5 10 9 8 7
F 0 3 2 2 2 1 0 4 9 14 13 12
G 0 2 1 1 1 0 0 3 8 13 12 11
Y 0 1 0 0 0 0 0 2 7 12 11 10
Example
Alignment
Q: E Q L L K A L E F K L Q: K A - L E F
P: K V L E F G Y P: K - V L E F

- E Q L L K A L E F K L
- 0 0 0 0 0 0 0 0 0 0 0 0
K 0 0 0 0 0 4 3 2 1 0 4 3
V 0 0 0 0 0 3 2 1 0 0 3 2
L 0 0 0 4 4 3 2 6 5 4 3 7
E 0 4 3 3 3 2 1 5 10 9 8 7
F 0 3 2 2 2 1 0 4 9 14 13 12
G 0 2 1 1 1 0 0 3 8 13 12 11
Y 0 1 0 0 0 0 0 2 7 12 11 10
Example
Alignment
Q: E Q L L K A L E F K L Q: K - A L E F
P: K V L E F G Y P: K V - L E F

- E Q L L K A L E F K L
- 0 0 0 0 0 0 0 0 0 0 0 0
K 0 0 0 0 0 4 3 2 1 0 4 3
V 0 0 0 0 0 3 2 1 0 0 3 2
L 0 0 0 4 4 3 2 6 5 4 3 7
E 0 4 3 3 3 2 1 5 10 9 8 7
F 0 3 2 2 2 1 0 4 9 14 13 12
G 0 2 1 1 1 0 0 3 8 13 12 11
Y 0 1 0 0 0 0 0 2 7 12 11 10
Scoring Matrices
• Substitution Matrix

Set of values for quantifying the likelihood of one residue


being substituted by an other in an alignment and define these
values for all possible pairs of residues.

The choice of which substitution matrix to use is not trivial


becuase there is no one correct scoring scheme for all
circumstances.
In the process of evolution, from one generation to the
next the amino acid sequences of an organism's
proteins are gradually altered through the action of
DNA mutations.
For example, the sequence

ALEIRYLRD could mutate into the sequence

ALEINYLRD in one step, and possibly

AQEINYQRD over a longer period of evolutionary


time.
Scoring Matrices
Nucleotide:
Scoring matrices for nucleotide sequences are
relatively simple.

Match positive value or high score


Mismatch Negative value or low score
Example..
A T C G

A 20 5 5 10
T 5 20 10 5
C 5 10 20 5
G 10 5 5 20
Scoring Matrices
Amino acid
Use as reference a 20 X 20 substitution matrix,
representing the 20 amino acids found in protein.

• Physicochemical properties of aminoacid


residues
• Likelihoodof certain residues being
substituted among true homologous sequences
There are essentially two types of amino
acid substitution matrices.
• One type is based on interchangeability of LESS
the genetic code or amino acid properties ACCURATE

• Derived from empirical studies of amino


acid substitutions.

Thus, the empirical approach has gained the


most popularity in sequence alignment
applications
• Physical and chemical characteristics
– V  I – Both small, both hydrophobic,
conservative substitution, small penalty
– V  K – Small  large, hydrophobic 
charged, large penalty
– Requires some expert knowledge and
judgement
• Empirical methods
– How often does the substitution V  I
occur in proteins that are known to be
related?
• Scoring matrices: PAM and BLOSUM
Scoring Alignments and Substitution Matrices

• Genuine matches may not be identical:

Seq1: T H I S I S A S E Q U E N C E
Seq1: T H A T _ _ _ S E Q U E N C E
Isoleucine – Alanine: both hydrophobic
Serine – Threonine : both polar
• Scoring pairs of amino acids:
– with similar properties  higher scores
– With different properties  lower scores
Positive Score

• frequency of amino acid substitutions found in a data set of homologous sequences is

greater than would have occurred by random chance.

Zero Score

• Frequency of aminoacid substitutions found in the homologous sequence data set is

equal to that expected by chance

Negative Score

• frequency of amino acid substitutions found in the homologous sequence data set is

less than would have occurred by random chance


Substitution Matrix
Small and
polar

Small and
nonpolar

Polar and
acidic

Basic

Large and
hydrophobic

Aromatic
Different types of substitution matrices are being used based on:
– The number of mutations required for convertion of one amino acid to
the other
– Similarities in physicochemical properties.
PAM matrix
PAM “Point accepted Mutation”

“Accepted point mutation”

This matrix is calculated by observing the differences in closely


related proteins. Accepted Point Mutations per 100 residues

250 PAM  250 mutation on 100 residues


Because of the use of very closely related homologous, the observed
mutation were not expected to significatly change the common function of
the proteins.

“NATURAL SELECTION”
• Derived from global alignments of closely related sequences.

• Matrices for greater evolutionary distances are extrapolated from


those for lesser ones.

• The number with the matrix (PAM40, PAM100) refers to the


evolutionary distance; greater numbers are greater distances.

• Does not take into account different evolutionary rates between


conserved and non-conserved regions.
PAM 250
• One of the most widely used PAM matrix.

• PAM 250 substitution matrix is used for distantly related sequences.

• PAM 250 accepts 250 mutations.

• The 250 in PAM 250 stands for the data gathered from 71 sets of aligned

sequences extrapolated up to the level of 250 amino acid replacements per

100 residues
PAM120
• PAM120 accepts 120 mutations.

• PAM 120 substitution matrix is used for


closely related sequences and it is best for
general alignments.
PAM120 A-R=-3, R-R=6, N-R=-1, X-Z=-1, Y-R=-5
PAM250 A-R=-2, R-R=6, N-R=0, X-Z=-1, Y-R=-4
Advantages of PAM matrix
• PAM tables are based on observed mutations so they can be extremely helpful in

determining those processes which are responsible for these mutations and also

provide criteria for selecting and fixing a mutation in a population.

• Another advantage from statistical point of view is that data is derived from

sequences to construct a PAM table, now using that data PAM table can provide

information about the changes in the structure of an amino acid residue after a

given number of mutations.

• PAM tables provide empirical and experimental determination of conserved

replacement.
Disadvantages of PAM matrix

• It assumes that all types of mutations are


distributed uniformly across proteins.

• It uses data from closely related proteins to


infer relationships between very different
proteins.
BLOSUM matrix
• The BLOSUM (BLOcks of Amino Acid SUbstitution Matrix)

• BLOSUM matrices are used to score alignments between evolutionarily divergent protein

sequences

Use mutation data from highly conserved local regions


• They are based on local alignments.

BLOSUM 62  62% identity


• They scanned the BLOCKS database for very conserved regions of protein families (that do

not have gaps in the sequence alignment) and then counted the relative frequencies of amino

acids and their substitution probabilities.

• All BLOSUM matrices are based on observed alignments; they are not extrapolated from

comparisons of closely related proteins like the PAM Matrices


• Target frequencies are identified directly instead of
extrapolation.

• Sequences more than x% identical within the block


where substitutions are being counted, are grouped
together and treated as a single sequence

– BLOSUM 50 : >= 50% identity

– BLOSUM 62 : >= 62 % identity


• Use blocks of protein sequence fragments from different families (the BLOCKS

database)

• Amino acid pair frequencies calculated by summing over all possible pairs in

block

• Different evolutionary distances are incorporated into this scheme with a

clustering procedure (identity over particular threshold = same cluster)

• Similar idea to PAM matrices

• Probabilities estimated from blocks of sequence fragments

• Blocks represent structurally conserved regions


BLOSUM 62
BLOSUM62/ BLOSUM50
• BLOSUM62 is better mainly in a function of how distant the

relationships are between the sequences.

• BLOSSUM 50 will work better for more distant relationships.

• BLOSUM62 will work better for closer relationships.


ADVANTAGE
• All BLOSUM matrices are calculated from
observed alignments they are not extrapolated.

DISADVANTAGE
• Restricted to a subset of conserved domains
Which matrix to use ?
– Depends on the problem properties,

– Distantly related sequences : PAM 250 – BLOSUM 50

– Closely related sequences: PAM 120, BLOSUM 80

– The length of the sequence is important

– Short sequences  PAM 40 or BLOSUM 80

– Long sequences  PAM 250 or BLOSUM 50


• BLOSUM – 62 and PAM 120
PAM BLOSUM
BLOSUM
Built from global alignments Built from local alignments
Built from small amout of Data Built from vast amout of Data
Counting is based on minimum Counting based on groups of
replacement or maximum parsimony related sequences counted as one
Perform better for finding global Better for finding local
alignments and remote homologs alignments
Higher PAM series means more Lower BLOSUM series means
divergence more divergence
EXAMPLE
DYNAMIC PROGRAMMING
• Sequences:
– X = THISLINE, Y = ISALIGNED
• Gaps:
– Linear gap penalty (E=8)

• Scoring matrix

(BLOSUM-62) 
• Matrix S of optimal scores of sub-sequence
alignments.

[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]


[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
S(I, T) = -1,
1. I -- H
2. I -- gap
3. gap -- H

[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]


S(I, H) = -3, Recurrence relation:

S(I, gap) = -8, F(i,j) = max [ F(i,j-1) + s(gap,Y(j),


F(i-1,j) + s(X(i),gap),
S(gap, H) = -8 F(i-1, j-1) + s(X(i), Y(j)]

You might also like