You are on page 1of 30

Lecture 09

Sequence alignment

Scoring Matrices
Dynamic Programming Solution
Scoring matrices

Protein scoring matrices – PAM and BLOSUM

Nucleic acid scoring matrices – Jukes Cantor and Kimura models


PROTEIN Scoring Matrices
Dynamic Programming Solution
Percent Accepted Mutation
(PAM or Dayhoff) Matrices
•  Studied by Margaret Dayhoff

•  Amino acid substitutions were estimated using:


–  Alignment of common protein sequences
–  1572 amino acid substitutions/changes in 71 groups
of protein which are atleast 85% similar

•  Accepted mutations – do not negatively affect


a protein s fitness
Percent Accepted Mutation
(PAM or Dayhoff) Matrices
•  Similar sequences were organized into phylogenetic trees

•  Number of changes of each amino acid into every other amino acid
was counted
without affecting the function

•  Relative mutability evaluated by counting the number of changes of


each amino acid divided by a normalization factor

(this would normalize the data for variations in amino acid


composition, mutation rate, and sequence length)

•  The amino acid exchange counts and mutability values were used to
generate a 20 x 20 mutation probability matrix representing all
possible amino acid changes.
Percent Accepted Mutation
(PAM or Dayhoff) Matrices
•  Since the changes are independent of previous
mutational events, the PAM1 matrix can be multiplied by
itself N times to give the transition matrices for
sequences that have undergone N mutations.

•  PAM1 is 1 accepted mutation per 100 amino acids;

•  PAM10 is 10 accepted mutations per 100 amino acids;

•  PAM250 is 250 accepted mutations per 100 amino acids


and so on.
PAM250 matrix

Source – David Mount - Bioinformatics


BLOSUM
BLOck SUbstitution Matrices

Introduced by Henikoff and Henikoff, 1992.

Derived from a database of local alignments of relatively distantly


related protein sequences.

They looked at a large set of approximately 2000 amino acid


patterns organized into blocks, which are conserved regions within
protein families as identified by the protein database, PROSITE.

The blocks that were studied were also signatures of a protein


family, indicating that members of the family could be found by
searching for these blocks.
BLOSUM
In order to deal with overrepresentation of amino acid
substitutions occurring in the most closely related
members of the family, a consensus sequence of the
block is formed.

Sequences that were 60% identical to the consensus


were grouped together to form the BLOSUM-60 matrix;

sequences 80% identical were grouped together to form


the BLOSUM-80 matrix, etc.
G""""7"
P"""%2"""9"
D"""%1""%1""""7"
E"""%2"""0""""2"""6" Blosum-45 Matrix
N""""0""%2"""2"""0"""6"
H"""%2""%2"""0"""0"""1""""10"
Q"""%2""%1"""0"""2"""0"""""1""""6"
K""""%2""%1"""0"""1"""0"""%1""""1"""5"
R""""%2""%2""%1"""0"""0""""0""""1"""3"""7"
S"""""0"""%1"""0"""0"""1"""%1""""0"""%1""%1"""4"
T""""%2"""%1""%1""%1"""0""%2"""%1"""%1"""%1"""2"""5"
A""""0"""%1""%2"""%1""%1""%2""%1"""%1"""%2"""1""""0"""5"
M"""%2""%2""%3""%2"""%2"""0"""0""""%1"""%1""%2"""%1""%1"""6"
V%"""3"""%3""%3""%3"""%3"""%3""%3"""%2"""%2"""%1"""0""""0""""1"""5"
I"""""%4""%2""%4"""%3""%2"""%3""%2"""%3""""%3"""%2""%1""%1"""2"""3"""5"
L""""%3"""%3""%3"""%2""%3""%2""%2"""%3""""%2"""%3""%1""%1""""2"""1""2"""5"
F"""%3"""%3"""%4""%3"""%2""%2""%4"""%3""""%2"""%2""%1""%2"""0"""0"""0"""1"""8"
Y"""%3"""%3"""%2""%2"""%2"""2""%1"""%1""""%1"""%2"""%1""%2"""0""%1""0"""0"""3"""8"
W"%2"""%3""""%4"%3"""%4""%3""%2""%2""""%2"""%4"""%3"""%2""%2""%3"%2"%2"""1"""3""""15"
C"""%3"""%4"""%3""%3"""%2""%3""%3""%3"""%3"""%1"""%1"""%1"""%2""%1"%3"%2""%2"%3""""%5""""12"
"""""G""""""P""""D"""E"""N""""H""Q"""K""""R""""S""""T"""""A""""N"""V"""I"""L""""F"""Y""""W"""""C"

Source – David Mount - Bioinformatics


Major Differences between PAM and BLOSUM

PAM BLOSUM
Built from global alignments Built from local alignments
Built from small amount of Data Built from vast amount of Data
Counting is based on minimum Counting based on groups of
replacement or maximum related sequences counted as
parsimony one
Perform better for finding global Better for finding local
alignments and remote homologs alignments
Higher PAM series means more Lower BLOSUM series means
divergence more divergence
NUCLEIC ACID Scoring Matrices
Dynamic Programming Solution
Nucleic Acid Scoring Matrices
•  Two mutation models (models of nucleotide evolution)
–  Uniform mutation rates (Jukes-Cantor)
–  Two separate mutation rates (Kimura) Generally, the rate of
•  Transitions transitions is thought to be
•  Transversions higher than the rate of
transversions.
DNA Mutations

A G
PURINES: A, G
PYRIMIDINES C, T

Transitions: A↔G; C↔T


Transversions: A↔C, A↔T,
C↔G, G↔T

C T
GAP Penalties
(General and Affine)
Pairwise alignment of retinol-binding protein (RBP)
and b-lactoglobulin

1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP
. ||| | . |. . . | : .||||.:| :
1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin

51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP
: | | | | :: | .| . || |: || |.
45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin

98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP


|| ||. | :.|||| | . .|
94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin

137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP


. | | | : || . | || |
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin
Linear GAPS

The scoring matrices seen so far assumes a linear gap


penalty where each gap is given the same penalty score.

This is almost always negative, so that the alignment with


fewer gaps is favored over the alignment with more gaps.

The overall penalty for one large gap is the same as for
many small gaps.
Constant GAPS

Constant gap penalties are the simplest type of gap


penalty.

The only parameter, d, is added to the alignment score


when the gap is first opened.

This means that any gap receives the same penalty,


regardless of its size.

Can be too severe penalty for a series of 100


consecutive indels
GAP penalties
Large insertions or deletions might result from a single event

In nature, a series of k indels often come as a single event rather


than a series of k single nucleotide events.
GAPS
However, over evolutionary time, it is more likely that a
contiguous block of residues has become inserted/deleted in a
certain region

For example, it is more likely to have 1 gap of length k than k


gaps of length 1.

Therefore, a better scoring scheme would be


•  an initial higher penalty for opening a gap (ie., new gaps)
•  a smaller penalty for extending the gap.
Affine Gap Penalty
wx = g + r(x-1)

wx : total gap penalty


g: gap open penalty
r: gap extend penalty
x: gap length

•  The gap penalty needs to be chosen relative to


score matrix so that :
–  Gaps will not be excluded from alignment
–  Gaps don t propagate throughout the alignment
Accounting for GAPS

because you do not want to add too much of a penalty for further
extending the gap, once it is opened.
Affine Gap Penalty

Somehow reduced penalties (as compared to


naïve scoring) are given to runs of horizontal and
vertical edges
How to measure significance of
sequence alignment ?
Significance of Alignment
•  When two sequences of length m and n are
not obviously similar but show an alignment,
it becomes necessary to assess the
significance of the alignment.

•  Determine probability of alignment occurring


at random
–  Sequence 1: length m
–  Sequence 2: length n
Scores
Raw Score

The score of an alignment, S, calculated as the sum of


substitution and gap scores.

Substitution scores are given by a table (PAM, BLOSUM).

Gap scores are typically calculated as the sum of G, the gap


opening penalty and L, the gap extension penalty.

For a gap of length n, the gap cost would be G+Ln.

The choice of gap costs, G and L is empirical, but it is


customary to choose a high value for G and a low value for
L.
E-Value

Expectation value. The number of different


alignments with scores equivalent to or better than S
that are expected to occur in a database search by
chance. The lower the E value, the more significant
the score.
Probability of Alignment Score
•  Expected # of alignments with score at least
S (E-value):

E = Kmn e-λS
–  m,n: Lengths of sequences
–  K ,λ: statistical parameters & natural scales
for the search space size and the scoring
system respectively.
A search from a database

If hits are represented as positives , and non-hits as


negatives ;

a true positive is a hit with a real biological relationship,


and a false positive is a hit without such a relationship;

a true negative is a non-hit with no real biological


relationship to the query, and a false negative is a non-hit
with a real biological relationship to the query;
Sensitivity & Specificity

Sensitivity measures the proportion of the real biological


relationships in the database that were detected as hits in
the search.

Sn = (Number of true positive hits) /


(Number of true positive hits + Number of false negative hits)

Specificity is the proportion of the hits that correspond to


real biological relationships. It is the ability to reject false
positive matches. The most specific search will return true
matches.

Sp= (Number of true positive hits) /


(Number of true positive hits + Number of false positive hits)

You might also like