Sequence Alignment: Scoring Matrices

Lecture 09
Sequence alignment
Scoring Matrices
Dynamic Programming Solution
Scoring matrices
Protein scoring matrices – PAM and BLOSUM
Nucleic acid scoring matrices – Jukes Cantor and Kimura models

PROTEIN Scoring Matrices
Percent Accepted Mutation
(PAM or Dayhoff) Matrices
•  Studied by Margaret Dayhoff
•  Amino acid substitutions were estimated using:

–  Alignment of common protein sequences
–  1572 amino acid substitutions/changes in 71 groups
of protein which are atleast 85% similar
•  Accepted mutations – do not negatively affect

a protein s fitness
•  Similar sequences were organized into phylogenetic trees
•  Number of changes of each amino acid into every other amino acid
was counted
without affecting the function
•  Relative mutability evaluated by counting the number of changes of

each amino acid divided by a normalization factor
(this would normalize the data for variations in amino acid

composition, mutation rate, and sequence length)
•  The amino acid exchange counts and mutability values were used to
generate a 20 x 20 mutation probability matrix representing all
possible amino acid changes.
•  Since the changes are independent of previous
mutational events, the PAM1 matrix can be multiplied by
itself N times to give the transition matrices for
sequences that have undergone N mutations.
•  PAM1 is 1 accepted mutation per 100 amino acids;
•  PAM10 is 10 accepted mutations per 100 amino acids;
•  PAM250 is 250 accepted mutations per 100 amino acids

and so on.
PAM250 matrix
Source – David Mount - Bioinformatics

BLOSUM
BLOck SUbstitution Matrices
Introduced by Henikoff and Henikoff, 1992.
Derived from a database of local alignments of relatively distantly

related protein sequences.
They looked at a large set of approximately 2000 amino acid

patterns organized into blocks, which are conserved regions within
protein families as identified by the protein database, PROSITE.
The blocks that were studied were also signatures of a protein

family, indicating that members of the family could be found by
searching for these blocks.
BLOSUM
In order to deal with overrepresentation of amino acid
substitutions occurring in the most closely related
members of the family, a consensus sequence of the
block is formed.
Sequences that were 60% identical to the consensus

were grouped together to form the BLOSUM-60 matrix;
sequences 80% identical were grouped together to form

the BLOSUM-80 matrix, etc.
G""""7"
P"""%2"""9"
D"""%1""%1""""7"
E"""%2"""0""""2"""6" Blosum-45 Matrix
N""""0""%2"""2"""0"""6"
H"""%2""%2"""0"""0"""1""""10"
Q"""%2""%1"""0"""2"""0"""""1""""6"
K""""%2""%1"""0"""1"""0"""%1""""1"""5"
R""""%2""%2""%1"""0"""0""""0""""1"""3"""7"
S"""""0"""%1"""0"""0"""1"""%1""""0"""%1""%1"""4"
T""""%2"""%1""%1""%1"""0""%2"""%1"""%1"""%1"""2"""5"
A""""0"""%1""%2"""%1""%1""%2""%1"""%1"""%2"""1""""0"""5"
M"""%2""%2""%3""%2"""%2"""0"""0""""%1"""%1""%2"""%1""%1"""6"
V%"""3"""%3""%3""%3"""%3"""%3""%3"""%2"""%2"""%1"""0""""0""""1"""5"
I"""""%4""%2""%4"""%3""%2"""%3""%2"""%3""""%3"""%2""%1""%1"""2"""3"""5"
L""""%3"""%3""%3"""%2""%3""%2""%2"""%3""""%2"""%3""%1""%1""""2"""1""2"""5"
F"""%3"""%3"""%4""%3"""%2""%2""%4"""%3""""%2"""%2""%1""%2"""0"""0"""0"""1"""8"
Y"""%3"""%3"""%2""%2"""%2"""2""%1"""%1""""%1"""%2"""%1""%2"""0""%1""0"""0"""3"""8"
W"%2"""%3""""%4"%3"""%4""%3""%2""%2""""%2"""%4"""%3"""%2""%2""%3"%2"%2"""1"""3""""15"
C"""%3"""%4"""%3""%3"""%2""%3""%3""%3"""%3"""%1"""%1"""%1"""%2""%1"%3"%2""%2"%3""""%5""""12"
"""""G""""""P""""D"""E"""N""""H""Q"""K""""R""""S""""T"""""A""""N"""V"""I"""L""""F"""Y""""W"""""C"
Source – David Mount - Bioinformatics

Major Differences between PAM and BLOSUM
PAM BLOSUM
Built from global alignments Built from local alignments
Built from small amount of Data Built from vast amount of Data
Counting is based on minimum Counting based on groups of
replacement or maximum related sequences counted as
parsimony one
Perform better for finding global Better for finding local
alignments and remote homologs alignments
Higher PAM series means more Lower BLOSUM series means
divergence more divergence
NUCLEIC ACID Scoring Matrices
Nucleic Acid Scoring Matrices
•  Two mutation models (models of nucleotide evolution)
–  Uniform mutation rates (Jukes-Cantor)
–  Two separate mutation rates (Kimura) Generally, the rate of
•  Transitions transitions is thought to be
•  Transversions higher than the rate of
transversions.
DNA Mutations
A G
PURINES: A, G
PYRIMIDINES C, T
Transitions: A↔G; C↔T

Transversions: A↔C, A↔T,
C↔G, G↔T
C T
GAP Penalties
(General and Affine)
Pairwise alignment of retinol-binding protein (RBP)
and b-lactoglobulin
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP
. ||| | . |. . . | : .||||.:| :
1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin
51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP
: | | | | :: | .| . || |: || |.
45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin
98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP

|| ||. | :.|||| | . .|
94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin
137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP

. | | | : || . | || |
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin
Linear GAPS
The scoring matrices seen so far assumes a linear gap

penalty where each gap is given the same penalty score.
This is almost always negative, so that the alignment with

fewer gaps is favored over the alignment with more gaps.
The overall penalty for one large gap is the same as for
many small gaps.
Constant GAPS
Constant gap penalties are the simplest type of gap

penalty.
The only parameter, d, is added to the alignment score

when the gap is first opened.
This means that any gap receives the same penalty,

regardless of its size.
Can be too severe penalty for a series of 100

consecutive indels
GAP penalties
Large insertions or deletions might result from a single event
In nature, a series of k indels often come as a single event rather

than a series of k single nucleotide events.
GAPS
However, over evolutionary time, it is more likely that a
contiguous block of residues has become inserted/deleted in a
certain region
For example, it is more likely to have 1 gap of length k than k

gaps of length 1.
Therefore, a better scoring scheme would be

•  an initial higher penalty for opening a gap (ie., new gaps)
•  a smaller penalty for extending the gap.
Affine Gap Penalty
wx = g + r(x-1)
wx : total gap penalty

g: gap open penalty
r: gap extend penalty
x: gap length
•  The gap penalty needs to be chosen relative to

score matrix so that :
–  Gaps will not be excluded from alignment
–  Gaps don t propagate throughout the alignment
Accounting for GAPS
because you do not want to add too much of a penalty for further
extending the gap, once it is opened.
Affine Gap Penalty
Somehow reduced penalties (as compared to

naïve scoring) are given to runs of horizontal and
vertical edges
How to measure significance of
sequence alignment ?
Significance of Alignment
•  When two sequences of length m and n are
not obviously similar but show an alignment,
it becomes necessary to assess the
significance of the alignment.
•  Determine probability of alignment occurring

at random
–  Sequence 1: length m
–  Sequence 2: length n
Scores
Raw Score
The score of an alignment, S, calculated as the sum of

substitution and gap scores.
Substitution scores are given by a table (PAM, BLOSUM).
Gap scores are typically calculated as the sum of G, the gap

opening penalty and L, the gap extension penalty.
For a gap of length n, the gap cost would be G+Ln.
The choice of gap costs, G and L is empirical, but it is

customary to choose a high value for G and a low value for
L.
E-Value
Expectation value. The number of different

alignments with scores equivalent to or better than S
that are expected to occur in a database search by
chance. The lower the E value, the more significant
the score.
Probability of Alignment Score
•  Expected # of alignments with score at least
S (E-value):
E = Kmn e-λS
–  m,n: Lengths of sequences
–  K ,λ: statistical parameters & natural scales
for the search space size and the scoring
system respectively.
A search from a database
If hits are represented as positives , and non-hits as

negatives ;
a true positive is a hit with a real biological relationship,

and a false positive is a hit without such a relationship;
a true negative is a non-hit with no real biological

relationship to the query, and a false negative is a non-hit
with a real biological relationship to the query;
Sensitivity & Specificity
Sensitivity measures the proportion of the real biological

relationships in the database that were detected as hits in
the search.
Sn = (Number of true positive hits) /

(Number of true positive hits + Number of false negative hits)
Specificity is the proportion of the hits that correspond to

real biological relationships. It is the ability to reject false
positive matches. The most specific search will return true
matches.
Sp= (Number of true positive hits) /

(Number of true positive hits + Number of false positive hits)

Sequence Alignment: Scoring Matrices

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sequence Alignment: Scoring Matrices

Uploaded by

Copyright:

Available Formats

Lecture 09

Protein scoring matrices – PAM and BLOSUM

Nucleic acid scoring matrices – Jukes Cantor and Kimura models

• Amino acid substitutions were estimated using:

• Accepted mutations – do not negatively affect

• Relative mutability evaluated by counting the number of changes of

(this would normalize the data for variations in amino acid

• PAM1 is 1 accepted mutation per 100 amino acids;

• PAM10 is 10 accepted mutations per 100 amino acids;

• PAM250 is 250 accepted mutations per 100 amino acids

Source – David Mount - Bioinformatics

Introduced by Henikoff and Henikoff, 1992.

Derived from a database of local alignments of relatively distantly

They looked at a large set of approximately 2000 amino acid

The blocks that were studied were also signatures of a protein

Sequences that were 60% identical to the consensus

sequences 80% identical were grouped together to form

Source – David Mount - Bioinformatics

Transitions: A↔G; C↔T

98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP

137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP

The scoring matrices seen so far assumes a linear gap

This is almost always negative, so that the alignment with

Constant gap penalties are the simplest type of gap

The only parameter, d, is added to the alignment score

This means that any gap receives the same penalty,

Can be too severe penalty for a series of 100

In nature, a series of k indels often come as a single event rather

For example, it is more likely to have 1 gap of length k than k

Therefore, a better scoring scheme would be

wx : total gap penalty

• The gap penalty needs to be chosen relative to

Somehow reduced penalties (as compared to

• Determine probability of alignment occurring

The score of an alignment, S, calculated as the sum of

Substitution scores are given by a table (PAM, BLOSUM).

Gap scores are typically calculated as the sum of G, the gap

For a gap of length n, the gap cost would be G+Ln.

The choice of gap costs, G and L is empirical, but it is

Expectation value. The number of different

If hits are represented as positives , and non-hits as

a true positive is a hit with a real biological relationship,

a true negative is a non-hit with no real biological

Sensitivity measures the proportion of the real biological

Sn = (Number of true positive hits) /

Specificity is the proportion of the hits that correspond to

Sp= (Number of true positive hits) /

You might also like

•  Amino acid substitutions were estimated using:

•  Accepted mutations – do not negatively affect

•  Relative mutability evaluated by counting the number of changes of

•  PAM1 is 1 accepted mutation per 100 amino acids;

•  PAM10 is 10 accepted mutations per 100 amino acids;

•  PAM250 is 250 accepted mutations per 100 amino acids

•  The gap penalty needs to be chosen relative to

•  Determine probability of alignment occurring