You are on page 1of 38

Amino acid substitution matrices

• Amino acids have different biochemical and physical


properties that influence their relative replaceability in
evolution
• Scoring matrices reflect:
– probabilities of mutual substitutions
– the probability of occurrence of each amino acid
• Widely used scoring matrices:
– PAM
– BLOSUM
Amino acid substitution matrices
• Certain amino acid substitutions commonly occur in
related proteins from different species.
• Because, a protein still functions with these
substitutions, the substituted amino acids are
compatible with structure and function.
• Knowing types of changes that are most and least
common in a large number of proteins can assist with
predicting alignments for any set of protein sequences.
• If ancestor relationships among a group of proteins
are assessed, the most likely amino acid changes that
occurred during evolution can be predicted.
Point Accepted Mutation (PAM) Matrices
[Dayhoff substitution matrices]

 The first systematic method to derive amino acid substitution


matrices was done by Dayhoff et al. (1978) Atlas of Protein
Structure. These widely used substitution matrices are frequently
called Dayhoff, MDM (Mutation Data Matrix), or PAM (Percent
Accepted Mutation) matrices.
 PAM approach: estimate the probability that b was substituted for a
in a given measure of evolutionary distance.
 KEY IDEA: trusted alignments of closely related sequences
provide information about biologically permissible mutations.
Point Accepted Mutation (PAM) Matrices
[Dayhoff substitution matrices]

• This family of matrices lists the likelyhood of change from one


amino acid to another in homologous protein sequences
during evolution.
• Each matrix gives the changes expected for a given period of
evolutionary time, evidenced by decreased sequence similarity
as genes that encoded the same protein diverge with
increased evolutionary time.
• This leads to two possibilities:
– One matrix gives the changes expected in homologous
proteins that have diverged only a small amount from each
other in a relatively short period of time (about 50% similar)
– Other matrix gives changes expected of proteins that have
diverged over a much longer period, leaving only 20%
similarity.
…How PAM matrix is derived
• In deriving the PAM matrices, each change in the
current amino acid at a particular site is assumed to
be independent of previous mutational events at that
site
• Thus, the probability of change of any amino acid ‘a’
to amino acid ‘b’ is the same, regardless of the
position of amino acid ‘a’ in a sequence.
– Based on Markov model (simple) which is
characterized by a series of changes of state in a
system such that a change from one state to
another does not depend on the previous history of
the state.
How PAM matrix is derived.. AA index
• To prepare the Dayhoff PAM matrices (Dayhoff 1978),
amino acid substitutions that occurred in a group of
evolving proteins were estimated using 1572 changes
in 71 groups of protein sequences that were atleast
85% similar.
• Because these changes were observed in closely
related proteins (>85% similar), they represented
amino acid substitutions that do not significantly
change the function of protein
• …. Hence called as “accepted mutations” – defined
as amino acid changes accepted by natural selection
…How PAM matrix is derived
• To develop a single-letter code for the amino acids, Dr.
Dayhoff attempted to make the code as easy to remember as
possible. Of course, if the name of each amino acid began
with a different letter, the code would be simple indeed. For 6
of the amino acids, the first letter of the name is unique,
making the code simple.
• Cystine Cys C (First letter)
• For the other amino acids, the first letter of the name is not
unique to a single amino acid, so Dr. Dayhoff assigned the
letters A, G, L, P and T to the amino acids Alanine, Glycine,
Leucine, Proline and Threonine, respectively, which occur
more frequently in proteins than do the other amino acids
having the same first letters.
…How PAM matrix is derived
• Some of the other amino acids are phonetically suggestive.
Arginine R aRginine
• For the remaining 5 amino acids, Dr. Dayhoff was reaching
somewhat to find an easy-to-remember connection between the
single letter and the amino acid. She assigned aspartic acid,
asparagine, glutamic acid and glutamine the letters D, N, E and Q,
respectively, noting that D and N are nearer the beginning of the
alphabet than E and Q, and that Asp is smaller than Glu, while Asn is
smaller than Gln.
• By the time Dr. Dayhoff got to lysine, there were not too many letters
left, so she used the letter K, explaining that K is at least near L in
the alphabet.
…How PAM matrix is derived
First step: Pair Exchange Frequencies
: A PAM (Percent accepted mutation) is one
accepted point mutation on the path between two
sequences, per 100 residues.

• In order to identify accepted point mutations, a complete


phylogenetic tree including all ancestral sequences has to
be constructed.

• To avoid a large degree of ambiguities in this step, Dayhoff


and colleagues restricted their analysis to sequence families
with more than 85% identity.
First step: Pair Exchange Frequencies
• For each of the observed and inferred sequences, the
amino acid pair exchanges are tabulated into a 20x20
matrix. It is assumed, that the likelihood of an amino-acid X
being replaced by an amino acid Y is the same as Y
replacing X. Hence the matrix is constructed symmetrically.

• Aij is the number of accepted mutations observed where


amino acid i replaces amino acid j.
Second step: Frequencies of Occurence
•If the properties of amino acids differ and if they occur with
different frequencies, all statements we can make about the
average properties of sequences will depend on the
frequencies of occurrence of the individual amino acids.
•These frequencies of occurrence are approximated by the
frequencies of observation.
•They are the number of occurences of a given amino acid
divided by the number of amino-acids observed.
Third step: Relative Mutabilities
•Relative mutabilities are evaluated by counting, in each
group of related sequences, the number of changes of
each amino acid and by dividing this number by a
normalizing factor.
•This factor is a product of the frequency of occurrence of
the amino acid in that group of sequences being analyzed
Third step: Relative Mutabilities

Aligned sequences A D A
A D B

Amino acids A B D
Observed Changes 1 1 0
Frequency of
Occurrence 3 1 2
(in total composition)

RELATIVE MUTABILITY 0.33 1 0


Amino acid frequencies (Frequency of Occurrence):

1978 1991
L 0.085 0.091
A 0.087 0.077
The frequencies in the
G 0.089 0.074
S 0.070 0.069 middle column are
V 0.065 0.066 taken from Dayhoff
E 0.050 0.062
(1978), the frequencies
T 0.058 0.059
K 0.081 0.059 in the right column are
I 0.037 0.053 taken from the 1991
D 0.047 0.052
recompilation of the
R 0.041 0.051
P 0.051 0.051 mutation matrices
N 0.040 0.043 representing a database
Q 0.038 0.041
of observations that is
F 0.040 0.040
Y 0.030 0.032 approximately 40 times
M 0.015 0.024 larger than that
H 0.034 0.023
available to Dayhoff.
C 0.033 0.020
W 0.010 0.014
Third step: Relative Mutabilities
• To obtain a complete picture of the mutational process,
the amino-acids that do not mutate are also taken into
account i.e., what is the chance, on average, that a given
amino acid will mutate at all.

• Based on the relative mutability scores of the amino acids,


Asn, Ser, Asp and Glu were observed to be most mutable
amino acids are Cys and Trp were the least mutable.
Example: Phe - Tyr
• Of 1572 observed amino acid changes, there were 260
changes between Phe and Tyr
• These numbers were multiplied by (a) mutability of Phe
& (b) the fraction of Phe to Tyr changes over all
changes of Phe to another amino acid – to obtain
mutation probability score of Phe to Tyr
• A similar score was obtained for changes of Tyr
Example: Phe - Tyr
• The resulting scores were summed up and divided by a
normalizing factor such that their sum represents a
probability of change of 1%  250%
• Score for changing Phe to Tyr was 0.15
• Frequence of Phe occurrence in sequence data was 0.04
• Score for changing Tyr to Phe was 0.20
• Frequency of Tyr occurance in sequence data was 0.03
• These changes can include both forward and reverse i.e.,
Phe  Tyr as well as Tyr  Phe
Example: Phe - Tyr
• Relative mutability of Phe to Tyr would be
• 0.15/0.04 = 3.75
• Converting to a log to the base 10 (log10 3.75 = 0.57)
• And multiplying it with 10 to remove fractional values =
5.7
• Relative mutability of Tyr to Phe would be
• 0.20/0.03 = 6.7 and log of this number = 0.83 further
multiplied by 10 would be 8.3
• Average of 5.7 and 8.3 is 7
Formulation of PAM matrix

• The amino acid exchange counts and mutability values


were then used to generate a 20 x 20 mutation probability
matrix representing all possible amino acid changes
• Amino acids are grouped according to chemistry of the side
group:
• C – Sulfhydryl + Ancestor probability
• STPAG – Small hydrophilic is greater
• NDEQ – Acid, acid amine and hydrophilic 0 Probability of
• HRK – basic ancestry as well as
• MILV – small hydrophobic by chance is same
• FYW - Aromatic - Alignment more by
chance than
ancestry
• Possible type of questions that can be answered
are:
• “Suppose I start with a given polypeptide sequence
M at time t, and observe the evolutionary changes in
the sequence until 1% of all amino acid residues
have undergone substitutions at time t+n. Let the
new sequence at time t+n be called M’. What is the
probability that a residue of type j in M will be
replaced by i in M’?”
Constructing BLOSUM Matrices

Blocks Substitution Matrices


BLOSUM matrices
• Blocks Substitution Matrix. Scores for
each position are obtained frequencies of
substitutions in blocks of local alignments
of protein sequences [Henikoff & Henikoff
1992].
• For example BLOSUM62 is derived from
sequence alignments with no more than
62% identity.
BLOSUM Scoring Matrices

• BLOck SUbstitution Matrix


• Based on comparisons of blocks of sequences derived
from the Blocks database
• The Blocks database contains multiply aligned ungapped
segments corresponding to the most highly conserved
regions of proteins (local alignment versus global
alignment)
• BLOSUM matrices are derived from blocks whose
alignment corresponds to the BLOSUM-,matrix number
Conserved blocks in alignments

AABCDA...BBCDA
DABCDA.A.BBCBB
BBBCDABA.BCCAA
AAACDAC.DCBCDB
CCBADAB.DBBDCC
AAACAA...BBCCC
Collecting substitution statistics
1. Count amino acids pairs in each column;
e.g.,
– 6 AA pairs, 4 AB pairs, 4 AC, 1 BC, 0 BB, 0 A
CC.
A
– Total = 6+4+4+1=15
B
1. Normalize results to obtain probabilities A
(pX’s and qXY’s)
C
2. Compute log-odds score matrix from A
probabilities:
s(X,Y) = log (qXY / (pX py))
Estimation of a BLOSUM matrix
• The BLOCKS database contains local ID FIBRONECTIN_2; BLOCK
COG9_CANFA GNSAGEPCVFPFIFLGKQYSTCTREGRGDGHLWCATT
multiple gap-free alignments of proteins. COG9_RABIT GNADGAPCHFPFTFEGRSYTACTTDGRSDGMAWCSTT
FA12_HUMAN LTVTGEPCHFPFQYHRQLYHKCTHKGRPGPQPWCATT
HGFA_HUMAN LTEDGRPCRFPFRYGGRMLHACTSEGSAHRKWCATTH

• All pairs of amino acids in each column MANR_HUMAN GNANGATCAFPFKFENKWYADCTSAGRSDGWLWCGTT


MPRI_MOUSE ETDDGEPCVFPFIYKGKSYDECVLEGRAKLWCSKTAN
of each BLOCK are compared, and the PB1_PIG AITSDDKCVFPFIYKGNLYFDCTLHDSTYYWCSVTTY
SFP1_BOVIN ELPEDEECVFPFVYRNRKHFDCTVHGSLFPWCSLDAD
observed pair frequencies are noted SFP3_BOVIN AETKDNKCVFPFIYGNKKYFDCTLHGSLFLWCSLDAD
(e.g., A aligned with A makes up 1.5% SFP4_BOVIN AVFEGPACAFPFTYKGKKYYMCTRKNSVLLWCSLDTE
SP1_HORSE AATDYAKCAFPFVYRGQTYDRCTTDGSLFRISWCSVT
of all pairs; A aligned with C makes up COG2_CHICK GNSEGAPCVFPFIFLGNKYDSCTSAGRNDGKLWCAST
COG2_HUMAN GNSEGAPCVFPFTFLGNKYESCTSAGRSDGKMWCATT
0.01% of all pairs, etc.) COG2_MOUSE GNSEGAPCVFPFTFLGNKYESCTSAGRNDGKVWCATT
COG2_RABIT GNSEGAPCVFPFTFLGNKYESCTSAGRSDGKMWCATS
COG2_RAT GNSEGAPCVFPFTFLGNKYESCTSAGRNDGKVWCATT
• Expected pair frequencies are computed COG9_BOVIN GNADGKPCVFPFTFQGRTYSACTSDGRSDGYRWCATT
COG9_HUMAN GNADGKPCQFPFIFQGQSYSACTTDGRSDGYRWCATT
from single amino acid frequencies. COG9_MOUSE GNGEGKPCVFPFIFEGRSYSACTTKGRSDGYRWCATT
COG9_RAT GNGDGKPCVFPFIFEGHSYSACTTKGRSDGYRWCATT
(e.g, fA,C =fA x fC=7% x 3% = 0.21%). FINC_BOVIN GNSNGALCHFPFLYNNHNYTDCTSEGRRDNMKWCGTT
FINC_HUMAN GNSNGALCHFPFLYNNHNYTDCTSEGRRDNMKWCGTT
FINC_RAT GNSNGALCHFPFLYSNRNYSDCTSEGRRDNMKWCGTT
MPRI_BOVIN ETEDGEPCVFPFVFNGKSYEECVVESRARLWCATTAN
• For each amino acid pair the MPRI_HUMAN ETDDGVPCVFPFIFNGKSYEECIIESRAKLWCSTTAD
substitution scores are essentially PA2R_BOVIN GNAHGTPCMFPFQYNQQWHHECTREGREDNLLWCATT
PA2R_RABIT GNAHGTPCMFPFQYNHQWHHECTREGRQDDSLWCATT
computed as:

Pair-freq(obs) 0.01%
log SA,C = log = -1.3
Pair-freq(expected) 0.21%
Constructing a BLOSUM matr.
1. Counting mutations
2. Tallying mutation frequencies
3. Matrix of mutation probs.
4. Calculate abundance of each
residue (Marginal prob)
5. Obtaining a BLOSUM matrix
Constructing BLOSUM r
• To avoid bias in favor of a certain protein, first eliminate
sequences that are more than r% identical
• The elimination is done by either
– removing sequences from the block, or
– finding a cluster of similar sequences and replacing it by a new
sequence that represents the cluster.
• BLOSUM r is the matrix built from blocks with no more the r%
of similarity
– E.g., BLOSUM62 is the matrix built using sequences with no more than
62% similarity.
– Note: BLOSUM 62 is the default matrix for protein BLAST
Obtaining BLOSUM62 Matrix

pij
Sij = 2 ⋅ log 2
pi p j
PAM & BLOSUM
The PAM family

• PAM matrices are based on global alignments of closely


related proteins.
• The PAM1 is the matrix calculated from comparisons of
sequences with no more than 1% divergence.
• Other PAM matrices are extrapolated from PAM1.
PAM & BLOSUM
The BLOSUM family

• BLOSUM matrices are based on local alignments.


• BLOSUM 62 is a matrix calculated from comparisons of
sequences with no less than 62% divergence.
• All BLOSUM matrices are based on observed alignments; they
are not extrapolated from comparisons of closely related
proteins.
• BLOSUM 62 is the default matrix in BLAST 2.0. Though it is
tailored for comparisons of moderately distant proteins, it
performs well in detecting closer relationships. A search for
distant relatives may be more sensitive with a different matrix.
PAM & BLOSUM

Rat versus Rat versus


mouse protein Bacterial protein

BLOSUM matrices with higher numbers and PAM matrices with low
numbers are both designed for comparisons of closely related
sequences.
BLOSUM matrices with low numbers and PAM matrices with high
numbers are designed for comparisons of distantly related proteins.
If distant relatives of the query sequence are specifically being sought,
the matrix can be tailored to that type of search.