Amino acid substitution matrices

• Amino acids have different biochemical and physical properties that influence their relative replaceability in evolution • Scoring matrices reflect: – probabilities of mutual substitutions – the probability of occurrence of each amino acid • Widely used scoring matrices: – PAM – BLOSUM

Amino acid substitution matrices
• Certain amino acid substitutions commonly occur in related proteins from different species. • Because, a protein still functions with these substitutions, the substituted amino acids are compatible with structure and function.

• Knowing types of changes that are most and least common in a large number of proteins can assist with predicting alignments for any set of protein sequences. • If ancestor relationships among a group of proteins are assessed, the most likely amino acid changes that occurred during evolution can be predicted.

Point Accepted Mutation (PAM) Matrices [Dayhoff substitution matrices]

 The first systematic method to derive amino acid substitution matrices was done by Dayhoff et al. (1978) Atlas of Protein Structure. These widely used substitution matrices are frequently called Dayhoff, MDM (Mutation Data Matrix), or PAM (Percent Accepted Mutation) matrices.  PAM approach: estimate the probability that b was substituted for a in a given measure of evolutionary distance.  KEY IDEA: trusted alignments of closely related sequences provide information about biologically permissible mutations.

Point Accepted Mutation (PAM) Matrices [Dayhoff substitution matrices]

• This family of matrices lists the likelyhood of change from one amino acid to another in homologous protein sequences during evolution. • Each matrix gives the changes expected for a given period of evolutionary time, evidenced by decreased sequence similarity as genes that encoded the same protein diverge with increased evolutionary time. • This leads to two possibilities: – One matrix gives the changes expected in homologous proteins that have diverged only a small amount from each other in a relatively short period of time (about 50% similar) – Other matrix gives changes expected of proteins that have diverged over a much longer period, leaving only 20% similarity.

…How PAM matrix is derived
• In deriving the PAM matrices, each change in the current amino acid at a particular site is assumed to be independent of previous mutational events at that site • Thus, the probability of change of any amino acid ‘a’ to amino acid ‘b’ is the same, regardless of the position of amino acid ‘a’ in a sequence. – Based on Markov model (simple) which is characterized by a series of changes of state in a system such that a change from one state to another does not depend on the previous history of the state.

How PAM matrix is derived.. AA index
• To prepare the Dayhoff PAM matrices (Dayhoff 1978), amino acid substitutions that occurred in a group of evolving proteins were estimated using 1572 changes in 71 groups of protein sequences that were atleast 85% similar. • Because these changes were observed in closely related proteins (>85% similar), they represented amino acid substitutions that do not significantly change the function of protein • …. Hence called as “accepted mutations” – defined as amino acid changes accepted by natural selection

…How PAM matrix is derived
• To develop a single-letter code for the amino acids, Dr. Dayhoff attempted to make the code as easy to remember as possible. Of course, if the name of each amino acid began with a different letter, the code would be simple indeed. For 6 of the amino acids, the first letter of the name is unique, making the code simple. • Cystine Cys C (First letter) • For the other amino acids, the first letter of the name is not unique to a single amino acid, so Dr. Dayhoff assigned the letters A, G, L, P and T to the amino acids Alanine, Glycine, Leucine, Proline and Threonine, respectively, which occur more frequently in proteins than do the other amino acids having the same first letters.

…How PAM matrix is derived
• Some of the other amino acids are phonetically suggestive. Arginine R aRginine • For the remaining 5 amino acids, Dr. Dayhoff was reaching somewhat to find an easy-to-remember connection between the single letter and the amino acid. She assigned aspartic acid, asparagine, glutamic acid and glutamine the letters D, N, E and Q, respectively, noting that D and N are nearer the beginning of the alphabet than E and Q, and that Asp is smaller than Glu, while Asn is smaller than Gln. • By the time Dr. Dayhoff got to lysine, there were not too many letters left, so she used the letter K, explaining that K is at least near L in the alphabet.

…How PAM matrix is derived

First step: Pair Exchange Frequencies

:

A PAM (Percent accepted mutation) is one accepted point mutation on the path between two sequences, per 100 residues.

In order to identify accepted point mutations, a complete phylogenetic tree including all ancestral sequences has to be constructed.

To avoid a large degree of ambiguities in this step, Dayhoff and colleagues restricted their analysis to sequence families with more than 85% identity.

First step: Pair Exchange Frequencies
• For each of the observed and inferred sequences, the amino acid pair exchanges are tabulated into a 20x20 matrix. It is assumed, that the likelihood of an amino-acid X being replaced by an amino acid Y is the same as Y replacing X. Hence the matrix is constructed symmetrically. Aij is the number of accepted mutations observed where amino acid i replaces amino acid j.

Second step: Frequencies of Occurence
•If the properties of amino acids differ and if they occur with different frequencies, all statements we can make about the average properties of sequences will depend on the frequencies of occurrence of the individual amino acids. •These frequencies of occurrence are approximated by the frequencies of observation. •They are the number of occurences of a given amino acid divided by the number of amino-acids observed.

Third step: Relative Mutabilities
•Relative mutabilities are evaluated by counting, in each group of related sequences, the number of changes of each amino acid and by dividing this number by a normalizing factor. •This factor is a product of the frequency of occurrence of the amino acid in that group of sequences being analyzed

Third step: Relative Mutabilities
Aligned sequences Amino acids Observed Changes Frequency of Occurrence (in total composition) RELATIVE MUTABILITY A D A A D B A 1 3 0.33 B 1 1 1 D 0 2 0

Amino acid frequencies (Frequency of Occurrence): L A G S V E T K I D R P N Q F Y M H C W 1978 0.085 0.087 0.089 0.070 0.065 0.050 0.058 0.081 0.037 0.047 0.041 0.051 0.040 0.038 0.040 0.030 0.015 0.034 0.033 0.010 1991 0.091 0.077 0.074 0.069 0.066 0.062 0.059 0.059 0.053 0.052 0.051 0.051 0.043 0.041 0.040 0.032 0.024 0.023 0.020 0.014

The frequencies in the middle column are taken from Dayhoff (1978), the frequencies in the right column are taken from the 1991 recompilation of the mutation matrices representing a database of observations that is approximately 40 times larger than that available to Dayhoff.

Third step: Relative Mutabilities
• To obtain a complete picture of the mutational process, the amino-acids that do not mutate are also taken into account i.e., what is the chance, on average, that a given amino acid will mutate at all. • Based on the relative mutability scores of the amino acids, Asn, Ser, Asp and Glu were observed to be most mutable amino acids are Cys and Trp were the least mutable.

Example: Phe - Tyr
• • Of 1572 observed amino acid changes, there were 260 changes between Phe and Tyr These numbers were multiplied by (a) mutability of Phe & (b) the fraction of Phe to Tyr changes over all changes of Phe to another amino acid – to obtain mutation probability score of Phe to Tyr A similar score was obtained for changes of Tyr

Example: Phe - Tyr
• The resulting scores were summed up and divided by a normalizing factor such that their sum represents a probability of change of 1%  250% Score for changing Phe to Tyr was 0.15 Frequence of Phe occurrence in sequence data was 0.04 Score for changing Tyr to Phe was 0.20 Frequency of Tyr occurance in sequence data was 0.03 These changes can include both forward and reverse i.e., Phe  Tyr as well as Tyr  Phe

• • • • •

Example: Phe - Tyr
• • • • • • • Relative mutability of Phe to Tyr would be 0.15/0.04 = 3.75 Converting to a log to the base 10 (log10 3.75 = 0.57) And multiplying it with 10 to remove fractional values = 5.7 Relative mutability of Tyr to Phe would be 0.20/0.03 = 6.7 and log of this number = 0.83 further multiplied by 10 would be 8.3 Average of 5.7 and 8.3 is 7

Formulation of PAM matrix

• The amino acid exchange counts and mutability values were then used to generate a 20 x 20 mutation probability matrix representing all possible amino acid changes

• Amino acids are grouped according to chemistry of the side group: • C – Sulfhydryl + Ancestor probability • STPAG – Small hydrophilic is greater • NDEQ – Acid, acid amine and hydrophilic 0 Probability of • HRK – basic ancestry as well as by chance is same • MILV – small hydrophobic • FYW - Aromatic Alignment more by
chance than ancestry

• •

Possible type of questions that can be answered are: “Suppose I start with a given polypeptide sequence M at time t, and observe the evolutionary changes in the sequence until 1% of all amino acid residues have undergone substitutions at time t+n. Let the new sequence at time t+n be called M’. What is the probability that a residue of type j in M will be replaced by i in M’?”

Constructing BLOSUM Matrices
Blocks Substitution Matrices

BLOSUM matrices
• Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks of local alignments of protein sequences [Henikoff & Henikoff 1992]. • For example BLOSUM62 is derived from sequence alignments with no more than 62% identity.

BLOSUM Scoring Matrices
• BLOck SUbstitution Matrix • Based on comparisons of blocks of sequences derived from the Blocks database • The Blocks database contains multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins (local alignment versus global alignment) • BLOSUM matrices are derived from blocks whose alignment corresponds to the BLOSUM-,matrix number

Conserved blocks in alignments
AABCDA...BBCDA DABCDA.A.BBCBB BBBCDABA.BCCAA AAACDAC.DCBCDB CCBADAB.DBBDCC AAACAA...BBCCC

Collecting substitution statistics
1. Count amino acids pairs in each column; e.g.,
– 6 AA pairs, 4 AB pairs, 4 AC, 1 BC, 0 BB, 0 CC. – Total = 6+4+4+1=15

1. Normalize results to obtain probabilities (pX’s and qXY’s) 2. Compute log-odds score matrix from probabilities: s(X,Y) = log (qXY / (pX py))

A A B A C A

Estimation of a BLOSUM matrix
• The BLOCKS database contains local multiple gap-free alignments of proteins. All pairs of amino acids in each column of each BLOCK are compared, and the observed pair frequencies are noted (e.g., A aligned with A makes up 1.5% of all pairs; A aligned with C makes up 0.01% of all pairs, etc.) Expected pair frequencies are computed from single amino acid frequencies. (e.g, fA,C =fA x fC=7% x 3% = 0.21%). For each amino acid pair the substitution scores are essentially computed as: •
ID FIBRONECTIN_2; BLOCK COG9_CANFA GNSAGEPCVFPFIFLGKQYSTCTREGRGDGHLWCATT COG9_RABIT GNADGAPCHFPFTFEGRSYTACTTDGRSDGMAWCSTT FA12_HUMAN LTVTGEPCHFPFQYHRQLYHKCTHKGRPGPQPWCATT HGFA_HUMAN LTEDGRPCRFPFRYGGRMLHACTSEGSAHRKWCATTH MANR_HUMAN GNANGATCAFPFKFENKWYADCTSAGRSDGWLWCGTT MPRI_MOUSE ETDDGEPCVFPFIYKGKSYDECVLEGRAKLWCSKTAN PB1_PIG AITSDDKCVFPFIYKGNLYFDCTLHDSTYYWCSVTTY SFP1_BOVIN ELPEDEECVFPFVYRNRKHFDCTVHGSLFPWCSLDAD SFP3_BOVIN AETKDNKCVFPFIYGNKKYFDCTLHGSLFLWCSLDAD SFP4_BOVIN AVFEGPACAFPFTYKGKKYYMCTRKNSVLLWCSLDTE SP1_HORSE AATDYAKCAFPFVYRGQTYDRCTTDGSLFRISWCSVT COG2_CHICK GNSEGAPCVFPFIFLGNKYDSCTSAGRNDGKLWCAST COG2_HUMAN GNSEGAPCVFPFTFLGNKYESCTSAGRSDGKMWCATT COG2_MOUSE GNSEGAPCVFPFTFLGNKYESCTSAGRNDGKVWCATT COG2_RABIT GNSEGAPCVFPFTFLGNKYESCTSAGRSDGKMWCATS COG2_RAT GNSEGAPCVFPFTFLGNKYESCTSAGRNDGKVWCATT COG9_BOVIN GNADGKPCVFPFTFQGRTYSACTSDGRSDGYRWCATT COG9_HUMAN GNADGKPCQFPFIFQGQSYSACTTDGRSDGYRWCATT COG9_MOUSE GNGEGKPCVFPFIFEGRSYSACTTKGRSDGYRWCATT COG9_RAT GNGDGKPCVFPFIFEGHSYSACTTKGRSDGYRWCATT FINC_BOVIN GNSNGALCHFPFLYNNHNYTDCTSEGRRDNMKWCGTT FINC_HUMAN GNSNGALCHFPFLYNNHNYTDCTSEGRRDNMKWCGTT FINC_RAT GNSNGALCHFPFLYSNRNYSDCTSEGRRDNMKWCGTT MPRI_BOVIN ETEDGEPCVFPFVFNGKSYEECVVESRARLWCATTAN MPRI_HUMAN ETDDGVPCVFPFIFNGKSYEECIIESRAKLWCSTTAD PA2R_BOVIN GNAHGTPCMFPFQYNQQWHHECTREGREDNLLWCATT PA2R_RABIT GNAHGTPCMFPFQYNHQWHHECTREGRQDDSLWCATT

log

Pair-freq(obs) Pair-freq(expected)

SA,C = log

0.01% 0.21%

= -1.3

Constructing a BLOSUM matr. 1. Counting mutations

2. Tallying mutation frequencies

3. Matrix of mutation probs.

4. Calculate abundance of each residue (Marginal prob)

5. Obtaining a BLOSUM matrix

Constructing BLOSUM r
• To avoid bias in favor of a certain protein, first eliminate sequences that are more than r% identical • The elimination is done by either
– removing sequences from the block, or – finding a cluster of similar sequences and replacing it by a new sequence that represents the cluster.

• BLOSUM r is the matrix built from blocks with no more the r% of similarity
– E.g., BLOSUM62 is the matrix built using sequences with no more than 62% similarity. – Note: BLOSUM 62 is the default matrix for protein BLAST

Obtaining BLOSUM62 Matrix
Sij = 2 ⋅ log 2 pij pi p j

PAM & BLOSUM
The PAM family

• PAM matrices are based on global alignments of closely related proteins. • The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. • Other PAM matrices are extrapolated from PAM1.

PAM & BLOSUM
The BLOSUM family
• • • • BLOSUM matrices are based on local alignments. BLOSUM 62 is a matrix calculated from comparisons of sequences with no less than 62% divergence. All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins. BLOSUM 62 is the default matrix in BLAST 2.0. Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix.

PAM & BLOSUM

Rat versus mouse protein

Rat versus Bacterial protein

BLOSUM matrices with higher numbers and PAM matrices with low numbers are both designed for comparisons of closely related sequences. BLOSUM matrices with low numbers and PAM matrices with high numbers are designed for comparisons of distantly related proteins. If distant relatives of the query sequence are specifically being sought, the matrix can be tailored to that type of search.