925 views

Uploaded by shiv_1987

its a part of biostatics & bioinformatics..........

save

- database technologies in bioinformatics
- Pam
- Lai - Integrating Knowledge Flow Mining and Collaborative Filtering to Support Document Rec
- Bioinformatics a starter
- DB11F669d01
- A Hybrid Distributed and Shared Memory Method For Fast HNGH Algorithm
- bioinformatics in PAM AND BLOSUM
- mrc2.pdf
- 6th sem
- 1 - Protein Folds and Families
- 1 Introduction to Chemistry.ppt
- eBooks
- Bonanno_ProcNatlAcadSciUSA_2001.pdf
- Fermentation Biotechnology
- Aminoacid+Alignment including PAM & BLOSUM
- Alcohol Fermentation
- Bio Degradation of Cellulosic Waste 1
- Liquid Chromatography (Chapter 28): Four Types of High Performance Liquid
- DOE Fundamentals Handbook, Material Science, Volume 2 of 2
- Applications of Chromatographic Techniques
- Chromatography
- travaux pratiques de materiaux de constructions.docx
- Manual Para Trabajar en Las Alturas
- Tecnicas INTEGRACION
- VividWordChoiceHandout.pdf
- Delawarr Camera Proved Real
- Metaforas del cuerpo....
- indicadores UPV (FTT).pdf
- Niveles de Tension Likinormas Codensa 1770
- Instalacao_Logix_1002
- cv dignat julien v2 1
- UNIDAD DE APRENDIZAJE COM2° UNIDAD III 2017
- 16 Big Bangs For Marcus Chown
- A CONTABILIDADE NO SÉCULO XXI - SPED 2012.pdf
- Ediscovery Model Order_lAL
- Space Syntax
- 201062_GUIA_Fase_3_Determinar los requerimientos de proceso.pdf
- Plano Aula Poligono
- Session Plan TG NCII (Autosaved)
- C.E. PRIMARIA ACUERDO 685.pdf
- El dilema ético de la clonación humana
- 37 2015-05-04 Relatorio Ipcc Portugues
- TP 1°ES WORD
- 6 Sectors Deployment in Downlik LTE
- Pyrrhotite deposition through.pdf
- First Endometriosis Tracking Study App, Phendo, Launched by Columbia University Researchers and Applied Informatics
- Cicloconveridores Trabajo
- Original o Que Eh Perf Schechner
- Blumer,_H[1]. (1982)
- Grafos+y+Modelos+de+Computación.+Teória+y+Ejercicios+Propuestos

You are on page 1of 38

properties that influence their relative replaceability in

evolution

• Scoring matrices reflect:

– probabilities of mutual substitutions

– the probability of occurrence of each amino acid

• Widely used scoring matrices:

– PAM

– BLOSUM

Amino acid substitution matrices

• Certain amino acid substitutions commonly occur in

related proteins from different species.

• Because, a protein still functions with these

substitutions, the substituted amino acids are

compatible with structure and function.

• Knowing types of changes that are most and least

common in a large number of proteins can assist with

predicting alignments for any set of protein sequences.

• If ancestor relationships among a group of proteins

are assessed, the most likely amino acid changes that

occurred during evolution can be predicted.

Point Accepted Mutation (PAM) Matrices

[Dayhoff substitution matrices]

matrices was done by Dayhoff et al. (1978) Atlas of Protein

Structure. These widely used substitution matrices are frequently

called Dayhoff, MDM (Mutation Data Matrix), or PAM (Percent

Accepted Mutation) matrices.

PAM approach: estimate the probability that b was substituted for a

in a given measure of evolutionary distance.

KEY IDEA: trusted alignments of closely related sequences

provide information about biologically permissible mutations.

Point Accepted Mutation (PAM) Matrices

[Dayhoff substitution matrices]

amino acid to another in homologous protein sequences

during evolution.

• Each matrix gives the changes expected for a given period of

evolutionary time, evidenced by decreased sequence similarity

as genes that encoded the same protein diverge with

increased evolutionary time.

• This leads to two possibilities:

– One matrix gives the changes expected in homologous

proteins that have diverged only a small amount from each

other in a relatively short period of time (about 50% similar)

– Other matrix gives changes expected of proteins that have

diverged over a much longer period, leaving only 20%

similarity.

…How PAM matrix is derived

• In deriving the PAM matrices, each change in the

current amino acid at a particular site is assumed to

be independent of previous mutational events at that

site

• Thus, the probability of change of any amino acid ‘a’

to amino acid ‘b’ is the same, regardless of the

position of amino acid ‘a’ in a sequence.

– Based on Markov model (simple) which is

characterized by a series of changes of state in a

system such that a change from one state to

another does not depend on the previous history of

the state.

How PAM matrix is derived.. AA index

• To prepare the Dayhoff PAM matrices (Dayhoff 1978),

amino acid substitutions that occurred in a group of

evolving proteins were estimated using 1572 changes

in 71 groups of protein sequences that were atleast

85% similar.

• Because these changes were observed in closely

related proteins (>85% similar), they represented

amino acid substitutions that do not significantly

change the function of protein

• …. Hence called as “accepted mutations” – defined

as amino acid changes accepted by natural selection

…How PAM matrix is derived

• To develop a single-letter code for the amino acids, Dr.

Dayhoff attempted to make the code as easy to remember as

possible. Of course, if the name of each amino acid began

with a different letter, the code would be simple indeed. For 6

of the amino acids, the first letter of the name is unique,

making the code simple.

• Cystine Cys C (First letter)

• For the other amino acids, the first letter of the name is not

unique to a single amino acid, so Dr. Dayhoff assigned the

letters A, G, L, P and T to the amino acids Alanine, Glycine,

Leucine, Proline and Threonine, respectively, which occur

more frequently in proteins than do the other amino acids

having the same first letters.

…How PAM matrix is derived

• Some of the other amino acids are phonetically suggestive.

Arginine R aRginine

• For the remaining 5 amino acids, Dr. Dayhoff was reaching

somewhat to find an easy-to-remember connection between the

single letter and the amino acid. She assigned aspartic acid,

asparagine, glutamic acid and glutamine the letters D, N, E and Q,

respectively, noting that D and N are nearer the beginning of the

alphabet than E and Q, and that Asp is smaller than Glu, while Asn is

smaller than Gln.

• By the time Dr. Dayhoff got to lysine, there were not too many letters

left, so she used the letter K, explaining that K is at least near L in

the alphabet.

…How PAM matrix is derived

First step: Pair Exchange Frequencies

: A PAM (Percent accepted mutation) is one

accepted point mutation on the path between two

sequences, per 100 residues.

phylogenetic tree including all ancestral sequences has to

be constructed.

and colleagues restricted their analysis to sequence families

with more than 85% identity.

First step: Pair Exchange Frequencies

• For each of the observed and inferred sequences, the

amino acid pair exchanges are tabulated into a 20x20

matrix. It is assumed, that the likelihood of an amino-acid X

being replaced by an amino acid Y is the same as Y

replacing X. Hence the matrix is constructed symmetrically.

amino acid i replaces amino acid j.

Second step: Frequencies of Occurence

•If the properties of amino acids differ and if they occur with

different frequencies, all statements we can make about the

average properties of sequences will depend on the

frequencies of occurrence of the individual amino acids.

•These frequencies of occurrence are approximated by the

frequencies of observation.

•They are the number of occurences of a given amino acid

divided by the number of amino-acids observed.

Third step: Relative Mutabilities

•Relative mutabilities are evaluated by counting, in each

group of related sequences, the number of changes of

each amino acid and by dividing this number by a

normalizing factor.

•This factor is a product of the frequency of occurrence of

the amino acid in that group of sequences being analyzed

Third step: Relative Mutabilities

Aligned sequences A D A

A D B

Amino acids A B D

Observed Changes 1 1 0

Frequency of

Occurrence 3 1 2

(in total composition)

Amino acid frequencies (Frequency of Occurrence):

1978 1991

L 0.085 0.091

A 0.087 0.077

The frequencies in the

G 0.089 0.074

S 0.070 0.069 middle column are

V 0.065 0.066 taken from Dayhoff

E 0.050 0.062

(1978), the frequencies

T 0.058 0.059

K 0.081 0.059 in the right column are

I 0.037 0.053 taken from the 1991

D 0.047 0.052

recompilation of the

R 0.041 0.051

P 0.051 0.051 mutation matrices

N 0.040 0.043 representing a database

Q 0.038 0.041

of observations that is

F 0.040 0.040

Y 0.030 0.032 approximately 40 times

M 0.015 0.024 larger than that

H 0.034 0.023

available to Dayhoff.

C 0.033 0.020

W 0.010 0.014

Third step: Relative Mutabilities

• To obtain a complete picture of the mutational process,

the amino-acids that do not mutate are also taken into

account i.e., what is the chance, on average, that a given

amino acid will mutate at all.

Asn, Ser, Asp and Glu were observed to be most mutable

amino acids are Cys and Trp were the least mutable.

Example: Phe - Tyr

• Of 1572 observed amino acid changes, there were 260

changes between Phe and Tyr

• These numbers were multiplied by (a) mutability of Phe

& (b) the fraction of Phe to Tyr changes over all

changes of Phe to another amino acid – to obtain

mutation probability score of Phe to Tyr

• A similar score was obtained for changes of Tyr

Example: Phe - Tyr

• The resulting scores were summed up and divided by a

normalizing factor such that their sum represents a

probability of change of 1% 250%

• Score for changing Phe to Tyr was 0.15

• Frequence of Phe occurrence in sequence data was 0.04

• Score for changing Tyr to Phe was 0.20

• Frequency of Tyr occurance in sequence data was 0.03

• These changes can include both forward and reverse i.e.,

Phe Tyr as well as Tyr Phe

Example: Phe - Tyr

• Relative mutability of Phe to Tyr would be

• 0.15/0.04 = 3.75

• Converting to a log to the base 10 (log10 3.75 = 0.57)

• And multiplying it with 10 to remove fractional values =

5.7

• Relative mutability of Tyr to Phe would be

• 0.20/0.03 = 6.7 and log of this number = 0.83 further

multiplied by 10 would be 8.3

• Average of 5.7 and 8.3 is 7

Formulation of PAM matrix

were then used to generate a 20 x 20 mutation probability

matrix representing all possible amino acid changes

• Amino acids are grouped according to chemistry of the side

group:

• C – Sulfhydryl + Ancestor probability

• STPAG – Small hydrophilic is greater

• NDEQ – Acid, acid amine and hydrophilic 0 Probability of

• HRK – basic ancestry as well as

• MILV – small hydrophobic by chance is same

• FYW - Aromatic - Alignment more by

chance than

ancestry

• Possible type of questions that can be answered

are:

• “Suppose I start with a given polypeptide sequence

M at time t, and observe the evolutionary changes in

the sequence until 1% of all amino acid residues

have undergone substitutions at time t+n. Let the

new sequence at time t+n be called M’. What is the

probability that a residue of type j in M will be

replaced by i in M’?”

Constructing BLOSUM Matrices

BLOSUM matrices

• Blocks Substitution Matrix. Scores for

each position are obtained frequencies of

substitutions in blocks of local alignments

of protein sequences [Henikoff & Henikoff

1992].

• For example BLOSUM62 is derived from

sequence alignments with no more than

62% identity.

BLOSUM Scoring Matrices

• Based on comparisons of blocks of sequences derived

from the Blocks database

• The Blocks database contains multiply aligned ungapped

segments corresponding to the most highly conserved

regions of proteins (local alignment versus global

alignment)

• BLOSUM matrices are derived from blocks whose

alignment corresponds to the BLOSUM-,matrix number

Conserved blocks in alignments

AABCDA...BBCDA

DABCDA.A.BBCBB

BBBCDABA.BCCAA

AAACDAC.DCBCDB

CCBADAB.DBBDCC

AAACAA...BBCCC

Collecting substitution statistics

1. Count amino acids pairs in each column;

e.g.,

– 6 AA pairs, 4 AB pairs, 4 AC, 1 BC, 0 BB, 0 A

CC.

A

– Total = 6+4+4+1=15

B

1. Normalize results to obtain probabilities A

(pX’s and qXY’s)

C

2. Compute log-odds score matrix from A

probabilities:

s(X,Y) = log (qXY / (pX py))

Estimation of a BLOSUM matrix

• The BLOCKS database contains local ID FIBRONECTIN_2; BLOCK

COG9_CANFA GNSAGEPCVFPFIFLGKQYSTCTREGRGDGHLWCATT

multiple gap-free alignments of proteins. COG9_RABIT GNADGAPCHFPFTFEGRSYTACTTDGRSDGMAWCSTT

FA12_HUMAN LTVTGEPCHFPFQYHRQLYHKCTHKGRPGPQPWCATT

HGFA_HUMAN LTEDGRPCRFPFRYGGRMLHACTSEGSAHRKWCATTH

MPRI_MOUSE ETDDGEPCVFPFIYKGKSYDECVLEGRAKLWCSKTAN

of each BLOCK are compared, and the PB1_PIG AITSDDKCVFPFIYKGNLYFDCTLHDSTYYWCSVTTY

SFP1_BOVIN ELPEDEECVFPFVYRNRKHFDCTVHGSLFPWCSLDAD

observed pair frequencies are noted SFP3_BOVIN AETKDNKCVFPFIYGNKKYFDCTLHGSLFLWCSLDAD

(e.g., A aligned with A makes up 1.5% SFP4_BOVIN AVFEGPACAFPFTYKGKKYYMCTRKNSVLLWCSLDTE

SP1_HORSE AATDYAKCAFPFVYRGQTYDRCTTDGSLFRISWCSVT

of all pairs; A aligned with C makes up COG2_CHICK GNSEGAPCVFPFIFLGNKYDSCTSAGRNDGKLWCAST

COG2_HUMAN GNSEGAPCVFPFTFLGNKYESCTSAGRSDGKMWCATT

0.01% of all pairs, etc.) COG2_MOUSE GNSEGAPCVFPFTFLGNKYESCTSAGRNDGKVWCATT

COG2_RABIT GNSEGAPCVFPFTFLGNKYESCTSAGRSDGKMWCATS

COG2_RAT GNSEGAPCVFPFTFLGNKYESCTSAGRNDGKVWCATT

• Expected pair frequencies are computed COG9_BOVIN GNADGKPCVFPFTFQGRTYSACTSDGRSDGYRWCATT

COG9_HUMAN GNADGKPCQFPFIFQGQSYSACTTDGRSDGYRWCATT

from single amino acid frequencies. COG9_MOUSE GNGEGKPCVFPFIFEGRSYSACTTKGRSDGYRWCATT

COG9_RAT GNGDGKPCVFPFIFEGHSYSACTTKGRSDGYRWCATT

(e.g, fA,C =fA x fC=7% x 3% = 0.21%). FINC_BOVIN GNSNGALCHFPFLYNNHNYTDCTSEGRRDNMKWCGTT

FINC_HUMAN GNSNGALCHFPFLYNNHNYTDCTSEGRRDNMKWCGTT

FINC_RAT GNSNGALCHFPFLYSNRNYSDCTSEGRRDNMKWCGTT

MPRI_BOVIN ETEDGEPCVFPFVFNGKSYEECVVESRARLWCATTAN

• For each amino acid pair the MPRI_HUMAN ETDDGVPCVFPFIFNGKSYEECIIESRAKLWCSTTAD

substitution scores are essentially PA2R_BOVIN GNAHGTPCMFPFQYNQQWHHECTREGREDNLLWCATT

PA2R_RABIT GNAHGTPCMFPFQYNHQWHHECTREGRQDDSLWCATT

computed as:

Pair-freq(obs) 0.01%

log SA,C = log = -1.3

Pair-freq(expected) 0.21%

Constructing a BLOSUM matr.

1. Counting mutations

2. Tallying mutation frequencies

3. Matrix of mutation probs.

4. Calculate abundance of each

residue (Marginal prob)

5. Obtaining a BLOSUM matrix

Constructing BLOSUM r

• To avoid bias in favor of a certain protein, first eliminate

sequences that are more than r% identical

• The elimination is done by either

– removing sequences from the block, or

– finding a cluster of similar sequences and replacing it by a new

sequence that represents the cluster.

• BLOSUM r is the matrix built from blocks with no more the r%

of similarity

– E.g., BLOSUM62 is the matrix built using sequences with no more than

62% similarity.

– Note: BLOSUM 62 is the default matrix for protein BLAST

Obtaining BLOSUM62 Matrix

pij

Sij = 2 ⋅ log 2

pi p j

PAM & BLOSUM

The PAM family

related proteins.

• The PAM1 is the matrix calculated from comparisons of

sequences with no more than 1% divergence.

• Other PAM matrices are extrapolated from PAM1.

PAM & BLOSUM

The BLOSUM family

• BLOSUM 62 is a matrix calculated from comparisons of

sequences with no less than 62% divergence.

• All BLOSUM matrices are based on observed alignments; they

are not extrapolated from comparisons of closely related

proteins.

• BLOSUM 62 is the default matrix in BLAST 2.0. Though it is

tailored for comparisons of moderately distant proteins, it

performs well in detecting closer relationships. A search for

distant relatives may be more sensitive with a different matrix.

PAM & BLOSUM

mouse protein Bacterial protein

BLOSUM matrices with higher numbers and PAM matrices with low

numbers are both designed for comparisons of closely related

sequences.

BLOSUM matrices with low numbers and PAM matrices with high

numbers are designed for comparisons of distantly related proteins.

If distant relatives of the query sequence are specifically being sought,

the matrix can be tailored to that type of search.

- database technologies in bioinformaticsUploaded byapi-356732519
- PamUploaded byManoj Deshmukh
- Lai - Integrating Knowledge Flow Mining and Collaborative Filtering to Support Document RecUploaded byDJ
- Bioinformatics a starterUploaded byK Mani
- DB11F669d01Uploaded byNaresh Prasad Sapkota
- A Hybrid Distributed and Shared Memory Method For Fast HNGH AlgorithmUploaded byIJEC_Editor
- bioinformatics in PAM AND BLOSUMUploaded bygladson
- mrc2.pdfUploaded byRCharan Vithya
- 6th semUploaded byAman Bansal
- 1 - Protein Folds and FamiliesUploaded byRigel_T
- 1 Introduction to Chemistry.pptUploaded byMThana Balan
- eBooksUploaded byTawfeeq Ahmad M
- Bonanno_ProcNatlAcadSciUSA_2001.pdfUploaded byhuouinkyouma

- Fermentation BiotechnologyUploaded byshiv_1987
- Aminoacid+Alignment including PAM & BLOSUMUploaded byshiv_1987
- Alcohol FermentationUploaded byshiv_1987
- Bio Degradation of Cellulosic Waste 1Uploaded byshiv_1987
- Liquid Chromatography (Chapter 28): Four Types of High Performance LiquidUploaded byshiv_1987
- DOE Fundamentals Handbook, Material Science, Volume 2 of 2Uploaded byBob Vines
- Applications of Chromatographic TechniquesUploaded byshiv_1987
- ChromatographyUploaded byapi-26998277

- travaux pratiques de materiaux de constructions.docxUploaded bysamiahannachi
- Manual Para Trabajar en Las AlturasUploaded byLuis Gamboa Canto
- Tecnicas INTEGRACIONUploaded byRicardo Reynoso
- VividWordChoiceHandout.pdfUploaded bykay
- Delawarr Camera Proved RealUploaded byplan2222
- Metaforas del cuerpo....Uploaded byMel Estevez
- indicadores UPV (FTT).pdfUploaded byangelo_1989
- Niveles de Tension Likinormas Codensa 1770Uploaded byEsteban Beltrán
- Instalacao_Logix_1002Uploaded byviajante011
- cv dignat julien v2 1Uploaded byapi-285866083
- UNIDAD DE APRENDIZAJE COM2° UNIDAD III 2017Uploaded byJoseph Capcha
- 16 Big Bangs For Marcus ChownUploaded byMatthew Lee Knowles
- A CONTABILIDADE NO SÉCULO XXI - SPED 2012.pdfUploaded byFábio Cunha
- Ediscovery Model Order_lALUploaded byjoebartjr
- Space SyntaxUploaded byFelipe Lazo
- 201062_GUIA_Fase_3_Determinar los requerimientos de proceso.pdfUploaded byLina Maria Jimenez
- Plano Aula PoligonoUploaded byJuliano Lauria
- Session Plan TG NCII (Autosaved)Uploaded byMatthew Bowman
- C.E. PRIMARIA ACUERDO 685.pdfUploaded byGerman Lopez
- El dilema ético de la clonación humanaUploaded byPedro Luna
- 37 2015-05-04 Relatorio Ipcc PortuguesUploaded byMateus Cogo
- TP 1°ES WORDUploaded bycrisbarreiro
- 6 Sectors Deployment in Downlik LTEUploaded bysmartel01
- Pyrrhotite deposition through.pdfUploaded bykrninas
- First Endometriosis Tracking Study App, Phendo, Launched by Columbia University Researchers and Applied InformaticsUploaded byPR.com
- Cicloconveridores TrabajoUploaded byDeybe Ruiz Jara
- Original o Que Eh Perf SchechnerUploaded byCinthia Mendonça
- Blumer,_H[1]. (1982)Uploaded byRosana Panero
- Grafos+y+Modelos+de+Computación.+Teória+y+Ejercicios+PropuestosUploaded byBryan Soto