You are on page 1of 27

Bioinformatics (BIO213)

Session 4

Slide content: Various textbooks, Internet sources


When two proteins are aligned, what
scores should they be assigned?
Human beta globin (query) and Myglobin (subject)

2
How did we get these scores?
• Based on Margaret Dayhoff (1966, 1978) model of protein evolution 
development of PAM scoring matrices (Log odds scoring matrices)
• She studied 34 protein superfamilies ranging from conserved proteins to proteins
with high rates of mutation acceptance.

• This provided the basis of a quantitative scoring


system for alignments between any proteins (closely
or distantly related).
• BLOSUM matrices were later developed by Steven
Henikoff and Jorja Henikoff.
• Most alignment and database searching methods
such as BLAST and HMMER depend in some form
upon the evolutionary insights of the Dayhoff model.

3
4
Interesting reads: Profiles on Dayhoff and Henikoff’s
5
Dayhoff’s log-odds scoring scheme

• Mij : The mutation probability matrix.


• Gives the probability of amino acid j  amino acid i in a given evolutionary
interval.
• fi : The normalized frequency of amino acids in a protein.
• Gives the probability that amino acid i will occur at a given amino acid position
by chance.
6
Dayhoff’s log odds scoring scheme

7
Few new terms
• Odds ratio/Log-odds score
• Markov chain
• Transition matrix
Odds vs probability in a nutshell
• Probability: it is the risk of an event happening divided by the total
number of people at risk of having that event.
Eg., In a deck of 52 cards, there are 13 spades.
• The risk (or probability) of drawing a card randomly from the deck and
getting spades is 13/52 = 0.25 = 25%.

• Odds: it is the ratio of the probability of occurrence of an event over the


probability it won’t.
• In the spades example, the probability of drawing a spade is 0.25. The
probability of not drawing a spade is 1 - 0.25.
• So, the odds is 0.25/0.75 or 1:3 (or 0.33 or odds of 1 to 3).

9
Odds ratio
• It is the probability of one outcome over the probability of
another.
• The odds ratio is a ratio of the 2 odds

Odds Ratio =

• A measure of how strongly an event is associated with exposure


Event: the rate of lung cancer, Exposed: group of smokers, Non-exposed: non-smokers,

If 17 smokers have lung cancer (a) and 83 smokers do not have lung cancer (b).

1 non-smoker has lung cancer (c), and 99 non-smokers do not have lung cancer (d).

What are the odds that smoking, and lung cancer are associated?

Odds ratio is calculated as follows.  

I. Odds-exposed gp = = = = 0.205 

II. Odds-non-exposed gp = = = = 0.01 

III. Odds Ratio = =

= = 20.5 

Thus, smokers have a 20 times the odds of having lung cancer than non-smokers. 
Is this significant? 
Odds Ratio Confidence Interval
• Significance of OR is determined by CI
• CI: gives an expected range for the true OR for the population to fall within.
• Formula for 95% CI:
• Upper 95% CI = exp[ln(OR) + 1.96 sqrt(1/a + 1/b + 1/c + 1/d)] 
• Lower 95% CI = exp[ln(OR) - 1.96 sqrt(1/a + 1/b + 1/c + 1/d)]
• OR > 1 ⇒ odds of the event in the exposed gp > in the non-exposed gp.
• OR < 1 ⇒ odds of the event in the exposed gp < in the non-exposed gp. 
• OR = 1 ⇒ odds of the event in the exposed and the non-exposed gp are the same.
• If CI includes 1 then the calculated OR is not considered statistically significant

Assignment: Calculate the CI for the example in the previous slide and report if
the odds of getting cancer due to smoking is significant.
Few new terms
• Odds ratio/Log-odds score
• Markov chain
• Transition matrix
Markov chain
• A Markov chain (Andrey Markov) is a stochastic model describing a sequence of
possible events in which the probability of each event depends only on the state of
the previous event.
• Since, the probability distribution is obtained solely by observing transitions from
the (n-1)th event  nth event, Markov processes are memoryless.

For a single cell that can transition among three states: growth (G), mitosis (M) and arrest (A).
T: transition matrices
Tn: number of transitions required to approximate the steady-state limiting distributions
Example:

• Assume 3 guaranteed states: Rainy and Sunny


• If you assume that there is an inherent transition in this process, i.e., the
current weather has some bearing on the next day’s weather.
• If today is a rainy day, what is the likelihood that that its going to be sunny
tomorrow ?
• You collect weather data over several years, and calculate that the chance
of a sunny day occurring after a rainy day
Transition matrix
• The probability distribution of state transitions is typically represented as the
Markov chain’s transition matrix. 
• If the Markov chain has N possible states, the matrix will be an N x N matrix,
such that entry (I, J) is the probability of transitioning from state I to state J.
• Additionally, the entries in each row must add up to exactly 1 transition
matrix must be a stochastic matrix, a matrix whose. This makes complete
sense, since each row represents its own probability distribution.
17
Dayhoff’s log-odds scoring scheme

• Mij : The mutation probability matrix.


• Gives the probability of amino acid j  amino acid i in a given evolutionary
interval.
• fi : The normalized frequency of amino acids in a protein.
• Gives the probability that amino acid i will occur at a given amino acid position
by chance.
18
Mij : The mutation probability matrix over an evolutionary interval
• Evolutionary interval: one PAM, defined in terms of % amino acid divergence and not in
units of years
• PAM 1: is defined as the unit of evolutionary divergence in which 1% of the amino acids
have been changed between the two protein sequences.
• 1% divergence of protein sequence may occur over vastly different time frames for protein
families that undergo substitutions at different rates.

Transition probability matrix: Sum of each column is 100%


Mij : The mutation probability matrix over an evolutionary interval

• Diagonal elements:

• Nondiagonal elements:

Aij: is an element of the accepted point mutation matrix from empirical data
(i.e., substitution value of the original alanine  arginine).
• λ: is a proportionality constant
• mj is the mutability of the jth amino acid
• Amino acid substitutions in reference to genetic code common amino acid substitutions
tend to require only a single‐nucleotide change or 2 nucleotide changes etc.
• For example, aspartic acid is encoded by GAU or GAC, and changing the third
position to either A or G causes the codon to encode a glutamic acid
Mij : The mutation probability matrix over an evolutionary interval

• Diagonal elements:

• Nondiagonal elements:

• mj is the mutability of the jth amino acid


• Amino acid substitutions in reference to genetic code common amino acid
substitutions tend to require only a single‐nucleotide change or 2 nucleotide
changes etc.
• For example, aspartic acid is encoded by GAU or GAC, and changing the
third position to either A or G causes the codon to encode a glutamic acid
Mij : The mutation probability matrix over an evolutionary interval
• Nondiagonal elements:

Aij: is an element of the accepted point mutation matrix from empirical


data (i.e., substitution value of the original alanine  arginine).
Mij : The mutation probability matrix over an evolutionary interval
• Nondiagonal elements:

• λ: is a proportionality constant
• λ is chosen to correspond to an evolutionary distance of 1 PAM.
• As we make λ larger, we model a greater evolutionary distance. For
example, get PAM2, PAM3, or PAM4 by multiplying larger λ.
• This approach fails for greater evolutionary distances (such as PAM250)
• PAM250: 250 changes occur in two aligned sequences of length 100
• i.e., the problem is that adjusting λ does not account for multiple substitutions.
• Dayhoff et al. instead multiplied the PAM1 matrix by itself, up to hundreds of
times, to obtain other PAM matrices (PAM250  multiply PAM1 250 times)
PAM250
• Applies to an evolutionary distance where proteins share about 20% amino
acid identity
• At this evolutionary distance, only one in five amino acid residues remains
unchanged.
What does different PAM matrices mean ?
• PAM = 0, is a unit diagonal matrix, because no amino acids have changed.
• For PAM = ∞, there is an equal likelihood of any amino acid being present
(background probability)
Dayhoff’s log odds scoring scheme

26
Next class:
• Transition matrices to scoring matrices
• BLOSUM matrices
• Example problems
• Local sequence alignment: Smith waterman algorithm

You might also like