Bioinformatics Session4

Bioinformatics (BIO213)
Session 4
Slide content: Various textbooks, Internet sources

When two proteins are aligned, what
scores should they be assigned?
Human beta globin (query) and Myglobin (subject)
2
How did we get these scores?
• Based on Margaret Dayhoff (1966, 1978) model of protein evolution 
development of PAM scoring matrices (Log odds scoring matrices)
• She studied 34 protein superfamilies ranging from conserved proteins to proteins
with high rates of mutation acceptance.
• This provided the basis of a quantitative scoring

system for alignments between any proteins (closely
or distantly related).
• BLOSUM matrices were later developed by Steven
Henikoff and Jorja Henikoff.
• Most alignment and database searching methods
such as BLAST and HMMER depend in some form
upon the evolutionary insights of the Dayhoff model.
3
4
Interesting reads: Profiles on Dayhoff and Henikoff’s
5
Dayhoff’s log-odds scoring scheme
• Mij : The mutation probability matrix.

• Gives the probability of amino acid j  amino acid i in a given evolutionary
interval.
• fi : The normalized frequency of amino acids in a protein.
• Gives the probability that amino acid i will occur at a given amino acid position
by chance.
6
Dayhoff’s log odds scoring scheme
7
Few new terms
• Odds ratio/Log-odds score
• Markov chain
• Transition matrix
Odds vs probability in a nutshell
• Probability: it is the risk of an event happening divided by the total
number of people at risk of having that event.
Eg., In a deck of 52 cards, there are 13 spades.
• The risk (or probability) of drawing a card randomly from the deck and
getting spades is 13/52 = 0.25 = 25%.
• Odds: it is the ratio of the probability of occurrence of an event over the

probability it won’t.
• In the spades example, the probability of drawing a spade is 0.25. The
probability of not drawing a spade is 1 - 0.25.
• So, the odds is 0.25/0.75 or 1:3 (or 0.33 or odds of 1 to 3).
9
Odds ratio
• It is the probability of one outcome over the probability of
another.
• The odds ratio is a ratio of the 2 odds
Odds Ratio =
• A measure of how strongly an event is associated with exposure

Event: the rate of lung cancer, Exposed: group of smokers, Non-exposed: non-smokers,
If 17 smokers have lung cancer (a) and 83 smokers do not have lung cancer (b).
1 non-smoker has lung cancer (c), and 99 non-smokers do not have lung cancer (d).
What are the odds that smoking, and lung cancer are associated?
Odds ratio is calculated as follows.
I. Odds-exposed gp = = = = 0.205
II. Odds-non-exposed gp = = = = 0.01
III. Odds Ratio = =
= = 20.5
Thus, smokers have a 20 times the odds of having lung cancer than non-smokers.
Is this significant?
Odds Ratio Confidence Interval
• Significance of OR is determined by CI
• CI: gives an expected range for the true OR for the population to fall within.
• Formula for 95% CI:
• Upper 95% CI = exp[ln(OR) + 1.96 sqrt(1/a + 1/b + 1/c + 1/d)]
• Lower 95% CI = exp[ln(OR) - 1.96 sqrt(1/a + 1/b + 1/c + 1/d)]
• OR > 1 ⇒ odds of the event in the exposed gp > in the non-exposed gp.
• OR < 1 ⇒ odds of the event in the exposed gp < in the non-exposed gp.
• OR = 1 ⇒ odds of the event in the exposed and the non-exposed gp are the same.
• If CI includes 1 then the calculated OR is not considered statistically significant
Assignment: Calculate the CI for the example in the previous slide and report if
the odds of getting cancer due to smoking is significant.
Few new terms
• Odds ratio/Log-odds score
• Markov chain
• Transition matrix
Markov chain
• A Markov chain (Andrey Markov) is a stochastic model describing a sequence of
possible events in which the probability of each event depends only on the state of
the previous event.
• Since, the probability distribution is obtained solely by observing transitions from
the (n-1)th event  nth event, Markov processes are memoryless.
For a single cell that can transition among three states: growth (G), mitosis (M) and arrest (A).
T: transition matrices
Tn: number of transitions required to approximate the steady-state limiting distributions
Example:
• Assume 3 guaranteed states: Rainy and Sunny

• If you assume that there is an inherent transition in this process, i.e., the
current weather has some bearing on the next day’s weather.
• If today is a rainy day, what is the likelihood that that its going to be sunny
tomorrow ?
• You collect weather data over several years, and calculate that the chance
of a sunny day occurring after a rainy day
Transition matrix
• The probability distribution of state transitions is typically represented as the
Markov chain’s transition matrix.
• If the Markov chain has N possible states, the matrix will be an N x N matrix,
such that entry (I, J) is the probability of transitioning from state I to state J.
• Additionally, the entries in each row must add up to exactly 1 transition
matrix must be a stochastic matrix, a matrix whose. This makes complete
sense, since each row represents its own probability distribution.
17
Dayhoff’s log-odds scoring scheme
• Mij : The mutation probability matrix.

• Gives the probability of amino acid j  amino acid i in a given evolutionary
interval.
• fi : The normalized frequency of amino acids in a protein.
• Gives the probability that amino acid i will occur at a given amino acid position
by chance.
18
Mij : The mutation probability matrix over an evolutionary interval
• Evolutionary interval: one PAM, defined in terms of % amino acid divergence and not in
units of years
• PAM 1: is defined as the unit of evolutionary divergence in which 1% of the amino acids
have been changed between the two protein sequences.
• 1% divergence of protein sequence may occur over vastly different time frames for protein
families that undergo substitutions at different rates.
Transition probability matrix: Sum of each column is 100%

• Diagonal elements:
• Nondiagonal elements:
Aij: is an element of the accepted point mutation matrix from empirical data
(i.e., substitution value of the original alanine  arginine).
• λ: is a proportionality constant
• mj is the mutability of the jth amino acid
• Amino acid substitutions in reference to genetic code common amino acid substitutions
tend to require only a single‐nucleotide change or 2 nucleotide changes etc.
• For example, aspartic acid is encoded by GAU or GAC, and changing the third
position to either A or G causes the codon to encode a glutamic acid
• Diagonal elements:
• mj is the mutability of the jth amino acid

• Amino acid substitutions in reference to genetic code common amino acid
substitutions tend to require only a single‐nucleotide change or 2 nucleotide
changes etc.
• For example, aspartic acid is encoded by GAU or GAC, and changing the
third position to either A or G causes the codon to encode a glutamic acid
Aij: is an element of the accepted point mutation matrix from empirical

data (i.e., substitution value of the original alanine  arginine).
• λ: is a proportionality constant
• λ is chosen to correspond to an evolutionary distance of 1 PAM.
• As we make λ larger, we model a greater evolutionary distance. For
example, get PAM2, PAM3, or PAM4 by multiplying larger λ.
• This approach fails for greater evolutionary distances (such as PAM250)
• PAM250: 250 changes occur in two aligned sequences of length 100
• i.e., the problem is that adjusting λ does not account for multiple substitutions.
• Dayhoff et al. instead multiplied the PAM1 matrix by itself, up to hundreds of
times, to obtain other PAM matrices (PAM250  multiply PAM1 250 times)
PAM250
• Applies to an evolutionary distance where proteins share about 20% amino
acid identity
• At this evolutionary distance, only one in five amino acid residues remains
unchanged.
What does different PAM matrices mean ?
• PAM = 0, is a unit diagonal matrix, because no amino acids have changed.
• For PAM = ∞, there is an equal likelihood of any amino acid being present
(background probability)
Dayhoff’s log odds scoring scheme
26
Next class:
• Transition matrices to scoring matrices
• BLOSUM matrices
• Example problems
• Local sequence alignment: Smith waterman algorithm

Bioinformatics Session4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bioinformatics Session4

Uploaded by

Copyright:

Available Formats

Bioinformatics (BIO213)

Slide content: Various textbooks, Internet sources

• This provided the basis of a quantitative scoring

• Mij : The mutation probability matrix.

• Odds: it is the ratio of the probability of occurrence of an event over the

• A measure of how strongly an event is associated with exposure

Odds ratio is calculated as follows.

II. Odds-non-exposed gp = = = = 0.01

III. Odds Ratio = =

• Assume 3 guaranteed states: Rainy and Sunny

• Mij : The mutation probability matrix.

Transition probability matrix: Sum of each column is 100%

• mj is the mutability of the jth amino acid

Aij: is an element of the accepted point mutation matrix from empirical

You might also like