Professional Documents
Culture Documents
Session 4
2
How did we get these scores?
• Based on Margaret Dayhoff (1966, 1978) model of protein evolution
development of PAM scoring matrices (Log odds scoring matrices)
• She studied 34 protein superfamilies ranging from conserved proteins to proteins
with high rates of mutation acceptance.
3
4
Interesting reads: Profiles on Dayhoff and Henikoff’s
5
Dayhoff’s log-odds scoring scheme
7
Few new terms
• Odds ratio/Log-odds score
• Markov chain
• Transition matrix
Odds vs probability in a nutshell
• Probability: it is the risk of an event happening divided by the total
number of people at risk of having that event.
Eg., In a deck of 52 cards, there are 13 spades.
• The risk (or probability) of drawing a card randomly from the deck and
getting spades is 13/52 = 0.25 = 25%.
9
Odds ratio
• It is the probability of one outcome over the probability of
another.
• The odds ratio is a ratio of the 2 odds
Odds Ratio =
If 17 smokers have lung cancer (a) and 83 smokers do not have lung cancer (b).
1 non-smoker has lung cancer (c), and 99 non-smokers do not have lung cancer (d).
What are the odds that smoking, and lung cancer are associated?
I. Odds-exposed gp = = = = 0.205
= = 20.5
Thus, smokers have a 20 times the odds of having lung cancer than non-smokers.
Is this significant?
Odds Ratio Confidence Interval
• Significance of OR is determined by CI
• CI: gives an expected range for the true OR for the population to fall within.
• Formula for 95% CI:
• Upper 95% CI = exp[ln(OR) + 1.96 sqrt(1/a + 1/b + 1/c + 1/d)]
• Lower 95% CI = exp[ln(OR) - 1.96 sqrt(1/a + 1/b + 1/c + 1/d)]
• OR > 1 ⇒ odds of the event in the exposed gp > in the non-exposed gp.
• OR < 1 ⇒ odds of the event in the exposed gp < in the non-exposed gp.
• OR = 1 ⇒ odds of the event in the exposed and the non-exposed gp are the same.
• If CI includes 1 then the calculated OR is not considered statistically significant
Assignment: Calculate the CI for the example in the previous slide and report if
the odds of getting cancer due to smoking is significant.
Few new terms
• Odds ratio/Log-odds score
• Markov chain
• Transition matrix
Markov chain
• A Markov chain (Andrey Markov) is a stochastic model describing a sequence of
possible events in which the probability of each event depends only on the state of
the previous event.
• Since, the probability distribution is obtained solely by observing transitions from
the (n-1)th event nth event, Markov processes are memoryless.
For a single cell that can transition among three states: growth (G), mitosis (M) and arrest (A).
T: transition matrices
Tn: number of transitions required to approximate the steady-state limiting distributions
Example:
• Diagonal elements:
• Nondiagonal elements:
Aij: is an element of the accepted point mutation matrix from empirical data
(i.e., substitution value of the original alanine arginine).
• λ: is a proportionality constant
• mj is the mutability of the jth amino acid
• Amino acid substitutions in reference to genetic code common amino acid substitutions
tend to require only a single‐nucleotide change or 2 nucleotide changes etc.
• For example, aspartic acid is encoded by GAU or GAC, and changing the third
position to either A or G causes the codon to encode a glutamic acid
Mij : The mutation probability matrix over an evolutionary interval
• Diagonal elements:
• Nondiagonal elements:
• λ: is a proportionality constant
• λ is chosen to correspond to an evolutionary distance of 1 PAM.
• As we make λ larger, we model a greater evolutionary distance. For
example, get PAM2, PAM3, or PAM4 by multiplying larger λ.
• This approach fails for greater evolutionary distances (such as PAM250)
• PAM250: 250 changes occur in two aligned sequences of length 100
• i.e., the problem is that adjusting λ does not account for multiple substitutions.
• Dayhoff et al. instead multiplied the PAM1 matrix by itself, up to hundreds of
times, to obtain other PAM matrices (PAM250 multiply PAM1 250 times)
PAM250
• Applies to an evolutionary distance where proteins share about 20% amino
acid identity
• At this evolutionary distance, only one in five amino acid residues remains
unchanged.
What does different PAM matrices mean ?
• PAM = 0, is a unit diagonal matrix, because no amino acids have changed.
• For PAM = ∞, there is an equal likelihood of any amino acid being present
(background probability)
Dayhoff’s log odds scoring scheme
26
Next class:
• Transition matrices to scoring matrices
• BLOSUM matrices
• Example problems
• Local sequence alignment: Smith waterman algorithm