Professional Documents
Culture Documents
Modified from:http://www.cs.iastate.edu/~cs544/Lectures/lectures.html
Nucleotide frequencies in the
human genome
A C T G
29.5 20.4 20.5 29.6
CpG Islands
Written CpG to
distinguish from
a C≡G base pair)
0.01
0.2
The occasionally dishonest casino
• Known:
– The structure of the model
– The transition probabilities
• Hidden: What the casino did
– FFFFFLLLLLLLFFFF...
• Observable: The series of die tosses
– 3415256664666153...
• What we must infer:
– When was a fair die used?
– When was a loaded one used?
• The answer is a sequence
FFFFFFFLLLLLLFFF...
Making the inference
• Model assigns a probability to each explanation of the
observation:
P(326|FFL)
= P(3|F)·P(F→F)·P(2|F)·P(F→L)·P(6|L)
= 1/6 · 0.99 · 1/6 · 0.01 · ½
• Maximum Likelihood: Determine which explanation is most likely
– Find the path most likely to have produced the observed
sequence
• Total probability: Determine probability that observed
sequence was produced by the HMM
– Consider all paths that could have produced the observed
sequence
Notation
• x is the sequence of symbols emitted by model
– xi is the symbol emitted at time i
• A path, π, is a sequence of states
– The i-th state in π is πi
• akr is the probability of making a transition from
state k to state r:
akr = Pr(π i = r | π i −1 = k )
2 2 2 … 2
0 0
… … … …
K K K … K
x1 x2 x3 xL
L
Pr(x , π ) = a0π 1 ∏ eπ i (xi ) ⋅ aπ i π i +1
i =1
The occasionally dishonest casino
x = x1 , x 2 , x 3 = 6,2,6
Pr(x , π (1) ) = a0F eF (6)aFF eF (2)aFF eF (6)
π (1) = FFF 1 1 1
= 0.5 × × 0.99 × × 0.99 ×
6 6 6
≈ 0.00227
• Termination:
Pr(x , π * ) = max(v k (L)ak 0 )
k
B 1 0 0 0
• Termination:
Pr(x ) = ∑fk (L)ak 0
k
The Backward Algorithm
• Initialization (i = L)
• Termination:
Posterior Decoding
• How likely is it that my observation comes
from a certain state?
Note: These
sequences
could lead to
other paths.
Source: http://www.csit.fsu.edu/~swofford/bioinformatics_spring05/
Pfam
• “A comprehensive collection of protein
domains and families, with a range of well-
established uses including genome
annotation.”
• Each family is represented by two multiple
sequence alignments and two profile-Hidden
Markov Models (profile-HMMs).
• A. Bateman et al. Nucleic Acids Research
(2004) Database Issue 32:D138-D141
Lab 5
I1 I2 I3 I4
M1 M2 M3
D1 D2 D3
Some recurrences
aBM1 ⋅v B (i − 1)
v M1 (i ) = eM1 (xi ) ⋅ max
aI1M1 ⋅v I1 (i − 1)
v I1 (i ) = eI1 (xi ) ⋅ aBI1 ⋅v B (i − 1)
vD (i ) = eD ( −) ⋅ aBD ⋅v B (i )
1 1 1
I1 I2 I3 I4
M1 M2 M3
D1 D2 D3
More recurrences
aI M ⋅v I (i − 1)
2 2 2
v M2 (i ) = eM2 (xi ) ⋅ max aM1M2 ⋅v M1 (i − 1)
a
D1M2 ⋅vD1 (i − 1)
v I2 (i ) = eI2 (xi ) ⋅ aM1I2 ⋅v M1 (i − 1)
vD (i ) = eD ( −) ⋅ aM D ⋅v M (i )
2 2 1 2 1 I1 I2 I3 I4
M1 M2 M3
D1 D2 D3
ε T A G ε
Begin 1 0 0 0 0
M1 0 0.35
M2 0 0.04
M3 0 0
I1 0 0.025
I2 0 0
I3 0 0
I4 0 0
D1 0.2 0
D2 0 0.07
D3 0 0
End 0 0