Professional Documents
Culture Documents
Motif Discovery: What Are Motifs?
Motif Discovery: What Are Motifs?
Profile HMMs can also be seen as descriptions of motifs. They do not describe a fixed
string, but allow some positions to have different characters and also to have insertions
and deletions.
For this note, we will use weight matrices; motifs that are a little more general than exact
strings and a little less general than profile HMMs.
Weight matrices
Weight matrices describes motifs as strings of a fixed length, l, and assumes that the
characters at each position in the strings are independent of each other. That is, the
probability of the character seen at position i does not depend on the characters on
positions before or after i. A motif can therefore be defined by a matrix of position-wise
probability, or weights, thus the name.
If we want to know whether a given string was most likely generated by the motif model
or the background, we don’t quite have enough information from the probabilities above.
They tell us how likely a given string is to be produced by the models, compared to other
strings the models could produce. That is, they tell us the probabilities Pr(s | w) and Pr(s |
b) but not the probabilities Pr(w | s) and Pr(b | s).
To get the latter probabilities, we also need to know, a priori, how likely it is that a string is
from a motif or not. The motif might be very likely to generate the string s, so Pr(s | w) is
high, but if the motif is seen very rarely the posterior probability that it was the motif that
generated s can still be low. If Pr(w) is the probability of seeing the motif anywhere, and 1-
Pr(b) the probability of seeing the background, then
Pr(s | w) Pr(w)
Pr(w | s) =
Pr(s)
This recursion is similar in purpose to the Forward algorithm you know from HMMs, and
to get the probability of the entire sequence you simply look in the final entry of the table:
Pr(s) = F[L].
In exactly the same way, you can define a Backwards recursion B[n] for the probability
Pr(s[n..L]) with
B[n] = Pr(b) b[s[n]] B[n+1] + Pr(w) Pr(s[n,n+l] | w) B[n+l].
Training
Below we will not cover all aspects of training motifs, since a lot of the techniques
typically used are beyond the scope of this class, but we will see some basic ideas.
which, since we assume that the positions in the motif are independent, can be written
n
Y n Y
Y l l Y
Y n
Pr(si | w) = w[s[k], k] = w[s[k], k]
i=1 i=1 k=1 k=1 i=1
and optimized column-wise setting w[a,k] = n(a,k) / n where n(a,k) denotes the number of
as seen at index k in the n sequences. (To see this, just observe that each column is an
independent multinomial distribution).
If we are conditioning on where in the sequences the motifs are found, and that there is
exactly one motif per sequence, we do not need the full generative model for the
sequences but only the probabilities Pr(s[k] | b) and Pr(s | w). All characters before and
after the motif occurrence will be explained by the background, and the motif occurrence
itself of course by the motif model.
We will typically not know where the motif occurrences are in the sequences, so we need
to estimate this. Often we are not interested in knowing where they are either, since we
only want to train the weight matrix from the sequences to use it elsewhere, but we need
to deal with its location in the sequences as a nuisance parameter.
We could approach the problem by estimating w and the vector of start indices, i, jointly
by maximizing Pr(D | i, w) for i and w jointly, which would be similar to the Viterbi
training you saw for HMMs, our we could sum over all the possibilities for i to get the
marginal likelihood for w
X
Pr(D | w) = Pr(D | i, w) Pr(i)
i
Remember now, that expectation really just means average weighted with a probability
distribution, so
X
Ei|D,wt ,bt [Pr(D, i | w, b)] = Pr(i | D, wt , bt ) Pr(D, i | w, b).
i
If we couldn’t actually sum over all possible i values, but instead saw samples of them, we
would estimate the expectation by just taking the average. The law of large numbers
ensures us that we would get close to the real expectation if we had enough samples, but
of course we would only get the right number if we sampled an infinite number of times.
In the actual algorithm we don’t need to sample an infinite number of observations
because we can directly compute the expectation, but for the intuition of the algorithm let
us just pretend that we sample observations of the sequences.
So we pretend that we sample a number of motif positions im ~ Pr(i | D,bt,wt), for m =
1,...,M, and then out estimate of the expectation would be
M
1 X
Ei|D,wt ,bt ⇠ Pr(D, i | w, b).
M m=1
The reason we go this roundabout way of doing it is that you already know how to pick
the most likely (another word for maximum likelihood estimates) values of w and b if you
had a number of observations of the background and motif models.
If you had observed N characters from the background model, you would say that the
probability of seeing character a from the background process would be b[a] = N(a)/N,
where N(a) is the number of times you saw a. Similarly, you would say the best probability
for seeing character a at position k in the motif would be w[a,k] = N(a,k)/N if you saw a at
position k N(a,k) times and you saw the motif N times.
In other words, had you observed samples of the process you would know how to find the
maximum likelihood. It would only be a maximum likelihood estimate, since the
randomness in your observations means you could be slightly off the true maximum
likelihood parameters, but you know how to maximize the likelihood in this case.
Optimizing though samples of the process is essentially the training by counting you did for
HMMs. There you assume that you have a number of observations from the process and
you use these observations to pick the most likely parameters.
We don’t need to sample in this algorithm, because we can directly compute how often
you would expect to see, say, character a at position k in the motif. Using these expected
values instead of sampled values will give us the exact values that maximizes Q.