You are on page 1of 7

Motif discovery

What are motifs?


Sequences that occur unusually often in a DNA sequence typically do so because of some
functional effect of their occurrence. One example is regulatory elements, where the
expression of a gene is determined by certain proteins that binds to DNA close to the
genes they regulate. The genes they affect are the genes where the proteins bind, and it is
the DNA sequence near these genes that determines whether they bind there or not.
What we formally will consider a motif when we derive algorithms for their discovery
varies from application to application. The simplest for would just be a fixed string. A
string that occurs more frequently than expected by chance is likely to have a function and
is therefore interesting to find. Extract string matches are not that common in biology,
though, and more often we are interested in classes of similar, but not identical, strings.
Below is an example of this, visualized as a so-called sequence logo. Here we have a
sequence of length 50 where an instance of the motif should match on most of the
positions, but at some positions it can match with more than one character.

Profile HMMs can also be seen as descriptions of motifs. They do not describe a fixed
string, but allow some positions to have different characters and also to have insertions
and deletions.

For this note, we will use weight matrices; motifs that are a little more general than exact
strings and a little less general than profile HMMs.
Weight matrices
Weight matrices describes motifs as strings of a fixed length, l, and assumes that the
characters at each position in the strings are independent of each other. That is, the
probability of the character seen at position i does not depend on the characters on
positions before or after i. A motif can therefore be defined by a matrix of position-wise
probability, or weights, thus the name.

Probabilities of emitting from a motif of not


For a weight matrix, w, each column corresponds to one position in the motif, and the lines
of these columns gives us the probability for each character. The probability of seeing
character a at position k is therefore w[a,k]. Since we assume independence between
positions, the probability of generating a string, s, is just the product of the position-wise
probabilities:
l
Y
Pr(s | w) = w[s[k], k]
k=1

To be able to compare occurrences of motifs to just random strings - remember we said


that motifs are something that occur more often than expected by chance - we need a
model of “random strings” to compare with. A simple model here is to assume that all
strings that are not in the motif also have independent positions - so the probability of the
character that occur at one position does not depend on which characters occurred at other
positions - and to assume that all positions have the same probability for seeing a given
character. Let b[a] be the probability of seeing character a, then the probability of seeing a
string of length l from the background probability, i.e. not the motif, is given by
l
Y
Pr(s | b) = b[s[k]]
k=1

If we want to know whether a given string was most likely generated by the motif model
or the background, we don’t quite have enough information from the probabilities above.
They tell us how likely a given string is to be produced by the models, compared to other
strings the models could produce. That is, they tell us the probabilities Pr(s | w) and Pr(s |
b) but not the probabilities Pr(w | s) and Pr(b | s).
To get the latter probabilities, we also need to know, a priori, how likely it is that a string is
from a motif or not. The motif might be very likely to generate the string s, so Pr(s | w) is
high, but if the motif is seen very rarely the posterior probability that it was the motif that
generated s can still be low. If Pr(w) is the probability of seeing the motif anywhere, and 1-
Pr(b) the probability of seeing the background, then

Pr(s | w) Pr(w)
Pr(w | s) =
Pr(s)

where Pr(s) = Pr(s | w) Pr(w) + Pr(s | b) Pr(b).


A generative model from background and motif models
In the applications where we are interested in motifs, we are typically considering longer
sequences, s, of some length L longer than the motif length, l. To deal with these
sequences, we will need a model for them as well.
One such model is this. We imagine that we repeatedly sample a model to generate a sub-
sequence from. With probability Pr(b) we choose the background model, and generates a
single character from this, so a is generated with probability b[a]. With probability Pr(w)
we choose the motif and generate a sequence a1a2a3...al of length l, with probability
w[a1,1]w[a2,2]w[a3,3]...w[al,l].
We can compute the probability of generating a sequence s from this model recursively. If
we let F[n] denote the probability Pr(s[1..n]), then
F[n] = Pr(b) b[s[n]] F[n-1] + Pr(w) Pr(s[n-l,n] | w) F[n-l]
with some appropriate border conditions at the beginning of s.

This recursion is similar in purpose to the Forward algorithm you know from HMMs, and
to get the probability of the entire sequence you simply look in the final entry of the table:
Pr(s) = F[L].
In exactly the same way, you can define a Backwards recursion B[n] for the probability
Pr(s[n..L]) with
B[n] = Pr(b) b[s[n]] B[n+1] + Pr(w) Pr(s[n,n+l] | w) B[n+l].

Finding occurrences of the motif


With this legwork down, we can now ask questions such as “what is the probability that the
motif is found at index i of s?”.
If we let {n} denote all sequences generated with an occurrences of the motif at index n,
then that question can be formally specified as finding the probability Pr({n} | s), that is,
given the observed string is s, what is the probability that we observed a sequence that
had the motif at index n.
Since Pr({n} | s) = Pr({n},s) / Pr(s), where we know that Pr(s) = F[L], we simply need to
compute the joint probability of generating the sequence s, but using the motif at index n.
This is simply the probability of generating s[1..n-1] from the full model, then selecting the
motif and generating s[n..n+l] from it, and then generating the remaining string, that is the
string s[n+l+1..L], from the general model. In other words
Pr({n},s) = F[n-1] Pr(w) Pr(s[n..n+l] | w) B[n+l+1].
We can compute the probability Pr({n} | s) for all indices in s and classify as motif
occurrences all indices where this probability is higher than 50% since these are the indices
where the motif is more likely to have occurred than the background sequence process.

Training
Below we will not cover all aspects of training motifs, since a lot of the techniques
typically used are beyond the scope of this class, but we will see some basic ideas.

Training the weight matrix from motif occurrences


If we have a set of sequences S = {s1,s2,…,sn} all generated independently from the motif, i.e. si ~
Pr(s | w), then we can train w by maximizing the likelihood of w given S
n
Y
L(w | S) = Pr(S | w) = Pr(si | w)
i=1

which, since we assume that the positions in the motif are independent, can be written
n
Y n Y
Y l l Y
Y n
Pr(si | w) = w[s[k], k] = w[s[k], k]
i=1 i=1 k=1 k=1 i=1

and optimized column-wise setting w[a,k] = n(a,k) / n where n(a,k) denotes the number of
as seen at index k in the n sequences. (To see this, just observe that each column is an
independent multinomial distribution).

Getting the probability of seeing a motif versus seeing the background


To get the probabilities for a priori seeing a sequence from the background process or the
motif process, the probabilities Pr(b) and Pr(w), we can again maximize the likelihood of
the sequence, assuming we know how many times we see the motif in the sequence.
If we have seen the motif m times we must have seen the background process M = L - ml
times, and the maximum likelihood estimates for these probabilities are Pr(w) = m/(M+m)
and Pr(b) = M/(M+m).
NB! In the paper you will see Pr(w) used in a different way. There it denotes the
probability of the weights in the matrix being particular values. This is a Bayesian
approach to training w from data and beyond the scope of this class.
(More) general case
Consider finally a more general case where we have n sequences of varying length where
we know that each sequence has exactly one occurrence of the motif.

If the sequences have lengths L1,L2,…,Ln, with L = L1 + L2 + … + Ln, we know we have n


motif occurrences taking up nl of the total length leaving M = L - nl for the background
model, from which we can get Pr(b) and Pr(w) as above.
We will assume that the n sequences are independent, so the probability/likelihood of the
entire data set is the product of the probabilities/likelihoods of the individual sequences.
n
Y
Pr(D | i, w) = Pr(sk | ik , w).
k=1

If we are conditioning on where in the sequences the motifs are found, and that there is
exactly one motif per sequence, we do not need the full generative model for the
sequences but only the probabilities Pr(s[k] | b) and Pr(s | w). All characters before and
after the motif occurrence will be explained by the background, and the motif occurrence
itself of course by the motif model.
We will typically not know where the motif occurrences are in the sequences, so we need
to estimate this. Often we are not interested in knowing where they are either, since we
only want to train the weight matrix from the sequences to use it elsewhere, but we need
to deal with its location in the sequences as a nuisance parameter.
We could approach the problem by estimating w and the vector of start indices, i, jointly
by maximizing Pr(D | i, w) for i and w jointly, which would be similar to the Viterbi
training you saw for HMMs, our we could sum over all the possibilities for i to get the
marginal likelihood for w
X
Pr(D | w) = Pr(D | i, w) Pr(i)
i

which is similar to the Baum-Welch approach you saw for HMMs.


Assuming that the motifs are equally likely to occur at any position in the sequences,
specifying Pr(i) is straightforward, and so is computing Pr(D | w). Optimizing it with
respect to w, however, is not, and numerical optimization is necessary.
One such approach is the expectation-maximization (EM) algorithm. As you may recall from
the lectures on HMMs, this is an algorithm for maximizing the likelihood Pr(X|!) in cases
where you can calculate Pr(X,Z|!) but where you have not observed Z and need to sum
over it. In this case, ! would be our vector b and matrix w, X would be our sequences, D,
and Z would be the indices where the motifs are found. Since i does not depend on w,
optimizing Pr(D | i, w) with respect to w is the same as optimizing Pr(D, i | w), so the EM
algorithm is applicable here. (I’ve left out b in these probabilities since the paper does that,
but it should really be considered as well).
The idea in the EM algorithm is to iteratively update the parameters, for this application
they would be b and w, and in each iteration we define a function Q(w,b | wt,bt) where wt
and bt would be the parameters for iteration t. Q should be the expected (log)likelihood,
and the parameters for iteration t+1 should be the parameters that maximizes Q.
The paper hints at how the EM algorithm would look for this application. For the HMMs
you saw the math worked out in details. Below I will instead give some intuition about
how it would work, to present it to you in a slightly different way.
We will use as Q the expected likelihood. That is, we have the likelihood of the
parameters, or the probabilities of the observations given the parameters, Pr(D,i | w,b),
and Q is the expected value of this if we consider i a random variable

Q(w, b | wt , bt ) = Ei|wt ,bt [Pr(D, i | w, b)].

Remember now, that expectation really just means average weighted with a probability
distribution, so
X
Ei|D,wt ,bt [Pr(D, i | w, b)] = Pr(i | D, wt , bt ) Pr(D, i | w, b).
i

If we couldn’t actually sum over all possible i values, but instead saw samples of them, we
would estimate the expectation by just taking the average. The law of large numbers
ensures us that we would get close to the real expectation if we had enough samples, but
of course we would only get the right number if we sampled an infinite number of times.
In the actual algorithm we don’t need to sample an infinite number of observations
because we can directly compute the expectation, but for the intuition of the algorithm let
us just pretend that we sample observations of the sequences.
So we pretend that we sample a number of motif positions im ~ Pr(i | D,bt,wt), for m =
1,...,M, and then out estimate of the expectation would be
M
1 X
Ei|D,wt ,bt ⇠ Pr(D, i | w, b).
M m=1

The reason we go this roundabout way of doing it is that you already know how to pick
the most likely (another word for maximum likelihood estimates) values of w and b if you
had a number of observations of the background and motif models.
If you had observed N characters from the background model, you would say that the
probability of seeing character a from the background process would be b[a] = N(a)/N,
where N(a) is the number of times you saw a. Similarly, you would say the best probability
for seeing character a at position k in the motif would be w[a,k] = N(a,k)/N if you saw a at
position k N(a,k) times and you saw the motif N times.
In other words, had you observed samples of the process you would know how to find the
maximum likelihood. It would only be a maximum likelihood estimate, since the
randomness in your observations means you could be slightly off the true maximum
likelihood parameters, but you know how to maximize the likelihood in this case.
Optimizing though samples of the process is essentially the training by counting you did for
HMMs. There you assume that you have a number of observations from the process and
you use these observations to pick the most likely parameters.
We don’t need to sample in this algorithm, because we can directly compute how often
you would expect to see, say, character a at position k in the motif. Using these expected
values instead of sampled values will give us the exact values that maximizes Q.

You might also like