Professional Documents
Culture Documents
Non Unif Rand Gen
Non Unif Rand Gen
net/publication/255362442
CITATIONS READS
16 45
3 authors, including:
Alain Denise
Université Paris-Sud 11
98 PUBLICATIONS 1,768 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Alain Denise on 29 May 2014.
Abstra
t
Let L be an algebrai
1 language on an alphabet X = fx1 ; x2 ; : : : ; xk g, and n a positive
integer. We
onsider the problem of generating at random words of L with respe
t to a
given distribution of the number of o
urren
es of the letters. We
onsider two alternatives
of the problem. In the rst one, a ve
tor of natural numbers (n1 ; n2 ; : : : ; nk ) su
h that
n1 + n2 + + nk = n is given, and the words must be generated uniformly among the set of
words of L whi
h
ontain exa
tly ni letters xi (1 i k ). The se
ond alternative
onsists,
given v = (v1 ; : : : ; vk ) a ve
tor of positive real numbers su
h that v1 + + vk = 1, to
generate at random words among the whole set of words of L of length n, in su
h a way that
the expe
ted number of o
urren
es of any letter xi equals nvi (1 i k ), and two words
having the same distribution of letters have the same probability to be generated. For this
purpose, we design and study two alternatives of the re
ursive method whi
h is
lassi
ally
employed for the uniform generation of
ombinatorial stru
tures. This type of \
ontrolled"
non-uniform generation is of great interest in the statisti
al study of genomi
sequen
es.
1 Introdu
tion
The problem of uniform random generation of
ombinatorial stru
tures has been extensively
studied in the past few years. To our knowledge, random generation a
ording to a given
distribution was mu
h less treated (ex
ept for random numbers, for whi
h one nds an abundant
literature {see [3℄ for example). We are interested here in a problem of this type. Let L be a
language on an alphabet X = fx1 ; x2 ; : : : ; xk g, and n an integer. Let us denote Ln the set of
words of L of length n. The problem
onsists in generating words of Ln while respe
ting a
distribution of the letters given by a ve
tor of k positive numbers. We
onsider two alternatives:
1. Generation in exa
t frequen
ies. The distribution of the number of letters of any word
must respe
t exa
tly a given ve
tor of integers (n1 ; : : : ; nk ). In other words, we generate
words uniformly at random in a subset of Ln
onstituted of all the words w 2 Ln su
h
that jwjx = ni for all i 2 f1; 2; : : : ; kg.
i
2. Generation in expe
ted frequen
ies. The words must respe
t on average a distribution
given by a ve
tor v = (v1 ; : : : ; vk ) su
h that v1 + : : : + vk = 1. More pre
isely, we generate
words at random in su
h a way that
(a) any word of L has a non zero probability to be generated;
(b) for all i 2 f1; 2; : : : ; kg, the frequen
y of the letters xi of the words is equal (or asymp-
toti
ally equivalent) to vi : if p(w) is the probability that the word w is generated by
the algorithm, one must have
onvergen
e for the frequen
ies,
1 X
n
jwjx p(w) vi
i
w2L n
1
(
) two words having the same distribution of letters have the same probability of being
generated.
These generation s
hemes lead to appli
ations in genomi
s. An important problem in the
eld of analysis of genomes
onsists in determining whether some properties observed in genomi
sequen
es are biologi
ally signi
ant or not. The main idea is as follows: if a property observed
in natural sequen
es is really relevant from a biologi
al point of view, one should not observe
it signi
antly in random sequen
es. Thus for example, in order to evaluate the biologi
al
signian
e of similarities between proteini
sequen
es of dierent organisms, one
ompares the
s
ores of their alignments with s
ores obtained on random sequen
es [15, 14℄. Similarly, the
omparison of the frequen
ies of
ertain motifs in natural and random sequen
es
an
ontribute
to determine if these motifs are biologi
ally relevant [7, 22, 23℄.
Traditionally, sequen
es are generated a
ording to purely statisti
al
onsiderations. The
fundations of these models and the rst algorithms of generation were des
ribed in [8℄. The
parameters whi
h are taken into a
ount are the frequen
ies of nu
leotides (letters in DNA)
or oligonu
leotides (fa
tors) of xed length l observed in a natural sequen
e whi
h is taken as
referen
e. Thus for example, for l = 2 and the natural sequen
e aatgtaa
gt, the frequen
ies are
aa = 2, at = 1, tg = 1, gt = 2, ta = 1, a
= 1 and
g = 1. Random sequen
es are generated
with respe
t to these frequen
ies, either exa
tly (generation in exa
t frequen
ies) or in average
(generation in expe
ted frequen
ies).
In this last
ase, the generation is
arried out a
ording to a Markov
hain of order l 1:
the sequen
e is generated letter by letter and, at any step, the probability of generating a given
letter depends on the l 1 previous letters. The pro
ess is
learly linear a
ording to the length
of the sequen
e. In fa
t, in this model, re
ent works [17, 20℄ allow to
ompute analyti
ally
some parameters (expe
ted number of o
urren
es, varian
e, et
)
on
erning the frequen
ies of
appearan
e of given motifs in random sequen
es. This allows to avoid the generation of a large
number of sequen
es, and to obtain exa
t values of the required parameters. However, random
generation remains useful when the studied properties do not relate to relatively simple motifs.
Random generation in exa
t frequen
ies is a more diÆ
ult problem if one wishes the gener-
ation to be uniform among all the allowed sequen
es. This problem is solved in linear expe
ted
time in [13℄. One of the main ideas is the fa
t, stated in [8℄, that it
an be redu
ed to the
generation of a random Eulerian trail in a dire
ted graph. An implementation of the algorithm
is presented in [1℄.
These methods do not handle synta
ti
onstraints: the words are generated in X . But
it is of interest to generate words in parti
ular languages be
ause genomi
sequen
es
an be
synta
ti
ally
onstrained. In this
ontext, the aim of our work is to generate more \realisti
"
random sequen
es by taking into a
ount synta
ti
riteria as well as statisti
al ones.
Our approa
h is based on the so-named re
ursive method, whi
h was initiated by Nijenhuis
and Wilf [18℄ and then generalized and formalized by Flajolet, Zimmermann and Van Cutsem [9℄.
Se
tion 2 is devoted to a short presentation of this methodology within the framework of the
algebrai
languages. We present in se
tion 3 a simple adaptation whi
h allows to generate words
in exa
t frequen
ies of the letters. In se
tion 4, we fo
us on generation in expe
ted frequen
ies.
2 Uniform generation
The general methodology of uniform random generation of de
omposable stru
tures (whi
h
in
ludes algebrai
languages) is presented in detail in [9℄. Some variations whi
h deal with the
spe
ial
ase of
ontext-free languages are studied in [16℄ and [10℄. We present here a quite simple
version of the method. The reader will nd more powerful alternatives in the referen
ed papers.
The starting point is a non-ambiguous
ontext-free grammar in Chomsky Normal Form: any
right member of a rule is either the empty word ", or a letter of X , or a non-terminal symbol, or
2
a produ
t (
on
atenation) of two non-terminal symbols. Moreover, any non-terminal symbol
an
be left member of at most two rules; and in this
ase ea
h right member is a single non-terminal
symbol.
The rst stage
onsists in
ounting words: for any non-terminal symbol T and for all 0
j n, the number T (j ) of words of length j whi
h derive from T is
omputed (re
all that n is
the length of the words to be generated.) This
an be done by using re
urren
e relations whi
h
result dire
tly from the rules of the grammar:
!"T ) T (0) =1; (1)
T ! xi ) T (1) = 1 ; (2)
0 00
T !T j T ) 0 00
T (n) = T (n) + T (n) ; (3)
0 00
T !T T ) T (n) =
X
0 0 00 00
T (n )T (n ): (4)
n0 +n00 =n
Note that, sin
e the generating series of any
ontext-free languages is holonomi
, these same
oeÆ
ients
an be
al
ulated by using linear re
urren
es. This leads to faster
omputations
(see [10℄ for example).
This prepro
essing stage is done only on
e, whatever the number of words of size n (or
less) than one wishes to generate. A random word is generated by
arrying out a su
ession
of derivations starting from the axiom of the grammar. At ea
h step, a derivation is
hosen
with the adequate probability. Let us suppose for example that, at a given step of generation
of a word of length j , one has to
hoose a rewriting rule for the symbol T . If T ! T 0 j T 00 ,
one
hooses to generate a word of length j either deriving from T 0 with probability T 0 (j )=T (j ),
or deriving from T 00 with probability T 00 (j )=T (j ). If T ! T 0 T 00 , one
hooses an integer h with
probability T 0 (h)T 00 (j h)=T (h), and then one generates a word of length h deriving from of T 0 ,
on
atenated to a word of length j h deriving from T 00 . Details of the pro
ess are given in [9℄.
The best algorithms derived from this method
arry out the generation of a word of length
n in O (n) arithmeti
operations after a prepro
essing stage in O (n2 ) [16℄, or the generation of a
word in O(n log n) arithmeti
operations after a prepro
essing stage in O(n) [10℄. Note however
that the
oeÆ
ients whi
h are
omputed are extremely large, so the bit
omplexity is a more
a
urate measure for these algorithms, as dis
ussed in [2℄.
The above method applies obviously to rational languages. However, in this parti
ular
ase,
a slightly dierent approa
h makes it possible to
arry out a generation in O(n) arithmeti
operations with a preprop
essing stage in O(n) too [12℄. We
onsider a regular grammar of the
language, in whi
h any rule has the shape T ! T1 j : : : j Tm , where either Ti = xTj or Ti = x
(x 2 X ), or Ti = ". The T (:)'s are
omputed using the following re
urren
es:
T ! T j : : : j Tm )
1 T (n) = T1 (n) + : : : + Tm (n) (5)
where
Ti = xTj ) Ti (n) = Tj (n 1) (6)
Ti = x ) Ti (n) = 1 if n = 1 ; Ti (n) = 0 otherwise: (7)
Ti = " ) Ti (n) = 1 if n = 0 ; Ti (n) = 0 otherwise: (8)
The generation stage is done like in the general
ase. Here, ea
h step simply
onsists in
hoosing
a letter with an adequate probability. This is equivalent to build a path of length n in a
deterministi
nite-state automaton of the language.
3
The prin
iple of the method that we des
ribe here is a natural extension of the general outline
given in the previous se
tion. The method is used impli
itely in works like [5℄, were the problem
of randomly generating stru
tures while xing more than one parameter is adressed. Our goal
is simply to formalize the method within the framework of the generation of words with xed
numbers of letters in
ontext-free languages.
Like above, the grammar is supposed to stand in Chomsky Normal Form. For any non-
terminal symbol T , we note T (j1 ; j2 ; : : : ; jk ) the number of words whi
h derive from T and
whi
h
ontain j1 letters x1 , j2 letters x2 , . . . , jk letters xk . The prepro
essing stage
onsists in
omputing a table of the T (j1 ; j2 ; : : : ; jk ) for 0 j1 n1 , . . . , 0 jk nk . We use the following
re
urren
es:
!" T ) T (0; : : : ; 0) =1; (9)
T ! xi ) T (0; : : : ; 0; 1; 0; : : : ; 0) = 1 ; (10)
0 00
T !T j T ) 0 00
T (n1 ; : : : ; nk ) = T (n1 ; : : : ; nk ) + T (n1 ; : : : ; nk ) ; (11)
T ! T 0 T 00 ) 0 0 0 0 00 00 00 00
X
T (n1 ; n2 ; : : : ; nk ) = T (n1 ; n2 ; : : : ; nk )T (n1 ; n2 ; : : : ; nk ): (12)
n + n1 = n1
0
10
00
n + n2 = n2
2
00
:::
n +n
k = nk
0 00
k
Like above, ea
h step of the generation stage
onsists in
hoosing a rewriting rule of the
urrent
symbol T . Suppose that, at a given step of generation of a word of distribution j = (j1 ; : : : ; jk ),
one has to
hoose a rewriting rule for the symbol T . If T ! T 0 j T 00 , one generates a word of
distribution j deriving from T 0 with probability T 0 (j)=T (j), or deriving from T 00 with probability
T 00 (j)=T (j). If T ! T 0 T 00 , one
hooses a ve
tor h = (h1 ; : : : ; hk ) with probability T 0 (h)T 00 (j
h)=T (h), then one generates a word deriving from T 0 having distribution h,
on
atenated to a
word deriving from T 00 having distribution j h.
The rational
ase is similar:
T ! T j : : : j Tm )
1 T (n1 ; : : : ; nk ) = T1 (n1 ; : : : ; nk ) + : : : + Tm (n1 ; : : : ; nk )
where
Ti = xm Tj ) Ti (n1 ; : : : ; nm ; : : : ; ; nk )= Tj (n1 ; : : : ; nm 1; : : : ; nk ) ;
Ti = xm ) Ti (0; : : : ; 0; 1; 0; : : : ; 0) = 1 if nm = 1 ;
Ti = " ) Ti (0; : : : ; 0) = 1:
The bottlene
k of this method stands in its strong
omplexity in time and memory. Indeed,
it requires the
omputation of a table of (n1 n2 : : : nk ) values. Moreover, in the nonrational
ase, the multiple
onvolutions of the formula (12) imply a very slow generation stage too. On
the other hand, if the language is rational, the generation is
arried out in linear arithmeti
omplexity (but the problem of the
omputation of the table remains). Thus the method is
feasible for an alphabet of limited size. It
an also be extended to alphabet of any size, provided
that the number of
onstraints is limited. For example, instead of xing (n1 ; n2 ; : : : ; nk ), one
an x only (n1 ; : : : ; nj ) with j < k.
4
and
1
jwjx p(w) 8i 2 f1; 2; : : : ; kg:
X
i
vi (14)
n
w2Ln
Moreover, we for
e words having the same letter distribution to be equally generated.
(jwjx = jw0 jx 8i 2 f1; 2; : : : ; kg) ) p(w) = p(w0 ):
i i
(15)
Our method
onsists in assigning a weight to ea
h letter of the alphabet. For this purpose, we
dene a fun
tion : X ! IR+ , whi
h we
all a weighting fun
tion. It naturally a
ts on X :
(w ) =
Y
(xi )
jw j xi
;
1ik
just as it a
ts on any nite language, in parti
ular to Ln :
X
(Ln ) = (w):
w2Ln
then the expe
ted number of o
uren
es of a given letter in a random sample will in
rease with
the weight put on this letter in
omparison with the others.
Let us summarize our methodology:
1. Find a weighting fun
tion satisfying (14);
2. Let having been xed, design a geberation algorithm whi
h satises (16).
Look at the se
ond step. Let be given. In order to generate words with the required
distribution (16), we use the methodology presented in se
tion 2, but now the rule
T ! xi ) T (1) = (x i ):
stand for the rule (2). For the parti
ular
ase of rational languages, the rules
Ti = xTj ) Ti (n) = (x)Tj (n 1)
Ti = x ) Ti (n) = (x) if n = 1 ; Ti (n) = 0 otherwise:
stand for (6) and (7).
The generation pro
edure works exa
tly in the same manner as the uniform one of se
tion
2.
In this way, the probability that a word w o
urs will be proportional to its weight (w).
Observe that our methodology is equivalent to generating words uniformly at random from
an ambiguous grammar. In this way, the probability of ea
h word is proportional to the number
of derivations whi
h lead to it in the original unambiguous grammar, times its weight (w).
Now, we have to nd a weighting fun
tion satisfying (14). Let us
onsider the following
generating fun
tion:
jw j jw j
(w)tjwj x1 1 : : : xk
X
L (t; x) =
x xk
;
w2L
where x = (x1 ; x2 ; : : : ; xk ).
5
For ea
h letter i 2 f1; : : : ; kg, the expe
ted number of o
uren
es of the letter xi in the words
generated by the algorithm is equal to ni (), where
i ( ) =
P
j j ( ) = [tn℄ ;x (t)
w2Ln w xi w i
(17)
n (Ln ) [tn ℄ (t)
with
L (t; x) L (t; x)
( )=
;xi t (t; 1) and ( )=t
t (t; 1)
xi t
Be
ause the
omputation of i () involves
omputing the
oeÆ
ients of the above generating
fun
tions, we treat separately the algebrai
ase (sub se
tion 4.1) and the rational
ase (sub
se
tion 4.2).
>From a more general theorem due to Drmota [4, Theoreme 1, p.107℄, we immediatly state
Theorem 1 Let y = F(t; y; x) be a set of equations from an algebrai
grammar with a weighting
fun
tion , where y = (y1 ; y2 ; : : : ; yN ) is the ve
tor of generating fun
tions for the non terminal
symbols and F(t; y; x) = (F1 (t; y; x); : : : ; FN (t; y; x)). Under the following assumptions
F(0; y; x) = 0 and F(t; 0; x) 6= 0 ;
the system of equations is not linear in y.
the system of equations is of simple type; there exists a set of N (k + 1)-dimensionnals
ones Ci IRk su
h that, for any (n; m ; : : : ; mk ) 2 Cj with n, m , . . . , mk are large
+1
1 1
m m1
enough, the
oeÆ
ient of the term t n x1 : : : xk k
is non zero;
the grammar is strongly
onne
ted ; ea
h non terminal symbol
an be rea
hed from any
other non terminal symbol.
We have the following:
If there exists a positive solution (y0 ; t0 ) of the system of equations
(
y = F(t; y; 1)
0 = det(I Fy (t; y; 1))
Fi
where Fy is the matrix , then
yj 1 i;j N
1 t
i ( ) = (1)
t(1) xi
su
h that t(1) = t0 .
This theorem
an help us to solve our problem, by taking the generating fun
tion L (t; x)
as the rst
omponent of the ve
tor y.
However, the Theorem 4 doesn't apply to rational language sin
e in this
ase the system of
equations is linear.
Consider a toy example, namely generating rooted plane trees (see [19℄) with n edges su
h
that the expe
ted number of leaves is n (0 < < 1). A
lassi
al bije
tion turns this problem
6
into the problem of generating words of length 2n belonging to the language dened by the
grammar S ! aSbS + aSb +
dS +
d. The fa
tor
d represents a leave of the tree. We
an now
hange the expe
ted number of leaves in a random tree by putting a positive real weight 1 on
the letter
: S ! aSbS + aSb + 1
dS + 1
d. This weighted grammar meets the
onditions of
Theorem 1 and we obtain,
(
F (t; y; x1 ) = xy2 + xy + 1 z1 xy + 1 z1 x
Fy (t; y; x1 ) = 2 xy + x + 1 z1 x:
This means that in order to generate a word with 2n letters a (resp. n edges) having n letters
(resp. leaves) on average, we have to take 1 = (=(1 ))2 .
Let us say that this example is parti
ularly simple. In general, we may be unable to solve
the system (18) sin
e we
an't use numeri
al te
hniques be
ause of the unknown fun
tion .
Thus, our method fails if we try to generate words of the language dened in [4, p.112℄ with a
given expe
ted number of letters a and b, sin
e we have to solve an algebrai
equation of degree
ve whose
oeÆ
ients are fun
tions of the unknown weights.
Morever, Theorem 1, only applies to strongly
onne
ted grammars whi
h is a strong restri
-
tion.
In this se
tion, we show that we
an solve the problem of generating word a
ording to given
frequen
ies for a whole
lass of rational languages. If L is a rational language, the fun
tions
;x (t) and (t) dened in (17) are rational fra
tions. The n
oeÆ
ients of their Taylor series
th
i
at t = 0 are a fun
tion of the zeros of their denominators (see for example [11, p.360℄). If we are
only interested in the expe
ted number of letters in the words of length n as n goes to innity,
we just need to know the dominant singularity of ea
h series, that is the real zero of smallest
modulus of their denominator, say s, with a multipli
ity . Let us re
all that from Pringsheim's
theorem, at least one of the dominant singularities of the generating fun
tion of any rational
languages is real positive.
However, in our problem, the denominator depends on the unknown variables (xi ). Hen
e,
we en
ounter a priori the same diÆ
ulty as in the algebrai
ase: that of solving a possibly high
degree algebrai
equation in the variable s with unknown
oeÆ
ients, in others words without
the help of numeri
al approximation. However, thanks to the following proposition, we will be
able to nd a solution.
Proposition 2 Let L be a rational language dened on the alphabet X = fx1 ; x2 ; : : : ; xk g. Let
a weighting fun
tion on X . Assume that s is the unique real pole of smallest modulus of (t)
(resp. of ;x (t)), and its multipli
ity . It follows that i (s) = i () for any i 2 f1; : : : ; kg,
i
and the unique real pole of smallest modulus of s (t) (resp. of s;x (t)) is 1, with mutipli
ity
i
.
Proof:
Under the assumption of Proposition 2 the denominator Q(t) of (t)
an be written as Q(t) =
(t s) P (t) where the modulus of ea
h root of P (t) is stri
tly greater than s. Hen
e, the
denominator of s (t) is of the form Q(ts) = (ts s) P (ts) and it follows that 1 is the unique
root of smallest modulus of Q(ts). with multipli
ity .
The above proposition means that if there exists a weighting fun
tion leading to the required
frequen
y than there exists a weighting fun
tion , leading to the required frequen
ies, su
h that
7
ea
h series is singular at 1.
Remark 3 The assumption of Proposition 2 is quite natural sin e if for some i, ;x (t) and i
(t) didn't have the same dominant pole with the same multipli
ity, this would imply that the
frequen
y of the letter xi tends to zero as n tended to innity.
We shall now derive suÆ
ient
onditions for the series involved in Proposition 2 to have the
same dominant pole with same multipli
ity. This is the purpose of the following theorem.
Theorem 4 Let L be a rational language dened on the alphabet X = fx1 ; ; xk g and A the
nite minimal deterministi
automaton whi
h re
ognizes L. Let C = fC1 ; ; Cf g the set of non
trivial strongly
onne
ted
omponents of A. Assume moreover that for any i 2 f1; ; f g, Ci is
aperiodi
(in the sense of Markov
hains). Under this assumption, for any weighting fun
tion
, we have
the generating fun
tion of L a
ording to the length and the letter xi . Let T (t; u) = tT (u) where
T (u) is the fundamental matrix asso
iated to A where Tl;m (u) is either equal to u if there is
an edge labelled by xi from state l to state m, either to 1 if there is an unlabelled edge from
state l to state m or 0 otherwise. Then, sin
e A is strongly
onne
ted and aperiodi
T (z; 1) is
irredu
ible and primitive. It follows from Perron-frobenius theorem that L1 (t; 1) has an unique
dominant simple pole, s, (the greatest eigenvalue of the fundamental matrix). Hen
e, see [19℄
C (t; u) = det(Id T (t; u)) has a unique root
C1
(u) =s (u 1) O ((u 1) ) with C1 = C (t; u) and C2 = C (t; u)
2
C2 u (s;1) t (s;1)
8
Remark 5 If f > 1, the dominant pole may not be simple ( > 1). Hen
e, we have to use the
formula ([21℄[Theorem 4.1℄) for
omputing the asymptoti
of the
oeÆ
ients of the series.
Remark 6 If A has a unique non trivial
onne
ted
omponents, aperiodi
y is not ne
essary. If
A has period k, than we shall
onsider the words of length kn, and then a
hieve primitivity.
Remark 7 The assumption that ea
h strongly
onne
ted
omponent is aperiodi
is ne
essary if
f > 1. To see this,
onsider the languages
L
(1)
= + abL(1) + bbL(1) L
(2)
= + aaL(2) + bL(2) L = L(1) + L(2) b
?
fp 1 ; p 1 g;
ab + b2 ab + b2
Using the above results, We shall design an algorithm in order to nd a weighting fun
tion
satisfying the required frequen
ies (14) .
1. Compute the series L (t; x), (t) and ;x (t) for i 2 1; : : : ; k (in whi
h (xi ) are unknown
i
where D (t; x) is the denominator of the series L (t; x). (the last equation for
e 1 to be a
pole of the series.)
4. Look for a solution of the above system su
h that,
ea
h (xi) is (stri
tly) positive.
9
1 is the pole of smallest modulus of L (t; 1). (We have for
ed 1 to be a pole, but not
a dominant one!)
Observe that the system
an be solved with numeri
al te
hniques (using GB [6℄ for example)
sin
e the number of equations is equal to the number of unknown variables. If we don't nd a
solution that hold for the step 4 of the algorithm, this means, from Proposition 2, that it doesn't
exist.
Let us show how the algorithm works by taking, for example, a language involved in the
study of the genome. One
all ORF (\Open Reading Frame") a sequen
e in a
hromosome that
is a proteini
putative gene. In organisms with
ompa
t genome (genes without Intron), one
an approximate the language that des
ribe ORF by the following rational language L,
L = atg((X 3 nftaa; tag; tgag)? )(taa + tag + tga);
where X = fa;
; g; tg. The minimal automaton of L has got an unique strongly
onne
ted
omponent, whi
h
ontains all the letters of X .
>From the regular expression above, we nd the generating fun
tion
atgx(2tgax + taax)
L(x; a;
; g; t) =
(1 (a +
+ g + t)3 x + taax + 2tgax)
where the
oeÆ
ient of xn is the number of words of length 3n, in order to respe
t the primitivity
ondition (Remark 6). The series L(x; a;
; g; t) has an unique pole of smallest modulus, 611 , whi
h
is additionally simple. Sin
e, the automaton of L has only one non trivial
onne
ted
omponent,
we are in the
ase of the rst part of the proof of Theorem 4. Hen
e, taking for instan
e the
trivial weighting fun
tion, say 0 =1, we nd that the root of L(x; a; 1; 1) is (a) = a3 +8a2 +25
1
a+27 .
Expanding, (a) at a=1, we nd
1 44
(a) = (a 1) O(a 1)2 and a = 44=183:
61 3721
In an analoguous way, the frequen
y of the other letters of X are
= 16=61; g = 46=183;
t = 15=61. This means for example that an uniform random word of L has got 44 n=183 letters
a on average.
We are now looking for a weighting fun
tion su
h that the frequen
ies of the nu
leotides
are the same as those of the
hromosomes III of Sa
haromy
es
erevisiae (brewers' yeast), say
v = (0; 307 ; 0; 193 ; 0; 193 ; 0; 307). Pro
eeding as above, but now with the series
(a)a (t)t (g )gx(2 (t)t (g )g (a)ax
+ (t)t(a)a(a)ax)
L (x; a;
; g; t) =
(1 ((a)a + (
)
+ (g)g + (t)t) + (t)t(a)a(a)ax + 2(t)t(g)g(a)ax)
3x
we express (putting s = 1) a (),
(), g (), t () as fun
tions of (a), (
), (g), (t).
With the help of the formal
al
ulus software Maple, we solve the system
8
>
>
>
a ( )
= 0; 307
>
>
<
()
= 0; 193
>
g ()
= 0; 193
>
>
>
>
t ()
= 0; 307
:
1 ((a) + (
) + (g) + (t))3 + (t)(a)(a) + 2(t)(g)(a)) = 0
and we nd the following weighting fun
tion : (a) = 0; 31632; (
) = 0; 16991; (g) = 0; 18396;
(t) = 0; 32707. One
an observe that the weights of the letters are
lose to the frequen
ies.
This is be
ause the language L is not mu
h
onstrained in the sense that only few patterns
are forbidden. Hen
e, if we take = v, we nd a () = 0; 29484 : : : ;
() = 0; 20648 : : : ;
g () = 0; 19351 : : : ; t () = 0; 30519 : : : , whi
h are at most far from 7%
ompared with v.
10
5 Con
lusion
As far as we
an see, we present for the rst time, algorithms wi
h both take into a
ount
statisti
al and synta
ti
riteria.
The generation algorithm a
ording to exa
t frequen
ies presented in se
tion 3, have a re-
stri
ted sphere of a
tivity sin
e the
omplexity in
reases strongly with the number of statisti
al
parameters. Moreover, the algorithm doesn't allow to deal with the frequen
ies of patterns.
However, it really works if one is interested in few parameters. For example, in the eld of
the study of genomi
sequen
es, one
an be interested in generating nu
leotides with a given
expe
ted frequen
y of letters C + G, whi
h is of interest in some analysis.
The method of se
tion 4 for generating a
ording to expe
ted frequen
ies is mu
h more
eÆ
ient. We
ompute on
e and for all the weighting fun
tion and than the
omplexity is identi
al
to uniform generation algorithms. In the
ase of rational languages, we
an always nd a
weighting fun
tion with the help of numeri
al te
hniques [6℄. The same method applies for
generating a
ording to expe
ted patterns frequen
ies (see [17℄). Unfortunately, in the algebrai
ase we are only able to deal with low degree fun
tional equations languages.
Aknowledgements
We thank Jean-Guy Penaud, Gilles S
haeer, Edward Trifonov and Paul Zimmermann for fruit-
ful dis
ussions. We thank also Marie-Fran
e Sagot for her help.
Referen
es
[1℄ E. Coward. Shuet: Shuing sequen
es while
onserving the k-let
ounts. Bioinformati
s,
15(12):1058{1059, 1999.
[2℄ A. Denise and P. Zimmermann. Uniform random generation of de
omposable stru
tures
using
oating-point arithmeti
. Theoreti
al Computer S
ien
e, 218:233{248, 1999.
[3℄ L. Devroye. Non-uniform random variate generation. Springer-Verlag, 1986.
[4℄ M. Drmota. Systems of fun
tional equations. Random Stru
tures and Algorithms, 10:103{
124, 1997.
[5℄ I. Dutour and J.-M. Fedou. Obje
t grammars and random generation. Dis
rete Mathemati
s
and Theoreti
al Computer S
ien
e, 2:47{61, 1998.
[6℄ J-.C. Faugere. GB. http://
alfor.lip6.fr/GB.html.
[7℄ J.W. Fi
kett. ORFs and genes: how strong a
onne
tion? J Comput Biol, 2(1):117{123,
1995.
[8℄ W.M. Fit
h. Random sequen
es. Journal of Mole
ular Biology, 163:171{176, 1983.
[9℄ Ph. Flajolet, P. Zimmermann, and B. Van Cutsem. A
al
ulus for the random generation
of labelled
ombinatorial stru
tures. Theoreti
al Computer S
ien
e, 132:1{35, 1994.
[10℄ M. Goldwurm. Random generation of words in an algebrai
language in linear binary spa
e.
Information Pro
essing Letters, 54:229{233, 1995.
[11℄ R.L. Graham, D.E. Knuth, and O. Patashnik. Con
rete Mathemati
s. Addison Wesley, 2
edition, 1997. Fren
h translation : Mathmatiques
on
rtes, International Thomson Pub-
lishing, 1998.
11
[12℄ T. Hi
key and J. Cohen. Uniform random generation of strings in a
ontext-free language.
SIAM J. Comput., 12(4):645{655, 1983.
[13℄ D. Kandel, Y. Matias, R. Unger, and P. Winkler. Shuing biologi
al sequen
es. Dis
rete
Applied Mathemati
s, 71:171{185, 1996.
[14℄ D.J. Lipman and W.R. Pearson. Rapid and sensitive protein similarity sear
hes. S
ien
e,
227:1435{1441, 1985.
[15℄ D.J. Lipman, W.J. Wilbur, T.F. Smith, and M.S. Waterman. On the statisti
al signian
e
of nu
lei
a
id similarities. Nu
lei
A
ids Resear
h, 12:215{226, 1984.
[16℄ H. G. Mairson. Generating words in a
ontext free language uniformly at random. Infor-
mation Pro
essing Letters, 49:95{99, 1994.
[17℄ P. Ni
odeme, B. Salvy, and Ph. Flajolet. Motif statisti
s. In European Symposium on
Algorithms-ESA99, volume 1643, pages 194{211. Le
ture Notes in Computer S
ien
e, 1999.
[18℄ A. Nijenhuis and H.S. Wilf. Combinatorial algorithms. A
ademi
Press, New York, 2
edition, 1978.
[19℄ Ph. Flajolet and R. Sedgewi
k. The average
ase analysis of algorithms: Multivariate
asymptoti
s and limit distribution. RR Inria, Number 3162, 1997.
[20℄ M. Regnier. A unied approa
h to word o
urren
e probabilities. Dis
rete Applied Mathe-
mati
s, 2000. To appear.
[21℄ R. Sedgewi
k and Ph. Flajolet. An introdu
tion to the analysis of algorithms. Addison
Wesley, 1996. Fren
h translation: Introdu
tion a l'analyse des algorithmes, International
Thomson Publishing, 1996.
[22℄ M. Termier and A. Kalogeropoulos. Dis
rimination between fortuitous and biologi
ally
onstrained Open Reading Frames in DNA sequen
es of Sa
haromy
es
erevisiae. Yeast,
12:369{384, 1996.
[23℄ A. Vanet, L. Marsan, and M.-F. Sagot. Promoter sequen
es and algorithmi
al methods for
identifying them. Res. Mi
robiol., 150:779{799, 1999.
12