You are on page 1of 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/255362442

Random generation of words of context-free languages according to the


frequencies of letters

Conference Paper · January 2000


DOI: 10.1007/978-3-0348-8405-1_10

CITATIONS READS

24 712

3 authors, including:

Alain Denise
Université Paris-Saclay
106 PUBLICATIONS 2,838 CITATIONS

SEE PROFILE

All content following this page was uploaded by Alain Denise on 29 May 2014.

The user has requested enhancement of the downloaded file.


Random generation of words of algebrai languages
a ording to the frequen ies of the letters
Alain Denise  Olivier Roques y Mi hel Termier z

Abstra t
1
Let L be an algebrai language on an alphabet X = fx1 ; x2 ; : : : ; xk g, and n a positive
integer. We onsider the problem of generating at random words of L with respe t to a
given distribution of the number of o urren es of the letters. We onsider two alternatives
of the problem. In the rst one, a ve tor of natural numbers (n1 ; n2 ; : : : ; nk ) su h that
n1 + n2 +    + nk = n is given, and the words must be generated uniformly among the set of

words of L whi h ontain exa tly ni letters xi (1  i  k ). The se ond alternative onsists,
given v = (v1 ; : : : ; vk ) a ve tor of positive real numbers su h that v1 +    + vk = 1, to
generate at random words among the whole set of words of L of length n, in su h a way that
the expe ted number of o urren es of any letter xi equals nvi (1  i  k ), and two words
having the same distribution of letters have the same probability to be generated. For this
purpose, we design and study two alternatives of the re ursive method whi h is lassi ally
employed for the uniform generation of ombinatorial stru tures. This type of \ ontrolled"
non-uniform generation is of great interest in the statisti al study of genomi sequen es.

1 Introdu tion
The problem of uniform random generation of ombinatorial stru tures has been extensively
studied in the past few years. To our knowledge, random generation a ording to a given
distribution was mu h less treated (ex ept for random numbers, for whi h one nds an abundant
literature {see [3℄ for example). We are interested here in a problem of this type. Let L be a
language on an alphabet X = fx1 ; x2 ; : : : ; xk g, and n an integer. Let us denote Ln the set of
words of L of length n. The problem onsists in generating words of Ln while respe ting a
distribution of the letters given by a ve tor of k positive numbers. We onsider two alternatives:
1. Generation in exa t frequen ies. The distribution of the number of letters of any word
must respe t exa tly a given ve tor of integers (n1 ; : : : ; nk ). In other words, we generate
words uniformly at random in a subset of Ln onstituted of all the words w 2 Ln su h
that jwjx = ni for all i 2 f1; 2; : : : ; kg.
i

2. Generation in expe ted frequen ies. The words must respe t on average a distribution
given by a ve tor v = (v1 ; : : : ; vk ) su h that v1 + : : : + vk = 1. More pre isely, we generate
words at random in su h a way that
(a) any word of L has a non zero probability to be generated;
(b) for all i 2 f1; 2; : : : ; kg, the frequen y of the letters xi of the words is equal (or asymp-
toti ally equivalent) to vi : if p(w) is the probability that the word w is generated by
the algorithm, one must have onvergen e for the frequen ies,
1 X
n
jwjx p(w)  vi
i

w2L n

 LRI, UMR CNRS 8623, Universit


e Paris-Sud XI. Alain.Deniselri.fr
y
LaBRI, UMR CNRS 5800, Universite Bordeaux I. roqueslabri.u-bordeaux.fr
z
IGM, UMR CNRS 8621, Universite Paris-Sud XI. termierigmors.u-psud.fr
1 we use the term algebrai instead of ontext-free be ause we are interested in algebrai properties of the series

1
( ) two words having the same distribution of letters have the same probability of being
generated.

These generation s hemes lead to appli ations in genomi s. An important problem in the
eld of analysis of genomes onsists in determining whether some properties observed in genomi
sequen es are biologi ally signi ant or not. The main idea is as follows: if a property observed
in natural sequen es is really relevant from a biologi al point of view, one should not observe
it signi antly in random sequen es. Thus for example, in order to evaluate the biologi al
signi an e of similarities between proteini sequen es of di erent organisms, one ompares the
s ores of their alignments with s ores obtained on random sequen es [15, 14℄. Similarly, the
omparison of the frequen ies of ertain motifs in natural and random sequen es an ontribute
to determine if these motifs are biologi ally relevant [7, 22, 23℄.
Traditionally, sequen es are generated a ording to purely statisti al onsiderations. The
fundations of these models and the rst algorithms of generation were des ribed in [8℄. The
parameters whi h are taken into a ount are the frequen ies of nu leotides (letters in DNA)
or oligonu leotides (fa tors) of xed length l observed in a natural sequen e whi h is taken as
referen e. Thus for example, for l = 2 and the natural sequen e aatgtaa gt, the frequen ies are
aa = 2, at = 1, tg = 1, gt = 2, ta = 1, a = 1 and g = 1. Random sequen es are generated
with respe t to these frequen ies, either exa tly (generation in exa t frequen ies) or in average
(generation in expe ted frequen ies).
In this last ase, the generation is arried out a ording to a Markov hain of order l 1:
the sequen e is generated letter by letter and, at any step, the probability of generating a given
letter depends on the l 1 previous letters. The pro ess is learly linear a ording to the length
of the sequen e. In fa t, in this model, re ent works [17, 20℄ allow to ompute analyti ally
some parameters (expe ted number of o urren es, varian e, et ) on erning the frequen ies of
appearan e of given motifs in random sequen es. This allows to avoid the generation of a large
number of sequen es, and to obtain exa t values of the required parameters. However, random
generation remains useful when the studied properties do not relate to relatively simple motifs.
Random generation in exa t frequen ies is a more diÆ ult problem if one wishes the gener-
ation to be uniform among all the allowed sequen es. This problem is solved in linear expe ted
time in [13℄. One of the main ideas is the fa t, stated in [8℄, that it an be redu ed to the
generation of a random Eulerian trail in a dire ted graph. An implementation of the algorithm
is presented in [1℄.
These methods do not handle synta ti onstraints: the words are generated in X  . But
it is of interest to generate words in parti ular languages be ause genomi sequen es an be
synta ti ally onstrained. In this ontext, the aim of our work is to generate more \realisti "
random sequen es by taking into a ount synta ti riteria as well as statisti al ones.
Our approa h is based on the so-named re ursive method, whi h was initiated by Nijenhuis
and Wilf [18℄ and then generalized and formalized by Flajolet, Zimmermann and Van Cutsem [9℄.
Se tion 2 is devoted to a short presentation of this methodology within the framework of the
algebrai languages. We present in se tion 3 a simple adaptation whi h allows to generate words
in exa t frequen ies of the letters. In se tion 4, we fo us on generation in expe ted frequen ies.

2 Uniform generation
The general methodology of uniform random generation of de omposable stru tures (whi h
in ludes algebrai languages) is presented in detail in [9℄. Some variations whi h deal with the
spe ial ase of ontext-free languages are studied in [16℄ and [10℄. We present here a quite simple
version of the method. The reader will nd more powerful alternatives in the referen ed papers.
The starting point is a non-ambiguous ontext-free grammar in Chomsky Normal Form: any
right member of a rule is either the empty word ", or a letter of X , or a non-terminal symbol, or

2
a produ t ( on atenation) of two non-terminal symbols. Moreover, any non-terminal symbol an
be left member of at most two rules; and in this ase ea h right member is a single non-terminal
symbol.
The rst stage onsists in ounting words: for any non-terminal symbol T and for all 0 
j  n, the number T (j ) of words of length j whi h derive from T is omputed (re all that n is
the length of the words to be generated.) This an be done by using re urren e relations whi h
result dire tly from the rules of the grammar:
!"T ) T (0) =1; (1)
T ! xi ) T (1) = 1 ; (2)
0 00
T !T j T ) 0 00
T (n) = T (n) + T (n) ; (3)
0 00
T !T T ) T (n) =
X
0 0 00 00
T (n )T (n ): (4)
n0 +n00 =n

Note that, sin e the generating series of any ontext-free languages is holonomi , these same
oeÆ ients an be al ulated by using linear re urren es. This leads to faster omputations
(see [10℄ for example).
This prepro essing stage is done only on e, whatever the number of words of size n (or
less) than one wishes to generate. A random word is generated by arrying out a su ession
of derivations starting from the axiom of the grammar. At ea h step, a derivation is hosen
with the adequate probability. Let us suppose for example that, at a given step of generation
of a word of length j , one has to hoose a rewriting rule for the symbol T . If T ! T 0 j T 00 ,
one hooses to generate a word of length j either deriving from T 0 with probability T 0 (j )=T (j ),
or deriving from T 00 with probability T 00 (j )=T (j ). If T ! T 0 T 00 , one hooses an integer h with
probability T 0 (h)T 00 (j h)=T (h), and then one generates a word of length h deriving from of T 0 ,
on atenated to a word of length j h deriving from T 00 . Details of the pro ess are given in [9℄.
The best algorithms derived from this method arry out the generation of a word of length
n in O (n) arithmeti operations after a prepro essing stage in O (n2 ) [16℄, or the generation of a
word in O(n log n) arithmeti operations after a prepro essing stage in O(n) [10℄. Note however
that the oeÆ ients whi h are omputed are extremely large, so the bit omplexity is a more
a urate measure for these algorithms, as dis ussed in [2℄.
The above method applies obviously to rational languages. However, in this parti ular ase,
a slightly di erent approa h makes it possible to arry out a generation in O(n) arithmeti
operations with a preprop essing stage in O(n) too [12℄. We onsider a regular grammar of the
language, in whi h any rule has the shape T ! T1 j : : : j Tm , where either Ti = xTj or Ti = x
(x 2 X ), or Ti = ". The T (:)'s are omputed using the following re urren es:
T ! T j : : : j Tm )
1 T (n) = T1 (n) + : : : + Tm (n) (5)
where
Ti = xTj ) Ti (n) = Tj (n 1) (6)
Ti = x ) Ti (n) = 1 if n = 1 ; Ti (n) = 0 otherwise: (7)
Ti = " ) Ti (n) = 1 if n = 0 ; Ti (n) = 0 otherwise: (8)
The generation stage is done like in the general ase. Here, ea h step simply onsists in hoosing
a letter with an adequate probability. This is equivalent to build a path of length n in a
deterministi nite-state automaton of the language.

3 Generation with respe t to exa t frequen ies


Let (n1 ; n2 ; : : : ; nk ) be a ve tor of integers su h that n1 + n2 +    + nk = n. Our goal is to
generate uniformly at random a word of Ln whi h ontains exa tly ni letters xi , for all 1  i  k.

3
The prin iple of the method that we des ribe here is a natural extension of the general outline
given in the previous se tion. The method is used impli itely in works like [5℄, were the problem
of randomly generating stru tures while xing more than one parameter is adressed. Our goal
is simply to formalize the method within the framework of the generation of words with xed
numbers of letters in ontext-free languages.
Like above, the grammar is supposed to stand in Chomsky Normal Form. For any non-
terminal symbol T , we note T (j1 ; j2 ; : : : ; jk ) the number of words whi h derive from T and
whi h ontain j1 letters x1 , j2 letters x2 , . . . , jk letters xk . The prepro essing stage onsists in
omputing a table of the T (j1 ; j2 ; : : : ; jk ) for 0  j1  n1 , . . . , 0  jk  nk . We use the following
re urren es:
!" T ) T (0; : : : ; 0) =1; (9)
T ! xi ) T (0; : : : ; 0; 1; 0; : : : ; 0) = 1 ; (10)
0 00
T !T j T ) 0 00
T (n1 ; : : : ; nk ) = T (n1 ; : : : ; nk ) + T (n1 ; : : : ; nk ) ; (11)
T ! T 0 T 00 ) 0 0 0 0 00 00 00 00
X
T (n1 ; n2 ; : : : ; nk ) = T (n1 ; n2 ; : : : ; nk )T (n1 ; n2 ; : : : ; nk ): (12)
n + n1 = n1
0
10
00

n + n2 = n2
2
00

:::
n +n
k = nk
0 00
k

Like above, ea h step of the generation stage onsists in hoosing a rewriting rule of the urrent
symbol T . Suppose that, at a given step of generation of a word of distribution j = (j1 ; : : : ; jk ),
one has to hoose a rewriting rule for the symbol T . If T ! T 0 j T 00 , one generates a word of
distribution j deriving from T 0 with probability T 0 (j)=T (j), or deriving from T 00 with probability
T 00 (j)=T (j). If T ! T 0 T 00 , one hooses a ve tor h = (h1 ; : : : ; hk ) with probability T 0 (h)T 00 (j
h)=T (h), then one generates a word deriving from T 0 having distribution h, on atenated to a
word deriving from T 00 having distribution j h.
The rational ase is similar:
T ! T j : : : j Tm )
1 T (n1 ; : : : ; nk ) = T1 (n1 ; : : : ; nk ) + : : : + Tm (n1 ; : : : ; nk )
where
Ti = xm Tj ) Ti (n1 ; : : : ; nm ; : : : ; ; nk )= Tj (n1 ; : : : ; nm 1; : : : ; nk ) ;
Ti = xm ) Ti (0; : : : ; 0; 1; 0; : : : ; 0) = 1 if nm = 1 ;
Ti = " ) Ti (0; : : : ; 0) = 1:

The bottlene k of this method stands in its strong omplexity in time and memory. Indeed,
it requires the omputation of a table of (n1 n2 : : : nk ) values. Moreover, in the nonrational
ase, the multiple onvolutions of the formula (12) imply a very slow generation stage too. On
the other hand, if the language is rational, the generation is arried out in linear arithmeti
omplexity (but the problem of the omputation of the table remains). Thus the method is
feasible for an alphabet of limited size. It an also be extended to alphabet of any size, provided
that the number of onstraints is limited. For example, instead of xing (n1 ; n2 ; : : : ; nk ), one
an x only (n1 ; : : : ; nj ) with j < k.

4 Generation with respe t to expe ted frequen ies


In this se tion, we onsider the problem of generationg words of Ln at random a ording to
some probability distribution so that, ea h word is generated with positive probability and the
ve tor of expe ted frequen ies is equal to a given ve tor v.
p(w) > 0 8w 2 Ln (13)

4
and
1
jwjx p(w)  8i 2 f1; 2; : : : ; kg:
X
i
vi (14)
n
w2Ln

Moreover, we for e words having the same letter distribution to be equally generated.
(jwjx = jw0 jx 8i 2 f1; 2; : : : ; kg) ) p(w) = p(w0 ):
i i
(15)
Our method onsists in assigning a weight to ea h letter of the alphabet. For this purpose, we
de ne a fun tion  : X ! IR+ , whi h we all a weighting fun tion. It naturally a ts on X  :
 (w ) =
Y
 (xi )
jw j xi
;
1ik
just as it a ts on any nite language, in parti ular to Ln :
X
 (Ln ) =  (w):
w2Ln

Hen e, if the algorithm generates ea h word w 2 Ln with the probability


 (w)
p(w) =
 (Ln )
8w 2 Ln; (16)

then the expe ted number of o uren es of a given letter in a random sample will in rease with
the weight put on this letter in omparison with the others.
Let us summarize our methodology:
1. Find a weighting fun tion  satisfying (14);
2. Let  having been xed, design a geberation algorithm whi h satis es (16).

Look at the se ond step. Let  be given. In order to generate words with the required
distribution (16), we use the methodology presented in se tion 2, but now the rule
T ! xi ) T (1) =  (x i ):
stand for the rule (2). For the parti ular ase of rational languages, the rules
Ti = xTj ) Ti (n) = (x)Tj (n 1)
Ti = x ) Ti (n) =  (x) if n = 1 ; Ti (n) = 0 otherwise:
stand for (6) and (7).
The generation pro edure works exa tly in the same manner as the uniform one of se tion
2.
In this way, the probability that a word w o urs will be proportional to its weight (w).
Observe that our methodology is equivalent to generating words uniformly at random from
an ambiguous grammar. In this way, the probability of ea h word is proportional to the number
of derivations whi h lead to it in the original unambiguous grammar, times its weight (w).
Now, we have to nd a weighting fun tion  satisfying (14). Let us onsider the following
generating fun tion:
jw j jw j
 (w)tjwj x1 1 : : : xk
X
L (t; x) =
x xk
;
w2L

where x = (x1 ; x2 ; : : : ; xk ).

5
For ea h letter i 2 f1; : : : ; kg, the expe ted number of o uren es of the letter xi in the words
generated by the algorithm is equal to ni (), where

i ( ) =
P
j j ( ) = [tn℄ ;x (t)
w2Ln w xi  w i
(17)
n (Ln ) [tn ℄  (t)
with
L (t; x) L (t; x)
( )=
;xi t (t; 1) and ( )=t
 t (t; 1)
xi t
Be ause the omputation of i () involves omputing the oeÆ ients of the above generating
fun tions, we treat separately the algebrai ase (sub se tion 4.1) and the rational ase (sub
se tion 4.2).

4.1 The algebrai ase

>From a more general theorem due to Drmota [4, Theoreme 1, p.107℄, we immediatly state
Theorem 1 Let y = F(t; y; x) be a set of equations from an algebrai grammar with a weighting
fun tion , where y = (y1 ; y2 ; : : : ; yN ) is the ve tor of generating fun tions for the non terminal
symbols and F(t; y; x) = (F1 (t; y; x); : : : ; FN (t; y; x)). Under the following assumptions
 F(0; y; x) = 0 and F(t; 0; x) 6= 0 ;
 the system of equations is not linear in y.
 the system of equations is of simple type; there exists a set of N (k + 1)-dimensionnals
ones Ci  IRk su h that, for any (n; m ; : : : ; mk ) 2 Cj with n, m , . . . , mk are large
+1
1 1
m m1
enough, the oeÆ ient of the term t n x1 : : : xk k
is non zero;
 the grammar is strongly onne ted ; ea h non terminal symbol an be rea hed from any
other non terminal symbol.
We have the following:
If there exists a positive solution (y0 ; t0 ) of the system of equations
(
y = F(t; y; 1)
0 = det(I Fy (t; y; 1))
 
Fi
where Fy is the matrix , then
yj 1 i;j N
1 t
i ( ) = (1)
t(1) xi

for any 1  i  k, where t(x) is a solution of the system of equations


(
y = F(t; y; x)
0 = det(I Fy (t; y; x) (18)

su h that t(1) = t0 .
This theorem an help us to solve our problem, by taking the generating fun tion L (t; x)
as the rst omponent of the ve tor y.
However, the Theorem 4 doesn't apply to rational language sin e in this ase the system of
equations is linear.
Consider a toy example, namely generating rooted plane trees (see [19℄) with n edges su h
that the expe ted number of leaves is n (0 <  < 1). A lassi al bije tion turns this problem

6
into the problem of generating words of length 2n belonging to the language de ned by the
grammar S ! aSbS + aSb + dS + d. The fa tor d represents a leave of the tree. We an now
hange the expe ted number of leaves in a random tree by putting a positive real weight 1 on
the letter : S ! aSbS + aSb + 1 dS + 1 d. This weighted grammar meets the onditions of
Theorem 1 and we obtain,
(
F (t; y; x1 ) = xy2 + xy + 1 z1 xy + 1 z1 x
Fy (t; y; x1 ) = 2 xy + x + 1 z1 x:

Solving the system (18), we nd


 2

1 = with 0 <  < 1:
1 

This means that in order to generate a word with 2n letters a (resp. n edges) having n letters
(resp. leaves) on average, we have to take 1 = (=(1 ))2 .
Let us say that this example is parti ularly simple. In general, we may be unable to solve
the system (18) sin e we an't use numeri al te hniques be ause of the unknown fun tion .
Thus, our method fails if we try to generate words of the language de ned in [4, p.112℄ with a
given expe ted number of letters a and b, sin e we have to solve an algebrai equation of degree
ve whose oeÆ ients are fun tions of the unknown weights.
Morever, Theorem 1, only applies to strongly onne ted grammars whi h is a strong restri -
tion.

4.2 The rational ase

In this se tion, we show that we an solve the problem of generating word a ording to given
frequen ies for a whole lass of rational languages. If L is a rational language, the fun tions
;x (t) and  (t) de ned in (17) are rational fra tions. The n
th oeÆ ients of their Taylor series
i

at t = 0 are a fun tion of the zeros of their denominators (see for example [11, p.360℄). If we are
only interested in the expe ted number of letters in the words of length n as n goes to in nity,
we just need to know the dominant singularity of ea h series, that is the real zero of smallest
modulus of their denominator, say s, with a multipli ity  . Let us re all that from Pringsheim's
theorem, at least one of the dominant singularities of the generating fun tion of any rational
languages is real positive.
However, in our problem, the denominator depends on the unknown variables (xi ). Hen e,
we en ounter a priori the same diÆ ulty as in the algebrai ase: that of solving a possibly high
degree algebrai equation in the variable s with unknown oeÆ ients, in others words without
the help of numeri al approximation. However, thanks to the following proposition, we will be
able to nd a solution.
Proposition 2 Let L be a rational language de ned on the alphabet X = fx1 ; x2 ; : : : ; xk g. Let
 a weighting fun tion on X . Assume that s is the unique real pole of smallest modulus of  (t)
(resp. of ;x (t)), and  its multipli ity . It follows that i (s) = i () for any i 2 f1; : : : ; kg,
i

and the unique real pole of smallest modulus of s (t) (resp. of s;x (t)) is 1, with mutipli ity
i

.

Proof:
Under the assumption of Proposition 2 the denominator Q(t) of  (t) an be written as Q(t) =
(t s) P (t) where the modulus of ea h root of P (t) is stri tly greater than s. Hen e, the
denominator of s (t) is of the form Q(ts) = (ts s) P (ts) and it follows that 1 is the unique
root of smallest modulus of Q(ts). with multipli ity  .
The above proposition means that if there exists a weighting fun tion leading to the required
frequen y than there exists a weighting fun tion , leading to the required frequen ies, su h that

7
ea h series is singular at 1.

Remark 3 The assumption of Proposition 2 is quite natural sin e if for some i, ;x (t) and i

 (t) didn't have the same dominant pole with the same multipli ity, this would imply that the
frequen y of the letter xi tends to zero as n tended to in nity.

We shall now derive suÆ ient onditions for the series involved in Proposition 2 to have the
same dominant pole with same multipli ity. This is the purpose of the following theorem.
Theorem 4 Let L be a rational language de ned on the alphabet X = fx1 ;    ; xk g and A the
nite minimal deterministi automaton whi h re ognizes L. Let C = fC1 ;    ; Cf g the set of non
trivial strongly onne ted omponents of A. Assume moreover that for any i 2 f1;    ; f g, Ci is
aperiodi (in the sense of Markov hains). Under this assumption, for any weighting fun tion
 , we have

 L (t; 1) has an unique pole of smallest modulus and multipli ity  .

 i() > 0 for any i 2 f1;    ; kg as n goes to in nity.


Sket h of the proof:
Let us take for instan e the trivial weighting fun tion  = 1. First assume that A is strongly on-
t;u) = L1 (t; (1;    ; 1; u; 1;    ; 1))
ne ted. Let u = xi for some i 2 f1;    ; kg and let F (t; u) = BC ((t;u)

the generating fun tion of L a ording to the length and the letter xi . Let T (t; u) = tT (u) where
T (u) is the fundamental matrix asso iated to A where Tl;m (u) is either equal to u if there is
an edge labelled by xi from state l to state m, either to 1 if there is an unlabelled edge from
state l to state m or 0 otherwise. Then, sin e A is strongly onne ted and aperiodi T (z; 1) is
irredu ible and primitive. It follows from Perron-frobenius theorem that L1 (t; 1) has an unique
dominant simple pole, s, (the greatest eigenvalue of the fundamental matrix). Hen e, see [19℄
C (t; u) = det(Id T (t; u)) has a unique root

C1  
(u) =s (u 1) O ((u 1)2 ) with C1 = C (t; u) and C2 = C (t; u)
C2 u (s;1) t (s;1)

whi h is analyti at u = 1 and su h that (1) = s. >From this fa t, the frequen y of xi is


asymptoti ally
C1
C2 s
whi h is stri tly positive sin e (u) is not onstant.
Let us return to the less restri tive onditions of the Theorem 4 where the automaton an
have more than one aperiodi strongly onne ted omponents. Than, the automaton an be
represented as a DAG (Dire ted A y li Graph) with f states, the f non trivial strongly on-
ne ted omponents, with an edge between state Ci and Cj if there is a path in the A between
a state x 2 Ci and a state y 2 Cj .
Hen e ea h word belonging to L an be written w = w1 v1    wl vl where l  f and for
i 2 f1;    ; k g wi 's lie in di erent onne ted omponents and the vi 's are the words whi h
orresponds to some edges of the DAG. Now, sin e we have just proved that the expe ted
frequen y of ea h letter in ea h aperiodi strongly onne ted omponent is positive and sin e
the length of the vi 's is onstant, it follows that the frequen y of ea h letter in a word belonging
to L is positive. Observe now, that any weighting fun tion doesn't hange the shape of the
automaton. One an just think about weights added on the edges. Hen e, if Theorem 4 applies
to some language L1 , it applies to any weighted language L . Let us say to be omplete that
under a so- alled variability ondition in [19℄, a Gaussian limit law holds for the oeÆ ients of
F (t; u).

8
Remark 5 If f > 1, the dominant pole may not be simple ( > 1). Hen e, we have to use the
formula ([21℄[Theorem 4.1℄) for omputing the asymptoti of the oeÆ ients of the series.

Remark 6 If A has a unique non trivial onne ted omponents, aperiodi y is not ne essary. If
A has period k, than we shall onsider the words of length kn, and then a hieve primitivity.
Remark 7 The assumption that ea h strongly onne ted omponent is aperiodi is ne essary if
f > 1. To see this, onsider the languages

L
(1)
=  + abL(1) + bbL(1) L
(2)
=  + aaL(2) + bL(2) L = L(1) + L(2) b
?

whose generating fun tions are


1 1 1
L
(1)
(t; a; b) = ;
(2)
L (t; a; b) = ; L(t; a; b) = L(1) (t; a; b)+L(2) (t; a; b) :
1 abt2 b2 t2 1 a2 t2 bt 1 bt

Then L(1) (t; a; b) has two dominant pole,

fp 1 ; p 1 g;
ab + b2 ab + b2

and L(2) (t; a; b) has only one dominant pole


p
1 b+ b2 + 4a 2
:
2 a2
p p  p  n
Sin e 52 1 < 22 , the number of words in L of length n is of order 5 1
2
and L has an
unique dominant pole. But if we take for example a weighting fun tion ,psu h that p(a) = 1
and (b) = 10, then L is dominated by the words belonging to L(1)
 sin e
104
2
5 > 110
110
and
hen e has two poles of smallest modulus whi h is in ontradi tion with Theorem 4.

Corollary 8 Under the assumption of Theorem 4, ea h series ( ) and


 t ( ) for i 2 1; : : : ; k
;xi t
has an unique pole of smallest modulus and multipli ity  + 1.

Using the above results, We shall design an algorithm in order to nd a weighting fun tion
satisfying the required frequen ies (14) .
1. Compute the series L (t; x),  (t) and ;x (t) for i 2 1; : : : ; k (in whi h (xi ) are unknown
i

variables). Let s the unique dominant pole of this series.


2. Put s = 1 and take  as mentionned in Theorem 4. Compute the asymptoti s values, say
i (), of ea h i () a ording to (17) and using ([21℄[Theorem 4.1℄). This an be done

sin e Corollary 8 ensures that ea h series has got a unique pole.
3. Solve the following algebrai system in the unknown variables ((x1 ); : : : ; (xk )):
8
>
> 1 ()
 = v1
>
>
< ..
.
>
>
>
>
k 1 () = vk
 1
D (1; 1) = 0
:

where D (t; x) is the denominator of the series L (t; x). (the last equation for e 1 to be a
pole of the series.)
4. Look for a solution of the above system su h that,
 ea h (xi) is (stri tly) positive.
9
 1 is the pole of smallest modulus of L (t; 1). (We have for ed 1 to be a pole, but not
a dominant one!)
Observe that the system an be solved with numeri al te hniques (using GB [6℄ for example)
sin e the number of equations is equal to the number of unknown variables. If we don't nd a
solution that hold for the step 4 of the algorithm, this means, from Proposition 2, that it doesn't
exist.
Let us show how the algorithm works by taking, for example, a language involved in the
study of the genome. One all ORF (\Open Reading Frame") a sequen e in a hromosome that
is a proteini putative gene. In organisms with ompa t genome (genes without Intron), one
an approximate the language that des ribe ORF by the following rational language L,
L = atg((X 3 nftaa; tag; tgag)? )(taa + tag + tga);
where X = fa; ; g; tg. The minimal automaton of L has got an unique strongly onne ted
omponent, whi h ontains all the letters of X .
>From the regular expression above, we nd the generating fun tion
atgx(2tgax + taax)
L(x; a; ; g; t) =
(1 (a + + g + t)3 x + taax + 2tgax)
where the oeÆ ient of xn is the number of words of length 3n, in order to respe t the primitivity
ondition (Remark 6). The series L(x; a; ; g; t) has an unique pole of smallest modulus, 611 , whi h
is additionally simple. Sin e, the automaton of L has only one non trivial onne ted omponent,
we are in the ase of the rst part of the proof of Theorem 4. Hen e, taking for instan e the
trivial weighting fun tion, say 0 =1, we nd that the root of L(x; a; 1; 1) is (a) = a3 +8a2 +25
1
a+27 .
Expanding, (a) at a=1, we nd
1 44
(a) = (a 1) O(a 1)2 and a = 44=183:
61 3721
In an analoguous way, the frequen y of the other letters of X are  = 16=61; g = 46=183;
t = 15=61. This means for example that an uniform random word of L has got 44 n=183 letters

a on average.
We are now looking for a weighting fun tion su h that the frequen ies of the nu leotides
are the same as those of the hromosomes III of Sa haromy es erevisiae (brewers' yeast), say
v = (0; 307 ; 0; 193 ; 0; 193 ; 0; 307). Pro eeding as above, but now with the series
 (a)a (t)t (g )gx(2 (t)t (g )g (a)ax
+ (t)t(a)a(a)ax)
L (x; a; ; g; t) =
(1 ((a)a + ( ) + (g)g + (t)t) + (t)t(a)a(a)ax + 2(t)t(g)g(a)ax)
3x

we express (putting s = 1) a (),  (), g (), t () as fun tions of (a), ( ), (g), (t).
With the help of the formal al ulus software Maple, we solve the system
8
>
>
>
 a ( )
 = 0; 307
>
>
<  ()
 = 0; 193
>
g ()
 = 0; 193
>
>
>
>
t ()
 = 0; 307
:
1 ((a) + ( ) + (g) + (t))3 + (t)(a)(a) + 2(t)(g)(a)) = 0
and we nd the following weighting fun tion : (a) = 0; 31632; ( ) = 0; 16991; (g) = 0; 18396;
 (t) = 0; 32707. One an observe that the weights of the letters are lose to the frequen ies.
This is be ause the language L is not mu h onstrained in the sense that only few patterns
are forbidden. Hen e, if we take  = v, we nd a () = 0; 29484 : : : ;  () = 0; 20648 : : : ;
g () = 0; 19351 : : : ; t () = 0; 30519 : : : , whi h are at most far from 7% ompared with v.


10
5 Con lusion
As far as we an see, we present for the rst time, algorithms wi h both take into a ount
statisti al and synta ti riteria.
The generation algorithm a ording to exa t frequen ies presented in se tion 3, have a re-
stri ted sphere of a tivity sin e the omplexity in reases strongly with the number of statisti al
parameters. Moreover, the algorithm doesn't allow to deal with the frequen ies of patterns.
However, it really works if one is interested in few parameters. For example, in the eld of
the study of genomi sequen es, one an be interested in generating nu leotides with a given
expe ted frequen y of letters C + G, whi h is of interest in some analysis.
The method of se tion 4 for generating a ording to expe ted frequen ies is mu h more
eÆ ient. We ompute on e and for all the weighting fun tion and than the omplexity is identi al
to uniform generation algorithms. In the ase of rational languages, we an always nd a
weighting fun tion with the help of numeri al te hniques [6℄. The same method applies for
generating a ording to expe ted patterns frequen ies (see [17℄). Unfortunately, in the algebrai
ase we are only able to deal with low degree fun tional equations languages.

Aknowledgements
We thank Jean-Guy Penaud, Gilles S hae er, Edward Trifonov and Paul Zimmermann for fruit-
ful dis ussions. We thank also Marie-Fran e Sagot for her help.

Referen es
[1℄ E. Coward. Shuet: Shuing sequen es while onserving the k-let ounts. Bioinformati s,
15(12):1058{1059, 1999.
[2℄ A. Denise and P. Zimmermann. Uniform random generation of de omposable stru tures
using oating-point arithmeti . Theoreti al Computer S ien e, 218:233{248, 1999.
[3℄ L. Devroye. Non-uniform random variate generation. Springer-Verlag, 1986.
[4℄ M. Drmota. Systems of fun tional equations. Random Stru tures and Algorithms, 10:103{
124, 1997.
[5℄ I. Dutour and J.-M. Fedou. Obje t grammars and random generation. Dis rete Mathemati s
and Theoreti al Computer S ien e, 2:47{61, 1998.
[6℄ J-.C. Faugere. GB. http:// alfor.lip6.fr/GB.html.
[7℄ J.W. Fi kett. ORFs and genes: how strong a onne tion? J Comput Biol, 2(1):117{123,
1995.
[8℄ W.M. Fit h. Random sequen es. Journal of Mole ular Biology, 163:171{176, 1983.
[9℄ Ph. Flajolet, P. Zimmermann, and B. Van Cutsem. A al ulus for the random generation
of labelled ombinatorial stru tures. Theoreti al Computer S ien e, 132:1{35, 1994.
[10℄ M. Goldwurm. Random generation of words in an algebrai language in linear binary spa e.
Information Pro essing Letters, 54:229{233, 1995.
[11℄ R.L. Graham, D.E. Knuth, and O. Patashnik. Con rete Mathemati s. Addison Wesley, 2
edition, 1997. Fren h translation : Mathmatiques on rtes, International Thomson Pub-
lishing, 1998.

11
[12℄ T. Hi key and J. Cohen. Uniform random generation of strings in a ontext-free language.
SIAM J. Comput., 12(4):645{655, 1983.
[13℄ D. Kandel, Y. Matias, R. Unger, and P. Winkler. Shuing biologi al sequen es. Dis rete
Applied Mathemati s, 71:171{185, 1996.
[14℄ D.J. Lipman and W.R. Pearson. Rapid and sensitive protein similarity sear hes. S ien e,
227:1435{1441, 1985.
[15℄ D.J. Lipman, W.J. Wilbur, T.F. Smith, and M.S. Waterman. On the statisti al signi an e
of nu lei a id similarities. Nu lei A ids Resear h, 12:215{226, 1984.
[16℄ H. G. Mairson. Generating words in a ontext free language uniformly at random. Infor-
mation Pro essing Letters, 49:95{99, 1994.
[17℄ P. Ni odeme, B. Salvy, and Ph. Flajolet. Motif statisti s. In European Symposium on
Algorithms-ESA99, volume 1643, pages 194{211. Le ture Notes in Computer S ien e, 1999.
[18℄ A. Nijenhuis and H.S. Wilf. Combinatorial algorithms. A ademi Press, New York, 2
edition, 1978.
[19℄ Ph. Flajolet and R. Sedgewi k. The average ase analysis of algorithms: Multivariate
asymptoti s and limit distribution. RR Inria, Number 3162, 1997.
[20℄ M. Regnier. A uni ed approa h to word o urren e probabilities. Dis rete Applied Mathe-
mati s, 2000. To appear.
[21℄ R. Sedgewi k and Ph. Flajolet. An introdu tion to the analysis of algorithms. Addison
Wesley, 1996. Fren h translation: Introdu tion a l'analyse des algorithmes, International
Thomson Publishing, 1996.
[22℄ M. Termier and A. Kalogeropoulos. Dis rimination between fortuitous and biologi ally
onstrained Open Reading Frames in DNA sequen es of Sa haromy es erevisiae. Yeast,
12:369{384, 1996.
[23℄ A. Vanet, L. Marsan, and M.-F. Sagot. Promoter sequen es and algorithmi al methods for
identifying them. Res. Mi robiol., 150:779{799, 1999.

12

View publication stats

You might also like