You are on page 1of 16

PROPERTIES OF DETERMINISTIC TOP DOWN GRAMMARS

D.J. Rosenkrantz & R.E. Stearns

General Electric Research & Development Center


Schenectady, N.Y.

Abstract G by a four-tuple (T,N,P,S) where T is the


finite terminal alphabet, N is the finite
The class of context free granmmrs non-terminal alphabet,
that can be determinlstically parsed in a P is a finite set
top down manner with a fixed amount of of symbols each of which represents a
look-ahead is investigated. These grammars, production that we write in the form A
called LL(k) grammars where k is the w where A is in N and w in (N U T)*,
amount of look-ahead are first defined and S in N is the starting symbol.
and a procedure is given for determining
if a context free grammar is LL(k) for a For 71 and 72 in (N U T)*, we write
given value of k. It is shown that ~-rules
71 - 72 if and only if there exit ~i and
can be eliminated from an LL(k) grammar,
at the cost of increasing the value of k ~02 in (N U T)* and production A - 7 in P
by one, and a description is given of a such that 71 = ~PlA~02 and 72 = ~017~2. We
canonical pushdown machine for recognizing
write 71 ~L 72 if in addition ~i is in T*.
LL(k) languages. It is shown that for
each value of k there are LL(k+i) lan- We let "=" represent the transitive com-
guages that are not LL(k) languages. It pletion of "-~" and "=L" the ~'reflexive"
is shown that the equivalence problem is
transitive completion of ''~ " Intu-
decidable for LL(k) grammars. Additional L "
properties are also given. itively 71 = 72 means that 72 can be

In troduc t ion derived from 71 using productions in P


and 71 =L 72 means that 72 can be obtained
The class of context free grammars
that can be parsed in a top down manner from a left derivation.
without backtrack is of interest because
the parsing can be done quickly and the For any 7 in (N U T)*, we let L(7)
type of syntax directed transductions = [w in T* I 7 = w}. The language gener-
ated by G is L(S). This language will
which can be performed over such grammars
sometimes be written as L(G). If produc-
by a deterministic pushdown machine is
tion p is A -" 7, we write Lp(A) = L(7).
fairly general [L&S]. The object of this
paper is to study these grammars.
For a given word w and non-negative
integer k, we define w/k to be w if the
More specifically, we study the LL(k)
length of w is less than or equal to k
•rammars defined by Lewis and Stearns in
L&S]. A number of decision procedures
and we define w/k to be the string con-
sisting of the first k symbols of w if w
are given, including the testing of a has more than k symbols.
grammar for the LL(k) property and the
testing of two LL(k) grammars for equiv- If R is a set of words, let
alence. Methods are given for obtaining
canonic forms which inherit the LL(k) R/k = {w/k I w in R}.
property. Some of the results were stated
previously in [L&S] without proofs. If A is a non-terminal, w is a word
in (N U T)* and p the name of a production
We represent a context free granmmr in P, we write

-z6s-
A = w (p)
2T*/k to represent the set of all subsets
if and only if w can be derived from A of T*/k. We use the term structurally
after first applying production p. equivalent as in [P&U] to mean that two
grammars generate the same strings and
Definition i: A grarmnar G = (T,N,P,S) is the same trees (with the intermediate
said to be an LL(k) grammar for some posi- nodes unlabeled) for these strings.
tive integer k if and only if given
Construction i: Given a g r a m a r G = (T,
i) a word w in T* such that lwl
k; N,P,S) we begin the construction of a
grammar G' = (T,N',P',S') by letting
2) a non-terminal A in N; T" = T x 2T*/k
3) a word w I in T*;
N" = N x 2 T*/k
there is at most one production p in P
such that for some w 2 and w 3 in T*, s" = (s,[~})

4) S = WlAW3; P" = P x 2T*/k

5) A = w2 (p); where symbol (p,R) represents the pro-


duct ion
6) (w2w3)Ik = w.
(A,R) - (An,Rn) ... (Ai,Ri)
Stated informally in terms of par- where A -~A n ... A 1 is the production p
sing, an LL(k) grammar is a context free
and Ri+ 1 satisfies
grammar such that for any word in its
language, each production in its deriva-
tion can be identified with certainty by Ri+ I = (L(A i ... Ai)R)Ik
inspecting the word from its beginning
for all n > i > 0. If n = 0, the right-
(left end) to the k-th symbol beyond the
hand sides of ~he productions are under-
beginning of the production. Thus when a
stood to represent (. The condition for
nonterminal is to be expanded during a
R I reduces to
top down parse, the portion of the input
string which has been processed so far
R I = R/k = R.
plus the next k input symbols determine
which production must be used for the
This gives us a grammar G" = (T",N",P",S").
nonterminal. Thus the parse can proceed
without backtrack. Conversely, wl, A,
Remove all the symbols from N" and
and w constitute the only information
all productions from P"which cannot be
available at that point in a left-to-right
used in deriving a terminal string from
top-down parse.
S". Finally, replace each occurrence of
terminal (a,R) by terminal a. Letting
Any context free language that has
N' be the new nonterminal set~ P' the
a LL(k) grammar can be recognized (top-
new production set, and S' be the start-
down) by a (deterministic) push-down
ing symbol S", we obtain
machine [L&S]. The machine uses a pre-
dictive recognition scheme [OET] in a
G' = (T,N',P',S').
manner that 'Uses" the grammar in the
recognition process and can be said to
Two lemmas are now given which
"recognize" each production. An LL(k)
clarify the relation between G and G'.
grammar is always an LR(k) grammar as
defined in [KN].
Lermna i: The G' of Construction i is a
grammar structurally equivalent to the
Test for LL(k> original grammar G.
In this section, we give a construc- Proof: Given a derivation in G', a
tion which is basic to our LL(k) test and corresponding derivation in G is obtained
then we give the test. In what follows, by replacing each nonterminal (A,R) by A.
we use the standard mathematical notation

-166-
Given a derivation from S in G, a elements, the corollary indicates that the
corresponding derivation from S" in G" set N' may be much smaller. If, for
(where G" is defined in the construction) example, G contained no ( productions, N'
is obtained if, instead of applying could not have more than INI " IN u T U
production p to an instance of A, one [¢}I k elements since the first k elements
applies (p,R) to the corresponding (A,R). of string ? would determine R. Thus, a
Since all nonterminals used must, by much more practical approach to deriving
their very use, be in N' and all produc- G' is to generate it directly from G
tions used must be in P' (after replacing without constructing all of G".
each (t,R) for t in T by t) we obtain
a corresponding derivation in G'. We are now in a position to state the
LL(k) test.
Corollary i: L(p,R~((A,R)). = Lp(A) for
Test: Given a grammar G = (T,N,P,S) and
(A,R) in N'. given an integer k, construct the
grammar G' of Construction I. Then for
Proof: As with S' and S, derivations each w in T*/k and (A,R) in N', test to
from (A,R) and A can be made to correspond. see if there is at most one p in P such
that
Lemma 2: Given G and G' as defined in w is in (Lp (A)R)/k.
Construction I, then for all (A,R) in N'
and %0 and ql in (N' U T)*, This last expression is equal to ((Lp(A)
S' =L ~(A,R)~ implies R = L(q~)/k. /k)R)/k which is certainly computable.
If all w and (A,R) pass the test, then
Proo__~: We will prove the result by the grammar is LL(k) ; otherwise it is
induction on the length of a leftmost not.
derivation. It certainly holds for the
zero length derivation since the initial To prove that this test works, we need
string is (S,{~}) and [¢} = L(()/k. Now a couple of lemmas relating the LL(k)
suppose that it is true for some property with left derivations.
S = w(A,R)? 1 for w in T* and let (p,R)
be La production which applies to (A,R) Lenmm 3: If G = (T,N,P,S) is an LL(k)
from which we obtain grammar, then for all w I in T*, A in N,
S =n w(A'R)~Yi "*L W(An'Rn) "'" (Ai'RI)~Yi" and w in T*/k, there exists a ~ in (N U T)*
such that for all w 2 and w 3 in T*, the
The lenmm certainly holds for occurrences
of nonterminals in ~YI since the string three relations

following them is the same as before. It i) S = WlAW 3


also holds for nonterminals in w since
there are none. Thus, it remains to be 2) A = w2
shown that it holds for the nonterminals
(Ai,Ri). But this is true by construction 3) (w2w3)/k = w
since
imply the two relations
Ri+ 1 = (L(A i ... Ai)R)/k
4) S =L WlAY
= (L(A i ... A l) L(Yl)/k)/k
5) ~ = w 3.
= L(A i ... AiY1)/k. Furthermore, if i, 2, and 3 are satisfied
~llus the lemma is true by induction. for some w 2 and w3, then there is only
one y satisfying 4 and 5.
Coro!lary 2: The number [ N' I is
bounded by IN I times the number of sets Proof: Suppose that w 2 and w 3 do exist
of the form L(y)/k for ~ in (N U T)*.
which satisfy i, 2, and 3. Because WlW2W3
Although the intermediate set N" in has a derivation which includes WlAW3,
the construction has at least
INI • z ITlk there must be a ~ in (N U T)* such that

-167-
S =L WlAT and ~ = w 3 (which are conditions satisfy conditions 4, 5, and 6 of the
! definition. Therefore, we conclude that
4 and 5). Now consider another pair w 2
= ?' and then 4 and 5 hold for all
and w~ satisfying i, 2, and 3. There must choices of w 2 and w 3.
also Be a ?' in (N U T)* such that
'
S =L WlAq~' and q~' = w 3. If ~ ~ ~' then
The last statement of the lemma
the two left derivation sequences must follows as a special case of the above by
differ. Let ~IB~ for Wl in T*, B in N, taking w 2' = w 2 and w 3' = w 3.
and ~ in (N U T)* be the last word for
which the two strings are the same. Word Le~mna 4: A grammar (T,N,P,S) is LL(k) if
Wl must of course be a prefix of w I since and only if for each w I and w in T*, A in
WlA is the beginning of the words being N and T in (N U T)* such that
derived. Symbolically, we now have
S =L WlAW'
6) S =L Wl B£° there exist at most one production p in P
and such that
7) w I = ~i v for some v in T*. w is in (Lp(A) L(~))/k.
Since ~iB~ is the departure point in the Proof: Suppose there are two such
two derivations, there are two productions productions. Then each production has a
B - ~ and B - ~' such that w 2 and w 3 in T* such that A = w2(P) and
8) ~ = vw2w 3 = w 3 (hence S = WlAW 3) which is a
and violation of the L L ~ ) definition.
9) ~*'~ = vw2w
' 3' Conversely, assume the LL(k) defini-
!
tion is violated by some Wl, w2, w3, w2,
We will now show a violation of the LL(k)
definition between the two productions wl, w in T* and A in N and distinct p and
for B. Because of 8, there are ~2 and w3
p' in P. l~tting w~B T be the first point
in T* such that where the left derigatlons differ,
WlW = w{u for some u. Thus, there are
i0) B-~=~ 2
q and q' such that u/k is in
ii) ~ = w3'
and w2w 3 = vw2w 3 . (Lq(B)L(?))/k n (nq, (B)L(?))/k.

Similarly by 9, there are wl and w~ in T* Thus, the lemma is proven.


such that
12) B -" ~' = w~ ,
Corollary 3: An LL(k) grammar is unam-
13) ~=wl '
b iguous.
and w2w 3 = v w ~ I . Because of 3, Proof: The lemma says that in each step
of a left derivation, there is at most
! !
f62~3)/k = (vw2w3)/k = (vw2w3)/k one production which will enable one to
reach the desired terminal string.

The significance of Lemma 4 is that


letting the choice of p can be obtained from a
I--I finite amount of information, namely A
14) w =(w2w3)/k = ~2WB)/k , and L(?)/k. The construction has
given us a method of computing the L(q~)/k
relations 6, i0, ii, 12, 13, and 14 as we go along. We are now ready to
violate the LL(k) definition as we have verify the test.
two productions B - ~ and B - ~' which

-168-
Theorem i: Given a context free grammar 5. (w2w3)/k = w.
G = (T,N,P,S) and given an integer k, one
can decide if the grammar is LL(k). The only difference between this
definition and that of an LL(k) grammar
Proof: We show that the test given earlier is the quantifier "for all Wl" has been
in this section works. moved within the scope of the "there exist
at most one p". Thus, strong LL(k)
By Lenmna 2, the nonterminals (A,R) of grammars are a special case of LL(k).
N' represent all the possible A in N and R Intuitively, they are grammars where one
in T*/k such that R = L(?)/k and can parse correctly knowing only that one
S =L WlA7 for some w I in T* and ~ in is looking for a given nonterminal and
(N U T)*. The test is therefore a test of knowing the next k input. The power of
whether the condition of Lemma 4 holds these grammars is expressed by the follow-
and is therefore an LL(k) test. ing:

Strong LL(k) Grammars Theorem 2: Given an LL(k) grammar


= (T,N,P,S), one can find a structurally
In this section we define the concept equivalent strong LL(k) grammar using
of a strong LL(k) grammar. The power of Construction i of the previous section.
these grammars will be shown to be
structurally equivalent to LL(k) grammars. Proof: We already know that the construc-
We consider these strong grammars more as tion gives a structurally equivalent
a normal form rather than as a class for grammar (Lenmna I). If S' = Wl(A,R)w 3 for
separate study. w I and w 3 in T*, we know that S' =L Wl
For grammar G = (T,N,P,S) and non- (A,R)7 for some 7 in (N' U T ~ s u c h that
terminal A in N, let = w 3. For each w in T*/k, we know
from Lemma 4 that there is at most one p
R(A) = {w in T* I s = WlAW for some w I such that w is in (Lp((A,R))L(~))/k.
in T*}.
For positive integer k, let By the last statement in Lemma 3, we
know that ~ is independent of w I and
Rk(A ) = R(A)/k. hence R = Rk((A,R)) which we know is
Now R(A) is itself a context free language equal to L(7)/k by Lemma 2. Hence, there
with a grammar easily obtained from G. The is at most one p such that
set Rk(A) is then certainly computable. w is in (Lp((A,R))Rk((A,R)))/k

For fixed k, we wish to consider which we observed is an equivalent state-


granmmrs which satisfy the property that ment of the strong LL(k) property. Thus,
for any A in N and w in T** there is at the theorem is proved.
most one p such that (LD(A)Rk(A))/k
Role of (-Rules
contains w. We call these strong LL(k)
grammars. Formulating this concept with- We use ( to represent the null or
out reference to Rk, we get the following: length zero string. A production is
called an c-rule if its right hand side
Definition 2: A grammar G = (T,N,P,S) is is (.
said to be a strong LL(k) grammar for some
position integer k if and only if given Theorem 3: Given an LL(k) granmmr G =
i. a word w in T* such that lw[ ~ k; (T,N,P,S), an LL(k+i) grammar without
(-rules can be constructed which
2. a nonterminal A in N; generates the language L(G) - [(}.
there is at most one production p in P
such that for some Wl, w 2 and w 3 in T*, Proof: We will obtain the desired
grammar by rewritting G in two stages.
I
----I ! !
3. S = WlAW3; For a given grammar a' (T,N,P,S), we
will call an element A of N' U T'
4. A = w2 (p);

-169-
nullable if L(A) ~ {c} and call A non- The starting symbol S I is taken to be S
nullable otherwise. In particular,
if S is non-nullable and to be S' if S is
this means that terminals are non-nullable.
nullable.
The first step is to rewrite G so that the
first symbol on the righthand side of a
Each production in Pi corresponds to
non-E-rule is non-nullable. This will be
done in such a way as to preserve the a derivation in G. For example, A - A 2
LL(k) property. The new grammar w i l l be
obtained from the old by the advance • .. AnB I ... B m corresponds to the produc-
substitution of E-derivations into the tion A - A I ... AnB I ... B m followed by
various strings of leading nullable
the derivation of A I = E. Thus, a
symbols that occur on the righthand side
of productions in P. This preserves the derivation in G I certainly corresponds to
LL(k) property because the look-ahead of
a derivation in G. Conversely, given a
k determines precisely w h i c h initial E-
derivation tree in G, a derivation for G I
derivations should be applied• Readers
is obtained by successively deleting all
who are not interested in the details of
leftmost branches w h i c h result in E and
this step may skip ahead to the descrip-
replacing leftmost occurrences of other
tion of the second step.
nullable nonterminals by their non-nullable
counterparts.
For each nullable symbol A of G, w e
will add a new nonterminal symbol A' to
To verify that G I is LL(k), suppose
the nonterminal set; A' w i l l have the
property that L(A') = L(A) - {E]. Let- that w e are given w I in T*, A ° in Ni, and
ting G I = (T, Ni,Pi,S I) be the grammar, w in T*/k satisfying conditions I, 2, and
w e are trying to construct, our new non- 3 of Definition i. The corresponding
terminal set is described symbolically symbols in G determine a production
as A ~ A I ... AnB I ... B m as described above

N I = NU [A' i A is nullable in G}. (where A = A or A'), and it is clear


o
under the correspondence of derivations
Each production of P can be expressed in
that any production in PI satisfying 4
the form:
and 5 of Definition i must be one of the
A -~ A I ... AnB I ... B m productions obtained from this. Further-
more, it is possible to determine w h i c h
where ~ ..., A n are nullable, B I is non- of the leading A i must go into ¢ and w h i c h
nullable if m>0, m and n are non-negative is the first symbol to not go into ¢ as
integers, and where the case n=0 is the various c-derivations w h i c h d o this
interpreted to mean that A I ... A n = ¢ are determined without further lookahead.
This information then determines the one
and the case m=O to mean B I ... B n = E.
possible production derived from A - A I
For each such production we let Pi contain
. . . .AnB
.. I B m that satisfies Definition
the productions. i.

A - A~ A 2 "" An B I "'" Bm We will now give a procedure for con-


verting G I into an equivalent grammar G 2
A - A 2' . . .AnB
. I. . Bn without E-rules. We will assume that each
nonterminal of G I generates a non-null
A-A' B I ... B terminal string• A nonterminal A w h i c h
n m
does not have this property is easily
A - B I ... B m . removed by deletion (if L(A) is empty) or
by substitution (if L(A) = {E}) without
Furthermore, if A is nullable, w e let affecting the LL(k) property. Let V be
Pi contain these same productions w i t h the nullable symbols of G 1 and let V 1¢ be
A' on the lefthand side instead of A. the non-nullable symbols. Let
If, however, m=0, the production V = ViV*. Any word ? in V I(N 1 DT)*
A' -~ B I ... Bm (i.e. A '~ ¢) is omitted. has a unique representation as a word in

-170-
V + and we let ~(y) represent this word. A = w where w is a non-null element of
For example, letting A represent symbols n o o
of V¢ and B represent symbols of Vi, L(Ai)). We let T be the terminal set of
G^ and let N^ = V' - T be the nonterminal
L z
set. The s t a r t i n g symbol w i l l be S 1
~(BiB2A3A4B5A6A7) = [BI ] [B2A3A4] [BsA6 ] [B7]
(sometimes written [Sl]). Finally, let
where the square brackets limit the words
of V. Thus, the sequence of nullable non- the p r o d u c t i o n s e t P2 f o r G2 be the s e t
terminals that can be generated in a left- of productions determined by the following
most derivation by G I are combined with three rules :
the preceding non-nullable symbol. Ele-
ments of V that are strings of length one Rule i: If B in N I and y in V* are such
E
are considered to be elements of V I i.e. that [By] is in V' and if B - YI is a
[A] = A for A in V. production of Pi' then P2 has the produc-

The overall plan of the construction tion :


is to have a left derivation S I = L ~F in [By] --, o~(yl~, ) .
G I correspond to a left derivation IS I] Rule 2: If b in T, A in V¢, and Yl and
=L ~(~) in G 2. Steps in the G I derivation
Y2 in V* are such that [bYiAY 2] is in V'
which involve an E-rule will be combined E
into a non E-step in order to avoid E- and if A - y is a non-E-rule of PI' then
rules for G 2. This approach involves a P2 has the production
small discrepency in timing as a deriva-
tion such as [DYiA~/2] -" [b] ~(yq~2).
Si =L blb2AIB2
Rule 3: If b in T and w in V + are such
' ' E
(where b I and b 2 are in T,A I in V(, and
that [by] i s in V ' , then P1 has p r o d u c t i o n
B 2 in V I) represents the situation after
2 plus the look-ahead inputs have been [b@] - [b].
considered and
A production obtained from Rule I is
[Sl] =L [bl][D2Ai][B2] used in G^g whenever the corresponding
rule is used in P.. A production obtained
represents the situation where only i from Rule 2 is useld w h e n y i = ¢ followed
plus the look-ahead inputs have been
considered. Thus, to get the same by A - y is applied. In this manner, left
information, the look-ahead for process- derivations in G I and G 2 are made to
ing G 2 will need to be one larger than correspond to each other and the equivalence
the look-ahead for processing G I. In of G I and G 2 obtained.
other words, when a decision as to which
production for [b2A I] should be used, the To see that G 2 is LL(k+i), assume
that we are given w I in T*, w in T*/k+l,
b 2 plus the next k input symbols may be and a nonterminal of G o satisfying
needed to determine whether or not A I conditions i, 2, and 3 o~ Definition i.
will be expanded into (. If the nonterminal is of the form [By]
where B is in Ni, then the only production
We now give the construction in more which can be applied is the one obtained
detail. Let V' be the set of elements of via Rule i from the production of Pi
V which occur in some word ~(~) for some
y in (NiUT)* such that S I =L Y. Any ele- which can be applied to B. If the non-
terminal has the form [bA I ... A n ] as in
ment in V' of the form [BA I ... A n ] must
Rules 2 and 3, then w must have the form
have distinct A i for otherwise G I would bw 2 for w 2 in T*/k. If it does not have
be ambiguous. (If A. were repeated, this form, no rule can be applied. If w
l
there would be two derivatzons of A I ... does have this form, then Wlb and w 2

-171-
determine which of the leading A i must a nonterminal cannot be distinguished by
the next k input symbols, then there would
be eliminated with c-derivations and (if
be two corresponding productions of the
all A. are not so eliminated) which non
l original grammar which could not be dis-
c-rule to apply to the next A.. Thus, tinguished by the next k input symbols.
l
the grammar is LL(k+i) and the theorem is Corollary 4: Given an LL(k) grammar G
proven. with c-rules, a strong LL(k+I) grammar in
Greibach normal form can be obtained for
A nonterminal symbol, A, is said to L(G)- [¢].
be left recursive if and only if A = Aw
for some w in T* and L(A) ~ @. Proof: From Theorem 3, an LL(k+i) grammar
without c-rules can be obtained for L(G)
Lemma 5: An LL(k) grammar G can have no - [¢], and from Theorem 4, an LL(k+i) gram-
left recursive nonterminals. mar in Grelbach normal form can then be
obtained. Construction i preserves this
Proof: Assume that an L L ~ ) grammar has form and the result is strong LL(k) by
a left recursive symbol. Then for some Theorem 2.
nonterminal A, A = Ay (p) and A ~ x (p) Theorem 5: Given an LL(k+i) grammar with-
where x and y are in T*, and p and p' out c-rules for k ~ i, there exists an
are different productions. Because G is LL(k) granmmr with c-rules for the same
unambiguous, y ~ ¢. Furthermore, S = uAv language.
for some u and v. Now consider the
derivations Proof : From Theorem 4, the grammar can be
rewritten so that it is in Greibach normal
S = uAv = uAykv = uxykv form, and is still LL(k+i). This granmuar
will now be rewritten so that it is LL(k)
and S = uAv = uAykv = uAyk+lv = uxyk+lv with c-rules. If there is more than one
production for nonterminal A with terminal
Thus S = uAykv, A = xy (p), A = x (p'), symbol a as the first symbol on the right
hand side of the production, then a new
and (xyk+Iv)/k = (xykv)/k. nonterminal, (A,a) will be introduced.
Let the set of productions for A with a as
Therefore, since the granmuar is LL(k), the first symbol on the righthand side be
p=p', and there cannot be a left recursive
nonterminal. A-aw 1
A - aw 2
A granmmr is said to be in Greibach
normal form [GR] if the righthand side of
every production begins with a terminal
A-.aw
symbol. n
Then in the new grammar these productions
Theorem 4: Given an LL(k) grammar without
will be replaced by:
c-rules, another LL(k) grammar in Greibach
normal form can be obtained for the same A ~ a(A,a)
language.
(A,a) - w I
Proof: For nonterminals A and B let > be (A,a) - w 2
the transitive relation defined by A > B
if A = B~ for some ~0. From Lemma 5 it
I

cannot be true that A > A; therefore, the


nonterminals can be arranged in a linear (A,a) ~w
n
order Ai, .. , A such that for i ~ j, it
is not true that A. > A.. The grammar Note that if one of the original productions
can now be rewrittenZin n3steps. In the is A - a, then the new grammar will contain
the production A - c.
i-th step, each occurrence of A. as the
first symbol on the right hand slide of a Each of the productions in the new
production is replaced by all the produc- grammar for one of the original nonterminals
tions for A. (each of which begins with a begins with a distinct terminal symbol and
terminal s y ~ o l ) . The rewritten grammar therefore the next input symbol distinguish-
is LL(k) since if two new productions for es between these productions. Since the next

-172-
k+l input symbols distinguisn between the control unit. The stack manipulation con-
original productions for ~, th8 next k sym- sists of two cases.
bols after the a distinguish between the
productions for (A,a) in the new grarmnar. Case i: If the top tape symbol is a non-
Thus the new granmmr is LL(k). terminal A, then there is at most one
The class of LL(1) granmnars in Grei- production p such that w is in (Lp(A)R~L.
bach normal form are the simple grammars of (A) ~k)/k. If there is no such production,
Korenjak and Hopcroft [K&H].
the machine stops and rejects the sequence.
Canonical Pushdown Machines If it has such a production, it is of the
form A - aT where ~ is in (N U T)*, and
We will assume throughout this
the machine replaces A by T.
section that G = (T,N,P,S) is a strong
LL(k) grammar in Greibach normal form.
Case 2: If the top stack symbol is a
We know from Corollary 4 that any LL(k')
terminal, then it is popped off if it
grammar can be put into this form for
matches the first symbol of w; otherwise
some k satisfying k ~ k' + i. We will
the machine stops and the sequence is
describe a (deterministic) pushdown
rejected.
machine which recognizes L(G).
When the machine reaches a configura-
The input set for the machine is
tion with no tape symbols, the machine
T U [ ~] where ~ is an end of tape marker
stops and accepts the sequence if and only
not in T. The machine is designed to
accept sequences from the language follow- if the control word (i.e. inputs r+l to
r+k-l) is ~_k-l.
ed by k-I end markers.
The machine operates in close corres-
It is important for later proofs to
observe that the machine has no c-moves pondence to a leftmost derivation in the
and stops after reading in k-i end markers. granmmr. This relationship is described
The pushdown control could, of course, be by the following:
redesigned so as not to read beyond the
first end marker, and the machine would Lemma 6: Let M be the canonical pushdown
terminate its recognition with k-2 c- machine for a strong LL(k) grammar G = (T,
moves. When k=l, there is no need for N,P,S) in Greibach normal form. If M
reaches a configuration with tape ? in
the end marker.
(N U T)* and look-ahead word w of length
The finite state control has enough k-i after input sequence WlW has been
memory to store an input string of length processed, then
k-i and to perform such obvious tasks as
reading in the first k-i inputs. After i) S = L wit
the (r+k-l)-th input symbol has been
processed, the word stored in the finite 2) For all w 3 in T* such that
state control is the string consisting of
the (r+l)-th input through the (r+k-l)-th S = WlW 3 and (w3 ~k)/k-i = w, w3
input. satisfies the relation T = w 3.

Initially, the machine reads the Proof: The proof is similar in spirit to
first k-i input symbols and stores them the proof of Lemma 3 so we present the
as the word in the finite control. The logic more briefly. It is evident from
pushdown tape is initialized with the the construction that S =L w i t and all that
starting symbol S. The tape alphabet
needs to be shown is that T generates all
will be N U T. the w 3 satisfying the conditions of 2).
If the machine has not stopped after Assume that w 3 satisfies S = WlW 3 and
r+k-i inputs have been processed, the w 3 ~ k / k - I = w but not T @ w 3. There must
machine reads in the (r+k)-th input, uses
the top tape symbol and the word w formed be a w I' in T*, A in N, and T' in (N U T)*
by the (r+l)-th through the (r+k)-th input such that S =L w~AT' is the last configura-
symbols to manipulate the stack as
described below, and stores the (r+k+2)- tion before the left derivations of
th through the (r+k)-th symbols in the S =L w i t and S =L WlW3 diverge. Word

-173-
w I' cannot have the form WlX because the (S,ac) - A
only such configuration in the left (A,bb) -" (
derivation of Wl? is Wlq~ itself and we
(A,cc) - (
are assuming w 3 cannot be derived from ~.
(A,b~-) -~ (
Therefore, there is an input word x of
(A,c~) - (
length k such that WlX is a prefix of both
WlW and W l w 3 ~ k . But since x has k sym- Assume now that [anbn} U [anc n} can be
bols, the choice of production to apply at generated by an LL(k) grarmuar. First
WlAY' consistent with x is unique by the rewrite the grarmnar in Greibach normal
n-k
form. Now note that for each k, S =, a
LL(k) property, contrary to the assump- Tn ' ~n = akbn' and ~n = akcn" FurtHermore
tion that the derivations differ. Thus
the lemma is proved by contradiction. for n I ~ n2, ~n I ~ Wn2, or else the
When one applies Construction I to an
LL(k) grammar to make it strong and then
applies the construction of the machine
grammar could have a derivation of the
just given, one obtains a construction h

which is essentially the same as that nen 2


form S = a Li-k--?~i a D . Thus for some
given in Appendix I of [L&S].
value of n, the length of ?n is > k+2.
This lemma says in effect that the
pushdown tape always has the necessary Furthermore in the derivations ?n = akbn
information to recognize any legitimate
and qKn = akcn' since the grammar has no
extensions. The case w 3 = ( implies
(-rules, at most the first k symbols of
= ( so the machine does recognize all
?n can be expanded into a's. Thus each of
words in the language. Since it
obviously only accepts words in the lang- the last two symbols of ?n can be expanded
uage, we have proven : into both b's and c's. Thus qtn = akbmlc m2,
Theroem 6: The canonical pushdown machine and the language cannot be generated by any
for a strong LL(k) grammar in Greibach LL(k) gr anm~ar.
normal form recognizes the language
generated by that grammar. Hierarchy of LL(k) Lan~uaEes

It should be pointed out that push- In this section it will be shown that
down machines of the type described for every k ~ I, the class of languages
above can recognize languages that cannot generated by LL(k) grammars is properly
be generated by an LL(k) grammar for any contained within the class generated by
k. For instance, the language [anbn] U LL (k+l) grammars.
[anon}, which it will be shown has no
First, for each k ~ i, consider the
LL(k) granmuar, can be recognized by the
language [an(bkd+b+cc) n I n ~ i}. Here
pushdown machine described as follows.
"+" denotes the "or" operation. This
This machine operates on the basis of a
language can be generated by the following
word of length 2 and its operation is
LL(k) grammar with c-rules.
described by a set of rules of the form
(A,w) - ~, which mean that if A is the S - aSA
top stack symbol and w is in the stored
S -aA
input string, then A is replaced by q(
and another input symbol is read in. A ~ cc
Combinations of stack symbol and w not A -.bB
shown below result in rejection of the
input string. The initial stack symbol B-¢
is S.
B - bk'id
(S,aa) - SA
However, this language cannot be
(S,ab) - A
generated by an LL(k) grammar with (-rules.

-174-
Lemma 7: There exists no LL(k) grammar
bn3/k-i = bk-l. Since furthermore bk'Idb n3
without c-rules for the language [an(bkd+
no+k'l nl+n2 k i n3
b+cc) n I n • i} where k • i. /k-i = b k'l and a b b " db

Proof: Assume that there exists such a is in the language, it follows by Lemma 6
grammar. We may assume that it is a strong that Y2 = bk'idbn3"
LL(k) grammar G = (N,T,P,S) in Greibach
normal form and we let M be the canonical
pushdown machine associated with this
Having obtained the relations
granmmr. There must be an integer n such
o
n +k-i ?I = ak'ibnl' Z = cm, and
that the input sequence a o causes the
machine to have a pushdown tape with at n
least 2k-I symbols. (This is because M qt2 = b k-I db 3 we can conclude
must distinguish all pairs
nI n2 n ° +k-I b nlc mb k_idbn 3
a and a for n I ~ n2). Thus, the S=a
resulting tape has the form yiZY2 where but this string is clearly not in the
~i and ~2 in (N U T)* satisfy l~iI • k-i language since it contains a d not pre-
fixed by bk. Therefore, the language
and I~21 • k-i and Z is an element of
cannot be obtained by an LL(k) grammar
NUT. without c-rules.
n
By Lemma 6, S =L a °yiZy 2. Also by Theorem 7: For every k • I the class of
languages generated by LL(k) grammars
Lemma 6,
w i t h o u t c-rules is properly contained
within the class of languages generated
+k-I k-i by LL(k+I) grammars without c-rules.
?iZq,2 = ak'ibn° . and ylZql2 = a
2 (n o +k +i )
c Proof: For each k • i, the language

Since G has no c-rules and since I?iI • [an(bkd+b+cc) n I n • i}


k-l, there must exist four numbers nl, n2, cannot be generated by an LL(k) grananar
without c-rules. The smallest class in
n3, and m such that
the hierarchy is that of the simple
languages. By virtue of Theorem 5, the
qlI = ak'ibnl ' other classes correspond to classes in
the hierarchy of languages generated by
n2 unrestricted LL(k) grammars. The exist-
Z = b and Z = cm
ence of this latter hierarchy was first
~2 = bn3 observed by Dr. Kurki-Suomio.

Decidability of the Equivalence Problem


and nl+n2+n 3 = no+k-I
In this section it will be shown
Since [y2] • k-I and G has no c-rules, we
that it is decidable if two LL(k) grammars
know also that n 3 m k-I and n 2 • i. generate the same language. We will begin
with some definitions.
By Lemma 6, input sequence
Let G = (T,N,P,S) be a context free
a
n o +k- ibnl+n2+k- i grammar and ? be a string in (N U T)*. We
define T(?), the thickness of ql, as the
results in the tape Y2 since length of the shortest terminal string
that can be generated from % i.e. ~(q~) =
n +k-I nl+n 2 n3 min[nl there exists x such that ? = x and
S =L a o b ql2, ql2 = b and n = Ixl]. Note that T(qIl~2) = T(~I) +

-175-
• ) The canonical pushdown machine for a
super strong LL(k) grammar will stop and
For w in T*~* and qt in (N. U T)*, reject if and only if it discovers a tape
let S(?,w) = [x I ? = x and x ~ = wy for ?, look-ahead word w, and input a such
some i m 0 and y in T*}. If S(?,w) is not that S(wa,v) = ~.
empty, let
To prove the theorem, we need only
•w(?) = min {n ] n = Ixl for x in $(?,w)}. give an equivalence test for two super
strong LL(k) granmmrs G I = (TI,Ni,Pi,S I)
Lemma 8: If grammar G is in Greibach
normal form, t is the maximum thickness and G 2 = (T2,N2,P2,S2). We let M I and M 2
of the righthand sides of productions in be the corresponding canonical determinis-
P, w in T*~* is of length m, ? is in tic pushdown machines. We will describe
(N U T)*, and S ( % w ) is non-empty, then a third deterministic pushdown machine,
M 3 with the property that L(Gi) = L(G 2)
T(?) ~ tw(?) ~ ~(?) + m(t-l)
if and only if M 3 accepts the set of all
Proof: Clearly Tw(q~) m ~(?). Since G is strings.

in Greibach normal form, it requires the Mq will attempt to simultaneously


application of at most m productions to simulate M I and M 2 by having a two track
convert ql~ i into a string of the form
pushdown tape, one track for each tape
w~. Each of these productions replaces
of the simulated machine. For a symbol
a nonterminal (whose thickness is at
which can appear on the tape of M I or M 2
least i) by the righthand side of a
production (whose thickness is at most t), that has thickness T, M 3 will have a
thereby increasing the thickness of the
"symbol" that can occupy T squares on the
intermediate string by t-l. Thus,
corresponding track of its tape. Assume
T (~) ~ ~(V) + m(t-l). that after reading in w I (followed by
w
Theorem 8: It is decidable whether or not look-ahead word w of length k-l), M I has
two LL(k) grammars generate the same lang- v on its tape, M 2 has ~ on its tape, and
uage.
neither machine is in the rejecting state.
Then T(v) will be the size (number of
Proof: For purposes of this proof, we will
call strong LL(k) grammar G = (T,N,P,S) squares) occupies by the string on the
super stron ~ if i) all its productions first track of M 3 that corresponds to v,
have the form A - a~ for some A in N, a and T(~) will be the size of the string
in T, and ~ in N* and 2) there exist a on the second track of M 3 that corresponds
func t ion to ~.
f:N ~ 2 T * ~ k / k s u e h that for all
The key observation is that the
w in T * ~ k / k , w I in T* and A~ in N* lengths T(v) and ~(v) will differ by at
most 2(k-l)(t-l) whenever L(G I) = L(G2),
satisfying S =L WlA?' S ~ ? , w ) is nonempty
where t is the maximum thickness of the
if and only if w is in f(A). If an LL(k) righthand sides of the productions in
grammar satisfies condition i, it can be Pi U P2"
made to satisfy condition 2 by applying
Construction i. The function f is given To verify this last observation, we
simply by f((A,R)) = R~k/k where (A,R) is note that L(G I) = L(G 2) implies that
a nonterminal expressed in the notation of
Tw(V ) = Tw(U), for if to the contrary
the construction. An LL(k) grammar
satisfying condition 1 is easily obtained
T (v) > Tw(~) , the minimum continuation
from a LL(k) Greibach normal form grammar w
by introducing nonterminals A a and of v acceptable by M 1 would not be
productions A a -~ a for a in T and then acceptable to M 2. From the fact that
replacing occurrences of a by A a where (v) = ~w(~), it follows from Lemma 8
w
required to obtain the form expressed by i.

-176-
that 17(v)- ~(~) I ~ 2(k-l)(t-l). Properties of LL(k) Grammars
M 3 will now be described in greater In this section, various closure and
undecidability properties of LL(k) grammars
detail. It has a two track tape as will be given. These results are primarily
described above, but the top 2(k-l)(t-l)+l of a negative nature.
cells of the tape are kept in the finite
state control unit. Thus, if the differ- Theorem 9: If the finite union of dis-
ence in thickness of the two simulated joint LL(k) languages is regular, then all
tapes is less than this amount, M 3 has the languages are regular.
access to the top of both tracks of the
simulating tape. M 3 simulates both M I Proof: Let T be the combined terminal
vocabulary of the languages. It is suffi-
and M 2 as long as neither one has entered cient to prove the results for the regular
the rejecting state and the difference in set T* since if the union is a regular set
thickness between the two tapes being R, one can add the LL(1) language T*-R to
simulated is ~ 2(k-l)(t-l). the given set of LL(k) languages in order
to get a disjoint union which gives T*.
Machine M 3 is designed to accept Letting n be the number of languages in
all input sequences until one of the the union, we let M. for 1 ~ i ~ n repre-
following three things occur: sent the canonical ~ushdown machines for
these languages. We will call a machine
i) the difference in thickness configuration viable if there exists a
between the two tapes become sequence of inputs which causes the machine
greater than 2(k-l)(t-l); to enter an accepting configuration. To
prove the theorem, we will derive an upper
2) an input sequence causes exactly bound on the tape langths of the viable
one of the machines M I or M 2 to configurations that can occur when input
sequences are applied to the starting
stop in a rejecting state;
configurations of the M i. Thus the
languages will be shown to be recognized
3) one of the machines accepts a by a finite set of configurations and
sequence and the other does not. hence be regular.
If Mq rejects because of reason i, Suppose that input sequence w I in T*
we know that the machine with the is applied to each machine M s causing it
shorter tape can accept a short sequence to enter configuration C~ and let V be the
which the other machine cannot and hence subset of these configurations which are
L(G I) ~ L(G2). If M 3 rejects because viable. Since a sequence of k end markers
of the second reason, we know that the must be accepted by one of the C. and
machine which stopped in a rejecting since the canonical M. do not made (-moves
state cannot accept any continuation of and accept only with elmpty stacks, this C.
the input sequence whereas the other will be in V and will have tape length ~i I
machine can. The languages are obviously which is less than or equal to k. Now
different if rejection is for the last suppose that we have found a j element set
reason. Conversely, if there is an input V. which is a proper subset of V and which
sequence in L(G I) which is not in L(G2) , has configurations with tape lengths of
maximum size ~'3" Let ~j+l be the shortest
then the application of that sequence to
M 3 will obviously cause M 3 to reject for sequence in T * ~ k which does not cause a
configuration in Vj to enter an accepting
one of these reasons.
configuration. (Such a sequence exists
The problem has now been reduced because V. is a proper subset of V and the
to deciding if deterministic pushdown 3
languages are disjoint). Some configura-
machine M 3 rejects any sequences (or if
tion C i in V-Vj must accept this shortest
the complement machine accepts any). This
sequence and must therefore have tape of
is a well-known decidable question [CG&G].
length less than or equal to P]+i and
therefore Vj+ I = V.j U [C.}i is a j+l

-177-
element subset of V with tapes of length I n,m m i} are two LL(1) languages, whose
~j+l or less. We have defined by induction intersection is L3, which is not LL(k).
bounds ~i and sets V i for all i such that L 8 = [bnan + c n a n i n m i } is an LL(1)
I ~ i ~ IV I. Thus if the ~. can be bound-
l language whose reversal is L3, and is not
ed independent of Wl, the tape lengths of EL(k).
all reachable viabl~ configurations will
be bounded and each machine will accept a From Theorem 9, the language L 9 =
regular set. [ambn I i ~ n ~ m} is not an eL(k) language
since its union with the LL(1) language L I
We have already observed that ~i
is the finite state language [ambn I m,n
has a bound independent of Wl; namely k. ml }. However, L 9 is the concatenation
Suppose that a bound b. has Been found
3 of a* and [anbn I n m i}, each of which
for all the possible ~. resulting from all is an LL(1) language. Therefore the set
3
choices of w I. The possible bj+ 1 are of LL(k) languages is not closed under
concatenation.
determined from the possible V. which can
3
arise. But the V. which arise have tape The language Li0 = [dambm+l + e a m ¢ m+l
lengths of bj or 3 less and there are only I m m 0} is an LL(1) language, but its
a finite number of such Vj. Thus we may 1~n~ge under the homomorphism that converts
choose b~+
I ~ to be the max~num Zj+i over d and e to a while leaving a,b, and c un-
changed is the non-LL(k) language L 3.
the range of possible V.. Thus bounds b.
j z
are established by induction and the Some undecidability properties will
theorem proved. now be given.

Corollary 5: The complement of a non- Theorem Ii: Given a context free grammar,
regular LL(k) language is never LL(k). it is undecidable whether or not there
exists a k such that the g r a m a r is LL(k).
Qorollary 4: Theorem 9 generalizes to
languages recognized by deterministic Pro0f: Consider a Turing machine M with
pushdown machines which accept only with some initial finite string on its tape.
an empty stack and do not make c-moves. An instantaneous description [Davis] for
the Turing machine is a string that
Theorem I0: The LL(k) languages are not indicates the current tape contents, in-
closed under complementation, union, inter- ternal state, and position of the head on
section, reversal, concatenation, or c-free the tape.
homomorphisms.
Let c be a symbol that cannot appear
Proof: Korenjak and Hopcroft [K&H] give in an instantaneous description and for a
examples of simple languages whose closure string x let x r denote the reversal of x.
under most of these operations produces An LL(1) grammar G I with sentence symbol
non-LL(k) languages. S I can be obtained whose sentences are of
r r
the form Po c Pl c P2 "'" c P2i-i c P2i c
Since the language L 1 = [ a m b n I i <
m~n} is an LL(1) language which is not ... c P2n where Po is the initial instant-
finite state, its complement, Lg, is not aneous description and for each i between
LL(k) by Corollary 5. The language L o = i and n, p^. is the instantaneous descrip-
[anbn + anon I n e i} was shown to be = tion that ~ s u l t s from instantaneous des-
a non-LL(k) language in an earlier section. cription P2i-i by one move of the machine.
However, L^ is the union of two LL(1)
languages, ~ L4 = [anbn I n m i} and Another LL(1) grammar G 2 with sen-
tence symbol S 2 can be written such that
L 5 = anon I n ~ I}. The languages L 6 =
its sentences are of the form qo c q~ c
[an(b + b) n I n ~ i} and L 7 = [anbm U ant m r
q2 "'" c q2i c q2i+l c ... c q2n where

-178-
for each i between 0 and n-l~ q2i+l is the Prqof: The problem of computing the look-
ahead required to determine which production
instantaneous description that results from to apply is very similar to the problem in
instantaneous description q2i by one move [L&S] of computing the "distinction index"
of the machine. of two occurrences of a nonterminal and it
is a fairly straightforward exercise to
Now let G 3 be the grammar with sen- reduce the first problem to the second.
tence symbol S R whose productions include As there is little insight to be gained
all the productions of G I and G 2 plus the in repeating the relevant definitions and
two new ones S 3 - S I and S 3 - S 2. techniques from [L&S], we omit further
detail.
A string of the form
r Theorem 13: It is undecldable whether or
r ° c r I c r 2 c ... c r2n not an arbitrary context free grammar
generates an LL(k) language, even for a
is a prefix of strings in both L(Gi) and fixed k.
L(G 2) if and only if ro, rl, r2, ..., rn
Proof: The proof of the corresponding
is the sequence of instantaneous des- theorem [K&H] for simple grammars is
criptions produced by the Turlng machine valid for LL(k) grammars.
when it starts with Pc" Therefore, if
the machine does not halt, then there However, given an arbitrary context
are arbitrarily long sequences that are free granmmr, it is decidable [P&U] if
prefixes of sentences in L(G I) and L(G2) , there exists an LL(1) grammar without
(-rules that is structurally equivalent
no choice can be made between productions to the original one.
S 3 - S I and S 3 - S 2 by looking ahead a
finite amount, and G 3 is not an LL(k) AcknowlgdKment
grammar for any k.
The authors are grateful to P. M.
On the other hand, if the machine l~wis for valuable discussion and en-
does halt, then there is a bound on the couragement.
length of any prefix of strings in both
L(Gi) and L(G2) , namely the length of the References
series of instantaneous descriptions that
lead to the halting condition. Therefore, [Davis ] - Davis, M., Cgmputabi!ity and
by looking ahead this amount it is always Unsolyability, McGraw-Hill,
possible to choose between productions New York, 1958.
S 3 ~ S I and S 3 - S 2, Any subsequent
E~G] - Ginsburg, S. and Greibach, S.,
choice between productions can be made on "Deterministic Context-Free
the basis of the next input symbol. Languages," Informatlon & Control,
Therefore, if the machine halts, G 3 is an 9,6 (December 1966) 620-648.
LL(k) grammar for some k.
[GR] - Greibach, S., "A New Normal-Form
Since G 3 is L L ~ ) if and only if the Theorem for Context-Free Phrase
Structure Grammars," J. ACM, 12,
machine halts, which is undecidable, it is 1 (Jan. 1965)42-52.
undecidable if there exists a k such that
G 3 is LL(k). [K&H] - Korenjak, A.J., and Hopcroft, J.E.
"Simple Deterministic Languages,"
Although given a general context IEEE Conference Record of Seventh
free grammar, it is decidable if there Annual Symposium on Switchin K and
exists a k such that it is LL(k), given Automata Theory, IEEE Pub. No.
an LR(k) grammar, the problem is un- 16-C-40, Oct. 1966, pp. 36-46.
decidable.
Theorem 12: Given an LR(k) grammar of [KN] - Knuth, D.E., "On the Translation
known k, it is decidable if there exists Of Languages from Left to Right,"
a k' such that the grammar is LL(k'). Information & Control, 8,6

-179-
CDecember 1965) 607-639.

[L&S] - Lewis, P.M. II, and Stearns, R.E.,


"Syntax-directed Transductions,"
J. ACM 15, 3 (July 1968), 465-488.

[OETT]- Oettinger, A. Automatic Syntactic


Analysis and the Pushdown Store,
Jacobson, R. (Ed.), Structure
of Language and Its Mathematical
Concepts, Proc. 12th Symposium in
Appl. Math., Amer. Math. Soc.,
Providence, R. I., 1961, pp. 104-
129.

[P&U] - Paull, M.C., and Unger, S.H.,


"Structural Equivalence of LL-k
Grammars," IEEE Conference Record
of Ninth Annual Symposium on
Switchinx and Automata Theory,
IEEE Pub. No. 68-C-50-C, Oct.
1968, pp. 176-186.

-180-

You might also like