You are on page 1of 11
‘Theoretical Computer Science 3 (1976) 359-369, @ North-Holland Publishing Company A STRONG PUMPING LEMMA FOR CONTEXT-FREE LANGUAGES David S. WISE Computer Science Department, Indiana University, Bloomington, IN 47401, US.A Communicated by Michael A. Harrison Received January 1975 Revised August 1976 Abstract. A context-free language is shown to be equivalent to a set of sentences describable by sequences of strings related by finite substitutions on finite domains, and vice-versa, As a result, 2 necessary and sufficient version of the Classic Pumping Lemma is established. This result provides a guaranteed method of proving that a language is not context-free when such is the case, An example is given of a language which neither the Classic Pumping Lemma nor Parikh's ‘Theorem can show to be non-context-free, although Ogden’s Lemma can. The main result also establishes {a"ba*} as a language which is not in the Boolean closure of deterministic context-free languages. 1, Introduction One of the most useful results about regular languages is Nerode’s Theorem [8], which yields a “sure-fire” scheme for proving either that a language is regular (by presenting a finite state automaton) or that it is not (by violating any finite right congruence on 3*) The standard technique for establishing that a language is context-free is to present a context-free grammar which generates it or a pushdown automaton which accepts it. If it is not context-free, that Classic Pumping Lemma [2] or Parikh’s ‘Theorem [7] often can establish the fact, but they are not guaranteed to do so, as will be seen. The characterization of context-free languages by non-deterministic pushdown automata does not solve the problem because of the difficulties in establishing constraints on arbitrary non-deterministic computations, In this paper context-free languages are characterized by three finite substitu- tions on a finite domain (closely related to self-embedding nonterminals), such that a sentence is in a language precisely when a finite sequence of strings exists which are related by these substitutions in a manner reminiscent of the Pumping Lemma (Property 3 below). The domain is analogous to the finite set of partition blocks in Nerode’s Theorem. The main theorem (Theorem 2) also establishes a Pumping ‘Lemma for sentential forms (Property 2) as equally powerful. 359 360 DS. Wise Two applications are presented which demonstrate the power of the results. Using Property 2, Theorem 4 establishes that {a"b%c' |p,qr2=0 and p~qX rp} is not a context-free language whereas the Classic Pumping Lemma and Paraikh’s Theorem fail to do so. Theorem 5, which is not directly obtainable from characterization of context-free languages in terms of grammars or machines, states that {a"ba""} is not expressible as a finite intersection of context-free languages. It was a corollary to Theorem 5, the fact that this language is not in the Boolean closure of deterministic languages, which originally motivated this work. 2, Definitions The notation generally follows Aho and Ullman [1]. If 5 is a vocabulary, 3* denotes the set of strings on 3, and 3* denotes the set of non-empty strings on 3. A grammar is a quadruple (N, 3, P,S) where N is a set of nonterminals, 3 is a terminal alphabet such that NO E=$, SENUS is the start symbol, and PC(NU{S})"x (NUS)? is the set of productions. A grammar is context-free if PC(NU{S})x(NUS)*. Lower case Latin letters denote characters in 3 if early in the alphabet and strings in 3'* if at the end of the alphabet. Upper case Latin letters usually denote characters in N and upper case Greek letters will denote auxiliary alphabets. Lower case Greek letters denote arbitrary strings. Of special note is e, denoting the empty string. The length of a string @ is written |a|;|e|=0. The derives relation =3> applies between two strings, for example a => 8, when a production of G applies to @ and results in B. Often a production (A,B) P is displayed as A — B, with G understood. A derivation of o, from a» is a sequence of strings oo, 0%,-..,0, such that oj-; => 9; for all O0 is denoted by =>", and its reflexive transitive closure is denoted by =3>*. Note that if we say A =>*o then the derivation of o can proceed without regard to the context in which A appears. If, however, 3,A5:=3>* 8,06: it is not necessarily true that A =5>*. The set of sentential forms of G, denoted SF(G), is {a | S =>* o}. The language of G, denoted L(G), is SF(G)NE*, A finite substitution f is a mapping of some finite set I” onto finite subsets of A * for some finite set A. The mapping f may be extended to strings in the natural manner: f(e)=e and f(Aa)= f(A)f(a) for A ET,a Er. A set § of n-tuples of non-negative numbers is said to be linear if there is an integer k >0 and n-tuples vo,...,¥% such that S={vo+ 3i.1(mu)|m 0 are integers}. A set of n-tuples is semilinear if it is a finite union of linear sets. A Parikh mapping [9] q is a mapping of z €3* into a |Z|-tuple of non-negative A strong pumping lemma for context-free languages 361 integers defined by q(z)=(# a(2),-.4 #as(z)) where # 4 (z) is the number of times a, € occurs in z. For L CE*, define q(L)= {q(z)|z EL}. ‘A nonterminal A is cyclic if A =3> *A and any derivation by which A => *A isa cycle, A nonterminal A is self-embedding in a context-free grammar G if A => ‘BAy and By#e, Other authors restrict B# e# y. The new definition specifies a somewhat larger class of nonterminals, elsewhere described as “recur- sive but not because of a cycle,” which characterize grammars that ate necessarily context-free, as we shall see. A production A —> a is said to be self-embedded at the jth step of a derivation 01 => 02 => +++ => o, if for 1 *BAy, By# e. Intui- tively, a production is self-embedded if its left part has already generated a self-embedding at that point in the sentential form. A self-embedding chain from A isa derivation A =5> *BAy with no self-embedded productions. The derivation tree of any self-embedding chain is bounded in depth by | N’| and the degree of any node in such a tree is bounded by the length of the longest production, so for any context-free grammar there are only a finite number of self-embedding chains. 3. Results The usual tactic for proving that a language is not context-free is to obtain a contradiction of Bar-Hillel’s “pumping” Lemma [2] (the ‘Classic Pumping Lemma”), Parikh’s “semilinear” Theorem [7], or Ogden’s Lemma [6] (an extended version of the Classic Pumping Lemma). Often the given language is intersected with a regular set or transformed by a gsm mapping before one of these techniques is applied. If a language is context-free then the conditions stated in the Pumping Lemma, Parikh’s Theorem, and Ogden’s Lemma are necessarily satisfied, but none of them are known to guarantee that a language is context-free. Therefore, there is no guarantee that these statements will generate a contradiction if it is improperly assumed that a language or its transformed image is context-free. Ogden’s Lemma is, however, more powerful than the Classic Pumping Lemma or Parikh’s Theorem and may characterize context-free languages. Theorem 1. (Ogden’s Lemma) [6]. For each context-free grammar G = (N, 3, P,S) there is an integer k such that for any word z € L(G), if any k or more distinct positions in z are designated as distinguished, then there is some A € N, and strings u,v, w,x,y €Z* such that () S=>*uAy; A=>* Ax; A=>*w; uvwry i) w contains at least one distinguished position. (iii) Either u and v both contain distinguished positions or x and y both do, (iv) vwx contains at most k distinguished positions. 362 DS. Wise I know of no non-context-free language which displays the property cited for any of its grammars, but it is not known whether satisfying (j)-(iv) of Theorem 1 is sufficient to establish that a language is context-free. The emphasis of the theorem is on “distinguished positions”, yet it is unclear why a grammar which satisfies ()-Civ) might necessarily describe a context-tree language. In an attempt to capture the essence of the context-free language property, we shall prove the following three statements to be equivalent. 1. L is context-free, 2. There is an unrestricted grammar G and an integer k such that L = L(G) and if 7 € SF(G),|o|>k, then o may be rewritten as o = vowyy where w7 ¢, ux # &, |vax |< k, and there is a nonterminal A in G such that S=>* vAy, A =>" vAx, and A =>*w, 3. LCE* and there exist: a finite alphabet I’ disjoint from ¥ and a distinguished S EP U 5, a substitution, h, mapping I” U{S} onto finite subsets of (I" U X)* whose domain is extended to 3 by defining h(a) = {a} for a © 5 and extended thence to strings on E UF U{S} in the usual manner, and two substitutions, f and g, mapping I U{S} onto finite subsets of (7 UE)* such that e€ f(C)g(C) for all CET but f(S)={e}= g(S), such that whenever ¥EL there is a finite sequence 00, 01,025... 0m of strings in (S U)* such that S= 00, 2 =o, and for 1" BAy in some context-free grammar G for L. If p is a substitution mapping each nonterminal A into the set of triples which represent self-embedding chains for A in G then F(C)= PB), g(C)= p(y), and h(C)={p(e)| A =>*o with no self-embedded productions}. Property 3 concentrates our attention upon the finiteness of self- embedding chains, which is somewhat analogous to the finiteness of congruence classes in Nerode’s Theorem [8]. Theorem 2. (Strong Pumping Lemma). Properties 1,2, and 3 above are equivalent. Proof. Property 1 =3 Property 2. If G =(N, 3, P,S) is context-free, construct N’, Z', and P’ by priming all characters. Then G’=(N'US', NUE U{S}, P'U{A'>+ A|A ENUSU{S)}, S') is context-free and L(G')= SF(G). If @ € SF(G) then Theorem 1 can be applied (using G’) whenever |o|>k if all A strong pumping lemma for context-free languages 363 positions are distinguished. The trivial homomorphism from SF(G') to SF(G) establishes 2. Property 2 =3> Property 3. Given k and G =(N, 3, P, S’) as described in 2, for every AEN define p(A)={(8,A,y)|By€(NUZ)", A=>*BAy, and 0< |By| k) may be excluded by the length restriction. Since Property 2 applies to every sentential form of sufficient length, however, we shall be able to describe some derivation for every sentence in L in terms of the p mappings. Since G is unrestricted, p(A) may not be effectively constructable, but it does exist and is finite because of the bound k. Define $=(¢,S',e) and P= U,ewp(A). I” is clearly a finite set. Define p(a)={a} for aX and extend p to a length-preserving string substitution on (N UX)*in the natural manner. The substitutions f, g, and h defined as follows are also finite: MBAY)= VU Plo) ieee h(a)={a} for a3, $(BA,7))= P(A), and 8((8, 4,1) = ty) Let z LL. If|z|k. Beginning with z apply Property 2 repeatedly to get a sequence of sentential forms 2 =y--4f: such that [Zi * Ax, and A, =>" @, for all 10<|a,| and | ya,x; |< k, so {%-:|<|G| and the sequence is finite: m <|z|. If we assume that there exists a string oj = 5%, © p(G) such that , € p(v,); % € P(y); & € p(o)s % &p0a)s and d; & p(¥,) (and this is trivially true for j = m) then we may easily construct 1-1 = 8)(4y An xi) E P(E) We still have 5, € p(v,) and dy € p(w), and by definition (v, A), x)) © p(A;) because [ywxl* wy |ay| Property 1. Suppose that 5, I", S, f, g, and A are given as in 3, defining a language L. Let G =(I,3, P,S) where P is constructed as follows: = Ad, (A > Ba A > BAY |BEf(A),a Eh(A), y €8(A)} Now suppose z € L(G) and consider its derivation S => oy => ++: => on = z, This sequence of sentential forms satisfies the requirements of Property 3 on the sequence of a, so that z © L, On the other hand, if z © L the sequence of a; (which necessarily exists) describes a derivation of z in G. Hence L(G)=L, so L is context free, O Note that the grammar constructed immediately above may have one e- production, because there is no restriction preventing e € h(C) for CET; ife EL then e € f(S)h(S)g(S). Corollary 1, (The Classic Pumping Lemma) [2]. If L is context-free, then there exist integers m and n such that when z © L. |z|>m then z may be written z = uvwxy where |vwx|3 may be rewritten as z= uvwxy where |vwx|<3, ox#e and uv'wx'y EL, for all 120. Proof. The Parikh mapping of L, is a union of six linear sets of triples, corresponding to the six ways of ordering three distinct integers. The linear set corresponding to the case in which the integers (p, q,r) are in decreasing order is generated by {(2,1,0)+i(1, 1, 1) + (J, 1,0)+(1,0,0)|i,j,k #0}. The other five linear sets are generated by uniformly permuting the coordinates of all vectors in this set. Each a?b‘c’ L, is a member of the linear set identified by the sorting of (2.9.7). The classic pumping criterion always applies to a “le L,when #[a] + #[b]+ #[c]>3. Choose +E ¥= {a,b,c} such that #{1] is largest, and then choose 13 it follows that k= #[t]. Let o= 1, w= 6, x=, and u and y be appropriate so that uvy =z. It follows easily that |owx|<3 and uv'wx'y EL; for all i=0. Finally we must establish that L, is not context-free, Theorem 1 yields an easy contradiction to the assumption that it is by considering a*b***!c**™" with the first k positions distinguished. Any factorization must pump a’s, or a’s and b's, or a's and c's, If there are precisely q a's in vx then 1*uAy and A =>" vAx => * owe, Proof, As in the proof of Theorem 2 apply Property 2 repeatedly to obtain a sequence of sentential forms z = f,,...,¢1 and nonterminals A; € N such that for 1" wAyy, and A) =>* 0; () ae, vx e, and | nowy, | k. We will begin by determining how some {; might satisfy the claim. Observe that although (2) is not a context-free derivation the derivation (4) of yiax; from A, is independent of the context surrounding A, (Its precisely this fact which allowed us to pass from Property 2 back to Property 1 in Theorem 2.) It any y, or x, is, or is used (in derivation (2)) to derive, a string containing an a then it is, or is used to derive, a string composed entirely of a's. Similarly for a string on b or on ¢. If any » or xi is used to derive some string of a’s then there is some later sentential form g in which yx; a* and 1<|»,x,|*uAy, A) =>" vAx => * owe, and 1S #, (ux) = #. (1 vlek0. (In fact />|N| because of the definition of n; this fact reappears in the following argument.) Restricting i tuAyy and A, =>" vA,x =>" ows, where (p - q)<#. (ux) 0. Let m now be #,(vx) and notice that m clearly divides n!, Choose i=n!/m or i=2n!/m depending on whether #,(ux)=0 or #.(vx)=0. Then one of the following two sentences is shown to be in L, (where d is a value which doesn’t matte: antnl pret gntanivd or nant prvnted gnednt, a c Both of these sentences violate the definition of L; and, therefore, contradict the existence of G. Gi The esoteric flavor of Property 3 is particularly useful for theorems like the following. Theorem 5. © L2={a"ba™" | m,n >0} cannot be expressed as the intersection of any finite member of context-free languages. Proof. This (by contradiction) follows from a choice of m and n based on the 368 DS. Wise finite union of finite I" implied by Property 3 applied to any hypothetical selection of context-free grammars. Corollary 2, La is not a member of the Boolean closure of the deterministic context-free languages [4] Proof. If L, were in that class then it could be expressed as some Boolean combination of deterministic languages in conjunctive normal form. Each conjunct would necessarily be context-free because the class of deterministic languages is closed on complementation, because each deterministic language is context-free, and because context-free languages are closed on union. Theorem 5 does the rest. O 5. Conclusions ‘The examples of the last section demonstrate that Corollary 1 and Theorem 3 do not characterize context-free languages. However, this weakness does not appear in Theorem 1 (which may indeed be necessary and sufficient). Theorem 2 shows that the essence of context-free languages is a pumping property of a finite nature which may appear at different points in the sentence, The pumping may be characterized by the self-embedding chains of a grammar. Alternatively, it may be expressed as a set of finite substitutions on a finite domain, avoiding the terminology of grammati- cal derivations. ‘Therefore, the universal strategy for proving that a language is not context-free (when such is the case) is to assume it is characterized by Property 2 or Property 3 and search for the guaranteed contradiction. Acknowledgement 1 am grateful for the comments of Jan van Leeuwen, of Michael Harrison, and of the referees. Particular thanks are due to John C. Beatty who is responsible for the course of the proof that L; is not context-free using Property 2. References (1) AV. Aho and J.D. Ullman, The Theory of Parsing, Translation and Compiting 1: Parsing (Prentice-Hall, Englewood Clits 1972) (2) ¥. Bar-Hile, M, Perles and E. Shamir, On formal properties of simple phrase structure grammars, Z. Phonetik Sprachwiss. Kommunikat 14 (1961) 143-172. Also ia Y. Bar-Hillel, Language and Information (Addison-Wesley, Reading 1964) [3] N.Chomsky, On certain formal properties of grammars, Information and Control 2 (1988) 137-167. A strong pumping lemma for context-free languages 369 [4] S. Ginsburg and 8. Greibach, Deter: (21966) 620-648, {5} J. van Leeuwen, A generalization of Parikh’s Theorem in formal language theory, Pro. 2nd Collo, on Automata, Languages and Programming, Springer Lecture Notes in Computer Science td (1974) 17-26. [6] W. Ogden, A helpful esult for proving inherent ambiguity, Math. Systems Theory 21968) 191-194 [7] R., Parikh, On context-free languages, J. Assoc. Comput, Mach, 13 (1966) S70-581. {8} M.O. Rabin and D. Scott, Finite automata and their decision problems, IBM J. Res. Develop. 3 (1959) 114-125, Also in E.F. Moore (Bd), Sequential Machines (Addison-Wesley, Reading 1961), [9] A, Salomaa, Formal Languages (Academic Press, New York 1973). istic context-free languages, Information and Control 9

You might also like