This action might not be possible to undo. Are you sure you want to continue?

# Automata term paper Ambiguity and Membership algorithms in Context-Free Grammars Introduction When using context-free

grammars to describe formal languages, one has to be aware of potential ambiguity in the grammars, that is, the situation where a string may be parsed in multiple ways, leading to di erent parse trees. We propose a technique for detecting ambiguities in a given grammar. As the problem is in general undecidable, which was shown by Cantor [6], Floyd [14], and Chomsky and Sch¨utzenberger [8], we resort to conservative approximation. This means that our analysis for some grammars is able to guarantee that they are unambiguous, whereas for others it cannot give certain answers. Related Work;Our approach is related to building an LR(k) parse table for the given grammar a nd checking for con icts. The LR(k) condition has since its discovery by Knuth in 1965 been known as a powerful test for unambiguity [21]. Contributions Despite decades of work on parsing techniques, which in many cases involve the problem of grammar ambiguity, we have been unable to nd tools that are applicable to grammars in the areas mentioned above • We observe that there is a simple linguistic characterization of grammar ambiguity. This allows us to shift from reasoning about grammar derivations to r easoning about purely linguistic properties, such as, language inclusion and approximations. This results in a framework, called ACLA (Ambiguity Checking with Language Approximations). • We show how Mohri and Nederhof’s regular approximation technique for context-free grammars [25] can be adapted in a local manner using the ACLA framework to detect many common sources of ambiguity, including ones that i nvolve palindromic structures. Also, a simple grammar unfolding transformation can be used to improve the precision of the approximati on. The exibility of the framework is additionally substantiated by presenting other approximation techniques that can be combined with the regular approximation to improve analysis performance. • We demonstrate that ACLA can handle real-world grammars of varying complexity ta ken from the bioinformatics literature on RNA analysis, acquitting the unambiguous grammars and pinpointing the sources of ambiguity – wit h shortest possible examples as witnesses – in grammars that are in fact ambiguous. Overview We begin in Section 2 by giving a characterization of grammar ambiguity that allows us to reason about the language of the nonterminals in the grammar rather than the structure of the grammar. In particular, we reformulate the ambiguity problem in terms of language intersection and overlap operations (Context-free grammar and ambiguity). A context-free grammar (CFG) G is de ned by G = (N , Σ, s, π) where N is a nite set of nonterminals, Σ is a nite set of al habet symbols (or terminals), s ∈ N is the start nonterminal, and π : N → P(E ) is the nite roduction function where E = Σ ∪ N . We write αnω ⇉ αθω hen θ ∈ π(n) nd α, ω ∈ E , nd ⇉ is the re exive tr nsitive closure of ⇉. We ssume th t every nontermin l n ∈ N is re ch ble from s nd derives some string: x, y, z ∈ Σ : s ⇉

¢

¢

¢

¢

¡

¢

¢

¢ ¢

¢

xnz ⇉ xyz. The l ngu ge of sententi l form α ∈ E is LG(α) = {x ∈ Σ | α ⇉ x}, nd the l ngu ge of G is L(G) = LG(s). context-free gr mm r (CFG), sometimes lso c lled hr se structure gr mm r, is gr mm r th t n tur lly gener tes form l l ngu ge in hich cl uses c n be nested inside cl uses rbitr rily dee ly, but here gr mm tic l structures re n ot llo ed to overl . CFGs c n be ex ressed by B ckus–N ur Form, or BNF. In terms of roduction rules, every roduction of context free gr mm r is of the form V → here V is single nontermin l symbol, nd is string of termin ls nd/or no ntermin ls ( c n be em ty). Form l definitions A context-free gr mm r G is defined by the 4-tu le: here 1. is finite set; e ch element is c lled non-termin l ch r cter or v r i ble. E ch v ri ble re resents different ty e of hr se or cl use in the sent ence. V ri bles re lso sometimes c lled synt ctic c tegories. E ch v ri ble de fines sub-l ngu ge of the l ngu ge defined by . 2. is finite set of termin ls, disjoint from , hich m ke u the ctu l con tent of the sentence. The set of termin ls is the l h bet of the l ngu ge defin ed by the gr mm r . 3. is rel tion from to such th t . These rel tions re c lled roductio ns or re rite rules. 4. is the st rt v ri ble (or st rt symbol), used to re resent the hole senten ce (or rogr m). It must be n element of . is finite set. The members of re c lled the rules or roductions of the g r mm r. The sterisk re resents the Kleene st r o er tion. [edit]Rule lic tion For ny strings , e s y yields , ritten s , if nd such th t nd . Thus, is the result of lying the rule to . [edit]Re etitive rule lic tion For ny (or in some textbooks), if such th t [edit]Context-free l ngu ge The l ngu ge of gr mm r is the set

Introduction A context-free gr mm r G is un mbiguous if it does not h ve t o di erent deriv tion trees for ny ord. A context-free l ngu ge is un mbiguous if it is gener ted by n un mbiguous context-free gr mm r. Context-free gr mm rs nd l ngu ges re mbiguous if they re not un mbiguous. Ambiguous context-free l ngu ges re lso c lled inherently mbiguous. 1 The existence of mbiguous context-free l ngu ges. The mbiguity of ord ith res ect to given context-free gr mm r G is the number of di erent deriv tion trees for gener ted by G. Ambiguous context-free gr mm rs nd l ngu ges c n be distinguished by their degree of mbiguity, th t is, the le st u er bound for the mbiguity hich

¢

¢

¡

¢ ¢

¢

¡

¢

¢

¢

¢ ¢ ¢

¢ ¢

¢

¢

¢

¢

¢

¢

¢

¢

¢

¢ ¢ ¢

¢

¢

¢

¢

¡

¢

A l ngu ge is s id to be , such th t .

context-free l ngu ge (CFL), if there exists

¢ ¢

¢ ¢ ¢ ¢

¢

¢

¢ ¢

¢

¢

¢

¢

¡

¢

¢

¢

¢ ¢ ¢ ¢

¢

¢ ¢

¢

¢

¡

¢

¢

¢ ¢ ¢ ¢ ¡ ¢ ¢ ¢ ¢ ¢ ¢

¢

¢

¢ ¢ ¡

¢

¢

¢

¢

¢

¢

¢

¢

¢

¡

¢

¢

¢

¢

¢

¢

¢

¢

¡

¢

¢

¢

¢

¢

¢

¢ ¢

¢

¢

¢ ¢ ¢ ¢ ¡

¢

¢

¢ ¢ ¢ ¢ ¢ ¢ ¢

¢ ¢ ¢ ¡

¢

¡ ¡

¢ ¢ ¢ ¢

¢

¢

¢ ¢ ¢ ¢ ¢ ¢ ¢

¢ ¢

¢ ¢

¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢

¢ ¢

¡ ¢

¢

¢

¢

¢ ¢

¢

¢ ¢ ¢

¢

¢ ¢

¢ ¢

¢

¡ ¢

¢

¢

¢ ¢

¢

¢

¢ ¢ ¢ ¢ ¢

¡ ¢

¡

¢

¢

¢ ¢ ¢

¢ ¢ ¡ ¢ ¢ ¢ ¢

¢ ¢

¢ ¢

¢ ¢ ¢

¡

¢

¢ ¡

CFG

ord c n h ve. A context-free gr mm r is k- mbiguous if k is the le st u er bound for the mbiguity of the gener ted ords. A context-free l ngu ge is k- mbiguous if it is gener ted by k- mbiguous context-free gr mm r but by no k 1- mbiguous gr mm r. roduction [A;

ch in roduction if 2 N. termin l roduction if 2 . "- roduction if = ". Greib ch roduction if 2 N . line r roduction if 2 (N [ f"g) right line r roduction if 2 (N [ f"g). The gr mm r G is "-free if no tree in G n f[ ; "]g cont ins "- roductions, it is in Greib ch norm l form, line r, or right line r, if no tree in G n f[ ; "]g cont ins roduction hich is not Greib ch roduction, line r roduction, or right line r roduction, res ectively. Ambiguity We c n reg rd trees s circuits ith the root s n in ut line nd the le ves s out ut lines. The tree ex nsion llo s to \ lug" the root A of tree into le f of nother, rovided th t the le f is of the \ ro ri te ty e", i.e., it is l belled ith A. For the uestion ho trees c n be connected only the roots nd frontiers re relev nt, hile the intern l structure c n be considered s bl ck box.

¢

¡ ¡ £

gener ted by context-free gr mm r G = (N; ; P; ) is de ned s the number of deriv tion trees ith the frontier . ince the root of e ch deriv tion tree is l belled lso consider the mbiguity of s the number of deriv tion trees

¢ ¢

¢

¢

¢

Ambiguity of Context-Free Gr mm rs Throughout ection 2.5.2 let G = (N; ; P; ) be Usu lly the mbiguity of ord 2

context free gr mm r.

, e c n ith

¢

£

¢

¢

¢

¢

¢ ¢

¢

¢

¢

Production Ty es Let G = (N; ; P; ) be context-free gr mm r. A ] 2 P c n lso be denoted s A ! or [A ! ]. It is c lled

¢ ¢ ¢ ¢ ¢

¢

¢ ¢

¢

£

¢

¢

¢

¢ ¢

£

¢

¡

¢ ¡

¢ ¢ ¢ ¢ ¢ ¡ ¢ ¡ ¢ ¢ ¢ ¡ ¡

¢ ¡

¢ ¢

¢ ¢

¡

¡ ¢

¢ ¢ ¢ ¢ ¤ ¢ ¢

¢ ¡

£

¢

¢

¢

¢

£

£

¢

¢ ¢

¢

¢

¡

¢ £

¢

¢ ¢ ¢ ¢ ¢ ¢

¢ ¢ ¢ ¢

¢

¡

¢

¢

¢

¢ ¢ ¢ ¢

¢

¢ ¢

¢

¢ ¢ ¢ ¢

¢

¢ ¢

¡ ¢ ¢ ¢

A sketch of such n lgorithm (in r ther ho eless seudocode, I'm fr id): gener te(g): if g is em ty: yield "" other ise if g is termin l : yield " " other ise if g is single nontermin l: for c in every construction of g: st rt gener tor for gener te(c) until ll gener tors re exh usted: loo ing over e ch nonexh usted gener tor gen: yield " " here = next(gen) other ise if g is ir of symbols m nd n: for c in every construction of m: st rt gener tor in set 1 for gener te(c) for d in every construction of m: st rt gener tor in set 2 for gener te(d) until ll in set 1 or ll in set 2 re exh usted: loo over ll irs gen1,gen2 of nonexh usted in set 1 nd set 2: yield " b" here = next(gen1) nd b = next(gen2) Assuming the gr mm r h s been converted so th t e ch construction is zero to t o termin ls, this ill run bre dth first se rch over the tree of ll rse tree s of the gr mm r.

¡

¢ ¢

¢

¢ ¢ ¢ ¢

¢

¢ ¢

¢ ¢

¢

¢

¢

¢

¢

¢

¢

¢

¢

¢ ¢

¢

¢

¢ ¢

¢

¢

¢

¢

¡

interf ce [ ; ]. Moreover, form l l ngu ge over TN[.

e consider the set of G's deriv tion trees s

¢ ¢ ¡ ¢ ¢ ¢

¢ ¢ ¡ ¢ ¢

¢

¢

¢

¢

¢

¢

¢ ¢ ¢ ¢ ¢

¡ ¢ ¢

¢

¢ ¡ £ ¢ ¢ ¢ ¢ ¢

¢ ¢

¡

¡

¡

¢

¢

¢

¢

¢