Context Free Languages Ambiguity Parsing Algorithm For CFGS: Dr. Mohammad Ahmad

Damascus University
Context Free Languages

Ambiguity
Parsing algorithm for CFGs
Dr. Mohammad Ahmad

Context-free grammar
A → 0A1 A, B are variables

A→B 0, 1, # are terminals
B→# A is the start variable
A  0A1  00A11  000A111

 000B111  000#111
this is a derivation
Context-free grammar
• A context-free grammar (CFG) is (V, S, R, S) where
– V is a finite set of variables or non-terminals
 S is a finite set of terminals (V S = )
– R is a set of productions or substitution rules of the form
A→a
where A is a variable V and a is a string of variables and
terminals
– S is a variable called the start variable
The grammar of English
SENTENCE
NOUN-PHRASE VERB-PHRASE
CMPLX-VERB
PREP-PHRASE NOUN-PHRASE
CMPLX-NOUN CMPLX-NOUN CMPLX-NOUN
ART NOUN PREP ART NOUN VERB ART NOUN
a girl with a flower likes the boy

The grammar of English
SENTENCE → NOUN-PHRASE VERB-PHRASE ARTICLE → a
NOUN-PHRASE → CMPLX-NOUN ARTICLE → the
NOUN-PHRASE → CMPLX-NOUN PREP-PHRASE NOUN → boy
VERB-PHRASE → CMPLX-VERB NOUN → girl
VERB-PHRASE → CMPLX-VERB PREP-PHRASE NOUN → flower
PREP-PHRASE → PREP CMPLX-NOUN VERB → likes
CMPLX-NOUN → ARTICLE NOUN VERB → touches
CMPLX-VERB → VERB NOUN-PHRASE VERB → sees
CMPLX-VERB → VERB PREP → with
variables: SENTENCE, NOUN-PHRASE, …

terminals: a, the, boy, girl, flower, likes, touches, sees, with
start variable: SENTENCE
This grammar describes (a part of) English

Derivations in English
(1) SENTENCE → NOUN-PHRASE VERB-PHRASE (10) ARTICLE → a
(2) NOUN-PHRASE → CMPLX-NOUN (11) ARTICLE → the
(3) NOUN-PHRASE → CMPLX-NOUN PREP-PHRASE (12) NOUN → boy
(4) VERB-PHRASE → CMPLX-VERB (13) NOUN → girl
(5) VERB-PHRASE → CMPLX-VERB PREP-PHRASE (14) NOUN → flower
(6) PREP-PHRASE → PREP CMPLX-NOUN (15) VERB → likes
(7) CMPLX-NOUN → ARTICLE NOUN (16) VERB → touches
(8) CMPLX-VERB → VERB NOUN-PHRASE (17) VERB → sees
(9) CMPLX-VERB → VERB (18) PREP → with
SENTENCE  NOUN-PHRASE VERB-PHRASE (1)

 CPLX-NOUN VERB-PHRASE (2)
 ARTICLE NOUN VERB-PHRASE (7)
 a NOUN VERB-PHRASE (10)
 a boy VERB-PHRASE (12)
 a boy CPLX-VERB (4)
 a boy VERB (9)
 a boy sees (17)
Grammars for programming languages
bash-3.2$ python
Python 2.6.5 (r265:79359, Mar 24 2010, 01:32:55)
>>> (2+3)*5
25
E→E+E
E  E*E
E→E*E
 (E) * E
E → (E)
 (E + E) * E
E→0
 (2 + E) * E
E→1
 (2 + 3) * E
…
E→9
 (2 + 3) * 5
Variables: E meaning: “add 2 and 3,

Terminals: +*()0123456789 and then multiply by 5”
Notation and conventions
E→E+E N → 0N Variables: E, N
E→E*E N → 1N Terminals: +, *, (, ), 0, 1
E → (E) N→0 Start variable: E
E→N N→1
shorthand: conventions:
E → E + E | E * E | (E) | N Variables in UPPERCASE
N → 0N | 1N | 0 | 1 Start variable comes first
Derivation
• A derivation is a sequential application of productions:
E E*E
 (E)* E E → E + E | E * E | (E) | N
 (E)* N N → 0N | 1N | 0 | 1
 (E + E)* 1
derivation
 (E + E)* 1
 (E + N)* 1
 (N + N)* 1
ab b obtained from a
 (N + 1N)* 1
in one production
 (N + 10)* 1 * b
a b obtained from a
 (1 + 10)* 1 in zero or more
* (1 + 10)* 1
E  productions
Context-free languages
• The language of a CFG is the set of all strings of
terminals that can be derived from the start variable
* w}
L(G) = {w : w  S* and S 
• Questions we will ask:
I give you a CFG, what is the language?
I give you a language, write a CFG for it

Analysis example 1
A → 0A1 | B
B→#
L(G) = {0n#1n: n ≥ 0}
• Can you derive:

00#11 A  0A1  00A11  00B11  00#11
# A B #
00#111 No, there is an uneven number of 0s and 1s
00##11 No, there are too many #

Analysis example 1
A → 0A1 | B variables: A, B
B→# terminals: 0, 1, #
start variable: A
• Can you derive: 00#11 00##11

00#111 #
• What is the language of this CFG?

L = {0n#1n: n ≥ 0}
Analysis example 2
S → SS | (S) | 
• Can you derive

S  (S) (2) S  (S)
 () (3)  (SS)
 ((S)S)
 ((S)(S))
 (()(S))
 (()())
() (()())
Parse trees
S → SS | (S) | 
• A parse tree gives a more compact representation:

S  (S) S
 (SS)
( S )
 ((S)S)
 ((S)(S))
S S
 (()(S))
 (()()) ( S )( S )
(()())  
Parse trees
• One parse tree can represent several derivations
S  (S) S  (S)
 (SS)  (SS)
 ((S)S)  (S(S))
 ((S)(S)) S  ((S)(S))
 (()(S))  (()(S))
 (()()) ( S )  (()())
S  (S) S S S  (S)
 (SS) ( S )( S )  (SS)
 ((S)S)  (S(S))
 (()S)    (S())
 (()(S))  ((S)())
 (()())  (()())
Analysis example 2
S → SS | (S) | 
• Can you derive
(()() No, because there is an uneven

number of ( and )
())()) No, because there is a prefix

with an excess of )
Analysis example 2
S → SS | (S) |  L(G) = {w:

w has the same number of ( and )
no prefix of w has more )than(}
S
S S Parsing rules:
S Divide w up in blocks with

same number of ( and )
S S
S Each block is in L(G)
S S 
 
( ( ) ( ) ) ( ) Parse each block recursively
Design example 1
L = {0n1n | n  0}
These strings have recursive structure:

000000111111
0000011111
00001111
000111
0011
01

S → 0S1| 
Design example 2
L = numbers without leading zeros
0, 109, 2, 23 , 01, 003

allowed not allowed
S → 0|LN 1052870032
N → ND|
any number N
D → 0|L
leading digit L
L → 1|2|3|4|5|6|7|8|9
Design examples
L = {0n1n0m1m | n  0, m  0} 010011
00110011
These strings have two parts: 000111
L = L1L2
L1 = {0n1n | n  0}
L2 = {0m1m | m  0}
S → S 1S 1
rules for L1: S1 → 0S11|  S1 → 0S11 | 
L2 is the same as L1
Design examples
L = {0n1m0m1n | n  0, m  0} 011001
0011
These strings have nested structure: 1100
00110011
outer part: 0n1n
inner part: 1m0m
S → 0S1|I
I → 1I0 | 
Design examples
L = {x: x has two 0-blocks with same number of 0s}
01011, 001011001, 10010101001 01001000, 01111

allowed not allowed
10010011010010110 A: , or ends in 1
initial part middle part final part C: , or begins with 1
A B C
Design examples
10010011010010110 A: , or ends in 1
A B C C: , or begins with 1
U: any string
S → ABC B has recursive structure:
A →  | U1
U → 0U | 1U |  00110100
C →  | 1U D
B → 0D0 | 0B0 same number of 0s
D → 1U1 | 1 at least one 0
D: begins and ends in 1

Context-free versus regular
• Write a CFG for the language (0 + 1)*111
S → U111
U → 0U | 1U | 
• Can you do so for every regular language?
Every regular language is context-free
regular
NFA DFA
expression
From regular to context-free
regular expression CFG
 grammar with no rules

 S→
a (alphabet symbol) S→a
E1 + E2 S → S 1 | S2
E1E2 S → S 1S 2
E1* S → SS1 | 
In all cases, S becomes the new start symbol

Context-free versus regular
• Is every context-free language regular?
S → 0S1 |  L = {0n1n: n ≥ 0}
Is context-free but not regular
regular context-free
Ambiguity
Parsing algorithm for CFGs
Ambiguity
• A grammar is ambiguous if some strings have more
than one parse tree
E → E + E | E * E | (E) | N
N → 1N | 2N | 1 | 2
E E
E + E E * E
1+2*2
V E * E E + E V
1 V V V V 2
2 2 =5 1 2 =6
Disambiguation
• Sometimes we can rewrite the grammar to remove
the ambiguity
E → E + E | E * E | (E) | N
N → 1N | 2N | 1 | 2
same precedence!
F F
T T
Divide expression
into terms and factors F F
2 * (1 + 2 * 2)
Disambiguation
E → E + E | E * E | (E) | N
N → 1N | 2N | 1 | 2
An expression is a sum of
one or more terms E→T|E+T
Each term is a product of

T→F|T*F
one or more factors
Each factor is a parenthesized

F → (E) | 1 | 2
expression or a number
Parsing example
E E→T|E+T
T→F|T*F
E F → (E) | 1 | 2
+ T
T F
T * F
F ( E )
E + T
E + T T * F
T F F
F
2 * (1 + 1 + 2 * 2) + 1
Disambiguation
• Disambiguation is not always possible
– There exist inherently ambiguous languages
– There is no general procedure for disambiguation
• In programming languages, ambiguity comes from

precedence rules, and we can do like in example
• In English, ambiguity is sometimes a problem:
He ate the cookies on the floor

Parsing
S → 0S1 | 1S0S1 | T input: 00111

T→S|
• Do we have a method for building a parse tree?

• Can we tell if the parse tree is unique?
First attempt
S → 0S1 | 1S0S1 | T
x = 00111
T→S|
• Maybe we can try all possible derivations:
S 0S1 00S11
01S0S11
0T1
1S0S1 10S10S1 when do we stop?
...
T S

Problems
S → 0S1 | 1S0S1 | T
x = 00111
T→S|
• How do we know when to stop?
S 0S1 00S11
01S0S11 when do we stop?
0T1
1S0S1 10S10S1
...
Problems
S → 0S1 | 1S0S1 | T
x = 01011
T→S|
• Idea: Stop derivation when length exceeds |x|

• Not right because of -productions
S  0S1  01S0S11  01S011  01011
1 3 7 6 5
• We want to eliminate -productions

Problems
S → 0S1 | 1S0S1 | T
x = 00111
T→S|
• Loops among the variables (S → T → S) might make

us go forever
• We want to eliminate such loops
Removal of -productions
• A variable N is nullable if there is a derivation
* 
N
• How to remove -productions

 Find all nullable variables N
 For every production of the form A → aNb,
add another production A → ab
If N →  is a production, remove it
 If S is nullable, add the special production S → 
Example
• Find the nullable variables
grammar nullable variables

S → ACD B C D
A→ a
B→
C → ED | 
D → BC | b
E→b
 Find all nullable variables

Finding nullable variables
• To find nullable variables, we work backwards
– First, mark all variables A s.t. A →  as nullable
– Then, as long as there are productions of the form
A → A1… Ak
where all of A1,…, Ak are marked as nullable, mark A as
nullable
Eliminating -productions
S → ACD D→C
A→ a S → AD
B→ D→B
C → ED |  D→
D → BC | b S → AC
E→b S →A
C →E
nullable variables: B, C, D
 For every production of the form A → aNb,

add another production A → ab
If N →  is a production, remove it
Dealing with loops
• A unit production is a production of the form
A1 → A2
where A1 and A2 are both variables
• Example
grammar: unit productions:

S → 0S1 | 1S0S1 | T S T
T→S|R|
R → 0SR R
Removal of unit productions
• If there is a cycle of unit productions
A1 → A2 → ... → Ak → A1
delete it and replace everything with A1
• Example
S T
S → 0S1 | 1S0S1 | 
T S → 0S1 | 1S0S1
T →S|R| S→R|
R → 0SR R → 0SR
R
T is replaced by S in the {S, T} cycle

Removal of unit productions
• For other unit productions, replace every chain
A1 → A2 → ... → Ak → a
by productions A1 → a,... , Ak → a
• Example
S → 0S1 | 1S0S1 S → 0S1 | 1S0S1
|R| | 0SR | 
R → 0SR R → 0SR
S → R → 0SR is replaced by S → 0SR, R → 0SR

Recap
• After eliminating -productions and unit productions,
we know that every derivation
* a …a
S where a1, …, ak are terminals
1 k
doesn’t shrink in length and doesn’t go into cycles
• Exception: S → 
– We will not use this rule at all, except to check if   L
• Note
 -productions must be eliminated before unit productions
Example: testing membership
eliminate S →  | 01 | 101 | 0S1
S → 0S1 | 1S0S1 | T
T→S| unit, -prod |10S1 | 1S01 | 1S0S1
x = 00111
S 01, 101
0S1 0011, 01011
00S11 only strings of length ≥ 6
strings of length ≥ 6
10S1 10011, strings of length ≥ 6
1S01 10101, strings of length ≥ 6
1S0S1 only strings of length ≥ 6
Algorithm 1 for testing membership
• How to check if a string x ≠  is in L(G)
 Eliminate all -productions and unit productions

 Let X := S
 While some new rule R can be applied to X
Apply R to X
If X = x, you have found a derivation for x
If |X| > |x|, backtrack
 If no more rules can be applied to X, x is not in L
Practical limitations of Algorithm I
• This method can be very slow if x is long
G = CFG of the java programming language

x = code for a 200-line java program
algorithm might take about 10200 steps!
• There is a faster algorithm, but it requires that we do

some more transformations on the grammar
Chomsky Normal Form
• A CFG is in Chomsky Normal Form
if every production (except S → ) is
A → BC or A→a
• Convert to Chomsky Normal Form: Noam Chomsky
A → BcDE A → BCDE A → BX1

replace break up
C→c sequences
X1 → CX2
terminals
with new with new X2 → DE
variables variables
C→c
Algorithm 2 for testing membership
S → AB | BC SAC
A → BA | a – SAC
B → CC | b – B B
C → AB | a
SA B SC SA
B AC AC B AC
x = baaba
b a a b a
Idea:We generate each substring of x bottom up

Parse tree reconstruction
S → AB | BC SAC
A → BA | a – SAC
B → CC | b – B B
C → AB | a
SA B SC SA
B AC AC B AC
x = baaba
b a a b a
Tracing back the derivations, we obtain the parse tree

Cocke-Younger-Kasami algorithm
Input: Grammar G in CNF, string x = x1…xk

1k
For cells in last row
…
If there is a production A → xi
Put A in table cell ii 12 23
For cells st in other rows 11 22 kk
If there is a production A → BC x1 x2 … xk
where B is in cell sj and C is in cell jt
Put A in cell st
1 s j t k
Cell ij remembers all possible derivations of substring xi…xj

Context Free Languages Ambiguity Parsing Algorithm For CFGS: Dr. Mohammad Ahmad

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Context Free Languages Ambiguity Parsing Algorithm For CFGS: Dr. Mohammad Ahmad

Uploaded by

Copyright:

Available Formats

Damascus University

Context Free Languages

Dr. Mohammad Ahmad

A → 0A1 A, B are variables

A  0A1  00A11  000A111

CMPLX-NOUN CMPLX-NOUN CMPLX-NOUN

ART NOUN PREP ART NOUN VERB ART NOUN

a girl with a flower likes the boy

variables: SENTENCE, NOUN-PHRASE, …

This grammar describes (a part of) English

SENTENCE  NOUN-PHRASE VERB-PHRASE (1)

Variables: E meaning: “add 2 and 3,

• Questions we will ask:

I give you a CFG, what is the language?

I give you a language, write a CFG for it

• Can you derive:

00#111 No, there is an uneven number of 0s and 1s

00##11 No, there are too many #

• Can you derive: 00#11 00##11

• What is the language of this CFG?

• Can you derive

• A parse tree gives a more compact representation:

• Can you derive

(()() No, because there is an uneven

())()) No, because there is a prefix

S → SS | (S) |  L(G) = {w:

S Divide w up in blocks with

These strings have recursive structure:

L = numbers without leading zeros

0, 109, 2, 23 , 01, 003

L = {x: x has two 0-blocks with same number of 0s}

01011, 001011001, 10010101001 01001000, 01111

D: begins and ends in 1

• Can you do so for every regular language?

Every regular language is context-free

 grammar with no rules

In all cases, S becomes the new start symbol

Is context-free but not regular

Each term is a product of

Each factor is a parenthesized

• In programming languages, ambiguity comes from

• In English, ambiguity is sometimes a problem:

He ate the cookies on the floor

S → 0S1 | 1S0S1 | T input: 00111

• Do we have a method for building a parse tree?

• Maybe we can try all possible derivations:

• How do we know when to stop?

• Idea: Stop derivation when length exceeds |x|

• We want to eliminate -productions

• Loops among the variables (S → T → S) might make

• How to remove -productions

grammar nullable variables

 Find all nullable variables

 For every production of the form A → aNb,

grammar: unit productions:

T is replaced by S in the {S, T} cycle

S → R → 0SR is replaced by S → 0SR, R → 0SR

doesn’t shrink in length and doesn’t go into cycles

 Eliminate all -productions and unit productions

G = CFG of the java programming language

algorithm might take about 10200 steps!

• There is a faster algorithm, but it requires that we do