You are on page 1of 52

Damascus University

Context Free Languages


Ambiguity
Parsing algorithm for CFGs

Dr. Mohammad Ahmad


Context-free grammar

A → 0A1 A, B are variables


A→B 0, 1, # are terminals
B→# A is the start variable

A  0A1  00A11  000A111


 000B111  000#111
this is a derivation
Context-free grammar
• A context-free grammar (CFG) is (V, S, R, S) where
– V is a finite set of variables or non-terminals
 S is a finite set of terminals (V S = )
– R is a set of productions or substitution rules of the form
A→a
where A is a variable V and a is a string of variables and
terminals
– S is a variable called the start variable
The grammar of English

SENTENCE

NOUN-PHRASE VERB-PHRASE

CMPLX-VERB

PREP-PHRASE NOUN-PHRASE

CMPLX-NOUN CMPLX-NOUN CMPLX-NOUN

ART NOUN PREP ART NOUN VERB ART NOUN

a girl with a flower likes the boy


The grammar of English
SENTENCE → NOUN-PHRASE VERB-PHRASE ARTICLE → a
NOUN-PHRASE → CMPLX-NOUN ARTICLE → the
NOUN-PHRASE → CMPLX-NOUN PREP-PHRASE NOUN → boy
VERB-PHRASE → CMPLX-VERB NOUN → girl
VERB-PHRASE → CMPLX-VERB PREP-PHRASE NOUN → flower
PREP-PHRASE → PREP CMPLX-NOUN VERB → likes
CMPLX-NOUN → ARTICLE NOUN VERB → touches
CMPLX-VERB → VERB NOUN-PHRASE VERB → sees
CMPLX-VERB → VERB PREP → with

variables: SENTENCE, NOUN-PHRASE, …


terminals: a, the, boy, girl, flower, likes, touches, sees, with
start variable: SENTENCE

This grammar describes (a part of) English


Derivations in English
(1) SENTENCE → NOUN-PHRASE VERB-PHRASE (10) ARTICLE → a
(2) NOUN-PHRASE → CMPLX-NOUN (11) ARTICLE → the
(3) NOUN-PHRASE → CMPLX-NOUN PREP-PHRASE (12) NOUN → boy
(4) VERB-PHRASE → CMPLX-VERB (13) NOUN → girl
(5) VERB-PHRASE → CMPLX-VERB PREP-PHRASE (14) NOUN → flower
(6) PREP-PHRASE → PREP CMPLX-NOUN (15) VERB → likes
(7) CMPLX-NOUN → ARTICLE NOUN (16) VERB → touches
(8) CMPLX-VERB → VERB NOUN-PHRASE (17) VERB → sees
(9) CMPLX-VERB → VERB (18) PREP → with

SENTENCE  NOUN-PHRASE VERB-PHRASE (1)


 CPLX-NOUN VERB-PHRASE (2)
 ARTICLE NOUN VERB-PHRASE (7)
 a NOUN VERB-PHRASE (10)
 a boy VERB-PHRASE (12)
 a boy CPLX-VERB (4)
 a boy VERB (9)
 a boy sees (17)
Grammars for programming languages
bash-3.2$ python
Python 2.6.5 (r265:79359, Mar 24 2010, 01:32:55)
>>> (2+3)*5
25

E→E+E
E  E*E
E→E*E
 (E) * E
E → (E)
 (E + E) * E
E→0
 (2 + E) * E
E→1
 (2 + 3) * E

E→9
 (2 + 3) * 5

Variables: E meaning: “add 2 and 3,


Terminals: +*()0123456789 and then multiply by 5”
Notation and conventions

E→E+E N → 0N Variables: E, N
E→E*E N → 1N Terminals: +, *, (, ), 0, 1
E → (E) N→0 Start variable: E
E→N N→1

shorthand: conventions:
E → E + E | E * E | (E) | N Variables in UPPERCASE
N → 0N | 1N | 0 | 1 Start variable comes first
Derivation
• A derivation is a sequential application of productions:
E E*E
 (E)* E E → E + E | E * E | (E) | N
 (E)* N N → 0N | 1N | 0 | 1
 (E + E)* 1

derivation
 (E + E)* 1
 (E + N)* 1
 (N + N)* 1
ab b obtained from a
 (N + 1N)* 1
in one production
 (N + 10)* 1 * b
a b obtained from a
 (1 + 10)* 1 in zero or more
* (1 + 10)* 1
E  productions
Context-free languages
• The language of a CFG is the set of all strings of
terminals that can be derived from the start variable
* w}
L(G) = {w : w  S* and S 

• Questions we will ask:

I give you a CFG, what is the language?

I give you a language, write a CFG for it


Analysis example 1

A → 0A1 | B
B→#
L(G) = {0n#1n: n ≥ 0}

• Can you derive:


00#11 A  0A1  00A11  00B11  00#11

# A B #

00#111 No, there is an uneven number of 0s and 1s

00##11 No, there are too many #


Analysis example 1

A → 0A1 | B variables: A, B
B→# terminals: 0, 1, #
start variable: A

• Can you derive: 00#11 00##11


00#111 #

• What is the language of this CFG?


L = {0n#1n: n ≥ 0}
Analysis example 2

S → SS | (S) | 

• Can you derive


S  (S) (2) S  (S)
 () (3)  (SS)
 ((S)S)
 ((S)(S))
 (()(S))
 (()())

() (()())
Parse trees

S → SS | (S) | 

• A parse tree gives a more compact representation:


S  (S) S
 (SS)
( S )
 ((S)S)
 ((S)(S))
S S
 (()(S))
 (()()) ( S )( S )
(()())  
Parse trees
• One parse tree can represent several derivations
S  (S) S  (S)
 (SS)  (SS)
 ((S)S)  (S(S))
 ((S)(S)) S  ((S)(S))
 (()(S))  (()(S))
 (()()) ( S )  (()())

S  (S) S S S  (S)
 (SS) ( S )( S )  (SS)
 ((S)S)  (S(S))
 (()S)    (S())
 (()(S))  ((S)())
 (()())  (()())
Analysis example 2

S → SS | (S) | 

• Can you derive

(()() No, because there is an uneven


number of ( and )

())()) No, because there is a prefix


with an excess of )
Analysis example 2

S → SS | (S) |  L(G) = {w:


w has the same number of ( and )
no prefix of w has more )than(}
S
S S Parsing rules:

S Divide w up in blocks with


same number of ( and )
S S
S Each block is in L(G)
S S 
 
( ( ) ( ) ) ( ) Parse each block recursively
Design example 1

L = {0n1n | n  0}

These strings have recursive structure:


000000111111
0000011111
00001111
000111
0011
01

S → 0S1| 
Design example 2

L = numbers without leading zeros

0, 109, 2, 23 , 01, 003


allowed not allowed

S → 0|LN 1052870032
N → ND|
any number N
D → 0|L
leading digit L
L → 1|2|3|4|5|6|7|8|9
Design examples

L = {0n1n0m1m | n  0, m  0} 010011
00110011
These strings have two parts: 000111

L = L1L2
L1 = {0n1n | n  0}
L2 = {0m1m | m  0}
S → S 1S 1
rules for L1: S1 → 0S11|  S1 → 0S11 | 
L2 is the same as L1
Design examples

L = {0n1m0m1n | n  0, m  0} 011001
0011
These strings have nested structure: 1100
00110011
outer part: 0n1n
inner part: 1m0m

S → 0S1|I
I → 1I0 | 
Design examples

L = {x: x has two 0-blocks with same number of 0s}

01011, 001011001, 10010101001 01001000, 01111


allowed not allowed

10010011010010110 A: , or ends in 1
initial part middle part final part C: , or begins with 1
A B C
Design examples

10010011010010110 A: , or ends in 1
A B C C: , or begins with 1
U: any string
S → ABC B has recursive structure:
A →  | U1
U → 0U | 1U |  00110100
C →  | 1U D
B → 0D0 | 0B0 same number of 0s
D → 1U1 | 1 at least one 0

D: begins and ends in 1


Context-free versus regular
• Write a CFG for the language (0 + 1)*111
S → U111
U → 0U | 1U | 

• Can you do so for every regular language?

Every regular language is context-free

regular
NFA DFA
expression
From regular to context-free
regular expression CFG

 grammar with no rules


 S→
a (alphabet symbol) S→a
E1 + E2 S → S 1 | S2
E1E2 S → S 1S 2
E1* S → SS1 | 

In all cases, S becomes the new start symbol


Context-free versus regular
• Is every context-free language regular?

S → 0S1 |  L = {0n1n: n ≥ 0}

Is context-free but not regular

regular context-free
Ambiguity
Parsing algorithm for CFGs
Ambiguity
• A grammar is ambiguous if some strings have more
than one parse tree
E → E + E | E * E | (E) | N
N → 1N | 2N | 1 | 2

E E
E + E E * E
1+2*2
V E * E E + E V

1 V V V V 2
2 2 =5 1 2 =6
Disambiguation
• Sometimes we can rewrite the grammar to remove
the ambiguity
E → E + E | E * E | (E) | N
N → 1N | 2N | 1 | 2
same precedence!

F F
T T
Divide expression
into terms and factors F F
2 * (1 + 2 * 2)
Disambiguation

E → E + E | E * E | (E) | N
N → 1N | 2N | 1 | 2

An expression is a sum of
one or more terms E→T|E+T

Each term is a product of


T→F|T*F
one or more factors

Each factor is a parenthesized


F → (E) | 1 | 2
expression or a number
Parsing example
E E→T|E+T
T→F|T*F
E F → (E) | 1 | 2
+ T
T F
T * F
F ( E )
E + T
E + T T * F
T F F
F
2 * (1 + 1 + 2 * 2) + 1
Disambiguation
• Disambiguation is not always possible
– There exist inherently ambiguous languages
– There is no general procedure for disambiguation

• In programming languages, ambiguity comes from


precedence rules, and we can do like in example

• In English, ambiguity is sometimes a problem:

He ate the cookies on the floor


Parsing

S → 0S1 | 1S0S1 | T input: 00111


T→S|

• Do we have a method for building a parse tree?


• Can we tell if the parse tree is unique?
First attempt

S → 0S1 | 1S0S1 | T
x = 00111
T→S|

• Maybe we can try all possible derivations:

S 0S1 00S11
01S0S11
0T1
1S0S1 10S10S1 when do we stop?
...
T S

Problems

S → 0S1 | 1S0S1 | T
x = 00111
T→S|

• How do we know when to stop?

S 0S1 00S11
01S0S11 when do we stop?
0T1
1S0S1 10S10S1
...
Problems

S → 0S1 | 1S0S1 | T
x = 01011
T→S|

• Idea: Stop derivation when length exceeds |x|


• Not right because of -productions
S  0S1  01S0S11  01S011  01011
1 3 7 6 5

• We want to eliminate -productions


Problems

S → 0S1 | 1S0S1 | T
x = 00111
T→S|

• Loops among the variables (S → T → S) might make


us go forever
• We want to eliminate such loops
Removal of -productions
• A variable N is nullable if there is a derivation

* 
N

• How to remove -productions


 Find all nullable variables N
 For every production of the form A → aNb,
add another production A → ab
If N →  is a production, remove it
 If S is nullable, add the special production S → 
Example
• Find the nullable variables

grammar nullable variables


S → ACD B C D
A→ a
B→
C → ED | 
D → BC | b
E→b

 Find all nullable variables


Finding nullable variables
• To find nullable variables, we work backwards
– First, mark all variables A s.t. A →  as nullable
– Then, as long as there are productions of the form
A → A1… Ak
where all of A1,…, Ak are marked as nullable, mark A as
nullable
Eliminating -productions

S → ACD D→C
A→ a S → AD
B→ D→B
C → ED |  D→
D → BC | b S → AC
E→b S →A
C →E
nullable variables: B, C, D

 For every production of the form A → aNb,


add another production A → ab
If N →  is a production, remove it
Dealing with loops
• A unit production is a production of the form
A1 → A2
where A1 and A2 are both variables
• Example

grammar: unit productions:


S → 0S1 | 1S0S1 | T S T
T→S|R|
R → 0SR R
Removal of unit productions
• If there is a cycle of unit productions
A1 → A2 → ... → Ak → A1
delete it and replace everything with A1

• Example
S T
S → 0S1 | 1S0S1 | 
T S → 0S1 | 1S0S1
T →S|R| S→R|
R → 0SR R → 0SR
R

T is replaced by S in the {S, T} cycle


Removal of unit productions
• For other unit productions, replace every chain
A1 → A2 → ... → Ak → a
by productions A1 → a,... , Ak → a

• Example
S → 0S1 | 1S0S1 S → 0S1 | 1S0S1
|R| | 0SR | 
R → 0SR R → 0SR

S → R → 0SR is replaced by S → 0SR, R → 0SR


Recap
• After eliminating -productions and unit productions,
we know that every derivation
* a …a
S where a1, …, ak are terminals
1 k

doesn’t shrink in length and doesn’t go into cycles

• Exception: S → 
– We will not use this rule at all, except to check if   L

• Note
 -productions must be eliminated before unit productions
Example: testing membership
eliminate S →  | 01 | 101 | 0S1
S → 0S1 | 1S0S1 | T
T→S| unit, -prod |10S1 | 1S01 | 1S0S1

x = 00111

S 01, 101
0S1 0011, 01011
00S11 only strings of length ≥ 6
strings of length ≥ 6
10S1 10011, strings of length ≥ 6
1S01 10101, strings of length ≥ 6
1S0S1 only strings of length ≥ 6
Algorithm 1 for testing membership
• How to check if a string x ≠  is in L(G)

 Eliminate all -productions and unit productions


 Let X := S
 While some new rule R can be applied to X
Apply R to X
If X = x, you have found a derivation for x
If |X| > |x|, backtrack
 If no more rules can be applied to X, x is not in L
Practical limitations of Algorithm I
• This method can be very slow if x is long

G = CFG of the java programming language


x = code for a 200-line java program

algorithm might take about 10200 steps!

• There is a faster algorithm, but it requires that we do


some more transformations on the grammar
Chomsky Normal Form
• A CFG is in Chomsky Normal Form
if every production (except S → ) is

A → BC or A→a

• Convert to Chomsky Normal Form: Noam Chomsky

A → BcDE A → BCDE A → BX1


replace break up
C→c sequences
X1 → CX2
terminals
with new with new X2 → DE
variables variables
C→c
Algorithm 2 for testing membership

S → AB | BC SAC
A → BA | a – SAC
B → CC | b – B B
C → AB | a
SA B SC SA
B AC AC B AC
x = baaba
b a a b a

Idea:We generate each substring of x bottom up


Parse tree reconstruction

S → AB | BC SAC
A → BA | a – SAC
B → CC | b – B B
C → AB | a
SA B SC SA
B AC AC B AC
x = baaba
b a a b a

Tracing back the derivations, we obtain the parse tree


Cocke-Younger-Kasami algorithm

Input: Grammar G in CNF, string x = x1…xk


1k
For cells in last row


If there is a production A → xi
Put A in table cell ii 12 23
For cells st in other rows 11 22 kk
If there is a production A → BC x1 x2 … xk
where B is in cell sj and C is in cell jt
Put A in cell st
1 s j t k

Cell ij remembers all possible derivations of substring xi…xj

You might also like