Professional Documents
Culture Documents
CS3300 - Compiler Design These slides borrow liberal portions of text verbatim from Antony L.
Hosking @ Purdue, Jens Palsberg @ UCLA and the Dragon book.
Parsing
V. Krishna Nandivada
IIT Madras
Copyright
2019
c by Antony L. Hosking. Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies bear this notice and full citation on the first page. To copy otherwise, to republish, to post on servers, or to redistribute to
lists, requires prior specific permission and/or fee. Request permission to publish from hosking@cs.purdue.edu.
If S →∗ β then β is said to be a sentential form of G This describes simple expressions over numbers and identifiers.
In a BNF for a grammar, we represent
L(G) = {w ∈ Vt ∗ | S ⇒+ w}, w ∈ L(G) is called a sentence of G 1 non-terminals with angle brackets or capital letters
Note, L(G) = {β ∈ V∗ | S →∗ β } ∩ Vt ∗ 2 terminals with typewriter font or underline
3 productions as in the example
* *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 5 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 6 / 98
Derivations Derivations
We can view the productions of a CFG as rewriting rules.
Using our example CFG (for x + 2 ∗ y):
At each step, we chose a non-terminal to replace.
hgoali ⇒ hexpri
This choice can lead to different derivations.
⇒ hexprihopihexpri
Two are of particular interest:
⇒ hid,xihopihexpri
⇒ hid,xi + hexpri leftmost derivation
⇒ hid,xi + hexprihopihexpri the leftmost non-terminal is replaced at each step
⇒ hid,xi + hnum,2ihopihexpri rightmost derivation
⇒ hid,xi + hnum,2i ∗ hexpri the rightmost non-terminal is replaced at each step
⇒ hid,xi + hnum,2i ∗ hid,yi
We have derived the sentence x + 2 ∗ y. The previous example was a leftmost derivation.
We denote this hgoali→∗ id + num ∗ id.
Such a sequence of rewrites is a derivation or a parse.
The process of discovering a derivation is called parsing. * *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 7 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 8 / 98
Rightmost derivation Precedence
hgoali ⇒ hexpri
⇒ hexprihopihexpri expr
⇒ hexprihopihid,yi
⇒ hexpri ∗ hid,yi expr op expr
⇒ hexprihopihexpri ∗ hid,yi
⇒ hexprihopihnum,2i ∗ hid,yi
⇒ hexpri + hnum,2i ∗ hid,yi expr op expr * <id,y>
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 9 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 10 / 98
Precedence Precedence
These two derivations point out a problem with the grammar.
It has no notion of precedence, or implied order of evaluation. Now, for the string x + 2 ∗ y:
To add precedence takes additional machinery:
hgoali ⇒ hexpri
1 hgoali ::= hexpri ⇒ hexpri + htermi
2 hexpri ::= hexpri + htermi ⇒ hexpri + htermi ∗ hfactori
3 | hexpri − htermi ⇒ hexpri + htermi ∗ hid,yi
4 | htermi ⇒ hexpri + hfactori ∗ hid,yi
5 htermi ::= htermi ∗ hfactori ⇒ hexpri + hnum,2i ∗ hid,yi
6 | htermi/hfactori ⇒ htermi + hnum,2i ∗ hid,yi
7 | hfactori ⇒ hfactori + hnum,2i ∗ hid,yi
8 hfactori ::= num ⇒ hid,xi + hnum,2i ∗ hid,yi
9 | id
This grammar enforces a precedence on the derivation: Again, hgoali⇒∗ id + num ∗ id, but this time, we build the desired tree.
terms must be derived from expressions
forces the “correct” tree * *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 11 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 12 / 98
Precedence Ambiguity
goal
If a grammar has more than one derivation for a single sentential form,
then it is ambiguous
expr Example:
hstmti ::= if hexprithen hstmti
| if hexprithen hstmtielse hstmti
expr + term
| other stmts
Consider deriving the sentential form:
term term * factor if E1 then if E2 then S1 else S2
It has two derivations.
factor factor <id,y>
This ambiguity is purely grammatical.
It is a context-free ambiguity.
<id,x> <num,2>
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 13 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 14 / 98
Ambiguity Ambiguity
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 15 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 16 / 98
Scanning vs. parsing Parsing: the big picture
Where do we draw the line?
term ::= [a − zA − z]([a − zA − z] | [0 − 9])∗ tokens
| 0 | [1 − 9][0 − 9]∗
op ::= + | − | ∗ | /
expr ::= (term op)∗ term
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 17 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 18 / 98
A top-down parser starts with the root of the parse tree, labelled with
Top-down parsers the start or goal symbol of the grammar.
start at the root of derivation tree and fill in To build a parse, it repeats the following steps until the fringe of the
picks a production and tries to match the input parse tree matches the input string
may require backtracking
1 At a node labelled A, select a production A → α and construct the
appropriate child for each symbol of α
some grammars are backtrack-free (predictive)
2 When a terminal is added to the fringe that doesn’t match the
Bottom-up parsers
input string, backtrack
start at the leaves and fill in 3 Find next node to be expanded (must have a label in Vn )
start in a state valid for legal first tokens
as input is consumed, change state to encode possibilities The key is selecting the right production in step 1.
(recognize valid prefixes)
If the parser makes a wrong step, the “derivation” process does not
use a stack to store both state and sentential forms terminate.
Why is it bad?
* *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 19 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 20 / 98
Left-recursion Eliminating left-recursion
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 21 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 22 / 98
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 23 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 24 / 98
Left factoring Example
* *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 25 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 26 / 98
Question:
Given a left-factored CFG, to eliminate left-recursion: By left factoring and eliminating left-recursion, can we
1 Input: Grammar G with no cycles and no ε productions. transform an arbitrary context-free grammar to a form where it
2 Output: Equivalent grammat with no left-recursion. begin can be predictively parsed with a single token lookahead?
3 Arrange the non terminals in some order A1 , A2 , · · · An ; Answer:
4 foreach i = 1 · · · n do Given a context-free grammar that doesn’t meet our
5 foreach j = 1 · · · i − 1 do conditions, it is undecidable whether an equivalent grammar
6 Say the ith production is: Ai → Aj γ ; exists that does meet our conditions.
7 and Aj → δ1 |δ2 | · · · |δk ; Many context-free languages do not have such a grammar:
8 Replace, the ith production by:
9 Ai → δ1 γ|δ2 γ| · · · δn γ; {an 0bn | n ≥ 1} ∪ {an 1b2n | n ≥ 1}
10 Eliminate immediate left recursion in Ai ; Must look past an arbitrary number of a’s to discover the 0 or the 1 and
so determine the derivation.
* *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 27 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 28 / 98
Recursive descent parsing Recursive descent parsing
1 int A()
2 begin
3 foreach production of the form A → X1 X2 X3 · · · Xk do
4 for i = 1 to k do
5 if Xi is a non-terminal then
6 if (Xi () 6= 0) then
Backtracks in general – in practise may not do much.
7 backtrack; break; // Try the next production
How to backtrack?
8 else if Xi matches the current input symbol a then
9 advance the input to the next symbol; A left recursive grammar will lead to infinite loop.
10 else
11 backtrack; break; // Try the next production
12 if i EQ k + 1 then
13 return 0; // Success
14 return 1; // Failure
* *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 29 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 30 / 98
stack
stack
parser parsing
parsing grammar
generator tables
tables
This is true for both top-down (LL) and bottom-up (LR) parsers
Rather than writing recursive code, we build tables. This also uses a stack – but mainly to remember part of the input
Why?Building tables can be automated, easily. string; no recursion.
* *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 31 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 32 / 98
FIRST FOLLOW
For a string of grammar symbols α, define FIRST(α) as: For a non-terminal A, define FOLLOW(A) as
the set of terminals that begin strings derived from α: the set of terminals that can appear immediately to the right
{a ∈ Vt | α ⇒∗ aβ } of A in some sentential form
If α ⇒∗ ε then ε ∈ FIRST(α)
FIRST (α) contains the tokens valid in the initial position in α Thus, a non-terminal’s FOLLOW set specifies the tokens that can
To build FIRST(X): legally appear after it.
A terminal symbol has no FOLLOW set.
1 If X ∈ Vt then FIRST(X) is {X}
To build FOLLOW(A):
2 If X → ε then add ε to FIRST(X)
3 If X → Y1 Y2 · · · Yk : 1 Put $ in FOLLOW(hgoali)
1 Put FIRST(Y1 ) − {ε} in FIRST(X) 2 If A → αBβ :
2 ∀i : 1 < i ≤ k, if ε ∈ FIRST(Y1 ) ∩ · · · ∩ FIRST(Yi−1 ) 1 Put FIRST(β ) − {ε} in FOLLOW(B)
(i.e., Y1 · · · Yi−1 ⇒∗ ε) 2 If β = ε (i.e., A → αB) or ε ∈ FIRST(β ) (i.e., β ⇒∗ ε) then put
then put FIRST(Yi ) − {ε} in FIRST(X) FOLLOW(A) in FOLLOW(B)
3 If ε ∈ FIRST(Y1 ) ∩ · · · ∩ FIRST(Yk ) then put ε in FIRST(X) Repeat until no more additions can be made
Repeat until no more additions can be made.
* *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 33 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 34 / 98
Previous definition
A grammar G is LL(1) iff. for all non-terminals A, each distinct Provable facts about LL(1) grammars:
pair of productions A → β and A → γ satisfy the condition 1 No left-recursive grammar is LL(1)
T
FIRST (β ) FIRST (γ) = φ .
2 No ambiguous grammar is LL(1)
3 Some languages have no LL(1) grammar
What if A ⇒∗ ε? 4 A ε–free grammar where each alternative expansion for A begins
Revised definition with a distinct terminal is a simple LL(1) grammar.
A grammar G is LL(1) iff. for each set of productions Example
A → α1 | α2 | · · · | αn : S → aS | a is not LL(1) because FIRST(aS) = FIRST(a) = {a}
1 FIRST(α1 ), FIRST(α2 ), . . . , FIRST (αn ) are all pairwise S → aS0
disjoint S0 → aS0 | ε
2 If αi ⇒∗ ε T then accepts the same language and is LL(1)
FIRST(αj ) FOLLOW (A) = φ , ∀1 ≤ j ≤ n, i 6= j.
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 35 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 36 / 98
LL(1) parse table construction Example
Our long-suffering expression grammar:
1. S →E 6. T → FT 0
Input: Grammar G 2. E → TE0 7. T 0 → ∗T
Output: Parsing table M 3. E0 → +E 8. | /T
Method: 4. | −E 9. |ε
1 ∀ productions A → α: 5. |ε 10. F → num
11. | id
1 ∀a ∈ FIRST(α), add A → α to M[A, a]
2 If ε ∈ FIRST(α): FIRST FOLLOW id num + − ∗ / $
1 ∀b ∈ FOLLOW(A), add A → α to M[A, b] S num, id $ 1 1 − − − − −
2 If $ ∈ FOLLOW(A) then add A → α to M[A, $] E num, id $ 2 2 − − − − −
E0 ε, +, − $ − − 3 4 − − 5
2 Set each undefined entry of M to error T num, id +, −, $ 6 6 − − − − −
If ∃M[A, a] with multiple entries then grammar is not LL(1). T0 ε, ∗, / +, −, $ − − 9 9 7 8 9
F num, id +, −, ∗, /, $ 11 10 − − − − −
id id −
Note: recall a, b ∈ Vt , so a, b 6= ε num num −
∗ ∗ −
/ / −
+ + −
*
− − − *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 37 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 38 / 98
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 39 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 40 / 98
Another example of painful left-factoring Error recovery in Predictive Parsing
Here is a typical example where a programming language fails to
An error is detected when the terminal on top of the stack does
be LL(1):
not match the next input symbol or M[A, a] = error.
stmt → asginment | call | other Panic mode error recovery
assignment → id := exp Skip input symbols till a “synchronizing” token appears.
call → id (exp-list)
Q: How to identify a synchronizing token?
Some heuristics:
This grammar is not in a form that can be left factored. We must
first replace assignment and call by the right-hand sides of their All symbols in FOLLOW(A) in the synchronizing set for the
defining productions: non-terminal A.
statement → id := exp | id( exp-list ) | other Semicolon after a Stmt production: assgignmentStmt;
assignmentStmt;
We left factor: If a terminal on top of the stack cannot be matched? –
statement → id stmt’ | other pop the terminal.
stmt’ → := exp (exp-list) issue a message that the terminal was inserted.
*
Q: How about error messages? *
See how the grammar obscures the language semantics.
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 41 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 42 / 98
Recall
For a grammar G, with start symbol S, any string α such that Goal:
S ⇒∗ α is called a sentential form Given an input string w and a grammar G, construct a parse
If α ∈ Vt∗ , then α is called a sentence in L(G) tree by starting at the leaves and working to the root.
Otherwise it is just a sentential form (not a sentence in L(G))
A left-sentential form is a sentential form that occurs in the leftmost
derivation of some sentence.
A right-sentential form is a sentential form that occurs in the rightmost
derivation of some sentence.
* *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 43 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 44 / 98
Reductions Vs Derivations Example
Reduction:
Consider the grammar
At each reduction step, a specific substring matching the body of
a production is replaced by the non-terminal at the head of the 1 S → aABe
production. 2 A → Abc
3 | b
Key decisions 4 B → d
When to reduce? and the input string abbcde
What production rule to apply? Prod’n. Sentential Form
Reduction Vs Derivations 3 a b bcde
Recall: In derivation: a non-terminal in a sentential form is 2 a Abc de
replaced by the body of one of its productions. 4 aA d e
A reduction is reverse of a step in derivation. 1 aABe
– S
Bottom-up parsing is the process of “reducing” a string w to the The trick appears to be scanning the input and finding valid sentential
start symbol. forms.
Goal of bottum-up parsing: build derivation tree in reverse. * *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 45 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 46 / 98
Handles Handles
S
Theorem:
If G is unambiguous then every right-sentential form has a
unique handle.
Informally, a “handle” is
Proof: (by definition)
a substring that matches the 1 G is unambiguous ⇒ rightmost derivation is unique
α
body of a production (not
necessarily the first one), 2 ⇒ a unique production A → β applied to take γi−1 to γi
and reducing this handle,
3 ⇒ a unique position k at which A → β is applied
A represents one step of reduction 4 ⇒ a unique handle A → β
(or reverse rightmost derivation).
β w
The handle A → β in the parse tree
for αβ w * *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 47 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 48 / 98
Example Handle-pruning
* *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 49 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 50 / 98
* *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 51 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 52 / 98
Shift-reduce parsing LR parsing
The skeleton parser:
Shift-reduce parsers are simple to understand push s0
token ← next token()
A shift-reduce parser has just four canonical actions: repeat forever
s ← top of stack
1 shift — next input symbol is shifted onto the top of the stack if action[s,token] = "shift si " then
2 reduce — right end of handle is on top of stack; push si
locate left end of handle within the stack; token ← next token()
pop handle off stack and push appropriate non-terminal LHS else if action[s,token] = "reduce A → β "
then
3 accept — terminate parsing and signal success pop | β | states
4 error — call an error recovery routine s0 ← top of stack
Key insight: recognize handles with a DFA: push goto[s0 ,A]
else if action[s, token] = "accept" then
DFA transitions shift states instead of symbols return
accepting states trigger reductions else error()
May have Shift-Reduce Conflicts. “How many ops?”:k shifts, l reduces, and 1 accept, where k is length
of input string and l is length of reverse rightmost derivation
* *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 53 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 54 / 98
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 55 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 56 / 98
LR(k) grammars LR(k) grammars
* *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 57 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 58 / 98
LR(1) grammars are often used to construct parsers. Three common algorithms to build tables for an “LR” parser:
We call these parsers LR(1) parsers. 1 SLR(1)
virtually all context-free programming language constructs can be smallest class of grammars
expressed in an LR(1) form smallest tables (number of states)
LR grammars are the most general grammars parsable by a simple, fast construction
deterministic, bottom-up parser 2 LR(1)
efficient parsers can be implemented for LR(1) grammars
full set of LR(1) grammars
LR parsers detect an error as soon as possible in a left-to-right largest tables (number of states)
scan of the input slow, large construction
LR grammars describe a proper superset of the languages
recognized by predictive (i.e., LL) parsers
3 LALR(1)
intermediate sized set of grammars
LL(k): recognize use of a production A → β seeing first k same number of states as SLR(1)
symbols derived from β canonical construction is slow and large
LR(k): recognize the handle β after seeing everything better construction techniques exist
derived from β plus k lookahead symbols
* *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 59 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 60 / 98
SLR vs. LR/LALR LR(k) items
* *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 61 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 62 / 98
* *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 63 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 64 / 98
CLOSURE GOTO
Given an item [A → α • Bβ ], its closure contains the item and any other Let I be a set of LR(0) items and X be a grammar symbol.
items that can generate legal substrings to follow α. Then, GOTO(I, X) is the closure of the set of all items
Thus, if the parser has viable prefix α on its stack, the input should [A → αX • β ] such that [A → α • Xβ ] ∈ I
reduce to Bβ (or γ for some other item [B → •γ] in the closure).
If I is the set of valid items for some viable prefix γ, then GOTO(I, X) is
function CLOSURE(I) the set of valid items for the viable prefix γX.
repeat GOTO (I, X) represents state after recognizing X in state I.
if [A → α • Bβ ] ∈ I
add [B → •γ] to I function GOTO(I,X)
until no more items can be added to I let J be the set of items [A → αX • β ]
return I such that [A → α • Xβ ] ∈ I
return CLOSURE(J)
* *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 65 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 66 / 98
We start the construction with the item [S0 → •S$], where 1 S → E$ I0 : S → •E$ I4 : E → E + T•
2 E → E+T E → •E + T I5 : T → id•
S0 is the start symbol of the augmented grammar G0
S is the start symbol of G 3 | T E → •T I6 : T → (•E)
$ represents EOF 4 T → id T → •id E → •E + T
To compute the collection of sets of LR(0) items 5 | (E) T → •(E) E → •T
I1 : S → E • $ T → •id
function items(G0 ) The corresponding CFSM:
s0 ← CLOSURE({[S0 → •S$]}) 9
E → E • +T T → •(E)
C ← {s0 } T
(
I2 : S → E$• I7 : T → (E•)
repeat T
I3 : E → E + •T E → E • +T
for each set of items s ∈ C 0
id
5
id
6 (
for each grammar symbol X T → •id I8 : T → (E)•
if GOTO(s, X) 6= φ and GOTO(s, X) 6∈ C E id ( E
T → •(E) I9 : E → T•
add GOTO(s, X) to C 1
+
3
+
7
until no more item sets can be added to C )
return C $ T
2 4 8
* *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 67 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 68 / 98
Constructing the LR(0) parsing table LR(0) example
* *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 69 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 70 / 98
If the LR(0) parsing table contains any multiply-defined ACTION Add lookaheads after building LR(0) item sets
entries then G is not LR(0) Constructing the SLR(1) parsing table:
Two conflicts arise: 1 construct the collection of sets of LR(0) items for G0
shift-reduce: both shift and reduce possible in same item 2 state i of the CFSM is constructed from Ii
set 1 [A → α • aβ ] ∈ Ii and GOTO(Ii , a) = Ij
reduce-reduce: more than one distinct reduce action ⇒ ACTION[i, a] ← “shift j”, ∀a 6= $
possible in same item set
2 [A → α•] ∈ Ii , A 6= S0
Conflicts can be resolved through lookahead in ACTION. Consider: ⇒ ACTION[i, a] ← “reduce A → α”, ∀a ∈ FOLLOW(A)
A → ε | aα 3 [S0 → S • $] ∈ Ii
⇒ shift-reduce conflict ⇒ ACTION[i, $] ← “accept”
a:=b+c*d
requires lookahead to avoid shift-reduce conflict after shifting c 3 GOTO(Ii , A) = Ij
(need to see * to give precedence over +) ⇒ GOTO[i, A] ← j
4 set undefined entries in ACTION and GOTO to “error”
5 initial state of parser s0 is CLOSURE([S0 → •S$])
* *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 71 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 72 / 98
From previous example Example: A grammar that is not LR(0)
* *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 73 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 74 / 98
* *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 75 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 76 / 98
LR(1) items LR(1) items
* *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 77 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 78 / 98
closure1(I) goto1(I)
Given an item [A → α • Bβ , a], its closure contains the item and any Let I be a set of LR(1) items and X be a grammar symbol.
other items that can generate legal substrings to follow α. Then, GOTO(I, X) is the closure of the set of all items
Thus, if the parser has viable prefix α on its stack, the input should [A → αX • β , a] such that [A → α • Xβ , a] ∈ I
reduce to Bβ (or γ for some other item [B → •γ, b] in the closure).
If I is the set of valid items for some viable prefix γ, then GOTO(I, X) is
function closure1(I) the set of valid items for the viable prefix γX.
repeat goto(I, X) represents state after recognizing X in state I.
if [A → α • Bβ , a] ∈ I
add [B → •γ, b] to I, where b ∈ F I R S T (β a) function goto1(I,X)
until no more items can be added to I let J be the set of items [A → αX • β , a]
return I such that [A → α • Xβ , a] ∈ I
return closure1(J)
* *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 79 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 80 / 98
Building the LR(1) item sets for grammar G Constructing the LR(1) parsing table
We start the construction with the item [S0 → •S, $], where
Build lookahead into the DFA to begin with
S0 is the start symbol of the augmented grammar G0
S is the start symbol of G
1 construct the collection of sets of LR(1) items for G0
$ represents EOF
2 state i of the LR(1) machine is constructed from Ii
1 [A → α • aβ , b] ∈ Ii and goto1(Ii , a) = Ij
To compute the collection of sets of LR(1) items ⇒ ACTION[i, a] ← “shift j”
2 [A → α•, a] ∈ Ii , A 6= S0
function items(G0 ) ⇒ ACTION[i, a] ← “reduce A → α”
s0 ← closure1({[S0 → •S, $]}) 3 [S0 → S•, $] ∈ Ii
C ← {s0 } ⇒ ACTION[i, $] ← “accept”
repeat
for each set of items s ∈ C 3 goto1(Ii , A) = Ij
for each grammar symbol X ⇒ GOTO[i, A] ← j
if goto1(s, X) 6= φ and goto1(s, X) 6∈ C
add goto1(s, X) to C 4 set undefined entries in ACTION and GOTO to “error”
until no more item sets can be added to C 5 initial state of parser s0 is closure1([S0 → •S, $])
return C
* *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 81 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 82 / 98
Back to previous example (6∈ SLR(1)) Example: back to SLR(1) expression grammar
S → L=R I0 : S0 → •S, $ I5 : L → id•, = $ In general, LR(1) has many more states than LR(0)/SLR(1):
| R S → •L = R, $ I6 : S → L = •R, $
L → ∗R S → •R, $ R → •L, $ 1 S → E 4 T → T ∗F
| id L → • ∗ R, = L → • ∗ R, $ 2 E → E+T 5 | F
L → •id, = L → •id, $
R → L 3 | T 6 F → id
R → •L, $ I7 : L → ∗R•, = $
L → • ∗ R, $ I8 : R → L•, =$ 7 | (E)
L → •id, $ I9 : S → L = R•, $
I1 : S0 → S•, $ I10 : R → L•, $ LR(1) item sets:
I2 : S → L• = R, $ I11 : L → ∗ • R, $ I0 : I00 :shifting ( I000 :shifting (
R → L•, $ R → •L, $ S → •E, $ F → (•E), ∗ + $ F → (•E), ∗+)
I3 : S → R•, $ L → • ∗ R, $ E → •E + T,+$ E → •E + T,+) E → •E + T,+)
I4 : L → ∗ • R, = $ L → •id, $ E → •T, +$ E → •T, +) E → •T, +)
R → •L, =$ I12 : L → id•, $ T → •T ∗ F, ∗ + $ T → •T ∗ F, ∗+) T → •T ∗ F, ∗+)
L → • ∗ R, = $ I13 : L → ∗R•, $ T → •F, ∗+$ T → •F, ∗+) T → •F, ∗+)
L → •id, = $ F → •id, ∗ + $ F → •id, ∗+) F → •id, ∗+)
F → •(E), ∗ + $ F → •(E), ∗+) F → •(E), ∗+)
I2 no longer has shift-reduce conflict: reduce on $, shift on =
* *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 83 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 84 / 98
Another example LALR(1) parsing
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 85 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 86 / 98
* *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 87 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 88 / 98
Example More efficient LALR(1) construction
Reconsider: I0 : S0 → •S, $ I3 : C → c • C, cd I6 : C → c • C, $
S → •CC, $ C → •cC, cd C → •cC, $ Observe that we can:
0 S0 → S
C → •cC, cd C → •d, cd C → •d, $
1 S → CC represent Ii by its basis or kernel:
C → •d, cd I4 : C → d•, cd I7 : C → d•, $
2 C → cC
I1 : S0 → S•, $ I5 : S → CC•, $ I8 : C → cC•, cd
items that are either [S0 → •S, $]
3 | d or do not have • at the left of the RHS
I2 : S → C • C, $ I9 : C → cC•, $
C → •cC, $ compute shift, reduce and goto actions for state derived from Ii
state ACTION GOTO
C → •d, $ directly from its kernel
c d $ S C
Merged states:
I36 : C → c • C, cd$ 0 s36 s47 – 1 2
C → •cC, cd$ 1 – – acc – –
2 s36 s47 – – 5
This leads to a method that avoids building the complete canonical
C → •d, cd$
36 s36 s47 – – 8 collection of sets of LR(1) items
I47 : C → d•, cd$
I89 : C → cC•, cd$ 47 r3 r3 r3 – –
5 – – r1 – –
89 r2 r2 r2 – – Self reading: Section 4.7.5 Dragon book
* *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 89 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 90 / 98
* *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 91 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 92 / 98
Error recovery in shift-reduce parsers Error recovery in yacc/bison/Java CUP
* *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 93 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 94 / 98
* *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 95 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 96 / 98
Parsing review Grammar hierarchy
Recursive descent
A hand coded recursive descent parser directly encodes a
grammar (typically an LL(1) grammar) into a series of mutually
recursive procedures. It has most of the linguistic limitations of LR(k) > LR(1) > LALR(1) > SLR(1) > LR(0)
LL(1). LL(k) > LL(1) > LL(0)
LL(k) LR(0) > LL(0)
An LL(k) parser must be able to recognize the use of a production
LR(1) > LL(1)
after seeing only the first k symbols of its right hand side.
LR(k) > LL(k)
LR(k)
An LR(k) parser must be able to recognize the occurrence of the
right hand side of a production after having seen all that is derived
from that right hand side with k symbols of lookahead.
* *
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 97 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 98 / 98