You are on page 1of 25

Acknowledgement

CS3300 - Compiler Design These slides borrow liberal portions of text verbatim from Antony L.
Hosking @ Purdue, Jens Palsberg @ UCLA and the Dragon book.
Parsing

V. Krishna Nandivada

IIT Madras
Copyright 2019
c by Antony L. Hosking. Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies bear this notice and full citation on the first page. To copy otherwise, to republish, to post on servers, or to redistribute to
lists, requires prior specific permission and/or fee. Request permission to publish from hosking@cs.purdue.edu.

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 2 / 98

The role of the parser Syntax analysis by using a CFG


Context-free syntax is specified with a context-free grammar.
Formally, a CFG G is a 4-tuple (Vt , Vn , S, P), where:
source tokens
scanner parser IR Vt is the set of terminal symbols in the grammar.
code
For our purposes, Vt is the set of tokens returned by the
scanner.
errors Vn , the nonterminals, is a set of syntactic variables that
denote sets of (sub)strings occurring in the language.
A parser These are used to impose a structure on the grammar.
S is a distinguished nonterminal (S ∈ Vn ) denoting the entire
performs context-free syntax analysis
set of strings in L(G).
guides context-sensitive analysis This is sometimes called a goal symbol.
constructs an intermediate representation P is a finite set of productions specifying how terminals and
produces meaningful error messages non-terminals can be combined to form strings in the
language.
attempts error correction
Each production must have a single non-terminal on its
For the next several classes, we will look at parser construction left hand side.
* *
The set V = Vt ∪ Vn is called the vocabulary of G.
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 3 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 4 / 98
Notation and terminology Syntax analysis
Grammars are often written in Backus-Naur form (BNF).
a, b, c, . . . ∈ Vt Example:
A, B, C, . . . ∈ Vn 1 hgoali ::= hexpri
2 hexpri ::= hexprihopihexpri
U, V, W, . . . ∈ V
3 | num
α, β , γ, . . . ∈ V∗ 4 | id
u, v, w, . . . ∈ Vt ∗ 5 hopi ::= +
If A → γ then αAβ ⇒ αγβ is a single-step derivation using A → γ 6 | −
7 | ∗
Similarly, →∗ and ⇒+ denote derivations of ≥ 0 and ≥ 1 steps 8 | /

If S →∗ β then β is said to be a sentential form of G This describes simple expressions over numbers and identifiers.
In a BNF for a grammar, we represent
L(G) = {w ∈ Vt ∗ | S ⇒+ w}, w ∈ L(G) is called a sentence of G 1 non-terminals with angle brackets or capital letters
Note, L(G) = {β ∈ V∗ | S →∗ β } ∩ Vt ∗ 2 terminals with typewriter font or underline
3 productions as in the example
* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 5 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 6 / 98

Derivations Derivations
We can view the productions of a CFG as rewriting rules.
Using our example CFG (for x + 2 ∗ y):
At each step, we chose a non-terminal to replace.
hgoali ⇒ hexpri
This choice can lead to different derivations.
⇒ hexprihopihexpri
Two are of particular interest:
⇒ hid,xihopihexpri
⇒ hid,xi + hexpri leftmost derivation
⇒ hid,xi + hexprihopihexpri the leftmost non-terminal is replaced at each step
⇒ hid,xi + hnum,2ihopihexpri rightmost derivation
⇒ hid,xi + hnum,2i ∗ hexpri the rightmost non-terminal is replaced at each step
⇒ hid,xi + hnum,2i ∗ hid,yi

We have derived the sentence x + 2 ∗ y. The previous example was a leftmost derivation.
We denote this hgoali→∗ id + num ∗ id.
Such a sequence of rewrites is a derivation or a parse.
The process of discovering a derivation is called parsing. * *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 7 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 8 / 98
Rightmost derivation Precedence

For the string x + 2 ∗ y: goal

hgoali ⇒ hexpri
⇒ hexprihopihexpri expr

⇒ hexprihopihid,yi
⇒ hexpri ∗ hid,yi expr op expr
⇒ hexprihopihexpri ∗ hid,yi
⇒ hexprihopihnum,2i ∗ hid,yi
⇒ hexpri + hnum,2i ∗ hid,yi expr op expr * <id,y>

⇒ hid,xi + hnum,2i ∗ hid,yi


<id,x> + <num,2>

Again, hgoali⇒∗ id + num ∗ id.


Treewalk evaluation computes (x + 2) ∗ y
— the “wrong” answer!
* Should be x + (2 ∗ y) *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 9 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 10 / 98

Precedence Precedence
These two derivations point out a problem with the grammar.
It has no notion of precedence, or implied order of evaluation. Now, for the string x + 2 ∗ y:
To add precedence takes additional machinery:
hgoali ⇒ hexpri
1 hgoali ::= hexpri ⇒ hexpri + htermi
2 hexpri ::= hexpri + htermi ⇒ hexpri + htermi ∗ hfactori
3 | hexpri − htermi ⇒ hexpri + htermi ∗ hid,yi
4 | htermi ⇒ hexpri + hfactori ∗ hid,yi
5 htermi ::= htermi ∗ hfactori ⇒ hexpri + hnum,2i ∗ hid,yi
6 | htermi/hfactori ⇒ htermi + hnum,2i ∗ hid,yi
7 | hfactori ⇒ hfactori + hnum,2i ∗ hid,yi
8 hfactori ::= num ⇒ hid,xi + hnum,2i ∗ hid,yi
9 | id

This grammar enforces a precedence on the derivation: Again, hgoali⇒∗ id + num ∗ id, but this time, we build the desired tree.
terms must be derived from expressions
forces the “correct” tree * *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 11 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 12 / 98
Precedence Ambiguity

goal
If a grammar has more than one derivation for a single sentential form,
then it is ambiguous
expr Example:
hstmti ::= if hexprithen hstmti
| if hexprithen hstmtielse hstmti
expr + term
| other stmts
Consider deriving the sentential form:
term term * factor if E1 then if E2 then S1 else S2
It has two derivations.
factor factor <id,y>
This ambiguity is purely grammatical.
It is a context-free ambiguity.
<id,x> <num,2>

Treewalk evaluation computes x + (2 ∗ y) * *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 13 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 14 / 98

Ambiguity Ambiguity

May be able to eliminate ambiguities by rearranging the grammar:


hstmti ::= hmatchedi Ambiguity is often due to confusion in the context-free specification.
| hunmatchedi Context-sensitive confusions can arise from overloading.
hmatchedi ::= if hexpri then hmatchedi else hmatchedi Example:
| other stmts a = f(17)
hunmatchedi ::= if hexpri then hstmti In many Algol/Scala-like languages, f could be a function or
| if hexpri then hmatchedi else hunmatchedi subscripted variable. Disambiguating this statement requires context:
need values of declarations
This generates the same language as the ambiguous grammar, but not context-free
applies the common sense rule: really an issue of type
match each else with the closest unmatched then

Rather than complicate parsing, we will handle this separately.


This is most likely the language designer’s intent.
* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 15 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 16 / 98
Scanning vs. parsing Parsing: the big picture
Where do we draw the line?
term ::= [a − zA − z]([a − zA − z] | [0 − 9])∗ tokens
| 0 | [1 − 9][0 − 9]∗
op ::= + | − | ∗ | /
expr ::= (term op)∗ term

Regular expressions are used to classify: parser


identifiers, numbers, keywords grammar parser
REs are more concise and simpler for tokens than a grammar
generator
more efficient scanners can be built from REs (DFAs) than
grammars
Context-free grammars are used to count:
brackets: (), begin. . . end, if. . . then. . . else
imparting structure: expressions code IR
Syntactic analysis is complicated enough: grammar for C has around 200
productions. Factoring out lexical analysis as a separate phase makes Our goal is a flexible parser generator system
compiler more manageable. * *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 17 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 18 / 98

Different ways of parsing: Top-down Vs Bottom-up Top-down parsing

A top-down parser starts with the root of the parse tree, labelled with
Top-down parsers the start or goal symbol of the grammar.
start at the root of derivation tree and fill in To build a parse, it repeats the following steps until the fringe of the
picks a production and tries to match the input parse tree matches the input string
may require backtracking
1 At a node labelled A, select a production A → α and construct the
appropriate child for each symbol of α
some grammars are backtrack-free (predictive)
2 When a terminal is added to the fringe that doesn’t match the
Bottom-up parsers
input string, backtrack
start at the leaves and fill in 3 Find next node to be expanded (must have a label in Vn )
start in a state valid for legal first tokens
as input is consumed, change state to encode possibilities The key is selecting the right production in step 1.
(recognize valid prefixes)
If the parser makes a wrong step, the “derivation” process does not
use a stack to store both state and sentential forms terminate.
Why is it bad?
* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 19 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 20 / 98
Left-recursion Eliminating left-recursion

To remove left-recursion, we can transform the grammar


Consider the grammar fragment:
Top-down parsers cannot handle left-recursion in a grammar hfooi ::= hfooiα
Formally, a grammar is left-recursive if | β
∃A ∈ Vn such that A ⇒+ Aα for some string α
where α and β do not start with hfooi
We can rewrite this as:
hfooi ::= β hbari
hbari ::= αhbari
Our simple expression grammar is left-recursive | ε

where hbari is a new non-terminal

This fragment contains no left-recursion


* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 21 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 22 / 98

How much lookahead is needed? Predictive parsing

We saw that top-down parsers may need to backtrack when they


Basic idea:
select the wrong production
Do we need arbitrary lookahead to parse CFGs? For any two productions A → α | β , we would like a distinct way of
in general, yes choosing the correct production to expand.
use the Earley or Cocke-Younger, Kasami algorithms For some RHS α ∈ G, define FIRST(α) as the set of tokens that
Fortunately appear first in some string derived from α.
large subclasses of CFGs can be parsed with limited lookahead That is, for some w ∈ Vt∗ , w ∈ FIRST(α) iff. α ⇒∗ wγ.
most programming language constructs can be expressed in a Key property:
grammar that falls in these subclasses Whenever two productions A → α and A → β both appear in the
grammar, we would like
Among the interesting subclasses are: FIRST (α) ∩ FIRST (β ) = φ
LL(1): left to right scan, left-most derivation, 1-token lookahead;
and This would allow the parser to make a correct choice with a
LR(1): left to right scan, reversed right-most derivation, 1-token lookahead of only one symbol!
lookahead
* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 23 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 24 / 98
Left factoring Example

What if a grammar does not have this property?


Sometimes, we can transform a grammar to have this property. There are two non-terminals Applying the transformation:
to left factor:
For each non-terminal A find the longest prefix hexpri ::= htermi + hexpri hexpri ::= htermihexpr0 i
α common to two or more of its alternatives. | htermi − hexpri hexpr0 i ::= +hexpri
| htermi | −hexpri
if α 6= ε then replace all of the A productions
| ε
A → αβ1 | αβ2 | · · · | αβn htermi ::= hfactori ∗ htermi
with | hfactori/htermi htermi ::= hfactorihterm0 i
A → αA0 | hfactori hterm0 i ::= ∗htermi
A0 → β1 | β2 | · · · | βn | /htermi
where A0 is a new non-terminal. | ε
Repeat until no two alternatives for a single
non-terminal have a common prefix.

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 25 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 26 / 98

Indirect Left-recursion elimination Generality

Question:
Given a left-factored CFG, to eliminate left-recursion: By left factoring and eliminating left-recursion, can we
1 Input: Grammar G with no cycles and no ε productions. transform an arbitrary context-free grammar to a form where it
2 Output: Equivalent grammat with no left-recursion. begin can be predictively parsed with a single token lookahead?
3 Arrange the non terminals in some order A1 , A2 , · · · An ; Answer:
4 foreach i = 1 · · · n do Given a context-free grammar that doesn’t meet our
5 foreach j = 1 · · · i − 1 do conditions, it is undecidable whether an equivalent grammar
6 Say the ith production is: Ai → Aj γ ; exists that does meet our conditions.
7 and Aj → δ1 |δ2 | · · · |δk ; Many context-free languages do not have such a grammar:
8 Replace, the ith production by:
9 Ai → δ1 γ|δ2 γ| · · · δn γ; {an 0bn | n ≥ 1} ∪ {an 1b2n | n ≥ 1}
10 Eliminate immediate left recursion in Ai ; Must look past an arbitrary number of a’s to discover the 0 or the 1 and
so determine the derivation.

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 27 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 28 / 98
Recursive descent parsing Recursive descent parsing

1 int A()
2 begin
3 foreach production of the form A → X1 X2 X3 · · · Xk do
4 for i = 1 to k do
5 if Xi is a non-terminal then
6 if (Xi () 6= 0) then
Backtracks in general – in practise may not do much.
7 backtrack; break; // Try the next production
How to backtrack?
8 else if Xi matches the current input symbol a then
9 advance the input to the next symbol; A left recursive grammar will lead to infinite loop.
10 else
11 backtrack; break; // Try the next production

12 if i EQ k + 1 then
13 return 0; // Success

14 return 1; // Failure
* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 29 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 30 / 98

Non-recursive predictive parsing Table-driven parsers


Now, a predictive parser looks like: A parser generator system often looks like:

stack
stack

source tokens table-driven


source tokens table-driven scanner IR
scanner IR code parser
code parser

parser parsing
parsing grammar
generator tables
tables
This is true for both top-down (LL) and bottom-up (LR) parsers
Rather than writing recursive code, we build tables. This also uses a stack – but mainly to remember part of the input
Why?Building tables can be automated, easily. string; no recursion.
* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 31 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 32 / 98
FIRST FOLLOW

For a string of grammar symbols α, define FIRST(α) as: For a non-terminal A, define FOLLOW(A) as
the set of terminals that begin strings derived from α: the set of terminals that can appear immediately to the right
{a ∈ Vt | α ⇒∗ aβ } of A in some sentential form
If α ⇒∗ ε then ε ∈ FIRST(α)
FIRST (α) contains the tokens valid in the initial position in α Thus, a non-terminal’s FOLLOW set specifies the tokens that can
To build FIRST(X): legally appear after it.
A terminal symbol has no FOLLOW set.
1 If X ∈ Vt then FIRST(X) is {X}
To build FOLLOW(A):
2 If X → ε then add ε to FIRST(X)
3 If X → Y1 Y2 · · · Yk : 1 Put $ in FOLLOW(hgoali)
1 Put FIRST(Y1 ) − {ε} in FIRST(X) 2 If A → αBβ :
2 ∀i : 1 < i ≤ k, if ε ∈ FIRST(Y1 ) ∩ · · · ∩ FIRST(Yi−1 ) 1 Put FIRST(β ) − {ε} in FOLLOW(B)
(i.e., Y1 · · · Yi−1 ⇒∗ ε) 2 If β = ε (i.e., A → αB) or ε ∈ FIRST(β ) (i.e., β ⇒∗ ε) then put
then put FIRST(Yi ) − {ε} in FIRST(X) FOLLOW(A) in FOLLOW(B)
3 If ε ∈ FIRST(Y1 ) ∩ · · · ∩ FIRST(Yk ) then put ε in FIRST(X) Repeat until no more additions can be made
Repeat until no more additions can be made.
* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 33 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 34 / 98

LL(1) grammars LL(1) grammars

Previous definition
A grammar G is LL(1) iff. for all non-terminals A, each distinct Provable facts about LL(1) grammars:
pair of productions A → β and A → γ satisfy the condition 1 No left-recursive grammar is LL(1)
T
FIRST (β ) FIRST (γ) = φ .
2 No ambiguous grammar is LL(1)
3 Some languages have no LL(1) grammar
What if A ⇒∗ ε? 4 A ε–free grammar where each alternative expansion for A begins
Revised definition with a distinct terminal is a simple LL(1) grammar.
A grammar G is LL(1) iff. for each set of productions Example
A → α1 | α2 | · · · | αn : S → aS | a is not LL(1) because FIRST(aS) = FIRST(a) = {a}
1 FIRST(α1 ), FIRST(α2 ), . . . , FIRST (αn ) are all pairwise S → aS0
disjoint S0 → aS0 | ε
2 If αi ⇒∗ ε T then accepts the same language and is LL(1)
FIRST(αj ) FOLLOW (A) = φ , ∀1 ≤ j ≤ n, i 6= j.

If G is ε-free, condition 1 is sufficient.


* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 35 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 36 / 98
LL(1) parse table construction Example
Our long-suffering expression grammar:
1. S →E 6. T → FT 0
Input: Grammar G 2. E → TE0 7. T 0 → ∗T
Output: Parsing table M 3. E0 → +E 8. | /T
Method: 4. | −E 9. |ε
1 ∀ productions A → α: 5. |ε 10. F → num
11. | id
1 ∀a ∈ FIRST(α), add A → α to M[A, a]
2 If ε ∈ FIRST(α): FIRST FOLLOW id num + − ∗ / $
1 ∀b ∈ FOLLOW(A), add A → α to M[A, b] S num, id $ 1 1 − − − − −
2 If $ ∈ FOLLOW(A) then add A → α to M[A, $] E num, id $ 2 2 − − − − −
E0 ε, +, − $ − − 3 4 − − 5
2 Set each undefined entry of M to error T num, id +, −, $ 6 6 − − − − −
If ∃M[A, a] with multiple entries then grammar is not LL(1). T0 ε, ∗, / +, −, $ − − 9 9 7 8 9
F num, id +, −, ∗, /, $ 11 10 − − − − −
id id −
Note: recall a, b ∈ Vt , so a, b 6= ε num num −
∗ ∗ −
/ / −
+ + −
*
− − − *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 37 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 38 / 98

Table driven Predictive parsing A grammar that is not LL(1)


Input: A string w and a parsing table M for a grammar G
Output: If w is in L(G), a leftmost derivation of w; otherwise, indicate an error hstmti ::= if hexpri then hstmti
1 push $ onto the stack; push S onto the stack; | if hexpri then hstmti else hstmti
2 inp points to the input tape; | ...
3 X = stack.top();
Left-factored: hstmti ::= if hexpri then hstmti hstmt0 i | . . . Now,
4 while X 6= $ do
5 if X is inp then hstmt0 i ::= else hstmti | ε
0
FIRST (hstmt i) = {ε, else}
6 stack.pop(); inp++;
Also, FOLLOW(hstmtT0 i) = {else, $}
7 else if X is a terminal then
But, FIRST(hstmt0 i) FOLLOW(hstmt0 i) = {else} 6= φ
8 error(); On seeing else, there is a conflict between choosing
9 else if M[X, a] is an error entry then
hstmt0 i ::= else hstmti and hstmt0 i ::= ε
10 error();
11 else if M[X, a] = X → Y1 Y2 · · · Yk then ⇒ grammar is not LL(1)!
12 output the production X → Y1 Y2 · · · Yk ; The fix:
13 stack.pop(); Put priority on hstmt0 i ::= else hstmti to associate else with
14 push Yk , Yk−1 , · · · Y1 in that order; closest previous then.
15 X = stack.top();
* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 39 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 40 / 98
Another example of painful left-factoring Error recovery in Predictive Parsing
Here is a typical example where a programming language fails to
An error is detected when the terminal on top of the stack does
be LL(1):
not match the next input symbol or M[A, a] = error.
stmt → asginment | call | other Panic mode error recovery
assignment → id := exp Skip input symbols till a “synchronizing” token appears.
call → id (exp-list)
Q: How to identify a synchronizing token?
Some heuristics:
This grammar is not in a form that can be left factored. We must
first replace assignment and call by the right-hand sides of their All symbols in FOLLOW(A) in the synchronizing set for the
defining productions: non-terminal A.
statement → id := exp | id( exp-list ) | other Semicolon after a Stmt production: assgignmentStmt;
assignmentStmt;
We left factor: If a terminal on top of the stack cannot be matched? –
statement → id stmt’ | other pop the terminal.
stmt’ → := exp (exp-list) issue a message that the terminal was inserted.

*
Q: How about error messages? *
See how the grammar obscures the language semantics.
V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 41 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 42 / 98

Some definitions Bottom-up parsing

Recall
For a grammar G, with start symbol S, any string α such that Goal:
S ⇒∗ α is called a sentential form Given an input string w and a grammar G, construct a parse
If α ∈ Vt∗ , then α is called a sentence in L(G) tree by starting at the leaves and working to the root.
Otherwise it is just a sentential form (not a sentence in L(G))
A left-sentential form is a sentential form that occurs in the leftmost
derivation of some sentence.
A right-sentential form is a sentential form that occurs in the rightmost
derivation of some sentence.

An unambiguous grammar will have a unique leftmost/rightmost


derivation.

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 43 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 44 / 98
Reductions Vs Derivations Example
Reduction:
Consider the grammar
At each reduction step, a specific substring matching the body of
a production is replaced by the non-terminal at the head of the 1 S → aABe
production. 2 A → Abc
3 | b
Key decisions 4 B → d
When to reduce? and the input string abbcde
What production rule to apply? Prod’n. Sentential Form
Reduction Vs Derivations 3 a b bcde
Recall: In derivation: a non-terminal in a sentential form is 2 a Abc de
replaced by the body of one of its productions. 4 aA d e
A reduction is reverse of a step in derivation. 1 aABe
– S
Bottom-up parsing is the process of “reducing” a string w to the The trick appears to be scanning the input and finding valid sentential
start symbol. forms.
Goal of bottum-up parsing: build derivation tree in reverse. * *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 45 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 46 / 98

Handles Handles

S
Theorem:
If G is unambiguous then every right-sentential form has a
unique handle.
Informally, a “handle” is
Proof: (by definition)
a substring that matches the 1 G is unambiguous ⇒ rightmost derivation is unique
α
body of a production (not
necessarily the first one), 2 ⇒ a unique production A → β applied to take γi−1 to γi
and reducing this handle,
3 ⇒ a unique position k at which A → β is applied
A represents one step of reduction 4 ⇒ a unique handle A → β
(or reverse rightmost derivation).

β w
The handle A → β in the parse tree
for αβ w * *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 47 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 48 / 98
Example Handle-pruning

The left-recursive expression grammar (original form)


1 hgoali ::= hexpri Prod’n. Sentential Form The process to construct a bottom-up parse is called handle-pruning.
2 hexpri ::= hexpri + htermi – hgoali To construct a rightmost derivation
3 | hexpri − htermi 1 hexpri
S = γ0 ⇒ γ1 ⇒ γ2 ⇒ · · · ⇒ γn−1 ⇒ γn = w
4 | htermi 3 hexpri − htermi
5 htermi ::= htermi ∗ hfactori 5 hexpri − htermi ∗ hfactori we set i to n and apply the following simple algorithm
6 | htermi/hfactori 9 hexpri − htermi ∗ id for i = n downto 1
7 | hfactori 7 hexpri − hfactori ∗ id 1 find the handle Ai → βi in γi
8 hfactori ::= num 8 hexpri − num ∗ id 2 replace βi with Ai to generate γi−1
9 | id 4 htermi − num ∗ id This takes 2n steps, where n is the length of the derivation
7 hfactori − num ∗ id
9 id − num ∗ id

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 49 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 50 / 98

Stack implementation Example: back to x − 2 ∗ y

1S →E Stack Input Action


One scheme to implement a handle-pruning, bottom-up parser is 2 E → E+T $ id − num ∗ id S
3 | E−T $id − num ∗ id R9
called a shift-reduce parser. $hfactori − num ∗ id R7
Shift-reduce parsers use a stack and an input buffer 4 | T $htermi − num ∗ id R4
5 T → T ∗F $hexpri − num ∗ id S
1 initialize stack with $
2 Repeat until the top of the stack is the goal symbol and the input 6 | T/F $hexpri − num ∗ id S
7 | F $hexpri − num ∗ id R8
token is $ $hexpri − hfactori ∗ id R7
a) find the handle 8 F → num $hexpri − htermi ∗ id S
if we don’t have a handle on top of the stack, shift an input symbol 9 | id $hexpri − htermi ∗ id S
onto the stack $hexpri − htermi ∗ id R9
b) prune the handle
$hexpri − htermi ∗ hfactori R5
if we have a handle A → β on the stack, reduce
$hexpri − htermi R3
i) pop | β | symbols off the stack
ii) push A onto the stack $hexpri R1
$hgoali A

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 51 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 52 / 98
Shift-reduce parsing LR parsing
The skeleton parser:
Shift-reduce parsers are simple to understand push s0
token ← next token()
A shift-reduce parser has just four canonical actions: repeat forever
s ← top of stack
1 shift — next input symbol is shifted onto the top of the stack if action[s,token] = "shift si " then
2 reduce — right end of handle is on top of stack; push si
locate left end of handle within the stack; token ← next token()
pop handle off stack and push appropriate non-terminal LHS else if action[s,token] = "reduce A → β "
then
3 accept — terminate parsing and signal success pop | β | states
4 error — call an error recovery routine s0 ← top of stack
Key insight: recognize handles with a DFA: push goto[s0 ,A]
else if action[s, token] = "accept" then
DFA transitions shift states instead of symbols return
accepting states trigger reductions else error()
May have Shift-Reduce Conflicts. “How many ops?”:k shifts, l reduces, and 1 accept, where k is length
of input string and l is length of reverse rightmost derivation

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 53 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 54 / 98

Example tables Example using the tables

The Grammar Stack Input Action


state ACTION GOTO
1S →E $0 id∗ id+ id$ s4
id + ∗ $ E T F ∗ id+ id$
2 E → T +E $04 r6
0 s4 – – – 1 2 3 3 | T $03 ∗ id+ id$ s6
1 – – – acc – – – 4 T → F∗T $036 id+ id$ s4
2 – s5 – r3 – – – $0364 + id$ r6
5 | F
3 – r5 s6 r5 – – – $0363 + id$ r5
6 F → id $0368 + id$ r4
4 – r6 r6 r6 – – –
$02 + id$ s5
5 s4 – – – 7 2 3 $025 id$ s4
6 s4 – – – – 8 3 $0254 $ r6
7 – – – r2 – – – $0253 $ r5
8 – r4 – r4 – – – $0252 $ r3
$0257 $ r2
Note: This is a simple little right-recursive grammar. It is not the same grammar as in $01 $ acc
previous lectures.
* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 55 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 56 / 98
LR(k) grammars LR(k) grammars

Formally, a grammar G is LR(k) iff.:


Informally, we say that a grammar G is LR(k) if, given a rightmost 1 S ⇒∗rm αAw ⇒rm αβ w, and
derivation 2 S ⇒∗rm γBx ⇒rm αβ y, and
S = γ0 ⇒ γ1 ⇒ γ2 ⇒ · · · ⇒ γn = w, 3 FIRST k (w) = FIRSTk (y)
we can, for each right-sentential form in the derivation: ⇒ αAy = γBx
1 isolate the handle of each right-sentential form, and i.e., Assume sentential forms αβ w and αβ y, with common prefix αβ
2 determine the production by which to reduce and common k-symbol lookahead FIRSTk (y) = FIRSTk (w), such that
by scanning γi from left to right, going at most k symbols beyond the αβ w reduces to αAw and αβ y reduces to γBx.
right end of the handle of γi . But, the common prefix means αβ y also reduces to αAy, for the same
result.
Thus αAy = γBx.

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 57 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 58 / 98

Why study LR grammars? LR parsing

LR(1) grammars are often used to construct parsers. Three common algorithms to build tables for an “LR” parser:
We call these parsers LR(1) parsers. 1 SLR(1)
virtually all context-free programming language constructs can be smallest class of grammars
expressed in an LR(1) form smallest tables (number of states)
LR grammars are the most general grammars parsable by a simple, fast construction
deterministic, bottom-up parser 2 LR(1)
efficient parsers can be implemented for LR(1) grammars
full set of LR(1) grammars
LR parsers detect an error as soon as possible in a left-to-right largest tables (number of states)
scan of the input slow, large construction
LR grammars describe a proper superset of the languages
recognized by predictive (i.e., LL) parsers
3 LALR(1)
intermediate sized set of grammars
LL(k): recognize use of a production A → β seeing first k same number of states as SLR(1)
symbols derived from β canonical construction is slow and large
LR(k): recognize the handle β after seeing everything better construction techniques exist
derived from β plus k lookahead symbols

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 59 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 60 / 98
SLR vs. LR/LALR LR(k) items

The table construction algorithms use sets of LR(k) items or


configurations to represent the possible states in a parse.
An LR(k) item is a pair [α, β ], where
An LR(1) parser for either Algol or Pascal has several thousand states,
while an SLR(1) or LALR(1) parser for the same language may have α is a production from G with a • at some position in the RHS,
marking how much of the RHS of a production has already been
several hundred states. seen
β is a lookahead string containing k symbols (terminals or $)
Two cases of interest are k = 0 and k = 1:
LR(0) items play a key role in the SLR(1) table construction
algorithm.
LR(1) items play a key role in the LR(1) and LALR(1) table
construction algorithms.

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 61 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 62 / 98

Example The characteristic finite state machine (CFSM)

The • indicates how much of an item we have seen at a given state in


The CFSM for a grammar is a DFA which recognizes viable prefixes of
the parse:
right-sentential forms:
[A → •XYZ] indicates that the parser is looking for a string that can be
derived from XYZ A viable prefix is any prefix that does not extend beyond the
[A → XY • Z] indicates that the parser has seen a string derived from handle.
XY and is looking for one derivable from Z
It accepts when a handle has been discovered and needs to be
LR(0) items: (no lookahead)
reduced.
A → XYZ generates 4 LR(0) items:
To construct the CFSM we need two functions:
1 [A → •XYZ]
2 [A → X • YZ] CLOSURE(I) to build its states
3 [A → XY • Z] GOTO(I, X) to determine its transitions
4 [A → XYZ•]

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 63 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 64 / 98
CLOSURE GOTO

Given an item [A → α • Bβ ], its closure contains the item and any other Let I be a set of LR(0) items and X be a grammar symbol.
items that can generate legal substrings to follow α. Then, GOTO(I, X) is the closure of the set of all items
Thus, if the parser has viable prefix α on its stack, the input should [A → αX • β ] such that [A → α • Xβ ] ∈ I
reduce to Bβ (or γ for some other item [B → •γ] in the closure).
If I is the set of valid items for some viable prefix γ, then GOTO(I, X) is
function CLOSURE(I) the set of valid items for the viable prefix γX.
repeat GOTO (I, X) represents state after recognizing X in state I.
if [A → α • Bβ ] ∈ I
add [B → •γ] to I function GOTO(I,X)
until no more items can be added to I let J be the set of items [A → αX • β ]
return I such that [A → α • Xβ ] ∈ I
return CLOSURE(J)

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 65 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 66 / 98

Building the LR(0) item sets LR(0) example

We start the construction with the item [S0 → •S$], where 1 S → E$ I0 : S → •E$ I4 : E → E + T•
2 E → E+T E → •E + T I5 : T → id•
S0 is the start symbol of the augmented grammar G0
S is the start symbol of G 3 | T E → •T I6 : T → (•E)
$ represents EOF 4 T → id T → •id E → •E + T
To compute the collection of sets of LR(0) items 5 | (E) T → •(E) E → •T
I1 : S → E • $ T → •id
function items(G0 ) The corresponding CFSM:
s0 ← CLOSURE({[S0 → •S$]}) 9
E → E • +T T → •(E)
C ← {s0 } T
(
I2 : S → E$• I7 : T → (E•)
repeat T
I3 : E → E + •T E → E • +T
for each set of items s ∈ C 0
id
5
id
6 (
for each grammar symbol X T → •id I8 : T → (E)•
if GOTO(s, X) 6= φ and GOTO(s, X) 6∈ C E id ( E
T → •(E) I9 : E → T•
add GOTO(s, X) to C 1
+
3
+
7
until no more item sets can be added to C )
return C $ T

2 4 8

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 67 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 68 / 98
Constructing the LR(0) parsing table LR(0) example

state ACTION GOTO


1 construct the collection of sets of LR(0) items for G0 9
id ( ) + $ S E T
2 state i of the CFSM is constructed from Ii T
1 [A → α • aβ ] ∈ Ii and GOTO(Ii , a) = Ij ( T 0 s5 s6 – – – – 1 9
⇒ ACTION[i, a] ← “shift j” id id
1 – – – s3 s2 – – –
2 [A → α•] ∈ Ii , A 6= S0 0 5 6 ( 2 acc acc acc acc acc – – –
⇒ ACTION[i, a] ← “reduce A → α”, ∀a 3 s5 s6 – – – – – 4
3 [S0 → S$•] ∈ Ii E id ( E
⇒ ACTION[i, a] ← “accept”, ∀a
4 r2 r2 r2 r2 r2 – – –
+ +
1 3 7 5 r4 r4 r4 r4 r4 – – –
3 GOTO(Ii , A) = Ij
6 s5 s6 – – – – 7 9
⇒ GOTO[i, A] ← j $ T )
4 set undefined entries in ACTION and GOTO to “error” 7 – – s8 s3 – – – –
5 initial state of parser s0 is CLOSURE([S0 → •S$]) 2 4 8 8 r5 r5 r5 r5 r5 – – –
9 r3 r3 r3 r3 r3 – – –

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 69 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 70 / 98

Conflicts in the ACTION table SLR(1): simple lookahead LR

If the LR(0) parsing table contains any multiply-defined ACTION Add lookaheads after building LR(0) item sets
entries then G is not LR(0) Constructing the SLR(1) parsing table:
Two conflicts arise: 1 construct the collection of sets of LR(0) items for G0
shift-reduce: both shift and reduce possible in same item 2 state i of the CFSM is constructed from Ii
set 1 [A → α • aβ ] ∈ Ii and GOTO(Ii , a) = Ij
reduce-reduce: more than one distinct reduce action ⇒ ACTION[i, a] ← “shift j”, ∀a 6= $
possible in same item set
2 [A → α•] ∈ Ii , A 6= S0
Conflicts can be resolved through lookahead in ACTION. Consider: ⇒ ACTION[i, a] ← “reduce A → α”, ∀a ∈ FOLLOW(A)
A → ε | aα 3 [S0 → S • $] ∈ Ii
⇒ shift-reduce conflict ⇒ ACTION[i, $] ← “accept”
a:=b+c*d
requires lookahead to avoid shift-reduce conflict after shifting c 3 GOTO(Ii , A) = Ij
(need to see * to give precedence over +) ⇒ GOTO[i, A] ← j
4 set undefined entries in ACTION and GOTO to “error”
5 initial state of parser s0 is CLOSURE([S0 → •S$])

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 71 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 72 / 98
From previous example Example: A grammar that is not LR(0)

1 S → E$ FOLLOW (E) = FOLLOW (T) = {$, +, )}


1 S → E$ I0 : S → •E$ I6 : F → (•E)
2 E → E+T state ACTION GOTO E → •E + T E → •E + T
3 | T id ( ) + $ S E T 2 E → E+T E → •T E → •T
3 | T T → •T ∗ F T → •T ∗ F
4 T → id 0 s5 s6 – – – – 1 9 T → •F T → •F
5 | (E) 4 T → T ∗F F → •id F → •id
1 – – – s3 acc – – – 5 | F F → •(E) F → •(E)
2 – – – – – –– – I1 : S → E • $ I7 : E → T•
9 6 F → id E → E • +T T → T • ∗F
T 3 s5 s6 – – – – – 4 7 | (E) I2 : S → E$• I8 : T → T ∗ •F
( T 4 – – r2 r2 r2 – – – I3 : E → E + •T F → •id
T → •T ∗ F F → •(E)
0
id
5
id
6 ( 5 – – r4 r4 r4 – – – FOLLOW T → •F I9 : T → T ∗ F•
F → •id I10 : F → (E)•
E id ( E
6 s5 s6 – – – – 7 9 E {+, ), $} F → •(E) I11 : E → E + T•
+ +
7 – – s8 s3 – – – – T {+, ∗, ), $} I4 : T → F• T → T • ∗F
1 3 7 I5 : F → id• I12 : F → (E•)
8 – – r5 r5 r5 – – – F {+, ∗, ), $} E → E • +T
$ T )
9 – – r3 r3 r3 – – –
2 4 8

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 73 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 74 / 98

Example: But it is SLR(1) Example: A grammar that is not SLR(1)

state ACTION GOTO


+ ∗ id ( ) $ S E T F Consider: Its LR(0) item sets:
0 – – s5 s6 – – – 1 7 4 S → L=R I0 : S0 → •S$ I5 : L → ∗ • R
1 s3 – – – – acc – – – – | R S → •L = R R → •L
2 – – – – – – – – – – L → ∗R S → •R L → •∗R
3 – – s5 s6 – – – – 11 4 L → • ∗ R L → •id
| id
4 r5 r5 – – r5 r5 – – – – L → •id I6 : S → L = •R
5 r6 r6 – – r6 r6 – – – –
R → L
R → •L R → •L
6 – – s5 s6 – – – 12 7 4
7 r3 s8 – – r3 r3 – – – – I1 : S 0 → S • $ L → •∗R
8 – – s5 s6 – – – – – 9 I2 : S → L• = R L → •id
9 r4 r4 – – r4 r4 – – – – R → L• I7 : L → ∗R•
10 r7 r7 – – r7 r7 – – – – I3 : S → R• I8 : R → L•
11 r2 s8 – – r2 r2 – – – – I4 : L → id• I9 : S → L = R•
12 s3 – – – s10 – – – – – Now consider I2 : = ∈ FOLLOW(R) (S ⇒ L = R ⇒ ∗R = R)

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 75 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 76 / 98
LR(1) items LR(1) items

What’s the point of the lookahead symbols?


Recall: An LR(k) item is a pair [α, β ], where
carry along to choose correct reduction when there is a choice
α is a production from G with a • at some position in the RHS, lookaheads are bookkeeping, unless item has • at right end:
marking how much of the RHS of a production has been seen
β is a lookahead string containing k symbols (terminals or $) in [A → X • YZ, a], a has no direct use
in [A → XYZ•, a], a is useful
What about LR(1) items?
allows use of grammars that are not uniquely invertible†
All the lookahead strings are constrained to have length 1
Look something like [A → X • YZ, a] The point: For [A → α•, a] and [B → α•, b], we can decide between
reducing to A or B by looking at limited right context

† No two productions have the same RHS

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 77 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 78 / 98

closure1(I) goto1(I)

Given an item [A → α • Bβ , a], its closure contains the item and any Let I be a set of LR(1) items and X be a grammar symbol.
other items that can generate legal substrings to follow α. Then, GOTO(I, X) is the closure of the set of all items
Thus, if the parser has viable prefix α on its stack, the input should [A → αX • β , a] such that [A → α • Xβ , a] ∈ I
reduce to Bβ (or γ for some other item [B → •γ, b] in the closure).
If I is the set of valid items for some viable prefix γ, then GOTO(I, X) is
function closure1(I) the set of valid items for the viable prefix γX.
repeat goto(I, X) represents state after recognizing X in state I.
if [A → α • Bβ , a] ∈ I
add [B → •γ, b] to I, where b ∈ F I R S T (β a) function goto1(I,X)
until no more items can be added to I let J be the set of items [A → αX • β , a]
return I such that [A → α • Xβ , a] ∈ I
return closure1(J)

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 79 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 80 / 98
Building the LR(1) item sets for grammar G Constructing the LR(1) parsing table

We start the construction with the item [S0 → •S, $], where
Build lookahead into the DFA to begin with
S0 is the start symbol of the augmented grammar G0
S is the start symbol of G
1 construct the collection of sets of LR(1) items for G0
$ represents EOF
2 state i of the LR(1) machine is constructed from Ii
1 [A → α • aβ , b] ∈ Ii and goto1(Ii , a) = Ij
To compute the collection of sets of LR(1) items ⇒ ACTION[i, a] ← “shift j”
2 [A → α•, a] ∈ Ii , A 6= S0
function items(G0 ) ⇒ ACTION[i, a] ← “reduce A → α”
s0 ← closure1({[S0 → •S, $]}) 3 [S0 → S•, $] ∈ Ii
C ← {s0 } ⇒ ACTION[i, $] ← “accept”
repeat
for each set of items s ∈ C 3 goto1(Ii , A) = Ij
for each grammar symbol X ⇒ GOTO[i, A] ← j
if goto1(s, X) 6= φ and goto1(s, X) 6∈ C
add goto1(s, X) to C 4 set undefined entries in ACTION and GOTO to “error”
until no more item sets can be added to C 5 initial state of parser s0 is closure1([S0 → •S, $])
return C

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 81 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 82 / 98

Back to previous example (6∈ SLR(1)) Example: back to SLR(1) expression grammar

S → L=R I0 : S0 → •S, $ I5 : L → id•, = $ In general, LR(1) has many more states than LR(0)/SLR(1):
| R S → •L = R, $ I6 : S → L = •R, $
L → ∗R S → •R, $ R → •L, $ 1 S → E 4 T → T ∗F
| id L → • ∗ R, = L → • ∗ R, $ 2 E → E+T 5 | F
L → •id, = L → •id, $
R → L 3 | T 6 F → id
R → •L, $ I7 : L → ∗R•, = $
L → • ∗ R, $ I8 : R → L•, =$ 7 | (E)
L → •id, $ I9 : S → L = R•, $
I1 : S0 → S•, $ I10 : R → L•, $ LR(1) item sets:
I2 : S → L• = R, $ I11 : L → ∗ • R, $ I0 : I00 :shifting ( I000 :shifting (
R → L•, $ R → •L, $ S → •E, $ F → (•E), ∗ + $ F → (•E), ∗+)
I3 : S → R•, $ L → • ∗ R, $ E → •E + T,+$ E → •E + T,+) E → •E + T,+)
I4 : L → ∗ • R, = $ L → •id, $ E → •T, +$ E → •T, +) E → •T, +)
R → •L, =$ I12 : L → id•, $ T → •T ∗ F, ∗ + $ T → •T ∗ F, ∗+) T → •T ∗ F, ∗+)
L → • ∗ R, = $ I13 : L → ∗R•, $ T → •F, ∗+$ T → •F, ∗+) T → •F, ∗+)
L → •id, = $ F → •id, ∗ + $ F → •id, ∗+) F → •id, ∗+)
F → •(E), ∗ + $ F → •(E), ∗+) F → •(E), ∗+)
I2 no longer has shift-reduce conflict: reduce on $, shift on =
* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 83 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 84 / 98
Another example LALR(1) parsing

Consider: LR(1) item sets:


0 S0 → S I0 : S0 → •S, $ I4 : C → d•, cd Define the core of a set of LR(1) items to be the set of LR(0) items
1 S → CC S → •CC, $ I5 : S → CC•, $
2 C → cC
derived by ignoring the lookahead symbols.
C → •cC, cd I6 : C → c • C, $ Thus, the two sets
3 | d
C → •d, cd C → •cC, $ {[A → α • β , a], [A → α • β , b]}, and
state ACTION GOTO
I1 : S0 → S•, $ C → •d, $ {[A → α • β , c], [A → α • β , d]}
c d $ S C
I2 : S → C • C, $ I7 : C → d•, $ have the same core.
0 s3 s4 – 1 2
1 – – acc – –
C → •cC, $ I8 : C → cC•, cd Key idea:
2 s6 s7 – – 5 C → •d, $ I9 : C → cC•, $
I3 : C → c • C, cd If two sets of LR(1) items, Ii and Ij , have the same core, we
3 s3 s4 – – 8
4 r3 r3 – – – C → •cC, cd can merge the states that represent them in the ACTION and
5 – – r1 – – C → •d, cd GOTO tables.
6 s6 s7 – – 9
7 – – r3 – –
8 r2 r2 – – –
9 – – r2 – –
* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 85 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 86 / 98

LALR(1) table construction LALR(1) table construction

The revised (and renumbered) algorithm


To construct LALR(1) parsing tables, we can insert a single step into 1 construct the collection of sets of LR(1) items for G0
the LR(1) algorithm 2 for each core present among the set of LR(1) items, find all sets
(1.5) For each core present among the set of LR(1) items, find having that core and replace these sets by their union (update the
goto1 function incrementally)
all sets having that core and replace these sets by their 3 state i of the LALR(1) machine is constructed from Ii .
union. 1 [A → α • aβ , b] ∈ Ii and goto1(Ii , a) = Ij
The goto function must be updated to reflect the ⇒ ACTION[i, a] ← “shift j”
replacement sets. 2 [A → α•, a] ∈ Ii , A 6= S0
⇒ ACTION[i, a] ← “reduce A → α”
3 [S0 → S•, $] ∈ Ii ⇒ ACTION[i, $] ← “accept”
4 goto1(Ii , A) = Ij ⇒ GOTO[i, A] ← j
The resulting algorithm has large space requirements, as we still are 5 set undefined entries in ACTION and GOTO to “error”
required to build the full set of LR(1) items. 6 initial state of parser s0 is closure1([S0 → •S, $])

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 87 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 88 / 98
Example More efficient LALR(1) construction

Reconsider: I0 : S0 → •S, $ I3 : C → c • C, cd I6 : C → c • C, $
S → •CC, $ C → •cC, cd C → •cC, $ Observe that we can:
0 S0 → S
C → •cC, cd C → •d, cd C → •d, $
1 S → CC represent Ii by its basis or kernel:
C → •d, cd I4 : C → d•, cd I7 : C → d•, $
2 C → cC
I1 : S0 → S•, $ I5 : S → CC•, $ I8 : C → cC•, cd
items that are either [S0 → •S, $]
3 | d or do not have • at the left of the RHS
I2 : S → C • C, $ I9 : C → cC•, $
C → •cC, $ compute shift, reduce and goto actions for state derived from Ii
state ACTION GOTO
C → •d, $ directly from its kernel
c d $ S C
Merged states:
I36 : C → c • C, cd$ 0 s36 s47 – 1 2
C → •cC, cd$ 1 – – acc – –
2 s36 s47 – – 5
This leads to a method that avoids building the complete canonical
C → •d, cd$
36 s36 s47 – – 8 collection of sets of LR(1) items
I47 : C → d•, cd$
I89 : C → cC•, cd$ 47 r3 r3 r3 – –
5 – – r1 – –
89 r2 r2 r2 – – Self reading: Section 4.7.5 Dragon book

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 89 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 90 / 98

The role of precedence The role of precedence

With precedence and associativity, we can use:


Precedence and associativity can be used to resolve shift/reduce
conflicts in ambiguous grammars. E → E∗E
lookahead with higher precedence ⇒ shift | E/E
same precedence, left associative ⇒ reduce | E+E
Advantages: | E−E
more concise, albeit ambiguous, grammars | (E)
shallower parse trees ⇒ fewer reductions | -E
Classic application: expression grammars | id
| num

This eliminates useless reductions (single productions)

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 91 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 92 / 98
Error recovery in shift-reduce parsers Error recovery in yacc/bison/Java CUP

The problem The error mechanism


encounter an invalid token designated token error
bad pieces of tree hanging from stack valid in any production
incorrect entries in symbol table error shows synchronization points
We want to parse the rest of the file When an error is discovered
Restarting the parser pops the stack until error is legal
skips input tokens until it successfully shifts 3 (some default value)
find a restartable state on the stack error productions can have actions
move to a consistent place in the input
print an informative message to stderr (line number) This mechanism is fairly general

Read the section on Error Recovery of the on-line CUP manual

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 93 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 94 / 98

Example Left versus right recursion

Using error Right Recursion: Left recursive grammar:


stmt list : stmt needed for termination in
| stmt list ; stmt predictive parsers E → E + T|E
can be augmented with error requires more stack space T → T ∗ F|F
stmt list : stmt right associative operators
| error F → (E) + Int
| stmt list ; stmt Left Recursion:
This should works fine in bottom-up After left recursion removal
throw out the erroneous statement parsers
synchronize at “;” or “end” limits required stack space E→ TE0
invoke yyerror("syntax error") left associative operators E0 → +TE0 |ε
Other “natural” places for errors Rule of thumb: T→ FT 0
all the “lists”: FieldList, CaseList right recursion for top-down T0 → ∗FT 0 |ε
missing parentheses or brackets (yychar) parsers F→ (E) + Int
extra operator or missing operator left recursion for bottom-up
parsers Parse the string 3 + 4 + 5

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 95 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 96 / 98
Parsing review Grammar hierarchy

Recursive descent
A hand coded recursive descent parser directly encodes a
grammar (typically an LL(1) grammar) into a series of mutually
recursive procedures. It has most of the linguistic limitations of LR(k) > LR(1) > LALR(1) > SLR(1) > LR(0)
LL(1). LL(k) > LL(1) > LL(0)
LL(k) LR(0) > LL(0)
An LL(k) parser must be able to recognize the use of a production
LR(1) > LL(1)
after seeing only the first k symbols of its right hand side.
LR(k) > LL(k)
LR(k)
An LR(k) parser must be able to recognize the occurrence of the
right hand side of a production after having seen all that is derived
from that right hand side with k symbols of lookahead.

* *

V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 97 / 98 V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 98 / 98

You might also like