You are on page 1of 24

Compiler Design

Chapter Three: Parsers (syntactic Analyzer) part two

The Objective of this chapter are the following,


• Describe the basics of Parsing techniques (Top-Down Vs. Bottom-Up parsing).

• Describe the Recursive Descent Parsers and LL (1) parser: explain First and Fellow sets as well as how
to find first and follow sets of a parser, construct LL (1) parse Tables.

• Describe about Grammars that are not LL (1): Removing Ambiguity, removing left recursion and
removing nondeterminism (left factoring)

• Be familiar with LL(k) grammars.

• Describe the bottom-up parsers: explain shift reduce parsing as well as the LR parsers.

• Construct the LR parsers (LR (0), SLR (1), CLR (1) and LALR (1)): valid item and item closure,
transition diagrams, Action and GOTO tables.

Parsing and Parsers


➢ Once we have described the syntax of our programming language using a context-free grammar, the next step is to
determine if a string of tokens returned from the lexical analyzer could be derived from that context-free grammar.
➢ Determining if a sequence of tokens is syntactically correct is called parsing.

➢ Two main strategies:


o Top-down parsing: Start at the root of the parse tree and grow toward leaves (modern parsers: e.g.
JavaCC)
▪ Pick a production rule & try to match the input
▪ Bad “pick” → may need to backtrack
▪ Two types, Recursive descent parsing and non-recursive descent (LL (1)).
o Bottom-up parsing: Start at the leaves and grow toward root (earlier parsers: e.g. yacc)
▪ As input is consumed, encode possible parse trees in an internal state
▪ Bottom-up parsers handle a large class of grammars.
▪ Also known as shift-reduced parser.
▪ Two types, operator precedence parser and LR parsers.

Recursive Descent Parsers


❖ Top-down parsers are usually implemented as a mutual recursive suite of functions that descend
through a parse tree for the string, and as such are called “recursive descent parsers” (RDP).

Compiled by: Dawit K. 1


Compiler Design

❖ These parsers use a procedure for each nonterminal. The procedure looks at its input and decides which
production to apply for its nonterminal.

❖ Terminals in the body of the production are matched to the input at the appropriate time, while
nonterminals in the body result in calls to their procedure.

❖ Backtracking, in the case when the wrong production was chosen, is a possibility.

Example 1: A CFG and a CFG in its RDP form Terminals = { e, f, g , h, i }


Terminals = { id, num, while, print,>, {, }, ;, (, ) } Non-Terminals = {S',S, A, B, C, D }
Non-Terminals = { S, E, B, L } Rules = (1) S → AB
Rules = (1) S → print(E);
(2) S → while (B) S (2) S → Cf
(3) A → ef
(3) S → { L }
(4) A → ε
(4) E → id (5) B → hg
(5) E → num (6) C → DD
(6) B → E > E (7) C → fi
(7) L → SL|ε (8) D → g
Start Symbol = S
Start Symbol = S B. A CFG in its RDP form
A CFG not in its RDP form
LL(k) parsers
❖ LL(k) stands for Left-to-right input scan, Leftmost-derivation, k-symbol lookahead parsers.

❖ A grammar such that it is possible to choose the correct production with which to expand a given
nonterminal, looking only at the next input symbol, is called LL(l).

❖ These grammars allow us to construct a predictive parsing table that gives, for each
nonterminal and each lookahead symbol, the correct choice of production.

❖ Error correction can be facilitated by placing error routines in some or all of the table entries that have
no legitimate production.

❖ We first examine LL (1) parsers – LL parsers with one symbol lookahead

❖ To build the RDP, at first, we need to create the “First” and “Follow” sets of the non-terminals
in the CFG.

First Sets and Follow Sets


❖ FIRST and FOLLOW sets allow us to choose which production to apply, based on the next input symbol.

Compiled by: Dawit K. 2


Compiler Design

❖ Goal: - Given productions A → a |b, the parser should be able to choose between a and b

❖ How can the next input token help us decide?


Solution: FIRST sets

FIRST(A) is the set of tokens that could appear as the first symbol in a string derived from A
Informally:
which is in this case a.
Def: x in FIRST(A) if A ---> x g

First Sets
 The First set of a non-terminal A is the set of all terminals that can begin a string derived from A

 If the empty string ε can be derived from A, then ε is also in the First set of A.

For instance, given the CFG below ($ is an end-of-file marker, ε means empty string):
Terminals = { e, f, g , h, i }
Non-Terminals = {S', S, A, B, C, D }
Rules = (1) S → AB|Cf
(3) A → ef|ε
(5) B → hg
(6) C → DD|fi
(8) D → g
Start Symbol = S

 In this grammar, the set of all strings derivable from the non-terminal S’ are {efhg, hg, fif, ggf }

 Thus, the First(S’) = {e,h,f,g}, where e,h,f and g are the first terminal of each string in the
above terminal set, respectively

 Similarly, we can derive the First sets of S, A, B, C and D as follows:

First(S) = {e,f,g,h} First(DD) = {g}


First(A) = {e, ε} First(AB) = {e, h}
First(B) = {h} First(efB) = {e}
First(C) = {f,g} First(AC) = {e, f, g}
First(D) = {g} First(AA) = {e, ε}

Fellow Sets
 For each non-terminal in a grammar, we can also create a Follow set.

 The Follow set for a non-terminal A in a grammar is the set of all terminals that could appear right after A
in a valid sentence while driving it.

Compiled by: Dawit K. 3


Compiler Design

 Take another look at the CFG shown in Example 1B above, what terminals can follow A in a derivation?

 Consider the derivation S$ → AB$ → Ahg$, since h follows A in this derivation, h is in the Follow
set of A. Note: $ is the end-of-file marker.

 What about the non-terminal D? Consider the partial derivation: S$ → Cf$ → DDf$
→ Dgf$.

 Since both f and g follow a D in this derivation, f and g are in the Follow set of D.

 The follow sets for all non-terminals in the CFG are shown below:
Fellow(S) = {$}
Fellow(A) = {h}
Fellow(B) = {$}
Fellow (C) = {f}
Fellow (D) = {f, g}

 To calculate the First set of a non-terminal A, we need to calculate the First set of a string of terminals and
non-terminals, since the rules that define what a non-terminal can derive contain terminals and non-terminals.
 The First set of a string of terminals and non-terminals  can be defined recursively using the following two
algorithms:
Algorithm to calculate First ( ), for a string of Terminals and Non-Terminals

▪ If  = ε then First ()= ε


▪ If the first symbol in  is the terminal a, then First()={a}
▪ If = A' for some non-terminal A, and (possibly empty) string of terminals and non-terminals ':
▪ If First(A) does not contain ε, then First()=First(A)
▪ If First(A) does contain ε, then First()=(First(A)-{ε})  First (')

Algorithm to calculate First sets for all Non-Terminals in CFG G


1. For each non-terminal A in G, set First(A)= { }
2. For each rule A→ (where  is a string of etrminals and non-terminals), add all elements f First() to
First(A). That is:
❖ If = ε add ε to First(A)
❖ If the first character in  is the terminal a, then add a to First(A)
❖ If = A1' for some non-terminal A1, and First(A1) does not contain ε, then add all lements of
First(A1) to First(A)
❖ If = A1' for some non-terminal A1, and First(A1) does contain ε, then add all lements of
First(A1) (other than ε) to First(A), and recursively add all elements f First(') to First(A)
3. If any changes were made in step 2, go back and to 2 and repeat

Compiled by: Dawit K. 4


Compiler Design

❖ Consider the CFG given on Example 1A above is G. find First set


o Initially, for each non-terminal A in G we set First(A) = { }, an empty set
o We then go through one iteration of the algorithm, which will modify the first set as follows:

1. S → AB. Add { } to First(S) = { } (no change) Non-Terminals First Set


2. S → Cf. Add { } to First(S) = { } (no change)
S {}
3. A → ef. Add e to First(A) = {e}
4. A → ε. Add ε to First(A) = {e, ε} A {e, ε}
5. B → hg. Add h to First(B) = {h}
B {h}
6. C → DD. Add { } to First(C) = { } (no change)
7. C → fi. Add f to First(C) = {f} C {f}
8. D → g. Add g to First(D) = {g}
D {g}
Note: Since there were 5 changes, we need another iteration

1. S → AB. Add e, h to First(S) = {e, h}


2. S → Cf. Add f to First(S) = {e, h, f} Non-Terminals First Set
3. A → ef. Add e to First(A) = {e, ε} (no change) S {e, f, h}
4. A → ε. Add ε to First(A) = {e, ε} (no change) A {e, ε}
5. B → hg. Add h to First(B) ={h} (no change) B {h}
6. C → DD. Add g to First(C) = {f, g} C {f, g}
7. C → fi. Add f to First(C) = {f, g} (no change) D {g}
8. D → g. Add g to First(D) ={g} (no change)
Note: Since there were 3 changes, we need another iteration

1. S → AB. Add e,h to First(S) = {e, f, h} (no change)


2. S → C. Add f,g to First(S) = {e, h, f, g} Non-Terminals First Set
3. A → ef. Add e to First(A) = {e, ε} (no change) S {e, f, g, h}
4. A → ε. Add ε to First(A) = {e, ε} (no change) A {e, ε}
5. B → hg. Add h to First(B) = {h} (no change) B {h}
6. C → DD. Add g to First(C) = {f, g} (no change) C {f, g}
7. C → fi. Add f to First(C) = {f, g} (no change) D {g}
8. D → g. Add g to First(D) = {g} (no change)
Note: Since there were 1 change, we need another iteration
1. S → AB. Add e, h to First(S) = {e, f, h} (no change) Non-Terminals First Set
2. S → C. Add f, g to First(S) = {e, h, f, g} (no change)
3. A → ef. Add e to First(A) = {e, ε} (no change) S {e, f, g, h}
4. A → ε. Add ε to First(A) = {e, ε} (no change) A {e, ε}
5. B → hg. Add h to First(B) = {h} (no change) B {h}
6. C → DD. Add g to First(C) = {f, g} (no change) C {f, g}
7. C → fi. Add f to First(C) = {f, g} (no change)
8. D → g. Add g to First(D) = {g} (no change) D {g}
Note: Since there was no change, we stop

Note that within each iteration we can examine the rules in any order.

Compiled by: Dawit K. 5


Compiler Design

 If we examine the rules in a different order in each iteration, we will still achieve the same result,
but may take a different number of iterations.
Check that an order of iteration 8,7,6,5,4,3,2,1,0 requires fewer number of iterations?
Finding Follow Sets for Non-Terminals
 If the grammar contains the rule: S → Aa, then a is in the Follow set of A, since a appears mmediately after A.
 If the grammar contains the rules:

S →AB
B→a│b
then both a and b are in the Follow set of A, Why? Consider the following two partial derivations:
S → AB → Aa
S → AB → Ab
So, both a and b are in the Follow set of A.

If the grammar contains the rules:


S → ABC S → ABC AaC
B→a│b│ε S → ABC AbC
C→c│d S → ABC AC Ac
Then a, b, c and d are all in the Follow set of A. Why? S → ABC AC Ad
Consider the grammar:
S → Ax
A→C
T en x is in the Follow sets of A and C. Why? S →Ax Cx
Consider another grammar:
S → Ax
A → CD
D→ε
Then x is in the Follow sets of A, C and D. Why? S → Ax CDx CDx

 The above examples lead us to the following method for finding the follow sets for the non-terminals in a CFG

 We can calculate the Follow sets from the First sets by using the recursive algorithm given below:
Algorithm to calculate Follow sets for all Non-Terminals in CFG G

1. Place $ in follow(S) where S is the start symbol, and $ is the input right end marker.
2. If there is a production A → 𝛼𝐵𝛽, then everything in first(𝛽) except ε is in follow(𝐵).
3. If there is a production A → 𝛼𝐵, or a production A → 𝛼𝐵𝛽, where first(𝛽) contains ε , then verything in
follow (A) is in follow (B) .

Compiled by: Dawit K. 6


Compiler Design

Example : Consider the following CFG, for which we calculate the First sets for all non-terminals:

Terminals = { a, b, c , d}
Non-Terminals = {S, T, U, V}
Rules = (0) S' → S$
1. S → TU
2. T → aVa Non-Terminals First Set
3. T → ε
S' {a, b}
4. U → bVT
5. V → Ub S {a, b}
6. V → d T {a, ε}
Start Symbol = S' U {b}
The First sets of non-terminals are: V {b, d}
S' = {a,b}; S = {a,b}; T = {a, ε}; U = {b}; and V = {b,d}

Initially, for each non-terminal A in G we set Fellow(A) = { }, an empty set


1. S' →S$ Add {$} to Follow (S) = {$}.
2. S → TU Add First (U), {b}, to Follow(T) = {b}
Add Follow (S), {$}, to Follow(U) = {$} Non- Fellow Set
3. T → aVa Add {a} to Follow(V) = {a} Terminals
4. T → ε (no change) S' {}
5. U → bVT Add First(T), {a}, to Follow(V) = {a} S {$}
Add Follow(U), {$}, to Follow(T) = {b, $} T {b, $}
Add Follow(U), {$}, to Follow(V) = {a, $} (for T→ ε)
U {b, $}
6. V → Ub Add {b} to Follow(U) = {b, $}
7. V → d (no change) V {a, $}
The Follow sets of non-terminals are:
S’ = { }; S = {$}; T = {b, $}; U = {b, $}; and V = {a,$}
Note: Since there were some changes, we need another iteration

1. S’ → S$ Add $ to Follow (S) = {$}. (no change)


2. S → TU Add First(U), b, to Follow(T) = {b, $} (no change)
Add Follow(S), {$}, to Follow(U) = {b, $} (no change)
3. T → aVa Add {a} to Follow(V) = {a, $} (no change) Non- Fellow Set
4. T → ε (no change) Terminals
S' {}
5. U → bVT Add First(T), a, to Follow(V) = {a, $} (no change)
Add Follow(U), {$}, to Follow(T) = {b, $} (no change) Add S {$}
Follow(U), {b, $}, to Follow(V) = {a, b, $} (for T → ε) T {b, $}
U {b, $}
6. V → Ub Add b to Follow(U) = {b, $} (no change) V {a, b, $}
7. V → d (no change)
The Follow sets of non-terminals are:
S’ = { }; S = {$}; T = {b, $}; U = {b, $}; and V = {a, b, $}
Note: Since there was 1 change, we need another iteration

Compiled by: Dawit K. 7


Compiler Design

1. S’ → S$ Add $ to Follow S = {$}.. (no change)


2. S → TU Add First(U), b, to Follow (T ) = {b, $} (no change)
Add Follow(S), {$}, to Follow(U) = {b, $} (no change)
3. T → aVa Add a to Follow(V) = {a, $} (no change)
4. T → ε (no change) Non- Fellow
5. U → bVT Add First(T), a, to Follow(V) = {a, $} (no change) Terminals Set
Add Follow(U), {$}, to Follow(T) = {b, $} (no change) S' {}
Add Follow(U), {b, $}, to Follow(V) = {a, b, $} (no change) S {$}
6. V → Ub Add b to Follow(U) = {b, $} (no change) T {b, $}
7. V → d (no change) U {b, $}
The Follow sets of non-terminals are: V {a, b, $}
S’ = { }; S = {$}; T = {b, $}; U = {b, $}; and V = {a, b, $}
Note: Since there were no changes, we stop.

LL (1) Parse Tables


 Once we have First and Follow sets for all non-terminals in the grammar, we can create a Parse Table
 A parse table is a blueprint for the creation of a recursive descent parser (RDP)
o The rows in the parse table are labeled with non-terminals and the columns are labeled with terminals
o Each entry in the parse table is either empty or contain a grammar rule
 The rule located at row S, column a of a parse table tells us which rule to apply when we are trying to parse
the non-terminal S, and the next symbol in the input is an a
 For instance, for the grammar in slide no. 2 (a), the parse table is:

id num while print > { } ; ( )


S S→while(B) S S→ print(E) S →{L}
E E→id E→num
B B→E>E B→E>E
L L→SL L→SL L→SL L→ε

 Once we have the parse table for a CFG, creating a recursive descent parser is easy.
o We need to write a function for each non-terminal S in the grammar.
o The row labeled S in the parse table will tell us exactly what the function parse S
needs to do.

Creating Parse Tables


A parse table is created as follows:
1. The rows of the parse table are labeled with the non-terminals of the grammar.
2. The columns of the parse table are labeled with the terminals of the grammar
3. Each entry of the parse table is either empty or contains grammar rule.
o Place each rule of the form S → γ in row S in each column in First(γ), where γ is the

Compiled by: Dawit K. 8


Compiler Design

First set of terminals and non-terminals.


o Place each rule of the form S → γ in row S in each column inFollow(S), where First(γ) contains ε.

Consider again the CFG in slide no 2 (b). We have the First and Follow sets of each non-terminal:

1. S → AB First(AB)={e, h}. S→ AB goes in row S, columns e, h


2. S → Cf First(C)={f, g}. S→ C goes in row C, columns f, g
3. A → ef First(ef)={e}. A→ ef goes in row A, column e Non- First Follow
4. A→ ε First(ε)={ε} First(A)={h} A→ ε goes in row A, column h terminal
5. B → hg First(hg)={h}. B→ hg goes in row B, column h S {e,f,g,h} {$}
6. C → DD First(DD)={g}. C→ DD goes in row C, column g A {e,ε} {h}
7. C → fi First(C)={f}. C→ fi goes in row C, column f B {h} {$}
8. D → g First(D)={g}. D→ g goes in row D, column g C {f,g} {f}
The Resulting parse table is shown below. D {g} {f,g}

e f g h i
S S → AB S→C S→C S → AB
A A → ef A→ε
B B → hg
C C → fi C → DD
D D→g
Example 2
Given the following CFG create the LL(1) of the CFG :
Terminals = { id, num, (, ), ;, if, else, ,}
Non-Terminals = { S, L, C, E}
Rules = (1) S → id(L);
(2) S → if(E) S else S
(3) L → ε Non-terminal First Follow
(4) L → E C S {id, if} {$,else}
(5) C → ε L {id, num, ε} {)}
(6) C → , E C C {,, ε} {)}
(7) E → id E {id, num} { ), ,}
(6) E → num
Start Symbol = S'
Find the First and Fellow sets of each non-terminal?
Given the above First and Fellow sets, the parse table for the CFG is created as follows:
id num ( ) ; if else ,
S S → id(L) S → if(E) S else S
L L→EC L→EC L→ε
E L→ε C→,EC
E E → id E → num

Compiled by: Dawit K. 9


Compiler Design

Note that we only need to compute Fellow sets for an LL(1) parser if at least one First contains ε.
o Fellow sets are only used in creation of the parse table for rules of the formS → γ, where
First(γ) contains ε.
o Fellow sets are not necessary if no such rule exists. However, if there exists at least one rule,
then we still need to create the fellow sets of all non-terminals in the grammar.

Grammars That Are Not LL (1)


 If we can build an LL (1) parse table for a grammar that has no duplicate entries, then we say that
grammar is LL (1).
o Unfortunately, not all grammars are LL (1). For instance, the following grammar is not LL (1)
grammar.
Terminals = { id, +, - , *, /, % }
Non-Terminals = {E}
Rules = (0) E → id
(1) E → E + E|E - E
(3) E → E * E|E / E |E % E
Start Symbol = E
 The parse table includes only one non-terminal E, but it has 6 entries in the id column.
o Hence, the above grammar is ambiguous and we cannot create an LL (1) parser for it.
o If we wish to be able to parse the grammar, we need to create an equivalent grammar that is not
ambiguous
+ - * / % id
E E → id
E→E+E
E→E-E
E→E*E
E→E/E
E→E%E
Removing Ambiguity
There are four ways in which ambiguity can creep into (get into) a CFG for a programming language:

1. Defining expressions: - the straightforward definition of expressions will often lead to ambiguity, such
as the one that we have seen in above.
2. Defining complex variables: - complex variables, such as instance variables in classes, fields in records or
structures, array subscripts and pointer references, can also lead to ambiguity. Example V → id|V.V
3. Overlap between specific and general cases: For example, CFG
Terminals = { id, +, - , *, /, % }
Non-Terminals = {E, T, F}
Rules =(0) E → E+T | E-T | T | id
(1) T→ T*F | T/F | T%F | F | id
(3) F→ (E) | id
Start Symbol=E
Compiled by: Dawit K. 10
Compiler Design

The terminal id has several leftmost derivations (and hence several parse trees): E → T, E → T → id, E → T
→ F → id

4. Nesting statements: - the most common instance of nesting statements causing ambiguity is the
infamous "dangling else", whose CFG is shown below:

S → if e then S else S |S → if e then S


S → a|b
The above CFG has two parse trees
 It is not always possible to remove ambiguity from a context-free grammar.
o There are some languages that are inherently ambiguous. That is, there exists a=language
L, such that all CFGs that generate L are ambiguous.
 Inherent ambiguity is not a problem that compiler designers usually need to face.
i.e. no major programming language is inherently ambiguous.
o There is no algorithm that will always remove ambiguity from a context-free grammar
Removing Left Recursion
 An unambiguous grammar may still not be LL (1). Consider the unambiguous expression grammar below:

Terminals = { id, +, - , *, /, % }
Non-Terminals = {E, T, F}
Rules = (1) E → E +T
(2) E → E - T
(3) E → T
(4) T → T*F
(5) T → T/F
(6) T → F
(7) F → (E)
(8) F → id
Start Symbol = E
 Though this CFG is unambiguous, it is not LL (1). In order for a CFG to be LL (1), it must
be possible to decide which rule to apply after looking at only the leftmost symbol of a string. 
On seeing that rules an id, we cannot tell if we should apply rule (1), (2), or (3).
 The problem with this CFG is (1), (2), (4) and (5) are left-recursive.
 A rule S → α (where S is a non-terminal and α is a string of terminals and non-terminals) isleft-recursive
if the first symbol in α is S

 Note: No left-recursive grammar is LL (1)


 Consider the following CFG fragment:
(1) S → Sα
(2) S → β
 What strings can be derived from S? Consider the following partial derivations:
S→Sα→Sαα→Sααα→βααα
Compiled by: Dawit K. 11
Compiler Design

 Any string that can be derived from S will be a string that can be derived from α followed by zero or more
strings that can be derived from β. Using EBNF notation, we have:
S → β(α)*
 Using CFG notations, we have:
S → βS′
S′ → α S′
S′ → ε
We have removed the left-recursion in the above example!!
In general, the set of rules of the form:
S → Sα1 | Sα2 | Sα3| ...... |Sαn
S → β1 | β2 | β3 | …. |βn
Can be rewritten as:
S → β1S′│β2 S′│β3 S′│…. │βn S′
S′ → α1 S′│α2 S′│α3 S′│ .... │αn S′
S′ → ε
Let’s take a closer look at the expression grammar:
E→E+T
E→E–T
E→T
Using the above transformation, we get the following CFG, which has no left-recursion:
E → TE'
E' → +TE'
E' → -TE'
E' → ε
Using EBNG notations, we have:
E → T((+E)│(-E))*
Left Factoring
❖ Even if a CFG is unambiguous and has no left-recursion, it still may not be LL (1).
❖ Consider the following two Fortran do statements:
Fortran: Java/C equivalent:
do for (var=initial; var<=final; var++)

var = initial, final {


loop body loop body
end do }
and and
Fortran: Java/C equivalent:
Do for (var=initial; var<=final; var+=inc)
var = initial, final, inc {
loop body loop body
end do }
❖ We can describe the Fortran do statement with the following CFG fragment:
S → do LS
L → id = exp, exp
L → id = exp, exp, exp
Compiled by: Dawit K. 12
Compiler Design

❖ This CFG is not LL (1). Why? Because there are two rules for L. We cannot tell which rule to use by
looking only at the first symbol L.
❖ We can fix this problem by left-factoring the similar section of the rule as follows:
S → do LS
L → id = exp, exp L'
L' → , exp
L' → ε
❖ Using EBNF notations, the Fortran do statement can also be written as follows:
S → do LS
L → id = exp, exp (, exp)?
❖ In general, if we have the following context-free grammar, where α and βk stand for strings of terminals and
non-terminals:
S → αβ1 | αβ2 | αβ3 | ... | αβn |
We could left-factor it to get the CFG:
S → α S′ | γ
S′→ β1 | β2 | β3 | ... | βn
Using EBNF notations to get:
S → α(β1│β2│β3│…│βn) | γ
❖ Example 4 The following grammar abstracts the "dangling-else" problem:
S → i E t S | i E t S e S| α
E→b
Here i, t, and e stand for if, then, and else; E and S stand for "conditional
expression" and "statement." Left-factored, this grammar becomes:
S → i E t S S′ | α
S′ → e S | ε
E→b

Bottom-up Parsing
• A bottom-up parsing corresponds to the construction of a parse tree for an input tokens
beginning at the leaves (the bottom) and working up towards the root (the top).
• Example 5 Given the grammar:
E→T
T→T*F
T→F
F → id
Construct a bottom-up parse of the token stream id * id,

Compiled by: Dawit K. 13


Compiler Design

Reduction
➢ We can think the bottom-up parsing as the process of “reducing” a token string to the start
symbol of the grammar.
➢ At each reduction, the token string matching the RHS of a production is replaced by the LHS non-
terminal of that production.

Compiled by: Dawit K. 14


Compiler Design

➢ The key decisions during bottom-up parsing are about when to reduce and about what
production to apply.
➢ The above figure illustrates a sequence of reductions; the grammar is the expression grammar in
Example 5. The reductions will be discussed in terms of the sequence of strings
id * id, F * id, T * id, T * F, T, E
Shift-reduce Parsing
➢ Shift-reduce parsing is a form of bottom-up parsing in which a stack holds grammar symbols
and an input buffer holds the rest of the tokens to be parsed.
➢ We use $ to mark the bottom of the stack and also the end of the input. Initially, the stack is
empty, and the string w is on the input, as follows:
STACK INPUT
$ w$
➢ During a left-to-right scan of the input tokens, the parser shifts zero or more input tokens into
the stack, until it is ready to reduce a string β of grammar symbols on top of the stack.
➢ There are actually four possible actions a shift-reduce parser can make
• Shift: shift the next input token onto the top of the stack.
• Reduce: the right end of the string to be reduced must be at the top of the stack.
Locate the left end of the string within the stack and decide what non-terminal to
replace that string.
• Accept: announce successful completion of parsing.
• Error: discover a syntax error and call an error recovery routine.
➢ steps through the actions a shift-reduce parser might take in parsing the input string id1 *id2
according to the expression in example 5.

LR Parsers
➢ The most prevalent type of bottom-up parser today is based on a concept called LR(k) parsing;
➢ The "L" is for left-to-right scanning of the input, the "R" for constructing a rightmost derivation
in reverse, and the k for the number of input symbols of lookahead that are used in making parsing
decisions.

Compiled by: Dawit K. 15


Compiler Design

➢ LR(k) parsers are of interest for the following reasons:


o they are the most powerful class of deterministic bottom-up parsers using at most K look-ahead tokens.
o Deterministic parsers must uniquely determine the correct parsing action at each step; they cannot
back up or retry parsing actions.
o An LR parser can detect a syntactic error as soon as it is possible to do so on a left-to-right scan of the
input
➢ We will cover 4 LR(k) parsers: LR (0), SLR (1), LR (1), and LALR (1) here.
➢ In building an LR Parser:
1) Create the Transition Diagram
2) Depending on it, construct:
Go_to Table: - which defines the next state after a shift.
Action Table: tells parser whether to: shift (S), reduce (R), accept (A) the source code, or signal a syntactic
error (E).

➢ An LR parser makes shift-reduce decisions by maintaining states to keep track of where we are in a parse.
➢ States represent sets of items.
LR (0) Item
➢ LR (0) of a grammar G is a production of G with a dot at some position of the body.
➢ The dot symbol ‧, in an item may appear anywhere in the right-hand side of a production.
➢ It marks how much of the production has already been matched.
➢ The production A → XYZ yields the four items: and the fourth one is called final item.

A → ‧XYZ
A → X ‧ YZ
A → XY ‧ Z
A → XYZ ‧
➢ The production A → λ generates only one item, A → ‧.
LR (0) Item Closure
• If I is a set of items for a grammar G, then CLOSURE(I) is the set of items constructed from I by the 2 rules:
1) Initially, add every item in I to CLOSURE(I)
2) If A → α‧B β is in CLOSURE(I) and B → γ is a production, then add B → ‧γ to
CLOSURE(I), if it is not already there.
Apply this until no more new items can be added.

Example 6 E’ → E
E → E + T| T
T → T * F | F
F → (E) | id
I is the set of one item {E’ → ‧E}.

Compiled by: Dawit K. 16


Compiler Design

Find CLOSURE(I),
Solution:
o First, E’ → ‧E is put in CLOSURE(I) by rule 1.
o Then, E-productions with dots at the left end:
▪ E → ‧E + T and E → ‧T.
o Now, there is a T immediately to the right of a dot in E → ‧T, so we add T → ‧T * F
and T → ‧F.
o Next, T → ‧F forces us to add: F → ‧(E) and F → ‧id.
Example 7 on closure
S→E$
E→E+T | T
T → ID | (E)
closure (S→‧E$) ={S→‧E$,
E→‧E+T,
E→‧T,
T→‧ID,
T→‧(E)}
The five items above forms an item set called state s0.
➢ Therefore, the closure can be computed,
SetOfItems Closure(I) {
J=I
repeat
for (each item A → α‧B β in J)
for (each production B → γ of G)
if (B → ‧ γ is not in J)
add B → ‧ γ to J;
until no more items are added to J;
return J;
} // end of Closure (I)
GOTO function
➢ the GOTO function is used to define the transitions in the LR (0) automaton for a grammar.
➢ Go_to (S, X) = S’ where S is a set of items and X is a grammar symbol. is defined to be the
closure of the set of all items [A→ 𝛼𝑋. 𝛽] such that [A→ 𝛼. 𝑋𝛽] is in I.
➢ Example 8: - If I is the set of two items { [E' → E·] , [E → E· + T] } , then
GOTO (I, +) contains the items
E → E + ·T
T → ·T * F
T → ·F
F → · (E)
F → ·id

Compiled by: Dawit K. 17


Compiler Design

Example 9: build the LR (0) parser for the following grammar


S’→ S $
S→ id
Solution: 1st draw the transition diagram

➢ Each state in the Transition Diagram, either signals a shift (‧ moves to right of a terminal) or
signals a reduce (reducing the RHS handle to LHS)
➢ 2nd construct the LR (0) parsing table, this table contains two parts, the Action and GOTO part
and we use as column headers and the states number as row header. Again, the action contains
only terminals as column name and the GOTO contains the nonterminal.
States Action GOTO
Id $ S
0 S1 2
1 R1 R1
2 S3
3 A
➢ The blanks above in the table indicate errors and S for shift, A for accept, R1 for reduce by Rule 1.
➢ Example 10 for the given grammar build the LR (0) parser:
S → E $ r1
E → E+T r2
|T r3
T → id r4
| (E) r5.
Solution: LR (0) Transition Diagram

19
Compiled by: dawit K.
Compiler Design

States Action GOTO


+ id ( ) $ S E T
0 S5 S6 1 9
1 S3 S2
2 A
3 S5 S6 4
4 R2 R2 R2 R2 R2
5 R4 R4 R4 R4 R4
6 S5 S6 7 9
7 S3 S8
8 R5 R5 R5 R5 R5
9 R3 R3 R3 R3 R3
Simple LR (1) / SLR (1) Parsing
• SLR (1) has the same Transition Diagram and GOTO table as LR (0) BUT with different
Action table because it looks ahead 1 token.
• SLR (1) parsers are built first by constructing Transition Diagram, then by computing Follow
set as SLR (1) look-ahead.
• The ideas is:
A handle (RHS) should NOT be reduced to N if the look ahead token is NOT in follow(N)
Example 11. S→ E $ r1
E→ E + T r2
|T r3
T→ ID r4
T→ ( E ) r5
Follow (S) = { $ }
Follow (E) = { ), +, $}
Follow (T) = { ), +, $}
Use the follow sets as look-ahead in reduction.

Compiled by: Dawit K. 20


Compiler Design

States Action GOTO


+ id ( ) $ S E T
0 S5 S6 1 9
1 S3 S2
2 A
3 S5 S6 4
4 R2 R2 R2
5 R4 R4 R4
6 S5 S6 7 9
7 S3 S8
8 R5 R5 R5 R5 R5
9 R3 R3 R3
• Example 12: The SLR (1) grammar below causes a shift-reduce conflict:
S → A | xb r1,2
A → aAb | B r3,4
B→x r5
Use follow(S) = {$},
follow(A) = follow(B) = {b, $}
in the SLR (1) Transition Diagram next.

Compiled by: Dawit K. 21


Compiler Design

States Action GOTO


a b x $ S A B
0 S3 S2 1 4
1 R1
2 S5/R5 R5
3 S3 S7 6 4
4 R4 R4
5 R2
6 S8
7 R5 R5
8 R3 R3
State 2 (R5/S5) causes shift-reduce conflict:
When handling ‘b’, the parser doesn’t know whether to reduce by rule 5 (R5) or shift to state 5 (S5).
Solution: Use more powerful LR (1)

LR (1) Parsing

❖ The reason why the FOLLOW set does not work as well as one might wish is that:
o It replaces the look-ahead of a single item of a rule N in a given LR state by: the
whole FOLLOW set of N,
o which is the union of all the look-ahead of all alternatives of N in all states.
❖ Solution: Use LR (1), which is equivalent to LR (0) item + look-ahead
❖ LR (1) item sets are more discriminating:
❖ A look-ahead set is kept with each separate item, to be used to resolve conflicts when a
reduce item has been reached.
❖ This greatly increases the strength of the parser, but also the size of its tables.
❖ An LR (1) item is of the form:
A→X1…Xi‧Xi+1…Xj, l where l belongs to Vt U {λ}
l is look-ahead, Vt is vocabulary of terminals, λ is the look-ahead after end marker $
❖ Rules for look-ahead sets:
1) initial item set: the look-ahead set of the initial item set S0 contains only one
token, the end-of-file token ($), the only token that follows the start symbol.
2) other item set:
Given P → α‧Nβ {σ}, we have
N → ‧γ {FIRST(β{σ}) } in the item set.
❖ The LR (1) look-ahead set FIRST(β{σ}) is:
If β can produce λ (β →* λ),
FIRST(β{σ}) is: FIRST(β) plus the tokens in {σ}, excludes λ.
else
FIRST(β{σ}) just equals FIRST(β);

Compiled by: Dawit K. 22


Compiler Design

❖ Unlike to LR(0) which put the reduce move in the entire row and SLR(1) which puts the
reduce move in the follow set, CLR(1) put the reduce move in the look-ahead set.
Example 13: for the given grammar construct the CLR (1) parser:
S→ A | xb r1,2
A→ aAb | B r3,4
B→ x r5

States Action GOTO


a b x $ S A B
0 S3 S2 1 4
1 R1
2 S5 R5
3 S10 S7 6 9
4 R4
5 R2
6 S8
7 R5
8 R3
9 R4
10 S10 S7 11
11 S12
12 R3

Compiled by: Dawit K. 23


Compiler Design

Look-ahead LR (1) /LALR (1) Parsing


• LALR (1) parser can be built by first constructing an LR (1) transition diagram and then
merging states.
• It differs with LR (1) only in its merging look-ahead components of the items with
common core.
• Consider states S and S’ below in LR (1):
s: A→a‧ {b} s’ : A→a‧ {c}
B→a‧ {d} B→a‧{e}
s and s’ have common core:
A→a‧
B→a‧
So, we can merge the two states:
A→a‧ {b,c}
B→a‧ {d,e}
Example 14: For the grammar below construct LALR (1) parser:
S → A | xb r1,2
A→ aAb | B r3,4
B→ x r5
Merge the states in the LR (1) Transition Diagram to get that of LALR (1).

Compiled by: Dawit K. 24


Compiler Design

LALR (1) State CLR (1) States with


Common Core
State 0 State 0
State 1 State 1
State 2 State 2
State 3 State 3, State 10
State 4 State 4, State 9
State 5 State 5
State 6 State 6, State 11
State 7 State 7
State 8 State 8, State 12

States Action GOTO


a b x $ S A B
0 S3 S2 1 4
1 R1
2 S5 R5
3 S3 S7 6 4
4 R4 R4
5 R2
6 S8
7 R5
8 R3 R3
Exercise
Given the grammar below:
E → T Op T r1
T →a r2
|b r3
Op → + r4
write 1) state transition diagram
2) action table and GOTO table
for LR (0), SLR (1), LR (1) and LALR (1) bottom-up parsing methods, respectively.

Compiled by: Dawit K. 25

You might also like