Professional Documents
Culture Documents
• Describe the Recursive Descent Parsers and LL (1) parser: explain First and Fellow sets as well as how
to find first and follow sets of a parser, construct LL (1) parse Tables.
• Describe about Grammars that are not LL (1): Removing Ambiguity, removing left recursion and
removing nondeterminism (left factoring)
• Describe the bottom-up parsers: explain shift reduce parsing as well as the LR parsers.
• Construct the LR parsers (LR (0), SLR (1), CLR (1) and LALR (1)): valid item and item closure,
transition diagrams, Action and GOTO tables.
❖ These parsers use a procedure for each nonterminal. The procedure looks at its input and decides which
production to apply for its nonterminal.
❖ Terminals in the body of the production are matched to the input at the appropriate time, while
nonterminals in the body result in calls to their procedure.
❖ Backtracking, in the case when the wrong production was chosen, is a possibility.
❖ A grammar such that it is possible to choose the correct production with which to expand a given
nonterminal, looking only at the next input symbol, is called LL(l).
❖ These grammars allow us to construct a predictive parsing table that gives, for each
nonterminal and each lookahead symbol, the correct choice of production.
❖ Error correction can be facilitated by placing error routines in some or all of the table entries that have
no legitimate production.
❖ To build the RDP, at first, we need to create the “First” and “Follow” sets of the non-terminals
in the CFG.
❖ Goal: - Given productions A → a |b, the parser should be able to choose between a and b
FIRST(A) is the set of tokens that could appear as the first symbol in a string derived from A
Informally:
which is in this case a.
Def: x in FIRST(A) if A ---> x g
First Sets
The First set of a non-terminal A is the set of all terminals that can begin a string derived from A
If the empty string ε can be derived from A, then ε is also in the First set of A.
For instance, given the CFG below ($ is an end-of-file marker, ε means empty string):
Terminals = { e, f, g , h, i }
Non-Terminals = {S', S, A, B, C, D }
Rules = (1) S → AB|Cf
(3) A → ef|ε
(5) B → hg
(6) C → DD|fi
(8) D → g
Start Symbol = S
In this grammar, the set of all strings derivable from the non-terminal S’ are {efhg, hg, fif, ggf }
Thus, the First(S’) = {e,h,f,g}, where e,h,f and g are the first terminal of each string in the
above terminal set, respectively
Fellow Sets
For each non-terminal in a grammar, we can also create a Follow set.
The Follow set for a non-terminal A in a grammar is the set of all terminals that could appear right after A
in a valid sentence while driving it.
Take another look at the CFG shown in Example 1B above, what terminals can follow A in a derivation?
Consider the derivation S$ → AB$ → Ahg$, since h follows A in this derivation, h is in the Follow
set of A. Note: $ is the end-of-file marker.
What about the non-terminal D? Consider the partial derivation: S$ → Cf$ → DDf$
→ Dgf$.
Since both f and g follow a D in this derivation, f and g are in the Follow set of D.
The follow sets for all non-terminals in the CFG are shown below:
Fellow(S) = {$}
Fellow(A) = {h}
Fellow(B) = {$}
Fellow (C) = {f}
Fellow (D) = {f, g}
To calculate the First set of a non-terminal A, we need to calculate the First set of a string of terminals and
non-terminals, since the rules that define what a non-terminal can derive contain terminals and non-terminals.
The First set of a string of terminals and non-terminals can be defined recursively using the following two
algorithms:
Algorithm to calculate First ( ), for a string of Terminals and Non-Terminals
Note that within each iteration we can examine the rules in any order.
If we examine the rules in a different order in each iteration, we will still achieve the same result,
but may take a different number of iterations.
Check that an order of iteration 8,7,6,5,4,3,2,1,0 requires fewer number of iterations?
Finding Follow Sets for Non-Terminals
If the grammar contains the rule: S → Aa, then a is in the Follow set of A, since a appears mmediately after A.
If the grammar contains the rules:
S →AB
B→a│b
then both a and b are in the Follow set of A, Why? Consider the following two partial derivations:
S → AB → Aa
S → AB → Ab
So, both a and b are in the Follow set of A.
The above examples lead us to the following method for finding the follow sets for the non-terminals in a CFG
We can calculate the Follow sets from the First sets by using the recursive algorithm given below:
Algorithm to calculate Follow sets for all Non-Terminals in CFG G
1. Place $ in follow(S) where S is the start symbol, and $ is the input right end marker.
2. If there is a production A → 𝛼𝐵𝛽, then everything in first(𝛽) except ε is in follow(𝐵).
3. If there is a production A → 𝛼𝐵, or a production A → 𝛼𝐵𝛽, where first(𝛽) contains ε , then verything in
follow (A) is in follow (B) .
Example : Consider the following CFG, for which we calculate the First sets for all non-terminals:
Terminals = { a, b, c , d}
Non-Terminals = {S, T, U, V}
Rules = (0) S' → S$
1. S → TU
2. T → aVa Non-Terminals First Set
3. T → ε
S' {a, b}
4. U → bVT
5. V → Ub S {a, b}
6. V → d T {a, ε}
Start Symbol = S' U {b}
The First sets of non-terminals are: V {b, d}
S' = {a,b}; S = {a,b}; T = {a, ε}; U = {b}; and V = {b,d}
Once we have the parse table for a CFG, creating a recursive descent parser is easy.
o We need to write a function for each non-terminal S in the grammar.
o The row labeled S in the parse table will tell us exactly what the function parse S
needs to do.
Consider again the CFG in slide no 2 (b). We have the First and Follow sets of each non-terminal:
e f g h i
S S → AB S→C S→C S → AB
A A → ef A→ε
B B → hg
C C → fi C → DD
D D→g
Example 2
Given the following CFG create the LL(1) of the CFG :
Terminals = { id, num, (, ), ;, if, else, ,}
Non-Terminals = { S, L, C, E}
Rules = (1) S → id(L);
(2) S → if(E) S else S
(3) L → ε Non-terminal First Follow
(4) L → E C S {id, if} {$,else}
(5) C → ε L {id, num, ε} {)}
(6) C → , E C C {,, ε} {)}
(7) E → id E {id, num} { ), ,}
(6) E → num
Start Symbol = S'
Find the First and Fellow sets of each non-terminal?
Given the above First and Fellow sets, the parse table for the CFG is created as follows:
id num ( ) ; if else ,
S S → id(L) S → if(E) S else S
L L→EC L→EC L→ε
E L→ε C→,EC
E E → id E → num
Note that we only need to compute Fellow sets for an LL(1) parser if at least one First contains ε.
o Fellow sets are only used in creation of the parse table for rules of the formS → γ, where
First(γ) contains ε.
o Fellow sets are not necessary if no such rule exists. However, if there exists at least one rule,
then we still need to create the fellow sets of all non-terminals in the grammar.
1. Defining expressions: - the straightforward definition of expressions will often lead to ambiguity, such
as the one that we have seen in above.
2. Defining complex variables: - complex variables, such as instance variables in classes, fields in records or
structures, array subscripts and pointer references, can also lead to ambiguity. Example V → id|V.V
3. Overlap between specific and general cases: For example, CFG
Terminals = { id, +, - , *, /, % }
Non-Terminals = {E, T, F}
Rules =(0) E → E+T | E-T | T | id
(1) T→ T*F | T/F | T%F | F | id
(3) F→ (E) | id
Start Symbol=E
Compiled by: Dawit K. 10
Compiler Design
The terminal id has several leftmost derivations (and hence several parse trees): E → T, E → T → id, E → T
→ F → id
4. Nesting statements: - the most common instance of nesting statements causing ambiguity is the
infamous "dangling else", whose CFG is shown below:
Terminals = { id, +, - , *, /, % }
Non-Terminals = {E, T, F}
Rules = (1) E → E +T
(2) E → E - T
(3) E → T
(4) T → T*F
(5) T → T/F
(6) T → F
(7) F → (E)
(8) F → id
Start Symbol = E
Though this CFG is unambiguous, it is not LL (1). In order for a CFG to be LL (1), it must
be possible to decide which rule to apply after looking at only the leftmost symbol of a string.
On seeing that rules an id, we cannot tell if we should apply rule (1), (2), or (3).
The problem with this CFG is (1), (2), (4) and (5) are left-recursive.
A rule S → α (where S is a non-terminal and α is a string of terminals and non-terminals) isleft-recursive
if the first symbol in α is S
Any string that can be derived from S will be a string that can be derived from α followed by zero or more
strings that can be derived from β. Using EBNF notation, we have:
S → β(α)*
Using CFG notations, we have:
S → βS′
S′ → α S′
S′ → ε
We have removed the left-recursion in the above example!!
In general, the set of rules of the form:
S → Sα1 | Sα2 | Sα3| ...... |Sαn
S → β1 | β2 | β3 | …. |βn
Can be rewritten as:
S → β1S′│β2 S′│β3 S′│…. │βn S′
S′ → α1 S′│α2 S′│α3 S′│ .... │αn S′
S′ → ε
Let’s take a closer look at the expression grammar:
E→E+T
E→E–T
E→T
Using the above transformation, we get the following CFG, which has no left-recursion:
E → TE'
E' → +TE'
E' → -TE'
E' → ε
Using EBNG notations, we have:
E → T((+E)│(-E))*
Left Factoring
❖ Even if a CFG is unambiguous and has no left-recursion, it still may not be LL (1).
❖ Consider the following two Fortran do statements:
Fortran: Java/C equivalent:
do for (var=initial; var<=final; var++)
❖ This CFG is not LL (1). Why? Because there are two rules for L. We cannot tell which rule to use by
looking only at the first symbol L.
❖ We can fix this problem by left-factoring the similar section of the rule as follows:
S → do LS
L → id = exp, exp L'
L' → , exp
L' → ε
❖ Using EBNF notations, the Fortran do statement can also be written as follows:
S → do LS
L → id = exp, exp (, exp)?
❖ In general, if we have the following context-free grammar, where α and βk stand for strings of terminals and
non-terminals:
S → αβ1 | αβ2 | αβ3 | ... | αβn |
We could left-factor it to get the CFG:
S → α S′ | γ
S′→ β1 | β2 | β3 | ... | βn
Using EBNF notations to get:
S → α(β1│β2│β3│…│βn) | γ
❖ Example 4 The following grammar abstracts the "dangling-else" problem:
S → i E t S | i E t S e S| α
E→b
Here i, t, and e stand for if, then, and else; E and S stand for "conditional
expression" and "statement." Left-factored, this grammar becomes:
S → i E t S S′ | α
S′ → e S | ε
E→b
Bottom-up Parsing
• A bottom-up parsing corresponds to the construction of a parse tree for an input tokens
beginning at the leaves (the bottom) and working up towards the root (the top).
• Example 5 Given the grammar:
E→T
T→T*F
T→F
F → id
Construct a bottom-up parse of the token stream id * id,
Reduction
➢ We can think the bottom-up parsing as the process of “reducing” a token string to the start
symbol of the grammar.
➢ At each reduction, the token string matching the RHS of a production is replaced by the LHS non-
terminal of that production.
➢ The key decisions during bottom-up parsing are about when to reduce and about what
production to apply.
➢ The above figure illustrates a sequence of reductions; the grammar is the expression grammar in
Example 5. The reductions will be discussed in terms of the sequence of strings
id * id, F * id, T * id, T * F, T, E
Shift-reduce Parsing
➢ Shift-reduce parsing is a form of bottom-up parsing in which a stack holds grammar symbols
and an input buffer holds the rest of the tokens to be parsed.
➢ We use $ to mark the bottom of the stack and also the end of the input. Initially, the stack is
empty, and the string w is on the input, as follows:
STACK INPUT
$ w$
➢ During a left-to-right scan of the input tokens, the parser shifts zero or more input tokens into
the stack, until it is ready to reduce a string β of grammar symbols on top of the stack.
➢ There are actually four possible actions a shift-reduce parser can make
• Shift: shift the next input token onto the top of the stack.
• Reduce: the right end of the string to be reduced must be at the top of the stack.
Locate the left end of the string within the stack and decide what non-terminal to
replace that string.
• Accept: announce successful completion of parsing.
• Error: discover a syntax error and call an error recovery routine.
➢ steps through the actions a shift-reduce parser might take in parsing the input string id1 *id2
according to the expression in example 5.
LR Parsers
➢ The most prevalent type of bottom-up parser today is based on a concept called LR(k) parsing;
➢ The "L" is for left-to-right scanning of the input, the "R" for constructing a rightmost derivation
in reverse, and the k for the number of input symbols of lookahead that are used in making parsing
decisions.
➢ An LR parser makes shift-reduce decisions by maintaining states to keep track of where we are in a parse.
➢ States represent sets of items.
LR (0) Item
➢ LR (0) of a grammar G is a production of G with a dot at some position of the body.
➢ The dot symbol ‧, in an item may appear anywhere in the right-hand side of a production.
➢ It marks how much of the production has already been matched.
➢ The production A → XYZ yields the four items: and the fourth one is called final item.
A → ‧XYZ
A → X ‧ YZ
A → XY ‧ Z
A → XYZ ‧
➢ The production A → λ generates only one item, A → ‧.
LR (0) Item Closure
• If I is a set of items for a grammar G, then CLOSURE(I) is the set of items constructed from I by the 2 rules:
1) Initially, add every item in I to CLOSURE(I)
2) If A → α‧B β is in CLOSURE(I) and B → γ is a production, then add B → ‧γ to
CLOSURE(I), if it is not already there.
Apply this until no more new items can be added.
Example 6 E’ → E
E → E + T| T
T → T * F | F
F → (E) | id
I is the set of one item {E’ → ‧E}.
Find CLOSURE(I),
Solution:
o First, E’ → ‧E is put in CLOSURE(I) by rule 1.
o Then, E-productions with dots at the left end:
▪ E → ‧E + T and E → ‧T.
o Now, there is a T immediately to the right of a dot in E → ‧T, so we add T → ‧T * F
and T → ‧F.
o Next, T → ‧F forces us to add: F → ‧(E) and F → ‧id.
Example 7 on closure
S→E$
E→E+T | T
T → ID | (E)
closure (S→‧E$) ={S→‧E$,
E→‧E+T,
E→‧T,
T→‧ID,
T→‧(E)}
The five items above forms an item set called state s0.
➢ Therefore, the closure can be computed,
SetOfItems Closure(I) {
J=I
repeat
for (each item A → α‧B β in J)
for (each production B → γ of G)
if (B → ‧ γ is not in J)
add B → ‧ γ to J;
until no more items are added to J;
return J;
} // end of Closure (I)
GOTO function
➢ the GOTO function is used to define the transitions in the LR (0) automaton for a grammar.
➢ Go_to (S, X) = S’ where S is a set of items and X is a grammar symbol. is defined to be the
closure of the set of all items [A→ 𝛼𝑋. 𝛽] such that [A→ 𝛼. 𝑋𝛽] is in I.
➢ Example 8: - If I is the set of two items { [E' → E·] , [E → E· + T] } , then
GOTO (I, +) contains the items
E → E + ·T
T → ·T * F
T → ·F
F → · (E)
F → ·id
➢ Each state in the Transition Diagram, either signals a shift (‧ moves to right of a terminal) or
signals a reduce (reducing the RHS handle to LHS)
➢ 2nd construct the LR (0) parsing table, this table contains two parts, the Action and GOTO part
and we use as column headers and the states number as row header. Again, the action contains
only terminals as column name and the GOTO contains the nonterminal.
States Action GOTO
Id $ S
0 S1 2
1 R1 R1
2 S3
3 A
➢ The blanks above in the table indicate errors and S for shift, A for accept, R1 for reduce by Rule 1.
➢ Example 10 for the given grammar build the LR (0) parser:
S → E $ r1
E → E+T r2
|T r3
T → id r4
| (E) r5.
Solution: LR (0) Transition Diagram
19
Compiled by: dawit K.
Compiler Design
LR (1) Parsing
❖ The reason why the FOLLOW set does not work as well as one might wish is that:
o It replaces the look-ahead of a single item of a rule N in a given LR state by: the
whole FOLLOW set of N,
o which is the union of all the look-ahead of all alternatives of N in all states.
❖ Solution: Use LR (1), which is equivalent to LR (0) item + look-ahead
❖ LR (1) item sets are more discriminating:
❖ A look-ahead set is kept with each separate item, to be used to resolve conflicts when a
reduce item has been reached.
❖ This greatly increases the strength of the parser, but also the size of its tables.
❖ An LR (1) item is of the form:
A→X1…Xi‧Xi+1…Xj, l where l belongs to Vt U {λ}
l is look-ahead, Vt is vocabulary of terminals, λ is the look-ahead after end marker $
❖ Rules for look-ahead sets:
1) initial item set: the look-ahead set of the initial item set S0 contains only one
token, the end-of-file token ($), the only token that follows the start symbol.
2) other item set:
Given P → α‧Nβ {σ}, we have
N → ‧γ {FIRST(β{σ}) } in the item set.
❖ The LR (1) look-ahead set FIRST(β{σ}) is:
If β can produce λ (β →* λ),
FIRST(β{σ}) is: FIRST(β) plus the tokens in {σ}, excludes λ.
else
FIRST(β{σ}) just equals FIRST(β);
❖ Unlike to LR(0) which put the reduce move in the entire row and SLR(1) which puts the
reduce move in the follow set, CLR(1) put the reduce move in the look-ahead set.
Example 13: for the given grammar construct the CLR (1) parser:
S→ A | xb r1,2
A→ aAb | B r3,4
B→ x r5