You are on page 1of 43

PANIMALAR ENGINEERING COLLEGE

Department of CSE
CS8602 - COMPILER DESIGN
UNIT II SYNTAX ANALYSIS
2.1. Role of Parser
2.1.1 Grammars
2.1.2 Error Handling
2.2. Context-free grammars
2.3. Writing a grammar
2.4. Top Down Parsing
2.4.1 General Strategies Recursive Descent Parser
2.4.2 Predictive Parser-LL(1) Parser
2.5. Shift Reduce Parser
2.6. LR Parser
2.7. LR (0)Item Construction of SLR Parsing Table
2.8. Introduction to LALR Parser
2.9. Error Handling and Recovery in Syntax Analyzer
2.10. YACC.

SYNTAX ANALYSIS
Intoduction :
 Syntax analysis is done by the parser.
 Detects whether the program is written following the grammar rules and reports
syntax errors.
 Produces a parse tree from which intermediate code can be generated.
Advantages of grammar for syntactic specification:
1. A grammar gives a precise and easy-to-understand syntactic specification of a
programming language.
2. An efficient parser can be constructed automatically from a properly designed grammar.
3. A grammar imparts a structure to a source program that is useful for its translation into
object code and for the detection of errors.
4. New constructs can be added to a language more easily when there is a grammatical
description of the language.
2.1. Role of Parser
The parser or syntactic analyzer obtains a string of tokens from the lexical analyzer and
verifies that the string can be generated by the grammar for the source language.
It reports any syntax errors in the program. It also recovers from commonly occurring errors
so that it can continue processing its input.

Figure 2.1 : Position of parser in compiler model

1
Role of the Parser:
 Parser builds the parse tree.
 Parser verifies the structure generated by the tokens based on the grammar
(performs context free syntax analysis).
 Parser helps to construct intermediate code.
 Parser produces appropriate error messages.
 Parser performs error recovery.
Issues :
Parser cannot detect errors such as:
1. Variable re-declaration
2. Variable initialization before use.
3. Data type mismatch for an operation.
The above issues are handled by Semantic Analysis phase
Three General types of Parsers for grammars:
1. Universal parsing methods such as the Cocke-Younger-Kasami
algorithm and Earley's algorithm can parse any grammar. These general methods are,
however, too inefficient to use in production compilers.
2. Top-down Parsing builds parse trees from the top (root) to the bottom (leaves).
3. Bottom-up parsing start from the leaves and work their way up to the root.
The last two methods are commonly used in compilers and the input to the parser is
scanned from left to right, one symbol at a time.
2.1.1 Grammars
Constructs that begin with keywords like while or int , are relatively easy to parse,
because the keyword guides the choice of the grammar production that must be applied to
match the input . Consider the grammar for expressions, which present more of challenge,
because of the associativity and precedence of operators. In the following expression
grammar E represents expressions consisting of terms separated by + signs, T represents
terms consisting of factors separated by * signs, and F represents factors that can be either
parenthesized expressions or identifiers.

E E+T|T
T T*F|F
F  (E) | id _________(2.1)
The above grammar is used for Bottom up parsing but it cannot be used for Top down
parsing because it is left recursive.
2.1.2 Error Handling :
Programs can contain errors at many different levels. For example :
1. Lexical, such as misspelling a identifiers,keyword or operators
2. Syntactic, such as misplaced semicolons or extra or missing braces, a case statement
without an enclosing switch statement.
3. Semantic, such as type mismatches between operators and operands.
4. Logical, such as an assignment operator = instead of the comparison operator ==,
infinitely recursive call.
Functions of error handler :
1. It should report the presence of errors clearly and accurately.
2. It should recover from each error quickly enough to be able to detect subsequent errors.
3. It should not significantly slow down the processing of correct programs .

2
Error-Recovery Strategies:
Once an error is detected, the parser should recover from the error. The following recovery
strategies are
1. Panic-mode : On discovering an error, the parser discards input symbols one at a
time until one of a designated set of synchronizing tokens is found. The
synchronizing tokens are usually delimiters, such as semicolon or } , whose role in
the source program is clear and unambiguous
2. Phrase-level : When an error is identified, a parser may perform local correction on
the remaining input; that is , it may replace a prefix of the remaining input by some
string that allows the parser to continue. A typical local correction is to replace a
comma by a semicolon, delete an extraneous semicolon, or insert a missing
semicolon. Phrase-level replacement has been used in several error-repairing
compilers, as it can correct any input string. Its major drawback is the difficulty it has
in coping with situations in which the actual error has occurred before the point of
detection
3. Error-productions: The common errors that might be encountered are anticipated
and augment the grammar for the language at hand with productions that generate
the erroneous constructs
4. Global-correction: A compiler need to make few changes as possible in processing
an incorrect input string. There are algorithms for choosing a minimal sequence of
changes to obtain a globally least-cost correction. Given an incorrect input string x
and grammar G, these algorithms will find a parse tree for a related string y, such
that the number of insertions, deletions, and changes of tokens required to
transform x into y is as small as possible. These methods are in general too costly to
implement in terms of time and space
2.2. Context-free Grammars
Grammars are used to systematically describe the syntax of programming language
constructs like expressions and statements. Using
a syntactic variable stmt to denote statements and variable expr to denote
expressions, the production
stmt  if (expr) stmt else stmt
specifies the structure of this form of conditional statement.
2.2.1 The Formal Definition of a Context-Free Grammar
A context-free grammar G is defined by the 4-tuple: G= (V, T, P S) where
1. V is a finite set of non-terminals (variable).
2. T is a finite set of terminals.
3. S is the start symbol (variable S∈ V).
4. The productions of a grammar specify the manner in which the terminals and
nonterminals can be combined to form strings. Each production consists of:
(a) A nonterminal called the head or left side of the production; this
production defines some of the strings denoted by the head.
(b) The symbol  . Sometimes : := has been used in place of the arrow.
(c) A body or right side consisting of zero or more terminals and nonterminals. The
components of the body describe one way in which
strings of the nonterminal at the head can be constructed
Example 1: The following grammar in Fig.2.2 defines simple arithmetic expressions.
expression  expression + term
expressionexpression - term

3
expression  term
term  term * factor
term term / factor
term  factor
factor  (expression )
factor  id
Fig.2.2: Grammar for simple arithmetic expressions
Here, Start symbol : expression
Terminal : + , - , * , / , ( , ) , id
Non Terminal : expression, term, factor
2.2.2 Notational Conventions
1. These symbols are terminals:
(a) Lowercase letters early in the alphabet, such as a, b, c.
(b) Operator symbols such as +, *,- , / and so on.
(c) Punctuation symbols such as parentheses, comma, and so on.
(d) The digits 0 , 1 , . . . ,9.
(e) Boldface strings such as id or if, each of which represents a single terminal symbol.
2. These symbols are non-terminals:
(a) Uppercase letters early in the alphabet, such as A, B, C.
(b) The letter S, which, when it appears, is usually the start symbol.
(c) Lowercase, italic names such as expr or stmt.
(d) When discussing programming constructs, uppercase letters may be used to represent
nonterminals for the constructs. For example, nonterminals for expressions, terms, and
factors are often represented by E, T, and F, respectively.
3. Uppercase letters late in the alphabet, such as X, Y, Z, represent grammar symbols; that
is, either nonterminals or terminals.
4. Lowercase letters late in the alphabet, u,v,..., z, represent (possibly empty) strings of
terminals.
5. Lowercase Greek letters, α , β , γ for example, represent (possibly empty) strings of
grammar symbols. Thus, a generic production can be written as A ->α,, where A is the head
and a the body.
6. A set of productions A a1, A  a2, . . . , A  ak with a common head A (call them
productions), may be written A a1 | a2 | … | ak .
7. Unless stated otherwise, the head of the first production is the start symbol.

Example Using these conventions, the grammar of Example 1 can be rewritten concisely as
EE+T|E-T|T
TT*F|T/F|F
F  (E) | id
Start symbol : E
Terminal : + , - , * , / , ( , ) , id
Non Terminal : E , T , F
2.2.3 DERIVATIONS:
Derivation is a process that generates a valid string with the help of grammar by
replacing the non-terminals on the left with the string on the right side of the production.
Example : Consider the following grammar for arithmetic expressions :
E → E+E | E*E | ( E ) | - E | id
To generate a valid string - ( id+id ) from the grammar the steps are

4
1. E → - E
2. E → - ( E )
3. E → - ( E+E )
4. E → - ( id+E )
5. E → - ( id+id )
In the above derivation,
 E is the start symbol.
 -(id+id) is the required sentence (only terminals).
 Strings such as E, -E, -(E), . . . are called sentinel forms.

For a general definition of derivation, consider a nonterminal A in the middle


of a sequence of grammar symbols, as in αAβ , where α and β are arbitrary
strings of grammar symbols. Suppose A  γ is a production. Then, we write
αAβ => α γ β. The symbol => means, "derives in one step." When a sequence
of derivation steps αl => α2 => ... => αn rewrites α1 to αn , we say α1 derives
αn · ("derives in zero or more steps") For this purpose, we can use the symbol * . Thus,

Likewise, means, "derives in one or more steps."


If where S is the start symbol of a grammar G, we say that α is a
sentential form of G. A sentential form may contain both terminals
and nonterminals, and may be empty. A sentence of G is a sentential form with no
nonterminals. A language that can be generated by a grammar is said to be a context-free
language. If two grammars generate the same language, the grammars are said to be
equivalent.
Types of derivations:
The two types of derivation are:
1. Left most derivation
2. Right most derivation.
In leftmost derivations, the leftmost non-terminal in each sentinel is always chosen first for
replacement.
In rightmost derivations, the rightmost non-terminal in each sentinel is always chosen first
for replacement. Rightmost derivations are sometimes called canonical derivations.
Example:
Given grammar G : E → E+E | E*E | ( E ) | - E | id
Sentence to be derived : – (id+id)
LEFTMOST DERIVATION RIGHTMOST DERIVATION
E→-E E→-E
E→-(E) E→-(E)
E → - ( E+E ) E → - (E+E )
E → - ( id+E ) E → - ( E+id )
E → - ( id+id ) E → - ( id+id )
String that appear in leftmost derivation are called left sentinel forms.
String that appear in rightmost derivation are called right sentinel forms.
Example: Consider the context free grammar (CFG) G = ({S}, {a, b, c}, P, S ) where
P={SSbS | ScS | a}. Derive the string “abaca” by leftmost derivation and rightmost
derivation.
5
Leftmost derivation for “abaca”
S =lm=> SbS
lm=> abS (using rule S  a)
lm=> abScS (using rule S  ScS )
lm=> abacS (using rule S  a)
lm=> abaca (using rule S  a)
Rightmost derivation for “abaca”
S =rm=> ScS
rm=> Sca (using rule S  a)
rm=> SbSca (using rule S  SbS )
rm=> Sbaca (using rule S  a)
rm=> abaca (using rule S  a)
Sentinels:
Given a grammar G with start symbol S, if S → α , where α may contain non-terminals or
terminals, then α is called the sentinel form of G.
Yield or frontier of tree:
Each interior node of a parse tree is a non-terminal. The children of node can be a terminal
or non-terminal of the sentinel forms that are read from left to right. The sentinel form in
the parse tree is called yield or frontier of the tree.

2.2.4 Parse Trees and Derivations


A parse tree is a graphical representation of a derivation that filters out the
order in which productions are applied to replace nonterminals. Each interior node of a
parse tree represents the application of a production. The interior node is labeled with the
nonterminal A in the head of the production; the children of the node are labeled, from left
to right, by the symbols in the body of the production by which this A was replaced during
the derivation.
For example, the parse tree for - (id + id) in Fig. 2.3, results from the
derivation of -(id + id)
The leaves of a parse tree are labeled by nonterminals or terminals and, read
from 'left to right, constitute a sentential form, called the yield or frontier of the tree.

Figure 2.3: Parse tree for –(id+id)

6
Figure 2.4: Sequence of parse trees for derivation -(id+id)
2.2.5 Ambiguity:
A grammar produce more than one parse tree for some sentence is said to be ambiguous.
Put another way, an ambiguous grammar is one that produces more than one leftmost
derivation or more than one rightmost derivation for the same sentence.
A grammar G is said to be ambiguous if it has more than one parse tree either in LMD or in
RMD for at least one string.

Example: Given grammar G : E → E+E | E*E | ( E ) | - E | id

The sentence id+id*id has the following two distinct leftmost derivations:
Leftmost Derivations -1 Leftmost Derivations -2
E → E+ E E → E* E
E → id + E E→E+E*E
E → id + E * E E → id + E * E
E → id + id * E E → id + id * E
E → id + id * id E → id + id * id
The two corresponding parse trees are :

Figure 2.5: Two parse trees for id+id*id


The given grammar has two distinct leftmost derivations so it is said to be an ambiguous
grammar.
2.2.6 Verifying the Language Generated by a Grammar
Construct a grammar such that the following condition holds.
A proof that a grammar G generates a language L has two parts:
1. show that every string generated by G is in L,
2. conversely that every string in L can indeed be generated by G.

7
2.2.7 Context-Free Grammars Versus Regular Expressions
Every construct that can be described by a regular expression can be described by a
grammar, but not vice-versa. Alternatively, every regular language is a context-free
language, but not vice-versa.
For example, the regular expression (a | b) * abb and the grammar
A0  aA0 | bA0 | aA1
Al  bA2
A2  bA3
A3  ε
describe the same language, the set of strings of a's and b's ending in abb.

Figure 2.6: NFA for (a|b)*abb

The rules to construct mechanically a grammar to recognize the same language as a


nondeterministic finite automaton (NFA) is mentioned below:
1. For each state i of the NFA, create a nonterminal Ai .
2. If state i has a transition to state j on input a, add the production Ai  aAj. If state i goes
to state j on input ε, add the production Ai  Aj .
3. If i is an accepting state, add Ai ε
4. If i is the start state, make Ai be the start symbol of the grammar.

2.3. Writing a grammar


Grammars are capable of describing most, but not all, of the syntax of programming
languages. For instance, the requirement that identifiers be declared before they are
used, cannot be described by a context-free grammar. Therefore, the sequences of
tokens accepted by a parser form a superset of the programming language; subsequent
phases of the compiler must analyze the output of the parser to ensure compliance with
rules that are not checked by the parser.
2.3.1 Lexical Versus Syntactic Analysis
Everything that can be described by a regular expression can also be described by a
grammar. So, "Why use regular expressions to define the lexical syntax of a language?".
There are several reasons.
1. Separating the syntactic structure of a language into lexical and non lexical parts
provides a convenient way of modularizing the front end of a compiler into two
manageable-sized components.
2. The lexical rules of a language are frequently quite simple, and to describe them we do
not need a notation as powerful as grammars.
3. Regular expressions generally provide a more concise and easier-to-understand
notation for tokens than grammars.
4. More efficient lexical analyzers can be constructed automatically from regular
expressions than from arbitrary grammars.

8
Regular expressions are most useful for describing the structure of constructs such as
identifiers, constants, keywords, and white space. Grammars, on the other hand, are
most useful for describing nested structures such as balanced parentheses, matching
begin-end's, corresponding if-then-else's, and so on. These nested structures cannot be
described by regular expressions.

2.3.2 Eliminating Ambiguity


Ambiguity of the grammar that produces more than one parse tree for leftmost or
rightmost derivation can be eliminated by re-writing the grammar.
Consider the following "dangling else" grammar,
stmt → if expr then stmt | if expr then stmt else stmt | other
Here "other" stands for any other statement. According to this grammar, the compound
conditional statement
if E1 then S1 else if E2 then S2 else S3

Figure 2.7: Parse tree for “if E1 then S1 else if E2 then S2 else S3”
But the grammar is ambiguous since the string
if E1 then if E2 then S1 else S2
has the following two parse trees for leftmost derivation:
Parse Tree-1

Figure 2.8(a) : Parse tree for “if E1 then if E2 then S1 else S2”
Parse Tree-2

Figure 2.8(b) : Parse tree for “if E1 then if E2 then S1 else S2”

9
In all programming languages with conditional statements of this form, the first parse tree
(Fig.2.8(a)) is preferred. The general rule is, "Match each else with the closest unmatched
then".
The unambiguous grammar for the above grammar is given below.

Figure 2.9 : Unambiguous grammar for it-then-else statements

2.3.3 Elimination of Left Recursion


 A grammar is left recursive if it has a non-terminal A such that there is a derivation
A=> Aα
 Top down parsing methods can’t handle left-recursive grammars
 A simple rule for immediate left recursion elimination for: A => A α|β
 We may replace it with
A -> β A’
A’ -> α A’ | ɛ
Example to eliminate Immediate left recursion:
Consider the following grammar for arithmetic expressions:
E → E+T | T
T → T*F | F
F → (E) | id
Answer: Eliminate left recursive productions E and T by applying the left recursion
First eliminate the left recursion for E as
E → TE’
E’ → +TE’ | ε
Then eliminate for T as
T → FT’
T’→ *FT’| ε
Thus the obtained grammar after eliminating left recursion is
E → TE’
E’ → +TE’| ε
T → FT’
T’ → *FT’| ε
F → (E) | id

The above rule does not eliminate left recursion involving derivations of two or more
steps. For example, consider the grammar
S  Aa | b
A  Ac | Sd | ɛ
The nonterminal S is left recursive because S=> Aa => Sda, but it is not immediately
left recursive.

10
Algorithm 2.1, systematically eliminates left recursion from a grammar. It works if the
grammar has no cycles or ɛ productions.

ALGORITHM 2.1 : Eliminating left recursion


INPUT: Grammar G with no cycles or ε - productions.
OUTPUT: An equivalent grammar with no left recursion.
METHOD: Apply the algorithm to G. Note that the resulting non-left-recursive grammar may
have ε -productions.
Arrange the nonterminals in some order A1,A2,…,An.
for (each i from 1 to n) {
for (each j from 1 to i-1) {
Replace each production of the form Ai-> Ajγ by the production
Ai -> δ1 γ | δ2 γ | … |δk γ
where Aj-> δ1 |δ2 |…|δk are all current Aj productions
}
Eliminate left recursion among the Ai-productions
}
Example: Consider the grammar, Eliminate left recursive productions.
S→Aa|b
A→Ac|Sd|ε
There is no immediate left recursion. To get it substitute S-production in A.
A →A c | A a d | b d | ε

A → A c | A a d | b d | ε is replaced by
A → b d A' | A'
A' → c A' | a d A' | ε
Therefore, finally we obtain grammar without left recursion,
S→Aa|b
A → b d A' | A'
A' → c A' | a d A' | ε

Example: Consider the grammar


A →A B d | A a | a B →B e | b
The grammar without left recursion is
A → a A'
A' → B d A' | a A' | ε
B → b B'
B' → e B' | ε
Example: Eliminate left recursion from the given grammar. A→ A c | A a d | b d | b c
After removing left recursion, the grammar becomes,
A → b d A' | b c A'
A'→ c A' | a d A'
A'→ ε

2.3.4 Left factoring


 Left factoring is a process of factoring out the common prefixes of two or more
production alternates for the same nonterminal.

11
 Left factoring is a grammar transformation that is useful for producing a grammar
suitable for predictive or top-down parsing.
 Consider following grammar:
stmt -> if expr then stmt else stmt | if expr then stmt
 On seeing input if it is not clear for the parser which production to use. So, can
perform left factoring, where the general form is:
If we have A  αβ1 | αβ2 then we replace it with
A  αA’
A’  β1 | β2
 For if then else grammar, after left factoring
stmt  if expr then stmt S’
S’  else stmt | ε
Algorithm
INPUT: Grammar G.
OUTPUT: An equivalent left-factored grammar.
METHOD: For each nonterminal A, find the longest prefix α common to two or more of its
alternatives. If a ≠ ε - i.e., there is a nontrivial common prefix - replace all of the A
productions
A  αβ1 | αβ2 | … | αβn| γ, where γ represents all alternatives that
do not begin with α, by
A  αA' | γ
A'  β1 | β2 | … | βn
Here A' is a new nonterminal. Repeatedly apply this transformation
until no two alternatives for a nonterminal have a common prefix.

Example: Eliminate left factors from the given grammar. S  T + S | T


After left factoring, the grammar becomes,
S T S’
S’ + S | ε
Example: Left factor the following grammar.
S  aBcDeF | aBcDgg | F , B x, D y, F z
After left factoring, the grammar becomes,
S  aBcDS' | F
S'  eF | gg
B x
D y
F z
Uses: Left factoring is used in predictive top down parsing technique.

2.4. Top Down Parsing


 Top-down parsing can be viewed as the problem of constructing a parse tree for the
input string, starting from the root and creating the nodes of the parse tree in preorder
(depth-first ).
 Top down parsing can be viewed as finding a leftmost derivation for an input string.
 Parsers are generally distinguished by whether they work top-down (start with the
grammar's start symbol and construct the parse tree from the top) or bottom-up (start
with the terminal symbols that form the leaves of the parse tree and build the tree from
the bottom).

12
 Top down parsers include recursive-descent and LL parsers, while the most common
forms of bottom up parsers are LR parsers

Figure 2.10 : Types of Bottom-Up and Top-Down Parser

Example : The sequence of parse trees in Fig.2.11 for the input id+id*id is a top-down parse
according to grammar given below.
E TE'
E' + T E' | ε
T F T'
T' * F T' | ε
F  ( E ) | id

Figure 2.11 : Top-down parse for id + id * id

13
LL(k) class: The class of grammars for which we can construct predictive parsers looking k
symbols ahead in the input. The class of grammars for which we can construct predictive
parsers looking 1 symbol ahead in the input is called LL(1) Parser.

2.4.1 General Strategies Recursive Descent Parser


These parsers use a procedure for each nonterminal. The procedure looks at its input and
decides which production to apply for its nonterminal.
Terminals in the body of the production are matched to the input at the appropriate time,
while nonterminals in the body result in calls to their procedure.
void A()
{
Choose an A-production, A X1 X2 . . . Xk;
for (i=l to k)
{
if ( Xi is a nonterminal )
call procedure Xi () ;
else if ( Xi equals the current input symbol a )
advance the input to the next symbol;
else /* an error has occurred */;
}
}
Figure 2.12: A typical procedure for a nonterminal in a top-down parser
General recursive-descent may require backtracking; that is, it may require repeated scans
over the input. However, backtracking is rarely needed to parse programming language
constructs. To allow backtracking, the code of Fig.2.12 needs to be modified. First, we
cannot choose a unique A-production at line (1), so we must try each of several productions
in some order. Then, failure at line (7) is not ultimate failure, but suggests only that we need
to return to line (1) and try another A-production. Only if there are no more A-productions
to try do we declare that an input error has been found. In order to try another A-
production, we need to be able to reset the input pointer to where it was when we first
reached line (1) . Thus, a local variable is needed to store this input pointer for future use.
Example : Consider the grammar
ScAd
A ab | a
To construct a parse tree top-down for the input string w = cad.
Step 1:
 Initially create a tree with single node labeled S. An input pointer points to ‘c’, the
first symbol of w. Expand the tree with the production of S.
 An input pointer points to ‘c’, the first symbol of w. Expand the tree with the
production of S (Fig.2.13(a)).
Step 2:
 The leftmost leaf ‘c’ matches the first symbol of w, so advance the input pointer to
the second symbol of w ‘a’ and consider the next leaf ‘A’. Expand A using the
first alternative as in Fig.2.13(b).
Step 3:
 The second symbol ‘a’ of w also matches with second leaf of tree. So advance the
input pointer to third symbol of w ‘d’. But the third leaf of tree is b which does not
match with the input symbol d.
14
 Hence discard the chosen production and reset the pointer to second position.
This is called backtracking.

Figure 2.13: Steps in Top-Down Parsing for input cad


Step 4:
 Now try the second alternative for A as in Fig.2.13(c).
 We have produced a parse tree for w, we halt and announce successful completion of
parsing
2.4.1.1 FIRST( ) & FOLLOW( )
 First(α) is defined as set of terminals that begins strings derived from α.
 If , then ε is also in First(α)
 In predictive parsing when we have A-> α|β, if First(α) and First(β) are disjoint sets
then we can select appropriate A-production by looking at the next input
 Follow(A), for any nonterminal A, is set of terminals a that can appear immediately
after A in some sentential form.
Algorithm to compute FIRST(X)
To compute FIRST(X) for all grammar symbols X, apply the following rules until no more
terminals or ɛ can be added to any FIRST set.
1. If X is a terminal, then FIRST(X) = {X}.
2. If X is a nonterminal and X  YlY2…Yk is a production for some k≥1, then place a in
FIRST(X) if for some i, a is in FIRST(Yi), and ε is in all of FIRST(Y1),…, FIRST(Yi-1); that is,
(If ε is in FIRST(Yj) for all j = 1,2, . . . , k, then add ε to FIRST(X). For
example, everything in FIRST(Y1) is surely in FIRST(X). If Yl does not derive ε, then we add
nothing more to FIRST(X), but if , then we add F1RST(Y2), and So on.
3. If X ε is a production, then add ε to FIRST(X).
Algorithm to compute FOLLOW(A)
To compute FOLLOW(A) for all non-terminals A, apply the following rules until nothing can
be added to any FOLLOW set.
1. Place $ in FOLLOW(S), where S is the start symbol, and $ is the input right endmarker.
2. If there is a production AαBβ, then everything in FIRST(β) except ε is in FOLLOW(B).
3. If there is a production AαB, or a production AαBβ, where FIRST(β) contains ε, then
everything in FOLLOW (A) is in FOLLOW (B).

Example : To compute FIRST and FOLLOW


Consider again the non-left-recursive grammar
E TE'
E' + T E' | ε
T F T'
T' * F T' | ε
F  ( E ) | id

15
1. FIRST(F) = FIRST(T) = FIRST(E) = {(, id}. To see why, note that the two productions for F
have bodies that start with these two terminal symbols, id and the left parenthesis. T
has only one production, and its body starts with F. Since F does not derive ε, FIRST(T)
must be the same as FIRST(F). The same argument covers FIRST(E) .
2. FIRST(E') = {+, ε }. The reason is that one of the two productions for E' has a body that
begins with terminal +, and the other's body is ε. Whenever a nonterminal derives ε, we
place ε in FIRST for that nonterminal.
3. FIRST(T') = {* , t}. The reasoning is analogous to that for FIRST(E').
4. FOLLOW(E) = FOLLOW(E') = { ), $}. Since E is the start symbol, FOLLOW(E) must contain
$. The production body (E) explains why the right parenthesis is in FOLLOW(E) . For E' ,
note that this nonterminal appears only at the ends of bodies of E-productions. Thus,
FOLLOW(E') must be the same as FOLLOW(E) .
5. FOLLOW(T) = FOLLOW(T') = {+, ), $}. Notice that T appears in bodies only followed by E' .
Thus, everything except ε that is in FIRST(E') must be in FOLLOW(T) ; that explains the
symbol +. However, since FIRST(E') contains ε (i.e., ) , and E' is the entire string
following T in the bodies of the E-productions, everything in FOLLOW(E) must also be in
FOLLOW(T). That explains the symbols $ and the right parenthesis. As for T' , since it
appears only at the ends of the T-productions, it must be that FOLLOW(T') = FOLLOW(T) .
6. FOLLOW(F) = {+,*,),$}.The reasoning is analogous to that for T in point (5) .
2.4.2 Predictive Parser-LL(1) Parser
 Predictive parsing is a special case of recursive descent parsing where no
backtracking is required.
 The key problem of predictive parsing is to determine the production to be applied
for a non-terminal in case of alternatives.
 This parser looks up the production in parsing table.
 The table-driven predictive parser has an input buffer, stack, a parsing table and an
output stream.
 Grammars for which we can create predictive parsers are called LL(1)
 The first L means scanning input from left to right
 The second L means producing leftmost derivation
 And 1 stands for using one input symbol of lookahead at each step to make
parsing actions.

The class of LL(l) grammars is rich enough to cover most programming constructs but care is
needed in writing a suitable grammar. For example, no left-recursive or ambiguous
grammar can be LL(l) .
A grammar G is LL(l) if and only if whenever A  α|β are two distinct productions of G, the
following conditions hold:
1. For no terminal a do both α and β derive strings beginning with a.
2. At most one of α and β can derive the empty string.
3. If , then α does not derive any string beginning with a terminal in FOLLOW(A).
Likewise, if , then β does not derive any string beginning with a terminal in
FOLLOW(A).
2.4.2.1 Transition Diagrams for Predictive Parsers:
 Transition diagrams are useful for visualizing predictive parsers.
 For example, the transition diagrams for nonterminals E and E' of expression
grammar is shown in Fig.3.14(a).
16
E TE'
E' + T E' | ε
T F T'
T' * F T' | ε
F  ( E ) | id
 To construct the transition diagram from a grammar, first eliminate left recursion and
then left factor the grammar.
 Then, for each nonterminal A,
1. Create an initial and final (return) state.
2. For each production A  X1 X2 … Xk, create a path from the initial to the final state, with
edges labeled X1 , X2, … , Xk. If A  ε, the path is an edge labeled ε.
 Transition diagrams for predictive parsers differ from those for lexical analyzers.
 Parsers have one diagram for each nonterminal.
 The labels of edges can be tokens or nonterminals.
 A transition on a token (terminal) means that we take that transition if that token is the
next input symbol.
 A transition on a nonterminal A is a call of the procedure for A.

Figure 2.14: Transition diagrams for nonterminals E and E' of Expression grammar

2.4.2.2 Predictive Parsing


The steps for constructing predictive parser table is as follows
1. The Grammar should be an unambiguous grammar
2. Eliminate the Left Recursion from the grammar
3. Perform Left factoring
4. Find the FIRST and FOLLOW
5. Construct the Predictive Parsing Table.
Already we have discussed the first four steps in the previous sections. Let us see the
algorithm for constructing the predictive parsing table.
Parsing table: It is a two-dimensional array M[A, a], where ‘A’ is a non-terminal and ‘a’ is a
terminal
Algorithm 2.2 : Construction of predictive parsing table
Input : Grammar G
Output: Parsing table M
Method: For each production A → α of the grammar, do the following:
1. For each terminal a in FIRST(α), add A → α to M[A, a].
2. If ε is in FIRST(α), then for each terminal b in FOLLOW(A), add A→α to M[A,b]. If ε is in
FIRST(α) and $ is in FOLLOW(A) , add A → α to M[A, $].
If, after performing the above, there is no productions in M[A,a], then set M[A,a] to
error.

17
Example : For the expression grammar, the above algorithm produces the parsing table in
Fig. 2.15. Blanks are error entries; nonblanks indicate a production with which to expand a
nonterminal.
Grammar:
E  E+T | T
TT*F|F
F (E) | id
Step 1: The Grammar is Unambigious
Step 2: The first two productions are left recursive. After elimination left recursion, the
grammar is
E TE'
E' + T E' | ε
T F T'
T' * F T' | ε
F  ( E ) | id
Step 3: There is no need for let factoring.
Step 4: Find FIRST and FOLLOW
Non-Terminal FIRST FOLLOW
E ( , id ),$
E’ +, ε ),$
T ( , id +, ) , $
T’ *, ε +, ) , $
F ( , id +, * , ) , $
Step 5: Construction of Predictive Parsing Table

Figure 2.15: Parsing table M for Expression Grammar


Consider production E  T E' . Since
FIRST(T E') = FIRST(T) = {(, id}
this production is added to M[E, (] and M[E, id] . Production E' +TE' is added to M[E', +]
since FIRST(+TE') = {+}. Since FOLLOW(E') = {), $}, production E' E is added to M[E', )] and
M[E',$) .
EXAMPLE: The following grammar, which abstracts the dangling-else problem, is given
S  iEtSS' I a
S'  eS I ε
Eb
The parsing table for this grammar appears in Fig. 2.16. The entry for M[S',e] contains both
S'  eS and S'  ε

18
Figure 2.16: Parsing table M for Dangling Else Grammar
As the grammar is an ambiguous grammar, the above grammar is not an LL(1) because
there exist two entries for M[S’,e].

2.4.2.3 Predictive Parsing Program


The predictive parser has an input, a stack, a parsing Table and an output.

Figure 2.17: Model of a table-driven Predictive Parser


Input: Contains the string to be parsed, followed by right end marker $.
Stack: Contains a sequence of grammar symbols, preceded by $, the bottom of stack
marker. Initially the stack contains the start symbol of the grammar preceded by $.
Parsing Table: It is a two dimensional array M[A,a], where A is a non-terminal and a is a
terminal or $.
Output: Gives the output whether the string is valid or not.

Algorithm 2.3: Non recursive predictive parsing


INPUT: A string w and a parsing table M for grammar G.
OUTPUT: If w is in L(G), a leftmost derivation of w; otherwise, an error indication.
METHOD: Initially, the parser is in a configuration with w$ in the input buffer and the start
symbol S of G on top of the stack, above $. The following procedure uses the predictive
parsing table M to produce a predictive parse for the input.
Set ip point to the first symbol of w;
Set X to the top stack symbol;
While (X<>$)
{ /* stack is not empty */
if (X = a) pop the stack and advance ip;
else if (X is a terminal) error();
else if (M[X,a] is an error entry) error();
19
else
if (M[X,a] = X->Y1Y2..Yk)
{
output the production X->Y1Y2..Yk;
pop the stack;
push Yk,…,Y2,Y1 on to the stack with Y1 on top;
}
set X to the top stack symbol;
}
Example: Consider expression grammar; whose parsing table in Fig.2.15. On input id+id*id,
the nonrecursive predictive parser of Algorithm 2.3 makes the sequence of moves in
Fig.2.18. These moves correspond to a leftmost derivation.

Figure 2.18: Moves made by a predictive parser on input id + id * id

2.4.2.4 Error Recovery In Predictive Parsing


1. Panic mode
 Place all symbols in Follow(A) into synchronization set for nonterminal A: skip tokens
until an element of Follow(A) is seen and pop A from stack.
 Add to the synchronization set of lower level construct the symbols that begin higher
level constructs
 Add symbols in First(A) to the synchronization set of nonterminal A
 If a nonterminal can generate the empty string then the production deriving can be
used as a default
 If a terminal on top of the stack cannot be matched, pop the terminal, issue a
message saying that the terminal was inserted

20
Figure 2.19: Synchronizing tokens added to the parsing table of Fig.2.15
On the erroneous input )id* + id, the parser and error recovery mechanism of Fig.2.19
behave as in Fig.2.20.

Figure 2.20: Parsing and error recovery moves made by a predictive parser
2. Phrase-level Recovery: Phrase-level error recovery is implemented by filling in the blank
entries in the predictive parsing table with pointers to error routines. These routines
may change, insert, or delete symbols on the input and issue appropriate error
messages. They may also pop from the stack.

BOTTOM UP PARSING:
Introduction:
A bottom-up parse corresponds to the construction of a parse tree for an input string
beginning at the leaves (the bottom) and working up towards the root (the top). This is
nothing but reducing a string w to the start symbol of a grammar. At each reduction step a
particular substring matching the right side of a production is replaced by the symbol on the
left of that production and if the substring is chosen correctly at each step, a rightmost
derivation is traced out in reverse. A general style of bottom-up parsing is known as shift
reduce parsing.

21
Figure 2.21: A bottom-up parse for id * id
Reductions:
Bottom-up parsing is the process of "reducing" a string w to the start symbol of the
grammar. At each reduction step, a specific substring matching the body of a production is
replaced by the nonterminal at the head of that production. The key decisions during
bottom-up parsing are about when to reduce and about what production to apply, as the
parse proceeds.
Reduction for the string id*id for the expression Grammar
EE+T|T
TT*F|F
F  (E) | id
Reduction is :
id * id Reduction production is F  id
F * id TF
T * id F id
T*F T  T *F
T ET
E
Where E is the start symbol of the grammar. The derivation is Right most derivation.
The Reduction is RMD in reverse.
Handle Pruning:
Bottom-up parsing during a left-to-right scan of the input constructs a rightmost
derivation in reverse. Informally, a "handle" is a substring that matches the body of a
production, and whose reduction represents one step along the reverse of a rightmost
derivation.
The handles during the parse of id1 * id2 according to the expression grammar is

Figure 2.22: Handles during a parse of id1 * id2

If , then production Aβ in the position following α is a handle of


αβw. Notice that the string w to the right of the handle must contain only terminal symbols.
A rightmost derivation in reverse can be obtained by "handle pruning." That is, we start
with a string of terminals w to be parsed.
2.5. Shift Reduce Parser
Shift-reduce parsing is a form of bottom-up parsing in which a stack holds grammar
symbols and an input buffer holds the rest of the string to be parsed. The handle always
appears at the top of the stack. The symbol $ is used to mark the bottom of the stack
and also the right end of the input.
22
Initially, the stack is empty, and the string w is on the input, as follow
STACK INPUT
$ w$
During a left-to-right scan of the input string, the parser shifts zero or more input
symbols onto the stack, until a handle β is on top of the stack. It then reduces β to the left
side of the appropriate production. The parser repeats this cycle until it has detected an
error or until the stack contains the start symbol and the input is empty:
STACK INPUT
$S $
Upon entering this configuration, the parser halts and announces successful completion of
parsing.
There are four possible actions a shift-reduce parser: (1)shift, (2) reduce, (3) accept
and (4) error. The primary operations are shift and reduce
1. In a Shift action, the next input symbol is shifted onto the top of the stack.
2. In a Reduce action, the parser knows the right end of the handle is at the top of
the stack. It must then locate the left end of the handle within the stack and decide with
what non-terminal to replace the string.
3. In an Accept action, the parser announces successful completion of parsing.
4. In an Error action, the parser discovers that a syntax error has occurred and calls
an error recovery routine.
The actions a shift-reduce parser might take in parsing the input string id1 *id2 is
shown below according to the expression grammar E-> E+E | E*E | (E) | -E | id

STACK INPUT ACTION


$ id1 * id2 $ Shift
$ id1 * id2 $ Reduce by E->id
$E * id2 $ Shift
$E * id2$ Shift
$E * id2 $ Reduce by E->id
$E * E $ Reduce by E-> E*E
$E $ Accept
Conflicts during Shift-Reduce Parsing
There are context-free grammars for which shift-reduce parsing cannot be used.
The two conflicts in Shift Reduce Parsing are
1. Shift/Reduce Conflict : knowing the entire stack contents and the next input
symbol, cannot decide whether to shift or to reduce
2. Reduce/Reduce Conflict : knowing the entire stack contents cannot decide which
of several reductions to make
Shift/Reduce Conflict:
Example 1: An ambiguous grammar can never be LR. Consider the dangling-else grammar:
stmt -> if expr then stmt | if expr then stmt else stmt | other
If we have a shift-reduce parser in configuration
STACK Input
. . if expr then stmt else . . . $
we cannot tell whether if expr then stmt is the handle, no matter what appears
below it on the stack. Here there is a shift/reduce conflict. Depending on what follows the
else on the input, it might be correct to reduce if expr then stmt to stmt, or it might be
correct to shift else and then to look for another stmt to complete the alternative if expr
then stmt else stmt.
23
Example 2:
Grammar :E E+E | E*E | id and the input id +id*id
Solution:
Stack Input Action Stack Input Action
$ id +id*id$ Shift $ id +id*id$ Shift

$ id +id*id$ Reduced by Eid $ id +id*id$ Reduced by Eid


$E +id*id$ Shift $E +id*id$ Shift
$ E+ id*id$ Shift $ E+ id*id$ Shift
$ E+id *id$ Reduce Eid $ E+id *id$ Reduce Eid
$ E+E *id$ Shift $ E+E *id$ Reduced by EE+E
$ E+E* id$ Shift $E *id$ Shift
$ E+E*id $ Reduced by Eid $ E* id$ Shift
$ E+E*E $ Reduced by EE*E $E*id $ Reduced by Eid
$ E+E $ Reduced by EE+E $ E*E $ Reduced by EE*E
$E $ Accept $E $ Accept

Reduce/Reduce Conflict
Suppose the language invokes procedures by giving their names, with parameters
surrounded by parentheses, and that arrays are referenced by the same syntax. The
grammar might therefore have productions as shown below.

A statement beginning with p( i, j) would appear as the token stream id(id, id) to the
parser. After shifting the first three tokens onto the stack, a shift-reduce parser would be in
configuration
STACK INPUT
. . . id (id , id ) . . .
It is evident that the id on top of the stack must be reduced, but by which
production? The correct choice is production (5) if p is a procedure, but production (7) if p is
an array. The stack does not tell which; information in the symbol table obtained from the
declaration of p must be used.
One solution is to change the token id in production (1) to procid and to make the lexical
analyzer to return the token name procid when it recognizes an identifier which is the
name of a procedure. If we made this modification, then on processing p(i,j) the parser
would be either in the configuration
STACK INPUT
… procid( id , id ) …
or in the configuration above. In the former case, we choose reduction by production (5), in
the latter case by production (7).
Example 2:
MR+R | R+c | R

24
Rc and the input string c+c
Solution:
Stack Input Action Stack Input Action
$ c+c$ Shift $ c+c$ Shift
$c +c$ Reduced by Rc $c +c$ Reduced by Rc
$R +c$ Shift $R +c$ Shift
$ R+ c$ Shift $ R+ c$ Shift
$ R+c $ Reduce RR $ R+c $ Reduce MR+c
$ R+R $ Reduce MR+R $M $ Accept
$M $ Accept

2.6. LR Parser
LR parser is a bottom-up syntax analysis technique that can be used to parse a large
class of context-free grammars. This technique is called as LR(k) parsing, the “L” is for left-
to-right scanning of the input, the “R” for constructing a right most derivation in reverse,
and “k” for the number of input symbols of lookahead that are used in making parsing
decisions. When (k) is omitted, k is assumed to be 1.
LR parsing is attractive for a variety of reasons:
 LR parsers can be constructed to recognize virtually all programming language
constructs for which context-free grammars can be written.
 The LR-parsing method is the most general non backtracking shift-reduce parsing
method.
 The class of grammars that can be parsed using LR methods is a proper superset of
the class of grammars that can be parsed with predictive parsers.
 An LR parser can detect a syntactic error as soon as it is possible to do so on a left-to-
right scan of the input.
The principal drawback of the LR method is that it is too much work to construct an LR
parser by hand for a typical programming-language grammar. A specialized tool, an LR
parser generator like YACC is available.
2.6.1 Types of LR Parsers:
There are three main types of LR Parsers. They are
1. Simple LR PARSER (SLR) : SLR parsing is the construction from the grammar of the
LR(0) automaton.
2. Canonical LR Parser (CLR ): This parser makes full use of the lookahead symbol(s) .
This method uses a large set of items, called the LR(1) items.
3. Look-Ahead LR Parser(LALR): This method is based on the LR(0) sets of items, and has
many fewer states than typical parsers based on the LR(l) items. By carefully
introducing lookaheads into the LR(0) items, we can handle many more grammars
with the LALR method than with the SLR method, and build parsing tables that are
no bigger than the SLR tables. LALR is the method of choice in most situations

2.6.2 The LR-Parsing Algorithm: A schematic of an LR parser is shown in Fig.2.23. It consists


of an input, an output, a stack, a driver program, and a parsing table that has two parts
(ACTION and GOTO). The driver program is the same for all LR parsers; only the parsing
table changes from one parser to another. The parsing program reads characters from an
input buffer one at a time. Where a shift-reduce parser would shift a symbol, an LR parser
shifts a state.

25
Figure 2.23: Model of an LR parser
The stack holds a sequence of states, s0 s1 …sm where sm, is on top. In the SLR
method, the stack holds states from the LR(0) automaton; the canonical- LR and LALR
methods are similar.
2.6.3 Structure of the LR Parsing Table
The parsing table consists of 2 parts: a parsing-action function ACTION and function GOTO.
1. The ACTION function takes as arguments a state i and a terminal a (or $, the input end
marker). The value of ACTION[i, a] can have one of four forms:
(a) Shift j, where j is a state. The action taken by the parser effectively shifts input a to the
stack, but uses state j to represent a.
(b) Reduce A  β: The action of the parser effectively reduces β on the top of the stack to
head A.
(c) Accept: The parser accepts the input and finishes parsing.
(d) Error: The parser discovers an error in its input and takes some corrective action.
2. The GOTO function, defined on sets of items, to states: if GOTO[I i,A]=Ij , then GOTO also
maps a state i and a nonterminal A to state j.
Algorithm 2.4 : LR-parsing algorithm
INPUT: An input string w and an LR-parsing table with functions ACTION and GOTO for a
grammar G.
OUTPUT: If w is in L( G), the reduction steps of a bottom-up parse for W; otherwise, an error
indication.
METHOD: Initially, the parser has So on its stack, where So is the initial state, and w$ in the
input buffer. The parser then executes the program in Fig. 2.24
let a be the first symbol of w$;
while(l) { /* repeat forever * / }
let s be the state on top of the stack;
if ( ACTION[S, a] = shift t ) {
push t onto the stack;
let a be the next input symbol;
} else if ( ACTION[S, a] = reduce A β ) {
pop |β| symbols off the stack;
let state t now be on top of the stack;
push GOTO[t, A] onto the stack;
output the production A β;
} else if ( ACTION[s, a] = accept ) break; /* parsing is done * /
else call error-recovery routine;
Figure 2.24: LR-parsing program
26
Example: Consider the Parsing action and goto function of an LR parsing table for the
expression grammar.
(1) E -> E+T (4) T -> F
(2) E -> T (5) F -> (E)
(3) T -> T*F (6) F -> id

Figure 2.24: LR Parsing table for expression grammar


The codes for the actions in Fig.2.24 are:
1. si means shift and stack state i
2. rj means reduce by the production numbered j
3. acc means accept
4. blank means error
LR Parsing for input id+id*id : On input id * id + id, the sequence of stack and input contents
is shown in Fig.2.25.

Figure 2.25: Moves of an LR parser on id * id + id


2.7. LR (0) Item
An LR parser makes shift-reduce decisions by maintaining states to keep track of where
we are in a parse. States represent sets of "items." An LR(0) item of a grammar G is a
production of G with a dot at some position of the body. Thus, production A  XYZ
yields the four items
A ·XYZ
A  X·YZ
A  XY·Z
A  XYZ·
27
The production A  ε generates only one item, A  .
An item indicates how much of a production we have seen at a given point in the
parsing process. For example, the item A  ·XYZ indicates that we hope to see a string
derivable from XYZ next on the input. Item AX·YZ indicates that we have just seen on
the input a string derivable from X and that we hope next to see a string derivable from
YZ. Item AXYZ· indicates that we have seen the body XYZ and that it may be time to
reduce XYZ to A.
One collection of sets of LR(0) items, called the canonical LR(0) collection, provides
the basis for constructing a deterministic finite automaton that is used to make parsing
decisions. Such an automaton is called an LR(0) automaton.
To construct the canonical LR(0) collection for a grammar, define an augmented
grammar and two functions, CLOSURE and GOTO. If G is a grammar with start symbol S,
then G' , the augmented grammar for G, is G with a new start symbol S' and production
S' S. The purpose of this new starting production is to indicate to the parser when it
should stop parsing and announce acceptance of the input.
Closure of Item Sets
If I is a set of items for a grammar G, then CLOSURE(I) is the set of items constructed
from I by the two rules:
1. Initially, add every item in I to CLOSURE(I).
2. If Aα.Bβ is in CLOSURE(I) and Bγ is a production, then add the item B. γ to
CLOSURE(I) , if it is not already there. Apply this rule until no more new items can be
added to CLOSURE(I) .
SetOfItems CLOSURE (I) {
J=I;
repeat
for (each item Aα.Bβ in J )
for ( each production Bγ of G )
if (B. γ is not in J )
add B. γ to J;
until no more items are added to J on one round;
return J;
}
The sets of items of interest into two classes:
1. Kernel items: the initial item, S' ·S, and all items whose dots are not at the left end.
2. Nonkernel items: all items with their dots at the left end, except for S'  ·S
The Function GOTO
The second useful function is GOTO(I,X) where I is a set of items and X is
a grammar symbol. GOTO (I,X) is defined to be the closure of the set of all items [AαX.β]
such that [Aα.Xβ] is in I.
Intuitively, the GOTO function is used to define the transitions in the LR(0) automaton
for a grammar. The states of the automaton correspond to sets of items, and GOTO(I, X)
specifies the transition from the state for I under input X.
SetOfItems Goto(I,X) {
initialize J to be the empty set;
for (each item [A->α.Xβ,a] in I)
add item [A->αX.β,a] to set J;
return closure(J);
}
28
The algorithm to construct C, the canonical collection of sets of LR(0) items for an
augmented grammar G' - the algorithm is shown below.
void items( G') {
C = CLOSURE({[S'  ·S]});
repeat
for ( each set of items I in C )
for ( each grammar symbol X )
if ( GOTO(I, X) is not empty and not in C )
add GOTO(I, X) to C;
until no new sets of items are added to C on a round;
}
Figure 2.26: Computation of the canonical collection of sets of LR(O) items
Example : Let us consider the grammar
(1) E -> E+T (4) T -> F Non- FIRST FOLLOW
(2) E -> T (5) F -> (E) Terminal
(3) T -> T*F (6) F -> id E ( , id +, ), $
In the above grammar, find FIRST AND FOLLOW T ( , id +, *, ), $
The augmented grammar is F ( , id +, *, ), $
E’ ->E E ->E+T E ->T T ->T*F
T ->F F ->(E) F ->id
I0 = closure (E’->E) Then I7 = goto(I2, *) is Then I4 = goto (I6, ( ) is
I 0: E’ -> .E I7 : T -> T * .F I4 : F -> (.E)
E -> .E+T F -> .(E) E -> .E+T
E -> .T F -> .id E -> .T
T -> .T*F Then I8 = goto (I4, E) is T -> .T*F
T -> .F I8 : F->(E.) T -> .F
F -> .(E) E-> E.+T F -> .(E)
F -> .id Then I2 = goto (I4 ,T) is F -> .id
Then I1 =goto(I0, E) is I2 : E-> T. Then I5 = goto(I6, id) is
I1 : E’ -> E. T-> T.*F I5 : F->id.
E -> E.+T Then I3 =goto(I4, F) is Then I10 = goto(I7,F)
Then I2 =goto (I0 ,T) is I3 : T -> F. I10 : T->T*F.
I2 : E -> T. Then I4 = goto (I4, ( ) is Then I4 = goto (I7, ( ) is
T -> T. * F I4 : F -> (.E) I4 : F -> (.E)
Then I3 =goto(I0, F) is E -> .E+T E -> .E+T
I3 : T -> F. E -> .T E -> .T
Then I4 = goto (I0, ( ) is T -> .T*F T -> .T*F
I4 : F -> (.E) T -> .F T -> .F
E -> .E+T F -> .(E) F -> .(E)
E -> .T F -> .id F -> .id
T -> .T*F Then I5 = goto(I4, id) is Then I5 = goto(I7, id) is
T -> .F I5 : F->id. I5 : F->id.
F -> .(E) Then I6 = goto(I6, T) Then I11 = goto(I8, ) )
F -> .id I9 : E->E+T. I11 : F -> (E).
Then I5 = goto(I0, id) is T=>T.*F Then I6 = goto(I8,+) is
I5 : F->id. Then I3 =goto(I6, F) is I6 : E-> E +. T
Then I6 = goto(I1,+) is I3 : T -> F. T -> .T*F
I6 : E-> E +. T T -> .F
T -> .T*F F -> .(E)
T -> .F F -> .id
F -> .(E) Then I7 = goto(I9, *) is
F -> .id I7 : T -> T * .F
F -> .(E)
F -> .id

29
Figure 2.27: The GOTO Graph for Expression Grammar

2.7.1 Construction of SLR Parsing Table


Algorithm 2.5: Constructing an SLR-parsing table.
INPUT: An augmented grammar G' .
OUTPUT: The SLR-parsing table functions ACTION and GOTO for G'.
METHOD:
1. Construct C = {I0,I1 , ... , In}, the collection of sets of LR(0) items for G' .
2. State i is constructed from Ii . The parsing actions for state i are determined as follows:
(a) If [Aα.aβ] is in Ii and GOTO(Ii,a)=Ij , then set ACTION[i, a] to "shift j." Here a must be a
terminal.
(b) If [Aα.] is in Ii , then set ACTION[i, a] to "reduce Aα" for all a in FOLLOW(A); here A
may not be S' .
(c) If [S'  S·] is in Ii , then set ACTION[i, $] to "accept."

If any conflicting actions result from the above rules, we say the grammar is not SLR(l) . The
algorithm fails to produce a parser in this case.
3. The goto transitions for state i are constructed for all nonterminals A using the rule: If
GOTO(Ii,A) = Ij, then GOTO[i,A] = j.
4. All entries not defined by rules (2) and (3) are made "error."
5. The initial state of the parser is the one constructed from the set of items containing
[S’.S]

The parsing table consisting of the ACTION and GOTO functions determined by Algorithm
2.5 is called the SLR(1) table for G. An LR parser using the SLR(1) table for G is called the
SLR(1) parser for G, and a grammar having an SLR(1) parsing table is said to be SLR(1).

30
Figure 2.28: Parsing Table for Expression Grammar
Example : SLR PARSING:
Every SLR(l) grammar is unambiguous, but there are many unambiguous grammars that are
not SLR(l) . Consider the grammar with productions
SL=R|R
L  *R | id
RL
Step 1: Augmented Grammar
S’  S
SL=R|R NT FIRST FOLLOW
L  *R | id S * , id $
RL L * , id $=
Step 2 : Find the FIRST and FOLLOW R * , id $=
Step 3: Constructing LR(0) Items

Figure 2.29: LR(0) Items for grammar


31
Figure 2.30: The GOTO Graph for grammar

Step 4: Constructing LR Parsing Table:


Consider the set of items I2. The first item in this set makes ACTION[ 2, =]be "shift 6." Since
FOLLOW(R) contains =, the second item sets action[2,=] to reduce “R->L.” (R5). Since there
is both a shift and a reduce entry in action[2,=], state 2 has a shift/reduce conflict on input
symbol =.
ACTION GOTO
id = * $ S L R
0 S3 S5 1 2 4
1 Accept
2 S6/
R5
3 R4 R4
4 R2
5 S3 S5 7 8
6 S3 S5 7 9
7 R5
8 R3 R3
9 R1
Figure 2.31: The Parsing Table for grammar

Grammar mentioned above is not ambiguous. This shift/reduce conflict arises from the fact
that the SLR parser construction method is not powerful enough to remember enough left
context to decide what action the parser should take on input =, having seen a string
reducible to L.
The canonical and LALR methods, will succeed on a larger collection of grammars.
However, that there are unambiguous grammars for which every LR parser construction
method will produce a parsing action table with parsing action conflicts.
Conflict example- Reduce/reduce conflict
Consider the grammar,
S  AaAb I 0: S’  .S
S  BbBa S  .AaAb
32
A S  .BbBa
B A.
B.
Here FOLLOW(A)={a,b} and FOLLOW(B)={a,b}
Action[0,a] reduce by A   Action[0,b] reduce by A  
reduce by B   reduce by B  
reduce/reduce conflict reduce/reduce conflict

2.7.2 Canonical LR Parser:


To overcome the conflict that occurs in SLR parsing, we may use more powerful LR
Parsing techniques like Canonical LR Parsing and LALR Parsing Techniques. We carry
more information in the state that will allow us to rule out some of the invalid
reductions.
The extra information is incorporated into the state by redefining items to include a
terminal symbol as a second component. The general form of an item becomes [Aα.β,
a] , where Aαβ is a production and a is a terminal or the right endmarker $. We call
such an item as LR(1) item. The 1 refers to the length of the second component, called
the lookahead of the item.

2.7.2.1 Constructing LR(l) Sets of Items:


The method for building the collection of sets of valid LR(l) items is essentially the same
as LR(O) items with little modification in the two procedures CLOSURE and GOTO.
SetOfItems CLOSURE(I) {
repeat
for ( each item [A->α.Bβ,a] in I )
for ( each production B->γ in G' )
for ( each terminal b in FIRST(βa) )
add [B->.γ,b] to set I;
until no more items are added to I;
return I;
}
SetOfItems GOTO(I, X) {
initialize J to be the empty set;
for ( each item [A->αX.β,a] in I )
add item [A->αX.β,a] to set J;
return CLOSURE( J);
}
void items(G') {
initialize C to CLOSURE( { [S'  ·S, $]});
repeat
for ( each set of items I in C )
for ( each grammar symbol X )
if ( GOTO(I, X) is not empty and not in C )
add GOTO(I, X) to C;
until no new sets of items are added to C;
}
Figure 2.32: Sets-of-LR(l)-items construction for grammar G'

33
Algorithm 2.6 : Construction of the sets of LR(1) items
INPUT: An augmented grammar G' .
OUTPUT: The sets of LR(l) items that are the set of items valid for one or
more viable prefixes of G'.
METHOD: The procedures CLOSURE and GOTO and the main routine items for constructing
the sets of items were shown above.
Example:
Consider the following grammar:
S  CC
C  cC I d
Augmented grammar is
S'  S
S  CC
C  cC I d
Find the FIRST and FOLLOW
FIRST FOLLOW
S c,d $
C c, d c, d, $

Computing LR(1) items


Compute the closure of {[S’ .S, $]}.
Match the item [S ·S, $] with the item [A  α.Bβ, a ] in the procedure CLOSURE. That is,
A = S’ , α =  , B = S, β = , and a = $.
Function CLOSURE tells us to add [B .γ,b] for each production B  γ and terminal b in
FIRST(βa).
In terms of the present grammar, B  γ must be S  CC, and since β is  and a is $, b may
only be $. Thus we add [S  ·CC, $] .

Compute the closure by adding all items [C . γ, b] for b in FIRST(C$) . That is, matching [S
 ·CC, $] against [A  α.Bβ, a] , we have A = S, α = , B = C, β = C, and a = $. Since C does
not derive the empty string, FIRST(C$) = FIRST(C). Since FIRST(C) contains terminals c and d,
we add items [C.cC,c] , [C.cC,d], [C·d, c] and [C  ·d, d] . The initial set of items is
I0 : S’->.S , $ Goto(I0, c) Goto(I2,d)
S->.CC, $ I3: C->c.C, c/d I7: C ->d., $
C->.cC , c/d C->.cC, c/d Goto(I3, C)
C->.d , c/d C-> .d , c/d I 8: C->cC. , c/d
Goto(I0,S) Goto(I0,d) Goto(I3,c) = I3
I1 : S’->S., $ I4: C->d., c/d Goto(I3, d) = I4
Goto(I0, C) Goto(I2, C) Goto (I6, C)
I2 : S->C.C,$ I5: S->CC., $ I9: C->cC., $
C-> .cC,$ Goto(I2,c) Goto(I6, c) = I6
C->.d ,$ I6: C->c.C, $ Goto(I6, d) = I7
C-> .cC, $
C->.d, $

Figure 2.33: Sets-of-LR(l)-items for Grammar S  CC , C  cC I d

34
The difference between I3 and I6 is only in the second component. The first component is
the same.

Figure 2.34: The GOTO graph for Grammar S  CC , C  cC I d


Every SLR(1) grammar is an LR(1) grammar, but for an SLR(1) grammar the canonical LR
parser may have more states than the SLR parser for the same grammar. The grammar of
the previous examples is SLR and has an SLR parser with seven states, compared with the
ten in CLR.
Algorithm 2.7 : Construction of canonical-LR parsing tables.
INPUT: An augmented grammar G’.
OUTPUT: The canonical-LR parsing table functions ACTION and GOTO for G'.
METHOD:
1. Construct C’ = {I0, I1, . . , In), the collection of sets of LR(1) items for G'.
2. State i of the parser is constructed from Ii. The parsing action for state i is determined as
follows.
(a) If [A -> α.aβ,b ] is in Ii and goto(Ii,a)= Ij, then set ACTION[i,a] to "shift j". Here a must be a
terminal.
(b) If [A -> α. ,a ] is in Ii, A ≠ S', then set action[i, a] to "reduce A ->α."
(c) If [S’ -> S. ,$] is in Ii, then set action[i, $] to "accept."
If any conflicting actions result from the above rules, we say the grammar is not LR(1). The
algorithm fails to produce a parser in this case.
3. The goto transitions for state i are constructed for all non terminals using the rule: If
GOTO(Ii,A) = Ij , then GOTO[i, A] = j.
4. All entries not defined by rules (2) and (3) are made "error."
5. The initial state of the parser is the one constructed from the set containing [S’->.S, $].

Canonical Parsing table construction:


1. To Fill the shift action : Take all goto in the LR(1) items constructed.
If Goto(Ii,x)=Ij, where Ii, Ij are states and x is a terminal, then fill action[Ii,x]= shift Ij.
Example: Now consider I0
S’->.S, $
S->.CC, $
C->.cC , c/d
C->.d , c/d

35
The Third item yields action[0,c]=shift3 (i.e.,S3) because goto(I 0, c)=I3.
The fourth item yields action[0,d] = shift4 (i.e.,S4) because goto(I 0, d)=I4.
2. To fill the reduce action : Find out the non-kernel items, i.e., Items
in which the dot ‘.’ is at the right end. If Ij contains a production with dot at a right end
(i.e., [A->α. a] and A≠S’ ) , then fill action[i, a] = reduce A->α.
Example: Now consider I4
C->d., c/d
The item has dot at the right end. So action[4,c]=r3 and action[4,d] = r3.
3. To fill the accept action : If [S’->s., $] is in Ii, then set action[i , $]
to accept. Example: Now consider I1
S’->S., $
The item contains [S’->S.] , so action[1,$] = accept.
4. To fill the error action : All entries not defined are made “error”.

Figure 2.35: Canonical parsing table for grammar S  CC , C  cC I d

2.8. Introduction to LALR Parser


The last parser method LALR (LookAhead-LR) technique is often used in practice because the
tables obtained by it are considerably smaller than the canonical LR tables. The SLR and
LALR tables for a grammar always have the same number of states whereas CLR would
typically have several number of states. For example, for a language like Pascal SLR and LALR
would have hundreds of states but CLR would have several thousands of states.
Algorithm 2.8 : LALR table construction.
INPUT: An augmented grammar G'.
OUTPUT: The LALR parsing-table functions ACTION and GOT0 for G'.
METHOD:
1. Construct C = (I0 I1, . . , In ), the collection of sets of LR(1) items.
2. For each core present among the set of LR(1) items, find all sets having that core, and
replace these sets by their union.
3. Let C' = {J0J1, . . . ,Jm) be the resulting sets of LR(1) items. The parsing actions for state i are
constructed from Ji in the same manner as in Algorithm 2.7. If there is a parsing action
conflict, the algorithm fails to produce a parser, and the grammar is said not to be LALR(1).
4. The GOTO table is constructed as follows. If J is the union of one or more sets of LR(1)
items, that is, J = I1 U I2 U . . .U Ik , then the cores of GOTO(I1, X) , GOTO(I2, X) , . . . , GOTO(Ik,
X) are the same, since I1,I2, . . ,Ik all have the same core. Let K be the union of all sets of items
having the same core as GOTO(I1, X). Then GOTO(J,X) = K.
36
Let us again consider grammar
S->CC
C->cC
C->d
whose sets of LR(1) items are
I0 : S’->.S , $ Goto(I0, c) Goto(I2,d)
S->.CC, $ I3: C->c.C, c/d I7: C ->d., $
C->.cC , c/d C->.cC, c/d
C->.d , c/d C-> .d , c/d Goto(I3, C)
I 8: C->cC. , c/d
Goto(I0,S) Goto(I0,d) Goto(I3,c) = I3
I1 : S’->S., $ I4: C->d., c/d Goto(I3, d) = I4

Goto(I0, C) Goto(I2, C) Goto (I6, C)


I2 : S->C.C,$ I 5: S->CC., $ I9: C->cC., $
C-> .cC,$
C->.d ,$ Goto(I2,c) Goto(I6, c) = I6
I6: C->c.C, $ Goto(I6, d) = I7
C-> .cC, $
C->.d, $
Figure 2.36: Sets-of-LR(l)-items for Grammar S  CC , C  cC I d
In the above LR(1) items, there are three pairs of sets of items that can be merged. I 3 and
I6 are replaced by their union:
I36: C->c.C, c/d/$
C->.cC, c/d /$
C-> .d , c/d/$
I4 and I7 are replaced by their union:
I47: C->d., c/d/$
and I8 and I9 are replaced by their union
I8: C->cC. , c/d/$
The LALR action and goto functions for the condensed sets of items are

Figure 2.37: LALR parsing table for the grammar S  CC , C  cC I d


Conflict in LALR Parsing
It is possible, that a merger will produce a reduce/reduce conflict, as the following example
shows.
Consider the grammar
S’  S
S  aAd | bBd | aBe | bAe
Ac
Bc
37
which generates the four strings acd, ace, bcd, and bce. The reader can check
that the grammar is LR(l) by constructing the sets of items. Upon doing so, we find the set of
items {[A c. , d), [B  c. , e ] valid for viable prefix ac and {[A  c., e] , [B c. , d]} valid for
bc. Neither of these sets has a conflict, and their cores are the same. However, their union,
which is
A  c. , d/e
B  c., d/e
generates a reduce/reduce conflict, since reductions by both A c and Bc are called for
on inputs d and e.
Error Recovery in LR Parsing:
An LR parser will detect an error when it consults the parsing action table and finds an error
entry. Errors are never detected by consulting the goto table. An LR parser will announce an
error as soon as there is no valid continuation for the portion of the input thus far scanned.
A canonical LR parser will not make even a single reduction before announcing an error. SLR
and LALR parsers may make several reductions before announcing an error, but they will
never shift an erroneous input symbol onto the stack.

Panic-mode error recovery: We scan down the stack until a state s with a goto on a
particular nonterminal A is found. Zero or more input symbols are then discarded until a
symbol a is found that can legitimately follow A. The parser then stacks the state GOTO ( s,
A) and resumes normal parsing.
Phrase-level recovery : It is implemented by examining each error entry in the LR parsing
table and deciding on the basis of language usage the most likely programmer error that
would give rise to that error. An appropriate recovery procedure can then be constructed;
presumably the top of the stack and/or first input symbols would be modified in a way
deemed appropriate for each error entry.

Consider again the expression grammar


E  E + E I E * E I (E) I id
The LR(0) Items for the grammar is

Figure 2.38: LR(0) items for grammar E  E + E I E * E I (E) I id


38
The Parsing table for the grammar E  E + E I E * E I (E) I id is

Figure 2.39: Parsing Table for grammar E  E + E I E * E I (E) I id

We have changed each state that calls for a particular reduction on some input symbols by
replacing error entries in that state by the reduction. This change has the effect of
postponing the error detection until one or more reductions are made, but the error will still
be caught before any shift move takes place.

The error routines are as follows.


e1 : This routine is called from states 0, 2, 4 and 5, all of which expect the beginning of an
operand, either an id or a left parenthesis. Instead, +, * , or the end of the input was found.
push state 3 (the goto of states 0, 2, 4 and 5 on id) ;
issue diagnostic "missing operand."
e2 : Called from states 0, 1 , 2, 4 and 5 on finding a right parenthesis.
remove the right parenthesis from the input;
issue diagnostic "unbalanced right parenthesis."
e3: Called from states 1 or 6 when expecting an operator, and an id or right parenthesis is
found.
push state 4 (corresponding to symbol +) onto the stack;
issue diagnostic "missing operator."
e4: Called from state 6 when the end of the input is found.
push state 9 (for a right parenthesis) onto the stack;
issue diagnostic "missing right parenthesis."

Figure 2.40: LR parsing table with error routines

39
On the erroneous input id + ) , the sequence of configurations entered by the parser is
shown in Fig. 2.41

Figure 2.41: Parsing and error recovery moves made by an LR parser


2.9. Error Handling and Recovery in Syntax Analyzer
The different strategies that a parse uses to recover from a syntactic error are:
1. Panic mode
2. Phrase level
3. Error productions
4. Global correction
Panic mode recovery:
On discovering an error, the parser discards input symbols one at a time until a
synchronizing token is found. The synchronizing tokens are usually delimiters, such as
semicolon or end. It has the advantage of simplicity and does not go into an infinite loop.
When multiple errors in the same statement are rare, this method is quite useful.
Phrase level recovery:
On discovering an error, the parser performs local correction on the remaining input that
allows it to continue.
Example: Insert a missing semicolon or delete an extraneous semicolon etc.
Error productions:
The parser is constructed using augmented grammar with error productions. If an error
production is used by the parser, appropriate error diagnostics can be generated to indicate
the erroneous constructs recognized by the input.
Global correction:
Given an incorrect input string x and grammar G, certain algorithms can be used to find a
parse tree for a string y, such that the number of insertions, deletions and changes of tokens
is as small as possible. However, these methods are in general too costly in terms of time
and space.
Error Recovery In Predictive Parsing :
Refer Page Number 21 and 22
Error Recovery In LR Parsing:
Refer Page Number 39,40 and 41

2.10. YACC
YACC is an acronym for "Yet Another Compiler Compiler". It was originally developed in the
early 1970s by Stephen C. Johnson at AT&T Corporation and written in the B programming
language, but soon rewritten in C.It appeared as part of Version 3 Unix, and a full
description of Yacc was published in 1975.

40
Yacc is a computer program for the Unix operating system. It is a LALR parser generator,
generating a parser, the part of a compiler that tries to make syntactic sense of the source
code, specifically a LALR parser, based on an analytic grammar written in a notation similar
to BNF. Yacc itself used to be available as the default parser generator on most Unix
systems. The input to Yacc is a grammar with snippets of C code (called "actions") attached
to its rules. Its output is a shift-reduce parser in C that executes the C snippets associated
with each rule as soon as the rule is recognized. Typical actions involve the construction of
parse trees.
declarations
%%
translation rules
%%
supporting C routine

Figure 2.42 :Creating an input/output translator with Yacc

Yacc specification of a simple desk calculator


%{
#include <ctype.h>
%}
%token DIGIT
%%
line : expr ‘\n’ { printf( “%d\n”, $1) ; }
;
expr : expr ‘+’ term ( $$ = $1 + $3; }
| term
;
term : term ‘*’ factor { $$ = $1 * $γ; }
| factor
;
factor : ‘(’ expr ‘)’ ( $$ = $2; }
: DIGIT
;
%%
yylex()
{
int c;
c = getchar () ;
if (isdigit(c) )
41
{
yylval = c-‘0’;
return DIGIT;
}
return c;
}
DESIGN OF A SYNTAX ANALYZER FOR A SAMPLE LANGUAGE

Figure 2.43 : Design of Syntax Analyzer with Lex and Yacc


YACC (Yet Another Compiler Compiler).
• Automatically generate a parser for a context free grammar (LALR parser)
 Allows syntax direct translation by writing grammar productions and semantic action
 LALR(1) is more powerful than LL(1).
• Work with lex. YACC calls yylex to get the next token.
 YACC and lex must agree on the values for each token.
• Like lex, YACC pre-dated c++, need workaround for some constructs when using c++ (will
give an example).
• YACC file format:
declarations /* specify tokens, and non-terminals */
%%
translation rules /* specify grammar here */
%%
supporting C-routines
• Command “yacc yaccfile” produces y.tab.c, which contains a routine yyparse().
 yyparse() calls yylex() to get tokens.
• yyparse() returns 0 if the program is grammatically correct, non-zero otherwise
• YACC automatically builds a parser for the grammar (LALR parser).
 May have shift/reduce and reduce/reduce conflicts when the grammar is not LALR
 In this case, you will need to modify grammar to make it LALR in order for yacc to
work properly.
• YACC tries to resolve conflicts automatically
• Default conflict resolution:
 shift/reduce --> shift
 reduce/reduce --> first production in the state

42
Program to recognize a valid variable (identifier) which starts with a letter followed by any
number of letters or digits.
LEX
%{
#include"y.tab.h"
extern yylval;
%}
%%
[0-9]+ {yylval=atoi(yytext); return DIGIT;}
[a-zA-Z]+ {return LETTER;}
[\t] ;
\n return 0;
. {return yytext[0];}
%%
YACC
%{
#include<stdio.h>
%}
%token LETTER DIGIT
%%
variable: LETTER|LETTER rest
;
rest: LETTER rest
|DIGIT rest
|LETTER
|DIGIT
;
%%
main()
{
yyparse();
printf("The string is a valid variable\n");
}
int yyerror(char *s)
{
printf("this is not a valid variable\n");
exit(0);
}
OUTPUT
$lex p4b.l
$yacc –d p4b.y
$cc lex.yy.c y.tab.c –ll
$./a.out
input34
The string is a valid variable
$./a.out
89file
This is not a valid variable

43

You might also like