You are on page 1of 20

UNIT-II

SYNTAX ANALYSIS (PARSER)


ROLE OF PARSER:
Syntax analysis is the second phase of the compiler. It gets stream of tokens from the lexical
analyzer as the input and generates a syntax tree or parse tree as output.

For the input stream of tokens, parser finds a derivation sequence and verifies whether the
string can be generated by the grammar for the source language.
To accomplish this task, parser determines the structure of the program and classifies
them into entities such as declaration statement, control statement, operational statement etc.
If any statement in the program violates the syntax rules of programming language,
parser reports errors. In order to produce an appropriate error message, it interacts with error
handling procedures and perform error recovery to continue with processing of remaining
program.
If no errors are identified by the parser, then it produces syntax tree as output.

CONTEXT-FREE GRAMMAR:
A Context-Free Grammar is defined as a quadruple G=(V,T,P,S), where
V - Set of non-terminals
T - Set of terminals
P - Set of production rules represented in the form of A(VUT)*.
S-Starting Symbol.
Example: Grammar for representing arithmetic expressions is represented as
EE+E|E-E|E*E|E/E|(E)|id
where V {E}
T {+,-,*,/,(,),id}
P { EE+E, EE-E, EE*E, EE/E, E(E), Eid}
S {E}
DERIVATION:
Two basic requirements for a grammar are :
1. To generate a valid string.
2. To recognize a valid string.
DERIVATION is a process that generates a valid string with the help of grammar by replacing
the non-terminals on the left with the string on the right side of the production.
TYPES OF DERIVATIONS:
There are two types of derivation. They are:
1. Left most derivation
2. Right most derivation.
LEFTMOST DERIVATION: At each step of derivation process, the leftmost non-terminal is
replaced by its equivalent RHS.
RIGHTMOST DERIVATION: At each step of derivation process, the rightmost non-terminal is
replaced by its equivalent RHS.
Example: Consider the Grammar G : EE+E|E-E|E*E|E/E|(E)|id and let the string to be
derived is id+id*id.
LEFTMOST DERIVATION RIGHTMOST DERIVATION
EE+E EE+E
id+E E+E*E
id+E*E E+E*id
id+id*E E+id*id
id+id*id id+id*id

String that appear in leftmost derivation are called left sentinel forms.
String that appear in rightmost derivation are called right sentinel forms.
SENTINELS: Given a grammar G with start symbol S, if S, where may contain non-
terminals or terminals, then is called the sentinel form of G.
DERIVATION TREE (OR) PARSE TREE (OR) SYNTAX TREE:
Graphical representation of derivation process of a particular string is called as derivation tree.
Each interior node of a parse tree is a non-terminal and the children nodes can be a terminal
or non-terminal of the sentinel forms that are read from left to right. The sentinel form in the
parse tree is called yield or frontier of the tree.
AMBIGUOUS GRAMMARS:
A Grammar is said to be Ambiguous, if the grammar produces more than one leftmost
derivation (or) one rightmost derivation (or) one parse tree to derive a particular string.
Example: Consider the Grammar G : EE+E|E-E|E*E|E/E|(E)|id and let the string to be
derived is id+id*id.

Elimination of ambiguity:
Ambiguity of the grammar that produces more than one parse tree for leftmost or rightmost
derivation can be eliminated by re-writing the grammar.
1. Starting with the production, which contains least precedence operator, convert into
unambiguous by introducing a new variable into the production.
2. Repeat step 1 until the production with highest precedence is converted into
unambiguous.

LEFT RECURSION:
A grammar is said to be left recursive if it has a production of the form AA, where A is a
non-terminal and is any combination of terminals and non-terminals.
Elimination of left recursion:
Top-down parsing methods cannot handle left-recursive grammars. Hence, left recursion
must be eliminated.
If there is a production A A | it can be replaced with a sequence of two productions.
A A
A A |
Example: Consider the Grammar

LEFT FACTORING:
Left factoring is a performed when there are several alternatives which starts with same
symbol for a particular non terminal and when it is not clear which of two alternative
productions to use to expand a non-terminal.
If there is any production A 1 | 2 , it can be rewritten as
A A
A 1 | 2

CLASSIFICATION OF PARSING TECHNIQUES:


Parsers can be broadly classified into two categories:
1. Top down parsing
2. Bottom up parsing
TOP DOWN PARSING: When the parse tree is constructed from the root and expanded to
leaves, then that kind of parsing is called as top down parsing.
i.e parser starts with starting symbol and derives the input string.
BOTTOM UP PARSING: When the parse tree is constructed from the leaves and developed to
root, then that kind of parsing is called as bottom up parsing.
i.e parser starts with the input string and derives the starting symbol.
TOP DOWN PARSERS:
Top down parser constructs parse tree starting from the root and expanding towards the root.
It corresponds to the leftmost derivation i.e; parser starts with starting variable and at each
step, it identifies LHS non terminal and replaces with its corresponding RHS and derives the
input string.
Types of top-down parsers :
Top down parsing can be classified into 2 types.
1. Recursive descent parsing
2. Predictive parsing.
Again predictive parsing can be classified into 2 types.
1. Recursive descent parser without backtracking
2. Predictive parser

1. RECURSIVE DESCENT PARSING:


Recursive descent parsing is one of the top-down parsing techniques that uses a set of
recursive procedures to scan its input. It is also known as backtracking parser (or) brute force
approach.
This parsing method may involve backtracking, that is, making repeated scans of the input,
to find the correct production to be applied, hence the name backtracking parser.

Example: Consider the grammar G : S cAd


A ab | a
and the input string w=cad.
The parse tree can be constructed using the following top-down approach :
Step1:
Initially create a tree with single node labeled S. An input pointer points to c, the first
symbol of w. Expand the tree with the production of S.

Step2:
The leftmost leaf node c matches the first symbol of w, so advance the input pointer to the
second symbol of w a and consider the next leaf A. Expand A using the first alternative.
Step3:
The second symbol a of w also matches with second leaf of tree. So advance the input
pointer to third symbol of w d. But the third leaf of tree is b which does not match with the
input symbol d. Hence discard the chosen production and reset the pointer to second position.
This is called backtracking.
Step4:
Now try the second alternative for A.

Now we can halt and announce the successful completion of parsing.

2. RECURSIVE DESCENT PARSER WITHOUT BACKTRACKING:


Recursive descent parser without backtracking is one of the top-down parsing techniques that
uses a set of recursive procedures to scan its input. It involves no backtracking since it has the
capability to predict which production is to be used to replace the input string.
In Recursive descent parser without backtracking, for each non-terminal, procedure is
constructed. Initially lookahead pointer points to the starting symbol in the given input string
and parser is activated by calling procedure that corresponds to start variable.
Now parser reads the symbol in the procedure. If the symbol is non terminal, call
other procedures, and if the symbol is terminal, compare that terminal symbol with terminal
symbol in given input string. If terminal matches, advance the lookahead pointer to the next
symbol.
3. PREDICTIVE PARSER:
It is a special case of recursive descent parser where no backtracking is required. It has the
capability to predict which production is to be used to replace the input string. To accomplish
this task, predictive parser uses lookahead pointer which points to the next input symbol.
It is also known as Non-Recursive Descent Parser, since parser is built by
maintaining a stack explicitly rather than recursive procedures.
It is also known as Table Driven Parser, since the parser refers to the predictive
parsing table, while making parsing action.
It is also known as LL(1) Parser, where LL(1) stands for
L - Left to right scanning of input string
L Leftmost derivation
1 only one lookahead symbol is used to predict during parsing.

Model of LL(1) Parser:


1. LL(1) parser consists of 4 components.
input buffer
stack and
predictive parsing table
predictive parser program
2. Input buffer: It is used to store the input string that is to be parsed, followed by $ to
indicate the end of the input string.
3. Stack: It contains a sequence of grammar symbols preceded by $ to indicate the
bottom of the stack. Initially, the stack contains the start symbol on top of $.
4. Parsing table: It is a two-dimensional array M[A, a], where A is a non-terminal and
a is a terminal.
5. Predictive Parser program: It refers to the predictive parsing table for the input and
stack combination to get an appropriate production. It repeats until the input string is
derived or an error message is generated.

Steps required to construct predictive parser:


1. Eliminate ambiguity if the grammar is ambiguous, eliminate left recursion if there is
left recursion in the grammar and perform left factoring if possible.
2. Find first and follow sets for every non terminal.
3. Construct predictive parsing table by using first and follow sets.
4. Parse the given input string by using input buffer, stack and predictive parsing table.
5. Determine whether the string is accepted or not and also check whether the given
grammar is LL(1) grammar or not.

First Set:
First set of a non terminal(x) is the set of all the terminals that x can begin with.
Rules to compute FIRST set:
1. If x is a terminal then FIRST(x) is {x}.
2. If X then add to FIRST(X).
3. If X is a non terminal and Xa then add a to FIRST(X).
4. If X is a non terminal and XY1Y2Yn and FIRST(Y1) doesnt contain
then add FIRST(Y1) to FIRST(X).
5. If XY1Y2.Yn and FIRST(X) contain then add FIRST(Y1) U
FIRST(Y2)-{} to FIRST(X).

Follow set:
Follow set of a non terminal A is set of all the terminals that follows A.
Rules to compute FOLLOW set:
1. If S is a starting symbol then FOLLOW(S) includes $.
2. If there is a production AB then everything in FIRST() except is placed in
FOLLOW(B).
3. If there is a production AB or AB and if FIRST() includes then
everything in FOLLOW(A) will be placed in FOLLOW(B).
Steps required to construct predictive parsing table:
1. Consider the productions in the given grammar. Non-terminals are used for row
selection and terminals are used for column selection.
2. For each production A,
For each terminal a in FIRST() add A to M[A,a]
If is in FIRST() then for each terminal b in FOLLOW(A), add A to
M[A,b]
If is in FIRST() and $ is in FOLLOW(A) then add A to M[A,$].
3. Make each undefined entry as syntax error.

Steps required to parse the given input string using predictive parser:
1. Initially predictive parser contains $ followed by starting variable S in stack, and input
string w followed by $ in input buffer.
Stack Input buffer
$S w$
2. let X be the symbol on top of the stack and a be the input symbol
i. If X = a = $, the parser halts and announces successful completion of parsing.
ii. If X = a $, the parser pops X off the stack and advances the input pointer
to the next input symbol.
iii. If X is a non-terminal , the program consults entry M[X, a] of the parsing
table M. This entry will either be an X-production of the grammar or an error
entry.
If M[X, a] = {X UVW},the parser replaces X on top of the stack by WVU.
If M[X, a] = error, the parser calls an error recovery routine.
LL(1) grammar:
Context free grammar G=(V,T,P,S) is said to be LL(1) grammar, if the associated LL(1)
parsing table has no multiple productions in the same entry.
A language is said to be LL(1) if it is generated by LL(1)grammar.
Properties of LL(1) grammar:
1. LL(1) grammar should not contain ambiguity.
2. LL(1) grammar should not contain left recursion.
3. There should not be multiple productions in the same entity in the LL(1) parsing table.
4. The grammar is said to be LL(1) grammar if and only if the following conditions are
satisfied for two distinct productions A and A
i. FIRST()FIRST()a where a is some terminal symbol.
ii. FIRST()FIRST()
Combining (i) and (ii) FIRST()FIRST()
This would allow parser to make correct choice to look a head of exactly 1 symbol.
5. If FIRST() includes , then does not derive any string beginning with the
terminal in FOLLOW(A).
FIRST()FOLLOW(A)=
LL(K) grammars:
Grammars parsable with LL(K) parsing tables are called LL(k) grammars. In LL(k) parsing
tables, non-terminals are used for row selection and column is specified for every sequence of
k terminals.
Properties of LL(k) grammar:
1. If G is LL(k) then G is also LL(K+1) for k1.
2. If G is LL(k) fork I then it is not an ambiguous grammar.
UNIT-3
SYNTAX ANALYSIS - BOTTOM UP PARSERS

BOTTOM UP PARSERS:
Bottom up parser constructs parse tree starting from the leaves and developing towards the
root. It corresponds to the rightmost derivation in reverse order i.e; parser starts with input
string and at each step, it identifies RHS and replaces with its corresponding LHS non
terminal and derives the starting variable.
Handle:
A handle of a string is a substring that matches the right side of a production, and whose
reduction to the non-terminal on the left side of the production represents one step along the
reverse of a rightmost derivation.
Example:
Consider the grammar: The rightmost derivation is :
E E+E E E+E
E E*E E+E*E
E (E) E+E*id3
E id E+id2*id3
And the input string id1+id2*id3 id1+id2*id3
In the above derivation the underlined substrings are called handles.
Handle pruning:
A rightmost derivation in reverse can be obtained by handle pruning. i.e.; if w is a sentence
or string of the grammar at hand, then w = n, where n is the nth right sentinel form of some
rightmost derivation.
Types of bottom-up parsers :
Bottom up parsing can be classified into 3 types.
1. Shift reduce parser
2. Operator Precedence parser
3. LR parsers.
Again LR parsers can be classified into 3 types.
1. Simple LR Parser (SLR)
2. Canonical LR Parser (CLR)
3. LookAhead LR Parser (LALR)
1. SHIFT REDUCE PARSER:
Shift Reduce parser is the basic bottom-up parsing technique in which shift and reduce are
the two actions performed with the help of stack.
The shift action is used to move symbols from the input buffer onto the stack.
The reduce action is used to replace RHS(handle) on the top of the stack with its equivalent
LHS.
Stack Implementation (or) Working of Shift Reduce Parser:
1. Shift reduce parser consists of two components:
Input buffer
Stack
where input buffer is used to store the input string that is to be parsed and stack is
used to either shift or reduce the symbols.
2. Shift reduce parser performs the following 4 actions:
1. Shift
2. Reduce
3. Accept
4. Reject
3. Initially shift reduce parser contains $ in stack, and input string w followed by $ in
input buffer.
Stack Input buffer
$ w$
4. Now input string is scanned from left to right and shift zero or more symbols until an
appropriate handle is identified on the top of the stack.
5. When parser identifies handle on top of the stack then it reduces handle by its
equivalent LHS.
6. Repeat 3,4 steps until the input string is accepted or an error message is generated.
7. A string is said to be accepted, if shift reduce parser contains $ followed by starting
symbol S in stack and $ in the input buffer.
Stack Input buffer
$S $
8. A string is said to be rejected, if shift reduce parser violates accept condition.

Conflicts in shift reduce parsing:


Shift reduce conflict(SR-conflict) and Reduce reduce conflict(RR-conflict) are the two types
of conflicts that occur during shift reduce parsing.
1. Shift Reduce Conflict:
If the parser cannot decide whether to shift a symbol or reduce handle on the top of
the stack, then that problem is said to be Shift Reduce Conflict.
2. Reduce Reduce Conflict:
If the parser cannot decide which reduction to be applied for a particular handle then
that problem is said to be Reduce Reduce Conflict.

2. OPERATOR PRECEDENCE PARSER:


Operator precedence parser is one of the bottom-up parsing techniques which can be used
only for operator grammars.
Operator Grammar:
Operator grammar is a context free grammar in which no production on right side contains
1. epsilon () or
2. adjacent non-terminals.
Operator precedence relations:
There are three disjoint precedence relations namely
< . - less than
= - equal to
. > - greater than
These relations can be defined as:
a < . b a yields precedence to b (or) a has lower precedence than b
a = . b a has the same precedence as b
a . > b a takes precedence over b (or) a has higher precedence than b
Model of Operator Precedence Parser:
1. Operator Precedence parser consists of 4 components.
input buffer
stack and
operator precedence parsing table
operator precedence parser program
2. Input buffer: It is used to store the input string that is to be parsed, followed by $ to
indicate the end of the input string.
3. Stack: It contains a sequence of grammar symbols preceded by $ to indicate the
bottom of the stack. Initially, the stack contains the start symbol on top of $.
4. Operator Precedence Parsing table: It is a two-dimensional array M[a, b], where a
and b are terminals.
5. Operator Precedence Parser program: It refers to the precedence parsing table for
the input and stack combination to get an appropriate action. It repeats until the input
string is derived or an error message is generated.
Steps required to construct operator precedence parser:
1. Check whether the given grammar is an operator grammar or not. If not, convert it
into an operator grammar.
2. The given grammar must be an ambiguous grammar. If not, we cannot conctruct
operator precedence parser for the given grammar.
3. Construct operator precedence parsing table by using operator precedence and
associativity rules.
4. Parse the given input string by using input buffer, stack and operator precedence
parsing table.
5. Determine whether the string is accepted or not.
Operator Precedence and Associativity rules to construct operator precedence table:
1. If operator 1 has higher precedence than operator 2, then make

1. > 2 and 2 < . 1


2. If operators 1 and 2, are of equal precedence, then make

1. > 2 and 2. > 1 , if they are left associative.

1< .2 and 2< .1 , if they are right associative.


3. Make the following for all operators :

<. id id . >
<.( (<.
).> .>)
.>$ $<.
$ < . id id . > $
4. ( =. )
5. Precedence relations cannot be defined for
id, id (, $
id, ( ), (
), id $, )
$, $
Steps required to parse the given input string using operator precedence parser:
1. Operator Precedence parser performs the following 4 actions:
1. Shift
2. Reduce
3. Accept
4. Reject
2. Initially Operator Precedence parser contains $ in stack, and input string w followed
by $ in input buffer.
Stack Input buffer
$ w$
3. let a be the symbol on top of the stack and b be the input symbol.
i. If a<.b or a=.b, then shift action is performed.
ii. If a>.b, then reduce action is performed.
iii. If no precedence relation is defined or violates accept condition, then reject
action is performed.
iv. If Operator Precedence parser contains $ followed by starting symbol S in
stack and $ in the input buffer.
Stack Input buffer
$S $
4. Repeat 3rd step until the input string is accepted or an error message is generated.
3. LR PARSER:
LR(k) Parser is a bottom up parser which can be used to parse a large class of
grammars, where LR(k) stands for
L - Left to right scanning of input string
L Rightmost derivation in reverse order
k number of lookahead symbol used during parsing

Model of LR Parser:
1. LR parser consists of 4 components.
input buffer
stack and
LR parsing table
LR parser program
2. Input buffer: It is used to store the input string that is to be parsed, followed by $ to
indicate the end of the input string.
3. Stack: It contains a sequence of states si and grammar symbols X i in the form of
s0X1s1X2s2.Xmsm preceded by $ to indicate the bottom of the stack. Initially,
the stack contains the starting state on top of $.
4. LR Parsing table: It consists of 2 parts.
i. Action table
ii. GOTO table.
Action table is a two-dimensional array M[s, a], where s is a state and a is a
terminal which specifies the action to be performed and
GOTO table is a two-dimensional array M[, A], where s is a state and A is a non-
terminal which specifies the state.
5. LR Parser program: It refers to the LR parsing table for the input and stack
combination to get an appropriate action. It repeats until the input string is derived or
an error message is generated. It is same for all the three parsers, but differs only in
constructing parsing table.
Types of LR Parsers:
1. SLR Parser
2. CLR Parser (or) LR Parser
3. LALR Parser
SLR Parser:
Simple LR parser is a bottom up parser, in which no lookahead symbols are generated while
constructing canonical set of items. Hence it is also known as LR(0) parser.
Steps required to construct SLR parser:
1. Check whether the given grammar is an augmented grammar or not. If not, convert it
into an augmented grammar.
2. Construct canonical set of LR(0) items.
3. Construct GOTO graph for canonical set of LR(0) items.
4. Construct FOLLOW sets for every non terminal in the augmented grammar.
5. Construct SLR parsing table by using canonical set of LR(0) items and FOLLOW sets.
6. Parse the given input string by using input buffer, stack and SLR parsing table.
7. Determine whether the string is accepted or not and also check whether the given
grammar is SLR grammar or not.
Augmented Grammar:
Augmented grammar G is a grammar G with a new production rule E E, where E is new
starting variable and E is starting variable of given grammar such that starting production has
only single variable on right side.
LR(0) items:
An LR(0) item of a grammar G is a production of G with a dot at some position of the
right side. For example, production A XYZ yields the four LR(0) items :
A . XYZ
A X . YZ
A XY . Z
A XYZ .
The dot represents that input have been derived upto that point in the production.

LR(0) item for A is A.


Steps required to construct canonical set of LR(0) items:
To construct canonical set of LR(0) items, we need two functions.
1. Closure function
2. GOTO function.
Closure function:
If I is a set of items for a grammar G, then closure(I) is the set of items constructed
from I by the following rules:
1. Initially, every item in I is added to closure(I).
2. If A . B is in closure(I) and B is a production, then add the item B . to
I , if it is not already there.
3. Repeat step 2 until no more new items can be added to closure(I).
Goto function:
Goto(I, X) is defined to be the closure of the set of all items [A X . ] such
that [A . X] is in I.
Steps required to construct SLR parsing table:
1. Consider canonical set of LR(0) items. States are used for row selection and terminals
are used for column selection in ACTION table and Non-terminals are used for
column selection in GOTO table.
2. ACTION table:
If [Aa] is in Ii and goto(Ii,a) = Ij, then set action[i,a] to shift j, where a is
be terminal.
If [A] is in Ii , then set action[i,a] to reduce A for all a in FOLLOW(A).
If [SS.] is in Ii, then set action[i,$] to accept.
3. GOTO table:
If goto(Ii,A) = Ij, then goto[i,A] = j
4. Make each undefined entry as syntax error.
Steps required to parse the given input string using SLR, CLR, LALR parser:
1. Initially SLR parser contains $ followed by starting state 0 in stack, and input string w
followed by $ in input buffer.
Stack Input buffer
$0 w$
2. let s be the state on top of the stack and a be the input symbol.
i. if action[s, a] = shift s then push a and s on top of the stack; advance ip to the next
input symbol.
ii. if action[s, a] = reduce A then
pop 2* | | symbols from the stack;
let s be the state now on top of the stack; push A
then push goto[s, A] on top of the stack;
iii. if action[s, a] = accept then parser accepts the given input string and
generates parse tree.
iv. if action[s, a] = error then parser calls the corresponding error recovery
procedure.
CLR Parser:
Canonical LR parser is a bottom up parser, in which one lookahead symbol is generated
while constructing canonical set of items. Hence it is also known as LR(1) parser.
Steps required to construct CLR parser:
1. Check whether the given grammar is an augmented grammar or not. If not, convert it
into an augmented grammar.
2. Construct canonical set of LR(1) items.
3. Construct GOTO graph for canonical set of LR(1) items.
4. Construct CLR parsing table by using canonical set of LR(1) items.
5. Parse the given input string by using input buffer, stack and CLR parsing table.
6. Determine whether the string is accepted or not and also check whether the given
grammar is CLR grammar or not.
Steps required to construct canonical set of LR(0) items:
To construct canonical set of LR(1) items, we need two functions.
1. Closure function
2. GOTO function.
Closure function:
If I is a set of items for a grammar G, then closure(I) is the set of items constructed
from I by the following rules:
1. Initially, add S.S, $ as the first rule in I0.
2. If A . B is in closure(I) and B is a production and b in FIRST(a), then add
the item B . ,b to I , if it is not already there.
3. Repeat step 2 until no more new items can be added to closure(I).
Goto function:
Goto(I, X) is defined as follows if [A . X] is in I, then add [A X . ,a]
Steps required to construct CLR, LALR parsing table:
1. Consider canonical set of LR(1) items. States are used for row selection and terminals
are used for column selection in ACTION table and Non-terminals are used for
column selection in GOTO table.
2. ACTION table:
If [Aa, b] is in Ii and goto(Ii,a) = Ij, then set action[i,a] to shift j, where a
is be terminal.
If [A, a] is in Ii , then set action[i,a] to reduce A.
If [SS.] is in Ii, then set action[i,$] to accept.
3. GOTO table:
If goto(Ii,A) = Ij, then goto[i,A] = j
4. Make each undefined entry as syntax error.
LALR Parser:
LookAhead LR parser is a bottom up parser, in which one lookahead symbol is generated
while constructing canonical set of items. It is same as LR(1) parser but with reduced no. of
states and having same number of states as of SLR parser..
Steps required to construct LALR parser:
1. Check whether the given grammar is an augmented grammar or not. If not, convert it
into an augmented grammar.
2. Construct canonical set of LR(1) items.
3. If any of the states are having same set of production rules with different lookahead
symbols, then merge those states into a single state. Iij = Ii U Ij
4. Construct GOTO graph for canonical set of LR(1) items.
5. Construct LALR parsing table by using canonical set of LR(1) items.
6. Parse the given input string by using input buffer, stack and LALR parsing table.
7. Determine whether the string is accepted or not and also check whether the given
grammar is LALR grammar or not.

You might also like