Professional Documents
Culture Documents
For the input stream of tokens, parser finds a derivation sequence and verifies whether the
string can be generated by the grammar for the source language.
To accomplish this task, parser determines the structure of the program and classifies
them into entities such as declaration statement, control statement, operational statement etc.
If any statement in the program violates the syntax rules of programming language,
parser reports errors. In order to produce an appropriate error message, it interacts with error
handling procedures and perform error recovery to continue with processing of remaining
program.
If no errors are identified by the parser, then it produces syntax tree as output.
CONTEXT-FREE GRAMMAR:
A Context-Free Grammar is defined as a quadruple G=(V,T,P,S), where
V - Set of non-terminals
T - Set of terminals
P - Set of production rules represented in the form of A(VUT)*.
S-Starting Symbol.
Example: Grammar for representing arithmetic expressions is represented as
EE+E|E-E|E*E|E/E|(E)|id
where V {E}
T {+,-,*,/,(,),id}
P { EE+E, EE-E, EE*E, EE/E, E(E), Eid}
S {E}
DERIVATION:
Two basic requirements for a grammar are :
1. To generate a valid string.
2. To recognize a valid string.
DERIVATION is a process that generates a valid string with the help of grammar by replacing
the non-terminals on the left with the string on the right side of the production.
TYPES OF DERIVATIONS:
There are two types of derivation. They are:
1. Left most derivation
2. Right most derivation.
LEFTMOST DERIVATION: At each step of derivation process, the leftmost non-terminal is
replaced by its equivalent RHS.
RIGHTMOST DERIVATION: At each step of derivation process, the rightmost non-terminal is
replaced by its equivalent RHS.
Example: Consider the Grammar G : EE+E|E-E|E*E|E/E|(E)|id and let the string to be
derived is id+id*id.
LEFTMOST DERIVATION RIGHTMOST DERIVATION
EE+E EE+E
id+E E+E*E
id+E*E E+E*id
id+id*E E+id*id
id+id*id id+id*id
String that appear in leftmost derivation are called left sentinel forms.
String that appear in rightmost derivation are called right sentinel forms.
SENTINELS: Given a grammar G with start symbol S, if S, where may contain non-
terminals or terminals, then is called the sentinel form of G.
DERIVATION TREE (OR) PARSE TREE (OR) SYNTAX TREE:
Graphical representation of derivation process of a particular string is called as derivation tree.
Each interior node of a parse tree is a non-terminal and the children nodes can be a terminal
or non-terminal of the sentinel forms that are read from left to right. The sentinel form in the
parse tree is called yield or frontier of the tree.
AMBIGUOUS GRAMMARS:
A Grammar is said to be Ambiguous, if the grammar produces more than one leftmost
derivation (or) one rightmost derivation (or) one parse tree to derive a particular string.
Example: Consider the Grammar G : EE+E|E-E|E*E|E/E|(E)|id and let the string to be
derived is id+id*id.
Elimination of ambiguity:
Ambiguity of the grammar that produces more than one parse tree for leftmost or rightmost
derivation can be eliminated by re-writing the grammar.
1. Starting with the production, which contains least precedence operator, convert into
unambiguous by introducing a new variable into the production.
2. Repeat step 1 until the production with highest precedence is converted into
unambiguous.
LEFT RECURSION:
A grammar is said to be left recursive if it has a production of the form AA, where A is a
non-terminal and is any combination of terminals and non-terminals.
Elimination of left recursion:
Top-down parsing methods cannot handle left-recursive grammars. Hence, left recursion
must be eliminated.
If there is a production A A | it can be replaced with a sequence of two productions.
A A
A A |
Example: Consider the Grammar
LEFT FACTORING:
Left factoring is a performed when there are several alternatives which starts with same
symbol for a particular non terminal and when it is not clear which of two alternative
productions to use to expand a non-terminal.
If there is any production A 1 | 2 , it can be rewritten as
A A
A 1 | 2
Step2:
The leftmost leaf node c matches the first symbol of w, so advance the input pointer to the
second symbol of w a and consider the next leaf A. Expand A using the first alternative.
Step3:
The second symbol a of w also matches with second leaf of tree. So advance the input
pointer to third symbol of w d. But the third leaf of tree is b which does not match with the
input symbol d. Hence discard the chosen production and reset the pointer to second position.
This is called backtracking.
Step4:
Now try the second alternative for A.
First Set:
First set of a non terminal(x) is the set of all the terminals that x can begin with.
Rules to compute FIRST set:
1. If x is a terminal then FIRST(x) is {x}.
2. If X then add to FIRST(X).
3. If X is a non terminal and Xa then add a to FIRST(X).
4. If X is a non terminal and XY1Y2Yn and FIRST(Y1) doesnt contain
then add FIRST(Y1) to FIRST(X).
5. If XY1Y2.Yn and FIRST(X) contain then add FIRST(Y1) U
FIRST(Y2)-{} to FIRST(X).
Follow set:
Follow set of a non terminal A is set of all the terminals that follows A.
Rules to compute FOLLOW set:
1. If S is a starting symbol then FOLLOW(S) includes $.
2. If there is a production AB then everything in FIRST() except is placed in
FOLLOW(B).
3. If there is a production AB or AB and if FIRST() includes then
everything in FOLLOW(A) will be placed in FOLLOW(B).
Steps required to construct predictive parsing table:
1. Consider the productions in the given grammar. Non-terminals are used for row
selection and terminals are used for column selection.
2. For each production A,
For each terminal a in FIRST() add A to M[A,a]
If is in FIRST() then for each terminal b in FOLLOW(A), add A to
M[A,b]
If is in FIRST() and $ is in FOLLOW(A) then add A to M[A,$].
3. Make each undefined entry as syntax error.
Steps required to parse the given input string using predictive parser:
1. Initially predictive parser contains $ followed by starting variable S in stack, and input
string w followed by $ in input buffer.
Stack Input buffer
$S w$
2. let X be the symbol on top of the stack and a be the input symbol
i. If X = a = $, the parser halts and announces successful completion of parsing.
ii. If X = a $, the parser pops X off the stack and advances the input pointer
to the next input symbol.
iii. If X is a non-terminal , the program consults entry M[X, a] of the parsing
table M. This entry will either be an X-production of the grammar or an error
entry.
If M[X, a] = {X UVW},the parser replaces X on top of the stack by WVU.
If M[X, a] = error, the parser calls an error recovery routine.
LL(1) grammar:
Context free grammar G=(V,T,P,S) is said to be LL(1) grammar, if the associated LL(1)
parsing table has no multiple productions in the same entry.
A language is said to be LL(1) if it is generated by LL(1)grammar.
Properties of LL(1) grammar:
1. LL(1) grammar should not contain ambiguity.
2. LL(1) grammar should not contain left recursion.
3. There should not be multiple productions in the same entity in the LL(1) parsing table.
4. The grammar is said to be LL(1) grammar if and only if the following conditions are
satisfied for two distinct productions A and A
i. FIRST()FIRST()a where a is some terminal symbol.
ii. FIRST()FIRST()
Combining (i) and (ii) FIRST()FIRST()
This would allow parser to make correct choice to look a head of exactly 1 symbol.
5. If FIRST() includes , then does not derive any string beginning with the
terminal in FOLLOW(A).
FIRST()FOLLOW(A)=
LL(K) grammars:
Grammars parsable with LL(K) parsing tables are called LL(k) grammars. In LL(k) parsing
tables, non-terminals are used for row selection and column is specified for every sequence of
k terminals.
Properties of LL(k) grammar:
1. If G is LL(k) then G is also LL(K+1) for k1.
2. If G is LL(k) fork I then it is not an ambiguous grammar.
UNIT-3
SYNTAX ANALYSIS - BOTTOM UP PARSERS
BOTTOM UP PARSERS:
Bottom up parser constructs parse tree starting from the leaves and developing towards the
root. It corresponds to the rightmost derivation in reverse order i.e; parser starts with input
string and at each step, it identifies RHS and replaces with its corresponding LHS non
terminal and derives the starting variable.
Handle:
A handle of a string is a substring that matches the right side of a production, and whose
reduction to the non-terminal on the left side of the production represents one step along the
reverse of a rightmost derivation.
Example:
Consider the grammar: The rightmost derivation is :
E E+E E E+E
E E*E E+E*E
E (E) E+E*id3
E id E+id2*id3
And the input string id1+id2*id3 id1+id2*id3
In the above derivation the underlined substrings are called handles.
Handle pruning:
A rightmost derivation in reverse can be obtained by handle pruning. i.e.; if w is a sentence
or string of the grammar at hand, then w = n, where n is the nth right sentinel form of some
rightmost derivation.
Types of bottom-up parsers :
Bottom up parsing can be classified into 3 types.
1. Shift reduce parser
2. Operator Precedence parser
3. LR parsers.
Again LR parsers can be classified into 3 types.
1. Simple LR Parser (SLR)
2. Canonical LR Parser (CLR)
3. LookAhead LR Parser (LALR)
1. SHIFT REDUCE PARSER:
Shift Reduce parser is the basic bottom-up parsing technique in which shift and reduce are
the two actions performed with the help of stack.
The shift action is used to move symbols from the input buffer onto the stack.
The reduce action is used to replace RHS(handle) on the top of the stack with its equivalent
LHS.
Stack Implementation (or) Working of Shift Reduce Parser:
1. Shift reduce parser consists of two components:
Input buffer
Stack
where input buffer is used to store the input string that is to be parsed and stack is
used to either shift or reduce the symbols.
2. Shift reduce parser performs the following 4 actions:
1. Shift
2. Reduce
3. Accept
4. Reject
3. Initially shift reduce parser contains $ in stack, and input string w followed by $ in
input buffer.
Stack Input buffer
$ w$
4. Now input string is scanned from left to right and shift zero or more symbols until an
appropriate handle is identified on the top of the stack.
5. When parser identifies handle on top of the stack then it reduces handle by its
equivalent LHS.
6. Repeat 3,4 steps until the input string is accepted or an error message is generated.
7. A string is said to be accepted, if shift reduce parser contains $ followed by starting
symbol S in stack and $ in the input buffer.
Stack Input buffer
$S $
8. A string is said to be rejected, if shift reduce parser violates accept condition.
<. id id . >
<.( (<.
).> .>)
.>$ $<.
$ < . id id . > $
4. ( =. )
5. Precedence relations cannot be defined for
id, id (, $
id, ( ), (
), id $, )
$, $
Steps required to parse the given input string using operator precedence parser:
1. Operator Precedence parser performs the following 4 actions:
1. Shift
2. Reduce
3. Accept
4. Reject
2. Initially Operator Precedence parser contains $ in stack, and input string w followed
by $ in input buffer.
Stack Input buffer
$ w$
3. let a be the symbol on top of the stack and b be the input symbol.
i. If a<.b or a=.b, then shift action is performed.
ii. If a>.b, then reduce action is performed.
iii. If no precedence relation is defined or violates accept condition, then reject
action is performed.
iv. If Operator Precedence parser contains $ followed by starting symbol S in
stack and $ in the input buffer.
Stack Input buffer
$S $
4. Repeat 3rd step until the input string is accepted or an error message is generated.
3. LR PARSER:
LR(k) Parser is a bottom up parser which can be used to parse a large class of
grammars, where LR(k) stands for
L - Left to right scanning of input string
L Rightmost derivation in reverse order
k number of lookahead symbol used during parsing
Model of LR Parser:
1. LR parser consists of 4 components.
input buffer
stack and
LR parsing table
LR parser program
2. Input buffer: It is used to store the input string that is to be parsed, followed by $ to
indicate the end of the input string.
3. Stack: It contains a sequence of states si and grammar symbols X i in the form of
s0X1s1X2s2.Xmsm preceded by $ to indicate the bottom of the stack. Initially,
the stack contains the starting state on top of $.
4. LR Parsing table: It consists of 2 parts.
i. Action table
ii. GOTO table.
Action table is a two-dimensional array M[s, a], where s is a state and a is a
terminal which specifies the action to be performed and
GOTO table is a two-dimensional array M[, A], where s is a state and A is a non-
terminal which specifies the state.
5. LR Parser program: It refers to the LR parsing table for the input and stack
combination to get an appropriate action. It repeats until the input string is derived or
an error message is generated. It is same for all the three parsers, but differs only in
constructing parsing table.
Types of LR Parsers:
1. SLR Parser
2. CLR Parser (or) LR Parser
3. LALR Parser
SLR Parser:
Simple LR parser is a bottom up parser, in which no lookahead symbols are generated while
constructing canonical set of items. Hence it is also known as LR(0) parser.
Steps required to construct SLR parser:
1. Check whether the given grammar is an augmented grammar or not. If not, convert it
into an augmented grammar.
2. Construct canonical set of LR(0) items.
3. Construct GOTO graph for canonical set of LR(0) items.
4. Construct FOLLOW sets for every non terminal in the augmented grammar.
5. Construct SLR parsing table by using canonical set of LR(0) items and FOLLOW sets.
6. Parse the given input string by using input buffer, stack and SLR parsing table.
7. Determine whether the string is accepted or not and also check whether the given
grammar is SLR grammar or not.
Augmented Grammar:
Augmented grammar G is a grammar G with a new production rule E E, where E is new
starting variable and E is starting variable of given grammar such that starting production has
only single variable on right side.
LR(0) items:
An LR(0) item of a grammar G is a production of G with a dot at some position of the
right side. For example, production A XYZ yields the four LR(0) items :
A . XYZ
A X . YZ
A XY . Z
A XYZ .
The dot represents that input have been derived upto that point in the production.