3.syntax Analysis1

Syntax Analysis
(Parsing)
By,
Dr. Anisha P Rodrigues

Associate Professor
Dept of CSE
NMAMIT, Nitte
1
Reference:
Compilers : Principles, Techniques and Tools
Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman, second
edition
The Role of the Parser
• The parser obtains a string of tokens from the lexical
analyzer verifies that the string of token names can be
generated by the grammar for the source language.
• Parser should report any syntax errors in an intelligible
fashion and recover from commonly occurring errors to
continue processing the remainder of the program.
3
• Conceptually, for well-formed programs, the parser

constructs a parse tree and passes it to the rest of the
compiler for further processing.
• Syntax Analyzer creates the syntactic structure of the given

source program. This syntactic structure is mostly a parse
tree.
• The syntax of a programming is described by a context-free

grammar (CFG). We will use BNF (Backus-Naur Form) notation
in the description of CFGs.
4
• The syntax analyzer (parser) checks whether a given
source program satisfies the rules implied by a
context-free grammar or not.
• If it satisfies, the parser creates the parse tree of
that program.
• Otherwise the parser reports error messages.
5
• The Parser
1. should report the presence of errors clearly and

accurately
2. should recover from error quickly
3. should not significantly slow down the

processing of correct programs
6
Types of Parsers
• There are three general types of parsers for grammars:
universal, top-down, and bottom-up.
• Universal parsing methods such as the Cocke-Younger-
Kasami algorithm and Earley's algorithm can parse any
grammar. These general methods are, however, too
inefficient to use in production compilers.
• Commonly used methods for parsers
1. Top-down parsers
build parse tree from top (root) to bottom(leaves)
2. Bottom-up Parsers
build parse tree start from the leaves and work their
way up to the root (bottom to top)
In either case, the input to the parser is scanned from left
to right, one symbol at a time.
7
Error Recovering Strategies by a Parser
1. Panic Mode:
– Discard symbols until a valid token can be found
2. Phrase Level
– Deleting , inserting , Replacing symbols
3. Error Productions
– Uses error productions that generate erroneous
constructs
4. Global Corrections
– uses algorithms that make only a few changes in the
input string
8
Panic-Mode Recovery
• With this method, on discovering an error, the parser

discards input symbols one at a time until one of a
designated set of synchronizing tokens is found.
• The synchronizing tokens are usually delimiters, such as
semicolon or 3, whose role in the source program is clear
and unambiguous.
• The compiler designer must select the synchronizing
tokens appropriate for the source language.
• While panic-mode correction often skips a considerable
amount of input without checking it for additional errors, it
has the advantage of simplicity, and, unlike some methods
to be considered later, is guaranteed not to go into an
infinite loop.
Phrase-Level Recovery
• On discovering an error, a parser may perform local

correction on the remaining input; that is, it may replace a
prefix of the remaining input by some string that allows
the parser to continue.
• A typical local correction is to replace a comma by a
semicolon, delete an extraneous semicolon, or insert a
missing semicolon.
• The choice of the local correction is left to the compiler
designer.
• Of course, we must be careful to choose replacements that
do not lead to infinite loops, as would be the case, for
example, if we always inserted something on the input
ahead of the current input symbol.
Error Product ions
• By anticipating common errors that might be encountered,

we can augment the grammar for the language at hand with
productions that generate the erroneous constructs.
• A parser constructed from a grammar augmented by these
error productions detects the anticipated errors when an
error production is used during parsing.
• The parser can then generate appropriate error
diagnostics about the erroneous construct that has been
recognized in the input.
Global Correction
• Ideally, we would like a compiler to make as few changes as
possible in processing an incorrect input string.
• There are algorithms for choosing a minimal sequence of
changes to obtain a globally least-cost correction.
• Given an incorrect input string x and grammar G, these
algorithms will find a parse tree for a related string y,
such that the number of insertions, deletions, and changes
of tokens required to transform x into y is as small as
possible.
• Unfortunately, these methods are in general too costly to
implement in terms of time and space, so these techniques
are currently only of theoretical interest.
Context-Free Grammars
• Context Free grammars syntactically describe the syntax
of programming language constructs like expressions and
statements.
• Using a syntactic variable stmt to denote statements and
variable expr to denote expressions, the production
specifies the structure of this form of conditional statement.

• Other productions then define precisely what an expr is
and what else a stmt can be.
The Formal Definition of a Context-Free
Grammar
• Context-free grammar (grammar for short) consists of
terminals, nonterminals, a start symbol, and productions.
• Terminals are the basic symbols from which strings are
formed.
• The term "token name" is a synonym for “terminal“. We
assume that the terminals are the first components of
the tokens output by the lexical analyzer.
• Nonterminals are syntactic variables that denote sets of
strings. The sets of strings denoted by nonterminals help
define the language generated by the grammar.
• stmt and expr are nonterminals.
• In a grammar, one nonterminal is distinguished as the start
symbol, and the set of strings it denotes is the language
generated by the grammar.
• Conventionally, the productions for the start symbol are
listed first.
• The productions of a grammar specify the manner in which
the terminals and nonterminals can be combined to form
strings. Each production consists of:
a. A nonterminal called the head or left side of the
production.
b. The symbol ->. Sometimes : : = has been used in place of
the arrow.
c. A body or right side consisting of zero or more terminals
and nonterminals.
Example
• The grammar given below defines simple arithmetic
expressions.
• In this grammar, the terminal symbols are:
• The nonterminal symbols are expression, term and factor,

and expression is the start symbol
16
Notational Conventions
17
18
19
For Example
E, T, and F are non terminals, with E the start symbol.

The remaining symbols are terminals.
20
Derivations
Consider,
We can take a single E and repeatedly apply productions

in any order to get a sequence of replacements
A sequence of replacements of non-terminal symbols is called a

derivation of –(id) from E.
21
Derivations
• General Definition
A  
( if there is a production rule A→ in our grammar
where  and  are arbitrary strings of terminal and non-
terminal symbols)
1  2  ...  n (n derives from 1 or 1 derives n )
 : derives in one step

 : derives in zero or more steps
*
+
 : derives in one or more steps
22
CFG - Terminology
• L(G) is the language of G which is a set of

sentences.
• If S * , where S is the start symbol of a

grammar G
• We say that  is a sentential form of G

• Sentential form may contain both terminals and
nonterminals and may be empty
23
CFG - Terminology
• If G is a context-free grammar, L(G) is a context-free

language.
* 
• S
– If  contains non-terminals, it is called as a sentential form of G.
– If  does not contain non-terminals, it is called as a sentence of G.
A language that can be generated by a grammar is said o be context

free language
G is S   where  is a sentence of G.
*
24
Left-Most and Right-Most Derivations
• If we always choose the left-most non-terminal in each derivation

step, this derivation is called as left-most derivation.
E  -E  -(E)  -(E+E)  -(id+E)  -(id+id)
• If we always choose the right-most non-terminal in each

derivation step, this derivation is called as right-most
derivation.
E  -E  -(E)  -(E+E)  -(E+id)  -(id+id)

25
Parse Tree and Derivations
• A parse tree can be seen as a graphical representation of a
derivation. E  -E  -(E)  -(E+E)  -(id+E)  -(id+id)
E  -E E
 -(E) E
 -(E+E) E
- E - E - E
( E ) ( E )
E E E + E
- E - E
 -(id+E)  -(id+id)
( E ) ( E )
E + E E + E
id id id
26
Parse Tree
• Get a parse tree for the derivations:
1. E  E+E  id+E  id+E*E  id+id*E  id+id*id
2. E  E*E  E+E*E  id+E*E  id+id*E  id+id*id
E E
E + E E * E
id E * E E + E id
id id id id
Ambiguity
• A grammar produces more than one parse tree for a
sentence is called as an ambiguous grammar.
For the most parsers, the grammar must be unambiguous.

28
Ambiguity (cont.)
Note :
• We should eliminate the ambiguity in the grammar during the
design phase of the compiler.
• An unambiguous grammar should be written to eliminate the

ambiguity.
• We have to prefer one of the parse trees of a sentence

(generated by an ambiguous grammar) to disambiguate that
grammar to restrict to this choice.
29
Left-Most and Right-Most Derivations
Exercise:
30
Answer:
a) 1. S =lm=> SS* => SS+S* => aS+S* => aa+S* => aa+a*
b) S =rm=> SS* => Sa* => SS+a* => Sa+a* => aa+a*
c) Parse tree:
d) Unambiguous
e) L = {Postfix expression consisting of digits, plus and multiple
signs}.
31
Eliminating Ambiguity
Example1:
Ambiguous grammar
unambiguous grammar for the above

Because the only derivation
for the string id+id*id is
EE+T T+T
 F + T  id + T
id + T * F  id + F * F
So you have only one parse id + id * F
tree for id+id*id id + id * id
32
"dangling else“
grammar
has two parsen trees
stmt stmt
if expr then stmt else stmt if expr then stmt
E1 if expr then stmt S2 E1 if expr then stmt else stmt
E2 S1 E2 S1 S2
1 2
33
• We prefer the second parse tree (else matches with closest if).
• We can rewrite the dangling-else grammar as the following unambiguous
grammar
The idea is that a statement appearing between a then and an

else must be "matched" ; 34
Left Recursion
A grammar is left recursive if it has a non-terminal

A such
+ that there is a derivation.
A
+
A for some string 
• The left-recursion may appear in a single step of the

derivation (immediate left-recursion), or may appear
in more than one step of the derivation.
35
Left Recursion
Note
• Top-down parsing techniques cannot handle left-
recursive grammars.
• So, we have to convert our left-recursive grammar

into an equivalent grammar which is not left-
recursive.
36
Elimination of Immediate Left-Recursion
A→A|  where  does not start with A


A →  A’
A’ →  A’ |  an equivalent grammar
37
Immediate Left-Recursion
In general,
A → A 1 | ... | A m | 1 | ... | n
where 1 ... n do not start with A

A → 1 A’ | ... | n A’
A’ → 1 A’ | ... | m A’ |  an equivalent grammar
38
Elimination of Immediate Left-Recursion
Exercise
Eliminate immediate left recursion from the

expression grammar
39
Left-Recursion -- Problem
• A grammar need not be immediately left-recursive, but it
still can be left-recursive. By just eliminating the
immediate left-recursion, we may not get a grammar which is
not left-recursive.
S → Aa | b not immediately left-recursive, but still left-

A → Sc | d recursive
Since ,
S  Aa  Sca or
A  Sc  Aac causes to a left-recursion
• So, we have to eliminate all left-recursions from our grammar
40
Eliminate Left-Recursion -- Algorithm
Arrange non-terminals in some order: A1 ... An
for i from 1 to n do {
for j from 1 to i-1 do {
replace each production
Ai → Aj 
by
Ai → 1  | ... | k 
where Aj → 1 | ... | k
}
eliminate immediate left-recursions among Ai productions
}
41
Eliminate Left-Recursion -- Example
Exercise
1. Eliminate Left-Recursion from the grammar
S → Aa | b
A → Ac | Sd | f
42
Eliminate Left-Recursion – soln1
S → Aa | b
A → Ac | Sd | f
- Order of non-terminals: S, A
for S:
- we do not enter the inner loop.
- there is no immediate left recursion in S.
for A:
- Replace A → Sd with A → Aad | bd
So, we will have A → Ac | Aad | bd | f
- Eliminate the immediate left-recursion in A
A → bdA’ | fA’
A’ → cA’ | adA’ | 
So, the resulting equivalent grammar which is not left-recursive is:

S → Aa | b
A → bdA’ | fA’
A’ → cA’ | adA’ | 
43
Eliminate Left-Recursion – soln2
S → Aa | b
A → Ac | Sd | f
- Order of non-terminals: A, S
for A:
- we do not enter the inner loop.
- Eliminate the immediate left-recursion in A
A → SdA’ | fA’
A’ → cA’ | 
for S:
- Replace S → Aa with S → SdA’a | fA’a
So, we will have S → SdA’a | fA’a | b
- Eliminate the immediate left-recursion in S
S → fA’aS’ | bS’
S’ → dA’aS’ | 
So, the resulting equivalent grammar which is not left-recursive is:

S → fA’aS’ | bS’
S’ → dA’aS’ | 
A → SdA’ | fA’
A’ → cA’ | 
44
Left-Factoring
• A predictive parser (a top-down parser without backtracking)
insists that the grammar must be left-factored.
consider
• when we see if, we cannot tell immediately which production

rule to choose to re-write stmt in the derivation.
45
Left-Factoring (cont.)
• In general,
A → 1 | 2 where  is non-empty and the first symbols
of 1 and 2 (if they have one)are different.
• when processing  we cannot know whether expand
A to 1 or
A to 2
• But, if we re-write the grammar as follows

A → A’
A’ → 1 | 2 so, we can immediately expand A to A’
46
Left-Factoring -- Algorithm
Algorithm : Left factoring a grammar.
INPUT: Grammar G.
OUTPUT: An equivalent left-factored grammar.
Method: For each nonterminal A, find the longest prefix a! common to two
or more of its alternatives. If ≠  - i.e., there is a nontrivial common
prefix - replace all of the A-productions by A → 1 | ... | n |  ,
where  represents all alternatives that do not begin with a, by
A → A’ | 
A’ → 1 | ... | n
Here A' is a new nonterminal. Repeatedly apply this transformation until no two alternatives for
a nonterminal have a common prefix.
47
Left-Factoring
Exercise:
1. A → abB | aB | cdg | cdeB | cdfB

2. A → ad | a | ab | abc | b
48
Left-Factoring – soln1
A → abB | aB | cdg | cdeB | cdfB

A → aA’ | cdg | cdeB | cdfB
A’ → bB | B

A → aA’ | cdA’’
A’ → bB | B
A’’ → g | eB | fB
49
Left-Factoring – soln2
A → ad | a | ab | abc | b

A → aA’ | b
A’ → d |  | b | bc

A → aA’ | b
A’ → d |  | bA’’
A’’ →  | c
50
THANK YOU

3.syntax Analysis1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3.syntax Analysis1

Uploaded by

Copyright:

Available Formats

Syntax Analysis

Dr. Anisha P Rodrigues

• Conceptually, for well-formed programs, the parser

• Syntax Analyzer creates the syntactic structure of the given

• The syntax of a programming is described by a context-free

• If it satisfies, the parser creates the parse tree of

• Otherwise the parser reports error messages.

1. should report the presence of errors clearly and

2. should recover from error quickly

3. should not significantly slow down the

• With this method, on discovering an error, the parser

• On discovering an error, a parser may perform local

• By anticipating common errors that might be encountered,

specifies the structure of this form of conditional statement.

• The nonterminal symbols are expression, term and factor,

E, T, and F are non terminals, with E the start symbol.

We can take a single E and repeatedly apply productions

A sequence of replacements of non-terminal symbols is called a

1  2  ...  n (n derives from 1 or 1 derives n )

 : derives in one step

• L(G) is the language of G which is a set of

• If S * , where S is the start symbol of a

• We say that  is a sentential form of G

• If G is a context-free grammar, L(G) is a context-free

– If  contains non-terminals, it is called as a sentential form of G.

– If  does not contain non-terminals, it is called as a sentence of G.

A language that can be generated by a grammar is said o be context

• If we always choose the left-most non-terminal in each derivation

E  -E  -(E)  -(E+E)  -(id+E)  -(id+id)

• If we always choose the right-most non-terminal in each

E  -E  -(E)  -(E+E)  -(E+id)  -(id+id)

derivation. E  -E  -(E)  -(E+E)  -(id+E)  -(id+id)

1. E  E+E  id+E  id+E*E  id+id*E  id+id*id

2. E  E*E  E+E*E  id+E*E  id+id*E  id+id*id

For the most parsers, the grammar must be unambiguous.

• An unambiguous grammar should be written to eliminate the

• We have to prefer one of the parse trees of a sentence

unambiguous grammar for the above

has two parsen trees

if expr then stmt else stmt if expr then stmt

E1 if expr then stmt S2 E1 if expr then stmt else stmt

The idea is that a statement appearing between a then and an

A grammar is left recursive if it has a non-terminal

• The left-recursion may appear in a single step of the

• So, we have to convert our left-recursive grammar

A→A|  where  does not start with A

Eliminate immediate left recursion from the

S → Aa | b not immediately left-recursive, but still left-

• So, we have to eliminate all left-recursions from our grammar

1. Eliminate Left-Recursion from the grammar

So, the resulting equivalent grammar which is not left-recursive is:

So, the resulting equivalent grammar which is not left-recursive is:

• when we see if, we cannot tell immediately which production

• But, if we re-write the grammar as follows

OUTPUT: An equivalent left-factored grammar.

1. A → abB | aB | cdg | cdeB | cdfB

A → abB | aB | cdg | cdeB | cdfB

You might also like

1. E  E+E  id+E  id+EE  id+idE  id+id*id

2. E  EE  E+EE  id+EE  id+idE  id+id*id