You are on page 1of 51

Syntax Analysis

(Parsing)
By,

Dr. Anisha P Rodrigues


Associate Professor
Dept of CSE
NMAMIT, Nitte

1
Reference:
Compilers : Principles, Techniques and Tools
Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman, second
edition
The Role of the Parser
• The parser obtains a string of tokens from the lexical
analyzer verifies that the string of token names can be
generated by the grammar for the source language.
• Parser should report any syntax errors in an intelligible
fashion and recover from commonly occurring errors to
continue processing the remainder of the program.

3
The Role of the Parser

• Conceptually, for well-formed programs, the parser


constructs a parse tree and passes it to the rest of the
compiler for further processing.

• Syntax Analyzer creates the syntactic structure of the given


source program. This syntactic structure is mostly a parse
tree.

• The syntax of a programming is described by a context-free


grammar (CFG). We will use BNF (Backus-Naur Form) notation
in the description of CFGs.
4
The Role of the Parser
• The syntax analyzer (parser) checks whether a given
source program satisfies the rules implied by a
context-free grammar or not.

• If it satisfies, the parser creates the parse tree of

that program.

• Otherwise the parser reports error messages.

5
The Role of the Parser
• The Parser

1. should report the presence of errors clearly and


accurately

2. should recover from error quickly

3. should not significantly slow down the


processing of correct programs

6
Types of Parsers
• There are three general types of parsers for grammars:
universal, top-down, and bottom-up.
• Universal parsing methods such as the Cocke-Younger-
Kasami algorithm and Earley's algorithm can parse any
grammar. These general methods are, however, too
inefficient to use in production compilers.
• Commonly used methods for parsers
1. Top-down parsers
build parse tree from top (root) to bottom(leaves)
2. Bottom-up Parsers
build parse tree start from the leaves and work their
way up to the root (bottom to top)
In either case, the input to the parser is scanned from left
to right, one symbol at a time.
7
Error Recovering Strategies by a Parser

1. Panic Mode:
– Discard symbols until a valid token can be found
2. Phrase Level
– Deleting , inserting , Replacing symbols
3. Error Productions
– Uses error productions that generate erroneous
constructs
4. Global Corrections
– uses algorithms that make only a few changes in the
input string
8
Panic-Mode Recovery

• With this method, on discovering an error, the parser


discards input symbols one at a time until one of a
designated set of synchronizing tokens is found.
• The synchronizing tokens are usually delimiters, such as
semicolon or 3, whose role in the source program is clear
and unambiguous.
• The compiler designer must select the synchronizing
tokens appropriate for the source language.
• While panic-mode correction often skips a considerable
amount of input without checking it for additional errors, it
has the advantage of simplicity, and, unlike some methods
to be considered later, is guaranteed not to go into an
infinite loop.
Phrase-Level Recovery

• On discovering an error, a parser may perform local


correction on the remaining input; that is, it may replace a
prefix of the remaining input by some string that allows
the parser to continue.
• A typical local correction is to replace a comma by a
semicolon, delete an extraneous semicolon, or insert a
missing semicolon.
• The choice of the local correction is left to the compiler
designer.
• Of course, we must be careful to choose replacements that
do not lead to infinite loops, as would be the case, for
example, if we always inserted something on the input
ahead of the current input symbol.
Error Product ions

• By anticipating common errors that might be encountered,


we can augment the grammar for the language at hand with
productions that generate the erroneous constructs.
• A parser constructed from a grammar augmented by these
error productions detects the anticipated errors when an
error production is used during parsing.
• The parser can then generate appropriate error
diagnostics about the erroneous construct that has been
recognized in the input.
Global Correction
• Ideally, we would like a compiler to make as few changes as
possible in processing an incorrect input string.
• There are algorithms for choosing a minimal sequence of
changes to obtain a globally least-cost correction.
• Given an incorrect input string x and grammar G, these
algorithms will find a parse tree for a related string y,
such that the number of insertions, deletions, and changes
of tokens required to transform x into y is as small as
possible.
• Unfortunately, these methods are in general too costly to
implement in terms of time and space, so these techniques
are currently only of theoretical interest.
Context-Free Grammars
• Context Free grammars syntactically describe the syntax
of programming language constructs like expressions and
statements.
• Using a syntactic variable stmt to denote statements and
variable expr to denote expressions, the production

specifies the structure of this form of conditional statement.


• Other productions then define precisely what an expr is
and what else a stmt can be.
The Formal Definition of a Context-Free
Grammar
• Context-free grammar (grammar for short) consists of
terminals, nonterminals, a start symbol, and productions.
• Terminals are the basic symbols from which strings are
formed.
• The term "token name" is a synonym for “terminal“. We
assume that the terminals are the first components of
the tokens output by the lexical analyzer.
• Nonterminals are syntactic variables that denote sets of
strings. The sets of strings denoted by nonterminals help
define the language generated by the grammar.
• stmt and expr are nonterminals.
• In a grammar, one nonterminal is distinguished as the start
symbol, and the set of strings it denotes is the language
generated by the grammar.
• Conventionally, the productions for the start symbol are
listed first.
• The productions of a grammar specify the manner in which
the terminals and nonterminals can be combined to form
strings. Each production consists of:
a. A nonterminal called the head or left side of the
production.
b. The symbol ->. Sometimes : : = has been used in place of
the arrow.
c. A body or right side consisting of zero or more terminals
and nonterminals.
Example
• The grammar given below defines simple arithmetic
expressions.
• In this grammar, the terminal symbols are:

• The nonterminal symbols are expression, term and factor,


and expression is the start symbol

16
Notational Conventions

17
Notational Conventions

18
Notational Conventions

19
Notational Conventions

For Example

E, T, and F are non terminals, with E the start symbol.


The remaining symbols are terminals.

20
Derivations
Consider,

We can take a single E and repeatedly apply productions


in any order to get a sequence of replacements

A sequence of replacements of non-terminal symbols is called a


derivation of –(id) from E.
21
Derivations
• General Definition
A  
( if there is a production rule A→ in our grammar
where  and  are arbitrary strings of terminal and non-
terminal symbols)

1  2  ...  n (n derives from 1 or 1 derives n )

 : derives in one step


 : derives in zero or more steps
*

+
 : derives in one or more steps

22
CFG - Terminology

• L(G) is the language of G which is a set of


sentences.

• If S * , where S is the start symbol of a


grammar G

• We say that  is a sentential form of G


• Sentential form may contain both terminals and
nonterminals and may be empty
23
CFG - Terminology

• If G is a context-free grammar, L(G) is a context-free


language.

* 
• S

– If  contains non-terminals, it is called as a sentential form of G.

– If  does not contain non-terminals, it is called as a sentence of G.

A language that can be generated by a grammar is said o be context


free language

G is S   where  is a sentence of G.
*

24
Left-Most and Right-Most Derivations

• If we always choose the left-most non-terminal in each derivation


step, this derivation is called as left-most derivation.

E  -E  -(E)  -(E+E)  -(id+E)  -(id+id)

• If we always choose the right-most non-terminal in each


derivation step, this derivation is called as right-most
derivation.

E  -E  -(E)  -(E+E)  -(E+id)  -(id+id)


25
Parse Tree and Derivations
• A parse tree can be seen as a graphical representation of a

derivation. E  -E  -(E)  -(E+E)  -(id+E)  -(id+id)

E  -E E
 -(E) E
 -(E+E) E
- E - E - E

( E ) ( E )

E E E + E
- E - E
 -(id+E)  -(id+id)
( E ) ( E )

E + E E + E

id id id

26
Parse Tree
• Get a parse tree for the derivations:

1. E  E+E  id+E  id+E*E  id+id*E  id+id*id

2. E  E*E  E+E*E  id+E*E  id+id*E  id+id*id

E E

E + E E * E
id E * E E + E id

id id id id
Ambiguity
• A grammar produces more than one parse tree for a
sentence is called as an ambiguous grammar.

For the most parsers, the grammar must be unambiguous.


28
Ambiguity (cont.)

Note :
• We should eliminate the ambiguity in the grammar during the
design phase of the compiler.

• An unambiguous grammar should be written to eliminate the


ambiguity.

• We have to prefer one of the parse trees of a sentence


(generated by an ambiguous grammar) to disambiguate that
grammar to restrict to this choice.

29
Left-Most and Right-Most Derivations

Exercise:

30
Answer:
a) 1. S =lm=> SS* => SS+S* => aS+S* => aa+S* => aa+a*
b) S =rm=> SS* => Sa* => SS+a* => Sa+a* => aa+a*
c) Parse tree:

d) Unambiguous
e) L = {Postfix expression consisting of digits, plus and multiple
signs}.

31
Eliminating Ambiguity
Example1:
Ambiguous grammar

unambiguous grammar for the above


Because the only derivation
for the string id+id*id is
EE+T T+T
 F + T  id + T
id + T * F  id + F * F
So you have only one parse id + id * F
tree for id+id*id id + id * id
32
Eliminating Ambiguity

"dangling else“
grammar

has two parsen trees

stmt stmt

if expr then stmt else stmt if expr then stmt

E1 if expr then stmt S2 E1 if expr then stmt else stmt

E2 S1 E2 S1 S2
1 2
33
Eliminating Ambiguity
• We prefer the second parse tree (else matches with closest if).
• We can rewrite the dangling-else grammar as the following unambiguous
grammar

The idea is that a statement appearing between a then and an


else must be "matched" ; 34
Left Recursion

A grammar is left recursive if it has a non-terminal


A such
+ that there is a derivation.

A
+
A for some string 

• The left-recursion may appear in a single step of the


derivation (immediate left-recursion), or may appear
in more than one step of the derivation.

35
Left Recursion

Note
• Top-down parsing techniques cannot handle left-
recursive grammars.

• So, we have to convert our left-recursive grammar


into an equivalent grammar which is not left-
recursive.

36
Elimination of Immediate Left-Recursion

A→A|  where  does not start with A



A →  A’
A’ →  A’ |  an equivalent grammar

37
Immediate Left-Recursion

In general,

A → A 1 | ... | A m | 1 | ... | n
where 1 ... n do not start with A

A → 1 A’ | ... | n A’
A’ → 1 A’ | ... | m A’ |  an equivalent grammar

38
Elimination of Immediate Left-Recursion

Exercise

Eliminate immediate left recursion from the


expression grammar

39
Left-Recursion -- Problem
• A grammar need not be immediately left-recursive, but it
still can be left-recursive. By just eliminating the
immediate left-recursion, we may not get a grammar which is
not left-recursive.

S → Aa | b not immediately left-recursive, but still left-


A → Sc | d recursive
Since ,
S  Aa  Sca or
A  Sc  Aac causes to a left-recursion

• So, we have to eliminate all left-recursions from our grammar

40
Eliminate Left-Recursion -- Algorithm
Arrange non-terminals in some order: A1 ... An
for i from 1 to n do {
for j from 1 to i-1 do {
replace each production
Ai → Aj 
by
Ai → 1  | ... | k 
where Aj → 1 | ... | k
}
eliminate immediate left-recursions among Ai productions
}

41
Eliminate Left-Recursion -- Example

Exercise

1. Eliminate Left-Recursion from the grammar

S → Aa | b
A → Ac | Sd | f

42
Eliminate Left-Recursion – soln1
S → Aa | b
A → Ac | Sd | f
- Order of non-terminals: S, A
for S:
- we do not enter the inner loop.
- there is no immediate left recursion in S.
for A:
- Replace A → Sd with A → Aad | bd
So, we will have A → Ac | Aad | bd | f
- Eliminate the immediate left-recursion in A
A → bdA’ | fA’
A’ → cA’ | adA’ | 

So, the resulting equivalent grammar which is not left-recursive is:


S → Aa | b
A → bdA’ | fA’
A’ → cA’ | adA’ | 

43
Eliminate Left-Recursion – soln2
S → Aa | b
A → Ac | Sd | f

- Order of non-terminals: A, S

for A:
- we do not enter the inner loop.
- Eliminate the immediate left-recursion in A
A → SdA’ | fA’
A’ → cA’ | 

for S:
- Replace S → Aa with S → SdA’a | fA’a
So, we will have S → SdA’a | fA’a | b
- Eliminate the immediate left-recursion in S
S → fA’aS’ | bS’
S’ → dA’aS’ | 

So, the resulting equivalent grammar which is not left-recursive is:


S → fA’aS’ | bS’
S’ → dA’aS’ | 
A → SdA’ | fA’
A’ → cA’ | 
44
Left-Factoring
• A predictive parser (a top-down parser without backtracking)
insists that the grammar must be left-factored.

consider

• when we see if, we cannot tell immediately which production


rule to choose to re-write stmt in the derivation.

45
Left-Factoring (cont.)
• In general,
A → 1 | 2 where  is non-empty and the first symbols
of 1 and 2 (if they have one)are different.
• when processing  we cannot know whether expand
A to 1 or
A to 2

• But, if we re-write the grammar as follows


A → A’
A’ → 1 | 2 so, we can immediately expand A to A’

46
Left-Factoring -- Algorithm
Algorithm : Left factoring a grammar.

INPUT: Grammar G.

OUTPUT: An equivalent left-factored grammar.

Method: For each nonterminal A, find the longest prefix a! common to two
or more of its alternatives. If ≠  - i.e., there is a nontrivial common
prefix - replace all of the A-productions by A → 1 | ... | n |  ,
where  represents all alternatives that do not begin with a, by

A → A’ | 

A’ → 1 | ... | n

Here A' is a new nonterminal. Repeatedly apply this transformation until no two alternatives for
a nonterminal have a common prefix.
47
Left-Factoring

Exercise:

1. A → abB | aB | cdg | cdeB | cdfB


2. A → ad | a | ab | abc | b

48
Left-Factoring – soln1

A → abB | aB | cdg | cdeB | cdfB


A → aA’ | cdg | cdeB | cdfB
A’ → bB | B


A → aA’ | cdA’’
A’ → bB | B
A’’ → g | eB | fB

49
Left-Factoring – soln2

A → ad | a | ab | abc | b


A → aA’ | b
A’ → d |  | b | bc


A → aA’ | b
A’ → d |  | bA’’
A’’ →  | c

50
THANK YOU

You might also like