03 Compiler Design Lecture - Syntax Analysis

Compiler Design Concepts
Syntax Analysis
Introduction
First task is to break up the text into meaningful words newval=oldval+12
called tokens.
id = id + num Source
Code
Token Stream Lexical Analysis (High
Level)
identifiers
The order of the tokens is not
important at this stage. Example:
Symbol Table Token Lexeme
12 + old val = newval
Will also be accepted. Id newval
Id oldval
Lexical Analyzer’s purpose is
Num 12
simply to extract the token.
There should not be any combination which can not pass as token. e.g., 12oldval
Syntax
After verifying that there is no lexical error, it is time to check for the order
of the tokens
id = id + num
Syntax Analysis Token Stream
Syntax Analysis phase should be able to say if Id = id + num is a valid

arrangement or not.
Observe that the actual lexemes are not used here. Syntax Analysis phase
is not interested to know if it is
Oldval = newval + 12 or newval = oldval + 12
Only the structure is important
Just like Lexical Analysis was not interested in the order of token
Syntax
But the compiler process should not forget the lexemes. They will be used
later.
id = id + num
Symbol Table
Token Lexeme
Id oldval
Id newval
Tokens will carry the pointer to the symbol Num 12
table entry with them.
Syntax
Okay, now, how to check if the syntax is correct or not
id = id + num
Rules That is, in this case, if id = id + num is a

valid combination or not.
S  id = id + num There must be some ruled defined.
Which will specify which combinations are

valid. This rule is specified by the means
It means if there is a combination of formats called “productions”
Id = id + num, it can be called a statement,
which may be symbolized as S. S  id = id + num
Now, it has to be seen whether S fits into the total scheme

Syntax
Most constructions from programming languages are easily expressed by Context Free
Grammars (CFG)
According to CFG, a software program can be seen as made of syntactic categories, by

arranging them in a proper order.
This is like natural languages where we have parts of speech.
These are Expressions, Statements, Declaration, etc.
Each syntactic category is made up of valid arrangement of tokens.
A syntactic category can be made of other syntactic categories and finally, tokens.
Syntactic categories are designated as Non Terminals.
Recall that a non terminal can be derived into any combination of terminals and non
terminals, but eventually, it should be all tokens.
Syntax
The entire source program listing can be considered as a syntactic category , i.e., non
terminal, say P
A statement (whatever type it may be) can also be considered as another syntactic
category , i.e., non terminal, say S
So, as a rule, we can write
P  S;
S  S;S
Now, S, i.e., a statement can have various expansions.

For example, an assignment statement can look like
S  id := id + id * number ;
Syntax
Let’s take another string myval = newval* 10
It will be converted to token stream id = id * num
id = id * num
Rules If there is another production
S  id = id + num S id = id * num
S id = id * num
Then the above combination will be
considered valid.
Syntax
Let’s take another string myval = newval* 10
It will be converted to token stream id = id * num
id = id * num
Rules If there is another production
S  id = id + num S id = id * num
S id = id * num
Then the above combination will be
considered valid.
Syntax
id = id + num ; id = id * num ;
newval=oldval+12;
Source myval=newval*10;
Token Lexical Code
Syntax Analysis
Stream Analysis (High
Level)
S  id = id + num
S  id = id * num Symbol Table Token Lexeme
Id Newval
Id oldval
So, the stream will be converted to S;S; Num 12
We can also check later if S;S; is valid or not.
Id Myval
It will be valid, if there is a production P  S;S; Num 10
But combinations like S+S or S*S will not be valid
Syntax
So, any combination of tokens that can be reduced, meaning, that exists on
the right hand side of a production is valid.
But there are infinite combinations that are valid, e.g.,
Id = id – id
Id = id * id
Id = id + id – id
Id = id + id – num
Id = id * id – id
… …. …. …. It is impossible to have all ….
We have to have a limited set of rules using which we can generate all
combinations.
Just like English grammar….

Finite number of words but infinite combinations, that is infinite number
of sentences
Syntax
This is the house that Jack built
This is the malt that laid in the house that Jack built
This is the rat that ate the malt that laid in the house that Jack built
This is the cat that killed the rat that ate the malt that laid in the house
that Jack built
This is the dog that chased the cat that killed the rat that ate the malt that
laid in the house that Jack built
Syntax
There are limited types of tokens but the combination is infinite
Take for example arithmetic expressions

EE+E E  E /E
EE–E E  id
EE*E E  num
Using the above productions, we can validate any arithmetic expression

containing variable, number, add, sub, mult & div
This is context free grammar
E is a non terminal. It has to stay on LHS of at least one production. It can also
stay on the RHS of some productions.
Id, num, + , - * , /, = These are terminals which are tokens.

They stay only on RHS of productions
Syntax
E  E + E  E + id  id + id
E  E + E  E + E * E  E + E * id  E + id * id  id + id * id
E  E + E  E + E - E  E + E - id  E + id - id  id + id – id
E  E + E  E + E - E  E + E – num  E + id – num  id + id – num
E  E * E  E * E - E  E * E - id  E * id - id  id * id – id
E  E * E  E * E - E  E * E – E / E id * E – E / E id * id – E / E
id * id – id / E id * id – id / id
(the non terminal being derived in each step has been highlighted)
One has to choose the appropriate production.

Syntax
Recursive usage of productions on terminals and non terminals result in valid
statements.
Defining a grammar:
A Context Free Grammar consists of
1. A set of terminals (T)

2. A set of non terminals (V)
3. A set of productions (P)
4. A start symbol which is a non terminal (S)
Start symbol is a non terminal from which the chain of derivations will start.
There can be only one.
In the example, E is the start symbol.

A production is of the form V  w
Where w is a string of terminals and non terminals.

Syntax
A derivation happens when a terminal is replaced by a string of terminals and
non terminals as defined in some production.
E  E + E  E + E - E  E + E – num  E + id – num  id + id – num
The combination of terminals and non terminals at each stage of derivation is

called a Sentential Form.
Let’s get little cryptic:
N: Non terminal
α, β, γ : strings of terminals and non terminals
If there exists a production N  γ
Then in a sentential form, N can be replaced by γ
So, αNβ can be rewritten as αγβ

Derivation
Definition: Given a context-free grammar G with start symbol S, terminal symbols T and
productions P, the language L(G) that G generates is defined to be the set of strings of
terminal symbols that can be obtained by derivation from S using the productions P, i.e.,
the set
As an example, look at the grammar TR

T  aTc
R ε
R RbR
This grammar generates the string aabbbcc by the derivation shown.
We have, for clarity, in each sequence of symbols underlined the non

terminal that is rewritten in the following step.
Derivation
Production applied Derivation Step
1. T  aTc
2. T  aTc Rightmost
3. T  R
4. R  RbR
5. R  ε
6. R  RbR Leftmost
7. R  RbR
8. R  ε
9. R  ε
10. R ε
Derivation of the string aabbbcc using the given grammar
In this derivation, we have applied derivation steps sometimes to the leftmost non
terminal, sometimes to the rightmost and sometimes to a non terminal that was neither.
Derivation- Parsing
The Syntax Analysis phase checks the structure of the source code statements. This is
.called Parsing
There are two common methods:
1. Trying to generate the statement from the start symbol and applying production
rules. This is called top down parsing.
We have generated the sting aabbbcc from the start symbol T
T   aabbbcc
2. Taking the string and applying productions in reverse to arrive at the start symbol.
This is called bottom up parsing
aabbbcc    T
Derivation
However, since derivation steps are local, the order does not matter. So,
we might as well decide to always rewrite the leftmost non terminal.
Production applied Derivation Step A derivation that

always rewrites the
leftmost non terminal
1. T  aTc is called a leftmost
2. T  aTc
derivation. Similarly,
3. T  R
a derivation that
4. R  RbR
always rewrites the
5. R  RbR
rightmost non
6. R  ε
terminal is called a
7. R  RbR
rightmost derivation.
8. R  ε
9. R  ε .
10. R ε
Derivation - Trees
Drawing the tree from production rules

We can draw a derivation as a tree:
Root of the tree = Start symbol

For a derivation, the string on the RHS of the chosen production are added as children
below the non terminal
When applying T  aTc T
a, T and c will be drawn as children below T
a T c
Read the leaves from left to right

The leaves of the tree are terminals which, when read from left to right, form the
derived string. ε is ignored.
.
Derivation - Trees
Order of derivation does not matter: only choice of rule
Third “b” from left
First “b” from left
Second “b” from left
Syntax tree for the string aabbbcc irrespective of order of derivation

Ambiguity
But, we may have alternate tree for the same string
Choice of production matters
Different rule has

been applied
When a grammar permits several different

syntax trees for some strings we call the
grammar ambiguous.
Ambiguity
Ambiguity is not a problem for validating syntax.
Both parse trees show that aabbbcc is a valid string.
But the problem is elsewhere. When we evaluate the string: 2 + 3 * 4
Let’s take the example of an Expression
E –> E + E
EE*E
E  num
E  E + E  E + E * E  num + num * num  2 + 3 * 4

E  E * E  E + E * E  num + num * num  2 + 3 * 4
Ambiguity
E  E + E  E + E * E  num + num * num  E2 + 3 * 4
E E
+
Evaluation: E
3 * 4 = 12; * E
2
2 + 12 = 14
3 4
Sub trees are evaluated first

E
E  E * E  E + E * E  num + num * num  2 + 3 * 4
E E
*
Evaluation:
2 + 3 = 5;
E 4
5 * 4 = 20 + E
NOTE: THE SUBTREES ARE EVALUATED FIRST

2 3
Ambiguity Resolution
Parser can not be built for ambiguous grammar
Parser must make a tree while processing the token string.
So, ambiguity must be resolved
1) Use disambiguating / precedence rule while parsing

2) Rewrite the grammar to make it unambiguous (with language unchanged)
(i) Associativity
a – b – c will be processed as (a - b) – c  left associative

a ** b ** c will be processed as a ** ( b**c)  right associative
a > b > c will be invalid  non associative
Note: Each of + and * can be both right associative and left associative, but for
convenience, they are made left associative. (parser has to follow any one rule)
(i) Precedence
a+ b * c will be treated as a + (b * c)
Ambiguity Detection
Ambiguity exists in the grammar is there exists a string which can result in two
distinct parse trees.
- Very hard, almost impossible to find in certain cases
In many cases , it is not difficult by looking at the grammar
N  NαN
Note : Parsers can be built only from unambiguous grammars
Most of the ambiguity occurs in expression grammar
E E op E
E  num
(num is a numeric literal)
Rewriting ambiguous grammar
Expression Grammar
Rewrite as follows:
(a) For left associative operators (e.g., a-b-c)
Introduce new non terminal
E  E op E’
E E’
E’ num
Op  + | - | * | /
Isolate the rightmost non terminal first, push it to a sub tree
Derivation example: EE-E’ (E-E’)-E’ (E’-E’)-E’

(num-num)-num
There is an implicit parenthesis

(b) For right associative operators (e.g., a**b**c)
Introduce new non terminal
E  E’ op E E  num op E
E E’ E  num
E’ num
Derivation example: EE’ ^ E  num ^ E  num ^ (E’ ^ E)

 num ^ (num ^ E)  num ^ (num ^ E’)
 num ^ (num ^ num)
There is an implicit parenthesis

(c) For non associative operators (e.g., a**b**c)
e.g., a<b
EE’ op E’ E  num op num

E  E’
E’num
e.g., a<b<c is not allowed

So far, we have handled only the cases where an operator interacts with itself
This is easily extendible where the cases where several operators with the same
precedence and associativity interact
EE+E‘
EE–E‘
E E ‘
Enum
“+” and “-” are both left associative hence left recursive grammar is required.
But if we mix left recursive with right recursive, it will be ambiguous again
EE+E‘
EE‘^E‘
E E ‘
Enum
As an example, we can not represent 2 + 3 ^ 4 using this

grammar.
But if we mix left recursive with right recursive, it will be ambiguous again
EE+E‘
EE‘^E‘
E E ‘
Enum
As an example, we can not represent 2 + 3 ^ 4 using this

grammar.
Mixing operators with different precedents but equal
associativity
 We must know the precedence of operators
First, the higher precedence operator needs to be worked out
Use different non terminals for different precedence levels
E E + E2
EE – E2
E E2
E2E2*E3
E2E2/E3
E2E3
E3num
Other sources of ambiguity
Example:
if P then if Q then S1 else S2
Ambiguity is , which “if” the “else” is connected to?
It might mean
if P then ( if Q then S1 else S2 )
Or
if P then (if Q then S1) else S2
Note: “else” clause is optional. Otherwise it would’ve been unambiguous

Let’s see
The grammar is
stmt <id> :=<exp>

stmt <stmt>.<stmt>
stmt  if <exp> then <stmt> else <stmt>
stmt  if <exp> then <stmt>
According to this grammar, the single “else” can equally

match with either “if”
Two parse trees, indicating ambiguous grammar

Usual convention: “else” matches with the closest “if”.
We will enforce this rule by rewriting the grammar
We introduce two new non terminals
stmt <matched>
stmt <unmatched>
matched  if <exp> then <matched> else <matched>
matched  <id> :=<exp>
unmatched  if <exp> then <matched> > else <unmatched>
unmatched  if <exp> then <matched>
For statements with unmatched if and else
unmatched  if <exp> then <matched> > else <unmatched>

unmatched  if <exp> then <matched>

03 Compiler Design Lecture - Syntax Analysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

03 Compiler Design Lecture - Syntax Analysis

Uploaded by

Copyright:

Available Formats

Compiler Design Concepts

Syntax Analysis Token Stream

Syntax Analysis phase should be able to say if Id = id + num is a valid

Only the structure is important

Syntax Analysis Token Stream

Rules That is, in this case, if id = id + num is a

S  id = id + num There must be some ruled defined.

Which will specify which combinations are

Now, it has to be seen whether S fits into the total scheme

According to CFG, a software program can be seen as made of syntactic categories, by

This is like natural languages where we have parts of speech.

These are Expressions, Statements, Declaration, etc.

Each syntactic category is made up of valid arrangement of tokens.

So, as a rule, we can write

Now, S, i.e., a statement can have various expansions.

Rules If there is another production

Rules If there is another production

But there are infinite combinations that are valid, e.g.,

… …. …. …. It is impossible to have all ….

Just like English grammar….

Take for example arithmetic expressions

Using the above productions, we can validate any arithmetic expression

This is context free grammar

Id, num, + , - * , /, = These are terminals which are tokens.

E  E + E  E + E - E  E + E – num  E + id – num  id + id – num

One has to choose the appropriate production.

A Context Free Grammar consists of

1. A set of terminals (T)

In the example, E is the start symbol.

Where w is a string of terminals and non terminals.

E  E + E  E + E - E  E + E – num  E + id – num  id + id – num

The combination of terminals and non terminals at each stage of derivation is

Let’s get little cryptic:

If there exists a production N  γ

Then in a sentential form, N can be replaced by γ

So, αNβ can be rewritten as αγβ

As an example, look at the grammar TR

This grammar generates the string aabbbcc by the derivation shown.

We have, for clarity, in each sequence of symbols underlined the non

Derivation of the string aabbbcc using the given grammar

There are two common methods:

We have generated the sting aabbbcc from the start symbol T

Production applied Derivation Step A derivation that

Drawing the tree from production rules

Root of the tree = Start symbol

When applying T  aTc T

a, T and c will be drawn as children below T

Third “b” from left

First “b” from left

Second “b” from left

Syntax tree for the string aabbbcc irrespective of order of derivation

Choice of production matters

Different rule has

When a grammar permits several different

Both parse trees show that aabbbcc is a valid string.

But the problem is elsewhere. When we evaluate the string: 2 + 3 * 4

Let’s take the example of an Expression

E  E + E  E + E * E  num + num * num  2 + 3 * 4

Sub trees are evaluated first

NOTE: THE SUBTREES ARE EVALUATED FIRST

Parser must make a tree while processing the token string.

So, ambiguity must be resolved

1) Use disambiguating / precedence rule while parsing