You are on page 1of 39

Compiler Design Concepts

Syntax Analysis
Introduction
First task is to break up the text into meaningful words newval=oldval+12
called tokens.

id = id + num Source
Code
Token Stream Lexical Analysis (High
Level)

identifiers
The order of the tokens is not
important at this stage. Example:
Symbol Table Token Lexeme
12 + old val = newval
Will also be accepted. Id newval
Id oldval
Lexical Analyzer’s purpose is
Num 12
simply to extract the token.

There should not be any combination which can not pass as token. e.g., 12oldval
Syntax
After verifying that there is no lexical error, it is time to check for the order
of the tokens

id = id + num

Syntax Analysis Token Stream

Syntax Analysis phase should be able to say if Id = id + num is a valid


arrangement or not.

Observe that the actual lexemes are not used here. Syntax Analysis phase
is not interested to know if it is
Oldval = newval + 12 or newval = oldval + 12

Only the structure is important

Just like Lexical Analysis was not interested in the order of token
Syntax
But the compiler process should not forget the lexemes. They will be used
later.

id = id + num

Syntax Analysis Token Stream

Symbol Table
Token Lexeme
Id oldval
Id newval
Tokens will carry the pointer to the symbol Num 12
table entry with them.
Syntax
Okay, now, how to check if the syntax is correct or not

id = id + num
Syntax Analysis Token Stream

Rules That is, in this case, if id = id + num is a


valid combination or not.

S  id = id + num There must be some ruled defined.

Which will specify which combinations are


valid. This rule is specified by the means
It means if there is a combination of formats called “productions”
Id = id + num, it can be called a statement,
which may be symbolized as S. S  id = id + num

Now, it has to be seen whether S fits into the total scheme


Syntax
Most constructions from programming languages are easily expressed by Context Free
Grammars (CFG)

According to CFG, a software program can be seen as made of syntactic categories, by


arranging them in a proper order.

This is like natural languages where we have parts of speech.

These are Expressions, Statements, Declaration, etc.

Each syntactic category is made up of valid arrangement of tokens.

A syntactic category can be made of other syntactic categories and finally, tokens.
Syntactic categories are designated as Non Terminals.

Recall that a non terminal can be derived into any combination of terminals and non
terminals, but eventually, it should be all tokens.
Syntax
The entire source program listing can be considered as a syntactic category , i.e., non
terminal, say P

A statement (whatever type it may be) can also be considered as another syntactic
category , i.e., non terminal, say S

So, as a rule, we can write

P  S;
S  S;S

Now, S, i.e., a statement can have various expansions.


For example, an assignment statement can look like

S  id := id + id * number ;
Syntax
Let’s take another string myval = newval* 10
It will be converted to token stream id = id * num

id = id * num
Syntax Analysis Token Stream

Rules If there is another production

S  id = id + num S id = id * num
S id = id * num
Then the above combination will be
considered valid.
Syntax
Let’s take another string myval = newval* 10
It will be converted to token stream id = id * num

id = id * num
Syntax Analysis Token Stream

Rules If there is another production

S  id = id + num S id = id * num
S id = id * num
Then the above combination will be
considered valid.
Syntax

id = id + num ; id = id * num ;

newval=oldval+12;
Source myval=newval*10;
Token Lexical Code
Syntax Analysis
Stream Analysis (High
Level)

S  id = id + num
S  id = id * num Symbol Table Token Lexeme
Id Newval
Id oldval
So, the stream will be converted to S;S; Num 12
We can also check later if S;S; is valid or not.
Id Myval
It will be valid, if there is a production P  S;S; Num 10
But combinations like S+S or S*S will not be valid
Syntax
So, any combination of tokens that can be reduced, meaning, that exists on
the right hand side of a production is valid.

But there are infinite combinations that are valid, e.g.,

Id = id – id
Id = id * id
Id = id + id – id
Id = id + id – num
Id = id * id – id

… …. …. …. It is impossible to have all ….

We have to have a limited set of rules using which we can generate all
combinations.

Just like English grammar….


Finite number of words but infinite combinations, that is infinite number
of sentences
Syntax
This is the house that Jack built

This is the malt that laid in the house that Jack built

This is the rat that ate the malt that laid in the house that Jack built

This is the cat that killed the rat that ate the malt that laid in the house
that Jack built

This is the dog that chased the cat that killed the rat that ate the malt that
laid in the house that Jack built
Syntax
There are limited types of tokens but the combination is infinite

Take for example arithmetic expressions


EE+E E  E /E
EE–E E  id
EE*E E  num

Using the above productions, we can validate any arithmetic expression


containing variable, number, add, sub, mult & div

This is context free grammar

E is a non terminal. It has to stay on LHS of at least one production. It can also
stay on the RHS of some productions.

Id, num, + , - * , /, = These are terminals which are tokens.


They stay only on RHS of productions
Syntax

E  E + E  E + id  id + id

E  E + E  E + E * E  E + E * id  E + id * id  id + id * id

E  E + E  E + E - E  E + E - id  E + id - id  id + id – id

E  E + E  E + E - E  E + E – num  E + id – num  id + id – num

E  E * E  E * E - E  E * E - id  E * id - id  id * id – id

E  E * E  E * E - E  E * E – E / E id * E – E / E id * id – E / E
id * id – id / E id * id – id / id

(the non terminal being derived in each step has been highlighted)

One has to choose the appropriate production.


Syntax
Recursive usage of productions on terminals and non terminals result in valid
statements.

Defining a grammar:

A Context Free Grammar consists of

1. A set of terminals (T)


2. A set of non terminals (V)
3. A set of productions (P)
4. A start symbol which is a non terminal (S)

Start symbol is a non terminal from which the chain of derivations will start.
There can be only one.

In the example, E is the start symbol.


A production is of the form V  w

Where w is a string of terminals and non terminals.


Syntax
A derivation happens when a terminal is replaced by a string of terminals and
non terminals as defined in some production.

E  E + E  E + E - E  E + E – num  E + id – num  id + id – num

The combination of terminals and non terminals at each stage of derivation is


called a Sentential Form.

Let’s get little cryptic:

N: Non terminal
α, β, γ : strings of terminals and non terminals

If there exists a production N  γ

Then in a sentential form, N can be replaced by γ

So, αNβ can be rewritten as αγβ


Derivation

Definition: Given a context-free grammar G with start symbol S, terminal symbols T and
productions P, the language L(G) that G generates is defined to be the set of strings of
terminal symbols that can be obtained by derivation from S using the productions P, i.e.,
the set

As an example, look at the grammar TR


T  aTc
R ε
R RbR

This grammar generates the string aabbbcc by the derivation shown.

We have, for clarity, in each sequence of symbols underlined the non


terminal that is rewritten in the following step.
Derivation
Production applied Derivation Step

1. T  aTc
2. T  aTc Rightmost
3. T  R
4. R  RbR
5. R  ε
6. R  RbR Leftmost
7. R  RbR
8. R  ε
9. R  ε
10. R ε

Derivation of the string aabbbcc using the given grammar

In this derivation, we have applied derivation steps sometimes to the leftmost non
terminal, sometimes to the rightmost and sometimes to a non terminal that was neither.
Derivation- Parsing
The Syntax Analysis phase checks the structure of the source code statements. This is
.called Parsing

There are two common methods:

1. Trying to generate the statement from the start symbol and applying production
rules. This is called top down parsing.

We have generated the sting aabbbcc from the start symbol T

T   aabbbcc

2. Taking the string and applying productions in reverse to arrive at the start symbol.
This is called bottom up parsing

aabbbcc    T
Derivation
However, since derivation steps are local, the order does not matter. So,
we might as well decide to always rewrite the leftmost non terminal.

Production applied Derivation Step A derivation that


always rewrites the
leftmost non terminal
1. T  aTc is called a leftmost
2. T  aTc
derivation. Similarly,
3. T  R
a derivation that
4. R  RbR
always rewrites the
5. R  RbR
rightmost non
6. R  ε
terminal is called a
7. R  RbR
rightmost derivation.
8. R  ε
9. R  ε .
10. R ε
Derivation - Trees

Drawing the tree from production rules


 
We can draw a derivation as a tree:

Root of the tree = Start symbol


For a derivation, the string on the RHS of the chosen production are added as children
below the non terminal

When applying T  aTc T

a, T and c will be drawn as children below T

a T c
Read the leaves from left to right
 
The leaves of the tree are terminals which, when read from left to right, form the
derived string. ε is ignored.
.
Derivation - Trees
Order of derivation does not matter: only choice of rule

Third “b” from left

First “b” from left

Second “b” from left

Syntax tree for the string aabbbcc irrespective of order of derivation


Ambiguity
But, we may have alternate tree for the same string

Choice of production matters

Different rule has


been applied

When a grammar permits several different


syntax trees for some strings we call the
grammar ambiguous.
Ambiguity
Ambiguity is not a problem for validating syntax.

Both parse trees show that aabbbcc is a valid string.

But the problem is elsewhere. When we evaluate the string: 2 + 3 * 4

Let’s take the example of an Expression

E –> E + E
EE*E
E  num

E  E + E  E + E * E  num + num * num  2 + 3 * 4


E  E * E  E + E * E  num + num * num  2 + 3 * 4
Ambiguity
E  E + E  E + E * E  num + num * num  E2 + 3 * 4
E E
+
Evaluation: E
3 * 4 = 12; * E
2
2 + 12 = 14
3 4

Sub trees are evaluated first


E
E  E * E  E + E * E  num + num * num  2 + 3 * 4
E E
*
Evaluation:
2 + 3 = 5;
E 4
5 * 4 = 20 + E

NOTE: THE SUBTREES ARE EVALUATED FIRST


2 3
Ambiguity Resolution
Parser can not be built for ambiguous grammar

Parser must make a tree while processing the token string.

So, ambiguity must be resolved

1) Use disambiguating / precedence rule while parsing


2) Rewrite the grammar to make it unambiguous (with language unchanged)

(i) Associativity

a – b – c will be processed as (a - b) – c  left associative


a ** b ** c will be processed as a ** ( b**c)  right associative
a > b > c will be invalid  non associative

Note: Each of + and * can be both right associative and left associative, but for
convenience, they are made left associative. (parser has to follow any one rule)

(i) Precedence

a+ b * c will be treated as a + (b * c)
Ambiguity Detection
Ambiguity exists in the grammar is there exists a string which can result in two
distinct parse trees.

- Very hard, almost impossible to find in certain cases

In many cases , it is not difficult by looking at the grammar

N  NαN

Note : Parsers can be built only from unambiguous grammars

Most of the ambiguity occurs in expression grammar

E E op E

E  num
(num is a numeric literal)
Rewriting ambiguous grammar
Expression Grammar

Rewrite as follows:

(a) For left associative operators (e.g., a-b-c)

Introduce new non terminal

E  E op E’
E E’
E’ num
Op  + | - | * | /
Isolate the rightmost non terminal first, push it to a sub tree

Derivation example: EE-E’ (E-E’)-E’ (E’-E’)-E’


(num-num)-num

There is an implicit parenthesis


Rewriting ambiguous grammar
(b) For right associative operators (e.g., a**b**c)

Introduce new non terminal

E  E’ op E E  num op E
E E’ E  num
E’ num

Derivation example: EE’ ^ E  num ^ E  num ^ (E’ ^ E)


 num ^ (num ^ E)  num ^ (num ^ E’)
 num ^ (num ^ num)

There is an implicit parenthesis


Rewriting ambiguous grammar
(c) For non associative operators (e.g., a**b**c)

e.g., a<b

EE’ op E’ E  num op num


E  E’
E’num

e.g., a<b<c is not allowed


Rewriting ambiguous grammar
So far, we have handled only the cases where an operator interacts with itself

This is easily extendible where the cases where several operators with the same
precedence and associativity interact

EE+E‘
EE–E‘
E E ‘
Enum
“+” and “-” are both left associative hence left recursive grammar is required.
Rewriting ambiguous grammar
But if we mix left recursive with right recursive, it will be ambiguous again

EE+E‘
EE‘^E‘
E E ‘
Enum

As an example, we can not represent 2 + 3 ^ 4 using this


grammar.
Rewriting ambiguous grammar
But if we mix left recursive with right recursive, it will be ambiguous again

EE+E‘
EE‘^E‘
E E ‘
Enum

As an example, we can not represent 2 + 3 ^ 4 using this


grammar.
Rewriting ambiguous grammar
Mixing operators with different precedents but equal
associativity

 We must know the precedence of operators

First, the higher precedence operator needs to be worked out

Use different non terminals for different precedence levels

E E + E2
EE – E2
E E2
E2E2*E3
E2E2/E3
E2E3
E3num
Other sources of ambiguity
Example:

if P then if Q then S1 else S2

Ambiguity is , which “if” the “else” is connected to?

It might mean

if P then ( if Q then S1 else S2 )

Or
if P then (if Q then S1) else S2

Note: “else” clause is optional. Otherwise it would’ve been unambiguous


Other sources of ambiguity
Let’s see

The grammar is

stmt <id> :=<exp>


stmt <stmt>.<stmt>
stmt  if <exp> then <stmt> else <stmt>
stmt  if <exp> then <stmt>

According to this grammar, the single “else” can equally


match with either “if”
Other sources of ambiguity

Two parse trees, indicating ambiguous grammar


Other sources of ambiguity
Usual convention: “else” matches with the closest “if”.

We will enforce this rule by rewriting the grammar

We introduce two new non terminals

stmt <matched>
stmt <unmatched>
matched  if <exp> then <matched> else <matched>
matched  <id> :=<exp>
unmatched  if <exp> then <matched> > else <unmatched>
unmatched  if <exp> then <matched>
Other sources of ambiguity
For statements with unmatched if and else

unmatched  if <exp> then <matched> > else <unmatched>


unmatched  if <exp> then <matched>

You might also like