You are on page 1of 74

compiler design study material

COMPILERS

Simply stating, a compiler is a program that reads a program written in one language-the
source language- and translates it into an equivalent program in another language –the target
language.

          Compiler
Source program target program

Error messages

Evolution of the compiler:


Compilers came in to existence due to difficulty involved in writing large programs in
machine language . So ,high level languages are created ,but the systems cannot understand them
.That is why these languages are converted into assembly language ,by the compilers ,as there
are assemblers to convert the program into machine language.

But compilers are very difficult programs to write .For example, the first Fortran
compiler nearly took 18 staff years to implement during the 1950’s. But over the years good
implementation languages, programming environments, and software tools have been developed.

The Analysis-Synthesis Model of compilation:


There are two parts to compilation:

1. Analysis
2. Synthesis

1. ANALYSIS OF THE SOURCE PROGRAM:


There are 3 phases in this part of the compilation
i) Linear analysis or lexical analysis
ii) Hierarchical analysis
iii) Semantic analysis


 
it department srkr engg college 13/143/14 bhimavarm
compiler design study material

i) Lexical analysis:

The stream of characters making up the source program is read from left-to-
right and grouped into tokens.

In a compiler, linear analysis is called lexical analysis or scanning. For example,


in lexical analysis the characters in the assignment statement
Position: = initial + rate * 60
Would be grouped into the following tokens:
1. The identifier1  position
2. The assignment symbol  :=
3. The identifier2  initial
4. The plus sign  +
5. The identifier3  rate
6. The multiplication sign  *
7. The number  60

ii) Hierarchical analysis:


In this phase characters or tokens are grouped hierarchically into nested collection
with collective meaning.
Hierarchical analysis is also called parsing or syntax analysis. It involves
grouping the tokens of the source program into parse trees that are used by the
compiler to synthesize output.
Example:
Assignment
Statement

Identifier1 := expression

Position +

Expression expression

Identifier2 *

Initial expression expression

Identifier3 number

Rate 60

Fig1: Parse tree for position:=initial + rate * 60;

The hierarchical structure of a program is usually expressed by recursive rules.


 
it department srkr engg college 13/143/14 bhimavarm
compiler design study material

The following rules as part of the definition of expressions:

1. Any identifier is an expression.


2. Any number is an expression.
3. If expression1 and expression2 are expressions, then so are
Expression1 + Expression2
Expression1 * Expression2
(expression1)

Languages define statements recursively by rules such as:

1. If identifier1 is an identifier, and expression2 is an expression, then


identifier1 := expression2 is a statement

2. If expression1 is an expression and statement2 is a statement, then


While ( Expression1) do statement2
If( expression1 ) then statement2 are statements

iii) Semantic analysis:


The semantic analysis phase checks the source program for semantic
errors and it uses the hierarchical structure determined by the syntax-analysis
phase to identify the operators and operands of expression and statements.

The important component of semantic analysis is type checking.

Example:

:=

Position +

Initial *

Rate 60

:=

Position +

Initial *

Rate inttoreal

60

Fig 2: semantic analysis inserts a conversion from integer to real.


 
it department srkr engg college 13/143/14 bhimavarm
compiler design study material

2) SYNTHESIS OF THE SOURCE PROGRAM:


In this , there are 3 phases

i) Intermediate code generation


ii) Code optimization
iii) Code generation

i) Intermediate code generation:


Compilers generate an explicit intermediate representation of the source program.
We think of this intermediate representation as a program for an abstract machine

This intermediate code should have the following properties:

1. It should be easy to produce


2. Easy to translate into target program.

We use the “3-address code” for the intermediate code.

Example: The source code appear in 3-address code like this:

temp1:=inttoreal(60)

temp2:= id3 * temp1

temp3:= id2 + temp2

id1:= temp3

the intermediate code has the following properties

1> Firstly, each 3-address code has only one operator other than the assignment operator, so,
the compiler should decide on the order of the operators.
2> The compiler must generate a temporary name to hold the value computed by each
instruction
3> Some 3-address instructions have fewer operands eg., the first and last instructions in the
above example.


 
it department srkr engg college 13/143/14 bhimavarm
compiler design study material

ii) Code optimization:


The code optimization phase attempts to improve the intermediate code, so as to increase
the machine efficiency.

Example: the above code can be reduced to:

Temp1:= id3 * 60.0

Id1:= id2 + temp1

Note:

1. The inttoreal operation can performed in the compile time so it can be removed.
2. Temp3 is used just to put the calculated value into id1 so the final instruction can also be
removed.

iii) Target Code generation:


In this phase of the compiler is the generation of target code is done, i.e usually
assembly.
Example:
MOVF R2, id3
MULF R2, #60.0
MOVF R1, id2
ADDF R1, R2
MOVF id1, R1

A crucial aspect is the assignment of variables to registers. Target code optimization is


performed to generate the effective target code.


 
it department srkr engg college 13/143/14 bhimavarm
compiler design study material

The phases of the compiler:


All the phases discussed above can be typically represented in the following way.

Symbol table management:


One of the essential functions of the compiler is to record the identifiers used in the
source program and collect he information about various attributes of each identifier.

Identifier Attributes

Variables type, its scope, and , its storage allocation

Procedures number and type of its arguments, type of passing each


argument, and the type returned.

All this information is stored in the symbol table, which is a data structure containing a
record for each identifier, with fields for the attributes of the identifier.

Lexical analyzer enters the symbols and remaining phases uses the symbols in various
ways, for example, the semantic analyzer uses to check for its validity in the statement.


 
it department srkr engg college 13/143/14 bhimavarm
compiler design study material

Error detection and reporting:


Another important component of a compiler is error logging.

Every phase of the compiler encounters errors, which need to be handled, so that the
compilation can proceed, allowing further errors to be detected.

Any compiler that stops after just encountering a single error is not good. Most of the
errors are handled by syntax analysis and the semantic analysis.


 
it department srkr engg college 13/143/14 bhimavarm
compiler design study material

Example:
The following diagram shows the input of each phase of the compiler and the output of
each phase and also records symbols in the symbol table.

Position: = initial + rate * 60

Lexical analyzer

Id1:=id2 + id3 * 60

Syntax analyzer

:=

Position +

Initial *

Rate inttoreal

60

Semantic analyzer 

:=

Position +

Initial *

Rate inttoreal

60


 
it department srkr engg college 13/143/14 bhimavarm
compiler design study material

Intermediate code generator 

temp1:=intoreal(60)

temp2:= id3 * temp1

temp3:= id2 + temp2

id1:= temp3

Code optimizer 

Temp1:= id3 * 60.0

Id1:= id2 + temp1

  Code generator 

MOVF R2, id3


MULF R2, #60.0
MOVF R1, id2
ADDF R1, R2
MOVF id1, R1

Symbol Table

1 Position ………
2 initial ………

3 rate ……...

Let us consider the following source program



 
it department srkr engg college 13/143/14 bhimavarm
compiler design study material

Int a,b;

Float c,d;

Char x,y;

c=a+b;

d=c-b;

x=x+y;

1) Lexical Analysis:-

Int - data type(int)

a id1(identifier)

, separator

b  id2(identifier)

; line separator

float data type(float)

c id3(identifier)

, separator

d  id4(identifier)

char data type(char)

x id5(identifier)

, separator

y id6(identifier)

; line separator

c id7(identifier)

= operator(Assignment)

a id8(identifier)

+ operator(plus)

10 
 
it department srkr engg college 13/143/14 bhimavarm
compiler design study material

b id9(identifier)

;  line separator

d id10

=  operator(Assignment)

c id11

- operator(minus)

b id12

; line separator

x id13

=  operator(Assignment)

x id14

+  operator(plus)

y id15

; line separator

2) Syntax Analysis:-constructing parse trees for the given expressions..

a) c=a+b; b) d=c-b

11 
 
it department srkr engg college 13/143/14 bhimavarm
compiler design study material

c)x=x+y

3) Semantic Analysis:-

12 
 
it department srkr engg college 13/143/14 bhimavarm
compiler design study material

4)Intermediate Code generation:-

t1=a+b; t2=c-b; t3=x+y;

c=t1; d=t2; x=t3;

5) Intermediate Code optimization:-It deletes unnecessary, redundant instruction from


intermediate code.

The main advantage of optimization is size of the code can be reduced.

t1=a+b; t2=c-b; t3=x+y;

c=t1; d=t2; x=t3;

6) Target code generation:-

the following source program

Int a,b; Float c,d;

Char x,y; c=a+b;

d=c-b; x=x+y;

The equivalent target code for the given source program is

ADD a,b

MOV c,a

SUB c,b

MOV d,c

ADD x,y

13 
 
it department srkr engg college 13/143/14 bhimavarm
compiler design study material

Symbol table management:

let the source program is

EX: Main ( )

{ Int x;

Add ( )

int a=10,b=20,c

C=a + b :

Name Type Value scope


X int ------- main
A int 10 add
B int 20 add
C int -------- add

Note: A syntax tree is compressed representation of parse tree in which the operators are
representing as interior nodes, and the operands are representing as child nodes of that operator
node.

PASS: A group of phases is combining into one pass. In general, the first 3 phases are
combining as first pass and remaining phases combined to another pass. Therefore, in general the
compiler has two passes.

Tools used to perform compiler phases are

Scanner  Lexical analysis

Parser  Syntax analysis

Parser  Semantic analysis

Code generator Intermediate code generation

Code optimizer  Intermediate code optimization

Code generator Target code generation.


14 
 
it department srkr engg college 13/143/14 bhimavarm
compiler design study material

Cousins of compiler:

Preprocessor: To include header files in source program.

Macro Preprocessor: To load macro definition in place of macro call

Linker: To load Library files into source program

Loader: It loads source program into main memory for execution

Assembler: It translates the source program into assembly language program.

The compiler generated target code is always larger than the manual target code. But the
main difficulty of manual target code generation is very difficult & correction of errors is also
too difficult.

Types of complier

Incremental compiler: It compiles the only modified part of source program only during
recompilation.

Cross compiler. A compiler runs on one machine & generates target code for other machine.

Bootstrapping of compiler: it is a process in which single language is used to translate more


complicated program which in turn may handle even more complicated program and so on.

The output of one compiler is input of next compiler. The output of that is given to the next
compiler and so on to generate required target code.

Interpreter

Interpreter is a kind of translator which produces the result directly when the source program &
data is given to it as input.

interpreter
source program & data Result

15 
 
it department srkr engg college 13/143/14 bhimavarm
compiler design study material

1.LEXICAL ANALYSIS
 Lexical analysis is the first phase of compiler which generates a set of “tokens”
which are identified in source program.
 Lexical analysis performs scanning of source program from left to right to
generate tokens.
 Scanning of source program request command is given by parser is taken by
 Lexical analyzer.
 Lexical analysis is achieved by finite automata which return all valid tokens of
source language.
 Lexical analysis interacts with the symbol table to store information of symbols
that are found in lexical analysis.
 Lexical analysis also reports errors that are present in source program including
line no’s.

2. INPUT BUFFER:

To perform Lexical analysis the compiler need to maintain a part of memory to store the
program is called “input Buffer”.
In one buffer scheme if the lexeme is too long to fit into input buffer then overwriting of
first part of lexeme is required.
In two buffer scheme the input buffer size is 2 memory blocks. The input buffer is
divided into 2 partitions. It is called “buffer pair” scheme.
Input buffer maintains 2 pointers they are called as
1. Beginning pointer
2. Forward pointer

Beginning pointer points first character of the partition.


Forward pointer points to first character of partition & it is forwarded until token is
generated. After identification of token the beginning pointer points to next character of
last character of token.

int i,j;
It is the source program then initially the input buffer is in the form

bptr

i  n  t    i  ,  j  ;         
fptr

after forwarding the forward pointer 3 times token is identified the input buffer is in the
form

1
it department srkr engg college 13/143/14 bhimavarm
compiler design study material

skipped char’s word separators & line separators

bptr

i  n  t    i  ,  j  ;         
fptr
token

In buffer pair scheme for each moment of forward pointer it has to check the end of
partition or not.

If it is end of 1st partition then 2nd partition is to be loaded.


If it is end of 2nd partition then 1st partition is to be loaded.

The major drawback of this method is for every moment of forward pointer 2 checking’s
required. The alternative solution is “sentinel method”.

In this method both partitions ends are specified by a special character “EOF”
In this for every moment of forward pointer it has to check that is “EOF” or not. So that
one check is enough for every moment of forward pointer.
If it is “EOF” then checking of which partition end is required then other partition is
loaded.

LEXEME: The string present between forward pointer & beginning pointer is called
“LEXEME”

PATTERN: A set of strings may return same token they are called as pattern.
Ex: x1, abc will returns identifier only

TOKEN: Lexical analyzer returns a token if the lexeme is valid for that source language.
All possible words in the source language are called tokens.

2
it department srkr engg college 13/143/14 bhimavarm
compiler design study material

3. LEXICAL ERROR HANDLING MECHANISMS

Every phase of compiler may found the errors in source program. The errors are reported
to error report unit.
Most of the errors that are present in lexical analysis are typing errors. These errors can
be handled lexical error handler.

Mechanism to handle lexical errors:


1. Panic mode recovery: In this the content of source program is skipped
until a valid token is generated. It is simple technique but data loss is
present,
2. Place missing Character FR FOR
3. Delete extra Character IFX IF
4. Transpose adjacent characters FI  IF
5. Place correct character in place of incorrect character.
FXR FOR

FINITE AUTOMATA
Deterministic finite automata: (DFA):
A Deterministic finite automata is a five tuple (Q,∑,δ,q0,F)
Where q is the finite set of states,
∑ is the finite set of input symbols
q0 € Q is the initial state
F C Q is the set of final states
Δ: Q*∑Q is the transition function

Non-Deterministic finite automata: (NFA):


A Non-Deterministic finite automata is a five tuple (Q,∑,δ,q0,F)
Where q is the finite set of states,
∑ is the finite set of input symbols
q0 € Q is the initial state
F C Q is the set of final states
Δ : Q*∑2Q is the transition function

A) construct “NFA” for the input string “aabaaa”

a, b
 b a,b q2
qo q1

3
it department srkr engg college 13/143/14 bhimavarm
compiler design study material

B) construct equivalent “DFA” for the above NFA

Where Q={q0,q1,q2}
∑={a, b}
F={q2}
Δ: Q*∑Q
Sol:
Now for DFA
M’=(Q’, ∑, Δ’,q0’,F’)

           a           b 
     [q0]         [q0]      [q0, q1] 
     [q0, q1]        [q0, q2]    [q0,q1,q2] 
   * [q0, q2]         [q0]    [q0,q1] 
   *[q0,q1,q2]       [q0, q2]  [q0,q1,q2] 

Finite automata
a [q0,q1]
[q0]
b

a a b

a
a [q0,q1,q2]

b
b
[[q0,q2]

4
it department srkr engg college 13/143/14 bhimavarm
compiler design study material

1.CONTEXT FREE GRAMMAR

A context free grammar is a 4 tuple G=( V,T,P,S)

V A set of non terminals.

T A set of terminal symbols

P A set of productions

S Start symbol of grammar

AMBIGOUS GRAMMAR:

A grammar is said to be ambiguous if it generates multiple parse trees for one


input string.

LEFT RECURSIVE GRAMMAR:

A grammar is said to be left recursive if the grammar has the productions in


following form A Aα/β

The left recursion can be eliminated from grammar by writing equivalent productions in non
recursive form. They are AΒA’

A’αΑ’/€

LEFT FACTORING: The user may get confusion if the grammar has the production in the
following form

A¥x1/¥x2/¥x3/¥x4/¥x5…………/¥xn

BY performing left factoring the production is converted into

A¥ Z

ZX1/X23/X3/X4/X5/………/Xn

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

AMBIGO
OUS GRAM
MMAR:

A gramm
mar is said to
o be ambiguoous if it geneerates multipple parse tresss for the sam
me input striing.

The ambiguous gram mmar does noot support toop down parssing. The am
mbiguity can be eliminateed
by eliminnating left reecursion from
m the gramm
mar.

g the resultinng grammar will be unam


Perform left factoring mbiguous.

EX: EE+E/E
E*E/id

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

Ambiguity

stmt  if expr then stmt |


if expr then stmt else stmt | otherstmts

if E1 then if E2 then S1 else S2

stmt stmt

if expr then stmt else stmt if expr then stmt

E1 if expr then stmt S2 E1 if expr then stmt else stmt

E2 S1 E2 S1 S2
1 2
CS416 Compilr Design 12

LEFT RECURSIVE GRAMMAR:

A grammar is said to be left recursive if the grammar has the productions in


following form A Aα/β

The left recursion can be eliminated from grammar by writing equivalent productions in non
recursive form. They are

AΒA’

A’ α Α’/€

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

EX:- EE+T/T

TT*F/F

F(E)/id

Eliminate the left recursion in the above grammar. After elimination of left recursion the
grammar will be,

ETE’

E’+TE’/ €

TFT’

T’*FT’/€

F(E)/id.

LEFT FACTORING: The user may get confusion if the grammar has the production in the
following form

A¥x1/¥x2/¥x3/¥x4/¥x5…………/¥xn

BY performing left factoring the production is converted into

A¥ Z

ZX1/X23/X3/X4/X5/………/Xn

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

SYNTAX ANALYSIS
 The output of lexical analysis is a set of tokens i.e. tokenized source program
 The tokenized source program is given as an input to the parser.
 Parser is a tool which can perform syntax analysis.
 Construction of parse tree is called as “PARSING”.
 Parsing is a process of deriving an input string for the given grammar.
 Parsing is performed by two methods. They are,
1) Top-down parsing
2) Bottom-up parsing.

Top-down parsing: In top-down parsing a parse tree is constructed from root node to leaf
node.

Bottom-up parsing: In bottom-up parsing a parse tree is constructed from leaf nodes to root
node.

1) Top down parsing is classified into two types .They are,


a) Recursive parsing
 In recursive parsing a recursive call is used to check the acceptance of input string.
 Recursive parsing is classified into two types. They are,
i) Recursive parsing with back tracking.
ii) Recursive parsing without back tracking.
b) Non-recursive parsing
 Non-Recursive parsing uses a parse table to perform parsing.
 In non-recursive parsing a parse table is constructed by using FIRST and
FOLLOW of the symbol.
2) The bottom up parsing is also called LR parsing(LR means scanning the input string from
left to right and deriving the input string from right side.

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

Recursive -parsing with Back tracking :-

e.g 1 :- Let us take S cAd

Aab/a

Derive the given input sting “cad”.

Ans:-

After backtracking,

Recursive -parsing without Back tracking :-

a. It designs finite automata for each non terminal of grammar.

b. The parsing is performed by scanning the finite automata to generate the input
string.
2. During scanning of finite automata,
a. If non-terminal is identified, then scanning will jump to the corresponding non-
terminal symbol.

b. If scanning goes to final state of that non-terminal symbol, then it will jump to
previous finite automata to perform further processing.

c. To solve the problem for the acceptance of the input string it uses a set of finite
automata machines.

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

d. Finite automata machine maintains a set of input symbols for making transitions.
e. During checking of acceptance of input string we have to proceed through a set of
input symbols to generate the required input string.

EX:- EE+T/T

TT*F/F

F(E)/id

Eliminate the left recursion in the above grammar because top down parsing doesn‟t
parse the left recursive grammars. After elimination of left recursion the grammar will be,

ETE‟

E‟+TE‟/ €

TFT‟

T‟*FT‟/€

F(E)/id.

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

Check the validation of input string id+id

First it goes to finite automata of “E” . The first symbol in that is nonterminal “T”.

So it goes to finite automata of ” T”. The first symbol in that is nonterminal is ” F”

So it goes to finite automata of ” F”. The first symbol in that is nonterminal is ” id” and it reaches
to final state of “ F ”.So it goes to previous finite automata “ T”. The next symbol in that finite
automata is “ T’ ” .So it goes to the finite automata of “ T’ ”. In that finite automata the symbol

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

is “ € ”. It reaches to final state of finite automata of “ T’ ”. So it goes to previous finite automa


“ E ”. The next symbol in that finite automa is “ E’“.So it goes to finite automa of “ E’ “.

The first symbol in the finite automa of “ E’ ” is „ + ‟. after that symbol the next symbol is “ T
”. It goes to the finite automa of “ T “. Then it goes to finite automa of “ F” . It‟s first symbol is
“ id “. It goes to final state so the control goes to previuos finite automa “ T “. Follow the above
steps to reach final state of the finite automa of “ E”.

Roughly the control transfer is as follows

ETFTT‟TEE‟TFTT‟TE

LEFT RECURSIVE GRAMMAR:

A grammar is said to be left recursive if it has a production in the form

AAα/β

The left recursive grammars go to infinite loop during parsing. The left recursion can be
eliminated replacing the left recursive production with the following

AβA‟

A‟Αa‟/€

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

1. NON RECURSIVE PARSING


Non recursive parsing uses a data structure to perform parsing called “parse table”
1. Parsing is performed by using a table called "PARSE TABLE"
2. During derivation of an input string it uses 2 components they are
1. Input buffer
2. Stack
3. The purpose of stack is maintain intermediate results during parsing
4. Initially stack maintains a symbol "$"(dollar symbol)
5. The input buffer maintains an input string and ends with "$"
6. During parsing some of the input string symbols can be eliminated if it reaches to the
symbol"$" then we say that the given input string is derived from the given grammar.

To perform non recursive parsing a parse table is used. Parse table construction
requires 2 components they are "FIRST" and "FOLLOW" of non terminal symbols that are
present in the given grammar.

RULES TO CALCULATE FIRST:

Let us consider x is non terminal that is present in the given grammar

1. If x-->€ is a production in the grammar then add "€" to the FIRST[x]={€ }

2. If x-->a is a production in the grammar then add "a" to the FIRST[X]={a}

3.if x-->y1y2y3....yn(non-terminals) is a production in the grammar then add


FIRST[y1],FIRST[y2],...FIRST[yi] where FIRST[ Yi]€

RULES TO CALCULATE FOLLOW:


1.
If A is the start symbol of the grammar then add "$" to the FOLLOW[A]
FOLLOW[A]={$}
2.
If the grammar has production in the form of A-->αBβ then FOLLOW[B] contains FIRST[β]-€
FOLLOW[B]=FIRST[β]-€
3. if the grammar has a production in the form of A-->αB or A-->αBβ whereβ --> €then
FOLLOW[B]=FOLLOW[A]

consider the following grammar


E-->TE'
E'-->+TE'
E'-->E
T-->FT'
T'-->*FT'
T'-->E
F-->(E)
F-->id

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

Calculate FIRST

sol:
FIRST[E]=FIRST[T]=FIRST[F]={(, id}
FIRST[E']={+,€}
FIRST[T']={*,€}

so,
FOLLOW[E]={$,)}
FOLLOW[E’]= {$,)}
FOLLOW[T]= {$,+,)}
FOLLOW[T’]= {$,+,)}
FOLLOW[F]= {$,+,),*}

CONSTRUCTION PROCEDURE FOR PARSE TABLE:


to perform non recursive parsing a parse table is required to derive the input string
parse table construction follows 2 rules

Place the corresponding entry in parse table for the symbols that are
Present in FIRST of non terminal symbol

2. Place an "€” production for the symbols that are present in the FOLLOW of non-terminal
symbol i.e. if the non terminal FIRST has "€"entry. [ X€ ]

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

Check the validation of input string id+ id

Initially stack maintains $E (start symbol of grammar)


Input buffer has input string ends with “$”
If both stack & input buffer has only “$” then it is accepted
Stack input output
$E id+id$ E → TE’
$E’T id+id$ T → FT’
$E’ T’F id+id$ F → id
$ E’ T’id id+id$
$ E’ T’ +id$ T’ → €
$E ’
+id$ E’ → +TE’
$ E’ T+ +id$
$E T’
id$ T → FT’
’ ’
$E T F id$ F → id
$ E’ T’id id$
$ E’ T’ $ T’ → €
$ E’ $ E’ → €
$ $ accept

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

1. SHIFT REDUCE PARSING

Parsing is the process of checking the validation of input string for the given grammar. There are
2 types of parsing mechanisms.

1. Top down parsing


2. Bottom up parsing

Shift reduce parsing is one the bottom up parsing technique. Shift reduce parsing performs 4
actions they are

1. Shift
2. Reduce
3. Accept
4. Error

The shift reduce parser uses input buffer & stack. The input string ends with $ is maintained in
input buffer initially. Stack has $ initially.

In shift operation a symbol of input string is pushed into stack.

If some part of stack is equal to right side of production then that sub string is called handle.
Handle is a sub string that matches to right side of any of the production in the given grammar

In reduce operation a stack portion (HANDLE) is replaced with non terminal which is in the left
side of corresponding production

Accept is a situation when the stack has start symbol of grammar & input buffer has “ $ ”only.

Error is a situation when there is no further moment of parsing.

To perform shift reduce parsing it uses the following

1. Stack : It maintains “ $ “ initially.


2. Input Buffer: It maintains input string to be derived ends with “ $ “.

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

EXAMPLE:

GRAMMAR IS EE + E

Eid

Input string to derive is id+id

STACK INPUT BUFFER ACTION

$ id + id $ shift

$ id + id $ reduce

$E +id $ shift

$E + id $ shift

$ E + id $ reduce

$E+E $ reduce

$E $ Accept.

The above input string is valid for the given grammar because parsing of that string reaches to
accept state.

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

1. LR PARSERS

LR Parsers: It is the most popular bottom up parsing technique these parsers are also called
generalized backtracking shift reduce parsing.
These parses can be used for any type of grammars

Steps to perform LR Parsing:

1. Take the grammar to construct a canonical item set .By using canonical item set
construct an LR Parse table.
2. Use input buffer and stack to perform the shift reduce Parsing action. they are
1. Shift
2. Reduce
3. Action
4. Error
Construction of canonical item set for SLR PARSER:
Step 1: Add a new production to the grammar which is given
the production is A’A where ‘A’ is the start symbol of the grammar.
Steps 2: place a “dot “symbol in the right side of every production.
Step 3: forward the dot operator to one symbol for every expansion.
Step 4: During expansion if any non terminal is immediately adjacent to the dot operator
Then add the productions of that non terminal to that item set.
Step 5: Continue step 3 and step 4 until no further expansion is possible.

Construction of canonical item set for the given grammar:

E->E+T
E->T
T->T*F
T->F
F->(E)
F->id

STEP1: Convert the given CFG into augmented grammar by adding a production E’ E

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

STEP2: The initial canonical item set for the above grammar is achieved by placing “.”(dot)
symbol in the right side of every production ( I0)

E’ . E
E  . E+T
E . T
T . T*F
T . F
F . (E)
F . id

I0 I1
E’->E.
E->E+.T
E’->.E E->E.+T E->E+T.
T->.T*F
E->.E+T T->T.*F
T->.F I7
E->.T
E->T. F->.(E)
T->.T*F
T->T.*F F->.id I3
T->F
F->.(E)
T->F. I4
F->.id T->T*.F
I5
F->.(E)
F->(.E) F->.id
E->.E+T T->T*F.
F->id. E->.T F->(E.)
T->.T*F E->E.+T
I4
T->.F
F->.(E) I2 F->(E).
F->.id I6 I5
I3
I5
I4

STEPS TO CONSTRUCT SLR PARSE TABLE:


it department srkr engg college 13/143/14 bhimavarm
compiler design study material

STEP1: The table contains two components. They are


1) Action
2) Go to
Step2: Action components represents parsing situation for the terminal symbols.
Step3: The action component performs two actions. They are
1) Shift
2) Reduce
Step4:The shift operation is performing when the dot operator is moved from one item set to
another item set.The Reduce operation is perform when there is no moment of dot operator
.The reduce operation is performed on symbols which are in FOLLOW[X] where X is a non
Terminal in that production.
Step5: The goto components relates with a set of non terminals that are present in the
Grammar.
id + * ( ) $ E T F
0 s4 1 3
s5 2
1 s6 ACCEPTED

2 r2 r2
s7 r2
3 r4 r4 r4 r4

4 s5 s4 8 2 3

5 r6 r6 r6 r6

6 s5 s4 9 3

s5 s4 10
7
8 r2 r2

9 s7

10 r3 r3 r3

11 r5 r5 r5
r5

Derivation of input String:

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

1. The derivation of input String uses two components they are Stack and input buffer.
2. The input buffer maintains an input string ends with ‘$ ‘symbol.
3. Stack contains an initial state “ 0 ” initially.
4. During shift operation a symbol of input string and the next state is pushed into the stack
5. During reduce operation both symbol and state are placed by the left side of the symbol
of the production.
Example: Input String is id+id$.

STACK INPUT BUFFER ACTION

0 id+id$ s5

0id5 +id$ r6

0F3 +id$ r4

0T2 +id$ r2

0E1 +id$ s6

0E1+6 +id$ s5

0E1+6id5 $ r6

0+6F3 $ r4

0E1+6T9 $ r1

0E1 $ ACCEPTED

Hence the given input string is accepted

The major drawback of “SLR” PARSER is it may has

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

1. Shift-Reduce complex
2. Reduce-Reduce complex

The SLR PARSE table may have multiple entries. It causes confusion in operation selection.
This problem can be solved in canonical LR PARSER.
The major drawback of “SLR” PARSER is it may has

1. Shift-Reduce complex
2. Reduce-Reduce complex

The SLR PARSE table may have multiple entries. It causes confusion in operation selection.
This problem can be solved in canonical LR PARSER.

Conflict Example

S → L=R I0: S’ → .S I1:S’ → S. I6:S → L=.R I9: S → L=R.


S→R S → .L=R R → .L
L→ *R S → .R I2:S → L.=R L→ .*R
L → id L → .*R R → L. L → .id
R→L L → .id
R → .L I3:S → R.

I4:L → *.R I7:L → *R.


Problem R → .L
FOLLOW(R)={=,$} L→ .*R I8:R → L.
= shift 6 L → .id
Action[2,=] = shift 6
reduce by R → L
shift/reduce conflict I5:L → id. Action[2,=] = reduce by R → L
[ S ⇒L=R ⇒*R=R] so follow(R) contains, =

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

Conflict Example2
S → AaAb I0: S’ → .S
S → BbBa S → .AaAb
A→ε S → .BbBa
B→ε A→.
B→.

Problem
FOLLOW(A)={a,b}
FOLLOW(B)={a,b}
a reduce by A → ε b reduce by A → ε
reduce by B → ε reduce by B → ε
reduce/reduce conflict reduce/reduce conflict

The above problems can be solved by using canonical LR parser. It is more similar to SLR
parser but the construction of canonical item set is included with terminal symbols as
follows.

Canonical LR parser:

The canonical LR parser item set construction is different compare to SLR parser item set
construction.
closure(I) is: ( where I is a set of LR(1) items)
– every LR(1) item in I is in closure(I)
– if A→α.Bβ,a in closure(I) and B→γ is a production rule of G; then
B→.γ,b will be in the closure(I) for each terminal b in FIRST(βa) .

CANANICAL LR PARSER ITEM SET

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

S’ → • S, $ I1
S → • C C, $ S (S’ → S • , $
C → • c C, c/d
C → • d, c/d
I0 S → C • C, $ I5
C C
C → • c C, $ S → C C •, $
C → • d, $
I2
c
I6
C → c • C, $ C
c C → • c C, $
C → • d, $ I9
d C → cC •, $
I7
d
C → d •, $
c
C → c • C, c/d C I8
c C → • c C, c/d
C → • d, c/d C → c C •, c/d
I3
I4 d
C → d •, c/d
d
stack input buffer Action

0 dccd$ S4

0d4 ccd$ R3(Cd)

CANAONICAL LR PRSE TABLE 0C2 ccd$ S6

0C2c6 cd$ S4
state c d $ S C
0C2c6c6 d$ S7
0 S3 S4 1 2
1 acc
0C2c6c6d7 $ R3(Cd)
2 S6 S7 5
3 S3 S4 8
0C2c6c6C9 $ R2(CcC)
4 R3 R3
5 R1 0C2c6C9 $ R2(CcC)
6 S6 S7 9
7 R3 0C2C5 $ R1(SCC)
8 R2 R2
9 R2 0S1 $ accept

LALR PARSE TABLE:

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

The canonical LR parser has many no of states. In LALR parser the no of states can be
reduced by grouping the item sets which have same productions with different terminal
symbols. The resulting LALR parse table for the above grammar is given below.

state c d $ S C stack input buffer Action


0 S36 S47 1 2
1 acc 0 dccd$ S47
2 S36 S47 5
36 S36 S47 89 0d47 ccd$ R3(Cd)
47 R3 R3 R3
0C2 ccd$ S36
5 R1
89 R2 R2 R2
0C2c36 cd$ S36

0C2c36c36 d$ S47

0C2c36c36d47 $ R3(Cd)

0C2c36c36C89 $ R2(CcC)

0C2c36C89 $ R2(CcC)

0C2C5 $ R1(SCC)

0S1 $ accept

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

Syntax Directed Translation

 Syntax directed translation (SDT) is frame work for intermediate code generation.
 Syntax directed translation provides semantic actions for each production of the
grammar.
 The semantic action is placed in the right side of the production in braces.
 The semantic action provides super scripting if the production right side has multiple
instances of same non terminal symbol.
 The semantic action corresponding to a production A->xyz is performed.
 In top down parsing A is expanded to xyz.
 In bottom up parsing xyz is reduced to A.
Syntax directed translation scheme performs computation of values of non terminal symbols by
performing semantic actions.
There are two types of syntax directed translation schemes.
 Synthesized translation.
 Inherited translation.
Synthesized translation:-
The production left side non terminal value is calculated as a function of right side
non terminal values.
A->B+C
A. value = B. value + c. value.
Inherited translation:-
The production right side non terminal value is calculated as a function of left side
non terminal values.
Ex:-
A->xyz
Y. value =2*A. value.
Implementation of syntax directed translation:-
E->E+E
E->digit
Implement the syntax directed translation scheme for the given expression.
23*5+4$
Step 1:-
The lexical analyzer reads the source program (expression) from left to right.
Step 2:-
If the lexical value is equal to right side of any production then that is replaced by left side
symbol of that production.
Step 3:-
The value of that non terminal symbol is calculated by performing corresponding action of that
production.
Step 4:-
Repeat the above procedure until the expression ends.
Step 5:-
The root node of the syntax tree contains value of the expression.

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

Ex:-Design syntax directed translation scheme for desktop calculator.


Input grammar is E->E+E
E->E*E
E->I
I->I digit
I->digit
Expression to be evaluated is 23*5+4$
(1) S->E$ { print E. val}

(2) E->E(1)+E(2) {E.VAL:=E.(1)VAL+E.(2)VAL}

(3) E->E(1)*E(2) {E.VAL:=E.(1)VAL*E.(2)VAL}

(4) E->(E(1)) {E.VAL:=E.(1)VAL}

(5) E->I {E.VAL:=I.VAL}

(6) I->I(1) digit {I.VAL:=10*I.(1)VAL+LEXVAL}

(7) I->digit {I.VAL:=LEXVAL}

Syntax directed translation scheme for desk calculator.

E $

+
E E

E * I
E

I I digit

I
digit digit
digit

Sequence of moves:

INPUT STATE VAL PRODUCTION USED

1 23*5+4$ - -

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

2 3*5+4$ 2 -
3 3*5+4$ I 2 I->digit
4 *5+4$ I3 2_
5 *5+4$ I (23) I->digit
6 *5+4$ E (23) E->I
7 5+4$ E* (23)_
8 +4$ E*5 (23)_ _
9 +4$ E*I (23)_5 I->digit
10 +4$ E*E (23)_5 E->I
11 +4$ E (115) E->E*E
12 4$ E+ (115)_
13 $ E+4 (115)_ _
14 $ E+I (115)_4 I->digit
15 $ E+E (115)_4 E->I
16 $ E (119) E->E+E
17 - E$ (119)_
18 - S _ S->E$

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

INTERMEDIATE CODE REPRESENTATION:


The intermediate code generation is one of the phase after semantic analysis.
The generation of target code from the intermediate code is easy compare to generation
of target code directly.
The generated intermediate code can be represented in several ways.They are

1. 3-address code representation


2. Abstract syntax tree & directed acyclic graph
.
3. polish & reverse polish notation

let the source program has the following expression

x=(a+b)* (c*d)*(a+b).
The following expression is represented in the form of abstract syntax tree in manner
below

*
X =

a + b c * d a + b

1. the major drawback of this technique is common sub expressions are


represented in several times which causes redundancy. It consumes much
memory. The redundancy problem can be solved by using directed acyclic
graph (DAG) representation.
2. In DAG representation if the source program has common subexpression then
the node that represents that can be used in expression evaluation.

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

DAG REPRESENTATION :
Let the expression in source program is

X=(a*a)+(b*b)

+, x

*
*

a b

3-address code representation


The intermediate code can be represented in 3 ways they are

1. Quadriple
2. Triple
3. Indirect triples

X=(a+b)*(c+d)

The 3address code for the given source program is

T1=A+B
T2=C+D
T3=T1*T2
X=T3
The above 3 address code is represented in the following manner

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

Quadriple:
In quadruple representation there are 4 fields. The last field represents result of
operation. In the unary operations representation there is no 2nd argument.

address operation Arg1 Arg2 result


[0] + A B T1
[1] + C D T2
[2] * T1 T2 T3
[3] = T3 X

There is no arg2 parameter in during assignment statements. So the last entry has no arg2
field.

Triple:
In triple representation there are 3 fields. The result of the operation is directly stored in
the location itself. No temporary variables are used to store intermediate results.
The main advantage of triple representation is it takes less memory.

address operation Arg1 Arg2


[0] + A B
[1] + C D
[2] * [0] [1]
[3] = x [2]

INDIRECT TRIPLES:

Indirect triple representation is similar to indirect address mode of instructions. I.e. each
instruction is stored in some address. The instructions are represented by that addresses
only. The representation is very easy for user point of view but during execution of each
instruction memory fetching is required to access the instruction to be executed. This
mechanism maintains 2 tables. They are
1. Instruction table
2. Address table
Only address table is stored in main memory during compilation which occupies smaller
main memory.

Address statement address operation Arg1 Arg2


10 [0] [0] + A B
11 [1]
12 [2]` [1] + C D
13 [3] [2] * [0] [1]

[3] = x [2]

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

3 address code generation for flow control loops


IF-ELSE WHILE

Ex: if a<b then While (i<10 )


a=a+5 {x=0;
Else i=i+1;
a=a+7 }

100 if a<b goto L1


102 goto L2 100 L0 if i<10 goto L1
102 L1: a=a+5 101 goto LLast
103 end 102 L1 x=0
104 L2: a=a+7 103 i=i+1
103 end 104 goto L0
105 LLast end

SWITCH FOR

Switch(ch)
Case 1: For(i=1;i<10;i++)
I=i+1; {
Case 2: y=x+5;
I=i+2;
}
100 if(ch=1) goto L1
101 if(ch=2) goto L2 100 i=1
102 L1 t1=i+1 101 L1 t1=x+5
103 i=t1 102 y=t1
104 goto LLast 103 i=i+1
105 L2 t2=i+2 104 if(i<10) goto L1
106 i=t2
015 end
107 goto Llast
108 Last end

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

Code optimization

Code optimization is a process of identifying the patterns of source program which can be
replaced by some other patterns which are short in size

 The code optimization is done on both intermediate code and target code
 There are three types of code optimization techniques. They are

1) Loop Optimization.
2) Straight line optimization.
3) Peephole optimization.

Loop Optimization:-
The program may contain a set of instructions which are outside of the loop
are executed in one time but the instructions which are in loop are executed for many number of
times. So that loop optimization plays an important role in code optimization process.

 The loop optimization has the following steps

(1) Divide the program into basic blocks by determining the leaders in the source program.

(2) Construct a flow graph which represents the communication between the blocks ( Represents
order of execution of blocks).

(3) Perform code motion.

(4) Perform constant folding.

(5) Reduction of induction variables.

(6) Reduction of strength of instructions.

(7) Replacing the common sub expressions.

(8) Loop unrolling.

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

Let us consider 3-add code which calculates the (.) dot product of two arrays vectors is given
below

1 PROD=0
2 I=1
3 T1=4*I
4 T2=Add(A)-4
5 T3=t2[T1]
6 T4=Add(B)-4
7 T5=t4[T1]
8 T6=T3*T5
9 PROD=PROD+T6
10 I=I+1
11 If(i<=20) goto(3)

Step1) Identifying the leaders:

First we have to identify the leaders in the source program. Rules to select the leaders
are
1) It should be starting instruction of the program
2) The location of branch/Jump instruction.
3) The immediate instruction to the branch/jump loop.

In our example
1 PROD=0

3 If T1=4*I these are the two leaders

Step2) Dividing the blocks:


The set of instructions between two leaders is called block.

1 PROD=0

2 I=1

B1

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

3 T1=4*I
4 T2=Add(A)-4
5 T3=t2[T1]
6 T4=Add(B)-4
7 T5=t4[T1]
8 T6=T3*T5
9 PROD=PROD+T6
10 I=I+1
11 If(i<=20) goto(3)

B2

Construction of flow graph between the blocks for our example

1 PROD=0

2 I=1
B1

1 T1=4*I
2 T2=Add(A)-4
3 T3=t2[T1]
4 T4=Add(B)-4
5 T5=t4[T1]
6 T6=T3*T5
7 PROD=PROD+T6
8 I=I+1
9 If(i<=20) goto(3)

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

Step3) Code motion:


Perform the code motion i.e) identify the instructions which are independent of loop and
place them outside of the loop in a separate block.
1 PROD=0 B1
2 I=1

B3

T2=Add(A)-4
T4=Add(B)-4

1 T1=4*I
2 T3=t2[T1]
3 T5=t4[T1]
4 T6=T3*T5
5 PROD=PROD+T6
6 I=I+1
7 If(i<=20) goto(3)

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

Step4) Constant folding:


If the program contains any variable that does not change the value of the particular
instruction(i.e constant).

Example:- i=1; // the value of “ i” is constant //

if(j<i)

Printf(“compiler”);

The above code can be replaced by

if(j<1)

Printf(“compiler”);
It causes minimization of number of symbols used in symbol table.

Step5) Reduction of induction variables:

If one variable is dependent on another variable then we can use only one variable
instead of two variables.

In above example I & T1 are induction variables the equivalent code can be constructed by
using “ T1 “ only. Then the resulting code will be
T1=T1+4
.
.
.
.
If(T1<=80) goto(5)

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

1 PROD=0
B1
2. T1=0

T2=Add(A)-4
T4=Add(B)-4
B3

1 T1=T1+4 B2
2 T3=t2[T1]
3 T5=t4[T1]
4 T6=T3*T5
5 PROD=PROD+T6
6 If(T1<=20) goto(3)

Step6) Reduce the strength of instruction:


If instructions are more in complexity then replace them with less complexity instructions.

Example:- i=2; i=2;


J=i*i; j=i+i;

Step7) Replacing the common sub expressions:


For example if we have like

i=x+1;
a[i]=’H’; we can write it as a[x+1]=’H’;

Common sub expressions in a basic block can be identified by constructing “DAG”


Step8) Loop Unrolling:

If the loop exection causes delay we can remove unnecessary looping which are in the program.
j=1;
for(i=1;i<=j;i++)
printf(“x”);

we can write it directly as printf(“x”);

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

Peephole optimization

Peephole optimization is a technique used in many compilers ,used to


optimize the intermediate code or target code .Here we can use the repeated passes over the
code and apply the optimization techniques to achieve the maximum benefit.

The peephole optimization can be performed on any of the

Instruction, without following any order during optimization of code.

In peephole optimization ,we use following techniques

1. Elimination of redundant loads and stores

2. Elimination of unreachable code in the program

3. Elimination of multiple jumps which are unnecessary

4. Perform algebraic simplifications

5. Use of machine idioms

6. Reduction in strength

1. Elimination of redundant loads and stores

Consider the 3-address code A=B+C

D=A+E

mov B,R0

add C,R0

mov R0,A

mov A,R0

add E,R0

mov R0,D

We observe that in the above code we have the instructons,

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

1. mov R0,A

2. mov A,R0

The 2nd instruction is Redundant store. i.e. No change in the value after execution of
instruction.

So whenever such instructions are present in our program, we can delete the instruction(2)
provided it does not have any label.

2. Elimination of unreachable code in the program

Debug =1;

If Debug!=1 goto L2

goto L2

L1:printf(“compiler design”);

L2:printf(“file structure”);

We observe that debug is a constant intialised to ‘1’ ,so dedug!=1 codition will never be true
and so the statements at label’L2’ are never executed,so we can eliminated such type of
ureachable code. After elimination of dead code the resulting code is

Debug:=1

L1:printf(“compiler design”);

3. Elimination of multiple jumps which are unnecessary

Consider the following 3-address statements

goto L1

L1: goto L2

L2: printf (“hai”);

We can replace such case by following code

goto L2

L1:goto L2

L2: printf(“hai”);

Suppose in above there are no jumps to ‘L1’ then we can delete the label’L1’.

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

It will become

goto L2

L2: printf(“hai”);

Eample:

01 jmp 03

03 jmp 05

05 jmp 07

07 add R1,R2

If a program contains multiple unecessary jumps of these type can be eliminated .so above
program can be simplified by as

01 jmp 07

07 add R1,R2

4. Perform algebraic simplifications

Sometimes we can use the statements like x=x*1 or x= x+0

y=y*1 or y=y+0

The execution of this instructions doesn’t change the values.

Such statements are produced by straight forward intermediate code –generation algorithms,
these types of statements can be eliminated directly.
5. Use of machine idioms

In order to implement some instructions efficiently we can use some hardware instructions

For example,

In order to implement stack operations like push stack top has to be incremented. If the machine
is running in auto increment mode then push can be performed without manipulating the stack
top because machine automatically increments the top after instruction execution.(push
operation)

In order to implement stack operations like pop stack top has to be decremented. If the machine
is running in auto decrement mode then pop can be performed without manipulating the stack

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

top because machine automatically decrements the top after instruction execution.(pop
operation)

6. Reduction in strength

Replacing high cost operator with the low cost operator is called reduction in strength.

Example 1:

Instead of implementing x^2 [(square(x)] , it is easy to implement x*x .

Example 2:

Instead of implementing [2 * x ] , It is easy to implement “ x + x ”

Example 3:

Let s1 and s2 are two strings then total length can be calculated by len(s1+s2)

It can be simplified by calculating lengths of stings s1,s2 separately and adding them
len(s1)+len(s2) thus we reduce the strength

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

Peephole optimization

Peephole optimization is a technique used in many compilers ,used to


optimize the intermediate code or target code .Here we can use the repeated passes over the
code and apply the optimization techniques to achieve the maximum benefit.

The peephole optimization can be performed on any of the

Instruction, without following any order during optimization of code.

In peephole optimization ,we use following techniques

1. Elimination of redundant loads and stores

2. Elimination of unreachable code in the program

3. Elimination of multiple jumps which are unnecessary

4. Perform algebraic simplifications

5. Use of machine idioms

6. Reduction in strength

1. Elimination of redundant loads and stores

Consider the 3-address code A=B+C

D=A+E

mov B,R0

add C,R0

mov R0,A

mov A,R0

add E,R0

mov R0,D

We observe that in the above code we have the instructons,

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

1. mov R0,A

2. mov A,R0

The 2nd instruction is Redundant store. i.e. No change in the value after execution of
instruction.

So whenever such instructions are present in our program, we can delete the instruction(2)
provided it does not have any label.

2. Elimination of unreachable code in the program

Debug =1;

If Debug!=1 goto L2

goto L2

L1:printf(“compiler design”);

L2:printf(“file structure”);

We observe that debug is a constant intialised to ‘1’ ,so dedug!=1 codition will never be true
and so the statements at label’L2’ are never executed,so we can eliminated such type of
ureachable code. After elimination of dead code the resulting code is

Debug:=1

L1:printf(“compiler design”);

3. Elimination of multiple jumps which are unnecessary

Consider the following 3-address statements

goto L1

L1: goto L2

L2: printf (“hai”);

We can replace such case by following code

goto L2

L1:goto L2

L2: printf(“hai”);

Suppose in above there are no jumps to ‘L1’ then we can delete the label’L1’.

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

It will become

goto L2

L2: printf(“hai”);

Eample:

01 jmp 03

03 jmp 05

05 jmp 07

07 add R1,R2

If a program contains multiple unecessary jumps of these type can be eliminated .so above
program can be simplified by as

01 jmp 07

07 add R1,R2

4. Perform algebraic simplifications

Sometimes we can use the statements like x=x*1 or x= x+0

y=y*1 or y=y+0

The execution of this instructions doesn’t change the values.

Such statements are produced by straight forward intermediate code –generation algorithms,
these types of statements can be eliminated directly.
5. Use of machine idioms

In order to implement some instructions efficiently we can use some hardware instructions

For example,

In order to implement stack operations like push stack top has to be incremented. If the machine
is running in auto increment mode then push can be performed without manipulating the stack
top because machine automatically increments the top after instruction execution.(push
operation)

In order to implement stack operations like pop stack top has to be decremented. If the machine
is running in auto decrement mode then pop can be performed without manipulating the stack

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

top because machine automatically decrements the top after instruction execution.(pop
operation)

6. Reduction in strength

Replacing high cost operator with the low cost operator is called reduction in strength.

Example 1:

Instead of implementing x^2 [(square(x)] , it is easy to implement x*x .

Example 2:

Instead of implementing [2 * x ] , It is easy to implement “ x + x ”

Example 3:

Let s1 and s2 are two strings then total length can be calculated by len(s1+s2)

It can be simplified by calculating lengths of stings s1,s2 separately and adding them
len(s1)+len(s2) thus we reduce the strength

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

SIMPLE CODE GENERATION ALGORITHM:

The code generator generates target code for a sequence of three-address statement. It considers
each statement in turn, remembering if any of the operands of the statement are currently in
registers, and taking advantage of that fact, if possible. The code-generation uses descriptors to
keep track of register contents and addresses for names.

1. A register descriptor keeps track of what is currently in each register. It is consulted


whenever a new register is needed. We assume that initially the register descriptor shows
that all registers are empty. (If registers are assigned across blocks, this would not be the
case). As the code generation for the block progresses, each register will hold the value of
zero or more names at any given time.

2. An address descriptor keeps track of the location (or locations) where the current value of
the name can be found at run time. The location might be a register, a stack location, a
memory address, or some set of these, since when copied, a value also stays where it was.
This information can be stored in the symbol table and is used to determine the accessing
method for a name.

CODE GENERATION ALGORITHM

for each X = Y op Z do

• invoke a function getreg to determine location L where X must be stored. Usually L is a


register.

• Consult address descriptor of Y to determine Y'. Prefer a register for Y'. If value of Y not
already in L generate

Mov Y', L

• Generate

op Z', L

Again prefer a register for Z. Update address descriptor of X to indicate X is in L. If L is


a register update its descriptor to indicate that it contains X and remove X from all other register
descriptors.

• If current value of Y and/or Z have no next use and are dead on exit from block and are in
registers, change register descriptor to indicate that they no longer contain Y and/or Z.

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

The code generation algorithm takes as input a sequence of three-address statements constituting
a basic block. For each three-address statement of the form x := y op z we perform the following
actions:

1. Invoke a function getreg to determine the location L where the result of the computation
y op z should be stored. L will usually be a register, but it could also be a memory
location. We shall describe getreg shortly.

2. Consult the address descriptor for u to determine y’, (one of) the current location(s) of y.
Prefer the register for y’ if the value of y is currently both in memory and a register. If the
value of u is not already in L, generate the instruction MOV y’, L to place a copy of y in
L.

3. Generate the instruction OP z’, L where z’ is a current location of z. Again, prefer a


register to a memory location if z is in both. Update the address descriptor to indicate that
x is in location L. If L is a register, update its descriptor to indicate that it contains the
value of x, and remove x from all other register descriptors.

4. If the current values of y and/or y have no next uses, are not live on exit from the block,
and are in registers, alter the register descriptor to indicate that, after execution of x := y
op z, those registers no longer will contain y and/or z, respectively.

The function getreg returns the location L to hold the value of x for the assignment x := y op z.

1. If the name y is in a register that holds the value of no other names (recall that copy
instructions such as x := y could cause a register to hold the value of two or more
variables simultaneously), and y is not live and has no next use after execution of x := y
op z, then return the register of y for L. Update the address descriptor of y to indicate that
y is no longer in L

2. Failing (1), return an empty register for L if there is one.

3. Failing (2), if x has a next use in the block, or op is an operator such as indexing, that
requires a register, find an occupied register R. Store the value of R into memory location
(by MOV R, M) if it is not already in the proper memory location M, update the address
descriptor M, and return R. If R holds the value of several variables, a MOV instruction
must be generated for each variable that needs to be stored. A suitable occupied register
might be one whose datum is referenced furthest in the future, or one whose value is also
in memory.

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

4. If x is not used in the block, or no suitable occupied register can be found, select the
memory location of x as L.

For example, the assignment d := (a - b) + (a - c) + (a - c) might be translated into the following


three-address code sequence:

Stmt code reg desc addr desc

t1=a-b mov a,R0 R0 contains t1 t1 in R0

sub b,R0

t2=a-c mov a,R1 R0 contains t1 t1 in R0

sub c,R1 R1 contains t2 t2 in R1

t3=t1+t2 add R1,R0 R0 contains t3 t3 in R0

R1 contains t2 t2 in R1

d=t3+t2 add R1,R0 R0 contains d d in R0

mov R0,d d in R0 and memory

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

SYMBOL TABLE MANAGEMENT

 Symbol table is a data structure, which maintains the information about symbols
 Symbol table is accessed by several phases of compiler to retrieve symbols information
 Symbol table contains two fields
a) Name of the symbol
b) Information of the symbol
 Operations performed on symbol table are
a) Perform the search to identify whether the symbol is present or not
b) Retrieval of information for the referencing symbol
c) Update the symbol information
d) Add an entry to the symbol table ,if the symbol is referenced
e) Delete an entry from the symbol table , if there is further reference of that
symbol

SYMBOL TABLE IMPLEMENTATION

There are 3 implementation mechanisms

I) Maintain array of records

1. Maintain a set of records in an array


2. The array contains one record for every symbol
3. Each record is a set of memory words (addresses)
4. Each record contains two fields , they are
a. Name b. Information

Name1
R
1
Info1

Name2 R
2
Info2

Name3 R
3
Info3

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

Disadvantages:

i) The major drawback of the above implementation mechanism is


“searching complexity”

To access the information of any symbol, the user should perform a


linear search on the symbol table array

ii) In practice the maximum size of name may not be used by programmers.
But the symbol table reserves maximum memory for symbol name. It
causes wastage of memory which is reserved for the name of symbol

This problem can be solved by alternative implementation mechanism.

II) MAINTAIN THE ARRAY OF NAMES SEPERATELY

1. In this mechanism all the names of the symbols are stored as a string of characters
in a linear array.
2. The symbol table maintains one record for every symbol
3. The name field of symbol table contains a pointer to the starting character of that
symbol name , which is in the symbol name array
4. The name field also contains length of the symbol name to find exact symbol
name. To get the symbol name the user has to access from pointer location to the
length specified in the field.

Name1 Length 4
R
1
Info1

Name 2 Length 4
R
2

Info2

R A M A R A J A

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

So in the above example the name1 field has a pointer to “R”. to get symbol name
the user has to read 4 char’s from that because the length specified in that is “4”.
So the name of symbol is “RAMA”
III) TWO TABLE MECHANISM:
1. In this mechanism the symbol table names and information are stored in separate
tables
2. The name table contains all the symbol’s names
3. The information table contains all the symbol’s information
4. The association between the name table and the information table can be achieved
by maintaining indexing mechanism, while inserting the entries in the symbol
table

RAMA Info 1

RAJA Info1

Info2

Info2

Name table Info table

DATA STRUCTURES USED TO IMPLEMENT SYMBOL TABLE

i) linear list
ii) self organizing list
iii) search trees
iv) hash tables

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

i) linear list:
A list is maintained for the symbol table .The major drawback of this data
structure is a linear search is required to access, the information about the symbols.
Linear search requires more time complexity.

ii) self organizing list:

In this data structure, each entry maintains a link with another entry in the
list .Each referenced symbol gets a link which maintains a pointer to the first
accessing location

The most frequently referenced entries will get links to the first
referencing location

The advantage of this data structure is search time minimum for most
frequently accessed symbols

iii) search trees:


The symbol table is organized as a binary tree. To access the
information about the symbols, binary search is performed on the symbol
table. The Binary search is faster than linear search
iv) Hash tables:

It maintains two tables

a) hash table: it maintains names of the symbols


b) storage table: it maintains information of the symbols as records

The storage table maintains a set of linked lists. To access the information
of the symbol, the name of the symbol is given to the hash function. It generates
an address, which contains the information of the symbol.

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

DAG CONSTRUCTION FOR BASIC BLOCKS

DAG (directed acyclic graph ) is a useful data structure used to represent the basic
blocks. It has no cycles.

It gives picture how the values are computed in statements.

It determines the common sub expressions which are within block.

It determines names of variables which are used in the block but the values of variables
are computed in outside of block.

DAG is a directed acyclic graph which has following nodes.

1. Leaves are labeled as variables.


2. The interior node represents operator
3. The interior nodes are also labeled with unique identifiers.
4. If block uses same sub expression then corresponding sub graph is labeled
with no of identifiers too.

DAG representation gives the following

1. It gives an optimized representation of basic block.


2. It identifies the common sub expressions used in the basic block
3. It identifies the no of symbols that are used in basic block but defined in
previous blocks.
4. It identifies the no of symbols that the values of them are computed in the
basic block.

Construct a basic block for the given basic block

S1 = 4 * I
S2 = address(A)-4
S3=S2*S1
S4=4*I
S5=address(B)-4
S6=S5*S4
S7=S3*S6
S8=PROD+S7
PROD=S8
S9=I+1
IF (I<=20 ) GOTO (1)

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

+ S8,prod

* S7

* S3 * S6 < (1)

- s2 -s5
*s1,s4 + s9,I

prod Addr(A) Addr(B) 4 I0 1 20

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

Runtime Storage Management


• Runtime storage management deals with runtime memory of the program.
• The runtime storage management reserves a part of the main memory for program
execution.
• Runtime storage management is two types. They are:
1. Simple stack implementation
2. Block structure implementation
• Simple stack implementation:-
 In this a set of memory rows (locations) are reserved to execute a program.
 Each procedure in a program maintains a record called “Activation
Record”.
 The activation record contains the,
i. Values of parameters
ii. Count of arguments
iii. Return value
iv. Return address
v. A stack pointer contains the address of the first instruction of the
procedure
 The runtime stack maintains all the activation records of the program which
involved in execution.
 A set of locations are reserved for every activation record.
 Some extra space is maintained between the activation records to store external
data.
 The top of the runtime stack maintains the main program and data.
 The gap between main program segment and activation record segment will be
used to store the temporary values or intermediate values while the program is
executing.
 The runtime storage management maintains a register called “instruction pointer”
(similar to program counter) which contains the address of the instruction to be
executed.
 Example:- Let there is a program with main function which calls a procedure ‘P’.
The procedure ‘P’ calls procedure ‘Q’. The procedure ‘Q’ calls procedure ‘R’.
Draw the runtime stack for the given situation.

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

• Block structured implementation:-


 Some of the programming languages may support to implement block structure
programs.
 In block structured program a set of procedures can be integrated into a single
block.
 It may support implementation of adjustable length arrays.
 In this block structured implementation a separate runtime stack is maintained for
every block.
 The main procedure maintains a display which contains pointers to the activation
records i.e. starting address of the activation record.
 The activation record in block structured implementation contains,
i. The values of parameters
ii. Count of arguments
iii. Return address

it department srkr engg college 13/143/14 bhimavarm


compiler design study material

iv. Return values


v. Stack pointers
vi. Pointers to arrays
vii. The count of formal parameters.
 Example:- Let us consider a program which maintains a main function that
calls procedure ‘P’. Procedure ‘P’ calls procedure ‘Q’. Procedure ‘Q’ calls
procedure ‘R’.

Let us consider the procedure ‘R’ is called by procedure ‘Q’ in the following manner,

it department srkr engg college 13/143/14 bhimavarm

You might also like