Professional Documents
Culture Documents
COMPILERS
Simply stating, a compiler is a program that reads a program written in one language-the
source language- and translates it into an equivalent program in another language –the target
language.
Compiler
Source program target program
Error messages
But compilers are very difficult programs to write .For example, the first Fortran
compiler nearly took 18 staff years to implement during the 1950’s. But over the years good
implementation languages, programming environments, and software tools have been developed.
1. Analysis
2. Synthesis
1
it department srkr engg college 13/143/14 bhimavarm
compiler design study material
i) Lexical analysis:
The stream of characters making up the source program is read from left-to-
right and grouped into tokens.
Identifier1 := expression
Position +
Expression expression
Identifier2 *
Identifier3 number
Rate 60
2
it department srkr engg college 13/143/14 bhimavarm
compiler design study material
Example:
:=
Position +
Initial *
Rate 60
:=
Position +
Initial *
Rate inttoreal
60
3
it department srkr engg college 13/143/14 bhimavarm
compiler design study material
temp1:=inttoreal(60)
id1:= temp3
1> Firstly, each 3-address code has only one operator other than the assignment operator, so,
the compiler should decide on the order of the operators.
2> The compiler must generate a temporary name to hold the value computed by each
instruction
3> Some 3-address instructions have fewer operands eg., the first and last instructions in the
above example.
4
it department srkr engg college 13/143/14 bhimavarm
compiler design study material
Note:
1. The inttoreal operation can performed in the compile time so it can be removed.
2. Temp3 is used just to put the calculated value into id1 so the final instruction can also be
removed.
5
it department srkr engg college 13/143/14 bhimavarm
compiler design study material
Identifier Attributes
All this information is stored in the symbol table, which is a data structure containing a
record for each identifier, with fields for the attributes of the identifier.
Lexical analyzer enters the symbols and remaining phases uses the symbols in various
ways, for example, the semantic analyzer uses to check for its validity in the statement.
6
it department srkr engg college 13/143/14 bhimavarm
compiler design study material
Every phase of the compiler encounters errors, which need to be handled, so that the
compilation can proceed, allowing further errors to be detected.
Any compiler that stops after just encountering a single error is not good. Most of the
errors are handled by syntax analysis and the semantic analysis.
7
it department srkr engg college 13/143/14 bhimavarm
compiler design study material
Example:
The following diagram shows the input of each phase of the compiler and the output of
each phase and also records symbols in the symbol table.
Lexical analyzer
Id1:=id2 + id3 * 60
Syntax analyzer
:=
Position +
Initial *
Rate inttoreal
60
Semantic analyzer
:=
Position +
Initial *
Rate inttoreal
60
8
it department srkr engg college 13/143/14 bhimavarm
compiler design study material
Intermediate code generator
temp1:=intoreal(60)
id1:= temp3
Code optimizer
Code generator
Symbol Table
1 Position ………
2 initial ………
3 rate ……...
Int a,b;
Float c,d;
Char x,y;
c=a+b;
d=c-b;
x=x+y;
1) Lexical Analysis:-
a id1(identifier)
, separator
b id2(identifier)
; line separator
c id3(identifier)
, separator
d id4(identifier)
x id5(identifier)
, separator
y id6(identifier)
; line separator
c id7(identifier)
= operator(Assignment)
a id8(identifier)
+ operator(plus)
10
it department srkr engg college 13/143/14 bhimavarm
compiler design study material
b id9(identifier)
; line separator
d id10
= operator(Assignment)
c id11
- operator(minus)
b id12
; line separator
x id13
= operator(Assignment)
x id14
+ operator(plus)
y id15
; line separator
a) c=a+b; b) d=c-b
11
it department srkr engg college 13/143/14 bhimavarm
compiler design study material
c)x=x+y
3) Semantic Analysis:-
12
it department srkr engg college 13/143/14 bhimavarm
compiler design study material
d=c-b; x=x+y;
ADD a,b
MOV c,a
SUB c,b
MOV d,c
ADD x,y
13
it department srkr engg college 13/143/14 bhimavarm
compiler design study material
EX: Main ( )
{ Int x;
Add ( )
int a=10,b=20,c
C=a + b :
Note: A syntax tree is compressed representation of parse tree in which the operators are
representing as interior nodes, and the operands are representing as child nodes of that operator
node.
PASS: A group of phases is combining into one pass. In general, the first 3 phases are
combining as first pass and remaining phases combined to another pass. Therefore, in general the
compiler has two passes.
Cousins of compiler:
The compiler generated target code is always larger than the manual target code. But the
main difficulty of manual target code generation is very difficult & correction of errors is also
too difficult.
Types of complier
Incremental compiler: It compiles the only modified part of source program only during
recompilation.
Cross compiler. A compiler runs on one machine & generates target code for other machine.
The output of one compiler is input of next compiler. The output of that is given to the next
compiler and so on to generate required target code.
Interpreter
Interpreter is a kind of translator which produces the result directly when the source program &
data is given to it as input.
interpreter
source program & data Result
15
it department srkr engg college 13/143/14 bhimavarm
compiler design study material
1.LEXICAL ANALYSIS
Lexical analysis is the first phase of compiler which generates a set of “tokens”
which are identified in source program.
Lexical analysis performs scanning of source program from left to right to
generate tokens.
Scanning of source program request command is given by parser is taken by
Lexical analyzer.
Lexical analysis is achieved by finite automata which return all valid tokens of
source language.
Lexical analysis interacts with the symbol table to store information of symbols
that are found in lexical analysis.
Lexical analysis also reports errors that are present in source program including
line no’s.
2. INPUT BUFFER:
To perform Lexical analysis the compiler need to maintain a part of memory to store the
program is called “input Buffer”.
In one buffer scheme if the lexeme is too long to fit into input buffer then overwriting of
first part of lexeme is required.
In two buffer scheme the input buffer size is 2 memory blocks. The input buffer is
divided into 2 partitions. It is called “buffer pair” scheme.
Input buffer maintains 2 pointers they are called as
1. Beginning pointer
2. Forward pointer
int i,j;
It is the source program then initially the input buffer is in the form
bptr
i n t i , j ;
fptr
after forwarding the forward pointer 3 times token is identified the input buffer is in the
form
1
it department srkr engg college 13/143/14 bhimavarm
compiler design study material
bptr
i n t i , j ;
fptr
token
In buffer pair scheme for each moment of forward pointer it has to check the end of
partition or not.
The major drawback of this method is for every moment of forward pointer 2 checking’s
required. The alternative solution is “sentinel method”.
In this method both partitions ends are specified by a special character “EOF”
In this for every moment of forward pointer it has to check that is “EOF” or not. So that
one check is enough for every moment of forward pointer.
If it is “EOF” then checking of which partition end is required then other partition is
loaded.
LEXEME: The string present between forward pointer & beginning pointer is called
“LEXEME”
PATTERN: A set of strings may return same token they are called as pattern.
Ex: x1, abc will returns identifier only
TOKEN: Lexical analyzer returns a token if the lexeme is valid for that source language.
All possible words in the source language are called tokens.
2
it department srkr engg college 13/143/14 bhimavarm
compiler design study material
Every phase of compiler may found the errors in source program. The errors are reported
to error report unit.
Most of the errors that are present in lexical analysis are typing errors. These errors can
be handled lexical error handler.
FINITE AUTOMATA
Deterministic finite automata: (DFA):
A Deterministic finite automata is a five tuple (Q,∑,δ,q0,F)
Where q is the finite set of states,
∑ is the finite set of input symbols
q0 € Q is the initial state
F C Q is the set of final states
Δ: Q*∑Q is the transition function
a, b
b a,b q2
qo q1
3
it department srkr engg college 13/143/14 bhimavarm
compiler design study material
Where Q={q0,q1,q2}
∑={a, b}
F={q2}
Δ: Q*∑Q
Sol:
Now for DFA
M’=(Q’, ∑, Δ’,q0’,F’)
a b
[q0] [q0] [q0, q1]
[q0, q1] [q0, q2] [q0,q1,q2]
* [q0, q2] [q0] [q0,q1]
*[q0,q1,q2] [q0, q2] [q0,q1,q2]
Finite automata
a [q0,q1]
[q0]
b
a a b
a
a [q0,q1,q2]
b
b
[[q0,q2]
4
it department srkr engg college 13/143/14 bhimavarm
compiler design study material
P A set of productions
AMBIGOUS GRAMMAR:
The left recursion can be eliminated from grammar by writing equivalent productions in non
recursive form. They are AΒA’
A’αΑ’/€
LEFT FACTORING: The user may get confusion if the grammar has the production in the
following form
A¥x1/¥x2/¥x3/¥x4/¥x5…………/¥xn
A¥ Z
ZX1/X23/X3/X4/X5/………/Xn
AMBIGO
OUS GRAM
MMAR:
A gramm
mar is said to
o be ambiguoous if it geneerates multipple parse tresss for the sam
me input striing.
The ambiguous gram mmar does noot support toop down parssing. The am
mbiguity can be eliminateed
by eliminnating left reecursion from
m the gramm
mar.
EX: EE+E/E
E*E/id
Ambiguity
stmt stmt
E2 S1 E2 S1 S2
1 2
CS416 Compilr Design 12
The left recursion can be eliminated from grammar by writing equivalent productions in non
recursive form. They are
AΒA’
A’ α Α’/€
EX:- EE+T/T
TT*F/F
F(E)/id
Eliminate the left recursion in the above grammar. After elimination of left recursion the
grammar will be,
ETE’
E’+TE’/ €
TFT’
T’*FT’/€
F(E)/id.
LEFT FACTORING: The user may get confusion if the grammar has the production in the
following form
A¥x1/¥x2/¥x3/¥x4/¥x5…………/¥xn
A¥ Z
ZX1/X23/X3/X4/X5/………/Xn
SYNTAX ANALYSIS
The output of lexical analysis is a set of tokens i.e. tokenized source program
The tokenized source program is given as an input to the parser.
Parser is a tool which can perform syntax analysis.
Construction of parse tree is called as “PARSING”.
Parsing is a process of deriving an input string for the given grammar.
Parsing is performed by two methods. They are,
1) Top-down parsing
2) Bottom-up parsing.
Top-down parsing: In top-down parsing a parse tree is constructed from root node to leaf
node.
Bottom-up parsing: In bottom-up parsing a parse tree is constructed from leaf nodes to root
node.
Aab/a
Ans:-
After backtracking,
b. The parsing is performed by scanning the finite automata to generate the input
string.
2. During scanning of finite automata,
a. If non-terminal is identified, then scanning will jump to the corresponding non-
terminal symbol.
b. If scanning goes to final state of that non-terminal symbol, then it will jump to
previous finite automata to perform further processing.
c. To solve the problem for the acceptance of the input string it uses a set of finite
automata machines.
d. Finite automata machine maintains a set of input symbols for making transitions.
e. During checking of acceptance of input string we have to proceed through a set of
input symbols to generate the required input string.
EX:- EE+T/T
TT*F/F
F(E)/id
Eliminate the left recursion in the above grammar because top down parsing doesn‟t
parse the left recursive grammars. After elimination of left recursion the grammar will be,
ETE‟
E‟+TE‟/ €
TFT‟
T‟*FT‟/€
F(E)/id.
First it goes to finite automata of “E” . The first symbol in that is nonterminal “T”.
So it goes to finite automata of ” F”. The first symbol in that is nonterminal is ” id” and it reaches
to final state of “ F ”.So it goes to previous finite automata “ T”. The next symbol in that finite
automata is “ T’ ” .So it goes to the finite automata of “ T’ ”. In that finite automata the symbol
The first symbol in the finite automa of “ E’ ” is „ + ‟. after that symbol the next symbol is “ T
”. It goes to the finite automa of “ T “. Then it goes to finite automa of “ F” . It‟s first symbol is
“ id “. It goes to final state so the control goes to previuos finite automa “ T “. Follow the above
steps to reach final state of the finite automa of “ E”.
ETFTT‟TEE‟TFTT‟TE
AAα/β
The left recursive grammars go to infinite loop during parsing. The left recursion can be
eliminated replacing the left recursive production with the following
AβA‟
A‟Αa‟/€
To perform non recursive parsing a parse table is used. Parse table construction
requires 2 components they are "FIRST" and "FOLLOW" of non terminal symbols that are
present in the given grammar.
Calculate FIRST
sol:
FIRST[E]=FIRST[T]=FIRST[F]={(, id}
FIRST[E']={+,€}
FIRST[T']={*,€}
so,
FOLLOW[E]={$,)}
FOLLOW[E’]= {$,)}
FOLLOW[T]= {$,+,)}
FOLLOW[T’]= {$,+,)}
FOLLOW[F]= {$,+,),*}
Place the corresponding entry in parse table for the symbols that are
Present in FIRST of non terminal symbol
2. Place an "€” production for the symbols that are present in the FOLLOW of non-terminal
symbol i.e. if the non terminal FIRST has "€"entry. [ X€ ]
Parsing is the process of checking the validation of input string for the given grammar. There are
2 types of parsing mechanisms.
Shift reduce parsing is one the bottom up parsing technique. Shift reduce parsing performs 4
actions they are
1. Shift
2. Reduce
3. Accept
4. Error
The shift reduce parser uses input buffer & stack. The input string ends with $ is maintained in
input buffer initially. Stack has $ initially.
If some part of stack is equal to right side of production then that sub string is called handle.
Handle is a sub string that matches to right side of any of the production in the given grammar
In reduce operation a stack portion (HANDLE) is replaced with non terminal which is in the left
side of corresponding production
Accept is a situation when the stack has start symbol of grammar & input buffer has “ $ ”only.
EXAMPLE:
GRAMMAR IS EE + E
Eid
$ id + id $ shift
$ id + id $ reduce
$E +id $ shift
$E + id $ shift
$ E + id $ reduce
$E+E $ reduce
$E $ Accept.
The above input string is valid for the given grammar because parsing of that string reaches to
accept state.
1. LR PARSERS
LR Parsers: It is the most popular bottom up parsing technique these parsers are also called
generalized backtracking shift reduce parsing.
These parses can be used for any type of grammars
1. Take the grammar to construct a canonical item set .By using canonical item set
construct an LR Parse table.
2. Use input buffer and stack to perform the shift reduce Parsing action. they are
1. Shift
2. Reduce
3. Action
4. Error
Construction of canonical item set for SLR PARSER:
Step 1: Add a new production to the grammar which is given
the production is A’A where ‘A’ is the start symbol of the grammar.
Steps 2: place a “dot “symbol in the right side of every production.
Step 3: forward the dot operator to one symbol for every expansion.
Step 4: During expansion if any non terminal is immediately adjacent to the dot operator
Then add the productions of that non terminal to that item set.
Step 5: Continue step 3 and step 4 until no further expansion is possible.
E->E+T
E->T
T->T*F
T->F
F->(E)
F->id
STEP1: Convert the given CFG into augmented grammar by adding a production E’ E
STEP2: The initial canonical item set for the above grammar is achieved by placing “.”(dot)
symbol in the right side of every production ( I0)
E’ . E
E . E+T
E . T
T . T*F
T . F
F . (E)
F . id
I0 I1
E’->E.
E->E+.T
E’->.E E->E.+T E->E+T.
T->.T*F
E->.E+T T->T.*F
T->.F I7
E->.T
E->T. F->.(E)
T->.T*F
T->T.*F F->.id I3
T->F
F->.(E)
T->F. I4
F->.id T->T*.F
I5
F->.(E)
F->(.E) F->.id
E->.E+T T->T*F.
F->id. E->.T F->(E.)
T->.T*F E->E.+T
I4
T->.F
F->.(E) I2 F->(E).
F->.id I6 I5
I3
I5
I4
2 r2 r2
s7 r2
3 r4 r4 r4 r4
4 s5 s4 8 2 3
5 r6 r6 r6 r6
6 s5 s4 9 3
s5 s4 10
7
8 r2 r2
9 s7
10 r3 r3 r3
11 r5 r5 r5
r5
1. The derivation of input String uses two components they are Stack and input buffer.
2. The input buffer maintains an input string ends with ‘$ ‘symbol.
3. Stack contains an initial state “ 0 ” initially.
4. During shift operation a symbol of input string and the next state is pushed into the stack
5. During reduce operation both symbol and state are placed by the left side of the symbol
of the production.
Example: Input String is id+id$.
0 id+id$ s5
0id5 +id$ r6
0F3 +id$ r4
0T2 +id$ r2
0E1 +id$ s6
0E1+6 +id$ s5
0E1+6id5 $ r6
0+6F3 $ r4
0E1+6T9 $ r1
0E1 $ ACCEPTED
1. Shift-Reduce complex
2. Reduce-Reduce complex
The SLR PARSE table may have multiple entries. It causes confusion in operation selection.
This problem can be solved in canonical LR PARSER.
The major drawback of “SLR” PARSER is it may has
1. Shift-Reduce complex
2. Reduce-Reduce complex
The SLR PARSE table may have multiple entries. It causes confusion in operation selection.
This problem can be solved in canonical LR PARSER.
Conflict Example
Conflict Example2
S → AaAb I0: S’ → .S
S → BbBa S → .AaAb
A→ε S → .BbBa
B→ε A→.
B→.
Problem
FOLLOW(A)={a,b}
FOLLOW(B)={a,b}
a reduce by A → ε b reduce by A → ε
reduce by B → ε reduce by B → ε
reduce/reduce conflict reduce/reduce conflict
The above problems can be solved by using canonical LR parser. It is more similar to SLR
parser but the construction of canonical item set is included with terminal symbols as
follows.
Canonical LR parser:
The canonical LR parser item set construction is different compare to SLR parser item set
construction.
closure(I) is: ( where I is a set of LR(1) items)
– every LR(1) item in I is in closure(I)
– if A→α.Bβ,a in closure(I) and B→γ is a production rule of G; then
B→.γ,b will be in the closure(I) for each terminal b in FIRST(βa) .
S’ → • S, $ I1
S → • C C, $ S (S’ → S • , $
C → • c C, c/d
C → • d, c/d
I0 S → C • C, $ I5
C C
C → • c C, $ S → C C •, $
C → • d, $
I2
c
I6
C → c • C, $ C
c C → • c C, $
C → • d, $ I9
d C → cC •, $
I7
d
C → d •, $
c
C → c • C, c/d C I8
c C → • c C, c/d
C → • d, c/d C → c C •, c/d
I3
I4 d
C → d •, c/d
d
stack input buffer Action
0 dccd$ S4
0C2c6 cd$ S4
state c d $ S C
0C2c6c6 d$ S7
0 S3 S4 1 2
1 acc
0C2c6c6d7 $ R3(Cd)
2 S6 S7 5
3 S3 S4 8
0C2c6c6C9 $ R2(CcC)
4 R3 R3
5 R1 0C2c6C9 $ R2(CcC)
6 S6 S7 9
7 R3 0C2C5 $ R1(SCC)
8 R2 R2
9 R2 0S1 $ accept
The canonical LR parser has many no of states. In LALR parser the no of states can be
reduced by grouping the item sets which have same productions with different terminal
symbols. The resulting LALR parse table for the above grammar is given below.
0C2c36c36 d$ S47
0C2c36c36d47 $ R3(Cd)
0C2c36c36C89 $ R2(CcC)
0C2c36C89 $ R2(CcC)
0C2C5 $ R1(SCC)
0S1 $ accept
Syntax directed translation (SDT) is frame work for intermediate code generation.
Syntax directed translation provides semantic actions for each production of the
grammar.
The semantic action is placed in the right side of the production in braces.
The semantic action provides super scripting if the production right side has multiple
instances of same non terminal symbol.
The semantic action corresponding to a production A->xyz is performed.
In top down parsing A is expanded to xyz.
In bottom up parsing xyz is reduced to A.
Syntax directed translation scheme performs computation of values of non terminal symbols by
performing semantic actions.
There are two types of syntax directed translation schemes.
Synthesized translation.
Inherited translation.
Synthesized translation:-
The production left side non terminal value is calculated as a function of right side
non terminal values.
A->B+C
A. value = B. value + c. value.
Inherited translation:-
The production right side non terminal value is calculated as a function of left side
non terminal values.
Ex:-
A->xyz
Y. value =2*A. value.
Implementation of syntax directed translation:-
E->E+E
E->digit
Implement the syntax directed translation scheme for the given expression.
23*5+4$
Step 1:-
The lexical analyzer reads the source program (expression) from left to right.
Step 2:-
If the lexical value is equal to right side of any production then that is replaced by left side
symbol of that production.
Step 3:-
The value of that non terminal symbol is calculated by performing corresponding action of that
production.
Step 4:-
Repeat the above procedure until the expression ends.
Step 5:-
The root node of the syntax tree contains value of the expression.
E $
+
E E
E * I
E
I I digit
I
digit digit
digit
Sequence of moves:
1 23*5+4$ - -
2 3*5+4$ 2 -
3 3*5+4$ I 2 I->digit
4 *5+4$ I3 2_
5 *5+4$ I (23) I->digit
6 *5+4$ E (23) E->I
7 5+4$ E* (23)_
8 +4$ E*5 (23)_ _
9 +4$ E*I (23)_5 I->digit
10 +4$ E*E (23)_5 E->I
11 +4$ E (115) E->E*E
12 4$ E+ (115)_
13 $ E+4 (115)_ _
14 $ E+I (115)_4 I->digit
15 $ E+E (115)_4 E->I
16 $ E (119) E->E+E
17 - E$ (119)_
18 - S _ S->E$
x=(a+b)* (c*d)*(a+b).
The following expression is represented in the form of abstract syntax tree in manner
below
*
X =
a + b c * d a + b
DAG REPRESENTATION :
Let the expression in source program is
X=(a*a)+(b*b)
+, x
*
*
a b
1. Quadriple
2. Triple
3. Indirect triples
X=(a+b)*(c+d)
T1=A+B
T2=C+D
T3=T1*T2
X=T3
The above 3 address code is represented in the following manner
Quadriple:
In quadruple representation there are 4 fields. The last field represents result of
operation. In the unary operations representation there is no 2nd argument.
There is no arg2 parameter in during assignment statements. So the last entry has no arg2
field.
Triple:
In triple representation there are 3 fields. The result of the operation is directly stored in
the location itself. No temporary variables are used to store intermediate results.
The main advantage of triple representation is it takes less memory.
INDIRECT TRIPLES:
Indirect triple representation is similar to indirect address mode of instructions. I.e. each
instruction is stored in some address. The instructions are represented by that addresses
only. The representation is very easy for user point of view but during execution of each
instruction memory fetching is required to access the instruction to be executed. This
mechanism maintains 2 tables. They are
1. Instruction table
2. Address table
Only address table is stored in main memory during compilation which occupies smaller
main memory.
[3] = x [2]
SWITCH FOR
Switch(ch)
Case 1: For(i=1;i<10;i++)
I=i+1; {
Case 2: y=x+5;
I=i+2;
}
100 if(ch=1) goto L1
101 if(ch=2) goto L2 100 i=1
102 L1 t1=i+1 101 L1 t1=x+5
103 i=t1 102 y=t1
104 goto LLast 103 i=i+1
105 L2 t2=i+2 104 if(i<10) goto L1
106 i=t2
015 end
107 goto Llast
108 Last end
Code optimization
Code optimization is a process of identifying the patterns of source program which can be
replaced by some other patterns which are short in size
The code optimization is done on both intermediate code and target code
There are three types of code optimization techniques. They are
1) Loop Optimization.
2) Straight line optimization.
3) Peephole optimization.
Loop Optimization:-
The program may contain a set of instructions which are outside of the loop
are executed in one time but the instructions which are in loop are executed for many number of
times. So that loop optimization plays an important role in code optimization process.
(1) Divide the program into basic blocks by determining the leaders in the source program.
(2) Construct a flow graph which represents the communication between the blocks ( Represents
order of execution of blocks).
Let us consider 3-add code which calculates the (.) dot product of two arrays vectors is given
below
1 PROD=0
2 I=1
3 T1=4*I
4 T2=Add(A)-4
5 T3=t2[T1]
6 T4=Add(B)-4
7 T5=t4[T1]
8 T6=T3*T5
9 PROD=PROD+T6
10 I=I+1
11 If(i<=20) goto(3)
First we have to identify the leaders in the source program. Rules to select the leaders
are
1) It should be starting instruction of the program
2) The location of branch/Jump instruction.
3) The immediate instruction to the branch/jump loop.
In our example
1 PROD=0
1 PROD=0
2 I=1
B1
3 T1=4*I
4 T2=Add(A)-4
5 T3=t2[T1]
6 T4=Add(B)-4
7 T5=t4[T1]
8 T6=T3*T5
9 PROD=PROD+T6
10 I=I+1
11 If(i<=20) goto(3)
B2
1 PROD=0
2 I=1
B1
1 T1=4*I
2 T2=Add(A)-4
3 T3=t2[T1]
4 T4=Add(B)-4
5 T5=t4[T1]
6 T6=T3*T5
7 PROD=PROD+T6
8 I=I+1
9 If(i<=20) goto(3)
B3
T2=Add(A)-4
T4=Add(B)-4
1 T1=4*I
2 T3=t2[T1]
3 T5=t4[T1]
4 T6=T3*T5
5 PROD=PROD+T6
6 I=I+1
7 If(i<=20) goto(3)
if(j<i)
Printf(“compiler”);
if(j<1)
Printf(“compiler”);
It causes minimization of number of symbols used in symbol table.
If one variable is dependent on another variable then we can use only one variable
instead of two variables.
In above example I & T1 are induction variables the equivalent code can be constructed by
using “ T1 “ only. Then the resulting code will be
T1=T1+4
.
.
.
.
If(T1<=80) goto(5)
1 PROD=0
B1
2. T1=0
T2=Add(A)-4
T4=Add(B)-4
B3
1 T1=T1+4 B2
2 T3=t2[T1]
3 T5=t4[T1]
4 T6=T3*T5
5 PROD=PROD+T6
6 If(T1<=20) goto(3)
i=x+1;
a[i]=’H’; we can write it as a[x+1]=’H’;
If the loop exection causes delay we can remove unnecessary looping which are in the program.
j=1;
for(i=1;i<=j;i++)
printf(“x”);
Peephole optimization
6. Reduction in strength
D=A+E
mov B,R0
add C,R0
mov R0,A
mov A,R0
add E,R0
mov R0,D
1. mov R0,A
2. mov A,R0
The 2nd instruction is Redundant store. i.e. No change in the value after execution of
instruction.
So whenever such instructions are present in our program, we can delete the instruction(2)
provided it does not have any label.
Debug =1;
If Debug!=1 goto L2
goto L2
L1:printf(“compiler design”);
L2:printf(“file structure”);
We observe that debug is a constant intialised to ‘1’ ,so dedug!=1 codition will never be true
and so the statements at label’L2’ are never executed,so we can eliminated such type of
ureachable code. After elimination of dead code the resulting code is
Debug:=1
L1:printf(“compiler design”);
goto L1
L1: goto L2
goto L2
L1:goto L2
L2: printf(“hai”);
Suppose in above there are no jumps to ‘L1’ then we can delete the label’L1’.
It will become
goto L2
L2: printf(“hai”);
Eample:
01 jmp 03
03 jmp 05
05 jmp 07
07 add R1,R2
If a program contains multiple unecessary jumps of these type can be eliminated .so above
program can be simplified by as
01 jmp 07
07 add R1,R2
y=y*1 or y=y+0
Such statements are produced by straight forward intermediate code –generation algorithms,
these types of statements can be eliminated directly.
5. Use of machine idioms
In order to implement some instructions efficiently we can use some hardware instructions
For example,
In order to implement stack operations like push stack top has to be incremented. If the machine
is running in auto increment mode then push can be performed without manipulating the stack
top because machine automatically increments the top after instruction execution.(push
operation)
In order to implement stack operations like pop stack top has to be decremented. If the machine
is running in auto decrement mode then pop can be performed without manipulating the stack
top because machine automatically decrements the top after instruction execution.(pop
operation)
6. Reduction in strength
Replacing high cost operator with the low cost operator is called reduction in strength.
Example 1:
Example 2:
Example 3:
Let s1 and s2 are two strings then total length can be calculated by len(s1+s2)
It can be simplified by calculating lengths of stings s1,s2 separately and adding them
len(s1)+len(s2) thus we reduce the strength
Peephole optimization
6. Reduction in strength
D=A+E
mov B,R0
add C,R0
mov R0,A
mov A,R0
add E,R0
mov R0,D
1. mov R0,A
2. mov A,R0
The 2nd instruction is Redundant store. i.e. No change in the value after execution of
instruction.
So whenever such instructions are present in our program, we can delete the instruction(2)
provided it does not have any label.
Debug =1;
If Debug!=1 goto L2
goto L2
L1:printf(“compiler design”);
L2:printf(“file structure”);
We observe that debug is a constant intialised to ‘1’ ,so dedug!=1 codition will never be true
and so the statements at label’L2’ are never executed,so we can eliminated such type of
ureachable code. After elimination of dead code the resulting code is
Debug:=1
L1:printf(“compiler design”);
goto L1
L1: goto L2
goto L2
L1:goto L2
L2: printf(“hai”);
Suppose in above there are no jumps to ‘L1’ then we can delete the label’L1’.
It will become
goto L2
L2: printf(“hai”);
Eample:
01 jmp 03
03 jmp 05
05 jmp 07
07 add R1,R2
If a program contains multiple unecessary jumps of these type can be eliminated .so above
program can be simplified by as
01 jmp 07
07 add R1,R2
y=y*1 or y=y+0
Such statements are produced by straight forward intermediate code –generation algorithms,
these types of statements can be eliminated directly.
5. Use of machine idioms
In order to implement some instructions efficiently we can use some hardware instructions
For example,
In order to implement stack operations like push stack top has to be incremented. If the machine
is running in auto increment mode then push can be performed without manipulating the stack
top because machine automatically increments the top after instruction execution.(push
operation)
In order to implement stack operations like pop stack top has to be decremented. If the machine
is running in auto decrement mode then pop can be performed without manipulating the stack
top because machine automatically decrements the top after instruction execution.(pop
operation)
6. Reduction in strength
Replacing high cost operator with the low cost operator is called reduction in strength.
Example 1:
Example 2:
Example 3:
Let s1 and s2 are two strings then total length can be calculated by len(s1+s2)
It can be simplified by calculating lengths of stings s1,s2 separately and adding them
len(s1)+len(s2) thus we reduce the strength
The code generator generates target code for a sequence of three-address statement. It considers
each statement in turn, remembering if any of the operands of the statement are currently in
registers, and taking advantage of that fact, if possible. The code-generation uses descriptors to
keep track of register contents and addresses for names.
2. An address descriptor keeps track of the location (or locations) where the current value of
the name can be found at run time. The location might be a register, a stack location, a
memory address, or some set of these, since when copied, a value also stays where it was.
This information can be stored in the symbol table and is used to determine the accessing
method for a name.
for each X = Y op Z do
• Consult address descriptor of Y to determine Y'. Prefer a register for Y'. If value of Y not
already in L generate
Mov Y', L
• Generate
op Z', L
• If current value of Y and/or Z have no next use and are dead on exit from block and are in
registers, change register descriptor to indicate that they no longer contain Y and/or Z.
The code generation algorithm takes as input a sequence of three-address statements constituting
a basic block. For each three-address statement of the form x := y op z we perform the following
actions:
1. Invoke a function getreg to determine the location L where the result of the computation
y op z should be stored. L will usually be a register, but it could also be a memory
location. We shall describe getreg shortly.
2. Consult the address descriptor for u to determine y’, (one of) the current location(s) of y.
Prefer the register for y’ if the value of y is currently both in memory and a register. If the
value of u is not already in L, generate the instruction MOV y’, L to place a copy of y in
L.
4. If the current values of y and/or y have no next uses, are not live on exit from the block,
and are in registers, alter the register descriptor to indicate that, after execution of x := y
op z, those registers no longer will contain y and/or z, respectively.
The function getreg returns the location L to hold the value of x for the assignment x := y op z.
1. If the name y is in a register that holds the value of no other names (recall that copy
instructions such as x := y could cause a register to hold the value of two or more
variables simultaneously), and y is not live and has no next use after execution of x := y
op z, then return the register of y for L. Update the address descriptor of y to indicate that
y is no longer in L
3. Failing (2), if x has a next use in the block, or op is an operator such as indexing, that
requires a register, find an occupied register R. Store the value of R into memory location
(by MOV R, M) if it is not already in the proper memory location M, update the address
descriptor M, and return R. If R holds the value of several variables, a MOV instruction
must be generated for each variable that needs to be stored. A suitable occupied register
might be one whose datum is referenced furthest in the future, or one whose value is also
in memory.
4. If x is not used in the block, or no suitable occupied register can be found, select the
memory location of x as L.
sub b,R0
R1 contains t2 t2 in R1
Symbol table is a data structure, which maintains the information about symbols
Symbol table is accessed by several phases of compiler to retrieve symbols information
Symbol table contains two fields
a) Name of the symbol
b) Information of the symbol
Operations performed on symbol table are
a) Perform the search to identify whether the symbol is present or not
b) Retrieval of information for the referencing symbol
c) Update the symbol information
d) Add an entry to the symbol table ,if the symbol is referenced
e) Delete an entry from the symbol table , if there is further reference of that
symbol
Name1
R
1
Info1
Name2 R
2
Info2
Name3 R
3
Info3
Disadvantages:
ii) In practice the maximum size of name may not be used by programmers.
But the symbol table reserves maximum memory for symbol name. It
causes wastage of memory which is reserved for the name of symbol
1. In this mechanism all the names of the symbols are stored as a string of characters
in a linear array.
2. The symbol table maintains one record for every symbol
3. The name field of symbol table contains a pointer to the starting character of that
symbol name , which is in the symbol name array
4. The name field also contains length of the symbol name to find exact symbol
name. To get the symbol name the user has to access from pointer location to the
length specified in the field.
Name1 Length 4
R
1
Info1
Name 2 Length 4
R
2
Info2
R A M A R A J A
So in the above example the name1 field has a pointer to “R”. to get symbol name
the user has to read 4 char’s from that because the length specified in that is “4”.
So the name of symbol is “RAMA”
III) TWO TABLE MECHANISM:
1. In this mechanism the symbol table names and information are stored in separate
tables
2. The name table contains all the symbol’s names
3. The information table contains all the symbol’s information
4. The association between the name table and the information table can be achieved
by maintaining indexing mechanism, while inserting the entries in the symbol
table
RAMA Info 1
RAJA Info1
Info2
Info2
i) linear list
ii) self organizing list
iii) search trees
iv) hash tables
i) linear list:
A list is maintained for the symbol table .The major drawback of this data
structure is a linear search is required to access, the information about the symbols.
Linear search requires more time complexity.
In this data structure, each entry maintains a link with another entry in the
list .Each referenced symbol gets a link which maintains a pointer to the first
accessing location
The most frequently referenced entries will get links to the first
referencing location
The advantage of this data structure is search time minimum for most
frequently accessed symbols
The storage table maintains a set of linked lists. To access the information
of the symbol, the name of the symbol is given to the hash function. It generates
an address, which contains the information of the symbol.
DAG (directed acyclic graph ) is a useful data structure used to represent the basic
blocks. It has no cycles.
It determines names of variables which are used in the block but the values of variables
are computed in outside of block.
S1 = 4 * I
S2 = address(A)-4
S3=S2*S1
S4=4*I
S5=address(B)-4
S6=S5*S4
S7=S3*S6
S8=PROD+S7
PROD=S8
S9=I+1
IF (I<=20 ) GOTO (1)
+ S8,prod
* S7
* S3 * S6 < (1)
- s2 -s5
*s1,s4 + s9,I
Let us consider the procedure ‘R’ is called by procedure ‘Q’ in the following manner,