Professional Documents
Culture Documents
INTRODUCTION TO COMPILERS
1.1 Compilers & Translators
Translator is a program that translates the program written in one language to another language.
Compiler is a software that translates the High level language program to low level
language(machine language)
Eg. A compiler translate the program written in FORTRAN, COBOL to machine language
Executing a program
Needs 2 steps
Compile the source program and translate into object program
Load the object program in memory and execute it
Source program COMPILER object program
Object program Load in memory & execute Output
Interpreter is a software that translates the High level
language program to an intermediate code that can be directly executed
Difference :
Compiler
Compiler produces the object code and is saved in memory
Since the object code is saved in memory , More memory space is needed
Compiles the entire program and then lists all the errors
Execution is fast (Takes the object code from memory and directly executes it)
Interpreter
Interpreter produces the intermediate code but is not saved in memory
Less memory space is needed( object code is not saved)
Interprets line by line and list single error at a time. Only if it is corrected, then next error
will be listed
Execution is slow ( every time interpretation and execution is done)
1.2 Need for translators
Machine Language Program :
Program written using 0s and 1s is called machine language program.
Eg: 0110 001110 010101
Assembly language program :
Program written using mnemonics is assembly language program
Mnemonic names are used for specifying operation codes and data addresses.
Eg: ADD X, Y
ADD is the opcodes, X and Y are the addresses of data.
Assembler is a software that translates the Assembly language program to low level
language(machine language)
ADD2 A,B
When a micro is called, the statements in the micro will be substituted there.
Phase :
The compilation process is divided into series of sub process called phases.
A phase is a operation that takes as input one representation of source program and produces
as output another representation of program.
The structure of a compiler contain several phases.
one or more phases can be combined together to form a pass.
A Pass reads the source program or the output of the previous pass, makes transformation
specified by its phase and writes the output to an intermediate file
2 types of Compiler
Single pass compiler (compilation is done in one pass)
Multi pass compiler (compilation is done in several passes)
Multipass compiler is slower than single pass compiler because each pass reads and writes
an intermediate file
Multipass compiler occupies less space because the space occupied by the compiler for one
pass can be reused by the next pass
Syntax analysis:
Second phase of the compiler
Has Syntax analyzer / Parser
Groups the tokens into syntactic structures
Parse tree is generated
Syntactic structure is represented as a tree whose leaves are tokens
Eg. A + B -- A+B is an expression.
Input Stream of tokens
Output Parse tree
A parse tree represents the syntactic structure. It has 2 functions.
Check the tokens in the input
Imposes a tree like structure
A+B
+
Id id
A/B*C
expn
Expn expn
Id id id
A / B C
For i= 1 to 50
{
j=5 // loop invariant. So it can be brought outside theloop
i=i*j
}
Can be changed as
j=5
for i= 1 to 50
{
i=i*j
}
Code generation phase
Final phase
Original object code is generated
memory locations for data is decided
selection of registers is done here.
Simple code generation for A:=B+C
LOAD B
ADD C
STORE A
Some tools have been created for the automatic design of specific compiler components. These
tools use specialized languages for specifying and implementing the compiler.
Tools available in existing compiler –compiler
Scanner generator
Parse generator
Syntax directed translation engine
Automatic code generator
Dataflow analysis engine
Scanner generator- this tool automatically generate lexical analyzer from a specification based on
regular expression
Parser generator – this tool automatically produce syntax analyzer from the input that is based in
Context free grammar
Syntax directed translation engine- This tool generates intermediate code with three address format
from the input that consists of a parse tree. These engines have routines to traverse the parse tree and
then produces the intermediate code
Automatic code generator(Facilities for code generation) – this tool generates the machine language
for a target machine. Each operation of the intermediate language is translated using a collection of rules
and then is taken as an input by the code generator
Dataflow analysis engine-- It is used in code optimization. Data flow analysis is a key part of the code
optimization
BOOTSTRAPPING
Any compiler is characterized by 3 languages.
Source Language
Object Language
The language in which compiler is written
A compiler may run on one machine and it can produce object code for the same machine is called
Pure compiler.
A compiler may run on one machine and it can produce object code for another machine is called
Cross compiler.
Lnew language
This language L is to be implemented on 2 machines. A and B
First, design a Compiler for language L in machine A
1. Take a subset S of language L
For this subset, write a small compiler for machine A. This compiler should be written in a
language that is already in A
CA SA
2. Then write a compiler for language L using a simple language S
CS LA
This compiler, CS LA, when it runs through CA SA , produces a complete compiler CA LA
The buffer is divided into 2 halves. If the look ahead pointer travels beyond the buffer half in which
it began, the other half will be loaded with the next chrs from the source file.
Eg
DECLARE(ARG1, ARG2, ………ARGn)
^
Declare is a keyword or arrayname cannot be determined until the chr next to the parenthesis is
read.
SOURCE-BUFFER ACTUAL BUFFER
When the chrs are read from the source to buffer, ie. At the time of preliminary scanning ,
following things will be done.
Delete the comments
Ignore unneeded blanks
Combine the blanks
Count lines
Preprocessing the chrs is done to avoid the trouble of moving the look ahead pointer front and back
over comments and blanks.
3.3Regular expressions
Regular expression is a notation used to describe tokens
Set of constraints to be followed are called as regular expressions.
Regular expn for a identifier
id=letter(letter| digit)* | represents or , union
*- indicates zero or more occurrences
Eg:
a,ab,abc,abcd,a1,a12,a123,ab12cd,a1b1c1,adcfre234 etc….
Regular expn for constant
const= digit+ or digit(digit)*
+- indicates one or more occurrences
Eg:
4,456,4321,67890 etc….
Regular expn for relational operator
Relop=<|<=|>|>=|<>|=
Regular expn for keyword
Keyword=begin|end|if|then|else
Construction rules:
1. {ℇ} is a regular expression denoting an empty string.
2. {a} is a regular expression with one symbol
3. If R and S are two regular expressions, then
(R)|(S) is a regular expression
(R).(S) is a regular expression
(R)* is a regular expression
Precedence: * has the highest precedence, then comes . , then | has lowest precedence
Regular expression is defined in terms of primitive RE and Complex RE
Properties Regular expressions
Properties
If R, S and T are regular expressions then
1. R|S=S|R (| is commutative)
2. R|(S|T)=(R|S)|T (| is associative)
3. R.(S.T)=(R.S).T (. is associative)
4. R.(S|T)=(R.S)| (R.T),
(S|T).R=(S.R)| (T.R) (. is distributive over |)
5. ℇ.R=R.ℇ=R (ℇ is the identity)
Example
Form the regular expression for the set containing {a,b}
1. R={ℇ} , regular expression with empty string
2. R={a} the set containing single chr forms the RE
3. R=a|b the set containing a or b forms the RE
4. R=a* Zero or more occurrences of a {},a,aa,aaa,aaaa,….
5. R=a+ one or more occurrences of a a,aa,aaa,aaaa…. a.a*
6. R=(a|b)* zero or more occurrences of a|b
{},a,aa,aaa,aaaa,b,bb,bbb,ababab,bababa,baaa….
7. R=a|(ba*) single a or b followed by zero or more occurrences of a
a,b,ba,baa,baaa,baaaa…..
8. R=aa|ab|ba|bb denotes even lengthed string
9. R=ℇ|a|b denotes a string of length 0 or 1 {},a,b
10. R=(a|b)(a|b)(a|b) denotes string of length 3
aaa,abb,aba,bbb,baa…
*
11. R=(a|b)(a|b)(a|b) denote a string of length 3 or more
Transition diagram
Valuable tool for lexical analyser
Also called as state diagram
It is a flowchart for representing tokens
Circles represents states.
Arrows represents edges.
Labels on edges represent the input character that can appear after the states
3.4Finite Automata
The transition diagram for the regular expression is called as finite automata.
A recognizer for a language L takes an input string x and checks whether x is a sentence of L. If
so, it returns yes or it returns no.
For converting regular expressions to recognizer, a transition diagram is constructed from the
expressions. This transition diagram is called as Finite Automata
Finite automata types
Non deterministic Finite Automata (NFA)
Deterministic Finite automata (DFA)
NFA
1. Edges are labeled by ℇ
ℇ
0 1
2. Same character can be used as label for 2 or more transitions , out of 1 state
1
a
0
a
2
DFA
1. Edges cannot be labelled as ℇ (no transition with ℇ)
2. Same character cannot be used as label for 2 or more transitions , out of 1 state
(for each state s and input symbol a, there is atmost 1 edge labelled a leaving s)
NFA
NFA is a labelled directed graph. Nodes are called as states. Labeled edges are called transitions.
We have one state as start state and one or more states as accepting state or final state.
Transition table :The transitions of NFA can be easily represented in a table called as transition
table. There is a row for each state. There is a column for each admissible input symbol. The entry
for state i and symbol a is the set of possible next states for the state i on the input symbol a.
NFA accepts an input string x , if and only if there is a path from the start state to some accepting
state.
Transition table
input symbol
State
a b
t0 0,1 0
1 - 2
2 - 3
R3=R1|R2
a
1 2
ℇ
ℇ
5
0 b
3 4 ℇ
ℇ
R4=(R3)
R5=R4*
R6=R5.R1
R7=R6.R2
R8=R7.R2
R=a
a
0 11
R=b
b
0 11
After constructing components for basic regular expression, proceed to combine them. Hence
compound regular expressions are formed from smaller reg expressions.
For regular expression R1|R2, construct NFA
Given N1 is NFA for R1 and N2 is NFA for R2
ℇ
ℇ N1
1f
i
ℇ ℇ
N2
N2
i N1 f
Input symbol a b
transitions 2 7 4 8 9
3 8 5 9 10
ℇ closure(0)
Add 0 to ℇ closure
Add all states that is reachable from 0 that has ℇ as its edge.
ℇ closure(0)={0,1,2,4,7}----------A
1. From the members of A, find the states having transitions on a. Among the given states, 2
and 7 have, a transition to 3 and 8
A-a: ℇ closure(3,8)={3,6,7,1,2,4,8)----B
2. From the members of A, find the states having transitions on .b Among the given states, 4 a
have, a transition to 5
A-b: ℇ closure(5)={5,6,7,1,2,4)----C
3. From the members of B, find the states having transitions on a. Among the given states, 2
and 7 have, a transition to 3 and 8
B-a: ℇ closure(3,8)={3,6,7,1,2,4,8)----B
4. From the members of B, find the states having transitions on b. Among the given states, 4
and 8 have, a transition to 5 and 9
B-b: ℇ closure(5,9)={5,6,7,1,2,4,9)----D
5. Among the given states in C, 2 and 7 have transitions on a
C-a: ℇ closure(3,8)={3,6,7,1,2,4,8)----B
Input : NFA
Output : DFA
Method :
Define a function ℇ-closure(s)
1. s is added to ℇ-closure(s)
2. If t is in ℇ-closure(s), and if there is an edge labelled ℇ from t to u,
then add u to ℇ-closure(s), if u is not already there.
3. Repeat rule2 , until no more states can be added .
(ℇ-closure(s) is just the set of states that can be reached with ℇ transitions alone)
Computation of ℇ-closure(s)
begin
push all states T onto STACK;
ℇ-closure(T):=T;
While STACK not empty do
begin
pop s, (the top element of STACK ), out of the stack;
for each state t with an edge from s to t labelled ℇ do
if t is not in ℇ-closure(T) do
begin
add t to ℇ-closure(T);
push T on to the stack;
end
end
end
Dead state: A state is called as dead state when it has self loops for all the input's.
Non reachable state – A state that cannot be reached
Construction of Πnew
For each group G of Π do
begin
partition G in to subgroups
such that 2 states s and t are of G are in the same subgroup if and only if
for all the input symbols a, the states s and t have transitions to states in the
same group
place all sub groups formed in Πnew
end
1 a
2
a b
3 b
4 5 6
Convert to single NFA
2 4 7 5 6 8 8
ℇclosure(0)={0,1,5,7}-------A
Aa: ℇclosure(2,4,7)={2,4,7}-------B
Ab: ℇclosure(8)={8}-------C
Ba: ℇclosure(7)={ 7}-------D
Bb: ℇclosure(5,8)={ 5,8}-------E
Ca: ℇclosure( ɸ)=null
Cb: ℇclosure( 8)={8}-------C
Da: ℇclosure( 7)={7}-------D
Db: ℇclosure( 8)={8}-------C
Ea: ℇclosure( ɸ)=null
Eb: ℇclosure(6,8)={ 6,8}-------F
Fa: ℇclosure( ɸ)=null
Fb: ℇclosure( 8)={8}-------C
Nonterminals
Special symbols that denote set of strings
Syntactic variable/ Syntactic category is a synonym for non terminal
Examples
Lower case names (expn, stmt, operator….)
Italic capital letters ( E A..)
Productions
Rewriting rules
Each production consists of nonterminal followed by arrow followed by string of
nonterminals and terminals
Examples
Stmt begin stmt list end
Expn expn opr expn
Expn ( expn)
Expn id
Opr+/-/*
Start symbol
Symbol on the left side of the first production
Eg:
EE+d In this eg, E is the start symbol
E + E
id E * E
id id
E * E
E + E id
id id
Parse trees
Graphical representation for derivations can be created.
This representation is called Parse tree.
Each interior node of the parse tree is labelled by some nonterminal
The children of the node are labeled by the symbols on the right side of the production
Eg:
AXYZ is a production
X Y Z
- E
( E )
E + E
id id
E-E E - E --E - E
- E - E - E - E - E
( E ) ( E ) ( E ) ( E )
E + E E + E E + E
id id id
Ambiguity
Precedence of operators
Unary -, ^ , * , / , + , -
Now using the associativity and precedence rule rewrite the grammar