Professional Documents
Culture Documents
Introduction
What is Compiler?
A compiler translates the code written in one language to some other
language without changing the meaning of the program. It is also expected that a
compiler should make the target code efficient and optimized in terms of time and
space.
Compiler design principles provide an in-depth view of translation and
optimization process. Compiler design covers basic translation mechanism and
error detection & recovery. It includes lexical, syntax, and semantic analysis as
front end, and code generation and optimization as back-end.
Computers are a balanced mix of software and hardware. Hardware is just a
piece of mechanical device and its functions are being controlled by a compatible
software. Hardware understands instructions in the form of electronic charge,
which is the counterpart of binary language in software programming. Binary
language has only two alphabets, 0 and 1. To instruct, the hardware codes must be
written in binary format, which is simply a series of 1s and 0s. It would be a
difficult and cumbersome task for computer programmers to write such codes,
which is why we have compilers to write such codes.
Language Processing System
We have learnt that any computer system is made of hardware and software.
The hardware understands a language, which humans cannot understand. So we
write programs in high-level language, which is easier for us to understand and
remember. These programs are then fed into a series of tools and OS components
to get the desired code that can be used by the machine. This is known as
Language Processing System.
involve many passes. In contrast, an interpreter reads a statement from the input,
converts it to an intermediate code, executes it, then takes the next statement in
sequence. If an error occurs, an interpreter stops execution and reports it. whereas a
compiler reads the whole program even if it encounters several errors.
Assembler
An assembler translates assembly language programs into machine code.The
output of an assembler is called an object file, which contains a combination of
machine instructions as well as the data required to place these instructions in
memory.
Linker
Linker is a computer program that links and merges various object files
together in order to make an executable file. All these files might have been
compiled by separate assemblers. The major task of a linker is to search and locate
referenced module/routines in a program and to determine the memory location
where these codes will be loaded, making the program instruction to have absolute
references.
Loader
Loader is a part of operating system and is responsible for loading
executable files into memory and execute them. It calculates the size of a program
(instructions and data) and creates memory space for it. It initializes various
registers to initiate execution.
Cross-compiler
A compiler that runs on platform (A) and is capable of generating executable
code for platform (B) is called a cross-compiler.
Source-to-source Compiler
A compiler that takes the source code of one programming language and
translates it into the source code of another programming language is called a
source-to-source compiler.
A compiler can broadly be divided into two phases based on the way they
compile.
(i) Analysis Phase and
(ii) Synthesis Phase
Analysis Phase
Known as the front-end of the compiler, the analysis phase of the compiler
reads the source program, divides it into core parts and then checks for lexical,
grammar and syntax errors. The analysis phase generates an intermediate
representation of the source program and symbol table, which should be fed to the
Synthesis phase as input.
Synthesis Phase
Known as the back-end of the compiler, the synthesis phase generates the
target program with the help of intermediate source code representation and
symbol table.
Code Generation
In this phase, the code generator takes the optimized representation of the
intermediate code and maps it to the target machine language. The code generator
translates the intermediate code into a sequence of (generally) re-locatable machine
code. Sequence of instructions of machine code performs the task as the
intermediate code would do.
Symbol Table
It is a data-structure maintained throughout all the phases of a compiler. All
the identifier's names along with their types are stored here. The symbol table
makes it easier for the compiler to quickly search the identifier record and retrieve
it. The symbol table is also used for scope management.
Cousins of Compiler
1. Preprocessor
2. Assembler
3. Loader and Link-editor
Preprocessor
A preprocessor is a program that processes its input data to produce output
that is used as input to another program. The output is said to be a preprocessed
form of the input data, which is often used by some subsequent programs like
compilers. They may perform the following functions:
1. Macro processing 3. Rational Preprocessors
2. File Inclusion 4. Language extension
1. Macro processing:
Department of Computer Science,
Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 8
COMPILER DESIGN UNIT - 1
A linker or link editor is a program that takes one or more objects generated
by a compiler and combines them into a single executable program. Three tasks of
the linker are
1. Searches the program to find library routines used by program, e.g. printf(),
math routines.
2. Determines the memory locations that code from each module will occupy and
relocates its instructions by adjusting absolute references
3. Resolves references among files.
A loader is the part of an operating system that is responsible for loading
programs in memory, one of the essential stages in the process of starting a
program.
Compiler Construction Tools
These are specialized tools that have been developed for helping implement
various phases of a compiler. The following are the compiler construction tools:
1) Parser Generators:
These produce syntax analyzers, normally from input that is based on a context-
free grammar.
It consumes a large fraction of the running time of a compiler. -Example-YACC
(Yet Another Compiler-Compiler).
2) Scanner Generator:
These generate lexical analyzers, normally from a specification based on regular
expressions. -The basic organization of lexical analyzers is based on finite
automation.
3) Syntax-Directed Translation:
These produce routines that walk the parse tree and as a result generate
intermediate code. -Each translation is defined in terms of translations at its
neighbor nodes in the tree.
4) Automatic Code Generators:
It takes a collection of rules to translate intermediate language into machine
language. The rules must include sufficient details to handle different possible
access methods for data.
5) Data-Flow Engines:
It does code optimization using data-flow analysis, that is, the gathering of
information about how values are transmitted from one part of a program to each
other part.
Lexical Analysis
Lexical analysis is the first phase of a compiler. It takes the modified source
code from language preprocessors that are written in the form of sentences. The
lexical analyzer breaks these syntaxes into a series of tokens, by removing any
whitespace or comments in the source code.
A simple way to build lexical analyzer is to construct a diagram that
illustrates the structure of the tokens of the source language, and then to hand-
translate the diagram into a program for finding tokens. Efficient lexical analysers
can be produced in this manner.
If the lexical analyzer finds a token invalid, it generates an error. The lexical
analyzer works closely with the syntax analyzer. It reads character streams from
the source code, checks for legal tokens, and passes the data to the syntax analyzer
when it demands.
Tokens
Department of Computer Science,
Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 12
COMPILER DESIGN UNIT - 1
Special Symbols
Language
A language is considered as a finite set of strings over some finite set of
alphabets. Computer languages are considered as finite sets, and mathematically
set operations can be performed on them. Finite languages can be described by
means of regular expressions.
Longest Match Rule
When the lexical analyzer read the source-code, it scans the code letter by
letter; and when it encounters a whitespace, operator symbol, or special symbols, it
decides that a word is completed.
For example:
int intvalue;
While scanning both lexemes till ‘int’, the lexical analyzer cannot determine
whether it is a keyword int or the initials of identifier int value.
The Longest Match Rule states that the lexeme scanned should be
determined based on the longest match among all the tokens available.
The lexical analyzer also follows rule priority where a reserved word, e.g., a
keyword, of a language is given priority over user input. That is, if the lexical
analyzer finds a lexeme that matches with any existing reserved word, it should
generate an error.
The lexical analyzer needs to scan and identify only a finite set of valid
string/token/lexeme that belong to the language in hand. It searches for the pattern
defined by the language rules.
Regular expressions have the capability to express finite languages by
defining a pattern for finite strings of symbols. The grammar defined by regular
expressions is known as regular grammar. The language defined by regular
grammar is known as regular language.
Regular expression is an important notation for specifying patterns. Each
pattern matches a set of strings, so regular expressions serve as names for a set of
strings. Programming language tokens can be described by regular languages. The
specification of regular expressions is an example of a recursive definition.
Regular languages are easy to understand and have efficient implementation.
There are a number of algebraic laws that are obeyed by regular expressions,
which can be used to manipulate regular expressions into equivalent forms.
Operations
The various operations on languages are:
Union of two languages L and M is written as
L U M = {s | s is in L or s is in M}
Concatenation of two languages L and M is written as
LM = {st | s is in L and t is in M}
The Kleene Closure of a language L is written as
L* = Zero or more occurrence of language L.
Notations
If r and s are regular expressions denoting the languages L(r) and L(s), then
Union : (r)|(s) is a regular expression denoting L(r) U L(s)
Concatenation : (r)(s) is a regular expression denoting L(r)L(s)
Kleene closure : (r)* is a regular expression denoting (L(r))*
(r) is a regular expression denoting L(r)
Representing valid tokens of a language in regular expression
If x is a regular expression, then:
x* means zero or more occurrence of x.
i.e., it can generate { e, x, xx, xxx, xxxx, … }
x+ means one or more occurrence of x.
i.e., it can generate { x, xx, xxx, xxxx … } or x.x*
x? means at most one occurrence of x
i.e., it can generate either {x} or {e}.
[a-z] is all lower-case alphabets of English language.
[A-Z] is all upper-case alphabets of English language.
The transition function (δ) maps the finite set of state (Q) to a finite set of input
symbols (Σ), Q × Σ ➔ Q
Finite Automata Construction
Let L(r) be a regular language recognized by some finite automata (FA).
States: States of FA are represented by circles. State names are written inside
circles.
Start state: The state from where the automata starts, is known as the start state.
Start state has an arrow pointed towards it.
Intermediate states: All intermediate states have at least two arrows; one pointing
to and another pointing out from them.
Final state: If the input string is successfully parsed, the automata is expected to be
in this state. Final state is represented by double circles. It may have any odd
number of arrows pointing to it and even number of arrows pointing out from it.
The number of odd arrows are one greater than even, i.e. odd = even+1.
Transition: The transition from one state to another state happens when a desired
symbol in the input is found. Upon transition, automata can either move to the next
state or stay in the same state. Movement from one state to another is shown as a
directed arrow, where the arrows points to the destination state. If automata stays
on the same state, an arrow pointing from a state to itself is drawn.
Example: We assume FA accepts any three digit binary value ending in digit 1. FA
= {Q(q0, qf), Σ(0,1), q0, qf, δ}
Example:
Q = {q0, q1, q2}
∑ = {0, 1}
q0 = {q0}
F = {q2}
Solution:
Transition Diagram:
Transition Table:
Example:
DFA with ∑ = {0, 1} accepts all starting with 0.
Solution:
Explanation:
In the above diagram, we can see that on given 0 as input to DFA in state q0
the DFA changes state to q1 and always go to final state q1 on starting input 0. It
can accept 00, 01, 000, 001....etc. It can't accept any string which starts with 1,
because it will never go to final state on a string starting with 1.
Example:
DFA with ∑ = {0, 1} accepts all ending with 0.
Solution:
Explanation:
In the above diagram, we can see that on given 0 as input to DFA in state q0,
the DFA changes state to q1. It can accept any string which ends with 0 like 00, 10,
110, 100....etc. It can't accept any string which ends with 1, because it will never
go to the final state q1 on 1 input, so the string ending with 1, will not be accepted
or will be rejected.
where,
Q: finite set of states
∑: finite set of the input symbol
q0: initial state
F: final state
δ: Transition function
Graphical Representation of an NFA
An NFA can be represented by digraphs called state diagram. In which:
The state is represented by vertices.
The arc labeled with an input character show the transitions.
The initial state is marked with an arrow.
The final state is denoted by the double circle.
Example:
Q = {q0, q1, q2}
∑ = {0, 1}
q0 = {q0}
F = {q2}
Solution:
Transition diagram:
Transition Table:
In the above diagram, we can see that when the current state is q0, on input
0, the next state will be q0 or q1, and on 1 input the next state will be q1. When the
current state is q1, on input 0 the next state will be q2 and on 1 input, the next state
will be q0. When the current state is q2, on 0 input the next state is q2, and on 1
input the next state will be q1 or q2.
Example:
NFA with ∑ = {0, 1} accepts all strings with 01.
Solution:
Transition Table:
Example:
Solution:
Transition Table:
We are going to prove that the DFA obtained from NFA by the conversion
algorithm accepts the same language as the NFA.
NFA that recognizes a language L is denoted by M1 = < Q1 , , q1,0 , 1 , A1 > and
DFA obtained by the conversion is denoted by M2 = < Q2, , q2,0 , 2 , A2 >
First we are going to prove by induction on strings that 1*( q1,0 , w ) =
2*( q2,0 , w ) for any string w. When it is proven, it obviously implies that NFA
M1 and DFA M2 accept the same strings.
Inductive Step: Assume that 1*( q1,0 , w ) = 2*( q2,0 , w ) for an arbitrary string
w. --- Induction Hypothesis
For the string w and an arbitrry symbol a in ,
1*( q1,0 , wa ) =
= 2( 1*( q1,0 , w ) , a )
= 2( 2*( q2,0 , w ) , a )
= 2*( q2,0 , wa )
Thus for any string w 1*( q1,0 , w ) = 2*( q2,0 , w ) holds.
End of Proof
Algorithm
Given NFA N = (Σ, Q, q0, F, δ) we want to build DFA D= (Σ, Q’, q’0, F’,
δ’). Here’s how:
Initially Q’ = ∅.
Add q0 to Q’.
For each state in Q find the set of possible states for each input symbol using N’s
transition table, δ. Add this set of states to Q’, if it is not already there.
The set of final states of D, F’, will be all of the the states in Q’ that contain in
them a state that is in F.
Working through an example may aid in understanding these steps.
Consider the following NFA N = (Σ, Q, q0, F, δ)
Σ = {a, b, c}
Q = {q0, q1, q2}
F = {q2}
We wish to construct DFA D= (Σ, Q’, q’0, F’, δ’). Following the steps in the
conversion algorithm:
Q’ = ∅
Q’ = {q0}
For every state in Q, find the set of states for each input symbol using N’s
transition table, δ. If the set is not Q, add it to Q’.We can start building the
transition table δ’ for D by first examining q0.
To fill in the transitions for {q0, q1}, you need to determine per symbol how
each individual state in the set transitions and take the union of these states.
δ'({q0, q1}, a) = δ(q0,a) ∪ δ(q1,a) = {q0, q1} ∪ ∅ = {q0, q1}
δ'({q0, q1}, b) = δ(q0,b) ∪ δ(q1,b) = c ∪ {q2} = {q0, q2}
δ'({q0, q1}, c) = δ(q0,c) ∪ δ(q1,c) ={q2} ∪ ∅ = {q2}
The only new state here is {q0, q2}. We add it to Q’:
Q’ = {q0, {q0, q1}, q2, {q0, q2}}
And update the transition table δ’:
There are no new states to add to Q’. The updated transition table looks like:
Since we’ve inspected all of the states in Q and no longer have any states to
add to Q’, we are finished. D’s set of final states F’ are those states that include
states in F. So, F’ = {q2, {q0, q2}}.
For our completed DFA D= (Σ, Q’, q’0, F’, δ’):
Σ = {a, b}
Q’ = {q0, {q0, q1}, q2, {q0, q2}}
q’0=q0
F’ = {q2, {q0, q2}}
δ’ is:
Step 3 − We will try to mark the state pairs, with green colored check mark,
transitively. If we input 1 to state ‘a’ and ‘f’, it will go to state ‘c’ and ‘f’
respectively. (c, f) is already marked, hence we will mark pair (a, f). Now, we input
1 to state ‘b’ and ‘f’; it will go to state ‘d’ and ‘f’ respectively. (d, f) is already
marked, hence we will mark pair (b, f).
After step 3, we have got state combinations {a, b} {c, d} {c, e} {d, e} that are
unmarked.
We can recombine {c, d} {c, e} {d, e} into {c, d, e}
Hence we got two combined states as − {a, b} and {c, d, e}
So the final minimized DFA will contain three states {f}, {a, b} and {c, d, e}
Example
Let us consider the following DFA −
There are three states in the reduced DFA. The reduced DFA is as follows −
What is a token?
A lexical token is a sequence of characters that can be treated as a unit in the
grammar of the programming languages.
Example of tokens:
Type token (id, number, real, . . . )
Punctuation tokens (IF, void, return, . . . )
Alphabetic tokens (keywords)
Keywords; Examples-for, while, if etc.
Identifier; Examples-Variable name, function name etc.
Operators; Examples '+', '++', '-' etc.
Separators; Examples ',' ';' etc
Example of Non-Tokens:
Comments, preprocessor directive, macros, blanks, tabs, newline etc
Lexeme: The sequence of characters matched by a pattern to form the
corresponding token or a sequence of input characters that comprises a single
token is called a lexeme. eg- “float”, “abs_zero_Kelvin”, “=”, “-”, “273”, “;” .
How Lexical Analyzer functions
1. Tokenization .i.e Dividing the program into valid tokens.
2. Remove white space characters.
3. Remove comments.
4. It also provides help in generating error message by providing row number and
column number.
The lexical analyzer identifies the error with the help of automation machine
and the grammar of the given language on which it is based like C , C++ and gives
row number and column number of the error.
Suppose we pass a statement through lexical analyzer –
a = b + c ; It will generate token sequence like this:
id=id+id; Where each id reference to it’s variable in the symbol table referencing
all details
For example, consider the program
int main()
{
// 2 variables
int a, b;
a = 10;
return 0;
}
'int' 'main' '(' ')' '{' 'int' 'a' ',' 'b' ';'
'a' '=' '10' ';' 'return' '0' ';' '}'
Above are the valid tokens.
We can observe that we have omitted comments.
As another example, consider below printf statement.
Exercise 1:
Count number of tokens :
int main()
{
int a = 10, b = 20;
printf("sum is :%d",a+b);
return 0;
}
Answer: Total number of token: 27.
Exercise 2:
Count number of tokens :
int max(int i);
Lexical analyzer first read int and finds it to be valid and accepts as token
max is read by it and found to be valid function name after reading (
int is also a token , then again i as another token and finally ;
Answer: Total number of tokens 7:
int, max, ( ,int, i, ), ;