You are on page 1of 43

COMPILER DESIGN UNIT - 1

Introduction
What is Compiler?
A compiler translates the code written in one language to some other
language without changing the meaning of the program. It is also expected that a
compiler should make the target code efficient and optimized in terms of time and
space.
Compiler design principles provide an in-depth view of translation and
optimization process. Compiler design covers basic translation mechanism and
error detection & recovery. It includes lexical, syntax, and semantic analysis as
front end, and code generation and optimization as back-end.
Computers are a balanced mix of software and hardware. Hardware is just a
piece of mechanical device and its functions are being controlled by a compatible
software. Hardware understands instructions in the form of electronic charge,
which is the counterpart of binary language in software programming. Binary
language has only two alphabets, 0 and 1. To instruct, the hardware codes must be
written in binary format, which is simply a series of 1s and 0s. It would be a
difficult and cumbersome task for computer programmers to write such codes,
which is why we have compilers to write such codes.
Language Processing System
We have learnt that any computer system is made of hardware and software.
The hardware understands a language, which humans cannot understand. So we
write programs in high-level language, which is easier for us to understand and
remember. These programs are then fed into a series of tools and OS components
to get the desired code that can be used by the machine. This is known as
Language Processing System.

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 1
COMPILER DESIGN UNIT - 1

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 2
COMPILER DESIGN UNIT - 1

The high-level language is converted into binary language in various phases.


A compiler is a program that converts high-level language to assembly language.
Similarly, an assembler is a program that converts the assembly language to
machine-level language.
Let us first understand how a program, using C compiler, is executed on a
host machine.
 User writes a program in C language (high-level language).
 The C compiler, compiles the program and translates it to assembly program
(low-level language).
 An assembler then translates the assembly program into machine code
(object).
 A linker tool is used to link all the parts of the program together for
execution (executable machine code).
 A loader loads all of them into memory and then the program is executed.
Before diving straight into the concepts of compilers, we should understand a
few other tools that work closely with compilers.
Preprocessor
A preprocessor, generally considered as a part of compiler, is a tool that
produces input for compilers. It deals with macro-processing, augmentation, file
inclusion, language extension, etc.
Interpreter
An interpreter, like a compiler, translates high-level language into low-level
machine language. The difference lies in the way they read the source code or
input. A compiler reads the whole source code at once, creates tokens, checks
semantics, generates intermediate code, executes the whole program and may
Department of Computer Science,
Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 3
COMPILER DESIGN UNIT - 1

involve many passes. In contrast, an interpreter reads a statement from the input,
converts it to an intermediate code, executes it, then takes the next statement in
sequence. If an error occurs, an interpreter stops execution and reports it. whereas a
compiler reads the whole program even if it encounters several errors.
Assembler
An assembler translates assembly language programs into machine code.The
output of an assembler is called an object file, which contains a combination of
machine instructions as well as the data required to place these instructions in
memory.
Linker
Linker is a computer program that links and merges various object files
together in order to make an executable file. All these files might have been
compiled by separate assemblers. The major task of a linker is to search and locate
referenced module/routines in a program and to determine the memory location
where these codes will be loaded, making the program instruction to have absolute
references.
Loader
Loader is a part of operating system and is responsible for loading
executable files into memory and execute them. It calculates the size of a program
(instructions and data) and creates memory space for it. It initializes various
registers to initiate execution.
Cross-compiler
A compiler that runs on platform (A) and is capable of generating executable
code for platform (B) is called a cross-compiler.

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 4
COMPILER DESIGN UNIT - 1

Source-to-source Compiler
A compiler that takes the source code of one programming language and
translates it into the source code of another programming language is called a
source-to-source compiler.
A compiler can broadly be divided into two phases based on the way they
compile.
(i) Analysis Phase and
(ii) Synthesis Phase
Analysis Phase
Known as the front-end of the compiler, the analysis phase of the compiler
reads the source program, divides it into core parts and then checks for lexical,
grammar and syntax errors. The analysis phase generates an intermediate
representation of the source program and symbol table, which should be fed to the
Synthesis phase as input.

Synthesis Phase
Known as the back-end of the compiler, the synthesis phase generates the
target program with the help of intermediate source code representation and
symbol table.

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 5
COMPILER DESIGN UNIT - 1

A compiler can have many phases and passes.


Pass: A pass refers to the traversal of a compiler through the entire program.
Phase: A phase of a compiler is a distinguishable stage, which takes input from the
previous stage, processes and yields output that can be used as input for the next
stage. A pass can have more than one phase.
The compilation process is a sequence of various phases. Each phase takes
input from its previous stage, has its own representation of source program, and
feeds its output to the next phase of the compiler. Let us understand the phases of a
compiler.
Lexical Analysis
The first phase of scanner works as a text scanner. This phase scans the
source code as a stream of characters and converts it into meaningful lexemes.
Lexical analyzer represents these lexemes in the form of tokens as:
<token-name, attribute-value>
Syntax Analysis
The next phase is called the syntax analysis or parsing. It takes the token
produced by lexical analysis as input and generates a parse tree (or syntax tree). In
this phase, token arrangements are checked against the source code grammar, i.e.
the parser checks if the expression made by the tokens is syntactically correct.
Semantic Analysis
Semantic analysis checks whether the parse tree constructed follows the
rules of language. For example, assignment of values is between compatible data
types, and adding string to an integer. Also, the semantic analyzer keeps track of
identifiers, their types and expressions; whether identifiers are declared before use
or not etc. The semantic analyzer produces an annotated syntax tree as an output.

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 6
COMPILER DESIGN UNIT - 1

Intermediate Code Generation


After semantic analysis the compiler generates an intermediate code of the
source code for the target machine. It represents a program for some abstract
machine. It is in between the high-level language and the machine language. This
intermediate code should be generated in such a way that it makes it easier to be
translated into the target machine code.
Code Optimization
The next phase does code optimization of the intermediate code.
Optimization can be assumed as something that removes unnecessary code lines,
and arranges the sequence of statements in order to speed up the program
execution without wasting resources (CPU, memory).

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 7
COMPILER DESIGN UNIT - 1

Code Generation
In this phase, the code generator takes the optimized representation of the
intermediate code and maps it to the target machine language. The code generator
translates the intermediate code into a sequence of (generally) re-locatable machine
code. Sequence of instructions of machine code performs the task as the
intermediate code would do.
Symbol Table
It is a data-structure maintained throughout all the phases of a compiler. All
the identifier's names along with their types are stored here. The symbol table
makes it easier for the compiler to quickly search the identifier record and retrieve
it. The symbol table is also used for scope management.
Cousins of Compiler
1. Preprocessor
2. Assembler
3. Loader and Link-editor
Preprocessor
A preprocessor is a program that processes its input data to produce output
that is used as input to another program. The output is said to be a preprocessed
form of the input data, which is often used by some subsequent programs like
compilers. They may perform the following functions:
1. Macro processing 3. Rational Preprocessors
2. File Inclusion 4. Language extension

1. Macro processing:
Department of Computer Science,
Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 8
COMPILER DESIGN UNIT - 1

A macro is a rule or pattern that specifies how a certain input sequence


should be mapped to an output sequence according to a defined procedure. The
mapping process that instantiates a macro into a specific output sequence is known
as macro expansion.
2. File Inclusion:
Preprocessor includes header files into the program text. When the
preprocessor finds an #include directive it replaces it by the entire content of the
specified file.
3. Rational Preprocessors:
These processors change older languages with more modern flow-of-control
and data-structuring facilities.
4. Language extension:
These processors attempt to add capabilities to the language by what
amounts to built-in macros. For example, the language Equel is a database query
language embedded in C.
Assembler
Assembler creates object code by translating assembly instruction
mnemonics into machine code. There are two types of assemblers:
One-pass assemblers go through the source code once and assume that all symbols
will be defined before any instruction that references them.
Two-pass assemblers create a table with all symbols and their values in the first
pass, and then use the table in a second pass to generate code

Linker and Loader

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 9
COMPILER DESIGN UNIT - 1

A linker or link editor is a program that takes one or more objects generated
by a compiler and combines them into a single executable program. Three tasks of
the linker are
1. Searches the program to find library routines used by program, e.g. printf(),
math routines.
2. Determines the memory locations that code from each module will occupy and
relocates its instructions by adjusting absolute references
3. Resolves references among files.
A loader is the part of an operating system that is responsible for loading
programs in memory, one of the essential stages in the process of starting a
program.
Compiler Construction Tools
These are specialized tools that have been developed for helping implement
various phases of a compiler. The following are the compiler construction tools:
1) Parser Generators:
These produce syntax analyzers, normally from input that is based on a context-
free grammar.
It consumes a large fraction of the running time of a compiler. -Example-YACC
(Yet Another Compiler-Compiler).
2) Scanner Generator:
These generate lexical analyzers, normally from a specification based on regular
expressions. -The basic organization of lexical analyzers is based on finite
automation.

3) Syntax-Directed Translation:

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 10
COMPILER DESIGN UNIT - 1

These produce routines that walk the parse tree and as a result generate
intermediate code. -Each translation is defined in terms of translations at its
neighbor nodes in the tree.
4) Automatic Code Generators:
It takes a collection of rules to translate intermediate language into machine
language. The rules must include sufficient details to handle different possible
access methods for data.
5) Data-Flow Engines:
It does code optimization using data-flow analysis, that is, the gathering of
information about how values are transmitted from one part of a program to each
other part.
Lexical Analysis
Lexical analysis is the first phase of a compiler. It takes the modified source
code from language preprocessors that are written in the form of sentences. The
lexical analyzer breaks these syntaxes into a series of tokens, by removing any
whitespace or comments in the source code.
A simple way to build lexical analyzer is to construct a diagram that
illustrates the structure of the tokens of the source language, and then to hand-
translate the diagram into a program for finding tokens. Efficient lexical analysers
can be produced in this manner.
If the lexical analyzer finds a token invalid, it generates an error. The lexical
analyzer works closely with the syntax analyzer. It reads character streams from
the source code, checks for legal tokens, and passes the data to the syntax analyzer
when it demands.

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 11
COMPILER DESIGN UNIT - 1

Role of Lexical Analyser


The lexical analyzer is the first phase of compiler. Its main task is to read the
input characters and produces output a sequence of tokens that the parser uses for
syntax analysis. As in the figure, upon receiving a “get next token” command from
the parser the lexical analyzer reads input characters until it can identify the next
token.

Tokens
Department of Computer Science,
Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 12
COMPILER DESIGN UNIT - 1

Lexemes are said to be a sequence of characters (alphanumeric) in a token.


There are some predefined rules for every lexeme to be identified as a valid token.
These rules are defined by grammar rules, by means of a pattern. A pattern
explains what can be a token, and these patterns are defined by means of regular
expressions.
In programming language, keywords, constants, identifiers, strings,
numbers, operators and punctuations symbols can be considered as tokens.
For example, in C language, the variable declaration line
int value = 100;
contains the tokens:
int (keyword), value (identifier), = (operator), 100 (constant) and ; (symbol).
Specifications of Tokens
Let us understand how the language theory undertakes the following terms:
Alphabets
Any finite set of symbols {0,1} is a set of binary alphabets,
{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal alphabets, {a-z, A-Z} is
a set of English language alphabets.
Strings
Any finite sequence of alphabets is called a string. Length of the string is the
total number of occurrence of alphabets, e.g., the length of the string
computerscience is 15 and is denoted by | computerscience | = 15. A string having
no alphabets, i.e. a string of zero length is known as an empty string and is denoted
by ε (epsilon).

Special Symbols

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 13
COMPILER DESIGN UNIT - 1

A typical high-level language contains the following symbols:-

Language
A language is considered as a finite set of strings over some finite set of
alphabets. Computer languages are considered as finite sets, and mathematically
set operations can be performed on them. Finite languages can be described by
means of regular expressions.
Longest Match Rule
When the lexical analyzer read the source-code, it scans the code letter by
letter; and when it encounters a whitespace, operator symbol, or special symbols, it
decides that a word is completed.
For example:
int intvalue;

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 14
COMPILER DESIGN UNIT - 1

While scanning both lexemes till ‘int’, the lexical analyzer cannot determine
whether it is a keyword int or the initials of identifier int value.
The Longest Match Rule states that the lexeme scanned should be
determined based on the longest match among all the tokens available.
The lexical analyzer also follows rule priority where a reserved word, e.g., a
keyword, of a language is given priority over user input. That is, if the lexical
analyzer finds a lexeme that matches with any existing reserved word, it should
generate an error.
The lexical analyzer needs to scan and identify only a finite set of valid
string/token/lexeme that belong to the language in hand. It searches for the pattern
defined by the language rules.
Regular expressions have the capability to express finite languages by
defining a pattern for finite strings of symbols. The grammar defined by regular
expressions is known as regular grammar. The language defined by regular
grammar is known as regular language.
Regular expression is an important notation for specifying patterns. Each
pattern matches a set of strings, so regular expressions serve as names for a set of
strings. Programming language tokens can be described by regular languages. The
specification of regular expressions is an example of a recursive definition.
Regular languages are easy to understand and have efficient implementation.
There are a number of algebraic laws that are obeyed by regular expressions,
which can be used to manipulate regular expressions into equivalent forms.

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 15
COMPILER DESIGN UNIT - 1

Operations
The various operations on languages are:
 Union of two languages L and M is written as
 L U M = {s | s is in L or s is in M}
 Concatenation of two languages L and M is written as
 LM = {st | s is in L and t is in M}
 The Kleene Closure of a language L is written as
 L* = Zero or more occurrence of language L.
Notations
If r and s are regular expressions denoting the languages L(r) and L(s), then
 Union : (r)|(s) is a regular expression denoting L(r) U L(s)
 Concatenation : (r)(s) is a regular expression denoting L(r)L(s)
 Kleene closure : (r)* is a regular expression denoting (L(r))*
(r) is a regular expression denoting L(r)
Representing valid tokens of a language in regular expression
If x is a regular expression, then:
 x* means zero or more occurrence of x.
 i.e., it can generate { e, x, xx, xxx, xxxx, … }
 x+ means one or more occurrence of x.
 i.e., it can generate { x, xx, xxx, xxxx … } or x.x*
 x? means at most one occurrence of x
 i.e., it can generate either {x} or {e}.
[a-z] is all lower-case alphabets of English language.
[A-Z] is all upper-case alphabets of English language.

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 16
COMPILER DESIGN UNIT - 1

[0-9] is all natural digits used in mathematics.


Representing occurrence of symbols using regular expressions
letter = [a – z] or [A – Z]
digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9]
sign = [ + | - ]
Representing language tokens using regular expressions
Decimal = (sign)?(digit)+
Identifier = (letter)(letter | digit)*
The only problem left with the lexical analyzer is how to verify the validity
of a regular expression used in specifying the patterns of keywords of a language.
A well-accepted solution is to use finite automata for verification.
Finite automata
Finite automata is a state machine that takes a string of symbols as input and
changes its state accordingly. Finite automata is a recognizer for regular
expressions. When a regular expression string is fed into finite automata, it
changes its state for each literal. If the input string is successfully processed and
the automata reaches its final state, it is accepted, i.e., the string just fed was said to
be a valid token of the language in hand.
The mathematical model of finite automata consists of:
 Finite set of states (Q)
 Finite set of input symbols (Σ)
 One Start state (q0)
 Set of final states (qf)
 Transition function (δ)

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 17
COMPILER DESIGN UNIT - 1

The transition function (δ) maps the finite set of state (Q) to a finite set of input
symbols (Σ), Q × Σ ➔ Q
Finite Automata Construction
Let L(r) be a regular language recognized by some finite automata (FA).
States: States of FA are represented by circles. State names are written inside
circles.
Start state: The state from where the automata starts, is known as the start state.
Start state has an arrow pointed towards it.
Intermediate states: All intermediate states have at least two arrows; one pointing
to and another pointing out from them.
Final state: If the input string is successfully parsed, the automata is expected to be
in this state. Final state is represented by double circles. It may have any odd
number of arrows pointing to it and even number of arrows pointing out from it.
The number of odd arrows are one greater than even, i.e. odd = even+1.
Transition: The transition from one state to another state happens when a desired
symbol in the input is found. Upon transition, automata can either move to the next
state or stay in the same state. Movement from one state to another is shown as a
directed arrow, where the arrows points to the destination state. If automata stays
on the same state, an arrow pointing from a state to itself is drawn.

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 18
COMPILER DESIGN UNIT - 1

Example: We assume FA accepts any three digit binary value ending in digit 1. FA
= {Q(q0, qf), Σ(0,1), q0, qf, δ}

DFA (Deterministic finite automata)


DFA refers to deterministic finite automata. Deterministic refers to the
uniqueness of the computation. The finite automata are called deterministic finite
automata if the machine is read an input string one symbol at a time.
 In DFA, there is only one path for specific input from the current state to the
next state.
 DFA does not accept the null move, i.e., the DFA cannot change state
without any input character.
 DFA can contain multiple final states. It is used in Lexical Analysis in
Compiler.
 In the following diagram, we can see that from state q0 for input a, there is
only one path which is going to q1. Similarly, from q0, there is only one
path for input b going to q2.

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 19
COMPILER DESIGN UNIT - 1

Formal Definition of DFA


A DFA is a collection of 5-tuples same as we described in the definition of
FA.
Q: finite set of states
∑: finite set of the input symbol
q0: initial state
F: final state
δ: Transition function
Transition function can be defined as: δ: Q x ∑→Q
Graphical Representation of DFA
A DFA can be represented by digraphs called state diagram. In which:
The state is represented by vertices.
The arc labeled with an input character show the transitions.
The initial state is marked with an arrow.
The final state is denoted by a double circle.

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 20
COMPILER DESIGN UNIT - 1

Example:
Q = {q0, q1, q2}
∑ = {0, 1}
q0 = {q0}
F = {q2}
Solution:
Transition Diagram:

Transition Table:

Example:
DFA with ∑ = {0, 1} accepts all starting with 0.
Solution:

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 21
COMPILER DESIGN UNIT - 1

Explanation:
In the above diagram, we can see that on given 0 as input to DFA in state q0
the DFA changes state to q1 and always go to final state q1 on starting input 0. It
can accept 00, 01, 000, 001....etc. It can't accept any string which starts with 1,
because it will never go to final state on a string starting with 1.
Example:
DFA with ∑ = {0, 1} accepts all ending with 0.
Solution:

Explanation:
In the above diagram, we can see that on given 0 as input to DFA in state q0,
the DFA changes state to q1. It can accept any string which ends with 0 like 00, 10,
110, 100....etc. It can't accept any string which ends with 1, because it will never
go to the final state q1 on 1 input, so the string ending with 1, will not be accepted
or will be rejected.

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 22
COMPILER DESIGN UNIT - 1

NFA (Non-Deterministic finite automata)


NFA stands for non-deterministic finite automata. It is easy to construct an
NFA than DFA for a given regular language.
The finite automata are called NFA when there exist many paths for specific
input from the current state to the next state.
Every NFA is not DFA, but each NFA can be translated into DFA.
NFA is defined in the same way as DFA but with the following two
exceptions, it contains multiple next states, and it contains ε transition.
In the following image, we can see that from state q0 for input a, there are
two next states q1 and q2, similarly, from q0 for input b, the next states are q0 and
q1. Thus it is not fixed or determined that with a particular input where to go next.
Hence this FA is called non-deterministic finite automata.

Formal definition of NFA:


NFA also has five states same as DFA, but with different transition function,
as shown follows:
δ: Q x ∑ →2Q

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 23
COMPILER DESIGN UNIT - 1

where,
Q: finite set of states
∑: finite set of the input symbol
q0: initial state
F: final state
δ: Transition function
Graphical Representation of an NFA
An NFA can be represented by digraphs called state diagram. In which:
 The state is represented by vertices.
 The arc labeled with an input character show the transitions.
 The initial state is marked with an arrow.
 The final state is denoted by the double circle.
Example:
Q = {q0, q1, q2}
∑ = {0, 1}
q0 = {q0}
F = {q2}
Solution:
Transition diagram:

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 24
COMPILER DESIGN UNIT - 1

Transition Table:

In the above diagram, we can see that when the current state is q0, on input
0, the next state will be q0 or q1, and on 1 input the next state will be q1. When the
current state is q1, on input 0 the next state will be q2 and on 1 input, the next state
will be q0. When the current state is q2, on 0 input the next state is q2, and on 1
input the next state will be q1 or q2.
Example:
NFA with ∑ = {0, 1} accepts all strings with 01.
Solution:

Transition Table:

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 25
COMPILER DESIGN UNIT - 1

Example:

NFA with ∑ = {0, 1} and accept all string of length atleast 2.

Solution:

Transition Table:

Equivalence of NFA and DFA

We are going to prove that the DFA obtained from NFA by the conversion
algorithm accepts the same language as the NFA.
NFA that recognizes a language L is denoted by M1 = < Q1 , , q1,0 , 1 , A1 > and
DFA obtained by the conversion is denoted by M2 = < Q2, , q2,0 , 2 , A2 >
First we are going to prove by induction on strings that 1*( q1,0 , w ) =
2*( q2,0 , w ) for any string w. When it is proven, it obviously implies that NFA
M1 and DFA M2 accept the same strings.

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 26
COMPILER DESIGN UNIT - 1

Theorem: For any string w, 1*( q1,0 , w ) = 2*( q2,0 , w ) .


Proof: This is going to be proven by induction on w.
Basis Step: For w = ,
2*( q2,0 , ) = q2,0 by the definition of 2* .
= { q1,0 } by the construction of DFA M2 .
= 1*( q1,0 , ) by the definition of 1* .

Inductive Step: Assume that 1*( q1,0 , w ) = 2*( q2,0 , w ) for an arbitrary string
w. --- Induction Hypothesis
For the string w and an arbitrry symbol a in ,
1*( q1,0 , wa ) =
= 2( 1*( q1,0 , w ) , a )
= 2( 2*( q2,0 , w ) , a )
= 2*( q2,0 , wa )
Thus for any string w 1*( q1,0 , w ) = 2*( q2,0 , w ) holds.
End of Proof

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 27
COMPILER DESIGN UNIT - 1

NFA to DFA Conversion Example

Algorithm
Given NFA N = (Σ, Q, q0, F, δ) we want to build DFA D= (Σ, Q’, q’0, F’,
δ’). Here’s how:
Initially Q’ = ∅.
Add q0 to Q’.
For each state in Q find the set of possible states for each input symbol using N’s
transition table, δ. Add this set of states to Q’, if it is not already there.
The set of final states of D, F’, will be all of the the states in Q’ that contain in
them a state that is in F.
Working through an example may aid in understanding these steps.
Consider the following NFA N = (Σ, Q, q0, F, δ)

Σ = {a, b, c}
Q = {q0, q1, q2}
F = {q2}
We wish to construct DFA D= (Σ, Q’, q’0, F’, δ’). Following the steps in the
conversion algorithm:
Q’ = ∅
Q’ = {q0}

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 28
COMPILER DESIGN UNIT - 1

For every state in Q, find the set of states for each input symbol using N’s
transition table, δ. If the set is not Q, add it to Q’.We can start building the
transition table δ’ for D by first examining q0.

q0 is already in Q’ so we don’t add it. {q0, q1} is considered a single state,


so we add it and q2 to Q’.
Q’ = {q0, {q0, q1}, q2}
δ’ now looks like:

To fill in the transitions for {q0, q1}, you need to determine per symbol how
each individual state in the set transitions and take the union of these states.
δ'({q0, q1}, a) = δ(q0,a) ∪ δ(q1,a) = {q0, q1} ∪ ∅ = {q0, q1}
δ'({q0, q1}, b) = δ(q0,b) ∪ δ(q1,b) = c ∪ {q2} = {q0, q2}
δ'({q0, q1}, c) = δ(q0,c) ∪ δ(q1,c) ={q2} ∪ ∅ = {q2}
The only new state here is {q0, q2}. We add it to Q’:
Q’ = {q0, {q0, q1}, q2, {q0, q2}}
And update the transition table δ’:

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 29
COMPILER DESIGN UNIT - 1

q2 has no outgoing transitions so the transition table is simple to update.

Now calculate the transitions for {q0, q2}:


δ'({q0, q2}, a) = δ(q0,a) ∪ δ(q2,a) = {q0, q1} ∪ ∅ = {q0, q1}
δ'({q0, q2}, b) = δ(q0,b) ∪ δ(q2,b) ={q0} ∪ ∅ ={q0}
δ'({q0, q2}, c) = δ(q0,c) ∪ δ(q2,c) = {q2} ∪ ∅ = {q2}

There are no new states to add to Q’. The updated transition table looks like:

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 30
COMPILER DESIGN UNIT - 1

Since we’ve inspected all of the states in Q and no longer have any states to
add to Q’, we are finished. D’s set of final states F’ are those states that include
states in F. So, F’ = {q2, {q0, q2}}.
For our completed DFA D= (Σ, Q’, q’0, F’, δ’):
Σ = {a, b}
Q’ = {q0, {q0, q1}, q2, {q0, q2}}
q’0=q0
F’ = {q2, {q0, q2}}
δ’ is:

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 31
COMPILER DESIGN UNIT - 1

The transition graph for this DFA looks like:

Again for reference, the original NFA was:

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 32
COMPILER DESIGN UNIT - 1

Minimizing the states of DFA


DFA Minimization using Myphill-Nerode Theorem
Algorithm
Input − DFA
Output − Minimized DFA
Step 1 − Draw a table for all pairs of states (Qi, Qj) not necessarily connected
directly [All are unmarked initially]
Step 2 − Consider every state pair (Qi, Qj) in the DFA where Qi ∈ F and Qj ∉ F or
vice versa and mark them. [Here F is the set of final states]
Step 3 − Repeat this step until we cannot mark anymore states −
If there is an unmarked pair (Qi, Qj), mark it if the pair {δ (Qi, A), δ (Qi, A)} is
marked for some input alphabet.
Step 4 − Combine all the unmarked pair (Qi, Qj) and make them a single state in
the reduced DFA.
Example

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 33
COMPILER DESIGN UNIT - 1

Step 1 − We draw a table for all pair of states.

Step 2 − We mark the state pairs.

Step 3 − We will try to mark the state pairs, with green colored check mark,
transitively. If we input 1 to state ‘a’ and ‘f’, it will go to state ‘c’ and ‘f’
respectively. (c, f) is already marked, hence we will mark pair (a, f). Now, we input
1 to state ‘b’ and ‘f’; it will go to state ‘d’ and ‘f’ respectively. (d, f) is already
marked, hence we will mark pair (b, f).

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 34
COMPILER DESIGN UNIT - 1

After step 3, we have got state combinations {a, b} {c, d} {c, e} {d, e} that are
unmarked.
We can recombine {c, d} {c, e} {d, e} into {c, d, e}
Hence we got two combined states as − {a, b} and {c, d, e}
So the final minimized DFA will contain three states {f}, {a, b} and {c, d, e}

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 35
COMPILER DESIGN UNIT - 1

DFA Minimization using Equivalence Theorem


If X and Y are two states in a DFA, we can combine these two states into
{X, Y} if they are not distinguishable. Two states are distinguishable, if there is at
least one string S, such that one of δ (X, S) and δ (Y, S) is accepting and another is
not accepting. Hence, a DFA is minimal if and only if all the states are
distinguishable.
Algorithm
Step 1 − All the states Q are divided in two partitions − final states and non-final
states and are denoted by P0. All the states in a partition are 0th equivalent. Take a
counter k and initialize it with 0.
Step 2 − Increment k by 1. For each partition in Pk, divide the states in Pk into two
partitions if they are k-distinguishable. Two states within this partition X and Y are
k-distinguishable if there is an input S such that δ(X, S) and δ(Y, S) are (k-1)-
distinguishable.
Step 3 − If Pk ≠ Pk-1, repeat Step 2, otherwise go to Step 4.
Step 4 − Combine kth equivalent sets and make them the new states of the reduced
DFA.

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 36
COMPILER DESIGN UNIT - 1

Example
Let us consider the following DFA −

Let us apply the above algorithm to the above DFA −


P0 = {(c,d,e), (a,b,f)}
P1 = {(c,d,e), (a,b),(f)}
P2 = {(c,d,e), (a,b),(f)}
Hence, P1 = P2.

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 37
COMPILER DESIGN UNIT - 1

There are three states in the reduced DFA. The reduced DFA is as follows −

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 38
COMPILER DESIGN UNIT - 1

Implementation of lexical analyzer


Lexical Analysis is the first phase of compiler also known as scanner. It
converts the High level input program into a sequence of Tokens.
Lexical Analysis can be implemented with the Deterministic finite
Automata.
The output is a sequence of tokens that is sent to the parser for syntax analysis

What is a token?
A lexical token is a sequence of characters that can be treated as a unit in the
grammar of the programming languages.
Example of tokens:
 Type token (id, number, real, . . . )
 Punctuation tokens (IF, void, return, . . . )
 Alphabetic tokens (keywords)
Keywords; Examples-for, while, if etc.
Identifier; Examples-Variable name, function name etc.
Operators; Examples '+', '++', '-' etc.
Separators; Examples ',' ';' etc

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 39
COMPILER DESIGN UNIT - 1

Example of Non-Tokens:
Comments, preprocessor directive, macros, blanks, tabs, newline etc
Lexeme: The sequence of characters matched by a pattern to form the
corresponding token or a sequence of input characters that comprises a single
token is called a lexeme. eg- “float”, “abs_zero_Kelvin”, “=”, “-”, “273”, “;” .
How Lexical Analyzer functions
1. Tokenization .i.e Dividing the program into valid tokens.
2. Remove white space characters.
3. Remove comments.
4. It also provides help in generating error message by providing row number and
column number.

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 40
COMPILER DESIGN UNIT - 1

The lexical analyzer identifies the error with the help of automation machine
and the grammar of the given language on which it is based like C , C++ and gives
row number and column number of the error.
Suppose we pass a statement through lexical analyzer –
a = b + c ; It will generate token sequence like this:
id=id+id; Where each id reference to it’s variable in the symbol table referencing
all details
For example, consider the program
int main()
{
// 2 variables
int a, b;
a = 10;
return 0;
}
'int' 'main' '(' ')' '{' 'int' 'a' ',' 'b' ';'
'a' '=' '10' ';' 'return' '0' ';' '}'
Above are the valid tokens.
We can observe that we have omitted comments.
As another example, consider below printf statement.

There are 5 valid token in this printf statement.

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 41
COMPILER DESIGN UNIT - 1

Exercise 1:
Count number of tokens :
int main()
{
int a = 10, b = 20;
printf("sum is :%d",a+b);
return 0;
}
Answer: Total number of token: 27.
Exercise 2:
Count number of tokens :
int max(int i);
 Lexical analyzer first read int and finds it to be valid and accepts as token
 max is read by it and found to be valid function name after reading (
 int is also a token , then again i as another token and finally ;
Answer: Total number of tokens 7:
int, max, ( ,int, i, ), ;

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 42
COMPILER DESIGN UNIT - 1

Department of Computer Science,


Sri Lakshmi College of Arts & Science,
Kallakurichi
Page | 43

You might also like