You are on page 1of 77

Compiler Construction

III B.E. – VI Sem

UNIT - I

M Venkata Krishna Reddy


Assistant Professor
CSE Dept, CBIT
Outline
• Introduction
• Programs related to Compilers
• Phases of Compilers - Translation Process
• Major Data Structures
• Other Issues in Compiler Structure
• Boot strapping and porting
• Lexical Analysis – Specification of Tokens
• Recognition of Tokens
• Role of Lexical Analyzer
• Input Buffering
• LEX – Lexical – Analyzer Generator
Translators
• Translators : A translator translates one form
of language into another.
• Examples : Compiler, Assembler
High Level Language
Input Output
Program

Input

High Level Language Translator Machine Level


Program Instructions Processor

Output
OVERVIEW OF LANGUAGE
PROCESSING SYSTEM
Preprocessor
• A preprocessor produce input to compilers.
• They may perform the following functions.
1. Macro processing: A preprocessor may allow a user to
define macros that are short hands for longer constructs.
2. File inclusion: A preprocessor may include header files
into the program text.
3. Language Extensions: These preprocessor attempts to add
capabilities to the language by certain amounts to build-in
macro
ASSEMBLER
• Programmers found it difficult to write or read programs in
machine language.
• They begin to use a mnemonic (symbols) for each machine
instruction, which they would subsequently translate into
machine language. Such a mnemonic machine language is now
called an assembly language.
• Programs known as assembler were written to automate the
translation of assembly language in to machine language. The
input to an assembler program is called source program, the
output is a machine language translation (object program).
• A compiler may generate assembly language as
its target language and an assembler finished
the translation into object code

• Examples of Assemblers :

1. Macro Assembler – MASM


2. Turbo Assembler - TASM
Linkers
• Collect separate object files into a directly
executable file
• Connect an object program to the code for
standard library functions and to resource
supplied by OS
• Becoming one of the principle activities of a
compiler, depends on OS and processor
Loaders
• Resolve all re-locatable address relative to a
given base
• Make executable code more flexible
• Often as part of the operating environment,
rarely as an actual separate program
Editors
• Compiler have been bundled together with
editor and other programs into an interactive
development environment (IDE)
• Oriented toward the format or structure of the
programming language, called structure-based
• May include some operations of a compiler,
informing some errors
Interpreters
• Execute the source program immediately rather
than generating object code
• Examples: BASIC, LISP, used often in
educational or development situations
• Speed of execution is slower than compiled
code by a factor of 10 or more
• Share many of their operations with compilers
Compiler
• Complier : Compilers are computer programs that
translate one language into another.
• It translates/converts High Level Language such
as C, C++ (Source Code - .C/.CPP) into Low
Level Language (Target Code) – Object Code,
sometimes called as machine code.
• Examples of Compilers :
• C compilers , C++ compilers ,COBOL compilers
– turboC, Turbo C++ compilers
Why Compiler
• Processor Understands and executes Machine Level Language
• But User writes the program in High Level Language
• Writing machine language-numeric codes is time consuming
and tedious
Example : C7 06 0000 0002
Mov x, 2
X=2
• The assembly language has a number of defects
– Not easy to write
– Difficult to read and understand
• So a Compiler is required to translate the
source code into machine level code.
• A Compiler is a translator program that
translates a program written in (HLL) the
source program and translate it into an
equivalent program in (MLL) the target
program.
Translation Process –
Phases of Compiler –
Analysis – Synthesis Model of a
Compiler
Model of a Compiler
Lexical Analysis
• Lexical analysis breaks the input string into meaningful pieces called as lexemes and
group them as tokens and these tokens are passed to parser.
Ex : x=y+z*10
x | = | y | + | z | * | 10

Lexeme Token
• x Identifier – id1
• = Assignment Operator
• y Identifier – id2
• + Addition Operator
• z Identifier – id3
• * Multiplication Operator
• 10 Number – num

Tokens : id1, Assignment Operator, id2, Addition Operator,id3, Multiplication


Operator, num
Lexeme Error :
in | t |... Here in – lexeme error
• Input : High Level Language Program – Source
Code – Input String
• Output : Tokens
• Lexical Analysis is performed by Lexical
Analyzer
• Lexical Analysis is also called as Linear Analysis,
Sequential analysis, Scanning – Scanner
• It avoids white spaces automatically.
• It reports Lexical Errors to error handler.
• Token : Sequence of characters that
collectively gives a meaning is called as Token
• Lexical Error : Sequence of characters that
cannot be grouped together or cannot give
meaning collectively is called as Lexical Error.
Syntax Analysis
• Example :
Tokens : id1,ass opr, id2, add opr, id3, mul opr,
num
Syntax tree :
Assgn Opr
id1 add Opr
id2 mul Opr
id3 num
• Syntax Analysis receives tokens from the lexical
analysis as input and generates Syntax Tree as
output.
• It constructs a nested structure called as syntax
tree (parse tree) using tokens.
• It performs syntax checking on the parse tree
using
1. Top Down Approach ( Root to Leaves)
2. Bottom up Approach ( Leaves to Root)
– It reports syntax errors to the Error handler
Syntax Tree (parse tree)
• Syntax tree is collection of nodes and leaves
where nodes represent operators and leaves
represent operands.
• Syntax Analysis is also called as parsing –
parser.
• Input : Tokens
Output : Syntax tree
• Syntax Analysis is performed by Syntax
Analyzer
Semantic Analysis
• Example :
Input : Syntax tree :
Assign Opr
id1 add Opr
id2 mul Opr
id3 num
Output : Syntax tree ( After performing semantic
checking ) Assign Opr
id1 add Opr
id2 mul Opr
id3 num
• Semantic Analysis consumes syntax tree as input
from Syntax Analysis and generates Syntax tree
as output.
• Semantic Analysis is performed by Semantic
Analyzer.
• Semantic Analysis performs semantic checking
on syntax tree.
• It performs type checking to find semantic errors.
• It reports semantic errors to the error handler.
• Type Checking involves

– Whether two operands fit together with the


operator meaningfully in an expression or not .
Ex : int a, b;
a+b - Correct
– Whether correct data fits to the operand exactly
or not.
int a;
a=10.2 - Wrong
Intermediate Code
• Example :
Input : Syntax tree :
Assign Opr
id1 add Opr
id2 mul Opr
id3 num
Output : Intermediate Code / 3 Address
Code/Machine Level Code
t1 = 10
t2=1d3 * 10;
t3 = id2 + 10;
id1 = t3;
Where t1,t2,t3 are temporary variables
• Intermediate Code is performed by
Intermediate Code Generator.
• Intermediate Code phase converts syntax tree
into intermediate code / Machine level
instructions preferably 3 – Address code.
• 3 – Address Code :
Possible 3 – cases are
1. Two operands – One Operator other than
Assignment Operator
Ex : a=b+c
2. Two operands – One Operator other than
Assignment Operator
Ex : a= - b
3. Two operands –No Operator other than
Assignment Operator
Ex : a=b
Code Optimization
• Example :
Input : Intermediate Code / 3 Address Code/Machine Level Code
t1 = 10
t2=1d3 * 10;
t3 = id2 + 10;
id1 = t3;
Where t1,t2,t3 are temporary variables
Output : Relocatable Machine Code / Optimized Machine Level
Instructions
t2=1d3 * 10;
id1 = id2 + t2;
Where t2 is temporary variables
• Code Optimization Phase optimizes
intermediate code received from Intermediate
code generator and generates Optimized
Machine level instructions as output.
• It applies various appropriate code
optimization techniques to optimize the code.
• Code Optimizer performs code optimization.
• Code Optimization reduces size and execution
time.
Code Generation
• Example :
Input : Relocatable Machine Code / Optimized Machine Level
Instructions
t2=id3 * 10;
id1 = id2 + t2;
Where t2 is temporary variables
Output : Mnemonic Instructions / Assembly Level Language
Code/Target Code
mov id3, R1
mul R1, #10
mov id2, R2
add R2, R1
mov R2, id1
Where R1, R2 are registers
• Code Generation phase generates assemble
language code from optimized code received
from Code optimization phase.
• Code Generation phase is performed by code
generator.
• Optimizations can be performed on target code.
Example x=y+z*10 Input String /Source Program
Lexical Analysis

id1,ass opr, id2, add opr, id3, mul opr, num Tokens

Syntax Analysis

Assgn Syntax tree


id1 add
id2 mul
id3 num

Semantic Analysis

Assgn Syntax tree


id1 add
id2 mul
id3 num
Syntax Tree

Intermediate Code

t1 = 10; Intermediate Code


t2=1d3 * 10;
t3 = id2 + 10;
id1 = t3;

Code Optimization
t1 = id3 * 10; Optimized Code
id1 = id2 + t1;

Code Generation
MOV id3, r1 Target Program, Mnemonic Instructions
MUF *10, r1
MOV id2, r2
ADD r1, r2 MOV r2, id1
Error Handler
• All the phases of Compiler reports errors to the
handlers.
• Major part of the errors are reported by
Analysis phase.
1. Lexical Analysis – Lexical Errors
2. Syntax Analysis – Syntax Errors
3. Semantic Analysis – Semantic Errors
• Error Handler will display the appropriate
error message to the user
Major Data structures used in Compilers
• THE SYMBOL TABLE
– Keeps information associated with identifiers: function, variable,
constants, and data types
– Interacts with almost every phase of compiler.
– Access operation need to be constant-time
– Used to store and retrieve data in the form of attribute values
– Collection of records and fields

• THE LITERAL TABLE


– Stores constants and strings, reducing size of program
– Quick insertion and lookup are essential
• TOKENS
– A scanner collects characters into a token, as a value of an
enumerated data type for tokens
– May also preserve the string of characters or other derived
information, such as name of identifier, value of a number
token
– A single global variable or an array of tokens
• THE SYNTAX TREE
– A standard pointer-based structure generated by parser
– Each node represents information collect by parser or later,
which maybe dynamically allocated or stored in symbol
table
– The node requires different attributes depending on kind of
language structure, which may be represented as variable
record.
• INTERMEDIATE CODE
– Kept as an array of text string, a temporary text, or a linked
list of structures, depending on kind of intermediate code
(e.g. three-address code and p-code)
– Should be easy for reorganization

• TEMPORARY FILES
– Holds the product of intermediate steps during compiling
– Solve the problem of memory constraints or back-patch
addressed during code generation

BACK
Other Issues in Compiler
Structure
Analysis and Synthesis
• The analysis part of the compiler analyzes the source program
to compute its properties
– Lexical analysis, syntax analysis and semantics analysis, as
well as optimization
– More mathematical and better understood
• The synthesis part of the compiler produces the translated
codes
– Code generation, as well as optimization
– More specialized
• The two parts can be changed independently of the other
Front End and Back End
• The operations of the front end depend on the source language
– The scanner, parser, and semantic analyzer, as well as
intermediate code synthesis
• The operations of the back end depend on the target language
– Code generation, as well as some optimization analysis
• The intermediate representation is the medium of
communication between them
• This structure is important for compiler portability
Passes
• The repetitions to process the entire source program before
generating code are referred as passes.
• Passes may or may not correspond to phases
– A pass often consists of several phases
– A compiler can be one pass, which results in efficient
compilation but less efficient target code
– Most compilers with optimization use more than one pass
• One Pass for scanning and parsing
• One Pass for semantic analysis and source-level
optimization
• The third Pass for code generation and target-level
optimization
Bootstrapping and Porting
A Compiler usually available in

• Machine language
– compiler to execute immediately;
• Another language with existed compiler on the same target
machine : (First Scenario)
– Compile the new compiler with existing compiler
• Another language with existed compiler on different machine
: (Second Scenario)
– Compilation produce a cross compiler
T-Diagram Describing Complex Situation

• A compiler written in language H that translates language S


into language T.

S T
H

• T-Diagram can be combined in two basic ways.


Process of Bootstrapping
• Write a compiler in the same language
S T
S
• No compiler for source language yet
• Porting to a new host machine
A H A H
A A H H
H

• “quick and dirty” compiler written in machine


language H
• Compiler written in its own language A
• Result in running but inefficient compiler
Porting
A K A K
A A H H
H

• Original compiler
• Compiler source code retargeted to K
• Result in Cross Compiler
Lexical Analysis
• Example : input string/statement :
int amt = 10;
Then,
Lexeme Token Pattern
Int Keyword Data type
Amt Identifier Letter(Letter/Digit)*
= Assignment Operator Operator
10 Number Digit(digit)*
• Lexeme : It is a sequence of characters in the source
program that is matched by the pattern for a token.
Ex : int, a, printf
• Token : A lexical token is a sequence of characters that
can be treated as a unit in the grammar of the
programming language.
• A Token is a pair consisting of token name and an
optional attribute value.
Ex : identifier
• Pattern : Tokens are described by a rule ( regular
expression) called as pattern.
Ex : identifier – id=letter(letter/digit)*
• Regular Expressions are usually used to specify patterns.
Specification of Tokens
• Tokens are produced by lexical analysis for set
of strings in the input as output to syntax
analysis.
• Ex : if (EXP)
{
STMT
}
• Here, “ if “ is string in the above input
code/program and Token will be produced for
it – “ Keyword”.
• Alphabet Set : It is a finite non empty set of symbols
denoted by ‘ Ƹ ‘
Ex : Ƹ = {0,1}, Ƹ ={ a,b} – represents binary alphabet.
• Symbol : It is a smallest abstract entity that cannot be
divided further, that can be represented but cannot be
defined.
Ex : 0, 1, a, b
• String : A string is a finite sequence of symbols on an
alphabet set denoted by ‘ w’.
Ex : Binary alphabet set – {0,1}
Strings : w = 001, 101, 110
Length of string - | w | = | 001 | = 3
• Prefix : Collection of leading symbols called as prefix. Ex : w
= 001 , then its prefixes are 0, 00, 001
• Proper prefix : All prefixes of a string other than the string is
called as Proper Prefix.
Ex : w = 001 , then its proper prefixes are 0, 00
• Suffix : Collection of trailing symbols called as suffix. Ex : w
= 001 , then its suffixes are 1, 01, 001
• Proper suffix : All suffixes of a string other than the string is
called as Proper suffix.
Ex : w = 001 , then its proper suffixes are 1, 01
• Substring : It is a contiguous sequence of characters from a
string – part of a string.
Ex : w = 00101, substrings are 0, 01, 101, 010
• Subsequence of a string is part of a string may not be
contiguous.
• Language : It is a set of strings over some alphabet set.
Ex : L = { 0, 01, 001, 00001.....} over { 0,1}
Operation on Languages :
• Union : L1 and L2 are languages the L1UL2 is also a language.
L1 = {00, 01}, L2 = { 10, 11} then L1UL2={00,01,10,11}
• Concatenation : L1L2 = { 0010,0011,0110,0111}
• Closure ( Kleene) : L * = UL i
• Regular Grammar is responsible for syntax of programming
languages.
• Regular Grammar and Regular Language are represented using
finite automata.
• Regular Expression : It is used to describe/denote the Regular Language
and Regular Set.
• Regular Expressions are used to describe tokens of a programming
language.
• The language denoted by a RE is called as Regular Set.
• An algebraic way of expressing the notation is called as RE.
• RE is defined as –
1. Ǿ is RE denoting an empty language.
2. ε (Epsilon) is a RE denoting language which has an empty string.
3. {a} = a - RE
4. If R and S are RE, then LR and LS denote language for RE R and S, then
i. R+S is RE for language LR U LS .
ii. RS is RE for language LR LS .
iii. R* is RE for language L*R
Algebraic Laws on Regular Expressions
Let R, S, T denote Regular Expressions.
• Commutative Law : R|S = S|R
• Associative Law : R|(S|T) = (R|S)|T
R(ST) = (RS)T
• Distributive Law : R(S|T) = RS|RT
(S|T)R=SR|TR
• Identity Law : εR = Rε = R
• R* = (R|ε)*
• Idempotent Law : R ** = R*
• Regular Definition : For Notional
Convenience, Regular Expressions are denoted
by ‘Regular Definition ‘.
• Regular Expressions are used to specify Tokens
Token/Regular Definition Regular expression
Letter A|B|....|Z|a|b|...|z
[A-Z]|[a-z]
Digit 0|1|...|9 or [09]
Id letter(letter/digit)*
Recognition of Tokens :
Ex : Piece of code in High Level language for ‘if - else’
statement.
If ( EXP) • Grammar Behind ...
{ • S->iEtS|iEtSeS|ε
STMT; • E->T relop T|T
} • T-> id|num
Else Terminals are
{ i,t,e,relop,id,num –
STMT; strings.
}
• Strings represent tokens and tokens are represented by Regular
Expressions / Patterns.
• Regular Expressions are represented by Finite Automata
digit -> [0-9]
Digits -> digit+
number -> digit(.digits)? (E[+-]? Digit)?
letter -> [A-Za-z_]
id -> letter (letter|digit)*
If -> if
Then -> then
Else -> else
Relop -> < | > | <= | >= | = | <>
We also need to handle whitespaces:
ws -> (blank | tab | newline)+
Transition diagram for token
‘ relop'
Transition diagram for reserved words and identifiers
Transition diagram for unsigned numbers
Transition diagram for whitespace
TOKEN getRelop()
{
TOKEN retToken = new (RELOP)
while (1) { /* repeat character processing until a
return or failure occurs */
switch(state) {
case 0: c= nextchar();
if (c == ‘ < ‘ ) state = 1;
else if (c == ‘ = ‘ ) state = 5;
else if (c == ‘ > ’ ) state = 6;
else fail(); /* lexeme is not a relop */
break;
case 1: …

case 8: retract();
retToken.attribute = GT;
return(retToken);
}
Role of Lexical Analyzer

token
Source Lexical Analy To semantic
program Parser analysis
zer
getNextToken

Symbol
table
Tokens, Patterns, Lexemes
• A token is a pair a token name and an optional token value
– attribute value
• Token is a sequence of characters that can be treated as a
single logical entity. Typical tokens are,
1) Identifiers 2) keywords 3) operators 4) special symbols
5)constants
• A pattern is a rule that describes the token.
• A lexeme is a sequence of characters in the source program
that matches the pattern for a token
Ex :
Token Informal description Sample lexemes
if Characters i, f Characters e, if
else l, s, e else
comparison < or > or <= or >= or == or != <=, !=

id Letter followed by letter and digits pi, score, D2


number Any numeric constant 3.14159, 0, 6.02e23
literal Anything but “ sorrounded by “ “core dumped”

printf(“total = %d\n”, score);


Input Buffering…

n Lexical Analyzer performs the process by implementing the


scheme consists of
1. Input Buffer 2. Token Buffer
1. Lexeme Beginning Pointer 2. Forward Pointer
Input Buffering…
n Example :
C > = B; - input String

lbfp
C > = B; text = C> - No Token Matched

lb fp
C > = B; text = C - Lexeme – Token : Variable - Parser

lbfp
• Sometimes lexical analyzer needs to look ahead
some s ymbols to decide about the token to return
• In C language: we need to look after -, = or < to decide w
hat token to return
• We need to introduce a two buffer scheme to handle la
rge look-aheads safely

E = M * C * * 2 eof
Sentinels
E = M eof * C * * 2 eof eof
Lexical Analyzer Generator -
LEX
Lex Source program Lexical Com lex.yy.c
lex.l piler

lex.yy.c
C a.out
compiler

Sequence
Input stream a.out
of tokens
Syntax – Structure of LEX
Program
declarations
%%
translation rules Regular {Action}
%% Expression -
auxiliary functions Pattern
Example
%{
Int installID() {/* funtion to install the
/* definitions of manifest constants LT,
lexeme, whose first character is poi
LE, EQ, NE, GT, GE, nted to by yytext, and whose lengt
IF, THEN, ELSE, ID, NUMBER, RELOP */ h is yyleng, into the symbol table a
%} nd return a pointer thereto */
}
/* regular definitions
delim [ \t\n] Int installNum() {/* similar to installI
ws {delim}+ D, but puts numerical constants int
o a separate table */
letter [A-Za-z]
}
digit id [0-9]
number {letter}({letter}|{digit})*
{digit}+(\.{digit}+)?(E[+-]?{digit}+)?

%%
{ws} {/* no action and no return */} if
{return(IF);}
then {return(THEN);}
else {return(ELSE);}
{id} {yylval = (int) installID(); return(ID); }
{number} {yylval = (int) installNum(); return(NUMBER);}

References
1.Kenneth C Louden, “ Compiler Construction:
Principles and Practice”, Cengage Learning,
Lex & Yacc, John R Levine, Oreilly Publishers.
2.Alfred V.Aho, Monica S Lam, Ravi Sethi, Jeffrey
D Ullman, “ Compilers: Principles, Techniques
& Tools”, Pearson Education 2nd Edition 2013

You might also like