Unit 2 ● In the first part, the source program compiled
and translated into the object program (low level
Compiler is a software which converts a program language).
written in high level language (Source Language) to
● In the second part, object program translated
low level language (Object/Target/Machine
into the target program through the assembler.
Language).
There are two major phases of compilation, which in
turn have many parts. Each of them take input from Phases of a Compiler
the output of the previous level and work in a
coordinated way.
Analysis Phase – An intermediate representation is
created from the give source code : We basically have two phases of compilers, namely Analysis
phase and Synthesis phase. Analysis phase creates an
Lexical Analyzer intermediate representation from the given source code.
Syntax Analyzer Synthesis phase creates an equivalent target program from
Semantic Analyzer the intermediate representation.
Intermediate Code Generator
Lexical analyzer divides the program into “tokens”,
Syntax analyzer recognizes “sentences” in the
program using syntax of language and Semantic
analyzer checks static semantics of each construct.
Intermediate Code Generator generates “abstract”
code.
Synthesis Phase – Equivalent target program is
created from the intermediate representation. It has
two parts :
Code Optimizer
Code Generator
Code Optimizer optimizes the abstract code, and
final Code Generator translates abstract
intermediate code into specific machine instructions.
● A compiler is a translator that converts the
high-level language into the machine language.
Symbol Table – It is a data structure being used and
● High-level language is written by a developer maintained by the compiler, consists all the identifier’s name
and machine language can be understood by the along with their types. It helps the compiler to function
processor. smoothly by finding the identifiers quickly.
● Compiler is used to show errors to the The compiler has two modules namely front end and back end.
programmer. Front-end constitutes of the Lexical analyzer, semantic
● The main purpose of compiler is to change the analyzer, syntax analyzer and intermediate code generator.
code written in one language without changing the And the rest are assembled to form the back end.
meaning of the program.
1. Lexical Analyzer – It reads the program and converts it
● When you execute a program which is written into tokens. It converts a stream of lexemes into a stream of
in HLL programming language then it executes into tokens. Tokens are defined by regular expressions which are
two parts. understood by the lexical analyzer. It also removes
white-spaces and comments.
2. Syntax Analyzer – It is sometimes called as parser. It various entities such as variable and function names, classes,
constructs the parse tree. It takes all the tokens one by one objects, etc.
and uses Context Free Grammar to construct the parse tree.
Why Grammar ? ● It is built in lexical and syntax analysis phases.
The rules of programming can be entirely represented in some ● The information is collected by the analysis phases of
few productions. Using these productions we can represent compiler and is used by synthesis phases of compiler to
what the program actually is. The input has to be checked generate code.
whether it is in the desired format or not. ● It is used by compiler to achieve compile time
Syntax error can be detected at this level if the input is not in efficiency.
accordance with the grammar. ● It is used by various phases of compiler as follows :-
1. Lexical Analysis: Creates new table entries in the table,
example like entries about token.
2. Syntax Analysis: Adds information regarding attribute
type, scope, dimension, line of reference, use, etc in the table.
3. Semantic Analysis: Uses available information in the
table to check for semantics i.e. to verify that expressions and
assignments are semantically correct(type checking) and
update it accordingly.
4. Intermediate Code generation: Refers symbol table for
knowing how much and what type of run-time is allocated and
table helps in adding temporary variable information.
5. Code Optimization: Uses information present in symbol
table for machine dependent optimization.
3. Semantic Analyzer – It verifies the parse tree, whether
6. Target Code generation: Generates code by using
it’s meaningful or not. It furthermore produces a verified parse
address information of identifier present in the table.
tree.It also does type checking, Label checking and Flow
control checking.
4. Intermediate Code Generator – It generates
intermediate code, that is a form which can be readily
Symbol Table entries – Each entry in symbol table is
executed by machine We have many popular intermediate
associated with attributes that support compiler in different
codes. Example – Three address code etc. Intermediate code
phases.
is converted to machine language using the last two phases
which are platform dependent. Items stored in Symbol table:
Till intermediate code, it is same for every compiler out there,
but after that, it depends on the platform. To build a new
compiler we don’t need to build it from scratch. We can take
the intermediate code from the already existing compiler and ● Variable names and constants
build the last two parts. ● Procedure and function names
5. Code Optimizer – It transforms the code so that it ● Literal constants and strings
consumes fewer resources and produces more speed. The ● Compiler generated temporaries
meaning of the code being transformed is not altered. ● Labels in source languages
Optimisation can be categorized into two types: machine
dependent and machine independent.
6. Target Code Generator – The main purpose of Target
Code generator is to write a code that the machine can Information used by compiler from Symbol table:
understand and also register allocation, instruction selection
● Data type and name
etc. The output is dependent on the type of assembler. This is
● Declaring procedures
the final stage of compilation.
● Offset in storage
● If structure or record then, pointer to structure table.
● For parameters, whether parameter passing by value or
by reference
Symbol Table in Compiler
● Number and type of arguments passed to function
● Base Address
Prerequisite – P
hases of a Compiler
Symbol Table is an important data structure created and
maintained by the compiler in order to keep track of semantics
of variable i.e. it stores information about scope and binding Operations of Symbol table – The basic operations defined on
information about names, information about instances of a symbol table include:
Differences between compiler and
interpreter
What is Translators? Different type of translators
BY DINESH THAKUR Category: Compiler Design
A program written in high-level language is called as source . No ompiler terpreter
code. To convert the source code into machine code,
translators are needed.
A translator takes a program written in source language as erforms the translation erforms statement by
input and converts it into a program in target language as of a program as a tatement translation.
output. whole.
It also detects and reports the error during translation.
xecution is faster. xecution is slower.
Roles of translator are:
• Translating the high-level language program input into an
equivalent machine language program. equires more memory emory usage is
as linking is needed for efficient as no
• Providing diagnostic messages wherever the programmer
he generated ntermediate object
violates specification of the high-level language program.
ntermediate object code is generated.
Different type of translators code.
The different types of translator are as follows:
ebugging is hard as the stops translation
Compiler error messages are when the first error is
generated after met. Hence,
Compiler is a translator which is used to convert programs in
canning the entire debugging is easy.
high-level language to low-level language. It translates the
program only.
entire program and also reports the errors in source program
encountered during the translation.
rogramming languages rogramming languages
ike C, C++ uses ike Python, BASIC,
compilers. and Ruby uses
nterpreters.
Assembler
Interpreter Assembler is a translator which is used to translate the
assembly language code into machine language code.
Interpreter is a translator which is used to convert programs in
high-level language to low-level language. Interpreter
translates line by line and reports the error once it
encountered during the translation process.
It directly executes the operations specified in the source
program when the input is given by the user.
It gives better error diagnostics than a compiler.
Introduction of Lexical Analysis How Lexical Analyzer functions
Lexical Analysis is the first phase of compiler also known as
scanner. It converts the High level input program into a
1. Tokenization .i.e Dividing the program into valid tokens.
sequence of Tokens.
2. Remove white space characters.
● Lexical Analysis can be implemented with the
Deterministic finite Automata. 3. Remove comments.
● The output is a sequence of tokens that is sent to the
parser for syntax analysis 4. It also provides help in generating error message by providing
row number and column number.
The lexical analyzer identifies the error with the help of
automation machine and the grammar of the given language
on which it is based like C , C++ and gives row number and
column number of the error.
Suppose we pass a statement through lexical analyzer –
a = b + c ; It will generate token sequence like this:
What is a token? id=id+id; Where each id reference to it’s variable in the
symbol table referencing all details
A lexical token is a sequence of characters that can be treated
as a unit in the grammar of the programming languages. For example, consider the program
int main()
{
Example of tokens:
// 2 variables
● Type token (id, number, real, . . . )
int a, b;
● Punctuation tokens (IF, void, return, . . . )
● Alphabetic tokens (keywords) a = 10;
return 0;
}
Keywords; Examples-for, while, if etc.
All the valid tokens are:
Identifier; Examples-Variable name, function
name etc.
Operators; Examples '+', '++', '-' etc. 'int' 'main' '(' ')' '{' 'int' 'a' ','
'b' ';'
Separators; Examples ',' ';' etc
'a' '=' '10' ';' 'return' '0' ';' '}'
Example of Non-Tokens:
Above are the valid tokens.
● Comments, preprocessor directive, macros, blanks,
tabs, newline etc You can observe that we have omitted comments.
As another example, consider below printf statement.
Lexeme: The sequence of characters matched by a pattern to
form
the corresponding token or a sequence of input characters that
comprises a single token is called a lexeme. eg- “float”,
“abs_zero_Kelvin”, “=”, “-”, “273”, “;” .
There are 5 valid token in this printf statement.
Exercise 1: • keywords,
Count number of tokens : • constant,
• identifiers,
int main()
• numbers,
{
• operators and
int a = 10, b = 20;
• punctuations symbols
printf("sum is :%d",a+b);
are possible tokens to be identified.
return 0;
Pattern
}
Answer: Total number of token: 27. Pattern describes a rule that must be matched by sequence of
characters (lexemes) to form a token. It can be defined by
Exercise 2: regular expressions or grammar rules.
Count number of tokens : Lexeme
Lexeme is a sequence of characters that matches the pattern
int max(int i);
for a token i.e., instance of a
● Lexical analyzer first read int and finds it to be valid and token.
accepts as token
● max is read by it and found to be valid function name (eg.) c=a+b*5;
after reading ( Lexemes and tokens
● int is also a token , then again i as another token and
finally ;
exemes okens
Answer: Total number of tokens 7:
entifier
int, max, ( ,int, i, ), ;
signment symbol
entifier
(addition symbol)
Lexical Analysis – Compiler Design
entifier
Lexical analysis is the process of converting a sequence of
characters from source program into a sequence of tokens.
(multiplication symbol)
A program which performs lexical analysis is termed as a
lexical analyzer (lexer), tokenizer or scanner.
Lexical analysis consists of two stages of processing which are (number)
as follows:
• Scanning
• Tokenization The sequence of tokens produced by lexical analyzer helps
the parser in analyzing the syntax of programming languages.
Token, Pattern and Lexeme
Role of Lexical Analyzer
Token
Token is a valid sequence of characters which are given by
lexeme. In a programming language,
Specification of Tokens
There are 3 specifications of tokens: 1)Strings 2) Language
3)Regular expression
SPECIFICATION OF TOKENS
There are 3 specifications of tokens:
1)Strings
Lexical analyzer performs the following tasks: 2) Language
3)Regular expression
• Reads the source program, scans the input characters, group
them into lexemes and produce the token as output. Strings and Languages
• Enters the identified token into the symbol table. v An alphabet or character class is a finite set of symbols.
v A string over an alphabet is a finite sequence of symbols
• Strips out white spaces and comments from source program.
drawn from that alphabet.
• Correlates error messages with the source program i.e., v A language is any countable set of strings over some fixed
displays error message with its occurrence by specifying the alphabet.
line number. In language theory, the terms "sentence" and "word" are often used as
• Expands the macros if it is found in the source program. synonyms for
Tasks of lexical analyzer can be divided into two processes: "string." The length of a string s, usually written |s|, is the number of
erforms reading of input characters, removal of
Scanning: P occurrences of symbols in s. For example, banana is a string of
white spaces and comments. length six. The empty string, denoted ε, is the string of length zero.
Lexical Analysis: Produce tokens as the output. Operations on strings
Need of Lexical Analyzer The following string-related terms are commonly used:
Simplicity of design of compiler The removal of white spaces 1. A prefix of string s is any string obtained by removing
and comments enables the syntax analyzer for efficient zero or more symbols from the end of string s. For example, ban is a
syntactic constructs. prefix of banana.
Compiler efficiency is improved S pecialized buffering
2. A suffix of string s is any string obtained by removing
techniques for reading characters speed up the compiler
zero or more symbols from the beginning of s. For example, nana is a
process.
suffix of banana.
Compiler portability is enhanced
3. A substring of s is obtained by deleting any prefix and
any suffix from s. For example, nan is a substring of banana.
Lexical Errors
4. The proper prefixes, suffixes, and substrings of a string s are
• A character sequence that cannot be scanned into any valid those prefixes, suffixes, and substrings, respectively of s that are not
token is a lexical error. ε or not equal to s itself.
• Lexical errors are uncommon, but they still must be handled 5. A subsequence of s is any string formed by deleting zero or more
by a scanner. not necessarily consecutive positions of s
6. For example, baan is a subsequence of banana.
• Misspelling of identifiers, keyword, or operators are
considered as lexical errors. Operations on languages:
Usually, a lexical error is caused by the appearance of some The following are the operations that can be applied to languages:
illegal character, mostly at the beginning of a token. 1. Union
2. Concatenation
3. Kleene closure
4. Positive closure
The following example shows the operations on strings: Let L={0,1}
and S={a,b,c}
Regular Expressions 1. One or more instances (+):
· Each regular expression r denotes a language L(r). - The unary postfix operator + means “ one or more instances of” .
· Here are the rules that define the regular expressions over - If r is a regular expression that denotes the language L(r), then ( r )+
some alphabet Σ and the languages that those expressions denote: is a regular expression that denotes the language (L (r ))+
1.ε is a regular expression, and L(ε) is { ε }, that is, the language - Thus the regular expression a+ denotes the set of all strings of one or
whose sole member is the empty string. more a’s.
2. If ‘a’ is a symbol in Σ, then ‘a’ is a regular expression, and L(a) = - The operator + has the same precedence and associativity as the
{a}, that is, the language with one string, of length one, with ‘a’ in its operator *.
one position.
3.Suppose r and s are regular expressions denoting the languages L(r) 2. Zero or one instance ( ?):
and L(s). Then, a) (r)|(s) is a regular expression denoting the - The unary postfix operator ? means “zero or one instance of”.
language L(r) U L(s).
- The notation r? is a shorthand for r | ε.
b) (r)(s) is a regular expression denoting the language L(r)L(s). c) (r)* - If ‘r’ is a regular expression, then ( r )? is a regular expression that
is a regular expression denoting (L(r))*. denotes the language
d) (r) is a regular expression denoting L(r).
4.The unary operator * has highest precedence and is left associative. 3. Character Classes:
5.Concatenation has second highest precedence and is left associative. - The notation [abc] where a, b and c are alphabet symbols denotes the
6. | has lowest precedence and is left associative. regular expression a | b | c.
- Character class such as [a – z] denotes the regular expression a | b | c
Regular set | d | ….|z.
A language that can be defined by a regular expression is called a - We can describe identifiers as being strings generated by the regular
regular set. If two regular expressions r and s denote the same regular expression, [A–Za–z][A– Za–z0–9]*
set, we say they are equivalent and write r = s.
Non-regular Set
There are a number of algebraic laws for regular expressions that can
be used to manipulate into equivalent forms. A language which cannot be described by any regular expression is a
For instance, r|s = s|r is commutative; r|(s|t)=(r|s)|t is associative. non-regular set. Example: The set of all strings of balanced
parentheses and repeating strings cannot be described by a regular
Regular Definitions expression. This set can be specified by a context-free grammar.
Giving names to regular expressions is referred to as a Regular
definition. If Σ is an alphabet of basic symbols, then a regular
definition is a sequence of definitions of the form
dl → r 1
d2 → r2
………
dn → rn
1.Each di is a distinct name.
2.Each ri is a regular expression over the alphabet Σ U {dl, d2,. . . ,
di-l}.
Example: Identifiers is the set of strings of letters and digits beginning
with a letter. Regular
definition for this set:
letter → A | B | …. | Z | a | b | …. | z | digit → 0 | 1 | …. | 9
id → letter ( letter | digit ) *
Shorthands
Certain constructs occur so frequently in regular expressions that it is
convenient to introduce notational short hands for them.