Professional Documents
Culture Documents
Input-Output
Assembler–
The assembler takes the target code as input and produces real
locatable machine code as output.
Linker–
A linker or link editor is a program that takes a collection of objects
(created by assemblers and compilers) and combines them into an
executable program.
Loader–
The loader keeps the linked program in the main memory.
Executable code–
It is the low level and machine specific code and machine can
easily understand. Once the job of linker and loader is done then
object code finally converted it into the executable code.
Differences between Linker/Loader :
The differences between linker and loader as follows.
Linker Loader
The linker is part of the library files. The loader is part of an operating system.
Functions of loader :
1. Allocation –
It is used to allocate space for memory in an object program. A
translator cannot allocate space because there may be overlap or
large wastage of memory.
2. Linking –
It combines two or more different object programs and resolves the
symbolic context between object decks. It also provides the
necessary information to allow reference between them. Linking is
of two types as follows.
Static Linking :
It copies all the library routines used in the program into an
executable image. This requires more disk space and memory.
Dynamic Linking :
It resolves undefined symbols while a program is running. This
means that executable code still has undefined symbols and a list of
objects or libraries that will provide definitions for the same.
3. Reallocation –
This object modifies the program so that it can be loaded to an
address different from the originally specified location, and to
accommodate all addresses between dependent locations.
4. Loading –
Physically, it keeps machine instructions and data in memory for
execution.
5. We basically have two phases of compilers, namely Analysis phase and Synthesis phase.
Analysis phase creates an intermediate representation from the given source code.
Synthesis phase creates an equivalent target program from the intermediate
representation.
6.
7. Symbol Table – It is a data structure being used and maintained by the compiler, consists
all the identifier’s name along with their types. It helps the compiler to function smoothly
by finding the identifiers quickly.
8. The analysis of a source program is divided into mainly three phases. They are:
1. Linear Analysis-
This involves a scanning phase where the stream of characters are read from left
to right. It is then grouped into various tokens having a collective meaning.
2. Hierarchical Analysis-
In this analysis phase,based on a collective meaning, the tokens are categorized
hierarchically into nested groups.
3. Semantic Analysis-
This phase is used to check whether the components of the source program are
meaningful or not.
The compiler has two modules namely front end and back end. Front-end constitutes of the
Lexical analyzer, semantic analyzer, syntax analyzer and intermediate code generator. And the
rest are assembled to form the back end.
1. Lexical Analyzer –
It is also called scanner. It takes the output of preprocessor (which performs file
inclusion and macro expansion) as the input which is in pure high level language. It
reads the characters from source program and groups them into lexemes
(sequence of characters that “go together”). Each lexeme corresponds to a token.
Tokens are defined by regular expressions which are understood by the lexical
analyzer. It also removes lexical errors (for e.g., erroneous characters), comments
and white space.
2. Syntax Analyzer – It is sometimes called as parser. It constructs the parse tree.
It takes all the tokens one by one and uses Context Free Grammar to construct the
parse tree.
Why Grammar ?
The rules of programming can be entirely represented in some few productions.
Using these productions we can represent what the program actually is. The input
has to be checked whether it is in the desired format or not.
The parse tree is also called the derivation tree.Parse trees are generally constructed
to check for ambiguity in the given grammar. There are certain rules associated with
the derivation tree.
1.
Syntax error can be detected at this level if the input is not in accordance with the
grammar.
2. Semantic Analyzer – It verifies the parse tree, whether it’s meaningful or not.
It furthermore produces a verified parse tree. It also does type checking, Label
checking and Flow control checking.
3. Intermediate Code Generator – It generates intermediate code, that is a form
which can be readily executed by machine We have many popular intermediate
codes. Example – Three address code etc. Intermediate code is converted to
machine language using the last two phases which are platform dependent.
Till intermediate code, it is same for every compiler out there, but after that, it
depends on the platform. To build a new compiler we don’t need to build it from
scratch. We can take the intermediate code from the already existing compiler and
build the last two parts.
Nearly all compilers separate the task of analyzing syntax into two distinct
• Simplicity
o The syntax analyzer can be smaller and cleaner by removing the low-
• Portability
Initially both the bp and fp are pointing to the first character of first buffer. Then the
fp moves towards right in search of end of lexeme. as soon as blank character is
recognized, the string between bp and fp is identified as corresponding token. to
identify, the boundary of first buffer end of buffer character should be placed at the
end first buffer.
Similarly end of second buffer is also recognized by the end of buffer mark
present at the end of second buffer. when fp encounters first eof, then one can
recognize end of first buffer and hence filling up second buffer is started. in the
same way when second eof is obtained then it indicates of second buffer.
alternatively both the buffers can be filled up until end of the input program and
stream of tokens is identified. This eof character introduced at the end is
calling Sentinel which is used to identify the end of buffer.
Tokens, patterns and lexemes
IF if if
LPAREN ( (
Token: Token is a sequence of characters that can be treated as a single logical entity. Typical tokens
are,
Pattern: A set of strings in the input for which the same token is produced as output. This set of strings
is described by a rule called a pattern associated with the token.
Lexeme: A lexeme is a sequence of characters in the source program that is matched by the pattern for
a token.
Example:
if if if
relation <,<=,= ,< >,>=,> < or <= or = or < > or >= or letter
A patter is a rule describing the set of lexemes that can represent a particular token in source program
Analysis part
• Analysis part breaks the source program into constituent pieces and imposes a
grammatical structure on them which further uses this structure to create an
intermediate representation of the source program.
• It is also termed as front end of compiler.
• Information about the source program is collected and stored in a data structure
called symbol table.
Synthesis part
• Synthesis part takes the intermediate representation as input and transforms it to the
target program.
• It is also termed as back end of compiler.
The design of compiler can be decomposed into several phases, each of which
converts one form of source program into another.
The different phases of compiler are as follows:
1. Lexical analysis
2. Syntax analysis
3. Semantic analysis
4. Intermediate code generation
5. Code optimization
6. Code generation
All of the aforementioned phases involve the following tasks:
• Symbol table management.
• Error handling.
Lexical Analysis
• Lexical analysis is the first phase of compiler which is also termed as scanning.
• Source program is scanned to read the stream of characters and those characters
are grouped to form a sequence called lexemes which produces token as output.
• Token: Token is a sequence of characters that represent lexical unit, which matches
with the pattern, such as keywords, operators, identifiers etc.
• Lexeme: Lexeme is instance of a token i.e., group of characters forming a token. ,
• Pattern: Pattern describes the rule that the lexemes of a token takes. It is the
structure that must be matched by strings.
• Once a token is generated the corresponding entry is made in the symbol table.
Input: stream of characters
Output: Token
Token Template: <token-name, attribute-value>
(eg.) c=a+b*5;
Lexemes and tokens
Lexemes Tokens
identifier
c
assignment symbol
=
identifier
a
+ (addition symbol)
+
identifier
b
* (multiplication symbol)
*
5 (number)
5
• Syntax analysis is the second phase of compiler which is also called as parsing.
• Parser converts the tokens produced by lexical analyzer into a tree like representation
called parse tree.
• A parse tree describes the syntactic structure of the input.
• Syntax tree is a compressed representation of the parse tree in which the operators
appear as interior nodes and the operands of the operator are the children of the node
for that operator.
Input: Tokens
Output: Syntax tree
Semantic Analysis
• Code optimization phase gets the intermediate code as input and produces optimized
intermediate code as output.
• It results in faster running machine code.
• It can be done by reducing the number of lines of code for a program.
• This phase reduces the redundant code and attempts to improve the intermediate
code so that faster-running machine code will result.
• During the code optimization, the result of the program is not affected.
• To improve the code generation, the optimization involves
o Deduction and removal of dead code (unreachable code).
o Calculation of constants in expressions and terms.
o Collapsing of repeated expression into temporary string.
o Loop unrolling.
o Moving code outside the loop.
o Removal of unwanted temporary variables.
t1 = id3* 5.0
id1 = id2 + t1
Code Generation
• Symbol table is used to store all the information about identifiers used in the program.
• It is a data structure containing a record for each identifier, with fields for the attributes
of the identifier.
• It allows finding the record for each identifier quickly and to store or retrieve data from
that record.
• Whenever an identifier is detected in any of the phases, it is stored in the symbol
table.
Example
int a, b; float c; char z;
a Int 1000
b Int 1002
c Float 1004
z char 1008
Example
1 extern double test (double x);
6 return sum;
7 }
x double function
parameter
i int for-loop
statement
Error Handling
• Each phase can encounter errors. After detecting an error, a phase must handle the
error so that compilation can proceed.
• In lexical analysis, errors occur in separation of tokens.
• In syntax analysis, errors occur during construction of syntax tree.
• In semantic analysis, errors may occur at the following cases:
(i) When the compiler detects constructs that have right syntactic structure but no
meaning
(ii) During type conversion.
• In code optimization, errors occur when the result is affected by the optimization. In
code generation, it shows error when code is missing etc.
Figure illustrates the translation of source code through each phase, considering the
statement
c =a+ b * 5.
Error Encountered in Different Phases
Each phase can encounter errors. After detecting an error, a phase must some how
deal with the error, so that compilation can proceed.
A program may have the following kinds of errors at various stages:
Lexical Errors
These errors are a result of incompatible value assignment. The semantic errors that
the semantic analyzer is expected to recognize are:
• Type mismatch.
• Undeclared variable.
• Reserved identifier misuse.
• Multiple declaration of variable in a scope.
• Accessing an out of scope variable.
• Actual and formal parameter mismatch.
Logical errors
Front end
Back end
Front End
Back End
Passes
Reducing the Number of Passes
Front end
• Front end comprises of phases which are dependent on the input (source language)
and independent on the target machine (target language).
• It includes lexical and syntactic analysis, symbol table management, semantic
analysis and the generation of intermediate code.
• Code optimization can also be done by the front end.
• It also includes error handling at the phases concerned.
Back End
• Back end comprises of those phases of the compiler that are dependent on the
target machine and independent on the source language.
• This includes code optimization, code generation.
• In addition to this, it also encompasses error handling and symbol table
management operations.
Passes
• The phases of compiler can be implemented in a single pass by marking the primary
actions viz. reading of input file and writing to the output file.
• Several phases of compiler are grouped into one pass in such a way that the
operations in each and every phase are incorporated during the pass.
• (eg.) Lexical analysis, syntax analysis, semantic analysis and intermediate code
generation might be grouped into one pass. If so, the token stream after lexical
analysis may be translated directly into intermediate code.
Reducing the Number of Passes
• Minimizing the number of passes improves the time efficiency as reading from and
writing to intermediate files can be reduced.
• When grouping phases into one pass, the entire program has to be kept
in memory to ensure proper information flow to each phase because one phase may
need information in a different order than the information produced in previous
phase.
The source program or target program differs from its internal representation. So,
the memory for internal form may be larger than that of input and output.
Input: Parse tree.
Output: Intermediate code.
Syntax-directed translation engines produce collections of routines that walk a parse
tree and generates intermediate code.
Automatic Code Generators
Input: Intermediate language.
Output: Machine language.
Code-generator takes a collection of rules that define the translation of each
operation of the intermediate language into the machine language for a target
machine.
Data-flow Analysis Engines
Data-flow analysis engine gathers the information, that is, the values transmitted
from one part of a program to each of the other parts. Data-flow analysis is a key
part of code optimization.
Compiler Construction Toolkits
The toolkits provide integrated set of routines for various phases of compiler.
Compiler construction toolkits provide an integrated set of routines for construction
of phases of compiler.
Token
Lexeme is a sequence of characters that matches the pattern for a token i.e.,
instance of a
token.
(eg.) c=a+b*5;
Lexemes and tokens
Lexemes Tokens
c identifier
= assignment symbol
a identifier
+ + (addition symbol)
b identifier
* * (multiplication symbol)
5 5 (number)
The sequence of tokens produced by lexical analyzer helps the parser in analyzing
the syntax of programming languages.
Role of Lexical Analyzer
Lexical analyzer performs the following tasks:
• Reads the source program, scans the input characters, group them into lexemes
and produce the token as output.
• Enters the identified token into the symbol table.
• Strips out white spaces and comments from source program.
• Correlates error messages with the source program i.e., displays error message
with its occurrence by specifying the line number.
• Expands the macros if it is found in the source program.
Tasks of lexical analyzer can be divided into two processes:
Scanning: Performs reading of input characters, removal of white spaces and
comments.
Lexical Analysis: Produce tokens as the output.
Need of Lexical Analyzer
Lexical analysis is the process of producing tokens from the source program. It has
the following issues:
• Lookahead
• Ambiguities
Lookahead
Lookahead is required to decide when one token will end and the next token will
begin. The simple example which has lookahead issues are i vs. if, = vs. ==.
Therefore a way to describe the lexemes of each token is required.
A way needed to resolve ambiguities
• Is if it is two variables i and f or if?
• Is == is two equal signs =, = or ==?
• arr(5, 4) vs. fn(5, 4) II in Ada (as array reference syntax and function call syntax
are similar.
Hence, the number of lookahead to be considered and a way to describe the lexemes
of each token is also needed.
Regular expressions are one of the most popular ways of representing tokens.
Ambiguities
The lexical analysis programs written with lex accept ambiguous specifications and
choose the longest match possible at each input point. Lex can handle ambiguous
specifications. When more than one expression can match the current input, lex
chooses as follows:
• The longest match is preferred.
• Among rules which matched the same number of characters, the rule given first is
preferred.
Lexical Errors
• A character sequence that cannot be scanned into any valid token is a lexical error.
• Lexical errors are uncommon, but they still must be handled by a scanner.
• Misspelling of identifiers, keyword, or operators are considered as lexical errors.
Usually, a lexical error is caused by the appearance of some illegal character, mostly
at the beginning of a token.
Error Recovery Schemes
In panic mode recovery, unmatched patterns are deleted from the remaining input,
until the lexical analyzer can find a well-formed token at the beginning of what input
is left.
(eg.) For instance the string fi is encountered for the first time in a C program in the
context:
fi (a== f(x))
A lexical analyzer cannot tell whether f iis a misspelling of the keyword if or an
undeclared function identifier.
Since f i is a valid lexeme for the token id, the lexical analyzer will return the
token id to the parser.
Local correction
• To ensure that a right lexeme is found, one or more characters have to be looked
up beyond the next lexeme.
• Hence a two-buffer scheme is introduced to handle large lookaheads safely.
• Techniques for speeding up the process of lexical analyzer such as the use of
sentinels to mark the buffer end have been adopted.
There are three general approaches for the implementation of a lexical analyzer:
(i) By using a lexical-analyzer generator, such as lex compiler to produce the lexical
analyzer from a regular expression based specification. In this, the generator
provides routines for reading and buffering the input.
(ii) By writing the lexical analyzer in a conventional systems-programming language,
using I/O facilities of that language to read the input.
(iii) By writing the lexical analyzer in assembly language and explicitly managing the
reading of input.
We’ll be covering the following topics in this tutorial:
Buffer Pairs
Sentinels
Buffer Pairs
Scheme
• Consists of two buffers, each consists of N-character size which are reloaded
alternatively.
• N-Number of characters on one disk block, e.g., 4096.
• N characters are read from the input file to the buffer using one system read
command.
• eof is inserted at the end if the number of characters is less than N.
Pointers
• This scheme works well most of the time, but the amount of lookahead is limited.
• This limited lookahead may make it impossible to recognize tokens in situations
where the distance that the forward pointer must travel is more than the length of
the buffer.
(eg.) DECLARE (ARGl, ARG2, . . . , ARGn) in PL/1 program;
• It cannot determine whether the DECLARE is a keyword or an array name until the
character that follows the right parenthesis.
Sentinels
• In the previous scheme, each time when the forward pointer is moved, a check is
done to ensure that one half of the buffer has not moved off. If it is done, then the
other half must be reloaded.
• Therefore the ends of the buffer halves require two tests for each advance of the
forward pointer.
Test 1: For end of buffer.
Test 2: To determine what character is read.
• The usage of sentinel reduces the two tests to one by extending each buffer half to
hold a sentinel character at the end.
• The sentinel is a special character that cannot be part of the source
program. (eof character is used as sentinel).
Advantages
• Most of the time, It performs only one test to see whether forward pointer points
to an eof.
• Only when it reaches the end of the buffer half or eof, it performs more tests.
• Since N input characters are encountered between eofs, the average number of
tests per input character is very close to 1.
Union of two languages Land M produces the set of strings which may be either in
language L or in language M or in both. It can be denoted as,
LUM = {p I p is in L or p is in M}
Concatenation
Concatenation of two languages L and M, produces a set of strings which are formed
by merging the strings in L with strings in M (strings in L must be followed by strings
in M). It can be represented as,
LUM= {pq | p is in L and q is in M}
Closure
Positive closure (L +)
Positive closure indicates one or more occurrences of input symbols in a string, i.e.,
it excludes empty string Ɛ(set of strings with 1or more occurrences of input symbols).
L3– set of strings each with length 3.
(eg.) Let Σ = {a, b}
L* = {E, a, b, aa, ab, ba, bb, aab, aba, aaba, … }
L+ = {a, b, aa, ab, ba, bb, aab, aaba, }
L3 = {aaa, aba, abb, bba, bob, bbb, }
Precedence of operators
Regular expressions are a combination of input symbols and language operators such
as union, concatenation and closure.
It can be used to describe the identifier for a language. The identifier is a collection
of letters, digits and underscore which must begin with a letter. Hence, the regular
expression for an identifier can be given by,
Letter_ (letter I digit)*
Note: Vertical bar ( I ) refers to ‘or’ (Union operator).
The following describes the language for given regular expression:
Languages for regular expressions
Regular set Language defined by regular expression.
Two regular expressions are equivalent, if they represent the same regular set.
(p I q) = (q | p)
Algebraic laws of regular expressions
Law Description
r|s=s|r | is commutative
r | (s | t) = (r | s ) | t | is associative
r (st) = (rs)t Concatenation is
associative
r(s|t) = rs | rt; (s|t)r Concatenation is
= sr | tr distributive
Ɛr = rƐ = r Ɛ is identity for
concatenation
r** = r* * is idempotent
Regular Definition
Figure – Compiler-Process
2. Interpreter:
An interpreter is a program that translates a programming language
into a comprehensible language. –
It translates only one statement of the program at a time.
Interpreters, more often than not are smaller than compilers.
Figure – Interpreter-Process
Let’s see the difference between Compiler and Interpreter:
S.No
. Compiler Interpreter
It does not require source code for It requires source code for later
5 later execution. execution.
o If the their
progra executi
m on. If
contai an
ns no error is
error, found
then at any
the specific
compil statem
er will ent
conver interpr
t the eter, it
source stops
code further
progra executi
m into on
machi until
ne the
code. error
o The gets
remov
compil
ed.
er
links
all the
code
files
into a
single
runna
ble
progra
m,
which
is
known
as the
exe
file.
o Finally
, it
runs
the
progra
m and
gener
ates
output
.
Translat A compiler An
ion type translates interpreter
complete translates
high-level one
programmin statement of
g code into programming
machine code at a
code at time into
once. machine
code.
Running A compiler An
time takes an interpreter
enormous takes less
time to time to
analyze analyze
source code. source code
However, as compared
overall to a compiler.
compiled However,
programmin overall
g code runs interpreted
faster as programming
compression code runs
to an slower as
interpreter. compression
to the
compiler.
Memory A compiled An
require program is interpreted
ment generated program does
into an not generate
intermediate an
object code, intermediate
and it code. So
further there is no
required requirement
linking. So for extra
there is a memory.
requirement
for more
memory.
Interpreter Compiler
Programming
Programming
languages like
languages like C, C++,
JavaScript, Python,
Java use compilers.
Ruby use interpreters.