You are on page 1of 17

Phases of Compiler

A Compiler is composed of several components called Phases.

 Each phase performs a specific task.


 The complete computation procedure can be divided into six phases & these phases can be
regrouped into two parts:
1. Analysis part
2. Synthesis part
1. Analysis part: In this part source program is broken into constituent pieces & creates an intermediate
representation.
Analysis can be done in 3 phases.
i) Lexical Analysis
ii) Syntax Analysis
ii) Semantic Analysis
2. Synthesis part: Synthesis part constructs the desired program from intermediate representation. This
can be done in following 3 phases:
i) Intermediate Code Generation
ii) Code Optimization
iii) Code Generation

1
Sometimes, we divide Synthesis part as follows:

i) Intermediate code generation


ii) Storage Allocation
iii) Code Optimization
iv) Code Generation

Front ends and Back ends

 The front end of a compiler includes all analysis phases and the intermediate code generator phase.
 While back end includes the code optimization and final code generator phase.
 The front end analyzes the source program and produces intermediate code.
While the back end synthesizes the target program from the intermediate code.

(1)Lexical Analyzer (Scanner)


This module has the task of separating the continuous string of characters into different groups that makes
sense such a group is called a token

The main objective of Lexical Analyzer is

1. Specify the token of the language


2. Suitably/Efficiently recognize the tokens

Example:
Find the output for the following expression after each phase of compilation
position= initial + rate *60
SOLUTION: 1)Lexical Analyzer (Scanner)

Lexeme token

Position Identifier

= Assignment Operator

Initial Identifier

+ Operator

Rate Identifier

* Multiplication Operator

60 Integer Constant

2
(2) Syntax Analyzer (parser)
 The syntactic level describes the way that program statements are constructed form tokens.
i.e. syntax analyzer is the module in which the overall structure is identified and involves an
understanding of the order in which the symbol in a program may appear.
 The main task of parser is to group the tokens into sentences. i.e. to determine if the sequence of
tokens that have been extracted by the syntax analyzer are in the correct order or not.
 For analyzing each sentence, the parser builds abstract tree structures known as syntax tree.
 Syntax tree facilitates transformations of the program that may lead to possible minimization of the
machine instructions.
Example:

Position
+

initial =

rate 6
0

(3) Semantic Analyzer


The semantic analyzer gathers type information & checks the tree produced by the syntax analyzer for
semantic errors.

Position
+

initial =

rate
int to
real

60. 3
0
(4) Intermediate code generator
After semantic analysis compiler generates an intermediate representation of the source program that is both
easy to produce and easy to translate into the target program.

 There are a variety of forms used for intermediate code these are: two-address code, three address
code & so on.( i,e. looks like code for a memory-memory machine where every operator reads
operands form memory & write result into memory)
 The intermediate code generator usually has to create temporary location to hold intermediate results

Ex :- Position = initial + rate * 60

Temp1 = int. to real (60)

Temp2 = rate * temp1

Temp3 = Initial + Temp2

Position = Temp3

(5) Code Optimization


The main objective of this phase is to improve on the intermediate code to generate a code that Runs faster
and/or occupies less space.

Ex : Temp1 = int to red (60.0)

Temp1 = rate * temp1

Temp1= initial + temp1

position = Temp1

(6) Code Generator:-The main objective of this phase is to allocate storage and generates a
machine/Assembly code

 memory location & registers are allocated for identifier


 The instructions in intermediate code formats are converted into machine instruction

Ex: position = initial + Rate * 60

Id1 = Id2 + Id3 * 60

MOV R1, #(60.0)

MUL R1, R1, Id3

ADD R1, R1, Id2

REN Id1, R1
4
(*) Storage - Allocation

Every constant & variable appearing in the program must have storage space allocated for its value during
the storage allocation phase. This storage space may be three types

a) Static Storage: If the life time of the variable is life time of the program and the space for its value
once allocated cannot later be released, this kind of allocation is called static storage allocation.
b) Dynamic Storage: If the life time is a particular block or function or procedure in which it is
allocated so that, it may be released when the block or function or procedure in which it is
allocated ,is left called dynamic storage
c) Global Storage: It’s life time is unknown at compile time & it has to be allocated & deal located
at runtime. The efficient control of such storage usually implies runtime overheads

After space is allocated by the storage allocation phase, an address containing as much as is known about
it’s location at compile time, is passed to the code generator for it’s use

(a) Symbol table manager

 It contains a symbol table which is a data structure that stores/records the information of each
identifier given by the lexeme
 It has facilities to manipulate (add/delete) element in it.
 It contains a "FIND function" which is stored all the information if descriptions gives by the lexeme
 "FIND Function" returns a pointer pointing to each identifier information
 If FIND function returns a NULL pointer i.e, the symbol table has no record about the Identifier
 It also has "INSERT Function" to Insert an identifier into the symbol table given by the lexeme

Descriptor:- It is a record which stores the information of each identifier

(b) Error Handler

Compiler returns error in different phases of compiler it returns error like

1. Lexical Error
2. Syntax Error
3. Semantic Error
4. Intermediate code generator Error
5. Code generator Error
6. Code optimize error
7. Symbol table entity error

Error can be encountered by various phases of a compiler

i. Lexical Analyzer may be unable to proceed - if the next token is misspelled


ii. syntax Analyzer : may be unable to infer/conclude a structure for its input because a syntactic error
such as missing parenthesis has occurred

5
iii. Semantic Analyzer: The error which is semantic in nature. ie, some statement may be correct form
the syntactical point of view but they make no sense & there is no code that can be generated to
carry out the meaning of the statement
iv. Intermediate code Generator may detect an operator whose operands have incompatible types.
v. The code optimizer may detect that certain statement can never be reached
vi. A code Generator may find a compiler created constant that is too large to fit in a word of the target
machine
vii. while entering information into the symbol table it may discover multiply declared identifier with
contradictory attributes.

Cross Compiler
A cross compiler is a compiler that runs on one machine & produces object code for another machine. The
cross compiler is used to implement the compiler, which is characterized by three languages

(1) The Source Language

(2) The Object Language

(3) The Language in which it is written.

Bootstrap:- If a compiler has been implemented in its own language then this arrangement is called a
bootstrap arrangement.

Ex:- C compiler designs in C programming

Number of passes of compiler


On the basis of regrouping of phases, the compiler may be multi pass or one pass compiler

1. One pass compiler


 In an one pass compiler when a line source is processed it is scanned and the tokens are
extracted
 Then the syntax of the line is analyzed & built a tree structure
 Then semantically check whether correct or not
 Similarly the process will be repeated for all the line till total program will be completed
2. Multi pass Compiler
 In multi pass compiler each function of a compiler can be perform by one pass of the compiler
 Here compiler scan the inputs & produce a first modified from
 Then scan first modified form and produce a second modified form & so on.
 This process is continued until object code is produced.

6
Differentiate between one pass and multi pass compiler

7
1. It is faster because it is loaded into 1. It is slower then one pass compiler
memory one time. because each time out put of each pass is
stored on memory & must be read in each
2. it has some restriction upon program. ie, time the next pass starts.
constants, types, variable & procedure must
be defined before they are used 2. But here it has no restriction because it
differ more & more from one modified form
3. The components of a one pass compiler to another
are inter-related
3.It can be decomposed into passes that can
Ex. All the programmers working on the be relatively independent
project must have knowledge about the
entire project Ex. A team of programmers can work on the
project with interaction among them

Lexical Analysis (Scanning)


Lexical Analysis is the operation of reading the input program and breaking it up into a sequence
of lexeme (tokens)

 The main task of the Lexical Analyzer is to read the input characters of the source
program, group them into Lexemes and produces as output a sequence of tokens for each
lexeme in the source program
 Then the stream of tokens is sent to the parser for syntax analysis.
 It is common for the Lexical analyzer to interact with the symbol table as well. When the
lexical Analyzer discovers a Lexeme constituting an identifier, it needs to enter that
lexeme into the symbol table
 Some cases, information regarding the kind of identifier may be read from the symbol
table by the lexical analyzer to assist it in determining the proper token ,that must pass to
the parser

Interaction between lexical analyzer & the parser


Diagram

Tokens
source code
Lexical analyzer GetNEXT tokens parser Semantic analysis

Symbol table
Here the interaction is implemented by having the parser with the Lexical Analyzer

8
 The call suggested by the get Next Token command, causes the Lexical analyzer
to read characters from its input until it can identify the next lexeme & produce
for it the next token, which returns to the parser.
 i,e the Lexical Analyzer is usually implemented as a subroutine of the parser

Task of Lexical Analyzer

1. Separation of the input source code into token. Such as keywords, identifier,
constants ,operators .
2. Stripping out the unnecessary white space ( Blank, newline, tab) and comments.
3. Keeping track of line numbers while scanning the newline characters. The line
numbers are used by the error handler to print the error messages
4. Detect Lexical errors
5. The output of the Lexical Analysis phase is input to syntax analysis phase.

Errors in Lexical Analysis phase (Lexical Errors)

Lexical Errors such as

I. Misspelled keyword
II. Numeric literals are too long
III. Input characters that are not is the source language
IV. Identifiers that are too long (after a warning is given)

Lexical errors & error recovery actions

1. Let us consider a statement “ fi(a = = f)”. Here “fi” is a misspelled keyword. This error is not
detected in lexical Analysis phase. It is taken “fi” as an identifier. This error is then detected
is syntax analyzer phase of compilation.
2. In this case Lexical analyzer is not able to continue with the process of compilation. It resorts
to panic mode of error recovery.

The Lexical Analyzer can perform the following action to identify a token

a) Deleting the successive characters from the remaining input until a token is detected.
b) Deleting extraneous characters.
c) Inserting missing characters
d) Replacing an incorrect character by a correct character
e) Transposing two advanced characters

9
3) Minimum distance error correction is the strategy generally followed by Lexical
Analyzer to correct the errors in the lexeme.
It is nothing but the minimum number of the corrections to be made to correct an
invalid lexeme to a valid lexeme

Input Buffering :

Storing a block of input data in buffer to avoid costly access to secondary storage each time is
called input buffering

 The lexical analyzer uses two painters to read tokens.


 They are ‘lb’ (lexeme –beginning) pointer that indicates the beginning of the lexeme and
‘sp’ (search-pointer) that keeps track of the portion of the input string scanned .
lb sp

Begin I=I+1;J=J+1;…

(Initial position of the pointers ‘lb’ & ‘sp’)

 Initially both pointers point to the beginning of a lexeme


 The search pointer ’sp’ then starts scanning forward to search for the end of the
lexeme.
 The end of the lexeme in this case is indicated by the blank space after ‘begin’
 The lexeme is identified only when the ‘sp’ scan the blank space after ‘begin’

lb sp

Begin I=I+1;J=J+1;…

 When the end of the lexeme is identified, the token & the attribute corresponding to this
lexeme is returned
 ‘lb’ & ‘sp’ are then made to point to the beginning of the next token

lb sp

Begin I=I+1;J=J+1;…

( updating of pointers for the next lexeme)

10
Buffering methods
Reading the input character by character from the secondary storage is costly. A block of data is
read first into a buffer & then scanned by the lexical analyzer

For this purpose we need buffering methods

They are two types

A. One buffer scheme (B) Two buffer scheme


a) One buffer scheme:- In this scheme we read the input character by charter from this
buffer
 But here is a problem, if a lexeme crosses the buffer boundary.
 To scan the rest of the lexeme, the buffer has to be refilled thereby overwriting the
first part of the lexeme
b) Two buffers scheme:- lb

Begin I=I+1;J=…….J+1;EOF
Buffer-1

sp

End;
Buffer-2

(Here, buffer 1 & buffer 2 are scanned alternatively)

 When the end of the current buffer is reached, the other buffer is filled. Hence the
problem encountered in the previous method is solved
 In this scheme, the 2nd buffer is loaded when the first buffer becomes full.
 Similarly, the first buffer is filled when the end of the 2nd buffer is reached. Then the ‘sp’
pointer is incremented
 Hence two tests have done to increment the ‘sp’ pointer, ie,
(1) one for the end of the buffer
(2) another to determine what character is read
 This can be reduced to one test if we include a sentinel character :- (ie a special character
not a part of the input program) at the end of the buffer
Ex. These characters are EOF (end of file)
 So only if the EOF character is encountered a second check is made as to which buffer
has to be refilled & the action is performed. Hence average no. of test per input character
is 1.

11
 Sentinel character :- an extra character other than input characters are added at the end
of input buffer to reduce buffer test. Ex. EOF Character

Why regular expression notation can be used for specification of tokens

Regular expressions can be used to specify a set of strings. A set of string that can be specified
by using regular expression notation is called a regular set.

The tokens of a programming language constitutes a regular set. Hence this regular set can be
specified by using regular expression notation. Therefore, we write regular expression for things
like operators keywords, and identifier.

Ex:- The regular expression specifying the sub set of tokens of typical programming language as
follows:

Operators = + / - / * / mod

Key words = if / while / do / then

Letter = a / b / c / ….. z / A / B / ……..Z

Digit = 0 / 1 / 2 …..9

Identifier = letter (letter /digit) = letter / letter letter / letter digit

 Regular set is compact, precise, & contain a DFA that accepts the language specified by
the regular expression
 The DFA is used to recognize the language specified by the R.E notation, making the
automatic construction of recognizer of token possible. So we need both R.E & F A

Issues involve in the deign of Lexical Analyzer

1. Identifying the tokens of the language for which the lexical analyzer is to be built , & so
specify these tokens by using suitable notation
2. Constructing a suitable recognizer for these token.

Therefore the next step is the construction of a DFA from the R.E. But DFA is a flow chart
(graphical) representation of the lexical analyzer

Therefore after constructing a DFA the next step is to write a program is suitable programming
language that will simulate the DFA.

This program acts as a token recognizer on lexical analyzer

12
Therefore it is possible to automate the procedure of obtaining the lexical analyzer from the R.E
specifying the tokens.

For this purpose we use a tool i,e LEX .

 LEX is a compiler writing tool that facilitates writing the lexical analyzer
 It’s inputs are the R.E specifying the token to be recognized & generates a C program as
output that acts as a lexical analyzer for the tokens specified by the inputted R.E

Syntax Analysis (parsing)


A parser for any grammar is a program (parsing is a technique) that takes as input string W &
produces as output either a parse tree for W, if W is a valid sentence of grammar

Or an error message indicating that W is not a valid sentence of given grammar

 Syntax analysis which verifies whether the token produced by the lexical analyzer are
properly sequenced is accordance with the grammar of the source language

Symbol table

source code Tokens


Lexical GetNEXT tokens parser Parse Tree Semantic
analyzer analysis

ERROR HANDLER

 The parser should also report syntactical errors is a manner that is easily understood by
the user. It should also have procedure to recover from these errors & to continuing
parsing action

Error detection

Compiler should detect & recover from errors. A simple compiler stops all activities except
lexical & syntactic Analysis after detection of 1st error

13
 Errors my occur is design specification, algorithms ,transcription during compilation
etc. .

The error handler has the following goal

1. It should report on the presence of the errors clearly and accurately


2. It should recover from the errors quickly to detect the subsequent errors
3. It should not slow down the process of compilation. Hence error handler perform
detection, recovery, repair & correction of errors
 For this the reported error message should posses following characteristic
1. They should print the location of error in the source program
2. They should be easily understandable by the user
3. They should be specific & localize the problem
4. They should not be redundant
Errors in syntax phase
Syntactic errors are errors in structure, missing operator, unbalanced parenthesis
1. Missing parenthesis:-
Let a = (( a + c ) d + e )
Here an ‘(‘ is missing
2. Extraneous insertion error
for ( i=0, ; i= 10; i++)
an extra ‘,’ is inserted
3. Replacement error
a ‘,’ or a ‘:’ can be used in place of a ‘;’
I =0:
j = 9; PARSER
Diagram
PARSER

Universal TOP DOWN BOTTOM UP

Operator precedence LR PARSER


Back track Non backtrack

14
RECURSIVE TABLE SLR(LR0) LALR(LR0+LR1)
CLR(LR1)
DESCENT DRIVEN

There are three general types of parsers for grammars

1. Universal parsing
2. Top down
3. Bottom up

Universal parsing :- this methods such as the CYK (cocke-younger – kasami algo)& Earley’s
algo can parse any grammar. These general methods are inefficient to use in compilers

The methods commonly used is compliers as be classified as either top down or bottom up
Top – down methods : build parse tree form the top (root) to the bottom (leaves)
While bottom up methods start from leaves & work their way up to the root
 In either case the input to the parse is scanned from left to right, ie one symbol at a time
 In top down parsers we use left-most derivation but in case of bottom up parsers we use
right most derivation
Top down parsing

Basically top down parsing attempts to find the left-most derivation for the input string W, since
string W is scanned by the parser left to right, one symbol/ token at a time & the left-most
derivations generates the leaves of the parse tree is left- to – right order which matches the input
scan order

Ex:- consider the grammar


Vn = {expr, term, rest}
Vt = {+, -, 0, 1, 2, 3, ……8, 9, €}
expr  term rest
rest  + term rst/ - term rest / €
term  0 / 1 / …. / 8 / 9
Show the construction of the parse tree for the input 8-6+4

Diagram expr

term rest

15
8 - term rest

6 + term rest

4 €

Backtracking parser

Basically in top down mechanism, every terminal symbol generated by some production of the
grammar (which is predicted) is matched with the input string symbol pointed to by the string
marker (pointer)

 If the match is successful, the parse can continue


 If a mismatch occurs, then predictions have gone wrong. At this stage it is necessary to
reject previous / some predictions
 The prediction which led to the mismatching terminal symbol is rejected & the sting
marker (pointer) is reset to its position when the rejected production was made. This is
known as backtracking.
 Back-tracking is one of the major drawback of top down parsing

Ex:- consider grammar

S  aAb

A  cd/c

Show the backing for string w =acb

S S S

a A b a A b a A b

back tracking by

resetting of pointer

c d c

16
point of failure

fig.(1) fig.(2) fig.(3)

17

You might also like