You are on page 1of 19

Panimalar Institute of Technology

CS6660 – COMPILER DESIGN

UNIT I - INTRODUCTION TO COMPILERS

Translators- Compilation and Interpretation-Language processors -The Phases of Compiler-


Errors Encountered in Different Phases-The Grouping of Phases-Compiler Construction
Tools - Programming Language basics.

COMPILERS:
Definition: A Compiler is a translator that reads a program written in one language (source
language) and translates it into an equivalent program in another language (target language).

Source program Compiler target program

Error messages
During the translation process, the compiler reports to its users the presence of errors in the
source program.
 The machine-language target program produced by a compiler is usually much faster
than an interpreter at mapping inputs to outputs.

Language Processors
A compiler is a program that can read a program in one language, the source language and
translate it into an equivalent program in another language, the target language as seen in the
figure below.
An important role of the compiler is to report any errors in the source program that it detects
during the translation process.

C6660 –COMPILER DESIGN 1 VI Sem CSE


Panimalar Institute of Technology

If the target program is an executable machine-language program, it can then be called by the
user to process inputs and produce outputs.

Fig: Running the target Program

Preprocessor
A preprocessor produce input to compilers. They may perform the following functions.
1. Macro processing: A preprocessor may allow a user to define macros that are short
hands for longer constructs.
Macro: Instructions which can be used repeatedly in the program.
Eg)#define
2. File inclusion: A preprocessor may include header files into the program text.
Eg) #include
Compiler
Compiler is a translator program that translates a program written in (HLL) the source
program and translate it into an equivalent program in (MLL) the target program. As an
important part of a compiler is error showing to the programmer.
Assembler:
 Programmers found it difficult to write or read programs in machine language.
They begin to use a mnemonic (symbols) for each machine instruction, which
they would subsequently translate into machine language. Such a mnemonic
machine language is now called an assembly language.
 Programs known as assembler were written to automate the translation of
assembly language in to machine language.
 The input to an assembler program is called source program, the output is a
machine language translation (object program).

C6660 –COMPILER DESIGN 2 VI Sem CSE


Panimalar Institute of Technology

Loader and Link-editor:


 “A loader is a program that places programs into memory and prepares them
for execution.”
 It would be more efficient if subroutines could be translated into object form
the loader could “relocate” directly behind the user‟s program.
 The task of adjusting programs they may be placed in arbitrary core locations
is called relocation.
 The job of link editor is to make a single program from several files of
relocatable machine code. If the code in one file refers the location in another
file then such a reference is called external reference. The link editor resolves
such external references also.

INTERPRETER:
An interpreter is another common kind of language processor. Instead of producing a target
program as a translation, an interpreter appears to directly execute the operations specified in
the source program on inputs supplied by the user.

Fig: An Interpreter
Advantages:
 Modification of user program can be easily made and implemented as execution
proceeds.
 Type of object that may change dynamically.
 Debugging a program and finding errors is simplified task for a program used for
interpretation.
 The interpreter for the language makes it machine independent.
Disadvantages:
 The execution of the program is slower.
 Memory consumption is more.
Example:
Java language processors combine compilation and interpretation, as shown below in Fig.
 A Java source program may first be compiled into an intermediate form called byte
codes.
 The byte codes are then interpreted by a virtual machine.

C6660 –COMPILER DESIGN 3 VI Sem CSE


Panimalar Institute of Technology

A benefit of this arrangement is that byte codes compiled on one machine can be interpreted
on another machine, perhaps across a network.

 In order to achieve faster processing of inputs to outputs, some Java compilers, called
just-in-time compilers, translate the byte codes into machine language immediately
before they run the intermediate program to process the input.

TRANSLATOR

A translator is a program that takes as input a program written in one language and produces
as output a program in another language. Beside program translation, the translator performs
another very important role, the error-detection. Any violation of the HLL specification
would be detected and reported to the programmers. Important role of translator are:

1. Translating the HLL program input into an equivalent machine language program.
2. Providing diagnostic messages wherever the programmer violates specification of the
HLL.

TYPE OF TRANSLATORS:-
1. INTERPRETOR
2. COMPILER
3. PREPROSSESSOR
A preprocessor, generally considered as a part of compiler, is a tool that produces
input for compilers. It deals with macro-processing, augmentation, file inclusion,
language extension, etc.

C6660 –COMPILER DESIGN 4 VI Sem CSE


Panimalar Institute of Technology

Comparison between Interpreter and Compiler:


Compiler Interpreter
Compiler Takes Entire program as Interpreter Takes Single instruction
input as input .
Intermediate Object Code is No Intermediate Object Code is
Generated Generated
Conditional Control Statements are Conditional Control Statements are
Executes faster Executes slower

Memory Requirement : More (Since


Memory Requirement is Less
Object Code is Generated)

Program need not be compiled every Every time higher level program is
time converted into lower level program

Errors are displayed after entire Errors are displayed for every
program is checked instruction interpreted (if any)
Example : C Compiler Example : BASIC

THE GROUPING OF PHASES


Front end and Back end:
Several phases may be grouped together to from a front end and back end that reads an input
file and writes an output file.
Front End:
The front end consists of those phases that primarily dependent on the source language and
independent on the target language.
For example, the front-end phases of lexical analysis, syntax analysis, semantic analysis, and
intermediate code generation might be grouped together into one pass.
Code optimization might be an optional pass.
Back End:
The back end consists of those phases that are totally dependent upon the target language and
independent on the source language.
For Example, Then there could be a back-end pass consisting of code generation for a
particular target machine.
The front end back end model of the compiler is very much advantageous because of
following reasons,

C6660 –COMPILER DESIGN 5 VI Sem CSE


Panimalar Institute of Technology

1. By keeping the same front end and attaching different back end one can produce a
compiler for same source language on different machines.
2. By keeping different front end and same back end can compile several languages
on same machine.

The Analysis – Synthesis Model of Compilation:


There are two parts to compilation.
1. Analysis (Machine Independent/Language Dependent)
2. Synthesis (Machine Dependent/Language independent)
Analysis: The source program is broken into constituent pieces and creates an intermediate
representation of the source program. The analysis part also collects information about the
source program and stores it in a data structure called a symbol table.
The analysis part is often called the front end of the compiler.

Synthesis: The desired target program is obtained from the intermediate representation and the
information in the symbol table. Synthesis part requires the most specialized techniques.
The synthesis part is often called the back end of the compiler.
Analysis Synthesis

Source Intermediate Target


Program code Program

Analysis Of The Source Program


The analysis consists of three phases. They are:
1. Linear Analysis: The streams of characters in the source program is read from left-
to-right and grouped into tokens, which have a collective meaning. It is also called as
Lexical Analysis.
2. Hierarchical Analysis: The tokens are grouped hierarchically into nested collections.
It is also called as syntax analysis.
3. Semantic Analysis: Checks are performed to ensure that the components of a
program fit together meaningfully.

THE PHASES OF A COMPILER:


 A Compiler operates in phases, each of which transforms the source program from
one representation to another.

C6660 –COMPILER DESIGN 6 VI Sem CSE


Panimalar Institute of Technology

 The graphical representation of the phases of a compiler is given below:

The first three phases form the bulk of the analysis portion of a compiler.
 Two other activities symbol table management and error handling interact with the six
phases of the compiler.

1. Lexical Analysis:
The first phase of a compiler is called lexical analysis or scanning. It reads the streams of
characters in the source program from left-to-right and grouped into tokens, which have a
collective meaningful sequences called lexemes. Consider the statement
Position := initial + rate * 60
The following tokens are identified in the above statement.
 The identifier position
 The assignment symbol :=

C6660 –COMPILER DESIGN 7 VI Sem CSE


Panimalar Institute of Technology

 The identifier initial.


 The plus sign
 The identifier rate
 The multiplication sign
 The number 60
 The blanks separating the characters would be discarded by the lexical analyzer.
 The character sequence forming a token is called the lexeme of the token.
 Certain tokens will be augmented by a lexical value.
 For each lexeme, the lexical analyzer produces as output a token of the form
<token-name, attribute-value>

After lexical analysis as the sequence of tokens

<id,1> < = > <id,2> < + > <id,3> < * > <60>

2. Syntax Analysis:
 Hierarchical analysis is called parsing or syntax analysis. It involves grouping the
tokens of the source program.
 A syntax tree is formulated based on the grouping of tokens.
 A hierarchical structure generated in this phase is called syntax tree or parse tree in
which each interior node represents an operation and the children of the node
represent the arguments of the operation.
 The parser checks if the expression made by the tokens is syntactically correct.
For the expression
position = initial + rate * 60 the syntax tree can be generated as follows:

C6660 –COMPILER DESIGN 8 VI Sem CSE


Panimalar Institute of Technology

A leaf is a record with two or more fields, one to identify the token at the leaf, and others to
record the information about the token.
3. Semantic Analysis:
 The semantic analyzer uses the syntax tree and the information in the symbol table
to check the source program for semantic errors and gathers type information for the
subsequent code-generation phase.
 It identifies the operators and operands of expressions and statements.
 An important component of semantic analysis is type checking,where the compiler
checks that each operator has matching operands.
 For example, when a binary arithmetic operator is applied to an integer and real,
the compiler may need to convert the integer to a real value.

4. Intermediate code generation:


 Some compilers generate an explicit intermediate representation of the source
program after syntax and semantic analysis.
 The intermediate code representation should have two properties. They are:
i. It should be easy to produce.
ii. It should be easy to translate into the target program.
 There are varieties of forms to represent the intermediate code generation such as
three address code, quadruple, triple, posix. We consider “three address code”.as an
intermediate code. This is like an assembly language.
 The source code given may appear in the three address code as

 This intermediate form has several properties. They are:


a. Each three-address instruction has at most one operator in addition to the
assignment.

C6660 –COMPILER DESIGN 9 VI Sem CSE


Panimalar Institute of Technology

b. The compiler must generate a temporary name to hold the value computed by
each instruction.
c. Three-address instructions have few than 3 operands.
5. Code Optimization:
 This phase attempts to improve the intermediate code.
 So, a fast-running machine code will result.
 The intermediate code generated above can be optimized as follows:
temp1 := id3 * 60.0
id1 := id2 + temp1
6. Code Generation:
 The final phase of the compiler is the generation of target code, consisting normally
of relocatable machine code or assembly code.
 Memory locations are selected for each of the variables used by the program.
 Intermediate instructions are each translated into a sequence of machine instructions
that perform the same task.
 For example, using registers 1 and 2, the translation of the code might become

 The “F” in each instruction indicates that the instructions deal with floating point
numbers.
The two supporting phases are:
a. Symbol Table Management
b. Error Detection and Reporting
a. Symbol Table Management:
 A symbol table is a data structure containing a record for each identifier with fields
for the attributes of the identifier.
 It allows us to find the record for each identifier quickly and to store or retrieve data
from that record quickly.

C6660 –COMPILER DESIGN 10 VI Sem CSE


Panimalar Institute of Technology

Example:

 When an identifier in the source program is detected by the lexical analyzer, the
identifier is entered into the symbol table. But, the attributes of an identifier cannot be
determined during the lexical analysis normally.
 The remaining phases enter the information about identifiers into the symbol table and
then use this information in various ways.
 The code generator enters and uses detailed information about the storage assigned to
identifiers.
b. Error Detection and Reporting:
 Each phase can encounter errors.
 The syntax and semantic analysis phases handle a large fraction of the errors
detectable by the compiler.
 The lexical phase can detect errors where the character remaining in the input do not
form any token of the language.
 Errors where the token stream violates the structure rules of the language are
determined by the syntax analysis phase.
 During semantic analysis, the compiler tries to detect constructs that have the right
syntactic structure but no meaning to the operation involved. (Example: Adding two
identifiers, where one identifier is the name of the array and the other identifier is the
name of the procedure)

Example: Translation of a Statement

C6660 –COMPILER DESIGN 11 VI Sem CSE


Panimalar Institute of Technology

ERRORS ENCOUNTERED IN DIFFERENT PHASES


During different phases (but mainly the analysis phases) of a compiler , all possible
errors made by the programmer detected and they are reported to the user in the form of

C6660 –COMPILER DESIGN 12 VI Sem CSE


Panimalar Institute of Technology

messages. This process of locating errors and reporting to the user is called error handling
process.
On detecting an error the compiler must:
• report the error in a helpful way,
• correct the error if possible, and
• continue processing (if possible) after the error to look for further errors.
Types of Error. Errors are either syntactic or semantic:
1. Syntax errors: Syntax errors are errors in the program text; they may be either lexical or
grammatical:
(a) A lexical errors include misspellings of identifiers, keywords or operators. For example ,
typing tehn instead of then, or missing off one of the quotes in a literal.
(b) A grammatical error is a one that violates the (grammatical) rules of the language, for
example if x = 7 y := 4 (missing then).
Semantic errors are mistakes concerning the meaning of a program construct; they may be
either type errors, logical errors or run-time errors:
(a) Type errors occur when an operator is applied to an argument of the wrong type,
or to the wrong number of arguments.
(b) Logical errors occur when a badly conceived program is executed, for example:
while x = y do ... when x and y initially have the same value and the body of
loop need not change the value of either x or y.
(c) Run-time errors are errors that can be detected only when the program is executed,
for example:
var x : real;
readln(x);
writeln(1/x)
which would produce a run time error if the user input 0.
Syntax errors must be detected by a compiler and at least reported to the user (in a helpful
way). If possible, the compiler should make the appropriate correction(s).
Semantic errors are much harder and sometimes impossible for a computer to detect.

C6660 –COMPILER DESIGN 13 VI Sem CSE


Panimalar Institute of Technology

COUSINS OF THE COMPILER:

The input to a compiler may be produced by one or more preprocessors, and further
processing of the compiler’s output may be needed before running machine code is obtained.
Some of the software tools which are needed by the compiler are given below:
1. Preprocessors:
Preprocessors produce input to compilers. Some of the operations performed by them are:
Macroprocessing: A preprocessor may allow a user to define macros that are shorthands for
longer constructs.
File Inclusion: A preprocessor may include header files into the program text.
Example:
#include <stdio.h> in C language.
Rational Preprocessors: These processors provide older languages with more modern flow
control statements and data structure facilities. Example: If a programming language does not
have “while”, “do…while” constructs, the preprocessors provide the users with built-in
macros for those functionalities.
Language Extensions: These preprocessors add capabilities to the language by having built-
in macros.
Example: The language Equel is a data base query language embedded in C.
Statements beginning with ## are taken by the preprocessors to be database access
statements, unrelated to C. These statements are translated into procedures and routines
that perform the database access.
2. Assemblers:
 Some compilers produce assembly code.
 The assemble code is given as input to an assembler to produce machine code.
 The relocatable machine is given as input to loader/linker for further processing.
 An assembly code is a mnemonic version of machines. Here, names are used for
the operations and memory locations instead of machine code.
 A typical sequence of assembly instructions can be

The above instructions are executed as given below:


i. The content of memory location a is moved to Register R1.

C6660 –COMPILER DESIGN 14 VI Sem CSE


Panimalar Institute of Technology

j. The number 2 is added to Register value R1.


k. The resultant value R1 is moved to the memory location b.
The above instructions are performed for the operations of
b := a + 2
3. Loaders and Link Editors:
A loader is a program which takes the relocatable machine code, altering the
relocatable addresses and places the altered instructions and data in memory at the proper
locations.
The link editor allows us to make a single program from several files a relocatable
machine code. Some of the files may be library files of routine provided by the system and
available to any program that needs them.

COMPILER CONSTRUCTION TOOLS:

 There are specialized tools developed which help in implementing various phases of a
compiler.
 These tools use specialized languages for specifying and implementing the
component.
Some of the compiler construction tools are:
a. Parser Generators:
 They automatically generate syntax analyzers from a grammatical description of a
programming language.
 The specification given to these generators in the form of context free grammar.
 Many parser generators utilize powerful parsing algorithms that are too complex
to be carried out by hand.
 Typically UNIX has a tool called YACC which is a parser generator.
b. Scanner Generators:
 It produce lexical analyzers from a regular expressions.
 The specification given to these generators in the form of regular expression.
 The basic organization of the resulting lexical analyzer is in effect a finite
automaton.
c. Syntax-directed translation engines:
 They produce collection of routines for walking a parse tree and generating
intermediate code.

C6660 –COMPILER DESIGN 15 VI Sem CSE


Panimalar Institute of Technology

d. Automatic Code Generator:


 These tools take a collection of rules that define the translation of each operation of
the intermediate language into the machine language for the target machine.
 The intermediate code statements are replaced by templates that represent the
sequence of machine instructions, in such a way that the assumptions about storage of
variables match from template to template.
e. Data flow analysis Engines:
 Data flow analysis is nothing but gathering of information about how values are
transmitted from one part of the program to each other part.
 The data flow analysis is done by data flow engines.
 They help in good code optimization.
f. Compiler-construction toolkits :

 Provide an integrated set of routines for constructing various phases of a compiler.

PROGRAMMING LANGUAGE BASICS


The Static/Dynamic Distinction
Among the most important issues that we face when designing a compiler for a
language is what decisions can the compiler make about a program.
If a language uses a policy that allows the compiler to decide an issue, then we say
that the language uses a static policy or that the issue can be decided at compile time. On the
other hand, a policy that only allows a decision to be made when we execute the program is
said to be a dynamic policy or to require a decision at run time.
One issue on which we shall concentrate is the scope of declarations. The scope of a
declaration of x is the region of the program in which uses of x refer to this declaration. A
language uses static scope or lexical scope if it is possible to determine the scope of a
declaration by looking only at the program. Otherwise, the language uses dynamic scope.
With dynamic scope, as the program runs, the same use of x could refer to any of several
different declarations of x.
Environments and States
Another important distinction we must make when discussing programming
languages is whether changes occurring as the program runs affect the values of data
elements or affect the interpretation of names for that data. For example, the execution of an
assignment such as x=y+1 changes the value denoted by the name x. More specifically, the
assignment changes the value in whatever location is denoted by x.

C6660 –COMPILER DESIGN 16 VI Sem CSE


Panimalar Institute of Technology

The association of names with locations in memory (the store) and then with values can be
described by two mappings that change as the program runs
1. The environment is a mapping from names to locations in the store. Since variables
refer to locations ('L1-values" in the terminology of C), we could alternatively define an
environment as a mapping from names to variables.
2. The state is a mapping from locations in store to their values. That is, the state maps
1-values to their corresponding r-values, in the terminology of C. Environments change
according to the scope rules of a language.
Static Scope and Block Structure
 Most languages, including C and its family, use static scope. The scope rules for C are
based on program structure; the scope of a declaration is determined implicitly by
where the declaration appears in the program.
 Later languages, such as C++, Java, and C#, also provide explicit control over scopes
through the use of keywords like public, private, and protected.

C6660 –COMPILER DESIGN 17 VI Sem CSE


Panimalar Institute of Technology

 The C++ program in Fig. above has four blocks, with several definitions of variables
a and b. As a memory aid, each declaration initializes its variable to the number of the
block to which it belongs.

Explicit Access Control


 Classes and structures introduce a new scope for their members. If p is an object of a
class with a field (member) x, then the use of x in p: x refers to field x in the class
definition.
 Through the use of keywords like public, private, and protected, object-oriented
languages such as C++ or Java provide explicit control over access to member names
in a superclass.
 These keywords support encapsulation by restricting access. Thus, private names are
purposely given a scope that includes only the method declarations and definitions
associated with that class and any \friend" classes (the C++ term).
 Protected names are accessible to subclasses. Public names are accessible from
outside the class.
Dynamic Scope
 Technically, any scoping policy is dynamic if it is based on factor(s) that can be
known only when the program executes. The term dynamic scope, however, usually
refers to the following policy: a use of a name x refers to the declaration
 of x in the most recently called, not-yet-terminated, procedure with such a declaration.
 Dynamic scoping of this type appears only in special situations.
Consider this example,

C6660 –COMPILER DESIGN 18 VI Sem CSE


Panimalar Institute of Technology

 In this example, the function main first calls function b. As b executes, it prints the
value of the macro a. Since (x + 1) must be substituted for a, we resolve this use of x
to the declaration int x=1 in function b. The reason is that b has a declaration of x, so
the (x+1) in the printf in b refers to this x. Thus, the value printed is 2.
 After b finishes, and c is called, we again need to print the value of macro a.
However, the only x accessible to c is the global x. The printf statement in c thus
refers to this declaration of x, and value 3 is printed.

Parameter Passing Mechanisms


 All programming languages have a notion of a procedure, but they can differ in how
these procedures get their arguments. The great majority of languages use either
“call-by-value," or “call-by-reference," or both.
Call-by-Value

 In call-by-value, the actual parameter is evaluated (if it is an expression) or copied (if


it is a variable).
 The value is placed in the location belonging to the corresponding formal parameter
of the called procedure.
 This method is used in C and Java, and is a common option in C++, as well as in most
other languages.
Call-by-Reference
 In call-by-reference, the address of the actual parameter is passed to the callee as the
value of the corresponding formal parameter.
 Uses of the formal parameter in the code of the callee are implemented by following
this pointer to the location indicated by the caller.
 Changes to the formal parameter thus appear as changes to the actual parameter.
Aliasing

 It is possible that two formal parameters can refer to the same location; such variables
are said to be aliases of one another.
 As a result, any two variables, which may appear to take their values from two
distinct formal parameters, can become aliases of each other, as well.

C6660 –COMPILER DESIGN 19 VI Sem CSE

You might also like