You are on page 1of 101

1

COMPILER DESIGN
UNIT I
Reference Books
2

 Compilers: Principles, Techniques,


and Tools, Aho, Sethi and Ullman
 http://dragonbook.stanford.edu/

 Parsing Techniques, Grune and Jacobs


 http://www.cs.vu.nl/~dick/PT2Ed.html

 Advanced Compiler Design and


Implementation, Muchnik
Basic Blocks of Computing System

3
Program Execution: Layered
Architecture
temp = v[k];
High Level Language v[k] = v[k+1];
Program
v[k+1] = temp;
Compiler
lw $15, 0($2)
Assembly Language
Program lw $16, 4($2)
sw $16, 0($2)
Assembler + sw $15, 4($2)
Linker + Loader 0000 1001 1100 0110 1010 1111 0101 1000
Machine Language 1010 1111 0101 1000 0000 1001 1100 0110
Program 1100 0110 1010 1111 0101 1000 0000 1001
0101 1000 0000 1001 1100 0110 1010 1111

Machine Interpretation

Control Signal
Specification ALUOP[0:3] <= InstReg[9:11] & MASK
°
°
8-Jan- 4
15
What is a compiler?
5

a program that translates an executable


program in one language into an
executable program in another language
Interpreter
6

An interpreter is a computer program, which coverts


each high-level program statement into the machine
code. This includes source code, pre-compiled code,
and scripts. Both compiler and interpreters do the
same job which is converting higher level
programming language to machine code. However, a
compiler will convert the code into machine code
(create an exe) before program run. Interpreters
convert code into machine code when the program is
run.
7
Compiler is there….
8

 Programming Languages
 Machine Architecture
 Language theory
 Algorithms
 Software Engineering
Why do we care for Compiler Design?

artificial greedy algorithms


intelligence learning algorithms
Compiler construction graph algorithms
is a microcosm of algorithms union-find
dynamic programming
computer science
DFAs for scanning
theory parser generators
lattice theory for analysis
allocation and naming
systems locality
synchronization
pipeline management
architecture hierarchy management
instruction set use

Inside a compiler, all theseCompiler


things come
Construction
together
© Oscar Nierstrasz
Compiler Applications- Machine Code Generation
10

 Convert source language program to machine


understandable one
 Takes care of semantics of varied constructs of source
language
 Considers limitations and specific features of target
machine
 Automata theory helps in syntactic checks –valid and
invalid programs
 Compilation also generate code for syntactically correct
programs
Compiler Applications- Format Converters
11

 Act as interfaces between two or more


software packages
 Compatibility of input-output formats

between tools coming from different vendors


  Also used to convert heavily used programs

written in some older languages (like COBOL)


to newer languages (like C/C++)
Compiler Applications- Silicon Compilation
12

 Automatically synthesize a circuit from its


behavioural description in languages like
VHDL, Verilog etc.
Complexity of circuits increasing with reduced

time-to-market
Optimization criteria for silicon compilation

are area, power, delay, etc.


Compiler Applications- Query Optimization
13

 In the domain of database query processing


Optimize search time

More than one evaluation sequence for each

query
Cost depends upon relative sizes of tables,

availability of indexes
Generate proper sequence of operations

suitable for fastest query processing


Compiler Applications- Text Formatting
14

Accepts an ordinary text file as input having


formatting commands embedded
 Generates formatted text

 Example troff, nroff, LaTex etc.


Isn’t it a solved problem?
15

 Machines are constantly changing


 Changes in architecture  changes in compilers
 new features pose new problems
 changing costs lead to different concerns
 old solutions need re-engineering

 Innovations in compilers should prompt changes in


architecture
 New languages and features
What qualities are important in a compiler?
16

1. Correct code
2. Output runs fast
3. Compiler runs fast
4. Compile time proportional to program size
5. Support for separate compilation
6. Good diagnostics for syntax errors
7. Works well with the debugger
8. Good diagnostics for flow anomalies
9. Cross language calls
10. Consistent, predictable optimization
A bit of history
17

 1952: First compiler (linker/loader) written by


Grace Hopper for A-0 programming language

 1957: First complete compiler for FORTRAN by


John Backus and team

 1960: COBOL compilers for multiple architectures

 1962: First self-hosting compiler for LISP


Compilers

18
Compilers:
• Compilers are sometimes classified as the
following depending on how they have been
constructed or on what function they are
supposed to perform
– single-pass
– multi-pass
– load-and-go
– Debugging
– optimizing
The Analysis-Synthesis Model of
Compilation

• There are two parts of compilation:

– Analysis
– Synthesis
The Analysis-Synthesis Model of
Compilation
Analysis Part
• The analysis part breaks up the
source program into constituent pieces
• creates an intermediate representation of the
source program.

Synthesis Part
The synthesis part constructs the desired target
program from the intermediate representation
The Analysis-Synthesis Model of
Compilation:

• During analysis, the operations implied by the


source program are determined and recorded
in a hierarchical structure called a tree.

• Often, a special kind of tree called a syntax


tree is used.
The Analysis-Synthesis Model of
Compilation:

• Insyntax tree each node represents an


operation and the children of the node
represent the arguments of the operation.

• For example, a syntax tree of an


assignment statement is shown below.
The Analysis-Synthesis Model of
Compilation:
Egs of S/W Tools that perform Analysis
25

1.Structure Editors
2.Pretty Printers
3.Static Checkers
4.Interpreters
Context of a Compiler
26

 In addition to Compiler , several other programs


may be required to create an executable target
program
 Preprocessor: macros
 Assembler: translate assembly into machine
code
 Loader/link-editor: link library routines
Context of a Compiler
27
Analysis of the Source Program:

• In compiling, analysis consists of three phases:

– Linear Analysis:
– Hierarchical Analysis:
– Semantic Analysis:

28
Analysis of the Source Program:

• Linear Analysis:

– Also called as Lexical Analysis or Scanning

- In which the stream of characters making up the


source program is read from left-to-right and
grouped into tokens that are sequences of
characters having a collective meaning.
Scanning or Lexical Analysis
(Linear Analysis):

• In a compiler, linear analysis is called


lexical analysis or scanning.

• For example, in lexical analysis the characters


in the assignment statement

• Position: = initial + rate * 60


• Would be grouped into the following tokens:
Scanning or Lexical Analysis
(Linear Analysis):
Position: = initial + rate * 60
• The identifier, Position.
• The assignment symbol :=
• The identifier initial.
• The plus sign.
• The identifier rate.
• The multiplication sign.
• The number 60.
Scanning or Lexical Analysis
(Linear Analysis):

• The blanks separating the characters of these


tokens would normally be eliminated during
the lexical analysis.
Analysis of the Source Program:

• Hierarchical Analysis:

– In which characters or tokens are grouped


hierarchically into nested collections with
collective meaning.

– Also called as Parsing or Syntax analysis


Syntax Analysis or
Hierarchical Analysis
(Parsing):
• Hierarchical analysis is called parsing or syntax
analysis.

• It involves grouping the tokens of the source


program into grammatical phases that are
used by the compiler to synthesize output.
Syntax Analysis or
Hierarchical Analysis
(Parsing):
• The grammatical phrases of the
source program are represented by a parse
tree.
Syntax Analysis or
Hierarchical Analysis
(Parsing):
Assignment
Statement
:=
Identifier Expression
Position +
Expression Expressio
Identifier n *
Initial
Expression Expression
Identifier Number
Rate 60
Syntax Analysis or
Hierarchical Analysis
(Parsing):
• In the expression initial + rate * 60, the phrase
rate * 60 is a logical unit because the usual
conventions of arithmetic expressions tell us
that the multiplication is performed before
addition.
• Because the expression initial + rate is
followed by a *, it is not grouped into a single
phrase by itself

37
Syntax Analysis or
Hierarchical Analysis
(Parsing):
• The hierarchical structure of a program
is usually expressed by recursive rules.

• For example, we might have the


following rules, as part of the definition of
expression:

38
Syntax Analysis or
Hierarchical Analysis
(Parsing):
• Any identifier is an expression.
• Any number is an expression
• If expression1 and expression2 are expressions,
then so are
– Expression1 + expression2
– Expression1 * expression2
– (Expression1 )
Analysis of the Source Program:

• Semantic Analysis:

– In which certain checks are performed to ensure


that the components of a program fit together
meaningfully.
Semantic Analysis:

• The semantic analysis phase checks the source


program for semantic errors and gathers type
information for the subsequent code-
generation phase.
Semantic Analysis:

• It uses the hierarchical structure determined


by the syntax-analysis phase to identify the
operators and operand of expressions and
statements.
Semantic Analysis:

• An important component of semantic analysis


is type checking.

• Here are the compiler checks that each


operator has operands that are permitted by
the source language specification.
Semantic Analysis:

• For when a binary arithmetic


example, is applied to an integer and real. In
operator
this case, the compiler may need to be
converting the integer to a real. As shown in
figure given below
Semantic Analysis:
Semantic Analysis
46
Semantic Analysis(Summarized)
47

 Check semantic error


 Gather type information for code-generation
 Using hierarchical structure to identify operators
and operands
 Doing type checking
 E.g, using a real number to index an array (error)
 Type convert
 E.g, Fig.1.5 ittoreal(60) if initial is a real number
1.3 The Phases of a Compiler:
49
1.3 The Phases of a Compiler:
• Symbol Table Management:
– An essential function of a compiler is to record the
identifiers used in the source program and collect
information about various attributes of each
identifier.

– These attributes may provide information about


the storage allocated for an identifier, its type, its
scope.
1.3 The Phases of a Compiler:
– The symbol table is a data structure containing a
record for each identifier with fields for the
attributes of the identifier.

– When an identifier in the source program is


detected by the lexical analyzer, the identifier is
entered into the symbol table
1.3 The Phases of a Compiler:
– However, the attributes of an identifier
cannot normally be determined during lexical
analysis.

– For example, in a Pascal declaration like


– Var position, initial, rate : real;
– The type real is not known when position,
initial and rate are seen by the lexical analyzer.
1.3 The Phases of a Compiler:
– The remaining phases gets information about
identifiers into the symbol table and then use this
information in various ways.

– For example, when doing semantic analysis and


intermediate code generation, we need to know
what the types of identifiers are, so we can check
that the source program uses them in valid ways,
and so that we can generate the proper
operations on them.
Error Detection and Reporting:
• Each phase can encounter errors.

• However, after detecting an error, a phase


must somehow deal with that error, so that
compilation can proceed, allowing further
errors in the source program to be detected.
Error Detection and Reporting:
• A compiler that stops when it finds the first
error is not as helpful as it could be.

• The syntax and semantic analysis phases


usually handle a large fraction of the errors
detectable by the compiler.
Error Detection and Reporting:
• Errors where the token stream violates the
structure rules (syntax) of the language are
determined by the syntax analysis phase.

• The lexical phase can detect errors where the


characters remaining in the input do not form
any token of the language.
Intermediate Code Generation:
• After Syntax and semantic analysis, some
compilers generate an explicit intermediate
representation of the source program.

• We can of intermediate
think this
representation as a program for an abstract
machine.
Intermediate Code Generation:
• This intermediate representation should
have two important properties;
– it should be easy to produce,

– easy to translate into the target program.


Intermediate Code Generation:
• We consider an intermediate form called
“three-address code,”

• which is like the assembly language for a


machine in which every memory location can
act like a register.
Intermediate Code Generation:
• Three-address code consists of a sequence of
instructions, each of which has at most three
operands.

• The source program in (1.1) might appear in


three-address code as
Intermediate Code Generation:

• Temp1 := inttoreal (60)


• Temp2 := id3 * temp1
• Temp3 := id2 + temp2
• id1 : temp3
=
Code Optimization:
• The code optimization phase attempts to
improve the intermediate code, so that faster-
running machine code will result.
Code Optimization:
• Some optimizations are trivial.

• For example, a natural algorithm


the intermediate code (1.3), using
generates
instruction for each operator in an the
representation after semantic analysis,
treeeven
though there is a better way to perform the
same calculation, using the two instructions.
Code Optimization:

• Temp := id3 * 60.0


1 id2 + temp1
• id :=
• There is nothing wrong with this simple
algorithm, since the problem can be fixed
during the code-optimization phase.
Code Optimization:
• That is, the compiler can deduce that the
conversion of 60 from integer to real
representation can be done once and for all at
compile time, so the inttoreal operation can
be eliminated.
Code Optimization:
• Besides, temp3 is used only once, to transmit its
value to id1. It then becomes safe to substitute id1
for temp3, whereupon the last statement of 1.3 is
not needed and the code of 1.4 results.
Code Generation
• The final phase of the is the
compiler generation of target code

• consisting normally of machine


relocatable code or assembly code.
Code Generation
• Memory locations are selected for each of the
variables used by the program.

• Then, intermediate instructions are each


translated into a sequence of machine
instructions that perform the same task.

• A crucial aspect is the assignment of variables


to registers.
Code Generation
• For example, using registers 1 and 2,
the translation of the code of 1.4 might
become

• MOVF id3, r2
• MULF #60.0, r2
• MOVF id2, r1
• ADDF r2, r1
• MOVF r1, id1
Code Generation
• The first and second operands of each
instruction specify a source and destination,
respectively.

• The F in each instruction tells us that


instructions deal with floating-point numbers.
Code Generation
• This code moves the contents of the address id3 into
register 2, and then multiplies it with the real-
constant 60.0.

• The # signifies that 60.0 is to be treated as a


constant.
Code Generation
• The third instruction moves id2 into register 1
and adds to it the value previously computed
in register 2

• Finally, the value in register 1 is moved into


the address of id1.
Cousins of the compiler
73

 Preprocessors
 Assemblers
 Two pass assembly
 Loaders & Link editors
Cousins of the compiler- Preprocessors
74

• Macro processing

• File inclusion

• Rational Preprocessors- built-in constructs

• Language extensions- embedded


Cousins of the compiler-
75
Assemblers
 Assembly Code
 MOV a,R1
 ADD #2, R1
 MOV R1,b
Cousins of the compiler- Two-Pass Assembly
76

 First pass
 Find all identifiers and their storage location and store in
symbol table
 Identifier Address
a 0
b 4
 Second pass
 Translate each operation code into the sequence of bits
 Relocatable machine code
Cousins of the compiler- Loaders and Link-Editors
77

 Loader
 Taking and altering relocatable address machine codes
 Link-editors
 External references
 Library file, routines by system, any other program
Grouping of Phases
78

 Front & back ends


 Passes
 Reducing the number of passes
Front End
79

Front end comprises of phases which are dependent


on the input (source language) and independent on
the target machine (target language).
• It includes lexical and syntactic analysis, symbol
table management, semantic analysis and the
generation of intermediate code.
• It also includes error handling at the phases
concerned
involves
80
Back end
81

 Back end comprises of those phases of the


compiler that are dependent on the target machine
and independent on the source language.

 This includes code optimization, code generation.


 In addition to this, it also encompasses error

handling and symbol table management operations.


         
Involves
82
Passes
83

 The phases of compiler can be implemented in a single


pass by marking the primary actions such as reading of
input file and writing to the output file.
 Several phases of compiler are grouped into one pass in
such a way that the operations in each and every phase
are incorporated during the pass.
 (eg.) Lexical analysis, syntax analysis, semantic analysis
and intermediate code generation might be grouped into
one pass. If so, the token stream after lexical analysis
may be translated directly into intermediate code.
Reducing the Number of Passes
84

 Minimizing the number of passes improves the time efficiency


as reading from and writing to intermediate files can be
reduced.
 When grouping phases into one pass, the entire program has to
be kept in memory to ensure proper information flow to each
phase because one phase may need information in a different
order than the information produced in previous phase.
 The source program or target program differs from its internal
representation. So, the memory for internal form may be larger
than that of input and output.
Evolution of Programming Languages
85

Based on generation

Imperative

Classification Declarative
of Prog lang

Von Neumann

Object oriented

Scripting
Based on Generations
86

 1st Generation – Machine Language


 2nd Generation –Assembly language
 3rd Generation – HLL like Fortran, Cobol, C, Lisp,
C++, C#, Java
 4th Generation - Languages for specific applications
like NOMAD (relational database and fourth-generation
language) for report generation, SQL for database
queries ,post script for text formatting
 5th Generation - Logic and constraint based
languages like Prolog, OPS5
Imperative and Declarative Languages
87

 Imperative
 In this classification the program specifies how a
computation is to be done.
 eg C, C++, C#, Java.
 Declarative
 In this specification, program specifies what
computation is to be done
 Functional languages like ML and Haskell and
constraint logic languages such as Prolog
Von Neumann Languages
88

 In this, computational model is based on von


Neumann computer architecture.
 E.g . Fortran and C
Object Oriented Languages
89

 Support object oriented programming


 Program consists of collection of objects that
interact with one another
 E.g Earlier -Simula67, Smalltalk
 E.g Recent- C++, C#, Java,Python
Scripting Languages
90

 These are interpreted languages with high level


operators designed for gluing together
computations
 These computations are called as scripts
 E.g. Java Script, Perl , PHP, Python, Ruby
The Science of Building a
91
Compiler
Modeling in Compiler Design & Implementation
Fundamental Models:
 Finite State Machines & RE

(Useful for describing lexical units of


program (keywords, numbers, identifiers etc.)
 Context free grammars:
(Used to describe the
syntactic structure of programming language)
The Science of Code Optimization
92

 Optimization must meet the following objectives


 1. The optimization must be correct; ie preserve the

meaning of the compiled program.


2. The optimization must improve the performance
of many programs
3. The compilation time must be kept reasonable
4. The engineering effort required must be
manageable
Programming Language Basics
93

1. The Static/Dynamic Distinction


2. Environment and States
3. Static Scope and Block Structure
4. Explicit Access Control
5. Dynamic Scope
6. Parameter Passing Mechanisms
Programming Language Basics
94

The Static/Dynamic Distinction


Dynamically typed languages perform type
checking at runtime

Statically typed languages perform type checking

at compile time
Environment & States
95
 1. The environment is the mapping from names to
locations in the store.(ie mapping from names to
variables)
 2. The state is a mapping from locations in store to
their values
Two declarations of name i
96
Static Scope & Block Structure
97

 Languages such as C++,Java & C# provide


explicit control over scopes through the use of
keywords like public, private and protected

 C uses braces { } to delimit a block; Algol makes


use of begin & end to delimit block
Blocks in a C++ program
98
Explicit Access control
99

 Explicit Access Control : Through the keywords


public, private and protected, languages like C++,
Java provide explicit control over access to
member
names in a superclass.

These keywords support encapsulation by restricting

access
Dynamic Scope
100

 Any scoping policy is dynamic if it is based on


factor(s) that can be known only when the program
executes
 Dynamic Scope : a use of name x refers to the
declaration of x ,in the most recently called, yet not
terminated procedure with such a declaration.
Parameter Passing Mechanisms

101

 Call by Value

 Call by Reference

You might also like