You are on page 1of 50

Compiler design

 Text Book: Compilers: principles, theory, and techniques by


Aho, Sethi, and Ullman.
 Topics:
1. Compiler phases
2. Lexical analysis
3. Syntax analysis
4. Code generation
 Home work: there will be two major programming
assignments. They must be done independently.
 Examinations: there will be two hourly exams and a final
exam.
 Grading: the total grade will be computed as follows:
 20% for each hourly exam
 15% for the home work
 50% for the final exam.
Overview of Compiler

 Compiler is a program (written in a


high-level language) that converts /
translates / compiles source program
written in a high level language into
an equivalent machine code.
compiler
source program machine code
or object code
What is a Compiler?
 Definition: A compiler is a program
that translates one language to
another
 Usually, the translation takes place
between a high-level language and
a low-level language
 Clearly, our first step is to discuss
some terminology…
Terminology
 Source language – the language that
is being translated
 Object language – the language into
which the translation is being done
 High-level language – a language that
is far removed from a computer; one
which is close to the problem area(s)
for which the language is designed
Terminology…
 Low-level language – a language that
is close to the machine (computer)
upon which the language will run
(execute)
 Object language – (sometimes called
machine code) the language of some
computer. This language usually is
not human readable (and is expressed
in bits or hex)
Terminology…
 Intermediate language – a language that
is used either:
 because it is a temporary step in the
translation process; or,
 because it is neither particularly, high, nor
low, and is the output of a translation
 Assembly language – a language that
translates almost one-to-one to machine
language, but is in human readable form
What’s a Compiler?...
 Today, compilers are written using high-
level languages (such as Java, C++, etc.)
 The earliest compilers were written using
assembly language (e.g., FORTRAN and
COBOL around 1954)
 Sometimes a compiler is written in the
same language for which one is writing a
compiler. This is done through
Bootstrapping.
Why Should I learn Compiler
Construction?
 How do compilers work?
 How do computers work? (instruction set,
registers, addressing modes, run time data
structures, …)
 What machine code is generated for certain
language constructs? (efficiency
considerations)
 Getting "a feeling" for good language design
Why Compilers? A Brief History
 The first computers were “hard-
wired”
 That is, they were collections of
physical devices that connected to
one-another, in an assemblage
designed to calculate particular kinds
of results
Why Compilers? A Brief
History…
 For example, Babbage’s Analytic
Engine and his Difference Engine
were assemblages of gears that
solved numeric problems
 The primary driving force was the
calculation of ballistics tables for
artillery
 Jacquard’s loom is another example
 And Holleriths’ work for the US
Census bureau is another
Why Compilers? A Brief
History…
 In the late 1940’s John von Neumann
“invented” the stored program
computer
 The “invention” is the observation that
just as you can store data in the
memory of a computer, the data can
be machine instructions
 Then the computer can not only take
its instructions from memory…
Why Compilers? A Brief
History…
 But the computer can modify the
instructions in its memory…
 And, in fact, can write its own
programs, storing them in memory
 It quickly became apparent that the
simplest way to store information in a
computer was in the form of binary
numbers
Why Compilers? A Brief
History…
 So, to program a computer, you only
needed to enter a sequence of binary
numbers into memory, and then tell
the computer at which memory
address to start execution
 This was programming in machine
language
 Instructions (and data) were entered
from a console, one word (in binary) at
a time…
Why Compilers? A Brief
History…
 This form of coding (note the word!)
quickly was replaced by programming
in assembly language
 A program was written (in machine
language) which translated assembly
language to machine language (called
an assembler)
Why Compilers? A Brief
History…
 After the first assembler was written,
no one needed to code in machine
language any longer
 But, coding x = 3; can take many
instructions…
 So, the thought was – can we create a
program that translates something like
x = 3; into assembly language or into
machine language?
Why Compilers? A Brief History.
Formal Languages
 About the same time, in the mid-
1950’s, Noam Chomsky (M.I.T.)
began investigating the formal
structure of natural languages
 His work led to the Chomsky
hierarchy of type 0, 1, 2, 3 languages
and their associated grammars
Why Compilers? A Brief History.
Formal Languages…
 The type 2 (context-free) grammars
turned out to be very good at
describing computer languages
 And, efficient ways to recognize the
structure of a source program using a
type 2 were developed
 Such recognition is called parsing
Why Compilers? A Brief History.
Formal Languages…
 Very closely related to context-free
grammars are the type 3 grammars
 These are equivalent to finite
automata and regular grammars
 An entire sub-branch of mathematics
studies automata; it’s called
automata theory
Why Compilers? A Brief History.
Formal Languages…
 It turns out that type 3 (regular)
grammars are very good at describing
the “atoms” used in computer
languages
 These “atoms” are the reserved words,
symbols, and user-defined words that
are used in a computer language
 Recognizing atoms is called scanning
(or lexing)
Why Compilers? A Brief
History…
 By far the most difficult and
complicated problem has been how
to generate object code that is
concise, and most importantly,
executes efficiently
 This is called “optimization”
Why Compilers? A Brief
History…
 Far simpler are the front-end issues
of scanning and parsing = recognizing
the source code
 This is due to the fact that we’ve
developed (semi-) automatic ways to
create scanners and parsers…
 using scanner generators and parser
generators
Programs Related to
Compilers…
 Interpreters – directly executes
the code upon recognition;
usually statement by statement
 Assemblers – translate
assembly language to machine
language
 Macro Assemblers – ditto, but
with (powerful) macro
capabilities
Programs Related to
Compilers…
 Linkers – combine object
modules to produce an
executable module
 Linkage Editors – manage the
linking process, and are able to
create/maintain object libraries
Programs Related to
Compilers…
 Loaders – load executable
modules into memory, and
launch execution
 Dynamic Loaders – loaders that
stay around during execution to
handle the loading of DLLs
(dynamically loadable libraries)
Programs Related to
Compilers…
 Preprocessors – usually a
separate program whose input is
source code and whose output is
source code; perform macro
expansion, comment deletion,
etc. Sometimes the first phase
of a compiler
Programs Related to
Compilers…
 Editors – allow the user to create and
update source code
 Smart Editors – include syntax
coloring, parenthesis balancing, etc.
 Debuggers – a program that provides
an environment in which code may be
debugged; including single stepping,
symbol tables, etc.
Programs Related to
Compilers…
 IDEs – integrated development
environments; provide integrated
editor-debugger-execution
environments
 Profilers – collects statistics about
where programs spend their time
during execution; important for
optimizing at the source code level
Programs Related to
Compilers…
 Project Managers – programs that
help software managers deal with
hundreds or thousands of modules;
build reports, etc.
 SCCS – source code control systems;
provide for multiple access to shared
code in a control manner
The Translation Process
 The translation process consists of
a collection of phases, with the output
of one phase feeding the input of the
next
 The original source code is
transformed into a sequence of
intermediate representations (IRs)
during this process
The
… Translation
Process
Phases of Compiler
Parallel to all other phases are two
activities:
 Symbol table manipulation. Symbol
table is one of the primary data-
structures that a compiler uses. This
data-structure is used by all of the
phases.
 Error detecting and handling
The Scanner
 The scanner reads the source
program, as a stream of characters,
and it performs lexical analysis –
collecting sequences of characters
into meaningful units called tokens
 The scanner also may create a
symbol table and a literal table
The Parser
 The parser reads the tokens produced
by the scanner and performs
syntactic analysis – creating an IR (a
parse tree or a syntax tree) showing
the structure of the program
 Syntax trees (abstract syntax trees)
are reduced representations of the
tree, with many irrelevant nodes
eliminated
The Semantic Analyzer
 The semantics of a program are its
“meaning” – what it is intended to
accomplish
 The semantic analyzer creates an
intermediate data structure that
contains this meaning – these are the
static semantics
 The dynamic semantics of a program
only can be determined by executing
the program
The Semantic Analyzer…
 An example of the static semantics of
a program is the data types of the
variables (and expressions)
 These static semantics usually are
represented in the intermediate
representations (IRs) as attributes
 The IR usually is a tree, “decorated”
with these attributes
(Source) Code Optimization
 Optimization may occur during
several phases
 Source code optimization rearranges
the source (or the IR of the source) in
order to produce more optimal results
 E.g., x = 7 + 9; can become x =
16;
 This is called constant folding
(Source) Code Optimization…
 Duplicated computations can be
saved as temporaries and then
their values re-used
 Recursion can be converted to
iteration
 Repeated calculations can be moved
out of loops
 The possibilities are endless…
The Code Generator
 The code generator takes the IR and
generates code for the target machine
 Here the details of how various
numeric and non-numeric quantities
are represented become important
 E.g., word length, hardware stack,
hardware calling conventions, memory
access, etc.
The Target Code Optimizer
 The target code optimizer examines
the emitted target code to see if
further possibilities for optimization
are present and then capitalizes upon
them
 E.g., reuse of registers, using a shift
instruction to replace a multiplication
or division, etc.
Phases of the compiler
Source Program

Scanner Lexical Analyzer


Tokens
Parser Syntax Analyzer
Parse Tree
Semantic Analyzer
Abstract Syntax Tree with
attributes
Sample Program Compiled
 Consider the example:
int a, b{
a = 100;
b = f (a) + 3}

Source Program

Lexical Analyzer

Token stream
Sample Program Compiled
Tokens are entities defined by the compiler writer
which are of interest. A sequence of characters
with collective meanings are grouped to form a
token.
 Examples of Tokens:
Single Character operator: = + - * > <
More than one character operator: ++, --,==,<=
 Numeric Constants: 1997 45.89 19.9e+7
 Key Words: int, while, for
 Identifiers: x, my_name, Your_Name, a
 Homework: Identify all token types in C programs.
Example Program Compiled-Continued
What are the tokens in the example?

# Token type # Toke type


n
1. int keyword 8. = equal
2. a identifier 9. 100 integer
3. , comma 10 ; Semicolon
4. b identifier 11 b identifier
5. { L parenthesis 12 f identifier
6. a identifier 13 + Plus
7 ( L parenthesis 14 3 integer
Example Continued

The parser produces a parse tree: it is a heterogeneous


tree (nodes have different data types)

root_node
stmt1 stmt2
stmt1 stmt2
=
=
a 100 b +
f 3

( a )
Intermediate-Code Generation

 Using temporary location to save


values
 t1 = 100
 store t1, a
 load a, t2
 t3 = f(t2)
 t4 = t3 + 3
 store t4, b
Intermediate-Code Optimization
 Eliminate unnecessary code or
statements that want be executed
 t1 = 100
 store t1, a
 t3 = foo(t1)
 t4 = t3 + 3
 store t4, b
Target-code Generation
 Machine code generated for some
machine
 R1 = 100
 store r1, 0x10
 jsr _f
 r2 = r0 + 3
 store r2, 0x16
Compiler Architecture
Single pass vs. multi pass architecture

Single pass: all passes interleaved, driven by parser


Multi pass
Each pass finishes before next starts
Saves main memory, communicate through
files
 Used if the language is complex or portability

is important
Front end & Back end
 Front end: is the phases or parts of phases
that depend on the source language.
 Back end: is phases or part of phases that
depend on the target machine.

You might also like