You are on page 1of 54

Compiler Construction

What is a compiler?

A program that reads a program written in one language


and translates it into another language.

Source language Target language

Traditionally, compilers go from high-level languages to low-level


languages.
What is a compiler?
• Programming problems are easier to solve in high-
level languages
– Languages closer to the level of the problem
domain, e.g.,
• SmallTalk: OO programming
• JavaScript: Web pages

• Solutions are usually more efficient (faster, smaller)


when written in machine language
– Language that reflects to the cycle-by-cycle
working of a processor
Cont…

• Compilers are the bridges:


• Tools to translate programs written in high-level
languages to efficient executable code
Assemblers
Source program

Compiler

Assembly program

Assembler

Relocatable machine code

Loader/link-editor

Absolute machine code


Compilation task is full of variety??
• Thousands of source languages
– Fortran, Pascal, C, C++, Java, ……
• Target languages
– Some other lower level language (assembly language),
machine language
• Compilation process has similar variety
– Single pass, multi-pass, load-and-go, optimizing….
• Variety is overwhelming……

• Good news is:


– Few basic techniques is sufficient to cover all variety
– Many efficient tools are available
Requirement
• In order to translate statements in a language, one
needs to understand both
– the structure of the language: the way “sentences" are
constructed in the language, and
– the meaning of the language: what each “sentence"
stands for.
• Terminology:
Structure ≡ Syntax
Meaning ≡ Semantics
Analysis-Synthesis model of compilation
• Two parts
– Analysis
• Breaks up the source program into constituents
– Synthesis
• Constructs the target program

Intermediate
Language

Source Target Language


Language Front End – Back End –
language specific machine specific
Compilation Steps/Phases

Intermediate Intermediate
Language Language

Scanner Parser Semantic Code


Source Code Target
language
(lexical (syntax Analysis Generator language
analysis) Optimizer
analysis) (IC
tokens Syntactic
structure generator)

Symbol
Table
Compilation Steps/Phases

• Lexical Analysis Phase: Generates the “tokens” in


the source program

• Syntax Analysis Phase: Recognizes “sentences" in


the program using the syntax of the language

• Semantic Analysis Phase: Infers information about


the program using the semantics of the language
Cont…

• Intermediate Code Generation Phase: Generates


“abstract” code based on the syntactic structure of the
program and the semantic information from Phase 2

• Optimization Phase: Refines the generated code


using a series of optimizing transformations

• Final Code Generation Phase: Translates the abstract


intermediate code into specific machine instructions
Lexical Analysis

• Convert the stream of characters representing input


program into a sequence of tokens
• Tokens are the “words" of the programming
language
• Lexeme
– The characters comprising a token
Cont..

• For instance, the sequence of characters “static int"


is recognized as two tokens, representing the two
words “static" and “int"
• The sequence of characters “*x++" is recognized
as three tokens, representing “*", “x" and “++“
• Removes the white spaces
• Removes the comments
Lexical Analysis

• Input: result = a + b * 10

• Tokens:

‘result’, ‘=‘, ‘a’, ‘+’, ‘b’, ‘*’,


‘10’
identifiers
operators
Syntax Analysis (Parsing)
• Uncover the structure of a sentence in the program from a
stream of tokens.

• For instance, the phrase “x = +y", which is recognized as


four tokens, representing “x", “=“ and “+" and “y", has
the structure =(x,+(y)), i.e., an assignment expression,
that operates on “x" and the expression “+(y)".

• Build a tree called a parse tree that reflects the structure of


the input sentence.
Syntax Analysis: Grammars
• Expression grammar
Exp ::= Exp ‘+’
Exp
|

Exp ‘*’ Exp


|
Assign ::= ID ‘=‘ Exp
ID
|

NUMBER
Syntax Tree

Input: result = a + b * 10

Assign

result +

a *

b 10
Semantic Analysis

• Concerned with the semantic (meaning)


of the program
• Performs type checking
– Operator operand compitability
Intermediate Code Generation
• Translate each hierarchical structure decorated as
tree into intermediate code
• A program translated for an abstract machine
• Properties of intermediate codes
– Should be easy to generate
– Should be easy to translate
• Intermediate code hides many machine-level details,
but has instruction-level mapping to many assembly
languages
• Main motivation: portability
• One commonly used form is “Three-address Code”
Code Optimization

• Apply a series of transformations to improve the time


and space efficiency of the generated code.
• Peephole optimizations: generate new instructions by
combining/expanding on a small number of
consecutive instructions.
• Global optimizations: reorder, remove or add
instructions to change the structure of generated code
• Consumes a significant fraction of the compilation
time
• Optimization capability varies widely
• Simple optimization techniques can be vary
valuable
Code Generation

• Map instructions in the intermediate code to specific


machine instructions.

• Memory management, register allocation, instruction


selection, instruction scheduling, …

• Generates sufficient information to enable symbolic


debugging.
Symbol Table
• Records the identifiers used in the source
program
– Collects various associated information as
attributes
• Variables: type, scope, storage allocation
• Procedure: number and types of arguments
method of argument passing
• It’s a data structure with collection of records
– Different fields are collected and used at
different phases of compilation
Error Detection, Recovery and Reporting

Error
Handler

Scanner Parser Semantic Code


Source Code Target
language
(lexical (syntax Analysis Generator language
analysis) (IC generator) Optimizer
analysis)
tokens Syntactic
structure

Symbol
Table
Error Detection, Recovery and Reporting
• Each phase can encounter error
• Specific types of error can be detected by specific
phases
– Lexical Error: int abc, 1num;
– Syntax Error: total = capital + rate
– Semantic Error:year;
value = myarray [realIndex];
• Should be able to proceed and process the rest of the
program after an error detected
• Should be able to link the error with the source
program
Translation of a statement
result = a + b * 10

Lexical Analyzer
Symbol Table
id1 = id2 + id3 * 10
result …….
a …….
Syntax Analyzer b …….

Assign

id1 +

id2 *

id3 10
Translation of a statement
Assign

id1 +

id2 *

id3 10
Semantic Analyzer

Assign

id1 +

id2 *

id3 INTTOREAL

10
Translation of a statement

Intermediate Code Generator

Code Generator
temp1 := INTTOREAL (10)
temp2 := id3 * temp1
temp3 := id2 + MOVF id3, R2
temp2 Id1 := temp3 MULF #10.0, R2
MOVF id2, R1
ADDF R2, R1
Code Optimizer MOVF R1, id1

temp1 := id3 * 10.0


Id1 := id2 + temp1
Syntax Analyzer versus Lexical Analyzer
• Which constructs of a program should be recognized by
the lexical analyzer, and which ones by the syntax
analyzer?
– Both of them do similar things; But the lexical analyzer deals
with simple non-recursive constructs of the language.
– The syntax analyzer deals with recursive constructs of the
language.
– The lexical analyzer simplifies the job of the syntax
analyzer.
– The lexical analyzer recognizes the smallest meaningful
units (tokens) in a source program.
– The syntax analyzer works on the smallest meaningful units
(tokens) in a source program to recognize meaningful
structures in our programming language.
Cousins of the Compiler
• Preprocessor
– Macro preprocessing
• Define and use shorthand for longer constructs
– File inclusion
• Include header files
– “Rational” Preprocessors
• Augment older languages with modern flow-of-control or
data-structures
– Language Extension
• Add capabilities to a language
• Equel: query language embedded in C
Typical Compilation

Consider the source code of C function


int expr( int n )
{
int d;
d = 4*n*n*(n+1)*(n+1);
return d;
}
Cont….

Expressing an algorithm or a task in C is optimized for


human readability and
comprehension. This presentation matches human notions
of grammar of a programming
language. The function uses named constructs such as
variables and procedures which aid
human readability.
Now consider the assembly code that the C compiler gcc
generates for the Intel platform:
The figure above shows the structure of a two-pass
compiler. The front end maps legal
source code into an intermediate representation (IR).
The back end maps IR into target
machine code. An immediate advantage of this
scheme is that it admits multiple front
ends and multiple passes.
The front end recognizes legal and illegal programs
presented to it. When it encounters
errors, it attempts to report errors in a useful way. For
legal programs, front end produces
IR and preliminary storage map for the various data
structures declared in the program.
The front end consists of two modules:
1. Scanner
2. Parser
Scanner
The scanner takes a program as input and maps the
character stream into “words” that are
the basic unit of syntax. It produces pairs – a word
and its part of speech. For example,
the input
x=x+y
becomes
<id,x>
<assign,=>
<id,x>
<op,+>
<id,y>
We call the pair “<token type, word>” a token. Typical
tokens are: number, identifier, +,-, new, while, if.
Parser
The parser takes in the stream of tokens, recognizes
context- free syntax and reports
errors. It guides context-sensitive (“semantic”)
analysis for tasks like type checking. The
parser builds IR for source program.
The syntax of most programming languages is
specified using Context-Free Grammars
(CFG). Context- free syntax is specified with a
grammar G=(S,N,T,P) where
· S is the start symbol
· N is a set of non-terminal symbols
· T is set of terminal symbols or words
· P is a set of productions or rewrite rules
For example, the Context-Free Grammar for
arithmetic expressions is
1. goal ? expr
2. expr ? expr op term
3. | term
4. term ? number
5. | id
6. op ? +
7. | –
For this CFG,
S = goal
T = { number, id, +, -}
N = { goal, expr, term, op}
P = { 1, 2, 3, 4, 5, 6, 7}
syntax tree
A parse can be represented by a tree: parse tree or
syntax tree. For example, here is the parse tree for
the expression x+2-y
Cont….
The parse tree captures all rewrite during the derivation.
The derivation can be extracted by starting at the root of
the tree and working towards the leaf nodes.
Abstract Syntax Trees
The parse tree contains a lot of unneeded information.
Compilers often use an abstract syntax tree (AST). For
example, the AST for the above parse tree is
The Back End

The back end of the compiler translates IR into target


machine code. It chooses machine (assembly) instructions
to implement each IR operation.

The back end ensure


conformance with system interfaces. It decides which
values to keep in registers in order to avoid memory
access; memory access is far slower than register access.
Back End
The back end is responsible for instruction selection so
as to produce fast and compact code.
Modern processors have a rich instruction set. The back
end takes advantage of target features such as
addressing modes.
Two-Pass Assembly
• Simplest form of assembler
• First pass
– All the identifiers are stored in a symbol table
– Storage is allocated
• Second pass
– Translates each operand code in the machine language

Instruction Addressing
Register Mode Address
code
Relocation
bit

MOV a, R1 Identifier Address 0001 01 00 00000000 *


ADD #2, R1 a 0 0011 01 10 00000010
MOV R1, b b 4 0010 01 00 00000100 *

Source After First Pass After 2nd Pass


Loaders and Link-Editors
• Convert the relocatable machine code into absolute
machine code
– Map the relocatable address
• Place altered instructions and data in memory

Address space of data


0001 01 00 00000000 * to be loaded starting 0001 01 00 00001111
0011 01 10 00000010 0011 01 10 00000010
0010 01 00 00000100 * at location L=00001111 0010 01 00 00010011

• Make a single program from several files of


relocatable machine code
– Several files of relocatable codes
– Library files
Multi Pass Compilers
• Passes
– Several phases of compilers are grouped in to passes
– Often passes generate an explicit output file
– In each pass the whole input file/source is processed

Syntax Analyzer

Lexical Analyzer Intermediate Code Generator


• Semantic analysis
Issues in Compilation
•The translation of code from some human readable
form to machine code must be “correct”, i.e., the
generated machine code must execute precisely the
same computation as the source code.

•In general, there is no unique translation from source


language to a destination language. No algorithm
exists for an “ideal translation”.
Cont…
•Translation is a complex process. The source
language and generated code are very different. To
manage this complex process, the translation is
carried out in multiple passes.
How many passes?
• Relatively few passes is desirable
– Reading and writing intermediate files take time
– It may require to keep the entire file in memory
• One phase generate information in different order than
that is needed by the next phase
• Memory space is not trivial in some cases
• Grouping into same pass incurs some problems
– Intermediate code generation and code generation in
the same pass is difficult
• e.g. Target of ‘goto’ that jumps forward is now known
• ‘Backpatching’ can be a remedy
Issues Driving Compiler Design
• Correctness
• Speed (runtime and compile time)
– Degrees of optimization
– Multiple passes
• Space
• Feedback to user
• Debugging
Other Applications
• In addition to the development of a compiler, the techniques
used in compiler design can be applicable to many problems in
computer science.
– Techniques used in a lexical analyzer can be used in text editors,
information retrieval system, and pattern recognition programs.
– Techniques used in a parser can be used in a query processing
system such as SQL.
– Many software having a complex front-end may need techniques
used in compiler design.
• A symbolic equation solver which takes an equation as input.
That program should parse the given input
equation.
– Most of the techniques used in compiler design can be
used in Natural Language Processing (NLP) systems.
Software tools that performs analysis
• Structure Editors
– Gives hierarchical structure
– Suggests keywords/structures automatically
• Pretty Printer
– Provides an organized and structured look to program
– Different color, font, indentation are used
• Static Checkers
– Tries to discover potential bugs without running
– Identify logical errors, type-checking, dead codes
identification etc
• Interpreters
– Does not produce a target program
– Executes the operations implied by the program

You might also like