You are on page 1of 47

Ambo University

Hachalu Hundessa Campus


School of Informatics and Electrical Engineering
Department of Computer Science

Kenesa B. (getkennyo@gmail.com)
CHAPTER ONE
INTRODUCTION
O U T L I N E
q Introduction
q Language processors
§ Compilers
§ Interpreters
q Language processing system
q Phases of compilers
§ Lexical analysis, Syntax analysis, Semantic analysis, IC generator, IC
optimizer, Code generator, Code optimizer
q Error handling routine
q Compiler construction tools

Kenesa B. (Ambo University) Compiler Design 3


Introduction
What is a Compiler?

q Programming languages are notations for describing computations to people


and to machines.
§ before a program can be run, it first must be translated into a form in
which it can be executed by a computer.
q A translator or language processor is a program that translates an input
program written in a programming language into an equivalent program in
another language.
§ Compiler is a program that reads a program written in one language (the
source code) and translates it into another language (the target).
• Typically from high level source code to low level machine code or
object code
Kenesa B. (Ambo University) Compiler Design 4
Introduction...
q Compilers provide an essential interface between applications and architectures
q Since different platforms, or hardware architectures along with the operating
systems (Windows, Macs, Unix), require different machine code, you must
compile most programs separately for each platform.

program

compiler compiler

compiler Unix
Windows

Macs

Kenesa B. (Ambo University) Compiler Design 5


Language Processors
q Language processing is an important component of programming
q Compilers are fundamental to modern computing.
§ They act as translators, transforming human-oriented programming languages into computer-
oriented machine languages.
q Compilers normally translate conventional programming languages like Java, C
and C++ into executable machine-language instructions
§ the target language is machine code or "code" that a computer's processor
understands.
• Source code is normally optimized for human readability
– Expressive: matches our notion of languages (and application!)
– Redundant to help avoid programming errors
• Machine code is optimized for hardware
– Redundancy is reduced
– Information about the intent is lost
Kenesa B. (Ambo University) Compiler Design 6
Language Processors
q An important role of the compiler is to report any errors in the source program
that it detects during the translation process.
if (…) {
x := …;
} else { Compiler Target Program
y := …;
}
…;

Source Program

q If the target program is an executable machine-language program, it can then be


called by the user to process inputs and produce outputs.

Input Target Program Output

Running the target program


Kenesa B. (Ambo University) Compiler Design 7
Language Processors
q An interpreter is another common kind of language processor.
q An interpreter appears to directly execute the operations specified in the source
program on inputs supplied by the user
§ A program that reads source program and produces the results of executing
that program

Source Program
Interpreter Output
Input

§ Works by analyzing and executing the source program commands one at a


time
§ Does not translate the whole source program into object code

Kenesa B. (Ambo University) Compiler Design 8


Language Processors
q Interpretation is important when:
§ Programmer is working in interactive mode and needs to view and update
variables
§ Running speed is not important
§ Commands have simple formats, and thus can be quickly analyzed and
executed
§ Modification or addition to user programs is required as execution proceeds
q The machine-language target program produced by a compiler is usually much
faster than an interpreter at mapping inputs to outputs .
q An interpreter, however, can usually give better error diagnostics than a compiler
• because it executes the source program statement by statement.

Kenesa B. (Ambo University) Compiler Design 9


Compilers Vs. Interpreters
q Interpreters and compilers have much in common. They perform many of the same tasks.
§ Both analyze the input program and determine whether or not it is a valid program.
§ Both build an internal model of the structure and meaning of the program.
§ Both determine where to store values during execution.

§ Interpreter: § Compiler:
Ø Interpreter takes one statement then Ø While compiler translates the entire
translates it and executes it and then takes program in one go and then executes it.
another statement. Ø Compiler generates the error report after
Ø Interpreter will stop the translation after it the translation of the entire program.
gets the first error. Ø Compiler takes a large amount of time in
Ø Interpreter takes less time to analyze the analyzing and processing the high level
source code. language code.
Ø Over all execution speed is less. Ø Overall execution time is faster.
• E.g. Javascript, PERL, PHP, Python, • E.g. FORTRAN, C, C++, COBOL, VHDL, etc.
Ruby, awk, sed, shells (bash)...etc.
Kenesa B. (Ambo University) Compiler Design 10
Hybrid approaches
q Java language processors combine compilation and interpretation.
§ A Java source program may first be compiled into an intermediate form called
bytecodes.
• a compact representation intended to decrease download times for Java applications.
§ The bytecodes are then interpreted by a virtual machine.
§ Java applications execute by running the bytecode on the corresponding Java
Virtual Machine (jvm), an interpreter for bytecode.
q Benefit: bytecodes compiled on one machine can be interpreted on another
machine, perhaps across a network.
q Just-in-time compilers (JIT), translate the bytecodes into machine language
immediately before they run the intermediate program to process the input.
§ In order to achieve faster processing of inputs to outputs
§ Compile some or all byte codes to native code
Kenesa B. (Ambo University) Compiler Design 11
Hybrid approaches

Source Program e r
et
pr
t er
In
Java Program Java bytecode

compiler Interpreter
Translator
Inte
rpre
t er

Intermediate Program

Virtual Machine Output

Input

A hybrid compiler 12
Language Processing System
q In addition to a compiler, several other programs may be required to create an
executable target program
q A source program may be divided into modules stored in separate files.
q The task of collecting the source program is sometimes entrusted to a separate
program, called a preprocessor.
§ A preprocessor program executes automatically before the compiler’s
translation phase begins.
§ Such a pre-processor:
• can delete comments
• macro processing (substitutions)
• include other files
• produce input to a compiler...
• E.g. Preprocessor will remove all the #include<...> by file inclusion and #define<...>
by macro expansion in C (C++)
Kenesa B. (Ambo University) Compiler Design 13
Language Processing System
q The modified source program is then fed to a compiler.
q The compiler may produce an assembly-language program as its output
§ because assembly language is easier to produce as output and is easier to
debug.
q The assembly language is then processed by a program called an assembler that
produces relocatable machine code as its output.
q Large programs are often compiled in pieces
§ so the relocatable machine code may have to be linked together with other
relocatable object files and library files
q The linker resolves external memory addresses, where the code in one file may
refer to a location in another file.
q The loader then puts together all of the executable object files into memory for
execution.
Kenesa B. (Ambo University) Compiler Design 14
Language Processing System
Source program Preprocessor
(C or C++ program)
Modified source program (C or C++ program with macro
substitutions and file inclusions)

Compiler

Target Assembly language (Assembly code)

Assembler

Relocatable machine code (Relocatable object module)

Other relocatable object


modules or library Linker Loader Absolute machine code
modules Executable code
Kenesa B. (Ambo University) Compiler Design 15
Structure of a Compiler
q Many modern compilers have a common two stage design.
q Front end: Analysis
§ The front end focuses on understanding the source-language program.
§ Read source program and understand its structure and meaning
§ Breaks up the source program into constituent pieces (an intermediate
representation) and imposes a grammatical structure on them.
§ Creates an intermediate representation of the source program.
§ During analysis, the operations implied by the source program are determined
and recorded in hierarchical structure called a tree.
§ Analysis consists of the following phases:
• Lexical analysis (Scanning) - Linear
• Syntax analysis (Parsing) - Hierarchical
• Semantic analysis
Kenesa B. (Ambo University) Compiler Design 16
Structure of a Compiler
q Back end: Synthesis
§ The back end focuses on mapping programs to the target machine.
§ Generate equivalent target language program from the intermediate
representation.
• The "back end" works with the internal representation to produce low level code.
§ It constructs the desired target program from the intermediate representation
and the information in the symbol table.
§ Consists of:
• The target code generator
• The target code optimizer
A compiler uses some set of data IR
structures to represent the code that it Source Front End Back End Target
processes.That form is called an
intermediate representation, or IR.
Kenesa B. (Ambo University) Compiler Design Compiler 17
Structure of a Compiler
q The front end must encode its knowledge of the source program in some structure
for later use by the back end.
• This intermediate representation (IR) becomes the compiler’s definitive representation for
the code it is translating.
q At each point in compilation, the compiler will have a definitive representation.
q It may, in fact, use several different IRs as compilation progresses, but at each
point, one representation will be the definitive IR.
• We think of the definitive IR as the version of the program passed between independent
phases of the compiler
q the two-phase structure model of the compiler is very much advantageous
because of the following reasons.
§ By keeping the same front end and attaching different back ends one can produce a compiler
for same source language on different machines.
§ By keeping different front ends and same back end one can compile several different
languages on the same machine.
Kenesa B. (Ambo University) Compiler Design 18
Structure of a Compiler

q Introducing an IR makes it possible to add more phases to compilation.


q The compiler writer can insert a third phase between the front end and the back
end.
• This middle section, or optimizer, takes an IR program as its input and produces a semantically
equivalent IR program as its output.
• This leads to the following compiler structure, termed a three-phase compiler.

Kenesa B. (Ambo University) Compiler Design 19


Structure of a Compiler
q If the analysis part detects that the source program is either syntactically ill
formed or semantically unsound, then it must provide informative messages
q The analysis part also collects information about the source program and stores it
in a data structure called a symbol table.
§ Symbol table is passed along with the intermediate representation to the synthesis part.
q The compilation process operates as a sequence of phases
§ each of which transforms one representation of the source program to another.
q Compilation process is partitioned into number-of-sub processes called phases.
§ the intermediate representations between the grouped phases need not be constructed
explicitly.
q The symbol table, which stores information about the entire source program, is
used by all phases of the compiler.
q Some compilers have a machine-independent optimization phase between the
front end and the back end.
Kenesa B. (Ambo University) Compiler Design 20
Phases of a Compiler
Source code
Intermediate code
Scanner generator
Symbol Symbol
table table
Tokens Intermediate code

Parser
Intermediate code
optimizer

Syntax tree
Intermediate code
Error Semantic Error
handler analyzer handler
Target code generator
(Target code optimizer)
Annotated
tree
Target code

Kenesa B. (Ambo University) Compiler Design 21


Lexical Analysis - Scanning
q This is the first phase and it is also referred to as the Scanning
q The first step is recognizing/knowing alphabets of a language.
q For example:
• English text consists of lower and upper case alphabets, digits, punctuations
and white spaces
• Written programs consist of characters from the ASCII characters set
q It separates characters of the source language into groups that logically belong
together; these groups are called tokens.
q Token stream: A token is a sequence of characters having a collective meaning.
• Operators & Punctuation: { } [ ] ! + - = * ; :...
• Keywords: if, while, return, goto,...
• Identifiers: id & actual name, Literals: kind & value; int, floating-point character,
string,...
Kenesa B. (Ambo University) Compiler Design 22
Lexical Analysis - Scanning
q Example: Input text: if (x >= y)
y = 42;
§ Token Stream:

• Notes: tokens are atomic items, not character strings; comments & whitespace are not tokens
q The lexical analyzer reads the stream of characters making up the source program
and groups the characters into meaningful sequences called lexemes.
q For each lexeme, the lexical analyzer produces as output a token of the form
(token-name, attribute-value) that it passes on to the subsequent phase, syntax
analysis
q A scanner may perform other operations along with the recognition of tokens.
§ It may enter identifiers and literals into the symbol table
Kenesa B. (Ambo University) Compiler Design 23
Lexical Analysis - Scanning
q the first component token-name is an abstract symbol that is used during syntax
analysis
q the second component attribute-value points to an entry in the symbol table for
this token.
§ Information from the symbol-table entry is needed for semantic analysis and code generation
q Example: position = initial + rate * 60
q The characters in this assignment could be grouped into the following lexemes:
§ position is a lexeme that would be mapped into a token (id, 1)
– where id is an abstract symbol standing for identifier and 1 points to the
symbol table entry for position.
– the symbol-table entry for an identifier holds information about the
identifier, such as its name and type.

Kenesa B. (Ambo University) Compiler Design 24


Lexical Analysis - Scanning
§ The assignment symbol = is a lexeme that is mapped into the token (=).
– Since this token needs no attribute-value, we have omitted the second
component.
§ initial is a lexeme that is mapped into the token (id,2)
§ + is a lexeme that is mapped into the token (+).
§ rate is a lexeme that is mapped into the token (id, 3)
§ * is a lexeme that is mapped into the token (*).
§ 60 is a lexeme that is mapped into the token (60)
q Representation of the assignment statement after lexical analysis as the
sequence of tokens: (id, 1) (=) (id, 2) (+) (id, 3) (*) (60)
q There are tools available to assist in the writing of lexical analyzers:
• lex - produces C source code (UNIX/linux), flex - produces C source code (gnu),
JLex - produces Java source code.
Kenesa B. (Ambo University) Compiler Design 25
Syntax Analysis - Parsing
q The second phase of the compiler is syntax analysis or parsing.
q Once the words are understood, the next step is to understand the structure of the
sentence
§ the process is known as syntax checking or parsing
q The parser uses the first components of the tokens produced by the LA to create a
tree-like intermediate representation
§ that depicts the grammatical structure of the token stream.
q A typical representation is a syntax tree in which each interior node represents an
operation and the children of the node represent the arguments of the operation.
q In this phase expressions, statements, declarations etc... are identified by using
the results of lexical analysis.
§ This groups tokens together into syntactic structures.

Kenesa B. (Ambo University) Compiler Design 26


Syntax Analysis - Parsing
q For example, the three tokens representing A+B might be grouped into a syntactic
structure called an expression.
§ Expressions might further be combined to form statements.
q Often the syntactic structure can be regarded as a tree whose leaves are the
tokens.
q The interior nodes of the tree represent strings of tokens that logically belong
together.
q The parser has two functions:
• It checks that the tokens appearing in its input, which is the output of the lexical
analyser, occur in patterns that are permitted by the specification for the source
language.
• It also imposes on the tokens a tree-like structure that is used by the subsequent
phases of the compiler.

Kenesa B. (Ambo University) Compiler Design 27


Syntax Analysis - Parsing
q A syntax tree for the token stream: (id, 1) (=) (id, 2) (+) (id, 3) (*) (60)
§ has an interior node labeled * with (id, 3) as its left child and the integer 60 as its right child.
§ The node (id,3) represents the identifier rate.
§ The node labeled * makes it explicit that we must first multiply the value of rate by 60.
§ The node labeled + indicates that we mustadd the result of this multiplication to the value of
initial.
§ The root of the tree, labeled =, indicates that we must store the result of this addition into the
location for the identifier position (id,1).
=

(id, 1) +

(id, 2) *

(id, 3) 60

Kenesa B. (Ambo University) Compiler Design 28


Semantic Analysis
q Once the sentence structure is understood we try to understand the meaning of
the sentence (semantic analysis)
q Semantic analyzer uses the syntax tree and the information in the symbol table
§ to check the source program for semantic consistency with the language
definition.
§ type checking, array bound checking, the correctness of scope resolution,...
q It also gathers type information and saves it in either the syntax tree or the
symbol table, for subsequent use during intermediate-code generation.
q Static semantics include:
§ Declarations of variables and constants before use
§ Calling functions that exist (predefined in a library or defined by the user)
§ Passing parameters properly...

Kenesa B. (Ambo University) Compiler Design 29


Semantic Analysis
q The semantic analyzer does the following:
§ Checks the static semantics of the language
§ Annotates the syntax tree with type information
q An important part of semantic analysis is type checking, where the compiler
checks that each operator has matching operands.
q Suppose that position, initial, and rate have been declared to be floating-point numbers and that
the lexeme 60 by itself forms an integer.
=

Annotated (id, 1) +
syntax tree
*
(id, 2)
inttofloat
(id, 3)
60
Intermediate Code (IC) Generation
q During translation a compiler may construct one or more intermediate
representations
§ which can have a variety of forms: Three-address code (TAC), quadruples,
triple, P-code for an abstract machine, Tree or DAG representation
• e.g. Syntax trees are a form of intermediate representation (during syntax
and semantic analysis)
q After syntax and semantic analysis, many compilers generate an explicit low-level
or machine-like intermediate representation,
§ which we can think of as a program for an abstract machine.
q This intermediate representation should have two important properties:
§ it should be easy to produce and
§ it should be easy to translate into the target machine.

Kenesa B. (Ambo University) Compiler Design 31


Intermediate Code (IC) Generation
q E.g. The three-address code sequence
t1 = int to float(60)
t1 = id3 * float(60)
t2 = id2 + t1
id1 = t2
q three-address instructions:
§ each three-address assignment instruction has at most one operator on the
right side.
– these instructions fix the order in which operations are to be done; the
multiplication precedes the addition
§ the compiler must generate a temporary name to hold the value computed
§ some "three-address instructions" like the first and last line of the sequence
above, have fewer than three operands.
Kenesa B. (Ambo University) Compiler Design 32
IC Optimization
q An IC optimizer reviews the code, looking for ways to reduce:
§ the number of operations and
§ the memory requirements.
q A program may be optimized for speed or for size.
§ This phase changes the IC so that the code generator produces a faster and
less memory consuming program.
§ The optimized code does the same thing as the original (non-optimized) code
but with less cost in terms of CPU time and memory space.
q Ex. Unnecessary lines of code in loops (i.e. code that could be executed outside of
the loop) are moved out of the loop.
for(i=1; i<10; i++){ x = y+1;
x = y+1; for(i=1; i<10, i++)
z = x+i; } z = x+i;
Kenesa B. (Ambo University) Compiler Design 33
IC Optimization
q For example: the optimizer can deduce that the conversion of 60 from integer to
floating point can be done once and for all at compile time
§ so the inttofloat operation can be eliminated by replacing the integer 60 by the
floating-point number 60.0.
§ t3 is used only once to transmit its value to id1 so the optimizer can transform
into the shorter sequence
t1 = id3 * 60.0
id1 = id2 + t1
q Optimizing compilers : a significant amount of time is spent on this phase.
q There are simple optimizations that significantly improve the running time of the
target program without slowing down compilation too much.

Kenesa B. (Ambo University) Compiler Design 34


Target Code Generator
q The machine code generator receives the (optimized) intermediate code, and then
it produces either:
• Machine code for a specific machine, or
• Assembly code for a specific machine and assembler.
q Code generator
• Selects appropriate machine instructions
• Allocates memory locations for variables
• Allocates registers for intermediate computations

Kenesa B. (Ambo University) Compiler Design 35


Target Code Generator
q The code generator takes the IR code and generates code for the target machine.
§ For example, using registers R1 and R2,the following intermediate code in the
previous slide might get translated into the machine code:
LDF R2 , id3
t1 = id3 * 60.0 MULF R2 , R2 , #60.0
id1 = id2 + t1 LDF R1 , id2
ADDF R1 , R1 , R2
STF id1 , R1

q The first operand of each instruction specifies a destination.


q The F in each instruction tells us that it deals with floating-point numbers.
q The # signifies that 60.0 is to be treated as an immediate constant.

Kenesa B. (Ambo University) Compiler Design 36


Target Code Optimizer
q In this phase, the compiler attempts to improve the target code generated by the
code generator.
q Such improvement includes:
• Choosing addressing modes to improve performance
• Replacing slow instruction by faster ones
• Eliminating redundant or unnecessary operations
q In the sample target code given, use a shift instruction to replace the
multiplication in the second instruction.
q Another is to use a more powerful addressing mode, such as indexed addressing
to perform the array store.

Kenesa B. (Ambo University) Compiler Design 37


Symbol Table
q To support these phases of compiler a symbol table is maintained.
q The task of symbol table is to store identifiers (variables) used in the program.
q The symbol table also stores information about attributes of each identifier.
§ The attributes of identifiers are usually its type, its scope, information about the storage
allocated for it.
q The symbol table also stores information about the subroutines used in the
program.
§ the symbol table stores the name of the subroutine, number of arguments passed to it, type of
these arguments, the method of passing these arguments (may be call by value or call by
reference), return type if any.
q Basically, symbol table is a data structure used to store the information about
identifiers.
q The symbol table allows us to find the record for each identifier quickly and to
store or retrieve data from that record efficiently.
Kenesa B. (Ambo University) Compiler Design 38
Error Handling Routine
q Error detection and reporting of errors are important functions of the compiler.
q Whenever an error is encountered during the compilation of the source program,
an error handler is invoked.
q Error handler generates a suitable error reporting message regarding the error
encountered.
§ The error reporting message allows the programmer to find out the exact location of the error.
q Errors can be encountered at any phase of the compiler during compilation of the
source program for several reasons such as:
• Lexical analyzer: due to misspelled tokens, unrecognized characters, etc.
• Syntax analyzer: due to the syntactic violation of the language e.g. Missing parenthesis
• Intermediate code generator: due to incompatibility of operands type for an operator
• Code Optimizer: during the control flow analysis due to some unreachable statements
• Code Generator: due to the incompatibility with the computer architecture
• Symbol tables: due to the multiple declaration of an identifier with ambiguous attributes
Kenesa B. (Ambo University) Compiler Design 39
Concept of Pass
q One complete scan of the source language is called pass.
– It includes reading an input files and writing to an output file.
q Many phases can be grouped to one pass.
– It is difficult to compile the source program into a single pass, because the program may have
some forward references.
– For example, the front-end phases of lexical analysis, syntax analysis, semantic analysis, and
intermediate code generation might be grouped together into one pass.
q It is desirable to have relatively few passes, because it takes time to read and
write intermediate file.
q On the other hand, if we group several phases into one pass, we may be forced to
keep the entire program in the memory.
– Therefore, memory requirement may be large.
§ A typical arrangement for optimizing compiler is: one pass for scanning and parsing, one pass
for semantic analysis and third pass for code generation and target code optimization.
Kenesa B. (Ambo University) Compiler Design 40
Concept of Pass
q Two types of pass:
§ Single pass (e.g. Pascal’s compiler)
• a compiler that passes through the source code of each compilation unit only once.
• a one-pass compiler does not "look back" at code it previously processed.
§ intermediate representation of source program is not created.
• a one-pass compilers is faster than multi-pass compilers
• they are unable to generate an efficient programs, due to the limited scope available.
§ Multi pass (e.g. C++ compiler)
• a type of compiler that processes the source code or abstract syntax tree of a program
several times.
• intermediate representation of source program is created.
• a collection of phases is done multiple times
• a multi-pass compiler takes less space than the multi-pass compiler
§ because the space used by the compiler during one pass can be reused by the subsequent pass.

Kenesa B. (Ambo University) Compiler Design 41


Types of Compiler
q Native code Compiler:
§ A compiler may produce binary output to run /execute on the same computer
and operating system.
§ This type of compiler is called as native code compiler.
q JIT Compiler:
§ This compiler is used for JAVA programming language and Microsoft .NET
q Source to source Compiler:
§ It is a type of compiler that takes a high level language as a input and its
output as high level language.
q Cross Compiler:
§ A compiler which may run on one machine and produce the target code for
another machine
§ the cross compilation technique facilitates platform independence.
Kenesa B. (Ambo University) Compiler Design 42
Types of Compiler
q Cross Compiler:
§ It consists of three symbols S, T and I, where:
• S is the source language in which the source program is written,
• T is the target language in which the compiler produces its output or target
program, and
• I is the implementation language in which compiler is written.
§ A cross compiler can be represented with the help of a T diagram as shown
below

Kenesa B. (Ambo University) Compiler Design 43


Types of Compiler
q Bootstrapping:
§ If a compiler has been implemented in its own language, self-hosting compiler.
§ This concept uses a simple language to translate complicated programs which
can further handle more complicated programs.
§ Suppose we want to create a cross compiler for the new source language S
that generates a target code in language T, and the implementation language
of this compiler is A.

Kenesa B. (Ambo University) Compiler Design 44


Compiler Construction Tools
§ For the construction of a compiler, the compiler writer uses different types of
software tools that are known as compiler construction tools.
§ These tools make use of specialized languages for specifying and implementing
specific components, and most of them use sophisticated algorithms.
• The tools should hide the details of the algorithm
§ Some of the most commonly used compiler construction tools are:
q Scanner generators
§ Ex. Lex, flex, JLex
§ These tools generate a scanner /lexical analyzer/ if given a regular expression.
q Parser Generators
§ Ex. Yacc, Bison, CUP
§ These tools produce a parser /syntax analyzer/ if given a Context Free Grammar
(CFG) that describes the syntax of the source language.
Kenesa B. (Ambo University) Compiler Design 45
Compiler Construction Tools
q Syntax directed translation engines
§ Ex. Cornell Synthesizer Generator
§ It produces a collection of routines that walk the parse tree and execute some tasks.
q Code-generator generators
§ Take a collection of rules that define the translation of the IC to target code and
produce a code generator.
q Data-flow analysis engines
§ facilitate the gathering of information about how values are transmitted from one
part of a program to each other part.
• key part of code optimization.
q Compiler-construction toolkits
§ provide an integrated set of routines for constructing various phases of a compiler.

Kenesa B. (Ambo University) Compiler Design 46



T H E E N D !
47

You might also like