You are on page 1of 14

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/332726973

Chapter 1: Introduction to Compiler

Preprint · April 2019


DOI: 10.13140/RG.2.2.31394.89289

CITATIONS READS
0 2,496

1 author:

Rajendra Kumar
Chandigarh University
52 PUBLICATIONS   111 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

SRD Research Lab View project

Enhancing Latent fingerprint identification score through SIFT View project

All content following this page was uploaded by Rajendra Kumar on 29 April 2019.

The user has requested enhancement of the downloaded file.


Chapter 1
Introduction to Compiler
1.1 INTRODUCTION
Computer programs are formulated in a programming language and specify classes of computing processes.
Computers, however, interpret sequences of particular instructions, but not program texts. Therefore, the program
text must be translated into a suitable instruction sequence before it can be processed by a computer. This
translation can be automated, which implies that it can be formulated as a program itself. The translation program
is called a compiler, and the text to be translated is called source text (or sometimes source code).

THE COMPILER
A compiler is a software that takes a program written in a high level language and translates it into an equivalent
program in a target language. Most specifically a compiler takes a computer program and translates it into an
object program. Some other tools associated with the compiler are responsible for making an object program into
executable form.

source program COMPILER target program

error messages
Fig. 1.1 Major function of Compiler
Source program – It is normally a program written in a high-level programming language. It contains a set of
rules, symbols, and special words used to construct a computer program.
Target program – It is normally the equivalent program in machine code. It contains the binary representation
of the instructions that the hardware of computer can perform.
Error Message – A message issued by the compiler due to detection of syntax errors in the source program.
Assembly Language – It is a middle-level programming language in which a mnemonics are used to represent
each of the machine language instructions for a specific computer. Assembly language programs also allow the
user to use text names for data rather than having to remember their memory addresses. An Assembler is a
computer program that translates an assembly language program into machine code.
Table 1.1 The machine and assembly language codes
Machine Language Assembly Language
100101 ADD
011001 SUB
001101 MPY
100111 CMP
100011 JMP
110011 JNZ
Typically, a compiler includes several functional parts. For example, a conventional compiler may include a
lexical analyzer that looks at the source program and identifies successive “tokens” in the source program.
A conventional compiler also includes a parser / syntactical analyzer, which takes as an input a grammar defining
the language being compiled and a series of actions associated with the grammar. The syntactical analyzer builds
a “parse tree” for the statements in the source program in accordance with the grammar productions and actions.
For each statement in the input source program, the syntactical analyzer generates a parse tree of the source input
in a recursive, “bottom-up” manner in accordance with relevant productions and actions in the grammar. Thus, the

© Rajendra Kumar Page 1 of 13


parse tree is formed of nodes corresponding to one or more grammar productions. Generation of the parse tree
allows the syntactical analyzer to determine whether the parts of the source program comply with the grammar. If
not, the syntactical analyzer generates an error. Thus, the syntactical analyzer performs syntactical checking, but
does not conventionally check the meaning (“the semantics”) of the source program.
More general idea of compiler’s functions
Most of the compilers are not a single tool. For example Turbo C/C++ compiler is a combination of five different
but necessary tools, namely
➢ Editor
➢ Debugger
➢ Compiler
➢ Linker
➢ Loader
But in this course our scope is limited to debugging and compiling only.
When Turbo C/C++ compiler is activated it creates an IDE (Integrated Development Environment). This
environment provides an editor for writing a new program or edit an existing program (say xyz.c or xyz.cpp)
stored in some secondary storage like hard disk, CD, pen drive, etc. When editing is over, the Turbo C/C++
compiler compiles the program by translating it into object code (xyz.obj) if there is no syntax error in the source
program. If there is any syntax error in the source program then a tool called debugger is activated and produces a
list of syntax error. The syntax errors in the source program can only be eliminated by re-editing the source
program. It is a manual process. The code obtained after successful compilation is called object code. It is code in
machine language, but it is not executable. The tool liker now links the object code with the library functions used
in the source program and also it combines all the modules as required. When the liking process is over the object
code is now converted into executable code (xyz.exe). Finally, the tool loader loads this exe file in main memory
for execution. This over all process is shown by following flow chart:

© Rajendra Kumar Page 2 of 13


Start

Enter Source Edit Source Code xyz.c Re-edit source program (HLL)
Code (Editor)

From some storage device

Debugger

Syntax Yes
Errors?

No

Compiler

(Object code is created xyz.obj) Check Library

Linker

Linking Yes
Errors?

No (Executable code is created xyz.exe)

Loader

Execution

Stop

Fig. 1.2 Life cycle of a computer program

Java, CC and cc compilers do not have their own editors. Java compiler compiles the program written using
notepad and the code written using ed or vi editors can be compiled by CC (C++ compiler in Unix) and cc (C
compiler in Unix) compilers.

© Rajendra Kumar Page 3 of 13


Processes during running a C/C++ Program
Source Program Object Code
#include <iostream> 100101011010
int main()
010101011101
{
float radius = 0; Compiler 010101010101
cout << "radius of circle"; 010111010100
cin >> radius; 110101001011
... 011010100101
xyz.c …
xyz.obj

Computer Executes
Input Machine Language Linker/Loader
Program
xyz.exe

Output and Library Code


100100101010
terminate 011010100100
101010100101
010101011001
001000010001
000010100101

Fig. 1.3 Various stage in execution of a computer program


As brief representation the function of compiler can be expressed by following block diagram

Linker,
Input source
Translator
Object program Loader, and Output
program Run-time
System

Fig. 1.4 Input to the linker, loader and runtime system


The translation process is guided by the structure of the analyzed text. The text is decomposed into its components
according to the given syntax. For the most elementary components, their semantics is recognized, and the
meaning (semantics) of the composite parts is the result of the semantics of their components. Naturally, the
meaning of the source text is preserved by the translation. The translation process essentially consists of the
following parts:
1. The sequence of characters of a source text is translated into a corresponding sequence of symbols of the
vocabulary of the language. For instance, identifiers consisting of letters and digits, numbers consisting of
digits, delimiters and operators consisting of special characters are recognized in this phase, which is
called lexical analysis.

© Rajendra Kumar Page 4 of 13


2. The sequence of symbols is transformed into a representation that directly mirrors the syntactic structure
of the source text and lets this structure easily be recognized. This phase is called syntax analysis or
parsing.
3. High-level languages are characterized by the fact that objects of programs, for example variables and
functions, are classified according to their type. Therefore, in addition to syntactic rules, compatibility
rules among types of operators and operands define the language. Hence, verification of whether these
compatibility rules are observed by a program is an additional duty of a compiler. This verification is
called type checking.
4. On the basis of the representation resulting from step 2, a sequence of instructions taken from the
instruction set of the target computer is generated. This phase is called code generation. In general it is
the most involved part, not least because the instruction sets of many computers lack the desirable
regularity.
A partitioning of the compilation process into as many parts as possible was the predominant technique until
about 1980, because until then the available store was too small to accommodate the entire compiler. Only
individual compiler parts would fit, and they could be loaded one after the other in sequence. The parts were
called passes, and the whole was called a multipass compiler. The number of passes was typically 4 - 6, but
reached 70 in a particular case. Typically, the output of pass k served as input of pass k + 1, and the disk served as
intermediate storage. The very frequent access to disk storage resulted in long compilation times.

Lexical Syntax Code


Analysis Analysis Generation

Fig. 1.5 Multipass Compilation


Modern computers with their apparently unlimited stores make it feasible to avoid intermediate storage on disk.
And with it, the complicated process of serializing a data structure for output, and its reconstruction on input can
be discarded as well. With single-pass compilers, increases in speed by factors of several thousands are therefore
possible. Instead of being tackled one after another in strictly sequential fashion, the various parts are interleaved.
For example, code generation is not delayed until all preparatory tasks are completed, but it starts already after the
recognition of the first sentential structure of the source text. A wise compromise exists in the form of a compiler
with two parts, namely a front end and a back end (discussed later in this chapter).
The first part comprises lexical and syntax analyses and type checking, and it generates a tree representing the
syntactic structure of the source text. This tree is held in main store and constitutes the interface to the second part
which handles code generation. The main advantage of this solution lies in the independence of the front end of
the target computer and its instruction set. This advantage is enormous if compilers for the same language and for
various computers must be constructed, because the same front end serves them all.
1.2.2 Functions of compiler
The main function of compiler is to translate the (syntax error free) source program in to machine (hardware)
understandable code. Compilation is a large process. It is often broken into stages. The theories of computer
science guide us for writing programs at each stage. We must understand what a program “means” if we are to
translate it correctly. Many phases of the compiler try and optimize by translating one form into a better (more
efficient?) form. Most of compiling is about “pattern matching” languages and tools that support pattern
matching, are very useful.
An efficient compiler must preserve semantics of the source program and it should create an efficient version of
the target language. In the beginning, there was machine language coding producing ugly –writing code and time
consuming debugging, then textual assembly languages came into effect and still used on DSPs. With the
introduction of high-level languages like FORTRAN, Pascal, C, C++ the machine structures became too complex
and software management too difficult to continue with low-level languages.

© Rajendra Kumar Page 5 of 13


1.2.3 Compiler vs. Interpreter
A compiler takes the entire program to translate it into machine language code, while an interpreter takes one line
of source code at a time and executes it if there is no syntax error in that line. When the execution of current line
of code is complete then interpreter looks for next line of code. This way a complete program is executed line by
line. Most of the compilers use code optimization phase (multi-pass compilers) which results the faster execution
of the program, but in case of interpreters there is no option of code optimization. The construction cost of
interpreters is much less than the compilers, but compilers are much effective than interpreters. In early days of
computer, the interpreters were in much practice because of low speed of hardware and less memory. In present
scenario, at very low cost a faster processor and a large amount of memory are available. Therefore, for most of
high level language translation, we prefer compilers instead of interpreters.
1.3 Phases of compiler
Typically, a compiler includes several functional parts. For example, a conventional compiler may include a
lexical analyzer that looks at the source program and identifies successive “tokens” in the source program. A
conventional compiler also includes a parser or syntactical analyzer, which takes as an input a grammar defining
the language being compiled and a series of actions associated with the grammar. The syntactical analyzer builds
a “parse tree” for the statements in the source program in accordance with the grammar productions and actions.
For each statement in the input source program, the syntactical analyzer generates a parse tree of the source input
in a recursive, “bottom-up” manner in accordance with relevant productions and actions in the grammar. Thus, the
parse tree is formed of nodes corresponding to one or more grammar productions. Generation of the parse tree
allows the syntactical analyzer to determine whether the parts of the source program comply with the grammar. If
not, the syntactical analyzer generates an error. Thus, the syntax analyzer performs syntactical checking, but does
not conventionally check the meaning (the semantics) of the source program.
Classification of Compiler Phases
There are two major parts of a compiler phases: Analysis and Synthesis. In analysis phase, an intermediate
representation is created from the given source program that contains:
➢ Lexical Analyzer,
➢ Syntax Analyzer and
➢ Semantic Analyzer
In synthesis phase, the equivalent target program is created from this intermediate representation. This contains:
➢ Intermediate Code Generator,
➢ Code Optimizer, and
➢ Code Generator
The compiler has a number of phases plus symbol table manager and an error handler. The schematic diagram of
phases of compiler is shown in figure 1.6.
Symbol Table Manager: An essential function of a compiler is to record the identifiers used in the source
program and collect information about various attributes of each identifier. A symbol table is a data structure
containing a record for each identifier, with fields for the attributes of the identifier. The symbol table allows to
find the record for each identifier quickly and to store or retrieve data from that record quickly. When an identifier
in the source program is detected by the lexical analyzer, the identifier is entered into the symbol table.
Error handler: If the source program is not written as per the syntax of the language then syntax errors are
detected by the tool debugger associated with the compiler. Each phase of the compiler can encounter errors. A
compiler that stops when it finds the first error is not as helpful as it could be. The syntax and semantic analysis
phases usually handle a large fraction of the errors detectable by the compiler. The lexical phase can detect errors
where the characters remaining in the input do not form any token of the language. Errors when the token stream
violates the syntax of the language are determined by the syntax analysis phase. During semantic analysis the
compiler tries to detect constructs that have the right syntactic structure but no meaning to the operation involved.

© Rajendra Kumar Page 6 of 13


Input Source Program

Lexical Analyzer

Syntax Analyzer

Symbol
Table Error
Manager Semantic Analyzer Handler

Intermediate Code Generator

Code Optimizer

Code Generator

Target Program

Fig. 1.6 Phases of Compiler


The first three compiler phase (i.e., Lexical analyzer, Syntax Analyzer and Semantic Analyzer) are called the
front-end phase of the compiler because the programmer interacts with only these phases while programming in
some high level language, and the remaining last three phases are called back-end phases of the compiler.

Lexical analyzer
Lexical analyzer takes the source program as an input and produces a long string of tokens. Lexical Analyzer
reads the source program character by character and returns the tokens of the source program. The process of
generation and returning the tokens is called lexical analysis. A token describes a pattern of characters having
same meaning in the source program (such as identifiers, operators, keywords, numbers, delimiters and so on).
Tokens are the terminal strings of grammars, for example, white space, comments, reserved word identification.
Modern lexical generators handle these problems. The modern lexical analyzers remove non-grammatical
elements from the stream – i.e. spaces, comments. A lexical analyzer is implemented with a Finite State Automata
(FSA) that contains a finite set of states with partial inputs, and transition functions to move between states. Let us
consider a high level language assignment statement
newval := old_val + 12

© Rajendra Kumar Page 7 of 13


The tokens identified by lexical analyzer are-
newval identifier
:= assignment operator
oldval identifier
+ add operator
12 a number
Regular expressions are used to describe tokens (lexical constructs). A Finite State Automaton or simply a finite
automaton (deterministic) can be used in the implementation of a lexical analyzer.
Tokens, Patterns and Lexemes
Many strings from the input may produce the same token, i.e. identifiers, integers constants, floats. A pattern
describes a rule which describes which strings are assigned to a token. A lexeme is the exact sequence of input
characters matched by a pattern. For example,

Table 1.2 the lexemes, patterns and tokens


lexeme Pattern Token
x (alpha)(alpha)* Id “x”
abc Alpha)(alpha)* Id “abc”
152 (digit)+ Constant(152)
then then Keyword then

Many lexemes map to the same token. e.g. “x” and “abc”. Note, some lexemes might match many patterns. It is
mandatory to resolve ambiguity in CFGs. Since tokens are terminals, they must be “produced” by the lexical
phase with synthesized attributes in place.
The output of a phase is the input to the next phase. For example, the output of lexical analyzer is the input to
syntax analyzer, the output of syntax analyzer is the input to semantic analyzer, and so on. Each phase transforms
the source program from one representation into another representation. They communicate with error handlers
and the symbol table.
The phases of a compiler are collected into front end and back end. The front end includes all analysis phases and
the intermediate code generator. The back end includes the code optimization phase and final code generation
phase. The front end analyzes the source program and produces intermediate code while the back end synthesizes
the target program from the intermediate code.
Syntax Analyzer
A Syntax Analyzer creates the syntactic structure (generally a parse tree) of the given program. In other words, a
Syntax Analyzer takes output of lexical analyzer (list of tokens) and produces a parse tree. A syntax analyzer is
also called as a parser. A parse tree describes a syntactic structure of the program. The syntax is defined as the
physical layout of the source program. The grammars describe precisely the syntax of a language. Two kinds of
grammars which compiler writers use a lot are: regular, and context free

Following is a set of productions of a context free grammar


Sentence ::= Subject Verb Object
Subject ::= Proper-noun
Object ::= Article Adjective Noun
Verb ::= ate | saw | called
Noun ::= cat | ball | dish
Article ::= the | a
Adjective ::= big | bad | pretty
Proper-noun ::= tim | mary
where,
Start Symbol = Sentence

Example sentence: tim ate the big ball

© Rajendra Kumar Page 8 of 13


The sentience “tim ate the big ball” is derived by the above grammar in following manner:

Sentence ::= Subject Verb Object


 Proper-noun Verb Object
 Proper-noun Verb Article Adjective Noun
 tim Verb Article Adjective Noun
 tim ate Article Adjective Noun
 tim ate the Adjective Noun
 tim ate the big Noun
 tim ate the big ball
When a parse tree deals with the syntax of a programming language, all terminals are at leaves. Root of the parse
tree and all intermediate nodes are non-terminals.
assign_stmt

❖ In a parse tree, all terminal nodes are


identifier := expression leaves.
❖ All intermediate nodes and the root of the
parse tree are non-terminals of the CFG.
newval expression + expression

identifier number

old_value 20

sentence

Subject Verb Object

Proper-noun ate Article Adjective Noun

tim the big ball

Fig.1.7 Derivation tree for newval := old_val + 20 and tim ate the big ball
The syntax of a language is specified by a context free grammar (CFG). The rules in a CFG are mostly
recursive. A syntax analyzer checks whether a given program satisfies the rules implied by a CFG or not. If it
satisfies, the syntax analyzer creates a parse tree for the given program. For example, we use BNF (Backus Naur
Form) to specify a CFG
assign_stmt  identifier := expression
expression  identifier
expression  number
expression  expression + expression
A syntax directed translation traverses a syntax tree and builds a translation in the process.

© Rajendra Kumar Page 9 of 13


Syntax Analyzer versus Lexical Analyzer
Which constructs of a program should be recognized by the lexical analyzer, and which ones by the syntax
analyzer? Both of them do similar things; But the lexical analyzer deals with simple nonrecursive constructs of
the language. The syntax analyzer deals with recursive constructs of the language. The lexical analyzer simplifies
the job of the syntax analyzer. The lexical analyzer recognizes the smallest meaningful units (tokens) in a source
program. The syntax analyzer works on the smallest meaningful units (tokens) in a source program to recognize
meaningful structures in our programming language.
Semantic Analyzer
How do we know what to translate into the syntax tree? How do we know whether it is correct or not? Basically it
is validated by the Semantic analyzer. Semantic analyzer takes the output of syntax analyzer and produces another
parse tree. Semantic analysis is very useful in writing compilers since they give a reference when trying to decide
what the compiler should do in particular cases. A semantic analyzer checks the source program for semantic
errors and collects the type information for the code generation.
Semantic analyzer checks a source program for semantic consistency with the language definition. It also gathers
type information for use in intermediate-code generation. The major components of semantic analysis are:
❖ Creation of a symbol table
❖ Type checking
❖ Type coercions
The typical application of semantic analyzer the analyses pertaining to definitions and uses of variables though
❖ Variable used without being declared
❖ Variable declared multiple times
❖ Variable declared but not used
Analysis allows detection of errors and the generation of warnings. These analyses involve the creation and
consultation of symbol tables.

Type-checking is an important part of semantic analyzer. Normally semantic information cannot be represented
by a context-free language used in syntax analyzers. Context-free grammars used in the syntax analysis are
integrated with attributes (semantic rules):
❖ the result is a syntax-directed translation,
❖ Attribute grammars
For example,
newvalue := old_value + 20
The type of the identifier newvalue must match with type of the expression old_val+20
Intermediate Code Generator
A compiler may produce an explicit intermediate codes representing the source program. Intermediate code
generator takes a tree as an input produced by semantic analyzer and produces intermediate code (in assembly
language). The level of intermediate codes is close to the level of machine codes. For example,

© Rajendra Kumar Page 10 of 13


newvalue := old_value * fact + 1
:=

newvalue + Parse tree

* 1

old_value fact

id1 := id2 * id3 + 1


:=

id1 +
Explicit Parse tree
(Abstract Syntax Tree)
* 1

id2 id3

MULT id2, id3, temp1


ADD temp1, #1, temp2 Intermediates Codes (Quadruples)
MOV temp2, , id1

Fig. 1.8 Intermediate code generation


Code Optimizer
The code optimizer takes the code produced by the intermediate code generator. The code optimizer reduces the
code (if the code is not already optimized) without changing the meaning of the code. The optimization of code is
in terms of time and space. For example,
MULT id2, id3, temp1
ADD temp1, #1, id1
Code Generator
It produces the target language in a specific architecture. The target program is normally is a object file containing
the machine codes. Memory locations are selected for each of the variables used by the program. Then,
intermediate instructions are each translated into a sequence of machine instructions that perform the same task. A
crucial aspect is the assignment of variables to registers. For example,
MOVE id2, R1
MULT id3, R1
ADD #1, R1
MOVE R1, id1
We are assuming that we have an architecture with instructions whose at least one of its operands is a machine
register.
1.4 CLASSIFICATION OF COMPILERS
Compilers are sometimes classified as single-pass, multi-pass, load-and-go, debugging, or optimizing, depending
on how they have been constructed or on what function they are supposed to perform. Despite this apparent
complexity, the basic tasks that any compiler must perform are essentially the same.
© Rajendra Kumar Page 11 of 13
The Front End of Compiler

Source Code

Processing of
#include, #define, Language Preprocessor Trivial Errors
#ifdef, etc
Preprocessed Source Code

Lexical Analysis
Syntax Analysis Syntax Errors
Semantic Analysis

Abstract Syntax Tree

Fig. 1.9 The front-end of the compiler


1.4.1 Simple one-pass compiler
Modern computers with their apparently unlimited stores make it feasible to avoid intermediate storage on disk.
And with it, the complicated process of serializing a data structure for output, and its reconstruction on input can
be discarded as well. With single-pass compilers, increases in speed by factors of several thousands are therefore
possible. Instead of being tackled one after another in strictly sequential fashion, the various parts (tasks) are
interleaved. For example, code generation is not delayed until all preparatory tasks are completed, but it starts
already after the recognition of the first sentential structure of the source text.
A wise compromise exists in the form of a compiler with two parts, namely a front end and a back end. The first
part comprises lexical and syntax analyses and type checking, and it generates a tree representing the syntactic
structure of the source text. This tree is held in main store and constitutes the interface to the second part which
handles code generation. The main advantage of this lies in the independence of the front end of the target
computer and its instruction set. This advantage is enormous if compilers for the same language and for various
computers must be constructed, because the same front end serves them all.

Program
Declarations Statements

Front end

Symbol Table Syntax Tree

Back end

Code
Fig. 1.10 Compiler consisting front-end and back-end

© Rajendra Kumar Page 12 of 13


We point out one significant advantage of this structure, the partitioning of a compiler in a target-independent
front end and a target-dependent back end. In above figure, we focus on the interface between the two parts,
namely the structure of the syntax tree. FurthermoreThis gives rise to the understandable desire to declare the
elements of the symbol table (objects) in such a fashion that it is possible to refer to them from the symbol table
itself as well as from the syntax tree. As basic type we introduce the type Object which may assume different
forms as appropriate to represent constants, variables, types, and procedures. Only the attribute type is common to
all.
1.4.2 Cross Compiler
High-level languages are also employed for the programming of microcontrollers used in embedded applications.
Such systems are primarily used for data acquisition and automatic control of machinery. In these cases, the store
is typically small and is insufficient to carry a compiler. Instead, software is generated with the aid of other
computers capable of compiling. A compiler which generates code for a computer different from the one
executing the compiler is called a cross compiler. The generated code is then transferred - downloaded - via a data
transmission line.

© Rajendra Kumar Page 13 of 13

View publication stats

You might also like