You are on page 1of 25

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/284724546

1 Compiler technique (Part 1)

Data · October 2015


DOI: 10.13140/RG.2.1.2848.3921

CITATIONS READS
0 21,405

1 author:

Qasim Mohammed Hussein


Tikrit University
58 PUBLICATIONS   18 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Cryptography View project

Certificate of Participation View project

All content following this page was uploaded by Qasim Mohammed Hussein on 27 November 2015.

The user has requested enhancement of the downloaded file.


Tikrit University
College of computer sciences and mathematics
Computer science department
3rd class

Compiler techniques
Phases of compiler – part 1
Lexical analyzer phase
Assistant Prof. Dr. Qasim Mohammed Hussein

Reference:
Compilers: Principles, Techniques & Tools.
By: Alfred V. Aho. Monica S. Lam & Ravi Sethi

Ass. Prof. Dr. Qasim Mohammed Hussein Page 1


Compiler techniques
Compiler definition
The compiler is a program that takes as input a program written in high
level language ( source language), and translates it into an equivalent
program in other object language (target program ). Target language may be
another programming language, or the machine language of any computer.
Figure 1 shows the compiler program.

Figure 1: A compiler

To executing a program, we need two step processes:

1. The source program must first compiled into object program.


2. The object program is loaded into memory and executed.

Interpreter is a program that transfers programming language into


intermediate code which can be directly executed. The interpreter size is
smaller than compiler.

Assembler use if the program source is written in assembly language and


the target language is machine code.

Why do we need translator?


The machine language is a sequence of 1’s and 0’s communicate directly
with a computer. The programming in it is complex, and it is tedious with
opportunities for mistake and all operations and operands must in numeric
code. So it is difficult to write a program with machine language
programming.

Ass. Prof. Dr. Qasim Mohammed Hussein Page 2


What are the differences between compiler and interpreter?

Compiler Interpreter
1. The compiler translates the entire 1. The interpreter takes one statement
Program in one go and then then translates it and executes it
executes it. and then takes another statement.

2. Compiler generates the error 2. Interpreter will stop the


report after the translation of the translation after it gets the first error.
entire program.

3. The processing and analyzing time


the overall execution time of a
code is faster for compiler relative 3. It is slower than compiler
to the interpreter

4. The design is more complex than 4. The design is less complex


interpreter

A compiler which may run on one machine and produce the target code for
another machine is known as cross compiler.

Assembler
An assembly language is a low-level programming language for
a computer, it is the symbolic form of machine language which use
mnemonics to represent each low-level machine operation or opcode.(
names are used instead of binary code for operations memory address) ,such
as:
Mov a1, R1
Mov #2,R1
MOV R1,b
Each type of processor has its own unique assembly language. Assembly
language programs are translated into machine language by a program called
the assembler.

Some compilers produce assembly code that is passed to as assembly for


further processing. Other compilers perform the job of assemble producing
relocatable machine code that passed directly to loader / link – editor.

Ass. Prof. Dr. Qasim Mohammed Hussein Page 3


In assembler is used to translate assembly language statements into the target
computer's machine code.
.

Relocatable machine code: The code can be loaded ant any location L in
memory.

Linker: It is the program that takes as input one or more objects programs
that are separately compiled, and link together into a single executable
program.

Loader: It is a program that is responsible for loading object programs from


the secondary storage, and places them into memory and prepares them for
execution. The process of loading consists of taking relocatable machine
code, altering the relocatable addresses, and placing the altered instructions
and data in memory at the proper location. Other types of loaders are
absolute and direct linking.

In some systems, loader and linker may be a single system program.


Loader is responsible for loading and relocation. Linker is responsible
for linking.

Structure of compiler
The construction of compiler contains a series of phases, as shown in figure
(2).
1. Lexical analyzer: It separate characters of source language into
groups of characters that logically belong together. These groups are
called token, such as : do, if, <, = , +, 4, 234.
2. The syntax analyzer: It groups token together into syntactic
structure. For example : A + B.
3. Semantic analyzer: it determines the meaning, if any, of a
syntactically well-formed sentence. It checks the source program for
semantic error, and each operator has operands that permitted by the
source language specification.

Ass. Prof. Dr. Qasim Mohammed Hussein Page 4


Figure (2): Phases of compiler

4. The intermediate code generation: It use the structure produced by


the syntax analyzer to create a simple instruction.
5. Code optimization: It use to improve the intermediate code so that
the object program run faster and take less memory.
6. Code generation: it produce the object code by decoding on the
memory location for data, and selecting code to access each datum
and selecting registers.

Symbol table: The table management portion of compiler keeps track of


names used in program and record essential information about each. The

Ass. Prof. Dr. Qasim Mohammed Hussein Page 5


data structure, that containing record for each identifier, is called symbol
table.

Error handler: It is invoked whenever any fault occur in the compilation


process of source program. It warns the programmer when mistake is
detected.

We can collected more than phase in one phase such as lexical and syntax
analysis. Both symbol table and error handler are associated with all phases
of compiler.

It is relatively to have few passes since it take a few time to read and write
intermediate file. On other hand, if we ground several phases into one phase,
we may be forced to keep the entire program in memory because one phase
may need information.

For example, suppose a source program contains the assignment statement


position = initial + rate * 60
The translation of an assignment statement is shown in figure (3)

Why are we separating the lexical analyzer and parser?


The lexical analyzer is separated from parser to obtain:
1. Simpler design. It allows us to simplify one or other of these phases.
2. Compiler efficiency is improved.
3. Compiler portability is enhanced.
4. Phase can encounter error. The syntax and semantic analysis handle
a large fraction of the error detection by the computer.

Ass. Prof. Dr. Qasim Mohammed Hussein Page 6


Figure 3: Translation of an assignment statement

Ass. Prof. Dr. Qasim Mohammed Hussein Page 7


Lexical analyzer
The lexical analyzer task is a program that reads characters in source
program and produces as output a sequence of token that parser use for
syntax analysis, as shown in figure 4.

Figure 4: Interaction of lexical analyzer with parser

Types of tokens
Tokens correspond to sets of strings.
1) Identifier: strings of letters or digits, starting with a letter
2) Integer: a non-empty string of digits
3) Keyword: “else” or “if” or “begin” or …
4) Whitespace: a non-empty sequence of blanks, newlines, and tabs

Lexical functions
The lexical analyzer may perform, beside the token generation, the
following secondary tasks (functions):
1. Stripping out from the source program comment, and new line
character.
2. It is correlating error message from the compiler with source program.
3. Keeping track of line number.
4. Skipping redundant space and tabs.
5. Producing output listing.
6. Implementing macro processor function.
7. Sometimes the lexical analyzer can recover some errors such as:
a) Deleting an extraneous character.
Ass. Prof. Dr. Qasim Mohammed Hussein Page 8
b) Inserting a missing character.
c) Replacing an incorrect character by a correct character.
d) Transposing two adjacent characters.
These errors transformation may be tried to repair the input. There are
other strategies but more complex.

A simple way to build a lexical analyzer is to construct a diagram that


illustrate the structure of tokens of the source language and then translate the
diagram into program for finding token. As shown in figures 6, 7, 8, and 9.

Tokens, patterns and lexemes


When talking about LA, we use the terms "token", "pattern", and "lexeme"
Token: Token is a sequence of characters that can be treated as a single
logical entity such as: Identifiers and keywords
Pattern: A set of strings in the input for which the same token is produced
as output. This set of strings is described by a rule called a pattern associated
with the token.
Lexeme: A lexeme is a sequence of characters in the source program that is
matched by the pattern for a token.

Example of token
Token Information Sample lexeme
description
If Characters i, f if

Else characters e, 1, s, e else

comparison <,<=,= ,!=,>=,> < or <= or = or < > or


>= or
letter followed by letters Letter followed by
Id and digits any numeric letters and/or digit
constant
Number any numeric constant 3. 14159, 0, 6.02e23

Literal anything but " core dumped "


", surrounded by "'s

A patter is a rule describing the set of lexemes that can represent a particular
token in source program.

Ass. Prof. Dr. Qasim Mohammed Hussein Page 9


The lexical analyzer collects information about tokens into their associated
attributes. In practice: A token has usually only a single attribute - a pointer
to the symbol-table entry in which the information about the token is kept
such as: the lexeme, the line number on which it was first seen, etc.

Example: Let the instruction position =initial + rate * 60


1. "position" is a lexeme mapped into a token (id, 1), where id is an
abstract symbol standing for identifier and 1 points to the symbol table
entry for position
2. "=" is a lexeme that is mapped into the token (=). Since this token needs
no attribute-value, we have omitted the second component
3. “initial” is a lexeme that is mapped into the token (id, 2), where 2 points
to the symbol-table entry for initial.
4. "+" is a lexeme that is mapped into the token (+).
5. “rate” is a lexeme mapped into the token (id, 3), where 3 points to the
symbol-table entry for rate.
6. "*" is a lexeme that is mapped into the token (*) .
7. "60" is a lexeme that is mapped into the token (60)
Blanks separating the lexemes would be discarded by the lexical
analyzer.

Input buffer
The lexical analyzer (LA) reads the source program character-by-character
to find the tokens. The LA need to look ahead several characters beyond the
next token may have to be examined before the next token itself can be
determined. LA spends a considerable amount of time in the lexical analysis
phase. To reduce the amount of overhead required, it is desirable for the
lexical analyzer to read input from an input buffer. The size of buffer may be
1024 or 4096 bytes. There are many schemes that can be used to buffer
input. The input buffer is divided into two halves as shown in figure 5, or
using a two-buffer scheme each one with length N that are alternately
reloaded.

Figure 5: Input buffer in two halves

Ass. Prof. Dr. Qasim Mohammed Hussein Page 10


Design and implement lexical analyzer
To design and implement LA, we need:
1. A method for describing possible tokens. The best method is using
regular expression.
Letter  {A|B|C … |Z|a|b|….|z.}
Digit  {0|1|2|3|….|9}
Sign  {+| – | ∈}
Relation operation  {< | = | > |≤ | < > | }
Identifier  letter (letter | digit)*
Constant  digit+
Sign-number  sign integer

2. A mechanism to recognize these tokes in the input string. we use


either transition diagram or finite automata.

String and language


Definitions
Alphabet: it is any finite of symbols such as letters, digits, and punctuation,
denoted by ∑. For example, the set of {0, 1} is a binary alphabet.
String (word): it is a finite sequence of symbols drowns from the alphabet.
The empty string, denoted by ( ), is a string of length zero.
Language (L): it is a set of strings over a given alphabet. Languages vary
widely in appearance, applicability, and complexity.
In language theory, the terms "sentence" and "word" are often used as
synonyms for "string".

 If x and y are strings, then the concatenation of x and y is also string,


denoted xy. For example:
if x = dog and y = house, then xy = doghouse.
 The empty string is the identity under concatenation; that is, for any
string s, ES = SE = s.
Grammar G(L): it is a formal system that provides generative finite
description of L.

Ass. Prof. Dr. Qasim Mohammed Hussein Page 11


Language automaton: it is a finite machine for accepting a set of string and
rejecting all other.
The language can be specified by many different grammars and automata.
While the grammar or automata specifies only one language.

Operations on language
There are several operations. For lexical analyzer, we interested in union,
concatenation, and closure.

Operation Definition
Union of L and M L U M = {s | s in L or s in M}
Concatenation LM = {in st | s in L and t M}
Kleen closure of L L* = zero or more concatenation of L.
Positive closure of L L+ = one or more concatenation of L.

Example: Let L be the set of letters {A, B. . . Z, a, b . . . z} and let D be the


set of digits {0, 1 . . . 9}. We can be constructed from languages L and D,
using the above operators:
1. L U D is the set of letters and digits - strictly speaking the language
with 62 strings of length one, each of which strings is either one letter
or one digit.
2. LD is the set of 520 strings of length two, each consisting of one
letter followed by one digit.
3. L4 is the set of all 4-letter strings.
4. L * is the set of ail strings of letters, including , the empty string.
5. L(L U D)* is the set of all strings of letters and digits beginning with
a letter.
6. D + is the set of all strings of one or more digits.

Regular expression (R.E.)

Regular expression represents a finite or infinite set of strings. It is use to


describe all the languages that can be built from operators applied to the symbols
of some alphabet. It uses rules that define exactly the set of word that are
valid tokens in a formal language. The rules are built up from the following
operations:
1. Concatenation xy.
2. Alteration x|y x or y.
Ass. Prof. Dr. Qasim Mohammed Hussein Page 12
3. Repetition x*: x repeated 0 or more time.
The priority of them are:
1. The unary operator * has highest precedence and is left associative.
2. Concatenation has second highest precedence and is left associative.
3. I has lowest precedence and is left associative

The standard notation for regular languages is regular expressions.

Larger regular expressions are built from smaller ones. Each regular
expression r denotes a language L(r) , which is also defined recursively from
the languages denoted by r's sub-expressions. Here are the rules that define
the regular expressions over some alphabet ∑, and the languages that those
expressions denote.
There are two basic rules
1. is a regular expression, and L ( ) is { } , that is , the language
whose sole member is the empty string.
2. If a is a symbol in ∑, then a is a regular expression, and L(a) = {a}
Let r and s are regular expressions denoting languages L(r) and L(s),
respectively.
1) (r) | (s) is a regular expression denoting the language L(r) U L(s).
2) (r) (s) is a regular expression denoting the language L(r) L(s) .
3) (r) * is a regular expression denoting (L (r)) * .
4) (r) is a regular expression denoting L(r). This last rule says that we can
add additional pairs of parentheses around expressions without
changing the language they denote.

Example:
We may replace the regular expression (a) | ((b) * (c)) by a| b*c.

Ass. Prof. Dr. Qasim Mohammed Hussein Page 13


The identifiers are strings of letters, digits, and underscores. The regular
definition for the language of C identifiers is.
 Letter{A | B | C|…| Z | a | b | … |z| }
 digit  {0|1|2 |… | 9}
 id letter( letter | digit )*

Unsigned numbers (integer or floating point) are strings such as 5280,


0.01234, 6.336E4, or 1.89E-4. The regular definition
digit  0|1|2 |… | 9
digits  digit digit*
Fraction  .digits | 
Exponent  ( E( + |- | ) digits ) | 
number  digits Fraction Exponent

Transition diagram (TD)


Transition diagram is a flowchart that use as intermediate step in
construction of lexical analyzer. We use the TD to keep track of
characters that are seen in the input buffer positions. Transition diagrams
consists of
1. Nodes or circles, called states. Each state represents a condition that
could occur during the process of scanning the input looking for a
lexeme that matches one of several patterns.

Ass. Prof. Dr. Qasim Mohammed Hussein Page 14


2. Edges are directed from one state of the transition diagram to another.
Each edge is labeled by a symbol or set of symbols. The label
indicate the input characters that can next appear after the TD has
reached state.
3. One or set of accepting, or final, states indicate that a lexeme has
been found.
4. The start state, or initial state; it is indicated by an edge, labeled
"start," entering from nowhere.
The transition diagram always begins in the start state before any input
symbols have been read.
Example
The transition diagram that recognizes the lexemes matching the token relop
(relational operators) is shown in figure 6.

Figure 6: Transition diagram for relop

Figure 7: A transition diagram for id's and keywords

Ass. Prof. Dr. Qasim Mohammed Hussein Page 15


The procedure to identify the identifier
case 9: c = getchar( );
if letter( c) then goto state 10
else if delimiter (c) then goto state 11
else break;
case 10: c = getchar( );
if letter(c) or digit (c) then goto state 11
else break;
case 11: retract(1); // identifier has been found
return (id, install);

the transition diagram for the umber

Accepting integer Accepting float Accepting exponential


Figure 8: A transition diagram for unsigned numbers

Recognition of Reserved Words


Install the reserved words in the symbol table initially. A field of the
symbol-table entry indicates that these strings are never ordinary identifiers,
and tells which token they represent.

To specify the keywords, we create separate transition diagrams for each


keyword; the transition diagram for the reserved word "then" is

Figure 9: Transition diagram of the keyword "then"

RE with multiple accepting states


Two ways to implement:
1) Implement it as multiple regular expressions. Each with its own
start and accepting states. Starting with the longest one first, if

Ass. Prof. Dr. Qasim Mohammed Hussein Page 16


failed, then change the start state to a shorter RE, and re-scan. See
example of Fig. 3.15 and 3.16 in the textbook.
2) Implement it as a transition diagram with multiple accepting states.
When the transition arrives at the first two accepting states, just
remember the states, but keep advancing until a failure is occurred.
Then backup the input to the position of the last accepting state.

Finite Automata
A recognizer for a language is a program that takes as input a string x and
answer “yes” if x is a sentence of the language and “no” otherwise.
We compile a regular expression into a recognizer by constructing a
generalized transition diagram called a finite automaton.
A finite automaton can be deterministic or nondeterministic, where
nondeterministic means that more than one transition out of a state may be
possible on the same input symbol.
A DFA is a special case of a NFA in which
1) No state has an -transition
2) For each state s and input symbol a, there is at most one edge
labeled a leaving s.

Nondeterministic Finite Automation (NFA): A nondeterministic finite


automaton (NFA) consists of:
 A finite set of states (S)
 A finite set of input symbols ∑, the input alphabet.
 A transition function that maps state-symbol pairs to sets of
states.
 A start state S0 from S that is distinguished as the start state.
 A set of states F as accepting (Final) states.

An NFA can be represented diagrammatically by a labeled directed graph,


called a transition graph, in which the nodes are the states and the labeled
edges represent the transition function.
(a|b)*abb

Ass. Prof. Dr. Qasim Mohammed Hussein Page 17


The set of states = {0,1,2,3}
Input symbol = {a,b}
Start state is S0, accepting state is S3

Transition Function NFA Transition Table


The easiest implementation is a transition table in which there is a row for
each state and a column for each input symbol and , if necessary.

Ass. Prof. Dr. Qasim Mohammed Hussein Page 18


Example of NFA

Simulation of NFA

Given an NFA N and


an input string x,
determine whether N
accepts x
S:= e-closure({s0}) ;
a := nextchar;
While a <> eof do begin
S:= e-closure(move(S,a));
a:= nextchar;
end
if (an accepting state s in S,
return(yes)
otherwise return (no)

Ass. Prof. Dr. Qasim Mohammed Hussein Page 19


Computing the -closure (T)

Note
 Non-deterministic Finite Automata (NFA)
 An NFA accepts an input string x iff there is a path in the
transition graph from the start state to some accepting (final)
states.
 The language defined by an NFA is the set of strings it accepts
 Deterministic Finite Automata (DFA) is a special case of NFA has
unique successor states.

How to simulate a DFA


s = s0; c := nextchar;
while ( c <> eof) do
s := move(s, c);
c := nextchar;
end
if (s in F) then return “yes”

Ass. Prof. Dr. Qasim Mohammed Hussein Page 20


Conversion of RE to NFA
For each kind of RE, there is a corresponding NFA that defines the same
language.
There are many strategies to find FA, each with its strength and weakness.
The construction of NFA from RE is done as following:
Notes: NFA has one final state, no edge enters the star state, and no edge
leaves the final state.
First, parse RE into subexpression, then use rule 1&2.
• For each sub-expression the algorithm constructs an NFA with a
single accepting state.
• INPUT: A regular expression r over alphabet .
• OUTPUT: An NFA N accepting L(r).
• Method: Begin by parsing r into its constituent sub-expressions. The
rules for constructing an NFA consist of basis rules for handling sub-
expressions with no operators, and inductive rules for constructing
larger NFA's from the NFA's for the immediate sub-expressions of a
given expression.
1) For expression e construct the NFA

2) For any sub-expression a in C, construct the NFA

3) Suppose N(s)NFA for the union of two regular expressions

Ass. Prof. Dr. Qasim Mohammed Hussein Page 21


 Ex: a|b

Example: Convert the following regular expression "a (b | c d)* e" to NFA?

Solution

b 5
4
a
0 1 e
2 3 9 10
c d
6 7 8

Ass. Prof. Dr. Qasim Mohammed Hussein Page 22


Q1: Choose the suitable answer for the following?

1. A translator that takes as input a high-level language program and


translates into machine language in one step is known as ________.
(a) Compiler
(b) Interpreter
(c) Preprocessor
(d) Assembler
2. ________create a single program from several files of relocated machine
code.
(a) Loaders
(b) Assemblers
(c) Link editors
(d) Preprocessors
3. A group of logically related characters in the source program is known
as________.
(a) Token
(b) Lexeme
(c) Parse tree
(d) Buffer
4. The ________ uses the parse tree and symbol table checking the semantic
consistency of the source program.
(a) Lexical analyzer
(b) Intermediate code generator
(c) Syntax translator
(d) Semantic analyzer
5. The ________ phase converts an intermediate code into an optimized
code that takes lesser space and lesser time to execute.
(a) Code optimization
(b) Syntax directed translation
(c) Code generation
(d) Intermediate code generation
6.________ is invoked whenever any fault occurs in the compilation
process of source program.
(a) Syntax analyzer
(b) Code generator
(c) Error handler
(d) Lexical analyzer
Ass. Prof. Dr. Qasim Mohammed Hussein Page 23
7. In compiler, the activities of one or more phases are combined into a
single module known as a ________.
(a) Phase
(b) Pass
(c) Token
(d) Macro

8. A compiler that runs on one machine and produces the target code for
another machine is known as ________.
(a) Cross compiler
(b) Linker
(c) Preprocessor
(d) Assembler

Ass. Prof. Dr. Qasim Mohammed Hussein Page 24

View publication stats

You might also like