Professional Documents
Culture Documents
net/publication/284724546
CITATIONS READS
0 21,405
1 author:
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Qasim Mohammed Hussein on 27 November 2015.
Compiler techniques
Phases of compiler – part 1
Lexical analyzer phase
Assistant Prof. Dr. Qasim Mohammed Hussein
Reference:
Compilers: Principles, Techniques & Tools.
By: Alfred V. Aho. Monica S. Lam & Ravi Sethi
Figure 1: A compiler
Compiler Interpreter
1. The compiler translates the entire 1. The interpreter takes one statement
Program in one go and then then translates it and executes it
executes it. and then takes another statement.
A compiler which may run on one machine and produce the target code for
another machine is known as cross compiler.
Assembler
An assembly language is a low-level programming language for
a computer, it is the symbolic form of machine language which use
mnemonics to represent each low-level machine operation or opcode.(
names are used instead of binary code for operations memory address) ,such
as:
Mov a1, R1
Mov #2,R1
MOV R1,b
Each type of processor has its own unique assembly language. Assembly
language programs are translated into machine language by a program called
the assembler.
Relocatable machine code: The code can be loaded ant any location L in
memory.
Linker: It is the program that takes as input one or more objects programs
that are separately compiled, and link together into a single executable
program.
Structure of compiler
The construction of compiler contains a series of phases, as shown in figure
(2).
1. Lexical analyzer: It separate characters of source language into
groups of characters that logically belong together. These groups are
called token, such as : do, if, <, = , +, 4, 234.
2. The syntax analyzer: It groups token together into syntactic
structure. For example : A + B.
3. Semantic analyzer: it determines the meaning, if any, of a
syntactically well-formed sentence. It checks the source program for
semantic error, and each operator has operands that permitted by the
source language specification.
We can collected more than phase in one phase such as lexical and syntax
analysis. Both symbol table and error handler are associated with all phases
of compiler.
It is relatively to have few passes since it take a few time to read and write
intermediate file. On other hand, if we ground several phases into one phase,
we may be forced to keep the entire program in memory because one phase
may need information.
Types of tokens
Tokens correspond to sets of strings.
1) Identifier: strings of letters or digits, starting with a letter
2) Integer: a non-empty string of digits
3) Keyword: “else” or “if” or “begin” or …
4) Whitespace: a non-empty sequence of blanks, newlines, and tabs
Lexical functions
The lexical analyzer may perform, beside the token generation, the
following secondary tasks (functions):
1. Stripping out from the source program comment, and new line
character.
2. It is correlating error message from the compiler with source program.
3. Keeping track of line number.
4. Skipping redundant space and tabs.
5. Producing output listing.
6. Implementing macro processor function.
7. Sometimes the lexical analyzer can recover some errors such as:
a) Deleting an extraneous character.
Ass. Prof. Dr. Qasim Mohammed Hussein Page 8
b) Inserting a missing character.
c) Replacing an incorrect character by a correct character.
d) Transposing two adjacent characters.
These errors transformation may be tried to repair the input. There are
other strategies but more complex.
Example of token
Token Information Sample lexeme
description
If Characters i, f if
A patter is a rule describing the set of lexemes that can represent a particular
token in source program.
Input buffer
The lexical analyzer (LA) reads the source program character-by-character
to find the tokens. The LA need to look ahead several characters beyond the
next token may have to be examined before the next token itself can be
determined. LA spends a considerable amount of time in the lexical analysis
phase. To reduce the amount of overhead required, it is desirable for the
lexical analyzer to read input from an input buffer. The size of buffer may be
1024 or 4096 bytes. There are many schemes that can be used to buffer
input. The input buffer is divided into two halves as shown in figure 5, or
using a two-buffer scheme each one with length N that are alternately
reloaded.
Operations on language
There are several operations. For lexical analyzer, we interested in union,
concatenation, and closure.
Operation Definition
Union of L and M L U M = {s | s in L or s in M}
Concatenation LM = {in st | s in L and t M}
Kleen closure of L L* = zero or more concatenation of L.
Positive closure of L L+ = one or more concatenation of L.
Larger regular expressions are built from smaller ones. Each regular
expression r denotes a language L(r) , which is also defined recursively from
the languages denoted by r's sub-expressions. Here are the rules that define
the regular expressions over some alphabet ∑, and the languages that those
expressions denote.
There are two basic rules
1. is a regular expression, and L ( ) is { } , that is , the language
whose sole member is the empty string.
2. If a is a symbol in ∑, then a is a regular expression, and L(a) = {a}
Let r and s are regular expressions denoting languages L(r) and L(s),
respectively.
1) (r) | (s) is a regular expression denoting the language L(r) U L(s).
2) (r) (s) is a regular expression denoting the language L(r) L(s) .
3) (r) * is a regular expression denoting (L (r)) * .
4) (r) is a regular expression denoting L(r). This last rule says that we can
add additional pairs of parentheses around expressions without
changing the language they denote.
Example:
We may replace the regular expression (a) | ((b) * (c)) by a| b*c.
Finite Automata
A recognizer for a language is a program that takes as input a string x and
answer “yes” if x is a sentence of the language and “no” otherwise.
We compile a regular expression into a recognizer by constructing a
generalized transition diagram called a finite automaton.
A finite automaton can be deterministic or nondeterministic, where
nondeterministic means that more than one transition out of a state may be
possible on the same input symbol.
A DFA is a special case of a NFA in which
1) No state has an -transition
2) For each state s and input symbol a, there is at most one edge
labeled a leaving s.
Simulation of NFA
Note
Non-deterministic Finite Automata (NFA)
An NFA accepts an input string x iff there is a path in the
transition graph from the start state to some accepting (final)
states.
The language defined by an NFA is the set of strings it accepts
Deterministic Finite Automata (DFA) is a special case of NFA has
unique successor states.
Example: Convert the following regular expression "a (b | c d)* e" to NFA?
Solution
b 5
4
a
0 1 e
2 3 9 10
c d
6 7 8
8. A compiler that runs on one machine and produces the target code for
another machine is known as ________.
(a) Cross compiler
(b) Linker
(c) Preprocessor
(d) Assembler