Professional Documents
Culture Documents
Compiler
Simply stated, a compiler is a program that reads a program written in one language-the
source language-and translates it into an equivalent program in another language-the target
language (see fig.1) As an important part of this translation process, the compiler reports to its
user the presence of errors in the source program.
Conceptually, a compiler operates in phases, each of which transforms the source program from
one representation to another.
Fig.2. Phases of a compiler
The first three phases, forms the bulk of the analysis portion of a compiler. Symbol table
management and error handling, are shown interacting with the six phases.
An essential function of a compiler is to record the identifiers used in the source program
and collect information about various attributes of each identifier. A symbol table is a data
structure containing a record for each identifier, with fields for the attributes of the identifier.
The data structure allows us to find the record for each identifier quickly and to store or retrieve
data from that record quickly. When an identifier in the source program is detected by the lex
analyzer, the identifier is entered into the symbol table.
Each phase can encounter errors. A compiler that stops when it finds the first error is not
as helpful as it could be.
The syntax and semantic analysis phases usually handle a large fraction of the errors detectable
by the compiler. The lexical phase can detect errors where the characters remaining in the input
do not form any token of the language. Errors when the token stream violates the syntax of the
language are determined by the syntax analysis phase. During semantic analysis the compiler
tries to detect constructs that have the right syntactic structure but no meaning to the operation
involved.
As translation progresses, the compiler’s internal representation of the source program changes.
Consider the statement,
The lexical analysis phase reads the characters in the source pgm and groups them into a
stream of tokens in which each token represents a logically cohesive sequence of characters,
such as an identifier, a keyword etc. The character sequence forming a token is called the lexeme
for the token. Certain tokens will be augmented by a ‘lexical value’. For example, for any
identifier the lex analyzer generates not only the token id but also enter s the lexeme into the
symbol table, if it is not already present there. The lexical value associated this occurrence of id
points to the symbol table entry for this lexeme. The representation of the statement given above
after the lexical analysis would be:
Syntax analysis imposes a hierarchical structure on the token stream, which is shown by syntax
trees (fig 3).
After syntax and semantic analysis, some compilers generate an explicit intermediate
representation of the source program. This intermediate representation can have a variety of
forms.
id1: = temp3
Code Optimisation
The code optimization phase attempts to improve the intermediate code, so that faster
running machine codes will result. Some optimizations are trivial. There is a great variation in
the amount of code optimization different compilers perform. In those that do the most, called
‘optimising compilers’, a significant fraction of the time of the compiler is spent on this phase.
Code Generation
The final phase of the compiler is the generation of target code, consisting normally of
relocatable machine code or assembly code. Memory locations are selected for each of the
variables used by the program. Then, intermediate instructions are each translated into a
sequence of machine instructions that perform the same task. A crucial aspect is the assignment
of variables to registers.
Lexical Analyser
Semantic Analyser
: =
id1 +
id2 *
10
Code generator
Lexical Analysis
A simple way to build lexical analyzer is to construct a diagram that illustrates the
structure of the tokens of the source language, and then to hand-translate the diagram into a
program for finding tokens. Efficient lexical analysers can be produced in this manner.
The lexical analyzer is the first phase of compiler. Its main task is to read the input
characters and produces output a sequence of tokens that the parser uses for syntax analysis. As
in the figure, upon receiving a “get next token” command from the parser the lexical analyzer
reads input characters until it can identify the next token.
token
Lexical
analyser
Parser
Symbol
table
Source program
Since the lexical analyzer is the part of the compiler that reads the source text, it may also
perform certain secondary tasks at the user interface. One such task is stripping out from the
source program comments and white space in the form of blank, tab, and new line character.
Another is correlating error messages from the compiler with the source program.
There are several reasons for separating the analysis phase of compiling into lexical analysis and
parsing.
1) Simpler design is the most important consideration. The separation of lexical analysis from
syntax analysis often allows us to simplify one or the other of these phases.
There is a set of strings in the input for which the same token is produced as output. This
set of strings is described by a rule called a pattern associated with the token. The pattern is set to
match each string in the set. A lexeme is a sequence of characters in the source program that is
matched by the pattern for the token. For example in the Pascal’s statement const pi = 3.1416;
the substring pi is a lexeme for the token identifier.
In most programming languages, the following constructs are treated as tokens:
keywords, operators, identifiers, constants, literal strings, and punctuation symbols such as
parentheses, commas, and semicolons.
const Const const
if if if
In the example when the character sequence pi appears in the source program, the token
representing an identifier is returned to the parser. The returning of a token is often implemented
by passing and integer corresponding to the token. It is this integer that is referred to as bold face
id in the above table.
A pattern is a rule describing a set of lexemes that can represent a particular token in
source program. The pattern for the token const in the above table is just the single string const
that spells out the keyword.
Certain language conventions impact the difficulty of lexical analysis. Languages such as
FORTRAN require a certain constructs in fixed positions on the input line. Thus the alignment of
a lexeme may be important in determining the correctness of a source program.
Attributes of Token
The lexical analyzer returns to the parser a representation for the token it has found. The
representation is an integer code if the token is a simple construct such as a left parenthesis,
comma, or colon .The representation is a pair consisting of an integer code and a pointer to a
table if the token is a more complex element such as an identifier or constant .The integer code
gives the token type, the pointer points to the value of that token .Pairs are also retuned whenever
we wish to distinguish between instances of a token.
Input Buffering
The lexical analyzer scans the characters of the source program one a t a time to discover
tokens. Often, however, many characters beyond the next token many have to be examined
before the next token itself can be determined. For this and other reasons, it is desirable for the
lexical analyzer to read its input from an input buffer. Figure shows a buffer divided into two
haves of, say 100 characters each. One pointer marks the beginning of the token being
discovered. A look ahead pointer scans ahead of the beginning point, until the token is
discovered .we view the position of each pointer as being between the character last read and the
character next to be read. In practice each buffering scheme adopts one convention either a
pointer is at the symbol last read or the symbol it is ready to read.
The distance which the lookahead pointer may have to travel past the actual token may be
large. For example, in a PL/I program we may see:
DECALRE (ARG1, ARG2… ARG n)
Without knowing whether DECLARE is a keyword or an array name until we see the
character that follows the right parenthesis. In either case, the token itself ends at the second E.
If the look ahead pointer travels beyond the buffer half in which it began, the other half must be
loaded with the next characters from the source file.
Since the buffer shown in above figure is of limited size there is an implied constraint on
how much look ahead can be used before the next token is discovered. In the above example, if
the look ahead traveled to the left half and all the way through the left half to the middle, we
could not reload the right half, because we would lose characters that had not yet been grouped
into tokens. While we can make the buffer larger if we chose or use another buffering scheme,
we cannot ignore the fact that overhead is limited.
Specification of tokens
The term alphabet or character class denotes any finite set of symbols. Typically
examples of symbols are letters and characters. The set {0, 1} is the binary alphabet.
A String over some alphabet is a finite sequence of symbols drawn from that alphabet. In
Language theory, the terms sentence and word are often used as synonyms for the term “string”.
The term language denotes any set of strings over some fixed alphabet. This definition is
very broad. Abstract languages like the empty set, or {the set containing only the empty
set, are languages under this definition.
Certain terms fro parts of a string are prefix, suffix, substring, or subsequence of a string.
There are several important operations like union, concatenation and closure that can be applied
to languages.
Regular Expressions
The vertical bar here means “or” , the parentheses are used to group subexpressions, the star
means “ zero or more instances of” the parenthesized expression, and the juxtaposition of letter
with remainder of the expression means concatenation.
A regular expression is built up out of simpler regular expressions using set of defining
rules. Each regular expression r denotes a language L(r). The defining rules specify how L(r) is
formed by combining in various ways the languages denoted by the subexpressions of r.
Unnecessary parenthesis can be avoided in regular expressions if we adopt the conventions that:
1. the unary operator has the highest precedence and is left associative.
2. concatenation has the second highest precedence and is left associative.
3. | has the lowest precedence and is left associative.
Regular Definitions
If is an alphabet of basic symbols, then a regular definition is a sequence of definitions
of the form d r , d’ r’ where d,d’ is a distinct name , and r,r’ is a regular expression over
the symbols in U {d,d’,…} , i.e.; the basic symbols and the previously defined names.
Example: The set of Pascal identifiers is the set of strings of letters and digits beginning with a
letter. The regular definition for the set is
Unsigned numbers in Pascal are strings such as 5280,56.77,6.25E4 etc. The following
regular definition provides a precise specification for this class of strings:
digit 0|1|2|…..|9
This definition says that digit can be any number from 0-9, while digits is a digit
followed by zero or more occurrences of a digit.
Notational Shorthands
1. One or more instances. The unary postfix operator + means “one or more instances of
“
2. Zero or one instance. The unary postfix operator ? means “zero or one instance of “.
The notation r? is a shorthand for r/
3. Character classes. The notation [abc] where a , b , and c are the alphabet symbols
denotes the regular expression a|b|c. An abbreviated character class such as [a-z]
denotes the regular expression a|b|c|……….|z.
Recognition of tokens
The question of how to recognize the tokens is handled in this section. The language
generated by the following grammar is used as an example.
|
|term
termid
|num
where the terminals if , then, else, relop, id and num generate sets of strings given by the
following regular definitions:
if if
then ten
else else
relop <|<=|=|<>|>|>=
idletter(letter|digit)*
numdigit+ (.digit+)?(E(+|-)?digit+)?
For this language fragment the lexical analyzer will recognize the keywords if, then, else,
as well as the lexemes denoted by relop, id, and num. To simplify matters, we assume keywords
are reserved; that is, they cannot be used as identifiers. Unsigned integer and real numbers of
Pascal are represented by num.
In addition, we assume lexemes are separated by white space, consisting of nonnull sequences of
blanks, tabs and newlines. Our lexical analyzer will strip out white space. It will do so by
comparing a string against the regular definition ws, below.
delimblank|tab|newline
wsdelim+
If a match for ws is found, the lexical analyzer does not return a token to the parser.
Rather, it proceeds to find a token following the white space and returns that to the parser. Our
goal is to construct a lexical analyzer that will isolate the lexeme for the next token in the input
buffer and produce as output a pair consisting of the appropriate token and attribute value, using
the translation table given in the figure. The attribute values for the relational operators are given
by the symbolic constants LT,LE,EQ,NE,GT,GE.
REGULAR TOKEN ATTIBUTE VALUE
EXPRESSION
ws
if if
then then
else else
< relop LT
<= relop LE
= relop EQ
<> relop NE
> relop GT
>= relop GE
Transition diagram
Positions in a transition diagram are drawn as circles and are called states. The states are
connected by arrow, called edges. Edges leaving state s have labels indicating the input
characters that can next appear after the transition diagram has reached state s. the label other
refers to any character that is not indicated by any of the other edges leaving s.
One state is labeled as the start state; it is the initial state of the transition diagram where
control resides when we begin to recognize a token. Certain states may have actions that are
executed when the flow of control reaches that state. On entering a state we read the next input
character if there is and edge from the current state whose label matches this input character, we
then go to the state pointed to by the edge. Otherwise we indicate failure. A transition diagram
for >= is shown in the figure.
start
other
Lex helps write programs whose control flow is directed by instances of regular
expressions in the input stream. It is well suited for editor-script type transformations and for
segmenting input in preparation for a parsing routine.
Lex source is a table of regular expressions and corresponding program fragments. The
table is translated to a program which reads an input stream, copying it to an output stream and
partitioning the input into strings which match the given expressions. As each such string is
recognized the corresponding program fragment is executed. The recognition of the expressions
is performed by a deterministic finite automaton generated by Lex. The program fragments
written by the user are executed in the order in which the corresponding regular expressions
occur in the input stream.
The lexical analysis programs written with Lex accept ambiguous specifications and
choose the longest match possible at each input point. If necessary, substantial look-ahead is
performed on the input, but the input stream will be backed up to he end of the current partition,
so that the user has general freedom to manipulate it.
Introduction.
Lex is a program generator designed for lexical processing of character input streams. It
accepts a high-level, problem oriented specification for character string matching, and
produces a program in a general purpose language which recognizes regular expressions. The
regular expressions are specified by the user in the source given to Lex. The Lex written code
recognizes these expressions in an input stream and partitions the input stream into strings
matching the expressions. At the boundaries between strings program sections provided by the
user are executed. The Lex source file associates the regular expressions and the program
fragments. As each expression appears in the input to the program written by Lex, the
corresponding fragment is executed.
Lex is not a complete language, but rather a generator representing a new language
feature which can be added to different programming languages, called ‘‘host languages.’’
Lex can write code in different host languages. The host language is used for the output
code generated by Lex and also for the program fragments added by the user. Compatible run-
time libraries for the different host languages are also provided. This makes Lex adaptable to
different environments and different users. Each application may be directed to the combination
of hardware and host language appropriate to the task, the user’s background, and properties of
local implementations.
Lex turns the user’s expressions and actions (called source in this memo) into the host
general-purpose language; the generated program is named yylex The yylex program will
recognize expressions in a stream (called input in this memo) and perform the specified actions
for each expression as it is detected.
An overview of Lex
For a trivial example, consider a program to delete from the input all blanks or tabs at the ends of
lines.
%%
[ \t]+$ ;
is all that is required. The program contains a%% delimiter to mark the beginning of the rules,
and one rule. This rule contains a regular which matches one or more instances of the characters
blank rtab (written \t for visibility, in with the C language convention) just prior to the end of a
line. The brackets indicate character class made of blank and tab; the + indicates ‘‘one or
more ...’’; and the $ indicates ‘‘end of line,’’ as in QED. No action is specified, so the program
generated by Lex (yylex) will ignore these characters. Everything else will be copied. To change
any remaining string of blanks or tabs to a single blank, add another rule:
%%
[ \t]+$ ;
[ \t]+ printf (" ");
The finite automaton generated for this source will scan for both rules at once, observing
at the termination of the string of blanks or tabs whether or not there is a newline character, and
executing the desired rule action. The first rule matches all strings of blanks or tabs at the end of
lines, and the second rule all remaining strings of blanks or tabs. Lex can be used alone for
simple transformations, or for analysis and statistics gathering on a lexical level. Lex can also be
used with a parser generator to perform the lexical analysis phase; it is particularly easy to
interface Lex and Yacc .Lex programs recognize only regular expressions; Yacc writes parsers
that accept a large class of context free grammars, but require a lower level analyzer to recognize
input tokens. Thus, a combination of Lex and Yacc is often appropriate. When used as a
preprocessor for a later parser generator, Lex is used to partition the input stream, and the parser
generator assigns structure to the resulting pieces. The flow of control in such a case (which
might be the first half of a compiler, for example) is shown below.Additional programs, written
by other generators or by hand, can be added easily to programs written by Lex..
Yacc users will realize that the name yylex is what Yacc expects its lexical analyzer to be
named, so that the use of this name by Lex simplifies interfacing. Lex generates a deterministic
finite automation from the regular expressions in the source .The automaton is interpreted, rather
than compiled, in order to save space. The result is still a fast analyzer. In particular, the time
taken by a Lex program to recognize and partition an input stream is proportional to the length of
the input. The number of Lex rules or the complexity of the rules is not important in determining
speed, unless rules which include forward context require a significant amount of rescanning.
What does increase with the number and complexity of rules is the size of the finite automaton,
and therefore the size of the program generated by Lex.
In the program written by Lex, the user’s (representing the actions to be performed as
each regular expression is found) are gathered as cases of a switch. The automaton interpreter
directs the control flow. Opportunity is provided for the user to insert either declarations or
additional statements in the routine containing the actions, or to add subroutines outside this
action routine. Lex is not limited to source which can be interpreted on the basis of one character
look-ahead. For example, if there are two rules, one looking for ab and another for abcdefg , and
the input stream is abcdefh , Lex will recognize ab and leave the input pointer just before cd …
Such is more costly than the processing of languages.
Lex Source.
{definitions}
%%
{rules}
%%
{user subroutines}
where the definitions and the user subroutines are often omitted. The second % % is optional, but
the first is required to mark the beginning of the rules. The absolute minimum Lex program is %
% (no definitions, no rules) which translates into a program which copies the input to the output
unchanged. In the outline of Lex programs shown the rules represent the user’s control signs;
they are a table, in which the left column contains regular expressions and the right column
contains actions to be executed when the expressions are recognized. Thus an individual rule
might appear integer printf("found keyword INT"); to look for the string integer in the input
stream and print the message ‘‘found keyword INT’’ whenever it appears. In this example the
host procedural language is C and the C library function printf is used to print the string. The end
of the expression is indicated by the first blank or tab character. If the action is merely a single C
expression, it can just be given on the right side of the line; if it is compound, or takes more than
a line, it should be enclosed in braces. As a slightly more useful example, suppose it is desired to
change a number of words from British to American spelling.
petrol printf("gas");
would be a start.
Metacharacter Matches
\n newline
Expression Matches
abc abc
a(bc)? a, abc
[abc] a, b, c
[a\-z] a, -, z
[-az] -, a, z
[ \t\n]+ whitespace
[a^b] a, ^, b
[a|b] a, |, b
a|b a or b
name function
int yylex(void) call to invoke lexer, returns token
Finite automata
A recognizer for a language is a program that takes as input a string x and answers ‘yes’ if a sentence
of the language and ‘no’ otherwise. We compile a regular expression into a recognizer by constructing a
transition diagram called finite automation. A finite automation can be deterministic or non
deterministic where non deterministic means that more than one transition out of a state may be
possible out of a state may be possible on a same input symbol.
Dfa s are faster recognizers than nfas but can be much bigger than equivalent nfas.
Transition table:
a b
0 {0,1} {0}
1 - {2}
2 - {3}
2) for each state s and input symbol a, there is at most one edge labeled a leaving s
input: nfa N
Method:
operation description
Epsilon-closure(S) Set of nfa states reachable from nfa state s on e-transitions alone
Epsilon-closure(T) Set of nfa states reachable from nfa state s in Ton e-transitions alone
Move(T,a) Set of nfa states to which there is a transition on input symbol afrom nfa
state s in T
Subset construction:
U:=e-closure(move(T,a));
Dtransiton[T,a]:=U
End
End
Thomson’s Construction
To convert regular expression r over an alphabet into an NFA N accepting L(r)
Here i is a new start state and f is a new accepting state. This NFA recognizes {}.
Again i is a new start state and f is a new accepting state. This NFA accepts {a}.
If a occurs several times in r, then a separate NFA is constructed for each occurrence.
Keeping the syntactic structure of the regular expression in mind, combine these NFAs inductively
until the NFA for the entire expression is obtained. Each intermediate NFA produced during the
course of construction corresponds to a sub-expression r and has several important properties – it has
exactly one final state, no edge enters the start state and no edge leaves the final state.
Suppose N(s) and N(t) are NFAs for regular expressions s and t.
(a) For regular expression s|t, construct the following composite NFA N(s|t) :
(b) For the regular expression st, construct the composite NFA N(st) :
(c) For the regular expression s* , construct the composite NFA N(s*) :
(d) For the parenthesized regular expression (s), use N(s) itself as the NFA.
A lexical analyzer generator creates a lexical analyser using a set of specifications usually in the
format
. . . . . . . . . . . .
where pi is a regular expression and each action actioni is a program fragment that is to be
executed whenever a lexeme matched by pi is found in the input. If more than one pattern
matches, then longest lexeme matched is chosen. If there are two or more patterns that match the
longest lexeme, the first listed matching pattern is chosen.
This is usually implemented using a finite automaton. There is an input buffer with
two pointers to it, a lexeme-beginning and a forward pointer. The lexical analyser generator
constructs a transition table for a finite automaton from the regular expression patterns in the
lexical analyser generator specification. The lexical analyser itself consists of a finite automaton
simulator that uses this transition table to look for the regular expression patterns in the input
buffer.
This can be implemented using an NFA or a DFA. The transition table for an NFA is
considerably smaller than that for a DFA, but the DFA recognises patterns faster than the NFA.
Using NFA
The transition table for the NFA N is constructed for the composite pattern p1|p2|. . .|pn,
The NFA recognises the longest prefix of the input that is matched by a pattern. In the final
NFA, there is an accepting state for each pattern pi. The sequence of steps the final NFA can be
in is after seeing each input character is constructed. The NFA is simulated until it reaches
termination or it reaches a set of states from which there is no transition defined for the current
input symbol. The specification for the lexical analyser generator is so that a valid source
program cannot entirely fill the input buffer without having the NFA reach termination. To find a
correct match two things are done. Firstly, whenever an accepting state is added to the current set
of states, the current input position and the pattern pi is recorded corresponding to this accepting
state. If the current set of states already contains an accepting state, then only the pattern that
appears first in the specification is recorded. Secondly, the transitions are recorded until
termination is reached. Upon termination, the forward pointer is retracted to the position at which
the last match occurred. The pattern making this match identifies the token found, and the
lexeme matched is the string between the lexeme beginning and forward pointers. If no pattern
matches, the lexical analyser should transfer control to some default recovery routine.
Using DFA
Here a DFA is used for pattern matching. This method is a modified version of the
method using NFA. The NFA is converted to a DFA using a subset construction algorithm. Here
there may be several accepting states in a given subset of nondeterministic states. The accepting
state corresponding to the pattern listed first in the lexical analyser generator specification has
priority. Here also state transitions are made until a state is reached which has no next state for
the current input symbol. The last input position at which the DFA entered an accepting state
gives the lexeme.