You are on page 1of 32

Compilers

         Compiler

Simply stated, a compiler is a program that reads a program written in one language-the
source language-and translates it into an equivalent program in another language-the target
language (see fig.1) As an important part of this translation process, the compiler reports to its
user the presence of errors in the source program.                               

 Source program                                                                                     Target program

       

                                                                                                                

                                                            Error  messages


 

                                                  Fig1. A Compiler

Compilers are sometimes classified as single-pass, multi-pass, load-and-go, debugging,


or optimizing, depending on how they have been constructed or on what function they are
supposed to perform. Despite this apparent complexity, the basic tasks that any compiler must
perform are essentially the same.

THE DIFFERENT PHASES OF A COMPILER

Conceptually, a compiler operates in phases, each of which transforms the source program from
one representation to another.

                                        
Fig.2. Phases of a compiler

            

The first three phases, forms the bulk of the analysis portion of a compiler. Symbol table
management and error handling, are shown interacting with the six phases.

Symbol table management

An essential function of a compiler is to record the identifiers used in the source program
and collect information about various attributes of each identifier. A symbol table is a data
structure containing a record for each identifier, with fields for the attributes of the identifier.
The data structure allows us to find the record for each identifier quickly and to store or retrieve
data from that record quickly. When an identifier in the source program is detected by the lex
analyzer, the identifier is entered into the symbol table.

Error Detection and Reporting

Each phase can encounter errors. A compiler that stops when it finds the first error is not
as helpful as it could be.

The syntax and semantic analysis phases usually handle a large fraction of the errors detectable
by the compiler. The lexical phase can detect errors where the characters remaining in the input
do not form any token of the language. Errors when the token stream violates the syntax of the
language are determined by the syntax analysis phase. During semantic analysis the compiler
tries to detect constructs that have the right syntactic structure but no meaning to the operation
involved.

The Analysis phases

As translation progresses, the compiler’s internal representation of the source program changes.
Consider the statement,

position  :=  initial  +  rate  *  10


 

The lexical analysis phase reads the characters in the source pgm and groups them into a
stream of tokens in which each token represents a logically cohesive sequence of characters,
such as an identifier, a keyword etc. The character sequence forming a token is called the lexeme
for the token. Certain tokens will be augmented by a ‘lexical value’. For example, for any
identifier the lex analyzer generates not only the token id but also enter s the lexeme into the
symbol table, if it is not already present there. The lexical value associated this occurrence of id
points to the symbol table entry for this lexeme. The representation of the statement given above
after the lexical analysis would be:

id1: = id2 + id3 * 10

Syntax analysis imposes a hierarchical structure on the token stream, which is shown by syntax
trees (fig 3).

Intermediate Code Generation

After syntax and semantic analysis, some compilers generate an explicit intermediate
representation of the source program. This intermediate representation can have a variety of
forms.

In three-address code, the source pgm might look like this,

temp1: = inttoreal (10)

temp2: = id3 * temp1

temp3: = id2 + temp2

id1: = temp3

Code Optimisation
 

The code optimization phase attempts to improve the intermediate code, so that faster
running machine codes will result. Some optimizations are trivial. There is a great variation in
the amount of code optimization different compilers perform. In those that do the most, called
‘optimising compilers’, a significant fraction of the time of the compiler is spent on this phase.

Code Generation

The final phase of the compiler is the generation of target code, consisting normally of
relocatable machine code or assembly code. Memory locations are selected for each of the
variables used by the program. Then, intermediate instructions are each translated into a
sequence of machine instructions that perform the same task. A crucial aspect is the assignment
of variables to registers.

                                          Position : = initial +  rate  *  10

                                                   

Lexical Analyser

                                           

                                                 id1: =   id2 + id3 * 10           


 

Semantic Analyser

                                                         : =

                                                    

                                              id1              +

                                                         id2         *

                                                               id3       inttoreal

                                                                      

                                                                               10

Intermediate code generator

 
 

                                                    temp1: = inttoreal (10)

                                                    temp2: = id3 * temp1

                                                    temp3: = id2 + temp2

                                                    id1: = temp3

                 Code optimser

                                    

                                                       temp1: = id3 * 10.0

                                                       id1: = id2 + temp1  

                           

Code generator

 
 

                                                      

                                                    MOVF   id3, R2

                                                    MULF   #10.0, R2

                                                    MOVF   id2 , R1

                                                    ADDF   R2, R1

                                                    MOVF   R1, id1

                                                         Fig.3. Translation of a statement

Lexical Analysis

A simple way to build lexical analyzer is to construct a diagram that illustrates the
structure of the tokens of the source language, and then to hand-translate the diagram into a
program for finding tokens. Efficient lexical analysers can be produced in this manner.

Role of a Lexical Analyser

The lexical analyzer is the first phase of compiler. Its main task is to read the input
characters and produces output a sequence of tokens that the parser uses for syntax analysis. As
in the figure, upon receiving a “get next token” command from the parser the lexical analyzer
reads input characters until it can identify the next token.

token
 

get next token

Lexical

analyser

Parser

Symbol

table

Source program

 
 

Fig 4. Interaction of lexical analyzer with parser.

Since the lexical analyzer is the part of the compiler that reads the source text, it may also
perform certain secondary tasks at the user interface. One such task is stripping out from the
source program comments and white space in the form of blank, tab, and new line character.
Another is correlating error messages from the compiler with the source program.

Issues in Lexical Analysis

There are several reasons for separating the analysis phase of compiling into lexical analysis and
parsing.

1) Simpler design is the most important consideration. The separation of lexical analysis from
syntax analysis often allows us to simplify one or the other of these phases.

2) Compiler efficiency is improved.

3) Compiler portability is enhanced.

Tokens Patterns and Lexemes.

There is a set of strings in the input for which the same token is produced as output. This
set of strings is described by a rule called a pattern associated with the token. The pattern is set to
match each string in the set. A lexeme is a sequence of characters in the source program that is
matched by the pattern for the token. For example in the Pascal’s statement const pi = 3.1416;
the substring pi is a lexeme for the token identifier.
In most programming languages, the following constructs are treated as tokens:
keywords, operators, identifiers, constants, literal strings, and punctuation symbols such as
parentheses, commas, and semicolons.

TOKEN SAMPLE LEXEMES INFORMAL DESCRIPTION OF PATTERN

 
const Const const

if if if

relation <,<=,=,<>,>,>= < or <= or = or <> or >= or >

id pi,count,D2 letter followed by letters and digits

num 3.1416,0,6.02E23 any numeric constant

literal “core dumped” any characters between “ and “ except”

   

In the example when the character sequence pi appears in the source program, the token
representing an identifier is returned to the parser. The returning of a token is often implemented
by passing and integer corresponding to the token. It is this integer that is referred to as bold face
id in the above table.

A pattern is a rule describing a set of lexemes that can represent a particular token in
source program. The pattern for the token const in the above table is just the single string const
that spells out the keyword.

Certain language conventions impact the difficulty of lexical analysis. Languages such as
FORTRAN require a certain constructs in fixed positions on the input line. Thus the alignment of
a lexeme may be important in determining the correctness of a source program.
 

Attributes of Token

The lexical analyzer returns to the parser a representation for the token it has found. The
representation is an integer code if the token is a simple construct such as a left parenthesis,
comma, or colon .The representation is a pair consisting of an integer code and a pointer to a
table if the token is a more complex element such as an identifier or constant .The integer code
gives the token type, the pointer points to the value of that token .Pairs are also retuned whenever
we wish to distinguish between instances of a token.

Input Buffering

The lexical analyzer scans the characters of the source program one a t a time to discover
tokens. Often, however, many characters beyond the next token many have to be examined
before the next token itself can be determined. For this and other reasons, it is desirable for the
lexical analyzer to read its input from an input buffer. Figure shows a buffer divided into two
haves of, say 100 characters each. One pointer marks the beginning of the token being
discovered. A  look ahead pointer scans ahead of the beginning point, until the token is
discovered .we view the position of each pointer as being between the character last read and the
character next to be read. In practice each buffering scheme adopts one convention either a
pointer is at the symbol last read or the symbol it is ready to read.

   

                                               

Token beginnings         look ahead pointer

The distance which the lookahead pointer may have to travel past the actual token may be
large. For example, in a PL/I program we may see:
            DECALRE (ARG1, ARG2… ARG n)

Without knowing whether DECLARE is a keyword or an array name until we see the
character that follows the right   parenthesis. In either case, the token itself ends at the second E.
If the look ahead pointer travels beyond the buffer half in which it began, the other half must be
loaded with the next characters from the source file.

Since the buffer shown in above figure is of limited size there is an implied constraint on
how much look ahead can be used before the next token is discovered. In the above example, if
the look ahead traveled to the left half and all the way through the left half to the middle, we
could not reload the right half, because we would lose characters that had not yet been grouped
into tokens. While we can make the buffer larger if we chose or use another buffering scheme,
we cannot ignore the fact that overhead is limited.

Specification of tokens

Strings and Languages

The term alphabet or character class denotes any finite set of symbols. Typically
examples of symbols are letters and characters. The set {0, 1} is the binary alphabet.

A String over some alphabet is a finite sequence of symbols drawn from that alphabet. In
Language theory, the terms sentence and word are often used as synonyms for the term “string”.

The term language denotes any set of strings over some fixed alphabet. This definition is
very broad. Abstract languages like the empty set, or {the set containing only the empty
set, are languages under this definition.

Certain terms fro parts of a string are prefix, suffix, substring, or subsequence of a string.
There are several important operations like union, concatenation and closure that can be applied
to languages.

 
Regular Expressions

In Pascal, an identifier is a letter followed by zero or more letters or digits. Regular


expressions allow us to define precisely sets such as this. With this notation, Pascal identifiers
may be defined as

            letter (letter | digit)*

The vertical bar here means “or” , the parentheses are used to group subexpressions, the star
means “ zero or more instances of” the parenthesized expression, and the juxtaposition of letter
with remainder of the expression means concatenation.

A regular expression is built up out of simpler regular expressions using set of defining
rules. Each regular expression r denotes a language L(r). The defining rules specify how L(r) is
formed by combining in various ways the languages denoted by the subexpressions of r.

Unnecessary parenthesis can be avoided in regular expressions if we adopt the conventions that:

1. the unary operator has the highest precedence and is left associative.
2. concatenation has the second highest precedence and is left associative.
3. | has the lowest precedence and is left associative.

Regular Definitions

If  is an alphabet of basic symbols, then a regular definition is a sequence of definitions
of the form  d  r  ,  d’  r’ where d,d’ is a distinct name , and  r,r’ is a regular expression over
the symbols in U {d,d’,…} , i.e.; the basic symbols and the previously defined names.

Example: The set of Pascal identifiers is the set of strings of letters and digits beginning with a
letter. The regular definition for the set is

            letter  A|B|…|Z|a|b|…z

            digit   0|1|2|…|9

            id       letter ( letter | digit ) *

Unsigned numbers in Pascal are strings such as 5280,56.77,6.25E4 etc. The following
regular definition provides a precise specification for this class of strings:
            digit    0|1|2|…..|9

            digits  digit digit *

This definition says that digit can be any number from 0-9, while digits is a digit
followed by zero or more occurrences of a digit.

Notational Shorthands

Certain constructs occur so frequently in regular expressions that it is convenient to introduce


notational shorthands for them.

1.      One or more instances. The unary postfix operator + means “one or more instances of

2.      Zero or one instance. The unary postfix operator ? means “zero or one instance of “.
The notation r? is a shorthand for r/

3.      Character classes. The notation [abc] where a , b , and c are the alphabet symbols
denotes the regular expression a|b|c. An abbreviated character class such as [a-z]
denotes the regular expression a|b|c|……….|z.

Recognition of tokens

The question of how to recognize the tokens is handled in this section. The language
generated by the following grammar is used as an example.

Consider the following grammar fragment:

stmt if expr then stmt

            |if expr then stmt else stmt

            |

exprterm relop term

            |term
termid

            |num

where the terminals if , then, else, relop, id and num generate sets of strings given by the
following regular definitions:

if if

then ten

else else

relop <|<=|=|<>|>|>=

idletter(letter|digit)*

numdigit+ (.digit+)?(E(+|-)?digit+)?

For this language fragment the lexical analyzer will recognize the keywords if, then, else,
as well as the lexemes denoted by relop, id, and num. To simplify matters, we assume keywords
are reserved; that is, they cannot be used as identifiers. Unsigned integer and real numbers of
Pascal are represented by num.

In addition, we assume lexemes are separated by white space, consisting of nonnull sequences of
blanks, tabs and newlines. Our lexical analyzer will strip out white space. It will do so by
comparing a string against the regular definition ws, below.

delimblank|tab|newline

wsdelim+

If a match for ws is found, the lexical analyzer does not return a token to the parser.
Rather, it proceeds to find a token following the white space and returns that to the parser. Our
goal is to construct a lexical analyzer that will isolate the lexeme for the next token in the input
buffer and produce as output a pair consisting of the appropriate token and attribute value, using
the translation table given in the figure. The attribute values for the relational operators are given
by the symbolic constants LT,LE,EQ,NE,GT,GE.

 
REGULAR TOKEN ATTIBUTE VALUE
EXPRESSION

ws    

if if  

then then  

else else  

id id Pointer to table entry

num num Pointer to table entry

< relop LT

<= relop LE

= relop EQ

<> relop NE

> relop GT

>= relop GE

Transition diagram

A transition diagram is a stylized flowchart. Transition diagram is used to keep track of


information about characters that are seen as the forward pointer scans the input. We do so by
moving from position to position in the diagrams as characters are read.

Positions in a transition diagram are drawn as circles and are called states. The states are
connected by arrow, called edges. Edges leaving state s have labels indicating the input
characters that can next appear after the transition diagram has reached state s. the label other
refers to any character that is not indicated by any of the other edges leaving s.

One state is labeled as the start state; it is the initial state of the transition diagram where
control resides when we begin to recognize a token. Certain states may have actions that are
executed when the flow of control reaches that state. On entering a state we read the next input
character if there is and edge from the current state whose label matches this input character, we
then go to the state pointed to by the edge. Otherwise we indicate failure. A transition diagram
for >= is shown in the figure.

start

other

Fig 5. Transition diagram for >=

Lexical analyzer generators

Lex A Lexical Analyzer Generator

Lex helps write programs whose control flow is directed by instances of regular  
expressions in the input stream. It is well suited for editor-script type transformations and   for
segmenting input in preparation for a parsing routine.
 

Lex source is a table of regular expressions and corresponding program fragments.   The
table is translated to a program which reads an input stream, copying it to an output stream and
partitioning the input into strings which match the given expressions. As each such string is
recognized the corresponding program fragment is executed. The recognition of the expressions
is performed by a deterministic finite automaton generated by Lex.  The program fragments
written by the user are executed in the order in which the corresponding regular expressions
occur in the input stream.

The lexical analysis programs written with Lex accept ambiguous specifications and
choose the longest match possible at each input point. If necessary, substantial look-ahead is
performed on the input, but the input stream will be backed up to he end of the current partition,
so that the user has general freedom to manipulate it.

Introduction.

Lex is a program generator designed for lexical processing of character input streams. It
accepts a    high-level, problem oriented specification for character string matching, and
produces a program in a general purpose language which recognizes regular expressions. The
regular expressions are specified by the user in the source given to Lex. The Lex written code
recognizes these expressions in an input stream and partitions the input stream into strings
matching the expressions. At the boundaries between strings program sections provided by the
user are executed. The Lex source file associates the regular expressions and the program
fragments. As each expression appears in the input to the program written by Lex, the
corresponding fragment is executed.

Lex is not a complete language, but rather a   generator representing a new language
feature  which can be added to different programming languages, called ‘‘host languages.’’  

Lex can write code in different host languages. The host language is used for the output
code generated by Lex and also for the program fragments added by the user. Compatible run-
time libraries for the different host languages are also provided. This makes Lex adaptable to
different environments and different users. Each application may be directed to the combination
of hardware and host language appropriate to the task, the user’s background, and properties of
local implementations.
            Lex turns the user’s expressions and actions (called source in this memo) into the host
general-purpose language; the generated program is named  yylex  The yylex  program will
recognize expressions in a stream (called input in this memo) and perform the specified actions
for each expression as it is detected.

Source Lex yylex

            Inputyylex Output

An overview of Lex

For a trivial example, consider a program to delete from the input all blanks or tabs at the ends of
lines.

%%

[ \t]+$ ;

is all that is required. The program contains a%% delimiter to mark the beginning of the rules,
and one rule. This rule contains a regular which matches one or more instances of the characters
blank rtab (written \t for visibility, in with the C language convention) just prior to the end of a
line. The brackets indicate character class made of blank and tab; the + indicates ‘‘one or
more ...’’; and the $ indicates ‘‘end of line,’’ as in QED. No action is specified, so the program
generated by Lex (yylex) will ignore these characters. Everything else will be copied. To change
any remaining string of blanks or tabs to a single blank, add another rule:

%%

[ \t]+$ ;
[ \t]+ printf (" ");

The finite automaton generated for this source will scan for both rules at once, observing
at the termination of the string of blanks or tabs whether or not there is a newline character, and
executing the desired rule action. The first rule matches all strings of blanks or tabs at the end of
lines, and the second rule all remaining strings of blanks or tabs. Lex can be used alone for
simple transformations, or for analysis and statistics gathering on a lexical level. Lex can also be
used with a parser generator to perform the lexical analysis phase; it is particularly easy to
interface Lex and Yacc .Lex programs recognize only regular expressions; Yacc writes parsers
that accept a large class of context free grammars, but require a lower level analyzer to recognize
input tokens. Thus, a combination of Lex and Yacc is often appropriate. When used as a
preprocessor for a later parser generator, Lex is used to partition the input stream, and the parser
generator assigns structure to the resulting pieces. The flow of control in such a case (which
might be the first half of a compiler, for example) is shown below.Additional programs, written
by other generators or by hand, can be added easily to programs written by Lex..

                                 lexical         grammar

                                  rules            rules



                                   Lex              Yacc



                  Input yylex   yyparse Parsed input

                               Lex with Yacc

Yacc users will realize that the name yylex is what Yacc expects its lexical analyzer to be
named, so that the use of this name by Lex simplifies interfacing. Lex generates a deterministic
finite automation from the regular expressions in the source .The automaton is interpreted, rather
than compiled, in order to save space. The result is still a fast analyzer. In particular, the time
taken by a Lex program to recognize and partition an input stream is proportional to the length of
the input. The number of Lex rules or the complexity of the rules is not important in determining
speed, unless rules which include forward context require a significant amount of rescanning.
What does increase with the number and complexity of rules is the size of the finite automaton,
and therefore the size of the program generated by Lex.

In the program written by Lex, the user’s (representing the  actions to be performed as
each regular expression is found) are gathered as cases of a switch. The automaton interpreter
directs the control flow. Opportunity is provided for the user to insert either declarations or
additional statements in the routine containing the actions, or to add subroutines outside this
action routine. Lex is not limited to source which can be interpreted on the basis of one character
look-ahead. For example, if there are two rules, one looking for ab and another for  abcdefg , and
the input stream is abcdefh , Lex will recognize ab and leave the input pointer just before cd …
Such is more costly than the processing of languages.

Lex Source.

General format of Lex source is:

{definitions}

%%

{rules}

%%

{user subroutines}

where the definitions and the user subroutines are often omitted. The second % % is optional, but
the first is required to mark the beginning of the rules. The absolute minimum Lex program is %
% (no definitions, no rules) which translates into a program which copies the input to the output
unchanged. In the outline of Lex programs shown the rules represent the user’s control signs;
they are a table, in which the left column contains regular expressions and the right column
contains actions   to be executed when the expressions are recognized. Thus an individual rule
might appear integer printf("found keyword INT"); to look for the string integer in the input
stream  and print the message ‘‘found keyword INT’’ whenever it appears. In this example the
host procedural language is C and the C library function printf is used to print the string. The end
of the expression is indicated by the first blank or tab character. If the action is merely a single C
expression, it can just be given on the right side of the line; if it is compound, or takes more than
a line, it should be enclosed in braces. As a slightly more useful example, suppose it is desired to
change a number of words from British to American spelling.

Lex rules such as

  colour printf ("color");

  Mechanise printf ("mechanize");

  petrol printf("gas");

would be a start.

Lex Regular Expressions.

A regular expression specifies a set of strings to be matched. It contains text characters


(which match the corresponding characters in the strings being compared) and operator
characters (which specify repetitions, choices, and other features). The letters of the alphabet and
the digits are always text characters; thus the regular expression integer matches the string
integer wherever it appears and the expression a57D looks for the string a57D

Metacharacter Matches

.           any character except newline

\n         newline

*          zero or more copies of preceding expression

+          one or more copies of preceding expression

?          zero or one copy of preceding expression

^          beginning of line

$          end of line


a|b        a or b

(ab)+    one or more copies of ab (grouping)

"a+b"    literal “a+b” (C escapes still work)

[ ]         character class

Expression Matches

abc abc

abc* ab, abc, abcc, abccc, …

abc+ abc, abcc, abccc, …

a(bc)+ abc, abcbc, abcbcbc, …

a(bc)? a, abc

[abc] a, b, c

[a-z] any letter, a through z

[a\-z] a, -, z

[-az] -, a, z

[A-Za-z0-9]+ one or more alphanumeric characters

[ \t\n]+ whitespace

[^ab] anything except: a, b

[a^b] a, ^, b

[a|b] a, |, b

a|b a or b

name function
int yylex(void)               call to invoke lexer, returns token

char *yytext                  pointer to matched string

yyleng                           length of matched string

yylval                            value associated with token

int yywrap(void)           wrap-up, return 1 if done, 0 if not done

FILE *yyout                 output file

FILE *yyin                   input file

INITIAL                      initial start condition

BEGIN                  condition switch start condition

ECHO                   write matched string

Finite automata

      A recognizer for a language is a program that takes as input a string x and answers ‘yes’ if a sentence
of the language and ‘no’ otherwise. We compile a regular expression into a recognizer by constructing a
transition diagram called finite automation. A finite automation can be deterministic or non
deterministic where non deterministic means that more than one transition out of a state may be
possible out of a state may be possible on a same input symbol.

Dfa s are faster recognizers than nfas but can be much bigger than equivalent nfas.

Non deterministic finite automata

A  mathematical model consisting :


 

1)                      a set of states  S

2)                      input alphabet

3)                      transition function

4)                      initial state

5)                      final state

Transition table:

state Input symbols

  a b
0 {0,1} {0}

1 - {2}

 
2 - {3}

Deterministic finite automata

Special case of nfa in which


1)                  no state has epsilon transition

2)                  for each state s and input symbol a, there is at most  one edge labeled a  leaving s

Conversion of nfa to dfa

Subset construction algorithm

input:  nfa N

output: equivalent dfa D

Method:

Operations on nfa states:

operation description
Epsilon-closure(S) Set of nfa states reachable from nfa state s on e-transitions alone
Epsilon-closure(T) Set of nfa states reachable from nfa state s in Ton e-transitions alone
Move(T,a) Set of nfa states to which there is a transition on input symbol afrom nfa
state s in T

Subset construction:

Initially,e-closure(So) is the only state in dfa-states and it is unmarked

While  there is an unmarked state in T in  D-states do begin


     Mark T:

     For each input symbol a do begin

     U:=e-closure(move(T,a));

     If U is not in D-states then

     Add U asv an unmarked state to D-states;

     Dtransiton[T,a]:=U

     End

End

CONSTRUCTION OF AN NFA FROM A REGULAR EXPRESSION

Thomson’s Construction

            To convert regular expression r over an alphabet  into an NFA N accepting L(r)

   Parse r into its constituent sub-expressions.

   Construct NFAs for each of the basic symbols in r.

   For , construct the NFA

                      

           
                                               

     Here i is a new start state and f is a new accepting state. This NFA recognizes {}.

   For a in , construct the NFA

                                   

   Again i is a new start state and f  is a new accepting state. This NFA accepts {a}.

   If a occurs several times in r, then a separate NFA is constructed for each occurrence.

   Keeping the syntactic structure of the regular expression in mind, combine these NFAs inductively
until      the NFA for the entire expression is obtained.  Each intermediate NFA produced during the
course of construction corresponds to a sub-expression r and has several important properties – it has
exactly one final state, no edge enters the start state and no edge leaves the          final state.

   Suppose N(s) and N(t) are NFAs for regular expressions s and t.

            (a) For regular expression s|t, construct the following composite NFA N(s|t) :

 
 

               

                (b) For the regular expression st, construct the composite NFA N(st) :

             (c) For the regular expression s* , construct the composite NFA N(s*) :

 
                                               

                 (d) For the parenthesized regular expression (s), use N(s) itself as the NFA.

Design of a lexical analyser

A lexical analyzer generator creates a lexical analyser using a set of specifications usually in the
format

                        p1         {action 1}

                        p2         {action 2}

                        . . . . . . . . . . . .

                        pn         {action n}

where pi is a regular expression and each action actioni is a program fragment that is to be
executed whenever a lexeme matched by pi is found in the input. If more than one pattern
matches, then longest lexeme matched is chosen. If there are two or more patterns that match the
longest lexeme, the first listed matching pattern is chosen.

                 This is usually implemented using a finite automaton. There is an input buffer with
two pointers to it, a lexeme-beginning and a forward pointer. The lexical analyser generator
constructs a transition table for a finite automaton from the regular expression patterns in the
lexical analyser generator specification. The lexical analyser itself consists of a finite automaton
simulator that uses this transition table to look for the regular expression patterns in the input
buffer.
                 This can be implemented using an NFA or a DFA. The transition table for an NFA is
considerably smaller than that for a DFA, but the DFA recognises patterns faster than the NFA.

Using NFA

                 The transition table for the NFA N is constructed for the composite pattern p1|p2|. . .|pn,
The NFA recognises the longest prefix of the input that is matched by a pattern. In the final
NFA, there is an accepting state for each pattern pi. The sequence of steps the final NFA can be
in is after seeing each input character is constructed. The NFA is simulated until it reaches
termination or it reaches a set of states from which there is no transition defined for the current
input symbol. The specification for the lexical analyser generator is so that a valid source
program cannot entirely fill the input buffer without having the NFA reach termination. To find a
correct match two things are done. Firstly, whenever an accepting state is added to the current set
of states, the current input position and the pattern pi is recorded corresponding to this accepting
state. If the current set of states already contains an accepting state, then only the pattern that
appears first in the specification is recorded. Secondly, the transitions are recorded until
termination is reached. Upon termination, the forward pointer is retracted to the position at which
the last match occurred. The pattern making this match identifies the token found, and the
lexeme matched is the string between the lexeme beginning and forward pointers.  If no pattern
matches, the lexical analyser should transfer control to some default recovery routine.

Using DFA

                 Here a DFA is used for pattern matching. This method is a modified version of the
method using NFA. The NFA is converted to a DFA using a subset construction algorithm. Here
there may be several accepting states in a given subset of nondeterministic states. The accepting
state corresponding to the pattern listed first in the lexical analyser generator specification has
priority. Here also state transitions are made until a state is reached which has no next state for
the current input symbol. The last input position at which the DFA entered an accepting state
gives the lexeme.

You might also like