You are on page 1of 67

PANIMALAR ENGINEERING COLLEGE

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


III YEAR – VI SEMESTER
CS 8602-COMPILER DESIGN

UNIT I - INTRODUCTION TO COMPILERS 9


Structure of a compiler – Lexical Analysis – Role of Lexical Analyzer – Input Buffering –
Specification of Tokens – Recognition of Tokens – Lex – Finite Automata – Regular
Expressions to Automata – Minimizing DFA.

INTRODUCTION TO COMPILERS
TRANSLATORS-COMPILATION AND INTERPRETATION
TRANSLATOR
A translator is a program that takes as input a program written in one
language and produces as output a program in another language. Beside
program translation, the translator performs another very important role,
the error-detection. Any violation of d HLL(High Level Language)
specification would be detected and reported to the programmers.
Important Role of Translator are:
 Translating the HLL program input into an equivalent ml program.
 Providing diagnostic messages wherever the programmer violates
specification of the HLL.
A translator or language processor is a program that translates an
input program written in a programming language into an equivalent
program in another language.
Source Code Translator Target Code
Execution in Translator

Types of Translators:
a. Interpreter
b. Assembler
c. Compiler

1.1.1.a INTERPRETER
An interpreter is a program that appears to execute a source program
as if it were machine language. It is one of the translators that translate
high level language to low level language.
Source Program Program Output
Interpreter
Data
Execution in Interpreter
During execution, it checks line by line for errors. Languages such as
BASIC, SNOBOL, LISP can be translated using interpreters. JAVA also uses
interpreter. The process of interpretation can be carried out in following
phases.
1. Lexical analysis
2. Syntax analysis
3. Semantic analysis
4. Direct Execution
Example: BASIC , Lower Version of Pascal, SNOBOL, LISP & JAVA
Advantages:
 Modification of user program can be easily made and implemented as
execution proceeds.
 Type of object that denotes various may change dynamically.
 Debugging a program and finding errors is simplified task for a
program used for interpretation.
 The interpreter for the language makes it machine independent.
Disadvantages:
 The execution of the program is slower.
 Memory consumption is more.
1.1.1.b. ASSEMBLER
Programmers found it difficult to write or read programs in machine
language. They begin to use a mnemonic (symbols) for each machine
instruction, which they would subsequently translate into machine
language. Such a mnemonic machine language is now called an assembly
language. Programs known as assembler were written to automate the
translation of assembly language in to machine language. The input to an
assembler program is called source program, the output is a machine
language translation (object program).
It translates assembly level language to machine code.
Assembly Language Assembler Machine Code

Fig1.1.3: Execution in Assembler


Example: Microprocessor 8085, 8086.
Advantages:
 Debugging and verifying
 Making compilers->Understanding assembly coding techniques is
necessary for making compilers, debuggers and other development
tools.
 Optimizing code for size
 Optimizing code for speed.
Disadvantages:
 Development time. Writing code in assembly language takes much
longer than writing in a high-level language.
 Reliability and security. It is easy to make errors in assembly code.
 Debugging and verifying. Assembly code is more difficult to debug and
verify because there are more possibilities for errors than in high-level
code.
 Portability. Assembly code is platform-specific. Porting to a different
platform is difficult
1.1.1.c. COMPILER
Compiler is a translator program that translates a program written in
(HLL) the source program and translate it into an equivalent program in
(MLL) the target program. As an important role of a compiler is error
showing to the programmer.
Source Code Compiler Target Code
Error Message
Fig1.1.4: Execution in Compiler
Executing a program written in HLL programming language is
basically of two parts. The source program must first be compiled translated
into a object program. Then the results object program is loaded into a
memory executed.

Source Code Compiler Object Code

Object Program Object Program Object Program


Input Output

Fig1.1.5: Execution Process of Source Program


in Compiler

Example: C, C++, COBOL, higher version of Pascal.


List of Compilers :

 Ada compilers
 ALGOL compilers
 BASIC compilers
 C# compilers
 C compilers
 C++ compilers
 COBOL compilers
 D compilers
 Common Lisp compilers
 Fortran compilers
 Java compilers
 Pascal compilers
 PL/I compilers
 Python
Difference between Compiler and
Interpreter
Sl.No Compiler Interpreter
1 Compiler works on the Interpreter program works
complete program at once. It line-by-line. It takes one
takes the entire program as statement at a time as input.
input.
2 Compiler generates Interpreter does not generate
intermediate code, called the intermediate object code or
object code or machine code. machine code .
3 Compiler executes conditional Interpreter executes
control statements (like if-else conditional control statements
and switch-case) and logical at a much slower speed.
constructs faster than
interpreter.
4 Compiled programs take more Interpreter does not generate
memory because the entire intermediate object code. As a
object code has to reside in result, interpreted programs
memory. are more memory efficient.
5 Compile once and run 5 Interpreted programs are
anytime. Compiled program interpreted line-by-line every
does not need to be compiled time they are run.
every time.
6 Compiler does not allow a Interpreter runs the program
program to run until it is from first line and stops
completely error-free. execution only if it encounters
an error.
7 Compiled languages are more Interpreted languages are less
efficient but difficult to debug. efficient but easier to debug.
This makes such languages
an ideal choice for new
students
8 Example: C, C++, COBOL Example: BASIC, Visual
Basic, Python, Ruby, PHP,
Perl, MATLAB, Lisp

Language processors (COUSINS OF THE COMPLIER or Language


Processing System) :
1. Preprocessors :
It produces input to Compiler. They may perform the following
functions.
Macro Processing:
A preprocessor may allow a user to define macros that are
shorthand’s for longer constructs.
File inclusion:
A preprocessor may include header files into the program text. For
example, the C preprocessor causes the contents of the file <global.h> to replace
the statement #include<global.h> when it processes a file containing this
statement.
Rational preprocessors:
These preprocessors augment older language with more modern flow
of control and data structuring facilities. If constructs like while-statements
does not exist in the programming language then this preprocessor provides
it.
Language extensions :
These preprocessor attempts to add capabilities to the language by
what amounts to built in macros. For example, Equal a database query
language embedded in C. Statement beginning with ## are taken as
preprocessor to be database access statements, unrelated to C and are
translated into procedure calls on routines that perform the database access.

2. Complier :
It converts the source program (HLL) into target program (LLL).
3. Assemblers :
It converts an assembly language (LLL) into machine code. Some
compilers produce assembly for further processing. Other compilers perform
the job of the assembler, producing relocatable machine code that can be
passed directly to the loader/link-editor.
Assembly code is a mnemonic version of the machine code, in which
names are used instead of binary codes for operation and names are also
given to memory addresses. A typical sequence of assembly instructions
might be
MOV a, R1
ADD #2, R1
MOV R1, b
4. Loader and Link Editors :
Loader :
The process of loading consists of taking relocatable machine code,
altering the relocatable addresses and placing the altered instructions and
data in memory at the proper locations. The Link-editor allows us to make a
single program from several files of relocatable machine code. These files
may have been the result of several different compilations, and one or more
may be library files of routines provided by the system and available to any
program that needs them.
Link Editor :
It allows us to make a single program from several files of relocatable
machine code.
Software Tools
Many software tools that manipulate source program first perform some
kind of analysis. Some examples of such tools include:
1. STRUCTURE EDITORS: A structure editor takes as input a sequence
of commands to build a source program. The structure editor not only
performs the text creation and modification functions of an ordinary
text editor but it also analyses the program text. For example, it can
check that the input is correctly formed, can supply keywords
automatically etc., The output of such an editor is often similar to the
output of the analyses phase of the compiler.
2. PRETTY PRINTERS: A Pretty Printer analyzes a program and prints
it in such a way that the structure of the program becomes clearly
visible. For example, comments may appear in a special font, and
statements may appear with the amount of indentation proportional
to the depth of the nesting.
3. STATIC CHECKERS: A Static Checker reads a program, analyzes it
and attempts to discover potential bugs without running the program.
The analysis portion is often similar to that found in optimizing
compilers. For example, a static checker may detect that parts of the
source program can never be executed or that a certain variable might
be used before being defined. In addition it can catch logical errors.
4. INTERPRETERS: Instead of producing a target program as a
translation, an interpreter performs the operation implied by the
source program. For an assignment statement, for example, an
interpreter builds a tree and then carry out the operations at the
nodes.

THE ANALYSIS-SYNTHESIS MODEL OF COMPILATION (Parts of


compilation)
There are two parts to this mapping: analysis and synthesis.
The analysis part breaks up the source program into constituent
pieces and imposes a grammatical structure on them. It then uses this
structure to create an intermediate representation of the source program.
The synthesis part constructs the desired target program from the
intermediate representation and the information in the symbol table. The
analysis part is often called the front end of the compiler; the synthesis part
is the back end.
During analysis, the operation implied by the source program are
determined and recorded in a hierarchical structure called a TREE. Often a
special kind of tree called a SYNTAX TREE is used.
The Syntax Tree for the expression Position: = initial + rate * 60 is
shown below

THE STRUCTURE OF A COMPILER:

The first three phases, forms the analysis portion of a compiler and
last three phases form the synthesis portion of a compiler. Two other
activities, symbols-table management and error handling, are shown
interacting with the six phases of lexical analysis, syntax analysis,
intermediate code generation, code optimization, and code generation.
Informally, we shall also call the symbol-table manager and the error
handler “phases”.

Symbol-Table Management
An essential function of a compiler is to record the identifiers used in
the source program and collect information about various attributes of each
identifier. These attributes may provide information about the storage
allocated for an identifier, its type, its scope(where in the program it is valid),
and in the case of procedure names, such things as the number and types of
its arguments, the method of passing each argument, and the type returned,
if any.

A symbol table is a data structure containing a record for each


identifier, with fields for the attributes of the identifier. The data structure
allows us to find the record for each identifier quickly.
Var position, initial, rate: real;
The type real is not known when position, initial, and rate are seen by
lexical analyzer.
The remaining phases enter information about identifiers into the symbol
table and then use this information in various ways. For example, when
doing semantic analysis and intermediate code generation, we need to know
what the types of identifiers are, so we can check that the source program
uses them in valid ways, and so that we can generate the proper operations
on them. The code generator typically enters and uses detailed information
about the storage assigned to identifiers.

Error detection and Reporting


Each phase can encounter errors. However, after detecting an error, a
phase must somehow deal with that error, so that compilation can proceed,
allowing further errors in the source program to be detected.
The syntax and semantic analysis phases usually handle a large
fraction of the errors detectable by the compiler. The lexical phase can detect
errors where the characters remaining in the input do not form any token of
the language. During semantic analysis the compiler tries to detect
constructs that have the right syntactic structure but no meaning to the
operation involved. Example. b = a[1] + add, where add is the procedure
name.
Compiler consists of 6 phases.
1) Lexical analysis - it contains a sequence of characters called
tokens.Input is source program & the output is tokens.
2) Syntax analysis - input is token and the output is parse tree
3) Semantic analysis - input is parse tree and the output is expanded
version of parse tree
4) Intermediate Code generation - Here all the errors are checked & it
produce an intermediate code.
5) Code Optimization - the intermediate code is optimized here to get the
target program.
6) Code Generation - this is the final step & here the target program code is
generated.

Source Program

Lexical Analyzer

Syntax Analyzer

Semantic Analyzer

Symbol Table Intermediate Code Error


Manager Generator Handler

Code Optimizer

Code Generator

Target Program
ANALYSIS OF THE SOURCE PROGRAM:
Analysis consists of three phases.
1. LINEAR ANALYSIS: The stream of characters making up the source
program is read from left-to-right and grouped into tokens that are
sequence of characters having a collective meaning.
2. HIERARCHICAL ANALYSIS: Characters or tokens are grouped
hierarchically into nested collections with collective meaning.
3. SEMANTIC ANALYSIS: Certain checks are performed to ensure that
the components of the program fit together meaningfully.
LEXICAL ANALYSIS:
In a compiler, Linear analysis is called lexical analysis or scanning.
For example, in lexical analysis the characters in the assignment statement
position: = initial + rate * 60
would be grouped in to the following tokens
1. The identifier position
2. The assignment symbol :=
3. The identifier initial
4. The plus sign
5. The identifier rate
6. The multiplication sign
7. The number 60
The blanks are usually eliminated during lexical analysis.

SYNTAX ANALYSIS:
Hierarchical analysis is called parsing or syntax analysis. It involves
grouping the tokens of the source program into grammatical phrases that
are used by the compiler to synthesize output.
The hierarchical structure of a program is usually expressed by
recursive rules. The rules are
1. Any identifier is an expression
2. Any number is an expression
3. If expression1 and expression2 are expressions, then so are
expression1 + expression2
expression1 * expression2
(expression1 )
Rules (1) and (2) are basis rules, while (3) defines expressions in terms
of operators applied to other expressions. Thus by rule (1), initial and rate
are expressions. By rule(2), 60 is an expression, while by rule (3), we can
first infer that rate*60 is an expression and finally that initial + rate * 60 is
an expression.
Similarly, many languages define statements recursively by rules
such as:
1. if identifier1 is an identifier, and expression2 is an expression, then
identifier1 := expression2 is a statement
2. If expression2 is an expression and statement2 is a statement,
then
while (expression1) do statement2
if (expression1) then statement2
are statements

SEMANTIC ANALYSIS:
The semantic analysis checks the source program for semantic errors
and gathers type information for the subsequent code-generation phase. It
uses the hierarchical structure determined by the syntax-analysis phase to
identify the operators and operands of expressions and statements.
An important component of semantic analysis is type checking. Here
the compiler checks that each operator has operands that are permitted by
the source language specification. For example, many programming
language definitions require a compiler to report an error every time a real
number is used to index an array.

SYNTHESIS OF THE SOURCE PROGRAM


INTERMEDIATE CODE GENERATION:
After syntax and semantic analysis, some compilers generate an
explicit intermediate representation of the source program. This intermediate
representation can have a variety of forms.
In three-address code, the source program might look like this,
temp1: = inttoreal (60)
temp2: = id3 * temp1
temp3: = id2 + temp2
id1: = temp3

position := initial + rate * 60


intermediate code generator
lexical analyzer
temp1 := inttoreal (60)
id1 := id2 + id3 * 60 temp2 := id3 * temp1
temp3 := id2 + temp2
syntax analyzer id1 := temp3

:= code optimizer

id1 +
* temp1 := id3 * 60.0
id2 id1 := id2 + temp1
id3 60
code generator
semantic analyzer
MOVF id3, R2
MULF #60.0, R2
:= MOVF id2, R1
ADDF R2, R1
id1 + MOVF R1, id1

id2 *
id3 inttoreal

60

Code Optimization
The code optimization phase attempts to improve the intermediate
code, so that faster running machine codes will result. Some optimizations
are trivial. There is a great variation in the amount of code optimization
different compilers perform. In those that do the most, called ‘optimizing
compilers’, a significant fraction of the time of the compiler is spent on this
phase.
temp1 := id3 * 60.0
id1 := id2 + temp1
Code Generation
The final phase of the compiler is the generation of target code,
consisting normally of relocatable machine code or assembly code. Memory
locations are selected for each of the variables used by the program. Then,
intermediate instructions are each translated into a sequence of machine
instructions that perform the same task. A crucial aspect is the assignment
of variables to registers.
For example, using registers 1 and 2, the translation of the code in
code optimizer becomes,
MOVF id3, R2
MULF #60.0, R2
MOVF id2, R1
ADDF R2, R1
MOVF R1, id1
The first and second operands of each instruction specify a source
and destination, respectively. The F in the each instruction tells us that
instruction deal with floating point numbers.

ERRORS ENCOUNTERED IN DIFFERENT PHASES


One of the important tasks that a compiler must perform is the
detection of and recovery from errors. Recovery from errors is
important, because the compiler will be scanning and compiling the
entire program, perhaps in the presence of errors; so as many errors
as possible need to be detected .
Each of the six phases (but mainly the analysis phase) of a
compiler can encounter errors. On detecting an error the compiler
must:
 Report the error in a helpful way
 Correct the error if possible
 Continue processing (if possible) after the error to look for
further errors.
Types of Errors:
Errors are either syntactic or Semantic error
Syntax Errors:
When the rules of the programming language are not followed,
the compiler will show syntax errors.
For example, consider the statement, int a, b :
The above statement will produce syntax error as the statement is
terminated with : rather than ;
Also these are errors in the program text; they may be either lexical or
grammatical
a. A lexical error: is a mistake in a lexeme.
For example, typing tehn instead of then
Missing off one of the quotes in a literal
b. A grammatical error: is a one that violates the grammatical
rules of the language
For example, if x=7
y=4 (missing then keyword)
Semantic Errors:
These are mistakes concerning the meaning of a program construct ;
they may be either type errors, logical errors or runtime errors.
a. Type errors occur when an operator is applied to an argument
of the wrong type or to the wrong number of arguments.
b. Logical errors are the errors in the output of the program. The
presence of logical errors leads to undesired or incorrect output
and are caused due to error in the logic applied in the program
to produce the desired output.
Also, logical errors could not be detected by the compiler, and thus, a
programmer has to check the entire coding of a program line by line.
For example, while x= y do…
When x and y initially have the same value and the body of loop need
not change the value of either x or y.
c. Run Time errors are occur during the execution of a program
and generally occur due to some illegal operation performed in
the program.
For examples,
 Dividing a number by zero
 Trying to open a file which is not created
 Lack of free memory space

THE GROUPING OF PHASES

Front end and back ends:


The phases are collected into a front end and a back end.
o Front end : The front end consists of those phases that depend
primarily on the source language and are largely independent of
the target machine. The phases are
o Lexical analysis
o Syntax analysis
o Semantic analysis
o Intermediate code generation
o Some code optimization
o Back end : The back end includes those portions of the compiler
that depend on the target machine. The phases in Back end are
Code optimization phase, code generation phase along with
error handling and symbol table operations.
Passes:
Several phases of compilation are usually implemented in a
single pass consisting
of reading an input file and writing an output file. There is a great
variation in the way the phases of a compiler are grouped into passes.
The syntax analyzer as being “in-charge”, it attempts to discover the
grammatical structure on the tokens it sees; it obtains tokens as it
needs them, by calling the lexical analyzer to find the next token. As
the grammatical structure is discovered, the parser calls the
intermediate code generator to perform semantic analysis and
generate a portion of the code.

Reducing the Number of passes


It is desirable to have few passes, since it takes time to read and
write intermediate files. On the other hand, if we group several phases
into one pass, we may be forced to keep the entire program in memory,
because one phase may need information in a different order than a
previous phase produces it. The internal form of the program may be
considerably larger than either the source program or the target
program.
For some phases, grouping into one presents few problems. For
example, the interface between the lexical and syntactic analyzers can
often be limited between the lexical and syntactic analyzers can often
be limited to a single token. On the other hand, it is often very hard to
perform code generation until the intermediate representation has
been completely generated.
In some cases, it is possible to leave a blank slot for missing
information, and fill in the slot when the information becomes
available. In particular intermediate and target code generation can
often be merged into one pass using a technique called
“backpatching”
We can combine the action of the passes as follows. On
encountering an assembly statement that is a forward reference, say
GOTO target
We generate the skeletal instruction, with the machine
operation code for GOTO and blanks for the address. All instructions
with blanks for the address of target are kept in a list associated with
the symbol-table entry for target. The blanks are filled in when we
finally encounter an instruction such as
Target: MOV foobar, R1
And determine the value of target; it is the address of the current
instruction. We then “backpatch”, by going down the list for target of
all the instructions that need its address, substituting the address of
target for the blanks in the address fields of those instructions. This
approach is easy to implement if the instructions can be kept in
memory until target addresses can be determined.

CONSTRUCTION OF COMPILER TOOLS

The compiler writer, like any software developer, can profitably


use modern software development environments containing tools such
as language editors, debuggers, version managers, profilers, test
harnesses, and so on. In addition to these general software-
development tools, other more specialized tools have been created to
help implement various phases of a compiler. These tools use
specialized languages for specifying and implementing specific
components, and many use quite sophisticated algorithms. The most
successful tools are those that hide the details of the generation
algorithm and produce components that can be easily integrated into
the remainder of the compiler. Some commonly used compiler-
construction tools include

1. Parser generators that automatically produce syntax analyzers


from input that is based on a context-free grammar. I early compilers,
syntax analysis consumed not only a large fraction of the running
time of a compiler, but a large fraction of the effort of writing a
compiler.
2. Scanner generators that produce lexical analyzers from a regular-
expression description of the tokens of a language. The basic
organization of the resulting lexical analyzer is in effect a finite
automaton.
3. Syntax-directed translation engines that produce collections of
routines for walking a parse tree and generating intermediate code.
4. Code-generator generators that produce a code generator from a
collection of rules for translating each operation of the intermediate
language into the machine language for a target machine. The rules
must include sufficient detail that can handle the different possible
access methods for data; e.g., variables may be in registers, in a fixed
location in memory, or may be allocated a position on a stack. The
basic technique is “template matching”. The intermediate code
statements are replaced by “templates” that represents sequences of
machine instructions, in such a way that assumptions about storage
of variables match from template to template.
5. Data-flow analysis engines that facilitate the gathering of
information about how values are transmitted from one part of a
program to each other part. Data-flow analysis is a key part of code
optimization.

Exercise: Phases of Compiler

1.c=a+b*d-4

2.c=(b+c)*(b+c)*2

3.b=b2 -4ac

4.result=(height*width)+(rate*2)
LEXICAL ANALYZER
1. NEED AND ROLE OF LEXICAL ANALYZER
◦ First phase of a compiler or Scanner
◦ To identify the tokens we need some method of describing the possible tokens that can
appear in the input stream. For this purpose we introduce regular expression, a notation that
can be used to describe essentially all the tokens of programming language.
◦ Secondly , having decided what the tokens are, we need some mechanism to recognize these
in the input stream. This is done by the token recognizers, which are designed using
transition diagrams and finite automata.
Main Task:
 To read input characters and produce output as a sequence of tokens that the parser uses
for syntax analysis(Token Identification)

Interaction of Lexical Analyzer with Parser

 Upon receiving a getNextToken command from the parser, the lexical analysis reads
input characters until it can identify the next token.
Secondary Task:
⚫ It produces stream of tokens that all the basic elements in a language must be token
⚫ Stripping out from the comments and whitespaces while creating the tokens
⚫ It generates symbol table which stores the information about identifiers, constants
encountered in the input
⚫ It keeps track of line numbers
⚫ If any error is present then lexical analyzer will compare that error with source file and
line number mean while it reports the error encountered while generating the tokens.
⚫ If source language uses as a macro preprocessor (e.g: #define pi 3.14) expansion of
macro may be performed by the lexical analyzer.
Some Lexical Analyzer are divided into cascade of two phases:
1 .Scanning – Responsibility for doing simple task that scans the source program to
recognize the tokens
2.Lexical analysis- Responsibility for doing complex task, perform all secondary task.
Issues in Lexical analysis: (Lexical Analysis vs. Parsing)
Reasons for separating the analysis phase of compiling into lexical analysis and parsing
are as follows:
 Simplicity of design
◦ Separation of lexical from syntactical analysis -> simplify at least one of the tasks
◦ e.g. parser dealing with white spaces -> complex
 Improved compiler efficiency
◦ Speedup reading input characters using specialized buffering techniques
 Enhanced compiler portability
Input device peculiarities are restricted to the lexical analyzer
Tokens, Patterns, Lexemes:
Token: Sequence of character having a collective meaning.
Example: keyword, identifier, operators, special character constants, etc
Pattern: The set of rules by which set of string associated with single token
Example:
 for a keyword the pattern is the character sequence forming that keyword
 for identifiers the pattern is a letter followed by any number of letter
ordigits
Lexeme: a sequence of characters in the source program matching a pattern for a token
Examples of Tokens:
Token Informal Description Sample Lexemes
If characters i, f If
Case characters c, a, s, e Case
comparison < or > or <= or >= or == or != <=, !=
Identifier Letter followed by letters and digits area, result, m1

number Any constant number 3.123, 0, 05e29


Literal Anything enclosed by “” “3CSE”

Example: consider a conditional statement in C language: while( a >= 10 )


Lexeme Token
while keyword
( parenthesis
a identifier
>= relational operator
10 number
) parenthesis
Attributes for Tokens
 When more than one lexeme can match a pattern, the lexical analyzer must provide the
additional information about the particular lexeme that matched to the subsequent phase
of the compiler.
 For example, the pattern for token number matches both 0 and 1, but it is extremely
important for the code generator to know which lexeme was found in the source program.
 For tokens corresponding to keywords attributes are not needed since the name of the
token tells everything
 Usually , the attribute of a token is a pointer to the symbol table entry that keeps
information about token
Example of Attribute Values:
Example 1 : PE = M * G * H
◦ <id, pointer to symbol table entry for PE>
◦ <assign_op>
◦ <id, pointer to symbol-table entry for M>
◦ <mult_op>
◦ <id, pointer to symbol-table entry for G>
◦ <mult_op>
◦ <id, pointer to symbol-table entry for H>
Symbol Table (The Data will be stored in symbol table as follows)
Symbol Token Data type Initialized
PE id1 int Yes
M id2 int Yes
G id3 int Yes
H id4 int Yes

Example 2 : E = M * C ** 2
◦ <id, pointer to symbol table entry for E>
◦ <assign_op>
◦ <id, pointer to symbol-table entry for M>
◦ <mult_op>
◦ <id, pointer to symbol-table entry for C>
◦ <exp_op>
◦ <number, integer value 2>
Symbol Table (The Data will be stored in symbol table as follows)
Symbol Token Data type Initialized
E id1 int Yes
M id2 int Yes
C id3 int Yes

Lexical Errors
 Few errors are visible at the lexical level alone, because lexical analyzer has a localized
view of a source program
 For instance, if the string whlle is occurred in a source program
 Example : whlle ( a <= 7 ) a lexical analyzer cannot tell whether whlle is a misspelling
of the keyword while or an undeclared function identifier.
 Since whlle is a valid lexeme for the token id, the lexical analyzer must return the token
to the parser and some other phase of the compiler handle an error due to misspelling of
the letters. However, suppose a circumstance arises in which the lexical analyzer is
unable to proceed because none of the patterns for tokens matches any prefix of the
remaining input. The simplest recovery strategy is "panic mode" recovery.
 Actions are not followed by Lexical analyzer
o Misspelling of the keyword while
o An undeclared function identifier
Error-recovery actions
1. Delete successive characters from the remaining input, until the lexical analyzer can find a
well-formed token at the beginning of what input is left.
2. Delete one character from the remaining input.
3. Insert a missing character into the remaining input.
4. Replace a character by another character.
5. Transpose two adjacent characters.
INPUT BUFFERING
 We often have to look one or more characters beyond the next lexeme before we can be sure we
have the right lexeme. As characters are read from left to right, each character is stored in the
buffer to form a meaningful token
 We introduce a two-buffer scheme that handles large look ahead’s safely. We then consider an
improvement involving "sentinels" that saves time checking for the ends of buffers.
BUFFER PAIRS
 A buffer is divided into two N-character halves, as shown below
 Each buffer is of the same size N, and N is usually the number of characters on one disk block.
E.g., 1024 or 4096 bytes.
 Using one system read command we can read N characters into a buffer.
 If fewer than N characters remain in the input file, then a special character, represented
by eof, marks the end of the source file.

 Two pointers to the input are maintained:


1. Pointer lexeme_beginning, marks the beginning of the current lexeme,
whose extent we are attempting to determine.

2. Pointer forward scans ahead until a pattern match is found.


Once the next lexeme is determined, forward is set to the character at its right end.
 The string of characters between the two pointers is the current lexeme.
After the lexeme is recorded as an attribute value of a token returned to the parser,
lexeme_beginning is set to the character immediately after the lexeme just found.
Advancing forward pointer:
Advancing forward pointer requires that we first test whether we have reached the end of one of the
buffers, and if so, we must reload the other buffer from the input, and move forward to the beginning of
the newly loaded buffer. If the end of second buffer is reached, we must again reload the first buffer
with input and the pointer wraps to the beginning of the buffer.
Code to advance forward pointer:
if forward at end of first half then begin
reload second half;
forward := forward + 1
end
else if forward at end of second half then begin
reload first half;
move forward to beginning of first half
end
else forward := forward + 1;
SENTINELS:

 For each character read, we make two tests: one for the end of the buffer, and one to determine
what character is read. We can combine the buffer-end test with the test for the current character
if we extend each buffer to hold a sentinel character at the end.
 The sentinel is a special character that cannot be part of the source program, and a natural choice
is the character eof.
 The sentinel arrangement is as shown below:

Sentinels at the end of each buffer

Note that eof retains its use as a marker for the end of the entire input. Any eof that appears other than
at the end of a buffer means that the input is at an end.
Code to advance forward pointer:
forward : = forward + 1;
if forward ↑ = eof then begin
if forward at end of first half then begin
reload second half;
forward := forward + 1
end
else if forward at end of second half then begin
reload first half;
move forward to beginning of first half
end
else /* eof within a buffer signifying end of input */
terminate lexical analysis
end

2. EXPRESSING TOKENS BY REGULAR EXPRESSIONS / SPECIFICATION OF


TOKEN/IDENTIFICATION
 Regular expressions are notation for specifying patterns.
 Each pattern matches a set of strings.
 Regular expressions will serve as names for sets of strings.
Strings and Languages
String means a finite sequence of symbols. For example,
computer ( c, o, m, p, u, t, e, r)
CS6660 ( C, S, 6, 6, 6, 0)
101001 (1, 0)
• Symbols are given through alphabet. An alphabet is a finite set of symbols
• The term alphabet or character class denotes any finite set of symbols. e.g., set {0,1} is
the binary alphabet.
• The term sentence and word are often used as synonyms for the term string.
• The length of a string s is written as | s | - is the number of occurrences of symbols
in s. e.g., string “cs6660” is of length six.
• The empty string denoted by ε – length of empty string is zero.
 The term language denotes any set of strings over some fixed alphabet.
• e.g., {ε} – set containing only empty string is language under φ.
Operations on string:
1. Concatenation of string
If x and y are strings, then the concatenation of x and y (written as xy) is the string
formed by appending y to x. x = hello and y = world; then xy is helloworld.
2. Exponentiation Let s be a string, then
s0 = ε, s1 = s, s2 = ss, s3 = sss, … sn = sss…s(n times) so on.
3. Identity element sε = εs = s.

Terms for parts of a string

TERM DEFINITION
Prefix of s A string obtained by removing zero or more trailing symbols of string s;
e.g., cs is a prefix of cs6660.
Suffix of s A string formed by deleting zero or more of the leading symbols of s;
e.g., 660 is a suffix of cs6660.

Substring of s A string obtained by deleting a prefix and a suffix from s;


e.g., 66 is a substring of cs6660.

Proper prefix, suffix, Any nonempty string x that is a prefix, suffix or substring of s that
or substring of s s <> x.

Subsequence of s Any string formed by deleting zero or more not necessarily contiguous
symbols from s; e.g., c60 is a subsequence of cs6660.

Operations on Languages :
There are several operations that can be applied to languages:
Definitions of operations on languages L and M:

OPERATION DEFINITION
Union of L and M. written LυM L υ M = { s | s is in L or s is in M }
Concatenation of L and M. written LM LM = { st | s is in L and t is in M }

Kleene closure of L.
written L*

L* denotes “zero or more concatenation of” L.


Positive closure of L.
written L+
L+ denotes “one or more Concatenation of” L.
Example 1
Let L be the set of letters {A, B,.. . , Z, a, b,... , z} and let D be the set of digits {0,1,.. . 9}
Operations :

1. L U D is the set of letters and digits — strictly speaking the language with 62 strings of length
one, each of which strings is either one letter or one digit.
2. LD is the set of 520 strings of length two, each consisting of one letter followed by one digit.
3. L5 is the set of all 5-letter strings.
4. L* is the set of all strings of letters, including ε, the empty string.
5. L(L U D)* is the set of all strings of letters and digits beginning with a letter.
6. D+ is the set of all strings of one or more digits.

Example 2
Let W be the set of characters {c,o,m,p,i,l,e,r} and let N be the set of digits {1,2,3}
Operations:
1. W U N is the set of characters and digits —language with 11 strings of length one, each of
which strings is either one letter or one digit.
2. WN is the set of 24 strings of length two, each consisting of one characters followed by one
digit.
3. W3 is the set of all 3-character strings.
4. W* is the set of all strings of characters, including ε, the empty string.
5. W(W U N)* is the set of all strings of characters and digits beginning with a character.
6. N+ is the set of all strings of one or more digits.

Regular Expressions
 It allows defining the sets to form tokens.
 Defines a Pascal identifier –identifier is formed by a letter followed by zero or more
letters or digits. e.g., letter ( letter | digit) *
 A regular expression is formed using a set of defining rules.
 Each regular expression r denotes a language L(r).
 The rules that define the regular expressions over alphabet ∑. Associated with each rule
is a specification of the language denoted by the regular expression being defined
BASIS
Rule 1 ε is a regular expression that denotes {ε}, i.e. the set containing the
empty string.
Rule2 If a is a symbol in ∑, then a is a regular expression that denotes {a},
i.e. the set containing the string a.
INDUCTION (Rule 3). Suppose r and s is regular expressions denoting the languages
L(r) and L(s). Then
 (r) | (s) is a regular expression denoting the languages L(r) U L(s).
 (r)(s) is a regular expression denoting the languages L(r)L(s).
 (r)* is a regular expression denoting the languages (L(r))*.
 (r) is a regular expression denoting the languages L(r).
• A language denoted by a regular expression is said to be a regular set.
• The specification of a regular expression is an example of a recursive definition.

Order of evaluate Regular expression:


As defined, regular expressions often contain unnecessary pairs of parentheses. We may drop
certain pairs of parentheses if we adopt the conventions that:
 The unary operator * has highest precedence and is left associative.
 Concatenation has second highest precedence and is left associative.
 | has lowest precedence and is left associative.
Example 3 we may replace the regular expression (a)|((b)*(c)) by a|b*c.
Both expressions denote the set of strings that are either a single a or are zero or more b's
followed by one c.
Example 4 Let £ = {a, b}.
1. The regular expression a|b denotes the language {a, b}.
2. (a|b)(a|b) denotes {aa, ab, ba, bb}, the language of all strings of length two over the alphabet
E. Another regular expression for the same language is aa|ab|ba|bb.
3. a* denotes the language consisting of all strings of zero or more a's, tha t is, { ε, a,aa,aaa,...}.
4. (a|b)* denotes the set of all strings consisting of zero or more instances of a or b, that is, all
strings of a's and b's: { ε,a, b,aa, ab, ba, bb,aaa,...}. Another regular expression for the same
language is (a*b*)*.
5. a|a*b denotes the language {a, b, ab, aab, aaab,...}, tha t is, the string a and all strings
consisting of zero or more a's and ending in b.
Regular set
A language that can be defined by a regular expression is called a regular set. If two regular
expressions r and s denote the same regular set, we say they are equivalent and write r = s. For
instance, (a|b) = (b|a).
Below Table shows some of the algebraic laws that hold for arbitrary regular expressions r, s, and t.

Algebraic Properties of regular expressions


AXIOM DESCRIPTION
r|s = s|r | is commutative
r|(s|t) = (r|s)|t | is associative
(rs)t = r(st) Concatenation is associative
r(s|t) = rs|rt Concatenation distributes over |
(s|t)r = sr|tr

εr = r ε is the identity element for concatenation


rε = r
r* = (r| ε)* Relation between * and ε
r** = r* * Is idempotent

Regular Definition
For notational convenience, we need to give names for regular expressions and to define
regular expressions using these names as if they were symbols.
Identifiers are the string of letters and digits beginning with a letter. The following regular
definition provides clear specification for the string.
If ∑ is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the
form
d1 → r1
d2 → r2

dn → rn
Where each di is a distinct name, and each ri is a regular expression over the symbols in ∑ U {d1,
d2, … , di-1}, i.e., the basic symbols and the previously defined names.
Example 14 C identifiers are strings of letters, digits, and underscores. Here is a regular
definition for the language of C identifiers.
letter_ A|B|. . . |Z|a|b|. . . |z| _
digit  0 | 1 | • • • | 9
id  letter_ ( letter_ | digit )*
Example :Unsigned numbers (integer or floating point) are strings such as 5280, 0.01234,
6.336E4, or 1.89E-4. The regular definition
digit  0 | 1 | • • • | 9
digits  digit digit*
optionalFraction  . digits | ε
optionalExponent  ( E ( + | - | ε ) digits ) | ε
number  digits optionalFraction optionalExponent
Example :
Write a Regular Definition to represent date in the following format: JAN 5th2016
Date format = Month Date Year
Month =JAN|FEB| ............ |DEC
Date =[0-3] [0-9]th|1st|2nd |3rd
Year =[1|2] [0-9]3
Extensions of Regular Expressions
• Notational Shorthand:
– This shorthand is used in certain constructs that occur frequently in regular expressions.
1. One or more instances. The unary, postfix operator + represents the positive closure of a
regular expression and its language. That is, if r is a regular expression, then (r) + denotes
the language (L(r)) + . The operator + has the same precedence and associativity as the
operator *. Two useful algebraic laws, r* = r + | ε and r + = rr* = r*r relate the Kleene
closure and positive closure.
2. Zero or one instance. The unary postfix operator ? means "zero or one occurrence." That
is, r? is equivalent to r|ε, or put another way, L(r?) = L(r) U { ε }. The ? operator has the
same precedence and associativity as * and +.
3. Character classes. A regular expression a1 |a2 | |an, where the ai’s are each symbols of
the alphabet, can be replaced by the shorthand [a1 a2 . . . an]. e.g., consecutive uppercase
letters, lowercase letters, or digits, we can replace them by a1-an , that is, just the first and
last separated by a hyphen. Thus, [abc] is shorthand for a|b|c, and [a-z] is shorthand for a|b|--
-|z .
Example Using these shorthands, we can rewrite the regular definition of Example 3.4.1 as:
letter_ [A- Za-z_]
digit  [0-9]
id  letter_ ( letter_ | digit )*
the regular definition of Example 3.4.2 as:
digit  [0-9]
digits  digit+
optionalFraction  (. digits )?
optionalExponent  ( E [+-]? digits ) ?
number  digits (. digits )? ( E [+ - ]? digits ) ?
Recognition of tokens:

We learn how to express pattern using regular expressions. Now, we must study how to take the
patterns for all the needed tokens and build a piece of code that examines the input
string and finds a prefix that is a lexeme matching one of the patterns

Grammar for branching statements

Stmt →if expr then stmt

| If expr then else stmt

Expr →term relop term

| term

Term →id

|number

For relop ,we use the comparison operations of languages like Pascal or SQL where = is “equals”
and < > is “not equals” because it presents an interesting structure of lexemes.
The terminal of grammar, which are if, then , else, relop ,id and numbers are the names of tokens
as far as the lexical analyzer is concerned, the patterns for the tokens are described using regular
definitions.
Patterns for tokens of above grammar
digit → [0,9]

digits →digit+

number →digit(.digit)?(e.[+-]?digits)?

letter → [A-Z,a-z]

id →letter(letter/digit)*

if → if

then →then

else →else

relop →< | > |<= | >= | = = | < >

In addition, we assign the lexical analyzer the job stripping out white space, by recognizing the
“token” we defined by:

WS → (blank/tab/newline)+
Here, blank, tab and newline are abstract symbols that we use to express the ASCII characters of
the same names. Token ws is different from the other tokens in that ,when we recognize it, we do
not return it to parser ,but rather restart the lexical analysis from the character that
follows the white space . It is the following token that gets returned to the parser.

Lexeme Token Name Attribute Value


Any WS - -
if If -
then then -
else else -
Anyid id pointer to table entry
Any number number pointer to table entry
< relop LT
<= relop LE
== relop EQ
<> relop NE

Transition Diagram
Transition Diagram consists of a collection of nodes or circles, called states. Each state
represents a rule that occurs during the process of reading the input looking for a
lexeme that matches one of many patterns .

Edges are directed from one state to another. Each edge is labeled by a symbol or set of symbols.
If we are in one state s, and the next input symbol is a, we check for an edge out of state s labeled
by a. if we find such an edge ,we advance the forward pointer and enter the state of the transition
diagram to which that edge leads.

Some important conventions about transition diagrams are


1. Certain states are said to be accepting or final .These states indicates that a lexeme has been
found, although the actual lexeme may not consist of all positions b/w the lexeme Begin and
forward pointers we always indicate an accepting state by a double circle.
2. In addition, if it is necessary to return the forward pointer one position, then we additionally
place a * near that accepting state.

3. One state is designed the state, or initial state. It is indicated by an edge labeled “start”
entering from nowhere .the transition diagram always begins in the state before any input
symbols have been used.
Transition Diagram for relop
Relational operator :< | > |< = | >= | = | <>’’

Transition Diagram for identifier

Transition Diagram for unsigned nu


4. FINITE AUTOMATA
 Finite automata are a mechanism to recognize a set of valid inputs before carrying out an
action.
 Finite automata are a mathematical model of a system with inputs, outputs, finite number
of states and a transition from state to state on input symbol Σ.
 A recognizer for a language is a program that takes a string x, and answers “yes” if x is
a sentence of that language, and “no” otherwise.

 We call the recognizer of the tokens as a finite automaton. A finite automaton can be:
deterministic (DFA) or non-deterministic (NFA).Both deterministic and non-
deterministic finite automaton recognize regular sets.

Applications: 1) Tool in the design of lexical analyzer


2) Text editor
3) Pattern Matching
4) File Searching Program
5) Searching for keyword in a file.

Limitation: 1) it recognizes only regular expression


2) It designed only for decision making problems.
• Finite automata also called Finite State Machine (FSM)
– Abstract model of a computing entity.
– Decides whether to accept or reject a string.
– Every regular expression can be represented as a FA and vice versa
• Two types of Finite automata:
– Non-deterministic (NFA): It has more than one alternative action for the same
input symbol.
– Deterministic (DFA): It has at most one action for a given input symbol.
• Example: how do we write a program to recognize java keyword “int”?

4.1Model of NFA:
• NFA (Non-deterministic Finite Automaton) is a 5-tuple (S, Σ, , S0, F):
 S: a set of states;
 : the symbols of the input alphabet;
 : a set of transition functions;
 move(state, symbol)  a set of states
 S0: s0 S, the start state;

13
 F: F  S, a set of final or accepting states.
• Non-deterministic -- a state and symbol pair can be mapped to a set of states.
• Finite—the number of states is finite.
• Finite automata can be represented using transition diagram.
• Corresponding to FA definition, a transition diagram has:
– States represented by circles;
– An Alphabet (Σ) represented by labels on edges;
– Transitions represented by labeled directed edges between states. The label is the
input symbol;
– One Start State shown as having an arrow head;
– One or more Final State(s) represented by double circles.
– Example transition diagram to recognize (a|b)*abb

Non-Deterministic Finite Automaton (NFA)

• A non-deterministic finite automaton (NFA) is a mathematical model that consists of:

o S - a set of states
o Σ - a set of input symbols (alphabet)
o move - a transition function move to map state-symbol pairs to sets of states.
o s0 - a start (initial) state
o F- a set of accepting states (final states)
• ε- transitions are allowed in NFAs. In other words, we can move from one state to
another one without consuming any symbol.
• A NFA accepts a string x, if and only if there is a path from the starting state to one of
accepting states such that edge labels along this path spell out x.

Transition table
• A transition table is a good way to implement a FSA
– One row for each state, S
– One column for each symbol, A
– Entry in cell (S,A) gives the state or set of states can be reached from state S on
input A.

14
Example:

Deterministic Finite Automaton (DFA)

• A Deterministic Finite Automaton (DFA) is a special form of a NFA.

• No state has ε- transition

• For each symbol a and state s, there is at most one labeled edge a leaving s. i.e. transition

function is from pair of state-symbol to state (not set of states)

Example 4.2

15
4.2 From Regular Expressions to Automata
 Construction of an NFA from a Regular Expression
 Simulation of an NFA
 Conversion of a NFA to DFA
4.3. From Regular Expressions to Automata
 regular expression describes
 lexical analyzers
 pattern processing software
 implies simulation of DFA or NFA
 NFA simulation is less straightforward Techniques
 to convert NFA to DFA
 the subset construction technique
 simulating NFA directly
 when NFA to DFA is time consuming

16
 to convert regular expression to NFA and then to DFA

4.3.1.1 Construction of an NFA and DFA from a Regular Expression


 to convert a regular expression to a NFA using McNaughton-Yamada-Thompson
algorithm
 syntax-directed
◦ it works recursively up the parse tree of the regular expression
 for each sub expression a NFA with a single accepting state is built

Algorithm: 4.1
 Input
◦ regular expression r over an alphabet Σ
 Output
◦ An NFA accepting L(r)
 Method
◦ to parse r into constituent sub expressions
◦ basis rules for handling sub expressions with no operators
◦ inductive rules for creating larger NFAs from sub expressions NFAs union,
concatenation, closure

Basis Rules for Constructing NFA


• To recognize an empty string ε:

• To recognize a symbol a in the alphabet Σ:

17
• For regular expression r1 | r2:

• For regular expression r1 r2

Here, final state of N(r1) becomes the final state of N(r1r2).

 For regular expression r*

Example 4.3

For a RE (a|b) * a, the NFA construction is shown below.

18
1.

2. a|b

3.(a|b)*

4. abb

5. (a|b)+bcd

19
Example 4.4
Convert the R.E (0 + 1)* 1(0 + 1) to NFA
Step 1 : (0 + 1)

Step2: (0 + 1) *

Step 3: (0 + 1) *1

Step 4: (0 + 1) *1(0 + 1)

20
Example 4.5
Convert the R.E 01* to NFA
Step 1 : 1*

Step 2 : 01*

Example 4.6
Convert the R.E (0 + 1) 01 to NFA
Step 1 : (0 + 1)

21
Step 2 : 01

Step 3 : (0 + 1)01

Example 4.7
Convert the R.E aa (a | b)*to NFA
Step 1 : (a | b)*

22
Step 2 : aa (a | b)*

Example 4.8
Convert the R.E (a|b)* (aa | bb) to NFA
Step 1 : (a | b)*

Step 2 : (aa | bb)

23
Step 3 : (a|b)* (aa | bb)

4.3.1.2 Conversion of a NFA to a DFA


 subset construction
◦ each state of DFA corresponds to a set of NFA states
 DFA states may be exponential in number of NFA states

24
Every DFA defines a unique language but in general, there may be many DFAs for a given
language. These DFAs accept the same language. The minimization of DFA obtained the DFA
with the minimal number of states and added advantages of minimization of DFA are as follows:
 Use less memory
 Use less hardware (flip-flops)

Example of minimization of DFA

Before Minimization After Minimization

Subset construction of an DFA from an NFA


We merge together NFA states by looking at them from the point of view of the input characters:

• From the point of view of the input, any two states that are connected by an –
transition may as well be the same, since we can move from one to the other without
consuming any character. Thus states which are connected by an -transition will be
represented by the same states in the DFA.

• If it is possible to have multiple transitions based on the same symbol, then we can regard a
transition on a symbol as moving from a state to a set of states (ie. the union of all those states
reachable by a transition on the current symbol). Thus these states will be combined into a single
DFA state.

To perform this operation, let us define two functions:

25
• The -closure function takes a state and returns the set of states reachable from it based on (one
or more) -transitions. Note that this will always include the state itself. We should be able to get
from a state to any state in its -closure without consuming any input.

• The function move takes a state and a character, and returns the set of states reachable by one
transition on this character.

We can generalise both these functions to apply to sets of states by taking the union of the
application to individual states.

For Example, if A, B and C are states, move({A,B,C},`a') = move(A,`a')


move(B,`a') move(C,`a').

Operations on NFA states


Operation Description
ε-closure(s) set of NFA states reachable from NFA state s on ε- transition alone

ε-closure(T) set of NFA states reachable from some NFA state s in set
T on ε-transitions alone

move(T,a) set of NFA states to which there is a transition on input symbol a from some
state s in T

Transitions
 s0 – start state
 N can be in any states of ε-closure(s0)
 reading input string x
◦ N can be in the set of states T after
 reading input a
◦ N can go in ε-closure(move(T, a))
 accepting states of D are all sets of N states that include at least one accepting state of N

The Subset Construction Algorithm of a DFA from on NFA is a follows:

Input
◦ an NFA N
Output

26
◦ DFA D accepting the same language as N

put ε-closure({s0}) as an unmarked state into the set of DFA (DS)

while (there is one unmarked S1 in DS) do


begin
mark S1
for each input symbol a do
begin
S2 ← ε-closure(move(S1,a))
if (S2 is not in DS) then
add S2 into DS as an unmarked state
transfunc[S1,a] ← S2
end
end
 a state S in DS is an accepting state of DFA if a state in S is an accepting state of NFA
the start state of DFA is ε-closure({s0})

Computing ε-closure(T)

push all states of T onto stack; initialize ε-closure(T) to T; while(stack is not empty)
{
pop t, the top element, off stack;
for(each state u with an edge from t to u labeled ε)
if(u is not in ε-closure(T))
{
add u to ε-enclosure(T);
push u onto stack;
}
}

Example 4.9
By using subset construction algorithm convert the following NFA (a|b)*abb to DFA

27
We need to remove:

1.ε – transition

•That require to construct ε – closure (s)

2.A multiple transition on input symbol from some state s in T: Mov(T,a)

Example 4.10
First step: construct ε – closure (s)
State ε – closure (s)
{0}= {0,1,2,4,7}
{1}= {1,2,4}
{2}= {2}
{3}= {1,2,3,4,6,7}
{4}= {4}
{5}= {1,2,4,5,6,7}
{6}= {1,2,4,6,7}
{7}= {7}
{8}= {8}
{9}= {9}
{10}= {10}

Second step: Looking for the start state for the DFA

The start state A of the equivalent DFA is ε – closure (0)

The ε – closure (0) = {0, 1, 2, 4, 7}

Then the starting state A = {0, 1, 2, 4, 7} A

Third step: Compute U = ε -closure(move(T,a)

28
• First determine the input alphabet here input alphapet = {a,b}

• Second compute:

1.Dtran[A,a] = ε -closure(move(A,a))

2.Dtran[A,b] = ε -closure(move(A,b))

1. Dtran[A,a] = ε -closure(move(A,a))

•Among the state 0, 1, 2, 4, 7, only 2 and 7 have transitions on a to 3 and 8, respectively.

Thus move(A,a) = {3, 8}

Also, ε- closure ({3,8} = {1, 2, 3, 4, 6, 7, 8}

So we conclude:

Dtran[A,a] = ε -closure(move(A,a)) = ε- closure ({3,8} = {1, 2, 3, 4, 6, 7, 8} B

2. Dtran[A,b] = ε -closure(move(A,b))

Among the states in A, only 4 has a transition on b, and it goes to 5 Thus:

Dtran[A,b] = ε -closure(move(A,b)) = ε- closure ({5} = {1, 2, 4, 5, 6, 7 } C

• Third compute:

3.Dtran[B,a] = ε -closure(move(B,a))

Among the states in B, only 2, 7 has a transition on a, and it goes to {3, 8} respectively

Thus:

Dtran[B,a] = ε -closure(move(B,a)) = ε- closure ({3,8} = {1, 2, 3, 4, 6, 7, 8 } B

4.Dtran[B,b] = ε -closure(move(B,b))

Among the states in B, only 4, 8 has a transition on b, and it goes to {5, 9} respectively

Thus:

Dtran[B,b] = ε -closure(move(B,b)) = ε- closure ({5,9} = {1, 2, 4, 5, 6, 7,9} D

•Fourth compute:

29
5.Dtran[C,a] = ε -closure(move(C,a))

Among the states in C, only 2, 7 has a transition on a, and it goes to {3, 8} respectively

Thus:

Dtran[C,a] = ε -closure(move(C,a)) = ε- closure ({3,8} = {1, 2, 3, 4, 6, 7, 8 } B

6. Dtran[C,b] = ε -closure(move(C,b))

Among the states in C, only 4 has a transition on b, and it goes to {5} respectively

Thus:

Dtran[C,b] = ε -closure(move(C,b)) = ε- closure ({5} = {1, 2, 4, 5, 6, 7} C

7. Dtran[D,a] = ε -closure(move(D,a))

Among the states in D, only 2, 7 has a transition on a, and it goes to {3, 8} respectively

Thus:

Dtran[D,a] = ε -closure(move(D,a)) = ε- closure ({3,8} = {1, 2, 3, 4, 6, 7, 8 } B

8. Dtran[D,b] = ε -closure(move(D,b))

Among the states in D, only 4 and 9 has a transition on b, and it goes to {5, 10} respectively

Thus:

Dtran[D,b] = ε -closure(move(D,b)) = ε- closure ({5,10} = {1,2, 4, 5, 6, 7, 10} E

9. Dtran[E,a] = ε -closure(move(E,a))

Among the states in E, only 2, 7 has a transition on a, and it goes to {3, 8} respectively

Thus:

Dtran[E,a] = ε -closure(move(E,a)) = ε- closure ({3,8} = {1, 2, 3, 4, 6, 7, 8 } B

10. Dtran[E,b] = ε -closure(move(E,b))

Among the states in E, only 4 has a transition on b, and it goes to {5} respectively

Thus:

30
Dtran[E,b] = ε -closure(move(E,b)) = ε- closure ({5} = {1,2, 4, 5, 6, 7} C

Transition table Dtran for DFA D

NFA State DFA State/Input a b


{0,1,2,4,7} A B C
{1, 2, 3, 4, 6, 7, 8} B B D
{1, 2, 4, 5, 6, 7 } C B C
{1, 2, 4,5, 6, 7, 9 } D B E
{1,2,4, 5, 6, 7, 10} E B C

Minimizing the
Number of States of a
DFA
Distinguishable States
 string x distinguishes state s from state t if exactly one of the states reached from s and t
by following the path x is an accepting state
 state s is distinguishable from state t if
exists some string that distinguish them
 the empty string distinguishes any accepting state from any non-accepting state
Algorithm 4.3
 Input
◦ DFA D with set of states S, input alphabet Σ, start state s0, accepting states F
 Output
◦ DFA D’ accepting the same language as D and having as few states as possible
Method:
1 Start with an initial partition Π with two groups F and S-F
2 Apply the procedure
for(each group G of Π)
{
partition G into subgroups such that states s and t are in the same subgroup iff for all input
symbol a states s and t have transitions on a to states in the same group of Π
}

31
3 if Πnew= Π let Πfinal= Π and continue with step 4, otherwise repeat step 2 with Πnew
instead of Π
4 choose one state in each group of Πfinal as the representative
State a b for that group
A B A Minimum State DFA Construction
B B D  the start state of D’ is the representative of the group
D B E containing the start state of D
E B A  the accepting states of D’ are the representatives of those
groups that contain an accepting state of D
 if
◦ s is the representative of G from Πfinal
◦ exists a transition from s on input a is t from group H
◦ r is the representative of H then in D’ there is a transition from s to r on input a

Minimized DFA Transition table


Minimized DFA
Transition table

Example Minimization of DFA


{A,B,C,D}{E}
◦ on input a:
– A,B,C,D -> {A,B,C,D}
–E -> {A,B,C,D}
◦ on input b:
◦ – A,B,C -> {A,B,C,D}
◦ –D -> {E}
◦ –E -> {A,B,C,D}
{A,B,C}{D}{E}
◦ on input a:
– A,B,C->{A,B,C} – D->{A,B,C} – E->{A,BC}
◦ on input b:
– A,C,->{A,B,C} – B->{D} – D->{E} – E->{A,B,C}
{AC}{B}{D}{E}
◦ on input a:
– A,C->{B} – B->{B} – D->{B} – E->{B}

32
◦ on input b:
– A,C,->{A,C} – B->{D} – D->{E} – E->{A,C}

Example 4.11 Construct the minimized DFA for the regular expression (0+1)*(0+1) 10
Step 1: By using Thompson construction algorithm construct NFA from regular expression
(0+1)*(0+1)10

Step 2: Convert the NFA into a DFA by eliminating ε-transitions


Find the ε-closure of all the states
State ε – closure (s)
{0}= {0,1,2,4,7,8,9,11}
{1}= {1,2,4}
{2}= {2}
{3}= {1,2,3,4,6,7,8,9,11}
{4}= {4}
{5}= {1,2,4,5,6,7,8,9,11}
{6}= {1,2,4,6,7,8,9,11}
{7}= {7,8,9,11}
{8}= {8,9,11}
{9}= {9}
{10}= {10,13,14}
{11}= {11}
{12}= {12,13,14}
{13}= {13,14}
{14}= {14}
{15}= {15}
{16}= {16}

Step 3: Let the initial state be the ε –closure(initial state)


ε – closure (0) = {0,1,2,4,7,8,9,11} A
Move (A,0) = ε – closure(δ(A,0))

33
= ε – closure(3,10)
= {1,2,3,4,6,7,8,9,10,11,13,14} B
Move (A,1) = ε – closure(δ(A,1))
= ε – closure(5,12)
= {1,2,4,5,6,7,8,9,11,12,13,14} C
Move (B,0) = ε – closure(δ(B,0))
= ε – closure(3,10)
= {1,2,3,4,6,7,8,9, 10,11,13,14} B
Move (B,1) = ε – closure(δ(B,1))
= ε – closure(5,12,15)
= {1,2,4,5,6,7,8,9,11,12,13,14,15} D
Move (C,0) = ε – closure(δ(C,0))
= ε – closure(3,10)
= {1,2,3,4,6,7,8,9,10,11,13,14} B
Move (C,1) = ε – closure(δ(C,1))
= ε – closure(5,12,15)
= {1,2,4,5,6,7,8,9,11,12,13,14,15} D
Move (D,0) = ε – closure(δ(D,0))
= ε – closure(3,10,16)
EE
= {1,2,3,4,6,7,8,9,10,11,13,14,16}
Move (D,1) = ε – closure(δ(D,1))
= ε – closure(5,12,15)
= {1,2,4,5,6,7,8,9,11,12,13,14,15} D
Move (E,0) = ε – closure(δ(E,0))
= ε – closure(3,10)
= {1,2,3,4,6,7,8,9, 10,11,13,14} B
Move (E,1) = ε – closure(δ(E,1))
= ε – closure(5,12,15)
= {1,2,4,5,6,7,8,9,11,12,13,14,15} D
Since there is no new states , stop with this

Step 4: Transition table Dtran for DFA

NFA State Input


0 1
DFA State
{0,1,2,4,7,8,9,11} A B C
{1,2,3,4,6,7,8,9,10,11,13,14} B B D
{1,2,4,5,6,7,8,9,11,12,13,14} C B D
{1,2,4,5,6,7,8,9,11,12,13,14,15} D E D
{{1,2,3,4,6,7,8,9,10,11,13,14,16} E B D

34
Step 5: Transition Diagram for DFA

Step 6: Minimized DFA using i)


Table filling algorithm
B
C
D
E X X X X
A B C D

1. δ(A,B)
δ(A,0) = B δ(B,0) = B
δ(A,1) = C δ(B,1) = D
(A,B) are equivalent for single input
2. δ(A,C)
δ(A,0) = B δ(C,0) = B
δ(A,1) = C δ(C,1) = D
(A,C) are equivalent state for single input
3. δ(A,D)
δ(A,0) = B δ(D,0) = E
δ(A,1) = C δ(D,1) = D
(A,D) are not equivalent.
4. δ(B,C)
δ(B,0) = B δ(C,0) = B
δ(B,1) = D δ(C,1) = D
(B,C) are equivalent.
5. δ(B,D)
δ(B,0) = B δ(D,0) = E
δ(B,1) = D δ(D1) = D
(B,D) are equivalent. state for single input
6. δ(C,D)
δ(C,0) = B δ(D,0) = E
δ(C,1) = D δ(D1) = D
(B,D) are equivalent. state for single input

35
B

C X By considering the inputs , the following states are equivalent.


(B,C) are equivalent
D

E X X X X Minimized DFA using ii) Grouping Algorithm


A B C D
[Final State] [Non-Final State]

E [A B C D]

E [A] [ B C] [D]

Minimized DFA Transition table


Input 0 1
States

A B,C B,C
B,C B,C D
D E D
E B,C D
Minimized DFA Transition table

4.4 OPTIMIZATION OF DFA-BASED PATTERN MATCHERS (CONVERTING A


REGULAR EXPRESSION DIRECTLY TO A DFA)

 Construct a DFA directly from a regular expression, without constructing an


intermediate NFA with fewer states used in LEX.

 Minimizes the number of states of a DFA.Combines states, having the same


future behavior.

36
 Produces more compact representation of transition tables then the standard two
dimensional ones.

 Augment the given regular expression by concatenating a special symbol # ,that


is a augmented regular expression

 Create a syntax tree for the augmented regular expression.

o All leaves are alphabet symbols (plus # and the empty string)
o All inner nodes are operators.

Algorithm: Convert Regular Expression Directly To a DFA


 Input
◦ a regular expression r
 Output
◦ A DFA D that recognizes L(r)
 Method
1. Construct the syntax tree of (r) #
2. Compute nullable, firstpos, lastpos, followpos
3. Put firstpos(root) into the states of DFA as an unmarked state.
4. while (there is an unmarked state S in the states of DFA) do
a. mark S
b. for each input symbol a do
i. let s1,...,sn are positions in S and symbols in those positions are a
ii. S’ ß followpos(s1) È ... È followpos(sn)
iii. Dtran[S,a] ß S’
iv. if (S’ is not in the states of DFA)
1. – put S’ into the states of DFA as an unmarked state.
the start state of DFA is firstpos(root)
the accepting states of DFA are all states containing the position of #
Functions computed from the syntax tree
In order to construct a DFA directly from the regular expression we have to:
o Build the syntax tree
o Compute functions for finding the positions
 Firstpos, Lastpos, Followpos.
o Find Dtran
o Optimized DFA

Compute 4 functions referring (r)#


 nullable
 firstpos

37
 lastpost
 followpos
 nullable(n)
◦ true for syntax tree node n if the sub expression represented by n
 has ε in its language
 can be made null or the empty string even it can represent other strings
 firstpos(n)
◦ set of positions in the n rooted subtree that correspond to the first symbol of at
least one string in the language of the subexpression rooted at n
 lastpos(n)
◦ set of positions in the n rooted subtree that correspond to the last symbol of at
least one string in the language of the subexpression rooted at n
 followpos(n)
◦ for a position p
◦ is the set of positions q such that
◦ x=a1a2…an in L((r)#) such that
◦ for some i there is a way to explain the membership of x in L((r)#) by matching ai
to position p of the syntax tree ai+1 to position q

 Computing nullable, firstpos and lastpos

38
Example 4.12 (a|b)*abb# augmented regular expression

Syntax Tree Representation Rules


 syntax tree leaves are labeled by ε or by an alphabet symbol
 to each leaf which is not ε we attach a unique integer
◦ the position of the leaf
◦ the position of it’s symbol
 a symbol may have several positions
◦ symbol a has positions 1 and 3
 positions in the syntax tree correspond to NFA important states

Syntax tree of (a|b)*abb#


• each symbol is at a leaf
• each symbol is numbered (positions)
• inner nodes are operators
• cat-node – concatenation operator (dot)
• or-node – union operator |
• star-node – star operator *

Computing Followpos
 A position of a regular expression can follow
another position in two ways:
◦ if n is a cat-node c1c2 (rule 1)
 for every position i in lastpos(c1)
all positions in firstpos(c2) are in
followpos(i)

39
◦ if n is a star-node (rule 2)
 if i is a position in lastpos(n) then all positions in firstpos(n) are in
followpos(i)
◦ Applying rule 1
◦ followpos(1) incl. {3}
◦ followpos(2) incl. {3}
◦ followpos(3) incl. {4}
◦ followpos(4) incl. {5}
◦ followpos(5) incl. {6}

◦ Applying rule 2
◦ followpos(1) incl. {1,2}
◦ followpos(2) incl. {1,2}

Compute Followpos

Position(Node) Followpos
1 {1,2,3}
2 {1,2,3}
3 {4}
4 {5}
5 {6}
6 {φ}
Find Position for a & b
a position = 1,3
b position = 2,4,5
firstpos(n0)={1,2,3} =…..A
 Dtran[A,a]= followpos(1) U followpos(3)= {1,2,3,4}=…..B
 Dtran[A,b]= followpos(2)={1,2,3}=…..A
 Dtran[B,a]= followpos(1) U followpos(3)=……B

40
 Dtran[B,b]= followpos(2) U followpos(4)={1,2,3,5}=…..C…..

Example 4.13 (a|b)*a#

Step 1:- Syntax tree

Step 2:- Compute Firstpos and Lastpos

41
Compute Followpos
Position(Node) Followpos
1 {1,2,3}
2 {1,2,3}
3 {4}
4 {φ}

After we calculate follow positions, find Dtran then we are ready to create DFA for the
regular expression.

Find Positions for a & b


a positions = 1,3
b position = 2
Step 3:- Find Dtran
Firstpos(n o)={1,2,3} =…..A
 Dtran[A,a]= followpos(1) U followpos(3)= {1,2,3,4}=……B
 Dtran[A,b]= followpos(2)={1,2,3}=…..A
 Dtran[B,a]= followpos(1) U followpos(3)=…..B
 Dtran[B,b]= followpos(2) =…..A

Step 4:- Optimized DFA Transition table

States/Input a b
A B A
B B A

Step 4:- Optimized DFA Transition Diagram

42
Example 4.14 (a*|b)*#

Step 1:- Syntax tree

Step 2:- Compute Firstpos and Lastpos

Compute Followpos

43
Position(Node) Followpos
1 {1,2,3}
2 {1,2,3}
3 {φ}

Find Positions for a & b


a position = 1 b position = 2
Step 3:- Find Dtran
Firstpos(n o)={1,2,3} =…..A
 Dtran[A,a]= followpos(1) {1,2,3}=…..A
 Dtran[A,b]= followpos(2)={1,2,3}=……A
Step 4:- Optimized DFA Transition table
States/Input a b
A A A
Step 4:- Optimized DFA Transition Diagram

Example 4.15 ((ε|a)b*)*#


Step 1:- Syntax tree

44
Step 2:- Compute Firstpos and Lastpos

Compute Followpos
Position(Node) Followpos
1 {1,2,3}
2 {1,2,3}
3 {φ}

Find Positions for a & b


a position = 1
b position = 2

Step 3:- Find Dtran


Firstpos(n o)={1,2,3} =…..A
 Dtran[A,a]= followpos(1) {1,2,3}=…..A
 Dtran[A,b]= followpos(2)={1,2,3}=…..A

Step 4:- Optimized DFA Transition table

States/Input a b
A A A
Step 4:- Optimized DFA Transition Diagran

45
Example 4.15 abb(a|b)*#
Step 1:- Syntax tree

Step 2:- Compute Firstpos and Lastpos

46
Compute Followpos
Position(Node) Followpos
1 {2}
2 {3}
3 {3,4,5,6}
4 {4,5,6}
5 {4,5,6}
6 {φ}
Find Positions for a & b
a positions = 1,4
b positions = 2,3,5
Step 3:- Find Dtran
Firstpos(n o)={1} =….. A
 Dtran[A,a]= followpos(1) ={2}=…..B
 Dtran[B,b]= followpos(2) ={3}=…..C
 Dtran[C,b]= followpos(3)={3,4,5,6} =…..D
 Dtran[D,a]= followpos(4)={4,5,6}=…..E
 Dtran[D,b]= followpos(3) U followpos(5) ={3,4,5,6} =…..D
 Dtran[E,a]= followpos(4)={4,5,6}=…..E
 Dtran[E,b]= followpos(5)={4,5,6}=…..E
Step 4:- Optimized DFA Transition table

States/Input a b
A B φ
B φ C
C φ D
D E D
E E E

Step 5:- Optimized DFA Transition Diagran

47
5. LANGUAGE FOR SPECIFYING LEXICAL ANALYZER –LEX

A LEX source program is a requirement of a lexical analyzer, consisting of a set of


regular expressions together with an action for each regular expression. The action is a part of
code which is to be executed when a token specified by the equivalent regular expression is
recognized. LEX is a tool generally used to specify lexical analyzers for a variety of
languages.LEX tool require a Lex compiler, specification for the LEX tool is the Lex language.

Using a Scanner Generator: Lex

 Lex is a lexical analyzer generator developed by Lesk and Schmidt of AT&T Bell Lab,
written in C, running under UNIX.
 Lex produces an entire scanner module that can be compiled and linked with other
compiler modules.
 Lex associates regular expressions with arbitrary code fragments. When an expression is
matched, the code segment is executed.
 A typical lex program contains three sections separated by %% delimiters.

Role of LEX

Creating a lexical analyzer with Lex (or) Specification of LEX

A Lex program (the lex.l file ) consists of three parts:


%{
auxiliary declarations
%}

48
regular definitions
%%
translation rules
%%
auxiliary procedures
1. Declarations section
It includes declarations of variables, manifest constants (A manifest constant is an
identifier that is declared to represent a constant e.g. # define PIE 3.14), the files to be included
and definitions of regular expressions.

The auxiliary definitions are statements of the form:


D1=R1
D2=R2
Dn=Rn
Where each Di is a distinct name, and each Ri is a regular expression whose symbol are chosen
from Σ∪{D1, D2, Di-1}, i.e., characters or previously defined names. The Di's are shorthand
names for regular expressions. Σ is our input symbol alphabet.

2. Transition rules section


The translation rules of a Lex program are statements of the form :
p1 {action 1}
p2 {action 2}
p3 {action 3}
… …
Where each P1is a regular expression called a pattern, over the alphabet consisting of Σ and the
auxiliary definition names. The patterns describe the form of the tokens. Each action 1is a
program fragment describing what action the lexical analyzer should take when token P’s found.
The actions are written in a conventional programming language, rather than any particular
language, we use pseudo language. To create the lexical analyzer L, each of the actions must be
compiled into machine code.
3. Auxillary procedures
The third section holds whatever auxiliary procedures are needed by the actions.
Alternatively these procedures can be compiled separately and loaded with the lexical analyzer.
The auxiliary procedures are written in C language.
Example
: Let us consider the collection of tokens, LEX program is shown in table
Auxiliary Definition
Letter= A|B|...|Z
Digit= 0|1|...|9
Translation Rules

49
BEGIN {return 1}
END {return 2}
IF {return 3}
THEN {return 4}
ELSE {return 5}
letter(letter|digit)* {LEX VAL:= INSTALL( ); return 6}
digit+ {LEX VAL:= INSTALL( ); return 7}
< {LEX VAL := 1; return 8}
<= {LEX VAL := 2; return 8}
= {LEX VAL := 3; return 8}
<> {LEX VAL := 4; return 8}
> {LEX VAL := 5; return 8}
>= {LEX VAL := 6; return 8}
How does this Lexical analyzer work?
 The lexical analyzer created by Lex behaves in concert with a parser in the following
manner. When activated by the parser, the lexical analyzer begins reading its remaining
input , one character at a time, until it has found the longest prefix of the input that is
matched by one of the regular expressions p.
 Then it executes the corresponding action. Typically the action will return control to the
parser. However, if it does not, then the lexical analyzer proeeds to find more lexemes,
until an action causes control to return to the parser. The repeated search for lexemes
until an explicit return allows the lexical analyzer to process white space and comments
conveniently.
 The lexical analyzer returns a single quantity, the token, to the parser. To pass an attribute
value with information about the lexeme, we can set the global variable yylval.
•e.g. Suppose the lexical analyzer returns a single token for all the relational operators, in which
case the parser won’t be able to distinguish between ”<=”,”>=”,”<”,”>”,”==” etc. We can set
yylval appropriately to specify the nature of the operator.
LEX Actions
1. yytext() is a variable that is a pointer to the first character of the lexeme.
2. yywrap()
yywrap is called when lexical analyzer reach end of file. It yywrap returns a then
lexical analyzer continue scanning. When yywrap return 1 means end of file is
encountered
3. yyleng() is an integer telling how long the lexeme is.
4. yyin() –It is used to read the source program from file and then stored in yyin.

50
Sample LEX Program
1. Program to Find the Capital letters in some string using LEX

%{
%} Declarations section

%%
[A-Z] {printf(“%s”,yytext);}
.; Transition rules section
%%

main()
{printf(“Enter Some string”);
yylex();
} Auxillary procedures
int yywrap()
{return 1;
}

Input
Enter Some string
Panimalar Engineering College
Output
PEC

DESIGN OF LEXICAL ANALYZER FOR A SAMPLE LANGUAGE

51
An NFA for Lex program
• Create an NFA for each regular expression
• Combine all the NFAs into one
• Introduce a new start state
• Connect it with ε- transitions to the start states of the NFAs

Pattern Matching with NFA


1. The lexical analyzer reads in input and calculates the set of states it is in at each symbol.
Eventually, it reach a point with no next state.
2. It looks backwards in the sequence of sets of states, until it finds a set including one or
more accepting states.
④ It picks the one associated with the earliest pattern in the list from the Lex program.
⑤ It performs the associated action of the pattern

52
Pattern Matching with DFA
1. Convert the NFA for all the patterns into an equivalent DFA. For each DFA state with more
than one accepting NFA states, choose the pattern, who is defined earliest, the output of the DFA
state.
2. Simulate the DFA until there is no next state.
3. Trace back to the nearest accepting DFA state, and perform the associated action

53

You might also like