You are on page 1of 28

PANIMALAR ENGINEERING COLLEGE

DEPARTMENT OF INFORMATION TECHNOLOGY


21IT1501 – COMPILER ENGINEERING

UNIT II – LEXICAL ANALYSIS

Introduction – The Structure of Compiler – Evolution of Programming Languages – Application of Compiler


Technology – Programming Languages Basics – Lexical Analysis – Role of Lexical Analyzer – Specification
and Recognition of Tokens – Lexical Analyzer Generators.

INTRODUCTION TO COMPILERS
TRANSLATORS-COMPILATION AND INTERPRETATION
TRANSLATOR
A translator is a program that takes as input a program written in one language
and produces as output a program in another language. Beside program translation,
the translator performs another very important role, the error-detection. Any
violation of d HLL(High Level Language) specification would be detected and
reported to the programmers.
Important Role of Translator are:
 Translating the HLL program input into an equivalent ml program.
 Providing diagnostic messages wherever the programmer violates
specification of the HLL.
A translator or language processor is a program that translates an input
program written in a programming language into an equivalent program in another
language.
Source Code Translator Target Code
Execution in Translator

Types of Translators:
a. Interpreter
b. Assembler
c. Compiler

1.1.1.a INTERPRETER
An interpreter is a program that appears to execute a source program as if it
were machine language. It is one of the translators that translate high level
language to low level language.
Source Program Program Output
Interpreter
Data
Execution in Interpreter

1
During execution, it checks line by line for errors. Languages such as BASIC,
SNOBOL, LISP can be translated using interpreters. JAVA also uses interpreter. The
process of interpretation can be carried out in following phases.
1. Lexical analysis
2. Syntax analysis
3. Semantic analysis
4. Direct Execution
Example: BASIC , Lower Version of Pascal, SNOBOL, LISP & JAVA
Advantages:
Modification of user program can be easily made and implemented as execution
proceeds.
Type of object that denotes various may change dynamically.
Debugging a program and finding errors is simplified task for a
program used for interpretation.
The interpreter for the language makes it machine independent.
Disadvantages:
The execution of the program is slower.
Memory consumption is more.
1.1.1.b. ASSEMBLER
Programmers found it difficult to write or read programs in machine
language. They begin to use a mnemonic (symbols) for each machine instruction,
which they would subsequently translate into machine language. Such a
mnemonic machine language is now called an assembly language. Programs
known as assembler were written to automate the translation of assembly language
in to machine language. The input to an assembler program is called source
program, the output is a machine language translation (object program).
It translates assembly level language to machine code.
Assembly Language Assembler Machine Code

Fig1.1.3: Execution in Assembler


Example: Microprocessor 8085, 8086.
Advantages:
Debugging and verifying
Making compilers->Understanding assembly coding techniques is necessary
for making compilers, debuggers and other development tools.
Optimizing code for size
Optimizing code for speed.
Disadvantages:
Development time. Writing code in assembly language takes much longer
than writing in a high-level language.
Reliability and security. It is easy to make errors in assembly code.
Debugging and verifying. Assembly code is more difficult to debug and verify
because there are more possibilities for errors than in high-level code.

2
Portability. Assembly code is platform-specific. Porting to a different platform is
difficult
1.1.1.c. COMPILER
Compiler is a translator program that translates a program written in (HLL) the
source program and translate it into an equivalent program in (MLL) the target
program. As an important role of a compiler is error showing to the programmer.
Source Code Target Code
Compiler
Error Message
Fig1.1.4: Execution in
Compiler

Executing a program written in HLL programming language is basically of


two parts. The source program must first be compiled translated into a object program.
Then the results object program is loaded into a memory executed.

Source Code Compiler Object Code

Object Program Object Program Object Program


Input Output

Fig1.1.5: Execution Process of Source Program


in Compiler

Example: C, C++, COBOL, higher version of Pascal.


List of Compilers :

 Ada compilers
 ALGOL compilers
 BASIC compilers
 C# compilers
 C compilers
 C++ compilers
 COBOL compilers
 D compilers
 Common Lisp compilers
 Fortran compilers
 Java compilers
 Pascal compilers
 PL/I compilers
 Python

3
Difference between Compiler and
Interpreter
Sl.No Compiler Interpreter
1 Compiler works on the Interpreter program works
complete program at once. It line-by-line. It takes one
takes the entire program as statement at a time as input.
input.
2 Compiler generates Interpreter does not generate
intermediate code, called the intermediate object code or
object code or machine code. machine code .
3 Compiler executes conditional Interpreter executes conditional
control statements (like if-else control statements at a much
and switch-case) and logical slower speed.
constructs faster than
interpreter.
4 Compiled programs take more Interpreter does not generate
memory because the entire intermediate object code. As a
object code has to reside in result, interpreted programs
memory. are more memory efficient.
5 Compile once and run 5 Interpreted programs are
anytime. Compiled program interpreted line-by-line every
does not need to be compiled time they are run.
every time.
6 Compiler does not allow a Interpreter runs the program
program to run until it is from first line and stops
completely error-free. execution only if it encounters
an error.
7 Compiled languages are more Interpreted languages are less
efficient but difficult to debug. efficient but easier to debug.
This makes such languages
an ideal choice for new
students
8 Example: C, C++, COBOL Example: BASIC, Visual
Basic, Python, Ruby, PHP,
Perl, MATLAB, Lisp

Language processors (COUSINS OF THE COMPLIER or


Language Processing System) :
1. Preprocessors :
It produces input to Compiler. They may perform the
following functions.
Macro Processing:
A preprocessor may allow a user to define macros that are
shorthand’s for longer constructs.
File inclusion:
A preprocessor may include header files into the program text. For
example, the C preprocessor causes the contents of the file <global.h> to replace
the statement #include<global.h> when it processes a file containing this
statement.
4
Rational preprocessors:
These preprocessors augment older language with more modern flow
of control and data structuring facilities. If constructs like while-
statements does not exist in the programming language then this
preprocessor provides it.
Language extensions :
These preprocessor attempts to add capabilities to the language by
what amounts to built in macros. For example, Equal a database query
language embedded in C. Statement beginning with ## are taken as
preprocessor to be database access statements, unrelated to C and are
translated into procedure calls on routines that perform the database access.

2. Complier :
It converts the source program (HLL) into target program (LLL).
3. Assemblers :
It converts an assembly language (LLL) into machine code. Some
compilers produce assembly for further processing. Other compilers
perform the job of the assembler, producing relocatable machine code that
can be passed directly to the loader/link-editor.
Assembly code is a mnemonic version of the machine code, in which
names are used instead of binary codes for operation and names are also
given to memory addresses. A typical sequence of assembly instructions
might be
MOV a,
R1 ADD
#2, R1
MOV R1,
b
4. Loader and Link
Editors : Loader :
The process of loading consists of taking relocatable machine code,
altering the relocatable addresses and placing the altered instructions and
5
data in memory at the proper locations. The Link-editor allows us to make
a single program from several files of relocatable machine code. These
files may have been the result of several different compilations, and one
or more may be library files of routines provided by the system and
available to any program that needs them.
Link Editor :
It allows us to make a single program from several files of
relocatable machine code.
Software Tools
Many software tools that manipulate source program first perform
some kind of analysis. Some examples of such tools include:
1. STRUCTURE EDITORS: A structure editor takes as input a
sequence of commands to build a source program. The structure
editor not only performs the text creation and modification functions
of an ordinary text editor but it also analyses the program text. For
example, it can check that the input is correctly formed, can supply
keywords automatically etc., The output of such an editor is often
similar to the output of the analyses phase of the compiler.
2. PRETTY PRINTERS: A Pretty Printer analyzes a program and
prints it in such a way that the structure of the program becomes
clearly visible. For example, comments may appear in a special font,
and statements may appear with the amount of indentation
proportional to the depth of the nesting.
3. STATIC CHECKERS: A Static Checker reads a program, analyzes
it and attempts to discover potential bugs without running the
program. The analysis portion is often similar to that found in
optimizing compilers. For example, a static checker may detect that
parts of the source program can never be executed or that a certain
variable might be used before being defined. In addition it can catch
logical errors.
4. INTERPRETERS: Instead of producing a target program as a
translation, an interpreter performs the operation implied by the
source program. For an assignment statement, for example, an
interpreter builds a tree and then carry out the operations at the nodes.

THE ANALYSIS-SYNTHESIS MODEL OF COMPILATION (Parts of


compilation)
There are two parts to this mapping: analysis and synthesis.
The analysis part breaks up the source program into constituent pieces
and imposes a grammatical structure on them. It then uses this structure to
create an intermediate representation of the source program.
The synthesis part constructs the desired target program from the
intermediate representation and the information in the symbol table. The
analysis part is often called the front end of the compiler; the synthesis part
is the back end.
During analysis, the operation implied by the source program are
determined and recorded in a hierarchical structure called a TREE. Often a
special kind of tree called a SYNTAX TREE is used.

6
The Syntax Tree for the expression Position: = initial + rate * 60 is
shown below

THE STRUCTURE OF A COMPILER:

The first three phases, forms the analysis portion of a compiler and
last three phases form the synthesis portion of a compiler. Two other
activities, symbols-table management and error handling, are shown
interacting with the six phases of lexical analysis, syntax analysis,
intermediate code generation, code optimization, and code generation.
Informally, we shall also call the symbol-table manager and the error
handler “phases”.

Symbol-Table Management
An essential function of a compiler is to record the identifiers used in
the source program and collect information about various attributes of each
identifier. These attributes may provide information about the storage
allocated for an identifier, its type, its scope(where in the program it is valid),
and in the case of procedure names, such things as the number and types of
its arguments, the method of passing each argument, and the type returned,
if any.

A symbol table is a data structure containing a record for each


identifier, with fields for the attributes of the identifier. The data structure
allows us to find the record for each identifier quickly.
Var position, initial, rate: real;
The type real is not known when position, initial, and rate are seen by
lexical analyzer.
The remaining phases enter information about identifiers into the symbol
table and then use this information in various ways. For example, when
doing semantic analysis and intermediate code generation, we need to know
what the types of identifiers are, so we can check that the source program
uses them in valid ways, and so that we can generate the proper operations
7
on them. The code generator typically enters and uses detailed information
about the storage assigned to identifiers.

Error detection and Reporting


Each phase can encounter errors. However, after detecting an error, a
phase must somehow deal with that error, so that compilation can proceed,
allowing further errors in the source program to be detected.
The syntax and semantic analysis phases usually handle a large
fraction of the errors detectable by the compiler. The lexical phase can detect
errors where the characters remaining in the input do not form any token of
the language. During semantic analysis the compiler tries to detect
constructs that have the right syntactic structure but no meaning to the
operation involved. Example. b = a[1] + add, where add is the
procedure name.
Compiler consists of 6 phases.
1) Lexical analysis - it contains a sequence of characters
called tokens.Input is source program & the output is tokens.
2) Syntax analysis - input is token and the output is parse tree
3) Semantic analysis - input is parse tree and the output is expanded
version of parse tree
4) Intermediate Code generation - Here all the errors are checked &
it produce an intermediate code.
5) Code Optimization - the intermediate code is optimized here to get
the target program.
6) Code Generation - this is the final step & here the target program code
is generated.

Source Program

Lexical Analyzer

Syntax Analyzer

Semantic Analyzer

Symbol Table Intermediate Code Error


Manager Generator Handler

Code Optimizer

Code Generator

Target Program

8
ANALYSIS OF THE SOURCE PROGRAM:
Analysis consists of three phases.
1. LINEAR ANALYSIS: The stream of characters making up the source
program is read from left-to-right and grouped into tokens that are
sequence of characters having a collective meaning.
2. HIERARCHICAL ANALYSIS: Characters or tokens are grouped
hierarchically into nested collections with collective meaning.
3. SEMANTIC ANALYSIS: Certain checks are performed to ensure
that the components of the program fit together meaningfully.
LEXICAL ANALYSIS:
In a compiler, Linear analysis is called lexical analysis or scanning.
For example, in lexical analysis the characters in the assignment statement
position: = initial + rate * 60
would be grouped in to the following tokens
1. The identifier position
2. The assignment symbol :=
3. The identifier initial
4. The plus sign
5. The identifier rate
6. The multiplication sign
7. The number 60
The blanks are usually eliminated during lexical analysis.

SYNTAX ANALYSIS:
Hierarchical analysis is called parsing or syntax analysis. It involves
grouping the tokens of the source program into grammatical phrases that
are used by the compiler to synthesize output.
The hierarchical structure of a program is usually expressed by
recursive rules. The rules are
1. Any identifier is an expression
2. Any number is an expression
3. If expression1 and expression2 are expressions, then so are
expression1 + expression2
expression1 * expression2
(expression1 )
Rules (1) and (2) are basis rules, while (3) defines expressions in terms
of operators applied to other expressions. Thus by rule (1), initial and rate
are expressions. By rule(2), 60 is an expression, while by rule (3), we can
first infer that rate*60 is an expression and finally that initial + rate * 60
is an expression.
Similarly, many languages define statements recursively by rules
such as:
1. if identifier1 is an identifier, and expression 2 is an expression, then
identifier1 := expression2 is a statement
2. If expression2 is an expression and statement2 is a statement,
then
while (expression1) do statement2
if (expression1) then statement2
are statements

SEMANTIC ANALYSIS:

9
The semantic analysis checks the source program for semantic errors
and gathers type information for the subsequent code-generation phase. It
uses the hierarchical structure determined by the syntax-analysis phase to
identify the operators and operands of expressions and statements.
An important component of semantic analysis is type checking. Here
the compiler checks that each operator has operands that are permitted by
the source language specification. For example, many programming
language definitions require a compiler to report an error every time a real
number is used to index an array.

SYNTHESIS OF THE SOURCE


PROGRAM INTERMEDIATE
CODE GENERATION:
After syntax and semantic analysis, some compilers generate an
explicit intermediate representation of the source program. This intermediate
representation can have a variety of forms.
In three-address code, the source program might look like this,
temp1: = inttoreal (60)
temp2: = id3 * temp1
temp3: = id2 + temp2
id1: = temp3

position := initial + rate * 60


intermediate code generator
lexical analyzer
temp1 := inttoreal (60)
id1 := id2 + id3 * 60 temp2 := id3 * temp1
temp3 := id2 + temp2
syntax analyzer id1 := temp3

:= code optimizer

id1 +
* temp1 := id3 * 60.0
id2 id1 := id2 + temp1
id3 60
code generator
semantic analyzer
MOVF id3, R2
MULF #60.0, R2
:= MOVF id2, R1
ADDF R2, R1
id1 + MOVF R1, id1

id2 *
id3 inttoreal

60

10
Code Optimization
The code optimization phase attempts to improve the intermediate
code, so that faster running machine codes will result. Some optimizations
are trivial. There is a great variation in the amount of code optimization
different compilers perform. In those that do the most, called ‘optimizing
compilers’, a significant fraction of the time of the compiler is spent on this
phase.
temp1 := id3 * 60.0
id1 := id2 + temp1
Code Generation
The final phase of the compiler is the generation of target code,
consisting normally of relocatable machine code or assembly code. Memory
locations are selected for each of the variables used by the program. Then,
intermediate instructions are each translated into a sequence of machine
instructions that perform the same task. A crucial aspect is the assignment
of variables to registers.
For example, using registers 1 and 2, the translation of the code in
code optimizer becomes,
MOVF id3, R2
MULF #60.0,R2
MOVF id2, R1
ADDF R2,R1
MOVF R1, id1
The first and second operands of each instruction specify a source
and destination, respectively. The F in the each instruction tells us that
instruction deal with floating point numbers.

ERRORS ENCOUNTERED IN DIFFERENT PHASES


One of the important tasks that a compiler must perform is the
detection of and recovery from errors. Recovery from errors is important,
because the compiler will be scanning and compiling the entire program,
perhaps in the presence of errors; so as many errors as possible need to be
detected .
Each of the six phases (but mainly the analysis phase) of a compiler
can encounter errors. On detecting an error the compiler must:
 Report the error in a helpful way
 Correct the error if possible
 Continue processing (if possible) after the error to look for further
errors.
Types of Errors:
Errors are either syntactic or Semantic error
Syntax Errors:
When the rules of the programming language are not followed, the
compiler will show syntax errors.
For example, consider the statement, int a, b :
The above statement will produce syntax error as the statement is
terminated with : rather than ;
Also these are errors in the program text; they may be either lexical or grammatical
a. A lexical error: is a mistake in a lexeme.
For example, typing tehn instead of then
11
Missing off one of the quotes in a literal
b. A grammatical error: is a one that violates the grammatical rules of
the language
For example, if x=7
y=4 (missing then keyword)
Semantic Errors:
These are mistakes concerning the meaning of a program construct ; they
may be either type errors, logical errors or runtime errors.
a. Type errors occur when an operator is applied to an argument of the
wrong type or to the wrong number of arguments.
b. Logical errors are the errors in the output of the program. The presence
of logical errors leads to undesired or incorrect output and are caused
due to error in the logic applied in the program to produce the desired
output.
Also, logical errors could not be detected by the compiler, and thus, a
programmer has to check the entire coding of a program line by line.
For example, while x= y do…
When x and y initially have the same value and the body of loop need not
change the value of either x or y.
c. Run Time errors are occur during the execution of a program and
generally occur due to some illegal operation performed in the
program.
For examples,
 Dividing a number by zero
 Trying to open a file which is not created
 Lack of free memory space

THE GROUPING OF PHASES

Front end and back ends:


The phases are collected into a front end and a back end.
o Front end : The front end consists of those phases that depend primarily
on the source language and are largely independent of the target
machine. The phases are
o Lexical analysis
o Syntax analysis
o Semantic analysis
o Intermediate code generation
o Some code optimization
o Back end : The back end includes those portions of the compiler that
depend on the target machine. The phases in Back end are Code
optimization phase, code generation phase along with error handling
and symbol table operations.
Passes:
Several phases of compilation are usually implemented in a
single pass consisting
of reading an input file and writing an output file. There is a great variation
in the way the phases of a compiler are grouped into passes. The syntax
12
analyzer as being “in-charge”, it attempts to discover the grammatical
structure on the tokens it sees; it obtains tokens as it needs them, by calling
the lexical analyzer to find the next token. As the grammatical structure is
discovered, the parser calls the intermediate code generator to perform
semantic analysis and generate a portion of the code.

Reducing the Number of passes


It is desirable to have few passes, since it takes time to read and
write intermediate files. On the other hand, if we group several phases into
one pass, we may be forced to keep the entire program in memory, because
one phase may need information in a different order than a previous phase
produces it. The internal form of the program may be considerably larger than
either the source program or the target program.
For some phases, grouping into one presents few problems. For example,
the interface between the lexical and syntactic analyzers can often be limited
between the lexical and syntactic analyzers can often be limited to a single
token. On the other hand, it is often very hard to perform code generation until
the intermediate representation has been completely generated.
In some cases, it is possible to leave a blank slot for missing
information, and fill in the slot when the information becomes available. In
particular intermediate and target code generation can often be merged into
one pass using a technique called “backpatching”
We can combine the action of the passes as follows. On encountering
an assembly statement that is a forward reference, say
GOTO target
We generate the skeletal instruction, with the machine operation
code for GOTO and blanks for the address. All instructions with blanks for
the address of target are kept in a list associated with the symbol-table entry
for target. The blanks are filled in when we finally encounter an instruction
such as
Target: MOV foobar, R1
And determine the value of target; it is the address of the current instruction.
We then “backpatch”, by going down the list for target of all the instructions
that need its address, substituting the address of target for the blanks in the
address fields of those instructions. This approach is easy to implement if the
instructions can be kept in memory until target addresses can be determined.

CONSTRUCTION OF COMPILER TOOLS

The compiler writer, like any software developer, can profitably use
modern software development environments containing tools such as
language editors, debuggers, version managers, profilers, test harnesses,
and so on. In addition to these general software- development tools,
other more specialized tools have been created to help implement various
phases of a compiler. These tools use specialized languages for
specifying and implementing specific components, and many use quite
sophisticated algorithms. The most successful tools are those that hide the
details of the generation algorithm and produce components that can be
easily integrated into the remainder of the compiler. Some commonly used
compiler- construction tools include

13
1. Parser generators that automatically produce syntax analyzers from
input that is based on a context-free grammar. I early compilers, syntax
analysis consumed not only a large fraction of the running time of a
compiler, but a large fraction of the effort of writing a compiler.
2. Scanner generators that produce lexical analyzers from a regular-
expression description of the tokens of a language. The basic
organization of the resulting lexical analyzer is in effect a finite automaton.
3. Syntax-directed translation engines that produce collections of
routines for walking a parse tree and generating intermediate code.
4. Code-generator generators that produce a code generator from a collection
of rules for translating each operation of the intermediate language into the
machine language for a target machine. The rules must include sufficient
detail that can handle the different possible access methods for data; e.g.,
variables may be in registers, in a fixed location in memory, or may be allocated
a position on a stack. The basic technique is “template matching”. The
intermediate code statements are replaced by “templates” that represents
sequences of machine instructions, in such a way that assumptions about
storage of variables match from template to template.
5. Data-flow analysis engines that facilitate the gathering of information
about how values are transmitted from one part of a program to each other part.
Data-flow analysis is a key part of code optimization.

Exercise: Phases of Compiler


1.c=a+b*d-4
2.c=(b+c)*(b+c)*2
3.b=b2 -4ac
4.result=(height*width)+(rate*2)

14
LEXICAL ANALYZER
1. NEED AND ROLE OF LEXICAL ANALYZER
◦ First phase of a compiler or Scanner
◦ To identify the tokens we need some method of describing the possible tokens that can
appear in the input stream. For this purpose we introduce regular expression, a notation
that can be used to describe essentially all the tokens of programming language.
◦ Secondly , having decided what the tokens are, we need some mechanism to recognize
these in the input stream. This is done by the token recognizers, which are designed
using transition diagrams and finite automata.
Main Task:
 To read input characters and produce output as a sequence of tokens that the parser uses
for syntax analysis(Token Identification)

Interaction of Lexical Analyzer with Parser

 Upon receiving a getNextToken command from the parser, the lexical analysis reads
input characters until it can identify the next token.
Secondary Task:
⚫ It produces stream of tokens that all the basic elements in a language must be token
⚫ Stripping out from the comments and whitespaces while creating the tokens
⚫ It generates symbol table which stores the information about identifiers,
constants encountered in the input
⚫ It keeps track of line numbers
⚫ If any error is present then lexical analyzer will compare that error with source file
and line number mean while it reports the error encountered while generating the
tokens.
⚫ If source language uses as a macro preprocessor (e.g: #define pi 3.14) expansion
ofmacro may be performed by the lexical analyzer.
Some Lexical Analyzer are divided into cascade of two phases:
1 .Scanning – Responsibility for doing simple task that scans the source program
to recognize the tokens
2. Lexical analysis- Responsibility for doing complex task, perform all secondary task.
Issues in Lexical analysis: (Lexical Analysis vs. Parsing)
Reasons for separating the analysis phase of compiling into lexical analysis and
parsing are as follows:
 Simplicity of design
◦ Separation of lexical from syntactical analysis -> simplify at least one of the tasks
◦ e.g. parser dealing with white spaces -> complex
 Improved compiler efficiency
◦ Speedup reading input characters using specialized buffering techniques

15
 Enhanced compiler portability
Input device peculiarities are restricted to the lexical analyzer
Tokens, Patterns, Lexemes:
Token: Sequence of character having a collective meaning.
Example: keyword, identifier, operators, special character constants, etc
Pattern: The set of rules by which set of string associated with single token
Example:
 for a keyword the pattern is the character sequence forming that keyword
 for identifiers the pattern is a letter followed by any number of
letterordigits
Lexeme: a sequence of characters in the source program matching a pattern for a token
Examples of Tokens:

Token Informal Description Sample Lexemes


If characters i, f If
Case characters c, a, s, e Case
comparison < or > or <= or >= or == or != <=, !=
Identifier Letter followed by letters and digits area, result, m1

number Any constant number 3.123, 0, 05e29


Literal Anything enclosed by “” “3CSE”

Example: consider a conditional statement in C language: while( a >= 10 )


Lexeme Token
while keyword
( parenthesis
a identifier
>= relational operator
10 number
) parenthesis
Attributes for Tokens
 When more than one lexeme can match a pattern, the lexical analyzer must provide the
additional information about the particular lexeme that matched to the subsequent
phase of the compiler.
 For example, the pattern for token number matches both 0 and 1, but it is extremely
important for the code generator to know which lexeme was found in the source
program.
 For tokens corresponding to keywords attributes are not needed since the name of the
token tells everything
 Usually , the attribute of a token is a pointer to the symbol table entry that keeps
information about token
Example of Attribute Values:
Example 1 : PE = M * G * H
◦ <id, pointer to symbol table entry for PE>
◦ <assign_op>
◦ <id, pointer to symbol-table entry for M>
◦ <mult_op>

16
◦ <id,
pointer to symbol-table entry for G>
◦ <mult_op>
◦ <id, pointer to symbol-table entry for H>
Symbol Table (The Data will be stored in symbol table as follows)

Symbol Token Data type Initialized


PE id1 int Yes
M id2 int Yes
G id3 int Yes
H id4 int Yes

Example 2 : E = M * C ** 2
◦ <id, pointer to symbol table entry for E>
◦ <assign_op>
◦ <id, pointer to symbol-table entry for M>
◦ <mult_op>
◦ <id, pointer to symbol-table entry for C>
◦ <exp_op>
◦ <number,
integer value 2>
Symbol Table (The Data will be stored in symbol table as follows)

Symbol Token Data type Initialized


E id1 int Yes
M id2 int Yes
C id3 int Yes

Lexical Errors
 Few errors are visible at the lexical level alone, because lexical analyzer has a localized
view of a source program
 For instance, if the string whlle is occurred in a source program
 Example : whlle ( a <= 7 ) a lexical analyzer cannot tell whether whlle is a misspelling
of the keyword while or an undeclared function identifier.
 Since whlle is a valid lexeme for the token id, the lexical analyzer must return the token
to the parser and some other phase of the compiler handle an error due to misspelling of
the letters. However, suppose a circumstance arises in which the lexical analyzer is
unable to proceed because none of the patterns for tokens matches any prefix of the
remaining input. The simplest recovery strategy is "panic mode" recovery.
 Actions are not followed by Lexical analyzer
o Misspelling of the keyword while
o An undeclared function identifier
Error-recovery actions
1. Delete successive characters from the remaining input, until the lexical analyzer can
find a well-formed token at the beginning of what input is left.
2. Delete one character from the remaining input.
3. Insert a missing character into the remaining input.
4. Replace a character by another character.
5. Transpose two adjacent characters.
17
INPUT
BUFFERING
 We often have to look one or more characters beyond the next lexeme before we can be
sure we have the right lexeme. As characters are read from left to right, each character is
stored in the buffer to form a meaningful token
 We introduce a two-buffer scheme that handles large look ahead’s safely. We then
consider animprovement involving "sentinels" that saves time checking for the ends of
buffers.
BUFFER PAIRS
 A buffer is divided into two N-character halves, as shown below
 Each buffer is of the same size N, and N is usually the number of characters on one disk
block. E.g., 1024 or 4096 bytes.
 Using one system read command we can read N characters into a buffer.
 If fewer than N characters remain in the input file, then a special character,
represented by eof, marks the end of the source file.

 Two pointers to the input are maintained:


1. Pointer lexeme_beginning, marks the beginning of the current
lexeme, whose extent we are attempting to determine.

2. Pointer forward scans ahead until a pattern match is found.


Once the next lexeme is determined, forward is set to the character at its right end.
 The string of characters between the two pointers is the current
lexeme. After the lexeme is recorded as an attribute value of a token returned
to the parser, lexeme_beginning is set to the character immediately after the
lexeme just found.
Advancing forward pointer:
Advancing forward pointer requires that we first test whether we have reached the end of one of the
buffers, and if so, we must reload the other buffer from the input, and move forward to the
beginning ofthe newly loaded buffer. If the end of second buffer is reached, we must again reload
the first buffer with input and the pointer wraps to the beginning of the buffer.
Code to advance forward pointer:
if forward at end of first half then
begin reload second half;
forward := forward +
1 end
else if forward at end of second half then
begin reload first half;
move forward to beginning of first
half end
else forward := forward + 1;
18
SENTINELS:

 For each character read, we make two tests: one for the end of the buffer, and one to
determine what character is read. We can combine the buffer-end test with the test for the
current character if we extend each buffer to hold a sentinel character at the end.
 The sentinel is a special character that cannot be part of the source program, and a natural
choice is the character eof.
 The sentinel arrangement is as shown below:

Sentinels at the end of each buffer

Note that eof retains its use as a marker for the end of the entire input. Any eof that appears other
thanat the end of a buffer means that the input is at an end.
Code to advance forward pointer:
forward : = forward + 1;
if forward ↑ = eof then begin
if forward at end of first half then
begin reload second half;
forward := forward +
1 end
else if forward at end of second half then
begin reload first half;
move forward to beginning of first
half end
else /* eof within a buffer signifying end of input
*/terminate lexical analysis
end

2. EXPRESSING TOKENS BY REGULAR EXPRESSIONS / SPECIFICATION


OF TOKEN/IDENTIFICATION
 Regular expressions are notation for specifying patterns.
 Each pattern matches a set of strings.
 Regular expressions will serve as names for sets of strings.
Strings and Languages
String means a finite sequence of symbols. For
example, computer ( c, o, m, p, u, t, e, r)
CS6660 ( C, S, 6, 6, 6, 0)
101001 (1, 0)
• Symbols are given through alphabet. An alphabet is a finite set of symbols
• The term alphabet or character class denotes any finite set of symbols. e.g., set
{0,1} is the binary alphabet.
• The term sentence and word are often used as synonyms for the term string.
• The length of a string s is written as | s | - is the number of occurrences of symbols

19
in s. e.g., string “cs6660” is of length six.
• The empty string denoted by ε – length of empty string is zero.
 The term language denotes any set of strings over some fixed alphabet.
• e.g., {ε} – set containing only empty string is language under φ.
Operations on string:
1.Concatenation of string
If x and y are strings, then the concatenation of x and y (written as xy) is the
string formed by appending y to x. x = hello and y = world; then xy is helloworld.
2.Exponentiation Let s be a string, then
s0 = ε, s1 = s, s2 = ss, s3 = sss, … sn = sss…s(n times) so on.
3. Identity element sε = εs = s.

Terms for parts of a string

TERM DEFINITION
Prefix of s A string obtained by removing zero or more trailing symbols of string
s;
e.g., cs is a prefix of cs6660.
Suffix of s A string formed by deleting zero or more of the leading symbols of
s;
e.g., 660 is a suffix of cs6660.
Substring of s A string obtained by deleting a prefix and a suffix from s;
e.g., 66 is a substring of cs6660.

Proper prefix, Any nonempty string x that is a prefix, suffix or substring of s that
suffix, or substring s <> x.
of s
Subsequence of s Any string formed by deleting zero or more not necessarily
contiguous symbols from s; e.g., c60 is a subsequence of cs6660.

Operations on Languages :
There are several operations that can be applied to languages:
Definitions of operations on languages L and M:

OPERATION DEFINITION
Union of L and M. written LυM L υ M = { s | s is in L or s is in M }
Concatenation of L and M. written LM LM = { st | s is in L and t is in M }

Kleene closure of
L. written L*

L* denotes “zero or more concatenation of” L.


Positive closure of
L. written L+
L+ denotes “one or more Concatenation of”
L.
20
Example 1
Let L be the set of letters {A, B,.. . , Z, a, b,... , z} and let D be the set of digits {0,1,.. .
9} Operations :

1. L U D is the set of letters and digits — strictly speaking the language with 62 strings of
lengthone, each of which strings is either one letter or one digit.
2. LD is the set of 520 strings of length two, each consisting of one letter followed by one digit.
3. L5 is the set of all 5-letter strings.
4. L* is the set of all strings of letters, including ε, the empty string.
5. L(L U D)* is the set of all strings of letters and digits beginning with a letter.
6. D+ is the set of all strings of one or more digits.

Example 2
Let W be the set of characters {c,o,m,p,i,l,e,r} and let N be the set of digits {1,2,3}
Operations:
1. W U N is the set of characters and digits —language with 11 strings of length one,
each ofwhich strings is either one letter or one digit.
2. WN is the set of 24 strings of length two, each consisting of one characters followed by
one digit.
3. W3 is the set of all 3-character strings.
4. W* is the set of all strings of characters, including ε, the empty string.
5. W(W U N)* is the set of all strings of characters and digits beginning with a character.
6. N+ is the set of all strings of one or more digits.

Regular Expressions
 It allows defining the sets to form tokens.
 Defines a Pascal identifier –identifier is formed by a letter followed by zero or
more letters or digits. e.g., letter ( letter | digit) *
 A regular expression is formed using a set of defining rules.
 Each regular expression r denotes a language L(r).
 The rules that define the regular expressions over alphabet ∑. Associated with each
rule is a specification of the language denoted by the regular expression being defined
BASIS
Rule 1 ε is a regular expression that denotes {ε}, i.e. the set containing
the empty string.
Rule2 If a is a symbol in ∑, then a is a regular expression that denotes {a},
i.e. the set containing the string a.
INDUCTION (Rule 3). Suppose r and s is regular expressions denoting the
languages L(r) and L(s). Then
(r) | (s) is a regular expression denoting the languages L(r) U L(s).
(r)(s) is a regular expression denoting the languages L(r)L(s).
(r)* is a regular expression denoting the languages (L(r))*.
(r) is a regular expression denoting the languages L(r).
• A language denoted by a regular expression is said to be a regular set.
• The specification of a regular expression is an example of a recursive definition.

21
Order of evaluate Regular expression:
As defined, regular expressions often contain unnecessary pairs of parentheses. We may
dropcertain pairs of parentheses if we adopt the conventions that:
The unary operator * has highest precedence and is left associative.
Concatenation has second highest precedence and is left associative.
| has lowest precedence and is left associative.
Example 3 we may replace the regular expression (a)|((b)*(c)) by a|b*c.
Both expressions denote the set of strings that are either a single a or are zero or more
b's followed by one c.
Example 4 Let £ = {a, b}.
1. The regular expression a|b denotes the language {a, b}.
2. (a|b)(a|b) denotes {aa, ab, ba, bb}, the language of all strings of length two over the alphabet
E. Another regular expression for the same language is aa|ab|ba|bb.
3. a* denotes the language consisting of all strings of zero or more a's, tha t is, { ε, a,aa,aaa,...}.
4. (a|b)* denotes the set of all strings consisting of zero or more instances of a or b, that is, all
strings of a's and b's: { ε,a, b,aa, ab, ba, bb,aaa,...}. Another regular expression for the same
language is (a*b*)*.
5. a|a*b denotes the language {a, b, ab, aab, aaab,...}, tha t is, the string a and all strings
consisting of zero or more a's and ending in b.
Regular set
A language that can be defined by a regular expression is called a regular set. If two regular
expressions r and s denote the same regular set, we say they are equivalent and write r = s. For
instance, (a|b) = (b|a).
Below Table shows some of the algebraic laws that hold for arbitrary regular expressions r, s, and t.

Algebraic Properties of regular expressions


AXIOM DESCRIPTION
r|s = s|r | is commutative
r|(s|t) = (r|s)|t | is associative
(rs)t = r(st) Concatenation is associative
r(s|t) = Concatenation distributes over |
rs|rt(s|t)r =
sr|tr
εr = r ε is the identity element for concatenation
rε = r
r* = (r| ε)* Relation between * and ε
r** = r* * Is idempotent

Regular Definition
For notational convenience, we need to give names for regular expressions and to
define regular expressions using these names as if they were symbols.
22
Identifiers are the string of letters and digits beginning with a letter. The following
regular definition provides clear specification for the string.
If ∑ is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the
form
d1 →
r1 d2
→ r2

dn → rn
Where each di is a distinct name, and each ri is a regular expression over the symbols in ∑ U {d1,
d2, … , di-1}, i.e., the basic symbols and the previously defined names.
Example 14 C identifiers are strings of letters, digits, and underscores. Here is a
regular definition for the language of C identifiers.
letter_ A|B|. . . |Z|a|b|. . . |z|
_ digit 0 | 1 | • • • | 9
id letter_ ( letter_ | digit )*
Example :Unsigned numbers (integer or floating point) are strings such as 5280, 0.01234,
6.336E4, or 1.89E-4. The regular definition
digit 0 | 1 | • • • | 9
digits digit digit*
optionalFraction . digits | ε
optionalExponent ( E ( + | - | ε ) digits ) | ε
number digits optionalFraction optionalExponent
Example :
Write a Regular Definition to represent date in the following format: JAN 5th2016
Date format = Month Date Year
Month =JAN|FEB| ........... |DEC
Date =[0-3] [0-9]th|1st|2nd|3rd
Year =[1|2] [0-9]3
Extensions of Regular Expressions
• Notational Shorthand:
– This shorthand is used in certain constructs that occur frequently in regular expressions.
1. One or more instances. The unary, postfix operator + represents the positive closure of
a regular expression and its language. That is, if r is a regular expression, then (r) + denotes
the language (L(r)) + . The operator + has the same precedence and associativity as the
operator *. Two useful algebraic laws, r* = r + | ε and r + = rr* = r*r relate the Kleene
closure and positive closure.
2. Zero or one instance. The unary postfix operator ? means "zero or one occurrence." That
is, r? is equivalent to r|ε, or put another way, L(r?) = L(r) U { ε }. The ? operator has the
same precedence and associativity as * and +.
3. Character classes. A regular expression a1|a2| |an, where the ai’s are each symbols of
the alphabet, can be replaced by the shorthand [a1 a2 . . . an]. e.g., consecutive
uppercase letters, lowercase letters, or digits, we can replace them by a1-an , that is, just

23
the first and last separated by a hyphen. Thus, [abc] is shorthand for a|b|c, and [a-z] is
shorthand for a|b|--
-|z .
Example Using these shorthands, we can rewrite the regular definition of Example 3.4.1 as:
letter_ [A- Za-z_]
digit [0-9]
id letter_ ( letter_ | digit )*
the regular definition of Example 3.4.2 as:
digit [0-9]
digits digit+
optionalFraction (. digits )?
optionalExponent ( E [+-]? digits ) ?
number digits (. digits )? ( E [+ - ]? digits )
?

Recognition of tokens:

We learn how to express pattern using regular expressions. Now, we must study how to take the
patterns for all the needed tokens and build a piece of code that examines the input
string and finds a prefix that is a lexeme matching one of the patterns

Grammar for branching statements

Stmt →if expr then stmt

| If expr then else stmt

Expr →term relop term

termTerm

→id

|number

For relop ,we use the comparison operations of languages like Pascal or SQL where = is “equals”
and < > is “not equals” because it presents an interesting structure of lexemes.
The terminal of grammar, which are if, then , else, relop ,id and numbers are the names of
tokens as far as the lexical analyzer is concerned, the patterns for the tokens are described using
regular definitions.
Patterns for tokens of above grammar
digit → [0,9]

24
digits

→digit+

number →digit(.digit)?(e.[+-

]?digits)? letter → [A-Z,a-z]

id

→letter(letter/digit)* if

→ if

then

→thenelse

→else

relop →< | > |<= | >= | = = | < >

In addition, we assign the lexical analyzer the job stripping out white space, by recognizing the
“token” we defined by:

WS → (blank/tab/newline)+

Here, blank, tab and newline are abstract symbols that we use to express the ASCII characters
of the same names. Token ws is different from the other tokens in that ,when we recognize it,
we do not return it to parser ,but rather restart the lexical analysis from the character
that follows the white space . It is the following token that gets returned to the parser.

Lexeme Token Name Attribute Value


Any WS - -
if If -
then then -
else else -
Anyid id pointer to table entry
Any number number pointer to table entry
< relop LT
<= relop LE
== relop EQ
<> relop NE

25
Transition Diagram
Transition Diagram consists of a collection of nodes or circles, called states.
Each state represents a rule that occurs during the process of reading the
input looking for a lexeme that matches one of many patterns .

Edges are directed from one state to another. Each edge is labeled by a symbol or
set of symbols. If we are in one state s, and the next input symbol is a, we check
for an edge out of state s labeled by a. if we find such an edge ,we advance the
forward pointer and enter the state of the transition diagram to which that edge
leads.

Some important conventions about transition diagrams are


1. Certain states are said to be accepting or final .These states indicates that a
lexeme has been found, although the actual lexeme may not consist of all positions
b/w the lexeme Begin and forward pointers we always indicate an accepting state
by a double circle.
2. In addition, if it is necessary to return the forward pointer one position, then we
additionally place a * near that accepting state.

3. One state is designed the state, or initial state. It is indicated by an edge


labeled “start” entering from nowhere .the transition diagram always begins in the
state before any input symbols have been used.

Transition Diagram for relop


Relational operator :< | > |< = | >= | = | <>’’

26
Transition Diagram for identifier

Transition Diagram for unsigned nu

Transition Diagram for if

27
Transition Diagram for identifier

Transition Diagram for integer

Transition Diagram for NFA

28

You might also like