Lexical Analysis

8/5/2020
Compiler Construction
[SWE-310]
Lexical Analysis
Compiler Construction [SWE-310] Lexical Analysis

The Role of Lexical Analyzer
 1st phase of a compiler.

 The primary task is to read the input
characters and produce as output a
sequence of tokens that the parser uses for
syntax analysis.
 Works as a subroutine or co-routine of the
parser.
1
8/5/2020


 Its secondary tasks include:

Stripping out comments and white spaces (i.e.
blank, newline, tab, and perhaps other
characters that are used to separate tokens in
the input) from the source program.
Co-relating error messages from the compiler
with the source program.
For instance, the lexical analyzer may keep track
of the number of newline characters seen, so it
can associate a line number with each error
message.
2
8/5/2020

 Sometimes, lexical analyzers are divided

into a cascade of two processes:
a) Scanning consists of the simple processes
that do not require tokenization of the input,
such as deletion of comments and
compaction of consecutive whitespace
characters into one.
b) Lexical analysis is the more complex
portion, which produces tokens from the
output of the scanner.

Lexical Analysis vs. Syntax Analysis
 There are a number of reasons why the analysis portion
of a compiler is normally separated into lexical analysis
and parsing (syntax analysis) phases.
1. Simplicity of Design:
 Simplicity of design is the most important consideration.
 The separation of lexical and syntactic analysis often allows us
to simplify at least one of these tasks.
 E.g. A parser that had to deal with comments and whitespace
as syntactic units would be considerably more complex than
one that can assume comments and whitespace have already
been removed by the lexical analyzer.
 If we are designing a new language, separating lexical and
syntactic concerns can lead to a cleaner overall language
design.
3
8/5/2020

Lexical Analysis vs. Syntax Analysis
2. Compiler Efficiency:
 Compiler efficiency is improved. A separate lexical
analyzer allows us to apply specialized techniques
that serve only the lexical task, not the job of
parsing.
 In addition, specialized buffering techniques for
reading input characters can speed up the compiler
significantly.
3. Compiler Portability:
 Compiler portability is enhanced.
 Input-device-specific peculiarities can be restricted
to the lexical analyzer.

Tokens, Patterns, and Lexemes
 Token:
A token is a pair consisting of a token name and an
optional attribute value.
 The token name is an abstract symbol representing a
kind of lexical unit, e.g., a particular keyword, or a
sequence of input characters denoting an identifier.
 The token names are the input symbols that the
parser processes.
 In what follows, we shall generally write the name of
a token in boldface. We will often refer to a token by
its token name.
4
8/5/2020

 Pattern:
 A pattern is a description of the form that the lexemes of
a token may take.
 In the case of a keyword as a token, the pattern is just
the sequence of characters that form the keyword.
 For identifiers and some other tokens, the pattern is a
more complex structure that is matched by many strings.
 Lexeme:
 A lexeme is a sequence of characters in the source
program that matches the pattern for a token and is
identified by the lexical analyzer as an instance of that
token.

5
8/5/2020

Classes of Tokens
 Classes of Tokens:
 In many programming languages, the following
classes cover most or all of the tokens:
1. One token for each keyword. The pattern for a
keyword is the same as the keyword itself.
2. Tokens for the operators, either individually or in
classes
3. One token representing all identifiers.
4. One or more tokens representing constants, such as
numbers and literal strings.
5. Tokens for each punctuation symbol, such as left and
right parentheses, comma, and semicolon.

Buffer Pair Scheme
 Buffer Pairs:
A large amount of time is taken to process characters
and large number of characters must be processed
during the compilation of a large source program.
 Specialized buffering techniques have been developed to
reduce the amount of overhead required to process a
single input character.
 An important scheme involves two buffers that are
alternately reloaded.
6
8/5/2020

Buffer Pair Scheme
 A buffer is divided into two N-character halves.

 Typically,N is the number of characters on one
disk block
 E.g. 1024 or 4096 characters.

Buffer Pair Scheme
 Using one system read command we can

read N characters into a buffer, rather than
using one system call per character.
 A special character, represented by eof,
marks the end of the source file and is
different from any possible character of the
source program.
 Two pointers to the input are maintained:
7
8/5/2020

Buffer Pair Scheme
 Two pointers to the input are maintained:

1.Pointer lexemeBegin, marks the beginning
of the current lexeme, whose extent we are
attempting to determine.
2.Pointer forward scans ahead until a pattern
match is found
 The string of characters between the two
pointers is the current lexeme.

Buffer Pair Scheme
 Initially both pointers point to the first

character of the next lexeme to be found.
 The forward pointer scans ahead until a match
for a pattern is found.
 When the next lexeme is determined, forward
pointer is set to the character at its right end.
 When the lexeme is processed both pointers
are set to the character immediately past the
lexeme.
8
8/5/2020

Buffer Pair Scheme
 Comments (i.e. /*……*/) and white

spaces(i.e. blanks, tabs or new line
characters) are patterns that yield no
token.
 Special provision must be taken for the
cases where forward pointer is at the end
of buffer but token recognition requires
look-ahead into the next buffer.

Buffer Pair Scheme
 The algorithm for simple buffer-pair scheme is

given as follows:
If forward at end of first half then begin
reload second half;
forward:= forward + 1
end
else if forward at end of second half then begin
reload first half;
move forward to beginning of first half
end
else forward:= forward + 1;
9
8/5/2020

Sentinel Scheme
 Each buffer is extended to hold a sentinel (i.e.

eof) character at the end.
 The 1st scheme requires two test for each

advance of forward pointer.
 The Sentinel scheme reduces two tests to only
one.

Sentinel Scheme
 The algorithm for Sentinel scheme is given

as follows:
10
8/5/2020

Lexical Errors
 When the lexical analyzer (LA) is unable to proceed
because none of the patterns for tokens matches a
prefix of the remaining input, many recovery strategies
can be employed.
 The simplest recovery strategy is “panic mode"
recovery.
 i.e. Successive characters are deleted from the
remaining input, until the lexical analyzer can find a
well-formed token at the beginning of what input is left.
 This recovery technique may confuse the parser, but in
an interactive computing environment it may be quite
adequate.

Lexical Errors
 Other possible error-recovery actions are:

1. Delete one character from the remaining input.
(e.g. Change whilet  while)
2. Insert a missing character into the remaining
input.
(e.g. Change whil  while)
3. Replace a character by another character.
(e.g. Change whilf  while)
4. Transpose two adjacent characters.
(e.g. Change wh ile  while)
11
8/5/2020

Regular Expressions
 Regular Expressions:
Regular expressions is a method for
describing/expressing patterns of a language.
The regular expressions are built recursively out
of smaller regular expressions.
Each regular expression r denotes a language
L(r), which is also denoted recursively from the
languages denoted by r's sub expressions.

Extensions of Regular Expressions
 Many extensions have been added to

regular expressions to enhance their ability
to specify string patterns.
 A few notational extensions that were first
incorporated into Unix utilities such as Lex
are mentioned below that are particularly
useful in the specification of lexical
analyzers.
12
8/5/2020

1. One or more instances:

 The unary, postfix operator + represents the positive
closure of a regular expression and its language.
 i.e. If r is a regular expression, then (r)+ denotes the
language (L(r))+.
 The operator + has the same precedence and
associativity as the operator *.
 Two useful algebraic laws:
• r* = r+ | ε , and
• r+ = rr* = r*r relates to the Kleene closure and positive
closure.

2. Zero or one instance:

 The unary postfix operator ? means “zero
or one occurrence" .
 i.e. r? is equivalent to r | ε , or put another
way, L(r?) = L(r) U { ε } .
 The ? operator has the same precedence
and associativity as * and +.
13
8/5/2020

3. Character Classes:
 A regular expression a1 | a2 | --- |an, where the ai's are
each symbols of the alphabet, can be replaced by the
shorthand [a1a2 ---- an].
 More importantly, when a1, a2, ----, an form a logical
sequence, e.g., consecutive uppercase letters,
lowercase letters, or digits, we can replace them by
a1-an.
 i.e. Just the first and last character separated by a
hyphen.
 Thus, [ abc ] is shorthand for a | b |c, and
 [a-z] is shorthand for a | b | --- | z.

Regular Definitions
 Regular Definitions:
For notational convenience, names are
given to certain regular expressions and
those names are used in subsequent
expressions, as if the names were
themselves symbols.
14
8/5/2020

Regular Definitions
 E.g.#1 C identifiers are strings of letters,

digits, and underscores. A regular
definition for the language of C identifiers
is given below:

Regular Definitions
 E.g.#2 Unsigned numbers (integer or

floating point) are strings such as 5280,
0.01234, 6.336E4, or 1.89E-4. The regular
definition is given as follows:
15
8/5/2020

Recognition of Tokens
 Recognition of Tokens:
After learning how to express patterns
using regular expressions the next task is
to take the patterns for all the needed
tokens and build a piece of code that
examines the input string and find a prefix
that is a lexeme matching one of the
patterns.

 E.g.#3 A grammar for branching

statements.
16
8/5/2020

 The terminals of the grammar if, then,

else, relop, id, and number generate set
of strings given by following regular
definitions:

 In Addition, the lexical analyzer should also

recognize white spaces (ws), which may be
described in following two manners:
 E.g.#1
delim  blank | tab |newline
ws  delim+
OR
 E.g.#2
ws  ( blank | tab |newline)+
17
8/5/2020

 The lexical analyzer will produce following

sequence of tokens shown by transition
table:

Transition Diagrams
 Transition Diagrams (TD’s) help in building

Lexical Analyzers (LA’s).
 TD’s depict the action that take place
when a LA is called by parser to get the
next token.
 TD’s keep track of information about
characters that are seen as the forward
pointer scans the input.
18
8/5/2020

Transition Diagrams
 E.g.#1 Complete TD for relop

Transition Diagrams
 E.g.#2 TD for Reserved words and

Identifiers
19
8/5/2020

Transition Diagrams
 E.g.#3 TD for Keyword then

Transition Diagrams
 E.g.#4 TD for Unsigned numbers
20
8/5/2020

Transition Diagrams
 E.g.#5 TD for whitespace (ws) or delim+

Input Buffering Approaches for LA
 There are three (3) general approaches for

the implementation of LA.
1. Use a LA generator:
E.g. Lex Compiler, to produce the LA from
a regular expression based specification.
Therefore the generator provides routines
for reading and buffering input.
21
8/5/2020

2. Write the LA in a Conventional System

Programming Language:
Write the LA in a conventional system
programming language (i.e. C or Java etc.),
using the I/O facilities of that language to
read the input.

3. Write the LA in Assembly Language:

Write the LA in Assembly language and
explicitly manage the reading of input.
 The harder-to-implement approach yield
faster LA’s.
22
8/5/2020

A Sample C Program for LA
23

Lexical Analysis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lexical Analysis

Uploaded by

Copyright:

Available Formats

8/5/2020

Compiler Construction [SWE-310] Lexical Analysis

 1st phase of a compiler.

Compiler Construction [SWE-310] Lexical Analysis

Compiler Construction [SWE-310] Lexical Analysis

 Its secondary tasks include:

Compiler Construction [SWE-310] Lexical Analysis

 Sometimes, lexical analyzers are divided

Compiler Construction [SWE-310] Lexical Analysis

Compiler Construction [SWE-310] Lexical Analysis

Compiler Construction [SWE-310] Lexical Analysis

Compiler Construction [SWE-310] Lexical Analysis

Compiler Construction [SWE-310] Lexical Analysis

Compiler Construction [SWE-310] Lexical Analysis

Compiler Construction [SWE-310] Lexical Analysis

Compiler Construction [SWE-310] Lexical Analysis

 A buffer is divided into two N-character halves.

Compiler Construction [SWE-310] Lexical Analysis

 Using one system read command we can

Compiler Construction [SWE-310] Lexical Analysis

 Two pointers to the input are maintained:

Compiler Construction [SWE-310] Lexical Analysis

 Initially both pointers point to the first

Compiler Construction [SWE-310] Lexical Analysis

 Comments (i.e. /*……*/) and white

Compiler Construction [SWE-310] Lexical Analysis

 The algorithm for simple buffer-pair scheme is

Compiler Construction [SWE-310] Lexical Analysis

 Each buffer is extended to hold a sentinel (i.e.

 The 1st scheme requires two test for each

Compiler Construction [SWE-310] Lexical Analysis

 The algorithm for Sentinel scheme is given

Compiler Construction [SWE-310] Lexical Analysis

Compiler Construction [SWE-310] Lexical Analysis

 Other possible error-recovery actions are:

Compiler Construction [SWE-310] Lexical Analysis

Compiler Construction [SWE-310] Lexical Analysis

 Many extensions have been added to

Compiler Construction [SWE-310] Lexical Analysis

1. One or more instances:

Compiler Construction [SWE-310] Lexical Analysis

2. Zero or one instance:

Compiler Construction [SWE-310] Lexical Analysis

Compiler Construction [SWE-310] Lexical Analysis

Compiler Construction [SWE-310] Lexical Analysis

 E.g.#1 C identifiers are strings of letters,

Compiler Construction [SWE-310] Lexical Analysis

 E.g.#2 Unsigned numbers (integer or

Compiler Construction [SWE-310] Lexical Analysis

Compiler Construction [SWE-310] Lexical Analysis

 E.g.#3 A grammar for branching

Compiler Construction [SWE-310] Lexical Analysis

 The terminals of the grammar if, then,

Compiler Construction [SWE-310] Lexical Analysis

 In Addition, the lexical analyzer should also

Compiler Construction [SWE-310] Lexical Analysis

 The lexical analyzer will produce following

Compiler Construction [SWE-310] Lexical Analysis

 Transition Diagrams (TD’s) help in building

Compiler Construction [SWE-310] Lexical Analysis

 E.g.#1 Complete TD for relop

Compiler Construction [SWE-310] Lexical Analysis

 Comments (i.e. /……/) and white