You are on page 1of 23

8/5/2020

Compiler Construction
[SWE-310]
Lexical Analysis

Compiler Construction [SWE-310] Lexical Analysis


The Role of Lexical Analyzer

 1st phase of a compiler.


 The primary task is to read the input
characters and produce as output a
sequence of tokens that the parser uses for
syntax analysis.
 Works as a subroutine or co-routine of the
parser.

1
8/5/2020

Compiler Construction [SWE-310] Lexical Analysis


The Role of Lexical Analyzer

Compiler Construction [SWE-310] Lexical Analysis


The Role of Lexical Analyzer

 Its secondary tasks include:


Stripping out comments and white spaces (i.e.
blank, newline, tab, and perhaps other
characters that are used to separate tokens in
the input) from the source program.
Co-relating error messages from the compiler
with the source program.
For instance, the lexical analyzer may keep track
of the number of newline characters seen, so it
can associate a line number with each error
message.

2
8/5/2020

Compiler Construction [SWE-310] Lexical Analysis


The Role of Lexical Analyzer

 Sometimes, lexical analyzers are divided


into a cascade of two processes:
a) Scanning consists of the simple processes
that do not require tokenization of the input,
such as deletion of comments and
compaction of consecutive whitespace
characters into one.
b) Lexical analysis is the more complex
portion, which produces tokens from the
output of the scanner.

Compiler Construction [SWE-310] Lexical Analysis


Lexical Analysis vs. Syntax Analysis
 There are a number of reasons why the analysis portion
of a compiler is normally separated into lexical analysis
and parsing (syntax analysis) phases.
1. Simplicity of Design:
 Simplicity of design is the most important consideration.
 The separation of lexical and syntactic analysis often allows us
to simplify at least one of these tasks.
 E.g. A parser that had to deal with comments and whitespace
as syntactic units would be considerably more complex than
one that can assume comments and whitespace have already
been removed by the lexical analyzer.
 If we are designing a new language, separating lexical and
syntactic concerns can lead to a cleaner overall language
design.

3
8/5/2020

Compiler Construction [SWE-310] Lexical Analysis


Lexical Analysis vs. Syntax Analysis
2. Compiler Efficiency:
 Compiler efficiency is improved. A separate lexical
analyzer allows us to apply specialized techniques
that serve only the lexical task, not the job of
parsing.
 In addition, specialized buffering techniques for
reading input characters can speed up the compiler
significantly.
3. Compiler Portability:
 Compiler portability is enhanced.
 Input-device-specific peculiarities can be restricted
to the lexical analyzer.

Compiler Construction [SWE-310] Lexical Analysis


Tokens, Patterns, and Lexemes

 Token:
A token is a pair consisting of a token name and an
optional attribute value.
 The token name is an abstract symbol representing a
kind of lexical unit, e.g., a particular keyword, or a
sequence of input characters denoting an identifier.
 The token names are the input symbols that the
parser processes.
 In what follows, we shall generally write the name of
a token in boldface. We will often refer to a token by
its token name.

4
8/5/2020

Compiler Construction [SWE-310] Lexical Analysis


Tokens, Patterns, and Lexemes

 Pattern:
 A pattern is a description of the form that the lexemes of
a token may take.
 In the case of a keyword as a token, the pattern is just
the sequence of characters that form the keyword.
 For identifiers and some other tokens, the pattern is a
more complex structure that is matched by many strings.
 Lexeme:
 A lexeme is a sequence of characters in the source
program that matches the pattern for a token and is
identified by the lexical analyzer as an instance of that
token.

Compiler Construction [SWE-310] Lexical Analysis


Tokens, Patterns, and Lexemes

5
8/5/2020

Compiler Construction [SWE-310] Lexical Analysis


Classes of Tokens
 Classes of Tokens:
 In many programming languages, the following
classes cover most or all of the tokens:
1. One token for each keyword. The pattern for a
keyword is the same as the keyword itself.
2. Tokens for the operators, either individually or in
classes
3. One token representing all identifiers.
4. One or more tokens representing constants, such as
numbers and literal strings.
5. Tokens for each punctuation symbol, such as left and
right parentheses, comma, and semicolon.

Compiler Construction [SWE-310] Lexical Analysis


Buffer Pair Scheme

 Buffer Pairs:
A large amount of time is taken to process characters
and large number of characters must be processed
during the compilation of a large source program.
 Specialized buffering techniques have been developed to
reduce the amount of overhead required to process a
single input character.
 An important scheme involves two buffers that are
alternately reloaded.

6
8/5/2020

Compiler Construction [SWE-310] Lexical Analysis


Buffer Pair Scheme

 A buffer is divided into two N-character halves.


 Typically,N is the number of characters on one
disk block
 E.g. 1024 or 4096 characters.

Compiler Construction [SWE-310] Lexical Analysis


Buffer Pair Scheme

 Using one system read command we can


read N characters into a buffer, rather than
using one system call per character.
 A special character, represented by eof,
marks the end of the source file and is
different from any possible character of the
source program.
 Two pointers to the input are maintained:

7
8/5/2020

Compiler Construction [SWE-310] Lexical Analysis


Buffer Pair Scheme

 Two pointers to the input are maintained:


1.Pointer lexemeBegin, marks the beginning
of the current lexeme, whose extent we are
attempting to determine.
2.Pointer forward scans ahead until a pattern
match is found
 The string of characters between the two
pointers is the current lexeme.

Compiler Construction [SWE-310] Lexical Analysis


Buffer Pair Scheme

 Initially both pointers point to the first


character of the next lexeme to be found.
 The forward pointer scans ahead until a match
for a pattern is found.
 When the next lexeme is determined, forward
pointer is set to the character at its right end.
 When the lexeme is processed both pointers
are set to the character immediately past the
lexeme.

8
8/5/2020

Compiler Construction [SWE-310] Lexical Analysis


Buffer Pair Scheme

 Comments (i.e. /*……*/) and white


spaces(i.e. blanks, tabs or new line
characters) are patterns that yield no
token.
 Special provision must be taken for the
cases where forward pointer is at the end
of buffer but token recognition requires
look-ahead into the next buffer.

Compiler Construction [SWE-310] Lexical Analysis


Buffer Pair Scheme

 The algorithm for simple buffer-pair scheme is


given as follows:
If forward at end of first half then begin
reload second half;
forward:= forward + 1
end
else if forward at end of second half then begin
reload first half;
move forward to beginning of first half
end
else forward:= forward + 1;

9
8/5/2020

Compiler Construction [SWE-310] Lexical Analysis


Sentinel Scheme

 Each buffer is extended to hold a sentinel (i.e.


eof) character at the end.

 The 1st scheme requires two test for each


advance of forward pointer.
 The Sentinel scheme reduces two tests to only
one.

Compiler Construction [SWE-310] Lexical Analysis


Sentinel Scheme

 The algorithm for Sentinel scheme is given


as follows:

10
8/5/2020

Compiler Construction [SWE-310] Lexical Analysis


Lexical Errors
 When the lexical analyzer (LA) is unable to proceed
because none of the patterns for tokens matches a
prefix of the remaining input, many recovery strategies
can be employed.
 The simplest recovery strategy is “panic mode"
recovery.
 i.e. Successive characters are deleted from the
remaining input, until the lexical analyzer can find a
well-formed token at the beginning of what input is left.
 This recovery technique may confuse the parser, but in
an interactive computing environment it may be quite
adequate.

Compiler Construction [SWE-310] Lexical Analysis


Lexical Errors

 Other possible error-recovery actions are:


1. Delete one character from the remaining input.
(e.g. Change whilet  while)
2. Insert a missing character into the remaining
input.
(e.g. Change whil  while)
3. Replace a character by another character.
(e.g. Change whilf  while)
4. Transpose two adjacent characters.
(e.g. Change wh ile  while)

11
8/5/2020

Compiler Construction [SWE-310] Lexical Analysis


Regular Expressions

 Regular Expressions:
Regular expressions is a method for
describing/expressing patterns of a language.
The regular expressions are built recursively out
of smaller regular expressions.
Each regular expression r denotes a language
L(r), which is also denoted recursively from the
languages denoted by r's sub expressions.

Compiler Construction [SWE-310] Lexical Analysis


Extensions of Regular Expressions

 Many extensions have been added to


regular expressions to enhance their ability
to specify string patterns.
 A few notational extensions that were first
incorporated into Unix utilities such as Lex
are mentioned below that are particularly
useful in the specification of lexical
analyzers.

12
8/5/2020

Compiler Construction [SWE-310] Lexical Analysis


Extensions of Regular Expressions

1. One or more instances:


 The unary, postfix operator + represents the positive
closure of a regular expression and its language.
 i.e. If r is a regular expression, then (r)+ denotes the
language (L(r))+.
 The operator + has the same precedence and
associativity as the operator *.
 Two useful algebraic laws:
• r* = r+ | ε , and
• r+ = rr* = r*r relates to the Kleene closure and positive
closure.

Compiler Construction [SWE-310] Lexical Analysis


Extensions of Regular Expressions

2. Zero or one instance:


 The unary postfix operator ? means “zero
or one occurrence" .
 i.e. r? is equivalent to r | ε , or put another
way, L(r?) = L(r) U { ε } .
 The ? operator has the same precedence
and associativity as * and +.

13
8/5/2020

Compiler Construction [SWE-310] Lexical Analysis


Extensions of Regular Expressions
3. Character Classes:
 A regular expression a1 | a2 | --- |an, where the ai's are
each symbols of the alphabet, can be replaced by the
shorthand [a1a2 ---- an].
 More importantly, when a1, a2, ----, an form a logical
sequence, e.g., consecutive uppercase letters,
lowercase letters, or digits, we can replace them by
a1-an.
 i.e. Just the first and last character separated by a
hyphen.
 Thus, [ abc ] is shorthand for a | b |c, and
 [a-z] is shorthand for a | b | --- | z.

Compiler Construction [SWE-310] Lexical Analysis


Regular Definitions

 Regular Definitions:
For notational convenience, names are
given to certain regular expressions and
those names are used in subsequent
expressions, as if the names were
themselves symbols.

14
8/5/2020

Compiler Construction [SWE-310] Lexical Analysis


Regular Definitions

 E.g.#1 C identifiers are strings of letters,


digits, and underscores. A regular
definition for the language of C identifiers
is given below:

Compiler Construction [SWE-310] Lexical Analysis


Regular Definitions

 E.g.#2 Unsigned numbers (integer or


floating point) are strings such as 5280,
0.01234, 6.336E4, or 1.89E-4. The regular
definition is given as follows:

15
8/5/2020

Compiler Construction [SWE-310] Lexical Analysis


Recognition of Tokens

 Recognition of Tokens:
After learning how to express patterns
using regular expressions the next task is
to take the patterns for all the needed
tokens and build a piece of code that
examines the input string and find a prefix
that is a lexeme matching one of the
patterns.

Compiler Construction [SWE-310] Lexical Analysis


Recognition of Tokens

 E.g.#3 A grammar for branching


statements.

16
8/5/2020

Compiler Construction [SWE-310] Lexical Analysis


Recognition of Tokens

 The terminals of the grammar if, then,


else, relop, id, and number generate set
of strings given by following regular
definitions:

Compiler Construction [SWE-310] Lexical Analysis


Recognition of Tokens

 In Addition, the lexical analyzer should also


recognize white spaces (ws), which may be
described in following two manners:
 E.g.#1
delim  blank | tab |newline
ws  delim+
OR
 E.g.#2
ws  ( blank | tab |newline)+

17
8/5/2020

Compiler Construction [SWE-310] Lexical Analysis


Recognition of Tokens

 The lexical analyzer will produce following


sequence of tokens shown by transition
table:

Compiler Construction [SWE-310] Lexical Analysis


Transition Diagrams

 Transition Diagrams (TD’s) help in building


Lexical Analyzers (LA’s).
 TD’s depict the action that take place
when a LA is called by parser to get the
next token.
 TD’s keep track of information about
characters that are seen as the forward
pointer scans the input.

18
8/5/2020

Compiler Construction [SWE-310] Lexical Analysis


Transition Diagrams

 E.g.#1 Complete TD for relop

Compiler Construction [SWE-310] Lexical Analysis


Transition Diagrams

 E.g.#2 TD for Reserved words and


Identifiers

19
8/5/2020

Compiler Construction [SWE-310] Lexical Analysis


Transition Diagrams

 E.g.#3 TD for Keyword then

Compiler Construction [SWE-310] Lexical Analysis


Transition Diagrams

 E.g.#4 TD for Unsigned numbers

20
8/5/2020

Compiler Construction [SWE-310] Lexical Analysis


Transition Diagrams

 E.g.#5 TD for whitespace (ws) or delim+

Compiler Construction [SWE-310] Lexical Analysis


Input Buffering Approaches for LA

 There are three (3) general approaches for


the implementation of LA.
1. Use a LA generator:
E.g. Lex Compiler, to produce the LA from
a regular expression based specification.
Therefore the generator provides routines
for reading and buffering input.

21
8/5/2020

Compiler Construction [SWE-310] Lexical Analysis


Input Buffering Approaches for LA

2. Write the LA in a Conventional System


Programming Language:
Write the LA in a conventional system
programming language (i.e. C or Java etc.),
using the I/O facilities of that language to
read the input.

Compiler Construction [SWE-310] Lexical Analysis


Input Buffering Approaches for LA

3. Write the LA in Assembly Language:


Write the LA in Assembly language and
explicitly manage the reading of input.
 The harder-to-implement approach yield
faster LA’s.

22
8/5/2020

Compiler Construction [SWE-310] Lexical Analysis


A Sample C Program for LA

23

You might also like