Lecture#9 - Chap#3 (Lexical Analysis (Part-I) )

Lecture# 09
Compiler Construction
Lexical Analysis, (Part-I)
by Safdar Hussain
Topics
• Lexical analyzer as separate phase, Interaction of Lexical Analyzer and Parser, lexical errors
• Tokens, lexemes, patterns, Attributes of tokens, Specification of tokens for patterns
(Terminology, String & language operations, RE, Regular Definitions, Notational Shorthands
Overview
This chapter contains comprehensive material
• Building a simple Lexical Analyzer
– Lexical Analysis (Every perspective)
– Interaction of Lexical analyzer & Parsing
– Implementing Transition Diagrams
– Design of a Lexical Analyzer (RENFADFA)
– The Subset Construction Algorithm
– Minimization of number of states of a DFA
– Conversion RE to DFA directly (Algorithm & Tree
annotation)
Safdar Hussain, Khwaja Fareed University, RYK 2

The Reason Why Lexical Analysis is a
Separate Phase
• Simplifies the design of the compiler
– LL(1) or LR(1) with 1 lookahead would not be possible
– E.g., a parser that had to deal with comments & whitespace as syntactic
units would be considerably more complex than one that can assume
comments & whitespace have already been removed by the lexical
analyzer.
• Provides efficient implementation
– Systematic techniques to implement lexical analyzers by hand or
automatically
– Stream buffering methods to scan input
• Improves portability
– Non-standard symbols and alternate character encodings can be more
easily translated
Separate Phase

Separate Phase
• Division of lexical analyzers into two processes
– Scanning consists of the simple processes that do not require
tokenization of the input, such as deletion of comments &
compaction of consecutive whitespace characters into one
– Lexical analysis is the more complex portion, where the

scanner produces the sequence of tokens as output.

Lexemes Tokens
int
a
Keyword
Identifier
Scanning & Lexical Analysis
, Punctuation
b Identifier
, Punctuation
sum Identifier
; Punctuation Whitespaces & Comments
cout Identifier
<< Operator
“Please Enter any String literal
two integers!”
<< Operators
cin Identifier
>> Operator
a Identifier
>> Operator
b Identifier
; Punctuation
sum Identifier
= Operator
a Identifier
+ Operator
b Identifier
; Punctuation
cout Identifier
<< Operator
“Sum of two integers String literal
=“
<< Operator
sum Identifier
; Punctuation
return Keyword
0 Constant Safdar Hussain, Khwaja Fareed University, RYK 6
; Punctuation
Interaction of the Lexical Analyzer
with the Parser
Token,
Source Lexical tokenval
Program Parser
Analyzer
Get next
token
error error
Symbol Table

Attributes of Tokens
y := 31 + 28*x Lexical analyzer
<id, “y”> <assign, > <num, 31> <+, > <num, 28> <*, > <id, “x”>
token
tokenval
(token attribute) Parser

Tokens, Patterns, and Lexemes
• A token is a classification of lexical units

– For example: id and num
• Lexemes are the specific character strings that make
up a token (values)
– For example: abc and 123
• Patterns are rules describing the set of lexemes
belonging to a token
– For example: “letter followed by letters and digits” and
“non-empty sequence of digits”

Example tokens, lexeme, patterns

Example (Token & Lexeme)
printf (“Total = %d\n”, score); //Statement in C language
• Both printf and score are lexemes matching the pattern for token
id, and “Total = %d\n” is a lexeme matching literal.
In many programming languages, the following classes cover most or

all of the tokens:
1. One token for each keyword. The pattern for a keyword is the
same as the keyword itself.
2. Tokens for the operators, either individually or in classes such as
the token comparison mentioned in previous page (Page#10)
3. One token representing all identifiers.
4. One or more tokens representing constants, such as numbers and
literal strings.
5. Tokens for each punctuation symbol, such as left and right
parentheses, comma, and semicolon.
Example
Token names and associated attribute values for the FORTran
statement
E=M*C**2
are written below as a sequence of pairs.
– <id, pointer to symbol-table entry for E>
– <assign_op>
– <id, pointer to symbol-table entry for M>
– <mult_op>
– <id, pointer to symbol-table entry for C>
– <exp_op>
– <number, integer value 2>
• In certain pairs, especially operators, punctuation, &
keywords, there is no need for an attribute value.
Lexical Errors
• It is hard for lexical analyzer to tell, without the aid of other
components, that there is a source-code error.
fi (a = = f(x)) …
• A lexical analyzer cannot tell whether fi is misspelling of the
keyword if or an undeclared function identifier.
• Since fi is a valid lexeme for the token id, the lexical analyzer
must return the token id to the parser and let some other
phase of the compiler—probably the parser in this case—
handle an error due to transposition of the letters
• NOTE: There may arise a situation in which lexical analyzer
is unable to proceed because none of the patterns for tokens
matches any prefix of the remaining input. (Will discuss
about errors in chapter#4)
Example
• Tagged languages like HTML or XML are different from
conventional programming languages in that the punctuation
(tags) are either very numerous (as in HTML) or a user-
definable set (as in XML). Further, tags can often have
parameters. Suggest how to divide the following HTML
document:
Here is a photo of my house:
<IMG SRC = “house.gif ”> 
See <A HREF = “morePix.html”>More Pictures</A>
if you liked that one. 
• Into appropriate lexemes. Which lexemes should get
associated lexical values, and what should those values be?

Example Continued…
Here is a photo of my house:
<IMG SRC = “house.gif ”> 
See <A HREF = “morePix.html”>More Pictures</A>
if you liked that one. 
Answer:
<text, “Here is a photo of”> <nodestart, b> <text, “my house”> <nodeend, b>
<Nodestart, p> <selfendnode, img> <selfendnode, br>
<text, “see”> <nodestart, a> <text, “More Pictures”> <nodeend, a>
<text, “if you liked that one.”> <nodeend, p>

Next…
Specification of Patterns for Tokens
• Terminology
• String Operations
• Language Operations
• Terms for parts of string
• Regular Expressions
• Regular Definitions
• Notational Shorthands

Specification of Patterns for Tokens:
Terminology
• An alphabet  is a finite set of symbols (characters)
• A string s is a finite sequence of symbols from 
– |s| denotes the length of string s
–  denotes the empty string, thus || = 0
• A language is a specific set of strings over some fixed
alphabet 
Example
• The language of all strings consisting of n 0’s followed
by n 1’s, for some n ≥ 0 over  = {0, 1}:
– {ε, 01, 0011, 000111,…..}

String Operations
• The concatenation of two strings x and y is denoted
by xy
• The exponentiaion of a string s is defined by
s0 = 
si = si-1s for i > 0
(note that s = s = s)

String Operations

Language Operations
If L and M are languages then
• Union
L  M = {s | s  L or s  M}
• Concatenation
LM = {xy | x  L and y  M}
• Exponentiation
L0 = {}; Li = Li-1L
• Kleene closure
L* = i=0,…, Li
• Positive closure
L+ = i=1,…, Li
Regular Expressions
• Basis symbols:
–  is a regular expression denoting language {}
– a   is a regular expression denoting {a}
• If r and s are regular expressions denoting languages
L(r) and M(s) respectively, then
– r | s is a regular expression denoting L(r)  M(s)
– rs is a regular expression denoting L(r)M(s)
– r* is a regular expression denoting L(r)*
– (r) is a regular expression denoting L(r)
• A language defined by a regular expression is called a
regular set
Regular Definitions
• Naming convention for regular expressions:
d 1  r1
d 2  r2
…
d n  rn
where ri is a regular expression over
  {d1, d2, …, di-1 }
• Each dj in ri is textually substituted in ri

Regular Definitions
• Example:
letter  A | B | … | Z | a | b | … | z
digit  0 | 1 | … | 9
id  letter ( letter | digit )*
digits  digit digits | digit

Notational Shorthands
• We frequently use the following shorthands:
r+ = rr*
r? = r | 
[a-z] = a | b | c | … | z
• For example:
digit  [0-9]
num  digit+ (. digit+)? ( E (+|-)? digit+ )?
• NOTE: The unary postfix operator ? means “zero or more

occurrence.”

The End
25

Lecture#9 - Chap#3 (Lexical Analysis (Part-I) )

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture#9 - Chap#3 (Lexical Analysis (Part-I) )

Uploaded by

Copyright:

Available Formats

Lecture# 09

Safdar Hussain, Khwaja Fareed University, RYK 2

Safdar Hussain, Khwaja Fareed University, RYK 4

– Lexical analysis is the more complex portion, where the

Safdar Hussain, Khwaja Fareed University, RYK 5

Safdar Hussain, Khwaja Fareed University, RYK 7

y := 31 + 28*x Lexical analyzer

Safdar Hussain, Khwaja Fareed University, RYK 8

• A token is a classification of lexical units

Safdar Hussain, Khwaja Fareed University, RYK 9

Safdar Hussain, Khwaja Fareed University, RYK 10

In many programming languages, the following classes cover most or

Safdar Hussain, Khwaja Fareed University, RYK 14

Safdar Hussain, Khwaja Fareed University, RYK 15

Safdar Hussain, Khwaja Fareed University, RYK 16

Safdar Hussain, Khwaja Fareed University, RYK 17

Safdar Hussain, Khwaja Fareed University, RYK 18

Safdar Hussain, Khwaja Fareed University, RYK 19

Safdar Hussain, Khwaja Fareed University, RYK 22

digits  digit digits | digit

Safdar Hussain, Khwaja Fareed University, RYK 23

• NOTE: The unary postfix operator ? means “zero or more

Safdar Hussain, Khwaja Fareed University, RYK 24

You might also like