You are on page 1of 25

Lecture# 09

Compiler Construction
Lexical Analysis, (Part-I)

by Safdar Hussain

Topics
• Lexical analyzer as separate phase, Interaction of Lexical Analyzer and Parser, lexical errors
• Tokens, lexemes, patterns, Attributes of tokens, Specification of tokens for patterns
(Terminology, String & language operations, RE, Regular Definitions, Notational Shorthands
Overview
This chapter contains comprehensive material
• Building a simple Lexical Analyzer
– Lexical Analysis (Every perspective)
– Interaction of Lexical analyzer & Parsing
– Implementing Transition Diagrams
– Design of a Lexical Analyzer (RENFADFA)
– The Subset Construction Algorithm
– Minimization of number of states of a DFA
– Conversion RE to DFA directly (Algorithm & Tree
annotation)

Safdar Hussain, Khwaja Fareed University, RYK 2


The Reason Why Lexical Analysis is a
Separate Phase
• Simplifies the design of the compiler
– LL(1) or LR(1) with 1 lookahead would not be possible
– E.g., a parser that had to deal with comments & whitespace as syntactic
units would be considerably more complex than one that can assume
comments & whitespace have already been removed by the lexical
analyzer.
• Provides efficient implementation
– Systematic techniques to implement lexical analyzers by hand or
automatically
– Stream buffering methods to scan input
• Improves portability
– Non-standard symbols and alternate character encodings can be more
easily translated
Safdar Hussain, Khwaja Fareed University, RYK 3
The Reason Why Lexical Analysis is a
Separate Phase

Safdar Hussain, Khwaja Fareed University, RYK 4


The Reason Why Lexical Analysis is a
Separate Phase
• Division of lexical analyzers into two processes
– Scanning consists of the simple processes that do not require
tokenization of the input, such as deletion of comments &
compaction of consecutive whitespace characters into one

– Lexical analysis is the more complex portion, where the


scanner produces the sequence of tokens as output.

Safdar Hussain, Khwaja Fareed University, RYK 5


Lexemes Tokens
int
a
Keyword
Identifier
Scanning & Lexical Analysis
, Punctuation
b Identifier
, Punctuation
sum Identifier
; Punctuation Whitespaces & Comments
cout Identifier
<< Operator
“Please Enter any String literal
two integers!”
<< Operators
cin Identifier
>> Operator
a Identifier
>> Operator
b Identifier
; Punctuation
sum Identifier
= Operator
a Identifier
+ Operator
b Identifier
; Punctuation
cout Identifier
<< Operator
“Sum of two integers String literal
=“
<< Operator
sum Identifier
; Punctuation
return Keyword
0 Constant Safdar Hussain, Khwaja Fareed University, RYK 6
; Punctuation
Interaction of the Lexical Analyzer
with the Parser
Token,
Source Lexical tokenval
Program Parser
Analyzer
Get next
token
error error

Symbol Table

Safdar Hussain, Khwaja Fareed University, RYK 7


Attributes of Tokens

y := 31 + 28*x Lexical analyzer

<id, “y”> <assign, > <num, 31> <+, > <num, 28> <*, > <id, “x”>

token
tokenval
(token attribute) Parser

Safdar Hussain, Khwaja Fareed University, RYK 8


Tokens, Patterns, and Lexemes

• A token is a classification of lexical units


– For example: id and num
• Lexemes are the specific character strings that make
up a token (values)
– For example: abc and 123
• Patterns are rules describing the set of lexemes
belonging to a token
– For example: “letter followed by letters and digits” and
“non-empty sequence of digits”

Safdar Hussain, Khwaja Fareed University, RYK 9


Example tokens, lexeme, patterns

Safdar Hussain, Khwaja Fareed University, RYK 10


Example (Token & Lexeme)
printf (“Total = %d\n”, score); //Statement in C language
• Both printf and score are lexemes matching the pattern for token
id, and “Total = %d\n” is a lexeme matching literal.

In many programming languages, the following classes cover most or


all of the tokens:
1. One token for each keyword. The pattern for a keyword is the
same as the keyword itself.
2. Tokens for the operators, either individually or in classes such as
the token comparison mentioned in previous page (Page#10)
3. One token representing all identifiers.
4. One or more tokens representing constants, such as numbers and
literal strings.
5. Tokens for each punctuation symbol, such as left and right
parentheses, comma, and semicolon.
Safdar Hussain, Khwaja Fareed University, RYK 11
Example
Token names and associated attribute values for the FORTran
statement
E=M*C**2
are written below as a sequence of pairs.
– <id, pointer to symbol-table entry for E>
– <assign_op>
– <id, pointer to symbol-table entry for M>
– <mult_op>
– <id, pointer to symbol-table entry for C>
– <exp_op>
– <number, integer value 2>
• In certain pairs, especially operators, punctuation, &
keywords, there is no need for an attribute value.
Safdar Hussain, Khwaja Fareed University, RYK 12
Lexical Errors
• It is hard for lexical analyzer to tell, without the aid of other
components, that there is a source-code error.
fi (a = = f(x)) …
• A lexical analyzer cannot tell whether fi is misspelling of the
keyword if or an undeclared function identifier.
• Since fi is a valid lexeme for the token id, the lexical analyzer
must return the token id to the parser and let some other
phase of the compiler—probably the parser in this case—
handle an error due to transposition of the letters
• NOTE: There may arise a situation in which lexical analyzer
is unable to proceed because none of the patterns for tokens
matches any prefix of the remaining input. (Will discuss
about errors in chapter#4)
Safdar Hussain, Khwaja Fareed University, RYK 13
Example
• Tagged languages like HTML or XML are different from
conventional programming languages in that the punctuation
(tags) are either very numerous (as in HTML) or a user-
definable set (as in XML). Further, tags can often have
parameters. Suggest how to divide the following HTML
document:
Here is a photo of <B>my house</B>:
<P><IMG SRC = “house.gif ”><BR>
See <A HREF = “morePix.html”>More Pictures</A>
if you liked that one. <P>
• Into appropriate lexemes. Which lexemes should get
associated lexical values, and what should those values be?

Safdar Hussain, Khwaja Fareed University, RYK 14


Example Continued…
Here is a photo of <B>my house</B>:
<P><IMG SRC = “house.gif ”><BR>
See <A HREF = “morePix.html”>More Pictures</A>
if you liked that one. <P>

Answer:

<text, “Here is a photo of”> <nodestart, b> <text, “my house”> <nodeend, b>
<Nodestart, p> <selfendnode, img> <selfendnode, br>
<text, “see”> <nodestart, a> <text, “More Pictures”> <nodeend, a>
<text, “if you liked that one.”> <nodeend, p>

Safdar Hussain, Khwaja Fareed University, RYK 15


Next…
Specification of Patterns for Tokens
• Terminology
• String Operations
• Language Operations
• Terms for parts of string
• Regular Expressions
• Regular Definitions
• Notational Shorthands

Safdar Hussain, Khwaja Fareed University, RYK 16


Specification of Patterns for Tokens:
Terminology
• An alphabet  is a finite set of symbols (characters)
• A string s is a finite sequence of symbols from 
– |s| denotes the length of string s
–  denotes the empty string, thus || = 0
• A language is a specific set of strings over some fixed
alphabet 
Example
• The language of all strings consisting of n 0’s followed
by n 1’s, for some n ≥ 0 over  = {0, 1}:
– {ε, 01, 0011, 000111,…..}

Safdar Hussain, Khwaja Fareed University, RYK 17


Specification of Patterns for Tokens:
String Operations
• The concatenation of two strings x and y is denoted
by xy
• The exponentiaion of a string s is defined by
s0 = 
si = si-1s for i > 0
(note that s = s = s)

Safdar Hussain, Khwaja Fareed University, RYK 18


Specification of Patterns for Tokens:
String Operations

Safdar Hussain, Khwaja Fareed University, RYK 19


Specification of Patterns for Tokens:
Language Operations
If L and M are languages then
• Union
L  M = {s | s  L or s  M}
• Concatenation
LM = {xy | x  L and y  M}
• Exponentiation
L0 = {}; Li = Li-1L
• Kleene closure
L* = i=0,…, Li
• Positive closure
L+ = i=1,…, Li
Safdar Hussain, Khwaja Fareed University, RYK 20
Specification of Patterns for Tokens:
Regular Expressions
• Basis symbols:
–  is a regular expression denoting language {}
– a   is a regular expression denoting {a}
• If r and s are regular expressions denoting languages
L(r) and M(s) respectively, then
– r | s is a regular expression denoting L(r)  M(s)
– rs is a regular expression denoting L(r)M(s)
– r* is a regular expression denoting L(r)*
– (r) is a regular expression denoting L(r)
• A language defined by a regular expression is called a
regular set
Safdar Hussain, Khwaja Fareed University, RYK 21
Specification of Patterns for Tokens:
Regular Definitions
• Naming convention for regular expressions:
d 1  r1
d 2  r2

d n  rn
where ri is a regular expression over
  {d1, d2, …, di-1 }
• Each dj in ri is textually substituted in ri

Safdar Hussain, Khwaja Fareed University, RYK 22


Specification of Patterns for Tokens:
Regular Definitions
• Example:

letter  A | B | … | Z | a | b | … | z
digit  0 | 1 | … | 9
id  letter ( letter | digit )*

digits  digit digits | digit

Safdar Hussain, Khwaja Fareed University, RYK 23


Specification of Patterns for Tokens:
Notational Shorthands
• We frequently use the following shorthands:
r+ = rr*
r? = r | 
[a-z] = a | b | c | … | z
• For example:
digit  [0-9]
num  digit+ (. digit+)? ( E (+|-)? digit+ )?

• NOTE: The unary postfix operator ? means “zero or more


occurrence.”

Safdar Hussain, Khwaja Fareed University, RYK 24


The End

25

You might also like