Lecture 1 3

Lexical Analysis
Covering topics
● Interaction of lexical analyzer with parser

● Why lexical analysis and parsing?
● Tokens, Patterns and Lexemes
● Regular Languages
● Regular Expressions
● Questions?
Interaction of lexical analyzer with parser
token
source Lexical to semantic
Parser analysis
program Analyzer
getNextToken
Symbol
Table
Lexical analyzer
● Sometimes, divided into cascade of two process
– Scanning: processes that don't require tokenization like compaction of

whitespace characters into one and comments delation.
– Lexical analysis: complex portion that produces token from the output of
the scanner
Why lexical analysis and parsing?
● Number of reasons:
– Simplicity of design is the most important consideration

– Compiler efficiency is improved
– Compiler portability is improved
Token, Patterns and Lexemes
● Token: a pair consisting of a token and an optional attribute
● Pattern: description of the form that the lexemes of a token may take
● Lexeme: sequence of characters in source program that matches the pattern

for a token
● Example:
– printf(“Average is %d”,avg);
– printf and avg are lexemes matching pattern for token id
– and “Average is %d” is lexeme matching literal
Classes for tokens
● One token for each keyword
● Token for operator individually or in group such as comparison operators
● One token representing all identifiers
● One or more tokens representing constants(numbers,literals)
● Tokens for each punctuation symbol

Tokens Specification
● Regular expressions for specifying patterns
● String and Languages
– Alphabets and characters class

– Set {0,1} binary alphabets
– ASCII, UNICODE, EBCDIC computer alphabets
– string(words and sentences)
– empty string Ꮛ
– Language denotes: any set of strings over some fixed alphabet.
Operation on languages
● Union, Concatenation and Closure.
● Let L be set {A,B,...,Z,a,b,c,.....,z} and D the set{0,1,2,3,4,5,6,7,8,9}
– L U D set of letters and digits

– LD set of strings consisting of a letter followed by a digit
– L4 is the set of all four-letter strings
– L* set of all strings of letters, including Ꮛ
– L (L U D)*
– D+
Operation on languages
Regular Expression
● Notation used to define precisely language set
● Regular expression for identifier in pascal
– letter ( letter | digit ) *

● Each regular expression r represents a language L(r)
Regular Expression
●
Rules that define regular expression over alphabet £
– Ꮛ denotes {Ꮛ}
– a is symbol in £
– Consider u and v are RE denoting languages L(u) and L(v) then:
●
(u)|(v) denoting L(u) U L(v)
●
(u)(v) denoting L(u)L(v)
●
(u)* denoting (L(u))*
●
(v)2 denoting L(v)2
●
Language denoted by regular expression is called regular set.
Example
●
Let £={0,1}
●
RE (0|1) denotes the set {0, 1}
●
RE (0 (1|0) ) denotes set {01, 00}
●
RE (1*) denotes set {Ꮛ, 1, 11, 111, 1111, …...., 1111N}
Regular Definition
● Notational convenience we may give name to regular expressions
●
If £ an alphabet of basic symbol, then a regular definition is a sequence of
definitions of the form
– d1 → r1
– d2 → r2
– d3 → r3
– dn → rn
Example
●
letter → A | B | …..... | Z | a | b | ....... | z
●
digit → 0 | 1 | 2 | 3| …... | 9
●
id → letter ( letter | digit )*
Example
●
Unsigned floating like 245, 1345.456, 345. 324E34 or 1.87543E-23 has regular
expression
– digit → 0 | 1 | 2 | 3| …... | 9
– digits → digit digit*
– optional_fraction → (. digits) | Ꮛ
– optional_exponent → (E ( + | - | Ꮛ ) digits ) | Ꮛ
– num → digits optional_fraction optional_exponent
Tokens Recognition
● Consider grammar fragment
– stmt → if expr then stmt | if expr then stmt else stmt | Ꮛ

– expr → term relop term | term
– term → id | num
● Where terminals if, then, else, relop, id, and num generates sets of string given
by following regular definitions
– if → if
– Then → then
– Else → else
– Relop → < | <= | = | <> | > | >=
– Id → letter ( letter | digit ) *
– Num → digit + ( . digit +) ? (E( + | - ) ? digit +) ?
Tokens Recognition continued
– delim → blank | tab | newline
– ws → delim +
Translation Table
References
● Compilers Principals Techniques and Tools (Alfred V. Aho Colombia University,

Monica S.Lam Stanford Univeristy, Ravi Sethi Avaya, Jeffrey D.Ullman Stanford
University)
● Compiler course (Alex Icon Stanford University)

Lecture 1 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 1 3

Uploaded by

Copyright:

Available Formats

Lexical Analysis

● Interaction of lexical analyzer with parser

● Tokens, Patterns and Lexemes

● Sometimes, divided into cascade of two process

– Scanning: processes that don't require tokenization like compaction of

– Simplicity of design is the most important consideration

● Token: a pair consisting of a token and an optional attribute

● Lexeme: sequence of characters in source program that matches the pattern

● One token for each keyword

● Token for operator individually or in group such as comparison operators

● One token representing all identifiers

● One or more tokens representing constants(numbers,literals)

● Tokens for each punctuation symbol

● Regular expressions for specifying patterns

● String and Languages

– Alphabets and characters class

● Union, Concatenation and Closure.

● Let L be set {A,B,...,Z,a,b,c,.....,z} and D the set{0,1,2,3,4,5,6,7,8,9}

– L U D set of letters and digits

● Regular expression for identifier in pascal

– letter ( letter | digit ) *

– stmt → if expr then stmt | if expr then stmt else stmt | Ꮛ

● Compilers Principals Techniques and Tools (Alfred V. Aho Colombia University,

You might also like