Compiler 6

1
COMPILER CONSTRUCTION(CS-462)
(LECTURE 7)
(RELATED TO ASS. # 1)
Instructor: Gul Sher Ali

Overview
 The process of taking an input stream of

characters and converting/grouping it into a
sequence of distinct and recognizable words is
called Lexical Analysis or Tokenizing.
 Any procedure/ module/ program which
performs the task of lexical analysis is called a
Lexical Analyzer/Scanner or Tokenizer.
Overview
 The process of conversion/grouping of stream of

characters into distinct and recognizable words is
driven by some rules. These rules are known as
Lexical Rules.
Frontend Structure 4
Source Code
Processing of Language
#include, #defines Preprocessor Trivial errors
#ifdef, etc
Preprocessed source code (foo.i)
Lexical Analysis
Syntax Analysis Errors
Semantic Analysis
Abstract
Syntax Tree
Lexical Analysis Process 5
if (b == 0) a = b; Preprocessed source
code, read char by char
Lexical Analysis or Scanner
if ( b == 0 ) a = b ;
Lexical analysis
- Transform multi-character input stream to token stream
- Reduce length of program representation (remove spaces)
Tokens, Patterns, Lexemes 6
 The set of strings is defined by a rule called a

pattern associated with the token. It is used to
match each string in the set.
 A lexeme is a sequence of characters in the
source program that is matched by the pattern
for a token.
 For example, in the Pascal statement
 Const pi = 3.1416;
 The substring pi is a lexeme for the token “identifier”
Tokens, Patterns, Lexemes 7
TOKEN SAMPLE LEXEMES INFORMAL

DESCRITION OF
PATTERN
const const const
if if if
relation <,<=,=,<>,>,<,>= < or <= or = or <> or > or <

or>=
Id Pi, count, D2 Letter followed by letters
and digits
num 3.1416, 0, 6.02E23 Any numeric constant
Literal “core dumped” Any characters between “

and “ except “
Tokens 8
 Keywords: if else while for break

 Identifiers: x y11 elsex
 Numbers
 Integers: 2 1000 -20
 Floating-point: 2.0 -0.0010 .02 1e5
 Symbols: + * { } ++ << < <= [ ]
 *Strings: “x” “He said, \”I luv EECS 483\””
Attributes for tokens 9
 When more than one pattern matches a lexeme,

the lexical analyzer must provide additional
information about the particular lexeme that
matched to the subsequent phases of the
compiler
 For eg: the pattern num matches both the strings
0 and 1, but it is essential for the code generator
to know what string was actually matched
General Approaches to the 10
Implementation of Lexical
Analyzer
 There are three general approaches to the implementation
of a lexical analyzer
 Use a lexical analyzer generator, such as the Lex to produce
the lexical analyzer from a regular expression based
specification. The generator provides routines for reading and
buffering the input.
 Write the lexical analyzer in a conventional system
programming
 Write the lexical analyzer in assembly language and explicitly
manage the reading of input.
Analyzer
 These choices are given in previous slide in order
of increasing difficulty for the implementer.
 Unfortunately the harder to implement approach
often yields faster lexical analyzers.
 Lexical Analyzer is the only phase of the compiler

that reads the source program character by
character,
Analyzer
 it is possible to spend a considerable amount of time in this
phase, even though the later phases are conceptually more
complex.
 Thus, the speed of lexical analysis phase is a concern in

compiler design.
Specification of tokens 13
 Regular expressions are an important notation for

specifying patterns. Each pattern matches a set
of strings, so regular expression will serve as names
for the set of strings.
Design Of A Lexical
Analyzer
 Let us see the working of Lexical Analyzer.
 Our source code is in a file usually called ‘SOURCE
FILE’.
 First of all Lexical Analyzer should have the ability to
read files from computer system.
 Source File should be read into a buffer.
Design Of A Lexical
Analyzer
 Next, Lexical Analyzer should be able to read
alphabets from this buffer.
 The alphabets read are grouped into tokens
according to Lexical Rules.
 Lexical Analyzer can generate all tokens at a time
(Multi Pass) or One token at a time (Single Pass).
Design Of A Lexical
Analyzer
.
Source File
.
.
int num1 = a+(b/c1);
.
.
.
i n t n u m 1 = a + ( b / c 1 ) ;
Stream of Characters Lexical Analyzer Parser

Sequence of
tokens
(words)
Lexical Errors
What are the possible words
recognized by a Lexical Analyzer?
 Keywords
 Identifiers
 Operators
 Numeric constants
 Character constants
 Punctuations / Special Symbols
Working of Lexical
Analyzer
 We must understand how Lexical Analyzer
recognize different tokens in a source code.
 We use the help of RE.
 RE alone cannot handle some ambiguities that
can occur during scanning process.
Ambiguities in Lexical
Analysis
 There can occur 2 types of ambiguities while
developing Lexical analyzer.
 Keywords can also be Identifiers.
 How much part of the string should be token.
 Language Definition should provide
‘DISAMBIGUATING RULES’ to solve these problems.
Disambiguating Rules
 Two rules should be kept in mind
 Priority Rule
 Longest Sub string Principle.
Priority Rule
 There can 2 ways to define priority rule.

 With the use of DFA
 With the use of Symbol Table.
 If we use DFA as solution then we must specify
final states for keywords as well.
 But DFA can get very large and complex using
this approach.
Priority Rule
 According to 2nd Solution, we store all keywords in the

Symbol Table.
 Whenever we encounter an identifier, we match its
lexeme with entries in Symbol Table.
 If value is matched then we treat the token as keyword
rather than Identifier.
 keywords are also called Reserve Words.
First.cp
p
int ab = 67.1;
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]
i n t a b = 6 7 . 1 ;
lexeme_s forward
tart
INT
DRIVER
… DFA / T.T
Symbol Table
Principle of Longest
Substring
 Also called ‘Principle of Maximal Munch’.
 When a string can be a single token or a

sequence of several tokens, the single token
interpretation is preferred.
Token Delimiters
 Characters that are unambiguously part of other

tokens are Delimiters.
 For example
 Xtemp=ytemp
 In the above example , Equal Sign can distinguish 2

different identifiers.
White Spaces As Delimiters
 Blanks, newlines and Tab Characters are generally also

assumed to be token delimiters.
 For example
 int a
 If blank is ignored, the above can treated as a single
identifier ‘inta’ but Lexical will generate two tokens for the
above declaration.
Problem of Lookahead
 Delimiters end Token strings but they are not part

of the token itself.
 So, Scanner must deal with the problem of
Lookahead.
 When Scanner encounters a Delimiter, it must
arrange that the delimiter is removed from rest of
the input.
Operations on Languages 28
 There are several Operation Definition

important operations
Union of L and M L U M = { s|s is in L or s
that can be applied Written L U M is in M}
to languages.
Concatenation of LM = {st | s is in L and t is
L and M in M}
Written LM
 For lexical analysis, Kleen closure of L L* = i=0U∞Li
important are union, Wrtten L* L* denotes “zero or more
concatenation and concatenations of” L
Positive closure of L+ = i=1U∞Li
closure. L L+ denotes “one or more
Written L+ concatenations of” L

Compiler 6

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Compiler 6

Uploaded by

Copyright:

Available Formats

1

Instructor: Gul Sher Ali

 The process of taking an input stream of

 The process of conversion/grouping of stream of

Lexical Analysis or Scanner

 The set of strings is defined by a rule called a

TOKEN SAMPLE LEXEMES INFORMAL

relation <,<=,=,<>,>,<,>= < or <= or = or <> or > or <

Literal “core dumped” Any characters between “

 Keywords: if else while for break

 When more than one pattern matches a lexeme,

 Lexical Analyzer is the only phase of the compiler

 Thus, the speed of lexical analysis phase is a concern in

 Regular expressions are an important notation for

Stream of Characters Lexical Analyzer Parser

 Two rules should be kept in mind

 There can 2 ways to define priority rule.

 According to 2nd Solution, we store all keywords in the

 When a string can be a single token or a

 Characters that are unambiguously part of other

 In the above example , Equal Sign can distinguish 2

 Blanks, newlines and Tab Characters are generally also

 Delimiters end Token strings but they are not part

 There are several Operation Definition

You might also like