You are on page 1of 28

1

COMPILER CONSTRUCTION(CS-462)
(LECTURE 7)
(RELATED TO ASS. # 1)

Instructor: Gul Sher Ali


Overview

 The process of taking an input stream of


characters and converting/grouping it into a
sequence of distinct and recognizable words is
called Lexical Analysis or Tokenizing.
 Any procedure/ module/ program which
performs the task of lexical analysis is called a
Lexical Analyzer/Scanner or Tokenizer.
Overview

 The process of conversion/grouping of stream of


characters into distinct and recognizable words is
driven by some rules. These rules are known as
Lexical Rules.
Frontend Structure 4

Source Code

Processing of Language
#include, #defines Preprocessor Trivial errors
#ifdef, etc
Preprocessed source code (foo.i)
Lexical Analysis
Syntax Analysis Errors
Semantic Analysis

Abstract
Syntax Tree
Lexical Analysis Process 5

if (b == 0) a = b; Preprocessed source
code, read char by char

Lexical Analysis or Scanner

if ( b == 0 ) a = b ;

Lexical analysis
- Transform multi-character input stream to token stream
- Reduce length of program representation (remove spaces)
Tokens, Patterns, Lexemes 6

 The set of strings is defined by a rule called a


pattern associated with the token. It is used to
match each string in the set.
 A lexeme is a sequence of characters in the
source program that is matched by the pattern
for a token.
 For example, in the Pascal statement
 Const pi = 3.1416;
 The substring pi is a lexeme for the token “identifier”
Tokens, Patterns, Lexemes 7

TOKEN SAMPLE LEXEMES INFORMAL


DESCRITION OF
PATTERN
const const const

if if if

relation <,<=,=,<>,>,<,>= < or <= or = or <> or > or <


or>=
Id Pi, count, D2 Letter followed by letters
and digits
num 3.1416, 0, 6.02E23 Any numeric constant

Literal “core dumped” Any characters between “


and “ except “
Tokens 8

 Keywords: if else while for break


 Identifiers: x y11 elsex
 Numbers
 Integers: 2 1000 -20
 Floating-point: 2.0 -0.0010 .02 1e5
 Symbols: + * { } ++ << < <= [ ]
 *Strings: “x” “He said, \”I luv EECS 483\””
Attributes for tokens 9

 When more than one pattern matches a lexeme,


the lexical analyzer must provide additional
information about the particular lexeme that
matched to the subsequent phases of the
compiler
 For eg: the pattern num matches both the strings
0 and 1, but it is essential for the code generator
to know what string was actually matched
General Approaches to the 10
Implementation of Lexical
Analyzer
 There are three general approaches to the implementation
of a lexical analyzer
 Use a lexical analyzer generator, such as the Lex to produce
the lexical analyzer from a regular expression based
specification. The generator provides routines for reading and
buffering the input.
 Write the lexical analyzer in a conventional system
programming
 Write the lexical analyzer in assembly language and explicitly
manage the reading of input.
General Approaches to the 11
Implementation of Lexical
Analyzer
 These choices are given in previous slide in order
of increasing difficulty for the implementer.
 Unfortunately the harder to implement approach
often yields faster lexical analyzers.

 Lexical Analyzer is the only phase of the compiler


that reads the source program character by
character,
General Approaches to the 12
Implementation of Lexical
Analyzer
 it is possible to spend a considerable amount of time in this
phase, even though the later phases are conceptually more
complex.

 Thus, the speed of lexical analysis phase is a concern in


compiler design.
Specification of tokens 13

 Regular expressions are an important notation for


specifying patterns. Each pattern matches a set
of strings, so regular expression will serve as names
for the set of strings.
Design Of A Lexical
Analyzer
 Let us see the working of Lexical Analyzer.
 Our source code is in a file usually called ‘SOURCE
FILE’.
 First of all Lexical Analyzer should have the ability to
read files from computer system.
 Source File should be read into a buffer.
Design Of A Lexical
Analyzer
 Next, Lexical Analyzer should be able to read
alphabets from this buffer.
 The alphabets read are grouped into tokens
according to Lexical Rules.
 Lexical Analyzer can generate all tokens at a time
(Multi Pass) or One token at a time (Single Pass).
Design Of A Lexical
Analyzer
.

Source File
.
.
int num1 = a+(b/c1);
.
.
.

i n t n u m 1 = a + ( b / c 1 ) ;

Stream of Characters Lexical Analyzer Parser


Sequence of
tokens
(words)

Lexical Errors
What are the possible words
recognized by a Lexical Analyzer?

 Keywords
 Identifiers
 Operators
 Numeric constants
 Character constants
 Punctuations / Special Symbols
Working of Lexical
Analyzer
 We must understand how Lexical Analyzer
recognize different tokens in a source code.
 We use the help of RE.
 RE alone cannot handle some ambiguities that
can occur during scanning process.
Ambiguities in Lexical
Analysis
 There can occur 2 types of ambiguities while
developing Lexical analyzer.
 Keywords can also be Identifiers.
 How much part of the string should be token.
 Language Definition should provide
‘DISAMBIGUATING RULES’ to solve these problems.
Disambiguating Rules

 Two rules should be kept in mind

 Priority Rule
 Longest Sub string Principle.
Priority Rule

 There can 2 ways to define priority rule.


 With the use of DFA
 With the use of Symbol Table.
 If we use DFA as solution then we must specify
final states for keywords as well.
 But DFA can get very large and complex using
this approach.
Priority Rule

 According to 2nd Solution, we store all keywords in the


Symbol Table.
 Whenever we encounter an identifier, we match its
lexeme with entries in Symbol Table.
 If value is matched then we treat the token as keyword
rather than Identifier.
 keywords are also called Reserve Words.
First.cp
p

int ab = 67.1;

[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]
i n t a b = 6 7 . 1 ;

lexeme_s forward
tart
INT
DRIVER
… DFA / T.T

Symbol Table
Principle of Longest
Substring
 Also called ‘Principle of Maximal Munch’.

 When a string can be a single token or a


sequence of several tokens, the single token
interpretation is preferred.
Token Delimiters

 Characters that are unambiguously part of other


tokens are Delimiters.
 For example
 Xtemp=ytemp

 In the above example , Equal Sign can distinguish 2


different identifiers.
White Spaces As Delimiters

 Blanks, newlines and Tab Characters are generally also


assumed to be token delimiters.
 For example
 int a
 If blank is ignored, the above can treated as a single
identifier ‘inta’ but Lexical will generate two tokens for the
above declaration.
Problem of Lookahead

 Delimiters end Token strings but they are not part


of the token itself.
 So, Scanner must deal with the problem of
Lookahead.
 When Scanner encounters a Delimiter, it must
arrange that the delimiter is removed from rest of
the input.
Operations on Languages 28

 There are several Operation Definition


important operations
Union of L and M L U M = { s|s is in L or s
that can be applied Written L U M is in M}
to languages.
Concatenation of LM = {st | s is in L and t is
L and M in M}
Written LM
 For lexical analysis, Kleen closure of L L* = i=0U∞Li
important are union, Wrtten L* L* denotes “zero or more
concatenation and concatenations of” L
Positive closure of L+ = i=1U∞Li
closure. L L+ denotes “one or more
Written L+ concatenations of” L

You might also like