You are on page 1of 14

Chapter 2

Lexical
Analysis
Ete ¥ .tealAn_yedr : -

→ Lexical
Analyzer is the first phase of
compiler .

task to read the


input

Its main is
source
program from left to
right & reading
one character
of
at a
tokens
time
generate
sequence .

is cohesive unit
→ Each token a
single
&
such as identifier , keywords , operators
punctuation mark .

Token
Source Lexical Parser Syntax> Rest of

program Analyzer Get next Tree compiler


token

symbol target
code
Table


As the lexical analyzer scans
to is
program
also
tokens it
oecognize
-

called scanner .


Upon receiving get next token from
lexical analyzer reads the
passes ,
input character until it can
identify
next token .

→ It
may also perform secondary tasks
at User Interface .

out the
one such task is
stripping

source
program comments & white
in form of blanks &
spaces
newline characters .

Functions
=
of
=
Lexical Analyzer : -


Produces stream of tokens


Eliminates blanks & comments

→ Generates
symbol table which stores
information about identifiers ,
constants ,
etc .


keeps track of line -
numbers

Reports generating

errors while token .

→ Lexical
Analyzer works in 2
phases :
-

① In first phase it scans

② In 2nd phase ,
it does lexical analysis ,

it of token
meaning generate series .
Tokens Patterns & Lexemes
,

tokens Sequence of character


having
: -

collective called
meaning is token .

Typical tokens are : -

① Identifiers ② keywords ③ operators -

② Special symbol ⑤ Constants

Patten : -
Set of routes that describe
the token are
patterns .

lexemes : -

Sequence of characters in the


that matched with
program
are

of token
pattern .

is called lexeme

For e.g.int i
,
num / etc .
_ .

Example : -

if(a< b)
if -

keyword
C- operator
a
-

identifier
<
operator
-

b -
identifier
J -

operator .
Identifier

Identifier is a collection of letters .


Identify is -
a collection of
alphanumeric characters


First character of identifies must
be a letter .

Operator , _


Operator can be arithmetic ,
logical or

relational
operators .

Parenthesis considered
→ are as
operators

comma is treated as a
separation
operator
denoted

Assignment is
by operator .

Keyword : -


keyword are
special
is
words to
associated
which

meaning
some
.

Foo
→ int
data
,
void are keywords denoting
types .
Input buffering : -

→ Lexical analyzer scans the -

input
from character
left to
right one

at a time

→ It uses 2
pointers .

D begin pts Cbp )


-

2) forward -

pts Cfp )

It uses 2 pointers to keep track
of portion of input scanned .

both the

Initially _

pointers are

towards first
pointing character
of input string

.bg#ij;i--itI:j--jtIi nti,j;i--i--j--jtI

fp
→ The forward pointer moves ahead
to search the end of lexeme .


As blank is encountered
soon as
space ,

it indicates end of lexeme .


bg


i=i=jt
fp↑ blank space

→ When Fp encounters white


space it
& moves
is
ignored ahead .

Then both forward pointer I


begin

pointer are set at next token .

character read from


→ The
input is

secondary storage .

But
reading from secondary storage

is

costly
used .
buffering
so
techniques are

→ Two methods are: -

① One Buffer Scheme

② Two Buffer scheme


?⃝
① One Buffer Schemes -

→ In one buffer scheme,


only one buffer is

used to store the


input string .

→ The problem with above


approach is
that if lexeme is long then
very
it crosses the buffer boundary .

→ so to read the rest of -

lexeme, the
buffer needs to be refilled that ,

makes an
overwriting to first
part of lexeme .

② Two Buffer Scheme : -

→ To over the problem of one buffer


we use two buffer scheme .


Two buffers are used to store the
string .

→ First buffer & second buffer are

scanned alternatively .

→ when end of first buffer is


reached the second buffer is
filled alternatively .
→ To identify the boundary of first
called as
buffer ,
we use a
string
F- OF ( End of file) . This is also known
as Sentinels

Sentinels strings /characters which



are

indicates the end of buffer which


is represented by eof.
Specifications of Tokens

→ to

are
specify
used .
token ,
regular expressions


when a
pattern is matched by some

regular expression then token is

recognized .


string is a collection of finite
number of alphabets or letters .

also called words


The
strings

are as .

⑨ Length of string is denoted


1st
by
② The empty string can be denoted
by e.

③ The empty set of


string is denoted
by & .
Commonly used terms in
strings : -

a) Prefix of string : -

A obtained
string by

removing zero
-

called
or more
trailing symbol is as

prefix of a
string .

2) Suffix of a
string : -

A obtained

string by removingis
zero

called
or more
leading
sufix of
symbols
as a
string .

3) swbstoing : -

A obtained by removing

string called
prefix & suffix is as
substoing
of string .

4) subsequence . .
of a
string : -


A
string obtained by removing zero

00 more characters ( not


necessarily
called
contiguous) is as
subsequence
of string .
Thompson's Construction : -

① r=E

start E
% S
qf

② N = a

a
start
q◦ qf

3) 8=81 f- r2

e NFe
start

Nt E

4) 8=9.02

start

tNFAG
?⃝
5) r =
Crs >
*

a

⑤ E_①NF iOˢ⑧
I
confusing
¥É : -

at

T z a b
so so 0
I.
- -


I
.



e- → →④→⑧
v.
?⃝

You might also like