Lexical Analysis

Chapter 2
Lexical
Analysis
Ete ¥ .tealAn_yedr : -
→ Lexical
Analyzer is the first phase of
compiler .
task to read the

input
→
Its main is
source
program from left to
right & reading
one character
of
at a
tokens
time
generate
sequence .
is cohesive unit
→ Each token a
single
&
such as identifier , keywords , operators
punctuation mark .
Token
Source Lexical Parser Syntax> Rest of
program Analyzer Get next Tree compiler

token
symbol target
code
Table
→
As the lexical analyzer scans
to is
program
also
tokens it
oecognize
-
called scanner .
→
Upon receiving get next token from
lexical analyzer reads the
passes ,
input character until it can
identify
next token .
→ It
may also perform secondary tasks
at User Interface .
out the
one such task is
stripping
→
source
program comments & white
in form of blanks &
spaces
newline characters .
Functions
=
of
=
Lexical Analyzer : -
→
Produces stream of tokens
→
Eliminates blanks & comments
→ Generates
symbol table which stores
information about identifiers ,
constants ,
etc .
→
keeps track of line -
numbers
Reports generating
→
errors while token .
→ Lexical
Analyzer works in 2
phases :
-
① In first phase it scans
② In 2nd phase ,
it does lexical analysis ,
it of token
meaning generate series .
Tokens Patterns & Lexemes
,
tokens Sequence of character

having
: -
collective called
meaning is token .
Typical tokens are : -
① Identifiers ② keywords ③ operators -
② Special symbol ⑤ Constants
Patten : -
Set of routes that describe
the token are
patterns .
lexemes : -
Sequence of characters in the

that matched with
program
are
of token
pattern .
is called lexeme
For e.g.int i
,
num / etc .
_ .
Example : -
if(a< b)
if -
keyword
C- operator
a
-
identifier
<
operator
-
b -
identifier
J -
operator .
Identifier
→
Identifier is a collection of letters .
→
Identify is -
a collection of
alphanumeric characters
→
First character of identifies must
be a letter .
Operator , _
→
Operator can be arithmetic ,
logical or
relational
operators .
Parenthesis considered
→ are as
operators
→
comma is treated as a
separation
operator
denoted
→
Assignment is
by operator .
Keyword : -
→
keyword are
special
is
words to
associated
which
meaning
some
.
Foo
→ int
data
,
void are keywords denoting
types .
Input buffering : -
→ Lexical analyzer scans the -
input
from character
left to
right one
at a time
→ It uses 2
pointers .
D begin pts Cbp )

-
2) forward -
pts Cfp )
→
It uses 2 pointers to keep track
of portion of input scanned .
both the
→
Initially _
pointers are
towards first
pointing character
of input string
.bg#ij;i--itI:j--jtIi nti,j;i--i--j--jtI
↑
fp
→ The forward pointer moves ahead
to search the end of lexeme .
→
As blank is encountered
soon as
space ,
it indicates end of lexeme .

bg
↑
i=i=jt
fp↑ blank space
→ When Fp encounters white

space it
& moves
is
ignored ahead .
Then both forward pointer I

begin
→
pointer are set at next token .
character read from

→ The
input is
secondary storage .
But
reading from secondary storage
→
is
costly
used .
buffering
so
techniques are
→ Two methods are: -
① One Buffer Scheme
② Two Buffer scheme

?⃝
① One Buffer Schemes -
→ In one buffer scheme,

only one buffer is
used to store the

input string .
→ The problem with above

approach is
that if lexeme is long then
very
it crosses the buffer boundary .
→ so to read the rest of -
lexeme, the
buffer needs to be refilled that ,
makes an
overwriting to first
part of lexeme .
② Two Buffer Scheme : -
→ To over the problem of one buffer

we use two buffer scheme .
→
Two buffers are used to store the
string .
→ First buffer & second buffer are
scanned alternatively .
→ when end of first buffer is

reached the second buffer is
filled alternatively .
→ To identify the boundary of first
called as
buffer ,
we use a
string
F- OF ( End of file) . This is also known
as Sentinels
Sentinels strings /characters which

→
are
indicates the end of buffer which

is represented by eof.
Specifications of Tokens
→ to
are
specify
used .
token ,
regular expressions
→
when a
pattern is matched by some
regular expression then token is
recognized .
→
string is a collection of finite
number of alphabets or letters .
also called words

The
strings
→
are as .
⑨ Length of string is denoted

1st
by
② The empty string can be denoted
by e.
③ The empty set of

string is denoted
by & .
Commonly used terms in
strings : -
a) Prefix of string : -
A obtained
string by
→
removing zero
-
called
or more
trailing symbol is as
prefix of a
string .
2) Suffix of a
string : -
A obtained
→
string by removingis
zero
called
or more
leading
sufix of
symbols
as a
string .
3) swbstoing : -
A obtained by removing
→
string called
prefix & suffix is as
substoing
of string .
4) subsequence . .
of a
string : -
→
A
string obtained by removing zero
00 more characters ( not

necessarily
called
contiguous) is as
subsequence
of string .
Thompson's Construction : -
① r=E
start E
% S
qf
② N = a
a
start
q◦ qf
3) 8=81 f- r2
e NFe
start
Nt E
4) 8=9.02
start
tNFAG
?⃝
5) r =
Crs >
*
a
→
⑤ E_①NF iOˢ⑧
I
confusing
¥É : -
at
T z a b
so so 0
I.
- -
→
→
I
.
→
④
e- → →④→⑧
v.
?⃝

Lexical Analysis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lexical Analysis

Uploaded by

Copyright:

Available Formats

Chapter 2

task to read the

program Analyzer Get next Tree compiler

① In first phase it scans

tokens Sequence of character

Typical tokens are : -

① Identifiers ② keywords ③ operators -

② Special symbol ⑤ Constants

Sequence of characters in the

→ Lexical analyzer scans the -

D begin pts Cbp )

it indicates end of lexeme .

→ When Fp encounters white

Then both forward pointer I

pointer are set at next token .

character read from

→ Two methods are: -

① One Buffer Scheme

② Two Buffer scheme

→ In one buffer scheme,

used to store the

→ The problem with above

→ so to read the rest of -

② Two Buffer Scheme : -

→ To over the problem of one buffer

→ First buffer & second buffer are

→ when end of first buffer is

Sentinels strings /characters which

indicates the end of buffer which

regular expression then token is

also called words

⑨ Length of string is denoted

③ The empty set of

00 more characters ( not

You might also like