Welcome to Scribd. Sign in or start your free trial to enjoy unlimited e-books, audiobooks & documents.Find out more
Standard view
Full view
of .
0 of .
Results for:
P. 1
Lexical Analysis

# Lexical Analysis

Ratings: (0)|Views: 64|Likes:

### Availability:

See more
See less

05/11/2014

pdf

text

original

Unit 2
Role of lexical analyzer
Recognize tokens and ignore white spaces, comments
Generates token stream
. Error reporting
. Model using regular expressions
Recognize using Finite State AutomataDiagram is here
Design of lexical analyzer
. Allow white spaces, numbers and arithmetic operators in an expression. Return tokens and attributes to the syntax analyzer . A global variable tokenval is set to the value of the number . Design requires that- A finite set of tokens be defined- Describe strings belonging to each token

We now try to construct a lexical analyzer for a language in which white spaces, numbers andarithmetic operators in an expression are allowed. From the input stream, the lexical analyzer recognizes the tokens and their corresponding attributes and returns them to the syntax analyzer.To achieve this, the function returns the corresponding token for the lexeme and sets a globalvariable, say
tokenval
, to the value of that token. Thus, we must define a finite set of tokens andspecify the strings belonging to each token. We must also keep a count of the line number for the purposes of reporting errors and debugging.
Regular Expressions
We use regular expressions to describe tokens of a programming language.A regular expression is built up of simpler regular expressions (using definingrules)Each regular expression denotes a language.A language denoted by a regular expression is called as a
regular set
.
Regular Expressions (Rules)
Regular expressions over alphabet
Σ
Reg. Expr Language it denotes
ε
{
ε
}a

Σ
{a}(r1) | (r2) L(r1)
L(r2)(r1) (r2) L(r1) L(r2)(r)*(L(r))*(r)L(r)(r)+ = (r)(r)*(r)? = (r) |
ε
We may remove parentheses by using precedence rules. * highest –concatenation next |lowestab*|c means (a(b)*)|(c)Ex: –
Σ
= {0,1} 0|1 => {0,1} –(0|1)(0|1) => {00,01,10,11} 0* => {
ε
,0,00,000,0000,....} –(0|1)* => all strings with 0 and 1, including the empty string

Specificification and recognition of tokens
Token represents a set of strings described by a pattern. –Identifier represents a set of strings which start with a letter continues withletters and digits –The actual string (newval) is called as
lexeme
. –Tokens: identifier, number, addop, delimeter, …Since a token can represent more than one lexeme, additional information should be held for that specific lexeme. This additional information is called as the
attribute
of the token.For simplicity, a token may have a single attribute which holds the requiredinformation for that token.For identifiers, this attribute a pointer to the symbol table, and the symbol tableholds the actual attributes for that token.Some attributes: –<id,attr> where attr is pointer to the symbol table –<assgop,_> no attribute is needed (if there is only one assignmentoperator) <num,val>where val is the actual value of the number.Token type and its attribute uniquely identifies a lexeme.
Regular expressions
are widely used to specify patternsHow to recognize tokens. Consider relop < | <= | = | <> | >= | >id letter(letter|digit)*num digit tab | newlinews delim +. Construct an analyzer that will return <token, attribute> pairsWe now consider the following grammar and try to construct an analyzer that will return<token, attribute> pairs.relop < | = | = | <> | = | >id letter (letter | digit)*num digit+ ('.' digit+)? (E ('+' | '-')? digit+)?