You are on page 1of 25

Topic

Lexical Analysis – Role


of Lexical Analyzer
Over View of Lexical Analysis
 Lexical analysis is the first phase of a compiler. It takes the modified source code from
language preprocessors that are written in the form of sentences.
 The lexical analyzer breaks these syntaxes into a series of tokens, by removing any
whitespace or comments in the source code.
 If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer
works closely with the syntax analyzer.
 It reads character streams from the source code, checks for legal tokens, and passes the
data to the syntax analyzer when it demands.
 Programs that perform lexical analysis are called lexical analyzers or lexers. A lexer
contains tokenizer or scanner. If the lexical analyzer detects that the token is invalid, it
generates an error.
 It reads character streams from the source code, checks for legal tokens, and pass the
data to the syntax analyzer when it demands.
Role of Lexical Analyzer
 As the first phase of a compiler, the main task of the lexical analyzer is to read the input
characters of the source program, group them into lexemes, and produce as output a
sequence of tokens for each lexeme in the source program.
 The stream of tokens is sent to the parser for syntax analysis. It is common for the
lexical analyzer to interact with the symbol table as well
 The interaction is implemented by having the parser call the lexical analyzer. The call
suggested by the getNextToken command, causes the lexical analyzer to read characters from
its input until it can identify the next lexeme and produce for it the next token, which it
returns to the parser. Since the lexical analyzer is the part of the compiler that reads the
source text, it may perform certain other tasks besides identification of lexemes.
 Sometimes, lexical analyzers are divided into a cascade of two processes:
a. Scanning consists of the simple processes that do not require tokenization of the input, such
as deletion of comments and compaction of consecutive whitespace characters into one.
b. Lexical analysis proper is the more complex portion, where the scanner produces the
sequence of tokens as output.
Lexical Analysis versus Parsing
 There are a number of reasons why the analysis portion of a compiler is normally separated
into lexical analysis and parsing (syntax analysis) phases.
1. Simplicity of design is the most important consideration.
2. Compiler efficiency is improved.
3. Compiler portability is enhanced.
Tokens, Patterns, and Lexemes
Token
A token is a pair consisting of a token name and an optional attribute value. The token name is an
abstract symbol representing a kind of lexical unit, e.g., a particular keyword, or a sequence of input
characters denoting an identifier.
Pattern
A pattern is a description of the form that the lexemes of a token may take. In the case of a keyword as
a token, the pattern is just the sequence of characters that form the keyword. For identifiers and some
other tokens, the pattern is a more complex structure that is matched by many strings.
Lexeme
A lexeme is a sequence of characters in the source program that matches the pattern for a token
Attributes for Tokens
 When more than one lexeme can match a pattern, the lexical analyzer must provide the
subsequent compiler phase’s additional information about the particular lexeme that
matched.
 For example, the pattern for token number matches both 0 and 1, but it is extremely
important for the code generator to know which lexeme was found in the source
program.
Example: The token names and associated attribute values for the Fortran statement
E = M * C ** 2
are written below as a sequence of pairs.
<id, pointer to symbol-table entry for E>
< assign-op >
<id, pointer to symbol-table entry for M>
<mult-op>
<id, pointer to symbol-table entry for C>
<exp-op>
<number, integer value 2 >
Lexical Errors
 It is hard for a lexical analyzer to tell, without the aid of other components that there is
a source-code error.
fi(a = = f(x)) . . .
 A lexical analyzer cannot tell whether fi is a misspelling of the keyword if or an
undeclared function identifier.
 Since fi is a valid lexeme for the token id, here the lexical analyzer is unable to proceed
because none of the patterns for tokens matches any prefix of the remaining input.
 The simplest recovery strategy is "panic mode" recovery. Delete successive characters
from the remaining input, until the lexical analyzer can find a well formed token at the
beginning of what input is left.
Other possible error-recovery actions are:
1. Delete one character from the remaining input.
2. Insert a missing character into the remaining input.
3. Replace a character by another character.
4. Transpose two adjacent characters.
Topic
Input
Buffering
Sentinels
 To combine the buffer-end test with the test for the current character if we extend each
buffer to hold a sentinel character at the end.
 The sentinel is a special character that cannot be part of the source program, and a
natural choice is the character eof.
 The sentinel arrangement is as shown below:

 Note that eof retains its use as a marker for the end of the entire input. Any eof that
appears other than at the end of a buffer means that the input is at an end.
Code to advance forward pointer:
Procedure LookAheadwithSentinel
begin
forward : = forward + 1;
if forward ↑ = eof then
begin
if forward at end of first half then
begin
reload second half;
forward := forward +1
end
else if forward at end of
second half then
begin
reload first half;
move forward to beginning of first half
end
else
terminate lexical analysis
end if
end Procedure LookAheadwithSentinel
Advantages
 Most of the time, It performs only one
test to see whether forward pointer
points to an eof.
 Only when it reaches the end of the
buffer half or eof, it performs more
tests.
 The average number of tests per input
character is very close to 1.
Topic
Specification of Tokens
Introduction
 To specify the tokens regular expression are used. When a pattern is matched by some
regular expression then token can be recognized.
 Regular expressions are used to specify the patterns. Each pattern matches a set of
strings.
 There are 3 specifications of tokens: 1) Strings 2) Language 3) Regular expression
Strings And Languages
 An alphabet or character class is a finite set of symbols. Symbols are the collection of
letters and characters.
 A string over an alphabet is a finite sequence of symbols drawn from that alphabet.
 A language is any countable set of strings over some fixed alphabet.
 The length of a string S, usually written as |S|, is the number of occurrences of symbols
in S.
 The empty string, denoted ε, is the string of length zero

 For example, banana is a string of length six.


Operations on strings
The following string-related terms are commonly used:
A prefix of string S - Any string obtained by removing zero or more symbols from the end
of string S. For example, ban is a prefix of banana.
A suffix of string S - Any string obtained by removing zero or more symbols from the
beginning of S. For example, nana is a suffix of banana.
A substring of S - Obtained by deleting any prefix and any suffix from s. For example,
nan
is a substring of banana.
The proper prefixes, suffixes, and substrings of a string S are those prefixes, suffixes,
and substrings, respectively of S that are not ε or not equal to S itself.
A subsequence of S is any string formed by deleting zero or more not necessarily
consecutive positions of S. For example, baan is a subsequence of banana.
Operations On Languages
The following are the operations that can be applied to languages:
Union of L and M=L 𝖴 M
L 𝖴 M={s | s is in L or s is in M}
Concatenation L and M=LM
LM={st | s is in L and t is in M}
Kleen closure of L=L*
L*= ⋃i=0Li
Positive closure of
L=L+
L+= ⋃i=1Li
Regular
Expressions
 Regular
expressions have
the capability to
express finite
languages by
defining a
pattern for finite
strings of symbols.
 The grammar
defined by regular
expressions is
Rules that define the regular expressions over some alphabet Σ
1.ε is a regular expression, and L(ε) is { ε }, that is, the language whose sole member is the
empty string.
2. If ‘a’ is a symbol in Σ, then ‘a’ is a regular expression, and L(a) = {a}, that is, the language
with one string, of length one, with ‘a’ in its one position.
3. Suppose r and s are regular expressions denoting the languages L(r) and L(s), then
i) (r)|(s) is a regular expression denoting the language L(r) U L(s)
ii) (r).(s) is a regular expression denoting the language L(r).L(s).
iii) (r)* is a regular expression denoting (L(r))*.
Example: letter ( letter | digit ) *
Regular set
A language that can be defined by a regular expression is called a regular set. If two regular
expressions r and s denote the same regular set, we say they are equivalent and write r = s.
Algebraic Properties Of Regular Expression
There are a number of algebraic laws for regular expressions that can be used to
manipulate into equivalent forms.
i) | is commutative: r|s = s|r
ii) | is associative: r|(s|t)=(r|s)|t
iii) Concatenation is associative: (rs)t=r(st)
iv) Concatenation is distributive: r(s|t)=rs|rt
v) ɛ is identity element for concatenation ɛ.r=r.ɛ=r
vi) Closure is idempotent: r**=r*
Regular Definitions
Giving names to regular expressions is referred to as a Regular definition. If Σ is an alphabet of
basic symbols, then a regular definition is a sequence of definitions of the form

dl → r 1 Each di is a distinct name.


d2 → r2 Each ri is a regular expression over the alphabet Σ U {dl, d2,. . . , di-l}.
d3→ r3
………
dn → r n
Example: Identifiers is the set of strings of letters and digits beginning with a letter. Regular definition
for this set:
letter → A | B | …. | Z | a | b | …. | z
digit → 0 | 1 | …. | 9
id → letter ( letter | digit ) *

Notations Of Regular Expression


Certain constructs occur so frequently in regular expressions that it is convenient to introduce
notational short hands for them.
1.One or more instances (+): The unary postfix operator + means “one or more instances of ” regular
expression. ( r )+ is a regular expression that denotes (L(r ))+
2. Zero or more instance (*): The operator ‘*’ denotes zero or more instances of regular expressions. (
r )* is a regular expression that denotes (L(r ))*
3.Zero or one instance ( ?): The unary postfix operator ? means “zero or one instance of”. ( r )? is a
regular expression that denotes the language L(r) U {ɛ}
4. Character Classes: The character class such as [a – z] denotes the regular expression a | b | c | d |
….|z.
Non-regular Set
 A language which cannot be described by any regular expression is a non-regular set.
 Example: The set of all strings of balanced parentheses and repeating strings cannot be described
by a regular expression.
 This set can be specified by a context-free grammar.
Topic
Recognition of Tokens
Recognition of Tokens
Consider the following grammar fragment
stmt → if expr then stmt | if expr then stmt else stmt | ε
expr → term relop term | term
term → id | num
The components G={V,T,P,S} for the above grammar is,
Variables= {stmt, expt, term}
Terminals= {if, then, else, relop, id, num}
Start symbol=stmt
The terminals generate the following
regular definition,
if → if
then → then
else → else
relop → <|
<=|=|<>|>|
>=
id →
letter(letter
|digit)*
num →
digit+
(.digit+)?
(E(+|-)?
digit+)?
For this language fragment the lexical analyzer will recognize the keywords if, then, else, as well as the
lexemes denoted by relop, id, and num
Transition diagrams
It is a diagrammatic representation to depict the action that will take place when a lexical
analyzer is called by the parser to get the next token. It is used to keep track of information
about the characters that are seen as the forward pointer scans the input.
Example: Transition diagram for identifier Example: Transition diagram for reloperator
Topic
Lex
The Lexical- Analyzer Generator Lex
A tool called Lex, or in a more recent implementation Flex, that allows one to specify a
lexical analyzer by specifying regular expressions to describe patterns for tokens. The
input notation for the Lex tool is referred to as the Lex language and the tool itself is the
Lex compiler.
The Lex compiler transforms the input patterns into a transition diagram and generates
code, in a file called lex.yy.c,
Creating a lexical analyzer with Lex
An input file, which we call lex.l, is written in the Lex language and describes the lexical
analyzer to be generated.
 The Lex compiler transforms lex.l to a C program, in a file that is always named lex.yy.c.
The latter file is compiled by the C compiler into a file called a.out, as always.
The C compiler output is a working lexical analyzer that can take a stream of input
characters and produce a stream of tokens.
Structure of Lex Programs
A Lex program has the following form:
declarations
%%
translation rules
%%
auxiliary functions
 The declarations section includes declarations of variables, manifest constants (A manifest
constant is an identifier that is declared to represent a constant e.g. # define PIE 3.14), and regular
definitions.
 The translation rules of a Lex program are statements of the form :
p1 {action 1}
p2 {action 2}
p3 {action 3}
……
where each p is a regular expression and each action is a program fragment describing what
action the lexical analyzer should take when a pattern p matches a lexeme. In Lex the actions are
written in C.
 The third section holds whatever auxiliary procedures are needed by the actions. Alternatively
these procedures can be compiled separately and loaded with the lexical analyzer.
Design of a Lexical- Analyzer Generator
The following figure gives overview of the architecture of a lexical analyzer generated
by Lex as follows.

These components are:


1. A transition table for the automaton.
2. Those functions that are passed directly through Lex to the output
3.The actions from the input program, which appear as fragments of code to be invoked at
the appropriate time by the automaton simulator

You might also like