L4 - Lexical Analysis (Introduction)

LEXICAL ANALYSIS
Introduction
Lecture - 4
8/23/2020 Compiled by Mithun Roy, Department of CSE, SIT, Siliguri 1

LEXICAL ANALYSIS
Lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A program or
function which performs lexical analysis is called a lexical analyzer or scanner. A lexer often exists as a single
function which is called by a parser or another function.
The Role Of The Lexical Analyzer
1. The lexical analyzer is the first phase of a compiler.

2. Its main task is to read the input characters and produce as output a sequence of tokens that the parser
uses for syntax analysis.
3. Upon receiving a “get next token” command from the parser, the lexical analyzer reads input characters
until it can identify the next token.

Basic Terminologies
What's a lexeme?
A lexeme is a sequence of characters that are included in the source program according to the
matching pattern of a token. It is nothing but an instance of a token.
What's a token?
The token is a sequence of characters which represents a unit of information in the source program.
What is Pattern?
A pattern is a description which is used by the token. In the case of a keyword which uses as a token,
the pattern is a sequence of characters.
Issues Of Lexical Analyzer
There are three issues in lexical analysis:
1. To make the design simpler.

2. To improve the efficiency of the compiler.
3. To enhance the computer portability.

Example of Lexical Analysis, Tokens, Non-Tokens
Consider the following code that is fed to Lexical Analyzer Examples of Tokens created
#include <stdio.h> Lexeme Token
# define NUMS 8 int Keyword
int maximum(int x, int y) { maximum Identifier
// This will compare 2 numbers ( Operator
if (x > y) int Keyword
return x; x Identifier
else { , Operator
return y; int Keyword
} Y Identifier
} ) Operator
{ Operator
Examples of Nontokens
} Operator
Type Examples > Operator
if Keyword
Comment // This will compare 2 numbers
return Keyword
Pre-processor directive #include <stdio.h> else Keyword
Pre-processor directive #define NUMS 8 ; Operator
Macro NUMS
Whitespace /n /b /t

Input Buffering
The lexical analyzer scans the input from left to right one character at a time. It uses two pointers
begin ptr(bp) and forward ptr(fp) to keep track of the pointer of the input scanned.
Initially both the pointers point to the first character of the input string as shown below
The forward ptr moves ahead to search for end of lexeme. As soon as the blank space is
encountered, it indicates end of lexeme. In above example as soon as ptr (fp) encounters a blank
space the lexeme “int” is identified.
The fp will be moved ahead at white space, when fp encounters white space, it ignore and moves
ahead. then both the begin ptr(bp) and forward ptr(fp) are set at next token.
Specifications of Tokens
Let us understand how the language theory undertakes the following terms:
Alphabets
Any finite set of symbols {0,1} is a set of binary alphabets, {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal
alphabets, {a-z, A-Z} is a set of English language alphabets.
Strings
Any finite sequence of alphabets is called a string. Length of the string is the total number of occurrence of alphabets,
e.g., the length of the string “Mithun Roy” is 10 and is denoted by |Mithun Roy| = 10.
A string having no alphabets, i.e. a string of zero length is known as an empty string and is denoted by ε (epsilon).
Special Symbols
A typical high-level language contains the following symbols:-
Arithmetic Symbols Addition(+), Subtraction(-), Modulo(%), Multiplication(*), Division(/)

Punctuation Comma(,), Semicolon(;), Dot(.), Arrow(->)
Assignment =
Special Assignment +=, /=, *=, -=
Comparison ==, !=, <, <=, >, >=
Preprocessor #
Logical &, &&, |, ||, !
Shift Operator >>, >>>, <<, <<<
Language
A language is considered as a finite set of strings over some finite set of alphabets. Computer languages are
considered as finite sets, and mathematically set operations can be performed on them. Finite languages can be described
by means of regular expressions.
Recognition Of Tokens
Consider the following grammar fragment:
stmt→if expr then stmt |if expr then stmt else stmt |ε
expr→term relop term |term
term→id |num
where the terminals if , then, else, relop, id and num generate sets of strings given by the following
regular definitions:
if → if
then → then
else → else
relop → <|<=|=|<>|>|>=
id → letter(letter | digit)*
num → digit+ (.digit+)?(E(+|-)?digit+)?
For this language fragment the lexical analyzer will recognize the keywords if, then, else, as well as
the lexemes denoted by relop, id, and num. To simplify matters, we assume keywords are reserved;
that is, they cannot be used as identifiers.
Transition diagrams
It is a diagrammatic representation to depict the action that will take place when a lexical analyzer
is called by the parser to get the next token. It is used to keep track of information about the
characters that are seen as the forward pointer scans the input.

Lexical Errors
A character sequence which is not possible to scan into any valid token is a lexical error.
Important facts about the lexical error:
• Lexical errors are not very common, but it should be managed by a scanner.
• Misspelling of identifiers, operators, keyword are considered as lexical errors.
• Generally, a lexical error is caused by the appearance of some illegal character, mostly at the beginning of a
token.
Error Recovery in Lexical Analyzer
Here, are a few most common error recovery techniques:
• Removes one character from the remaining input.

• In the panic mode, the successive characters are always ignored until we reach a well-formed token by
inserting the missing character into the remaining input.
• Replace a character with another character.
• Transpose two serial characters.

Advantages of Lexical analysis
• Lexical analyzer method is used by programs like compilers which can use the parsed data from a
programmer's code to create a compiled binary executable code.
• It is used by web browsers to format and display a web page with the help of parsed data from
JavsScript, HTML, CSS.
Disadvantage of Lexical analysis

• You need to spend significant time reading the source program and partitioning it in the form of
tokens.
• Some regular expressions are quite difficult to understand.
• More effort is needed to develop and debug the lexer and its token descriptions.
• Additional runtime overhead is required to generate the lexer tables and construct the tokens.

Homework – III
Find out all the TOKENs and NONTOKENs from the following “C” programming code.
# include<stdio.h>
# define N 10
int main(){
int sum=0,i;
// this is the code where we add N naturals numbers.
for (i=1;i<=N;i++){
sum=sum + i;
}
return(0);
}

L4 - Lexical Analysis (Introduction)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

L4 - Lexical Analysis (Introduction)

Uploaded by

Copyright:

Available Formats

LEXICAL ANALYSIS

8/23/2020 Compiled by Mithun Roy, Department of CSE, SIT, Siliguri 1

The Role Of The Lexical Analyzer

1. The lexical analyzer is the first phase of a compiler.

8/23/2020 Compiled by Mithun Roy, Department of CSE, SIT, Siliguri 2

There are three issues in lexical analysis:

1. To make the design simpler.

8/23/2020 Compiled by Mithun Roy, Department of CSE, SIT, Siliguri 3

8/23/2020 Compiled by Mithun Roy, Department of CSE, SIT, Siliguri 4

Arithmetic Symbols Addition(+), Subtraction(-), Modulo(%), Multiplication(*), Division(/)

8/23/2020 Compiled by Mithun Roy, Department of CSE, SIT, Siliguri 8

• Removes one character from the remaining input.

8/23/2020 Compiled by Mithun Roy, Department of CSE, SIT, Siliguri 9

Disadvantage of Lexical analysis

• Some regular expressions are quite difficult to understand.

8/23/2020 Compiled by Mithun Roy, Department of CSE, SIT, Siliguri 10

8/23/2020 Compiled by Mithun Roy, Department of CSE, SIT, Siliguri 11

You might also like