You are on page 1of 11

LEXICAL ANALYSIS

Introduction
Lecture - 4

8/23/2020 Compiled by Mithun Roy, Department of CSE, SIT, Siliguri 1


LEXICAL ANALYSIS
Lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A program or
function which performs lexical analysis is called a lexical analyzer or scanner. A lexer often exists as a single
function which is called by a parser or another function.

The Role Of The Lexical Analyzer

1. The lexical analyzer is the first phase of a compiler.


2. Its main task is to read the input characters and produce as output a sequence of tokens that the parser
uses for syntax analysis.
3. Upon receiving a “get next token” command from the parser, the lexical analyzer reads input characters
until it can identify the next token.

8/23/2020 Compiled by Mithun Roy, Department of CSE, SIT, Siliguri 2


Basic Terminologies
What's a lexeme?

A lexeme is a sequence of characters that are included in the source program according to the
matching pattern of a token. It is nothing but an instance of a token.

What's a token?

The token is a sequence of characters which represents a unit of information in the source program.

What is Pattern?

A pattern is a description which is used by the token. In the case of a keyword which uses as a token,
the pattern is a sequence of characters.
Issues Of Lexical Analyzer

There are three issues in lexical analysis:

1. To make the design simpler.


2. To improve the efficiency of the compiler.
3. To enhance the computer portability.

8/23/2020 Compiled by Mithun Roy, Department of CSE, SIT, Siliguri 3


Example of Lexical Analysis, Tokens, Non-Tokens

Consider the following code that is fed to Lexical Analyzer Examples of Tokens created
#include <stdio.h> Lexeme Token
# define NUMS 8 int Keyword
int maximum(int x, int y) { maximum Identifier
// This will compare 2 numbers ( Operator
if (x > y) int Keyword
return x; x Identifier
else { , Operator
return y; int Keyword
} Y Identifier
} ) Operator
{ Operator
Examples of Nontokens
} Operator
Type Examples > Operator
if Keyword
Comment // This will compare 2 numbers
return Keyword
Pre-processor directive #include <stdio.h> else Keyword
Pre-processor directive #define NUMS 8 ; Operator
Macro NUMS
Whitespace /n /b /t

8/23/2020 Compiled by Mithun Roy, Department of CSE, SIT, Siliguri 4


Input Buffering
The lexical analyzer scans the input from left to right one character at a time. It uses two pointers
begin ptr(bp) and forward ptr(fp) to keep track of the pointer of the input scanned.

Initially both the pointers point to the first character of the input string as shown below

The forward ptr moves ahead to search for end of lexeme. As soon as the blank space is
encountered, it indicates end of lexeme. In above example as soon as ptr (fp) encounters a blank
space the lexeme “int” is identified.

The fp will be moved ahead at white space, when fp encounters white space, it ignore and moves
ahead. then both the begin ptr(bp) and forward ptr(fp) are set at next token.
8/23/2020 Compiled by Mithun Roy, Department of CSE, SIT, Siliguri 5
Specifications of Tokens
Let us understand how the language theory undertakes the following terms:
Alphabets
Any finite set of symbols {0,1} is a set of binary alphabets, {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal
alphabets, {a-z, A-Z} is a set of English language alphabets.
Strings
Any finite sequence of alphabets is called a string. Length of the string is the total number of occurrence of alphabets,
e.g., the length of the string “Mithun Roy” is 10 and is denoted by |Mithun Roy| = 10.
A string having no alphabets, i.e. a string of zero length is known as an empty string and is denoted by ε (epsilon).

Special Symbols
A typical high-level language contains the following symbols:-

Arithmetic Symbols Addition(+), Subtraction(-), Modulo(%), Multiplication(*), Division(/)


Punctuation Comma(,), Semicolon(;), Dot(.), Arrow(->)
Assignment =
Special Assignment +=, /=, *=, -=
Comparison ==, !=, <, <=, >, >=
Preprocessor #
Logical &, &&, |, ||, !
Shift Operator >>, >>>, <<, <<<
Language
A language is considered as a finite set of strings over some finite set of alphabets. Computer languages are
considered as finite sets, and mathematically set operations can be performed on them. Finite languages can be described
by means of regular expressions.
8/23/2020 Compiled by Mithun Roy, Department of CSE, SIT, Siliguri 6
Recognition Of Tokens
Consider the following grammar fragment:

stmt→if expr then stmt |if expr then stmt else stmt |ε
expr→term relop term |term
term→id |num

where the terminals if , then, else, relop, id and num generate sets of strings given by the following
regular definitions:

if → if
then → then
else → else
relop → <|<=|=|<>|>|>=
id → letter(letter | digit)*
num → digit+ (.digit+)?(E(+|-)?digit+)?

For this language fragment the lexical analyzer will recognize the keywords if, then, else, as well as
the lexemes denoted by relop, id, and num. To simplify matters, we assume keywords are reserved;
that is, they cannot be used as identifiers.
8/23/2020 Compiled by Mithun Roy, Department of CSE, SIT, Siliguri 7
Transition diagrams

It is a diagrammatic representation to depict the action that will take place when a lexical analyzer
is called by the parser to get the next token. It is used to keep track of information about the
characters that are seen as the forward pointer scans the input.

8/23/2020 Compiled by Mithun Roy, Department of CSE, SIT, Siliguri 8


Lexical Errors
A character sequence which is not possible to scan into any valid token is a lexical error.
Important facts about the lexical error:

• Lexical errors are not very common, but it should be managed by a scanner.
• Misspelling of identifiers, operators, keyword are considered as lexical errors.
• Generally, a lexical error is caused by the appearance of some illegal character, mostly at the beginning of a
token.
Error Recovery in Lexical Analyzer
Here, are a few most common error recovery techniques:

• Removes one character from the remaining input.


• In the panic mode, the successive characters are always ignored until we reach a well-formed token by
inserting the missing character into the remaining input.
• Replace a character with another character.
• Transpose two serial characters.

8/23/2020 Compiled by Mithun Roy, Department of CSE, SIT, Siliguri 9


Advantages of Lexical analysis
• Lexical analyzer method is used by programs like compilers which can use the parsed data from a
programmer's code to create a compiled binary executable code.

• It is used by web browsers to format and display a web page with the help of parsed data from
JavsScript, HTML, CSS.

Disadvantage of Lexical analysis


• You need to spend significant time reading the source program and partitioning it in the form of
tokens.

• Some regular expressions are quite difficult to understand.

• More effort is needed to develop and debug the lexer and its token descriptions.

• Additional runtime overhead is required to generate the lexer tables and construct the tokens.

8/23/2020 Compiled by Mithun Roy, Department of CSE, SIT, Siliguri 10


Homework – III

Find out all the TOKENs and NONTOKENs from the following “C” programming code.

# include<stdio.h>
# define N 10
int main(){
int sum=0,i;
// this is the code where we add N naturals numbers.
for (i=1;i<=N;i++){
sum=sum + i;
}
return(0);
}

8/23/2020 Compiled by Mithun Roy, Department of CSE, SIT, Siliguri 11

You might also like