You are on page 1of 10

Chapter 2

Lexical Analysis

Lexical analysis or scanning is the process which reads the stream of characters making up
the source program from left-to-right and groups them into tokens. The lexical analyzer takes
a source program as input and produces a stream of tokens as output. The lexical analyzer
might recognize particular instances of tokens called lexemes. A token can then be passed
to next phase of compiler i.e. syntax analysis. It is general for a lexical analyzer to interact
with the symbol table as well. When the lexical analyzer discovers a lexeme constituting an
identifier, it needs to enter that lexeme into the symbol table. In some cases, the information
concerning the kind of identifier may be read from symbol table by the lexical analyzer to
assist it in determining the suitable token it must pass to the parser. Figure 2.1 Shows the
role of a lexical analyzer.

Figure 2.1: Role of Lexical Analyzer

2.1 Constituents of Lexical Analysis

Token: A token is a pair consisting of a token name and an optional attribute value. The token
name is an abstract symbol representing a kind of lexical unit, e.g., a particular keyword, or

a sequence of input characters denoting an identifier. The token names are the input symbols
that the parser processes.
Some examples of tokens in C are: Keywords (, while), Identifiers (e.g. rate, total),
Constants (e.g. 10, 2.5), Strings (e.g. “total”, “hello”), Special symbols (e.g. ( ), { }),
Operators (e.g. +, /, -, *).
Pattern: A pattern is a description of the form that the lexemes of a token may take. In case
of a keyword as a token, the pattern is just the sequence of characters that forms the keyword.
For identifiers and some other tokens, the pattern is a more complex structure that is matched
by many strings.
Lexemes: A lexeme is a sequence of characters in the source program that matches the pattern
for a token and is identified by the lexical analyzer as an instance of that token. Table 2.1
shows the examples of tokens, patterns and lexemes used in C language.

Table 2.1: Example of Token, Pattern, Lexemes

Token Lexeme Pattern
ID x y n0 letter followed by letters and digits
NUM -123 1.456e-5 any numeric constant
IF if if
LITERAL “Hello” any string of characters

For example if we consider a C statement printf(“Final = %d”, Number); both printf

and Number are lexemes matching the pattern for token ID, and “Final = %d” is a lexeme
matching LITERAL. “(” and “)” match with token LPAREN and RPAREN respectively.
The lexical analyzer must provide the additional information about the particular lexeme,
when more than one lexeme matches a pattern. The lexical analyzer returns not only a token
name, but also an attribute value that describes the lexeme represented by the token to the
subsequent compiler phases. The token name influences parsing decisions, while the attribute
value influences translation of tokens after the parse.
For C statement printf(“Final = %d”, Number); the tokens returned would be:


Here, more than one identifier are discovered so to differentiate, a numeric value is assigned to

2.2 Input Buffering

There are three general approaches to the implementation of a lexical analyzer:

1. Use a lexical-analyzer generator, such as Lex compiler to produce the lexical analyzer
from a regular expression based specification. In this, the generator provides routines for
reading and buffering the input.

2. Write the lexical analyzer in a conventional systems-programming language, using I/O

facilities of that language to read the input.

3. Write the lexical analyzer in assembly language and explicitly manage the reading of

Because of the amount of time taken to the large number of characters that must be
processed during the compilation of a large source program, specialized buffering techniques
have been developed to reduce the amount of overhead required to process a single input
character. Two important techniques of buffering are described below:

2.2.1 Buffer Pairs

In this technique two pointers to the input are maintained. First Pointer “Lexeme Begin”
marks the beginning of the current lexeme, whose extent we are attempting to determine.
while second pointer “Forward” scans ahead until a pattern match is found.
Once the next lexeme is determined, forward is set to character at its right end. Then,
after the lexeme is recorded as an attribute value of a token returned to the parser, Lexeme
Begin is set to the character immediately after the lexeme just found.

2.2.2 Sentinels
If we use the idea of Buffer pairs we must make sure each time we advance forward, that we
have not moved off one of the buffers; if we do, then we must also reload the other buffer.
Thus, for each character read, we make two tests: one for the end of the buffer, and one to
determine what character is read. We can combine the buffer-end test with the test for the
current character if we extend each buffer to hold a sentinel character at the end. The sentinel
is a special character that cannot be part of the source program, and a natural choice is the
character EOF.
Note that EOF retains its use as a marker for the end of the entire input. Any EOF that
appears other than at the end of a buffer means that the input is at an end.

2.3 Token Specification

The Patterns corresponding to a token are generally specified using a compact notation known
as regular expression. Regular expressions of a language are created by combining members of
its alphabet. A regular expression r corresponds to a set of strings L(r) where L(r) is called

a regular set or a regular language and may be infinite. A regular expression is defined as

• A basic regular expression a denotes the set {a} where a ∈ Σ; L(a) = {a}

• The regular expression  denotes the set {}

Technically, regular expression  is different from string . Here  represents null.

If r and s are two regular expressions denoting the sets L(r) and L(s) then; following are
some rules for regular expressions
R1. r|s is a regular expression denoting the union set: L(r) L(s)

R2. rs is a regular expression denoting the concatenation set: L(r)L(s)

R3. r∗ is a regular expression denoting the Kleene closure set: L(r)∗

R4. (r) is a regular expression denoting the set L(r)

Following are some examples of regular expressions:

• 0|1 denotes the set {0, 1} as per rule 1.

• 0∗ denotes the set {, 0, 00, 000, 0000, . . . } as per rule 3.

• (0|1)(0|1) denotes the set {00, 01, 10, 11} as per rule 1.

• (0|1)∗ denotes the set {, 0, 1, 00, 01, 10, 11, 000, 001, . . . } as per rule 1,3.

• 0|0∗ 1 denotes the set {0, 1, 01, 001, 0001, . . . } as per rule 1,3.

2.3.1 Regular Definition

We may assign a name to a regular expression to use and reuse the name in other (more
complex) regular expressions and to enhance the readability of longer regular expressions.
Suppose, following regular definition definitions are given:

• digit = [0 − 9], This will represent the number in the range from 0 through 9.

• letter = [A − Za − z], Shows any letter between capital A through Z and small a through

• eol = [\n]

• neol = [ˆ\n]

We can use these regular definitions to write complex regular expressions, for example,

• Integer_Literal = digit+

• Fixed_Point_Literal = digit+ “.” digit+

• Floating_Point_Literal = digit+ “.” digit+(e|E)(+|-)?digit+

• Identifier = letter(letter|digit)*

2.4 Token Recognition

The previous section described about tokens specification of a language using compact nation
called regular expression. This section will elaborate how to construct recognizers that can
identify the tokens occurring in input stream. These recognizers are known as Finite Automata.
A Finite Automaton (FA) consists of:

• A finite set of states

• A set of transitions (or moves) between states: The transitions are labeled by characters
form the alphabet

• A special start state

• A set of final or accepting states

A finite automaton to represent

Identifier = letter(letter|digit)∗

is shown below in Figure 2.2.

Figure 2.2: A finite automata for Identifier

2.4.1 Deterministic Finite Automata(DFA)

A Deterministic Finite Automaton (DFA) is a 5-tuple M = (Q, Σ,δ, S, F) consisting of:

1. A finite set of states Q

2. Finite set of input symbols Σ

3. A transition function δ : Q × Σ → Q

4. A start state S ∈ Q

5. A set of accepting states F ⊆ Q

A DFA takes an input string w over the alphabet Σ , and either accepts or rejects the
string. Identifying acceptance with the value 1 and rejection with 0, one can think of a DFA
as a machine that takes a string w as input, and outputs a single bit b ∈ {0, 1}. DFA be
represented by a transition table T which is indexed by state S and input character c. T [s][c]
is the next state to visit from state S if the input character is c.
T can also be described as a transition function T : S × Σ → S maps the pair (S, c) to
DFA and transition table for a C comment are show in Figure 2.3 and Table 2.2. Blank
entries in the table represent an error state. A full transition table will contain one column for
each character (may waste space). The characters are combined into character classes when
treated identically in a DFA.

Figure 2.3: DFA for C Comments

Table 2.2: Transition Table for C Comments

State / * other
1 2
2 3
3 3 4 4
4 5 4 3

2.4.2 Non-Deterministic Finite Automata (NFA)
An NFA is a 5-tuple M = (Q, Σ, δ, S, F ) consisting of:

1. A finite set of states Q

2. Finite set of input symbols Σ

3. A transition function δ : Q × (Σ {}) → P (Q)


4. A start state S ∈ Q

5. A set of accepting states F ⊆ Q

The only difference between a DFA and an NFA is in the transition function δ . This is exactly
the same as the definition of NFA. We proceed to define its computations using the same style
as for DFAs.
An NFA is similar to a DFA except that multiple transitions labeled by same character
from same state are allowed,  -transitions are allowed and  -transitions are spontaneous.
They occur without consuming any character. Figure 2.4 and Figure 2.5 show DFA and NFA
for operators.

Figure 2.4: DFA of Relational Operators

Figure 2.5: NFA for Relational Operators

2.5 Lexical Analyzer Generator
Lexical Analyzer Generator or Scanner Generator generates lexical analyzers which can be used
to perform scanning of a file. Lex and Flex are two most popular scanner generators available
in UNIX and Linux platforms. They take as input specification of requirements in the form of
regular expressions and generate C code to do the lexical analysis of the file supplied as input
i.e. it generates a lexical analyzer. Figure 2.6 shows the working of lex/flex and Figure 2.7
gives a general template to write lex/flex specifications.

Figure 2.6: Working of Lex/Flex

Figure 2.7: Lex/Flex Specification Template

2.5.1 Definition Section

This section defines header files to import in code, macros basic declaration of variables, func-
tions, keywords, special patterns etc. This will be copied to generated C file. We include
following code in our definition section:

int vowels=0;
int cons=0;

2.5.2 Rule Section
This section deals with regular expression patterns with language statements. When the scan-
ner matches a pattern in the input file with the declared pattern, it will execute the code
associated with the pattern. Based on pattern declared in definition section we have defined
the following rules for patterns:

[aeiouAEIOU] {vowels++;}

The above rule means that whenever any vowel comes increment vowel count.

[a-zA-Z] {cons++;}

The above rule means means that whenever any consonant comes increment consonant count.

2.5.3 User Subroutines

This section contains main function, definition of function declared in definition section and
other relevant C code. These statements are directly copied to the generated source file. The
execution of statements and calling of function is done by rules written in rule section.

printf(“Enter the string.. at end press d̂”);
printf(“No of vowels=%d No of consonants=%d”, vowels, cons);

When lex compiles the input specifications, it generates the C file lex.yy.c that contains
the routine yylex(). This routine reads the input and tries to match it with any of the token
patterns specified in the rules section. On a match, the associated action is executed. If there is
more than one match, the action associated with the pattern that matches more text (included
context) is executed. If still there are two or more patterns that match the same amount of
text, the action associated with the pattern listed first in the specification file is executed. If
no match is found, the default action is executed. The input text (lexeme) associated with
the recognized token is placed in the global variable yytext. The detailed description of using
lex/flex compiler is given in Appendix A.

Example: To count the number of vowels and consonants in a given string.
int vowels=0;
int cons=0;
[aeiouAEIOU] vowels++;
[a-zA-Z] cons++;
int yywrap()
return 1;
printf(“Enter the string.. at end press d̂”);
printf(“No of vowels=%dNo of consonants=%d”,vowels,cons);
By using the approach described in this Chapter lexical analyzer can be designed to perform
specific task of lexical analysis.


You might also like