CHAPTER 2 LEXICAL ANALYSIS

The Role of the Lexical Analyzer 
The lexical analyzer is the first phase of compiler.

Lexical analyser
source program

token get next token

Parser
to semantic analysis

Symbol table

Fig 2.1: Interaction between lexical analyzer and parser

The Role of the Lexical Analyzer 
It performs the following task :  Reading the input characters of the source program and grouping them into lexemes and producing as output a sequence of tokens for each lexeme.  Stripping out comments and whitespace.  Keeping track of line numbers while scanning the new line characters. These line numbers are used by the error handler to print the error messages.  Preprocessing of the macros. Reasons for implementing the lexical and analysis phase separately  Simplicity of the design.  Compiler efficiency is improved  Compiler portability is enhanced 

The role of the Lexical Analyzer 
Lexical Analyzers are divided into a cascade of two processes: Scanning consists of the simple processes such as deletion of comments and compaction of consecutive whitespace characters into one. Lexical analysis proper is the more complex portion, where the scanner produces the sequence of tokens as output.

Tokens, Patterns and Lexemes 
Lexemes are the smallest logical units (words) of a program. Example num1, c, 1, >, true Tokens are the class of similar lexemes. Example identifier, constant, operator. A pattern is a rule describing a set of lexemes that can represent a particular token in source program. It give an informal or formal description of a token.
TOKEN const if comparison identifier number literal SAMPLE LEXEMES const if <,<=,=,<>,>,>= pi,count,D2 3.1416,0,6.02E23 core dumped INFORMAL DESCRIPTION OF PATTERN const if < or <= or = or <> or >= or > letter followed by letters and digits any numeric constant any characters between and

Token attributes 
Apart from the token itself, the lexical analyzer also passes other information regarding the token. These items of information are called token attributes. When more than one pattern matches the lexeme , a lexical analyzer must provide additional information about the particular lexeme that must be matched to the subsequent phases of compilation. The token names and associated attribute values for the following statement a=b+c is written below as a sequence of pairs <id, pointer to symbol-table entry for a> <assign_op> <id, pointer to symbol-table entry for b> <add_op> <id, pointer to symbol-table entry for c>

Scanning 
The lexical analyzer scans the characters of the source program one at a time to discover tokens.  Reading the input character by character is costly. A block of data is read first into a buffer, and then scanned by the lexical analyzer.  The lexical analyzer uses two pointers to read tokens.  lb(lexeme_beginning) indicates beginning of the lexeme  sp(search pointer) that keeps track of the input string scanned  Initially, both pointers point to the beginning of a lexeme  The sp then starts scanning forward to search for the end of the lexeme.  When the end of the lexeme is identified, the token and attribute corresponding to the lexeme is returned.  lb and sp are then made to point to the beginning of the next token.

Scanning 
Commonly used buffering methods are: One buffer scheme Two buffer scheme an extra character called sentinel character other than the input characters are added at the end of the input buffer to reduce buffer tests.
int a=12;

lb

sp

Specification of tokens 
The patterns corresponding to a token are generally specified using a compact notation known as regular expression. A regular expression (r) is defined by the set of strings that matches it. This set of strings is called as the language generated by the regular expression and is represented as L(r). The set of symbols in the language is called the alphabet of the language and is denoted by . Example L={A, , Z, a, .., z} and a set of digits is represented as {0,1, ,9}.Now L U D is a language.  For a language L with alphabet set ™, the following rules define the regular expressions:  is a regular expression denoting the language { }, that is the set containing the empty string. If a be a symbol in , then a also denotes the regular expression corresponding to the language {a}, that is, the set containing only the string a.

Specification of tokens 
If r1 and r2 be the regular expressions corresponding to the languages L1 and L2 respectively, then ‡r1|r2 is a regular expression corresponding to the language L1 U L2,i.e. the set containing all the strings of L1 and L2. ‡r1r2 is a regular expression corresponding to the language created by concatenating strings of L2 to the strings of L1. ‡r1* is a regular expression corresponding to the language L1*, the set containing zero or more occurrences of the strings belonging to L1. ‡(r1) is a regular expression corresponding to the language L1 itself Unary operator * is of highest precedence and is left associative. Concatenation has the next highest and | has the lowest.

Specification of tokens
For ={0,1} , let us consider the following regular expressions: a) (0|1)* denotes all the binary strings including the empty string. { ,0,1,001,011,«.} b) (0|1)(0|1)* denotes all non empty binary string. c) 0(0|1)*0 denotes all binary strings of length at least two, both starting and ending with 0¶s. {00,010,000,«.} d)(0|1)*0(0|1)(0|1) e)0*10*10*10*

Recognition of Tokens 
The tokens obtained during lexical analysis are recognized using a finite automaton. Finite automata or finite-state machine are a mathematical way of describing the regular expression. It produces a transition diagram for regular expression. It takes as input a particular string and verifies whether the string belongs to the language or not. Transition diagrams have a collection of nodes or circles called states. Each state represents a condition that would occur during the process of scanning the input looking for a lexeme that matches one of several patterns. Edges are directed from one state of the transition diagram to another. Each edge is labeled by a symbol or set of symbols. If we are in some state s and the next symbol is a, we look for an edge out of state s labeled by a.

Symbols used in Finite Automata
Symbol Meaning A circle is used to represent a state. Here, q0 is a state of the machine.

q0

start

q0

A circle with an arrow which is not originating from any node represents the start state of machine.

q0

Two circles are used to represent a final state. Here, q0 is the final state.

1

q0

q1

An arrow with label 1 goes from state q0 to state q1 . This indicates there is a transition from state q0 on input symbol 1 to state q1 .This is represented as: (q0, 1)= q1

Symbols used in Finite Automata
Symbol Meaning An arrow with label 0 starts and ends in q0.This indicates the machine in state q0 on reading a 0, it remains in the same state q0.This is represented as : (q0, 0)= q0

0

q0

0, 1

q0

q1

An arrow with label 0, 1 goes from state q0 to state q1.This indicates that the machine in state q0 on reading a 0 or a 1 enters into state q1 .This is represented as: (q0, 0)= q1 (q0, 1)= q1

Types of Finite Automata 

Deterministic Finite Automata Non-deterministic Finite Automata Non-deterministic Finite Automata with -moves

Deterministic Finite Automata 
In DFA, the next state of the automaton is uniquely determined by the current state and input symbol. Thus, name DFA emerges from the following facts: D(Deterministic)- There is exactly one transition for every input symbol from the state. So, it is possible to determine exactly to which state the machine enters into after consuming the input symbol. F(Finite)- Has finite number of states and arcs. A (Automaton)-Automaton is a machine which may accept the string or reject the string.
0 start q0 1 0, 1

q1

accept

Deterministic Finite Automata
The DFA is 5-tuple indicating five components: M=(Q, , , q0, F) Where M is the name of the machine. Q is non-empty, finite set of states  is non-empty finite set of input alphabets  :Q X to Q i.e. is transition function which is a mapping from Q X to Q q0 Q is the start state F Q is set of accepting or final states

Deterministic Finite Automata
The two notations using which the DFA s can be easily represented are: 
Transition diagram (Transition graph) Transition table

Transition diagram
The transition diagram for DFA M=(Q, , , q0, F) is defined as a graph with circles , arrow and arcs with labels, two circles etc. It is formally defined as shown below: Each state of the Q corresponds to one node or vertex represented using a circle or two circles. Alphabets in are represented as labels. The transition from one state to another state is indicated by the directed edge. Let (qi , a)=qj .This indicates that there is a direct edge from qi to qj and the edge is labeled a. The start state is an state which has an arrow not originating from any node and entering into the state. This is labeled with start. The final states or accepting states which are in F are represented by double circles. The states which are not in F are represented by a single circle.

Transition Table
The transition table for DFA M=(Q, , , q0, F) is defined as a conventional, tabular representation of a transition function such as which takes two arguments and returns a value with: The rows of the table correspond to the states of DFA obtained from Q. The columns correspond to the input symbols from . If q is the current state of DFA and a is the current input symbol, the value returned from (q, a) represent the next state of DFA and is entered in row q and column a. The start state is marked with an arrow. The final state is marked with a star. q0 *q1 0 q0 q1 1 q1 q1

Nondeterministic Finite Automata (NFA)
The basic feature of a NFA is its ability to have more than one possible transitions from a state on the same input symbols. The NFA is 5-tuple indicating 5 components: M=(Q, , , q0, F) Where M is the name of the machine. Q is non-empty, finite set of states  is non-empty finite set of input alphabets  :Q X to Q i.e. is transition function which is a mapping from Q X to Q q0 Q is the start state F Q is set of accepting or final states

Sign up to vote on this title
UsefulNot useful