You are on page 1of 7

Role of Lexical Analyzer

Groups input characters into tokens An expensive phase of the compiler Ways to construct:
Scanner generator flex Hand written in high level language Hand written in assembly language

Lexical Analysis

CS 5300 - SJAllan

Features of Lexical Analyzer

Return a token to syntax analyzer Strips white space Keeps track of line numbers Generates output listing with errors marked Delete comments Expands macros, if the language has them Converts number to internal form

Token Lexemes Patterns

CS 5300 - SJAllan

CS 5300 - SJAllan

Decomposition of Grammar
Determine what lexical analyzer recognizes vs. what the syntax analyzer recognizes Basic symbols
Delimiters Identifiers Constants

The Input
Sequence of characters

Structure of basic symbols can generally be described with regular expressions

CS 5300 - SJAllan 5 CS 5300 - SJAllan 6

The Output
A series of tokens:
Punctuation ( ) ; , [ ] Operators + - * := Keywords begin end if while try Identifiers SquareRoot String literals press Enter to continue Character literals x Numeric literals
Integer: 123 Floating point: 45.23e+2 Based representation: 0xaa
CS 5300 - SJAllan 7

Free Form vs Fixed Form

Free form languages (all modern ones)
White space does not matter. Ignore these:
Tabs, spaces, new lines, carriage returns

Only the ordering of tokens is important

Fixed format languages (historical)

Layout is critical
Fortran, label in cols 1-6 COBOL, area A B Lexical analyzer must know about layout to find tokens
CS 5300 - SJAllan 8

Punctuation: Separators
Typically individual special characters such as { } ;
Sometimes double characters: lexical scanner looks for longest token:
(*, /* -- comment openers in various languages

Like punctuation
No real difference for lexical analyzer Typically single or double special chars
Operators: + - == <= Operations: := =>

Returned just as identity (kind) of token

And perhaps location for error messages and debugging purposes

Returned as kind of token

And perhaps location

CS 5300 - SJAllan

CS 5300 - SJAllan


Reserved identifiers
E.g. BEGIN END in Pascal, if in C, catch in C++ Returned as kind of token
With possible location information

Rules differ
Length Allowed characters Separators

Lexical analyzer returns

Token kind Name of the identifier

CS 5300 - SJAllan


CS 5300 - SJAllan


String Literals
Text must be stored Actual characters are important
Not like identifiers: must preserve case Table needed
We will use a linked list

Character Literals
Similar issues to string literals Lexical Analyzer returns
Token kind Identity of character

String constant token Actual string

CS 5300 - SJAllan


CS 5300 - SJAllan


Numeric Literals
Return the integer constant token Return the value of the integer constant

Handling Comments
Comments have no effect on program Are eliminated by scanner Error detection issues
E.g. unclosed comments

Scanner skips over comments and returns next meaningful token

CS 5300 - SJAllan


CS 5300 - SJAllan


Case Equivalence
Some languages are case-insensitive
Pascal, Ada

Performance Issues
Lexical analysis can become bottleneck Minimize processing per character
Skip blanks fast I/O is also an issue (read large blocks)

Some are not

C, Java

We compile frequently
Compilation time is important
Especially during development

Communicate with parser through global variables

CS 5300 - SJAllan 17 CS 5300 - SJAllan 18

General Approach
Define set of token kinds:
An enumeration type Integers Some tokens carry associated data
Identifier - name of the identifier Constant value of constant

Interface to Lexical Analyzer

Either: Convert entire file to a file of tokens
Lexical analyzer is separate phase

Or: Parser calls lexical analyzer to supply next token

This approach avoids extra I/O Parser builds tree incrementally, using successive tokens as tree nodes

CS 5300 - SJAllan


CS 5300 - SJAllan


Automatic Generation of Lexical Analyzer


Regular Expressions
Regular expressions (RE) defined by an alphabet (terminal symbols) and three operations:
Alternation RE1 | RE2 Concatenation RE1 RE2 Repetition RE*
Also called Kleenes closure

CS 5300 - SJAllan


CS 5300 - SJAllan


Specifying REs in Unix Tools

Single characters Alternation Any character Sequence Concatenation Optional RE a b c d \x [bcd] [b-z] ab|cd . (period) x* y+ abc[d-q] [0-9]+(\.[0-9]*)?

Precedence in REs
Highest to lowest Kleene closure
Left associative

Left associative

Left associative

CS 5300 - SJAllan


CS 5300 - SJAllan


Examples of REs
a* (a|b)* (|a|b)(a|b)(a|b)(a|b)* BEGIN | END | IF | THEN | ELSE letter(letter|digit)* (digit)(digit)* A|B|C||Y|Z 0|1||9

Using flex
Flex source program cpsl.l lex.yy.c lexyy.c Flex Compiler lex.yy.c (Unix) lexyy.c (Windows)

C/C++ Compiler


Input stream


Sequence of tokens

CS 5300 - SJAllan

CS 5300 - SJAllan


Format of flex File

{ definitions } %% { rules } %% { programmer subroutines }

Any combination of:
Definitions name space translation Included code space code Included code %{ code %}

CS 5300 - SJAllan


CS 5300 - SJAllan


Any number of rules of the form
Expression { Action } Expression is a regular expression that describes the token (pattern for token) Action is C/C++ code to be executed when the pattern is matched
If it is more than a single statement, it should be enclosed in braces

Special variables in flex

Variable where the lexeme is kept. A character string and is reused for every token

Length of the string in yytext

Variable in which the lexeme can be returned

Function called when EOF is encountered

CS 5300 - SJAllan


CS 5300 - SJAllan


Example Input to flex

%{ #include <string.h> #include "utility.h" #include "" %} Letter [a-zA-Z] digit [0-9] lord [a-zA-Z0-9] %% BEGIN {return(BEGINSY);} END {return(ENDSY);} WHILE {return(WHILESY);} ... {letter}({lord})* {yylval.name_ptr = strdup(yytext); return(IDENTSY);} ({digit})+ {yylval.int_val = intnum(); return(CONSTANTSY);} ":=" {return(ASSIGNSY);} ":" {return(COLONSY);} ... . {error("Illegal character");} %% int intnum () /* convert character string into an integer */ { ... }; /* intnum */

Finite Automata
0 1 2 3 4 Example DFA 5 Transition table 3 4 3 3 4 5 3 3 ( 1 2 3 2 2 * ) Other

CS 5300 - SJAllan


CS 5300 - SJAllan


Formal Definition
A deterministic finite-state automaton, or DFA, is a five-tuple M=(Q,,,q0,F)
1. 2. 3. 4. 5. Q is finite set of states is the alphabet of the machine is the state transition function q0Q is the start state FQ are the final states

Configuration for a FSM

A configuration is designated (q,) where q is a state and w is the string remaining
(q0,) initial configuration (q,) final configuration if qF indicates a move

A move is made such that the following is true

(q,a)(q,) iff a, *, and q(q,a) Language (L(M))for FMS is described as follows: L(M) = {*|(q0,)*(q,) for some qF}
33 CS 5300 - SJAllan 34

CS 5300 - SJAllan

Finite Automata
Consider the two FAs M2 and M3 shown
What is L(M2) and L(M3)? What is important about M3? Why is it important?

NFAs and DFAs

Difference between:
NFA arbititrary choices permitted in transitions DFA no choice allowed on any move

Another difference. For NFA:

Given a terminal symbol, there may be a choice of which state to go to There may be empty moves
Doesnt consume input

CS 5300 - SJAllan 35

CS 5300 - SJAllan


Algorithm to take an RE into a NFA

For For a For A|B For AB For A*
CS 5300 - SJAllan 37

Applying the Previous Algorithm

Construct the NFA for the RE (ab|aba)*

CS 5300 - SJAllan



Applying Algorithm

States A B C D
CS 5300 - SJAllan 39 CS 5300 - SJAllan

Old States {0,1,2,5,10} {3,6} {1,2,4,5,7,9,10} {1,2,3,5,6,8,9,10}

Input a b a a B

New State B C D B C


Applying MFA Algorithm

F = A, C, D N=B
Initial partitioning of states Final MFA

CS 5300 - SJAllan


CS 5300 - SJAllan