You are on page 1of 7

Role of Lexical Analyzer

Groups input characters into tokens An expensive phase of the compiler Ways to construct:
Scanner generator flex Hand written in high level language Hand written in assembly language

Lexical Analysis

CS 5300 - SJAllan

Features of Lexical Analyzer


Return a token to syntax analyzer Strips white space Keeps track of line numbers Generates output listing with errors marked Delete comments Expands macros, if the language has them Converts number to internal form
3

Terminology
Token Lexemes Patterns

CS 5300 - SJAllan

CS 5300 - SJAllan

Decomposition of Grammar
Determine what lexical analyzer recognizes vs. what the syntax analyzer recognizes Basic symbols
Delimiters Identifiers Constants

The Input
Sequence of characters

Structure of basic symbols can generally be described with regular expressions


CS 5300 - SJAllan 5 CS 5300 - SJAllan 6

The Output
A series of tokens:
Punctuation ( ) ; , [ ] Operators + - * := Keywords begin end if while try Identifiers SquareRoot String literals press Enter to continue Character literals x Numeric literals
Integer: 123 Floating point: 45.23e+2 Based representation: 0xaa
CS 5300 - SJAllan 7

Free Form vs Fixed Form


Free form languages (all modern ones)
White space does not matter. Ignore these:
Tabs, spaces, new lines, carriage returns

Only the ordering of tokens is important

Fixed format languages (historical)


Layout is critical
Fortran, label in cols 1-6 COBOL, area A B Lexical analyzer must know about layout to find tokens
CS 5300 - SJAllan 8

Punctuation: Separators
Typically individual special characters such as { } ;
Sometimes double characters: lexical scanner looks for longest token:
(*, /* -- comment openers in various languages

Operators
Like punctuation
No real difference for lexical analyzer Typically single or double special chars
Operators: + - == <= Operations: := =>

Returned just as identity (kind) of token


And perhaps location for error messages and debugging purposes

Returned as kind of token


And perhaps location

CS 5300 - SJAllan

CS 5300 - SJAllan

10

Keywords
Reserved identifiers
E.g. BEGIN END in Pascal, if in C, catch in C++ Returned as kind of token
With possible location information

Identifiers
Rules differ
Length Allowed characters Separators

Lexical analyzer returns


Token kind Name of the identifier

CS 5300 - SJAllan

11

CS 5300 - SJAllan

12

String Literals
Text must be stored Actual characters are important
Not like identifiers: must preserve case Table needed
We will use a linked list

Character Literals
Similar issues to string literals Lexical Analyzer returns
Token kind Identity of character

Returns
String constant token Actual string

CS 5300 - SJAllan

13

CS 5300 - SJAllan

14

Numeric Literals
Integer
Return the integer constant token Return the value of the integer constant

Handling Comments
Comments have no effect on program Are eliminated by scanner Error detection issues
E.g. unclosed comments

Scanner skips over comments and returns next meaningful token

CS 5300 - SJAllan

15

CS 5300 - SJAllan

16

Case Equivalence
Some languages are case-insensitive
Pascal, Ada

Performance Issues
Speed
Lexical analysis can become bottleneck Minimize processing per character
Skip blanks fast I/O is also an issue (read large blocks)

Some are not


C, Java

We compile frequently
Compilation time is important
Especially during development

Communicate with parser through global variables


CS 5300 - SJAllan 17 CS 5300 - SJAllan 18

General Approach
Define set of token kinds:
An enumeration type Integers Some tokens carry associated data
Identifier - name of the identifier Constant value of constant

Interface to Lexical Analyzer


Either: Convert entire file to a file of tokens
Lexical analyzer is separate phase

Or: Parser calls lexical analyzer to supply next token


This approach avoids extra I/O Parser builds tree incrementally, using successive tokens as tree nodes

CS 5300 - SJAllan

19

CS 5300 - SJAllan

20

Automatic Generation of Lexical Analyzer


RE NFA DFA MFA LA

Regular Expressions
Regular expressions (RE) defined by an alphabet (terminal symbols) and three operations:
Alternation RE1 | RE2 Concatenation RE1 RE2 Repetition RE*
Also called Kleenes closure

CS 5300 - SJAllan

21

CS 5300 - SJAllan

22

Specifying REs in Unix Tools


Single characters Alternation Any character Sequence Concatenation Optional RE a b c d \x [bcd] [b-z] ab|cd . (period) x* y+ abc[d-q] [0-9]+(\.[0-9]*)?

Precedence in REs
Highest to lowest Kleene closure
Left associative

Concatenation
Left associative

Alternation
Left associative

CS 5300 - SJAllan

23

CS 5300 - SJAllan

24

Examples of REs
a* (a|b)* (|a|b)(a|b)(a|b)(a|b)* BEGIN | END | IF | THEN | ELSE letter(letter|digit)* (digit)(digit)* A|B|C||Y|Z 0|1||9
25

Using flex
Flex source program cpsl.l lex.yy.c lexyy.c Flex Compiler lex.yy.c (Unix) lexyy.c (Windows)

C/C++ Compiler

a.out

Input stream

a.out

Sequence of tokens

CS 5300 - SJAllan

CS 5300 - SJAllan

26

Format of flex File


{ definitions } %% { rules } %% { programmer subroutines }

Definition
Any combination of:
Definitions name space translation Included code space code Included code %{ code %}

CS 5300 - SJAllan

27

CS 5300 - SJAllan

28

Rules
Any number of rules of the form
Expression { Action } Expression is a regular expression that describes the token (pattern for token) Action is C/C++ code to be executed when the pattern is matched
If it is more than a single statement, it should be enclosed in braces

Special variables in flex


yytext
Variable where the lexeme is kept. A character string and is reused for every token

yyleng
Length of the string in yytext

yylval
Variable in which the lexeme can be returned

yywrap
Function called when EOF is encountered

CS 5300 - SJAllan

29

CS 5300 - SJAllan

30

Example Input to flex


%{ #include <string.h> #include "utility.h" #include "pascal.tab.h" %} Letter [a-zA-Z] digit [0-9] lord [a-zA-Z0-9] %% BEGIN {return(BEGINSY);} END {return(ENDSY);} WHILE {return(WHILESY);} ... {letter}({lord})* {yylval.name_ptr = strdup(yytext); return(IDENTSY);} ({digit})+ {yylval.int_val = intnum(); return(CONSTANTSY);} ":=" {return(ASSIGNSY);} ":" {return(COLONSY);} ... . {error("Illegal character");} %% int intnum () /* convert character string into an integer */ { ... }; /* intnum */

Finite Automata
0 1 2 3 4 Example DFA 5 Transition table 3 4 3 3 4 5 3 3 ( 1 2 3 2 2 * ) Other

CS 5300 - SJAllan

31

CS 5300 - SJAllan

32

Formal Definition
A deterministic finite-state automaton, or DFA, is a five-tuple M=(Q,,,q0,F)
1. 2. 3. 4. 5. Q is finite set of states is the alphabet of the machine is the state transition function q0Q is the start state FQ are the final states

Configuration for a FSM


A configuration is designated (q,) where q is a state and w is the string remaining
(q0,) initial configuration (q,) final configuration if qF indicates a move

A move is made such that the following is true


(q,a)(q,) iff a, *, and q(q,a) Language (L(M))for FMS is described as follows: L(M) = {*|(q0,)*(q,) for some qF}
33 CS 5300 - SJAllan 34

CS 5300 - SJAllan

Finite Automata
Consider the two FAs M2 and M3 shown
What is L(M2) and L(M3)? What is important about M3? Why is it important?

NFAs and DFAs


Difference between:
NFA arbititrary choices permitted in transitions DFA no choice allowed on any move
M2

Another difference. For NFA:


Given a terminal symbol, there may be a choice of which state to go to There may be empty moves
Doesnt consume input

M3
CS 5300 - SJAllan 35

CS 5300 - SJAllan

36

Algorithm to take an RE into a NFA


For For a For A|B For AB For A*
CS 5300 - SJAllan 37

Applying the Previous Algorithm


Construct the NFA for the RE (ab|aba)*

CS 5300 - SJAllan

38

NFA to DFA

Applying Algorithm

States A B C D
CS 5300 - SJAllan 39 CS 5300 - SJAllan

Old States {0,1,2,5,10} {3,6} {1,2,4,5,7,9,10} {1,2,3,5,6,8,9,10}

Input a b a a B

New State B C D B C
40

DFA to MFA

Applying MFA Algorithm


F = A, C, D N=B
Initial partitioning of states Final MFA

CS 5300 - SJAllan

41

CS 5300 - SJAllan

42