role of lexical anayiser

Groups input characters into tokens An expensive phase of the compiler Ways to construct:

Scanner generator flex Hand written in high level language Hand written in assembly language

Lexical Analysis

CS 5300 - SJAllan

Return a token to syntax analyzer Strips white space Keeps track of line numbers Generates output listing with errors marked Delete comments Expands macros, if the language has them Converts number to internal form

Terminology

Token Lexemes Patterns

CS 5300 - SJAllan

CS 5300 - SJAllan

Decomposition of Grammar

Determine what lexical analyzer recognizes vs. what the syntax analyzer recognizes Basic symbols

Delimiters Identifiers Constants

The Input

Sequence of characters

The Output

A series of tokens:

Punctuation ( ) ; , [ ] Operators + - * := Keywords begin end if while try Identifiers SquareRoot String literals press Enter to continue Character literals x Numeric literals

Integer: 123 Floating point: 45.23e+2 Based representation: 0xaa

Free form languages (all modern ones)

White space does not matter. Ignore these:

Tabs, spaces, new lines, carriage returns

Layout is critical

Fortran, label in cols 1-6 COBOL, area A B Lexical analyzer must know about layout to find tokens

Punctuation: Separators

Typically individual special characters such as { } ;

Sometimes double characters: lexical scanner looks for longest token:

(*, /* -- comment openers in various languages

Operators

Like punctuation

No real difference for lexical analyzer Typically single or double special chars

Operators: + - == <= Operations: := =>

And perhaps location for error messages and debugging purposes

And perhaps location

Keywords

Reserved identifiers

E.g. BEGIN END in Pascal, if in C, catch in C++ Returned as kind of token

With possible location information

Identifiers

Rules differ

Length Allowed characters Separators

Token kind Name of the identifier

String Literals

Text must be stored Actual characters are important

Not like identifiers: must preserve case Table needed

We will use a linked list

Character Literals

Similar issues to string literals Lexical Analyzer returns

Token kind Identity of character

Returns

String constant token Actual string

Numeric Literals

Integer

Return the integer constant token Return the value of the integer constant

Handling Comments

Comments have no effect on program Are eliminated by scanner Error detection issues

E.g. unclosed comments

Case Equivalence

Some languages are case-insensitive

Pascal, Ada

Performance Issues

Speed

Lexical analysis can become bottleneck Minimize processing per character

Skip blanks fast I/O is also an issue (read large blocks)

C, Java

We compile frequently

Compilation time is important

Especially during development

General Approach

Define set of token kinds:

An enumeration type Integers Some tokens carry associated data

Identifier - name of the identifier Constant value of constant

Either: Convert entire file to a file of tokens

Lexical analyzer is separate phase

This approach avoids extra I/O Parser builds tree incrementally, using successive tokens as tree nodes

RE NFA DFA MFA LA

Regular Expressions

Regular expressions (RE) defined by an alphabet (terminal symbols) and three operations:

Alternation RE1 | RE2 Concatenation RE1 RE2 Repetition RE*

Also called Kleenes closure

Single characters Alternation Any character Sequence Concatenation Optional RE a b c d \x [bcd] [b-z] ab|cd . (period) x* y+ abc[d-q] [0-9]+(\.[0-9]*)?

Precedence in REs

Highest to lowest Kleene closure

Left associative

Concatenation

Left associative

Alternation

Left associative

Examples of REs

a* (a|b)* (|a|b)(a|b)(a|b)(a|b)* BEGIN | END | IF | THEN | ELSE letter(letter|digit)* (digit)(digit)* A|B|C||Y|Z 0|1||9

Using flex

Flex source program cpsl.l lex.yy.c lexyy.c Flex Compiler lex.yy.c (Unix) lexyy.c (Windows)

C/C++ Compiler

a.out

Input stream

a.out

Sequence of tokens

{ definitions } %% { rules } %% { programmer subroutines }

Definition

Any combination of:

Definitions name space translation Included code space code Included code %{ code %}

Rules

Any number of rules of the form

Expression { Action } Expression is a regular expression that describes the token (pattern for token) Action is C/C++ code to be executed when the pattern is matched

If it is more than a single statement, it should be enclosed in braces

yytext

Variable where the lexeme is kept. A character string and is reused for every token

yyleng

Length of the string in yytext

yylval

Variable in which the lexeme can be returned

yywrap

Function called when EOF is encountered

%{ #include <string.h> #include "utility.h" #include "pascal.tab.h" %} Letter [a-zA-Z] digit [0-9] lord [a-zA-Z0-9] %% BEGIN {return(BEGINSY);} END {return(ENDSY);} WHILE {return(WHILESY);} ... {letter}({lord})* {yylval.name_ptr = strdup(yytext); return(IDENTSY);} ({digit})+ {yylval.int_val = intnum(); return(CONSTANTSY);} ":=" {return(ASSIGNSY);} ":" {return(COLONSY);} ... . {error("Illegal character");} %% int intnum () /* convert character string into an integer */ { ... }; /* intnum */

Finite Automata

0 1 2 3 4 Example DFA 5 Transition table 3 4 3 3 4 5 3 3 ( 1 2 3 2 2 * ) Other

Formal Definition

A deterministic finite-state automaton, or DFA, is a five-tuple M=(Q,,,q0,F)

1. 2. 3. 4. 5. Q is finite set of states is the alphabet of the machine is the state transition function q0Q is the start state FQ are the final states

A configuration is designated (q,) where q is a state and w is the string remaining

(q0,) initial configuration (q,) final configuration if qF indicates a move

(q,a)(q,) iff a, *, and q(q,a) Language (L(M))for FMS is described as follows: L(M) = {*|(q0,)*(q,) for some qF}

Finite Automata

Consider the two FAs M2 and M3 shown

What is L(M2) and L(M3)? What is important about M3? Why is it important?

Difference between:

NFA arbititrary choices permitted in transitions DFA no choice allowed on any move

M2

Given a terminal symbol, there may be a choice of which state to go to There may be empty moves

Doesnt consume input

M3

For For a For A|B For AB For A*

Construct the NFA for the RE (ab|aba)*

NFA to DFA

Applying Algorithm

States A B C D

Input a b a a B

New State B C D B C

DFA to MFA

F = A, C, D N=B

Initial partitioning of states Final MFA

