Professional Documents
Culture Documents
A motivating example
Create a program that counts the number of lines in a given input text file
Solution
int num_lines = 0; %% \n ++num_lines; . ; %% main() { yylex(); printf( "# of lines = %d\n", num_lines); }
Solution
int num_lines = 0; initial %% \n ++num_lines; . ; %% main() { yylex(); printf( "# of lines = %d\n", num_lines); } newline ;
Outline
Roles of lexical analysis What is a token Regular expressions and regular descriptions Lexical analysis Automatic Creation of Lexical Analysis Error Handling
Fin. Assembly
Example Tokens
Type ID NUM REAL IF COMMA NOTEQ LPAREN RPAREN Examples foo n_14 last 73 00 517 082 66.1 .5 10. 1e67 5.5e-10 if , != ( )
Example NonTokens
Type comment preprocessor directive macro whitespace Examples /* ignored */ #include <foo.h> #define NUMS 5, 6 NUMS \t \n \b
Example
void match0(char *s) /* find a zero */ { if (!strncmp(s, 0.0, 3)) return 0. ; } VOID ID(match0) LPAREN CHAR DEREF ID(s) RPAREN LBRACE IF LPAREN NOT ID(strncmp) LPAREN ID(s) COMMA STRING(0.0) COMMA NUM(3) RPAREN RPAREN RETURN REAL(0.0) SEMI RBRACE EOF
What is a token?
Defined by the programming language Can be separated by spaces Smallest units Defined by regular expressions
Regular Expressions
Regular Descriptions
EBNF where non-terminals are fully defined before first use
letter p[a-zA-Z] digit p[0-9] underscore p_ letter_or_digit p letter|digit underscored_tail p underscore letter_or_digit+ identifier p letter letter_or_digit* underscored_tail token description
A token name A regular expression
Flex
Input
regular expressions and actions (C code)
Output
A scanner program that reads the input and applies actions when input regular expression is matched
regular expressions
flex
input program scanner tokens
Automatic Creation of Efficient Scanners Nave approach on regular expressions (dotted items) Construct non deterministic finite automaton over items Convert to a deterministic Minimize the resultant automaton Optimize (compress) representation
Dotted Items
Example
T p a+ b+ Input aab After parsing aa
T p a+ b+
Item Types
Shift item
In front of a basic pattern A p (ab)+ c (de|fe)*
Reduce item
At the end of rhs A p (ab)+ c (de|fe)*
Basic item
Shift or reduce items
Character Moves
For shift items character moves are simple
TpEcF Digit p [0-9] c 7 c 7 TpEcF T p [0-9]
I Moves
For non-shift items the situation is more complicated What character do we need to see? Where are we in the matching? T p a* T p (a*)
I1
Slide 27 I1
IBM, 11/11/2003
I2
I Moves
Slide 28 I2
IBM, 11/11/2003
I p [0-9]+ F p[0-9]*.[0-9]+
Input 3.1;
F p ([0-9])*.([0-9])+ F p ([0-9])*.([0-9])+ F p ( [0-9] )*.([0-9])+ F p ( [0-9])*.([0-9])+ F p ( [0-9])* .([0-9])+ F p ( [0-9])*. ([0-9])+ F p ( [0-9])*. ([0-9])+ F p ( [0-9])*. ( [0-9] )+ F p ( [0-9])*. ( [0-9])+ F p ( [0-9])*. ( [0-9])+ F p ( [0-9])* .([0-9])+
Concurrent Search
How to scan multiple token classes in a single run?
Input 3.1;
F p ([0-9])* .([0-9])+
F p ( [0-9])* .([0-9])+
F p ( [0-9])*. ([0-9])+
I3
I Moves
Slide 34 I3
IBM, 11/11/2003
Ip ([0-9]+)
Fp ([0-9]*).[0-9]+ Fp ( [0-9]*) .[0-9]+
[0-9]
Fp ( [0-9] *).[0-9]+
.
Fp [0-9]*. ([0-9]+)
Fp [0-9]*. ( [0-9] +)
Ip ( [0-9])+
Fp [0-9]*. ( [0-9] +)
Efficient Scanners
Construct Deterministic Finite Automaton
Every state is a set of items Every transition is followed by an I-closure When a set contains two reduce items select the one declared first
Compress representation
I p [0-9]+ F p[0-9]*.[0-9]+
[0-9] Ip ( [0-9])+ Fp ( [0-9] *).[0-9]+ Ip ( [0-9])+ Ip ([0-9])+ Fp ([0-9]*).[0-9]+ Fp ( [0-9]*) .[0-9]+ [0-9]
.|\n [^0-9.]
Sink
[^0-9]
Scanning 3.1;
input q3.1; 3 q.1; 3. q1; 3.1 q; state 1 2 3 4 next state 2 3 4 Sink last token I I F F
[0-9] [0-9] [^0-9.]
1
[0-9] [^0-9.] . .
Sink
2 I
3
[0-9] [^0-9]
4 F
Scanning aaa
[^a]
[.\n]
T1 p a+; T2 pa input qaaa$ a q aa$ a a q a$ aaaq$ state 1 2 4 4 next state 2 4 4 Sink last token T1 T1 T1
1
[a] [^a;]
Sink
[.\n] ; ;
T1
2
[a]
3 T1
[^a;]
[a]
T1
Error Handling
Illegal symbols Common errors
Missing
Creating a lexical analysis by hand Table compression Symbol Tables Handling Macros Start states Nested comments
Summary
For most programming languages lexical analyzers can be easily constructed automatically Exceptions:
Fortran PL/1