Lexical Analysis

Textbook:Modern Compiler Design Chapter 2.1

A motivating example
‡ Create a program that counts the number of lines in a given input text file

Solution
int num_lines = 0; %% \n ++num_lines; . ; %% main() { yylex(); printf( "# of lines = %d\n", num_lines); }

Solution
int num_lines = 0; initial %% \n ++num_lines; . ; %% main() { yylex(); printf( "# of lines = %d\n", num_lines); } newline ;

Outline ‡ ‡ ‡ ‡ ‡ ‡ Roles of lexical analysis What is a token Regular expressions and regular descriptions Lexical analysis Automatic Creation of Lexical Analysis Error Handling .

Assembly .Basic Compiler Phases Source program (string) Front-End lexical analysis Tokens syntax analysis Abstract syntax tree semantic analysis Annotated Abstract syntax tree Back-End Fin.

Example Tokens Type ID NUM REAL IF COMMA NOTEQ LPAREN RPAREN Examples foo n_14 last 73 00 517 082 66.5e-10 if . != ( ) .5 10. 1e67 5.1 .

Example NonTokens Type comment preprocessor directive macro whitespace Examples /* ignored */ #include <foo. 6 NUMS \t \n \b .h> #define NUMS 5.

} VOID ID(match0) LPAREN CHAR DEREF ID(s) RPAREN LBRACE IF LPAREN NOT ID(strncmp) LPAREN ID(s) COMMA STRING(0.Example void match0(char *s) /* find a zero */ { if (!strncmp(s.0) SEMI RBRACE EOF .0´. . 3)) return 0.0) COMMA NUM(3) RPAREN RPAREN RETURN REAL(0. ³0.

Lexical Analysis (Scanning) ‡ input ± program text (file) ‡ output ± sequence of tokens ‡ Read input file ‡ Identify language keywords and standard identifiers ‡ Handle include files and macros ‡ Count line numbers ‡ Remove whitespaces ‡ Report illegal symbols ‡ Produce symbol table .

Why Lexical Analysis ‡ Simplifies the syntax analysis ± And language definition ‡ Modularity ‡ Reusability ‡ Efficiency .

What is a token? ‡ ‡ ‡ ‡ Defined by the programming language Can be separated by spaces Smallest units Defined by regular expressions .

case '=¶ return PlusEqual. case `+`: c = ungetc() . loop: c = getchar(). case `. switch (c) { case `+': return PlusPlus . default: ungetc(c).`: return SemiColumn. switch (c){ case ` `:goto loop .A simplified scanner for C Token nextToken() { char c . case `<`: case `w`: } } . return Plus.

Regular Expressions .

Escape characters in regular expressions ‡ \ converts a single operator into text ± a ± (a\*)+ ‡ Double quotes surround text ± ³a+*´+ ‡ Esthetically ugly ‡ But standard .

Regular Descriptions ‡ EBNF where non-terminals are fully defined before first use letter p[a-zA-Z] digit p[0-9] underscore p_ letter_or_digit p letter|digit underscored_tail p underscore letter_or_digit+ identifier p letter letter_or_digit* underscored_tail ‡ token description ± A token name ± A regular expression .

value) ‡ Ambiguity resolution ± The longest matching token ± Between two equal length tokens select the first .The Lexical Analysis Problem ‡ Given ± A set of token descriptions ± An input string ‡ Partition the strings into tokens (class.

´ { return SemiColumn.} ³+´ { return Plus} ³while´ { return While .} ³++´ { return PlusPlus .} [\n] {line_count++.} ³<=´ { return LessOrEqual. } {Letter}({Letter}|{Digit})* { return Id .} ³.A Flex specification of C Scanner Letter [a-zA-Z_] Digit [0-9] %% [ \t] {.} ³+=´ { return PlusEqual .} .} ³<´ { return LessThen .

Flex ‡ Input ± regular expressions and actions (C code) ‡ Output ± A scanner program that reads the input and applies actions when input regular expression is matched regular expressions flex input program scanner tokens .

Naïve Lexical Analysis .

Automatic Creation of Efficient Scanners ‡ Naïve approach on regular expressions (dotted items) ‡ Construct non deterministic finite automaton over items ‡ Convert to a deterministic ‡ Minimize the resultant automaton ‡ Optimize (compress) representation .

Dotted Items .

Example ‡ T p a+ b+ ‡ Input µaab¶ ‡ After parsing aa ± T p a+ ™ b+ .

Item Types ‡ Shift item ± ™ In front of a basic pattern ± A p (ab)+ ™ c (de|fe)* ‡ Reduce item ± ™ At the end of rhs ± A p (ab)+ c (de|fe)* ™ ‡ Basic item ± Shift or reduce items .

Character Moves ‡ For shift items character moves are simple TpE™cF Digit p ™ [0-9] c 7     c 7 TpEc™F T p [0-9] ™ .

I Moves ‡ For non-shift items the situation is more complicated ‡ What character do we need to see? ‡ Where are we in the matching? T p ™ a* T p ™ (a*) .

I1 I Moves for Repetitions ‡ Where can we get from T p E™ (R)* F ‡ If R occurs zero times T p E (R)* ™ F ‡ If R occurs one or more times T p E (™ R)* F ± When R ends E ( R™ )* F ‡ E (R)* ™ F ‡ E (™ R)* F .

Slide 27 I1 IBM. 11/11/2003 .

I2 I Moves .

11/11/2003 .Slide 28 I2 IBM.

¶([0-9])+ F p (™ [0-9])*¶.¶([0-9])+ F p ( [0-9])*¶.¶ F p ™([0-9])*¶.¶ ( [0-9] ™)+ F p ( [0-9])*¶.¶([0-9])+ F p (™[0-9])*¶.¶ ( [0-9])+ ™ F p ( [0-9])* ™¶.I p [0-9]+ F p[0-9]*¶.¶ (™ [0-9])+ F p ( [0-9])*¶.¶ ™([0-9])+ F p ( [0-9])*¶.1.¶([0-9])+ .¶([0-9])+ F p ( [0-9])* ™¶.¶[0-9]+ Input µ3.¶([0-9])+ F p ( [0-9] ™)*¶.¶ (™[0-9])+ F p ( [0-9])*¶.

Concurrent Search ‡ How to scan multiple token classes in a single run? .

¶([0-9])+ F p ( [0-9])*¶.¶ ™([0-9])+ .¶([0-9])+ F p (™[0-9])*¶.¶[0-9]+ I p ™([0-9])+ I p (™[0-9])+ I p ( [0-9] ™)+ I p (™[0-9])+ I p ( [0-9])+ ™ Input µ3.¶([0-9])+ F p ( [0-9])* ™¶.¶([0-9])+ F p ( [0-9] ™)*¶.¶([0-9])+ F p ([0-9])* ™¶.¶ F p ™([0-9])*¶.I p [0-9]+ F p[0-9]*¶.1.¶([0-9])+ F p (™[0-9])*¶.

The Need for Backtracking ‡ A simple minded solution may require unbounded backtracking T1 p a+. T2 p a ‡ Quadratic behavior ‡ Does not occur in practice ‡ A linear solution exists .

a>   T p E c™ F ± For every I move construct an I transition ± The accepting states are the reduce items ± Accept the language defined by Ti .A Non-Deterministic Finite State Machine ‡ Add a production S¶ p T1 | T2 | « | Tn ‡ Construct NDFA over the items ± Initial state S¶ p ™ (T1 | T2 | « | Tn) ± For every character move. construct a character transition <T p E ™ c F.

I3 I Moves .

Slide 34 I3 IBM. 11/11/2003 .

¶ ™([0-9]+) Fp [0-9]*¶.´ Fp [0-9]*¶.¶ ( [0-9] ™ +) [0-9] Fp [0-9]*¶.¶[0-9]+ Ip™ ([0-9]+) Fp (™[0-9]*)¶.¶[0-9]+ ³.¶[0-9]+ Ip (™[0-9])+ [0-9] Ip ( [0-9]™)+ [0-9] Fp ( [0-9] ™ *)¶.¶ ( [0-9] +) ™ .I p [0-9]+ F p[0-9]*¶.¶[0-9]+ Sp™(I|F) Fp™ ([0-9]*)¶.¶[0-9]+ Fp ( [0-9]*) ™¶.¶ (™[0-9]+) Ip ( [0-9])+™ Fp [0-9]*¶.

Efficient Scanners ‡ Construct Deterministic Finite Automaton ± Every state is a set of items ± Every transition is followed by an I-closure ± When a set contains two reduce items select the one declared first ‡ Minimize the resultant automaton ± Rejecting states are initially indistinguishable ± Accepting states of the same token are indistinguishable ‡ Exponential worst case complexity ± Does not occur in practice ‡ Compress representation .

¶[0-9]+ Fp (™[0-9]*)¶.¶ [0-9]+ Fp ([0-9]*) ™¶.¶[0-9]+ [0-9] Sp™(I|F) Ip™ ([0-9]+) Ip (™[0-9])+ Fp™ ([0-9]*)¶.¶(™[0-9]+) Fp [0-9]*¶.] Sink ³.´ Fp [0-9] *¶.¶[0-9]+ [0-9] Ip ( [0-9]™)+ Fp ( [0-9] ™ *)¶.´ [0-9] [^0-9] [^0-9.¶( [0-9]+) ™ .¶ ™ ([0-9]+) Fp [0-9]*¶.¶(™[0-9]+) [^0-9] ³.I p [0-9]+ F p[0-9]*¶.] [0-9] Fp [0-9] *¶.|\n [^0-9.¶ [0-9]+ .¶[0-9]+ Ip ( [0-9])+ ™ Ip (™[0-9])+ Fp (™[0-9]*)¶.¶ ([0-9] ™ +) Fp [0-9]*¶.¶[0-9]+ Fp ( [0-9]*) ™¶.

. ch]. set Start of token to Read Index. Set Read Index To 1.repr to char[Start of token . set token . if accepting(state): set Class of last token to Class(state). set token .. set Read index to End last token + 1. Set state = H[state.A Linear-Time Lexical Analyzer IMPORT Input Char [1. End last token].. Procedure Get_Next_Token.class to Class of last token. set End of last token to Read Index set Read Index to Read Index + 1. set End of last token to uninitialized set Class of last token to uninitialized set State to Initial while state /= Sink: Set ch to Input Char[Read Index].].

3 q.1 q.Scanning ³3. q1. 3.1.1.] 1 [0-9] [^0-9. state 1 2 3 4 next state 2 3 4 Sink last token I I F F [0-9] [0-9] [^0-9.1. 3.´ input q3.´ Sink 2 I 3 [0-9] [^0-9] 4 F .] ³.´ ³.

Scanning ³aaa´ [^a] [.\n] ³.´ ³.] 4 [a] T1 .´ T2 pa input qaaa$ a q aa$ a a q a$ aaaq$ state 1 2 4 4 next state 2 4 4 Sink last token T1 T1 T1 1 [a] [^a.´ T1 2 [a] 3 T1 [^a.\n] T1 p a+´.] Sink [.

Error Handling ‡ Illegal symbols ‡ Common errors .

Missing ‡ ‡ ‡ ‡ ‡ ‡ Creating a lexical analysis by hand Table compression Symbol Tables Handling Macros Start states Nested comments .

Summary ‡ For most programming languages lexical analyzers can be easily constructed automatically ‡ Exceptions: ± Fortran ± PL/1 ‡ Lex/Flex/Jlex are useful beyond compilers .