You are on page 1of 31

CS327 - Compilers

Lexical Analysis

Abhishek Bichhawat 12/01/2024


Lexical Analysis - Lexemes
● Divide code into lexical units
○ Partition input into lexemes (syntactic category)

if x == 0 then { y = 1 ; } else { z = 2 ; }

if x == 0 then { y = 1 ; } else { z = 2 ; }
Lexical Analysis - Token Classes
● Divide code into lexical units
● Classify lexemes as per the role
○ Keywords, identifiers, numbers, parentheses, semi-colon, whitespaces etc.
○ Classes correspond to sets of strings
■ E.g. Identifiers are alphanumeric strings starting with an alphabet
Numbers are string consisting of digits
Keywords are specific words

if x0 == 0 then { x1 = 1; } else { x2 = 2; }
Lexical Analysis - Tokens
● Divide code into lexical units
● Classify lexemes as per the role
● Input tokens to the parser, which relies on this classification

x = 1

<identifier,“x”>,<operator,“=”>,<number,“1”>
Lexical Analysis - Token Classes
● Divide code into lexical units
● Classify lexemes as per the role
● Input to the parser, which relies on this classification
● Number of tokens in each class for the following program?

if (x0==01) then {y1=10;} else {z2=20;}


keyword = identifier =
number = operator =
whitespace = other =
Lexical Analysis - Challenges
● Recognizing tokens
○ Example - FORTRAN
■ Disregards whitespaces, so, DO 5 I = 1.5 is the same as DO5I=1.5
■ DO 5 I = 1,5 is loop and DO 5 I = 1.5 is standard assignment
Lexical Analysis - Challenges
● Recognizing tokens
○ Example - FORTRAN
■ Disregards whitespaces, so, DO 5 I = 1.5 is the same as DO5I=1.5
■ DO 5 I = 1,5 is loop and DO 5 I = 1.5 is standard assignment
○ May require reading ahead before deciding on the tokens
■ if (x==0) then {y=1;} else {z=2;}
○ Also, a problem in modern languages like C++
■ A<B<C>>
■ Should we treat >> as stream operator or is the above snippet valid in C++?
Regular Expressions
● To define what set of strings are in a token class, we use regular
expressions, and in turn, regular languages (sets of strings)
● Alphabet is a set of characters (e.g., ASCII)
● Expression R (over some alphabet) :
○ R = 𝜖
| c
| R 1R 2
| R1|R2
| R*
Regular Languages
● To define what set of strings are in a token class, we use regular
expressions, and in turn, regular languages (sets of strings)
● Expression R (over some alphabet) denotes language L(R):
○ L(𝜖) = L(“”) = {“”}
○ L(c) = {“c”}
○ L(R1R2) = {x1x2 | x1 ∈ L(R1), x2 ∈ L(R2)}
○ L(R1|R2) = L(R1) ∪ L(R2)
○ L(R*) = L(𝜖) ∪ L(R) ∪ L(RR) ∪ …
Regular Languages - Example
Consider the alphabet {0,1}
1. What is the language 0*?

2. What is the language of (0|1)1?

3. What is the language of (0*|1*)?

4. What is the language of (0|1)*? (Is it same as 3?)


Regular Languages - Example
Consider the alphabet {0,1}
1. What is the language 0*?
a. {“”, “0”, “00”, “000”, …}
2. What is the language of (0|1)1?
a. {“01”, “11”}
3. What is the language of (0*|1*)?
a. Strings of 0s or strings of 1s, and the empty string
4. What is the language of (0|1)*? (Is it same as 3?)
a. All strings of 0s and 1s, and the empty string
Regular Languages
● Some other language-specific expressions with .(?),+,-,^
○ Option : ‘a’|𝜖 ⇔ a?
○ One or more occurrences: a+ ⇔ ‘a’|’aa’|’aaa’|...
○ Range : ‘a’|’b’|’c’|...|’z’ ⇔ [a-z]
○ Excluded range: complement of [a-z] ⇔ [^a-z]
Regular Languages - Example
Equivalent regular languages of:

1. (0 | 1)*(10 | 11 | 1)(0 | 1)*

2. (01 | 11)*(0 | 1)*

3. (0 | 1)*(0 | 1)(0 | 1)*


Regular Languages - Example
Equivalent regular languages of:

1. (0 | 1)*(10 | 11 | 1)(0 | 1)*


a. (0 | 1)*1(0 | 1)*
2. (01 | 11)*(0 | 1)*
a. (0 | 1)*
3. (0 | 1)*(0 | 1)(0 | 1)*
a. (0 | 1)+
Regular Languages - Example
Meaningful statement for the regular languages:

1. (0|1)*0

2. b*(abb*)*(a|𝜖)

3. (a|b)*aa(a|b)*
Regular Languages - Example
Meaningful statement for the regular languages:

1. (0|1)*0
a. EVEN NUMBERS IN BINARY FORM
2. b*(abb*)*(a|𝜖)
a. STRINGS OF A’S AND B’S WITH NO CONSECUTIVE A’S
3. (a|b)*aa(a|b)*
a. STRINGS OF A’S AND B’S WITH CONSECUTIVE A’S
Regular Expressions
1. Keywords in Java?
2. Numbers in Java?
3. Identifiers in Java?
4. Whitespaces in Java?
Regular Expressions
1. Keywords in Java? ‘if’|’else’|’void’|...
2. Numbers in Java? 0 | [1-9][0-9]*
3. Identifiers in Java? [_a-zA-Z][_a-zA-Z0-9]*
4. Whitespaces in Java? (‘ ‘ | ‘\n’ | ‘\t’ | ‘\r’)+
(\s is the regex for whitespace in Java)
https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Lexical Specifications
Lexical Specifications
● Given a string s, determine if the string is in the set of strings
constituting the language L(R)
● Break the input into tokens to pass on to the next phase
Lexical Specifications
1. Regex for all token classes
a. Number = 0|[1-9][0-9]*
b. Keywords = “if” | “else” | “then”
c. Identifiers = [a-zA-Z_][a-zA-Z_0-9]*
d. …
Lexical Specifications
1. Regex for all token classes
2. Construct R matching lexemes
a. R = Keyword | Identifier | Number | …
Lexical Specifications
1. Regex for all token classes
2. Construct R matching lexemes
3. Let x1..xn be the input
For 1 ≤ i ≤ n, check x1..xi ∈ L(R)
If yes, remove x1..xi from input and repeat 3
Question?
● How do we resolve ambiguities?
○ x1..xi ∈ L(R) and x1..xj ∈ L(R) s.t. i ≠ j
○ Which one to choose?
Maximal Munch!
● How do we resolve ambiguities?
○ x1..xi ∈ L(R) and x1..xj ∈ L(R) s.t. i ≠ j
○ Always take the longer one!
Question?
● How do we resolve ambiguities?
○ x1..xi ∈ L(Rj) and x1..xi ∈ L(Rk) s.t. j ≠ k
○ Which one to choose?
Priority Ordering
● How do we resolve ambiguities?
○ x1..xi ∈ L(Rj) and x1..xi ∈ L(Rk) s.t. j ≠ k
○ Priority to the one that appears earlier in the rule set!
No Rule Match?
If x1..xi ∉ L(R):
No Rule Match?
If x1..xi ∉ L(R):

Include an ERROR rule s.t.


ERROR = {all strings not in the lexical specification}
Language of Email Addresses
Alphabet = {letters, digits, “.”, “@”}
Language of Email Addresses
Alphabet = {letters, digits, “.”, “@”}

letter = [a-z]
digit = [0-9]
net = letter letter+
id_dom = letter (letter|digit)+
email = id_dom “@” id_dom “.” net

You might also like