Lecture 3

CS327 - Compilers
Lexical Analysis
Abhishek Bichhawat 12/01/2024

Lexical Analysis - Lexemes
● Divide code into lexical units
○ Partition input into lexemes (syntactic category)
if x == 0 then { y = 1 ; } else { z = 2 ; }
if x == 0 then { y = 1 ; } else { z = 2 ; }
Lexical Analysis - Token Classes
● Classify lexemes as per the role
○ Keywords, identifiers, numbers, parentheses, semi-colon, whitespaces etc.
○ Classes correspond to sets of strings
■ E.g. Identifiers are alphanumeric strings starting with an alphabet
Numbers are string consisting of digits
Keywords are specific words
if x0 == 0 then { x1 = 1; } else { x2 = 2; }
Lexical Analysis - Tokens
● Input tokens to the parser, which relies on this classification
x = 1
<identifier,“x”>,<operator,“=”>,<number,“1”>
Lexical Analysis - Token Classes
● Input to the parser, which relies on this classification
● Number of tokens in each class for the following program?
if (x0==01) then {y1=10;} else {z2=20;}

keyword = identifier =
number = operator =
whitespace = other =
Lexical Analysis - Challenges
● Recognizing tokens
○ Example - FORTRAN
■ Disregards whitespaces, so, DO 5 I = 1.5 is the same as DO5I=1.5
■ DO 5 I = 1,5 is loop and DO 5 I = 1.5 is standard assignment
Lexical Analysis - Challenges
● Recognizing tokens
○ Example - FORTRAN
■ Disregards whitespaces, so, DO 5 I = 1.5 is the same as DO5I=1.5
■ DO 5 I = 1,5 is loop and DO 5 I = 1.5 is standard assignment
○ May require reading ahead before deciding on the tokens
■ if (x==0) then {y=1;} else {z=2;}
○ Also, a problem in modern languages like C++
■ A<B<C>>
■ Should we treat >> as stream operator or is the above snippet valid in C++?
Regular Expressions
● To define what set of strings are in a token class, we use regular
expressions, and in turn, regular languages (sets of strings)
● Alphabet is a set of characters (e.g., ASCII)
● Expression R (over some alphabet) :
○ R = 𝜖
| c
| R 1R 2
| R1|R2
| R*
Regular Languages
● To define what set of strings are in a token class, we use regular
expressions, and in turn, regular languages (sets of strings)
● Expression R (over some alphabet) denotes language L(R):
○ L(𝜖) = L(“”) = {“”}
○ L(c) = {“c”}
○ L(R1R2) = {x1x2 | x1 ∈ L(R1), x2 ∈ L(R2)}
○ L(R1|R2) = L(R1) ∪ L(R2)
○ L(R*) = L(𝜖) ∪ L(R) ∪ L(RR) ∪ …
Regular Languages - Example
Consider the alphabet {0,1}
1. What is the language 0*?
2. What is the language of (0|1)1?
3. What is the language of (0*|1*)?
4. What is the language of (0|1)*? (Is it same as 3?)

Consider the alphabet {0,1}
1. What is the language 0*?
a. {“”, “0”, “00”, “000”, …}
2. What is the language of (0|1)1?
a. {“01”, “11”}
3. What is the language of (0*|1*)?
a. Strings of 0s or strings of 1s, and the empty string
4. What is the language of (0|1)*? (Is it same as 3?)
a. All strings of 0s and 1s, and the empty string
Regular Languages
● Some other language-specific expressions with .(?),+,-,^
○ Option : ‘a’|𝜖 ⇔ a?
○ One or more occurrences: a+ ⇔ ‘a’|’aa’|’aaa’|...
○ Range : ‘a’|’b’|’c’|...|’z’ ⇔ [a-z]
○ Excluded range: complement of [a-z] ⇔ [â-z]
Equivalent regular languages of:
1. (0 | 1)*(10 | 11 | 1)(0 | 1)*
2. (01 | 11)*(0 | 1)*
3. (0 | 1)*(0 | 1)(0 | 1)*

Equivalent regular languages of:
1. (0 | 1)*(10 | 11 | 1)(0 | 1)*

a. (0 | 1)*1(0 | 1)*
2. (01 | 11)*(0 | 1)*
a. (0 | 1)*
3. (0 | 1)*(0 | 1)(0 | 1)*
a. (0 | 1)+
Meaningful statement for the regular languages:
1. (0|1)*0
2. b*(abb*)*(a|𝜖)
3. (a|b)*aa(a|b)*
Meaningful statement for the regular languages:
1. (0|1)*0
a. EVEN NUMBERS IN BINARY FORM
2. b*(abb*)*(a|𝜖)
a. STRINGS OF A’S AND B’S WITH NO CONSECUTIVE A’S
3. (a|b)*aa(a|b)*
a. STRINGS OF A’S AND B’S WITH CONSECUTIVE A’S
Regular Expressions
1. Keywords in Java?
2. Numbers in Java?
3. Identifiers in Java?
4. Whitespaces in Java?
Regular Expressions
1. Keywords in Java? ‘if’|’else’|’void’|...
2. Numbers in Java? 0 | [1-9][0-9]*
3. Identifiers in Java? [_a-zA-Z][_a-zA-Z0-9]*
4. Whitespaces in Java? (‘ ‘ | ‘\n’ | ‘\t’ | ‘\r’)+
(\s is the regex for whitespace in Java)
https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Lexical Specifications
● Given a string s, determine if the string is in the set of strings
constituting the language L(R)
● Break the input into tokens to pass on to the next phase
1. Regex for all token classes
a. Number = 0|[1-9][0-9]*
b. Keywords = “if” | “else” | “then”
c. Identifiers = [a-zA-Z_][a-zA-Z_0-9]*
d. …
2. Construct R matching lexemes
a. R = Keyword | Identifier | Number | …
2. Construct R matching lexemes
3. Let x1..xn be the input
For 1 ≤ i ≤ n, check x1..xi ∈ L(R)
If yes, remove x1..xi from input and repeat 3
Question?
● How do we resolve ambiguities?
○ x1..xi ∈ L(R) and x1..xj ∈ L(R) s.t. i ≠ j
○ Which one to choose?
Maximal Munch!
○ x1..xi ∈ L(R) and x1..xj ∈ L(R) s.t. i ≠ j
○ Always take the longer one!
Question?
○ x1..xi ∈ L(Rj) and x1..xi ∈ L(Rk) s.t. j ≠ k
○ Which one to choose?
Priority Ordering
○ x1..xi ∈ L(Rj) and x1..xi ∈ L(Rk) s.t. j ≠ k
○ Priority to the one that appears earlier in the rule set!
No Rule Match?
If x1..xi ∉ L(R):
No Rule Match?
If x1..xi ∉ L(R):
Include an ERROR rule s.t.

ERROR = {all strings not in the lexical specification}
Language of Email Addresses
Alphabet = {letters, digits, “.”, “@”}
Language of Email Addresses
Alphabet = {letters, digits, “.”, “@”}
letter = [a-z]
digit = [0-9]
net = letter letter+
id_dom = letter (letter|digit)+
email = id_dom “@” id_dom “.” net

Lecture 3

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 3

Uploaded by

Copyright:

Available Formats

CS327 - Compilers

Abhishek Bichhawat 12/01/2024

if (x0==01) then {y1=10;} else {z2=20;}

2. What is the language of (0|1)1?

3. What is the language of (0|1)?

4. What is the language of (0|1)*? (Is it same as 3?)

1. (0 | 1)(10 | 11 | 1)(0 | 1)

2. (01 | 11)(0 | 1)

3. (0 | 1)(0 | 1)(0 | 1)

1. (0 | 1)(10 | 11 | 1)(0 | 1)

Include an ERROR rule s.t.

You might also like