You are on page 1of 26

Lexical Analysis

Dr. Nguyen Hua Phung

HCMC University of Technology, Viet Nam

08, 2016

Dr. Nguyen Hua Phung Lexical Analysis 1 / 33


Outline

1 Introduction

2 Roles

3 Implementation

4 Use ANTLR to generate Lexer

Dr. Nguyen Hua Phung Lexical Analysis 2 / 33


Compilation Phases

source program

lexical analyzer

syntax analyzer
front end
semantic analyzer

intermediate code generator

code optimizer
back end
code generator

target program

Dr. Nguyen Hua Phung Lexical Analysis 4 / 33


Lexical Analysis

Like a word extractor


in ⇒ i n ⇒ in
Like a spell checker
I og:::
og to socholsochol
:::::::

Like a classification
I am a student

pronoun verb article noun

Dr. Nguyen Hua Phung Lexical Analysis 6 / 33


Lexical Analysis Roles

Identify lexemes: substrings of the source program


that belong to a grammar unit
Return tokens: a lexical category of lexemes
Ignore spaces such as blank, newline, tab
Record the position of tokens that are used in next
phases

Dr. Nguyen Hua Phung Lexical Analysis 7 / 33


Example on Lexeme and Token

r e s u l t ’’ = ’’ o ldsum - value / 100;

Lexemes Kind of Tokens


result IDENT
= ASSIGN_OP
oldsum IDENT
- SUBSTRACT_OP
value IDENT
/ DIV_OP
100 INT_LIT
; SEMICOLON

Dr. Nguyen Hua Phung Lexical Analysis 8 / 33


How to build a lexical analyzer?

How to build a lexical analysis for English?


65000 words
Simply build a dictionary:
{(I,pronoun);(We,pronoun);(am,verb);...}
Extract, search, compare
But for a programming language?
How many words?
Identifiers: abc, cab, Abc, aBc, cAb, ...
Integers: 1, 10, 120, 20, 210, ...
...
Too many words to build a dictionary, so how?
Apply rules for each kind of word (token)

Dr. Nguyen Hua Phung Lexical Analysis 11 / 33


Rule Representations

Finite Automata
Deterministic Finite Automata
Nondeterministic Finite Automata
Regular Expressions

Dr. Nguyen Hua Phung Lexical Analysis 12 / 33


Finite Automata

... b b a a a a . . . Input Tape

Head (Read only)

q3 ...

q2 qn

q1 q0

Finite Control

Dr. Nguyen Hua Phung Lexical Analysis 13 / 33


State Diagram

a a
b

start q0 q1

Input: abaabb

Current state Read New State


q0 a q0
q0 b q1
q1 a q1
q1 a q1
q1 b q0
q0 b q1
Dr. Nguyen Hua Phung Lexical Analysis 14 / 33
Deterministic Finite Automata

Definition
Deterministic Finite Automaton(DFA) is a 5-tuple
M =(K,Σ,δ,s,F) where
K = a finite set of state
Σ = alphabet
s ∈ K = the initial state
F ⊆ K = the set of final states
δ = a transition function from K ×Σ to K

Dr. Nguyen Hua Phung Lexical Analysis 15 / 33


Example

M =(K,Σ,δ,s,F)
where K = {q0 , q1 } Σ = {a,b} s=q0 F={q1 }
and δ
K Σ δ(K , Σ)
q0 a q0
q0 b q1
q1 a q1
q1 b q0
a a
b

start q0 q1

Dr. Nguyen Hua Phung Lexical Analysis 16 / 33


Nondeterministic Finite Automata

Permit several possible “next states” for a given


combination of current state and input symbol
Accept the empty string ∈ in state diagram
Help simplifying the description of automata
Every NFA is equivalent to a DFA

Dr. Nguyen Hua Phung Lexical Analysis 17 / 33


Example

Language L = ({ab} ∪ {aba})*

start q0 q1
b

a b
q2

Dr. Nguyen Hua Phung Lexical Analysis 18 / 33


Example

Language L = ({ab} ∪ {aba})*

a
start q0 q1

b

a
q2

Dr. Nguyen Hua Phung Lexical Analysis 19 / 33


Regular Expression (regex)

Describe regular sets of strings


Symbols other than ( ) | * stand for themselves
Use ∈ for an empty string
Concatenation α β = First part matches α, second
part β
Union α | β = Match α or β
Kleene star α* = 0 or more matches of α
Use ( ) for grouping

Dr. Nguyen Hua Phung Lexical Analysis 20 / 33


Example

RE Language
0 => { 0 }
01 => { 01 }
0|1 => {0,1}
0(0|1) => {00,01}
(0|1)(0|1) => {00,01,10,11}
0* => {∈,0,00,000,0000,...}
(0|1)* => {∈,0,1,00,01,10,11,000,001,...}

Dr. Nguyen Hua Phung Lexical Analysis 21 / 33


Example

(i|I)(f|F)
Keyword if of language Pascal
if
IF
If
iF

E(0|1|2|3|4|5|6|7|8|9)*
An E followed by a (possibly empty) sequence of digits
E123
E9
E

Dr. Nguyen Hua Phung Lexical Analysis 22 / 33


Regular Expression and Finite Automata

a b
start ab
a
∈ ∈
start a|b
∈ b ∈

a
start a*
∈ ∈

Dr. Nguyen Hua Phung Lexical Analysis 24 / 33


Convenience Notation

α+ = one or more (i.e. αα∗)


α? = 0 or 1 (i.e. (α| ∈))
[xyz]= x|y|z
[x-y]= all characters from x to y, e.g. [0-9] = all ASCII
digits
[^x-y]= all characters other than [x-y]
. matches any character

Dr. Nguyen Hua Phung Lexical Analysis 25 / 33


Example

(0|1|2|3|4) => [0-4]


(a|g|h|m) => [aghm]
(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)* => [0-9]+
(E|e)(+|-|∈)(0|1|2|3|4|5|6|7|8|9)+ => [Ee][+-]?[0-9]+

Dr. Nguyen Hua Phung Lexical Analysis 26 / 33


ANTLR [1]

ANother Tool for Language Recognition


Terence Parr, Professor of CS at the Uni. San
Francisco
powerful parser/lexer generator

Dr. Nguyen Hua Phung Lexical Analysis 29 / 33


Lexer

/∗∗
∗ Filename : H e l l o . g4
∗/
l e x e r grammar H e l l o ;

/ / match any d i g i t s
INT : [0 −9]+;

/ / Hexadecimal number
HEX: 0 [ Xx][0 −9A−Fa−f ] + ;

/ / match lower−case i d e n t i f i e r s
ID : [ a−z ] + ;

/ / s k i p spaces , tabs , n e w l i n e s
WS : [ \ t \ r \ n ] + −> s k i p ;
Dr. Nguyen Hua Phung Lexical Analysis 30 / 33
Lexical Analyzer

1 0 . 0 e 2 0 . . . Input Tape

Look ahead

r3 ...
Token
r2 rn
with longest prefix match
r1 r0

Lexical Analyzer

Dr. Nguyen Hua Phung Lexical Analysis 31 / 33


Summary

A lexical analyzer is a pattern matcher that isolates


small-scale parts of a program
Lexical rules are represented by Regular expressions
or Finite Automata.
How to write a lexical analyzer (lexer) in ANTLR

Dr. Nguyen Hua Phung Lexical Analysis 32 / 33


References I

[1] ANTLR, http:antlr.org, 19 08 2016.

Dr. Nguyen Hua Phung Lexical Analysis 33 / 33

You might also like