You are on page 1of 29

Al sham University ‫جامعة الشام الخاصة‬

Faculty of Informatics
‫كلية الهندسة المعلوماتية‬

Compiler Design
Lexical Analyzer

Dr. Bassel ALKHATIB


‫‪Lexical Analyzer‬‬

‫ينفذ المحلل المفرداتي المرحلة األولى من عملية الترجمة‪.‬‬

‫تتلخص المهمة األساسية للمحلل اللفظي بتجميع محارف الدخل‪ ،‬اآلتية من‬
‫النص البرمجي المصدري‪ ،‬بهدف توليد مجموعة من الكلمات التي ستؤلف‬
‫بدورها جمل يعالجها المحلل القواعدي‪.‬‬

‫ينفذ المحلل المفرداتي أيضا مجموعة من المهام الثانوية من أهمها‪ :‬حذف‬


‫المحارف التي ال دور لها‪ :‬كالفراغ وسالسل التعليقات‪.‬‬
Types of Tokens in PLs

Tokens fall into several categories, including :


Reserved words, e.g. if , while , do
Special Symbols
Single character, e.g. =
Multiple characters, e.g. := , != , <>
Identifiers, e.g. x1, spd
Literals or constants :
Numeric constants, e.g. 42 or 3.14159
String literals, e.g. as "hello, world"
Characters, e.g. 'a' , 'b'
The Input Buffer
Regular Expressions ‫التعابير المنتظمة‬

Concatenation, ("‫)"الدمج التسلسلي" أو "التتابع‬, e.g. ab


Choice (‫ )االختيار‬among alternatives indicated by the metacharacter |,
e.g. a|b
Repetition (‫)التكرار‬, indicated by the metacharacter *, e.g. a*
One or more repetitions, indicated by +, e.g. a+
Any character, indicated by the period . , e.g. ( .*b.* )
A range (‫ )مجال‬of characters indicated by [-], e.g. [0-9] , [a-zA-Z]
Any character not in a given set (‫)النفي‬, indicated by ~, e.g. ~(a|b|c)
Optional (‫)خياري‬, indicated by ? , e.g. (+|-)?
Precedence of Operations
in REs
Precedence of Operations (from higher to lower)
Parenthesis
Closure
Concatenation
Choice

Examples :
a|b*

ab|c*d  (ab) | ( (c*) d )


Regular Expressions for Programming Language Tokens
Numbers

Natural numbers (sequences of digits), e.g. 234


Decimal numbers, e.g. 1.434
Numbers with an exponent, e.g. 2.71E-2

nat = [0-9]+
signedNat = (+|-)? nat
number = signedNat ("." nat)? (E signedNat)?
Regular Expressions for Programming Language Tokens
Reserved Words and Identifiers

reserved = if | while | do | …

Typically, an identifier must begin with a letter and contain only letters and digits

letter = [a-zA-Z]
digit = [0-9]
identifier = letter (letter|digit)*
Regular Expressions for Programming Language Tokens
Comments

A scanner must recognize comments and discard them


Comments are called pseudotokens
there are two types of comments
Free format surrounded by delimiters such as :

{ this is a Pascal comment }  RE = { (~})* }


/* this is a C comment */  RE = !!
Begin with a specified character or characters and continue to the
end of the line
; this is a Scheme comment  ; (~\n)*
-- this is an Ada comment  -- (~\n)*
Ambiguity
When scanning a source file, some strings can be matched by different
regular expressions, e.g.

Strings such as if and while can be either identifiers or keywords


The string < > might be interpreted as representing either two tokens
(less than and greater than) or a single token <>

Regular expressions cannot determine which interpretation is to be


observed, Hence, a language definition must give disambiguating rules such as:

When a string can be either an identifier or a keyword, the keyword


interpretation is preferred  reserved words
When a string can be a single token or a sequence of several tokens, the
single token interpretation is preferred  principle of longest substring or
"maximal munch" 
Token Delimiters
White space, comments, and characters that are unambiguously part of other
tokens are used as delimiters by the scanner:

White space: while x  KEYWORD ID


Comments: while/**/x  KEYWIRD ID
Characters that are not part of a token: xtemp=ytemp  ID ASSIGN ID

Other than acting as token delimiter, white space is usually ignored in free
format languages
Finite Automata
Finite Automata, or finite state machines, can be used to describe the process of
recognizing patterns in input strings, and so can be used to construct scanners

Identifier = letter(letter|digit)*

Accepting State

Start State Transitions


Finite Automata
Notations

States can be given names rather than numbers


Transitions are labeled with names representing a set of characters
Finite Automata
Error States

A finite automaton for identifiers with error transitions


Finite Automata
Example : numbers
Draw a DFA equivalent to the following regular expression

digit = [0-9]
nat = digit+
signedNat = (+|-)? nat
number = signedNat ("." nat)? (E signedNat)?
nat = digit+
signedNat = (+|-)? nat
signedNat ("." nat)?
number = signedNat ("." nat)? (E signedNat)?
Finite Automata
Example : comments
A DFA for accepting comments surrounded by braces
Finite Automata
Example : C-like comments
A DFA for accepting C-like comments

/ "*" ( (~"*") * "*" ("*") * )+ /


DFA for Scanners
The following DFA for a an identifier does not exhibit the behavior we want from a
scanner because the error state is not really an error state :

It represent the fact that an identifier is not to be recognized (if we came from the start
state), or
A delimiter has been seen and we must now accept and generate an identifier token
DFA for Scanners
A Modified DFA
In this DFA :
Brackets surrounding other indicate that the delimiting character should be
considered lookahead, and should be returned to the input string.
The error state has become the accepting state
The diagram expresses the principle of longest substring (the DFA continues to
match letters and digits until a delimiter is found.

Finite automaton for an identifier with delimiter and return value

xtemp= ytemp
Implementation of Finite Automata in Code
Ad hoc Solution

If the next character is a letter then


advance the input;
while the next character is a letter or a digit do
advance the input;
end while;
accept;
else
{ error or other cases }
end if;
Implementation of Finite Automata in Code
Nested Case Statements
state := 1;
while state = 1 or 2 do
case state of
1: case input character of
letter: advance the input;
state := 2;
else state := … {error or other state} ;
end case;
2: case input character of
letter, digit: advance the input;
state := 2; { unnecessary }
else state := 3;
end case;
end case;
end while;
if state = 3 then accept else error ;
Implementation of Finite Automata in Code
Table Driven Scanners

State / input char letter digit other Accepting


1 2 no
2 2 2 [3] no
3 yes

Transition Table for Identifiers

Brackets around a state number indicate that the transition should not consume the input
Implementation of Finite Automata in Code
Transition Table for C-like Comments

State / input char / * other Accepting


1 2 No
2 3 No
3 3 4 3 No
4 5 4 3 No
5 Yes
Transition Table for C-like Comments
Data Structures for Table Driven Scanners
State / input char letter digit other Accepting
1 2 no
2 2 2 [3] no
3 yes

State / in char letter digit other State / in char letter digit other State Accept
1 2 1 true 1 false
2 2 2 3 2 true true false 2 false
3 3 3 true

T Advance Accept
Code Schema for a Table Driven Parser

state := 1;
ch := next input character ;
while not (Accept[state]) and not (T[state , ch] = error) do
newstate := T [state , ch] ;
If Advance [ state , ch] then ch := next input char ;
state := newstate ;
end while;
if Accept[state] then accept ;

You might also like