You are on page 1of 17

SPECIFICATION OF

TOKENS

1
Strings and Languages
• Regular Expressions are an important notation for specifying patterns.

• Alphabet – any finite set of symbols


e.g. ASCII, binary alphabet, UNICODE, EBCDIC,LATIN-1

• String – A finite sequence of symbols drawn from an alphabet


– Banana (ASCII Alphabet)
– Length of a string => |s|
– Empty String => ε

• Other terms relating to strings: prefix; suffix; substring; proper prefix,


suffix, or substring (non-empty, not entire string); subsequence

• Language – A set of strings over a fixed alphabet

2
Languages
• A language, L, is simply any set of strings over a
fixed alphabet.

Alphabet Languages
{0,1} {0,10,100,1000,100000…}
{0,1,00,11,000,111,…}
{a,b,c} {abc,aabbcc,aaabbbccc,…}
{A, … ,Z} {FOR,WHILE,GOTO,…}
{A,…,Z,a,…,z,0,…9, { All legal PASCAL progs}
+,-,…,<,>,…}

Special Languages:  - EMPTY LANGUAGE


 - contains  string only

3
String operations
• Given String: banana
• Prefix : ban, banana
• Suffix : ana, banana
• Substring : nan, ban, ana, banana
• Subsequence: bnan, nn
• Proper Prefix and Suffix

4
String Operations
• Concatenation
– xy; s = s = s;  - identity for concatenation
– s0 =  if i > 0 si = si-1s

5
Operations on Languages

OPERATION DEFINITION
union of L and M L  M = {s | s is in L or s is in M}
written L  M
concatenation of L LM = {st | s is in L and t is in M}
and M written LM

Kleene closure of L L*= Li

written L*

i 0

L* denotes “zero or more concatenations of “ L


positive closure of L 
L i

written L+ L+= 
i 1

L+ denotes “one or more concatenations of “ L


Exponentiation Lo={ε}, L1=L,L2=LL
6
Operations on Languages
• LUD is the set of letters and digits
• LD is the set of strings consisting of a
letter followed by a digit
• L4 is the set of all four strings
• L* is the set of strings including ε
• D+ is the set of strings of one or more
digits.

7
Say What?
L = {A, B, C, D } D = {1, 2, 3}
• LD
{A, B, C, D, 1, 2, 3 }
• LD
{A1, A2, A3, B1, B2, B3, C1, C2, C3, D1, D2, D3 }
• L2
{ AA, AB, AC, AD, BA, BB, BC, BD, CA, … DD}
• L*
{ All possible strings of L plus  }
• L+
L* - 
• L (L  D )
Valid :{ A1,AA,B3,CD} Invlaid:{321,4A2}
• L (L  D )*
Valid:{ A,A1,A23,D3,DA3..} Invalid:{31}
8
Regular Expressions
• A Regular Expression is a Set of Rules /
Techniques for Constructing Sequences of
Symbols (Strings) from an Alphabet.

• Let  Be an Alphabet, r a Regular Expression


Then L(r) is the Language That is characterized
by the Rules of r

9
Regular Expressions
• Defined over an alphabet Σ

• ε represents {ε}, the set containing the empty string

• If a is a symbol in Σ, then a is a regular expression


denoting {a}, the set containing the string a

• If r and s are regular expressions denoting the


languages L(r) and L(s), then:
– (r)|(s) is a regular expression denoting L(r)U L(s)
– (r)(s) is a regular expression denoting L(r)L(s)
– (r)* is a regular expression denoting (L(r))*
– (r) is a regular expression denoting L(r)

• Precedence: * (left associative), then concatenation (left


associative), then | (left associative) 10
Regular Expressions
Alphabet = {a, b}
1. a|b denotes {a, b}
2. (a|b)(a|b) denotes {ab, aa, ba, bb}
3. a* denotes {, a, aa, …}
4. (a|b)* - Strings of a’s and b’s including the 
5. a|a*b – a followed by zero/more a’s followed by b

11
Algebraic Properties of Regular
Expressions

AXIOM DESCRIPTION
r|s=s|r | is commutative
r | (s | t) = (r | s) | t | is associative
(r s) t = r (s t) concatenation is associative
r(s|t)=rs|rt
(s|t)r=sr|tr concatenation distributes over |

r = r
r = r  Is the identity element for concatenation

r* = ( r |  )* relation between * and 


r** = r* * is idempotent

12
Regular Definitions
• Names maybe given to regular expressions; these
names can be used like symbols
• Let  is an alphabet of basic symbols. The regular
definition is a sequence of definitions of the form
d1 r1
d2 r2
...
dn rn
Where, each di is a distinct name, and each ri is a
regular expression over the symbols in   {d1, d2,
…, di-1 }

13
Regular Definitions
• Example 1:
– letter  A|B|…|Z|a|b|…|z
– digit  0|1|…|9
– id  letter (letter | digit)*
• Example 2
– digit  0 | 1 | 2 | … | 9
– digits  digit digit*
– optional_fraction  . digits | 
– optional_exponent  ( E ( + | -| ) digits) | 
– num  digits optional_fraction optional_exponent

14
Regular Definitions
• Shorthand
– One or more instances: r+ denotes rr*
– Zero or one Instance: r? denotes r|ε
– Character classes: [a-z] denotes [a|b|…|
z]

15
Example
• digit  0 | 1 | 2 | … | 9
• digits  digit+
• optional_fraction  (. digits ) ?
• optional_exponent  ( E ( + | -) ? digits) ?
• num  digits optional_fraction optional_exponent

16
Limitations of Regular
Expression
• Some languages cannot be described by any regular
expression
• Cannot describe balanced or nested constructs
– Example, all valid strings of balanced parentheses
– This can be done with CFG
• Cannot describe repeated strings
– Example: {wcw|w is a string of a’s and b’s}
– This can be done with CFG
• Can be used to denote only a fixed or unspecified
number of repetitions.

17

You might also like