Professional Documents
Culture Documents
REGULAR LANGUAGES,
E X P R E S S I O N S A N D A P P L I C AT I O N S
D R . V A D I M Z AY T S E V A.K.A. @GRAMMARWARE
ROADMAP
• Advanced methods
languages
recursively
enumerable
context-
sensitive
context-
free
regular
finite
Noam Chomsky. On Certain Formal Properties of Grammars, Information & Control 2(2):137–167, 1959.
CHOMSKY: A U T O M ATA Tu r i n g m a c h i n e
linear bounded
automaton languages
recursively
enumerable
context-
sensitive
context-
free
regular
finite
finite state pushdown
automaton automaton
(too many to list)
CHOMSKY: TOOLS imaginary
computer
languages
recursively
enumerable
context-
sensitive
context-
free
regular
finite
regexp grammarware
CHOMSKY: REWRITING α → β
αXβ → αγβ
languages
recursively
enumerable
context-
sensitive
context-
free
regular
finite
X → a
X → γ
X → aB Axel Thue. Probleme über Veränderungen von Zeichenreihen
nach gegebenen Regeln, 1914. http://arxiv.org/abs/1308.5858
REGEXPS REVISITED
• ∅, ε, letters from Σ
• concatenation
• iteration
• alternation
C. E. Shannon, W. Weaver, The Mathematical Theory of Communication, University of Illinois Press, Urbana, 1949.
(finite state grammars and finite diagrams and finite state Markov processes)
TO WHICH CLASS DO LANGUAGES BELONG?
• ∅ FINITE
• {ε} FINITE
• {x, y, z} F I N I T E
all
• {0ⁿ | n > 1} R E G U L A R
recursively
enumerable
• {0ⁿ1 ² ⁿ | n > 1} C - F R E E
regular
finite
• {0ⁿ1ⁿ2ⁿ | n > 1} C - S E N S I T I V E
interactive
IS A T A S K S O LV A B L E B Y R E G U L A R M E A N S ?
✓ • Substring search
✓ • Substring replacement
✗ • Pretty-printing
interactive
IS A T A S K S O LV A B L E B Y R E G U L A R M E A N S ?
• wc -l, grep -c “”
✗ • Parsing HTML
• <BODY><TABLE><P><A HREF=…
✓ • Parsing a postcode
• 1098 XG, …
interactive
HOW TO PROVE WHICH CLASS
A LANGUAGE BELONGS TO
all
context-
free
regular
Jos C.M. Baeten, Models of Computation: Automata, Formal Languages and Communicating Processes, §2.9, p.58.
JOHN MYHILL AND ANIL NERODE
• Myhill-Nerode equivalence
• In simple terms
• bounded regular
• unbounded ?
correct!
memory is limited,
• Correct or not? alphabet is limited
prefixes are limited
• {0ⁱ1ⁿ…}
• 1 counter context-free
• n counters context-sensitive
• {0ⁿ1ⁿ | n > 1}
• Ā is regular
• Meaning…
J. E. Hopcroft, R. Motwani, J. D. Ullman, Introduction to Automata Theory, Languages and Computation, Chapter 4.
E. Stearns, J. Hartmanis, Regularity Preserving Modifications of Regular Expressions, Information & Control 6:55–69, 1963.
CLASS CLOSED UNDER SET UNION
• A⋃B is regular
• Meaning…
• [a-z]
J. E. Hopcroft, R. Motwani, J. D. Ullman, Introduction to Automata Theory, Languages and Computation, Chapter 4.
CLASS CLOSED UNDER INTERSECTION
• A⋂B is regular
• Meaning…
J. E. Hopcroft, R. Motwani, J. D. Ullman, Introduction to Automata Theory, Languages and Computation, Chapter 4.
CLASS CLOSED UNDER DIFFERENCE
• A∖B is regular
• Meaning…
J. E. Hopcroft, R. Motwani, J. D. Ullman, Introduction to Automata Theory, Languages and Computation, Chapter 4.
CLASS C L O S E D U N D E R I T E R AT I O N
• Meaning…
• [a]*
• [a]⁺
J. E. Hopcroft, R. Motwani, J. D. Ullman, Introduction to Automata Theory, Languages and Computation, Chapter 4.
CLASS C L O S E D U N D E R C O N C AT E N AT I O N
• AB is regular
• Meaning…
• [Bb][Oo][Dd][Yy]
J. E. Hopcroft, R. Motwani, J. D. Ullman, Introduction to Automata Theory, Languages and Computation, Chapter 4.
CLASS [SOMETIMES] CLOSED UNDER
DECOMPOSITION
E. Stearns, J. Hartmanis, Regularity Preserving Modifications of Regular Expressions, Information & Control 6:55–69, 1963.
CLASS CLOSED UNDER HOMOMORPHISM
• h : Σ → Σ*
• then
• h(A) is regular
J. E. Hopcroft, R. Motwani, J. D. Ullman, Introduction to Automata Theory, Languages and Computation, Chapter 4.
REGEXPS YOU NEED TO DEBUG
var whitelist =
@"</?p>|<br\s?/?>|</?b>|</?strong>|</?i>|</?em>|
</?s>|</?strike>|</?blockquote>|</?sub>|</?super>|
</?h(1|2|3)>|</?pre>|<hr\s?/?>|</?code>|</?ul>|
</?ol>|</?li>|</a>|<a[^>]+>|<img[^>]+/?>";
Jeff Atwood, If You Like Regular Expressions So Much, Why Don't You Marry Them?, 22 Mar 2005.
Jeff Atwood, Regular Expressions: Now You Have Two Problems, 27 Jun 2008.
(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r
\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\
[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?
[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:
(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r
\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:
\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?
[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?
[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*
\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\
[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[
\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\]
of what it is
[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\]
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:
[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:
[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?
[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*
sensible
\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\
[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\
["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?
[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\
[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r
to do
\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)
(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r
\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^
\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?
Jeff Atwood, Regex use vs. Regex abuse, 16 Feb 2005. RFC822.
Paul Warren, Mail::RFC822::Address: regexp-based address validation, 17/09/2012.
TOOLS OVERVIEW
ALL TOOLS
ARE
DIFFERENT
• Built-in variables
• Built-in operators
A. V. Aho, B. W. Kernighan, P. J. Weinberger, AWK — A Pattern Scanning and Processing Language. SPE, 9(4): 267-279, 1979.
AW K IN ACTION
AW K EXAMPLES
• {
w += NF
c += length + 1
}
END { print NR, w, c }
• #!/usr/bin/awk -f
BEGIN { print "Hello, world!" }
https://en.wikipedia.org/wiki/AWK
AW K OUTPUT
LEX
• .|\n {}
int lineno=1;
!
letter ::= [a-zA-Z]
digit ::= [0-9]
id ::= {letter}({letter}|{digit})*
number ::= {digit}+
%%
printf("\nTokeniser running -- ^D to exit\n");
!
^{id} {line();printf("<id>");}
{id} printf("<id>");
^{number} {line();printf("<number>");}
{number} printf("<number>");
^[ \t]+ line();
[ \t]+ printf(" ");
[\n] ECHO;
^[^a-zA-Z0-9 \t\n]+ {line();printf("\\%s\\",yytext);}
[^a-zA-Z0-9 \t\n]+ printf("\\%s\\",yytext);
%%
line()
{
printf("%4d: ",lineno++);
}
• Or, in programs:
• $bar =~ /foo/
• Match
• Substitute
• $s =~ s/dog/cat/;
• Transliterate
• $uc =~ tr/a-z/A-Z/;
• <(\w+)>.*<\/\1>, \((?R)*\)
http://www.pcre.org
R A S C A L ( M E TA P R O G R A M M I N G )
• Java Regex
• /xyz/ := “xyz”
• if (/xyz/ := s) {…}
• if (/x<m:y+>z/ := s) println(m);
• Lexical grammars
• parse(#Number, file);
http://rascal-mpl.org/
CONCLUSION
• Drawbacks:
• homomorphism context-
free
• Tools regular
finite
• grep, perl, sed, awk, rsc
• To read: Jeffrey E. F. Friedl, Mastering Regular Expressions:
Understand Your Data and Be More Productive, O’Reilly, 2006.
THANKS F O R Y O U R AT T E N T I O N !
• grammarware.net, twitter.com/grammarware,
grammarware.github.com, …
• I usually teach at Master Software Engineering
• Affiliations