Professional Documents
Culture Documents
b
"In theory there is no difference between theory and
practice. In practice there is." -Yogi Berra
3 4
Pattern Matching Applications Regular Expressions: Basic Operations
Test if a string matches some pattern. Regular expression. Notation to specify a set of strings.
! Process natural language.
! Scan for virus signatures.
! Search for information using Google.
Operation Regular Expression Yes No
! Access information in digital libraries.
! Retrieve information from Lexis/Nexis. Concatenation aabaab aabaab every other string
! Search-and-replace in a word processors. cumulus succubus
Wildcard .u.u.u.
! Filter text (spam, NetNanny, Carnivore, malware). jugulum tumultuous
! Validate data-entry fields (dates, email, URL, credit card). aa
Union aa | baab every other string
! Search for markers in human genome using PROSITE patterns. baab
aa ab
Closure ab*a
abbba ababa
Parse text files.
! Compile a Java program. a(a|b)aab aaaab
every other string
abaab
! Crawl and index the Web. Parentheses
a !
! Read in data stored in TOY input file format. (ab)*a
ababababa abbbaa
! Automatically create Java documentation from Javadoc comments.
5 6
Regular expression. Notation is surprisingly expressive. Regular expressions are a standard programmer's tool.
! Built in to Java, Perl, Unix, Python, . . . .
! Additional operations typically added for convenience.
! Ex: [a-e]+ is shorthand for (a|b|c|d|e)(a|b|c|d|e)*.
Regular Expression Yes No
7 8
Regular Expressions in Java Solving the Pattern Match Problem
Validity checking. Is input in the set described by the re? Regular expressions are a concise way to describe patterns.
! How would you implement String.matches ?
! Hardware: build a deterministic finite state automaton (DFA).
public class Validate {
public static void main(String[] args) { ! Software: simulate a DFA.
String re = args[0];
String input = args[1];
DFA: simple machine that solves the pattern match problem.
System.out.println(input.matches(re));
} !Different machine for each pattern.
} powerful string library method !Accepts or rejects string specified on input tape.
!Focus on true or false questions for simplicity.
need help solving crosswords?
9 10
Simple machine with N states. RE. Concise way to describe a set of strings.
! Begin in start state. DFA. Machine to recognize whether a given string is in a given set.
! Read first input symbol.
! Move to new state, depending on current state and input symbol. Duality: for any DFA, there exists a regular expression to describe
! Repeat until last input symbol read. the same set of strings; for any regular expression, there exists a
! Accept or reject string depending on last state. DFA that recognizes the same set.
a a a
a a a b b
Y N N a* | (a*ba*ba*ba*)*
b b b
DFA Y N N
multiple of 3 b's multiple of 3 b's
b
Practical consequence of duality proof: to match regular expression
Input b b a a b b a b b patterns, (i) build DFA and (ii) simulate DFA on input string.
11 12
Implementing a Pattern Matcher Application: Harvester
Problem: given a regular expression, create program that tests Harvest information from input stream.
whether given input is in set of strings described.
! Harvest patterns from DNA.
Step 1: build the DFA.
! A compiler! % java Harvester "gcg(cgg|agg)*ctg" chromosomeX.txt
gcgcggcggcggcggcggctg
! See COS 226 or COS 320. gcgctg
gcgctg
Step 2: simulate it with given input. Easy. gcgcggcggcggaggcggaggcggctg
13 14
Harvest information from input stream. Ex: parsing an NCBI genome data file.
! Use Pattern data type to compile regular expression to NFA.
Use Matcher data type to simulate NFA.
LOCUS AC146846 128142 bp DNA linear HTG 13-NOV-2003
! DEFINITION Ornithorhynchus anatinus clone CLM1-393H9,
ACCESSION AC146846
! (NFA is fancy but equivalent variety of DFA) VERSION AC146846.2 GI:38304214
KEYWORDS HTG; HTGS_PHASE2; HTGS_DRAFT.
SOURCE Ornithorhynchus anatinus (platypus)
import java.util.regex.Pattern; ORIGIN
1 tgtatttcat ttgaccgtgc tgttttttcc cggtttttca gtacggtgtt agggagccac
import java.util.regex.Matcher; 61 gtgattctgt ttgttttatg ctgccgaata gctgctcgat gaatctctgc atagacagct // a comment
public class Harvester { 121 gccgcaggga gaaatgacca gtttgtgatg acaaaatgta ggaaagctgt ttcttcataa
...
public static void main(String[] args) { 128101 ggaaatgcga cccccacgct aatgtacagc ttctttagat tg
String re = args[0]; //
In in = new In(args[1]);
String input = in.readAll(); String re = "[ ]*[0-9]+([actg ]*).*";
Pattern pattern = Pattern.compile(re); Pattern pattern = Pattern.compile(re);
Matcher matcher = pattern.matcher(input); In in = new In(filename);
while (matcher.find()) { String line;
System.out.println(matcher.group()); while ((line = in.readLine()) != null) {
} Matcher matcher = pattern.matcher(line);
} if (matcher.find()) { extract the RE part in parentheses
} String s = matcher.group(1).replaceAll(" ", "");
// do something with s
} replace this RE with this string
}
15 16
Limitations of DFA Fundamental Questions
No DFA can recognize the language of all bit strings with an equal Which languages CANNOT be described by any RE?
number of 0's and 1's. !Bit strings with equal number of 0s and 1s.
!Decimal strings that represent prime numbers.
! Suppose an N-state DFA can recognize this language. !Genomic strings that are Watson-Crick complemented palindromes.
! Consider following input: 0000000011111111 !Many more. . . .
N+1 0's N+1 1's
How can we extend REs to describe richer sets of strings?
DFA must accept this string.
Context free grammar (e.g., Java).
!
Some state x is revisited during first N+1 0's since only N states.
!
! Machine would accept same string without intervening 0's. Q. Are there any limits on what kinds of problems machines can solve?
000011111111
17 18
Summary
Programmer.
Turing Machines
! Regular expressions are a powerful pattern matching tool.
! Implement regular expressions with finite state machines.
Theoretician.
!Regular expression is a compact description of a set of strings.
!DFA is an abstract machine that solves pattern match problem for
Challenge: Design simplest machine that is
regular expressions.
"as powerful" as conventional computers.
!DFAs and regular expressions have limitations.
Variations
! Yes (accept) and No (reject) states sometimes drawn differently
! Terminology: Deterministic Finite State Automaton (DFA), Finite
State Machine (FSM), Finite State Automaton (FSA) are the same
Alan Turing (1912-1954)
! DFA’s can have output, specified on the arcs or in the states
– These may not have explicit Yes and No states
19 20
Turing Machine: Components Turing Machine: Fetch, Execute
Alan Turing sought the most primitive model of a computing device. States.
! Finite number of possible machine configurations.
Tape. ! Determines what machine does and which way tape head moves.
! Stores input, output, and intermediate results.
! One arbitrarily long strip, divided into cells. tape head
State transition diagram.
! Finite alphabet of symbols. ! Ex. if in state 2 and input symbol is 1 then: overwrite the 1 with x,
tape move to state 0, move tape head to left.
2
Tape head. R
Before … # # x x x 1 1 0 # # …
22 23
States. Initialization.
! Finite number of possible machine configurations. ! Set input on some portion of tape.
! Determines what machine does and which way tape head moves. ! Set tape head.
… # # 0 0 1 1 1 0 # # …
! Set initial state.
State transition diagram.
! Ex. if in state 2 and input symbol is 1 then: overwrite the 1 with x,
move to state 0, move tape head to left. Termination.
2 ! Stop if enter yes, no, or halt state.
R ! Infinite loop possible. 2
1:x R
#:#
0:x 1:x #:#
0 1 3 5 0:x
#:# #:# 0 1 3 5
L R Y N #:# #:#
L R Y N
1:x 1:x
0:x 4 #:# 0:x 4 #:#
R R
After … # # x x x 1
x 1 0 # # … … # # x x x x x x # # …
24 25
Example: Equal Number of 0's and 1's Turing Machine Summary
R
Variations
find 0 ! Instead of just recognizing strings, TM’s can produce output: the
contents of the tape
… # # 0 0 1 1 1 0 # # … ! Instead of Y and N states, TM’s can have a plain Halt state
26 27