Patternmatchingalgorithms

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/326209389
Pattern Matching Algorithms
Presentation · April 2017

DOI: 10.13140/RG.2.2.27925.63200
CITATIONS READS
0 1,525
1 author:
Kamran Mahmoudi
Imam Khomeini International University
37 PUBLICATIONS 2 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Design, development and primitive evaluation of ADHD gamified assessment tool View project
Scientific Data Transfer using Big Data Tools View project
All content following this page was uploaded by Kamran Mahmoudi on 05 July 2018.
The user has requested enhancement of the downloaded file.

Pattern matching algorithms
Presentation by : kamran Mahmoudi [kmahmoudi@ieee.org]
Under supervision of dr. Mahdavi
Imam Khomeini international university, April 2017

Pattern matching in Bioinformatics
 Certain known nucleotide and/or amino acid sequences have properties
known to biologists. Ex. ATG is a string which must be present at the
beginning of every protein (gene) a DNA sequence.
 Finding if a DNA sequence contains a specific (candidate) primer is therefore

paramount to the ability to run correct PCR.
 A conserved DNA sequence is a sequence of nucleotides in DNA, which is found

in the DNA of multiple species and/or multiple strains.
 Some sequences are conserved precisely. However, a lot of sequences are

conserved with some modifications. Finding such modified strings is an
important process for mapping DNA of a new organism.
Intro.
Needle in a haystack
the string matching problem consists of

finding a (usually short) string, the
pattern , as a substring in a given
(usually very long) string, the text .[1]
1/56
Formal Definition
Let Σ be an arbitrary alphabet.

The (exact) string matching problem is the following problem:
Input: Two strings t=t1….tn and p= p1…pm over Σ.
Output: The set of all positions in the text t, where an occurrence of
the pattern p as a substring starts [1].
2/56
Classification
using preprocessing as main criteria
Classes of string searching algorithms [2]

Text not preprocessed Text preprocessed
Patterns not
primitive algorithms Index methods
preprocessed
Constructed search
Patterns preprocessed Signature methods
engines
3/56
Basic classification
 Single Pattern Algorithms
✓ Naïve String Search
✓ Knuth-Morris-Pratt Algorithm
✓ Boyer-Moore Algorithm
✓ Rabin-Karp String Search Algorithm
✓ Finite State Automaton Based Search
 Bitap algorithm (shift-or, shift-and, Baeza–Yates–Gonnet)
 Two-way string-matching algorithm
 BNDM (Backward Non-Deterministic Dawg Matching)
 BOM (Backward Oracle Matching) 4/56
 Algorithms using a finite set of patterns

 Aho–Corasick string matching algorithm (extension of Knuth-
Morris-Pratt)
 Commentz-Walter algorithm (extension of Boyer-Moore)
 Set-BOM (extension of Backward Oracle Matching)
 Rabin–Karp string search algorithm
5/56
 Algorithms using an infinite number of patterns

 Naturally,the patterns can not be enumerated finitely in this
case. They are represented usually by a regular grammar or
regular expression.
6/56
Naïve string search
Input: a pattern p= p1…pm and a text t=t1….tn
I := φ
For j:=0 to n-m do
i:=1
while pi=tj+1 and i<=m do
i:=i+1
if i=m+1 then {p1…pm=tj+1… tj+m}
I := I U {j+1}
Output: The set I of positions,
where an occurrence of p as a substring in t starts
7/56
Knuth–Morris–Pratt algorithm
KMP-prefix(P)
Begin
m  |P|
T[1]  0
i  0
for j=2 upto m step 1 do
while i>0 and P[i+1] != P[j] then
i  T[i]
if P[i+1] = P[j] then
i  i+1
T[j]  i
return T
end
8/56
Knuth–Morris–Pratt algorithm
KMP-Matcher(T,P)
Begin
n  |T|
m  |P|
Table KMP-Prefix(P)
i  0
for j=0 upto n step 1 do
while i>0 and P[i+1] != T[j] do
i  Table[i]
Wend
if P[i+1] = T[j] then
i  i+1
end if
if i = m then
output(j-m)
iTable[i] 9/56
end if
end
The Boyer-Moore algorithm
 The Boyer-Moore algorithm searches for occurrences of P in T by performing

explicit character comparisons at different alignments.
 Instead of a brute-force search of all alignments (of which there are m − n + 1),
Boyer-Moore uses information gained by preprocessing P to skip as many
alignments as possible. [3]
10/56
The Bad Character Rule
 The bad-character rule considers the character in T at which the comparison

process failed.The next occurrence of that character to the left in P is found,
and a shift which brings that occurrence in line with the mismatched occurrence
in T is proposed.[3]
THE GOOD SUFFIX RULE

• If we match some characters, use knowledge of the matched characters to skip
alignments. [4]
11/56
Ex.1: the bad character rule
[4] 12/56
Preprocessing for the bad character rule
Input: a pattern p= p1…pm over alphabet Σ

For all a ∈ Σ do β(a):=0
For i:=1 to m do β(pi):=i
Output: the function β.
13/56
Good suffix rule
Let t be the substring of T that matched a suffix of P. Skip

alignments until
(a) t matches opposite characters in P
(b) a prefix of P matches a suffix of t
(c) P moves past t
whichever happens first.
14/56
Bad match rule & good suffix rule
15/56
( https://www.youtube.com/watch?v=4Xyhb72LCX4 )
Rabin-Karp – the idea
 Compare a string's hash values, rather than the strings themselves.

 For efficiency, the hash value of the next position in the text is easily
computed from the hash value of the current position. [5]
16/56
Example
Pattern = AAT
Text = TAACGGCATACAATCG
Character values :
A=1
Calculate hash from oldHash code method
: T=2
1. X=oldHash – val(old char) C=3
2. X=x/prime G=4
3. newHash=X+primem-1 * val(new char) Prime number=7
17/56
Example, Rabin-Karp algorithm
Pattern = AAT
H(AAT)= 1 + 1*7 + 2*49 = 106
▪ Text = TAGACAATCG H(TAG)=2+1*7+4*49 = 205 !=106
▪ Text = TAGACAATCG H(AGT)=(205-2)/7+1*49 = 78 != 106
▪ Text = TAGACAATCG H(GAC)=(78-2)/7+3*49 = 157 != 106
▪ Text = TAGACAATCG H(ACA)=(157-2)/7+1*49 = 71 != 106
▪ Text = TAGACAATCG H(CAA)=(71-2)/7+1*49 = 58 != 106
✓ Text = TAGACAATCG H(AAT)=(58-2)/7+2*49 = 106 ==106
18/56
Finite state automaton
we will show that, after a clever preprocessing of the

pattern, one scan of the text from left to right will suffice to
solve the string matching problem. Furthermore we will see
that the preprocessing can also be realized efficiently; it is
possible in time in O(|p|.|Σ|). [1]
19/56
Informal definition of automata
 Informally speaking, a finite automaton can be described as a machine that

reads a given text once from left to right. At each step, the automaton is in
one of finitely many internal states, and this internal state can change after
reading every single symbol of the text, depending only on the current state
and the last symbol read.
20/56
Formal definition
 A finite automaton is a quintuple M = (Q;Σ; q0; Ϭ; F), where
 Q is a finite set of states,

 Σ is an input alphabet,
 q0 ∈ Q is the initial state,
 F ⊆Q is a set of accepting states , and
 Ϭ : Q x Σ  Q is a transition function describing the transitions
of the
 automaton from one state to another.
21/56
Why using finite state machine
 Complex pattern matching like non-finite regular

expressions :
Finite State Machine (FSM) aka DFA
 Time Complexity :
 Preprocessing : O(m3 |Σ|)
 Matching: 𝜃 (n)
22/56
String matching with FSM
23/56
( https://www.youtube.com/watch?v=nNb9lu5Hvio )
FSM Matching algorithm
FINITE-AUTOMATON-MATCHER(T,d,m)
1. n  length[T]
2. q  0
3. for i  1 to n
4. do q  Ϭ(q, T[i])
5. if q=m then
6. print `Pattern occurs with shift' i-m
24/56
Transition-function construction
algorithm
1. m  length[P]
2. for q  0 to m (for each state)
3. do for each character a ∈ Σ (|Σ|)
4. do k  min(m+1, q+2)
5. repeat k  k-1 (1 ≤ k ≤ m+1)
6. until Pk ⊐ Pqa (Σ k )
7. Ϭ(q,a)  k
8. return Ϭ
25/56
Better solution: suffix trees
 Can solve problem in O(m) time

 • Conceptually related to keyword trees [7]
26/56
[8]
27/56
28/56
29/56
30/56
31/56
32/56
33/56
34/56
35/56
36/56
37/56
38/56
39/56
40/56
41/56
42/56
43/56
44/56
45/56
46/56
47/56
48/56
49/56
50/56
51/56
Weiner’s Algorithm I
 Definitions
 i: suffix tree for Si=S[i..n]$
 WHead(i): longest prefix of Si that is also prefix of Sj j>i
 Proceeding
 Build n+1 = edge (root, n+1) labelled $
 For i from n to 1 do
 Find WHead(j) in Wj+1
 w = node labelled WHead(j) (eventually new created)
 Create new leaf j and edge (w,j) labelled
 S[j..n]-WHead(j)
52/56
[7]
53/56
54/56
[9]
Ukkonen’s suffix tree
(https://www.youtube.com/watch?v=WbLKFzqvacg )
55/56
Suffix array
 n computer science, a suffix array is a sorted array of all suffixes of a string.

It is a data structure used, among others, in full text indices, data
compression algorithms and within the field of bioinformatics
P.S. 1
Suffix array, example
P.S. 2
Suffix array, example (continue)
P.S. 3
Suffix array – pattern matching
def search(P):
l = 0; r = n
while l < r:
mid = (l+r) / 2
if P > suffixAt(A[mid]):
l = mid + 1
else:
r = mid
s = l; r = n
while l < r:
mid = (l+r) / 2
if P < suffixAt(A[mid]):
r = mid
else:
l = mid + 1
return (s, r)
P.S. 4
References
 [1]: Hans-Joachim Bockenhauer, Dirk Bongartz, “Algorithmic Aspects of Bioinformatics ”,
2007 Natural computing series, Springer, ISSN 1619-7127
 [2]: https://en.wikipedia.org/wiki/String_searching_algorithm
 [3]: https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
 [4]: http://www.cs.jhu.edu/~langmea/resources/lecture_notes/boyer_moore.pdf
 [5]: http://u.cs.biu.ac.il/~rosenfa5/Alg2/fingerpainting.ppt
 [6]: http://web.cs.mun.ca/~wang/courses/cs6783-13f/n2-string-1.pdf
 [7]: http://www.zbh.uni-hamburg.de/pubs/pdf/GieKur1997.pdf
 [8]:
http://bix.ucsd.edu/bioalgorithms/presentations/Ch09_CombinatorialPatternMatching.pdf
 [9]: http://wwwmayr.in.tum.de/konferenzen/Jass03/presentations/pentenrieder.pdf
56/56
View publication stats

Patternmatchingalgorithms

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Patternmatchingalgorithms

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Pattern Matching Algorithms

Presentation · April 2017

Scientific Data Transfer using Big Data Tools View project

The user has requested enhancement of the downloaded file.

Imam Khomeini international university, April 2017

 Finding if a DNA sequence contains a specific (candidate) primer is therefore

 A conserved DNA sequence is a sequence of nucleotides in DNA, which is found

 Some sequences are conserved precisely. However, a lot of sequences are

the string matching problem consists of

Let Σ be an arbitrary alphabet.

Classes of string searching algorithms [2]

 Algorithms using a finite set of patterns

 Algorithms using an infinite number of patterns

 The Boyer-Moore algorithm searches for occurrences of P in T by performing

 The bad-character rule considers the character in T at which the comparison

THE GOOD SUFFIX RULE

Input: a pattern p= p1…pm over alphabet Σ

Let t be the substring of T that matched a suffix of P. Skip

 Compare a string's hash values, rather than the strings themselves.

we will show that, after a clever preprocessing of the

 Informally speaking, a finite automaton can be described as a machine that

 A finite automaton is a quintuple M = (Q;Σ; q0; Ϭ; F), where

 Q is a finite set of states,

 Complex pattern matching like non-finite regular

 Can solve problem in O(m) time

 n computer science, a suffix array is a sorted array of all suffixes of a string.

View publication stats

You might also like