You are on page 1of 63

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/326209389

Pattern Matching Algorithms

Presentation · April 2017


DOI: 10.13140/RG.2.2.27925.63200

CITATIONS READS

0 1,525

1 author:

Kamran Mahmoudi
Imam Khomeini International University
37 PUBLICATIONS   2 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Design, development and primitive evaluation of ADHD gamified assessment tool View project

Scientific Data Transfer using Big Data Tools View project

All content following this page was uploaded by Kamran Mahmoudi on 05 July 2018.

The user has requested enhancement of the downloaded file.


Pattern matching algorithms
Presentation by : kamran Mahmoudi [kmahmoudi@ieee.org]
Under supervision of dr. Mahdavi

Imam Khomeini international university, April 2017


Pattern matching in Bioinformatics
 Certain known nucleotide and/or amino acid sequences have properties
known to biologists. Ex. ATG is a string which must be present at the
beginning of every protein (gene) a DNA sequence.

 Finding if a DNA sequence contains a specific (candidate) primer is therefore


paramount to the ability to run correct PCR.

 A conserved DNA sequence is a sequence of nucleotides in DNA, which is found


in the DNA of multiple species and/or multiple strains.

 Some sequences are conserved precisely. However, a lot of sequences are


conserved with some modifications. Finding such modified strings is an
important process for mapping DNA of a new organism.
Intro.
Needle in a haystack

the string matching problem consists of


finding a (usually short) string, the
pattern , as a substring in a given
(usually very long) string, the text .[1]

1/56
Formal Definition

Let Σ be an arbitrary alphabet.


The (exact) string matching problem is the following problem:
Input: Two strings t=t1….tn and p= p1…pm over Σ.
Output: The set of all positions in the text t, where an occurrence of
the pattern p as a substring starts [1].

2/56
Classification
using preprocessing as main criteria

Classes of string searching algorithms [2]


Text not preprocessed Text preprocessed
Patterns not
primitive algorithms Index methods
preprocessed
Constructed search
Patterns preprocessed Signature methods
engines

3/56
Basic classification
 Single Pattern Algorithms
✓ Naïve String Search
✓ Knuth-Morris-Pratt Algorithm
✓ Boyer-Moore Algorithm
✓ Rabin-Karp String Search Algorithm
✓ Finite State Automaton Based Search
 Bitap algorithm (shift-or, shift-and, Baeza–Yates–Gonnet)
 Two-way string-matching algorithm
 BNDM (Backward Non-Deterministic Dawg Matching)
 BOM (Backward Oracle Matching) 4/56
Basic classification

 Algorithms using a finite set of patterns


 Aho–Corasick string matching algorithm (extension of Knuth-
Morris-Pratt)
 Commentz-Walter algorithm (extension of Boyer-Moore)
 Set-BOM (extension of Backward Oracle Matching)
 Rabin–Karp string search algorithm

5/56
Basic classification

 Algorithms using an infinite number of patterns


 Naturally,the patterns can not be enumerated finitely in this
case. They are represented usually by a regular grammar or
regular expression.

6/56
Naïve string search
Input: a pattern p= p1…pm and a text t=t1….tn
I := φ
For j:=0 to n-m do
i:=1
while pi=tj+1 and i<=m do
i:=i+1
if i=m+1 then {p1…pm=tj+1… tj+m}
I := I U {j+1}
Output: The set I of positions,
where an occurrence of p as a substring in t starts

7/56
Knuth–Morris–Pratt algorithm
KMP-prefix(P)
Begin
m  |P|
T[1]  0
i  0
for j=2 upto m step 1 do
while i>0 and P[i+1] != P[j] then
i  T[i]
if P[i+1] = P[j] then
i  i+1
T[j]  i
return T
end

8/56
Knuth–Morris–Pratt algorithm
KMP-Matcher(T,P)
Begin
n  |T|
m  |P|
Table KMP-Prefix(P)
i  0
for j=0 upto n step 1 do
while i>0 and P[i+1] != T[j] do
i  Table[i]
Wend
if P[i+1] = T[j] then
i  i+1
end if
if i = m then
output(j-m)
iTable[i] 9/56
end if
end
The Boyer-Moore algorithm

 The Boyer-Moore algorithm searches for occurrences of P in T by performing


explicit character comparisons at different alignments.
 Instead of a brute-force search of all alignments (of which there are m − n + 1),
Boyer-Moore uses information gained by preprocessing P to skip as many
alignments as possible. [3]

10/56
The Bad Character Rule

 The bad-character rule considers the character in T at which the comparison


process failed.The next occurrence of that character to the left in P is found,
and a shift which brings that occurrence in line with the mismatched occurrence
in T is proposed.[3]

THE GOOD SUFFIX RULE


• If we match some characters, use knowledge of the matched characters to skip
alignments. [4]

11/56
Ex.1: the bad character rule

[4] 12/56
Preprocessing for the bad character rule

Input: a pattern p= p1…pm over alphabet Σ


For all a ∈ Σ do β(a):=0
For i:=1 to m do β(pi):=i
Output: the function β.

13/56
Good suffix rule

Let t be the substring of T that matched a suffix of P. Skip


alignments until
(a) t matches opposite characters in P
(b) a prefix of P matches a suffix of t
(c) P moves past t
whichever happens first.

14/56
Bad match rule & good suffix rule

15/56
( https://www.youtube.com/watch?v=4Xyhb72LCX4 )
Rabin-Karp – the idea

 Compare a string's hash values, rather than the strings themselves.


 For efficiency, the hash value of the next position in the text is easily
computed from the hash value of the current position. [5]

16/56
Example

Pattern = AAT
Text = TAACGGCATACAATCG
Character values :
A=1
Calculate hash from oldHash code method
: T=2
1. X=oldHash – val(old char) C=3
2. X=x/prime G=4
3. newHash=X+primem-1 * val(new char) Prime number=7

17/56
Example, Rabin-Karp algorithm
Pattern = AAT
H(AAT)= 1 + 1*7 + 2*49 = 106
▪ Text = TAGACAATCG H(TAG)=2+1*7+4*49 = 205 !=106
▪ Text = TAGACAATCG H(AGT)=(205-2)/7+1*49 = 78 != 106
▪ Text = TAGACAATCG H(GAC)=(78-2)/7+3*49 = 157 != 106
▪ Text = TAGACAATCG H(ACA)=(157-2)/7+1*49 = 71 != 106
▪ Text = TAGACAATCG H(CAA)=(71-2)/7+1*49 = 58 != 106
✓ Text = TAGACAATCG H(AAT)=(58-2)/7+2*49 = 106 ==106

18/56
Finite state automaton

we will show that, after a clever preprocessing of the


pattern, one scan of the text from left to right will suffice to
solve the string matching problem. Furthermore we will see
that the preprocessing can also be realized efficiently; it is
possible in time in O(|p|.|Σ|). [1]

19/56
Informal definition of automata

 Informally speaking, a finite automaton can be described as a machine that


reads a given text once from left to right. At each step, the automaton is in
one of finitely many internal states, and this internal state can change after
reading every single symbol of the text, depending only on the current state
and the last symbol read.

20/56
Formal definition

 A finite automaton is a quintuple M = (Q;Σ; q0; Ϭ; F), where

 Q is a finite set of states,


 Σ is an input alphabet,
 q0 ∈ Q is the initial state,
 F ⊆Q is a set of accepting states , and
 Ϭ : Q x Σ  Q is a transition function describing the transitions
of the
 automaton from one state to another.
21/56
Why using finite state machine

 Complex pattern matching like non-finite regular


expressions :
Finite State Machine (FSM) aka DFA

 Time Complexity :
 Preprocessing : O(m3 |Σ|)
 Matching: 𝜃 (n)

22/56
String matching with FSM

23/56
( https://www.youtube.com/watch?v=nNb9lu5Hvio )
FSM Matching algorithm

FINITE-AUTOMATON-MATCHER(T,d,m)
1. n  length[T]
2. q  0
3. for i  1 to n
4. do q  Ϭ(q, T[i])
5. if q=m then
6. print `Pattern occurs with shift' i-m

24/56
Transition-function construction
algorithm
1. m  length[P]
2. for q  0 to m (for each state)
3. do for each character a ∈ Σ (|Σ|)
4. do k  min(m+1, q+2)
5. repeat k  k-1 (1 ≤ k ≤ m+1)
6. until Pk ⊐ Pqa (Σ k )
7. Ϭ(q,a)  k
8. return Ϭ

25/56
Better solution: suffix trees

 Can solve problem in O(m) time


 • Conceptually related to keyword trees [7]

26/56
[8]

27/56
28/56
29/56
30/56
31/56
32/56
33/56
34/56
35/56
36/56
37/56
38/56
39/56
40/56
41/56
42/56
43/56
44/56
45/56
46/56
47/56
48/56
49/56
50/56
51/56
Weiner’s Algorithm I
 Definitions
 i: suffix tree for Si=S[i..n]$
 WHead(i): longest prefix of Si that is also prefix of Sj j>i
 Proceeding
 Build n+1 = edge (root, n+1) labelled $
 For i from n to 1 do
 Find WHead(j) in Wj+1
 w = node labelled WHead(j) (eventually new created)
 Create new leaf j and edge (w,j) labelled
 S[j..n]-WHead(j)
52/56
[7]

53/56
54/56
[9]
Ukkonen’s suffix tree

(https://www.youtube.com/watch?v=WbLKFzqvacg )
55/56
Suffix array

 n computer science, a suffix array is a sorted array of all suffixes of a string.


It is a data structure used, among others, in full text indices, data
compression algorithms and within the field of bioinformatics

P.S. 1
Suffix array, example

P.S. 2
Suffix array, example (continue)

P.S. 3
Suffix array – pattern matching

def search(P):
l = 0; r = n
while l < r:
mid = (l+r) / 2
if P > suffixAt(A[mid]):
l = mid + 1
else:
r = mid
s = l; r = n
while l < r:
mid = (l+r) / 2
if P < suffixAt(A[mid]):
r = mid
else:
l = mid + 1
return (s, r)
P.S. 4
References
 [1]: Hans-Joachim Bockenhauer, Dirk Bongartz, “Algorithmic Aspects of Bioinformatics ”,
2007 Natural computing series, Springer, ISSN 1619-7127
 [2]: https://en.wikipedia.org/wiki/String_searching_algorithm
 [3]: https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
 [4]: http://www.cs.jhu.edu/~langmea/resources/lecture_notes/boyer_moore.pdf
 [5]: http://u.cs.biu.ac.il/~rosenfa5/Alg2/fingerpainting.ppt
 [6]: http://web.cs.mun.ca/~wang/courses/cs6783-13f/n2-string-1.pdf
 [7]: http://www.zbh.uni-hamburg.de/pubs/pdf/GieKur1997.pdf
 [8]:
http://bix.ucsd.edu/bioalgorithms/presentations/Ch09_CombinatorialPatternMatching.pdf
 [9]: http://wwwmayr.in.tum.de/konferenzen/Jass03/presentations/pentenrieder.pdf

56/56

View publication stats

You might also like