Survey Paper On String Matching

COMPARISION OF STRING MATCHING ALGORITHMS AND
CLASSIFICATION
Abstract
The volume of data available online has been growing tremendously over the past two decades. The need
for dataset, signal and speech processing and big data analytics has lead many researches on string
matching algorithms and pattern matching. With so many algorithms developed it becomes more and
more difficult to choose one to be used in applications or further research. This paper aims to present a
simple yet comprehensive picture of important string matching and their classifications
I. INTRODUCTION:
String matching is the process of identifying a pattern in large volume of data. It is used in wide
variety of fields like natural language processing, pattern recognition, sentiment analysis, analytics etc.
Most of these applications are further used to solve more complex problems like creation of chat bots,
labeling of records. A string is defined as collection of characters including spaces and numbers. The
goal of string matching is to match one string in another ( One string can very large). This involves high
computation power and is a time consuming process. There are so many algorithm but are scattered in
dissimilar fields which makes it hard to understand the common things and what differences they have.
The main concept of any such algorithm is to align the required pattern to beginning of dataset and match
patterns and then move the required pattern forward till the end of dataset.
Motivation :
This paper doesnot propose any specific string matching algorithm but briefly explain the existing
algoritms and try to explain their merits and demerits. The objective is to help application developers
choose the right algorithm as per their need and fellow researchers to find the algorithm they want to
improve further.
II. CATEGORIZATION OF STRING-MATCHING ALGORITHM
Instead of stating all the available algorithm we categorize these algorithm on their general strategy.
There are two main categories: exact string matching which looks for exact match of all the characters in
the pattern and doesn’t provide tolerance, and second - approximate string matching approach which
provides a certain level of tolerance when matching the pattern.
A. Exact String Matching Algorithm
[27]. In this approach, the number of chars in pattern matches the length of matching windows.
1)Hardware-Based
This implementation requires hardware equipment, like field programmable graphical array and the
GPU. Parallel processing programming languages, such as CUDA. Due to use of hardware devices, like
graphical processing units or the FPGA, it has more overhead. It is faster than the software approach but
costly.[1]
2)Software-Based
Its more adaptable and may be employed multiple times on an application. This makes them more
popular.
Using either of the two strategies we can further divide the algorithms into-
a. Single-pattern matching algorithms, there is only one input pattern which is matched against the
entire dataset. It is further categorized into two: hardware and software-based matching.
b. Multiple-Pattern Matching
Its the compex and more advanced varient of single-pattern type matching. In this multiple occurrences
of the input pattern are searched in the given dataset. Its applications include but are not limited to
bioinformatics, eg. comparison of DNA [2].
B. Approximate string matching algorithm
This approach looks for a substring that is close(at certain level) with the required pattern contrary to
exact algorithm where entire pattern needs to match. A closeness level in terms of K or fewer differences
is defined like Wu and Manber approach. The approximate matching is used in situations where spelling
errors can be expected. Its further of two types-
a. Filtration-based algorithms: It’s a 2 step process. In 1 st step location of all occurrences of
given patterns in the dataset are identified. In next step these locations are completely verified.
b. Back Tracking-based algorithms: They are extended version of exact matching algorithm
which are modified using edit distance operations. succinct and index-based ds are commonly
used.
III. PROMINENT SINGLE PATTERN MATCHING ALGORITHM (SOFTWARE BASED)
A. Classical Method
Based on character comparisons of pattern and windows data. It’s a brute force Algorithm. Its considered
on of the simplest algorithm as it does comparisons in the dataset and the entire pattern going from left to
right(L->R)[3].
Knuth-Morris (KMP) Algorithm of 1977: It looks presence of patterns within a main dataset from (L-
>R) by checking when mismatch occurs. It takes advantage from previously matched characters. The
benefit is pointer in the dataset is never decremented [4].
Boyer-Moore (BM) Algorithm of 1977: Its considered to be the simplest yet one of the best algorithm
for single pattern technique. It matches pattern suffix from R->L and keeps two heuristic during
mismatch. [5]
Quick Search (QS) Algorithm of 1990: Inspired by Boyer Moore algorithm and simplifies it[6], by
using only bad character shift [8].Best suited for large alphabets and pattern of small length[9].
Boyer-Moore-Smith Algorithm of 1991: The advantages shows when the maximum shift value among
the calculated shifts from dataset char lies in the rightest char of dataset [10].
Raita Algorithm 1992: It begins by comparing first the right most char of the window with the
equivalent position in given pattern. Once matched then comparing leftmost char from window and the
counterpart in the pattern. Then the left chars are compared from R->L till completely matched or if it
mismatches [11].
Shift-Or (SO) algorithm 1992: Its inspired by bitwise algorithm. Best suited for the the situations where
pattern length is shorter than memory word size [13].
Backward-Oracle-Matching (BOM) Algorithm 1999: It is probably one of the most optimized algos.
Best suited for long patterns. It slides window of length equal to pattern length over the dataset. [14].
Hashing Method This enables a technique to avoid quadratic no. of character comparisons.
Karp-Rabin (KR) Algorithm 1987: It calculates hash func for every m-char in dataset and compares
with hashing function of pattern [15].
IV. RESULTS:
Below are the complexity analysis of above algoithms. Where m and n are pattern length and dataset
length respectively
Algorithm Preprocessing String Matching

Brute Force Not Applicable Ο(𝑚𝑛)
Deterministic Finite Automaton Ο(𝑚𝑘) Ο(𝑛)
Rabin-Karp Ο(𝑛) Ο(𝑚𝑛)
Morris-Pratt Ο(𝑚) Ο(𝑚 + 𝑛)
Colussi Ο(𝑚) Ο(𝑛)
Boyer-Moore Ο(𝑚 + 𝑛) Ο(𝑚𝑛)
Turbo-BM Ο(𝑚 + 𝑛) Ο(𝑚𝑛)
Aho–Corasick Ο(𝑚 + 𝑛) Ο(𝑚 + 𝑛)
Alpha Skip Search Ο(𝑚) Ο(𝑚𝑛)
Reverse Colussi Ο(𝑚2 ) Ο(𝑛)
Apostolico-Giancarlo Ο(𝑚 + 𝑛) Ο(𝑛)
Smith-Waterman Ο(𝑚 + 𝑛) Ο(𝑚𝑛)
Needleman–Wunsch Not Applicable Ο(𝑚𝑛)
Raita Ο(𝑚 + 𝑛) Ο(𝑚𝑛)
Reverse Factor Ο(𝑚) Ο(𝑚𝑛)
Berry-Ravindran Ο(𝑚 + 𝑛 2 ) Ο(𝑚 + 𝑛)
Table 1: A Comparison of algorithm’s performance for various string matching algorithms
V. CONCLUSION:
Several string-matching methods are explained in brief and analyzed their strengths, their weakness, and
complexity and this can help future researchers find suitable research topics and begin their work. The
prominent work in each of these algorithms have been cited to it is easy to find the required research
paper and learn about the algorithms in detail.
VI. REFERENCES:
[1] “String matching algorithms,” http://www.cs.rit.edu/~lr/courses/alg/student/1/String.p df
[2] String searching algorithm- Wikipedia, the free encyclopedia. [2] M. O. Külekci, “Filter Based Fast Matching of
Long Patterns by Using SIMD Instructions”, in Stringology, (2009), pp. 118-128.
[3] S. Faro, T. Lecroq, “The Exact Online String Matching Problem: a Review of the Most Recent Results”, ACM
Computing Surveys (CSUR) Surveys Homepage archive, Volume 45 Issue 2, Article No. 13, February 2013.
[4] D. Knuth, J. Morris, and V. Pratt, “Fast pattern matching in strings”, SIAM Journal on Computing, volume 6(1),
322–350. (1977).
[5] S. Wu and U. Manber, “A fast algorithm for multi-pattern searching”, (1994).
[6] R. Boyer, J. Moore, “A fast string searching algorithm”, Communication of the ACM
[7] R. HORSPOOL, “Practical fast searching in strings”, Softw. Pract. Exp., Volume 10, 6,
[8] B. C. Walter, “A string matching algorithm fast on the average”, in International Colloquium on Automata,
Languages, and Programming, (1979), pp. 118-132.
[9] S. Wu and U. Manber, “A fast algorithm for multi-pattern searching”, (1994)
[10] “knuth morris pratt algorithm,” http://www.personal.kent.edu/~rmuhamma/Algorithms/

MyAlgorithms/StringMatch/kuthMP.html [11] L. Colussi, “Correctness and e efficiency of the pattern matching
algorithms”, Information and Computation, Volume 95 Issue 2, Dec. 1991.
[12] T. Raita, “Tunning the Boyer-MooreHorspool string searching algorithm”, Software- Practice and Experience,
Volume 22, No. 10, pp. 879-884, 1992.
[13] M. Crochemore, A. Czumaj, L. Gasieniec, S. Jarominek, T. Lecroq, W. Plandowski, and W. Rytter, “Speeding
Up Two String Matching Algorithms”, Algorithmica
[14] C. Charras, and T. Lecroq, Handbook of exact string matching algorithms. King’s College Publications, 2004.

Survey Paper On String Matching

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Survey Paper On String Matching

Uploaded by

Copyright:

Available Formats

COMPARISION OF STRING MATCHING ALGORITHMS AND

II. CATEGORIZATION OF STRING-MATCHING ALGORITHM

A. Exact String Matching Algorithm

III. PROMINENT SINGLE PATTERN MATCHING ALGORITHM (SOFTWARE BASED)

Algorithm Preprocessing String Matching

Table 1: A Comparison of algorithm’s performance for various string matching algorithms

[5] S. Wu and U. Manber, “A fast algorithm for multi-pattern searching”, (1994).

[9] S. Wu and U. Manber, “A fast algorithm for multi-pattern searching”, (1994)

[10] “knuth morris pratt algorithm,” http://www.personal.kent.edu/~rmuhamma/Algorithms/

You might also like