• Embed Doc
  • Readcast
  • Collections
  • CommentGo Back
Download
 
Proceeding of the 3rd International Conference on Informatics and Technology, 2009
 
©Informatics '09, UM 2009
 
 RDT4 -
 
81
BM-KMP HYBRID ALGORITHM FOR EXACT AND SUBSEQUENCE STRING MATCHING
Ammar Waysi Mahmood 
1
, Nur'Aini binti Abdul Rashid 
, Atheer A. Abdul Rozaq 
 
1
UniversitySains Malaysia, 11800, Pulau Penang, Malaysia: ammar_wysi@yahoo.comSchool of computer Science
2
UniversitySains Malaysia, 11800, Pulau Penang, Malaysia: nuraini@webmail.cs.usm.mySchool of computer Science
3
UniversitySains Malaysia, 11800, Pulau Penang, Malaysia: athproof@yahoo.comSchool of computer Science
ABSTRACT 
This study focuses in hybridizing two well-known exact string matching algorithms which are Boyer-Moore and Knuth-Morris-Pratt string matching algorithms. The hybrid algorithm employs main ideas of the two phases of Boyer-Moore and Knuth-Morris-Pratt algorithms. The hybrid algorithm employs good prefix idea from Knuth-Morris- Pratt algorithm and bad character shifting from Boyer-Moore algorithm. Then the proposed hybrid algorithm was adapted for subsequence matching with some modifications in preprocessing phase and search phase. Both the hybrid and enhanced algorithms were tested using four types of alphabet which are binary alphabets, DNAalphabet, protein alphabet and English alphabet. The results of these algorithms show better results compared to Boyer-Moore and Knuth-Morris-Pratt algorithms in terms of time and the number of characters compared in different sizes of alphabet. The two algorithms also showed better results than Knuth-Morris-Pratt algorithm in all types of alphabet, and better result than Boyer-Moore algorithm in binary, DNA and protein alphabet. However, the algorithms performed worse than Boyer-Moore algorithm in English alphabet.
 
Keywords: Boyer Moore, BM, Knuth-Morris-Pratt, KMP, Exact String matching, subsequence matching,Hybrid algorithm 
1.0 INTRODUCTION
String matching is the process of finding patterns in a text is the oldest and often used operations in textprocessing, which is one of the fields in computer science. It is used in all word processing systems in the form offinding and replacing texts. This operation is becoming more and more complex due to the drastic increase in thesize of text.The current interest in the applications of string matching algorithms in computational biology and informationretrieval keeps the research of this field alive and active. Although the research started a long way back, there arestill a lot of interests in the research especially towards creating simpler ideas that work better in practice.In this research we proposed a new algorithm which is the hybrid of two existing string matching algorithms. Theexisting algorithms are Boyer-Moore and Knuth Morris-Pratt algorithms. Both algorithms are different in nature. Theproposed hybrid algorithm will take advantage of the two existing algorithm and is adjusted to suit a new set of datawhich is biological sub-sequences.
2.0 PREVIOUS STUDIES
In this section we will take about previous studies of hybrid algorithms for exact string matching and algorithms forsubsequence matching.
2.1 Hybrid Algorithms for Exact String Matching
Exact string matching algorithms have different behavior and different speed, and the speed of these algorithmschanged depending on the size of alphabet. Therefore, it is useful to combine two (or more) of these algorithm toget the advantage of both of them in increasing the speed of searching to reduce the time needed for matchingstrings and making the new algorithm compatible for any size of alphabet as much as possible. [1]
2.1.1 Faster Hybrid Algorithm
Boyer-Moore algorithm is one of the fastest and most important algorithms for exact string matching. Therefore, it isnormal to see some studies trying to hybrid this algorithm with other algorithms to get a new and fast algorithm.The hybrid Boyer-Moore algorithms with Quick search algorithm for exact string matching is done by utilizing thecontinuous skip over the text, this happened by large shift on text. This algorithm, like pure Boyer-Moore algorithm,has two phases: preprocessing phase and searching phase. It is based on bad character and good suffix by tryingto get the benefit from both of them and comparison of the characters between text and pattern in this algorithm
 
Proceeding of the 3rd International Conference on Informatics and Technology, 2009
 
©Informatics '09, UM 2009
 
 RDT4 -
 
82
during search phase done from right to left. The results of this algorithm are faster than other algorithms like Boyer-Moore algorithm, Quick search algorithm, especially if the pattern length is small. [2]
 
2.1.2 Berry-Ravindran Fast Search hybrid algorithm
The BRFS hybrid algorithm has been presented to combining between two searching algorithms they are FastSearch algorithm and Berry-Ravindran algorithm to improve the performance by increasing the distance of theshifting. This algorithm contained two phases: preprocessing and search phases. In preprocessing phase thereare two functions, the first is good suffix for Boyer-Moore algorithm and the second bad character heuristics forBerry-Ravindran algorithm. The search phase happened from right to left, this algorithm has high performance andfaster than any algorithms in small alphabets and long patterns, which exactly a biological sequence properties. [4]
2.1.3 TVSBS Hybrid Algorithm
The TVSBS hybrid algorithm done by combining Berry-Ravindran algorithm and SSABS algorithm, this algorithmusing Berry-Ravindran algorithm bad character shifting to reduce character comparisons. It is designed fornucleotide and amino acid sequences. The preprocessing phase calculates Berry-Ravindran bad character for allalphabets not only for one character. [5]
2.1.4 Franek - Jennings – Smyth Hybrid Algorithm
Knuth–Morris-Pratt algorithm is also an important algorithm in the exact string matching. This algorithm iscombined with Sunday algorithm to produce new hybrid. Where is the comparison of the characters between textand pattern in this algorithm done from left to right. It depends on partial matching, if there is a partial matchingbetween pattern and text the Knuth–Morris-Pratt algorithm will be used for determine the comparison for the nextstep, otherwise, it is also has preprocessing and search phases the shifting of a Sunday algorithm will be used todetermined the next position of the comparison. The algorithm produced better result (about 5-10% better),compared to the other exact string matching algorithms especially with alphabet size above twenty characters. [6]
2.2 Subsequences Matching
In previous studies many different methods and algorithms were tried to deal with subsequence problem usingdifferent methods for different purposes.
2.2.1 Subsequence Automaton Algorithm
The subsequence automaton algorithm is deterministic finite automaton has been used for constrictingsubsequences in multi texts, which can accept all subsequence of a set of texts. This method can be used aspreprocess to a given set of texts with any query and return the number of the texts which contains the query forsubsequence. [7]
2.2.2 Aligned Subsequence Matching Algorithm
In novel algorithm, called the aligned subsequence, matching is based on segmentation of real numbers sequencerunning in linear time, after the segmentation a calculation of the distance of the similarity will be done, thissimilarity can be calculate using special function. They also presented an indexing method based on suffix tree tospeed up the linearity for this algorithm; this algorithm designed for retrieval of similar subsequence. Theperformance of the algorithm is 6.5 times faster compared to results of other algorithms from UC and KDD archive.[8]
2.2.3 Subsequence Matching for Communications
Subsequence matching, as we have mentioned is used in nature language texts and bioinformatics sequence, theadditional use of it is in communication. In a study by Mercier et al used as error-correction of insertions anddeletions for a packets to and by present a framework of finding the number of subsequences when deleting anynumber of symbols from the string, this framework can apply only to no binary strings alphabet with techniquecalled a sandwich technique.[9]
 
Proceeding of the 3rd International Conference on Informatics and Technology, 2009
 
©Informatics '09, UM 2009
 
 RDT4 -
 
83
2.7.4 WASP Algorithm
WASP (Windows- Accumulated Subsequence matching Problem) algorithm depends on windows accumulated tomatching subsequence in long text, the window is the subsequence and this window sliding over the text todiscover all possible sequence that may the window’s pattern be the subsequence of it within the size of thewindow by generalizing KMP algorithm to be used in subsequence matching. [10]
3.0 METHODOLOGY
In our methodology, we will get the benefit from a bad character shift in BM algorithm shift and prefix shift fromKMP algorithm. As already known, both of the algorithms have two phases; the preprocessing phase and thesearch phase; therefore, our algorithm too has two phases.
3.1.1 Preprocessing Phase for Hybrid Exact String Matching Algorithm
The preprocessing phase of KMP algorithm only produces a single table; this table contains the information thatKMP algorithm uses for prefixes of the pattern, our algorithm uses same table as one of the two tables ouralgorithm needs to. This table was created by a preprocessing function in the KMP algorithm which is usually called
create next 
”.The preprocessing phase of BM algorithm produces two tables. The first table is for a bad character and thesecond table is for a good suffix. We will use the idea of the bad character in our algorithm. The bad character tablecontains the value of the shifting to get the right most position for each character in the pattern. For othercharacters that do not belong to the pattern, the shifting value will be maximized and will be assigned to the value“m”, where “m” is the size of the pattern. Our proposed algorithm uses a bad character table not only for the rightmost characters as in BM, but a bad character table for each position in the pattern. The following table shows thisidea for the pattern “character”:Table1: Bad character table for all positions in the pattern
3.1.2 Searching Phase for Hybrid Exact String Matching Algorithm
In our algorithm the first comparison of the characters of the pattern and text begin from the right most character(the character has index m-1). To get benefit from maximum shifting as much as possible which equal m. Thiscomparison from right most character is done only once after each shifting. The second comparison is done in thefirst element on the left (the character has index 0), then it will continue until the element has index m-2. This typeof comparison takes advantage from the prefix of KMP table and bad character shifting from bad character table.The Fig.1 shows the arrangement of the character comparison:Fig.1: The character comparison arrangementsc h a r t e Others0 0 1 1 1 1 1 11 1 0 2 2 2 2 22 2 1 0 3 3 3 33 3 3 2 0 0 4 44 4 3 0 1 5 5 55 0 4 1 2 6 6 66 1 5 2 3 0 7 77 2 6 3 4 1 0 88 3 7 4 0 2 1 9comparison 2 3 4 ………… 1Index 0 1 2…………m-1
of 00

Leave a Comment

You must be to leave a comment.
Submit
Characters: ...
You must be to leave a comment.
Submit
Characters: ...