Proceeding of the 3rd International Conference on Informatics and Technology, 2009
©Informatics '09, UM 2009
RDT4 -
81
BM-KMP HYBRID ALGORITHM FOR EXACT AND SUBSEQUENCE STRING MATCHING
Ammar Waysi Mahmood
1
, Nur'Aini binti Abdul Rashid
2
, Atheer A. Abdul Rozaq
3
1
UniversitySains Malaysia, 11800, Pulau Penang, Malaysia: ammar_wysi@yahoo.comSchool of computer Science
2
UniversitySains Malaysia, 11800, Pulau Penang, Malaysia: nuraini@webmail.cs.usm.mySchool of computer Science
3
UniversitySains Malaysia, 11800, Pulau Penang, Malaysia: athproof@yahoo.comSchool of computer Science
ABSTRACT
This study focuses in hybridizing two well-known exact string matching algorithms which are Boyer-Moore and Knuth-Morris-Pratt string matching algorithms. The hybrid algorithm employs main ideas of the two phases of Boyer-Moore and Knuth-Morris-Pratt algorithms. The hybrid algorithm employs good prefix idea from Knuth-Morris- Pratt algorithm and bad character shifting from Boyer-Moore algorithm. Then the proposed hybrid algorithm was adapted for subsequence matching with some modifications in preprocessing phase and search phase. Both the hybrid and enhanced algorithms were tested using four types of alphabet which are binary alphabets, DNAalphabet, protein alphabet and English alphabet. The results of these algorithms show better results compared to Boyer-Moore and Knuth-Morris-Pratt algorithms in terms of time and the number of characters compared in different sizes of alphabet. The two algorithms also showed better results than Knuth-Morris-Pratt algorithm in all types of alphabet, and better result than Boyer-Moore algorithm in binary, DNA and protein alphabet. However, the algorithms performed worse than Boyer-Moore algorithm in English alphabet.
Keywords: Boyer Moore, BM, Knuth-Morris-Pratt, KMP, Exact String matching, subsequence matching,Hybrid algorithm
1.0 INTRODUCTION
String matching is the process of finding patterns in a text is the oldest and often used operations in textprocessing, which is one of the fields in computer science. It is used in all word processing systems in the form offinding and replacing texts. This operation is becoming more and more complex due to the drastic increase in thesize of text.The current interest in the applications of string matching algorithms in computational biology and informationretrieval keeps the research of this field alive and active. Although the research started a long way back, there arestill a lot of interests in the research especially towards creating simpler ideas that work better in practice.In this research we proposed a new algorithm which is the hybrid of two existing string matching algorithms. Theexisting algorithms are Boyer-Moore and Knuth Morris-Pratt algorithms. Both algorithms are different in nature. Theproposed hybrid algorithm will take advantage of the two existing algorithm and is adjusted to suit a new set of datawhich is biological sub-sequences.
2.0 PREVIOUS STUDIES
In this section we will take about previous studies of hybrid algorithms for exact string matching and algorithms forsubsequence matching.
2.1 Hybrid Algorithms for Exact String Matching
Exact string matching algorithms have different behavior and different speed, and the speed of these algorithmschanged depending on the size of alphabet. Therefore, it is useful to combine two (or more) of these algorithm toget the advantage of both of them in increasing the speed of searching to reduce the time needed for matchingstrings and making the new algorithm compatible for any size of alphabet as much as possible. [1]
2.1.1 Faster Hybrid Algorithm
Boyer-Moore algorithm is one of the fastest and most important algorithms for exact string matching. Therefore, it isnormal to see some studies trying to hybrid this algorithm with other algorithms to get a new and fast algorithm.The hybrid Boyer-Moore algorithms with Quick search algorithm for exact string matching is done by utilizing thecontinuous skip over the text, this happened by large shift on text. This algorithm, like pure Boyer-Moore algorithm,has two phases: preprocessing phase and searching phase. It is based on bad character and good suffix by tryingto get the benefit from both of them and comparison of the characters between text and pattern in this algorithm
Leave a Comment