You are on page 1of 12

Algorithm Design and

Analysis
Lecture 10: String Matching Algorithms
Problem Definition

• Looking for a substring in a longer piece of text.


• It is an important utility in text editors and word processors.
• Matching techniques will be done from the perspective of character strings.
• However, these techniques could be used to search for any string of bits or bytes
in a binary file.
• Virus checking is an example of the search for the known pattern that appear in a
computer virus.
Problem Definition

• Spell checkers of word processing programs not only identify words that appear
to be misspelled but also suggest possible correct spellings for the word.
• One process for spell checkers is to produce a sorted list of words in the
document.
• This list is then compared to the words stored in both the system dictionary and
the user’s dictionary, and words that do not appear are flagged as potentially
incorrect.
• The process of identifying suggested alternative spellings can involve
approximate string matching.
String matching: standard or naïve algorithm
• The problem is to find the first occurrence of a substring
within a larger piece of text.
• Finding later occurrences can use the same techniques
by just changing the starting point in the text.
• This problem is complex because the entire substring
has to match in order.
• In the standard algorithm, we begin by comparing the
first character of the text with the first character of the
substring.
• If they match, we move to the next character of each.
• This process continues until the entire substring
matches the text or the next characters do not match
• In the first case we are done, but in the second, we move
the starting point in the text by one character and begin
matching with the substring again. There are 13 character comparisons done
to find the match.
Standard (naïve) String matching Algorithm
subLoc = 1 // current match point in substring
textLoc = 1 // current match point in text
textStart = 1 // location where this match attempt starts
while textLoc ≤ length(text) and subLoc ≤ length(substring) do
if text[ textLoc ] = substring[ subLoc ] then
textLoc = textLoc + 1
subLoc = subLoc + 1
else
textStart = textStart + 1 // begin again but move the start by 1
textLoc = textStart
subLoc = 1
end if
end while
if (subLoc > length(substring))
return textStart // found a match
else
return 0 // indicates no match found
end if
String matching: standard or naïve algorithm

• It should be obvious that the important task is to compare characters, and that is
what we will count.
• In the worst case, each time we compare the substring we match all of the
characters but fail on the last one. How many times could this happen? It could
happen once for each character in the text.
• If S is the length of the substring and T is the length of the text, the worst case
would seem to take S * (T - S + 1) comparisons.
• The problem with the standard algorithm is that it can waste a lot of effort.
• If we have matched the beginning part of the substring, we can use that
information to tell us how far to move in the text to start the next match.
Knuth-Morris-Pratt Algorithm

• The Knuth-Morris-Pratt algorithm is based on finite automata that uses a simpler


method of handling the situation of when the characters don’t match.
• A finite automaton (the singular form of the word “automata”) is a simple
machine that has a current state and a transition function.
• The transition function examines the state and the next character of input and
then decides on a new state for the automaton.
• Finite automata is used to do string matching by having an automata set up to
match just one word, and when we get to the final state, we know we have found
the substring in the text.
Knuth-Morris-Pratt Algorithm
• In Knuth-Morris-Pratt, we label the states with the symbol that should match at
that point.
• We then only need two links from each state—one for a successful match and the
other for a failure.
• The success link will take us to the next node in the chain, and the failure link will
take us back to a previous node based on the word pattern.
• A sample Knuth-Morris-Pratt automaton for the substring “ababcb” is given
below.
Knuth-Morris-Pratt Algorithm
• Each success link of a Knuth-Morris-Pratt automaton causes the “fetch” of a new
character from the text.
• Failure links do not get a new character but reuse the last character fetched.
• If we reach the final state, we know that we found the substring.
• Consider the string text “abababcbab” and the automaton below.
Knuth-Morris-Pratt Algorithm
• The full KMP algorithm is
subLoc = 1 // current match point in substring
textLoc = 1 // current match point in text
while textLoc ≤ length(text) and subLoc ≤ length(substring) do
if subLoc = 0 or text[ textLoc ] = substring[ subLoc ] then
textLoc = textLoc + 1
subLoc = subLoc + 1
else // no match so follow fail link
subLoc = fail[ subLoc ]
end if
end while
if (subLoc > length(substring)) then
return textLoc - length(substring) + 1 // found a match
else
return 0 // no match
end if
Knuth-Morris-Pratt Algorithm

• Notice that we do not need to do anything special for the success links
because they just move us to the next successive location.
• The failure links, however, are calculated by looking at how the substring
relates to itself.
• For example, if we look at the substring “ababcb,” we see that if we fail
when matching the c, we shouldn’t back up all the way.
• If we got to character 5 of the substring, we know that the first four
characters matched.
• So, the “ab” that matched substring characters 3 and 4 should perhaps
match substring characters 1 and 2 for a successful search.
Knuth-Morris-Pratt Algorithm

• The following algorithm determines these relationships in the substring:


fail[ 1 ] = 0
for i = 2 to length(substring) do
temp = fail[ i - 1 ]
while (temp > 0) and (substring[ temp ] ≠ substring[ i - 1 ]) do
temp = fail[ temp ]
end while
fail[ i ] = temp + 1
end for

You might also like