Professional Documents
Culture Documents
Analysis
Lecture 10: String Matching Algorithms
Problem Definition
• Spell checkers of word processing programs not only identify words that appear
to be misspelled but also suggest possible correct spellings for the word.
• One process for spell checkers is to produce a sorted list of words in the
document.
• This list is then compared to the words stored in both the system dictionary and
the user’s dictionary, and words that do not appear are flagged as potentially
incorrect.
• The process of identifying suggested alternative spellings can involve
approximate string matching.
String matching: standard or naïve algorithm
• The problem is to find the first occurrence of a substring
within a larger piece of text.
• Finding later occurrences can use the same techniques
by just changing the starting point in the text.
• This problem is complex because the entire substring
has to match in order.
• In the standard algorithm, we begin by comparing the
first character of the text with the first character of the
substring.
• If they match, we move to the next character of each.
• This process continues until the entire substring
matches the text or the next characters do not match
• In the first case we are done, but in the second, we move
the starting point in the text by one character and begin
matching with the substring again. There are 13 character comparisons done
to find the match.
Standard (naïve) String matching Algorithm
subLoc = 1 // current match point in substring
textLoc = 1 // current match point in text
textStart = 1 // location where this match attempt starts
while textLoc ≤ length(text) and subLoc ≤ length(substring) do
if text[ textLoc ] = substring[ subLoc ] then
textLoc = textLoc + 1
subLoc = subLoc + 1
else
textStart = textStart + 1 // begin again but move the start by 1
textLoc = textStart
subLoc = 1
end if
end while
if (subLoc > length(substring))
return textStart // found a match
else
return 0 // indicates no match found
end if
String matching: standard or naïve algorithm
• It should be obvious that the important task is to compare characters, and that is
what we will count.
• In the worst case, each time we compare the substring we match all of the
characters but fail on the last one. How many times could this happen? It could
happen once for each character in the text.
• If S is the length of the substring and T is the length of the text, the worst case
would seem to take S * (T - S + 1) comparisons.
• The problem with the standard algorithm is that it can waste a lot of effort.
• If we have matched the beginning part of the substring, we can use that
information to tell us how far to move in the text to start the next match.
Knuth-Morris-Pratt Algorithm
• Notice that we do not need to do anything special for the success links
because they just move us to the next successive location.
• The failure links, however, are calculated by looking at how the substring
relates to itself.
• For example, if we look at the substring “ababcb,” we see that if we fail
when matching the c, we shouldn’t back up all the way.
• If we got to character 5 of the substring, we know that the first four
characters matched.
• So, the “ab” that matched substring characters 3 and 4 should perhaps
match substring characters 1 and 2 for a successful search.
Knuth-Morris-Pratt Algorithm