Professional Documents
Culture Documents
Text Processing: Brute-Force Pattern Matching, The Boyer-Moore Algorithm, The Knuth-
Morris-Pratt Algorithm, The Huffman Coding Algorithm, The Longest Common Subsequence
Problem (LCS), Tries- Standard Tries, Compressed Tries, Suffix Tries
String algorithms can typically be divided into several categories. One of these categories
is string matching.
When it comes to string matching, the most basic approach is what is known as brute
force, which simply means to check every single character from the text to match against
the pattern. In general we have a text and a pattern (most commonly shorter than the
text). What we need to do is to answer the question whether this pattern appears in the
text.
Overview
The principles of brute force string matching are quite simple. We must check for a match
between the first characters of the pattern with the first character of the text as on the picture
bellow.
If they don’t match, we move forward to the second character of the text. Now we compare the
first character of the pattern with the second character of the text. If they don’t match again, we
move forward until we get a match or until we reach the end of the text.
In case they match, we move forward to the second character of the pattern comparing it with the
“next” character of the text, as shown in the picture bellow.
Just because we have found a match between the first character from the pattern and some
character of the text, doesn’t mean that the pattern appears in the text. We must move forward to
see whether the full pattern is contained in the text.
Implementation
Implementation of brute force string matching is easy and here we can see a short PHP example.
The bad news is that this algorithm is naturally quite slow.
$n = strlen($subject);
$m = strlen($pattern);
$j = 0;
$j++;
}
if ($j == $m) return $i;
return -1;
Complexity
As I said this algorithm is slow. Actually every algorithm that contains “brute force” in its name
is slow, but to show how slow string matching is, I can say that its complexity is O(n.m).
Here n is the length of the text, while m is the length of the pattern.
In case we fix the length of the text and test against variable length of the pattern, again we get a
rapidly growing function.
Application
Brute force string matching can be very ineffective, but it can also be very handy in some cases.
Just like the sequential search.
1. Doesn’t require pre-processing of the text – Indeed if we search the text only once we
don’t need to pre-process it. Most of the algorithms for string matching need to build an
index of the text in order to search quickly. This is great when you’ve to search more than
once into a text, but if you do only once, perhaps (for short texts) brute force matching is
great!
2. Doesn’t require additional space – Because brute force matching doesn’t need pre-
processing it also doesn’t require more space, which is one cool feature of this algorithm
3. Can be quite effective for short texts and patterns
It can be ineffective…
1. If we search the text more than once – As I said in the previous section if you perform the
search more than once it’s perhaps better to use another string matching algorithm that
builds an index, and it’s faster.
2. It’s slow – In general brute force algorithms are slow and brute force matching isn’t an
exception.
Pattern searching is an important problem in computer science. When we do search for a string
in a notepad/word file, browser, or database, pattern searching algorithms are used to show the
search results. A typical problem statement would be-
Given a text txt [0..n-1] and a pattern pat[0..m-1] where n is the length of the text and m is the
length of the pattern, write a function search(char pat[], char txt[]) that prints all occurrences of
pat[] in txt[]. You may assume that n > m.
Examples:
pat[] = "TEST"
pat[] = "AABA"
case 1
Explanation: In the above example, we got a mismatch at position 3. Here our mismatching
character is “A”. Now we will search for last occurrence of “A” in pattern. We got “A” at
position 1 in pattern (displayed in Blue) and this is the last occurrence of it. Now we will shift
pattern 2 times so that “A” in pattern get aligned with “A” in text.
Case 2 – Pattern move past the mismatch character
We’ll look up the position of last occurrence of mismatching character in pattern and if character
does not exist we will shift pattern past the mismatching character.
case2
Explanation:
Here we have a mismatch at position 7. The mismatching character “C” does not exist in pattern
before position 7 so we’ll shift pattern past to the position 7 and eventually in above example we
have got a perfect match of pattern (displayed in Green). We are doing this because “C” does not
exist in the pattern so at every shift before position 7 we will get mismatch and our search will be
fruitless.
In the following implementation, we pre-process the pattern and store the last occurrence of
every possible character in an array of size equal to alphabet size. If the character is not present
at all, then it may result in a shift by m (length of pattern). Therefore, the bad character heuristic
O(n/m)
takes time in the best case.
Knuth-Morris and Pratt introduce a linear time algorithm for the string matching problem. A
matching time of O (n) is achieved by avoiding comparison with an element of 'S' that have
previously been involved in comparison with some element of the pattern 'p' to be matched.
i.e., backtracking on the string 'S' never occurs
1. The Prefix Function (Π): The Prefix Function, Π for a pattern encapsulates knowledge about
how the pattern matches against the shift of itself. This information can be used to avoid
a useless shift of the pattern 'p.' In other words, this enables avoiding backtracking of the
string 'S.'
2. The KMP Matcher: With string 'S,' pattern 'p' and prefix function 'Π' as inputs, find the
occurrence of 'p' in 'S' and returns the number of shifts of 'p' after which occurrences are
found.
The Prefix Function (Π)
Following pseudo code compute the prefix function, Π:
● Skip Ad
COMPUTE- PREFIX- FUNCTION (P)
1. m ←length [P] //'p' pattern to be matched
2. Π [1] ← 0
3. k ← 0
4. for q ← 2 to m
5. do while k > 0 and P [k + 1] ≠ P [q]
6. do k ← Π [k]
7. If P [k + 1] = P [q]
8. then k← k + 1
9. Π [q] ← k
10. Return Π
Running Time Analysis:
In the above pseudo code for calculating the prefix function, the for loop from step 4 to step 10
runs 'm' times. Step1 to Step3 take constant time. Hence the running time of computing
prefix function is O (m).
Example: Compute Π for the pattern 'p' below:
Solution:
Initially: m = length [p] = 7
Π [1] = 0
k=0
●
After iteration 6 times, the prefix function computation is complete:
3. Π← COMPUTE-PREFIX-FUNCTION (P)
8. If P [q + 1] = T [i]
The for loop beginning in step 5 runs 'n' times, i.e., as long as the length of the string 'S.'
Since step 1 to step 4 take constant times, the running time is dominated by this for the loop.
Thus running time of the matching function is O (n).
Let us execute the KMP Algorithm to find whether 'P' occurs in 'T.'
For 'p' the prefix function, ? was computed previously and is as follows:
Solution:
Initially: n = size of T = 15
m = size of P = 7
Pattern 'P' has been found to complexity occur in a string 'T.' The total number of shifts that took
place for the match to be found is i-m = 13 - 7 = 6 shifts.
There are mainly two parts. First one to create a Huffman tree, and another one to
traverse the tree to find codes.
For an example, consider some strings “YYYZXXYYX”, the frequency of character Y is
larger than X and the character Z has the least frequency. So the length of the code for Y
is smaller than X, and code for X will be smaller than Z.
Complexity for assigning the code for each character according to their frequency is O(n
log n)
Input:
A string with different characters, say “ACCEBFFFFAAXXBLKE”
Output:
Code for different characters:
Data: K, Frequency: 1, Code: 0000
Data: L, Frequency: 1, Code: 0001
Data: E, Frequency: 2, Code: 001
Data: F, Frequency: 4, Code: 01
Data: B, Frequency: 2, Code: 100
Data: C, Frequency: 2, Code: 101
Data: X, Frequency: 2, Code: 110
Data: A, Frequency: 3, Code: 111
Algorithm
huffmanCoding(string)
Input: A string with different characters.
Output: The codes for each individual characters.
Begin
define a node with character, frequency, left and right child of the node for Huffman tree.
create a list ‘freq’ to store frequency of each character, initially, all are 0
for each character c in the string do
increase the frequency for character ch in freq list.
done
Subsequence
A sequence Z = <z1, z2, z3, z4, …,zm> over S is called a subsequence of S, if and only if it can be
derived from S deletion of some elements.
Common Subsequence
Suppose, X and Y are two sequences over a finite set of elements. We can say that Z is a
common subsequence of X and Y, if Z is a subsequence of both X and Y.
If a set of sequences are given, the longest common subsequence problem is to find a common
subsequence of all the sequences that is of maximal length.
The longest common subsequence problem is a classic computer science problem, the basis of
data comparison programs such as the diff-utility, and has applications in bioinformatics. It is
also widely used by revision control systems, such as SVN and Git, for reconciling multiple
changes made to a revision-controlled collection of files.
Naïve Method
Dynamic Programming
Let X = < x1, x2, x3,…, xm > and Y = < y1, y2, y3,…, yn > be the sequences. To compute the length
of an element the following algorithm is used.
In this procedure, table C[m, n] is computed in row major order and another table B[m,n] is
computed to construct optimal solution.
Analysis
To populate the table, the outer for loop iterates m times and the inner for loop iterates n times.
Hence, the complexity of the algorithm is O(m, n), where m and n are the length of two strings.
Example
In this example, we have two strings X = BACDB and Y = BDCB to find the longest common
subsequence.
In table B, instead of ‘D’, ‘L’ and ‘U’, we are using the diagonal arrow, left arrow and up arrow,
respectively. After generating table B, the LCS is determined by function LCS-Print. The result
is BCB.
● Tries
A trie is a tree-like information retrieval data structure whose nodes store the letters of an
alphabet. It is also known as a digital tree or a radix tree or prefix tree. Tries are classified into
three categories:
1. Standard Trie
2. Compressed Trie
3. Suffix Trie
Standard Trie A standard trie have the following properties:
1. A Standard Trie has the below structure:
class Node {
boolean isWordEnd;
boolean isEnd;
// parent node
int start;
int *end;
int suffixIndex;