You are on page 1of 31

UNIT - 4

Text Processing: Brute-Force Pattern Matching, The Boyer-Moore Algorithm, The Knuth-
Morris-Pratt Algorithm, The Huffman Coding Algorithm, The Longest Common Subsequence
Problem (LCS), Tries- Standard Tries, Compressed Tries, Suffix Tries

● Brute-Force Pattern Matching


String/Pattern matching is something crucial for database development and text
processing software. Fortunately, every modern programming language and library is full
of functions for string processing that help us in our everyday work. However it's
important to understand their principles.

String algorithms can typically be divided into several categories. One of these categories
is string matching.

When it comes to string matching, the most basic approach is what is known as brute
force, which simply means to check every single character from the text to match against
the pattern. In general we have a text and a pattern (most commonly shorter than the
text). What we need to do is to answer the question whether this pattern appears in the
text.

Overview
The principles of brute force string matching are quite simple. We must check for a match
between the first characters of the pattern with the first character of the text as on the picture
bellow.
If they don’t match, we move forward to the second character of the text. Now we compare the
first character of the pattern with the second character of the text. If they don’t match again, we
move forward until we get a match or until we reach the end of the text.
In case they match, we move forward to the second character of the pattern comparing it with the
“next” character of the text, as shown in the picture bellow.

Just because we have found a match between the first character from the pattern and some
character of the text, doesn’t mean that the pattern appears in the text. We must move forward to
see whether the full pattern is contained in the text.
Implementation
Implementation of brute force string matching is easy and here we can see a short PHP example.
The bad news is that this algorithm is naturally quite slow.

function sub_string($pattern, $subject)

$n = strlen($subject);

$m = strlen($pattern);

for ($i = 0; i < $n-$m; $i++) {

$j = 0;

while ($j < $m && $subject[$i+$j] == $pattern[$j]) {

$j++;

}
if ($j == $m) return $i;

return -1;

echo sub_string('o wo', 'hello world!');

Complexity
As I said this algorithm is slow. Actually every algorithm that contains “brute force” in its name
is slow, but to show how slow string matching is, I can say that its complexity is O(n.m).
Here n is the length of the text, while m is the length of the pattern.
In case we fix the length of the text and test against variable length of the pattern, again we get a
rapidly growing function.
Application

Brute force string matching can be very ineffective, but it can also be very handy in some cases.
Just like the sequential search.

It can be very useful…

1. Doesn’t require pre-processing of the text – Indeed if we search the text only once we
don’t need to pre-process it. Most of the algorithms for string matching need to build an
index of the text in order to search quickly. This is great when you’ve to search more than
once into a text, but if you do only once, perhaps (for short texts) brute force matching is
great!
2. Doesn’t require additional space – Because brute force matching doesn’t need pre-
processing it also doesn’t require more space, which is one cool feature of this algorithm
3. Can be quite effective for short texts and patterns
It can be ineffective…

1. If we search the text more than once – As I said in the previous section if you perform the
search more than once it’s perhaps better to use another string matching algorithm that
builds an index, and it’s faster.
2. It’s slow – In general brute force algorithms are slow and brute force matching isn’t an
exception.

● The Boyer-Moore Algorithm

Pattern searching is an important problem in computer science. When we do search for a string
in a notepad/word file, browser, or database, pattern searching algorithms are used to show the
search results. A typical problem statement would be- 
Given a text txt [0..n-1] and a pattern pat[0..m-1] where n is the length of the text and m is the
length of the pattern, write a function search(char pat[], char txt[]) that prints all occurrences of
pat[] in txt[]. You may assume that n > m. 
Examples: 

Input: txt[] = "THIS IS A TEST TEXT"

pat[] = "TEST"

Output: Pattern found at index 10

Input: txt[] = "AABAACAADAABAABA"

pat[] = "AABA"

Output: Pattern found at index 0

Pattern found at index 9

Pattern found at index 12


In this post, we will discuss the Boyer Moore pattern searching algorithm. Like KMP and Finite
Automata algorithms, Boyer Moore algorithm also pre-processes the pattern. 
Boyer Moore is a combination of the following two approaches. 
1. Bad Character Heuristic 
2. Good Suffix Heuristic 
Both of the above heuristics can also be used independently to search a pattern in a text. Let us
first understand how two independent approaches work together in the Boyer Moore algorithm.
If we take a look at the Naive algorithm, it slides the pattern over the text one by one. KMP
algorithm does pre-processing over the pattern so that the pattern can be shifted by more than
one. The Boyer Moore algorithm does pre-processing for the same reason. It processes the
pattern and creates different arrays for each of the two heuristics. At every step, it slides the
pattern by the max of the slides suggested by each of the two heuristics. So it uses greatest
offset suggested by the two heuristics at every step. 
Unlike the previous pattern searching algorithms, the Boyer Moore algorithm starts matching
from the last character of the pattern.
In this post, we will discuss the bad character heuristic and the Good Suffix heuristic in the next
post. 
 Bad Character Heuristic
The idea of bad character heuristic is simple. The character of the text which doesn’t match with
the current character of the pattern is called the Bad Character. Upon mismatch, we shift the
pattern until – 
1. The mismatch becomes a match.
2. Pattern P moves past the mismatched character.
Case 1 – Mismatch become match 
We will look up the position of the last occurrence of the mismatched character in the pattern,
and if the mismatched character exists in the pattern, then we’ll shift the pattern such that it
becomes aligned to the mismatched character in the text T. 
 

case 1

Explanation: In the above example, we got a mismatch at position 3. Here our mismatching
character is “A”. Now we will search for last occurrence of “A” in pattern. We got “A” at
position 1 in pattern (displayed in Blue) and this is the last occurrence of it. Now we will shift
pattern 2 times so that “A” in pattern get aligned with “A” in text.
Case 2 – Pattern move past the mismatch character 
We’ll look up the position of last occurrence of mismatching character in pattern and if character
does not exist we will shift pattern past the mismatching character. 
 

case2

Explanation: 
Here we have a mismatch at position 7. The mismatching character “C” does not exist in pattern
before position 7 so we’ll shift pattern past to the position 7 and eventually in above example we
have got a perfect match of pattern (displayed in Green). We are doing this because “C” does not
exist in the pattern so at every shift before position 7 we will get mismatch and our search will be
fruitless.

In the following implementation, we pre-process the pattern and store the last occurrence of
every possible character in an array of size equal to alphabet size. If the character is not present
at all, then it may result in a shift by m (length of pattern). Therefore, the bad character heuristic
O(n/m)
takes  time in the best case. 

● The Knuth-Morris-Pratt Algorithm

Knuth-Morris and Pratt introduce a linear time algorithm for the string matching problem. A
matching time of O (n) is achieved by avoiding comparison with an element of 'S' that have
previously been involved in comparison with some element of the pattern 'p' to be matched.
i.e., backtracking on the string 'S' never occurs

Components of KMP Algorithm:

1. The Prefix Function (Π): The Prefix Function, Π for a pattern encapsulates knowledge about
how the pattern matches against the shift of itself. This information can be used to avoid
a useless shift of the pattern 'p.' In other words, this enables avoiding backtracking of the
string 'S.'
2. The KMP Matcher: With string 'S,' pattern 'p' and prefix function 'Π' as inputs, find the
occurrence of 'p' in 'S' and returns the number of shifts of 'p' after which occurrences are
found.
The Prefix Function (Π)
Following pseudo code compute the prefix function, Π:
● Skip Ad
COMPUTE- PREFIX- FUNCTION (P)
1. m ←length [P] //'p' pattern to be matched
2. Π [1] ← 0
3. k ← 0
4. for q ← 2 to m
5. do while k > 0 and P [k + 1] ≠ P [q]
6. do k ← Π [k]
7. If P [k + 1] = P [q]
8. then k← k + 1
9. Π [q] ← k
10. Return Π
Running Time Analysis:
In the above pseudo code for calculating the prefix function, the for loop from step 4 to step 10
runs 'm' times. Step1 to Step3 take constant time. Hence the running time of computing
prefix function is O (m).
Example: Compute Π for the pattern 'p' below:
Solution:
Initially: m = length [p] = 7
Π [1] = 0
k=0

After iteration 6 times, the prefix function computation is complete:

The KMP Matcher:


The KMP Matcher with the pattern 'p,' the string 'S' and prefix function 'Π' as input, finds a
match of p in S. Following pseudo code compute the matching component of KMP
algorithm:
KMP-MATCHER (T, P)
1. n ← length [T]
2. m ← length [P]

3. Π← COMPUTE-PREFIX-FUNCTION (P)

4. q ← 0 // numbers of characters matched

5. for i ← 1 to n // scan S from left to right


6. do while q > 0 and P [q + 1] ≠ T [i]

7. do q ← Π [q] // next character does not match

8. If P [q + 1] = T [i]

9. then q ← q + 1 // next character matches

10. If q = m // is all of p matched?

11. then print "Pattern occurs with shift" i - m

12. q ← Π [q] // look for the next match

Running Time Analysis:

The for loop beginning in step 5 runs 'n' times, i.e., as long as the length of the string 'S.'
Since step 1 to step 4 take constant times, the running time is dominated by this for the loop.
Thus running time of the matching function is O (n).

Example: Given a string 'T' and pattern 'P' as follows:

Let us execute the KMP Algorithm to find whether 'P' occurs in 'T.'

For 'p' the prefix function, ? was computed previously and is as follows:

Solution:

Initially: n = size of T = 15
m = size of P = 7
Pattern 'P' has been found to complexity occur in a string 'T.' The total number of shifts that took
place for the match to be found is i-m = 13 - 7 = 6 shifts.

The Huffman Coding Algorithm


Huffman coding is a lossless data compression algorithm. In this algorithm, a variable-
length code is assigned to input different characters. The code length is related to how
frequently characters are used. Most frequent characters have the smallest codes and
longer codes for least frequent characters.

There are mainly two parts. First one to create a Huffman tree, and another one to
traverse the tree to find codes.
For an example, consider some strings “YYYZXXYYX”, the frequency of character Y is
larger than X and the character Z has the least frequency. So the length of the code for Y
is smaller than X, and code for X will be smaller than Z.

Complexity for assigning the code for each character according to their frequency is O(n
log n)

Input and Output

Input:
A string with different characters, say “ACCEBFFFFAAXXBLKE”
Output:
Code for different characters:
Data: K, Frequency: 1, Code: 0000
Data: L, Frequency: 1, Code: 0001
Data: E, Frequency: 2, Code: 001
Data: F, Frequency: 4, Code: 01
Data: B, Frequency: 2, Code: 100
Data: C, Frequency: 2, Code: 101
Data: X, Frequency: 2, Code: 110
Data: A, Frequency: 3, Code: 111
Algorithm
huffmanCoding(string)
Input: A string with different characters.
Output: The codes for each individual characters.
Begin
   define a node with character, frequency, left and right child of the node for Huffman tree.
   create a list ‘freq’ to store frequency of each character, initially, all are 0
   for each character c in the string do
      increase the frequency for character ch in freq list.
   done

   for all type of character ch do


      if the frequency of ch is non zero then
         add ch and its frequency as a node of priority queue Q.
   done

   while Q is not empty do


      remove item from Q and assign it to left child of node
      remove item from Q and assign to the right child of node
      traverse the node to find the assigned code
   done
End
traverseNode(n: node, code)
Input: The node n of the Huffman tree, and the code assigned from the previous call
Output: Code assigned with each character

if a left child of node n ≠φ then


   traverseNode(leftChild(n), code+’0’)     //traverse through the left child
   traverseNode(rightChild(n), code+’1’)    //traverse through the right child
else
   display the character and data of current node.

● The Longest Common Subsequence Problem


(LCS)

Subsequence

Let us consider a sequence S = <s1, s2, s3, s4, …,sn>.

A sequence Z = <z1, z2, z3, z4, …,zm> over S is called a subsequence of S, if and only if it can be
derived from S deletion of some elements.

Common Subsequence
Suppose, X and Y are two sequences over a finite set of elements. We can say that Z is a
common subsequence of X and Y, if Z is a subsequence of both X and Y.

Longest Common Subsequence

If a set of sequences are given, the longest common subsequence problem is to find a common
subsequence of all the sequences that is of maximal length.

The longest common subsequence problem is a classic computer science problem, the basis of
data comparison programs such as the diff-utility, and has applications in bioinformatics. It is
also widely used by revision control systems, such as SVN and Git, for reconciling multiple
changes made to a revision-controlled collection of files.

Naïve Method

Let X be a sequence of length m and Y a sequence of length n. Check for every subsequence


of X whether it is a subsequence of Y, and return the longest common subsequence found.

There are 2m subsequences of X. Testing sequences whether or not it is a subsequence


of Y takes O(n) time. Thus, the naïve algorithm would take O(n2m) time.

Dynamic Programming

Let X = < x1, x2, x3,…, xm > and Y = < y1, y2, y3,…, yn > be the sequences. To compute the length
of an element the following algorithm is used.

In this procedure, table C[m, n] is computed in row major order and another table B[m,n] is
computed to construct optimal solution.

Algorithm: LCS-Length-Table-Formulation (X, Y)


m := length(X)
n := length(Y)
for i = 1 to m do
C[i, 0] := 0
for j = 1 to n do
C[0, j] := 0
for i = 1 to m do
for j = 1 to n do
if xi = yj
C[i, j] := C[i - 1, j - 1] + 1
B[i, j] := ‘D’
else
if C[i -1, j] ≥ C[i, j -1]
C[i, j] := C[i - 1, j] + 1
B[i, j] := ‘U’
else
C[i, j] := C[i, j - 1]
B[i, j] := ‘L’
return C and B
Algorithm: Print-LCS (B, X, i, j)
if i = 0 and j = 0
return
if B[i, j] = ‘D’
Print-LCS(B, X, i-1, j-1)
Print(xi)
else if B[i, j] = ‘U’
Print-LCS(B, X, i-1, j)
else
Print-LCS(B, X, i, j-1)

This algorithm will print the longest common subsequence of X and Y.

Analysis

To populate the table, the outer for loop iterates m times and the inner for loop iterates n times.
Hence, the complexity of the algorithm is O(m, n), where m and n are the length of two strings.

Example
In this example, we have two strings X = BACDB and Y = BDCB to find the longest common
subsequence.

Following the algorithm LCS-Length-Table-Formulation (as stated above), we have calculated


table C (shown on the left hand side) and table B (shown on the right hand side).

In table B, instead of ‘D’, ‘L’ and ‘U’, we are using the diagonal arrow, left arrow and up arrow,
respectively. After generating table B, the LCS is determined by function LCS-Print. The result
is BCB.

● Tries
A trie is a tree-like information retrieval data structure whose nodes store the letters of an
alphabet. It is also known as a digital tree or a radix tree or prefix tree. Tries are classified into
three categories:

1. Standard Trie
2. Compressed Trie
3. Suffix Trie
Standard Trie A standard trie have the following properties:
1. A Standard Trie has the below structure:
class Node {

// Array to store the nodes of a tree


Node[] children = new Node[26];

// To check for end of string

boolean isWordEnd;

2. It is an ordered tree like data structure.


3. Each node(except the root node) in a standard trie is labeled with a character.
4. The children of a node are in alphabetical order.
5. Each node or branch represents a possible character of keys or words.
6. Each node or branch may have multiple branches.
7. The last node of every key or word is used to mark the end of word or node.
Below is the illustration of the Standard Trie:

Compressed Trie A Compressed trie have the following properties:


1. A Compressed Trie has the below structure:
class Node {
// Array to store the nodes of tree

Node[] children = new Node[26];

// To store the edgeLabel

StringBuilder[] edgeLabel = new StringBuilder[26];

// To check for end of string

boolean isEnd;

2. A Compressed Trie is an advanced version of the standard trie.


3. Each nodes(except the leaf nodes) have atleast 2 children.
4. It is used to achieve space optimization.
5. To derive a Compressed Trie from a Standard Trie, compression of chains of
redundant nodes is performed.
6. It consists of grouping, re-grouping and un-grouping of keys of characters.
7. While performing the insertion operation, it may be required to un-group the already
grouped characters.
8. While performing the deletion operation, it may be required to re-group the already
grouped characters.
9. A compressed trie T storing s strings(keys) has s external nodes and O(s) total
number of nodes.
Below is the illustration of the Compressed Trie:
Suffix Trie A Suffix trie have the following properties:
1. A Compressed Trie has the below structure:
struct SuffixTreeNode {

// Array to store the nodes

struct SuffixTreeNode *children[256];

//pointer to other node via suffix link

struct SuffixTreeNode *suffixLink;

// (start, end) interval specifies the edge,

// by which the node is connected to its

// parent node
int start;

int *end;

// For leaf nodes, it stores the index of

// Suffix for the path from root to leaf

int suffixIndex;

2. A Suffix Trie is an advanced version of the compressed trie.


3. The most common application of suffix trie is Pattern Matching.
4. While performing the insertion operation, both the word and its suffixes are stored.
5. A suffix trie is also used in word matching and prefix matching.
6. To generate a suffix trie, all the suffixes of given string are considered as individual
words.
7. Using the suffixes, compressed trie is built.
Below is the illustration of the Suffix Trie:

You might also like