6 Suffix-Tree

Suffix Tree
Trie
• A trie is a tree with children branches labeled
with distinct letters from Σ. The branches are
ordered alphabetically.(we will append a
dollar sign, $ or \0, to the end of all strings)
• Coalescing non-branching paths ⇒ Compact or
compressed Trie
Suffix tree
• A suffix tree for a given text is a compressed trie for all
suffixes of the given text.
• {bear, bell, bid, bull, buy, sell, stock, stop}
• Suffix trie
Compressed Trie
Suffix tree
• How to build a Suffix Tree for a given text?
1) Generate all suffixes of given text.
2) Consider all suffixes as individual words and
build a compressed trie
Example
• Let us consider an example text “banana\0″ where ‘\0′ is
string termination character. Following are all suffixes of
“banana\0″
• banana\0
• anana\0
• nana\0
• ana\0
• na\0
• a\0
• \0
Example (suffix Trie for “Banana”)
Suffix Tree
• If we join chains of single nodes, we get the
following compressed trie, which is the Suffix
Tree for given text “banana\0”
How to search a pattern in the built suffix tree?
• We have discussed above how to build a Suffix Tree
which is needed as a preprocessing step in pattern
searching.
• Following are abstract steps to search a pattern in the
built Suffix Tree.
– Starting from the first character of the pattern and root of
Suffix Tree, do following for every character.
• For the current character of pattern, if there is an edge from the
current node of suffix tree, follow the edge.
• If there is no edge, print “pattern doesn’t exist in text” and return.
• If all characters of pattern have been processed, i.e., there is a
path from root for characters of the given pattern, then print
“Pattern found”
Applications of Suffix tree
• Suffix tree can be used for a wide range of
problems. Following are some famous
problems where Suffix Trees provide optimal
time complexity solution.
1) Pattern Searching
2) Finding the longest repeated substring
3) Finding the longest common substring
4) Finding the longest palindrome in a string
Exact String Matching
• Input
– Pattern P of length n
– Text T of length m
• Output
– Position of all occurrences of P in T
Pattern P in Text T
A suffix of T
T: A prefix of
the suffix
A prefix of T
T: A suffix of
the prefix
P
Exact String Matching Problem
• Given a string text T and a string pattern P, the
exact string matching problem is to find a
suffix of T such that P is a prefix of this suffix
or to find a prefix of T such that P is a suffix of
this prefix.
Example
• Let S = atgttatcat. The following are its
suffixes: SU  atgttatcat
1
SU 2  tgttatcat
SU 3  gttatcat
SU 4  ttatcat
SU 5  tatcat
SU 6  atcat
SU 7  tcat
SU 8  cat
SU 9  at
SU 10  t
Suffix tree for S
t
a
t
c g
a t
$
$
c g t a
t $ t
9 a t g
t a c c 10
t a t t
$ a t a
t t a
t 8 c $ t t
a
a $ c
c t
6 t a
a c t
t $ 5 7 $
a
$ t
$
1 3 4
2
Example
• It is easy to determine whether P appears in T.
• It is more complicated to determine the location
where P appears in T.
• Consider P = at.
• Then, as shown in Fig. of previous slide, this
pattern appears in an edge whose ending node is
an ancestor of three leaf nodes, labeled as 1, 6
and 9.
• Thus, appears at locations 1, 6 and 9.
Determine the position of P in T
• In general, if a pattern P is found to exist in T, its
starting location can be determined as follows:
Let the searching of P in the suffix tree of T end
at edge E.
• Let the ending node of E be A. Let the leaf
nodes which are descendants of A be i1, i2, …, ia.
• Then the starting locations of P in T are i1, i2, …,
ia.
Time complexity
• The worst time complexity of searching of a suffix
tree is O(m) where m is the length of the pattern
P.
• Once the suffix tree of the text T is constructed, it
can be used for all patterns.
• This is an advantage of this approach.
• But, we must admit that tree searching is never
easy for programming and the suffix tree occupies
a large amount of memory if the text is long.
Why is the searching of a pattern rather
efficient if the suffix tree is used?
• Let us assume T = accgtccgttat and P = gtt.
• If a brute-force searching method is used, we
search from the beginning of the text T.
• But, if the suffix tree approach is used, since
the first character of P is g, we start from
locations in T where g exists.
• That is why this approach is so different from
many other methods.
Algorithm to Construct the Suffix Tree
• Input: A string S with length n
• Output: A suffix tree of S
• Step 1: Create all suffixes of S.
Create a node N.
Denote the set of all suffixes of S by GN.
Let x = N, Gx = GN and put (x, Gx), into a queue Q.
• Step 2: Pop out an element from Q.
Divide all suffixes in Gx into groups such that in each group, all suffixes start with the same
character.
Denote these groups as G1, G2, …,Gk for some k.
• Step 3: For k = 1 to n, do the following
If Gk contains one suffix, create a leaf node labeled with the index of the suffix and a branch from
x labeled with the suffix and delete this suffix from the group.
Otherwise, among all suffixes of this group, find the longest common prefix.
Create a node Y and a branch from x labeled with the longest common prefix.
Delete this prefix from the group.
Let this new group of suffixes be denoted as Gy.
Let x = y and Gx = Gy, put (x, Gx) into Q.
• Step 4: If Q is empty, exit and report the tree as a suffix tree for S.
Otherwise, go to Step 2.

6 Suffix-Tree

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

6 Suffix-Tree

Uploaded by

Copyright:

Available Formats

Suffix Tree

You might also like