You are on page 1of 25

Suffix Trees

Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors,

Suffix Trees
String any sequence of characters. Substring of string S string composed of characters i through j, i <= j of S.
S = cater => ate is a substring. car is not a substring. Empty string is a substring of S.

Subsequence
Subsequence of string S string composed of characters i1 < i2 < < ik of S.
S = cater => ate is a subsequence. car is a subsequence. The empty string is a subsequence.

String/Pattern Matching
You are given a source string S. Answer queries of the form: is the string pi a substring of S? Knuth-Morris-Pratt (KMP) string matching.
O(|S| + | pi |) time per query. O(n|S| + Si | pi |) time for n queries.

Suffix tree solution.


O(|S| + Si | pi |) time for n queries.

String/Pattern Matching
KMP preprocesses the query string pi, whereas the suffix tree method preprocesses the source string S. An application of string matching.
Genome project. Databank of strings (gene sequences). Character set is ATGC. Determine if a new sequence is a substring of a databank sequence.

Definition Of Suffix Tree


Compressed trie with edge information. Keys are the nonempty suffixes of a given string S. Nonempty suffixes of S = sleeper are:
sleeper leeper eeper eper per, er, and r.

String Matching & Suffixes


pi is a substring of S iff pi is a prefix of some suffix of S. Nonempty suffixes of S = sleeper are:
sleeper leeper eeper eper per, er, and r.

Which of these are substrings of S?


leep, eepe, pe, leap, peel

Last Character Of S Repeats


When the last character of S appears more than once in S, S has at least one suffix that is a proper prefix of another suffix. S = creeper
creeper, reeper, eeper, eper, per, er, r

When the last character of S appears more than once in S, use an end of string character # to overcome this problem. S = creeper#
creeper#, reeper#, eeper#, eper#, per#, er#, r#, #

Suffix Tree For S = abbbabbbb#


1 abbb b abbbb# #

5
abbbb# b# b

2
# 3 # b 4 #

abbbb#
abbbb#

b#

Suffix Tree For S = abbbabbbb#


1 abbb b abbbb# #

5
abbbb# b# b

2
# 3 # b 4 #

10

4
3

abbbb#
abbbb#

9
8 7

b#
6

abbbabbbb# 12345678910

Suffix Tree For S = abbbabbbb#


1 abbb b 1 #

5
1 abbbb# b# abbbb#

4
b 8 2 abbbb#

2
# 3 # b 4 #

10

4
3

abbbb#

9
8 7

b#
6

abbbabbbb# 12345678910

Suffix Tree Construction


See Web write up for algorithm. Time complexity
|S| = n, alphabet size = r. O(nr) using array nodes. This is O(n) for r a constant (or r <= c). O(n) expected time using a hash table. O(n) time algorithm for large r in reference cited in Web write up.

Suffix Array
Array that contains the start position of suffixes in lexicographic order. abbbabbbb# Assume # < a < b # < abbbabbbb# < abbbb# < b# < babbbb# < bb# < bbabbbb# < bbb# < bbbabbbb# < bbbb# SA = [10, 1, 5, 9, 4, 8, 3, 7, 2, 6] LCP = length of longest common prefix between adjacent entries of SA. LCP = [0, 4, 0, 1, 1, 2, 2, 3, 3, -]

Suffix Array
Less space than suffix tree Linear time construction Can be used to solve several of the problems solved by a suffix tree with same asymptotic complexity.
Substring matching binary search for p using SA. O(|p| log |S|).

O(|pi|) Time Substring Matching


abbb b abbbb# # # 10

abbbb#

b#

b b

4
3

abbbb#
abbbb#

#
#

9
8 7

b#
6

abbbabbbb# 12345678910

2 babb abbba

baba

Find All Occurrences Of pi


Search suffix tree for pi. Suppose the search for pi is successful. When search terminates at an element node, pi appears exactly once in the source string S.

Search Terminates At Element Node


abbb b abbbb# # # 10

abbbb#

b#

b b

4
3

abbbb#
abbbb#

#
#

9
8 7

b#
6

abbbabbbb# 12345678910

abbbb#

Search Terminates At Branch Node


When the search for pi terminates at a branch node, each element node in the subtree rooted at this branch node gives a different occurrence of pi.

Search Terminates At Branch Node


abbb b abbbb# # # 10

abbbb#

b#

b b

4
3

abbbb#
abbbb#

#
#

9
8 7

b#
6

abbbabbbb# 12345678910

ab

Find All Occurrences Of pi


To find all occurrences of pi in time linear in the length of pi and linear in the number of occurrences of pi, augment suffix tree:
Link all element nodes into a chain in inorder. Each branch node keeps a pointer to the left most and right most element node in its subtree.

Augmented Suffix Tree


abbb b abbbb# # # 10

abbbb#

b#

b b

4
3

abbbb#
abbbb#

#
#

9
8 7

b#
6

abbbabbbb# 12345678910

Longest Repeating Substring


Find longest substring of S that occurs more than m > 1 times in S. Label branch nodes with number of element nodes in subtree. Find branch node with label >= m and max char# field.

Longest Repeating Substring


10 abbb b #

2
abbbb# b# abbbb#

7
b 5 3 abbbb# b # #

10

4
3

abbbb#

9
8 7

b#
6

abbbabbbb# 12345678910

m=2

m=5

Longest Common Substring


Given two strings S and T. Find the longest common substring. S = carport, T = airports
Longest common substring = rport Longest common subsequence = arport

Longest common subsequence may be found in O(|S|*|T|) time using dynamic programming. Longest common substring may be found in O(|S|+|T|) time using a suffix tree.

Longest Common Substring


Let $ be a new symbol. Construct the suffix tree for the string U = S$T#.
U = carport$airports# No repeating substring includes $. Find longest repeating substring that is both to left and right of $.

Find branch node that has max char# and has at least one element node in its subtree that represents a suffix that begins in S as well as at least one that begins in T.

You might also like