Professional Documents
Culture Documents
Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors,
Suffix Trees
String any sequence of characters. Substring of string S string composed of characters i through j, i <= j of S.
S = cater => ate is a substring. car is not a substring. Empty string is a substring of S.
Subsequence
Subsequence of string S string composed of characters i1 < i2 < < ik of S.
S = cater => ate is a subsequence. car is a subsequence. The empty string is a subsequence.
String/Pattern Matching
You are given a source string S. Answer queries of the form: is the string pi a substring of S? Knuth-Morris-Pratt (KMP) string matching.
O(|S| + | pi |) time per query. O(n|S| + Si | pi |) time for n queries.
String/Pattern Matching
KMP preprocesses the query string pi, whereas the suffix tree method preprocesses the source string S. An application of string matching.
Genome project. Databank of strings (gene sequences). Character set is ATGC. Determine if a new sequence is a substring of a databank sequence.
When the last character of S appears more than once in S, use an end of string character # to overcome this problem. S = creeper#
creeper#, reeper#, eeper#, eper#, per#, er#, r#, #
5
abbbb# b# b
2
# 3 # b 4 #
abbbb#
abbbb#
b#
5
abbbb# b# b
2
# 3 # b 4 #
10
4
3
abbbb#
abbbb#
9
8 7
b#
6
abbbabbbb# 12345678910
5
1 abbbb# b# abbbb#
4
b 8 2 abbbb#
2
# 3 # b 4 #
10
4
3
abbbb#
9
8 7
b#
6
abbbabbbb# 12345678910
Suffix Array
Array that contains the start position of suffixes in lexicographic order. abbbabbbb# Assume # < a < b # < abbbabbbb# < abbbb# < b# < babbbb# < bb# < bbabbbb# < bbb# < bbbabbbb# < bbbb# SA = [10, 1, 5, 9, 4, 8, 3, 7, 2, 6] LCP = length of longest common prefix between adjacent entries of SA. LCP = [0, 4, 0, 1, 1, 2, 2, 3, 3, -]
Suffix Array
Less space than suffix tree Linear time construction Can be used to solve several of the problems solved by a suffix tree with same asymptotic complexity.
Substring matching binary search for p using SA. O(|p| log |S|).
abbbb#
b#
b b
4
3
abbbb#
abbbb#
#
#
9
8 7
b#
6
abbbabbbb# 12345678910
2 babb abbba
baba
abbbb#
b#
b b
4
3
abbbb#
abbbb#
#
#
9
8 7
b#
6
abbbabbbb# 12345678910
abbbb#
abbbb#
b#
b b
4
3
abbbb#
abbbb#
#
#
9
8 7
b#
6
abbbabbbb# 12345678910
ab
abbbb#
b#
b b
4
3
abbbb#
abbbb#
#
#
9
8 7
b#
6
abbbabbbb# 12345678910
2
abbbb# b# abbbb#
7
b 5 3 abbbb# b # #
10
4
3
abbbb#
9
8 7
b#
6
abbbabbbb# 12345678910
m=2
m=5
Longest common subsequence may be found in O(|S|*|T|) time using dynamic programming. Longest common substring may be found in O(|S|+|T|) time using a suffix tree.
Find branch node that has max char# and has at least one element node in its subtree that represents a suffix that begins in S as well as at least one that begins in T.