You are on page 1of 22

Suffix Trees

• String … any sequence of characters.


• Substring of string S … string composed of
characters i through j, i <= j of S.
 S = cater => ate is a substring.
 car is not a substring.
 Empty string is a substring of S.
Subsequence

• Subsequence of string S … string composed


of characters i1 < i2 < … < ik of S.
 S = cater => ate is a subsequence.
 car is a subsequence.
 The empty string is a subsequence.
String/Pattern Matching

• You are given a source string S.


• Answer queries of the form: is the string pi a
substring of S?
• Knuth-Morris-Pratt (KMP) string matching.
 O(|S| + | pi |) time per query.
 O(n|S| + i | pi |) time for n queries.
• Suffix tree solution.
 O(|S| + i | pi |) time for n queries.
String/Pattern Matching
• KMP preprocesses the query string pi,
whereas the suffix tree method preprocesses
the source string S.
• An application of string matching.
 Genome project.
 Databank of strings (gene sequences).
 Character set is ATGF.
 Determine if a “new” sequence is a substring of
a databank sequence.
Definition Of Suffix Tree
• Compressed trie with edge information.
• Keys are the nonempty suffixes of a given
string S.
• Nonempty suffixes of S = sleeper are:
 sleeper
 leeper
 eeper
 eper
 per, er, and r.
String Matching & Suffixes
• pi is a substring of S iff pi is a prefix of some
suffix of S.
• Nonempty suffixes of S = sleeper are:
 sleeper
 leeper
 eeper
 eper
 per, er, and r.
• Which of these are substrings of S?
 leep, eepe, pe, leap, peel
Last Character Of S Repeats
• When the last character of S appears more
than once in S, S has at least one suffix that
is a proper prefix of another suffix.
• S = creeper
 creeper, reeper, eeper, eper, per, er, r
• When the last character of S appears more
than once in S, use an end of string
character # to overcome this problem.
• S = creeper#
 creeper#, reeper#, eeper#, eper#, per#, er#, r#, #
Suffix Tree For S = abbbabbbb#
1
abbb b #
5 2
abbbb# #
abbbb# b# b
3
abbbb# #
b
4
abbbb# #
b#
Suffix Tree For S = abbbabbbb#
1
abbb b #
5 2 10
abbbb# #
abbbb# b# b
3
1 5 4 abbbb# # 9
b
4 8
3 #
abbbb#
b#
abbbabbbb# 2 6 7

12345678910
Suffix Tree For S = abbbabbbb#
1 1
abbb b #
5 4 2 10
1 abbbb# #
abbbb# b# b
3
1 5 8 9
4 abbbb# #
b
2 4 8
3 #
abbbb#
b#
abbbabbbb# 2 6 7

12345678910
Suffix Tree Construction

• See Web write up for algorithm.


• Time complexity
 |S| = n, alphabet size = r.
 O(nr) using array nodes.
 This is O(n) for r a constant (or r <= c).
 O(n) expected time using a hash table.
 O(n) time algorithm for large r in reference
cited in Web write up.
O(|pi|) Time Substring Matching
abbb b #

10
abbbb# #
abbbb# b# b
1 5 4 abbbb# # 9
b
3 8
abbbb# #
b#
abbbabbbb# 2 6 7

12345678910 babb abbba baba


Find All Occurrences Of pi

• Search suffix tree for pi.


• Suppose the search for pi is successful.
• When search terminates at an element node, pi
appears exactly once in the source string S.
Search Terminates At Element Node

abbb b #

10
abbbb# #
abbbb# b# b
1 5 4 abbbb# # 9
b
3 8
abbbb# #
b#
abbbabbbb# 2 6 7

12345678910 abbbb#
Search Terminates At Branch Node

• When the search for pi terminates at a branch


node, each element node in the subtree rooted
at this branch node gives a different occurrence
of pi.
Search Terminates At Branch Node

abbb b #

10
abbbb# #
abbbb# b# b
1 5 4 abbbb# # 9
b
3 8
abbbb# #
b#
abbbabbbb# 2 6 7

12345678910 ab
Find All Occurrences Of pi

• To find all occurrences of pi in time linear in


the length of pi and linear in the number of
occurrences of pi, augment suffix tree:
 Link all element nodes into a chain in inorder.
 Each branch node keeps a pointer to the left most
and right most element node in its subtree.
Augmented Suffix Tree

abbb b #

10
abbbb# #
abbbb# b# b
1 5 4 abbbb# # 9
b
3 8
abbbb# #
b#
abbbabbbb# 2 6 7

12345678910 b
Longest Repeating Substring

• Find longest substring of S that occurs more


than m > 1 times in S.
• Label branch nodes with number of element
nodes in subtree.
• Find branch node with label >= m and max
char# field.
Longest Repeating Substring
10
abbb b #
2 7 10
abbbb# #
abbbb# b# b
1 5 5 9
4 abbbb# #
b
3 8
3 #
abbbb#
b#
abbbabbbb# 2 6 7

12345678910 m=2 m=5


Longest Common Substring
• Given two strings S and T.
• Find the longest common substring.
• S = carport, T = airports
 Longest common substring = rport
 Longest common subsequence = arport
• Longest common subsequence may be found in
O(|S|*|T|) time using dynamic programming.
• Longest common substring may be found in O(|
S|+|T|) time using a suffix tree.
Longest Common Substring
• Let $ be a new symbol.
• Construct the suffix tree for the string U = S$T#.
 U = carport$airports#
 No repeating substring includes $.
 Find longest repeating substring that is both to left and
right of $.
• Find branch node that has max char# and has at
least one element node in its subtree that
represents a suffix that begins in S as well as at
least one that begins in T.

You might also like