Professional Documents
Culture Documents
ttttttttttttttgagacggagtctcgctctg tcgcccaggctggagtgcagtggcggg atctcggctcactgcaagctccgcctcc cgggttcacgccattctcctgcctcagcc tcccaagtagctgggactacaggcgcc cgccactacgcccggctaattttttgtattt ttagtagagacggggtttcaccgttttagc cgggatggtctcgatctcctgacctcgtg atccgcccgcctcggcctcccaaagtgc tgggattacaggcgt
2
Pattern synthesis problem: Learn from T new patterns that occur surprisingly often What is a pattern? Exact substring, approximate 5 substring, with generalized symbols, with gaps,
1.
Suffix tree
2. Suffix array
3. Some applications 4. Finding motifs
6
Trie(abaab)
a b a a b b a b b a
Tree(abaab)
a baab ab baab a
Tree(hattivatti)
hattivatti
attivatti ttivatti tivatti ivatti vatti atti
vatti
atti hattivatti vatti vatti tti tivatti ti atti i t vatti
i
i vatti
vatti
hattivatti
ti
ivatti
tti
ti i
attivatti
ttivatti
11
Tree(hattivatti)
hattivatti
attivatti ttivatti tivatti ivatti vatti atti
vatti
atti hattivatti vatti vatti tti tivatti ti atti i t vatti
i
i vatti
vatti
hattivatti
ti
ivatti
tti
ti i
attivatti
ttivatti
hattivatti
Tree(hattivatti)
hattivatti
attivatti ttivatti tivatti ivatti vatti atti
6,10
7 1 6,10 vatti 8 4 2,5 vatti
3,3
4,5
i
10 i vatti
hattivatti
tti
ti i
2
hattivatti
13
31
hattivatti
atti
ti
ivatti
tti
ti i
attivatti
7 2
ttivatti
15
ttivatti
tivatti ivatti
vatti
atti tti ti
Weiner (1973),
algorithm of the year
McCreight (1976)
i
16
17
Trie(abaab)
Trie(a)
a b a
Trie(ab)
a b b
Trie(aba)
a b a
abaa baa aa a
18
Trie(abaab)
Trie(abaa)
a b a a a b b a b a a a b a b
19
Trie(abaab)
Trie(abaa)
a b a a a b b a b a a a b a b
20
Trie(abaab)
Trie(abaa)
a b a a a b b a b a a a b a b
21
Trie(abaab)
a b a a a b b a b a a a b a b
Trie(abaab)
b a a
a a
b a
b
b
22
From here on the ai-arc exists already => stop updating here
After
ai
ai
ai
ai
ai
24
4.
5. 6. 7. 8. 9.
v := vi-1; v := 0
while node v has no outgoing arc for ti do begin Create a new node v and an arc son(v,ti) = v if v 0 then slink(v) := v v := slink(v); v := v end for the node v such that v= son(v,ti) do if v = v then slink(v) := root else slink(v) := v
25
26
27
Implicit node
v (v, )
28
slink(v)
w
29
Big picture
suffix link path traversed: total work O(n)
30
son(v,ti) := v; v := v
if v 0 then slink(v) := v v := v; v := slink(v); (v, ) := Canonize(v, ) if v 0 then slink(v) := v (v, ) := Canonize(v, ti) /* (v, ) = start node of the next round */
32
1.
Suffix tree
2. Suffix array
3. Some applications 4. Finding motifs
34
11 7
atti
attivatti
hattivatti i ivatti ti tivatti tti
2
1 10 5 9 4 8
vatti
atti tti ti i
ttivatti
vatti
3
6
Suffix array
suffix array SA(T) = an array giving the lexicographic order of the suffixes of T space requirement: 5|T| practitioners like suffix arrays (simplicity, space efficiency) theoreticians like suffix trees (explicit structure)
36
11 7
atti
attivatti
hattivatti i ivatti ti tivatti tti
2
1 10 5 9 4 8
att
binary search
vatti
atti tti ti i
ttivatti
vatti
3
6
37
38
Krkkinen-Sanders algorithm
1. Construct the suffix array of the suffixes starting at positions i mod 3 0. This is done by reduction to the suffix array construction of a string of two thirds the length, which is solved recursively.
2. Construct the suffix array of the remaining suffixes using the result of the first step.
3. Merge the two suffix arrays into one.
39
Notation
string T = T[0,n) = t0t1 tn-1 suffix Si = T[i,0) = titi+1 tn-1 for C \subset [0,n]: SC = {Si | i in C} suffix array SA[0,n] of T is a permutation of [0,n] satisfying SSA[0] < SSA[1] < < SSA[n]
40
Running example
0 1 2 3 4 5 6 7 8 9 10 11
T[0,n) = y a b b a d a b b a d o 0 0
SA = (12,1,6,4,9,3,8,2,7,5,10,11,0)
41
Step 1 (cont.)
once the sample suffixes are sorted, assign a rank to each: rank(Si) = the rank of Si in SC; rank(Sn+1) = rank(Sn+2) = 0 Example: R = [abb][ada][bba][do0][bba][dab][bad][o00] R = (1,2,4,6,4,5,3,7) SAR = (8,0,1,6,4,2,5,3,7) rank(Si) - 1 4 - 2 6 - 5 3 7 8 0 0
44
Step 3: Merge
merge the two sorted sets of suffixes using a standard comparison-based merging: to compare Si SC with Sj SB0, distinguish two cases:
i B1: Si Sj (ti,rank(Si+1)) (tj,rank(Sj+1)) i B2: Si Sj (ti,ti+1,rank(Si+2)) (tj,tj+1,rank(Sj+2)) note that the ranks are defined in all cases! S1 < S6 as (a,4) < (a,5) and S3 < S8 as (b,a,6) < (b,a,7)
46
47
Implementation
about 50 lines of C++ code available e.g. via Juha Krkkinens home page
48
LCP table
Longest Common Prefix of successive elements of suffix array: LCP[i] = length of the longest common prefix of suffixes SSA[i] and SSA[i+1] build inverse array SA-1 from SA in linear time then LCP table from SA-1 in linear time (Kasai et al, CPM2001)
49
50
1.
Suffix tree
2. Suffix array
3. Some applications 4. Finding motifs
51
53
vatti
4
ti atti i
2 2
vatti
hattivatti
2 2
ti
vatti vatti tti tivatti
ivatti
vatti
atti hattivatti
attivatti
ttivatti
54
55
Longest repeat?
Occurrences at: 28395980, 28401554r Length: 2559
ttagggtacatgtgcacaacgtgcaggtttgttacatatgtatacacgtgccatgatggtgtgctgcacccattaactcgtcatttagcgttaggtatatctccgaat gctatccctcccccctccccccaccccacaacagtccccggtgtgtgatgttccccttcctgtgtccatgtgttctcattgttcaattcccacctatgagtgagaac atgcggtgtttggttttttgtccttgcgaaagtttgctgagaatgatggtttccagcttcatccatatccctacaaaggacatgaactcatcatttttttatggctgcata gtattccatggtgtatatgtgccacattttcttaacccagtctacccttgttggacatctgggttggttccaagtctttgctattgtgaatagtgccgcaataaacatac gtgtgcatgtgtctttatagcagcatgatttataatcctttgggtatatacccagtaatgggatggctgggtcaaatggtatttctagttctagatccctgaggaatca ccacactgacttccacaatggttgaactagtttacagtcccagcaacagttcctatttctccacatcctctccagcacctgttgtttcctgactttttaatgatcgcca ttctaactggtgtgagatggtatctcattgtggttttgatttgcatttctctgatggccagtgatgatgagcattttttcatgtgttttttggctgcataaatgtcttcttttga gaagtgtctgttcatatccttcgcccacttttgatggggttgtttgtttttttcttgtaaatttgttggagttcattgtagattctgggtattagccctttgtcagatgagtag gttgcaaaaattttctcccattctgtaggttgcctgttcactctgatggtggtttcttctgctgtgcagaagctctttagtttaattagatcccatttgtcaattttggctttt gttgccatagcttttggtgttttagacatgaagtccttgcccatgcctatgtcctgaatggtattgcctaggttttcttctagggtttttatggttttaggtctaacatgta agtctttaatccatcttgaattaattataaggtgtatattataaggtgtaattataaggtgtataattatatattaattataaggtgtatattaattataaggtgtaaggaag ggatccagtttcagctttctacatatggctagccagttttccctgcaccatttattaaatagggaatcctttccccattgcttgtttttgtcaggtttgtcaaagatcaga tagttgtagatatgcggcattatttctgagggctctgttctgttccattggtctatatctctgttttggtaccagtaccatgctgttttggttactgtagccttgtagtatag tttgaagtcaggtagcgtgatggttccagctttgttcttttggcttaggattgacttggcaatgtgggctcttttttggttccatatgaactttaaagtagttttttccaatt ctgtgaagaaattcattggtagcttgatggggatggcattgaatctataaattaccctgggcagtatggccattttcacaatattgaatcttcctacccatgagcgt gtactgttcttccatttgtttgtatcctcttttatttcattgagcagtggtttgtagttctccttgaagaggtccttcacatcccttgtaagttggattcctaggtattttattct ctttgaagcaattgtgaatgggagttcactcatgatttgactctctgtttgtctgttattggtgtataagaatgcttgtgatttttgcacattgattttgtatcctgagacttt gctgaagttgcttatcagcttaaggagattttgggctgagacgatggggttttctagatatacaatcatgtcatctgcaaacagggacaatttgacttcctcttttcc taattgaatacccgttatttccctctcctgcctgattgccctggccagaacttccaacactatgttgaataggagtggtgagagagggcatccctgtcttgtgcca gttttcaaagggaatgcttccagtttttgtccattcagtatgatattggctgtgggtttgtcatagatagctcttattattttgagatacatcccatcaatacctaatttatt gagagtttttagcatgaagagttcttgaattttgtcaaaggccttttctgcatcttttgagataatcatgtggtttctgtctttggttctgtttatatgctggagtacgtttat tgattttcgtatgttgaaccagccttgcatcccagggatgaagcccacttgatcatggtggataagctttttgatgtgctgctggattcggtttgccagtattttattg aggatttctgcatcgatgttcatcaaggatattggtctaaaattctctttttttgttgtgtctctgtcaggctttggtatcaggatgatgctggcctcataaaatgagtta 56 gg
Ten occurrences?
ttttttttttttttgagacggagtctcgctctgtcgcccaggctggagtgcagtggcgggat ctcggctcactgcaagctccgcctcccgggttcacgccattctcctgcctcagcctcc caagtagctgggactacaggcgcccgccactacgcccggctaattttttgtatttttagt agagacggggtttcaccgttttagccgggatggtctcgatctcctgacctcgtgatccg cccgcctcggcctcccaaagtgctgggattacaggcgt
Length: 277
Occurrences at: 10130003, 11421803, 18695837, 26652515, 42971130, 47398125 In the reversed complement at: 17858493, 41463059, 42431718, 42580925
57
58
59
60
B: TUKHOLMA minimum number of mutation steps: a -> b a -> -> b dID(A,B) = 5 dLevenshtein(A,B)= 4 mutation costs => probabilistic modeling evaluation by dynamic programming => alignment
61
di,j = min(if ai=bj then di-1,j-1 else , di-1,j + 1, di,j-1 + 1) = distance between i-prefix of A and j-prefix of B (substitution excluded) B
mxn table d
bj
Dynamic programming
di-1,j-1
di-1,j
+1
+1
A
ai di,j-1
di,j
dm,n
62
dID(A,B)
Search problem
find approximate occurrences of pattern P in text T: substrings P of T such that d(P,P) small dyn progr with small modification: O(mn) lots of (practical) improvement tricks
T P
64
65
1.
Suffix tree
2. Suffix array
3. Some applications 4. Finding motifs
66
Gapped motifs
a##bc
67
Gapped motifs
68
Gapped motifs of T
gapped pattern: P in (A U {#})* gap symbol # matches any symbol in A aa##bb#b L(P) = occurrence locations of P in T P is called a motif of T if |L(P)| > 1 and a motif with quorum q if |L(P)| q. Problem: find occurrence count |L(P)| for all gapped motifs P of T anban has exponentially many motifs!
69
Maximal motifs
motif P is the maximal version of motif P if P has the largest possible number of non-# symbols among motifs that have the same set of occurrence locations as P every motif has unique maximal version
71
Tree(T) Y X
73
O(n) O(n)
O(n)
75
Conclusion
suffix structures provide very efficient algorithmic tools for finding and learning potentially interesting patterns in strings
77
General case
Q1: Given q and W, has T a motif with at least W non-gap symbols and at least q occurrences? In k-block case, is O(n2k-1) (or even better) time possible? related work: H.Arimura&al, A. Apostolico &al, M-F. Sagot, L. Parida, N. Pisanti,
78