You are on page 1of 3

Supplementary Information for Alignment-free detection of local similarity among viral and bacterial genomes

Mirjana Domazet-Loo1,2 and Bernhard Haubold2 s


1

University of Zagreb, Faculty of Electrical Engineering and Computing, Department of Applied Computing, Zagreb, Croatia
2

Max-Planck-Institute for Evolutionary Biology, Department of Evolutionary Genetics, 24306 Pl n, Germany o

March 11, 2011

Algorithm
In the main paper we described a simplied version of our algorithm for nding the most closely related subject sequence at every position in a query. In this Section we describe the method actually implemented. We assume familiarity with the terminology introduced in the Methods Section.
Node v1 v2 v3 v4 v5 v6 unresolvedTerm {2} {4} {3} {1} {5} Interval tree ([2, 3], {S1 , S3 }, 2) null null ([2, 3], {S1 , S3 }, 2) ([4, 5], {S2 }, 2) null ([2, 2], {S1 , S3 }, 2) ([3, 5], {S2 }, 3) null ([2, 2], {S1 , S3 }, 2) ([3, 5], {S2 }, 3) null ([1, 2], {S3 }, 3) ([3, 5], {S2 }, 3) null ([1, 2], {S3 }, 3) ([3, 5], {S2 }, 3) null

Figure S1: State of the interval tree I after the function formIntervalNode has been executed for each branch node during postorder traversal of the sufx tree shown in Figure 1B. To discover the desired intervals of closest relatedness we construct an interval tree, I, in which each node z comprises the following six elds: 1. lb: the left border 2. rb: the right border 3. S : the set of subject sequences that induce the longest shustring: S = {Si S|hi,lb = H} 4. H: the length of the longest shustring, that is H = max{hi,lb |Si S} 5. left: pointer to the left child of z 6. right: pointer to the right child of z We say that z = ([lb, rb], S , H) covers the substring Q[z.lb, z.rb]. Further, we construct I such that for each of its nodes z the following two rules hold: 1

1. for every node x in the left subtree of z , x.rb < z.lb; 2. for every node x in the right subtree of z , z.rb < x.lb. Let T be the generalized sufx tree holding the forward and reverse strands of all query and subject sequences. Each terminal node w of T comprises a list of sufx identiers, which contain two elds: 1. seqId : the sequence referred to by w; 2. pos: the starting position of the sufx in that sequence. We refer to a terminal node that corresponds to a sufx of a query sequence as a query terminal node, and to a terminal node that refers to a subject sequence as a subject terminal node. Each branch node v of T comprises ve elds: 1. subjectId : the set of subject identiers referring to terminal nodes in the subtree rooted on v; 2. branchChildren: the set of branch nodes that are children of v; 3. termChildren: the set of terminal nodes that are children of v; 4. sd : string depth, that is, the length of the concatenated edge labels along the path from the root to v; 5. unresolvedTerm: the set of query terminal nodes in the subtree rooted on v for which H remains to be determined. Algorithm S1 starts by calling the function traverse on the roots of T and I. During the subsequent traversal of T , each branch node v of T is visited once and v.subjectId and v.unresolvedTerm are determined. If v.subjectId is not empty (lines 714), then for each query terminal node of v and for each unresolved query terminal node v.unresolvedTerm a new interval node new is created and added to I by calling addIntervalNode. After its addition, the left border of new will be greater than the right borders of all the nodes in its left subtree. Ultimately, its right border should have the corresponding property of being less than the left border of all nodes in its right subtree. However, if new extends an existing interval z to the left, and the subtree rooted on z.left is not empty (lines 27 and 28), the right borders of the interval-nodes in that subtree are updated after the sufx tree traversal. This is done by calling updateItree, which takes O(L) time. In contrast, if the right borders were adjusted after each call to addIntervalNode, border adjustment would take O(L log L) time. Algorithm S1 is illustrated in Figure S1, where the construction of the interval tree for Q from the generalized sufx tree of Q and three subject sequences (Figure 1) is shown. During postorder traversal of T branch node v1 is encountered rst. It has two subject terminal nodes in its subtree (S1 , 2) and (S3 , 2), and hence v.subjectId = {S1 , S3 }. In addition, it has the query terminal node (Q, 2) in its subtree and we add the corresponding interval node ([2, 3], {S1 , S3 }, 2), to I. Next, v2 is encountered and we add the interval node ([4, 5], {S2 }, 2) to I. When we reach v3 , the corresponding interval node, ([3, 5], {S2 }, 3), overlaps the root node ([2, 3], {S1 , S3 }, 2), which is truncated to the left, resulting in ([2, 2], ...). It also overlaps node ([4, 5], {S2 }, 2), which is extended to the left and becomes ([3, 5], ...). At node v4 the set of unresolved query terminal nodes, v4 .unresolvedTerm, is empty and hence I remains unchanged. At v5 the interval node ([1, 3], {S3 }, 3) is added, leading to the transformation of the root interval node to ([1, 2], {S3 }, 3). Finally, at the root of the sufx tree, v6 , the interval node corresponding to (Q, 5) is discarded, since it is a subinterval of the existing interval [3, 5]. Upon completion of the traversal of the sufx tree, I consists of two interval nodes: ([1, 2], {S3 }, 3) and ([3, 5], {S2 }, 3). Traversal of I produces the ordered interval list that would next be subjected to the sliding window analysis.

Algorithm S1 Construct tree of closest neighbor intervals for query


Require: T {sufx tree of DNA sequences Q, S1 , S2 , ..., Sn } Ensure: I {interval tree}
1: root(I) null 2: traverse(root(T ), root(I)) 3: updateItree(root(I), 1) 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: 38: 39: 40: 41: 42: 43: 44: 45:

function traverse(v, z) for all w v.branchChildren do traverse(w, z) if v.subjectId = then for all w v.termChildren do if w is a query terminal node then new ([w.pos, w.pos + w.sd ], {v.subjectId }, v.sd + 1) addIntervalNode(z, new ) for all w v.unresolvedTerm do new ([w.pos, w.pos + w.sd ], {v.subjectId }, v.sd + 1) addIntervalNode(z, new ) end function function addIntervalNode(z, new ) if z = null then z new else if new .lb < z.lb then {left subtree} if new .H = z.H + z.lb new .lb then {new is superinterval of z extend z to the left} z.lb new .lb z.rb min{z.rb, new .rb} z.H new .H z.S new .S if z.left = null then z.left.rb min{z.left.rb, z.lb 1} else {add new to the left subtree of z} new .rb min{z.lb 1, new .rb} if z.left = null then z.left new else z.left addIntervalNode(z.left, new ) else if new .lb > z.lb then {right subtree} if z.H = new .H + new .lb z.lb then {z is not superinterval of new } z.rb min{z.rb, new .lb 1} if z.right = null then z.right new else z.right addIntervalNode(z.right, new ) end function

46: function updateItree(z, maxRB ) 47: if z = null then 48: updateItree(z.left, z.lb 1) 49: if maxRB = 1 then 50: z.rb min{z.rb, maxRB} 51: updateItree(z.right, maxRB) 52: end function

You might also like