Faster Tree Pattern Matching

Faster
Tree Pattern
Matching
MOSHE
T&AL)k
DUBINER
unlLWsi~, ~d-ZfL)iL>, I.SII.W[
ZVI
Tel-Ati(
GALIL
UruLcrsi~, Tel-Al iL, Israel, and CLIllu}dm[ Unulers[fy, New York New York
AND EDITH
Td-i4L)iL
MAGEN
utliLWYi~, Td-i4L3LL, Israel
Abstract. Recently,
R. Kosaraju gave an O(nm(7 polylog(m )) step algorithm for tree pattern matching. We Improve this result by designing a simple 0( nfipolylog(m )) algorithm, Complexity]:
Categories and Suhjcct Descriptors: F.2.2! [Analysis of Algorithms and Problem Nonnumerical Algorithms and Problemsco~?zp~t[ario~~s 0}1 dmretc .sfructum. General Terms: Algorithms, Theory.
Additional Key Words and Phrases: convolution, dont-care, pattern matching, period. string matching, tree.
1. Introduction For brevity, we shall ignore the log(m) factor in the time complexity, using the notation O(~(rL, m )) = 0(~( n, n2)polylog( m )). We consider ordered, labeled trees, except that roots are unlabeled. This is equivalent to edge labeling. (Having unlabeled roots is for technical convenience only.) A pattern tree P matches a target tree T at node L if there exists a one-to-one map from the nodes of P into the nodes of T such that (1) the root of P maps to LI, (2) if x maps to y and x is not the root of P, then labels, and x and Y have the same
(3) if x maps to y and x is not a leaf, then the ith child of x maps to the ith child of y. (In particular the degree of y is no less than the degree of x.)
Z. Galil was partially supported by National Science Foundation grant CCR 90-14625 and CISE Institutional Infrastructure grant COA 90-24735. Authors addresses: M. Dubiner, Department of Applied Mathematics. Tel-Aviv University, Tel-Aviv, Israel; Z. Galil, Department of Computer Science, 450 Computer Science Building, Columbia University, New York, NY 10027; E. Magen, Department of Computer Science, Tel-Aviv University, Tel-Aviv, Israel. Permission to copy without fee all or part of this material is granted prowded that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the or to republish, requires a fcc and/olAssociation for Computing Machinery, To copy otberwk, specific permission. 01994 ACM 0004-5411/94/0300-0205 $03.50
Journal of the Aswciatmn for Computmg Machlntxy. Vol. 41. No. 2, March lY94,
pp 205213
~(jfj
M.
DUB[NER EI AL.
Given trees P and T of sizes m and lZ respectively (m < H), we want to compute the set of nodes of T at which P matches. We assume that the order information of the children of a node is absorbed into the childrens node labels. Consequently, the labels of the children of any node are distinct. For convenience, we assume that the alphabet is {O, 1}. If this is not the case, we encode every symbol with 0s and 1s. Consequently, ~ur trees will be larger by a factor of at most log(m), which is absorbed in the O notation. Tree pattern matching is an important problem with many applications (see Hoffman and ODonnell [1982]). The obvious algorithm takes O(m-n) time. A classic open problem was whether this bound could be improved. Recently, Kosaraju [1989] broke the 0( nm ) barrier for this problem with an O(n/~~[) 75) algorithm. We improve this result giving an O(n&) algorithm. Kosaraju introduced three new techniques: (1) a suffix tree of a tree; (2) convolution of a tree and a string: (3) partitioning of trees into chains and anti-chains. Even though our algorithm was inspired by Kosarajus algorithm, we do not use any of his techniques. A different version of our algorithm can use the suffix tree of trees. Instead, we use truncated suffix trees that can be constructed in an obvious way. The improvement is obtained by discovering and exploiting periodical strings appearing in P. We make use of very simple properties of periods of strings. We denote by Ia I the length of the string a. A string CYis a period of a string ~ if ~ is a prefix of a ~ for some k > 0. The next two facts are well known and the third follows from the second: Fact 1.1. a is a period ~ iff ~ = ay = yt$. Fact 1.2,. ~ has a period of length k iff ~t = ~,+, for 1 s i < IPl - k. Fact 1.3. If ap and ~-y have a period of length k and I ~ I > k, then a~y has a period of length k. Throughout the paper, there are details of the algorithm that could be performed more efficiently than described. However, when the total time is not affected, we prefer simplicity over efficiency. 2. The Tree Pattern Matchirzg Algorithm For every node w of P, denote by q, the labeled path from the root of P to lv. Let w be a leaf of P. We say that w matches T at L if ~,,, considered as a tree, that the labels encode the order matches T at L. Using our assumption information, we see that P matches T at L iff every leaf w of P matches T at L,. Therefore, we may partition the set of leaves of P into subsets (as we shall often do), check where each subset matches, and in O(n) time conclude where P matches. The following lemma observes that the time of the naive algorithm (in which we try to match P at every node of T in a straightforward way) is actually better than O(mn ) in many cases.
LEMMA PROOF. 2.1. A
If the height of P is h, then the naiie algorithm
tukes time 0( nh).
path consisting
node 1 in T is compared with a node u in P at level r only if the of the r first ancestors of [ matches the path of the ancestors
Faster Tree Pattern Matching
207
of u in P. Since different children have different labels, the path of the r ancestors of L! cannot match a path to another node in P. Thus, every node of u T is compared with at most one node of P at every depth-level. The following is a well-known is given for completeness.
LEMMA 2.2.
fact. The proof (following
Knuth
et al. [1977])
A string, s, of length m can be matched with a tree, T, of size n ill
time 0( m + n).
PROOF.
Define
a function,
f, as follows:
For every 1 s i < m, let
f(i)
= the largest j for which s, s, is a proper suffix of s, . . . S,.
Given ~, matching s and T in linear time is very simple: Traverse T by DFS, keeping at each node, 1, a pointer to its highest ancestor, w, such that the path from w to 1 is a prefix of s. Consider a node, 1, and suppose the pointer of its parent points to node w. Initially, set the pointer of z to w, and let i be the length of the path from the pointer of [ to LI. If this path is a prefix of s (this is checked just by comparing the label of L), continue with the DFS. Otherwise, reset the pointer of LI to its f(i l)th ancestor (by starting at w and following a prefix of s of length i 1 f(i 1)) and compare again, until a (possibly empty) prefix of s is found. Since the pointers can only advance with the DFS, the total time spent is linear. The function f itself (the failure function of Knuth et al. [1977]) is constructed recursively in a similar way, only now the role of T is played by the string Sz . s,,,. u Remark. One can prove in a similar way that a set of strings can be matched in linear time, provided there is no string in the set that is a suffix of another string in the set. For every node ZI in P, denote by q,, ~ the string of the last k symbols of q,, if they exist; u,, ~ = q, in case Iq, I < k. The k-truncated suffLx tree of the tree P, ~{. h> iS defined tO be the trie of the set {~,!~ 1~ is a node of p}, where R stands for reversal. This means that for any node u in P there is a corresponding node C in Z,,, ~, such that the path from the root of XP ~ to 0 is the reversal of the path q,, ~. Different nodes of P may have equal corresponding nodes in 2P, ~. ~P, k may have additional nodes, but each of its leaves correspond not necessarily a leaf, of P. to a node,
Example. Figure 1 shows a tree P (of height 4) with its 4-truncated suffix tree and its 2-truncated suffix tree. The label of each node is marked on the edge leading to that node.
LEMMA PROOF. 2.3.
For a tree P of size m, 2P, ~ can be computed in O(mk) cr,,~~into the trie one at a time. from the definition of 2P ~.
steps.
u
Simply insert the strings fact follows
The following
immediately
Fact 2.4. ti is an ancestor of 0 in 2P, ~ iff OU, ~ is a suffix of q,, k. In particular, if U and fi are leaves of X,,, ~, then q, and q, cannot be a suffix of one another. Set 1 = [&l.
hf.
DUBINER
ET
AL.
2 01 4 / o
/
7
f)
1 01
3
V5
FIGLIRE
The first
step in our algorithm
computes
the 31-truncated
suffix
tree of P,
~ - x~>,ll, in O(ml) = O(WZVZ) steps. case 1. ~ has at least 1 leaves. Choose a set, S, of 1 nodes in P, which corresponds to leaves in X. For each node ( of S, mark the nodes of T at which the string q, matches in time O(n) using Lemma 2.2. By Fact 2.4, any node of T can be the end of at most one path to a node of S. Therefore, at most n marks are made, and at most ( n./l) nodes of T can be marked 1 times (i.e., for all members of S) and be considered as possible roots of P (i.e., nodes at which P could match T). At each of the possible roots, we check in O(n Z)time whether P matches or not. The total time spent in this case is therefore O(ln + (~z/l)nz) = O(~z& ), Renmrk. By the remark of Lemma 2.2, we see that marking the possible roots can be done in linear time. However, so far we do not know how to use this to improve the total time of the algorithm. cLzse 2. Y has at most 1 leaves.
Let P,] be the subtrce of P composed of the paths to the leaves of P whose depths are at most 31. By Lemma 2.1, we can find all the nodes of T at which these high leaves match in time O(~zl) = 0( ~z~) using the naive algorithm for P{] and T. So, without loss of generality, we may assume that the depth of every leaf of P is at least 31. In this case, if t is a leaf of P, then Iq ,,~11 = 31 and 0 is a leaf of Y.
LEMMA 2.5. Gi[viz 1 + 1 izodes in P: is L7szfffi~ of u,,,,~l.
L(l, . . . . [[,
there are i + j sz[ciz tizat wL,,,3[
PROOF. Since there are at most 1 leaves in X, there are i + j for which 0, is equal to or an ancestor of 0, in S. By Fact 2.4, m,,,,~{ is a suffix of 01,,, u jl.
hWhIA 2.6. For CL3Cg1 leaf L of P, the string of all but the at most 1 [ast ~zodcs of u,, has u period of length at most 1.
PROOF. Consider any node in P, L! [,, with depth at least 1, and let Lit, i s 1, be its ith ancestor. By Lemma 2.5, there are O s i < j <1, such that ur,,~[ is a suffix of al, ,~[. (Note that 10,,, ~11 s Iml,,jll. ) Let Lf be the ancestor of LIl and L;
Faster Tree Patterrz Matchilzg
209
such that UC ~1 labeisthe path from u to Z, and let t- bethelabel of the path from u to L~~ t- is a prefix and a suffix of q, ~1 (since it is a suffix of q,,, 31). By Fact 1.1, au ~[ has a period of length O < j L i s 1. Now let ~~be a leaf of P. By consideration above, there is an O s i <1, such that for the ith ancestor of Z, ~),, m(, ~1 has a period of length at most z. However, this period must also be a period of all q, . This is because (by the same consideration ) there will be periods further up in that path, their lengths are at most 1, and these periodic parts intersect in at least 1 consecutive nodes, u so by Fact 1.3 the periodicity must be the same. We conclude that for each leaf, LI, of P q, has a period of length at most /, except for a tail of at most 1 nodes. To each leaf L of P, we attach the uniquely determined pair of its tail and minimal period, where the period is chosen in such a way that it ends exactly before the tail starts. (We distinguish among cyclic permutations of the period.) Formally, we associate the pair (p, t) with L1, if q = plp~t, /t/ s 1, pl a suffix of p, where plph is the longest periodic prefix of q, with minimal period of length Ipl s 1. (We must have k 2 2.) How many pairs are there? The answer is given by the following lemma. WNIMA 2.7.
PROOF.
There are at most 1pairs of tail-period.

LI
If the leaves
and w have different
pairs of tail-period,
then in
particular the paths, q,, ~1 and q,, ~1,are not suffixes of each other, and by Fact 2.4, the corresponding leaves in I cannot be equal. Since there are at most 1 u leaves in X, we conclude that there are at most 1 pairs of tail-period. COROM.ARY 2.8. All the tail-period pairs can be found in total time O(ml).
PROOF. For every leaf of P, LI, we can find its pair of tail-period in time 0(1) as follows: Let s = q, ~,, and denote by s, the first 1 labels in s. Using a linear string matching algorithm (Lemma 2.2) match S1 and s in time 0(1). The position i of the first nontrivial occurrence, (i > 1), gives us the period of L in some cyclic permutation. The exact period and tail can now be easily found in time 0(1) from s. In order to find the set of t~e at most 1 pairs, we sort lexicographically the O(m) pairs we found in time 0( n71). u
In the following, we show how to match a set of leaves with a given pair of tail-period in time d(~z ). Hence, we will obtain an ~(ln) = d(n&) algorithm for matching P. Consider a fixed tail-period pair and denote the period by p. We call a path in a tree a nmximal-periodic-puth if (1) the path has p as a period (it may start at any place of p, but it ends at the end of a full period), (2) the period is repeated at least twice in the path, and (3) the path is maximal in the sense that it cannot be extended to a longer path with property (1).
LEMMA 2.9. If two maximal-periodic-paths of the paths L is among the first Ip I nodes.
intersect at a node
L!,
then for one
PROOF. Otherwise, at least the first Ipl ancestors of L are common to the two paths. Since both paths are periodic with the same period and maximal
210 (and recall u equal.

COROLLARY
M. DUBINER ET AL. that different children must have different labels), they must be
2.10.
The total length of all maxinzal-pe~iodic-paths
in a tree T of
size n is O(n).
PROOF. For every maximal-periodic-path, we call its suffix obtained by deleting its prefix of length Ipl the main path. By Lemma 2.9, the main paths in T are disjoint. Since every maximal-periodic-path has length at least 2 Ip 1,the total length is bounded above by twice the total length of the main paths, which u is no more than 2n. LEMMA 2.11.
Finding
all nuz.~imal-periodic-p
athsin a ti-ee of size n can be
done in time 0(n). PROOF. In the same manner as finding the occurrences of a string in a tree (Lemma 2.2), we can find in T all sequences of as many (at least two) consecutive periods as possible in time O(n). For every such sequence, we augment it as far as possible upwards. By Corollary 2.10, this takes 0(n) steps u as well. Using Lemma 2.11 and Lemma 2.2, we find in time O(n) all maximal-periodic-paths in T, as well as all the occurrences of the tail in T. For every maximal-periodic-path in T, we define the {O, 1}-sequence (starting with i = O) b = 1
if after i periods in the path the tail occurs, otherwise.
The b-sequence can be computed in time to proportional the length of the maximal-periodic-path. Hence, by Corollary 2.10, all the b-sequences can be computed in time 0(~~). Note that the length of the b-sequence is at most the length of the maximal-periodic-path divided by Ipl plus 1. Now we find in time O(nl) the maximal-periodic-paths in P that start at the root of P, and from which tails lead on to leaves with the given tail and period. (Note the two distinctions from maximal-periodic-paths in T.) The occurrences of the tail in P, that start at an end of a period and end at a leaf are found as well. This is done simply by tracing in P all paths from leaves with the given tail and period up to the root or to a previously visited node in a maximalBy kmmd 2.9, there al-e at most Ipl maximal-periodic-paths in P perimlic-path. that start at the root: at most one for every possible starting place in the period. For every such maximal-periodic-path in P, we define the {O, 1}-sequence
al =
if after i periods in the path the tail occurs and ends at a leaf, otherwise.
The (1-sequence can be computed in time proportional to the length of the maximal-periodic-path. Hence, by Corollary 2.10 applied to P, all the a-sequences can be computed in time O(nZ). Note that the length of the
a-sequence is at most the length of the maximal-periodic-path divided by Ip I plus 1. We show how to match the leaves of a fixed maximal-periodic-path of P in time 0(/z /l p[). Thus, matching the up to Ip I sets of leaves (corresponjin,g to the up to Ip I maximal-periodic-paths in P) will be done in time 0(n) as promised. So, let us fix one such maximal-periodic-path in P with the corresponding u-sequence. For our set of leaves to match at a node L of T, there should be a maximal-periodic-path in T that passes through Z such that (1) in the path of T L is at the same place in the period as the root of P is in the maximal-periodic-path of P, and (2) whenever a, is 1, the corresponding b, of the path through c should be 1 too. This means that our problem amounts to solving a problem of string matching with dont-care for the a-sequence and any one of the b-sequences, with an alphabet of {O, 1}, where the O plays the role of the dont-care symbol in the a-sequence. Each such problem is solved via the convolution algorithm in time
length of path in T 12A )
(see Fischer and Paterson [1974]). (We only perform the computation if the a-sequence is no longer than the b-sequence.) Finally, -by Corollary 2.10, all these matching problems can be solved in total time 0( n /l p l). Every match between the a-sequence and a b-sequence is translated to at most one node in the maximal-periodic-path of T at which our set of leaves matches, depending on the starting place in p of the maximal-periodic-path of P, for which the a-sequence was built. (If the a-sequence occurs at position O of the b-sequence, and the corresponding maximal-periodic-path of T starts at a later place in p than the corresponding maximal-periodic-path of P, then the occurrence is not translated into a match of the set of leaves.) We summarize the two cases above as follows:
THEOREM 2.12. Jfatching can be done in time O(n %).
a pattern tree P of size nl and a target tree T of size H
Example. We take 1 = [~fil in order to have a small instructive example. Figure 2 shows a pattern tree, P, of size m = 24 (1 = 3), with its 31-truncated suffix tree, X, and a target tree, T. Since Z has only 3 ( < 0 leaves this is Case 2 of the algorithm. Since all the leaves are of a depth that is at least 9 ( = 31), we do not use the naive algorithm. Leaves Z~l,Zl~,ZIJ all have period 01 and tail l; tz has period 10 and a null tail. The pair (01, l). P has two maximal-periodic-paths with period 01 that start at the root. One going left with an a-sequence 000011, and one going right (starting with the suffix 1 of the period) with an a-sequence 00001. T has two maximal-periodic-paths with period 01. (They intersect only at a
M. DLIBINER El Al,. Lz ,
u:
Y
b ~
P
FIGURE 2
node that is the first node of one of them.) The two corresponding h-sequences are 1000111 for the path starting at WI ,, and 000010 for the path starting at w,. Since the a-sequence 000011 matches the b-sequence 1000111 at positions O and 1 and does not match the b-sequence 000010, we have two nodes in T where [11 and u~ match: w, and W1. Since 00001 matches 1000111 at position O, 1, and 2 and also matches 000010 at position O, we have four nodes in T where C? matches: w{), W2, W6, and wk. The pair (10, nz411). P has only one maximal-periodic-path with period 10 that starts at the root and from which there is a (null) tail leading to a leaf ( Zl). The corresponding a-sequence for this path is 000001. T has two maximal-periodic-paths with period 10. The corresponding b-sequences are 1111111 for the path starting at w,} and 111111 for the path starting at WI. Matching the a-sequence against the b-sequences as before, gives the nodes WO,W., w, at which Uz matches. By ~raversing T one final time, we find that P matches T at w, that is the only node of T at which the four leaves of P match.
Faster Tvee Pattcm Matching

REFERENCES
FISCHER, AI. J..
of the SIA fWAhfS
AND PATERSON, M. S. 1974. String-matching and other products. In Przxxmiings Symposwn on Complexity of Computation, R. M. Karp, ed. SIAM, New York,
pp. 113-125. HOFFM,ANN, C. M. AND ODONNELL, M. J. 1982. Pattern matching in trees. JACM 29, 1 (Jan.), 68-95, KNurl I, D. E., MORRIS,J. H. AND PRATT,V. R. 1977. Fast pattern matching in strings. SIAM J. Comp, 6, 323-350. afznz~a/ZEEE KOSARAJU,S.R. 1989. Efficient tree pattern matching. In Proceedzrzgs oft/ze30th Sy/?zposzz[/)z ofz Fo~~}zc/afio?t.~ of Co/?zpt~fer Scze/zce. IEEE, New York, pp. 178-183.
RECEIVED FEBRUARY 1991; REVISED OCTOBER 1992?; ACCEPTED DECEMBER 1992

Faster Tree Pattern Matching

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Faster Tree Pattern Matching

Uploaded by

Copyright:

Available Formats

Faster

If the height of P is h, then the naiie algorithm

tukes time 0( nh).

Faster Tree Pattern Matching

fact. The proof (following

A string, s, of length m can be matched with a tree, T, of size n ill

For every 1 s i < m, let

= the largest j for which s, s, is a proper suffix of s, . . . S,.

Simply insert the strings fact follows

step in our algorithm

there are i + j sz[ciz tizat wL,,,3[

Faster Tree Patterrz Matchilzg

There are at most 1pairs of tail-period.

and w have different

then for one

210 (and recall u equal.

The total length of all maxinzal-pe~iodic-paths

athsin a ti-ee of size n can be

if after i periods in the path the tail occurs, otherwise.

length of path in T 12A )

a pattern tree P of size nl and a target tree T of size H

Faster Tvee Pattcm Matching

of the SIA fWAhfS

You might also like