This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, MANUSCRIPT ID

1

An UpDown Directed Acyclic Graph Approach for Sequential Pattern Mining
Jinlin Chen, Member, IEEE
Abstract— Traditional pattern-growth based approaches for sequential pattern mining derive length-(k+1) patterns based on the projected databases of length-k patterns recursively. At each level of recursion, they uni-directionally grow the length of detected patterns by one along the suffix of detected patterns, which needs k levels of recursion to find a length-k pattern. In this paper a novel data structure, UpDown Directed Acyclic Graph (UDDAG), is invented for efficient sequential pattern mining. UDDAG allows bidirectional pattern growth along both ends of detected patterns. Thus a length-k pattern can be detected in log2k+1 levels of recursion at best, which results in fewer levels of recursion and faster pattern growth. When minSup is large such that the average pattern length is close to 1, UDDAG and PrefixSpan have similar performance because the problem degrades into frequent item counting problem. However, UDDAG scales up much better. It often outperforms PrefixSpan by almost one order of magnitude in scalability tests. UDDAG is also considerably faster than Spade and LapinSpam. Except for extreme cases, UDDAG uses comparable memory to that of PrefixSpan and less memory than Spade and LapinSpam. Additionally, the special feature of UDDAG enables its extension toward applications involving searching in large spaces. Index Terms— Data mining algorithm, directed acyclic graph, performance analysis, sequential pattern, transaction database.

——————————

——————————

1 INTRODUCTION
EQUENTIAL pattern mining is an important data mining problem which detects frequent subsequences in a sequence database. A major technique for sequential pattern mining is pattern-growth. Traditional pattern-growth based approaches (e.g., PrefixSpan) derive length (k+1) patterns based on the projected databases of a length k pattern recursively. At each level of recursion, the length of detected patterns is grown by 1, and patterns are grown uni-directionally along the suffix direction. Consequently, we need k levels of recursion to mine a length-k pattern, which is expensive due to the large number of recursive database projections. In this paper a new approach based on UpDown Directed Acyclic Graph (UDDAG) is proposed for fast pattern growth. UDDAG is a novel data structure which supports bidirectional pattern growth from both ends of detected patterns. With UDDAG, at level i recursion we may grow the length of patterns by 2i-1 at most. Thus a length k pattern can be detected in log2k +1 levels of recursion at minimum, which results in better scale up property for UDDAG compared to PrefixSpan. Our extensive experiments clearly demonstrated the strength of UDDAG with its bi-directional pattern growth strategy. When minSup is very large such that the average length of patterns is very small (close to 1), UDDAG and PrefixSpan have similar performance because in this case the problem degrades into a basic frequent item counting problem. However, UDDAG scales up much better compared to PrefixSpan. It often outperforms PrefixSpan by one order of magnitude in our scalability tests. UDDAG is

S

also considerably faster than two other representative algorithms, Spade and LapinSpam. Except for some extreme cases, the memory usage of UDDAG is comparable to that of PrefixSpan. UDDAG generally uses less memory than Spade and LapinSpam. UDDAG may be extended to other areas where efficient searching in large searching spaces is necessary. The rest of the paper is organized as follows: Section 2 defines the problem and discusses related works. Section 3 presents motivation of our approach. Section 4 defines UDDAG based pattern mining. Performance evaluation is presented in Section 5. Discussions on time and space complexity are presented in Section 6. Finally, we conclude the paper and discuss future work in Section 7.

2 PROBLEM STATEMENT AND RELATED WORK
2.1 Problem Statement

Manuscript received (insert date of submission if desired). Please note that all acknowledgments should be placed at the end of the paper, before the bibliography.

Let I = {i1, i2, … in} be a set of items, an itemset is a subset of I, denoted as (x1, x2, …, xk), where xi ∈ I , i ∈ {1, …, k}. Without loss of generality, in this paper we use nonnegative integers to represent items, and assume that items in an itemset are sorted in ascending order. We omit the parentheses for itemset with only one item. A sequence s is a list of itemsets, denoted as <s1 s2 … sm>, where si is an itemset, si ⊆ I, i ∈ {1, …, m}. The number of instances of itemsets in s is called the length of s. Given two sequences a = <a1 a2 … aj> and b = <b1 b2 … bk>, if k j and there exists integers 1 i1<i2 < … < ij k, such that a1 ⊆ bi1, a2 ⊆ bi2, …, aj ⊆ bij, then a is a sub———————————————— sequence of b and b a super-sequence of a. In this case a • Jinlin Chen is with Computer Science Dept., Queens College, City Univ. of is also called contained in b, denoted as a b. New York, Flushing, NY 11367. E-mail: jchen@cs.qc.cuny.edu. A sequence database is a set of tuples <sid, s>, where sid is a sequence id and s is a sequence. A tuple <sid, s> is
xxxx-xxxx/0x/$xx.00 © 200x IEEE

Authorized licensed use10.1109/TKDE.2009.135 Digital Object Indentifier limited to: Gandhi Institute of Technology & Management. Downloaded on March 30,2010 at 01:14:12 EDT from IEEE Xplore. Restrictions apply. 1041-4347/$25.00 © 2009 IEEE

< (1.6) 3 2 3> 2. if The absolute support of a sequence in a sequence database D is defined as SupD( ) = |{ <sid.3)>. MANUSCRIPT ID s.5)> <(5. which costs a lot to merge the id-lists of frequent sequences for a large number of candidates. The approaches above represent databases horizontally.6) 3 2> <5 7 (1.3) 1> is a subsequence of sequences 1 and 2. however. As a result of effective index advancing. In [5] and [24].2010 at 01:14:12 EDT from IEEE Xplore. To reduce candidates. The SPIRIT algorithm [12] uses regular expressions as constraints and developed a family of algorithms for pattern mining under constraints based on a priori rule. A prioribased approaches can be considered as breadth-first traversal algorithms because they construct all length k patterns before constructing length (k+1) patterns. but has not been fully edited. s> |( s) ∧ (<sid. s> ∈ D) }|. They focus the search on a restricted portion of the initial database to avoid the expensive candidate generation and test step. however. databases are transformed into vertical layout consisting of items' id-lists. 2 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. Content may change prior to final publication. The AprioriAll algorithm [1] is one of the earliest a priori-based approaches. fewer and shorter data sequences need to be processed as the discovered patterns become longer. The SPaRSe algorithm [3] improves GSP by using both candidate genera- tion and projected databases to achieve higher efficiency for high pattern density conditions.6)> <(1. A priori principle states that any super-sequence of a non-frequent sequence must not be frequent.This article has been accepted for publication in a future issue of this journal. However. The MEMISP [15] algorithm uses memory indexes instead of projected databases to detect patterns. The SPAM algorithm [5] adopts the lattice concept but represents each id-list as a vertical bitmap.2 Related Work The problem of sequential pattern mining was introduced by Agrawal and Srikant [1]. Patterns are found by recursively growing subsequence fragments in each projected database.3) (1. One major concern of PrefixSpan is that it may generate multiple projected databases. . and then finds patterns. LapinSpam [25] improves SPAM by using last position information of items to avoid the ANDing operation or comparison at each iteration in the support counting process. sequence 1 only contributes 1 to the support of <(1. Problem Statement. Some approaches may achieve better performance under special circumstances.3)> occurs twice in sequence 1. it generally consumes more memory.3) 4 (3. transforms the database so that each transaction is replaced by all frequent itemsets it contains. To reduce this cost.4) 3 (2. Restrictions apply. LAPIN [26] is more efficient for dense data sets with long patterns but less efficient in other cases. Downloaded on March 30. said to contain a sequence . In Spade candidates are generated and tested on-the-fly to avoid storing candidates. The Spade algorithm [24] joins id-list pairs to form sequence lattices to group candidate sequences such that each group can be stored in the memory. Pattern-growth approaches can be considered as depth-first traversal algorithms as they recursively generate the projected database for each length k pattern to find length k+1 patterns. GSP [22] and PrefixSpan [18]. It uses the find-then-index technique to recursively find the items that constitute a frequent sequence and constructs a compact index set which indicates the set of data sequences for further exploration. the same authors proposed the PrefixSpan algorithm [18][19] which outperforms FreeSpan by projecting only effective postfixes. However. FSPM [23] declares to be faster than PrefixSpan in many cases. which is expensive when long patterns exist. MEMISP is faster than the basic PrefixSpan algorithm but slower when pseudo-projection technique is used in PrefixSpan.2) 3> is a pattern because it is contained in both sequences 1 and 3. Example 1.2. SPAM is more efficient than Spade for mining long patterns if all the bitmaps can be stored in the memory. Among the many algorithms proposed to solve the problem. It first finds all frequent itemsets. GSP only creates a new length k candidate when there are two frequent length (k-1) sequences with the prefix of one equal to the suffix of the other. is called a sequential pattern in D if SupD( ) minSup. PrefixSpan was one of the most influential and efficient ones in terms of both time and space. Spade then searches patterns across each sequence lattice. and the relative support of is defined as SupD(s)/|D|. The GSP algorithm [21] is an improvement over AprioriAll. Given a sequence database D and the minimum support threshold. <(1. the support of each length k candidate is counted by examining all the sequences.2) (4. [19] represent two major types of approaches: a priori-based and pattern-growth based. One major problem of a priori-based approaches is that a combinatorially explosive number of candidate sequences may be generated in a large sequence database. In this paper we will use absolute and relative supports interchangeably. Given D as shown in Table 1 and minSup = 2. To test whether a candidate is a frequent length k pattern. For example. <1 (2. Besides. TABLE 1 AN EXAMPLE SEQUENCE DATABASE Seq. The PSP algorithm [17] is similar to GSP except that the placement of candidates is improved through a prefix tree arrangement to speed up pattern discovery.6) (1. the overall performance of PrefixSpan is among the best. Among the various approaches. Based on a similar projection technique. (Note: in this paper we will always use D as a sequence database and P as the complete set of sequential patterns in D). Id 1 2 3 4 Sequence <1 (1. especially when long patterns exist. The length of sequence 1 is 5. Given a positive value minSup as the support threshold.3) (1. The FreeSpan algorithm [14] first projects a database into multiple smaller databases based on frequent items. sequential pattern mining is to find the complete set of sequential patterns (denoted as P) in the database. it consumes much more memory than PrefixSpan. the sequences that FSPM mines contain Authorized licensed use limited to: Gandhi Institute of Technology & Management.

2) <3 4 5>. memory index counting. 8D is. prefix and suffix of i. <8 4 3>. we can derive patterns in this cluster. the projected database of i. the algorithms first partition the solution space into disjoint sub-spaces. the patterns with 8 in between the beginning and end of each pattern. <6>. <4>. <4 8 3>) can be derived by concatenating a pattern of Pre(8D) (e. <4 8 3>. On the other hand. 1) <9 4 5 8 3 6>. we can derive patterns in this cluster. frequent prefix counting. it can de derived separately in each cluster. 3) <3 8 2 4 6 3 9>. we need only check the intersection of the valid pattern sets from Suf(8D) for patterns in Pre(8D) that contain s.g. If we can decrease the number of candidates for evaluation. The ith subset (1 i n) is the set of patterns that contain i (the root item of the subset) and items smaller than i. In the ith subset. Duplicated patterns can be eliminated by set union operation. For each sub-space.. respectively. 1 shows the Up and Down DAGs for the patterns in Pre(8D) and Suf(8D).e. 2) {<8 3>. <8 3 6>. etc. each vertex represents a pattern with occurrence information. 5) <9 6 3>. Content may change prior to final publication. except for <8>. each pattern pair (one from Pre(8D) and one from Suf(8D)) is a possible candidate for case 3. At each level of recursion. A directed edge means that the pattern of the destination vertex contains the pattern of the source vertex. . to detect the ith subset.. For a database with n different frequent items (without loss of generality. 1) <3 6>. <4 8 3>. At this stage the only length 1 pattern in the ith subset.3). Such a DAG can be recursively constructed in an efficient way to derive the contain relationship of patterns (see Section 4. or Pre(8D). <8 4 6>.g. <4 8>. To solve this problem we can use a directed acyclic graph (DAG) which represents patterns as vertexes and contain relationships as directed edges in between vertexes.. we will be able to recursively detect patterns in case 1 and 2 using similar strategies. and eventually find all the patterns in the 8th subset efficiently. This cluster can be derived based on the suffix sub-sequences of 8 in 8D. By concatenating the patterns (<3>.. <5 8 3>. Example 2. <8 6>. 4) <2>. In the DAGs. Restrictions apply. To support bidirectional pattern growth. which only contains 8 and can be derived directly. Fig. <4 5 8 3>}. <4>) with a pattern from Suf(8D) (e. <4 6>) of Suf(8D).g. <8 4 6>}. but has not been fully edited. The strategies above can effectively decrease the number of candidates for case 3. 3) <3 8 2 4 6 3>. 1) <4 5>. In example 2 we have 4 patterns from Pre(8D). to check the candidate patterns from Suf(8D) for s. i. One challenging issue is how to efficiently find and represent the contain relationship between patterns. which is. <3 8>. is detected. Since all items in the ith subset are no larger than i. 3). 2. e. we need only check the subset of tuples whose sequences contain i in database D. instead of partitioning patterns based on common prefix.g. i. all other patterns can be clustered and derived as follows. and 6 from Suf(8D).g. By representing the contain relationship of patterns from Pre(8D) with an DAG (Up DAG) and the contain relationship of patterns from Suf(8D) with another DAG (Down DAG). etc. To mine the patterns in the ith subset. we may grow patterns in parallel at each level of recursion. Projection and support counting are the two major costs for pattern-growth based approaches. we can partition them based on common root items. <4 5>) of Pre(8D) with 8. given a pattern s from Pre(8D) (e.) is created. In this case a pattern (e. which is.g. At each level of recursion the length of detected patterns is only grown by 1. is not a pattern. the 8th subset of patterns is {<8>. the patterns with the 8 at the end. 4) with the root item 8 and a pattern of Suf(8D) ( e. In this sense. <5 8>.. we exclude items that are larger than i in iD. <4>.) is applied to grow existing patterns. Direct evaluation of every pair can be expensive (24 candidates in this example). Based on a priori rule. each pattern can be divided into two parts. <4 8>. <3 8 3>.g. <4 5 8 3>}. the valid patterns from Suf(8D) for s should also be valid for any pattern from Pre(8D) that are contained in s (e. based on which a detection strategy (e. 4) <2 8 4 3 6>. Here the valid pattern sets from Suf(8D) for <4> and <5> are both {<3>}. which means we need only verify <4 5> with <3>. <3 6>. <4 3>. 1) {<3 8>.. then the concatenation of any pattern in Pre(8D) contains <4> (e. If we can grow the patterns bi-directionally along both ends of detected patterns. <5>. <5>). <4 5 8>}. <8 3 6>. first we perform level 1 projection to get iD. <i>. 2) <3 1 5>. projection. <5 8 3>. and the intersection of the two sets is {<3>}.g. Note: in case a pattern belongs to more than one cluster. i.. 3) {<3 8 3>. we can decrease the number of candidates by using these DAGs based on the strategies discussed above. 3 MOTIVATION Pattern-growth based approaches recursively grow the length of detected patterns. 2) <3 9 4 5 8 3 1 5>.. <4>.e. <4 5 8>. By concatenating 8 with the patterns (<3>. 4) <2 8 4 3 6>. or Suf(8D). AUTHOR ET AL. SPAM outperforms the basic PrefixSpan but is much slower than PrefixSpan with pseudo-projection technique [22]. 4) <4 3 6>.: TITLE 3 only a single item in each itemset. <4 5>) with any pattern in Suf(8D) that contains <6> (e. <4 8 6> (the root item 8 is added implicitly). The motivation of this paper is to find suitable partitioning. …. the patterns with 8 at the beginning.g.2010 at 01:14:12 EDT from IEEE Xplore. If minSup is 2. if the concatenation of a pattern from Pre(8D) (e. a projected database (or variations... and detection strategies that allow for faster pattern growth.e.g. we assume these items are 1. 3) <3>. the ids of tuples containing the pattern. Observing the patterns in the 8th subset. <6>). 1) <4 5 8 3 6>. <3 6>) is also not a pattern. Here the major difficulty is case 3. <8 3>. <8 4 3>. 3) <2 4 6 3>.g.. or iD. This cluster can be mined based on the patterns in Pre(8D) and Suf(8D). Intuitively. <8 4>. Given the following database. This cluster can be derived based on the prefix sub-sequences of 8 in 8D. Therefore. Authorized licensed use limited to: Gandhi Institute of Technology & Management.. In PrefixSpan patterns are partitioned based on common prefix and grown uni-directionally along the suffix direction of detected patterns.. We then perform level 2 projections on Pre(iD) and Suf(iD). <8 4>. FSPM is not a pattern mining algorithm as we discuss here. <4 5>). n). its patterns can be divided into n disjoint subsets. 2) <3 4 5 8 3 1 5>. memory index. <8 6>.This article has been accepted for publication in a future issue of this journal. <5 8>.. Downloaded on March 30. Since any pattern in subset i contains i.

which first transforms a database based on frequent itemsets. Let D be a database and P be the complete set of sequential patterns in D . Thus P = Px1 ∪ … ∪ Pxt.3) (6. We continue this until moving all the patterns that contain x1 to Px1. (3). Proof. (3)-5.4. We then replace each itemset in each sequence with the ids of all the FIs contained in the itemset. then at worst we project k levels. then partitions the problem. Based on Lemma 2. the complete set of patterns (P) in D can be divided into t disjoint subsets.7)> <(7. By assigning a unique id to each FI.5) (1. which is much less than those of previous approaches. We then perform level 3 projections to detect length 3. MANUSCRIPT ID based on which we can detect length 2 (case 1 and 2) and length 3 patterns (case 3). a major difference is that UpDown Tree is for compressed representation of the projected databases. Downloaded on March 30. 4. but has not been fully edited. the problem of pattern mining can be partitioned into mining subsets of patterns.8)> <(1. we have P = P’. we still use frequent itemsets instead of ids. Similarly.8) (1.2. x2.1 we will discuss in detail the impact of this strategy. and finally detects each subset using UDDAG. (5). (6). (6)-8. D’ be its transformed database.3. TABLE 2 TRANSFORMED DATABASE Seq.. and in the remaining P we move all the patterns that contain xt-1 to Pxt-1. 1 i t) is the set of patterns that contain xi and FIs smaller than xi. mining patterns from D is equivalent to mining item patterns from D’. P = P. the support of ip in D’ is the same as that of p in D. Id 1 2 3 4 Sequence <1 (1. and denote the resulted pattern as ip′. In our previous work [8].2 Problem Partitioning Lemma 2 (Problem partitioning). Therefore. for ∀ pl ∈ Pxj.3 UDDAG based pattern Mining Definition 3 (Projected database). (1. 4 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. Let {x1.1 Database Transformation Definition 1 (Frequent itemset). If the maximal pattern length is k. …. Given two integers i and j. In Section 6.2010 at 01:14:12 EDT from IEEE Xplore. Thus ip is an item pattern in D’. Thus p ∈ P’.8) 5 3> <7 (1. while UDDAG represents the containing relationship of detected patterns. An itemset with a support larger than minSup is called a frequent itemset (FI). Substituting each id in ip with the corresponding itemset. (5)-7.4. the FIs are: (1). The ith subset (denoted as Pxi. 1 i < j t. which is larger than the largest element contained in any pattern in Pxi. UpDown Tree is substantially different from the UpDown DAG in this paper. we presented an UpDown Tree data structure to detect contiguous sequential patterns (in which no gap is allowed for a sequence to contain the pattern). 5.This article has been accepted for publication in a future issue of this journal.g. pk ∉ Pxj because pk cannot contain xj which is contained in any pattern in Pxj.2. we assign a unique id to each FI in D. Example Up / Down DAGs of patterns from Pre( D)/Suf( D) 8 8 In the example above each itemset has exactly one item. pl ∉ Pxi because pl contains xj. (2)-3.e. Based on Lemma 1. The absolute support for an itemset in a sequence database is the number of tuples whose sequences contain the itemset. In addition to the different internal data structures.. 4. for the database in Table 1. (2. 6.2)-2. In our approach we first detect frequent itemsets and transform the database based on frequent itemsets. (2.8) 5 3 5> Definition 2 (Item pattern). 4. Based on frequent itemsets we transform each sequence in a database D into an alternative representation. An item pattern is a sequential pattern with exactly 1 item in every itemset it contains. 1. we have ip′ = p. Based on the definition of P’ . We then detect patterns on the transformed database using UDDAG. all the subsets of P are disjoint.2). Next we move patterns that contain xt from P to Pxt. Below we focus on mining item patterns from D’ and represent frequent itemsets with their ids. Content may change prior to final publication. Proof. and 7 patterns and continue this process to find all the patterns in the ith subset. Our strategy of detecting frequent itemsets first is the same as AprioriAll. (4). ip be the item pattern derived by replacing each itemset in p with the corresponding id in D’. we have ip′∈ P’. and P ⊆ P’.5) (1.3)-4. x1 < x2 < … < xt.3). Practically an itemset may have multiple items. x2. For example. First. use D instead of D’. Most previous approaches detect frequent itemsets with multiple items simultaneously when detecting sequential patterns. and use P instead of P’. Let p be a pattern in P. since the id of an itemset i exists at the same position in D’ as that of i in D. 4. and denoting the resulted pattern set as P’. All together. e. Fig.5) 6 (5. 4 UPDOWN DIRECTED ACYCLIC GRAPH BASED SEQUENTIAL PATTERN MINING This section presents UDDAG based pattern mining approach. P’ ⊆ P. Similarly. For brevity. . (1. However. xt}. (2). xt} be the frequent itemsets in a database D. First we create t empty sets. (1)-1. for ∀ pk ∈ Pxi. substituting the ids of each item pattern contained in D’ with the corresponding itemsets. i. but at best we only project log2k +1 levels. Lemma 1 (Transformed Database).6) 5 (3. use pattern instead of item patterns. we can transform the database as shown in Table 2 (infrequent items are eliminated). Now P is empty because any pattern can only contain FIs in {x1. The collection of all the tuples whose sequences contain an itemset x in a database D Authorized licensed use limited to: Gandhi Institute of Technology & Management. (4)-6. Restrictions apply. …. Pxi ∩ Pxj = ∅.

Given a frequent itemset x and a tuple <sid. there is a corresponding occurrence of pj′ in Suf(xD).. let pj′ = <a1 a2 … am-1>. for ∀p∈ PU. For case 3. Content may change prior to final publication. Below we define UDDAG to decrease the number of candidates. for each occurrence of pj in xD. let vu = ov(p). denoted as a. n-1}. which is contained in Pre(xD).This article has been accepted for publication in a future issue of this journal. thus pj″∈ PSuf. n-1}. PVDVSv is the VDVS of the parent. pj″. which can be recursively derived. pi | pk ∈ PPre. x only exists at the beginning and/or end of pj. and Suf(xD). 3. we have P = P’ . Q3 = {pk. Based on Theorem 1 we can detect Px based on PPre and PSuf. downParent. is derived as follows. Definition 9 (Valid down vertex set). am = x. if x ∈ si. If both subsequences contain a pattern (e. and op(v2) = <x j1 j2 … jn >. x x Proof. and pj ∈ Q2. PVDVSv = VD. Case 3 is complicated due to a potential large number of candidates. in sequence <3 5 6 3 5 6>.e.<x> | pk ∈ PPre }. where VU/VD is the set of all the vertexes in xU-UDDAG/xDUDDAG. Here case 1 is obvious. then each occurrence has its prefix/suffix subsequence. …. and x be an itemset. denoted as x-UDDAG. is defined as <a1 … ai b1 … bj>. each occurrence of pj′ in xD corresponds to a prefix subsequence of x. <x>. add a directed edge from vx to vd. sp> is the prefix tuple of x. based on the position of x in pj and the length of pj.g. denoted as xUUDDAG and xD-UDDAG. Definition 8 (Occurrence set).<x>. denoted as Pre(xD) / Suf(xD). and ss=<si+1 s i+2 … sj> is the suffix subsequence of x in s. ∃ m ∈ {2. For ∀v1∈VU and ∀v2∈VD. and v3 is the UpDown child of v1 and v2. the parent valid down vertex set of v (PVDVSv) is defined as follows: 1) If v has no parent (i. Definition 7 (UpDown directed acyclic graph). Similarly. For case 2. s = <s1 s2 … sj>. 1) Each pattern in Px corresponds to a vertex in x-UDDAG. and <sid. 2) n > 1. and aj ≠ x. 4) Each up/down root child vu/vd of vx also corresponds to an UDDAG (defined recursively using rules 1)-4)). downChildren. The Occurrence Set of a vertex v in a database D (denoted as OSD(v)) is the set of sequence ids of the tuples in D that contain op(v). let pj′ = <a2 … an>. //pattern sequence int[] occurs. 3 has two suffix subsequences (<5 6 3 5 6> and <5 6>). <x> contains x. if ∃ p ∈ Px. x x x x Based on Lemma 3. which is contained in Suf(xD). P. 2) Let PU be the set of length-2 patterns ending with x in Px. Downloaded on March 30. ss> is the suffix tuple of x. Given a vertex v in the Up DAG of x. respectively.e. 3) Let PD be the set of length-2 patterns starting with x in Px. Definition 4 (Prefix/suffix subsequence/tuple). P = P’ . 3. upDownChildren. pj =<x>. <x> corresponds to the root vertex.e. vu is called an up root child of vx. then sp=<s1 s2 … si-1> is the prefix subsequence of x in s. and/or an = x. an UpDown Directed Acyclic Graph based on Px.b. but has not been fully edited. Px ⊇ P’x. All together. and Q1 = {pk. if x only resides at the beginning of pj. add a directed edge from vx to vu. root vertex). let v3 = ov(p). denoted as vx. In an UDDAG if there is a directed path from vertex v1 to v2. P and P’ be the complete set of patterns in D and xD. here v1/ v2 is the Up/Down parent of v3. For ∀ p ∈ Px. for ∀p∈ PD. b=<b1 … bj>. Definition 10 (Parent valid down vertex set).pk | pk ∈ PSuf}. and add another direct edge from v2 to v3. Since any tuple in xD also exists in D. For p. let pj″ = <am+1 am+2 … an>. ov(p) represents the vertex corresponding to p. and any tuple that does not contain x also does not contain p. <sid. Restrictions apply.. thus pj′∈ PPre.. we have. // occurrence set }. PPre. All together. The data structure of a vertex in UDDAG is as follows: class UDVertex{ UDVertex upParent. assume op(v1) = <i1 i2 … im x>. since <x> ∈ Q. If x only resides at the beginning and end of pj. if x only resides at the end of pj. 1) n = 1. Given two sequences a=<a1 … ai>. i. pj= <a1 a2 … an>. vd is called a down root child of vx. let vd = ov(p). where Q={<x>} ∪ Q1 ∪ Q2 ∪ Q3. For ∀ pj ∈ P. xD. i. If a FI x occurs multiple times in a sequence. 1 i j. Given a FI x and xD. For example. a1 = x. and PSuf be the complete set of sequential patterns in xD. each occurrence of pj″ in xD corresponds to a suffix subsequence of x. respectively. p=<i1 i2 … im x j1 j2 … jn >.<x>. For case 1. add a direct edge from v1 to v3. pi ∈ PSuf }. s> in a database. i. . x is the only itemset of pj. v2 is called reachable from v1. 3) Otherwise PVDVSv is the intersection of the VDVSs of Authorized licensed use limited to: Gandhi Institute of Technology & Management. Definition 5 (Prefix/suffix projected database). we have pj ∈ Q1. they only contribute 1 to the count of the pattern. we have Px ⊆ Q. Let D be a database. Given a vertex v in the Up DAG of x. For a vertex v in x-UDDAG. and pj ∈ Q2. Case 2 is directly based on PPre and PSuf. Pre(xD). int[] pattern. x exists in between the beginning and end of pj.: TITLE 5 is called x-projected database. Px can be mined from xD.. Proof. thus pj′∈ PSuf. Thus any tuple that contains p also ∀p ∈ Px. Lemma 3 (Projected database). List upChildren. 3) n > 1. The set of vertexes of an UDDAG (Up/Down DAG) is denoted as V (VU/VD). we have pj ∈ Q. the valid down vertex set of v (VDVSv) is defined as VDVSv={v′ |(v′∈VD) ∧ (op(v). only one pair (randomly selected) is linked.e. The UDDAG for all the patterns in Pre(xD) / Suf(xD) is called the Up / Down DAG of x. we have Px ⊆ Q. j ∈ {2. <5 6>). denoted as xD. we have pj ∈ Q3. Theorem 1 (pattern mining). AUTHOR ET AL. Definition 6 (Sequence concatenation). Therefore P ⊆ P’ . Therefore p can only be detected from the collection of all the tuples that contain x. Note: if v3 corresponds to multiple up and down parents. …. 2) If v has one parent.2010 at 01:14:12 EDT from IEEE Xplore. Let x be a FI and xD be its projected database. sequence concatenation of a and b.op(v′) ∈ Px)}. Q2 = {<x>. The collection of all the prefix/suffix tuples of a frequent itemset x in xD is called the prefix/suffix projected database of x. op(v) represents the pattern corresponding to v. Since pj = pj′. then pj ∈ Q1.

P is partitioned into 8 subsets: the one contains 1 (P1). we can partition PP into 4 subsets. v″ ∉ VDVSv. only one is counted for support. If v has no parent. Based on Def. However. Therefore op(v″). <3>. If x occurs more than once in a sequence.<x> | spk ∈ Ppre }. for ∀ v′ ∈ PVDVSv. MANUSCRIPT ID the parents. and PP1 = {<1>}. op(v′) op(v″). which means the support of <3 5 3> is at most 2. <x>. The suffix projected database of 5 in Suf(8D) is: 3) < 3>. The prefix projected database of 5 in Suf(8D) is: 3) < (1. if v′ does not belong to VDVSv. we need consider case 3.3) >. we perform the same action until reaching the base case. Therefore. and | OS (v) ∩ OS (v″) |< minSup. Thus PP ={<1>. All Authorized licensed use limited to: Gandhi Institute of Technology & Management. Based on a priori rule. s>. PP7. x occurs exactly once in the corresponding sequence. and Ppre/PSur be the complete set of patterns in Pre”(xD)/Suf”(xD). Thus op(v). which are. we can partition PS into 2 subsets. spj ∈ Psuf}. Based on the definition. Given a vertex v in the Up DAG of x and a vertex v′ in PVDVSv. Thus we need further verification. op(v′) ∈ Px. PP3 = {<3>}.2. and then present the algorithm in detail. and Proof.spk | spk ∈ Psuf }. osp(v′) ∈ Px.op(v′) ∈ v. Based on Lemma 5. PS5 and PS3. …. and v″ ∉ VDVSv. Downloaded on March 30. Suf(8D): 1) <>.5) (1. and the occurrence set of <3> in the suffix projected database of 5 is also {3. we detect PS5.op(v′) ∉ Px. 3.op(v′) ∈ Px. <5>}. where the projected database has no frequent itemset. Since both databases have patterns. and VDVSv ⊆ PVDVSv.8) (1. Based on Lemma 4. Proof.. 4) <5 3 5>. op(v′). Therefore.4. then osp(v).op(v′) ∈ Px. 1) If x occurs only once in a sequence. then all the vertexes reachable from v′ do not belong to VDVSv.2.<x>. Lemma 5. and for any tuple whose id is contained in IS. directly add its prefix/suffix tuple to Pre(xD)/Suf(xD). and it is not a pattern. Thus their intersection set is {3. while in Pre”(xD)/Suf”(xD). if |IS|≥ minSup. which is not true by verification. Example 3 (UDDAG). VDVSv ⊆ VD. s. we need only examine all the vertexes in PVDVSv. op(v). Lemma 6. Therefore. for ∀ v′∈ VDVSv. only the last prefix/first suffix tuple is contained.This article has been accepted for publication in a future issue of this journal. Thus v′ ∈ PVDVSv. Since v′ ∉ VDVSv. Since x occurs once in s.2. Lemmas 4 and 5 help eliminate candidates for case 3. For the sample database in Table 1. Thus OS (v′) ⊇ OS (v″). First. 4}. If v″ is a parent of op(v). in sequence <5 3 5 2 5>. Thus op(v). and the suffix projected database of 7 is: 3) < (1. and (7). its patterns can be mined as follows: 1) Database transformation. Restrictions apply. Px ⊆ R. First. Let PP be all the patterns in Pre(8D). we have op(v″) Px. below we first give an example to illustrate UDDAG based pattern mining. Proof. Because of this. <2>. Given a vertex v in the Up DAG of x and PVDVSv.5) 6>. and <3 5 2 5> is the suffix of the first occurrence of 5. Similarly PS3 = {<3>}. VDVSv ⊆ PVDVSv. and op(v′) occurs after x. Lemma 6 further evaluates candidate patterns. Similarly. Since |IS|≥ minSup. and the one contains 8 and smaller ids (P8). This is a recursive process because for Pre(xD) and Suf(xD). spj | spk ∈ Ppre. The only difference is that in Pre(xD)/ Suf(xD). (3). we detect PP7.op(v″)∉ Px. 9. which also contains a pattern <3>. To minimize the effort of pattern detection in this case.8) 5 3>. For example.<x>.3) (6. For ∀ sp′. 3) <(7. 6 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. Proof. Content may change prior to final publication. if sid∈ IS. we can guarantee not miscounting the support of any pattern because the sequences of all other prefix/suffix tuples are contained in the last prefix/first suffix tuple. then for any vertex v″ in the Down DAG of x reachable from v′. since the FIs in Pre(8D) are (1).1) Finding P8 First we build Pre(8D) and Suf(8D). op(v). 4) < >. if sp′ op(v). then sp′. By including the last prefix/first suffix tuple in Pre”(xD)/Suf”(xD). and PP1. and the suffix tuple of the first occurrence of x to Suf(xD). we have R ⊆ Q. PVDVSv = VD. add the prefix tuple of the last occurrence of x to Pre(xD). whether concatenating <3> with root 5 and <3> is also a pattern. (2). The proof of Px ⊆ R is similar to that of Px ⊆ Q in Theorem 1. but has not been fully edited. which contains a pattern <3>. PP2.<x>. For a tuple <sid. R3 = {spk. 2) Pattern partitioning.<x>. 4) <5 3>.<x>. PP3.2010 at 01:14:12 EDT from IEEE Xplore. Therefore PS5 = {<3 5>. then op(v) op(v′) s. 2. VDVSv ⊆ PVDVSv. Since v″ is reachable from v′.<x>. to detect VDVSv. 2) If x occurs more than once in a sequence. Thus the support of <3 5 3> is 1.<x>.<x>. Here the occurrence set of <3> in the prefix projected database of 5 is {3. if v′ ∉ VDVSv. Thus |OS(v) ∩ OS (v′) |< minSup. Let PS be all the patterns in Suf(8D). To detect Px. and let R ={<x>}∪R1∪ R2∪R3}. 4}. . 4}. let IS be the intersection set of the occurrence sets of v and v’. since 5 occurs twice in tuple 4.1.2. Px ⊆ R ⊆ Q (Q is defined in Theorem 1). we build Pre(xD)/Suf(xD) as follows.op(v′) s. the one contains 2 and smaller ids (P2). R2 = {<x>. If v has one or more parents. we have Theorem 2. op(v′).. <7>}. <5 3> is the prefix of the second occurrence of 5. Based on the lemmas and theorems above. 4) <3 5>. where R1= {spk.3) (6. 3) < (1. Since the prefix projected database of 7 in Pre(8D) is empty. op(v).8) >. at least minSup tuples contain op(v).<x>.<x>. <5 3>.3. see Table 2 in Section 4. PP2 = {<2>}. we need further verify whether the sequence really contains op(v). since the FIs in Suf(8D) are (3) and (5). 3) >. we first detect patterns in Pre(xD) and Suf(xD) and then combine them to derive Px. if minSup = 2. op(v) occurs before x in s. the prefix/suffix tuple of every occurrence of x is contained for multiple occurrence of x in the same sequence. if multiple prefix/suffix tuples from the same sequence contain the same pattern. Pre(8D): 1) <1 (1. Lemma 6 evaluates candidates for Px when x occurs once in each sequence in IS. the only pattern in PP7 is <7>. 3) Finding subsets of patterns. Lemma 4. 4) <7>. a candidate pattern <5 3 5 3 5 2 5> may be mistakenly considered as being contained in the sequence. Denoting the derived prefix/suffix projected databases as Pre”(xD)/Suf”(xD). i.e. we need check whether it really contains <3 5 3>. Since every tuple in Pre”(xD)/Suf”(xD) also exists in Pre (xD)/Suf(xD).

Next we detect P8 based on the Up and Down DAGs of 8 (Fig. < 2 6 5>. rootVT) if(type==up) rooVT.getPreD(x). <8 3 >. Similarly up vertex 1 and down vertex 5 are also invalid combination.downParent. up. AUTHOR ET AL. Content may change prior to final publication. <6 3>. <1 1>}. Thus the corresponding support is at most 1. down.2) Similarly.enQueue(upVT. <1 3 5>. P7 = {<7>. the complete set of patterns in D Method: findP (D. VDVS1=∅. <6 5>. rootVT. the detection stops.upParent. 3) type (up/down) indicates prefix/suffix PD. it creates a new vertex as the Up/Down child (based on type) of the root vertex. <5 5>. <8 5 3>. first we check its combination with down vertex 3. <5>. the algorithm creates a root vertex for <x>. minSup). The algorithm then transforms the database. <5 3>. P5 = {<5>. FISet=D.getAllFI(minSup). <5 1>. < 1 6>. for each FI x.deQueue() if(isValid(upVT. <7 8 3>. Downloaded on March 30.getAllPatterns() } } The algorithm first calls subroutine getAllFI to detect all the FIs (An adapted version of the FP-growth* algorithm [11] is used to detect FIs in our implementation). minSup) findP(PD. down. detects all the patterns in the prefix projected database and suffix projected database of x. for each FI x in FISet{ UDVertex rootVT=new UDVertex (x) findP(D.downChildren) else if (upVT. <6 5 3 >. P6 = {<6>. <7 5 3>}. It then recursively detects all the patterns in PD similar as findP (D. and verify whether the FI corresponding to each child is valid in the itemset. <3 6 5>. < 2 6>.: TITLE 7 together. all the children of down vertex 5 are not valid for up vertex 1. up.e. Input: A database D and the minimum support Output: P. <5 5>. <1 3 1>}. <7 8>. <7 5 >.addUpChild(curVT) else rootVT.VDVS) else downQueue. curVT. <4 1>.isEmpty()){ UDVertex upVT=upQueue.addDownChild(curVT) findP(PD.getSufD(x). VDVS3=∅. and 7. <3 1>.getAllFI(minSup).upChildren) while(!upQueue. downVT){ UDVertex curVT=new UDVertex (upVT.getSurD(x).2010 at 01:14:12 EDT from IEEE Xplore. we add the id of the child to the itemset and further check the children of that child. <1 4>. <7 3>. For each (sorted) itemset. P1 = {<1>. <3 6>. Therefore. downVT) Fig. <3 5>. 4) The complete set of patterns is the union of all the subsets of patterns detected above. <2 5>. we check all its FIs with children in the DAG. For vertex 1. we have. For each FI x. Based on Lemma 5. 8-UDDAG based on detected patterns in P8 is shown in Fig.VDVS ∩ upVT.downParent==null) downQueue. for each FI x in FISet{ UDVertex curVT=new UDVertex (x. Based on the transformed database. . <7 1 5>.VDVS) while(!downQueue. <5 3>. minSup) findP(D. 2) rootVT is the vertex for the root item of PD. the intersection of the occurrence sets is {3}.isEmpty()){ UDVertex downVT=downQueue. Eventually we have P8 = {<8>. <2 8>. Similarly. The subroutine first detects all the FIs in PD. The parameters are 1) PD is the projected database. Subroutine: findPUDDAG(rootVT){ upQueue. creates x-UDDAG. PS = {<3 5>. but has not been fully edited. 2 (c). i. UpDown DAG for P8 Note: Based on Lemma 1. <7 1 3>. <1 5 1>. ov(<5 3>)}. D. <1 5>. 3. <1 5 5 >. Integers in the patterns are ids of FIs. <1 6 5>}. <3 8>. rootVT. here we actually detect item patterns. 2 (a) and (b)) by evaluating each candidate vertex pair.deQueue() if(upVT.upParent == rootVT) downQueue. and add Px to P. P2 = {<2>}.enQueue(upVT. P4 = {<4>. <7 3 5>. minSup) findPUDDAG(rootVT) P = P ∪ rootVT. <7 8 5 3>}. curVT. <8 3 5>. Restrictions apply. <1 3>.enQueue(rootVT. <7 8 5>. rootVT. detects Px using xUDDAG. <3>}. < 8 5>. ov(<5>).This article has been accepted for publication in a future issue of this journal. 3. Subroutine: findP(PD.. <1 8>. VDVS7={ov(<3>). 2. <7 1>. <1 5 3>}. First we detect the VDVSs for length 1 pattern in Pre(8D). minSup){ P=∅ Authorized licensed use limited to: Gandhi Institute of Technology & Management. If so. getPreD(x). Since no length 2 pattern exists in Pre(8D). Algorithm 1 (UDDAG based pattern Mining). type. P3 = {<3>.enQueue(rootVT.upParent. up vertexes 1. minSup){ FISet=PD. <1 4 1>}. 2.transform(). minSup) findPUDDAG(curVT) } } This subroutine detects all the patterns whose ids are no larger than the root of the projected database. which is not a valid combination. A directed acyclic graph is built to represent the containing relationship of FIs. 4) minSup is the support threshold.

and UDDAG-co uses co-occurrences to verify candidates whenever possible.0GHz Quad Core Intel Xeon Server and 16 GB memory. it creates a new vertex whose parents are upVT and downVT. it enqueues all the down children of rootVT into a downQueue. 2 occurs in sequences 1 and 3. The up vertex 1 in Fig. and adds downVT to the VDVS of upVT. Pre(8D) and Suf(8D) in Example 3 are shown as Table 3. (3 8) for sequence 1.2. thus not a pattern.2. A sequence is a sequential pattern if and only if UDDAG says so. 8 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. which were all implemented in C++ by their authors (Minor changes have been made to adapt Spade to Windows). the first suffix based on Theorem 2. Restrictions apply. Spade. Based on Theorem 2.3) (6. where $ indicates that 8 has an occurrence in the current sequence but its projected prefix/suffix is empty. 8D in Example 3 has three sequences (1. 2) Scalability study. For each vertex upVT in the upQueue it enqueues PVDVS of upVT into a downQueue as follows. In Algorithm 1 every candidate in R is checked either directly or indirectly based on Lemmas 4. All the experiments were performed on a Windows Server 2003 with 3. else it enqueues the intersection of the VDVSs of the parents into downQueue. else if upVT has only one parent. and exploring different values Authorized licensed use limited to: Gandhi Institute of Technology & Management. The first approach is bit vector based. we only register the last prefix and 5 PERFORMANCE EVALUATION We conducted an extensive set of experiments to compare our approach with other representative algorithms. Therefore. only (3 8) and (5 8) (both cooccur twice) will be considered as candidates.2010 at 01:14:12 EDT from IEEE Xplore. Since all the candidates in R are verified in Algorithm 1. and LapinSpam.8) 5 3 5> Pre(8D) Start: 0. 3. thus the bit vector representation of its occurrence set is 110.. and 6 (Case 1 and 2 are checked directly in subroutine findP. The reason is that we project a sequence bi-directionally.enQueue(downVT. we have co-occurring pairs (5 6). For example. we can guarantee that UDDAG identifies the complete set of patterns in D. Example 4 (Pseudo-Projection). For each vertex downVT in the downQueue of upVT. If minSup is 2. MANUSCRIPT ID upVT. given a 9-projected database with the following sequences. The datasets were generated by maintaining all except one of the parameters as shown in Table 4 fixed. Proof sketch. which has only one bit 1. (3 8). The parameter rootVT is the root vertex of the recursively constructed UpDown DAG. we derive the support of the candidate based on the size of the intersection set of the up and down vertexes’ occurrence sets. Px ⊆ R. and 3) <6 9 8>. It first enqueues all the up Children of the root vertex to an upQueue. Downloaded on March 30. which uses similar datasets as that in [19]. Below we discuss the implementation strategies for these two issues. and the bit vector is 011. Other pairs are discarded because they occur only once. sequence 3). (5 8). (3 6). and case 3 is checked in subroutine findPUDDAG). it enqueues the VDVS of the parent into downQueue. If the co-occurrence count of a pair is less than minSup. a sequence is a sequential pattern if UDDAG says so. It further enqueues all the children of downVT to downQueue if upVT is the upChild of the rootVT. Given Pre(xD) and Suf(xD). if upVT and downVT corresponds to a valid pattern. . we derive co-occurrence count for each ordered pair of FIs (one from a prefix and the other from the corresponding suffix) by enumerating every ordered pair and adding the corresponding cooccurrence count by 1.addVDVS(downVT) if(upVT.enQueue(upVT. Several approaches exist for efficiently counting bit 1s in a bit vector [6] [20].4 Detailed Implementation strategies The major costs of our approach are database projection and candidate pattern verification. We represent each occurrence set with a bit vector. (5 8) for sequence 2. The algorithms we compared are PrefixSpan. and (6 8) for sequence 3. End: 4 2) Verification of candidate patterns To verify whether an up vertex and a down vertex in a UDDAG correspond to a valid pattern.5)6 (5.children) } } Subroutine findPUDDAG detects all the case 3 patterns in a projected database using UpDown DAG. and perform Anding operation on the two bit vectors. TABLE 3 Pre(8D) / Suf(8D) BASED ON PSEUDO-PROJECTION Id 1 3 4 Sequence <1(1. Theorem 3 (UDDAG).4. The size of the intersection set is derived by counting the number of 1s in the resulted bit vector.8) (1. 4.8) 5 3> <7 (1. 1) Pseudo-Projection To reduce the number and size of projected databases we adopt similar pseudo-projection technique as in PrefixSpan. Content may change prior to final publication.upParent==rootVT) downQueue. One major difference is that we register the ids of sequences and both the starting and ending positions of the projected subsequences in the original sequences. End: 3 Start: 0. End: 1 Start: 0. 1) <5 3 9 6 8> and 2) <3 5 9 8>.VDVS. Note that for multiple occurrences of 8 in a sequence (e.children) } } if(upVT. End: 4 Start: 2. but has not been fully edited. End: 0 Suf(8D) $ Start: 1. The second approach is co-occurrence counting based. Two approaches are provided in our implementation.This article has been accepted for publication in a future issue of this journal. In our implementation we use the arithmetic logic based approach [20]. We perform two studies using the same data generator as in [19]: 1) Comparative study.5)(1. The down vertex 5 occurs in sequences 3 and 4. 5. and 4). UDDAG-bv uses bit vector to verify candidates. For example. the pair is an invalid candidate.size>0)upQueue. Finally if the size of the VDVS of upVT is not 0.g.8)> <(7.3. if upVT is root child of rootVT. Anding result of the two bit vectors is 010. the subroutine enqueues all the children of upVT into upQueue for further examination. Using Pseudo-Projection. This means the support of <1 8 5> in 8D is at most 1. Two versions of UDDAG were tested.

Time usage on data set C200S10T2. The memory usages of UDDAG based approaches are generally less than that of Spade 0. number of items in a trans. 6 shows the distribution of pattern lengths.2 0.1 Fig. Fig. number of items in a tran. When minSup is large (e.. Fig.000 10 2. the algorithms have similar running time. in pattern Def. Content may change prior to final publication. When minSup is 0.70% 0.5 2 1.25 with 200k sequences and 10000 different items. AUTHOR ET AL. 100 PrefixSpan UDDAG-co Spade UDDAG-bv LapinSpam Fig. UDDAG-bv (3s) and UDDAG-co (2.5 Minimum support (%) 1 0.15 Minimum support (%) Fig. When minSup is less than 1%. the processing time of PrefixSpan and Spade grows faster than those of UDDAG-bv and UDDAG-co.000 15.4 0. value 100. UDDAG-bv (0.9s). Downloaded on March 30.4 8 1. Authorized licensed use limited to: Gandhi Institute of Technology & Management.19) and Spade (0. . Distribution of pattern lengths of data set C10S8T8I8.25. (Note: The default value for T is 2. Distribution of pattern lengths of data set C200S10T2. Since LapinSpam crashed in large data sets.20% 0.5 Minimum support (%) 1 0. When minSup is 1%. 4 shows that UDDAG algorithms are fastest while LapinSpam is the slowest. Restrictions apply. Time usage on data set C10S8T8I8 The UDDAG algorithms use less memory than PrefixSpan when minSup is large (≥ 1%).: TITLE 9 for the remaining ones.5 2 1. Memory usage on data set C10S8T8I8 3% 2. Fig. Fig. 100000 Number of frequent sequences 10000 1000 100 10 1 1 2 3 4 5 6 7 8 9 10 11 12 13 Length of frequent k-sequences 1% 0. but are more than 10 times faster than LapinSpam (1. 3%.g. 1000 PrefixSpan LapinSpam M em ory (M B) 100 UDDAG-bv Spade UDDAG-co 10 5. 100000 Number of frequent sequences 10000 1000 100 10 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Length of frequent k-sequences 1 3 2.50% 1% 0. When minSup is 0. The processing time has similar order as the first test.1 Experiment Results for Comparative Study First we tested data set C10S8T8I8 with 10k sequences and 1000 different items.5 Fig. 7 and 8 shows the time and memory usage of the algorithms. Fig.40% 0.5 PrefixSpan UDDAG-co 0. ) TABLE 4 PARAMETERS FOR GENERATING DATASETS Symbol C N S T L I Name Number of sequences Number of different items Ave. We present the experiment results in this section and give more discussion in Section 6.50% 2% 1.5I1.1 3 2.25 UDDAG-bv and UDDAG-co use less memory than PrefixSpan except when minSup is 1%. UDDAG-bv (8. 7. The memory usage of Spade is the highest.5I1. which increases as the number of patterns increases.8 for scalability testing on I to allow higher value of I to be tested. As minSup decreases. 4 and 5 shows the time and memory usage of the algorithms at different minSup values.5%. 3 shows the distribution of pattern lengths. 5.1. number of items per sequence Ave.7 UDDAG-bv Spade 0. number of transactions in a pattern Ave. in the following tests we only show the testing results on the other four algorithms.15% 0.9s) are much faster than all the other algorithms.1 1 0.2010 at 01:14:12 EDT from IEEE Xplore.17) are slightly faster than PrefixSpan (0.7s) are almost 4 times faster than PrefixSpan (32s) and 3 times faster than Spade (23s).5I1.10% Fig.5s) and UDDAG-co (8. 4.50% Secondly we tested data set C200S10T2.This article has been accepted for publication in a future issue of this journal. they use more memory because of the extra memory usage for UDDAG.25).2 and much less than that of LapinSpam. 100 Time (s) 10 Time (s) 1 1 10 0. Ave.16s) and UDDAG-co (0. but has not been fully edited. 3. 6.

10 PrefixSpan UDDAG-co UDDAG-bv Spade 350 400 1 100 150 200 250 300 Number of sequences ('000) Fig. Memory usage on different seq. The UDDAG algorithms have similar performance as PrefixSpan for small datasets (100K and 200K).2 Experiment Results for Scalability Study This section studies the impact of different parameters of the datasets on the performance of each algorithm. Content may change prior to final publication.5 When minSup is large (>0. the algorithms have similar running time. However. When minSup is 400.75 0. Restrictions apply. 100 10 PrefixSpan UDDAG-co 1 100 150 200 250 300 Number of sequences ('000) UDDAG-bv Spade 350 400 10 1000 Fig. UDDAG-bv (49s) and UDDAG-co (50s) are 4 times faster than PrefixSpan (195s) and more than 2 times faster than Spade (118s). 12 and 13 shows the performance of the algorithms when minSup is 100. 8. 11.375%. UDDAG-bv and UDDAG-co have similar memory usage as PrefixSpan.375 Minimum support (%) 0. the UDDAG algorithms are about 10 times faster than PrefixSpan and 3-4 times faster than Spade.5 with 200k sequences and 10000 different items. Memory usage on data set C200S10T2. MANUSCRIPT ID 100 Memory (MB) 1000 10 PrefixSpan UDDAG-co 1 1 0.5 Next we tested a denser data set C200S10T5I2.30% 0. Spade consumes more memory than other algorithms in most cases. Time usage on different sequence numbers (minSup=100) 1 1 0.375%. 14 and 15 shows the performance when minSup is 400. they use more memory due to the extremely large number of patterns in this dataset at low minSup.5I1. However.5 0. the times of PrefixSpan and Spade grow faster than those of UDDAG-bv and UDDAG-co. Number of frequent sequences 1000000 100000 10000 1000 100 10 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Length of frequent k-sequences 1% 0. The UDDAG algorithms have similar memory usage as that of PrefixSpan. Time usage on data set C200S10T5I2. When minSup is 0. Distribution of pattern lengths of data set C200S10T5I2.50% 0.4 0. 10 and 11 shows the time and memory usage of the algorithms.375 0. 9 shows the distribution of pattern lengths. 12. Fig.3 Minimum support (%) 0. when the datasets get larger.375% 0. As minSup decreases.75 Memory (MB) 0. 1000 1000 PrefixSpan Time (s) 100 UDDAG-co UDDAG-bv Spade Time (s) Fig.25 100 Fig. When minSup is 100.15 Minimum support (%) 0.3 0.75% 0.25 Fig. 10 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING.7 0. when minSup is less than 0. Spade is the slowest. When minSup is 1%. numbers (minSup=100) Authorized licensed use limited to: Gandhi Institute of Technology & Management. Fig. and Fig. Downloaded on March 30.375%). UDDAG outperforms PrefixSpan with growing margins. 9. The default absolute support threshold is 100. The processing time shows similar order as previous experiments. 13.25%. but has not been fully edited. Fig.25 Fig. First we examine the performance of the algorithms with different number of sequences (C) under two different minSup settings.5 0.This article has been accepted for publication in a future issue of this journal. 10.2 0. Memory usage on data set C200S10T5I2. The memory usage of Spade is the highest when minSup is larger than 0.25% 5. .2010 at 01:14:12 EDT from IEEE Xplore.1 UDDAG-bv Spade Memory (MB) 100 10 PrefixSpan UDDAG-co UDDAG-bv Spade 1 1 0.5.

numbers (minSup=400) Fig. UDDAG-bv and UDDAGco are faster than PrefixSpan by about one order of magnitude. and they outperform Spade by about 3-4 times. UDDAG-bv and UDDAG-co use similar memory as PrefixSpan and less memory than Spade. the time usage of UDDAG approaches generally decreases as N increases. 15. . 100 M em ory (M B) 10 10 PrefixSpan UDDAG-co 1 8 UDDAG-bv Spade 1 10 12 14 16 Number of items ('000) 18 20 9 10 11 12 Average number of transactions in a sequence Fig. The time usage of PrefixSpan increases faster than those of UDDAG as S increases. Memory usage on different number of items Fig. Time usage on different ave. They outperform PrefixSpan by about an order of magnitude on average. 100 PrefixSpan UDDAG-co UDDAG-bv Spade Time (s) 10 Fig. 16. On the contrary. number of trans. Authorized licensed use limited to: Gandhi Institute of Technology & Management.2010 at 01:14:12 EDT from IEEE Xplore.This article has been accepted for publication in a future issue of this journal. but has not been fully edited. Downloaded on March 30. 16 and 17 shows the performance of the algorithms on data sets with different number of items (N). They are 3-4 times faster than Spade. Fig. Memory usage on different seq. in a seq. number of trans. 17. Restrictions apply. 18. and outperform Spade by about 2-4 times. in a sequence UDDAG-bv and UDDAG-co use similar memory as PrefixSpan and less memory than Spade. The time usage of PrefixSpan and Spade grows as N increases. Time usage on different number of items Fig. The UDDAG algorithms outperform PrefixSpan by about one order of magnitude on average. 19.: TITLE 11 100 100 M em ory (M B) 10 Time (s) 10 PrefixSpan UDDAG-co 1 10 12 UDDAG-bv Spade 20 1 PrefixSpan UDDAG-co 0. Content may change prior to final publication. Memory usage on different ave. AUTHOR ET AL. 18 and 19 shows the performance of the algorithms on data sets with different average number of transactions in a sequence (S). 100 PrefixSpan UDDAG-co Tim e (s) UDDAG-bv Spade 1 8 9 10 11 Average number of transactions in a sequence 12 Fig. 20 and 21 shows the performance of the algorithms on data sets with different average number of items in a transaction (T).1 100 150 200 250 300 Number of sequences ('000) 350 400 UDDAG-bv Spade 14 16 18 Number of items ('000) Fig. Time usage on different sequence numbers (minSup=400) 100 Memory (MB) 10 PrefixSpan UDDAG-co 1 100 150 200 250 300 Number of sequences ('000) 350 400 UDDAG-bv Spade Fig. 14.

and outperform Spade by about 3 times. and (j) show that RT generally decreases as the corresponding parameter increases. The UDDAG algorithms outperform PrefixSpan by about one order of magnitude..2 Average number of items in the transactions of patterns UDDAG-bv Spade Fig.5 9 1 9 0. in a pattern UDDAG-bv and UDDAG-co use similar memory as PrefixSpan. PrefixSpan detects multi-item FIs in each projected database while detecting sequential patterns.6 1.2010 at 01:14:12 EDT from IEEE Xplore.4). 12 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING.5 times faster than Spade. (g).1 Multi-item frequent itemset detection UDDAG and PrefixSpan take different approaches on detecting FIs with multiple items. When L is 2. the UDDAG algorithms are slightly faster than PrefixSpan and 2 times faster than Spade. number of trans. TABLE 5 RELATIVE TIME CONSUMPTION OF FI DETECTION (a) Comparative Study dataset C10S8T8I8 minSup (%) RT (%) 3 10 2. 25. The only exception is Table 5 (f). 100 Memory (MB) 10 PrefixSpan UDDAG-co 1 2 3 4 5 6 7 Average number of transactions in a pattern 8 UDDAG-bv Spade Fig. 22 and 23 shows the performance of the algorithms on data sets with different average number of transactions (L) in a sequential pattern. and they use less memory than Spade. (h). Table 5 (a)-(c) show that RT generally decreases as minSup decreases (except for the first minSup values in each test). However.4 1. 23. number of items in a trans. Table 5 (d). 21. the UDDAG algorithms use similar memory as PrefixSpan and less memory than Spade. UDDAG detects all the FIs before pattern detection. 1000 PrefixSpan UDDAG-co UDDAG-bv Spade 6 10 PrefixSpan UDDAG-co UDDAG-bv Spade 2 3 4 5 Average number of items in a transaction Time (s) 1 100 Fig. Time usage on different ave. they use more memory due to the extremely large number of patterns.2 1. However. Below we examine the impact of this strategy to its performance.5 7 1 2 3 4 5 6 7 Average number of transactions in a pattern 8 Fig. Memory usage on different ave.This article has been accepted for publication in a future issue of this journal.g. 100 Memory (MB) 1 1 1. 20. number of items in a trans. However.8 2 2. in a pattern Authorized licensed use limited to: Gandhi Institute of Technology & Management. . (e).4 1. 10 The UDDAG algorithms use similar memory as PrefixSpan and less memory than Spade when T is 2. Memory usage on different average number of items in a transaction in sequential patterns 6 DISCUSSION 6. Time usage on different ave. 22. 100 PrefixSpan UDDAG-co Time (s) 10 UDDAG-bv Spade Fig. 24 and 25 shows the performance of the algorithms on data sets with different average number of items (I) in a transaction of patterns. number of trans. Downloaded on March 30. 10 PrefixSpan UDDAG-co 1 1 1. they are about an order of magnitude faster than PrefixSpan and 3. 24. Memory usage on different ave. < 1.2 1. MANUSCRIPT ID 1000 Time (s) 100 Fig. Similarly. Time usage on different average number of items in a transaction in sequential patterns 10 PrefixSpan UDDAG-co 1 2 3 4 5 Average number of items in a transaction 6 UDDAG-bv Spade When I is small (e. Content may change prior to final publication. where RT almost remains the same with different number of items. Table 5 shows the relative time (RT) of FI detection (as well as database transformation) with respect to the total time usage of UDDAG-bv for the tests in Section 5.6 1. Restrictions apply. they use more memory as T grows. when I is larger.8 2 2. when L is 8. but has not been fully edited. (i).5 17 2 14 1.2 Average number of items in the transactions of patterns Fig. 100 Memory (MB) Fig.

(C) minSup=400 (e) Scalability Study on Different number of Seq. database projection and pattern detection account for the major time usage of our approach.1 10 0. the time UDDAG used for FI detection is around 10% of the total time. and 20 show that the processing time for both UDDAG and PrefixSpan scale up quasilinearly when C. In addition. the cost of pattern detection in PrefixSpan is limited to FI counting in each projected database. the actual gain of FP-growth to Apriori may be even higher if file output is not needed (which is the case in this paper). No. etc. the number of children vertex candidates in UDDAG also becomes larger. The major cost for pattern detection in UDDAG is the evaluation of candidate patterns of case 3. 3) Pattern detection. which themselves are considerably faster than the original Apriori algorithm implemented in [2]. in a pattern. the apriori algorithms tested in [9] and [11] are state of the art algorithms. and T increase.25 minSup (%) RT (%) minSup (%) RT (%) D (‘000) RT (%) D (‘000) RT (%) N (‘000) RT (%) S RT (%) T RT (%) P RT (%) 1 7 1 18 100 16 100 11 10 11 8 17 2 12 2 20 3 18 0. Besides.2010 at 01:14:12 EDT from IEEE Xplore.5 6. The time for database projection is proportional to the total time of checking items in projected databases. the Apriori approach for FI detection adopted in [2] may be significantly slower than the FP-growth approach adopted in our algorithms. Let L be the average length of detected patterns. while UDDAG based approaches scale up much slower (close to O(logL) ). the total number of items is CST. Given a sequence.15 13 0.2 10 1. 2) The original AprioriAll algorithm’s candidate generation (by joining and pruning) and support counting Authorized licensed use limited to: Gandhi Institute of Technology & Management. Using PrefixSpan. in a seq. in a transaction of the patterns. then on average an item is checked at most L times.6 6 1. AprioriAll also adopts similar solution paths. (c) Comparative Study dataset C200S10T5I2. and the total instances of items we check is at most LCST times.4 7 1. the maximal level of projections we may have on the sequence is M. which is an advantage over UDDAG.5 19 200 16 200 7 14 11 10 12 4 5 5 15 6 14 250 15 250 6 16 12 11 10 5 3 7 12 0. No. First.e. 14. on average it is much less.. This approach is extremely slow compared to the state of the art solutions.2 6 Table 5 shows that FI detection consumes around 10% of the total time on average. detecting FIs separately before pattern detection. which is insignificant to the overall performance of UDDAG.75 19 150 18 150 9 12 11 9 13 3 8 4 17 0. in our implementation we have made some adaptations to existing state of the art FP-growth approach to make it even faster. Based on our tests [9] and the FIMI tests [11].3 11 350 12 350 5 0. No. UDDAG has fewer levels of projections (d) Scalability Study on Different number of Seq.This article has been accepted for publication in a future issue of this journal. However. Thus its projection time is O(LCST). the total levels of projection is always M when detecting a length M pattern. However.1. AprioriAll uses the Apriori algorithm [2] for FI detection. the more effective the evaluation of case 3 candidates will be. i. which contributes to the inefficiency of AprioriAll. The goal of database projection is to find the occurrence information of FIs in a projected database and further derive a projected database for each FI. if the longest pattern in the sequence is M. 22 clearly shows that PrefixSpan almost scales up linearly. The longer L is. 1) Multi-item FI detection. Figures 12. This is insignificant and thus does not have big impact to the overall performance of UDDAG. practically AprioriAll is very slow. practically UDDAG performs better due to the following reasons: a) The special data structure UDDAG eliminates unnecessary candidates (based on Lemmas 4 and 5).25 6 400 10 400 4 20 12 12 8 6 2 8 11 (by checking all the sequences for supported patterns) strategies are extremely slow especially for large databases with many patterns.8 5 2. when L increases. However. Downloaded on March 30. 10. of items in a trans. However. As discussed in Section 3. As the average length of patterns becomes longer. b) Projected databases shrink much faster compared to PrefixSpan. This is verified in Fig. This means each item in the sequence is checked at most M times. which helps to eliminate unnecessary candidate checking. AUTHOR ET AL. .5I1. Fig. S. Based on [1]. The above analysis is verified by our experiment results. Lemmas 4 and 5 state that the validity of children vertex candidates can be inferred based on that of parent vertex candidates. Since these tests include the time for writing the detected patterns into a file (which may be significant when large number of FIs are involved). Altogether.7 13 0. The projection complexity of UDDAG is similar to that of PrefixSpan in the worst case.4 18 0. of trans. There are two major reasons: 1) The approach AprioriAll took for FI detection was very inefficient. Restrictions apply.375 16 300 13 300 5 18 11 0. 22. 18. Practically it is close to O((logL)CST)) because the minimal levels of projections to detect a pattern with length M is about log2M +1. Different approaches such as UDDAG-bv and UDDAG-co may have different efficiency. 2) Database projection. (C) minSup=100 (f) Scalability Study on Different number of Items (N) (g) Scalability Study on Different number of trans. As discussed in Section 6.0 4 2.2 Time Complexity Multi-item FI detection. Given a database. Apriori based algorithms are considerably slower than FP-Growth* in many cases (often one or two orders of magnitude slower). but has not been fully edited. (I) I RT (%) 1 13 1. (S) (h) Scalability Study on Different ave. as shown in Figures 4. (T) (i) Scalability Study on Different ave.: TITLE 13 (b) Comparative Study dataset C200S10T2. of items. (L) (j) Scalability Study on Different ave. Since PrefixSpan does not generate candidate patterns.2 16 0. Content may change prior to final publication. 7.

10. 17. Thus at level k projection. the total number of items. 11. UDDAG scales up much slower compared to PrefixSpan. the average pattern lengths are 1. However. this feature of UDDAG may also cause the jitter effect on memory usage for scalability tests. we also need to store the transformed database during the whole pattern mining process. UDDAG generally uses less memory than Spade and LapinSpam. as shown in Figures 11 and 25. In terms of time efficiency. Restrictions apply. This represents a promising approach for applications involving searching in large spaces. etc. 1) the size of projected databases gets smaller as the recursion level increases. the average number of instances of FIs in a projected database at level k in UDDAG is much smaller than that in PrefixSpan. when minSup is large enough such that the average pattern length is close to 1. Downloaded on March 30. the average sequence length is T/2k-1 in UDDAG. 7. Thus the maximal memory usage of finding all the patterns is max (M1. which leads to more efficient database projection and FI counting. if the length of the longest pattern is M. For example. the problem of finding all the patterns in a database is partitioned into finding subsets of patterns defined in Lemma 1. and it needs less space for storing the whole database as it stores the original database instead of the transformed database. UDDAG may use more memory than PrefixSpan in extreme cases when a significant number of patterns exist in a subset or the average length of FIs is large and the number/support of multi-item FIs is extremely big. given a level 1 projected database with C sequences. the problem of sequential pattern mining degrades into frequent item counting problem and PrefixSpan and UDDAG will have similar performance. Therefore. The size of the transformed database is decided by the size of the original database as well as the characteristics of FIs. with UDDAG. where Mi is the maximal memory usage for detecting subset i. …. Overall. the memory usage of UDDAG is comparable to that of PrefixSpan as shown in Figures 5. and 22 for large minSup values/small pattern lengths. Thus it has great potential to related areas of data mining and artificial intelligence. Content may change prior to final publication. and support of multiitem FIs increase. The size of the transformed database increases as the average length. One major feature of UDDAG is that it supports efficient pruning of invalid candidates. shows an example for such effect. Second. 14 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. which results in faster pattern growth because of less levels of database projection compared to traditional approaches. then the size of the transformed database is close to that of the original database. Each sequence in the prefix and suffix projected database has half the length of the original sequence on average. then the maximal level of projections is M. 13. Experiments also show that UDDAG is considerably faster than two other representative algorithms. Mi is mainly used to store the projected database and UpDown DAG. It often outperforms PrefixSpan by one order of magnitude in our scalability tests. The time usages of PrefixSpan and UDDAG are very close as shown in Fig. 2) The total levels of real projection may be much smaller than M. whose size is decided by the total number of vertexes. tions in a pattern (L). 17. UDDAG and PrefixSpan have similar performance because in this case the problem becomes a simple frequent item counting problem (practically not interesting for sequential pattern mining). Spade and LapinSpam. therefore the maximal number of integers we need to recursively store is 3CM. However. 25 (where the average length of FIs increases) and Fig. In addition.7 MB) and L=6 (20. 11. The actual memory usage may be much smaller because. we project a database into prefix and suffix projected database. PrefixSpan does not need additional space for UDDAG. at each level of projection we store the beginning and ending positions as well as sequence ids. 14. Besides. 23. One additional fact is. the memory usage (23. but has not been fully edited. Similar observations can be found in Fig. total number. For projected databases. 8. UDDAG also demonstrated satisfactory scale-up properties with respect to various parameters such as the total number of sequences. The memory usages of UDDAG based approaches are generally comparable to that of PrefixSpan.60 for minSup values 3% and 2. 11 (where the number and support of multi-item FIs increase dramatically as minSup decreases.5%. This is verified in Fig.9 MB).2010 at 01:14:12 EDT from IEEE Xplore. Generally this cost is relatively small compared to storing the databases. . Fig. but it may need more memory for projected databases because of more levels of projection. MANUSCRIPT ID compared to PrefixSpan on average. The new approach grows patterns from both ends (prefixes and suffixes) of detected patterns. M2. Using Pseudo-Projection. if the number of patterns is extremely large. which results in smaller/larger memory consumption.44 and 1. Extensive experiments on both comparative and scalability study have been performed to evaluate the proposed algorithm. respectively. etc. the knowledge Authorized licensed use limited to: Gandhi Institute of Technology & Management. Similar effect can also be found in Fig. this cost may also increase significantly as shown in Fig. 4. The largest subsets of patterns in some datasets may be smaller/larger than their neighboring datasets. ). 6.6 MB) of UDDAG-co is higher than those of the datasets of L=4 (16. when minSup is very large such that the average length of patterns is close to 1. at each level of recursion. If the average length of FI is small.3 Space Complexity In UDDAG. UDDAG based approaches may use more memory in extreme cases when a significant number of patterns exist in a subset or the average length of FIs is large and the number/support of multi-item FIs is extremely big. in Fig. The reason is that each testing dataset is generated independently. Mt). the memory usage for different average number of transac- 7 CONCLUSIONS AND FUTURE WORK In this paper a novel data structure UDDAG is invented for efficient pattern mining. Practically. 15. The cost of storing UDDAG is proportional to the maximal number of patterns in a subset. the average lengths of sequences. while in PrefixSpan the average sequence length is T-(T/L)*k. When L=5. In the future we expect to further improve UDDAG based pattern mining algorithm as follows: 1) Currently FI detection is independent from pattern mining. In addition. 3.This article has been accepted for publication in a future issue of this journal.

Banff. K. Chiu. BISC: a Binary Itemset Support Counting Approach towards Efficient Frequent Itemset Mining. Symp.C. 21. Goethals. 12. all from Tsinghua University. Grahne and J. Chen. 1531-1540. and J. pp.. of Pittsburg. of 20th Intl. Garofalakis. [11] G. vol. Hsu. Lee. 1998. Wang. Principle of Data Mining and Knowledge Discovery. of WWW2007 Poster session. and J.” Proc. pp.L. Downloaded on March 30.1424-1440. Chen.S. Agrawal. 215-224. J.” Proc. Wang.” Proc. “SPIRIT: Sequential [1] [20] [21] [22] [23] [24] [25] [26] Pattern Mining with Regular Expression Constraints. Poncelet." IPSJ Transactions on Database.I. ICDE 2004. Oliveira. and S. e. Y. Srikant. and data mining. etc. Dayal.” Proc. 2004. J. and Anne Moroney for their help on the draft. pp.This article has been accepted for publication in a future issue of this journal. Fimi’03: Workshop on frequent itemset mining implementations. Jajodia. Mortazavi-Asl. [3] C. 2001. Berkovich. Dayal. and M. Bettini. China.” Proc.P. F. May 8-12. T. 1996..” Software: Practice and Experience. “Mining Sequential Patterns. [18] [19] REFERENCES R. 2) Different candidate verification strategies may have different impacts to the efficiency of the algorithm. Proc.” Proc. Srikant and R. The author would also like to thank Ping Zhong. “Mining Sequential Patterns by PatternGrowth: The PrefixSpan Approach. and a researcher at Microsoft Research Asia. Knowledge Discovery and Data Mining 2002. Srikant. pp.2010 at 01:14:12 EDT from IEEE Xplore.Y. 3-14. Mortazavi-Asl. and M. Dayal. Wang. VLDB'99. J.H. pp. “A Bit-Counting Algorithm Using the Frequency Division Principle. Hsu. 487-499. U. 40. H. H.” IEEE Transactions on Knowledge and Data Engineering.” Machine Learning. Information and Knowledge Management. J. submitted to ACM Transactions on Knowledge Discovery in Data. Q. Data Eng. 30 (14). J." Prof. B. 32-38. information retrieval. Authorized licensed use limited to: Gandhi Institute of Technology & Management. 2000. 2005. Trees and Sequences. “FreeSpan: Frequent Pattern-Projected Sequential Pattern Mining. Chen. Cathala. 47 No. Verkamo. AUTHOR ET AL. vol. “Mining Sequential Patterns: Generalizations and Performance Improvements. and M. 2007. SIGKDD 2000. “Effective Sequential Pattern Mining Algorithms for Dense Database. In the future we will study more efficient verification strategy. the 2004 ACM Int. pp. 1994. 1999. Terry Cook. M. pp. of Information Science and Engineering. 2006 Jinlin Chen received his PhD degree in Automatic Control in 1999. pp. 2004.E. Mortazavi-Asl. Deo. “Sequential Pattern Mining Algorithms: Trade-offs between Speed and Memory. Int'l Conf. [4] C. pp. Lapir. Queens College. M. and N. Restrictions apply. Extending Database Technology 1996. [13] [14] [15] [16] [17] ACKNOWLEDGMENT The author is grateful for the insightful comments of the anonymous reviewers. “Fast Discovery of Sequential Patterns through Memory Indexing and Database Partitioning. (ICDE ’01). Kitsuregawa. “Combinatorial Algorithms—Theory and Practice.259-289. E. K. 2001 Int’l Conf. NJ. Wang.” Proc. on VLDB. 1995. Antunes and A. M. Int'l Conf Machine Learning and Data Mining 2003. vol. closed and maximal pattern mining. “Mining Temporal Relationships with Multiple Granularities in Time Sequences”. Zaki. Previously he was a visiting professor at Univ. [8] J. J.X. and M. Wang. vol. . 2004. vol. J. F. Pei. Bachelor of Engineering and Bachelor of Economics in 1994. “Scalable Sequential Pattern Mining for Biological Sequences. Vol.. 2001. and A. Cook. 2nd Intl. pp. pp. G. Zaki. Content may change prior to final publication.” Proc. Y. Han. 8-11. Bull. Alberta. This work was supported in part by a PSC-CUNY Research Grant (PSCREG-38-892) and a Queens College Research Enhancement Grant. 3365-3379. Inc. In Proc. Pinto. Mannila. mining with constraints. “The PSP Approach for Mining Sequential Patterns. Chen. Rastogi. Workshop on Mining Graphs. Flannick. Masseglia. Y. He is a faculty member at Computer Science Dept. 3) UDDAG has big impact to the memory usage when the number of patterns in a subset is extremely large.” Data Mining and Knowledge Discovery. Agrawal. Ayres. X. 2006. Canada [9] J. Mack. 1997.” Proc. and R. B. 2003. of SWOD'05. pp. M. Kitsuregawa. “Discovery of Frequent Episodes in Event Sequences. J. Lin and S. Pei. but has not been fully edited. [12] M. and K. Agrawal and R. Pinto. U. Q. 2003. Yu. Yu. pp. “An Efficient Algorithm for Mining Frequent Sequences by a New Strategy without Support Counting. Zhang. 2005. B. Nievergelt. Pei.: TITLE 15 gained from FI detection may be useful for pattern mining. J.g. 2004. 109-128. U. Z. [5] J. FIMI'03 Workshop on Frequent Itemset Mining Implementations. J. R. pp. Antunes and A. H. Hsu. Han. [7] C. R. Japanese National Data Engineering WorkShop (DEWS'06). 429-435.Y. 2000. of New York. vol. pp. Zhu. Zhang and M. 239-251. Z. 16. Gehrke. T. 375. Xu.C. pp. the City Univ. Q. Apr. Conf. pp. Mining Contiguous Sequential Patterns from Web Logs. Conf.” Proc. and domain-specific pattern mining.Y. 1. [10] D.” Proceedings ICDE'95. T. 355-359. Li. In Proceedings of the ICDM 2003 Workshop on Frequent Itemset Mining Implementations (2003). 1977. 2002. pp. 21. Shim. Chen. 223-234. M. His research interests include web information modeling and processing. “Sequential PAttern Mining using a Bitmap Representation.31-60. Han. We also expect to extend the UDDAG based approach to other areas where large searching spaces are involved and pruning of searching spaces are necessary.” Proc. Xiao. and J. and P. [2] R. J. [6] S. Asanuma. Kodama.: Englewood. Takata. "Mining Sequential Patterns More Efficiently by Reducing the Cost of Scanning Sequence Databases. He is a member of the IEEE and ACM. Int'l Conf. We will also extend our approach to other type of sequential pattern mining problems.” Proc. Cliffs. Fast algorithms for mining association rules. L. “LAPIN-SPAM: An Improved Algorithm for Mining Sequential Pattern. H Toivonen. Data Eng. "Efficiently Using Prefix-trees in Mining Frequent Itemsets. pp. B. 176-184. In the future we will find an efficient way to store UDDAG.C.178—187. 1998. approximate pattern mining. “Spade: An Efficient Algorithm for Mining Frequent Sequences. Y. Reingold. Oliveira. “PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. “Generalization of PatternGrowth Methods for Sequential Pattern Mining with Gap Constraints. In the future we will integrate the solutions of the two so that they can benefit from each other. Wu and A.” Prentice-Hall. Chen. Eurpn. L.” J. 3-17.