4 views

Uploaded by deepak1892

data mining

- Abstract Reasoning
- Ch 6 - Sequence & Series
- Zucker.games
- Neutrosophic Association Rule Mining Algorithm for Big Data Analysis
- Web Mining
- 1.Funtions
- 1st Summative Test Grade 10
- Reusable Test Case - Validation 1
- The Application of Computational Mechanics to the Analysis of Geomagnetic Data
- EXERCISE-1.docx
- Limit of Sequences 2.pdf
- dmkd07_frequentpattern
- 2nd PUC Mathematics March 2016
- Optimization of Association Rule Using Heuristic Approach
- DWM Question Bank
- 2018
- Section_8.2.pdf
- 16
- Bolzano-Weierstrass
- DM

You are on page 1of 17

The PrefixSpan Approach

Jian Pei, Member, IEEE Computer Society, Jiawei Han, Senior Member, IEEE,

Behzad Mortazavi-Asl, Jianyong Wang, Helen Pinto, Qiming Chen,

Umeshwar Dayal, Member, IEEE Computer Society, and Mei-Chun Hsu

AbstractSequential pattern mining is an important data mining problem with broad applications. However, it is also a difficult

problem since the mining may have to generate or examine a combinatorially explosive number of intermediate subsequences. Most of

the previously developed sequential pattern mining methods, such as GSP, explore a candidate generation-and-test approach [1] to

reduce the number of candidates to be examined. However, this approach may not be efficient in mining large sequence databases

having numerous patterns and/or long patterns. In this paper, we propose a projection-based, sequential pattern-growth approach for

efficient mining of sequential patterns. In this approach, a sequence database is recursively projected into a set of smaller projected

databases, and sequential patterns are grown in each projected database by exploring only locally frequent fragments. Based on an

initial study of the pattern growth-based sequential pattern mining, FreeSpan [8], we propose a more efficient method, called

PSP, which offers ordered growth and reduced projected databases. To further improve the performance, a pseudoprojection

technique is developed in PrefixSpan. A comprehensive performance study shows that PrefixSpan, in most cases, outperforms the a

priori-based algorithm GSP, FreeSpan, and SPADE [29] (a sequential pattern mining algorithm that adopts vertical data format), and

PrefixSpan integrated with pseudoprojection is the fastest among all the tested algorithms. Furthermore, this mining methodology can

be extended to mining sequential patterns with user-specified constraints. The high promise of the pattern-growth approach may lead

to its further extension toward efficient mining of other kinds of frequent patterns, such as frequent substructures.

Index TermsData mining algorithm, sequential pattern, frequent pattern, transaction database, sequence database, scalability,

performance analysis.

1 INTRODUCTION

subsequences as patterns in a sequence database, is an

important data mining problem with broad applications,

Many previous studies contributed to the efficient

mining of sequential patterns or other frequent patterns in

time-related data [2], [23], [13], [24], [30], [14], [12], [5], [16],

including the analysis of customer purchase patterns or [22], [7]. Srikant and Agrawal [23] generalized their

Web access patterns, the analysis of sequencing or time- definition of sequential patterns in [2] to include time

related processes such as scientific experiments, natural constraints, sliding time window, and user-defined taxon-

disasters, and disease treatments, the analysis of DNA

omy, and presented an a priori-based, improved algorithm

sequences, etc.

GSP (i.e., generalized sequential patterns). Mannila et al. [13]

The sequential pattern mining problem was first intro-

duced by Agrawal and Srikant in [2]: Given a set of sequences, presented a problem of mining frequent episodes in a

where each sequence consists of a list of elements and each element sequence of events, where episodes are essentially acyclic

consists of a set of items, and given a user-specified min_support graphs of events whose edges specify the temporal

threshold, sequential pattern mining is to find all frequent precedent-subsequent relationship without restriction on

subsequences, i.e., the subsequences whose occurrence frequency interval. Bettini et al. [5] considered a generalization of

in the set of sequences is no less than min_support. intertransaction association rules. These are essentially rules

whose left-hand and right-hand sides are episodes with

time-interval restrictions. Lu et al. [12] proposed intertran-

. J. Pei is with the School of Computing Science, Simon Fraser University,

Canada. E-mail: jpei@cse.sfu.edu. saction association rules that are implication rules whose

. J. Han is with the Department of Computer Science, University of Illinois two sides are totally-ordered episodes with timing-interval

at Urbana-Champaign, Urbana, IL 61801. E-mail: hanj@cs.uiuc.edu. restrictions. Guha et al. [6] proposed the use of regular

. J. Wang is with the University of Minnesota, Minneapolis, MN 55455.

E-mail: jianyong@cs.umn.edu. expressions as a flexible constraint specification tool that

. B. Mortazavi-Asl and H. Pinto are with the School of Computing Science, enables user-controlled focus to be incorporated into the

Simon Fraser University, Canada. E-mail: {mortazav, hlpinto}@cs.sfu.ca. sequential pattern mining process. Some other studies

. Q. Chen is with PacketMotion Inc., San Mateo, CA 94403.

. E-mail: qchen@packetmotion.com. extended the scope from mining sequential patterns to

. U. Dayal is with Hewlett-Packard Labs., 1501 Page Mill Road, Palo Alto, mining partial periodic patterns. O zden et al. [16] intro-

CA 94303. E-mail: dayal@hpl.hp.com. duced cyclic association rules that are essentially partial

. M.-C. Hsu is with Commerce One Inc., San Francisco, CA 94105.

E-mail: meichun.hsu@commerceone.com.

periodic patterns with perfect periodicity in the sense that

each pattern reoccurs in every cycle, with 100 percent

Manuscript received 6 Feb. 2003; revised 30 July 2003; accepted 9 Oct. 2003.

For information on obtaining reprints of this article, please send e-mail to: confidence. Han et al. [7] developed a frequent pattern

tkde@computer.org, and reference IEEECS Log Number 118248. mining method for mining partial periodicity patterns that

1041-4347/04/$20.00 2004 IEEE Published by the IEEE Computer Society

PEI ET AL.: MINING SEQUENTIAL PATTERNS BY PATTERN-GROWTH: THE PREFIXSPAN APPROACH 1425

are frequent maximal patterns where each pattern appears database at least l times. This bears a nontrivial cost

in a fixed period with a fixed set of offsets and with when long patterns exist.

sufficient support. . The a priori-based method generates a combinato-

Almost all of the above proposed methods for mining rially explosive number of candidates when

sequential patterns and other time-related frequent patterns mining long sequential patterns. A long sequential

are a priori-like, i.e., based on the a priori principle, which pattern contains a combinatorial explosive number

states the fact that any super-pattern of an infrequent pattern of subsequences, and such subsequences must be

cannot be frequent, and based on a candidate generation-and- generated and tested in the a priori-based mining.

test paradigm proposed in association mining [1]. Thus, the number of candidate sequences is expo-

A typical a priori-like sequential pattern mining method, nential to the length of the sequential patterns to be

mined. For example, let the database contain only

such as GSP [23], adopts a multiple-pass, candidate

one single sequence of length 100, ha1 a2 . . . a100 i, and

generation-and-test approach outlined as follows: The first

the min_support threshold be 1 (i.e., every occurring

scan finds all of the frequent items that form the set of single

pattern is frequent). To (re)derive this length-100

item frequent sequences. Each subsequent pass starts with a sequential pattern, the a priori-based method has to

seed set of sequential patterns, which is the set of sequential generate 100 length-1 candidate sequences (i.e., ha1 i,

patterns found in the previous pass. This seed set is used to ha2 i; . . . ; ha100 i), 100 100 10099

2 14; 950 length-2

generate new potential patterns, called candidate sequences, 100

candidate sequences, 3 161; 700 length-3 candi-

based on the a priori principle. Each candidate sequence date sequences,1 and so on. Obviously, the total

contains one more item than a seed sequential pattern, number of candidate sequences to be generated is

100

where each element in the pattern may contain one item or 100

i1 i 2100 1 1030 .

multiple items. The number of items in a sequence is called In many applications, such as DNA analysis or stock

the length of the sequence. So, all the candidate sequences sequence analysis, the databases often contain a large

in a pass will have the same length. The scan of the database number of sequential patterns and many patterns are long.

in one pass finds the support for each candidate sequence. It is important to reexamine the sequential pattern mining

All the candidates with support no less than min_support in problem to explore more efficient and scalable methods.

the database form the set of the newly found sequential Based on our observation, both the thrust and the

patterns. This set is then used as the seed set for the next bottleneck of an a priori-based sequential pattern mining

pass. The algorithm terminates when no new sequential method come from its step-wise candidate sequence

pattern is found in a pass, or when no candidate sequence generation and test. Can we develop a method that may

can be generated. absorb the spirit of a priori, but avoid or substantially

The a priori-like sequential pattern mining method, reduce the expensive candidate generation and test? This is

though reducing search space, bears three nontrivial, the motivation of this study.

inherent costs that are independent of detailed implementa- In this paper, we systematically explore a pattern-growth

tion techniques. approach for efficient mining of sequential patterns in large

sequence database. The approach adopts a divide-and-

. A huge set of candidate sequences could be conquer, pattern-growth principle as follows: Sequence

generated in a large sequence database. Since the databases are recursively projected into a set of smaller projected

set of candidate sequences includes all the possible databases based on the current sequential pattern(s), and

permutations of the elements and repetition of items sequential patterns are grown in each projected databases by

in a sequence, the a priori-based method may exploring only locally frequent fragments. Based on this

generate a really large set of candidate sequences philosophy, we first proposed a straightforward pattern

growth method, FreeSpan (for Frequent pattern-projected

even for a moderate seed set. For example, two

Sequential pattern mining) [8], which reduces the efforts of

frequent sequences of length-1, hai and hbi, will

candidate subsequence generation. In this paper, we

generate five candidate sequences of length-2: haai,

introduce another and more efficient method, called

habi, hbai, hbbi, and habi, where habi represents that PrefixSpan (for Prefix-projected Sequential pattern mining),

two events, a and b, happen in the same time slot. If which offers ordered growth and reduced projected

there are 1,000 frequent sequences of length-1, such databases. To further improve the performance, a pseudo-

as ha1 i; ha2 i; . . . ; ha1000 i, an a priori-like algorithm will projection technique is developed in PrefixSpan. A compre-

generate 1; 000 1; 000 1;000999 2 1; 499; 500 candi- hensive performance study shows that PrefixSpan, in most

date sequences. Notice that the cost of candidate cases, outperforms the a priori-based algorithm GSP,

sequence generation, test, and support counting is FreeSpan, and SPADE [29] (a sequential pattern mining

inherent to the a priori-based method, no matter algorithm that adopts vertical data format) and PrefixSpan,

what technique is applied to optimize its detailed integrated with pseudoprojection, is the fastest among all

implementation. the tested algorithms. Furthermore, our experiments show

. Multiple scans of databases in mining. The length that PrefixSpan consumes a much smaller memory space in

of each candidate sequence grows by one at each

1. Notice that a priori does cut a substantial amount of search space.

database scan. In general, to find a sequential pattern Otherwise, the number of length-3 candidate sequences would have been

of length l, the a priori-based method must scan the 100 100 100 100 100 99 1009998

32 2; 151; 700.

1426 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 11, NOVEMBER 2004

A Sequence Database A sequential pattern with length l is called an l-pattern.

Example 1 (Running Example). Let our running sequence

database be S given in Table 1 and min support 2. The

set of items in the database is fa; b; c; d; e; f; gg.

A sequence haabcacdcfi has five elements: a,

abc, ac, d, and cf, where items a and c appear

more than once, respectively, in different elements. It is a

9-sequence since there are nine instances appearing in that

sequence. Item a happens three times in this sequence, so

comparison with GSP and SPADE. This pattern-growth it contributes 3 to the length of the sequence. However,

the whole sequence haabcacdcfi contributes only 1

methodology can be further extended to mining multilevel,

to the support of hai. Also, sequence habcdfi is a

multidimensional sequential patterns, and mining other

subsequence of haabcacdcfi. Since both sequences 10

structured patterns.

and 30 contain subsequence s habci, s is a sequential

The remainder of the paper is organized as follows: In

pattern of length 3 (i.e., 3-pattern).

Section 2, the sequential pattern mining problem is defined,

and the a priori-based sequential pattern mining method,

Problem Statement. Given a sequence database and the

GSP, is illustrated. In Section 3, our approach, projection-

min_support threshold, sequential pattern mining is to find

based sequential pattern growth, is introduced, by first

summarizing FreeSpan, and then presenting PrefixSpan, the complete set of sequential patterns in the database.

associated with a pseudoprojection technique for perfor-

2.2 Algorithm GSP

mance improvement. Our experimental results and perfor-

As outlined in Section 1, a typical sequential pattern mining

mance analysis are reported in Section 4. Some extensions

of the method and future research issues are discussed in method, GSP [23], mines sequential patterns by adopting a

Section 5, and our study is concluded in Section 6. candidate subsequence generation-and-test approach,

based on the a priori principle. The method is illustrated

using the following example:

2 PROBLEM DEFINITION AND THE GSP ALGORITHM

Example 2 (GSP). Given the database S and min_support in

In this section, the problem of sequential pattern mining is

Example 1, GSP first scans S, collects the support for each

defined, and the most representative a priori-based sequen-

tial pattern mining method, GSP [23], is illustrated using an item, and finds the set of frequent items, i.e., frequent

example. length-1 subsequences (in the form of item:support):

hai : 4; hbi : 4; hci : 3; hdi : 3; hei : 3; hfi : 3; hgi : 1.

2.1 Problem Definition By filtering the infrequent item g, we obtain the first

Let I fi1 ; i2 ; . . . ; in g be a set of all items. An itemset is a seed set L1 fhai; hbi; hci; hdi; hei; hfig, each member in

subset of items. A sequence is an ordered list of itemsets. A the set representing a 1-element sequential pattern. Each

sequence s is denoted by hs1 s2 sl i, where sj is an itemset. subsequent pass starts with the seed set found in the

sj is also called an element of the sequence, and denoted as previous pass and uses it to generate new potential

x1 x2 xm , where xk is an item. For brevity, the brackets sequential patterns, called candidate sequences.

are omitted if an element has only one item, i.e., element x For L1 , a set of 6 length-1 sequential patterns

is written as x. An item can occur at most once in an generates a set of 6 6 65 2 51 candidate sequences,

element of a sequence, but can occur multiple times in

different elements of a sequence. The number of instances C2 fhaai; habi; . . . ; hafi; hbai; hbbi; . . . ; hffi;

of items in a sequence is called the length of the sequence. A habi; haci; . . . ; hefig:

sequence with length l is called an l-sequence. A sequence

The multiscan mining process is shown in Fig. 1. The

ha1 a2 an i is called a subsequence of another sequence

set of candidates is generated by a self-join of the

hb1 b2 bm i and a supersequence of , denoted as

sequential patterns found in the previous pass. In the

v , if there exist integers 1 j1 < j2 < < jn m such kth pass, a sequence is a candidate only if each of its

that a1 bj1 ; a2 bj2 ; . . . ; an bjn . length-k 1 subsequences is a sequential pattern found

A sequence database S is a set of tuples hsid; si, where at the k 1th pass. A new scan of the database collects

sid is a sequence_id and s a sequence. A tuple hsid; si is the support for each candidate sequence and finds the

said to contain a sequence , if is a subsequence of s. The new set of sequential patterns. This set becomes the seed

support of a sequence in a sequence database S is the for the next pass. The algorithm terminates when no

number of tuples in the database containing , i.e., sequential pattern is found in a pass, or when there is no

supportS j fhsid; sijhsid; si 2 S ^ v sg j : candidate sequence generated. Clearly, the number of

scans is at least the maximum length of sequential

It can be denoted as support if the sequence database is patterns. It needs one more scan if the sequential

clear from the context. Given a positive integer min_support as patterns obtained in the last scan still generate new

the support threshold, a sequence is called a sequential candidates.

PEI ET AL.: MINING SEQUENTIAL PATTERNS BY PATTERN-GROWTH: THE PREFIXSPAN APPROACH 1427

GSP, though benefits from the a priori pruning, still repeatedly scanning the entire database and generating and

generates a large number of candidates. In this example, testing large sets of candidate sequences, one can recursively

6 length-1 sequential patterns generate 51 length- project a sequence database into a set of smaller databases

2 candidates, 22 length-2 sequential patterns generate associated with the set of patterns mined so far and, then, mine

64 length-3 candidates, etc. Some candidates generated locally frequent patterns in each projected database.

by GSP may not appear in the database at all. For In this section, we first outline a projection-based sequen-

example, 13 out of 64 length-3 candidates do not appear tial pattern mining method, called FreeSpan [8], and then

in the database. systematically introduce an improved method PrefixSpan

[19]. Both methods generate projected databases, but they

differ at the criteria of database projection: FreeSpan creates

3 MINING SEQUENTIAL PATTERNS BY PATTERN

projected databases based on the current set of frequent

GROWTH patterns without a particular ordering (i.e., growth direction),

As analyzed in Section 1 and Example 2, the GSP algorithm whereas PrefixSpan projects databases by growing frequent

shares similar strengths and weaknesses as the a priori prefixes. Our study shows that, although both FreeSpan and

method. For frequent pattern mining, a frequent pattern PrefixSpan are efficient and scalable, PrefixSpan is substan-

growth method called FP-growth [9] has been developed for tially faster than FreeSpan in most sequence databases.

efficient mining of frequent patterns without candidate

3.1 FreeSpan: Frequent Pattern-Projected

generation. The method uses a data structure called FP-tree

Sequential Pattern Mining

to store compressed frequent patterns in transaction

For a sequence hs1 sl i, the itemset s1 [ [ sl is

database and recursively mines the projected conditional

called s projected itemset. FreeSpan is based on the

FP-trees to achieve high performance.

following property: If an itemset X is infrequent, any sequence

Can we mine sequential patterns by extension of the FP-tree

whose projected itemset is a superset of X cannot be a sequential

structure? Unfortunately, the answer cannot be so optimistic

pattern. FreeSpan mines sequential patterns by partitioning

because it is easy to explore the sharing among a set of

the search space and projecting the sequence subdatabases

unordered items, but it is difficult to explore the sharing of

recursively based on the projected itemsets.

common data structures among a set of ordered items. For Let f list hx1 ; . . . ; xn i be a list of all frequent items in

example, a set of frequent itemsets fabc; cbad; ebadc; cadbg sequence database S. Then, the complete set of sequential

share the same tree branch habcdei in the FP-tree. However, if patterns in S can be divided into n disjoint subsets: 1) the

they were a set of sequences, there is no common prefix set of sequential patterns containing only item x1 , 2) those

subtree structure that can be shared among them because one containing item x2 but no item in fx3 ; . . . ; xn g, and so on. In

cannot change the order of items to form sharable prefix general, the ith subset 1 i n is the set of sequential

subsequences. patterns containing item xi but no item in fxi1 ; . . . ; xn g.

Nevertheless, one can explore the spirit of FP-growth: Then, the database projection can be performed as

divide the sequential patterns to be mined based on the follows: At the time of deriving ps projected database from

subsequences obtained so far and project the sequence DB, the set of frequent items X of DB is already known.

database based on the partition of such patterns. Such a Only those items in X will need to be projected into ps

methodology is called sequential pattern mining by pattern- projected database. This effectively discards irrelevant

growth. The general idea is outlined as follows: Instead of information and keeps the size of the projected database

1428 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 11, NOVEMBER 2004

minimal. By recursively doing so, one can mine the mining of the other three projected databases

projected databases and generate the complete set of terminates without generating any length-4 pat-

sequential patterns in the given partition without duplica- terns for the haci-projected database.

tion. The details are illustrated in the following example: . Mining other subsets of sequential patterns.

Other subsets of sequential patterns can be mined

Example 3 (FreeSpan). Given the database S and min support

similarly on their corresponding projected data-

in Example 1, FreeSpan first scans S, collects the support for

bases. This mining process proceeds recursively,

each item, and finds the set of frequent items. This step is

which derives the complete set of sequential

similar to GSP. Frequent items are listed in support

patterns.

descending order (in the form of item : support), that is,

f list a : 4; b : 4; c : 4; d : 3; e : 3; f : 3. They form six The detailed presentation of the FreeSpan algorithm, the

length-one sequential patterns: proof of its completeness and correctness, and the perfor-

mance study of the algorithm are in [8]. By the analysis of

hai : 4; hbi : 4; hci : 4; hdi : 3; hei : 3; hfi : 3: Example 3, we have the following observations on the

strength and weakness of FreeSpan, which are also verified

According to the f list, the complete set of sequential

by our later experimental study.

patterns in S can be divided into six disjoint subsets:

On the one hand, the strength of FreeSpan is that it searches

1. the ones containing only item a, a smaller projected database than GSP in each subsequent

2. the ones containing item b but no item after b in database projection. This is because FreeSpan projects a large

f list, sequence database recursively into a set of small projected

3. the ones containing item c but no item after c in sequence databases based on the currently mined frequent

f list, and so on, and, finally, item-patterns, and the subsequent mining is confined to

4. the ones containing item f. each projected database relevant to a smaller set of

The sequential patterns related to the six partitioned candidates.

subsets can be mined by constructing six projected On the other hand, the major overhead of FreeSpan is that it

databases (obtained by one additional scan of the original may have to generate many nontrivial projected databases. If a

database). Infrequent items, such as g in this example, are pattern appears in each sequence of a database, its projected

removed from the projected databases. The process for database does not shrink (except for the removal of some

mining each projected database is detailed as follows: infrequent items). For example, the ffg-projected database

in this example contains three of the same sequences as that

. Mining sequential patterns containing only in the original sequence database, except for the removal of

item a. By mining the hai-projected database: the infrequent item g in sequence 40. Moreover, since a

fhaaai; haai; hai; haig, only one additional sequen- length-k subsequence may grow at any position, the search

tial pattern containing only item a, i.e., haai : 2, is for length-k 1 candidate sequence will need to check

found. every possible combination, which is costly.

. Mining sequential patterns containing item b but

no item after b in the f list. By mining the hbi- 3.2 PrefixSpan: Prefix-Projected Sequential

projected database: fhaabai; habai; habbi; habig, Patterns Mining

four additional sequential patterns containing item Based on the analysis of the FreeSpan algorithm, one can see

b but no item after b in f list are found. They are that one may still have to pay high cost at handling

fhabi : 4; hbai : 2; habi : 2; habai : 2g. projected databases. Is it possible to reduce the size of projected

. Mining sequential patterns containing item c but database and the cost of checking at every possible position of a

no item after c in the f list. Mining the hci-projected potential candidate sequence? To avoid checking every

database: fhaabcacci; hacbcai; habcbi; hacbcig, possible combination of a potential candidate sequence,

proceeds as follows: one can first fix the order of items within each element. Since

One scan of the projected database generates items within an element of a sequence can be listed in any

the set of length-2 frequent sequences, which are order, without loss of generality, one can assume that they

fhaci : 4; hbci : 2; hbci : 3; hcci : 3; hcai : 2; hcbi : 3g. are always listed alphabetically. For example, the sequence

One additional scan of the hci-projected database in S with Sequence_id 10 in our running example is listed as

generates all of its projected databases. haabcacdcfi instead of habaccadfci. With such a

The mining of the haci-projected database: convention, the expression of a sequence is unique.

fhaabcacci; hacbcai; habcbi; hacbcig generates Then, we examine whether one can fix the order of item

the set of length-3 patterns as follows: projection in the generation of a projected database.

Intuitively, if one follows the order of the prefix of a

fhacbi : 3; hacci : 3; habci : 2; hacai : 2g: sequence and projects only the suffix of a sequence, one can

Four projected database will be generated from examine in an orderly manner all the possible subsequences

them. and their associated projected database. Thus, we first

The mining of the first one, the hacbi-projected introduce the concept of prefix and suffix.

database: fhacbcai; habcbi; hacbcig generates no Definition 1 (Prefix). Suppose all the items within an element

length-4 pattern. The mining along this line are listed alphabetically. Given a sequence he1 e2 en i

terminates. Similarly, we can show that the (where each ei corresponds to a frequent element in S), a

PEI ET AL.: MINING SEQUENTIAL PATTERNS BY PATTERN-GROWTH: THE PREFIXSPAN APPROACH 1429

sequence he01 e02 e0m i m n is called a prefix of if and Definition 3 (Projected database). Let be a sequential

only if 1) e0i ei for i m 1; 2) e0m em ; and 3) all the pattern in a sequence database S. The -projected database,

frequent items in em e0m are alphabetically after those in e0m . denoted as Sj , is the collection of suffixes of sequences in S

with regards to prefix .

For example, hai, haai, haabi, and haabci are prefixes of

sequence s haabcacdcfi, but neither habi nor habci To collect counts in projected databases, we have the

is considered as a prefix if every item in the prefix haabci following definition:

of sequence s is frequent in S. Definition 4 (Support count in projected database). Let

Definition 2 (Suffix). Given a sequence he1 e2 en i be a sequential pattern in sequence database S, and be a

(where each ei corresponds to a frequent element in S). Let sequence with prefix . The support count of in -projected

he1 e2 em1 e0m i m n be the prefix of . Sequence database Sj , denoted as supportSj , is the number of

he00m em1 en i is called the suffix of with regards to sequences in Sj such that v .

prefix , denoted as =, where e00m em e0m .2 We

also denote . Note, if is not a subsequence of , the We have the following lemma regarding to the projected

suffix of with regards to is empty. databases.

Lemma 3.2 (Projected database). Let and be two

For example, for the sequence s haabcacdcfi, sequential patterns in a sequence database S such that is a

habcacdcfi is the suffix with regards to the prefix hai, prefix of .

h bcacdcfi is the suffix with regards to the prefix haai,

and h cacdcfi is the suffix with regards to the prefix 1. Sj Sj j ,

2. for any sequence with prefix , supportS

haabi.

supportSj , and

Based on the concepts of prefix and suffix, the problem

3. the size of -projected database cannot exceed that of S.

of mining sequential patterns can be decomposed into a set

of subproblems as shown below. Proof sketch. The first part of the lemma follows the fact that,

for a sequence , the suffix of with regards to , =,

Lemma 3.1 (Problem partitioning).

equals to the sequence resulted from first doing projection

1. Let fhx1 i; hx2 i; . . . ; hxn ig be the complete set of length- of with regards to , i.e., =, and then doing projection

1 sequential patterns in a sequence database S. The = with regards to . That is = ==.

complete set of sequential patterns in S can be divided The second part of the lemma states that to collect

into n disjoint subsets. The ith subset 1 i n is support count of a sequence , only the sequences in the

the set of sequential patterns with prefix hxi i. database sharing the same prefix should be considered.

2. Let be a length-l sequential pattern and Furthermore, only those suffixes with the prefix being a

f1 ; 2 ; . . . ; m g be the set of all length-l 1 super-sequence of should be counted. The claim

follows the related definitions.

sequential patterns with prefix . The complete set of

The third part of the lemma is on the size of a

sequential patterns with prefix , except for itself,

projected database. Obviously, the -projected database

can be divided into m disjoint subsets. The jth subset

can have the same number of sequences as S only if

1 j m is the set of sequential patterns prefixed appears in every sequence in S. Otherwise, only those

with j . sequences in S which are super-sequences of appear in

Proof. We show the correctness of the second half of the the -projected database. So, the -projected database

lemma. The first half is a special case where hi. cannot contain more sequences than S. For every

For a sequential pattern with prefix , where is of sequence in S such that is a super-sequence of ,

length l, the length-l 1 prefix of must be a appears in the -projected database in whole only if is

sequential pattern, according to the a priori heuristic. a prefix of . Otherwise, only a subsequence of appears

Furthermore, the length-l 1 prefix of is also with in the -projected database. Therefore, the size of

prefix , according to the definition of prefix. Therefore, -projected database cannot exceed that of S. u

t

there exists some j 1 j m such that j is the length- Let us examine how to use the prefix-based projection

l 1 prefix of . Thus, is in the jth subset. On the approach for mining sequential patterns based on our

other hand, since the length-k prefix of a sequence is running example.

unique, belongs to only one determined subset. That is, Example 4 (PrefixSpan). For the same sequence database S

the subsets are disjoint. So, we have the lemma. u

t in Table 1 with min sup 2, sequential patterns in S can

be mined by a prefix-projection method in the following

Based on Lemma 3.1, the problem can be partitioned steps:

recursively. That is, each subset of sequential patterns can

be further divided when necessary. This forms a divide-and- 1. Find length-1 sequential patterns. Scan S once

to find all the frequent items in sequences.

conquer framework. To mine the subsets of sequential

Each of these frequent items is a length-1

patterns, the corresponding projected databases can be

sequential pattern. They are hai : 4, hbi : 4,

constructed.

hci : 4, hdi : 3, hei : 3, and hfi : 3, where the

2. If e00m is not empty, the suffix is also denoted as h items in e00m notation hpatterni : count represents the pat-

em1 en i. tern and its associated support count.

1430 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 11, NOVEMBER 2004

TABLE 2

Projected Databases and Sequential Patterns

2. Divide search space. The complete set of sequen- subsets can be mined by constructing respec-

tial patterns can be partitioned into the following tive projected databases and mining each

six subsets according to the six prefixes: 1) the recursively as follows:

ones with prefix hai, 2) the ones with prefix

hbi; . . . ; and 3) the ones with prefix hfi. i. The haai-projected database consists of

3. Find subsets of sequential patterns. The subsets two nonempty (suffix) subsequences pre-

of sequential patterns can be mined by construct- fixed with haai: fh bcacdcfi, fh eig.

ing the corresponding set of projected databases Since there is no hope to generate any

and mining each recursively. The projected frequent subsequence from this projected

databases as well as sequential patterns found database, the processing of the haai-

in them are listed in Table 2, while the mining projected database terminates.

process is explained as follows: ii. The habi-projected database consists of

three suffix sequences: h cacdcfi,

a. Find sequential patterns with prefix hai. h cai, and hci. Recursively mining the

Only the sequences containing hai should be habi-projected database returns four se-

collected. Moreover, in a sequence containing quential patterns: h ci, h cai, hai, and

hai, only the subsequence prefixed with the hci (i.e., habci, habcai, habai, and habci.)

first occurrence of hai should be considered. They form the complete set of sequential

For example, in sequence hefabdfcbi, patterns prefixed with habi.

only the subsequence h bdfcbi should be iii. The habi-projected database contains

considered for mining sequential patterns only two sequences: h cacdcfi and

prefixed with hai. Notice that b means that hdfcbi, which leads to the finding of the

the last element in the prefix, which is a, following sequential patterns prefixed

together with b, form one element. with habi: hci, hdi, hfi, and hdci.

The sequences in S containing hai are iv. The haci, hadi, and hafi-projected data-

projected with regards to hai to form the hai- bases can be constructed and recursively

projected database, which consists of four suffix mined similarly. The sequential patterns

found are shown in Table 2.

sequences: habcacdcfi, h dcbcaei,

b. Find sequential patterns with prefix hbi, hci,

h bdfcbi, and h fcbci.

hdi, hei, and hfi, respectively. This can be

By scanning the hai-projected database

done by constructing the hbi, hci, hdi, hei, and

once, its locally frequent items are a : 2, hfi-projected databases and mining them,

b : 4, b : 2, c : 4, d : 2, and f : 2. Thus, all the respectively. The projected databases as well

length-2 sequential patterns prefixed with hai as the sequential patterns found are shown in

are found, and they are: haai : 2, habi : 4, Table 2.

habi : 2, haci : 4, hadi : 2, and hafi : 2. 4. The set of sequential patterns is the collection of

Recursively, all sequential patterns with patterns found in the above recursive mining

prefix hai can be partitioned into six subsets: process. One can verify that it returns exactly the

1) those prefixed with haai, 2) those with habi, same set of sequential patterns as what GSP and

. . . , and, finally, 3) those with hafi. These FreeSpan do.

PEI ET AL.: MINING SEQUENTIAL PATTERNS BY PATTERN-GROWTH: THE PREFIXSPAN APPROACH 1431

Based on the above discussion, the algorithm of Prefix- pattern. If there exist a good number of sequential

Span is presented as follows: patterns, the cost is nontrivial. Techniques for

reducing the number of projected databases will be

Algorithm 1 (PrefixSpan) Prefix-projected sequential pattern

discussed in the next subsection.

mining.

Input: A sequence database S, and the minimum support Theorem 3.1 (PrefixSpan). A sequence is a sequential pattern

threshold min_support. if and only if PrefixSpan says so.

Proof sketch (Direction if). A length-l sequence l 1 is

Output: The complete set of sequential patterns identified as a sequential pattern by PrefixSpan if and

only if is a sequential pattern in the projected database

Method: Call P refixSpanhi; 0; S. of its length-l 1 prefix . If l 1, the length-0 prefix

of is hi and the projected database is S itself. So,

Subroutine P refixSpan; l; Sj is a sequential pattern in S. If l > 1, according to

The parameters are 1) is a sequential pattern; 2) l is Lemma 3.2, Sj is exactly the -projected database,

the length of ; and 3) Sj is the -projected database if and supportS supportSj . Therefore, if is a

6 hi, otherwise, it is the sequence database S. sequential pattern in Sj , it is also a sequential pattern

in S. By this, we show that a sequence is a sequential

pattern if PrefixSpan says so.

Method:

(Direction only-if). Lemma 3.1 guarantees that Pre-

1. Scan Sj once, find each frequent item, b, such that

fixSpan identifies the complete set of sequential patterns

in S. So, we have the theorem. u

t

(a) b can be assembled to the last element of to

form a sequential pattern; or 3.3 Pseudoprojection

(b) hbi can be appended to to form a sequential The above analysis shows that the major cost of PrefixSpan is

pattern. database projection, i.e., forming projected databases

2. For each frequent item b, append it to to form a recursively. Usually, a large number of projected databases

sequential pattern 0 , and output 0 . will be generated in sequential pattern mining. If the

3. For each 0 , construct 0 -projected database Sj0 , and number and/or the size of projected databases can be

call P refixSpan0 ; l 1; Sj0 . reduced, the performance of sequential pattern mining can

be further improved.

Analysis. The correctness and completeness of the algo- One technique which may reduce the number and size of

rithm can be justified based on Lemma 3.1 and Lemma 3.2, projected databases is pseudoprojection. The idea is outlined

as shown in Theorem 3.1, later. Here, we analyze the as follows: Instead of performing physical projection, one

efficiency of the algorithm as follows: can register the index (or identifier) of the corresponding

sequence and the starting position of the projected suffix in

. No candidate sequence needs to be generated by the sequence. Then, a physical projection of a sequence is

PrefixSpan. Unlike a priori-like algorithms, Prefix- replaced by registering a sequence identifier and the

Span only grows longer sequential patterns from the projected position index point. Pseudoprojection reduces

shorter frequent ones. It neither generates nor tests the cost of projection substantially when the projected

any candidate sequence nonexistent in a projected database can fit in main memory.

database. Comparing with GSP, which generates This method is based on the following observation: For

and tests a substantial number of candidate se- any sequence s, each projection can be represented by a

quences, PrefixSpan searches a much smaller space. corresponding projection position (an index point) instead

. Projected databases keep shrinking. As indicated in of copying the whole suffix as a projected subsequence.

Lemma 3.2, a projected database is smaller than the Consider a sequence haabcacdcfi. Physical projections

original one because only the suffix subsequences of may lead to repeated copying of different suffixes of the

a frequent prefix are projected into a projected sequence. An index position pointer may save physical

database. In practice, the shrinking factors can be projection of the suffix and, thus, save both space and time

significant because 1) usually, only a small set of of generating numerous physical projected databases.

sequential patterns grow quite long in a sequence

database and, thus, the number of sequences in a Example 5 (Pseudoprojection). For the same sequence

projected database usually reduces substantially database S in Table 1 with min sup 2, sequential

when prefix grows; and 2) projection only takes the patterns in S can be mined by pseudoprojection method

suffix portion with respect to a prefix. Notice that as follows:

FreeSpan also employs the idea of projected data- Suppose the sequence database S in Table 1 can be

bases. However, the projection there often takes the held in main memory. Instead of constructing the

whole string (not just suffix) and, thus, the shrinking hai-projected database, one can represent the projected

factor is less than that of PrefixSpan. suffix sequences using pointer (sequence_id) and

. The major cost of PrefixSpan is the construction of offset(s). For example, the projection of sequence s1

projected databases. In the worst case, PrefixSpan haabcdaecfi with regard to the hai-projection

constructs a projected database for every sequential consists of two pieces of information: 1) a pointer to

1432 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 11, NOVEMBER 2004

TABLE 3

A Sequence Database and Some of Its Pseudoprojected Databases

s1 which could be the string_id s1 and 2) the offset(s), 4.1 Experimental Results

which should be a single integer, such as 2, if there is To evaluate the effectiveness and efficiency of the PrefixSpan

a single projection point; and a set of integers, such as algorithm, we performed an extensive performance study

f2; 3; 6g, if there are multiple projection points. Each of four algorithms: PrefixSpan, FreeSpan, GSP, and SPADE,

offset indicates at which position the projection starts on both real and synthetic data sets, with various kinds of

in the sequence. sizes and data distributions.

The projected databases for prefixes hai, hbi, hci, hdi, All experiments were conducted on a 750 MHz AMD PC

hfi, and haai are shown in Table 3, where $ indicates the with 512 megabytes main memory, running Microsoft

prefix has an occurrence in the current sequence but its Windows 2000 Server. Three algorithms, GSP, FreeSpan,

projected suffix is empty, whereas ; indicates that there and PrefixSpan, were implemented by us using Microsoft

is no occurrence of the prefix in the corresponding Visual C++ 6.0. The implementation of the fourth algorithm,

sequence. From Table 3, one can see that the pseudo- SPADE, is obtained directly from the author of the

projected database usually takes much less space than its

algorithm [29]. Detailed algorithm implementation is

corresponding physically projected one.

described as follows:

Pseudoprojection avoids physically copying suffixes. 1. GSP. The GSP algorithm is implemented according

Thus, it is efficient in terms of both running time and to the description in [23].

space. However, it may not be efficient if the pseudoprojec- 2. SPADE. SPADE is tested with the implementation

tion is used for disk-based accessing since random access provided by the algorithm inventor [29].

disk space is costly. Based on this observation, the 3. FreeSpan. FreeSpan is implemented according to the

suggested approach is that if the original sequence database algorithm described in [8].

or the projected databases is too big to fit into main 4. PrefixSpan. PrefixSpan is implemented as described

memory, the physical projection should be applied, how- in this paper,3 with pseudoprojection turned on in

most cases. Only in the case when testing the role of

ever, the execution should be swapped to pseudoprojection

pseudoprojection, two options are adopted: one with

once the projected databases can fit in main memory. This

the pseudoprojection function turned on and the

methodology is adopted in our PrefixSpan implementation. other with it turned off.

Notice that the pseudoprojection works efficiently for

For the data sets used in our performance study, we use

PrefixSpan, but not so for FreeSpan. This is because for

two kinds of data sets: one real data set and a group of

PrefixSpan, an offset position clearly identifies the suffix and

synthetic data sets.

thus the projected subsequence. However, for FreeSpan, For real data set, we have obtained the Gazelle data set

since the next step pattern-growth can be in both forward from Blue Martini. This data set has been used in KDD-

and backward directions, one needs to register more CUP2000 and contains a total of 29,369 customers Web

information on the possible extension positions in order to click-stream data provided by Blue Martini Software

identify the remainder of the projected subsequences. company. For each customer, there may be several sessions

Therefore, we only explore the pseudoprojection technique of Web click-stream and each session can have multiple

for PrefixSpan. page views. Because each session is associated with both

starting and ending date/time, for each customer, we can

4 EXPERIMENTAL RESULTS AND PERFORMANCE sort its sessions of click-stream into a sequence of page

views according to the viewing date/time. This data set

ANALYSIS

contains 29,369 sequences (i.e., customers), 35,722 sessions

Since GSP [23] and SPADE [29] are the two most influential (i.e., transactions or events), and 87,546 page views (i.e.,

sequential pattern mining algorithms, we conduct an

extensive performance study to compare PrefixSpan with 3. Note that the previous version of the PrefixSpan algorithm published

in [19] introduces another optimization technique called bilevel projection

them. In this section, we first report our experimental which performs physical database projection at every two levels. Based on

results on the performance of PrefixSpan in comparison with our recent substantial performance study, the role of bilevel projection in

performance improvement is only marginal in certain cases, but can barely

GSP and SPADE and, then, present an indepth analysis on offset its overhead in many other cases. Thus, this technique is dropped

why PrefixSpan outperforms the other algorithms. from the PrefixSpan options and also from the performance study.

PEI ET AL.: MINING SEQUENTIAL PATTERNS BY PATTERN-GROWTH: THE PREFIXSPAN APPROACH 1433

Fig. 2. Distribution of frequent sequences of data set C10T 8S8I8.

products or items). There are in total 1,423 distinct page SPADE (116.35 seconds), while GSP never terminates on our

views. More detailed information about this data set can be machine.

found in [10]. The second test is performed on the data set

For synthetic data sets, we have also used a large set of C200T 2:5S10I1:25, which is much larger than the first one

synthetic sequence data generated by a data generator since it contains 200k customers (i.e., sequences) and the

similar in spirit to the IBM data generator [2] designed for number of items is 10,000. However, it is sparser than the

testing sequential pattern mining algorithms. Various kinds first data set since the average number of items in a

of sizes and data distributions of data sets are generated transaction (i.e., event) is 2.5 and the average number of

and tested in this performance study. The convention for transactions in a sequence is 10. On average, a frequent

the data sets is as follows: C200T 2:5S10I1:25 means that the sequential pattern consists of four transactions, and each

data set contains 200k customers (i.e., sequences) and the transaction is composed of 1.25 items. Fig. 4 shows the

number of items is 10,000. The average number of items in a distribution of frequent sequences of the data set, from

transaction (i.e., event) is 2.5 and the average number of which one can see that the number of longer frequent

transactions in a sequence is 10. On average, a frequent

sequences is almost always smaller than that of shorter ones

sequential pattern consists of four transactions, and each

(except when min support 0:25%). Fig. 5 shows the

transaction is composed of 1.25 items.

processing time of the four algorithms at different support

To make our experiments fair to all the algorithms, our

synthetic test data sets are similar to that used in the thresholds. The processing time maintains the same order

performance study in [29]. Additional data sets are used for PrefixSpan < SPADE < FreeSpan < GSP when min_sup-

scalability study and for testing the algorithm behavior with port is small. When min support 0.5-0.75 percent, GSP,

varied (and, sometimes, very low) support thresholds. SPADE, and FreeSpan have very similar running time (but

The first test of the four algorithms is on the data set PrefixSpan is still 2-3 times faster).

C10T 8S8I8, which contains 10k customers (i.e., sequences)

and the number of items is 1,000. Both the average number of

items in a transaction (i.e., event) and the average number of

transactions in a sequence are set to 8. On average, a frequent

sequential pattern consists of four transactions, and each

transaction is composed of eight items. Fig. 2 shows the

distribution of frequent sequences of data set C10T 8S8I8,

from which one can see that when min_support is no less than

1 percent, the length of frequent sequences is very short (only

2-3), and the maximum number of frequent patterns in total is

less than 10,000. Fig. 3 shows the processing time of the four

algorithms at different support thresholds. The processing

times are sorted in time ascending order as PrefixSpan <

SPADE < FreeSpan < GSP. When min support 1%,

PrefixSpan (runtime 6.8 seconds) is about two orders of

magnitude faster than GSP (runtime 772.72 seconds).

When min_support is reduced to 0.5 percent, the data set

contains a large number of frequent sequences, PrefixSpan

takes 32.56 seconds, which is more than 3.5 times faster than Fig. 4. Distribution of frequent sequences of data set C200T 2:5S10I1:25.

1434 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 11, NOVEMBER 2004

Fig. 5. Performance of the four algorithms on data set C200T 2:5S10I1:25.

The third test is performed on the data set

figure).

C200T 5S10I2:5, which is substantially denser than the The performance study on the real data set Gazelle is

second data set, although it still contains 200k customers reported as follows: Fig. 8 shows the distribution of

(i.e., sequences) and the number of items is 10,000. This is frequent sequences of Gazelle data set for different support

because the average number of items in a transaction (i.e., thresholds. We can see that this data set is a very sparse

event) is five, the average number of transactions in a data set: Only when the support threshold is lower than

sequence is 10, and, on average, a frequent sequential 0.05 percent are there some long frequent sequences. Fig. 9

pattern consists of four transactions, and each transaction is shows the performance comparison among the four algo-

composed of 2.5 items. Fig. 6 shows the distribution of rithms for Gazelle data set. From Fig. 9, we can see that

frequent sequences of the data set, from which one can see PrefixSpan is much more efficient than SPADE, FreeSpan,

that the number of longer frequent sequences grows into and GSP. The SPADE algorithm is faster than both FreeSpan

nontrivial range. Fig. 7 shows the processing time of the four and GSP when the support threshold is no less than

algorithms at different support thresholds. The testing result 0.025 percent, but once the support threshold is no greater

makes clear distinction among the algorithms tested. It than 0.018 percent, it cannot stop running.

shows almost the same ordering of the algorithms for Figs. 10, 11, and 12 show the results of scalability tests of

running time, PrefixSpan < SPADE < FreeSpan < GSP, the four algorithms on data set T 2:5S10I1:25, with the

except SPADE is slightly faster than PrefixSpan when database size growing from 200K to 1,000K sequences, and

min support 0:33%. However, when min support with different min_support threshold settings. With

0.25 percent, the number of frequent sequences grows to min support 1 percent, all the algorithms terminate with-

more than 4 million, and in this case, only PrefixSpan can run in 26 seconds, even with the database of 1,000K sequences,

it (with running time = 3539.78 seconds) and all the other and PrefixSpan has the best performance overall but GSP is

actually marginally better when the database size is

between 600K and 800K sequences. However, when the

Fig. 6. Distribution of frequent sequences of data set C200T 5S10I2:5. Fig. 8. Distribution of frequent sequences of data set Gazelle.

PEI ET AL.: MINING SEQUENTIAL PATTERNS BY PATTERN-GROWTH: THE PREFIXSPAN APPROACH 1435

Fig. 11. Scalability test of the four algorithms on data set T 2:5S10I1:25,

min_support drops down to 0.25 percent, PrefixSpan has the with min_support 0.5 percent.

clear competitive edge. For the database of 1,000K

sequences, PrefixSpan is about nine times faster than GSP, PrefixSpan with pseudoprojection is consistently faster than

but SPADE cannot terminate. that without pseudoprojection although the gap is not big in

To test the effectiveness of pseudoprojection, a perfor- most cases, but is about 40 percent performance improvement

mance test was conducted to compare three algorithms: when min support 0:25%. Moreover, the performance of

SPADE is clearly behind PrefixSpan when min_support is very

1. PrefixSpan with the pseudoprojection function

low and the sequential patterns are dense.

turned on,

Finally, we compare the memory usage among the three

2. PrefixSpan with pseudoprojection turned off, and

3. SPADE. algorithms, PrefixSpan, SPADE, and GSP using both real

data set Gazelle and synthetic data set C200T5S10I2.5. Fig. 15

The test data is the same data set C10T 8S8I8, used in the

shows the results for Gazelle data set, from which we can

experiment of Fig. 3. However, the min_support threshold is

see that PrefixSpan is efficient in memory usage. It consumes

set low at the range of 0.25 percent to 0.5 percent. Notice that

almost one order of magnitude less memory than both

the previous test was between 0.5 percent and 3 percent,

SPADE and GSP. For example, at support 0.018 percent,

which was the range that FreeSpan and GSP can still run

GSP consumes about 40 MB memory and SPADE just

within a reasonable amount of time, although GSP cannot run

cannot stop running after it has used more than 22 MB

when min_support is reduced down to 0.5 percent. The memory while PrefixSpan only uses about 2.7 MB memory.

distribution of frequent sequences of data set C10T 8S8I8, Fig. 16 demonstrates the memory usage for data set

with very low min_support is shown in Fig. 13. It shows that C200T5S10I2.5, from which we can see that PrefixSpan is not

the total number of frequent patterns grows up to around only more efficient, but also more stable in memory usage

4 million frequent sequences when min support 0:25%. than both SPADE and GSP. At support 0.25 percent, GSP

The testing result is shown in Fig. 14, which demonstrates that cannot stop running after it has consumed about 362 MB

Fig. 10. Scalability test of the four algorithms on data set T 2:5S10I1:25, Fig. 12. Scalability test of the four algorithms on data set T 2:5S10I1:25,

with min_support 1 percent. with min_support 0.25 percent.

1436 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 11, NOVEMBER 2004

GSP for data set Gazelle.

Fig. 13. Distribution of frequent sequences of data set C10T 8S8I8, with

very low min_support (ranging from 0.25 percent to 0.5 percent). In summary, our performance study shows that Prefix-

Span has the best overall performance among the four

memory and SPADE reported an error message, memory:: algorithms tested. SPADE, though weaker than PrefixSpan

Array: Not enough memory, when it tried to allocate another in most cases, outperforms GSP consistently, which is

bulk of memory after it has used about 262 MB memory, consistent with the performance study reported in [29]. GSP

while PrefixSpan only uses 108 MB memory. This also performs fairly well only when min_support is rather high,

explains why in several cases in our previous experiments with good scalability, which is consistent with the perfor-

when the support threshold becomes really low, only mance study reported in [23]. However, when there are a

PrefixSpan can finish running. large number of frequent sequences, its performance starts

Based on our analysis, PrefixSpan only needs memory deteriorating. Our memory usage analysis also shows part

space to hold the sequence data sets plus a set of header of the reason why some algorithms become really slow

tables and pseudoprojection tables. Since the data set because the huge number of candidate sets may consume a

C200T5S10I2.5 is about 46 MB, which is much bigger than tremendous amount of memory. Also, when there are a

Gazelle (less than 1MB), it consumes more memory space large number of frequent subsequences, all the algorithms

than Gazelle, but the memory usage is still quite stable run slow (as shown in Fig. 7). This leaves some room for

(from 65 MB to 108 MB for different thresholds in our performance improvement in the future.

testing). However, both SPADE and GSP need memory 4.2 Why Does PrefixSpan Have High Performance?

space to hold candidate sequence patterns as well as the

With the above comprehensive performance study, we are

sequence data sets. When the min_support threshold drops,

convinced that PrefixSpan is the clear winner among all the

the set of candidate subsequences grows up quickly, which

four tested algorithms. The question becomes why Prefix-

causes memory consumption upsurge and sometimes both

Span has such high performance: Is it because of some

GSP and SPADE cannot finish processing.

implementation tricks, or is it inherently in the algorithm

itself? We have the following analysis:

Fig. 14. Performance of PrefixSpan(with versus without pseudoprojec- Fig. 16. Memory usage: PrefixSpan, SPADE, and GSP for synthetic

tion) versus SPADE on data C10T 8S8I8, with very low min_support. data set C200T5S10I2.5.

PEI ET AL.: MINING SEQUENTIAL PATTERNS BY PATTERN-GROWTH: THE PREFIXSPAN APPROACH 1437

fixSpan is a pattern growth-based approach, similar pattern mining method, such as PrefixSpan, it is interesting to

to FP-growth [9]. Unlike traditional a priori-based explore how such a method can be extended to handle more

approach which performs candidate generation-and sophisticated cases. It is straightforward to extend this

test, PrefixSpan does not generate any useless approach to mining multidimensional, multilevel sequential

candidate and it only counts the frequency of local patterns [21]. In this section, we will discuss constraint-based

1-itemsets. Our performance shows that when mining of sequential patterns and a few research problems.

min_support drops, the number of frequent se-

quences grows up exponentially. As a candidate 5.1 Constraint-Based Mining of Sequential Patterns

generation-and-test approach, like GSP, handling For many sequential pattern mining applications, instead of

such exponential number of candidates is the must, finding all the possible sequential patterns in a database, a

independent of optimization tricks. It is no wonder user may often like to enforce certain constraints to find

that GSP costs an exponential growth amount of desired patterns. The mining process which incorporates

time to process a pretty small database (e.g., only user-specified constraints to reduce search space and derive

10K sequences) since it must take a great deal of only the user-interested patterns is called constraint-based

time to generate and test a huge number of mining.

sequential pattern candidates. Constraint-based mining has been studied extensively in

. Projection-based divide-and-conquer as an effec- frequent pattern mining, such as [15], [4], [18]. In general,

tive means for data reduction. PrefixSpan grows constraints can be characterized based on the notion of

longer patterns from shorter ones by dividing the monotonicity, antimonotonicity, succinctness, as well as

search space and focusing only on the subspace convertible and inconvertible constraints, respectively,

potentially supporting further pattern growth. The depending on whether a constraint can be transformed

search space of PrefixSpan is focused and is confined into one of these categories if it does not naturally belong to

to a set of projected databases. Since a projected one of them [18]. This has become a classical framework for

database for a sequential pattern contains all and constraint-based frequent pattern mining.

only the necessary information for mining the Interestingly, such a constraint-based mining framework

sequential patterns that can grow from , the size

can be extended to sequential pattern mining. Moreover,

of the projected databases usually reduces quickly as

with pattern-growth framework, some previously not-so-

mining proceeds to longer sequential patterns. In

easy-to-push constraints, such as regular expression con-

contrast, the a priori-based approach always

straints [6], can be handled elegantly. Let us examine one

searches the original database at each iteration.

such example.

Many irrelevant sequences have to be scanned and

checked, which adds to the overhead. This argument Example 6 (Constraint-based sequential pattern mining).

is also supported by our performance study. Suppose our task is to mine sequential patterns with a

. PrefixSpan consumes relatively stable memory regular expression constraint C ha fbbjbcdjddgi with

space because it generates no candidates and min support 2, in a sequence database S (Table 1).

explores the divide-and-conquer methodology. On Since a regular expression constraint, like C, is neither

the other hand, the candidate generation-and-test antimonotone, nor monotone, nor succinct, the classical

methods, including both GSP and SPADE, require a constraint-pushing framework [15] cannot push it deep.

substantial amount of memory when the support To overcome this difficulty, Guha et al. [6] develop a set

threshold goes low since it needs to hold a of four SPIRIT algorithms, each pushing a stronger

tremendous number of candidate sets. relaxation of regular expression constraint R than its

. PrefixSpan applies prefix-projected pattern growth predecessor in the pattern mining loop. However, the

which is more efficient than FreeSpan that uses basic evaluation framework for sequential patterns is still

frequent pattern-guided projection. Comparing based on GSP [23], a typical candidate generation-and-

with frequent pattern-guided projection, employed test approach.

in FreeSpan, prefix-projected pattern growth (Prefix- With the development of the pattern-growth metho-

Span) saves a lot of time and space because dology, such kinds of constraints can be pushed deep

PrefixSpan projects only on the frequent items in easily and elegantly into the sequential pattern mining

the suffixes, and the projected sequences shrink process [20]. This is because, in the context of PrefixSpan,

rapidly. When mining in dense databases where a regular expression constraint has a nice property called

FreeSpan cannot gain much from the projections, growth-based antimonotonic. A constraint is growth-

PrefixSpan can still reduce substantially both the based antimonotonic if it has the following property: If a

length of sequences and the number of sequences in sequence satisfies the constraint, must be reachable by

projected databases. growing from any component which matches part of the

regular expression.

5 EXTENSIONS AND DISCUSSIONS The constraint C ha fbbjbcdjddgi can be inte-

grated with the pattern-growth mining process as

Comparing with mining (unordered) frequent patterns, follows: First, only the hai-projected database needs to

mining sequential patterns is one step toward mining more be mined since the regular expression constraint C

sophisticated frequent patterns in large databases. With the starting with a, and only the sequences which contain

1438 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 11, NOVEMBER 2004

frequent single item within the set of fb; c; dg should practically useful direction to pursue. A recent study on

retain in the hai-projected database. Second, the remain- mining long sequential patterns in a noisy environment [28]

ing mining can proceed from the suffix, which is is a good example in this direction.

essentially Suffix-Span, an algorithm symmetric to

PrefixSpan by growing suffixes from the end of the 5.2.3 Toward Mining Other Kinds of Structured Patterns

sequence forward. The growth should match the suffix Besides mining sequential patterns, another important task is

constraint hfbbjbcdjddgi. For the projected databases the mining of frequent substructures in a database composed

which matches these suffixes, one can grow sequential of structured or semistructured data sets. The substructures

patterns either in prefix or suffix-expansion manner to may consist of trees, directed-acyclic graphs (i.e., DAGs), or

find all the remaining sequential patterns. general graphs which may contain cycles. There are a lot of

applications related to mining frequent substructures since

Notice that the regular expression constraint C given in most human activities and natural processes may contain

Example 6 is in a special form hprefix suffixi out of many certain structures, and a huge amount of such data has been

possible general regular expressions. In this special case, an collected in large data/information repositories, such as

integration of PrefixSpan and Suffix-Span may achieve the best molecule or biochemical structures, Web connection struc-

performance. In general, a regular expression could be of the tures, and so on. It is important to develop scalable and flexible

form h1 2 3 i, where i is a set of instantiated methods for mining structured patterns in such databases.

regular expressions. In this case, FreeSpan should be applied There have been some recent work on mining frequent

to push the instantiated items by expansion first from the subtrees, such as [31], and frequent subgraphs, such as [11],

instantiated items. A detailed discussion of constraint-based [25] in structured databases, where [25] shows that the pattern

sequential pattern mining is in [20]. growth approach has clear performance edge over a candi-

date generation-and-test approach. Furthermore, as dis-

5.2 Problems for Further Research cussed above, is it is more desirable to mine closed frequent

Although the sequential pattern growth approach proposed subgraphs (a subgraph g is closed if there exists no supergraph

in this paper is efficient and scalable, there are still some of g carrying the same support as g) than mining explicitly the

challenging research issues with regards to sequential pattern complete set of frequent subgraphs because a large graph

mining, especially for certain large scale applications. Here, inherently contains an exponential number of subgraphs. A

we illustrate a few problems that need further research. recent study [26] has developed an efficient closed subgraph

pattern method, called CloseGraph, which is also based on the

5.2.1 Mining Closed and Maximal Sequential Patterns pattern-growth framework and influenced by this approach.

A frequent long sequence contains a combinatorial number

of frequent subsequences, as shown in Section 1. For a

sequential pattern of length 100, there exist 2100 1 none- 6 CONCLUSIONS

mpty subsequences. In such cases, it is prohibitively We have performed a systematic study on mining of

expensive to mine the complete set of patterns no matter sequential patterns in large databases and developed a

which method is to be applied. pattern-growth approach for efficient and scalable mining of

Similar to mining closed and maximal frequent patterns sequential patterns.

in transaction databases [17], [3], which mines only the Instead of refinement of the a priori-like, candidate

longest frequent patterns (in the case of max-pattern generation-and-test approach, such as GSP [23], we pro-

mining) or the longest one with the same support (in the mote a divide-and-conquer approach, called pattern-growth

case of closed-pattern mining), for sequential pattern approach, which is an extension of FP-growth [9], an efficient

mining, it is also desirable to mine only (frequent) maximal pattern-growth algorithm for mining frequent patterns

or closed sequential patterns, where a sequence s is maximal without candidate generation.

if there exists no frequent supersequence of s, while a An efficient pattern-growth method, PrefixSpan, is pro-

sequence s is closed if there exists no supersequence of s posed and studied in this paper. PrefixSpan recursively

with the same support as s. projects a sequence database into a set of smaller projected

The development of efficient algorithms for mining sequence databases and grows sequential patterns in each

closed and maximal sequential patterns in large databases projected database by exploring only locally frequent frag-

is an important research problem. A recent study in [27] ments. It mines the complete set of sequential patterns and

proposed an efficient closed sequential pattern method, substantially reduces the efforts of candidate subsequence

called CloSpan, as a further development of the PrefixSpan generation. Since PrefixSpan explores ordered growth by

mining framework, influenced by this approach. prefix-ordered expansion, it results in less growth points

and reduced projected databases in comparison with our

5.2.2 Mining Approximate Sequential Patterns previously proposed pattern-growth algorithm, FreeSpan.

In this study, we have assumed all the sequential patterns to Furthermore, a pseudoprojection technique is proposed for

be mined are exact matching patterns. In practice, there are PrefixSpan to reduce the number of physical projected

many applications that need approximate matches, such as databases to be generated. A comprehensive performance

DNA sequence analysis which allows limited insertions, study shows that PrefixSpan outperforms the a priori-based

deletions, and mutations in their sequential patterns. The GSP algorithm, FreeSpan, and SPADE in most cases, and

development of efficient and scalable algorithms for mining PrefixSpan integrated with pseudoprojection is the fastest

approximate sequential patterns is a challenging and among all the tested algorithms.

PEI ET AL.: MINING SEQUENTIAL PATTERNS BY PATTERN-GROWTH: THE PREFIXSPAN APPROACH 1439

Based on our view, the implication of this method is far [9] J. Han, J. Pei, and Y. Yin, Mining Frequent Patterns without

beyond yet another efficient sequential pattern mining Candidate Generation, Proc. 2000 ACM-SIGMOD Intl Conf.

Management of Data (SIGMOD 00), pp. 1-12, May 2000.

algorithm. It demonstrates the strength of the pattern-

[10] R. Kohavi, C. Brodley, B. Frasca, L. Mason, and Z. Zheng, KDD-

growth mining methodology since the methodology has Cup 2000 Organizers Report: Peeling the Onion, Proc. SIGKDD

achieved high performance in both frequent-pattern mining Explorations, vol. 2, pp. 86-98, 2000.

and sequential pattern mining. Moreover, our discussion [11] M. Kuramochi and G. Karypis, Frequent Subgraph Discovery,

shows that the methodology can be extended to mining Proc. 2001 Intl Conf. Data Mining (ICDM 01), pp. 313-320, Nov.

multilevel, multidimensional sequential patterns, mining 2001.

sequential patterns with user-specified constraints, and so [12] H. Lu, J. Han, and L. Feng, Stock Movement and n-Dimensional

on. Therefore, it represents a promising approach for the Inter-Transaction Association Rules, Proc. 1998 SIGMOD Work-

shop Research Issues on Data Mining and Knowledge Discovery

applications that rely on the discovery of frequent patterns (DMKD 98), vol. 12, pp. 1-7, June 1998.

and/or sequential patterns. [13] H. Mannila, H Toivonen, and A.I. Verkamo, Discovery of

There are many interesting issues that need to be studied Frequent Episodes in Event Sequences, Data Mining and Knowl-

further, such as mining closed and maximal sequential edge Discovery, vol. 1, pp. 259-289, 1997.

patterns, mining approximate sequential patterns, and [14] F. Masseglia, F. Cathala, and P. Poncelet, The PSP Approach for

extension of the method toward mining structured patterns. Mining Sequential Patterns, Proc. 1998 European Symp. Principle of

Especially, the developments of specialized sequential Data Mining and Knowledge Discovery (PKDD 98), pp. 176-184,

Sept. 1998.

pattern mining methods for particular applications, such as

[15] R. Ng, L.V.S. Lakshmanan, J. Han, and A. Pang, Exploratory

DNA sequence mining that may admit faults, such as Mining and Pruning Optimizations of Constrained Associations

allowing insertions, deletions, and mutations in DNA Rules, Proc. 1998 ACM-SIGMOD Intl Conf. Management of Data

sequences, and handling industry/engineering sequential (SIGMOD 98), pp. 13-24, June 1998.

process analysis are interesting issues for future research. [16] B. O zden, S. Ramaswamy, and A. Silberschatz, Cyclic Associa-

tion Rules, Proc. 1998 Intl Conf. Data Eng. (ICDE 98), pp. 412-421,

Feb. 1998.

ACKNOWLEDGMENTS [17] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, Discovering

Frequent Closed Itemsets for Association Rules, Proc. Seventh Intl

The authors are grateful to Dr. Mohammed Zaki for providing Conf. Database Theory (ICDT 99), pp. 398-416, Jan. 1999.

them with the source code of SPADE and the vertical data [18] J. Pei, J. Han, and L.V.S. Lakshmanan, Mining Frequent Itemsets

conversion package, as well as promptly answering many of with Convertible Constraints, Proc. 2001 Intl Conf. Data Eng.

(ICDE 01), pp. 433-332, Apr. 2001.

their questions related to SPADE. Also, they would like to

[19] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and

express their thanks to Blue Martini Software Inc. for sending M.-C. Hsu, PrefixSpan: Mining Sequential Patterns Efficiently by

them the Gazelle data set and allowing them to publish their Prefix-Projected Pattern Growth, Proc. 2001 Intl Conf. Data Eng.

algorithm performance results using this data set. The work (ICDE 01), pp. 215-224, Apr. 2001.

[20] J. Pei, J. Han, and W. Wang, Constraint-Based Sequential Pattern

was supported in part by the Natural Sciences and Engineer-

Mining in Large Databases, Proc. 2002 Intl Conf. Information and

ing Research Council of Canada, the Networks of Centres of Knowledge Management (CIKM 02), pp. 18-25, Nov. 2002.

Excellence of Canada, the Hewlett-Packard Lab, the US [21] H. Pinto, J. Han, J. Pei, K. Wang, Q. Chen, and U. Dayal, Multi-

National Science Foundation NSF IIS-02-09199, NSF IIS-03- Dimensional Sequential Pattern Mining, Proc. 2001 Intl Conf.

Information and Knowledge Management (CIKM 01), pp. 81-88, Nov.

08001, and the University of Illinois. Any opinions, findings, 2001.

and conclusions or recommendations expressed in this paper [22] S. Ramaswamy, S. Mahajan, and A. Silberschatz, On the

are those of the authors and do not necessarily reflect the Discovery of Interesting Patterns in Association Rules, Proc.

views of the funding agencies. This paper is a major-value 1998 Intl Conf. Very Large Data Bases (VLDB 98), pp. 368-379, Aug.

1998.

added version of the following conference paper [19].

[23] R. Srikant and R. Agrawal, Mining Sequential Patterns: General-

izations and Performance Improvements, Proc. Fifth Intl Conf.

Extending Database Technology (EDBT 96), pp. 3-17, Mar. 1996.

REFERENCES [24] J. Wang, G. Chirn, T. Marr, B. Shapiro, D. Shasha, and K. Zhang,

[1] R. Agrawal and R. Srikant, Fast Algorithms for Mining Combinatiorial Pattern Discovery for Scientific Data: Some

Association Rules, Proc. 1994 Intl Conf. Very Large Data Bases Preliminary Results, Proc. 1994 ACM-SIGMOD Intl Conf.

(VLDB 94), pp. 487-499, Sept. 1994. Management of Data (SIGMOD 94), pp. 115-125, May 1994.

[2] R. Agrawal and R. Srikant, Mining Sequential Patterns, Proc. [25] X. Yan and J. Han, gSpan: Graph-Based Substructure Pattern

1995 Intl Conf. Data Eng. (ICDE 95), pp. 3-14, Mar. 1995. Mining, Proc. 2002 Intl Conf. Data Mining (ICDM 02), pp. 721-

[3] R.J. Bayardo, Efficiently Mining Long Patterns from Databases, 724, Dec. 2002.

Proc. 1998 ACM-SIGMOD Intl Conf. Management of Data (SIGMOD [26] X. Yan and J. Han, CloseGraph: Mining Closed Frequent Graph

98), pp. 85-93, June 1998. Patterns, Proc. 2003 ACM SIGKDD Intl Conf. Knowledge Discovery

[4] R.J. Bayardo, R. Agrawal, and D. Gunopulos, Constraint-Based and Data Mining (KDD 03), Aug. 2003.

Rule Mining on Large, Dense Data Sets, Proc. 1999 Intl Conf. Data

[27] X. Yan, J. Han, and R. Afshar, CloSpan: Mining Closed

Eng. (ICDE 99), pp. 188-197, Apr. 1999.

Sequential Patterns in Large Datasets, Proc. 2003 SIAM Intl Conf.

[5] C. Bettini, X.S. Wang, and S. Jajodia, Mining Temporal Relation-

Data Mining (SDM 03), pp. 166-177, May 2003.

ships with Multiple Granularities in Time Sequences, Data Eng.

Bull., vol. 21, pp. 32-38, 1998. [28] J. Yang, W. Wang, P.S. Yu, and J. Han, Mining Long Sequential

[6] S. Guha, R. Rastogi, and K. Shim, Rock: A Robust Clustering Patterns in a Noisy Environment, Proc. 2002 ACM-SIGMOD Intl

Algorithm for Categorical Attributes, Proc. 1999 Intl Conf. Data Conf. Management of Data (SIGMOD 02), pp. 406-417, June 2002.

Eng. (ICDE 99), pp. 512-521, Mar. 1999. [29] M. Zaki, SPADE: An Efficient Algorithm for Mining Frequent

[7] J. Han, G. Dong, and Y. Yin, Efficient Mining of Partial Periodic Sequences, Machine Learning, vol. 40, pp. 31-60, 2001.

Patterns in Time Series Database, Proc. 1999 Intl Conf. Data Eng. [30] M.J. Zaki, Efficient Enumeration of Frequent Sequences, Proc.

(ICDE 99), pp. 106-115, Apr. 1999. Seventh Intl Conf. Information and Knowledge Management (CIKM

[8] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu, 98), pp. 68-75, Nov. 1998.

FreeSpan: Frequent Pattern-Projected Sequential Pattern [31] M.J. Zaki, Efficiently Mining Frequent Trees in a Forest, Proc.

Mining, Proc. 2000 ACM SIGKDD Intl Conf. Knowledge Discovery 2002 ACM SIGKDD Intl Conf. Knowledge Discovery in Databases

in Databases (KDD 00), pp. 355-359, Aug. 2000. (KDD 02), pp. 71-80, July 2002.

1440 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 11, NOVEMBER 2004

Jian Pei received the BEng and MEng degrees, Helen Pinto received the MSc degree in

both in computer science from Shanghai Jiao computing science in 2001 from Simon Fraser

Tong University, China, in 1991 and 1993, University, Canada. She is currently with the

respectively, and the PhD degree in computing Alberta Research Council in Calgary, Alberta,

science from Simon Fraser University, Canada, Canada. Since 2001, she has been working in

in 2002. He was a PhD candidate at Peking the area of intelligent systems applications with

University from 1997-1999. He is currently an emphasis on knowledge discovery in databases.

assistant professor of computing science at

Simon Fraser University, Canada. His research

interests include data mining, data warehousing,

online analytical processing, database systems, and bioinformatics. His

research is currently supported in part by the US National Science

Foundation (NSF). He has served on the program committees of Qiming Chen is a senior technical staff in

international conferences and workshops, and has been a reviewer for PacketMotion Inc. Prior to that he worked at

some leading academic journals. He is a member of the ACM, the ACM HP Labs, Hewlett Packard Company as a senior

SIGMOD, the ACM SIGKDD, and the IEEE Computer Society. He is a computer scientist, and at Commerce One Labs,

guest area editor of the Journal of Computer Science and Technology. Commerce One Inc. as a principal architect. His

current research and development activities are

Jiawei Han received the PhD degree in computer in the areas of Grid services, Web services,

science from the University of Wisconsin in 1985. interbusiness process and conversation man-

agement, as well as data warehousing, OLAP,

He is a professor in the Department of Computer

and data mining. His early research interests

Science at the University of Illinois at Urbana-

Champaign. Previously, he was an Endowed include database, workflow, and distributed and agent systems. He

University Professor at Simon Fraser University, authored more than 80 technical publications and has a number of

Canada. He has been working on research into patents granted or filed. He has served in the program committees of

data mining, data warehousing, spatial and more than 20 international conferences.

multimedia databases, deductive and object-

oriented databases, and biomedical databases, Umeshwar Dayal received the PhD degree in

with more than 250 journal and conference publications. He has chaired applied mathematics from Harvard University.

or served in many program committees of international conferences and He is Director of the Intelligent Enterprise

workshops, including ACM SIGKDD conferences (2001 best paper award Technologies Laboratory at Hewlett-Packard

chair, 2002 student award chair, 1996 PC cochair), SIAM-Data Mining Laboratories. His research interests are data

conferences (2001 and 2002 PC cochair), ACM SIGMOD Conferences mining, business process management, distrib-

(2000 exhibit program chair), and International Conferences on Data uted information management, and decision

Engineering (2004 and 2002 PC vicechair). He also served or is serving support technologies, especially as applied to

on the editorial boards for Data Mining and Knowledge Discovery: An e-business. Dr. Dayal is a member of the ACM

International Journal, IEEE Transactions on Knowledge and Data and the IEEE Computer Society.

Engineering, and the Journal of Intelligent Information Systems. He is

currently serving on the Board of Directors for the Executive Committee of

ACM Special Interest Group on Knowledge Discovery and Data Mining Mei-Chun Hsu received the BA degree from the

(SIGKDD). Dr. Han has received IBM Faculty Award, the Outstanding National Taiwan University, the MS degree from

Contribution Award at the 2002 International Conference on Data Mining, the University of Massachusetts at Amherst, and

and the ACM Service Award. He is the first author of the textbook Data the PhD degree from the Massachusetts Insti-

Mining: Concepts and Techniques (Morgan Kaufmann, 2001). He is a tute of Technology. She is currently vice

senior member of the IEEE. president of engineering at Commerce One

Inc. Dr. Hsu has led the design and development

Behzad Mortazavi-Asl received the BSc and efforts for the Commerce One Conductor 6.0, a

MSc degrees in computing science from Simon web service-enabled application integration plat-

Fraser University in 1998 and 2001, respec- form. Prior to Commerce One, she held posi-

tively. His research interests include WebLog tions with Hewlett-Packard Laboratories, EDS, and Digital Equipment

mining, sequential pattern mining, and online Corporation. She was also a member of Computer Science Faculty at

educational systems. He is currently working as Harvard University. Her research interests span database systems,

a software developer. transaction processing, business process management systems, and

data mining technologies.

Jianyong Wang received the PhD degree in

our Digital Library at www.computer.org/publications/dlib.

computer science in 1999 from the Institute of

Computing Technology, the Chinese Academy

of Sciences. Since then, he has worked as an

assistant professor in the Department of Com-

puter Science and Technology, Peking (Beijing)

University in the areas of distributed systems

and Web search engines, and visited the School

of Computing Science at Simon Fraser Univer-

sity and the Department of Computer Science at

the University of Illinois at Urbana-Champaign as a postdoc research

fellow, mainly working in the area of data mining. He is currently a

research associate of the Digital Technology Center at the University of

Minnesota.

- Abstract ReasoningUploaded byAnonymous TlGnQZv5d7
- Ch 6 - Sequence & SeriesUploaded byKhuram Shehzad Jafri
- Zucker.gamesUploaded byKamal Lohia
- Neutrosophic Association Rule Mining Algorithm for Big Data AnalysisUploaded byAnonymous 0U9j6BLllB
- Web MiningUploaded byPraveen Chinthala
- 1.FuntionsUploaded byradh14
- 1st Summative Test Grade 10Uploaded byHanna Grace Honrade
- Reusable Test Case - Validation 1Uploaded byapi-3809615
- The Application of Computational Mechanics to the Analysis of Geomagnetic DataUploaded byahsbon
- EXERCISE-1.docxUploaded byrowena pedros
- Limit of Sequences 2.pdfUploaded byChai Usajai Usajai
- dmkd07_frequentpatternUploaded byJothimani Murugesan K
- 2nd PUC Mathematics March 2016Uploaded byANUSHA
- Optimization of Association Rule Using Heuristic ApproachUploaded byEditor IJRITCC
- DWM Question BankUploaded bySindhuja Vigneshwaran
- 2018Uploaded bySomanshu Kumar
- Section_8.2.pdfUploaded byVarun Singh
- 16Uploaded byapi-3814854
- Bolzano-WeierstrassUploaded byCarlos Antonio Martinez Jurado
- DMUploaded byLakshmanan Venkatachalam
- D1_litUploaded bykhushboo_khanna
- How Did a Slide Rule WorkUploaded byKuppu Karthi
- Ansys Commands and AcronymsUploaded bykiran_wakchaure
- Definition of a SignalUploaded byimran
- S0002-9939-1950-0038589-5Uploaded byKrijan Mali
- 1-s2.0-S0020025516304583-mainUploaded byamrit338
- Week 1Uploaded byvishal223
- IBM Sequence DiagramUploaded byMauizah Hasanah
- Series Arithmetic GeometricUploaded byada
- 3Uploaded byIbadul Qadeer

- machine learningUploaded bySamira Palipana
- NIITE Brochure 2015-16Uploaded bydeepak1892
- Graph DatabaseUploaded bydeepak1892
- _964b8d77dc0ee6fd42ac7d8a70c4ffa1_Lecture6.pdfUploaded bydeepak1892
- New Doc 5Uploaded bydeepak1892
- Frequent Item SetUploaded bydeepak1892
- 08_BIDEUploaded bydeepak1892
- 01_Mining Sequential PatternsUploaded bydeepak1892
- 06_CHARMUploaded bydeepak1892
- 07_DSM.pdfUploaded bydeepak1892
- 03 Freespan rUploaded bydeepak1892
- 04_prefixspanUploaded bydeepak1892
- spade.pdfUploaded bydeepak1892
- 05_MAFIAUploaded bydeepak1892
- Icremental Mining of Sequential PatternUploaded bydeepak1892
- Lesson 5 QuizUploaded bydeepak1892
- Lesson 1Uploaded bydeepak1892
- 06842585-BigDataAnalyticsUploaded bydeepak1892

- knvb+philosophyUploaded bydakar16
- Maximilian in Mexico. Sara Yorke Stevenson.Uploaded byGuillermo Jaimes Benítez
- Theoretical Framework That Support the Therapeutic Use of Music on Elderly Patients With Dementia 2Uploaded byangelo
- 0110eUploaded byHebzer
- post colonialism/ criticismUploaded bylulavmizuri
- SWRL Tutorial -Spring 2009Uploaded byapi-19981384
- AllatRaUploaded byOnyinyeOkpala
- Bhalessa . BIO DATAUploaded bySadaket Malik
- Relaxation Techniques During LaborUploaded byEdderlyn Lamarca
- The de-composition of the theatrical language in the artistic practice of Carmelo BeneUploaded byGiacinta Gandolfo
- What is the Relationship of Religion to Science Posited by DurkheimUploaded bySrijan Chakraborty
- Civil Law 2010 Vol1Uploaded byAnonymous uqBeAy8on
- Inner PeaceUploaded bynesko61
- Darren Ambrose - Gilles Deleuze & Sublime Cinema: Painting Time With LightUploaded byDarren Ambrose
- The shining (1980)Uploaded byAnim MA
- Full Paper Msceis Ragil FixUploaded bySeptian Jauhariansyah
- Sex, Group Size and Helping in Three Cities PresentationUploaded byD Henry
- Research Paper Outline Examples.docxUploaded byRezel Marjoise Perez Huel
- fitness for cricketUploaded byBharath Venkatesh
- Bible Study Methods—#2 chapter summaryUploaded byRichard Chamberlain
- Principles for Conducting Critical Realist Case Study Research in IsUploaded byprasenjit_baishya
- rws 1301- discourse community map luisa gonzalezUploaded byapi-302604773
- Piezoelectric Precipitation Sensor From VaisalaUploaded byStephany Bryan Diez Itao
- Renosa-Theory of Ethical CaringUploaded byJoseph TheThird
- The Timeline Letter From a Whistle BlowerUploaded byZORROzorroZORROzorro
- Sheldon Wolin, Hannah Arendt Democracy.pdfUploaded byPepa Fernández
- FingerpickingUploaded byNathaniel Llido Hamot
- Guidelines for synopsis preparationUploaded bybluegeese1
- Notes Ch.1 Intro 2sppUploaded bysandi123in
- Research Paper Annotated BibliographyUploaded byBrendan Gleason