You are on page 1of 13

Available online at www.sciencedirect.

com

Knowledge-Based Systems 21 (2008) 110–122


www.elsevier.com/locate/knosys

A new framework for detecting weighted sequential


patterns in large sequence databases
Unil Yun *

Computer Engineering Division, School of Electrical & Computer Engineering, Chungbuk National University, Korea

Received 29 August 2006; received in revised form 7 February 2007; accepted 9 April 2007
Available online 19 April 2007

Abstract

Sequential pattern mining is an essential research topic with broad applications which discovers the set of frequent subsequences sat-
isfying a support threshold in a sequence database. The major problems of mining sequential patterns are that a huge set of sequential
patterns are generated and the computation time is so high. Although efficient algorithms have been developed to tackle these problems,
the performance of the algorithms dramatically degrades in case of mining long sequential patterns in dense databases or using low min-
imum supports. In addition, the algorithms may reduce the number of patterns but unimportant patterns are still found in the result
patterns. It would be better if the unimportant patterns could be pruned first, resulting in fewer but important patterns after mining.
In this paper, we suggest a new framework for mining weighted frequent patterns in which weight constraints are deeply pushed in
sequential pattern mining. Previous sequential mining algorithms treat sequential patterns uniformly while real sequential patterns have
different importance. In our approach, the weights of items are given according to the priority or importance. During the mining process,
we consider not only supports but also weights of patterns. Based on the framework, we present a weighted sequential pattern mining
algorithm (WSpan). To our knowledge, this is the first work to mine weighted sequential patterns. The experimental results show that
WSpan detects fewer but important weighted sequential patterns in large sequence databases even with a low minimum threshold.
 2007 Elsevier B.V. All rights reserved.

Keywords: Data mining; Knowledge discovery; Weighted sequential pattern mining; Weighted support; Minimum weight

1. Introduction the anti-monotone property [1] has been mainly used to


prune infrequent sequential patterns. That is, if a sequen-
Sequential pattern mining algorithms have been exten- tial pattern is infrequent, all super patterns of the sequen-
sively developed due to huge applications such as biologi- tial pattern must be infrequent. Using this characteristic,
cal sequence mining [7,20], incremental sequence mining sequential pattern mining algorithms prune infrequent
[5], sequence indexing [6], multi-dimensional sequence pat- sequential patterns earlier. However, following limitations
tern mining [15], approximate sequence mining in a noisy exist in the previous sequential pattern mining.
environment [10,24], constraint-based sequential pattern First, in the real world, some sequences are more impor-
mining [8,11,13], closed sequential pattern mining [19,23], tant and others are less important. However, previous
graph mining [22] and so on. Sequential pattern mining is sequential pattern mining approaches do not consider this
to find the complete set of sequential patterns in a sequence characteristic and all items and sequences are treated uni-
database with a minimum support threshold. Extensive formly. In the real world, there are several applications in
growth of data gives the motivation to find meaningful pat- which specific sequences are more important or have more
terns among the huge data. In sequential pattern mining, priority than other sequences. For example, in real applica-
tions, when finding the traversal patterns in the World
*
Tel.: +1 979 693 0284. Wide Web, each page has different importance. In DNA
E-mail address: yunei@cs.tamu.edu and biomedical data analysis, some genes are more

0950-7051/$ - see front matter  2007 Elsevier B.V. All rights reserved.
doi:10.1016/j.knosys.2007.04.002
U. Yun / Knowledge-Based Systems 21 (2008) 110–122 111

important than others in causing a particular disease, and sequential pattern with a low weight can get a high weight
some genes are more effective than others in fighting dis- after adding other items or itemsets with higher weights.
eases. Second, sequential pattern mining algorithms Our main goal in this framework is to push weight con-
[2,4,28] have better performance as a minimum support is straints into the sequential pattern mining efficiently.
high, the database is sparse and/or the length of the maxi- In our approach, the weight range is used to restrict
mum sequential pattern is short. However, the perfor- weight values of items and weights (prices) of items are nor-
mance of the algorithms degrades dramatically when malized within the weight range. During mining process, a
using a low support threshold, or mining dense databases maximum weight within the items are used to efficiently
with long sequences. The main problem of these algorithms prune weighted infrequent sequential patterns. Addition-
is that they generate an exponentially large number of ally, we use a minimum weight threshold to give balance
sequential patterns and the runtime increases dramatically support and weight. An extensive performance study shows
when a minimum support is lowered. In addition, irrespec- that WSpan is efficient and scalable in weighted sequential
tive of the minimum support, the unimportant patterns pattern mining.
with low weights can be detected. Even though efficient The main contributions of this paper are: (1) introduc-
algorithms [4,14] have been developed, unimportant tion of the concept of weighted sequential patterns, (2) clas-
sequential patterns are still found in result sets. sification and incorporation of two key features, a weight
Currently, no weighted sequential pattern mining algo- and a support, (3) description of weighted sequential pat-
rithm exists, although weights are important in the real tern mining by using two pruning conditions, (4) imple-
world. Let us give a motivating example for this work in mentation of our algorithm, WSpan, and execution of an
market basket data. In general sequential pattern mining, extensive experimental study.
a sequential pattern {(bread, milk) (diaper, beer)} can be The remainder of the paper is organized as follows. In
easily discovered with support threshold because the sup- Section 2, we describe previous work. In Section 3, we
port (frequency) of the sequential pattern is relatively high. develop WSpan. Section 4 shows extensive experimental
Although we want to find valuable (important) itemset lists results. Finally, conclusion and future work are presented
or sequences such as {(gold ring, silver necklace) (car, cell in Section 5.
phone) (computer, television)}, the itemsets (the sequence)
cannot be mined with the previous sequential pattern min- 2. Previous work
ing approach because the items within the sequence are so
expensive that the items have low frequencies (supports). Let I = {i1, i2 . . . in} be a unique set of items. A sequence
The itemsets or the sequence cannot be mined although S is an ordered list of itemsets, denoted as Æs1, s2, . . ., smæ,
the minimum support is lowered. Instead, the following where sj is an itemset which is also called an element of the
itemsets or sequences can be discovered in previous sequence, and sj ˝ I. That is, S = Æs1, s2, . . ., smæ and sj is
approach: (gold ring, hair pin, silver necklace), (car, sun- (xj1, xj2, . . ., xjk), where xjt is an item in the itemset sj. The
glass, cell phone) and (computer, television, mouse pad). brackets are omitted if an itemset has only one item. An item
That is, irrespective of the minimum support, items such can occur at most one time in an itemset of a sequence but it
as ‘‘hair pin’’, ‘‘sunglass’’, or ‘‘mouse pad’’ with low can occur multiple times in different itemsets of a sequence.
weights are included because these items have high sup- The size |S| of a sequence is the number of itemsets in the
ports (frequencies). However, these items are cheap and sequence. The length, l(S), is the total number of items in
give low profits, so they are not important items in terms the sequence. A sequence with length l is called an l-sequence.
of prices (weights). In real business, marketing managers A sequence database, SDB = {S1, S2, . . ., Sn}, is a set of
or trend analyzers want to find the sequences with more tuples Æsid, Sæ, where sid is a sequence identifier and Sk is
emphasis on some particular (important) products (items) an input sequence. A sequence a = ÆX1, X2, . . . ,Xnæ is called
and less emphasis on other products. However, the previ- a subsequence (a v b) of another sequence b = ÆY1, Y2,
ous sequential pattern mining cannot discover the impor- . . ., Ymæ (n 6 m), and b is called a super sequence of the
tant sequences with low supports. sequence a if there exist an integer 1 6 i1 <  < in 6 m such
From the above motivation, in this paper, we propose a that X1 ˝ Yi1, X2 ˝ Yi2,. . .,˝ Xn ˝ Yin. A tuple (sid, S) is said
new framework for detecting weighted sequential patterns. to contain a sequence a if S is a super sequence of a (a v S).
It would be better if the unimportant patterns could be The support of a sequence a in a sequence database SDB
pruned first, resulting in fewer patterns after mining. In (support (a) = j{<sid, S>j(<sid, S> 2 SDB) § (a v S)}|) is
our approach, weights of items are assigned to reflect the the number of sequences in SDB that contain the sequence
importance of sequential patterns, itemsets within the pat- a. Given a support threshold, min_sup, a sequence a is called
terns and items within the itemsets in the sequence database. a frequent sequential pattern in the sequence database if the
The main concern in weight based sequential pattern mining support of the sequence a is no less than the minimum sup-
is that the anti-monotone property [1] is broken when simply port threshold (support (a) P min_sup). The problem of
applying weights. That is, although a sequential pattern is sequential pattern mining is to find the complete set of fre-
weighted infrequent, super patterns of the sequential pattern quent sequential patterns satisfying a minimum support in
may be weighted frequent because super patterns of the the sequence database. Table 1 shows the input sequence
112 U. Yun / Knowledge-Based Systems 21 (2008) 110–122

Table 1 weights and then does post-processing during the rule gener-
A sequence database (SDB) ation step so the WAR algorithm did not consider weight
Sequence ID Sequence values while mining frequent patterns. In WARM (Weighted
10 <a (abc) (ac) d (cf)> Association Rule Mining) [17], the problem of breaking the
20 <(ad) c (bc) (ae) bc> anti-monotone property is solved by using a weighted sup-
30 <(ef) (ab) (df) cb> port and developing a weighted downward closure property.
40 <eg (af) cbc>
50 <a (ab) (cd) egh>
However, weighted support of a pattern AB in WARM is the
60 <a (abd) bc> ratio of the weight of the transactions containing both A and
B to the weight of all transactions so WARM does not con-
sider the support measure. Recently, in weighted frequent
database SDB in our running example. Assume that a mini- pattern mining, the weight constraint have been combined
mum support is 2. The SDB has eight unique items, and six with other constraints [25,27] or measures [26] such as length
input sequences. A sequence Æa (abc) (ac) d (cf)æ in SDB decreasing support constraints, correlation measure or the
has five itemsets: a, (abc), (ac), d, (cf) where items ‘‘a’’ and closure property. Although there have been many studies
‘‘c’’ appear three times in different itemsets of the sequence. to find weighted frequent patterns, no sequential pattern
The size of Æa (abc) (ac) d (cf)æ is 5 and the length of this mining algorithms have considered weighted sequences.
sequence is 9. Sequence Æa (bc) dæ is a sub sequence of Instead, all the sequential pattern mining algorithms sug-
Æa (abc) (ac) d (cf)æ since a ˝ a, (bc) ˝ (abc) and d ˝ d. Addi- gested so far have given the same importance to the
tionally, the sequence <a(bc)d> is a frequent sequential sequences and the itemsets in a sequence. However, it is
pattern because sequences 10 and 20 contain sub sequence essential to distinguish significant sequences from a large
s = Æa (bc) dæ and the support (2) of the sequence is no less number of sequence patterns.
than 2. Meanwhile, a sequential pattern <(ab)g> is not a fre-
quent pattern since the support (1) of the pattern is less than
2.2. Sequential pattern mining
the minimum support (2). Based on the anti-monotone prop-
erty, we can know that all super patterns of the sequential
Efficient sequential pattern mining algorithms have been
pattern <ag> such as sequential patterns such as <a(ab)g>,
developed such as constraint-based sequential pattern min-
<a(ab)cg>, and <a(ab)(cd)g> are infrequent patterns.
ing [8,11,13], sequential pattern mining without using sup-
port thresholds [18] and closed sequential pattern mining
2.1. Weight-based pattern mining [19,23]. However, these approaches may reduce the number
of patterns but unimportant patterns are still mined
Weight based frequent pattern mining algorithms because they did not consider weights or priorities of the
[3,17,21,25–27] have been suggested. As shown in Table 2, patterns. In previous sequential pattern mining algorithms,
MINWAL (Mining association rules with weighted items) GSP [16] mines sequential patterns based on an apriori-like
[3] defined a weighted support which is calculated by multi- approach [1] by generating all candidate sequences. To
plying a support of a pattern with an average weight of a pat- overcome this limitation, the database projection growth
tern. MINWAL used the k-support bound to maintain the based approach, FreeSpan [9], was developed. Although
anti-monotone property. WAR (Weighted Association FreeSpan outperforms the apriori-based GSP algorithm,
Rules) [21] generates frequent items without considering FreeSpan may generate any substring combination in a

Table 2
Weight-based pattern mining algorithms
Mining algorithm Comparison
WSpan The first sequential pattern mining algorithm to detect weighted sequential patterns
Apply a minimum weight threshold to sequential pattern mining
Use the prefix projected sequential pattern growth method
MINWAL [3] Frequent pattern/association rule mining algorithm
Use k-support bound to maintain the anti-monotone property
WAR [21] Frequent pattern/association rule mining algorithm
Generate frequent items without considering weights and then use weight values to find association rules (post-
processing approach)
WARM [17] Frequent pattern/association rule mining algorithm
Solve the downward closure property by developing a weighted downward closure property
Use weight values but do not consider supports of patterns
WLPMiner [25]/WIP [26]/ Frequent pattern mining algorithm
WCloset [27]
Use the FP-tree structure in weight based frequent pattern mining
Combine the weight constraint with other constraints or measures such as length decreasing support constraints,
correlation measure or the closure property
U. Yun / Knowledge-Based Systems 21 (2008) 110–122 113

sequence. The projection in FreeSpan must keep all in order to create a common basis for comparison. Weights
sequences in the original sequence database without length of items are assigned with aw 6 weight 6 bw according to
reduction. A more efficient pattern growth algorithm, Pre- items’ importance or priority. The weights with aw 6
fixSpan [12] was proposed which improves the mining pro- weight 6 bw are normalized as minw 6 weight 6 maxw
cess. The main idea of PrefixSpan is to examine only the and the normalized weights can be used in the mining pro-
prefix subsequences and project only their corresponding cess. As shown in Table 3, attribute values as prices of
suffix subsequences into projected databases. In each pro- items in market basket data can be used as a weight factor
jected database, sequential patterns are grown by exploring and the prices of items can be normalized within a specific
only local frequent patterns. In SPADE [28], a vertical id- weight range. Based on the definition, items, itemsets and a
list data format was presented and the frequent sequence sequence have their own weights. From this example,
enumeration was performed by a simple join on id-lists. weights of items are given between 0.01 and 1.2 and the
SPADE can be considered as an extension of vertical for- maximum weight of items is the weight (1.2) of the item
mat-based sequential pattern mining. SPAM [2] utilizes ‘‘laptop computer’’. In our approach, the weight range is
depth first traversal of the search space combined with a used for efficient mining.
vertical bitmap representation of each sequence. Before
Definition 3.1. Weight of a sequential pattern (sequence)
SPAM, SPADE and PrefixSpan were two of the fastest
algorithms. According to performance evaluations [2], A weight of an item is a non-negative real number that
SPAM outperforms SPADE on most datasets, and Prefix- shows the importance of each item. Given a sequence SÆs1,
Span outperforms SPAM only slightly on very small data- s2, . . ., smæ in which sj is (xj1, xj2, . . ., xjk), the weight of the
sets. Except for this case, SPAM outperforms PrefixSpan in sequential pattern (sequence) is formally defined as
all cases. Therefore, WSpan is compared with SPAM for PLengthðs1 Þ PLengthðs Þ PLengthðs Þ
Weightðx1i Þ þ i¼1 2 Weightðx2i Þ þ ... þ i¼1 m Weightðxmi Þ
the performance evaluation. WeightðSÞ ¼ i¼1
lengthðs1 Þ þ lengthðs2 Þ þ . .. þ lengthðsm Þ

3. A framework for detecting weighted sequential patterns Table 4 shows example sets of items with different weights.
Given SDB in Table 1, WR2 in Table 4, and a minimum
In this section, we propose a new framework for detect- support, 2, the set of items in the database, i.e., length-1
ing weighted sequential pattern in which the weight con- subsequences in the form of ‘‘<item>:support’’ is {<a>:
straint is pushed deeply into sequential pattern mining. In 6, <b>: 6, <c>: 6, <d>: 5, <e>: 4, <f>: 3, <g>: 2, <h>:
our approach, we define weighted sequential patterns and 1} and a weight list is {a: 0.9, b: 0.75, c: 0.8, d: 0.85, e:
two pruning methods are suggested to detect weighted 0.75, f: 0.7, g: 0.85, h: 0.8}. When WR2 as normalized
sequential patterns. Based on the framework, we show a weights of items within a sequence is used, the weight of
mining example and an algorithm for detecting weighted the sequence, <a (bc) (ac) d (cf)> is 0.8125 ((0.9 + 0.75 +
sequential patterns. 0.8 + 0.9 + 0.8 + 0.85 + 0.8 + 0.7)/8). Meanwhile, WR1
and WR3 are applied, the weights of the sequence, <a (bc)
3.1. Basic concepts (ac) d (cf)> is 1.0 ((1.2 + 1.0 + 0.9 +1.2 + 0.9 +
1.0 + 0.9 + 0.9)/8) and 0.4625 (0.5 + 0.2 + 0.6 +
To set up weights of items, an attribute value of the 0.5 + 0.6 + 0.4 + 0.6 + 0.3)/8), respectively. Additionally,
retail items can be used. For example, prices (profits) of Maximum Weights (MaxW) of WR1, WR2, and WR3 are
items can be used as a weight factor in market basket data. 1.3, 0.9 and 0.6 in this step.
However, the real values of items are not suitable for Definition 3.2. Weighted support (WSupport) and mini-
weight values because of the big variation. mum weight threshold (min_weight)
Table 3 shows an example of a retail database. As shown
in Table 3, we can know that variation of items’ prices is so A weighted support of a sequential pattern is defined as
big that the prices cannot be directly used as weights. the resultant value of multiplying the pattern’s support
Therefore, the normalization process is needed which with the weight of the pattern in which the weight of the
adjusts for differences among data from varying sources sequence (sequential pattern) is obtained by calculating

Table 3
An example of retail database
Bar code Item Price ($) Support (frequency) Normalized weight
1 Laptop computer 1200 5000 1.2
2 Desktop computer 700 3000 0.7
3 Memory stick 200 20,000 0.2
4 Memory card 150 10,000 0.15
5 Hard disk 100 5000 0.1
6 Mouse 40 80,000 0.04
7 Mouse pad 10 100,000 0.01
114 U. Yun / Knowledge-Based Systems 21 (2008) 110–122

Table 4
The example sets of items with different normalized weights
Item <a> <b> <c> <d> <e> <f> <g> <h>
Support 6 6 6 5 4 3 2 1
WR1 (0.7 6 weight 6 1.3) 1.2 1.0 0.9 1.0 0.7 0.9 1.3 1.1
WR2 (0.7 6 weight 6 0.9) 0.9 0.75 0.8 0.85 0.75 0.7 0.85 0.8
WR3 (0.2 6 weight 6 0.6) 0.5 0.2 0.6 0.4 0.6 0.3 0.5 0.3

the average value of the weights in items of a sequence. A 3.2.1. Pruning condition 1: weighted support
weighted support is used to prune weighted infrequent (support * MaxW P min_sup)
sequential patterns. Previous sequential pattern mining The weighted support of a sequential pattern is no less
has used only supports of patterns. However, in our than the minimum support. In WSpan, the weighted sup-
approach, support and weight of patterns are both consid- port of multiplying the pattern’s support with MaxW in
ered simultaneously so the weighted support is defined as the sequence database is no less than the minimum support
the value of multiplying a sequential pattern’s weight by (A maximum weight (MaxW) is defined as a value of the
the weight of the pattern and the weighted support (Weight maximum weight of items in a sequence database or sub
(S) * support (S)) of the pattern is compared with the min- sequence database during mining process).
imum threshold. In addition, in WSpan, two measures of Although a sequential pattern is weighted infrequent, super
weight and support are balanced by defining a minimum patterns of the sequential pattern may be weighted frequent
weight threshold (min_weight) like a minimum support sequential because a sequential pattern which has a low weight
(min_sup) to prune items with lower weights. can get a high weight after adding another item with a higher
weight. To prune weighted infrequent patterns earlier but
3.2. Weighted sequential pattern maintain the anti-monotone property, MaxW (instead of real
weight values) is used as an approximate maximum weight
Now, we define the weighted frequent sequential pattern when the weighted support of a sequential pattern is calcu-
and analyze the two pruning conditions used in weighted lated. If a maximum weighted support (support (S) * MaxW)
sequential pattern mining. of a sequential pattern S is less than the minimum support, any
super pattern cannot be a weighted frequent sequential
Definition 3.3. Weighted frequent sequential pattern
pattern so the pattern can be pruned now. During mining pro-
A pattern is called a weighted frequent sequential pat- cess, weighted infrequent items are pruned and weights of the
tern if and only if (1) the weighted support of the sequen- weighted infrequent items in next step are not considered as
tial pattern is no less than a minimum support and (2) the MaxW although weights of the items are high. By doing so,
support of the pattern is no less than a minimum support the MaxW value is reduced and the approximate maximum
or the weight of the pattern is no less than a minimum weighted support (support (S) * MaxW) becomes more accu-
weight. rate. However, in final step, we need to check if the sequential
In this definition of the weighed frequent sequential pat- pattern is really weighted frequent sequential pattern (support
tern, first condition is to calculate the weighted support by (S) * weight (S) P min_sup) because maximum weighed
multiplying the weight with the support of a pattern, and support is an approximation value.
compare the weighted support with a minimum threshold. The columns in Table 5 show the set of sequences after
Meanwhile, second condition is that the weight and the applying different weights. For example, when WR3 is
support of the pattern are compared with the minimum applied as normalized weights of items and a minimum
weight and the minimum support respectively. We show support is 2, an item f’s support is 3, and the weighted sup-
two conditions (1) and (2) in Definition 3.3 have their port (1.8) of multiplying the sequential pattern’s support
own pruning portions of weighted infrequent sequential (3) with a MaxW (0.6) in the SDB is less than the minimum
patterns. support (2), so the item ‘‘f’’ in each sequence of the

Table 5
weighted sequences (sequential patterns) with different WRs
SID Weighted sequences Weighted sequences Weighted sequences
(WR1: 0.7 6 weight 6 1.3) (WR2: 0.7 6 weight 6 0.9) (WR2: 0.2 6 weight 6 0.6)
10 <a(abc)(ac)d(cf)> <a(abc)(ac)d(cf)> <a(abc)(ac)dc>
20 <(ad)c(bc)(ae)bc> <(ad)c(bc)(ae)bc> <(ad)c(bc)(ae)bc>
30 <(ef)(ab)(df)cb> <(ef)(ab)(df)cb> <e(ab)dcb>
40 <eg(af)cbc> <e(af)cbc> <eacbc>
50 <a(ab)(cd)egh> <a(ab)(cd)e> <a(ab)(cd)e>
60 <a(abd)bc> <a(abd)bc> <a(abd)bc>
U. Yun / Knowledge-Based Systems 21 (2008) 110–122 115

sequence database is removed. Meanwhile, the number of Proof. In this case, the pruning condition 1 is satisfied
weighted sequential patterns can be increased when WR1 but the pruning condition 2 is not satisfied. In other
is used as normalized weights of items. The support of words, the weighted support of multiplying the pattern’s
the item ‘‘g’’ in the sequence database is 2 but the weighted support with a MaxW of a sequential pattern is no less
support (2.6) of multiplying the item’s support (2) with than a minimum support in pruning condition 1, but the
MaxW (1.3) is greater than the minimum support (2), so support of the sequential pattern is less than a min_sup
the item ‘‘g’’ is not pruned in the sequences. and the weight of the sequential pattern is less than a
min_weight in the pruning condition 2. Therefore, we
3.2.2. Pruning condition 2 (support P min_sup or can see that the following two constraints must be
weight P min_weight) satisfied.
The support of the pattern is no less than the minimum
support or the weight of the pattern is no less than the min- Constraint 1 : (weighted support (support * MaxW)
imum weight. P min_sup)
If the support of the sequential pattern is less than a Constraint 2 : (support < min_sup)
minimum support (min_sup) and its weight is also less
than a minimum weight (min_weight), it is a useless In the two constraints, a support of a sequential pattern
sequential pattern because the pattern has low frequency and a minimum support (min_sup) are fixed. Therefore, it
and low importance (priority). To find and remove useless is certain that MaxW of a sequence database must be
sequential patterns, we apply the minimum weight thresh- greater than one to satisfy above two constraints. h
old. When a weight and a support are considered sepa-
rately, there are four cases for sequences: sequences with Example 1. Suppose that a sequence database in Table 1
a high support and a high weight, a high support and a is used, WR1 in Table 4 is utilized as a weight list, a min-
low weight, a low support and a high weight and then a imum support is 5 and a minimum weight is 1.1. Then, a
low support and a low weight. Sequences with a low sup- support of a sequential pattern ‘‘(ab)c’’ is 4, the weight of
port and a low weight are useless sequences so these the sequential pattern is 1.03 ((1.2 + 1.0 + 0.9)/3). With
sequences are pruned. However, sequences having other MaxW of 1.3, the maximum weighted support 5.2 (4
cases cannot be pruned since these sequences may have * 1.3) is greater than the minimum support (5) so the
higher priorities although the supports of the sequences sequential pattern satisfies the pruning condition 1. How-
are low, or the sequences may have higher frequencies ever, the support (4) of the sequential pattern is less than
even if the weights of the sequences are low. Note that a min_sup (5) and the weight (1.03) of the sequential pat-
this pruning condition 2 can prune items which have tern is less than a min_weight (1.1). As a result, the prun-
low support and weight in sequential patterns, so the ing condition 1 is satisfied but the pruning condition 2 is
pruning condition can be used to reduce the scope of not satisfied. Therefore, the sequential pattern is pruned
result patterns. Let us show an example to show the effect by the pruning condition 2 but not by the pruning condi-
of the minimum weight. Given a sequence database in tion 1 in the Definition 3.3 when MaxW is greater than
Table 1, a minimum support of 3, and WR3 as weights one. Note that if a weight of a sequential pattern is no
of items, the pruning condition 2 is applied as follows. less than the min_weight, the sequential pattern cannot
If a minimum weight is 0.6, items ‘‘g’’ and ‘‘h’’ in each be pruned by the pruning condition 2 irrespective of the
sequence are pruned because the supports of the items is MaxW value. Additionally, although MaxW is greater
less than the minimum support and the weights of the items than one, the weighted support (support * MaxW) of the
is less than the minimum weight. If a minimum weight is 0.4, sequential pattern may be less than the minimum support.
the item ‘‘h’’ in each sequence is only pruned. Meanwhile, no That is, a sequential pattern may be also pruned by the
item in each sequence is pruned if a minimum weight is less pruning condition 1 even though MaxW is greater than
than 0.4. In a similar way, the number of weighted sequential one.
patterns can be adjusted by using the minimum weight. Now,
three Lemmas are shown to analyze two pruning conditions Lemma 3.2. Two pruning conditions 1 and 2 in the Definition
in Definition 3.3. 3.3 have different pruning portions and they can be used
together although the pruning condition 1 can be more
3.3. Analysis of two pruning conditions for detecting broadly applied rather than the pruning condition 2.
weighted sequential patterns
Proof. We need to see two cases: (1) a sequential pattern is
Lemma 3.1. When two pruning conditions in Definition 3.3 pruned by the pruning condition 1 but not the pruning con-
are applied to prune weighted infrequent sequential patterns, dition 2 and (2) a sequential pattern is pruned by both
the case where a sequential pattern is pruned by only the pruning conditions 1 and 2.
pruning condition 2 but not by the pruning condition 1 is that The case (1): a sequential pattern is only pruned by the
the MaxW of a sequence database should be greater than one pruning condition 1. In this case, the pattern must satisfies
as a minimum requirement. the following constraints.
116 U. Yun / Knowledge-Based Systems 21 (2008) 110–122

Constraint 3 : (weighted support (support * MaxW) pattern is less than (min_sup/MaxW) in the pruning
< min_sup) condition 1. Meanwhile, two condition parts (sup-
Constraint 4 : (support P min_sup or weight P port < min_sup and weight < min_weight) in the constraint
min_weight) 6 must be both satisfied so that the sequential pattern is
pruned by the pruning condition 2. As a result, two
If a support of a sequential pattern is greater than or pruning conditions have different pruning portions and
equal to a minimum support in the constraint 4, MaxW they can be used together but we can know that the
should be less than one to satisfy the constraint 3. From the pruning condition 1 can be more broadly applied rather
constraints 3 and 4, if MaxW is less than one, the than the pruning condition 2. h
sequential pattern can satisfy the constraint 3. Recall that
the weighted support of the pattern can be greater than or Example 2. In the Lemma 3.2, we proved that two pruning
equal to the minimum support even though the MaxW is conditions have different pruning portions and the pruning
less than one (if the support of the sequential pattern is condition 1 can be used more broadly. In the Example 1,
relatively high). However, in this case, the support of the we already checked an example of a sequential pattern
sequential pattern is no less than the minimum support, pruned by only the pruning condition 2 and in this exam-
and the pattern cannot be also pruned by the pruning ple, the reverse example is shown. As used in the Example
condition 2. Hence, we do not need to consider the case. If 1, given SDB in Table 1, WR1 in Table 4, min_sup of 5 and
a weight of the sequential pattern is no less than a min_weight of 1.1, a support of a sequential pattern
minimum weight but the support of a sequential pattern is ‘‘a(ab)’’ is 3, the weight of the sequential pattern is 1.13.
less than the minimum support, the pattern cannot be With MaxW of 1.3, the weighted support 3.4 (3 * 1.13) is
pruned by the pruning condition 2. Meanwhile, the less than the minimum support (5) so the sequential pattern
sequential pattern may be pruned when MaxW is greater is pruned by the pruning condition 1. Note that the support
than one by the pruning condition 1. Therefore, we can (3) of the sequential pattern ‘‘a(ab)’’ is less than a min_sup
know that in the case (1), the pruning condition 1 is more (5) but the weight (1.13) of the sequential pattern is greater
broadly used to prune weighted infrequent sequential than a min_weight (1.1). Thus, the pattern ‘‘a(ab)’’ satisfies
patterns. the pruning condition 2. Finally, the pattern ‘‘a(ab) is
The case (2): a sequential pattern is pruned by both pruned by the pruning condition 1 but not pruned by the
pruning conditions 1 and 2. pruning condition 2. From this example, we can know that
In the case (2), the following constraints should be a sequential pattern may be pruned by only either one of
satisfied. two pruning conditions and the pruning conditions have
their own pruning ranges.
Constraint 5 : (weighted support (support * MaxW)
< min_sup) Lemma 3.3. Based on the pruning conditions, WSpan
Constraint 6 : (support < min_sup and weight < prunes more patterns than the approach to use only a min-
min_weight) imum support when MaxW of the sequence database is less
than one.
Let us think the case that a support of a sequential
pattern is less than a minimum support in the constraint 6. Proof. In the normal sequential pattern mining, every item,
If MaxW is less than or equal to one, definitely, the itemset and sequence has the same importance. That is,
constraint 5 is also satisfied. Therefore, the sequential their weights are 1.0. If the pruning condition 1 in Defini-
pattern is pruned by the pruning condition 1. Meanwhile, tion 3.3 is considered, the approximate weighted supports
even if the MaxW is greater than one, the weighted support of sequential patterns are calculated by multiplying the
(support * MaxW) can be less than the minimum support supports of the sequential patterns with MaxW which are
when the support of the pattern is relatively low. In other compared with support threshold. Given a minimum sup-
words, if the support of the sequential pattern is less than port threshold, the sequential patterns can be weighted fre-
(min_sup/MaxW), the sequential pattern is pruned by the quent or weighted infrequent according to the MaxW.
pruning condition 1. In this case, the additional require- Although a support of a sequential pattern is no less than
ment is needed to prune the sequential pattern. However, the support threshold and the sequential pattern is fre-
the second part (weight < min_weight) in the constraint 6 quent, the pattern may be a weighted infrequent sequential
must be also satisfied to prune the pattern in the pruning pattern if MaxW is less than one. More items or sequential
condition 2 although the support of the sequential pattern patterns are pruned when weights of items and patterns are
is less than the minimum support. Finally, when MaxW is set as less than one. Note that the reverse case is not
less than or equal to one, we can say if a sequential pattern occurred. In other words, given MaxW more than one, if
is pruned by the pruning condition 2, the sequential pattern a pattern is weighted frequent sequential pattern, the pat-
must be always pruned by the pruning condition 1. tern is definitely frequent sequential pattern because
However, if MaxW is greater than one, we know that the weighted support of the pattern is greater than support
sequential pattern is pruned when the support of the of the pattern. h
U. Yun / Knowledge-Based Systems 21 (2008) 110–122 117

Example 3. Given SDB in Table 1, WR2 as weights of Procedure WSpan (WSP, a, L, S|a)
items from Table 4, min_sup of 4, the weight list is Parameter:
<a:0.9, b:0.75, c:0.8, d:0.85, e:0.75, f:0.7, g:0.85, h:0.8>, (1) a is a weighted sequential pattern.
and MaxW is 0.9. In the normal sequential pattern mining, (2) L is the length of a,
the sequential pattern ‘‘(ab)’’ is a frequent sequential pat- (3) S|a is the sequence database, SDB if a is null,
tern and it is not pruned since the support (4) of the pattern otherwise, it is the a-projected database.
is equal to the minimum support (4). However, the sequen- 1. Scan S|a once, count the support of each item, and
tial pattern is weighted infrequent because the approximate find each weighted frequent item, b in sequences: b is
weighted support (3.6 (4 * 0.9)) is less than the minimum a weighted sequential item if the following pruning
threshold (4). In weighted sequential pattern mining, the conditions are satisfied.
pattern is relatively not important and it is pruned. Condition 1: (weighted support (support *
MaxW) P min_sup)
Condition 2: (support P min_sup or weight
3.4. Mining weighted sequential patterns P min_weight)
(a) b can be assembled to the last itemset of a to form
On the framework, as a mining example, we develop the a weighted sequential pattern or
WSpan algorithm to detect weighted sequential patterns. (b) <b> can be appended to a to form a weighted
In this mining example, based on the framework, the sequential pattern.
weight constraint is pushed into the prefix projected 2. For each weighted frequent item b,
sequential pattern growth approach. A sequence database Add it to a to form a sequential pattern a 0 , and
is recursively projected into a set of smaller weighted pro- output a 0 .
jected databases and weighted sequential patterns are End for
grown in each weighted projected database. Given a 3. For each a 0 ,
sequence a = <e1, e2, . . ., en> (in which each ei means a fre- Construct a 0 -weighted projected database S| a 0 ;
quent element in S), a sequence b < e01 ; e02 . . . e0m > ðm 6 nÞ Call WSpan (a 0 , L + 1, S| a 0 )
is called a prefix of the sequence a if (1) ei ¼ e0i for End for
(i 6 m  1), (2) e0m  em and (3) all the weighted frequent After WSpan algorithm calls the procedure WSpan (WSP,
items in ðem  e0m Þ are alphabetically listed after those in <b>, 0, SDB), WSpan (a 0 , L + 1, S| a 0 ) is called recursively
e0m . Given a sequential pattern a in a sequence database, after a 0 projected database S|a 0 is constructed. Recall that
a-projected database (S|a) is the collection of suffixes of the approximate maximum weighted support (support
sequences in S about the prefix a. The support (supports|a (S) * MaxW) is used. Therefore, in final step, we should
(b)) of a sequential pattern b in the a-projected database prune weighted infrequent sequential patterns which satisfy
(S|a) is the number of weighted sequences c in S|a. this condition (support (S) * MaxW P min_sup).

WSpan algorithm: Weighted Sequential pattern mining Example 4. (Mining weighted sequential patterns) We
in large sequence databases. show how to mine weighted sequential patterns by using
Input: (1) A sequence database: SDB, a prefix-based projection approach. Given SDB in Table 1,
(2) The minimum support threshold: min_sup, a minimum support, 2, and WR2 as normalized weights of
(3) The minimum weight threshold: min_weight, items in Table 4, weighted sequential patterns are mined by
Output: The complete set of weighted sequential the prefix projection approach as follows. (1) scan the
patterns. sequence database once, and find all the weighted frequent
Begin items in sequences. The weighted infrequent items can be
1. Let WSP be the set of Weighted Sequential Pat- removed according to pruning conditions. For instance,
terns. Initialize WSP ‹ {}; items ‘‘g’’, and ‘‘h’’ in each itemset within sequences are
2. Scan SDB once, count the support of each item, pruned because the weighted support (1.8) of multiplying
check the weight of each item and find each the item’s support (2) with MaxW (0.9) is less than the
weighted frequent item, b, in sequences satisfying minimum support (2). Length-1 weighted frequent sequen-
the following pruning conditions: b is a weighted tial patterns are <a>: 6, <b>: 6, <c>: 6, <d>: 5, <e>: 4,
sequential item if the following pruning conditions <f>: 3, and the weight list is <a: 0.9, b: 0.75, c: 0.8, d: 0.85,
are satisfied. e: 0.75, f: 0.7>. The complete set of weighted sequential
Condition 1: (support * MaxW P min_sup) patterns can be partitioned into following six subsets
Condition 2: (support P min_sup or according to six prefixes: <a>, <b>, <c>, <d>, <e>, and
weight P min_weight) <f>. The subsets of weighted sequential patterns can be
3. For each weighted frequent item, b, in SDB mined by constructing the corresponding set of weighted
Call WSPan (WSP, <b>, 1, SDB) projected databases and mining them recursively. We only
End for collect the sequences which have the prefix <a>. Addition-
End ally, in a sequence containing the prefix <a>, only the
118 U. Yun / Knowledge-Based Systems 21 (2008) 110–122

subsequence prefixed with the first occurrence of the prefix number of itemsets (transactions) in each sequence, the
<a> should be considered. The sequences in sequence average number of items (products) in each itemset,
database containing the prefix <a> are projected with and the number of different items in the dataset. Table 6
regards to the prefix <a> to form the <a>-projected shows parameters and their meanings in this sequential
database, which consists of six suffix sequences: <(abc) (ac) dataset generation. More detail information can be
d (cf)>, <(d) c (bc) (ae) bc>, <(b) (df) cb>, <(f) found in [1]. WSpan was written in C++ and experi-
cbc><(ab) (cd) e> and <(abd) bc>. By scanning the prefix ments were performed on a sparcv9 processor operating
<a> projected database once, its local items are a:4, b:6, at 1062 MHz, with 2048 MB of memory. All experiments
c:6, d:4, e:2, f:2, (b):4, (d):1, (e):1 and (f):1. were performed on a Unix machine. In our experiments,
Weighted infrequent local items are pruned and the a random generation function was used to generate nor-
length-2 weighted sequential patterns prefixed with the malized weights for each item.
prefix <a> are: <aa>:4, <ab>:6, <ac>:6, <ad>:4, and
<(ab)>:4. Before constructing the next projected database, 4.1. Comparison of WSpan and SPAM
pruning conditions are applied in order to check whether it
is a weighed sequential pattern. Recursively, all the Our experiment shows that in most cases, WSpan out-
weighted sequential patterns with the prefix <a> can be performs SPAM. In this test, different weight ranges with
partitioned into four subsets: those prefixed with (1) the normalized weights are applied but the minimum weight
prefix <aa>, (2) <ab>, (3) <ac>, (4) <ad>, and finally, (5) is fixed as a minimum value within weight ranges to test
<(ab)>. For example, the <aa> projected database consists the effect of the weight range.
of four suffix subsequences prefixed with the prefix <aa>: Figs. 1 and 2 show that WSpan generates fewer sequen-
<(bc) (ac) dc)>, <bc>,<(b) (cd)>, and <(bd) bc>. By tial patterns and runs faster than SPAM. Specifically, much
scanning the <aa> projected database once, its local items fewer sequential patterns are generated as the weight range
are a:1, b:2, c:4, d:2, (b):3, and (c):1 and weighted is decreased. Note that SPAM generates huge sequential
infrequent items a:1, b:2, d:2 and (c):1 are pruned. Hence, patterns and it is slower as the minimum support is
the <aa> projected database returns two weighted sequen- decreased. In SPAM, the number of sequential patterns
tial patterns: <aac>:4, and <a(ab)>:3. As another example, increases quickly when a minimum support is less than
the <(ab)> projected database consists of four suffix 10%. Moreover, the runtime becomes much slower as the
subsequences prefixed with the prefix <(ab)>: <(c) (ac) minimum support is less than 6%. Meanwhile, WSpan gen-
dc>, <dcb>, <(cd)> and <(d) bc>. By scanning the erates fewer patterns than SPAM by adjusting the weight
<(ab)> projected database once, its local items are a:1, b:2, range. We can see that WSpan is faster than SPAM. In
c:4, d:3, (c):1 and (d):1. Weighted infrequent items a:1, addition, the number of patterns discovered by WSpan is
b:2, (c):1 and (d):1 are pruned from pruning conditions.
Finally, the <(ab)> projected database returns weighted
sequential patterns <(ab)c>:4 and <(ab)d>:3. In this way, Table 6
Parameters for IBM quest data generator
WSpan finds weighted sequential patterns with the prefixes
<b>, <c>, <d>, <e>, and <f>, respectively. Symbol Meaning
D Number of customers in the dataset
C Average number of transactions per customer
T Average number of items per transactions
4. Performance evaluation S Average length of maximal sequences
I Average length of transactions within the maximal sequences
In this section, we present our performance study over N Number of different items
various datasets. The WSpan is the first sequential pat-
tern mining algorithm to consider weights of sequences,
itemsets of the sequences and items of the itemsets. We
report our experimental results on the performance of
WSpan in comparison with a recently developed algo- 50000
45000 SPAM
rithm, SPAM [2], which is currently the fastest algorithm 40000
Numberof patterns

WSpan (WR: 0.4 - 0.5)


for mining sequential patterns. First, we show how the 35000 WSpan (WR: 0.3 - 0.4)
number of weighted sequential patterns can be adjusted, 30000 WSpan (WR: 0.2 - 0.3)

the efficiency in terms of runtime of the WSpan algo- 25000


20000
rithm, and the quality of weighted sequential patterns.
15000
Second, we show that WSpan has good scalability 10000
against the number of sequences. The IBM dataset gener- 5000
ator (http://www.almaden.ibm.com/software/projects/ 0
hdb/resources.shtml) is used to generate sequence data- 4 6 8 10
Minimum support
sets in this test. It accepts essential parameters such as
the number of sequences (customers), the average Fig. 1. Number of patterns.
U. Yun / Knowledge-Based Systems 21 (2008) 110–122 119

30 140
SPAM SPAM
Runtime in seconds 120

Runtime in seconds
25 WSpan (WR: 0.4 - 0.5) WSpan (WR:0.4 - 0.5)
WSpan (WR: 0.3 - 0.4)
100
20 WSpan (WR:0.3 - 0.4)
WSpan (WR: 0.2 - 0.3) 80
15
60
10 40
5 20
0 0
4 6 8 10 2.5 3 3.5 4 4.5 5 5.5 6
Minimum support (in %) Minimum support (in %)

Fig. 2. Runtime. Fig. 4. Runtime.

several orders of magnitude fewer than the number of


sequential patterns found by SPAM with the same mini- 10000
SPAM
9000
mum supports. WSpan (WR: 0.8 - 0.85)

Number of patterns
8000
Figs. 3 and 4 demonstrate the results of performance test 7000
WSpan (WR: 0.75 - 0.8)
WSpan (WR: 0.7 - 0.75)
using the D7C7T7S7I7 dataset by setting normalized 6000
WSpan (WR: 0.65 - 0.7)
weights of items from 0.2 to 0.5. In this dataset, WSpan 5000 WSpan (WR: 0.6 - 0.65)
shows better performance than SPAM and the perfor- 4000
3000
mance difference becomes larger when the support thresh-
2000
old is lowered. In Fig. 3, the number of sequential 1000
patterns is increased as the minimum support is decreased, 0
but the number of sequential patterns in SPAM is substan- 40 45 50
Minimum support (in %)
55
tially increased as the minimum support becomes lower.
Although SPAM can reduce the number of patterns by Fig. 5. Number of patterns.
increasing the minimum support, unimportant patterns
are discovered in the result sets. Meanwhile, in WSpan, 2000
the unimportant patterns are pruned first during mining SPAM
WSpan (WR: 0.8 - 0.85)
process, resulting in concise but significant patterns. In
Runtime in seconds

1600 WSpan (WR: 0.75 - 0.8)


Fig. 4, WSpan is faster than SPAM and the difference WSpan (WR: 0.7 - 0.75)
1200 WSpan (WR: 0.65 - 0.7)
becomes bigger as the minimum support is lowered. WSpan (WR: 0.6 - 0.65)
In Figs. 5 and 6, we report the evaluation results for 800
D15C15T15S15I15 dataset. The main performance differ-
ence between WSpan and SPAM algorithms results from 400

using a weight range. By decreasing the support threshold, 0


the number of sequential patterns of WSpan is increased 40 45 50 55
but the number of SPAM is extremely increased. In Minimum support (in %)
Fig. 5, we could not show the number of sequential pat- Fig. 6. Runtime.
terns of SPAM, because the number of sequential patterns
mined by SPAM is huge with the minimum support of less
than 55%. For example, the numbers of sequential patterns
in SPAM are 449,403 with a minimum support of 55%,
1,365,328 with a minimum support of 50%, 44,062,294 with
a minimum support of 45%, and so on. In Fig. 6, WSpan is
80000 faster than SPAM. Our above experiments show that
70000 SPAM WSpan can generate fewer but important frequent sequen-
Number of patterns

60000 WSpan (WR: 0.4 - 0.5) tial patterns with various weight ranges in several datasets.
50000 WSpan (WR: 0.3 - 0.4) It may be not surprising that the number of patterns and
40000 the runtime are reduced. The number of patterns can be
30000 decreased by increasing the minimum support in previous
20000 sequential pattern mining. However, by increasing the min-
10000 imum support in previous mining algorithms, important
0 patterns with high weights can be also removed but unim-
2.5 3 3.5 4 4.5 5 5.5 6
portant patterns (with low weights) are still found. It is dif-
Minimum support (in %)
ficult to detect important sequential patterns by using the
Fig. 3. Number of patterns. minimum support. Constraint-based sequential pattern
120 U. Yun / Knowledge-Based Systems 21 (2008) 110–122

mining [8,11,13] can be applied to reduce the number of of these sequential patterns are more than the minimum
patterns but previous constraints do not provide a way to support but the weights of the patterns are relatively less
find important patterns but prune unimportant patterns. low. The result patterns of WSpan are different from those
WSpan discovers important patterns by using weights of of general sequential pattern mining even though the num-
items, itemsets of the items, and sequences of the itemsets. ber of patterns of WSpan and SPAM can be reduced by
using the minimum support.
4.2. Effect of the minimum weight

We perform this test to show the effect of the minimum 4.4. Scalability test
weight. In this test, D7C7T7S7I7 dataset is used. Table 5
lists the number of Weighted Sequential Patterns (WSP) The DxC2.5T5S4I2.5 dataset is used to test scalability
with various minimum weights by WSpan and Sequential with the number of sequences in the sequence database.
Patterns (SP) generated by SPAM. From the performance test, both WSpan and SPAM
From Table 7, WSpan generates fewer WSP by using show linear scalability with the number of sequences from
different minimum weight thresholds. For example, the 20K to 100K but WSpan is much more scalable than
numbers of sequential patterns are 63,207. Meanwhile, SPAM.
the number of WSP at a minimum support: 6%, a WR: In Figs. 7 and 8, the difference between WSpan and
0.8–1.2 and a minimum weight: 0.8 is 48,827, the number SPAM becomes clear. WSpan has much better scalability
of WSP can be reduced to 36,156 with a minimum weight:
1.0 and can be further reduced to 27,826 with a minimum 1000
weight, 1.2. By increasing the minimum weight, the pat- 900
SPAM

terns with relatively low weights are first pruned. In this WSpan (WR: 0.4 - 0.5)

Runtime in seconds
800
WSpan (WR: 0.3 - 0.4)
way, the proper number of weighted sequential patterns 700
600 WSpan (WR: 0.2 - 0.3)
can be found by adjusting the minimum weight.
500
400
4.3. Quality of weighted sequential patterns 300
200
In the previous tests, we showed the efficiency in terms 100
0
of the number of patterns and runtime. In all datasets, 20K 40K 60K 80K 100K
items are expressed as integer values. In this test, the Databasesize (sequences)
D7C7T5S4I2.5 dataset is used to illustrate the quality of
Fig. 7. Scalability test (Min_sup = 0.4%).
weighted frequent sequential patterns. We compare the
patterns mined by WSpan with those found by SPAM
(general sequential pattern mining) and show that the
result sets of two approaches are different in spite of using 600
SPAM
the same minimum support. For example, sequential pat- 500 WSpan (WR: 0.5 - 0.6)
Runtime in seconds

terns (<pattern>:support) <(17, 45) (91) (70) (91)>:22 WSpan (WR: 0.4 - 0.5)
400 WSpan (WR: 0.3 - 0.4)
and <(45) (27, 91) (70)>:21 are mined by SPAM with the
minimum support, 3%. However, these sequential patterns 300
are pruned by WSpan with the weight range, 0.3–0.4 200
because these patterns are weighted infrequent sequential
100
patterns. Second, although the minimum support is
increased from 3% to 5%, weighted infrequent sequential 0
20K 40K 60K 80K 100K
patterns such as <(17, 45) (91) (70) (91)>:22 and <(45) Database size (sequences)
(27, 91) (70)>:21 are found by SPAM. Meanwhile, the pat-
terns are pruned by WSpan. In other words, the supports Fig. 8. Scalability test (Min_sup = 0.5%).

Table 7
The effect of the minimum weight threshold (min_weight)
Minimum support (%) Number of W.S.P Number of W.S.P Number of W.S.P Number of S.P
WR: 0.8–1.2 WR : 0.8–1.2 WR : 0.8–1.2
min_weight : 1.2 min_weight: 1.0 min_weight: 0.8
5 56,897 79,320 1,13,712 1,24,728
5.5 36,475 48,356 66,825 81,792
6 27,826 36,156 48,827 63,207
6.5 19,394 26,204 36,475 44,262
U. Yun / Knowledge-Based Systems 21 (2008) 110–122 121

in terms of number of sequences in the database and [6] H. Chung, X. Yan, J. Han, SeqIndex: indexing sequences by
becomes faster as a weight range is decreased. In SPAM, sequential pattern analysis, in: Proc. the Fifth SIAM Interna-
tional Conference on Data Mining, April 2005, 2005, pp. 601–
the runtime increases dramatically as the number of 605.
sequences becomes larger. [7] M. Ester, A top–down method for mining most specific frequent
patterns in biological sequence data, in: Proc. the Fourth SIAM
5. Conclusions and future work International Conference on Data Mining, April 2004, 2004, pp. 90–
101.
[8] M. Garofalakis, R. Rastogi, K. Shim, SPIRIT: Sequential pattern
In this paper, we described a framework of mining mining with regular expression constraints, in: Proc. the Twenty-fifth
weighted sequential patterns and developed WSpan algo- International Conference on Very Large Data Bases, (VLDB’99),
rithm based on the prefix projected sequential pattern September 1999, 1999, pp. 223–234.
growth approach. We suggested a new approach to detect [9] J. Han, J. Pei, B.M. Asi, Q. Chen, U. Dayal, M.C. Hsu,
more important sequential patterns. The extensive perfor- FreeSpan: frequent pattern-projected sequential pattern mining, in:
Proc. the Sixth ACM SIGKDD International Conference on
mance analysis shows that WSpan is efficient and scalable Knowledge Discovery and Data Mining, August 2000, 2000, pp.
in weighted sequential pattern mining. This framework of 355–359.
mining weighted sequential patterns can be applied in sev- [10] H.C. Kum, J. Pei, W. Wang, D. Duncan, ApproxMAP: approximate
eral areas. In application domains such as financial data mining of consensus sequential patterns, in: Proc. the Third SIAM
analysis, retail industry and telecommunication industry, International Conference on Data Mining, May 2003, 2003, pp. 311–
315.
the weighted sequential patterns mining can be used to [11] H. Albert-Lorincz, J.F. Boulicaut, Mining frequent sequential
detect unusual access such as sequences related to finan- patterns under regular expressions: a highly adaptive strategy for
cial crimes, fraudulent telecommunication activities, and pushing constraints, in: Proc. the Third SIAM International Confer-
purchase of expensive items within a short time. In this ence on Data Mining, May 2003, pp. 316–320.
case, the high weights are given to previously found [12] J. Pei, J. Han, B.M. Asi, H. Pino, PrefixSpan: mining sequential
patterns efficiently by prefix-projected pattern growth, in: Proc. the
fraudulent patterns (items within the patterns) and dis- Seventeenth International Conference on Data Engineering, April
cover the suspicious patterns. As future research, to set 2001, 2001, pp. 215–224.
up weights of items, prices of items can be used as a [13] J. Pei, J. Han, W. Wang, Mining sequential patterns with constraints
weight factor in market basket data. However, we should in large databases, in: Proc. the 2002 ACM CIKM International
think of ways to assign weights to items in other types of Conference on Information and Knowledge Management, November
2002, 2002, pp. 18–25.
datasets such as web log data, biomedical data, DNA [14] J. Pei, J. Han, B.M. Asi, J. Wang, Q. Chen, Mining Sequential Patterns
data and data used in other applications. Second, we need by Pattern-Growth: The PrefixSpan Approach, IEEE Transactions on
to have more research and experiment to give guidance of Knowledge and Data Engineering 16 (2004) 1424–1440.
how efficiently to set up the thresholds. Effective settings [15] H. Pinto, J. Han, J. Pei, K. Wang, Multi-dimensional sequence
of thresholds are essential although it is the common pattern mining, in: Proc. the 2001 ACM CIKM International
Conference on Information and Knowledge Management, November
problem of all threshold-based mining algorithms. As fur- 2001, 2001, pp. 474–481.
ther extension, based on the framework, the weighted [16] R. Srikant, R. Agrawal, Mining sequential patterns: generalizations
sequential pattern mining algorithm can be implemented and performance improvements, in: Proc. the Fifth International
using vertical bitmap representation. In addition, WSpan Conference on Extending Database Technology, March 1996, 1996,
can be extended and optimized by mining weighted closed pp. 3–17.
[17] F. Tao, Weighted association rule mining using weighted support and
sequential patterns. significant framework, in: Proc. the Ninth ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data Mining,
References August 2003, 2003, pp. 661–666.
[18] P. Tzvetkov, X. Yan, J. Han, TSP: Mining top-K closed sequential
[1] R. Agrawal, R. Srikant, Mining Sequential Patterns, in: Proc. the patterns, in: Proc. the Third IEEE International Conference on Data
Eleventh International Conference on Data Engineering, March 1995, Mining (ICDM 2003), December 2003, 2003, pp. 347–354.
1995, pp. 3–14. [19] J. Wang, J. Han, BIDE: Efficient mining of frequent closed sequences,
[2] J. Ayres, J. Gehrke, T. Yiu, J. Flannick, Sequential Pattern Mining in: Proc. the Twentieth International Conference on Data Engineer-
Using a Bitmap Representation, in: Proc. the Eighth ACM SIGKDD ing, March/April 2004, 2004, pp. 79–90.
International Conference on Knowledge Discovery and Data Mining, [20] K. Wang, Y. Xu, J.X. Yu, Scalable Sequential Pattern Mining for
July 2002, 2002, pp. 429–435. Biological Sequences, in: Proc. the 2004 ACM CIKM International
[3] C.H. Cai, A.W. Fu, C.H. Cheng, W.W. Kwong, Mining association Conference on Information and Knowledge Management, November
rules with weighted items, in: Proc. International Database Engineer- 2004, 2004, pp. 178–187.
ing and Applications Symposium, IDEAS 98, Cardiff, Wales, U.K., [21] W. Wang, J. Yang, P.S. Yu, Efficient mining of weighted association
1998, pp. 68–77. rules (WAR), in: Proc. the Sixth ACM SIGKDD International
[4] D. Chiu, Y. Wu, A.L. Chen, An efficient algorithm for mining Conference on Knowledge Discovery and Data Mining, August 2000,
frequent sequences by a new strategy without support counting, in: 2000, pp. 270–274.
Proc. the Twentieth International Conference on Data Engineering, [22] X. Yan, J. Han, gSpan: Graph-based substructure pattern mining, in:
March/April 2004, 2004, pp. 375–386. Proc. the 2002 IEEE International Conference on Data Mining
[5] H. Cheng, X. Yan, J. Han, IncSpan: incremental mining of sequential (ICDM 2002), pp. 721-724, December. 2002.
patterns in large databases, in: Proc. the Tenth ACM SIGKDD [23] X. Yan, J. Han, R. Afshar, CloSpan: Mining Closed Sequential
International Conference on Knowledge Discovery and Data Mining, Patterns in Large Datasets, the Third SIAM International Conference
Aug. 2004, pp. 527–532. on Data Mining, May 2003.
122 U. Yun / Knowledge-Based Systems 21 (2008) 110–122

[24] J. Yang, P.S. Yu, W. Wang, J. Han, Mining Long Sequential Patterns in [26] U. Yun, Efficient Mining of Weighted Interesting Patterns with a
a Noisy Environment, in: Proc. the 2002 ACM SIGMOD International strong weight and/or support affinity, Information Sciences
Conference on Management of Data, pp. 406-417, June 2002. (2007).
[25] U. Yun, J.J. Leggett, WLPMiner: Weighted Frequent Pattern Mining [27] U. Yun, Mining lossless closed frequent patterns with weight
with Length decreasing support constraints, in: Proc. the Ninth constraints, Knowledge based systems 210 (2007) 86–97.
Pacific-Asia Conference on Knowledge Discovery and Data Mining, [28] M.J. Zaki, SPADE: An Efficient Algorithm for Mining Frequent
Hanoi, Vietnam, pp. 555-567, May 2005. Sequences, Machine Learning 42 (2001) 31–60.

You might also like