Professional Documents
Culture Documents
Fast Utility Mining On Sequence Data
Fast Utility Mining On Sequence Data
1 Abstract—High-utility sequential pattern mining is an emerg- the data/information quality on the Weblog data, it is important 38
2 ing topic in the field of Knowledge Discovery in Databases. to take this attribute into account for providing more precise 39
3 It consists of discovering subsequences having a high utility assessment of data/information quality. 40
4 (importance) in sequences, which can be referred to high-
5 utility sequential patterns (HUSPs). HUSPs can be applied to SPM is similar to frequent itemset mining (FIM) [7], [8], 41
6 many real-life applications, such as market basket analysis, as it is designed to discover patterns that frequently occur in 42
7 E-commerce recommendation, click-stream analysis and route data. The implicit assumption of FIM and SPM is that frequent 43
8 planning. Several algorithms have been proposed to address patterns are useful and interesting. For example, it is an inter- 44
9 this problem by efficiently mining utility-based useful sequential esting information for a business manager if the the beer and 45
10 patterns. Nevertheless, the performance of these algorithms can
11 be unsatisfied in terms of runtime and memory usage due diapers are purchased together in the super market. The main 46
12 to the combinatorial explosion of the search space for low difference between SPM and FIM is that SPM generalizes 47
13 utility threshold and large-scale data. Hence, this paper proposes FIM by considering the sequential ordering of sequences. 48
14 an efficient algorithm for the task of high-utility sequential Therefore, mining interesting patterns in a sequential database 49
15 pattern mining, called HUSP-ULL. It utilizes a lexicographic using SPM is more challenging than FIM [3]. One significant 50
16 q-sequence (LQS)-tree and a utility-linked (UL)-list structure to
17 fast discover HUSPs. Furthermore, two pruning strategies are shortcoming of traditional sequential pattern mining is that 51
18 introduced in HUSP-ULL to obtain tight upper-bounds on the all objects (items, events, sequences, movements, etc.) are 52
19 utility of candidate sequences, and reduce the search space by treated equally. In fact, the most frequently occurring patterns 53
20 pruning unpromising candidates early. Substantial experiments can be, quite typically, the least interesting ones. In general, 54
21 both on real-life and synthetic datasets show that HUSP-ULL can criteria such as the interestingness, utility, and importance of 55
22 effectively and efficiently discover the complete set of HUSPs and
23 outperforms the state-of-the-art algorithms. patterns are not taken into account in traditional SPM and FIM. 56
27 Sequential pattern mining (SPM) [1], [2], [3], [4] is an analysis, the diamond may not be considered as a frequent 62
28 interesting and critical research area in Knowledge Discovery pattern if its sale frequency is relative low compared to the 63
29 in Databases (KDD) [5], [6], which plays a key role in sale amount of the eggs. However, some infrequent patterns 64
30 various applications such as DNA sequence analysis, consumer such as diamonds may yield higher profit than that of the 65
31 behavior analysis, and natural disaster analysis [4]. The main eggs. To address this issue, FIM was generalized to obtain the 66
32 objective of SPM is to discover a set of frequent sequences problem of high-utility itemset mining (HUIM) [9], [10], [11], 67
way. (E-mail: jerrylin@ieee.org) high-utility sequential pattern mining (HUSPM) [14], [15], 71
Jiexiong Zhang is with the Department of Computer Science and Technol- [16], [17]. Different from SPM, HUSPM considers not only 72
ogy, Harbin Institute of Technology (Shenzhen), Shenzhen, China.
Philippe Fournier-Viger is with the Department of Computer Science and the sequential ordering of items but also their utility values. 73
Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China. Hence, HUSPM is more difficult than traditional SPM and 74
Han-Chieh Chao is with the Department of Electrical Engineering, National HUIM. As shown in Fig. 1, a customer wants to purchase 75
Dong Hwa University, Taiwan, R.O.C.
Philip S. Yu is with the Department of Computer Science, University of a mountain trail Bicycle, LED Headlight, and the UShake 76
Illinois at Chicago, IL, USA. Bike lock, and each item has its own unit price. In this 77
2 IEEE TRANSACTIONS ON CYBERNETICS, 2019
78 case, the consumers’ purchase behavior consists of a series A. High-Utility Itemset Mining 134
79 of utility-oriented sequential events/processes within different The problem of high-utility itemset mining (HUIM) [9], [10] 135
80 timestamps. Since high-utility sequential pattern mining has was designed to find the set of high-utility itemsets (HUIs), 136
81 many applications, many researchers then focused on this i.e. the itemsets having their utility values that are greater 137
82 issue and several algorithms were developed to discover the than or equal to a minimum utility threshold. Since HUIM 138
83 complete set of high-utility sequential patterns. However, there does not provide a downward closure property to reduce the 139
84 are still several challenges in HUSPM. First, the utility of a search space, unlike association rule mining (ARM) [7], it 140
85 pattern is neither monotonic nor anti-monotonic. Therefore, is necessary to find other strategies for reducing the search 141
86 the downward closure property of support (aka the Apriori space. To obtain a downward closure property that can be 142
87 property [7]) is not held in HUSPM and the search space is used in HUIM, Liu et al. [13] introduced the transaction- 143
88 quite difficult to be reduced. Second, previous approaches have weighted downward closure (TWDC) property and defined a 144
89 been proposed for determining upper bounds (i.e., sequence- set of candidates called the high transaction-weighted utiliza- 145
90 weighted utilization (SWU) [14], sequence-utility upper-bound tion itemsets (HTWUIs). Based on the HTWUIs, the Two- 146
91 (SUUB) [16], sequence extension utility (SEU) [18]) on the Phase [13] algorithm can find HUIs with the downward 147
92 utility of the potential sequential patterns. However, these closure property. It first discovers the set of HTWUIs using 148
93 algorithms often consume a large amount of memory and have a breadth-first search and then selects HUIs in the discovered 149
94 long execution time due to the combinatorial explosion of the HTWUIs. To achieve better performance for mining HUIs, 150
95 search space. Third, in the era of big data, the data that needs some tree-based HUIM algorithms were introduced such as 151
96 to be analyzed grows quickly. How to design more efficient IHUP [19], UP-Growth [20] and UP-Growth+ [21]. To reduce 152
97 HUSPM algorithms that well-scaled in a very large dataset is the number of candidates, Liu et al. [12] proposed the HUI- 153
98 also an important topic. Miner algorithm, which efficiently discovers HUIs using a 154
99 To address these challenges, this paper designs a novel vertical structure called utility-list. This procedure identifies 155
100 utility-linked list (UL-list) based algorithm called HUSP-ULL HUIs without generating candidates and performing multiple 156
101 (mining High-Utility Sequential Patterns more efficiently with database scans. 157
102 UL-list). The major contributions of this paper are as follows: Up to now, the development of HUIM algorithms has 158
103 1) Insightful patterns. A novel fast algorithm is proposed been extensively studied, and many algorithms have been 159
104 to efficiently identify meaningful and profitable HUSPs. investigated to mine different kinds of HUIs in many real- 160
105 It employs a utility-linked list structure and two pruning life applications. Many utility mining algorithms focused on 161
106 strategies to improve its mining performance. the mining efficiency, such as FHM [22], EFIM [23] and 162
107 2) Novel index structures. A compressed utility-linked d2 HUP [24]. On the other hand, several models and algorithms 163
108 (UL)-list structure is designed to store information about put the efforts on the effectiveness problem of utility-oriented 164
109 patterns instead of processing the original database. UL- mining. For example, discovering various kinds of HUIs such 165
110 list is quite compact and different from the current as mining HUIs in uncertain databases [11], mining the top- 166
111 existing data structures for utility mining. k HUIs without setting the minimum utility threshold [25], 167
112 3) Effective pruning. Utilizing UL-list, two pruning strate- exploiting non-redundant correlated utility patterns [26], [27], 168
113 gies, named Look Ahead Removing (LAR) and Irrele- extracting the up-to-date HUIs to show the sale trends [28], 169
114 vant Item Pruning (IIP), are integrated in the designed mining temporal on-shelf HUIs [29], and big data issue of 170
115 algorithm to reduce the search space and improve its HUIM [30]. Yun et al. [31] proposed a damped window to 171
116 performance to discover HUSPs. extract high average utility patterns over data streams. Gan 172
117 4) Fast and better scalability. Experimental results show et al. [32] recently proposed a new utility measure namely 173
118 that the proposed algorithm can efficiently discover utility occupancy for pattern mining. In contrast to static data, 174
119 HUSPs and outperform the existing state-of-the-art the dynamic data is more complex and desirable in many real- 175
120 HUSPM algorithms, in terms of runtime, memory usage, life applications. Several dynamic utility mining models [33], 176
121 unpromising pattern filtering, and scalability. [34] have been proposed to deal with dynamic databases. 177
126 with the UL-list and two pruning strategies are presented important as it considers the sequential ordering of itemsets, 180
127 in Section IV. An experimental evaluation of the designed which is significant for many applications such as behavior 181
128 algorithms is provided in Section V. Finally, conclusions are analysis, DNA sequence analysis, and weblog mining [4]. 182
129 described in Section VI. SPM was proposed by Agrawal and Srikant [1] and has been 183
131 We structure the related work around two main elements for SPM have been also extensively studied, such as inter- 187
132 that this paper addresses: high-utility itemset mining and high- sequence patterns [38], [39]. Several recent literature surveys 188
133 utility sequential pattern mining. of the development of SPM can be further referred to [4], 189
GAN et al.: FAST UTILITY MINING ON SEQUENCE DATA 3
190 [40]. SPM algorithms rely on the frequency/support framework subset of I without quantities. Without loss of generality, we 245
191 to discover frequent sequences, which does not take business assume that items in an itemset (quantitative itemset) are listed 246
192 interests into account. High-utility sequential pattern mining in alphabetical order since items are unordered in an itemset 247
193 (HUSPM) [14], [17], [41] was developed to address utility- (quantitative itemset). A quantitative sequence is an ordered 248
194 driven mining on sequence data. It has been used for mining list of one or more quantitative itemsets, which is denoted as 249
195 high-utility path traversal patterns of web pages [42], high- s = <v1 , v2 , . . . , vd >. A sequence is an ordered list of one 250
196 utility web access sequences [43], high-utility mobile sequence or more itemsets without quantities, which is denoted as t = 251
197 [44], [45], and HUSPs in Bioinformatics (i.e., gene regulation) <w1 , w2 , . . . , wd >. 252
198 [46]. Ahmed et al. [47] designed a level-wise approach called For convenience, in the following ”quantitative” will be 253
199 UL and a pattern-growth approach named US for HUSPM. abbreviated as ”q-”. Thus, the term ”q-sequence” will be used 254
200 HUSPM takes ordered sequences as input and reveals sequen- to refer to a sequence with quantities, and ”sequence” to refer 255
201 tial patterns having high utilities, which has been a challenging to sequences without quantities. Similarly, a ”q-itemset” is an 256
202 and important issue in recent decades. Hence, Yin et al. [14] itemset having quantities, while ”itemset” refers to an itemset 257
203 proposed a formal framework for HUSPM and introduced an that does not have quantities. For example, <[(a, 2) (b, 1)], 258
204 efficient USpan algorithm to discover high-utility sequential [(c, 3)]> is a q-sequence while <[ab], [c]> is a sequence. [(a, 259
205 patterns (HUSPs). Information about the utility of each node in 2) (b, 1)] is a q-itemset and [ab] is an itemset. A quantitative 260
206 the tree is stored in a utility-matrix for mining HUSPs without sequential database is a set of transactions D = {S1 , S2 , . . . , 261
207 performing multiple database scans. Two pruning strategies Sn }, where each transaction Sq ∈ D is a q-sequence, and has 262
208 based on the sequential-weighted downward closure property a unique identifier q called its SID. In addition, each item in D 263
209 and on the remaining utility model were designed to reduce is associated with a profit (external utility), which is denoted 264
210 the search space. However, USpan may fail to discover the as pr(ij ). 265
211 complete HUSPs due to its over-estimated upper bound on Consider the following running example. A quantitative 266
212 the potential pattern [18]. sequential database is shown in Table I. This database has 267
213 Lan et al. [16] then proposed a projection-based approach 6 transactions and 6 items. Table II is a utility table that 268
214 with a sequence-utility upper-bound (SUUB) to discover high- provides a unit profit for each item in Table I. In the running 269
215 utility sequential patterns. A novel indexing strategy and the example, [(a:2) (c:3)] is the first q-itemset of transaction S1 . 270
216 maximum utility measure were developed to improve the The quantity of an item (a) in this q-itemset is 2, and its utility 271
217 mining performance. Then, Alkan et al. [41] proposed the is calculated as 2× $5 = $10. 272
234 The comprehensive review of utility-oriented pattern mining denoted as u(ij , v), and defined as u(ij , v) = q(ij , v) ×pr(ij ), 274
235 can be referred to [34], [50], [51]. where q(ij , v) is the quantity of (ij ) in v, and pr(ij ) is the 275
236 III. P RELIMINARIES AND P ROBLEM S TATEMENT it can be defined as u(v) = ij ∈v u(ij , v). 277
For instance, the utility of item (c) in the first q-itemset of 278
237 In this section, we introduce notations and concepts used in
S1 in Table I is calculated as: u(c, [(a:2) (c:3)]) = q(c, [(a:2) 279
238 the paper. Then, we give formal problem definition.
(c:3)]) pr(c) = 3 × $4 = $12. And u([(a:2) (c:3)]) = u(a, [(a:2) 280
240 Let I = {i1 , i2 , . . . , im } be a finite set of distinct items . . . , vd > is defined as u(s) = v∈s u(v). The utility of a 283
241 (symbols). A quantitative itemset, denoted as v = [(i1 :q1 ) quantitative sequential database DPis the sum of the utility of 284
242 (i2 :q2 ), . . . , (ic :qc )], is a subset of I and each item in a each of its q-sequences: u(D) = s∈D u(s). 285
243 quantitative itemset is associated with a quantity (internal For instance, consider Table I. We have that u(S1 ) = 286
244 utility). An itemset, denoted as w = [i1 , i2 , . . . , ic ], is a u([(a:2) (c:3)]) + u([(a:3) (b:1) (c:2)]) + u([(a:4) (b:5) (d:4)]) 287
4 IEEE TRANSACTIONS ON CYBERNETICS, 2019
288 + u([(e:3)]) = $22 + $26 + $43 + $3 = $94. For example, B. Problem Definition 344
289 u(D) = u(S1 ) + u(S2 ) + u(S3 ) + u(S4 ) + u(S5 ) + u(S6 ) = Definition 7 (High-Utility Sequential Pattern, HUSP): A
290 $94 + $67 + $56 + $67 + $76 + $81 = $441, as shown in sequence t in a quantitative sequential database D is defined
291 Table I. as a high-utility sequential pattern (denoted as HUSP) if its
292 Definition 3: Given a q-sequence s = <v1 , v2 , . . . , vd > and total utility is no less than the minimum utility threshold δ:
293 a sequence t = <w1 , w2 , . . . , wd0 >, if d = d0 and the items in
HU SP ← {t|u(t) ≥ δ × u(D)}. (1)
294 vk are the same as the items in wk for 1 ≤ k ≤ d, t matches
295 s, which is denoted as t ∼ s. For example in Table I, u(<[a], [b]>) = $160. If δ = 0.1, 345
296 For instance, in Table I, <[ac], [abc], [abd], [e]> matches then <[a], [b]> is a HUSP since u(<[a], [b]>) = $160 > 346
297 S1 . Note that it is possible that a sequence has more than δ × u(D) (= $44.1). Based on the above concepts, the formal 347
298 one match in a q-sequence. For instance, <[a], [b]> has definition of the problem studied in this work is defined below. 348
299 three matches as <[a:2], [b:1]>, <[a:2], [b:5]> and <[a:3], Problem Statement: Let there be a quantitative sequential 349
300 [b:5]> in S1 . Thus, HUSP is generally considered as more database and a user-defined minimum utility threshold. High- 350
301 challenging than SPM and HUIM. utility sequential pattern mining (HUSPM) consists of enu- 351
merating all HUSPs whose total utility value in this database 352
302 Definition 4: Let there be some itemsets w and w0 . The
is no less than or equal to the minimum utility threshold. 353
303 itemset w is contained in w0 (denoted as w ⊆ w0 ) if w is a
Therefore, the objective of high-utility sequential pattern 354
304 subset of w0 or w is the same as w0 . Given two q-itemsets v
mining is to identify sequential patterns in which the utility of 355
305 and v 0 , v is said to be contained in v 0 if for any item in v,
each pattern in a sequence database that meets or exceeds 356
306 there exists the same item having the same quantity in v 0 . This
a pre-specified minimum utility threshold. These insightful 357
307 is denoted as v ⊆ v 0 . Thus, q-itemset containment is different
and profitable sequential patterns can be used in some spe- 358
308 from itemset containment.
cific applications, such as market basket analysis [18], E- 359
309 For example, the itemset [ac] is contained in the itemset commerce recommendation with personalized promotion [44], 360
310 [abc] in Table I. The q-itemset [(a:2) (c:3)] is contained in [45], click-stream analysis [43], and Bioinformatics [46]. More 361
311 [(a:2) (b:1) (c:3)] and [(a:2) (c:3) (e:2)], but [(a:2) (c:3)] is explorations can be reviewed and studied in [50]. 362
312 not contained in [(a:2) (b:3) (c:1)] and [(a:4) (c:3) (d:4)].
313 Definition 5: Let there be some sequences t = <w1 , w2 , IV. T HE P ROPOSED HUSP-ULL A LGORITHM 363
314 . . . , wd > and t0 = <w10 , w20 , . . . , wd0 0 >. The sequence t is This section presents a novel algorithm named HUSP- 364
315 contained in t0 (denoted as t ⊆ t’) if there exists an integer ULL for the problem of high-utility sequential pattern mining 365
316 sequence 1 ≤ k1 ≤ k2 ≤ · · · ≤ d0 such that wj ⊆ wk0 j for (HUSPM). The HUSP-ULL algorithm first scans the database 366
317 1 ≤ j ≤ d. Let there be two q-sequences s = <v1 , v2 , . . . , to find 1-sequences for spanning a lexicographic q-sequence 367
318 vd > and s0 = <v10 , v20 , . . . , vd0 0 >. s is said to be contained sequence (LQS)-tree, which is a variant of lexicographic 368
319 in s0 (denoted as s ⊆ s0 ) if there exists an integer sequence tree [37]. The utility-based LQS-tree is a representation of 369
320 1 ≤ k1 ≤ k2 ≤ · · · ≤ d0 such that vj ⊆ vk0 j for 1 ≤ j ≤ d. the search space used for mining HUSPs. Details of the 370
321 In the rest of this paper, t ⊆ s will be used to indicate that LQS-tree, utility-linked (UL)-list, pruning strategies, and the 371
322 t ∼ sk ∧ sk ⊆ s for convenience. main procedure of the HUSP-ULL algorithm are respectively 372
323 For example, <[(a:2)], [(e:3)]> and <[(a:4)], [(e:3)]> are explained in this section. 373
335 For instance, for the sequential database of Table Concatenation of t with ij consists of appending ij to the 384
336 I, u(<[a], [b]>, S1 ) = max{u(<[a:2], [b:1]>), u(<[a:2], last itemset of t, denoted as <t ⊕ ij >I−Concatenation . The 385
337 [b:5]>), u(<[a:3], [b:5]>)} = max{$13, $25, $30} = $30. In S-Concatenation of t with an item ij consists of adding ij to 386
338 this example, it can be seen that several utility values can be a new itemset appended after the last itemset of t, denoted as 387
339 associated to a pattern in a same q-sequence. This is different <t ⊕ ij >S−Concatenation . 388
340 from traditional SPM and HUIM. In Table I, u(<[a], [b]>) = For example, given a sequence t = <[a], [b]> and a 389
341 u(<[a], [b]>, S1 ) + u(<[a], [b]>, S2 ) + u(<[a], [b]>, S3 ) + new item (c), <t ⊕ c>I−Concatenation = <[a], [bc]> and 390
342 u(<[a], [b]>, S4 ) + u(<[a], [b]>, S5 ) = $30 + $31 + $27 + <t ⊕ c>S−Concatenation = <[a], [b], [c]>. It follows that 391
343 $37 + $35 = $160. the number of itemsets in t does not change after performing 392
GAN et al.: FAST UTILITY MINING ON SEQUENCE DATA 5
393 an I-Concatenation, while performing an S-Concatenation first a in S1 ) is calculated to be $84, and the next position of 441
394 increases the number of itemsets in t by one. The search the item (a) in S1 is 3. 442
395 process of the proposed algorithm can be viewed as the process Note that UL-list is quite different from the previous struc- 443
396 of building a LQS-tree step-by-step, which is similar to the tures (i.e., utility-matrix [14], data-matrix [41], utility-chain 444
397 original lexicographic-sequence tree [37]. Each node in the [17]) that were developed for HUSPM. For each node in 445
398 tree represents a sequence. Based on the two operations, all the LQS-tree, sequences containing this node (sequence) are 446
399 candidates of the search space can be generated for the purpose transformed into a UL-list and attached to the projected set 447
400 of mining HUSPs. An illustrated partial LQS-tree can be of this node. Therefore, the utilities and upper-bound values 448
401 referred to [17], [14]. For example, 1-sequences such as <a>, of the candidates can be easily calculated from the projected 449
402 <b>, and <c>, are children of the root. UL-lists. The designed HUSP-ULL algorithm stores only one 450
403 To ensure the completeness and correctness for mining copy of the original database as UL-lists, and then constructs a 451
404 HUSPs, an order is defined for processing sequences. Let there series of projected UL-lists but not the projected sub-databases 452
405 be two sequences ta and tb . It is said that ta ≺ tb if 1) the throughout the execution. This is different from HuspExt and 453
406 length of ta is less than that of tb ; 2) ta is obtained by an HUS-Span. Besides, as a compact structure, the UL-list does 454
407 I-Concatenation on a sequence t while tb is obtained by an not consume a large amount of memory. 455
408 S-Concatenation on a sequence t; and 3) ta and tb are both As mentioned, a sequence may have multiple matches in a 456
409 obtained by respectively performing an I-Concatenation or S- q-sequence, and hence a sequence may have multiple utilities 457
410 Concatenation on a sequence t, and the item added to ta is in a q-sequence. Thus, it is necessary to find the positions 458
411 lexicographically smaller than the one added to tb . This order of the matches to calculate the utilities and the upper-bound 459
412 on sequences is also applied to q-sequences. For example, values of the processed node (sequence). For convenience, 460
413 <[a]> ≺ <[ab]> ≺ <[a], [a]> ≺ <[a], [c]>. the position of the last item within each match is defined 461
is called the start point. For example, consider the database 463
414 B. The Utility-Linked List Structure of Table I. The sequence t = <[a], [b]> has three matches 464
415 To calculate the utility and upper-bound values of candi- in S1 , that is <[a:2], [b:1]>, <[a:2], [b:5]> and <[a:3], 465
416 dates, the designed algorithm could scan the original database. [b:5]>. The concatenation points of t in S1 are 4, 7 and 466
417 However, this process would result in long execution time 7, respectively, and the start point is 4. By definition, an 467
418 because there are often multiple matches in a sequence. I-Concatenation appends an item to the last itemset of a 468
419 To handle this situation, the compact utility-linked (UL)-list sequence. Thus, the candidate items for I-Concatenation are 469
420 structure is introduced to store information about the utility the items appearing in the itemsets containing concatenation 470
421 of each sequence. UL-list is used to efficiently generate the points. An S-Concatenation adds an item to a new itemset, 471
422 utility of sequences obtained by I-Concatenations and S- appended at the end of a sequence. Thus, in each sequence, 472
423 Concatenations to continue the search for patterns. Table III the items in the itemsets appearing after the start point are 473
424 is the constructed UL-list of the sequence S1 in Table I. candidate items for S-Concatenation. In the above example, 474
TABLE III And the start point (= 4) is in the second itemset, then the 476
T HE U TILITY-L INKED (UL)-L IST S TRUCTURE OF S1 . items appearing after the second itemset are candidates for S- 477
UP Information <[(a, $10, $84, 3) (c, $12, $72, 5)], Concatenation, that is {(a:4), (b:5), (d:4), (e:3)}. Since there 478
[(a, $15, $57, 6) (b, $3, $54, 7) (c, $8, $46, -)], can be multiple matches of the sequence t in a q-sequence, the 479
[(a, $20, $26, -) (b, $15, $11, -) (d, $8, $3, -)],
[e, $3, $0, -]> utility of t in that q-sequence is defined as the largest utility 480
Header Table (a, 1) (b, 4) (c, 2) (d, 8) (e, 9) value of t in that sequence. 481
425 The UL-list structure contains two parts, Header Table and C. The Downward Closure Property of Upper Bound 482
426 UP (utility and position) Information. Details are described Based on UL-list, the proposed HUSP-ULL algorithm can 483
427 below. successfully identify the complete set of HUSPs using a depth- 484
428 1) Header Table. It stores a set of distinct items with first search that applies the two concatenations operations. 485
429 their first occurred positions in the transformed sequence. For However, this process can lead to exploring a very large 486
430 example in Table III, the distinct items of S1 are (a), (b), number of candidates in the LQS-tree, since there is a combi- 487
431 (c), (d), and (e) and their first occurred positions in S1 are national explosion of the number of candidates in the mining 488
432 respectively 1, 4, 2, 8 and 9. process of HUSPs. Since the downward closure property, 489
433 2) UP Information. In terms of information about UP also called Apriori property [7], is not held in high-utility 490
434 (utility and position) of each sequence, each element respec- sequential pattern mining, a new downward closure property 491
435 tively stores the item name, the utility of this item, the must be introduced to be able to reduce the search space and 492
436 remaining utility of this item w.r.t. this element, and the efficiently find all HUSPs. To speed up the mining process and 493
437 next position of this item. Consider the result (a, $10, $84, maintain the downward closure property, a sequence-weighted 494
438 3) of the first element in S1 , it means that the utility of the utilization (SWU) [14] upper-bound was proposed for mining 495
439 item (a) in the first element is $10; the remaining utility [14] HUSPs. This upper-bound can be used to greatly reduce the 496
440 of item (a) in the first element in S1 (the overall utilities after search space and eliminate unpromising candidates early. 497
6 IEEE TRANSACTIONS ON CYBERNETICS, 2019
498 Definition 9: The sequence-weighted utilization (SWU) [14] Theorem 3: Given a quantitative sequential database D and
499 of a sequence t in a quantitative
P sequential database D is two sequences t and t0 . If t ⊆ t0 , we can obtain that:
500 defined as: SW U (t) = s∈D {u(s)|t ⊆ s}.
501 For example in Table I, SW U (<a>) = u(S1 ) + u(S2 ) + SEU (t0 ) ≤ SEU (t). (4)
502 u(S3 ) + u(S4 ) + u(S5 ) + u(S6 ) = $94 + $67 + $56 + $67 + Theorem 4: Given a quantitative sequential database D and
503 $76 + $81 = $441, and SW U (<f >) = u(S6 ) = $81. a sequence t, it follows that:
Theorem 1 (Sequence-weighted downward closure prop-
erty, SWDC property [14]): Given a quantitative sequential u(t) ≤ SEU (t). (5)
database D and two sequences t and t0 . If t ⊆ t0 , then:
Proof of Theorems 3 and 4 can be referred to [18]. In 546
SW U (t0 ) ≤ SW U (t). (2) summary, they indicate that for a sequence t, if SEU(t) is less 547
0 0
P 0
than the minimum utility value (δ×u(D)) and the utility of t is 548
504
P Proof: Since t ⊆ t , SW U (t ) = s∈D {u(s)|t ⊆ s} ≤ less than that value, the utilities of the super-sequences of t are 549
505
s∈D {u(s)|t ⊆ s} = SW U (t). less than that value. If the SEU or SW U of t is less than the 550
Theorem 2: Given a quantitative sequential database D and minimum utility value, the utility of t and the utilities of the 551
a sequence t, it can be obtained that: super-sequences of t are less than this value, which indicates 552
u(t) ≤ SW U (t). (3) that t and the super-sequences of t are not HUSPs. However, 553
it may still explore a large search space since the SW U and 554
506
P Proof: Since u(t, s)P ≤ u(s), we can obtain that u(t) = SEU upper-bounds are the overestimations of utility values 555
507
s∈D {u(t, s)|t ⊆ s} ≤ s∈D {u(s)|t ⊆ s} = SW U (t).
for patterns. To improve the mining performance and reduce 556
508 Thus, numerous unpromising candidates can be pruned the search space by pruning a large number of candidates, we 557
509 using the SWU. However, the SW U of a sequence t is introduce a tighter upper-bound for mining HUSPs, which is 558
510 usually much larger than the actual utilities of t and its based on the PEU model [17]. Details are given below. 559
511 super-sequences. To improve the performance of the designed Definition 13: The prefix extension utility of a sequence 560
512 algorithm, the remaining utility model [14] is proposed in the t in a q-sequence s is denoted as P EU (t, s) and defined as 561
513 USpan algorithm [14]. However, it is not a real upper bound P EU (t, s) = max{u(sk )+u(<s−sk >rest )|t ∼ sk ∧sk ⊆ s}. 562
514 and cannot provide the complete mining results of utility The prefix extension
P utility of a sequence t in D is defined as 563
515 mining, as reported in [18]. Thus, the concept of sequence P EU (t) = s∈D {P EU (t, s)|t ⊆ s}. 564
516 extension utility (SEU) [18] was proposed in the projection- For example, consider Table I and a sequence t = <[a], 565
517 based ProUM algorithm. To explain the concept of remaining [b]>. This sequence has 3 matches in S2 , which are <[a:1], 566
518 utility and sequence extension utility, several concepts related [b:3]>, <[a:1], [b:2]> and <[a:5], [b:2]>. Thus, u(<S2 - 567
519 to sequences and q-sequences are introduced firstly. <[a:1], [b:3]>>rest ) = u(<[(d:2)], [(b:2) (c:1) (d:4) (e:3)]>) 568
520 Definition 10: Given two q-sequences s and s0 , if s ⊆ s0 , = $25, u(<S2 - <[a:1], [b:2]>>rest ) = u(<[(c:1) (d:4) 569
521 the extension of s in s0 is said to be the rest of s0 after s, and is (e:3)]>) = $15 and u(<S2 - <[a:5], [b:2]>>rest ) = u(<[(c:1) 570
522 denoted as <s0 -s>rest . Given a sequence t and a q-sequence (d:4) (e:3)]>) = $15. The utilities of the three matches are 571
523 s, if t ∼ sk ∧ sk ⊆ s (t ⊆ s), the extension of t in s is the $14, $11 and $31, respectively. Thus, P EU (<[a], [b]>,S2 ) = 572
524 rest of s after sk , which is denoted as <s-t>rest , where sk is max{$14 + $25, $11 + $15, $31 + $15} = $46. P EU (<[a], 573
525 the first match of t in s. [b]>) is calculated as $67 + $46 + $37 + $48 + $54 = $252, 574
526 For example, given two q-sequences s = <[a:2], [b:5]> which is smaller than SEU(t) = $279. 575
527 and S1 in Table I, the extension of s in S1 is <S1 - s>rest = Theorem 5: Given a quantitative sequential database D, and
528 <[(d:4)], [(e:3)]>. Consider a sequence t = <[a], [b]>. There two sequences t and t’. If t⊆ t’, we can obtain that:
529 exist three matches of t in S1 . The first one is <[a:2], [b:1]>.
530 Thus, <S1 - t>rest = <[(c:2)], [(a:4) (b:5) (d:4)], [(e:3)]>. P EU (t0 ) ≤ P EU (t). (6)
531 Definition 11: The set of extension items of a sequence t in Proof: Suppose that s is a transaction in D, which
532 a quantitative sequential database D is denoted as I(t)rest and contains t and t’. Let sq be a q-sequence satisfying {u(sq )
533 defined as I(t)rest = {ij |ij ∈< s − t >rest ∧t ⊆ s ∧ s ∈ D}. + u(<s - sq >rest )} = P EU (t, s), where t ∼ sq ∧ sq ⊆ s. Let
534 In the above example, I(<[a], [b]>)rest = {a, b, c, d, e}. sq0 be a q-sequence satisfying {u(sq0 ) + u(<s - sq0 >rest )}
535 Definition 12: The sequence extension utility (SEU) [18] of = P EU (t0 , s) where t0 ∼ sq0 ∧ sq0 ⊆ s. Since t ⊆ t0 , we can
536 a sequence t in a quantitative sequential P database D is denoted divide t0 into two parts as the prefix t and the extension e such
537 as SEU(t) and defined as SEU (t) = s∈D {u(t, s) + u(< that t + e = t0 . Similarly, sq0 can be divided into two parts
538 s − t >rest )|t ⊆ s}. as the prefix sqt0 matching t and the extension sqe0 matching e
539 Notice that u(<s - t>rest ) is the remaining utility of t in such that sqt0 + sqe0 = sq0 . Thus,
540 s, which is stored in the designed UL-list. For example in
541 Table I, consider the sequence t= <[a], [b]>. Then, SEU(t) P EU (t0 , s) = {u(sq0 ) + u(< s − sq0 >rest )}
542 = u(t, S1 ) + u(<S1 - t>rest ) + u(t, S2 ) + u(<S2 - t>rest ) + = {u(sqt0 ) + u(sqe0 ) + u(< s − sq0 >rest )}
543 u(t, S3 ) + u(<S3 - t>rest ) + u(t, S4 ) + u(<S4 - t>rest ) +
≤ {u(sqt0 ) + u(< s − sqt0 >rest )}
544 u(t, S5 ) + u(<S5 - t>rest ) = $30 + $54 + $31 + $25 + $27
545 + $10 + $37 + $11 + $35 + $19 = $279. ≤ {u(sq ) + u(< s − sq >rest )} = P EU (t, s).
GAN et al.: FAST UTILITY MINING ON SEQUENCE DATA 7
579 Theorem 5 indicates that if the P EU value of a sequence t Proof: For 1), let t0 = <t ⊕ ij >I−Concatenation for 626
580 is less than the minimum utility value, the P EU values of the convenience. By Theorem 5, P EU (t0 , s) ≤ P EU (t, s). Based 627
Theorem 6: Given a quantitative sequential database D and the same way, 2) holds. 630
a sequence t, we can obtain that Look Ahead Removing strategy (LAR): Given a sequence 631
583 Proof: Since u(t, s) = max{u(sk )|t ∼ sk ∧ sk ⊆ s} ≤ P1). If ij is a I-Concatenation candidate item for t and 634
586 P EU (t). moved from C I (the set of candidate items for I-Concatenation 637
587 Theorems 5 and 6 ensure that the complete set of HUSPs with t); 638
588 can be discovered. If the P EU of a sequence t is less than P2). If ij is a S-Concatenation candidate item for t and 639
589 the minimum utility value (δ × u(D)), then the utility of t is s∈D {P EU (t, s) | < t ⊕ ij >S−Concatenation ⊆ s} is 640
590 less than the minimum utility value, and the utilities of the less than the minimum utility value (δ × u(D)), ij should 641
591 super-sequences of t are also less than the minimum utility be removed from C S (the set of candidate items for S- 642
Theorem 7: For any quantitative sequential database D and The LAR strategy can be used to quickly remove unpromis- 644
a sequence t, the following relationship holds: ing candidate items so that they are not considered for I- 645
600 Theorem 7 indicates that the P EU model is a tighter provides a tight upper-bound to reduce the search space for 655
601 upper-bound compared to the SEU and SW U upper-bounds. mining HUSPs. However, several useless items appear in the 656
602 Based on the P EU model, the designed algorithm can prune extensions of sequences in each sequence, which may lead to 657
603 more candidates than using the SEU and SW U models. loose upper-bound values. To further reduce the search space, 658
604 The P EU model can be used to estimate the utility values an irrelevant item pruning strategy (IIP) is designed as follows. 659
605 of candidate sequences and their super-sequences. Thus, the Theorem 9: For a sequence t and any item ij ∈ 660
606 candidate sequences having P EU values that are less than I(t)rest , the maximal utility of the concatenation <t ⊕ 661
607 the minimum utility value (δ × u(D)) are discarded from the ij >I−Concatenation
P or <t ⊕ ij >S−Concatenation is no more 662
608 candidate set by the proposed algorithm so that their child than s∈D {P EU (t, s) | (< t ⊕ ij >I−Concatenation ⊆ s) ∨ 663
609 nodes (super-sequences) are not generated and explored in the (< t ⊕ ij >S−Concatenation ⊆ s)}. 664
612 A large amount of candidates may be generated from a u(<t ⊕ ij >I−Concatenation ). A similar proof can be done for 669
614 Concatenations with items. To reduce the number of candidate Irrelevant Item Pruning strategy P (IIP): Given a sequence 671
615 sequences, we propose a look ahead removing strategy (LAR) t and any item ij ∈ I(t)rest , if s∈D {P EU (t, s) | (<t ⊕ 672
616 to eliminate unpromising candidate items early. ij >I−Concatenation ⊆ s) ∨ (<t ⊕ ij >S−Concatenation ⊆ s)} 673
617 Theorem 8: Given a sequence t and a quantitative sequential is less than the minimum utility value (δ × u(D)), ij is called 674
618 database D, two situations are considered to generate a super- an irrelevant item of t and should be removed from the utility 675
620 1) if ij is a I-Concatenation candidate item of t, the With the help of the IIP strategy, the remaining utility values 677
621 maximal
P utility of <t ⊕ ij >I−Concatenation is no more than of candidate sequences in each sequence decrease, since many 678
622
s∈D {P EU (t, s)|<t ⊕ ij >I−Concatenation ⊆ s}. irrelevant items can be ignored. As a result, the PEU values of 679
8 IEEE TRANSACTIONS ON CYBERNETICS, 2019
680 candidate sequences can greatly decrease, and more candidates (Lines 5-9). Thus, those 1-sequences with low SW U values 713
681 may be removed. are exactly deemed unpromising for I-Concatenation or S- 714
682 Using the LAR and IIP pruning strategies, the designed Concatenation, and they are moved in this step. And the 1- 715
683 algorithm can eliminate a large number of candidates. Con- sequences having utilities that are no less than the minimum 716
684 sider a sequence t that is processed by the algorithm. First, utility value (δ × u(D)) are output as HUSPs (Lines 7-9). 717
685 the candidate items for I-Concatenation and S-Concatenation Using the special set of candidate HUSPs that were eliminated 718
686 with t are pruned by the IIP strategy, and the UL-lists of t are before, the HUSP-ULL algorithm can begin the depth-first 719
687 recalculated. Then, the candidate items for I-Concatenation search with the built projected database PD(<ij >) w.r.t. UL- 720
688 and S-Concatenation of the processed sequence t are assessed lists. Next, the candidate HUSPs are considered as prefix by 721
689 using the LAR instead of their SW U values. the PGrowth procedure for mining larger HUSPs (Line 10). 722
690 Then, the designed algorithm generates new sequences by The PGrowth procedure (Algorithm 2) performs a depth- 723
691 concatenating the processed sequence t with the candidate first search to enumerate sequences by following the sequence- 724
692 items. If the utility of the newly explored candidate sequence ascending order. Sequences are enumerated by applying the I- 725
693 is no less than the minimum utility value (δ × u(D)), it is a Concatenation and S-Concatenation operations. The algorithm 726
694 HUSP. By applying the downward closure property, the PEU first removes irrelevant items and then recalculates the UL-list, 727
695 of the new sequence is then checked to decide whether its as applying the proposed IIP pruning strategy (Line 1). Then, 728
696 super-sequences should be explored. the algorithm scans the reduced projected database PD(prefix) 729
to obtain C I (the set of candidate items that will be used for 730
697 E. The HUSP-ULL Algorithm I-Concatenation) (Line 2). To reduce the number of candidate 731
items for I-Concatenation with the sequence prefix, the upper- 732
698 Based on the designed utility-linked (UL)-list structure
bound values of the candidate items are calculated using 733
699 (Section IV-B), the downward closure property (Section IV-C),
PD(prefix). Based on the proposed LAR pruning strategy, a 734
700 and the above pruning strategies (Section IV-D), the designed
candidate item ij is discarded if its upper-bound is less than 735
701 algorithm named HUSP-ULL (High-Utility Sequential Pattern
the minimum utility value. The reduced set of candidate items 736
702 mining with UL-list) is proposed below.
for S-Concatenation with the sequence prefix, denoted as C S , 737
Algorithm 1 HUSP-ULL is obtained in the same way (Line 6). After the concatenation 738
1 In
the proposed HUSP-ULL algorithm, the projected database is referred database PD(prefix’) from PD(prefix) (Line 1). The P EU and 743
to UL-lists but not the real database. utility of prefix’ are then calculated from PD(prefix’) (Line 2). 744
GAN et al.: FAST UTILITY MINING ON SEQUENCE DATA 9
745 If the utility of prefix’ is no less than the minimum utility value 400K) sequences, named C8S6T4I3D|X|K) was also used to 786
746 δ × u(D), prefix’ is identified as a HUSP (Lines 4-6). If the evaluate the scalability of the compared approaches. 787
747 PEU of prefix’ is no less than the minimum utility threshold, • Kosarak10k is a real-life dataset of click-stream data 788
748 the PGrowth procedure is then applied with prefix’ to discover from a Hungarian news portal, which is a subset of the original 789
749 HUSPs by considering the super-sequences of prefix’ (Lines Kosarak dataset [54]. 790
750 3-8). The algorithm terminates if no candidates are generated. • Leviathan is a conversion of Thomas Hobbes’ Leviathan 791
751 Finally, the designed algorithm returns the set of discovered novel (1651) to a sequence of items (words). 792
753 V. E XPERIMENTAL R ESULTS retailer, where each session is encapsulating the click events. 795
The total number of item IDs and category IDs is 54,287 796
754 In this section, we conduct experiments on several real
and 347 correspondingly, with an interval of 6 months. #Seq 797
755 datasets to show the advantage of HUSP-ULL in the task of
(= 1.13) indicates the number of elements per sequence in 798
756 high-utility sequential pattern mining. In particular, we aim to
yoochoose-buys is small. 799
757 answer the following research questions via the experiments:
The characteristic of #Ele (the average number of items per 800
758 • How effectively HUSP-ULL can discover the useful high-
element/itemset) indicates that the Sign, Bible, Kosarak10k 801
759 utility sequential patterns with observed timestamps from the
and Leviathan are all item-based datasets, while others are the 802
760 quantitative sequential datasets?
sequence-based datasets. To make the experiments more con- 803
761 • How HUSP-ULL benefits from each component of the
vincing, both the sequence-based and item-based datasets were 804
762 propose structure and the developed pruning strategies for
conducted. In fact, the item-based datasets can be efficiently 805
763 mining HUSPs?
processed by the state-of-the-art HUIM algorithms, while the 806
764 • How efficiently HUSP-ULL can be applied when handling
task of HUSPM aims at dealing with sequence-based datasets. 807
765 large data with different sizes?
In the field of utility mining, a simulation model [13] 808
was widely used in the previous studies [17], [21], [23] to 809
766 A. Datasets generate the quantities and unit profit values of items in the 810
767 Totally five real-life datasets [52] and one synthetic dataset sequential datasets. In order to achieve a fair comparison, this 811
768 were used in the experiments to evaluate the performance simulation model [13] was adopted in our experiments. Note 812
769 of the proposed algorithm. Detailed characteristics of these that the quantity of each item is randomly generated in the [1, 813
770 datasets are shown in Table IV. Note that #|D| is the number 5] interval. A log-normal distribution was used to randomly 814
771 of sequences, #|I| is the number of different symbols/items, assign profit values of items in the [0.01, 10.00] interval. The 815
772 #Seq is the average number of elements per sequence, #Ele is above datasets can be downloaded from [52]. 816
TABLE IV All the compared algorithms were implemented in Java. The 818
C HARACTERISTICS OF THE DATASETS . experiments were carried out on a personal computer equipped 819
Dataset #|D| #|I| #Seq #Ele MaxLen
with an Intel(R) Core(TM) i7-7700HQ CPU @ 2.80 GHz 2.81 820
Sign 730 267 52.0 1 94
Bible 36,369 13,905 21.6 1 100 GHz, 32 GB of RAM, running the 64-bit Microsoft Windows 821
Kosarak10k 10,000 10,094 8.14 1 608 10 operating system. We conduct our experiments against the 822
Leviathan 5,834 9,025 33.8 1 100 following state-of-the-art HUSPM methods. 823
yoochoose-buys 234,300 16,004 1.13 1.97 21
SynDataset-10k 10,000 7,312 6.22 4.35 18 • HuspExt [41]: It introduced the Cumulated Rest of Match 824
SynDataset-80k 79,718 7,584 6.19 4.32 18 (CRoM), while it was not compared since its mining results 825
SynDataset-160k 159,501 7,609 6.19 4.32 20 are incomplete and incorrect, as reported in [18], [55]. 826
SynDataset-240k 239,211 7,617 6.19 4.32 20
SynDataset-320k 318,889 7,620 6.19 4.32 20
• USpan [14]: It is a well-known baseline that uses utility 827
SynDataset-400k 398,716 7,621 6.18 4.32 20 matrix and two upper-bounds on utility for width and depth 828
775 • Sign is a real-life dataset of sequences of sign language and replaced its upper bound by SEU [18]. 830
776 utterance, created by the National Center for Sign Language • HUS-Span [17]: This SWU-based method combines two 831
777 and Gesture Resources at Boston University. Each utterance quantitative metrics, called prefix extension utility (PEU) 832
778 in the dataset is associated with a segment of video with a and reduced sequence utility (RSU)), to prune low utility 833
780 • Bible is a real-life dataset obtained by converting the Bible • ProUM [18]: This projection-based model utilizes the 835
781 into a set of sequences of items (words). sequence extension utility (SEU) to present the maximum 836
782 • SynDataset-160K is a synthetic dataset that generated utility of the possible extensions that are based on the prefix 837
783 by IBM Quest Dataset Generator [53]. It contains 159,501 sequences. Besides, it applies the project mechanism during 838
Runtime (sec.)
Runtime (sec.)
2 2 3
10 10 10
1 1 2
10 10 10
1.2% 1.3% 1.4% 1.5% 1.6% 1.7% 0.5% 0.6% 0.7% 0.8% 0.9% 1.0% 0.065% 0.070% 0.075% 0.080% 0.085% 0.09%
δ δ δ
(d) Kosarak10k (e) Leviathan (f) yoochoose−buys
4 2 3
10 10 10
Runtime (sec.)
Runtime (sec.)
Runtime (sec.)
2
10
2 1
10 10
1
10
0 0 0
10 10 10
1.69% 1.70% 1.71% 1.72% 1.73% 1.74% 1.00% 1.05% 1.10% 1.15% 1.20% 1.25% 0.024% 0.026% 0.028% 0.030% 0.032% 0.034%
δ δ δ
USpan HUS−Span ProUM HUSP−ULL
839 the construction of the utility-array, which can achieve bet- increases due to their actual search space and the large number 872
840 ter performance than the previous HUSPM algorithms, e.g., of candidates that they generated. Thus, it demonstrates that 873
841 USpan, HUS-Span. the designed UL-list-based HUSP-ULL algorithm are able to 874
855 list does not make benefit to obtaining a tight upper bound for In order to evaluate the effectiveness of pruning strategies, 888
856 pruning the search space. Generally, the HUSP-ULL is faster the number of generated candidates of all compared algorithms 889
857 than the other algorithms by at least one order of magnitude. and the number of discovered high-utility sequential patterns 890
858 For example, in Figs. 2 (c), (d), and (e), the HUS-Span and (#HUSPs) under different parameter settings are compared in 891
859 USpan algorithms spent more than 1000 seconds, and in some this section. The results are shown in Fig. 3. Note that #P1, 892
860 cases, cannot be even terminated in a reasonable time. In #P2, #P3, and #P4 denote the number of the candidate patterns 893
861 contrast, HUSP-ULL spent less than 100 seconds to output the generated by USpan, HUS-Span, ProUM, and HUSP-ULL, 894
862 results under varied threshold values. As δ is decreased, the respectively. And #HUSPs denote the number of final HUSPs 895
863 compared approaches become slower. The runtime of ProUM discovered by the three compared algorithms. Note that the 896
864 and HUSP-ULL increases smoothly, while the runtime of the algorithm is terminated if its runtime exceeds 10,000 second 897
865 compared USpan and HUS-Span algorithms increases more or runs out of memory (a maximum of 4,096 MB (4 GB) in the 898
866 rapidly. For example in Fig. 2 (c), we can see that the runtime memory setting), and it is marked as “-” in our experiments. 899
867 of HUS-Span and USpan increases dramatically while the It can be seen in Fig. 3 that the number of candidates 900
868 threshold values are only slightly changed. Thus, the runtime generated by the HUSP-ULL algorithm is much less than that 901
869 performance of HUS-Span and USpan are very sensitive with of the other algorithms. This shows that the designed HUSP- 902
870 respect to the parameter settings. Generally, when δ is set to ULL algorithm and pruning strategies can greatly reduce the 903
871 a small value, the runtime of HUS-Span and USpan sharply number of unpromising candidates for mining the HUSPs 904
GAN et al.: FAST UTILITY MINING ON SEQUENCE DATA 11
10 8
6
# Patterns
8
# Patterns
# Patterns
6
4 6
4
4
2 2
2
0 0 0
1.2% 1.3% 1.4% 1.5% 1.6% 1.7% 0.5% 0.6% 0.7% 0.8% 0.9% 1.0% 0.065% 0.070% 0.075% 0.080% 0.085% 0.090%
δ δ δ
x 10
7 (d) Kosarak10k 4 (e) Leviathan 5 (f) yoochoose−buys
x 10 x 10
15 10 4
8
3
10
# Patterns
# Patterns
# Patterns
6
2
4
5
2 1
0 0 0
1.69% 1.70% 1.71% 1.72% 1.73% 1.74% 1.00% 1.05% 1.10% 1.15% 1.20% 1.25% 0.024% 0.026% 0.028% 0.030% 0.032% 0.034%
δ δ δ
# P1 # P2 # P3 # P4 # HUSPs
Fig. 3. Number of patterns (candidates and final results) under various δ values.
905 and hence reduces the requirements in terms of runtime and practice, as shown in the Bible dataset, we can find that HUSP- 939
906 memory. In all test datasets, the number #P3 is close to the ULL only has to cache up to a few thousand candidates. Using 940
907 number of #P2. It indicates that the upper bound named SEU the proposed pruning strategies in LQS-tree, HUSP-ULL can 941
908 used in ProUM has the similar overestimated effects compared speed up in computation, up to an order of magnitude, while 942
909 to the PEU upper bound used in HUS-Span. the memory consumption is also reduced. 943
933 Even though the LQS-tree may theoretically grow very It is also interesting to observe that the HUS-Span some- 966
934 large, in practice, it stays relatively small in the proposed times consumes more memory than the utility matrix based 967
935 HUSP-ULL framework. We only consider a small part of the USpan algorithm, as shown in Sign dataset. To summarize, 968
936 candidate space. That is, we only perform the I-Concatenation in most cases on the test datasets, the proposed HUSP- 969
937 and S-Concatenation by combining the potential candidate ULL algorithm significantly outperforms the state-of-the-art 970
938 patterns that may be the promising high-utility patterns. In HUSPM algorithms in terms of memory consumption. The 971
12 IEEE TRANSACTIONS ON CYBERNETICS, 2019
1500 3000
Memory (MB)
Memory (MB)
Memory (MB)
2000
1000 2000
1000
500 1000
0 0 0
1.2% 1.3% 1.4% 1.5% 1.6% 1.7% 0.5% 0.6% 0.7% 0.8% 0.9% 1.0% 0.065% 0.070% 0.075% 0.080% 0.085% 0.090%
δ δ δ
(d) Kosarak10k (e) Leviathan (f) yoochoose−buys
1500 2000
3000
1500
Memory (MB)
Memory (MB)
Memory (MB)
1000
2000
1000
500
1000
500
0 0 0
1.69% 1.70% 1.71% 1.72% 1.73% 1.74% 1.00% 1.05% 1.10% 1.15% 1.20% 1.25% 0.024% 0.026% 0.028% 0.030% 0.032% 0.034%
δ δ δ
USpan HUS−Span ProUM HUSP−ULL
(a) C8S6T4I3D|X|K (δ: 0.001) (b) C8S6T4I3D|X|K (δ: 0.001) 6 (c) C8S6T4I3D|X|K (δ: 0.001)
x 10
2.5
2000 USpan USpan # P1
3000
HUS−Span HUS−Span 2 # P2
Runtime (sec.)
Memory (MB)
# Patterns
HUSP−ULL 2000 HUSP−ULL 1.5
# P4
1000 HUSPs
1
1000
500
0.5
0 0 0
10K 80K 160K 240K 320K 400K 10K 80K 160K 240K 320K 400K 10K 80K 160K 240K 320K 400K
Dataset size |D| Dataset size |D| Dataset size |D|
972 reason is that HUSP-ULL utilizes the compact UL-list and or when the dataset is very large. This is because the HUS- 991
973 two pruning strategies to reduce the space complexity. Span algorithm utilizes the projected databases and utility- 992
989 obtain any results in some cases because it ran out of memory
990 when the minimum utility threshold is set to a small value Utility-based sequence mining is a significant problem due 1008
1011 triggered by complex real-life situations. This paper has pro- [14] J. Yin, Z. Zheng, and L. Cao, “USpan: an efficient algorithm for 1074
1012 posed a novel HUSP-ULL algorithm to discover high-utility mining high utility sequential patterns,” in Proceedings of the 18th ACM 1075
SIGKDD International Conference on Knowledge Discovery and Data 1076
1013 sequential patterns (HUSPs) more efficiently. Specifically, the Mining. ACM, 2012, pp. 660–668. 1077
1014 concept of utility-linked (UL)-list was developed and used to [15] J. Yin, Z. Zheng, L. Cao, Y. Song, and W. Wei, “Efficiently mining 1078
1015 calculate the utilities and the upper-bound values of candidates top-k high utility sequential patterns,” in Proceedings of the IEEE 13th 1079
International Conference on Data Mining. IEEE, 2013, pp. 1259–1264. 1080
1016 for deriving all HUSPs. By utilizing the designed UL-list [16] G. C. Lan, T. P. Hong, V. S. Tseng, and S. L. Wang, “Applying the 1081
1017 structure, the HUSP-ULL algorithm can fast discover the maximum utility measure in high utility sequential pattern mining,” 1082
1018 complete set of HUSPs. To further improve the performance of Expert Systems with Applications, vol. 41, no. 11, pp. 5071–5081, 2014. 1083
[17] J. Z. Wang, J. L. Huang, and Y. C. Chen, “On efficiently mining high 1084
1019 HUSP-ULL, two pruning strategies were introduced to reduce utility sequential patterns,” Knowledge and Information Systems, vol. 49, 1085
1020 the upper-bounds on utility and thus prune the search space no. 2, pp. 597–627, 2016. 1086
1021 to find HUSPs. Substantial experiments on some real-worlds [18] W. Gan, J. C. W. Lin, Z. Jiexiong, H. C. Chao, H. Fujita, and P. S. 1087
Yu, “ProUM: High utility sequential pattern mining,” in Proceedings of 1088
1022 and synthetic datasets show that the designed algorithm can the IEEE International Conference on Systems, Man, and Cybernetics. 1089
1023 effectively and efficiently identify all HUSPs and outperforms IEEE, 2019, pp. 1000–1007. 1090
1024 the state-of-the-art HUSPM algorithms. The proposed pruning [19] C. F. Ahmed, S. K. Tanbeer, B. S. Jeong, and Y. K. Lee, “Efficient tree 1091
structures for high utility pattern mining in incremental databases,” IEEE 1092
1025 strategies also improve the efficiency for mining HUSPs by Transactions on Knowledge and Data Engineering, vol. 21, no. 12, pp. 1093
1026 reducing the number of unpromising candidates early. 1708–1721, 2009. 1094
[20] V. S. Tseng, C. W. Wu, B. E. Shie, and P. S. Yu, “UP-Growth: an efficient 1095
algorithm for high utility itemset mining,” in Proceedings of the 16th 1096
1027 VII. ACKNOWLEDGMENT ACM SIGKDD International Conference on Knowledge Discovery and 1097
Data Mining. ACM, 2010, pp. 253–262. 1098
[21] V. S. Tseng, B. E. Shie, C. W. Wu, and P. S. Yu, “Efficient algorithms 1099
1028 We thank the editors and anonymous reviewers for their for mining high utility itemsets from transactional databases,” IEEE 1100
1029 constructive suggestions that help to improve the quality of this Transactions on Knowledge and Data Engineering, vol. 25, no. 8, pp. 1101
1030 paper. We would like to thank Dr. Jun-Zhe Wang for providing 1772–1786, 2013. 1102
[22] P. Fournier-Viger, C. W. Wu, S. Zida, and V. S. Tseng, “FHM: 1103
1031 the original C++ code of the HUS-Span algorithm, and Dr. Faster high-utility itemset mining using estimated utility co-occurrence 1104
1032 Oznur Kirmemis Alkan for sharing the Java executable file of pruning,” in International Symposium on Methodologies for Intelligent 1105
1033 the HuspExt algorithm. Systems. Springer, 2014, pp. 83–92. 1106
[23] S. Zida, P. Fournier-Viger, J. C. W. Lin, C. W. Wu, and V. S. Tseng, 1107
“EFIM: a highly efficient algorithm for high-utility itemset mining,” in 1108
Mexican International Conference on Artificial Intelligence. Springer, 1109
1034 R EFERENCES 2015, pp. 530–546. 1110
[24] J. Liu, K. Wang, and B. C. Fung, “Direct discovery of high utility 1111
1035 [1] R. Agrawal and R. Srikant, “Mining sequential patterns,” in The Inter- itemsets without candidate generation,” in Proceedings of the IEEE 12th 1112
1036 national Conference on Data Engineering. IEEE, 1995, pp. 3–14. International Conference on Data Mining. IEEE, 2012, pp. 984–989. 1113
1037 [2] R. Srikant and R. Agrawal, “Mining sequential patterns: generalizations [25] V. S. Tseng, C. W. Wu, P. Fournier-Viger, and P. S. Yu, “Efficient 1114
1038 and performance improvements,” in Proceedings of International Con- algorithms for mining top-k high utility itemsets,” IEEE Transactions 1115
1039 ference on Extending Database Technology. Springer, 1996, pp. 1–17. on Knowledge and Data Engineering, vol. 28, no. 1, pp. 54–67, 2016. 1116
1040 [3] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.- [26] J. C. W. Lin, W. Gan, P. Fournier-Viger, T. P. Hong, and H. C. Chao, 1117
1041 C. Hsu, “PrefixSpan: Mining sequential patterns efficiently by prefix- “FDHUP: Fast algorithm for mining discriminative high utility patterns,” 1118
1042 projected pattern growth,” in The International Conference on Data Knowledge and Information Systems, vol. 51, no. 3, pp. 873–909, 2017. 1119
1043 Engineering. IEEE, 2001, pp. 215–224. [27] W. Gan, J. C. W. Lin, P. Fournier-Viger, H. C. Chao, and H. Fujita, 1120
1044 [4] P. Fournier-Viger, J. C. W. Lin, R. U. Kiran, and Y. S. Koh, “A survey “Extracting non-redundant correlated purchase behaviors by utility mea- 1121
1045 of sequential pattern mining,” Data Science and Pattern Recognition, sure,” Knowledge-Based Systems, vol. 143, pp. 30–41, 2018. 1122
1046 vol. 1, no. 1, pp. 54–77, 2017. [28] J. C. W. Lin, W. Gan, T. P. Hong, and V. S. Tseng, “Efficient algorithms 1123
1047 [5] R. Agrawal, T. Imielinski, and A. Swami, “Database mining: A per- for mining up-to-date high-utility patterns,” Advanced Engineering In- 1124
1048 formance perspective,” IEEE Transactions on Knowledge and Data formatics, vol. 29, no. 3, pp. 648–661, 2015. 1125
1049 Engineering, vol. 5, no. 6, pp. 914–925, 1993. [29] G. C. Lan, T. P. Hong, and V. S. Tseng, “Discovery of high utility 1126
1050 [6] M. S. Chen, J. Han, and P. S. Yu, “Data mining: an overview from itemsets from on-shelf time periods of products,” Expert Systems with 1127
1051 a database perspective,” IEEE Transactions on Knowledge and data Applications, vol. 38, no. 5, pp. 5851–5857, 2011. 1128
1052 Engineering, vol. 8, no. 6, pp. 866–883, 1996. [30] Y. C. Lin, C. W. Wu, and V. S. Tseng, “Mining high utility itemsets 1129
1053 [7] R. Agrawal, R. Srikant et al., “Fast algorithms for mining association in big data,” in Pacific-Asia Conference on Knowledge Discovery and 1130
1054 rules,” in Proceedings of the 20th International Conference on Very Data Mining, 2015, pp. 649–661. 1131
1055 Large Data Bases, vol. 1215, 1994, pp. 487–499. [31] U. Yun, D. Kim, E. Yoon, and H. Fujita, “Damped window based 1132
1056 [8] J. Han, J. Pei, Y. Yin, and R. Mao, “Mining frequent patterns without high average utility pattern mining over data streams,” Knowledge-Based 1133
1057 candidate generation: A frequent-pattern tree approach,” Data Mining Systems, vol. 144, pp. 188–205, 2018. 1134
1058 and Knowledge Discovery, vol. 8, no. 1, pp. 53–87, 2004. [32] W. Gan, J. C.-W. Lin, P. Fournier-Viger, H.-C. Chao, and P. S. Yu, 1135
1059 [9] R. Chan, Q. Yang, and Y. D. Shen, “Mining high utility itemsets,” in “HUOPM: High-utility occupancy pattern mining,” IEEE Transactions 1136
1060 Proceedings of the third IEEE International Conference on Data Mining. on Cybernetics. DOI: 10.1109/TCYB.2019.2896267, pp. 1–14, 2019. 1137
1061 IEEE, 2003, pp. 19–26. [33] U. Yun, H. Ryang, G. Lee, and H. Fujita, “An efficient algorithm 1138
1062 [10] H. Yao, H. J. Hamilton, and C. J. Butz, “A foundational approach to for mining high utility patterns from incremental databases with one 1139
1063 mining itemset utilities from databases,” in Proceedings of the SIAM database scan,” Knowledge-Based Systems, vol. 124, pp. 188–206, 2017. 1140
1064 International Conference on Data Mining. SIAM, 2004, pp. 482–486. [34] W. Gan, J. C. W. Lin, P. Fournier-Viger, H. C. Chao, T. P. Hong, and 1141
1065 [11] J. C. W. Lin, W. Gan, P. Fournier-Viger, T. P. Hong, and V. S. H. Fujita, “A survey of incremental high-utility itemset mining,” Wiley 1142
1066 Tseng, “Efficient algorithms for mining high-utility itemsets in uncertain Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1143
1067 databases,” Knowledge-Based Systems, vol. 96, pp. 171–187, 2016. vol. 8, no. 2, p. e1242, 2018. 1144
1068 [12] M. Liu and J. Qu, “Mining high utility itemsets without candidate [35] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu, 1145
1069 generation,” in Proceedings of the 21st ACM International Conference “FreeSpan: frequent pattern-projected sequential pattern mining,” in 1146
1070 on Information and Knowledge Management. ACM, 2012, pp. 55–64. Proceedings of the sixth ACM SIGKDD International Conference on 1147
1071 [13] Y. Liu, W. K. Liao, and A. Choudhary, “A two-phase algorithm for Knowledge Discovery and Data Mining. ACM, 2000, pp. 355–359. 1148
1072 fast discovery of high utility itemsets,” in Pacific-Asia Conference on [36] M. J. Zaki, “SPADE: an efficient algorithm for mining frequent se- 1149
1073 Knowledge Discovery and Data Mining. Springer, 2005, pp. 689–695. quences,” Machine Learning, vol. 42, no. 1-2, pp. 31–60, 2001. 1150
14 IEEE TRANSACTIONS ON CYBERNETICS, 2019
1151 [37] J. Ayres, J. Flannick, J. Gehrke, and T. Yiu, “Sequential pattern mining Jerry Chun-Wei Lin (SM’19) is an associate pro- 1225
1152 using a bitmap representation,” in Proceedings of the 8th ACM SIGKDD fessor at Western Norway University of Applied 1226
1153 International Conference on Knowledge Discovery and Data Mining. Sciences, Bergen, Norway. He received the Ph.D. 1227
1154 ACM, 2002, pp. 429–435. in Computer Science and Information Engineering, 1228
1155 [38] B. Le, M. T. Tran, and B. Vo, “Mining frequent closed inter-sequence National Cheng Kung University, Tainan, Taiwan in 1229
1156 patterns efficiently using dynamic bit vectors,” Applied Intelligence, 2010. His research interests include data mining, big 1230
1157 vol. 43, no. 1, pp. 74–84, 2015. data analytics, soft computing, and privacy. He has 1231
1158 [39] T. Le, A. Nguyen, B. Huynh, B. Vo, and W. Pedrycz, “Mining con- published more than 300 research papers in peer- 1232
1159 strained inter-sequence patterns: a novel approach to cope with item reviewed international conferences and journals. He 1233
1160 constraints,” Applied Intelligence, vol. 48, no. 5, pp. 1327–1343, 2018. is the co-leader of the popular SPMF open-source 1234
1161 [40] W. Gan, J. C. W. Lin, P. Fournier-Viger, H. C. Chao, and P. S. Yu, data mining library, the project leader of PPSF open- 1235
1162 “A survey of parallel sequential pattern mining,” ACM Transactions on source privacy and security library, the Editor-in-Chief (EiC) of the Data 1236
1163 Knowledge Discovery from Data, vol. 13, no. 3, p. 25, 2019. Mining and Pattern Recognition (DSPR) journal, and Associate Editor of 1237
1164 [41] O. K. Alkan and P. Karagoz, “CRoM and HuspExt: Improving efficiency Journal of Internet Technology and IEEE Access. He is the Senior Member 1238
1165 of high utility sequential pattern extraction,” IEEE Transactions on for both IEEE and ACM. 1239
1166 Knowledge and Data Engineering, vol. 27, no. 10, pp. 2645–2657, 2015.
1167 [42] L. Zhou, Y. Liu, J. Wang, and Y. Shi, “Utility-based web path traversal Jiexiong Zhang is currently a senior software engi- 1240
1168 pattern mining,” in Seventh IEEE International Conference on Data neer in Didi Chuxing, Beijing, China. He received 1241
1169 Mining Workshops. IEEE, 2007, pp. 373–380. the M.S. degrees in Computer Science from Harbin 1242
1170 [43] C. F. Ahmed, S. K. Tanbeer, and B. S. Jeong, “Mining high utility web Institute of Technology (Shenzhen), Guangdong, 1243
1171 access sequences in dynamic web log data,” in 11th ACIS International China in 2017. His research interests include data 1244
1172 Conference on Software Engineering, Artificial Intelligence, Networking mining, artificial intelligence, and big data analytics. 1245
1173 and Parallel/Distributed Computing. IEEE, 2010, pp. 76–81.
1174 [44] B. E. Shie, H. F. Hsiao, and V. S. Tseng, “Efficient algorithms for
1175 discovering high utility user behavior patterns in mobile commerce
1176 environments,” Knowledge and Information Systems, vol. 37, no. 2, pp.
1246
1177 363–387, 2013.
1178 [45] B. E. Shie, H. F. Hsiao, V. S. Tseng, and P. S. Yu, “Mining high
1179 utility mobile sequential patterns in mobile commerce environments,” Philippe Fournier-Viger is full professor and Youth 1247
1180 in Proceedings of International Conference on Database Systems for 1000 scholar at the Harbin Institute of Technology 1248
1181 Advanced Applications. Springer, 2011, pp. 224–238. (Shenzhen), Shenzhen, China. He received a Ph.D. 1249
1182 [46] M. Zihayat, H. Davoudi, and A. An, “Mining significant high utility gene in Computer Science at the University of Quebec 1250
1183 regulation sequential patterns,” BMC Systems Biology, vol. 11, no. 6, p. in Montreal in 2010. His research interests include 1251
1184 109, 2017. pattern mining, sequence analysis and prediction, 1252
1185 [47] C. F. Ahmed, S. K. Tanbeer, and B. S. Jeong, “A novel approach for and social network mining. He has published more 1253
1186 mining high-utility sequential patterns in sequence databases,” ETRI than 300 research papers in refereed international 1254
1187 journal, vol. 32, no. 5, pp. 676–686, 2010. conferences and journals. He is the founder of the 1255
1188 [48] C. W. Wu, Y. F. Lin, P. S. Yu, and V. S. Tseng, “Mining high utility popular SPMF open-source data mining library. He 1256
1189 episodes in complex event sequences,” in Proceedings of the 19th ACM is Editor-in-Chief (EiC) of the Data Mining and 1257
1190 SIGKDD International Conference on Knowledge Discovery and Data Pattern Recognition (DSPR) journal. 1258