You are on page 1of 14

IEEE TRANSACTIONS ON CYBERNETICS, 2019 1

Fast Utility Mining on Sequence Data


Wensheng Gan, Member, IEEE, Jerry Chun-Wei Lin*, Senior Member, IEEE, Jiexiong Zhang,
Philippe Fournier-Viger, Han-Chieh Chao, Senior Member, IEEE, and Philip S. Yu, Fellow, IEEE

1 Abstract—High-utility sequential pattern mining is an emerg- the data/information quality on the Weblog data, it is important 38
2 ing topic in the field of Knowledge Discovery in Databases. to take this attribute into account for providing more precise 39
3 It consists of discovering subsequences having a high utility assessment of data/information quality. 40
4 (importance) in sequences, which can be referred to high-
5 utility sequential patterns (HUSPs). HUSPs can be applied to SPM is similar to frequent itemset mining (FIM) [7], [8], 41

6 many real-life applications, such as market basket analysis, as it is designed to discover patterns that frequently occur in 42

7 E-commerce recommendation, click-stream analysis and route data. The implicit assumption of FIM and SPM is that frequent 43

8 planning. Several algorithms have been proposed to address patterns are useful and interesting. For example, it is an inter- 44
9 this problem by efficiently mining utility-based useful sequential esting information for a business manager if the the beer and 45
10 patterns. Nevertheless, the performance of these algorithms can
11 be unsatisfied in terms of runtime and memory usage due diapers are purchased together in the super market. The main 46

12 to the combinatorial explosion of the search space for low difference between SPM and FIM is that SPM generalizes 47

13 utility threshold and large-scale data. Hence, this paper proposes FIM by considering the sequential ordering of sequences. 48

14 an efficient algorithm for the task of high-utility sequential Therefore, mining interesting patterns in a sequential database 49
15 pattern mining, called HUSP-ULL. It utilizes a lexicographic using SPM is more challenging than FIM [3]. One significant 50
16 q-sequence (LQS)-tree and a utility-linked (UL)-list structure to
17 fast discover HUSPs. Furthermore, two pruning strategies are shortcoming of traditional sequential pattern mining is that 51

18 introduced in HUSP-ULL to obtain tight upper-bounds on the all objects (items, events, sequences, movements, etc.) are 52

19 utility of candidate sequences, and reduce the search space by treated equally. In fact, the most frequently occurring patterns 53

20 pruning unpromising candidates early. Substantial experiments can be, quite typically, the least interesting ones. In general, 54
21 both on real-life and synthetic datasets show that HUSP-ULL can criteria such as the interestingness, utility, and importance of 55
22 effectively and efficiently discover the complete set of HUSPs and
23 outperforms the state-of-the-art algorithms. patterns are not taken into account in traditional SPM and FIM. 56

Consequently, these frameworks can reveal many patterns that 57


24 Index Terms—Economic behavior, utility theory, utility mining, are frequent but less interesting to decision makers. To better 58
25 sequence, linked-list structure.
measure the importance of patterns for decision-making, other 59

criteria such as the amount of profit (utility) that each pattern 60


26 I. I NTRODUCTION yields can be considered. For example, in market basket 61

27 Sequential pattern mining (SPM) [1], [2], [3], [4] is an analysis, the diamond may not be considered as a frequent 62

28 interesting and critical research area in Knowledge Discovery pattern if its sale frequency is relative low compared to the 63

29 in Databases (KDD) [5], [6], which plays a key role in sale amount of the eggs. However, some infrequent patterns 64

30 various applications such as DNA sequence analysis, consumer such as diamonds may yield higher profit than that of the 65

31 behavior analysis, and natural disaster analysis [4]. The main eggs. To address this issue, FIM was generalized to obtain the 66

32 objective of SPM is to discover a set of frequent sequences problem of high-utility itemset mining (HUIM) [9], [10], [11], 67

33 in a sequence database, selected with respect to a user- [12], [13]. 68

34 specified minimum support threshold, and where the frequency


35 of each sequence is defined as its occurrence count in the
36 database. Since data/information quality may be influenced by
37 the sequential ordering of the events, for example, to assess
Manuscript received XX 2019; revised XX 2019; accepted XX 2019. This
research was supported in part by the Shenzhen Technical Project under Grant
No. KQJSCX 20170726103424709 and No. JCYJ 20170307151733005, and
a grant from China Scholarship Council. (Corresponding author: Jerry Chun-
Wei Lin)
Wensheng Gan is with the Department of Computer Science and Technol- Fig. 1. A shopping example in Amazon.
ogy, Harbin Institute of Technology (Shenzhen), Shenzhen, China; and the
Department of Computer Science, University of Illinois at Chicago, IL, USA. Recently, to extract more informative patterns from ordered 69
Jerry Chun-Wei Lin is with the Department of Computing, Mathematics,
and Physics, Western Norway University of Applied Sciences, Bergen, Nor- data (sequences), SPM has been generalized as the task of 70

way. (E-mail: jerrylin@ieee.org) high-utility sequential pattern mining (HUSPM) [14], [15], 71
Jiexiong Zhang is with the Department of Computer Science and Technol- [16], [17]. Different from SPM, HUSPM considers not only 72
ogy, Harbin Institute of Technology (Shenzhen), Shenzhen, China.
Philippe Fournier-Viger is with the Department of Computer Science and the sequential ordering of items but also their utility values. 73

Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China. Hence, HUSPM is more difficult than traditional SPM and 74
Han-Chieh Chao is with the Department of Electrical Engineering, National HUIM. As shown in Fig. 1, a customer wants to purchase 75
Dong Hwa University, Taiwan, R.O.C.
Philip S. Yu is with the Department of Computer Science, University of a mountain trail Bicycle, LED Headlight, and the UShake 76

Illinois at Chicago, IL, USA. Bike lock, and each item has its own unit price. In this 77
2 IEEE TRANSACTIONS ON CYBERNETICS, 2019

78 case, the consumers’ purchase behavior consists of a series A. High-Utility Itemset Mining 134

79 of utility-oriented sequential events/processes within different The problem of high-utility itemset mining (HUIM) [9], [10] 135
80 timestamps. Since high-utility sequential pattern mining has was designed to find the set of high-utility itemsets (HUIs), 136
81 many applications, many researchers then focused on this i.e. the itemsets having their utility values that are greater 137
82 issue and several algorithms were developed to discover the than or equal to a minimum utility threshold. Since HUIM 138
83 complete set of high-utility sequential patterns. However, there does not provide a downward closure property to reduce the 139
84 are still several challenges in HUSPM. First, the utility of a search space, unlike association rule mining (ARM) [7], it 140
85 pattern is neither monotonic nor anti-monotonic. Therefore, is necessary to find other strategies for reducing the search 141
86 the downward closure property of support (aka the Apriori space. To obtain a downward closure property that can be 142
87 property [7]) is not held in HUSPM and the search space is used in HUIM, Liu et al. [13] introduced the transaction- 143
88 quite difficult to be reduced. Second, previous approaches have weighted downward closure (TWDC) property and defined a 144
89 been proposed for determining upper bounds (i.e., sequence- set of candidates called the high transaction-weighted utiliza- 145
90 weighted utilization (SWU) [14], sequence-utility upper-bound tion itemsets (HTWUIs). Based on the HTWUIs, the Two- 146
91 (SUUB) [16], sequence extension utility (SEU) [18]) on the Phase [13] algorithm can find HUIs with the downward 147
92 utility of the potential sequential patterns. However, these closure property. It first discovers the set of HTWUIs using 148
93 algorithms often consume a large amount of memory and have a breadth-first search and then selects HUIs in the discovered 149
94 long execution time due to the combinatorial explosion of the HTWUIs. To achieve better performance for mining HUIs, 150
95 search space. Third, in the era of big data, the data that needs some tree-based HUIM algorithms were introduced such as 151
96 to be analyzed grows quickly. How to design more efficient IHUP [19], UP-Growth [20] and UP-Growth+ [21]. To reduce 152
97 HUSPM algorithms that well-scaled in a very large dataset is the number of candidates, Liu et al. [12] proposed the HUI- 153
98 also an important topic. Miner algorithm, which efficiently discovers HUIs using a 154
99 To address these challenges, this paper designs a novel vertical structure called utility-list. This procedure identifies 155
100 utility-linked list (UL-list) based algorithm called HUSP-ULL HUIs without generating candidates and performing multiple 156
101 (mining High-Utility Sequential Patterns more efficiently with database scans. 157
102 UL-list). The major contributions of this paper are as follows: Up to now, the development of HUIM algorithms has 158

103 1) Insightful patterns. A novel fast algorithm is proposed been extensively studied, and many algorithms have been 159

104 to efficiently identify meaningful and profitable HUSPs. investigated to mine different kinds of HUIs in many real- 160

105 It employs a utility-linked list structure and two pruning life applications. Many utility mining algorithms focused on 161

106 strategies to improve its mining performance. the mining efficiency, such as FHM [22], EFIM [23] and 162

107 2) Novel index structures. A compressed utility-linked d2 HUP [24]. On the other hand, several models and algorithms 163

108 (UL)-list structure is designed to store information about put the efforts on the effectiveness problem of utility-oriented 164

109 patterns instead of processing the original database. UL- mining. For example, discovering various kinds of HUIs such 165

110 list is quite compact and different from the current as mining HUIs in uncertain databases [11], mining the top- 166

111 existing data structures for utility mining. k HUIs without setting the minimum utility threshold [25], 167

112 3) Effective pruning. Utilizing UL-list, two pruning strate- exploiting non-redundant correlated utility patterns [26], [27], 168

113 gies, named Look Ahead Removing (LAR) and Irrele- extracting the up-to-date HUIs to show the sale trends [28], 169

114 vant Item Pruning (IIP), are integrated in the designed mining temporal on-shelf HUIs [29], and big data issue of 170

115 algorithm to reduce the search space and improve its HUIM [30]. Yun et al. [31] proposed a damped window to 171

116 performance to discover HUSPs. extract high average utility patterns over data streams. Gan 172

117 4) Fast and better scalability. Experimental results show et al. [32] recently proposed a new utility measure namely 173

118 that the proposed algorithm can efficiently discover utility occupancy for pattern mining. In contrast to static data, 174

119 HUSPs and outperform the existing state-of-the-art the dynamic data is more complex and desirable in many real- 175

120 HUSPM algorithms, in terms of runtime, memory usage, life applications. Several dynamic utility mining models [33], 176

121 unpromising pattern filtering, and scalability. [34] have been proposed to deal with dynamic databases. 177

122 The rest of this paper is organized as follows. Related


123 work is briefly reviewed in Section II. Preliminaries and the B. High-Utility Sequential Pattern Mining 178
124 problem statement of high-utility sequential pattern mining are
125 presented in Section III. The proposed HUSP-ULL algorithm Sequential pattern mining (SPM) [1], [2], [3], [35] is 179

126 with the UL-list and two pruning strategies are presented important as it considers the sequential ordering of itemsets, 180

127 in Section IV. An experimental evaluation of the designed which is significant for many applications such as behavior 181

128 algorithms is provided in Section V. Finally, conclusions are analysis, DNA sequence analysis, and weblog mining [4]. 182

129 described in Section VI. SPM was proposed by Agrawal and Srikant [1] and has been 183

extensively studied. Many efficient algorithms of SPM have 184

been developed such as GSP [2], FreeSpan [35], PrefixSpan 185


130 II. R ELATED W ORK [3], SPADE [36] and SPAM [37]. Other interesting issues 186

131 We structure the related work around two main elements for SPM have been also extensively studied, such as inter- 187

132 that this paper addresses: high-utility itemset mining and high- sequence patterns [38], [39]. Several recent literature surveys 188

133 utility sequential pattern mining. of the development of SPM can be further referred to [4], 189
GAN et al.: FAST UTILITY MINING ON SEQUENCE DATA 3

190 [40]. SPM algorithms rely on the frequency/support framework subset of I without quantities. Without loss of generality, we 245

191 to discover frequent sequences, which does not take business assume that items in an itemset (quantitative itemset) are listed 246

192 interests into account. High-utility sequential pattern mining in alphabetical order since items are unordered in an itemset 247

193 (HUSPM) [14], [17], [41] was developed to address utility- (quantitative itemset). A quantitative sequence is an ordered 248

194 driven mining on sequence data. It has been used for mining list of one or more quantitative itemsets, which is denoted as 249

195 high-utility path traversal patterns of web pages [42], high- s = <v1 , v2 , . . . , vd >. A sequence is an ordered list of one 250

196 utility web access sequences [43], high-utility mobile sequence or more itemsets without quantities, which is denoted as t = 251

197 [44], [45], and HUSPs in Bioinformatics (i.e., gene regulation) <w1 , w2 , . . . , wd >. 252

198 [46]. Ahmed et al. [47] designed a level-wise approach called For convenience, in the following ”quantitative” will be 253

199 UL and a pattern-growth approach named US for HUSPM. abbreviated as ”q-”. Thus, the term ”q-sequence” will be used 254

200 HUSPM takes ordered sequences as input and reveals sequen- to refer to a sequence with quantities, and ”sequence” to refer 255

201 tial patterns having high utilities, which has been a challenging to sequences without quantities. Similarly, a ”q-itemset” is an 256

202 and important issue in recent decades. Hence, Yin et al. [14] itemset having quantities, while ”itemset” refers to an itemset 257

203 proposed a formal framework for HUSPM and introduced an that does not have quantities. For example, <[(a, 2) (b, 1)], 258

204 efficient USpan algorithm to discover high-utility sequential [(c, 3)]> is a q-sequence while <[ab], [c]> is a sequence. [(a, 259

205 patterns (HUSPs). Information about the utility of each node in 2) (b, 1)] is a q-itemset and [ab] is an itemset. A quantitative 260

206 the tree is stored in a utility-matrix for mining HUSPs without sequential database is a set of transactions D = {S1 , S2 , . . . , 261

207 performing multiple database scans. Two pruning strategies Sn }, where each transaction Sq ∈ D is a q-sequence, and has 262

208 based on the sequential-weighted downward closure property a unique identifier q called its SID. In addition, each item in D 263

209 and on the remaining utility model were designed to reduce is associated with a profit (external utility), which is denoted 264

210 the search space. However, USpan may fail to discover the as pr(ij ). 265

211 complete HUSPs due to its over-estimated upper bound on Consider the following running example. A quantitative 266

212 the potential pattern [18]. sequential database is shown in Table I. This database has 267

213 Lan et al. [16] then proposed a projection-based approach 6 transactions and 6 items. Table II is a utility table that 268

214 with a sequence-utility upper-bound (SUUB) to discover high- provides a unit profit for each item in Table I. In the running 269

215 utility sequential patterns. A novel indexing strategy and the example, [(a:2) (c:3)] is the first q-itemset of transaction S1 . 270

216 maximum utility measure were developed to improve the The quantity of an item (a) in this q-itemset is 2, and its utility 271

217 mining performance. Then, Alkan et al. [41] proposed the is calculated as 2× $5 = $10. 272

218 HuspExt algorithm by calculating a Cumulated Rest of Match


219 (CRoM) to obtain an upper-bound on utility. It uses CRoM TABLE I
220 to prune unpromising candidates early. To facilitate parameter A Q UANTITATIVE S EQUENTIAL DATABASE .
221 setting for HUSP mining, Yin et al. [15] proposed the TUS SID Q-sequence
S1 <[(a:2) (c:3)], [(a:3) (b:1) (c:2)], [(a:4) (b:5) (d:4)], [(e:3)]>
222 algorithm, which discovers the top-k HUSPs. Recently, Wang S2 <[(a:1) (e:3)], [(a:5) (b:3) (d:2)], [(b:2) (c:1)(d:4) (e:3)]>
223 et al. [17] developed two tight utility upper-bounds in the S3 <[(e:2)], [(c:2) (d:3)], [(a:3) (e:3)], [(b:4) (d:5)]>
224 HUS-Span algorithm, named prefix extension utility (PEU), S4 <[(b:2) (c:3)], [(a:5) (e:1)], [(b:4) (d:3) (e:5)]>
225 and reduced sequence utility (RSU) to speed up the discovery S5 <[(a:4) (c:3)], [(a:2) (b:5) (c:2) (d:4) (e:3)]>
S6 <[(f :4)], [(a:5) (b:3)], [(a:3) (d:4)]>
226 of HUSPs. The TKHUS-Span algorithm was also developed
227 to identify the top-k HUSPs [17]. Gan et al. [18] proposed
228 an efficient projection-based utility mining approach named
TABLE II
229 ProUM to discover high-utility sequences by using the upper A N U TILITY TABLE .
230 bound namely sequence extension utility (SEU) and the utility- Item a b c d e f
231 array structure. Wu et al. [48] studied the problem of mining Profit ($) 5 3 4 2 1 6
232 high-utility episodes in complex event sequences. Recently,
233 an incremental model for HUSP mining is introduced in [49]. Definition 1: The utility of an item (ij ) in a q-itemset v is 273

234 The comprehensive review of utility-oriented pattern mining denoted as u(ij , v), and defined as u(ij , v) = q(ij , v) ×pr(ij ), 274

235 can be referred to [34], [50], [51]. where q(ij , v) is the quantity of (ij ) in v, and pr(ij ) is the 275

profit of (ij ). Let u(v) denotePthe utility of a q-itemset v, then 276

236 III. P RELIMINARIES AND P ROBLEM S TATEMENT it can be defined as u(v) = ij ∈v u(ij , v). 277

For instance, the utility of item (c) in the first q-itemset of 278
237 In this section, we introduce notations and concepts used in
S1 in Table I is calculated as: u(c, [(a:2) (c:3)]) = q(c, [(a:2) 279
238 the paper. Then, we give formal problem definition.
(c:3)]) pr(c) = 3 × $4 = $12. And u([(a:2) (c:3)]) = u(a, [(a:2) 280

(c:3)]) + u(c, [(a:2) (c:3)]) = 2 × $5 + 3 × $4 = $22. 281


239 A. Notations and Concepts Definition 2: The utility of aPq-sequence s = <v1 , v2 , 282

240 Let I = {i1 , i2 , . . . , im } be a finite set of distinct items . . . , vd > is defined as u(s) = v∈s u(v). The utility of a 283

241 (symbols). A quantitative itemset, denoted as v = [(i1 :q1 ) quantitative sequential database DPis the sum of the utility of 284

242 (i2 :q2 ), . . . , (ic :qc )], is a subset of I and each item in a each of its q-sequences: u(D) = s∈D u(s). 285

243 quantitative itemset is associated with a quantity (internal For instance, consider Table I. We have that u(S1 ) = 286

244 utility). An itemset, denoted as w = [i1 , i2 , . . . , ic ], is a u([(a:2) (c:3)]) + u([(a:3) (b:1) (c:2)]) + u([(a:4) (b:5) (d:4)]) 287
4 IEEE TRANSACTIONS ON CYBERNETICS, 2019

288 + u([(e:3)]) = $22 + $26 + $43 + $3 = $94. For example, B. Problem Definition 344

289 u(D) = u(S1 ) + u(S2 ) + u(S3 ) + u(S4 ) + u(S5 ) + u(S6 ) = Definition 7 (High-Utility Sequential Pattern, HUSP): A
290 $94 + $67 + $56 + $67 + $76 + $81 = $441, as shown in sequence t in a quantitative sequential database D is defined
291 Table I. as a high-utility sequential pattern (denoted as HUSP) if its
292 Definition 3: Given a q-sequence s = <v1 , v2 , . . . , vd > and total utility is no less than the minimum utility threshold δ:
293 a sequence t = <w1 , w2 , . . . , wd0 >, if d = d0 and the items in
HU SP ← {t|u(t) ≥ δ × u(D)}. (1)
294 vk are the same as the items in wk for 1 ≤ k ≤ d, t matches
295 s, which is denoted as t ∼ s. For example in Table I, u(<[a], [b]>) = $160. If δ = 0.1, 345

296 For instance, in Table I, <[ac], [abc], [abd], [e]> matches then <[a], [b]> is a HUSP since u(<[a], [b]>) = $160 > 346

297 S1 . Note that it is possible that a sequence has more than δ × u(D) (= $44.1). Based on the above concepts, the formal 347

298 one match in a q-sequence. For instance, <[a], [b]> has definition of the problem studied in this work is defined below. 348

299 three matches as <[a:2], [b:1]>, <[a:2], [b:5]> and <[a:3], Problem Statement: Let there be a quantitative sequential 349

300 [b:5]> in S1 . Thus, HUSP is generally considered as more database and a user-defined minimum utility threshold. High- 350

301 challenging than SPM and HUIM. utility sequential pattern mining (HUSPM) consists of enu- 351

merating all HUSPs whose total utility value in this database 352
302 Definition 4: Let there be some itemsets w and w0 . The
is no less than or equal to the minimum utility threshold. 353
303 itemset w is contained in w0 (denoted as w ⊆ w0 ) if w is a
Therefore, the objective of high-utility sequential pattern 354
304 subset of w0 or w is the same as w0 . Given two q-itemsets v
mining is to identify sequential patterns in which the utility of 355
305 and v 0 , v is said to be contained in v 0 if for any item in v,
each pattern in a sequence database that meets or exceeds 356
306 there exists the same item having the same quantity in v 0 . This
a pre-specified minimum utility threshold. These insightful 357
307 is denoted as v ⊆ v 0 . Thus, q-itemset containment is different
and profitable sequential patterns can be used in some spe- 358
308 from itemset containment.
cific applications, such as market basket analysis [18], E- 359
309 For example, the itemset [ac] is contained in the itemset commerce recommendation with personalized promotion [44], 360
310 [abc] in Table I. The q-itemset [(a:2) (c:3)] is contained in [45], click-stream analysis [43], and Bioinformatics [46]. More 361
311 [(a:2) (b:1) (c:3)] and [(a:2) (c:3) (e:2)], but [(a:2) (c:3)] is explorations can be reviewed and studied in [50]. 362
312 not contained in [(a:2) (b:3) (c:1)] and [(a:4) (c:3) (d:4)].
313 Definition 5: Let there be some sequences t = <w1 , w2 , IV. T HE P ROPOSED HUSP-ULL A LGORITHM 363

314 . . . , wd > and t0 = <w10 , w20 , . . . , wd0 0 >. The sequence t is This section presents a novel algorithm named HUSP- 364
315 contained in t0 (denoted as t ⊆ t’) if there exists an integer ULL for the problem of high-utility sequential pattern mining 365
316 sequence 1 ≤ k1 ≤ k2 ≤ · · · ≤ d0 such that wj ⊆ wk0 j for (HUSPM). The HUSP-ULL algorithm first scans the database 366
317 1 ≤ j ≤ d. Let there be two q-sequences s = <v1 , v2 , . . . , to find 1-sequences for spanning a lexicographic q-sequence 367
318 vd > and s0 = <v10 , v20 , . . . , vd0 0 >. s is said to be contained sequence (LQS)-tree, which is a variant of lexicographic 368
319 in s0 (denoted as s ⊆ s0 ) if there exists an integer sequence tree [37]. The utility-based LQS-tree is a representation of 369
320 1 ≤ k1 ≤ k2 ≤ · · · ≤ d0 such that vj ⊆ vk0 j for 1 ≤ j ≤ d. the search space used for mining HUSPs. Details of the 370
321 In the rest of this paper, t ⊆ s will be used to indicate that LQS-tree, utility-linked (UL)-list, pruning strategies, and the 371
322 t ∼ sk ∧ sk ⊆ s for convenience. main procedure of the HUSP-ULL algorithm are respectively 372

323 For example, <[(a:2)], [(e:3)]> and <[(a:4)], [(e:3)]> are explained in this section. 373

324 contained in S1 , but <[(a:1)], [e:3]> and <[(a:4)], [(e:4)]>


325 are not contained in S1 . A k-itemset, also called k-q-itemset A. Concatenations and Lexicographic Sequence Tree 374
326 is an itemset that contains exactly k items. A k-sequence (k-q-
Each node in a LQS-tree represents a candidate HUSP, 375
327 sequence) is a sequence having k items. Consider the database
whose utility can be compared with the minimum utility 376
328 in Table I, the q-sequence S1 is a 9-q-sequence, and its first
threshold to determine whether the candidate is a HUSP. To 377
329 q-itemset is a 2-q-itemset.
generate new sequences (child nodes) of a node in the LQS- 378
330 Definition 6: Let there be a sequence t and a q-sequence s. tree, the designed algorithm performs two common operations 379
331 The utility of t in s is defined as: u(t, s) = max{u(sk )|t ∼ [3], [14], [17] of sequence mining, called I-Concatenation and 380
332 sk ∧ sk ⊆ s}. The utility of a sequence t in a quantitative S-Concatenation, respectively. 381
333 sequential
P database D is denoted as u(t) and defined as: u(t) Definition 8 (I-Concatenation and S-Concatenation [3], 382
334 = s∈D {u(t, s)|t ⊆ s}. [14], [17]): Given a sequence t and an item ij , the I- 383

335 For instance, for the sequential database of Table Concatenation of t with ij consists of appending ij to the 384

336 I, u(<[a], [b]>, S1 ) = max{u(<[a:2], [b:1]>), u(<[a:2], last itemset of t, denoted as <t ⊕ ij >I−Concatenation . The 385

337 [b:5]>), u(<[a:3], [b:5]>)} = max{$13, $25, $30} = $30. In S-Concatenation of t with an item ij consists of adding ij to 386

338 this example, it can be seen that several utility values can be a new itemset appended after the last itemset of t, denoted as 387

339 associated to a pattern in a same q-sequence. This is different <t ⊕ ij >S−Concatenation . 388

340 from traditional SPM and HUIM. In Table I, u(<[a], [b]>) = For example, given a sequence t = <[a], [b]> and a 389

341 u(<[a], [b]>, S1 ) + u(<[a], [b]>, S2 ) + u(<[a], [b]>, S3 ) + new item (c), <t ⊕ c>I−Concatenation = <[a], [bc]> and 390

342 u(<[a], [b]>, S4 ) + u(<[a], [b]>, S5 ) = $30 + $31 + $27 + <t ⊕ c>S−Concatenation = <[a], [b], [c]>. It follows that 391

343 $37 + $35 = $160. the number of itemsets in t does not change after performing 392
GAN et al.: FAST UTILITY MINING ON SEQUENCE DATA 5

393 an I-Concatenation, while performing an S-Concatenation first a in S1 ) is calculated to be $84, and the next position of 441

394 increases the number of itemsets in t by one. The search the item (a) in S1 is 3. 442

395 process of the proposed algorithm can be viewed as the process Note that UL-list is quite different from the previous struc- 443

396 of building a LQS-tree step-by-step, which is similar to the tures (i.e., utility-matrix [14], data-matrix [41], utility-chain 444

397 original lexicographic-sequence tree [37]. Each node in the [17]) that were developed for HUSPM. For each node in 445

398 tree represents a sequence. Based on the two operations, all the LQS-tree, sequences containing this node (sequence) are 446

399 candidates of the search space can be generated for the purpose transformed into a UL-list and attached to the projected set 447

400 of mining HUSPs. An illustrated partial LQS-tree can be of this node. Therefore, the utilities and upper-bound values 448

401 referred to [17], [14]. For example, 1-sequences such as <a>, of the candidates can be easily calculated from the projected 449

402 <b>, and <c>, are children of the root. UL-lists. The designed HUSP-ULL algorithm stores only one 450

403 To ensure the completeness and correctness for mining copy of the original database as UL-lists, and then constructs a 451

404 HUSPs, an order is defined for processing sequences. Let there series of projected UL-lists but not the projected sub-databases 452

405 be two sequences ta and tb . It is said that ta ≺ tb if 1) the throughout the execution. This is different from HuspExt and 453

406 length of ta is less than that of tb ; 2) ta is obtained by an HUS-Span. Besides, as a compact structure, the UL-list does 454

407 I-Concatenation on a sequence t while tb is obtained by an not consume a large amount of memory. 455

408 S-Concatenation on a sequence t; and 3) ta and tb are both As mentioned, a sequence may have multiple matches in a 456

409 obtained by respectively performing an I-Concatenation or S- q-sequence, and hence a sequence may have multiple utilities 457

410 Concatenation on a sequence t, and the item added to ta is in a q-sequence. Thus, it is necessary to find the positions 458

411 lexicographically smaller than the one added to tb . This order of the matches to calculate the utilities and the upper-bound 459

412 on sequences is also applied to q-sequences. For example, values of the processed node (sequence). For convenience, 460

413 <[a]> ≺ <[ab]> ≺ <[a], [a]> ≺ <[a], [c]>. the position of the last item within each match is defined 461

as the concatenation point, and the first concatenation point 462

is called the start point. For example, consider the database 463
414 B. The Utility-Linked List Structure of Table I. The sequence t = <[a], [b]> has three matches 464

415 To calculate the utility and upper-bound values of candi- in S1 , that is <[a:2], [b:1]>, <[a:2], [b:5]> and <[a:3], 465

416 dates, the designed algorithm could scan the original database. [b:5]>. The concatenation points of t in S1 are 4, 7 and 466

417 However, this process would result in long execution time 7, respectively, and the start point is 4. By definition, an 467

418 because there are often multiple matches in a sequence. I-Concatenation appends an item to the last itemset of a 468

419 To handle this situation, the compact utility-linked (UL)-list sequence. Thus, the candidate items for I-Concatenation are 469

420 structure is introduced to store information about the utility the items appearing in the itemsets containing concatenation 470

421 of each sequence. UL-list is used to efficiently generate the points. An S-Concatenation adds an item to a new itemset, 471

422 utility of sequences obtained by I-Concatenations and S- appended at the end of a sequence. Thus, in each sequence, 472

423 Concatenations to continue the search for patterns. Table III the items in the itemsets appearing after the start point are 473

424 is the constructed UL-list of the sequence S1 in Table I. candidate items for S-Concatenation. In the above example, 474

the candidate items for I-Concatenation are {(c:2), (d:4)}. 475

TABLE III And the start point (= 4) is in the second itemset, then the 476

T HE U TILITY-L INKED (UL)-L IST S TRUCTURE OF S1 . items appearing after the second itemset are candidates for S- 477

UP Information <[(a, $10, $84, 3) (c, $12, $72, 5)], Concatenation, that is {(a:4), (b:5), (d:4), (e:3)}. Since there 478
[(a, $15, $57, 6) (b, $3, $54, 7) (c, $8, $46, -)], can be multiple matches of the sequence t in a q-sequence, the 479
[(a, $20, $26, -) (b, $15, $11, -) (d, $8, $3, -)],
[e, $3, $0, -]> utility of t in that q-sequence is defined as the largest utility 480

Header Table (a, 1) (b, 4) (c, 2) (d, 8) (e, 9) value of t in that sequence. 481

425 The UL-list structure contains two parts, Header Table and C. The Downward Closure Property of Upper Bound 482

426 UP (utility and position) Information. Details are described Based on UL-list, the proposed HUSP-ULL algorithm can 483

427 below. successfully identify the complete set of HUSPs using a depth- 484

428 1) Header Table. It stores a set of distinct items with first search that applies the two concatenations operations. 485

429 their first occurred positions in the transformed sequence. For However, this process can lead to exploring a very large 486

430 example in Table III, the distinct items of S1 are (a), (b), number of candidates in the LQS-tree, since there is a combi- 487

431 (c), (d), and (e) and their first occurred positions in S1 are national explosion of the number of candidates in the mining 488

432 respectively 1, 4, 2, 8 and 9. process of HUSPs. Since the downward closure property, 489

433 2) UP Information. In terms of information about UP also called Apriori property [7], is not held in high-utility 490

434 (utility and position) of each sequence, each element respec- sequential pattern mining, a new downward closure property 491

435 tively stores the item name, the utility of this item, the must be introduced to be able to reduce the search space and 492

436 remaining utility of this item w.r.t. this element, and the efficiently find all HUSPs. To speed up the mining process and 493

437 next position of this item. Consider the result (a, $10, $84, maintain the downward closure property, a sequence-weighted 494

438 3) of the first element in S1 , it means that the utility of the utilization (SWU) [14] upper-bound was proposed for mining 495

439 item (a) in the first element is $10; the remaining utility [14] HUSPs. This upper-bound can be used to greatly reduce the 496

440 of item (a) in the first element in S1 (the overall utilities after search space and eliminate unpromising candidates early. 497
6 IEEE TRANSACTIONS ON CYBERNETICS, 2019

498 Definition 9: The sequence-weighted utilization (SWU) [14] Theorem 3: Given a quantitative sequential database D and
499 of a sequence t in a quantitative
P sequential database D is two sequences t and t0 . If t ⊆ t0 , we can obtain that:
500 defined as: SW U (t) = s∈D {u(s)|t ⊆ s}.
501 For example in Table I, SW U (<a>) = u(S1 ) + u(S2 ) + SEU (t0 ) ≤ SEU (t). (4)
502 u(S3 ) + u(S4 ) + u(S5 ) + u(S6 ) = $94 + $67 + $56 + $67 + Theorem 4: Given a quantitative sequential database D and
503 $76 + $81 = $441, and SW U (<f >) = u(S6 ) = $81. a sequence t, it follows that:
Theorem 1 (Sequence-weighted downward closure prop-
erty, SWDC property [14]): Given a quantitative sequential u(t) ≤ SEU (t). (5)
database D and two sequences t and t0 . If t ⊆ t0 , then:
Proof of Theorems 3 and 4 can be referred to [18]. In 546

SW U (t0 ) ≤ SW U (t). (2) summary, they indicate that for a sequence t, if SEU(t) is less 547

0 0
P 0
than the minimum utility value (δ×u(D)) and the utility of t is 548
504
P Proof: Since t ⊆ t , SW U (t ) = s∈D {u(s)|t ⊆ s} ≤ less than that value, the utilities of the super-sequences of t are 549
505
s∈D {u(s)|t ⊆ s} = SW U (t). less than that value. If the SEU or SW U of t is less than the 550
Theorem 2: Given a quantitative sequential database D and minimum utility value, the utility of t and the utilities of the 551
a sequence t, it can be obtained that: super-sequences of t are less than this value, which indicates 552

u(t) ≤ SW U (t). (3) that t and the super-sequences of t are not HUSPs. However, 553

it may still explore a large search space since the SW U and 554

506
P Proof: Since u(t, s)P ≤ u(s), we can obtain that u(t) = SEU upper-bounds are the overestimations of utility values 555

507
s∈D {u(t, s)|t ⊆ s} ≤ s∈D {u(s)|t ⊆ s} = SW U (t).
for patterns. To improve the mining performance and reduce 556

508 Thus, numerous unpromising candidates can be pruned the search space by pruning a large number of candidates, we 557

509 using the SWU. However, the SW U of a sequence t is introduce a tighter upper-bound for mining HUSPs, which is 558

510 usually much larger than the actual utilities of t and its based on the PEU model [17]. Details are given below. 559

511 super-sequences. To improve the performance of the designed Definition 13: The prefix extension utility of a sequence 560

512 algorithm, the remaining utility model [14] is proposed in the t in a q-sequence s is denoted as P EU (t, s) and defined as 561

513 USpan algorithm [14]. However, it is not a real upper bound P EU (t, s) = max{u(sk )+u(<s−sk >rest )|t ∼ sk ∧sk ⊆ s}. 562

514 and cannot provide the complete mining results of utility The prefix extension
P utility of a sequence t in D is defined as 563

515 mining, as reported in [18]. Thus, the concept of sequence P EU (t) = s∈D {P EU (t, s)|t ⊆ s}. 564

516 extension utility (SEU) [18] was proposed in the projection- For example, consider Table I and a sequence t = <[a], 565

517 based ProUM algorithm. To explain the concept of remaining [b]>. This sequence has 3 matches in S2 , which are <[a:1], 566

518 utility and sequence extension utility, several concepts related [b:3]>, <[a:1], [b:2]> and <[a:5], [b:2]>. Thus, u(<S2 - 567

519 to sequences and q-sequences are introduced firstly. <[a:1], [b:3]>>rest ) = u(<[(d:2)], [(b:2) (c:1) (d:4) (e:3)]>) 568

520 Definition 10: Given two q-sequences s and s0 , if s ⊆ s0 , = $25, u(<S2 - <[a:1], [b:2]>>rest ) = u(<[(c:1) (d:4) 569

521 the extension of s in s0 is said to be the rest of s0 after s, and is (e:3)]>) = $15 and u(<S2 - <[a:5], [b:2]>>rest ) = u(<[(c:1) 570

522 denoted as <s0 -s>rest . Given a sequence t and a q-sequence (d:4) (e:3)]>) = $15. The utilities of the three matches are 571

523 s, if t ∼ sk ∧ sk ⊆ s (t ⊆ s), the extension of t in s is the $14, $11 and $31, respectively. Thus, P EU (<[a], [b]>,S2 ) = 572

524 rest of s after sk , which is denoted as <s-t>rest , where sk is max{$14 + $25, $11 + $15, $31 + $15} = $46. P EU (<[a], 573

525 the first match of t in s. [b]>) is calculated as $67 + $46 + $37 + $48 + $54 = $252, 574

526 For example, given two q-sequences s = <[a:2], [b:5]> which is smaller than SEU(t) = $279. 575

527 and S1 in Table I, the extension of s in S1 is <S1 - s>rest = Theorem 5: Given a quantitative sequential database D, and
528 <[(d:4)], [(e:3)]>. Consider a sequence t = <[a], [b]>. There two sequences t and t’. If t⊆ t’, we can obtain that:
529 exist three matches of t in S1 . The first one is <[a:2], [b:1]>.
530 Thus, <S1 - t>rest = <[(c:2)], [(a:4) (b:5) (d:4)], [(e:3)]>. P EU (t0 ) ≤ P EU (t). (6)
531 Definition 11: The set of extension items of a sequence t in Proof: Suppose that s is a transaction in D, which
532 a quantitative sequential database D is denoted as I(t)rest and contains t and t’. Let sq be a q-sequence satisfying {u(sq )
533 defined as I(t)rest = {ij |ij ∈< s − t >rest ∧t ⊆ s ∧ s ∈ D}. + u(<s - sq >rest )} = P EU (t, s), where t ∼ sq ∧ sq ⊆ s. Let
534 In the above example, I(<[a], [b]>)rest = {a, b, c, d, e}. sq0 be a q-sequence satisfying {u(sq0 ) + u(<s - sq0 >rest )}
535 Definition 12: The sequence extension utility (SEU) [18] of = P EU (t0 , s) where t0 ∼ sq0 ∧ sq0 ⊆ s. Since t ⊆ t0 , we can
536 a sequence t in a quantitative sequential P database D is denoted divide t0 into two parts as the prefix t and the extension e such
537 as SEU(t) and defined as SEU (t) = s∈D {u(t, s) + u(< that t + e = t0 . Similarly, sq0 can be divided into two parts
538 s − t >rest )|t ⊆ s}. as the prefix sqt0 matching t and the extension sqe0 matching e
539 Notice that u(<s - t>rest ) is the remaining utility of t in such that sqt0 + sqe0 = sq0 . Thus,
540 s, which is stored in the designed UL-list. For example in
541 Table I, consider the sequence t= <[a], [b]>. Then, SEU(t) P EU (t0 , s) = {u(sq0 ) + u(< s − sq0 >rest )}
542 = u(t, S1 ) + u(<S1 - t>rest ) + u(t, S2 ) + u(<S2 - t>rest ) + = {u(sqt0 ) + u(sqe0 ) + u(< s − sq0 >rest )}
543 u(t, S3 ) + u(<S3 - t>rest ) + u(t, S4 ) + u(<S4 - t>rest ) +
≤ {u(sqt0 ) + u(< s − sqt0 >rest )}
544 u(t, S5 ) + u(<S5 - t>rest ) = $30 + $54 + $31 + $25 + $27
545 + $10 + $37 + $11 + $35 + $19 = $279. ≤ {u(sq ) + u(< s − sq >rest )} = P EU (t, s).
GAN et al.: FAST UTILITY MINING ON SEQUENCE DATA 7

We canPthus obtain that P EU (t0 ) = Ps∈D {P EU (t0 , s)|t0 ⊆


P
576 2) if ij is a S-Concatenation candidate item of t, the 623
0
577 s} ≤ s∈D {P EU (t, s)|t ⊆ s} ≤ s∈D {P EU (t, s)|t ⊆ maximal
P utility of <t⊕ij >S−Concatenation is no more than 624

578 s} = P EU (t). s∈D {P EU (t, s)|<t ⊕ ij >S−Concatenation ⊆ s}. 625

579 Theorem 5 indicates that if the P EU value of a sequence t Proof: For 1), let t0 = <t ⊕ ij >I−Concatenation for 626

580 is less than the minimum utility value, the P EU values of the convenience. By Theorem 5, P EU (t0 , s) ≤ P EU (t, s). Based 627

super-sequences of t are also less than the minimum utility 0 0 0 0


P Theorem 6, 0u(t )0 ≤ P EU (tP). Thus u(t ) ≤ P0EU (t ) =
on
581 628

582 value. s∈D {P EU (t , s)|t ⊆ s} ≤ s∈D {P EU (t, s)|t ⊆ s}. In 629

Theorem 6: Given a quantitative sequential database D and the same way, 2) holds. 630

a sequence t, we can obtain that Look Ahead Removing strategy (LAR): Given a sequence 631

t and a quantitative sequential database D, two situations are 632


u(t) ≤ P EU (t) (7)
considered: 633

583 Proof: Since u(t, s) = max{u(sk )|t ∼ sk ∧ sk ⊆ s} ≤ P1). If ij is a I-Concatenation candidate item for t and 634

584 max{u(sk )+u(<s−s ∼ sk ∧sk ⊆ s} = P EU (t, s).


k >rest )|t P s∈D {P EU (t, s) | <t ⊕ ij >I−Concatenation ⊆ s} is less 635

than the minimum utility value (δ × u(D)), ij should be re-


P
585 Thus, u(t) = s∈D u(t, s) ≤ s∈D {P EU (t, s)|t ⊆ s} =
636

586 P EU (t). moved from C I (the set of candidate items for I-Concatenation 637

587 Theorems 5 and 6 ensure that the complete set of HUSPs with t); 638

588 can be discovered. If the P EU of a sequence t is less than P2). If ij is a S-Concatenation candidate item for t and 639

589 the minimum utility value (δ × u(D)), then the utility of t is s∈D {P EU (t, s) | < t ⊕ ij >S−Concatenation ⊆ s} is 640

590 less than the minimum utility value, and the utilities of the less than the minimum utility value (δ × u(D)), ij should 641

591 super-sequences of t are also less than the minimum utility be removed from C S (the set of candidate items for S- 642

592 value. Concatenation with t). 643

Theorem 7: For any quantitative sequential database D and The LAR strategy can be used to quickly remove unpromis- 644

a sequence t, the following relationship holds: ing candidate items so that they are not considered for I- 645

Concatenation and S-Concatenation of a sequence t. Thus, 646


P EU (t) ≤ SEU (t) ≤ SW U (t) (8) this strategy is useful to avoid calculating the P EU values of 647

<t ⊕ ij >I−Concatenation and <t ⊕ ij >S−Concatenation for 648


593 Proof: Since u(sk ) ≤ max{u(sk )|t ∼ sk ∧ sk ⊆
each removed item ij . Since the upper-bound can be calculated 649
594 s} = u(t, s) and u(<s − sk >rest ) ≤ u(<s − t>rest ),
from the utility linked lists of t, LAR can remove unpromising 650
595 P EU (t, s) = max{u(sk ) + u(<s − sk >rest )|t ∼ sk ∧ sk ⊆
candidate items in advance. As a result, the execution time of 651
596 s} P≤ u(t, s) + u(<s − t>rest ) P≤ u(s). Thus, P EU (t)
the algorithm can be reduced since a smaller set of candidate 652
597 = s∈D {P EU (t, s)|t ⊆ s} ≤ s∈D {u(t, s) + u(<s −
P items are considered for concatenations with t. 653
598 t>rest )|t ⊆ s} = SEU (t) ≤ s∈D {u(s)|t ⊆ s} = SW U (t).
599
The downward closure property based on the PEU model 654

600 Theorem 7 indicates that the P EU model is a tighter provides a tight upper-bound to reduce the search space for 655

601 upper-bound compared to the SEU and SW U upper-bounds. mining HUSPs. However, several useless items appear in the 656

602 Based on the P EU model, the designed algorithm can prune extensions of sequences in each sequence, which may lead to 657

603 more candidates than using the SEU and SW U models. loose upper-bound values. To further reduce the search space, 658

604 The P EU model can be used to estimate the utility values an irrelevant item pruning strategy (IIP) is designed as follows. 659

605 of candidate sequences and their super-sequences. Thus, the Theorem 9: For a sequence t and any item ij ∈ 660

606 candidate sequences having P EU values that are less than I(t)rest , the maximal utility of the concatenation <t ⊕ 661

607 the minimum utility value (δ × u(D)) are discarded from the ij >I−Concatenation
P or <t ⊕ ij >S−Concatenation is no more 662

608 candidate set by the proposed algorithm so that their child than s∈D {P EU (t, s) | (< t ⊕ ij >I−Concatenation ⊆ s) ∨ 663

609 nodes (super-sequences) are not generated and explored in the (< t ⊕ ij >S−Concatenation ⊆ s)}. 664

610 LQS-tree. Proof: For an I-Concatenation,


P based on Theo- 665

rem 8, we can have that s∈D {P EU (t, s) | (<t ⊕ 666

ij >I−Concatenation ⊆ s) ∨ (<t ⊕ ij >S−Concatenation ⊆ 667


611 D. Pruning Strategies P
s)} ≥ s∈D {P EU (t, s) |<t ⊕ ij >I−Concatenation ⊆ s} ≥ 668

612 A large amount of candidates may be generated from a u(<t ⊕ ij >I−Concatenation ). A similar proof can be done for 669

613 candidate sequence t by performing I-Concatenations and S- <t ⊕ ij >S−Concatenation . 670

614 Concatenations with items. To reduce the number of candidate Irrelevant Item Pruning strategy P (IIP): Given a sequence 671

615 sequences, we propose a look ahead removing strategy (LAR) t and any item ij ∈ I(t)rest , if s∈D {P EU (t, s) | (<t ⊕ 672

616 to eliminate unpromising candidate items early. ij >I−Concatenation ⊆ s) ∨ (<t ⊕ ij >S−Concatenation ⊆ s)} 673

617 Theorem 8: Given a sequence t and a quantitative sequential is less than the minimum utility value (δ × u(D)), ij is called 674

618 database D, two situations are considered to generate a super- an irrelevant item of t and should be removed from the utility 675

619 sequence: linked lists of t and t’s supersets. 676

620 1) if ij is a I-Concatenation candidate item of t, the With the help of the IIP strategy, the remaining utility values 677

621 maximal
P utility of <t ⊕ ij >I−Concatenation is no more than of candidate sequences in each sequence decrease, since many 678

622
s∈D {P EU (t, s)|<t ⊕ ij >I−Concatenation ⊆ s}. irrelevant items can be ignored. As a result, the PEU values of 679
8 IEEE TRANSACTIONS ON CYBERNETICS, 2019

680 candidate sequences can greatly decrease, and more candidates (Lines 5-9). Thus, those 1-sequences with low SW U values 713

681 may be removed. are exactly deemed unpromising for I-Concatenation or S- 714

682 Using the LAR and IIP pruning strategies, the designed Concatenation, and they are moved in this step. And the 1- 715

683 algorithm can eliminate a large number of candidates. Con- sequences having utilities that are no less than the minimum 716

684 sider a sequence t that is processed by the algorithm. First, utility value (δ × u(D)) are output as HUSPs (Lines 7-9). 717

685 the candidate items for I-Concatenation and S-Concatenation Using the special set of candidate HUSPs that were eliminated 718

686 with t are pruned by the IIP strategy, and the UL-lists of t are before, the HUSP-ULL algorithm can begin the depth-first 719

687 recalculated. Then, the candidate items for I-Concatenation search with the built projected database PD(<ij >) w.r.t. UL- 720

688 and S-Concatenation of the processed sequence t are assessed lists. Next, the candidate HUSPs are considered as prefix by 721

689 using the LAR instead of their SW U values. the PGrowth procedure for mining larger HUSPs (Line 10). 722

690 Then, the designed algorithm generates new sequences by The PGrowth procedure (Algorithm 2) performs a depth- 723

691 concatenating the processed sequence t with the candidate first search to enumerate sequences by following the sequence- 724

692 items. If the utility of the newly explored candidate sequence ascending order. Sequences are enumerated by applying the I- 725

693 is no less than the minimum utility value (δ × u(D)), it is a Concatenation and S-Concatenation operations. The algorithm 726

694 HUSP. By applying the downward closure property, the PEU first removes irrelevant items and then recalculates the UL-list, 727

695 of the new sequence is then checked to decide whether its as applying the proposed IIP pruning strategy (Line 1). Then, 728

696 super-sequences should be explored. the algorithm scans the reduced projected database PD(prefix) 729

to obtain C I (the set of candidate items that will be used for 730

697 E. The HUSP-ULL Algorithm I-Concatenation) (Line 2). To reduce the number of candidate 731

items for I-Concatenation with the sequence prefix, the upper- 732
698 Based on the designed utility-linked (UL)-list structure
bound values of the candidate items are calculated using 733
699 (Section IV-B), the downward closure property (Section IV-C),
PD(prefix). Based on the proposed LAR pruning strategy, a 734
700 and the above pruning strategies (Section IV-D), the designed
candidate item ij is discarded if its upper-bound is less than 735
701 algorithm named HUSP-ULL (High-Utility Sequential Pattern
the minimum utility value. The reduced set of candidate items 736
702 mining with UL-list) is proposed below.
for S-Concatenation with the sequence prefix, denoted as C S , 737

Algorithm 1 HUSP-ULL is obtained in the same way (Line 6). After the concatenation 738

operations are performed, the newly generated sequences are 739


Input: D, a quantitative sequential database; utable, a utility
evaluated by applying the Judge procedure (Line 4 and Line 740
table containing the unit profit of each item; δ, the
8), which is explained next. 741
minimum utility threshold.
Output: The set of HUSPs. Algorithm 2 PGrowth(prefix, PD(prefix), HUSPs)
1: scan D to: 1). calculate u(s) for each s ∈ D and calculate
1: P D(pref ix) ← IIP(P D(pref ix)) // measured by IIP
u(D); 2). build the UL-list of each s ∈ D;
2: scan P D(pref ix) to get C I ; // measured by LAS
2: HUSPs ← ∅;
3: for each ij ∈ C I do
3: for each ij ∈ D do
4: call Judge(<prefix ⊕ ij >I−Concatenation , PD(prefix),
4: PD(<ij >) ←− {UL-list of s | <ij > ⊆ s ∧ s ∈ D};
HUSPs);
5: calculate SW U (<ij >) and u(<ij >);
5: end for
6: if SW U (<ij >) ≥ δ × u(D) then
6: scan P D(pref ix) to get C S ; // evaluated using LAS
7: if u(<ij >) ≥ δ × u(D)) then
7: for each ij ∈ C S do
8: HUSPs ←− HUSPs ∪ <ij >;
8: call Judge(<prefix ⊕ ij >S−Concatenation , PD(prefix),
9: end if
HUSPs);
10: call PGrowth(<ij >, PD(<ij >), HUSPs);
9: end for
11: end if
12: end for
13: return HUSPs Algorithm 3 Judge(prefix’, PD(prefix), HUSPs)
1: construct PD(prefix’) ←− {the UL-list of s| prefix’ ⊆ s ∧
703 The pseudo-code of the main procedure for HUSP-ULL
s ∈ PD(prefix)};
704 algorithm is given in Algorithm 1. It first scans the quantitative
2: calculate u(prefix’) and P EU (prefix’);
705 sequential database D to calculate u(s) and build the UL-list
3: if P EU (prefix’) ≥ δ × u(D) then
706 of each q-sequence s ∈ D (Line 1). For each item ij ∈ D,
4: if u(prefix’) ≥ δ × u(D) then
707 the algorithm builds the projected database1 PD(<ij >) to
5: HUSPs ←− HUSPs ∪ prefix’;
708 store the UL-lists of the transformed sequences (Line 4). The
6: end if
709 utility and SW U of each 1-sequence are calculated using the
7: call PGrowth(prefix’, PD(prefix’), HUSPs);
710 corresponding projected database (Line 5). The 1-sequences
8: end if
711 having SW U values that are no less than the minimum
712 utility value (δ × u(D)) are considered as candidate HUSPs
The Judge procedure (Algorithm 3) first builds the projected 742

1 In
the proposed HUSP-ULL algorithm, the projected database is referred database PD(prefix’) from PD(prefix) (Line 1). The P EU and 743

to UL-lists but not the real database. utility of prefix’ are then calculated from PD(prefix’) (Line 2). 744
GAN et al.: FAST UTILITY MINING ON SEQUENCE DATA 9

745 If the utility of prefix’ is no less than the minimum utility value 400K) sequences, named C8S6T4I3D|X|K) was also used to 786

746 δ × u(D), prefix’ is identified as a HUSP (Lines 4-6). If the evaluate the scalability of the compared approaches. 787

747 PEU of prefix’ is no less than the minimum utility threshold, • Kosarak10k is a real-life dataset of click-stream data 788

748 the PGrowth procedure is then applied with prefix’ to discover from a Hungarian news portal, which is a subset of the original 789

749 HUSPs by considering the super-sequences of prefix’ (Lines Kosarak dataset [54]. 790

750 3-8). The algorithm terminates if no candidates are generated. • Leviathan is a conversion of Thomas Hobbes’ Leviathan 791

751 Finally, the designed algorithm returns the set of discovered novel (1651) to a sequence of items (words). 792

752 HUSPs. • yoochoose-buys is a commercial dataset from yoo- 793

choose2 . It contains a collection of 1,150,753 sessions from a 794

753 V. E XPERIMENTAL R ESULTS retailer, where each session is encapsulating the click events. 795

The total number of item IDs and category IDs is 54,287 796
754 In this section, we conduct experiments on several real
and 347 correspondingly, with an interval of 6 months. #Seq 797
755 datasets to show the advantage of HUSP-ULL in the task of
(= 1.13) indicates the number of elements per sequence in 798
756 high-utility sequential pattern mining. In particular, we aim to
yoochoose-buys is small. 799
757 answer the following research questions via the experiments:
The characteristic of #Ele (the average number of items per 800
758 • How effectively HUSP-ULL can discover the useful high-
element/itemset) indicates that the Sign, Bible, Kosarak10k 801
759 utility sequential patterns with observed timestamps from the
and Leviathan are all item-based datasets, while others are the 802
760 quantitative sequential datasets?
sequence-based datasets. To make the experiments more con- 803
761 • How HUSP-ULL benefits from each component of the
vincing, both the sequence-based and item-based datasets were 804
762 propose structure and the developed pruning strategies for
conducted. In fact, the item-based datasets can be efficiently 805
763 mining HUSPs?
processed by the state-of-the-art HUIM algorithms, while the 806
764 • How efficiently HUSP-ULL can be applied when handling
task of HUSPM aims at dealing with sequence-based datasets. 807
765 large data with different sizes?
In the field of utility mining, a simulation model [13] 808

was widely used in the previous studies [17], [21], [23] to 809

766 A. Datasets generate the quantities and unit profit values of items in the 810

767 Totally five real-life datasets [52] and one synthetic dataset sequential datasets. In order to achieve a fair comparison, this 811

768 were used in the experiments to evaluate the performance simulation model [13] was adopted in our experiments. Note 812

769 of the proposed algorithm. Detailed characteristics of these that the quantity of each item is randomly generated in the [1, 813

770 datasets are shown in Table IV. Note that #|D| is the number 5] interval. A log-normal distribution was used to randomly 814

771 of sequences, #|I| is the number of different symbols/items, assign profit values of items in the [0.01, 10.00] interval. The 815

772 #Seq is the average number of elements per sequence, #Ele is above datasets can be downloaded from [52]. 816

773 the average number of items per element/itemset, and MaxLen


774 is the maximum number of items per sequence.
B. Experimental Settings 817

TABLE IV All the compared algorithms were implemented in Java. The 818
C HARACTERISTICS OF THE DATASETS . experiments were carried out on a personal computer equipped 819
Dataset #|D| #|I| #Seq #Ele MaxLen
with an Intel(R) Core(TM) i7-7700HQ CPU @ 2.80 GHz 2.81 820
Sign 730 267 52.0 1 94
Bible 36,369 13,905 21.6 1 100 GHz, 32 GB of RAM, running the 64-bit Microsoft Windows 821

Kosarak10k 10,000 10,094 8.14 1 608 10 operating system. We conduct our experiments against the 822
Leviathan 5,834 9,025 33.8 1 100 following state-of-the-art HUSPM methods. 823
yoochoose-buys 234,300 16,004 1.13 1.97 21
SynDataset-10k 10,000 7,312 6.22 4.35 18 • HuspExt [41]: It introduced the Cumulated Rest of Match 824

SynDataset-80k 79,718 7,584 6.19 4.32 18 (CRoM), while it was not compared since its mining results 825

SynDataset-160k 159,501 7,609 6.19 4.32 20 are incomplete and incorrect, as reported in [18], [55]. 826
SynDataset-240k 239,211 7,617 6.19 4.32 20
SynDataset-320k 318,889 7,620 6.19 4.32 20
• USpan [14]: It is a well-known baseline that uses utility 827

SynDataset-400k 398,716 7,621 6.18 4.32 20 matrix and two upper-bounds on utility for width and depth 828

pruning. In our experiments, the USpan algorithm was fixed 829

775 • Sign is a real-life dataset of sequences of sign language and replaced its upper bound by SEU [18]. 830

776 utterance, created by the National Center for Sign Language • HUS-Span [17]: This SWU-based method combines two 831

777 and Gesture Resources at Boston University. Each utterance quantitative metrics, called prefix extension utility (PEU) 832

778 in the dataset is associated with a segment of video with a and reduced sequence utility (RSU)), to prune low utility 833

779 detailed transcription. sequences. It outperforms USpan in most cases. 834

780 • Bible is a real-life dataset obtained by converting the Bible • ProUM [18]: This projection-based model utilizes the 835

781 into a set of sequences of items (words). sequence extension utility (SEU) to present the maximum 836

782 • SynDataset-160K is a synthetic dataset that generated utility of the possible extensions that are based on the prefix 837

783 by IBM Quest Dataset Generator [53]. It contains 159,501 sequences. Besides, it applies the project mechanism during 838

784 (≈ 160K) sequences. This synthetic sequential dataset with


785 different data sizes (from 10,000 sequences to 398,716 (≈ 2 https://recsys.acm.org/recsys15/challenge/
10 IEEE TRANSACTIONS ON CYBERNETICS, 2019

(a) Sign (b) Bible (c) SynDataset−160K


3 3 4
10 10 10
Runtime (sec.)

Runtime (sec.)

Runtime (sec.)
2 2 3
10 10 10

1 1 2
10 10 10
1.2% 1.3% 1.4% 1.5% 1.6% 1.7% 0.5% 0.6% 0.7% 0.8% 0.9% 1.0% 0.065% 0.070% 0.075% 0.080% 0.085% 0.09%
δ δ δ
(d) Kosarak10k (e) Leviathan (f) yoochoose−buys
4 2 3
10 10 10
Runtime (sec.)

Runtime (sec.)
Runtime (sec.)
2
10
2 1
10 10
1
10

0 0 0
10 10 10
1.69% 1.70% 1.71% 1.72% 1.73% 1.74% 1.00% 1.05% 1.10% 1.15% 1.20% 1.25% 0.024% 0.026% 0.028% 0.030% 0.032% 0.034%
δ δ δ
USpan HUS−Span ProUM HUSP−ULL

Fig. 2. Runtime for various δ values.

839 the construction of the utility-array, which can achieve bet- increases due to their actual search space and the large number 872

840 ter performance than the previous HUSPM algorithms, e.g., of candidates that they generated. Thus, it demonstrates that 873

841 USpan, HUS-Span. the designed UL-list-based HUSP-ULL algorithm are able to 874

significantly improve the performance in terms of running 875

842 C. Efficiency time. 876

It is important to note that the USpan algorithm outperforms 877


843 In the first experiment, the runtime of the designed algo-
the HUS-Span algorithm in Figs. 2 (a), (c), and (d), while 878
844 rithm was compared with the existing state-of-the-art algo-
HUS-Span outperforms USpan in Fig. 2 (b), (e), and (f). In 879
845 rithms. The runtime was measured by considering both the
most cases, the projection-based ProUM algorithm performs 880
846 time used by the CPU and the time required for disk I/O
better than USpan and HUS-Span. Besides, it can be seen that 881
847 accesses. Fig. 2 shows the runtime of the compared algorithms
the USpan algorithm cannot obtain results in Fig. 2 (e) since 882
848 for various minimum utility thresholds (denoted as δ values).
it ran out of memory. The USpan algorithm builds a series of 883
849 We now discuss the result concerning the efficiency of
utility-matrix to store utility information about patterns, but 884
850 HUSP-ULL. It can be seen in Fig. 2 that the proposed HUSP-
it requires additional processing time. Thus the runtime of 885
851 ULL algorithm outperforms other approaches for all datasets
USpan is larger than that of ProUM in many cases. 886
852 except for yoochoose-buys under various threshold values.
853 The reason is that since the average number of elements per
854 sequence in yoochoose-buys is 1.13, the construction of UL- D. Effectiveness of Pruning Strategies 887

855 list does not make benefit to obtaining a tight upper bound for In order to evaluate the effectiveness of pruning strategies, 888

856 pruning the search space. Generally, the HUSP-ULL is faster the number of generated candidates of all compared algorithms 889

857 than the other algorithms by at least one order of magnitude. and the number of discovered high-utility sequential patterns 890

858 For example, in Figs. 2 (c), (d), and (e), the HUS-Span and (#HUSPs) under different parameter settings are compared in 891

859 USpan algorithms spent more than 1000 seconds, and in some this section. The results are shown in Fig. 3. Note that #P1, 892

860 cases, cannot be even terminated in a reasonable time. In #P2, #P3, and #P4 denote the number of the candidate patterns 893

861 contrast, HUSP-ULL spent less than 100 seconds to output the generated by USpan, HUS-Span, ProUM, and HUSP-ULL, 894

862 results under varied threshold values. As δ is decreased, the respectively. And #HUSPs denote the number of final HUSPs 895

863 compared approaches become slower. The runtime of ProUM discovered by the three compared algorithms. Note that the 896

864 and HUSP-ULL increases smoothly, while the runtime of the algorithm is terminated if its runtime exceeds 10,000 second 897

865 compared USpan and HUS-Span algorithms increases more or runs out of memory (a maximum of 4,096 MB (4 GB) in the 898

866 rapidly. For example in Fig. 2 (c), we can see that the runtime memory setting), and it is marked as “-” in our experiments. 899

867 of HUS-Span and USpan increases dramatically while the It can be seen in Fig. 3 that the number of candidates 900

868 threshold values are only slightly changed. Thus, the runtime generated by the HUSP-ULL algorithm is much less than that 901

869 performance of HUS-Span and USpan are very sensitive with of the other algorithms. This shows that the designed HUSP- 902

870 respect to the parameter settings. Generally, when δ is set to ULL algorithm and pruning strategies can greatly reduce the 903

871 a small value, the runtime of HUS-Span and USpan sharply number of unpromising candidates for mining the HUSPs 904
GAN et al.: FAST UTILITY MINING ON SEQUENCE DATA 11

6 (a) Sign 4 (b) Bible 6


x 10 (c) SynDataset−160K
x 10 x 10
8 12 10

10 8
6

# Patterns
8
# Patterns

# Patterns
6
4 6
4
4
2 2
2
0 0 0
1.2% 1.3% 1.4% 1.5% 1.6% 1.7% 0.5% 0.6% 0.7% 0.8% 0.9% 1.0% 0.065% 0.070% 0.075% 0.080% 0.085% 0.090%
δ δ δ
x 10
7 (d) Kosarak10k 4 (e) Leviathan 5 (f) yoochoose−buys
x 10 x 10
15 10 4

8
3
10
# Patterns

# Patterns

# Patterns
6
2
4
5
2 1

0 0 0
1.69% 1.70% 1.71% 1.72% 1.73% 1.74% 1.00% 1.05% 1.10% 1.15% 1.20% 1.25% 0.024% 0.026% 0.028% 0.030% 0.032% 0.034%
δ δ δ
# P1 # P2 # P3 # P4 # HUSPs

Fig. 3. Number of patterns (candidates and final results) under various δ values.

905 and hence reduces the requirements in terms of runtime and practice, as shown in the Bible dataset, we can find that HUSP- 939

906 memory. In all test datasets, the number #P3 is close to the ULL only has to cache up to a few thousand candidates. Using 940

907 number of #P2. It indicates that the upper bound named SEU the proposed pruning strategies in LQS-tree, HUSP-ULL can 941

908 used in ProUM has the similar overestimated effects compared speed up in computation, up to an order of magnitude, while 942

909 to the PEU upper bound used in HUS-Span. the memory consumption is also reduced. 943

910 As the minimum utility threshold δ is decreased, the number


911 of candidates increases for the HUS-Span, USpan, and ProUM E. Memory Usage Evaluation 944
912 algorithms. In contrast, that number increases much more
For the applications of data mining and analytics, the 945
913 slowly for the HUSP-ULL algorithm. When the minimum
memory usage of a data mining algorithm is one of the key 946
914 utility threshold δ is set lower, it is obvious that the HUSP-
measure criteria. Therefore, to show a good efficiency, it would 947
915 ULL algorithm generates much fewer candidates than the
be better to test memory usage in performance evaluation. 948
916 other algorithms. Especially, the USpan algorithm generates no
In this subsection, we further evaluate the memory usage of 949
917 results in Fig. 3 (d) due to a very large number of candidates.
all the compared algorithms. With the same parameter setting 950
918 It can also be observed that the number of candidates as run in Fig. 2, the memory usage of each algorithm under 951
919 generated by the USpan algorithm (with the SEU upper bound) various δ values are shown in Fig. 4. 952
920 is close to that of the HUS-Span algorithm in most cases. As mentioned early, the maximum memory is set to 4,096 953
921 For all compared HUSPM algorithms, the number of the final MB, and USpan ran out of memory in Leviathan dataset. It 954
922 results of HUSPs is quite less than that of the generated is clear that the proposed HUSP-ULL algorithm consumes the 955
923 candidates, such as #HUSPs is less than #P1, # P2, # P3, and least memory among the compared algorithms with all param- 956
924 #P4, as shown in Fig. 3 (a) to (e). Notice that in Fig. 3 (c) and eter settings on all datasets, except for the SynDataset-160K. 957
925 (d), the results of #P4 are not obvious. The reason is that the Among the compared algorithms, the memory usage of HUSP- 958
926 number of the results of #P4 is too small. Although ProUM ULL is always very stable. For example, under six varied δ, 959
927 uses a structure named utility-array to store sequences and it consumes around 200 MB on Kosarak10k, and consumes 960
928 utility information in memory, it still generates a huge number from 600 MB to 500 MB in yoochoose-buys. However, it can 961
929 of candidate patterns for discovering high-utility sequential be observed that there is a sharp decrease in USpan and HUS- 962
930 patterns. The proposed HUSP-ULL algorithm employs the Span on some cases, as shown in Kosarak10k, Leviathan and 963
931 UL-list structure to speed up the mining process and uses yoochoose-buys. For example, USpan was run out of memory 964
932 projection mechanism to reduce memory consumption. when δ is set less than 1.20% in Leviathan. 965

933 Even though the LQS-tree may theoretically grow very It is also interesting to observe that the HUS-Span some- 966

934 large, in practice, it stays relatively small in the proposed times consumes more memory than the utility matrix based 967

935 HUSP-ULL framework. We only consider a small part of the USpan algorithm, as shown in Sign dataset. To summarize, 968

936 candidate space. That is, we only perform the I-Concatenation in most cases on the test datasets, the proposed HUSP- 969

937 and S-Concatenation by combining the potential candidate ULL algorithm significantly outperforms the state-of-the-art 970

938 patterns that may be the promising high-utility patterns. In HUSPM algorithms in terms of memory consumption. The 971
12 IEEE TRANSACTIONS ON CYBERNETICS, 2019

(a) Sign (b) Bible (c) SynDataset−160K


2000 4000 3000

1500 3000

Memory (MB)
Memory (MB)

Memory (MB)
2000

1000 2000
1000
500 1000

0 0 0
1.2% 1.3% 1.4% 1.5% 1.6% 1.7% 0.5% 0.6% 0.7% 0.8% 0.9% 1.0% 0.065% 0.070% 0.075% 0.080% 0.085% 0.090%
δ δ δ
(d) Kosarak10k (e) Leviathan (f) yoochoose−buys
1500 2000
3000
1500
Memory (MB)

Memory (MB)

Memory (MB)
1000
2000
1000
500
1000
500

0 0 0
1.69% 1.70% 1.71% 1.72% 1.73% 1.74% 1.00% 1.05% 1.10% 1.15% 1.20% 1.25% 0.024% 0.026% 0.028% 0.030% 0.032% 0.034%
δ δ δ
USpan HUS−Span ProUM HUSP−ULL

Fig. 4. Memory usage under various δ values.

(a) C8S6T4I3D|X|K (δ: 0.001) (b) C8S6T4I3D|X|K (δ: 0.001) 6 (c) C8S6T4I3D|X|K (δ: 0.001)
x 10
2.5
2000 USpan USpan # P1
3000
HUS−Span HUS−Span 2 # P2
Runtime (sec.)

Memory (MB)

1500 ProUM ProUM # P3

# Patterns
HUSP−ULL 2000 HUSP−ULL 1.5
# P4
1000 HUSPs
1
1000
500
0.5

0 0 0
10K 80K 160K 240K 320K 400K 10K 80K 160K 240K 320K 400K 10K 80K 160K 240K 320K 400K
Dataset size |D| Dataset size |D| Dataset size |D|

Fig. 5. Scalability of the compared approaches.

972 reason is that HUSP-ULL utilizes the compact UL-list and or when the dataset is very large. This is because the HUS- 991

973 two pruning strategies to reduce the space complexity. Span algorithm utilizes the projected databases and utility- 992

chain structure to store utility information, which requires a 993

large amount of memory to speed up the mining process. If 994


974 F. Scalability
patterns match many transactions, this structure can consume a 995
975 We further evaluated the scalability of the com- large amount of memory. In Fig. 5 (c), it also can be found that 996
976 pared approaches on the synthetic sequence-based dataset the number of candidates does not increase when the dataset 997
977 C8S6T4I3D|X|K [53] (recall that |D| is the size of the dataset size is increased. This is reasonable since the minimum utility 998
978 D). Note that the chain-store dataset from NU-MineBench3 value (w.r.t. the value of δ × u(D)) increases as the dataset 999
979 is the item-based data, thus it was not conducted here. The size is increased. Hence, fewer candidates are HUSPs, but the 1000
980 results in terms of runtime and number of candidates under algorithms still spend time to evaluate candidates. Thus, the 1001
981 different parameter settings are shown in Fig. 5. The size of runtime increases with the dataset size. 1002
982 this synthetic sequence-based dataset is varied from 10K to To summarize, HUSP-ULL is more suitable for processing 1003
983 400K sequences, with a threshold δ: 0.001 at each test. the large sequences, especially the datasets having large aver- 1004
984 In Fig. 5, it can be observed that HUSP-ULL has better age number of elements per sequence (#Seq) or large average 1005
985 scalability than the compared state-of-the-art algorithms in number of items per element (#Ele). 1006
986 large dataset. As the dataset size is increased, the runtime of
987 HUS-Span, USpan, ProUM, and HUSP-ULL always increases,
988 respectively. Note that the HUS-Span algorithm does not VI. C ONCLUSION 1007

989 obtain any results in some cases because it ran out of memory
990 when the minimum utility threshold is set to a small value Utility-based sequence mining is a significant problem due 1008

to the subtle interesting patterns among different factors (e.g., 1009


3 http://cucis.ece.northwestern.edu/projects/DMS/MineBench.html timestamp, quantity, profit) and the meaningful knowledge 1010
GAN et al.: FAST UTILITY MINING ON SEQUENCE DATA 13

1011 triggered by complex real-life situations. This paper has pro- [14] J. Yin, Z. Zheng, and L. Cao, “USpan: an efficient algorithm for 1074

1012 posed a novel HUSP-ULL algorithm to discover high-utility mining high utility sequential patterns,” in Proceedings of the 18th ACM 1075
SIGKDD International Conference on Knowledge Discovery and Data 1076
1013 sequential patterns (HUSPs) more efficiently. Specifically, the Mining. ACM, 2012, pp. 660–668. 1077
1014 concept of utility-linked (UL)-list was developed and used to [15] J. Yin, Z. Zheng, L. Cao, Y. Song, and W. Wei, “Efficiently mining 1078

1015 calculate the utilities and the upper-bound values of candidates top-k high utility sequential patterns,” in Proceedings of the IEEE 13th 1079
International Conference on Data Mining. IEEE, 2013, pp. 1259–1264. 1080
1016 for deriving all HUSPs. By utilizing the designed UL-list [16] G. C. Lan, T. P. Hong, V. S. Tseng, and S. L. Wang, “Applying the 1081
1017 structure, the HUSP-ULL algorithm can fast discover the maximum utility measure in high utility sequential pattern mining,” 1082

1018 complete set of HUSPs. To further improve the performance of Expert Systems with Applications, vol. 41, no. 11, pp. 5071–5081, 2014. 1083
[17] J. Z. Wang, J. L. Huang, and Y. C. Chen, “On efficiently mining high 1084
1019 HUSP-ULL, two pruning strategies were introduced to reduce utility sequential patterns,” Knowledge and Information Systems, vol. 49, 1085
1020 the upper-bounds on utility and thus prune the search space no. 2, pp. 597–627, 2016. 1086

1021 to find HUSPs. Substantial experiments on some real-worlds [18] W. Gan, J. C. W. Lin, Z. Jiexiong, H. C. Chao, H. Fujita, and P. S. 1087
Yu, “ProUM: High utility sequential pattern mining,” in Proceedings of 1088
1022 and synthetic datasets show that the designed algorithm can the IEEE International Conference on Systems, Man, and Cybernetics. 1089
1023 effectively and efficiently identify all HUSPs and outperforms IEEE, 2019, pp. 1000–1007. 1090

1024 the state-of-the-art HUSPM algorithms. The proposed pruning [19] C. F. Ahmed, S. K. Tanbeer, B. S. Jeong, and Y. K. Lee, “Efficient tree 1091
structures for high utility pattern mining in incremental databases,” IEEE 1092
1025 strategies also improve the efficiency for mining HUSPs by Transactions on Knowledge and Data Engineering, vol. 21, no. 12, pp. 1093
1026 reducing the number of unpromising candidates early. 1708–1721, 2009. 1094
[20] V. S. Tseng, C. W. Wu, B. E. Shie, and P. S. Yu, “UP-Growth: an efficient 1095
algorithm for high utility itemset mining,” in Proceedings of the 16th 1096

1027 VII. ACKNOWLEDGMENT ACM SIGKDD International Conference on Knowledge Discovery and 1097
Data Mining. ACM, 2010, pp. 253–262. 1098
[21] V. S. Tseng, B. E. Shie, C. W. Wu, and P. S. Yu, “Efficient algorithms 1099
1028 We thank the editors and anonymous reviewers for their for mining high utility itemsets from transactional databases,” IEEE 1100
1029 constructive suggestions that help to improve the quality of this Transactions on Knowledge and Data Engineering, vol. 25, no. 8, pp. 1101

1030 paper. We would like to thank Dr. Jun-Zhe Wang for providing 1772–1786, 2013. 1102
[22] P. Fournier-Viger, C. W. Wu, S. Zida, and V. S. Tseng, “FHM: 1103
1031 the original C++ code of the HUS-Span algorithm, and Dr. Faster high-utility itemset mining using estimated utility co-occurrence 1104
1032 Oznur Kirmemis Alkan for sharing the Java executable file of pruning,” in International Symposium on Methodologies for Intelligent 1105

1033 the HuspExt algorithm. Systems. Springer, 2014, pp. 83–92. 1106
[23] S. Zida, P. Fournier-Viger, J. C. W. Lin, C. W. Wu, and V. S. Tseng, 1107
“EFIM: a highly efficient algorithm for high-utility itemset mining,” in 1108
Mexican International Conference on Artificial Intelligence. Springer, 1109
1034 R EFERENCES 2015, pp. 530–546. 1110
[24] J. Liu, K. Wang, and B. C. Fung, “Direct discovery of high utility 1111
1035 [1] R. Agrawal and R. Srikant, “Mining sequential patterns,” in The Inter- itemsets without candidate generation,” in Proceedings of the IEEE 12th 1112
1036 national Conference on Data Engineering. IEEE, 1995, pp. 3–14. International Conference on Data Mining. IEEE, 2012, pp. 984–989. 1113
1037 [2] R. Srikant and R. Agrawal, “Mining sequential patterns: generalizations [25] V. S. Tseng, C. W. Wu, P. Fournier-Viger, and P. S. Yu, “Efficient 1114
1038 and performance improvements,” in Proceedings of International Con- algorithms for mining top-k high utility itemsets,” IEEE Transactions 1115
1039 ference on Extending Database Technology. Springer, 1996, pp. 1–17. on Knowledge and Data Engineering, vol. 28, no. 1, pp. 54–67, 2016. 1116
1040 [3] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.- [26] J. C. W. Lin, W. Gan, P. Fournier-Viger, T. P. Hong, and H. C. Chao, 1117
1041 C. Hsu, “PrefixSpan: Mining sequential patterns efficiently by prefix- “FDHUP: Fast algorithm for mining discriminative high utility patterns,” 1118
1042 projected pattern growth,” in The International Conference on Data Knowledge and Information Systems, vol. 51, no. 3, pp. 873–909, 2017. 1119
1043 Engineering. IEEE, 2001, pp. 215–224. [27] W. Gan, J. C. W. Lin, P. Fournier-Viger, H. C. Chao, and H. Fujita, 1120
1044 [4] P. Fournier-Viger, J. C. W. Lin, R. U. Kiran, and Y. S. Koh, “A survey “Extracting non-redundant correlated purchase behaviors by utility mea- 1121
1045 of sequential pattern mining,” Data Science and Pattern Recognition, sure,” Knowledge-Based Systems, vol. 143, pp. 30–41, 2018. 1122
1046 vol. 1, no. 1, pp. 54–77, 2017. [28] J. C. W. Lin, W. Gan, T. P. Hong, and V. S. Tseng, “Efficient algorithms 1123
1047 [5] R. Agrawal, T. Imielinski, and A. Swami, “Database mining: A per- for mining up-to-date high-utility patterns,” Advanced Engineering In- 1124
1048 formance perspective,” IEEE Transactions on Knowledge and Data formatics, vol. 29, no. 3, pp. 648–661, 2015. 1125
1049 Engineering, vol. 5, no. 6, pp. 914–925, 1993. [29] G. C. Lan, T. P. Hong, and V. S. Tseng, “Discovery of high utility 1126
1050 [6] M. S. Chen, J. Han, and P. S. Yu, “Data mining: an overview from itemsets from on-shelf time periods of products,” Expert Systems with 1127
1051 a database perspective,” IEEE Transactions on Knowledge and data Applications, vol. 38, no. 5, pp. 5851–5857, 2011. 1128
1052 Engineering, vol. 8, no. 6, pp. 866–883, 1996. [30] Y. C. Lin, C. W. Wu, and V. S. Tseng, “Mining high utility itemsets 1129
1053 [7] R. Agrawal, R. Srikant et al., “Fast algorithms for mining association in big data,” in Pacific-Asia Conference on Knowledge Discovery and 1130
1054 rules,” in Proceedings of the 20th International Conference on Very Data Mining, 2015, pp. 649–661. 1131
1055 Large Data Bases, vol. 1215, 1994, pp. 487–499. [31] U. Yun, D. Kim, E. Yoon, and H. Fujita, “Damped window based 1132
1056 [8] J. Han, J. Pei, Y. Yin, and R. Mao, “Mining frequent patterns without high average utility pattern mining over data streams,” Knowledge-Based 1133
1057 candidate generation: A frequent-pattern tree approach,” Data Mining Systems, vol. 144, pp. 188–205, 2018. 1134
1058 and Knowledge Discovery, vol. 8, no. 1, pp. 53–87, 2004. [32] W. Gan, J. C.-W. Lin, P. Fournier-Viger, H.-C. Chao, and P. S. Yu, 1135
1059 [9] R. Chan, Q. Yang, and Y. D. Shen, “Mining high utility itemsets,” in “HUOPM: High-utility occupancy pattern mining,” IEEE Transactions 1136
1060 Proceedings of the third IEEE International Conference on Data Mining. on Cybernetics. DOI: 10.1109/TCYB.2019.2896267, pp. 1–14, 2019. 1137
1061 IEEE, 2003, pp. 19–26. [33] U. Yun, H. Ryang, G. Lee, and H. Fujita, “An efficient algorithm 1138
1062 [10] H. Yao, H. J. Hamilton, and C. J. Butz, “A foundational approach to for mining high utility patterns from incremental databases with one 1139
1063 mining itemset utilities from databases,” in Proceedings of the SIAM database scan,” Knowledge-Based Systems, vol. 124, pp. 188–206, 2017. 1140
1064 International Conference on Data Mining. SIAM, 2004, pp. 482–486. [34] W. Gan, J. C. W. Lin, P. Fournier-Viger, H. C. Chao, T. P. Hong, and 1141
1065 [11] J. C. W. Lin, W. Gan, P. Fournier-Viger, T. P. Hong, and V. S. H. Fujita, “A survey of incremental high-utility itemset mining,” Wiley 1142
1066 Tseng, “Efficient algorithms for mining high-utility itemsets in uncertain Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1143
1067 databases,” Knowledge-Based Systems, vol. 96, pp. 171–187, 2016. vol. 8, no. 2, p. e1242, 2018. 1144
1068 [12] M. Liu and J. Qu, “Mining high utility itemsets without candidate [35] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu, 1145
1069 generation,” in Proceedings of the 21st ACM International Conference “FreeSpan: frequent pattern-projected sequential pattern mining,” in 1146
1070 on Information and Knowledge Management. ACM, 2012, pp. 55–64. Proceedings of the sixth ACM SIGKDD International Conference on 1147
1071 [13] Y. Liu, W. K. Liao, and A. Choudhary, “A two-phase algorithm for Knowledge Discovery and Data Mining. ACM, 2000, pp. 355–359. 1148
1072 fast discovery of high utility itemsets,” in Pacific-Asia Conference on [36] M. J. Zaki, “SPADE: an efficient algorithm for mining frequent se- 1149
1073 Knowledge Discovery and Data Mining. Springer, 2005, pp. 689–695. quences,” Machine Learning, vol. 42, no. 1-2, pp. 31–60, 2001. 1150
14 IEEE TRANSACTIONS ON CYBERNETICS, 2019

1151 [37] J. Ayres, J. Flannick, J. Gehrke, and T. Yiu, “Sequential pattern mining Jerry Chun-Wei Lin (SM’19) is an associate pro- 1225
1152 using a bitmap representation,” in Proceedings of the 8th ACM SIGKDD fessor at Western Norway University of Applied 1226
1153 International Conference on Knowledge Discovery and Data Mining. Sciences, Bergen, Norway. He received the Ph.D. 1227
1154 ACM, 2002, pp. 429–435. in Computer Science and Information Engineering, 1228
1155 [38] B. Le, M. T. Tran, and B. Vo, “Mining frequent closed inter-sequence National Cheng Kung University, Tainan, Taiwan in 1229
1156 patterns efficiently using dynamic bit vectors,” Applied Intelligence, 2010. His research interests include data mining, big 1230
1157 vol. 43, no. 1, pp. 74–84, 2015. data analytics, soft computing, and privacy. He has 1231
1158 [39] T. Le, A. Nguyen, B. Huynh, B. Vo, and W. Pedrycz, “Mining con- published more than 300 research papers in peer- 1232
1159 strained inter-sequence patterns: a novel approach to cope with item reviewed international conferences and journals. He 1233
1160 constraints,” Applied Intelligence, vol. 48, no. 5, pp. 1327–1343, 2018. is the co-leader of the popular SPMF open-source 1234
1161 [40] W. Gan, J. C. W. Lin, P. Fournier-Viger, H. C. Chao, and P. S. Yu, data mining library, the project leader of PPSF open- 1235
1162 “A survey of parallel sequential pattern mining,” ACM Transactions on source privacy and security library, the Editor-in-Chief (EiC) of the Data 1236
1163 Knowledge Discovery from Data, vol. 13, no. 3, p. 25, 2019. Mining and Pattern Recognition (DSPR) journal, and Associate Editor of 1237
1164 [41] O. K. Alkan and P. Karagoz, “CRoM and HuspExt: Improving efficiency Journal of Internet Technology and IEEE Access. He is the Senior Member 1238
1165 of high utility sequential pattern extraction,” IEEE Transactions on for both IEEE and ACM. 1239
1166 Knowledge and Data Engineering, vol. 27, no. 10, pp. 2645–2657, 2015.
1167 [42] L. Zhou, Y. Liu, J. Wang, and Y. Shi, “Utility-based web path traversal Jiexiong Zhang is currently a senior software engi- 1240
1168 pattern mining,” in Seventh IEEE International Conference on Data neer in Didi Chuxing, Beijing, China. He received 1241
1169 Mining Workshops. IEEE, 2007, pp. 373–380. the M.S. degrees in Computer Science from Harbin 1242
1170 [43] C. F. Ahmed, S. K. Tanbeer, and B. S. Jeong, “Mining high utility web Institute of Technology (Shenzhen), Guangdong, 1243
1171 access sequences in dynamic web log data,” in 11th ACIS International China in 2017. His research interests include data 1244
1172 Conference on Software Engineering, Artificial Intelligence, Networking mining, artificial intelligence, and big data analytics. 1245
1173 and Parallel/Distributed Computing. IEEE, 2010, pp. 76–81.
1174 [44] B. E. Shie, H. F. Hsiao, and V. S. Tseng, “Efficient algorithms for
1175 discovering high utility user behavior patterns in mobile commerce
1176 environments,” Knowledge and Information Systems, vol. 37, no. 2, pp.
1246
1177 363–387, 2013.
1178 [45] B. E. Shie, H. F. Hsiao, V. S. Tseng, and P. S. Yu, “Mining high
1179 utility mobile sequential patterns in mobile commerce environments,” Philippe Fournier-Viger is full professor and Youth 1247

1180 in Proceedings of International Conference on Database Systems for 1000 scholar at the Harbin Institute of Technology 1248

1181 Advanced Applications. Springer, 2011, pp. 224–238. (Shenzhen), Shenzhen, China. He received a Ph.D. 1249

1182 [46] M. Zihayat, H. Davoudi, and A. An, “Mining significant high utility gene in Computer Science at the University of Quebec 1250

1183 regulation sequential patterns,” BMC Systems Biology, vol. 11, no. 6, p. in Montreal in 2010. His research interests include 1251

1184 109, 2017. pattern mining, sequence analysis and prediction, 1252

1185 [47] C. F. Ahmed, S. K. Tanbeer, and B. S. Jeong, “A novel approach for and social network mining. He has published more 1253

1186 mining high-utility sequential patterns in sequence databases,” ETRI than 300 research papers in refereed international 1254

1187 journal, vol. 32, no. 5, pp. 676–686, 2010. conferences and journals. He is the founder of the 1255

1188 [48] C. W. Wu, Y. F. Lin, P. S. Yu, and V. S. Tseng, “Mining high utility popular SPMF open-source data mining library. He 1256

1189 episodes in complex event sequences,” in Proceedings of the 19th ACM is Editor-in-Chief (EiC) of the Data Mining and 1257

1190 SIGKDD International Conference on Knowledge Discovery and Data Pattern Recognition (DSPR) journal. 1258

1191 Mining. ACM, 2013, pp. 536–544.


1192 [49] J. Z. Wang and J. L. Huang, “On incremental high utility sequential pat- Han-Chieh Chao (SM’04) has been the president 1259
1193 tern mining,” ACM Transactions on Intelligent Systems and Technology, of National Dong Hwa University since February 1260
1194 vol. 9, no. 5, p. 55, 2018. 2016. He received M.S. and Ph.D. degrees in Elec- 1261
1195 [50] W. Gan, J. C. W. Lin, P. Fournier-Viger, H. C. Chao, V. S. trical Engineering from Purdue University in 1989 1262
1196 Tseng, and P. S. Yu, “A survey of utility-oriented pattern min- and 1993, respectively. His research interests in- 1263
1197 ing,” IEEE Transactions on Knowledge and Data Engineering, DOI: clude high-speed networks, wireless networks, IPv6- 1264
1198 10.1109/TKDE.2019.2942594, pp. 1–20, 2019. based networks, and artificial intelligence. He has 1265
1199 [51] W. Gan, J. C. W. Lin, H. C. Chao, S. L. Wang, and P. S. Yu, “Privacy published nearly 500 peer-reviewed professional re- 1266
1200 preserving utility mining: a survey,” in IEEE International Conference search papers. He is the Editor-in-Chief (EiC) of 1267
1201 on Big Data. IEEE, 2018, pp. 2617–2626. IET Networks and Journal of Internet Technology. 1268
1202 [52] P. Fournier-Viger, J. C. W. Lin, A. Gomariz, T. Gueniche, A. Soltani, Dr. Chao has served as a guest editor for ACM 1269
1203 Z. Deng, and H. T. Lam, “The spmf open-source data mining library MONET, IEEE JSAC, IEEE Communications Magazine, IEEE Systems 1270
1204 version 2,” in Joint European Conference on Machine Learning and Journal, Computer Communications, IEEE Proceedings Communications, 1271
1205 Knowledge Discovery in Databases. Springer, 2016, pp. 36–40. Wireless Personal Communications, and Wireless Communications & Mobile 1272
1206 [53] R. Agrawal and R. Srikant, “Quest synthetic data generator,” Computing. Dr. Chao is an IEEE Senior Member and a fellow of IET. 1273
1207 http://www.Almaden.ibm.com/cs/quest/syndata.html, 1994.
1208 [54] “Frequent itemset mining dataset repository,” http://fimi.ua.ac.be/data/,
Philip S. Yu (F’93) received the B.S. degree in 1274
1209 2012.
electrical engineering from National Taiwan Univer- 1275
1210 [55] T. Truong-Chi and P. Fournier-Viger, “A survey of high utility sequential
sity, M.S. and Ph.D. degrees in electrical engineering 1276
1211 pattern mining,” in High-Utility Pattern Mining. Springer, 2019, pp.
from Stanford University, and an MBA from New 1277
1212 97–129.
York University. He is a distinguished professor of 1278
computer science with the University of Illinois at 1279
Chicago (UIC) and also holds the Wexler Chair in 1280
Information Technology at UIC. Before joining UIC, 1281
1213 Wensheng Gan received the Ph.D. in Computer he was with IBM, where he was manager of the 1282
1214 Science and Technology, Harbin Institute of Tech- Software Tools and Techniques Department at the 1283
1215 nology (Shenzhen), Shenzhen, China in 2020. He Thomas J. Watson Research Center. His research 1284
1216 was a joint PhD student at the University of Illinois interests include data mining, data streams, databases, and privacy. He has 1285
1217 at Chicago (UIC), IL, USA, from 2017 to 2019. published more than 1,300 papers in peer-reviewed journals (i.e., TKDE, 1286
1218 He received the B.S. degree in Computer Science TPDS, TKDD, VLDBJ) and conferences (i.e., SIGMOD, KDD, ICDE, WWW, 1287
1219 from South China Normal University, Guangdong, AAAI, SIGIR, ICML, etc). He holds or has applied for more than 300 U.S. 1288
1220 China in 2013. His research interests include data patents. Dr. Yu was the Editor-in-Chief of ACM Transactions on Knowledge 1289
1221 mining, utility computing, and big data analytics. He Discovery from Data. He received the ACM SIGKDD 2016 Innovation Award, 1290
1222 has published more than 50 research papers in peer- and the IEEE Computer Society 2013 Technical Achievement Award. Dr. Yu 1291
1223 reviewed journals (i.e., TKDE, TKDD, TCYB, AEI, is a fellow of the ACM and the IEEE. 1292
1224 KBS) and conferences, which have received more than 600 citations.

You might also like