You are on page 1of 22

High-Utility Itemset Mining with Effective Pruning

Strategies

JIMMY MING-TAI WU, Shandong University of Science and Technology


JERRY CHUN-WEI LIN, Western Norway University of Applied Sciences
ASHISH TAMRAKAR, University of Nevada

High-utility itemset mining is a popular data mining problem that considers utility factors, such as quantity
and unit profit of items besides frequency measure from the transactional database. It helps to find the most
valuable and profitable products/items that are difficult to track by using only the frequent itemsets. An item
might have a high-profit value which is rare in the transactional database and has a tremendous importance.
While there are many existing algorithms to find high-utility itemsets (HUIs) that generate comparatively
large candidate sets, our main focus is on significantly reducing the computation time with the introduction
of new pruning strategies. The designed pruning strategies help to reduce the visitation of unnecessary nodes
in the search space, which reduces the time required by the algorithm. In this article, two new stricter upper
bounds are designed to reduce the computation time by refraining from visiting unnecessary nodes of an
itemset. Thus, the search space of the potential HUIs can be greatly reduced, and the mining procedure of
the execution time can be improved. The proposed strategies can also significantly minimize the transaction
database generated on each node. Experimental results showed that the designed algorithm with two pruning 58
strategies outperform the state-of-the-art algorithms for mining the required HUIs in terms of runtime and
number of revised candidates. The memory usage of the designed algorithm also outperforms the state-of-the-
art approach. Moreover, a multi-thread concept is also discussed to further handle the problem of big datasets.
CCS Concepts: • Information systems → Association rules; Data analytics; • Computing methodolo-
gies → Knowledge representation and reasoning;
Additional Key Words and Phrases: HUIM, high-utility itemset, pruning strategy, multiple threads
ACM Reference format:
Jimmy Ming-Tai Wu, Jerry Chun-Wei Lin, and Ashish Tamrakar. 2019. High-Utility Itemset Mining with
Effective Pruning Strategies. ACM Trans. Knowl. Discov. Data 13, 6, Article 58 (October 2019), 22 pages.
https://doi.org/10.1145/3363571

1 INTRODUCTION
The main challenge in data mining has always been finding the meaningful information from huge
datasets. A data mining technique to discover interesting, unexpected, and useful patterns of data

Authors’ addresses: J. M.-T. Wu, College of Computer Science and Engineering, Shandong University of Science and
Technology, 579 Qianwangang Rd, Qingdao, Shandong 266590, China; email: wmt@wmt35.idv.tw; J. C.-W. Lin (corre-
sponding author), Department of Computer Science, Electrical Engineering and Mathematical Sciences, Western Nor-
way University of Applied Sciences, Inndalsveien 28, Bergen 5063, Norway; email: jerrylin@ieee.org; A. Tamrakar, De-
partment of Computer Science, University of Nevada, Las Vegas, 4505 S Maryland Pkwy, Las Vegas, NV 89154; email:
ashish.tamrakar@unlv.edu.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from permissions@acm.org.
© 2019 Association for Computing Machinery.
1556-4681/2019/10-ART58 $15.00
https://doi.org/10.1145/3363571

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 6, Article 58. Publication date: October 2019.
58:2 J. M.-T. Wu et al.

from a huge database is called pattern mining. Previously, most researches focused on Frequent
Itemset Mining (FIM) and Association Rule Mining (ARM). These pattern mining algorithms are
the traditional methods of finding the frequent set of patterns for locating the set of itemsets in
which the frequency of each itemset is no less than the minimum support count (threshold) [7].
Apriori was firstly proposed for FIM where the database is scanned multiple times to identify the
frequent individual items/patterns and highlight the actual information and rules [1]. FP-Growth
was then proposed to overcome the limitation of the Apriori algorithm. It discovers all the fre-
quent patterns with only two database scans [19]. The FP-Growth is based on a compressed FP-
tree structure for mining the set of frequent itemsets by performing the recursive method. Thus,
the number of unpromising candidates and the computational cost can be reduced significantly.
Although FIM/ARM discovers frequently occurring itemsets, it is likely to miss the itemsets that
have an unexpectedly high importance on profit rather than the quantity. For example, the sale of
milk and bread occurs frequently among the transactions in the dataset while the sale of caviar
seems to be very rare and might not be reflected in the outcome of FIM/ARM. Therefore, it is nec-
essary to consider both profit and quantity of the itemsets for mining the useful and profitable
itemsets. Consequently, the concept of high-utility itemset mining (HUIM) was developed.
To discover the useful and profitable itemsets from huge transactional datasets, HUIM [6, 47–
49, 54] has been one of the most significant research works. HUIM considers both internal and
external utilities [47] to obtain the set of profitable itemsets. Internal utility is represented by the
quantity of an item and external utility is represented by the profit of an item. A minimum high
utility count (threshold) is used to decide whether an itemset is a high utility itemset (HUI). Re-
cently, a lot of research has been carried out in the field of HUIM [17, 29, 30]. Liu et al. proposed a
two-phase model that computes the Transaction Weighted Utility (TWU) of an itemset and con-
siders the Transaction Weighted Download Closure (TWDC) property to find the set of HUIs [33].
This algorithm generates a large number of candidates in the level-wise manner to find the re-
quired HUIs. However, it requires a lot of computation time and memory usage to process a large
number of unpromising candidates. Several improved methods have been proposed to reduce the
unpromising candidate sets [41, 42].
Liu et al. and Zida et al. respectively proposed two approaches to find the HUIs without candidate
generation [32, 56]. In [32], an efficient utility-list (UL) structure was presented to efficiently keep
the potential itemsets into the link-list structure. It uses the simple join operation to generate the
k-itemsets, which outperforms the traditional Apriori-like and Tree-like approaches. Liu et al. then
presented the D2HUP algorithm [31] to use a single phase without generating candidates. A novel
data structure is used to compute a tighter upper bound for pruning the unpromising candidates to
figure out the actual HUIs. Zida et al. [56] then proposed the EFIM approach to find HUIs. It is an
effective algorithm to quickly find all the HUIs in a dataset. In EFIM, a search tree is generated based
on upper bounds, such as sub-tree utility and local utility, to reduce the number of itemsets that
need to be estimated. In this article, two new lower bounds are proposed to reduce the searching
area in EFIM. Although EFIM outperforms the previous studies of HUIM, it still does not address
the mining performance problem especially for the sparse datasets. Thus, it is necessary to present
the efficient pruning strategies for further improvements. The major contributions of this article
are described below:

(1) Two, well-defined upper bounds are proposed to reduce the size of the unpromising candi-
dates. Thus, it can reduce the runtime and improve the mining performance as compared
to EFIM.
(2) The proposed approach generates fewer branches of the search space than EFIM, which
effectively reduces memory usage in a huge dataset.

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 6, Article 58. Publication date: October 2019.
HUIM with Effective Pruning Strategies 58:3

(3) The proposed approach also discusses a multiple threads framework, which can be used
to handle the large-scale datasets and can be applied for the MapReduce environment.
(4) Experimental results show that the proposed method estimated fewer candidate itemsets
than the state-of-the-art D2HUP and EFIM algorithms.
The remainder of this article is organized as follows: In Section 2, the related work and pre-
liminaries are described. In Section 3, the proposed algorithm is described in detail. An illustrated
example is shown in Section 4. The experimental results and discussion are given in Section 5.
Finally, conclusions of this article are given in Section 6.

2 BACKGROUND
2.1 Related Work
Data mining technology [5, 7, 11, 14, 15, 18–20, 24] is used to find the potential and implicit in-
formation from a very large dataset, and the most common algorithms used were FIM and ARM.
The initial breakthrough came when Agrawal and Srikant [1] proposed a method named Apriori.
However, to further improve the mining performance, Han et al. [19] proposed the FP-Growth al-
gorithm with a tree-structure named FP-tree. FIM does not emphasize on the importance of items
or their quantities. Therefore, there was the need for weighted FIM (WTI-FWI) [44, 52, 55]. These
methods focus on weight and give importance to the items.
HUIM [3, 10, 34, 36, 41, 45, 47–49] considers the importance of the item quantities (internal
utility) and profit value (external utility). Many applications for HUIs have already been proposed.
They prove that the field of HUIM has important commercial value. These applications include
website click-stream analysis [4, 23, 39], cross-marketing in retail store commercial value [9, 25,
26], mobile commerce environment [37, 38], gene regulation [57], and bio-medical applications.
The initial concept of HUIM was proposed by Yao et al. [48] but it failed at the computation since
the “combinational explosion” problem can easily occur. Liu et al. [33] proposed a two-phase algo-
rithm, which was based on Apriori-like approach to find the set of HUIs using multiple database
scans. However, this algorithm generated a large number of candidate sets on each level, and it
caused high computational cost for mining the promising HUIs. Erwin et al. [10] then presented a
CTU-Mine algorithm to discover the HUIs based on the pattern-growth approach.
HUP-Prune [2] was designed to extract the high-utility patterns but it still needed multiple
database scans to find the required HUIs. The Incremental High-Utility Patterns (IHUP) [4] was
also designed to mine the HUIs incrementally and interactively. Although the IHUP avoids the
generate-and-test approach, it still produces a large number of candidates in the Phase 1. Lin
et al. [27] developed the HUP-tree structure to keep the necessary information by integrating
the FP-tree-like structure and the TWU methodology. Tseng et al. then respectively presented the
UP-Growth [43] and UP-Growth+ [41] algorithms to efficiently prune the unpromising candidates
early and mine the HUIs based on the designed UP-tree. The above algorithms are mostly based
on the Apriori-like and Tree-like structure for mining the set of HUIs. Many studies focused on
developing the efficient pruning strategies for mining HUIs, and some of them are still in progress.
[8, 17, 22, 30, 35, 40].
Recently, Liu et al. [32] proposed the mining of HUIs without candidate generation. A utility-
list was used to store the information about utilities for itemsets. These utility-lists also helped
to prune the unnecessary candidates. However, this algorithm used a large amount of memory
for the utility list of each itemsets. Liu et al. developed the D2HUP algorithm [31] relying on a
single-phase approach without generating candidates. A novel data structure is also designed to
estimate a tighter upper bound for pruning the unpromising candidates to figure out the actual
HUIs. FHM [12] is an enhanced version of HUI-Miner, which utilizes a novel pruning strategy

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 6, Article 58. Publication date: October 2019.
58:4 J. M.-T. Wu et al.

named EUCP (Estimated Utility Co-occurrence Pruning) to reduce the costly join operations of
utility-lists. Krishnamoorthy [21] employed several pruning strategies to improve the performance
of the HUI-Miner algorithm. The utility list (PUL) that maintains utility information at the partition
level is also designed to keep the necessary information for improving the mining performance.
For further pruning the search space, Zida et al. [56] also used the concept of utility lists and
proposed two upper bounds called sub-tree utility and local utility. These bounds are described
in the following sections. This method also uses the fast utility counting technique to reduce the
memory usage. Additionally, Yun et al. [53] proposed the mining of HUIs for an incremental dataset
environment. This method can effectively update the set of HUIs when new transactions come
into the dataset. Ryan et al. [36] then presented an SIQ algorithm with two pruning strategies
for efficiently mining the HUIs. Ryang and Yun [34] then presented an effective algorithm called
SHU-Growth, which is used to mine the HUIs from the stream environment. The summary of the
famous HUIM algorithms is shown in Table 1.
Several extensions of HUIM are also studied. For example, average-utility itemset mining [28,
50] takes the length of the itemset for evaluating the average-utility of the itemset. Yun et al. then
also designed the MPM method [50] to mine the high average-utility itemset from the stream
situation. Mining HUIs from a stream data [52] is also a challenging topic since it needs to keep
enough information in a time-window for mining the complete HUIs. How to efficiently update
the discovered HUIs in an incremental database is also an important task [16, 53]. Moreover, top-k
HUIM [45] has emerged as an interesting topic in recent years, which is used to find the top-k
HUIs instead of mining the whole HUIs.

2.2 Preliminaries
To efficiently mine the set of HUIs, it is necessary to estimate the lower and upper-bound value
of the itemset and reduce the size of the candidates for obtaining the actual HUIs. Hence, by apply-
ing the efficient pruning rule, the number of candidate sets reduction plays a vital role in improving
the performance of the algorithm. Therefore, this study aims to construct a novel tree structure to
generate candidate sets efficiently, and to apply the proposed pruning strategies to significantly
reduce the unnecessary candidate sets in large datasets. Moreover, a multiple-thread is considered
to speed up the mining performance for handling the large-scale dataset. Preliminaries are then
defined as follows:
Suppose the finite set of m unique items is I = {i 1 , i 2 , . . . , im }, and the quantitative database with
a set of transactions is D = {T1 ,T2 , . . . ,Tn } . Each transaction Tq ∈ D where 1 ≤ q ≤ n has a unique
identifier called its TID. Each item i j is associated with a purchase quantity, which is the internal
utility, and with unit profit, which is the external utility. Internal and external utilities are denoted
by q(i j ,Td ) and p f t (i j ), respectively. A set of k unique items X = {i 1 , i 2 , . . . . , i k }, where X ⊆ I is
said to be an k-itemset, k is the length of the itemset, and an itemset X is in transaction Tq if X ⊆ Tq
and a minimum high-utility threshold δ is defined.

Definition 2.1. The utility of an item i j denoted by u (i j ,Tq ) in a transaction Td is defined as


follows:
u (i j ,Tq ) = q(i j ,Tq ) × p f t (i j ). (1)

Definition 2.2. The utility of an itemset X denoted by u (X ,Tq ) in a transaction Tq is defined as


follows:

u (X ,Tq ) = u (i j ,Tq ). (2)
i j ⊆X ∩X ⊆Tq

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 6, Article 58. Publication date: October 2019.
HUIM with Effective Pruning Strategies 58:5

Table 1. A Summary of the HUIM Algorithms

Algorithm Content Problem Year


UMining [48] The fundamental HUIM model with the A large amount of candidate patterns are 2004
mathematical properties of the utility generated and the scalability is poor.
measure.
Two-Phase [33] It first holds the TWDC to reduce the can- It requires multiple database scans to 2005
didate size. level-wisely mine the required HUIs.
CTU-Mine [10] A compact tree structure is designed to The CTU-tree is complex and consumes 2007
keep the information and use the pattern- huge memory.
growth method to mine HUIs.
HUC-Prune [2] Mining HUIs without level-wisely The over-estimated value of upper bound 2009
candidate-generation-and-test approach. is high and the constructed tree is huge.
IHUP [4] A tree-based algorithm to incrementally It still maintains the TWU model to gen- 2009
and interactively mine the HUIs. erate many HTWUIs in phase 1.
UP-Growth [43] The utility pattern growth algorithm with It takes recursive method to process all 2010
a compressed utility pattern tree (UP-tree). conditional prefix trees, which is time
consuming.
HUP-tree [27] The high-utility pattern tree structure to The method is not suitable for a long 2011
keep the potential HUIs length transaction
HUI-Miner [32] The first one-phase model to mine HUIs. It is time consuming to perform the join 2012
operations.
D2HUP [31] An algorithm to directly mine the HUIs The tree structure and CAUL have huge 2012
without maintaining candidates. memory usage.
UP-Growth+ [41] An improved UP-Growth algorithm with It is still time consuming to recursively 2013
two pruning strategies. process all conditional prefix trees to dis-
cover candidates.
FHM [12] It is extended by HUI-Miner with a EUCS It requires extra memory to keep the re- 2014
matrix to further prune the unpromising lated information and shows poor results
2-itemsets. on a very dense database.
HUP-Miner [21] An extension of the HUI-Miner algorithm It still overestimates the upper bound of 2015
with pruning strategies. the itemsets for mining HUIs.
EFIM [56] The state-of-the-art algorithm to mine the The recursive projection is sometimes 2015
HUIs by using the projection method. time consuming.
SIQ-tree [36] An algorithm only scans the database The process for conditional prefix trees is 2016
one time and decrease the number of time consuming.
candidates.
SHU-Growth [34] A sliding window algorithm to mine HUIs The algorithm is strictly for the stream en- 2016
from the stream environment. vironment.
IHMUP [35] An algorithm is based on utility-list with- The upper bound should be further re- 2017
out any candidate. duced.
HUIM-ACS [46] It is an ant-base algorithm to mine HUIs. It cannot ensure mining all of HUIs in a 2017
database.

Definition 2.3. The utility of an itemset X denoted by u (X ) in database D is defined as follows:



u (X ) = u (X ,Tq ). (3)
X ⊆Tq ∩Tq ∈D

Definition 2.4. The transaction utility (TU) of a transaction Tq denoted by TU (Tq ) is defined as
follows:

TU (Tq ) = u (X ,Tq ). (4)
X ⊆Tq

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 6, Article 58. Publication date: October 2019.
58:6 J. M.-T. Wu et al.

Table 2. A Quantitative Database D

TID Transaction (item:quantity) TU


T1 A:3, B:3, D:1 28
T2 A:3, B:7, C:1, E:3 49
T3 A:4, C:3, F :1, 47
T4 C:1, D:4, E:10 58
T5 A:4, B:4, D:2, E:6 54
T6 A:6, B:2, D:2, E:1 46

Table 3. A Profit Table

Item Profit Value


A 4
B 3
C 10
D 7
E 2
F 1

Definition 2.5. The total utility denoted by TU in database D is defined as follows:



TU = TU (Tq ). (5)
Tq ∈D

Definition 2.6. The TWU of an itemset X denoted by TW U (X ) in database D is defined as


follows: 
TW U (X ) = TU (Tq ). (6)
X ⊆Tq ∈D

Definition 2.7. An itemset X in a database D is a High Transaction Weighted Utility Itemset


(HTWUI) if its TW U is greater than or equal to the user specified minimum threshold where the
minimum high-utility threshold is TU multiplied by the threshold ratio δ as follows:
HTW U I ← {X |TW U (X ) ≥ TU × δ }. (7)
Definition 2.8. An itemset X in a database D is a HU I if its utility is greater than or equal to the
user specified minimum high-utility threshold, where the minimum high-utility threshold is TU
multiplied by threshold ratio δ as follows:
HU I ← {X |u (X ) ≥ TU × δ }. (8)
An illustrative example of this is shown in Table 2, which represents a quantitative database
in which there are six transactions with seven distinct items. Table 3 represents a profit table in
which there is a profit value for each item. Assume a user specified the threshold ratio δ as 0.17,
which will have a threshold value of (TU × δ )(= 47.94).
The utilities of items C and D in transaction T4 are calculated using the Equation (1) as u (C,T4 ) =
q(C,T4 ) × p f t (C) = 1 × 10 = 10 and u (D,T4 ) = q(D,T4 ) × p f t (D) = 4 × 7 = 28. The utility of an
itemset {C, D} in transaction T4 is calculated from Equation (2) as u ({C, D},T4 ) = u (C,T4 ) +
u (D,T4 ) = 10 + 28 = 38. The utility of an itemset {C, D} in database D is calculated from Equa-
tion (3) as u ({C, D}) = u ({C, D},T4 ) = 38. The TU of transaction T4 is calculated from Equation (4)

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 6, Article 58. Publication date: October 2019.
HUIM with Effective Pruning Strategies 58:7

Table 4. Transaction Weighted Utility (TWU) of Items

Itemset {A} {B} {C} {D} {E} {F }


TW U 224 177 154 186 207 47

Table 5. The Revised Database

TID Transaction (item:utility)


T1 B:3, D:1, A:3
T2 C:1, B:7, E:3, A:3
T3 C:3, A:4
T4 C:1, D:4, E:10
T5 B:4, D:2, E:6, A:4
T6 B:2, D:2, E:1, A:6

as TU (T4 ) = u (C,T4 ) + u (D,T4 ) + u (E,T4 ) = 10 + 28 + 20 = 58. Similarly, the TU for other trans-
actions are T1 = 28,T2 = 49,T3 = 47,T5 = 54, and T6 = 46. The total utility is calculated from Equa-
tion (5) as TU = 28 + 49 + 47 + 58 + 54 + 46 = 282. The TWU for the itemset {A, B} is calculated
from Equation (6) as TW U ({A, B}) = TU (T1 ) + TU (T2 ) + TU (T5 ) + TU (T6 ) = 28 + 49 + 54 + 46 =
177. The itemset {A, B} has TW U ({A, B}) ≥ TU × δ and it is therefore HTWUI. Similarly, an item-
set {C, D} has u ({C, D}) < TU × δ , and it is not a HUI.
Definition 2.9. The total ordering, denoted by , is the ordering of items in the increasing order
of TW U in the transaction. For example, the TWU for each item in D is shown in Table 4. The
increasing order of items in terms of TW U is as follows: F  C  B  D  E  A.
Definition 2.10. The revised transaction, denoted by RT , is said to be a transaction in which all
items that have TW U < TU × δ are removed and the remaining items are sorted in an increasing
order of TW U . The items that are removed from the transactions are considered as unpromising
items.
From the given illustrative example, after removing the unpromising items and arranging the
remaining items in an increasing order of TW U , database D is shown in Table 5.
Definition 2.11. The remaining utility, denoted by rem(X ,T ) in transaction T , is defined as
follows: 
rem(X ,T ) = u (i j ,T ). (9)
i j ∈T ∩z i j ∀z ∈X

As shown in Table 5, the remaining utility for an itemset {B, D} in transaction T1 is


rem({B, D},T1 ) = u (A,T1 ) = 4.
Definition 2.12. The extension of an itemset γ , denoted by Ex (γ ), is the possible following items
for the given itemset γ .
From Figure 1, the extension of an itemset {C} is {B, D, E, A}, and similarly for an itemset {D},
it is {E, A}.
Definition 2.13. The projected dataset of a database D denoted by γ D of an itemset γ is as follows:
γ D = {γT |T ∈ D ∩ γT  ϕ}, where
γT = {i j |i j ∈ T ∩ i ∈ Ex (γ )} is the projection of transaction T of the itemset γ .

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 6, Article 58. Publication date: October 2019.
58:8 J. M.-T. Wu et al.

Fig. 1. Construction of tree structure.

The projected transaction merging is the method of merging the identical projected transaction
(γT ) and the utility from each transaction into one, as follows:
where k is the number of identical projected transactions.
From the illustrative example in Table 5, considering γ = {B}, γ D obtains the projected trans-
actions of (D, A) from T1 , (E, A) from T2 , (D, E, A) from T5 , and (D, E, A) from T6 . The projected
transactions from T4 and T5 are merged to form a single projected transaction in the γ D database.
That is, the new projected database will have (D, A), (E, A), and (D, E, A).
Definition 2.14. The sub-tree utility is denoted by subU (γ , x ) of an itemset γ and an item x which
can have extension of γ as follows:
 ⎡⎢  ⎤⎥
subU (γ , x ) = ⎢
⎢⎢u (γ ,T ) + u (x,T ) + u (i j ,T ) ⎥⎥⎥ . (10)
(γ ∪{x }) ⊆T ⎢⎣ i j ∈T ∩Ex (γ ∪{x }) ⎥⎦
This sub-tree utility is one of the pruning strategies used to reduce the search space. If subU (γ , x ) <
TU × δ , then the itemset γ ∪ {x } and the following nodes (itemsets) can be pruned. As shown in
the illustrative example in Figure 1, if subU (∅, D) is less than TU × δ , then the following itemsets
{D, E}, {D, A}, and {D, E, A} can be pruned.
Definition 2.15. The local utility denoted by locU (γ , x ) for an itemset is as follows:

locU (γ , x ) = [u (γ ,T ) + rem(γ ,T )]. (11)
(γ ∪{x }) ⊆T

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 6, Article 58. Publication date: October 2019.
HUIM with Effective Pruning Strategies 58:9

Obviously, the local utility is always larger than the sub-tree utility for an itemset. However, it is
another pruning strategy in a more aggressive way. If locU (γ , x ) < TU × δ , then all of the branches
following itemset γ with item x can be pruned. If locU ({C}, E) is less than TU × δ , then the nodes
after itemsets {C, B, E}, {C, D, E}, and {C, E} do not need to be estimated, as shown in the illustrative
example in Figure 1.

3 PROPOSED ALGORITHM
This section describes the proposed algorithm i.e., HUIM with Pruning Strategies (HUI-PR) for
determining the set of HUIs. It consists of the following: (1) the construction of itemsets in a tree
structure, (2) the node selection rule, (3) the pruning strategies, and (4) the main algorithm to find
the set of HUIs. The proposed method is an extended approach for EFIM, which is the state-of-the-
art approach. We then propose a stricter upper bound for HUIs and enhance the abilities of the
pruning strategies. In addition, the proposed method also provides a multiple thread framework
to run in a parallel environment.
For the construction of itemsets in a tree structure, high-transaction weighted utilization item-
sets with one item in each itemset (1-HTWUIs) are prepared. The TWU of each item is used to
find 1-HTWUIs, which must be higher than the threshold. This helps to reduce the number of
unnecessary branches in the tree to traverse by pruning. The unpromising items are removed dur-
ing each transaction by scanning a database. After the removal of unpromising items, if there are
empty transactions in the database, they are removed and the items in each transaction are sorted
based on the total ordering as described in Definition 2.9. In the node selection rule, how the node
is traversed is explained in detail to find the itemset. A new strict sub-tree is constructed in the
recursive method and the nodes are visited based on the node selection rule. The pruning of the
nodes using strict local utility will be explained. These pruning rules help in avoiding unnecessary
traversing once an itemset is no longer feasible. Details are described below:

3.1 Construction of 1-HTWUIs Tree of Items


A tree-structured graph is constructed for the 1-HTWUIs. Those items that have TW U higher
than the given threshold are considered based on the transaction weighted downward closure
property [33], which reduces the search space in the beginning. These items are sorted in a tree in
an increasing order of TW U . For example, there are six items in a transactional database D with
TW UA = 224, TW UB = 177, TW UC = 154, TW UD = 186, TW UE = 207, and TW U F = 47. Assuming
a threshold ratio of (δ = 0.17), this makes a threshold value of 47.94. Then, the item {F } will not be
in the tree structure since it has a TW U of less than the threshold. The items that are higher than
the threshold are arranged in the following order: {C}, {B}, {D}, {E}, and {A}, as shown in Figure 2.

3.2 Strict Local Utility


In this section, a stricter upper bound than the previous local utility is proposed. It is called the
strict local utility and is denoted as slocU . First, to define strict local utility, an itemset named as
the following items (f i), is defined. Assuming there is an itemset γ , then the following items for
itemset γ (f i (γ )) are defined as follows:
 
f i (γ ) = x slocU (γ , x ) ≥ TU × δ . (12)

The following items are used to keep the following attached items from the current itemset and
generate the new candidate itemsets for estimation. The definition of strict local utility (slocU ) is
given here. The concept of strict local utility is very similar to traditional local utility, but prunes
some of the overestimated utilities. It includes the remaining utility of the estimated itemset X . A

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 6, Article 58. Publication date: October 2019.
58:10 J. M.-T. Wu et al.

Fig. 2. 1-HTWUI tree.

strict remaining utility srem is described below:



srem(X ,T ) = u (i j ,T ). (13)
i j ∈f i (X )∩T ∩z i j ∀z ∈X

Both remaining utility and strict remaining utility are used to estimate the potential utility for
the following itemsets from the current itemset in the searching tree structure. Thus, if an item
does not exist in f i (X ), then it is impossible to provide any utility to the following candidate
itemsets. Strict remaining utility resolves this problem and prunes the overestimated utility from
the remaining utility. Therefore, it can obtain a smaller upper bound for HUI mining and effectively
reduce the number of candidate itemsets. An illustrated example is given in Table 5 and assumes
that the current itemset is {B, D}. It is easy to calculate the remaining utility of itemset {B, D} from
T5 and T6 . In T5 , there are 6 × 2 from E and 4 × 4 from A. In the same way, there are 1 × 2 from E
and 6 × 4 from A in T6 . However, if f i ({B, D}) is {A}, this means that the process will not consider
the item E combined with the itemset {B, D} as a new candidate itemset. Therefore, the utility from
E does not need to be accumulated in the remaining utility. Thus, the modified strict remaining
utility of itemset {B, D} is 4 × 4 = 16 and 6 × 4 = 24. Then, the strict remaining utility will be used
to define a new upper bound, which is called strict local utility and is defined below:
  0 ,  item y ∀i ∈ γ , i  y and y  x
slocU (γ , x ) = (14)
[u (γ ,T ) + srem(γ ,T )] , otherwise
(γ ∪{x }) ⊆T

The concept of strict local utility is very similar to local utility. There are two different improve-
ments to further reduce the estimated utility for candidate HUIs. The first is using strict remaining
utility to replace remaining utility in each related transaction. The second is that if there is no item
existing between γ and x by the increasing order of TW U , then this transaction can be ignored.
There is a simple example here to explain the second improvement. Assume γ is {C, B} and x is
E in Equation (14). slocU (γ , x ) is used to decide whether to estimate the itemset {C, B, D} and the
following itemsets. Consider T2 in Table 5 (it is {C, B, E, A}), includes itemsets {C, B} and the item
{E}. However, it cannot provide any utility to the itemset {C, B, D} and its following itemsets, due
to there being no item between itemset {C, B} and E. Strict local utility will ignore transaction T2
directly. The general theorem and its proof are described below:
Theorem 3.1. Assume the increasing order O of TW U is i 1  i 2  · · ·  i n , slocU
(γ , x ) is less then
utility threshold. If ∃z ∈ f i (γ ) such that all of the itemsets I = ip1 , ip2 , . . . , ipn , z, x , then the given

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 6, Article 58. Publication date: October 2019.
HUIM with Effective Pruning Strategies 58:11

itemset and its following itemsets are definetly not a high-utility itemset, when γ = {ip1 , ip2 , . . . , ipn }
and the order of the items in I is sorted by O.
 
u (I ) = I ⊆T u (I ,T ) = {γ }∪z∪x ⊆T (u (γ ,T ) + u (z,T ) + u (x,T ))
 
≤ ipn y x {γ }∪y∪x ⊆T (u (γ ,T ) + u (y,T ) + u (x,T ))

Proof.  0 ,  item y ∀i ∈ γ , i  y and y  x
= (γ ∪{x }) ⊆T
[u (γ ,T ) + srem(γ ,T )] , otherwise
= slocU (γ , x )
Therefore, if slocU (γ , x ) < threshold, u (I ) is absolutely less than threshold. It should be noted
that the formula in line 2 is also large than the utility of the following itemsets of I . 

3.3 Strict Sub-tree Utility


In EFIM, the original sub-tree utility accumulates the utilities of the itemset γ , the item x, and
extension of γ ∪ {x }, Ex (γ ∪ {x }) in each related transaction. The definition of the extension of an
itemset is the set of items which follow the itemset in the pre-defined total order. Moreover, strict
local utility limits the proposed process to visit some unpromising itemset. Sub-tree utility, in fact,
does accumulate some utilities from some unnecessary items. An illustrated example is given from
Table 5. Assume the current itemset is {C}, the increasing order of TW U is C  B  D  E  A
and f i ({C}) is {B, D, A}. Consider {C} ∪ {B} = {C, B}, and only T2 includes {C, B} in Table 5. The
value of subU ({C}, B) can be calculated as 1 × 10 + 7 × 3 + 3 × 2 + 3 × 4 = 49. However, the process
will not estimate the itemsets (including {C} and {E}) in the following steps because E  f i ({C}).
Therefore, accumulating utility from E is meaningless. Strict sub-tree utility prunes the utilities
from the items that are not included in the following items. Thus, the strict sub-tree utility value of
{C} ∪ {B}, which is denoted as ssubU ({C}, B), is 1 × 10 + 7 × 3 + 3 × 4 = 43. This is a more accurate
upper bound than sub-tree utility. The general theorem and its proof is described below:

Theorem 3.2. Assume the increasing order O of TW U is i 1  i 2  · · ·  i n , ssubU (γ , x ) is less


 
than the utility threshold. Then, the itemset I = γ ∪ {x } and its following itemsets by order O are
all not high-utility itemsets.

Proof. Due to Theorem 3.1, if an item does not exist in f i (γ ), it will not be considered to
combine a candidate itemsets. ssubU (γ , x ) accumulates all of utility of γ , x and the items after x
in order O and existed in f i (γ ). Therefore, ssubU (γ , x ) is a upper bound of u (I ). In other words, if
ssubU (γ , x ) is less than threshold, I and its following itemsets are not high-utility itemsets. 

3.4 Pseudo-Code for HUI-PR


In this section, the detailed pseudo code of HUI-PR is described in ALGORITHM 1. The main
process of HUI-PR is to use the strict sub-tree utility and the strict local utility to estimate the
candidate itemsets. In the beginning, the proposed process scans dataset D twice and calculates
the original local utility and sub-tree utility. That is because the local utility of ∅ ∪ x is the TW U
of item x. The proposed algorithm does not need to calculate the TW U separately. The proposed
process calls the RecusiveSearch function to perform the depth first searching in the searching tree
and scan the database twice. The proposed strict sub-tree utility and strict local utility will replace
the original sub-tree utility and local utility to estimate the candidate itemsets. Due to having a
smaller value than the previous upper bounds, the proposed algorithm can effectively reduce the
number of the candidate itemsets. The RecusiveSearch function can reveal all the HUIs that are
extended from the input itemset. Therefore, the main algorithm can finally obtain all the HUIs
while the recursive process is terminated. Details of the algorithm is given below.

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 6, Article 58. Publication date: October 2019.
58:12 J. M.-T. Wu et al.

ALGORITHM 1: HUI-PR Algorithm


Input: A transaction database, D; an item list, I and a user-specified minimum high-utility threshold,
minutil.
Output: The set of high-utility itemsets.
1 Calculate locU (∅, i) ∀ items i ∈ I by scanning D;
2 The following items for ∅, f i (∅) = {i |i ∈ I ∧ locU (∅, i) ≥ minutil };
3 Set  be the total order of TW U ascending values on f i (∅);

4 Scan D to remove each item i  f i (∅) from D and delete empty transactions;
5 Sort each transaction in D according to ;
6 Calculate the sub-tree utility subU (∅, i) for each item i ∈ f i (∅) by scanning D;

7 The next items for ∅, ni (∅) = {i i ∈ f i (∅) ∧ subU (∅, i) ≥ minutil };
8 return RecusiveSearch (∅, D, ni (∅), f i (∅), minutil );

ALGORITHM 2: RecusiveSearch
Input: An itemset α; the projected database of α, α–D; the next items of α, ni (α ); the following items
of α, f i (α ) and the minimal threshold, minutil.
Output: The set of high-utility itemsets that are extended from α.
1 Set HU I α = ∅;
2 for each item i ∈ ni (α ) do
3 β = α ∪ {i};
4 Scan α − D to calculate u (β ) and create the projected database of β, β–D;
5 if u (β ) ≥ minutil then
6 β → HU I α ;
7 end
8 if β − D  ∅ then
9 Calculate ssubU (β, z) and slocU (β, z) ∀ item z ∈ f i (α ) by scanning β–D;
 
10 ni (β ) = z ∈ f i (α ) ssubU (β, z) ≥ minutil ;
 
11 f i (β ) = z ∈ f i (α ) slocU (β, z) ≥ minutil ;
12 HU I α ∪ RecusiveSearch (β, β–D, ni (β ), f i (β ), minutil );
13 end
14 end
15 return HU I α ;

3.5 Multiple Threads Process for HUI-PR


To further handle the large-scale problem of the big data issue, the multiple threads framework
is also implemented and discussed. The proposed algorithm can be applied in the parallel frame-
work or a cloud computing environment to handle a large-scale dataset. The designed multiple
thresholds attempt to scan the database each time. It also can be set up in a parallel environment.
The multiple threads process can be applied at Lines 1 to 6 in ALGORITHM 1, and Lines 4 to 9 in
ALGORITHM 2. The pseudo-code is shown in ALGORITHM 3;
In ALGORITHM 3, all threads accumulate the value of local utility for each item. Therefore, the
structure of locU in the process uses a specific thread safety structure, which holds the correctness
of the locU . The function round is used to divide the used database for each thread. In line 10, t
is the jth transaction in D, and locU will be increased by TU of t in line 12. In fact, the value of
locU (∅, x ) = TW U (x ). Based on this multiple threads framework, the large-scale database can be
easily divided into several threads, and the parallel mining concept can be easily implemented for
bigdata issue of HUIM. Experiments will then be analyzed and discussed in Section 5.

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 6, Article 58. Publication date: October 2019.
HUIM with Effective Pruning Strategies 58:13

ALGORITHM 3: Multiple Threads Process for Calculating Local Utilities


Input: A transaction database, D; The number of the total threads, n; An index for the current thread, i
(0 to n − 1).
Output: Accumulate the local utility locU for each item in a thread safety structure.
1 Set size is the size of D;
2 interval = size/n;
3 startT I D = round (interval × i);

4 if i  n − 1 then
5 stopT I D = round (interval × (i + 1)) − 1;
6 else
7 stopT I D = size − 1
8 end
9 for (j = startT I D; j ≤ stopT I D; ++j) do
10 t = D.get(j);
11 for each item m in t do
12 locU (m) = locU (m) + transU (t );
13 end
14 end

4 AN ILLUSTRATED EXAMPLE
In this section, an example for revealing HUIs from the quantitative database D in Table 2 and the
profit table in Table 3 is given. We assume the user-specified threshold minutil = 48. The process
is described below.
(1) Input file
In the beginning, the proposed HUI-PR performs a pre-process step to load the input
database into memory and obtain a profile for this database. Then, the process can obtain
a table for the transaction weighted utilities (it is also the value of local utility for ∅ with
each item) in Table 4. Next, an increasing order of TW U (without the promising items),
C  B  D  E  A, and a database without promising items and the items arranged in
increasing order of TW U , can be obtained in Table 5.
(2) Initial stage
(ALGORITHM 1 lines 1–7)
locU (∅, C) = 154, locU (∅, B) = 177, subU (∅, C) = 153, subU (∅, B) = 167,
locU (∅, D) = 186, locU (∅, E) = 207, subU (∅, D) = 149, subU (∅, E) = 92,
locU (∅, A) = 224. subU (∅, A) = 80.
f i (∅) = {C, B, D, E, A}, ni (∅) = {C, B, D, E, A}.
The projected database of ∅, ∅D (item : utility) is as follows:
B : 12, D : 14, E : 12, A : 16
B : 6, D : 14, E : 2, A : 24
C : 10, B : 21, E : 6, A : 12
B : 9, D : 7, A : 12
C : 30, A : 16
C : 10, D : 28, E : 20
(ALGORITHM 1 line 8)
Then, perform RecusiveSearch(∅, ∅D, ni (∅), f i (∅), minutil ) to search HUIs.
(3) Estimate ∅ ∪ {C}, (perform RecusiveSearch(∅, ∅D, ni (∅), f i (∅), minutil ), C is the first one
in ni (∅), (ALGORITHM 2 line 2))

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 6, Article 58. Publication date: October 2019.
58:14 J. M.-T. Wu et al.

(ALGORITHM 2 line 4)
The projected database of {C}, {C}-D (item : utility) is as follows:
B : 21, E : 6, A : 12
A : 16
D : 28, E : 20
(ALGORITHM 2 lines 5–7)
u (C) = 48 is not a HUIs.
slocU ({C}, B) = 0, slocU ({C}, D) = 0,
slocU ({C}, E) = 107, slocU ({C}, A) = 49,
ssubU ({C}, B) = 49, ssubU ({C}, D) = 58,
ssubU ({C}, E) = 58, ssubU ({C}, A) = 68.
(ALGORITHM 2 lines 10–11)
f i ({C}) = {E, A}, ni ({C}) = {B, D, E, A}.
(4) Estimate {C} ∪ {B} (ALGORITHM 2 line 2)
(ALGORITHM 2 line 4)
The projected database of {C, B}, {C, B}-D (item : utility) is as follows:
E : 6, A : 12
(ALGORITHM 2 lines 5–7)
u (C, B) = 31 is not a HUIs.
slocU ({C, B}, E) = 0, slocU ({C, B}, A) = 49,
ssubU ({C, B}, E) = 49, ssubU ({C, B}, A) = 43.
(ALGORITHM 2 lines 10–11)
f i ({C, B}) = {A}, ni ({C, B}) = {E}.
(5) Estimate {C, B} ∪ {E} (ALGORITHM 2 line 2)
(ALGORITHM 2 line 4)
The projected database of {C, B, E}, {C, B, E}-D (item : utility) is:
A : 12
(ALGORITHM 2 lines 5–7)
u (C, B, E) = 37 is not a HUIs.
slocU ({C, B, E}, A) = 0, ssubU ({C, B, E}, A) = 49.
(ALGORITHM 2 lines 10–11)
f i ({C, B, E}) = ∅, ni ({C, B, E}) = {A}.
(6) Estimate {C, B, E} ∪ {A} (ALGORITHM 2 line 2)
(ALGORITHM 2 lines 5–7)
u (C, B, E, A) = 49 is a HUIs.
HU Is ← {C, B, E, A}.
(7) Estimate {C} ∪ {D} (ALGORITHM 2 line 2)
(ALGORITHM 2 line 4)
The projected database of {C, D}, {C, D}-D (item : utility) is:
E : 20
(ALGORITHM 2 lines 5–7)
u (C, D) = 38 is not a HUIs.
slocU ({C, D}, E) = 0, slocU ({C, D}, A) = 0,
ssubU ({C, D}, E) = 58, ssubU ({C, D}, A) = 0.
(ALGORITHM 2 lines 10–11)
f i ({C, B}) = ∅, ni ({C, B}) = {A}.
(8) Estimate {C, D} ∪ {E} (ALGORITHM 2 line 2)
u (C, D, E) = 58 is a HUIs.

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 6, Article 58. Publication date: October 2019.
HUIM with Effective Pruning Strategies 58:15

Table 6. Datasets Characteristics

Dataset #|D| #|I | AvgLen MaxLen Type


Chess 3,196 76 37 37 Dense
Mushroom 8,124 120 23 23 Dense
Connect 67,557 129 43 43 Dense
Accidents 100,000 469 34 46 Dense
Footmart 4,141 1,559 4 11 Sparse
Retail 88,162 16,470 10 76 Sparse

Fig. 3. The runtimes on the six real datasets.

(ALGORITHM 2 lines 5–7)


HU Is ← {C, D, E}.
(9) Estimate {C} ∪ {E} (ALGORITHM 2 line 2)
(ALGORITHM 2 line 4)
The projected database of {C, E}, {C, E}-D (item : utility) is:
A : 12
(ALGORITHM 2 line 5–7)
u (C, E) = 46 is not a HUIs.
slocU ({C, E}, A) = 0, ssubU ({C, E}, A) = 28.
(ALGORITHM 2 lines 10–11)
f i ({C, E}) = ∅, ni ({C, E}) = ∅.
(10) Estimate {C} ∪ {A} (ALGORITHM 2 line 2)
(ALGORITHM 2 lines 5–7)
u (C, A) = 68 is a HUIs.
HU Is ← {C, A}.
At this point, the process has finished all branches from ∅ to C in the search tree. The following
process performs the same steps to find all HUIs in D.

5 EXPERIMENTAL RESULTS
Experiments for the proposed HUI-PR, the state-of-the-art D2HUP [31], and EFIM algorithms [56]
were performed to find high utility itemsets from several datasets. To compare the algorithms,

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 6, Article 58. Publication date: October 2019.
58:16 J. M.-T. Wu et al.

Table 7. The Number of Candidates on the Six Datasets

Threshold EFIM D2HUP HUI-PR HUI-PR∗ Threshold EFIM D2HUP HUI-PR HUI-PR∗
Chess Mushroom
0.25 1,390 11,279 1,390 1,390 0.14 395 1,819 395 395
0.255 966 8,844 965 964 0.1425 296 1,231 296 296
0.26 704 7,027 703 701 0.145 213 1,055 213 213
0.265 518 5,610 518 515 0.1475 148 1,044 148 148
0.27 392 4,520 392 390 0.15 90 984 89 88
Threshold EFIM D2HUP HUI-PR HUI-PR∗ Threshold EFIM D2HUP HUI-PR HUI-PR∗
Connect Accidents
0.289 3,026 32,334 3,026 3,021 0.131 1,387 7,009 1,386 1,386
0.291 2,378 27,463 2,378 2,375 0.134 1,096 5,969 1,095 1,095
0.293 1,889 23,223 1,889 1,889 0.137 868 5,073 868 868
0.295 1,535 19,925 1,535 1,533 0.14 670 4,274 670 670
0.297 1,307 17,507 1,307 1,305 0.143 548 3,691 548 548
Threshold EFIM D2HUP HUI-PR HUI-PR∗ Threshold EFIM D2HUP HUI-PR HUI-PR∗
Retail Footmart
0.003 425 2,388 425 425 0.0011 1,542 5,185 1,542 1,542
0.004 174 1,198 174 174 0.0012 1,524 2,945 1,524 1,524
0.005 95 819 95 95 0.0013 1,496 1,980 1,495 1,495
0.006 59 576 59 59 0.0014 1,455 1,671 1,455 1,455
0.007 47 419 47 47 0.0015 1,383 1,580 1,383 1,383

experimental results included calculations on the run times, the number of candidate itemsets, the
times of the transaction occurring, and the times of the upper bounds calculation. The experiments
were executed in Java language on a personal computer with 8GB 1,867 MHz DDR3 memory, a 2.7
GHz Intel Core i5 CPU, and macOS High Sierra.
The real-world datasets [13] were used for the experiments to compare the designed HUI-PR
with the D2HUP and EFIM algorithms. Table 6 shows the characteristics of the dataset, where
#|D|, #|I |, AvgLen, MaxLen, and Type represent the total number of transitions, the number of
distinct items, the average size of transactions, maximum size of transactions, and type of dataset,
respectively. For each dataset, the experiment was conducted 100 times and the average was taken.

5.1 Comparison of Computation Time


In this part, the proposed algorithm (HUI-PR) is compared with the D2HUP [31] and EFIM al-
gorithms [56] using six real datasets (chess, mushroom, connect, accidents, and retail) [13]. Ex-
periments were conducted to show the effectiveness of HUI-PR with the real databases and the
approach that was taken to improve the performance of the experiment. The tighter upper bounds
proposed in HUI-PR help to improve the computation time significantly for the datasets with a
large number of transactions. HUI-PR has the ability to reduce the number of candidate itemsets,
the number of the transaction occurring, and the number of calculation of the upper bounds (the
results will be shown in the following section). HUI-PR also reduced the run time of local util-
ity calculation by ignoring some transactions. However, this also increased the complexity of the
newly proposed upper bounds. In the experimental results, two different kinds of HUI-PR were
estimated to show the performances. HUI-PR is the proposed algorithm without using strict re-
maining utility, and HUI-PR∗ applies strict remaining utility to obtain new strict sub-tree utilities.

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 6, Article 58. Publication date: October 2019.
HUIM with Effective Pruning Strategies 58:17

Table 8. The Number of Visited Transactions on the Six Real Datasets

Threshold EFIM HUI-PR HUI-PR∗ Threshold EFIM HUI-PR HUI-PR∗


Chess Mushroom
0.25 424,179 424,179 424,179 0.14 385,009 385,009 385,009
0.255 390,178 390,070 389,681 0.1425 383,678 383,678 383,678
0.26 353,896 353,794 353,339 0.145 382,946 382,946 382,946
0.265 315,785 315,785 315,587 0.1475 364,647 364,647 364,647
0.27 290,950 290,950 290,818 0.15 347,977 347,973 347,956
Threshold EFIM HUI-PR HUI-PR∗ Threshold EFIM HUI-PR HUI-PR∗
Connect Accidents
0.289 2,348,342 2,348,342 2,348,270 0.131 9,896,963 9,870,560 9,870,560
0.291 2,340,646 2,340,646 2,340,610 0.134 9,559,676 9,533,273 9,533,273
0.293 2,330,537 2,330,537 2,330,537 0.137 9,132,652 9,132,652 9,132,652
0.295 2,322,990 2,322,990 2,322,960 0.14 8,749,291 8,749,291 8,749,291
0.297 2,318,502 2,318,502 2,318,472 0.143 8,021,358 8,021,358 8,021,358
Threshold EFIM HUI-PR HUI-PR∗ Threshold EFIM HUI-PR HUI-PR∗
Retail Footmart
0.003 28,100,820 28,100,820 28,100,820 0.0011 33,345,469 33,345,469 33,345,469
0.004 9,953,459 9,953,459 9,953,459 0.0012 32,957,227 32,957,227 32,957,227
0.005 4,894,110 4,894,110 4,894,108 0.0013 32,331,571 32,331,571 32,331,571
0.006 2,680,304 2,680,304 2,680,304 0.0014 31,467,202 31,467,202 31,467,202
0.007 1,999,318 1,999,106 1,999,105 0.0015 29,913,698 29,913,698 29,913,698

In Figure 3, HUI-PR showed a much better performance than the EFIM and D2HUP algorithms.
This shows that the proposed improvements were effective in reducing the searching time for
mining HUIs in a dataset. Generally, D2HUP could perform better on the sparse-type datasets.
This is because, in these kinds of datasets the pruning strategies of EFIM and HUI-PR could not
prune candidate itemsets effectively and wasted a lot of time scanning the dataset. HUI-PR was
also more sensitive than EFIM for the subtle improvements. It is worth noting that the run times of
HUI-PR∗ were much more than those of HUI-PR, and sometimes also more than those of EFIM. This
is because strict remaining utility needs to perform the intersection operation to reduce the value
of utility. It requires more time than the time it could save. However, it can indeed obtain more
accurate upper bounds for HUIs. Some discussions about HUI-PR∗ will be given in the following
sections.

5.2 Comparison of Quantity Experimental Results


In this section, some quantity experimental results are shown and discussed. The quantity experi-
mental results estimate the usage of memory and also evaluate the processing time of the compared
algorithms. Table 7 shows the number of candidates for EFIM, D2HUP, and the designed HUI-PR
and HUI-PR∗ . Table 8 shows the times of the visited transactions, and Table 9 shows the times of
the upper bounds calculation. Due to different architectures, only the EFIM, the designed HUI-
PR and HUI-PR∗ are compared for the number of visited transactions and the times of the upper
bounds calculation.
Obviously, due to the well-defined strict upper bounds and new pruning strategies, HUI-PR∗
had the minimal number of candidates, the minimal times of the transaction occurring, and the
minimal times of the upper bounds calculation in all cases. Logically, it could also use less memory

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 6, Article 58. Publication date: October 2019.
58:18 J. M.-T. Wu et al.

Table 9. The Times of the Upper Bounds Calculation on the Six Real Datasets

Threshold EFIM HUI-PR HUI-PR∗ Threshold EFIM HUI-PR HUI-PR∗


Chess Mushroom
0.25 17,963 16,178 16,165 0.14 3,840 3,481 3,476
0.255 14,533 13,263 13,241 0.1425 3,491 3,236 3,231
0.26 11,925 10,993 10,951 0.145 3,196 3,016 3,015
0.265 9,746 9,068 9,007 0.1475 2,721 2,609 2,606
0.27 8,028 7,499 7,464 0.15 2,312 2,256 2,240
Threshold EFIM HUI-PR HUI-PR∗ Threshold EFIM HUI-PR HUI-PR∗
Connect Accidents
0.289 41,479 37,837 37,723 0.131 14,773 12,946 12,939
0.291 35,633 32,691 32,585 0.134 12,939 11,449 11,435
0.293 30,387 27,989 27,905 0.137 11,430 10,260 10,251
0.295 26,193 24,178 24,078 0.14 10,013 9,092 9,088
0.297 23,117 21,447 21,377 0.143 8,686 7,938 7,933
Threshold EFIM HUI-PR HUI-PR∗ Threshold EFIM HUI-PR HUI-PR∗
Retail Footmart
0.003 387,534 386,224 386,222 0.0011 2,402,436 2,400,878 2,400,878
0.004 93,399 92,517 92,516 0.0012 2,374,392 2,372,834 2,372,834
0.005 33,855 33,204 33,202 0.0013 2,329,210 2,327,651 2,327,651
0.006 13,025 12,555 12,555 0.0014 2,263,980 2,263,980 2,263,980
0.007 7,403 7,034 7,033 0.0015 2,151,948 2,151,948 2,151,948

than HUI-PR and EFIM (D2HUP always estimates more candidates during its process), and spend
less time to find all the HUIs in a dataset. Figure 4 shows the memory usage of EFIM and HUI-PR
in the experiments. It proves that the proposed HUI-PR uses less memory in the real running en-
vironment. The proposed HUI-PR algorithm is obviously more suitable applied in a dense dataset.
In a dense dataset, a strict upper bound can avoid to calculate more number of overestimated
itemsets. In a sparse dataset, the influence of the memory usage in the proposed HUI-PR is not
evident, because the differences of the itemsets’ utility are large and the itemsets are easy to be
separated and classified. In the experimental results, the upper bounds calculation can be reduced
more than thousands time and the reduced calculation is larger than 8 percentage in chess, connect
and accidents compared to the traditional EFIM method. Thus, the proposed method can save a lot
of memory usage in the dense datasets. However, in the previous section, HUI-PR∗ did not show
the advantage of accurate upper bounds in the runtime experiments. This is because EFIM and
HUI-PR performed the projected database method to compress and reformat the input dataset. A
smaller dataset reduces the advantage of HUI-PR∗ , but distinguishes the computational complexity
of strict upper bound. The same issue is shown in the multi-thread version of HUI-PR.

5.3 Comparison of Multiple Threads Version


In this section, a specific multi-thread version of HUI-PR is compared with the original in two dif-
ferent sizes of datasets. The first one “chess” dataset includes 3,196 transactions and the second one
“connect” includes more than 67 thousand transactions. HUI-PR(2) follows the original algorithm
but uses two threads to import the dataset. The experimental results are shown in Figure 5.
Theoretically, two threads should take less computation time to mine all the HUIs in a dataset.
Paradoxically, the original version had better performance on the first dataset. There are several

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 6, Article 58. Publication date: October 2019.
HUIM with Effective Pruning Strategies 58:19

Fig. 4. The memory usage of EFIM and HUI-PR.

Fig. 5. The runtimes under of multiple threads.

reasons for this situation. First, it is very expensive to create a thread, and a computer needs to
assign more resources to this new one. Then, it is a critical section problem to handle the multiple
threads program, as it always wastes a lot of time to keep the data correctly. If the dataset is not
a large-scale dataset, then the multi-thread cannot present the ability of the parallel computation.
Since in the experiments, the D2HUP, EFIM and HUI-PR further performed the projected dataset
process to decrease the size of the dataset. In the “connect” dataset, HUI-PR(2) finally presented
better performance than HUI-PR, since more transactions and items were performed and the effect
of the projected dataset process is not obvious in the “connect” dataset. We then conclude that the
multi-threshold program can help to speed up mining performance especially for the large-scale
dataset without projection operation.

6 CONCLUSION
In this article, we have proposed a novel HUI mining approach called HU-PR to reduce the search
space while finding the set of high-utility itemsets. HUI-PR introduces two new upper bounds and
reduces the number of candidate itemsets. This helps in avoiding the computation time for unnec-
essary itemsets compared to the previous works. Mathematically, the proposed method can provide

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 6, Article 58. Publication date: October 2019.
58:20 J. M.-T. Wu et al.

more accurate upper bounds and help the process finding of HUIs effectively. However, we also
found that the proposed upper bounds and parallel computation could not show their advantages
with a small-scale dataset. The traditional EFIM loads all of the data into memory and duplicates
several modified version of the original dataset. It can indeed increase the performances of EFIM
and HUI-PR, but it is impractical to apply the algorithm in a real application. Datasets in the real
world are always very tremendous. They usually cannot load in memory and are not allowed to be
modified. In this case, performing the projected dataset process is impossible. However, we have
already proven that the proposed new upper bounds and parallel computation for finding HUIs
are useful. In the future, we will extend the designed multi-threads approach in cloud computation
model (such as a MapReduce framework) to reveal HUIs in a large-scale dataset.

REFERENCES
[1] Rakesh Agrawal and Ramakrishnan Srikant. 1994. Fast algorithms for mining association rules. In International Con-
ference on Very Large Data Bases, Vol. 1215. 487–499.
[2] Chowdhury Farhan Ahmed, Syed Khairuzzaman Tanbeer, Byeong-Soo Jeong, and Young-Koo Lee. 2009. An efficient
candidate pruning technique for high utility pattern mining. In The Pacific-Asia Conference on Knowledge Discovery
and Data Mining. ACM, 749–756.
[3] Chowdhury Farhan Ahmed, Syed Khairuzzaman Tanbeer, Byeong-Soo Jeong, and Young-Koo Lee. 2009. Efficient
tree structures for high utility pattern mining in incremental databases. IEEE Transactions on Knowledge and Data
Engineering 21, 12 (2009), 1708–1721.
[4] Chowdhury Farhan Ahmed, Syed Khairuzzaman Tanbeer, Byeong-Soo Jeong, and Young-Koo Lee. 2009. Efficient
tree structures for high utility pattern mining in incremental databases. IEEE Transactions on Knowledge and Data
Engineering 21, 12 (2009), 1708–1721.
[5] Brock Barber and Howard J. Hamilton. 2003. Extracting share frequent itemsets with infrequent subsets. Data Mining
and Knowledge Discovery 7, 2 (2003), 153–185.
[6] Raymond Chan, Qiang Yang, and Yi-Dong Shen. 2003. Mining high utility itemsets. In International Conference on
Data Mining. IEEE, 19–26.
[7] Ming-Syan Chen, Jiawei Han, and Philip S. Yu. 1996. Data mining: An overview from a database perspective. IEEE
Transactions on Knowledge and Data Engineering 8, 6 (1996), 866–883.
[8] Chun-Jung Chu, Vincent S. Tseng, and Tyne Liang. 2009. An efficient algorithm for mining high utility itemsets with
negative item values in large databases. Applied Mathematics and Computation 215, 2 (2009), 767–778.
[9] Alva Erwin, Raj P. Gopalan, and N. R. Achuthan. 2008. Efficient mining of high utility itemsets from large datasets.
In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 554–561.
[10] Alva Erwin, Raj P. Gopalan, and N. R. Achuthan. 2007. CTU-Mine: An efficient high utility itemset mining algorithm
using the pattern growth approach. In The International Conference on Computer and Information Technology. 71–76.
[11] Philippe Fournier-Viger, Jerry Chun-Wei Lin, Rage Uday Kiran, Yun-Sing Koh, and Rincy Thomas. 2017. A survey of
sequential pattern mining. Data Science and Pattern Recognition 1, 1 (2017), 54–77.
[12] Philippe Fournier-Viger, Cheng-Wei Wu, Souleymane Zida, and Vincent S. Tseng. 2014. FHM: Faster high-utility item-
set mining using estimated utility co-occurrence pruning. In International Symposium on Methodologies for Intelligent
Systems. Troels Andreasen, Henning Christiansen, Juan-Carlos Cubero, and Zbigniew W. Raś (Eds.), Springer, 83–92.
[13] Bart Goethals. 2012. Frequent itemset mining dataset repository. Retrieved from http://fimi.ua.ac.be/data.
[14] Wensheng Gan, Jerry Chun-Wei Lin, Han-Chieh Chao, and Justin Zhan. 2017. Data mining in distributed environ-
ment: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 7, 6 (2017), e1216.
[15] Wensheng Gan, Jerry Chun-Wei Lin, Philippe Fournier-Viger, Han-Chieh Chao, Tzung-Pei Hong, and Hamido Fujita.
2018. A survey of incremental high-utility itemset mining. Wiley Interdisciplinary Reviews: Data Mining and Knowl-
edge Discovery 8, 2 (2018), e1242.
[16] Wensheng Gan, Jerry Chun-Wei Lin, Philippe Fournier-Viger, Han-Chieh Chao, Tzung-Pei Hong, and Hamido Fujita.
2018. A survey of incremental high-utility itemset mining. WIRES Data Mining and Knowledge Discovery 8, 2 (2018),
e1242.
[17] Wensheng Gan, Jerry Chun-Wei Lin, Philippe Fournier-Viger, Han-Chieh Chao, and Philip S. Yu. 2019. HUOPM:
High-utility occupancy pattern mining. IEEE Transactions on Cybernetics (2019), 1–14.
[18] Wensheng Gan, Jerry Chun-Wei Lin, Philippe Fournier-Viger, Han-Chieh Chao, and Philip S. Yu. 2019. A survey of
parallel sequential pattern mining. ACM Transactions on Knowledge Discovery from Data 13, 3 (2019), 25.
[19] Jiawei Han, Jian Pei, and Yiwen Yin. 2000. Mining frequent patterns without candidate generation. In ACM Sigmod
Record, Vol. 29. ACM, 1–12.

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 6, Article 58. Publication date: October 2019.
HUIM with Effective Pruning Strategies 58:21

[20] Tzung-Pei Hong, Jimmy Ming-Tai Wu, Yan-Kang Li, and Chun-Hao Chen. 2018. Generalizing concept-drift patterns
for fuzzy association rules. Journal of Network Intelligence 3, 2 (2018), 126–137.
[21] Srikumar Krishnamoorthy. 2015. Pruning strategies for mining high utility itemsets. Expert Systems with Applications
42, 5 (2015), 2371–2381.
[22] Guo-Cheng Lan, Tzung-Pei Hong, and Vincent S. Tseng. 2014. An efficient projection-based indexing approach for
mining high utility itemsets. Knowledge and Information Systems 38, 1 (2014), 85–107.
[23] Hua-Fu Li, Hsin-Yun Huang, Yi-Cheng Chen, Yu-Jiun Liu, and Suh-Yin Lee. 2008. Fast and memory efficient mining
of high utility itemsets in data streams. In IEEE International Conference on Data Mining. IEEE, 881–886.
[24] Yu-Chiang Li, Jieh-Shan Yeh, and Chin-Chen Chang. 2005. Direct candidates generation: A novel algorithm for discov-
ering complete share-frequent itemsets. In The International Conference on Fuzzy Systems and Knowledge Discovery,
Lipo Wang and Yaochu Jin (Eds.). Springer, 551–560.
[25] Yu-Chiang Li, Jieh-Shan Yeh, and Chin-Chen Chang. 2005. Direct candidates generation: A novel algorithm for dis-
covering complete share-frequent itemsets. In International Conference on Fuzzy Systems and Knowledge Discovery.
Springer, 551–560.
[26] Yu-Chiang Li, Jieh-Shan Yeh, and Chin-Chen Chang. 2008. Isolated items discarding strategy for discovering high
utility itemsets. Data & Knowledge Engineering 64, 1 (2008), 198–217.
[27] Chun-Wei Lin, Tzung-Pei Hong, and Wen-Hsiang Lu. 2011. An effective tree structure for mining high utility itemsets.
Expert Systems with Applications 38, 6 (2011), 7419–7424.
[28] Jerry Chun-Wei Lin, Shifeng Ren, Philippe Fournier-Viger, Tzung-Pei Hong, Ja-Hwung Su, and Bay Vo. 2017. A fast
algorithm for mining high average-utility itemsets. Applied Intelligence 47, 2 (2017), 331–346.
[29] Jerry Chun-Wei Lin, Shifeng Ren, Philippe Fournier-Viger, Jeng-Shyan Pan, and Tzung-Pei Hong. 2019. Efficiently
updating the discovered high average-utility itemsets with transaction insertion. Engineering Applications of Artificial
Intelligence 72, C (2019), 136–149.
[30] Jerry Chun-Wei Lin, Lu Yang, Philippe Fournier-Viger, and Tzung-Pei Hong. 2019. Mining of skyline patterns by
considering both frequent and utility constraints. Engineering Applications of Artificial Intelligence 77 (2019), 229–
238.
[31] Junqiang Liu, Ke Wang, and Benjamin C. M. Fung. 2012. Direct discovery of high utility itemsets without candidate
generation. In The International Conference on Data Mining. IEEE, 984–989.
[32] Mengchi Liu and Qu Junfeng. 2012. Mining high utility itemsets without candidate generation. In The International
Conference on Information and Knowledge Management. ACM, 55–64.
[33] Ying Liu, Wei-keng Liao, and Alok Choudhary. 2005. A two-phase algorithm for fast discovery of high utility itemsets.
In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 689–695.
[34] Heungmo Ryang and Unil Yun. 2016. High utility pattern mining over data streams with sliding window technique.
Expert Systems with Applications 57 (2016), 214–231.
[35] Heungmo Ryang and Unil Yun. 2017. Indexed list-based high utility pattern mining with utility upper-bound reduc-
tion and pattern combination techniques. Knowledge and Information Systems 51, 2 (2017), 627–659.
[36] Heungmo Ryang, Unil Yun, and Keun Ho Ryu. 2016. Fast algorithm for high utility pattern mining with the sum of
item quantities. Intelligent Data Analysis 20, 2 (2016), 395–415.
[37] Bai-En Shie, Hui-Fang Hsiao, and Vincent S. Tseng. 2013. Efficient algorithms for discovering high utility user be-
havior patterns in mobile commerce environments. Knowledge and Information Systems 37, 2 (2013), 363–387.
[38] Bai-En Shie, Hui-Fang Hsiao, Vincent S. Tseng, and S. Yu Philip. 2011. Mining high utility mobile sequential patterns in
mobile commerce environments. In International Conference on Database Systems for Advanced Applications. Springer,
224–238.
[39] Bai-En Shie, Vincent S. Tseng, and Philip S. Yu. 2010. Online mining of temporal maximal utility itemsets from data
streams. In ACM Symposium on Applied Computing. ACM, 1622–1626.
[40] Wei Song, Yu Liu, and Jinhong Li. 2014. BAHUI: Fast and memory efficient mining of high utility itemsets based on
bitmap. International Journal of Data Warehousing and Mining 10, 1 (2014), 1–15.
[41] Vincent S. Tseng, Bai-En Shie, Cheng-Wei Wu, and Philip S. Yu. 2012. Efficient algorithms for mining high utility
itemsets from transactional databases. IEEE Transactions on Knowledge and Data Engineering 25, 8 (2012), 1772–1786.
[42] Vincent S. Tseng, Cheng-Wei Wu, Bai-En Shie, and Philip S. Yu. 2010. UP-Growth: An efficient algorithm for high
utility itemset mining. In ACM International Conference on Knowledge Discovery and Data Mining. ACM, 253–262.
[43] Vincent S. Tseng, Cheng-Wei Wu, Bai-En Shie, and Philip S. Yu. 2010. UP-Growth: An efficient algorithm for high
utility itemset mining. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM,
253–262.
[44] Bay Vo, Frans Coenen, and Bac Le. 2013. A new method for mining frequent weighted itemsets based on WIT-trees.
Expert Systems with Applications 40, 4 (2013), 1256–1264.

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 6, Article 58. Publication date: October 2019.
58:22 J. M.-T. Wu et al.

[45] Cheng Wei Wu, Bai-En Shie, Vincent S. Tseng, and Philip S. Yu. 2012. Mining top-k high utility itemsets. In The
International Conference on Knowledge Discovery and Data Mining. ACM, 78–86.
[46] Jimmy Ming-Tai Wu, Justin Zhan, and Jerry Chun-Wei Lin. 2017. An ACO-based approach to mine high-utility item-
sets. Knowledge-Based Systems 116 (2017), 102–113.
[47] Hong Yao and Howard J. Hamilton. 2006. Mining itemset utilities from transaction databases. Data & Knowledge
Engineering 59, 3 (2006), 603–626.
[48] Hong Yao, Howard J. Hamilton, and Cory J. Butz. 2004. A foundational approach to mining itemset utilities from
databases. In SIAM International Conference on Data Mining. SIAM, 482–486.
[49] Show-Jane Yen and Yue-Shi Lee. 2007. Mining high utility quantitative association rules. In International Conference
on Data Warehousing and Knowledge Discovery. Springer, 283–292.
[50] Unil Yun, Donggyu Kim, Eunchul Yoon, and Hamido Fujita. 2018. Damped window based high average utility pattern
mining over data streams. Knowledge-Based Systems 144 (2018), 188–205.
[51] Unil Yun, Gangin Lee, and Keun Ho Ryu. 2014. Mining maximal frequent patterns by considering weight conditions
over data streams. Knowledge-Based Systems 55 (2014), 49–65.
[52] Unil Yun, Gangin Lee, and Eunchul Yoon. 2017. Efficient high utility pattern mining for establishing manufacturing
plans with sliding window control. IEEE Transactions on Industrial Electronics 64, 9 (2017), 7239–7249.
[53] Unil Yun, Heungmo Ryang, Gangin Lee, and Hamido Fujita. 2017. An efficient algorithm for mining high utility
patterns from incremental databases with one database scan. Knowledge-Based Systems 124 (2017), 188–206.
[54] Unil Yun, Heungmo Ryang, and Keun Ho Ryu. 2014. High utility itemset mining with techniques for reducing over-
estimated utilities and pruning candidates. Expert Systems with Applications 41, 8 (2014), 3861–3878.
[55] Unil Yun and Keun Ho Ryu. 2013. Efficient mining of maximal correlated weight frequent patterns. Intelligent Data
Analysis 17, 5 (2013), 917–939.
[56] Souleymane Zida, Philippe Fournier-Viger, Jerry Chun-Wei Lin, Cheng-Wei Wu, and Vincent S. Tseng. 2015. EFIM:
A highly efficient algorithm for high-utility itemset mining. In The International Conference on Artificial Intelligence,
Grigori Sidorov and Sofia N. Galicia-Haro (Eds.). Springer, 530–546.
[57] Morteza Zihayat, Heidar Davoudi, and Aijun An. 2017. Mining significant high utility gene regulation sequential
patterns. BMC Systems Biology 11, 6 (2017), 109.

Received February 2017; revised February 2019; accepted August 2019

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 6, Article 58. Publication date: October 2019.

You might also like