Professional Documents
Culture Documents
2018
PEFIM: A Parallel Efficient
Algorithm for high utility itemset
mining
Authors:
Juan Francisco Figueredo Alexis
Eduardo Ojeda Davalos
Advisors:
Diego Pinto
Wilfrido Inchausti
2018
Dedicatory
Juan Figueredo
Alexis Ojeda
i
Acknowledgements
Special thanks to thesis advisors Wilfrido Inchausti and Diego Pinto for their guidance
and teaching.
To our professors, colleagues and friends who were part of this important experience.
ii
Abstract
Data Mining can be defined as an activity that extracts some new non-trivial information
contained in large databases. Traditional data mining techniques have focused largely
on detecting the statistical correlations between the items that are more frequent in
the transaction databases. These techniques which are also termed as frequent itemset
mining, were based on the rationale that itemsets which appear more frequently must
be of more importance to the user from the business perspective.
High Utility Itemset Mining is an important data mining problem which not only con-
siders the frequency of the itemsets but also considers the utility with the itemsets. The
term utility refers to the importance or the usefulness of the appearance of the itemset
in transactions, quantified in terms such as profit, sales or any other user preferences.
Existing researches focuses on reducing the computational time with the introduction
of pruning strategies. Another aspect of high utility itemset mining is to compute large
datasets, this aspect isn’t too often explored.
This work presents a distributed approach dividing the search space among different
worker nodes to compute high utility itemsets which are aggregated to find the result.
The experimental results shows significant improvement in the execution time for com-
puting the high utility itemsets with large datasets.
Contents
Dedicatory i
Acknowledgements ii
Abstract iii
Contents iv
List of Figures vi
1 Introduction 1
1.1 Thesis Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Problem statement 5
2.1 High-Utility Itemset Mining . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Related Work 7
3.1 State-of-the-art algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.1 High Utility Itemset Mining . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5 Parallel Computing 23
5.1 Apache Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2 Apache Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2.1 Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
iv
Contents v
6 PEFIM 30
6.1 Main Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.2 The Search procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.3 Overall Flow of the PEFIM Algorithm . . . . . . . . . . . . . . . . . . . . 34
7 Experimental Results 35
7.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.2 PEFIM vs. EFIM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.2.1 Comparison of Computational Time . . . . . . . . . . . . . . . . . 37
7.2.2 Comparison of Computational resource (physical memory) . . . . . 45
8 Conclusions 53
9 Future work 54
References 55
List of Figures
5.1 Lineage graph for the third query in our example; boxes represent RDDs,
and arrows represent transformations . . . . . . . . . . . . . . . . . . . . . 27
7.1 Time to find HUI having 75 distinct items and up to 20 items per transaction 38
7.2 Time to find HUI having 120 distinct and up to 50 items per transaction . 39
7.3 The dataset for these figures has the following characteristics: 400 distinct
items and the max length per transaction is 70 . . . . . . . . . . . . . . . 40
7.4 The dataset for these figures has the following characteristics: 500 distinct
items and the max length per transaction is 10 . . . . . . . . . . . . . . . 41
7.5 The dataset for these figures has the following characteristics: 1500 dis-
tinct items and the max length per transaction is 15 . . . . . . . . . . . . 42
7.6 The dataset for these figures has the following characteristics: 40000 dis-
tinct items and the max length per transaction is 20 . . . . . . . . . . . . 43
7.7 The dataset for these figures has the following characteristics: 45000 dis-
tinct items and the max length per transaction is 15 . . . . . . . . . . . . 44
7.8 The dataset for these graphs has the following characteristics: 75 distinct
items and the max length per transaction is 20 . . . . . . . . . . . . . . . 45
7.9 The dataset for these graphs has the following characteristics: 120 distinct
items and the max length per transaction is 50 . . . . . . . . . . . . . . . 46
7.10 The dataset for these graphs has the following characteristics: 400 distinct
items and the max length per transaction is 70 . . . . . . . . . . . . . . . 47
7.11 The dataset for these graphs has the following characteristics: 500 distinct
items and the max length per transaction is 10 . . . . . . . . . . . . . . . 48
7.12 The dataset for these graphs has the following characteristics: 1500 dis-
tinct items and the max length per transaction is 10 . . . . . . . . . . . . 49
7.13 The dataset for these graphs has the following characteristics: 2000 dis-
tinct items and the max length per transaction is 150 . . . . . . . . . . . 50
7.14 The dataset for these graphs has the following characteristics: 40000 dis-
tinct items and the max length per transaction is 15 . . . . . . . . . . . . 51
7.15 The dataset for these graphs has the following characteristics: 45000 dis-
tinct items and the max length per transaction is 15 . . . . . . . . . . . . 52
vi
List of Tables
vii
Acronyms and Symbols
viii
Acronyms and Symbols ix
Introduction
Data mining and knowledge discovery is the process of discovering and extracting in-
formation or patterns, revealing potentially useful information from large databases.
Among many ways of discovering knowledge in databases, association rules mining was
a form of data mining to extract interesting correlations, frequent patterns, associa-
tions or casual structures among sets of items in the databases. Discovering useful
patterns hidden in a database plays an essential role in several data mining tasks, such
as frequent pattern mining, weighted frequent pattern mining and high utility pattern
mining. Among them, frequent pattern mining is a fundamental research topic that has
been applied to different kinds of databases, such as transactional databases, stream-
ing databases and time series databases, with various application domains, including
decision support, market strategy, financial forecast, bio-informatics, web click-stream
analysis, and mobile environments [UKM14].
Frequent itemset mining (FIM) [Agr94] is a popular data mining task that is essen-
tial to a wide range of applications. Given a transaction database, FIM consists of
discovering frequent itemsets. i.e. groups of items (itemsets) appearing frequently in
transactions [Agr94, Uno04]. Many popular algorithms have been proposed for this pro-
blem such as Apriori, FPGrowth, LCM, Eclat, etc. These algorithms take as input a
transaction database and a parameter “minsup” called the minimum support threshold.
These algorithms then return all sets of items (itemsets) that appears in at least minsup
transactions.
1
Introduction 2
The problem of frequent itemset mining is popular. But it has some important limita-
tions when it comes to analyzing customer transactions. An important limitation is that
purchase quantities are not taken into account. Thus, an item may only appear once
or zero time in a transaction. Thus, if a customer has bought five loaves, ten loaves or
twenty loaves, it is viewed as the same.
A second important limitation is that all items are viewed as having the same impor-
tance, utility of weight. For example, if a customer buys a very expensive bottle of wine
or a cheap piece of bread, it is viewed as being equally important.
Thus, frequent itemset mining may find many frequent itemsets that are not interesting.
For example, one may find that {bread, milk} is a frequent itemset. However, from
a business perspective, this pattern may be uninteresting because it does not generate
much profit. Moreover, frequent itemset mining algorithms may miss the rare patterns
that generate a high profit such as perhaps {caviar, wine}.
To address this issue, the problem of FIM has been redefined as the problem of High-
Utility Itemset Mining (HUIM) [FV14, Lan14, Liu05, Son14, Tse13]. In this problem,
a transaction database contains transactions where purchase quantities are taken into
account as well as the unit profit of each item.
The difficulty of high-utility itemset mining is to find the itemsets (group of items) that
generate a high profit in a database, when they are sold together. The user has to
provide a value for a threshold called “minutil” (the minimum utility threshold). A
high-utility itemset mining algorithm outputs all the high-utility itemsets, that is the
itemsets that generates at least “minutil” profit.
Different pruning approaches have been introduced so far to reduce the number of candi-
date sets generation. However, these state-of-the-art algorithms perform well when the
dataset is small. When the size of the dataset increases, preliminary experiments shows
that the performance degrades. Therefore with the current era of big data, there is a
need to compute datasets in multiple machines, which is possible through distributed
computing.
disk-based paradigm and is heavily dependent on its Hadoop Distributed File System
(HDFS). Another framework named Spark [spaa] was introduced to overcome its heavy
dependency with HDFS by allowing in-memory computation. Spark framework can
perform up to 100 times faster than Hadoop. Spark uses Resilient Distributed Datasets
(RDD) which is an immutable data structure allowing efficient reuse of data for in-
memory computation.
Introduction 4
To deal with the research challenges associated to HUIM problem above mentioned, the
following objectives have been delineated:
General Objective:
Specific Objectives:
Chapter 3 presents related works and the main motivations of this work. Chapter 2
summarizes the presented HUIM problem formulation. Chapter 4 describes the fastest
algorithms to solve the formulated problem in the state-of-the-art, while Chapter 5
explains how the state-of-the-art framework for parallel computing works and Chapter
6 shows how the efim algorithm with the spark framework are combined. Chapter 7
presents the obtained results and main findings of the experimental comparison. Finally,
conclusions and future works of the first part are left to Chapter 8 and Chapter 9.
Chapter 2
Problem statement
The problem of high-utility itemset mining is defined as follows. Let I be a finite set of
items (symbols). An itemset X is a finite set of items such that X ⊆ I. A transaction
database is a multiset of transactions D = {T1 , T2 , · · · , Tn } such that for each transaction
Tc , Tc ⊆ I and Tc has a unique identifier c called its TID (Transaction ID). Each item
i ∈ I is associated with a positive number p(i), called its external utility (e.g. unit profit).
Every item i appearing in a transaction Tc has a positive number q(i, Tc ), called its
internal utility (e.g. purchase quantity). For example, consider the database in Table 2.1,
which will be used as the running example. It contains five transactions (T1 , T2 · · · , T5 ).
Transaction T2 indicates that items a, c, e and g appear in this transaction with an
internal utility of respectively 2, 6, 2 and 5. Table 2.2 indicates that the external utility
of these items are respectively 5, 1, 3 and 1.
TID Transaction
T1 (a, 1) (c, 1) (d, 1)
T2 (a, 2) (c, 6) (e, 2) (g, 5)
T3 (a, 1) (b, 2) (c, 1) (d, 6) (e, 1) (f, 5)
T4 (b, 4) (c, 3) (d, 3) (e, 1)
T5 (b, 2) (c, 2) (e, 1) (g, 2)
5
Problem statement 6
Item a b c d e f g
Profit 5 2 1 2 3 1 1
Utility of an item/itemset
The utility of an item i in a transaction Tc is denoted as u(i, Tc ) and defined as p(i) ×
q(i, Tc ) if i ∈ Tc . The utility of an itemset X in a transaction Tc is denoted as u(X, Tc )
P
and defined as u(X, Tc ) = i∈X u(i, Tc ) if X ⊆ Tc . The utility of an itemset X is
P
denoted as u(X) and defined as u(X) = Tc ∈g(X) u(X, Tc ), where g(X) is the set of
transactions containing X.
For example, the utility of item a in T2 is u(a, T2 ) = 5×2 = 10. The utility of the itemset
{a, c} in T2 is u({a, c}, T2 ) = u(a, T2 ) + u(c, T2 ) = 5 × 2 + 1 × 6 = 16. Furthermore, the
utility of the itemset {a, c} is u({a, c}) = u({a, c}, T1 ) + u({a, c}, T2 ) + u({a, c}, T3 ) =
u(a, T1 ) + u(c, T1 ) + u(a, T2 ) + u(c, T2 ) + u(a, T3 ) + u(c, T3 ) = 5 + 1 + 10 + 6 + 5 + 1 = 28.
For example, if minutil = 30, the high-utility itemsets in the database of the running
example are {b, d}, {a, c, e}, {b, c, d}, {b, c, e}, {b, d, e}, {b, c, d, e}, {a, b, c, d, e, f } with re-
spectively a utility of 30, 31, 34, 31, 36, 40 and 30.
Chapter 3
Related Work
Several research works have been proposed in the High Utility Itemset Mining literature,
studying the problem in both parallel and non-parallel environments. Sections below
detail main existing approaches described in relevant articles, focusing on motivation of
the presented work.
This section presents a brief overview of the various algorithms, concepts and approaches
that have been defined in various research publications.
Yao et al. in [HY04] define the problem of utility mining formally. The work defines the
terms transaction utility and external utility of an itemset. The mathematical model of
utility mining was then defined based on the two properties of utility bound and support
bound.
The utility bound property of any itemset provides an upper bound on the utility value
of any itemset. This utility bound property can be used as a heuristic measure for
pruning itemsets at early stages that are not expected to qualify as high utility itemsets.
Yao et al. in [HY06] define the utility mining problem as one of the cases of constraint
mining. This work shows that the downward closure property used in the standard
7
Related Work 8
Apriori algorithm and the convertible constraint property are not directly applicable to
the utility mining problem. Also, the authors also present two pruning strategies to
reduce the cost of finding high utility itemsets. By exploiting these pruning strategies,
they develop two algorithms UMining and UMining H. UMining find all itemsets with
utility values higher than minutil from a database. UMining H, finds most itemsets
with utility values higher than minutil based on a heuristic pruning strategy. The
effectiveness of the algorithms was demonstrated by applying them to synthetic and real
world databases. UMining may be preferable to UMining H, because it guarantees the
discovery of all high utility itemsets.
Yao et al. in [HY07] classifies the utility-measures into three categories namely, item
level, transaction level and cell level. The unified utility function was defined to represent
all existing utility-based measures.
High utility frequent itemsets contribute to the most to a predefined utility, objective
function or performance metric [J.H07]. Hu et al. in [J.H07] presents an algorithm called
HYP (High-Yield Partition trees) algorithm for frequent itemset mining that identifies
high utility item combinations. The algorithm is designed to find segments of data
defined through the combinations of few items (rules) which satisfy certain conditions
as a group and maximize a predefined objective function. The authors have formulated
the task as an optimization problem and presents an efficient approximation to solve
it through specialized partition trees called high-yield partition trees and investigated
the performance of various splitting techniques. The algorithm has been tested on real
world data with promising results.
Li et al. in [YL08] propose the Isolated Items Discarding Strategy (IIDS), which can be
applied to each level-wise utility mining method to further reduce the number of redun-
dant candidates. In each pass, a utility mining method with IIDS scans a database that
is smaller than the original by skipping isolated items to efficiently improve performance.
Their study focuses on the task of efficiently discovering all high utility itemsets.
Liu et al. in [YL05] propose a Two-phase algorithm for finding high utility itemsets to
efficiently prune down the number of candidates and can precisely obtain the complete
set of high utility itemsets. In the first phase, they propose a model that applies the
“transaction-weighted downward closure property” on the search space to expedite the
identification of candidates. In the second phase, one extra database scan is performed
Related Work 9
to identify the high utility itemsets. They verify their algorithm by applying it to both
synthetic and real databases.
Ahmed et al. in [CA09] propose three novel tree structures to efficiently perform incre-
mental and interactive high utility pattern mining. The first tree structure, Incremental
HUP (High Utility Pattern) Lexicographic Tree (IHUPL-Tree), is arranged according
to an item’s lexicographic order. It can capture the incremental data without any re-
structuring operation. The second tree structure is the IHUP Transaction Frequency
Tree (IHUPTF-Tree), which obtains a compact size by arranging items according to
their transaction frequency (descending order). To reduce the mining time, the third
tree, IHUP-Transaction Weighted Utilization Tree (IHUPTWU-Tree) is designed on the
TWU value of items in descending order.
Shankar et al. in [SS09] present a novel algorithm Fast Utility Mining (FUM) which finds
all high utility itemsets within the given utility constraint threshold. The FUM algorithm
is built upon using a novel approach which helps in rectifying and avoiding the pitfalls
that usually occur in the use of UMining algorithm. FUM algorithm demonstrates an
appreciable semantic intelligence by considering only the distinct itemsets involved or
defined in a transaction and not the entire set of available itemsets.
Moreover, the authors also suggest a technique to generate different types of itemsets
such as High Utility and High Frequency (HUHF), High Utility and Low Frequency
(HULF), Low Utility and High Frequency (LUHF) and Low Utility and Low Frequency
(LULF).
Pillai et al. in [JP10] present a new foundational approach to temporal weighted itemset
mining where item utility value is allowed to be dynamic within a specified period of
time, unlike traditional approaches where value is static within those times. The authors
incorporate a fuzzy model where item utilities can be assumed to be fuzzy values.
Ramaraju et al. in [CR11] present a novel algorithm CHUT (Conditional High Utility
Tree) to mine the high utility Itemset in two steps. The first step is to compress the
transaction database to reduce the search space. The second step uses a new proposed
algorithm HU-Mine to mine the complete set of high utility itemsets. The proposed
algorithm needs only two database scans in contrasts to many scans of existing algorithm.
Liu et al. in [JLF12] saw that the state-of-the-art works of utility mining all employ a
two-phase, candidate generation approach, which suffers from scalability issue due to the
large number of candidates. Therefore, they propose an algorithm called d2 HUP (Direct
Discovery of High Utility Patterns) that works in a single phase without generating
candidates. Their basic approach is to enumerate itemsets by prefix extensions, to prune
search space by utility upper bounding, and to maintain original utility information in
the mining process by a novel data structure.
Tseng et al. in [VTY10] saw that the huge number of potential high utility itemsets
forms a challenging problem to the mining performance since the higher processing cost
is incurred when more potential high utility itemsets are generated. To address this
issue, they propose an efficient algorithm called Utility Pattern Growth (UP-Growth),
for mining high utility itemsets with a set of techniques for pruning candidate itemsets,
the information of high utility itemsets is maintained in a special data structure named
Utility Pattern Tree (UP-Tree) such that the candidate can be generated with only two
scans of the database.
are computed from all items and considering multiple of the minimum utility as a user
specified threshold value.
To identify high utility itemsets, most existing algorithms first generate candidate item-
sets by overestimating their utilities, and subsequently compute the exact utilities of
these candidates. These algorithms incur in the problem that a very large number of
candidates are generated, but most of the candidates are discovered not to be not high
utility after their exact utilities are computed. Liu et al. in [ML12] propose an algo-
rithm, called HUI-Miner (High Utility Itemset Miner), for high utility itemset mining.
HUI-Miner uses a novel structure, called utility-list, to store both the utility informa-
tion about an itemset and the heuristic information for pruning the search space of
HUI-Miner. By avoiding the costly generation and utility computation of numerous
candidate itemsets, HUI-Miner can efficiently mine high utility itemsets from the utility
lists constructed from a mined database.
Tseng et al. in [Tse13] proposes a new algorithm called UP-Growth+ (Utility Pattern
Growth +) for reducing overestimated utilities more effectively and enhance the per-
formance of utility mining. In UP-Growth, minimum item utility is used to reduce the
overestimated utilities. In UP-Growth+, minimal node utilities in each path are used
to make the estimated pruning values closer to real utility values of the pruned items in
database.
Many algorithms incur in the problem of generating a large number of candidates item-
sets, which degrade mining performance, so Yun et al. propose in [UY14] an algorithm
named MU-Growth (Maximum Utility Growth) with two techniques for pruning can-
didates effectively in mining process. Moreover, they suggest a tree structure, named
MIQ-Tree (Maximum Item Quantity Tree), which captures database information with
a single-pass. The proposed data structure is restructured for reducing overestimated
utilities.
implement dynamic CHUI-Tree pruning, and discuss the rationality thereof. The CHUI-
Mine algorithm makes use of a concurrent strategy, enabling simultaneous construction
of a CHUI-Tree and the discovery of high utility itemsets. Their algorithm reduces
the problem of huge memory usage for tree construction and traversal in tree-based
algorithms for mining high utility itemsets.
Most of the existing approaches are based on the principle of level wise processing, as in
the traditional two-phase utility mining algorithm to find a high utility itemsets. Lan et
al. propose in [GL14] an efficient utility mining approach, namely the projection-based
mining approach (PB) that adopts an indexing mechanism to speed up the execution
and reduce the memory requirement in the mining process. The indexing mechanism
can imitate the traditional projection algorithms to achieve the aim of projecting sub-
databases for mining. In addition, a pruning strategy is also applied to reduce the
number of unpromising itemsets in mining.
Zida et al. in [SZ15] propose a novel algorithm named EFIM (Efficient high-utility
Itemset Mining), which introduces several new ideas to more efficiently discovers high-
utility itemsets both in terms of execution time and memory. EFIM relies on two upper-
bounds named sub-tree utility and local utility to more effectively prune the search
space. It also introduces a novel array-based utility counting technique named Fast
Utility Counting to calculate these upper-bounds in linear time and space. Moreover,
to reduce the cost of database scans, EFIM proposes efficient database projection and
transaction merging techniques.
for transactions with the same items to reduce the size of database, and directly calcu-
lates the utilities of itemsets based on HUITWU-Tree and conditional HUITWU-Trees
without scanning original database.
Jerry Chun-Wei Lin et al., in [JL16] presents an efficient algorithm named FHN (Faster
High-Utility itemset miner with Negative unit profits). It relies on a novel PNU-List
structure (Positive-and-Negative Utility-list) structure to efficiently mine high utility
itemsets, while considering both positive and negative unit profits.
3.2 Discussion
As seen in the previous section, there are many algorithms to find high utility itemsets,
over the years improvements were emerging to efficiently mine the huis, but mostly
to reduce the computational time with the introduction of pruning strategies, not to
compute large datasets. Therefore, this work extends the most efficient algorithm of the
state-of-the-art, the EFIM [SZ15] algorithm, to propose a distributed approach, which
divide the search space among different worker nodes to compute high utility itemsets
which are aggregated to find the result.
Chapter 4
This chapter presents EFIM, a one-phase algorithm, which introduces several novel ideas
to reduce the time and memory required for HUIM. Also, is the base used in the proposed
PEFIM algorithm. The content of the chapter was taken from [SZ15]
Let be any total order on items from I. According to this order, the search space
of all itemsets 2I can be represented as a set-enumeration tree. For example, the set-
enumeration tree of I = {a, b, c, d} for the lexicographical order is shown in Fig. 4.1.
EFIM explores this search space using a depth-first search starting from the root. During
this depth-first search for any itemset α, EFIM recursively appends one item at a time
to α according to the O order, to generate larger itemsets. The order is defined as the
order of increasing TWU because it generally reduces the search space for HUIM.
Items that can extend an itemset Let be an itemset α. Let E(α) denote the set
of all items that can be used to extend α according to the depth-first search, that is
E(α) = {z | z ∈ I ∧ z x, ∀x ∈ α}.
14
Efficient High-Utility for Itemset Mining 15
EFIM performs database scans to calculate the utility of itemsets and upper-bounds on
their utility. To reduce the cost of database scans, it is desirable to reduce the database
size. When an itemset is considered during the depth-first search, all items that not
belong to that itemset can be ignored when scanning the database to calculate the
utility of itemsets in the sub-tree, or upper-bounds on their utility. A database without
these items is called a projected database.
For example, consider database D (see Table 2.1) of the running example and α = {b}.
The projected database α-D contains three transactions: α-T3 = {c, d, e, f }, α-T4 =
{c, d, e} and α-T5 = {c, e, g}.
Database projections generally greatly reduce the cost of database scans since transac-
tions become smaller as larger itemsets are explored. EFIM sorts each transaction in
the original database according to the total order beforehand. Then, a projection
is performed as a pseudo-projection, that is each projected transaction is represented
by an offset pointer on the corresponding original transaction. The complexity of cal-
culating the projection of a database is linear in time and space with respect to the
number of transactions. The proposed database projection technique generalizes the
concept of database projection from pattern mining for the case of transactions with
internal/external utility values.
To further reduce the cost of database scans, EFIM also introduce an efficient transaction
merging technique. It is based on the observation that transaction databases often
contain identical transactions. The technique consists of identifying these transactions
and to replace them with single transactions.
To further reduce the cost of database scans, EFIM based on the observation that trans-
action databases often contain identical transactions, identify these transactions and
to replace them with single transactions. It is based on the observation that transac-
tion databases often contain identical transactions. The technique consists of identifying
these transactions and to replace them with single transactions. In this context, a trans-
action Ta is said to be identical to a transaction Tb if it contains the same items as Tb
(i.e. Ta = Tb ) (but not necessarily the same internal utility values).
But to achieve higher database reduction, the algorithm also merges transactions in
projected databases. This generally achieves a much higher reduction because projected
transactions are smaller than original transactions, and thus are more likely to be iden-
tical.
For example, consider database D of our running example and α = {c}. The projected
database α-D contains transactions α-T1 = {d}, α-T2 = {e, g}, α-T3 = {d, e, f }, α-
T4 = {d, e} and α-T5 = {e, g}. Transactions α-T2 and α-T5 can be replaced by a new
transaction TM = {e, g} where q(e, TM ) = 3 and q(g, TM ) = 7.
The fourth case is that otherwise Ta T Tb . For example, let be transactions Tx = {b, c},
Ty = {a, b, c} and Tz = {a, b, e}. We have that Tz T Ty T Tx . A database sorted
according to the T order provides the following property.
Efficient High-Utility for Itemset Mining 18
Using the above property, all identical transactions in a (projected) database can be
identified by only comparing each transaction with the next transaction in the database.
Thus, using this scheme, transaction merging can be done very efficiently by scanning a
(projected) database only once (linear time). It is interesting to note that transaction
merging as proposed in EFIM cannot be implemented efficiently in utility-list based
algorithms (e.g. FHM and HUP-Miner) and hyperlink-based algorithms (e.g. d2 HUP)
due to their database representations.
EFIM introduce an effective pruning mechanism with two new upper-bounds on the
utility of itemsets named sub-tree utility and local utility. The key difference with
previous upper-bounds is that they are defined w.r.t the sub-tree of an itemset α in the
search-enumeration tree.
Definition (Sub-tree utility). Let be an itemset α and an item z that can extend α
according to P the depth-first search (z ∈ E(α)).The Sub-tree Utility of z w.r.t. α is
P P
su(α, z) = T ∈g(α∪{z}) [u(α, T ) + u(z, T ) + i∈T ∧i∈E(α∪{z}) u(i, T )].
Example 2. Consider the running example and α = {a}. We have that su(α, c) =
(5 + 1 + 2) + (10 + 6 + 11) + (5 + 1 + 20) = 61, su(α, d) = 25 and su(α, e) = 34.
Using theorem 1 we can prune some sub-trees of an itemset, which reduces the number
of sub-trees to be explored. To further reduce the search space, we also identify items
that should not be explored in any sub-trees.
Definition (Local utility). Let be an P itemset α and an item z ∈ E(α). The Local
P
Utility of z w.r.t. α is lu(α, z) = T ∈g(α∪{z}) [u(α, T ) + re(α, T )].
Example 3. Consider the running example and α = {a}. We have that lu(α, c) =
(8 + 27 + 30) = 65, lu(α, d) = 30 and lu(α, e) = 57.
Property 3 (Overestimation using the local utility). Let be an itemset α and an item
z ∈ E(α). Let Z be an extension of α such that z ∈ Z. The relationship lu(α, z) ≥ u(Z)
holds.
Theorem 2 (Pruning an item from all sub-trees using the local utility). Let
be an itemset α and an item z ∈ E(α). If lu(α, z) < minutil, then all extensions of α
containing z are low-utility. In other words, item z can be ignored when exploring all
sub-trees of α. The relationship between the proposed upper-bounds and the main ones
used in previous work is the following.
Given, the above relationship, it can be seen that the proposed local utility upper-bound
is a tighter upper-bound on the utility of Y and its extensions compared to the TWU,
which is commonly used in two-phase HUIM algorithms. Thus, the local utility can be
more effective for pruning the search space. Besides, one can ask what is the difference
between the proposed su upper-bound and the reu upper-bound of HUI-Miner and FHM
since they are mathematically equivalent. The major difference is that su is calculated
when the depth-first search is at itemset α in the search tree rather than at the child
itemset Y . Thus, if su(α, z) < minutil, EFIM prunes the whole sub-tree of z including
node Y rather than only pruning the descendant nodes of Y. Thus, using su instead of
reu is more effective for pruning the search space. In the rest of the paper, for a given
itemset α, we respectively refer to items having a sub-tree utility and local-utility no
less than minutil as primary and secondary items.
The secondary items of α is the set of items defined as Secondary(α) = {z|z ∈ E(α) ∧
lu(α, z) ≥ minutil}. Because lu(α, z) ≥ su(α, z), P rimary(α) ⊆ Secondary(α).
For example, consider the running example and α = {a}. P rimary(α) = {c, e}.
Secondary(α) = {c, d, e}.
Calculating the TWU of all items: for each transaction T of the database, the utility-bin
U [z] for each item z ∈ T is updated as U [z] = U [z] + T U (T ). At the end of the database
scan, for each item k ∈ I, the utility-bin U [k] contains T W U (k).
Calculating the sub-tree utility w.r.t. an itemset α: for each transaction T of the
database, the utility-bin U [z] for each item z ∈ T ∩ E(α) is updated as U [z] = U [z] +
P
u(α, T ) + u(z, T ) + i∈T ∧iz u(i, T ). Thereafter, we have U [k] = su(α, k)∀k ∈ I.
The main procedure (Algorithm 3) takes as input a transaction database and the minutil
threshold. The algorithm initially considers that the current itemset α is the empty set.
The algorithm then scans the database once to calculate the local utility of each item
w.r.t. α, using a utility-bin array. Note that in the case where α = ∅, the local utility
of an item is its TWU. Then, the local utility of each item is compared with minutil
to obtain the secondary items w.r.t to α, that is items that should be considered in
extensions of α. Then, these items are sorted by ascending order of TWU and that
order is thereafter used as the order (as suggested in [2,7]). The database is then
scanned once to remove all items that are not secondary items w.r.t to α since they
cannot be part of any high-utility itemsets by Theorem 2. If a transaction becomes
empty, it is removed from the database. Then, the database is scanned again to sort
transactions by the T order to allow O(n) transaction merging, thereafter. Then, the
algorithm scans the database again to calculate the sub-tree utility of each secondary
item w.r.t. α, using a utility-bin array. Thereafter, the algorithm calls the recursive
procedure Search to perform the depth first search starting from α.
time construct the β projected database. Note that transaction merging is performed
whilst the β projected database is constructed. If the utility of β is no less than minutil,
β is output as a high-utility itemset. Then, the database is scanned again to calculate
the sub-tree and local utility w.r.t β of each item z that could extend β (the secondary
items w.r.t to α), using two utility-bin arrays. This allows determining the primary
and secondary items of β. Then, the Search procedure is recursively called with β
to continue the depth-first search by extending β. Based on properties and theorems
presented in previous sections, it can be seen that when EFIM terminates, all and only
the high-utility itemsets have been output.
Parallel Computing
With the advent of big data, there is a need for large and parallel computations to find
the solution in short time. Therefore, parallel computation is used to take advantage of
solving the tasks by computing in parallel using cheap resources. The parallel computa-
tion is categorized into different types, but according to the hardware level parallelism,
there are generally two types: shared memory and non-shared memory (distributed
systems) [DG08]. In shared memory computation, there are multiple processors which
concurrently access the shared memory. This model is very efficient and easy to de-
velop. However, this model requires large memory and suffers the problem of need of
large memory. In non-shared memory computation, there are different processors which
have their own local memories and each processor communicates with other by passing
a message through an interconnected network. This model is usually scalable and more
efficient than shared memory model.
In the field of big data mining, there is a need to analyze, process and extract the
information from the large data. However, there is a restriction on data because of the
computation limitation by the single machine. This limitation affects the scalability of
the algorithm implemented. Therefore, to process the huge amount of data and extract
meaningful information, distributed systems are used. There are different distributed
computing frameworks available to take advantage of scalability.
23
Parallel Computing 24
The most commonly known distributed computing framework is Apache Hadoop [had].
Apache Hadoop provides reliable, scalable, and distributed computing solution, which
is used by many companies, including Yahoo! and Facebook.
There are two main parts in the core of Apache Hadoop, the storage part and the pro-
cessing part. The storage part, also known as Hadoop Distributed File System (HDFS),
stores data by splitting them into blocks and distributing them amongst different nodes
in the cluster. Each block of a file is usually replicated, and stored in several different
nodes, so that data loss in HDFS is very rare in case of hardware failure. The process-
ing part, also known as MapReduce, uses two procedures, map and reduce, for parallel
processing of data. The map and reduce procedures are called mappers and reducers
respectively. In mappers, a set of data is converted into tuples (key-value pairs), while
reducers take output from mappers and combine tuples into smaller sets of tuples, by
aggregating tuples with the same key into a single tuple. HDFS and MapReduce are
inspired by the ideas proposed by Google based on the Google File System (GFS) and
their proprietary MapReduce technology.
However, there are some drawbacks of Apache Hadoop, which limits the performance
and flexibility of the algorithms implemented on it. On the one hand, the MapReduce
paradigm requires that each mapper is followed by a reducer, and they must be pro-
grammed in a strictly predefined way. On the other hand, each pair of mapper and
reducer in Apache Hadoop has to read data from disks, and write results back to disks,
which results in a bottleneck in its performance.
In order to deal with these two drawbacks of Apache Hadoop, another distributed com-
puting framework, Apache Spark [spaa], was developed. Instead of the two-stage disk-
based MapReduce paradigm introduced in Apache Hadoop, Apache Spark uses a data
abstraction, known as Resilient Distributed Datasets (RDD). RDDs are read-only, par-
titioned collection of records, which are created by reading from data storages or trans-
forming from other RDDs [MZ12]. An RDD holds references to Partition objects, where
Parallel Computing 25
each Partition object is a subset of the dataset represented by this RDD. RDDs are usu-
ally not in materialized form. Instead, if an RDD A is transformed from another RDD
B, we only need the information of the transformation and the RDD B, in order to derive
the RDD A. As a result, RDDs are only materialized when they are asked to perform a
reduce operation, which aggregates data in different nodes to a single machine. Apache
Spark loads data into the memories of machines in a cluster as RDDs, and uses them
repeatedly for data processing tasks. Apache Spark also allows programmers to have
arbitrary mappers and reducers in any order, providing a much more flexible API for its
users. In an Apache Spark cluster, there is one Master node and several Worker nodes.
The Master node is responsible for allocating resources and assigning tasks to Worker
nodes. However, Apache Spark is only an alternative for MapReduce in Apache Hadoop.
HDFS is still a state-of-the-art open source distributed data storage framework. Apache
Spark has interfaces with different types of data storage, including HDFS, Cassandra
[cas], OpenStack Swift [swi], etc. Apache Spark is able to read from these types of data
storage for data processing, which makes Spark more popular. Therefore, in this work,
Apache Spark is used as the main platform for the proposed algorithm.
The key programming abstraction in Spark is RDDs, which are fault-tolerant collec-
tions of objects partitioned across a cluster that can be manipulated in parallel. Users
create RDDs by applying operations called ”transformations” (such as map, filter, and
groupBy) to their data.
Spark exposes RDDs through a functional programming API in Scala, Java, Python,
and R, where users can simply pass local functions to run on the cluster. For example,
the following Scala code creates an RDD representing the error messages in a log file, by
searching for lines that start with ERROR, and then prints the total number of errors:
lines = spark.textFile(”hdfs://...”)
errors = lines.filter(s => s.startsWith(”ERROR”))
println(”Total errors: ” + errors.count())
The first line defines an RDD backed by a file in the Hadoop Distributed File System
(HDFS) as a collection of lines of text. The second line calls the filter transformation to
derive a new RDD from lines. Its argument is a Scala function literal or closure. Finally,
Parallel Computing 26
the last line calls count, another type of RDD operation called an ”action” that returns
a result to the program (here, the number of elements in the RDD) instead of defining
a new RDD.
Spark evaluates RDDs lazily, allowing it to find an efficient plan for the user’s compu-
tation. In particular, transformations return a new RDD object representing the result
of a computation but do not immediately compute it. When an action is called, Spark
looks at the whole graph of transformations used to create an execution plan. For ex-
ample, if there were multiple filter or map operations in a row, Spark can fuse them into
one pass, or, if it knows that data is partitioned, it can avoid moving it over the net-
work for groupBy [FA14]. Users can thus build up programs modularly without losing
performance.
Finally, RDDs provide explicit support for data sharing among computations. By de-
fault, RDDs are ”ephemeral” in that they get recomputed each time they are used in an
action (such as count). However, users can also persist selected RDDs in memory or for
rapid reuse. (If the data does not fit in memory, Spark will also spill it to disk.) For ex-
ample, a user searching through a large set of log files in HDFS to debug a problem might
load just the error messages into memory across the cluster by calling errors.persist()
After this, the user can run a variety of queries on the in-memory data:
// Count errors mentioning MySQL
errors.filter(s => s.contains(”MySQL”)).count()
// Fetch back the time fields of errors that mention PHP, assuming time is field #3:
errors.filter(s => s.contains(”PHP”)).map(line => line.split(‘\t’)(3)).collect()
This data sharing is the main difference between Spark and previous computing models
like MapReduce; otherwise, the individual operations (such as map and groupBy) are
similar. Data sharing provides large speedups, often as much as 100x, for interactive
queries and iterative algorithms [XS13]. It is also the key to Spark’s generality, as we
discuss later.
Fault tolerance. Apart from providing data sharing and a variety of parallel operations,
RDDs also automatically recover from failures. Traditionally, distributed computing
systems have provided fault tolerance through data replication or checkpointing. Spark
uses a different approach called ”lineage.” [MZ12]. Each RDD tracks the graph of
transformations that was used to build it and reruns these operations on base data to
Parallel Computing 27
Figure 5.1: Lineage graph for the third query in our example; boxes represent RDDs,
and arrows represent transformations
reconstruct any lost partitions. For example, Figure 2 shows the RDDs in our previous
query, where we obtain the time fields of errors mentioning PHP by applying two filters
and a map. If any partition of an RDD is lost (for example, if a node holding an
in-memory partition of errors fails), Spark will rebuild it by applying the filter on the
corresponding block of the HDFS file. For ”shuffle” operations that send data from all
nodes to all other nodes (such as reduceByKey), senders persist their output data locally
in case a receiver fails.
Integration with storage systems. Much like Google’s MapReduce, Spark is designed to
be used with multiple external systems for persistent storage. Spark is most commonly
used with cluster file systems like HDFS and key-value stores like S3 and Cassandra.
Parallel Computing 28
It can also connect with Apache Hive as a data catalog. RDDs usually store only tem-
porary data within an application, though some applications (such as the Spark SQL
JDBC server) also share RDDs across multiple users [AXL+ 15]. Spark’s design as a
storage-system-agnostic engine makes it easy for users to run computations against ex-
isting data and join diverse data sources.
5.2.2 Applications
Apache Spark is used in a wide range of applications. The surveys of Spark users
have identified more than 1,000 companies using Spark, in areas from Web services to
biotechnology to finance. In academia, we have also seen applications in several scientific
domains. Across these workloads, we find users take advantage of Spark’s generality and
often combine multiple of its libraries. Here, we cover a few top use cases. Presentations
on many use cases are also available on the Spark Summit conference website [spab].
Batch processing. Spark’s most common applications are for batch processing on
large datasets, including Extract-Transform-Load workloads to convert data from a raw
format (such as log files) to a more structured format and offline training of machine
learning models. Published examples of these workloads include page personalization
and recommendation at Yahoo!; managing a data lake at Goldman Sachs; graph mining
at Alibaba; financial Value at Risk calculation; and text mining of customer feedback at
Toyota. The largest published use case we are aware of is an 8,000-node cluster at Chi-
nese social network Tencent that ingests 1PB of data per day. [Xin, R. and Zaharia, M.
Lessons from running largescale Spark workloads; http://tinyurl.com/largescale-spark]
While Spark can process data in memory, many of the applications in this category run
only on disk. In such cases, Spark can still improve performance over MapReduce due
to its support for more complex operator graphs.
Interactive queries. Interactive use of Spark falls into three main classes. First,
organizations use Spark SQL for relational queries, often through business intelligence
tools like Tableau. Examples include eBay and Baidu. Second, developers and data
scientists can use Spark’s Scala, Python, and R interfaces interactively through shells or
visual notebook environments. Such interactive use is crucial for asking more advanced
questions and for designing models that eventually lead to production applications and
is common in all deployments. Third, several vendors have developed domain-specific
Parallel Computing 29
interactive applications that run on Spark. Examples include Tresata (anti-money laun-
dering), Trifacta (data cleaning), and PanTera (largescale visualization).
Stream processing. Real-time processing is also a popular use case, both in analytics
and in real-time decision making applications. Published use cases for Spark Streaming
include network security monitoring at Cisco, prescriptive analytics at Samsung SDS,
and log mining at Netflix. Many of these applications also combine streaming with batch
and interactive queries. For example, video company Conviva uses Spark to continuously
maintain a model of content distribution server performance, querying it automatically
when it moves clients across servers, in an application that requires substantial parallel
work for both model maintenance and queries.
Scientific applications. Spark has also been used in several scientific domains, in-
cluding large-scale spam detection [TS11], image processing [ZP15], and genomic data
processing [NP15]. One example that combines batch, interactive, and stream processing
is the Thunder platform for neuroscience at Howard Hughes Medical Institute, Janelia
Farm [FA14]. It is designed to process brain-imaging data from experiments in real time,
scaling up to 1TB/hour of whole-brain imaging data from organisms (such as zebrafish
and mice). Using Thunder, researchers can apply machine learning algorithms (such as
clustering and Principal Component Analysis) to identify neurons involved in specific
behaviors. The same code can be run in batch jobs on data from previous runs or in
interactive queries during live experiments.
Deployment environments. We also see growing diversity in where Apache Spark appli-
cations run and what data sources they connect to. While the first Spark deployments
were generally in Hadoop environments, only 40 percent of deployments in our July
2015 Spark survey were on the Hadoop YARN cluster manager. In addition, 52% of
respondents ran Spark on a public cloud.
Chapter 6
PEFIM
In this chapter, we present a parallel high utility itemset mining algorithm, named
PEFIM (Parallel EFficient high-utility Itemset Mining), which parallelizes the state-of-
the-art high utility itemset mining algorithm EFIM [SZ15]. The EFIM algorithm is
divided in two procedures, the main and the search procedure.
The main procedure remains the same as EFIM. The main procedure (Algorithm 1)
takes as input a transaction database and the minutil threshold. The algorithm initially
considers that the current itemset α is the empty set (line 1). The algorithm then scans
the database once to calculate the local utility of each item w.r.t. α, using a utility-bin
30
PEFIM 31
array (line 2). Note that in the case where α = ∅, the local utility of an item is its TWU.
Then, the local utility of each item is compared with minutil to obtain the secondary
items w.r.t to α, that is items that should be considered in extensions of α (line 3).
Then, these items are sorted by ascending order of TWU and that order is thereafter
used as the order (line 4). The database is then scanned once to remove all items
that are not secondary items w.r.t to α since they cannot be part of any high-utility
itemsets by Theorem 4.1 (line 5). At the same time, items in each transaction are sorted
according to , and if a transaction becomes empty, it is removed from the database.
Then, the database is scanned again to sort transactions by the T order to allow O(nw)
transaction merging, thereafter (line 6). Then, the algorithm scans the database again
to calculate the sub-tree utility of each secondary item w.r.t. α, using a utility-bin array
(line 7 and 8). Thereafter, the algorithm divides the search space and assign to each
node in a cluster. Each node is only responsible for mining the assigned search space,
which in another word, splits the work-load into different nodes in the cluster.
In PEFIM, each node is assigned one or more sub-tasks. For example, in 6.1, if there
are 2 nodes in the cluster, the items are divided into 2 groups and assigned to different
nodes. Assuming items a, e, d are assigned to node 1, and b, c are assigned to node 2,
PEFIM 32
node 1 will be responsible for mining all the itemsets containing item a, the itemsets
containing item e but no a and the itemsets containing item d but no a or e, while node
2 will be responsible for mining itemsets containing item b but no a, e or d and itemsets
containing item c but no a, e, d or b. The algorithm performs this action with the map
function. Each node of the cluster runs the search procedure.
The procedure performs a loop to consider each single-item extension of α of the form
β = α ∪ {i}, where i is a primary item w.r.t α (since only these single-item extensions of
α should be explored according to Theorem 4.2) (line 1 to 9). For each such extension β,
a database scan is performed to calculate the utility of and β at the same time construct
the β projected database (line 3). Note that transaction merging is performed whilst the
β projected database is constructed. If the utility of β is no less than minutil, is output
as a high-utility itemset (line 4). Then, the database is scanned again to calculate the
sub-tree and local utility w.r.t β of each item z that could extend β (the secondary items
w.r.t to α), using two utility-bin arrays (line 5). This allows determining the primary
and secondary items of (line 6 and 7). Then, the Search procedure is recursively called
with β to continue the depth-first search by extending β (line 8). Based on properties
and theorems presented in previous sections, it can be seen that when EFIM terminates,
all and only the high-utility itemsets have been output. When all the nodes in the cluster
finish their tasks, the algorithm collects all the HUIs.
PEFIM 34
The overall flow diagram of PEFIM Parallel Algorithm is shown in Figure 6.2. It started
with a reading of dataset from the file, the algorithm calculates the local utility for each
item. The items are sorted by ascending order of TWU. After that, the algorithm scans
the database to calculate sub-tree utility of each secondary item.
The dataset is divide into different blocks to be distributed among the worker nodes.
The worker nodes worked on the block of the file using map operation to run the search
procedure. Each worker node computes the high utility itemsets. Finally, the results
obtained from the worker nodes were combined to give the aggregated high utility item-
sets.
Experimental Results
The experiments were performed with our proposed algorithm (PEFIM) and EFIM
algorithm to find high utility itemsets on 16GB main memory in Intel Core i7 CPU
6700HQ @ 3.40 GHz x 4 on an Ubuntu 16.04 Linux Operating system. The language
used to write the spark application was Oracle Java 1.8 with Spark framework version
2.0.2.
Both algorithms first read the database in main memory, then search for high-utility
itemsets, and write the run-time results to disk. Since the input and output is the
same for all algorithms, the cost of disk accesses has no influence on the results of the
experiments. Algorithms were implemented in Java and memory measurements were
done using the standard Java API. Experiments were performed using a set of synthetic
datasets. This datasets were chosen because they have varied characteristics.
35
Experimental Results 36
7.1 Datasets
The experiments were performed on multiple real-world based datasets [fim12], using the
database generator of SPMF [FVGG+ 14]. Our experiments were conducted on relatively
smaller datasets such as 1000 and 10000 transaction and on relatively large datasets such
as 100000 and 1000000 transactions. The characteristics of the datasets are shown in the
Table 7.1. For each threshold ratio of a dataset, the experimental results were executed
30 times and the average was taken.
PEFIM algorithm was compared with EFIM algorithm with comparisons on the com-
putational time and the quantity of physical memory used to find HUIS.
In this section, we compared our algorithm (PEFIM) with the EFIM algorithm with
all datasets. Experiments were conducted to show the effectiveness of our algorithm
with all datasets and the approach that was taken to improve the performance. For
the experiments we first compare execution times on both algorithms by running the
algorithms on each dataset while at the same time decreasing the minutil threshold until
algorithms were too slow, ran out of memory or a clear winner was observed. Execution
times are shown in the following Figures.
Experimental Results 38
Figure 7.1: Time to find HUI having 75 distinct items and up to 20 items per trans-
action
From the Figure 7.1a, we see that EFIM can perform better than the PEFIM algorithm.
But when we increased the number of transactions, the Figure 7.1b, 7.1c, 7.1d, for the
same dataset, the proposed algorithms (PEFIM) showed significant improvement over
EFIM when the dataset type is dense.
Experimental Results 39
Figure 7.2: Time to find HUI having 120 distinct and up to 50 items per transaction
From the Figure 7.2a, we see that EFIM can perform better than the PEFIM algorithm.
But when we increased the number of transactions, the Figure 7.2b, 7.2c, 7.2d, for the
same dataset, the proposed algorithms (PEFIM) showed significant improvement over
EFIM when the dataset type is dense.
Experimental Results 40
Figure 7.3: The dataset for these figures has the following characteristics: 400 distinct
items and the max length per transaction is 70
From the Figure 7.3a, we see that EFIM can perform better than the PEFIM algorithm.
But when we increased the number of transactions, the Figure 7.3b, 7.3c, 7.3d, for the
same dataset, the proposed algorithms (PEFIM) showed significant improvement over
EFIM when the dataset type is dense.
Experimental Results 41
Figure 7.4: The dataset for these figures has the following characteristics: 500 distinct
items and the max length per transaction is 10
From the Figure 7.4a, 7.4b, 7.4c, we see that EFIM can perform better than the PEFIM
algorithm. But when we increased the number of transactions, the Figure 7.4d, for the
same dataset, the proposed algorithms (PEFIM) showed significant improvement over
EFIM when the dataset type is sparse.
Experimental Results 42
Figure 7.5: The dataset for these figures has the following characteristics: 1500 dis-
tinct items and the max length per transaction is 15
From the Figure 7.5a, 7.5b, 7.5c, we see that EFIM can perform better than the PEFIM
algorithm. But when we increased the number of transactions, the Figure 7.5d, for the
same dataset, the proposed algorithms (PEFIM) doesn’t showed significant improvement
over EFIM when the dataset type is sparse.
Experimental Results 43
Figure 7.6: The dataset for these figures has the following characteristics: 40000
distinct items and the max length per transaction is 20
From the Figure 7.6a, 7.6b, 7.6c, we see that EFIM can perform better than the PEFIM
algorithm. But when we increased the number of transactions, the Figure 7.6d, for the
same dataset, the proposed algorithms (PEFIM) showed significant improvement over
EFIM when the dataset type is sparse.
Experimental Results 44
Figure 7.7: The dataset for these figures has the following characteristics: 45000
distinct items and the max length per transaction is 15
From the Figure 7.7a, 7.7b, 7.7c, we see that EFIM can perform better than the PEFIM
algorithm. But when we increased the number of transactions, the Figure 7.7d, for the
same dataset, the proposed algorithms (PEFIM) showed significant improvement over
EFIM when the dataset type is sparse.
A first observation is that PEFIM on average outperforms EFIM when the minutil
threshold is very low (0.01), the number of transactions are in order of ten thousands
(104 ) and the dataset type is dense, or if the number of transaction are in the order of
millions (106 ) and the dataset type is sparse. PEFIM is in general about two to three
times faster than EFIM in those scenarios.
Experimental Results 45
We also compared PEFIM with the EFIM in terms of computational resource needed,
specifically physical memory. Experiments were conducted to shown how much memory
PEFIM needs over EFIM to reach the same results. For the experiments we compared
memory consumption on both algorithms by running both on each dataset while at the
same time decreasing the minutil threshold until algorithms were too slow, ran out of
memory or a clear winner was observed. Memory Consumption is shown in the following
figures.
Figure 7.8: The dataset for these graphs has the following characteristics: 75 distinct
items and the max length per transaction is 20
Experimental Results 46
Figure 7.9: The dataset for these graphs has the following characteristics: 120 distinct
items and the max length per transaction is 50
Experimental Results 47
Figure 7.10: The dataset for these graphs has the following characteristics: 400
distinct items and the max length per transaction is 70
Experimental Results 48
Figure 7.11: The dataset for these graphs has the following characteristics: 500
distinct items and the max length per transaction is 10
Experimental Results 49
Figure 7.12: The dataset for these graphs has the following characteristics: 1500
distinct items and the max length per transaction is 10
Experimental Results 50
(c) # of transactions:100000
Figure 7.13: The dataset for these graphs has the following characteristics: 2000
distinct items and the max length per transaction is 150
Experimental Results 51
Figure 7.14: The dataset for these graphs has the following characteristics: 40000
distinct items and the max length per transaction is 15
Experimental Results 52
Figure 7.15: The dataset for these graphs has the following characteristics: 45000
distinct items and the max length per transaction is 15
After looking at the figures, it is not obvious that EFIM outperforms PEFIM in terms
of memory efficiency on all datasets. Their behavior is very unpredictable and it will
depend on the nature of the dataset and the threshold ratio values. Looking at the
figures we cannot say for sure that EFIM will always outperform PEFIM, but when the
biggest the threshold ratio is, EFIM behave more efficiently. The reasons why PEFIM
consumes much resources in terms of physical memory when the threshold radio is big
are because each node have its own local tree to explore and puts all the data in memory
to be more time efficient.
Chapter 8
Conclusions
In this work, PEFIM was proposed. The proposed algorithm, PEFIM is a novel dis-
tributed approach to mine high utility itemsets using distributed approach. Spark frame-
work was used for the distributed computing because of its advantage over the Hadoop
framework. Spark framework uses in-memory computation which is much faster than
disk dependent Hadoop framework. The algorithm, PEFIM divided the search space
among different worker nodes to compute high utility itemsets which are aggregated to
find the result.
Extensive experiments were conducted to evaluate the proposed algorithm. The exper-
imental results on different datasets shows that with the order of million transactions
or more and sparse dataset, or transaction in the order of ten thousands and dense
dataset, PEFIM gained significant performance improvement in terms of computational
time. After compare PEFIM against EFIM, the experimental results shows that PEFIM
is a promising algorithm for processing large volumes of data.
Although PEFIM perform much better than EFIM for large datasets and the above
given conditions, PEFIM doesn’t perform very well against EFIM when the datasets are
small.
53
Chapter 9
Future work
Several future works were identified. The use of cloud computing infrastructure for
testing both algorithm with bigger datasets. Implement load balance techniques to
improve the assignment of the search space partitions to the different worker nodes.
Additionally, use another approach to parallelize EFIM, is also proposed as a future
work.
54
Bibliography
[AE08] N.R. Achuthan A. Erwin, R.P. Gopalan. Efficient mining of high utility
itemsets from large datasets. Advances in Knowledge Discovery, Springer
Lecture Notes in Computer Science, 5012:554–561, 2008.
[Agr94] Srikant R. Agrawal, R. Fast algorithms for mining association rules in large
databases. In Proc. Int. Conf. Very Large Databases, pages 487–499. IEEE
Computer Society, 1994.
[AXL+ 15] Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu,
Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin,
Ali Ghodsi, and Matei Zaharia. Spark sql: Relational data processing in
spark. In Proceedings of the 2015 ACM SIGMOD International Conference
on Management of Data, SIGMOD ’15, pages 1383–1394, New York, NY,
USA, 2015. ACM.
[BLV09] T. A. Cao B. Le, H. Nguyen and B. Vo. A novel algorithm for mining
high utility itemsets. First Asian Conference on Intelligent Information
and Database Systems, pages 13–17, 2009.
[cas] http://cassandra.apache.org/.
55
References 56
[DG08] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data process-
ing on large clusters. Commun. ACM, 51(1):107–113, January 2008.
[FVGG+ 14] Philippe Fournier-Viger, Antonio Gomariz, Ted Gueniche, Azadeh Soltani,
Cheng-Wei Wu, and Vincent S. Tseng. Spmf: A java open-source pattern
mining library. J. Mach. Learn. Res., 15(1):3389–3393, January 2014.
[GL14] J.P. Huang-V.S. Tseng G.C. Lan, T.P. Hong. An efficient projection-based
indexing approach for mining high utility itemsets. Knowledge and Infor-
mation Systems, 38:85–107, 2014.
[had] http://hadoop.apache.org/.
[HY04] C.J. Butz H. Yao, H.J. Hamilton. A foundation approach to mining itemset
utilities from databases. pages 482–486, 2004.
[HY06] H.J. Hamilton H. Yao. Mining itemset utilities from transaction databases.
Data and Knowledge Engineering, 59:603–626, 2006.
[HY07] L.Geng H. Yao, H.J Hamilton. A unified framework for utility-based mea-
sures for mining itemsets. ACM international conference on utility-based
Data Mining Workshop (UBDM), pages 28–37, 2007.
References 57
[JLF12] K. Wang J. Liu and B. C. M. Fung. Direct discovery of high utility itemsets
without candidate generation. 2012 IEEE 12th International Conference
on Data Mining, pages 984–989, 2012.
[PFV14] S. Zida V.S. Tseng P. Fournier-Viger, CW. Wu. Fhm: Faster high-utility
itemset mining using estimated utility co-occurrence pruning. Andreasen
T., Christiansen H., Cubero JC., Raś Z.W. (eds) Foundations of Intelligent
Systems, 8502, 2014.
[SG16] H. Gao SM. Guo. Huitwu: An efficient algorithm for high-utility itemset
mining in transaction databases. Journal of computer science and technol-
ogy, 31(4):776–786, 2016.
[Son14] Liu Y. Li J. Song, W. Bahui: Fast and memory efficient mining of high
utility itemsets based on bitmap. Intern. Journal of Data Warehousing and
Mining, 10(1):1–15, 2014.
[spaa] https://spark.apache.org/.
[spab] http://www.sparksummit.org.
[swi] http://www.openstack.org/.
[TS11] Grier C.-Ma J. Paxson V. Thomas, K. and D Song. Design and evaluation of
a real-time url spam filtering service. In Proceedings of the IEEE Symposium
on Security and Privacy, pages 22–25, 2011.
[UY14] K.H. Ryu U. Yun, H. Ryang. High utility itemset mining with techniques
for reducing overestimated utilities and pruning candidates. Expert Systems
with Applications, 41, 2014.
[VTY10] B. Shie V.S. Tseng, C. Wu and P.S. Yu. Up-growth: an efficient algorithm
for high utility itemset mining. pages 253–262, 2010.
[WSL14] Y. Liu W. Song and J. Li. Mining high utility itemsets by dynamically
pruning the tree structure. Applied Intelligence, 40:29–43, 2014.
[XS13] Rosen J. Zaharia M. Franklin M.J.-Shenker S. Xin, R.S. and I. Shark Stoica.
Sql and rich analytics at scale. In In Proceedings of the ACM SIGMOD-
/PODS Conference, pages 22–27. ACM, 2013.
[YL05] A. Choudhary Y. Liu, W. Liao. A fast high utility itemsets mining algo-
rithm. In Proceedings of the Utility-Based Data Mining Workshop, 2005.
[YL08] C. Chang Y. Li, J. Yeh. Isolated items discarding strategy for discovering
high utility itemsets. Data & Knowledge Engineering, 64:198–217, 2008.
[ZP15] Barbary K. Nothaft N.A. Sparks E.-Zahn O. Franklin M.J. Patterson D.A.
Zhang, Z. and S Perlmutter. Scientific computing meets big data tech-
nology: An astronomy use case. In Proceedings of IEEE International
Conference on Big Data, 2015.