Juanfran

PEFIM: A Parallel Efficient
Algorithm for high utility itemset

mining
FINAL CAREER PROJECT
Juan Francisco Figueredo

Alexis Eduardo Ojeda Davalos
San Lorenzo - Paraguay
2018
PEFIM: A Parallel Efficient
Algorithm for high utility itemset
mining
FINAL CAREER PROJECT
Authors:
Juan Francisco Figueredo Alexis
Eduardo Ojeda Davalos
Advisors:
Diego Pinto
Wilfrido Inchausti
San Lorenzo - Paraguay
2018
Dedicatory
To our families for the constant support, in particular our parents.
Juan Figueredo
Alexis Ojeda
i
Acknowledgements
Special thanks to thesis advisors Wilfrido Inchausti and Diego Pinto for their guidance
and teaching.
To our professors, colleagues and friends who were part of this important experience.
ii
Abstract
Data Mining can be defined as an activity that extracts some new non-trivial information
contained in large databases. Traditional data mining techniques have focused largely
on detecting the statistical correlations between the items that are more frequent in
the transaction databases. These techniques which are also termed as frequent itemset
mining, were based on the rationale that itemsets which appear more frequently must
be of more importance to the user from the business perspective.
High Utility Itemset Mining is an important data mining problem which not only con-
siders the frequency of the itemsets but also considers the utility with the itemsets. The
term utility refers to the importance or the usefulness of the appearance of the itemset
in transactions, quantified in terms such as profit, sales or any other user preferences.
Existing researches focuses on reducing the computational time with the introduction
of pruning strategies. Another aspect of high utility itemset mining is to compute large
datasets, this aspect isn’t too often explored.
This work presents a distributed approach dividing the search space among different
worker nodes to compute high utility itemsets which are aggregated to find the result.
The experimental results shows significant improvement in the execution time for com-
puting the high utility itemsets with large datasets.
Contents
Dedicatory i
Acknowledgements ii
Abstract iii
Contents iv
List of Figures vi
List of Tables vii
Acronyms and Symbols viii
1 Introduction 1
1.1 Thesis Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Problem statement 5
2.1 High-Utility Itemset Mining . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Related Work 7
3.1 State-of-the-art algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.1 High Utility Itemset Mining . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 EFIM: Efficient High-Utility for Itemset Mining 14

4.1 The Search Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Reducing the Cost of Database Scans using Projections . . . . . . . . . . 15
4.3 Reducing the Cost of Database Scans by Transaction Merging . . . . . . . 16
4.4 Pruning Search Space using Sub-tree Utility and Local Utility . . . . . . . 18
4.5 Calculating Upper-Bounds Efficiently using Fast Utility Counting . . . . . 20
4.5.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5 Parallel Computing 23
5.1 Apache Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2 Apache Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2.1 Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
iv
Contents v
6 PEFIM 30
6.1 Main Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.2 The Search procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.3 Overall Flow of the PEFIM Algorithm . . . . . . . . . . . . . . . . . . . . 34
7 Experimental Results 35
7.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.2 PEFIM vs. EFIM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.2.1 Comparison of Computational Time . . . . . . . . . . . . . . . . . 37
7.2.2 Comparison of Computational resource (physical memory) . . . . . 45
8 Conclusions 53
9 Future work 54
References 55
List of Figures
4.1 Set-enumeration tree example . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.1 Lineage graph for the third query in our example; boxes represent RDDs,
and arrows represent transformations . . . . . . . . . . . . . . . . . . . . . 27
6.1 Search Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6.2 PEFIM flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.1 Time to find HUI having 75 distinct items and up to 20 items per transaction 38
7.2 Time to find HUI having 120 distinct and up to 50 items per transaction . 39
7.3 The dataset for these figures has the following characteristics: 400 distinct
items and the max length per transaction is 70 . . . . . . . . . . . . . . . 40
7.4 The dataset for these figures has the following characteristics: 500 distinct
7.5 The dataset for these figures has the following characteristics: 1500 dis-
tinct items and the max length per transaction is 15 . . . . . . . . . . . . 42
7.8 The dataset for these graphs has the following characteristics: 75 distinct
7.12 The dataset for these graphs has the following characteristics: 1500 dis-
tinct items and the max length per transaction is 150 . . . . . . . . . . . 50
vi
List of Tables
2.1 A transaction database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 External Utility Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
7.1 Dataset characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
vii
Acronyms and Symbols
FIM Frequent Itemset Mining

FPGrowth Frequent Pattern Growth
LCM Linear Closed Itemset Miner
ECLAT Equivalence Class & Bottom-up
HUIM High-Utility Itemset Mining
HUI High-Utility Itemset
EFIM Efficient High-utility Itemset Mining
PEFIM Parallel Efficient High-utility Itemset Mining
TID Transaction ID
HYP High-Yield Partition
IIDS Isolated Items Discarding Strategy
HUP High Utility Pattern
IHUP Incremental High Utility Pattern
IHUPL-Tree IHUP Lexicographic Tree
IHUPTF-Tree IHUP Transaction Frequency Tree
IHUPTWU-Tree IHUP-Transaction Weighted Utilization Tree
TWU Transaction Weighted Utility
FUM Fast Utility Mining
HUHF High Utility and High Frequency
HULF High Utility and Low Frequency
LUHF Low Utility and High Frequency
LULF Low Utility and Low Frequency
CTU-Mine Compressed Transaction Utility Mine
CHUT Conditional High Utility Tree
HU-Mine High Utility Mine
viii
Acronyms and Symbols ix
TWU-Mining Transaction Weighted Utility Mining

WIT-tree Weighted Itemset-Tidset tree
d2 HUP Direct Discovery of High Utility Patterns
UP-Growth Utility Pattern Growth
UP-Tree Utility Pattern Tree
IUPG Improved UP-Growth
DLU Discarding Local Unpromising items
DLN Discarding Local Node utilities
MTWU Maximum Transaction Weight Utilization
HUI-Miner High Utility Itemset Miner
UP-Growth+ Utility Pattern Growth +
MU-Growth Maximum Utility Growth
MIQ-Tree Maximum Item Quantity Tree
CHUI-Mine Concurrent High Utility Itemsets Mine
CHUI-Tree Concurrent High Utility Itemsets Tree
PB Projection-Based mining approach
EUCP Estimated Utility Co-occurrence Pruning
FHM Fast High-utility Miner
HUITWU-Tree High-Utility Itemset TWU Tree
HUITWU High-Utility Itemset miner based on HUITWU-Tree
FHN Faster High-Utility itemset miner with Negative unit profits
PNU-List Positive-and-Negative Utility-list
HDFS Hadoop Distributed File System
GFS Google File System
RDD Resilient Distributed Datasets
API Application Programming Interface
RAM Random Access Memory
PB Peta Byte
TB Tera Byte
GB Giga Byte
SPMF Sequential Pattern Mining Framework
Chapter 1
Introduction
Data mining and knowledge discovery is the process of discovering and extracting in-
formation or patterns, revealing potentially useful information from large databases.
Among many ways of discovering knowledge in databases, association rules mining was
a form of data mining to extract interesting correlations, frequent patterns, associa-
tions or casual structures among sets of items in the databases. Discovering useful
patterns hidden in a database plays an essential role in several data mining tasks, such
as frequent pattern mining, weighted frequent pattern mining and high utility pattern
mining. Among them, frequent pattern mining is a fundamental research topic that has
been applied to different kinds of databases, such as transactional databases, stream-
ing databases and time series databases, with various application domains, including
decision support, market strategy, financial forecast, bio-informatics, web click-stream
analysis, and mobile environments [UKM14].
Frequent itemset mining (FIM) [Agr94] is a popular data mining task that is essen-
tial to a wide range of applications. Given a transaction database, FIM consists of
discovering frequent itemsets. i.e. groups of items (itemsets) appearing frequently in
transactions [Agr94, Uno04]. Many popular algorithms have been proposed for this pro-
blem such as Apriori, FPGrowth, LCM, Eclat, etc. These algorithms take as input a
transaction database and a parameter “minsup” called the minimum support threshold.
These algorithms then return all sets of items (itemsets) that appears in at least minsup
transactions.
1
Introduction 2
The problem of frequent itemset mining is popular. But it has some important limita-
tions when it comes to analyzing customer transactions. An important limitation is that
purchase quantities are not taken into account. Thus, an item may only appear once
or zero time in a transaction. Thus, if a customer has bought five loaves, ten loaves or
twenty loaves, it is viewed as the same.
A second important limitation is that all items are viewed as having the same impor-
tance, utility of weight. For example, if a customer buys a very expensive bottle of wine
or a cheap piece of bread, it is viewed as being equally important.
Thus, frequent itemset mining may find many frequent itemsets that are not interesting.
For example, one may find that {bread, milk} is a frequent itemset. However, from
a business perspective, this pattern may be uninteresting because it does not generate
much profit. Moreover, frequent itemset mining algorithms may miss the rare patterns
that generate a high profit such as perhaps {caviar, wine}.
To address this issue, the problem of FIM has been redefined as the problem of High-
Utility Itemset Mining (HUIM) [FV14, Lan14, Liu05, Son14, Tse13]. In this problem,
a transaction database contains transactions where purchase quantities are taken into
account as well as the unit profit of each item.
The difficulty of high-utility itemset mining is to find the itemsets (group of items) that
generate a high profit in a database, when they are sold together. The user has to
provide a value for a threshold called “minutil” (the minimum utility threshold). A
high-utility itemset mining algorithm outputs all the high-utility itemsets, that is the
itemsets that generates at least “minutil” profit.
Different pruning approaches have been introduced so far to reduce the number of candi-
date sets generation. However, these state-of-the-art algorithms perform well when the
dataset is small. When the size of the dataset increases, preliminary experiments shows
that the performance degrades. Therefore with the current era of big data, there is a
need to compute datasets in multiple machines, which is possible through distributed
computing.
One of the methods of distributed computing is to implement Map-Reduce framework

[had] with Hadoop. This framework is highly scalable and fault-tolerant system and
can process large datasets on multiple clusters. However, this popular framework is a
Introduction 3
disk-based paradigm and is heavily dependent on its Hadoop Distributed File System
(HDFS). Another framework named Spark [spaa] was introduced to overcome its heavy
dependency with HDFS by allowing in-memory computation. Spark framework can
perform up to 100 times faster than Hadoop. Spark uses Resilient Distributed Datasets
(RDD) which is an immutable data structure allowing efficient reuse of data for in-
memory computation.
Introduction 4
1.1 Thesis Objectives
To deal with the research challenges associated to HUIM problem above mentioned, the
following objectives have been delineated:
General Objective:
• Extend the EFIM algorithm so that it can be a distributed parallel algorithm, in

order to optimize the discovery time of HUIs in very large data volumes.
Specific Objectives:
• Propose a parallel optimization approach based on EFIM
• Implement the algorithm in a distributed computing framework.
• Perform experimental tests to verify the performance of the proposal algorithm

PEFIM against the traditional approach of EFIM.
1.2 Thesis Organization
The remainder of this work is organized as follows:
Chapter 3 presents related works and the main motivations of this work. Chapter 2
summarizes the presented HUIM problem formulation. Chapter 4 describes the fastest
algorithms to solve the formulated problem in the state-of-the-art, while Chapter 5
explains how the state-of-the-art framework for parallel computing works and Chapter
6 shows how the efim algorithm with the spark framework are combined. Chapter 7
presents the obtained results and main findings of the experimental comparison. Finally,
conclusions and future works of the first part are left to Chapter 8 and Chapter 9.
Chapter 2
Problem statement
2.1 High-Utility Itemset Mining
The problem of high-utility itemset mining is defined as follows. Let I be a finite set of
items (symbols). An itemset X is a finite set of items such that X ⊆ I. A transaction
database is a multiset of transactions D = {T1 , T2 , · · · , Tn } such that for each transaction
Tc , Tc ⊆ I and Tc has a unique identifier c called its TID (Transaction ID). Each item
i ∈ I is associated with a positive number p(i), called its external utility (e.g. unit profit).
Every item i appearing in a transaction Tc has a positive number q(i, Tc ), called its
internal utility (e.g. purchase quantity). For example, consider the database in Table 2.1,
which will be used as the running example. It contains five transactions (T1 , T2 · · · , T5 ).
Transaction T2 indicates that items a, c, e and g appear in this transaction with an
internal utility of respectively 2, 6, 2 and 5. Table 2.2 indicates that the external utility
of these items are respectively 5, 1, 3 and 1.
Table 2.1: A transaction database
TID Transaction
T1 (a, 1) (c, 1) (d, 1)
T2 (a, 2) (c, 6) (e, 2) (g, 5)
T3 (a, 1) (b, 2) (c, 1) (d, 6) (e, 1) (f, 5)
T4 (b, 4) (c, 3) (d, 3) (e, 1)
T5 (b, 2) (c, 2) (e, 1) (g, 2)
5
Problem statement 6
Table 2.2: External Utility Values
Item a b c d e f g
Profit 5 2 1 2 3 1 1
Utility of an item/itemset
The utility of an item i in a transaction Tc is denoted as u(i, Tc ) and defined as p(i) ×
q(i, Tc ) if i ∈ Tc . The utility of an itemset X in a transaction Tc is denoted as u(X, Tc )
P
and defined as u(X, Tc ) = i∈X u(i, Tc ) if X ⊆ Tc . The utility of an itemset X is
P
denoted as u(X) and defined as u(X) = Tc ∈g(X) u(X, Tc ), where g(X) is the set of
transactions containing X.
For example, the utility of item a in T2 is u(a, T2 ) = 5×2 = 10. The utility of the itemset
{a, c} in T2 is u({a, c}, T2 ) = u(a, T2 ) + u(c, T2 ) = 5 × 2 + 1 × 6 = 16. Furthermore, the
utility of the itemset {a, c} is u({a, c}) = u({a, c}, T1 ) + u({a, c}, T2 ) + u({a, c}, T3 ) =
u(a, T1 ) + u(c, T1 ) + u(a, T2 ) + u(c, T2 ) + u(a, T3 ) + u(c, T3 ) = 5 + 1 + 10 + 6 + 5 + 1 = 28.
Problem definition: an itemset X is a high-utility itemset if u(X) ≥ minutil. Other-

wise, X is a low-utility itemset. The problem of high-utility itemset mining is to discover
all high-utility itemsets.
For example, if minutil = 30, the high-utility itemsets in the database of the running
example are {b, d}, {a, c, e}, {b, c, d}, {b, c, e}, {b, d, e}, {b, c, d, e}, {a, b, c, d, e, f } with re-
spectively a utility of 30, 31, 34, 31, 36, 40 and 30.
Chapter 3
Related Work
Several research works have been proposed in the High Utility Itemset Mining literature,
studying the problem in both parallel and non-parallel environments. Sections below
detail main existing approaches described in relevant articles, focusing on motivation of
the presented work.
3.1 State-of-the-art algorithms
This section presents a brief overview of the various algorithms, concepts and approaches
that have been defined in various research publications.
Yao et al. in [HY04] define the problem of utility mining formally. The work defines the
terms transaction utility and external utility of an itemset. The mathematical model of
utility mining was then defined based on the two properties of utility bound and support
bound.
The utility bound property of any itemset provides an upper bound on the utility value
of any itemset. This utility bound property can be used as a heuristic measure for
pruning itemsets at early stages that are not expected to qualify as high utility itemsets.
3.1.1 High Utility Itemset Mining
Yao et al. in [HY06] define the utility mining problem as one of the cases of constraint
mining. This work shows that the downward closure property used in the standard
7
Related Work 8
Apriori algorithm and the convertible constraint property are not directly applicable to
the utility mining problem. Also, the authors also present two pruning strategies to
reduce the cost of finding high utility itemsets. By exploiting these pruning strategies,
they develop two algorithms UMining and UMining H. UMining find all itemsets with
utility values higher than minutil from a database. UMining H, finds most itemsets
with utility values higher than minutil based on a heuristic pruning strategy. The
effectiveness of the algorithms was demonstrated by applying them to synthetic and real
world databases. UMining may be preferable to UMining H, because it guarantees the
discovery of all high utility itemsets.
Yao et al. in [HY07] classifies the utility-measures into three categories namely, item
level, transaction level and cell level. The unified utility function was defined to represent
all existing utility-based measures.
High utility frequent itemsets contribute to the most to a predefined utility, objective
function or performance metric [J.H07]. Hu et al. in [J.H07] presents an algorithm called
HYP (High-Yield Partition trees) algorithm for frequent itemset mining that identifies
high utility item combinations. The algorithm is designed to find segments of data
defined through the combinations of few items (rules) which satisfy certain conditions
as a group and maximize a predefined objective function. The authors have formulated
the task as an optimization problem and presents an efficient approximation to solve
it through specialized partition trees called high-yield partition trees and investigated
the performance of various splitting techniques. The algorithm has been tested on real
world data with promising results.
Li et al. in [YL08] propose the Isolated Items Discarding Strategy (IIDS), which can be
applied to each level-wise utility mining method to further reduce the number of redun-
dant candidates. In each pass, a utility mining method with IIDS scans a database that
is smaller than the original by skipping isolated items to efficiently improve performance.
Their study focuses on the task of efficiently discovering all high utility itemsets.
Liu et al. in [YL05] propose a Two-phase algorithm for finding high utility itemsets to
efficiently prune down the number of candidates and can precisely obtain the complete
set of high utility itemsets. In the first phase, they propose a model that applies the
“transaction-weighted downward closure property” on the search space to expedite the
identification of candidates. In the second phase, one extra database scan is performed
Related Work 9
to identify the high utility itemsets. They verify their algorithm by applying it to both
synthetic and real databases.
Ahmed et al. in [CA09] propose three novel tree structures to efficiently perform incre-
mental and interactive high utility pattern mining. The first tree structure, Incremental
HUP (High Utility Pattern) Lexicographic Tree (IHUPL-Tree), is arranged according
to an item’s lexicographic order. It can capture the incremental data without any re-
structuring operation. The second tree structure is the IHUP Transaction Frequency
Tree (IHUPTF-Tree), which obtains a compact size by arranging items according to
their transaction frequency (descending order). To reduce the mining time, the third
tree, IHUP-Transaction Weighted Utilization Tree (IHUPTWU-Tree) is designed on the
TWU value of items in descending order.
Shankar et al. in [SS09] present a novel algorithm Fast Utility Mining (FUM) which finds
all high utility itemsets within the given utility constraint threshold. The FUM algorithm
is built upon using a novel approach which helps in rectifying and avoiding the pitfalls
that usually occur in the use of UMining algorithm. FUM algorithm demonstrates an
appreciable semantic intelligence by considering only the distinct itemsets involved or
defined in a transaction and not the entire set of available itemsets.
Moreover, the authors also suggest a technique to generate different types of itemsets
such as High Utility and High Frequency (HUHF), High Utility and Low Frequency
(HULF), Low Utility and High Frequency (LUHF) and Low Utility and Low Frequency
(LULF).
Pillai et al. in [JP10] present a new foundational approach to temporal weighted itemset
mining where item utility value is allowed to be dynamic within a specified period of
time, unlike traditional approaches where value is static within those times. The authors
incorporate a fuzzy model where item utilities can be assumed to be fuzzy values.
Erwin et al. in [AE08] observed that the conventional candidate-generate-and-test ap-

proach for identifying high utility itemsets is not suitable for dense date sets. Their
work proposes a novel algorithm CTU-Mine that mines high utility itemsets using the
pattern growth approach to overcome the limitations of existing algorithms based on
the candidate generation-and-test approach.
Related Work 10
Ramaraju et al. in [CR11] present a novel algorithm CHUT (Conditional High Utility
Tree) to mine the high utility Itemset in two steps. The first step is to compress the
transaction database to reduce the search space. The second step uses a new proposed
algorithm HU-Mine to mine the complete set of high utility itemsets. The proposed
algorithm needs only two database scans in contrasts to many scans of existing algorithm.
Le et al. propose in [BLV09] TWU-Mining (Transaction Weighted Utility Mining) a

novel algorithm based on WIT-tree (Weighted Itemset-Tidset tree) for improving the
cost of time and search space. By WIT-tree technique, the algorithm only scans database
one time which based-one the intersection of tids to compute faster the utility value and
the transaction-weighted utility of itemsets. The proposed algorithm was applied to a
real data set.
Liu et al. in [JLF12] saw that the state-of-the-art works of utility mining all employ a
two-phase, candidate generation approach, which suffers from scalability issue due to the
large number of candidates. Therefore, they propose an algorithm called d2 HUP (Direct
Discovery of High Utility Patterns) that works in a single phase without generating
candidates. Their basic approach is to enumerate itemsets by prefix extensions, to prune
search space by utility upper bounding, and to maintain original utility information in
the mining process by a novel data structure.
Tseng et al. in [VTY10] saw that the huge number of potential high utility itemsets
forms a challenging problem to the mining performance since the higher processing cost
is incurred when more potential high utility itemsets are generated. To address this
issue, they propose an efficient algorithm called Utility Pattern Growth (UP-Growth),
for mining high utility itemsets with a set of techniques for pruning candidate itemsets,
the information of high utility itemsets is maintained in a special data structure named
Utility Pattern Tree (UP-Tree) such that the candidate can be generated with only two
scans of the database.
Adinarayana et al. in [BAR12] propose an improved UP-Growth High Utility Itemset

Mining (IUPG). This algorithm uses the UP-Tree but introduces three strategies, DLU
strategy (Discarding Local Unpromising items) and DLN strategy (Discarding local
node utilities) to discarding item utilities of descendant nodes during the local UP-Tree
construction, and in phase 2 the maximum transaction weight utilizations (MTWU)
Related Work 11
are computed from all items and considering multiple of the minimum utility as a user
specified threshold value.
To identify high utility itemsets, most existing algorithms first generate candidate item-
sets by overestimating their utilities, and subsequently compute the exact utilities of
these candidates. These algorithms incur in the problem that a very large number of
candidates are generated, but most of the candidates are discovered not to be not high
utility after their exact utilities are computed. Liu et al. in [ML12] propose an algo-
rithm, called HUI-Miner (High Utility Itemset Miner), for high utility itemset mining.
HUI-Miner uses a novel structure, called utility-list, to store both the utility informa-
tion about an itemset and the heuristic information for pruning the search space of
HUI-Miner. By avoiding the costly generation and utility computation of numerous
candidate itemsets, HUI-Miner can efficiently mine high utility itemsets from the utility
lists constructed from a mined database.
Tseng et al. in [Tse13] proposes a new algorithm called UP-Growth+ (Utility Pattern
Growth +) for reducing overestimated utilities more effectively and enhance the per-
formance of utility mining. In UP-Growth, minimum item utility is used to reduce the
overestimated utilities. In UP-Growth+, minimal node utilities in each path are used
to make the estimated pruning values closer to real utility values of the pruned items in
database.
Many algorithms incur in the problem of generating a large number of candidates item-
sets, which degrade mining performance, so Yun et al. propose in [UY14] an algorithm
named MU-Growth (Maximum Utility Growth) with two techniques for pruning can-
didates effectively in mining process. Moreover, they suggest a tree structure, named
MIQ-Tree (Maximum Item Quantity Tree), which captures database information with
a single-pass. The proposed data structure is restructured for reducing overestimated
utilities.
Song et al. propose in [WSL14] an efficient concurrent algorithm, called CHUI-Mine

(Concurrent High Utility Itemsets Mine), for mining high utility itemsets dynamically
pruning the tree structure. A tree structure, called CHUI-Tree, is introduced to capture
the important utility information of the candidate itemsets. By recording changes in
support counts of candidate high utility items during the tree construction process, they
Related Work 12
implement dynamic CHUI-Tree pruning, and discuss the rationality thereof. The CHUI-
Mine algorithm makes use of a concurrent strategy, enabling simultaneous construction
of a CHUI-Tree and the discovery of high utility itemsets. Their algorithm reduces
the problem of huge memory usage for tree construction and traversal in tree-based
algorithms for mining high utility itemsets.
Most of the existing approaches are based on the principle of level wise processing, as in
the traditional two-phase utility mining algorithm to find a high utility itemsets. Lan et
al. propose in [GL14] an efficient utility mining approach, namely the projection-based
mining approach (PB) that adopts an indexing mechanism to speed up the execution
and reduce the memory requirement in the mining process. The indexing mechanism
can imitate the traditional projection algorithms to achieve the aim of projecting sub-
databases for mining. In addition, a pruning strategy is also applied to reduce the
number of unpromising itemsets in mining.
Mostly algorithms for mining high-utility itemsets remains computationally expensive

because have to perform a costly join operation for each pattern that is generated by
its search procedure. Fournier-Viger et al. address this issue in [PFV14] by proposing
a novel strategy named EUCP (Estimated Utility Co-occurrence Pruning) based on the
analysis of item co-occurrences to reduce the number of join operations that need to
be performed. Also, they proposed an algorithm that incorporates the EUCP strategy
named FHM (Fast High-utility Miner).
Zida et al. in [SZ15] propose a novel algorithm named EFIM (Efficient high-utility
Itemset Mining), which introduces several new ideas to more efficiently discovers high-
utility itemsets both in terms of execution time and memory. EFIM relies on two upper-
bounds named sub-tree utility and local utility to more effectively prune the search
space. It also introduces a novel array-based utility counting technique named Fast
Utility Counting to calculate these upper-bounds in linear time and space. Moreover,
to reduce the cost of database scans, EFIM proposes efficient database projection and
transaction merging techniques.
Shi-Ming Guo et al., in [SG16] proposed an algorithm HUITWU to efficiently mine

HUIs from transaction databases. A novel data structure HUITWU-Tree was adopted
for maintaining the information of itemsets in a database. During the construction and
mining of HUITWU-Trees, HUITWU repeatedly applies “transaction merging strategy”
Related Work 13
for transactions with the same items to reduce the size of database, and directly calcu-
lates the utilities of itemsets based on HUITWU-Tree and conditional HUITWU-Trees
without scanning original database.
Jerry Chun-Wei Lin et al., in [JL16] presents an efficient algorithm named FHN (Faster
High-Utility itemset miner with Negative unit profits). It relies on a novel PNU-List
structure (Positive-and-Negative Utility-list) structure to efficiently mine high utility
itemsets, while considering both positive and negative unit profits.
3.2 Discussion
As seen in the previous section, there are many algorithms to find high utility itemsets,
over the years improvements were emerging to efficiently mine the huis, but mostly
to reduce the computational time with the introduction of pruning strategies, not to
compute large datasets. Therefore, this work extends the most efficient algorithm of the
state-of-the-art, the EFIM [SZ15] algorithm, to propose a distributed approach, which
divide the search space among different worker nodes to compute high utility itemsets
which are aggregated to find the result.
Chapter 4
EFIM: Efficient High-Utility for

Itemset Mining
This chapter presents EFIM, a one-phase algorithm, which introduces several novel ideas
to reduce the time and memory required for HUIM. Also, is the base used in the proposed
PEFIM algorithm. The content of the chapter was taken from [SZ15]
4.1 The Search Space
Let be any total order on items from I. According to this order, the search space
of all itemsets 2I can be represented as a set-enumeration tree. For example, the set-
enumeration tree of I = {a, b, c, d} for the lexicographical order is shown in Fig. 4.1.
EFIM explores this search space using a depth-first search starting from the root. During
this depth-first search for any itemset α, EFIM recursively appends one item at a time
to α according to the O order, to generate larger itemsets. The order is defined as the
order of increasing TWU because it generally reduces the search space for HUIM.
Items that can extend an itemset Let be an itemset α. Let E(α) denote the set
of all items that can be used to extend α according to the depth-first search, that is
E(α) = {z | z ∈ I ∧ z x, ∀x ∈ α}.
Extension of an itemset Let be an itemset α. An itemset Z is an extension of α

(appears in a sub-tree of α in the set-enumeration tree) if Z = α ∪ W for an itemset
14
Efficient High-Utility for Itemset Mining 15
Figure 4.1: Set-enumeration tree example
W ∈ 2E(α) such that W 6= ∅. Furthermore, an itemset Z is a single-item extension of α

(is a child of α in the set-enumeration tree) if Z = α ∪ {z} for an item z ∈ E(α). For
example, consider the database of our running example and α = {d}. The set E(α) is
{e, f, g}. Single-item extensions of α are {d, e}, {d, f } and {d, g}. Other extensions of α
are {d, e, f }, {d, f, g} and {d, e, f, g}.
4.2 Reducing the Cost of Database Scans using Projec-

tions
EFIM performs database scans to calculate the utility of itemsets and upper-bounds on
their utility. To reduce the cost of database scans, it is desirable to reduce the database
size. When an itemset is considered during the depth-first search, all items that not
belong to that itemset can be ignored when scanning the database to calculate the
utility of itemsets in the sub-tree, or upper-bounds on their utility. A database without
these items is called a projected database.
Projected database The projection of a transaction T using an itemset α is denoted as

α-T and defined as α-T = {i|i ∈ T ∧ i ∈ E(α)}. The projection of a database D using an
itemset α is denoted as α-D and defined as the multiset α-D = {α-T |T ∈ D ∧ α-T 6= ∅}.
For example, consider database D (see Table 2.1) of the running example and α = {b}.
The projected database α-D contains three transactions: α-T3 = {c, d, e, f }, α-T4 =
{c, d, e} and α-T5 = {c, e, g}.
Database projections generally greatly reduce the cost of database scans since transac-
tions become smaller as larger itemsets are explored. EFIM sorts each transaction in
the original database according to the total order beforehand. Then, a projection
is performed as a pseudo-projection, that is each projected transaction is represented
by an offset pointer on the corresponding original transaction. The complexity of cal-
culating the projection of a database is linear in time and space with respect to the
number of transactions. The proposed database projection technique generalizes the
concept of database projection from pattern mining for the case of transactions with
internal/external utility values.
4.3 Reducing the Cost of Database Scans by Transaction

Merging
To further reduce the cost of database scans, EFIM also introduce an efficient transaction
merging technique. It is based on the observation that transaction databases often
contain identical transactions. The technique consists of identifying these transactions
and to replace them with single transactions.
To further reduce the cost of database scans, EFIM based on the observation that trans-
action databases often contain identical transactions, identify these transactions and
to replace them with single transactions. It is based on the observation that transac-
tion databases often contain identical transactions. The technique consists of identifying
these transactions and to replace them with single transactions. In this context, a trans-
action Ta is said to be identical to a transaction Tb if it contains the same items as Tb
(i.e. Ta = Tb ) (but not necessarily the same internal utility values).
Transaction merging Transaction merging consists of replacing a set of identical

transactions T r1 , T r2 , · · · T rm in a database D by a single new transaction TM =
T r1 = T r2 = · · · = T rm where the quantity of each item i ∈ TM is defined as
P
q(i, TM ) = k=1···m q(i, T rk ).
But to achieve higher database reduction, the algorithm also merges transactions in
projected databases. This generally achieves a much higher reduction because projected
transactions are smaller than original transactions, and thus are more likely to be iden-
tical.
Projected transaction merging Projected transaction merging consists of replacing

a set of identical transactions T r1 , T r2 , · · · T rm in a database α-D by a single new trans-
action TM = T r1 = T r2 = · · · = T rm where the quantity of each item i ∈ TM is defined
P
as q(i, TM ) = k=1···m q(i, T rk ).
For example, consider database D of our running example and α = {c}. The projected
database α-D contains transactions α-T1 = {d}, α-T2 = {e, g}, α-T3 = {d, e, f }, α-
T4 = {d, e} and α-T5 = {e, g}. Transactions α-T2 and α-T5 can be replaced by a new
transaction TM = {e, g} where q(e, TM ) = 3 and q(g, TM ) = 7.
Transaction merging is obviously desirable. However, a key problem is to implement

it efficiently. The naive approach to identify identical transactions is to compare all
transactions with each other. But this is inefficient (O(n2 ), where n is the number of
transactions). To find identical transactions in O(n) time, [SZ15] proposes the following
approach. The algorithm initially sorts the original database according to a new total
order T on transactions. Sorting is achieved in O(nlog(n)) time, and is performed only
once.
Total order on transactions The T order is defined as the lexicographical order

when the transactions are read backwards. Formally, let be two transactions Ta =
{i1 , i2 , · · · im } and Tb = {j1 , j2 , · · · jk }. The total order T is defined by four cases. The
first case is that Tb Ta if both transactions are identical and Tb is greater than the
TID of Ta . The second case is that Tb T Ta if k > m and im−x = jk−x for any integer
x such that 0 ≤ x < m. The third case is that Tb T Ta if there exists an integer x such
that 0 ≤ x < min(m, k), where jk−x im−x and im−y = jk−y for all integer y such that
x < y < min(m, k).
The fourth case is that otherwise Ta T Tb . For example, let be transactions Tx = {b, c},
Ty = {a, b, c} and Tz = {a, b, e}. We have that Tz T Ty T Tx . A database sorted
according to the T order provides the following property.
Property 1 (Order of identical transactions in a T sorted database). Let be a T

sorted database D and an itemset α. Identical transactions appear consecutively in the
projected database α-D.
Using the above property, all identical transactions in a (projected) database can be
identified by only comparing each transaction with the next transaction in the database.
Thus, using this scheme, transaction merging can be done very efficiently by scanning a
(projected) database only once (linear time). It is interesting to note that transaction
merging as proposed in EFIM cannot be implemented efficiently in utility-list based
algorithms (e.g. FHM and HUP-Miner) and hyperlink-based algorithms (e.g. d2 HUP)
due to their database representations.
4.4 Pruning Search Space using Sub-tree Utility and Local

Utility
EFIM introduce an effective pruning mechanism with two new upper-bounds on the
utility of itemsets named sub-tree utility and local utility. The key difference with
previous upper-bounds is that they are defined w.r.t the sub-tree of an itemset α in the
search-enumeration tree.
Definition (Sub-tree utility). Let be an itemset α and an item z that can extend α
according to P the depth-first search (z ∈ E(α)).The Sub-tree Utility of z w.r.t. α is
P P
su(α, z) = T ∈g(α∪{z}) [u(α, T ) + u(z, T ) + i∈T ∧i∈E(α∪{z}) u(i, T )].
Example 2. Consider the running example and α = {a}. We have that su(α, c) =
(5 + 1 + 2) + (10 + 6 + 11) + (5 + 1 + 20) = 61, su(α, d) = 25 and su(α, e) = 34.
Property 2 (Overestimation using the sub-tree utility). Let be an itemset α and an

item z ∈ E(α). The relationship su(α, z) ≥ u(α ∪ {z}) holds. And more generally,
su(α, z) ≥ u(Z) holds for any extension Z of α ∪ {z}. The following theorem of the
sub-tree utility is proposed in EFIM to prune the search space
Theorem 1 (Pruning a sub-tree using the sub-tree utility). Let be an itemset

α and an item z ∈ E(α). If su(α, z) < minutil, then the single item extension α ∪ {z}
and its extensions are low-utility. In other words, the sub-tree of α ∪ {z} in the set-
enumeration tree can be pruned.
Using theorem 1 we can prune some sub-trees of an itemset, which reduces the number
of sub-trees to be explored. To further reduce the search space, we also identify items
that should not be explored in any sub-trees.
Definition (Local utility). Let be an P itemset α and an item z ∈ E(α). The Local
P
Utility of z w.r.t. α is lu(α, z) = T ∈g(α∪{z}) [u(α, T ) + re(α, T )].
Example 3. Consider the running example and α = {a}. We have that lu(α, c) =
(8 + 27 + 30) = 65, lu(α, d) = 30 and lu(α, e) = 57.
Property 3 (Overestimation using the local utility). Let be an itemset α and an item
z ∈ E(α). Let Z be an extension of α such that z ∈ Z. The relationship lu(α, z) ≥ u(Z)
holds.
Theorem 2 (Pruning an item from all sub-trees using the local utility). Let
be an itemset α and an item z ∈ E(α). If lu(α, z) < minutil, then all extensions of α
containing z are low-utility. In other words, item z can be ignored when exploring all
sub-trees of α. The relationship between the proposed upper-bounds and the main ones
used in previous work is the following.
Property 4 (Relationships between upper-bounds). Let be an itemset Y = α ∪ {z}. The

relationship T W U (Y ) ≥ lu(α, z) ≥ reu(Y ) = su(α, z) holds.
Given, the above relationship, it can be seen that the proposed local utility upper-bound
is a tighter upper-bound on the utility of Y and its extensions compared to the TWU,
which is commonly used in two-phase HUIM algorithms. Thus, the local utility can be
more effective for pruning the search space. Besides, one can ask what is the difference
between the proposed su upper-bound and the reu upper-bound of HUI-Miner and FHM
since they are mathematically equivalent. The major difference is that su is calculated
when the depth-first search is at itemset α in the search tree rather than at the child
itemset Y . Thus, if su(α, z) < minutil, EFIM prunes the whole sub-tree of z including
node Y rather than only pruning the descendant nodes of Y. Thus, using su instead of
reu is more effective for pruning the search space. In the rest of the paper, for a given
itemset α, we respectively refer to items having a sub-tree utility and local-utility no
less than minutil as primary and secondary items.
Definition (Primary and secondary items). Let be an itemset α. The primary

items of α is the set of items defined as P rimary(α) = {z|z ∈ E(α)∧su(α, z) ≥ minutil}.
The secondary items of α is the set of items defined as Secondary(α) = {z|z ∈ E(α) ∧
lu(α, z) ≥ minutil}. Because lu(α, z) ≥ su(α, z), P rimary(α) ⊆ Secondary(α).
For example, consider the running example and α = {a}. P rimary(α) = {c, e}.
Secondary(α) = {c, d, e}.
4.5 Calculating Upper-Bounds Efficiently using Fast Util-

ity Counting
A novel efficient array-based approach to compute upper-bounds in linear time and

space. It relies on a novel array structure called utility-bin.
Definition (Utility-Bin). Let be the set of items I appearing in a database D. A

utility-bin array U for database D is an array of length |I|, having an entry denoted as
U [z] for each item z ∈ I. Each entry is called a utility-bin and is used to store a utility
value (an integer in our implementation, initialized to 0).
A utility-bin array can be used to efficiently calculate the following upper-bounds in

O(n) time (recall that n is the number of transactions), as follows.
Calculating the TWU of all items: for each transaction T of the database, the utility-bin
U [z] for each item z ∈ T is updated as U [z] = U [z] + T U (T ). At the end of the database
scan, for each item k ∈ I, the utility-bin U [k] contains T W U (k).
Calculating the sub-tree utility w.r.t. an itemset α: for each transaction T of the
database, the utility-bin U [z] for each item z ∈ T ∩ E(α) is updated as U [z] = U [z] +
P
u(α, T ) + u(z, T ) + i∈T ∧iz u(i, T ). Thereafter, we have U [k] = su(α, k)∀k ∈ I.
Calculating the local utility w.r.t. an itemset α: A utility-bin array U is initialized.

Then, for each transaction T of the database, the utility-bin U [z] for each item z ∈ T ∩
E(α) is updated as U [z] = U [z]+u(α, T )+re(α, T ). Thereafter, we have U [k] = lu(α, k)
∀k ∈ I.
Algorithm 1: The EFIM algorithm

Data: D: a transaction database, minutil: a user specified threshold
Result: the set of high-utility itemsets
1 α = ∅;
2 Calculate lu(α, i) for all item i ∈ I by scanning D, using utility-bin array;
3 Secondary(α) = {i|i ∈ I ∧ lu(α, i) ≥ minutil} ;
4 Let be the total order of TWU ascending values on Secondary(α);
5 Scan D to remove each item i ∈ / Secondary(α) from the transaction, and delete empty
transaction;
6 Sort transaction in D according to T ;
7 Calculate the sub-tree su(α, i) of each item i ∈ Secondary(α) by scanning D, using a
utility-bin array;
8 P rimary(α) = {i|i ∈ I ∧ su(α, i) ≥ minutil};
9 Search (α, D, P rimary(α), Secondary(α), minutil);
4.5.1 The Algorithm
The main procedure (Algorithm 3) takes as input a transaction database and the minutil
threshold. The algorithm initially considers that the current itemset α is the empty set.
The algorithm then scans the database once to calculate the local utility of each item
w.r.t. α, using a utility-bin array. Note that in the case where α = ∅, the local utility
of an item is its TWU. Then, the local utility of each item is compared with minutil
to obtain the secondary items w.r.t to α, that is items that should be considered in
extensions of α. Then, these items are sorted by ascending order of TWU and that
order is thereafter used as the order (as suggested in [2,7]). The database is then
scanned once to remove all items that are not secondary items w.r.t to α since they
cannot be part of any high-utility itemsets by Theorem 2. If a transaction becomes
empty, it is removed from the database. Then, the database is scanned again to sort
transactions by the T order to allow O(n) transaction merging, thereafter. Then, the
algorithm scans the database again to calculate the sub-tree utility of each secondary
item w.r.t. α, using a utility-bin array. Thereafter, the algorithm calls the recursive
procedure Search to perform the depth first search starting from α.
The Search procedure (Algorithm 4) takes as parameters the current itemset to be

extended α, the α projected database, the primary and secondary items w.r.t α and the
minutil threshold. The procedure performs a loop to consider each single-item extension
of α of the form β = α ∪ {i}, where i is a primary item w.r.t α (since only these
single-item extensions of α should be explored according to Theorem 1). For each such
extension β, a database scan is performed to calculate the utility of β and at the same
Algorithm 2: The Search procedure

Data: α: an itemset, α − D: the alpha projected database, P rimary(α): the primary
items of α, Secondary(α): the secondary items of α, the minutil threshold
Result: the set of high-utility itemsets that are extension of α
1 foreach item i ∈ P rimary(α) do
2 β = α ∪ {i};
3 Scan α − D to calculate u(β) and create β − D;
4 if u(β) ≥ minutil then
5 ouput β;
6 end
7 Calculate su(β, z) and su(β, z) for all item z ∈ Secondary(α) by scanning β − D
once, using two utility-bin array;
8 P rimary(β) = {z ∈ Secondary(α)|su(β, z) ≥ minutil} ;
9 Secondary(β) = {z ∈ Secondary(α)|lu(β, z) ≥ minutil} ;
10 Search (β, β − D, P rimary(β), Secondary(β), minutil);
11 end
time construct the β projected database. Note that transaction merging is performed
whilst the β projected database is constructed. If the utility of β is no less than minutil,
β is output as a high-utility itemset. Then, the database is scanned again to calculate
the sub-tree and local utility w.r.t β of each item z that could extend β (the secondary
items w.r.t to α), using two utility-bin arrays. This allows determining the primary
and secondary items of β. Then, the Search procedure is recursively called with β
to continue the depth-first search by extending β. Based on properties and theorems
presented in previous sections, it can be seen that when EFIM terminates, all and only
the high-utility itemsets have been output.
Complexity. To process each primary itemset α encountered during the depth-first

search, EFIM performs database projection, transaction merging and upper-bound cal-
culation in O(n) time. In terms of space, utility-bin arrays are created once and require
O(|I|) space. The database projection performed for each primary itemset α requires at
most O(n) space. In practice, this is small considering that projected databases become
smaller as larger itemsets are explored, and are implemented using offset pointers.
Chapter 5
Parallel Computing
With the advent of big data, there is a need for large and parallel computations to find
the solution in short time. Therefore, parallel computation is used to take advantage of
solving the tasks by computing in parallel using cheap resources. The parallel computa-
tion is categorized into different types, but according to the hardware level parallelism,
there are generally two types: shared memory and non-shared memory (distributed
systems) [DG08]. In shared memory computation, there are multiple processors which
concurrently access the shared memory. This model is very efficient and easy to de-
velop. However, this model requires large memory and suffers the problem of need of
large memory. In non-shared memory computation, there are different processors which
have their own local memories and each processor communicates with other by passing
a message through an interconnected network. This model is usually scalable and more
efficient than shared memory model.
In the field of big data mining, there is a need to analyze, process and extract the
information from the large data. However, there is a restriction on data because of the
computation limitation by the single machine. This limitation affects the scalability of
the algorithm implemented. Therefore, to process the huge amount of data and extract
meaningful information, distributed systems are used. There are different distributed
computing frameworks available to take advantage of scalability.
23
Parallel Computing 24
5.1 Apache Hadoop
The most commonly known distributed computing framework is Apache Hadoop [had].
Apache Hadoop provides reliable, scalable, and distributed computing solution, which
is used by many companies, including Yahoo! and Facebook.
There are two main parts in the core of Apache Hadoop, the storage part and the pro-
cessing part. The storage part, also known as Hadoop Distributed File System (HDFS),
stores data by splitting them into blocks and distributing them amongst different nodes
in the cluster. Each block of a file is usually replicated, and stored in several different
nodes, so that data loss in HDFS is very rare in case of hardware failure. The process-
ing part, also known as MapReduce, uses two procedures, map and reduce, for parallel
processing of data. The map and reduce procedures are called mappers and reducers
respectively. In mappers, a set of data is converted into tuples (key-value pairs), while
reducers take output from mappers and combine tuples into smaller sets of tuples, by
aggregating tuples with the same key into a single tuple. HDFS and MapReduce are
inspired by the ideas proposed by Google based on the Google File System (GFS) and
their proprietary MapReduce technology.
However, there are some drawbacks of Apache Hadoop, which limits the performance
and flexibility of the algorithms implemented on it. On the one hand, the MapReduce
paradigm requires that each mapper is followed by a reducer, and they must be pro-
grammed in a strictly predefined way. On the other hand, each pair of mapper and
reducer in Apache Hadoop has to read data from disks, and write results back to disks,
which results in a bottleneck in its performance.
5.2 Apache Spark
In order to deal with these two drawbacks of Apache Hadoop, another distributed com-
puting framework, Apache Spark [spaa], was developed. Instead of the two-stage disk-
based MapReduce paradigm introduced in Apache Hadoop, Apache Spark uses a data
abstraction, known as Resilient Distributed Datasets (RDD). RDDs are read-only, par-
titioned collection of records, which are created by reading from data storages or trans-
forming from other RDDs [MZ12]. An RDD holds references to Partition objects, where
each Partition object is a subset of the dataset represented by this RDD. RDDs are usu-
ally not in materialized form. Instead, if an RDD A is transformed from another RDD
B, we only need the information of the transformation and the RDD B, in order to derive
the RDD A. As a result, RDDs are only materialized when they are asked to perform a
reduce operation, which aggregates data in different nodes to a single machine. Apache
Spark loads data into the memories of machines in a cluster as RDDs, and uses them
repeatedly for data processing tasks. Apache Spark also allows programmers to have
arbitrary mappers and reducers in any order, providing a much more flexible API for its
users. In an Apache Spark cluster, there is one Master node and several Worker nodes.
The Master node is responsible for allocating resources and assigning tasks to Worker
nodes. However, Apache Spark is only an alternative for MapReduce in Apache Hadoop.
HDFS is still a state-of-the-art open source distributed data storage framework. Apache
Spark has interfaces with different types of data storage, including HDFS, Cassandra
[cas], OpenStack Swift [swi], etc. Apache Spark is able to read from these types of data
storage for data processing, which makes Spark more popular. Therefore, in this work,
Apache Spark is used as the main platform for the proposed algorithm.
5.2.1 Programming Model
The key programming abstraction in Spark is RDDs, which are fault-tolerant collec-
tions of objects partitioned across a cluster that can be manipulated in parallel. Users
create RDDs by applying operations called ”transformations” (such as map, filter, and
groupBy) to their data.
Spark exposes RDDs through a functional programming API in Scala, Java, Python,
and R, where users can simply pass local functions to run on the cluster. For example,
the following Scala code creates an RDD representing the error messages in a log file, by
searching for lines that start with ERROR, and then prints the total number of errors:
lines = spark.textFile(”hdfs://...”)
errors = lines.filter(s => s.startsWith(”ERROR”))
println(”Total errors: ” + errors.count())
The first line defines an RDD backed by a file in the Hadoop Distributed File System
(HDFS) as a collection of lines of text. The second line calls the filter transformation to
derive a new RDD from lines. Its argument is a Scala function literal or closure. Finally,
the last line calls count, another type of RDD operation called an ”action” that returns
a result to the program (here, the number of elements in the RDD) instead of defining
a new RDD.
Spark evaluates RDDs lazily, allowing it to find an efficient plan for the user’s compu-
tation. In particular, transformations return a new RDD object representing the result
of a computation but do not immediately compute it. When an action is called, Spark
looks at the whole graph of transformations used to create an execution plan. For ex-
ample, if there were multiple filter or map operations in a row, Spark can fuse them into
one pass, or, if it knows that data is partitioned, it can avoid moving it over the net-
work for groupBy [FA14]. Users can thus build up programs modularly without losing
performance.
Finally, RDDs provide explicit support for data sharing among computations. By de-
fault, RDDs are ”ephemeral” in that they get recomputed each time they are used in an
action (such as count). However, users can also persist selected RDDs in memory or for
rapid reuse. (If the data does not fit in memory, Spark will also spill it to disk.) For ex-
ample, a user searching through a large set of log files in HDFS to debug a problem might
load just the error messages into memory across the cluster by calling errors.persist()
After this, the user can run a variety of queries on the in-memory data:
// Count errors mentioning MySQL
errors.filter(s => s.contains(”MySQL”)).count()
// Fetch back the time fields of errors that mention PHP, assuming time is field #3:
errors.filter(s => s.contains(”PHP”)).map(line => line.split(‘\t’)(3)).collect()
This data sharing is the main difference between Spark and previous computing models
like MapReduce; otherwise, the individual operations (such as map and groupBy) are
similar. Data sharing provides large speedups, often as much as 100x, for interactive
queries and iterative algorithms [XS13]. It is also the key to Spark’s generality, as we
discuss later.
Fault tolerance. Apart from providing data sharing and a variety of parallel operations,
RDDs also automatically recover from failures. Traditionally, distributed computing
systems have provided fault tolerance through data replication or checkpointing. Spark
uses a different approach called ”lineage.” [MZ12]. Each RDD tracks the graph of
transformations that was used to build it and reruns these operations on base data to
Figure 5.1: Lineage graph for the third query in our example; boxes represent RDDs,
and arrows represent transformations
reconstruct any lost partitions. For example, Figure 2 shows the RDDs in our previous
query, where we obtain the time fields of errors mentioning PHP by applying two filters
and a map. If any partition of an RDD is lost (for example, if a node holding an
in-memory partition of errors fails), Spark will rebuild it by applying the filter on the
corresponding block of the HDFS file. For ”shuffle” operations that send data from all
nodes to all other nodes (such as reduceByKey), senders persist their output data locally
in case a receiver fails.
Lineage-based recovery is significantly more efficient than replication in data-intensive

workloads. It saves both time, because writing data over the network is much slower
than writing it to RAM, and storage space in memory. Recovery is typically much faster
than simply rerunning the program, because a failed node usually contains multiple RDD
partitions, and these partitions can be rebuilt in parallel on other nodes.
Integration with storage systems. Much like Google’s MapReduce, Spark is designed to
be used with multiple external systems for persistent storage. Spark is most commonly
used with cluster file systems like HDFS and key-value stores like S3 and Cassandra.
It can also connect with Apache Hive as a data catalog. RDDs usually store only tem-
porary data within an application, though some applications (such as the Spark SQL
JDBC server) also share RDDs across multiple users [AXL+ 15]. Spark’s design as a
storage-system-agnostic engine makes it easy for users to run computations against ex-
isting data and join diverse data sources.
5.2.2 Applications
Apache Spark is used in a wide range of applications. The surveys of Spark users
have identified more than 1,000 companies using Spark, in areas from Web services to
biotechnology to finance. In academia, we have also seen applications in several scientific
domains. Across these workloads, we find users take advantage of Spark’s generality and
often combine multiple of its libraries. Here, we cover a few top use cases. Presentations
on many use cases are also available on the Spark Summit conference website [spab].
Batch processing. Spark’s most common applications are for batch processing on
large datasets, including Extract-Transform-Load workloads to convert data from a raw
format (such as log files) to a more structured format and offline training of machine
learning models. Published examples of these workloads include page personalization
and recommendation at Yahoo!; managing a data lake at Goldman Sachs; graph mining
at Alibaba; financial Value at Risk calculation; and text mining of customer feedback at
Toyota. The largest published use case we are aware of is an 8,000-node cluster at Chi-
nese social network Tencent that ingests 1PB of data per day. [Xin, R. and Zaharia, M.
Lessons from running largescale Spark workloads; http://tinyurl.com/largescale-spark]
While Spark can process data in memory, many of the applications in this category run
only on disk. In such cases, Spark can still improve performance over MapReduce due
to its support for more complex operator graphs.
Interactive queries. Interactive use of Spark falls into three main classes. First,
organizations use Spark SQL for relational queries, often through business intelligence
tools like Tableau. Examples include eBay and Baidu. Second, developers and data
scientists can use Spark’s Scala, Python, and R interfaces interactively through shells or
visual notebook environments. Such interactive use is crucial for asking more advanced
questions and for designing models that eventually lead to production applications and
is common in all deployments. Third, several vendors have developed domain-specific
interactive applications that run on Spark. Examples include Tresata (anti-money laun-
dering), Trifacta (data cleaning), and PanTera (largescale visualization).
Stream processing. Real-time processing is also a popular use case, both in analytics
and in real-time decision making applications. Published use cases for Spark Streaming
include network security monitoring at Cisco, prescriptive analytics at Samsung SDS,
and log mining at Netflix. Many of these applications also combine streaming with batch
and interactive queries. For example, video company Conviva uses Spark to continuously
maintain a model of content distribution server performance, querying it automatically
when it moves clients across servers, in an application that requires substantial parallel
work for both model maintenance and queries.
Scientific applications. Spark has also been used in several scientific domains, in-
cluding large-scale spam detection [TS11], image processing [ZP15], and genomic data
processing [NP15]. One example that combines batch, interactive, and stream processing
is the Thunder platform for neuroscience at Howard Hughes Medical Institute, Janelia
Farm [FA14]. It is designed to process brain-imaging data from experiments in real time,
scaling up to 1TB/hour of whole-brain imaging data from organisms (such as zebrafish
and mice). Using Thunder, researchers can apply machine learning algorithms (such as
clustering and Principal Component Analysis) to identify neurons involved in specific
behaviors. The same code can be run in batch jobs on data from previous runs or in
interactive queries during live experiments.
Deployment environments. We also see growing diversity in where Apache Spark appli-
cations run and what data sources they connect to. While the first Spark deployments
were generally in Hadoop environments, only 40 percent of deployments in our July
2015 Spark survey were on the Hadoop YARN cluster manager. In addition, 52% of
respondents ran Spark on a public cloud.
Chapter 6
PEFIM
In this chapter, we present a parallel high utility itemset mining algorithm, named
PEFIM (Parallel EFficient high-utility Itemset Mining), which parallelizes the state-of-
the-art high utility itemset mining algorithm EFIM [SZ15]. The EFIM algorithm is
divided in two procedures, the main and the search procedure.
Algorithm 3: The EFIM algorithm

Data: D: a transaction database, minutil: a user specified threshold
Result: the set of high-utility itemsets
1 α = ∅;
2 Calculate lu(α, i) for all item i ∈ I by scanning D, using utility-bin array;
3 Secondary(α) = {i|i ∈ I ∧ lu(α, i) ≥ minutil} ;
4 Let be the total order of TWU ascending values on Secondary(α);
5 Scan D to remove each item i ∈ / Secondary(α) from the transaction, and delete empty
transaction;
6 Sort transaction in D according to T ;
7 Calculate the sub-tree su(α, i) of each item i ∈ Secondary(α) by scanning D, using a
utility-bin array;
8 P rimary(α) = {i|i ∈ I ∧ su(α, i) ≥ minutil};
9 Search (α, D, P rimary(α), Secondary(α), minutil);
6.1 Main Procedure
The main procedure remains the same as EFIM. The main procedure (Algorithm 1)
takes as input a transaction database and the minutil threshold. The algorithm initially
considers that the current itemset α is the empty set (line 1). The algorithm then scans
the database once to calculate the local utility of each item w.r.t. α, using a utility-bin
30
PEFIM 31
Figure 6.1: Search Space
array (line 2). Note that in the case where α = ∅, the local utility of an item is its TWU.
Then, the local utility of each item is compared with minutil to obtain the secondary
items w.r.t to α, that is items that should be considered in extensions of α (line 3).
Then, these items are sorted by ascending order of TWU and that order is thereafter
used as the order (line 4). The database is then scanned once to remove all items
that are not secondary items w.r.t to α since they cannot be part of any high-utility
itemsets by Theorem 4.1 (line 5). At the same time, items in each transaction are sorted
according to , and if a transaction becomes empty, it is removed from the database.
Then, the database is scanned again to sort transactions by the T order to allow O(nw)
transaction merging, thereafter (line 6). Then, the algorithm scans the database again
to calculate the sub-tree utility of each secondary item w.r.t. α, using a utility-bin array
(line 7 and 8). Thereafter, the algorithm divides the search space and assign to each
node in a cluster. Each node is only responsible for mining the assigned search space,
which in another word, splits the work-load into different nodes in the cluster.
In PEFIM, each node is assigned one or more sub-tasks. For example, in 6.1, if there
are 2 nodes in the cluster, the items are divided into 2 groups and assigned to different
nodes. Assuming items a, e, d are assigned to node 1, and b, c are assigned to node 2,
PEFIM 32
node 1 will be responsible for mining all the itemsets containing item a, the itemsets
containing item e but no a and the itemsets containing item d but no a or e, while node
2 will be responsible for mining itemsets containing item b but no a, e or d and itemsets
containing item c but no a, e, d or b. The algorithm performs this action with the map
function. Each node of the cluster runs the search procedure.
6.2 The Search procedure
The Search procedure (Algorithm 2) takes as parameters the current itemset to be

extended α, the α projected database, the primary and secondary items w.r.t α and the
minutil threshold.
Algorithm 4: The Search procedure

Data: α: an itemset, α − D: the alpha projected database, P rimary(α): the primary
items of α, Secondary(α): the secondary items of α, the minutil threshold
Result: the set of high-utility itemsets that are extension of α
1 foreach item i ∈ P rimary(α) do
2 β = α ∪ {i};
3 Scan α − D to calculate u(β) and create β − D;
4 if u(β) ≥ minutil then
5 ouput β;
6 end
7 Calculate su(β, z) and su(β, z) for all item z ∈ Secondary(α) by scanning β − D
once, using two utility-bin array;
8 P rimary(β) = {z ∈ Secondary(α)|su(β, z) ≥ minutil} ;
9 Secondary(β) = {z ∈ Secondary(α)|lu(β, z) ≥ minutil} ;
10 Search (β, β − D, P rimary(β), Secondary(β), minutil);
11 end
PEFIM 33
The procedure performs a loop to consider each single-item extension of α of the form
β = α ∪ {i}, where i is a primary item w.r.t α (since only these single-item extensions of
α should be explored according to Theorem 4.2) (line 1 to 9). For each such extension β,
a database scan is performed to calculate the utility of and β at the same time construct
the β projected database (line 3). Note that transaction merging is performed whilst the
β projected database is constructed. If the utility of β is no less than minutil, is output
as a high-utility itemset (line 4). Then, the database is scanned again to calculate the
sub-tree and local utility w.r.t β of each item z that could extend β (the secondary items
w.r.t to α), using two utility-bin arrays (line 5). This allows determining the primary
and secondary items of (line 6 and 7). Then, the Search procedure is recursively called
with β to continue the depth-first search by extending β (line 8). Based on properties
and theorems presented in previous sections, it can be seen that when EFIM terminates,
all and only the high-utility itemsets have been output. When all the nodes in the cluster
finish their tasks, the algorithm collects all the HUIs.
PEFIM 34
6.3 Overall Flow of the PEFIM Algorithm
The overall flow diagram of PEFIM Parallel Algorithm is shown in Figure 6.2. It started
with a reading of dataset from the file, the algorithm calculates the local utility for each
item. The items are sorted by ascending order of TWU. After that, the algorithm scans
the database to calculate sub-tree utility of each secondary item.
The dataset is divide into different blocks to be distributed among the worker nodes.
The worker nodes worked on the block of the file using map operation to run the search
procedure. Each worker node computes the high utility itemsets. Finally, the results
obtained from the worker nodes were combined to give the aggregated high utility item-
sets.
Figure 6.2: PEFIM flow

Chapter 7
Experimental Results
The experiments were performed with our proposed algorithm (PEFIM) and EFIM
algorithm to find high utility itemsets on 16GB main memory in Intel Core i7 CPU
6700HQ @ 3.40 GHz x 4 on an Ubuntu 16.04 Linux Operating system. The language
used to write the spark application was Oracle Java 1.8 with Spark framework version
2.0.2.
Both algorithms first read the database in main memory, then search for high-utility
itemsets, and write the run-time results to disk. Since the input and output is the
same for all algorithms, the cost of disk accesses has no influence on the results of the
experiments. Algorithms were implemented in Java and memory measurements were
done using the standard Java API. Experiments were performed using a set of synthetic
datasets. This datasets were chosen because they have varied characteristics.
35
Experimental Results 36
7.1 Datasets
The experiments were performed on multiple real-world based datasets [fim12], using the
database generator of SPMF [FVGG+ 14]. Our experiments were conducted on relatively
smaller datasets such as 1000 and 10000 transaction and on relatively large datasets such
as 100000 and 1000000 transactions. The characteristics of the datasets are shown in the
Table 7.1. For each threshold ratio of a dataset, the experimental results were executed
30 times and the average was taken.
Table 7.1: Dataset characteristics
Number of Max distinct Max item count Avg transaction Dataset

transactions items per transaction length type
1000 75 20 10 dense
10000 75 20 10 dense
100000 75 20 10 dense
1000000 75 20 10 dense
1000 120 50 25 dense
10000 120 50 25 dense
100000 120 50 25 dense
1000000 120 50 25 dense
1000 400 70 35 dense
10000 400 70 35 dense
100000 400 70 35 dense
1000000 400 70 35 dense
1000 500 10 5 sparse
10000 500 10 5 sparse
100000 500 10 5 sparse
1000000 500 10 5 sparse
1000 1500 10 5 sparse
10000 1500 10 5 sparse
100000 1500 10 5 sparse
1000000 1500 10 5 sparse
1000 2000 150 75 sparse
10000 2000 150 75 sparse
100000 2000 150 75 sparse
1000000 2000 150 75 sparse
1000 40000 15 7.5 sparse
10000 40000 15 7.5 sparse
100000 40000 15 7.5 sparse
1000000 40000 15 7.5 sparse
1000 45000 15 7.5 sparse
10000 45000 15 7.5 sparse
100000 45000 15 7.5 sparse
1000000 45000 15 7.5 sparse
7.2 PEFIM vs. EFIM
PEFIM algorithm was compared with EFIM algorithm with comparisons on the com-
putational time and the quantity of physical memory used to find HUIS.
7.2.1 Comparison of Computational Time
In this section, we compared our algorithm (PEFIM) with the EFIM algorithm with
all datasets. Experiments were conducted to show the effectiveness of our algorithm
with all datasets and the approach that was taken to improve the performance. For
the experiments we first compare execution times on both algorithms by running the
algorithms on each dataset while at the same time decreasing the minutil threshold until
algorithms were too slow, ran out of memory or a clear winner was observed. Execution
times are shown in the following Figures.
(a) # of transactions:1000 (b) # of transactions:10000
(c) # of transactions:100000 (d) # of transactions:1000000
Figure 7.1: Time to find HUI having 75 distinct items and up to 20 items per trans-
action
From the Figure 7.1a, we see that EFIM can perform better than the PEFIM algorithm.
But when we increased the number of transactions, the Figure 7.1b, 7.1c, 7.1d, for the
same dataset, the proposed algorithms (PEFIM) showed significant improvement over
EFIM when the dataset type is dense.
Figure 7.2: Time to find HUI having 120 distinct and up to 50 items per transaction
Figure 7.3: The dataset for these figures has the following characteristics: 400 distinct
items and the max length per transaction is 70
Figure 7.4: The dataset for these figures has the following characteristics: 500 distinct
From the Figure 7.4a, 7.4b, 7.4c, we see that EFIM can perform better than the PEFIM
algorithm. But when we increased the number of transactions, the Figure 7.4d, for the
EFIM when the dataset type is sparse.
Figure 7.5: The dataset for these figures has the following characteristics: 1500 dis-
tinct items and the max length per transaction is 15
same dataset, the proposed algorithms (PEFIM) doesn’t showed significant improvement
over EFIM when the dataset type is sparse.
(a) # of transactions:1000 (b) # of transaction7.6ds:10000
Figure 7.6: The dataset for these figures has the following characteristics: 40000
distinct items and the max length per transaction is 20
Figure 7.7: The dataset for these figures has the following characteristics: 45000
A first observation is that PEFIM on average outperforms EFIM when the minutil
threshold is very low (0.01), the number of transactions are in order of ten thousands
(104 ) and the dataset type is dense, or if the number of transaction are in the order of
millions (106 ) and the dataset type is sparse. PEFIM is in general about two to three
times faster than EFIM in those scenarios.
7.2.2 Comparison of Computational resource (physical memory)
We also compared PEFIM with the EFIM in terms of computational resource needed,
specifically physical memory. Experiments were conducted to shown how much memory
PEFIM needs over EFIM to reach the same results. For the experiments we compared
memory consumption on both algorithms by running both on each dataset while at the
same time decreasing the minutil threshold until algorithms were too slow, ran out of
memory or a clear winner was observed. Memory Consumption is shown in the following
figures.
Figure 7.8: The dataset for these graphs has the following characteristics: 75 distinct
Figure 7.9: The dataset for these graphs has the following characteristics: 120 distinct
Figure 7.10: The dataset for these graphs has the following characteristics: 400
(c) # of transactions:100000
(a) # of transactions:1000 (b) 10000
After looking at the figures, it is not obvious that EFIM outperforms PEFIM in terms
of memory efficiency on all datasets. Their behavior is very unpredictable and it will
depend on the nature of the dataset and the threshold ratio values. Looking at the
figures we cannot say for sure that EFIM will always outperform PEFIM, but when the
biggest the threshold ratio is, EFIM behave more efficiently. The reasons why PEFIM
consumes much resources in terms of physical memory when the threshold radio is big
are because each node have its own local tree to explore and puts all the data in memory
to be more time efficient.
Chapter 8
Conclusions
In this work, PEFIM was proposed. The proposed algorithm, PEFIM is a novel dis-
tributed approach to mine high utility itemsets using distributed approach. Spark frame-
work was used for the distributed computing because of its advantage over the Hadoop
framework. Spark framework uses in-memory computation which is much faster than
disk dependent Hadoop framework. The algorithm, PEFIM divided the search space
among different worker nodes to compute high utility itemsets which are aggregated to
find the result.
Extensive experiments were conducted to evaluate the proposed algorithm. The exper-
imental results on different datasets shows that with the order of million transactions
or more and sparse dataset, or transaction in the order of ten thousands and dense
dataset, PEFIM gained significant performance improvement in terms of computational
time. After compare PEFIM against EFIM, the experimental results shows that PEFIM
is a promising algorithm for processing large volumes of data.
Although PEFIM perform much better than EFIM for large datasets and the above
given conditions, PEFIM doesn’t perform very well against EFIM when the datasets are
small.
53
Chapter 9
Future work
Several future works were identified. The use of cloud computing infrastructure for
testing both algorithm with bigger datasets. Implement load balance techniques to
improve the assignment of the search space partitions to the different worker nodes.
Additionally, use another approach to parallelize EFIM, is also proposed as a future
work.
54
Bibliography
[AE08] N.R. Achuthan A. Erwin, R.P. Gopalan. Efficient mining of high utility
itemsets from large datasets. Advances in Knowledge Discovery, Springer
Lecture Notes in Computer Science, 5012:554–561, 2008.
[Agr94] Srikant R. Agrawal, R. Fast algorithms for mining association rules in large
databases. In Proc. Int. Conf. Very Large Databases, pages 487–499. IEEE
Computer Society, 1994.
[AXL+ 15] Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu,
Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin,
Ali Ghodsi, and Matei Zaharia. Spark sql: Relational data processing in
spark. In Proceedings of the 2015 ACM SIGMOD International Conference
on Management of Data, SIGMOD ’15, pages 1383–1394, New York, NY,
USA, 2015. ACM.
[BAR12] M.H.M. Krishna Prasa B. Adinarayana Reddy, O. Srinivasa Rao. An im-

proved up-growth high utility itemset mining. International Journal of
Computer Applications, 58(2):25–28, 2012.
[BLV09] T. A. Cao B. Le, H. Nguyen and B. Vo. A novel algorithm for mining
high utility itemsets. First Asian Conference on Intelligent Information
and Database Systems, pages 13–17, 2009.
[CA09] J. Byeong-Soo L. Young-Koo C.F. Ahmed, S.K. Tanbeer. Efficient tree

structures for high utility pattern mining in incremental databases. IEEE
Transactions on Knowledge and Data Engineering, 21, 2009.
[cas] http://cassandra.apache.org/.
55
References 56
[CR11] N. Savarimuthu C. Ramaraju. A conditional tree based novel algorithm for

high utility itemset mining. International Conference on Recent Trends in
Information Technology (ICRTIT), pages 701–706, 2011.
[DG08] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data process-
ing on large clusters. Commun. ACM, 51(1):107–113, January 2008.
[FA14] Vladimirov N. Kawashima T.-Mu Y. Sofroniew-N.J. Bennett D.V. Rosen

J. Yang C.-T. Looger L.L. Freeman, J. and M.B Ahrens. Mapping brain
activity at scale with cluster computing. Nature Methods, pages 841–950,
2014.
[fim12] Frequent itemset mining dataset repository. http://fimi.ua.ac.be/data/,

2012.
[FV14] Wu C.-W.-Zida S. Tseng-V. S. Fournier-Viger, P. Fhm: Faster high-utility

itemset mining using estimated utility co-occurrence pruning. pages 83–92,
2014.
[FVGG+ 14] Philippe Fournier-Viger, Antonio Gomariz, Ted Gueniche, Azadeh Soltani,
Cheng-Wei Wu, and Vincent S. Tseng. Spmf: A java open-source pattern
mining library. J. Mach. Learn. Res., 15(1):3389–3393, January 2014.
[GL14] J.P. Huang-V.S. Tseng G.C. Lan, T.P. Hong. An efficient projection-based
indexing approach for mining high utility itemsets. Knowledge and Infor-
mation Systems, 38:85–107, 2014.
[had] http://hadoop.apache.org/.
[HY04] C.J. Butz H. Yao, H.J. Hamilton. A foundation approach to mining itemset
utilities from databases. pages 482–486, 2004.
[HY06] H.J. Hamilton H. Yao. Mining itemset utilities from transaction databases.
Data and Knowledge Engineering, 59:603–626, 2006.
[HY07] L.Geng H. Yao, H.J Hamilton. A unified framework for utility-based mea-
sures for mining itemsets. ACM international conference on utility-based
Data Mining Workshop (UBDM), pages 28–37, 2007.
References 57
[J.H07] A. Mojsilovic J.Hu. High-utility pattern mining: A method for discovery

of high-utility itemsets. Pattern Recognitions, 40:3317–3324, 2007.
[JL16] W. Gan J.CW. Lin, P. Fournier-Viger. Fhn: An efficient algorithm for

mining high-utility itemsets with negative unit profits. Knowledge-Based
Systems, 111:283–298, 2016.
[JLF12] K. Wang J. Liu and B. C. M. Fung. Direct discovery of high utility itemsets
without candidate generation. 2012 IEEE 12th International Conference
on Data Mining, pages 984–989, 2012.
[JP10] S. Soni M.Muyeba J. Pillai, O.P. Vyas. A conceptual approach to tem-

poral weighted itemset utility mining. International Journal of Computer
Applications, 1(28):975–8887, 2010.
[Lan14] Hong T. P. Tseng V. S. Lan, G. C. An efficient projection-based index-

ing approach for mining high utility itemsets. Knowl. and Inform. Syst.,
38(1):85–107, 2014.
[Liu05] Liao W. Choudhary A. Liu, Y. A two-phase algorithm for fast discovery of

high utility itemsets. pages 689–695, 2005.
[ML12] J. Qu M. Liu. Mining high utility itemsets without candidate generation.

In Proceedings of the 21st ACM international conference on Information
and knowledge management (CIKM ’12), pages 55–64. ACM, 2012.
[MZ12] T. Das A. Dave J. Ma M. McCauley M.J. Franklin S. Shenker-I. Stoica

M. Zaharia, M. Chowdhury. Resilient distributed datasets: a fault toler-
ant abstraction for in-memory cluster computing. Proceedings of the 9th
USENIX Conference on Networked Systems Design and Implementation,
page 2, 2012.
[NP15] Massie M. Danford T. Zhang Z. Laserson U. Yeksigian C. Kottalam J.

Ahuja A.-Hammerbacher J. Linderman M. Franklin M.J. Joseph A.D.
Nothaft, F.A. and D.A Patterson. Rethinking dataintensive science using
scalable analytics systems. In Proceedings of the SIGMOD/PODS Confer-
ence, 2015.
References 58
[PFV14] S. Zida V.S. Tseng P. Fournier-Viger, CW. Wu. Fhm: Faster high-utility
itemset mining using estimated utility co-occurrence pruning. Andreasen
T., Christiansen H., Cubero JC., Raś Z.W. (eds) Foundations of Intelligent
Systems, 8502, 2014.
[SG16] H. Gao SM. Guo. Huitwu: An efficient algorithm for high-utility itemset
mining in transaction databases. Journal of computer science and technol-
ogy, 31(4):776–786, 2016.
[Son14] Liu Y. Li J. Song, W. Bahui: Fast and memory efficient mining of high
utility itemsets based on bitmap. Intern. Journal of Data Warehousing and
Mining, 10(1):1–15, 2014.
[spaa] https://spark.apache.org/.
[spab] http://www.sparksummit.org.
[SS09] S. Jayanthi N. Babu S. Shankar, T.P. Purusothoman. A fast algorithm for

mining high utility itemsets. In Proceedings of IEEE International Advance
Computing Conference, pages 1459–1464, 2009.
[swi] http://www.openstack.org/.
[SZ15] J.CW. Lin CW Wu . Tseng V.S. S. Zida, P. Fournier-Viger. Efim: A highly

efficient algorithm for high-utility itemset mining. Sidorov G., Galicia-Haro
S. (eds) Advances in Artificial Intelligence and Soft Computing, 9413, 2015.
[TS11] Grier C.-Ma J. Paxson V. Thomas, K. and D Song. Design and evaluation of
a real-time url spam filtering service. In Proceedings of the IEEE Symposium
on Security and Privacy, pages 22–25, 2011.
[Tse13] Shie-B.-E. Wu C.-W. Yu. P. S. Tseng, V. S. Efficient algorithms for mining

high utility itemsets from transactional databases. IEEE Trans. Knowl.
Data Eng., 25(8):1772–1786, 2013.
[UKM14] J. K. Kavitha U. Kanimozhi and D. Manjula. Mining high utility item-

sets - a recent survey. International Journey of Scientific Engineering and
Technology, 3:1339–1344, 2014.
References 59
[Uno04] Kiyomi M. Arimura-H. Uno, T. Efficient mining algorithms for frequent/-

closed/maximal itemsets. In Workshop on Frequent Itemset Mining Imple-
mentations, 2004.
[UY14] K.H. Ryu U. Yun, H. Ryang. High utility itemset mining with techniques
for reducing overestimated utilities and pruning candidates. Expert Systems
with Applications, 41, 2014.
[VTY10] B. Shie V.S. Tseng, C. Wu and P.S. Yu. Up-growth: an efficient algorithm
for high utility itemset mining. pages 253–262, 2010.
[WSL14] Y. Liu W. Song and J. Li. Mining high utility itemsets by dynamically
pruning the tree structure. Applied Intelligence, 40:29–43, 2014.
[XS13] Rosen J. Zaharia M. Franklin M.J.-Shenker S. Xin, R.S. and I. Shark Stoica.
Sql and rich analytics at scale. In In Proceedings of the ACM SIGMOD-
/PODS Conference, pages 22–27. ACM, 2013.
[YL05] A. Choudhary Y. Liu, W. Liao. A fast high utility itemsets mining algo-
rithm. In Proceedings of the Utility-Based Data Mining Workshop, 2005.
[YL08] C. Chang Y. Li, J. Yeh. Isolated items discarding strategy for discovering
high utility itemsets. Data & Knowledge Engineering, 64:198–217, 2008.
[ZP15] Barbary K. Nothaft N.A. Sparks E.-Zahn O. Franklin M.J. Patterson D.A.
Zhang, Z. and S Perlmutter. Scientific computing meets big data tech-
nology: An astronomy use case. In Proceedings of IEEE International
Conference on Big Data, 2015.

Juanfran

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Juanfran

Uploaded by

Copyright:

Available Formats

PEFIM: A Parallel Efficient

Algorithm for high utility itemset

FINAL CAREER PROJECT

Juan Francisco Figueredo

San Lorenzo - Paraguay

FINAL CAREER PROJECT

San Lorenzo - Paraguay

To our families for the constant support, in particular our parents.

List of Tables vii

Acronyms and Symbols viii

4 EFIM: Efficient High-Utility for Itemset Mining 14

4.1 Set-enumeration tree example . . . . . . . . . . . . . . . . . . . . . . . . . 15

6.1 Search Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.1 A transaction database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

7.1 Dataset characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

FIM Frequent Itemset Mining

TWU-Mining Transaction Weighted Utility Mining

One of the methods of distributed computing is to implement Map-Reduce framework

1.1 Thesis Objectives

• Extend the EFIM algorithm so that it can be a distributed parallel algorithm, in

• Propose a parallel optimization approach based on EFIM

• Implement the algorithm in a distributed computing framework.

• Perform experimental tests to verify the performance of the proposal algorithm

1.2 Thesis Organization

The remainder of this work is organized as follows:

2.1 High-Utility Itemset Mining

Table 2.1: A transaction database

Table 2.2: External Utility Values

Problem definition: an itemset X is a high-utility itemset if u(X) ≥ minutil. Other-

3.1 State-of-the-art algorithms

3.1.1 High Utility Itemset Mining

Erwin et al. in [AE08] observed that the conventional candidate-generate-and-test ap-

Le et al. propose in [BLV09] TWU-Mining (Transaction Weighted Utility Mining) a

Adinarayana et al. in [BAR12] propose an improved UP-Growth High Utility Itemset

Song et al. propose in [WSL14] an efficient concurrent algorithm, called CHUI-Mine

Mostly algorithms for mining high-utility itemsets remains computationally expensive

Shi-Ming Guo et al., in [SG16] proposed an algorithm HUITWU to efficiently mine

EFIM: Efficient High-Utility for

4.1 The Search Space

Extension of an itemset Let be an itemset α. An itemset Z is an extension of α

Figure 4.1: Set-enumeration tree example

W ∈ 2E(α) such that W 6= ∅. Furthermore, an itemset Z is a single-item extension of α

4.2 Reducing the Cost of Database Scans using Projec-

Projected database The projection of a transaction T using an itemset α is denoted as

4.3 Reducing the Cost of Database Scans by Transaction

Transaction merging Transaction merging consists of replacing a set of identical

Projected transaction merging Projected transaction merging consists of replacing

Transaction merging is obviously desirable. However, a key problem is to implement

Total order on transactions The T order is defined as the lexicographical order

Property 1 (Order of identical transactions in a T sorted database). Let be a T

4.4 Pruning Search Space using Sub-tree Utility and Local

Property 2 (Overestimation using the sub-tree utility). Let be an itemset α and an

Theorem 1 (Pruning a sub-tree using the sub-tree utility). Let be an itemset

Property 4 (Relationships between upper-bounds). Let be an itemset Y = α ∪ {z}. The

Definition (Primary and secondary items). Let be an itemset α. The primary

4.5 Calculating Upper-Bounds Efficiently using Fast Util-

A novel efficient array-based approach to compute upper-bounds in linear time and

Definition (Utility-Bin). Let be the set of items I appearing in a database D. A

A utility-bin array can be used to efficiently calculate the following upper-bounds in

Calculating the local utility w.r.t. an itemset α: A utility-bin array U is initialized.

Algorithm 1: The EFIM algorithm

4.5.1 The Algorithm

The Search procedure (Algorithm 4) takes as parameters the current itemset to be

Total order on transactions The T order is defined as the lexicographical order

Property 1 (Order of identical transactions in a T sorted database). Let be a T