You are on page 1of 1

Heuristic algorithm for mining frequent patterns in Big

Data using Apache Spark

Frequent itemsets mining (FIM) problem, is considered one of the most successful techniques in the field of data
mining. It consists of extracting frequent patterns from a transactional database by calculating their support. An itemset
is frequent if and only if its support is greater than or equal to minSup, when minSup is a threshold given by the user [1].

There are two methods to solve the FIM problem: exact or metaheuristic-based methods. Exact methods are
extremely effective for dealing with small to medium datasets such as Aprioi algorithm [1]. However, when dealing with
large datasets these methods suffer from temporal complexity. Metaheuristic-based methods are becoming more rapid,
but still the majority of them are insufficiently precise. These methods are the combination of Apriori algorithm with
several metaheuristic algorithms such as Genetic Algorithm (GA). The result of these combinations gave birth of two
approaches GA-Aprioi [2, 3].

In recent years, several researchers have been designed to word with several big data technologies such as:
1. Apache Hadoop
2. MapReduce
3. Apache Spark.
We suggest to apply distribution using Apache Spark on the GA-Apriori approach.

Datasets
The dataset used experimental analysis with big data technologies depends the number of transactions, number of
items and average number of items per transaction [4, 5, 6]:
Name of instance N. of transactions N. of items Avg. Items per transaction
mushrooms 8 416 119 23
pumsb 49 046 2113 74
chess 3 196 75 37
connect 67 557 129 43
accidents 340 183 468 33.8
Kddcup99 1 000 000 135 16
PAMP 1 000 000 141 23.93
RecordLink 574 913 29 10
PowerC 1 040 000 140 7
c20d10k 10 000 192 20
c73d10k 10 000 1 592 73

References

[1] R. Agrawal, T. Imieliński, and A. Swami, "Mining association rules between sets of items in large databases," in
Proceedings of the 1993 ACM SIGMOD international conference on Management of data, 1993, pp. 207-216.
[2] Y. Djenouri and M. Comuzzi, "GA-Apriori: Combining Apriori heuristic and genetic algorithms for solving the frequent
itemsets mining problem," in Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2017, pp. 138-148:
Springer.
[3] Y. Djenouri and M. J. I. S. Comuzzi, "Combining Apriori heuristic and bio-inspired algorithms for solving the frequent
itemsets mining problem," vol. 420, pp. 1-15, 2017.
[4] https://archive.ics.uci.edu/ml/datasets.html
[5] http://fimi.ua.ac.be/data/
[6] https://sourceforge.net/projects/ibmquestdatagen/

You might also like