Professional Documents
Culture Documents
Abstract—Pattern mining is one of the most important tasks for business analysis purposes. With the increasing importance
to extract meaningful and useful information from raw data. of data in every application, the amount of data to deal with
This task aims to extract item-sets that represent any type of has become unmanageable, and the performance of such tech-
homogeneity and regularity in data. Although many efficient
algorithms have been developed in this regard, the growing niques could be dropped. The term big data [1] is more and
interest in data has caused the performance of existing pat- more used to refer to the challenges and advances derived
tern mining techniques to be dropped. The goal of this paper from the processing of such high-dimensional datasets in an
is to propose new efficient pattern mining algorithms to work efficient way.
in big data. To this aim, a series of algorithms based on the Pattern mining [2] is considered as an essential part of data
MapReduce framework and the Hadoop open-source implemen-
tation have been proposed. The proposed algorithms can be analysis and data mining. Its aim is to extract subsequences,
divided into three main groups. First, two algorithms [Apriori substructures or item-sets that represent any type of homo-
MapReduce (AprioriMR) and iterative AprioriMR] with no geneity and regularity in data, denoting intrinsic and important
pruning strategy are proposed, which extract any existing item- properties [3], [4]. This problem was originally proposed in
set in data. Second, two algorithms (space pruning AprioriMR the context of market basket analysis in order to find frequent
and top AprioriMR) that prune the search space by means of
the well-known anti-monotone property are proposed. Finally, a groups [5] of products that are bought together [6]. Since its
last algorithm (maximal AprioriMR) is also proposed for min- formal definition, at early nineties, a high number of algo-
ing condensed representations of frequent patterns. To test the rithms has been described in [7]–[9]. Most of these algorithms
performance of the proposed algorithms, a varied collection of are based on Apriori-like methods [10], producing a list of
big data datasets have been considered, comprising up to 3·1018 candidate item-sets or patterns formed by any combination
transactions and more than 5 million of distinct single-items. The
experimental stage includes comparisons against highly efficient of single items. However, when the number of these single
and well-known pattern mining algorithms. Results reveal the items to be combined increases, the pattern mining problem
interest of applying MapReduce versions when complex prob- turns into an arduous task and more efficient strategies are
lems are considered, and also the unsuitability of this paradigm required [8], [11]. To understand the complexity, let us con-
when dealing with small data. sider a dataset comprising n single items or singletons. From
Index Terms—Big data, Hadoop, MapReduce, pattern mining. them, the number of item-sets that can be generated is equal
to 2n −1, so it becomes extremely complex with the increasing
number of singletons. All of this led to the logical outcome
I. I NTRODUCTION that the whole space of solutions cannot always be analyzed.
ATA analysis has a growing interest in many fields like In many application fields [12], however, it is not required
D business intelligence, which includes a set of techniques
to transform raw data into meaningful and useful information
to produce any existing item-set but only those considered as
of interest, e.g., those which are included into a high num-
ber of transactions. In order to do so new methods have been
proposed, some of them based on the anti-monotone property
Manuscript received September 23, 2016; revised April 7, 2017; accepted
August 29, 2016. Date of publication September 27, 2017; date of current ver- as a pruning strategy. It determines that any subpattern of a
sion September 14, 2018. This work was supported by the Spanish Ministry of frequent pattern is also frequent, and any super-pattern of an
Economy and Competitiveness and the European Regional Development Fund infrequent pattern will be never frequent. This pruning strategy
under Project TIN-2014-55252-P. This paper was recommended by Associate
Editor P. Chen. (Corresponding author: Sebastián Ventura.) enables the search space to be reduced since once a pattern is
J. M. Luna is with the Department of Computer Science, University of marked as infrequent, then no new pattern needs to be gener-
Jaén, 23071 Jaén, Spain (e-mail: jmluna@uco.es). ated from it. Despite this, the huge volumes of data in many
F. Padillo is with the Department of Computer Science and Numerical
Analysis, University of Córdoba, 14071 Córdoba, Spain (e-mail: application fields have caused a decrease in the performance
fpadillo@uco.es). of existing methods. Traditional pattern mining algorithms are
M. Pechenizkiy is with the Department of Computer Science, Eindhoven not suitable for truly big data [13], presenting two main chal-
University of Technology, 5600 MB Eindhoven, The Netherlands (e-mail:
m.pechenizkiy@tue.nl). lenges to be solved: 1) computational complexity and 2) main
S. Ventura is with the Department of Computer Science and Numerical memory requirements [14]. In this scenario, sequential pat-
Analysis, University of Córdoba, 14071 Córdoba, Spain, and also with tern mining algorithms on a single machine may not handle
the Department of Information Systems, King Abdulaziz University,
Jeddah 21589, Saudi Arabia (e-mail: sventura@uco.es). the whole procedure and an adaptation of them to emerging
Digital Object Identifier 10.1109/TCYB.2017.2751081 technologies [15] might be fundamental to complete process.
2168-2267 c 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Sheffield Hallam University. Downloaded on July 11,2022 at 11:44:17 UTC from IEEE Xplore. Restrictions apply.
2852 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 48, NO. 10, OCTOBER 2018
Authorized licensed use limited to: Sheffield Hallam University. Downloaded on July 11,2022 at 11:44:17 UTC from IEEE Xplore. Restrictions apply.
LUNA et al.: APRIORI VERSIONS BASED ON MAPREDUCE FOR MINING FREQUENT PATTERNS ON BIG DATA 2853
Authorized licensed use limited to: Sheffield Hallam University. Downloaded on July 11,2022 at 11:44:17 UTC from IEEE Xplore. Restrictions apply.
2854 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 48, NO. 10, OCTOBER 2018
,
the result is a
P, supp(P)
pair in such a way that supp(P) = m l=1 supp(P)l
data (see Algorithm 1). In this well-known algorithm, (see AprioriReducer in lines 2–4 of Algorithm 2).
Agrawal et al. [10] proposed to generate all the feasible item- Fig. 2 illustrates how the AprioriMR algorithm works. In
sets in each transaction and to assign a support of one to them this example, the input database is divided into four sub-
(see lines 4 and 5 in Algorithm 1). Then, the algorithm checks databases, and the AprioriMR algorithm includes four mappers
whether each of the new item-sets were already generated by and three reducers. As shown, each mapper mines item-sets
previous transactions and, if so, their support is increased in (patterns) for its subdataset iterating transaction by transac-
a unity (see lines 6–8 in Algorithm 1). The same process is tion, producing a set of P, supp(P)l
pairs for each transaction
repeated for each transaction (see lines 2–11 in Algorithm 1), tl ∈ T . Then, in an internal MapReduce grouping proce-
giving rise to a list L of patterns including their support or dure, P, supp(P)l
pairs are grouped by the key P producing
frequency of occurrence. As shown, the higher the number of P, supp(P)1 , supp(P)2 , . . . , supp(P)m
Authorized licensed use limited to: Sheffield Hallam University. Downloaded on July 11,2022 at 11:44:17 UTC from IEEE Xplore. Restrictions apply.
LUNA et al.: APRIORI VERSIONS BASED ON MAPREDUCE FOR MINING FREQUENT PATTERNS ON BIG DATA 2855
)
Fig. 3. Diagram of the proposed IAprioriMR algorithm when the second
1: support = 0
iteration is run, i.e., {∀P, |P| = 2}.
2: for all supp ∈ supp(P)1 , supp(P)2 , ..., supp(P)m
do
3: support += supp
4: end for Algorithm 4 Apriori Algorithm With Anti-Monotone Property
5: emit P, support
Authorized licensed use limited to: Sheffield Hallam University. Downloaded on July 11,2022 at 11:44:17 UTC from IEEE Xplore. Restrictions apply.
2856 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 48, NO. 10, OCTOBER 2018
)
1: support = 0
2: for all supp ∈ supp(P)1 , supp(P)2 , ..., supp(P)m
do
3: support += supp
4: end for
5: if support ≥ threshold then
6: emit P, support
7: else
8: keep P, support
into the list of infrequent patterns
9: end if
end procedure
of P, support(P)l
pairs. As shown in Fig. 4, the algorithm begin procedure TopAprioriReducer( P, supp(P)1 , ..., supp(P)m
)
stops once the maximum size for the frequent patterns is 1: support = 0
reached—no additional pattern can be obtained from the set 2: for all supp ∈ supp(P)1 , supp(P)2 , ..., supp(P)m
do
{{i1 }, {i3 }, {i1 i3 }}. 3: support += supp
SPAprioriMR considers the monotonic property since an 4: end for
5: if support ≥ threshold then
infrequent pattern P will not be considered any longer. 6: emit P, support
Authorized licensed use limited to: Sheffield Hallam University. Downloaded on July 11,2022 at 11:44:17 UTC from IEEE Xplore. Restrictions apply.
LUNA et al.: APRIORI VERSIONS BASED ON MAPREDUCE FOR MINING FREQUENT PATTERNS ON BIG DATA 2857
pairs. In this example, taking by obtaining a pattern comprising all the singletons in data
the singleton {i3 } as a key, the following pair is obtained (line 1 in Algorithm 7). From this, the algorithm gener-
{i3 }, 1, 1, 1, 1, 1
. Finally, the reducer phase is responsible ates subpatterns looking for those that are frequent according
for merging all the values associated with the same key k and to a minimum support threshold thr. If a new subpattern
producing the final support values, i.e., {i3 }, 5
. It is notewor- comprises items that are maximals, then this subpattern is dis-
thy to mention that, as described in Section III-A, each reducer carded (line 9 in Algorithm 7). The algorithm continues in
computes different keys k. Hence, the first reducer computes an iterative way till no subpattern can be obtained from C,
those patterns comprising i1 ; the second reducer computes i.e., C = ∅.
those patterns including i2 ; and, finally, i3 is computed by the In order to reduce the computational time required by the
third reducer. Whereas all the frequent patterns are emitted Apriori version, we present a novel proposal (see Algorithm 8)
by the reducer (TopAprioriMapper procedure in lines 5–7 of based on MapReduce paradigm. The proposed algorithm,
Algorithm 6), infrequent patterns are stored in an external file called MaxAprioriMR, starts by mining the set of patterns P
(TopAprioriMapper procedure in lines 7–9 of Algorithm 6) in comprising patterns P of size |P| = s, in such a way that no
order to update the database in the next iteration. super-pattern of size s + 1 could be mined. For a better under-
Following with the second iteration, and similarly to the standing, Fig. 6 shows the algorithm iterations for a sample
previous iteration, the TopAprioriMR algorithm distributes the dataset. Here, the minimum support threshold value is fixed
Authorized licensed use limited to: Sheffield Hallam University. Downloaded on July 11,2022 at 11:44:17 UTC from IEEE Xplore. Restrictions apply.
2858 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 48, NO. 10, OCTOBER 2018
begin procedure
MaxAprioriReducer( P, supp(P)1 , ..., supp(P)m
)
1: support = 0
2: for all supp ∈ supp(P)1 , supp(P)2 , ..., supp(P)m
do
3: support += supp
4: end for
5: if support ≥ threshold then
6: emit P, support
7: else
8: keep P, support
into the list of infrequent item-sets Fig. 6. Iterations run by the MaxAprioriMR algorithm on a sample dataset.
9: end if The minimum support threshold was fixed to a value of 4.
end procedure
Authorized licensed use limited to: Sheffield Hallam University. Downloaded on July 11,2022 at 11:44:17 UTC from IEEE Xplore. Restrictions apply.
LUNA et al.: APRIORI VERSIONS BASED ON MAPREDUCE FOR MINING FREQUENT PATTERNS ON BIG DATA 2859
Fig. 7. Runtime of different algorithms when they are run on datasets com- Fig. 8. Runtime of different algorithms when they are run on datasets com-
prising eight singletons and a number of instances that varies from 2·106 to prising 20 singletons and a number of instances that varies from 2·106 to
2·108 . The computational complexity is up to O(28 − 1 × 2 · 108 × 8). 2·108 . The computational complexity is up to O(220 − 1 × 2 · 108 × 20).
size in the search space up to 218 is considered. Here, the num- of instances barely affect the runtime of the proposed algo-
ber of distinct instances varies from 3·107 to 3·1012 . Analyzing rithms. This behavior involves that, for huge datasets, the use
the file size, it is noteworthy that the smallest dataset has a size of MapReduce is alluring. Finally, it is interesting to note that
of 86 MB, whereas the biggest one has a file size of 814 GB. Eclat does not appear in the analysis for datasets comprising
Finally, we have conducted experiments on real world datasets more than 80 million of instances due to its low performance.
considering the Webdoc dataset,1 which is the one comprising To understand this issue, it is worth noting that Eclat includes
the highest number of singletons—it comprises a search space the whole tid-list for each pattern, so extremely large datasets
of 25 267 646 . This datasets has a file size of 1.5 GB. (according to the number of instances) cannot be handled into
memory (hardware limitations).
B. Sequential Mining Versus MapReduce Following with the experimental analysis, the number of sin-
gletons is increased from 8 to 20 so the search space growths
In this experimental study, the performance of a set of
from 28 − 1 = 255 to 220 − 1 = 1 048 575. Fig. 8 illus-
algorithms when the number of both attributes and instances
trates the runtime for the set of algorithms considered in this
increases have been analyzed. In this paper, a set of syn-
paper, denoting the good performance of MapReduce ver-
thetic datasets have been properly created to analyze how
sions. Results reveal the importance of using the MapReduce
the number of both instances and attributes affects the algo-
paradigm for mining patterns on huge search spaces rather
rithms’ performance. Single-items generated for these datasets
than using sequential pattern mining algorithms. It demon-
are normally distributed by following a Gaussian distribu-
strates that sequential mining algorithms are appropriate to low
tion throughout the whole set of instances. The number of
complex problems, requiring algorithms based on MapReduce
single-items varies from 8 to 20, whereas the number of
to handle with tough tasks. In this regard, and considering
instances varies from 2·106 to 2·108 . In this regard, the com-
datasets with more than 80 million of instances, none of the
putational cost varies from O(28 − 1 × 2 · 106 × 8) to
sequential versions considered in this paper are able to be run
O(220 − 1 × 2 · 108 × 20).
(hardware limitations).
In a first analysis, the runtime of different proposals when
The second part of this experimental analysis is related
no pruning strategy is compared. In this analysis (see Fig. 7),
to the number of single-items, so the aim is to demonstrate
a dataset comprising a search space of size 28 − 1 is consid-
how the algorithms behave when this number increases. In
ered. Results illustrate that sequential algorithms like Apriori,
this sense, Fig. 9 illustrates the performance of the algorithms
FP-Growth, or Eclat behave really well even when datasets
when datasets comprising 2 million of instances are consid-
includes a large number of instances (up to 20 million). In sit-
ered, and the number of single-items varies from 8 to 20.
uations where the number of instances continues growing, the
Results demonstrate that small problems should be solved
proposed Apriori versions based on MapReduce (AprioriMR
by sequential pattern mining algorithms (Apriori, FP-Growth,
and IAprioriMR) behave better than the others. Results reveal
and Eclat), which require a runtime lower than MapReduce
that, for a dataset comprising more than 80 million of instances
versions. Following with this analysis, and considering now
and 8 single-items, the use of MapReduce version is appropri-
20 million of instances (see Fig. 10), the MapReduce ver-
ate, outperforming sequential algorithms (FP-Growth, Apriori,
sions tend to perform better when a high number of both
and Eclat). Continuing the analysis, and focusing on the run-
instances and singletons are considered. Finally, and consider-
time of MapReduce versions, it is obtained that the number
ing now 200 million of instances, it is obtained that sequential
1 This dataset is considered by the pattern mining community as the biggest pattern mining algorithms cannot be run due to hardware lim-
one, which is electronically available at the website: http://fimi.ua.ac.be/data/. itations (see Fig. 11). In this final analysis, it is obtained that
Authorized licensed use limited to: Sheffield Hallam University. Downloaded on July 11,2022 at 11:44:17 UTC from IEEE Xplore. Restrictions apply.
2860 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 48, NO. 10, OCTOBER 2018
Fig. 9. Runtime of different algorithms when they are run on datasets com-
prising 2 million of instances and a number of singletons that varies from 8 Fig. 11. Runtime of different algorithms when they are run on datasets
to 20. The computational complexity is up to O(220 − 1 × 2 · 106 × 20). comprising 200 million of instances and a number of singletons that varies
from 8 to 20. The computational complexity is up to O(220 −1 × 2·108 × 20).
TABLE I
R ESULTS S HOW THE O RDERS OF M AGNITUDE IN THE D IFFERENCE OF
THE RUNTIME W HEN E ACH PAIR OF A LGORITHMS (N O P RUNING
S TRATEGY ) I S C OMPARED . E ITHER THE AVERAGE , W ORST ( THE
M OST U NFAVORABLE ) AND B EST C ASE ( THE M OST
FAVORABLE ) H AVE B EEN C ONSIDERED
Fig. 10. Runtime of different algorithms when they are run on datasets worse than sequential pattern mining algorithms. Finally, con-
comprising 20 million of instances and a number of singletons that varies from sidering the IAprioriMR algorithm, it is obtained values in the
8 to 20. The computational complexity is up to O(220 − 1 × 2 · 107 × 20).
average difference of the orders of magnitude lower than the
ones obtained for AprioriMR. Nevertheless, to be fair, it should
be noted that IAprioriMR performs better than AprioriMR for
MapReduce algorithms behave quite similar and they are really
extremely complex problems (see Fig. 11 when using more
appropriate to handle huge datasets where an extremely large
than 18 singletons).
number of comparisons are required.
The final part of this experimental study is related to the
orders of magnitude in the difference for the runtime (see C. MapReduce Versions With Space Pruning
Table I). Let us consider AprioriMR as baseline. Here, it is In this second experimental study, our aim is to analyze the
obtained more than 0.5 order of magnitude in the average dif- behavior of the proposed MapReduce Apriori versions when
ference (mean value when considering different number of a pruning strategy is considered to reduce the search space.
both instances and single-items) when AprioriMR is com- The aim of these algorithms is not the discovery of any exist-
pared to sequential pattern mining algorithms. Taking the most ing pattern but a subset of interesting patterns according to
favorable case, AprioriMR obtains two orders of magnitude in a minimum predefined value of frequency. It is noteworthy
the difference with regard to Apriori, FP-Growth, and Eclat. to mention that this experimental analysis includes compar-
Continuing the analysis, let us compare the runtime of the isons with regard to the two existing proposals (BigFIM and
two MapReduce Apriori versions for mining patterns without DistEclat) based on MapReduce for mining patterns of interest.
pruning the search space. In this paper, results reveal that both First, it is interesting to study the performance of the
algorithm similarly behave, obtaining an average difference in MapReduce algorithms when increasing the size |P| = s of
order of magnitude of 0.231 (mean value when considering the discovered patterns. The aim is to determine whether the
different number of both instances and single-items). It is note- patterns’ size is detrimental to the algorithms’ performance.
worthy to mention that negative orders of magnitude means Fig. 12 illustrates the resulting runtime required for mining
that, in some datasets (small datasets comprising a small patterns of different sizes s = 1, s = 2, s = 3, and s = 4.
search space), the proposed MapReduce algorithms behave Results reveal that the size of the patterns does not affect
Authorized licensed use limited to: Sheffield Hallam University. Downloaded on July 11,2022 at 11:44:17 UTC from IEEE Xplore. Restrictions apply.
LUNA et al.: APRIORI VERSIONS BASED ON MAPREDUCE FOR MINING FREQUENT PATTERNS ON BIG DATA 2861
Fig. 12. Runtime required for mining all the patterns of size s = 1, s = 2, Fig. 14. Runtime of different algorithms when they are run on datasets
s = 3, and s = 4. comprising 30 million of instances and a number of singletons that varies from
4 to 18. The computational complexity is up to O(218 − 1 × 3 · 107 × 18).
Authorized licensed use limited to: Sheffield Hallam University. Downloaded on July 11,2022 at 11:44:17 UTC from IEEE Xplore. Restrictions apply.
2862 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 48, NO. 10, OCTOBER 2018
TABLE III
RUNTIME IN M INUTES OF THE T OPA PRIORI MR A LGORITHM W HEN I T I S
RUN ON THE W ELL -K NOWN Webdocs DATASET
TABLE II
RUNTIME IN M INUTES OF THE T OPA PRIORI MR A LGORITHM W HEN
D IFFERENT DATASETS C OMPRISING A H IGH N UMBER OF B OTH
I NSTANCES AND S INGLE -I TEMS I S C ONSIDERED
Fig. 17. Runtime of different algorithms when they are run on datasets com-
prising 16 singletons and up to 30 000 million of instances. The computational
complexity is up to O(218 − 1 × 3 · 1010 × 18).
Authorized licensed use limited to: Sheffield Hallam University. Downloaded on July 11,2022 at 11:44:17 UTC from IEEE Xplore. Restrictions apply.
LUNA et al.: APRIORI VERSIONS BASED ON MAPREDUCE FOR MINING FREQUENT PATTERNS ON BIG DATA 2863
Fig. 18. Runtime of different algorithms when they are run on datasets comprising 12 singletons (left) and 18 singletons (right). The computational complexity
is up to O(218 − 1 × 3 · 1018 × 18).
TABLE IV
in many cases, it is possible to mine the whole set of fre- N UMBER OF PATTERNS D ISCOVERED (P ERCENTAGE W ITH R EGARD TO
quent patterns faster than the set of maximal frequent patterns. THE BASELINE ) FOR D IFFERENT S EARCH S PACES AND 30 000 M ILLION
OF I NSTANCES . A M INIMUM S UPPORT T HRESHOLD OF 0.0001 WAS
To demonstrate this assertion, Fig. 18 shows a comparison
A PPLIED TO A LGORITHMS T HAT P RUNE THE S EARCH S PACE
between the MaxAprioriMR algorithm and TopAprioriMR,
which is the MapReduce proposal that achieved better results.
Results demonstrate that the number of single-items has a
high impact on the performance. Let us considered a dataset
comprising 12 single-items [see Fig. 18(left)], in this sce-
nario both TopAprioriMR and MaxAprioriMR equally behave.
Analyzing the results illustrated in Fig. 18, it is obtained that
differences between both algorithms are higher with the incre-
ment of the single-items. Thus, for a total of 18 single-items
[see Fig. 18(right)] it is obtained that TopAprioriMR out-
performs MaxAprioriMR, demonstrating that algorithms for
mining maximal frequent patterns are time-consuming.
Authorized licensed use limited to: Sheffield Hallam University. Downloaded on July 11,2022 at 11:44:17 UTC from IEEE Xplore. Restrictions apply.
2864 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 48, NO. 10, OCTOBER 2018
TABLE V
F EATURES OF THE T HREE P ROPOSED A LGORITHMS FOR M INING PATTERNS BY U SING THE M ONOTONIC P ROPERTY
2) Pruning the search space by means of anti-monotone [11] J. Liu, K. Wang, and B. C. M. Fung, “Mining high utility patterns in
property. Two additional algorithms (SPAprioriMR and one phase without generating candidates,” IEEE Trans. Knowl. Data
Eng., vol. 28, no. 5, pp. 1245–1257, May 2016. [Online]. Available:
TopAprioriMR) have been proposed with the aim of http://dx.doi.org/10.1109/TKDE.2015.2510012
discovering any frequent pattern available in data. [12] S. Ventura and J. M. Luna, Pattern Mining With Evolutionary
3) Maximal frequent patterns. A last algorithm Algorithms, 1st ed. Cham, Switzerland: Springer, 2016.
[13] S. Moens, E. Aksehirli, and B. Goethals, “Frequent itemset mining for
(MaxAprioriMR) has been also proposed for mining big data,” in Proc. IEEE Int. Conf. Big Data (IEEEBigData), Silicon
condensed representations of frequent patterns. Valley, CA, USA, 2013, pp. 111–118.
The features of all these algorithms are summarized in [14] J. M. Luna, A. Cano, M. Pechenizkiy, and S. Ventura, “Speeding-up
association rule mining with inverted index compression,” IEEE Trans.
Table V. Cybern., vol. 46, no. 12, pp. 3059–3072, Dec. 2016. [Online]. Available:
To test the performance of the proposed algorithms, we http://dx.doi.org/10.1109/TCYB.2015.2496175
have conducted a series of experiments on a collection of [15] S. Sakr, A. Liu, D. M. Batista, and M. Alomari, “A survey of large scale
data management approaches in cloud environments,” IEEE Commun.
big datasets comprising up to 3·1018 instances and more than Surveys Tuts., vol. 13, no. 3, pp. 311–336, 3rd Quart., 2011.
5 million of single-items. All the algorithms have been com- [16] J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on
pared to highly efficient algorithms in the pattern mining field. large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, 2008.
[17] I. Triguero, D. Peralta, J. Bacardit, S. García, and F. Herrera, “MRPR: A
The experimental stage has revealed that our proposals per- MapReduce solution for prototype reduction in big data classification,”
form really well for huge search spaces. Results have also Neurocomputing, vol. 150, pp. 331–345, Feb. 2015.
revealed the unsuitability of MapReduce frameworks when [18] C. Lam, Hadoop in Action, 1st ed. Stamford, CT, USA: Manning, 2010.
[19] B. Ziani and Y. Ouinten, “Mining maximal frequent itemsets: A Java
small datasets are considered. implementation of FPMAX algorithm,” in Proc. 6th Int. Conf. Innov.
Inf. Technol. (IIT), Al Ain, UAE, 2009, pp. 330–334.
[20] M. J. Zaki, “Scalable algorithms for association mining,” IEEE Trans.
R EFERENCES Knowl. Data Eng., vol. 12, no. 3, pp. 372–390, May/Jun. 2000.
[21] J. Pei et al., “Mining sequential patterns by pattern-growth:
[1] T.-M. Choi, H. K. Chan, and X. Yue, “Recent development in big data The PrefixSpan approach,” IEEE Trans. Knowl. Data Eng.,
analytics for business operations and risk management,” IEEE Trans. vol. 16, no. 11, pp. 1424–1440, Nov. 2004. [Online]. Available:
Cybern., vol. 47, no. 1, pp. 81–92, Jan. 2017. [Online]. Available: http://dx.doi.org/10.1109/TKDE.2004.77
http://dx.doi.org/10.1109/TCYB.2015.2507599 [22] P. N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining,
[2] J. M. Luna, “Pattern mining: Current status and emerging topics,” Progr. 1st ed. Boston, MA, USA: Addison-Wesley, 2005.
Artif. Intell., vol. 5, no. 3, pp. 165–170, 2016. [23] B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang, “Mars: A
[3] C. C. Aggarwal and J. Han, Frequent Pattern Mining, 1st ed. Cham, MapReduce framework on graphics processors,” in Proc. 17th Int. Conf.
Switzerland: Springer, 2014. Parallel Archit. Compilation Tech. (PACT), Toronto, ON, Canada, 2008,
[4] J. Han, H. Cheng, D. Xin, and X. Yan, “Frequent pattern mining: Current pp. 260–269.
status and future directions,” Data Min. Knowl. Disc., vol. 15, no. 1, [24] C. Borgelt, “Efficient implementations of apriori and eclat,” in Proc. 1st
pp. 55–86, 2007. IEEE ICDM Workshop Frequent Item Set Min. Implement. (FIMI),
[5] J. M. Luna, J. R. Romero, C. Romero, and S. Ventura, Melbourne, FL, USA, 2003, pp. 90–99.
“On the use of genetic programming for mining compre-
hensible rules in subgroup discovery,” IEEE Trans. Cybern.,
vol. 44, no. 12, pp. 2329–2341, Dec. 2014. [Online]. Available:
http://dx.doi.org/10.1109/TCYB.2014.2306819
[6] R. Agrawal, T. Imielinski, and A. Swami, “Database mining: A
performance perspective,” IEEE Trans. Knowl. Data Eng., vol. 5, no. 6,
pp. 914–925, Dec. 1993.
[7] J. Han, J. Pei, Y. Yin, and R. Mao, “Mining frequent patterns with- José María Luna (M’10) received the Ph.D. degree
out candidate generation: A frequent-pattern tree approach,” Data Min. in computer science from the University of Granada,
Know. Disc., vol. 8, no. 1, pp. 53–87, 2004. Granada, Spain, in 2014. He is a member of
[8] S. Zhang, Z. Du, and J. T. L. Wang, “New techniques for min- the Knowledge Discovery and Intelligent Systems
ing frequent patterns in unordered trees,” IEEE Trans. Cybern., Laboratory, University of Córdoba, Córdoba, Spain.
vol. 45, no. 6, pp. 1113–1125, Jun. 2015. [Online]. Available: His research was first granted by the Ministry of
http://dx.doi.org/10.1109/TCYB.2014.2345579 Education. He has authored the book entitled Pattern
[9] J. M. Luna, J. R. Romero, and S. Ventura, “Design and behavior Mining with Evolutionary Algorithms (Springer) and
study of a grammar-guided genetic programming algorithm for min- two book chapters, and published over 30 papers in
ing association rules,” Knowl. Inf. Syst., vol. 32, no. 1, pp. 53–76, top ranked journals and international scientific con-
2012. ferences. He has also been engaged in four national
[10] R. Agrawal, T. Imieliński, and A. Swami, “Mining association rules and regional research projects. He has contributed three international projects.
between sets of items in large databases,” in Proc. ACM SIGMOD His current research interests include evolutionary computation, pattern min-
Int. Conf. Manag. Data (SIGMOD), Washington, DC, USA, 1993, ing, and association rule mining and its applications.
pp. 207–216. Dr. Luna was a recipient of JdC—Training Post-Doctoral grant.
Authorized licensed use limited to: Sheffield Hallam University. Downloaded on July 11,2022 at 11:44:17 UTC from IEEE Xplore. Restrictions apply.
LUNA et al.: APRIORI VERSIONS BASED ON MAPREDUCE FOR MINING FREQUENT PATTERNS ON BIG DATA 2865
Francisco Padillo received the M.Sc. degree in Sebastián Ventura (M’07–SM’09) received the
computer science from the University of Córdoba, B.Sc. and Ph.D. degrees in sciences from the
Córdoba, Spain, in 2015. University of Córdoba, Córdoba, Spain, in 1989 and
He is a member of the Knowledge Discovery 1996, respectively.
and Intelligent Systems Laboratory, University of He is currently a Full Professor with the
Córdoba. His current research interests include pat- Department of Computer Science and Numerical
tern mining, big data, and the MapReduce paradigm. Analysis, University of Córdoba, where he heads
the Knowledge Discovery and Intelligent Systems
Research Laboratory. He has published over 150
papers in journals and scientific conferences, and
edited three books and several special issues in inter-
national journals. He has also been engaged in 12 research projects (being
the coordinator of four of them) supported by the Spanish and Andalusian
Mykola Pechenizkiy (M’07) received the Ph.D. governments and the European Union. His current research interests include
degree in computer science and information systems soft-computing, machine learning, data mining, and their applications.
from the University of Jyväskylä (JYU), Jyväskylä, Dr. Ventura is a Senior Member of the IEEE Computer, the IEEE
Finland, in 2005. Computational Intelligence, the IEEE Systems, Man and Cybernetics
He is a Full Professor of predictive analyt- Societies, and the Association of Computing Machinery.
ics with the Department of Computer Science,
Eindhoven University of Technology, Eindhoven,
The Netherlands. He is also an Adjunct Professor
of data mining for industrial applications with JYU.
He has co-authored over 70 peer-reviewed publica-
tions and has been co-organizing several workshops,
conferences, such as the IEEE CBMS 2008 and 2012 and EDM 2011 and
2015, and tutorials at ECML/PKDD 2010 and 2012, PAKDD 2011, and the
IEEE CBMS 2010. He has co-edited special issues in SIGKDD Explorations,
Elsevier’s DKE and AIIM, Springer’s Evolving Systems and the Journal of
Learning Analytics, and the book entitled Handbook of Educational Data
Mining.
Dr. Pechenizkiy serves as the President of the International Educational
Data Mining Society.
Authorized licensed use limited to: Sheffield Hallam University. Downloaded on July 11,2022 at 11:44:17 UTC from IEEE Xplore. Restrictions apply.