Professional Documents
Culture Documents
Fast and Memory Efficient Mining of Periodic Frequent Patterns
Fast and Memory Efficient Mining of Periodic Frequent Patterns
Abstract Periodic frequent pattern mining, the process of finding frequent patterns
which occur periodically in databases, is an important data mining task for various
decision making. Though several algorithms have been proposed for their discovery,
most employ a two stage process to evaluate the periodicity of patterns. That is, by
firstly deriving the set of periods of a pattern from its coverset, and subsequently
evaluating the periodicity from the derived set of periods. This two step process thus
make algorithms for discovering periodic frequent patterns both time and memory
inefficient in the discovery process. In this paper, we present solutions to reduce both
runtime and memory consumption in periodic frequent pattern mining. We achieve
this by evaluating the periodicity of patterns without deriving the set of periods from
their coversets. Our experimental results show that our proposed solutions are effi-
cient both in reducing the runtime and memory consumption in the discovery of
periodic frequent patterns.
1 Introduction
Frequent pattern mining (the process of identifying patterns which occur frequently
in databases) over the past years has been widely studied for knowledge discovery in
databases by works such as [1, 4, 17]. Though algorithms for mining frequent pat-
terns are helpful in revealing frequently occurring patterns in databases (for decision
making), they often fail in revealing the occurrence shapes of patterns. For instance,
in crime data analysis, frequent pattern mining algorithms will fail to report the peri-
odic occurrence shapes of crime patterns, which are vital in decision making to curb
crime. This limitation in frequent pattern mining, and the importance of patterns’
V. M. Nofong (✉)
Univeristy of Mines and Technology, P. O. Box 237, Tarkwa, Ghana
e-mail: vnofong@umat.edu.gh
shapes in decision making resulted in the start of research on periodic frequent pat-
tern mining.
Mining periodic frequent patterns from transactional databases has been widely
researched on, in works such as [3, 5, 7, 11, 14, 15]. Though many algorithms have
been proposed for mining periodic frequent patterns in transactional databases, they
have the following limitations:
∙ Periodic frequent pattern mining algorithms such as [5–7, 14] which mine periodic
frequent patterns (PFPs) based on the maximum periodicity threshold (proposed
in [15]) will often miss some relevant periodic patterns if such patterns have just
one periodic interval being greater than the maximum periodicity threshold.
∙ Periodic frequent pattern mining algorithms such as [9, 12, 13] which mine peri-
odic frequent patterns based on the maximum variance threshold (proposed in
[13]) will often report a set of PFPs with distinct periods for decision making.
∙ As mentioned previously, to test a pattern as either periodic or non-periodic, exist-
ing algorithms firstly derive the set of periods from its coverset, and subsequently
evaluate its periodicity from the derived set of periods. This two stage process
thus make existing algorithms inefficient in both runtime and memory usage in
discovering PFPs.
Though some of these challenges have been addressed, the case of time and mem-
ory inefficiency in the discovery of periodic frequent patterns to the best of our
knowledge is yet to be addressed. In this paper, we address this issue by propos-
ing to evaluate the periodicity of patterns without deriving the set of periods. This
in turn reduces both runtime and memory consumption in the discovery of periodic
frequent patterns.
This paper makes the following contributions in the discovery of periodic frequent
patterns:
∙ Effective and efficient techniques are introduced for evaluating the periodicity of
patterns without deriving their set of periods from their coversets.
∙ The proposed techniques are incorporated on existing periodic frequent pattern
mining frameworks which showed a drastic reduction in both runtime and memory
usage in the discovery of PFPs.
2 Related Work
|cov𝐃 (S)|
sup𝐃 (S) = (2)
|𝐃|
where n = |𝐃|. The last time, n, will be duplicated if it is already in cov𝐃 (S). For
instance, given |𝐃| = 6 and cov𝐃 (S) = {1, 4, 6}, then, e.cov𝐃 (S) = {0} ∪ {1, 4, 6} ∪
{6} = {0, 1, 4, 6, 6}.
Let (mj , mj+1 ) ∈ e.cov𝐃 (S) be two consecutive occurrence times (TIDs) of S in
𝐃, then pSj = mj+1 − mj is the jth period of S in 𝐃. The set of all periods of S, PS ,
obtained from its extended cover is denoted as:
Definition 1 [15] Given a database 𝐃, a pattern S and its set of periods PS in 𝐃, the
periodicity of S, Per(S), is defined as, Per(S) = max{p|p ∈ PS }.
Based on the proposed periodicity measure in Definition 1, Tanbeer et al. [15] sub-
sequently defined a periodic frequent pattern as frequent pattern whose periodicity
is not greater than a user defined maximum periodicity threshold, maxPer.
For a pattern S and its set of periods PS , the proposition in Tanbeer et al. [15]
returns S as periodic if the maximal occurring period (maximal time interval between
226 V. M. Nofong
any consecutive occurrence times) of S is not greater than the maximum periodicity
threshold, maxPer. This idea of mining periodic frequent patterns based on the max-
imal occuring period proposed in [15] have been used in mining periodic frequent
patterns in transaction-like datasets in works such as [6–8, 10, 14].
Rashid et al. in [13] however argued that mining periodic frequent patterns based
on the proposed periodicity measure in [15] is inappropriate since it returns the max-
imum period for which a pattern does not appear in a database as its periodicity. To
avoid reporting the maximal occuring period as a patterns’ periodicity, Rashid et al.
[13] defined the periodicity of a pattern under the name regularity as follows.
Definition 2 [13] Given a database 𝐃, a pattern S and its set of periods PS in 𝐃, the
regularity of S, Reg(S), is defined as Reg(S) = var(PS ), where var(PS ) is the variance
of PS .
With the regularity (periodicity) measure in Definition 2, Rashid et al. [13]
defined a regular (periodic) frequent pattern as a frequent pattern whose variance
among its set of periods is not greater than a user desired maximum regularity thresh-
old, maxReg. The concept of mining regular (periodic) frequent patterns proposed
in [13] has also been used in mining regular (periodic) frequent patterns in works
such as [9, 12].
Nofong in [11] however argued that though the concept proposed [13] will not dis-
card interesting periodic frequent patterns as in [15], PFP mining algorithms based
on the propositions in both [13, 15] will always report PFPs with totally distinct
periods. To ensure only PFPs with similar periods are reported, Nofong [11] defined
a periodic frequent pattern as follows.
Definition 3 [11] Given a database 𝐃, minimum support threshold 𝜀, periodicity
threshold p, difference factor p1 , a pattern S and PS , S is a periodic frequent pattern
if sup𝐃 (S) ≥ 𝜀, (p − p1 ) ≤ Prd(S) − std(PS ) and Prd(S) + std(PS ) ≤ (p + p1 ).
where, Prd(S) (the mean of PS , that is, x̄ (PS )) is the periodicity of S and std(PS ) the
standard deviation in PS .
Nofong [11] observed that with Definition 3, though PFPs with similar periods
will be reported, some may be periodic due to random chance without inherent item
relationship. To address this issue, the productiveness measure (proposed in [16])
was incorporated in defining the productive periodic frequent patterns as the set of
PFPs with inherent item relationship.
Philippe et al. in [3] introduced PFPM, an efficient algorithm with novel pruning
techniques for mining periodic frequent patterns. Unlike the methods proposed in
[11, 13, 15], Philippe et al. [3] introduced three periodicity measures (the minimum,
maximum and average periodicity measures) for mining user desired periodic fre-
quent patterns. Employing the three measures proposed in [3] gives users the advan-
tages of more flexibility in PFP mining.
As mentioned previously, the propositions in [3, 11, 13, 15] and works based on
these propositions ([6–9, 12, 14]) are faced with the challenges of time and mem-
ory inefficiency, and, difficulty in finding early termination mechanisms in the PFP
mining process.
Fast and Memory Efficient Mining of Periodic Frequent Patterns 227
We adopt the PFP definition (Definition 3) proposed in [11]. To enable us mine PFPs
based on Definition 3 while addressing the time and memory inefficiencies in PFP
discovery, we show that the periodicity of a pattern can be evaluated from its support
and the size of the database as follows:
Proof
( Let covD (S) = (T1 , T2 , T3 , … , Tm−1 , Tm ), then PS can) be expressed as, PS =
(T1 − 0), (T2 − T1 ), (T3 − T2 ), … , (Tm − Tm−1 ), (|𝐃| − Tm ) . Then, PrdD (S) (which
∑|PS |
PSi
is the mean of PS ) can thus be expressed as PrdD (S) = i=1
|P |
S
. Hence, PrdD (S) =
(T1 − 0) + (T2 − T1 ) + (T3 − T2 ) + ⋯ + (Tm − Tm − 1 ) + (|𝐃| − Tm ) |𝐃|
|PS |
. This simplifies to PrdD (S) = |P S|
.
However, since |P | = |covD (S)| + 1, the periodicity of S, PrdD (S) can thus be
S
With the Big-O notation, the runtime complexity of Function 1 (based Lemma 1)
turns out to be O(n) while that of Function 2 is O(n2 ). This shows that employing
Lemma 1 in evaluating the periodicity of patterns will result in a significant reduction
in runtime for discovering PFPs.
Though Lemma 1 will evaluate the periodicity of a pattern, it will not be able to
detect the standard deviation (which will be required to identify PFPs with similar
periodicities—see Definition 3) among the set of periods of patterns. In exiting works
228 V. M. Nofong
on periodic frequent pattern mining, the set of periods for each pattern have to be
obtained (from their coversets) to derive the standard deviation among the set of
periods. Obtaining the sets of periods and subsequently evaluating the periodicity is
both time and memory expensive.
To avoid this situation, we show that the standard deviation among the set of
periods can be directly evaluated from the coverset without necessarily obtaining
the set of periods as follows.
Lemma 2 Given covD (S) = {n1 , n2 , n3 , … , nm−1 , nm },√the standard deviation among
XS + YS + ZS
the set of periods of S can be evaluated as std(PS ) = |covD (S)| + 1
where:
∑
m−1
ZS = (n2j + n2j−1 + Prd(S)2 − 2nj nj−1 ) (7)
j=2
Proof Let x̄ = Prd(S). If covD (S) = {n1 , n2 , n3 , … , nm−1 , nm }, then set of periods of
S will be PS = {n1 − 0, n2 − n1 , n3 − n2 , … , nm − nm−1 , |D| − nm }. Hence, the vari-
∑|PS | (PS − x̄ )2
ance among the periods of S will be Var(PS ) = i=1 ( i|PS | ). This expands to
((n − 0) − x̄ )2 + ((n − n ) − x̄ )2 + ⋯ + ((n − n ) − x̄ )2 + ((|D| − n ) − x̄ )2
Var(PS ) = 1 2 1 m
|covD (S)| + 1
m−1 m
.
Expanding the expressions in the numerator, we get:
((n1 − 0) − x̄ )2 = n21 + x̄ 2 − 2n1 x̄ .
((n2 − n1 ) − x̄ )2 = n21 + n22 + x̄ 2 − 2n1 n2 − 2n2 x̄ + 2n1 x̄ .
((n3 − n2 ) − x̄ )2 = n22 + n23 + x̄ 2 − 2n2 n3 − 2n3 x̄ + 2n2 x̄ .
⋮
((nm − nm−1 ) − x̄ )2 = n2m−1 + n2m + x̄ 2 − 2nm−1 nm − 2nm x̄ + 2nm−1 x̄ .
((|D| − nm ) − x̄ )2 = n2m + |D|2 + x̄ 2 − 2nm |D| − 2|D|̄x + 2nm x̄ .
Since expansions are being summed, we get the following:
YS = n2m + |D|2 + Prd(S)2 − 2nm |D| − 2|D|Prd(S) (for the last expansion) … 2
∑
m−1
ZS = (n2j + n2j−1 + Prd(S)2 − 2nj nj−1 (for any other period in PS ) … 3
j=2
Fast and Memory Efficient Mining of Periodic Frequent Patterns 229
where, ZS is the variance value for any other period in PS which is not the first or last
period.
As such with only the coverset of a pattern, the variance and standard devia-
XS + YS + ZS
tions among its periods can be obtained respectively as: Var(PS ) = |cov , and
√ D (S)| + 1
XS + YS + ZS
std(PS ) = |cov (S)| + 1
. ⊔
⊓
D
Our proposed measures for evaluating the periodicity, standard deviation and vari-
ance among the set of periods unlike the existing measures have the advantage of
reducing both runtime and memory consumption in PFP discovery.
We incorporate our proposed periodicity measures on existing PFP discovery
algorithms to experimentally check their effectiveness.
4 Experimental Results
Table 1 Datasets
Dataset Origin Characteristics
Kosarak10K SPMF [2] 10,000 Transactions—partly
dense
Accident FIMI repository 7593 Transactions—very
dense
Tafeng Nov. 2000 AIIA lab 31,807 Transactions—very
sparse
230 V. M. Nofong
For time performance and scalability, we compare the runtimes of the above men-
tioned implementations on the datasets shown in Table 1.
Figures 1, 2 and 3 show the runtime comparison of the above mentioned imple-
mentations on the Kosarak10K, Accident and Tafeng datasets respectively. As can
be seen in all three Figures, employing our proposed techniques significantly reduces
the runtime required in discovering PFPs. For instance, in Fig. 1, PFP+, an imple-
mentation based on our proposed techniques is almost twice as efficient as PFP* in
periodic frequent pattern discovery. Also, as can be seen in Figs. 2 and 3, PFP+ and
PPFP+ (based on our proposed techniques) are also slightly more efficient compared
to than PFP* and PPFP in periodic frequent pattern discovery.
For memory usage, we compare the memory used by the above mentioned imple-
mentations on the datasets shown in Table 1 on PFP discovery.
Figures 4 and 5a, b show the memory usage comparison of the above mentioned
implementations on the Kosarak10K, Tafeng and Accident datasets respectively. As
can be seen in all three Figures, employing our proposed techniques significantly
reduces the memory used in discovering PFPs. In Fig. 5b for instance, both PFP+
and PPFP+ (implementations based on our proposed techniques) are almost twice
as efficient as PFP* and PPFP with regards to memory usage in periodic frequent
pattern discovery.
5 Conclusion
Existing PFP mining algorithms often employ a two stage process for discovering
PFPs hence making them both time and memory inefficient in the discovery process.
In this paper, we propose techniques to reduce both runtime and memory usage in
PFP discovery. We have incorporated these techniques on existing PFP mining algo-
rithms and show experimentally that our proposed techniques are efficient in reduc-
ing both the runtime and memory used in discovering periodic frequent patterns.
232 V. M. Nofong
References
1. Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large
databases. ACM SIGMOD Rec. 22(2), 207–216 (1993)
2. Fournier-Viger, P., et al.: The SPMF open-source data mining library version 2. In: Proceedings
of European Conference on Machine Learning and Knowledge Discovery in Databases 2016.
Springer International Publishing (2016)
3. Fournier-Viger, P., Lin, C.W., Duong, Q.H., Dam, T.L., Ševčík, L., Uhrin, D., Voznak, M.:
PFPM: discovering periodic frequent patterns with novel periodicity measures. In: Proceedings
of the 2nd Czech-China Scientific Conference 2017. InTech (2017)
4. Jian, P., Jiawei, H., Hongjun, L., Shojiro, N., Shiwei, T., Dongqing, Y.: H-Mine: fast and space-
preserving frequent pattern mining in large databases. IIE Trans. 39(6), 593–605 (2007)
5. Kiran, R.U., Kitsuregawa, M.: Discovering quasi-periodic-frequent patterns in transactional
databases. In: Bhatnagar, V., Srinivasa, S. (eds.) BDA 2013. LNCS, vol. 8302, pp. 97–115.
Springer International Publishing (2013)
6. Kiran, R.U., Kitsuregawa, M.: Novel techniques to reduce search space in periodic-frequent
pattern mining. In: Bhowmick, S.S., Dyreson, C.E., Jensen, C.S., Lee, M.L., Muliantara, A.,
Thalheim, B. (eds.) DASFAA 2014. LNCS, vol. 8422, pp. 377–391. Springer International
Publishing (2014)
7. Kiran, R.U., Reddy, P.K.: Towards efficient mining of periodic-frequent patterns in transac-
tional databases. In: Bringas, P.G., Hameurlain, A., Quirchmayr, G. (eds.) DASFAA 2010.
LNCS, vol. 6262, pp. 194–208. Springer, Heidelberg (2010)
8. Kiran, R.U., Reddy, P.K.: An alternative interestingness measure for mining periodic-frequent
patterns. In: Yu, J.X., Kim, M.H., Unland, R. (eds.) DASFAA 2011. LNCS, vol. 6587, pp.
183–192. Springer, Heidelberg (2011)
9. Kumar, V., Valli Kumari, V.: Incremental mining for regular frequent patterns in vertical for-
mat. Int. J. Eng. Tech. 5(2), 1506–1511 (2013)
10. Lin, J.C.W., Zhang, J., Fournier-Viger, P., Hong, T.P., Zhang, J.: A two-phase approach to mine
short-period high-utility itemsets in transactional databases. Adv. Eng. Inf. 33, 29–43 (2017)
11. Nofong, V.M.: Discovering productive periodic frequent patterns in transactional databases.
Ann. Data Sci. 3(3), 235–249 (2016)
12. Rashid, M.M., Gondal, I., Kamruzzaman, J.: Regularly frequent patterns mining from sensor
data stream. In: Lee, M., Hirose, A., Hou, Z.G., Kil, R. (eds.) NIP 2013. LNCS, vol. 8227, pp.
417–424. Springer, Heidelberg (2013)
13. Rashid, M.M., Karim, M.R., Jeong, B.S., Choi, H.J.: Efficient mining regularly frequent pat-
terns in transactional databases. In: Lee, S., Peng, Z., Zhou, X., Moon, Y., Unland, R., Yoo, J.
(eds.) DASFAA 2012. LNCS, vol. 7238, pp. 258–271. Springer, Heidelberg (2012)
14. Surana, A., Kiran, R.U., Reddy, P.K.: An efficient approach to mine periodic-frequent patterns
in transactional databases. In: Cao, L., Huang, J.Z., Bailey, J., Koh, Y.S., Luo, J. (eds.) PAKDD
2011 Workshops. LNAI, vol. 7104, pp. 254–266. Springer, Heidelberg (2012)
15. Tanbeer, S.K., Ahmed, C.F., Jeong, B.S., Lee, Y.K.: Discovering periodic-frequent patterns
in transactional databases. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T. (eds.)
PAKDD 2009. LNAI, vol. 5476, pp. 242–253. Springer, Heidelberg (2009)
16. Webb, G.I.: Self-sufficient itemsets: an approach to screening potentially interesting associa-
tions between items. ACM Trans. Knowl. Discov. Data 4(1), 3:1–3:20 (2010)
17. Zaki, M.J., Gouda, K.: Fast vertical mining using diffsets. In: Proceedings of the 9th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 326–335
(2003)