Mining probabilistic delegate recurrent samples as oI uncertain
Amit Batra
, Anshu Prashar
, Sunil Kumar Kaushik
and Anjali Batra

Department of CSE, HCTM Technical Campus, Kaithal, India
Department of CSE, HCTM Technical Campus, Kaithal, India
Probabilistic recurrent sample mining over uncertain data has received a great deal oI attention recently due to the wide
applications oI uncertain data. Similar to its counterpart in deterministic databases, however, probabilistic recurrent sample
mining suIIers Irom the same problem oI generating an exponential number oI result samples. The large number oI discovered
samples hinders Iurther evaluation and analysis, and calls Ior the need to Iind a small number oI representative patterns to
approximate all other patterns. This paper Iormally deIines the problem oI probabilistic delegate recurrent sample (P-DRS)
mining, which aims to Iind the minimal set oI patterns with suIIiciently high probability to represent all other patterns. The
problem`s bottleneck turns out to be checking whether a pattern can probabilistically represent another, which involves the
computation oI a joint probability oI supports oI two patterns. To address the problem, we propose a novel and eIIicient
dynamic programming-based approach. Moreover, we have devised a set oI eIIective optimization strategies to Iurther
improve the computation eIIiciency. Our experimental results demonstrate that the proposed P-DRS mining eIIectively
reduces the size oI probabilistic Irequent patterns. Our proposed approach not only discovers the set oI P-DRSs eIIiciently, but
also restores the Irequency probability inIormation oI patterns with an error guarantee.
Keywords: P-DRSs, Mining, FIMI, IIP
1. Introduction
Uncertainty is inherent in data Irom many diIIerent domains, including sensor network monitoring, moving
object tracking, and protein-protein interaction data |6|. A survey oI state-oI-the-art uncertain data mining
techniques can be Iound in |1|. As one oI the most Iundamental data mining tasks, Irequent pattern mining over
uncertain data has also received a great deal oI research attention, since it was Iirst introduced in |3|. Currently,
there exist two diIIerent deIinitions oI Irequent patterns in the context oI uncertain data: expected support-based
Irequent patterns |3, 11|, and probabilistic Irequent patterns |4, 5|. Some initial research work has been
undertaken to Iind a small set oI representative patterns. For example, mining probabilistic Irequent closed
patterns over uncertain data has been studied in |7, 8, 9|. In the context oI deterministic data, a pattern is closed iI
it is the longest pattern that appears in the same set oI transactions supporting its sub-patterns. As a generalization
oI the concept oI Irequent closed patterns, Xin et al. |20| proposed the notion oI a c- covered relationship
between patterns.
2. Related Work
In this section, we review related research Irom two subareas: recurrent sample mining over uncertain data and
recurrent sample summarization.
2.1. Recurrent sample mining over uncertain data.
Based on the deIinition oI a Irequent pattern, existing work on mining Irequent patterns over uncertain data
Ialls into two categories: expected support-based Irequent pattern mining |3, 10, 11| and probabilistic Irequent
pattern mining |4, 5|. For mining expected support-based Irequent patterns, there are three representative
algorithms: UApriori |3|, UFP-growth |10|, UH-Mine |11|. For mining probabilistic Irequent patterns, two
representative algorithms are DP dynamic programming-based Apriori algorithm |4|, and DC divide-and-
conquer-based Apriori algorithm |5|. |12, 13| respectively use the normal distribution and the poisson method to
approximate the Irequency probability oI patterns.
2.2. Recurrent sample summarization.
Various concepts have been proposed, such as maximal patterns |14|, Irequent closed patterns |15|, and non-
derivable patterns |16|. There are several generalizations oI closed patterns, such as the pattern proIiling based
approaches |17, 18, 19| and the support distance based approaches |20, 21|. It was observed in |21| that the
proIile-based approaches |17, 18| have some drawbacks, such as no error guarantee on restored support. Tang
and Peterson |8| proposed mining probabilistic Irequent closed patterns, based on the concept called probabilistic
support. Tong et al. |9| pointed out that Irequent closed patterns deIined on probabilistic support cannot guarantee
the patterns are closed in possible worlds which contribute to their probabilistic supports.
3. Problem Definition
This section Iirst introduces preliminary deIinitions and then Iormulates the problem oI probabilistic delegate
recurrent sample (P-DRS) mining. Xin et al. |20| deIined a robust distance measure between patterns in
deterministic data.
DEFINITION 3.1 (distance measure) Given two samples X
and X
the distance between them, denoted as
, X
), is deIined as 1- ,T(X
) T(X
) UT(X
), where T(X
) is the set oI transactions supporting sample
. Then, an c-covered relationship is deIined on two patterns where one subsumes another.
DEFINITION 3.2 (c-covered) Given a real number c C |0.1| and two samples X
and X
we say X
is c-
covered by X
iI X
is a subset oI X
d(X1,X2)_ c.
As commonly used in recurrent sample mining, X1X2 denotes X1 is a subset oI X2 (e.g. a} a, b}). It can
be proved easily that, iI X2 -covers X1, supp(X1) supp(X2)/supp(X1) . The goal oI representative recurrent
sample mining then becomes Iinding the minimal set oI samples that -cover all recurrent samples.
TABLE I An uncertain database with attribute uncertainty
ID Transactions
a:0.7 b:0.2
a:1.0 c:0.5
TABLE II An example oI possible worlds
ID Possible World Prob.
w1 T1 : #, T2 : a}} 0.12
w2 T1 : a}, T2 : a}} 0.28
w3 T1 : b}, T2 : a}} 0.03
w4 T1 : a, b}, T2 : a}} 0.07
w5 T1 : #, T2 : a, c}} 0.12
w6 T1 : a}, T2 : a, c}} 0.28
w7 T1 : b}, T2 : a, c}} 0.03
w8 T1 : a, b}, T2 : a, c}} 0.07
In the context oI uncertain data, the support oI a pattern, supp (Xi), becomes a discrete random variable.
ThereIore, we cannot directly apply the c-cover relationship to probabilistic Irequent patterns. BeIore explaining
how to extend the concept oI c-covered in the context oI uncertain data, we examine an uncertain database where
attributes are associated with existential probabilities. Table 1 shows an uncertain transaction database where
each transaction consists oI a set oI probabilistic items.
For example, the probability that item a appears in the Iirst transaction T1 is 0.7. Possible world semantics are
commonly used to explain the existence oI data in an uncertain database. For example, the database in Table I has
eight possible worlds, which are listed in Table II. Each possible world is associated with an existential
probability. For instance, the probability that the Iirst possible world w1 exists is (1 0.7) (1 0.2) 1 (1
0.5) 0.12.
DEFINITION 3.3. (probabilistic distance measure) Given an uncertain database D, and two samples X1 and
X2, let PW w1, . . . , wm} be the set oI possible worlds derived Irom D, the distance between X1 and X2 in a
possible world wj PW is dist(X
, X
;wj) 1- ,T(X1;wj)T(X2;wj)/ ,T(X1;wj)T(X2;wj), where T (Xi; wj) is
the set oI transactions containing sample Xi in the possible world wj. Then, the probabilistic distance between X1
and X2, denoted by dist(X1, X2), is a random variable.
DEFINITION 3.4. (-cover probability) Given an uncertain database D, two patterns X1 and X2, and a
distance threshold , the -cover probability oI X1 and X2 is Prcover(X1,X2; ) Pr(dist(X1,X2) ).
DEFINITION 3.5. ((, )-covered) Given an uncertain database D, two samples X1 and X2, a distance
threshold and a cover probability threshold , we say X2 (, )-covers X1 iI X1 X2 and Prcover(X1,X2; )
4. P-DRS Mining Algorithm
The overall Iramework oI our P-RFP mining algorithm is shown in Algorithm 1. From lines 39, we Iind the
cover set Ior each pattern X2 in the pseudo probabilistic Irequent patterns ` F. The most important step is to check
whether X2 covers X1C F (line 6). The details oI the Iunction isCover is illustrated in Algorithm 2, where lines 1
3 implement the optimization. Finally, Irom lines 1014, we use the dynamic programming based scheme to
compute the c-cover probability. As mentioned beIore, the Iunction setCover in Algorithm 1 is solved using the
greedy algorithm in |22|.
Input: D, F, ` F, and
Output: Minimal P-DRS Set R
1: R
2: CoverSets
3: Ior all X2 ` F do
4: NoCoverSet
5: Ior all X1 F such that X1 X2 do
6: iI isCover(X1, X2) True then
7: CoverSets|X2|.add(X1)
8: else
9: NoCoverSet.add(X1)
10: R setCover (CoverSets, F)
11: return R
Algorithm 2 Function isCover
Input: X1, X2,
Output: II X2 (, )-covers X1, then return True, else
1: Ior all X CoverSets|X2| do
2: iI X X1 then
3: return True
4: Ior all X NoCoverSet|X2| do
5: iI X X1 then
6: return False
7: Q (,D(X1),) (X
),(1 p
8: iI Q (,D(X1),) then
9: return True
10: Ior l 0 to ,D(X1), do
11: Ior k (1 )l to l 1 do
12: Pcover Pr (supp(X1) l, supp(X2) k)
13: iI Pcover then
14: return True
15: return False
5. Performance Study
This section evaluates the eIIectiveness oI P-DRSs, the perIormance oI our approach Ior P-DRS mining, and
the optimization strategies.
5.1. Data sets
Two datasets have been used in our experiments. The Iirst is the Retail dataset Irom the Frequent Itemset
Mining (FIMI) Dataset Repository. This is one oI the standard datasets used in recurrent sample mining in
deterministic databases. In order to bring uncertainty into the dataset, we synthesize an existential probability Ior
each item based on a Gaussian distribution with the mean oI 0.9 and the variance oI 0.125. This dataset is an
uncertain database that associates uncertainty to attributes. The second one is the iceberg sighting record Irom
1993 to 1997 on the North Atlantic Irom the International Ice Patrol (IIP) Iceberg Sightings Database. The IIP
Iceberg Sighting Database collects inIormation oI iceberg activities in the North Atlantic. Each transaction in the
database contains the inIormation oI date, location, size, shape, reporting source and a conIidence level. The
conIidence level has six possible attributes, R/V(Radar and visual), R(Radar only), V(Visual), MEA (Measured),
EST(Estimated) and GBL(Garbled), which indicate diIIerent reliabilities oI that tuple. We translate conIidence
levels to probabilities 0.8, 0.7, 0.6, 0.5, 0.4 and 0.3, respectively. This dataset is an uncertain database that
associates uncertainty to tuples.
5.2. Result analysis
We Iirst evaluate the compression rate oI the P-DRSs, with respect to the variation oI parameters. We
randomly select 1000 transactions Irom the two datasets respectively to conduct the experiment. The sizes oI R -
the set oI P-DRSs, and F -the set oI probabilistic Irequent patterns, with respect to the variations oI minsup,
minprob,, and , on the two datasets are shown in Figures 1 and 2 respectively. The deIault values oI the Iour
parameters are set to 0.5, 0.8, 0.2 and 0.2 respectively. It can be observed Irom the results on both datasets,
when minsup and minprob are low, the compression rate oI P-DRSs is high because there are more probabilistic
Irequent patterns. For the variations oI and , obviously, the high compression rate can be achieved iI the
probabilistic distance threshold is high and/or the cover probability threshold is low. We then examine the
runtime oI the proposed algorithm Ior P-DRS mining. Figures 3 and 4 show the runtime vs. minsup, minprob, ,
and curves on 1000 transactions randomly selected Irom the two datasets, respectively. The deIault values oI
the Iour parameters are same as in the Iirst experiment. It is intuitive that, when is increasing or minsup,
minprob and are decreasing, the runtime will increase because more pattern pairs are engaged in cover
probability checking. We Iind that the growth oI both and lead to a tradeoII between the number oI P-DRSs
and runtime. We also evaluate the eIIectiveness oI the optimization Strategies. We randomly select 500
transactions Irom the two datasets, respectively, to carry out this experiment. The deIault values Ior the
experiments on the Retail dataset are: minsup 4, minprob 0.8, 0.1 and 0.2. On the IIP dataset, the
Iour parameters are set to 10, 0.8, 0.1 and 0.2 by deIault, respectively. Figure 5 shows the runtime oI the basic
version oI our algorithm, and the runtime oI the algorithm integrated with optimization strategies, with respect to
the variation oI and on the two datasets, respectively. The results clearly reveal the eIIectiveness oI the
optimization strategies by demonstrating that the optimized algorithm signiIicantly reduces the runtime.
Fig. 1 The Number oI P-DRS on Retail
Fig. 2 The Number oI P-DRS on IIP
Fig. 3 Runtime on Retail
6. Conclusions
Due to the downward closure property, the number oI probabilistic Irequent patterns mined over uncertain
data can be so large that they hinder Iurther analysis and exploitation.
Fig. 4 Runtime on IIP
Fig. 5 EIIect oI Optimization
This paper proposes the P-DRS mining, which aims to Iind a small set oI samples to represent the complete set
oI probabilistic recurrent samples. To address the data uncertainty issue, we deIine the concept oI probabilistic
distance, as well as a (, )-cover relationship between two samples. P-DRSs are the minimal set oI patterns that
( )-cover the complete set oI probabilistic recurrent samples. We develop a P-DRS mining algorithm that uses
a dynamic programming based scheme to eIIiciently check whether one pattern (, )-covers another. We also
exploit eIIective optimization strategies to Iurther improve the computation
eIIiciency. Our experimental results demonstrate that the devised data mining algorithm eIIectively and
eIIiciently discovers the set oI P-RFPs, which can substantially reduce the size oI probabilistic recurrent samples.
This work extends the measure deIined in deterministic databases to quantiIy the distance between two
samples in terms oI their supporting transactions. Since the supports oI samples are random variables in the
context oI uncertain data, other distance measures, such as Kullback-Leibler divergence, might be applicable. As
ongoing work, we will study the eIIectiveness oI probabilistic delegate recurrent samples deIined on diIIerent
distance measures.
7. Acknowledgement
I would like to thank Lovely ProIessional University, Phagwara, Punjab Ior giving me this opportunity to
present this paper in International ConIerence oI Computing Sciences and to be subsequently published in
Elsevier as proceedings. I also thank to Hon`ble Director Sir, HCTM Technical Campus, Kaithal Ior helping me
in making the availability oI resources.
