Proceedings oI International ConIerence on Computing Sciences
WILKES100 ICCS 2013
ISBN: 978-93-5107-172-3 Mining probabilistic delegate recurrent samples as oI uncertain data Amit Batra 1* , Anshu Prashar 2 , Sunil Kumar Kaushik 3 and Anjali Batra 4
1 Department of CSE, HCTM Technical Campus, Kaithal, India 2,3,4 Department of CSE, HCTM Technical Campus, Kaithal, India Abstract Probabilistic recurrent sample mining over uncertain data has received a great deal oI attention recently due to the wide applications oI uncertain data. Similar to its counterpart in deterministic databases, however, probabilistic recurrent sample mining suIIers Irom the same problem oI generating an exponential number oI result samples. The large number oI discovered samples hinders Iurther evaluation and analysis, and calls Ior the need to Iind a small number oI representative patterns to approximate all other patterns. This paper Iormally deIines the problem oI probabilistic delegate recurrent sample (P-DRS) mining, which aims to Iind the minimal set oI patterns with suIIiciently high probability to represent all other patterns. The problem`s bottleneck turns out to be checking whether a pattern can probabilistically represent another, which involves the computation oI a joint probability oI supports oI two patterns. To address the problem, we propose a novel and eIIicient dynamic programming-based approach. Moreover, we have devised a set oI eIIective optimization strategies to Iurther improve the computation eIIiciency. Our experimental results demonstrate that the proposed P-DRS mining eIIectively reduces the size oI probabilistic Irequent patterns. Our proposed approach not only discovers the set oI P-DRSs eIIiciently, but also restores the Irequency probability inIormation oI patterns with an error guarantee. 2013 Elsevier Science. All rights reserved. Keywords: P-DRSs, Mining, FIMI, IIP 1. Introduction Uncertainty is inherent in data Irom many diIIerent domains, including sensor network monitoring, moving object tracking, and protein-protein interaction data |6|. A survey oI state-oI-the-art uncertain data mining techniques can be Iound in |1|. As one oI the most Iundamental data mining tasks, Irequent pattern mining over uncertain data has also received a great deal oI research attention, since it was Iirst introduced in |3|. Currently, there exist two diIIerent deIinitions oI Irequent patterns in the context oI uncertain data: expected support-based Irequent patterns |3, 11|, and probabilistic Irequent patterns |4, 5|. Some initial research work has been undertaken to Iind a small set oI representative patterns. For example, mining probabilistic Irequent closed patterns over uncertain data has been studied in |7, 8, 9|. In the context oI deterministic data, a pattern is closed iI it is the longest pattern that appears in the same set oI transactions supporting its sub-patterns. As a generalization oI the concept oI Irequent closed patterns, Xin et al. |20| proposed the notion oI a c- covered relationship between patterns. 2. Related Work In this section, we review related research Irom two subareas: recurrent sample mining over uncertain data and recurrent sample summarization. * Corresponding author. Amit Batra 52 Elsevier Publications, 2013 Amit Batra, Anshu Prashar, Sunil Kumar Kaushik and Anfali Batra
2.1. Recurrent sample mining over uncertain data. Based on the deIinition oI a Irequent pattern, existing work on mining Irequent patterns over uncertain data Ialls into two categories: expected support-based Irequent pattern mining |3, 10, 11| and probabilistic Irequent pattern mining |4, 5|. For mining expected support-based Irequent patterns, there are three representative algorithms: UApriori |3|, UFP-growth |10|, UH-Mine |11|. For mining probabilistic Irequent patterns, two representative algorithms are DP dynamic programming-based Apriori algorithm |4|, and DC divide-and- conquer-based Apriori algorithm |5|. |12, 13| respectively use the normal distribution and the poisson method to approximate the Irequency probability oI patterns. 2.2. Recurrent sample summarization. Various concepts have been proposed, such as maximal patterns |14|, Irequent closed patterns |15|, and non- derivable patterns |16|. There are several generalizations oI closed patterns, such as the pattern proIiling based approaches |17, 18, 19| and the support distance based approaches |20, 21|. It was observed in |21| that the proIile-based approaches |17, 18| have some drawbacks, such as no error guarantee on restored support. Tang and Peterson |8| proposed mining probabilistic Irequent closed patterns, based on the concept called probabilistic support. Tong et al. |9| pointed out that Irequent closed patterns deIined on probabilistic support cannot guarantee the patterns are closed in possible worlds which contribute to their probabilistic supports. 3. Problem Definition This section Iirst introduces preliminary deIinitions and then Iormulates the problem oI probabilistic delegate recurrent sample (P-DRS) mining. Xin et al. |20| deIined a robust distance measure between patterns in deterministic data. DEFINITION 3.1 (distance measure) Given two samples X 1 and X 2, the distance between them, denoted as d(X 1 , X 2 ), is deIined as 1- ,T(X 1 ) T(X 2 ),/,T(X 1 ) UT(X 2 ), where T(X i ) is the set oI transactions supporting sample X i . Then, an c-covered relationship is deIined on two patterns where one subsumes another. DEFINITION 3.2 (c-covered) Given a real number c C |0.1| and two samples X 1 and X 2, we say X 1 is c- covered by X 2 iI X 1 is a subset oI X 2 d(X1,X2)_ c. As commonly used in recurrent sample mining, X1X2 denotes X1 is a subset oI X2 (e.g. a} a, b}). It can be proved easily that, iI X2 -covers X1, supp(X1) supp(X2)/supp(X1) . The goal oI representative recurrent sample mining then becomes Iinding the minimal set oI samples that -cover all recurrent samples. TABLE I An uncertain database with attribute uncertainty ID Transactions T 1 a:0.7 b:0.2 T 2 a:1.0 c:0.5 TABLE II An example oI possible worlds ID Possible World Prob. w1 T1 : #, T2 : a}} 0.12 w2 T1 : a}, T2 : a}} 0.28 w3 T1 : b}, T2 : a}} 0.03 w4 T1 : a, b}, T2 : a}} 0.07 w5 T1 : #, T2 : a, c}} 0.12 w6 T1 : a}, T2 : a, c}} 0.28 w7 T1 : b}, T2 : a, c}} 0.03 w8 T1 : a, b}, T2 : a, c}} 0.07 In the context oI uncertain data, the support oI a pattern, supp (Xi), becomes a discrete random variable. ThereIore, we cannot directly apply the c-cover relationship to probabilistic Irequent patterns. BeIore explaining 53 Elsevier Publications, 2013 Mining probablistic delegate recrurrent samples as of uncertain data
how to extend the concept oI c-covered in the context oI uncertain data, we examine an uncertain database where attributes are associated with existential probabilities. Table 1 shows an uncertain transaction database where each transaction consists oI a set oI probabilistic items. For example, the probability that item a appears in the Iirst transaction T1 is 0.7. Possible world semantics are commonly used to explain the existence oI data in an uncertain database. For example, the database in Table I has eight possible worlds, which are listed in Table II. Each possible world is associated with an existential probability. For instance, the probability that the Iirst possible world w1 exists is (1 0.7) (1 0.2) 1 (1 0.5) 0.12. DEFINITION 3.3. (probabilistic distance measure) Given an uncertain database D, and two samples X1 and X2, let PW w1, . . . , wm} be the set oI possible worlds derived Irom D, the distance between X1 and X2 in a possible world wj PW is dist(X 1 , X 2 ;wj) 1- ,T(X1;wj)T(X2;wj)/ ,T(X1;wj)T(X2;wj), where T (Xi; wj) is the set oI transactions containing sample Xi in the possible world wj. Then, the probabilistic distance between X1 and X2, denoted by dist(X1, X2), is a random variable. DEFINITION 3.4. (-cover probability) Given an uncertain database D, two patterns X1 and X2, and a distance threshold , the -cover probability oI X1 and X2 is Prcover(X1,X2; ) Pr(dist(X1,X2) ). DEFINITION 3.5. ((, )-covered) Given an uncertain database D, two samples X1 and X2, a distance threshold and a cover probability threshold , we say X2 (, )-covers X1 iI X1 X2 and Prcover(X1,X2; ) . 4. P-DRS Mining Algorithm The overall Iramework oI our P-RFP mining algorithm is shown in Algorithm 1. From lines 39, we Iind the cover set Ior each pattern X2 in the pseudo probabilistic Irequent patterns ` F. The most important step is to check whether X2 covers X1C F (line 6). The details oI the Iunction isCover is illustrated in Algorithm 2, where lines 1 3 implement the optimization. Finally, Irom lines 1014, we use the dynamic programming based scheme to compute the c-cover probability. As mentioned beIore, the Iunction setCover in Algorithm 1 is solved using the greedy algorithm in |22|. ------------------------------------------------------------------------------ Algorithm 1 P-DRS MINING FRAMEWORK Input: D, F, ` F, and Output: Minimal P-DRS Set R 1: R 2: CoverSets 3: Ior all X2 ` F do 4: NoCoverSet 5: Ior all X1 F such that X1 X2 do 6: iI isCover(X1, X2) True then 7: CoverSets|X2|.add(X1) 8: else 9: NoCoverSet.add(X1) 10: R setCover (CoverSets, F) 11: return R Algorithm 2 Function isCover Input: X1, X2, Output: II X2 (, )-covers X1, then return True, else False 1: Ior all X CoverSets|X2| do 2: iI X X1 then 3: return True 4: Ior all X NoCoverSet|X2| do 5: iI X X1 then 6: return False 54 Elsevier Publications, 2013
7: Q (,D(X1),) (X 1 ),(1 p m X1 p m X 2 ) 8: iI Q (,D(X1),) then 9: return True 10: Ior l 0 to ,D(X1), do 11: Ior k (1 )l to l 1 do 12: Pcover Pr (supp(X1) l, supp(X2) k) 13: iI Pcover then 14: return True 15: return False 5. Performance Study This section evaluates the eIIectiveness oI P-DRSs, the perIormance oI our approach Ior P-DRS mining, and the optimization strategies. 5.1. Data sets Two datasets have been used in our experiments. The Iirst is the Retail dataset Irom the Frequent Itemset Mining (FIMI) Dataset Repository. This is one oI the standard datasets used in recurrent sample mining in deterministic databases. In order to bring uncertainty into the dataset, we synthesize an existential probability Ior each item based on a Gaussian distribution with the mean oI 0.9 and the variance oI 0.125. This dataset is an uncertain database that associates uncertainty to attributes. The second one is the iceberg sighting record Irom 1993 to 1997 on the North Atlantic Irom the International Ice Patrol (IIP) Iceberg Sightings Database. The IIP Iceberg Sighting Database collects inIormation oI iceberg activities in the North Atlantic. Each transaction in the database contains the inIormation oI date, location, size, shape, reporting source and a conIidence level. The conIidence level has six possible attributes, R/V(Radar and visual), R(Radar only), V(Visual), MEA (Measured), EST(Estimated) and GBL(Garbled), which indicate diIIerent reliabilities oI that tuple. We translate conIidence levels to probabilities 0.8, 0.7, 0.6, 0.5, 0.4 and 0.3, respectively. This dataset is an uncertain database that associates uncertainty to tuples. 5.2. Result analysis We Iirst evaluate the compression rate oI the P-DRSs, with respect to the variation oI parameters. We randomly select 1000 transactions Irom the two datasets respectively to conduct the experiment. The sizes oI R - the set oI P-DRSs, and F -the set oI probabilistic Irequent patterns, with respect to the variations oI minsup, minprob,, and , on the two datasets are shown in Figures 1 and 2 respectively. The deIault values oI the Iour parameters are set to 0.5, 0.8, 0.2 and 0.2 respectively. It can be observed Irom the results on both datasets, when minsup and minprob are low, the compression rate oI P-DRSs is high because there are more probabilistic Irequent patterns. For the variations oI and , obviously, the high compression rate can be achieved iI the probabilistic distance threshold is high and/or the cover probability threshold is low. We then examine the runtime oI the proposed algorithm Ior P-DRS mining. Figures 3 and 4 show the runtime vs. minsup, minprob, , and curves on 1000 transactions randomly selected Irom the two datasets, respectively. The deIault values oI the Iour parameters are same as in the Iirst experiment. It is intuitive that, when is increasing or minsup, minprob and are decreasing, the runtime will increase because more pattern pairs are engaged in cover probability checking. We Iind that the growth oI both and lead to a tradeoII between the number oI P-DRSs and runtime. We also evaluate the eIIectiveness oI the optimization Strategies. We randomly select 500 transactions Irom the two datasets, respectively, to carry out this experiment. The deIault values Ior the experiments on the Retail dataset are: minsup 4, minprob 0.8, 0.1 and 0.2. On the IIP dataset, the Iour parameters are set to 10, 0.8, 0.1 and 0.2 by deIault, respectively. Figure 5 shows the runtime oI the basic version oI our algorithm, and the runtime oI the algorithm integrated with optimization strategies, with respect to the variation oI and on the two datasets, respectively. The results clearly reveal the eIIectiveness oI the optimization strategies by demonstrating that the optimized algorithm signiIicantly reduces the runtime. 55 Elsevier Publications, 2013 Amit Batra, Anshu Prashar, Sunil Kumar Kaushik and Anfali Batra Mining probablistic delegate recrurrent samples as of uncertain data
Fig. 1 The Number oI P-DRS on Retail Fig. 2 The Number oI P-DRS on IIP Fig. 3 Runtime on Retail 56 Elsevier Publications, 2013
6. Conclusions Due to the downward closure property, the number oI probabilistic Irequent patterns mined over uncertain data can be so large that they hinder Iurther analysis and exploitation. Fig. 4 Runtime on IIP Fig. 5 EIIect oI Optimization This paper proposes the P-DRS mining, which aims to Iind a small set oI samples to represent the complete set oI probabilistic recurrent samples. To address the data uncertainty issue, we deIine the concept oI probabilistic distance, as well as a (, )-cover relationship between two samples. P-DRSs are the minimal set oI patterns that ( )-cover the complete set oI probabilistic recurrent samples. We develop a P-DRS mining algorithm that uses a dynamic programming based scheme to eIIiciently check whether one pattern (, )-covers another. We also exploit eIIective optimization strategies to Iurther improve the computation eIIiciency. Our experimental results demonstrate that the devised data mining algorithm eIIectively and eIIiciently discovers the set oI P-RFPs, which can substantially reduce the size oI probabilistic recurrent samples. This work extends the measure deIined in deterministic databases to quantiIy the distance between two samples in terms oI their supporting transactions. Since the supports oI samples are random variables in the context oI uncertain data, other distance measures, such as Kullback-Leibler divergence, might be applicable. As 57 Elsevier Publications, 2013 Amit Batra, Anshu Prashar, Sunil Kumar Kaushik and Anfali Batra Mining probablistic delegate recrurrent samples as of uncertain data
ongoing work, we will study the eIIectiveness oI probabilistic delegate recurrent samples deIined on diIIerent distance measures. 7. Acknowledgement I would like to thank Lovely ProIessional University, Phagwara, Punjab Ior giving me this opportunity to present this paper in International ConIerence oI Computing Sciences and to be subsequently published in Elsevier as proceedings. I also thank to Hon`ble Director Sir, HCTM Technical Campus, Kaithal Ior helping me in making the availability oI resources. References |1| Aggarwal, C.C., Yu, P.S.: A survey oI uncertain data algorithms and applications. IEEE Transactions on Knowledge and Data Engineering 21(5) (2009) 609623. |2| Aggarwal, C.C.: Managing and mining uncertain data. Springer (2009). |3| Chui, C.K., Kao, B., Hung, E.: Mining Irequent itemsets Irom uncertain data. PAKDD (2007) 4758. |4| Bernecker, T., Kriegel, H.P., Renz, M., Verhein, F., ZueIle, A.: Probabilistic Irequent itemset mining in uncertain databases. SIGKDD (2009) 119128. |5| Sun, L., Cheng, R., Cheung, D.W., Cheng, J.: Mining uncertain data with probabilistic guarantees. SIGKDD (2010) 273282. |6| Tong, Y., Chen, L., Cheng, Y., Yu, P.S.: Mining Irequent item sets over uncertain databases. VLDB Endowment 5(11) (2012) 1650 1661 |7| Peterson, E.A., Tang, P.: Fast approximation oI probabilistic Irequent closed item sets. ASRC (2012) 214219. |8| Tang, P., Peterson, E.A.: Mining probabilistic Irequent closed item sets in uncertain databases. ASRC (2011) 8691. |9| Tong, Y., Chen, L., Ding, B.: Discovering threshold-based Irequent closed itemsets over probabilistic data. ICDE (2012) 270281. |10| Leung, C., Mateo, M., Brajczuk, D.: A tree-based approach Ior Irequent pattern mining Irom uncertain data. Advances in Knowledge Discovery and Data Mining (2008) 653661. |11| Aggarwal, C.C., Li, Y., Wang, J.: Frequent pattern mining with uncertain data. SIGKDD (2009) 2938. |12| Calders, T., Garboni, C., Goethals, B.: Approximation oI Irequentness probability oI item sets in uncertain data. ICDE (2010) 749754 58 Elsevier Publications, 2013 Index
F Frequent Itemset Mining (FIMI) Dataset Repository, 55
I IIP. see International Ice Patrol (IIP) International Ice Patrol (IIP), 55 number of P-DRS, 56 runtime on, 57
K Kullback-Leibler divergence, 57
M Mining, 53
P P-DRS. see Probabilistic delegate recurrent sample (P-DRS) Probabilistic delegate recurrent sample (P-DRS), 52 data sets, 55 IIP, runtime on, 57 mining algorithm, 5455 mining over uncertain data, recurrent sample, 53 on retail, number of, 56 optimization, effect of, 57 problem of, 5354 recurrent sample summarization, 53 result analysis, 55 runtime on retail, 56