You are on page 1of 4

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 7July 2013

PFI Testing On Uncertain Databases


Resma. K.S#1
#1

M.Tech Student, Department of ISE, PESIT, Bangalore ,India

Abstract the data involved in lot of applications are inexact in nature. For example the applications such as sensor data or satellite images which are noisy, the analysis of a transaction by customer and hence obtain the probabilities of purchase of each item of each customer that considering a certain set of items. The paper studies the important problem of mining frequent item sets from a uncertain database using what is called as a PFI (Probabilistic Frequent Item) Testing technique Keywords Uncertain Databases, Probabilistic Frequent Items, PFI Mining.

occurrence of that item that means it represents the chances the customer is likely to buy those products in future. It can be obtained by analysing the browsing histories. For example consider the customer Ramu and assume that Ramu visited the shop for 20 times. Out of those 20 times he clicked on the toys section 10 times, which implies that there is 50% chance of buying a toy when he is again visiting the shop. The same will happen in the case of a Global Positioning System where attribute uncertainty model is used to attach confidence values with the attribute values [4][5]. A. Possible World Semantics TABLE 2 World Tuples W1 {video};{Book} W2 {video};{Book, Toys} W3 {Video};{Book, Pen} W4 {Video};{Book, Pen, Toys} W5 {Video, Toys};{Book} W6 {Video, Toys};{Book, Toys} W7 {Video, Toys};{Book, Pen} W8 {Video, Toys};{Book, Pen, Toys} Possible worlds for Table 1

I. INTRODUCTION As the applications involving uncertain databases are growing, a tremendous research is getting done on this field[1]. The paper is mainly concerned with extracting frequent item sets from a large uncertain database, interpreted under the Possible World Semantics [2]. This problem becomes difficult, since an uncertain database contains very large number of possible worlds. By observing that the mining process can be modelled as a Poisson binomial distribution, an approximate algorithm, which can efficiently and accurately discover frequent item sets in a large uncertain database, is implemented. It is required to examine how an existing algorithm that extracts exact item sets. This approach should support both tuple and attribute uncertainty, which is a common uncertain database model. Extensive evaluation is to be done on real and synthetic datasets to validate the implemented approaches.
TABLE I UNCERTAIN DATABASE ILLUSTRATION

Prob: 1/9 1/18 2/9 1/9 1/9 1/18 2/9 1/9

Customer Ramu Shamu

Item_Purchased (Toys:1/2),(Video:1) (Books:1),(Toys:1/3), (Pen:2/3)

II. UNCERTAIN DATABASE Uncertain database[1][3] is defined as the database which consists of uncertain transactions. Uncertain transaction is a transaction which consists of uncertain data items. If the presence of an item is represented by some existential probability we call that item as an uncertain item. Table 1 illustrates an uncertain database. It represents an online market basket database[6]. It consist of the purchase behaviour of two customers namely Ramu and Shamu. The values associated with each attribute shows the probability of

Uncertain database is usually characterized using the possible world semantics[2]. The possible world semantics is a set of deterministic instances. Each of these deterministic instances consist of tuples. A possible world w from the Table 1 consists of two tuples video and books for Ramu and Shamu. Since {video} occurs with a probability (1-1/2) 1=1/2 and {Books} occurs with a probability 1 (1-1/3) (12/3)=2/9. Then the probability that w exists is () (2/9) =1/9. This is the most accurate way of query evaluation in an uncertain database. Any algorithm working with uncertain database has to give the same result as the query is evaluated under possible world semantics. Query evaluation using the possible world semantics is very costly as an uncertain database will contain large number of possible worlds. This large number of possible world leads to the use of possible worlds difficult and costly. From the table 1 we have 2^3=8 possible worlds. If one more tuple gets added it increases the number of possible worlds all the mining results from the previous possible worlds need to be revaluated. Table 2 gives the set of possible worlds which can be generated using the table 1. The column for probability gives

ISSN: 2231-2803

http://www.ijcttjournal.org

Page 2035

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 7July 2013
the probability of occurrence of each possible world calculated using the laws of probability calculation. III. PROBABILISTIC FREQUENT ITEMSETS In the case of certain database the support is defined as the number of transactions in which an item is present in the database. The support is represented as S(I), read as support of an item I. If the support of an item I is greater than a user defined threshold we call that the item is a frequent item. In certain database the support value is a single value but that is not the case in uncertain database. In uncertain database each possible world will carry different values of support for reach item I. Probabilistic frequent itemsets (PFI) can be defined as the set of attribute values which occurs with a sufficiently high probability. Once the probability of occurrence of each possible world is calculated for each item from table1 the support probability mass function has to be calculated. S-pmf is the probability mass function for the number of tuples(support count). In PWS scenario a database induces a set of possible worlds, where each possible world giving a support count for the given itemset. Hence we can say that the support of an item is given by the PMF. Given the values of minimum support (min_sup) and minimum probability (min_prob) we can check whether an item I is a PFI in three steps. STEP 1: Find a real number R[8] which satisfies the equation min_prob = 1 F(MSC(D)-1, R) Where D is the database on which mining is being done and MSC[7] is the minimal support count of the database which is obtained as the min_sup n, where min_sup is the threshold value and n is the maximum number of times the item occurs in the database. STEP 2: Calculate the value of M such that it is the expected number of times an item I occurs in the database. It is obtained by scanning the database for once. STEP 3: If M R it can be concluded that the item is a PFI. V. CASE STUDY: THE MODIFIED APRIORI The PFI testing technique just mentioned does not correspond to any particular mining algorithm. This method supports both tuple and attributes uncertainty as a result this can be adopted by any existing algorithms. Now lets see how this can be incorporated in to the apriori algorithm[3] which is an important PFI mining algorithm. The resulting procedure (Algorithm 1) uses the bottomup framework of the Apriori: starting from k = 1, size-k PFIs (called k-PFIs) are first generated. Then, using Theorem 1, size-(k+1) candidate item sets are derived from the k-PFIs, based on which the (k+1)-PFIs are found. The process goes on with larger k, until no larger candidate item sets can be discovered.

Figure 1: s-pmf of PFI{Toys} from table 1 An easy way of finding PFIs is to derive the frequent patterns from each possible world, and then record the probabilities of the occurrences of these derived patterns. This is impractical, due to the large number of possible worlds. As a solution to this, some algorithms have been developed to successfully retrieve PFIs without deriving all possible worlds. IV. PROPOSED SYSTEM Is it possible to find out the PFIs in a faster way? The answer to this is the PFI Testing technique[1] where an item I is tested to see whether it is a PFI without really calculating the frequentness probability. This technique can be incorporated to any existing mining algorithm to meet the requirement. A. PFI Testing

Figure 2 Modified Apriori Architecture

The PFI algorithm works as follows: 1. It creates singleton items. 2. Performs the PFI testing on the items. 3. Creates 2 pair item sets from the PFI derived at step 2. 4. The process continues until no more itemsets can be formed and the final set of PFIs where no more

ISSN: 2231-2803

http://www.ijcttjournal.org

Page 2036

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 7July 2013
combination can be made forms the probabilistic frequent itemset for the given set of uncertain dataitems.

Algorithm 1 is different from Apriori because all the steps that require frequentness probability computation are replaced by the PFI testing technique.
Complexity: In Algorithm 1, each candidate item needs O(n) time to test whether it is a PFI. This is much faster than the Apriori, which verifies a PFI in O(n^2) time. Moreover, since D is scanned once for all k-PFI candidates Ck, at most a total of n tuples is retrieved for each Ck(instead of |Ck| .n). The space complexity is O |Ck| for each candidate set Ck, in order to maintain for each candidate.

Min_sup (a)Number of PFIs to min_sup

B. RUN TIME VS MINIMUM PROBABILITY

VI. RESULT The experiment was conducted using the dataset called as accidents. It was taken from FIMI (Frequent Itemset Mining Repository). It was taken by National Institute of Statistics for the region of Flandersof Belgium for a period from 19902000.The data was filled by police officers at the time of each accidents occurring. The data consist of more than 3,40,000accident records with more than 500 attribute values. A. NUMBER OF PFIS TO MINIMUM SUPPORT The graph implies that as the threshold value(min_sup) increases the number of PFIs decreases. So we can say that number of PFIs to threshold value is inversely proposional.

From the above graph we have the observation that the running time of the algorithm decreases with a higher value of minimum support C. RUN TIME VS MINIMUM SUPPORT

Here the time taken by the algorithm to execute increases with increased value of minimum support from the graph A. we have the number of PFIs getting reduced when the minimum support value increases, thus by comparing we can

ISSN: 2231-2803

http://www.ijcttjournal.org

Page 2037

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 7July 2013
say that the time required to calculate the frequentness probability of those items reduces. VII. CONCLUSION AND FUTURE WORK The paper an attempt is done to develop a model based approach to mine frequent items which are probabilistic in nature from an uncertain database. It uses the technique called as the PFI testing technique so as to determine the PFIs more accurately and faster in a much more cost effective way. The PFI testing technique is more accurate and faster compared to the possible world semantics and it supports both attribute and tuple uncertainty. The future work involved in this topic is the association of this PFI Testing technique in to other existing mining algorithms. This can also be used for the increamental mining of uncertain databases. The PFI testing technique can be used for diifrent ooperations on uncertain databases like clustering and classifiactions etc. The study can be conducted on the tuple updates as well as tuple deletions.
[1] Efficient Mining of Frequent Item Sets on Large Uncertain Databases Liang Wang, David Wai-Lok Cheung, Reynold Cheng, Member, IEEE, Sau Dan Lee, and Xuan S. Yang Probabilistic Frequent Itemset Mining in Uncertain Databases Thomas Bernecker, Hans-Peter Kriegel, Matthias Renz, Florian Verhein, Andreas Zuefle Managing Uncertain Data: Probabilistic Approaches, Wenjie Zhang #1, Xuemin Lin #2, Jian Pei 3, Ying Zhang #4 C. Aggarwal, Y. Li, J. Wang, and J. Wang, Frequent Pattern Mining with Uncertain Data, Proc. 15th ACM SIGKDD Intl onf.Knowledge Discovery and Data Mining (KDD), 2009. W. Cheung and O.R. Zaane, Incremental Mining of Frequent Patterns without Candidate Generation or Support Constraint, Proc. Seventh Intl Database Eng. and Applications Symp. (IDEAS), 2003. C. Aggarwal and P. Yu, A Survey of Uncertain Data Algorithms and Applications, IEEE Trans Knowledge and Data Eng., vol. 21, no. 5, pp. 609-623, May 2009. R. Agrawal, T. Imieli_nski, and A. Swami, Mining Association Rules between Sets of Items in Large Databases, Proc. ACM SIGMOD Intl Conf. Management of Data, 1993. L.L. Cam, An Approximation Theorem for the Poisson Binomial Distribution, Pacific J. Math., vol. 10, pp. 1181-1197, 1960.

[2]

[3] [4]

[5]

[6]

[7]

[8]

REFERENCES

ISSN: 2231-2803

http://www.ijcttjournal.org

Page 2038

You might also like