You are on page 1of 4

Interesting Measures for Mining Association Rules

Liaquat Majeed Sheikh, Basit Tanveer, Syed Mustafa Ali Hamdani FAST-NUCES, Lahore liaquat.majeed@nu.edu.pk, basit.tanveer@gmail.com, mustafa.hamdani@gmail.com

Abstract
Discovering association rules is one of the most important tasks in data mining and many efficient algorithms were proposed in literature. However, the number of discovered rules is often so large, so the user cannot analyze all discovered rules. To overcome that problem several methods for mining interesting rules only have been proposed. Many measures have been proposed in literature to determine the interestingness of the rule. In this paper we have selected a total of eight different measures, we have compared these measures by using a data set, and we have made some recommendation about the use of the measures for discovering the most interesting rules.

Correlation, and Odds ratio. The second section gives us the calculation of each measure on our sample data (customer transactions) and the last section contains our recommendation on using which measure for discovering the interesting rules.

2. Description of Different Measures
To make the measures comparable all measures are defined using probabilities. The probability of encountering itemset X is given by

P( X ) =

count ( X ) |D|

1. Introduction
In the previous few years a lot of work is done in the field of data mining especially in finding association between items in a data base of customer transaction. Association rules identify items that are most often bought along with certain other items by a significant fraction of the customers. For example, we may find that“95 percent of the customers who bought bread also bought milk.” A rule may contain more than one item in the antecedent and the consequent of the rule. Every rule must satisfy two user specified constraints: one is a measure of statistical significance called support and the other a measure of goodness of the rule called confidence. In this paper we have identified a set of measures as proposed by the literature and we have tried to conclude that a single measure alone can not determine the interestingness of the rule. This paper is divided in to three sections the first section gives the formal definition (as presented in the literature) and some explanation of each measure. The measures we have chosen are Support, Confidence, Conviction, Lift, Piatetsky-Shapiro, Coverage,

Where, count(X) is the number of transactions that contain the itemset X and |D| is the size (number of transactions) of the database.

2.1. Support [1]
Introduced by R. Agrawal, T. Imielinski, and A. Swami. Mining associations between sets of items in large databases. In Proc. of the ACM SIGMOD Int'l Conference on Management of Data, pages 207-216, Washington D.C., May 1993.

Support ( X ) = P ( X )
Support is defined on itemsets and gives the proportion of transactions that contain Z and therefore is used as a measure of significance (importance) of an itemset. Since it basically uses the count of transactions it is often called a frequency constraint. An itemset with a support greater than a set minimum support threshold is called a frequent or large itemset. Supports main feature is that it possesses the downward closure property (anti-monotonicity) which means that all subsets of a frequent set are also frequent. This

-1-

consequents with higher support will automatically produce higher confidence values even if there exists no association between the items. Dynamic itemset counting and implication rules for market basket data. May 1997. confidence threshold. pages 265-276. If antecedent and consequent are independent then lift is equal to 1. May 1993. and A. Agrawal. In Proc. Then confidence is used in a second step to produce rules from the frequent itemsets that exceed a min. Conviction compares the probability that X appears without Y if they were dependent with the actual frequency of the appearance of X without Y. confidence( X → Y ) = P( X and Y ) P( X ) Conviction was developed as an alternative to confidence which was found to not capture direction of associations adequately.3. values higher than 1 mean. of the ACM SIGMOD Int'l Conf.property (actually. (the so-called support-confidence framework). lift ( X → Y ) = P( X and Y ) P( X ) P(Y ) Lift measures how many times more often X and Y occurs together than expected. Its values are in range [0. Swami. T. If antecedent and consequent are independent it is equal to 1. Lift is not down-ward closed and does not suffer from the rare item problem. Support is first used to find frequent (significant) itemsets exploiting its down-ward closure property to prune the search space. In that respect it is similar to lift (see section about lift on this page). Jeffrey D. And if they are occurring in all transactions its value is equal to 1. Its values are in range [0. pages 255-264. Items that occur very infrequently in the data set are pruned although they would still produce interesting and potentially valuable rules. Confidence is defined as the probability of seeing the rule's consequent under the condition that the transactions also contain the antecedent. An interesting fact is that conviction is monotone in confidence and lift. Mining associations between sets of items in large databases. -2- . and Shalom Tsur.2. 1997. USA. the Apriori algorithm). of the ACM SIGMOD Int'l Conference on Management of Data. Caused by the way confidence is calculated. Motwani. 2. If antecedent and consequent are not occurring in transactions it is equal to 0. Confidence is directed and gives different values for the rules X → Y and Y → X. Lift [1] Introduced by S. J. + ). 1]. conviction( X → Y ) = P( X ) P(Y ) P( X and Y ) 2. R. Conviction [1] Introduced by Sergey Brin. A problem with confidence is that it is sensitive to the frequency of the consequent (Y) in the database. Dynamic itemset counting and implication rules for market basket data. that satisfying condition of antecedent decreases probability of consequent in comparison to unconditional probability. on Management of Data (ACM SIGMOD '97).C. Its values are in range [0. Confidence is not down-ward closed and was developed together with support by Agrawal et al. 1]. + ]. Arizona. For implications occurring in all cases measure’s value is equal to + . In SIGMOD 1997. Tsur.. Imielinski. it contrast to lift it is a directed measure since it also uses the information of the absence of the consequent. Rajeev Motwani.g. Values lower than 1 mean. the fact that no super set of infrequent set can be frequent) is used to prune the search space (usually thought of as a lattice or tree of item sets with increasing size) in level-wise algorithms (e. Ullman. Consequently. Washington D. If antecedent and consequent are independent it is equal to 0. Brin.. Its values are in range [0. In Proc. Confidence [1] Introduced by R. pages 207-216. 2. The rare item problem is important for transaction data which usually have a very uneven distribution of support for the individual items (few items are used all the time and most item are rarely used). and S. Tucson. D. if they where statistically independent. For implications occurring in all cases measure’s value is equal to 1. Proceedings ACM SIGMOD International Conference on Management of Data.4. The disadvantage of support is the rare item problem. that satisfying condition of antecedent increases probability of consequent in comparison to unconditional probability. Ullman. however.

discussed in the first section. or to a data-based estimate of that ratio.g.1. Correlation Correlation is a statistical technique which can show whether and how strongly pairs of variables/itemsets are related. Its values are in range [0. 2. for setting a min. The rational in a sales setting is to find out how many more units (items X and Y together) are sold than expected from the independent sells. For strong associations its value is equal to + .5 0. Table 1: Sample Transactions Items A B C D E 1 1 1 0 1 0 2 1 0 1 1 0 3 1 0 1 1 0 4 0 1 1 1 0 5 0 1 0 1 1 6 1 0 0 0 1 7 1 0 1 0 1 8 0 0 1 0 0 9 0 1 1 1 0 10 1 1 0 1 1 Total 6 5 6 7 4 P(X) 0. Comparison of Measures This section compares all the measures. 1991: p.6 0. + ]. Sample Data cov erage( X → Y ) = P( X and Y ) P(Y ) It shows what part of itemsets from consequent is covered by a rule. Leverage [1] Introduced by Piatetsky-Shapiro. 229-248. In each transaction a 1 represents the presence of an item while a 0 represents the absence of an item from the market basket. 2. [4] F 1 1 1 0 0 1 1 0 0 0 5 0.5 -3- .2.01% and then filter the found item sets using the leverage constraint. Coverage [1] 3. Its values are in range [0. 1]. 2. All the measure are applied on each of the frequent item set. 3. It varies from -1 (perfect negative linear relationship) to 1 (perfect linear relationship) and in between them 0 means no relationship..8. If antecedent and consequent are independent it is equal to 0. e. Because of this property leverage also can suffer from the rare item problem. Odds Ratio The odds-ratio is a statistical measure which is defined as the ratio of the odds of an event occurring in one group to the odds of it occurring in another group. Knowledge Discovery in Databases.6. The sample data for the analysis purpose is taken from a store database of customer transactions there are six different types of items and a total of ten transactions. To the extent that there is a nonlinear relationship between the two variables being correlated. and presentation of strong rules. Using minimum leverage thresholds at the same time incorporates an implicit frequency constraint. and then in the end the recommendation related to choosing a measure to decide which rule is interesting are given. We have chosen a data set on which we have performed the A-priori algorithm to find out the frequent item set.4 TID corr( X → Y ) = P( X and Y ) − P( X ) P(Y ) P( X ) P(Y )(1 − P( X ))(1 − P(Y )) Correlation is a bi-variant measure of association (strength) of the relationship between two variables/itemsets. analysis..01% (corresponds to 10 occurrence in a data set with 100.000 transactions) one first can use an algorithm to find all itemsets with minimum support of 0.5. G. leverage thresholds to 0.7 0. [3] leverage X → Y ) = P( X ( and Y ) − P( X ) P(Y ) odds( X → Y ) = P( X P( X and Y ) P( X and Y ) P( X and Y ) and Y ) Leverage measures the difference of X and Y appearing together in the data set and what would be expected if X and Y where statistically dependent.6 0. Discovery.7. correlation will understate the relationship.

The Odds Ratio in this table suggest that all the rules are interesting but if we look at the Correlation along with the Odds Ratio we will come to know that A→F and F→A are more strongly related to each other. Member. Edward R.67 Lift 0.57 0.00 1.edu/garson/pa765/ correl.ncsu. IEEE Computer Society.67 3. Table 4: Subset of Sample Dataset Rules Measures Confidence Correlation Odds Ratio A→F 0.00 0.67 0.20 0.67 Correlation -0.67 1.95 Leverage -0.02 0. Conclusion Any measure alone cannot determine the Interestingness of the rules.htm [5] Discovering interesting rules from financial data Przemys³aw So³dacki.pdf [3] http://en.40 Confidence 0. F→A both the rules are interesting but the Confidence value of F→A suggest that it is more interesting as compared to A→F hence we can not conclude alone from a symmetric measure we also have to look for an asymmetric measure in order to know the interestingness of such types of rules A→B. 5.65 The table sown above contains the subset of measures and rules taken from the above table.ufl.15 -0.09 -0.00 0. 3.D} 40% The minimum support used for the generation of the frequent item set is 40%.D} 50% {C.50 0.83 0. References [1] http://wwwai.80 0.65 -0.09 0.chass.F} 50% {B. The results are shown in table 3.00 0.83 0.00 0. We have to look at a combination of different measures in order to get the rule that is really interesting.82 B→D 1. B→A.50 0. which is the output of the Apriori algorithm.00 0.50 0.43 1. Andersa 13.html [2] www.43 0.02 -0.67 0.50 0.67 1.83 1.wu-wien.67 0.at/~hahsler/research/ association_rules/measures.wikipedia. Calculations All the measures discussed in the first section are calculated for each rule.82 0.67 1.Table 3: Calculation of Different Measures on Sample Datasets Rules A→D D→A A→F F→A B→D D→B C→D D→C Support 0. On combining another measure i.60 1.95 0.71 0. -4- . Generating Frequent Itemsets The frequent item set generated by the sample data using A-priori algorithm is shown in the following table: Table 2: Frequent Itemsets Itemset Support {A.ac.edu/class/cis6930fa03dm/notes/ dm4part2.3. Institute of Computer Science.50 0.40 0.71 1.57 0.57 0.cise. Warsaw University of Technology Ul.D} 40% {A.82 F→A 1.00 0.2.50 0. Omiecinski.org/wiki/Odds-ratio [4] http://www2.71 0. 00-159 Warszawa [6] Alternative Interest Measures for Mining Associations in Databases. 3.95 1. There are two types of measures one is symmetric measures and the other is asymmetric If we look at a symmetric measure e.95 0.20 0.65 0.02 -0. Confidence with these two measures we will come to know that only the rule F→A is more interesting. Odds Ratio we can conclude the A→F.57 Conviction 0.82 0.40 0.09 1 1 Odds Ratio 0.65 D→B 0.02 Coverage 0.83 1.09 -0.40 0.g.15 0.00 0.e.