You are on page 1of 77

# Chapter 6: Mining

Association Rules in Large
Databases
 Association rule mining
 Mining single-dimensional Boolean association
rules from transactional databases
 Mining multilevel association rules from
transactional databases
 Mining multidimensional association rules from
transactional databases and data warehouse
 From association mining to correlation analysis
 Constraint-based association mining
 Summary

October 16, 2008 Data Mining: Concepts and Techniques 1

What Is Association
Mining?

 Association rule mining:
 Finding frequent patterns, associations,

correlations, or causal structures among sets of
items or objects in transaction databases,
relational databases, and other information
repositories.
 Applications:

 Basket data analysis, cross-marketing, catalog

classification, etc.
 Examples.

 Rule form: “Body → Ηead [support,

confidence]”.
October 16, 2008 Data Mining: Concepts and Techniques 2

Association Rule: Basic
Concepts
 Given: (1) database of transactions, (2) each
transaction is a list of items (purchased by a
customer in a visit)
 Find: all rules that correlate the presence of one set

of items with that of another set of items
 E.g., 98% of people who purchase tires and auto

accessories also get automotive services done
 Applications

 * ⇒ Maintenance Agreement (What the store
should do to boost Maintenance Agreement
sales)
 Home Electronics ⇒ * (What other products

should the store stocks up?)
 Attached mailing in direct marketing
October 16, 2008 Data Mining: Concepts and Techniques 3

E.C 50%. also contains 2000 A.D  A ⇒ C (50%.C Zand minimum confidence 1000 A. conditional probability that a transaction Transaction ID Items Bought Let having {X  Y}support minimum 50%.B.F  C ⇒ A (50%. c. probability that a transaction contains {X  Y  Customer Z} buys milk  confidence. Rule Measures: Support and Confidence Customer buys both CustomerFind all the rules X & Y ⇒ Z  buys Bisc. with minimum confidence and support  support. s. 100%) . we have 4000 A. 66.6%) 5000 B.

2%. “DMBook”) → buys(x. “30.48K”) → buys(x. 60%]  age(x. “42. multiple-level analysis  What brands of milk are associated with what brands of Biscuit?  Various extensions  Correlation.. Association Rule Mining: A Road Map  Boolean vs. “SQLServer”) ^ buys(x. 2008 itemsets Data Mining: Concepts and Techniques 5 . “DBMiner”) [0.. 75%]  Single dimension vs. “PC”) [1%.39”) ^ income(x. causality analysis  Association does not necessarily imply correlation or causality  Maxpatterns and closed October 16. Above)  Single level vs. quantitative associations (Based on the types of values handled)  buys(x. multiple dimensional associations (see ex.

2008 Data Mining: Concepts and Techniques 6 . Chapter 6: Mining Association Rules in Large Databases  Association rule mining  Mining single-dimensional Boolean association rules from transactional databases  Mining multilevel association rules from transactional databases  Mining multidimensional association rules from transactional databases and data warehouse  From association mining to correlation analysis  Constraint-based association mining  Summary October 16.

C 4000 A. support 50% 2000 A.C} 50% support = support({A C}) = 50% confidence = support({A C})/support({A}) = 66.C Min.B.6% The Apriori principle: Any subset of a frequent itemset must be . confidence 50% 1000 A.F {B} 50% {C} 50% For rule A ⇒ C: {A.D Frequent Itemset Support {A} 75% 5000 B. Mining Association Rules—An Example Transaction ID Items Bought Min.E.

Mining Frequent Itemsets: the Key Step  Find the frequent itemsets: the sets of items that have minimum support  A subset of a frequent itemset must also be a frequent itemset  i. both {A} and {B} should be a frequent itemset  Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)  Use the frequent itemsets to generate association rules. October 16.. 2008 Data Mining: Concepts and Techniques 8 . if {AB} is a frequent itemset.e.

k++) do begin Ck+1 = candidates generated from Lk. for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end October 16. for (k = 1. 2008return ∪k Lk. The Apriori Algorithm  Join Step: C is generated by joining L with itself k k-1  Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset  Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}. Lk !=∅. Data Mining: Concepts and Techniques 9 .

The Apriori Algorithm — Example Database D itemset sup. TID Items C1 {1} 2 {1} 2 100 134 {2} 3 {2} 3 200 235 Scan D {3} 3 {3} 3 300 1235 {4} 1 {5} 3 400 25 {5} 3 C2 itemset sup C2 itemset L2 itemset sup {1 2} 1 Scan D {1 2} {1 3} 2 {1 3} 2 {1 3} {2 3} 2 {1 5} 1 {1 5} {2 3} 2 {2 3} {2 5} 3 {2 5} 3 {2 5} {3 5} 2 {3 5} 2 {3 5} C3 itemset Scan D L3 itemset sup {2 3 5} {2 3 5} 2 . L1 itemset sup.

p. p.item2.itemk-1 < q. Lk-1 q where p. 2008 if (s is not inData Lk-1Mining: ) then delete Concepts c from Ck and Techniques 11 . How to Generate Candidates?  Suppose the items in Lk-1 are listed in an order  Step 1: self-joining Lk-1 insert into Ck select p.itemk-1  Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do October 16.itemk-2=q. p. ….itemk-2.itemk-1.item1. p. q. ….item1=q.itemk-1 from Lk-1 p.item1.

2008 Data Mining: Concepts and Techniques 12 . How to Count Supports of Candidates?  Why counting supports of candidates a problem?  The total number of candidates can be very huge  One transaction may contain many candidates  Method:  Candidate itemsets are stored in a hash-tree  Leaf node of hash-tree contains a list of itemsets and counts  Interior node contains a hash table  Subset function: finds all the candidates October 16.

ace. abd. 2008 Data Mining: Concepts and Techniques 13 . acd. Example of Generating Candidates  L3={abc. bcd}  Self-joining: L3*L3  abcd from abc and abd  acde from acd and ace  Pruning:  acde is removed because ade is not in L3  C4={abcd} October 16.

2008 Data Mining: Concepts and Techniques 14 . lower support threshold + a method to determine the completeness  Dynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be October 16. Methods to Improve Apriori’s Efficiency  Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent  Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans  Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB  Sampling: mining on a subset of given data.

a100}. a2.  Multiple scans of database: October 16.. e. Is Apriori Fast Enough? — Performance Bottlenecks  The core of the Apriori algorithm:  Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets  Use database scan and pattern matching to collect counts for the candidate itemsets  The bottleneck of Apriori: candidate generation  Huge candidate sets:  104 frequent 1-itemset will generate 107 candidate 2- itemsets  To discover a frequent pattern of size 100. one needs to generate 2100 ≈ 1030 candidates. 2008 Data Mining: Concepts and Techniques 15 . {a1. ….g.

but complete for frequent pattern mining  avoid costly database scans  Develop an efficient. 2008 Data Mining: Concepts and Techniques 16 . Mining Frequent Patterns Without Candidate Generation  Compress a large database into a compact. FP-tree-based frequent pattern mining method  A divide-and-conquer methodology: decompose mining tasks into smaller ones  Avoid candidate generation: sub-database test only! October 16. Frequent-Pattern tree (FP-tree) structure  highly condensed.

5 300 {b. c. k. construct FP-tree p:2 m:1 . p} {c. a. l. b. f. Scan DB again. c. a. m} min_support = 0. j. b. m. g. a. c. a. Scan DB once. c. o} {f. m. p} {} Steps: Header Table 2. c. b} 400 {b. Construct FP-tree from a Transaction DB TID Items bought (ordered) frequent items 100 {f. s. m. l. Order frequent items in b 3 a:3 p:1 frequency descending m 3 order p 3 m:2 b:1 4. m. n} {f. f. p} 500 {a. e. d. p} 200 {a. m. c. i. c. p} {f. o} {f. b. f. find Item frequency head f:4 c:1 frequent 1-itemset f 4 (single item pattern) c 4 c:3 b:1 b:1 a 3 3. p. h.

2008 be over 100 Data Mining: Concepts and Techniques 18 . compression ratio could October 16. Benefits of the FP-tree Structure  Completeness:  never breaks a long pattern of any transaction  preserves complete information for frequent pattern mining  Compactness  reduce irrelevant information—infrequent items are gone  frequency descending ordering: more frequent items are more likely to be shared  never be larger than the original database (if not count node-links and counts)  Example: For Connect-4 DB.

or it contains only one path (single path will generate all the combinations of its sub-paths. and then its conditional FP-tree  Repeat the process on each newly created conditional FP-tree  Until the resulting FP-tree is empty. 2008 Data Mining: Concepts and Techniques 19 . each of which is a frequent pattern) October 16. Mining Frequent Patterns Using FP-tree  General idea (divide-and-conquer)  Recursively grow frequent pattern path using the FP-tree  Method  For each item. construct its conditional pattern- base.

simply enumerate all the patterns October 16. Major Steps to Mine FP-tree 1) Construct conditional pattern base for each node in the FP-tree 2) Construct conditional FP-tree from each conditional pattern-base 3) Recursively mine conditional FP-trees and grow frequent patterns obtained so far  If the conditional FP-tree contains a single path. 2008 Data Mining: Concepts and Techniques 20 .

pattern base c 4 c:3 b:1 b:1 c f:3 a 3 b 3 a fc:3 a:3 p:1 m 3 b fca:1. cb:1 . Step 1: From FP-tree to Conditional Pattern Base  Starting at the frequent header table in the FP-tree  Traverse the FP-tree by following the link of each frequent item  Accumulate all of transformed prefix paths of that item to form a conditional pattern base Header Table {} Conditional pattern bases Item frequency head f:4 c:1 f 4 item cond. f:1. fcab:1 p:2 m:1 p fcam:2. c:1 p 3 m:2 b:1 m fca:2.

Mining: Concepts and Techniques 22 . Properties of FP-tree for Conditional Pattern Base Construction  Node-link property  For any frequent item ai. only the prefix sub-path of ai in P need to be accumulated. all the possible frequent patterns that contain ai can be obtained by following ai's node-links. starting from ai's head in the FP-tree header  Prefix path property  To calculate the frequent patterns for a node ai in a path P. and its frequency count should Data October 16. 2008 carry the same count as node ai.

am. p 3 m:2 b:1 c:3 fcam p:2 m:1 a:3 m-conditional FP-tree .  b 3 a:3 p:1 f:3  fm. fcab:1 f 4 All frequent patterns c 4 c:3 b:1 b:1 {} concerning m a 3 m. cam. cm. m 3 fcm. Step 2: Construct Conditional FP- tree  For each pattern-base  Accumulate the count for each item in the base  Construct the FP-tree for the frequent items of the pattern base {} m-conditional pattern Header Table base: Item frequency head f:4 c:1 fca:2. fam.

(f:1). Mining Frequent Patterns by Creating Conditional Pattern- Bases Item Conditional pattern-base Conditional FP-tree p {(fcam:2). (fcab:1)} {(f:3. c:3. c:3)}|a c {(f:3)} {(f:3)}|c f Empty Empty . (cb:1)} {(c:3)}|p m {(fca:2). a:3)}|m b {(fca:1). (c:1)} Empty a {(fc:3)} {(f:3.

pattern base of “cam”: (f:3) f:3 cam-conditional FP-tree . pattern base of “am”: (fc:3) f:3 c:3 f:3 am-conditional FP-tree c:3 {} Cond. Step 3: Recursively mine the conditional FP-tree {} {} Cond. pattern base of “cm”: (f:3) a:3 f:3 m-conditional FP-tree cm-conditional FP-tree {} Cond.

cm. am. c:3  fm. fam. Single FP-tree Path Generation  Suppose an FP-tree T has a single path P  The complete set of frequent pattern of T can be generated by enumeration of all the combinations of the sub-paths of P {} All frequent patterns concerning m f:3 m. cam. fcm. a:3 fcam m-conditional FP-tree .

if and only if  “abcde ” is a frequent pattern. 2008 Data Mining: Concepts and Techniques 27 . and  “f ” is frequent in the set of transactions containing “abcde ” October 16.  “abcdef ” is a frequent pattern. Principles of Frequent Pattern Growth  Pattern growth property  Let α be a frequent itemset in DB. B be α's conditional pattern base. and β be an itemset in B. Then α ∪ β is a frequent itemset in DB iff β is frequent in B.

Why Is Frequent Pattern Growth Fast?  Our performance study shows  FP-growth is an order of magnitude faster than Apriori. 2008 Data Mining: Concepts and Techniques 28 . no candidate test  Use compact data structure  Eliminate repeated database scan  Basic operation is counting and FP-tree October 16. and is also faster than tree- projection  Reasoning  No candidate generation.

5 1 1.5 2 2.5 3 Support threshold(%) .) 70 60 50 40 30 20 10 0 0 0. FP-growth vs. Apriori: Scalability With the Support Threshold 100 Data set T25I20D10K 90 D1 FP-grow th runtime D1 Apriori runtime 80 Run time(sec.

Tree-Projection: Scalability with Support Threshold Data set T25I20D100K 140 D2 FP-growth 120 D2 TreeProjection 100 Runtime (sec.) 80 60 40 20 0 0 0.5 2 Support threshold (%) . FP-growth vs.5 1 1.

Presentation of Association Rules (Table Form ) .

2008 Data Mining: Concepts and Techniques 32 .Visualization of Association Rule Using Plane Graph October 16.

2008 Data Mining: Concepts and Techniques 33 .Visualization of Association Rule Using Rule Graph October 16.

Iceberg Queries  Icerberg query: Compute aggregates over one or a set of attributes only for those whose aggregate values is above certain threshold  Example: select P.itemID having sum(P.custID.qty) >= 10  Compute iceberg queries efficiently by Apriori:  First compute lower dimensions  Then compute higher dimensions only when all the lower ones are above the threshold October 16.itemID. P.qty) from purchase P group by P. 2008 Data Mining: Concepts and Techniques 34 .custID. sum(P. P.

2008 Data Mining: Concepts and Techniques 35 . Chapter 6: Mining Association Rules in Large Databases  Association rule mining  Mining single-dimensional Boolean association rules from transactional databases  Mining multilevel association rules from transactional databases  Mining multidimensional association rules from transactional databases and data warehouse  From association mining to correlation analysis  Constraint-based association mining  Summary October 16.

411} be encoded based on dimensions and levels T4 {111. 222. 211. 122.  Rules regarding itemsets Fraser Sunset at appropriate levels could TID Items be quite useful. Multiple-Level Association Rules Food  Items often form hierarchy. 122. 221} T2 {111. 121. 121} T5 {111. 211. T1 {111. milk bread  Items at the lower level are expected to have skim 2% wheat white lower support. 413}  We can explore shared . 323}  Transaction database can T3 {112. 221. 221. 211.

 Level-crossed association rules: 2% milk → Wonder wheat bread  Association rules with multiple. Mining Multi-Level Associations  A top_down.  Then find their lower-level “weaker” rules: 2% milk → wheat bread [6%. alternative hierarchies: 2% milk → Wonder bread . 50%]. progressive deepening approach:  First find high-level strong rules: milk → bread [20%.  Variations at mining multiple-level association rules. 60%].

Multi-level Association: Uniform Support vs.  – Lower level items do not occur as frequently. Reduced Support  Uniform Support: the same minimum support for all levels  + One minimum support threshold. If support threshold  too high ⇒ miss low level associations  too low ⇒ generate too many high level associations  Reduced Support: reduced minimum support at lower levels  There are 4 search strategies:  Level-by-level independent  Level-cross filtering by k-itemset  Level-cross filtering by single item  Controlled level-cross filtering by single item . No need to examine itemsets containing any item whose ancestors do not have minimum support.

Uniform Support Multi-level mining with uniform support Level 1 Milk min_sup = 5% [support = 10%] Level 2 2% Milk Skim Milk min_sup = 5% [support = 6%] [support = 4%] Back .

Reduced Support Multi-level mining with reduced support Level 1 Milk min_sup = 5% [support = 10%] Level 2 2% Milk Skim Milk min_sup = 3% [support = 6%] [support = 4%] Back .

Multi-level Association: Redundancy
Filtering

 Some rules may be redundant due to “ancestor”
relationships between items.
 Example
 milk ⇒ wheat bread [support = 8%, confidence =
70%]
 2% milk ⇒ wheat bread [support = 2%, confidence =
72%]
 We say the first rule is an ancestor of the second
rule.
 A rule is redundant if its support is close to the
“expected” value, based on the rule’s ancestor.

October 16, 2008 Data Mining: Concepts and Techniques 41

Multi-Level Mining:
Progressive Deepening
 A top-down, progressive deepening approach:
 First mine high-level frequent items:
 Then mine their lower-level “weaker”
frequent itemsets:
2% milk (5%), wheat bread (4%)
 Different min_support threshold across multi-
 If adopting the same min_support across

multi-levels
then toss t if any of t’s ancestors is infrequent.
 If adopting reduced min_support at lower
levels
then examine only those descendents whose
October 16, 2008 ancestor’s support
Data Mining:is frequent/non-negligible.
Concepts and Techniques 42

Progressive Refinement of Data
Mining Quality

 Why progressive refinement?
 Mining operator can be expensive or cheap, fine
or rough
 Trade speed with quality: step-by-step
refinement.
 Superset coverage property:
 Preserve all the positive answers—allow a
positive false test but not a false negative test.
 Two- or multi-step mining:
 First apply rough/cheap operator (superset
coverage)
October 16, 2008 Data Mining: Concepts and Techniques 43

2008 Data Mining: Concepts and Techniques 44 . intersect. touch.  Two-step mining of spatial association:  Step 1: rough spatial computation (as a filter)  Using MBR or R-tree for rough estimation.  First search for rough relationship and then refine it. contain.  Step2: Detailed spatial algorithm (as refinement)  Apply only to those objects which have passed the rough spatial association test (no less than min_support) October 16. etc. Mining of Spatial Association Rules  Hierarchy of spatial relationship:  “g_close_to”: near_by.

2008 Data Mining: Concepts and Techniques 45 . Chapter 6: Mining Association Rules in Large Databases  Association rule mining  Mining single-dimensional Boolean association rules from transactional databases  Mining multilevel association rules from transactional databases  Mining multidimensional association rules from transactional databases and data warehouse  From association mining to correlation analysis  Constraint-based association mining  Summary October 16.

Multi-Dimensional Association: Concepts  Single-dimensional rules: buys(X. “bread”)  Multi-dimensional rules: 2 dimensions or predicates  Inter-dimension association rules (no repeated predicates) age(X. no ordering among values October 16. “popcorn”) ⇒ buys(X.“student”) ⇒ buys(X. “milk”) ⇒ buys(X.“coke”)  hybrid-dimension association rules (repeated predicates) age(X. 2008 Data Mining: Concepts and Techniques 46 .”19-25”) ∧ buys(X.”19-25”) ∧ occupation(X. “coke”)  Categorical Attributes  finite number of possible values.

2008 Data Mining: Concepts and Techniques 47 . Distance-based association rules October 16. Quantitative association rules  Quantitative attributes are dynamically discretized into “bins”based on the distribution of the data. Using static discretization of quantitative attributes  Quantitative attributes are statically discretized by using predefined concept hierarchies. Techniques for Mining MD Associations  Search for frequent k-predicate set:  Example: {age. buys} is a 3- predicate set. 1. occupation.  Techniques can be categorized by how age are treated. 3. 2.

Static Discretization of Quantitative Attributes  Discretized prior to mining using concept hierarchy.income. 2008 Data Mining: Concepts and Techniques 48 .buys) (income.buys) October 16. finding all frequent k- predicate sets will require k or k+1 table()scans.  Data cube is well suited for mining.buys) predicate sets. (age) (income) (buys)  The cells of an n-dimensional cuboid correspond to the (age.  Numeric values are replaced by ranges.  In relational database.  Mining from data cubes (age. income) (age.

2008 Data Mining: Concepts and Techniques 49 . quan1 ∧ Aquan2 ⇒  2-D quantitative association rules: A Acat  Cluster “adjacent” association rules to form general rules using a 2-D grid.”30-34”) 48K”) ⇒ buys(X. Quantitative Association Rules  Numeric attributes are dynamically discretized  Such that the confidence or compactness of the rules mined is maximized.  Example:∧ income(X.”high resolution TV”) October 16.”24K - age(X.

ARCS (Association Rule Clustering System) How does ARCS work? 1. Optimize October 16. Binning 2. 2008 Data Mining: Concepts and Techniques 50 . Find frequent predicateset 3. Clustering 4.

2008 Data Mining: Concepts and Techniques 51 .  “Mining Quantitative Association Rules in Large Relational Tables” by R. (2D limitation)  An alternative to ARCS  Non-grid-based  equi-depth binning  clustering based on a measure of partial completeness. Srikant and R.  Only 2 attributes on LHS. Limitations of ARCS  Only quantitative attributes on LHS of rules. October 16. Agrawal.

50] 53 [51.60]  Distance-based partitioning.7] 20 [11. more meaningful discretization considering:  density/number of points in an interval  “closeness” of points in an interval October 16.53] [50.30] [51.22] 22 [21.20] [22.10] [7. Mining Distance-based Association Rules  Binning methods do not capture the semantics of interval data Equi-width Equi-depth Distance- Price(\$) (width \$10) (depth 2) based 7 [0.50] [20.20] [7. 2008 Data Mining: Concepts and Techniques 52 .40] 51 [41.53] 50 [31.

t2. ….g. e. tN . Clusters and Distance Measurements  S[X] is a set of N tuples t1. tj[ X ]) d ( S [ X ]) = N ( N − 1)  distx:distance metric. projected on the attribute set X  The diameter of S[X]: ∑ ∑ N N i =1 j =1 distX (ti[ X ]. Euclidean distance or Manhattan October 16. 2008 Data Mining: Concepts and Techniques 53 .

2008 Data Mining: Concepts and Techniques 54 . where d (CX ) ≤ d 0 X CX ≥ s 0  Finding clusters and distance-based rules  the density threshold. d. replaces the 0 notion of support  modified version of the BIRCH clustering algorithm October 16. Clusters and Distance Measurements(Cont. assesses the density of a cluster CX .)  The diameter. d .

2008 Data Mining: Concepts and Techniques 55 . Chapter 6: Mining Association Rules in Large Databases  Association rule mining  Mining single-dimensional Boolean association rules from transactional databases  Mining multilevel association rules from transactional databases  Mining multidimensional association rules from transactional databases and data warehouse  From association mining to correlation analysis  Constraint-based association mining  Summary October 16.

KDD95) A rule (pattern) is interesting if ❶ it is unexpected (surprising to the user). Interestingness Measurements  Objective measures Two popular measurements: ❶ support. and ❷ confidence  Subjective measures (Silberschatz & Tuzhilin. and/or ❷ actionable (the user can do something with it) October 16. 2008 Data Mining: Concepts and Techniques 56 .

although with lower support and confidence basketball not basketball sum(row) cereal 2000 1750 3750 not cereal 1000 250 1250 sum(col. 2008 Data Mining: Concepts and Techniques 57 .) 3000 2000 5000 October 16. Criticism to Support and Confidence  Example 1: (Aggarwal & Yu.3%] is far more accurate.7%. 66.  play basketball ⇒ not eat cereal [20%. 33.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66. PODS98)  Among 5000 students  3000 play basketball  3750 eat cereal  2000 both play basket ball and eat cereal  play basketball ⇒ eat cereal [40%.

)  Example 2:  X and Y: positively X 1 1 1 1 0 0 0 0 correlated. Y 1 1 0 0 0 0 0 0  X and Z. Criticism to Support and Confidence (Cont. negatively related Z 0 1 1 1 1 1 1 1  support and confidence of X=>Z dominates  We need a measure of Rule Support Confidence dependent or A∪ B) P (correlated X=>Y 25% 50% corr eventsA. B = P ( A) P ( B ) X=>Z 37. 2008 Data Mining: Concepts and Techniques 58 .50% 75%  October 16.

50% 0.Y 25% 2 Y 1 1 0 0 0 0 0 0 X.Z 37. if A and B are independent events  A and B negatively correlated.Other Interestingness Measures: Interest  Interest (correlation.57 October 16. 2008 Data Mining: Concepts and Techniques 59 .Z 12.9 Z 0 1 1 1 1 1 1 1 Y. if the value is less than 1. otherwise A and B positively correlated Itemset Support Interest X 1 1 1 1 0 0 0 0 X.50% 0. lift) P( A ∧ B) P( A) P( B )  taking both P(A) and P(B) in consideration  P(A^B)=P(B)*P(A).

2008 Data Mining: Concepts and Techniques 60 . Chapter 6: Mining Association Rules in Large Databases  Association rule mining  Mining single-dimensional Boolean association rules from transactional databases  Mining multilevel association rules from transactional databases  Mining multidimensional association rules from transactional databases and data warehouse  From association mining to correlation analysis  Constraint-based association mining  Summary October 16.

2008 constraints: Data Mining: Concepts and Techniques 61 . association.  Data constraint: SQL-like queries  Find product pairs sold together in Vancouver in Dec.  Dimension/level constraints:  in relevance to region. Constraint-Based Mining  Interactive.  Interestingness October 16. brand.’98. etc.  Rule constraints  small sales (price < \$10) triggers big sales (sum > \$200). customer category. exploratory mining giga-bytes of data?  Could it be real? — Making good use of constraints!  What kinds of constraints can be used in mining?  Knowledge type constraint: classification. price.

. SIGMOD’99):  1-var: A constraint confining only one side (L/R) of the rule.  Rule (content) constraint: constraint-based query optimization (Ng. “database systems”). SIGMOD’98). et al.  sum(LHS) < 100 ^ min(LHS) > 20 ^ count(LHS) > 3 ^ sum(RHS) > 1000  1-variable vs. sum(LHS) October 16.. 2-variable constraints (Lakshmanan. w) → takes(x. y) ^ Q(x.  2-var: A constraint confining both sides (L and R). 2008 < min(RHS) ^ max(RHS) Data Mining: < 5* sum(LHS) Concepts and Techniques 62 . et al.  P(x. Rule Constraints in Association Mining  Two kind of rule constraints:  Rule form constraints: meta-rule guided mining. e. as shown above.g.

(2) itemInfo (Item. S2 )|C }.g. ≥ }.Price) < 100 Data Mining: Concepts and Techniques 63 . ≤.g.g. Price)  A constrained asso.Type) October 16. S. ≠. ⊂. ≤. θ ∈ { ⊆. where agg is in {min. ≥ }.Type  Vθ S. >.g. count. count(S1. 2008 = 1 . {snacks. S2 including frequency constraint  A classification of (single-variable) constraints:  Class constraint: S ⊂ A. ≠. e. e. query (CAQ) is in the form of {(S1. e. avg(S2.  where C is a set of constraints on S1.g. avg}. θ ∈ { =. or Sθ V. ⊄. <.  e. <. ≠ }  e. sodas } ⊆ S. Itemset ). and θ ∈ { =. sum. θ is ∈ or ∉. S ⊂ Item  Domain constraint:  Sθ v. >. Type.Type  Aggregation constraint: agg(S) θ v.Price < 100  vθ S. Constrain-Based Association Query  Database: (1) trans (TID. max. snacks ∉ S. =.

and then to test them for constraint satisfaction one by one. Constrained Association Query Optimization Problem  Given a CAQ = { (S1. S2) | C }. the algorithm should be :  sound: It only finds frequent sets that satisfy the given constraints C  complete: All frequent sets satisfy the given constraints C are found  A naïve solution:  Apply Apriori for finding all frequent sets. 2008 the frequent set computation.  Our approach:  Comprehensive analysis of the properties of constraints and try to push them as deeply as possible inside October 16. Data Mining: Concepts and Techniques 64 .

2008 Data Mining: Concepts and Techniques 65 . every super- pattern of S also satisfies it October 16. none of the super-patterns of S can satisfy Ca  A constraint Cm is monotone iff. Anti-monotone and Monotone Constraints  A constraint Ca is anti-monotone iff. for any pattern S satisfying Cm. for any pattern S not satisfying Ca.

set if it can be expressed as σp(I) for some selection predicate p. Ik using union and minus  A constraint C is succinct provided October 16. set if there is a fixed number of succinct set I1. 2008 s Mining: Concepts and Techniques Data 66 . s. Succinct Constraint  A subset of item Is is a succinct set. where σ is a selection operator  SP⊆2I is a succinct power set. Ik ⊆I. SP can be expressed in terms of the strict power sets of I1.t. …. ….

2008 Data Mining: Concepts and Techniques 67 . R also satisfies C  A constraint C is convertible monotone iff a pattern S satisfying the constraint implies that each pattern of which S is a suffix w. Convertible Constraint  Suppose all items in patterns are listed in a total order R  A constraint C is convertible anti- monotone iff a pattern S satisfying the constraint implies that each suffix of S w.r.r. R also satisfies C October 16.t.t.

2008 Data Mining: Concepts and Techniques 68 . Relationships Among Categories of Constraints Succinctness Anti-monotonicity Monotonicity Convertible constraints Inconvertible constraints October 16.

 Examples:  sum(S. October 16.Price) ≥ v is not anti-monotone  sum(S. 2008 Data Mining: Concepts and Techniques 69 .Price) ≤ v is anti-monotone  sum(S. Property of Constraints: Anti-Monotone  Anti-monotonicity: If a set S violates the constraint.price) ≤ 1000” deeply into iterative frequent set computation.Price) = v is partly anti-monotone  Application:  Push “sum(S. any superset of S violates the constraint.

θ ∈ { =. ≥ } yes v∈ S no S⊇ V no S⊆ V yes S= V partly min(S) ≤ v no min(S) ≥ v yes min(S) = v partly max(S) ≤ v yes max(S) ≥ v no max(S) = v partly count(S) ≤ v yes count(S) ≥ v no count(S) = v partly sum(S) ≤ v yes sum(S) ≥ v no sum(S) = v partly avg(S) θ v. 2008 Data Mining: Concepts and Techniques 70 . θ ∈ { =. ≤. ≥ } convertible (frequent constraint) (yes) October 16. ≤. Characterization of Anti-Monotonicity Constraints S θ v.

R  If S is a suffix of S . I={9.g. avg(S ) ≥ avg(S) 1 1  {8. 3}  avg({9. 3} is a suffix of {9. 4. 3})=6 ≥ avg({8. 4. 2008 does {9. 4. 6. 1}  Avg(S) ≥ v is convertible monotone w. 3})=5  If S satisfies avg(S) ≥v. 4.t. 8. 4. 3} Data Mining: Concepts and Techniques 71 . 4. 8. Example of Convertible Constraints: Avg(S) θ V  Let R be the value descending order over the set of items  E. 3. 4. 3} satisfies constraint avg(S) ≥ 4. so does S1  {8. 8. so October 16. 8.r.

Property of Constraints: Succinctness  Succinctness:  For any set S1 and S2 satisfying C. i.Price ) ≤ v is succinct  Optimization:  If C is succinct.e. S1 ∪ S2 satisfies C  Given A1 is the sets of size 1 satisfying C.Price ) ≥ v is not succinct  min(S. 2008 by theData iterative support Mining: Concepts counting. then any set S satisfying C are based on A1 . and Techniques 72 .  Example :  sum(S. it contains a subset belongs to A1 . then C is pre-counting prunable. The satisfaction of the constraint alone is not affected October 16..

≤. θ ∈ { =. 2008 Data Mining: Concepts and Techniques 73 . ≥ } Yes v∈ S yes S ⊇V yes S⊆ V yes S= V yes min(S) ≤ v yes min(S) ≥ v yes min(S) = v yes max(S) ≤ v yes max(S) ≥ v yes max(S) = v yes count(S) ≤ v weakly count(S) ≥ v weakly count(S) = v weakly sum(S) ≤ v no sum(S) ≥ v no sum(S) = v no avg(S) θ v. ≤. Characterization of Constraints by Succinctness S θ v. ≥ } no (frequent constraint) (no) October 16. θ ∈ { =.

2008 Data Mining: Concepts and Techniques 74 . Chapter 6: Mining Association Rules in Large Databases  Association rule mining  Mining single-dimensional Boolean association rules from transactional databases  Mining multilevel association rules from transactional databases  Mining multidimensional association rules from transactional databases and data warehouse  From association mining to correlation analysis  Constraint-based association mining  Summary October 16.

. et al.  From association analysis to classification and clustering analysis  E. Why Is the Big Pie Still There?  More on constraint-based mining of associations  Boolean vs. 2008 Data Mining: Concepts and Techniques 75 . quantitative associations  Association on discrete vs.  Association does not necessarily imply correlation or causal relationships  From intra-trasanction association to inter- transaction associations  E. break the barriers of transactions (Lu. continuous data  From association to correlation and causal structure analysis.g.g. clustering association rules October 16. TOIS’99).

Chapter 6: Mining Association Rules in Large Databases  Association rule mining  Mining single-dimensional Boolean association rules from transactional databases  Mining multilevel association rules from transactional databases  Mining multidimensional association rules from transactional databases and data warehouse  From association mining to correlation analysis  Constraint-based association mining  Summary October 16. 2008 Data Mining: Concepts and Techniques 76 .

2008 Data Mining: Concepts and Techniques 77 . time series data. multimedia data. etc. October 16. Summary  Association rule mining  probably the most significant contribution from the database community in KDD  A large number of papers have been published  Many interesting issues have been explored  An interesting research direction  Association analysis in other types of data: spatial data.