7 - Association Rule Analysis

Mining Frequent Patterns, Association and
What Is Frequent Pattern Analysis?

Correlations
 Frequent pattern: a pattern (a set of items, subsequences, substructures,
 Basic concepts and a road map etc.) that occurs frequently in a data set
 First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
 Efficient and scalable frequent itemset mining
of frequent itemsets and association rule mining
methods
 Motivation: Finding inherent regularities in data
 Mining various kinds of association rules  What products were often purchased together?— Beer and diapers?!
 From association mining to correlation  What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
analysis 
 Can we automatically classify web documents?

 Constraint-based association mining
 Applications
 Summary  Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
April 15, 2018 1 April 15, 2018 2
Basic Concepts: Frequent Patterns and

Why Is Freq. Pattern Mining Important? Association Rules
 Discloses an intrinsic and important property of data sets Transaction-id Items bought  Itemset X = {x1, …, xk}
 Forms the foundation for many essential data mining tasks 10 A, B, D
 Find all the rules X  Y with minimum
20 A, C, D support and confidence
 Association, correlation, and causality analysis
30 A, D, E
 Sequential, structural (e.g., sub-graph) patterns  support, s, probability that a
40 B, E, F
transaction contains X  Y
 Pattern analysis in spatiotemporal, multimedia, time- 50 B, C, D, E, F
 confidence, c, conditional
series, and stream data Customer Customer probability that a transaction
buys both buys diaper
 Classification: associative classification having X also contains Y
 Cluster analysis: frequent pattern-based clustering Let supmin = 50%, confmin = 50%
 Data warehousing: iceberg cube and cube-gradient Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}
Association rules:
 Semantic data compression: fascicles Customer A  D (60%, 100%)
buys beer
 Broad applications D  A (60%, 75%)
April 15, 2018 3 April 15, 2018 4
1
Closed Patterns and Max-Patterns Closed Patterns and Max-Patterns
 A long pattern contains a combinatorial number of sub-  Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … +
 Min_sup = 1.
= 2100 – 1 = 1.27*1030 sub-patterns!
 Solution: Mine closed patterns and max-patterns instead  What is the set of closed itemset?
 An itemset X is closed if X is frequent and there exists no  <a1, …, a100>: 1
super-pattern Y ‫ כ‬X, with the same support as X  < a1, …, a50>: 2
 An itemset X is a max-pattern if X is frequent and there
 What is the set of max-pattern?
exists no frequent super-pattern Y ‫ כ‬X
 Closed pattern is a lossless compression of freq. patterns  <a1, …, a100>: 1
 Reducing the # of patterns and rules  What is the set of all patterns?
 !!
April 15, 2018 5 April 15, 2018 6
Mining Frequent Patterns, Association

Scalable Methods for Mining Frequent Patterns
and Correlations
 The downward closure property of frequent patterns
 Basic concepts and a road map
 Any subset of a frequent itemset must be frequent
 Efficient and scalable frequent itemset mining  If {beer, diaper, nuts} is frequent, so is {beer,
methods diaper}
 i.e., every transaction having {beer, diaper, nuts} also
 Mining various kinds of association rules contains {beer, diaper}
 From association mining to correlation  Scalable mining methods: Three major approaches
 Apriori
analysis
 Freq. pattern growth (Fpgrowth)
 Constraint-based association mining  Vertical data format approach
 Summary
April 15, 2018 7 April 15, 2018 8
2
Apriori: A Candidate Generation-and-Test Approach The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
 Apriori pruning principle: If there is any itemset which is Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
infrequent, its superset should not be generated/tested! 10 A, C, D {C} 3
{B} 3
1st scan {C} 3
 Method: 20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
 Initially, scan DB once to get frequent 1-itemset 40 B, E
 Generate length (k+1) candidate itemsets from length k C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
frequent itemsets {A, C} 2
{A, C} 2
{A, E} 1 {A, C}
{B, C} 2
 Test the candidates against DB {B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
 Terminate when no frequent or candidate set can be {C, E} 2
{C, E} 2 {B, E}
generated {C, E}
C3 Itemset L3 Itemset sup

3rd scan
{B, C, E} {B, C, E} 2
April 15, 2018 9 April 15, 2018 10
Apriori Algorithm Apriori Algorithm Cont...

repeat
 Input k  k  1;
Lk  
 I is itemset,
for each Ii  C k do Set count of each item to be zero
 D is database of transactions counti  0 ;
for each Tj  D do For each transaction in database
 S is support (min threshold we require)
for each Ii  C k do
For each item if it is in transaction then
 Output ifIi  Tj Increments its count
count i  count i  1
 L is Large frequent itemsets for each Ii  C k do
 Initialize if count i  ( S  D ) do For each item check if it is greater then

threshold.
L k  L k  Ii
 K=0 L  L  Lk
C k  1  Apriori _ Gen ( L k )
 L=null
until C k  1  
 C1=I // Initially candidate are set to all items
April 15, 2018 11 April 15, 2018 12
3
Another version to write Apriori
algorithm Important Details of Apriori
 Pseudo-code:  How to generate candidates?

Ck: Candidate itemset of size k  Step 1: self-joining Lk
Lk : frequent itemset of size k  Step 2: pruning
 How to count supports of candidates?
L1 = {frequent items};
 Example of Candidate-generation
for (k = 1; Lk !=; k++) do begin
 L3={abc, abd, acd, ace, bcd}
Ck+1 = candidates generated from Lk;
for each transaction t in database do  Self-joining: L3*L3
 abcd from abc and abd
increment the count of all candidates in Ck+1  acde from acd and ace
that are contained in t  Pruning:
Lk+1 = candidates in Ck+1 with min_support  acde is removed because ade is not in L3
end
 C4={abcd}
return k Lk;
April 15, 2018 13 April 15, 2018 14
How to Generate Candidates? How to Count Supports of Candidates?
 Suppose the items in Lk-1 are listed in an order  Why counting supports of candidates a problem?
 Step 1: self-joining Lk-1  The total number of candidates can be very huge
insert into Ck
 One transaction may contain many candidates
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q  Method:
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <  Candidate itemsets are stored in a hash-tree
q.itemk-1  Leaf node of hash-tree contains a list of itemsets and
 Step 2: pruning counts
forall itemsets c in Ck do  Interior node contains a hash table
forall (k-1)-subsets s of c do
 Subset function: finds all the candidates contained in
if (s is not in Lk-1) then delete c from Ck
a transaction
April 15, 2018 15 April 15, 2018 16
4
Challenges of Frequent Pattern Mining Partition: Scan Database Only Twice
 Challenges  Any itemset that is potentially frequent in DB must be

 Multiple scans of transaction database frequent in at least one of the partitions of DB
 Huge number of candidates  Scan 1: partition database and find local frequent
 Tedious workload of support counting for candidates patterns
 Improving Apriori: general ideas  Scan 2: consolidate global frequent patterns

 Reduce passes of transaction database scans  A. Savasere, E. Omiecinski, and S. Navathe. An efficient
 Shrink number of candidates algorithm for mining association in large databases. In
 Facilitate support counting of candidates VLDB’95
April 15, 2018 17 April 15, 2018 18
DHP: Reduce the Number of Candidates Sampling for Frequent Patterns
 A k-itemset whose corresponding hashing bucket count is  Select a sample of original database, mine frequent
below the threshold cannot be frequent patterns within sample using Apriori
 Candidates: a, b, c, d, e  Scan database once to verify frequent itemsets found in
 Frequent 1-itemset: a, b, d, e sample, only borders of closure of frequent patterns are
 Hash entries: {ab, ad, ae} {bd, be, de} … checked
 ab is not a candidate 2-itemset if the sum of count of  Example: check abcd instead of ab, ac, …, etc.
{ab, ad, ae} is below support threshold  Scan database again to find missed frequent patterns
 J. Park, M. Chen, and P. Yu. An effective hash-based  H. Toivonen. Sampling large databases for association
algorithm for mining association rules. In SIGMOD’95 rules. In VLDB’96
April 15, 2018 19 April 15, 2018 20
5
DIC: Reduce Number of Scans Bottleneck of Frequent-pattern Mining
ABCD  Multiple database scans are costly

 Once both A and D are determined
ABC ABD ACD BCD

frequent, the counting of AD begins  Mining long patterns needs many passes of
 Once all length-2 subsets of BCD are
determined frequent, the counting of BCD scanning and generates lots of candidates
begins
AB AC BC AD BD CD  To find frequent itemset i1i2…i100
Transactions
1-itemsets  # of scans: 100
A B C D
Apriori 2-itemsets  # of Candidates: (1001) + (1002) + … + (110000) = 2100-
…
{} 1 = 1.27*1030 !
Itemset lattice 1-itemsets  Bottleneck: candidate-generation-and-test
S. Brin R. Motwani, J. Ullman, 2-items
and S. Tsur. Dynamic itemset DIC 3-items  Can we avoid candidate generation?
counting and implication rules for
market basket data. In
SIGMOD’97
April 15, 2018 21 April 15, 2018 22
Mining Frequent Patterns Without

Candidate Generation Construct FP-tree from a Transaction Database
TID Items bought (ordered) frequent items

 Grow long patterns from short ones using local 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
frequent items 300 {b, f, h, j, o, w} {f, b} min_support = 3
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
 “abc” is a frequent pattern Header Table
{}
1. Scan DB once, find
 Get all transactions having “abc”: DB|abc frequent 1-itemset Item frequency head f:4 c:1
(single item pattern) f 4
c 4 c:3 b:1 b:1
 “d” is a local frequent item in DB|abc  abcd is 2. Sort frequent items in a 3
frequency descending b 3 a:3 p:1
a frequent pattern order, f-list m 3
p 3
3. Scan DB again, m:2 b:1
construct FP-tree
F-list=f-c-a-b-m-p p:2 m:1
April 15, 2018 23 April 15, 2018 24
6
Benefits of the FP-tree Structure Partition Patterns and Databases
 Completeness  Frequent patterns can be partitioned into subsets

 Preserve complete information for frequent pattern
according to f-list
mining
 F-list=f-c-a-b-m-p
 Never break a long pattern of any transaction
 Patterns containing p
 Compactness
 Reduce irrelevant info—infrequent items are gone  Patterns having m but no p
 Items in frequency descending order: the more  …

frequently occurring, the more likely to be shared
 Patterns having c but no a nor b, m, p
 Never be larger than the original database (not count
 Pattern f
node-links and the count field)
 For Connect-4 DB, compression ratio could be over 100  Completeness and non-redundency
April 15, 2018 25 April 15, 2018 26
Find Patterns Having P From P-conditional Database From Conditional Pattern-bases to Conditional FP-trees
 Starting at the frequent item header table in the FP-tree  For each pattern-base
 Traverse the FP-tree by following the link of each frequent item p
 Accumulate the count for each item in the base
 Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base  Construct the FP-tree for the frequent items of the
pattern base
{}
Header Table
m-conditional pattern base:
Conditional pattern bases {} fca:2, fcab:1
Item frequency head f:4 c:1 Header Table
f 4 item cond. pattern base Item frequency head All frequent
f:4 c:1 patterns relate to m
c 4 c:3 b:1 b:1 c f:3 f 4 {}
a 3 c 4 c:3 b:1 b:1 m,
a fc:3 
b 3 a:3 p:1 a 3 f:3  fm, cm, am,
m 3 b 3 a:3 p:1 fcm, fam, cam,
b fca:1, f:1, c:1
p 3 m:2 b:1 m 3 c:3 fcam
m fca:2, fcab:1 m:2 b:1
p 3
p fcam:2, cb:1 p:2 m:1 a:3
p:2 m:1 m-conditional FP-tree
April 15, 2018 27 April 15, 2018 28
7
Recursion: Mining Each Conditional FP-tree A Special Case: Single Prefix Path in FP-tree
{}
 Suppose a (conditional) FP-tree T has a shared
{} Cond. pattern base of “am”: (fc:3) f:3 single prefix-path P
c:3  Mining can be decomposed into two parts
f:3
am-conditional FP-tree {}
 Reduction of the single prefix path into one node
c:3 {}
Cond. pattern base of “cm”: (f:3) a1:n1  Concatenation of the mining results of the two
a:3 f:3
m-conditional FP-tree a2:n2 parts
cm-conditional FP-tree
a3:n3
{} r1
{}
b1:m1 C1:k1 a1:n1
Cond. pattern base of “cam”: (f:3) f:3  r1 = + b1:m1 C1:k1
a2:n2
cam-conditional FP-tree
C2:k2 C3:k3
a3:n3 C2:k2 C3:k3
April 15, 2018 29 April 15, 2018 30
Mining Frequent Patterns With FP-trees Scaling FP-growth by DB Projection
 Idea: Frequent pattern growth  FP-tree cannot fit in memory?—DB projection

 Recursively grow frequent patterns by pattern and
 First partition a database into a set of projected DBs
database partition
 Then construct and mine FP-tree for each projected DB
 Method
 Parallel projection vs. Partition projection techniques
 For each frequent item, construct its conditional
 Parallel projection is space costly
pattern-base, and then its conditional FP-tree
 Repeat the process on each newly created conditional
FP-tree
 Until the resulting FP-tree is empty, or it contains only
one path—single path will generate all the

combinations of its sub-paths, each of which is a
frequent pattern
April 15, 2018 31 April 15, 2018 32
8
FP-Growth vs. Apriori: Scalability With the Support
Partition-based Projection Threshold
Tran. DB
 Parallel projection needs a lot fcamp 100 Data set T25I20D10K
of disk space fcabm 90 D1 FP-grow th runtime
fb
 Partition projection saves it cbp 80
D1 Apriori runtime
fcamp 70
Run time(sec.)
60
p-proj DB m-proj DB b-proj DB a-proj DB c-proj DB f-proj DB 50
fcam fcab f fc f …
40
cb fca cb … …
fcam fca … 30
20
10
am-proj DB cm-proj DB
fc f … 0
fc f 0 0.5 1 1.5 2 2.5 3
fc f Support threshold(%)
April 15, 2018 33 April 15, 2018 34
Why Is FP-Growth the Winner? MaxMiner: Mining Max-patterns
 Divide-and-conquer:  1st scan: find frequent items Tid Items
 decompose both the mining task and DB according to  A, B, C, D, E 10 A,B,C,D,E

20 B,C,D,E,
the frequent patterns obtained so far  2nd scan: find support for 30 A,C,D,F
 leads to focused search of smaller databases  AB, AC, AD, AE, ABCDE
 Other factors
 BC, BD, BE, BCDE
 no candidate generation, no candidate test Potential
 CD, CE, CDE, DE, max-patterns
 compressed database: FP-tree structure
 Since BCDE is a max-pattern, no need to check BCD, BDE,
 no repeated scan of entire database
CDE in later scan
 basic ops—counting local freq items and building sub
 R. Bayardo. Efficiently mining long patterns from
FP-tree, no pattern search and matching
databases. In SIGMOD’98
April 15, 2018 35 April 15, 2018 36
9
CLOSET+: Mining Closed Itemsets by
Mining Frequent Closed Patterns: CLOSET Pattern-Growth
 Flist: list of all frequent items in support ascending order  Itemset merging: if Y appears in every occurrence of X, then Y
is merged with X
 Flist: d-a-f-e-c Min_sup=2  Sub-itemset pruning: if Y ‫ כ‬X, and sup(X) = sup(Y), X and all of
 Divide search space TID Items
X’s descendants in the set enumeration tree can be pruned
10 a, c, d, e, f
 Patterns having d 20 a, b, e  Hybrid tree projection
30 c, e, f
 Patterns having d but no a, etc. 40 a, c, d, f  Bottom-up physical tree-projection
50 c, e, f
 Find frequent closed pattern recursively  Top-down pseudo tree-projection
 Every transaction having d also has cfa  cfad is a  Item skipping: if a local frequent item has the same support in
frequent closed pattern several header tables at different levels, one can prune it from
the header table at higher levels
 J. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm for
 Efficient subset checking
Mining Frequent Closed Itemsets", DMKD'00.
April 15, 2018 37 April 15, 2018 38

CHARM: Mining by Exploring Vertical Data Format
and Correlations
 Vertical format: t(AB) = {T11, T25, …}
 tid-list: list of trans.-ids containing an itemset
 Deriving closed patterns based on vertical intersections  Efficient and scalable frequent itemset mining
 t(X) = t(Y): X and Y always happen together methods
 t(X)  t(Y): transaction having X always has Y  Mining various kinds of association rules
 Using diffset to accelerate mining
 From association mining to correlation
 Only keep track of differences of tids
analysis
 t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
 Diffset (XY, X) = {T2}
 Eclat/MaxEclat (Zaki et al. @KDD’97), VIPER(P. Shenoy et  Summary
al.@SIGMOD’00), CHARM (Zaki & Hsiao@SDM’02)
April 15, 2018 39 April 15, 2018 40
10
Mining Various Kinds of Association Rules Mining Multiple-Level Association Rules
 Items often form hierarchies

 Mining multilevel association  Flexible support settings
 Items at the lower level are expected to have lower
 Miming multidimensional association support

 Exploration of shared multi-level mining
 Mining quantitative association
uniform support reduced support
 Mining interesting correlation patterns
Level 1
Milk Level 1
min_sup = 5%
[support = 10%] min_sup = 5%
Level 2 Low Fat Milk Skim Milk Level 2

min_sup = 5% [support = 6%] [support = 4%] min_sup = 3%
April 15, 2018 41 April 15, 2018 42
Multi-level Association: Redundancy Filtering Mining Multi-Dimensional Association
 Some rules may be redundant due to “ancestor”  Single-dimensional rules:

relationships between items. buys(X, “milk”)  buys(X, “bread”)
 Multi-dimensional rules:  2 dimensions or predicates
 Example
 Inter-dimension assoc. rules (no repeated predicates)
 milk  wheat bread [support = 8%, confidence = 70%]
age(X,”19-25”)  occupation(X,“student”)  buys(X, “coke”)
 Low Fat milk  wheat bread [support = 2%, confidence =  hybrid-dimension assoc. rules (repeated predicates)
72%]
age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”)
 We say the first rule is an ancestor of the second rule.  Categorical Attributes: finite number of possible values, no
 A rule is redundant if its support is close to the “expected” ordering among values—data cube approach
value, based on the rule’s ancestor.  Quantitative Attributes: numeric, implicit ordering among
values—discretization, clustering, and gradient approaches
April 15, 2018 43 April 15, 2018 44
11
Mining Quantitative Associations Static Discretization of Quantitative Attributes
 Techniques can be categorized by how numerical  Discretized prior to mining using concept hierarchy.
attributes, such as age or salary are treated
 Numeric values are replaced by ranges.
1. Static discretization based on predefined concept
 In relational database, finding all frequent k-predicate sets
hierarchies (data cube methods)
will require k or k+1 table scans.
2. Dynamic discretization based on data distribution
 Data cube is well suited for mining. ()
3. Clustering: Distance-based association
 one dimensional clustering then association  The cells of an n-dimensional
(age) (income) (buys)
cuboid correspond to the
predicate sets.
(age, income) (age,buys) (income,buys)
 Mining from data cubes
can be much faster.
(age,income,buys)
April 15, 2018 45 April 15, 2018 46

Quantitative Association Rules
and Correlations
 Proposed by Lent, Swami and Widom ICDE’97
 Numeric attributes are dynamically discretized  Basic concepts and a road map
 Such that the confidence or compactness of the rules  Efficient and scalable frequent itemset mining
mined is maximized
 2-D quantitative association rules: Aquan1  Aquan2  Acat methods
 Cluster adjacent  Mining various kinds of association rules
association rules
to form general  From association mining to correlation analysis
rules using a 2-D grid
 Example
age(X,”34-35”)  income(X,”30-50K”)  Summary
 buys(X,”high resolution TV”)
April 15, 2018 47 April 15, 2018 48
12
Interestingness Measure: Correlations (Lift) Are lift and 2 Good Measures of Correlation?
Basketball Not Sum (row)
 play basketball  eat cereal basketball
 “Buy walnuts  buy milk [1%, 80%]” is misleading
 [40%, 66.7%] is misleading Cereal 2000 1750 3750  if 85% of customers buy milk
Not cereal 1000 250 1250
 The overall % of students  Support and confidence are not good to represent correlations
eating cereal is 75% > Sum(col.) 3000 2000 5000
 So many interestingness measures?
66.7%.
play basketball  not eat P( A B) lift 
P ( A B )

lift  P ( A) P ( B )
cereal [20%, 33.3%] is more P( A) P( B)
accurate, although with lower sup(X )
all _ conf 
1000 / 5000 max_ item _ sup(X )
support and confidence lift ( B, C )   1.33
3000 / 5000 *1250 / 5000
 Measure of sup( X )
coh 
dependent/correlated events: | universe ( X ) |
2000 / 5000
lift lift ( B, C )   0.89
3000 / 5000 * 3750 / 5000
April 15, 2018 49 April 15, 2018 50

Which Measures Should Be Used?
and Correlations
 lift and 2 are not
good measures for  Basic concepts and a road map
correlations in large
transactional DBs  Efficient and scalable frequent itemset mining
 all-conf or
coherence could be methods
good measures
(Omiecinski@TKDE’03)  Mining various kinds of association rules
 Both all-conf and
coherence have the  From association mining to correlation analysis
downward closure
property  Constraint-based association mining
 Summary
April 15, 2018 51 April 15, 2018 52
13
Constraint-based (Query-Directed) Mining Constraints in Data Mining
 Knowledge type constraint:
 Finding all the patterns in a database autonomously? —  classification, association, etc.
unrealistic!  Data constraint — using SQL-like queries

 The patterns could be too many but not focused!  find product pairs sold together in stores in Chicago in
 Data mining should be an interactive process Dec.’02

 Dimension/level constraint
 User directs what to be mined using a data mining
 in relevance to region, price, brand, customer category
query language (or a graphical user interface)
 Rule (or pattern) constraint
 Constraint-based mining  small sales (price < $10) triggers big sales (sum >
 User flexibility: provides constraints on what to be $200)

mined  Interestingness constraint
 strong rules: min_support  3%, min_confidence 
 System optimization: explores such constraints for
60%
efficient mining—constraint-based mining
April 15, 2018 53 April 15, 2018 54
Constrained Mining vs. Constraint-Based Search Anti-Monotonicity in Constraint Pushing

TDB (min_sup=2)
 Constrained mining vs. constraint-based search/reasoning  Anti-monotonicity TID Transaction
 Both are aimed at reducing search space  When an itemset S violates the 10 a, b, c, d, f
20 b, c, d, f, g, h
 Finding all patterns satisfying constraints vs. finding constraint, so does any of its superset
30 a, c, d, e, f
some (or one) answer in constraint-based search in AI  sum(S.Price)  v is anti-monotone 40 c, e, f, g
 Constraint-pushing vs. heuristic search  sum(S.Price)  v is not anti-monotone Item Profit
 It is an interesting research problem on how to integrate
 Example. C: range(S.profit)  15 is anti- a 40
them monotone b 0
 Constrained mining vs. query processing in DBMS c -20
 Itemset ab violates C d 10
 Database query processing requires to find all
 So does every superset of ab e -30
 Constrained pattern mining shares a similar philosophy f 30
as pushing selections deeply in query processing g 20
h -10
April 15, 2018 55 April 15, 2018 56
14
Monotonicity for Constraint Pushing Succinctness
TDB (min_sup=2)
 Monotonicity TID Transaction  Succinctness:

10 a, b, c, d, f
 We can enumerate all and only those sets that are
 When an intemset S satisfies the 20 b, c, d, f, g, h
gurranty to satisfy the consdtraint.
constraint, so does any of its 30 a, c, d, e, f
superset 40 c, e, f, g  For example, min(S.Price)  v is succinct, because

 sum(S.Price)  v is monotone Item Profit
we can generate all items using SQL that satisfy it.
 min(S.Price)  v is monotone
a 40  But sum(S.Price)  v is not succinct
b 0
 Example. C: range(S.profit)  15 c -20  Idea: Without looking at the transaction database,

d 10 whether an itemset S satisfies constraint C can be
 Itemset ab satisfies C e -30
determined based on the selection of items
 So does every superset of ab f 30
g 20  Optimization: If C is succinct, C is pre-counting prunable
h -10
April 15, 2018 57 April 15, 2018 58
Converting “Tough” Constraints

Strongly Convertible Constraints
TDB (min_sup=2)
TID Transaction
 Convert tough constraints into anti-  avg(X)  25 is convertible anti-monotone w.r.t.
10 a, b, c, d, f
monotone or monotone by properly item value descending order R: <a, f, g, d, b, Item Profit
20 b, c, d, f, g, h h, c, e>
ordering items a 40
30 a, c, d, e, f
 If an itemset af violates a constraint C, so b 0
 Examine C: avg(S.profit)  25 40 c, e, f, g
does every itemset with af as prefix, such as c -20
 Order items in value-descending Item Profit afd d 10
order a 40  avg(X)  25 is convertible monotone w.r.t. item e -30
 <a, f, g, d, b, h, c, e>
b 0 value ascending order R-1: <e, c, h, b, d, g, f, f 30
c -20 a> g 20
 If an itemset afb violates C d 10
 If an itemset d satisfies a constraint C, so
h -10
e -30
 So does afbh, afb* does itemsets df and dfa, which having d as
f 30
a prefix
 It becomes anti-monotone! g 20
h -10  Thus, avg(X)  25 is strongly convertible
April 15, 2018 59 April 15, 2018 60
15
Constraint-Based Mining—A General Picture A Classification of Constraints
Constraint Antimonotone Monotone Succinct

vS no yes yes
SV no yes yes
SV yes no yes

min(S)  v no yes yes
Antimonotone Monotone
min(S)  v yes no yes
max(S)  v yes no yes
max(S)  v no yes yes Strongly

count(S)  v yes no weakly convertible
count(S)  v no yes weakly
Succinct
sum(S)  v ( a  S, a  0 ) yes no no
sum(S)  v ( a  S, a  0 ) no yes no
range(S)  v yes no no Convertible Convertible

range(S)  v no yes no
anti-monotone monotone
avg(S)  v,   { , ,  } convertible convertible no
support(S)   yes no no
Inconvertible
support(S)   no yes no
April 15, 2018 61 April 15, 2018 62

Frequent-Pattern Mining: Summary
and Correlations
 Frequent pattern mining—an important task in data mining
 Scalable frequent pattern mining methods
 Efficient and scalable frequent itemset mining
 Apriori (Candidate generation & test)
methods
 Projection-based (FPgrowth, CLOSET+, ...)
 Mining various kinds of association rules
 Vertical format approach (CHARM, ...)
 From association mining to correlation analysis
 Mining a variety of rules and interesting patterns
 Constraint-based mining
 Summary  Mining sequential and structured patterns
 Extensions and applications
April 15, 2018 63 April 15, 2018 64
16

7 - Association Rule Analysis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

7 - Association Rule Analysis

Uploaded by

Copyright:

Available Formats

Mining Frequent Patterns, Association and

What Is Frequent Pattern Analysis?

 Can we automatically classify web documents?

Basic Concepts: Frequent Patterns and

Mining Frequent Patterns, Association

April 15, 2018 7 April 15, 2018 8

C3 Itemset L3 Itemset sup

April 15, 2018 9 April 15, 2018 10

Apriori Algorithm Apriori Algorithm Cont...

 Initialize if count i  ( S  D ) do For each item check if it is greater then

April 15, 2018 11 April 15, 2018 12

 Pseudo-code:  How to generate candidates?

How to Generate Candidates? How to Count Supports of Candidates?

April 15, 2018 15 April 15, 2018 16

 Challenges  Any itemset that is potentially frequent in DB must be

 Improving Apriori: general ideas  Scan 2: consolidate global frequent patterns

April 15, 2018 17 April 15, 2018 18

DHP: Reduce the Number of Candidates Sampling for Frequent Patterns

 Hash entries: {ab, ad, ae} {bd, be, de} … checked

April 15, 2018 19 April 15, 2018 20

ABCD  Multiple database scans are costly

ABC ABD ACD BCD

Mining Frequent Patterns Without

TID Items bought (ordered) frequent items

 Completeness  Frequent patterns can be partitioned into subsets

 Items in frequency descending order: the more  …

April 15, 2018 25 April 15, 2018 26

Mining Frequent Patterns With FP-trees Scaling FP-growth by DB Projection

 Idea: Frequent pattern growth  FP-tree cannot fit in memory?—DB projection

one path—single path will generate all the

April 15, 2018 31 April 15, 2018 32

April 15, 2018 33 April 15, 2018 34

Why Is FP-Growth the Winner? MaxMiner: Mining Max-patterns

 Divide-and-conquer:  1st scan: find frequent items Tid Items

 decompose both the mining task and DB according to  A, B, C, D, E 10 A,B,C,D,E

Mining Frequent Patterns, Association

 Items often form hierarchies

 Miming multidimensional association support

Level 2 Low Fat Milk Skim Milk Level 2

April 15, 2018 41 April 15, 2018 42

Multi-level Association: Redundancy Filtering Mining Multi-Dimensional Association

 Some rules may be redundant due to “ancestor”  Single-dimensional rules:

April 15, 2018 43 April 15, 2018 44

Mining Frequent Patterns, Association

April 15, 2018 47 April 15, 2018 48

Mining Frequent Patterns, Association

April 15, 2018 51 April 15, 2018 52

unrealistic!  Data constraint — using SQL-like queries

 Data mining should be an interactive process Dec.’02

 User flexibility: provides constraints on what to be $200)

Constrained Mining vs. Constraint-Based Search Anti-Monotonicity in Constraint Pushing

 Monotonicity TID Transaction  Succinctness:

superset 40 c, e, f, g  For example, min(S.Price)  v is succinct, because

 Example. C: range(S.profit)  15 c -20  Idea: Without looking at the transaction database,

Converting “Tough” Constraints

Constraint Antimonotone Monotone Succinct

SV yes no yes

max(S)  v no yes yes Strongly

range(S)  v yes no no Convertible Convertible

April 15, 2018 61 April 15, 2018 62

Mining Frequent Patterns, Association

April 15, 2018 63 April 15, 2018 64