Professional Documents
Culture Documents
From association mining to correlation What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
analysis
1
Closed Patterns and Max-Patterns Closed Patterns and Max-Patterns
A long pattern contains a combinatorial number of sub- Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … +
Min_sup = 1.
= 2100 – 1 = 1.27*1030 sub-patterns!
Solution: Mine closed patterns and max-patterns instead What is the set of closed itemset?
An itemset X is closed if X is frequent and there exists no <a1, …, a100>: 1
super-pattern Y כX, with the same support as X < a1, …, a50>: 2
An itemset X is a max-pattern if X is frequent and there
What is the set of max-pattern?
exists no frequent super-pattern Y כX
Closed pattern is a lossless compression of freq. patterns <a1, …, a100>: 1
Reducing the # of patterns and rules What is the set of all patterns?
!!
April 15, 2018 5 April 15, 2018 6
Efficient and scalable frequent itemset mining If {beer, diaper, nuts} is frequent, so is {beer,
methods diaper}
i.e., every transaction having {beer, diaper, nuts} also
Mining various kinds of association rules contains {beer, diaper}
From association mining to correlation Scalable mining methods: Three major approaches
Apriori
analysis
Freq. pattern growth (Fpgrowth)
Constraint-based association mining Vertical data format approach
Summary
2
Apriori: A Candidate Generation-and-Test Approach The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Apriori pruning principle: If there is any itemset which is Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
infrequent, its superset should not be generated/tested! 10 A, C, D {C} 3
{B} 3
1st scan {C} 3
Method: 20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
Initially, scan DB once to get frequent 1-itemset 40 B, E
Generate length (k+1) candidate itemsets from length k C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
frequent itemsets {A, C} 2
{A, C} 2
{A, E} 1 {A, C}
{B, C} 2
Test the candidates against DB {B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
Terminate when no frequent or candidate set can be {C, E} 2
{C, E} 2 {B, E}
generated {C, E}
3
Another version to write Apriori
algorithm Important Details of Apriori
Suppose the items in Lk-1 are listed in an order Why counting supports of candidates a problem?
Step 1: self-joining Lk-1 The total number of candidates can be very huge
insert into Ck
One transaction may contain many candidates
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q Method:
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < Candidate itemsets are stored in a hash-tree
q.itemk-1 Leaf node of hash-tree contains a list of itemsets and
Step 2: pruning counts
forall itemsets c in Ck do Interior node contains a hash table
forall (k-1)-subsets s of c do
Subset function: finds all the candidates contained in
if (s is not in Lk-1) then delete c from Ck
a transaction
4
Challenges of Frequent Pattern Mining Partition: Scan Database Only Twice
Huge number of candidates Scan 1: partition database and find local frequent
Tedious workload of support counting for candidates patterns
A k-itemset whose corresponding hashing bucket count is Select a sample of original database, mine frequent
below the threshold cannot be frequent patterns within sample using Apriori
Candidates: a, b, c, d, e Scan database once to verify frequent itemsets found in
Frequent 1-itemset: a, b, d, e sample, only borders of closure of frequent patterns are
ab is not a candidate 2-itemset if the sum of count of Example: check abcd instead of ab, ac, …, etc.
{ab, ad, ae} is below support threshold Scan database again to find missed frequent patterns
J. Park, M. Chen, and P. Yu. An effective hash-based H. Toivonen. Sampling large databases for association
algorithm for mining association rules. In SIGMOD’95 rules. In VLDB’96
5
DIC: Reduce Number of Scans Bottleneck of Frequent-pattern Mining
6
Benefits of the FP-tree Structure Partition Patterns and Databases
Patterns containing p
Compactness
Reduce irrelevant info—infrequent items are gone Patterns having m but no p
Find Patterns Having P From P-conditional Database From Conditional Pattern-bases to Conditional FP-trees
Starting at the frequent item header table in the FP-tree For each pattern-base
Traverse the FP-tree by following the link of each frequent item p
Accumulate the count for each item in the base
Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base Construct the FP-tree for the frequent items of the
pattern base
{}
Header Table
m-conditional pattern base:
Conditional pattern bases {} fca:2, fcab:1
Item frequency head f:4 c:1 Header Table
f 4 item cond. pattern base Item frequency head All frequent
f:4 c:1 patterns relate to m
c 4 c:3 b:1 b:1 c f:3 f 4 {}
a 3 c 4 c:3 b:1 b:1 m,
a fc:3
b 3 a:3 p:1 a 3 f:3 fm, cm, am,
m 3 b 3 a:3 p:1 fcm, fam, cam,
b fca:1, f:1, c:1
p 3 m:2 b:1 m 3 c:3 fcam
m fca:2, fcab:1 m:2 b:1
p 3
p fcam:2, cb:1 p:2 m:1 a:3
p:2 m:1 m-conditional FP-tree
April 15, 2018 27 April 15, 2018 28
7
Recursion: Mining Each Conditional FP-tree A Special Case: Single Prefix Path in FP-tree
{}
Suppose a (conditional) FP-tree T has a shared
{} Cond. pattern base of “am”: (fc:3) f:3 single prefix-path P
c:3 Mining can be decomposed into two parts
f:3
am-conditional FP-tree {}
Reduction of the single prefix path into one node
c:3 {}
Cond. pattern base of “cm”: (f:3) a1:n1 Concatenation of the mining results of the two
a:3 f:3
m-conditional FP-tree a2:n2 parts
cm-conditional FP-tree
a3:n3
{} r1
{}
b1:m1 C1:k1 a1:n1
Cond. pattern base of “cam”: (f:3) f:3 r1 = + b1:m1 C1:k1
a2:n2
cam-conditional FP-tree
C2:k2 C3:k3
a3:n3 C2:k2 C3:k3
April 15, 2018 29 April 15, 2018 30
FP-tree
Until the resulting FP-tree is empty, or it contains only
8
FP-Growth vs. Apriori: Scalability With the Support
Partition-based Projection Threshold
Tran. DB
Parallel projection needs a lot fcamp 100 Data set T25I20D10K
of disk space fcabm 90 D1 FP-grow th runtime
fb
Partition projection saves it cbp 80
D1 Apriori runtime
fcamp 70
Run time(sec.)
60
p-proj DB m-proj DB b-proj DB a-proj DB c-proj DB f-proj DB 50
fcam fcab f fc f …
40
cb fca cb … …
fcam fca … 30
20
10
am-proj DB cm-proj DB
fc f … 0
fc f 0 0.5 1 1.5 2 2.5 3
fc f Support threshold(%)
9
CLOSET+: Mining Closed Itemsets by
Mining Frequent Closed Patterns: CLOSET Pattern-Growth
Flist: list of all frequent items in support ascending order Itemset merging: if Y appears in every occurrence of X, then Y
is merged with X
Flist: d-a-f-e-c Min_sup=2 Sub-itemset pruning: if Y כX, and sup(X) = sup(Y), X and all of
Divide search space TID Items
X’s descendants in the set enumeration tree can be pruned
10 a, c, d, e, f
Patterns having d 20 a, b, e Hybrid tree projection
30 c, e, f
Patterns having d but no a, etc. 40 a, c, d, f Bottom-up physical tree-projection
50 c, e, f
Find frequent closed pattern recursively Top-down pseudo tree-projection
Every transaction having d also has cfa cfad is a Item skipping: if a local frequent item has the same support in
frequent closed pattern several header tables at different levels, one can prune it from
the header table at higher levels
J. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm for
Efficient subset checking
Mining Frequent Closed Itemsets", DMKD'00.
April 15, 2018 37 April 15, 2018 38
10
Mining Various Kinds of Association Rules Mining Multiple-Level Association Rules
11
Mining Quantitative Associations Static Discretization of Quantitative Attributes
Techniques can be categorized by how numerical Discretized prior to mining using concept hierarchy.
attributes, such as age or salary are treated
Numeric values are replaced by ranges.
1. Static discretization based on predefined concept
In relational database, finding all frequent k-predicate sets
hierarchies (data cube methods)
will require k or k+1 table scans.
2. Dynamic discretization based on data distribution
Data cube is well suited for mining. ()
3. Clustering: Distance-based association
one dimensional clustering then association The cells of an n-dimensional
(age) (income) (buys)
cuboid correspond to the
predicate sets.
(age, income) (age,buys) (income,buys)
Mining from data cubes
can be much faster.
(age,income,buys)
April 15, 2018 45 April 15, 2018 46
12
Interestingness Measure: Correlations (Lift) Are lift and 2 Good Measures of Correlation?
Basketball Not Sum (row)
play basketball eat cereal basketball
“Buy walnuts buy milk [1%, 80%]” is misleading
[40%, 66.7%] is misleading Cereal 2000 1750 3750 if 85% of customers buy milk
Not cereal 1000 250 1250
The overall % of students Support and confidence are not good to represent correlations
eating cereal is 75% > Sum(col.) 3000 2000 5000
So many interestingness measures?
66.7%.
play basketball not eat P( A B) lift
P ( A B )
lift P ( A) P ( B )
cereal [20%, 33.3%] is more P( A) P( B)
accurate, although with lower sup(X )
all _ conf
1000 / 5000 max_ item _ sup(X )
support and confidence lift ( B, C ) 1.33
3000 / 5000 *1250 / 5000
Measure of sup( X )
coh
dependent/correlated events: | universe ( X ) |
2000 / 5000
lift lift ( B, C ) 0.89
3000 / 5000 * 3750 / 5000
April 15, 2018 49 April 15, 2018 50
13
Constraint-based (Query-Directed) Mining Constraints in Data Mining
Knowledge type constraint:
Finding all the patterns in a database autonomously? — classification, association, etc.
Both are aimed at reducing search space When an itemset S violates the 10 a, b, c, d, f
20 b, c, d, f, g, h
Finding all patterns satisfying constraints vs. finding constraint, so does any of its superset
30 a, c, d, e, f
some (or one) answer in constraint-based search in AI sum(S.Price) v is anti-monotone 40 c, e, f, g
Constraint-pushing vs. heuristic search sum(S.Price) v is not anti-monotone Item Profit
It is an interesting research problem on how to integrate
Example. C: range(S.profit) 15 is anti- a 40
them monotone b 0
Constrained mining vs. query processing in DBMS c -20
Itemset ab violates C d 10
Database query processing requires to find all
So does every superset of ab e -30
Constrained pattern mining shares a similar philosophy f 30
as pushing selections deeply in query processing g 20
h -10
April 15, 2018 55 April 15, 2018 56
14
Monotonicity for Constraint Pushing Succinctness
TDB (min_sup=2)
min(S.Price) v is monotone
a 40 But sum(S.Price) v is not succinct
b 0
<a, f, g, d, b, h, c, e>
b 0 value ascending order R-1: <e, c, h, b, d, g, f, f 30
c -20 a> g 20
If an itemset afb violates C d 10
If an itemset d satisfies a constraint C, so
h -10
e -30
So does afbh, afb* does itemsets df and dfa, which having d as
f 30
a prefix
It becomes anti-monotone! g 20
h -10 Thus, avg(X) 25 is strongly convertible
April 15, 2018 59 April 15, 2018 60
15
Constraint-Based Mining—A General Picture A Classification of Constraints
16