Mining Association Rules in Large Databases

Mining Association Rules in
Large Databases
By Group 10
Sadler Divers 103315414
Beili Wang 104522400
Xiang Xu 106067660
Xiaoxiang Zhang 105635826
Spring 2007 - CSE634 DATA MINING

Professor Anita Wasilewska
Department of Computer Sciences - Stony Brook University - SUNY
Sources/References
[1] J. Han and M. Kamber, "Data Mining: Concepts and Techniques", 2nd
Edition, Morgan Kaufmann Publishers, August 2006.
[2] A. Wasilewska, "Data Mining: Concepts and Techniques", Course Slides.
[3] J. Han, "Data Mining: Concepts and Techniques", Book Slides.
[4] T. Brijs et al., “Using Association Rules for Product Assortment
Decisions: A Case Study”, KDD-99 ACM 1999.
[5] A. Savasere, E. Omiecinski, and S. Navathe. An Efficient Algorithm for
Mining Association Rules in Large Databases. VLDB'95, 432-444,
Zurich, Switzerland.
[6] J. Han and Y. Fu, ''Discovery of Multiple-Level Association Rules from
Large Databases'', In Proc. of 1995 Int. Conf. on Very Large Data Bases
(VLDB'95).
[7] M. Kamber, J. Han, and J. Y. Chiang. "Metarule-guided mining of multi-
dimensional association rules using data cubes". In Proc. 3rd Int. Conf.
Knowledge Discovery and Data Mining (KDD'97).
[8] B. Lent, A. Swami, and J. Widom. "Clustering association rules". In Proc.
1997 Int. Conf. Data Engineering (ICDE'97).
[9]. S. Brin, R. Motwani and C. Silverstein. “Beyond Market Baskets:
Generalizing Association rules to Correlations”. Proceeding of the 1997
ACM SIGMOD International conference on management of data.
Goal and Overview
• Goals:
– Introduce the concepts of frequent patterns,
associations, and correlations;
– Explain how they can be mined efficiently.
• Overview:
– Introduction and Apriori Algorithm
– Improved the Efficiency of Apriori
– Mining Various Kinds of Association Rules
– From Association Mining to Correlation Analysis
Introduction and Apriori
Algorithm
Sadler Divers
References
[1] J. Han and M. Kamber, "Data Mining: Concepts and
Techniques", 2nd Edition, Morgan Kaufmann Publishers,
August 2006.
[2] A. Wasilewska, "Data Mining: Concepts and Techniques",

Course Slides.
[3] J. Han, "Data Mining: Concepts and Techniques", Book

Slides.
[4] T. Brijs et al., “Using Association Rules for Product

Assortment Decisions: A Case Study”, KDD-99 ACM 1999.
Mining Association Rules
• Definition
 It’s the process of finding frequent patterns or
associations within the data of some DB or
some set of DBs.
• Why?
 To gain Information, Knowledge, Money, etc.
Applications
 Market Basket Analysis
 Cross-Marketing
 Catalog Design
 Product Assortment Decision

How is it done?
 Approaches:
• Apriori Algorithm
• FP-Growth (Frequent Pattern Growth)
• Vertical Format
Concepts and Definitions
• Let I = {I1, I2, … Im} a set of items
• Let D be a set of DB transactions
• Let T be a particular transaction
• An association rule is of the form A => B

where A, B included in I and (A ∩ B = ∅)
Concepts & Definitions (continued)
• Support: The support of a rule, A => B, is the
percentage of transactions in D, the DB,
containing both A and B.
• Confidence: The percentage of transactions in

D containing A that also contain B.
• Strong Rules: Rules that satisfy both a
minimum support and a minimum confidence
are said to be strong
• Itemset: Simply a set of items
• k-Itemset: a set of items with k items in it

• Apriori Property: All non-empty subset of a
frequent itemset must also be frequent
• Frequent Itemset: An itemset is said to be

frequent if it satisfies the minimum support
threshold.
Apriori Algorithm
• A two-step process
– The join step: Find Lk, the set of candidate of k-

itemsets, join Lk-1 with itself.
– Rules for joining:

• Order the items first so you can compare item by item
• The join of Lk-1 is possible only if its first (k-2) items are
in common
Apriori Algorithm (continued)
• The Prune step:
– The “join” step will produce all k-itemsets, but not

all of them are frequent.
– Scan DB to see which itemsets are indeed frequent

and discard the others.
• Stop when “join” step produces and empty set

Apriori Algorithm : Pseudo code
• Join Step: Ck is generated by joining Lk-1with itself
• Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a
frequent k-itemset
• Pseudo-code:
Ck: Candidate itemset of size k
Lk: frequent itemset of size k
L1 = {frequent items};
for (k= 1; Lk!= ∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1= candidates in Ck+1with min_support
end
return∪kLk;
Source: A. Wasilewska, CSE 634, Lecture Slides

The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Database TDB {A} 2
Itemset sup
L1 {A} 2
Tid Items C1 {B} 3 {B} 3
10 A, C, D
1st scan {C} 3 {C} 3
20 B, C, E {D} 1 {E} 3
30 A, B, C, E
{E} 3
40 B, E
C2 Itemset sup C2
{A, B} 1 Itemset
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, E} 1 {A, C}
{B, C} 2 {B, C} 2 {A, E}
{B, E} 3 {B, E} 3 {B, C}
{C, E} 2 {C, E} 2 {B, E}
{C, E}
C3 Itemset 3rd scan L3 Itemset sup Source: J. Han, “Data
{B, C, E} 2 Mining Concepts and
{B, C, E}
Techniques
Generating Association Rules From
Frequent Itemsets
• For each frequent itemset l, generate all

nonempty subsets of l.
• For every nonempty subset s of l, output rule

“s => (l - s)” if:
support_count(l) / support_count(s) >= min_conf
(where min_conf = minimum confidence threshold).

Association Rules from Example
• Generate all nonempty subsets:
– {B, C}, {B, E}, {C, E}, {B}, {C}, {E}
• Calculate Confidence:
• B ∩ C => E Confidence = 2/2 = 100%
• B ∩ E => C Confidence = 2/3 = 66%
• C ∩ E => B Confidence = 2/2 = 100%
• B => C ∩ E Confidence = 2/3 = 66%
• C => B ∩ E Confidence = 3/3 = 100%
• E => B ∩ C Confidence = 2/3 = 66%
Improved the Efficiency of Apriori
Beili Wang
References
[1] A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for
mining association rules in large databases. VLDB'95, 432-444, Zurich,
Switzerland. <
http://www.informatik.uni-trier.de/~ley/db/conf/vldb/SavasereON95.html>.
[2] J. Han and M. Kamber. "Data Mining: Concepts and Techniques". Morgan
Kaufmann Publishers. March 2006. Chapter 5, Section 5.2.3, Page 240.
[3] Presentation Slides of Prof. Anita Wasilewska

Improving Apriori: General Ideas
• Challenges:
– Multiple scans of transaction database
– Huge number of candidates
– Tedious workload of support counting for candidates
• General Ideas:
– Reduce passes of transaction database scans
– Shrink number of candidates
– Facilitate support counting of candidates
Source: textbook slide, 2nd Edition, Chapter 5, http://www-faculty.cs.uiuc.edu/~hanj/bk2/slidesindex.html
Methods to Improve Apriori’s
Efficiency
• Hash-based itemset counting: A k-itemset whose corresponding
hashing bucket count is below the threshold cannot be frequent.
An Effective Hash-Based Algorithm for Mining Association Rules <http://citeseer.ist.psu.edu/park95effective.html>
• Transaction reduction: A transaction that does not contain any

frequent k-itemset is useless in subsequent scans.
Fast Algorithms for Mining Association Rules in Large Databases <http://citeseer.ist.psu.edu/agrawal94fast.html>
• Partitioning: Any itemset that is potentially frequent in DB must

be frequent in at least one of the partitions of DB.
An Efficient Algorithm for Mining Association Rules in Large Databases <http://citeseer.ist.psu.edu/sarasere95efficient.html>
• Sampling: mining on a subset of given data, lower support

threshold + a method to determine the completeness.
Sampling Large Databases for Association Rules <http://citeseer.ist.psu.edu/toivonen96sampling.html>
• Dynamic itemset counting: add new candidate itemsets only when

all of their subsets are estimated to be frequent.
Dynamic Itemset Counting and Implication Rules for Market Basket Data <http://citeseer.ist.psu.edu/brin97dynamic.html>
Source: Presentation Slides of Prof. Anita Wasilewska, 07. Association Analysis, page 51
Partition Algorithm: Basics
• Definition:
A partition p b D of the database refers to any
subset of the transactions contained in the database
D . Any two different partitions are non-
overlapping, i.e., pi T p j   , i  j .
• Ideas:
Any itemset that is potentially frequent in DB must
be frequent in at least one of the partitions of DB.
Partition scans DB only twice:
Scan 1: partition database and find local frequent
patterns.
Scan 2: consolidate global frequent patterns.
Partition Algorithm
Initially the database D is logically partitioned into n partitions.
Phase I: read the entire database once, takes n iterations

input: pi, where i = 1... n.
i i i
output: local large itemsets of all lengths, L2 , L3 , , Ll as the output.
Merge phase:
input: local large itemsets of same lengths from all n partitions
output: combine and generate the global candidate itemsets. The set of global
G i
candidate itemsets of length j is computed as C j  [ Lj
i  1,  , n
Phase II: read the entire database again, takes n iterations

G
input: pi, where i = 1... n; c 2 C
output: counters for each global candidate itemset and counts their support
Algorithm output: itemsets that have the minimum global support along with their
support. The algorithm reads the entire database twice.
Partition Algorithm: Pseudo code
Source: An Efficient Algorithm for Mining Association Rules in Large Databases, page 7, <
http://citeseer.ist.psu.edu/sarasere95efficient.html>
Partition Algorithm: Example
Consider a small database with four items I={Bread, Butter, Eggs, Milk}
and four transactions as shown in Table 1. Table 2 shows all itemsets for I.
Suppose that the minimum support and minimum confidence of an
association rule are 40% and 60%, respectively.
Source: A Survey of Association Rules

<pandora.compsci.ualr.edu/milanova/7399-11/week10/ar.doc>
Partition Algorithm: Example
Source: A Survey of Association Rules

<pandora.compsci.ualr.edu/milanova/7399-11/week10/ar.doc>
Partition Size
Q: How to estimate the partition size from system
parameters?
A: We must choose the partition size such that at least
those itemsets that are used for generating the new
large itemsets can fit in main memory.
The size is estimated based on:

1. available main memory
2. average length of the transactions
Effect of Data Skew
Problem:
2. A gradual change in data characteristics or any
localized changes in data, can lead to the generation
of a large number of local large sets which may not
have global support.
3. Fewer itemsets will be found common between
partitions leading to a larger global candidate set.
Solution: Randomly reading the pages from the

database is extremely effective in eliminating data
skew.
Performance Comparison - Time
Performance Comparison – Disk IO
Performance Comparison – Scale-up
Parallelization in Parallel Database
Partition algorithm indicates that the partition processing can
be essentially done in parallel.
Parallel algorithm executes in four phases:

1. All the processing nodes independently generate the large
itemsets for their local data.
2. The large itemsets at each node is exchanged with all other

nodes.
3. At each node support for each itemset in the candidate set

with respect to the local data is measured.
4. The local counts at each node is sent to all other nodes. The
global support is the sum of all local supports.
Conclusion
• Partition algorithm achieve both CPU and I/O improvements
over Apriori algorithm
• It scans the database at most twice, wherease in Apriori this is

not known in advance and may be quite large.
• The inherent parallelism in the alogrithm can be exploited for

implementation on a parallel machine. It is suited for very
large database in a high data and resource contention
environment such as an OLTP system.
Mining Various Kinds of
Association Rules
Xiang Xu
Outline
• Mining multilevel association
• Miming multidimensional association

– Mining quantitative association
References
[1] J. Han and M. Kamber, "Data Mining: Concepts and Techniques", Morgan
Kaufmann Publishers, August 2000.
[2] A. Wasilewska, "Data Mining: Concepts and Techniques", Course Slides.
[3] J. Han, "Data Mining: Concepts and Techniques", Book Slides.
[4] J. Han and Y. Fu, ''Discovery of Multiple-Level Association Rules from Large
Databases'', In Proc. of 1995 Int. Conf. on Very Large Data Bases (VLDB'95).
[5] M. Kamber, J. Han, and J. Y. Chiang. "Metarule-guided mining of multi-

dimensional association rules using data cubes". In Proc. 3rd Int. Conf.
Knowledge Discovery and Data Mining (KDD'97).
[6] B. Lent, A. Swami, and J. Widom. "Clustering association rules". In Proc.

1997 Int. Conf. Data Engineering (ICDE'97).
Mining Multilevel Association
Multilevel Association Rules
• Rules generated from association rule
mining with concept hierarchies
milk →
bread [8%, 70%]
2% milk →
wheat bread [2%,
72%]
• Encoded transaction: T1 {111,121,211,221}
Source: J. Han and Y. Fu, ''Discovery of Multiple-Level Association Rules from Large
Databases'', Proc. of 1995 Int. Conf. on Very Large Data Bases (VLDB'95).
Multilevel Association:
Uniform vs Reduced Support
• Uniform support
• Reduced support
Uniform Support
• Same minimum support threshold for all levels
Level 1 Milk
min support = 5%
[support = 10%]
Level 2 2% Milk Skim Milk

min support = 5% [support = 6%] [support = 4%]
Reduced Support
• Reduced minimum support threshold at lower levels
Level 1 Milk
min support = 5%
[support = 10%]
Level 2 2% Milk Skim Milk

min support = 3% [support = 6%] [support = 4%]
Mining Multilevel: Top-Down
Progressive Deepening
• Find multilevel frequent itemsets
– High-level frequent itemsets
milk (15%), bread (10%)
– Lower-level “weaker” frequent itemsets
2% milk (5%), wheat bread (4%)
• Generate multilevel association rules
– High-level strong rules
milk → bread [8%, 70%]
– Lower-level “weaker”rules:
2% milk → wheat bread [2%, 72%]
Generation of Flexible Multilevel
Association Rules
• Association rules with alternative multiple hierarchies
2% milk → Old Mills bread <{11*},{2*1}>
• Level-crossed association rules
2% milk → Old Mills white bread <{11*},{211}>
Source: J. Han and Y. Fu, ''Discovery of

Multiple-Level Association Rules from Large
Databases'', Proc. of 1995 Int. Conf. on Very
Large Data Bases (VLDB'95).
Redundant Multilevel
Association Rules Filtering
• Some rules may be redundant due to “ancestor”
relationships between items
milk → wheat bread [8%, 70%]
2% milk → wheat bread [2%, 72%]
• First rule is an ancestor of the second rule
• A rule is redundant if its support and confidence are
close to their “expected” values, based on the rule’s
ancestor.
Mining Multidimensional
Association Rules
Multidimensional
Association Rules
• Single-dimensional rules
buys(X, “milk”) → buys(X, “bread”)
• Multidimensional rules(≥2 dimensions/predicates)
– Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”) ∧ occupation(X,“student”) →
buys(X, “coke”)
– Hybrid-dimension assoc. rules (repeated predicates)
age(X,”19-25”) ∧ buys(X, “popcorn”) →
buys(X, “coke”)
Categorical Attributes and
Quantitative Attributes
• Categorical Attributes
– Finite number of possible values, no ordering
among values
• Quantitative Attributes
– Numeric, implicit ordering among values
Mining Quantitative Associations
• Static discretization based on predefined concept
hierarchies
• Dynamic discretization based on data distribution
• Clustering: Distance-based association

Static Discretization of
Quantitative Attributes
• Discretized prior to mining using concept
hierarchy. Numeric values are replaced by ranges.
• In relational database, finding all frequent k-
predicate sets will require k or k+1 table scans.
• Data cube is well suited
for mining. (faster)
Source: J. Han and M. Kamber, "Data Mining: Concepts and Techniques", Morgan Kaufmann Publishers,
August 2000.
Dynamic Discretization of
Quantitative Association Rules
• Numeric attributes are dynamically discretized
– The confidence of the rules mined is maximized
• Cluster adjacent association rules to form general

rules using a 2-D grid: ARCS (Association Rules
Clustering System)
Source: B. Lent, A. Swami, and J. Widom. “Clustering association rules”. In

Proc. 1997 Int. Conf. Data Engineering (ICDE'97).
Clustering Association Rules:
Example
age(X,34) ∧ income(X,“30 - 40K”)
→ buys(X,“high resolution
TV”)
age(X,35) ∧ income(X,“30 - 40K”)
TV”)
age(X,34) ∧ income(X,“40 - 50K”)
TV”)
age(X,35) ∧ income(X,“40 - 50K”)
age(X, “34 - 35”) ∧ resolution
→ buys(X,“high income(X,“30 - 50K”) Source: J. Han and M. Kamber, "Data
Mining: Concepts and Techniques", Morgan
→ buys(X,“high resolution TV”)
TV”) Kaufmann Publishers, August 2000.
Mining Distance-based
Association Rules: Motive
• Binning methods like equi-width and equi-depth do
not capture the semantics of interval data
Equi-width Equi-depth Distance-

Price($) (width $10) (depth 2) based
7 [0,10] [7,20] [7,7]
20 [11,20] [22,50] [20,22]
22 [21,30] [51,53] [50,53]
50 [31,40]
51 [41,50]
53 [51,60]
• Source: J. Han and M. Kamber, "Data Mining: Concepts and

Techniques", Morgan Kaufmann Publishers, August 2000.
Clusters and Distance
Measurements
• S[X]: A set of N tuples t1,t 2 ,...,t N projected on the attribute
set X.
• The diameter of S[X]:
∑ ∑ dist X ( ti [ X ], t j [ X ] )
N N
d (S [ X ]) =
i =1 j =1
N ( N − 1)
• dist X : Distance metric on the values for the attribute set X (e.g.
Euclidean distance or Manhattan distance)
Clusters and Distance
Measurements (Cont.)
• Cluster C X
– Density threshold d
X
0
d ( C X ) ≤ d 0X
– Frequency threshold s0 C X ≥ s0
• Finding clusters and distance-based rules

– A modified version of BIRCH
– Density threshold replace Support
– Degree of association threshold replace Confidence
Conclusion
• Mining multilevel association
– Uniform and reduced support
– Top-down progressive deepening approach
– Generation of flexible multilevel association rules
– Redundant multilevel association rules filtering
• Miming multidimensional association
– Mining quantitative association
• Static Discretization of Quantitative Attributes
• ARCS (Association Rules Clustering System)
• Mining Distance-based Association Rules
From Association Mining to
Correlation Analysis
Xiaoxiang Zhang
Sources/References:
[1]. J. Han and M. Kamber. “Data Mining Concepts and
Techniques”. Morgan Kaufman Publishers.
[2]. S. Brin, R. Motwani and C. Silverstein. “Beyond

Market Baskets: Generalizing Association rules to
Correlations”. Proceeding of the 1997 ACM
SIGMOD International conference on management of
data.
• Why we need correlation analysis?
Because correlation analysis can reveal which

strong association rules are really interesting
and useful.
• Association rule mining often generates a huge

number of rules, but a majority of them either
are redundant or do not reflect the true
correlation relationship among data objects.
Example
Above table is called contingency table

• Let us apply the support-
confidence framework to this
example. If the support,
confidence threshold is [10%,
60%]. Then the following
association rule is discovered:
buys (X, “Tea”) => buys (X,

“Coffee”)
[support = 20%, confidence =
80%]
• However, tea=>coffee is
misleading, since the
probability of purchasing
coffee is 90%, which is larger
than 80%.
• The above example illustrates that the
confidence of a rule A=>B can be deceiving in
that it is only an estimate of the conditional
probability of itemset B given itemset A.
Measuring Correlation
• One way of measuring correlation is
p ( A ∪B )
corrA, B =
p ( A) p ( B )
• If the resulting value is equal to 1, then A
and B are independent. If the resulting value is greater than 1,
then A and B are positively correlated, else A and B are
negatively correlated.
For the above example
p[tΛc] /( p[t ] * p[c]) = 0.2 /(0.25 * 0.9) = 0.89,
which is less than 1, indicating there is a negative correlation
between buying tea and buying coffee.
• Is the above way of measuring the correlation
good enough?
The fact is that we calculate the correlation

value indeed, but we could not tell whether the
value is statistically significant.
• So, we introduce:
The chi-squared test for independence
The chi-squared test for independence
• Let R be {i1 , i1} × ... × {ik , ik } and r = r1...rk ∈ R
• Here R is the set of all possible basket values, and r is
a single basket value. Each value of r denotes a
cell---this terminology comes from the view that R is
a k-dimensional contingency table.
Let O(r) denote the number of baskets falling into cell
r.
• The chi-squared statistic is defined as:
(O ( r ) − E[r ]) 2
x =∑
2
E[ r ]
What does chi-squared statistic mean?
• The chi-squared statistic as defined will specify

whether all k items are k-way independent.
2
• If the x is equal to 0, then all the variables are really
independent. If it is larger than a cutoff value at one
significance level, then we say all the variables are
dependent (correlated), else we say all the variables
are independent.
• Note that the cutoff value for any given significance

level can be obtained from wildly available tables for
the chi-squared distribution.
2
• Example of calculating x
(O(r ) − E[r ]) 2
x =∑
2
E[r ]
x2
• If the cutoff of the 95% significance level = 3.84

then 0.900 < 3.84, so the two items are independent.
Correlation Rules
• We have the tool to test whether a given itemset is
independent or dependent (correlated).
• We are almost ready to mine of rules that identify
correlations, or correlation rules.
• Then what is correlation rule?
A correlation rule is of the form {i1 , i2 ,..., im }
where the occurrence of the items
{i1 , i2 ,..., im }are correlated.
Upward Closed Property of Correlation
• An advantage of correlation is that it is upward

closed. This means that if a set S of items is
correlated, then every superset of S is also
correlated. In other words, adding items to a
set of correlated items does not remove the
exiting correlation.
Minimal Correlated Itemsets
• Minimal correlated itemsets are the Itemsets
that are correlated although no subsets of them
is correlated.
• Minimal correlated itemsets form a border
within the lattice.
• Consequently, we reduce the data mining task
as the problem of computing a border in the
lattice.
Support and Significant Concepts
• Support:
A set of items S has support s at p% level
means that at least p% of cells in the
contingency table for S have value s.
• Significant:
If an itemset is supported and minimally
correlated, we say this itemset is significant.
Algorithm Chi-squared Support
• Input: A chi-squared significance level α,
support s, support fraction p > 0.25,
Basket data B.
• Output: A set of minimal correlated itemsets,
from B.
• For each item i in I, count O(i).
• Initialize Cand  0, Sig  0, Notsig 0.
• For each pair of items ia, ib such that O(ia) > s
and O(ib) > s, add {ia,ib} to Cand.
4. Notsig  0.
5. If Cand is empty, then return Sig and terminate.
6. For each itemset in Cand, do construct the
contingency table for the itemset. If less than p
percent of the cells have count s, then go to step 8.
7. If the chi-squared value exceeds a threshold, then add
the itemset to Sig, else add the itemset to Notsig.
8. Continue with the next itemset in Cand. If there are
no more itemsets in Cand, then set Cand to be the set
of all sets S such that every subset of size |S|-1 of S is
in Notsig. Goto Step 4.
Example:
• I: { i1, i2, i3, i4, i5}
• Cand:{ {i1, i2},{i1, i3},{i1, i5},{i3, i5},{i2, i4}
{i3, i4} }
• Sig: { {i1, i2} }
• Notsig: { {i1, i3}, {i1, i5}, {i3, i5}, {i2, i4} }
• Cand: { {i1,i3,i5} }
Limitation
Use of the chi-squared test only if
- All cells in the contingency table have
expected value greater than 1.
- At least 80% of the cells in the contingency
table have expected value greater than 5.
Conclusion
• The use of the chi-squared test is solidly
grounded in statistical theory.
• The chi-squared statistic simultaneously and

uniformly takes into account all possible
combinations of the presence and absence of
the various attributes being examined as a
group.
Thank You!

Mining Association Rules in Large Databases

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mining Association Rules in Large Databases

Uploaded by

Copyright:

Available Formats

Mining Association Rules in

Spring 2007 - CSE634 DATA MINING

[2] A. Wasilewska, "Data Mining: Concepts and Techniques",

[3] J. Han, "Data Mining: Concepts and Techniques", Book

[4] T. Brijs et al., “Using Association Rules for Product

 Product Assortment Decision

• FP-Growth (Frequent Pattern Growth)

• Let D be a set of DB transactions

• Let T be a particular transaction

• An association rule is of the form A => B

• Confidence: The percentage of transactions in

• Itemset: Simply a set of items

• k-Itemset: a set of items with k items in it

• Frequent Itemset: An itemset is said to be

– The join step: Find Lk, the set of candidate of k-

– Rules for joining:

– The “join” step will produce all k-itemsets, but not

– Scan DB to see which itemsets are indeed frequent

• Stop when “join” step produces and empty set

Source: A. Wasilewska, CSE 634, Lecture Slides

• For each frequent itemset l, generate all

• For every nonempty subset s of l, output rule

support_count(l) / support_count(s) >= min_conf

(where min_conf = minimum confidence threshold).

[3] Presentation Slides of Prof. Anita Wasilewska

• Transaction reduction: A transaction that does not contain any

• Partitioning: Any itemset that is potentially frequent in DB must

• Sampling: mining on a subset of given data, lower support

• Dynamic itemset counting: add new candidate itemsets only when

Phase I: read the entire database once, takes n iterations

Phase II: read the entire database again, takes n iterations

Source: A Survey of Association Rules

Source: A Survey of Association Rules

The size is estimated based on:

Solution: Randomly reading the pages from the

Parallel algorithm executes in four phases:

2. The large itemsets at each node is exchanged with all other

3. At each node support for each itemset in the candidate set

• It scans the database at most twice, wherease in Apriori this is

• The inherent parallelism in the alogrithm can be exploited for

• Miming multidimensional association

[2] A. Wasilewska, "Data Mining: Concepts and Techniques", Course Slides.

[3] J. Han, "Data Mining: Concepts and Techniques", Book Slides.

[5] M. Kamber, J. Han, and J. Y. Chiang. "Metarule-guided mining of multi-

[6] B. Lent, A. Swami, and J. Widom. "Clustering association rules". In Proc.

Level 2 2% Milk Skim Milk

Level 2 2% Milk Skim Milk

Source: J. Han and Y. Fu, ''Discovery of

• Dynamic discretization based on data distribution

• Clustering: Distance-based association

• Cluster adjacent association rules to form general

Source: B. Lent, A. Swami, and J. Widom. “Clustering association rules”. In

Equi-width Equi-depth Distance-

• Source: J. Han and M. Kamber, "Data Mining: Concepts and

• Finding clusters and distance-based rules

[2]. S. Brin, R. Motwani and C. Silverstein. “Beyond

Because correlation analysis can reveal which

• Association rule mining often generates a huge

Above table is called contingency table

buys (X, “Tea”) => buys (X,

The fact is that we calculate the correlation

• The chi-squared statistic as defined will specify