Assoc 1

Mining Association Rules
Data Mining Overview
Data Mining
Data warehouses and OLAP (On Line Analytical

Processing.)
Association Rules Mining
Clustering: Hierarchical and Partitional
approaches
Classification: Decision Trees and Bayesian
classifiers
Sequential Patterns Mining
Advanced topics: outlier detection, web mining
Association Rules:
Background
Given: (1) database of transactions, (2) each transaction is

a list of items (purchased by a customer in a visit)
Find: all association rules that satisfy user-specified
minimum support and minimum confidence interval
Example: 30% of transactions that contain beer also
contain diapers; 5% of transactions contain these items
30%: confidence of the rule

5%: support of the rule
We are interested in finding all rules rather than verifying if

a rule holds
Rule Measures: Support and

Confidence
Customer
buys both
Customer
buys diaper
Find all the rules X & Y Z with

minimum confidence and support
Customer
buys beer
support, s, probability that a

transaction contains {X Y Z}
confidence, c, conditional probability
that a transaction having {X Y}
also contains Z
Transaction ID Items Bought

Let minimum support 50%,
2000
A,B,C
and minimum confidence
1000
A,C
50%, we have
4000
A,D
A C (50%, 66.6%)
5000
B,E,F
C A (50%, 100%)
Application Examples
Market Basket Analysis
* Maintenance Agreement (What the store should do to

boost Maintenance Agreement sales?)
Home Electronics * (What other products should the store
stocks up on if the store has a sale on Home Electronics?)
Attached mailing in direct marketing
Detecting ping-ponging of patients
Transaction: patient
Item: doctor/clinic visited by patient

Support of the rule: number of common patients
HIC Australia success story
Problem Statement
I = {i1, i2, , im}: a set of literals, called items

Transaction T: a set of items s.t. T I
Database D: a set of transactions
A transaction contains X, a set of items in I, if
X
An association rule is an implication of the form X
where X,Y I
Y,
The rule X Y holds in the transaction set D with confidence

c if c% of transactions in D that contain X also contain Y
The rule X Y has support s in the transaction set D if s% of
transactions in D contain X Y
Find all rules that have support and confidence greater than
user-specified min support and min confidence
Association Rule Mining: A Road

Map
Boolean vs. quantitative associations (Based on the types

of values handled)
Single dimension vs. multiple dimensional associations

(see ex. Above)
Single level vs. multiple-level analysis
buys(x, SQLServer) ^ buys(x, DMBook) buys(x, DBMiner)

[0.2%, 60%]
age(x, 30..39) ^ income(x, 42..48K) buys(x, PC) [1%,
75%]
What brands of beers are associated with what brands of diapers?
Various extensions
Correlation, causality analysis

Association does not necessarily imply correlation or causality
Constraints enforced
E.g., small sales (sum < 100) trigger big buys (sum > 1,000)?
Problem Decomposition
1. Find all sets of items that have minimum
support (frequent itemsets)
2. Use the frequent itemsets to generate the
desired rules
Problem Decomposition
Example
Transaction ID Items Bought
1
Shoes, Shirt, Jacket
2
Shoes,Jacket
3
Shoes, Jeans
4
Shirt, Sweatshirt
Frequent Itemset
{Shoes}
{Shirt}
{Jacket}
{Shoes, Jacket}
Support
75%
50%
50%
50%
For min support = 50% = 2 trans,

and min confidence = 50%
For the rule Shoes Jacket

Support = Sup({Shoes,Jacket)}=50%
50
Confidence =
70
=66.6%
Jacket Shoes has 50% support and 100% confidence
Discovering Rules
Nave Algorithm
for each frequent itemset l do
for each subset c of l do
if (support(l ) / support(l - c) >= minconf) then
output the rule (l c ) c,
with confidence = support(l ) / support (l - c )
and support = support(l )
Discovering Rules (2)
Lemma. If consequent c generates a valid rule,

so do all subsets of c. (e.g. X YZ, then XY Z
and XZ Y)
Example: Consider a frequent itemset ABCDE
If ACDE B and ABCE D are the only one-consequent

rules with minimum support confidence, then
ACE BD is the only other rule that needs to be tested
Mining Frequent Itemsets: the Key

Step
Find the frequent itemsets: the sets of items

that have minimum support
A subset of a frequent itemset must also be a frequent

itemset
i.e., if {AB} is a frequent itemset, both {A} and

{B} should be a frequent itemset
Iteratively find frequent itemsets with cardinality from

1 to k (k-itemset)
Use the frequent itemsets to generate

association rules.
The Apriori Algorithm
Lk: Set of frequent itemsets of size k (those with

min support)
Ck: Set of candidate itemset of size k (potentially
frequent itemsets)
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1

that are contained in t
Lk+1 = candidates in Ck+1 with min_support

end
return k Lk;
The Apriori Algorithm

Min support =50% = 2 trans
Example
Database D
itemset sup.
TID
100
200
300
400
C1
Items
134
235
1235
25
Scan D
L2 itemset sup
{1}
{2}
{3}
{4}
{5}
C2 itemset sup
2
2
3
2
{1
{1
{1
{2
{2
{3
C3 itemset
{2 3 5}
Scan D
{1 3}
{2 3}
{2 5}
{3 5}
2
3
3
1
3
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
L1 itemset sup.
{1}
{2}
{3}
{5}
2
3
3
3
C2 itemset
{1 2}
Scan D
L3 itemset sup
{2 3 5} 2
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
How to Generate Candidates?
Suppose the items in Lk-1 are listed in order
Step 1: self-joining Lk-1

insert into Ck
select p.item1, p.item2, , p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, , p.itemk-2=q.itemk-2, p.itemk-1 <
q.itemk-1
Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
Example of Generating
Candidates
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abd
acde from acd and ace
Pruning:
acde is removed because ade is not in L3
C4={abcd}
How to Count Supports of

Candidates?
Why counting supports of candidates a problem?
The total number of candidates can be very huge

One transaction may contain many candidates
Method:
Candidate itemsets are stored in a hash-tree
Leaf node of hash-tree contains a list of itemsets

and counts
Interior node contains a hash table
Subset function: finds all the candidates

contained in a transaction
Hash-tree:search
Given a transaction T and a set Ck find all members of its

members contained in T
Assume an ordering on the items
Start from the root, use every item in T to go to the next
node
If you are at an interior node and you just used item i, then
use each item that comes after i in T
If you are at a leaf node check the itemsets
Methods to Improve Aprioris

Efficiency
Transaction reduction: A transaction that does not contain

any frequent k-itemset is useless in subsequent scans
Partitioning: Any itemset that is potentially frequent in DB

must be frequent in at least one of the partitions of DB
Sampling: mining on a subset of given data, lower support

threshold + a method to determine the completeness
Dynamic itemset counting: add new candidate itemsets

only when all of their subsets are estimated to be frequent
Is Apriori Fast Enough?

Performance Bottlenecks
The core of the Apriori algorithm:
Use frequent (k 1)-itemsets to generate candidate frequent kitemsets

Use database scan and pattern matching to collect counts for the
candidate itemsets
The bottleneck of Apriori: candidate generation
Huge candidate sets:

104 frequent 1-itemset will generate 10 7 candidate 2itemsets
To discover a frequent pattern of size 100, e.g., {a , a , ,
1
2
a100}, one needs to generate 2100 1030 candidates.
Multiple scans of database:

Needs (n +1 ) scans, n is the length of the longest pattern
Max-Miner
Max-miner finds long patterns efficiently:

the maximal frequent patterns
Instead of checking all subsets of a long
pattern try to detect long patterns early
Scales linearly to the size of the patterns
Max-Miner: the idea

Set enumeration tree of
an ordered set
1,2
1,3 1,4
1,2,3
1,2,4
1,2,3,4
2,3
1,3,4
2,4
3,4
Pruning: (1) set infrequency

(2) Superset frequency
2,3,4
Each node is a candidate group g

h(g) is the head: the itemset of the
node
t(g) tail: an ordered set that contains
all items that can appear in the
subnodes
Example: h({1}) = {1} and t({1}) = {2,3,4}
Max-miner pruning
When we count the support of a candidate group

g, we compute also the support for
h(g), h(g)
t(g)
and h(g) {i} for each i in t(g)
If h(g) t(g) is frequent, then stop expanding
the node g and report the union as frequent
itemset
If h(g) {i} is infrequent, then remove I from all

subnodes (just remove i from any tail of a group
after g)
Expand the node g by one and do the same
The algorithm
Max-Miner
Set candidate groups C {}
Set of Itemsets F {Gen-Initial-Groups(T,C)}
while C not empty do
scan T to count the support of all candidate groups in C
for each g in C s.t. h(g) U t(g) is frequent do
F F U {h(g) U t(g)}
Set candidate groups Cnew{ }
for each g in C such that h(g) U t(g) is infrequent do
F F U {Gen-sub-nodes(g, Cnew)}
C
remove from F any itemset with a proper superset in F
remove from C any group g s.t. h(g) U t(g) has a superset in F
return F
The algorithm (2)

Gen-Initial-Groups(T, C)
scan T to obtain F1, the set of frequent 1-itemsets
impose an ordering on items in F1
for each item i in F1 other than the greatest itemset do
let g be a new candidate with h(g) = {i}
and t(g) = {j | j follows i in the ordering}
C C U {g}
return the itemset F1 (an the C of course)
Gen-sub-nodes(g, C) /* generation of new itemsets at the next level*/
remove any item i from t(g) if h(g) U {i} is infrequent
reorder the items in t(g)
for each i in t(g) other than the greatest do
let g be a new candidate with h(g) = h(g) U {i} and t(g) = {j | j in t(g)
and j is after i in t(g)}
C C U {g}
return h(g) U {m} where m is the greatest item in t(g) or h(g) if t(g) is
empty
Item Ordering
Re-ordering items we try to increase the

effectiveness of frequency-pruning
Very frequent items have higher
probability to be contained in long
patterns
Put these item at the end of the ordering,
so they appear in many tails

Assoc 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assoc 1

Uploaded by

Copyright:

Available Formats

Mining Association Rules

Data Mining Overview

Data warehouses and OLAP (On Line Analytical

Given: (1) database of transactions, (2) each transaction is

30%: confidence of the rule

We are interested in finding all rules rather than verifying if

Rule Measures: Support and

Find all the rules X & Y Z with

support, s, probability that a

Transaction ID Items Bought

Market Basket Analysis

* Maintenance Agreement (What the store should do to

Item: doctor/clinic visited by patient

I = {i1, i2, , im}: a set of literals, called items

The rule X Y holds in the transaction set D with confidence

Association Rule Mining: A Road

Boolean vs. quantitative associations (Based on the types

Single dimension vs. multiple dimensional associations

buys(x, SQLServer) ^ buys(x, DMBook) buys(x, DBMiner)

What brands of beers are associated with what brands of diapers?

Correlation, causality analysis

For min support = 50% = 2 trans,

For the rule Shoes Jacket

Jacket Shoes has 50% support and 100% confidence

Discovering Rules (2)

Lemma. If consequent c generates a valid rule,

Example: Consider a frequent itemset ABCDE

If ACDE B and ABCE D are the only one-consequent

Mining Frequent Itemsets: the Key

Find the frequent itemsets: the sets of items

A subset of a frequent itemset must also be a frequent

i.e., if {AB} is a frequent itemset, both {A} and

Iteratively find frequent itemsets with cardinality from

Use the frequent itemsets to generate

The Apriori Algorithm

Lk: Set of frequent itemsets of size k (those with

increment the count of all candidates in Ck+1

Lk+1 = candidates in Ck+1 with min_support

The Apriori Algorithm

How to Generate Candidates?

Suppose the items in Lk-1 are listed in order

Step 1: self-joining Lk-1

L3={abc, abd, acd, ace, bcd}

abcd from abc and abd

acde from acd and ace

acde is removed because ade is not in L3

How to Count Supports of

Why counting supports of candidates a problem?

The total number of candidates can be very huge

Candidate itemsets are stored in a hash-tree

Leaf node of hash-tree contains a list of itemsets

Interior node contains a hash table

Subset function: finds all the candidates

Given a transaction T and a set Ck find all members of its

Methods to Improve Aprioris

Transaction reduction: A transaction that does not contain

Partitioning: Any itemset that is potentially frequent in DB

Sampling: mining on a subset of given data, lower support

Dynamic itemset counting: add new candidate itemsets

Is Apriori Fast Enough?

The core of the Apriori algorithm:

Use frequent (k 1)-itemsets to generate candidate frequent kitemsets