You are on page 1of 35

CH-IV

Mining Association Rules

1
Association Rule Mining/ARM/
• ARM was introduced by Agrawal et al. (1993).
• Given a set of records each of which contain some number
of items from a given collection;
– ARM produce dependency rules which will predict
occurrence of an item based on occurrences of other items.

• Motivation of ARM:
ARM Finding inherent regularities in data
– What products were often purchased together? Pasta & Tea?
– What are the subsequent purchases after buying a PC?
• ARM aims to extract interesting correlations, frequent
patterns, associations among sets of items in the
transaction databases or other data repositories
2
Association Rule Mining Con’t…
• Association rules are widely used in various areas
such as
– market and risk management
– inventory control
– medical diagnosis
– Web usage mining
– intrusion detection
– catalog design and
– customer shopping behavior analysis, etc.
• ARM is to find out association rules that satisfy the
predefined minimum support and confidence from a
given database 3
Association Rule Mining Con’t…
• Based on the concept of strong rules, Agrawal et al.
(1993) introduced association rules for discovering
regularities between products in large scale
transaction data recorded by point-of-sale (POS)
systems in supermarkets
• For example, the rule {onion, potatoes} Burger
found in the sales data of a supermarket would
indicate that if a customer buys onions and potatoes
together, he or she is likely to also buy hamburger
meat
– Such information can be used as the basis for decisions about
marketing activities such as, e.g., promotional pricing or product
placements
4
Association Rule Mining Con’t…
• In general, ARM can be viewed as a two-step
process
– Finding frequent patterns from large itemsets
• find those itemsets whose occurrences exceed a
predefined threshold in the database; those
itemsets are called frequent or large itemsets.
– Generating association rules from these
itemsets
• generate association rules from those large
itemsets with the constraints of minimal
confidence
5
Association Rule Mining Con’t…
• The problem of ARM is defined as: Let I= {i1,i2,….,
in} be a set of n attributes called items. Let D= {t1, t2,
…., tm} be a set of transactions called the database.
Each transaction in D has a unique transaction ID and
contains a subset of the items in I. A rule is defined
as an implication of the form XY (which means that
Y may present in the transaction if X is in the
transaction) where X, Y C I and X ∩ Y = Ø
• The sets of items (for short itemsets) X and Y are
called antecedent (left-hand-side or LHS) and
consequent (right-hand-side or RHS) of the rule
respectively
6
Frequent Patterns
• are patterns (such as itemsets) that appear in a data set
frequently
– For example, a set of items, such as milk and bread, that
appear frequently together in a transaction data set is a
frequent itemset
• Mining frequent patterns leads to the discovery of
associations and correlations among items in large
transactional or relational data sets
• can help in many business decision-making processes,
such as
– catalog design,
– Store layout
– and customer shopping behavior analysis 7
Frequent Patterns Con’t…
• A typical example of frequent itemset mining is
market basket analysis
– analyzes customer buying habits by finding associations
between the different items that customers place in their
“shopping baskets”
• For example, if customers are buying milk, how likely
are they to also buy bread on the same trip to the
supermarket?
– Such information can lead to increased sales by helping
retailers do selective marketing and plan their shelf space
• Support and confidence are the two measures of
association rule interestingness
– They respectively reflect the usefulness and certainty of
discovered rules 8
Frequent Patterns Con’t…
• Since the database is large and users concern about
only those frequently purchased items, usually
thresholds of support and confidence are predefined
by users to drop those rules that are not so interesting
or useful
• The two thresholds are called minimal support and
minimal confidence respectively
– Support (s) of an association rule XY is defined as the
percentage/fraction of records that contain X∪Y to the
total number of records in the database
– Confidence of an association rule XY is defined as the
percentage/fraction of the number of transactions that
contain X∪Y to the total number of records that contain X
9
Frequent Patterns Con’t…

10
Frequent Pattern Analysis
• Basic concepts:
– itemset: A set of one or more items
– k-itemset X = {x1, …, xk}: An itemset that contains k items
– support, s, is the fraction of transactions that contains X (i.e., the
probability that a transaction contains X)
• support of X and Y greater than user defined threshold s; i.e.
support probability of s that a transaction contains X  Y

Support (XY)= P (X U Y)
Support (XY)= Support_count (X U
Y)
Total transaction |D|
• P(X U Y) indicates the probability that a transaction contains
the union of set X and set Y
• Support_count of an itemset is the number of transactions
11
that contain the itemset
Frequent Pattern Analysis
• Confidence: is the probability of finding Y in a transaction
with all X1,X2,…,Xn
– confidence, c, conditional prob. that a transaction having X also
contains Y; i.e. conditional prob. (confidence) of Y given X > user
threshold c
Confidence (XY)= P (Y|X)
Confidence (XY)= P (Y|X)=Support (X U Y)
Support (X)
= Support_count (X U Y)
Support_count (X)
– P (Y|X) indicates the conditional probability that a transaction contain
Y given X
• An itemset X is frequent if X’s support is no less than a
minsup threshold
• Rules that satisfy both a minimum support threshold (min
sup) and a minimum confidence threshold (min conf) are 12
called strong
Example: Finding frequent itemsets
• Given a support threshold (S), sets of X items that
appear in greater than or equal to S baskets are called
frequent itemsets
• Example: Frequent Itemsets
– Itemsets bought={milk, coke, Pepsi, biscuit, juice}.
– Support = 4 baskets.
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
– Frequent itemsets: {m}, {c}, {b}, {j}, {m,b} , {b,c} 13
Association Rules
• Find all rules on itemsets of the form XY with minimum
support and confidence
– If-then rules about the contents of baskets.
• {i1, i2,…,ik} → j means: “if a basket contains all of i1,…,ik then
it is likely to contain j.”
• A typical question: “find all association rules with support ≥ s
and confidence ≥ c.” Note: “support” of an association rule is the
support of the set of items it mentions.
– Confidence of this association rule is the probability of j given i1,
…,ik. It is the number of transactions i1,…,ik containing item j
– Example: Confidence of the rule{m, b} → c
B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b}
B4 = {c, j} B5 = {m, p, b} B6 =
{m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
• An association rule: {m, b} → c (with confidence = 2/4 = 50%)14
Example: Association Rules
• Let say min_support = 50%, min_confidence = 50%, identify
frequent item pairs and define association rules
Tid Items bought Customer Customer
10 Coke, Nuts, Tea buys both buys Tea
20 Coke, Coffee, Tea
30 Coke, Tea, Eggs
40 Nuts, Eggs, Milk
50 Coffee, Tea, Eggs, Milk Customer
buys Coke

• Frequent Pattern:
– Coke:3, Tea:4, Eggs:3, {Coke, Tea}:3
• Association rules:
– Coke  Tea (60%, 100%)
– Tea  Coke (60%, 75%) 15
Frequent Itemset Mining Methods
• Apriori: A Candidate Generation-and-Test Approach
–A two-pass approach called a-priori limits the need for main
memory
–Key idea: if a set of items appears at least s times, so does
every subset.
• Contrapositive for pairs: if item i does not appear in s baskets,
then no pair including i can appear in s baskets.
• FPGrowth: A Frequent Pattern-Growth Approach
–Mining Frequent Patterns Without Candidate Generation
–Uses the Apriori Pruning Principle
–Scan DB only twice!
• Once to find frequent 1-itemset (single item pattern)
• The other to construct FP-tree, the data structure of
FPGrowth 16
A-Priori Algorithm
• Apriori is a seminal algorithm proposed by R.
Agrawal and R. Srikant in 1994 for mining frequent
itemsets for Boolean association rules
– The name of the algorithm is based on the fact that the
algorithm uses prior knowledge of frequent itemset
properties
• Apriori employs an iterative approach known as a
level-wise search, where k-itemsets are used to
explore (k + 1)-itemsets
– First, the set of frequent 1-itemsets is found by scanning
the database to accumulate the count for each item, and
collecting those items that satisfy minimum support. The
resulting set is denoted L1
17
A-Priori Algorithm Con’t….
– Next, L1 is used to find L2 , the set of frequent 2-itemsets,
which is used to find L3 , and so on, until no more frequent
k-itemsets can be found
– The finding of each Lk requires one full scan of the
database
• To improve the efficiency of the level-wise
generation of frequent itemsets, an important property
called the Apriori property is used to reduce the
search space
• Apriori property: All nonempty subsets of a frequent
itemset must also be frequent
– E.g. If {Coke, Tea, nuts} is frequent, so is {Coke, Tea}
i.e., every transaction having {Coke, Tea, nuts} also 18
contains {Coke, Tea}
A-Priori Algorithm Con’t…
• Pass 1: Read baskets and
count in main memory
the occurrences of each Item counts Frequent items
item
– Requires only memory
proportional to #items
• Pass 2: Read baskets Counts of
again and count in main candidate
memory only those pairs pairs
both of which were found
in Pass 1 to be frequent
– Requires memory
proportional to square of Pass 1 Pass 2
frequent items only

19
Apriori: A Candidate Generation & Test
Approach
• Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
• Method:
– Initially, scan DB once to get frequent 1-itemset
– Generate length (k+1) candidate itemsets from length k
frequent itemset. For each k, we construct two sets of k –
tuples:
• Ck = candidate k - tuples = those that might be frequent sets
(support > s ) based on information from the pass for k –1
• Lk = the set of truly frequent k –tuples
– Test the candidates against DB
– Terminate when no frequent or candidate set can be
20
generated
A-Priori for All Frequent Itemsets
• C1 = all items; L1 = those counted on first pass to be
frequent.; C2 = pairs with support ≥ s, both chosen from L1;
In general, Ck = k –tuples, each k –1 of which is in Lk -1; Lk =
members of Ck with support ≥ s
Count All pairs Count All pairs of
All of items items from
the items the pairs
items from L1 L2

C1 Filter L1 Construct C2 Filter L2 Construct C3

First pass Second pass 21


The Apriori Algorithm—An Example
Assume that min_Support = 2 and min_confidence = 80%, identify
frequent itemsets and construct association rules
Database TDB Itemset sup

Tid Items
{A} 2 L1 Itemset sup
C1
{B} 3 {A} 2
10 A, C, D
{C} 3
20 B, C, E 1st scan {B} 3
{D} 1 {C} 3
30 A, B, C, E
{E} 3 {E} 3
40 B, E

Itemset sup C2 Itemset sup C2


Itemset
{A, C} 2 {A, B} 1
L2 2nd scan {A, B}
{B, C} 2 {A, C} 2
{B, E} 3 {A, E} 1 {A, C}
{C, E} 2 {B, C} 2 {A, E}
{B, E} 3 {B, C}
Itemset {C, E} 2 {B, E}
{A, B, C}
{C, E}
{A, B, E}
C3 L3 Itemset sup
{A, C, E} 3rd scan {B, C, E} 2
{B, C, E} 22
Which of the above pairs fulfill
confidence level at least 80%?
Pairs Support Confidence
AC 50% 100%
BC 50% 66.67%
BE 75% 100%
CE 50% 66.67%
(B,C) E 50% 100%
(B,E)C
Results:
AC (with support 50%, confidence 100%)
BE (with support 75%, confidence 100%)
(B,C)E (with support 50%, confidence 100%)
23
Bottlenecks of the Apriori Approach
• The Apriori algorithm reduces the size of candidate
frequent itemsets by using “Apriori property.” However, it
still requires two nontrivial computationally expensive
processes
i. It may generate a huge number of candidate sets that will
be discarded later in the test stage
ii. It requires as many database scans as the size of the
largest frequent itemsets
- In order to find frequent k-itemsets, the Apriori
algorithm needs to scan database k times
- It is costly to go over each transaction in the database
to determine the support of the candidate itemsets
24
Pattern-Growth Approach
• The FP-Growth Approach
– FP-growth was first proposed briefly by Han et al (2000)
– Depth-first search: search depth wise by identifying
different set of combinations with a given single or pair
of items
– Avoid explicit candidate generation
• FP-growth adopts a divide-and-conquer strategy
– First, it compresses the database representing frequent
items into a frequent-pattern tree, or FP-tree, which
retains the itemset association information 25
Pattern-Growth Approach
– It then divides the compressed database into a set of
conditional databases (a special kind of projected
database), each associated with one frequent item or
“pattern fragment,” and mines each such database
separately
• An FP-Tree is constructed by first creating the
root of the tree, labeled with “null.”
• This algorithm generates frequent itemsets
from FP-tree by traversing in bottom-up
fashion
26
Construct FP-tree from a Transaction Database
Assume min_support = 3 and Min_confidence = 80%
TIDitems Items bought (ordered) frequent
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} {}
1.Scan DB once, find Header Table
frequent 1-itemset Item frequency head
f:4 c:1
(single item pattern) f 4
c 4 c:3 b:1 b:1
2.Sort frequent items in a 3
frequency descending b 3 a:3 p:1
order, f-list m 3
p 3 m:2 b:1
3.Scan DB again,
construct FP-tree F-list = f-c-a-b-m-p p:2 m:1 27
Construct FP-tree from a Transaction Database

• One may construct a frequent-pattern tree as


follows:
– First, a scan of DB derives a list of frequent items,
⟨( f:4), (c:4), (a :3), (b:3), (m:3), ( p:3) ⟩ (the
number after “:” indicates the support), in which
items are ordered in frequency- descending order.
– Second, the root of a tree is created and labeled
with “null”
• The FP-tree is constructed as follows by
scanning the transaction database DB the
second time
28
Construct FP-tree from a Transaction Database
• The scan of the first transaction leads to the
construction of the first branch of the tree: ⟨( f :1),
(c:1), (a :1), (m:1), ( p:1)⟩
– Notice that the frequent items in the transaction are listed
according to the order in the list of frequent items.
• For the second transaction, since its (ordered)
frequent item list ⟨ f , c, a , b, m⟩ shares a common
prefix⟨ f , c, a ⟩ with the existing path ⟨ f , c, a , m, p⟩,
the count of each node along the prefix is
incremented by 1, and one new node (b:1) is created
and linked as a child of (a :2) and another new node
(m:1) is created and linked as the child of (b:1).
29
Construct FP-tree from a Transaction Database
• For the third transaction, since its frequent item list
⟨ f , b⟩ shares only the node ⟨ f ⟩ with the f -prefix
subtree, f ’s count is incremented by 1, and a new
node (b:1) is created and linked as a child of ( f :3)
• The scan of the fourth transaction leads to the
construction of the second branch of the tree, ⟨(c:1),
(b:1), ( p:1)⟩
• For the last transaction, since its frequent item list
⟨ f , c, a , m, p⟩ is identical to the first one, the path is
shared with the count of each node along the path
incremented by 1
30
Pattern-Growth Approach
• For each frequent item, construct its
conditional pattern base, and then its
conditional FP-tree;
- Starting at the bottom of frequent-item header table in the
FP-tree
• How do we get find all frequent patterns from
the FP-Tree? –Intuitively:
1. Find all frequent patterns containing one of the items
2. Then find all frequent patterns containing the next item but
NOT containing the previous one
3. Repeat 2) until we're out of items
31
FP-Growth Example
• Construct conditional pattern base which consists of the set of prefix
paths in the FP tree co-occurring with the suffix pattern, and then
construct its conditional FP-tree.

Item Conditional pattern-base Conditional FP-tree


p {(fcam:2), (cb:1)} {(c:3)}|p
m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m
b {(fca:1), (f:1), (c:1)} --
a {(fc:3)} {(f:3, c:3)}|a
c {(f:3)} {(f:3)}|c
f -- --
Which of the above pairs fulfill
confidence level at least 80%?
Pairs Support Confidence
cp 3 75%
fcam 3 100%
cam 3 100%
fa 3 75%
ca 3 75%
fc 3 100%
Results: generate association rules
fcam (with support 3, confidence 100%)
cam (with support 3, confidence 100%)
fc (with support 3, confidence 100%)
33
Benefits of the FP-tree Structure
• Completeness
– Preserve complete information for frequent pattern
mining
• Compactness
– Reduce irrelevant info—infrequent items are gone
– Never be larger than the original database

34
Benefits of FP-growth over Apriori

• FP-growth is faster than Apriori because:


– No candidate generation, no candidate test
– Use compact data structure
– Eliminate repeated database scan
– Basic operation is counting and FP-tree building
(no pattern matching)

35

You might also like