Professional Documents
Culture Documents
1
Association Rule Mining/ARM/
• ARM was introduced by Agrawal et al. (1993).
• Given a set of records each of which contain some number
of items from a given collection;
– ARM produce dependency rules which will predict
occurrence of an item based on occurrences of other items.
• Motivation of ARM:
ARM Finding inherent regularities in data
– What products were often purchased together? Pasta & Tea?
– What are the subsequent purchases after buying a PC?
• ARM aims to extract interesting correlations, frequent
patterns, associations among sets of items in the
transaction databases or other data repositories
2
Association Rule Mining Con’t…
• Association rules are widely used in various areas
such as
– market and risk management
– inventory control
– medical diagnosis
– Web usage mining
– intrusion detection
– catalog design and
– customer shopping behavior analysis, etc.
• ARM is to find out association rules that satisfy the
predefined minimum support and confidence from a
given database 3
Association Rule Mining Con’t…
• Based on the concept of strong rules, Agrawal et al.
(1993) introduced association rules for discovering
regularities between products in large scale
transaction data recorded by point-of-sale (POS)
systems in supermarkets
• For example, the rule {onion, potatoes} Burger
found in the sales data of a supermarket would
indicate that if a customer buys onions and potatoes
together, he or she is likely to also buy hamburger
meat
– Such information can be used as the basis for decisions about
marketing activities such as, e.g., promotional pricing or product
placements
4
Association Rule Mining Con’t…
• In general, ARM can be viewed as a two-step
process
– Finding frequent patterns from large itemsets
• find those itemsets whose occurrences exceed a
predefined threshold in the database; those
itemsets are called frequent or large itemsets.
– Generating association rules from these
itemsets
• generate association rules from those large
itemsets with the constraints of minimal
confidence
5
Association Rule Mining Con’t…
• The problem of ARM is defined as: Let I= {i1,i2,….,
in} be a set of n attributes called items. Let D= {t1, t2,
…., tm} be a set of transactions called the database.
Each transaction in D has a unique transaction ID and
contains a subset of the items in I. A rule is defined
as an implication of the form XY (which means that
Y may present in the transaction if X is in the
transaction) where X, Y C I and X ∩ Y = Ø
• The sets of items (for short itemsets) X and Y are
called antecedent (left-hand-side or LHS) and
consequent (right-hand-side or RHS) of the rule
respectively
6
Frequent Patterns
• are patterns (such as itemsets) that appear in a data set
frequently
– For example, a set of items, such as milk and bread, that
appear frequently together in a transaction data set is a
frequent itemset
• Mining frequent patterns leads to the discovery of
associations and correlations among items in large
transactional or relational data sets
• can help in many business decision-making processes,
such as
– catalog design,
– Store layout
– and customer shopping behavior analysis 7
Frequent Patterns Con’t…
• A typical example of frequent itemset mining is
market basket analysis
– analyzes customer buying habits by finding associations
between the different items that customers place in their
“shopping baskets”
• For example, if customers are buying milk, how likely
are they to also buy bread on the same trip to the
supermarket?
– Such information can lead to increased sales by helping
retailers do selective marketing and plan their shelf space
• Support and confidence are the two measures of
association rule interestingness
– They respectively reflect the usefulness and certainty of
discovered rules 8
Frequent Patterns Con’t…
• Since the database is large and users concern about
only those frequently purchased items, usually
thresholds of support and confidence are predefined
by users to drop those rules that are not so interesting
or useful
• The two thresholds are called minimal support and
minimal confidence respectively
– Support (s) of an association rule XY is defined as the
percentage/fraction of records that contain X∪Y to the
total number of records in the database
– Confidence of an association rule XY is defined as the
percentage/fraction of the number of transactions that
contain X∪Y to the total number of records that contain X
9
Frequent Patterns Con’t…
10
Frequent Pattern Analysis
• Basic concepts:
– itemset: A set of one or more items
– k-itemset X = {x1, …, xk}: An itemset that contains k items
– support, s, is the fraction of transactions that contains X (i.e., the
probability that a transaction contains X)
• support of X and Y greater than user defined threshold s; i.e.
support probability of s that a transaction contains X Y
Support (XY)= P (X U Y)
Support (XY)= Support_count (X U
Y)
Total transaction |D|
• P(X U Y) indicates the probability that a transaction contains
the union of set X and set Y
• Support_count of an itemset is the number of transactions
11
that contain the itemset
Frequent Pattern Analysis
• Confidence: is the probability of finding Y in a transaction
with all X1,X2,…,Xn
– confidence, c, conditional prob. that a transaction having X also
contains Y; i.e. conditional prob. (confidence) of Y given X > user
threshold c
Confidence (XY)= P (Y|X)
Confidence (XY)= P (Y|X)=Support (X U Y)
Support (X)
= Support_count (X U Y)
Support_count (X)
– P (Y|X) indicates the conditional probability that a transaction contain
Y given X
• An itemset X is frequent if X’s support is no less than a
minsup threshold
• Rules that satisfy both a minimum support threshold (min
sup) and a minimum confidence threshold (min conf) are 12
called strong
Example: Finding frequent itemsets
• Given a support threshold (S), sets of X items that
appear in greater than or equal to S baskets are called
frequent itemsets
• Example: Frequent Itemsets
– Itemsets bought={milk, coke, Pepsi, biscuit, juice}.
– Support = 4 baskets.
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
– Frequent itemsets: {m}, {c}, {b}, {j}, {m,b} , {b,c} 13
Association Rules
• Find all rules on itemsets of the form XY with minimum
support and confidence
– If-then rules about the contents of baskets.
• {i1, i2,…,ik} → j means: “if a basket contains all of i1,…,ik then
it is likely to contain j.”
• A typical question: “find all association rules with support ≥ s
and confidence ≥ c.” Note: “support” of an association rule is the
support of the set of items it mentions.
– Confidence of this association rule is the probability of j given i1,
…,ik. It is the number of transactions i1,…,ik containing item j
– Example: Confidence of the rule{m, b} → c
B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b}
B4 = {c, j} B5 = {m, p, b} B6 =
{m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
• An association rule: {m, b} → c (with confidence = 2/4 = 50%)14
Example: Association Rules
• Let say min_support = 50%, min_confidence = 50%, identify
frequent item pairs and define association rules
Tid Items bought Customer Customer
10 Coke, Nuts, Tea buys both buys Tea
20 Coke, Coffee, Tea
30 Coke, Tea, Eggs
40 Nuts, Eggs, Milk
50 Coffee, Tea, Eggs, Milk Customer
buys Coke
• Frequent Pattern:
– Coke:3, Tea:4, Eggs:3, {Coke, Tea}:3
• Association rules:
– Coke Tea (60%, 100%)
– Tea Coke (60%, 75%) 15
Frequent Itemset Mining Methods
• Apriori: A Candidate Generation-and-Test Approach
–A two-pass approach called a-priori limits the need for main
memory
–Key idea: if a set of items appears at least s times, so does
every subset.
• Contrapositive for pairs: if item i does not appear in s baskets,
then no pair including i can appear in s baskets.
• FPGrowth: A Frequent Pattern-Growth Approach
–Mining Frequent Patterns Without Candidate Generation
–Uses the Apriori Pruning Principle
–Scan DB only twice!
• Once to find frequent 1-itemset (single item pattern)
• The other to construct FP-tree, the data structure of
FPGrowth 16
A-Priori Algorithm
• Apriori is a seminal algorithm proposed by R.
Agrawal and R. Srikant in 1994 for mining frequent
itemsets for Boolean association rules
– The name of the algorithm is based on the fact that the
algorithm uses prior knowledge of frequent itemset
properties
• Apriori employs an iterative approach known as a
level-wise search, where k-itemsets are used to
explore (k + 1)-itemsets
– First, the set of frequent 1-itemsets is found by scanning
the database to accumulate the count for each item, and
collecting those items that satisfy minimum support. The
resulting set is denoted L1
17
A-Priori Algorithm Con’t….
– Next, L1 is used to find L2 , the set of frequent 2-itemsets,
which is used to find L3 , and so on, until no more frequent
k-itemsets can be found
– The finding of each Lk requires one full scan of the
database
• To improve the efficiency of the level-wise
generation of frequent itemsets, an important property
called the Apriori property is used to reduce the
search space
• Apriori property: All nonempty subsets of a frequent
itemset must also be frequent
– E.g. If {Coke, Tea, nuts} is frequent, so is {Coke, Tea}
i.e., every transaction having {Coke, Tea, nuts} also 18
contains {Coke, Tea}
A-Priori Algorithm Con’t…
• Pass 1: Read baskets and
count in main memory
the occurrences of each Item counts Frequent items
item
– Requires only memory
proportional to #items
• Pass 2: Read baskets Counts of
again and count in main candidate
memory only those pairs pairs
both of which were found
in Pass 1 to be frequent
– Requires memory
proportional to square of Pass 1 Pass 2
frequent items only
19
Apriori: A Candidate Generation & Test
Approach
• Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
• Method:
– Initially, scan DB once to get frequent 1-itemset
– Generate length (k+1) candidate itemsets from length k
frequent itemset. For each k, we construct two sets of k –
tuples:
• Ck = candidate k - tuples = those that might be frequent sets
(support > s ) based on information from the pass for k –1
• Lk = the set of truly frequent k –tuples
– Test the candidates against DB
– Terminate when no frequent or candidate set can be
20
generated
A-Priori for All Frequent Itemsets
• C1 = all items; L1 = those counted on first pass to be
frequent.; C2 = pairs with support ≥ s, both chosen from L1;
In general, Ck = k –tuples, each k –1 of which is in Lk -1; Lk =
members of Ck with support ≥ s
Count All pairs Count All pairs of
All of items items from
the items the pairs
items from L1 L2
Tid Items
{A} 2 L1 Itemset sup
C1
{B} 3 {A} 2
10 A, C, D
{C} 3
20 B, C, E 1st scan {B} 3
{D} 1 {C} 3
30 A, B, C, E
{E} 3 {E} 3
40 B, E
34
Benefits of FP-growth over Apriori
35