Professional Documents
Culture Documents
Association Rules (Market Basket Analysis)
Association Rules (Market Basket Analysis)
Association rules:
Unsupervised learning
Used for pattern discovery
Each rule has form: A -> B, or Left -> Right
For example: 70% of customers who purchase 2% milk will also purchase whole wheat bread.
Data mining using association rules is the process of looking for strong rules:
1. Find the large itemsets (i.e. most frequent combinations of items)
Most frequently used algorithm: Apriori algorithm.
2. Generate association rules for the above itemsets.
Support/confidence
Support shows the frequency of the patterns in the rule; it is the percentage of transactions that
contain both A and B, i.e.
Confidence is the strength of implication of a rule; it is the percentage of transactions that contain B
if they contain A, ie.
Ex.ample:
Customer Item Item
purchased purchased
1 pizza beer
2 salad soda
3 pizza soda
4 salad tea
Confidence does not measure if the association between A and B is random or not.
For example, if milk occurs in 30% of all baskets, information that milk occurs in 30% of all
baskets with bread is useless. But if milk is present in 50% of all baskets that contain coffee, that is
significant information.
Support allows us to weed out most infrequent combinations but sometimes we should not ignore
them, for example, if the transaction is valuable and generates a large revenue, or if the products
repel each other.
Dependence framework
If items are statistically dependent, the presence of one of the items in the basket gives us a lot of
information about the other items. How to determine the threshold of statistical dependence? Use:
Chi-square
Impact
Lift
Chi_square =
(ExpectedCooccurrence ActualCooccurrence)/ExpectedCooccurrence
Pick a small alpha (e.g. 5% or 10%). Number of degrees of freedom equals the number of items
minus 1.
Impact = ActualCooccurrence/ExpectedCooccurrence
-1 Lift 1
Lift is similar to correlation: it is 0 if A and B are independent, and +1 or -1 if they are dependent.
+1 indicates attraction, and -1 indicates repulsion.
Why do two items repel or attract are they substitutes? Or are they complimentary and a third
product is needed? Or do they address different market segments?
Product Triangulation strategy examines cross-purchase skews to answer the above questions. If
the most significant skew occurs when triangulating with respect to promotion or pricing, the
products are substitutes.
Ex. Orange juice and soda repel each other (so are they substitutes?). They each exhibit a different
profile when compared with whole bread and potato chips, so they are not substitutes, they have
two different market segments.
Definitions:
Itemset: a set of items
k-itemset: an itemset which consists of k items
Frequent itemset (i.e. large itemset): an itemset with sufficient support
Lk or Fk: a set of large (frequent) k-itemsets
ck: a set of candidate k-itemsets
Appriori property: if an item X is joined with item Y,
Support(X U Y) = min(Support(X), Support(Y))
Negative border: an intemset is in the negative border if it is infrequent but all its neighbors in
the candidate itemset are frequent.
Interesting rules: strong rules for which antecedent and consequent are dependent
Apriori algorithm:
Set Lk is defined as the set containing the frequent k itemsets which satisfy
Support > threshold.
}
Fk = {c in Ck such that countc min_support}
k++
}
F = U k 1 Fk
}
K=3:
C3 = AprioriGeneration(F2):
Insert into C3: 2 3 4, 2 3 5, 2 4 5
Delete from C3: 2 3 5 ( because 3 5 is not in F2)
F3 = {2 3 4} (because 2 4 5 shows up only once)
K=4:
C4 = AprioriGeneration(F3)
Insert into C4: none
Since we cannot generate any more candidate sets by self-joining, the algorithms stops here. The
frequent itemsets are F1, F2, and F3. Negative border contains all pairs deleted from C2, and 2 3 5.
For all pairs of frequent itemsets (assume we call them A and B) such that A U B is also frequent,
calculate c, the confidence of the rule:
c = support(A U B) / support(A).
Example: continuing the previous example: we can generate the rules involving any combination
of:
1, 2, 3, 4, 5, 1 2, 2 3, 2 4, 2 5, 3 4, 4 5, 2 3 4.
For example, rule 1 2 -> 2 5 is not a strong rule because 1 2 5 is not a frequent itemset.
Rule 2 3 -> 4 could be a strong rule, because 2 3 4 and 2 3 are frequent itemsets, and c= 2/4.
Applications
A: sales of item a
B: sales of item B
The next step: sequential pattern discovery (i.e. association rules in time). For example:
college_degree-> professional_job -> high_salary.
Example: http://www.icaen.uiowa.edu/~comp/Public/Apriori.pdf
Assume min_support = 40% = 2/5, min_confidence = 70%. Five transactions are recorded in a
supermarket:
# Transaction Code
1 Beer, diaper, baby powder, bread, umbrella BDPRU
2 Diaper, baby powder, DP
3 Beer, diaper, milk BDM
4 Diaper, beer, detergent DBG
5 Beer, milk, cola BMC
P->B -------
P->D 2/5 2/5 1 yes
Interesting! What the rules are saying is that it is very likely that a customer who buys diapers or
milk will also buy beer. Does that rule make sense?
Example p.170:
Assume min_support = 0.4, min_confidence = 0.6. Contingency table for high-school students (with
the derived quantities in italics):
Eat cereal
Y N
Play Y 2000 1000 3000
basketball N 1750 250 2000
3750 1250 5000
Group items into higher conceptual groups, e.g. white and brown bread become bread.
Reduce the number of scans of the entire database (Apriori needs n+1 scans, where n is the
length of the longest pattern)
o Partition-based apriori
o Take a subset from the database, generate candidates for frequent itemsets; then confirm
the hypothesis on the entire database.
Used to find the frequent itemsets using only two scans of database.
Algorithm:
1. Scan databse and find items with frequency greater then or equal to a threshold T
2. Order the frequent items in decreasing order
3. Construct a tree which has only the root
4. Scan database again; for each sample:
a. add the items from the sample to the existing tree, using only the frequent items (i.e.
items discovered in step 1.)
b. repeat a. until all samples have been processed
5. Enumerate all frequent itemsets by examining the tree: the frequent itemsets are present in
those paths for which every node is represented with the frequency T .
Example p.173: Assume threshold T = 3.
Steps 3 and 4: Construct the tree using the last column of the table above. The growing tree is
shown below.
f 1 f 2 f 3 f 3 c1 f 4 c1
c1 c2 c2 b1 c2 b1 b1 c3 b1 b1
a1 a2 a2 a2 p1 a3 p1
m1 m1 b1 m1 b1 m1 b1 m2 b1
p1 p1 m1 p1 m1 p1 m1 p2 m1
Step 5: The frequent itemsets are contained in those paths, starting from the root, which have
frequency T. Therefore, at each branching, see if any item from the branches can be added to the
frequent path, i.e. if the total of branches gives frequency T for this item.
HITS algorithm
Searches for authorities and hubs.
Algorithm:
Use search engines to search for a given term and collect a root set of pages
Expand the root set by including all the pages that the root set links to, up to a cutoff point (e.g.
1000-5000 pages including links). Assume that now we have n pages total.
Construct the adjacency matrix A such that aij = 1if page i links to page j.
Associate authority weight ap and hub weight hp with each page, and set it to a uniform constant
initially.
a = Transpose{a1, a2, , an}
h = Transpose{h1, h2, , hn}
Then update a and h iteratively:
a = Transpose(A)*h = (Transpose(A)*A)*a
h = A*a = (A * Transpose(A))*h
Example p.181:
Assume initial:
A= 000111
1 2 000110
3 000010
000000
6 000000
4 001000
5
a = Transpose{.1, .1, .1, .1, .1, .1}
h = Transpose{.1, .1, .1, .1, .1, .1}
Seems like document 5 is the best authority and document 1 is the best hub.
LOGSOM Algorithm
For finding users navigation behavior which pages do they visit the most.
For a given set of URLs, urli, i=1, ..,n, and a set of user transactions tj, j=1, , m, assign 1 to a urli if
the transaction j involved visiting this page. Make a table of all transactions:
tm 0 0 1
Use K-means clustering to group the users into k transaction groups, and then record the number of
hits of each group. For example:
groupk 20 10 0
Example p.186
Path = A B C D C B E G H G W A O U O V, assume threshold of 40% (2/5).
Tree:
ABCD
BEG
GH
GW
AO
O U
O V
Text Mining
Document = a vector of tokens (each word is a token)
Calculate Hamming distance for each token