Professional Documents
Culture Documents
Many business enterprises accumulate large quantities of data from their day to
day operations. For example, huge amounts of customer purchase data are collected daily
at the checkout counters of grocery stores. Table 6.1 illustrates an example of such data,
commonly known as market basket transactions. Each row in this table corresponds to a
transaction, which contains a unique identifier labeled TID and a set of items bought by a
given customer. Retailers are interested in analyzing the data to learn about the
purchasing behavior of their customers
Problem Definition
Binary Representation Market basket data can be represented in a binary format
as shown in Table 6.2, where each row corresponds to a transaction and each column
corresponds to an item. An item can be treated as a binary variable whose value is one if
the item is present in a transaction and zero otherwise. Because the presence of an item in
a transaction is often considered more important than its absence, an item is an
asymmetric binary variable.
This representation is perhaps a very simplistic view of real market basket data
because it ignores certain important aspects of the data such as the quantity of items sold
or the price paid to purchase them.
Itemset and Support Count:
In association analysis, a collection of zero or more items is termed an itemset. If
an itemset contains /c items, it is called a k-itemset. For instance, {Beer, Diapers, MiIk}
is an example of a 3-itemset. An important property of an itemset is its support count,
which refers to the number of transactions that contain a particular itemset.
Mathematically, the support count, (X), for an itemset X can be stated as follows:
Association Rule :
An association rule is an implication expression of the form X ->Y, where X and
Y are disjoint itemsets. The strength of an association rule can be measured in terms of its
support and confidence. Support determines how often a rule is applicable to a given data
set, while confidence determines how frequently items in Y appear inransactions that
contain X. The formal definitions of these metrics are
Figure 6.5 provides a high-level illustration of the frequent itemset generation part of the
Apriori algorithm for the transactions shown in 6.1. We assume that the support threshold
is60% , which is equivalent to a minimum support count equal to 3.
Confidence-Based Pruning
FP-growth Algorithm
• Use a compressed representation of the database using an FP-tree
TID Items
A:1
1 {A,B}
2 {B,C,D}
B:1
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C} After reading TID=2:
6 {A,B,C,D} null
7 {B,C}
8 {A,B,C} A:1 B:1
9 {A,B,D}
10 {B,C,E}
B:1 C:1
D:1
FP-Tree Construction
TID Items Transaction
1 {A,B} Database
2 {B,C,D} null
3 {A,C,D,E}
4 {A,D,E}
A:7 B:3
5 {A,B,C}
6 {A,B,C,D}
7 {B,C}
8 {A,B,C}
B:5 C:3
C:1 D:1
9 {A,B,D}
10 {B,C,E}
Header table D:1
C:3 E:1
Item Pointer D:1 E:1
A D:1
B E:1
D:1
C
Pointers are used to assist
D
frequent itemset generation
E
FP-growth
Conditional Pattern base
null
for D:
P = {(A:1,B:1,C:1),
A:7 B:1 (A:1,B:1),
(A:1,C:1),
(A:1),
B:5 C:1 (B:1,C:1)}
C:1 D:1 Recursively apply FP-
growth on P
C:3 D:1 Frequent Itemsets found
D:1 (with sup > 1):
D:1
AD, BD, CD, ACD, BCD
D:1
Tree Projection
Set enumeration tree: null
Possible Extension:
E(A) = {B,C,D,E} A B C D E
AB AC AD AE BC BD BE CD CE DE
Possible Extension:
E(ABC) = {D,E}
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCDE
• FP-growth will not generate any duplicate item sets. In addition, the counts
associated with the nodes allow the algorithm to perform support counting while
generating the common suffix item sets.
• FP-growth is an interesting algorithm because it illustrates how a compact
representation of the transaction data set helps to efficiently generate frequent
item sets.
• In addition, for certain transaction data sets, FP-growth outperforms the standard
Apriori, algorithm by several orders of magnitude.