You are on page 1of 4

Data Structures

Notes for Lecture 15


Techniques of Data Mining
By
Samaher Hussein Ali
2007-2008

Association Rules: Park, Chen, and Yu (PCY) Algorithm


1. PCY Algorithm
Park, Chen, and Yu proposed using a hash table to determine on the first pass (while L1 is being
determined) that many pairs are not possibly frequent. Takes advantage of the fact that main memory is
usually much bigger than the number of items. During the two passes to find L2, the main memory is
laid out as in Fig. 1.
• Hash-based improvement to A-Priori.
• During Pass 1 of A-priori, most memory is idle(i.e., most the itemset count least than minsup).
• Use that memory to keep counts of itemsets into which pairs of items are hashed.
- Just the count, not the pairs themselves.
• Gives extra condition that candidate pairs must satisfy on Pass 2.

Figure 1: Two passes of the PCY algorithm


2. Finding Association Rules
• A typical question is “find all association rules with support >= minsup and confidence >= mincof.”
• The hard part is finding the high support itemsets.
– Once you have those, checking the confidence of association rules involving those sets is relatively
easy.
1
3. Hash Table
means a data structure that associates keys with values. The primary operation it supports efficiently is a
lookup: given a key (e.g. a person's name), find the corresponding value (e.g. that person's telephone
number). It works by transforming the key using a hash function into a hash, a number that is used to
index into an array to locate the desired location ("itemset") where the values should be.
4. Hash Function
is a reproducible method of turning some kind of data into a (relatively) small number that may serve as
a digital "fingerprint" of the data. The algorithm "chops and mixes" (i.e., substitutes or transposes) the
data to create such fingerprints, called hash values. These are commonly used as indices into hash tables
or hash files.
5. Observations About itemsets
1. If a itemset contains a frequent pair, then the itemset is surely frequent.
• We cannot use the hash table to eliminate any member of this itemset.
2. Even without any frequent pair, an itemset can be frequent.
• Again, nothing in the itemset can be eliminated.
3. But in the best case, the count for an itemset is less than the minsup.
• Now, all pairs that hash to this itemset can be eliminated as candidates, even if the pair consists of
two frequent items.
6. PCY Algorithm (cont)
Assume that data is stored as a at flat file, with records consisting of a truncation ID and a list of its items.

• PCY Pass 1:
(a) Count occurrences of all items.
(b) For each itemset, consisting of items {i1;….. ik}, hash all pairs to an itemset of the hash table, and
increment the count of the itemset by 1.
(c) At the end of the pass, determine L1, the items with counts at least minsup.
(d) Also at the end, determine those itemsets with counts at least minsup.
• Key point: a pair (i; j) cannot be frequent unless it hashes to a frequent itemset, so pairs that hash to
other itemsets need not be candidates in C2.
(e) Replace the hash table by a bitmap, with one bit per itemset: 1 if the itemset was frequent count
>=minsup ), 0 if not.
PCY Algorithm ---Pass 1
FOR (each itemset) {
FOR (each item)
add 1 to item’s count
2
FOR (each pair of items) {
hash the pair to an itemset
add 1 to the count for that itemset
}
}
PCY Algorithm ---Between Passes
(a) Replace the itemsets by a bit-vector:
• 0 means the itemset did not have a count >the minsup; 1 means it did.
(b) Note that a (say) 32-bit integer is replaced by 1 bit, so the bit vector requires little second-pass space.
(c) Also, decide which items are frequent and list them for the second pass.
• PCY Pass 2:
(a) Main memory holds a list of all the frequent items, i.e. L1.
(b) Main memory also holds the bitmap summarizing the results of the hashing from pass 1.
• Key point: The itemsets must use 16 or 32 bits for a count, but these are compressed to 1 bit. Thus,
even if the hash table occupied almost the entire main memory on pass 1, its bitmap occupies no
more than 1/16 of main memory on pass 2.
(c) Finally, main memory also holds a table with all the candidate pairs and their counts. A pair (i; j) can
be a candidate in C2 only if all of the following are true:
• i is in L1.
• j is in L1.
• (i; j) hashes to a frequent itemset.
It is the last condition that distinguishes PCY from straight a-priori and reduces the requirements for
memory in pass 2.
(d) During pass 2, we consider each itemset, and each pair of its items, making the test outlined above. If
a pair meets all three conditions, add to its count in memory, or create an entry for it if one does not yet
exist.
PCY Algorithm ---Pass 2
• Count all pairs {i,j } that meet the conditions:
1.Both i and j are frequent items.
2.The pair {i,j}, hashes to an itemset number whose bit in the bit vector is 1.
• Notice all these conditions are necessary for the pair to have a chance of being frequent.

3
When does PCY beat A priori?
When there are too many pairs of items from L1 to fit a table of candidate pairs and their counts in main
memory, yet the number of frequent itemsets in the PCY algorithm is sufficiently small that it reduces
the size of C2 below what can fit in memory (even with 1/16 of it given over to the bitmap).
When will most of the itemsets be infrequent in PCY?
When there are a few frequent pairs, but most pairs are so infrequent that even when the counts of all the
pairs that hash to a given itemset are added, they still are unlikely to sum to minsup or more.
Example:-
• Items = {milk, coke, pepsi, bold, juice}.
• minsup = 3 itemsets.
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, p, b, j}
B7 = {c, b, j} B8 = {b, p}
• Frequent itemsets: {m}, {c}, {b}, {p}, {j}, {m, b}, {m, p}, {b, p}.

Matrix Representation
• Columns = items.
• itemsets = rows.
• Entry (r , c ) = 1 if item c is in itemset r ;
= 0 if not.
• Assume matrix is almost all 0’s.
In Matrix Form
m c p b j
{m,c,p} 1 1 0 1 0
{m,p} 1 0 1 0 0
{m,b} 1 0 0 1 0
{c,j} 0 1 0 0 1
{m,p,b} 1 0 1 1 0
{m,p,b,j} 1 0 1 1 1
{c,b,j} 0 1 0 1 1
{p,b} 0 0 1 1 0
Similarity of Columns
• Think of a column as the set of rows in which it has 1.
• The similarity of columns C1 and C2, sim (C1,C2), is the ratio of the sizes of the intersection
and union of C1 and C2.
• Our goal of finding correlated columns becomes that of finding similar columns.
4

You might also like