You are on page 1of 59

Mining Frequent Patterns,

Associations and Correlations

1
Similarity and Dissimilarity
• Similarity
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1].

• Dissimilarity
– Numerical measure of how different are two data objects
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies

• Proximity refers to a similarity or dissimilarity


2
Similarity vs. Dissimilarity: Simple Attributes
Suppose p and q are the attribute values for two data objects.

Figure 1: Similarity and dissimilarity two simple attributes 3


Euclidean Distance

4
Euclidean Distance: Example

Distance Matrix 5
Minkowski Distance

6
Minkowski Distance: Examples
• r = 1. City block (Manhattan, taxicab, L1 norm) distance.
– A common example of this is the Hamming distance, which
is just the number of bits that are different between two
binary vectors

• r = 2. Euclidean distance

• r  . “supremum” (Lmax norm, L norm) distance.


– This is the maximum difference between any component of
the vectors

• Do not confuse r with n, i.e., all these distances are defined for
all numbers of dimensions.
7
Minkowski Distance : Examples

8
Common Properties of a Distance
• Distances, such as the Euclidean distance,
have some well known properties.
1. d(p, q)  0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness)
2. d(p, q) = d(q, p) for all p and q. (Symmetry)
3. d(p, r)  d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)
where d(p, q) is the distance (dissimilarity) between points
(data objects), p and q.

• A distance that satisfies these properties is


a metric

9
Common Properties of a Similarity
• Similarities, also have some well known
properties.
1. s(p, q) = 1 (or maximum similarity) only if p = q.

2. s(p, q) = s(q, p) for all p and q. (Symmetry)

where s(p, q) is the similarity between points (data


objects), p and q.

10
Similarity Between Binary Vectors

• Common situation is that objects, p and q, have only binary


attributes
• Compute similarities using the following quantities

M01 = the number of attributes where p was 0 and q was 1


M10 = the number of attributes where p was 1 and q was 0
M00 = the number of attributes where p was 0 and q was 0
M11 = the number of attributes where p was 1 and q was 1
• Simple Matching and Jaccard Coefficients
SMC = number of matches / number of attributes
= (M11 + M00) / (M01 + M10 + M11 + M00)

J = number of 11 matches / number of not-both-zero


attributes values
= (M11) / (M01 + M10 + M11)
11
SMC versus Jaccard: Example
p= 1000000000
q= 0000001001

M01 = 2 (the number of attributes where p was 0 and q was 1)


M10 = 1 (the number of attributes where p was 1 and q was 0)
M00 = 7 (the number of attributes where p was 0 and q was 0)
M11 = 0 (the number of attributes where p was 1 and q was 1)

SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7

J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0

12
Cosine Similarity
• A document can be represented by thousands of attributes, each recording the
frequency of a particular word (such as keywords) or phrase in the document.

• Other vector objects: gene features in micro-arrays, …


• Applications: information retrieval, biologic taxonomy, gene feature mapping, ...
• Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors),
then
cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d||: the length of vector d

13
Cosine Similarity
• If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1  d2) / ||d1|| ||d2|| ,
where  indicates vector dot product and || d || is the length of vector d.

• Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2

d1  d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0) 0.5 = (42) 0.5 =
6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 =
2.245

cos( d1, d2 ) = .3150


14
Correlation

15
16
17
What Is Frequent Pattern Analysis?
• Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.)
that occurs frequently in a data set

• Motivation: Finding inherent regularities in data


– What products were often purchased together?— Beer and diapers?!
– What are the subsequent purchases after buying a PC?
– What kinds of DNA are sensitive to this new drug?
– Can we automatically classify web documents?

• Applications
– Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.

18
19
20
21
Why Is Freq. Pattern Mining Important?
• Freq. pattern: An intrinsic and important property of datasets
• It is foundation for many essential data mining tasks
 Association, correlation, and causality analysis
 Sequential, structural (e.g., sub-graph) patterns
 Pattern analysis in spatiotemporal, multimedia, time-series,
and stream data
 Classification: discriminative, frequent pattern analysis
 Cluster analysis: frequent pattern-based clustering
 Data warehousing: iceberg cube and cube-gradient
 Semantic data compression: fascicles
 Broad applications

22
Frequent patterns: subsequence
• A subsequence, such as:
buying first a PC, then a digital camera, and then a memory card is a
(frequent) sequential pattern.

• A substructure can refer to different structural forms, such as


subgraphs, subtrees, or sublattices, which may be combined with
itemsets or subsequences.

• Finding frequent patterns plays an essential role in mining


associations, correlations, and many other interesting
relationships among data.

• Moreover, it helps in data classification, clustering, and other


data mining tasks. 23
Frequent patterns
• Frequent pattern mining searches for recurring relationships in a given
data set.
• Frequent itemset mining leads to the discovery of associations and
correlations among items in large transactional or relational data sets.
• The discovery of interesting correlation relationships among huge
amounts of business transaction records can help in many business
decision-making processes such as:
catalog design,
cross-marketing, and
Customer shopping behavior analysis.

• A typical example of frequent itemset mining is market basket analysis.


• Analyzes customer buying habits by finding associations between the
different items that customers place in their “shopping baskets.”
24
25
Example: Market basket analysis

26
27
28
Market basket analysis

• Market basket analysis may help you design different store layouts.
• In one strategy, items that are frequently purchased together can be
placed in proximity to further encourage the combined sale of such
items.
– purchase computers followed by buying antivirus or printer

• Each basket can be represented by a Boolean vector of values


assigned to these variables.
• The Boolean vectors can be analyzed for buying patterns that reflect
items that are frequently associated or purchased together.
• These patterns can be represented in the form of association rules.

29
30
31
32
33
34
35
36
37
38
39
40
41
Basic Concepts: Frequent Patterns
Tid Items bought • itemset: A set of one or more items
10 Beer, Nuts, Diaper
• k-itemset X = {x1, …, xk}
20 Beer, Coffee, Diaper

30 Beer, Diaper, Eggs


• (absolute) support, or, support
count of X: Frequency or
40 Nuts, Eggs, Milk
occurrence of an itemset X
50 Nuts, Coffee, Diaper, Eggs, Milk

Customer • (relative) support, s, is the fraction


Customer
buys both buys diaper of transactions that contains X (i.e.,
the probability that a transaction
contains X)
• An itemset X is frequent if X’s
Customer
support is no less than a minsup
buys beer threshold
42
Basic Concepts: Frequent Patterns

43
44
45
Association Rules

• Example of association rule: customers who purchase computers tend


to buy antivirus software at the same time :
Computer  antivirus software [support = 2%, confidence = 60%]

• Rule support and confidence are two measures of rule interestingness.


• They respectively reflect the usefulness and certainty of discovered
rules.
• Support of 2% means that 2% of all the transactions under analysis
show that computer and antivirus software are purchased together.
• A confidence of 60% means that 60% of the customers who purchased
a computer also bought the software.
• Typically, association rules are considered interesting if they satisfy
both a minimum support threshold and a minimum confidence
threshold.
46
Association Rules: Example
• Find all the rules X  Y with
Tid Items bought
minimum support and confidence
10 Beer, Nuts, Diaper – support, s, probability that a
20 Beer, Coffee, Diaper transaction contains X  Y
30 Beer, Diaper, Eggs
– confidence, c, conditional
40 Nuts, Eggs, Milk
50 Nuts, Coffee, Diaper, Eggs, Milk probability that a transaction
Customer
having X also contains Y
Customer
buys both Let minsup = 50%, minconf = 50%
buys
diaper Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer,
Diaper}:3
 Association rules:
Customer  Beer  Diaper (60%, 100%)
buys beer
 Diaper  Beer (60%, 75%)

47
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup

2. Rule Generation
– Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a
frequent itemset

• Frequent itemset generation is still


computationally expensive 48
Frequent Itemset Generation
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE


Given d items, there are
2d possible candidate
ABCDE itemsets
49
Frequent Itemset Generation
• Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the database

Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w

– Match each transaction against every candidate


– Complexity ~ O(NMw) => Expensive since M = 2d !!!
50
Computational Complexity
• Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:

 d   d  k 
R        
d 1 d k

 k   j 
k 1 j 1

 3  2 1
d d 1

If d=6, R = 602 rules

51
Frequent Itemset Generation Strategies

• Reduce the number of candidates (M)


– Complete search: M=2d
– Use pruning techniques to reduce M

• Reduce the number of transactions (N)


– Reduce size of N as the size of itemset increases
– Used by DHP and vertical-based mining algorithms

• Reduce the number of comparisons (NM)


– Use efficient data structures to store the candidates or
transactions
– No need to match every candidate against every
transaction
52
Closed Patterns and Max-Patterns
• A long pattern contains a combinatorial number of sub-patterns,
e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000) = 2100 – 1 =
1.27*1030 sub-patterns!

• Solution: Mine closed patterns and max-patterns instead


• An itemset X is closed if X is frequent and there exists no super-
pattern Y ‫ כ‬X, with the same support as X.

• An itemset X is a max-pattern if X is frequent and there exists no


frequent super-pattern Y ‫ כ‬X.

• Closed pattern is a lossless compression of freq. patterns


– Reducing the # of patterns and rules.
53
Closed Patterns and Max-Patterns
• The downward closure property of frequent patterns
– Any subset of a frequent itemset must be frequent.

• If {beer, diaper, nuts} is frequent, so is {beer, diaper}


– i.e., every transaction having {beer, diaper, nuts} also contains {beer,
diaper}

• Scalable mining methods: Three major approaches


1. Apriori

2. Freq. pattern growth (FPgrowth)

3. Vertical data format approach 54


Apriori: A Candidate Generation & Test Approach
• Apriori pruning principle: If there is any itemset
which is infrequent, its superset should not be
generated/tested!
• Method:
Initially, scan DB once to get frequent 1-itemset
Generate length (k+1) candidate itemsets from
length k frequent itemsets
Test the candidates against DB
Terminate when no frequent or candidate set can
be generated
55
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2 2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
56
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk; 57
Implementation of Apriori
• How to generate candidates?
– Step 1: self-joining Lk
– Step 2: pruning
• Example of Candidate-generation
– L3={abc, abd, acd, ace, bcd}
– Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
– Pruning:
• acde is removed because ade is not in L3
– C4 = {abcd}
58
Further Improvement of the Apriori
Method
• Major computational challenges
– Multiple scans of transaction database
– Huge number of candidates
– Tedious workload of support counting for candidates

• Improving Apriori: general ideas


– Reduce passes of transaction database scans
– Shrink number of candidates
– Facilitate support counting of candidates

59

You might also like