You are on page 1of 22

DATA WAREHOUSE

& DATA MINING


FREQUENT PATTERN ANALYSIS

1
 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern Evaluation Methods

 Summary

2
WHAT IS FREQUENT PATTERN
ANALYSIS?
 Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.)
that occurs frequently in a data set

 Motivation: Finding inherent regularities in data


 What products were often purchased together?— Milk and diapers?!
 What are the subsequent purchases after buying a PC?
 Can we automatically classify web documents?

 Applications
 Basket data analysis, cross-marketing, catalog design, sale campaign analysis,
Web log (click stream) analysis. 3
WHAT IS FREQUENT PATTERN
ANALYSIS?
 Frequent pattern: a pattern (a set of items) that occurs frequently in a data set.
 Milk and Bread

 Frequent pattern: a pattern (subsequences) that occurs frequently in a data set


 Buying first PC and then digital Camera

 Frequent pattern: a pattern (substructures) that occurs frequently in a data set .


 Sub-Graphs

4
BASIC CONCEPTS: FREQUENT PATTERNS

Tid Items bought  itemset: A set of one or more items


10 Beer, Nuts, Diaper
20 Beer, Coffee, Diaper
 k-itemset X = {x1, …, xk}
30 Beer, Diaper, Eggs
40 Nuts, Eggs, Milk
50 Nuts, Coffee, Diaper, Eggs, Milk  E.g. 2-itemset X = {x1, x2}
Customer Customer
buys both buys diaper

Customer 6
buys beer
THE PROBLEM
 When we go grocery shopping, we often have a standard list of
things to buy. Each shopper has a distinctive list, depending on
one’s needs and preferences. A housewife might buy healthy
ingredients for a family dinner, while a bachelor might buy beer
and chips. Understanding these buying patterns can help to
increase sales in several ways. If there is a pair of items, X and Y,
that are frequently bought together:

• Both X and Y can be placed on the same shelf, so that buyers of one
item would be prompted to buy the other.
• Promotional discounts could be applied to just one out of the two
items.
• Advertisements on X could be targeted at buyers who purchase Y.
• X and Y could be combined into a new product, such as having Y in
flavors of X.
THE PROBLEM
 While we may know that certain items are frequently bought together, the
question is, how do we uncover these associations?

 Besides increasing sales profits, association rules can also be used in


other fields. In medical diagnosis for instance, understanding which
symptoms tend to co-morbid can help to improve patient care and
medicine prescription.
ASSOCIATION RULES ANALYSIS:
DEFINITION
 Association rules analysis is a technique to uncover how items are
associated to each other. There are three common ways to measure
association.
 Measure 1: Support. This says how popular an itemset is, as measured
by the proportion of transactions in which an itemset appears. In Table 1
below, the support of {apple} is 4 out of 8, or 50%. Itemsets can also
contain multiple items. For instance, the support of {apple, beer, rice} is 2
out of 8, or 25%.
ASSOCIATION RULES ANALYSIS:
DEFINITION
 If you discover that sales of items beyond a certain proportion tend to have a
significant impact on your profits, you might consider using that proportion as
your support threshold. You may then identify itemsets with support values above
this threshold as significant itemsets.

 Measure 2: Confidence. This says how likely item Y is purchased when item X is
purchased, expressed as {X -> Y}. This is measured by the proportion of transactions
with item X, in which item Y also appears. In Table 1, the confidence of {apple -> beer}
is 3 out of 4, or 75%.

 One drawback of the confidence measure is that it might misrepresent the


importance of an association. This is because it only accounts for how popular apples
are, but not beers. If beers are also very popular in general, there will be a higher
chance that a transaction containing apples will also contain beers, thus inflating the
confidence measure. To account for the base popularity of both constituent items, we
use a third measure called lift.
ASSOCIATION RULES ANALYSIS:
DEFINITION
 Measure 3: Lift. This says how likely item Y is purchased when item X is
purchased, while controlling for how popular item Y is. In Table 1, the lift
of {apple -> beer} is 1,which implies no association between items. A lift
value greater than 1 means that item Y is likely to be bought if item X is
bought, while a value less than 1 means that item Y is unlikely to be
bought if item X is bought.
BASIC CONCEPTS: FREQUENT PATTERNS

 (absolute) support, or, support


count of X:
 Frequency or occurrence of an
Tid Items bought
itemset X
10 Beer, Nuts, Diaper
20 Beer, Coffee, Diaper
 (relative) support, s,
30 Beer, Diaper, Eggs
 is the fraction of transactions that
40 Nuts, Eggs, Milk
contains X (i.e., the probability that
50 Nuts, Coffee, Diaper, Eggs, Milk
a transaction contains X)

 An itemset X is frequent if X’s


support is no less than a minsup
threshold 12
BASIC CONCEPTS: ASSOCIATION RULES

 Find all the rules X  Y with


minimum support and confidence
Tid Items bought
10 Beer, Nuts, Diaper  support, s, probability that a
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs
transaction contains X  Y
40 Nuts, Eggs, Milk
50 Nuts, Coffee, Diaper, Eggs, Milk
 confidence, c, conditional
probability that a transaction
having X also contains Y

13
BASIC CONCEPTS: ASSOCIATION RULES

 Absolute Support (X-> Y) = P (X  Y)


 Relative Support (X->Y)= P (X  Y)/N
 Confidence(X-> Y) = P (X | Y) = Support (X  Y) / Support (X)
 Lift (X-> Y)= Support (X  Y) / Support (X) x Support (Y)

14
BASIC CONCEPTS: ASSOCIATION RULES

Tid Items bought Let minsup = 50%, minconf = 50%


10 Beer, Nuts, Diaper
20 Beer, Coffee, Diaper
Freq. Pat.:
30 Beer, Diaper, Eggs
40 Nuts, Eggs, Milk
Beer:3, Nuts:3, Diaper:4, Eggs:3,
50 Nuts, Coffee, Diaper, Eggs, Milk {Beer, Diaper}:3

 Association rules: (many more!)


 Beer  Diaper (60%, 100%)
 Diaper  Beer (60%, 75%)
Rule 1: Beer  Diaper
support= P(Beer U Diaper)/N = 3/5= 0.6= 0.6x100= 60%
Confidence= P(Beer U Diaper)/P(Beer) = (3/5)/(3/5)= 3/3 = 1= 1x100= 100%

Rule 2: Diaper  Beer 15


support= P(Diaper U Beer)/N= 3/5=0.6= 0.6x100= 60%
Confidence= P(Diaper U Beer)/ P((Diaper ) = (3/5)/ (4/5)= 3/4=0.75=0.75x100= 75%
EXAMPLE 1
TID Milk Bread Butter
1 1 1 0
2 0 1 1
3 0 0 0
4 1 1 1
5 0 1 0

Rule1: {milk, bread} {butter}

Support=P(XUY)/ N = P((milk, bread)U butter)/N= 1/5= 0.2= 0.2x100=20%


Confidence= P(XUY)/ P(X)= P((milk, bread)U butter)/ P(milk, bread)
=(1/5) / (2/5)= 1/2= 0.5= 0.5x100= 50%
Lift= P(XUY)/ P(X)*P(Y)=P((milk, bread)U butter)/ P(milk, bread)*P(butter)
= (1/5)/ ((2/5)*(2/5))= 1.25>1
So, it is positively correlated
As lift value greater than 1 ,so Y(butter) is likely to be bought if item X(milk , bred) is
bought
EXAMPLE
TID Bread Milk Diaper Beer Egg Cola
1 1 1 0 0 0 0
2 1 0 1 1 1 0
3 0 1 1 1 0 1
4 1 1 1 1 0 0
5 1 1 1 0 0 1

Rule: {Milk, Diaper}{Beer}

Support=P(XUY)/ N = P((Milk, Diaper)U Beer)/N= 2/5=0.4=0.4x100=40%

Confidence= P(XUY)/ P(X)= P((Milk, Diaper)U Beer)/ P(Milk, Diaper)


=(2/5)/(3/5)= 2/3=0.66=0.66x100=6%
Lift= P(XUY)/ P(X)*P(Y)=P((Milk, Diaper)U Beer)/ P(Milk, Diaper)*P(Beer)
= 2/5)/((3/5)*(3/5))= 1.11>1
So, it is positively correlated
As lift value greater than 1 ,so Y(beer) is likely to be bought if item X(Milk, Diaper) is
EXAMPLE
TID Bread Milk Diaper Beer Egg Cola
1 1 1 0 0 0 0
2 1 0 1 1 1 0
3 0 1 1 1 0 1
4 1 1 1 1 0 0
5 1 1 1 0 0 1

Rule: 1- {Beer}{Diaper} 2- {Diaper}  {Beer}


COMPUTATIONAL COMPLEXITY OF FREQUENT ITEMSET MINING

 How many itemsets are potentially to be generated in the worst case?

 The number of frequent itemsets to be generated is senstive to the minsup threshold

 When minsup is low, there exist potentially an exponential number of frequent


itemsets

 The worst case: MN where M: # distinct items, and N: max length of transactions.

 A long pattern contains a combinatorial number of sub-patterns,


e.g., {a1, …, a100} contains
19
 ( 100
1
) + (100 ) + … + (
2
)=2
1 0 0
1 0 0
100
– 1 = 1.27*10 sub-patterns!
30
CLOSED PATTERNS AND MAX-PATTERNS

 Solution: Mine closed patterns and max-patterns instead

 An itemset X is closed if X is frequent and there exists no super-


pattern Y ‫ כ‬X, with the same support as X.

 An itemset X is a max-pattern if X is frequent and there exists


no frequent super-pattern Y ‫ כ‬X

20
EXAMPLE
Suppose that a transactional database has three transactions. Let the minimum support
count is 2. you have to find closed and maximal pattern

Transaction Closed Maximal


{a1, a2, a3, a4}=1 No No
{a2, a3, a4}=4 Yes No
{a2, a3, a4, a5, a6}=2 Yes Yes
{a2, a3, a4, a5, a6, a7}=1 No No
CLOSED PATTERNS AND MAX-PATTERNS

 Exercise. DB = {<a1, …, a100>, < a1, …, a50>}


 Min_sup = 1.

 What is the set of closed itemset?


 <a1, …, a100>:
< a1, …, a50>:

 What is the set of max-pattern?


 <a1, …, a100>:

22
CLOSED PATTERNS AND MAX-PATTERNS

 Exercise. DB = {<a1, …, a100>, < a1, …, a50>}


 Min_sup = 1.

 What is the set of closed itemset?


 <a1, …, a100>: 1
< a1, …, a50>: 2

 What is the set of max-pattern?


 <a1, …, a100>: 1

23

You might also like