Frequent Pattern Analysis

DATA WAREHOUSE
& DATA MINING

FREQUENT PATTERN ANALYSIS
1
 Basic Concepts
 Frequent Itemset Mining Methods
 Which Patterns Are Interesting?—Pattern Evaluation Methods
 Summary
2
WHAT IS FREQUENT PATTERN
ANALYSIS?
 Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.)
that occurs frequently in a data set
 Motivation: Finding inherent regularities in data

 What products were often purchased together?— Milk and diapers?!
 What are the subsequent purchases after buying a PC?
 Can we automatically classify web documents?
 Applications
 Basket data analysis, cross-marketing, catalog design, sale campaign analysis,
Web log (click stream) analysis. 3
WHAT IS FREQUENT PATTERN
ANALYSIS?
 Frequent pattern: a pattern (a set of items) that occurs frequently in a data set.
 Milk and Bread
 Frequent pattern: a pattern (subsequences) that occurs frequently in a data set

 Buying first PC and then digital Camera
 Frequent pattern: a pattern (substructures) that occurs frequently in a data set .

 Sub-Graphs
4
BASIC CONCEPTS: FREQUENT PATTERNS
Tid Items bought  itemset: A set of one or more items

10 Beer, Nuts, Diaper
20 Beer, Coffee, Diaper
 k-itemset X = {x1, …, xk}
30 Beer, Diaper, Eggs
40 Nuts, Eggs, Milk
50 Nuts, Coffee, Diaper, Eggs, Milk  E.g. 2-itemset X = {x1, x2}
Customer Customer
buys both buys diaper
Customer 6
buys beer
THE PROBLEM
 When we go grocery shopping, we often have a standard list of
things to buy. Each shopper has a distinctive list, depending on
one’s needs and preferences. A housewife might buy healthy
ingredients for a family dinner, while a bachelor might buy beer
and chips. Understanding these buying patterns can help to
increase sales in several ways. If there is a pair of items, X and Y,
that are frequently bought together:
• Both X and Y can be placed on the same shelf, so that buyers of one
item would be prompted to buy the other.
• Promotional discounts could be applied to just one out of the two
items.
• Advertisements on X could be targeted at buyers who purchase Y.
• X and Y could be combined into a new product, such as having Y in
flavors of X.
THE PROBLEM
 While we may know that certain items are frequently bought together, the
question is, how do we uncover these associations?
 Besides increasing sales profits, association rules can also be used in

other fields. In medical diagnosis for instance, understanding which
symptoms tend to co-morbid can help to improve patient care and
medicine prescription.
ASSOCIATION RULES ANALYSIS:
DEFINITION
 Association rules analysis is a technique to uncover how items are
associated to each other. There are three common ways to measure
association.
 Measure 1: Support. This says how popular an itemset is, as measured
by the proportion of transactions in which an itemset appears. In Table 1
below, the support of {apple} is 4 out of 8, or 50%. Itemsets can also
contain multiple items. For instance, the support of {apple, beer, rice} is 2
out of 8, or 25%.
DEFINITION
 If you discover that sales of items beyond a certain proportion tend to have a
significant impact on your profits, you might consider using that proportion as
your support threshold. You may then identify itemsets with support values above
this threshold as significant itemsets.
 Measure 2: Confidence. This says how likely item Y is purchased when item X is
purchased, expressed as {X -> Y}. This is measured by the proportion of transactions
with item X, in which item Y also appears. In Table 1, the confidence of {apple -> beer}
is 3 out of 4, or 75%.
 One drawback of the confidence measure is that it might misrepresent the

importance of an association. This is because it only accounts for how popular apples
are, but not beers. If beers are also very popular in general, there will be a higher
chance that a transaction containing apples will also contain beers, thus inflating the
confidence measure. To account for the base popularity of both constituent items, we
use a third measure called lift.
DEFINITION
 Measure 3: Lift. This says how likely item Y is purchased when item X is
purchased, while controlling for how popular item Y is. In Table 1, the lift
of {apple -> beer} is 1,which implies no association between items. A lift
value greater than 1 means that item Y is likely to be bought if item X is
bought, while a value less than 1 means that item Y is unlikely to be
bought if item X is bought.
BASIC CONCEPTS: FREQUENT PATTERNS
 (absolute) support, or, support

count of X:
 Frequency or occurrence of an
Tid Items bought
itemset X
 (relative) support, s,
 is the fraction of transactions that
40 Nuts, Eggs, Milk
contains X (i.e., the probability that
50 Nuts, Coffee, Diaper, Eggs, Milk
a transaction contains X)
 An itemset X is frequent if X’s

support is no less than a minsup
threshold 12
BASIC CONCEPTS: ASSOCIATION RULES
 Find all the rules X  Y with

minimum support and confidence
Tid Items bought
10 Beer, Nuts, Diaper  support, s, probability that a
transaction contains X  Y
40 Nuts, Eggs, Milk
50 Nuts, Coffee, Diaper, Eggs, Milk
 confidence, c, conditional
probability that a transaction
having X also contains Y
13
 Absolute Support (X-> Y) = P (X  Y)

 Relative Support (X->Y)= P (X  Y)/N
 Confidence(X-> Y) = P (X | Y) = Support (X  Y) / Support (X)
 Lift (X-> Y)= Support (X  Y) / Support (X) x Support (Y)
14
Tid Items bought Let minsup = 50%, minconf = 50%

Freq. Pat.:
40 Nuts, Eggs, Milk
Beer:3, Nuts:3, Diaper:4, Eggs:3,
50 Nuts, Coffee, Diaper, Eggs, Milk {Beer, Diaper}:3
 Association rules: (many more!)

 Beer  Diaper (60%, 100%)
 Diaper  Beer (60%, 75%)
Rule 1: Beer  Diaper
support= P(Beer U Diaper)/N = 3/5= 0.6= 0.6x100= 60%
Confidence= P(Beer U Diaper)/P(Beer) = (3/5)/(3/5)= 3/3 = 1= 1x100= 100%
Rule 2: Diaper  Beer 15

support= P(Diaper U Beer)/N= 3/5=0.6= 0.6x100= 60%
Confidence= P(Diaper U Beer)/ P((Diaper ) = (3/5)/ (4/5)= 3/4=0.75=0.75x100= 75%
EXAMPLE 1
TID Milk Bread Butter
1 1 1 0
2 0 1 1
3 0 0 0
4 1 1 1
5 0 1 0
Rule1: {milk, bread} {butter}
Support=P(XUY)/ N = P((milk, bread)U butter)/N= 1/5= 0.2= 0.2x100=20%

Confidence= P(XUY)/ P(X)= P((milk, bread)U butter)/ P(milk, bread)
=(1/5) / (2/5)= 1/2= 0.5= 0.5x100= 50%
Lift= P(XUY)/ P(X)*P(Y)=P((milk, bread)U butter)/ P(milk, bread)*P(butter)
= (1/5)/ ((2/5)*(2/5))= 1.25>1
So, it is positively correlated
As lift value greater than 1 ,so Y(butter) is likely to be bought if item X(milk , bred) is
bought
EXAMPLE
TID Bread Milk Diaper Beer Egg Cola
1 1 1 0 0 0 0
2 1 0 1 1 1 0
3 0 1 1 1 0 1
4 1 1 1 1 0 0
5 1 1 1 0 0 1
Rule: {Milk, Diaper}{Beer}
Support=P(XUY)/ N = P((Milk, Diaper)U Beer)/N= 2/5=0.4=0.4x100=40%
Confidence= P(XUY)/ P(X)= P((Milk, Diaper)U Beer)/ P(Milk, Diaper)

=(2/5)/(3/5)= 2/3=0.66=0.66x100=6%
Lift= P(XUY)/ P(X)*P(Y)=P((Milk, Diaper)U Beer)/ P(Milk, Diaper)*P(Beer)
= 2/5)/((3/5)*(3/5))= 1.11>1
So, it is positively correlated
As lift value greater than 1 ,so Y(beer) is likely to be bought if item X(Milk, Diaper) is
EXAMPLE
TID Bread Milk Diaper Beer Egg Cola
1 1 1 0 0 0 0
2 1 0 1 1 1 0
3 0 1 1 1 0 1
4 1 1 1 1 0 0
5 1 1 1 0 0 1
Rule: 1- {Beer}{Diaper} 2- {Diaper}  {Beer}

COMPUTATIONAL COMPLEXITY OF FREQUENT ITEMSET MINING
 How many itemsets are potentially to be generated in the worst case?
 The number of frequent itemsets to be generated is senstive to the minsup threshold
 When minsup is low, there exist potentially an exponential number of frequent

itemsets
 The worst case: MN where M: # distinct items, and N: max length of transactions.
 A long pattern contains a combinatorial number of sub-patterns,

e.g., {a1, …, a100} contains
19
 ( 100
1
) + (100 ) + … + (
2
)=2
1 0 0
1 0 0
100
– 1 = 1.27*10 sub-patterns!
30
CLOSED PATTERNS AND MAX-PATTERNS
 Solution: Mine closed patterns and max-patterns instead
 An itemset X is closed if X is frequent and there exists no super-

pattern Y ‫ כ‬X, with the same support as X.
 An itemset X is a max-pattern if X is frequent and there exists

no frequent super-pattern Y ‫ כ‬X
20
EXAMPLE
Suppose that a transactional database has three transactions. Let the minimum support
count is 2. you have to find closed and maximal pattern
Transaction Closed Maximal

{a1, a2, a3, a4}=1 No No
{a2, a3, a4}=4 Yes No
{a2, a3, a4, a5, a6}=2 Yes Yes
{a2, a3, a4, a5, a6, a7}=1 No No
 Exercise. DB = {<a1, …, a100>, < a1, …, a50>}

 Min_sup = 1.
 What is the set of closed itemset?

 <a1, …, a100>:
< a1, …, a50>:
 What is the set of max-pattern?

 <a1, …, a100>:
22
 Exercise. DB = {<a1, …, a100>, < a1, …, a50>}

 Min_sup = 1.
 What is the set of closed itemset?

 <a1, …, a100>: 1
< a1, …, a50>: 2
 What is the set of max-pattern?

 <a1, …, a100>: 1
23

Frequent Pattern Analysis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Frequent Pattern Analysis

Uploaded by

Copyright:

Available Formats

DATA WAREHOUSE

& DATA MINING

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern Evaluation Methods

 Motivation: Finding inherent regularities in data

 Frequent pattern: a pattern (subsequences) that occurs frequently in a data set

 Frequent pattern: a pattern (substructures) that occurs frequently in a data set .

Tid Items bought  itemset: A set of one or more items

 Besides increasing sales profits, association rules can also be used in

 One drawback of the confidence measure is that it might misrepresent the

 (absolute) support, or, support

 An itemset X is frequent if X’s

 Find all the rules X  Y with

 Absolute Support (X-> Y) = P (X  Y)

Tid Items bought Let minsup = 50%, minconf = 50%

 Association rules: (many more!)

Rule 2: Diaper  Beer 15

Rule1: {milk, bread} {butter}

Support=P(XUY)/ N = P((milk, bread)U butter)/N= 1/5= 0.2= 0.2x100=20%

Rule: {Milk, Diaper}{Beer}

Support=P(XUY)/ N = P((Milk, Diaper)U Beer)/N= 2/5=0.4=0.4x100=40%

Confidence= P(XUY)/ P(X)= P((Milk, Diaper)U Beer)/ P(Milk, Diaper)

Rule: 1- {Beer}{Diaper} 2- {Diaper}  {Beer}

 How many itemsets are potentially to be generated in the worst case?

 The number of frequent itemsets to be generated is senstive to the minsup threshold

 When minsup is low, there exist potentially an exponential number of frequent

 A long pattern contains a combinatorial number of sub-patterns,

 Solution: Mine closed patterns and max-patterns instead

 An itemset X is closed if X is frequent and there exists no super-

 An itemset X is a max-pattern if X is frequent and there exists

Transaction Closed Maximal

 Exercise. DB = {<a1, …, a100>, < a1, …, a50>}

 What is the set of closed itemset?

 What is the set of max-pattern?

 Exercise. DB = {<a1, …, a100>, < a1, …, a50>}

 What is the set of closed itemset?

 What is the set of max-pattern?

You might also like