DMDW Chapter 4

FACULTY: COMPUTING & SWE
Introduction to Data Mining

and
Warehousing
Amin T (Asst. Prof). (2022/23)
CHAPTER FOUR
BASIC CONCEPTS AND METHODS OF MINING FREQUENT
PATTERNS, ASSOCIATIONS AND CORRELATIONS
Garbage In Garbage Out

(GIGO)
05/03/2023 2
Basic concept
• Frequent patterns are patterns (such as itemsets or subsequences) that
appear in a data set frequently.
– For example,
– a set of items, such as milk and bread, that appear frequently together in a
transaction dataset is a frequent itemset.
– A subsequence, such as buying first a PC, then a digital camera, and then a
memory card, if it occurs frequently in a shopping history database, is a
(frequent) sequential pattern
• Finding such frequent patterns plays an essential role in mining associations,
correlations, and many other interesting relationships among data.
05/03/2023 3
Frequent Pattern Analysis
• Motivation: Finding inherent regularities in data
– What products were often purchased together?— Beer and diapers?!
– What are the subsequent purchases after buying a PC?
– What kinds of DNA are sensitive to this new drug?
– Can we automatically classify web documents?
• Applications
-Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log
(click stream) analysis, and DNA sequence analysis.
05/03/2023 4
Why Frequent Pattern Mining is Important?
• Frequent pattern: An intrinsic (fundamental) and important property of datasets
• Foundation for many essential data mining tasks
Association, correlation, and causality analysis Sequential, structural (e.g., sub-
graph) patterns
Pattern analysis Is spatiotemporal, multimedia, time-series, and stream data
Classification: discriminative, frequent pattern analysis
Cluster analysis: frequent pattern-based clustering
Data warehousing: iceberg cube and cube-gradient
Semantic data compression: fascicles
Broad applications
05/03/2023 5
Association Rule Mining
• Frequent itemset mining leads to the discovery of associations and correlations among
items in large transactional or relational data sets.
• Association is used to mine frequent patterns
• Initially used for Market Basket Analysis to analyze customer buying habits by finding
associations between the different items that customers place in their “shopping
baskets”
05/03/2023 6
• The result can be used in
– Making various marketing strategies or advertising strategies
– Shelf arrangement
– design different layouts of shop
05/03/2023 7
Association rule can be used for
05/03/2023 8
• Association Rule:
TID Items
– An implication expression of the form
1. Bread, Milk
X → Y, where X and Y are itemsets 2. Bread, Diaper, Beer, Eggs
– Example: 3. Milk, Diaper, Beer, Coke
{Milk, Diaper} → {Beer} 4. Bread, Milk, Diaper, Beer
5. Bread, Milk, Diaper, Coke
• Rule Evaluation Metrics
– Support(s)
 = support(“AB” probability that a transaction contains A∪B
 s = P(A∪B)
– Confidence (c)
 conditional probability that a transaction having
A also contains B.
 c = confidence(“AB”) = P(B|A)
05/03/2023 9
Basic Concept: Association Rules
Transaction-id Items bought  Let min_support = 50%,

10 A, B, C min_conf = 50%:
20 A, C  A  C (50%, 66.7%)
30 A, D  C  A (50%, 100%)
40 B, E, F
Customer Customer
buys both buys diaper
Customer
buys beer
10
Basic Concepts: Frequent Patterns and Association Rules
• Association rule mining is a two-step process:

– Find all frequent itemsets
– Generate strong association rules from the frequent itemsets
– For every frequent itemset L, find all non-empty subsets of L. For every such subset A, output a
rule of the form “A  (L-A)” if the ratio of support(L) to support(A) is at least minimum
confidence
• The overall performance of mining association rules is determined by the first step
• Given a set of transactions T, the goal of association rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold
11
Mining Association Rules—an Example
Transaction-id Items bought Min. support 50%

10 A, B, C Min. confidence 50%
20 A, C
Frequent pattern Support
30 A, D
{A} 75%
40 B, E, F
{B} 50%
{C} 50%
{A, C} 50%
For rule A C:
support = support({A}{C}) = 50%
confidence = support({A}{C})/support({A}) = 66.6%
12
The Apriori Algorithm
• The name, Apriori, is based on the fact that the algorithm uses prior knowledge of
frequent itemset properties
• Apriori employs an iterative approach known as a level-wise search, where k-
itemsets are used to explore (k+1)-itemsets
– The first pass determines the frequent 1-itemsets denoted L1
– A subsequence pass k consists of two phases
– First, the frequent itemsets Lk-1 are used to generate the candidate itemsets C k
– Next, the database is scanned and the support of candidates in Ck is counted
– The frequent itemsets Lk are determined
13
Apriori Property
• Apriori property: any subset of a large itemset must be large
– If {beer, diaper, nuts} is frequent, so is {beer, diaper}
– Every transaction having {beer, diaper, nuts} also contains {beer, diaper}
• Anti-monotone: if a set cannot pass a test, all of its supersets will fail the same test as
well
14
Apriori: A Candidate Generation-and-test Approach
• Apriori pruning principle: If there is any itemset which is infrequent, its superset should
not be generated/tested!
• Method: join and prune steps
– Generate candidate (k+1)-itemsets Ck+1 from frequent k-itemsets Lk
– If any k-subset of a candidate (k+1)-itemset is not in Lk, then the candidate cannot
be frequent either and so can be removed from Ck
– Test the candidates against DB to obtain Lk+1
15
The Apriori Algorithm—Example
• Let the minimum support be 20%
16
The Apriori Algorithm
• Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++)
Ck+1 = candidates generated from Lk;
for each transaction t in database
increment the count of all candidates in Ck+1 that are
contained in t
end
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
17
Important Details of Apriori
• How to generate candidates?
– Step 1: self-joining Lk
– Step 2: pruning
• How to count supports of candidates?
• Example of candidate-generation
– L3={abc, abd, acd, ace, bcd}
– Self-joining: L3*L3
– abcd from abc and abd
– acde from acd and ace
– Pruning:
– acde is removed because ade is not in L3
– C4={abcd}
18
How to Generate Candidates?
• Suppose the items in Lk-1 are listed in an order
• Step 1: self-joining Lk-1

insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
• Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck

19
Generating Association Rules from Frequent Itemsets
• Once the frequent itemsets from transactions in a database D have been found,
it is straightforward to generate strong association rules from them (where
strong association rules satisfy both minimum support and minimum
confidence).
05/03/2023 20
• The association rules that can be generated from L3= { I1, I2, I5 }:
…
– The nonempty subsets of l are
– The resulting association rules are as shown below, each listed with its confidence:
• If the minimum confidence threshold is, say, 70%, then only the second, third,
and last rules above are the output, because these are the only ones generated
that are strong.
05/03/2023 21
Mining Frequent Itemsets without Candidate Generation
• FP-growth: Mining frequent patterns with FP-tree by pattern fragment growth:

– Input: constructed FP-tree, using DB and a minimum support threshold .
– Output: The complete set of frequent patterns.
– Method: Call FP-growth(FP-tree, null).
22
Algorithm of FP-growth
23
Construct FP-tree from a Transaction Database
 Let the minimum support be 20%
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Sort frequent items in frequency descending order, f-list
3. Scan DB again, construct FP-tree
Frequent 1-itemset Support Count

I1 6
I2 7
I3 6
I4 2
I5 2
24
25
26
Additional references
• https://www.youtube.com/watch?v=yCbankIouUU
• http://home.etf.rs/~vm/os/dmsw/FP-Growth%20algorithm.ppt
05/03/2023 27
Thank You

DMDW Chapter 4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DMDW Chapter 4

Uploaded by

Copyright:

Available Formats

FACULTY: COMPUTING & SWE

Introduction to Data Mining

Garbage In Garbage Out

Transaction-id Items bought  Let min_support = 50%,

• Association rule mining is a two-step process:

Transaction-id Items bought Min. support 50%

• How to count supports of candidates?

• Step 1: self-joining Lk-1

select p.item1, p.item2, …, p.itemk-1, q.itemk-1

from Lk-1 p, Lk-1 q

where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1

if (s is not in Lk-1) then delete c from Ck

• FP-growth: Mining frequent patterns with FP-tree by pattern fragment growth:

Frequent 1-itemset Support Count

You might also like