You are on page 1of 33

Chapter 3

MINING FREQUENT
PATTERNS,ASSOCIATIONS AND CORRELATIONS

September 3, 2022 Data Mining: Concepts and Techniques 1


What Is Frequent Pattern Analysis?
 Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
 First proposed by Agrawal, Imielinski, and Swami in the context
of frequent itemsets and association rule mining.
 It leads to discovery of associations and correlations among items
in large transactions and relational data sets
 Applications
 Basket data analysis, cross-marketing, catalog design, sale
campaign analysis, Web log (click stream) analysis, and DNA
sequence analysis.
September 3, 2022 Data Mining: Concepts and Techniques 2
 Set of items- milk,bread.
 Subsequence-buying first mobile, then sim card and
then memory card.
 Substructure-different structural forms such as
subtrees,subgraph.

September 3, 2022 Data Mining: Concepts and Techniques 3


Frequent pattern mining

Categorized in to different ways:


Based on the Completeness of patterns.

Based on the Levels of abstraction in the rule set

Based on the Dimension in the rules(attributes).

Based on the Types of values handled in the

rule(Boolean values, Quantitative values).


Based on the Kinds of rules to be mined(Association

rule ,correlation rule).


Based on the Kinds of pattern to be mined(items,

sequence, structure).
September 3, 2022 Data Mining: Concepts and Techniques 4
Generate associate rules
 Support(A=>B)=P(AUB).
-union of A and B or both A and B.

 Confidence(A=>B)=P(B/A).
- sup_count(AUB)/sup_count(A).
-measure how the item B that appear in
transaction contains A.
-These are the two interesting measure to
generate the associations rule.

September 3, 2022 Data Mining: Concepts and Techniques 5


Example:support and confidence

Condition: TID ITEMS


1 Bread,milk
{milk,diaper}=>beer
2 Bread,diaper,beer,eggs
3 Milk,diaper,beer,coke
Support=P(AUB)/ T 4 Bread,milk,diaper,beer
5 Bread,milk,diaper,coke
=2/5=0.40
=40%
Confidence=sup_count(AUB)/sup_count(A)
=2/3=0.67
=67%

September 3, 2022 Data Mining: Concepts and Techniques 6


Apriori: A Candidate Generation-and-Test Approach

 Apriori pruning principle: If there is any itemset which is


infrequent, its superset should not be generated/tested.
 Proposed by R.Agarwal and R.Srikant in 1994 for mining
frequent itemsets.
 Method:
 Initially, scan DB once to get frequent 1-itemset
 Generate length (k+1) candidate itemsets from length k
frequent itemsets
September 3, 2022 Data Mining: Concepts and Techniques 7
Cont….

 Test the candidates against DB


 Terminate when no frequent or candidate set can be
generated

Main Steps:
 Join step
 Pruning step

September 3, 2022 Data Mining: Concepts and Techniques 8


 Apriori employs an iterative approach known as level
wise search.
 K items are used to explore k+1 itemsets.

 Frequent -1-item set is found by scanning the database

to accumulate the count for each item


and collecting those items that satisfy minimum
support. Resulting set is L1.
-L1 is used to find L2.
-L2 is used to find L3, and so on.

September 3, 2022 Data Mining: Concepts and Techniques 9


 Untill no more frequent k-itemsets can be found.
 All nonempty subsets of a frequent itemset must also
be frequent.
 It never accept the empty set at the final.

September 3, 2022 Data Mining: Concepts and Techniques 10


The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
September 3, 2022 Data Mining: Concepts and Techniques 11
Cont...
 Calculate confidence for final itemsets(B,C,E)=2.
Possibility:- {B},{C},{E},{B,C},{B,E},{C,E}.
B=>C Ʌ E ; B=2/3 =0.66 =66%
C=>B Ʌ E ; C=2/3 =0.66=66%
E=>B Ʌ C ; E=2/3 =0.66=66%
B Ʌ C=>E ; B Ʌ C=2/2 =1 =100%
B Ʌ E=>C ; B Ʌ E=2/3 =0.66=66%
C Ʌ E=>B ; C Ʌ E=2/2 =1=100%
Overall confidence= 464/6
=77.3% (given minimum support=2)

September 3, 2022 Data Mining: Concepts and Techniques 12


Improving the Efficiency of Apriori
 Hash based technique
 Transaction reduction
 Partitioning
 Sampling
 Dynamic itemset count

September 3, 2022 Data Mining: Concepts and Techniques 13


Construct FP-tree from a Transaction Database

TID items Items bought (ordered) frequent


100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m} min_support = 3
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} {}
Header Table
1. Scan DB once, find
frequent 1-itemset (single Item frequency head f:4 c:1
item pattern) f 4
c 4 c:3 b:1 b:1
2. Sort frequent items in a 3
frequency descending b 3 a:3 p:1
order, f-list m 3
p 3
3. Scan DB again, construct m:2 b:1
FP-tree
F-list=f-c-a-b-m-p p:2 m:1
September 3, 2022 Data Mining: Concepts and Techniques 14
Benefits of the FP-tree Structure

 Completeness
 Preserve complete information for frequent pattern

mining
 Never break a long pattern of any transaction

 Compactness
 Reduce irrelevant info—infrequent items are gone

 Items in frequency descending order: the more

frequently occurring, the more likely to be shared


 Never be larger than the original database .

September 3, 2022 Data Mining: Concepts and Techniques 15


From Conditional Pattern-bases to Conditional FP-trees

 For each pattern-base


 Accumulate the count for each item in the base

 Construct the FP-tree for the frequent items of the

pattern base
m-conditional pattern base:
{} Min_sup=3
Header Table
Item frequency head All frequent
f:4 c:1 patterns relate to m
f 4 {}
c 4 c:3 b:1 b:1 m,

a 3 f:4  fm, cm, am,
b 3 a:3 p:1 fcm, fam, cam,
m 3 c:3 fcam
p 3 m:2 b:1
p:2 m:1 a:3
m-conditional FP-tree
September 3, 2022 Data Mining: Concepts and Techniques 16
Why Is FP-Growth the Winner?

 Divide-and-conquer:
 leads to focused search of smaller databases
 Other factors
 no candidate generation, no candidate test
 compressed database: FP-tree structure
 no repeated scan of entire database
 basic ops—counting local freq items and building
sub FP-tree.

September 3, 2022 Data Mining: Concepts and Techniques 17


Mining Various Kinds of Association Rules

 Mining multilevel association

 Mining multidimensional association

 Mining quantitative association

 Mining interesting correlation patterns

September 3, 2022 Data Mining: Concepts and Techniques 18


Kinds of mining rules

1.Multilevel association rules


-different levels of abstraction
2.Multidimensional association rules
-more than one dimension or predicates
- e.g. what a customer buys as well as
customer’s age.
3.Quantitative association rules
-numeric attributes implicit order among
values. e.g.age

September 3, 2022 Data Mining: Concepts and Techniques 19


Mining Multiple-Level Association Rules
 Items often form hierarchies
 Flexible support settings
 Items at the lower level are expected to have lower

support
 Exploration of shared multi-level mining.

uniform support reduced support


Level 1
Milk Level 1
min_sup = 5%
[support = 10%] min_sup = 5%

Level 2 Butter Milk Skim Milk Level 2


min_sup = 5% [support = 6%] [support = 4%] min_sup = 3%

September 3, 2022 Data Mining: Concepts and Techniques 20


 Uniform minimum support
-Mining from milk to butter milk ,both are
frequent. that satisfy min_support 5%
-While skim milk is not
 Reduced minimum support
-mining from milk to butter milk and skim milk
-both are all consider frequent.
-satisfy min _sup5 % and3%

September 3, 2022 Data Mining: Concepts and Techniques 21


Mining Multi-Dimensional Association
 Single-dimensional rules/intradimensions:
buys(X, “milk”)  buys(X, “bread”)
 Multi-dimensional rules:  2 dimensions or predicates
 Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”)  occupation(X,“student”)  buys(X, “coke”)
 hybrid-dimension assoc. rules (repeated predicates)
age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”)
 Categorical Attributes: finite number of possible values, no
ordering among values—data cube approach
 Quantitative Attributes: numeric, implicit ordering among values
—discretization, clustering, and gradient approaches

September 3, 2022 Data Mining: Concepts and Techniques 22


Mining Quantitative Associations

 Techniques can be categorized by how numerical attributes,


such as age or salary are treated
1. Static discretization based on predefined concept hierarchies
(data cube methods)
2. Dynamic discretization based on data distribution
(quantitative association rules)
3. Clustering: based on characteristics or features(by clustering
methods).

September 3, 2022 Data Mining: Concepts and Techniques 23


Static Discretization of Quantitative Attributes

 Discretized prior to mining using concept hierarchy.


 Numeric values are replaced by ranges.
 In relational database, finding all frequent k-predicate sets will
require k+1 table scans.
()
 Data cube is well suited for mining.
 Mining from data cubes (age) (income) (buys)
can be much faster.

(age, income) (age,buys) (income,buys)

(age,income,buys)
September 3, 2022 Data Mining: Concepts and Techniques 24
 Store aggregates which is essential for computing
support and confidence
 Dimension-age,income,buys
 Basecuboid aggregates the task relevant data by
Age,income and buys.
 2-D aggregates (age,income),(age,buys),(income,buys)
 1-D having (age,buys,income)
 0-D cuboid contains total number of transactions in the
task relevant data.

September 3, 2022 Data Mining: Concepts and Techniques 25


Quantitative Association Rules
 2-D quantitative association rules:
Aquan1  Aquan2  Acat
 Quan1 and quan2 are quantitative attribute intervals.


Cat – categorical attributes.
 Increase confidence level or compact rule.
 Only numeric values are discretized is called dynamic

Quantization.
Example:
age(X,”34-35”)  income(X,”30-50K”)
 buys(X,”high resolution TV”)

September 3, 2022 Data Mining: Concepts and Techniques 26


Association to correlation analysis

 Support and confidence measures are insufficient at filtering


out uninteresting association rules.
 Correlation measure can be used to augment the support and
confidence framework.
P ( A B )
lift 
P ( A) P ( B )

sup( X )
all _ conf 
max_ item _ sup( X )
sup( X )
coh 
| universe( X ) |

September 3, 2022 Data Mining: Concepts and Techniques 27


1. <1;occurrence of A is negatively correlated with
the occurrence of B.
2.>1;Positively correlated

3. =1;no correlation.

September 3, 2022 Data Mining: Concepts and Techniques 28


Example:correlation using Lift
 1.probability of purchasing a computer game is
p(game)=0.60.
 2. probability of purchasing video is

p(video)=0.75.
 3. probability of purchasing both is

p(game,video)=0.40.
Lift=P(AUB)/P(A)P(B)
=0.40/0.60*0.75=0.89
The Value is less than 1.so it is negatively
correlated.

September 3, 2022 Data Mining: Concepts and Techniques 29


Constraints in Data Mining

 The users specify such intuition or expectation as


constraints to confine the search space.

 Knowledge type constraint:


 association,correlation etc.

 It describes the types of knowledge.

 Data constraint:
 Set of task relevant data.

 Use queries/tools.

September 3, 2022 Data Mining: Concepts and Techniques 30


Cont..

 Dimension/level constraint
-in relevance to region, price, brand, customer category

 Rule (or pattern) constraint


-relationship among attributes,attribute values,max or
mininmum number of predicates that occur in antecedent
and consequent

 Interestingness constraint
-Support,confidence,correlation
-specify threshold on statistical measures.
September 3, 2022 Data Mining: Concepts and Techniques 31
Rule constraints
 Specify the syntactic form of rules.
 Improve the efficiency of data mining process.
 Expected set.
 Analysis the relationship between variables.
 Simply as Metarules.

September 3, 2022 Data Mining: Concepts and Techniques 32


Another example: Apriori
TID LIST OF ITEMS
T100 I1,I2,I5
T200 I2,I4
T300 I2,I3
T400 I1,I2,I4
T500 I1,I3
T600 I2,I3
T700 I1,I3
T800 I1,I2,I3,I5
T900 I1,I2,I3.

Consider minimum support is 2.


Find confidence

September 3, 2022 Data Mining: Concepts and Techniques 33

You might also like