Chapter3 DM

Chapter 3
MINING FREQUENT
PATTERNS,ASSOCIATIONS AND CORRELATIONS
September 3, 2022 Data Mining: Concepts and Techniques 1

What Is Frequent Pattern Analysis?
 Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
 First proposed by Agrawal, Imielinski, and Swami in the context
of frequent itemsets and association rule mining.
 It leads to discovery of associations and correlations among items
in large transactions and relational data sets
 Applications
 Basket data analysis, cross-marketing, catalog design, sale
campaign analysis, Web log (click stream) analysis, and DNA
sequence analysis.
 Set of items- milk,bread.
 Subsequence-buying first mobile, then sim card and
then memory card.
 Substructure-different structural forms such as
subtrees,subgraph.

Frequent pattern mining
Categorized in to different ways:

Based on the Completeness of patterns.
Based on the Levels of abstraction in the rule set
Based on the Dimension in the rules(attributes).
Based on the Types of values handled in the
rule(Boolean values, Quantitative values).

Based on the Kinds of rules to be mined(Association
rule ,correlation rule).

Based on the Kinds of pattern to be mined(items,
sequence, structure).
Generate associate rules
 Support(A=>B)=P(AUB).
-union of A and B or both A and B.
 Confidence(A=>B)=P(B/A).
- sup_count(AUB)/sup_count(A).
-measure how the item B that appear in
transaction contains A.
-These are the two interesting measure to
generate the associations rule.

Example:support and confidence
Condition: TID ITEMS

1 Bread,milk
{milk,diaper}=>beer
2 Bread,diaper,beer,eggs
3 Milk,diaper,beer,coke
Support=P(AUB)/ T 4 Bread,milk,diaper,beer
5 Bread,milk,diaper,coke
=2/5=0.40
=40%
Confidence=sup_count(AUB)/sup_count(A)
=2/3=0.67
=67%

Apriori: A Candidate Generation-and-Test Approach
 Apriori pruning principle: If there is any itemset which is

infrequent, its superset should not be generated/tested.
 Proposed by R.Agarwal and R.Srikant in 1994 for mining
frequent itemsets.
 Method:
 Initially, scan DB once to get frequent 1-itemset
 Generate length (k+1) candidate itemsets from length k
frequent itemsets
Cont….
 Test the candidates against DB

 Terminate when no frequent or candidate set can be
generated
Main Steps:
 Join step
 Pruning step

 Apriori employs an iterative approach known as level
wise search.
 K items are used to explore k+1 itemsets.
 Frequent -1-item set is found by scanning the database
to accumulate the count for each item

and collecting those items that satisfy minimum
support. Resulting set is L1.
-L1 is used to find L2.
-L2 is used to find L3, and so on.

 Untill no more frequent k-itemsets can be found.
 All nonempty subsets of a frequent itemset must also
be frequent.
 It never accept the empty set at the final.

The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
Cont...
 Calculate confidence for final itemsets(B,C,E)=2.
Possibility:- {B},{C},{E},{B,C},{B,E},{C,E}.
B=>C Ʌ E ; B=2/3 =0.66 =66%
C=>B Ʌ E ; C=2/3 =0.66=66%
E=>B Ʌ C ; E=2/3 =0.66=66%
B Ʌ C=>E ; B Ʌ C=2/2 =1 =100%
B Ʌ E=>C ; B Ʌ E=2/3 =0.66=66%
C Ʌ E=>B ; C Ʌ E=2/2 =1=100%
Overall confidence= 464/6
=77.3% (given minimum support=2)

Improving the Efficiency of Apriori
 Hash based technique
 Transaction reduction
 Partitioning
 Sampling
 Dynamic itemset count

Construct FP-tree from a Transaction Database
TID items Items bought (ordered) frequent

100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m} min_support = 3
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} {}
Header Table
1. Scan DB once, find
frequent 1-itemset (single Item frequency head f:4 c:1
item pattern) f 4
c 4 c:3 b:1 b:1
2. Sort frequent items in a 3
frequency descending b 3 a:3 p:1
order, f-list m 3
p 3
3. Scan DB again, construct m:2 b:1
FP-tree
F-list=f-c-a-b-m-p p:2 m:1
Benefits of the FP-tree Structure
 Completeness
 Preserve complete information for frequent pattern
mining
 Never break a long pattern of any transaction
 Compactness
 Reduce irrelevant info—infrequent items are gone
 Items in frequency descending order: the more
frequently occurring, the more likely to be shared

 Never be larger than the original database .

From Conditional Pattern-bases to Conditional FP-trees
 For each pattern-base

 Accumulate the count for each item in the base
 Construct the FP-tree for the frequent items of the
pattern base
m-conditional pattern base:
{} Min_sup=3
Header Table
Item frequency head All frequent
f:4 c:1 patterns relate to m
f 4 {}
c 4 c:3 b:1 b:1 m,

a 3 f:4  fm, cm, am,
b 3 a:3 p:1 fcm, fam, cam,
m 3 c:3 fcam
p 3 m:2 b:1
p:2 m:1 a:3
m-conditional FP-tree
Why Is FP-Growth the Winner?
 Divide-and-conquer:
 leads to focused search of smaller databases
 Other factors
 no candidate generation, no candidate test
 compressed database: FP-tree structure
 no repeated scan of entire database
 basic ops—counting local freq items and building
sub FP-tree.

Mining Various Kinds of Association Rules
 Mining multilevel association
 Mining multidimensional association
 Mining quantitative association
 Mining interesting correlation patterns

Kinds of mining rules
1.Multilevel association rules

-different levels of abstraction
2.Multidimensional association rules
-more than one dimension or predicates
- e.g. what a customer buys as well as
customer’s age.
3.Quantitative association rules
-numeric attributes implicit order among
values. e.g.age

Mining Multiple-Level Association Rules
 Items often form hierarchies
 Flexible support settings
 Items at the lower level are expected to have lower
support
 Exploration of shared multi-level mining.
uniform support reduced support

Level 1
Milk Level 1
min_sup = 5%
[support = 10%] min_sup = 5%
Level 2 Butter Milk Skim Milk Level 2

min_sup = 5% [support = 6%] [support = 4%] min_sup = 3%

 Uniform minimum support
-Mining from milk to butter milk ,both are
frequent. that satisfy min_support 5%
-While skim milk is not
 Reduced minimum support
-mining from milk to butter milk and skim milk
-both are all consider frequent.
-satisfy min _sup5 % and3%

Mining Multi-Dimensional Association
 Single-dimensional rules/intradimensions:
buys(X, “milk”)  buys(X, “bread”)
 Multi-dimensional rules:  2 dimensions or predicates
 Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”)  occupation(X,“student”)  buys(X, “coke”)
 hybrid-dimension assoc. rules (repeated predicates)
age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”)
 Categorical Attributes: finite number of possible values, no
ordering among values—data cube approach
 Quantitative Attributes: numeric, implicit ordering among values
—discretization, clustering, and gradient approaches

Mining Quantitative Associations
 Techniques can be categorized by how numerical attributes,

such as age or salary are treated
1. Static discretization based on predefined concept hierarchies
(data cube methods)
2. Dynamic discretization based on data distribution
(quantitative association rules)
3. Clustering: based on characteristics or features(by clustering
methods).

Static Discretization of Quantitative Attributes
 Discretized prior to mining using concept hierarchy.

 Numeric values are replaced by ranges.
 In relational database, finding all frequent k-predicate sets will
require k+1 table scans.
()
 Data cube is well suited for mining.
 Mining from data cubes (age) (income) (buys)
can be much faster.
(age, income) (age,buys) (income,buys)
(age,income,buys)
 Store aggregates which is essential for computing
support and confidence
 Dimension-age,income,buys
 Basecuboid aggregates the task relevant data by
Age,income and buys.
 2-D aggregates (age,income),(age,buys),(income,buys)
 1-D having (age,buys,income)
 0-D cuboid contains total number of transactions in the
task relevant data.

Quantitative Association Rules
 2-D quantitative association rules:
Aquan1  Aquan2  Acat
 Quan1 and quan2 are quantitative attribute intervals.

Cat – categorical attributes.
 Increase confidence level or compact rule.
 Only numeric values are discretized is called dynamic
Quantization.
Example:
age(X,”34-35”)  income(X,”30-50K”)
 buys(X,”high resolution TV”)

Association to correlation analysis
 Support and confidence measures are insufficient at filtering

out uninteresting association rules.
 Correlation measure can be used to augment the support and
confidence framework.
P ( A B )
lift 
P ( A) P ( B )
sup( X )
all _ conf 
max_ item _ sup( X )
sup( X )
coh 
| universe( X ) |

1. <1;occurrence of A is negatively correlated with
the occurrence of B.
2.>1;Positively correlated
3. =1;no correlation.

Example:correlation using Lift
 1.probability of purchasing a computer game is
p(game)=0.60.
 2. probability of purchasing video is
p(video)=0.75.
 3. probability of purchasing both is
p(game,video)=0.40.
Lift=P(AUB)/P(A)P(B)
=0.40/0.60*0.75=0.89
The Value is less than 1.so it is negatively
correlated.

Constraints in Data Mining
 The users specify such intuition or expectation as

constraints to confine the search space.
 Knowledge type constraint:

 association,correlation etc.
 It describes the types of knowledge.
 Data constraint:
 Set of task relevant data.
 Use queries/tools.

Cont..
 Dimension/level constraint
-in relevance to region, price, brand, customer category
 Rule (or pattern) constraint

-relationship among attributes,attribute values,max or
mininmum number of predicates that occur in antecedent
and consequent
 Interestingness constraint
-Support,confidence,correlation
-specify threshold on statistical measures.
Rule constraints
 Specify the syntactic form of rules.
 Improve the efficiency of data mining process.
 Expected set.
 Analysis the relationship between variables.
 Simply as Metarules.

Another example: Apriori
TID LIST OF ITEMS
T100 I1,I2,I5
T200 I2,I4
T300 I2,I3
T400 I1,I2,I4
T500 I1,I3
T600 I2,I3
T700 I1,I3
T800 I1,I2,I3,I5
T900 I1,I2,I3.
Consider minimum support is 2.

Find confidence

Chapter3 DM

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter3 DM

Uploaded by

Copyright:

Available Formats

Chapter 3

September 3, 2022 Data Mining: Concepts and Techniques 1

September 3, 2022 Data Mining: Concepts and Techniques 3

Categorized in to different ways:

Based on the Levels of abstraction in the rule set

Based on the Dimension in the rules(attributes).

Based on the Types of values handled in the

rule(Boolean values, Quantitative values).

rule ,correlation rule).

September 3, 2022 Data Mining: Concepts and Techniques 5

Condition: TID ITEMS

September 3, 2022 Data Mining: Concepts and Techniques 6

 Apriori pruning principle: If there is any itemset which is

 Test the candidates against DB

September 3, 2022 Data Mining: Concepts and Techniques 8

 Frequent -1-item set is found by scanning the database

to accumulate the count for each item

September 3, 2022 Data Mining: Concepts and Techniques 9

September 3, 2022 Data Mining: Concepts and Techniques 10

September 3, 2022 Data Mining: Concepts and Techniques 12

September 3, 2022 Data Mining: Concepts and Techniques 13

TID items Items bought (ordered) frequent

 Items in frequency descending order: the more

frequently occurring, the more likely to be shared

September 3, 2022 Data Mining: Concepts and Techniques 15

 For each pattern-base

 Construct the FP-tree for the frequent items of the

September 3, 2022 Data Mining: Concepts and Techniques 17

 Mining multilevel association

 Mining multidimensional association

 Mining quantitative association

 Mining interesting correlation patterns

September 3, 2022 Data Mining: Concepts and Techniques 18

1.Multilevel association rules

September 3, 2022 Data Mining: Concepts and Techniques 19

uniform support reduced support

Level 2 Butter Milk Skim Milk Level 2

September 3, 2022 Data Mining: Concepts and Techniques 20

September 3, 2022 Data Mining: Concepts and Techniques 21

September 3, 2022 Data Mining: Concepts and Techniques 22

 Techniques can be categorized by how numerical attributes,

September 3, 2022 Data Mining: Concepts and Techniques 23

 Discretized prior to mining using concept hierarchy.

(age, income) (age,buys) (income,buys)

September 3, 2022 Data Mining: Concepts and Techniques 25

September 3, 2022 Data Mining: Concepts and Techniques 26

 Support and confidence measures are insufficient at filtering

September 3, 2022 Data Mining: Concepts and Techniques 27

September 3, 2022 Data Mining: Concepts and Techniques 28

September 3, 2022 Data Mining: Concepts and Techniques 29

 The users specify such intuition or expectation as

 Knowledge type constraint:

 It describes the types of knowledge.

September 3, 2022 Data Mining: Concepts and Techniques 30

 Rule (or pattern) constraint

September 3, 2022 Data Mining: Concepts and Techniques 32

Consider minimum support is 2.

September 3, 2022 Data Mining: Concepts and Techniques 33

You might also like