Frequent Patterns 3

Other issues
 Multi-level patterns
 Sequential patterns
 Negative correlated patterns
 Fault tolerance patterns
 Fix
 Proportion
 High utility patterns
 Data Sanitization
1
Defining Negative Correlated Patterns (I)
 Definition 1 (association rule_based)
 Itemsets X and Y are both frequent. Moreover
and .
 Problem
 In a large retail store, customers often buy several types
of things in a single transaction. However, there might
be tens of thousands of items that have not been
bought.
 Produces too many negative association rules, most of
which are likely to be of very little interest.
2
Defining Negative Correlated Patterns (II)
 Definition 2 (support-based)
 If itemsets X and Y are both frequent but rarely occur together, i.e.,
sup(X , Y) < sup (X) * sup(Y)
 Then X and Y are negatively correlated
 Problem:
 Support count A: 100, B: 100, AB:1.
 When there are in total 200 transactions, we have
s(A , B) = 0.005, s(A) * s(B) = 0.25, s(A , B) < s(A) * s(B)
 When there are 105 transactions, we have
s(A , B) = 1/105, s(A) * s(B) = 1/103 * 1/103, s(A ,B) > s(A) * s(B)
 Where is the problem? —Null transactions, i.e., the support-based
definition is not null-invariant!
3
Defining Negative Correlated Patterns (III)
 Definition 3 (Kulzynski measure-based) If itemsets X and Y
are frequent, but (P(X|Y) + P(Y|X))/2 < є, where є is a
negative pattern threshold, then X and Y are negatively
correlated.
 Ex. For the same problem, when no matter there are 200 or
105 transactions, if є = 0.02, we have
(P(A|B) + P(B|A))/2 = (0.01 + 0.01)/2 < є
Support count A: 100, B: 100, AB:1.
4
Other issues
 Fix
 Proportion
5
Fault-tolerant patterns
 Traditional association rules mining:
 Extracting exactly match patterns
Fever over Head cold:

38℃, throat Symptoms:
hurt, headache coughing, nose
tearing, headache,
throat hurt, fever,
palpitations,
vomiting…
Treatment: Vit-C,…
Introduction
 Fault-tolerant mining:
 Allowing limited inexactitude
Head cold:
Fever over Symptoms:
38℃, throat coughing,
hurt, headache nose tearing,
headache,
throat hurt,
fever,
palpitations,
vomiting…
Treatment:
Vit-C,…
Fault tolerance patterns
8
Initial idea
 [YFB01]: Discovering of groups of similar
transactions that share most items.
 Focusing on transactions, not items
For example, for p = 0.15, d = 0.8, N =1,000,000, D = 500,
the probability of finding an ETI with 5 items by chance is
approximately 10-9300
 Sparse
tid; item
pattern
1
problem:
2 3 4 5 6
010 1 1 1 1 0 0
δ= 0.8
020
030
1
1
1
1
1
1
1
1
0
0
0
0
min_sup
040 1 1 1 1 0 1 =4
050 0 0 0 0 1 0
060 0 0 0 0 1 0
Problem description
-definition
 A transaction t FT-contains pattern X iff t contains
x, where x is sub-pattern of X and |X|-|x|<=δ
 supFT(X) = # of transactions FT-contains X.
 supitem
B(X)(x) = # of transactions contains x in
the transactions which FT-contains X.
tid items
010 cde X = {abcd}, δ=1

020 bde
supFT(X) = |{040, 050}| = 2
030 ade
040 abc
supitemB(X)(a) = |{040, 050}| = 2
050 abc
supitemB(X)(d) = 0
Fix fault tolerance pattern
 A pattern X is a FT-pattern iff:
 1. supFT(X) >= min_supFT
 2. For each item x in X,
supitemB(X)(x) >= min_ supitem
X = {abcd}, δ=1, min_supFT=0.6, min_supitem=0.4

tid items
supFT(X) = |{020, 030, 040, 050}| = 4
010 cde
020 abde supitemB(X)(a) = |{020,030,040,050}| = 4

030 abde
supitemB(X)(b) = |{020,030,040,050}| = 4
040 abc
050 abc
supitemB(X)(c) = |{040,050}| = 2
supitemB(X)(d) =|{020,030}|=2
11
Practice
X = {bcde}, δ=1, min_supFT=0.6, min_supitem=0.4
Is X FT-pattern ?
tid items
010 bcde
020 bde
030 abde
040 ace
050 abc
12
Proportional fault tolerance patterns
 Proportional fault-tolerant pattern mining:

Finding such patterns as X, while items in each
sub-pattern of X with length (|X|*δ) frequently
occur together.
 For example:
X ={a b c d } , delta=0.75,
X is a FT pattern => {a b c}, {a b d}, {a c d}, {b c d}
frequently occur
Proportional fault tolerance patterns
 A transaction t FT-contains pattern X iff t contains
| X || x|
x, where x is sub-pattern of X and | X |   (δ is a
fault-tolerant parameter)
 supFT(X) = # of transactions FT-contains X.
 supitemB(X)(x) = # of transactions contains x in the
transactions which FT-contains X.
tid items X = {abcd}, δ=0.75

010 cde
020 bde
supFT(X) = |{040, 050}| = 2
030 ade
supitemB(X)(a) = |{040, 050}| = 2
040 abc
050 abc
supitemB(X)(d) = 0
Problem description
-definition
 A pattern X is a FT-pattern iff:
 1. supFT(X) >= min_supFT
 2. For each item x in X,
supitemB(X)(x) >= min_ supitem

Exercise
cde δ=0.6 Find supFT(X) and item sup for

bde min_supFT=5 following patterns
ade abcd ?
min_supitem=2
abc
abcde ?
#fault(3)=1
abc #fault(4)=1
#fault(5)=2
abcd: 2, (2, 2, 2, 0)
abcde: 5, (3, 2, 3, 3, 3)
Problem description
-observation
gap
# fault (| X |)  (1   ) | X |
Consider the case
 d=0.5
ab min_supFT=5
ab min_supitem=2
ab
cd
cd
abcd is FT-pattern!
18
Example
Property
 Lemma 2.1 If an item y is away from x for the
distance greater than 2 in the FT-association
graph, then a pattern P which contains both x
and y can not be a frequent FT-pattern.
P: x………… + y…………

Px  0.5  P  or Py i0.5  P 
 The transactions which contain Py will never FT-

contain P => supitemB(P)(y) = 0
Data Preprocessing
 In order to avoid scanning the whole database
when checking candidates, the original database
is transformed into a bitmap
Candidate generation and pruning
 The data structure of FT-association graph
Checking candidates
 Extract bitmap(P) for a candidate P
 Calculate the supFT of P and the supitemB(P)(i) of
each item i of P
 Let candidate P = abcde, the bitmap(P) is
shown below
If the number of 1s is more than
  P 
, T is FT-containing P
If the number of 1s is less than

min_supitem, P is not a frequent
FT-pattern
Other issues
 Fix
 Proportion
24
High Utility Itemset Mining
 Mining high utility itemsets from the databases
refers to finding the itemsets with high utilities.
Problem definition
 Transaction database and profit table.
(Cont.)
 Definition 1. u(ip, Td) = p(ip)*q(ip, Td)
 E.g. u({A}, T1) = 5*1 = 5
(Cont.)
 Definition 2. u(X, Td) =
 E.g. u({AD}, T1) = u({A}, T1) + u({D}, T1)
= 5*1 + 2*1 = 7
28
(Cont.)
 Definition 3. u(X) =
 E.g. u({AD}, T1)+u({AD}, T3)=(5*1+2*1) +
(5*1+2*6)=24
29
(Cont.)
 Definition 4.
 Utility < min_utility ≤ Utility
low utility itemset high utility itemset
30
Goal
 Find all high utility itemsets
 However
Min_utility=26
U({CD})=1*1+2*
1+1*1+2*6
+1*3+2*3=25
U({ACD})=
5*1+1*1+2*1
+5*1+1*1+2*6=
26
02/20/2023
Goal
 Find all high utility itemsets
 However
Min_utility=26
U({CD})=1*1+2*
1+1*1+2*6
+1*3+2*3=25
U({ACD})=
5*1+1*1+2*1
+5*1+1*1+2*6=
26
02/20/2023
(Cont.)
 Definition 5. TU(Td) = u(Td, Td)
 E.g. TU(T2) = (5*2) + (1*6) + (3*2) + (1*5) =
27
33
2013/01/03
(Cont.)
 Definition 6. TWU(X) =
 E.g. TWU({AD}) = TU(T1) + TU(T3) = 8 + 30 = 38
 If min_utility ≤ TWU(X), X is called a high

transaction-weighted utilization itemset (HTWUI).
34
2013/01/03
(Cont.)
 Definition 7. transaction-weighted downward

closure (TWDC) (5+1+2)+(5+1+12)=26<40
 E.g.
 Any superset of {AD} is a low utility

itemset since {AD} is not a HTWUI.
35
2013/01/03
Other issues
 Fix
 Proportion
36
Sensitive itemsets hiding
Consider a supermarket and two beer suppliers A and B.

 The database of the supermarket is released to the
suppliers for exchanging a lower price of goods, and thus,

supplier A can mine the association rules related to his
products for the purpose of sales promotion.
 If supplier A finds that most customers who buy diapers
also buy B’s beers

 Run a coupon for giving a 10 percent discount when buying A’s
beers together with diapers. As a result, the amount of sales on B’s
beers decreases and B can not give a low price to the supermarket
as before. Finally, supplier A monopolizes the beer market and is
unwilling to give a low price to the supermarket any more.
 From this aspect, releasing the database is bad for the
supermarket.
02/20/2023
37
Problem definition

Sensitive itemsets: The itemsets belong to the frequent
itemsets with privacy or security concern that have to be hidden in
the database

Non-sensitive itemsets: The itemsets belong to the
frequent itemsets without privacy or security concern that do not
have to be hidden in the database
02/20/2023
38
Problem definition
 The problem of hiding sensitive itemsets
 Sensitive itemsets: The frequencies of sensitive itemsets have
to be decreased in the database
 How to choose the transactions with sensitive itemsets?

Non-sensitive itemsets: The non-sensitive itemsets have to
be protected in the database
 Side effects: The frequent itemsets become non-frequent itemsets
after hiding process
 How to reduce side effects?

Quality of database: The difference between the original
database and the sanitized one have to be minimized
 How to measure the quality of database?
02/20/2023
39
Example
Tid items Min_sup=3
1 ABC
Frequent support
2 ABCE itemset
3 ABE A 5
4 BCE B 4
5 ADE C 3
6 AE E 5
AB 3
AE 4
BC 3
BE 3
Sensitive itemsets
{AB, AE}
40
Example
Tid items Min_sup=3
1 ABC Frequent support
2 ABCE itemset
3 ABE A 5
4 BCE B 4
5 ADE C 3
6 AE E 5
AB 3
AE 4
BC 3
BE 3
Sensitive itemsets
{AB, AE}
41
Example
Tid items Min_sup=3
2 ABCE itemset
3 ABE A 5
4 BCE B 4
5 ADE C 3
6 AE E 5
AB 3 2
AE 4 2
BC 3
BE 3 2 !!!
Sensitive itemsets
{AB, AE}
42
Example
Tid items Min_sup=3
2 ABCE itemset
3 ABE A 5
4 BCE B 4
5 ADE C 3
6 AE E 5
AB 3 2
AE 4 2
BC 3
BE 3
Sensitive itemsets
{AB, AE}
43
Four classes
 Heuristic based approaches
 Border based approaches
 Constraint-satisfaction problem approaches
 Database reconstruction approaches
02/20/2023
44
Constraint Satisfaction
Problem Model
02/20/2023
45
Preliminaries
 F: frequent itemsets( the frequency

in database is larger than or equal to
msup ) A B C D
T1 1 0 1 0
 Ex: F={A,B,C,D,AB,AC,AD,CD,ACD} T2 1 0 1 1
 SI: sensitive itemsets T3 0 0 1 1
 Ex: SI={AB} T4 0 1 0 0
 SS: superset of sensitive itemsets in T5

T6
1
0
1
0
1
0
1
1
F T7 0 0 1 0
 EX: SS={AB} T8 1 1 0 0
 F’: frequent itemsets in sanitized Transaction
database (F’=F-SS) database
 Ex: F’={A,B,C,D,AC,AD,CD,ACD}
02/20/2023 46
Preliminaries (cont.)
 The concept of border in [DV06] :
 B+(F) (Positive border): set of all maximally frequent itemsets
ex: B+ (F)={AD,BD,ABC}
null frequent
A B C D
AB AC AD BC BD CD
ABC ABD ACD BCD

ABCD infrequent
border
02/20/2023
47
CSP
 Constraint Satisfaction Problem (CSP): A

solution of a CSP is a complete assignment
of values to the variables that satisfies all the A B C D
constraints T1 0 1 1 1
 Ex:
T2 0 0 1 0
minsup=0.2, msup count=6×0.2=1.2
SI={AB}, SS={AB}, T3 0 0 0 1
F’=F-SS={A,B,C,D,BC,BD,CD,BCD}
B+(F’)={A,BCD} T4 1
u41 1
u42
1 1
T5 1 1 0 0
u51 u52
1  u41  u51  1.2 A: frequent T6 1 0 0 0
1  u42  1.2 BCD: frequent
u41u42  u51u52  1.2 AB: infrequent
Transaction
Maximize(u41+u42+u51+u52) database
02/20/2023 48
CSP
 Distance: The number of variant bits between the original and the
sanitized database
 The lower distance means the better hiding performance
A B C D
T1 0 1 1 1
T2 0 0 1 0
solution u41 u42 u51 u52
l1 0 1 1 1 T3 0 0 0 1
l2 1 1 0 1 T4 1 1 1 1
l3 1 1 1 0
T5 1 1 0 0
Distance=1 T6 1 0 0 0
02/20/2023 49
Practice
minf=0.2, msup=6×0.2=1.2
SI={AB,CD}, SS={AB,CD,BCD},
F’=F-SS={A,B,C,D,BC,BD} A B C D
T1 0 1 1
u1 1
u1
13 14
T2 0 0 1 0
T3 0 0 0 1
T4 1
u1
41
1
u1
42
1
u1
43
1
u1
44
T5 1
u1 1
u1 0 0
51 52
T6 1 0 0 0
Transaction
database
50
CSP
CSP is infeasible: Remove the maximal size and minimum support
itemsets in B+(F’) until CSP is feasible
Ex: A B C D
minf=0.2, msup=6×0.2=1.2
SI={AB,CD}, SS={AB,CD,BCD}, T1 0 1 u1 u113 14
F’=F-SS={A,B,C,D,BC,BD} T2 0 0 1 0
B (F’)={A,BC,BD}
+
T3 0 0 0 1
1  u41  u51  1.2 A: frequent T4 u1 u1 u1 u1
u13  u42u43  1.2 BC: frequent
41 42 43 44
T5 u1 u1 0 0
u14  u42u44  1.2 BD: frequent 51 52
u41u42  u51u52  1.2 AB: infrequent T6 1 0 0 0
u13u14  u43u44  1.2 CD: infrequent

Transaction
Maximize(u13+u14+u41+u42+u51+u52) database
51
CSP
Solution u41 u42 u51 u52 u13 u14 u43 u44
l1 0 1 1 1 1 1 0 1
l2 1 1 0 1 1 1 0 1
l3 1 1 1 0 1 1 0 1
BC: infrequent
l4 0 1 1 1 0 1 1 1
l5 1 1 0 1 0 1 1 1
l6 1 1 1 0 0 1 1 1
l7 0 1 1 1 1 0 1 1
l8 1 1 0 1 1 0 1 1
l9 1 1 1 0 1 0 1 1
BD: infrequent
l10 0 1 1 1 1 1 1 0
l11 1 1 0 1 1 1 1 0
l12 1 1 1 0 1 1 1 0
52
Execution Scalability Hiding Failure Information loss Modification
time degree
Heuristic Sensitive moderate low low moderate moderate

approaches transactions
identification
methods
Sensitive moderate moderate none moderate moderate

associations
clustering methods
Sanitization matrix fast good Very low bad bad

multiplication
methods
Border based approaches moderate moderate none good good

Constraint-satisfaction problem slow low none none(for the Very good
approaches case that CSP is
feasible)
Database reconstruction approaches slow low none good moderate
53
Summary
 Basic concepts: association rules, support-

confident framework, closed and max-patterns
 Scalable frequent pattern mining methods
 Apriori (Candidate generation & test)
 Projection-based (FPgrowth, CLOSET+, ...)
 Vertical format approach (ECLAT, CHARM, ...)
54
Frequent-Pattern Mining: Research Problems
 Multilevel
 Negative
 Fix
 Proportion
 Data sanitization

Frequent Patterns 3

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Frequent Patterns 3

Uploaded by

Copyright:

Available Formats

Other issues

Support count A: 100, B: 100, AB:1.

Fever over Head cold:

010 cde X = {abcd}, δ=1

 2. For each item x in X,

supitemB(X)(x) >= min_ supitem

X = {abcd}, δ=1, min_supFT=0.6, min_supitem=0.4

020 abde supitemB(X)(a) = |{020,030,040,050}| = 4

X = {bcde}, δ=1, min_supFT=0.6, min_supitem=0.4

 Proportional fault-tolerant pattern mining:

tid items X = {abcd}, δ=0.75

 2. For each item x in X,

supitemB(X)(x) >= min_ supitem

cde δ=0.6 Find supFT(X) and item sup for

 The transactions which contain Py will never FT-

If the number of 1s is less than

 If min_utility ≤ TWU(X), X is called a high

 Definition 7. transaction-weighted downward

 Any superset of {AD} is a low utility

Consider a supermarket and two beer suppliers A and B.

suppliers for exchanging a lower price of goods, and thus,

also buy B’s beers

 F: frequent itemsets( the frequency

 SS: superset of sensitive itemsets in T5

ABC ABD ACD BCD

 Constraint Satisfaction Problem (CSP): A

u41u42  u51u52  1.2 AB: infrequent T6 1 0 0 0

u13u14  u43u44  1.2 CD: infrequent

Heuristic Sensitive moderate low low moderate moderate

Sensitive moderate moderate none moderate moderate

Sanitization matrix fast good Very low bad bad

Border based approaches moderate moderate none good good

Database reconstruction approaches slow low none good moderate

 Basic concepts: association rules, support-

You might also like