You are on page 1of 8

Association pattern mining

Subhasis Ray

2023-04-19

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Subhasis Ray Association pattern mining


Which patterns are interesting?

Not all frequent patterns are interesting.


One big problem is association rules like
A => B(support, confidence) can have high support and
confidence without strong correlation between A and B.
So we make correlation rules like
A => B(support, confidence, correlation)
Pearson’s correlation coefficient (with binary encoding):
ρx,y = √ support(x∪y)−support(x)support(y)
support(x)support(y)(1−support(x))(1−support(y))

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Subhasis Ray Association pattern mining


Lift measures how the occurrence of one item lifts
that of the other.

P(A∪B)
lift(A, B) = P(A)P(B)
If A and B are independent: lift 1
If A and B are negatively correlated: lift(A, B) < 1
If A and B are positively correlated: lift(A, B) > 1

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Subhasis Ray Association pattern mining


Lift measures how the occurrence of one item lifts
that of the other.

Transaction database
item1 item2
Lift for A => C
P(A∪C)
1 A C = P(A)P(C)
2 A C 3/7
3 A D = (4/7)(4/7)
4 A C
5 B D = 21/16
6 B C
7 B C

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Subhasis Ray Association pattern mining


Recap: χ2 test of correlation of nominal data
To see if two nominal variables A and B are independent, we
build a contingency table:

Contingency table
A = a1 A = a2
B = b1 o11 o12 Tb1
B = b2 o21 o22 Tb2
Ta1 Ta2

If A and B were independent, we’d expect A=a1 and B=b1 to


occur in
e11 = n × p(A = a1) × p(B = b1)
= n × Ta1
n
× Tb1
n
= Ta1×Tb1
n
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Subhasis Ray Association pattern mining


Recap: χ2 test of correlation of nominal data

Compute
∑c ∑r (oij−eij)2
χ2 = i=1 j=1 eij
Degrees of freedom = (r - 1) x (c - 1) where r is the
number of rows and c is the number of columns.
Check significance from table/software for this χ2 value
for this many degrees of freedom.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Subhasis Ray Association pattern mining


In case of two items a and b use χ2 to check
positive correlation

Contingency table for two items


a not a
b o11 o12 Tb1
not b o21 o22 Tb2
Ta1 Ta2

∑ |X| 2
For any itemset X: χ2 (X) = 2i=1 (oi −eei
i)

where the expectiations/observations i are summed over


all possible binary encodings of X. For example, if X =
{a, b, c}, then we have to sum over 000, 001,
010,011,100, 101, 110, 111.
χ2 close to 0 indicates statistical independence of the
items, and larger values indicate larger dependency.. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Subhasis Ray Association pattern mining


Thank you!

References
Charu Aggarwal
Han, Kamber, Pei
Hongbo Du

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Subhasis Ray Association pattern mining

You might also like