5 views

Uploaded by phani

irs

irs

© All Rights Reserved

- vistextechnicalbywww-150819142955-lva1-app6892.pdf
- IRJET-Comparison of Different Techniques of High Utility Mining for Representation and Graphical Analysis
- Learning Image-Text Associations
- 07. Closet - An Efficient Algorithm for Mining Frequent
- Bh Win 03 Security Friday
- The Recommender Problem Revisited
- R Refcard Data Mining
- Bank Marketing Startegy Using Clustering
- Mining Medical Specialist Billing Patterns For
- Linking Behavioral Patterns to Personal Attributes through Data Re-Mining
- Hash-Close: A Hash-Based Algorithm for Mining Frequent Closed Itemsets
- V3I4-IJERTV3IS041942
- A Systematic Review on Mining Web Navigation Pattern Using Graph Based Techniques
- 1Intro2
- 19-BDT-MarketBasketAnalysis.pdf
- Data Mining Quiz
- Generate frequent item set in serial and parallel approach
- Arul Britto
- [IJCST-V4I3P21]:Shivakeshi.C , Mahantesh.H.M, Naveen Kumar.B, Pampapathi.B.M
- IRJET-Effecient Support Itemset Mining using Parallel Map Reducing

You are on page 1of 65

Concepts and

Techniques

(3rd ed.)

Chapter 6

Jiawei Han, Micheline Kamber, and Jian Pei

University of Illinois at Urbana-Champaign &

Simon Fraser University

2011 Han, Kamber & Pei. All rights reserved.

1

Association and Correlations: Basic

Concepts and Methods

Basic Concepts

Evaluation Methods

Summary

Analysis?

etc.) that occurs frequently in a data set

of frequent itemsets and association rule mining

Applications

analysis, Web log (click stream) analysis, and DNA sequence analysis.

Important?

datasets

Foundation for many essential data mining tasks

Association, correlation, and causality analysis

Sequential, structural (e.g., sub-graph) patterns

Pattern analysis in spatiotemporal, multimedia, timeseries, and stream data

Classification: discriminative, frequent pattern

analysis

Cluster analysis: frequent pattern-based clustering

Data warehousing: iceberg cube and cube-gradient

Semantic data compression: fascicles

Broad applications

4

Patterns

Tid

Items bought

10

20

30

40

50

Milk

Customer

buys both

Customer

buys diaper

Customer

buys beer

items

k-itemset X = {x1, , xk}

(absolute) support, or,

support count of X: Frequency

or occurrence of an itemset X

(relative) support, s, is the

fraction of transactions that

contains X (i.e., the

probability that a transaction

contains X)

An itemset X is frequent if Xs

support is no less than a

minsup threshold

5

Ti

d

Items bought

10

20

30

40

50

Nuts, Coffee, Diaper, Eggs,

Milk

Customer

Customer

buys both

buys

diaper

Customer

buys beer

minimum support and

confidence

a transaction contains X

Y

confidence, c, conditional

probability that a

transaction having X also

contains Y

Let

minsup = 50%,

minconf

= 50%

Association

rules:

(many

Freq.more!)

Pat.: Beer:3, Nuts:3, Diaper:4,

Eggs:3,

{Beer,

Diaper}:3

Beer

Diaper

(60%,

100%)

A long pattern contains a combinatorial number of subpatterns, e.g., {a1, , a100} contains (1001) + (1002) + +

(110000) = 2100 1 = 1.27*1030 sub-patterns!

super-pattern Y X, with the same support as X

(proposed by Pasquier, et al. @ ICDT99)

exists no frequent super-pattern Y X (proposed by

Bayardo @ SIGMOD98)

7

<a1, , a100>: 1

Min_sup = 1.

<a1, , a100>: 1

!!

8

Itemset Mining

case?

the minsup threshold

number of frequent itemsets

of transactions

frequent 103 times in 109 transactions?

9

Association and Correlations: Basic

Concepts and Methods

Basic Concepts

Evaluation Methods

Summary

10

Methods

Approach

Approach

Data Format

11

Scalable Mining Methods

Any subset of a frequent itemset must be frequent

If {beer, diaper, nuts} is frequent, so is {beer,

diaper}

i.e., every transaction having {beer, diaper, nuts}

also contains {beer, diaper}

Scalable mining methods: Three major approaches

Apriori (Agrawal & Srikant@VLDB94)

Freq. pattern growth (FPgrowthHan, Pei & Yin

@SIGMOD00)

Vertical data format approach (CharmZaki &

Hsiao @SDM02)

12

Approach

which is infrequent, its superset should not be

generated/tested! (Agrawal & Srikant @VLDB94,

Mannila, et al. @ KDD 94)

Method:

length k frequent itemsets

be generated

13

Database TDB

Tid

Items

10

A, C, D

20

B, C, E

30

A, B, C, E

40

B, E

L2

Supmin = 2 Itemset

C1

1st scan

C2

sup

{A}

{B}

{C}

{D}

{E}

Itemset

sup

1

Itemset

sup

{A, B}

{A, C}

{A, C}

{B, C}

{A, E}

{B, E}

{B, C}

{C, E}

{B, E}

{C, E}

C3

Itemset

{B, C, E}

3rd scan

L3

L1

Itemset

sup

{A}

{B}

{C}

{E}

C2

2nd scan

Itemset

{A, B}

{A, C}

{A, E}

{B, C}

{B, E}

{C, E}

Itemset

sup

{B, C, E}

2

14

Ck: Candidate itemset of size k

Lk : frequent itemset of size k

L1 = {frequent items};

for (k = 1; Lk !=; k++) do begin

Ck+1 = candidates generated from Lk;

for each transaction t in database do

increment the count of all candidates in Ck+1

that are contained in t

Lk+1 = candidates in Ck+1 with min_support

end

return k Lk;

15

Implementation of Apriori

Step 1: self-joining Lk

Step 2: pruning

Example of Candidate-generation

Self-joining: L3*L3

Pruning:

C4 = {abcd}

16

One transaction may contain many candidates

Method:

and counts

contained in a transaction

17

Tree

Subset function

3,6,9

1,4,7

Transaction: 1 2 3 5 6

2,5,8

1+2356

234

567

13+56

145

136

12+356

124

457

125

458

345

356

357

689

367

368

159

18

Implementation

insert into Ck

select p.item1, p.item2, , p.itemk-1, q.itemk-1

from Lk-1 p, Lk-1 q

where p.item1=q.item1, , p.itemk-2=q.itemk-2, p.itemk-1 <

q.itemk-1

Step 2: pruning

forall itemsets c in Ck do

forall (k-1)-subsets s of c do

if (s is not in Lk-1) then delete c from Ck

for efficient implementation [See: S. Sarawagi, S. Thomas, and R.

Agrawal. Integrating association rule mining with relational database

systems: Alternatives and implications. SIGMOD98]

19

Methods

Format

20

candidates

21

Twice

be frequent in at least one of the partitions of DB

Scan 1: partition database and find local frequent

patterns

Scan 2: consolidate global frequent patterns

A. Savasere, E. Omiecinski and S. Navathe, VLDB95

DB1

sup1(i) <

DB1

DB2

sup2(i) <

DB2

DBk

supk(i) <

DBk

DB

sup(i) < DB

Candidates

A k-itemset whose corresponding hashing bucket count is below

Candidates: a, b, c, d, e

Hash entries

count itemset

s ad,

{ab,

35

88

.

.

.

ae}

{bd, be,

de}

.

.

.

10

{yz, qs,

2 Hash

wt}

Table

Frequent 1-itemset: a, b, d, e

ad, ae} is below support threshold

mining association rules. SIGMOD95

23

frequent patterns within sample using Apriori

found in sample, only borders of closure of

frequent patterns are checked

patterns

association rules. In VLDB96

24

ABCD

AB

AC

BC

AD

BD

CD

frequent, the counting of AD begins

Once all length-2 subsets of BCD are

determined frequent, the counting of

BCD begins

Transactions

A

Apriori

{}

Itemset lattice

S. Brin R. Motwani, J. Ullman,

and S. Tsur. Dynamic itemset DIC

counting and implication

rules for market basket data.

SIGMOD97

1-itemsets

2-itemsets

1-itemsets

2-items

3-items

25

Methods

Format

26

Patterns Without Candidate Generation

The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD 00)

Depth-first search

Major philosophy: Grow long patterns from short ones using local

frequent items only

27

Database

TID

100

200

300

400

500

Items bought

(ordered) frequent items

{f, a, c, d, g, i, m, p}

{f, c, a, m, p}

{a, b, c, f, l, m, o}

{f, c, a, b, m}

min_support = 3

{b, f, h, j, o, w}

{f, b}

{b, c, k, s, p}

{c, b, p}

{}

{a, f, c, e, l, p, m, n}

{f, c, a, m, p}

Header Table

1. Scan DB once, find

f:4

c:1

Item frequency head

frequent 1-itemset

f

4

(single item pattern)

c:3 b:1 b:1

c

4

2. Sort frequent items in

a

3

frequency descending

b

3

a:3

p:1

m

3

order, f-list

p

3

m:2 b:1

3. Scan DB again,

construct FP-tree

F-list = f-c-a-b-m-p p:2 m:1

28

subsets according to f-list

F-list = f-c-a-b-m-p

Patterns containing p

Patterns having m but no p

Patterns having c but no a nor b, m, p

Pattern f

Completeness and non-redundency

29

Database

Traverse the FP-tree by following the link of each frequent

item p

Accumulate all of transformed prefix paths of item p to form

ps conditional pattern base

{}

Header Table

Item frequency head

f

4

c

4

a

3

b

3

m

3

p

3

f:4

c:3

c:1

b:1

a:3

b:1

p:1

item

f:3

fc:3

m:2

b:1

fca:2, fcab:1

p:2

m:1

fcam:2, cb:1

30

Accumulate the count for each item in the base

Construct the FP-tree for the frequent items of

the pattern base

Header Table

Item frequency head

f

4

c

4

a

3

b

3

m

3

p

3

fca:2, fcab:1

{}

f:4

c:3

c:1

b:1

a:3

b:1

p:1

{}

f:3

m:2

b:1

c:3

p:2

m:1

a:3

All frequent

patterns relate to m

m,

fm, cm, am,

fcm, fam, cam,

fcam

m-conditional FP-tree

31

{}

{}

f:3

Cond. pattern base of am: (fc:3)

c:3

f:3

c:3

a:3

am-conditional FP-tree

{}

f:3

m-conditional FP-tree

cm-conditional FP-tree

{}

cam-conditional FP-tree

32

single prefix-path P

{}

a1:n1

a2:n2

parts

a3:n3

b1:m1

C2:k2

r1

{}

C1:k1

C3:k3

r1

a1:n1

a2:n2

a3:n3

b1:m1

C2:k2

C1:k1

C3:k3

33

Completeness

pattern mining

Compactness

frequently occurring, the more likely to be shared

count node-links and the count field)

34

Method

Recursively grow frequent patterns by pattern

and database partition

Method

For each frequent item, construct its conditional

pattern-base, and then its conditional FP-tree

Repeat the process on each newly created

conditional FP-tree

Until the resulting FP-tree is empty, or it contains

only one pathsingle path will generate all the

combinations of its sub-paths, each of which is a

frequent pattern

35

Projection

DB projection

Parallel projection

Partition projection

Passing the unprocessed parts to the subsequent

partitions

36

Partition-Based Projection

lot of disk space

Tran. DB

fcamp

fcabm

fb

cbp

fcamp

fcam

fcab

f

fc

f

cb

fca

cb

fcam

fca

am-proj DB cm-proj DB

fc

f

fc

f

fc

f

37

Datasets

100

D2 FP-growth

120

D1 Apriori runtime

80

70

D2 TreeProjection

100

Runtime (sec.)

Run time(sec.)

140

D1 FP-grow th runtime

90

60

50

40

30

20

80

60

40

20

10

0

0

0.5

1

1.5

2

Support threshold(%)

2.5

0.5

1.5

38

Approach

Divide-and-conquer:

frequent patterns obtained so far

Other factors

Basic ops: counting local freq items and building sub FPtree, no pattern search and matching

FPGrowth

39

Methods

pattern (CFP) tree

Proc. ICDM'03 Int. Workshop on Frequent Itemset Mining

Implementations (FIMI'03), Melbourne, FL, Nov. 2003

40

Methodology

CLOSET (DMKD00), FPclose, and FPMax (Grahne & Zhu, Fimi03)

Mining sequential patterns

PrefixSpan (ICDE01), CloSpan (SDM03), BIDE (ICDE04)

Mining graph patterns

gSpan (ICDM02), CloseGraph (KDD03)

Constraint-based mining of frequent patterns

Convertible constraints (ICDE01), gPrune (PAKDD03)

Computing iceberg data cubes with complex measures

H-tree, H-cubing, and Star-cubing (SIGMOD01, VLDB03)

Pattern-growth-based Clustering

MaPle (Pei, et al., ICDM03)

Pattern-Growth-Based Classification

Mining frequent and discriminative patterns (Cheng, et al,

ICDE07)

41

Methods

Format

42

Format

Hsiao@SDM02)

43

Methods

Format

44

CLOSET

TID

Items

Patterns having d

10

a, c, d, e, f

20

a, b, e

30

c, e, f

40

a, c, d, f

50

c, e, f

Min_sup=2

Flist: d-a-f-e-c

frequent closed pattern

for Mining Frequent Closed Itemsets", DMKD'00.

then Y is merged with X

all of Xs descendants in the set enumeration tree can be

pruned

support in several header tables at different levels, one

can prune it from the header table at higher levels

A, B, C, D, E

Tid

Items

10

A, B, C, D, E

20

B, C, D, E,

30

A, C, D, F

Potential

CD, CE, CDE, DE

maxSince BCDE is a max-pattern, no need

to check

patterns

BCD, BDE, CDE in later scan

databases. SIGMOD98

Format

et al.@SIGMOD00), CHARM (Zaki & Hsiao@SDM02)

Graph

49

Graph

50

Visualization of Association

Rules

(SGI/MineSet 3.0)

51

Association and Correlations: Basic

Concepts and Methods

Basic Concepts

Evaluation Methods

Summary

52

(Lift)

accurate, although with lower support and confidence

P( A B)

lift

P ( A) P( B)

2000 / 5000

lift ( B, C )

0.89

3000 / 5000 * 3750 / 5000

lift ( B, C )

Basketba

ll

Not

basketball

Sum

(row)

Cereal

2000

1750

3750

Not

cereal

1000

250

1250

Sum(col.)

3000

2000

5000

1000 / 5000

1.33

3000 / 5000 *1250 / 5000

53

milk [1%, 80%] is

misleading if 85% of

customers buy milk

are not good to indicate

correlations

Over 20 interestingness

measures have been

proposed (see Tan,

Kumar, Sritastava

@KDD02)

54

Null-Invariant Measures

55

Comparison of Interestingness

Measures

Lift and 2 are not null-invariant

5 null-invariant measures

Milk

No Milk

Sum

(row)

Coffee

m, c

~m, c

No

Coffee

m, ~c

~m, ~c

~c

Sum(col. m

~m

)

Null-transactions

w.r.t. m and c

August 5, 2015

Kulczynski

measure (1927)

Null-invariant

Recent DB conferences, removing balanced associations, low sup, etc.

coherence: low, cosine: middle

Association Mining in Large Databases: A Re-Examination of It

s Measures

, Proc. 2007 Int. Conf. Principles and Practice of Knowledge

Discovery in Databases (PKDD'07), Sept. 2007

57

itemsets A and B in rule implications

clear picture for all the three datasets D 4 through D6

Association and Correlations: Basic

Concepts and Methods

Basic Concepts

Evaluation Methods

Summary

59

Summary

60

Pattern Mining

Mining association rules between sets of items in large

databases. SIGMOD'93

from databases. SIGMOD'98

Lakhal. Discovering frequent closed itemsets for association

rules. ICDT'99

sequential patterns. ICDE'95

61

VLDB'94

discovering association rules. KDD'94

mining association rules in large databases. VLDB'95

for mining association rules. SIGMOD'95

counting and implication rules for market basket analysis. SIGMOD'97

mining with relational database systems: Alternatives and

implications. SIGMOD'98

62

Mining

generation of frequent itemsets. J. Parallel and Distributed Computing,

2002.

Itemsets, Proc. FIMI'03

mining implementations. Proc. ICDM03 Int. Workshop on Frequent Itemset

Mining Implementations (FIMI03), Melbourne, FL, Nov. 2003

generation. SIGMOD 00

Opportunistic Projection. KDD'02

J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining Top-K Frequent Closed Patterns

without Minimum Support. ICDM'02

J. Wang, J. Han, and J. Pei. CLOSET+: Searching for the Best Strategies for

Mining Frequent Closed Itemsets. KDD'03

63

Enumeration Methods

algorithm for discovery of association rules. DAMI:97.

Closed Itemset Mining, SDM'02.

C. Bucila, J. Gehrke, D. Kifer, and W. White. DualMiner: A DualPruning Algorithm for Itemsets with Constraints. KDD02.

CARPENTER: Finding Closed Patterns in Long Biological

Datasets. KDD'03.

from Very High Dimensional Data: A Top-Down Row

Enumeration Approach, SDM'06.

64

Interesting Rules

Generalizing association rules to correlations. SIGMOD'97.

Verkamo. Finding interesting rules from large sets of discovered

association rules. CIKM'94.

of Interest. Kluwer Academic, 2001.

for mining causal structures. VLDB'98.

Interestingness Measure for Association Patterns. KDD'02.

TKDE03.

Measures in Pattern Mining: A Unified Framework", Data Mining and

Knowledge Discovery, 21(3):371-397, 2010

65

- vistextechnicalbywww-150819142955-lva1-app6892.pdfUploaded byRajan S Prasad
- IRJET-Comparison of Different Techniques of High Utility Mining for Representation and Graphical AnalysisUploaded byIRJET Journal
- Learning Image-Text AssociationsUploaded byiosrose
- 07. Closet - An Efficient Algorithm for Mining FrequentUploaded byapi-19799369
- Bh Win 03 Security FridayUploaded byAnonymous Tvpppp
- The Recommender Problem RevisitedUploaded byYourTechGuy
- R Refcard Data MiningUploaded byGustavo Ansaldi Oliva
- Bank Marketing Startegy Using ClusteringUploaded byAlok Kumar Singh
- Mining Medical Specialist Billing Patterns ForUploaded bylilmo sule
- Linking Behavioral Patterns to Personal Attributes through Data Re-MiningUploaded byertekg
- Hash-Close: A Hash-Based Algorithm for Mining Frequent Closed ItemsetsUploaded byIJEC_Editor
- V3I4-IJERTV3IS041942Uploaded bySeminarsPunters
- A Systematic Review on Mining Web Navigation Pattern Using Graph Based TechniquesUploaded byIJSTR Research Publication
- 1Intro2Uploaded byHiểu Nguyễn
- 19-BDT-MarketBasketAnalysis.pdfUploaded bySiddharth Singhal
- Data Mining QuizUploaded byJohn Ed
- Generate frequent item set in serial and parallel approachUploaded byInternational Journal for Scientific Research and Development - IJSRD
- Arul BrittoUploaded byapi-3807268
- [IJCST-V4I3P21]:Shivakeshi.C , Mahantesh.H.M, Naveen Kumar.B, Pampapathi.B.MUploaded byEighthSenseGroup
- IRJET-Effecient Support Itemset Mining using Parallel Map ReducingUploaded byIRJET Journal
- Text Categorization Using Frequent ItemsetsUploaded byZahoor Mhd
- Data Mining FunctionalitiesUploaded byKamalakar Sreevatasala
- ClosedPatternMining_ESWA_AK_AL_JA_AG_2014.pdfUploaded byAbonyi Janos
- titleUploaded byLuke Johnson
- PPT for Academic Council Meeting04052018Uploaded byAnonymous PcPkRpAKD5
- Uploaded to ScribdUploaded byDibyo Ban
- First Esaimen 6 Mac 2014Uploaded byecahencot
- Rank Co OccurUploaded byTao-Yi Chang
- IMP_Unit1Uploaded bySonal ShethiaDalvi
- AdministrationUploaded bysamparkcomputers

- 01_servletUploaded byMukesh Kumar
- SAP UI5_Netweaver Gateway_Fiori Syllabus SheetUploaded byphani
- 05Uploaded byphani
- 04 InheritanceUploaded byphani
- 03_interface.docUploaded byphani
- 02 OverloadingUploaded byphani
- 01_JLCUploaded byphani
- JDBC.pdfUploaded byKashif Rizvee
- 02_New Microsoft Word DocumentUploaded byphani
- 03_PROUploaded byphani
- 04_p1Uploaded byphani
- 05 NothingUploaded byphani
- 04_Public Class JLcValidatorUploaded byphani
- 03 MethodsUploaded byHemant Insan Insan
- 02_Type of ApplicationsUploaded byphani
- ModelUploaded byphani
- ListenerUploaded byphani
- FiltersUploaded byphani
- Filter dsUploaded byphani
- FilterUploaded byphani
- CookiesUploaded byphani
- MvcUploaded byphani
- HelloUploaded byphani
- cartssUploaded byphani
- CartUploaded byphani
- 06Uploaded byphani
- 05Uploaded byphani
- 04Uploaded byphani
- serv 03Uploaded byphani

- ch 3 review sheetUploaded byapi-233611431
- Depopulation and HIV by Jon RappoportUploaded byfilresist
- Fiber TechnologyUploaded byPv Khoi
- c-languagenotesUploaded byMichelle Anne Constantino
- TBG Chair ResponsibilitiesUploaded bytheberkeleygroup
- awscloudtrail-ugUploaded byDurga Sainath
- 1-s2.0-S0263822313000950-mainUploaded bymuki10
- Perfect Herbals & Oils Chhattisgarh IndiaUploaded byExportersindia.com
- Foodborne IllnessesUploaded byNicoel
- Lg 37lc2d Lcd Tv Service Manual[1]Uploaded byKJINTF
- Black PantherUploaded byPrussianCommand
- cubo-octahedronessayUploaded byapi-282302245
- 6%5B1%5D.Nida.pdfUploaded byR.v. Naveenan
- Info Mtech PhdUploaded byamec3264
- 3 Alloy Zn NiUploaded byDante Castillo Cisneros
- Cytological study of Allium cepa and Allium sativumUploaded bystringangel
- tunkuUploaded byapi-276599815
- CementingUploaded byb4rf
- Mtech CS Semester VIIIUploaded byVikas Kumar
- 54-462-1-PB.pdfUploaded byPearl Sky
- fulltextUploaded byوائل عبده
- Structural Design Brief.docUploaded bytaz_taz3
- Ps6x Basics User GuideUploaded bySaad Raja
- 5-607Uploaded byRobson Custódio de Souza
- 7.0 Manual_Updated 07.08.09(2)Uploaded byhildamaggie
- Induction Motor Braking Regenerative Dynamic Braking of Induction Motor _ Electrical4u.pdfUploaded bybalaji
- research action about homework policyUploaded byGubee Kulit Gubstahh
- 495-PTOMANUAL2008Uploaded bytigerpower
- GPS and GISUploaded byAremacho Erkazet
- EAMON RYAN Newsletter Summer 2008Uploaded byEamon Ryan