18 views

Uploaded by akriti nanda

© All Rights Reserved

- Oracle - Data Mining Concepts
- STOCK EXCHANGE MARKET PREDICTION
- Analysis of Pattern Transformation Algorithms for Sensitive Knowledge Protection in Data Mining
- 44 Paper 28011035 IJCSIS Camera Ready Pp273-279
- Algebra Question Paper for Board Exam
- PLS Web Based Questionnaire
- Data - Google Search
- 354-G475
- Signature Based Indexing
- Mca
- Dk 24717723
- Dunman High School, Senior High Promotional Examination
- Auditing Database System
- entropy based subspace clustering
- Informatics Engineering, an International Journal (IEIJ),
- [11-01-08-027]
- math lesson plan
- IRJET-Mining of medical data to identify risk factors of heart disease using frequent itemset
- Paper Dinesh Clustering Techniques
- [IJCT-V2I3P14] Authors :Jyotsana Dixit, Abha Choubey

You are on page 1of 15

Apriori Algorithm

Mining Association Rules Additional Measures of rule interestingness

Advanced Techniques

1 2

Rule form

Association rule mining

Antecedent Consequent [support, confidence]

Finding frequent patterns, associations, correlations, or causal

structures among sets of items in transaction databases

(support and confidence are user defined measures of interestingness)

correlations between the different items that customers place in Examples

their shopping basket

buys(x, computer) buys(x, financial management

software) [0.5%, 60%]

Applications

Basket data analysis, cross-marketing, catalog design, loss-leader

analysis, web log analysis, fraud detection (supervisor->examiner) age(x, 30..39) ^ income(x, 42..48K) buys(x, car) [1%,75%]

3 4

How can Association Rules be used? How can Association Rules be used?

Probably mom was

calling dad at Let the rule discovered be

work to buy

diapers on way {Bagels,...} {Potato Chips}

home and he

decided to buy a Potato chips as consequent => Can be used to determine what

six-pack as well. should be done to boost its sales

The retailer

could move Bagels in the antecedent => Can be used to see which products

diapers and beers would be affected if the store discontinues selling bagels

to separate

places and

position high- Bagels in antecedent and Potato chips in the consequent =>

profit items of

interest to young Can be used to see what products should be sold with Bagels

fathers along the to promote sale of Potato Chips

path.

5 6

Association Rule: Basic Concepts

A B [ s, c ]

Given:

(1) database of transactions, Support: denotes the frequency of the rule within

(2) each transaction is a list of items purchased by a customer

transactions. A high value means that the rule involve a

in a visit great part of database.

support(A B [ s, c ]) = p(A B)

Find:

all rules that correlate the presence of one set of items

(itemset) with that of another set of items Confidence: denotes the percentage of transactions

containing A which contain also B. It is an estimation of

E.g., 98% of people who purchase tires and auto accessories conditioned probability .

also get automotive services done

confidence(A B [ s, c ]) = p(B|A) = sup(A,B)/sup(A).

7 8

Example Mining Association Rules

Itemset:

A,B or B,E,F

Trans. Id Purchased Items What is Association rule mining

1 A,D Support of an itemset:

2 A,C Sup(A,B)=1 Sup(A,C)=2 Apriori Algorithm

3 A,B,C Frequent pattern: Additional Measures of rule interestingness

4 B,E,F Given min. sup=2, {A,C} is a

frequent pattern Advanced Techniques

the following rules

C => A with 50% support and 100% confidence

9 10

Transaction ID Items Bought Min. support 50%

Each transaction is Min. confidence 50%

represented by a 2000 A,B,C

Boolean vector 1000 A,C Frequent Itemset Support

4000 A,D {A} 75%

5000 B,E,F {B} 50%

{C} 50%

{A,C} 50%

For rule A C :

support = support({A , C }) = 50%

confidence = support({A , C }) / support({A }) = 66.6%

11 12

The Apriori principle Apriori principle

Any subset of a frequent itemset must No superset of any infrequent itemset should be

be frequent generated or tested

Many item combinations can be pruned

{beer, diaper}

frequent

13 14

null

null

If an itemset is

infrequent, then all of its

A B C D E

A B C D E

infrequent

AB AC AD AE BC BD BE CD CE DE

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE Found to be

Infrequent ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

Pruned ABCD ABCE ABDE ACDE BCDE

supersets

ABCDE ABCDE

15 16

Mining Frequent Itemsets (the Key Step)

The Apriori Algorithm Example

Min. support: 2 transactions

Database D C1 itemset sup.

Find the frequent itemsets: the sets of items that TID Items {1} 2 L1 itemset sup.

have minimum support 100 134 {2} 3 {1} 2

200 235 Scan D {3} 3 {2} 3

300 1235 {4} 1 {3} 3

itemset {5} 3 {5} 3

400 25

Generate length (k+1) candidate itemsets from length k

C2 C2 itemset

Test the candidates against DB to determine which are in fact

L2 itemset sup {1 2} 1 {1 2}

{1 3} 2 {1 3} 2 {1 3}

frequent

{2 3} 2 {1 5} 1 Scan D {1 5}

{2 3} 2 {2 3}

{2 5} 3

{2 5} 3 {2 5}

Use the frequent itemsets to generate association {3 5} 2

{3 5} 2 {3 5}

rules.

C3 itemset Scan D L3 itemset sup

Generation is straightforward {2 3 5} {2 3 5} 2

17 18

The items in Lk-1 are listed in an order

Step 2: pruning

Step 1: self-joining Lk-1 for all itemsets c in Ck do

insert into Ck

for all (k-1)-subsets s of c do

select p.item1, p.item2, , p.itemk-1, q.itemk-1 if (s is not in Lk-1) then delete c from Ck

from Lk-1 p, Lk-1 q

where p.item1=q.item1, , p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1

A D E F

A D E

=

<

A D F

19 20

The Apriori Algorithm

Example of Generating Candidates

Ck: Candidate itemset of size k Lk : frequent itemset of size k

L3={abc, abd, acd, ace, bcd} Join Step: Ck is generated by joining Lk-1 with itself

Self-joining: L3*L3 subset of a frequent k-itemset

Algorithm:

abcd from abc and abd

L1 = {frequent items};

acde from acd and ace for (k = 1; Lk !=; k++) do begin

Ck+1 = candidates generated from Lk;

for each transaction t in database do

Pruning (before counting its support):

increment the count of all candidates in Ck+1 that are

contained in t

acde is removed because ade is not in L3

Lk+1 = candidates in Ck+1 with min_support

end

C4={abcd} return L = k Lk;

21 22

The total number of candidates can be very huge

support_count({A})

One transaction may contain many candidates

For every frequent itemset x, generate all non-empty

Method: subsets of x

Candidate itemsets are stored in a hash-tree

Leaf node of hash-tree contains a list of itemsets and counts

Interior node contains a hash table For every non-empty subset s of x, output the rule

Subset function: finds all the candidates contained in a transaction

s (x-s) if

support_count({x})

min_conf

support_count({s})

23 24

From Frequent Itemsets to Association Rules Generating Rules: example

Q: Given frequent set {A,B,E}, what are possible Trans-ID Items Min_support: 60%

Min_confidence: 75%

association rules? 1 ACD

A => B, E 2 BCE

A, B => E 3 ABCE

A, E => B

4 BE

5 ABCE Frequent Itemset Support

B => A, E

{BCE},{AC} 60%

B, E => A

Rule Conf. {BC},{CE},{A} 60%

E => A, B {BC} =>{E} 100% {BE},{B},{C},{E} 80%

__ => A,B,E (empty rule), or true => A,B,E {BE} =>{C} 75%

{CE} =>{B} 100%

{B} =>{CE} 75%

{C} =>{BE} 75%

{E} =>{BC} 75%

25 26

C1 L1 C2 L2

TID Items

Bread 6 Bread 6 Bread,Milk 5 Bread,Milk 5

1 Bread, Milk, Chips, Mustard Milk 6 Milk 6 Bread,Beer 3 Bread,Beer 3

2 Beer, Diaper, Bread, Eggs Converta os dados para o Chips 2 Beer 4 Bread,Diaper 5 Bread,Diaper 5

Mustard 2 Diaper 6 Bread,Coke 2 Milk,Beer 3

3 Beer, Coke, Diaper, Milk formato booleano e para um

Beer 4 Coke 3 Milk,Beer 3 Milk,Diaper 5

suporte de 40%, aplique o

4 Beer, Bread, Diaper, Milk, Chips algoritmo apriori.

Diaper 6 Milk,Diaper 5 Milk,Coke 3

Eggs 1 Milk,Coke 3 Beer,Diaper 4

5 Coke, Bread, Diaper, Milk

Coke 3 Beer,Diaper 4 Diaper,Coke 3

6 Beer, Bread, Diaper, Milk, Mustard Beer,Coke 1

7 Coke, Bread, Diaper, Milk Diaper,Coke 3

C3 L3

Bread Milk Chips Mustard Beer Diaper Eggs Coke

Bread,Milk,Beer 2 Bread,Milk,Diaper 4

1 1 1 1 0 0 0 0

Bread,Milk,Diaper 4 Bread,Beer,Diaper 3

1 0 0 0 1 1 1 0

Bread,Beer,Diaper 3 Milk,Beer,Diaper 3

0 1 0 0 1 1 0 1

Milk,Beer,Diaper 3 Milk,Diaper,Coke 3

1 1 1 0 1 1 0 0 Milk,Beer,Coke

1 1 0 0 0 1 0 1 Milk,Diaper,Coke 3

1 1 0 1 1 1 0 0

1 1 0 0 0 1 0 1 8 + C 28 + C 38 = 92 >> 24

27 28

Challenges of Frequent Pattern Mining Improving Aprioris Efficiency

Challenges

Problem with Apriori: every pass goes over whole data.

Multiple scans of transaction database AprioriTID: Generates candidates as apriori but DB is

Huge number of candidates used for counting support only on the first pass.

Tedious workload of support counting for candidates Needs much more memory than Apriori

Improving Apriori: general ideas Builds a storage set C^k that stores in memory the frequent sets

Reduce number of transaction database scans per transaction

Shrink number of candidates

AprioriHybrid: Use Apriori in initial passes; Estimate the

Facilitate support counting of candidates

size of C^k; Switch to AprioriTid when C^k is expected to

fit in memory

The switch takes time, but it is still better in most cases

29 30

C^1 L1

TID Items TID Set-of-itemsets Itemset Support Improving Aprioris Efficiency

100 134 100 { {1},{3},{4} } {1} 2

200 235 200 { {2},{3},{5} } {2} 3 Transaction reduction: A transaction that does not

300 1235 {3} 3

300 { {1},{2},{3},{5} }

contain any frequent k-itemset is useless in subsequent

400 25 400 { {2},{5} } {5} 3

scans

C2 C^2

itemset TID Set-of-itemsets

L2

{1 2} 100 { {1 3} } Itemset Support Sampling: mining on a subset of given data.

{1 3} 200 { {2 3},{2 5} {3 5} } {1 3} 2

The sample should fit in memory

{1 5} 300 { {1 2},{1 3},{1 5}, {2 3} 3

{2 5} 3 Use lower support threshold to reduce the probability of

{2 3} {2 3}, {2 5}, {3 5} }

{3 5} 2

missing some itemsets.

{2 5} 400 { {2 5} }

{3 5}

The rest of the DB is used to determine the actual itemset

C^3 L3 count.

C3 TID Set-of-itemsets

Itemset Support

itemset 200 { {2 3 5} }

{2 3 5} 2

{2 3 5} 300 { {2 3 5} }

31 32

Improving Aprioris Efficiency Improving Aprioris Efficiency

Dynamic itemset counting: partitions the DB into several blocks

frequent in at least one of the partitions of DB (2 DB scans)

each marked by a start point.

(support in a partition is lowered to be proportional to the number of

elements in the partition) At each start point, DIC estimates the support of all itemsets that are

currently counted and adds new itemsets to the set of candidate

Phase I

itemsets if all its subsets are estimated to be frequent.

Phase II

If DIC adds all frequent itemsets to the set of candidate itemsets

Combine

Divide D Find frequent Find global during the first scan, it will have counted each itemsets exact support

results to

into n itemsets local frequent Freq.

Trans.

Non- to each

form a global

itemsets itemsets at some point during the second scan;

in D set of

overlapping partition among in D

candidate thus DIC can complete in two scans.

partitions (parallel alg.) candidates

itemset

33 34

support hypothesis verification about a relationship such as

Apriori Algorithm

Data Mining methods automatically discover significant

Advanced Techniques

associations rules from data.

Find whatever patterns exist in the database, without the user

having to specify in advance what to look for (data driven).

Therefore allow finding unexpected correlations

35 36

Interestingness Measurements Objective measures of rule interest

Are all of the strong association rules discovered interesting

Support

enough to present to the user?

Lift or Interest or Correlation

A rule (pattern) is interesting if Leverage or Piatetsky-Shapiro

it is unexpected (surprising to the user); and/or Coverage

actionable (the user can do something with it)

(only the user can judge the interestingness of a rule)

37 38

Example 1: (Aggarwal & Yu,

basketball not basketball sum(row)

PODS98) Lift (Correlation, Interest)

cereal 2000 1750 3750 75%

Among 5000 students not cereal 1000 250 1250 2 5%

3750 eat cereal Lift(A B ) = =

sup(A) sup(B ) P (B )

60% 40%

and eat cereal

A and B negatively correlated, if the value is less than 1;

play basketball eat cereal [40%, 66.7%] otherwise A and B positively correlated

misleading because the overall percentage of students eating cereal is rule Support Lift

75% which is higher than 66.7%. X 1 1 1 1 0 0 0 0 X Y 25% 2.00

Y 1 1 0 0 0 0 0 0 XZ 37.50% 0.86

play basketball not eat cereal [20%, 33.3%]

Z 0 1 1 1 1 1 1 1 YZ 12.50% 0.57

is more accurate, although with lower support and confidence

39 40

Lift of a Rule Problems With Lift

Example 1 (cont)

Rules that hold 100% of the time may not have the highest

2000

possible lift. For example, if 5% of people are Vietnam veterans

5000

play basketball eat cereal [40%, 66.7%] LIFT =

3000 3750

= 0.89

and 90% of the people are more than 5 years old, we get a lift of

5000 5000

0.05/(0.05*0.9)=1.11 which is only slightly above 1 for the rule

1000 Vietnam veterans -> more than 5 years old.

5000

play basketball not eat cereal [20%, 33.3%] LIFT = 3000 1250

= 1.33

5000 5000

And, lift is symmetric:

basketball not basketball sum(row) not eat cereal play basketball [20%, 80%]

cereal 2000 1750 3750

1000

not cereal 1000 250 1250 LIFT = 5000 = 1.33

1250 3000

sum(col.) 3000 2000 5000 5000 5000

41 42

Note that A -> B can be rewritten as (A,B) Leverage or Piatetsky-Shapiro

Conv (A B ) = = =

sup(A, B ) P (A, B ) P (A) P (A, B ) PS (A B ) = sup(A, B ) sup(A) sup(B )

unrelated. PS (or Leverage):

3000 3750

1

5000

5000

is the proportion of additional elements covered by both the

play basketball eat cereal [40%, 66.7%] Conv =

3000 2000

= 0.75

5000 5000

premise and consequence above the expected if independent.

eat cereal play basketball conv:0.85

3000 1250

play basketball not eat cereal [20%, 33.3%] 1

5000

5000

Conv = = 1.125

not eat cereal play basketball conv:1.43 3000 1000

5000 5000

43 44

Coverage of a Rule Association Rules Visualization

coverage(A B ) = sup(A)

Different icon colours are used to show different metadata values of

the association rule.

45 46

Association Rules Visualization

Height

equates to

confidence

47 48

Association Rules Visualization - Ball graph

The Ball graph Explained

A ball graph consists of a set of nodes and arrows. All the nodes are yellow, green

or blue. The blue nodes are active nodes representing the items in the rule in which

the user is interested. The yellow nodes are passive representing items related to

the active nodes in some way. The green nodes merely assist in visualizing two or

more items in either the head or the body of the rule.

A circular node represents a frequent (large) data item. The volume of the ball

represents the support of the item. Only those items which occur sufficiently

frequently are shown

An arrow between two nodes represents the rule implication between the two

items. An arrow will be drawn only when the support of a rule is no less than the

minimum support

49 50

Apriori Algorithm

FP-tree Algorithm

Additional Measures of rule interestingness

Advanced Techniques

51 52

Multiple-Level Association Rules Multi-Dimensional Association Rules

FoodStuff

Single-dimensional rules:

Frozen Refrigerated Fresh Bakery Etc... buys(X, milk) buys(X, bread)

Multi-dimensional rules: 2 dimensions or predicates

Banana Apple Orange Etc... Inter-dimension association rules (no repeated predicates)

age(X,19-25) occupation(X,student) buys(X,coke)

Fresh Bakery [20%, 60%]

hybrid-dimension association rules (repeated predicates)

Dairy Bread [6%, 50%]

age(X,19-25) buys(X, popcorn) buys(X, coke

Fruit Bread [1%, 50%] is not valid

Flexible support settings: Items at the lower level are expected to have lower support.

Transaction database can be encoded based on dimensions and levels explore shared multi-level

mining

53 54

Given

A database of customer transactions ordered by increasing

age(X,30-34) income(X,24K - 48K) buys(X,high resolution

transaction time

TV)

Each transaction is a set of items

A sequence is an ordered list of itemsets

Example:

Mining Sequential Patterns 10% of customers bought Foundation and Ringworld" in one

transaction, followed by Ringworld Engineers" in another transaction.

(a transaction may contain more books than those in the pattern)

Foundation and Ringworld in one transaction,

Problem

followed by

Find all sequential patterns supported by more than a user-specified

Ringworld Engineers in another transaction. percentage of data sequences

55 56

Application Difficulties Some Suggestions

Wal-Mart knows that customers who buy Barbie dolls By increasing the price of Barbie doll and giving the type of candy bar

free, wal-mart can reinforce the buying habits of that particular types of

(it sells one every 20 seconds) have a 60% likelihood of buyer

buying one of three types of candy bars. What does Highest margin candy to be placed near dolls.

Wal-Mart do with information like that?

Special promotions for Barbie dolls with candy at a slightly higher margin.

'I don't have a clue,' says Wal-Mart's chief of Take a poorly selling product X and incorporate an offer on this which is

merchandising, Lee Scott. based on buying Barbie and Candy. If the customer is likely to buy these

two products anyway then why not try to increase sales on X?

Probably they can not only bundle candy of type A with Barbie dolls, but

can also introduce new candy of Type N in this bundle while offering

See - KDnuggets 98:01 for many ideas discount on whole bundle. As bundle is going to sell because of Barbie

www.kdnuggets.com/news/98/n01.html dolls & candy of type A, candy of type N can get free ride to customers

houses. And with the fact that you like something, if you see it often,

Candy of type N can become popular.

61 62

References

Jiawei Han and Micheline Kamber, Data Mining:

Concepts and Techniques, 2000

Performance Data Mining , 1999

Rakesh Agrawal, Ramakrishnan Srikan, Fast Algorithms

for Mining Association Rules, Proc VLDB, 1994

(http://www.cs.tau.ac.il/~fiat/dmsem03/Fast%20Algorithms%20for%20Mining%20Association%

20Rules.ppt)

e meta queries,

(http://www.liacc.up.pt/~amjorge/Aulas/madsad/ecd2/ecd2_Aulas_AR_3_2003.pdf)

Thank you !!!

63 64

- Oracle - Data Mining ConceptsUploaded byApril Wong
- STOCK EXCHANGE MARKET PREDICTIONUploaded byeditor3854
- Analysis of Pattern Transformation Algorithms for Sensitive Knowledge Protection in Data MiningUploaded byInternational Organization of Scientific Research (IOSR)
- 44 Paper 28011035 IJCSIS Camera Ready Pp273-279Uploaded byijcsis
- Algebra Question Paper for Board ExamUploaded byAMIN BUHARI ABDUL KHADER
- Data - Google SearchUploaded byAnonymous bD2XD9
- 354-G475Uploaded byDevi Priya
- Dk 24717723Uploaded byAnonymous 7VPPkWS8O
- PLS Web Based QuestionnaireUploaded byAhmad Fadhly Arham
- Signature Based IndexingUploaded byaman.4u
- Dunman High School, Senior High Promotional ExaminationUploaded byMathathlete
- Informatics Engineering, an International Journal (IEIJ),Uploaded byjournalieij
- McaUploaded byankitbaaliyan
- Auditing Database SystemUploaded byElyssa
- [11-01-08-027]Uploaded bykrishnareddy56
- IRJET-Mining of medical data to identify risk factors of heart disease using frequent itemsetUploaded byIRJET Journal
- entropy based subspace clusteringUploaded byGarima Vats
- math lesson planUploaded byapi-383282739
- Paper Dinesh Clustering TechniquesUploaded bydinesh
- [IJCT-V2I3P14] Authors :Jyotsana Dixit, Abha ChoubeyUploaded byIjctJournals
- Yasin Assignment Types of Precipitation GaugesUploaded byAhaan Prince
- Core 1 - Ch 2 - 1 - Straight Lines - SolutionsUploaded byAhadh12345
- 19-BDT-MarketBasketAnalysis.pdfUploaded bySiddharth Singhal
- Chap 004Uploaded byArooj Talat Khan
- ddd.txtUploaded byPankajKumar
- Notes on coordinate geometry.docxUploaded byAvinash S
- plsqlUploaded byS Ekh Ar
- Unity Build Cheat SheetUploaded byMario
- Lab Evaluation RubricUploaded byminna1208
- My paper for an examUploaded byMohammed Imthiyas

- Chap 006Uploaded byakriti nanda
- Group 7 MerloniUploaded byakriti nanda
- Nptel Mooc Ida Week 1 Lecture 1Uploaded byMohit Goyal
- Chap 020Uploaded byakriti nanda
- Chap 001Uploaded byShivani Aggarwal
- Chap 009Uploaded byakriti nanda
- Chap 021Uploaded byakriti nanda
- Chap 022Uploaded byakriti nanda
- Chap 014Uploaded byakriti nanda
- Chap 013Uploaded byakriti nanda
- Chap 012Uploaded byakriti nanda
- Chap 011Uploaded byakriti nanda
- Chap 010Uploaded byakriti nanda
- Chap 008Uploaded byakriti nanda
- Chap 007Uploaded byakriti nanda
- Chap 005Uploaded byakriti nanda
- Perspectives on Consumer BehaviorUploaded byakriti710
- Chap003.pptUploaded byakriti nanda
- Chap 002Uploaded byakriti nanda

- Sundmacher_NTNU_2007Uploaded byAmin
- Renewable Energy PptUploaded byPrakash Kashyap
- Lean MaintenanceUploaded byrodocamp
- Argentina Application SupportUploaded byPablo Kuziw
- Results of Halon 1301 and Hfc - 125 Concentration Tests on a Large Commercial Aircraft Engine InstallationUploaded byrohen87
- Hub location problems.pdfUploaded byLeonardo Cetre
- scholar communication studiesUploaded byINORANAIN
- Logistics FundamentalsUploaded byPrateek Kaushik
- PDS2000 User Manual 3.9.3.1Uploaded bywitzing
- Bamboo BuildingUploaded byruiterfe
- ThesisUploaded byTahir Aziz
- Straudiment Creative BriefUploaded byEnvano Interactive Business
- Models 69NT40-511_521Uploaded bycloviskrelling
- CfcUploaded byTexas POS
- Effectiveness Evaluation of Rule Based Classifiers for the Classification of Iris Data SetUploaded byBONFRING
- Topical Study Green RooftopUploaded byUzey Zaib
- Manual ProMark Field 120 Spectra.pdfUploaded byMarta Sya'bani
- HardnessTesting (AITP-HT-NACE). Rev 1.pdfUploaded bySohail Aziz Ahmad Malik
- Caudalimetro DanielUploaded byYoSoloYo
- 11511 Somerset Water BillUploaded byAce Mereria
- ipbstyleUploaded byMisster Agoes
- appxav2.pdfUploaded byen_samara
- Dm 00027543Uploaded byAndy Shang
- 11. Ficha DAPUploaded byMarcelo Chelo Fernández Flores
- Brochure Surpass Hit 7060hcUploaded byRalph Ronquillo
- Book of AbstractsUploaded byNIET14
- Basic Concepts of DbmsUploaded byekpanjabidost
- US - 11 06 15 Contractual Incentives in EPC Contracts.Uploaded byureachanit
- Intr. System (Daft 1984)Uploaded byAndrés Abad
- dp update reflectionUploaded byapi-315837339