You are on page 1of 31

# Association Rule Mining with R∗

Yanchang Zhao
http://www.RDataMining.com

R and Data Mining Workshop
for the Master of Business Analytics course, Deakin University, Melbourne

28 May 2015

Presented at Australian Customs (Canberra, Australia) in Oct 2014, at AusDM 2014 (QUT, Brisbane) in Nov
2014, at Twitter (US) in Oct 2014, at UJAT (Mexico) in Sept 2014, and at University of Canberra in Sept 2013
1 / 30

Outline
Introduction
Association Rule Mining
Removing Redundancy
Interpreting Rules
Visualizing Association Rules

2 / 30

Association Rule Mining with R

I

basic concepts of association rules

I

association rules mining with R

I

pruning redundant rules

I

interpreting and visualizing association rules

I

Chapter 9: Association Rules, R and Data Mining: Examples and Case
Studies. http://www.rdatamining.com/docs/RDataMining.pdf
3 / 30

4 / 30 . support(A ⇒ B) = P(A ∪ B) confidence(A ⇒ B) = P(B|A) P(A ∪ B) = P(A) confidence(A ⇒ B) lift(A ⇒ B) = P(B) P(A ∪ B) = P(A)P(B) where P(A) is the percentage (or probability) of cases containing A.Association Rules Association rules are rules presenting association or correlation between itemsets.

depth-first search and set intersection instead of counting eclat() in the same package 5 / 30 .Association Rule Mining Algorithms in R I APRIORI I I I a level-wise. breadth-first algorithm which counts transactions to find frequent itemsets and then derive association rules from them apriori() in package arules ECLAT I I finds frequent itemsets with equivalence classes.

Outline Introduction Association Rule Mining Removing Redundancy Interpreting Rules Visualizing Association Rules Further Readings and Online Resources 6 / 30 .

rdata. I The reconstructed raw data can also be downloaded at http://www. we reconstruct the raw data as titanic. sex.raw. I To make it suitable for association rule mining.The Titanic Dataset I The Titanic dataset in the datasets package is a 4-dimensional table with summarized information on the fate of passengers on the Titanic according to social class. age and survival. 7 / 30 .rdatamining.raw. where each row represents a person.com/data/titanic.

/data/titanic.raw) ## ## ## ## ## Class 1st :325 2nd :285 3rd :706 Crew:885 Sex Female: 470 Male :1731 Age Adult:2092 Child: 109 Survived No :1490 Yes: 711 8 / 30 .load(". ] ## ## ## ## ## ## 1203 1218 1674 941 820 Class Crew Crew 3rd Crew Crew Sex Male Male Male Male Male Age Survived Adult No Adult No Adult Yes Adult No Adult No summary(titanic.raw. 5) titanic.raw).sample(1:nrow(titanic.rdata") ## draw a sample of 5 records idx <.raw[idx.

8 I maximum length of rules: maxlen=10 9 / 30 . Default settings: I minimum support: supp=0.Function apriori() Mine frequent itemsets.1 I minimum confidence: conf=0. The Apriori algorithm employs level-wise search for frequent itemsets. association rules or association hyperedges using the Apriori algorithm.

creating transaction tree . done [0... done [0..09) (c) 1996-2004 Christian .library(arules) rules. [9 item(s)] done [0.00s].all <..00s].[0 item(s)] done [0. writing .00s].. [27 rule(s)] done [0.. set transactions . creating S4 object ..find association rules with the apriori algorithm version 4.. 2201 transaction(s)] done .1 minlen maxlen target ext 1 10 rules FALSE Algorithmic control: filter tree heap memopt load sort verbose 0..00s].1 1 none FALSE TRUE 0.21 (2004.8 0.. checking subsets of size 1 2 3 4 done [0. sorting and recoding items ... 10 / 30 .00s]. set item appearances .00s]..[10 item(s)..apriori(titanic.raw) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Parameter specification: confidence minval smax arem aval originalSupport support 0.1 TRUE TRUE FALSE TRUE 2 TRUE apriori ...05.

8).sort(rules. control = list(verbose=F).round(quality(rules).# rules with rhs containing "Survived" only rules <. "Survived=Yes"). parameter = list(minlen=2. default="lhs")) ## keep three decimal places quality(rules) <. conf=0.005.raw. supp=0. digits=3) ## order rules by lift rules.apriori(titanic. appearance = list(rhs=c("Survived=No".sorted <. by="lift") 12 / 30 .

rhs support confidence lift => {Survived=Yes} 0.870 2.972 3.010 => {Survived=Yes} 0.716 => {Survived=Yes} 0.036 0.877 2.sorted) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## 1 2 3 4 5 6 7 8 9 lhs {Class=2nd.011 1. Sex=Female} {Class=1st.009 0. Sex=Female.870 2. Sex=Female.692 => {Survived=Yes} 0. Sex=Female.663 13 / 30 .042 0.860 2.000 3. Age=Child} {Class=2nd.000 3.692 => {Survived=Yes} 0.064 0.064 0. Age=Child} {Class=1st. Sex=Female} {Class=Crew.009 0. Age=Adult} {Class=2nd.972 3. Sex=Female} {Class=Crew.010 => {Survived=Yes} 0.inspect(rules.006 1.096 => {Survived=Yes} 0.096 => {Survived=Yes} 0. Age=Adult} {Class=2nd. Age=Adult} {Class=2nd. Sex=Female.

Outline Introduction Association Rule Mining Removing Redundancy Interpreting Rules Visualizing Association Rules Further Readings and Online Resources 14 / 30 .

Redundant Rules inspect(rules. I Other redundant rules in the above result are rules #4.096 I Rule #2 provides no extra knowledge in addition to rule #1.096 ## 2 {Class=2nd. ## Sex=Female.006 1 3. ## Age=Child} => {Survived=Yes} 0. #7 and #8.sorted[1:2]) ## lhs rhs support confidence lift ## 1 {Class=2nd.011 1 3. ## Age=Child} => {Survived=Yes} 0. compared respectively with #3. #6 and #5. since rules #1 tells us that all 2nd-class children survived. the former rule (#2) is considered to be redundant. I When a rule (such as #2) is a super rule of another rule (#1) and the former has the same or a lower lift. 15 / 30 .

sorted) subset.is. diag = T)] <. na.tri(subset.pruned <.matrix.Remove Redundant Rules ## find redundant rules subset.subset(rules.colSums(subset.rm = T) >= 1 ## which rules are redundant which(redundant) ## [1] 2 4 7 8 ## remove redundant rules rules.matrix.matrix[lower. rules.NA redundant <.sorted[!redundant] 16 / 30 .matrix <.rules.sorted.

838 1.692 => {Survived=No} 0.Remaining Rules inspect(rules.271 => {Survived=No} 0.917 1.070 0.070 0.000 3.972 3.716 => {Survived=Yes} 0.222 17 / 30 . Age=Adult} 8 {Class=3rd. Sex=Female} 5 {Class=2nd.877 2.870 2.011 1.pruned) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## lhs 1 {Class=2nd.009 0.192 0.354 => {Survived=No} 0.827 1. Sex=Male} 7 {Class=3rd. Sex=Male} rhs support confidence lift => {Survived=Yes} 0. Sex=Male. Sex=Male. Sex=Female} 4 {Class=Crew.042 0.176 0. Age=Adult} 6 {Class=2nd.860 1. Age=Child} 2 {Class=1st. Sex=Female} 3 {Class=2nd.064 0.237 => {Survived=No} 0.010 => {Survived=Yes} 0.096 => {Survived=Yes} 0.

Outline Introduction Association Rule Mining Removing Redundancy Interpreting Rules Visualizing Association Rules Further Readings and Online Resources 18 / 30 .

inspect(rules. ## Age=Child} => {Survived=Yes} 0.pruned[1]) ## lhs rhs support confidence lift ## 1 {Class=2nd.011 1 3.096 Did children of the 2nd class have a higher survival rate than other children? 19 / 30 .

but provides no information at all to compare the survival rates of different classes.inspect(rules.011 1 3. ## Age=Child} => {Survived=Yes} 0.096 Did children of the 2nd class have a higher survival rate than other children? The rule states only that all children of class 2 survived.pruned[1]) ## lhs rhs support confidence lift ## 1 {Class=2nd. 19 / 30 .

002726034 1.74. . .6175549 => {Survived=Yes} 0.05. "Age=Adult"))) rules.sort(rules.042707860 0. Age=Child} 3 {Class=1st..Rules about Children rules <. Age=Child} 6 {Class=3rd.010904134 1.. 1.2408293 .. Age=Adult} rhs support confidence => {Survived=Yes} 0... Age=Adult} 5 {Class=3rd....0000000 => {Survived=Yes} 0.0000000 => {Survived=Yes} 0..... "Class=3rd". 0..3417722 => {Survived=Yes} 0.. 3.. .09. . supp=0. 1. 1. .3601533 => {Survived=Yes} 0. 20 / 30 . conf=0. "Class=2nd".apriori(titanic. rhs=c("Survived=Yes"). lhs=c("Class=1st"..91.09. .068605179 0.... "Age=Child".11.raw.sorted) ## ## ## ## ## ## ## ## ## ## ## ## ## lhs 1 {Class=2nd. Age=Child} 2 {Class=1st. control = list(verbose=F). 3..002... Age=Adult} 4 {Class=2nd.. parameter = list(minlen=3.2)..089504771 0. appearance = list(default="none"..012267151 0.sorted <. by="confidence") inspect(rules..

Outline Introduction Association Rule Mining Removing Redundancy Interpreting Rules Visualizing Association Rules Further Readings and Online Resources 21 / 30 .

85 1 0.95 confidence 1.library(arulesViz) plot(rules.2 0.1 0.05 0.95 0.all) Scatter plot for 27 rules 1 1.15 1.25 1.4 0.8 lift 22 / 30 .2 0.6 support 0.9 1.

+1 items} − 1 rules {Sex=Female} − 1 rules {Class=2nd} − 1 rules {Class=3rd. +2 items} − 2 rules {Class=3rd. +1 items} − 1 rules {Survived=Yes.LHS {Class=3rd} − 1 rules {Class=3rd. +2 items} − 1 rules {Sex=Female. +2 items} − 1 rules plot(rules. +1 items} − 1 rules {Class=Crew.all. +2 items} − 2 rules {Class=3rd. +1 items} − 2 rules {Survived=No} − 2 rules {Class=Crew} − 2 rules {Class=Crew. +2 items} − 1 rules {Class=Crew. +1 items} − 1 rules {Class=3rd. method = "grouped") Grouped matrix for 27 rules size: support color: lift {Sex=Male} RHS {Survived=No} {Age=Adult} 23 / 30 . +2 items} − 1 rules {Class=Crew. +1 items} − 2 rules {Class=1st} − 1 rules {Sex=Male} − 1 rules {Class=1st. +1 items} − 2 rules {Age=Adult.

method = "graph") Graph for 27 rules size: support (0.119 − 0.95) color: lift (0.all.266) Class=1st Class=3rd Survived=No Sex=Male Class=Crew Age=Adult Class=2nd Sex=Female Survived=Yes 24 / 30 .plot(rules.934 − 1.

plot(rules.95) color: lift (0.934 − 1.266) Sex=Female Survived=Yes Class=1st Class=2nd Age=Adult Class=Crew Sex=Male Survived=No Class=3rd 25 / 30 . method = "graph". control = list(type = "items")) Graph for 27 rules size: support (0.all.119 − 0.

control = list(reorder = TRUE)) Parallel coordinates plot for 27 rules Class=Crew Class=2nd Survived=Yes Survived=No Sex=Male Class=1st Class=3rd Age=Adult Sex=Female 3 2 1 rhs Position 26 / 30 .plot(rules.all. method = "paracoord".

Outline Introduction Association Rule Mining Removing Redundancy Interpreting Rules Visualizing Association Rules Further Readings and Online Resources 27 / 30 .

J. New York. USA.-N. I Package arulesSequences: mining sequential patterns http://cran. I Post mining of association rules. (2002). ACM Press. such as chi-square. Selecting the right interestingness measure for association patterns.Further Readings I More than 20 interestingness measures. and Srivastava. of KDD ’02. Information Science Reference.org/web/packages/arulesSequences/ 28 / 30 . In Proc. conviction. P. pages 32-41.). Chengqi Zhang and Longbing Cao (Eds. such as selecting interesting association rules. Kumar. May 2009... NY. V. ISBN 978-1-60566-404-0. “Post-Mining of Association Rules: Techniques for Effective Knowledge Extraction”.r-project. gini and leverage Tan. visualization of association rules and using association rules for classification Yanchang Zhao.

com I RDataMining on Twitter (2.com/docs/RDataMining.Online Resources I Chapter 9: Association Rules.rdatamining.pdf I R Reference Card for Data Mining http://www. in book R and Data Mining: Examples and Case Studies http://www.rdatamining.000+ followers) @RDataMining 29 / 30 .000+ members) http://group.com/resources/ I RDataMining Group on LinkedIn (12.com/docs/R-refcard-data-mining.rdatamining.rdatamining.pdf I Free online courses and documents http://www.

The End Thanks! Email: yanchang(at)rdatamining.com 30 / 30 .