You are on page 1of 31

RMBI1020 – Data Analytics for Business

– Association Rule Mining

Dr. Jean Wang


RMBI@IPO
HKUST
Topics
Introduction to BI

Data Visualization Data Modeling and Storage

Big Data and BI Analytics Technologies Enables BI


RMBI1020

Time Series Excel


Data Analytics for Business Basics
Forecast

Regression
Market Basket Clustering
Analysis
Analysis

Optimization Classification
Association Collaborative
Rule Mining Filtering
RMBI1020@JeanWang, UST
Introduction

90%

RMBI1020@JeanWang, UST
Introduction

RMBI1020@JeanWang, UST
Outline
✓Introduction to Market Basket Analysis
✓ Data Source
✓ Different Types of Analytics
✓ Business Applications

✓Association Rule Mining


✓Apriori Algorithm

RMBI1020@JeanWang, UST
RMBI1020 - Market Basket Analysis
• Source of Market Basket Data
• Different Types of Analytics
• Business Applications

RMBI1020@JeanWang, UST
For retailer giants Walmart and Amazon, The
number of products could be up to millions, and
Market Basket Data the number of transactions could be billions.

› Data source: POS (point-of-sale) transactions in OLTP system


Customer Item
Item ID
Customer ID Order
Price
Name Customer ID
Address Quantity
Store ID
Contact Number Tax
Item ID
Discounts
Amount
Product ID
Taxes
Store
Shipping
Store ID
# of Items
Size Product
Payment Method
Product ID
Geolocation …
Product Category
# of Staffs
Suppliers

Cost
RMBI1020@JeanWang, UST
Market Basket Analysis (I)
› By descriptive analytics, we can track › By differential analysis, we can compare
– On customer level results

# orders per # items per $ Sales per By customer


customer order order By store By weekday By season
group
– On product level

Products Brands Category All can be


done by OLAP

RMBI1020@JeanWang, UST
Market Basket Analysis (II)
› Market basket data is not just about the contents of shopping carts. It could also
tell us
How characteristics of customers affect their Whether a market invention is effective or not
purchases

Sales
in US

Which products are often purchased together Interest and preference of individual customers

RMBI1020@JeanWang, UST
Related Data Analytics Techniques
Regression Classification Association Rule Mining

Clustering Time Series Collaborative Filtering

RMBI1020@JeanWang, UST
Applications of Association Rule Mining (I)
› Understanding basket-level dynamics allows retailers to generate new revenue by

Cross Selling Content Placement

Store Layout Design Inventory Management

RMBI1020@JeanWang, UST
Applications of Association Rule Mining (II)
› Association Rule Mining is applicable where exists a one-to-many relationship

Fraud Claim Detection in Insurance

Healthcare

RMBI1020@JeanWang, UST
RMBI1020 Market Basket Analysis
– Association Rules
• Association Rule and Mining
• Interestingness Measure of the Rules
• Apriori Algorithm

RMBI1020@JeanWang, UST
Association Rule
Mining is process of
What is an Association Rule? generating association
rules

› Association rule is of the form X => Y


– Read as: IF X THEN Y
– Meaning: transactions having X are also likely to
have Y
– X / Y: one single item or one itemset

Antecedent (LHS) Consequent (RHS)

RMBI1020@JeanWang, UST
𝑐𝑜𝑢𝑛𝑡(𝑋 ∪ 𝑌) is the
co-concurrence count
How Good is an Association Rule? of the union of items
in X and Y
𝑐𝑜𝑢𝑛𝑡(𝑋 ∪ 𝑌)
› Support 𝑠𝑢𝑝𝑝 𝑋 ⇒ 𝑌 =
# 𝑜𝑓 𝑡𝑟𝑎𝑛𝑠. TID
– The occurring frequency of the association (i.e., the number
of transactions containing both X and Y over the total 1 1 1 0 0 1
number of transactions)
2 0 1 0 1 0
› supp ( Milk => Cereal ) = 3 / 10 = 0.3
3 0 1 1 1 0
› supp ( Milk, Bread => Eggs ) = 2 / 10 = 0.2
4 1 1 0 1 0
𝑐𝑜𝑢𝑛𝑡(𝑋 ∪ 𝑌)
› Confidence 𝑐𝑜𝑛𝑓 𝑋 ⇒ 𝑌 =
𝑐𝑜𝑢𝑛𝑡(𝑋)
5 1 0 1 0 0
6 0 1 1 0 0
– The strength of the association (i.e., measures of how often
7 1 0 0 0 0
items in Y appear in transactions that contain X)
› conf ( Milk => Cereal ) = 3 / 6 = 0.5 8 1 1 1 0 1
› conf ( Milk, Bread => Eggs ) = 2 / 4 = 0.5 9 1 1 1 0 0
10 0 1 1 1 0

RMBI1020@JeanWang, UST
Support of an itemset is
How Good is an Association Rule? (II) the occurring frequency
of the set
𝑠𝑢𝑝𝑝 (𝑋 ∪ 𝑌)
› Lift 𝑙𝑖𝑓𝑡 𝑋 ⇒ 𝑌 =
𝑠𝑢𝑝𝑝 𝑋 ∗ 𝑠𝑢𝑝𝑝(𝑌) TID
– The ratio of the observed support to the expected
support if X and Y and independent 1 1 1 0 0 1
› lift ( Milk => Cereal ) = 0.3 / (0.6 * 0.6) = 0.83 2 0 1 0 1 0
› lift ( Milk, Bread => Eggs ) = 0.2 / (0.4 * 0.2) = 2.5 3 0 1 1 1 0
4 1 1 0 1 0
– Can be considered as a “lift” that X provides to the
probability of having Y in the transaction 5 1 0 1 0 0
– High lift (lift > 1) suggests the presence of X increases 6 0 1 1 0 0
the chances that Y occurs, which might be worth
7 1 0 0 0 0
investigating
8 1 1 1 0 1
9 1 1 1 0 0
10 0 1 1 1 0

RMBI1020@JeanWang, UST
BUT it is
computationally

Association Rule Mining expensive or even


impossible!

› Association rule mining is the task of finding all association rules that are having
support and confidence above the user-specified thresholds suppmin and confmin
› Brute-force approach:

Rule ID Rule Description Supp Conf suppmin = 0.6 and confmin = 0.7
1 {X1} => {X3} 0.65 0.8
2 {X1, X2} => {X3} 0.3 0.5
3 {X2, X3} => {X1} 0.7 0.3
4 {X2, X3} => {X1, X4} 0.61 0.8
5 {X2, X3, X4} => {X1,X5} 0.2 0.3
6 {X1, X2, X3} => {X4,X5} 0.67 0.9
… … … …
1000000 {X1, X2, …, X10} => {X11, X12, …, X20} 0.2 0.1
RMBI1020@JeanWang, UST
Given 𝑑 items,
there will be 2𝑑
Example: Itemset Lattice for 5 Items possible itemsets

› In a rule of X => Y, X and Y could be one single item or a set of multiple items

RMBI1020@JeanWang, UST
Apriori Algorithm to Discover Frequent Itemsets
› The major challenge of Association Rule Mining is to find the high-support
itemsets, or refereed to as frequent itemsets
– A set of items whose support is greater than or equal to the given support threshold (suppmin)

› Apriori Principle
– All subsets of a frequent itemset must be frequent too freq(subset) >= freq (superset)
– The supersets of infrequent itemset will not be frequent either

› Apriori algorithm to discover frequent itemsets

Generate frequent - Count the support for each


Find frequent 1- (k+1)-item itemsets itemset candidate
item sets (k = 1) based on k-item - Prune the infrequent candidates
itemsets -k=k+1

Exit if no frequent
itemset is found
RMBI1020@JeanWang, UST
TID

1 1 1 0 0 1
2 0 1 0 1 0

Example: Apriori Algorithm - Lattice Pruning 3


4
0
1
1
1
1
0
1
1
0
0
5 1 0 1 0 0

Let suppmin= 0.4 6 0 1 1 0 0


7 1 0 0 0 0
8 1 1 1 0 1
9 1 1 1 0 0
10 0 1 1 1 0

0.6 0.8 0.6 0.4 0.2

0.4 0.3 0.1 0.5 0.4 0.2

0.2 0.1 0.2

RMBI1020@JeanWang, UST
From Frequent Itemset to Association Rules
› The Apriori algorithm returns a set of frequent itemsets, but we still need to
generate association rules from the frequent itemsets
› For each frequent itemset (with > 1 items) - No magic thresholds work for all
- Needs domain knowledge to set the
– Split the frequent set in all possible combinations of X∪Y thresholds and interpret the rules
– Test if supp(X∪Y) ≥ suppmin and supp(X∪Y)/sup(X) ≥ confmin
Frequent Itemset Association Rules

0.67
Let suppmin = 0.4 Support = 0.4 Confidence = 0.83

and confmin = 0.8

0.5 0.5 0.5

0.4 0.625 1.0


RMBI1020@JeanWang, UST
RMBI1020 Data Analytics for Business –
Association Rule Mining
Case Demo #7: Wine Recommendation

RMBI1020@JeanWang, UST
Introduction

RMBI1020@JeanWang, UST
Data Specification
lec09_wine.xlsx – “WineInfo” Worksheet lec09_wine.xlsx – “Transactions” Worksheet

One customer
-> one basket

Data pre-processing is
needed to transform
transactional data to
basket data

RMBI1020@JeanWang, UST
Transactional Data to Basket Data in Excel
› Create a Pivot Table lec09_wine.xlsx – “Baskets” Worksheet
based on the
transaction data
› In Pivot Table Field
Setting,
› Set customers as
the ROWS
› Set wine as the
COLUMNS
› Set count of
wine as the
VALUES

RMBI1020@JeanWang, UST
lec09_wine.xlsx – “AssociationRules” Worksheet

Count 2-itemset Frequency in Excel


1. Count the co-occurrences of every pair of wines 3. Use Conditional
Formatting to highlight
2. Use INDEX() function to replace direct column referencing frequent counts

4. Cancel
highlighting on
the diagonal cells
of the matrix

RMBI1020@JeanWang, UST
lec09_wine.xlsx – “AssociationRules” Worksheet

Generate Association Rules in Excel


1. Display frequent 2- 2. Generate rules by 3. Calculate support, confidence and lift of the rules
itemsets and theirs string concatenation
counts

RMBI1020@JeanWang, UST
Generate Association Rules in Excel
lec09_wine.xlsx – “AssociationRules” Worksheet
4. Highlight “interesting” rules using
Conditional Formatting 5. Interpret the selected rules

RMBI1020@JeanWang, UST
lec09_wine.xlsx – “AssociationRules” Worksheet
From 2-item Sets to 3-item Sets
1. Generate 3-item sets by combining two 2-item 3. Generate rules if any itemset is frequent
sets with a shared item (with the help of search
box and conditional highlighting)

2. Count the frequency of the 3-item sets

RMBI1020@JeanWang, UST
Summary
✓Market Basket Analysis
✓Association Rule Mining
✓Apriori Algorithm

RMBI1020@JeanWang, UST
Readings
› [1] Market Basket Analysis Using Big Data Analytics
– https://www.linkedin.com/pulse/gain-consumer-insight-market-basket-analysis-birendra-kumar-
sahu

› [2] Association Rule Mining – Not Your Typical Data Science Algorithm
– https://www.mapr.com/blog/association-rule-mining-not-your-typical-data-science-algorithm

› [3] Association Rules and the Apriori Algorithm: A Tutorial


– http://www.kdnuggets.com/2016/04/association-rules-apriori-algorithm-tutorial.html

› [4] Mining Massive Datasets – Chapter 9 Recommendation Systems


– http://mmds.org/#book

RMBI1020@JeanWang, UST

You might also like