You are on page 1of 24

Market Basket Analysis and

Advanced Data Mining


Professor Amit Basu
abasu@smu.edu
What is Market Basket Analysis?
Understanding behavior of shoppers
What items are bought together
Whats in each shopping cart/basket?
Basket data consist of collection of transaction date and
items bought in a transaction
Itemset
How does this data differ from a transaction database?
Pivoting
Retail organizations interested in generating qualified
decisions and strategy based on analysis of transaction
data
what to put on sale, how to place merchandise on shelves for
maximizing profit, customer segmentation based on buying pattern
Examples
Rule form: LHS RHS
IF a customer buys diapers, THEN they also buy beer
diapers beer
Transactions that purchase bread and butter also
purchase milk
bread . butter milk
Customers who purchase maintenance agreements
are very likely to purchase large appliances
When a new hardware store opens, one of the most
commonly sold items is toilet bowl cleaners

Representations
Whats the difference between these patterns?

(a) Risk = 0.3 * sin(numcards * dem
1
0.25
) +
0.83 * (pastdef - dem2) * cos(employed+dem1)
2

(b) Risk = 0.93 * priordefault + 0.23 * num_cards
1.3 * employed 0.734

(c) IF person has a good credit rating THEN they have
fewer accidents
Evaluation
Support : measure of how often the collection of items
in an association occur together as a percentage of all
the transactions
In 2% of the purchases at hardware store, both pick and shovel
were bought
support = #tuples(LHS, RHS)/N

Confidence : confidence of rule B given A is a
measure of how much more likely it is that B occurs
when A has occurred
100% meaning that B always occurs if A has occurred
confidence = #tuples(LHS, RHS) / #tuples(LHS)
Example: bread and butter milk [90%, 1%]
Rules originating from the same itemset have identical
support but can have different confidence



The association rules mining
problem
Generate all association rules from the
given dataset that have
support greater than a specified minimum
and
confidence greater than a specified minimum

Examples
Rule form:
LHS RHS [confidence, support]
diapers beer [60%, 0.5%]
90% of transactions that purchase bread and
butter also purchase milk
bread and butter milk [90%, 1%]

Example
Tr# Items
T1 Beer, Milk
T2 Bread, Butter
T3 Bread, Butter, Jelly
T4 Bread, Butter, Milk
T5 Beer, Bread
Itemset Support
Bread 80
Butter 60
Milk 40
Beer 40
Bread, Butter 60
Large Itemsets with minsup=30%
Consider the itemset
{Bread, Butter}, and the two
possible rules
Bread Butter
Butter Bread
Support({Bread, Butter})/support({Bread} = .75
i.e., Confidence(Bread Butter) = 75%
Support({Bread, Butter})/support({Butter} = 1
i.e. Confidence(Butter Bread) = 100%
How Good is an Association Rule?
Is support and confidence enough?
Lift (improvement) tells us how much better a rule is at
predicting the result than just assuming the result in the
first place
Lift = P(LHS^RHS) / (P(LHS).P(RHS)
When lift > 1 then the rule is better at predicting the
result than guessing
When lift < 1, the rule is doing worse than informed
guessing and using the Negative Rule produces a better
rule than guessing
Computational Complexity
Given d unique items:
Total number of itemsets = 2
d
Total number of possible association rules:
1 2 3
1
1
1 1
+ =
(

|
.
|

\
|

|
.
|

\
|
=
+

=

d d
d
k
k d
j
j
k d
k
d
R
If d=6, R = 602 rules
The Problem of Lots of Data
Fast Food Restaurantcould have 100 items on
its menu
How many combinations are there with 3 different
menu items? 161,700 !
Supermarket10,000 or more unique items
50 million 2-item combinations
100 billion 3-item combinations
Use of product hierarchies (groupings) helps
address this common issue
Also, the number of transactions in a given time-
period could also be huge (hence expensive to
analyze)
Preparing Data for MBA
Determining scope of dataset (one or
many stores, what period, etc)
Converting transaction data to itemsets
Generalizing items to appropriate level
Depends on objective of model
Rolling up rare items to get adequate support

Search Approach
Two sub-problems in discovering all association rules:

Find all sets of items (itemsets) that have transaction
support above minimum support
Itemsets that qualify are called large itemsets, and all others
small itemsets.

Generate from each large itemset, rules that use items
from the large itemset.
Given a large itemset Y, and X is a subset of Y
Take the support of Y and divide it by the support of X
If the ratio c is at least minconf, then X (Y - X) is satisfied with
confidence factor c

Reducing Number of
Candidates
Apriori principle:
If an itemset is large, then all of its subsets must
also be large

Support of an itemset never exceeds the
support of its subsets
The Apriori Algorithm
Progressively
identifies large
itemsets of different
sizes
Exploits the property
that any subset of a
large itemset is also
a large itemset
Also, any superset of
a small itemset is also
small

A
C D B
AB AC AD BC BD CD
ABC ABD ACD BCD
ABCD
Extending MBA
Dissociation rules
Combining transaction data with complementary
data
Shopper characteristics
Store characteristics
Seasonal factors
Analyzing patterns over time
Patterns that span multiple occasions
Need to sessionize data
Need to recognize shoppers across sessions
Usability of Association Rules
Explainability High Intuitive explanations
Accuracy Moderate Depends on rule quality
Scalability Moderate/Low Performance of rule systems depends
on both no. and complexity of rules
Embeddability Moderate/high Can be compiled in many cases
Tolerance for sparse
data
Low Support and confidence are both
affected
Tolerance for noisy data Moderate How do you use outliers?
Development Speed Low/Moderate Needs lot of filtering
Dependence on Experts Moderate/high Domain experts to filter rules
Advanced Data Mining
Text Mining
Mining non-textual data
Image and video data (Multimedia)
Spatial data
GIS
Temporal data
Time series
Behavioral patterns
Web Mining
Web usage
Web content
Mining Image Data
Traditional pattern recognition
Neural networks
Supervised learning
Discovering patterns
Unsupervised learning
Clustering
Mining Spatial Data
Spatial databases typically use special data
structures
Extensions of tree-structured indexes
Quad trees, R-trees, k-D trees, etc.
Relationships based on spatial descriptors
Overlapping, disjoint, contains, etc.
Distance-based clustering
Feature extraction
Association rules
If location is near lake, pollution is low
Web Mining
Mining data that is obtained from the Web
Web Content mining
Web Usage mining
Web Content Mining
Search engines
Spiders and Crawlers
Metacrawlers
A major challenge is the unstructured form
of the data
Lack of high-level standards
Abuse of descriptors (meta-information)
Web Usage Mining
Mining Web logs
Data is relatively structured
Data is highly dynamic
Problems with identification and location
The inherently non-linear aspects of Web
usage behavior
Tracking both forward and backward links
Dynamic personalization
Issues and Trends
Mining across multiple data sources and
sets
Online mining what are the patterns right
now?
Concerns about privacy and other ethical
questions
Property
Accuracy

You might also like