You are on page 1of 24

Market Basket Analysis and

Advanced Data Mining


Professor Amit Basu
abasu@smu.edu

What is Market Basket Analysis?

Understanding behavior of shoppers

What items are bought together


Whats in each shopping cart/basket?

Basket data consist of collection of transaction date and


items bought in a transaction
Itemset

How does this data differ from a transaction database?

Pivoting

Retail organizations interested in generating qualified


decisions and strategy based on analysis of transaction
data
what to put on sale, how to place merchandise on shelves for
maximizing profit, customer segmentation based on buying pattern

Examples

Rule form: LHS RHS

IF a customer buys diapers, THEN they also buy beer

diapers beer

Transactions that purchase bread and butter also


purchase milk
bread butter milk

Customers who purchase maintenance agreements


are very likely to purchase large appliances

When a new hardware store opens, one of the most


commonly sold items is toilet bowl cleaners

Representations
Whats the difference between these patterns?
(a) Risk = 0.3 * sin(numcards * dem
1
0.25
) +
0.83 * (pastdef - dem2) * cos(employed+dem1)
2
(b) Risk = 0.93 * priordefault + 0.23 * num_cards 1.3
* employed 0.734
(c) IF person has a good credit rating THEN they have
fewer accidents

Evaluation

Support : measure of how often the collection of items in


an association occur together as a percentage of all the
transactions
In 2% of the purchases at hardware store, both pick and shovel
were bought

support = #tuples(LHS, RHS)/N

Confidence : confidence of rule B given A is a measure


of how much more likely it is that B occurs when A has
occurred
100% meaning that B always occurs if A has occurred
confidence = #tuples(LHS, RHS) / #tuples(LHS)

Example: bread and butter milk [90%, 1%]

Rules originating from the same itemset have identical


support but can have different confidence

The association rules mining
problem
Generate all association rules from the
given dataset that have

support greater than a specified minimum


and

confidence greater than a specified minimum



Examples

Rule form:

LHS RHS [confidence, support]

diapers beer [60%, 0.5%]

90% of transactions that purchase bread and


butter also purchase milk
bread and butter milk [90%, 1%]

Example
Tr# Items
T1 Beer, Milk
T2 Bread, Butter
T3 Bread, Butter, Jelly
T4 Bread, Butter, Milk
T5 Beer, Bread
Itemset Support
Bread 80
Butter 60
Milk 40
Beer 40
Bread, Butter 60
Large Itemsets with minsup=30%
Consider the itemset
{Bread, Butter}, and the two
possible rules
Bread Butter
Butter Bread
Support({Bread, Butter})/support({Bread} = .
75
i.e., Confidence(Bread Butter) = 75%
Support({Bread, Butter})/support({Butter} = 1
i.e. Confidence(Butter Bread) = 100%

How Good is an Association Rule?

Is support and confidence enough?

Lift (improvement) tells us how much better a rule is at


predicting the result than just assuming the result in the
first place

Lift = P(LHS^RHS) / (P(LHS).P(RHS)

When lift > 1 then the rule is better at predicting the result
than guessing

When lift < 1, the rule is doing worse than informed


guessing and using the Negative Rule produces a better
rule than guessing

Computational Complexity

Given d unique items:

Total number of itemsets = 2


d

Total number of possible association rules:


1 2 3
1
1
1 1
+
1
]
1

,
_


,
_


d d
d
k
k d
j
j
k d
k
d
R
If d=6, R = 602 rules

The Problem of Lots of Data

Fast Food Restaurantcould have 100 items on


its menu

How many combinations are there with 3 different


menu items? 161,700 !

Supermarket10,000 or more unique items

50 million 2-item combinations

100 billion 3-item combinations

Use of product hierarchies (groupings) helps


address this common issue

Also, the number of transactions in a given time-


period could also be huge (hence expensive to
analyze)

Preparing Data for MBA

Determining scope of dataset (one or


many stores, what period, etc)

Converting transaction data to itemsets

Generalizing items to appropriate level

Depends on objective of model

Rolling up rare items to get adequate support



Search Approach
Two sub-problems in discovering all association rules:

Find all sets of items (itemsets) that have transaction


support above minimum support

Itemsets that qualify are called large itemsets, and all others
small itemsets.

Generate from each large itemset, rules that use items


from the large itemset.
Given a large itemset Y, and X is a subset of Y
Take the support of Y and divide it by the support of X
If the ratio c is at least minconf, then X (Y - X) is satisfied with
confidence factor c

Reducing Number of
Candidates

Apriori principle:

If an itemset is large, then all of its subsets must


also be large

Support of an itemset never exceeds the


support of its subsets

The Apriori Algorithm

Progressively
identifies large
itemsets of different
sizes

Exploits the property


that any subset of a
large itemset is also a
large itemset

Also, any superset of a


small itemset is also
small
A
C D B
AB AC AD BC BD CD
ABC ABD ACD BCD
ABCD

Extending MBA

Dissociation rules

Combining transaction data with complementary


data

Shopper characteristics

Store characteristics

Seasonal factors

Analyzing patterns over time

Patterns that span multiple occasions

Need to sessionize data

Need to recognize shoppers across sessions



Usability of Association Rules
Explainability High Intuitive explanations
Accuracy Moderate Depends on rule quality
Scalability Moderate/Low Performance of rule systems depends on
both no. and complexity of rules
Embeddability Moderate/high Can be compiled in many cases
Tolerance for sparse data Low Support and confidence are both affected
Tolerance for noisy data Moderate How do you use outliers?
Development Speed Low/Moderate Needs lot of filtering
Dependence on Experts Moderate/high Domain experts to filter rules

Advanced Data Mining

Text Mining

Mining non-textual data

Image and video data (Multimedia)

Spatial data

GIS

Temporal data

Time series

Behavioral patterns

Web Mining

Web usage

Web content

Mining Image Data

Traditional pattern recognition

Neural networks

Supervised learning

Discovering patterns

Unsupervised learning

Clustering

Mining Spatial Data

Spatial databases typically use special data


structures

Extensions of tree-structured indexes

Quad trees, R-trees, k-D trees, etc.

Relationships based on spatial descriptors

Overlapping, disjoint, contains, etc.

Distance-based clustering

Feature extraction

Association rules

If location is near lake, pollution is low



Web Mining
Mining data that is obtained from the Web

Web Content mining

Web Usage mining



Web Content Mining

Search engines

Spiders and Crawlers

Metacrawlers

A major challenge is the unstructured form


of the data

Lack of high-level standards

Abuse of descriptors (meta-information)



Web Usage Mining

Mining Web logs

Data is relatively structured

Data is highly dynamic

Problems with identification and location

The inherently non-linear aspects of Web


usage behavior

Tracking both forward and backward links

Dynamic personalization

Issues and Trends

Mining across multiple data sources and


sets

Online mining what are the patterns right


now?

Concerns about privacy and other ethical


questions

Property

Accuracy

You might also like