Amit Basu - Data Mining III

Market Basket Analysis and
Advanced Data Mining

Professor Amit Basu
abasu@smu.edu

What is Market Basket Analysis?
Understanding behavior of shoppers
What items are bought together

Whats in each shopping cart/basket?
Basket data consist of collection of transaction date and

items bought in a transaction
Itemset
How does this data differ from a transaction database?
Pivoting
Retail organizations interested in generating qualified

decisions and strategy based on analysis of transaction
data
what to put on sale, how to place merchandise on shelves for
maximizing profit, customer segmentation based on buying pattern

Examples
Rule form: LHS RHS
IF a customer buys diapers, THEN they also buy beer
diapers beer
Transactions that purchase bread and butter also

purchase milk
bread butter milk
Customers who purchase maintenance agreements

are very likely to purchase large appliances
When a new hardware store opens, one of the most

commonly sold items is toilet bowl cleaners

Representations
Whats the difference between these patterns?
(a) Risk = 0.3 * sin(numcards * dem
1
0.25
) +
0.83 * (pastdef - dem2) * cos(employed+dem1)
2
(b) Risk = 0.93 * priordefault + 0.23 * num_cards 1.3
* employed 0.734
(c) IF person has a good credit rating THEN they have
fewer accidents

Evaluation
Support : measure of how often the collection of items in

an association occur together as a percentage of all the
transactions
In 2% of the purchases at hardware store, both pick and shovel
were bought
support = #tuples(LHS, RHS)/N
Confidence : confidence of rule B given A is a measure

of how much more likely it is that B occurs when A has
occurred
100% meaning that B always occurs if A has occurred
confidence = #tuples(LHS, RHS) / #tuples(LHS)
Example: bread and butter milk [90%, 1%]
Rules originating from the same itemset have identical

support but can have different confidence

The association rules mining
problem
Generate all association rules from the
given dataset that have
support greater than a specified minimum

and
confidence greater than a specified minimum

Examples
Rule form:
LHS RHS [confidence, support]
diapers beer [60%, 0.5%]
90% of transactions that purchase bread and

butter also purchase milk
bread and butter milk [90%, 1%]

Example
Tr# Items
T1 Beer, Milk
T2 Bread, Butter
T3 Bread, Butter, Jelly
T4 Bread, Butter, Milk
T5 Beer, Bread
Itemset Support
Bread 80
Butter 60
Milk 40
Beer 40
Bread, Butter 60
Large Itemsets with minsup=30%
Consider the itemset
{Bread, Butter}, and the two
possible rules
Bread Butter
Butter Bread
Support({Bread, Butter})/support({Bread} = .
75
i.e., Confidence(Bread Butter) = 75%
Support({Bread, Butter})/support({Butter} = 1
i.e. Confidence(Butter Bread) = 100%

How Good is an Association Rule?
Is support and confidence enough?
Lift (improvement) tells us how much better a rule is at

predicting the result than just assuming the result in the
first place
Lift = P(LHS^RHS) / (P(LHS).P(RHS)
When lift > 1 then the rule is better at predicting the result
than guessing
When lift < 1, the rule is doing worse than informed

guessing and using the Negative Rule produces a better
rule than guessing

Computational Complexity
Given d unique items:
Total number of itemsets = 2

d
Total number of possible association rules:

1 2 3
1
1
1 1
+
1
]
1
,
_

,
_

d d
d
k
k d
j
j
k d
k
d
R
If d=6, R = 602 rules

The Problem of Lots of Data
Fast Food Restaurantcould have 100 items on

its menu
How many combinations are there with 3 different

menu items? 161,700 !
Supermarket10,000 or more unique items
50 million 2-item combinations
100 billion 3-item combinations
Use of product hierarchies (groupings) helps

address this common issue
Also, the number of transactions in a given time-

period could also be huge (hence expensive to
analyze)

Preparing Data for MBA
Determining scope of dataset (one or

many stores, what period, etc)
Converting transaction data to itemsets
Generalizing items to appropriate level
Depends on objective of model
Rolling up rare items to get adequate support

Search Approach
Two sub-problems in discovering all association rules:
Find all sets of items (itemsets) that have transaction

support above minimum support
Itemsets that qualify are called large itemsets, and all others
small itemsets.
Generate from each large itemset, rules that use items

from the large itemset.
Given a large itemset Y, and X is a subset of Y
Take the support of Y and divide it by the support of X
If the ratio c is at least minconf, then X (Y - X) is satisfied with
confidence factor c

Reducing Number of
Candidates
Apriori principle:
If an itemset is large, then all of its subsets must

also be large
Support of an itemset never exceeds the

support of its subsets

The Apriori Algorithm
Progressively
identifies large
itemsets of different
sizes
Exploits the property

that any subset of a
large itemset is also a
large itemset
Also, any superset of a

small itemset is also
small
A
C D B
AB AC AD BC BD CD
ABC ABD ACD BCD
ABCD

Extending MBA
Dissociation rules
Combining transaction data with complementary

data
Shopper characteristics
Store characteristics
Seasonal factors
Analyzing patterns over time
Patterns that span multiple occasions
Need to sessionize data
Need to recognize shoppers across sessions

Usability of Association Rules
Explainability High Intuitive explanations
Accuracy Moderate Depends on rule quality
Scalability Moderate/Low Performance of rule systems depends on
both no. and complexity of rules
Embeddability Moderate/high Can be compiled in many cases
Tolerance for sparse data Low Support and confidence are both affected
Tolerance for noisy data Moderate How do you use outliers?
Development Speed Low/Moderate Needs lot of filtering
Dependence on Experts Moderate/high Domain experts to filter rules

Advanced Data Mining
Text Mining
Mining non-textual data
Image and video data (Multimedia)
Spatial data
GIS
Temporal data
Time series
Behavioral patterns
Web Mining
Web usage
Web content

Mining Image Data
Traditional pattern recognition
Neural networks
Supervised learning
Discovering patterns
Unsupervised learning
Clustering

Mining Spatial Data
Spatial databases typically use special data

structures
Extensions of tree-structured indexes
Quad trees, R-trees, k-D trees, etc.
Relationships based on spatial descriptors
Overlapping, disjoint, contains, etc.
Distance-based clustering
Feature extraction
Association rules
If location is near lake, pollution is low

Web Mining
Mining data that is obtained from the Web
Web Content mining
Web Usage mining

Web Content Mining
Search engines
Spiders and Crawlers
Metacrawlers
A major challenge is the unstructured form

of the data
Lack of high-level standards
Abuse of descriptors (meta-information)

Web Usage Mining
Mining Web logs
Data is relatively structured
Data is highly dynamic
Problems with identification and location
The inherently non-linear aspects of Web

usage behavior
Tracking both forward and backward links
Dynamic personalization

Issues and Trends
Mining across multiple data sources and

sets
Online mining what are the patterns right

now?
Concerns about privacy and other ethical

questions
Property
Accuracy

Amit Basu - Data Mining III

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Amit Basu - Data Mining III

Uploaded by

Copyright:

Available Formats

Market Basket Analysis and

Advanced Data Mining

Understanding behavior of shoppers

What items are bought together

Basket data consist of collection of transaction date and

How does this data differ from a transaction database?

Retail organizations interested in generating qualified

Rule form: LHS RHS

IF a customer buys diapers, THEN they also buy beer

Transactions that purchase bread and butter also

Customers who purchase maintenance agreements

When a new hardware store opens, one of the most

Support : measure of how often the collection of items in

support = #tuples(LHS, RHS)/N

Confidence : confidence of rule B given A is a measure

Example: bread and butter milk [90%, 1%]

Rules originating from the same itemset have identical

support greater than a specified minimum

confidence greater than a specified minimum

LHS RHS [confidence, support]

diapers beer [60%, 0.5%]

90% of transactions that purchase bread and

Is support and confidence enough?

Lift (improvement) tells us how much better a rule is at

Lift = P(LHS^RHS) / (P(LHS).P(RHS)

When lift < 1, the rule is doing worse than informed

Given d unique items:

Total number of itemsets = 2

Total number of possible association rules:

Fast Food Restaurantcould have 100 items on

How many combinations are there with 3 different

Supermarket10,000 or more unique items

50 million 2-item combinations

100 billion 3-item combinations

Use of product hierarchies (groupings) helps

Also, the number of transactions in a given time-

Determining scope of dataset (one or

Converting transaction data to itemsets

Generalizing items to appropriate level

Depends on objective of model

Rolling up rare items to get adequate support

Find all sets of items (itemsets) that have transaction

Generate from each large itemset, rules that use items

If an itemset is large, then all of its subsets must

Support of an itemset never exceeds the

Exploits the property

Also, any superset of a

Combining transaction data with complementary

Analyzing patterns over time

Patterns that span multiple occasions

Need to sessionize data

Need to recognize shoppers across sessions

Mining non-textual data

Image and video data (Multimedia)

Traditional pattern recognition

Spatial databases typically use special data

Extensions of tree-structured indexes

Quad trees, R-trees, k-D trees, etc.

Relationships based on spatial descriptors

Overlapping, disjoint, contains, etc.

If location is near lake, pollution is low

Web Content mining

Web Usage mining

Spiders and Crawlers