Professional Documents
Culture Documents
|
.
|
\
|
|
.
|
\
|
=
+
=
d d
d
k
k d
j
j
k d
k
d
R
If d=6, R = 602 rules
The Problem of Lots of Data
Fast Food Restaurantcould have 100 items on
its menu
How many combinations are there with 3 different
menu items? 161,700 !
Supermarket10,000 or more unique items
50 million 2-item combinations
100 billion 3-item combinations
Use of product hierarchies (groupings) helps
address this common issue
Also, the number of transactions in a given time-
period could also be huge (hence expensive to
analyze)
Preparing Data for MBA
Determining scope of dataset (one or
many stores, what period, etc)
Converting transaction data to itemsets
Generalizing items to appropriate level
Depends on objective of model
Rolling up rare items to get adequate support
Search Approach
Two sub-problems in discovering all association rules:
Find all sets of items (itemsets) that have transaction
support above minimum support
Itemsets that qualify are called large itemsets, and all others
small itemsets.
Generate from each large itemset, rules that use items
from the large itemset.
Given a large itemset Y, and X is a subset of Y
Take the support of Y and divide it by the support of X
If the ratio c is at least minconf, then X (Y - X) is satisfied with
confidence factor c
Reducing Number of
Candidates
Apriori principle:
If an itemset is large, then all of its subsets must
also be large
Support of an itemset never exceeds the
support of its subsets
The Apriori Algorithm
Progressively
identifies large
itemsets of different
sizes
Exploits the property
that any subset of a
large itemset is also
a large itemset
Also, any superset of
a small itemset is also
small
A
C D B
AB AC AD BC BD CD
ABC ABD ACD BCD
ABCD
Extending MBA
Dissociation rules
Combining transaction data with complementary
data
Shopper characteristics
Store characteristics
Seasonal factors
Analyzing patterns over time
Patterns that span multiple occasions
Need to sessionize data
Need to recognize shoppers across sessions
Usability of Association Rules
Explainability High Intuitive explanations
Accuracy Moderate Depends on rule quality
Scalability Moderate/Low Performance of rule systems depends
on both no. and complexity of rules
Embeddability Moderate/high Can be compiled in many cases
Tolerance for sparse
data
Low Support and confidence are both
affected
Tolerance for noisy data Moderate How do you use outliers?
Development Speed Low/Moderate Needs lot of filtering
Dependence on Experts Moderate/high Domain experts to filter rules
Advanced Data Mining
Text Mining
Mining non-textual data
Image and video data (Multimedia)
Spatial data
GIS
Temporal data
Time series
Behavioral patterns
Web Mining
Web usage
Web content
Mining Image Data
Traditional pattern recognition
Neural networks
Supervised learning
Discovering patterns
Unsupervised learning
Clustering
Mining Spatial Data
Spatial databases typically use special data
structures
Extensions of tree-structured indexes
Quad trees, R-trees, k-D trees, etc.
Relationships based on spatial descriptors
Overlapping, disjoint, contains, etc.
Distance-based clustering
Feature extraction
Association rules
If location is near lake, pollution is low
Web Mining
Mining data that is obtained from the Web
Web Content mining
Web Usage mining
Web Content Mining
Search engines
Spiders and Crawlers
Metacrawlers
A major challenge is the unstructured form
of the data
Lack of high-level standards
Abuse of descriptors (meta-information)
Web Usage Mining
Mining Web logs
Data is relatively structured
Data is highly dynamic
Problems with identification and location
The inherently non-linear aspects of Web
usage behavior
Tracking both forward and backward links
Dynamic personalization
Issues and Trends
Mining across multiple data sources and
sets
Online mining what are the patterns right
now?
Concerns about privacy and other ethical
questions
Property
Accuracy