Professional Documents
Culture Documents
Data Mining
Raghu Ramakrishnan
Yahoo! Research
University of Wisconsin–Madison (on leave)
Group 2
Group 3
Group 1
Group 4
4. Evaluate results:
– Many “uninteresting” clusters
– One interesting cluster! Customers with both
business and personal accounts; unusually high
percentage of likely respondents
Action:
• New marketing campaign
Result:
• Acceptance rate for home equity offers more
than doubled
Examples:
• Clustering
• Linear regression model
• Classification model
• Frequent itemsets and association rules
• Support Vector Machines
Age Minivan
<30 >=30 YES
Sports, YES
Car Type YES Truck
NO
Minivan Sports, Truck
YES NO
0 30 60 Age
TECS 2007, Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, R. Ramakrishnan,
Pradeep Tamma Yahoo! Research
Decision Trees
• A decision tree T encodes d (a classifier or
regression function) in form of a tree.
• A node t in T without children is called a leaf
node. Otherwise t is called an internal node.
Encoded classifier:
If (age<30 and
carType=Minivan)
Then YES
Age If (age <30 and
<30 >=30 (carType=Sports or
carType=Truck))
Then NO
Car Type YES If (age >= 30)
Then YES
Minivan Sports, Truck
YES NO
Age
30 35
(Yes: No: )
Age
Training Database
<30 >=30
Database
AVC-Sets
Main Memory
TECS 2007, Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, R. Ramakrishnan,
Pradeep Tamma Yahoo! Research
RainForest Algorithms: RF-Hybrid
Age<30
Database
AVC-Sets
Main Memory
TECS 2007, Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, R. Ramakrishnan,
Pradeep Tamma Yahoo! Research
RainForest Algorithms: RF-Hybrid
Third Scan: As we expand the tree, we run out
Of memory, and have to “spill”
partitions to disk, and recursively
read and process them later.
Age<30
Database
Sal<20k Car==S
Age<30
Database
Sal<20k Car==S
CLARANS
DBSCAN
BIRCH
CLIQUE
CURE
…….
• Above algorithms can be used to cluster documents
after reducing their dimensionality using SVD
• Imposing constraints
– Only find rules involving the dairy department
– Only find rules involving expensive products
– Only find rules with “whiskey” on the right hand
side
– Only find rules with “milk” on the left hand side
– Hierarchies on the items
– Calendars (every Sunday, every 1st of the month)
• Sample Applications
– Direct marketing
– Fraud detection for medical insurance
– Floor/shelf planning
– Web site layout
– Cross-selling
Extract
Data Consistency?
1 35 M 380,000 TV 1 Appliance
Coke 6 Drink
Ham 3 Food
• Keys: Columns that uniquely identify a case
• Attributes: Columns that describe a case
– Value: A state associated with the attribute in a specific case
– Attribute Property: Columns that describe an attribute
– Unique for a specific attribute value (TV is always an appliance)
– Attribute Modifier: Columns that represent additional “meta” information for
an attribute
– Weight of a case, Certainty of prediction
Name of algorithm
SELECT [Customers].[ID],
MyDMM.[Age],
PredictProbability(MyDMM.[Age])
FROM
MyDMM PREDICTION JOIN [Customers]
ON MyDMM.[Gender] = [Customers].[Gender] AND
MyDMM.[Hair Color] =
[Customers].[Hair Color]
Automobile
ALL 3
3
1
2
ALL
Region
Category 2
State
ALL
Sedan Truck
DIMENSION
Civic Camry F150 Sierra Model 1
ATTRIBUTES
p3 p4
MA
East
p1 p2 p1 F150 NY 100
ALL
p2 Sierra NY 500
TX
West
p3 F150 MA 100
CA
p4 Sierra MA 200
Automobile
ALL 3
3
1
2
ALL
Region
Category 2
State
ALL
Sedan Truck
p3 p4
MA
p5
East
p1 p2 p1 F150 NY 100
ALL
p2 Sierra NY 500
TX
West
p3 F150 MA 100
CA
p4 Sierra MA 200
p5 Truck MA 100
Auto = F150
Loc = MA
SUM(Repair) = ??? How do we treat p5?
Truck
FactID Auto Loc Repair
F150 Sierra p1 F150 NY 100
p2 Sierra NY 500
p5 p3 F150 MA 100
MA
p3 p4
p4 Sierra MA 200
East
p5 Truck MA 100
NY
p1 p2
Truck
FactID Auto Loc Repair
F150 Sierra
p1 F150 NY 100
p5 p2 Sierra NY 500
MA
p3 p4
p3 F150 MA 100
East
p4 Sierra MA 200
NY
p1 p2 p5 Truck MA 100
p1 p2
6 p5 Sierra MA 100 0.5
p1 p2
6 p5 Sierra MA 100 0.5
Count (c1) 2
Truck
pc1, p 5
Count (c1) Count (c 2) 2 1
F150 Sierra Count (c 2) 1
pc 2, p 5
Count (c1) Count (c 2) 2 1
p5 p5
MA
p3 p4
p6
c1 c2
East
p1 p2
NY
p3 p4 ID Sales
p6 p1 100
c1 c2
East
p2 150
p1 p2 p3 300
NY
p4 200
p5 250
p6 400
Truck
Q (c ) Q (c ) F150 Sierra
pc,r
Q(c ') Qsum(r ) r
MA
c 'region ( r ) c1 c2
East
NY
TECS 2007, Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, R. Ramakrishnan,
Pradeep Tamma Yahoo! Research 107
What is a Good Allocation Policy?
F150 Sierra
p3 p4
MA
We propose desiderata
p5 that enable
appropriate definition of query
East
p1 p2
Truck • Consistency
specifies the
F150 Sierra relationship between
answers to related
p3 p4 queries on a fixed
MA
p5 data set
East
NY
p1 p2
MA
MA
MA
p3 p4 p3 p4 p3 p4
NY
NY
NY
p1 p2 p1 p2 p1 p2
MA
p3 p4 possible worlds
[Kripke63, …]
NY
p1 p2
F150 Sierra
F150 Sierra
MA
p3
MA
p5 p4
p3
p4 w1
p5 w4
w2
NY
w3 p2
NY
p2 p1
p1
MA
MA
p4 p5 p4
p5
p3 p3
NY
NY
p2 p2
p1 p1
TECS 2007, Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, R. Ramakrishnan,
Pradeep Tamma Yahoo! Research 113
Query Semantics
Z: Dimensions Y: Measure
Location Time # of
Goal: Look for patterns of unusually App.
high numbers of applications: … … ...
AL, USA Dec, 04 2
… … …
WY, USA Dec, 04 3
Location Time
All All All All
State AL WY
Month Jan., 86 Dec., 86
Fact table D
Z: Dimensions X: Predictors Y: Class
Location Time Race Sex … Approval
Cube subset
Location Time
All All All All
State AL WY
Month Jan., 86 Dec., 86
… … … … … … … … Measure in a cell:
• Accuracy of the model
• Predictiveness of Race
Data [USA, Dec 04](D)
measured based on that
Location Time Race Sex … Approval model
AL ,USA Dec, 04 White M … Y • Similarity between that
… … … … … … model and a given model
WY, USA Dec, 04 Black F … N
Model h(X, [USA, Dec 04](D))
E.g., decision tree
2004 2003 …
Jan … Dec Jan … Dec …
CA 0.4 0.2 0.3 0.6 0.5 … …
… … …
… …
Black M …
No Yes
The loan decision process in USA during Dec 04 h (X) Test set
0
was similar to a discriminatory decision model
TECS 2007, Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, R. Ramakrishnan,
Pradeep Tamma Yahoo! Research 122
Example (6/7): Predictiveness
2004 2003 …
Jan … Dec Jan … Dec …
CA 0.4 0.2 0.3 0.6 0.5 … …
Yes Yes
USA 0.2 0.3 0.9 … … … No
.
No
.
Build models
… … … … … … … …
. .
Yes No
Black M …
Race was an important predictor of loan
approval decision in USA during Dec 04 Test set
TECS 2007, Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, R. Ramakrishnan,
Pradeep Tamma Yahoo! Research 123
Model Accuracy
1
| | x
I ( h1 ( x ) h2 ( x ))
1 ph1 ( y | x)
||
x y h1
p ( y | x) log
ph2 ( y | x)
2004 2003 …
Cell value: Predictiveness of Race
Jan … Dec Jan … Dec …
AB 0.4 0.2 0.1 0.1 0.2 … …
… … … … … … … … …
• Naïve Bayes:
– Scoring function: algebraic
• Kernel-density-based classifier:
– Scoring function: distributive
• Decision tree, random forest:
– Neither distributive, nor algebraic
• PBE: Probability-based ensemble (new)
– To make any machine-learning model distributive
– Approximation
2500 RFex
KDCex
2000
Using exhaustive
Execution Time (sec)
NBex method
1500 J48ex
NB
1000
KDC
Using bottom-up
500 RF-
P BE
score computation
J48-
0 P BE
40K 80K 120K 160K 200K
# of Records
TECS 2007, Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, R. Ramakrishnan,
Pradeep Tamma Yahoo! Research 132
Bellwether Analysis:
Global Aggregates from Local Regions
15000
randomly sampled (non-cube)
10000 regions with costs under a given
5000 budget
[1-8 month, MD]
0 (RMSE: Root Mean Square Error)
5 25 45 65 85
Budget
0.7
0.6 – The fraction of regions that
0.5 satisfy the constraints and
0.4
have errors within the 99%
0.3
confidence interval of the
0.2
error of the bellwether region
0.1 • We have 99% confidence that
[1-8 month, MD]
0 that [1-8 month, MD] is a
5 25 45 65 85 quite unusual bellwether
Budget
region
Conclusions