Professional Documents
Culture Documents
DataMining Tutorial
DataMining Tutorial
D. A. Dickey
NCSU
Pr{default} =0.006
X1=Debt
To
Income Pr{default} =0.0001
Ratio
Pr{default} =0.003
No default X2 = Age
Default
Some Actual Data
• Framingham Heart
Study Import
Split # 1: Age
Systolic BP
“terminal node”
How to make splits?
• Which variable to use?
• Where to split?
– Cholesterol > ____
– Systolic BP > _____
• Goal: Pure “leaves” or “terminal nodes”
• Ideal split: Everyone with BP>x has
problems, nobody with BP<x has
problems
Where to Split?
• First review Chi-square tests
• Contingency tables
Low 95 5 100 75 25
BP
55 45 100 75 25
High
BP
DEPENDENT INDEPENDENT
2
Test Statistic
• Expect 100(150/200)=75 in upper left if
independent (etc. e.g. 100(50/200)=25)
( observed exp ected ) 2
2 allcells
Heart Disease
No Yes
exp ected
=
Pr{ falsely reject hypothesis 1}
* Lift Chart
- Go from leaf of most
to least response.
- Lift is cumulative
proportion responding.
Regression Trees
• Continuous response (not just class)
• Predicted response constant in regions
Predict 80
Predict 50
X2
Predict
130 Predict
20
Predict 100
X1
• Predict Pi in cell i.
• Yij jth response in cell i.
• Split to minimize i j (Yij-Pi)2
Predict 80
Predict 50
Predict
130 Predict
20
Predict 100
• Predict Pi in cell i.
• Yij jth response in cell i.
• Split to minimize i j (Yij-Pi)2
Logistic Regression
• “Trees” seem to be main tool.
• Logistic – another classifier
• Older – “tried & true” method
• Predict probability of response from input
variables (“Features”)
• Linear regression gives infinite range of
predictions
• 0 < probability < 1 so not linear regression.
• Logistic idea: Map p in (0,1) to L in whole
real line
• Use L = ln(p/(1-p))
• Model L as linear in temperature
• Predicted L = a + b(temperature)
• Given temperature X, compute a+bX then p
= eL/(1+eL)
• p(i) = ea+bXi/(1+ea+bXi)
• Write p(i) if response, 1-p(i) if not
• Multiply all n of these together, find a,b to
maximize
Example: Ignition
• Flame exposure time = X
• Ignited Y=1, did not ignite Y=0
– Y=0, X= 3, 5, 9 10 , 13, 16
– Y=1, X = 11, 12 14, 15, 17, 25, 30
• Q=(1-p)(1-p)(1-p)(1-p)pp(1-p)pp(1-p)ppp
• P’s all different p=f(exposure)
• Find a,b to maximize Q(a,b)
Generate Q for array of (a,b) values
DATA LIKELIHOOD;
ARRAY Y(14) Y1-Y14; ARRAY X(14) X1-X14;
DO I=1 TO 14; INPUT X(I) y(I) @@; END;
DO A = -3 TO -2 BY .025;
DO B = 0.2 TO 0.3 BY .0025;
Q=1;
DO i=1 TO 14;
L=A+B*X(i); P=EXP(L)/(1+EXP(L));
IF Y(i)=1 THEN Q=Q*P; ELSE Q=Q*(1-P);
END; IF Q<0.0006 THEN Q=0.0006; OUTPUT; END;END;
CARDS;
3 0 5 0 7 1 9 0 10 0 11 1 12 1 13 0 14 1 15 1 16 0 17 1
25 1 30 1
;
Likelihood function (Q)
-2.6
0.23
IGNITION DATA
The LOGISTIC Procedure
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -2.5879 1.8469 1.9633 0.1612
TIME 1 0.2346 0.1502 2.4388 0.1184
Standard Wald
Parameter DF Estimate Error Chi-Square Pr>ChiSq
output
inputs
Logistic function of
Logistic functions
Of data
Arrows represent linear
combinations of “basis
functions,” e.g. logistics
b1
Example:
Y = a + b1 p1 + b2 p2 + b3 p3
Y = 4 + p1+ 2 p2 - 4 p3
• Should always use holdout sample
• Perturb coefficients to optimize fit (fit data)
– Nonlinear search algorithms
• Eliminate unnecessary arrows using
holdout data.
• Other basis sets
– Radial Basis Functions
– Just normal densities (bell shaped) with
adjustable means and variances.
Terms
• Train: estimate coefficients
• Bias: intercept a in Neural Nets
• Weights: coefficients b
• Radial Basis Function: Normal density
• Score: Predict (usually Y from new Xs)
• Activation Function: transformation to target
• Supervised Learning: Training data has
response.
Hidden Layer
L1 = -1.87 - .27*Age – 0.20*SBP22
H11=exp(L1)/(1+exp(L1))
L2 = -20.76 -21.38*H11
Pr{first_chd} = exp(L2)/(1+exp(L2))
“Activation Function”
Demo (optional)
• Compare several methods using SAS
Enterprise Miner
– Decision Tree
– Nearest Neighbor
– Neural Network
Unsupervised Learning
• We have the “features” (predictors)
• We do NOT have the response even on a
training data set (UNsupervised)
• Clustering
– Agglomerative
• Start with each point separated
– Divisive
• Start with all points in one cluster then spilt
EM PROC FASTCLUS
• Step 1 – find “seeds” as separated as
possible
• Step 2 – cluster points to nearest seed
– Drift: As points are added, change seed
(centroid) to average of each coordinate
– Alternatively: Make full pass then recompute
seed and iterate.
Clusters as Created
As Clustered
Cubic Clustering Criterion
(to decide # of Clusters)
• Divide random scatter of (X,Y) points into
4 quadrants
• Pooled within cluster variation much less
than overall variation
• Large variance reduction
• Big R-square despite no real clusters
• CCC compares random scatter R-square
to what you got to decide #clusters
• 3 clusters for “macaroni” data.
Association Analysis
• Market basket analysis
– What they’re doing when they scan your “VIP”
card at the grocery
– People who buy diapers tend to also buy
_________ (beer?)
– Just a matter of accounting but with new
terminology (of course )
– Examples from SAS Appl. DM Techniques, by
Sue Walsh:
Termnilogy
• Baskets: ABC ACD BCD ADE BCE
• Rule Support Confidence
• X=>Y Pr{X and Y} Pr{Y|X}
• A=>D 2/5 2/3
• C=>A 2/5 2/4
• B&C=>D 1/5 1/3
Don’t be Fooled!
• Lift = Confidence /Expected Confidence if Independent
Checking-> No Yes
Saving V (1500) (8500) (10000)
No 500 3500 4000