You are on page 1of 44

Business Intelligence and

Data Mining
Session 07-08
SUPERVISED
UNSUPERVISED
• MULTIPLE LINEAR REGRESSION
• Explanatory Vs Predictive Modelling
Explanatory Modeling
Goal: Explain relationship between predictors
(explanatory variables) and target

• Familiar use of regression in data analysis

• Model Goal: Fit the data well and understand the


contribution of explanatory variables to the
model

• “goodness-of-fit”: R2, residual analysis, p-values

5
Predictive Modeling
Goal: predict target values in other data where we have predictor
values, but not target values
• Classic data mining context
• Model Goal: Optimize predictive accuracy
• Train model on training data
• Assess performance on validation (hold-out) data
• Explaining role of predictors is not primary purpose (but useful)

6
• HOW TO ASSESS THE PERFORMANCE OF A MODEL?
Prediction Accuracy Measure
Selecting Subsets of Predictors
Goal: Find parsimonious model (the simplest model that performs
sufficiently well)
• More robust
• Higher predictive accuracy

Exhaustive Search

Partial Search Algorithms


• Forward
• Backward
• Stepwise

9
Exhaustive Search
• All possible subsets of predictors assessed (single,
pairs, triplets, etc.)
• Computationally intensive
• Judge by “adjusted R2”

n 1
R 2
 1 (1  R 2 )
n  p 1
adj

Penalty for number


of predictors
10
Forward Selection
• Start with no predictors
• Add them one by one (add the one with largest
contribution)
• Stop when the addition is not statistically
significant

11
Backward Elimination
• Start with all predictors
• Successively eliminate least useful predictors
one by one
• Stop when all remaining predictors have
statistically significant contribution

12
Stepwise
• Like Forward Selection
• Except at each step, also consider dropping non-
significant predictors

13
Summary
• Linear regression models are very popular tools, not only for
explanatory modeling, but also for prediction
• A good predictive model has high predictive accuracy (to a useful
practical level)
• Predictive models are built using a training data set, and evaluated
on a separate validation data set
• Removing redundant predictors is key to achieving predictive
accuracy and robustness
• Subset selection methods help find “good” candidate models.
These should then be run and assessed.

14
Supervised Learning – Possible Outcomes
Predicted numerical value: when the outcome variable is numerical
(e.g., house price)

Propensity: the probability of class membership, when the outcome


variable is categorical (e.g., the propensity to default)

Predicted class membership: when the outcome variable is categorical


(e.g., buyer/nonbuyer)
Naive Benchmark: The Average or Majority
Class
• The benchmark criterion in prediction is using the average outcome
value (thereby ignoring all predictor information).
SUPERVISED
UNSUPERVISED
K-Nearest-Neighbor
Characteristics

Data-driven, not model-driven

Makes no assumptions about the data


Basic Idea
For a given record to be classified, identify nearby
records

“Near” means records with similar predictor values


X1, X2, … Xp

Classify the record as whatever the predominant


class is among the nearby records (the “neighbors”)
How to measure “nearby”?

The most popular distance measure is


Euclidean distance
Choosing k
K is the number of nearby neighbors to be used to classify the new
record
K=1 means use the single nearest record
K=5 means use the 5 nearest records

Typically choose that value of k which has lowest error rate in


validation data
Low k vs. High k
Low values of k (1, 3, …) capture local structure in
data (but also noise)

High values of k provide more smoothing, less noise,


but may miss local structure

Note: the extreme case of k = n (i.e., the entire data set) is


the same as the “naïve rule” (classify all records according
to majority class)
• Converting Categorical Variables to Binary Dummies
Using K-NN for Prediction
(for Numerical Outcome)

• Instead of “majority vote determines class” use


average of response values

• May be a weighted average, weight decreasing


with distance
Advantages
• Simple
• No assumptions required about Normal distribution, etc.
• Effective at capturing complex interactions among variables without
having to define a statistical model
Shortcomings
• Required size of training set increases exponentially with # of
predictors, p
This is because expected distance to nearest neighbor increases with p (with large vector of
predictors, all records end up “far away” from each other)
• In a large training set, it takes a long time to find distances to all the
neighbors and then identify the nearest one(s)
• These constitute “curse of dimensionality”
Dealing with the Curse

• Reduce dimension of predictors (e.g., with PCA)

• Computational shortcuts that settle for “almost


nearest neighbors”
• HOW TO ASSESS THE PERFORMANCE OF A MODEL?
• Errors that are based on the training set tell us about model fit,
whereas those that are based on the validation set (called “prediction
errors”) measure the model’s ability to predict new data (predictive
performance).
Cutoff for classification
Most DM algorithms classify via a 2-step process:
For each record,
1. Compute probability of belonging to class “1”
2. Compare to cutoff value, and classify accordingly

• Default cutoff value is 0.50


If >= 0.50, classify as “1”
If < 0.50, classify as “0”
• Can use different cutoff values
• Typically, error rate is lowest for cutoff = 0.50

32
Cutoff Table
Actual Class Prob. of "1" Actual Class Prob. of "1"
1 0.996 1 0.506
1 0.988 0 0.471
1 0.984 0 0.337
1 0.980 1 0.218
1 0.948 0 0.199
1 0.889 0 0.149
1 0.848 0 0.048
0 0.762 0 0.038
1 0.707 0 0.025
1 0.681 0 0.022
1 0.656 0 0.016
0 0.622 0 0.004

• If cutoff is 0.50: 13 records are classified as “1”


• If cutoff is 0.80: seven records are classified as “1”
33
Confusion Matrix for Different
Cutoffs

34
Lift

36
When One Class is More Important
In many cases it is more important to identify
members of one class

• Tax fraud
• Credit default
• Response to promotional offer
• Detecting electronic network intrusion
• Predicting delayed flights

In such cases, we are willing to tolerate greater


overall error, in return for better identifying the
important class for further attention
37
Lift and Decile Charts: Goal
Useful for assessing performance in terms of
identifying the most important class

Helps evaluate, e.g.,


• How many tax records to examine
• How many loans to grant
• How many customers to mail offer to

38
Lift and Decile Charts – Cont.
Compare performance of DM model to “no model,
pick randomly”

Measures ability of DM model to identify the


important class, relative to its average prevalence

Charts give explicit assessment of results over a


large number of cutoffs

39
Lift and Decile Charts: How to Use

Compare lift to “no model” baseline

In lift chart: compare step function to straight line

In decile chart compare to ratio of 1

40
Lift Chart – cumulative performance
Actual Class Prob. of "1" Actual Class Prob. of "1"
1 0.996 1 0.506
1 0.988 0 0.471
1 0.984 0 0.337
1 0.980 1 0.218
1 0.948 0 0.199
1 0.889 0 0.149
1 0.848 0 0.048
0 0.762 0 0.038
1 0.707 0 0.025
1 0.681 0 0.022
1 0.656 0 0.016
0 0.622 0 0.004
After examining (e.g.,) 10 cases (x-axis), 9 owners (y-
axis) have been correctly identified
41
Decile Chart
Decile-wise lift chart (training dataset) Actual Class Prob. of "1" Actual Class Prob. of "1"
1 0.996 1 0.506
2.5
1 0.988 0 0.471
Decile mean / Global mean

2 1 0.984 0 0.337
1 0.980 1 0.218
1.5
1 0.948 0 0.199
1 1 0.889 0 0.149
1 0.848 0 0.048
0.5
0 0.762 0 0.038
0 1 0.707 0 0.025
1 2 3 4 5 6 7 8 9 10 1 0.681 0 0.022
1 0.656 0 0.016
Deciles
0 0.622 0 0.004

In “most probable” (top) decile, model is twice as likely to identify the important
class (compared to avg. prevalence)
The y axis is ratio of decile mean vs global mean
The numerator is for how many records the class of interest is predicted correctly in the respective 10% of
records.
The denominator is what is the average number of records that will be classified in the class of interest in the
respective 10% of records.
Lift Charts: How to Compute
• Using the model’s classifications, sort records from
most likely to least likely members of the
important class

• Compute lift: Accumulate the correctly classified


“important class” records (Y axis) and compare to
number of total records (X axis)

43
Lift vs. Decile Charts
Both embody concept of “moving down” through
the records, starting with the most probable

Decile chart does this in decile chunks of data


Y axis shows ratio of decile mean to overall mean

Lift chart shows continuous cumulative results


Y axis shows number of important class records identified

44

You might also like