BI and Data Mining Key Concepts

Business Intelligence and
Data Mining
Session 07-08
SUPERVISED
UNSUPERVISED
• MULTIPLE LINEAR REGRESSION
• Explanatory Vs Predictive Modelling
Explanatory Modeling
Goal: Explain relationship between predictors
(explanatory variables) and target
• Familiar use of regression in data analysis
• Model Goal: Fit the data well and understand the

contribution of explanatory variables to the
model
• “goodness-of-fit”: R2, residual analysis, p-values
5
Predictive Modeling
Goal: predict target values in other data where we have predictor
values, but not target values
• Classic data mining context
• Model Goal: Optimize predictive accuracy
• Train model on training data
• Assess performance on validation (hold-out) data
• Explaining role of predictors is not primary purpose (but useful)
6
• HOW TO ASSESS THE PERFORMANCE OF A MODEL?
Prediction Accuracy Measure
Selecting Subsets of Predictors
Goal: Find parsimonious model (the simplest model that performs
sufficiently well)
• More robust
• Higher predictive accuracy
Exhaustive Search
Partial Search Algorithms

• Forward
• Backward
• Stepwise
9
Exhaustive Search
• All possible subsets of predictors assessed (single,
pairs, triplets, etc.)
• Computationally intensive
• Judge by “adjusted R2”
n 1
R 2
 1 (1  R 2 )
n  p 1
adj
Penalty for number

of predictors
10
Forward Selection
• Start with no predictors
• Add them one by one (add the one with largest
contribution)
• Stop when the addition is not statistically
significant
11
Backward Elimination
• Start with all predictors
• Successively eliminate least useful predictors
one by one
• Stop when all remaining predictors have
statistically significant contribution
12
Stepwise
• Like Forward Selection
• Except at each step, also consider dropping non-
significant predictors
13
Summary
• Linear regression models are very popular tools, not only for
explanatory modeling, but also for prediction
• A good predictive model has high predictive accuracy (to a useful
practical level)
• Predictive models are built using a training data set, and evaluated
on a separate validation data set
• Removing redundant predictors is key to achieving predictive
accuracy and robustness
• Subset selection methods help find “good” candidate models.
These should then be run and assessed.
14
Supervised Learning – Possible Outcomes
Predicted numerical value: when the outcome variable is numerical
(e.g., house price)
Propensity: the probability of class membership, when the outcome

variable is categorical (e.g., the propensity to default)
Predicted class membership: when the outcome variable is categorical

(e.g., buyer/nonbuyer)
Naive Benchmark: The Average or Majority
Class
• The benchmark criterion in prediction is using the average outcome
value (thereby ignoring all predictor information).
SUPERVISED
UNSUPERVISED
K-Nearest-Neighbor
Characteristics
Data-driven, not model-driven
Makes no assumptions about the data

Basic Idea
For a given record to be classified, identify nearby
records
“Near” means records with similar predictor values

X1, X2, … Xp
Classify the record as whatever the predominant

class is among the nearby records (the “neighbors”)
How to measure “nearby”?
The most popular distance measure is

Euclidean distance
Choosing k
K is the number of nearby neighbors to be used to classify the new
record
K=1 means use the single nearest record
K=5 means use the 5 nearest records
Typically choose that value of k which has lowest error rate in

validation data
Low k vs. High k
Low values of k (1, 3, …) capture local structure in
data (but also noise)
High values of k provide more smoothing, less noise,

but may miss local structure
Note: the extreme case of k = n (i.e., the entire data set) is

the same as the “naïve rule” (classify all records according
to majority class)
• Converting Categorical Variables to Binary Dummies
Using K-NN for Prediction
(for Numerical Outcome)
• Instead of “majority vote determines class” use

average of response values
• May be a weighted average, weight decreasing

with distance
Advantages
• Simple
• No assumptions required about Normal distribution, etc.
• Effective at capturing complex interactions among variables without
having to define a statistical model
Shortcomings
• Required size of training set increases exponentially with # of
predictors, p
This is because expected distance to nearest neighbor increases with p (with large vector of
predictors, all records end up “far away” from each other)
• In a large training set, it takes a long time to find distances to all the
neighbors and then identify the nearest one(s)
• These constitute “curse of dimensionality”
Dealing with the Curse
• Reduce dimension of predictors (e.g., with PCA)
• Computational shortcuts that settle for “almost

nearest neighbors”
• HOW TO ASSESS THE PERFORMANCE OF A MODEL?
• Errors that are based on the training set tell us about model fit,
whereas those that are based on the validation set (called “prediction
errors”) measure the model’s ability to predict new data (predictive
performance).
Cutoff for classification
Most DM algorithms classify via a 2-step process:
For each record,
1. Compute probability of belonging to class “1”
2. Compare to cutoff value, and classify accordingly
• Default cutoff value is 0.50

If >= 0.50, classify as “1”
If < 0.50, classify as “0”
• Can use different cutoff values
• Typically, error rate is lowest for cutoff = 0.50
32
Cutoff Table
Actual Class Prob. of "1" Actual Class Prob. of "1"
1 0.996 1 0.506
1 0.988 0 0.471
1 0.984 0 0.337
1 0.980 1 0.218
1 0.948 0 0.199
1 0.889 0 0.149
1 0.848 0 0.048
0 0.762 0 0.038
1 0.707 0 0.025
1 0.681 0 0.022
1 0.656 0 0.016
0 0.622 0 0.004
• If cutoff is 0.50: 13 records are classified as “1”

• If cutoff is 0.80: seven records are classified as “1”
33
Confusion Matrix for Different
Cutoffs
34
Lift
36
When One Class is More Important
In many cases it is more important to identify
members of one class
• Tax fraud
• Credit default
• Response to promotional offer
• Detecting electronic network intrusion
• Predicting delayed flights
In such cases, we are willing to tolerate greater

overall error, in return for better identifying the
important class for further attention
37
Lift and Decile Charts: Goal
Useful for assessing performance in terms of
identifying the most important class
Helps evaluate, e.g.,

• How many tax records to examine
• How many loans to grant
• How many customers to mail offer to
38
Lift and Decile Charts – Cont.
Compare performance of DM model to “no model,
pick randomly”
Measures ability of DM model to identify the

important class, relative to its average prevalence
Charts give explicit assessment of results over a

large number of cutoffs
39
Lift and Decile Charts: How to Use
Compare lift to “no model” baseline
In lift chart: compare step function to straight line
In decile chart compare to ratio of 1
40
Lift Chart – cumulative performance
Actual Class Prob. of "1" Actual Class Prob. of "1"
1 0.996 1 0.506
1 0.988 0 0.471
1 0.984 0 0.337
1 0.980 1 0.218
1 0.948 0 0.199
1 0.889 0 0.149
1 0.848 0 0.048
0 0.762 0 0.038
1 0.707 0 0.025
1 0.681 0 0.022
1 0.656 0 0.016
0 0.622 0 0.004
After examining (e.g.,) 10 cases (x-axis), 9 owners (y-
axis) have been correctly identified
41
Decile Chart
Decile-wise lift chart (training dataset) Actual Class Prob. of "1" Actual Class Prob. of "1"
1 0.996 1 0.506
2.5
1 0.988 0 0.471
Decile mean / Global mean
2 1 0.984 0 0.337
1 0.980 1 0.218
1.5
1 0.948 0 0.199
1 1 0.889 0 0.149
1 0.848 0 0.048
0.5
0 0.762 0 0.038
0 1 0.707 0 0.025
1 2 3 4 5 6 7 8 9 10 1 0.681 0 0.022
1 0.656 0 0.016
Deciles
0 0.622 0 0.004
In “most probable” (top) decile, model is twice as likely to identify the important
class (compared to avg. prevalence)
The y axis is ratio of decile mean vs global mean
The numerator is for how many records the class of interest is predicted correctly in the respective 10% of
records.
The denominator is what is the average number of records that will be classified in the class of interest in the
respective 10% of records.
Lift Charts: How to Compute
• Using the model’s classifications, sort records from
most likely to least likely members of the
important class
• Compute lift: Accumulate the correctly classified

“important class” records (Y axis) and compare to
number of total records (X axis)
43
Lift vs. Decile Charts
Both embody concept of “moving down” through
the records, starting with the most probable
Decile chart does this in decile chunks of data

Y axis shows ratio of decile mean to overall mean
Lift chart shows continuous cumulative results

Y axis shows number of important class records identified
44

BI and Data Mining Key Concepts

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BI and Data Mining Key Concepts

Uploaded by

Copyright:

Available Formats

Business Intelligence and

• Familiar use of regression in data analysis

• Model Goal: Fit the data well and understand the

• “goodness-of-fit”: R2, residual analysis, p-values

Partial Search Algorithms

Penalty for number

Propensity: the probability of class membership, when the outcome

Predicted class membership: when the outcome variable is categorical

Data-driven, not model-driven

Makes no assumptions about the data

“Near” means records with similar predictor values

Classify the record as whatever the predominant

The most popular distance measure is

Typically choose that value of k which has lowest error rate in

High values of k provide more smoothing, less noise,

Note: the extreme case of k = n (i.e., the entire data set) is

• Instead of “majority vote determines class” use

• May be a weighted average, weight decreasing

• Reduce dimension of predictors (e.g., with PCA)

• Computational shortcuts that settle for “almost

• Default cutoff value is 0.50

• If cutoff is 0.50: 13 records are classified as “1”

In such cases, we are willing to tolerate greater

Helps evaluate, e.g.,

Measures ability of DM model to identify the

Charts give explicit assessment of results over a

Compare lift to “no model” baseline

In lift chart: compare step function to straight line

In decile chart compare to ratio of 1

• Compute lift: Accumulate the correctly classified

Decile chart does this in decile chunks of data

Lift chart shows continuous cumulative results

You might also like