Spreadsheet Modeling & Decision Analysis: A Practical Introduction To Business Analytics

Spreadsheet Modeling
& Decision Analysis

A Practical Introduction to
Business Analytics
8th edition
Cliff T. Ragsdale
© 2017 Cengage Learning. All Rights Reserved. May not be

scanned, copied or duplicated, or posted to a publicly
accessible website, in whole or in part.
Chapter 10
Data Mining

The Digital World
 The digital world runs on data
 Businesses produce and collect lots of it via

– Sales and returns transactions
– Bar code scans
– Credit card transactions
– GPS and RFID tracking
– Clicks on a webpage (searches,saved searches,
successful searchers, prints, etc)
 Data can be a valuable strategic asset

Data Mining
 Data mining is the process of finding and
extracting useful information and insights from
large datasets
 Like geological mining

– It is often hard, dirty work
– It takes the right tools
 XLMiner provides tools for data mining in Excel

The Data Mining Process
Explore,
Identify Build &
Identify Collect Understand Partition Deploy
Task & Evaluate
Opportunity Data & Prepare Data Models
Tools Models
Data
 Identify Opportunity
– Don’t dig randomly
– Begin with the end in mind
– What is the business problem/opportunity?

Explore,
Identify Build &
Identify Collect Understand Partition Deploy
Task & Evaluate
Opportunity Data & Prepare Data Models
Tools Models
Data
 Collect Data
– Decided where to dig
– Get the right data – internally or externally. This could be
primary data or secondary data.
– Millions of records aren’t required – use samples
– 10p to 15p records is OK (where p = # of variables)

Understand, Identify Build &

Identify Collect Partition Deploy
Explore & Task & Evaluate
Opportunity Data Data Models
Prepare Data Tools Models
 Understand, Explore & Prepare the Data

– Know what the data represents. Need to understand
variables in the data.
– Make sure it is clean & complete. This is a process of
cleaning the data to get rid of outliers and empty cells.
– Eliminate unneeded/redundant variables. This could
generate multicolliniarity.
– Transform variables as needed. This could be transformed
to z standard for example.
– You might spend most of your data mining time here! It
takes a lot of time to clean and prepare data.

 Identify Task & Tools

 Identify first what is required and sought from the
mining.
– Classification (supervised). Where classes are already
defined.
– Prediction (supervised).
– Segmentation/Clustering (unsupervised). Where there is
no class and clusters/segments need to be created.


 Partition Data
– Training. Is implemented to build up a model.
– Validation. Is used to determine parameters of the
model.
– Testing (optional). Is used to evaluate performance of
the model in a real world data set.


 Build & Evaluate Models

– Try different models
– Try different parameter settings
– Avoid overfitting. "the production of an analysis that
corresponds too closely or exactly to a particular set of
data, and may therefore fail to fit additional data or
predict future observations reliably".


 Deploy Models
– Integrate models in operational systems
– Train users
– Monitor results
– Look for opportunities for continuous improvement

Classification
Into which of m mutually exclusive group does an
observation of unknown origin belong?
 Character/target  Predict bond ratings

recognition  Fraud detection (credit
 Oil/gold exploration card, tax, trading, etc)
 Loan approval/credit  Predict winners of
history check. sports events
 Diagnose diseases.  Etc, etc…
Cancer patients vs. non-
cancer patients.
 Identify defects
Types of Classification Problems
 2 Group Problems...
 m Group Problem (where m >= 2)...
 Most m-group problems have one group of

primary interest and can be reduced to a 2
group problem

Example
 Universal Bank
– Wants to improve profitability of marketing
efforts on personal loans
– one group of primary interest: Who will
respond to loan solicitations?

Descriptive Statistics…

Transforming Variables…

Correlations…
Age and Work Experience are highly correlated.

Which one should you use??? Multicollinearity.

Plotting the data…

Exploring relationships…
Insight!

Classification Techniques…
 Discriminant Analysis: is a statistical tool with an objective
to assess the adequacy of a classification, given the group
memberships; or to assign objects to one group among a
number of groups.
 Logistic Regression: is used to describe data and to explain
the relationship between one dependent binary variable (0,1)
and one
or more nominal, ordinal, interval or ratio-level independent
variables.
 k-Nearest Neighbor: is a method used for classification and
regression. In this method the object is simply assigned to the
class of that single nearest neighbor.

Classification Techniques…
 Classification Trees: It is one of the predictive modeling
approaches used in statistics, data mining and
machine learning. Decision trees where the target variable can
take continuous values (typically real numbers) are called
regression trees.
 Neural Networks: are a set of algorithms, modeled loosely
after the human brain, that are designed to recognize patterns.
 Naïve Bayes: It is a classification technique based on Bayes'
Theorem (in statistics) with an assumption of independence
among predictors. In simple terms, a Naive Bayes classifier
assumes that the presence of a particular feature in a class is
unrelated to the presence of any other feature.

Discriminant Analysis
45
Group 1 centroid
40
Verbal Aptitude
Group 2 centroid
C1
35
C2
30
Satisfactory Employees
Unsatisfactory Employees
25
25 30 35 40 45 50
Mechanical Aptitude
Distance Measures
• Euclidean Distance
2 2
√
Distance = ( A 1 − A2 ) + ( B1 − B 2)
• This does not account for possible

differences in variances.
99% Contours of Two Groups
X2
P1
C2
C1
X1
Fisher’s Linear Discriminant Function
• Identifies a linear function for each group
• Each function returns a classification score
for each observation
• An observation is classified into the group
whose function returns the largest
classification score
• (Classification scores may also be converted
to probabilities of group membership)
Accuracy Measures
for Classifiers
Predicted Class
Confusion Matrix
1 0
Actual 1 TP FN
Class (true positive) (false negative)
0 FP TN
(false positive) (true negative)
This indicates classification and classifiers in

terms of their accuracy.
Precision = TP / (TP + FP)
(model accuracy on positive predictions)
Recall (Sensitivity) = TP / (TP + FN)
(how good a model is at detecting the actual positives)
Specificity = TN / (TN + FP)

(how good a model is at detecting the actual negatives)
Logistic Regression
• Computes a function that maps the independent
variables into a probability of membership in group 1
1
𝑃1 (𝑖 ) = −( 𝑏 0+𝑏 1 𝑥 𝑖 1+𝑏 2 𝑥𝑖 2+⋯ +𝑏 𝑝 𝑥 𝑖𝑝)
1+ 𝑒

k-Nearest Neighbors
• To classify an observation:
1. Identify its k-nearest neighbors
2. Assign observation to the most frequently

occurring group among those k neighbors
• Challenge: What should k be?

k-Nearest Neighbors Example
45
40
Verbal Aptitude
35
30
Satisfactory Employees
Unsatisfactory Employees
25
25 30 35 40 45 50
Mechanical Aptitude
Classification Trees
• Trees are prone to overfitting: is "the
production of an analysis that corresponds too
closely or exactly to a particular set of data, and
may therefore fail to fit additional data
• Overfitting is mitigated by
 Pruning a fully grown tree, or
 Requiring a minimum number of observations
per terminal node

Classification Trees
Cut-off points for different
variables decide whether
to go Left or Right
0: not likely to
respond
1: likely to
respond
Neural Networks:
Brain Basics…
• Neural networks “mimic” (crudely)
the operation of the human brain
• Brains:
 Receive stimuli
 Process the stimuli via massively
interconnected sets of neurons
 Determine a response
Neural Networks:
A Computational Model…
Input Layer Hidden Layer(s) Output Layer
xi1
xi2
yi
xi3 ⋮
⋮
xiP
Avoiding Overfitting:
Concurrent Descent…
Error
Rate
Testing data
Training data
Training trials
Full Bayes Classifier…
 To classify a new record
– Find all matching records
– Put new record in most frequently occurring matching group
 Problem
– Continuous variables are unlikely to match exactly
– Even with nominal variables, there might not be a match
– Eight variables with 4 levels result in 48 = 65,536 possible
records
 Solution
– “Naïvely” assume variables are independent
 Requires categorical independent (X) variables
 “Binning” continuous variables results in lost information!


Spreadsheet Modeling & Decision Analysis: A Practical Introduction To Business Analytics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Spreadsheet Modeling & Decision Analysis: A Practical Introduction To Business Analytics

Uploaded by

Copyright:

Available Formats

Spreadsheet Modeling

& Decision Analysis

© 2017 Cengage Learning. All Rights Reserved. May not be

© 2017 Cengage Learning. All Rights Reserved. May not be

 Businesses produce and collect lots of it via

 Data can be a valuable strategic asset

© 2017 Cengage Learning. All Rights Reserved. May not be

 Like geological mining

 XLMiner provides tools for data mining in Excel

© 2017 Cengage Learning. All Rights Reserved. May not be

© 2017 Cengage Learning. All Rights Reserved. May not be

© 2017 Cengage Learning. All Rights Reserved. May not be

Understand, Identify Build &

 Understand, Explore & Prepare the Data

Understand, Identify Build &

 Identify Task & Tools

© 2017 Cengage Learning. All Rights Reserved. May not be

Understand, Identify Build &

© 2017 Cengage Learning. All Rights Reserved. May not be

Understand, Identify Build &

 Build & Evaluate Models

© 2017 Cengage Learning. All Rights Reserved. May not be

Understand, Identify Build &

© 2017 Cengage Learning. All Rights Reserved. May not be

 Character/target  Predict bond ratings

 m Group Problem (where m >= 2)...

 Most m-group problems have one group of

© 2017 Cengage Learning. All Rights Reserved. May not be

© 2017 Cengage Learning. All Rights Reserved. May not be

© 2017 Cengage Learning. All Rights Reserved. May not be

© 2017 Cengage Learning. All Rights Reserved. May not be

Age and Work Experience are highly correlated.

© 2017 Cengage Learning. All Rights Reserved. May not be

© 2017 Cengage Learning. All Rights Reserved. May not be

© 2017 Cengage Learning. All Rights Reserved. May not be

© 2017 Cengage Learning. All Rights Reserved. May not be

© 2017 Cengage Learning. All Rights Reserved. May not be

• This does not account for possible

This indicates classification and classifiers in

Specificity = TN / (TN + FP)

© 2017 Cengage Learning. All Rights Reserved. May not be

1. Identify its k-nearest neighbors

2. Assign observation to the most frequently

• Challenge: What should k be?

© 2017 Cengage Learning. All Rights Reserved. May not be

© 2017 Cengage Learning. All Rights Reserved. May not be

 “Binning” continuous variables results in lost information!

You might also like