You are on page 1of 35

Spreadsheet Modeling

& Decision Analysis


A Practical Introduction to
Business Analytics
8th edition

Cliff T. Ragsdale

© 2017 Cengage Learning. All Rights Reserved. May not be


scanned, copied or duplicated, or posted to a publicly
accessible website, in whole or in part.
Chapter 10

Data Mining

© 2017 Cengage Learning. All Rights Reserved. May not be


scanned, copied or duplicated, or posted to a publicly
accessible website, in whole or in part.
The Digital World
 The digital world runs on data

 Businesses produce and collect lots of it via


– Sales and returns transactions
– Bar code scans
– Credit card transactions
– GPS and RFID tracking
– Clicks on a webpage (searches,saved searches,
successful searchers, prints, etc)

 Data can be a valuable strategic asset

© 2017 Cengage Learning. All Rights Reserved. May not be


scanned, copied or duplicated, or posted to a publicly
accessible website, in whole or in part.
Data Mining
 Data mining is the process of finding and
extracting useful information and insights from
large datasets

 Like geological mining


– It is often hard, dirty work
– It takes the right tools

 XLMiner provides tools for data mining in Excel

© 2017 Cengage Learning. All Rights Reserved. May not be


scanned, copied or duplicated, or posted to a publicly
accessible website, in whole or in part.
The Data Mining Process

Explore,
Identify Build &
Identify Collect Understand Partition Deploy
Task & Evaluate
Opportunity Data & Prepare Data Models
Tools Models
Data

 Identify Opportunity
– Don’t dig randomly
– Begin with the end in mind
– What is the business problem/opportunity?

© 2017 Cengage Learning. All Rights Reserved. May not be


scanned, copied or duplicated, or posted to a publicly
accessible website, in whole or in part.
The Data Mining Process

Explore,
Identify Build &
Identify Collect Understand Partition Deploy
Task & Evaluate
Opportunity Data & Prepare Data Models
Tools Models
Data

 Collect Data
– Decided where to dig
– Get the right data – internally or externally. This could be
primary data or secondary data.
– Millions of records aren’t required – use samples
– 10p to 15p records is OK (where p = # of variables)

© 2017 Cengage Learning. All Rights Reserved. May not be


scanned, copied or duplicated, or posted to a publicly
accessible website, in whole or in part.
The Data Mining Process

Understand, Identify Build &


Identify Collect Partition Deploy
Explore & Task & Evaluate
Opportunity Data Data Models
Prepare Data Tools Models

 Understand, Explore & Prepare the Data


– Know what the data represents. Need to understand
variables in the data.
– Make sure it is clean & complete. This is a process of
cleaning the data to get rid of outliers and empty cells.
– Eliminate unneeded/redundant variables. This could
generate multicolliniarity.
– Transform variables as needed. This could be transformed
to z standard for example.
– You might spend most of your data mining time here! It
takes a lot of time to clean and prepare data.
© 2017 Cengage Learning. All Rights Reserved. May not be
scanned, copied or duplicated, or posted to a publicly
accessible website, in whole or in part.
The Data Mining Process

Understand, Identify Build &


Identify Collect Partition Deploy
Explore & Task & Evaluate
Opportunity Data Data Models
Prepare Data Tools Models

 Identify Task & Tools


 Identify first what is required and sought from the
mining.
– Classification (supervised). Where classes are already
defined.
– Prediction (supervised).
– Segmentation/Clustering (unsupervised). Where there is
no class and clusters/segments need to be created.

© 2017 Cengage Learning. All Rights Reserved. May not be


scanned, copied or duplicated, or posted to a publicly
accessible website, in whole or in part.
The Data Mining Process

Understand, Identify Build &


Identify Collect Partition Deploy
Explore & Task & Evaluate
Opportunity Data Data Models
Prepare Data Tools Models

 Partition Data
– Training. Is implemented to build up a model.
– Validation. Is used to determine parameters of the
model.
– Testing (optional). Is used to evaluate performance of
the model in a real world data set.

© 2017 Cengage Learning. All Rights Reserved. May not be


scanned, copied or duplicated, or posted to a publicly
accessible website, in whole or in part.
The Data Mining Process

Understand, Identify Build &


Identify Collect Partition Deploy
Explore & Task & Evaluate
Opportunity Data Data Models
Prepare Data Tools Models

 Build & Evaluate Models


– Try different models
– Try different parameter settings
– Avoid overfitting. "the production of an analysis that
corresponds too closely or exactly to a particular set of
data, and may therefore fail to fit additional data or
predict future observations reliably".

© 2017 Cengage Learning. All Rights Reserved. May not be


scanned, copied or duplicated, or posted to a publicly
accessible website, in whole or in part.
The Data Mining Process

Understand, Identify Build &


Identify Collect Partition Deploy
Explore & Task & Evaluate
Opportunity Data Data Models
Prepare Data Tools Models

 Deploy Models
– Integrate models in operational systems
– Train users
– Monitor results
– Look for opportunities for continuous improvement

© 2017 Cengage Learning. All Rights Reserved. May not be


scanned, copied or duplicated, or posted to a publicly
accessible website, in whole or in part.
Classification
Into which of m mutually exclusive group does an
observation of unknown origin belong?

 Character/target  Predict bond ratings


recognition  Fraud detection (credit
 Oil/gold exploration card, tax, trading, etc)
 Loan approval/credit  Predict winners of
history check. sports events
 Diagnose diseases.  Etc, etc…
Cancer patients vs. non-
cancer patients.
 Identify defects
© 2017 Cengage Learning. All Rights Reserved. May not be
scanned, copied or duplicated, or posted to a publicly
accessible website, in whole or in part.
Types of Classification Problems

 2 Group Problems...

 m Group Problem (where m >= 2)...

 Most m-group problems have one group of


primary interest and can be reduced to a 2
group problem

© 2017 Cengage Learning. All Rights Reserved. May not be


scanned, copied or duplicated, or posted to a publicly
accessible website, in whole or in part.
Example

 Universal Bank
– Wants to improve profitability of marketing
efforts on personal loans
– one group of primary interest: Who will
respond to loan solicitations?

© 2017 Cengage Learning. All Rights Reserved. May not be


scanned, copied or duplicated, or posted to a publicly
accessible website, in whole or in part.
Descriptive Statistics…

© 2017 Cengage Learning. All Rights Reserved. May not be


scanned, copied or duplicated, or posted to a publicly
accessible website, in whole or in part.
Transforming Variables…

© 2017 Cengage Learning. All Rights Reserved. May not be


scanned, copied or duplicated, or posted to a publicly
accessible website, in whole or in part.
Correlations…

Age and Work Experience are highly correlated.


Which one should you use??? Multicollinearity.

© 2017 Cengage Learning. All Rights Reserved. May not be


scanned, copied or duplicated, or posted to a publicly
accessible website, in whole or in part.
Plotting the data…

© 2017 Cengage Learning. All Rights Reserved. May not be


scanned, copied or duplicated, or posted to a publicly
accessible website, in whole or in part.
Exploring relationships…

Insight!

© 2017 Cengage Learning. All Rights Reserved. May not be


scanned, copied or duplicated, or posted to a publicly
accessible website, in whole or in part.
Classification Techniques…
 Discriminant Analysis: is a statistical tool with an objective
to assess the adequacy of a classification, given the group
memberships; or to assign objects to one group among a
number of groups.
 Logistic Regression: is used to describe data and to explain
the relationship between one dependent binary variable (0,1)
and one
or more nominal, ordinal, interval or ratio-level independent
variables.
 k-Nearest Neighbor: is a method used for classification and
regression. In this method the object is simply assigned to the
class of that single nearest neighbor.

© 2017 Cengage Learning. All Rights Reserved. May not be


scanned, copied or duplicated, or posted to a publicly
accessible website, in whole or in part.
Classification Techniques…
 Classification Trees: It is one of the predictive modeling
approaches used in statistics, data mining and
machine learning. Decision trees where the target variable can
take continuous values (typically real numbers) are called
regression trees.
 Neural Networks: are a set of algorithms, modeled loosely
after the human brain, that are designed to recognize patterns.
 Naïve Bayes: It is a classification technique based on Bayes'
Theorem (in statistics) with an assumption of independence
among predictors. In simple terms, a Naive Bayes classifier
assumes that the presence of a particular feature in a class is
unrelated to the presence of any other feature.

© 2017 Cengage Learning. All Rights Reserved. May not be


scanned, copied or duplicated, or posted to a publicly
accessible website, in whole or in part.
Discriminant Analysis
45

Group 1 centroid

40
Verbal Aptitude

Group 2 centroid

C1

35
C2

30
Satisfactory Employees
Unsatisfactory Employees

25
25 30 35 40 45 50

Mechanical Aptitude
© 2017 Cengage Learning. All Rights Reserved. May not be
scanned, copied or duplicated, or posted to a publicly
accessible website, in whole or in part.
Distance Measures

• Euclidean Distance

  2 2

Distance = ( A 1 − A2 ) + ( B1 − B 2)

• This does not account for possible


differences in variances.
99% Contours of Two Groups
X2

P1

C2

C1

X1
Fisher’s Linear Discriminant Function
• Identifies a linear function for each group
• Each function returns a classification score
for each observation
• An observation is classified into the group
whose function returns the largest
classification score
• (Classification scores may also be converted
to probabilities of group membership)
Accuracy Measures
for Classifiers
Predicted Class
Confusion Matrix
1 0
Actual 1 TP FN
Class (true positive) (false negative)
0 FP TN
(false positive) (true negative)

This indicates classification and classifiers in


terms of their accuracy.
Precision = TP / (TP + FP)
(model accuracy on positive predictions)
Recall (Sensitivity) = TP / (TP + FN)
(how good a model is at detecting the actual positives)

Specificity = TN / (TN + FP)


(how good a model is at detecting the actual negatives)
© 2014 Cengage Learning. All Rights Reserved. May not be
scanned, copied or duplicated, or posted to a publicly
accessible website, in whole or in part.
Logistic Regression
• Computes a function that maps the independent
variables into a probability of membership in group 1
  1
𝑃1 (𝑖 ) = −( 𝑏 0+𝑏 1 𝑥 𝑖 1+𝑏 2 𝑥𝑖 2+⋯ +𝑏 𝑝 𝑥 𝑖𝑝)
1+ 𝑒

© 2017 Cengage Learning. All Rights Reserved. May not be


scanned, copied or duplicated, or posted to a publicly
accessible website, in whole or in part.
k-Nearest Neighbors
• To classify an observation:

1. Identify its k-nearest neighbors

2. Assign observation to the most frequently


occurring group among those k neighbors

• Challenge: What should k be?

© 2017 Cengage Learning. All Rights Reserved. May not be


scanned, copied or duplicated, or posted to a publicly
accessible website, in whole or in part.
k-Nearest Neighbors Example
45

40
Verbal Aptitude

35

30
Satisfactory Employees
Unsatisfactory Employees

25
25 30 35 40 45 50

Mechanical Aptitude
© 2017 Cengage Learning. All Rights Reserved. May not be
scanned, copied or duplicated, or posted to a publicly
accessible website, in whole or in part.
Classification Trees
• Trees are prone to overfitting: is "the
production of an analysis that corresponds too
closely or exactly to a particular set of data, and
may therefore fail to fit additional data
• Overfitting is mitigated by
 Pruning a fully grown tree, or
 Requiring a minimum number of observations
per terminal node

© 2017 Cengage Learning. All Rights Reserved. May not be


scanned, copied or duplicated, or posted to a publicly
accessible website, in whole or in part.
Classification Trees
Cut-off points for different
variables decide whether
to go Left or Right

0: not likely to
respond
1: likely to
© 2017 Cengage Learning. All Rights Reserved. May not be
respond
scanned, copied or duplicated, or posted to a publicly
accessible website, in whole or in part.
Neural Networks:
Brain Basics…
• Neural networks “mimic” (crudely)
the operation of the human brain
• Brains:
 Receive stimuli
 Process the stimuli via massively
interconnected sets of neurons
 Determine a response
Neural Networks:
A Computational Model…
Input Layer Hidden Layer(s) Output Layer

xi1

xi2
yi
xi3 ⋮ 
⋮ 
xiP
Avoiding Overfitting:
Concurrent Descent…
Error
Rate

Testing data

Training data

Training trials
Full Bayes Classifier…
 To classify a new record
– Find all matching records
– Put new record in most frequently occurring matching group
 Problem
– Continuous variables are unlikely to match exactly
– Even with nominal variables, there might not be a match
– Eight variables with 4 levels result in 48 = 65,536 possible
records
 Solution
– “Naïvely” assume variables are independent
 Requires categorical independent (X) variables

 “Binning” continuous variables results in lost information!


© 2017 Cengage Learning. All Rights Reserved. May not be
scanned, copied or duplicated, or posted to a publicly
accessible website, in whole or in part.

You might also like