Professional Documents
Culture Documents
Cliff T. Ragsdale
Data Mining
Explore,
Identify Build &
Identify Collect Understand Partition Deploy
Task & Evaluate
Opportunity Data & Prepare Data Models
Tools Models
Data
Identify Opportunity
– Don’t dig randomly
– Begin with the end in mind
– What is the business problem/opportunity?
Explore,
Identify Build &
Identify Collect Understand Partition Deploy
Task & Evaluate
Opportunity Data & Prepare Data Models
Tools Models
Data
Collect Data
– Decided where to dig
– Get the right data – internally or externally. This could be
primary data or secondary data.
– Millions of records aren’t required – use samples
– 10p to 15p records is OK (where p = # of variables)
Partition Data
– Training. Is implemented to build up a model.
– Validation. Is used to determine parameters of the
model.
– Testing (optional). Is used to evaluate performance of
the model in a real world data set.
Deploy Models
– Integrate models in operational systems
– Train users
– Monitor results
– Look for opportunities for continuous improvement
2 Group Problems...
Universal Bank
– Wants to improve profitability of marketing
efforts on personal loans
– one group of primary interest: Who will
respond to loan solicitations?
Insight!
Group 1 centroid
40
Verbal Aptitude
Group 2 centroid
C1
35
C2
30
Satisfactory Employees
Unsatisfactory Employees
25
25 30 35 40 45 50
Mechanical Aptitude
© 2017 Cengage Learning. All Rights Reserved. May not be
scanned, copied or duplicated, or posted to a publicly
accessible website, in whole or in part.
Distance Measures
• Euclidean Distance
2 2
√
Distance = ( A 1 − A2 ) + ( B1 − B 2)
P1
C2
C1
X1
Fisher’s Linear Discriminant Function
• Identifies a linear function for each group
• Each function returns a classification score
for each observation
• An observation is classified into the group
whose function returns the largest
classification score
• (Classification scores may also be converted
to probabilities of group membership)
Accuracy Measures
for Classifiers
Predicted Class
Confusion Matrix
1 0
Actual 1 TP FN
Class (true positive) (false negative)
0 FP TN
(false positive) (true negative)
40
Verbal Aptitude
35
30
Satisfactory Employees
Unsatisfactory Employees
25
25 30 35 40 45 50
Mechanical Aptitude
© 2017 Cengage Learning. All Rights Reserved. May not be
scanned, copied or duplicated, or posted to a publicly
accessible website, in whole or in part.
Classification Trees
• Trees are prone to overfitting: is "the
production of an analysis that corresponds too
closely or exactly to a particular set of data, and
may therefore fail to fit additional data
• Overfitting is mitigated by
Pruning a fully grown tree, or
Requiring a minimum number of observations
per terminal node
0: not likely to
respond
1: likely to
© 2017 Cengage Learning. All Rights Reserved. May not be
respond
scanned, copied or duplicated, or posted to a publicly
accessible website, in whole or in part.
Neural Networks:
Brain Basics…
• Neural networks “mimic” (crudely)
the operation of the human brain
• Brains:
Receive stimuli
Process the stimuli via massively
interconnected sets of neurons
Determine a response
Neural Networks:
A Computational Model…
Input Layer Hidden Layer(s) Output Layer
xi1
xi2
yi
xi3 ⋮
⋮
xiP
Avoiding Overfitting:
Concurrent Descent…
Error
Rate
Testing data
Training data
Training trials
Full Bayes Classifier…
To classify a new record
– Find all matching records
– Put new record in most frequently occurring matching group
Problem
– Continuous variables are unlikely to match exactly
– Even with nominal variables, there might not be a match
– Eight variables with 4 levels result in 48 = 65,536 possible
records
Solution
– “Naïvely” assume variables are independent
Requires categorical independent (X) variables