Professional Documents
Culture Documents
A Managerial Perspective on
Analytics (3rd Edition)
Chapter 4:
Data Mining
Learning Objectives
Define data mining as an enabling technology for
business intelligence
Understand the objectives and benefits of
business analytics and data mining
Recognize the wide range of applications of data
mining
Learn the standardized data mining processes
CRISP-DM
SEMMA
KDD
(Continued…)
Copyright © 2014 Pearson Education Limited Slide 4- 2
Learning Objectives
Understand the steps involved in data
preprocessing for data mining
Learn different methods and algorithms of data
mining
Build awareness of the existing data mining
software tools
Commercial versus free/open source
Understand the pitfalls and myths of data mining
Pattern
Recognition
DATA Machine
MINING Learning
Mathematical
Modeling Databases
Types of patterns
Association
Prediction
Cluster (segmentation)
Sequential (or time series) relationships
Copyright © 2014 Pearson Education Limited Slide 4- 12
Application Case 4.2
Harnessing Analytics to Combat Crime:
Predictive Analytics Helps Memphis Police
Department Pinpoint Crime and Focus
Police Resources
Questions for Discussion
1. How did the Memphis Police Department use
data mining to better combat crime?
2. What were the challenges, the proposed
solution, and the obtained results?
Copyright © 2014 Pearson Education Limited Slide 4- 13
A Taxonomy for Data Mining Tasks
Data Mining Learning Method Popular Algorithms
Types of DM
Hypothesis-driven data mining
Discovery-driven data mining
Copyright © 2014 Pearson Education Limited Slide 4- 15
Data Mining Applications
Customer Relationship Management
Maximize return on marketing campaigns
Improve customer retention (churn analysis)
Maximize customer value (cross-, up-selling)
Identify and treat most valued customers
Insurance
Forecast claim costs for better business planning
Determine optimal rate plans
Optimize marketing to specific customers
Identify and prevent fraudulent claim activities
Copyright © 2014 Pearson Education Limited Slide 4- 18
Data Mining Applications
Computer hardware and software
Science and engineering
Government and defense
Homeland security and law enforcement
Travel industry
Healthcare Highly popular application
Medicine areas for data mining
Entertainment industry
Sports
Etc.
Source: KDNuggets.com
Copyright © 2014 Pearson Education Limited Slide 4- 22
Data Mining Process: CRISP-DM
1 2
Business Data
Understanding Understanding
3
Data
Preparation
Data Sources
6
4
Deployment
Model
Building
5
Testing and
Evaluation
· Collect data
Data Consolidation · Select data
· Integrate data
· Normalize data
Data Transformation · Discretize/aggregate data
· Construct new attributes
Well-formed
Data
Assess Explore
(Evaluate the accuracy and (Visualization and basic
usefulness of the models) description of the data)
SEMMA
Model Modify
(Use variety of statistical and (Select variables, transform
machine learning models ) variable representations)
TP FN
Count (TP) Count (FP)
TN
True Negative Rate
TN FP
Negative
False True
Negative Negative
Count (FN) Count (TN) TP TP
P recision Recall
TP FP TP FN
Model
Training Data Development
2/3
Preprocessed Classifier
Data
1/3 Model
Prediction
Assessment
Testing Data Accuracy
(scoring)
For ANN, the data is split into three sub-sets (training [~60%],
validation [~20%], testing [~20%])
Copyright © 2014 Pearson Education Limited Slide 4- 31
Estimation Methodologies for
Classification
k-Fold Cross Validation (rotation estimation)
Split the data into k mutually exclusive subsets
Use each subset as testing while using the rest of the
subsets as training
Repeat the experimentation for k times
Aggregate the test results for true estimation of prediction
accuracy training
Other estimation methodologies
Leave-one-out, bootstrapping, jackknifing
Area under the ROC curve
0.9
0.8
A
True Positive Rate (Sensitivity)
0.7
B
0.6
C
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1 1, 2, 3, 4 1 3 1, 2 3 1, 2, 4 3
1 2, 3, 4 2 6 1, 3 2 2, 3, 4 3
1 2, 3 3 4 1, 4 3
1 1, 2, 4 4 5 2, 3 4
1 1, 2, 3, 4 2, 4 5
1 2, 4 3, 4 3
Axon
Axon
Dendrites Neuron
Neuron
Biological Artificial NN
versus
x1
w1 Y1
Inputs
Artificial x2
w2 Processing
Outputs
Neural .
Element (PE)
n
f (S )
Y Y2
. S X iW
Networks Weights
i
Transfer .
. .
i 1
Function .
Summation
wn Yn
Biological Artificial
xn
Neuron Node (or PE)
Dendrites Input
Axon Output
Synapse Weight
Slow Fast
9
Many (10
Copyright © 2014 Pearson Education )
Limited Few (102) Slide 4- 53
Elements/Concepts of ANN
Processing element (PE)
Information processing
Network structure
Feedforward vs. recurrent vs. multi-layer…
Learning parameters
Supervised/unsupervised, backpropagation,
learning rate, momentum
ANN Software – NN shells, integrated modules in
comprehensive DM software, …
Copyright © 2014 Pearson Education Limited Slide 4- 54
R (245)
Excel (238)
Rapid-I RapidMiner (213)
Software SAS (101)
Rapid-I RapidAnalytics (83)
MATLAB (80)
IBM SPSS Statistics (62)
IBM SPSS Modeler (54)
Commercial SAS Enterprise Miner (46)
Orange (42)
Miner Zementis (14)
KXEN (14)
Bayesia (14)
… many more C4.5/C5.0/See5 (13)
Revolution Computing (11)
Source: KDNuggets.com
0 10 20 30 40 SQL
50 (185)
60 70 80
Java (138)
Python (119)
C/C++ (66)
Other languages (57)
Perl (37)
Awk/Gawk/Shell (31)
F# (5)
Range <1 >1 > 10 > 20 > 40 > 65 > 100 > 150 > 200
(in $Millions) (Flop) < 10 < 20 < 40 < 65 < 100 < 150 < 200 (Blockbuster)
Number of
Dependent Independent Variable
Values
Possible Values
Variable
MPAA Rating 5 G, PG, PG-13, R, NR
Independent
Variables Competition 3 High, Medium, Low
Star value 3 High, Medium, Low
Sci-Fi, Historic Epic Drama,
Modern Drama, Politically
A Typical Genre 10 Related, Thriller, Horror,
Comedy, Cartoon, Action,
Classification Documentary
Process
Map in Model
IBM Assessment
process
SPSS
Modeler
Questions, comments