Professional Documents
Culture Documents
Chapter 4
Predictive Analytics I: Data
Mining Process, Methods,
and Algorithms
Information
Visualization
and Algorithms
Expectation Maximization, Apriory
Link analysis Unsupervised
Algorithm, Graph-based Matching
Segmentation
DM Data Mining
3
Process Data
Preparation
6
4
• The process is
Deployment
Model
Data
Building
highly repetitive
and experimental
(DM: art versus Testing and
5
science?) Evaluation
SEMMA Data
Mining Process
Assess Explore
(Evaluate the accuracy and (Visualization and basic
• Developed by
usefulness of the models) description of the data)
Feedback
SAS Institute
Model Modify
(Use variety of statistical and (Select variables, transform
machine learning models ) variable representations)
“Actionable
PHASE 1 PHASE 2 PHASE 3 PHASE 4 PHASE 5
DEPT 1
DEPT 2
DEPT 3
Insight”
DEPT 4
3 4 5
Data 1 2
Discovery in Transformation
Extracted
Databases) Patterns
Data
Process Cleaning Transformed
Data
Data
Selection Preprocessed
Data
Target
Data
Feedback
Sources for
Raw Data
CRISP-DM
My own
SEMMA
KDD Process
My organization's
Domain-specific methodology
None
0 10 20 30 40 50 60 70
in contributing to medical
and biological research Tabulated Model
Testing Results
Tabulated
Relative Variable
endeavors? Importance
(Accuracy, Sensitivity
and Specificity) Results
TP
True Positive Rate
Positive
True False
TP FN
Predicted Class
Positive Positive
Count (TP) Count (FP)
TN
True Negative Rate
TN FP
Negative
False True
Negative Negative
Count (FN) Count (TN) TP TP
Precision Recall
TP FP TP FN
Model
Training Data Development
2/3
Trained Prediction
Preprocessed Classifier Accuracy
Data
1/3 Model TP FP
Assessment
Testing Data (scoring) FN TN
• FIGURE 4.10
A Graphical Depiction of k-Fold Cross-Validation
Slide 4-32 Copyright © 2018 Pearson Education Ltd.
Additional Estimation Methodologies for
Classification
• Leave-one-out
– Similar to k-fold where k = number of samples
• Bootstrapping
– Random sampling with replacement
• Jackknifing
– Similar to leave-one-out
• Area Under the ROC Curve (AUC)
– ROC: receiver operating characteristics (a term
borrowed from radar image processing)
• FIGURE 4.11 A
Sample ROC Curve
0.8
distributions too! 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
• FIGURE 4.12
Graphical
Illustration of a
Heterogeneous
Ensemble
1001234 1, 2, 3, 4 1 3 1, 2 3 1, 2, 4 3
1001235 2, 3, 4 2 6 1, 3 2 2, 3, 4 3
1001236 2, 3 3 4 1, 4 3
1001237 1, 2, 4 4 5 2, 3 4
1001238 1, 2, 3, 4 2, 4 5
1001239 2, 4 3, 4 3
SciKit-Learn 497
• Commercial Java
Anaconda
487
462
Hive 359
– IBM SPSS Modeler Mllib 337
Weka 315
(formerly Clementine) Microsoft SQL Server
Unix shell/awk/gawk
314
301
– … many more
SQL on Hadoop tools 211
C/C++ 210
Other free analytics/data mining tools 198
Other programming and data languages 197
–
QlikView 153 Legend:
RapidMiner Microsoft Azure Machine Learning 147 [Orange] Free/Open Source tools
Other Hadoop/HDFS-based tools 141
[Green] Commercial tools
– Weka Apache Pig
IBM Watson
132
121 [Blue] Hadoop/Big Data tools
Rattle 103
– R, … Salford SPM/CART/RF/MARS/TreeNet
Gnu Octave
100
89
Orange 89
0 200 400 600 800 1000 1200 1400 1600
Range <1 >1 > 10 > 20 > 40 > 65 > 100 > 150 > 200
(in $Millions) (Flop) < 10 < 20 < 40 < 65 < 100 < 150 < 200 (Blockbuster)
Number of
Independent Variable Possible Values
Values
Dependent
Variable MPAA Rating 5 G, PG, PG-13, R, NR
Independent Competition 3 High, Medium, Low
Variables Star value 3 High, Medium, Low
Sci-Fi, Historic Epic Drama,
Modern Drama, Politically
A Typical Genre 10 Related, Thriller, Horror,
Comedy, Cartoon, Action,
Classification Documentary
SPSS
Modeler
• Questions / Comments