You are on page 1of 10

Index

Note: Page numbers followed by f indicate figures and t indicate tables.

A Application fraud, 294


Academic analytics, 261–267 Application programming interfaces (APIs), 776
Access tools, data, 100–101 data insights, 777
Accountable Care Organizations (ACOs), 237 language, 776–777
Account-centric database vs. customer-centric database, speech, 777
25–27 vision, 777
Accuracy, 759 Area under the curve (AUC), 223
global, 217–218 Area under the ROC curve (AUROC), 759
weighted, 759 Aristotle, 11–12, 12f
Advisor perceptrons, 706 Artificial neural network (ANN), 15, 15f, 126–129, 132,
Aetna’s health-plan pilot program, 255 181, 743–744
Affordable Care Act (ACA), 235–236 advantages, 131
Agile modeling, 723–726 disadvantages, 131
timeliness and sufficiency in, 725–726 implementations, 131
Akaiki’s information criterion, 229 multilayer perceptron, 744f
Algorithm. See also Classification algorithm processing of, 745–746
AdaBoost, 229, 230f two-layer neural network, 744f
Adding a Cost Matrix, 230 Artistic steps, in data mining, 53
advanced data mining, 151–166 Art of data mining, 52–53
advanced general-purpose machine-learning, 149 Association of Certified Fraud Examiners, 289
association rules, 124–125, 125t, 126–127f AT&T Bell Labs, 292
data mining, 121–136 AutoDiscovery from ButlerScientifics, 17
decision tree, 207 Automated
inadequate experimental design improper, 224–225 analytics, 186
Kernel learning, 211–212 business modeler, 17
machine-learning, 87–88, 94, 201 modeling interface, DMRecipe, 94–96, 95f
multivariate adaptive regression splines, 159 neural net, 134f, 209
parameter adjustment, 229–230 predictive analytics applications, 17
parametric modeling, 272 statistician project, 17
prediction, 94 Automobile fraud, 294
regression tree, 565, 573
SANN, 209, 210t B
systematic error assessment improper, 224–225 Back-propagation operation, 129–131, 130–131f
Alpha error, 14 Bagging technique, 229–230, 339–342, 344f,
Analytical model 706–708
errors in, 216–227 Bat signal modeling, 34, 34f
evaluation Bayes, Thomas, 6, 7f
classification errors, 216–220 approach, 7, 10, 244
on predictive power, 216 averaging model, 706–708
on random error, 221–224 theorem, 6–7, 10–11
systematic errors, 224–225 Behaviorism, 270
ANN. See Artificial neural network (ANN) Behavior pattern, static measures vs. evolutionary
Antinomy, 705–706 measures, 282
APIs. See Application programming interfaces (APIs) Benford’s law, 293
Apple education ecosystem, 271, 274 Bias, 759

783
784 INDEX

Big Data in education, 259–273 CHAID. See Chi-square automatic interaction detection
data analytics, 274–275 (CHAID)
donor Charge-back fraud, 293
development, 265–266, 266f Check fraud, 294
recruitment, 266 Chi-square automatic interaction detection (CHAID),
retention, 266–267 144, 177–184
drivers for innovation, 261–263 data reduction, 74
future scenarios, 262–263 Claim fraud, 294
industry vendors, 263 Classical statistical sensitivity, 131
machine learning techniques, 272–273 Classification, 169
math and statistical analysis, 271–272 accuracy vs. generality, 171
student algorithm, 185–186
achievement, 267–268 advantages and disadvantages, 174–177
recruitment, 264–265 CHAID, 177–184
retention, 265 decision trees, 174–175
Bionomics: Economy As Ecosystem (Rothschild), 721 logistic regression, 179–181, 180–181f
Black box modeling, 706 Naïve Bayesian classifiers, 182–184
Boosted tree model, 614–625, 648–650 nearest-neighbor classifiers, 178–179
Boosting technique, 228–229, 339–342, 343f, 706–708 neural networks, 181–182
Bootstrap phases in operation of, 172–174
method, 225–226 random forests, 175
sampling test, 763–764 rule induction, 176–177, 177t
Bumping, 710 assumptions, 171–172
Business initial operations in, 169–170
customer relationship management issues in, issues with, 170–171
279–280 Classification and regression trees (CART), 138–144, 175
ecosystem advantages, 205–207
customer relationship management in, 281–285 data mining tool packages, 204–205
for data mining, 42–43 decision tree, 138–144, 139–141f
transforming corporations into, 280–281 issues, 206–207
network intrusion, 296–297 numerical prediction with, 202–204, 204t
objectives pruning trees, 142–144, 143f
application, 628 recursive partitioning, 139–142
data file, 628 Cleaning of data, 471–496, 471–496f
of data mining model, 42 Clinical psychology, 675, 676–691f, 678, 680, 682, 687,
description of variables, 628, 629t 692–693t, 694–698f, 699t, 700–701f, 701
marketing impacts, 627–628 Closing the information loop, 52
performance, 628 Cloud computing, 19
predictive analytics impact, 628 Clustering, 169
organism Cognitive era, 780
complex system, 722–723 Cognitivism, 270
decision-making activities in, 721–723 Coincidence (confusion) matrix, 217
muscles in, 721–722 Collaborative Leader Profile, 651
understanding, 42–44 Collinearity problem, 8–9
analytic project goals, 43–44 Column Selection tab, 452, 452f
Complex model, 228, 228f
C Conditional probability, 6
Capital One, 292 Confusion matrix, 759
CAR. See Customer analytic record (CAR) Constant
CART. See Classification and regression trees (CART) filling missing values with, 529–538, 529–538f
Categorical variables variance, 9
dummy coding, 497–514, 497f, 499–513f Constructivism, 270–271
frequencies of, 636t Continuous data regressions, 759
Central limit theorem, 191 Convolutional neural network (CNN), 747–750
INDEX 785
Covariance inflation criterion (CCI), 710–711 Customer relationship management (CRM), 26–27
Credit card fraud, 293, 295 in business ecosystem, 281–285
CRISP-DM. See Cross industry standard process for issues in business, 279–280
data mining (CRISP-DM) model, 216, 727–728
CRM. See Customer relationship management (CRM)
Cross industry standard process for data mining D
(CRISP-DM), 40–41, 41f, 727–728 Data
acceptance criteria, 729 abstractions, 70–73, 72f, 283
access data, 730 access tools, 100–101
analytical goals of project, 728 acquisition, 57–58, 443–458, 443f, 445–458f
business high-level query languages, 58
goals, 728 low-level and ODBC database connections, 58
stakeholders, 728 query-based data extracts, 57
understanding phase, 728–729 analysis
characterization and description of data, 730–732 DMRecipe process, 123
data exploratory, 28
cleaning, 733 analytics in healthcare, 240
preparation phase, 732–735 descriptive analytics, 241–242, 241f
reduction, 733–734 predictive analytics, 241
understanding phase, 729–732 prescriptive analytics, 241
derived variables, 735 assessment, 62
enhancing and enriching data, 730 cleaning, 62–63, 733
filtering, 734 and recoding, 471–496, 471–496f
handling cluster in clustering problem, 145–146, 145f
of outliers, 735 conditioning
of temporal data, 735 data set balancing, 80–81
missing value imputation, 734–735 segmentation, 81
modeling standardization, 79–80
activities, 48–51, 48f DataRobot, 17
algorithms, 48 derivation
architecture, 48 assignment/derivation of target variable, 78
assumptions, 48–49 attribute-oriented induction of generalization
phase, 736 variables, 79
predictive analytics projects, 736–739 derivation of new predictor variables, 78–79
recoding, 734 description, 60–62, 459–469, 459–469f
sampling regimes, 732 discretization, 77–78
service level agreement, 729 exploration tools, 102–107, 104t, 105–108f
standardization, 734 extraction, 58–60
target variable, 729 filtering, 69
timeline, 729 removal of outliers, 69, 515–528, 515–528f
working relationships, 728 time-series filtering, 70
Cross sell modeling, 279, 284 health check, beta procedure, 652, 652–665f, 654–655,
Cross tabulation, 637 657–658, 661–663, 666, 666t, 667–673f, 669, 671
internet-dependent service categories, 638–639, 638t imputation, 65–69, 68t, 529
by model, 649t filling missing values with constants, 529–538,
phone service vs. multiple lines, 637–638, 637t 529–538f
STATISTICA Data Miner workspace, 644, 645f filling missing values with formulas, 539–564,
Customer 539–545f, 547–564f
centric database vs. account-centric database, 25–27 filling missing values with model, 565–596, 566–596f
response modeling, 282–283 maximum likelihood imputation, 67
retention modeling, 279 missing at random assumption, 65
Customer analytic record (CAR), 27, 730 missing completely at random assumption, 65
creation, 28 multiple imputation, 67–69
decision-making activities in, 721–723 techniques for imputing data, 66–67
786 INDEX

Data (Continued) artistic steps in, 53


integration, 443–458, 443f, 445–458f challenges, 30–31
mart definitions, 22–24
physical, 26 historical development, 30t
virtual, 26–27 issues in, 31–33
paradigm shift, 27 and knowledge discovery, 23f
preparation, 47, 55–56, 630–640 model
batch statistics, 632–633, 633f business objectives of, 42
DMRecipe process, 123 creation, 34
issues, 55–56 model-theoretic for, 24
Process Missing Data configuration, 641f project
recoding missing data, 639–640 example, 33–34
sorting configuration screen, 639f requirements for success, 33
STATISTICA Data Miner, data set to, 630–639 science of, 39
profiling, 62 strengths, 25
recoding configuration screen, 640f theoretical framework, 24–25
reduction, 733–734 tools for pattern discovery, 35–36
chi-square automatic interaction detection, 74 workspace method, statistica, 319, 319–333f, 321–333
correlation coefficients, 74 Data mining recipes (DMR), 646–647
DMRecipe process, 123 evaluation, 648–650
Gini index, 75 lift charts, 650, 650f
graphical methods, 75–76 model performance, 648t
principal components analysis, 74–75 module
reduction of dimensionality, 73 Data Mining tab, 306f
sampling, 76–77 Open Midwest Data with Statistica, 305, 306–317f
science, 22 Decision-making activities, in business organism, 721–723
segmentation, 81 Decision tree, 138–144, 139f
set balancing, 80–81 models, 94
over-sampling, 81 Deductive method, 21, 40
prior probabilities, 81 Deep learning (DL), 18
under-sampling, 80 ANNs, 743–746
weights, 81 development, 741
set partitioning, 569 human cognition, 742–743
set to STATISTICA Data Miner workspace, 630–639 IBM Watson, 776
source multiple input data sets, 750
merging, 443–458, 443f, 445–458f Deep learning neural networks (DLNNs), 746–750, 749f
selection screen, 643f Definitional data abstraction, 73, 283
transformation, 497 Demographic data, 291
accuracy vs. precision, 64–65, 64f Deontological-normative perspective, 769
categorical variables, 63–64 Deployment, 52
numerical variables, 63 Descriptive data analytics, 241–242, 241f
understanding, 55–81 Descriptive modeling, 28–29
issues, 56 Descriptive statistics, 459
understanding activities, 44–47 Differentiation of Self Inventory (DSI), 651
data acquisition, 45, 46f Dimensionality reduction, DMRecipe process, 124
data description, 45 Discovering patterns/rules, data mining activity, 29
data integration, 45, 46f DL. See Deep learning (DL)
data quality assessment, 47 DMRecipe. See Data Miner Recipe (DMRecipe) process
Data Miner Recipe (DMRecipe) process, 122–124, 123f DMWay, 17
automated modeling interface, 94–96, 95–96f Domain knowledge, 35
Data Miner Workspace, 90 Donor engagement index, 266
templates, 108–109 Drill down tool, 106f, 107, 108f
Data mining, 4–6, 599–606 DSI. See Differentiation of Self Inventory (DSI)
activities, 28–30 Duality, 6–11
application, examples, 31 Dummy variables, 497
INDEX 787
E FeatureLab, 17
Educational data mining analytics, 261 Feature ranking methods
vs. learning analytics, 260f bivariate methods, 86
student achievement, 267–268 complex methods, 88
Educational psychology, 268–270 Gini index, 84–86
industrial integration of, 274–275 multivariate methods, 86–88
paradigms in, 270–271 Feature selection, 94, 110
eFalcon, 295 types, 84
Efficiency paradigm, 720–726 feature ranking methods, 84–88
Efficient business, 280 subset selection methods, 88–96, 89–93f
Efficient solution, 720 Feed-forward design, 744
80:20 rule, 723–724 FICA. See Fast independent component analysis (FICA)
EIM. See Enterprise information management (EIM) Filling missing values, 529
EM cluster analysis, 146 with constants, 529–538, 529–538f
Ensemble modeling, 49, 705–706 with formulas, 539–564, 539–545f, 547–564f
building, 706–708 with model, 565–596, 566–596f
complexity, 709–711 Filtering of data, 734
decision tree surface with noise, 712–715 Financial fraud system, 296
estimation surfaces of, 708f Firmographic data, 291
generalized degrees of freedom, 711 Fisher, Ronald, 8, 10–11, 10f, 285
GMDH, 708 Flat file format, 443
median error, 709f Formulas, filling missing values with, 539–564,
out-of-sample errors, 709f 539–545f, 547–564f
overlinear case, 710 Fraud detection, 289
relative out-of-sample error, 707f approach, 292–293
underlinear case, 710 building profiles, 301–302
Enterprise information management (EIM), 4–5 deployment of, 302
Errors intrusion, 296–297
in analytical models, 216–227 issues with, 289–292
Type I and type II, 759 modeling, 294–295, 294f
Ethics supervised classification of, 293–294
academic secular ethics, 768–769 system building, 295–296
and data science, 769–770 time-based features, 297–301
example for, 767–768 Frequency tables, 103–104
existential-motivational perspective, 768–769 F-value, 218
normative-deontological perspective, 768
situational-teleological perspective, 768 G
ETL. See Extract-transform-load (ETL) Gain
Existential-motivational perspective, 768–770 chart, 760
Experimental bias, 732 curve, 219–220
Exploration tools, data, 102–107, 104t, 105–108f Galton, Francis, 7, 192
Exploratory data analysis, 28 GAM. See Generalized additive models (GAM)
Exponential distributions, 200–201 General CHAID models, 144
Extended markup language (XML), 25 Generalization data abstraction, 73, 283
External data, 114, 114f Generalized additive models (GAM), 136–137
Extract-transform-load (ETL) interpreting results, 137, 137f
capabilities, 100–101, 101f outputs, 137
process, 45, 60f Generalized degrees of freedom (GDF), 706,
Extreme programming software development 711–716
(XP), 724 Generalized regression neural networks (GRNN),
32, 133
F General linear model (GLM), 13–14, 136, 195–197
Fair Isaac fraud detection systems Falcon Fraud Geometric progression, 188
Manager, 295 Gini index, 75, 84–86
Fast independent component analysis (FICA), 165–166 GLM. See General linear model (GLM)
788 INDEX

Global In-place data processing (IDP), 113–115


accuracy, 217–218 STATISTICA Data Miner, 114, 114–115f
minimum error, 378, 378f Intentional bias, 732
GRNN. See Generalized regression neural networks Interactive Drill Down tool, 106f
(GRNN) Interactive menus interface, 94
Group method of data handling (GMDH), 708 Interactive Trees (I-Trees), 151–155, 152f
advantages, 153–154
H building trees interactively, 154
Hancock, 292 combining techniques, 155
Healthcare manually building the tree, 153, 153f
data analytics in, 240–241 tree browser, 153, 154f
fraud, 294 Intrinsically linear regression models, 198–200
future of, 235–246
IBM Watson, 778–779 J
predictive analytics in, 244–245, 248 Jackknife method, 225
transformation, 245–246 Java Snippet code, 539
transformational waves, 235, 236f Joined table, 453–454, 453f
consumer retail revolution, 238, 239f Joiner node, 449, 449f
health system devolution, 239–240, 240f Just barely good enough (JBGE), 724–726
provider value evolution, 236–237, 237f
Histogram in KNIME, 359, 360–376f, 362, 367, 370–371 K
Homoscedasticity, 9 KDD. See Knowledge discovery in databases (KDD)
Householded database, 27 Kernel function, 160–161
Human K-fold cross validation, 226–227
behavioral modeling, 280 k-means clustering, 145–146
nature, 282 KNIME
neuron histogram in, 359, 360–376f, 362, 367, 370–371
learning process of, 129 local minimum error, 378–379
structure, 127f Occam’s razor, 377–378
Hypothesis space, 16 predictive analytics in, 379–391
select features, 377
I strategies in, 379–391
IBM Watson, 773–774 Knowledge discovery, 22
APIs (see Application programming interfaces and data mining, 23f, 710
(APIs)) Knowledge discovery in databases (KDD), 4, 23–24, 39
cognitive era, 780 Kohonen networks, 133, 166
deep learning, 776 Kolmogorov-Smirnov (KS)
healthcare, 778–779 caveats, 224
IBM Watson Analytics project, 628 statistic, 223
internal features of, 774–776
Jeopardy, 774 L
natural language processing, 775 Lag variables, 283–285
transportation, 779–780 Learning analytics, 260–261
unstructured text and data, statistical analysis, 775 vs. educational data mining analytics, 260f, 261
ICA. See Independent components analysis (ICA) education psychology, 268–270
IDP. See In-place data processing (IDP) student performance, 260, 267, 271
Immersive learning, 274–275 Learning management system (LMS), 262, 274
Importance plots of variables, 110–113, 112f Learning surface topology, 132f
Imputation process, 47 Life insurance fraud, 294
Independent components analysis (ICA), 164–166, 165f Lift
Indexed sequential access method (ISAM) database, 25 chart, 650, 650f, 760
Inductive database approach, 21, 24–25, 40 curve, 285, 286–287f
Industrial revolution, 280 Likert scale, 200
Inexact (“fuzzy”) matching, 31 Linearity assumption, 191–192
INDEX 789
Linear regression (LR), 192–194, 708 Mixed models, application to, 207
collinearity among variables in, 193 MLP. See Multilayer perceptron (MLP)
piecewise, 201 MLR. See Multiple linear regression (MLR)
response surface, 193–194, 195f Model (ing)
stepwise, 86–87 activities, 47–51
variable interactions in, 193 CRISP-DM, 48–51, 48f
Linear response analysis, 188 algorithm, 497
Link analysis, 162, 292–293 analysis tools, 110–113, 110–113f, 111–112t
employing visualization, 164 building, DMRecipe process, 124
Link discovery (LD), 292–293 deployment, DMRecipe process, 124
LiquidCredit Fraud Solution, 295 enhancement
Local minimum error, KNIME, 378–379 checklist, 231–232
Local nonparametric method, 159 techniques, 227–231
Logistic regression, 16, 179–181, 180–181f filling missing values with, 565–596, 566–596f
Logit management tools, 108–109, 109f
model, 13 monitors, 117
regression, 200 process, evaluation and enhancement, 215–216,
Long short-term memory (LSTM) approach, 750 216f
Loom Systems, 17 theoretic for data mining, 24
Modern statistics, 6–11
M analysis, second generation, 13–15
Machine-learning (ML) MOOC. See Massive open online course (MOOC)
program, analyzing imbalanced data sets with, MSE. See Mean squared error (MSE)
172, 173f Multicollinearity problem, 8–9
technology, 272–273, 285, 746 Multilayer perceptron (MLP), 134
third generation, 15–16 linearly separable, 133–134
tools, 273 nonseparable classes, 134f
MAE. See Mean absolute error (MAE) Multiple linear regression (MLR), 136
MAR. See Missing at random (MAR) assumption Multivariate Adaptive Regression Splines
Marketing impacts, 627–628 (MARSplines), 88, 155–159
MARSplines. See Multivariate Adaptive Regression advantage, 159
Splines (MARSplines) applications, 158
Massively parallel processor (MPP) technology, 775 basis functions, 156, 157f
Massive open online course (MOOC), 274 categorical predictors, 157
MCAR. See Missing completely at random (MCAR) and classification problems, 158
assumption model, 156–157
Mean absolute error (MAE), 221–222, 759 selection and pruning, 158
Mean absolute percentage error (MAPE), 759 multiple dependent (outcome) variables,
Mean squared error (MSE), 221, 759 157–158
Measurement bias, 732 as predictor (feature) selection method, 158
MECE. See Mutually exclusive and categorically Mutually exclusive and categorically exhaustive
exhaustive (MECE) (MECE), 171–172
Medical/business tutorial, 393–394, 394–401f, 397–402,
402t, 403–417f, 405–410, 407–408t, 416t, 418t, N
419–422, 419–422f Naïve Bayesian classifiers, 182–184
Medicare, 393 Natural language processing (NLP), 775
Merchant fraud, 294 Nautical Almanac Office, Newcomb, 293
Metabolic syndrome analysis, 255–256 NCLEX examination, 335–336
Metamodeling technique, 227–228 case study, 337
Microeconomic approach, 24 dataset and expected strength of predictors,
MidWest Company Personality Data, 359, 360–376f, 338–339, 338t
362, 367, 370–371 decision management, 336
Missing at random (MAR) assumption, 65 literature review, 337–338
Missing completely at random (MCAR) assumption, 65 research question, 337
790 INDEX

Nearest-neighbor method, 178–179 Piecewise linear regression, 201


Neural networks, 125–129, 127–130f, 133–136, 181–182, Plato, 12–13, 13f
746–750 PMML. See Predictive modeling markup language
architecture, 129f (PMML)
automated, 134f, 209 PNN. See Probabilistic neural networks (PNN)
with back propagation, 129, 130f Poisson regression, 200
Gray Boxes, 209 Pooling, 747
Kohonen, 133 Population health, predictive analytics and, 246–253
manual/automated operation, 208 Positive semidefinite matrix, 720
network structuring, 208–209 Precision medicine, predictive analytics and, 253–256
for numerical prediction, 208–209 Prediction. See also Numerical prediction
training, 132, 132f data mining, 104
NLEs. See Nonlinear events (NLEs) implications for, 11
Nonlinear estimation techniques, 199–200 Predictive analytics (PA), 4–6
logit regression, 200 applications, 17
piecewise linear regression, 201 data, 241
Poisson regression, 200 development in, current trends of, 18–19
probit regression, 200 fits, 235–246
Nonlinear events (NLEs), 15–16, 282 in healthcare, 244–245
Nonlinear regression, 198–201, 199f consumer engagement, 248–250
Nonlinear relationships analyzing methods, 198 consumer segmentation, 250
Nonnormality, 191 micro-segmentation pilot, 250–253
Normality, 191 history, 6
assumption, 190–191 impact, business objectives, 628
Normative-deontological perspective, 768 and population health, 246–253
Numerical prediction, 201–205 and precision medicine, 253–256
with CART, 202–204, 204t SAP, 17–18
neural nets for, 208–209 science of, 39
Predictive model, 29
O for client defection
Occam’s razor, 705–706, 710 business objectives, 627–628
KNIME, 377–378 creating new work space, 642–644, 642f
Open Database Connectivity (ODBC) driver, 113 data mining recipes, 646–647
Open Midwest Data with Statistica, 305, 306–317f data preparation, 630–640
Operations research (OR), 150 feature selection, 642–645
Optimal binning, 105f model evaluation, 648–650
Outlier handling, 515–528, 515–528f procedure selection dialog screen, 643f
Over-training, 746 rapid deployment of, 115–117, 116f
Predictive modeling markup language (PMML),
P 25, 115
PA. See Predictive analytics (PA) Predictor, different sets of, 231
Parametric model Prescriptive data analytics, 241
assumption, 8–9, 188–192 Principal components analysis (PCA), 151
independency, 189–190 data reduction, 74–75
linearity, 191–192 Probabilistic neural networks (PNN), 133
normality, 190–191 Probit
Parametric predictive system, 8–9 model, 13
Parametric statistical analysis, 188–189, 272 regression, 200
Pareto’s principle, 723–724 Problem solving, 40
Partial least squares regression, 87 Property fraud, 294
PCA. See Principal components analysis (PCA) Psychographic data, 291
Pearson, Karl, 7 P-value
Percent correct classification (PCC), 759 approach, 243–244
Physical data mart, 26 statistical analysis, 753–754
INDEX 791
bootstrap sampling tests, 763–764 Sequence analysis, 162
performance measures, 759–760 applications, 164
predictive analytics, 760–761 Serial autocorrelation, 129
problem of significance, 754–759 Service level agreement (SLA), 729
software packages, 760 Single-split binary trees, 709–710
target shuffling, 762 Situational-teleological perspective, 768–769
Skytree machine-learning software, 18
Q Slicing/dicing, 104–107
Qualitative data abstraction, 73, 283 SOFM. See Self-organizing feature map (SOFM)
Quality control data mining, 149, 166 Sparsely connected neural network (SCNN), 748–749
Question and answer machine (QAM), 773–774, 780 SPSS
Clementine, 285
modeler
R bagging, 339–342, 344f
Radial basis function (RBF) networks, 48–49, 134–136, 211 boosting, 339–342, 343f
RAE. See Relative absolute error (RAE) interpreting model output, 345–348
Random error modeling workflow, 339–342
assessment of, 221–224 selection procedure and evaluation, 344–345
evaluation of models, 221 SQL. See Structured query language (SQL)
RapidMiner, 89, 89f Standardization of data, 734
RBF networks. See Radial basis function (RBF) Star-schema database structure, 26, 26f
networks State-Trait Anxiety Scale, 651
Receiver operating curve (ROC), 218–219, 219f, 760 Statistica
Recoding of data, 471–496, 471–496f, 734 data mining
Recurrent neural networks (RNNs), 749–750 recipe, 348–352
Regularization, 746 workspace method, 319, 319–333f, 321–333
Reinforced learning, 18 model output and evaluation, 352–353
Relational database management systems (RDBMS), rules creation, 353–354
25–27 Statistica data miner (SDM), 89–90, 90f, 151
Relative absolute error (RAE), 222 cross-tabulation and feature selection nodes, 645f
Relative squared error (RSE), 222 data mining recipe, 646–647
Representational learning (RL), 747 data set to, 630–639
Response surface, linear regression, 193–194, 195f data source node ready for further operations, 644f
Retrieval by content, data mining activity, 29 file
Return on investment (ROI), 295 import configuration screen, 631f
RMSE. See Root mean squared error (RMSE) selection screen, 630f
ROC. See Receiver operating curve (ROC) recipe, 122–124, 123f
Root cause analysis, 149, 166 three cross-tabulation nodes, 644, 645f
Root mean squared error (RMSE), 221 Statistical learning theory, 16–18, 159–161
RSE. See Relative squared error (RSE) Statistical modeling, 22
Rule induction, 176–177, 177t Statistics
history, 6
S modern, 6–11
Sampling error, 225 node, 517
assessment technique Stepwise linear regression, 86–87
bootstrap method, 225–226 Structured query language (SQL), 31–32, 100, 100f, 115
jackknife, 225 Stumps, 709–710
K-fold cross validation, 226–227 Subset selection methods, 88–96, 89–93f
SAP predictive analytics, 17–18 Supervised classification, 169
Science of data mining/predictive analytics, 39 of fraud, 290, 292
Scientific method, 21–22 Support vector machine (SVM), 159–161, 211–212, 648
Secular ethics, 768–769 Surrogate variables, 231
Self-organizing feature map (SOFM), 166 Suspicion scores, 290
Sensitivity analysis, 87–88, 131 SVM. See Support vector machine (SVM)
792 INDEX

Systematic error assessment, 224–225 KNIME, 377–391, 423, 426–429, 432–434, 438
inadequate experimental design, 224–225 medical/business tutorial, 393–394, 394–401f,
sampling errors, 225 397–402, 402t, 403–417f, 405–410, 407–408t, 416t,
418t, 419–422, 419–422f
T MidWest Company Personality Data, 359, 360–376f,
Target shuffling, 762 362, 367, 370–371
Tautology, 565, 569 Open Midwest Data with Statistica, 305,
Teaching/learning situation, 267 306–317f
Temporal data, 749 removal of outliers, 515–528, 515–528f
abstraction, 72, 72f, 283, 284f Statistica data mining workspace method, 319,
handling of, 735 319–333f, 321–333
Text mining, 606–613 text mining, 606–613
Time-delayed neural networks (TDNNs), 748
Time-series analyses, 287–288 U
TPOT. See Tree-based Pipeline Optimization Tool (TPOT) Understanding and problem solving, 40
Traditional data mining, 293 Unsupervised classification, 169
Traditional statistical analysis, 21 of fraud, 290, 292–293
in clinical medicine, 242–244 Up-sell modeling, 279, 284
Transformational waves, health-care system, 235, 236f
consumer retail revolution, 238, 239f
V
health system devolution, 239–240, 240f
Variables
provider value evolution, 236–237, 237f
description of, 628, 629t
Transportation, IBM Watson, 779–780
as features, 83–84
Tree-based Pipeline Optimization Tool (TPOT), 18
importance plots of, 110–113, 112f
Tutorials
interaction in linear regression, 193
boosted trees, 614–625
and MonthlyCharges, 634f
cleaning and recoding of data, 471–496, 471–496f
selection screen, 646f
client defection, predictive model (see Predictive
specification
model for client defection)
dialog screen, 631f, 641f
clinical psychology, 675, 676–691f, 678, 680, 682, 687,
editor dialog screen, 632f
692–693t, 694–698f, 699t, 700–701f, 701
and tenure, 634f
data description, 459–469, 459–469f
and TotalCharges, 635f
data health check, beta procedure, 652, 652–665f,
V-fold cross-validation, 146
654–655, 657–658, 661–663, 666, 666t, 667–673f,
Virtual data mart, 26–27
669, 671
VitalSource (CourseSmart), 267–268, 269f
data mining, 599–606
data sources, merging, 443–458, 443f, 445–458f
dummy coding category variables, 497–514, 497f, W
499–513f Weighted accuracy, 759
filling missing values with
constants, 529–538, 529–538f X
formulas, 539–564, 539–545f, 547–564f XML. See Extended markup language (XML)
model, 565–596, 566–596f Xpanse Analytics, 18

You might also like