Professional Documents
Culture Documents
783
784 INDEX
Big Data in education, 259–273 CHAID. See Chi-square automatic interaction detection
data analytics, 274–275 (CHAID)
donor Charge-back fraud, 293
development, 265–266, 266f Check fraud, 294
recruitment, 266 Chi-square automatic interaction detection (CHAID),
retention, 266–267 144, 177–184
drivers for innovation, 261–263 data reduction, 74
future scenarios, 262–263 Claim fraud, 294
industry vendors, 263 Classical statistical sensitivity, 131
machine learning techniques, 272–273 Classification, 169
math and statistical analysis, 271–272 accuracy vs. generality, 171
student algorithm, 185–186
achievement, 267–268 advantages and disadvantages, 174–177
recruitment, 264–265 CHAID, 177–184
retention, 265 decision trees, 174–175
Bionomics: Economy As Ecosystem (Rothschild), 721 logistic regression, 179–181, 180–181f
Black box modeling, 706 Naïve Bayesian classifiers, 182–184
Boosted tree model, 614–625, 648–650 nearest-neighbor classifiers, 178–179
Boosting technique, 228–229, 339–342, 343f, 706–708 neural networks, 181–182
Bootstrap phases in operation of, 172–174
method, 225–226 random forests, 175
sampling test, 763–764 rule induction, 176–177, 177t
Bumping, 710 assumptions, 171–172
Business initial operations in, 169–170
customer relationship management issues in, issues with, 170–171
279–280 Classification and regression trees (CART), 138–144, 175
ecosystem advantages, 205–207
customer relationship management in, 281–285 data mining tool packages, 204–205
for data mining, 42–43 decision tree, 138–144, 139–141f
transforming corporations into, 280–281 issues, 206–207
network intrusion, 296–297 numerical prediction with, 202–204, 204t
objectives pruning trees, 142–144, 143f
application, 628 recursive partitioning, 139–142
data file, 628 Cleaning of data, 471–496, 471–496f
of data mining model, 42 Clinical psychology, 675, 676–691f, 678, 680, 682, 687,
description of variables, 628, 629t 692–693t, 694–698f, 699t, 700–701f, 701
marketing impacts, 627–628 Closing the information loop, 52
performance, 628 Cloud computing, 19
predictive analytics impact, 628 Clustering, 169
organism Cognitive era, 780
complex system, 722–723 Cognitivism, 270
decision-making activities in, 721–723 Coincidence (confusion) matrix, 217
muscles in, 721–722 Collaborative Leader Profile, 651
understanding, 42–44 Collinearity problem, 8–9
analytic project goals, 43–44 Column Selection tab, 452, 452f
Complex model, 228, 228f
C Conditional probability, 6
Capital One, 292 Confusion matrix, 759
CAR. See Customer analytic record (CAR) Constant
CART. See Classification and regression trees (CART) filling missing values with, 529–538, 529–538f
Categorical variables variance, 9
dummy coding, 497–514, 497f, 499–513f Constructivism, 270–271
frequencies of, 636t Continuous data regressions, 759
Central limit theorem, 191 Convolutional neural network (CNN), 747–750
INDEX 785
Covariance inflation criterion (CCI), 710–711 Customer relationship management (CRM), 26–27
Credit card fraud, 293, 295 in business ecosystem, 281–285
CRISP-DM. See Cross industry standard process for issues in business, 279–280
data mining (CRISP-DM) model, 216, 727–728
CRM. See Customer relationship management (CRM)
Cross industry standard process for data mining D
(CRISP-DM), 40–41, 41f, 727–728 Data
acceptance criteria, 729 abstractions, 70–73, 72f, 283
access data, 730 access tools, 100–101
analytical goals of project, 728 acquisition, 57–58, 443–458, 443f, 445–458f
business high-level query languages, 58
goals, 728 low-level and ODBC database connections, 58
stakeholders, 728 query-based data extracts, 57
understanding phase, 728–729 analysis
characterization and description of data, 730–732 DMRecipe process, 123
data exploratory, 28
cleaning, 733 analytics in healthcare, 240
preparation phase, 732–735 descriptive analytics, 241–242, 241f
reduction, 733–734 predictive analytics, 241
understanding phase, 729–732 prescriptive analytics, 241
derived variables, 735 assessment, 62
enhancing and enriching data, 730 cleaning, 62–63, 733
filtering, 734 and recoding, 471–496, 471–496f
handling cluster in clustering problem, 145–146, 145f
of outliers, 735 conditioning
of temporal data, 735 data set balancing, 80–81
missing value imputation, 734–735 segmentation, 81
modeling standardization, 79–80
activities, 48–51, 48f DataRobot, 17
algorithms, 48 derivation
architecture, 48 assignment/derivation of target variable, 78
assumptions, 48–49 attribute-oriented induction of generalization
phase, 736 variables, 79
predictive analytics projects, 736–739 derivation of new predictor variables, 78–79
recoding, 734 description, 60–62, 459–469, 459–469f
sampling regimes, 732 discretization, 77–78
service level agreement, 729 exploration tools, 102–107, 104t, 105–108f
standardization, 734 extraction, 58–60
target variable, 729 filtering, 69
timeline, 729 removal of outliers, 69, 515–528, 515–528f
working relationships, 728 time-series filtering, 70
Cross sell modeling, 279, 284 health check, beta procedure, 652, 652–665f, 654–655,
Cross tabulation, 637 657–658, 661–663, 666, 666t, 667–673f, 669, 671
internet-dependent service categories, 638–639, 638t imputation, 65–69, 68t, 529
by model, 649t filling missing values with constants, 529–538,
phone service vs. multiple lines, 637–638, 637t 529–538f
STATISTICA Data Miner workspace, 644, 645f filling missing values with formulas, 539–564,
Customer 539–545f, 547–564f
centric database vs. account-centric database, 25–27 filling missing values with model, 565–596, 566–596f
response modeling, 282–283 maximum likelihood imputation, 67
retention modeling, 279 missing at random assumption, 65
Customer analytic record (CAR), 27, 730 missing completely at random assumption, 65
creation, 28 multiple imputation, 67–69
decision-making activities in, 721–723 techniques for imputing data, 66–67
786 INDEX
Systematic error assessment, 224–225 KNIME, 377–391, 423, 426–429, 432–434, 438
inadequate experimental design, 224–225 medical/business tutorial, 393–394, 394–401f,
sampling errors, 225 397–402, 402t, 403–417f, 405–410, 407–408t, 416t,
418t, 419–422, 419–422f
T MidWest Company Personality Data, 359, 360–376f,
Target shuffling, 762 362, 367, 370–371
Tautology, 565, 569 Open Midwest Data with Statistica, 305,
Teaching/learning situation, 267 306–317f
Temporal data, 749 removal of outliers, 515–528, 515–528f
abstraction, 72, 72f, 283, 284f Statistica data mining workspace method, 319,
handling of, 735 319–333f, 321–333
Text mining, 606–613 text mining, 606–613
Time-delayed neural networks (TDNNs), 748
Time-series analyses, 287–288 U
TPOT. See Tree-based Pipeline Optimization Tool (TPOT) Understanding and problem solving, 40
Traditional data mining, 293 Unsupervised classification, 169
Traditional statistical analysis, 21 of fraud, 290, 292–293
in clinical medicine, 242–244 Up-sell modeling, 279, 284
Transformational waves, health-care system, 235, 236f
consumer retail revolution, 238, 239f
V
health system devolution, 239–240, 240f
Variables
provider value evolution, 236–237, 237f
description of, 628, 629t
Transportation, IBM Watson, 779–780
as features, 83–84
Tree-based Pipeline Optimization Tool (TPOT), 18
importance plots of, 110–113, 112f
Tutorials
interaction in linear regression, 193
boosted trees, 614–625
and MonthlyCharges, 634f
cleaning and recoding of data, 471–496, 471–496f
selection screen, 646f
client defection, predictive model (see Predictive
specification
model for client defection)
dialog screen, 631f, 641f
clinical psychology, 675, 676–691f, 678, 680, 682, 687,
editor dialog screen, 632f
692–693t, 694–698f, 699t, 700–701f, 701
and tenure, 634f
data description, 459–469, 459–469f
and TotalCharges, 635f
data health check, beta procedure, 652, 652–665f,
V-fold cross-validation, 146
654–655, 657–658, 661–663, 666, 666t, 667–673f,
Virtual data mart, 26–27
669, 671
VitalSource (CourseSmart), 267–268, 269f
data mining, 599–606
data sources, merging, 443–458, 443f, 445–458f
dummy coding category variables, 497–514, 497f, W
499–513f Weighted accuracy, 759
filling missing values with
constants, 529–538, 529–538f X
formulas, 539–564, 539–545f, 547–564f XML. See Extended markup language (XML)
model, 565–596, 566–596f Xpanse Analytics, 18