You are on page 1of 68

Machine Learning for Business

Analytics: Concepts, Techniques and


Applications in RapidMiner Galit
Shmueli
Visit to download the full and correct content document:
https://ebookmass.com/product/machine-learning-for-business-analytics-concepts-tec
hniques-and-applications-in-rapidminer-galit-shmueli/
MACHINE LEARNING
FOR BUSINESS ANALYTICS
MACHINE LEARNING
FOR BUSINESS ANALYTICS
Concepts, Techniques and Applications in RapidMiner

Galit Shmueli
National Tsing Hua University

Peter C. Bruce
statistics.com

Amit V. Deokar
University of Massachusetts Lowell

Nitin R. Patel
Cytel, Inc.
This edition first published 2023

© 2023 John Wiley & Sons, Inc.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in

any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by

law. Advice on how to obtain permission to reuse material from this title is available at

http://www.wiley.com/go/permissions.

The right of Galit Shmueli, Peter C. Bruce, Amit V. Deokar, and Nitin R. Patel to be identified as the authors of

this work has been asserted in accordance with law.

Registered Office
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

For details of our global editorial offices, customer services, and more information about Wiley products visit us at

www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that

appears in standard print versions of this book may not be available in other formats.

Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or

its affiliates in the United States and other countries and may not be used without written permission. All other

trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product

or vendor mentioned in this book.

Limit of Liability/Disclaimer of Warranty


In view of ongoing research, equipment modifications, changes in governmental regulations, and the constant

flow of information relating to the use of experimental reagents, equipment, and devices, the reader is urged to

review and evaluate the information provided in the package insert or instructions for each chemical, piece of

equipment, reagent, or device for, among other things, any changes in the instructions or indication of usage and

for added warnings and precautions. While the publisher and authors have used their best efforts in preparing this

work, they make no representations or warranties with respect to the accuracy or completeness of the contents of

this work and specifically disclaim all warranties, including without limitation any implied warranties of

merchantability or fitness for a particular purpose. No warranty may be created or extended by sales

representatives, written sales materials or promotional statements for this work. The fact that an organization,

website, or product is referred to in this work as a citation and/or potential source of further information does not

mean that the publisher and authors endorse the information or services the organization, website, or product

may provide or recommendations it may make. This work is sold with the understanding that the publisher is not

engaged in rendering professional services. The advice and strategies contained herein may not be suitable for

your situation. You should consult with a specialist where appropriate. Further, readers should be aware that

websites listed in this work may have changed or disappeared between when this work was written and when it is

read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages,

including but not limited to special, incidental, consequential, or other damages.

Library of Congress Cataloging-in-Publication Data Applied for:

Hardback: 9781119828792

Cover Design: Wiley

Cover Image: © Eakarat Buanoi/Getty Images

Set in 11.5/14.5pt BemboStd by Straive, Chennai, India


To our families

Boaz and Noa


Liz, Lisa, and Allison
Aparna, Aditi, Anuja, Ajit, Aai, and Baba
Tehmi, Arjun, and in memory of Aneesh
Contents

Foreword by Ravi Bapna xxi


Preface to the RapidMiner Edition xxiii
Acknowledgments xxvii

PART I PRELIMINARIES
CHAPTER 1 Introduction 3

1.1 What Is Business Analytics? . . . . . . . . . . . . . . . . . . . . . . . . . . . 3


1.2 What Is Machine Learning? . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Machine Learning, AI, and Related Terms . . . . . . . . . . . . . . . . . . . . 5
Statistical Modeling vs. Machine Learning . . . . . . . . . . . . . . . . . . . . 6
1.4 Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Why Are There So Many Different Methods? . . . . . . . . . . . . . . . . . . . 9
1.7 Terminology and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.8 Road Maps to This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Order of Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.9 Using RapidMiner Studio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Importing and Loading Data in RapidMiner . . . . . . . . . . . . . . . . . . . 16
RapidMiner Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

CHAPTER 2 Overview of the Machine Learning Process 19

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Core Ideas in Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . 20
Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Association Rules and Recommendation Systems . . . . . . . . . . . . . . . . . 20
Predictive Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Data Reduction and Dimension Reduction . . . . . . . . . . . . . . . . . . . . 21
Data Exploration and Visualization . . . . . . . . . . . . . . . . . . . . . . . . 21
Supervised and Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . 22
2.3 The Steps in a Machine Learning Project . . . . . . . . . . . . . . . . . . . . . 23

vii
viii CONTENTS

2.4 Preliminary Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25


Organization of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Sampling from a Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Oversampling Rare Events in Classification Tasks . . . . . . . . . . . . . . . . . 26
Preprocessing and Cleaning the Data . . . . . . . . . . . . . . . . . . . . . . . 26
2.5 Predictive Power and Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . 32
Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Creation and Use of Data Partitions . . . . . . . . . . . . . . . . . . . . . . . 34
2.6 Building a Predictive Model with RapidMiner . . . . . . . . . . . . . . . . . . . 37
Predicting Home Values in the West Roxbury Neighborhood . . . . . . . . . . . 39
Modeling Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.7 Using RapidMiner for Machine Learning . . . . . . . . . . . . . . . . . . . . . 45
2.8 Automating Machine Learning Solutions . . . . . . . . . . . . . . . . . . . . . 47
Predicting Power Generator Failure . . . . . . . . . . . . . . . . . . . . . . . . 48
Uber’s Michelangelo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.9 Ethical Practice in Machine Learning . . . . . . . . . . . . . . . . . . . . . . . 52
Machine Learning Software Tools: The State of the Market by Herb Edelstein . . . 53
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

PART II DATA EXPLORATION AND DIMENSION REDUCTION


CHAPTER 3 Data Visualization 63

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.2 Data Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Example 1: Boston Housing Data . . . . . . . . . . . . . . . . . . . . . . . . 65
Example 2: Ridership on Amtrak Trains . . . . . . . . . . . . . . . . . . . . . . 66
3.3 Basic Charts: Bar Charts, Line Charts, and Scatter Plots . . . . . . . . . . . . . 66
Distribution Plots: Boxplots and Histograms . . . . . . . . . . . . . . . . . . . 69
Heatmaps: Visualizing Correlations and Missing Values . . . . . . . . . . . . . . 72
3.4 Multidimensional Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Adding Attributes: Color, Size, Shape, Multiple Panels, and Animation . . . . . . 75
Manipulations: Rescaling, Aggregation and Hierarchies, Zooming, and Filtering . . 78
Reference: Trend Lines and Labels . . . . . . . . . . . . . . . . . . . . . . . . 81
Scaling Up to Large Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Multivariate Plot: Parallel Coordinates Plot . . . . . . . . . . . . . . . . . . . . 83
Interactive Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.5 Specialized Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Visualizing Networked Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Visualizing Hierarchical Data: Treemaps . . . . . . . . . . . . . . . . . . . . . 89
Visualizing Geographical Data: Map Charts . . . . . . . . . . . . . . . . . . . . 90
3.6 Summary: Major Visualizations and Operations, by Machine Learning Goal . . . . 92
Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Time Series Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
CONTENTS ix

CHAPTER 4 Dimension Reduction 97

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.2 Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.3 Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Example 1: House Prices in Boston . . . . . . . . . . . . . . . . . . . . . . . 99
4.4 Data Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Aggregation and Pivot Tables . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.5 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.6 Reducing the Number of Categories in Categorical Attributes . . . . . . . . . . . 105
4.7 Converting a Categorical Attribute to a Numerical Attribute . . . . . . . . . . . 107
4.8 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Example 2: Breakfast Cereals . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Principal Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Normalizing the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Using Principal Components for Classification and Prediction . . . . . . . . . . . 117
4.9 Dimension Reduction Using Regression Models . . . . . . . . . . . . . . . . . . 117
4.10 Dimension Reduction Using Classification and Regression Trees . . . . . . . . . . 119
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

PART III PERFORMANCE EVALUATION


CHAPTER 5 Evaluating Predictive Performance 125

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125


5.2 Evaluating Predictive Performance . . . . . . . . . . . . . . . . . . . . . . . . 126
Naive Benchmark: The Average . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Prediction Accuracy Measures . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Comparing Training and Holdout Performance . . . . . . . . . . . . . . . . . . 130
Lift Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.3 Judging Classifier Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Benchmark: The Naive Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Class Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
The Confusion (Classification) Matrix . . . . . . . . . . . . . . . . . . . . . . . 133
Using the Holdout Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Accuracy Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Propensities and Threshold for Classification . . . . . . . . . . . . . . . . . . . 136
Performance in Case of Unequal Importance of Classes . . . . . . . . . . . . . . 139
Asymmetric Misclassification Costs . . . . . . . . . . . . . . . . . . . . . . . . 143
Generalization to More Than Two Classes . . . . . . . . . . . . . . . . . . . . . 146
5.4 Judging Ranking Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Lift Charts for Binary Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Decile Lift Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Beyond Two Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Lift Charts Incorporating Costs and Benefits . . . . . . . . . . . . . . . . . . . 150
Lift as a Function of Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . 150
x CONTENTS

5.5 Oversampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151


Creating an Over-sampled Training Set . . . . . . . . . . . . . . . . . . . . . . 154
Evaluating Model Performance Using a Non-oversampled Holdout Set . . . . . . . 155
Evaluating Model Performance if Only Oversampled Holdout Set Exists . . . . . . 155
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

PART IV PREDICTION AND CLASSIFICATION METHODS


CHAPTER 6 Multiple Linear Regression 163

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163


6.2 Explanatory vs. Predictive Modeling . . . . . . . . . . . . . . . . . . . . . . . 164
6.3 Estimating the Regression Equation and Prediction . . . . . . . . . . . . . . . . 166
Example: Predicting the Price of Used Toyota Corolla Cars . . . . . . . . . . . . 167
6.4 Variable Selection in Linear Regression . . . . . . . . . . . . . . . . . . . . . 171
Reducing the Number of Predictors . . . . . . . . . . . . . . . . . . . . . . . 171
How to Reduce the Number of Predictors . . . . . . . . . . . . . . . . . . . . . 174
Regularization (Shrinkage Models) . . . . . . . . . . . . . . . . . . . . . . . . 180
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

CHAPTER 7 k-Nearest Neighbors (k-NN) 189

7.1 The k -NN Classifier (Categorical Label) . . . . . . . . . . . . . . . . . . . . . . 189


Determining Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
Classification Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
Example: Riding Mowers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
Choosing Parameter k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Setting the Threshold Value . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
Weighted k-NN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
k-NN with More Than Two Classes . . . . . . . . . . . . . . . . . . . . . . . . 199
Working with Categorical Attributes . . . . . . . . . . . . . . . . . . . . . . . 199
7.2 k-NN for a Numerical Label . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
7.3 Advantages and Shortcomings of k-NN Algorithms . . . . . . . . . . . . . . . . 202
Appendix: Computing Distances Between Records in RapidMiner . . . . . . . . . . . . 203
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

CHAPTER 8 The Naive Bayes Classifier 209

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209


Threshold Probability Method . . . . . . . . . . . . . . . . . . . . . . . . . . 210
Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
Example 1: Predicting Fraudulent Financial Reporting . . . . . . . . . . . . . . 210
8.2 Applying the Full (Exact) Bayesian Classifier . . . . . . . . . . . . . . . . . . . 211
Using the “Assign to the Most Probable Class” Method . . . . . . . . . . . . . . 212
Using the Threshold Probability Method . . . . . . . . . . . . . . . . . . . . . 212
Practical Difficulty with the Complete (Exact) Bayes Procedure . . . . . . . . . . 212
CONTENTS xi

8.3 Solution: Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213


The Naive Bayes Assumption of Conditional Independence . . . . . . . . . . . . 214
Using the Threshold Probability Method . . . . . . . . . . . . . . . . . . . . . 214
Example 2: Predicting Fraudulent Financial Reports, Two Predictors . . . . . . . 215
Example 3: Predicting Delayed Flights . . . . . . . . . . . . . . . . . . . . . . 216
Working with Continuous Attributes . . . . . . . . . . . . . . . . . . . . . . . 222
8.4 Advantages and Shortcomings of the Naive Bayes Classifier . . . . . . . . . . . 224
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226

CHAPTER 9 Classification and Regression Trees 229

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229


Tree Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
Decision Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
Classifying a New Record . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
9.2 Classification Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
Recursive Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
Example 1: Riding Mowers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
Measures of Impurity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
9.3 Evaluating the Performance of a Classification Tree . . . . . . . . . . . . . . . . 240
Example 2: Acceptance of Personal Loan . . . . . . . . . . . . . . . . . . . . . 240
Sensitivity Analysis Using Cross Validation . . . . . . . . . . . . . . . . . . . . 243
9.4 Avoiding Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
Stopping Tree Growth: Grid Search for Parameter Tuning . . . . . . . . . . . . . 247
Stopping Tree Growth: CHAID . . . . . . . . . . . . . . . . . . . . . . . . . . 249
Pruning the Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
9.5 Classification Rules from Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 255
9.6 Classification Trees for More Than Two Classes . . . . . . . . . . . . . . . . . . 256
9.7 Regression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
Measuring Impurity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
Evaluating Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
9.8 Improving Prediction: Random Forests and Boosted Trees . . . . . . . . . . . . 259
Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
Boosted Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
9.9 Advantages and Weaknesses of a Tree . . . . . . . . . . . . . . . . . . . . . . 261
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265

CHAPTER 10 Logistic Regression 269

10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269


10.2 The Logistic Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . 271
10.3 Example: Acceptance of Personal Loan . . . . . . . . . . . . . . . . . . . . . . 272
Model with a Single Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Estimating the Logistic Model from Data: Computing Parameter Estimates . . . . 275
Interpreting Results in Terms of Odds (for a Profiling Goal) . . . . . . . . . . . . 278
xii CONTENTS

Evaluating Classification Performance . . . . . . . . . . . . . . . . . . . . . . 280


Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
10.4 Logistic Regression for Multi-class Classification . . . . . . . . . . . . . . . . . 283
Example: Accidents Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
10.5 Example of Complete Analysis: Predicting Delayed Flights . . . . . . . . . . . . 286
Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
Model Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
Model Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
Appendix: Logistic Regression for Ordinal Classes . . . . . . . . . . . . . . . . . . . . 299
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301

CHAPTER 11 Neural Networks 305

11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306


11.2 Concept and Structure of a Neural Network . . . . . . . . . . . . . . . . . . . . 306
11.3 Fitting a Network to Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
Example 1: Tiny Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
Computing Output of Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
Preprocessing the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
Training the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
Example 2: Classifying Accident Severity . . . . . . . . . . . . . . . . . . . . . 316
Avoiding Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
Using the Output for Prediction and Classification . . . . . . . . . . . . . . . . 320
11.4 Required User Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
11.5 Exploring the Relationship Between Predictors and Target Attribute . . . . . . . 322
11.6 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
Convolutional Neural Networks (CNNs) . . . . . . . . . . . . . . . . . . . . . . 324
Local Feature Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
A Hierarchy of Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
The Learning Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
Example: Classification of Fashion Images . . . . . . . . . . . . . . . . . . . . 327
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
11.7 Advantages and Weaknesses of Neural Networks . . . . . . . . . . . . . . . . . 334
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335

CHAPTER 12 Discriminant Analysis 337

12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337


Example 1: Riding Mowers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
Example 2: Personal Loan Acceptance . . . . . . . . . . . . . . . . . . . . . . 338
12.2 Distance of a Record from a Class . . . . . . . . . . . . . . . . . . . . . . . . 340
12.3 Fisher’s Linear Classification Functions . . . . . . . . . . . . . . . . . . . . . . 341
12.4 Classification Performance of Discriminant Analysis . . . . . . . . . . . . . . . 346
CONTENTS xiii

12.5 Prior Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348


12.6 Unequal Misclassification Costs . . . . . . . . . . . . . . . . . . . . . . . . . 348
12.7 Classifying More Than Two Classes . . . . . . . . . . . . . . . . . . . . . . . . 349
Example 3: Medical Dispatch to Accident Scenes . . . . . . . . . . . . . . . . . 349
12.8 Advantages and Weaknesses . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355

CHAPTER 13 Generating, Comparing, and Combining Multiple 359


Models
13.1 Automated Machine Learning (AutoML) . . . . . . . . . . . . . . . . . . . . . 359
AutoML: Explore and Clean Data . . . . . . . . . . . . . . . . . . . . . . . . . 360
AutoML: Determine Machine Learning Task . . . . . . . . . . . . . . . . . . . . 360
AutoML: Choose Attributes and Machine Learning Methods . . . . . . . . . . . . 361
AutoML: Evaluate Model Performance . . . . . . . . . . . . . . . . . . . . . . 363
AutoML: Model Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
Advantages and Weaknesses of Automated Machine Learning . . . . . . . . . . . 365
13.2 Explaining Model Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . 367
Explaining Model Predictions: LIME . . . . . . . . . . . . . . . . . . . . . . . 368
Counterfactual Explanations of Predictions: What-If Scenarios . . . . . . . . . . 369
13.3 Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
Why Ensembles Can Improve Predictive Power . . . . . . . . . . . . . . . . . . 373
Simple Averaging or Voting . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
Bagging and Boosting in RapidMiner . . . . . . . . . . . . . . . . . . . . . . . 378
Stacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
Advantages and Weaknesses of Ensembles . . . . . . . . . . . . . . . . . . . . 381
13.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383

PART V INTERVENTION AND USER FEEDBACK


CHAPTER 14 Interventions: Experiments, Uplift Models, and 387
Reinforcement Learning
14.1 A/B Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
Example: Testing a New Feature in a Photo Sharing App . . . . . . . . . . . . . 389
The Statistical Test for Comparing Two Groups (T-Test) . . . . . . . . . . . . . . 389
Multiple Treatment Groups: A/B/n Tests . . . . . . . . . . . . . . . . . . . . . 392
Multiple A/B Tests and the Danger of Multiple Testing . . . . . . . . . . . . . . 392
14.2 Uplift (Persuasion) Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
Gathering the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
A Simple Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
Modeling Individual Uplift . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
Computing Uplift with RapidMiner . . . . . . . . . . . . . . . . . . . . . . . . 398
Using the Results of an Uplift Model . . . . . . . . . . . . . . . . . . . . . . . 398
xiv CONTENTS

14.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400


Explore-Exploit: Multi-Armed Bandits . . . . . . . . . . . . . . . . . . . . . . 400
Markov Decision Process (MDP) . . . . . . . . . . . . . . . . . . . . . . . . . 402
14.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406

PART VI MINING RELATIONSHIPS AMONG RECORDS


CHAPTER 15 Association Rules and Collaborative Filtering 409

15.1 Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409


Discovering Association Rules in Transaction Databases . . . . . . . . . . . . . 410
Example 1: Synthetic Data on Purchases of Phone Faceplates . . . . . . . . . . 410
Data Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
Generating Candidate Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
The Apriori Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
FP-Growth Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
Selecting Strong Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
The Process of Rule Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 418
Interpreting the Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
Rules and Chance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
Example 2: Rules for Similar Book Purchases . . . . . . . . . . . . . . . . . . . 424
15.2 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
Data Type and Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
Example 3: Netflix Prize Contest . . . . . . . . . . . . . . . . . . . . . . . . . 427
User-Based Collaborative Filtering: “People Like You” . . . . . . . . . . . . . . 428
Item-Based Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . 430
Evaluating Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
Example 4: Predicting Movie Ratings with MovieLens Data . . . . . . . . . . . . 432
Advantages and Weaknesses of Collaborative Filtering . . . . . . . . . . . . . . 434
Collaborative Filtering vs. Association Rules . . . . . . . . . . . . . . . . . . . 437
15.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440

CHAPTER 16 Cluster Analysis 445

16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445


Example: Public Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
16.2 Measuring Distance Between Two Records . . . . . . . . . . . . . . . . . . . . 449
Euclidean Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
Normalizing Numerical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
Other Distance Measures for Numerical Data . . . . . . . . . . . . . . . . . . . 451
Distance Measures for Categorical Data . . . . . . . . . . . . . . . . . . . . . . 454
Distance Measures for Mixed Data . . . . . . . . . . . . . . . . . . . . . . . . 454
16.3 Measuring Distance Between Two Clusters . . . . . . . . . . . . . . . . . . . . 455
Minimum Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
CONTENTS xv

Maximum Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455


Average Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
Centroid Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
16.4 Hierarchical (Agglomerative) Clustering . . . . . . . . . . . . . . . . . . . . . 457
Single Linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
Complete Linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
Average Linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
Centroid Linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
Ward’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
Dendrograms: Displaying Clustering Process and Results . . . . . . . . . . . . . 460
Validating Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
Limitations of Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . 464
16.5 Non-Hierarchical Clustering: The k -Means Algorithm . . . . . . . . . . . . . . . 466
Choosing the Number of Clusters (k) . . . . . . . . . . . . . . . . . . . . . . . 467
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473

PART VII FORECASTING TIME SERIES


CHAPTER 17 Handling Time Series 479

17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480


17.2 Descriptive vs. Predictive Modeling . . . . . . . . . . . . . . . . . . . . . . . 481
17.3 Popular Forecasting Methods in Business . . . . . . . . . . . . . . . . . . . . . 481
Combining Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
17.4 Time Series Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
Example: Ridership on Amtrak Trains . . . . . . . . . . . . . . . . . . . . . . . 483
17.5 Data Partitioning and Performance Evaluation . . . . . . . . . . . . . . . . . . 486
Benchmark Performance: Naive Forecasts . . . . . . . . . . . . . . . . . . . . . 488
Generating Future Forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . . 490
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493

CHAPTER 18 Regression-Based Forecasting 497

18.1 A Model with Trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498


Linear Trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
Exponential Trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
Polynomial Trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503
18.2 A Model with Seasonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
Additive vs. Multiplicative Seasonality . . . . . . . . . . . . . . . . . . . . . . 507
18.3 A Model with Trend and Seasonality . . . . . . . . . . . . . . . . . . . . . . . 508
18.4 Autocorrelation and ARIMA Models . . . . . . . . . . . . . . . . . . . . . . . . 509
Computing Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
Improving Forecasts by Integrating Autocorrelation Information . . . . . . . . . 514
Evaluating Predictability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
xvi CONTENTS

CHAPTER 19 Smoothing and Deep Learning Methods for 533


Forecasting
19.1 Smoothing Methods: Introduction . . . . . . . . . . . . . . . . . . . . . . . . 534
19.2 Moving Average . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534
Centered Moving Average for Visualization . . . . . . . . . . . . . . . . . . . . 534
Trailing Moving Average for Forecasting . . . . . . . . . . . . . . . . . . . . . 536
Choosing Window Width (w) . . . . . . . . . . . . . . . . . . . . . . . . . . . 540
19.3 Simple Exponential Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . 540
Choosing Smoothing Parameter α . . . . . . . . . . . . . . . . . . . . . . . . 542
Relation Between Moving Average and Simple Exponential Smoothing . . . . . . 543
19.4 Advanced Exponential Smoothing . . . . . . . . . . . . . . . . . . . . . . . . 545
Series with a Trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
Series with a Trend and Seasonality . . . . . . . . . . . . . . . . . . . . . . . 546
Series with Seasonality (No Trend) . . . . . . . . . . . . . . . . . . . . . . . . 547
19.5 Deep Learning for Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . 549
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553

PART VIII DATA ANALYTICS


CHAPTER 20 Social Network Analytics 563

20.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563


20.2 Directed vs. Undirected Networks . . . . . . . . . . . . . . . . . . . . . . . . 564
20.3 Visualizing and Analyzing Networks . . . . . . . . . . . . . . . . . . . . . . . 567
Plot Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
Edge List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 570
Adjacency Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
Using Network Data in Classification and Prediction . . . . . . . . . . . . . . . 571
20.4 Social Data Metrics and Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . 571
Node-Level Centrality Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 572
Egocentric Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
Network Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
20.5 Using Network Metrics in Prediction and Classification . . . . . . . . . . . . . . 576
Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
Entity Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579
Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581
20.6 Collecting Social Network Data with RapidMiner . . . . . . . . . . . . . . . . . 584
20.7 Advantages and Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . 584
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587

CHAPTER 21 Text Mining 589

21.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589


21.2 The Tabular Representation of Text: Term–Document Matrix and “Bag-of-Words’’ . 590
21.3 Bag-of-Words vs. Meaning Extraction at Document Level . . . . . . . . . . . . . 592
21.4 Preprocessing the Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593
CONTENTS xvii

Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593
Text Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595
Presence/Absence vs. Frequency (Occurrences) . . . . . . . . . . . . . . . . . . 597
Term Frequency–Inverse Document Frequency (TF-IDF) . . . . . . . . . . . . . . 598
From Terms to Concepts: Latent Semantic Indexing . . . . . . . . . . . . . . . . 600
Extracting Meaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
From Terms to High-Dimensional Word Vectors: Word2Vec . . . . . . . . . . . . 601
21.5 Implementing Machine Learning Methods . . . . . . . . . . . . . . . . . . . . 602
21.6 Example: Online Discussions on Autos and Electronics . . . . . . . . . . . . . . 602
Importing the Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603
Data Preparation and Labeling the Records . . . . . . . . . . . . . . . . . . . . 603
Text Preprocessing in RapidMiner . . . . . . . . . . . . . . . . . . . . . . . . 605
Producing a Concept Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 605
Fitting a Predictive Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606
Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607
21.7 Example: Sentiment Analysis of Movie Reviews . . . . . . . . . . . . . . . . . . 607
Data Loading, Preparation, and Partitioning . . . . . . . . . . . . . . . . . . . 607
Generating and Applying Word2vec Model . . . . . . . . . . . . . . . . . . . . 609
Fitting a Predictive Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611
Using a Pretrained Word2vec Model . . . . . . . . . . . . . . . . . . . . . . . 611
21.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615

CHAPTER 22 Responsible Data Science 617

22.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617


Example: Predicting Recidivism . . . . . . . . . . . . . . . . . . . . . . . . . 618
22.2 Unintentional Harm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618
22.3 Legal Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 620
The General Data Protection Regulation (GDPR) . . . . . . . . . . . . . . . . . 620
Protected Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 620
22.4 Principles of Responsible Data Science . . . . . . . . . . . . . . . . . . . . . . 621
Non-maleficence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621
Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622
Transparency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
Accountability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624
Data Privacy and Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624
22.5 A Responsible Data Science Framework . . . . . . . . . . . . . . . . . . . . . . 624
Justification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625
Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625
Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626
Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627
Auditing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627
22.6 Documentation Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 628
Impact Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 628
Model Cards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629
xviii CONTENTS

Datasheets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630
Audit Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630
22.7 Example: Applying the RDS Framework to the COMPAS Example . . . . . . . . . . 631
Unanticipated Uses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632
Ethical Concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632
Protected Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632
Data Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633
Fitting the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633
Auditing the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634
Bias Mitigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 640
22.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 641
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643

PART IX CASES
CHAPTER 23 Cases 647

23.1 Charles Book Club . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647


The Book Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647
Database Marketing at Charles . . . . . . . . . . . . . . . . . . . . . . . . . . 648
Machine Learning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 650
Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 651
23.2 German Credit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654
Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658
23.3 Tayko Software Cataloger . . . . . . . . . . . . . . . . . . . . . . . . . . . . 659
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 659
The Mailing Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 659
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 659
Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 661
23.4 Political Persuasion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663
Predictive Analytics Arrives in US Politics . . . . . . . . . . . . . . . . . . . . 663
Political Targeting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663
Uplift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665
Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665
23.5 Taxi Cancellations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667
Business Situation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667
Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667
23.6 Segmenting Consumers of Bath Soap . . . . . . . . . . . . . . . . . . . . . . . 669
Business Situation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669
Key Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 670
Measuring Brand Loyalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 670
Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 670
CONTENTS xix

23.7 Direct-Mail Fundraising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673


Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673
Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673
23.8 Catalog Cross-Selling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676
Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676
23.9 Time Series Case: Forecasting Public Transportation Demand . . . . . . . . . . . 678
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678
Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678
Available Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678
Assignment Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678
Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 679
Tips and Suggested Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . 679
23.10 Loan Approval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 680
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 680
Regulatory Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 680
Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 680
Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 681

References 683

Data Files Used in the Book 687


Index 689
Foreword by Ravi Bapna

onverting data into an asset is the new business imperative facing mod-

C ern managers. Each day the gap between what analytics capabilities make

possible and companies’ absorptive capacity of creating value from such capa-

bilities increases. In many ways, data is the new gold—and mining this gold to

create business value in today’s context of a highly networked and digital society

requires a skillset that we haven’t traditionally delivered in business or statistics or

engineering programs on their own. For those businesses and organizations that

feel overwhelmed by today’s big data, the phrase you ain’t seen nothing yet comes

to mind. Yesterday’s three major sources of big data—the 20+ years of invest-

ment in enterprise systems (ERP, CRM, SCM, etc.), the 3 billion plus people

on the online social grid, and the close to 5 billion people carrying increasingly

sophisticated mobile devices—are going to be dwarfed by tomorrow’s smarter

physical ecosystems fueled by the Internet of Things (IoT) movement.

The idea that we can use sensors to connect physical objects such as homes,

automobiles, roads, and even garbage bins and streetlights to digitally optimized

systems of governance goes hand in glove with bigger data and the need for

deeper analytical capabilities. We are not far away from a smart refrigerator

sensing that you are short on, say, eggs, populating your grocery store’s mobile

app’s shopping list, and arranging a Task Rabbit to do a grocery run for you.

Or the refrigerator negotiating a deal with an Uber driver to deliver an evening

meal to you. Nor are we far away from sensors embedded in roads and vehicles

that can compute traffic congestion, track roadway wear and tear, record vehicle

use, and factor these into dynamic usage-based pricing, insurance rates, and even

taxation. This brave new world is going to be fueled by analytics and the ability

to harness data for competitive advantage.

Business Analytics is an emerging discipline that is going to help us ride

this new wave. This new Business Analytics discipline requires individuals who

are grounded in the fundamentals of business such that they know the right

questions to ask; who have the ability to harness, store, and optimally process

vast datasets from a variety of structured and unstructured sources; and who can

then use an array of techniques from machine learning and statistics to uncover

new insights for decision-making. Such individuals are a rare commodity today,

xxi
xxii FOREWORD BY RAVI BAPNA

but their creation has been the focus of this book for a decade now. This book’s

forte is that it relies on explaining the core set of concepts required for today’s

business analytics professionals using real-world data-rich cases in a hands-on

manner, without sacrificing academic rigor. It provides a modern-day founda-

tion for Business Analytics, the notion of linking the x’s to the y ’s of interest in

a predictive sense. I say this with the confidence of someone who was probably

the first adopter of the zeroth edition of this book (Spring 2006 at the Indian

School of Business).

After the publication of the R and Python editions, the new RapidMiner

edition is an important addition. RapidMiner is gaining in popularity among

analytics professionals as it is a non-programming environment that lowers the

barriers for managers to adopt analytics. The new addition also covers causal

analytics as experimentation (often called A/B testing in the industry), which

is now becoming mainstream in the tech companies. Further, the authors have

added a new chapter on Responsible Data Science, a new part on AutoML,

more on deep learning and beefed up deep learning examples in the text mining

and forecasting chapters. These updates make this new edition “state of the art”

with respect to modern business analytics and AI.

I look forward to using the book in multiple fora, in executive education,

in MBA classrooms, in MS-Business Analytics programs, and in Data Science

bootcamps. I trust you will too!

RAVI BAPNA

Carlson School of Management, University of Minnesota, 2022


Preface to the RapidMiner Edition

his textbook first appeared in early 2007 and has been used by numerous

T students and practitioners and in many courses, including our own expe-

rience teaching this material both online and in person for more than 15 years.

The first edition, based on the Excel add-in Analytic Solver Data Mining

(previously XLMiner), was followed by two more Analytic Solver editions, a

JMP edition, an R edition, a Python edition, and now this RapidMiner edition,

with its companion website, www.dataminingbook.com.

This new RapidMiner edition relies on the open source machine learning

platform RapidMiner Studio (generally called RapidMiner), which offers both free
and commercially-licensed versions.We present output from RapidMiner, as well

as the processes used to produce that output. We show the specification of the

appropriate operators from RapidMiner and some of its key extensions. Unlike

computer science- or statistics-oriented textbooks, the focus in this book is on

machine learning concepts and how to implement the associated algorithms in

RapidMiner. We assume a familiarity with the fundamentals of data analysis and

statistics. Basic knowledge of Python can be helpful for a few chapters.

For this RapidMiner edition, a new co-author, Amit Deokar comes

on board bringing both expertise teaching business analytics courses using

RapidMiner and extensive data science experience in working with businesses

on research and consulting projects. In addition to providing RapidMiner guid-

ance, and RapidMiner process and output screenshots, this edition also incor-

porates updates and new material based on feedback from instructors teaching

MBA, MS, undergraduate, diploma, and executive courses, and from their

students.

Importantly, this edition includes several new topics:

• A dedicated section on deep learning in Chapter 11, with additional deep


learning examples in text mining (Chapter 21) and time series forecasting

(Chapter 19).

• A new chapter on Responsible Data Science (Chapter 22) covering topics

of fairness, transparency, model cards and datasheets, legal considerations,

and more, with an illustrative example.

xxiii
xxiv PREFACE TO THE RAPIDMINER EDITION

• The Performance Evaluation exposition in Chapter 5 was expanded to

include further metrics (precision and recall, F1, AUC) and the SMOTE

oversampling method.

• A new chapter on Generating, Comparing, and Combining Multiple Models


(Chapter 13) that covers AutoML, explaining model predictions, and

ensembles.

• A new chapter dedicated to Interventions and User Feedback (Chapter 14),


that covers A/B tests, uplift modeling, and reinforcement learning.

• A new case (Loan Approval) that touches on regulatory and ethical issues.

A note about the book’s title: The first two editions of the book used

the title Data Mining for Business Intelligence. Business intelligence today refers

mainly to reporting and data visualization (“what is happening now”), while

business analytics has taken over the “advanced analytics,” which include pre-

dictive analytics and data mining. Later editions were therefore renamed

Data Mining for Business Analytics. However, the recent AI transformation

has made the term machine learning more popularly associated with the meth-

ods in this textbook. In this new edition, we therefore use the updated

terms Machine Learning and Business Analytics.


Since the appearance of the (Analytic Solver-based) second edition, the land-

scape of the courses using the textbook has greatly expanded: whereas initially

the book was used mainly in semester-long elective MBA-level courses, it is

now used in a variety of courses in business analytics degrees and certificate

programs, ranging from undergraduate programs to postgraduate and execu-

tive education programs. Courses in such programs also vary in their dura-

tion and coverage. In many cases, this textbook is used across multiple courses.

The book is designed to continue supporting the general “predictive analyt-

ics”, “data mining”, or “machine learning” course as well as supporting a set of

courses in dedicated business analytics programs.

A general “business analytics,” “predictive analytics,” or “machine learn-

ing” course, common in MBA and undergraduate programs as a one-semester

elective, would cover Parts I–III, and choose a subset of methods from Parts IV

and V. Instructors can choose to use cases as team assignments, class discussions,

or projects. For a two-semester course, Part VII might be considered, and we

recommend introducing Part VIII (Data Analytics).

For a set of courses in a dedicated business analytics program, here are a few

courses that have been using our book:

Predictive Analytics—Supervised Learning: In a dedicated business analytics

program, the topic of predictive analytics is typically instructed across a set

of courses. The first course would cover Parts I–III, and instructors typically

choose a subset of methods from Part IV according to the course length.

We recommend including Part VIII: Data Analytics.


PREFACE TO THE RAPIDMINER EDITION xxv

Predictive Analytics—Unsupervised Learning: This course introduces data

exploration and visualization, dimension reduction, mining relationships,

and clustering (Parts II and VI). If this course follows the Predictive

Analytics: Supervised Learning course, then it is useful to examine examples

and approaches that integrate unsupervised and supervised learning, such as

Part VIII on Data Analytics.

Forecasting analytics: A dedicated course on time series forecasting would rely

on Part VI.

Advanced analytics: A course that integrates the learnings from predictive ana-

lytics (supervised and unsupervised learning) can focus on Part VIII: Data

Analytics, where social network analytics and text mining are introduced,

and responsible data science is discussed. Such a course might also include

Chapter 13, Generating, Comparing, and Combining Multiple Models and

AutoML from Part IV, as well as Part V, which covers experiments, uplift

modeling, and reinforcement learning. Some instructors choose to use the

cases (Chapter 23) in such a course.

In all courses, we strongly recommend including a project component,

where data are either collected by students according to their interest or provided

by the instructor (e.g., from the many machine learning competition datasets

available). From our experience and other instructors’ experience, such projects

enhance the learning and provide students with an excellent opportunity to

understand the strengths of machine learning and the challenges that arise in the

process.

GALIT SHMUELI, P ETER C. BRUCE, AMIT V. DEOKAR, AND N ITIN R. PATEL

2022
Acknowledgments

e thank the many people who assisted us in improving the book from its

W Data Mining for Business Intelligence in 2006 (using XLMiner,


inception as

now Analytic Solver), its reincarnation as Data Mining for Business Analytics, and

now Machine Learning for Business Analytics, including translations in Chinese and

Korean and versions supporting Analytic Solver Data Mining, R, Python, SAS

JMP, and now RapidMiner.

Anthony Babinec, who has been using earlier editions of this book for years

in his data mining courses at Statistics.com, provided us with detailed and expert

corrections. Dan Toy and John Elder IV greeted our project with early enthu-

siasm and provided detailed and useful comments on initial drafts. Ravi Bapna,

who used an early draft in a data mining course at the Indian School of Busi-

ness and later at University of Minnesota, has provided invaluable comments and

helpful suggestions since the book’s start.

Many of the instructors, teaching assistants, and students using earlier edi-

tions of the book have contributed invaluable feedback both directly and indi-

rectly, through fruitful discussions, learning journeys, and interesting machine

learning projects that have helped shape and improve the book. These include

MBA students from the University of Maryland, MIT, the Indian School of

Business, National Tsing Hua University, University of Massachusetts Lowell,

and Statistics.com. Instructors from many universities and teaching programs,

too numerous to list, have supported and helped improve the book since its

inception. Scott Nestler has been a helpful friend of this book project from the

beginning.

Kuber Deokar, instructional operations supervisor at Statistics.com, has been

unstinting in his assistance, support, and detailed attention. We also thank Anuja

Kulkarni, assistant teacher at Statistics.com. Valerie Troiano has shepherded

many instructors and students through the Statistics.com courses that have helped

nurture the development of these books.

Colleagues and family members have been providing ongoing feedback and

assistance with this book project. Vijay Kamble at UIC and Travis Greene

at NTHU have provided valuable help with the section on reinforcement

learning. Boaz Shmueli and Raquelle Azran gave detailed editorial comments

xxvii
xxviii ACKNOWLEDGMENTS

and suggestions on the first two editions; Bruce McCullough and Adam Hughes

did the same for the first edition. Noa Shmueli provided careful proofs of the

third edition. Ran Shenberger offered design tips. Che Lin and Boaz Shmueli

provided feedback on deep learning. Ken Strasma, founder of the microtarget-

ing firm HaystaqDNA and director of targeting for the 2004 Kerry campaign

and the 2008 Obama campaign, provided the scenario and data for the section

on uplift modeling. We also thank Jen Golbeck, director of the Social Intelli-

gence Lab at the University of Maryland and author of Analyzing the Social Web,
whose book inspired our presentation in the chapter on social network analytics.

Inbal Yahav and Peter Gedeck, co-authors of the R and Python editions, helped

improve the social network analytics and text mining chapters. Randall Pruim

contributed extensively to the chapter on visualization. Inbal Yahav, co-author

of the R edition, helped improve the social network analytics and text mining

chapters.

Marietta Tretter at Texas A&M shared comments and thoughts on the time

series chapters, and Stephen Few and Ben Shneiderman provided feedback and

suggestions on the data visualization chapter and overall design tips.

Susan Palocsay and Mia Stephens have provided suggestions and feedback

on numerous occasions, as have Margret Bjarnadottir and Mohammad Salehan.

We also thank Catherine Plaisant at the University of Maryland’s Human–

Computer Interaction Lab, who helped out in a major way by contributing

exercises and illustrations to the data visualization chapter. Gregory Piatetsky-

Shapiro, founder of KDNuggets.com, has been generous with his time and

counsel in the early years of this project.

We thank colleagues at the MIT Sloan School of Management for their sup-

port during the formative stage of this book—Dimitris Bertsimas, James Orlin,

Robert Freund, Roy Welsch, Gordon Kaufmann, and Gabriel Bitran. As teach-

ing assistants for the data mining course at Sloan, Adam Mersereau gave detailed

comments on the notes and cases that were the genesis of this book, Romy

Shioda helped with the preparation of several cases and exercises used here, and

Mahesh Kumar helped with the material on clustering.

Colleagues at the University of Maryland’s Smith School of Business:

Shrivardhan Lele, Wolfgang Jank, and Paul Zantek provided practical advice and

comments. We thank Robert Windle and University of Maryland MBA stu-

dents Timothy Roach, Pablo Macouzet, and Nathan Birckhead for invaluable

datasets. We also thank MBA students Rob Whitener and Daniel Curtis for the

heatmap and map charts.

We are grateful to colleagues at UMass Lowell’s Manning School of Business

for their encouragement and support in developing data analytics courses at the

undergraduate and graduate levels that led to the development of this edition:

Luvai Motiwalla, Harry Zhu, Thomas Sloan, Bob Li, and Sandra Richtermeyer.
ACKNOWLEDGMENTS xxix

We also thank Michael Goul (late), Dan Power (late), Ramesh Sharda, Babita

Gupta, Ashish Gupta, Uday Kulkarni, and Haya Ajjan from the Association for

Information System’s Decision Support and Analytics (SIGDSA) community for

ideas and advice that helped the development of the book.

Anand Bodapati provided both data and advice. Jake Hofman from Micro-

soft Research and Sharad Borle assisted with data access. Suresh Ankolekar and

Mayank Shah helped develop several cases and provided valuable pedagogical

comments. Vinni Bhandari helped write the Charles Book Club case.

We would like to thank Marvin Zelen, L. J. Wei, and Cyrus Mehta at

Harvard, as well as Anil Gore at Pune University, for thought-provoking discus-

sions on the relationship between statistics and machine learning. Our thanks to

Richard Larson of the Engineering Systems Division, MIT, for sparking many

stimulating ideas on the role of machine learning in modeling complex systems.

Over two decades ago, they helped us develop a balanced philosophical perspec-

tive on the emerging field of machine learning.

Our thanks to Scott Genzer at RapidMiner, Inc., for insightful discussions

and help. We are appreciative of the vibrant online RapidMiner Community for

being an extremely helpful resource. We also thank Ingo Mierswa, the Founder

of RapidMiner, for his encouragement and support.

Lastly, we thank the folks at Wiley for the 15-year successful journey of

this book. Steve Quigley at Wiley showed confidence in this book from the

beginning and helped us navigate through the publishing process with great

speed. Curt Hinrichs’ vision, tips, and encouragement helped bring this book

to the starting gate. Brett Kurzman has taken over the reins and is now

shepherding the project with the help of Kavya Ramu and Sarah Lemore.

Sarah Keegan, Mindy Okura-Marszycki, Jon Gurstelle, Kathleen Santoloci, and

Katrina Maceda greatly assisted us in pushing ahead and finalizing earlier editions.

We are also especially grateful to Amy Hendrickson, who assisted with typeset-

ting and making this book beautiful.


Part I

Preliminaries
CHAPTER 1
Introduction

1.1 What Is Business Analytics?


Business analytics (BA) is the practice and art of bringing quantitative data to bear
on decision making. The term means different things to different organizations.

Consider the role of analytics in helping newspapers survive the transition

to a digital world. One tabloid newspaper with a working-class readership in

Britain had launched a web version of the paper and did tests on its home page

to determine which images produced more hits: cats, dogs, or monkeys. This

simple application, for this company, was considered analytics. By contrast, the

Washington Post has a highly influential audience that is of interest to big defense
contractors: it is perhaps the only newspaper where you routinely see adver-

tisements for aircraft carriers. In the digital environment, the Post can track

readers by time of day, location, and user subscription information. In this fash-

ion, the display of the aircraft carrier advertisement in the online paper may be

focused on a very small group of individuals---say, the members of the House

and Senate Armed Services Committees who will be voting on the Pentagon’s

budget.

Business analytics, or more generically, analytics, includes a range of data

analysis methods. Many powerful applications involve little more than counting,

rule checking, and basic arithmetic. For some organizations, this is what is meant

by analytics.

The next level of business analytics, now termed business intelligence (BI),

refers to data visualization and reporting for understanding “what happened and

what is happening.” This is done by use of charts, tables, and dashboards to

display, examine, and explore data. BI, which earlier consisted mainly of gener-

ating static reports, has evolved into more user-friendly and effective tools and

Machine Learning for Business Analytics: Concepts, Techniques and Applications in RapidMiner, First Edition.

Galit Shmueli, Peter C. Bruce, Amit V. Deokar, and Nitin R. Patel

© 2023 John Wiley & Sons, Inc. Published 2023 by John Wiley & Sons, Inc.

3
4 1 INTRODUCTION

practices, such as creating interactive dashboards that allow the user not only

to access real-time data but also to directly interact with it. Effective dash-

boards are those that tie directly into company data and give managers a tool

to quickly see what might not readily be apparent in a large complex database.

One such tool for industrial operations managers displays customer orders in a

single two-dimensional display, using color and bubble size as added variables,

showing customer name, type of product, size of order, and length of time to

produce.

Business analytics now typically includes BI as well as sophisticated data anal-

ysis methods, such as statistical models and machine learning algorithms used for

exploring data, quantifying and explaining relationships between measurements,

and predicting new records. Methods like regression models are used to describe

and quantify “on average” relationships (e.g., between advertising and sales), to

predict new records (e.g., whether a new patient will react positively to a med-

ication), and to forecast future values (e.g., next week’s web traffic).

Readers familiar with earlier editions of this book might have noticed that

the book title changed from Data Mining for Business Intelligence to Data Mining
for Business Analytics and, finally, in this edition to Machine Learning for Business

Analytics. The change reflects the more recent term BA, which overtook the

earlier term BI to denote advanced analytics. Today, BI is used to refer to data

visualization and reporting. The change from data mining to machine learning
reflects today’s common use of machine learning to refer to algorithms that learn
from data. This book uses primarily the term machine learning .

WHO USES PREDICTIVE ANALYTICS?

The widespread adoption of predictive analytics, coupled with the accelerating avail-
ability of data, has increased organizations’ capabilities throughout the economy.
A few examples are as follows:

Credit scoring: One long-established use of predictive modeling techniques for


business prediction is credit scoring. A credit score is not some arbitrary judgment
of credit-worthiness; it is based mainly on a predictive model that uses prior data
to predict repayment behavior.
Future purchases: A controversial example is Target’s use of predictive modeling
to classify sales prospects as “pregnant” or “not-pregnant.” Those classified as
pregnant could then be sent sales promotions at an early stage of pregnancy,
giving Target a head start on a significant purchase stream.
Tax evasion: The US Internal Revenue Service found it was 25 times more likely
to find tax evasion when enforcement activity was based on predictive models,
allowing agents to focus on the most likely tax cheats (Siegel, 2013).
1.3 MACHINE LEARNING, AI, AND RELATED TERMS 5

The business analytics toolkit also includes statistical experiments, the most

common of which is known to marketers as A/B testing. These are often used

for pricing decisions:

• Orbitz, the travel site, found that it could price hotel options higher for

Mac users than Windows users.

• Staples online store found it could charge more for staplers if a customer

lived far from a Staples store.

Beware the organizational setting where analytics is a solution in search of

a problem: a manager, knowing that business analytics and machine learning

are hot areas, decides that her organization must deploy them too, to capture

that hidden value that must be lurking somewhere. Successful use of analytics

and machine learning requires both an understanding of the business context

where value is to be captured and an understanding of exactly what the machine

learning methods do.

1.2 What Is Machine Learning?


In this book, machine learning (or data mining) refers to business analytics meth-

ods that go beyond counts, descriptive techniques, reporting, and methods based

on business rules. While we do introduce data visualization, which is com-

monly the first step into more advanced analytics, the book focuses mostly on

the more advanced data analytics tools. Specifically, it includes statistical and

machine learning methods that inform decision making, often in automated

fashion. Prediction is typically an important component, often at the individual

level. Rather than “what is the relationship between advertising and sales,” we

might be interested in “what specific advertisement, or recommended product,

should be shown to a given online shopper at this moment?” Or, we might be

interested in clustering customers into different “personas” that receive differ-

ent marketing treatment and then assigning each new prospect to one of these

personas.

The era of big data has accelerated the use of machine learning. Machine

learning algorithms, with their power and automaticity, have the ability to cope

with huge amounts of data and extract value.

1.3 Machine Learning, AI, and Related Terms


The field of analytics is growing rapidly, both in terms of the breadth of appli-

cations and in terms of the number of organizations using advanced analytics.

As a result, there is considerable overlap and inconsistency of definitions. Terms

have also changed over time.


6 1 INTRODUCTION

The older term data mining means different things to different people. To

the general public, it may have a general, somewhat hazy and pejorative meaning

of digging through vast stores of (often personal) data in search of something

interesting. Data mining, as it refers to analytic techniques, has largely been


superseded by the term machine learning . Other terms that organizations use

are predictive analytics , predictive modeling , and most recently machine learning and

artificial intelligence (AI).


Many practitioners, particularly those from the IT and computer science

communities, use the term AI to refer to all the methods discussed in this book.
AI originally referred to the general capability of a machine to act like a human,

and, in its earlier days, existed mainly in the realm of science fiction and the unre-

alized ambitions of computer scientists. More recently, it has come to encompass

the methods of statistical and machine learning discussed in this book, as the pri-

mary enablers of that grand vision, and sometimes the term is used loosely to

mean the same thing as machine learning. More broadly, it includes generative

capabilities such as the creation of images, audio, and video.

Statistical Modeling vs. Machine Learning


A variety of techniques for exploring data and building models have been around

for a long time in the world of statistics: linear regression, logistic regression,

discriminant analysis, and principal components analysis, for example. How-

ever, the core tenets of classical statistics---computing is difficult and data are

scarce---do not apply in machine learning applications where both data and com-

puting power are plentiful.

This gives rise to Daryl Pregibon’s description of “data mining” (in the

sense of machine learning) as “statistics at scale and speed” (Pregibon, 1999).

Another major difference between the fields of statistics and machine learning

is the focus in statistics on inference from a sample to the population regard-

ing an “average effect”---for example, “a $1 price increase will reduce average

demand by 2 boxes.” In contrast, the focus in machine learning is on predict-

ing individual records---“the predicted demand for person i given a $1 price

increase is 1 box, while for person j it is 3 boxes.” The emphasis that classi-

cal statistics places on inference (determining whether a pattern or interesting

result might have happened by chance in our sample) is absent from machine

learning. Note also that the term inference is often used in the machine learn-

ing community to refer to the process of using a model to make predictions

for new data, also called scoring, in contrast to its meaning in the statistical

community.

In comparison with statistics, machine learning deals with large datasets in

an open-ended fashion, making it impossible to put the strict limits around

the question being addressed that classical statistical inference would require.
1.4 BIG DATA 7

As a result, the general approach to machine learning is vulnerable to the danger

of overfitting, where a model is fit so closely to the available sample of data that it


describes not merely structural characteristics of the data but random peculiar-

ities as well. In engineering terms, the model is fitting the noise, not just the

signal.

In this book, we use the term machine learning algorithm to refer to methods
that learn directly from data, especially local patterns, often in layered or itera-

tive fashion. In contrast, we use statistical models to refer to methods that apply

global structure to the data that can be written as a simple mathematical equation.

A simple example is a linear regression model (statistical) vs. a k -nearest neigh-

bors algorithm (machine learning). A given record would be treated by linear

regression in accord with an overall linear equation that applies to all the records.
In k -nearest neighbors, that record would be classified in accord with the values

of a small number of nearby records.

1.4 Big Data


Machine learning and big data go hand in hand. Big data is a relative term---data
today are big by reference to the past and to the methods and devices available

to deal with them. The challenge big data presents is often characterized by the

four V’s---volume, velocity, variety, and veracity. Volume refers to the amount of
data. Velocity refers to the flow rate---the speed at which it is being generated

and changed. Variety refers to the different types of data being generated (time

stamps, location, numbers, text, images, etc.). Veracity refers to the fact that

data is being generated by organic distributed processes (e.g., millions of people

signing up for services or free downloads) and not subject to the controls or

quality checks that apply to data collected for a study.

Most large organizations face both the challenge and the opportunity of

big data because most routine data processes now generate data that can be

stored and, possibly, analyzed. The scale can be visualized by comparing the

data in a traditional statistical analysis (e.g., 15 variables and 5000 records) to the

Walmart database. If you consider the traditional statistical study to be the size

of a period at the end of a sentence, then the Walmart database is the size of a

football field. Moreover, that probably does not include other data associated

with Walmart---social media data, for example, which comes in the form of

unstructured text.

If the analytical challenge is substantial, so can be the reward:

• OKCupid, the online dating site, uses statistical models with their data

to predict what forms of message content are most likely to produce a

response.
8 1 INTRODUCTION

• Telenor, a Norwegian mobile phone service company, was able to reduce

subscriber turnover 37% by using models to predict which customers were

most likely to leave and then lavishing attention on them.

• Allstate, the insurance company, tripled the accuracy of predicting injury

liability in auto claims by incorporating more information about vehicle

type.

The examples above are from Eric Siegel’s book Predictive Analytics (2013, Wiley).
Some extremely valuable tasks were not even feasible before the era of big

data. Consider web searches, the technology on which Google was built. In early

days, a search for “Ricky Ricardo Little Red Riding Hood” would have yielded

various links to the I Love Lucy TV show, other links to Ricardo’s career as a band
leader, and links to the children’s story of Little Red Riding Hood. Only once

the Google database had accumulated sufficient data (including records of what

users clicked on) would the search yield, in the top position, links to the specific

I Love Lucy episode in which Ricky enacts, in a comic mixture of Spanish and

English, Little Red Riding Hood for his infant son.

1.5 Data Science


The ubiquity, size, value, and importance of big data has given rise to a new

profession: the data scientist. Data science is a mix of skills in the areas of statistics,
machine learning, math, programming, business, and IT. The term itself is thus

broader than the other concepts we discussed above, and it is a rare individual

who combines deep skills in all the constituent areas. In their book Analyzing
the Analyzers (Harris et al., 2013), the authors describe the skillsets of most data
scientists as resembling a “T”--deep in one area (the vertical bar of the T) and

shallower in other areas (the top of the T).

At a large data science conference session (Strata-Hadoop World, October

2014), most attendees felt that programming was an essential skill, though there

was a sizable minority who felt otherwise. Also, although big data is the motivat-

ing power behind the growth of data science, most data scientists do not actually

spend most of their time working with terabyte-size or larger data.

Data of the terabyte or larger size would be involved at the deployment stage

of a model. There are manifold challenges at that stage, most of them IT and

programming issues related to data handling and tying together different compo-

nents of a system. Much work must precede that phase. It is that earlier piloting

and prototyping phase on which this book focuses--developing the statistical and

machine learning models that will eventually be plugged into a deployed system.

What methods do you use with what sorts of data and problems? How do the
1.7ÂTERMINOLOGYÂANDÂNOTATIONÂ 9Â

       Â
methods work? What are their requirements, their strengths, their weaknesses? Â
HowÂ
doÂ
youÂ
assessÂ
theirÂ
performance?Â

1.6Â WhyÂAreÂThereÂSoÂManyÂDifferentÂ
Methods?Â
   Â  Â     Â
As can be seen in this book or any other resource on machine learning, there are   Â
many different Â
methods ÂforÂpredictionÂandÂclassiï¬ cation.You  ÂmightÂaskÂyourselfÂ
why they coexist and whether some areÂbetterÂthan others. The answer isÂthatÂ

each method hasÂadvantages and disadvantages. The usefulness ofÂaÂmethod canÂ

depend on factorsÂsuch asÂtheÂsize ofÂtheÂdataset, theÂtypes ofÂpatterns that existÂ

inÂtheÂdata, whether the dataÂmeet some underlying assumptions ofÂthe method,Â

how noisy theÂdata are,Âand theÂparticular goalÂofÂthe analysis.ÂA smallÂillustrationÂ

isÂ
shown inÂFigure 1.1,Âwhere the goalÂisÂ
toÂï¬ ndaÂcombination ofÂhouseholdÂincomeÂ
levelÂandÂhouseholdÂlotÂsizeÂthatÂseparatesÂownersÂ(solidÂcircles)ÂfromÂnonownersÂ
(hollow circles)ÂofÂriding mowers.ÂTheÂ ï¬ rstmethod
 Â(leftÂpanel)ÂlooksÂonlyÂforÂ
horizontal and vertical linesÂto separate owners from nonowners;Âthe secondÂ

method (right panel) looks forÂa singleÂdiagonal line.Â

FIGUREÂ1.1Â TWOÂMETHODSÂFORÂSEPARATINGÂOWNERSÂFROMÂNONOWNERSÂ

Differentmethods
 ÂcanÂleadÂtoÂdifferentresults,
 ÂandÂtheirÂperformanceÂcanÂ
vary.ÂItÂ
isÂthereforeÂcustomary inÂmachine learningÂtoÂapplyÂseveralÂdifferentÂ

methods (andÂperhaps theirÂ


combination) and selectÂ
theÂoneÂthatÂappears mostÂ

usefulÂ
forÂtheÂgoalÂatÂ
hand.Â

1.7Â TerminologyÂandÂNotationÂ
ÂÂ Â Â ÂÂ Â ÂÂ Â Â Â
Because of the hybrid parentry of data science, its practitioners often use multiple Â
terms to referÂtoÂtheÂsame thing.ÂFor example,ÂinÂthe machine learning andÂ

artiï¬ cialintelligence
  ÂvariableÂbeingÂpredictedÂisÂtheÂoutputÂvariableÂorÂ
Âï¬ elds,the
10 1 INTRODUCTION

target variable. A categorical target variable is often called a label. To a statistician


or social scientist, the variable being predicted is the dependent variable or the
response. Here is a summary of terms used:

Algorithm A specific procedure used to implement a particular machine

learning technique: classification tree, discriminant analysis, and the like.

Attribute Any measurement on the records, including both the input (X )

variables and the output ( Y) variable.

Binary attribute A variable that takes on two discrete values (e.g., fraud and
non-fraud transactions); also called binominal attribute.
Case See Record.
Categorical attribute See Nominal attribute.
Class label See Label.
Confidence A performance measure in association rules of the type “IF A and
B are purchased, THEN C is also purchased.” Confidence is the conditional

probability that C will be purchased IF A and B are purchased.

Confidence In classification problems, it is used to denote the probability of

belonging to a certain class.

Confidence Also has a broader meaning in statistics ( confidence interval), con-


cerning the degree of error in an estimate that results from selecting one

sample as opposed to another.

Dependent variable See Target attribute.


Estimation See Prediction.
Example See Record.
Feature See Attribute. The term feature is also used in the context of select-
ing attributes (feature selection) or generating new attributes (feature generation )

through some mechanism. More broadly, this process is called feature engi-

neering. This usage is adopted in certain RapidMiner operator names.


Holdout data (or holdout set) The portion of the data used only at the end
of the model building and selection process to assess how well the final model

might perform on new data; also called test set.


Inference In statistics, the process of accounting for chance variation when

making estimates or drawing conclusions based on samples; in machine

learning, the term often refers to the process of using a model to make

predictions for new data (see Score).


Input variable See Attribute.
Label A nominal (categorical) attribute being predicted in supervised learning.
Model An algorithm as applied to a dataset, complete with its settings (many
of the algorithms have parameters that the user can adjust).
1.7 TERMINOLOGY AND NOTATION 11

Nominal attribute A variable that takes on one of several fixed values, for

example, a flight could be on-time, delayed, or canceled; also called categorical


variable or polynominal attribute.
Numerical attribute A variable that takes on numeric (integer or real) values;
also called numerical variable.

Observation See Record .


Outcome variable See Target attribute.
Output variable See Target attribute.
P (A|B) The conditional probability of event A occurring given that event

B has occurred. Read as “the probability that A will occur given that B has

occurred.”

Positive class The class of interest in a binary target attribute (e.g., purchasers
in the target attribute purchase/no purchase); the positive class need not be

favorable.

Prediction The prediction of the numerical value of a continuous output vari-

able; also called estimation.


Predictor A variable, usually denoted by X , used as an input into a predictive

model. Also called afeature, input variable, independent variable, or, from a
database perspective, a field.

Record The unit of analysis on which the measurements are taken (a cus-
tomer, a transaction, etc.); also called example , instance , case , observation, sam-

ple, pattern, or row. In spreadsheets, each row typically represents a record;


each column, a variable. Note that the use of the term “sample” here is

different from its usual meaning in statistics, where it refers to a collection

of observations.

Profile A set of measurements on an observation (e.g., the height, weight, and

age of a person).

Response See Target attribute.


Sample In the statistical community, “sample” means a collection of observa-

tions. In the machine learning community, “sample” means a single obser-

vation.

Score A predicted value or class. Scoring new data means using a model devel-
oped with training data to predict output values in new data.

Success class See Positive class.


Supervised learning The process of providing an algorithm (logistic regres-
sion, classification tree, etc.) with records in which an output variable of

interest is known and the algorithm “learns” how to predict this value for

new records where the output is unknown.


12 1 INTRODUCTION

Target attribute A variable, usually denoted by Y , which is the variable being

predicted in supervised learning; also called dependent variable, output variable,


target variable, or outcome variable.
Test data (or test set) A sample of data not used in fitting a model, but instead
used to assess the performance of that model; also called holdout data. This

book uses the term holdout set instead and uses the term testing to refer to cer-

tain validation checks (e.g., in the Split-Validation and Cross-Validation nested

operators in RapidMiner) during the model-tuning phase.

Training data (or training set) The portion of the data used to fit a model.
Unsupervised learning An analysis in which one attempts to learn patterns

in the data other than predicting an output value of interest.

Validation data (or validation set) The portion of the data used to assess

how well the model fits, to adjust models, and to select the best model from

among those that have been tried.

Variable See Attribute. In RapidMiner, the term attribute is used.

1.8 Road Maps to This Book


The book covers many of the widely used predictive and classification methods as

well as other machine learning tools. Figure 1.2 outlines machine learning from

a process perspective and where the topics in this book fit in. Chapter numbers

are indicated beside the topic. Table 1.1 provides a different perspective: it

organizes supervised and unsupervised machine learning procedures according

to the type and structure of the data.

Order of Topics
The book is divided into nine parts: Part I (Chapters 1–2) gives a general

overview of machine learning and its components. Part II (Chapters 3–4) focuses

on the early stages of data exploration and dimension reduction.

Part III (Chapter 5) discusses performance evaluation. Although it con-

tains only one chapter, we discuss a variety of topics, from predictive perfor-

mance metrics to misclassification costs. The principles covered in this part

are crucial for the proper evaluation and comparison of supervised learning

methods.

Part IV includes eight chapters (Chapters 6–13), covering a variety of popular

supervised learning methods (for classification and/or prediction). Within this

part, the topics are generally organized according to the level of sophistication

of the algorithms, their popularity, and ease of understanding. The final chapter

introduces ensembles and combinations of methods.


1.8 ROAD MAPS TO THIS BOOK 13

Prediction
Linear regression (6)
k-Nearest neighbors (7)
Regression trees (9)
Neural networks (11)
Ensembles, AutoML (13)

Supervised
Classification
k-Nearest neighbors (7) Model evaluation
Data preparation, Naïve Bayes (8) and selection Model
exploration, and Classification trees (9) Performance deployment
reduction Logistic regression (10) evaluations (5) Score new data
Data preparation (2,17) Neural networks (11) Responsible DS (22)
Data visualization (3) Discriminant analysis (12)
Dimension reduction (4) Ensembles, AutoML (13)

Time series forecasting


Regression-based (18)
Smoothing methods (19)

What goes together


Unsupervised

Association rules (15)


Collaborative filtering (15)
Deriving
Segmentation insight
Cluster analysis (16)
Intervention

Experiments (14)
A/B testing
Uplift modeling
Reinforcement learning

FIGURE 1.2 MACHINE LEARNING FROM A PROCESS PERSPECTIVE. NUMBERS IN PARENTHESES


INDICATE CHAPTER NUMBERS

TABLE 1.1 ORGANIZATION OF MACHINE LEARNING METHODS IN THIS BOOK, ACCORDING TO


THE NATURE OF THE DATA a

Supervised Unsupervised
Continuous Categorical
response response No response
Continuous Linear regression (6) Logistic regression (10) Principal components (4)
predictors Neural nets (11) Neural nets (11) Cluster analysis (16)
k -Nearest neighbors (7) Discriminant analysis (12) Collaborative filtering (15)

Ensembles (13) k-Nearest neighbors (7)


Ensembles (13)

Categorical Linear regression (6) Neural nets (11) Association rules (15)
predictors Neural nets (11) Classification trees (9) Collaborative filtering (15)

Regression trees (9) Logistic regression (10)


Ensembles (13) Naive Bayes (8)
Ensembles (13)
a Numbers in parentheses indicate the chapter number.

Part V (Chapter 14) introduces the notions of experiments, intervention,

and user feedback. This single chapter starts with A/B testing and then its use in

uplift modeling and finally expands into reinforcement learning, explaining the

basic ideas and formulations that utilize user feedback for learning best treatment

assignments.
14 1 INTRODUCTION

Part VI focuses on unsupervised learning of relationships. It presents

association rules and collaborative filtering (Chapter 15) and cluster analysis

(Chapter 16).

Part VII includes three chapters (Chapters 17–19), with the focus on fore-

casting time series. The first chapter covers general issues related to handling

and understanding time series. The next two chapters present two popular fore-

casting approaches: regression-based forecasting and smoothing methods.

Part VIII presents two broad data analytics topics: social network analy-

sis (Chapter 20) and text mining (Chapter 21). These methods apply machine

learning to specialized data structures: social networks and text. The final chap-

ter on responsible data science (Chapter 22) introduces key issues to consider for

when carrying out a machine learning project in a responsible way.

Finally, Part IX includes a set of cases.

Although the topics in the book can be covered in the order of the chapters,

each chapter stands alone. We advise, however, to read parts I–III before pro-

ceeding to chapters in Parts IV–VI. Similarly, Chapter 17 should precede other

chapters in Part VII.

1.9 Using RapidMiner Studio


To facilitate hands-on machine learning experience, this book uses RapidMiner
Studio (commonly referred in this book as RapidMiner), a commercial machine
learning software, available for free (limited to data with 10,000 records), partic-

ularly for educational purposes (no data limitations). It provides an interactive

environment with a drag-and-drop interface in which you can easily perform

machine learning tasks by creating and executing processes based on relevant

operators.

RapidMiner helps you get started quickly on machine learning and offers

a variety of operators for analyzing data. The illustrations, exercises, and cases

in this book are written in relation to this software. RapidMiner has extensive

coverage of machine learning techniques for classification, prediction, mining

associations and text, forecasting, and data exploration and reduction. It offers

a variety of operators supporting supervised learning algorithms: neural nets,

classification and regression trees, k -nearest-neighbor classification, naive Bayes,

logistic regression, linear regression, and discriminant analysis, all for predictive

modeling. It provides for automatic partitioning of data into training, validation,

and holdout samples and for the deployment of the model to new data. It

also offers unsupervised algorithms: association rules, principal components

analysis, k -means clustering, and hierarchical clustering, as well as visualization

tools and data-handling utilities. With its short learning curve, free educational
1.9 USING RAPIDMINER STUDIO 15

license program, it is an ideal companion to a book on machine learning for

the business student.

To download RapidMiner Studio, visit www.rapidminer.com and create

an account. Follow the instructions there to download and install the latest

version of RapidMiner Studio for your operating system. For information

on RapidMiner’s Educational License Program, visit https://rapidminer.com/

educational-program/.

Once RapidMiner Studio is downloaded and installed with a license, upon

launching, the primary interface looks similar to that shown in Figure 1.3. The

RapidMiner interface provides five different views, highlighted in blue. Clicking

on these views allows one to toggle between them. The Design and Process

views are the most frequently used. We will also introduce Turbo Prep and

Auto Model views later in the book. The Deployments view is needed only

when putting chosen models into production use.

The Design view is where a data scientist can create machine learning

processes by combining and configuring various operators. One such pro-

cess is shown in the figure (the process is elaborated later in Section 2.6).

In RapidMiner, an operator is a container (internally containing code) perform-


ing a specific machine learning task (e.g., selecting attributes), each with its

own parameters. The Design view includes multiple display units called panels,

which can be moved and resized. The default panels include Repository, Oper-

ators, Process, Parameters, and Help. Additional panels can be activated from the

View menu. The View > Restore Default View option from the main menu is

often useful for restoring the default location and size of these panels.

To keep different artifacts organized, RapidMiner uses the notion of repos-


itory, which is a folder-like structure for organizing data, processes, and models

(for deployment). After creating a repository, it is best practice to create subfold-

ers data and processes. The Repository panel is useful for managing the reposito-

ries and related artifacts. Upon designing a process in the Process panel, it can

processes subfolder by right-clicking and selecting the Store Process


be stored in the

Here option. For example, the process 02-02-West Roxbury Preprocessing is stored
in the processes subfolder in the MLBA repository. To execute a process, the

blue “play” button is used, shown among the icons above the Repository view.

If the process executes properly and the output ports are connected to some

modeling artifacts (e.g., performance metrics), the view switches automatically

to the Results view for the user to analyze the results. The Operators panel is

useful for quickly searching or navigating through a plethora of operators avail-

able for different tasks. The Parameters panel is used for configuring operators,

and the Help panel is self-explanatory.

For more details on the RapidMiner interface, see https://academy.

rapidminer.com/courses/rapidminer-studio-gui-intro.
16 1 INTRODUCTION

FIGURE 1.3 PANELS IN THE DESIGN VIEW OF RAPIDMINER STUDIO

Importing and Loading Data in RapidMiner


RapidMiner allows importing data directly into a process through a variety of

Read… operators (such as Read CSV, Read Excel, and more). With this approach,

a local link to the source data is created, and any updates to the source data are

immediately reflected upon rerunning the process. Another approach that is

designed to be portable across computers but lacks data updates is to import

the data into RapidMiner by selecting the data subfolder and then using the

Import Data wizard to store the data there in native RapidMiner format called
ExampleSet. This ExampleSet (data) can then be used in any process with the
Retrieve operator. We use the latter approach in this book.
For more details on importing and loading data in RapidMiner, see

https://academy.rapidminer.com/learning-paths/get-started-with-rapidminer-

and-machine-learning.

RapidMiner Extensions
RapidMiner offers a variety of special-purpose extensions (e.g., Deep Learning)

that provide additional operators for performing specific machine learning tasks.
1.9 USING RAPIDMINER STUDIO 17

These extensions can be searched, installed, and updated from the Marketplace
(Updates and Extensions) option in RapidMiner Extensions menu. We will use

some such extensions in the book, and we will mention them at appropriate

places.

One such extension worth mentioning here is the Python Scripting extension.
It includes operators such as Python Transformer, Python Learner, Execute Python,
and Python Forecaster that allow Python packages to be used from within Rapid-
Miner. To use the Python extension, you will need an installation of Python and

supporting packages. The easiest way to do that is by installing anaconda, which is


a packaged distribution of Python and most commonly used data science pack-

ages. To install anaconda and integrate it with RapidMiner, first close Rapid-

Miner Studio, and visit https://www.anaconda.com. We recommend installing

anaconda’s Individual Edition. Once anaconda is installed and RapidMiner is

launched, RapidMiner automatically detects the path for Python, and all installed

packages are available to be used from Python operators in RapidMiner.

We will demonstrate the use of Python extensions in Chapters 12 and 20.

Further, Python’s pandas and statsmodels packages are particularly powerful for

handling time series data and time series forecasting. While outside the scope of

this book, we encourage interested readers to explore techniques described in

Chapters 17–19 using these packages.


CHAPTER 2
Overview of the Machine Learning
Process

In this chapter, we give an overview of the steps involved in machine learning

(ML), starting from a clear goal definition and ending with model deployment.

The general steps are shown schematically in Figure 2.1. We also discuss issues

related to data collection, cleaning, and preprocessing. We introduce the notion

of data partitioning, where methods are trained on a set of training data and then
their performance is evaluated on a separate set of holdout data, as well as explain

how this practice helps avoid overfitting. Finally, we illustrate the steps of model

building by applying them to data.

Define Obtain Explore Determine Choose Apply methods, Evaluate


purpose data & clean ML task ML select final performance Deploy
data methods model

FIGURE 2.1 SCHEMATIC OF THE DATA MODELING PROCESS

2.1 Introduction
In Chapter 1, we saw some very general definitions of business analytics and

machine learning. In this chapter, we introduce a variety of machine learning

methods. The core of this book focuses on what has come to be called predictive
analytics, the tasks of classification and prediction as well as pattern discovery,

which have become key elements of a “business analytics” function in most

large firms. These terms are described next.

Machine Learning for Business Analytics: Concepts, Techniques and Applications in RapidMiner, First Edition.

Galit Shmueli, Peter C. Bruce, Amit V. Deokar, and Nitin R. Patel

© 2023 John Wiley & Sons, Inc. Published 2023 by John Wiley & Sons, Inc.

19
20 2 OVERVIEW OF THE MACHINE LEARNING PROCESS

2.2 Core Ideas in Machine Learning


Classification
Classification is perhaps the most basic form of predictive analytics. The recipient

of an offer can respond or not respond. An applicant for a loan can repay on time,

repay late, or declare bankruptcy. A credit card transaction can be normal or

fraudulent. A packet of data traveling on a network can be benign or threatening.

A bus in a fleet can be available for service or unavailable. The victim of an illness

can be recovered, still be ill, or be deceased.

A common task in machine learning is to examine data where the classifi-

cation is unknown or will occur in the future, with the goal of predicting what

that classification is or will be. Similar data where the classification is known

are used to develop rules, which are then applied to the data with the unknown

classification.

Prediction
Prediction is similar to classification, except that we are trying to predict the

value of a numerical attribute (e.g., amount of purchase) rather than a class

(e.g., purchaser or nonpurchaser). Of course, in classification we are trying to

predict a class, but the term prediction in this book refers to the prediction of the
value of a continuous numerical attribute.

Sometimes in the machine learning literature, the terms estimation and regression
are used to refer to the prediction of the value of a continuous attribute, and
prediction may be used for both continuous and categorical data.

Association Rules and Recommendation Systems


Large databases of customer transactions lend themselves naturally to the analysis

of associations among items purchased, or “what goes with what.” Association


rules, or affinity analysis, is designed to find such general associations patterns

between items in large databases. The rules can then be used in a variety of

ways. For example, grocery stores can use such information for product place-

ment. They can use the rules for weekly promotional offers or for bundling

products. Association rules derived from a hospital database on patients’ symp-

toms during consecutive hospitalizations can help find “which symptom is fol-

lowed by what other symptom” to help predict future symptoms for returning

patients.

Online recommendation systems, such as those used on Amazon and Netflix,

use collaborative filtering , a method that uses individual users’ preferences and tastes
2.2 CORE IDEAS IN MACHINE LEARNING 21

given their historic purchase, rating, browsing, or any other measurable behavior

indicative of preference, as well as other users’ history. In contrast to association


rules that generate rules general to an entire population, collaborative filtering

generates “what goes with what” at the individual user level. Hence, collab-

orative filtering is used in many recommendation systems that aim to deliver

personalized recommendations to users with a wide range of preferences.

Predictive Analytics
Classification, prediction, and, to some extent, association rules and collaborative

filtering constitute the analytical methods employed in predictive analytics. The

term predictive analytics is sometimes used to also include data pattern identifi-

cation methods such as clustering.

Data Reduction and Dimension Reduction


The performance of some machine learning algorithms is often improved when

the number of attributes is limited, and when large numbers of records can

be grouped into homogeneous groups. For example, rather than dealing with

thousands of product types, an analyst might wish to group them into a smaller

number of groups and build separate models for each group. Or a marketer might

want to classify customers into different “personas” and must therefore group

customers into homogeneous groups to define the personas. This process of

data
consolidating a large number of records (or cases) into a smaller set is termed

reduction. Methods for reducing the number of cases are often called clustering.
Reducing the number of attributes is typically called dimension reduction.

Dimension reduction is a common initial step before deploying supervised

learning methods, intended to improve predictive power, manageability, and

interpretability.

Data Exploration and Visualization


One of the earliest stages of engaging with data is exploring it. Exploration is

aimed at understanding the global landscape of the data and detecting unusual

values. Exploration is used for data cleaning and manipulation as well as for

visual discovery and “hypothesis generation.”

Methods for exploring data include looking at various data aggregations

and summaries, both numerically and graphically. This includes looking at

each attribute separately as well as looking at relationships between attributes.

The purpose is to discover patterns and exceptions. Exploration by creating

charts and dashboards is called data visualization or visual analytics. For numerical

attributes, we use histograms and boxplots to learn about the distribution of their

values, to detect outliers (extreme observations), and to find other information


Another random document with
no related content on Scribd:
But the inspection was not too encouraging. We were distinctly
short of water. Qway thought we should have just enough to take us
out to his “Valley of the Mist,” and back again to Dakhla, if all went
well, but he pointed out that we had one lame camel and another
limping slightly, and that at that season it was quite possible that we
might get some hot days with a simum blowing, and he consequently
thought that it would be far better to be on the safe side and go
straight back to Dakhla, rest the camels, and then come out and go
on to the valley on the next journey.
As this was obviously sound advice, we struck camp, packed up
and prepared to set off at once towards Dakhla, leaving several
sacks of grain behind us, which greatly eased the burden of the
camels and allowed us to leave the two limping beasts unloaded.
The wady in which the camp had been pitched evidently lay on
the southern fringe of the plateau, and opened out on its eastern
side down a sandy slope on to the lower ground beyond. The
plateau, I knew, did not extend much farther to the east, so with two
damaged camels in the caravan, I thought it best to avoid a return
over the very rough road we had followed on our outward journey,
and to strike instead in an easterly direction, round the south-east
corner of the tableland, over the smooth sandy desert lying at the
foot of the scarp of the plateau.
This road, though somewhat longer than the one we had followed
on our outward journey, proved to be excellent going; it lay almost
entirely over smooth hard sand. We continued to follow an easterly
course till the middle of the next morning, when, on reaching the
edge of the dune belt that runs along the western boundary of
Dakhla, we turned up north towards Mut, and coasted along it.
The road was almost featureless. A few low rocky hills were seen
on the lower ground for a while after leaving the “Valley of the Rat,”
but even these soon ceased. From this point onwards we saw
nothing of interest, with the exception of some pieces of petrified
wood, lying on a greenish clay, until we reached our destination at
Mut, in Dakhla Oasis. In the desert round about Kharga and Dakhla
we several times came across the petrified remains of trees, though
they never occurred in large patches.
Qway proved to be right in his forebodings of hot weather, and we
had two days of fairly warm simum wind. We, however, managed to
get in without suffering unduly from thirst—but I felt rather glad that
we had not tried to reach that valley.
The state of my caravan necessitated my giving them some days’
rest, to enable them to recover their condition, and to allow their feet
to get right again after the hard usage they had received on the
sharp rocks of the plateau, before setting out again into the desert.
In the meantime I conducted an experiment to try and locate the
position of the place from which the palm doves—the kimri sifi—were
said to come. Their migration was just at its height, and several
times, while on the plateau, we put them up from the rocks on which
they had alighted to rest during their flight.
The kimri sifi always arrived in the oasis just before sunset, and
as they generally made for a particular well to the south-west of Mut,
I went there one evening with a compass and gun to wait for them. I
took the bearing with my compass to the direction in which a number
of them came. These bearings tallied very closely, the average of
them being 217° mag.
I then shot a few of them just as they were alighting, and cut them
open. They had all been feeding on seeds—grass seeds apparently
—and olives. The seeds were in an almost perfect condition, but the
olives were in such an advanced state of digestion as to be hardly
recognisable.
I next bought some doves of the ordinary kind kept in the oasis
from the villagers, and confined them in a cage. At sunrise the
following morning I fed them on olives and then, towards midday,
took them out one by one, at intervals of an hour, killed them, and
cut them open to see the state of the olives. Those of the one killed
at three o’clock seemed in the state most resembling those taken
from the kimri sifi I had shot, showing that it required about nine
hours’ digestion to reduce them to that condition.
The kimri sifi is a weak-flighted bird, and, judging from the
numbers we put up in the desert from places where they had settled
down to rest, spends a considerable part of the day during the flight
to Mut from the oasis where the olives grow, resting upon rocks in
the desert. I consequently concluded that its average speed,
including the rests, during its journey from the olive oasis, would be
about twenty-five miles an hour.
Applying the principles of Sherlock Holmes to the case I deduced
—I believe that to be the correct word—that the oasis the kimri came
from lay in the direction of the mean of the bearings I had taken, viz.
217° mag., at a distance of nine times twenty-five, or two hundred
and twenty-five miles, and that it contained olive trees. Some years
later an Arab told me that there was an oasis off there that contained
large quantities of olive trees. Boy scouts will, I trust, copy!
CHAPTER VIII

H AVING given my caravan sufficient time to recover from their


previous journey, I set out again into the desert. On this
occasion the camels were much more heavily loaded, as I had
determined to cover as much ground as possible.
But we had not proceeded for more than four hours from Mut
when one of the camels fell dead lame again. As it was obviously
hopeless to think of taking him along with us, and we had proceeded
such a short distance, I decided to turn back and make a fresh start.
On reaching Mut we fired the camel and then the poor brute was
cast loose. He hobbled painfully about for a few minutes, and then
with a grunt knelt down on the ground. Musa, with the idea perhaps
of relieving his sufferings, squatted on his heels in front of him, and
proceeded to warble to him on his flute.
This was an expedient to which he often resorted in order to
soothe the beasts under his charge. Frequently, after an unusually
heavy day in the desert, when the camels had been fed, he would
squat down among them and discourse wild music from his reed
flute to them, till far into the night. As this generally had the effect of
keeping me awake, I rather objected to the proceeding.
On this occasion his musical efforts seemed curiously to take
effect. The camel for some time remained shuffling uneasily on the
ground, probably in considerable pain. But after a time he became
quieter, and before long he stretched his long neck out upon the
ground and apparently went to sleep.
The day after our operation on the camel we started off again for
the “Valley of the Mist” and Qway’s high black mountain.
The weather at the beginning of April is always variable. A strong
northerly wind sprang up towards evening, on the third day out, and
made things rather uncomfortable. The sky at dusk had a curious
silvery appearance that I had noticed often preceded and followed a
sand storm. It was presumably caused by fine sand particles in the
upper reaches of the atmosphere. The wind dropped after dark, as it
frequently does in the desert, but it sprang up again in the morning
with increased strength. During the night it worked round from north
towards the east, and by morning had got round still farther, and was
blowing a gale from the south, right into our teeth.
Soon after our start, we found considerable difficulty in making
any headway against it, and before long we were marching into a
furious gale. One of the beasts, which was perhaps rather
overloaded, was several times brought to a standstill by a violent
gust. An unusually powerful one that struck him fairly brought him
down on his knees. We got him on his feet again, but had gone but a
short way when another camel followed his example. Then the first
one came down again and this time threw his load.
It was obviously useless to attempt to proceed, so having
reloaded the camel, we retraced our steps to a hill at the foot of
which we had camped. It was, of course, quite out of the question to
pitch the tent, so it was left tied up in a bale, together with the other
baggage, while we climbed up on to a ledge that ran round the hill,
about twenty feet above its base. Here we were above the thickest of
the clouds of sand that swept over the surface of the ground so
densely that it was hardly possible to see more than a few yards in
any direction.
Towards the afternoon the wind increased if anything in force, and
small stones could be heard rattling about among the rocks on the
hill. It veered round once more till it was blowing again from the
north. The gale had considerably fallen off by sunset. I accordingly,
rather to my subsequent regret, decided to spend the night at the
bottom of the hill.
When I got out my bedding, I picked up a woollen burnus and
shook it to get rid of the sand. It blazed all over with sparks. I put the
end of my finger near my blankets, and drew from them a spark of
such strength that I could very faintly feel it. When I took off the hat I
was wearing I found that my hair was standing on end—this I hasten
to state was only due to electricity.
The wind died out towards morning. I had, however, to get up
several times before midnight to shake off the sand that had
accumulated on my blankets, to prevent being buried alive, for it
drifted to an extraordinary extent round the flanks of the hill.
We had started off some time the following morning before it
struck me that there was something wrong with the baggage, and I
found that the tent had been left behind. We found it at the foot of the
hill completely buried by the sand that must have banked up during
that gale to the height of two or three feet against the hill.
The horrors of a sand storm have been greatly overrated. An
ordinary sand storm is hardly even troublesome, if one covers up
one’s mouth and nose in the native fashion and keeps out of the
sand. A certain amount of it gets into one’s eyes, which is
unpleasant, but otherwise there is not much to complain about. On
the other hand, there is an extraordinarily invigorating feeling in the
air while a sand storm is blowing—due perhaps to the electrified
condition of the sand grains, which, from some experiments I once
made on the sand blown off a dune, carry a fairly high charge of
positive electricity.
The storm I have described was certainly unpleasant, but it had
one compensation—Musa left his reed flute lying on the sand, and
my hagin promptly ate it! That camel seemed to be omnivorous.
Feathers, tent pegs and gun stocks all figured at various times in his
bill of fare. But bones were his favourite delicacy; a camel’s skeleton
or skull by the roadside invariably drew him off the track to
investigate, and he seldom returned to his place without taking a
mouthful. In consequence, among the numerous names by which he
was known in the caravan—they were all abusive, for his habits were
vile—was that of the ghul, or cannibal.
We got off at five in the morning the day following the sand storm,
and, after a six hours’ march, reached the sacks of grain in the
“Valley of the Rat.” As the day was rather warm, we rested the
camels here for four hours and then pushed on for Qway’s “high
black mountain” and the “Valley of the Mist.”
I had hoped great things from Qway’s description of them, but
unfortunately I had not taken into account the want of proportion of
the bedawin Arabs. The “high black mountain” was certainly black,
but it was only seventy feet high!
From the top of this “mountain” we were able to look down into the
“Valley of the Mist.” Here, too, great disappointment met me. The
wady was there all right—it was an enormous depression, about two
hundred and fifty feet lower than the plateau. But the vegetation and
the huge oasis, that I had been expecting from Qway’s account of
the “mist,” were only conspicuous by their absence. The wady was
as bare as the plateau; and considering the porous nature of the
sand that covered its floor, and the height above sea-level as
compared with the other oases, it could hardly have been otherwise.
It was clearly, however, of enormous size, for it stretched as far as
we could see south of an east and west line, as a vast expanse of
smooth sand, studded towards the south and east by a few low
rocky hills, but absolutely featureless to the south-west and west.
The “mist,” upon which Qway laid such stress, I found was not
due to moisture at all, but to refraction, or rather to the absence of it.
The hot sun blazing down on to a flat stony desert, such as the
plateau over which we had been travelling, causes a hazy
appearance in the nature of a mirage on the distant horizon. But,
when looking from the top of a tableland over a deep depression
some distance away, this hazy appearance is absent, as the line of
sight of the spectator lies the height of the cliff above the floor of the
depression, instead of being only a few feet above it. Though the
“Valley of the Mist” was invisible from the point where Qway had first
seen his “high black mountain,” his experienced eye had seen that a
depression lay beyond it, owing to the absence of this haze, which,
however, is only to be seen under certain conditions.
With some difficulty we managed to get the caravan down from
the plateau on to the lower ground, and then coasted along towards
the west, under the cliff, in order to survey it. This scarp ran
practically due east and west, without a break or indentation until we
came to a belt of dunes which poured over it, forming an easy ascent
on to the plateau, up which we proceeded to climb.
At the top the sand belt passed between two black sandstone
hills, from the summit of one of which a very extensive view over the
depression was obtainable. It was at once clear that there was no
prospect of finding water—still less an oasis—for at least two days’
journey farther to the south, for there was nothing whatever to break
the monotony of the sand-covered plain below us. As the water
supply was insufficient to warrant any further advance from Mut, we
had to return—always a depressing performance.
We found, however, one hopeful sign. The pass that led over the
dune belt on to the plateau—the “Bab es Sabah,” or “gate of the
morning,” as the poetical Khalil called it, because we first sighted it
soon after dawn—had at its foot an ’alem. When I plotted our route
on the map, I found that this ’alem lay almost exactly in line with the
old road we had followed on our first journey out from Mut, showing
that the pass had been the point for which it had been making. The
place to which this road led would consequently be sure to lie near,
or on the continuation of the bearing from the pass to the place
where we had seen the two first ’alems. This was a point of
considerable importance, as there seemed to be little chance of
finding any remains of the road itself on the sandy soil of the
depression, unless we should happen to land on another ’alem. The
bearing we had been marching on before was such a short one that
there was always the risk that, owing to the obstruction to the direct
road of some natural feature, the short section of it, along which the
bearing was taken, was not running directly towards its ultimate
destination.
While hunting round about the camp, I found embedded in the
sand two pieces of dried grass, much frayed and battered. So on
leaving the camp next day, we followed the line of the sand belt to
the north, as showing the direction of the prevailing wind, in hopes of
finding the place from which the dried grass embedded in the dune
had come.
View near Rashida.
Note the wooded height in the background and the scrub-lined stream in foreground
from the well under the large tree on the right. (p. 49).

A Conspicuous Road—to an Arab.


Two small piles of stone, or ’alems can, with difficulty, be seen. Arabs can march for
hundreds of miles through a waterless desert, relying on landmarks such as these.
(p. 86).
Battikh.
A type of sand erosion, known as battikh or “watermelon” desert. (p. 308).

We left the camp about half-past seven. Soon after four we


entered what is known as a redir—that is to say, a place where water
will collect after one of the rare desert rains. It was a very shallow
saucer-like hollow, a few feet in depth, the floor of which consisted of
clay. The farther side of this was covered with sand, and here we
found the grass for which we had been searching.
It was very thinly scattered over an area a few hundred yards in
diameter. It was quite shrivelled and to all appearances completely
dead. But it was the first vegetation we had seen on the plateau to
the south-west of Dakhla. This redir showed a noticeable number of
tracks of the desert rats, and was probably one of their favourite
feeding grounds.
Having solved the problem of the grass, as our water supply was
getting low, we turned off in a north-easterly direction, making for
Dakhla. The plateau surface changed for the worse, and a
considerable amount of sofut had to be crossed; but fortunately the
camels held out. We crossed two old roads running up north,
apparently to Bu Mungar and Iddaila. Here and there along these old
disused roads we saw circles, four or five feet in diameter, sparsely
covered with stones about the size of a hen’s egg, scattered on the
sandy surface, that obviously had been placed there by human
agency. Qway explained that these were the places where the old
slave traders, who used these roads, had been in the habit of laying
their water-skins. A gurba, raised slightly off the ground in this way,
so that the air can circulate round it, keeps the water much cooler
than when laid with a large part of its surface in contact with the
ground.
Other evidence of the old users of these roads were to be seen in
an occasional specimen of an oval, slightly dished stone about two
feet long, known as a markaka, on which they used to grind, or
rather crush, their grain with the help of a smaller hand stone, and
also in the quantities of broken ostrich shells that were frequently
seen. These shells can be found in many parts of the desert, and are
said to be the remains of fresh eggs brought by old travellers from
the Sudan to act as food on the journey. It has been argued, from
their existence, that ostriches ran wild in these deserts. But it is
difficult to see upon what food such a large bird could have
subsisted.
On the second day after leaving the redir, we got on to another
old road, and continued to follow it all day. This road eventually took
us to a clump of four or five green terfa bushes, and a second one of
about the same size was reached soon afterwards. These little
clusters of bushes proved afterwards to be of the greatest assistance
to us, as they not only afforded the camels a bite of green food, but
were the source from which came most of the firewood that we used
in the desert. Evidently others had found them useful too in the past,
for no less than four old roads converged on to them—a striking
instance of the value of green food and firewood in the desert. Some
broken red pottery was found amongst these bushes.
Shortly after leaving them we found the track of a single camel
going to the west—obviously to Kufara. But beyond this single track,
and that of the five camels we had seen on our first journey from
Mut, we never saw any modern traces of human beings on the
plateau.
The weather, which had been very hot, fortunately grew suddenly
cool, and once or twice a few drops of rain fell. This change in the
temperature was most welcome, as the camels were becoming
exhausted with their long journey away from water, and showing
unmistakable signs of distress. The change to colder weather,
however, revived them wonderfully.
The road, unluckily, became much worse, and we got on to a part
of the plateau thickly covered by loose slabs of purplish-black
sandstone, many of which tinkled like a bell when kicked.
On the day before we reached Dakhla there was a slight shower
in the morning just after we started, and the weather remained cool,
with a cold north wind and overcast sky all day. We were
consequently able to make good progress, and by the evening had
reached the north-east corner of the plateau and were within a day’s
journey of Mut.
Just before camping there was a sharp shower accompanied by
thunder and lightning, enough rain falling during the few minutes it
lasted to make my clothing feel thoroughly damp.
The tent was pitched on a sandy patch, and had hardly been
erected before the rain, for about a quarter of an hour, came down in
torrents, with repeated flashes of vivid lightning, which had a very
grand effect over the darkened desert.
I was just going to turn in about an hour afterwards when my
attention was attracted by a queer droning sound occurring at
intervals. At first I thought little of it, attributing it to the wind blowing
in the tent ropes, which the heavy rain had shrunk till they were as
taut as harp strings. The sound died away, and for a few minutes I
did not hear it.
Then again it swelled up much louder than before and with a
different note. At first it sounded like the wind blowing in a telegraph
wire; but this time it was a much deeper tone, rather resembling the
after reverberation of a great bell.
I stepped out of the tent to try and discover the cause. It was at
once clear that it could not be due to the wind in the tent ropes, for it
was a perfectly calm night. The thunder still growled occasionally in
the distance and the lightning flickered in the sky to the north. After
the hot scorching weather we had experienced, the air felt damp and
chilly enough to make one shiver.
The sound was not quite so distinctly audible outside the tent as
inside it, presumably owing to the fact that the rain had so tightened
the ropes and canvas that the tent acted as a sounding board. At
times it died away altogether, then it would swell up again into a
weird musical note.
Thinking that possibly it might be due to a singing in my ears, I
called out to my men to ask if they could hear anything.
Abd er Rahman, whose hearing was not so keen as his eyesight,
declared that he could hear nothing at all. But Khalil and Qway both
said they could hear the sound, Qway adding that it was only the
wind in the mountain. It then flashed across me that I must be
listening to the “song of the sands,” that, though I had often read of, I
had never actually heard.
This “song of the sands” was singularly difficult to locate. It
appeared to come from about half a mile away to the west, where
the sand came over a cliff. It was a rather eerie experience
altogether.
Musical sands are not very uncommon. The sound they emit is
sometimes attributed, by the natives, to the beating of drums by a
class of subterranean spirits that inhabit the dunes. In addition to
those sands that give out a sound of their own accord, there is
another kind that rings like a bell when struck. A patch of sand of this
kind is said to exist on the plateau to the north of Dakhla Oasis. I
never personally came across any sand of this description, but much
of the Nubian sandstone we found on the plateau to the south-west
of Dakhla Oasis gave out a distinctly musical sound when kicked,
and in the gully that leads up to the plateau at the Dakhla end of the
’Ain Amur road, I passed a shoulder of rock that emitted a slight
humming sound as a strong south wind blew round it.
The following day we reached Mut without any further incident.
We, however, only just got in in time as our water-tanks were
completely empty, after our journey of eleven days in the desert.
Knowing that many of the natives in Dakhla suspected me of
being engaged on a treasure hunt, and of looking for the oasis of
Zerzura, I had played up to the theory by continually asking for
information on the subject. On our return from such a long journey
into the desert several natives, assuming that we must have found
something, came round to enquire whether I had actually found the
oasis.
Khalil, who had heard the account in the “Book of Treasure,”
called my attention to the fact that the road we had followed on our
return journey, until it lost itself in the sand dunes on the outskirts of
Dakhla, at that time was leading straight for the Der el Seba’a Banat,
and gave it as his opinion that, if we only followed the road far
enough in the opposite direction, it would be bound to lead us to
Zerzura. For the benefit of any treasure seekers who wish to look for
that oasis, to embark on a treasure hunt, I will mention another and
still more significant fact—that road exactly follows the line of the
great bird immigration in the spring—showing that it leads to a fertile
district, and moreover—most significant fact of all—many of those
birds are wild geese!
CHAPTER IX

I N the journey from which we had just returned, we had been a


rather long time away from water for that time of year, and the
camels were in a very exhausted condition from the hard travelling in
the heat on a short allowance of water. It was then May, and March
is usually considered in Egypt as being the last month for field work,
so I decided to give them a rest to recover their condition, and then
go back to Kharga Oasis and the Nile Valley.
The men, with the exception of Khalil, had all settled down to the
routine of desert travelling, and were working well. The mainstay of
the caravan was Qway. He was a magnificent man in the desert, and
was hardly ever at fault.
Finding that the caravan was rather overloaded at our start for our
third journey, I left, on our second day out, a tank of water and two
sacks of grain in the desert, to be picked up on our way back to Mut.
From that point we had gone three days to the south. We had then
gone two days south-west; then two days west; another day towards
the north-west, and then three days north-east. All but the first four
days of this journey had been over ground which was quite unknown
to him; but when at the end of this roundabout route I asked him to
point out to me where our tank and sacks had been laid, he was able
to indicate its position without the slightest uncertainty.
At first sight the faculty that a good desert guide has of finding his
way about a trackless desert seems little short of miraculous. But he
has only developed to an unusual degree the powers that even the
most civilised individual possesses in a rudimentary state.
Anyone, for instance, can go into a room that he knows in the
dark, walk straight across from the door to a table, say, from there to
the mantelpiece, and back again to the door without any difficulty at
all, thus showing the same sense of angles and distances that
enabled Qway, after a circuitous journey of a hundred and sixty
miles, to find his way straight back to his starting-point. The Arabs,
however, have so developed this faculty that they can use it on a
much larger scale.
The bedawin, accustomed to travelling over the wide desert
plains, from one landmark to another, keep their eyes largely fixed
on the horizon. You can always tell a desert man when you see him
in a town. He is looking towards the end of the street, and appears to
be oblivious of his immediate surroundings. This gives him that “far-
away” look that is so much admired by lady novelists.
It would be rash, however, to assume that a desert guide does not
also notice what is going on around him, for there is very little indeed
that he does not see. He may be looking to the horizon to find his
next landmark during a great part of his time, but he also scans most
closely the ground over which he is travelling, and will not pass the
faintest sign or footprint, without noticing it and drawing his own
conclusions as to who has passed that way and where they were
going. He may say nothing about them at the time; but he does not
forget them.
Nor will he forget his landmarks, or fail to identify them when he
sees them a second time; a good guide will remember his landmarks
sufficiently well to be able to follow without hesitation, a road that he
has been over many years before, and has not seen in the interval.
Frequently, after passing a conspicuous hill, I have seen Qway
glance over his shoulder for a second or two, to see what it would
look like when he approached it again on the return journey, and to
note any small peculiarities that it possessed.
In addition to this sense of angles and distances, these desert
men have in many cases a wonderfully accurate knowledge of the
cardinal points of the compass. This seems at first sight to amount
almost to an instinct. It is, however, probably produced by a
recollection of the changes of direction in a day’s march which has,
through long practice, become so habitual as to be almost
subconscious.
A good guide can not only steer by the stars and sun, but is able
to get on almost equally well without them. On the darkest and most
overcast night, Qway never had the slightest doubt as to the
direction in which our road lay—and this too in a part of the desert
which he had previously never visited.
I often tested the sense of direction possessed by my men when
we got into camp, by resting a rifle on the top of a sack of grain and
telling them to aim it towards the north, afterwards testing their
sighting by means of my compass.
Qway and Abd er Rahman were surprisingly consistent in their
accuracy, and there was very little indeed to choose between them.
There was considerable rivalry between them on this point in
consequence. They were very seldom more than two degrees wrong
on one side or the other of the true north.
Qway was an unusually intelligent specimen of the bedawin Arabs
—a race who are by no means so stupid as they are sometimes
represented. There was little that he did not know about the desert
and its ways, and he was extraordinarily quick to pick up any little
European dodges, such as map-making to scale, that I showed him;
but on questions connected with irrigation, cultivation, building, or
anything that had a bearing on the life of the fellahin, he was—or
professed to be—entirely ignorant. He regarded them as an inferior
race, and evidently considered it beneath his dignity to take any
interest at all in them or their ways. He seldom alluded to them to me
without adding some contemptuous remark. He never felt at home in
the crowded life of the Nile Valley, declared that he got lost whenever
he went into a town—this I believe to be the case with most bedawin
—that the towns were filthy, the inhabitants all thieves, liars,
“women” and worse, and that the drinking water was foul, and even
the air was damp, impure, and not to be compared with that of his
beloved desert.
The opinion of the Egyptians of the Nile Valley is equally
unfavourable to the Arabs. They regard them as an overbearing,
lawless, ignorant set of ruffians whom they pretend to despise—but
they stand all the same very much in awe of them. After all, their
views of each other are only natural; their characters have practically
nothing in common, and criticism usually takes the form of “this man
is different from me, so he must be wrong.”
Qway, in the caravan, was invariably treated with great respect.
He was usually addressed to as “khal (uncle) Qway,” and he was not
the man to allow any lapses from this attitude, which he considered
his due as an Arab and as the head-man of the caravan. Any falling
off in this respect was immediately followed by some caustic
reference on his part to the inferiority of slaves, “black men,” or
fellahin, as the case required.
Abd er Rahman and the camel men all did their work well, and the
difficulties due to the sand and the attitude of the natives that I had
been warned that I should have to face, all appeared to be greatly
exaggerated. With Qway as my guide, I hoped with the experience I
had already gained, to make an attempt the next year, with a
reasonable prospect of success, to cross the desert, or at any rate to
penetrate much farther into it than I had already done, and reach
some portion that was inhabited.
But just when I was preparing to return to Egypt, an event
happened that put an entirely new complexion upon things, and
upset the whole of my plans.
During our absence in the desert, a new mamur arrived in Dakhla
Oasis and came round to call on me. He was rather a smart-looking
fellow, dressed in a suit considerably too tight for him, of that peculiar
shade of ginger so much affected by the Europeanised Egyptians.
He had the noisy boisterous manner common to his class, but he
spoke excellent English and was evidently prepared to make himself
pleasant.
Before he left, he informed me that the postman had just come in,
and that news had arrived by the mail of the revolution in Turkey.
This revolution had long been simmering, with the usual result that
the scum—in the form of Tala’at and the Germanised Enver—had
come up to the top. The Sultan had been deposed, and it was
considered likely that he would be replaced by some sort of republic.
The whole Moslem community was in a very excited state in
consequence.
A day or two later the Coptic doctor dropped in. He told me that
he had just seen Sheykh Ahmed, from the zawia at Qasr Dakhl—
whose guest I had been at his ezba—who had told him that if the
revolution in Turkey succeeded and the Sultan really were deposed,
the Senussi Mahdi would reappear and invade Egypt. The Mahdi, it
may be mentioned, is the great Moslem prophet, who according to
Mohammedan prophecies, is to arise shortly before the end of the
world, to convert the whole of mankind to the faith of Islam.
This, if it were true, was important news. The position was one
fraught with considerable possibilities. In order to understand the
situation some explanation may perhaps be useful to those
unacquainted with Mohammedan politics.
Egypt at that time was a part of the Turkish Empire—our position
in the country being, at any rate in theory, merely that of an
occupation, with the support of a small military force. The Sultan of
Turkey was consequently, nominally, still the ruler of the country.
But in addition to being Sultan of Turkey, Abdul Hamid was also
the Khalif of Islam—an office that made him a sort of Emperor-Pope
of the whole of the Mohammedans. His claim to be the holder of this
title was in reality of a somewhat flimsy character; but whatever his
rights to it may have been according to the strict letter of the Moslem
law, he was almost universally regarded by the members of the
Sunni Mohammedans as their Khalif, that is to say, as the direct
successor, as the head of Islam, of the Prophet Mohammed himself,
in the same way that the Pope is regarded as the direct successor to
St. Peter.
A revolution always loosens the hold that the central Government
has over the outlying parts of a country, and in a widespread and
uncivilised empire like that subject to the Sultan of Turkey, where
centuries of misgovernment have produced a spirit—it might almost
be said a habit—of revolt, serious trouble was bound to follow, if the
Sultan should be deposed and his place be taken by a republic. Not
only would Egypt and Tripoli be deprived of the ruler to whom they
owed their allegiance, but the whole native population of North
Africa, with the exception of an almost negligible minority, would be
left without a spiritual head. This would have been clearly a situation
that opened endless possibilities to such an enterprising sect as the
Senussia, whose widespread influence through North Africa is
shown by the numerous zawias they have planted in all the countries
along the south of the Mediterranean and far into the interior of the
continent.
Egypt, as the richest of these countries, was likely to offer the
most promising prize. The fellahin of Egypt, when left to themselves,
are far too much taken up in cultivating their land to trouble
themselves about politics, and though of a religious turn of mind, are
not fanatical. But, as recent events have shown, they are capable of
being stirred up by agitators to a dangerous extent.
I several times heard the Senussi question discussed in Egypt.
Opinions on its seriousness varied greatly. Some loudly and
positively asserted that the threat of a Senussi invasion was only a
bugbear, and, like every bugbear, more like its first syllable than its
second. But there were others who relapsed into silence or changed
the subject whenever it was mentioned. It was, however, certain that
with the small force we at that time possessed in the country, an
attempt to invade Egypt by the Senussi accompanied, as it was
almost certain it would have been, by a rising engineered by them
among the natives of the Nile Valley, would have caused a
considerable amount of trouble.
The appearance of a Mahdi—if he is not scotched in time—may
set a whole country in a ferment. Not infrequently some local
religious celebrity will proclaim himself the Mahdi and gain perhaps a
few followers; but his career is usually shortlived. Occasionally,
however, one arrives on the scene, who presents a serious problem
—such, for instance, as the well-known Mahdi of the Sudan, and the
lesser known, but more formidable, Mahdi of the Senussi sect.
The latter, though he seems to have been a capable fellow, was a
theatrical mountebank, who preferred to surround himself with an

You might also like