Professional Documents
Culture Documents
Galit Shmueli
National Tsing Hua University
Peter C. Bruce
statistics.com
Amit V. Deokar
University of Massachusetts Lowell
Nitin R. Patel
Cytel, Inc.
This edition first published 2023
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in
any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by
law. Advice on how to obtain permission to reuse material from this title is available at
http://www.wiley.com/go/permissions.
The right of Galit Shmueli, Peter C. Bruce, Amit V. Deokar, and Nitin R. Patel to be identified as the authors of
Registered Office
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
For details of our global editorial offices, customer services, and more information about Wiley products visit us at
www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that
appears in standard print versions of this book may not be available in other formats.
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or
its affiliates in the United States and other countries and may not be used without written permission. All other
trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product
flow of information relating to the use of experimental reagents, equipment, and devices, the reader is urged to
review and evaluate the information provided in the package insert or instructions for each chemical, piece of
equipment, reagent, or device for, among other things, any changes in the instructions or indication of usage and
for added warnings and precautions. While the publisher and authors have used their best efforts in preparing this
work, they make no representations or warranties with respect to the accuracy or completeness of the contents of
this work and specifically disclaim all warranties, including without limitation any implied warranties of
merchantability or fitness for a particular purpose. No warranty may be created or extended by sales
representatives, written sales materials or promotional statements for this work. The fact that an organization,
website, or product is referred to in this work as a citation and/or potential source of further information does not
mean that the publisher and authors endorse the information or services the organization, website, or product
may provide or recommendations it may make. This work is sold with the understanding that the publisher is not
engaged in rendering professional services. The advice and strategies contained herein may not be suitable for
your situation. You should consult with a specialist where appropriate. Further, readers should be aware that
websites listed in this work may have changed or disappeared between when this work was written and when it is
read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages,
Hardback: 9781119828792
PART I PRELIMINARIES
CHAPTER 1 Introduction 3
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Core Ideas in Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . 20
Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Association Rules and Recommendation Systems . . . . . . . . . . . . . . . . . 20
Predictive Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Data Reduction and Dimension Reduction . . . . . . . . . . . . . . . . . . . . 21
Data Exploration and Visualization . . . . . . . . . . . . . . . . . . . . . . . . 21
Supervised and Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . 22
2.3 The Steps in a Machine Learning Project . . . . . . . . . . . . . . . . . . . . . 23
vii
viii CONTENTS
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.2 Data Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Example 1: Boston Housing Data . . . . . . . . . . . . . . . . . . . . . . . . 65
Example 2: Ridership on Amtrak Trains . . . . . . . . . . . . . . . . . . . . . . 66
3.3 Basic Charts: Bar Charts, Line Charts, and Scatter Plots . . . . . . . . . . . . . 66
Distribution Plots: Boxplots and Histograms . . . . . . . . . . . . . . . . . . . 69
Heatmaps: Visualizing Correlations and Missing Values . . . . . . . . . . . . . . 72
3.4 Multidimensional Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Adding Attributes: Color, Size, Shape, Multiple Panels, and Animation . . . . . . 75
Manipulations: Rescaling, Aggregation and Hierarchies, Zooming, and Filtering . . 78
Reference: Trend Lines and Labels . . . . . . . . . . . . . . . . . . . . . . . . 81
Scaling Up to Large Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Multivariate Plot: Parallel Coordinates Plot . . . . . . . . . . . . . . . . . . . . 83
Interactive Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.5 Specialized Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Visualizing Networked Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Visualizing Hierarchical Data: Treemaps . . . . . . . . . . . . . . . . . . . . . 89
Visualizing Geographical Data: Map Charts . . . . . . . . . . . . . . . . . . . . 90
3.6 Summary: Major Visualizations and Operations, by Machine Learning Goal . . . . 92
Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Time Series Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
CONTENTS ix
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.2 Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.3 Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Example 1: House Prices in Boston . . . . . . . . . . . . . . . . . . . . . . . 99
4.4 Data Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Aggregation and Pivot Tables . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.5 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.6 Reducing the Number of Categories in Categorical Attributes . . . . . . . . . . . 105
4.7 Converting a Categorical Attribute to a Numerical Attribute . . . . . . . . . . . 107
4.8 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Example 2: Breakfast Cereals . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Principal Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Normalizing the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Using Principal Components for Classification and Prediction . . . . . . . . . . . 117
4.9 Dimension Reduction Using Regression Models . . . . . . . . . . . . . . . . . . 117
4.10 Dimension Reduction Using Classification and Regression Trees . . . . . . . . . . 119
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593
Text Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595
Presence/Absence vs. Frequency (Occurrences) . . . . . . . . . . . . . . . . . . 597
Term Frequency–Inverse Document Frequency (TF-IDF) . . . . . . . . . . . . . . 598
From Terms to Concepts: Latent Semantic Indexing . . . . . . . . . . . . . . . . 600
Extracting Meaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
From Terms to High-Dimensional Word Vectors: Word2Vec . . . . . . . . . . . . 601
21.5 Implementing Machine Learning Methods . . . . . . . . . . . . . . . . . . . . 602
21.6 Example: Online Discussions on Autos and Electronics . . . . . . . . . . . . . . 602
Importing the Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603
Data Preparation and Labeling the Records . . . . . . . . . . . . . . . . . . . . 603
Text Preprocessing in RapidMiner . . . . . . . . . . . . . . . . . . . . . . . . 605
Producing a Concept Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 605
Fitting a Predictive Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606
Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607
21.7 Example: Sentiment Analysis of Movie Reviews . . . . . . . . . . . . . . . . . . 607
Data Loading, Preparation, and Partitioning . . . . . . . . . . . . . . . . . . . 607
Generating and Applying Word2vec Model . . . . . . . . . . . . . . . . . . . . 609
Fitting a Predictive Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611
Using a Pretrained Word2vec Model . . . . . . . . . . . . . . . . . . . . . . . 611
21.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615
Datasheets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630
Audit Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630
22.7 Example: Applying the RDS Framework to the COMPAS Example . . . . . . . . . . 631
Unanticipated Uses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632
Ethical Concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632
Protected Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632
Data Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633
Fitting the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633
Auditing the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634
Bias Mitigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 640
22.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 641
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643
PART IX CASES
CHAPTER 23 Cases 647
References 683
onverting data into an asset is the new business imperative facing mod-
C ern managers. Each day the gap between what analytics capabilities make
possible and companies’ absorptive capacity of creating value from such capa-
bilities increases. In many ways, data is the new gold—and mining this gold to
create business value in today’s context of a highly networked and digital society
engineering programs on their own. For those businesses and organizations that
feel overwhelmed by today’s big data, the phrase you ain’t seen nothing yet comes
to mind. Yesterday’s three major sources of big data—the 20+ years of invest-
ment in enterprise systems (ERP, CRM, SCM, etc.), the 3 billion plus people
on the online social grid, and the close to 5 billion people carrying increasingly
The idea that we can use sensors to connect physical objects such as homes,
automobiles, roads, and even garbage bins and streetlights to digitally optimized
systems of governance goes hand in glove with bigger data and the need for
deeper analytical capabilities. We are not far away from a smart refrigerator
sensing that you are short on, say, eggs, populating your grocery store’s mobile
app’s shopping list, and arranging a Task Rabbit to do a grocery run for you.
meal to you. Nor are we far away from sensors embedded in roads and vehicles
that can compute traffic congestion, track roadway wear and tear, record vehicle
use, and factor these into dynamic usage-based pricing, insurance rates, and even
taxation. This brave new world is going to be fueled by analytics and the ability
this new wave. This new Business Analytics discipline requires individuals who
are grounded in the fundamentals of business such that they know the right
questions to ask; who have the ability to harness, store, and optimally process
vast datasets from a variety of structured and unstructured sources; and who can
then use an array of techniques from machine learning and statistics to uncover
new insights for decision-making. Such individuals are a rare commodity today,
xxi
xxii FOREWORD BY RAVI BAPNA
but their creation has been the focus of this book for a decade now. This book’s
forte is that it relies on explaining the core set of concepts required for today’s
tion for Business Analytics, the notion of linking the x’s to the y ’s of interest in
a predictive sense. I say this with the confidence of someone who was probably
the first adopter of the zeroth edition of this book (Spring 2006 at the Indian
School of Business).
After the publication of the R and Python editions, the new RapidMiner
barriers for managers to adopt analytics. The new addition also covers causal
is now becoming mainstream in the tech companies. Further, the authors have
more on deep learning and beefed up deep learning examples in the text mining
and forecasting chapters. These updates make this new edition “state of the art”
RAVI BAPNA
his textbook first appeared in early 2007 and has been used by numerous
T students and practitioners and in many courses, including our own expe-
rience teaching this material both online and in person for more than 15 years.
The first edition, based on the Excel add-in Analytic Solver Data Mining
JMP edition, an R edition, a Python edition, and now this RapidMiner edition,
This new RapidMiner edition relies on the open source machine learning
platform RapidMiner Studio (generally called RapidMiner), which offers both free
and commercially-licensed versions.We present output from RapidMiner, as well
as the processes used to produce that output. We show the specification of the
appropriate operators from RapidMiner and some of its key extensions. Unlike
ance, and RapidMiner process and output screenshots, this edition also incor-
porates updates and new material based on feedback from instructors teaching
MBA, MS, undergraduate, diploma, and executive courses, and from their
students.
(Chapter 19).
xxiii
xxiv PREFACE TO THE RAPIDMINER EDITION
include further metrics (precision and recall, F1, AUC) and the SMOTE
oversampling method.
ensembles.
• A new case (Loan Approval) that touches on regulatory and ethical issues.
A note about the book’s title: The first two editions of the book used
the title Data Mining for Business Intelligence. Business intelligence today refers
business analytics has taken over the “advanced analytics,” which include pre-
dictive analytics and data mining. Later editions were therefore renamed
has made the term machine learning more popularly associated with the meth-
ods in this textbook. In this new edition, we therefore use the updated
scape of the courses using the textbook has greatly expanded: whereas initially
tive education programs. Courses in such programs also vary in their dura-
tion and coverage. In many cases, this textbook is used across multiple courses.
elective, would cover Parts I–III, and choose a subset of methods from Parts IV
and V. Instructors can choose to use cases as team assignments, class discussions,
For a set of courses in a dedicated business analytics program, here are a few
of courses. The first course would cover Parts I–III, and instructors typically
and clustering (Parts II and VI). If this course follows the Predictive
on Part VI.
Advanced analytics: A course that integrates the learnings from predictive ana-
lytics (supervised and unsupervised learning) can focus on Part VIII: Data
Analytics, where social network analytics and text mining are introduced,
and responsible data science is discussed. Such a course might also include
AutoML from Part IV, as well as Part V, which covers experiments, uplift
where data are either collected by students according to their interest or provided
by the instructor (e.g., from the many machine learning competition datasets
available). From our experience and other instructors’ experience, such projects
understand the strengths of machine learning and the challenges that arise in the
process.
2022
Acknowledgments
e thank the many people who assisted us in improving the book from its
now Analytic Solver), its reincarnation as Data Mining for Business Analytics, and
now Machine Learning for Business Analytics, including translations in Chinese and
Korean and versions supporting Analytic Solver Data Mining, R, Python, SAS
Anthony Babinec, who has been using earlier editions of this book for years
in his data mining courses at Statistics.com, provided us with detailed and expert
corrections. Dan Toy and John Elder IV greeted our project with early enthu-
siasm and provided detailed and useful comments on initial drafts. Ravi Bapna,
who used an early draft in a data mining course at the Indian School of Busi-
ness and later at University of Minnesota, has provided invaluable comments and
Many of the instructors, teaching assistants, and students using earlier edi-
tions of the book have contributed invaluable feedback both directly and indi-
learning projects that have helped shape and improve the book. These include
MBA students from the University of Maryland, MIT, the Indian School of
too numerous to list, have supported and helped improve the book since its
inception. Scott Nestler has been a helpful friend of this book project from the
beginning.
unstinting in his assistance, support, and detailed attention. We also thank Anuja
many instructors and students through the Statistics.com courses that have helped
Colleagues and family members have been providing ongoing feedback and
assistance with this book project. Vijay Kamble at UIC and Travis Greene
learning. Boaz Shmueli and Raquelle Azran gave detailed editorial comments
xxvii
xxviii ACKNOWLEDGMENTS
and suggestions on the first two editions; Bruce McCullough and Adam Hughes
did the same for the first edition. Noa Shmueli provided careful proofs of the
third edition. Ran Shenberger offered design tips. Che Lin and Boaz Shmueli
ing firm HaystaqDNA and director of targeting for the 2004 Kerry campaign
and the 2008 Obama campaign, provided the scenario and data for the section
on uplift modeling. We also thank Jen Golbeck, director of the Social Intelli-
gence Lab at the University of Maryland and author of Analyzing the Social Web,
whose book inspired our presentation in the chapter on social network analytics.
Inbal Yahav and Peter Gedeck, co-authors of the R and Python editions, helped
improve the social network analytics and text mining chapters. Randall Pruim
of the R edition, helped improve the social network analytics and text mining
chapters.
Marietta Tretter at Texas A&M shared comments and thoughts on the time
series chapters, and Stephen Few and Ben Shneiderman provided feedback and
Susan Palocsay and Mia Stephens have provided suggestions and feedback
Shapiro, founder of KDNuggets.com, has been generous with his time and
We thank colleagues at the MIT Sloan School of Management for their sup-
port during the formative stage of this book—Dimitris Bertsimas, James Orlin,
Robert Freund, Roy Welsch, Gordon Kaufmann, and Gabriel Bitran. As teach-
ing assistants for the data mining course at Sloan, Adam Mersereau gave detailed
comments on the notes and cases that were the genesis of this book, Romy
Shioda helped with the preparation of several cases and exercises used here, and
Shrivardhan Lele, Wolfgang Jank, and Paul Zantek provided practical advice and
dents Timothy Roach, Pablo Macouzet, and Nathan Birckhead for invaluable
datasets. We also thank MBA students Rob Whitener and Daniel Curtis for the
for their encouragement and support in developing data analytics courses at the
undergraduate and graduate levels that led to the development of this edition:
Luvai Motiwalla, Harry Zhu, Thomas Sloan, Bob Li, and Sandra Richtermeyer.
ACKNOWLEDGMENTS xxix
We also thank Michael Goul (late), Dan Power (late), Ramesh Sharda, Babita
Gupta, Ashish Gupta, Uday Kulkarni, and Haya Ajjan from the Association for
Anand Bodapati provided both data and advice. Jake Hofman from Micro-
soft Research and Sharad Borle assisted with data access. Suresh Ankolekar and
Mayank Shah helped develop several cases and provided valuable pedagogical
comments. Vinni Bhandari helped write the Charles Book Club case.
sions on the relationship between statistics and machine learning. Our thanks to
Richard Larson of the Engineering Systems Division, MIT, for sparking many
Over two decades ago, they helped us develop a balanced philosophical perspec-
and help. We are appreciative of the vibrant online RapidMiner Community for
being an extremely helpful resource. We also thank Ingo Mierswa, the Founder
Lastly, we thank the folks at Wiley for the 15-year successful journey of
this book. Steve Quigley at Wiley showed confidence in this book from the
beginning and helped us navigate through the publishing process with great
speed. Curt Hinrichs’ vision, tips, and encouragement helped bring this book
to the starting gate. Brett Kurzman has taken over the reins and is now
shepherding the project with the help of Kavya Ramu and Sarah Lemore.
Katrina Maceda greatly assisted us in pushing ahead and finalizing earlier editions.
We are also especially grateful to Amy Hendrickson, who assisted with typeset-
Preliminaries
CHAPTER 1
Introduction
Britain had launched a web version of the paper and did tests on its home page
to determine which images produced more hits: cats, dogs, or monkeys. This
simple application, for this company, was considered analytics. By contrast, the
Washington Post has a highly influential audience that is of interest to big defense
contractors: it is perhaps the only newspaper where you routinely see adver-
tisements for aircraft carriers. In the digital environment, the Post can track
readers by time of day, location, and user subscription information. In this fash-
ion, the display of the aircraft carrier advertisement in the online paper may be
and Senate Armed Services Committees who will be voting on the Pentagon’s
budget.
analysis methods. Many powerful applications involve little more than counting,
rule checking, and basic arithmetic. For some organizations, this is what is meant
by analytics.
The next level of business analytics, now termed business intelligence (BI),
refers to data visualization and reporting for understanding “what happened and
display, examine, and explore data. BI, which earlier consisted mainly of gener-
ating static reports, has evolved into more user-friendly and effective tools and
Machine Learning for Business Analytics: Concepts, Techniques and Applications in RapidMiner, First Edition.
© 2023 John Wiley & Sons, Inc. Published 2023 by John Wiley & Sons, Inc.
3
4 1 INTRODUCTION
practices, such as creating interactive dashboards that allow the user not only
to access real-time data but also to directly interact with it. Effective dash-
boards are those that tie directly into company data and give managers a tool
to quickly see what might not readily be apparent in a large complex database.
One such tool for industrial operations managers displays customer orders in a
single two-dimensional display, using color and bubble size as added variables,
showing customer name, type of product, size of order, and length of time to
produce.
ysis methods, such as statistical models and machine learning algorithms used for
and predicting new records. Methods like regression models are used to describe
and quantify “on average” relationships (e.g., between advertising and sales), to
predict new records (e.g., whether a new patient will react positively to a med-
ication), and to forecast future values (e.g., next week’s web traffic).
Readers familiar with earlier editions of this book might have noticed that
the book title changed from Data Mining for Business Intelligence to Data Mining
for Business Analytics and, finally, in this edition to Machine Learning for Business
Analytics. The change reflects the more recent term BA, which overtook the
visualization and reporting. The change from data mining to machine learning
reflects today’s common use of machine learning to refer to algorithms that learn
from data. This book uses primarily the term machine learning .
The widespread adoption of predictive analytics, coupled with the accelerating avail-
ability of data, has increased organizations’ capabilities throughout the economy.
A few examples are as follows:
The business analytics toolkit also includes statistical experiments, the most
common of which is known to marketers as A/B testing. These are often used
• Orbitz, the travel site, found that it could price hotel options higher for
• Staples online store found it could charge more for staplers if a customer
are hot areas, decides that her organization must deploy them too, to capture
that hidden value that must be lurking somewhere. Successful use of analytics
ods that go beyond counts, descriptive techniques, reporting, and methods based
monly the first step into more advanced analytics, the book focuses mostly on
the more advanced data analytics tools. Specifically, it includes statistical and
level. Rather than “what is the relationship between advertising and sales,” we
ent marketing treatment and then assigning each new prospect to one of these
personas.
The era of big data has accelerated the use of machine learning. Machine
learning algorithms, with their power and automaticity, have the ability to cope
The older term data mining means different things to different people. To
the general public, it may have a general, somewhat hazy and pejorative meaning
are predictive analytics , predictive modeling , and most recently machine learning and
communities, use the term AI to refer to all the methods discussed in this book.
AI originally referred to the general capability of a machine to act like a human,
and, in its earlier days, existed mainly in the realm of science fiction and the unre-
the methods of statistical and machine learning discussed in this book, as the pri-
mary enablers of that grand vision, and sometimes the term is used loosely to
mean the same thing as machine learning. More broadly, it includes generative
for a long time in the world of statistics: linear regression, logistic regression,
ever, the core tenets of classical statistics---computing is difficult and data are
scarce---do not apply in machine learning applications where both data and com-
This gives rise to Daryl Pregibon’s description of “data mining” (in the
Another major difference between the fields of statistics and machine learning
increase is 1 box, while for person j it is 3 boxes.” The emphasis that classi-
result might have happened by chance in our sample) is absent from machine
learning. Note also that the term inference is often used in the machine learn-
for new data, also called scoring, in contrast to its meaning in the statistical
community.
the question being addressed that classical statistical inference would require.
1.4 BIG DATA 7
ities as well. In engineering terms, the model is fitting the noise, not just the
signal.
In this book, we use the term machine learning algorithm to refer to methods
that learn directly from data, especially local patterns, often in layered or itera-
tive fashion. In contrast, we use statistical models to refer to methods that apply
global structure to the data that can be written as a simple mathematical equation.
regression in accord with an overall linear equation that applies to all the records.
In k -nearest neighbors, that record would be classified in accord with the values
to deal with them. The challenge big data presents is often characterized by the
four V’s---volume, velocity, variety, and veracity. Volume refers to the amount of
data. Velocity refers to the flow rate---the speed at which it is being generated
and changed. Variety refers to the different types of data being generated (time
stamps, location, numbers, text, images, etc.). Veracity refers to the fact that
signing up for services or free downloads) and not subject to the controls or
Most large organizations face both the challenge and the opportunity of
big data because most routine data processes now generate data that can be
stored and, possibly, analyzed. The scale can be visualized by comparing the
data in a traditional statistical analysis (e.g., 15 variables and 5000 records) to the
Walmart database. If you consider the traditional statistical study to be the size
of a period at the end of a sentence, then the Walmart database is the size of a
football field. Moreover, that probably does not include other data associated
with Walmart---social media data, for example, which comes in the form of
unstructured text.
• OKCupid, the online dating site, uses statistical models with their data
response.
8 1 INTRODUCTION
type.
The examples above are from Eric Siegel’s book Predictive Analytics (2013, Wiley).
Some extremely valuable tasks were not even feasible before the era of big
data. Consider web searches, the technology on which Google was built. In early
days, a search for “Ricky Ricardo Little Red Riding Hood” would have yielded
various links to the I Love Lucy TV show, other links to Ricardo’s career as a band
leader, and links to the children’s story of Little Red Riding Hood. Only once
the Google database had accumulated sufficient data (including records of what
users clicked on) would the search yield, in the top position, links to the specific
I Love Lucy episode in which Ricky enacts, in a comic mixture of Spanish and
profession: the data scientist. Data science is a mix of skills in the areas of statistics,
machine learning, math, programming, business, and IT. The term itself is thus
broader than the other concepts we discussed above, and it is a rare individual
who combines deep skills in all the constituent areas. In their book Analyzing
the Analyzers (Harris et al., 2013), the authors describe the skillsets of most data
scientists as resembling a “T”--deep in one area (the vertical bar of the T) and
2014), most attendees felt that programming was an essential skill, though there
was a sizable minority who felt otherwise. Also, although big data is the motivat-
ing power behind the growth of data science, most data scientists do not actually
Data of the terabyte or larger size would be involved at the deployment stage
of a model. There are manifold challenges at that stage, most of them IT and
programming issues related to data handling and tying together different compo-
nents of a system. Much work must precede that phase. It is that earlier piloting
and prototyping phase on which this book focuses--developing the statistical and
machine learning models that will eventually be plugged into a deployed system.
What methods do you use with what sorts of data and problems? How do the
1.7ÂTERMINOLOGYÂANDÂNOTATIONÂ 9Â
       Â
methods work? What are their requirements, their strengths, their weaknesses? Â
HowÂ
doÂ
youÂ
assessÂ
theirÂ
performance?Â
1.6Â WhyÂAreÂThereÂSoÂManyÂDifferentÂ
Methods?Â
   Â  Â     Â
As can be seen in this book or any other resource on machine learning, there are   Â
many different Â
methods ÂforÂpredictionÂandÂclassiï¬ cation.You  ÂmightÂaskÂyourselfÂ
why they coexist and whether some areÂbetterÂthan others. The answer isÂthatÂ
isÂ
shown inÂFigure 1.1,Âwhere the goalÂisÂ
toÂï¬ ndaÂcombination ofÂhouseholdÂincomeÂ
levelÂandÂhouseholdÂlotÂsizeÂthatÂseparatesÂownersÂ(solidÂcircles)ÂfromÂnonownersÂ
(hollow circles)ÂofÂriding mowers.ÂTheÂ ï¬ rstmethod
 Â(leftÂpanel)ÂlooksÂonlyÂforÂ
horizontal and vertical linesÂto separate owners from nonowners;Âthe secondÂ
FIGUREÂ1.1Â TWOÂMETHODSÂFORÂSEPARATINGÂOWNERSÂFROMÂNONOWNERSÂ
Differentmethods
 ÂcanÂleadÂtoÂdifferentresults,
 ÂandÂtheirÂperformanceÂcanÂ
vary.ÂItÂ
isÂthereforeÂcustomary inÂmachine learningÂtoÂapplyÂseveralÂdifferentÂ
usefulÂ
forÂtheÂgoalÂatÂ
hand.Â
1.7Â TerminologyÂandÂNotationÂ
ÂÂ Â Â ÂÂ Â ÂÂ Â Â Â
Because of the hybrid parentry of data science, its practitioners often use multiple Â
terms to referÂtoÂtheÂsame thing.ÂFor example,ÂinÂthe machine learning andÂ
artiï¬ cialintelligence
  ÂvariableÂbeingÂpredictedÂisÂtheÂoutputÂvariableÂorÂ
Âï¬ elds,the
10 1 INTRODUCTION
Binary attribute A variable that takes on two discrete values (e.g., fraud and
non-fraud transactions); also called binominal attribute.
Case See Record.
Categorical attribute See Nominal attribute.
Class label See Label.
Confidence A performance measure in association rules of the type “IF A and
B are purchased, THEN C is also purchased.” Confidence is the conditional
through some mechanism. More broadly, this process is called feature engi-
learning, the term often refers to the process of using a model to make
Nominal attribute A variable that takes on one of several fixed values, for
B has occurred. Read as “the probability that A will occur given that B has
occurred.”
Positive class The class of interest in a binary target attribute (e.g., purchasers
in the target attribute purchase/no purchase); the positive class need not be
favorable.
model. Also called afeature, input variable, independent variable, or, from a
database perspective, a field.
Record The unit of analysis on which the measurements are taken (a cus-
tomer, a transaction, etc.); also called example , instance , case , observation, sam-
of observations.
age of a person).
vation.
Score A predicted value or class. Scoring new data means using a model devel-
oped with training data to predict output values in new data.
interest is known and the algorithm “learns” how to predict this value for
book uses the term holdout set instead and uses the term testing to refer to cer-
Training data (or training set) The portion of the data used to fit a model.
Unsupervised learning An analysis in which one attempts to learn patterns
Validation data (or validation set) The portion of the data used to assess
how well the model fits, to adjust models, and to select the best model from
well as other machine learning tools. Figure 1.2 outlines machine learning from
a process perspective and where the topics in this book fit in. Chapter numbers
are indicated beside the topic. Table 1.1 provides a different perspective: it
Order of Topics
The book is divided into nine parts: Part I (Chapters 1–2) gives a general
overview of machine learning and its components. Part II (Chapters 3–4) focuses
tains only one chapter, we discuss a variety of topics, from predictive perfor-
are crucial for the proper evaluation and comparison of supervised learning
methods.
part, the topics are generally organized according to the level of sophistication
of the algorithms, their popularity, and ease of understanding. The final chapter
Prediction
Linear regression (6)
k-Nearest neighbors (7)
Regression trees (9)
Neural networks (11)
Ensembles, AutoML (13)
Supervised
Classification
k-Nearest neighbors (7) Model evaluation
Data preparation, Naïve Bayes (8) and selection Model
exploration, and Classification trees (9) Performance deployment
reduction Logistic regression (10) evaluations (5) Score new data
Data preparation (2,17) Neural networks (11) Responsible DS (22)
Data visualization (3) Discriminant analysis (12)
Dimension reduction (4) Ensembles, AutoML (13)
Experiments (14)
A/B testing
Uplift modeling
Reinforcement learning
Supervised Unsupervised
Continuous Categorical
response response No response
Continuous Linear regression (6) Logistic regression (10) Principal components (4)
predictors Neural nets (11) Neural nets (11) Cluster analysis (16)
k -Nearest neighbors (7) Discriminant analysis (12) Collaborative filtering (15)
Categorical Linear regression (6) Neural nets (11) Association rules (15)
predictors Neural nets (11) Classification trees (9) Collaborative filtering (15)
and user feedback. This single chapter starts with A/B testing and then its use in
uplift modeling and finally expands into reinforcement learning, explaining the
basic ideas and formulations that utilize user feedback for learning best treatment
assignments.
14 1 INTRODUCTION
association rules and collaborative filtering (Chapter 15) and cluster analysis
(Chapter 16).
Part VII includes three chapters (Chapters 17–19), with the focus on fore-
casting time series. The first chapter covers general issues related to handling
and understanding time series. The next two chapters present two popular fore-
Part VIII presents two broad data analytics topics: social network analy-
sis (Chapter 20) and text mining (Chapter 21). These methods apply machine
learning to specialized data structures: social networks and text. The final chap-
ter on responsible data science (Chapter 22) introduces key issues to consider for
Although the topics in the book can be covered in the order of the chapters,
each chapter stands alone. We advise, however, to read parts I–III before pro-
operators.
RapidMiner helps you get started quickly on machine learning and offers
a variety of operators for analyzing data. The illustrations, exercises, and cases
in this book are written in relation to this software. RapidMiner has extensive
associations and text, forecasting, and data exploration and reduction. It offers
logistic regression, linear regression, and discriminant analysis, all for predictive
and holdout samples and for the deployment of the model to new data. It
tools and data-handling utilities. With its short learning curve, free educational
1.9 USING RAPIDMINER STUDIO 15
an account. Follow the instructions there to download and install the latest
educational-program/.
launching, the primary interface looks similar to that shown in Figure 1.3. The
on these views allows one to toggle between them. The Design and Process
views are the most frequently used. We will also introduce Turbo Prep and
Auto Model views later in the book. The Deployments view is needed only
The Design view is where a data scientist can create machine learning
cess is shown in the figure (the process is elaborated later in Section 2.6).
own parameters. The Design view includes multiple display units called panels,
which can be moved and resized. The default panels include Repository, Oper-
ators, Process, Parameters, and Help. Additional panels can be activated from the
View menu. The View > Restore Default View option from the main menu is
often useful for restoring the default location and size of these panels.
ers data and processes. The Repository panel is useful for managing the reposito-
ries and related artifacts. Upon designing a process in the Process panel, it can
Here option. For example, the process 02-02-West Roxbury Preprocessing is stored
in the processes subfolder in the MLBA repository. To execute a process, the
blue “play” button is used, shown among the icons above the Repository view.
If the process executes properly and the output ports are connected to some
to the Results view for the user to analyze the results. The Operators panel is
able for different tasks. The Parameters panel is used for configuring operators,
rapidminer.com/courses/rapidminer-studio-gui-intro.
16 1 INTRODUCTION
Read… operators (such as Read CSV, Read Excel, and more). With this approach,
a local link to the source data is created, and any updates to the source data are
the data into RapidMiner by selecting the data subfolder and then using the
Import Data wizard to store the data there in native RapidMiner format called
ExampleSet. This ExampleSet (data) can then be used in any process with the
Retrieve operator. We use the latter approach in this book.
For more details on importing and loading data in RapidMiner, see
https://academy.rapidminer.com/learning-paths/get-started-with-rapidminer-
and-machine-learning.
RapidMiner Extensions
RapidMiner offers a variety of special-purpose extensions (e.g., Deep Learning)
that provide additional operators for performing specific machine learning tasks.
1.9 USING RAPIDMINER STUDIO 17
These extensions can be searched, installed, and updated from the Marketplace
(Updates and Extensions) option in RapidMiner Extensions menu. We will use
some such extensions in the book, and we will mention them at appropriate
places.
One such extension worth mentioning here is the Python Scripting extension.
It includes operators such as Python Transformer, Python Learner, Execute Python,
and Python Forecaster that allow Python packages to be used from within Rapid-
Miner. To use the Python extension, you will need an installation of Python and
ages. To install anaconda and integrate it with RapidMiner, first close Rapid-
launched, RapidMiner automatically detects the path for Python, and all installed
Further, Python’s pandas and statsmodels packages are particularly powerful for
handling time series data and time series forecasting. While outside the scope of
(ML), starting from a clear goal definition and ending with model deployment.
The general steps are shown schematically in Figure 2.1. We also discuss issues
of data partitioning, where methods are trained on a set of training data and then
their performance is evaluated on a separate set of holdout data, as well as explain
how this practice helps avoid overfitting. Finally, we illustrate the steps of model
2.1 Introduction
In Chapter 1, we saw some very general definitions of business analytics and
methods. The core of this book focuses on what has come to be called predictive
analytics, the tasks of classification and prediction as well as pattern discovery,
Machine Learning for Business Analytics: Concepts, Techniques and Applications in RapidMiner, First Edition.
© 2023 John Wiley & Sons, Inc. Published 2023 by John Wiley & Sons, Inc.
19
20 2 OVERVIEW OF THE MACHINE LEARNING PROCESS
of an offer can respond or not respond. An applicant for a loan can repay on time,
A bus in a fleet can be available for service or unavailable. The victim of an illness
cation is unknown or will occur in the future, with the goal of predicting what
that classification is or will be. Similar data where the classification is known
are used to develop rules, which are then applied to the data with the unknown
classification.
Prediction
Prediction is similar to classification, except that we are trying to predict the
predict a class, but the term prediction in this book refers to the prediction of the
value of a continuous numerical attribute.
Sometimes in the machine learning literature, the terms estimation and regression
are used to refer to the prediction of the value of a continuous attribute, and
prediction may be used for both continuous and categorical data.
between items in large databases. The rules can then be used in a variety of
ways. For example, grocery stores can use such information for product place-
ment. They can use the rules for weekly promotional offers or for bundling
toms during consecutive hospitalizations can help find “which symptom is fol-
lowed by what other symptom” to help predict future symptoms for returning
patients.
use collaborative filtering , a method that uses individual users’ preferences and tastes
2.2 CORE IDEAS IN MACHINE LEARNING 21
given their historic purchase, rating, browsing, or any other measurable behavior
generates “what goes with what” at the individual user level. Hence, collab-
Predictive Analytics
Classification, prediction, and, to some extent, association rules and collaborative
term predictive analytics is sometimes used to also include data pattern identifi-
the number of attributes is limited, and when large numbers of records can
be grouped into homogeneous groups. For example, rather than dealing with
thousands of product types, an analyst might wish to group them into a smaller
number of groups and build separate models for each group. Or a marketer might
want to classify customers into different “personas” and must therefore group
data
consolidating a large number of records (or cases) into a smaller set is termed
reduction. Methods for reducing the number of cases are often called clustering.
Reducing the number of attributes is typically called dimension reduction.
interpretability.
aimed at understanding the global landscape of the data and detecting unusual
values. Exploration is used for data cleaning and manipulation as well as for
charts and dashboards is called data visualization or visual analytics. For numerical
attributes, we use histograms and boxplots to learn about the distribution of their