Professional Documents
Culture Documents
Real-World Machine L v8 MEAP
Real-World Machine L v8 MEAP
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
Welcome
Thank you for purchasing the MEAP for Real-World Machine Learning. We’re very excited
about this release and we’re looking forward to getting your feedback in the development of
this book. The aim of this book is to teach you practical aspects of modern machine learning in
an engaging and visual way, from the initial data munging through model building and
advanced optimization techniques.
This book will focus on how you can use and implement machine learning in your projects
while keeping the mathematical and algorithmic details to a minimum. We will introduce the
advantages of particular methods with real-world examples and highlight common pitfalls
based on extensive experience.
We’re initially releasing the first two chapters and Appendix A. Chapter 1 provides an
introduction to machine learning and motivation for the remaining chapters of the book. We
will introduce the essential machine learning workflow that will be the topic of the first part of
the book. The second part of the book will introduce some more advanced techniques that can
boost the accuracy and scalability of your models even further.
Chapter 2 will start our venture into the essential machine learning workflow by talking
about real-world data and how to explore and process various types of data for machine
learning modeling. We will introduce powerful visualization techniques that drive our pre-
processing requirements and later in chapter 3 will influence the choice of modeling algorithm.
The appendix introduces a table of commonly used machine learning algorithms. This table
is meant as a reference for choosing the algorithm that best applies to the data at hand. The
book in general will not limit your machine learning understanding to a few currently popular
algorithms, but this table will serve as a good reference when choosing the optimal algorithm.
Looking ahead, chapters 3-5 will finish the discussion of the essential machine learning
workflow and introduce model building, evaluation and optimization. Chapters 6-9 will
highlight advanced techniques that can improve your model in various ways, including feature
engineering, clustering and scalability aspects. We will conclude the book with chapter 10 on
interesting current and future developments in the rapidly growing field of machine learning.
As you're reading, we hope you’ll take advantage of the Author Online forum. We’ll be
reading your comments and responding. We appreciate any feedback, as it is very helpful in
the development process.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
brief contents
1 What is Machine Learning?
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
1
1
What is Machine Learning?
Machine learning (ML) is the science of building systems that automatically learn from data.
ML is a data-driven approach to problem solving, in which the hidden patterns and trends
present within data are first detected and then leveraged to help make better decisions. This
process of automatically learning from data and in turn using that acquired knowledge to
inform future decisions is extremely powerful. Indeed, machine learning is rapidly becoming
the engine which powers the modern data-driven economy.
At the core of machine learning lies a set complicated algorithms which have been
developed over the course of the past few decades by academics in a diverse set of
disciplines such as statistics, computer science, robotics, computer vision, physics, and
applied mathematics. The aim of this book is not to describe the gory mathematical details
of the underlying algorithms (although we will peel back a few layers of the onion to provide
the reader some insight into the inner workings of the most common ML algorithms).
Rather, the book’s primary purpose is to instruct non-experts on how to properly employ
machine learning in real-world systems and how to integrate ML into computing pipelines.
The book details the end-to-end ML workflow and provides detailed descriptions of each of
the steps required to build and integrate an ML system. Wherever possible, the book uses
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
2
real business problems to demonstrate the role of machine learning in solving practical
problems.
In this first chapter, we will use a real business problem–reviewing loan applications–to
demonstrate the advantages of using machine learning over some of the most common
alternatives. We will then introduce the essentials of the basic end-to-end ML workflow–from
initial data ingest to employment of the ML output–which makes up the content of the first
half the book (chapters 2-5). Finally, we introduce the more advanced material in the
second half of the book (chapters 6-9), which teaches how to boost the performance of ML in
practice.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
3
Figure 1.1: Schematic of the loan approval process for the microlending example. For each applicant, a
variety of input data–including application text, applicant metadata, and applicant credit history–all
contribute to your decision of whether or not to approve the loan.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
4
Finally, each new hire is less experienced and slower at processing applications than the last
and the added stress of managing a team of individuals is wearing on you.
EMPLOY SIMPLE BUSINESS RULES
After your failed experiment of hiring additional analysts, you decide that to grow your
business you need to change your application examination process by automating some of
the decision-making process. Since you have already been in business for a few months,
you now have seen the outcomes of many of the loans that were approved (Figure 1.2).
Imagine that of the 1000 loans whose repayment date has passed, 70% were repaid on
time. This set of training data–loans for which the ultimate outcome of interest is known–is
extremely valuable to begin building automation into your loan approval system.
Using the 1000 loans whose repayment status is known, you are now in the position to
begin looking for trends. In particular, you perform a manual search for a set of filtering
rules that produce a subset of “good” loans that were primarily paid on time. Then, when
new applications come in, you can quickly filter them through these hard-coded business
rules to reduce the number of applications which must be vetted by hand.
So how do we come up with these decision rules? Luckily, through the process of
manually analyzing hundreds of applications, you’ve gained extensive experience about what
makes each application good or bad 1. Through some introspection and back-testing to loan
repayment status, you’ve noticed a few trends in the credit background checks data 2:
• most borrowers with a credit line of more than $7500 defaulted on their loan, and
• most borrowers that had no checking account repaid their loan on time.
Now you can design a filtering mechanism to pare down the number of applications that
you need to process manually through those two rules.
1
Note that you could also use statistical correlation techniques to determine which input data attributes are most
strongly associated with the outcome event of loan repayment.
2
In this example, we use the German credit data set as example credit-check data. These data are available for
download at http://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
5
Figure 1.2: After a few months of business, you now have observed the outcome of 1000 loans that were
approved. This information is critical to start building automation into your loan approval process.
Your first filter is to automatically reject any applicant that has a credit line of more than
$7500. Looking through your historical data, you find that 44 of the 86 applicants with a
credit line of more than $7500 defaulted on their loan. So, roughly 51% of these high-credit
line applicants defaulted, compared to 28% of the rest. This filter seems like a good way to
exclude high-risk applicants, but you realize that only 8.6% of your accepted applicants had
a credit line that was so high, meaning that you’ll still need to manually process more than
90% of applications. You need to do some more filtering to get that number down to
something more manageable.
Your second filter is to automatically accept any applicant that does not have a checking
account. This seems to be an excellent filter, as 348 of the 394 (88%) of the applicants that
did not have a checking account repaid their loans on time. Including this second filter
brings the percentage of applications that are automatically accepted or rejected up to 45%.
Thus, you only need to manually analyze roughly half of the new incoming applications.
These filtering rules are demonstrated in Figure 2.3.
With these two business rules, you can now scale your business up to twice the amount
of volume without having to hire a second analyst, since you now only need to manually
accept or reject 52% of new applications. Additionally, based on the training set of 1000
applications with known outcome, you expect your filtering mechanism to erroneously reject
42 out of every 1000 applications (4.2%) and to erroneously accept 46 of every 1000
applications (4.6%).
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
6
These error rates might be acceptable for now, but perhaps in the future (as demand
grows) we will require a smaller rate of erroneously accepting applications (this is called the
false positive rate). Under a manual filtering process there is no way of lowering these error
rates besides adding additional business rules. This can be a debilitating drawback.
Figure 1.3: Example filtering of new applications through two business rules. Use of the filters allows us to
manually analyze only 52% of the incoming applications.
Further, as business grows, you will require your system to automatically accept or reject
a larger and larger percentage of applications. To do this, you again need to add more
business rules. You will soon find that this kind of system is not a sustainable business
process, as a number of problems will arise:
1. Manually finding effective filters becomes harder and harder–if not impossible–as the
filtering system grows in complexity.
2. The business rules become so complicated and opaque that debugging them and
ripping out old, irrelevant rules becomes virtually impossible.
3. There is no statistical rigor in the construction of your rules. You are pretty sure that
better “rules” can be found by better exploration of the data.
4. As the patterns of loan repayment change over time–perhaps due to changes in the
population of applicants–the system does not adapt to those changes. To stay up to
date, you would have to manually adjust the rules or build a new system from
scratch.
All of these drawbacks can be traced to a single debilitating weakness in a business rules
approach: The system does not automatically learn from data.
Data-driven systems, from simple statistical models to more sophisticated machine
learning workflows, can overcome these problems.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
7
the completely automated nature of the process will allow your operation to keep pace with
the increasing inflow of applications. Further, unlike business rules, ML learns the optimal
decisions directly from the data without having to arbitrarily hard-code decision rules. This
graduation from rules-based to ML-based decision making means that your decisions will be
more accurate and will improve over time as more loans are made. In short you can be
assured that your ML system produces optimized decisions with minimal hand-holding.
In machine learning, the data provides the foundation for deriving insights about the
problem at hand. To determine whether or not to accept each new loan application, ML
utilizes historical training data to predict the best course of action for each new application.
To get started with ML for loans approval, you begin by assembling the training data for the
1000 loans that have been granted. This training data consists of the input data for each
loan application along with the known outcome of whether or not each loan was repaid on
time. The input data in turn consist of a set of features–numerical or categorical metrics that
capture the relevant aspects of each application–such as the credit score of the applicant,
their gender, and their current occupation.
Figure 1.4: Basic ML workflow, as applied to the microloan example. Here, historical data are used to train
the machine learning model. Then, as new applications come in, probabilistic predictions of future
repayment are generated instantaneously from the application data.
ML modeling, then, determines how the input data for each applicant can be used to best
predict the loan outcome. By finding and utilizing patterns in the training set, ML produces a
model (you can think of this as a black box, for now) that produces a prediction of the
outcome for each new applicant, based on that applicant’s data.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
8
The optimal values of each of the coefficients of the equation (in this case, β0, β1, β2, and
β3) are learned from the 1000 training data examples.
The amount by which we can learn from data in parametric models such as logistic
regression is quite restrictive. Take the data set in the left-hand panel of Figure 1.5 for
example. Here, there are two input features and the task is to classify each data point into
one of two classes (red circle or green square). The two classes are separated, in the two-
dimensional feature space, by a non-linear curve (this is the so-called decision boundary).
In the center panel, we see the result of fitting a logistic regression model on this data set.
Clearly, the predicted separation boundary (commonly called the decision boundary)
between the reds and the greens is not close to the true boundary between the classes
(denoted by the blue curve). As a result, the simple statistical model makes many
classification errors (greens incorrectly predicted as reds and vice versa).
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
9
The problem here is that the parametric model is attempting to explain a complicated,
non-linear phenomenon with a simple linear model, much like jamming a square peg in a
round hole. What we need are more flexible models that can automatically discover complex
trends and structure in data without being told what the patterns look like. This is where so-
called non-parametric machine learning approaches come to the rescue. In the right-hand
panel of Figure 1.5, we see the result of applying a non-parametric learning algorithm (in
this case, a random forest classifier) to the red/green problem. Clearly, the predicted
decision boundary is much closer to the true boundary and as a result, the classification
accuracy is much higher than that of the parametric model.
Because they attain such high levels of accuracy on complicated, high-dimensional, real-
world data sets, non-parametric ML models are the approach of choice for data-driven
problems. Examples of non-parametric approaches include some of the most widely used
methods in machine learning, such as k-nearest neighbors, kernel smoothing, support vector
machines, decision trees, and ensemble methods. We will describe all of these approaches
later in the book.
Returning to the microlending problem, the best choice for scaling up your business is to
employ a non-parametric ML model. In addition to providing an automated workflow, you’ll
also attain higher accuracy, which directly translates to higher business value. For instance,
imagine that a non-parametric ML model yields a 25% higher accuracy than a logistic
regression approach. In this case, that means that your ML model will make fewer mistakes
on new applications: accepting fewer applicants who will not repay their loan and rejecting
fewer applicants who would have repaid their loan. Overall, this means a higher average
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
10
return on the loans that you do make, enabling you to make more loans overall and to
generate higher revenues for your business.
5. Scalable - As your business grows, ML easily scales to handle increased data rates.
Most ML processes are parallelizable, enabling infinite scalability.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
11
Figure 1.6: The workflow of real-world machine learning systems. On historical input data we enter the
model building > evaluation > optimization phase. With the final model we can now make predictions on
new data.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
12
Figure 1.7: An example tabular dataset. Rows are called instances of the dataset with instance features in
columns.
The first thing to notice about tabular data is that individual columns usually include the
same type of data while each row can have columns of various types. In figure 1 we can
already identify 4 different types of data: Name is a string variable, Age is an integer
variable, Income is a floating point variable, and Marital status is a categorical variable
(taking on a discrete number of categories). Such a dataset is called heterogeneous (in
contrast to homogeneous), and in chapter 2 we will explain how and why we will coerce
some of these types of data into other types depending on the particular machine-learning
algorithm at hand.
Real-world data can be “messy” in a variety of other ways. Suppose that a particular
measurement was unavailable for an instance in the data-gathering phase, and there was no
way of going back to find the missing piece of information. In this case our table will contain
a missing value in one or more cells, and this can complicate both model building and
subsequent predictions. In some cases humans are involved in the data-gathering phase,
and we all know how easy it is to make mistakes in repetitive tasks such as data recording.
This can lead to some of the data being flat-out wrong and we will have to be able to handle
such scenarios, or at least know how well a particular algorithm behaves in the presence of
misleading data. We will look closer at methods for dealing with missing and misleading
data in chapter 2, along with other challenging properties of real-world data such as large
dataset sizes and continuously streaming data.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
13
At this point, we will simply regard the ML algorithm to be a magic box that performs the
mapping from input features to output data. To build a useful model, we would of course
need more than two rows. One of the advantages of machine learning algorithms, compared
with other widely used methods, is the ability to handle a large number of features. In figure
2 there are only four features of which the Person ID and Name is probably not very useful in
predicting the marital status of the person. Some algorithms will be relatively immune to
uninformative features, while others may yield higher accuracy if you leave those features
out. In chapter 3 we will look closer at the various types of algorithms and their performance
on different kinds of problems and datasets.
It is worth noting, however, that valuable information can sometimes be extracted from
seemingly uninformative features. A location feature may not be very informative by itself,
for example, but can lead to informative features such as population density. This type of
data enhancement, called feature extraction, is very important in real-world ML projects and
will be the topic of chapter 6.
With our ML model in hand we can now make predictions on new data; data where the
target variable is unknown. Figure 1.8 shows this process using the magic box model that we
built in figure 1.7.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
14
The target predictions are returned in the same form as it appeared in the original data
used to learn the model – also known as the training data. Using the model to make
predictions can be seen as filling out the blank target column of the new data. Some ML
algorithms can also output the probabilities associated with each class. In our married/single
example, a probabilistic ML model would output two values for each new person: the
probability of this person being married and the probability of the person being single.
We left out a few details on the way here, but in principle we have just built our first ML
system. Every machine learning system is about building models and using those models to
make predictions. Let’s take a look at the basic machine learning workflow in pseudo-code,
just to get another view of how simple it actually is:
Of course, we have not filled out any of the functions yet, but the basic structure is in place.
By chapter 3 we will understand most of these functions and the rest of the book - chapters
4 through 9 - is about making sure we are building the very best model for the problem at
hand.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
15
data. We simply take out some of the training data and pretend that we don’t know the
target variable. We then build a model on the remaining data and use the held-out data
(testing data) to make predictions. Simple, right? Figure 1.9 visualizes this model testing
process.
Figure 1.10: Using a testing set for model performance evaluation. We “pretend” that the target variable is
unknown and compare the predictions with the true values.
We can now compare the predicted results with the known “true” values to get a feeling for
how accurate the model is. In the pseudo-code this functionality is hidden behind the
compare_predictions function, and most of chapter 4 is dedicated to understanding how
this function looks for different types of machine learning problems.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
16
With the machine learning essentials in place, we will look briefly towards more advanced
features in the next section before we go into some more details with the main components
covered in this section.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
17
• Digital media data such as text and documents and images and video. All of these
are not directly usable by the ML algorithms, but can be enabled by feature
engineering.
Hopefully it’s clear that feature engineering can be important for real-world ML projects,
but we will go into much more detail in chapter 6. We will introduce specific feature
engineering techniques, like the ones listed above, but we will also talk about how it all feeds
into our ML workflow so our model performance improves without becoming too complex and
be prone to over-fitting. Figure 1.10 illustrates the features engineering integration into the
larger ML workflow introduced in section 1.2:
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
18
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
19
computers are not bound to our three dimensions in a data-handling sense, the same
underlying limitations apply to ML algorithms: the larger the dimensionality, the harder it will
be to find meaningful signals. Dimensionality reduction attempts to solve this problem by
reducing many dimensions to a few meaningful ones, providing a way to visualize large-
dimensional data, but also a set of new features that makes it easier for the ML model to find
the signal, thus increasing the performance. Simple, right? Don’t worry if your head hurts by
now, we will go more carefully through these ideas in chapter 7.
Another related technique is known as clustering. Clustering can connect rows of the
data into groups based on the raw columns or features. We can use this to discover natural
groupings of the data in the same way as humans often do. Imagine being presented with a
list of 100 movies and asked to group them into maximally 5 piles. You’d probably identify
main genres and other factors that are important for you while you sift through the list and
build out the piles. Now what if you had a million movies? It would take you a long time to
do the sorting, and this is where clustering algorithms can help. By identifying the features
that should be used for the input data, and running it through the clustering algorithm, the
computer may be able to identify some or all of the piles, and may even discover some that
you hadn’t. While clustering can be used for much more diverse problems, this example is
good because it is something that many of us has already seen at work. Netflix is a company
that makes money by streaming a huge library of movies, and they used clustering to
identify interesting piles for the movies. They ran the Netflix Prize, a competition to improve
their movie recommendations. Figure 1.11 shows an example of connected movies from one
of the participants.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
20
Figure 1.12: Machine learned connections between 5000 Netflix movies. The goal is to predict Netflix user
ratings from a training set of over 100 million examples. Image by chef_ele on Flickr
(http://www.flickr.com/photos/chef_ele)
Methods like these are collectively known as unsupervised learning and can be used
stand-alone, in essence as data exploration processes that reveal structure in the data and
have value in themselves, but they can also be successfully used for the predictive modeling
workflow that we’ve introduced in the previous sections. Netflix is again a good example, as
the clusters identified are used stand-alone to improve the user experience, but they are also
used to improve movie recommendations for users, i.e. predicting whether a user would be
interested in a movie or not.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
21
learning and we will introduce these and potential pitfalls in chapter 8. Figure 1.12 shows
how continuous re-learning would be integrated into the ML workflow.
Figure 1.13: Flow of an online ML system where predictions are fed back to the model for iterative
improvements.
1.4 Summary
In this chapter, we introduced machine learning as a better, more data-driven approach to
making decisions. The main points to take away from this chapter are:
• With a real-world example of deciding which loan applicants should be approved, we
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
22
To conclude the chapter, we provide the reader with a set of example machine learning
use-cases in Table 1.1. This list is not meant to be exhaustive, but rather is intended to
show the broad applicability of supervised machine learning. Indeed, real-world machine
learning is a driving force in today’s economy.
Table 1.1: Use cases for supervised machine learning, organized by the type of problem.
Classification Determine the discrete class to which spam filtering, sentiment analysis, fraud
each individual belongs, based on detection, customer ad targeting, churn
input data prediction, support case flagging,
content personalization, detection of
manufacturing defects, customer
segmentation, event discovery,
genomics, drug efficacy
Regression Predict the real-valued output for stock market prediction, demand
each individual, based on input data forecasting, price estimation, ad bid
optimization, risk management, asset
management, weather forecasting,
sports prediction
Imputation Infer the values of missing input data incomplete patient medical records,
missing customer data, census data
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
23
2
Real World Data
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
24
In supervised machine learning, we use data to teach automated systems how to make
accurate decisions. ML algorithms are designed to discover patterns and associations in
historical training data (definitions of all bolded terminology in Chapter 2 are found in Table
2.2 near the end of the chapter) to learn how to accurately predict a quantity of interest for
new data. Training data, therefore, are fundamental in the pursuit of machine learning.
With high quality data, subtle nuances and correlations can be accurately captured and high-
fidelity predictive systems can be built. However, if training data are of poor quality, then
the efforts of even the best ML algorithms may be rendered useless.
This chapter serves as a guide to the collection and compilation of training data for
supervised machine learning. We will give general guidelines for preparing training data for
ML modeling and warn of some of the common pitfalls. Much of the art of machine learning
is in exploring and visualizing training data to assess data quality and guide the learning
process. To that end, we provide an overview of some of the most useful data visualization
techniques. Finally, we discuss how to prepare a training data set for ML model building,
which is the subject of Chapter 3.
Throughout the chapter, we will use a real-world machine-learning example: churn
prediction. In business, churn refers to the act of a customer canceling or unsubscribing
from a paid service. An important, high-value problem is to predict which customers are
likely to “churn” in the near future. If a company has an accurate idea of which customers
may unsubscribe from their service, then they may intervene by sending a message or
offering a discount. This intervention can save companies millions of dollars, as the typical
cost of new customer acquisition largely out-paces the cost of intervention on churners.
Therefore, a machine learning solution to churn prediction–whereby those users who are
likely to churn are predicted weeks in advance–can be extremely valuable.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
25
You’ll notice a few commonalities in these questions. First, they all deal with making
assessments on one or several instances of interest. These instances can be people (such
as in the churn question), events (such as the tweet sentiment question), or even periods of
time (such as in the product demand question). Second, each of these problems has a well-
defined target of interest, which in some cases is binary (e.g., churn vs. not churn, fraud vs.
not fraud), in some cases takes on multiple classes (e.g., negative vs. positive vs. neutral)
and in others takes on numerical values (e.g., product demand). Note that in statistics and
computer science, the target is also commonly referred to as the response or dependent
variable. Those terms may be used interchangeably.
Third, each of these problems can potentially have sets of historical data where the target
is known. For instance, over weeks or months of data collection, you can completely
determine which of your subscribers churned and which people clicked on your ads. With
some manual effort, you can even assess the sentiment of different tweets. In addition to
known target values, your historical data files will contain information about each instance
that is knowable at the time of prediction. These are the so-called input features (also
commonly referred to as the explanatory or independent variables). For example, the
product usage history of each customer along with their demographics and account
information would be appropriate input features for churn prediction. The input features
together with the known values of the target variable compose the training set.
Finally, each of these questions comes with an implied action if the target were knowable.
For example, if you knew that a user would click on your ad, you would bid on that user and
serve them an ad. Likewise, if you knew precisely what your product demand would be for
the upcoming month, you would position your supply chain to match that demand. The role
of the ML algorithm is to use the training set to determine how the set of input features can
most accurately predict the target variable. The result of this ‘learning’ is encoded in a
machine-learning model. When new instances (with unknown target) are observed, their
features are fed into the ML model, which generates predictions on those instances.
Ultimately, those predictions enable the end user to taker smarter (and faster) actions. In
addition to producing predictions, the ML model also allows the user to draw inferences about
the relationships between the input features and the target variable.
Let’s put all of this in context of the churn prediction problem. Imagine that we worked
for a telecom company and that the question of interest was “which of my current cellphone
subscribers will unsubscribe in the next month?” Here, each instance is a current subscriber.
Likewise, the target variable is the binary outcome of whether or not each subscriber
cancelled their service during that month. The input features can consist of any information
about each customer that is knowable at the beginning of the month, such as the current
duration of the account, details on the subscription plan, and usage information like total
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
26
number of calls made and minutes used in the previous month. In Figure 2.2, we show the
first four rows of an example training set for telecom churn prediction.
Figure 2.2: Example training data for four instances for the telecom churn problem.
The aim of this section is to give a basic guide of how to properly collect training data for
machine learning. Data collection can differ tremendously from industry to industry, but
there are a number of common questions and pain points that arise when assembling
training data. In the following subsections, we give a practical guide to addressing four of
the most common data-collection questions: Which input features should I include? How do
I obtain known values of my target variable? How much training data do I need? And, how
do I know if my training data is good enough?
Before we jump into these other aspects of compiling training data, let’s first take a look
at selecting input features.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
27
More unstructured data, such as ‘calling history’ data streams, can be processed into a
set of numerical and/or categorical features by computing summary statistics on the data,
such as total minutes used, ratio of day/night minutes used, ratio of week/weekend minutes
used, proportion of minutes used in network, etc.
Given such a broad array of possible features, which should we use? As a simple rule of
thumb, features should only be included if they are suspected to be related to the target
variable. Insofar as the goal of supervised ML is to predict the target, features that
obviously have nothing to do with the target should be excluded. For example, if a
distinguishing identification number was available for each customer, it should not be used
as an input feature to predict whether the customer will unsubscribe. Such useless features
make it more difficult to detect the true relationships (signal) from the random perturbations
in the data (noise). The more uninformative features that are present, the lower the signal-
to-noise ratio and thus the less accurate (on average) that the ML model will be.
Likewise, excluding an input features because it was not previously known to be related
to the target can also hurt the accuracy of your ML model. Indeed, it is the role of ML to
discover new patterns and relationships in data! Suppose, for instance, that a feature
counting the number of current unopened voicemail messages was excluded from the feature
set. Yet, some small subset of the population had ceased to check their voicemail because
they decided to change carriers in the following month. This signal would express itself in
the data as a slightly increased conditional probability of churn for customers with a large
number of unopened voicemails. Exclusion of that input feature would deprive the ML
algorithm of important information and therefore would result in an ML system of lower
predictive accuracy. Because ML algorithms are able to discover subtle, non-linear
relationships, features beyond the known ‘first-order’ effects can have a substantial impact
on the accuracy of the model.
In selecting a set of input features to use, we face a trade-off. On one hand, throwing
every possible feature that comes to mind (“the kitchen sink”) into the model can drown out
the handful of features that contain any signal with an overwhelming amount of noise.
Result: the accuracy of the ML model suffers because it cannot distinguish the true patterns
from the random noise. On the other extreme, hand-selecting a small subset of features
which we already know are related to the target variable can cause us to omit other highly
predictive features. Result: the accuracy of the ML model suffers because the model does
not know about the neglected features, which are predictive of the target.
Faced with this tradeoff, the most practical approach is the following:
1. Include all of the features that you suspect to be predictive of the target variable. Fit
a ML model. If the accuracy of the model is sufficient, stop.
2. Else, expand the feature set by including other features that are less obviously related
to the target. Fit another model and assess the accuracy. If performance is
sufficient, stop.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
28
3. Else, starting from the expanded feature set, run a ML feature selection algorithm to
choose the best, most predictive subset of your expanded feature set.
We will further discuss feature selection algorithms in Chapter 5. These approaches seek the
most accurate model built on a subset of the feature set. In essence, they retain the signal
in the feature set while discarding the noise. Though computationally expensive, they can
yield a tremendous boost in model performance.
To finish off this subsection, it is important to note that in order to use an input feature,
it is not required that the feature be present for each instance. For example, if the ages of
your customers were known for only 75% of your client base, you could still use age as an
input feature. We will discuss ways to handle such missing data later in the chapter.
Obtaining ground-truth for the target variable
One of the most difficult hurdles in getting started with supervised machine learning is the
aggregation of training instances with known target variable. This process often requires
running an existing, sub-optimal system for some period of time until enough training data is
collected. For example, in building out a ML solution for telecom churn, we first need to sit
on our hands and watch over several weeks or months as some of our customers
unsubscribe and others renew. Once we have enough training instances to build an accurate
ML model, we can flip the switch and start using ML in production.
Each use case will have a different process by which ground truth can be collected or
estimated. For example, consider the following training data collection processes for a few
selected ML use cases:
• For ad targeting: you can run a campaign for a few days to determine which users did
/ did not click on your ad and which users converted.
• For fraud detection: you can pore over your past data to figure out which users were
fraudulent and which were legitimate.
• In demand forecasting: you can go into your historical supply chain management data
logs to determine the demand over the past months or years.
• In twitter sentiment: getting information on the true intended sentiment is
considerably harder. You can perform manual analysis on a set of tweets by having
people (or crowdsourcing) read and opine on tweets.
Although the collection of instances of known target variables can be painful, both in
terms of time and money, the benefits of migrating to a ML solution are likely to more than
make up for those losses. There are a number of other ways of obtaining ground truth
values of the target variable, including:
1. Dedicating analysts to manually look through past or current data to determine or
estimate the ground-truth values of the target.
2. Using crowdsourcing to leverage the ‘wisdom of crowds’ in order to attain estimates of
the target.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
29
Each of these strategies is labor-intensive, but there are ways to accelerate the learning
process and shorten the time required to collect training data by collecting only target
variables for the highest-leverage instances. One example of this is a method called active
learning. Given an existing (small) training set and a (large) set of data with unknown
response variable, active learning identifies the subset of instances from the latter set whose
inclusion in the training set would yield the most accurate ML model. In this sense, active
learning can accelerate the production of an accurate ML model by focusing manual
resources. We will return to active learning and related methods in Chapter 8.
• The dimensionality of the feature space. If only 2 input features are available, then
less training data will be required than if there were 2000 features.
One guiding principle to remember is that, as the training set grows the models will (on
average) get more accurate. (This assumes that the data remain representative of the
ongoing data generating process, more on this in next section.) More training data results in
higher accuracy due to the data-driven nature of ML models. Because the relationship
between the features and target is learned entirely from the training data, the more we have
the higher our ability to recognize and capture more subtle patterns and relationships.
Using the telecom data from earlier in the chapter, we can demonstrate how the ML
model gets better with more training data and also offer up a strategy to assess whether
more training data are required. The telecom training data set consists of 3,333 instances,
each containing 19 features plus the binary outcome of unsubscribed vs. renewed. Using
this data, it is straightforward to assess if we need to collect more data. Simply do the
following:
1. Using the current training set, choose a grid of subsample sizes to try. For example,
with our telecom training set of 3,333 instances of training data, our grid could be
500, 1000, 1500, 2000, 2500, 3000.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
30
2. For each sample size, randomly draw that many instances (without replacement) from
the training set.
3. With each subsample of training data, build a ML model and assess the accuracy of
that model (we’ll talk about ML evaluation metrics in Chapter 4).
4. Assess how the accuracy changes as a function of sample size. If it seems to level off
at the higher sample sizes, then the existing training set is probably sufficient.
However, if the accuracy continues to rise for the larger samples, then the inclusion of
more training instances would likely boost the accuracy.
Alternatively, if there is a clear accuracy target, then you may use this strategy to assess
whether that target has been fulfilled by your current ML model built on the existing training
data, in which case it is not necessary to amass more training data.
Figure 2.3 demonstrates how the accuracy of the fitted ML model changes as a function
of the number of training instances used using the telecom data set. In this case, it is clear
that the ML model improves as we add training data: moving from 250 to 500 to 750 training
examples produces significant improvements in the accuracy level. Yet, as we increase the
number of training instances beyond 2,000, the accuracy levels off. This is evidence that the
ML model will not improve substantially if we add more training instances (Note: this does
not mean that significant improvements could not be made by using more features).
Figure 2.3. Testing whether or not our existing sample of 3,333 training instances is enough data to build
an accurate telecom churn ML model. The black line represents the average accuracy over 10 repetitions
of the assessment routine, while the blue bands represent the error bands.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
31
In each of these cases, a ML model fit to the training data may not extrapolate well to new
data. Thus, the accuracy of the model can suffer when applied to incoming data.
To avoid these problems, it is important to attempt to make the training set as
representative of future data as possible. This entails structuring your training data
collection process in such a way that biases are removed. As we mention in the following
section, visualization can also aid to ensure that the training data are representative. We will
touch again on this topic in Chapter 8, when we discuss active learning as a means to deal
with sample selection biases.
Now that we have an idea of how to collect training data, our next task is to structure
and assemble that data to get ready for ML model building. The next section discusses how
to take your training data and pre-process it so that you can start building models, which will
be the topic of Chapter 3.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
32
of the dataset. Many machine-learning algorithms work only on numerical data, integers and
real-valued numbers. The simplest ML datasets come in this format, but many include other
types of features, such as categorical features, and some might include missing values or be
in need of other kinds of data processing before being ready for modeling. In this section we
will look at a few of the most common data pre-processing steps needed for real-world
machine learning.
Figure 2.4: Identifying categorical features. In the top is our simple person dataset. In the bottom we show
a dataset with information about Titanic passengers. The features identified as categorical here are:
Survived (whether the passenger survived or not), Pclass (what class the passenger was traveling on),
Sex (male or female), and Embarked (from which city the passenger embarked).
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
33
Some machine learning algorithms deal with categorical features natively, but generally
they need data in numerical form. We can encode categorical features as numbers - one
number per category - but we cannot use this encoded data as a true categorical feature as
we’ve then introduced an (arbitrary) order of categories. Recall that one of the properties of
categorical features is that they are not ordered. What we can do instead is to convert each
of the categories in the categorical feature to a feature with 1’s or 0’s wherever the category
appeared or not. In essence, we convert the categorical feature into binary features that can
be used by ML algorithms that only support numerical features. Figure 2.5 illustrates this
concept further.
The pseudo-code for converting categorical features to binary features, as shown in figure
2.2, would look something like this:
NOTE: CODE EXAMPLES Readers familiar with the Python programming language may
have noticed that the above example is not just pseudo-code, but also valid Python. You
will see this a lot throughout the book: we will introduce a code snippet as pseudo-code,
but unless otherwise noted it will be actual working code. In order to make the code
simpler, we are implicitly importing a few helper libraries, such as numpy and scipy. A lot of
functionality from those libraries can be imported easily with the pylab package. If nothing
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
34
else is noted, code examples should work by importing everything from pylab 3: from pylab
import *
While the categorical to numerical conversion technique works for most ML algorithms,
there are a few algorithms – such as certain types of decision tree algorithms and related
algorithms such as Random Forests - that can deal with categorical features natively. This
will often yield better results for highly categorical datasets, and we will discuss this further
in the next chapter where we will look closer at choosing the right algorithm with the data
and performance requirements of the problem at hand. Our simple person dataset, after
conversion of the categorical feature to binary features, is shown in Figure 2.6.
Figure 2.6: The simple person dataset after conversion of categorical features to binary numerical features.
Original dataset shown in figure 2.1.
Figure 2.7: An example of missing values in the Titanic passenger dataset in the Age and Cabin columns.
The passenger information has been extracted from various historical sources, so in this case the missing
values stem from information that simply couldn’t be found in the sources.
3
To install the Pylab package and many other scientific Python package, we recommend the Anaconda distribution
from Continuum Analytics. We realise that cluttering the namespace by importing everything from a package is
not ideal, but in these examples we care more about convenience and readability.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
35
There are two main types of missing data, which we need to handle in different ways.
First, data can be missing where the fact that it is missing carries information in itself, and
could be useful for the ML algorithm. The other possibility is that the data is missing simply
because the measurement was impossible, and there is no information in the reason for the
unavailability of the information. In the Titanic passenger dataset, for example, missing
values in the Cabin column may tell us that those passengers were in a different social or
economical class, while missing values in the age column carries no information; the age of a
particular passenger at the time simply couldn’t be found.
Let us first consider the case of meaningful missing data. When we believe that there is
information in the data being missing, we usually want the ML algorithm to be able to use
this information to potentially improve the prediction accuracy. To achieve this we want to
convert the missing values into the same format as the column in general. For numerical
columns this can be done by setting missing values to -1 or -999, depending on typical
values of non-missing values. Simply pick a number in one end of the numerical spectrum
that will denote missing values, and remember that order is important for numerical
columns. You don’t want to pick a value in the middle of the distribution of values.
For a categorical column with meaningful missing data, we can simply create a new
category called “missing”, “None” or similar and then handle the categorical feature in the
usual way, for example using the technique described in the previous section. Figure 2.8
shows a simple diagram of what to do with meaningful missing data.
Missing data where the lack of information carries no information, we need to handle in
different ways. In this case, we cannot simply introduce a special number or category
because we might introduce data that are flat-out wrong. For example, if we were to change
any missing values in the Age column of the Titanic passenger dataset to -1, we would
probably hurt the model by messing with the age distribution for no good reason. Some ML
algorithms will be able to deal with these truly missing values by simply ignoring them. If
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
36
not, we need to pre-process the data and replace missing values by guessing the true value.
This concept of replacing missing data is called imputation.
There are many ways to impute missing data, but there is unfortunately no one-size-
fits-all. The easiest and most undesirable way is to simply remove all instances for which
there are missing values. This will not only decrease the predictive power of the model, but
also introduce biases in case the missing data are not randomly distributed. Another simple
way is to assume some temporal order to the data instances and simply replace missing
values with the column value of the preceding row. With no other information, we are
making a guess that a measurement hasn’t changed from one instance to the next. Needless
to say, this assumption will often be wrong. For extremely big data, however, we will not
always be able to apply more sophisticated methods and these simple methods can be
useful.
It is usually better to use a larger portion of the existing data to guess the missing
values. To avoid biasing the model, we may replace missing column values by the mean
value of the column. With no other information, we’ll make a guess that on average will be
closest to the truth. The mean is sensitive to outliers, so depending on the distribution of
column values we may want to use the median instead. These are widely used in machine
learning today and work well in many cases. But, when we set all missing values to a single
new value, we lose any potential correlation with other variables which may be important for
the algorithm to detect appropriate patterns in the data.
What we really want to do is predict the value of the missing variable based on all of the
data and variables available. Does this sound familiar? This is exactly what machine learning
is about, so we are basically thinking about building ML models in order to be able to build
ML models. In practice, you will often use a simple algorithm – such as linear or logistic
regression (see chapter 3) - to impute the missing data that is not necessarily the same as
the main ML algorithm used. In any case you are creating a pipeline of ML algorithms that
introduces more knobs to turn in order to optimize the model in the end.
Again, it is important to realize that there is no best way to deal with truly missing data.
We’ve discussed a few different ways in this section and figure 2.9 attempts to summarize
the different possibilities.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
37
Figure 2.9: Full decision diagram for handling missing values when preparing data for ML modeling.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
38
Figure 2.10: Showing various cabin values in the Titanic dataset. Some include multiple cabins, while
others are missing. And, cabin identifiers themselves may not be good categorical features.
Living in a particular region of the ship, though, could potentially be important for
survival. For single cabin IDs, we could extract the letter as a categorical feature and the
number as a numerical feature, assuming they denote different parts of the ship. This
doesn’t handle multiple cabin IDs, but since it looks like all multiple cabins are close to each
other, it will probably be fine to only extract the first cabin ID. We could include the number
of cabins in a new feature, though, which could also be relevant. All in all, we will create 3
new features from the Cabin feature. The code for this simple extraction could look like the
following:
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
39
Note how we return both the normalized data and the factor with which the data was
normalized. We do this because any new data, for example for prediction, will have to be
normalized in the same way in order to yield meaningful results. This also means that the ML
modeler will have to remember how a particular feature was normalized, and save the
relevant values (factor and minimum value).
We leave it up to the reader to implement a function that takes new data, the
normalization factor and the normalized minimum value and re-applies the normalization.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
40
We focus this section on methods for visualizing the association between the target
variable and the input features. There are four visualization techniques that we recommend:
mosaic plots, boxplots, density plots and scatterplots. Each of these visualization
techniques is appropriate for a different type (numeric or categorical) of input feature and
target variable, as set out in Figure 2.11.
Figure 2.11. Four different visualization techniques, arranged by the type of input feature and response
variable to be plotted.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
41
Next, each of the vertical rectangles is split by horizontal lines into sub-rectangles whose
relative areas are proportional to the percent of instances belonging to each category of the
response variable. For example, of Titanic passengers who were female, 74% survived (this
is the conditional probability of survival given that the passenger was female). Therefore,
the “female” rectangle is split by a horizontal line into two sub-rectangles that contain
74%/26% of the area of the rectangle. The same is repeated for the “male” rectangle (for
males the breakdown is 19%/81%).
What results is a quick visualization of the relationship between sex and survival. If there
were no relationship, then the horizontal splits would occur at similar locations in the y-axis.
If there is a strong relationship, then the horizontal splits will be far apart. To enhance the
visualization, the rectangles are color-coded to assess the statistical significance of the
relationship, compared to independence of the input feature and response variable (see
Figure 2.12).
Figure 2.12. Mosaic plot showing the relationship between sex and survival on the Titanic. The
visualization shows that a much higher proportion of females (and much smaller proportion of males)
survived than would have been expected if survival were independent of sex. “Women and children first.”
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
42
This tells us that when building a machine-learning model to predict survival on the
Titanic, sex will be an important factor to include in the model. It also allows us to perform a
sanity check on the relationship between sex and survival: indeed, it is common knowledge
that a higher proportion of women survived the disaster. This gives us an extra layer of
assurance that our data are legitimate. Such data visualizations can also assist us to
interpret and validate our machine learning models, once they have been built.
Figure 2.13 shows another mosaic plot for survival versus passenger class (1st / 2nd / 3rd).
As expected, a higher proportion of first class passengers (and a lower proportion of third
class passengers) survived the sinking. Obviously, passenger class will also be an important
factor in a ML model to predict survival, and the relationship is exactly as we should expect:
higher-class passengers had a higher probability of survival.
Figure 2.13. Mosaic plot showing the relationship between passenger class and survival on the Titanic.
2.3.2 Boxplots
Boxplots are a standard statistical plotting technique to visualize the distribution of a
numerical variable. For a single variable, a boxplot depicts the quartiles of its distribution:
the minimum, 25th percentile, median, 75th percentile, and maximum of the values. Boxplot
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
43
visualization of a single variable is useful to get insight into the center, spread and skewness
of its distribution of values plus the existence of any outliers.
Boxplots can also be used to compare different distributions when plotted in parallel. In
particular, they can be utilized to visualize the difference in the distribution of a numerical
feature as a function of the different categories of a categorical response variable. Returning
to the Titanic example, we can visualize the difference in ages between survivors and non-
survivors using parallel boxplots, as in Figure 2.14. In this case, it is not clear that any
differences exist in the distribution of passenger ages of survivors versus non-survivors, as
the two boxplots look fairly similar in shape and location.
Figure 2.14. Boxplots showing the relationship between passenger age and survival on the Titanic. There
are no noticeable differences between the age distributions for survivors versus non-survivors. Note: this
alone should not be a reason to exclude age from the ML model, as it may still be a predictive factor.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
44
other input features. For example, although age does not show a clear relationship with
survival, it could be that for 3rd class passengers, age is an important predictor (perhaps for
3rd class passengers, the younger and stronger passengers could make their way to the deck
of the ship more readily than older passengers). A good ML model will discover and expose
such a relationship, and thus the visualization alone is not means to exclude age as a
feature.
We also display boxplots exploring the relationship between passenger fare paid and
survival outcome in Figure 2.15. In the left panel, it is clear that the distributions of fare
paid are highly skewed (many small values and a few very large outliers) making the
differences difficult to visualize. This is remedied by a simple transformation of the fare
(square root, in the right panel), making the differences easy to spot. Fare paid has an
obvious relationship with survival status: those paying higher fares were more likely to
survive, as is expected. Thus, fare amount should be included in the model as we expect the
ML model to find and exploit the positive association.
Figure 2.15. Boxplots showing the relationship between passenger fare paid and survival on the Titanic.
The square root transformation makes it obvious that passengers that survived paid higher fares, on
average.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
45
Density plots display the distribution of a single variable in more detail than a boxplot.
First, a smoothed estimate of the probability distribution of the variable is estimated
(typically using a technique called kernel smoothing). Next, that distribution is plotted as a
curve which depicts the values that the variable are likely to have. By plotting a single
density plot of the response variable for each category that the input feature takes, we can
easily visualize any discrepancies in the values of the response variable for differences in the
categorical input feature. Note that density plots are similar to histograms, but due to their
smooth nature make it much simpler to visualize multiple distributions in a single figure.
For example, we will use the Auto MPG data set. This data set contains the miles per
gallon (MPG) attained by each of a large collection of automobiles from 1970-82, plus
attributes about each auto, including horsepower, weight, location of origin, and model year.
In Figure 2.16, a density plot for MPG versus location of origin (USA, Europe, or Asia) is
plotted. It is quite clear from the plot that Asian cars tend to have higher MPG, followed by
European then American cars. This means that location should be an important predictor in
our model. Further, a few secondary ‘bumps’ in the density occur for each curve, which may
be related to different types of automobile (e.g., truck versus sedan versus hybrid). Thus,
extra exploration of these secondary bumps is warranted to understand their nature and to
use as a guide for further feature engineering.
Figure 2.16. Density plot for the MPG data set, showing the distribution of vehicle MPG for each
manufacturer region. It is obvious from the plot that Asian cars tend to have the highest MPG and that
cars made in the USA have the lowest. Region is clearly a strong indicator of MPG.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
46
2.3.4 Scatterplots
Scatterplots are a simple visualization of the relationship between two numerical variables
and are one of the most popular plotting tools in existence. In a scatterplot, the value of the
feature is plotted versus the value of the response variable with each instance represented
as a dot. Though simple, scatterplots can reveal both linear and non-linear relationships
between the input and response variables.
In Figure 2.9, we plot two scatterplots: one of car weight versus MPG and one of car
model year versus MPG. In both cases, clear relationships exist between the input features
and the MPG of the car, and hence both should be used in modeling. In the left panel, there
is a clear ‘banana’ shape in the data showing a non-linear decrease in MPG for increasing
vehicle weight. Likewise, in the right panel, there is an increasing, linear relationship
between MPG and the model year. Both plots show a clear indication that the input features
are useful in predicting MPG and both have the expected relationship.
Figure 2.17. Scatterplots for the relationship of vehicle miles per gallon versus vehicle weight (left) and
vehicle model year (right).
Now that we have learned about data collection, preparation and visualization, we are
ready to start building machine learning models! Chapter 3 introduces the concepts and
mechanics of ML model building.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
47
2.4 Summary
In this chapter, we’ve looked at important aspects of data in the context of real-world
machine learning. A number of new vocabulary terms were introduced in this chapter. To
help the reader keep this machine-learning terminology straight, we include definitions of
each of the bolded words from the chapter in Table 2.3.
Word Definition
We began the chapter by introducing the concept of training data and providing a guide
to collecting training data. In particular, the four most important things to keep in mind
when compiling your training data are: deciding which input features to include, figuring out
how to obtain ground-truth values for the target variable, determining when you have
collected enough training data, and keeping an eye out for biased or non-representative
training data.
Once your training data are collected, the next step is to prepare them for ML modeling.
The type of preparation needed depends both on the characteristics of your data as well as
the machine learning model you plan to use. We described the pre-processing steps that
you’ll need to perform to handle categorical features, deal with missing data, do simple
feature engineering, and perform normalization of features.
Finally, we described four go-to visualization techniques that are handy to look at the
association between a single input feature and the target variable. Each of the different
visualization techniques is appropriate for a different type of feature or target. These
techniques are helpful to perform last-minute sanity checks on your data and to begin to
gain insight into the relationships in your data before building ML models. These
visualizations should be used to assess whether the different features in your data are likely
to be important or unimportant in your model, but they should not be used to exclude
features. We’ll get to feature selection once we learn how to build models.
With our data ready for modeling, let’s now start building machine-learning models!
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
48
3
Modeling and prediction
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
49
In the previous chapter, we covered guidelines and principles of data collection, pre-
processing and visualization. The next step in the machine-learning workflow is to use that
data to begin exploring and uncovering the relationships that exist between the input
features and the target. In machine learning, this process is done by building statistical
models on the data. This chapter covers the basics required to understand ML modeling
and to start building your own models. In contrast to most machine-learning textbooks, we
spend little time discussing the various approaches to ML modeling, instead focusing
attention on the big-picture concepts. This will help you gain a broad understanding of
machine learning model building and quickly get up to speed on building your own models to
solve real-world problems. For those seeking more information about specific ML modeling
techniques, please see Appendix A.
We begin the chapter with a high-level overview statistical modeling in Section 3.1. This
discussion focuses on the big-picture concepts of ML modeling, such as the purpose of
models, the ways in which models are used in practice, and a succinct look at types of
modeling techniques in existence and their relative strengths and weaknesses. From there,
we dive into the two most common machine-learning models: classification and regression,
in Sections 3.2 and 3.3, respectively. In these sections, we give more details about how to
build models on your data. We also call attention to a few of the most common algorithms
used in practice in the grey “Algorithm highlight” boxes scattered through the chapter.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
50
Y = f(X) + ε.
In this equation, f represents the unknown function that relates the input variables to the
target, Y. The goal of ML modeling is to accurately estimate f using data. The
symbol ε represents random noise in the data that is unrelated to the function f. The
function f is commonly referred to as the signal whereas the random variable ε is called the
noise. The challenge of machine learning is to use data to determine what the true signal is
while ignoring the noise.
In the Auto MPG example, the function f describes what the true MPG rating for an
automobile is as a function of that car’s many input features. If we knew that function
perfectly, then we could know what the MPG rating would be for any car, real or fictional.
However, there could be a number of sources of noise, ε, including (and certainly not limited
to):
1. Imperfect measurement of each vehicle’s MPG rating caused by noise in the
measuring devices.
2. Statistical noise in the manufacturing process, causing each car in the fleet to have
slightly different MPG rating.
3. Noise in the measurement of the input features, such as weight and horsepower.
4. Lack of access to the broader set of features that would exactly determine MPG.
Using the noisy data that we have from hundreds of vehicles, the ML approach is to use
modeling techniques to find a good estimate for f. This resultant estimate is referred to as a
ML model.
In Sections 3.2 and 3.3 we will describe in further detail how these ML modeling
techniques work. Indeed, the bulk of the academic literature on machine learning deals with
how to best estimate f.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
51
Ypred = fest(Xnew).
These predictions can then be used to make decisions about the new data or may be fed
into an automated workflow.
Going back to the Auto MPG example, suppose that we have an ML model, fest, that
describes the relationship between MPG and the input metrics of an automobile. Prediction
allows us to ask the question: “What would the MPG of a certain automobile with known
input metrics be?” Such a predictive ability would be useful for designing automobiles, since
it would allow engineers to assess the MPG rating of different design concepts and to ensure
that the individual concepts meet MPG requirements.
Prediction is the most common use of machine learning systems. Prediction is central to
a large number of ML use cases, including:
• Deciphering hand-written digits / voice recordings
• Predicting the stock market
• Forecasting
• Predicting which users are most likely to click, convert, or buy
• Predicting which users will need product support and which are likely to unsubscribe
• Determining which transactions are fraudulent
• Recommendations
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
52
These inferences can tell us a lot about the data generating process and give clues into the
factors that are driving the relationships in the data. Returning to the Auto MPG example,
we can use inference to answer questions such as: does manufacturer region have an effect
on MPG? Which of the inputs are most strongly related to MPG? And, are they negatively or
positively related? Answers to these questions can give us an idea of which are the driving
factors in automobile MPG and give clues about how to engineer vehicles with higher MPG.
In this equation, the unknown parameters, β0, β1, … can be interpreted as the intercept and
slope parameters (with respect to each of the inputs). When we fit a parametric model to
some data, we are simply estimating the best values of each of the unknown parameters.
Then, we can turn around and plug in those estimates into the formula for f(X) along with
new data to generate predictions.
Other examples of commonly used parametric models include logistic regression,
polynomial regression, linear discriminant analysis, quadratic discriminant analysis,
(parametric) mixture models, and Naïve Bayes (when parametric density estimation is used).
Approaches that are often used in conjunction with parametric models for model selection
purposes include ridge regression, lasso, and principal components regression. Further
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
53
details about some of these methods will be given later in this chapter, and a description of
each approach is given in Appendix A.
The drawback of parametric approaches is that they make strong assumptions about the
true form of the function f. In most real-world problems, f does not assume such a simple
form, especially when there are many input variables X. In these situations, parametric
approaches will fit the data poorly, leading to inaccurate predictions. For this reason, most
real-world approaches to machine learning depend on non-parametric machine-learning
methods.
NON-PARAMETRIC METHODS
In non-parametric models, f does not take a simple, fixed function. Instead, the form and
complexity of f adapts to the complexity of the data. For example, if the relationship
between X and Y is quite wiggly, then a non-parametric approach will choose a function f
that matches the curvy patterns. Likewise, if the relationship between the input and output
variable is smooth, then a simple function f will be chosen.
A simple example of a non-parametric model is a classification tree. A classification tree
is a series of recursive binary decisions on the input features. The classification tree learning
algorithm uses the target variable to learn the optimal series of splits such that the terminal
leaf nodes of the tree contain instances with similar values of the target.
Take, for example, the Titanic data set. The classification tree algorithm first seeks the
best input feature to split on such that the resulting leaf nodes contain passengers that
either mostly lived or mostly died. In this case, the best split is on the sex (male / female)
of the passenger. The algorithm continues splitting on other input features in each of the
sub-nodes until the algorithm can no longer detect any good subsequent splits.
Classification trees are non-parametric because the depth and complexity of the tree is
not fixed in advance, but rather is learned from the data. If the relationship between the
target variable and the input features is complex and there is a sufficient amount of data,
then the tree will grow deeper, uncovering more nuanced patterns. Figure 3.2 shows two
classification trees learned from different subsets of the Titanic data set. In the left panel is
a tree learned only from 400 passengers: the resultant model is simple, only consisting of a
single split. In the right panel is a tree learned on 891 passengers: the larger amount of
data enables the model to grow in complexity and find more detailed patterns in the data.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
54
Figure 3.2: A decision tree is an example of a non-parametric ML algorithm because its functional form is
not fixed. Rather, the tree model can grow in complexity with larger amounts of data to capture more
complicated patterns.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
55
hierarchical clustering.
• Dimensionality Reduction – Transform the input features into a small number of
coordinates that capture most of the variability of the data. Methods: principal
components analysis (PCA), multi-dimensional scaling, manifold learning.
Both clustering and dimensionality reduction have wide popularity (particularly K-means and
PCA), yet are often abused and used inappropriately when a supervised approach is
warranted.
However, unsupervised problems do play a significant role in machine learning, often in
support of supervised problems either to help compile training data to learn on or in deriving
new input features on which to learn. We will return to the topic of unsupervised learning in
Chapter 8.
Now, we’ll transition to the more practical aspects of ML modeling, describing the steps
needed to start building models on your own data and the practical considerations of
choosing which algorithm to use. We break up the rest of the chapter into two sections
corresponding to the two most common problems in machine learning: classification and
regression.
We begin with the topic of classification.
Figure 3.3: Illustration of a classification process. Rectangles and circles are divided by a classifier into
classes A and B. This is a case of binary classification where there are only two classes.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
56
Let’s again use an example. In chapter 2 we looked at the Titanic dataset for predicting
survival of passengers onboard the ill-fated ship. Figure 3.4 shows a subset of this data.
As we’ve discussed in section 2.2, typically the best way to start an ML project is to get a
feel for the data by visualizing it. For example, it is perceived as common knowledge that
more women than men survived the Titanic, and we can see that this is in fact the case from
the mosaic plot in figure 3.5 (if you’ve forgotten about mosaic plots, look back at section
2.2.1).
Figure 3.5: Mosaic plot showing overwhelming support to the idea that more women than men survived the
disaster.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
57
By using the visualization techniques in section 2.2, we can get a feeling for the
performance of each of the features in the Titanic dataset, but it’s important to realize that
just because a single feature looks good or bad, it does not necessarily show the
performance of the feature in combination with one or more other features. Maybe the age
together with the gender and social status divides the passengers much better than any
single feature. In fact, this is one of the main reasons to use machine-learning algorithms in
the first place: to find signals in many dimensions that humans cannot discover.
In the following subsections, we will introduce the methodology for building classification
models and making predictions. We will look at a few specific algorithms and the difference
between linear and nonlinear algorithms.
The final dataset that we will use for modeling is shown in figure 3.6.
4
The ”regression” in logistic regression does not mean it is a regression algorithm. Logistic regression expands
linear regression with a logistic function to make it suitable for classification.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
58
Figure 3.6: The first 5 rows of the Titanic dataset after processing of categorical features, missing values
and transforming the Fare variable by taking the square root. All features are now numerical which is the
preferred format for most ML algorithms.
We can now go ahead and build the model by running the data through the logistic
regression algorithm. This algorithm is implemented in the Scikit-Learn Python package, and
the model building and prediction code might look like the following:
After building the model we predict the survival of previously unseen passengers based
on their features. The model expects features in the format given in figure 3.6, so any new
passengers will have to be run through exactly the same processes as the training data. The
output of the predict function will be 1 if the passenger is predicted to survive, and 0
otherwise.
It is useful to visualize the classifier by plotting the so-called decision boundary. Given
two of the features in the dataset, we can plot the boundary that separates surviving
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
59
passengers from the dead, according to the model. Figure 3.7 shows this for the Age and
square root Fare features.
Figure 3.7: The decision boundary for the Age and sqrt(Fare) features. The blue diamonds show
passengers who survived while red circles denote passengers who actually died. Light background denote
the combinations of Age and Fare that are predicted to yield survival. Notice how there are a few instances
that overlap the boundary. Our classifier is certainly not perfect but we are only looking in two dimensions.
The algorithm on the full dataset finds this decision boundary in 10 dimensions, but that becomes harder to
visualize.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
60
2. Measure how well this line separates the two classes. In logistic regression we use the
statistical deviance for the goodness-of-fit measurement.
3. Guess new values of the parameters, and measure the separation power.
4. Repeat until there are no better guesses. This is an optimization procedure that can be
done with a range of optimization algorithms. Gradient descent is a popular choice for
a simple optimization algorithm.
Of course, this approach can be extended to more dimensions, so we are not limited to two
features in our model. If you are interested in the details, we strongly encourage you to
research further and try to implement this algorithm in your programming language of
choice. Then go look at an implementation in a widely used ML package. There are plenty of
details left out here, but the above steps remain the basis of the algorithm.
Some properties of logistic regression include:
• The algorithm is relatively simple to understand compared to more complex
algorithms. It is also computationally simple, making it very scalable to large
datasets.
• The performance will degrade if the decision boundary that separates the classes
needs to be highly nonlinear. See section 3.2.2 below.
• Logistic regression algorithms can sometimes over-fit the data, and it is often needed
to use a technique called regularization that limits this danger. See section 3.2.2
below for an example of over-fitting.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
61
Figure 3.8: Nonlinear decision boundary of the Titanic survival support vector machine classifier with a
nonlinear kernel. Light background denote the combinations of Age and Fare that are predicted to yield
survival.
We see that the decision boundary in figure 3.8 is very different from the linear one in
figure 3.7. What we see here is a good example of a very important concept in machine
learning: over-fitting. The algorithm is capable of fitting very well to the data, almost at the
single record level, and we risk losing the ability to make good predictions on new data that
was not included in the training set; the more complex we allow the model to become, the
higher the risk of over-fitting.
There are usually ways to avoid over-fitting a non-linear model built in to the algorithm
in the form of model parameters. By tweaking the parameters of the model, keeping the
data unchanged, we can obtain a better decision boundary. Note that we are currently using
our intuition to determine when something is over-fitting or not; in chapter 4 we will learn
how to use data and statistics to quantify this intuition. For now, we will use the authors
experience and tweak a certain parameter called gamma; we don’t need to know what
gamma is at this point, only that it helps us control the risk of over-fitting. In chapter 5 we
will look at how to optimize the model parameters without simply guessing at better values.
Setting gamma=0.1 in the SVM classifier, we obtain the much improved decision boundary
shown in figure 3.9.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
62
Although the above algorithm is also linear in the sense that the separation boundary is
linear, SVMs are also capable of fitting to nonlinear data as we’ve seen earlier in this
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
63
section. SVMs utilize a clever technique in order to fit to nonlinear data: the kernel trick. A
kernel is a mathematical construct that can “warp” the space where the data lives. The
algorithm can then find a linear boundary in this warped space, making the boundary
nonlinear in the original space.
Figure 3.10: Four randomly chosen handwritten digits from the MNIST database.
The images are 28x28 pixels each, but we convert each image into a 282 = 784 features,
one feature for each pixel. In addition to being a multi-class problem, this is also a very high-
dimensional problem. The pattern that the algorithm needs to find is a complex combination
of many of these features and the problem is very nonlinear in nature.
To build the classifier, we first choose the algorithm to use from Appendix A. The first
nonlinear algorithm on the list that natively supports multi-class problems is the k-nearest
neighbor classifier, which is another simple but powerful algorithm for non-linear ML
modeling. We only need to change one line in listing 3.1 to use the new algorithm, but we’ll
also include a function for getting the full prediction probabilities instead of just the final
prediction.
5
http://yann.lecun.com/exdb/mnist/
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
64
preds = model.predict_proba(new_features)
return preds
Building the k-nearest neighbor classifier and making predictions on the four digits shown
in figure 3.9, we obtain the following table of probabilities:
We see how most of the predictions are spot on, with only a few uncertain. Looking at the
second digit (3), it is not surprising why this is hard to classify perfectly. This is the main
reason to get the full probabilities in the first place: to be able to take action on things that
are not perfectly certain. This is easy to understand in the case of a post-office robot routing
letters; if the robot is sufficiently uncertain about some digits, maybe we should have a good
old human look at it before we send it out wrong?
The K-nearest neighbor algorithm is a simple yet very powerful nonlinear ML method. It is
often used in cases where model training should be quick, while predictions are typically
slower. We will see in the following why this is the case.
The basic idea is that we can classify a new data record by comparing it with similar
records from the training set. If a dataset record consists of a set of numbers , we can
find the distance between records simply by the usual distance formula: .
When making predictions on new records, we simply find the closest known record and
assign that class to the new record. This would be a 1-nearest neighbor classifier, as we are
only using the closest neighbor. Usually we’d use 3, 5 or 9 neighbors and simply pick the
class that was most common among neighbors (we use odd numbers to avoid ties).
The training phase is relatively quick because we simply index the known records for fast
distance calculations to new data. The prediction phase is where most of the work is done,
finding the closest neighbors from the entire dataset.
In the simple example above, we used the usual Euclidean distance metric. We can also
use more advanced distance metrics, depending on the dataset at hand.
K-nearest neighbors is not only useful for classification, but for regression as well. In
stead of taking the most common class of the neighbors, we simply take the average or
median values of the target values of the neighbors. We will go further into regression in
section 3.3.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
65
Figure 3.13: Illustration of a regression process. The regressor is predicting the numerical value of a
record.
As an example of a regression analysis, we will use the Auto MPG dataset introduced in
section 2.2.3. The goal is to build a model that can predict the average miles per gallon of a
car given various properties of the car such as horsepower, weight, location of origin, and
model year. Figure 3.14 shows a small subset of this data.
In section 2.2.4 we discovered useful relationships between the MPG rating, the car
weight and the model year. These relationships are shown in figure 3.15.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
66
Figure 3.15: Using scatter plots we can see that car weight and model year are useful features for
predicting MPG. See section 2.2.4 for more details.
In the next section, we will look at how to build a basic linear regression model to predict
the miles per gallon values of this dataset of vehicles. After successfully building a basic
model, we will in the last section look at more advanced algorithms for modeling nonlinear
data.
Figure 3.16: The Auto MPG data after expanding the categorical “origin” column.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
67
We can now use the algorithm to build the model. Again we can use the code structure
defined in listing 3.2 and change the line:
With the model in hand, we can now make predictions. In this example, however, we will
split the dataset into a training set and a testing set before we build the model. In chapter 4
we will talk much more about how to evaluate models, but we use some simple techniques
already in this section. By training a model on only some of the data while holding out a
testing set, we can subsequently make predictions on the testing set and see how close our
predictions come to the actual values. If we were training on all the data and making
predictions on some of the training data, we would essentially be cheating as the model is
more likely to make good predictions if it’s seen the data while training. In figure 3.17 we
show the results of making predictions on a held-out testing set, and how they compare to
the known values. In this example we are training the model on 80% of the data and using
the remaining 20% for testing.
Figure 3.17: Comparing MPG predictions on a held-out testing set to actual values.
A useful way to compare more than a few rows of predictions is to use our good friend,
the scatter plot, once again. For regression problems, both the actual target values and the
predicted values are numeric. Plotting the predictions against each other in a scatter plot,
introduced in section 2.2.4, we are able to visualize how well the predictions follow the actual
values. This is shown for the held-out Auto MPG test set in figure 3.18. This figure shows
great prediction performance as the predictions all fall close to the optimal diagonal line. By
looking at this figure we can get a sense for how our ML model might perform on new data.
In this case, a few of the predictions for higher MPG values seem to be underestimated, and
this may be useful information for us. For example, if we want to get better at estimating
high MPG values, we may need to find more examples of high MPG vehicles, or we may need
to obtain higher quality data in this regime.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
68
Figure 3.18: A scatter plot of the actual versus predicted values on the held-out test set. The diagonal line
shows the perfect regressor. The closer all of the predictions are to this line, the better the model.
Like logistic regression for classification, linear regression is arguably the simplest and most
widely used algorithm for building regression models. The main strengths are linear
scalability and a high level of interpretability.
The idea behind this algorithm is to plot the dataset records as points with the target
variable on the y-axis and fit a straight line (or plane in case of 2+ features) to these
points. The figure below illustrates the process of optimizing the distance from the points to
the straight line of the model.
A straight line can be described by 2 parameters for lines in 2 dimensions, and so on. We
know this from the a and b in the y = a*x+b from the basic math. These parameters are
fitted to the data and when optimized they completely describe the model and can be used
to make predictions on new data.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
69
Figure 3.19: Table of actual versus predicted MPG values for the nonlinear random forest regression
model.
In this case there is not much different from the linear algorithm, at least visually, and it is
not actually clear which of the algorithms performs the best in terms of accuracy. In the next
chapter, we will learn how to quantify the performance so we can make meaningful
measurements of how good the prediction accuracy actually is.
Figure 3.20: Comparison of MPG data versus predicted values for the nonlinear random forest regression
model.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
70
For the last algorithm highlight of this chapter, we will introduce the Random Forest
algorithm. This is a highly accurate nonlinear algorithm that is widely used in real-world
classification and regression problems.
The basis of the RF algorithm is the decision tree. Imagine that you need to make a
decision about something, such as what to work on next. There are some variables that can
help you decide the best course of action, and some variables weigh higher than others. In
this case, you might ask first: “how much money will this make me?”. If the answer is less
than $10, you can choose to not go ahead with the task. If the answer is more than $10,
you might ask the next question in the decision tree: “will working on this make me
happy?” and answer with a yes/no. You can continue to build out this tree until you’ve
reached a conclusion and chosen a task to work on.
The decision tree algorithm lets the computer figure out, based on the training set, which
variables are the most important, and put them in the top of the tree, and gradually use
less important variables. This allows it to combine variables and say: “if the amount > $10
and makes me happy and amount of work < 1 hour, then yes”.
A problem with decision trees is that the top levels of the tree have a huge impact on the
answer, and if the new data doesn’t follow exactly the same distribution as the training set,
the generalizability might suffer. This is where the random forest method comes in. By
building a collection of decision trees, we mitigate this risk. When making the answer, we
simply pick the majority vote in the case of classification, or take the mean in case of
regression. Because we use votes or means, we can also give back full probabilities in a
very natural way that not many algorithms share.
Random forests are also known for other kinds of advantages, such as their immunity to
unimportant features, noisy datasets in terms of missing values and mislabeled records.
Word Definition
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
71
(un)supervised Supervised models, such as classification and regression, find the mapping
between the input features and the target variable. Unsupervised models are
used to find patterns in the data without a specified target variable.
Clustering Clustering is a form of unsupervised learning that puts data into self-defined
clusters.
Dimensionality reduction Another form of unsupervised learning that can map high-dimensional
datasets to a lower dimensional representation, usually for plotting in 2 or 3
dimensions.
Classification A supervised learning method that predicts data into buckets.
Regression The supervised method that predicts numerical target values.
3.5 Summary
In this chapter, we introduced machine learning modeling. Here we list the main take-aways
from this chapter:
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
72
variable in regression models. Again, we show example use of linear and nonlinear
methods and how to visualize the predictions of these models.
Next, we will take a look at how to rigorously validate a model to accurately measure how
good its predictions will be on new data. We will in the next chapter introduce various
validation methods and metrics as well as common visualization methods for visual model
inspection.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
73
4
Model Evaluation & Optimization
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
74
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
75
Therefore, when we evaluate the performance of a model, we want to determine how well
that model will perform on new data. This seemingly simple task is wrought with complications
and pitfalls that can befuddle even the most experienced ML users. In this section, we describe
the difficulties that arise when evaluating ML models and propose a simple workflow to
overcome those menacing issues and achieve unbiased estimates of model performance.
Figure 4.1: Training data for the corn production regression problem. The data contain clear signal and
noise.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
76
modeled as the average of the target variable for only the training data whose feature value is
close to the feature value of the new data point. A single parameter, called the bandwidth
parameter, controls the size of the window for the local averaging.
Figure 4.2 demonstrates what happens for different values of the kernel smoothing
bandwidth parameters. For large values of the bandwidth, almost all of the training data are
averaged together to predict the target, at each value of the input parameter. This causes the
model to be very flat and to under-fit the obvious trend in the training data. Likewise, for
very small values of the bandwidth, only one or two training instances are used to determine
the model output at each feature value. Therefore, the model effectively traces every bump
and wiggle in the data. This susceptibility to model the intrinsic noise in the data instead of the
true signal is called over-fitting. Where we want to be is somewhere in the Goldilocks zone:
not too under-fit and not too over-fit.
Figure 4.2: Three fits of a kernel smoothing regression model to the corn production training set. For small
values of the bandwidth parameter, the model is over-fit, resulting in an overly bumpy model. For large
values of the bandwidth parameter, the model is under-fit, resulting in a model that is too flat. A good choice
of the tuning parameter results in a fit that looks just right.
Now, let’s get back to the problem at hand: determining how well our ML model will generalize
to predict the corn output from data on different farms. The first step in this process is to
select an evaluation metric that captures how good our predictions are. For regression, the
standard metric for evaluation is mean squared error (MSE), which is the average squared
difference between the true value of the target variable and the model-predicted value (later in
this chapter, we will cover other evaluations metrics for regression and classification).
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
77
This is where things get tricky. Evaluated on the training set, the error (measured by MSE)
of our model predictions gets ever smaller as the bandwidth parameter decreases. This should
not be unexpected: the more flexibility that we allow the model, the better it will do at tracing
the patterns (both the signal and the noise) in the training data. However, the models with
smallest bandwidth are severely over-fit to the training data because they trace every random
fluctuation in the training set. Using these models to predict on new data will result in poor
predictive accuracy because the new data will have their own unique random noise signatures
that are different from those in the training set.
Thus, there is a divergence between the training set error and the generalization error of
a ML model. This divergence is exemplified in Figure 4.3 on the corn production data. For
small values of the bandwidth parameter, the MSE evaluated on training set is extremely small
while the MSE evaluated on new data (in this case, 10,000 new instances) is much larger.
Simply put, the performance of the predictions of a model evaluated on the training set is not
indicative of the performance of that model on new data. Therefore, it is extremely dangerous
to evaluate the performance of a model on the same data that was used to train the model.
CAUTION ABOUT DOUBLE-DIPPING OF THE TRAINING DATA Using the training data
for both model fitting and evaluation purposes, can lead you to be overly optimistic of the
performance of the model. This can cause you to ultimately choose a sub-optimal model
which performs poorly when predicting on new data.
As we see on the corn production data, choosing the model with smallest training set MSE
causes the selection of the model with smallest bandwidth. On the training set, this model
yields a MSE of 0.08. However, when applied to new data, the same model yields a MSE of
0.50, which is much worse than the optimal model (bandwidth=0.12 and MSE=0.27).
We need an evaluation metric that better approximates the performance of the model on
new data. This way, we can be confident about the accuracy of our model when deployed to
make predictions on new data. This is the topic of the next subsection.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
78
Figure 4.3: Comparison of the training set error to the error on new data for the corn production regression
problem. The training set error is an overly optimistic measure of the performance of the model for new data,
particularly for small values of the bandwidth parameter. Obviously, using the training set error as a
surrogate for the prediction error on new data will get us into a lot of trouble.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
79
Figure 4.4: Flowchart of the holdout method of cross-validation. Here, the dark green boxes denote the
target variable.
N = features.shape[0]
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
80
N_train = floor(0.7 * N)
Now, let’s apply the holdout method to the corn production data. For each value of the
bandwidth parameter, we apply the holdout method (using a 70% / 30% split) and compute
the MSE on the predictions for the held-out 30% of data. Figure 4.5 demonstrates how the
holdout method estimates of MSE stack up to the MSE of the model when applied to new data.
Two main things stand out:
1. The error estimates computed by the holdout method are very close to the ‘new
data’ error of the model. They are certainly much closer than the training set error
estimates (Figure 4.3), particularly for small bandwidth values.
2. The holdout error estimates are very noisy. They bounce around wildly compared
to the smooth curve that represents the error on new data.
We could beat down the noise by doing repeated random training-testing splits and
averaging the result. However, over multiple iterations, each data point will be assigned to the
testing set a different number of times, which could bias our result.
A better approach is to do k-fold cross-validation.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
81
Figure 4.5: Comparison of the holdout error MSE to the MSE on new data, using the corn production data
set. The holdout error is an unbiased estimate of the error of each model on new data. However, it is a very
noisy estimator that fluctuates wildly between 0.14 and 0.40 for bandwidths in the neighborhood of the
optimal model (bandwidth = 0.12).
K-FOLD CROSS-VALIDATION
A better, but more computationally intensive approach to cross-validation is K-fold cross-
validation. Like the holdout method, K-fold cross-validation relies on quarantining subsets of
the training data during the learning process. The primary difference is that K-fold cross-
validation begins by randomly splitting the data into K disjoint subsets, called folds (typical
choices for K are 5, 10, or 20). For each fold, a model is trained on all of the data except the
data from that fold and is subsequently used to generate predictions for the data from that
fold.
Once all K folds are cycled through, the predictions for each fold are aggregated and
compared to the true target variable to assess accuracy. A pictorial representation of K-fold
cross-validation is shown in Figure 4.6 and pseudo-code is given in Listing 4.2.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
82
Finally, let’s apply K-fold cross-validation to the corn production data. For each value of the
bandwidth parameter, we apply K-fold cross-validation with K=10 and compute the cross-
validated MSE on the predictions. Figure 4.7 demonstrates how the K-fold cross-validation
MSE estimates stack up to the MSE of the model when applied to new data. Clearly, the K-fold
cross-validation error estimate is very close to the error of the model on future data.
N = features.shape[0]
K = 10 # number of folds
preds_kfold = np.empty(N)
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
83
for ii in np.arange(K):
# break your data into training and testing subsets
features_train = features[folds != ii,:]
target_train = target[folds != ii]
features_test = features[folds == ii,:]
Figure 4.7: Comparison of the K-fold cross-validation error MSE to the MSE on new data, using the corn
production data set. The K-fold CV error is a very good estimate for how the model will perform on new data,
allowing us to use it confidently to forecast the error of the model and to select the best model.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
84
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
85
Figure 4.8: The first 5 rows of the Titanic passenger dataset. The target column indicates whether a
passenger survived the sinking of the ship, or died.
In order to build our classifier, we feed this dataset to a classification algorithm. Because
the dataset consists of different types of data, we have to make sure the algorithm knows how
to deal with these types. As we discussed in the previous chapters, we may need to process the
data prior to training the model, but for this chapter we will simply view the classifier as a black
box that has learned the mapping from the input variables to the target variable. The goal of
this section is to evaluate the model in order to optimize the prediction accuracy and compare
with other models.
With the data ready, we move to step 2 of the basic model evaluation workflow: cross-
validation. We will divide the full dataset into training and testing sets and use the holdout
method of cross-validation. The model will be built on a training set and evaluated on a held-
out testing set. It is important to re-iterate that our goal is not necessarily to obtain the
maximum model accuracy on the training data, but to obtain the highest predictive accuracy on
unseen data. In the model-building phase, we are by definition not yet in possession of this
data, so we pretend that some of the training data is hidden for the learning algorithm. Figure
4.9 illustrates the dataset splitting step in this particular example.
Figure 4.9: Splitting the full dataset into training and testing sets allows us to evaluate the model.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
86
With the training set ready, we can go ahead and build the classifier and make predictions
on the testing set. Following the holdout method of figure 4.4 we will obtain a list of prediction
values: 0 (died) or 1 (survived) for all rows in the test set. We now go to step 3 of the
evaluation workflow and compare these predictions to the actual survival values to obtain a
performance metric that we can optimize.
The simplest performance measure of a classification model is to simply calculate the
fraction of correct answers; if 3 out of 4 rows where correctly predicted, we’d say the accuracy
of the model on this particular validation set is 3/4 = 0.75 or 75%. This is illustrated in Figure
4.10. In the following sections we will introduce more sophisticated ways of performing this
comparison.
Figure 4.10: Comparing the testing set predictions with the actual values gives us the accuracy of the model.
Figure 4.11: Counting the class-wise accuracy and error rate gives us more information on the model
accuracy.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
87
Each element in the matrix shows the class-wise accuracy or confusion between the positive
and the negative class. In figure 4.13 we relate the specific confusion matrix in figure 4.12 to
the general concept of receiver operator characteristics (ROCs) that we will employ widely
throughout the rest of this book. Although these terms can be a bit confusing at first, they will
become very important when talking to other people about the performance of your model.
Figure 4.13: The confusion matrix for our binary classifier tested on only 4 rows. The ROC metrics pointed
out in the figure are chopped up and explained in the bottom box.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
88
The confusion matrix is a powerful tool for model evaluation. In the next chapter, we will
build on what we’ve learned here and explore even more powerful visualization techniques and
the associated evaluation metrics.
Figure 4.14: A subset of probabilistic predictions from our Titanic test set. After sorting the full table by
decreasing survival probability, we can set a threshold and consider all rows above this threshold as
survived. Note how the indices are maintained so we know which original row the instance refers to.
In figure 4.14, we show the process of sorting the probability vectors and setting a
threshold of 0.7. All rows above the line is now predicted to survive, and we can compare with
the actual labels to get the confusion matrix and the ROC metrics at this particular threshold. If
we follow this process for all thresholds from 0 to 1, we define what is known as the ROC
curve. This is shown in figure 4.15.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
89
Figure 4.15: The ROC curve defined by calculating the confusion matrix and ROC metrics at 100 different
threshold points from 0 to 1. By convention we plot the false-positive rate on the x-axis and the true-positive
rate on the y-axis.
From figure 4.15 we can visually read out the confusion matrix at all thresholds, making the
ROC curve a very powerful visualization when evaluating classifier performance. Given the true
and predicted labels from any cross-validation process, the ROC curve is calculated as follows:
thr = np.linspace(0,1,n_points) #2
tpr = np.zeros(n_points) #2
fpr = np.zeros(n_points) #2
#1 Returns the false-positive and true-positive rate at n_points thresholds for the given true and
predicted labels.
#2 Allocate the threshold and ROC lists
#3 Pre-calculate values for the positive and negative cases, used in the loop
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
90
#4 For each threshold, calculate the rate of true and false positives
If we follow the ROC curve, we see that when the false-positive rate increases, the true-
positive rate decreases. This trade-off is the “no free lunch” of machine learning because we
are able sacrifice the fraction of instances that we classify correctly for more certainty that we
are actually correct, and vice versa, depending on our choice of the probability threshold
parameter.
In real world scenarios this trade-off can be extremely important to evaluate. If we are
classifying whether a patient has cancer or not, it is much better to classify a few extra healthy
patients as sick and avoid classifying any sick patients as healthy. So we would select the
threshold that would minimize the false negative rate, and hence maximize the true positive
rate and place us as far as possible in the top of the ROC plot while sacrificing the false positive
rate.
Another good example is spam filters where you’ll need to choose between unwanted emails
being shown in your inbox or wanted emails being dumped in the spam folder. Or credit card
companies detecting fraudulent activities. Would you rather call your customers often with false
alarms or risk missing potential fraudulent transactions?
Figure 4.16: The ROC curve visualizes the overall model performance. We can quantify this by defining the
AUC metric: the area under the ROC curve.
In addition to the trade-off information, the ROC curve itself also gives us a view of the
overall performance of the classifier. A perfect classifier will have no false positives and no
missed detections, so the curve would be pushed to the top left corner, as illustrated in figure
4.16. This leads us naturally to a widely used evaluation metric: the area under the ROC curve
(AUC). The larger this area, the better the classification performance. The AUC is a widely used
choice for evaluating and comparing models, although in most cases it is important to inspect
the full ROC curve in order to understand the performance trade-offs. We will use the ROC
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
91
curve and the AUC evaluation metric to validate classification models throughout the rest of
this book.
Figure 4.17: Examples of handwritten digits in the MNIST dataset. The entire dataset consists of 80,000 such
digits each in 28x28 pixel images. Without any image processing, each row of our dataset then consists of a
known label (0 to 9) and 784 features (one for each of the 28 times 28 pixels).
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
92
We use the Random Forest algorithm to build a classifier from the training set, and we
generate the confusion matrix from the held-out testing set. Remember that we’ve only worked
with the confusion matrix for binary classification? Luckily, we can easily define it for multiple
classes as every element in the matrix is simply the class on the row versus the class on the
column. For the MNIST classifier, we see in figure 4.18 that most of the power is located on the
matrix diagonal, as it should be because it shows the number of instances that are correctly
classified for each digit. The largest non-diagonal items shows where the classifier is most
confused. Inspecting the figure, we see that the greatest confusion occurs between digits 4 and
9; 3 and 5; 7 and 9, which makes sense given what we know about the shape of the digits.
Figure 4.18: The confusion matrix for the 10-class MNIST handwritten digit classification problem.
The reason for displaying the class-wise accuracy in the form of a matrix is to distill the
information in the text on the left hand side of this figure down to something more visually
digestible. It allows us to take advantage of our excellent visual abilities to process more
information. In figure 4.18 we clearly see how colorizing the confusion matrix can help us take
advantage of this ability.
So how do we generate the ROC curve for multi-class classifiers? The ROC curve is in
principle only applicable to binary classification problems because we divide the predictions into
positive and negative classes in order to get ROC metrics such as the true positive rate and
false positive rate commonly used on the ROC curve axis. To simulate binary classification in a
multi-class problem, we make use of the one-versus-all trick. For each class, we denote the
particular class as the positive class and everything else as the negative class and we draw the
ROC curve as usual. The 10 ROC curves from running this process on the MNIST classifier is
shown in figure 4.19. The most accurately classified digits are 0 and 1, consistent with the
confusion matrix in figure 4.18. The confusion matrix, however, is generated simply from the
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
93
most probable class predictions, while the ROC curve shows the performance of the class at all
probability thresholds.
Figure 4.19: The ROC curves for each class of the MNIST 10-class classifier using the one-vs-all method of
for simulating a binary classification problem. Note that because the classifier is so good, we’ve zoomed
closely into the top corner of the ROC curve in order to see any differences in the model performance
between the classes. The AUC is calculated for each class and also shows a very well performing model
overall.
Keep in mind, however, that the multi-class ROC curves do not show the entire confusion
matrix on the curve. In principle, there is a full 10x10 confusion matrix at every point on the
ROC curve, but we cannot visualize this in a sufficiently simple way. In the multi-class case, it
is therefore important to look at both at the confusion matrix and the ROC curve.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
94
Using the basic model evaluation workflow introduced in the beginning of this chapter, we
use our data and choice of algorithm to build a cross-validated regression model. Potential
model performance metrics used in this process is introduced in the following sections, but we
will show the most basic visualization of the regression performance on which those metrics are
based in figure 4.21: the scatter plot of predicted versus actual values.
Figure 4.21: Scatter plot of the predicted MPG versus actual values from the testing set. The diagonal line
shows the optimal model.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
95
correct value because we usually consider numerical measurements to be drawn from some
distribution with a degree of uncertainty known as the error. In this section we will introduce
two simple metrics to measure the regression performance: the root-mean-square error and
the r-squared value.
The simplest form of performance measurement of a regression model is the root-mean-
square error, or RMSE. This estimator looks at the difference from each of the predicted values
to the know values, and calculates the mean in a way that is immune to the fact that predicted
values can be both higher and lower than the actual values. Figure 4.22 illustrates the concept
of the RMSE calculation.
Figure 4.22: An illustration of the concept of the RMSE calculation. In the equation yi and xi are the i’th target
and feature vector, respectively, and f(x) denotes the application of the model to the feature vector, returning
the predicted target value.
To get a better understanding of the details in the RMSE calculation, a code snippet is
shown in the listing below:
An advantage of RMSE is that the result is in the same units as the values themselves, but
it is also a disadvantage in the sense that the RMSE value depends on the scale of the problem.
If the predicted or actual values are larger numbers, the RMSE will be correspondingly higher.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
96
While this is not a problem when comparing models in the same project, it can be a challenge
understand the overall model performance and compare it to other models in general.
Another simple metric is the R-squared or R2 metric, whose response is relative and in the 0
– 1 range. If the model can predict the data better, the R-squared value is closer to 1. In the
following listing we show some more details of the R-squared calculation:
While the R-squared metric is a good way to estimate the performance of regression
models, it has a few pitfalls that are important to be aware of as well. For parametric models,
for example, R2 will increase slightly with the number of dependent variables, even if the
actual model performance does not increase. In such cases R2 must be used only in
conjunction with other metrics such as the RMSE or residual analysis which we will look at in
the next section. There are also adjusted R2 values that the curious reader can look into that
seek to avoid some of these pitfalls.
Figure 4.24: The residual plot from predictions on the MPG dataset. The horizontal line show the 0-line where
the residual is 0.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
97
Figure 4.24 shows an example residual plot for our MPG dataset. This is basically the same
information as in the scatter plot in figure 4.22, but zoomed in on the scale of the residuals. In
an ideal case we expect the residuals to be distributed randomly around the 0-line. In the lower
end of the figure, MPG values from 10 to 35, it looks like the residuals are randomly distributed
around the 0-line, maybe with a slight bias towards overestimating the values. At values 35-
45, however, we see a clear bias towards underestimating the values, resulting in larger
residual values. We can use this information to improve the model, either by tweaking model
parameters, or by processing or amending the data. If we have access to acquiring additional
data, we could try to obtain labels for a few more high MPG examples. In this case, we could
go find a few more high-MPG cars and add them to the dataset in order to improve the
predictions in that part of the scale.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
98
As an example, think back to Section 3.2.2, where we applied a kernel SVM to the Titanic data
set. Here, we showed the model fits for two different choices of the kernel coefficient
parameter (called gamma) in Figures 3.8 and 3.9. Note how different the fits are: setting
gamma=0.01 produces a very complex, segmented decision boundary between the two classes
while setting gamma=0.1 creates a smoother model. In this case, the fitted model is highly
sensitive to the choice of the tuning parameter gamma.
What makes this hard is that the appropriate choice for each tuning parameter for a given
algorithm is entirely dependent on the problem and data at hand. What works well for one
problem is not necessarily appropriate for the next problem. This means that relying on
heuristics and rule-of-thumb default tuning parameter settings may lead to poor predictive
performance. This means that rigorous selection of tuning parameters is critical to ensuring
that our models are as accurate as they can be with the given data.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
99
that we can, in principle also use grid search to select the best kernel. Indeed, there’s no
reason we can’t use grid search to select between different algorithms!
Next, we select which tuning parameters to optimize over. For kernel SVM with a RBF
kernel, we have two standard tuning parameters: the kernel coefficient, gamma, and the
penalty parameter, C. The code from Listing 4.6 shows how to run a grid search over those
two parameters for this problem.
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.svm import SVC
indmax = np.argmax(AUC_all)
print "Maximum = %.3f" % (np.max(AUC_all))
print "Tuning Parameters: (gamma = %f, C = %f)" % (gam_vec.ravel()[indmax],
cost_vec.ravel()[indmax])
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
100
We find with the Titanic data set that the maximum cross-validated AUC is 0.670, and it
occurs at the tuning parameter vector (gamma = 0.01, C = 6). A contour plot showing the
AUC evaluated over the grid like in Figure 4.25 can be very informative. There are a few things
that jump out from this plot:
• The maximum occurs at the boundary of the grid (gamma = 0.01), meaning that
we would want to re-run the grid search on an expanded grid,
• There is a high amount of sensitivity to the gamma parameter, meaning that we
need to increase the sampling of that parameter
• The maximum value occurs near gamma=0, so expressing the grid on a log scale
(e.g., 10-4, 10-3, 10-2,10-1, etc. is sensible)
• There is not much sensitivity of the AUC as a function of C, so we can use a coarse
sampling of that parameter.
Re-running the grid search on a modified grid, we find that the maximum AUC is 0.690 and
it occurs at (gamma = 0.08, C = 20). The value of optimizing over the tuning parameters is
clear: a single model with arbitrary choice of tuning parameter could have attained a result as
poor as AUC=0.5 (no better than random guessing); the grid-search optimized model boosts
the accuracy up to AUC=0.69.
Figure 4.25: Contour plot showing the cross-validated AUC as a function of the two tuning parameters,
gamma and C. The maximum occurs way off to the upper left, meaning that we need to expand the search
and focus on that region.
Note that grid search does not absolutely ensure that we have chosen the best set of tuning
parameters. Due to the limitations caused by choosing a finite grid of possible values to try, it
is possible that the actual best value has landed somewhere between the values of our grid.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
101
Readers with some familiarity with optimization might be wondering why more sophisticated
optimization routines aren’t traditionally used for tuning parameter selection. The short answer
is that the world of derivative-free, non-convex optimization has not yet become part of the
standard ML tool-kit. The longer answer is that ML researchers on the cutting edge of the field
are beginning to incorporate these methods into tuning parameter optimization strategies.
4.5 Summary
In this chapter we learned the basics of evaluating the ML model performance. Here’s a quick
rundown of the main takeaways:
• We started by discussing how to evaluate models in general terms. It was clear
that we can’t double dip the training data and use it for evaluation as well as
training.
• Instead we introduced cross-validation as a more robust method of model
evaluation.
• Holdout cross-validation is the simplest form of cross-validation, where a testing
set is held out for prediction, in order to better estimate the generalizability of the
model.
• K-fold cross-validation - where K folds are held out one at a time - were introduced
to give even less uncertain estimates of the model performance. This improvement
comes at the cost of a higher computational cost. If available, the best estimate is
obtained if K = number of samples, also known as leave-one-out cross-validation.
• The basic model evaluation workflow was introduced. In condensed form this is
introduced as:
o Acquire and pre-process the dataset for modeling (Chapter 2) and
determine the appropriate ML method and algorithm (Chapter 3).
o Build models and make predictions using either the holdout or k-fold
cross-validation methods, depending on the computing resources
available.
o Evaluate the predictions with the performance metric of choice. If the ML
method is classification, common performance metrics will be introduced
in the following section 4.2. Likewise, we will look at common
performance metrics for regression in section 4.3.
o Tweak data and model until the desired model performance is obtained.
In sections 5-8 we will look at various methods for increasing the model
performance in common real-world scenarios.
• For classification models, we introduced a few model performance metrics to be
used in step 3 of the above workflow. These techniques included simple counting
accuracy, the confusions matrix, receiver-operator characteristics, the ROC curve
and the area under the ROC curve.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
102
Word Definition
Under / over-fitting Using too simple / too complex of a model for the problem at hand.
Cross-validation The method of splitting the training set into 2 or more separate
training/testing sets in order to assess the accuracy better.
Holdout method A form of cross-validation where a single test set is held out of the model
fitting routine for testing purposes.
K-fold cross-validation A kind of cross-validation where the data are split into K random disjoint sets
(folds). The folds are held out one at a time, and cross-validated on models
built on the remainder of the data.
Confusion matrix A matrix showing for each class the number of predicted values that were
correctly classified or not.
ROC - Receiver operator A number characterizing the number of true positives, false positives, true
characteristic negatives or false negatives.
AUC – Area under the An evaluation metric for classification tasks defined from the area under the
ROC curve ROC curve of false positives versus true positives.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
103
Grid search A brute force strategy for selecting the best values for the tuning parameters
to optimize a ML model
In the next chapter we will start looking at improving our models by working more focused
with the features of the model. In addition to some basic feature engineering techniques, we
will look at advanced methods for extracting information out of text, images and time-series
data. We will also look at how to select the best features to optimize the performance of the
model and avoid over-fitting.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
104
5
Basic feature engineering
• dividing total dollar amount by total number of payments to get a ratio of dollars per
payment,
• counting the appearance of a particular word across a text document,
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
105
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
106
• Web conversions are always higher on Tuesday (include the Boolean feature ‘is it
Tuesday?’).
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
107
as a feature).
• Spam emails typically come from free email accounts (engineer the boolean feature ‘is
from free email account?’ or email domain).
• Loan applicants with recently opened credit cards default more often (use the feature
‘days since last credit card opened’).
• Customers often switch their cell phone provider after others in their network also switch
providers (engineer the feature which counts the number of people in a subscriber’s
network that recently switched).
Clearly, the list of potential domain expertise tidbits could go on and on. Indeed, the
standard operating procedure for many companies is to use long lists of these ad hoc rules to
make decisions and predictions. These so-called business rules are a perfect set of engineered
features on which one can start building ML models!
Turned on its head, feature engineering can be a way to test the preconceived notions that
are held by domain experts. If there is any question as to whether a particular hypothesis
holds any merit, it can be codified and used as a feature in a ML model. Then, the accuracy of
the model can be tested with and without that feature to assess the conditional importance of
the feature in predicting the target variable. If the gains in accuracy are negligible, then this is
evidence of the lack of added value of that idea.
Next, we will take a look at a few examples of simple feature engineering to show how
these processes work in practice. We will describe how feature engineering fits into the overall
ML workflow and demonstrate how the predictive accuracy of ML models can be improved by
employing some relatively straightforward feature engineering processes.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
108
Figure 5.1: How feature engineering fits into the basic ML workflow. We’re extending the training data with
features before building the model. When making predictions we need to push new data through the same
feature engineering pipeline to ensure the answers make sense.
The feature engineering extension of the workflow allows us to expand upon the training
data to increase the accuracy of the ML algorithm. To ensure that the feature engineering is
used properly, we need to run the prediction data through the same feature engineering
pipeline that was applied to the training data. This will ensure that the predictions are
generated using exactly the same process as applied to the training data.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
109
Figure 5.2: A sample of the datasets used for training the event recommendations model.
• invited – a Boolean indicating whether or not the individual was invited to the event
• birthyear – the year the person was born
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
110
We begin this section by building and evaluating a model only on these six features. It is
clear that this data set is limited in terms of the patterns which can identify whether or not
each user will be interested in the event. As we continue this section, we will use a few
straightforward feature engineering transformations to extend the feature set and layer on new
information.
We build initial binary classification model to predict the target variable, interested, from
the 6 input features. Following the ML workflow from Chapters 1-4, we: (1) perform initial
data processing exercises (convert categorical columns to numerical, impute missing values),
(2) do the model training (using the Random Forest algorithm), and (3) evaluate the model
(using 10-fold cross-validation and ROC curves). Our cross-validated ROC curve is shown in
figure 5.3. It attains an AUC score of 0.81.
Figure 5.3: Cross-validated ROC curve and AUC metric for the simple event recommendation model.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
111
of the datetime elements into numerical features that capture the information encoded within
the datetime string. This simple, yet powerful concept of feature engineering can enable us to
transform each datetime string into a smattering of features, such as:
• Hour of day
• Day of the week
• Etc.
In Figure 5.4, we show the first 5 rows of data that result from converting our single
start_time feature into 10 different date/time features.
Figure 5.4: Additional date-time columns extracted from the timestamp column for the event recommendation
data set.
Next, we build a Random Forest model on this new, 16-feature data set. Our cross-
validated ROC curve is shown in Figure 5.5.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
112
Figure 5.5: Cross-validated ROC curve for model including date-time features.
The AUC of the model has increased from 0.81 to 0.85. Clearly, there was hidden value in
the start_time information that, when imbued into the ML model via feature engineering,
helped to improve the accuracy of the model. Most likely, events that occur on particular days
of the week and at certain times of day are more popular than others.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
113
rest of the words, which is principle determines the length of the text outside of the selected
top words.
Now, let’s say we are selecting the top 100 words to get a counts column in our dataset.
We will get a bunch of columns with counts for very common but not very useful words, such
as “is”, “and”, “the” et cetera. In the field of natural language processing these words are
known as stop-words and are usually purged from the text before performing the bag-of-words
counting.
We will introduce more advanced text feature concepts in the next chapter, but the last
complicating factor that we will mention here is that the bag-of-words dataset quickly becomes
large and sparse; we have a lot of features mostly filled with zeros because a particular word is
usually not very likely to appear in a random passage of text. The English dictionary is very
large (~200k words in use), and only a small fraction of the words are used in most texts. Of
course some ML problems have a much more narrow space where a class of words is more
represented than in general. For example, figure 5.6 shows a few instances of the count
features for the top words in our even recommendation example where the sparsity of the data
is clear. Some ML algorithms, such as Naïve Bayes classifiers, handle sparse data very well (i.e.
by requiring no extra memory for the 0’s), while most of others do not.
Figure 5.6: A slice of the bag-of-words data for the event recommendation example. These numbers are the
counts of the top occurring words in the event descriptions. A large fraction of the cells are 0 and we call the
dataset sparse.
In events.csv of our event recommendation example, there are 100 features representing
the bag of words for the top 100 occurring words. We want to use these as features in our
model because particular events might be more popular than others. We add these features to
our model and we show the resulting ROC curve in figure 5.7.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
114
Figure 5.7: Cross-validated ROC curve for full model including datetime and text features.
The AUC metric in figure 5.7 does not increase from our previous model that included only
basic and datetime features. This tells us that a particular event description is not more likely
to be interesting for users just because of the text. In this model we are not addressing the
interests of individual users but the user base in general. In a real recommendation engine, we
could build a model for each user, or each class of user. Other popular methods for
recommendation engines use connections between events, users and user friends to find
recommendation. In later chapters, we will look at unsupervised learning and collaborative
filtering for recommendation systems.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
115
Figure 5.8: The decision boundary of two different models fit on the same training data. In one of the models,
the decision boundary is too detailed so the classification performance on new data will be affected.
In this section we will look at methods to select a subset of the features in order to avoid
over-fitting, and thus increase the accuracy of the model when applied to new data. Some
algorithms are more or less susceptible to over-fitting, but it might be worth the effort to
perform some of these optimizations if model accuracy is of particular importance.
Another advantage of a smaller amount of features, and thus smaller models, is that the
computational cost of training and prediction is usually related to the number of features. By
spending some time in the model development phase, we can save some time when re-training
or when making predictions.
Lastly, feature selection and the related term of feature importance can be used to gain
insight into the model and therefore the data used to build the model. In some cases the goal
of building the model might not even be to make predictions, but simply to get a view into the
important features of the model, using the knowledge of the most significant features for
discovering certain patterns such as credit status being correlated with certain demographic or
social factors. There might be a cost associated with obtaining the data for specific features,
which there is no need to suffer if the feature is unimportant for the model at hand. And, the
importance of particular features can also reveal valuable insights into the predictions returned
by the model. In many real-world use-cases it’s important to understand something about why
a certain prediction was made, and not just the particular answer.
With that, we should be well motivated to look more deeply into the concept of feature
selection and the most common methods used.
The simplest way to select the optimal subset of features is simply to try out all
combinations of features, e.g. building a model for all subsets of features and use our
knowledge from chapter 4 to measure the performance of the model. Unfortunately even with a
small number of features this approach quickly becomes infeasible. Because of this, we have to
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
116
use techniques that can approximate the optimal subset of features. In the next few
subsections we will investigate some of these methods. One of the most widely used class of
methods is the forwards selection / backwards elimination methods that we will cover in the
next subsection, while we will discuss a number of other heuristic methods are being used in
practice in the following subsection.
Examples of built-in feature selection methods are the weights assigned to features in linear
and logistic regression algorithms and the feature importances in decision trees and ensemble
variants such as Random Forests. For example, we can inspect the top feature importances of
our Random Forest event recommendation model from the previous section:
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
117
The process of forwards feature selection is shown in figure 5.9. Depending on the number
of features, a large amount of models will potentially need to be built. If the algorithm is run to
the end, you will need to build N+(N-1)+(N-2)... (N-N+2)+(N-N+1) or . For 20, 100,
500, 1000 features this is 210, 5050, 125250 and 500500 model builds, respectively. In
addition, each cross-validation iteration requires k models to be built so if the model build
takes any significant amount of time, this also becomes quite unmanageable. For smaller sets
of features, or when running a smaller number of iterations (for example, because the increase
in accuracy quickly levels off), this approach is very effective in practice. Figure 5.10 shows the
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
118
equivalent process of backwards elimination. The computational requirements are the same as
forwards selection, so the choice between forwards and backwards methods is usually a
question of the problem at hand and the choice of algorithm. Some algorithms, for example,
perform worse on a very small set of features, in which case backwards elimination is the
better approach.
A useful way to visualize an iterative feature selection procedure is by plotting a vertical bar
chart like the one seen in figure 5.11.
Figure 5.11: The iterative feature selection bar chart, showing the evolution of accuracy in feature selection
procedure, in this case a forward selection algorithm. Any measure of accuracy (see chapter 4) that makes
sense for the problem should be used here.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
119
As we mentioned in the callout in the previous sections, some ML algorithms have built-in
methods for feature selection. Instead of making the feature selection based on the built-in
feature rakings, we can use a hybrid approach where the built-in feature importance is used to
find the best or worst feature in each iteration of the forward selection or backward elimination
process. This can significantly reduce the computation time when there are many features, but
will likely yield less accurate approximations of the optimal feature subset.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
120
Figure 5.12: An example showing the decision boundary for classifying into red/blue dots with only two
features. The model was built on many more features, but the sqrt(Fare) and Age features were found by our
feature selection algorithm to be important in this particular problem (Titanic survival prediction). This plot was
first introduced in section 3.2.
In the next section we will look at an example of how feature selection can be very useful in
real-world problems.
Figure 5.13: Examples of real supernova images are shown in the panel on the left. Examples of bogus
candidate events are shown in the panel on the right. The job of the classifier is to learn the difference
between these two types of candidates from features extracted from the images (these are obvious
examples, there are many which are hard to classify even for trained humans).
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
121
The real/bogus classifier is built by first processing the raw image data into a set of
features, some of which we’ll discuss in the image features section of the next chapter. We
then run the featurized data through a Random Forest algorithm to build the classifier, and
perform various model optimizations such as the ones outlined in chapters 3-5. The last part
before putting this model into the live stream from the telescope is to determine the best
features, avoid over-fitting and make the model as small as possible in order to support the
real-time requirements of the project. The feature selection plot, a slightly more advanced
version of figure 5.11, is shown in figure 5.14.
Figure 5.14: The feature selection plot showing a backward elimination process taken directly from the
scientific publication. Each feature from the bottom up were selected for removal as the algorithm
progressed. The gray bars show the performance metric obtained (smaller is better in this case) at each step
by removing the feature (with standard deviation from cross-validation). After removing 23 features (out of
44), the cross-validated performance gain levels off and eventually becomes much worse when too many
features have been removed. At the end, a significant 5 percent-points were gained in model performance by
removing noisy features.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
122
Now, by knowing which features are most important for the model, we can plot these
features against real and bogus events in order to visualize how a particular feature helps solve
the problem. Figure 5.15 shows the performance of four of the best features.
Figure 5.15: Visualization of the performance of four individual features chosen by our feature selection
algorithm to be among the best features for our model. The histograms show how many real (purple) or
bogus (cyan) events take on a particular value of the feature. We see that the distributions of real vs. bogus
events are quite different in the amp and flux_ratio features, and they are indeed selected as the top
performing features in our feature selection procedure.
5.4 Summary
In this chapter we introduced the concept of feature engineering for transforming raw data for
the purpose of improving the accuracy of ML models. The primary takeaways from this chapter
are:
o It can create features that are more closely related to the target variable,
o It enables you to bring in external data sources,
o It allows you to use unstructured data,
o It can enable you to create features that are more interpretable, and
o It gives you the freedom to create lots of features and then choose the
best subset via feature selection
• Feature selection was introduced as a rigorous way to select the most predictive subset
of features from a data set.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
123
Word Definition
Feature Engineering Transforming input data to extract more value and improve the predictive
accuracy of ML models
Feature Selection Process of choosing the most predictive subset of features out of a larger set
Forward Selection A version of feature selection which adds the next most important feature at
a time, conditional on the current active feature set
Backward Elimination A version of feature selection which removes the next least important feature
at a time, conditional on the current active feature set
Bag of words A method for turning arbitrary text into numerical features for use
by the ML algorithm
In the next chapter we expand on the simple feature engineering approaches presented in
this chapter to perform more advanced feature engineering on data such as text, images and
time-series.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
124
6
Example: NYC Taxi Data
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
125
1
Initially released in a blog post by Chris Wong: http://chriswhong.com/open-data/foil_nyc_taxi/
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
126
Figure 6.1: The first 5 rows of the NYC taxi trip and fare record data. Most of the columns are rather self-
explanatory, but we will introduce some of them in more details in the text body.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
127
The medallion and hack_license columns look like simple ID columns that are less useful
from an ML perspective. A few of the columns look like categorical data, like vendor_id,
rate_code, store_and_fwd_flag and payment_type. Figure 6.2 shows the distribution of
values in these categorical columns.
Figure 6.2: The distribution of values across some of the categorical-looking columns in our dataset.
Let us look at some of the numerical columns in the dataset. It is interesting to validate, for
example, that there are correlations between things like the duration (trip_time_in_secs),
distance, and total cost of a trip. Figure 6.3 shows scatter plots of some of these plotted
against each other.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
128
Figure 6.3: The scatter of trips for the time in seconds versus the total cost, and the trip distance,
respectively. We see there is a certain amount of correlation, as expected, but the scatter is still relatively
high. There are also some less sensible clusters such as a lot of 0-time trips, even very expensive ones.
Lastly, Figure 6.4 visualizes the pickup locations in the latitude/longitude space, defining a
map of NYC taxi trips.
Figure 6.4: The latitude/longitude of pickup locations. Note the X-axis is flipped, compared to a regular map.
We see a huge amount of pickups on Manhattan, dropping off as we move away from the city center.
With a fresh perspective on the data we’re dealing with, let’s go ahead and define how we
want to use machine learning to solve a problem on this dataset.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
129
Figure 6.5: The distribution of tip amounts. Around half the trips yield $0 tips, which might be more than we’d
expect intuitively.
So, the idea for our model is to predict the no-tips from the rest, in a binary classifier. The
point is to be able to help the taxi driver predict a no-tip situation, and to gain understanding of
why such a situation might arise. I.e. what are the driving factors in the dataset, if any, of trips
yielding no tips?
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
130
is very important to be watching out for two pit-falls. Too-good-to-be-true scenarios, and
making premature assumptions that are not rooted in the data.
If the accuracy is higher than you would have expected, chances are your model is cheating
somewhere. The real world is creative when trying to make life as a data scientist hard. When
building initial tip/no-tip classification models, we quickly obtained very high predictive
accuracy of the model. Because you are excited about your new dataset and model, and you
just nailed it, you tend to ignore the warnings of a cheating model. Having been bitten by such
things a few times before, the results let us to investigate further. One of the things we look at
is the importance of the input features (as we will see in more details in later sections). In our
case, a certain feature totally dominated the performance of the model: payment type.
From our own taxi experience, it could make sense. People paying with credit cards may
have a lower probability of tipping. If you pay with cash, you almost always round up to
whatever you have the bills for. So, we started segmenting the number of tip vs. no-tip for
people paying with credit card vs. cash. It looked like quite the majority (95+%) of millions of
passengers paying with credit card did actually tip. So much for that theory. So how many of
people paying with cash tipped? All of them, it seemed.
Because of the assumptions I had already made, I didn’t look properly. Actually, none of
the passengers paying with cash had tipped. Then it quickly became obvious. When a
passenger pays with cash, and tips, the driver does not register it in whatever way it is
registered to be part of the dataset in our possession. The data can’t be trusted. This is a set
back.
In a situation like this, you’ve discovered a problem in the generation of the data, and there
is no way to trust that part of the data for building an ML model. If the answers are not correct,
how is the model supposed to learn?
What we did to fix it, was to simply remove from the dataset all of the trips payed with
cash. It always feels wrong to throw away data, but in this case, we had no choice. Of course,
there is no guarantee that any of the other tip recording are not wrong as well, but we can at
least check the new distribution of tip amounts. Figure 6.6 shows the histogram of tip amounts
after filtering out any cash-payed trips.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
131
Figure 6.6: The distribution of tip amount after cleaning out cash payments, where we’ve discovered that tips
are never recorded in the system.
With the bad data removed, the distribution is looking much better. There are only around
5% no-tip trips. Our job in the next section is to find out why.
6.2 Modeling
With the data prepared for modeling, we can easily use our knowledge from chapter 3 to setup
and evaluate models. In the following subsections, we will build different versions of the
models, trying to improve the performance in each iteration.
sc = StandardScaler() #a
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
132
data_scaled = sc.fit_transform(data[feats]) #a
sgd = SGDClassifier(loss="modified_huber") #b
sgd.fit( #c
data.ix[train_idx,feats], #c
data['tipped'].ix[train_idx] #c
) #c
preds = sgd.predict_proba( #d
data.ix[test_idx,feats] #d
) #d
plot(fpr,tpr) #e
plot(fpr,fpr) #e
xlabel("False positive rate") #e
ylabel("True positive rate") #e
The last part of Listing 6.1 plots the ROC curve for our first classifier. It is shown in Figure 6.7.
Figure 6.7: The ROC curve of the linear logistic regression tip/no-tip classifier. With an AUC of 0.5, the model
seems to perform no better than random guessing. Not a good sign for our problem.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
133
There is no way around it: the performance of this classifier is not good. With an AUC
around 0.5, the model is basically no better than random guessing, which is obviously not very
useful. Luckily, we started out simple, and have a few ways of trying to improve the
performance of this model.
rf = RandomForestClassier(n_estimators=100)
rf.fit(data.ix[train_idx,feats], data['tipped'].ix[train_idx])
preds = rf.predict_proba(data.ix[test_idx,feats])
plot(fpr,tpr) #a
plot(fpr,fpr) #a
xlabel("False positive rate") #a
ylabel("True positive rate") #a
fi = zip(feats, rf.feature_importances_) #b
fi.sort(key=lambda x: -x[1]) #b
fi = pandas.DataFrame(fi, columns=["Feature","Importance"]) #b
The results of running the code in Listing 6.2 is shown in Figure 6.8. We see a very
significant increase in accuracy, and it is clear that there is signal in the dataset: some
combinations of the input features are capable of predicting whether a taxi trip will yield any
tips from the passenger.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
134
Figure 6.8: The ROC curve of the nonlinear random forest model. The AUC is significantly better: at 0.64 it is
likely that there is real signal in the dataset.
We can also use the model to gain some insight into what features were important in order
to create this lift from the random base-line and successfully predict tip vs. no-tip. Figure 6.9
(also generated by the code in Listing 6.2) shows the list of features and their relative
importance for the Random Forest model. From this figure we can see that the location
features are important, along with time, distance and fare amount. It may be that some parts
of the city are less patient with slow, expensive rides, for example? We will look closer at the
potential insights gained, in section 7.2.5.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
135
Figure 6.9: The important features of the random forest model. The drop-off and pickup location features
seem to dominate the model.
Now that we’ve chosen the algorithm, let us make sure we’re actually using all of the raw
features, including categorical columns, and not just plain numerical columns.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
136
store_and_fwd_flag_cats = cat_to_num(data['store_and_fwd_flag']) #b
rate_code_cats = cat_to_num(data['rate_code']) #b
data = data.join(payment_type_cats) #c
data = data.join(vendor_id_cats) #c
data = data.join(store_and_fwd_flag_cats) #c
data = data.join(rate_code_cats) #c
After creating the booleanized columns, we run the data through Listing 6.2 again, and
obtain the ROC curve and feature importance list shown in Figure 6.10.
Figure 6.10: The ROC curve and feature importance list of the Random Forest model with all categorical
variables converted to boolean (0/1) columns, one per category per feature. The new features are bringing
new useful information to the table, as the AUC is seen to increase from the previous model without
categorical features.
As the model performance keeps increasing, we can think about additional things to
consider. We have not done any real feature engineering, of course, as the data
transformations applied so far are considered basic data pre-processing.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
137
pickup = pandas.to_datetime(data['pickup_datetime']) #a
dropoff = pandas.to_datetime(data['dropoff_datetime']) #a
data['pickup_hour'] = pickup.apply(lambda e: e.hour) #b
data['pickup_day'] = pickup.apply(lambda e: e.dayofweek) #b
data['pickup_week'] = pickup.apply(lambda e: e.week) #b
data['dropoff_hour'] = dropoff.apply(lambda e: e.hour) #c
data['dropoff_day'] = dropoff.apply(lambda e: e.dayofweek) #c
data['dropoff_week'] = dropoff.apply(lambda e: e.week) #c
With our datetime features, we can go ahead and build our new model. We run the data
through the codein Listing 6.2 once again, and obtain the ROC curve and feature importance
shown in Figure 6.11.
Figure 6.11: The ROC curve and feature importance list for the Random Forest model including all
categorical features and additional datetime features.
We’ve seen an interesting evolution in the accuracy of the model with additional data pre-
processing and feature engineering. We are able to predict if a passenger will tip the driver
with an accuracy above random. At this point, we’ve only looked at the improving the data in
order to improve the model. There are two other ways we can try to improve upon this model:
vary the model parameters to see if the default values are not necessarily the most optimal,
and increase the dataset size. During this chapter, we’ve been heavily subsampling the dataset
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
138
in order for the algorithms to handle the dataset, even on a 16GB memory machine. We will
talk more about scalability of methods in chapters 9 and 10, but in the meantime we will leave
it to you to work with this data to try and improve the accuracy even further.
Figure 6.12: The geographical distribution of drop-offs colored by tip (blue) and no-tip (red).
Figure 6.12 shows us that there is an interesting trend of not tipping when being dropped
off closer to the center of the city. How is that? One possibility is that the traffic situation
creates more “slow” trips and the passenger is not necessarily happy with the driver’s behavior.
As a non-US person, I have another theory. This particular area of the city has a high volume
of both financial workers, and tourists. We would expect the financial group to be distributed
further south on Manhattan as well, and there’s another reason why tourists are the most likely
cause of this discrepancy, in my mind. Many countries have vastly different rules for tipping
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
139
than in the US. Some Asian countries almost never tips, and many northern European countries
are tipping much less, and rarely in taxis.
There are many other interesting investigations to make based on this dataset. The point is
of course, that real-world data can often be used to say something interesting about the real
world and the people generating the data.
6.3 Summary
In this chapter we’ve introduced a dataset from the real world, defined a problem suitable for
the machine learning knowledge that have been built up over the previous five chapters. We’ve
gone through the entire ML workflow including initial data preparation, feature engineering, and
multiple iterations of model building, evaluation and optimization. The main take-aways from
the chapter are:
• With more and more organizations producing vast amounts of data, more and more is
becoming available within organizations, if not publicly.
• Records of all taxi trips from NYC in 2013 have been released publicly. There are a lot
of taxi trips in NYC in one year!
• Real-world data can be messy. Visualization and knowledge about the domain helps.
Don’t get caught in too-good-to-be-true scenarios and don’t make premature
assumptions about the data.
• Start iterating from the simplest possible model. Don’t spend time on premature
optimization. Gradually increase complexity.
• Make choices and move on. E.g. decide on an algorithm early on. In an ideal world we
would try all combinations at all steps in the iterative process of building a model, but
some things have to be fixed in order to make progress.
• Gain insights into the model and the data in order to learn about the domain and
potentially improve the model further, or present interesting insights.
Word Definition
Open Data The term for data made available publicly by institutions and
organizations.
FOIL Freedom of Information Law. Also known as FOIA (Freedom
of Information Act).
Too-good-to-be-true If a model is extremely accurate compared to what you would
scenario have thought, chances are there are some features in the
model, or some data peculiarities, causing the model to be
“cheating”.
Premature assumptions Assuming something about the data without validation,
risking biasing your views on the results.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
140
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
141
7
Advanced feature engineering
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
142
Figure 7.1: The initial steps in the Bag-of-Words extraction algorithm. We use this convert any text to
numbers in figure 7.2.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
143
representations, and you want to make sure to use an ML algorithm that handles sparse data in
case you do. We’ll discuss that further in the Vectorization subsection below.
The next step in our BoW algorithm is to make any transformations necessary to the tokens
extracted from the text. A good example of a transformation is converting all words to lower
case, such that we don’t produce features for both “fox” and “Fox” which may add to the noise
of the model. In some cases, however, you may want to preserve the case, if it makes sense in
your project. Other transformations could be handling hyphens, special characters etc.
Finally we can now define the dictionary that we will generate our text features from. For
machine learning projects, it’s common to set a limit on the number of features, hence the
number of words in our dictionary. This is usually done by sorting by the word occurrences and
using only the top-N words.
VECTORIZATION
We can use our bag-of-words dictionary to generate features to use in our ML models. After
defining the dictionary, we can convert any text to a set of numbers corresponding to the
occurrences of each dictionary word in the text. Figure 7.2 shows this process.
Figure 7.2: Using the vocabulary, we can now represent each text as a list of numbers. The rows show the
count for the 2 small texts in figure 7.1 and the count for the Wikipedia page about the sentence “The quick
brown fox jumps over the lazy dog” which is an English pangram; it includes all letters in the English
alphabet.
Now there is a problem that we haven’t discussed yet. In most natural language texts there
will be many words that are not very important for understanding the topics of the text, but are
simply “fillings” in the language. This includes words such as “the”, “is”, “and” etc. In NLP
research, these words are called stop-words, and usually removed from the dictionary as they
may take the space from actual important words. With our words already sorted by
occurrences, the usual way to remove stop-words is simply to cut off at some excess word-
count. In figure 7.2 we see an example where a larger text has a much larger number of the
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
144
word “the” than any other, more meaningful, word. The problem then becomes to define the
threshold at which a particular word is a stop-word and not a meaningful word. Most NLP
libraries, such as the NLTK Python library, include pre-built stop-word lists for a range of
languages so you don’t have to do this every time. In some cases, though, the list of stop-
words will be different for your specific project.
While not very apparent in figure 7.2, any realistic dictionary will have a large number of
words, and usually only a small subset of those are present in the texts that we are generating
features for. This combination usually makes our text features include a large number of 0’s.
Only a small number of the dictionary words are actually found in a given text, and we call the
bag-of-words features sparse. If you have a large number of sparse features – it’s not
uncommon to have 1000 features with only a few percent non-zero elements – it’s a good idea
to choose an ML algorithm that can handle sparse features natively, or an algorithm that can
deal with many low-significance features without sacrificing accuracy. The Naïve Bayes
algorithms in the Scikit-Learn Python library handle sparse data natively, and are therefore
well suited for text classification problems. Algorithms such as Random Forest are known to
handle a large number of low-significance features well, although your mileage may vary and it
should always be tested using the evaluation and optimization techniques discussed in chapter
4.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
145
tf-idf can be very useful for generating good ML features from any corpus of text. It can
also be useful in other areas, such as search. Because we are generating a vector of numbers
for any document, we can also find “distances” between documents simply as distances
between their tf-idf vector representations. If the user search query is a document, you can
find the distances between any other documents in your index in this way, and hence return a
ranked list of document to the user based on the query. Code listing 7.1 in the next section
shows how to use the Scikit-Learn Python library to generate tf-idf vectors from documents,
along with a more advanced technique called latent semantic indexing.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
146
While it is interesting to understand the principles of LSA, not everyone knows linear
algebra enough to do these calculations. Luckily, however, there are plenty of implementations
that can readily be used in your ML project. The Scikit-Learn Python library includes the
functionality needed to run LSA by (1) using tf-idf to generate the term-document matrix, (2)
performing the matrix decomposition and (3) transforming the documents to vectors:
def latent_semantic_analysis(docs):
tfidf = TfidfVectorizer() # Using default parameters
tfidf.fit(docs) # Creating dictionary
vecs = tfidf.transform(docs) # Using dictionary to vectorize documents
svd = TruncatedSVD(n_components=100) # Generating 100 top components
svd.fit(vecs) # Creating SVD matrices
return svd.transform(vecs) # Finally use LSA to vectorize documents
In the next section, we will look at a few advanced extensions to LSA that has recently become
quite popular in the field of topic modeling.
PROBABILISTIC METHODS
LSA is based on linear algebra (math with vectors and matrices), but an equivalent analysis
can be done using probabilistic methods where each document is modeled as a mixture of topic
distributions. These concepts are all relatively advanced and we won’t go into the mathematical
details here, but the probabilistic approach can perform better in terms of model accuracy for
some datasets.
The probabilistic analogue to LSA is known as pLSA (for probabilistic). A more widely used
version of this is called latent dirichlet analysis (LDA) where specific assumptions are made
on the distribution of topics. Basically we build in the assumption that a document can be
described by a small set of topics and that any term (word) can be attributed to a topic. In
practice LDA seem to perform very well on diverse datasets. In the following code listing we
highlight how LDA can be used in Python using the Gensim library:
def lda_model(docs):
# Build LDA model, setting the number of topics to extract
return LdaModel(docs, num_topics=20)
The number of topics used in the LDA model is a parameter that needs to be tuned to the data
and problem at hand. We encourage you to define your performance metric and use the
techniques in chapter 4 to optimize your model. It’s also worth noting that the LDA in Gensim
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
147
can also be updated on-the-fly with new documents if new training data is coming in
continuously. There are also many other interesting algorithms natural language and topic
modeling algorithms in the Gensim library that we encourage you to check out if you are
interested in the field. In chapter 10 we will use some of these advanced text feature extraction
techniques to solve a real-world machine-learning problem. In the next section we will
introduce a completely different method for text feature extraction: expanding the content of
the text.
FOLLOW LINKS
If you are looking to build an ML classifier by extracting text features from tweets – e.g. for a
Twitter sentiment analysis where you classify the post as positive or negative in sentiment –
you will often find the 140-character limit problematic as there may not be enough information
to obtain the desired accuracy of the model.
One thing you might realize is that many tweets contain links to external webpages that
can hold much more text and that you could expand the tweet with the text from the link in
order to improve the quality of the data. You could even follow links deeper on the webpage to
build a larger corpus of text.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
148
TEXT META-FEATURES
Another technique for extending the text features with valuable data is to analyze the text for
what we call meta-features. Unlike the previously discussed techniques, these types of
features are very much problem-dependent. Let us take the example of tweets again. In a
tweet there are all sorts of valuable data to extract that are special to tweets, such as hashtags
and mentions as well as meta-information from Twitter such as counts of retweets and
favorites.
Other examples for web-based texts could be to extract information from links, such as the
top-level domain. In general texts you could extract the count of words or characters or the
number of special characters in different languages. Extracting the language could be an ML
classifier itself that provides the answer as a feature to another classifier.
For choosing the right text meta-features, you should use your imagination and knowledge
of the problem at hand. Remember that the ML workflow is an iterative process where you can
develop new feature, go back trough the pipeline and analyze how the accuracy is improved
over time.
We can use the text to get at other types of data as well. The text might include dates and
times that could be useful for the ML model to understand, or there may be time information in
the meta-data of the text. In chapter 5 we talked about date-time feature extractions which
can be used in this context as well.
If you are analyzing a webpage, or there’s a URL in the text, you may have access to
images or videos that are important for understanding the context of the text. Extracting
features from images and videos requires even more advanced techniques which we will
investigate in the next section.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
149
COLOR FEATURES
Let’s say you are trying to classify images into categories based on the landscape of the image.
Categories could be sky, mountain or grass, for example. In this case, it sounds useful to
represent the images by the constituent colors. We can calculate simple color statistics of each
color channel of the image, such as mean, median, mode, standard deviation, skewness and
kurtosis. This leads to 6x3 = 18 features for common RGB (red-green-blue channel) images.
Another set of features representing colors in the images are the color ranges of the image.
In table 7.1 we show a list of possible color ranges that will cover much of the color space:
Red range Max value in red channel - Min value in red channel
Red to blue range Red range / (Max value in blue channel - Min value in blue channel + 1)
Blue to green range (Min value in blue channel - Max value in blue channel) / (Min value in green
channel - Max value in green channel + 1)
Red to green range Red range / (Max value in green channel - Min value in green channel + 1)
Table 7.1: Examples of color range features. We add 1 to the divisors here to avoid producing
missing values by dividing by zero.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
150
Feature Definition
Table 7.2: Image meta-data features that can be included in the ML pipeline.
With these simple features we may be able to solve quite a few problems in machine
learning where images are part of the data. Of course, we have not represented any of the
shapes or actual objects in the image, which will be important for many image classification
problems. In the next section we will introduce some more advanced computer vision
techniques commonly used to represent objects and shape.
EDGE DETECTION
Probably the simplest way to represent shapes in images is to find their edges and build
features on those. In figure 7.3 we show an example of edge detection in an image.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
151
Figure 7.3: The infamous Lena image on which we’ve run the Sobel edge detection algorithm.
There are a number of well-known algorithms to find edges in an image. Some of the most
commonly used are the Sobel and Canny edge-detection algorithms. In figure 7.3 we’ve used
the Sobel algorithm to find edges in the photograph.
A simple example of using this library for edge detection could look like:
Now that we’ve extracted edges from our images, there are different ways of extracting
features from those edges. The simplest way is calculate a number that represents the total
amount of edges in an image. If edges is our edge images and 𝑟𝑒𝑠 is the resolution of the
image:
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
152
Together with other features, this may be useful in determining objects of interest. We can
define other edge-based features depending on our use-case. For example, we could choose to
calculate the above edge score for multiple parts of the image in a grid.
image = skimage.data.lena()
hog = skimage.feature.hog(image, orientations=9, pixels_per_cell=(8,8),
cells_per_block=(3,3), normalise=True, visualise=True)
In listing 7.3 you see how we calculate hog features very easily while defining the number
of orientations to consider, the size of the cells in pixels, the size of the blocks in cells and if we
want to normalize and visualize the result. The visualization is shown in figure 7.4.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
153
Figure 7.4: Example of applying the HOG transformation on the Lena image. This image is from the HOG
example page on Scikit-Image documentation.
With HOG features we have a powerful way to find objects in images. As with everything
there are certain cases – e.g. when the object changes orientation significantly – where HOG
does not work very well. You should make proper tests of the ML system as usual to determine
their usefulness for the problem at hand.
DIMENSIONALITY REDUCTION
We’re almost always in the game of dimensionality reduction when performing feature
extraction, perhaps except for the content expansion methods in the previous section.
However, there are a few techniques that are commonly known and used for dimensionality
reduction in general, and the most widely used is called principal component analysis
(PCA).
PCA allows us to take a set of images and find “typical” images that can be used for building
blocks for representing the original images. The combination of the first couple of principal
components will allow you to rebuild a large amount of the training images, while subsequent
components will cover more infrequent patterns in the images. Features for a new image is
generated by finding the “distance” from a principal image, thus representing the new image
by a single number per principal image. You can use as many principal components as make
sense in your ML problem.
PCAs are known to be linear algorithms – i.e. they cannot represent inherent non-linear
data. There are several extensions to PCA or other types of non-linear dimensionality
reduction. An example that we’ve had good experiences with, are diffusion maps.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
154
computationally scalable, they ended up as last resorts when real-world problems needed
solving. Due to several break-throughs in machine-learning research, however, these issues
have been mostly taken care of with the advent of Deep Neural Nets (DNNs) that are now
considered the state of the art for many ML problems, but especially those that deal with
images, video or voice. Figure 7.6 shows the layout of a neural net.
Figure 7.6: A simple artificial neural network. Deep neural nets are made of many layers of these simple
networks. Image from Wikipedia.
In deep neural nets, each layer in a deep neural net is capable of defining a set of new
features that are useful for the problem at hand. The weights between nodes then define the
importance of those features for the next layer and so forth. This approach was traditionally
prone to over-fitting, but recently developed techniques allow for the removal of node
connections in a way that the accuracy is maintained while decreasing the risk of over-fitting.
DNNs, also known as deep belief networks or simply deep learning, is still a relatively new
field, so we encourage the reader to follow the developments in this field if they are interested
in the state of the art.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
155
TIME-SERIES DATA
There are two main types of time-series data, classical time series and point processes.
Classical time series consist of numerical measurements that are taken over time. Classical
time series typically have measurements that are evenly spaced over time (e.g., hourly, daily,
weekly, etc.) but can also consist of irregularly sampled data. Examples of classical time series
data include:
• The value of the stock market, in $B (e.g., measured hourly, daily or weekly)
• The day-to-day energy consumption of a commercial building or residential home
• The value, in dollars, of a client’s bank account over time
• Sets of diagnostics monitored in an industrial manufacturing plant, e.g., physical
performance measurements of different parts, measurements of plant output over
time, etc.
Point processes, on the other hand, are collections of events that occur over time. As
opposed to measuring numerical quantities over time, point processes consist of a time stamp
for each discrete event that happens, plus (optionally) other meta-data about the event such
as category or value. For this reason, point processes are also commonly referred to as event
streams. Examples of point processes include:
• The activity of a web user, measuring the time and type of each click (this is also
called clickstream data)
• Worldwide occurrences of earthquakes, hurricanes, disease outbreak, etc.
• The individual purchases made by a customer throughout the history of their account
• Event logs in a manufacturing plant, recording every time an employee touches the
system and every time a step in the manufacturing process is completed
An astute reader may note that for some time series, there is a one-to-one mapping
between the classical time-series representation and the underlying point process. For
example, a customer’s bank account can easily be viewed either as the value of the account
over time (classical time series) or as a list of the individual transactions (point process). This
correspondence can be useful in creating different types of time-series features on a single
data set. However, the conversion is not always possible (e.g., it is difficult to imagine what a
classical time series related to simple web clicks would be).
To make this more concrete, we take a look at time-series data that can be just as easily
viewed as a point process or a time series. In Table 7.3, we show the first few rows of a crime
data set from the city of San Francisco, collected between 2003 and 2014 (data set publicly
available at https://data.sfgov.org). In all, the data set consists of more than 1.5 million
crimes that occurred in the city. The data for each crime include the exact date and time of the
crime, type of crime and location information.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
156
Table 7.3: San Francisco crime data in its raw form, as a sequence of events.
With this raw data, there are a multitude of ways that we can aggregate it into classical
time series data – by year, by month, by day of week, by district, etc. In Listing 7.4, we
demonstrate how to aggregate the raw event data into a time series of monthly number of
crimes in San Francisco. The resulting time series of integer crime count by month, is plotted
in Figure 7.7. These data show a marked decline from the rate of 13,000 crimes per month in
2003 with a recent up-tick in crime activity.
df = pd.read_csv("sfpd_incident_all.csv")
df['Month'] = map(lambda x:
datetime.strptime("/".join(x.split("/")[0::2]),"%m/%Y"),df['Date'])
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
157
Figure 7.7: Classical time series of monthly crime count in San Francisco. These data were processed from
the raw event data. For ML modeling, we can derive features from the event data, the classical time series,
or both.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
158
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
159
# windowed differences:
# window 2 = spring 2013
window2 = (datetime(2013,3,22),datetime(2013,6,21))
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
160
Figure 7.8: Left: Correlation of the original time series (black line) and 12-month lagged time series (blue line)
defines the 12-month autocorrelation. Right: The autocorrelation function for the SF crime data. The
autocorrelation is high for short timescales, showing high dependence of any month’s crime on the previous
months’ values.
Fourier Analysis is one of the most commonly used tools for time-series feature
engineering. The goal of Fourier analysis is to decompose a time series as a sum of different
sine and cosine functions on a range of different frequencies. This decomposition is achieved
using the discrete Fourier transform, which computes the spectral density of the time series –
how well it correlates to a sinusoidal function at each given frequency – as a function of
frequency. This decomposition of a time series into its component spectral densities is called
the periodogram. Figure 7.9 shows the periodogram of the San Francisco crime data,
computed using the scipy.signal.periodogram function (several Python modules have
methods for periodogram estimation). From the periodogram, a number of ML features can be
computed, such as the spectral density at specified frequencies, the sum of the spectral
densities within frequency bands, or the location of the highest spectral density (which
describes the fundamental frequency of oscillation of the time series). See Listing 7.6 for
example code for periodogram computation and features.
# Features:
# period of highest psd peak:
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
161
Figure 7.9: Left: Periodogram of the San Francisco crime data, showing the spectral density as a function of
frequency. Right: The same periodogram with the x axis transformed from frequency to period.
To conclude this section, we mention a number of classical time-series models that are
commonly used in the literature. The purpose of these models is to describe each value of the
time series as a function of the past values of the time series. These models, themselves, have
been widely used for prediction purposes for decades. Now, as machine learning has become a
mainstay in time-series data analysis, they are often used in conjunction with more
sophisticated ML models such as SVMs, neural nets and random forests for prediction.
Examples of some time-series models, include:
• GARCH model – a model that is commonly used in financial analysis which describes the
random noise terms of a time series using an ARMA model
• Hidden Markov Model (HMM) – a probabilistic model which describes the observed values
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
162
of the time series as being drawn from a series of hidden states which themselves follow
a Markov process.
These models can be used to compute time-series features in a number of different ways,
including:
1. Using the predicted values from each of the different models (and the differences
between the predictions) as features themselves.
2. Using the best-fit parameters of the models (e.g., the values of p and q in an
ARMA(p,q) model) as features.
3. Calculating the statistical goodness-of-fit (e.g., mean square error) of a model and
using it as a feature.
In this way, a blend of classical time-series models and state-of-the-art machine-learning
methodologies can be achieved. In this way, the best of both worlds can be attained – if an
ARMA model is already highly predictive for a certain time series, then the ML model that uses
those predictions will also be quite successfully; but if the ARMA model does not fit well (as for
most real-world data sets) then the flexibility that the ML model provides can still produce
highly accurate predictions.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
163
7.4 Summary
In this chapter we’ve looked at a host of methods for generating features from text and images
for use in our ML algorithms to build models that are capable of “reading” or “seeing” with
human-level perception. The main take-aways were:
• For text-based datasets, we need to transform variable-length documents to a
fixed-length number of features. We looked at different methods for this,
including:
o Simple bag-of-words methods where particular words are simply counted
for each document.
o The tf-idf algorithm that takes into account the frequency of words in the
entire corpus to avoid biasing the dictionary towards unimportant-but-
common words.
o More advanced algorithms for topic modeling, such as latent semantic
analysis or latent dirichlet analysis.
o The topic modeling techniques are used to describe documents with a set
of topics and topics by a set of words. This allows for sophisticated
semantic understanding of documents and can help build advanced
search-engines, for example, in addition to the usefulness in the ML
world.
o We were able to use the Skicit-Learn and Gensim Python libraries for
many interesting experiments in the field of text extraction.
• For images, we need to be able to represent different characteristics of the image
with numeric features. Some of our findings were:
o We can extract information about the colors in the image by defining color
ranges and color statistics.
o We can extract potentially valuable image meta-data from the image file
itself, for example by tapping into the EXIF meta-data available in most
image files.
o In some cases, we need to be able to extract shapes and objects from
images. Methods for doing this include:
Simple edge-detection based algorithms using Sobel or Canny
edge detection filters.
Sophisticated shape-extraction algorithms such as histogram of
oriented gradients (HOG).
Dimensionality reduction techniques such as PCA.
Automated feature extraction by using deep neural nets.
• Time-series data comes in two flavors: classical time series and point processes.
There are a plethora of ML features that can be estimated from these data.
o The two principal machine-learning tasks that we perform on time-series
data are:
Forecasting the value of a single time series
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
164
Word Definition
Feature Engineering Transforming input data to extract more value and improve
the predictive accuracy of ML models
Natural language processing The field that aims to of make computers understand natural
language.
Bag of words A method for transforming text into numbers: simply counting
the number of occurrences of a particular word in a
document.
Stop-words Words that are very common but not very useful as a feature.
E.g. “the”, “is”, “and”.
Sparse data When data consists of mostly 0’s and very few actual data-
cells, we call the data sparse. Most NLP algorithms produce
sparse data which we need to use or transform for our ML
algorithms.
tf-idf Term-frequency, inverse-document frequency. A bag-of-
words method that is normalized by text from the entire
corpus.
Latent semantic analysis A method for finding topics of interest in documents and
connect them to a set of words.
Latent Dirichlet analysis An extension to the idea from LSA that works very well with
many text problems in practice.
Content expansion The process of expanding the original content into more data,
e.g. by following links in a document.
Meta features A set of features that are not extracted from the content
itself, but some connected meta-data.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
165
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
166
9
Scaling Machine Learning
Workflows
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
167
machines and sometimes to hundreds of machines in the cloud. This chapter is about deciding
on a scaling strategy and learning about the technologies involved.
In the first section below we will introduce the different dimensions to consider when facing
a large dataset or a requirement for high volume predictions. We will talk about ways in which
you don’t have to invest a lot into a scaling approach, and some technologies to consider if you
do. The following section goes more deeply into the process of scaling up the ML workflow for
training models on large datasets, while the last section before the summary considers the
case of focusing on scaling the prediction workflow to large volumes or decreased latency.
In the following chapter we will get to use everything we’ve learned in order to solve a real-
world big-data example, so hang on as we get through the basics here.
MODEL TRAINING
1. Training data set is too large to fit a model No model is fitted, so no predictions can be made
2. Training data set is so large that model fitting is Model optimization is infeasible (or impractical), so
very slow a sub-optimal model is used, sacrificing accuracy
Table 9.1. Some potential problems in Model Building that can occur due to lack of scalability, plus their
ultimate consequences.
During model building, the scalability issues that we face stem from very large training sets.
At one extreme, if our training data is so large that we cannot even fit a model (e.g., the data
does not fit in memory) then this is a problem that we must find a way around. In a nutshell
there are three approaches that you can take: (1) find a smaller subset of the data that you
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
168
can learn on, (2) use a machine with more RAM, or (3) use a more memory-efficient learning
algorithm.
In the next subsection we describe a few quick ways to reduce your data set size without
significantly impacting model quality. Then in subsection 9.1.3, we address this second option
via scalable data systems. In Section 9.2, we talk to the third option by introducing scalable
learning algorithms.
For slightly less sizable data sets, it may be possible to fit only relatively simple models (like
linear / logistic regression) in lieu of more sophisticated ones (like boosting) due to the extra
computational complexity and memory footprint of the latter. In this case, we may be
sacrificing accuracy by not fitting more sophisticated learning algorithms, but at least we are
able to fit a model. In this case, the same options presented above are viable approaches for
us to try.
A related scenario is if the massive size of our training data set causes model fitting, and in
turn model optimization, to be very slow. Like the previous scenario, this can cause us to use
a less-accurate model because we are forced to employ a coarse tuning parameter optimization
strategy or to forego tuning altogether. However, unlike the above situation, this predicament
can be solved by spinning up more nodes (horizontal scaling) and fitting models (with different
tuning parameter choices) on the different machines. We touch more on horizontal scaling in
9.1.3.
PREDICTION
1. Data rates (streaming) are too fast for the ML The backlog of data to predict on grows and grows
system to keep up until ultimately breaking
2. Feature engineering code and/or prediction Potential value of the predictions is lost, particularly
processes are too slow to generate timely in real-time use cases
predictions
3. Data sizes (batch) are too large to process with Prediction system breaks and no predictions are
the model made
Table 9.2. Some potential problems in ML Prediction that can occur due to lack of scalability, plus their
ultimate consequences.
In the prediction workflow, the scalability issues we face stem from data that come in very fast,
prediction or feature engineering processes that are CPU intensive, or prediction data batches
that are very large.
Luckily, all three of these challenges can be resolved with the same strategy: spinning up
more machines. The advantage of prediction, as opposed to model training, is that predictions
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
169
can be made, independently, for each data instance one at a time. 1 This means that to
generate predictions, at any one time we only need to hold in memory the features for a single
instance (and the ML model that we’ve built). Contrast that scenario to model training, where
typically the entire training set needs to be loaded into memory. Thus, unlike our scalability
problems during model training, prediction scalability issues do not require larger machines,
they just require more of them. And, of course, an efficient data management system to
control them (more on this later).
Whether we need to generate predictions more quickly, handle a higher volume of
instances, or deal with slow feature engineering or prediction processes, the solution is to spin
up more machines and send out different subsets of instances on each node for processing.
Then, assuming that the fitted model is distributed onto all of the nodes, we can generate
predictions in parallel across all machines and return them to a central database.
In section 9.3 we will dive deeply into prediction systems. There, we will explore a few
different approaches to building computational systems for fast and scalable ML prediction.
FEATURE SELECTION
Often, the broadness of a data set is the computational bottleneck. For example, in genome
data, a training set may contain data for millions of genes (features) but only for hundreds of
patients (instances). Likewise, for text analysis the featurization of data into n-grams can
result in training sets containing millions of features. In these cases, we can make our model
training scale by first eliminating unimportant features in a process called feature selection.
For massive training sets, our recommended method of feature selection is LASSO. LASSO
is a linear learning algorithm that automatically searches for the most predictive subset of
features, using an efficient greedy algorithm. Computing the entire trace of the algorithm is
very efficient, allowing the user insight into in the entire ordering of all of the features in terms
of their predictive power in the linear model. Moreover, the best subset of features, in terms of
the linear model predictions, is provided.
1
Note that there are a handful of ML use cases for which predictions cannot be made on separate instances, independently.
For example, a time-series forecasting model may rely on the predictions from multiple time stamps in generating a single
forecast. However, for the vast majority of ML applications, individual instances can be predicted on independently.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
170
If your data are too large even to fit a LASSO model, you can consider fitting the model to
subsets of the instances in your training set (and potentially averaging across different runs of
the algorithm).
The obvious downside to LASSO feature selection is that it uses a linear model to gauge the
importance of each feature. A feature that is selected out using LASSO could indeed have a
non-linear relationship with the target variable. That said, there are statistical advantages to
discarding features that have no relationship with the target variable. Indeed, the overall
accuracy of a machine-learning model can increase significantly by performing feature selection
first.
INSTANCE CLUSTERING
If after feature selection your training data are still too large to fit a model on, then you may
consider sub-selecting instances. An absolute method of last resort, you can use statistical
clustering algorithms to identify and remove redundancies in your training instances.
For this type of data reduction, we recommend using an agglomerative hierarchical
clustering algorithm. This approach will initialize with each training set instance as the sole
member of its own cluster. Then, the two closest clusters are subsequently joined (using some
pre-defined distance measure to determine ‘closeness’). This joining of nearby clusters
continues until some stopping criterion (like number of clusters) is reached. We recommend
stopping this process as early as possible to not reduce too dramatically the information
content of your data. The final reduced training set consists of a single instance for each of the
resulting clusters.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
171
Figure 9.1: Horizontal vs. vertical scalability for Big Data systems. In horizontal systems, we simply add new
nodes (machines) to our infrastructure to be able to handle more data or computation as the load is
distributed relatively even among nodes. Examples of such a system is Apache Hadoop, which we will
discuss furtherin the text. In vertically scaling systems we add more resources to our existing machines in
order to handle higher loads. This approach is usually more efficient initially, but there’s a limit to the amount
of resources we can add. Examples of databases that work well with this approach are SQL servers such as
PostgreSQL.
Sometimes, and perhaps more often than you might think, simply upgrading your machines
will be enough for scaling up your machine learning workflow. As we talked about in the
previous sections, after the raw data has been processed and readied for your classification or
regression problem, the data may not be big enough to warrant the complexity of a true Big
Data system. But in some cases, when dealing with data from popular websites, mobile apps,
games, or a large number of physical sensors, it’s necessary to use a horizontally scalable
system. From now on, this is what we will assume.
There are two main layers in horizontally scalable Big Data systems: storage and
computation. The storage layer is where data is stored and passed on to the computational
layer where data is processed. On of the most popular Big Data software projects is Apache
Hadoop which is still widely used in science and industry and based on ideas from an previously
unseen level of scalability obtained at Google and other web-scale companies in the early
2000s.
The storage layer in Hadoop is called HDFS (Hadoop Distributed File System) and the
basic idea is that any data is stored on multiple machines, so that it becomes unlikely that data
can be lost or inaccessible in the case of hardware or software failure.
The computing layer of Hadoop uses a simple algorithm called MapReduce to distribute
computation among the nodes in the cluster. In the MapReduce framework, the “map” step
distributes data from HDFS onto a number of workers that transform the data in some way,
usually keeping the number of data rows the same. This is similar to our feature engineering
processes in earlier chapters where we simply add new columns to each row of input data. In
the “reduce” step, the mapped data is filtered and aggregated into it’s final form. Many data
processing algorithms can be transformed into MapReduce jobs. By transforming algorithms to
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
172
this framework, systems like Hadoop will take care of the distribution of work among any
number of machines in your cluster.
In principle the storage and computational layer need not be integrated. Many organizations
use a storage system from a cloud provider, such as the S3 service in the Amazon Web
Services (AWS) cloud infrastructure, coupled with the Hadoop MapReduce framework for
computation. This has the benefit that AWS manages your large volumes of data, but you lose
one of the main points of the tight integration between HDFS and MapReduce: data locality.
With data locality your system becomes more efficient as computational tasks are performed
on subsets of the data close to where that data is actually stored.
The Hadoop community has developed a machine learning library called Mahout which
implements a range of popular ML algorithms that work with HDFS and MapReduce in the
Hadoop framework. If your data is in Hadoop, Mahout may be worth looking in to for your
machine learning needs. Mahout is currently moving away from the simplistic MapReduce
framework into more advanced distributed computing approaches based on Apache Spark.
Apache Spark is a more recent and widely popular framework that is based on the ideas of
Hadoop but strives to achieve better performance by working on data in-memory. Spark has
it’s own library of machine learning algorithms in the MLlib library that is included with the
framework. Figure 9.2 shows a simple diagram of the Apache Spark ecosystem.
Figure 9.2: The Apache Spark ecosystem based of the Spark core for distributed computation. Spark SQL
allows us to work with data in tables as we know them from the Pandas Python library, or R dataframes.
Spark Streaming allows us to process data in real-time as they arrive, in contrast with the batch processing
nature of Hadoop and classic Spark. MLlib is the machine learning library that includes a range of ML
algorithms optimized for the Spark engine, and GraphX is a a library allowing efficient computation on large
graphs such as the social graph of a social network.
Scalable ML algorithms are often linear for very natural reasons. This also means that both
Mahout and MLlib includes mostly linear ML algorithms or approximations to nonlinear
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
173
algorithms. In the next section, we will look at how to approach scaling of both of these two
types of algorithms.
Figure 9.3: The modeling part of our familiar ML workflow diagram with scalable components.
In 9.1.3 we introduced a few big-data-capable systems that can be used to manage and
process data of almost any size. Because they work on an instance-by-instance basis, the
feature engineering processes that we’ve talked about in the book so far can be done with
simple map-calls that are available in any of those systems. In the subsection below, we’re
going to look at how some popular linear and nonlinear ML algorithms scale in the face of big
data.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
174
single computer with fixed amount of RAM). Yet, data volumes are still outpacing the gains in
ML software and computer hardware. For some training sets, the only option is distributed
learning.
The most commonly used distributed learning algorithm is linear (and logistic) regression.
The Vowpal Wabbit (VW) library popularized this approach, and has been a mainstay for
scalable linear learning across multiple machines. The basic way in which distributed linear
regression works is to first send subsets of the training data (subset by instance) to the various
machines in the cluster. Then, in an iterative manner, each of the machines performs an
optimization problem on the subset of data on hand, sending back the result of the
optimization to the central node. There, that information is combined to come up with the best
overall solution. After a small number of iterations of this procedure, the final model is
guaranteed to be close to the overall optimal model (if a single model were fit to all the data at
once). Hence, linear models can be fit in a distributed way to terabytes or more of data!
As we’ve discussed at numerous occasions in this book, linear algorithms are not
necessarily adequate for modeling the nuances of the data for accuracy predictions, in which
case it can be helpful to turn to nonlinear models. Nonlinear models usually require more
computational resources, and horizontal scalability isn’t always possible with nonlinear models.
This can be understood loosely by thinking of nonlinear models as also considering complex
interactions between features thus requiring a larger portion of the dataset at any given node.
In many cases it’s actually more feasible to upgrade your hardware or find more efficient
algorithms or more efficient implementations of the algorithms you have chosen. There are of
course situations where scaling nonlinear model is actually needed, and in this section we will
discuss a few ways to approach this.
POLYNOMIAL FEATURES
One of the most widely used tricks to model nonlinear feature interactions is to create new
features that are combinations of the existing features and then train a linear model including
the nonlinear features. We usually generate polynomial features and create features of the
form 𝑓𝑒𝑎𝑡_1 ∙ 𝑓𝑒𝑎𝑡_2, 𝑓𝑒𝑎𝑡1 2 etc. Listing 9.1 shows how this can be achieved with the Scikit-Learn
Python library. The results of running the code in listing 9.1 is:
iris = datasets.load_iris() #1
linear_classifier = LogisticRegression() #2
linear_scores = cross_val_score(linear_classifier, \ #2
iris.data, iris.target, cv=10) #2
print "Accuracy (linear):\t%0.2f (+/- %0.2f)" % \ #2
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
175
(linear_scores.mean(), linear_scores.std() * 2) #2
pol = PolynomialFeatures(degree=2) #3
nonlinear_data = pol.fit_transform(iris.data) #3
nonlinear_classifier = LogisticRegression() #4
nonlinear_scores = cross_val_score(nonlinear_classifier, \ #4
nonlinear_data, iris.target, cv=10) #4
print "Accuracy (nonlinear):\t%0.2f (+/- %0.2f)" % \ #4
(nonlinear_scores.mean(), nonlinear_scores.std() * 2) #4
An example of another machine learning toolkit that has polynomial feature extraction
integrated is the Vowpal Wabbit (VW) library. VW can be used to build models on large
datasets on single machines because all computation is done iteratively and out-of-core,
meaning that only the data used in the particular iteration needs to be kept in memory. VW
uses stochastic gradient descent and feature hashing to deal with unstructured and sparse
data in a scalable fashion. VW can generate nonlinear models by supplying the –q and –cubic
flags to generate quadratic or cubic features.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
176
the loss in accuracy again potentially can be mitigated by an increase in the number of trees
built.
Other algorithms that have natural approximations include k-nearest neighbors where
special approximate tree structures can be used to increase the scalability of the model.
Support vector machines have seen multiple approximation methods to make the non-linear
versions more scalable, including Budgeted Stochastic Gradient Descent (BSGD) and Adaptive
Multi-hyperplane Machines (AMM).
Figure 9.4: Scaling the prediction part of the ML workflow to high volumes or high velocity predictions.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
177
In the first section, we will be looking at infrastructure architectures for scaling with the
volume of predictions so we can handle the large user-base of our email client, for example. In
the subsequent section we will look at how to scale the velocity of predictions and guarantee an
answer within a given timeframe. This is important when your ML models are used in the real-
time feedback loop of e.g. humans on web or mobile devices.
Figure 9.5: A possible infrastructure for a scalable prediction service. Prediction requests are sent from the
consumer to a queue, which delegates the job to a prediction worker. The worker stores the prediction in a
database, or delivers it back to the consumer when done. If the queue is clogging up when more predictions
are streaming in than the workers can handle, more workers can be spun up automatically.
This approach requires that the model can be loaded on all worker nodes, of course, and
that there are enough workers (or some auto-scaling solution in place) to handle the amount of
predictions coming in. We can easily calculate the amount of workers needed if we know the
mean prediction time for a worker and the velocity of requests coming in:
E.g. if you have 10 prediction requests coming in per second, and your workers take 2
seconds to finish, you need at least 20 workers to keep up. The optimal auto-scaling solution
here is to be able to spawn new workers from the number of requests waiting in the queue
over some period of time.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
178
Figure 9.6: A possible architecture for a prediction pipeline with low-latency requirements. A prediction
dispatcher sends the prediction job to multiple workers, hoping that at least one will return predictions in time.
It will return the first one that comes back in time to the client, and afterwards record it in a log or database for
later inspection and analytics.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
179
Another approach is to make predictions in parts that can be combined into a full prediction
whose accuracy is usually correlated with the number of parts that is used. A class of
algorithms that lend themselves well to this approach are the ensemble methods. One such
method is random forests, a highly accurate nonlinear algorithm that we’ve mentioned several
times before.
Random forest models consist of an ensemble of decision trees, and predictions are made
simply by making the prediction from each tree and (in the case of classification) counting the
votes from each tree into the final probability. For example, if there are 10 trees and 6 of them
vote “yes” for a particular prediction instance, the forest returns 6/10=3/5 as the answer. And,
usually the larger the total number of trees queried, the more accurate and confident the
results are.
This can be used in a real-time prediction system to trade accuracy for speed. If each
prediction node is responsible for a tree or list of trees from the forest, we ask each of them for
their predictions. Whenever a node finishes predicting on it’s own trees, the result is returned
to a collector service that collects results from all nodes and makes the final prediction. The
collector can observe a time-limit and at any time return the prediction in it’s current state if
necessary. E.g. if only 20 of the 1000 trees has returned anything, the user gets an answer,
but it’s not as accurate as it could have been had all 1000 trees had time to return their
answer.
Figure 9.7 shows a diagram of this architecture in action.
Figure 9.7: Suggested architecture for a prediction pipeline that is guaranteed to return within a certain time,
potentially sacrificing prediction accuracy and confidence if some of the partial predictions have not returned
yet. Partial predictions are shipped to workers by the producer, while a consumer service collect partial
predictions ready to return to the client if time is up.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
180
To conclude this section, we will mention briefly some systems that is showing promise in
order to support these scalable, real-time systems. One of them is part of the previously
mentioned Apache Spark ecosystem: Spark Streaming. With Streaming you get a set of tools
and libraries that makes it easier to build real-time, stream-oriented data processing pipelines.
Don’t forget that any prediction being made has to go through the same feature engineering
processes as the training data went through at model build time.
Other projects include Apache Storm and AWS Kinesis. Each of these projects have pros
and cons for particular use-cases, so we encourage the reader to investigate the appropriate
tool for their needs.
9.4 Summary
We list here the table of important terms from this chapter:
Word Definition
Big Data A broad term usually used to denote data management and processing
problems that cannot fit on single machines.
Horizontal / vertical scalability Scaling out horizontally means to be able to add more machines to
handle more data. Scaling up vertically means upgrading the hardware
of your machines.
Hadoop, HDFS, MapReduce, The Hadoop ecosystem is widely used in science and industry for
Mahout handling and processing large amounts of data.
Apache Spark, MLlib Apache Spark is a newer project that tries to keep data in memory to
make it much more efficient that the disk-based Hadoop.
Data locality Doing computation on the data where it resides. Data transfer can
often be the bottleneck in big data projects, so avoiding transferring
data can be a big gain in resource requirements.
Polynomial features A trick to extend linear models to include nonlinear polynomial feature
interaction terms without loosing the scalability of linear learning
algorithms.
Vowpal Wabbit An ML tool for building models efficiently on large datasets without
necessarily using a full big data system like Hadoop.
Out-of-core Computations are done out-of-core if we only need to keep the current
iteration data in memory.
Histogram approximations Approximations to the training that converts all columns to a histogram
for the learning process.
Feature Selection Process of reducing the size of training data by selecting and retaining
the best (most predictive) subset of features.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
181
LASSO Linear algorithm that selects the most predictive subset of features.
Very useful for feature selection.
Deep neural nets An evolution of neural nets that scales to larger datasets and achieves
state-of-the-art accuracy.
Prediction volume / velocity Scaling prediction volume means being able to handle a lot of data.
Scaling velocity means being able to do it fast enough for a specific
real-time use-case.
Accuracy vs speed For real-time predictions we can sometimes trade accuracy of the
prediction for the speed of which the prediction is made.
Spark Streaming, Apache Upcoming technologies for building real-time streaming systems.
Storm, AWS Kinesis
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning
Licensed to Sébastien ROMANO <sebastien.romano@gmail.com>
182
Appendix A
Popular Machine Learning Algorithms
Handles Missing
Nonlinear categorical value Requires
Name Type algorithm features handling normalization Accuracy1
1
Note that the accuracy estimate is based on experience with the particular algorithm for complex, real-
world projects. Your milage may vary.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and
other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/real-world-machine-learning